Re: [BUG nohz]: wrong user and system time accounting
On Mon, May 15, 2017 at 04:17:10PM +0800, Wanpeng Li wrote: > Ping, Sorry for the late answer, I was focused on some other bugs. So since my ideas weren't even clear on that issue yet, I took your patch and enhanced the code around. I just posted a new series with it, please have a look. Thanks.
Re: [BUG nohz]: wrong user and system time accounting
Ping, 2017-05-02 18:01 GMT+08:00 Wanpeng Li : > Cc Paolo, > 2017-04-13 21:32 GMT+08:00 Frederic Weisbecker : >> On Thu, Apr 13, 2017 at 12:31:12PM +0800, Wanpeng Li wrote: >>> 2017-04-12 22:57 GMT+08:00 Thomas Gleixner : >>> > On Wed, 12 Apr 2017, Frederic Weisbecker wrote: >>> >> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote: >>> >> > It's not different from the current jiffies based stuff at all. Same >>> >> > failure mode. >>> >> >>> >> Yes you're right, I got confused again. So to fix this we could do our >>> >> snapshots >>> >> at a frequency lower than HZ but still high enough to avoid overhead. >>> >> >>> >> Something like TICK_NSEC / 2 ? >>> > >>> > If you are using TSC anyway then you can do proper accumulation for both >>> > system and user and only account the data when the accumulation is more >>> > than a jiffie. >>> >>> So I implement it as below: >>> >>> - HZ=1000. >>> 1) two cpu hogs on cpu in nohz_full mode, 100% user time >>> 2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected) >>> - HZ=250 >>>1) two cpu hogs on cpu in nohz_full mode, 100% user time >>>2) Luzi's testcase, 100% idle >>> >>> So the codes below still not work correctly for HZ=250, any suggestions? >> >> Right, so first lets reorder that code a bit so we can see clear inside :-) >> >>> >>> -->8- >>> >>> diff --git a/include/linux/sched.h b/include/linux/sched.h >>> index d67eee8..6a11771 100644 >>> --- a/include/linux/sched.h >>> +++ b/include/linux/sched.h >>> @@ -668,6 +668,8 @@ struct task_struct { >>> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN >>> seqcount_tvtime_seqcount; >>> unsigned long longvtime_snap; >>> +u64vtime_acct_utime; >>> +u64vtime_acct_stime; >> >> You need to accumulate guest and steal time as well. >> > > Hi Frederic, > > Sorry for the delay since I'm too busy recently, I just add guest time > and idle time accumulations as below, the code work as we expected for > native kernel, however, the testcase fails when it runs in kvm guest. > Top shows ~99% sys for Luzi's testcase "./acct-bug 1 995" which we > expect 95% user and %5 idle. In addition, what's the design idea of > steal time accumluation in your mind? Pass the tsk parameter in the > function get_vtime_delta() down to the function > steal_account_process_time()? > > -->8- > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 4cf9a59..56815cd 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -672,6 +672,10 @@ struct task_struct { > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > seqcount_tvtime_seqcount; > unsigned long longvtime_snap; > +u64vtime_acct_utime; > +u64vtime_acct_stime; > +u64vtime_acct_idle_time; > +u64vtime_acct_guest_time; > enum { > /* Task is sleeping or running in a CPU with VTIME inactive: */ > VTIME_INACTIVE = 0, > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index f3778e2b..2d950c6 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -676,18 +676,19 @@ void thread_group_cputime_adjusted(struct > task_struct *p, u64 *ut, u64 *st) > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > static u64 vtime_delta(struct task_struct *tsk) > { > -unsigned long now = READ_ONCE(jiffies); > +unsigned long long clock; > > -if (time_before(now, (unsigned long)tsk->vtime_snap)) > +clock = sched_clock(); > +if (clock < tsk->vtime_snap) > return 0; > > -return jiffies_to_nsecs(now - tsk->vtime_snap); > +return clock - tsk->vtime_snap; > } > > static u64 get_vtime_delta(struct task_struct *tsk) > { > -unsigned long now = READ_ONCE(jiffies); > -u64 delta, other; > +u64 delta = vtime_delta(tsk); > +u64 other; > > /* > * Unlike tick based timing, vtime based timing never has lost > @@ -696,17 +697,16 @@ static u64 get_vtime_delta(struct task_struct *tsk) > * elapsed time. Limit account_other_time to prevent rounding > * errors from causing elapsed vtime to go negative. > */ > -delta = jiffies_to_nsecs(now - tsk->vtime_snap); > other = account_other_time(delta); > WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE); > -tsk->vtime_snap = now; > +tsk->vtime_snap += delta; > > return delta - other; > } > > static void __vtime_account_system(struct task_struct *tsk) > { > -account_system_time(tsk, irq_count(), get_vtime_delta(tsk)); > +account_system_time(tsk, irq_count(), tsk->vtime_acct_stime); > } > > void vtime_account_system(struct task_struct *tsk) > @@ -715,7 +715,11 @@ void vtime_account_system(struct task_struct *tsk) > return; > > w
Re: [BUG nohz]: wrong user and system time accounting
Cc Paolo, 2017-04-13 21:32 GMT+08:00 Frederic Weisbecker : > On Thu, Apr 13, 2017 at 12:31:12PM +0800, Wanpeng Li wrote: >> 2017-04-12 22:57 GMT+08:00 Thomas Gleixner : >> > On Wed, 12 Apr 2017, Frederic Weisbecker wrote: >> >> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote: >> >> > It's not different from the current jiffies based stuff at all. Same >> >> > failure mode. >> >> >> >> Yes you're right, I got confused again. So to fix this we could do our >> >> snapshots >> >> at a frequency lower than HZ but still high enough to avoid overhead. >> >> >> >> Something like TICK_NSEC / 2 ? >> > >> > If you are using TSC anyway then you can do proper accumulation for both >> > system and user and only account the data when the accumulation is more >> > than a jiffie. >> >> So I implement it as below: >> >> - HZ=1000. >> 1) two cpu hogs on cpu in nohz_full mode, 100% user time >> 2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected) >> - HZ=250 >>1) two cpu hogs on cpu in nohz_full mode, 100% user time >>2) Luzi's testcase, 100% idle >> >> So the codes below still not work correctly for HZ=250, any suggestions? > > Right, so first lets reorder that code a bit so we can see clear inside :-) > >> >> -->8- >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index d67eee8..6a11771 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -668,6 +668,8 @@ struct task_struct { >> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN >> seqcount_tvtime_seqcount; >> unsigned long longvtime_snap; >> +u64vtime_acct_utime; >> +u64vtime_acct_stime; > > You need to accumulate guest and steal time as well. > Hi Frederic, Sorry for the delay since I'm too busy recently, I just add guest time and idle time accumulations as below, the code work as we expected for native kernel, however, the testcase fails when it runs in kvm guest. Top shows ~99% sys for Luzi's testcase "./acct-bug 1 995" which we expect 95% user and %5 idle. In addition, what's the design idea of steal time accumluation in your mind? Pass the tsk parameter in the function get_vtime_delta() down to the function steal_account_process_time()? -->8- diff --git a/include/linux/sched.h b/include/linux/sched.h index 4cf9a59..56815cd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -672,6 +672,10 @@ struct task_struct { #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN seqcount_tvtime_seqcount; unsigned long longvtime_snap; +u64vtime_acct_utime; +u64vtime_acct_stime; +u64vtime_acct_idle_time; +u64vtime_acct_guest_time; enum { /* Task is sleeping or running in a CPU with VTIME inactive: */ VTIME_INACTIVE = 0, diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index f3778e2b..2d950c6 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -676,18 +676,19 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN static u64 vtime_delta(struct task_struct *tsk) { -unsigned long now = READ_ONCE(jiffies); +unsigned long long clock; -if (time_before(now, (unsigned long)tsk->vtime_snap)) +clock = sched_clock(); +if (clock < tsk->vtime_snap) return 0; -return jiffies_to_nsecs(now - tsk->vtime_snap); +return clock - tsk->vtime_snap; } static u64 get_vtime_delta(struct task_struct *tsk) { -unsigned long now = READ_ONCE(jiffies); -u64 delta, other; +u64 delta = vtime_delta(tsk); +u64 other; /* * Unlike tick based timing, vtime based timing never has lost @@ -696,17 +697,16 @@ static u64 get_vtime_delta(struct task_struct *tsk) * elapsed time. Limit account_other_time to prevent rounding * errors from causing elapsed vtime to go negative. */ -delta = jiffies_to_nsecs(now - tsk->vtime_snap); other = account_other_time(delta); WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE); -tsk->vtime_snap = now; +tsk->vtime_snap += delta; return delta - other; } static void __vtime_account_system(struct task_struct *tsk) { -account_system_time(tsk, irq_count(), get_vtime_delta(tsk)); +account_system_time(tsk, irq_count(), tsk->vtime_acct_stime); } void vtime_account_system(struct task_struct *tsk) @@ -715,7 +715,11 @@ void vtime_account_system(struct task_struct *tsk) return; write_seqcount_begin(&tsk->vtime_seqcount); -__vtime_account_system(tsk); +tsk->vtime_acct_stime += get_vtime_delta(tsk); +if (tsk->vtime_acct_stime >= TICK_NSEC) { +__vtime_account_system(tsk); +tsk->vtime_acct_stime = 0;
Re: [BUG nohz]: wrong user and system time accounting
On Thu, Apr 13, 2017 at 12:31:12PM +0800, Wanpeng Li wrote: > 2017-04-12 22:57 GMT+08:00 Thomas Gleixner : > > On Wed, 12 Apr 2017, Frederic Weisbecker wrote: > >> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote: > >> > It's not different from the current jiffies based stuff at all. Same > >> > failure mode. > >> > >> Yes you're right, I got confused again. So to fix this we could do our > >> snapshots > >> at a frequency lower than HZ but still high enough to avoid overhead. > >> > >> Something like TICK_NSEC / 2 ? > > > > If you are using TSC anyway then you can do proper accumulation for both > > system and user and only account the data when the accumulation is more > > than a jiffie. > > So I implement it as below: > > - HZ=1000. > 1) two cpu hogs on cpu in nohz_full mode, 100% user time > 2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected) > - HZ=250 >1) two cpu hogs on cpu in nohz_full mode, 100% user time >2) Luzi's testcase, 100% idle > > So the codes below still not work correctly for HZ=250, any suggestions? Right, so first lets reorder that code a bit so we can see clear inside :-) > > -->8- > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index d67eee8..6a11771 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -668,6 +668,8 @@ struct task_struct { > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > seqcount_tvtime_seqcount; > unsigned long longvtime_snap; > +u64vtime_acct_utime; > +u64vtime_acct_stime; You need to accumulate guest and steal time as well. > enum { > /* Task is sleeping or running in a CPU with VTIME inactive: */ > VTIME_INACTIVE = 0, > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index f3778e2b..f8e54ba 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -674,20 +674,41 @@ void thread_group_cputime_adjusted(struct > task_struct *p, u64 *ut, u64 *st) > #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ > > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > -static u64 vtime_delta(struct task_struct *tsk) > +static u64 vtime_delta(struct task_struct *tsk, bool user) > { > -unsigned long now = READ_ONCE(jiffies); > +u64 delta, ret = 0; > > -if (time_before(now, (unsigned long)tsk->vtime_snap)) > -return 0; > +delta = sched_clock() - tsk->vtime_snap; > > -return jiffies_to_nsecs(now - tsk->vtime_snap); > +if (is_idle_task(tsk)) { > +if (delta >= TICK_NSEC) > +ret = delta; > +} else { > +if (user) { > +tsk->vtime_acct_utime += delta; > +if (tsk->vtime_acct_utime >= TICK_NSEC) > +ret = tsk->vtime_acct_utime; > +} else { > +tsk->vtime_acct_stime += delta; > +if (tsk->vtime_acct_utime >= TICK_NSEC) > +ret = tsk->vtime_acct_stime; > +} We already have vtime_account_idle, vtime_account_user, etc... The accumulation should be done by these functions that know what and where to account. vtime_delta() should really just return the difference against vtime_snap, it's too low level to care about these details. > +} > + > +return ret; > } > > -static u64 get_vtime_delta(struct task_struct *tsk) > +static u64 get_vtime_delta(struct task_struct *tsk, bool user) > { > -unsigned long now = READ_ONCE(jiffies); > -u64 delta, other; > +u64 delta = vtime_delta(tsk, user); > +u64 other; > + > +if (!is_idle_task(tsk)) { > +if (user) > +tsk->vtime_acct_utime = 0; > +else > +tsk->vtime_acct_stime = 0; > +} Like vtime_delta(), get_vtime_delta() shouldn't touch these accumulators. Reset and accounting really should be done by the upper level functions vtime_account_*() Thanks.
Re: [BUG nohz]: wrong user and system time accounting
2017-04-12 22:57 GMT+08:00 Thomas Gleixner : > On Wed, 12 Apr 2017, Frederic Weisbecker wrote: >> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote: >> > It's not different from the current jiffies based stuff at all. Same >> > failure mode. >> >> Yes you're right, I got confused again. So to fix this we could do our >> snapshots >> at a frequency lower than HZ but still high enough to avoid overhead. >> >> Something like TICK_NSEC / 2 ? > > If you are using TSC anyway then you can do proper accumulation for both > system and user and only account the data when the accumulation is more > than a jiffie. So I implement it as below: - HZ=1000. 1) two cpu hogs on cpu in nohz_full mode, 100% user time 2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected) - HZ=250 1) two cpu hogs on cpu in nohz_full mode, 100% user time 2) Luzi's testcase, 100% idle So the codes below still not work correctly for HZ=250, any suggestions? -->8- diff --git a/include/linux/sched.h b/include/linux/sched.h index d67eee8..6a11771 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -668,6 +668,8 @@ struct task_struct { #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN seqcount_tvtime_seqcount; unsigned long longvtime_snap; +u64vtime_acct_utime; +u64vtime_acct_stime; enum { /* Task is sleeping or running in a CPU with VTIME inactive: */ VTIME_INACTIVE = 0, diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index f3778e2b..f8e54ba 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -674,20 +674,41 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN -static u64 vtime_delta(struct task_struct *tsk) +static u64 vtime_delta(struct task_struct *tsk, bool user) { -unsigned long now = READ_ONCE(jiffies); +u64 delta, ret = 0; -if (time_before(now, (unsigned long)tsk->vtime_snap)) -return 0; +delta = sched_clock() - tsk->vtime_snap; -return jiffies_to_nsecs(now - tsk->vtime_snap); +if (is_idle_task(tsk)) { +if (delta >= TICK_NSEC) +ret = delta; +} else { +if (user) { +tsk->vtime_acct_utime += delta; +if (tsk->vtime_acct_utime >= TICK_NSEC) +ret = tsk->vtime_acct_utime; +} else { +tsk->vtime_acct_stime += delta; +if (tsk->vtime_acct_utime >= TICK_NSEC) +ret = tsk->vtime_acct_stime; +} +} + +return ret; } -static u64 get_vtime_delta(struct task_struct *tsk) +static u64 get_vtime_delta(struct task_struct *tsk, bool user) { -unsigned long now = READ_ONCE(jiffies); -u64 delta, other; +u64 delta = vtime_delta(tsk, user); +u64 other; + +if (!is_idle_task(tsk)) { +if (user) +tsk->vtime_acct_utime = 0; +else +tsk->vtime_acct_stime = 0; +} /* * Unlike tick based timing, vtime based timing never has lost @@ -696,22 +717,21 @@ static u64 get_vtime_delta(struct task_struct *tsk) * elapsed time. Limit account_other_time to prevent rounding * errors from causing elapsed vtime to go negative. */ -delta = jiffies_to_nsecs(now - tsk->vtime_snap); other = account_other_time(delta); WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE); -tsk->vtime_snap = now; +tsk->vtime_snap += delta; return delta - other; } static void __vtime_account_system(struct task_struct *tsk) { -account_system_time(tsk, irq_count(), get_vtime_delta(tsk)); +account_system_time(tsk, irq_count(), get_vtime_delta(tsk, false)); } void vtime_account_system(struct task_struct *tsk) { -if (!vtime_delta(tsk)) +if (!vtime_delta(tsk, false)) return; write_seqcount_begin(&tsk->vtime_seqcount); @@ -723,15 +743,15 @@ void vtime_account_user(struct task_struct *tsk) { write_seqcount_begin(&tsk->vtime_seqcount); tsk->vtime_snap_whence = VTIME_SYS; -if (vtime_delta(tsk)) -account_user_time(tsk, get_vtime_delta(tsk)); +if (vtime_delta(tsk, true)) +account_user_time(tsk, get_vtime_delta(tsk, true)); write_seqcount_end(&tsk->vtime_seqcount); } void vtime_user_enter(struct task_struct *tsk) { write_seqcount_begin(&tsk->vtime_seqcount); -if (vtime_delta(tsk)) +if (vtime_delta(tsk, false)) __vtime_account_system(tsk); tsk->vtime_snap_whence = VTIME_USER; write_seqcount_end(&tsk->vtime_seqcount); @@ -747,7 +767,7 @@ void vtime_guest_enter(struct task_struct *tsk) * that can thus safely catch up with a tickless delta. */ write_seqcount_begin(&tsk->vtime_seqcount); -if (vtime_delta(tsk)) +if (vtime_delta(tsk, false))
Re: [BUG nohz]: wrong user and system time accounting
On Wed, Apr 12, 2017 at 04:57:58PM +0200, Thomas Gleixner wrote: > On Wed, 12 Apr 2017, Frederic Weisbecker wrote: > > On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote: > > > It's not different from the current jiffies based stuff at all. Same > > > failure mode. > > > > Yes you're right, I got confused again. So to fix this we could do our > > snapshots > > at a frequency lower than HZ but still high enough to avoid overhead. > > > > Something like TICK_NSEC / 2 ? > > If you are using TSC anyway then you can do proper accumulation for both > system and user and only account the data when the accumulation is more > than a jiffie. Sounds nice, and accumulation shouldn't introduce too much overhead. Let's try that. Thanks. > > Thanks, > > tglx >
Re: [BUG nohz]: wrong user and system time accounting
On Wed, 12 Apr 2017, Frederic Weisbecker wrote: > On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote: > > It's not different from the current jiffies based stuff at all. Same > > failure mode. > > Yes you're right, I got confused again. So to fix this we could do our > snapshots > at a frequency lower than HZ but still high enough to avoid overhead. > > Something like TICK_NSEC / 2 ? If you are using TSC anyway then you can do proper accumulation for both system and user and only account the data when the accumulation is more than a jiffie. Thanks, tglx
Re: [BUG nohz]: wrong user and system time accounting
On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote: > On Thu, 30 Mar 2017, Wanpeng Li wrote: > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > > index f3778e2b..f1ee393 100644 > > --- a/kernel/sched/cputime.c > > +++ b/kernel/sched/cputime.c > > @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct > > task_struct *p, u64 *ut, u64 *st) > > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > > static u64 vtime_delta(struct task_struct *tsk) > > { > > -unsigned long now = READ_ONCE(jiffies); > > +u64 now = local_clock(); > > +u64 delta; > > + > > +delta = now - tsk->vtime_snap; > > > > -if (time_before(now, (unsigned long)tsk->vtime_snap)) > > +if (delta < TICK_NSEC) > > return 0; > > > > -return jiffies_to_nsecs(now - tsk->vtime_snap); > > +return jiffies_to_nsecs(delta / TICK_NSEC); > > So you replaced a jiffies based approach with a jiffies based approach. > > > } > > > > static u64 get_vtime_delta(struct task_struct *tsk) > > { > > -unsigned long now = READ_ONCE(jiffies); > > -u64 delta, other; > > +u64 delta = vtime_delta(tsk); > > +u64 other; > > > > /* > > * Unlike tick based timing, vtime based timing never has lost > > @@ -696,10 +699,9 @@ static u64 get_vtime_delta(struct task_struct *tsk) > > * elapsed time. Limit account_other_time to prevent rounding > > * errors from causing elapsed vtime to go negative. > > */ > > -delta = jiffies_to_nsecs(now - tsk->vtime_snap); > > other = account_other_time(delta); > > WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE); > > -tsk->vtime_snap = now; > > +tsk->vtime_snap += delta; > > Here is how it works^Wfails > > For simplicity tsk->vtime_snap starts at 0 > HZ = 1000 > > CPU0 CPU1 > sysexit() >account_system() > now == 0 > delta = vtime_delta()<- 0ns > tsk->vtime_snap += delta;== 0ns > > busy_loop(995us) > > sysenter() >now == 996us >account_user() > delta = vtime_delta()<- 0ns > tsk->vtime_snap += delta == 0ns > > sysexit() >account_system() > now == 1001us > delta = vtime_delta()<- 1000ns > > Gets accounted to system > > tsk->vtime_snap += delta;== 1000ns > > It's not different from the current jiffies based stuff at all. Same > failure mode. Yes you're right, I got confused again. So to fix this we could do our snapshots at a frequency lower than HZ but still high enough to avoid overhead. Something like TICK_NSEC / 2 ?
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 30 Mar 2017, Wanpeng Li wrote: > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index f3778e2b..f1ee393 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct > task_struct *p, u64 *ut, u64 *st) > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > static u64 vtime_delta(struct task_struct *tsk) > { > -unsigned long now = READ_ONCE(jiffies); > +u64 now = local_clock(); > +u64 delta; > + > +delta = now - tsk->vtime_snap; > > -if (time_before(now, (unsigned long)tsk->vtime_snap)) > +if (delta < TICK_NSEC) > return 0; > > -return jiffies_to_nsecs(now - tsk->vtime_snap); > +return jiffies_to_nsecs(delta / TICK_NSEC); So you replaced a jiffies based approach with a jiffies based approach. > } > > static u64 get_vtime_delta(struct task_struct *tsk) > { > -unsigned long now = READ_ONCE(jiffies); > -u64 delta, other; > +u64 delta = vtime_delta(tsk); > +u64 other; > > /* > * Unlike tick based timing, vtime based timing never has lost > @@ -696,10 +699,9 @@ static u64 get_vtime_delta(struct task_struct *tsk) > * elapsed time. Limit account_other_time to prevent rounding > * errors from causing elapsed vtime to go negative. > */ > -delta = jiffies_to_nsecs(now - tsk->vtime_snap); > other = account_other_time(delta); > WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE); > -tsk->vtime_snap = now; > +tsk->vtime_snap += delta; Here is how it works^Wfails For simplicity tsk->vtime_snap starts at 0 HZ = 1000 CPU0 CPU1 sysexit() account_system() now == 0 delta = vtime_delta()<- 0ns tsk->vtime_snap += delta;== 0ns busy_loop(995us) sysenter() now == 996us account_user() delta = vtime_delta()<- 0ns tsk->vtime_snap += delta == 0ns sysexit() account_system() now == 1001us delta = vtime_delta()<- 1000ns Gets accounted to system tsk->vtime_snap += delta;== 1000ns It's not different from the current jiffies based stuff at all. Same failure mode. Thanks, tglx
Re: [BUG nohz]: wrong user and system time accounting
2017-04-11 19:36 GMT+08:00 Peter Zijlstra : > On Tue, Apr 11, 2017 at 07:03:17PM +0800, Wanpeng Li wrote: >> 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker : >> > On Thu, Mar 30, 2017 at 02:47:11PM +0800, Wanpeng Li wrote: >> >> [...] >> >> > >> >> >> >> -->8- >> >> >> >> use nanosecond granularity to check deltas but only perform an actual >> >> cputime update when that delta >= TICK_NSEC. >> >> >> >> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c >> >> index f3778e2b..f1ee393 100644 >> >> --- a/kernel/sched/cputime.c >> >> +++ b/kernel/sched/cputime.c >> >> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct >> >> task_struct *p, u64 *ut, u64 *st) >> >> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN >> >> static u64 vtime_delta(struct task_struct *tsk) >> >> { >> >> -unsigned long now = READ_ONCE(jiffies); >> >> +u64 now = local_clock(); >> > >> > I fear we need a global clock, because the reader (task_cputime()) needs >> > to compute the delta and therefore use the same clock from any CPU. >> > >> > Or we can use the local_clock() but the reader must access the same. >> > >> > So there would be vtime_delta_writer() which uses local_clock and stores >> > the current CPU to tsk->vtime_cpu (under the vtime_seqcount). And then >> > vtime_delta_reader() which calls sched_clock_cpu(tsk->vtime_cpu) which >> > is protected by vtime_seqcount as well. >> > >> > Although those sched_clock_cpu() things seem to only matter when the >> > sched_clock() is unstable. And that stability is a condition for nohz_full >> > to work anyway. So probably sched_clock() alone would be enough. >> >> I observed ~60% user time and ~40% sys time when replace local_clock() >> above by sched_clock()(two cpu hogs on the cpu in nohz_full mode). In >> addition, Luiz's testcast ./acct-bug 1 995 will show 100% idle time. >> If keep local_clock() in vtime_delta(), cpu hogs testcase will >> success. However, Luiz's testcase still show 100% idle time. > > Assuming a stable TSC, there should be no difference between > local_clock() and sched_clock(). So it is weird. I did't see any unstable tsc dump in dmesg. Regards, Wanpeng Li
Re: [BUG nohz]: wrong user and system time accounting
On Tue, Apr 11, 2017 at 07:03:17PM +0800, Wanpeng Li wrote: > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker : > > On Thu, Mar 30, 2017 at 02:47:11PM +0800, Wanpeng Li wrote: > > [...] > > > > >> > >> -->8- > >> > >> use nanosecond granularity to check deltas but only perform an actual > >> cputime update when that delta >= TICK_NSEC. > >> > >> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > >> index f3778e2b..f1ee393 100644 > >> --- a/kernel/sched/cputime.c > >> +++ b/kernel/sched/cputime.c > >> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct > >> task_struct *p, u64 *ut, u64 *st) > >> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > >> static u64 vtime_delta(struct task_struct *tsk) > >> { > >> -unsigned long now = READ_ONCE(jiffies); > >> +u64 now = local_clock(); > > > > I fear we need a global clock, because the reader (task_cputime()) needs > > to compute the delta and therefore use the same clock from any CPU. > > > > Or we can use the local_clock() but the reader must access the same. > > > > So there would be vtime_delta_writer() which uses local_clock and stores > > the current CPU to tsk->vtime_cpu (under the vtime_seqcount). And then > > vtime_delta_reader() which calls sched_clock_cpu(tsk->vtime_cpu) which > > is protected by vtime_seqcount as well. > > > > Although those sched_clock_cpu() things seem to only matter when the > > sched_clock() is unstable. And that stability is a condition for nohz_full > > to work anyway. So probably sched_clock() alone would be enough. > > I observed ~60% user time and ~40% sys time when replace local_clock() > above by sched_clock()(two cpu hogs on the cpu in nohz_full mode). In > addition, Luiz's testcast ./acct-bug 1 995 will show 100% idle time. > If keep local_clock() in vtime_delta(), cpu hogs testcase will > success. However, Luiz's testcase still show 100% idle time. Assuming a stable TSC, there should be no difference between local_clock() and sched_clock().
Re: [BUG nohz]: wrong user and system time accounting
2017-03-30 21:38 GMT+08:00 Frederic Weisbecker : > On Thu, Mar 30, 2017 at 02:47:11PM +0800, Wanpeng Li wrote: [...] > >> >> -->8- >> >> use nanosecond granularity to check deltas but only perform an actual >> cputime update when that delta >= TICK_NSEC. >> >> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c >> index f3778e2b..f1ee393 100644 >> --- a/kernel/sched/cputime.c >> +++ b/kernel/sched/cputime.c >> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct >> task_struct *p, u64 *ut, u64 *st) >> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN >> static u64 vtime_delta(struct task_struct *tsk) >> { >> -unsigned long now = READ_ONCE(jiffies); >> +u64 now = local_clock(); > > I fear we need a global clock, because the reader (task_cputime()) needs > to compute the delta and therefore use the same clock from any CPU. > > Or we can use the local_clock() but the reader must access the same. > > So there would be vtime_delta_writer() which uses local_clock and stores > the current CPU to tsk->vtime_cpu (under the vtime_seqcount). And then > vtime_delta_reader() which calls sched_clock_cpu(tsk->vtime_cpu) which > is protected by vtime_seqcount as well. > > Although those sched_clock_cpu() things seem to only matter when the > sched_clock() is unstable. And that stability is a condition for nohz_full > to work anyway. So probably sched_clock() alone would be enough. I observed ~60% user time and ~40% sys time when replace local_clock() above by sched_clock()(two cpu hogs on the cpu in nohz_full mode). In addition, Luiz's testcast ./acct-bug 1 995 will show 100% idle time. If keep local_clock() in vtime_delta(), cpu hogs testcase will success. However, Luiz's testcase still show 100% idle time. Regards, Wanpeng Li
Re: [BUG nohz]: wrong user and system time accounting
On Tue, 2017-04-04 at 13:36 -0400, Luiz Capitulino wrote: > > On further debugging this, I realized that I had overlooked > something: > the timer interrupt in this trace is not the tick, but cyclictest's > timer > (remember that the test-case consists of pinning cyclictest and a > task > hogging the CPU to the same CPU). > > I'm running cyclictest with -i 200. If I increase this to -i 1000, > then > I seem unable to reproduce the issue (caution: even with -i 200 it > doesn't always happen. But it does usually happen after I restart the > test-case a few times. However, I've never been able to reproduce > with -i 1000). > > Now, if it's really cyclictest that's causing the timer interrupts to > get aligned, I guess this might not have a solution? (note: I haven't > been able to reproduce this on bare-metal). With any sample (tick) based timekeeping, it is possible to construct workloads that avoid the sampling and result in skewed statistics as a result. However, given that local users can already DoS the system in all kinds of ways, skewed statistics are probably not that high up on the list of importance. If there were a way to do accurate accounting (true vtime accounting) without increasing the overhead of every syscall and interrupt noticeably, that might be worth it, but syscall overhead is likely to be a more important factor than the accuracy of statistics. I don't know if doing TSC reads and subtraction/addition only, and delaying the conversion to cputime until a later point would slow down system calls measurably, compared with reading jiffies and comparing it against a cached value of jiffies, nor do I know whether spending time implementing and testing that would be worthwhile :) -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: [BUG nohz]: wrong user and system time accounting
On Mon, 3 Apr 2017 15:06:13 -0400 Luiz Capitulino wrote: > On Mon, 3 Apr 2017 17:23:17 +0200 > Frederic Weisbecker wrote: > > > Do you observe aligned ticks with trace events (hrtimer_expire_entry)? > > > > You might want to enforce the global clock to trace that: > > > > echo "global" > /sys/kernel/debug/tracing/trace_clock > > I've used the same trace points & debugging code I've been using to debug > this issue, and this what I'm seeing: > > stress-25757 [002] 2742.717507: function: > enter_from_user_mode <-- apic_timer_interrupt > stress-25757 [002] 2742.717508: function: > __context_tracking_exit <-- enter_from_user_mode > stress-25757 [002] 2742.717508: bprint: vtime_delta: > diff=0 (now=4297409970 vtime_snap=4297409970) > stress-25757 [002] 2742.717509: function: > smp_apic_timer_interrupt <-- apic_timer_interrupt > stress-25757 [002] 2742.717509: function: irq_enter <-- > smp_apic_timer_interrupt > stress-25757 [002] 2742.717510: hrtimer_expire_entry: > hrtimer=0xc900039fbe58 function=hrtimer_wakeup now=2742674000776 > stress-25757 [002] 2742.717514: function: irq_exit <-- > smp_apic_timer_interrupt > cyclictest-25760 [002] 2742.717518: function: > vtime_account_system <-- vtime_common_task_switch > cyclictest-25760 [002] 2742.717518: bprint: vtime_delta: > diff=100 (now=4297409971 vtime_snap=4297409970) > cyclictest-25760 [002] 2742.717519: function: > __vtime_account_system <-- vtime_account_system > cyclictest-25760 [002] 2742.717519: bprint: get_vtime_delta: > vtime_snap=4297409970 now=4297409971 > cyclictest-25760 [002] 2742.717520: function: > account_system_time <-- __vtime_account_system > cyclictest-25760 [002] 2742.717520: bprint: > account_system_time: cputime=961981 > cyclictest-25760 [002] 2742.717521: function: > __context_tracking_enter <-- do_syscall_64 > cyclictest-25760 [002] 2742.717522: function: vtime_user_enter > <-- __context_tracking_enter > cyclictest-25760 [002] 2742.717522: bprint: vtime_delta: > diff=0 (now=4297409971 vtime_snap=4297409971) > > CPU2 shows 98% system time while the other CPUs (from CPU3 to CPU7) > show 98% user time (they're all running the same workload). On further debugging this, I realized that I had overlooked something: the timer interrupt in this trace is not the tick, but cyclictest's timer (remember that the test-case consists of pinning cyclictest and a task hogging the CPU to the same CPU). I'm running cyclictest with -i 200. If I increase this to -i 1000, then I seem unable to reproduce the issue (caution: even with -i 200 it doesn't always happen. But it does usually happen after I restart the test-case a few times. However, I've never been able to reproduce with -i 1000). Now, if it's really cyclictest that's causing the timer interrupts to get aligned, I guess this might not have a solution? (note: I haven't been able to reproduce this on bare-metal).
Re: [BUG nohz]: wrong user and system time accounting
On Mon, 2017-04-03 at 16:40 +0200, Frederic Weisbecker wrote: > On Thu, Mar 30, 2017 at 03:35:22PM +0200, Mike Galbraith wrote: > Nohz_full is already bad for powersavings anyway. CPU 0 always ticks :-) OTOH, if a nohz_full set is doing what it was born to do, CPU0 tick spikes won't be noticeable on your (pegged/glowing) watt meter :)
Re: [BUG nohz]: wrong user and system time accounting
On Mon, 3 Apr 2017 17:23:17 +0200 Frederic Weisbecker wrote: > Do you observe aligned ticks with trace events (hrtimer_expire_entry)? > > You might want to enforce the global clock to trace that: > > echo "global" > /sys/kernel/debug/tracing/trace_clock I've used the same trace points & debugging code I've been using to debug this issue, and this what I'm seeing: stress-25757 [002] 2742.717507: function: enter_from_user_mode <-- apic_timer_interrupt stress-25757 [002] 2742.717508: function: __context_tracking_exit <-- enter_from_user_mode stress-25757 [002] 2742.717508: bprint: vtime_delta: diff=0 (now=4297409970 vtime_snap=4297409970) stress-25757 [002] 2742.717509: function: smp_apic_timer_interrupt <-- apic_timer_interrupt stress-25757 [002] 2742.717509: function: irq_enter <-- smp_apic_timer_interrupt stress-25757 [002] 2742.717510: hrtimer_expire_entry: hrtimer=0xc900039fbe58 function=hrtimer_wakeup now=2742674000776 stress-25757 [002] 2742.717514: function: irq_exit <-- smp_apic_timer_interrupt cyclictest-25760 [002] 2742.717518: function: vtime_account_system <-- vtime_common_task_switch cyclictest-25760 [002] 2742.717518: bprint: vtime_delta: diff=100 (now=4297409971 vtime_snap=4297409970) cyclictest-25760 [002] 2742.717519: function: __vtime_account_system <-- vtime_account_system cyclictest-25760 [002] 2742.717519: bprint: get_vtime_delta: vtime_snap=4297409970 now=4297409971 cyclictest-25760 [002] 2742.717520: function: account_system_time <-- __vtime_account_system cyclictest-25760 [002] 2742.717520: bprint: account_system_time: cputime=961981 cyclictest-25760 [002] 2742.717521: function: __context_tracking_enter <-- do_syscall_64 cyclictest-25760 [002] 2742.717522: function: vtime_user_enter <-- __context_tracking_enter cyclictest-25760 [002] 2742.717522: bprint: vtime_delta: diff=0 (now=4297409971 vtime_snap=4297409971) CPU2 shows 98% system time while the other CPUs (from CPU3 to CPU7) show 98% user time (they're all running the same workload). What's happening here is: 1. Timer interrupt 2. Transition from user-space to kernel-space, vtimer_delta() returns zero 3. Context switch from hog application to cyclictest 4. This time vtime_delta() returns != zero, which implies jiffies was updated between steps 2 and 3 This seems to be the pattern that accounts incorrectly, and seem to suggest that the ticks are aligned because this repeats over and over. Please, let me know if you want me to run a different trace-cmd command-line.
Re: [BUG nohz]: wrong user and system time accounting
On Fri, Mar 31, 2017 at 11:11:19PM -0400, Luiz Capitulino wrote: > On Sat, 1 Apr 2017 01:24:54 +0200 > Frederic Weisbecker wrote: > > > On Fri, Mar 31, 2017 at 04:09:10PM -0400, Luiz Capitulino wrote: > > > On Thu, 30 Mar 2017 17:25:46 -0400 > > > Luiz Capitulino wrote: > > > > > > > On Thu, 30 Mar 2017 16:18:17 +0200 > > > > Frederic Weisbecker wrote: > > > > > > > > > On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote: > > > > > > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker > > > > > > : > > > > > > > If it works, we may want to take that solution, likely less > > > > > > > performance sensitive > > > > > > > than using sched_clock(). In fact sched_clock() is fast, > > > > > > > especially as we require it to > > > > > > > be stable for nohz_full, but using it involves costly conversion > > > > > > > back and forth to jiffies. > > > > > > > > > > > > So both Rik and you agree with the skew tick solution, I will try it > > > > > > tomorrow. Btw, if we should just add random offset to the cpu in the > > > > > > nohz_full mode or add random offset to all cpus like the codes > > > > > > above? > > > > > > > > > > Lets just keep it to all CPUs for simplicty. > > > > > Also please add a comment that explains why we need that skew_tick on > > > > > nohz_full. > > > > > > > > I've tried all the test-cases we discussed in this thread with > > > > skew_tick=1 > > > > and it worked as expected in bare-metal and KVM guests. > > > > > > > > However, I found a test-case that works in bare-metal but show problems > > > > in KVM guests. It could something that's KVM specific, or it could be > > > > something that's harder to reproduce in bare-metal. > > > > > > After discussing some findings on this issue with Rik, I realized that > > > we don't add the skew when restarting the tick in tick_nohz_restart(). > > > Adding the offset there seems to solve this problem. > > > > Are you sure? tick_nohz_restart() doesn't seem to override the initial > > skew. It > > always forwards the expiration time on top of the last tick. > > OK, I'll double check. Without my change the bug triggers almost > instantly with the described reproducer. With my change it didn't > trig for several minutes (but it does look wrong looking at it now). Do you observe aligned ticks with trace events (hrtimer_expire_entry)? You might want to enforce the global clock to trace that: echo "global" > /sys/kernel/debug/tracing/trace_clock
Re: [BUG nohz]: wrong user and system time accounting
On Thu, Mar 30, 2017 at 03:35:22PM +0200, Mike Galbraith wrote: > On Thu, 2017-03-30 at 09:02 -0400, Rik van Riel wrote: > > On Thu, 2017-03-30 at 14:51 +0200, Frederic Weisbecker wrote: > > > > Also, why does it raise power consumption issues? > > > > On a system without either nohz_full or nohz idle > > mode, skewed ticks result in CPU cores waking up > > at different times, and keeping an idle system > > consuming power for more time than it would if all > > the ticks happened simultaneously. > > And if your server farm is mostly idle, that power savings may delay > your bankruptcy proceedings by a whole microsecond ;-) > > Or more seriously, what skew does do on boxen of size X today is > something for perf to say. At the time, removal was very bad for my 8 > socket box, and allegedly caused huge SGI beasts in horrific pain. I see. Nohz_full is already bad for powersavings anyway. CPU 0 always ticks :-)
Re: [BUG nohz]: wrong user and system time accounting
On Sat, 1 Apr 2017 01:24:54 +0200 Frederic Weisbecker wrote: > On Fri, Mar 31, 2017 at 04:09:10PM -0400, Luiz Capitulino wrote: > > On Thu, 30 Mar 2017 17:25:46 -0400 > > Luiz Capitulino wrote: > > > > > On Thu, 30 Mar 2017 16:18:17 +0200 > > > Frederic Weisbecker wrote: > > > > > > > On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote: > > > > > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker : > > > > > > > > > > > If it works, we may want to take that solution, likely less > > > > > > performance sensitive > > > > > > than using sched_clock(). In fact sched_clock() is fast, especially > > > > > > as we require it to > > > > > > be stable for nohz_full, but using it involves costly conversion > > > > > > back and forth to jiffies. > > > > > > > > > > So both Rik and you agree with the skew tick solution, I will try it > > > > > tomorrow. Btw, if we should just add random offset to the cpu in the > > > > > nohz_full mode or add random offset to all cpus like the codes above? > > > > > > > > > > > > > Lets just keep it to all CPUs for simplicty. > > > > Also please add a comment that explains why we need that skew_tick on > > > > nohz_full. > > > > > > I've tried all the test-cases we discussed in this thread with skew_tick=1 > > > and it worked as expected in bare-metal and KVM guests. > > > > > > However, I found a test-case that works in bare-metal but show problems > > > in KVM guests. It could something that's KVM specific, or it could be > > > something that's harder to reproduce in bare-metal. > > > > After discussing some findings on this issue with Rik, I realized that > > we don't add the skew when restarting the tick in tick_nohz_restart(). > > Adding the offset there seems to solve this problem. > > Are you sure? tick_nohz_restart() doesn't seem to override the initial skew. > It > always forwards the expiration time on top of the last tick. OK, I'll double check. Without my change the bug triggers almost instantly with the described reproducer. With my change it didn't trig for several minutes (but it does look wrong looking at it now).
Re: [BUG nohz]: wrong user and system time accounting
On Fri, Mar 31, 2017 at 04:09:10PM -0400, Luiz Capitulino wrote: > On Thu, 30 Mar 2017 17:25:46 -0400 > Luiz Capitulino wrote: > > > On Thu, 30 Mar 2017 16:18:17 +0200 > > Frederic Weisbecker wrote: > > > > > On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote: > > > > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker : > > > > > If it works, we may want to take that solution, likely less > > > > > performance sensitive > > > > > than using sched_clock(). In fact sched_clock() is fast, especially > > > > > as we require it to > > > > > be stable for nohz_full, but using it involves costly conversion back > > > > > and forth to jiffies. > > > > > > > > So both Rik and you agree with the skew tick solution, I will try it > > > > tomorrow. Btw, if we should just add random offset to the cpu in the > > > > nohz_full mode or add random offset to all cpus like the codes above? > > > > > > > > > > Lets just keep it to all CPUs for simplicty. > > > Also please add a comment that explains why we need that skew_tick on > > > nohz_full. > > > > I've tried all the test-cases we discussed in this thread with skew_tick=1 > > and it worked as expected in bare-metal and KVM guests. > > > > However, I found a test-case that works in bare-metal but show problems > > in KVM guests. It could something that's KVM specific, or it could be > > something that's harder to reproduce in bare-metal. > > After discussing some findings on this issue with Rik, I realized that > we don't add the skew when restarting the tick in tick_nohz_restart(). > Adding the offset there seems to solve this problem. Are you sure? tick_nohz_restart() doesn't seem to override the initial skew. It always forwards the expiration time on top of the last tick.
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 30 Mar 2017 17:25:46 -0400 Luiz Capitulino wrote: > On Thu, 30 Mar 2017 16:18:17 +0200 > Frederic Weisbecker wrote: > > > On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote: > > > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker : > > > > If it works, we may want to take that solution, likely less performance > > > > sensitive > > > > than using sched_clock(). In fact sched_clock() is fast, especially as > > > > we require it to > > > > be stable for nohz_full, but using it involves costly conversion back > > > > and forth to jiffies. > > > > > > So both Rik and you agree with the skew tick solution, I will try it > > > tomorrow. Btw, if we should just add random offset to the cpu in the > > > nohz_full mode or add random offset to all cpus like the codes above? > > > > Lets just keep it to all CPUs for simplicty. > > Also please add a comment that explains why we need that skew_tick on > > nohz_full. > > I've tried all the test-cases we discussed in this thread with skew_tick=1 > and it worked as expected in bare-metal and KVM guests. > > However, I found a test-case that works in bare-metal but show problems > in KVM guests. It could something that's KVM specific, or it could be > something that's harder to reproduce in bare-metal. After discussing some findings on this issue with Rik, I realized that we don't add the skew when restarting the tick in tick_nohz_restart(). Adding the offset there seems to solve this problem.
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 30 Mar 2017 16:18:17 +0200 Frederic Weisbecker wrote: > On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote: > > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker : > > > If it works, we may want to take that solution, likely less performance > > > sensitive > > > than using sched_clock(). In fact sched_clock() is fast, especially as we > > > require it to > > > be stable for nohz_full, but using it involves costly conversion back and > > > forth to jiffies. > > > > So both Rik and you agree with the skew tick solution, I will try it > > tomorrow. Btw, if we should just add random offset to the cpu in the > > nohz_full mode or add random offset to all cpus like the codes above? > > Lets just keep it to all CPUs for simplicty. > Also please add a comment that explains why we need that skew_tick on > nohz_full. I've tried all the test-cases we discussed in this thread with skew_tick=1 and it worked as expected in bare-metal and KVM guests. However, I found a test-case that works in bare-metal but show problems in KVM guests. It could something that's KVM specific, or it could be something that's harder to reproduce in bare-metal. The reproducer is (not sure all the steps are necessary): 1. Isolate 8 cores in the host with isolcpus= and nohz_full= (and skew_tick=1) 2. Create a KVM guest with 8 vCPUs and pin each vCPU to an isolated host core 3. Boot the guest with isolcpus=2,3,4,5,6,7 nohz_full=2,3,4,5,6,7 skew_tick=1 4. Once the guest is booted, run: # for i in $(seq 2 7); do taskset -c $i hog& ;done # taskset -c 2,3,4,5,6,7 \ cyclictest -m -n -q -p95 -D 1m -h60 -i 200 -t 6 -a 2,3,4,5,6,7 (where hog is a program taking 100% of the CPU, and cyclictest is RT's cyclictest) 5. Run top -d1 In a few minutes into this test-case, I see one isolated CPU in the guest reporting around 95% system time (where the expected is close to 100% user time, which the others isolated CPUs correctly report).
Re: [BUG nohz]: wrong user and system time accounting
On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote: > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker : > > If it works, we may want to take that solution, likely less performance > > sensitive > > than using sched_clock(). In fact sched_clock() is fast, especially as we > > require it to > > be stable for nohz_full, but using it involves costly conversion back and > > forth to jiffies. > > So both Rik and you agree with the skew tick solution, I will try it > tomorrow. Btw, if we should just add random offset to the cpu in the > nohz_full mode or add random offset to all cpus like the codes above? Lets just keep it to all CPUs for simplicty. Also please add a comment that explains why we need that skew_tick on nohz_full. Thanks!
Re: [BUG nohz]: wrong user and system time accounting
2017-03-30 21:38 GMT+08:00 Frederic Weisbecker : > On Thu, Mar 30, 2017 at 02:47:11PM +0800, Wanpeng Li wrote: >> Cc Peterz, Thomas, >> 2017-03-30 12:27 GMT+08:00 Mike Galbraith : >> > On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote: >> > >> >> In other words, the tick on cpu0 is aligned >> >> with the tick on the nohz_full cpus, and >> >> jiffies is advanced while the nohz_full cpus >> >> with an active tick happen to be in kernel >> >> mode? >> > >> > You really want skew_tick=1, especially on big boxen. >> > >> >> Frederic, can you think of any reason why >> >> the tick on nohz_full CPUs would end up aligned >> >> with the tick on cpu0, instead of running at some >> >> random offset? >> > >> > (I or low rq->clock bits as crude NOHZ collision avoidance) >> > >> >> A random offset, or better yet a somewhat randomized >> >> tick length to make sure that simultaneous ticks are >> >> fairly rare and the vtime sampling does not end up >> >> "in phase" with the jiffies incrementing, could make >> >> the accounting work right again. >> > >> > That improves jitter, especially on big boxen. I have an 8 socket box >> > that thinks it's an extra large PC, there, collision avoidance matters >> > hugely. I couldn't reproduce bean counting woes, no idea if collision >> > avoidance will help that. >> >> So I implement two methods, one is from Rik's random offset proposal >> through skew tick, the other one is from Frederic's proposal and it is >> the same as my original idea through use nanosecond granularity to >> check deltas but only perform an actual cputime update when that delta >> >= TICK_NSEC. Both methods can solve the bug which Luiz reported. >> Peterz, Thomas, any ideas? >> >> --->8- >> >> skew tick: >> >> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c >> index 7fe53be..9981437 100644 >> --- a/kernel/time/tick-sched.c >> +++ b/kernel/time/tick-sched.c >> @@ -1198,7 +1198,11 @@ void tick_setup_sched_timer(void) >> hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); >> >> /* Offset the tick to avert jiffies_lock contention. */ >> +#ifdef CONFIG_NO_HZ_FULL >> +if (sched_skew_tick || tick_nohz_full_running) { >> +#else >> if (sched_skew_tick) { >> +#endif > > Please rather use tick_nohz_full_enabled() to avoid ifdeffery. > >> u64 offset = ktime_to_ns(tick_period) >> 1; >> do_div(offset, num_possible_cpus()); >> offset *= smp_processor_id(); > > If it works, we may want to take that solution, likely less performance > sensitive > than using sched_clock(). In fact sched_clock() is fast, especially as we > require it to > be stable for nohz_full, but using it involves costly conversion back and > forth to jiffies. So both Rik and you agree with the skew tick solution, I will try it tomorrow. Btw, if we should just add random offset to the cpu in the nohz_full mode or add random offset to all cpus like the codes above? Regards, Wanpeng Li > >> >> -->8- >> >> use nanosecond granularity to check deltas but only perform an actual >> cputime update when that delta >= TICK_NSEC. >> >> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c >> index f3778e2b..f1ee393 100644 >> --- a/kernel/sched/cputime.c >> +++ b/kernel/sched/cputime.c >> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct >> task_struct *p, u64 *ut, u64 *st) >> #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN >> static u64 vtime_delta(struct task_struct *tsk) >> { >> -unsigned long now = READ_ONCE(jiffies); >> +u64 now = local_clock(); > > I fear we need a global clock, because the reader (task_cputime()) needs > to compute the delta and therefore use the same clock from any CPU. > > Or we can use the local_clock() but the reader must access the same. > > So there would be vtime_delta_writer() which uses local_clock and stores > the current CPU to tsk->vtime_cpu (under the vtime_seqcount). And then > vtime_delta_reader() which calls sched_clock_cpu(tsk->vtime_cpu) which > is protected by vtime_seqcount as well. > > Although those sched_clock_cpu() things seem to only matter when the > sched_clock() is unstable. And that stability is a condition for nohz_full > to work anyway. So probably sched_clock() alone would be enough. > > Thanks.
Re: [BUG nohz]: wrong user and system time accounting
On Thu, Mar 30, 2017 at 09:02:31AM -0400, Rik van Riel wrote: > On Thu, 2017-03-30 at 14:51 +0200, Frederic Weisbecker wrote: > > On Thu, Mar 30, 2017 at 06:27:31AM +0200, Mike Galbraith wrote: > > > On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote: > > > > > > > A random offset, or better yet a somewhat randomized > > > > tick length to make sure that simultaneous ticks are > > > > fairly rare and the vtime sampling does not end up > > > > "in phase" with the jiffies incrementing, could make > > > > the accounting work right again. > > > > > > That improves jitter, especially on big boxen. I have an 8 socket > > > box > > > that thinks it's an extra large PC, there, collision avoidance > > > matters > > > hugely. I couldn't reproduce bean counting woes, no idea if > > > collision > > > avoidance will help that. > > > > Out of curiosity, where is the main contention between ticks? I > > indeed > > know some locks that can be taken on special cases, such as posix cpu > > timers. > > > > Also, why does it raise power consumption issues? > > On a system without either nohz_full or nohz idle > mode, skewed ticks result in CPU cores waking up > at different times, and keeping an idle system > consuming power for more time than it would if all > the ticks happened simultaneously. Ah fair point! > > This is not a factor at all on systems that switch > off the tick while idle, since the CPU will be busy > anyway while the tick is enabled. I see. Thanks!
Re: [BUG nohz]: wrong user and system time accounting
On Thu, Mar 30, 2017 at 02:47:11PM +0800, Wanpeng Li wrote: > Cc Peterz, Thomas, > 2017-03-30 12:27 GMT+08:00 Mike Galbraith : > > On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote: > > > >> In other words, the tick on cpu0 is aligned > >> with the tick on the nohz_full cpus, and > >> jiffies is advanced while the nohz_full cpus > >> with an active tick happen to be in kernel > >> mode? > > > > You really want skew_tick=1, especially on big boxen. > > > >> Frederic, can you think of any reason why > >> the tick on nohz_full CPUs would end up aligned > >> with the tick on cpu0, instead of running at some > >> random offset? > > > > (I or low rq->clock bits as crude NOHZ collision avoidance) > > > >> A random offset, or better yet a somewhat randomized > >> tick length to make sure that simultaneous ticks are > >> fairly rare and the vtime sampling does not end up > >> "in phase" with the jiffies incrementing, could make > >> the accounting work right again. > > > > That improves jitter, especially on big boxen. I have an 8 socket box > > that thinks it's an extra large PC, there, collision avoidance matters > > hugely. I couldn't reproduce bean counting woes, no idea if collision > > avoidance will help that. > > So I implement two methods, one is from Rik's random offset proposal > through skew tick, the other one is from Frederic's proposal and it is > the same as my original idea through use nanosecond granularity to > check deltas but only perform an actual cputime update when that delta > >= TICK_NSEC. Both methods can solve the bug which Luiz reported. > Peterz, Thomas, any ideas? > > --->8- > > skew tick: > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index 7fe53be..9981437 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -1198,7 +1198,11 @@ void tick_setup_sched_timer(void) > hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); > > /* Offset the tick to avert jiffies_lock contention. */ > +#ifdef CONFIG_NO_HZ_FULL > +if (sched_skew_tick || tick_nohz_full_running) { > +#else > if (sched_skew_tick) { > +#endif Please rather use tick_nohz_full_enabled() to avoid ifdeffery. > u64 offset = ktime_to_ns(tick_period) >> 1; > do_div(offset, num_possible_cpus()); > offset *= smp_processor_id(); If it works, we may want to take that solution, likely less performance sensitive than using sched_clock(). In fact sched_clock() is fast, especially as we require it to be stable for nohz_full, but using it involves costly conversion back and forth to jiffies. > > -->8- > > use nanosecond granularity to check deltas but only perform an actual > cputime update when that delta >= TICK_NSEC. > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index f3778e2b..f1ee393 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct > task_struct *p, u64 *ut, u64 *st) > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > static u64 vtime_delta(struct task_struct *tsk) > { > -unsigned long now = READ_ONCE(jiffies); > +u64 now = local_clock(); I fear we need a global clock, because the reader (task_cputime()) needs to compute the delta and therefore use the same clock from any CPU. Or we can use the local_clock() but the reader must access the same. So there would be vtime_delta_writer() which uses local_clock and stores the current CPU to tsk->vtime_cpu (under the vtime_seqcount). And then vtime_delta_reader() which calls sched_clock_cpu(tsk->vtime_cpu) which is protected by vtime_seqcount as well. Although those sched_clock_cpu() things seem to only matter when the sched_clock() is unstable. And that stability is a condition for nohz_full to work anyway. So probably sched_clock() alone would be enough. Thanks.
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 2017-03-30 at 09:02 -0400, Rik van Riel wrote: > On Thu, 2017-03-30 at 14:51 +0200, Frederic Weisbecker wrote: > > Also, why does it raise power consumption issues? > > On a system without either nohz_full or nohz idle > mode, skewed ticks result in CPU cores waking up > at different times, and keeping an idle system > consuming power for more time than it would if all > the ticks happened simultaneously. And if your server farm is mostly idle, that power savings may delay your bankruptcy proceedings by a whole microsecond ;-) Or more seriously, what skew does do on boxen of size X today is something for perf to say. At the time, removal was very bad for my 8 socket box, and allegedly caused huge SGI beasts in horrific pain. -Mike
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 2017-03-30 at 14:40 +0200, Frederic Weisbecker wrote: > On Thu, Mar 30, 2017 at 09:58:44AM +0800, Wanpeng Li wrote: > > There is such a feature skew_tick currently, refer to commit > > 5307c9556bc (tick: add tick skew boot option), w/ skew_tick=1 boot > > parameter, the bug disappear, however, the commit also mentioned that > > it will hurt power consumption. > > Oh, I completely missed that! It suggests it'll harm power consumption because skew removal allegedly saved power. Recalling what removal did to my 8 socket box, I doubt adding it back costs large boxen anything at all, rather the opposite. -Mike
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 2017-03-30 at 14:51 +0200, Frederic Weisbecker wrote: > On Thu, Mar 30, 2017 at 06:27:31AM +0200, Mike Galbraith wrote: > > On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote: > > > > > A random offset, or better yet a somewhat randomized > > > tick length to make sure that simultaneous ticks are > > > fairly rare and the vtime sampling does not end up > > > "in phase" with the jiffies incrementing, could make > > > the accounting work right again. > > > > That improves jitter, especially on big boxen. I have an 8 socket > > box > > that thinks it's an extra large PC, there, collision avoidance > > matters > > hugely. I couldn't reproduce bean counting woes, no idea if > > collision > > avoidance will help that. > > Out of curiosity, where is the main contention between ticks? I > indeed > know some locks that can be taken on special cases, such as posix cpu > timers. > > Also, why does it raise power consumption issues? On a system without either nohz_full or nohz idle mode, skewed ticks result in CPU cores waking up at different times, and keeping an idle system consuming power for more time than it would if all the ticks happened simultaneously. This is not a factor at all on systems that switch off the tick while idle, since the CPU will be busy anyway while the tick is enabled.
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 2017-03-30 at 00:54 +0200, Frederic Weisbecker wrote: > (Adding Thomas in Cc) > > On Wed, Mar 29, 2017 at 04:08:45PM -0400, Rik van Riel wrote: > > > > Frederic, can you think of any reason why > > the tick on nohz_full CPUs would end up aligned > > with the tick on cpu0, instead of running at some > > random offset? > > tick_init_jiffy_update() takes that decision to align all ticks. > > I'm not sure why. I don't see why that would matter, either. > I'm not sure that randomizing the tick start per CPU would be a > right solution. Somewhere in the world you can be sure the tick > randomization of some nohz_full CPU will coincide with the tick > of CPU 0 :o) > > Or we could force that tick on nohz_full CPUs to be far from > CPU 0's tick... I'm not sure such a solution would be accepted > though. I am not sure we would have to force things. Simply getting rid of tick_init_jiffy_update and scheduling the next tick for "now + tick period" might have the same effect, when the tick gets stopped and restarted on nohz_full CPUs.
Re: [BUG nohz]: wrong user and system time accounting
On Thu, Mar 30, 2017 at 06:27:31AM +0200, Mike Galbraith wrote: > On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote: > > > A random offset, or better yet a somewhat randomized > > tick length to make sure that simultaneous ticks are > > fairly rare and the vtime sampling does not end up > > "in phase" with the jiffies incrementing, could make > > the accounting work right again. > > That improves jitter, especially on big boxen. I have an 8 socket box > that thinks it's an extra large PC, there, collision avoidance matters > hugely. I couldn't reproduce bean counting woes, no idea if collision > avoidance will help that. Out of curiosity, where is the main contention between ticks? I indeed know some locks that can be taken on special cases, such as posix cpu timers. Also, why does it raise power consumption issues? Thanks.
Re: [BUG nohz]: wrong user and system time accounting
On Thu, Mar 30, 2017 at 09:58:44AM +0800, Wanpeng Li wrote: > 2017-03-30 4:08 GMT+08:00 Rik van Riel : > > > > In other words, the tick on cpu0 is aligned > > with the tick on the nohz_full cpus, and > > jiffies is advanced while the nohz_full cpus > > with an active tick happen to be in kernel > > mode? > > > > Frederic, can you think of any reason why > > the tick on nohz_full CPUs would end up aligned > > with the tick on cpu0, instead of running at some > > random offset? > > > > A random offset, or better yet a somewhat randomized > > tick length to make sure that simultaneous ticks are > > fairly rare and the vtime sampling does not end up > > "in phase" with the jiffies incrementing, could make > > the accounting work right again. > > > > Of course, that assumes the above hypothesis is correct :) > > There is such a feature skew_tick currently, refer to commit > 5307c9556bc (tick: add tick skew boot option), w/ skew_tick=1 boot > parameter, the bug disappear, however, the commit also mentioned that > it will hurt power consumption. Oh, I completely missed that! > I will try Frederic's proposal which > is similar to my original idea "how bad would it be to revert to > sched_clock() instead of jiffies in vtime_delta()? We could use > nanosecond granularity to check deltas but only perform an actual > cputime update when that delta >= TICK_NSEC." Thanks! I hope sched_clock() won't introduce too much overhead. Otherwise we may want to pick up the skew_tick solution.
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 2017-03-30 at 19:52 +0800, Wanpeng Li wrote: > If we should just add random offset to the cpu in the nohz_full mode? Up to you, whatever works best. I left the regular skew alone, just added some noise to scheduler_tick_max_deferment(). -Mike
Re: [BUG nohz]: wrong user and system time accounting
2017-03-30 10:14 GMT+08:00 Luiz Capitulino : > On Thu, 30 Mar 2017 06:46:30 +0800 > Wanpeng Li wrote: > >> > So! Now we need to find a proper fix :o) >> > >> > Hmm, how bad would it be to revert to sched_clock() instead of jiffies in >> > vtime_delta()? >> > We could use nanosecond granularity to check deltas but only perform an >> > actual cputime update >> > when that delta >= TICK_NSEC. That should keep the load ok. >> >> Yeah, I mentioned something similar before. >> https://lkml.org/lkml/2017/3/26/138 However, Rik's commit optimized >> syscalls by not utilize sched_clock(), so if we should distinguish >> between syscalls/exceptions and irqs? > > Why not use ktime_get()? I believe ktime_get() is more heavy than local_clock() when sched clock is stable. So we can cooperate to improve https://lkml.org/lkml/2017/3/30/456. Regards, Wanpeng Li > > Here's the solution I was thinking about, it's mostly untested. I'm > rate limiting below TICK_NSEC because I want to avoid syncing with > the tick. > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index f3778e2b..a8b1e85 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -676,18 +676,20 @@ void thread_group_cputime_adjusted(struct task_struct > *p, u64 *ut, u64 *st) > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > static u64 vtime_delta(struct task_struct *tsk) > { > - unsigned long now = READ_ONCE(jiffies); > + return ktime_sub(ktime_get(), tsk->vtime_snap); > +} > > - if (time_before(now, (unsigned long)tsk->vtime_snap)) > - return 0; > +/* A little bit less than the tick period */ > +#define VTIME_RATE_LIMIT (TICK_NSEC - 20) > > - return jiffies_to_nsecs(now - tsk->vtime_snap); > +static bool vtime_should_account(struct task_struct *tsk) > +{ > + return vtime_delta(tsk) > VTIME_RATE_LIMIT; > } > > static u64 get_vtime_delta(struct task_struct *tsk) > { > - unsigned long now = READ_ONCE(jiffies); > - u64 delta, other; > + u64 delta, other, now = ktime_get(); > > /* > * Unlike tick based timing, vtime based timing never has lost > @@ -696,7 +698,7 @@ static u64 get_vtime_delta(struct task_struct *tsk) > * elapsed time. Limit account_other_time to prevent rounding > * errors from causing elapsed vtime to go negative. > */ > - delta = jiffies_to_nsecs(now - tsk->vtime_snap); > + delta = ktime_sub(now, tsk->vtime_snap); > other = account_other_time(delta); > WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE); > tsk->vtime_snap = now; > @@ -711,7 +713,7 @@ static void __vtime_account_system(struct task_struct > *tsk) > > void vtime_account_system(struct task_struct *tsk) > { > - if (!vtime_delta(tsk)) > + if (!vtime_should_account(tsk)) > return; > > write_seqcount_begin(&tsk->vtime_seqcount); > @@ -723,7 +725,7 @@ void vtime_account_user(struct task_struct *tsk) > { > write_seqcount_begin(&tsk->vtime_seqcount); > tsk->vtime_snap_whence = VTIME_SYS; > - if (vtime_delta(tsk)) > + if (vtime_should_account(tsk)) > account_user_time(tsk, get_vtime_delta(tsk)); > write_seqcount_end(&tsk->vtime_seqcount); > } > @@ -731,7 +733,7 @@ void vtime_account_user(struct task_struct *tsk) > void vtime_user_enter(struct task_struct *tsk) > { > write_seqcount_begin(&tsk->vtime_seqcount); > - if (vtime_delta(tsk)) > + if (vtime_should_account(tsk)) > __vtime_account_system(tsk); > tsk->vtime_snap_whence = VTIME_USER; > write_seqcount_end(&tsk->vtime_seqcount); > @@ -747,7 +749,7 @@ void vtime_guest_enter(struct task_struct *tsk) > * that can thus safely catch up with a tickless delta. > */ > write_seqcount_begin(&tsk->vtime_seqcount); > - if (vtime_delta(tsk)) > + if (vtime_should_account(tsk)) > __vtime_account_system(tsk); > current->flags |= PF_VCPU; > write_seqcount_end(&tsk->vtime_seqcount); > @@ -776,7 +778,7 @@ void arch_vtime_task_switch(struct task_struct *prev) > > write_seqcount_begin(¤t->vtime_seqcount); > current->vtime_snap_whence = VTIME_SYS; > - current->vtime_snap = jiffies; > + current->vtime_snap = ktime_get(); > write_seqcount_end(¤t->vtime_seqcount); > } >
Re: [BUG nohz]: wrong user and system time accounting
2017-03-30 14:47 GMT+08:00 Wanpeng Li : > Cc Peterz, Thomas, > 2017-03-30 12:27 GMT+08:00 Mike Galbraith : >> On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote: >> >>> In other words, the tick on cpu0 is aligned >>> with the tick on the nohz_full cpus, and >>> jiffies is advanced while the nohz_full cpus >>> with an active tick happen to be in kernel >>> mode? >> >> You really want skew_tick=1, especially on big boxen. >> >>> Frederic, can you think of any reason why >>> the tick on nohz_full CPUs would end up aligned >>> with the tick on cpu0, instead of running at some >>> random offset? >> >> (I or low rq->clock bits as crude NOHZ collision avoidance) >> >>> A random offset, or better yet a somewhat randomized >>> tick length to make sure that simultaneous ticks are >>> fairly rare and the vtime sampling does not end up >>> "in phase" with the jiffies incrementing, could make >>> the accounting work right again. >> >> That improves jitter, especially on big boxen. I have an 8 socket box >> that thinks it's an extra large PC, there, collision avoidance matters >> hugely. I couldn't reproduce bean counting woes, no idea if collision >> avoidance will help that. > > So I implement two methods, one is from Rik's random offset proposal If we should just add random offset to the cpu in the nohz_full mode? > through skew tick, the other one is from Frederic's proposal and it is > the same as my original idea through use nanosecond granularity to > check deltas but only perform an actual cputime update when that delta >>= TICK_NSEC. Both methods can solve the bug which Luiz reported. This can just solves two cpu hogs running on the cpu in nohz_full mode. However, Luiz's testcase w/ ./acct-bug 1 995 shows idle 100%. Regards, Wanpeng Li > Peterz, Thomas, any ideas? > > --->8- > > skew tick: > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index 7fe53be..9981437 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -1198,7 +1198,11 @@ void tick_setup_sched_timer(void) > hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); > > /* Offset the tick to avert jiffies_lock contention. */ > +#ifdef CONFIG_NO_HZ_FULL > +if (sched_skew_tick || tick_nohz_full_running) { > +#else > if (sched_skew_tick) { > +#endif > u64 offset = ktime_to_ns(tick_period) >> 1; > do_div(offset, num_possible_cpus()); > offset *= smp_processor_id(); > > -->8- > > use nanosecond granularity to check deltas but only perform an actual > cputime update when that delta >= TICK_NSEC. > > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c > index f3778e2b..f1ee393 100644 > --- a/kernel/sched/cputime.c > +++ b/kernel/sched/cputime.c > @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct > task_struct *p, u64 *ut, u64 *st) > #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN > static u64 vtime_delta(struct task_struct *tsk) > { > -unsigned long now = READ_ONCE(jiffies); > +u64 now = local_clock(); > +u64 delta; > + > +delta = now - tsk->vtime_snap; > > -if (time_before(now, (unsigned long)tsk->vtime_snap)) > +if (delta < TICK_NSEC) > return 0; > > -return jiffies_to_nsecs(now - tsk->vtime_snap); > +return jiffies_to_nsecs(delta / TICK_NSEC); > } > > static u64 get_vtime_delta(struct task_struct *tsk) > { > -unsigned long now = READ_ONCE(jiffies); > -u64 delta, other; > +u64 delta = vtime_delta(tsk); > +u64 other; > > /* > * Unlike tick based timing, vtime based timing never has lost > @@ -696,10 +699,9 @@ static u64 get_vtime_delta(struct task_struct *tsk) > * elapsed time. Limit account_other_time to prevent rounding > * errors from causing elapsed vtime to go negative. > */ > -delta = jiffies_to_nsecs(now - tsk->vtime_snap); > other = account_other_time(delta); > WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE); > -tsk->vtime_snap = now; > +tsk->vtime_snap += delta; > > return delta - other; > } > @@ -776,7 +778,7 @@ void arch_vtime_task_switch(struct task_struct *prev) > > write_seqcount_begin(¤t->vtime_seqcount); > current->vtime_snap_whence = VTIME_SYS; > -current->vtime_snap = jiffies; > +current->vtime_snap = sched_clock_cpu(smp_processor_id()); > write_seqcount_end(¤t->vtime_seqcount); > } > > @@ -787,7 +789,7 @@ void vtime_init_idle(struct task_struct *t, int cpu) > local_irq_save(flags); > write_seqcount_begin(&t->vtime_seqcount); > t->vtime_snap_whence = VTIME_SYS; > -t->vtime_snap = jiffies; > +t->vtime_snap = sched_clock_cpu(cpu); > write_seqcount_end(&t->vtime_seqcount); > local_irq_restore(flags); > } > > Regards, > Wanpeng Li
Re: [BUG nohz]: wrong user and system time accounting
Cc Peterz, Thomas, 2017-03-30 12:27 GMT+08:00 Mike Galbraith : > On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote: > >> In other words, the tick on cpu0 is aligned >> with the tick on the nohz_full cpus, and >> jiffies is advanced while the nohz_full cpus >> with an active tick happen to be in kernel >> mode? > > You really want skew_tick=1, especially on big boxen. > >> Frederic, can you think of any reason why >> the tick on nohz_full CPUs would end up aligned >> with the tick on cpu0, instead of running at some >> random offset? > > (I or low rq->clock bits as crude NOHZ collision avoidance) > >> A random offset, or better yet a somewhat randomized >> tick length to make sure that simultaneous ticks are >> fairly rare and the vtime sampling does not end up >> "in phase" with the jiffies incrementing, could make >> the accounting work right again. > > That improves jitter, especially on big boxen. I have an 8 socket box > that thinks it's an extra large PC, there, collision avoidance matters > hugely. I couldn't reproduce bean counting woes, no idea if collision > avoidance will help that. So I implement two methods, one is from Rik's random offset proposal through skew tick, the other one is from Frederic's proposal and it is the same as my original idea through use nanosecond granularity to check deltas but only perform an actual cputime update when that delta >= TICK_NSEC. Both methods can solve the bug which Luiz reported. Peterz, Thomas, any ideas? --->8- skew tick: diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 7fe53be..9981437 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1198,7 +1198,11 @@ void tick_setup_sched_timer(void) hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); /* Offset the tick to avert jiffies_lock contention. */ +#ifdef CONFIG_NO_HZ_FULL +if (sched_skew_tick || tick_nohz_full_running) { +#else if (sched_skew_tick) { +#endif u64 offset = ktime_to_ns(tick_period) >> 1; do_div(offset, num_possible_cpus()); offset *= smp_processor_id(); -->8- use nanosecond granularity to check deltas but only perform an actual cputime update when that delta >= TICK_NSEC. diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index f3778e2b..f1ee393 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN static u64 vtime_delta(struct task_struct *tsk) { -unsigned long now = READ_ONCE(jiffies); +u64 now = local_clock(); +u64 delta; + +delta = now - tsk->vtime_snap; -if (time_before(now, (unsigned long)tsk->vtime_snap)) +if (delta < TICK_NSEC) return 0; -return jiffies_to_nsecs(now - tsk->vtime_snap); +return jiffies_to_nsecs(delta / TICK_NSEC); } static u64 get_vtime_delta(struct task_struct *tsk) { -unsigned long now = READ_ONCE(jiffies); -u64 delta, other; +u64 delta = vtime_delta(tsk); +u64 other; /* * Unlike tick based timing, vtime based timing never has lost @@ -696,10 +699,9 @@ static u64 get_vtime_delta(struct task_struct *tsk) * elapsed time. Limit account_other_time to prevent rounding * errors from causing elapsed vtime to go negative. */ -delta = jiffies_to_nsecs(now - tsk->vtime_snap); other = account_other_time(delta); WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE); -tsk->vtime_snap = now; +tsk->vtime_snap += delta; return delta - other; } @@ -776,7 +778,7 @@ void arch_vtime_task_switch(struct task_struct *prev) write_seqcount_begin(¤t->vtime_seqcount); current->vtime_snap_whence = VTIME_SYS; -current->vtime_snap = jiffies; +current->vtime_snap = sched_clock_cpu(smp_processor_id()); write_seqcount_end(¤t->vtime_seqcount); } @@ -787,7 +789,7 @@ void vtime_init_idle(struct task_struct *t, int cpu) local_irq_save(flags); write_seqcount_begin(&t->vtime_seqcount); t->vtime_snap_whence = VTIME_SYS; -t->vtime_snap = jiffies; +t->vtime_snap = sched_clock_cpu(cpu); write_seqcount_end(&t->vtime_seqcount); local_irq_restore(flags); } Regards, Wanpeng Li
Re: [BUG nohz]: wrong user and system time accounting
On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote: > In other words, the tick on cpu0 is aligned > with the tick on the nohz_full cpus, and > jiffies is advanced while the nohz_full cpus > with an active tick happen to be in kernel > mode? You really want skew_tick=1, especially on big boxen. > Frederic, can you think of any reason why > the tick on nohz_full CPUs would end up aligned > with the tick on cpu0, instead of running at some > random offset? (I or low rq->clock bits as crude NOHZ collision avoidance) > A random offset, or better yet a somewhat randomized > tick length to make sure that simultaneous ticks are > fairly rare and the vtime sampling does not end up > "in phase" with the jiffies incrementing, could make > the accounting work right again. That improves jitter, especially on big boxen. I have an 8 socket box that thinks it's an extra large PC, there, collision avoidance matters hugely. I couldn't reproduce bean counting woes, no idea if collision avoidance will help that. -Mike
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 30 Mar 2017 06:46:30 +0800 Wanpeng Li wrote: > > So! Now we need to find a proper fix :o) > > > > Hmm, how bad would it be to revert to sched_clock() instead of jiffies in > > vtime_delta()? > > We could use nanosecond granularity to check deltas but only perform an > > actual cputime update > > when that delta >= TICK_NSEC. That should keep the load ok. > > Yeah, I mentioned something similar before. > https://lkml.org/lkml/2017/3/26/138 However, Rik's commit optimized > syscalls by not utilize sched_clock(), so if we should distinguish > between syscalls/exceptions and irqs? Why not use ktime_get()? Here's the solution I was thinking about, it's mostly untested. I'm rate limiting below TICK_NSEC because I want to avoid syncing with the tick. diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index f3778e2b..a8b1e85 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -676,18 +676,20 @@ void thread_group_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st) #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN static u64 vtime_delta(struct task_struct *tsk) { - unsigned long now = READ_ONCE(jiffies); + return ktime_sub(ktime_get(), tsk->vtime_snap); +} - if (time_before(now, (unsigned long)tsk->vtime_snap)) - return 0; +/* A little bit less than the tick period */ +#define VTIME_RATE_LIMIT (TICK_NSEC - 20) - return jiffies_to_nsecs(now - tsk->vtime_snap); +static bool vtime_should_account(struct task_struct *tsk) +{ + return vtime_delta(tsk) > VTIME_RATE_LIMIT; } static u64 get_vtime_delta(struct task_struct *tsk) { - unsigned long now = READ_ONCE(jiffies); - u64 delta, other; + u64 delta, other, now = ktime_get(); /* * Unlike tick based timing, vtime based timing never has lost @@ -696,7 +698,7 @@ static u64 get_vtime_delta(struct task_struct *tsk) * elapsed time. Limit account_other_time to prevent rounding * errors from causing elapsed vtime to go negative. */ - delta = jiffies_to_nsecs(now - tsk->vtime_snap); + delta = ktime_sub(now, tsk->vtime_snap); other = account_other_time(delta); WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE); tsk->vtime_snap = now; @@ -711,7 +713,7 @@ static void __vtime_account_system(struct task_struct *tsk) void vtime_account_system(struct task_struct *tsk) { - if (!vtime_delta(tsk)) + if (!vtime_should_account(tsk)) return; write_seqcount_begin(&tsk->vtime_seqcount); @@ -723,7 +725,7 @@ void vtime_account_user(struct task_struct *tsk) { write_seqcount_begin(&tsk->vtime_seqcount); tsk->vtime_snap_whence = VTIME_SYS; - if (vtime_delta(tsk)) + if (vtime_should_account(tsk)) account_user_time(tsk, get_vtime_delta(tsk)); write_seqcount_end(&tsk->vtime_seqcount); } @@ -731,7 +733,7 @@ void vtime_account_user(struct task_struct *tsk) void vtime_user_enter(struct task_struct *tsk) { write_seqcount_begin(&tsk->vtime_seqcount); - if (vtime_delta(tsk)) + if (vtime_should_account(tsk)) __vtime_account_system(tsk); tsk->vtime_snap_whence = VTIME_USER; write_seqcount_end(&tsk->vtime_seqcount); @@ -747,7 +749,7 @@ void vtime_guest_enter(struct task_struct *tsk) * that can thus safely catch up with a tickless delta. */ write_seqcount_begin(&tsk->vtime_seqcount); - if (vtime_delta(tsk)) + if (vtime_should_account(tsk)) __vtime_account_system(tsk); current->flags |= PF_VCPU; write_seqcount_end(&tsk->vtime_seqcount); @@ -776,7 +778,7 @@ void arch_vtime_task_switch(struct task_struct *prev) write_seqcount_begin(¤t->vtime_seqcount); current->vtime_snap_whence = VTIME_SYS; - current->vtime_snap = jiffies; + current->vtime_snap = ktime_get(); write_seqcount_end(¤t->vtime_seqcount); }
Re: [BUG nohz]: wrong user and system time accounting
2017-03-30 4:08 GMT+08:00 Rik van Riel : > On Wed, 2017-03-29 at 13:16 -0400, Luiz Capitulino wrote: >> On Tue, 28 Mar 2017 13:24:06 -0400 >> Luiz Capitulino wrote: >> >> > 1. In my tracing I'm seeing that sometimes (always?) the >> > time interval between two timer interrupts is less than 1ms >> >> I think that's the root cause. >> >> In this trace, we see the following: >> >> 1. On CPU15, we transition from user-space to kernel-space because >> of a timer interrupt (it's the tick) >> >> 2. vtimer_delta() returns 0, because jiffies didn't change since the >> last accounting >> >> 3. While CPU15 is executing in kernel-space, jiffies is updated >> by CPU0 >> >> 4. When going back to user-space, vtime_delta() returns non-zero >> and the whole time is accounted for system time (observe how >> the cputime parameter in account_system_time() is less than 1ms) > > In other words, the tick on cpu0 is aligned > with the tick on the nohz_full cpus, and > jiffies is advanced while the nohz_full cpus > with an active tick happen to be in kernel > mode? > > Frederic, can you think of any reason why > the tick on nohz_full CPUs would end up aligned > with the tick on cpu0, instead of running at some > random offset? > > A random offset, or better yet a somewhat randomized > tick length to make sure that simultaneous ticks are > fairly rare and the vtime sampling does not end up > "in phase" with the jiffies incrementing, could make > the accounting work right again. > > Of course, that assumes the above hypothesis is correct :) There is such a feature skew_tick currently, refer to commit 5307c9556bc (tick: add tick skew boot option), w/ skew_tick=1 boot parameter, the bug disappear, however, the commit also mentioned that it will hurt power consumption. I will try Frederic's proposal which is similar to my original idea "how bad would it be to revert to sched_clock() instead of jiffies in vtime_delta()? We could use nanosecond granularity to check deltas but only perform an actual cputime update when that delta >= TICK_NSEC." Regards, Wanpeng Li
Re: [BUG nohz]: wrong user and system time accounting
On Wed, 29 Mar 2017 23:12:00 +0200 Frederic Weisbecker wrote: > On Wed, Mar 29, 2017 at 09:23:57AM -0400, Luiz Capitulino wrote: > > > > There are various reproducers actually. I started off with the simple > > loop above, then wrote the attach program and then wrote the one > > you're mentioning: > > > > http://people.redhat.com/~lcapitul/real-time/acct-bug.c > > > > All of them reproduce the issue 100% of the time for me. > > > #define _GNU_SOURCE > > #include > > #include > > #include > > #include > > #include > > > > static int move_to_cpu(int cpu) > > { > > cpu_set_t set; > > > > CPU_ZERO(&set); > > CPU_SET(cpu, &set); > > return sched_setaffinity(0, sizeof(set), &set); > > } > > > > static void loop(void) > > { > > for (;;) ; > > } > > > > static int fork_hog(int cpu) > > { > > int pid; > > > > pid = (int) fork(); > > if (pid == 0) { > > move_to_cpu(cpu); > > loop(); > > exit(0); > > } > > > > return pid; > > } > > > > int main(int argc, char *argv[]) > > { > > int i, pid, cpu, nr_procs; > > > > if (argc != 3) { > > printf("usage: hog < nr-procs > < CPU >\n"); > > exit(1); > > } > > > > cpu = atoi(argv[2]); > > nr_procs = atoi(argv[1]); > > > > for (i = 0; i < nr_procs; i++) { > > pid = fork_hog(cpu); > > fprintf(stderr, "created hog%d pid=%d\n", i, pid); > > } > > > > fprintf(stderr, "pausing...\n"); > > pause(); > > > > return 0; > > } > > I just tried both of these and none seem to show incorrect cputime :-/ > I'm wondering if that bug depends on some hardware. Are you running on x86? My CPU is: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz I wonder if this issue depends on the timer used by the hrtimer subsystem.
Re: [BUG nohz]: wrong user and system time accounting
(Adding Thomas in Cc) On Wed, Mar 29, 2017 at 04:08:45PM -0400, Rik van Riel wrote: > On Wed, 2017-03-29 at 13:16 -0400, Luiz Capitulino wrote: > > On Tue, 28 Mar 2017 13:24:06 -0400 > > Luiz Capitulino wrote: > > > > > 1. In my tracing I'm seeing that sometimes (always?) the > > > time interval between two timer interrupts is less than 1ms > > > > I think that's the root cause. > > > > In this trace, we see the following: > > > > 1. On CPU15, we transition from user-space to kernel-space because > > of a timer interrupt (it's the tick) > > > > 2. vtimer_delta() returns 0, because jiffies didn't change since the > > last accounting > > > > 3. While CPU15 is executing in kernel-space, jiffies is updated > > by CPU0 > > > > 4. When going back to user-space, vtime_delta() returns non-zero > > and the whole time is accounted for system time (observe how > > the cputime parameter in account_system_time() is less than 1ms) > > In other words, the tick on cpu0 is aligned > with the tick on the nohz_full cpus, and > jiffies is advanced while the nohz_full cpus > with an active tick happen to be in kernel > mode? Ah you found out faster than me :-) > Frederic, can you think of any reason why > the tick on nohz_full CPUs would end up aligned > with the tick on cpu0, instead of running at some > random offset? tick_init_jiffy_update() takes that decision to align all ticks. I'm not sure why. I don't see anything that could depend on that wide tick synchronization. The jiffies update itself relies on ktime to check when to update it. So even if the tick fires a bit later on CPU 1 than on CPU 0, the jiffies updates should stay coherent and should never exceed 999us delay in the worst case (for HZ=1000) Now I might overlook something. > > A random offset, or better yet a somewhat randomized > tick length to make sure that simultaneous ticks are > fairly rare and the vtime sampling does not end up > "in phase" with the jiffies incrementing, could make > the accounting work right again. > > Of course, that assumes the above hypothesis is correct :) I'm not sure that randomizing the tick start per CPU would be a right solution. Somewhere in the world you can be sure the tick randomization of some nohz_full CPU will coincide with the tick of CPU 0 :o) Or we could force that tick on nohz_full CPUs to be far from CPU 0's tick... I'm not sure such a solution would be accepted though.
Re: [BUG nohz]: wrong user and system time accounting
2017-03-30 6:17 GMT+08:00 Frederic Weisbecker : > On Wed, Mar 29, 2017 at 01:16:56PM -0400, Luiz Capitulino wrote: >> On Tue, 28 Mar 2017 13:24:06 -0400 >> Luiz Capitulino wrote: >> >> > 1. In my tracing I'm seeing that sometimes (always?) the >> > time interval between two timer interrupts is less than 1ms >> >> I think that's the root cause. >> >> I'm getting traces like this: >> >>hog-11980 [015] 341.494491: function: enter_from_user_mode >> <-- apic_timer_interrupt >> -0 [000] 341.494492: function: >> smp_apic_timer_interrupt <-- apic_timer_interrupt >>hog-11980 [015] 341.494492: function: >> __context_tracking_exit <-- enter_from_user_mode >> -0 [000] 341.494492: function: irq_enter <-- >> smp_apic_timer_interrupt >>hog-11980 [015] 341.494492: bprint: vtime_delta: diff=0 >> (now=4295008339 vtime_snap=4295008339) >>hog-11980 [015] 341.494492: function: >> smp_apic_timer_interrupt <-- apic_timer_interrupt >>hog-11980 [015] 341.494492: function: irq_enter <-- >> smp_apic_timer_interrupt >>hog-11980 [015] 341.494493: function: tick_sched_timer <-- >> __hrtimer_run_queues >> -0 [000] 341.494493: function: tick_sched_timer <-- >> __hrtimer_run_queues >> -0 [000] 341.494493: function: >> tick_do_update_jiffies64.part.14 <-- tick_sched_do_timer >> -0 [000] 341.494494: function: do_timer <-- >> tick_do_update_jiffies64.part.14 >>hog-11980 [015] 341.494494: function: irq_exit <-- >> smp_apic_timer_interrupt >> -0 [000] 341.494494: bprint: do_timer: updated >> jiffies_64=4295008340 ticks=1 >>hog-11980 [015] 341.494494: function: >> __context_tracking_enter <-- prepare_exit_to_usermode >>hog-11980 [015] 341.494494: function: vtime_user_enter <-- >> __context_tracking_enter >>hog-11980 [015] 341.494495: bprint: vtime_delta: >> diff=100 (now=4295008340 vtime_snap=4295008339) >>hog-11980 [015] 341.494495: function: >> __vtime_account_system <-- vtime_user_enter >>hog-11980 [015] 341.494495: bprint: get_vtime_delta: >> vtime_snap=4295008339 now=4295008340 >>hog-11980 [015] 341.494495: function: account_system_time >> <-- __vtime_account_system >>hog-11980 [015] 341.494495: bprint: account_system_time: >> cputime=995488 >> -0 [000] 341.494497: function: irq_exit <-- >> smp_apic_timer_interrupt >> >> In this trace, we see the following: >> >> 1. On CPU15, we transition from user-space to kernel-space because >> of a timer interrupt (it's the tick) >> >> 2. vtimer_delta() returns 0, because jiffies didn't change since the >> last accounting >> >> 3. While CPU15 is executing in kernel-space, jiffies is updated >> by CPU0 >> >> 4. When going back to user-space, vtime_delta() returns non-zero >> and the whole time is accounted for system time (observe how >> the cputime parameter in account_system_time() is less than 1ms) > > Aah, so the issue can indeed happen if all CPUs fire their ticks at the same > time: > > > CPU 0 CPU 1 > - - >exit_user() // no cputime > update > tick X update_jiffies >enter_user() // cputime update > > >exit_user() //no cputime update > tick X+1 update_jiffies >enter_user() // cputime update > >> >> That's why my patch from yesterday fixed the issue, it increased the >> tick period to more than 1ms. So vtime_delta() always evaluate to true >> when transitioning from user-space to kernel-space (because we spend >> more than 1ms in user-space between ticks). The patch below achieves >> the same result by adding 10us to the tick period. >> >> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c >> index 7fe53be..00e46df 100644 >> --- a/kernel/time/tick-sched.c >> +++ b/kernel/time/tick-sched.c >> @@ -1165,7 +1165,7 @@ static enum hrtimer_restart tick_sched_timer(struct >> hrtimer *timer) >> if (unlikely(ts->tick_stopped)) >> return HRTIMER_NORESTART; >> >> - hrtimer_forward(timer, now, tick_period); >> + hrtimer_forward(timer, now, tick_period + 1); > > I'm surprised it works though. If the 10us shift was only applied to CPU 0 > and not the > others then yes, but if it is applied to all CPUs, the ticks stay > synchronized and the > problem should stay... > > Ah wait! It can work because the nohz_full CPUs have their ticks sometimes > scheduled > by tick_nohz_stop_sched_tick() or tick_nohz_restart_sched_tick() which
Re: [BUG nohz]: wrong user and system time accounting
On Wed, Mar 29, 2017 at 09:23:57AM -0400, Luiz Capitulino wrote: > > There are various reproducers actually. I started off with the simple > loop above, then wrote the attach program and then wrote the one > you're mentioning: > > http://people.redhat.com/~lcapitul/real-time/acct-bug.c > > All of them reproduce the issue 100% of the time for me. > #define _GNU_SOURCE > #include > #include > #include > #include > #include > > static int move_to_cpu(int cpu) > { > cpu_set_t set; > > CPU_ZERO(&set); > CPU_SET(cpu, &set); > return sched_setaffinity(0, sizeof(set), &set); > } > > static void loop(void) > { > for (;;) ; > } > > static int fork_hog(int cpu) > { > int pid; > > pid = (int) fork(); > if (pid == 0) { > move_to_cpu(cpu); > loop(); > exit(0); > } > > return pid; > } > > int main(int argc, char *argv[]) > { > int i, pid, cpu, nr_procs; > > if (argc != 3) { > printf("usage: hog < nr-procs > < CPU >\n"); > exit(1); > } > > cpu = atoi(argv[2]); > nr_procs = atoi(argv[1]); > > for (i = 0; i < nr_procs; i++) { > pid = fork_hog(cpu); > fprintf(stderr, "created hog%d pid=%d\n", i, pid); > } > > fprintf(stderr, "pausing...\n"); > pause(); > > return 0; > } I just tried both of these and none seem to show incorrect cputime :-/ I'm wondering if that bug depends on some hardware.
Re: [BUG nohz]: wrong user and system time accounting
On Wed, 2017-03-29 at 13:16 -0400, Luiz Capitulino wrote: > On Tue, 28 Mar 2017 13:24:06 -0400 > Luiz Capitulino wrote: > > > 1. In my tracing I'm seeing that sometimes (always?) the > > time interval between two timer interrupts is less than 1ms > > I think that's the root cause. > > In this trace, we see the following: > > 1. On CPU15, we transition from user-space to kernel-space because > of a timer interrupt (it's the tick) > > 2. vtimer_delta() returns 0, because jiffies didn't change since the > last accounting > > 3. While CPU15 is executing in kernel-space, jiffies is updated > by CPU0 > > 4. When going back to user-space, vtime_delta() returns non-zero > and the whole time is accounted for system time (observe how > the cputime parameter in account_system_time() is less than 1ms) In other words, the tick on cpu0 is aligned with the tick on the nohz_full cpus, and jiffies is advanced while the nohz_full cpus with an active tick happen to be in kernel mode? Frederic, can you think of any reason why the tick on nohz_full CPUs would end up aligned with the tick on cpu0, instead of running at some random offset? A random offset, or better yet a somewhat randomized tick length to make sure that simultaneous ticks are fairly rare and the vtime sampling does not end up "in phase" with the jiffies incrementing, could make the accounting work right again. Of course, that assumes the above hypothesis is correct :)
Re: [BUG nohz]: wrong user and system time accounting
On Wed, 29 Mar 2017 09:14:32 -0400 Rik van Riel wrote: > > I failed to reproduce with your config. I'm still getting 99% > > userspace > > cputime. So I'm wondering if the hogging style plays a role. > > > > I run pure user loops: > > > > int main(int argc, char **argv) > > { > > for (;;); > > return 0 > > } > > > > Does your user program perform syscalls or IOs of some sort? > > Luiz's program makes a syscall every millisecond, > if started with the arguments he gave as his > reproducer. There are various reproducers actually. I started off with the simple loop above, then wrote the attach program and then wrote the one you're mentioning: http://people.redhat.com/~lcapitul/real-time/acct-bug.c All of them reproduce the issue 100% of the time for me. #define _GNU_SOURCE #include #include #include #include #include static int move_to_cpu(int cpu) { cpu_set_t set; CPU_ZERO(&set); CPU_SET(cpu, &set); return sched_setaffinity(0, sizeof(set), &set); } static void loop(void) { for (;;) ; } static int fork_hog(int cpu) { int pid; pid = (int) fork(); if (pid == 0) { move_to_cpu(cpu); loop(); exit(0); } return pid; } int main(int argc, char *argv[]) { int i, pid, cpu, nr_procs; if (argc != 3) { printf("usage: hog < nr-procs > < CPU >\n"); exit(1); } cpu = atoi(argv[2]); nr_procs = atoi(argv[1]); for (i = 0; i < nr_procs; i++) { pid = fork_hog(cpu); fprintf(stderr, "created hog%d pid=%d\n", i, pid); } fprintf(stderr, "pausing...\n"); pause(); return 0; }
Re: [BUG nohz]: wrong user and system time accounting
On Wed, 2017-03-29 at 15:04 +0200, Frederic Weisbecker wrote: > On Thu, Mar 23, 2017 at 04:55:12PM -0400, Luiz Capitulino wrote: > > > > When there are two or more tasks executing in user-space and > > taking 100% of a nohz_full CPU, top reports 70% system time > > and 30% user time utilization. Sometimes I'm even able to get > > 100% system time and 0% user time. > > > > This was reproduced with latest Linus tree (093b995), but I > > don't believe it's a regression (at least not a recent one) > > as I can reproduce it with older kernels. Also, I have > > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce > > without it yet. > > > > Below you'll find the steps to reproduce and some initial > > analysis. > > > > Steps to reproduce > > -- > > > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full= > > > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU > > I failed to reproduce with your config. I'm still getting 99% > userspace > cputime. So I'm wondering if the hogging style plays a role. > > I run pure user loops: > > int main(int argc, char **argv) > { > for (;;); > return 0 > } > > Does your user program perform syscalls or IOs of some sort? Luiz's program makes a syscall every millisecond, if started with the arguments he gave as his reproducer.
Re: [BUG nohz]: wrong user and system time accounting
On Thu, Mar 23, 2017 at 04:55:12PM -0400, Luiz Capitulino wrote: > > When there are two or more tasks executing in user-space and > taking 100% of a nohz_full CPU, top reports 70% system time > and 30% user time utilization. Sometimes I'm even able to get > 100% system time and 0% user time. > > This was reproduced with latest Linus tree (093b995), but I > don't believe it's a regression (at least not a recent one) > as I can reproduce it with older kernels. Also, I have > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce > without it yet. > > Below you'll find the steps to reproduce and some initial > analysis. > > Steps to reproduce > -- > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full= > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU I failed to reproduce with your config. I'm still getting 99% userspace cputime. So I'm wondering if the hogging style plays a role. I run pure user loops: int main(int argc, char **argv) { for (;;); return 0 } Does your user program perform syscalls or IOs of some sort?
Re: [BUG nohz]: wrong user and system time accounting
On Wed, Mar 29, 2017 at 05:56:30PM +0800, Wanpeng Li wrote: > 2017-03-29 5:26 GMT+08:00 Luiz Capitulino : > > On Tue, 28 Mar 2017 17:01:52 -0400 > > Rik van Riel wrote: > > > >> On Tue, 2017-03-28 at 16:14 -0400, Luiz Capitulino wrote: > >> > On Tue, 28 Mar 2017 13:24:06 -0400 > >> > Luiz Capitulino wrote: > >> > > I'm starting to suspect that the nohz code may be programming > >> > > the tick period to be shorter than 1ms when it re-activates > >> > > the tick. > >> > > >> > And I think I was right, it looks like the nohz code is programming > >> > the tick period incorrectly when restarting the tick. The patch below > >> > fixes things for me, but I still have some homework todo and more > >> > testing before posting a patch for inclusion. Could you guys test it? > >> > >> Your patch seems to work. I don't claim to understand why > >> your patch makes a difference, but for this particular test > >> case, on this particular setup, it seems to work... > > > > I don't fully understand why either yet. I was looking for places > > where nohz might be programming the tick period incorrectly and > > The bug is still present when I config CONTEXT_TRACKING_FORCE and > nohz=off in the boot parameter. Indeed I saw something similar a few days ago with: !CONFIG_NO_HZ_FULL && CONFIG_VIRT_CPU_ACCOUNTING_GEN && CONTEXT_TRACKING_FORCE And it disappeared with CONFIG_NO_HZ_FULL=y so I didn't care much because that setting isn't used in production and in fact I intend to remove CONTEXT_TRACKING_FORCE. But it could be the sign of something important. It might be different than Luiz's bug because I can't reproduce his bug yet even with his config.
Re: [BUG nohz]: wrong user and system time accounting
2017-03-29 5:26 GMT+08:00 Luiz Capitulino : > On Tue, 28 Mar 2017 17:01:52 -0400 > Rik van Riel wrote: > >> On Tue, 2017-03-28 at 16:14 -0400, Luiz Capitulino wrote: >> > On Tue, 28 Mar 2017 13:24:06 -0400 >> > Luiz Capitulino wrote: >> > > I'm starting to suspect that the nohz code may be programming >> > > the tick period to be shorter than 1ms when it re-activates >> > > the tick. >> > >> > And I think I was right, it looks like the nohz code is programming >> > the tick period incorrectly when restarting the tick. The patch below >> > fixes things for me, but I still have some homework todo and more >> > testing before posting a patch for inclusion. Could you guys test it? >> >> Your patch seems to work. I don't claim to understand why >> your patch makes a difference, but for this particular test >> case, on this particular setup, it seems to work... > > I don't fully understand why either yet. I was looking for places > where nohz might be programming the tick period incorrectly and The bug is still present when I config CONTEXT_TRACKING_FORCE and nohz=off in the boot parameter. Regards, Wanpeng Li > I found that there's a case in tick_nohz_stop_sched_tick() where > tick_nohz_restart() is called only to reprogram the tick timer, > not cancel the tick. In this case, ts->last_tick seems to be out > of date. Fixing this fixed accounting for me. > >> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c >> > index 7fe53be..9abe979 100644 >> > --- a/kernel/time/tick-sched.c >> > +++ b/kernel/time/tick-sched.c >> > @@ -1152,6 +1152,7 @@ static enum hrtimer_restart >> > tick_sched_timer(struct hrtimer *timer) >> > struct pt_regs *regs = get_irq_regs(); >> > ktime_t now = ktime_get(); >> > >> > + ts->last_tick = now; >> > tick_sched_do_timer(now); >> > >> > /* >> >
Re: [BUG nohz]: wrong user and system time accounting
On Tue, 28 Mar 2017 17:24:11 -0400 Rik van Riel wrote: > On Tue, 2017-03-28 at 16:14 -0400, Luiz Capitulino wrote: > > > And I think I was right, it looks like the nohz code is programming > > the tick period incorrectly when restarting the tick. The patch below > > fixes things for me, but I still have some homework todo and more > > testing before posting a patch for inclusion. Could you guys test it? > > I spoke too soon. After half an hour of runtime, > things have gotten aligned to give me about 50/50 > user time and system time with your test case, > again. Hmmm, maybe it's incomplete. I still think that nohz might screwing something up when re-activating the tick. > > This is on an 8 VCPU virtual machine, with > nohz_full=2-7, and the test case running on one > of the nohz_full CPUs. > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > index 7fe53be..9abe979 100644 > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -1152,6 +1152,7 @@ static enum hrtimer_restart > > tick_sched_timer(struct hrtimer *timer) > > struct pt_regs *regs = get_irq_regs(); > > ktime_t now = ktime_get(); > > > > + ts->last_tick = now; > > tick_sched_do_timer(now); > > > > /* >
Re: [BUG nohz]: wrong user and system time accounting
On Tue, 28 Mar 2017 17:01:52 -0400 Rik van Riel wrote: > On Tue, 2017-03-28 at 16:14 -0400, Luiz Capitulino wrote: > > On Tue, 28 Mar 2017 13:24:06 -0400 > > Luiz Capitulino wrote: > > > I'm starting to suspect that the nohz code may be programming > > > the tick period to be shorter than 1ms when it re-activates > > > the tick. > > > > And I think I was right, it looks like the nohz code is programming > > the tick period incorrectly when restarting the tick. The patch below > > fixes things for me, but I still have some homework todo and more > > testing before posting a patch for inclusion. Could you guys test it? > > Your patch seems to work. I don't claim to understand why > your patch makes a difference, but for this particular test > case, on this particular setup, it seems to work... I don't fully understand why either yet. I was looking for places where nohz might be programming the tick period incorrectly and I found that there's a case in tick_nohz_stop_sched_tick() where tick_nohz_restart() is called only to reprogram the tick timer, not cancel the tick. In this case, ts->last_tick seems to be out of date. Fixing this fixed accounting for me. > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > index 7fe53be..9abe979 100644 > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -1152,6 +1152,7 @@ static enum hrtimer_restart > > tick_sched_timer(struct hrtimer *timer) > > struct pt_regs *regs = get_irq_regs(); > > ktime_t now = ktime_get(); > > > > + ts->last_tick = now; > > tick_sched_do_timer(now); > > > > /* >
Re: [BUG nohz]: wrong user and system time accounting
On Tue, 2017-03-28 at 16:14 -0400, Luiz Capitulino wrote: > And I think I was right, it looks like the nohz code is programming > the tick period incorrectly when restarting the tick. The patch below > fixes things for me, but I still have some homework todo and more > testing before posting a patch for inclusion. Could you guys test it? I spoke too soon. After half an hour of runtime, things have gotten aligned to give me about 50/50 user time and system time with your test case, again. This is on an 8 VCPU virtual machine, with nohz_full=2-7, and the test case running on one of the nohz_full CPUs. > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index 7fe53be..9abe979 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -1152,6 +1152,7 @@ static enum hrtimer_restart > tick_sched_timer(struct hrtimer *timer) > struct pt_regs *regs = get_irq_regs(); > ktime_t now = ktime_get(); > > + ts->last_tick = now; > tick_sched_do_timer(now); > > /*
Re: [BUG nohz]: wrong user and system time accounting
On Tue, 2017-03-28 at 16:14 -0400, Luiz Capitulino wrote: > On Tue, 28 Mar 2017 13:24:06 -0400 > Luiz Capitulino wrote: > > I'm starting to suspect that the nohz code may be programming > > the tick period to be shorter than 1ms when it re-activates > > the tick. > > And I think I was right, it looks like the nohz code is programming > the tick period incorrectly when restarting the tick. The patch below > fixes things for me, but I still have some homework todo and more > testing before posting a patch for inclusion. Could you guys test it? Your patch seems to work. I don't claim to understand why your patch makes a difference, but for this particular test case, on this particular setup, it seems to work... > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index 7fe53be..9abe979 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -1152,6 +1152,7 @@ static enum hrtimer_restart > tick_sched_timer(struct hrtimer *timer) > struct pt_regs *regs = get_irq_regs(); > ktime_t now = ktime_get(); > > + ts->last_tick = now; > tick_sched_do_timer(now); > > /*
Re: [BUG nohz]: wrong user and system time accounting
On Tue, 28 Mar 2017 13:28:13 +0800 Wanpeng Li wrote: > 2017-03-28 2:38 GMT+08:00 Luiz Capitulino : > > On Mon, 27 Mar 2017 09:56:47 +0800 > > Wanpeng Li wrote: > > > >> Actually after I bisect, the first bad commit is ff9a9b4c4334 ("sched, > >> time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug > >> can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true, > >> then just stress all the online cpus or just one cpu and leave others > >> idle(so it stresses the global timekeeping one), top show 100% > >> sys-time. And another way to reproduce it is by nohz_full, and gives > >> the stress to the house keeping cpu, the top show 100% sys-time of the > >> house keeping cpu, and also the other cpus who have at least two tasks > >> running on and in full_nohz mode. > > > > We're not short on reproducers, I have a new one too: > > > > http://people.redhat.com/~lcapitul/real-time/acct-bug.c > > > > This is a single threaded task that reproduces the issue. If you > > run it as instructed, you'll get: > > > > - nohz_full CPU: 95% system time 5% idle time > > - non-nohz_full CPU: 95% user time 5% idle time (expected behavior) > > > > This reproduces the issue, but not for the reasons I expected. I was > > trying to mimic what I was seeing on my trace when tracing the two > > task problem. Which is: a task stays 995us in user-space and then > > enters the kernel. Time won't be accounted for user-space because > > we're not 1 jiffies yet, but if the task stays in the kernel for more > > than 5us, then time will be accounted for system time when going > > back to user-space. > > > > However, what really seems to be happening is: acct-bug is causing > > the tick to be re-activated (why? it shouldn't) and that causes the > > issue to appear. This is consistent with my other observations: I > > can only reproduce the issue if the nohz_full CPU re-activates the tick. > > I see there are other kthreads like migration, kworker, > torture_shuffle etc on the isolated CPU. Except for torture_shuffle (which is new to me, and I guess could be disabled in .config) the other threads should not be runnable for most of the time. > > Regards, > Wanpeng Li > > > > >> Let's consider the cpu which has responsibility for the global > >> timekeeping, as the tracing posted above, the vtime_account_user() is > >> called before tick_sched_timer() which will update jiffies, > > > > But the vtime_account_user() call and the jiffies update happen > > on different CPUs, no? So the ordering shouldn't matter. > > > >> so jiffies > >> is stale in vtime_account_user() and the run time in userspace is > >> skipped, the vtime_user_enter() is called after jiffies update, so > >> both the time in userspace and in kernel are accumulated to sys time. > >> > >> If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is > >> fine. However, if you give stress to the housekeeping cpu, top will > >> show 100% sys-time of both the housekeeping cpu and the other cpus who > >> have at least two tasks running on and in full_nohz mode. > > > > The housekeeping CPUs are idle with my reproducers. > > > >> I think it > >> is because the stress delays the timer interrupt handling in some > >> degree, then the jiffies is not updated timely before other cpus > >> access it in vtime_account_user(). > >> > >> I think we can keep syscalls/exceptions context tracking still in > >> jiffies based sampling and utilize local_clock() in vtime_delta() > >> again for irqs which avoids jiffies stale influence. I can make a > >> patch if the idea is acceptable or there is any better proposal. :) > >> > >> Regards, > >> Wanpeng Li > >> > > >
Re: [BUG nohz]: wrong user and system time accounting
2017-03-28 1:35 GMT+08:00 Rik van Riel : > On Mon, 2017-03-27 at 09:56 +0800, Wanpeng Li wrote: >> >> Actually after I bisect, the first bad commit is ff9a9b4c4334 >> ("sched, >> time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug >> can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true > > At the time, we thought it was an "occasionally bad" / "unlucky" > kind of bug, not a systemic issue, like your observations seem > to suggest. > >> Let's consider the cpu which has responsibility for the global >> timekeeping, as the tracing posted above, the vtime_account_user() is >> called before tick_sched_timer() which will update jiffies, so >> jiffies >> is stale in vtime_account_user() and the run time in userspace is >> skipped, the vtime_user_enter() is called after jiffies update, so >> both the time in userspace and in kernel are accumulated to sys >> time. >> If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is >> fine. However, if you give stress to the housekeeping cpu, top will >> show 100% sys-time of both the housekeeping cpu and the other cpus >> who >> have at least two tasks running on and in full_nohz mode. I think it >> is because the stress delays the timer interrupt handling in some >> degree, then the jiffies is not updated timely before other cpus >> access it in vtime_account_user(). >> >> I think we can keep syscalls/exceptions context tracking still in >> jiffies based sampling and utilize local_clock() in vtime_delta() >> again for irqs which avoids jiffies stale influence. I can make a >> patch if the idea is acceptable or there is any better proposal. :) > > Making that patch seems worthwhile, but I would like to > know what the root cause is of the issue that is being > observed. > > Is the problem due to the nohz_full CPU receiving an > interrupt at the same time the timer interrupt fires on > the housekeeping CPU? > > Is it due to a nohz_full CPU updating jiffies all by > itself from irq context? In that case, could it be > better to always have that be done by the housekeeping > CPU? I observed that the jiffies is always updated by housekeeping CPU as we expected. Regards, Wanpeng Li
Re: [BUG nohz]: wrong user and system time accounting
2017-03-28 2:38 GMT+08:00 Luiz Capitulino : > On Mon, 27 Mar 2017 09:56:47 +0800 > Wanpeng Li wrote: > >> Actually after I bisect, the first bad commit is ff9a9b4c4334 ("sched, >> time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug >> can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true, >> then just stress all the online cpus or just one cpu and leave others >> idle(so it stresses the global timekeeping one), top show 100% >> sys-time. And another way to reproduce it is by nohz_full, and gives >> the stress to the house keeping cpu, the top show 100% sys-time of the >> house keeping cpu, and also the other cpus who have at least two tasks >> running on and in full_nohz mode. > > We're not short on reproducers, I have a new one too: > > http://people.redhat.com/~lcapitul/real-time/acct-bug.c > > This is a single threaded task that reproduces the issue. If you > run it as instructed, you'll get: > > - nohz_full CPU: 95% system time 5% idle time > - non-nohz_full CPU: 95% user time 5% idle time (expected behavior) > > This reproduces the issue, but not for the reasons I expected. I was > trying to mimic what I was seeing on my trace when tracing the two > task problem. Which is: a task stays 995us in user-space and then > enters the kernel. Time won't be accounted for user-space because > we're not 1 jiffies yet, but if the task stays in the kernel for more > than 5us, then time will be accounted for system time when going > back to user-space. > > However, what really seems to be happening is: acct-bug is causing > the tick to be re-activated (why? it shouldn't) and that causes the > issue to appear. This is consistent with my other observations: I > can only reproduce the issue if the nohz_full CPU re-activates the tick. I see there are other kthreads like migration, kworker, torture_shuffle etc on the isolated CPU. Regards, Wanpeng Li > >> Let's consider the cpu which has responsibility for the global >> timekeeping, as the tracing posted above, the vtime_account_user() is >> called before tick_sched_timer() which will update jiffies, > > But the vtime_account_user() call and the jiffies update happen > on different CPUs, no? So the ordering shouldn't matter. > >> so jiffies >> is stale in vtime_account_user() and the run time in userspace is >> skipped, the vtime_user_enter() is called after jiffies update, so >> both the time in userspace and in kernel are accumulated to sys time. >> >> If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is >> fine. However, if you give stress to the housekeeping cpu, top will >> show 100% sys-time of both the housekeeping cpu and the other cpus who >> have at least two tasks running on and in full_nohz mode. > > The housekeeping CPUs are idle with my reproducers. > >> I think it >> is because the stress delays the timer interrupt handling in some >> degree, then the jiffies is not updated timely before other cpus >> access it in vtime_account_user(). >> >> I think we can keep syscalls/exceptions context tracking still in >> jiffies based sampling and utilize local_clock() in vtime_delta() >> again for irqs which avoids jiffies stale influence. I can make a >> patch if the idea is acceptable or there is any better proposal. :) >> >> Regards, >> Wanpeng Li >> >
Re: [BUG nohz]: wrong user and system time accounting
On Mon, 27 Mar 2017 09:56:47 +0800 Wanpeng Li wrote: > Actually after I bisect, the first bad commit is ff9a9b4c4334 ("sched, > time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug > can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true, > then just stress all the online cpus or just one cpu and leave others > idle(so it stresses the global timekeeping one), top show 100% > sys-time. And another way to reproduce it is by nohz_full, and gives > the stress to the house keeping cpu, the top show 100% sys-time of the > house keeping cpu, and also the other cpus who have at least two tasks > running on and in full_nohz mode. We're not short on reproducers, I have a new one too: http://people.redhat.com/~lcapitul/real-time/acct-bug.c This is a single threaded task that reproduces the issue. If you run it as instructed, you'll get: - nohz_full CPU: 95% system time 5% idle time - non-nohz_full CPU: 95% user time 5% idle time (expected behavior) This reproduces the issue, but not for the reasons I expected. I was trying to mimic what I was seeing on my trace when tracing the two task problem. Which is: a task stays 995us in user-space and then enters the kernel. Time won't be accounted for user-space because we're not 1 jiffies yet, but if the task stays in the kernel for more than 5us, then time will be accounted for system time when going back to user-space. However, what really seems to be happening is: acct-bug is causing the tick to be re-activated (why? it shouldn't) and that causes the issue to appear. This is consistent with my other observations: I can only reproduce the issue if the nohz_full CPU re-activates the tick. > Let's consider the cpu which has responsibility for the global > timekeeping, as the tracing posted above, the vtime_account_user() is > called before tick_sched_timer() which will update jiffies, But the vtime_account_user() call and the jiffies update happen on different CPUs, no? So the ordering shouldn't matter. > so jiffies > is stale in vtime_account_user() and the run time in userspace is > skipped, the vtime_user_enter() is called after jiffies update, so > both the time in userspace and in kernel are accumulated to sys time. > > If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is > fine. However, if you give stress to the housekeeping cpu, top will > show 100% sys-time of both the housekeeping cpu and the other cpus who > have at least two tasks running on and in full_nohz mode. The housekeeping CPUs are idle with my reproducers. > I think it > is because the stress delays the timer interrupt handling in some > degree, then the jiffies is not updated timely before other cpus > access it in vtime_account_user(). > > I think we can keep syscalls/exceptions context tracking still in > jiffies based sampling and utilize local_clock() in vtime_delta() > again for irqs which avoids jiffies stale influence. I can make a > patch if the idea is acceptable or there is any better proposal. :) > > Regards, > Wanpeng Li >
Re: [BUG nohz]: wrong user and system time accounting
On Mon, 2017-03-27 at 09:56 +0800, Wanpeng Li wrote: > > Actually after I bisect, the first bad commit is ff9a9b4c4334 > ("sched, > time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug > can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true At the time, we thought it was an "occasionally bad" / "unlucky" kind of bug, not a systemic issue, like your observations seem to suggest. > Let's consider the cpu which has responsibility for the global > timekeeping, as the tracing posted above, the vtime_account_user() is > called before tick_sched_timer() which will update jiffies, so > jiffies > is stale in vtime_account_user() and the run time in userspace is > skipped, the vtime_user_enter() is called after jiffies update, so > both the time in userspace and in kernel are accumulated to sys > time. > If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is > fine. However, if you give stress to the housekeeping cpu, top will > show 100% sys-time of both the housekeeping cpu and the other cpus > who > have at least two tasks running on and in full_nohz mode. I think it > is because the stress delays the timer interrupt handling in some > degree, then the jiffies is not updated timely before other cpus > access it in vtime_account_user(). > > I think we can keep syscalls/exceptions context tracking still in > jiffies based sampling and utilize local_clock() in vtime_delta() > again for irqs which avoids jiffies stale influence. I can make a > patch if the idea is acceptable or there is any better proposal. :) Making that patch seems worthwhile, but I would like to know what the root cause is of the issue that is being observed. Is the problem due to the nohz_full CPU receiving an interrupt at the same time the timer interrupt fires on the housekeeping CPU? Is it due to a nohz_full CPU updating jiffies all by itself from irq context? In that case, could it be better to always have that be done by the housekeeping CPU? What exactly is going on here?
Re: [BUG nohz]: wrong user and system time accounting
On Thu, Mar 23, 2017 at 08:56:02PM -0400, Rik van Riel wrote: > On Thu, 2017-03-23 at 16:55 -0400, Luiz Capitulino wrote: > > When there are two or more tasks executing in user-space and > > taking 100% of a nohz_full CPU, top reports 70% system time > > and 30% user time utilization. Sometimes I'm even able to get > > 100% system time and 0% user time. > > > > This was reproduced with latest Linus tree (093b995), but I > > don't believe it's a regression (at least not a recent one) > > as I can reproduce it with older kernels. Also, I have > > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce > > without it yet. > > > > Below you'll find the steps to reproduce and some initial > > analysis. > > > > Steps to reproduce > > -- > > > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full= > > > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU > > > > 3. Run top -d1 and check system time > > > > NOTE: When there's only one task hogging a nohz_full CPU, top > > shows 100% user-time, as expected > > > > Initial analysis > > > > > > When tracing vtime accounting functions and the user-space/kernel > > transitions when the issue is taking place, I see several of the > > following: > > > > hog-10552 [015] 1132.711104: > > function: enter_from_user_mode <-- apic_timer_interrupt > > hog-10552 [015] 1132.711105: > > function: __context_tracking_exit <-- > > enter_from_user_mode > > hog-10552 [015] 1132.711105: > > bprint: __context_tracking_exit.part.4: new state=1 cur > > state=1 active=1 > > hog-10552 [015] 1132.711105: > > function: vtime_account_user <-- > > __context_tracking_exit.part.4 > > hog-10552 [015] 1132.711105: > > function: smp_apic_timer_interrupt <-- > > apic_timer_interrupt > > hog-10552 [015] 1132.711106: function: irq_enter <-- > > smp_apic_timer_interrupt > > hog-10552 [015] 1132.711106: function: tick_sched_timer > > <-- __hrtimer_run_queues > > hog-10552 [015] 1132.711108: function: irq_exit <-- > > smp_apic_timer_interrupt > > hog-10552 [015] 1132.711108: > > function: __context_tracking_enter <-- > > prepare_exit_to_usermode > > hog-10552 [015] 1132.711108: > > bprint: __context_tracking_enter.part.2: new state=1 > > cur state=0 active=1 > > hog-10552 [015] 1132.711109: function: vtime_user_enter > > <-- __context_tracking_enter.part.2 > > hog-10552 [015] 1132.711109: > > function: __vtime_account_system <-- vtime_user_enter > > hog-10552 [015] 1132.711109: > > function: account_system_time <-- __vtime_account_system > > > > On entering the kernel due to a timer interrupt, vtime_account_user() > > skips user-time accounting. Then later on when returning to user- > > space, > > vtime_user_enter() is probably accounting the whole time (ie. user- > > space > > plus kernel-space) to system time. > > > > Now, when does vtime_account_user() skips accounting? Well, when the > > time delta is less then one jiffie. This would imply that > > vtime_account_user() > > is being called less than one jiffie since the last accounting, but I > > haven't > > confirmed any of this yet. > > Jiffies should be advanced by the timer interrupt, on the > housekeeping CPU, which is not doing context tracking. > > Why is the isolated/nohz_full CPU receiving timer interrupts > at all? > > I thought it would not, but obviously I am wrong. What is > going on here? This thread sounds awful familiar to me. With CONFIG_NO_HZ_FULL=y && CONFIG_VIRT_CPU_ACCOUNTING_GEN=y I observed process accounting anomalies with user CPU time being misaccounted as system time all the way back to 4.6.0. After switching to CONFIG_NO_HZ_IDLE=y && CONFIG_VIRT_CPU_ACCOUNTING_GEN=n the issues went away. The lkml thread I had seen at that time which compelled me to suspect these settings was this: http://lkml.iu.edu/hypermail/linux/kernel/1608.2/05860.html It sounds like this issue is finally beginning to be understood though, good work! Regards, Vito Caputo
Re: [BUG nohz]: wrong user and system time accounting
2017-03-24 4:55 GMT+08:00 Luiz Capitulino : > > When there are two or more tasks executing in user-space and > taking 100% of a nohz_full CPU, top reports 70% system time > and 30% user time utilization. Sometimes I'm even able to get > 100% system time and 0% user time. > > This was reproduced with latest Linus tree (093b995), but I > don't believe it's a regression (at least not a recent one) > as I can reproduce it with older kernels. Also, I have > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce > without it yet. > > Below you'll find the steps to reproduce and some initial > analysis. > > Steps to reproduce > -- > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full= > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU > > 3. Run top -d1 and check system time > > NOTE: When there's only one task hogging a nohz_full CPU, top > shows 100% user-time, as expected > > Initial analysis > > > When tracing vtime accounting functions and the user-space/kernel > transitions when the issue is taking place, I see several of the > following: > > hog-10552 [015] 1132.711104: function: enter_from_user_mode <-- > apic_timer_interrupt > hog-10552 [015] 1132.711105: function: __context_tracking_exit > <-- enter_from_user_mode > hog-10552 [015] 1132.711105: bprint: > __context_tracking_exit.part.4: new state=1 cur state=1 active=1 > hog-10552 [015] 1132.711105: function: vtime_account_user <-- > __context_tracking_exit.part.4 > hog-10552 [015] 1132.711105: function: smp_apic_timer_interrupt > <-- apic_timer_interrupt > hog-10552 [015] 1132.711106: function: irq_enter <-- > smp_apic_timer_interrupt > hog-10552 [015] 1132.711106: function: tick_sched_timer <-- > __hrtimer_run_queues > hog-10552 [015] 1132.711108: function: irq_exit <-- > smp_apic_timer_interrupt > hog-10552 [015] 1132.711108: function: __context_tracking_enter > <-- prepare_exit_to_usermode > hog-10552 [015] 1132.711108: bprint: > __context_tracking_enter.part.2: new state=1 cur state=0 active=1 > hog-10552 [015] 1132.711109: function: vtime_user_enter <-- > __context_tracking_enter.part.2 > hog-10552 [015] 1132.711109: function: __vtime_account_system > <-- vtime_user_enter > hog-10552 [015] 1132.711109: function: account_system_time <-- > __vtime_account_system > > On entering the kernel due to a timer interrupt, vtime_account_user() > skips user-time accounting. Then later on when returning to user-space, > vtime_user_enter() is probably accounting the whole time (ie. user-space > plus kernel-space) to system time. Actually after I bisect, the first bad commit is ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true, then just stress all the online cpus or just one cpu and leave others idle(so it stresses the global timekeeping one), top show 100% sys-time. And another way to reproduce it is by nohz_full, and gives the stress to the house keeping cpu, the top show 100% sys-time of the house keeping cpu, and also the other cpus who have at least two tasks running on and in full_nohz mode. Let's consider the cpu which has responsibility for the global timekeeping, as the tracing posted above, the vtime_account_user() is called before tick_sched_timer() which will update jiffies, so jiffies is stale in vtime_account_user() and the run time in userspace is skipped, the vtime_user_enter() is called after jiffies update, so both the time in userspace and in kernel are accumulated to sys time. If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is fine. However, if you give stress to the housekeeping cpu, top will show 100% sys-time of both the housekeeping cpu and the other cpus who have at least two tasks running on and in full_nohz mode. I think it is because the stress delays the timer interrupt handling in some degree, then the jiffies is not updated timely before other cpus access it in vtime_account_user(). I think we can keep syscalls/exceptions context tracking still in jiffies based sampling and utilize local_clock() in vtime_delta() again for irqs which avoids jiffies stale influence. I can make a patch if the idea is acceptable or there is any better proposal. :) Regards, Wanpeng Li
Re: [BUG nohz]: wrong user and system time accounting
On Fri, 24 Mar 2017 09:52:11 +0800 Wanpeng Li wrote: > 2017-03-24 4:55 GMT+08:00 Luiz Capitulino : > > > > When there are two or more tasks executing in user-space and > > taking 100% of a nohz_full CPU, top reports 70% system time > > and 30% user time utilization. Sometimes I'm even able to get > > 100% system time and 0% user time. > > > > This was reproduced with latest Linus tree (093b995), but I > > don't believe it's a regression (at least not a recent one) > > as I can reproduce it with older kernels. Also, I have > > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce > > without it yet. > > > > Below you'll find the steps to reproduce and some initial > > analysis. > > > > Steps to reproduce > > -- > > > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full= > > > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU > > > > 3. Run top -d1 and check system time > > > > NOTE: When there's only one task hogging a nohz_full CPU, top > > shows 100% user-time, as expected > > I just saw at most 12% system time instead of 30% or 100%. Could you > grep HZ /boot/config-`uname -r` and post here? I'm sending you the whole thing: # # Automatically generated file; DO NOT EDIT. # Linux/x86 4.11.0-rc3 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MIN=28 CONFIG_ARCH_MMAP_RND_BITS_MAX=32 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_HAVE_INTEL_TXT=y CONFIG_X86_64_SMP=y CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y CONFIG_THREAD_INFO_IN_TASK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_CROSS_MEMORY_ATTACH=y CONFIG_FHANDLE=y CONFIG_USELIB=y CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y CONFIG_AUDITSYSCALL=y CONFIG_AUDIT_WATCH=y CONFIG_AUDIT_TREE=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set # CONFIG_NO_HZ_IDLE is not set CONFIG_NO_HZ_FULL=y # CONFIG_NO_HZ_FULL_ALL is not set # CONFIG_NO_HZ_FULL_SYSIDLE is not set CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accounting # CONFIG_VIRT_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_GEN=y CONFIG_IRQ_TIME_ACCOUNTING=y CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y # # RCU Subsystem # CONFIG_PREEMPT_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y # CONFIG_TASKS_RCU is not set CONFIG_RCU_STALL_COMMON=y CONFIG_CONTEXT_TRACKING=y # CONFIG_CONTEXT_TRACKING_FORCE is not set # CONFIG_TREE_RCU_TRACE is not set CONFIG_RCU_NOCB_CPU=y # CONFIG_RCU_NOCB_CPU_NONE is not set # CONFIG_RCU_NOCB_CPU_ZERO is not set CONFIG_RCU_NOCB_CPU_ALL=y CONFIG_BUILD_BIN2C=y # CONFIG_IKCONFIG is not set CONFIG_LOG_BUF_SHIFT=20 CONFIG_
Re: [BUG nohz]: wrong user and system time accounting
2017-03-24 4:55 GMT+08:00 Luiz Capitulino : > > When there are two or more tasks executing in user-space and > taking 100% of a nohz_full CPU, top reports 70% system time > and 30% user time utilization. Sometimes I'm even able to get > 100% system time and 0% user time. > > This was reproduced with latest Linus tree (093b995), but I > don't believe it's a regression (at least not a recent one) > as I can reproduce it with older kernels. Also, I have > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce > without it yet. > > Below you'll find the steps to reproduce and some initial > analysis. > > Steps to reproduce > -- > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full= > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU > > 3. Run top -d1 and check system time > > NOTE: When there's only one task hogging a nohz_full CPU, top > shows 100% user-time, as expected I just saw at most 12% system time instead of 30% or 100%. Could you grep HZ /boot/config-`uname -r` and post here? Regards, Wanpeng Li > > Initial analysis > > > When tracing vtime accounting functions and the user-space/kernel > transitions when the issue is taking place, I see several of the > following: > > hog-10552 [015] 1132.711104: function: enter_from_user_mode <-- > apic_timer_interrupt > hog-10552 [015] 1132.711105: function: __context_tracking_exit > <-- enter_from_user_mode > hog-10552 [015] 1132.711105: bprint: > __context_tracking_exit.part.4: new state=1 cur state=1 active=1 > hog-10552 [015] 1132.711105: function: vtime_account_user <-- > __context_tracking_exit.part.4 > hog-10552 [015] 1132.711105: function: smp_apic_timer_interrupt > <-- apic_timer_interrupt > hog-10552 [015] 1132.711106: function: irq_enter <-- > smp_apic_timer_interrupt > hog-10552 [015] 1132.711106: function: tick_sched_timer <-- > __hrtimer_run_queues > hog-10552 [015] 1132.711108: function: irq_exit <-- > smp_apic_timer_interrupt > hog-10552 [015] 1132.711108: function: __context_tracking_enter > <-- prepare_exit_to_usermode > hog-10552 [015] 1132.711108: bprint: > __context_tracking_enter.part.2: new state=1 cur state=0 active=1 > hog-10552 [015] 1132.711109: function: vtime_user_enter <-- > __context_tracking_enter.part.2 > hog-10552 [015] 1132.711109: function: __vtime_account_system > <-- vtime_user_enter > hog-10552 [015] 1132.711109: function: account_system_time <-- > __vtime_account_system > > On entering the kernel due to a timer interrupt, vtime_account_user() > skips user-time accounting. Then later on when returning to user-space, > vtime_user_enter() is probably accounting the whole time (ie. user-space > plus kernel-space) to system time. > > Now, when does vtime_account_user() skips accounting? Well, when the > time delta is less then one jiffie. This would imply that vtime_account_user() > is being called less than one jiffie since the last accounting, but I haven't > confirmed any of this yet.
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 23 Mar 2017 21:08:38 -0400 Rik van Riel wrote: > On Thu, 2017-03-23 at 21:05 -0400, Luiz Capitulino wrote: > > On Thu, 23 Mar 2017 20:56:02 -0400 > > Rik van Riel wrote: > > > > > On Thu, 2017-03-23 at 16:55 -0400, Luiz Capitulino wrote: > > > > When there are two or more tasks executing in user-space and > > > > taking 100% of a nohz_full CPU, top reports 70% system time > > > > and 30% user time utilization. Sometimes I'm even able to get > > > > 100% system time and 0% user time. > > > > > > > > This was reproduced with latest Linus tree (093b995), but I > > > > don't believe it's a regression (at least not a recent one) > > > > as I can reproduce it with older kernels. Also, I have > > > > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce > > > > without it yet. > > > > > > > > Below you'll find the steps to reproduce and some initial > > > > analysis. > > > > > > > > Steps to reproduce > > > > -- > > > > > > > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full= > > > > > > > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU > > > > > > > > 3. Run top -d1 and check system time > > > > > > > > NOTE: When there's only one task hogging a nohz_full CPU, top > > > > shows 100% user-time, as expected > > > > > > > > Initial analysis > > > > > > > > > > > > When tracing vtime accounting functions and the user-space/kernel > > > > transitions when the issue is taking place, I see several of the > > > > following: > > > > > > > > hog-10552 [015] 1132.711104: > > > > function: enter_from_user_mode <-- > > > > apic_timer_interrupt > > > > hog-10552 [015] 1132.711105: > > > > function: __context_tracking_exit <-- > > > > enter_from_user_mode > > > > hog-10552 [015] 1132.711105: > > > > bprint: __context_tracking_exit.part.4: new state=1 > > > > cur > > > > state=1 active=1 > > > > hog-10552 [015] 1132.711105: > > > > function: vtime_account_user <-- > > > > __context_tracking_exit.part.4 > > > > hog-10552 [015] 1132.711105: > > > > function: smp_apic_timer_interrupt <-- > > > > apic_timer_interrupt > > > > hog-10552 [015] 1132.711106: function: irq_enter <-- > > > > smp_apic_timer_interrupt > > > > hog-10552 [015] 1132.711106: > > > > function: tick_sched_timer > > > > <-- __hrtimer_run_queues > > > > hog-10552 [015] 1132.711108: function: irq_exit <-- > > > > smp_apic_timer_interrupt > > > > hog-10552 [015] 1132.711108: > > > > function: __context_tracking_enter <-- > > > > prepare_exit_to_usermode > > > > hog-10552 [015] 1132.711108: > > > > bprint: __context_tracking_enter.part.2: new > > > > state=1 > > > > cur state=0 active=1 > > > > hog-10552 [015] 1132.711109: > > > > function: vtime_user_enter > > > > <-- __context_tracking_enter.part.2 > > > > hog-10552 [015] 1132.711109: > > > > function: __vtime_account_system <-- vtime_user_enter > > > > hog-10552 [015] 1132.711109: > > > > function: account_system_time <-- > > > > __vtime_account_system > > > > > > > > On entering the kernel due to a timer interrupt, > > > > vtime_account_user() > > > > skips user-time accounting. Then later on when returning to user- > > > > space, > > > > vtime_user_enter() is probably accounting the whole time (ie. > > > > user- > > > > space > > > > plus kernel-space) to system time. > > > > > > > > Now, when does vtime_account_user() skips accounting? Well, when > > > > the > > > > time delta is less then one jiffie. This would imply that > > > > vtime_account_user() > > > > is being called less than one jiffie since the last accounting, > > > > but I > > > > haven't > > > > confirmed any of this yet. > > > > > > Jiffies should be advanced by the timer interrupt, on the > > > housekeeping CPU, which is not doing context tracking. > > > > The hypothesis isn't that it wasn't advanced, but that we stayed in > > user-space less than 1ms. > > That is part of the hypothesis. The other part of the hypothesis > involves jiffies advancing on the nohz_full & isolated CPU while > that CPU is in kernel mode 30% of the time. OK. > I have no good explanation for the latter yet... > > > > Why is the isolated/nohz_full CPU receiving timer interrupts > > > at all? > > > > > > I thought it would not, but obviously I am wrong. What is > > > going on here? > > > > There are two runnable SCHED_OTHER tasks on the nohz_full CPU. When > > that happens, the tick is re-activated. We're not nohz_full anymore, > > but accounting should still work. > > Isn't the scheduler tick distinct from the timer interrupt, > or am I confused? If you consider the scheduler tick to be the code that's run by scheduler_tick(), yes they are distinct. But I was referring to tick_sched_timer() the "main" tick handler. This one runs as a hrtimer handler. In the case described in this email
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 2017-03-23 at 21:05 -0400, Luiz Capitulino wrote: > On Thu, 23 Mar 2017 20:56:02 -0400 > Rik van Riel wrote: > > > On Thu, 2017-03-23 at 16:55 -0400, Luiz Capitulino wrote: > > > When there are two or more tasks executing in user-space and > > > taking 100% of a nohz_full CPU, top reports 70% system time > > > and 30% user time utilization. Sometimes I'm even able to get > > > 100% system time and 0% user time. > > > > > > This was reproduced with latest Linus tree (093b995), but I > > > don't believe it's a regression (at least not a recent one) > > > as I can reproduce it with older kernels. Also, I have > > > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce > > > without it yet. > > > > > > Below you'll find the steps to reproduce and some initial > > > analysis. > > > > > > Steps to reproduce > > > -- > > > > > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full= > > > > > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU > > > > > > 3. Run top -d1 and check system time > > > > > > NOTE: When there's only one task hogging a nohz_full CPU, top > > > shows 100% user-time, as expected > > > > > > Initial analysis > > > > > > > > > When tracing vtime accounting functions and the user-space/kernel > > > transitions when the issue is taking place, I see several of the > > > following: > > > > > > hog-10552 [015] 1132.711104: > > > function: enter_from_user_mode <-- > > > apic_timer_interrupt > > > hog-10552 [015] 1132.711105: > > > function: __context_tracking_exit <-- > > > enter_from_user_mode > > > hog-10552 [015] 1132.711105: > > > bprint: __context_tracking_exit.part.4: new state=1 > > > cur > > > state=1 active=1 > > > hog-10552 [015] 1132.711105: > > > function: vtime_account_user <-- > > > __context_tracking_exit.part.4 > > > hog-10552 [015] 1132.711105: > > > function: smp_apic_timer_interrupt <-- > > > apic_timer_interrupt > > > hog-10552 [015] 1132.711106: function: irq_enter <-- > > > smp_apic_timer_interrupt > > > hog-10552 [015] 1132.711106: > > > function: tick_sched_timer > > > <-- __hrtimer_run_queues > > > hog-10552 [015] 1132.711108: function: irq_exit <-- > > > smp_apic_timer_interrupt > > > hog-10552 [015] 1132.711108: > > > function: __context_tracking_enter <-- > > > prepare_exit_to_usermode > > > hog-10552 [015] 1132.711108: > > > bprint: __context_tracking_enter.part.2: new > > > state=1 > > > cur state=0 active=1 > > > hog-10552 [015] 1132.711109: > > > function: vtime_user_enter > > > <-- __context_tracking_enter.part.2 > > > hog-10552 [015] 1132.711109: > > > function: __vtime_account_system <-- vtime_user_enter > > > hog-10552 [015] 1132.711109: > > > function: account_system_time <-- > > > __vtime_account_system > > > > > > On entering the kernel due to a timer interrupt, > > > vtime_account_user() > > > skips user-time accounting. Then later on when returning to user- > > > space, > > > vtime_user_enter() is probably accounting the whole time (ie. > > > user- > > > space > > > plus kernel-space) to system time. > > > > > > Now, when does vtime_account_user() skips accounting? Well, when > > > the > > > time delta is less then one jiffie. This would imply that > > > vtime_account_user() > > > is being called less than one jiffie since the last accounting, > > > but I > > > haven't > > > confirmed any of this yet. > > > > Jiffies should be advanced by the timer interrupt, on the > > housekeeping CPU, which is not doing context tracking. > > The hypothesis isn't that it wasn't advanced, but that we stayed in > user-space less than 1ms. That is part of the hypothesis. The other part of the hypothesis involves jiffies advancing on the nohz_full & isolated CPU while that CPU is in kernel mode 30% of the time. I have no good explanation for the latter yet... > > Why is the isolated/nohz_full CPU receiving timer interrupts > > at all? > > > > I thought it would not, but obviously I am wrong. What is > > going on here? > > There are two runnable SCHED_OTHER tasks on the nohz_full CPU. When > that happens, the tick is re-activated. We're not nohz_full anymore, > but accounting should still work. Isn't the scheduler tick distinct from the timer interrupt, or am I confused?
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 23 Mar 2017 20:56:02 -0400 Rik van Riel wrote: > On Thu, 2017-03-23 at 16:55 -0400, Luiz Capitulino wrote: > > When there are two or more tasks executing in user-space and > > taking 100% of a nohz_full CPU, top reports 70% system time > > and 30% user time utilization. Sometimes I'm even able to get > > 100% system time and 0% user time. > > > > This was reproduced with latest Linus tree (093b995), but I > > don't believe it's a regression (at least not a recent one) > > as I can reproduce it with older kernels. Also, I have > > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce > > without it yet. > > > > Below you'll find the steps to reproduce and some initial > > analysis. > > > > Steps to reproduce > > -- > > > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full= > > > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU > > > > 3. Run top -d1 and check system time > > > > NOTE: When there's only one task hogging a nohz_full CPU, top > > shows 100% user-time, as expected > > > > Initial analysis > > > > > > When tracing vtime accounting functions and the user-space/kernel > > transitions when the issue is taking place, I see several of the > > following: > > > > hog-10552 [015] 1132.711104: > > function: enter_from_user_mode <-- apic_timer_interrupt > > hog-10552 [015] 1132.711105: > > function: __context_tracking_exit <-- > > enter_from_user_mode > > hog-10552 [015] 1132.711105: > > bprint: __context_tracking_exit.part.4: new state=1 cur > > state=1 active=1 > > hog-10552 [015] 1132.711105: > > function: vtime_account_user <-- > > __context_tracking_exit.part.4 > > hog-10552 [015] 1132.711105: > > function: smp_apic_timer_interrupt <-- > > apic_timer_interrupt > > hog-10552 [015] 1132.711106: function: irq_enter <-- > > smp_apic_timer_interrupt > > hog-10552 [015] 1132.711106: function: tick_sched_timer > > <-- __hrtimer_run_queues > > hog-10552 [015] 1132.711108: function: irq_exit <-- > > smp_apic_timer_interrupt > > hog-10552 [015] 1132.711108: > > function: __context_tracking_enter <-- > > prepare_exit_to_usermode > > hog-10552 [015] 1132.711108: > > bprint: __context_tracking_enter.part.2: new state=1 > > cur state=0 active=1 > > hog-10552 [015] 1132.711109: function: vtime_user_enter > > <-- __context_tracking_enter.part.2 > > hog-10552 [015] 1132.711109: > > function: __vtime_account_system <-- vtime_user_enter > > hog-10552 [015] 1132.711109: > > function: account_system_time <-- __vtime_account_system > > > > On entering the kernel due to a timer interrupt, vtime_account_user() > > skips user-time accounting. Then later on when returning to user- > > space, > > vtime_user_enter() is probably accounting the whole time (ie. user- > > space > > plus kernel-space) to system time. > > > > Now, when does vtime_account_user() skips accounting? Well, when the > > time delta is less then one jiffie. This would imply that > > vtime_account_user() > > is being called less than one jiffie since the last accounting, but I > > haven't > > confirmed any of this yet. > > Jiffies should be advanced by the timer interrupt, on the > housekeeping CPU, which is not doing context tracking. The hypothesis isn't that it wasn't advanced, but that we stayed in user-space less than 1ms. > Why is the isolated/nohz_full CPU receiving timer interrupts > at all? > > I thought it would not, but obviously I am wrong. What is > going on here? There are two runnable SCHED_OTHER tasks on the nohz_full CPU. When that happens, the tick is re-activated. We're not nohz_full anymore, but accounting should still work.
Re: [BUG nohz]: wrong user and system time accounting
On Thu, 2017-03-23 at 16:55 -0400, Luiz Capitulino wrote: > When there are two or more tasks executing in user-space and > taking 100% of a nohz_full CPU, top reports 70% system time > and 30% user time utilization. Sometimes I'm even able to get > 100% system time and 0% user time. > > This was reproduced with latest Linus tree (093b995), but I > don't believe it's a regression (at least not a recent one) > as I can reproduce it with older kernels. Also, I have > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce > without it yet. > > Below you'll find the steps to reproduce and some initial > analysis. > > Steps to reproduce > -- > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full= > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU > > 3. Run top -d1 and check system time > > NOTE: When there's only one task hogging a nohz_full CPU, top > shows 100% user-time, as expected > > Initial analysis > > > When tracing vtime accounting functions and the user-space/kernel > transitions when the issue is taking place, I see several of the > following: > > hog-10552 [015] 1132.711104: > function: enter_from_user_mode <-- apic_timer_interrupt > hog-10552 [015] 1132.711105: > function: __context_tracking_exit <-- > enter_from_user_mode > hog-10552 [015] 1132.711105: > bprint: __context_tracking_exit.part.4: new state=1 cur > state=1 active=1 > hog-10552 [015] 1132.711105: > function: vtime_account_user <-- > __context_tracking_exit.part.4 > hog-10552 [015] 1132.711105: > function: smp_apic_timer_interrupt <-- > apic_timer_interrupt > hog-10552 [015] 1132.711106: function: irq_enter <-- > smp_apic_timer_interrupt > hog-10552 [015] 1132.711106: function: tick_sched_timer > <-- __hrtimer_run_queues > hog-10552 [015] 1132.711108: function: irq_exit <-- > smp_apic_timer_interrupt > hog-10552 [015] 1132.711108: > function: __context_tracking_enter <-- > prepare_exit_to_usermode > hog-10552 [015] 1132.711108: > bprint: __context_tracking_enter.part.2: new state=1 > cur state=0 active=1 > hog-10552 [015] 1132.711109: function: vtime_user_enter > <-- __context_tracking_enter.part.2 > hog-10552 [015] 1132.711109: > function: __vtime_account_system <-- vtime_user_enter > hog-10552 [015] 1132.711109: > function: account_system_time <-- __vtime_account_system > > On entering the kernel due to a timer interrupt, vtime_account_user() > skips user-time accounting. Then later on when returning to user- > space, > vtime_user_enter() is probably accounting the whole time (ie. user- > space > plus kernel-space) to system time. > > Now, when does vtime_account_user() skips accounting? Well, when the > time delta is less then one jiffie. This would imply that > vtime_account_user() > is being called less than one jiffie since the last accounting, but I > haven't > confirmed any of this yet. Jiffies should be advanced by the timer interrupt, on the housekeeping CPU, which is not doing context tracking. Why is the isolated/nohz_full CPU receiving timer interrupts at all? I thought it would not, but obviously I am wrong. What is going on here?
[BUG nohz]: wrong user and system time accounting
When there are two or more tasks executing in user-space and taking 100% of a nohz_full CPU, top reports 70% system time and 30% user time utilization. Sometimes I'm even able to get 100% system time and 0% user time. This was reproduced with latest Linus tree (093b995), but I don't believe it's a regression (at least not a recent one) as I can reproduce it with older kernels. Also, I have CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce without it yet. Below you'll find the steps to reproduce and some initial analysis. Steps to reproduce -- 1. Set up a CPU for nohz_full with isolcpus= nohz_full= 2. Pin two tasks that hog the CPU 100% of the time to that CPU 3. Run top -d1 and check system time NOTE: When there's only one task hogging a nohz_full CPU, top shows 100% user-time, as expected Initial analysis When tracing vtime accounting functions and the user-space/kernel transitions when the issue is taking place, I see several of the following: hog-10552 [015] 1132.711104: function: enter_from_user_mode <-- apic_timer_interrupt hog-10552 [015] 1132.711105: function: __context_tracking_exit <-- enter_from_user_mode hog-10552 [015] 1132.711105: bprint: __context_tracking_exit.part.4: new state=1 cur state=1 active=1 hog-10552 [015] 1132.711105: function: vtime_account_user <-- __context_tracking_exit.part.4 hog-10552 [015] 1132.711105: function: smp_apic_timer_interrupt <-- apic_timer_interrupt hog-10552 [015] 1132.711106: function: irq_enter <-- smp_apic_timer_interrupt hog-10552 [015] 1132.711106: function: tick_sched_timer <-- __hrtimer_run_queues hog-10552 [015] 1132.711108: function: irq_exit <-- smp_apic_timer_interrupt hog-10552 [015] 1132.711108: function: __context_tracking_enter <-- prepare_exit_to_usermode hog-10552 [015] 1132.711108: bprint: __context_tracking_enter.part.2: new state=1 cur state=0 active=1 hog-10552 [015] 1132.711109: function: vtime_user_enter <-- __context_tracking_enter.part.2 hog-10552 [015] 1132.711109: function: __vtime_account_system <-- vtime_user_enter hog-10552 [015] 1132.711109: function: account_system_time <-- __vtime_account_system On entering the kernel due to a timer interrupt, vtime_account_user() skips user-time accounting. Then later on when returning to user-space, vtime_user_enter() is probably accounting the whole time (ie. user-space plus kernel-space) to system time. Now, when does vtime_account_user() skips accounting? Well, when the time delta is less then one jiffie. This would imply that vtime_account_user() is being called less than one jiffie since the last accounting, but I haven't confirmed any of this yet.