Re: [BUG nohz]: wrong user and system time accounting

2017-06-29 Thread Frederic Weisbecker
On Mon, May 15, 2017 at 04:17:10PM +0800, Wanpeng Li wrote:
> Ping,

Sorry for the late answer, I was focused on some other bugs.

So since my ideas weren't even clear on that issue yet, I took your
patch and enhanced the code around. I just posted a new series
with it, please have a look.

Thanks.


Re: [BUG nohz]: wrong user and system time accounting

2017-05-15 Thread Wanpeng Li
Ping,
2017-05-02 18:01 GMT+08:00 Wanpeng Li :
> Cc Paolo,
> 2017-04-13 21:32 GMT+08:00 Frederic Weisbecker :
>> On Thu, Apr 13, 2017 at 12:31:12PM +0800, Wanpeng Li wrote:
>>> 2017-04-12 22:57 GMT+08:00 Thomas Gleixner :
>>> > On Wed, 12 Apr 2017, Frederic Weisbecker wrote:
>>> >> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote:
>>> >> > It's not different from the current jiffies based stuff at all. Same
>>> >> > failure mode.
>>> >>
>>> >> Yes you're right, I got confused again. So to fix this we could do our 
>>> >> snapshots
>>> >> at a frequency lower than HZ but still high enough to avoid overhead.
>>> >>
>>> >> Something like TICK_NSEC / 2 ?
>>> >
>>> > If you are using TSC anyway then you can do proper accumulation for both
>>> > system and user and only account the data when the accumulation is more
>>> > than a jiffie.
>>>
>>> So I implement it as below:
>>>
>>> - HZ=1000.
>>>   1) two cpu hogs on cpu in nohz_full mode, 100% user time
>>>   2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected)
>>> - HZ=250
>>>1) two cpu hogs on cpu in nohz_full mode, 100% user time
>>>2) Luzi's testcase, 100% idle
>>>
>>> So the codes below still not work correctly for HZ=250, any suggestions?
>>
>> Right, so first lets reorder that code a bit so we can see clear inside :-)
>>
>>>
>>> -->8-
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index d67eee8..6a11771 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -668,6 +668,8 @@ struct task_struct {
>>>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>>>  seqcount_tvtime_seqcount;
>>>  unsigned long longvtime_snap;
>>> +u64vtime_acct_utime;
>>> +u64vtime_acct_stime;
>>
>> You need to accumulate guest and steal time as well.
>>
>
> Hi Frederic,
>
> Sorry for the delay since I'm too busy recently, I just add guest time
> and idle time accumulations as below, the code work as we expected for
> native kernel, however, the testcase fails when it runs in kvm guest.
> Top shows ~99% sys for Luzi's testcase "./acct-bug 1 995" which we
> expect 95% user  and %5 idle. In addition, what's the design idea of
> steal time accumluation in your mind? Pass the tsk parameter in the
> function get_vtime_delta() down to the function
> steal_account_process_time()?
>
> -->8-
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4cf9a59..56815cd 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -672,6 +672,10 @@ struct task_struct {
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>  seqcount_tvtime_seqcount;
>  unsigned long longvtime_snap;
> +u64vtime_acct_utime;
> +u64vtime_acct_stime;
> +u64vtime_acct_idle_time;
> +u64vtime_acct_guest_time;
>  enum {
>  /* Task is sleeping or running in a CPU with VTIME inactive: */
>  VTIME_INACTIVE = 0,
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..2d950c6 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -676,18 +676,19 @@ void thread_group_cputime_adjusted(struct
> task_struct *p, u64 *ut, u64 *st)
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>  static u64 vtime_delta(struct task_struct *tsk)
>  {
> -unsigned long now = READ_ONCE(jiffies);
> +unsigned long long clock;
>
> -if (time_before(now, (unsigned long)tsk->vtime_snap))
> +clock = sched_clock();
> +if (clock < tsk->vtime_snap)
>  return 0;
>
> -return jiffies_to_nsecs(now - tsk->vtime_snap);
> +return clock - tsk->vtime_snap;
>  }
>
>  static u64 get_vtime_delta(struct task_struct *tsk)
>  {
> -unsigned long now = READ_ONCE(jiffies);
> -u64 delta, other;
> +u64 delta = vtime_delta(tsk);
> +u64 other;
>
>  /*
>   * Unlike tick based timing, vtime based timing never has lost
> @@ -696,17 +697,16 @@ static u64 get_vtime_delta(struct task_struct *tsk)
>   * elapsed time. Limit account_other_time to prevent rounding
>   * errors from causing elapsed vtime to go negative.
>   */
> -delta = jiffies_to_nsecs(now - tsk->vtime_snap);
>  other = account_other_time(delta);
>  WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
> -tsk->vtime_snap = now;
> +tsk->vtime_snap += delta;
>
>  return delta - other;
>  }
>
>  static void __vtime_account_system(struct task_struct *tsk)
>  {
> -account_system_time(tsk, irq_count(), get_vtime_delta(tsk));
> +account_system_time(tsk, irq_count(), tsk->vtime_acct_stime);
>  }
>
>  void vtime_account_system(struct task_struct *tsk)
> @@ -715,7 +715,11 @@ void vtime_account_system(struct task_struct *tsk)
>  return;
>
>  w

Re: [BUG nohz]: wrong user and system time accounting

2017-05-02 Thread Wanpeng Li
Cc Paolo,
2017-04-13 21:32 GMT+08:00 Frederic Weisbecker :
> On Thu, Apr 13, 2017 at 12:31:12PM +0800, Wanpeng Li wrote:
>> 2017-04-12 22:57 GMT+08:00 Thomas Gleixner :
>> > On Wed, 12 Apr 2017, Frederic Weisbecker wrote:
>> >> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote:
>> >> > It's not different from the current jiffies based stuff at all. Same
>> >> > failure mode.
>> >>
>> >> Yes you're right, I got confused again. So to fix this we could do our 
>> >> snapshots
>> >> at a frequency lower than HZ but still high enough to avoid overhead.
>> >>
>> >> Something like TICK_NSEC / 2 ?
>> >
>> > If you are using TSC anyway then you can do proper accumulation for both
>> > system and user and only account the data when the accumulation is more
>> > than a jiffie.
>>
>> So I implement it as below:
>>
>> - HZ=1000.
>>   1) two cpu hogs on cpu in nohz_full mode, 100% user time
>>   2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected)
>> - HZ=250
>>1) two cpu hogs on cpu in nohz_full mode, 100% user time
>>2) Luzi's testcase, 100% idle
>>
>> So the codes below still not work correctly for HZ=250, any suggestions?
>
> Right, so first lets reorder that code a bit so we can see clear inside :-)
>
>>
>> -->8-
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index d67eee8..6a11771 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -668,6 +668,8 @@ struct task_struct {
>>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>>  seqcount_tvtime_seqcount;
>>  unsigned long longvtime_snap;
>> +u64vtime_acct_utime;
>> +u64vtime_acct_stime;
>
> You need to accumulate guest and steal time as well.
>

Hi Frederic,

Sorry for the delay since I'm too busy recently, I just add guest time
and idle time accumulations as below, the code work as we expected for
native kernel, however, the testcase fails when it runs in kvm guest.
Top shows ~99% sys for Luzi's testcase "./acct-bug 1 995" which we
expect 95% user  and %5 idle. In addition, what's the design idea of
steal time accumluation in your mind? Pass the tsk parameter in the
function get_vtime_delta() down to the function
steal_account_process_time()?

-->8-

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4cf9a59..56815cd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -672,6 +672,10 @@ struct task_struct {
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 seqcount_tvtime_seqcount;
 unsigned long longvtime_snap;
+u64vtime_acct_utime;
+u64vtime_acct_stime;
+u64vtime_acct_idle_time;
+u64vtime_acct_guest_time;
 enum {
 /* Task is sleeping or running in a CPU with VTIME inactive: */
 VTIME_INACTIVE = 0,
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f3778e2b..2d950c6 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -676,18 +676,19 @@ void thread_group_cputime_adjusted(struct
task_struct *p, u64 *ut, u64 *st)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static u64 vtime_delta(struct task_struct *tsk)
 {
-unsigned long now = READ_ONCE(jiffies);
+unsigned long long clock;

-if (time_before(now, (unsigned long)tsk->vtime_snap))
+clock = sched_clock();
+if (clock < tsk->vtime_snap)
 return 0;

-return jiffies_to_nsecs(now - tsk->vtime_snap);
+return clock - tsk->vtime_snap;
 }

 static u64 get_vtime_delta(struct task_struct *tsk)
 {
-unsigned long now = READ_ONCE(jiffies);
-u64 delta, other;
+u64 delta = vtime_delta(tsk);
+u64 other;

 /*
  * Unlike tick based timing, vtime based timing never has lost
@@ -696,17 +697,16 @@ static u64 get_vtime_delta(struct task_struct *tsk)
  * elapsed time. Limit account_other_time to prevent rounding
  * errors from causing elapsed vtime to go negative.
  */
-delta = jiffies_to_nsecs(now - tsk->vtime_snap);
 other = account_other_time(delta);
 WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
-tsk->vtime_snap = now;
+tsk->vtime_snap += delta;

 return delta - other;
 }

 static void __vtime_account_system(struct task_struct *tsk)
 {
-account_system_time(tsk, irq_count(), get_vtime_delta(tsk));
+account_system_time(tsk, irq_count(), tsk->vtime_acct_stime);
 }

 void vtime_account_system(struct task_struct *tsk)
@@ -715,7 +715,11 @@ void vtime_account_system(struct task_struct *tsk)
 return;

 write_seqcount_begin(&tsk->vtime_seqcount);
-__vtime_account_system(tsk);
+tsk->vtime_acct_stime += get_vtime_delta(tsk);
+if (tsk->vtime_acct_stime >= TICK_NSEC) {
+__vtime_account_system(tsk);
+tsk->vtime_acct_stime = 0;

Re: [BUG nohz]: wrong user and system time accounting

2017-04-13 Thread Frederic Weisbecker
On Thu, Apr 13, 2017 at 12:31:12PM +0800, Wanpeng Li wrote:
> 2017-04-12 22:57 GMT+08:00 Thomas Gleixner :
> > On Wed, 12 Apr 2017, Frederic Weisbecker wrote:
> >> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote:
> >> > It's not different from the current jiffies based stuff at all. Same
> >> > failure mode.
> >>
> >> Yes you're right, I got confused again. So to fix this we could do our 
> >> snapshots
> >> at a frequency lower than HZ but still high enough to avoid overhead.
> >>
> >> Something like TICK_NSEC / 2 ?
> >
> > If you are using TSC anyway then you can do proper accumulation for both
> > system and user and only account the data when the accumulation is more
> > than a jiffie.
> 
> So I implement it as below:
> 
> - HZ=1000.
>   1) two cpu hogs on cpu in nohz_full mode, 100% user time
>   2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected)
> - HZ=250
>1) two cpu hogs on cpu in nohz_full mode, 100% user time
>2) Luzi's testcase, 100% idle
> 
> So the codes below still not work correctly for HZ=250, any suggestions?

Right, so first lets reorder that code a bit so we can see clear inside :-)

> 
> -->8-
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d67eee8..6a11771 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -668,6 +668,8 @@ struct task_struct {
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>  seqcount_tvtime_seqcount;
>  unsigned long longvtime_snap;
> +u64vtime_acct_utime;
> +u64vtime_acct_stime;

You need to accumulate guest and steal time as well.

>  enum {
>  /* Task is sleeping or running in a CPU with VTIME inactive: */
>  VTIME_INACTIVE = 0,
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..f8e54ba 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -674,20 +674,41 @@ void thread_group_cputime_adjusted(struct
> task_struct *p, u64 *ut, u64 *st)
>  #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
> 
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> -static u64 vtime_delta(struct task_struct *tsk)
> +static u64 vtime_delta(struct task_struct *tsk, bool user)
>  {
> -unsigned long now = READ_ONCE(jiffies);
> +u64 delta, ret = 0;
> 
> -if (time_before(now, (unsigned long)tsk->vtime_snap))
> -return 0;
> +delta = sched_clock() - tsk->vtime_snap;
> 
> -return jiffies_to_nsecs(now - tsk->vtime_snap);
> +if (is_idle_task(tsk)) {
> +if (delta >= TICK_NSEC)
> +ret = delta;
> +} else {
> +if (user) {
> +tsk->vtime_acct_utime += delta;
> +if (tsk->vtime_acct_utime >= TICK_NSEC)
> +ret = tsk->vtime_acct_utime;
> +} else {
> +tsk->vtime_acct_stime += delta;
> +if (tsk->vtime_acct_utime >= TICK_NSEC)
> +ret = tsk->vtime_acct_stime;
> +}

We already have vtime_account_idle, vtime_account_user, etc...
The accumulation should be done by these functions that know what and where
to account. vtime_delta() should really just return the difference against 
vtime_snap,
it's too low level to care about these details.

> +}
> +
> +return ret;
>  }
> 
> -static u64 get_vtime_delta(struct task_struct *tsk)
> +static u64 get_vtime_delta(struct task_struct *tsk, bool user)
>  {
> -unsigned long now = READ_ONCE(jiffies);
> -u64 delta, other;
> +u64 delta = vtime_delta(tsk, user);
> +u64 other;
> +
> +if (!is_idle_task(tsk)) {
> +if (user)
> +tsk->vtime_acct_utime = 0;
> +else
> +tsk->vtime_acct_stime = 0;
> +}

Like vtime_delta(), get_vtime_delta() shouldn't touch these accumulators.
Reset and accounting really should be done by the upper level functions
vtime_account_*()

Thanks.


Re: [BUG nohz]: wrong user and system time accounting

2017-04-12 Thread Wanpeng Li
2017-04-12 22:57 GMT+08:00 Thomas Gleixner :
> On Wed, 12 Apr 2017, Frederic Weisbecker wrote:
>> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote:
>> > It's not different from the current jiffies based stuff at all. Same
>> > failure mode.
>>
>> Yes you're right, I got confused again. So to fix this we could do our 
>> snapshots
>> at a frequency lower than HZ but still high enough to avoid overhead.
>>
>> Something like TICK_NSEC / 2 ?
>
> If you are using TSC anyway then you can do proper accumulation for both
> system and user and only account the data when the accumulation is more
> than a jiffie.

So I implement it as below:

- HZ=1000.
  1) two cpu hogs on cpu in nohz_full mode, 100% user time
  2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected)
- HZ=250
   1) two cpu hogs on cpu in nohz_full mode, 100% user time
   2) Luzi's testcase, 100% idle

So the codes below still not work correctly for HZ=250, any suggestions?

-->8-

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d67eee8..6a11771 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -668,6 +668,8 @@ struct task_struct {
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 seqcount_tvtime_seqcount;
 unsigned long longvtime_snap;
+u64vtime_acct_utime;
+u64vtime_acct_stime;
 enum {
 /* Task is sleeping or running in a CPU with VTIME inactive: */
 VTIME_INACTIVE = 0,
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f3778e2b..f8e54ba 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -674,20 +674,41 @@ void thread_group_cputime_adjusted(struct
task_struct *p, u64 *ut, u64 *st)
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */

 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-static u64 vtime_delta(struct task_struct *tsk)
+static u64 vtime_delta(struct task_struct *tsk, bool user)
 {
-unsigned long now = READ_ONCE(jiffies);
+u64 delta, ret = 0;

-if (time_before(now, (unsigned long)tsk->vtime_snap))
-return 0;
+delta = sched_clock() - tsk->vtime_snap;

-return jiffies_to_nsecs(now - tsk->vtime_snap);
+if (is_idle_task(tsk)) {
+if (delta >= TICK_NSEC)
+ret = delta;
+} else {
+if (user) {
+tsk->vtime_acct_utime += delta;
+if (tsk->vtime_acct_utime >= TICK_NSEC)
+ret = tsk->vtime_acct_utime;
+} else {
+tsk->vtime_acct_stime += delta;
+if (tsk->vtime_acct_utime >= TICK_NSEC)
+ret = tsk->vtime_acct_stime;
+}
+}
+
+return ret;
 }

-static u64 get_vtime_delta(struct task_struct *tsk)
+static u64 get_vtime_delta(struct task_struct *tsk, bool user)
 {
-unsigned long now = READ_ONCE(jiffies);
-u64 delta, other;
+u64 delta = vtime_delta(tsk, user);
+u64 other;
+
+if (!is_idle_task(tsk)) {
+if (user)
+tsk->vtime_acct_utime = 0;
+else
+tsk->vtime_acct_stime = 0;
+}

 /*
  * Unlike tick based timing, vtime based timing never has lost
@@ -696,22 +717,21 @@ static u64 get_vtime_delta(struct task_struct *tsk)
  * elapsed time. Limit account_other_time to prevent rounding
  * errors from causing elapsed vtime to go negative.
  */
-delta = jiffies_to_nsecs(now - tsk->vtime_snap);
 other = account_other_time(delta);
 WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
-tsk->vtime_snap = now;
+tsk->vtime_snap += delta;

 return delta - other;
 }

 static void __vtime_account_system(struct task_struct *tsk)
 {
-account_system_time(tsk, irq_count(), get_vtime_delta(tsk));
+account_system_time(tsk, irq_count(), get_vtime_delta(tsk, false));
 }

 void vtime_account_system(struct task_struct *tsk)
 {
-if (!vtime_delta(tsk))
+if (!vtime_delta(tsk, false))
 return;

 write_seqcount_begin(&tsk->vtime_seqcount);
@@ -723,15 +743,15 @@ void vtime_account_user(struct task_struct *tsk)
 {
 write_seqcount_begin(&tsk->vtime_seqcount);
 tsk->vtime_snap_whence = VTIME_SYS;
-if (vtime_delta(tsk))
-account_user_time(tsk, get_vtime_delta(tsk));
+if (vtime_delta(tsk, true))
+account_user_time(tsk, get_vtime_delta(tsk, true));
 write_seqcount_end(&tsk->vtime_seqcount);
 }

 void vtime_user_enter(struct task_struct *tsk)
 {
 write_seqcount_begin(&tsk->vtime_seqcount);
-if (vtime_delta(tsk))
+if (vtime_delta(tsk, false))
 __vtime_account_system(tsk);
 tsk->vtime_snap_whence = VTIME_USER;
 write_seqcount_end(&tsk->vtime_seqcount);
@@ -747,7 +767,7 @@ void vtime_guest_enter(struct task_struct *tsk)
  * that can thus safely catch up with a tickless delta.
  */
 write_seqcount_begin(&tsk->vtime_seqcount);
-if (vtime_delta(tsk))
+if (vtime_delta(tsk, false))
   

Re: [BUG nohz]: wrong user and system time accounting

2017-04-12 Thread Frederic Weisbecker
On Wed, Apr 12, 2017 at 04:57:58PM +0200, Thomas Gleixner wrote:
> On Wed, 12 Apr 2017, Frederic Weisbecker wrote:
> > On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote:
> > > It's not different from the current jiffies based stuff at all. Same
> > > failure mode.
> > 
> > Yes you're right, I got confused again. So to fix this we could do our 
> > snapshots
> > at a frequency lower than HZ but still high enough to avoid overhead.
> > 
> > Something like TICK_NSEC / 2 ?
> 
> If you are using TSC anyway then you can do proper accumulation for both
> system and user and only account the data when the accumulation is more
> than a jiffie.

Sounds nice, and accumulation shouldn't introduce too much overhead. Let's try 
that.

Thanks.

> 
> Thanks,
> 
>   tglx
> 


Re: [BUG nohz]: wrong user and system time accounting

2017-04-12 Thread Thomas Gleixner
On Wed, 12 Apr 2017, Frederic Weisbecker wrote:
> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote:
> > It's not different from the current jiffies based stuff at all. Same
> > failure mode.
> 
> Yes you're right, I got confused again. So to fix this we could do our 
> snapshots
> at a frequency lower than HZ but still high enough to avoid overhead.
> 
> Something like TICK_NSEC / 2 ?

If you are using TSC anyway then you can do proper accumulation for both
system and user and only account the data when the accumulation is more
than a jiffie.

Thanks,

tglx



Re: [BUG nohz]: wrong user and system time accounting

2017-04-12 Thread Frederic Weisbecker
On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote:
> On Thu, 30 Mar 2017, Wanpeng Li wrote:
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index f3778e2b..f1ee393 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct
> > task_struct *p, u64 *ut, u64 *st)
> >  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> >  static u64 vtime_delta(struct task_struct *tsk)
> >  {
> > -unsigned long now = READ_ONCE(jiffies);
> > +u64 now = local_clock();
> > +u64 delta;
> > +
> > +delta = now - tsk->vtime_snap;
> > 
> > -if (time_before(now, (unsigned long)tsk->vtime_snap))
> > +if (delta < TICK_NSEC)
> >  return 0;
> > 
> > -return jiffies_to_nsecs(now - tsk->vtime_snap);
> > +return jiffies_to_nsecs(delta / TICK_NSEC);
> 
> So you replaced a jiffies based approach with a jiffies based approach.
> 
> >  }
> > 
> >  static u64 get_vtime_delta(struct task_struct *tsk)
> >  {
> > -unsigned long now = READ_ONCE(jiffies);
> > -u64 delta, other;
> > +u64 delta = vtime_delta(tsk);
> > +u64 other;
> > 
> >  /*
> >   * Unlike tick based timing, vtime based timing never has lost
> > @@ -696,10 +699,9 @@ static u64 get_vtime_delta(struct task_struct *tsk)
> >   * elapsed time. Limit account_other_time to prevent rounding
> >   * errors from causing elapsed vtime to go negative.
> >   */
> > -delta = jiffies_to_nsecs(now - tsk->vtime_snap);
> >  other = account_other_time(delta);
> >  WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
> > -tsk->vtime_snap = now;
> > +tsk->vtime_snap += delta;
> 
> Here is how it works^Wfails
> 
>  For simplicity tsk->vtime_snap starts at 0
>  HZ = 1000
> 
> CPU0 CPU1
>  sysexit()
>account_system()
>  now == 0
>  delta = vtime_delta()<- 0ns
>  tsk->vtime_snap += delta;== 0ns
> 
>  busy_loop(995us)
> 
>  sysenter()
>now == 996us
>account_user()
>  delta = vtime_delta()<- 0ns
>  tsk->vtime_snap += delta == 0ns
>  
>  sysexit()
>account_system()
>  now == 1001us
>  delta = vtime_delta()<- 1000ns
> 
>   Gets accounted to system
> 
>  tsk->vtime_snap += delta;== 1000ns
> 
> It's not different from the current jiffies based stuff at all. Same
> failure mode.

Yes you're right, I got confused again. So to fix this we could do our snapshots
at a frequency lower than HZ but still high enough to avoid overhead.

Something like TICK_NSEC / 2 ?


Re: [BUG nohz]: wrong user and system time accounting

2017-04-11 Thread Thomas Gleixner
On Thu, 30 Mar 2017, Wanpeng Li wrote:
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..f1ee393 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct
> task_struct *p, u64 *ut, u64 *st)
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>  static u64 vtime_delta(struct task_struct *tsk)
>  {
> -unsigned long now = READ_ONCE(jiffies);
> +u64 now = local_clock();
> +u64 delta;
> +
> +delta = now - tsk->vtime_snap;
> 
> -if (time_before(now, (unsigned long)tsk->vtime_snap))
> +if (delta < TICK_NSEC)
>  return 0;
> 
> -return jiffies_to_nsecs(now - tsk->vtime_snap);
> +return jiffies_to_nsecs(delta / TICK_NSEC);

So you replaced a jiffies based approach with a jiffies based approach.

>  }
> 
>  static u64 get_vtime_delta(struct task_struct *tsk)
>  {
> -unsigned long now = READ_ONCE(jiffies);
> -u64 delta, other;
> +u64 delta = vtime_delta(tsk);
> +u64 other;
> 
>  /*
>   * Unlike tick based timing, vtime based timing never has lost
> @@ -696,10 +699,9 @@ static u64 get_vtime_delta(struct task_struct *tsk)
>   * elapsed time. Limit account_other_time to prevent rounding
>   * errors from causing elapsed vtime to go negative.
>   */
> -delta = jiffies_to_nsecs(now - tsk->vtime_snap);
>  other = account_other_time(delta);
>  WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
> -tsk->vtime_snap = now;
> +tsk->vtime_snap += delta;

Here is how it works^Wfails

   For simplicity tsk->vtime_snap starts at 0
   HZ = 1000

CPU0   CPU1
   sysexit()
 account_system()
   now == 0
   delta = vtime_delta()<- 0ns
   tsk->vtime_snap += delta;== 0ns

   busy_loop(995us)

   sysenter()
 now == 996us
 account_user()
   delta = vtime_delta()<- 0ns
   tsk->vtime_snap += delta == 0ns
   
   sysexit()
 account_system()
   now == 1001us
   delta = vtime_delta()<- 1000ns

    Gets accounted to system

   tsk->vtime_snap += delta;== 1000ns

It's not different from the current jiffies based stuff at all. Same
failure mode.

Thanks,

tglx



Re: [BUG nohz]: wrong user and system time accounting

2017-04-11 Thread Wanpeng Li
2017-04-11 19:36 GMT+08:00 Peter Zijlstra :
> On Tue, Apr 11, 2017 at 07:03:17PM +0800, Wanpeng Li wrote:
>> 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker :
>> > On Thu, Mar 30, 2017 at 02:47:11PM +0800, Wanpeng Li wrote:
>>
>> [...]
>>
>> >
>> >>
>> >> -->8-
>> >>
>> >> use nanosecond granularity to check deltas but only perform an actual
>> >> cputime update when that delta >= TICK_NSEC.
>> >>
>> >> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>> >> index f3778e2b..f1ee393 100644
>> >> --- a/kernel/sched/cputime.c
>> >> +++ b/kernel/sched/cputime.c
>> >> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct
>> >> task_struct *p, u64 *ut, u64 *st)
>> >>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>> >>  static u64 vtime_delta(struct task_struct *tsk)
>> >>  {
>> >> -unsigned long now = READ_ONCE(jiffies);
>> >> +u64 now = local_clock();
>> >
>> > I fear we need a global clock, because the reader (task_cputime()) needs
>> > to compute the delta and therefore use the same clock from any CPU.
>> >
>> > Or we can use the local_clock() but the reader must access the same.
>> >
>> > So there would be vtime_delta_writer() which uses local_clock and stores
>> > the current CPU to tsk->vtime_cpu (under the vtime_seqcount). And then
>> > vtime_delta_reader() which calls sched_clock_cpu(tsk->vtime_cpu) which
>> > is protected by vtime_seqcount as well.
>> >
>> > Although those sched_clock_cpu() things seem to only matter when the
>> > sched_clock() is unstable. And that stability is a condition for nohz_full
>> > to work anyway. So probably sched_clock() alone would be enough.
>>
>> I observed ~60% user time and ~40% sys time when replace local_clock()
>> above by sched_clock()(two cpu hogs on the cpu in nohz_full mode). In
>> addition, Luiz's testcast ./acct-bug 1 995 will show 100% idle time.
>> If keep local_clock() in vtime_delta(), cpu hogs testcase will
>> success. However, Luiz's testcase still show 100% idle time.
>
> Assuming a stable TSC, there should be no difference between
> local_clock() and sched_clock().

So it is weird. I did't see any unstable tsc dump in dmesg.

Regards,
Wanpeng Li


Re: [BUG nohz]: wrong user and system time accounting

2017-04-11 Thread Peter Zijlstra
On Tue, Apr 11, 2017 at 07:03:17PM +0800, Wanpeng Li wrote:
> 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker :
> > On Thu, Mar 30, 2017 at 02:47:11PM +0800, Wanpeng Li wrote:
> 
> [...]
> 
> >
> >>
> >> -->8-
> >>
> >> use nanosecond granularity to check deltas but only perform an actual
> >> cputime update when that delta >= TICK_NSEC.
> >>
> >> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> >> index f3778e2b..f1ee393 100644
> >> --- a/kernel/sched/cputime.c
> >> +++ b/kernel/sched/cputime.c
> >> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct
> >> task_struct *p, u64 *ut, u64 *st)
> >>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> >>  static u64 vtime_delta(struct task_struct *tsk)
> >>  {
> >> -unsigned long now = READ_ONCE(jiffies);
> >> +u64 now = local_clock();
> >
> > I fear we need a global clock, because the reader (task_cputime()) needs
> > to compute the delta and therefore use the same clock from any CPU.
> >
> > Or we can use the local_clock() but the reader must access the same.
> >
> > So there would be vtime_delta_writer() which uses local_clock and stores
> > the current CPU to tsk->vtime_cpu (under the vtime_seqcount). And then
> > vtime_delta_reader() which calls sched_clock_cpu(tsk->vtime_cpu) which
> > is protected by vtime_seqcount as well.
> >
> > Although those sched_clock_cpu() things seem to only matter when the
> > sched_clock() is unstable. And that stability is a condition for nohz_full
> > to work anyway. So probably sched_clock() alone would be enough.
> 
> I observed ~60% user time and ~40% sys time when replace local_clock()
> above by sched_clock()(two cpu hogs on the cpu in nohz_full mode). In
> addition, Luiz's testcast ./acct-bug 1 995 will show 100% idle time.
> If keep local_clock() in vtime_delta(), cpu hogs testcase will
> success. However, Luiz's testcase still show 100% idle time.

Assuming a stable TSC, there should be no difference between
local_clock() and sched_clock().


Re: [BUG nohz]: wrong user and system time accounting

2017-04-11 Thread Wanpeng Li
2017-03-30 21:38 GMT+08:00 Frederic Weisbecker :
> On Thu, Mar 30, 2017 at 02:47:11PM +0800, Wanpeng Li wrote:

[...]

>
>>
>> -->8-
>>
>> use nanosecond granularity to check deltas but only perform an actual
>> cputime update when that delta >= TICK_NSEC.
>>
>> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>> index f3778e2b..f1ee393 100644
>> --- a/kernel/sched/cputime.c
>> +++ b/kernel/sched/cputime.c
>> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct
>> task_struct *p, u64 *ut, u64 *st)
>>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>>  static u64 vtime_delta(struct task_struct *tsk)
>>  {
>> -unsigned long now = READ_ONCE(jiffies);
>> +u64 now = local_clock();
>
> I fear we need a global clock, because the reader (task_cputime()) needs
> to compute the delta and therefore use the same clock from any CPU.
>
> Or we can use the local_clock() but the reader must access the same.
>
> So there would be vtime_delta_writer() which uses local_clock and stores
> the current CPU to tsk->vtime_cpu (under the vtime_seqcount). And then
> vtime_delta_reader() which calls sched_clock_cpu(tsk->vtime_cpu) which
> is protected by vtime_seqcount as well.
>
> Although those sched_clock_cpu() things seem to only matter when the
> sched_clock() is unstable. And that stability is a condition for nohz_full
> to work anyway. So probably sched_clock() alone would be enough.

I observed ~60% user time and ~40% sys time when replace local_clock()
above by sched_clock()(two cpu hogs on the cpu in nohz_full mode). In
addition, Luiz's testcast ./acct-bug 1 995 will show 100% idle time.
If keep local_clock() in vtime_delta(), cpu hogs testcase will
success. However, Luiz's testcase still show 100% idle time.

Regards,
Wanpeng Li


Re: [BUG nohz]: wrong user and system time accounting

2017-04-05 Thread Rik van Riel
On Tue, 2017-04-04 at 13:36 -0400, Luiz Capitulino wrote:
> 
> On further debugging this, I realized that I had overlooked
> something:
> the timer interrupt in this trace is not the tick, but cyclictest's
> timer
> (remember that the test-case consists of pinning cyclictest and a
> task
> hogging the CPU to the same CPU).
> 
> I'm running cyclictest with -i 200. If I increase this to -i 1000,
> then
> I seem unable to reproduce the issue (caution: even with -i 200 it
> doesn't always happen. But it does usually happen after I restart the
> test-case a few times. However, I've never been able to reproduce
> with -i 1000).
> 
> Now, if it's really cyclictest that's causing the timer interrupts to
> get aligned, I guess this might not have a solution? (note: I haven't
> been able to reproduce this on bare-metal).

With any sample (tick) based timekeeping, it is possible
to construct workloads that avoid the sampling and result
in skewed statistics as a result.

However, given that local users can already DoS the system
in all kinds of ways, skewed statistics are probably not
that high up on the list of importance.

If there were a way to do accurate accounting (true vtime
accounting) without increasing the overhead of every
syscall and interrupt noticeably, that might be worth it,
but syscall overhead is likely to be a more important
factor than the accuracy of statistics.

I don't know if doing TSC reads and subtraction/addition
only, and delaying the conversion to cputime until a later
point would slow down system calls measurably, compared
with reading jiffies and comparing it against a cached
value of jiffies, nor do I know whether spending time
implementing and testing that would be worthwhile :)

-- 
All rights reversed


signature.asc
Description: This is a digitally signed message part


Re: [BUG nohz]: wrong user and system time accounting

2017-04-04 Thread Luiz Capitulino
On Mon, 3 Apr 2017 15:06:13 -0400
Luiz Capitulino  wrote:

> On Mon, 3 Apr 2017 17:23:17 +0200
> Frederic Weisbecker  wrote:
> 
> > Do you observe aligned ticks with trace events (hrtimer_expire_entry)?
> > 
> > You might want to enforce the global clock to trace that:
> > 
> > echo "global" > /sys/kernel/debug/tracing/trace_clock  
> 
> I've used the same trace points & debugging code I've been using to debug
> this issue, and this what I'm seeing:
> 
> stress-25757 [002]  2742.717507: function: 
> enter_from_user_mode <-- apic_timer_interrupt
> stress-25757 [002]  2742.717508: function: 
> __context_tracking_exit <-- enter_from_user_mode
> stress-25757 [002]  2742.717508: bprint:   vtime_delta: 
> diff=0 (now=4297409970 vtime_snap=4297409970)
> stress-25757 [002]  2742.717509: function: 
> smp_apic_timer_interrupt <-- apic_timer_interrupt
> stress-25757 [002]  2742.717509: function: irq_enter <-- 
> smp_apic_timer_interrupt
> stress-25757 [002]  2742.717510: hrtimer_expire_entry: 
> hrtimer=0xc900039fbe58 function=hrtimer_wakeup now=2742674000776
> stress-25757 [002]  2742.717514: function: irq_exit <-- 
> smp_apic_timer_interrupt
> cyclictest-25760 [002]  2742.717518: function: 
> vtime_account_system <-- vtime_common_task_switch
> cyclictest-25760 [002]  2742.717518: bprint:   vtime_delta: 
> diff=100 (now=4297409971 vtime_snap=4297409970)
> cyclictest-25760 [002]  2742.717519: function: 
> __vtime_account_system <-- vtime_account_system
> cyclictest-25760 [002]  2742.717519: bprint:   get_vtime_delta: 
> vtime_snap=4297409970 now=4297409971
> cyclictest-25760 [002]  2742.717520: function: 
> account_system_time <-- __vtime_account_system
> cyclictest-25760 [002]  2742.717520: bprint:   
> account_system_time: cputime=961981
> cyclictest-25760 [002]  2742.717521: function: 
> __context_tracking_enter <-- do_syscall_64
> cyclictest-25760 [002]  2742.717522: function: vtime_user_enter 
> <-- __context_tracking_enter
> cyclictest-25760 [002]  2742.717522: bprint:   vtime_delta: 
> diff=0 (now=4297409971 vtime_snap=4297409971)
> 
> CPU2 shows 98% system time while the other CPUs (from CPU3 to CPU7)
> show 98% user time (they're all running the same workload).

On further debugging this, I realized that I had overlooked something:
the timer interrupt in this trace is not the tick, but cyclictest's timer
(remember that the test-case consists of pinning cyclictest and a task
hogging the CPU to the same CPU).

I'm running cyclictest with -i 200. If I increase this to -i 1000, then
I seem unable to reproduce the issue (caution: even with -i 200 it
doesn't always happen. But it does usually happen after I restart the
test-case a few times. However, I've never been able to reproduce
with -i 1000).

Now, if it's really cyclictest that's causing the timer interrupts to
get aligned, I guess this might not have a solution? (note: I haven't
been able to reproduce this on bare-metal).


Re: [BUG nohz]: wrong user and system time accounting

2017-04-04 Thread Mike Galbraith
On Mon, 2017-04-03 at 16:40 +0200, Frederic Weisbecker wrote:
> On Thu, Mar 30, 2017 at 03:35:22PM +0200, Mike Galbraith wrote:

> Nohz_full is already bad for powersavings anyway. CPU 0 always ticks :-)

OTOH, if a nohz_full set is doing what it was born to do, CPU0 tick
spikes won't be noticeable on your (pegged/glowing) watt meter :)


Re: [BUG nohz]: wrong user and system time accounting

2017-04-03 Thread Luiz Capitulino
On Mon, 3 Apr 2017 17:23:17 +0200
Frederic Weisbecker  wrote:

> Do you observe aligned ticks with trace events (hrtimer_expire_entry)?
> 
> You might want to enforce the global clock to trace that:
> 
> echo "global" > /sys/kernel/debug/tracing/trace_clock

I've used the same trace points & debugging code I've been using to debug
this issue, and this what I'm seeing:

stress-25757 [002]  2742.717507: function: enter_from_user_mode 
<-- apic_timer_interrupt
stress-25757 [002]  2742.717508: function: 
__context_tracking_exit <-- enter_from_user_mode
stress-25757 [002]  2742.717508: bprint:   vtime_delta: diff=0 
(now=4297409970 vtime_snap=4297409970)
stress-25757 [002]  2742.717509: function: 
smp_apic_timer_interrupt <-- apic_timer_interrupt
stress-25757 [002]  2742.717509: function: irq_enter <-- 
smp_apic_timer_interrupt
stress-25757 [002]  2742.717510: hrtimer_expire_entry: 
hrtimer=0xc900039fbe58 function=hrtimer_wakeup now=2742674000776
stress-25757 [002]  2742.717514: function: irq_exit <-- 
smp_apic_timer_interrupt
cyclictest-25760 [002]  2742.717518: function: vtime_account_system 
<-- vtime_common_task_switch
cyclictest-25760 [002]  2742.717518: bprint:   vtime_delta: 
diff=100 (now=4297409971 vtime_snap=4297409970)
cyclictest-25760 [002]  2742.717519: function: 
__vtime_account_system <-- vtime_account_system
cyclictest-25760 [002]  2742.717519: bprint:   get_vtime_delta: 
vtime_snap=4297409970 now=4297409971
cyclictest-25760 [002]  2742.717520: function: account_system_time 
<-- __vtime_account_system
cyclictest-25760 [002]  2742.717520: bprint:   account_system_time: 
cputime=961981
cyclictest-25760 [002]  2742.717521: function: 
__context_tracking_enter <-- do_syscall_64
cyclictest-25760 [002]  2742.717522: function: vtime_user_enter <-- 
__context_tracking_enter
cyclictest-25760 [002]  2742.717522: bprint:   vtime_delta: diff=0 
(now=4297409971 vtime_snap=4297409971)

CPU2 shows 98% system time while the other CPUs (from CPU3 to CPU7)
show 98% user time (they're all running the same workload).

What's happening here is:

1. Timer interrupt
2. Transition from user-space to kernel-space, vtimer_delta()
   returns zero
3. Context switch from hog application to cyclictest
4. This time vtime_delta() returns != zero, which implies
   jiffies was updated between steps 2 and 3

This seems to be the pattern that accounts incorrectly,
and seem to suggest that the ticks are aligned because
this repeats over and over.

Please, let me know if you want me to run a different
trace-cmd command-line.


Re: [BUG nohz]: wrong user and system time accounting

2017-04-03 Thread Frederic Weisbecker
On Fri, Mar 31, 2017 at 11:11:19PM -0400, Luiz Capitulino wrote:
> On Sat, 1 Apr 2017 01:24:54 +0200
> Frederic Weisbecker  wrote:
> 
> > On Fri, Mar 31, 2017 at 04:09:10PM -0400, Luiz Capitulino wrote:
> > > On Thu, 30 Mar 2017 17:25:46 -0400
> > > Luiz Capitulino  wrote:
> > >   
> > > > On Thu, 30 Mar 2017 16:18:17 +0200
> > > > Frederic Weisbecker  wrote:
> > > >   
> > > > > On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote:
> > > > > > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker 
> > > > > > :  
> > > > > > > If it works, we may want to take that solution, likely less 
> > > > > > > performance sensitive
> > > > > > > than using sched_clock(). In fact sched_clock() is fast, 
> > > > > > > especially as we require it to
> > > > > > > be stable for nohz_full, but using it involves costly conversion 
> > > > > > > back and forth to jiffies.  
> > > > > > 
> > > > > > So both Rik and you agree with the skew tick solution, I will try it
> > > > > > tomorrow. Btw, if we should just add random offset to the cpu in the
> > > > > > nohz_full mode or add random offset to all cpus like the codes 
> > > > > > above?  
> > > > > 
> > > > > Lets just keep it to all CPUs for simplicty.
> > > > > Also please add a comment that explains why we need that skew_tick on 
> > > > > nohz_full.
> > > > 
> > > > I've tried all the test-cases we discussed in this thread with 
> > > > skew_tick=1
> > > > and it worked as expected in bare-metal and KVM guests.
> > > > 
> > > > However, I found a test-case that works in bare-metal but show problems
> > > > in KVM guests. It could something that's KVM specific, or it could be
> > > > something that's harder to reproduce in bare-metal.  
> > > 
> > > After discussing some findings on this issue with Rik, I realized that
> > > we don't add the skew when restarting the tick in tick_nohz_restart().
> > > Adding the offset there seems to solve this problem.  
> > 
> > Are you sure? tick_nohz_restart() doesn't seem to override the initial 
> > skew. It
> > always forwards the expiration time on top of the last tick.
> 
> OK, I'll double check. Without my change the bug triggers almost
> instantly with the described reproducer. With my change it didn't
> trig for several minutes (but it does look wrong looking at it now).

Do you observe aligned ticks with trace events (hrtimer_expire_entry)?

You might want to enforce the global clock to trace that:

echo "global" > /sys/kernel/debug/tracing/trace_clock


Re: [BUG nohz]: wrong user and system time accounting

2017-04-03 Thread Frederic Weisbecker
On Thu, Mar 30, 2017 at 03:35:22PM +0200, Mike Galbraith wrote:
> On Thu, 2017-03-30 at 09:02 -0400, Rik van Riel wrote:
> > On Thu, 2017-03-30 at 14:51 +0200, Frederic Weisbecker wrote:
> 
> > > Also, why does it raise power consumption issues?
> > 
> > On a system without either nohz_full or nohz idle
> > mode, skewed ticks result in CPU cores waking up
> > at different times, and keeping an idle system
> > consuming power for more time than it would if all
> > the ticks happened simultaneously.
> 
> And if your server farm is mostly idle, that power savings may delay
> your bankruptcy proceedings by a whole microsecond ;-)
> 
> Or more seriously, what skew does do on boxen of size X today is
> something for perf to say.  At the time, removal was very bad for my 8
> socket box, and allegedly caused huge SGI beasts in horrific pain.

I see.
Nohz_full is already bad for powersavings anyway. CPU 0 always ticks :-)


Re: [BUG nohz]: wrong user and system time accounting

2017-03-31 Thread Luiz Capitulino
On Sat, 1 Apr 2017 01:24:54 +0200
Frederic Weisbecker  wrote:

> On Fri, Mar 31, 2017 at 04:09:10PM -0400, Luiz Capitulino wrote:
> > On Thu, 30 Mar 2017 17:25:46 -0400
> > Luiz Capitulino  wrote:
> >   
> > > On Thu, 30 Mar 2017 16:18:17 +0200
> > > Frederic Weisbecker  wrote:
> > >   
> > > > On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote:
> > > > > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker :  
> > > > > 
> > > > > > If it works, we may want to take that solution, likely less 
> > > > > > performance sensitive
> > > > > > than using sched_clock(). In fact sched_clock() is fast, especially 
> > > > > > as we require it to
> > > > > > be stable for nohz_full, but using it involves costly conversion 
> > > > > > back and forth to jiffies.  
> > > > > 
> > > > > So both Rik and you agree with the skew tick solution, I will try it
> > > > > tomorrow. Btw, if we should just add random offset to the cpu in the
> > > > > nohz_full mode or add random offset to all cpus like the codes above? 
> > > > >  
> > > > 
> > > > Lets just keep it to all CPUs for simplicty.
> > > > Also please add a comment that explains why we need that skew_tick on 
> > > > nohz_full.
> > > 
> > > I've tried all the test-cases we discussed in this thread with skew_tick=1
> > > and it worked as expected in bare-metal and KVM guests.
> > > 
> > > However, I found a test-case that works in bare-metal but show problems
> > > in KVM guests. It could something that's KVM specific, or it could be
> > > something that's harder to reproduce in bare-metal.  
> > 
> > After discussing some findings on this issue with Rik, I realized that
> > we don't add the skew when restarting the tick in tick_nohz_restart().
> > Adding the offset there seems to solve this problem.  
> 
> Are you sure? tick_nohz_restart() doesn't seem to override the initial skew. 
> It
> always forwards the expiration time on top of the last tick.

OK, I'll double check. Without my change the bug triggers almost
instantly with the described reproducer. With my change it didn't
trig for several minutes (but it does look wrong looking at it now).


Re: [BUG nohz]: wrong user and system time accounting

2017-03-31 Thread Frederic Weisbecker
On Fri, Mar 31, 2017 at 04:09:10PM -0400, Luiz Capitulino wrote:
> On Thu, 30 Mar 2017 17:25:46 -0400
> Luiz Capitulino  wrote:
> 
> > On Thu, 30 Mar 2017 16:18:17 +0200
> > Frederic Weisbecker  wrote:
> > 
> > > On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote:  
> > > > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker :
> > > > > If it works, we may want to take that solution, likely less 
> > > > > performance sensitive
> > > > > than using sched_clock(). In fact sched_clock() is fast, especially 
> > > > > as we require it to
> > > > > be stable for nohz_full, but using it involves costly conversion back 
> > > > > and forth to jiffies.
> > > > 
> > > > So both Rik and you agree with the skew tick solution, I will try it
> > > > tomorrow. Btw, if we should just add random offset to the cpu in the
> > > > nohz_full mode or add random offset to all cpus like the codes above?   
> > > >  
> > > 
> > > Lets just keep it to all CPUs for simplicty.
> > > Also please add a comment that explains why we need that skew_tick on 
> > > nohz_full.  
> > 
> > I've tried all the test-cases we discussed in this thread with skew_tick=1
> > and it worked as expected in bare-metal and KVM guests.
> > 
> > However, I found a test-case that works in bare-metal but show problems
> > in KVM guests. It could something that's KVM specific, or it could be
> > something that's harder to reproduce in bare-metal.
> 
> After discussing some findings on this issue with Rik, I realized that
> we don't add the skew when restarting the tick in tick_nohz_restart().
> Adding the offset there seems to solve this problem.

Are you sure? tick_nohz_restart() doesn't seem to override the initial skew. It
always forwards the expiration time on top of the last tick.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-31 Thread Luiz Capitulino
On Thu, 30 Mar 2017 17:25:46 -0400
Luiz Capitulino  wrote:

> On Thu, 30 Mar 2017 16:18:17 +0200
> Frederic Weisbecker  wrote:
> 
> > On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote:  
> > > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker :
> > > > If it works, we may want to take that solution, likely less performance 
> > > > sensitive
> > > > than using sched_clock(). In fact sched_clock() is fast, especially as 
> > > > we require it to
> > > > be stable for nohz_full, but using it involves costly conversion back 
> > > > and forth to jiffies.
> > > 
> > > So both Rik and you agree with the skew tick solution, I will try it
> > > tomorrow. Btw, if we should just add random offset to the cpu in the
> > > nohz_full mode or add random offset to all cpus like the codes above?
> > 
> > Lets just keep it to all CPUs for simplicty.
> > Also please add a comment that explains why we need that skew_tick on 
> > nohz_full.  
> 
> I've tried all the test-cases we discussed in this thread with skew_tick=1
> and it worked as expected in bare-metal and KVM guests.
> 
> However, I found a test-case that works in bare-metal but show problems
> in KVM guests. It could something that's KVM specific, or it could be
> something that's harder to reproduce in bare-metal.

After discussing some findings on this issue with Rik, I realized that
we don't add the skew when restarting the tick in tick_nohz_restart().
Adding the offset there seems to solve this problem.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Luiz Capitulino
On Thu, 30 Mar 2017 16:18:17 +0200
Frederic Weisbecker  wrote:

> On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote:
> > 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker :  
> > > If it works, we may want to take that solution, likely less performance 
> > > sensitive
> > > than using sched_clock(). In fact sched_clock() is fast, especially as we 
> > > require it to
> > > be stable for nohz_full, but using it involves costly conversion back and 
> > > forth to jiffies.  
> > 
> > So both Rik and you agree with the skew tick solution, I will try it
> > tomorrow. Btw, if we should just add random offset to the cpu in the
> > nohz_full mode or add random offset to all cpus like the codes above?  
> 
> Lets just keep it to all CPUs for simplicty.
> Also please add a comment that explains why we need that skew_tick on 
> nohz_full.

I've tried all the test-cases we discussed in this thread with skew_tick=1
and it worked as expected in bare-metal and KVM guests.

However, I found a test-case that works in bare-metal but show problems
in KVM guests. It could something that's KVM specific, or it could be
something that's harder to reproduce in bare-metal.

The reproducer is (not sure all the steps are necessary):

1. Isolate 8 cores in the host with isolcpus= and nohz_full= (and skew_tick=1)

2. Create a KVM guest with 8 vCPUs and pin each vCPU to an isolated
   host core

3. Boot the guest with isolcpus=2,3,4,5,6,7 nohz_full=2,3,4,5,6,7 skew_tick=1

4. Once the guest is booted, run:

# for i in $(seq 2 7); do taskset -c $i hog& ;done
# taskset -c 2,3,4,5,6,7 \
  cyclictest -m -n -q -p95 -D 1m -h60 -i 200 -t 6 -a 2,3,4,5,6,7

  (where hog is a program taking 100% of the CPU, and cyclictest
   is RT's cyclictest)

5. Run top -d1

In a few minutes into this test-case, I see one isolated CPU in the
guest reporting around 95% system time (where the expected is close
to 100% user time, which the others isolated CPUs correctly report).


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Frederic Weisbecker
On Thu, Mar 30, 2017 at 09:59:54PM +0800, Wanpeng Li wrote:
> 2017-03-30 21:38 GMT+08:00 Frederic Weisbecker :
> > If it works, we may want to take that solution, likely less performance 
> > sensitive
> > than using sched_clock(). In fact sched_clock() is fast, especially as we 
> > require it to
> > be stable for nohz_full, but using it involves costly conversion back and 
> > forth to jiffies.
> 
> So both Rik and you agree with the skew tick solution, I will try it
> tomorrow. Btw, if we should just add random offset to the cpu in the
> nohz_full mode or add random offset to all cpus like the codes above?

Lets just keep it to all CPUs for simplicty.
Also please add a comment that explains why we need that skew_tick on nohz_full.

Thanks!


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Wanpeng Li
2017-03-30 21:38 GMT+08:00 Frederic Weisbecker :
> On Thu, Mar 30, 2017 at 02:47:11PM +0800, Wanpeng Li wrote:
>> Cc Peterz, Thomas,
>> 2017-03-30 12:27 GMT+08:00 Mike Galbraith :
>> > On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote:
>> >
>> >> In other words, the tick on cpu0 is aligned
>> >> with the tick on the nohz_full cpus, and
>> >> jiffies is advanced while the nohz_full cpus
>> >> with an active tick happen to be in kernel
>> >> mode?
>> >
>> > You really want skew_tick=1, especially on big boxen.
>> >
>> >> Frederic, can you think of any reason why
>> >> the tick on nohz_full CPUs would end up aligned
>> >> with the tick on cpu0, instead of running at some
>> >> random offset?
>> >
>> > (I or low rq->clock bits as crude NOHZ collision avoidance)
>> >
>> >> A random offset, or better yet a somewhat randomized
>> >> tick length to make sure that simultaneous ticks are
>> >> fairly rare and the vtime sampling does not end up
>> >> "in phase" with the jiffies incrementing, could make
>> >> the accounting work right again.
>> >
>> > That improves jitter, especially on big boxen.  I have an 8 socket box
>> > that thinks it's an extra large PC, there, collision avoidance matters
>> > hugely.  I couldn't reproduce bean counting woes, no idea if collision
>> > avoidance will help that.
>>
>> So I implement two methods, one is from Rik's random offset proposal
>> through skew tick, the other one is from Frederic's proposal and it is
>> the same as my original idea through use nanosecond granularity to
>> check deltas but only perform an actual cputime update when that delta
>> >= TICK_NSEC. Both methods can solve the bug which Luiz reported.
>> Peterz, Thomas, any ideas?
>>
>> --->8-
>>
>> skew tick:
>>
>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> index 7fe53be..9981437 100644
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -1198,7 +1198,11 @@ void tick_setup_sched_timer(void)
>>  hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
>>
>>  /* Offset the tick to avert jiffies_lock contention. */
>> +#ifdef CONFIG_NO_HZ_FULL
>> +if (sched_skew_tick || tick_nohz_full_running) {
>> +#else
>>  if (sched_skew_tick) {
>> +#endif
>
> Please rather use tick_nohz_full_enabled() to avoid ifdeffery.
>
>>  u64 offset = ktime_to_ns(tick_period) >> 1;
>>  do_div(offset, num_possible_cpus());
>>  offset *= smp_processor_id();
>
> If it works, we may want to take that solution, likely less performance 
> sensitive
> than using sched_clock(). In fact sched_clock() is fast, especially as we 
> require it to
> be stable for nohz_full, but using it involves costly conversion back and 
> forth to jiffies.

So both Rik and you agree with the skew tick solution, I will try it
tomorrow. Btw, if we should just add random offset to the cpu in the
nohz_full mode or add random offset to all cpus like the codes above?

Regards,
Wanpeng Li

>
>>
>> -->8-
>>
>> use nanosecond granularity to check deltas but only perform an actual
>> cputime update when that delta >= TICK_NSEC.
>>
>> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>> index f3778e2b..f1ee393 100644
>> --- a/kernel/sched/cputime.c
>> +++ b/kernel/sched/cputime.c
>> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct
>> task_struct *p, u64 *ut, u64 *st)
>>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>>  static u64 vtime_delta(struct task_struct *tsk)
>>  {
>> -unsigned long now = READ_ONCE(jiffies);
>> +u64 now = local_clock();
>
> I fear we need a global clock, because the reader (task_cputime()) needs
> to compute the delta and therefore use the same clock from any CPU.
>
> Or we can use the local_clock() but the reader must access the same.
>
> So there would be vtime_delta_writer() which uses local_clock and stores
> the current CPU to tsk->vtime_cpu (under the vtime_seqcount). And then
> vtime_delta_reader() which calls sched_clock_cpu(tsk->vtime_cpu) which
> is protected by vtime_seqcount as well.
>
> Although those sched_clock_cpu() things seem to only matter when the
> sched_clock() is unstable. And that stability is a condition for nohz_full
> to work anyway. So probably sched_clock() alone would be enough.
>
> Thanks.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Frederic Weisbecker
On Thu, Mar 30, 2017 at 09:02:31AM -0400, Rik van Riel wrote:
> On Thu, 2017-03-30 at 14:51 +0200, Frederic Weisbecker wrote:
> > On Thu, Mar 30, 2017 at 06:27:31AM +0200, Mike Galbraith wrote:
> > > On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote:
> > > 
> > > > A random offset, or better yet a somewhat randomized
> > > > tick length to make sure that simultaneous ticks are
> > > > fairly rare and the vtime sampling does not end up
> > > > "in phase" with the jiffies incrementing, could make
> > > > the accounting work right again.
> > > 
> > > That improves jitter, especially on big boxen.  I have an 8 socket
> > > box
> > > that thinks it's an extra large PC, there, collision avoidance
> > > matters
> > > hugely.  I couldn't reproduce bean counting woes, no idea if
> > > collision
> > > avoidance will help that.
> > 
> > Out of curiosity, where is the main contention between ticks? I
> > indeed
> > know some locks that can be taken on special cases, such as posix cpu
> > timers.
> > 
> > Also, why does it raise power consumption issues?
> 
> On a system without either nohz_full or nohz idle
> mode, skewed ticks result in CPU cores waking up
> at different times, and keeping an idle system
> consuming power for more time than it would if all
> the ticks happened simultaneously.

Ah fair point!

> 
> This is not a factor at all on systems that switch
> off the tick while idle, since the CPU will be busy
> anyway while the tick is enabled.

I see. Thanks!


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Frederic Weisbecker
On Thu, Mar 30, 2017 at 02:47:11PM +0800, Wanpeng Li wrote:
> Cc Peterz, Thomas,
> 2017-03-30 12:27 GMT+08:00 Mike Galbraith :
> > On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote:
> >
> >> In other words, the tick on cpu0 is aligned
> >> with the tick on the nohz_full cpus, and
> >> jiffies is advanced while the nohz_full cpus
> >> with an active tick happen to be in kernel
> >> mode?
> >
> > You really want skew_tick=1, especially on big boxen.
> >
> >> Frederic, can you think of any reason why
> >> the tick on nohz_full CPUs would end up aligned
> >> with the tick on cpu0, instead of running at some
> >> random offset?
> >
> > (I or low rq->clock bits as crude NOHZ collision avoidance)
> >
> >> A random offset, or better yet a somewhat randomized
> >> tick length to make sure that simultaneous ticks are
> >> fairly rare and the vtime sampling does not end up
> >> "in phase" with the jiffies incrementing, could make
> >> the accounting work right again.
> >
> > That improves jitter, especially on big boxen.  I have an 8 socket box
> > that thinks it's an extra large PC, there, collision avoidance matters
> > hugely.  I couldn't reproduce bean counting woes, no idea if collision
> > avoidance will help that.
> 
> So I implement two methods, one is from Rik's random offset proposal
> through skew tick, the other one is from Frederic's proposal and it is
> the same as my original idea through use nanosecond granularity to
> check deltas but only perform an actual cputime update when that delta
> >= TICK_NSEC. Both methods can solve the bug which Luiz reported.
> Peterz, Thomas, any ideas?
> 
> --->8-
> 
> skew tick:
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 7fe53be..9981437 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1198,7 +1198,11 @@ void tick_setup_sched_timer(void)
>  hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
> 
>  /* Offset the tick to avert jiffies_lock contention. */
> +#ifdef CONFIG_NO_HZ_FULL
> +if (sched_skew_tick || tick_nohz_full_running) {
> +#else
>  if (sched_skew_tick) {
> +#endif

Please rather use tick_nohz_full_enabled() to avoid ifdeffery.

>  u64 offset = ktime_to_ns(tick_period) >> 1;
>  do_div(offset, num_possible_cpus());
>  offset *= smp_processor_id();

If it works, we may want to take that solution, likely less performance 
sensitive
than using sched_clock(). In fact sched_clock() is fast, especially as we 
require it to
be stable for nohz_full, but using it involves costly conversion back and forth 
to jiffies.

> 
> -->8-
> 
> use nanosecond granularity to check deltas but only perform an actual
> cputime update when that delta >= TICK_NSEC.
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..f1ee393 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct
> task_struct *p, u64 *ut, u64 *st)
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>  static u64 vtime_delta(struct task_struct *tsk)
>  {
> -unsigned long now = READ_ONCE(jiffies);
> +u64 now = local_clock();

I fear we need a global clock, because the reader (task_cputime()) needs
to compute the delta and therefore use the same clock from any CPU.

Or we can use the local_clock() but the reader must access the same.

So there would be vtime_delta_writer() which uses local_clock and stores
the current CPU to tsk->vtime_cpu (under the vtime_seqcount). And then
vtime_delta_reader() which calls sched_clock_cpu(tsk->vtime_cpu) which
is protected by vtime_seqcount as well.

Although those sched_clock_cpu() things seem to only matter when the
sched_clock() is unstable. And that stability is a condition for nohz_full
to work anyway. So probably sched_clock() alone would be enough.

Thanks.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Mike Galbraith
On Thu, 2017-03-30 at 09:02 -0400, Rik van Riel wrote:
> On Thu, 2017-03-30 at 14:51 +0200, Frederic Weisbecker wrote:

> > Also, why does it raise power consumption issues?
> 
> On a system without either nohz_full or nohz idle
> mode, skewed ticks result in CPU cores waking up
> at different times, and keeping an idle system
> consuming power for more time than it would if all
> the ticks happened simultaneously.

And if your server farm is mostly idle, that power savings may delay
your bankruptcy proceedings by a whole microsecond ;-)

Or more seriously, what skew does do on boxen of size X today is
something for perf to say.  At the time, removal was very bad for my 8
socket box, and allegedly caused huge SGI beasts in horrific pain.

-Mike


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Mike Galbraith
On Thu, 2017-03-30 at 14:40 +0200, Frederic Weisbecker wrote:
> On Thu, Mar 30, 2017 at 09:58:44AM +0800, Wanpeng Li wrote:

> > There is such a feature skew_tick currently, refer to commit
> > 5307c9556bc (tick: add tick skew boot option), w/ skew_tick=1 boot
> > parameter, the bug disappear, however, the commit also mentioned that
> > it will hurt power consumption.
> 
> Oh, I completely missed that!

It suggests it'll harm power consumption because skew removal allegedly
saved power.  Recalling what removal did to my 8 socket box, I doubt
adding it back costs large boxen anything at all, rather the opposite.

-Mike


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Rik van Riel
On Thu, 2017-03-30 at 14:51 +0200, Frederic Weisbecker wrote:
> On Thu, Mar 30, 2017 at 06:27:31AM +0200, Mike Galbraith wrote:
> > On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote:
> > 
> > > A random offset, or better yet a somewhat randomized
> > > tick length to make sure that simultaneous ticks are
> > > fairly rare and the vtime sampling does not end up
> > > "in phase" with the jiffies incrementing, could make
> > > the accounting work right again.
> > 
> > That improves jitter, especially on big boxen.  I have an 8 socket
> > box
> > that thinks it's an extra large PC, there, collision avoidance
> > matters
> > hugely.  I couldn't reproduce bean counting woes, no idea if
> > collision
> > avoidance will help that.
> 
> Out of curiosity, where is the main contention between ticks? I
> indeed
> know some locks that can be taken on special cases, such as posix cpu
> timers.
> 
> Also, why does it raise power consumption issues?

On a system without either nohz_full or nohz idle
mode, skewed ticks result in CPU cores waking up
at different times, and keeping an idle system
consuming power for more time than it would if all
the ticks happened simultaneously.

This is not a factor at all on systems that switch
off the tick while idle, since the CPU will be busy
anyway while the tick is enabled.



Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Rik van Riel
On Thu, 2017-03-30 at 00:54 +0200, Frederic Weisbecker wrote:
> (Adding Thomas in Cc)
> 
> On Wed, Mar 29, 2017 at 04:08:45PM -0400, Rik van Riel wrote:
> > 
> > Frederic, can you think of any reason why
> > the tick on nohz_full CPUs would end up aligned
> > with the tick on cpu0, instead of running at some
> > random offset?
> 
> tick_init_jiffy_update() takes that decision to align all ticks.
> 
> I'm not sure why. 

I don't see why that would matter, either.

> I'm not sure that randomizing the tick start per CPU would be a
> right solution. Somewhere in the world you can be sure the tick
> randomization of some nohz_full CPU will coincide with the tick
> of CPU 0 :o)
> 
> Or we could force that tick on nohz_full CPUs to be far from
> CPU 0's tick... I'm not sure such a solution would be accepted
> though.

I am not sure we would have to force things.

Simply getting rid of tick_init_jiffy_update
and scheduling the next tick for "now + tick
period" might have the same effect, when the
tick gets stopped and restarted on nohz_full
CPUs.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Frederic Weisbecker
On Thu, Mar 30, 2017 at 06:27:31AM +0200, Mike Galbraith wrote:
> On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote:
> 
> > A random offset, or better yet a somewhat randomized
> > tick length to make sure that simultaneous ticks are
> > fairly rare and the vtime sampling does not end up
> > "in phase" with the jiffies incrementing, could make
> > the accounting work right again.
> 
> That improves jitter, especially on big boxen.  I have an 8 socket box
> that thinks it's an extra large PC, there, collision avoidance matters
> hugely.  I couldn't reproduce bean counting woes, no idea if collision
> avoidance will help that.

Out of curiosity, where is the main contention between ticks? I indeed
know some locks that can be taken on special cases, such as posix cpu timers.

Also, why does it raise power consumption issues?

Thanks.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Frederic Weisbecker
On Thu, Mar 30, 2017 at 09:58:44AM +0800, Wanpeng Li wrote:
> 2017-03-30 4:08 GMT+08:00 Rik van Riel :
> >
> > In other words, the tick on cpu0 is aligned
> > with the tick on the nohz_full cpus, and
> > jiffies is advanced while the nohz_full cpus
> > with an active tick happen to be in kernel
> > mode?
> >
> > Frederic, can you think of any reason why
> > the tick on nohz_full CPUs would end up aligned
> > with the tick on cpu0, instead of running at some
> > random offset?
> >
> > A random offset, or better yet a somewhat randomized
> > tick length to make sure that simultaneous ticks are
> > fairly rare and the vtime sampling does not end up
> > "in phase" with the jiffies incrementing, could make
> > the accounting work right again.
> >
> > Of course, that assumes the above hypothesis is correct :)
> 
> There is such a feature skew_tick currently, refer to commit
> 5307c9556bc (tick: add tick skew boot option), w/ skew_tick=1 boot
> parameter, the bug disappear, however, the commit also mentioned that
> it will hurt power consumption.

Oh, I completely missed that!

> I will try Frederic's proposal which
> is similar to my original idea "how bad would it be to revert to
> sched_clock() instead of jiffies in vtime_delta()? We could use
> nanosecond granularity to check deltas but only perform an actual
> cputime update when that delta >= TICK_NSEC."

Thanks! I hope sched_clock() won't introduce too much overhead.
Otherwise we may want to pick up the skew_tick solution.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Mike Galbraith
On Thu, 2017-03-30 at 19:52 +0800, Wanpeng Li wrote:

> If we should just add random offset to the cpu in the nohz_full mode?

Up to you, whatever works best.  I left the regular skew alone, just
added some noise to scheduler_tick_max_deferment().

-Mike


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Wanpeng Li
2017-03-30 10:14 GMT+08:00 Luiz Capitulino :
> On Thu, 30 Mar 2017 06:46:30 +0800
> Wanpeng Li  wrote:
>
>> > So! Now we need to find a proper fix :o)
>> >
>> > Hmm, how bad would it be to revert to sched_clock() instead of jiffies in 
>> > vtime_delta()?
>> > We could use nanosecond granularity to check deltas but only perform an 
>> > actual cputime update
>> > when that delta >= TICK_NSEC. That should keep the load ok.
>>
>> Yeah, I mentioned something similar before.
>> https://lkml.org/lkml/2017/3/26/138 However, Rik's commit optimized
>> syscalls by not utilize sched_clock(), so if we should distinguish
>> between syscalls/exceptions and irqs?
>
> Why not use ktime_get()?

I believe ktime_get() is more heavy than local_clock() when sched
clock is stable. So we can cooperate to improve
https://lkml.org/lkml/2017/3/30/456.

Regards,
Wanpeng Li

>
> Here's the solution I was thinking about, it's mostly untested. I'm
> rate limiting below TICK_NSEC because I want to avoid syncing with
> the tick.
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..a8b1e85 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -676,18 +676,20 @@ void thread_group_cputime_adjusted(struct task_struct 
> *p, u64 *ut, u64 *st)
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>  static u64 vtime_delta(struct task_struct *tsk)
>  {
> -   unsigned long now = READ_ONCE(jiffies);
> +   return ktime_sub(ktime_get(), tsk->vtime_snap);
> +}
>
> -   if (time_before(now, (unsigned long)tsk->vtime_snap))
> -   return 0;
> +/* A little bit less than the tick period */
> +#define VTIME_RATE_LIMIT (TICK_NSEC - 20)
>
> -   return jiffies_to_nsecs(now - tsk->vtime_snap);
> +static bool vtime_should_account(struct task_struct *tsk)
> +{
> +   return vtime_delta(tsk) > VTIME_RATE_LIMIT;
>  }
>
>  static u64 get_vtime_delta(struct task_struct *tsk)
>  {
> -   unsigned long now = READ_ONCE(jiffies);
> -   u64 delta, other;
> +   u64 delta, other, now = ktime_get();
>
> /*
>  * Unlike tick based timing, vtime based timing never has lost
> @@ -696,7 +698,7 @@ static u64 get_vtime_delta(struct task_struct *tsk)
>  * elapsed time. Limit account_other_time to prevent rounding
>  * errors from causing elapsed vtime to go negative.
>  */
> -   delta = jiffies_to_nsecs(now - tsk->vtime_snap);
> +   delta = ktime_sub(now, tsk->vtime_snap);
> other = account_other_time(delta);
> WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
> tsk->vtime_snap = now;
> @@ -711,7 +713,7 @@ static void __vtime_account_system(struct task_struct 
> *tsk)
>
>  void vtime_account_system(struct task_struct *tsk)
>  {
> -   if (!vtime_delta(tsk))
> +   if (!vtime_should_account(tsk))
> return;
>
> write_seqcount_begin(&tsk->vtime_seqcount);
> @@ -723,7 +725,7 @@ void vtime_account_user(struct task_struct *tsk)
>  {
> write_seqcount_begin(&tsk->vtime_seqcount);
> tsk->vtime_snap_whence = VTIME_SYS;
> -   if (vtime_delta(tsk))
> +   if (vtime_should_account(tsk))
> account_user_time(tsk, get_vtime_delta(tsk));
> write_seqcount_end(&tsk->vtime_seqcount);
>  }
> @@ -731,7 +733,7 @@ void vtime_account_user(struct task_struct *tsk)
>  void vtime_user_enter(struct task_struct *tsk)
>  {
> write_seqcount_begin(&tsk->vtime_seqcount);
> -   if (vtime_delta(tsk))
> +   if (vtime_should_account(tsk))
> __vtime_account_system(tsk);
> tsk->vtime_snap_whence = VTIME_USER;
> write_seqcount_end(&tsk->vtime_seqcount);
> @@ -747,7 +749,7 @@ void vtime_guest_enter(struct task_struct *tsk)
>  * that can thus safely catch up with a tickless delta.
>  */
> write_seqcount_begin(&tsk->vtime_seqcount);
> -   if (vtime_delta(tsk))
> +   if (vtime_should_account(tsk))
> __vtime_account_system(tsk);
> current->flags |= PF_VCPU;
> write_seqcount_end(&tsk->vtime_seqcount);
> @@ -776,7 +778,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
>
> write_seqcount_begin(¤t->vtime_seqcount);
> current->vtime_snap_whence = VTIME_SYS;
> -   current->vtime_snap = jiffies;
> +   current->vtime_snap = ktime_get();
> write_seqcount_end(¤t->vtime_seqcount);
>  }
>


Re: [BUG nohz]: wrong user and system time accounting

2017-03-30 Thread Wanpeng Li
2017-03-30 14:47 GMT+08:00 Wanpeng Li :
> Cc Peterz, Thomas,
> 2017-03-30 12:27 GMT+08:00 Mike Galbraith :
>> On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote:
>>
>>> In other words, the tick on cpu0 is aligned
>>> with the tick on the nohz_full cpus, and
>>> jiffies is advanced while the nohz_full cpus
>>> with an active tick happen to be in kernel
>>> mode?
>>
>> You really want skew_tick=1, especially on big boxen.
>>
>>> Frederic, can you think of any reason why
>>> the tick on nohz_full CPUs would end up aligned
>>> with the tick on cpu0, instead of running at some
>>> random offset?
>>
>> (I or low rq->clock bits as crude NOHZ collision avoidance)
>>
>>> A random offset, or better yet a somewhat randomized
>>> tick length to make sure that simultaneous ticks are
>>> fairly rare and the vtime sampling does not end up
>>> "in phase" with the jiffies incrementing, could make
>>> the accounting work right again.
>>
>> That improves jitter, especially on big boxen.  I have an 8 socket box
>> that thinks it's an extra large PC, there, collision avoidance matters
>> hugely.  I couldn't reproduce bean counting woes, no idea if collision
>> avoidance will help that.
>
> So I implement two methods, one is from Rik's random offset proposal

If we should just add random offset to the cpu in the nohz_full mode?

> through skew tick, the other one is from Frederic's proposal and it is
> the same as my original idea through use nanosecond granularity to
> check deltas but only perform an actual cputime update when that delta
>>= TICK_NSEC. Both methods can solve the bug which Luiz reported.

This can just solves two cpu hogs running on the cpu in nohz_full
mode. However, Luiz's testcase w/ ./acct-bug 1 995 shows idle 100%.

Regards,
Wanpeng Li

> Peterz, Thomas, any ideas?
>
> --->8-
>
> skew tick:
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 7fe53be..9981437 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1198,7 +1198,11 @@ void tick_setup_sched_timer(void)
>  hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
>
>  /* Offset the tick to avert jiffies_lock contention. */
> +#ifdef CONFIG_NO_HZ_FULL
> +if (sched_skew_tick || tick_nohz_full_running) {
> +#else
>  if (sched_skew_tick) {
> +#endif
>  u64 offset = ktime_to_ns(tick_period) >> 1;
>  do_div(offset, num_possible_cpus());
>  offset *= smp_processor_id();
>
> -->8-
>
> use nanosecond granularity to check deltas but only perform an actual
> cputime update when that delta >= TICK_NSEC.
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..f1ee393 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct
> task_struct *p, u64 *ut, u64 *st)
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>  static u64 vtime_delta(struct task_struct *tsk)
>  {
> -unsigned long now = READ_ONCE(jiffies);
> +u64 now = local_clock();
> +u64 delta;
> +
> +delta = now - tsk->vtime_snap;
>
> -if (time_before(now, (unsigned long)tsk->vtime_snap))
> +if (delta < TICK_NSEC)
>  return 0;
>
> -return jiffies_to_nsecs(now - tsk->vtime_snap);
> +return jiffies_to_nsecs(delta / TICK_NSEC);
>  }
>
>  static u64 get_vtime_delta(struct task_struct *tsk)
>  {
> -unsigned long now = READ_ONCE(jiffies);
> -u64 delta, other;
> +u64 delta = vtime_delta(tsk);
> +u64 other;
>
>  /*
>   * Unlike tick based timing, vtime based timing never has lost
> @@ -696,10 +699,9 @@ static u64 get_vtime_delta(struct task_struct *tsk)
>   * elapsed time. Limit account_other_time to prevent rounding
>   * errors from causing elapsed vtime to go negative.
>   */
> -delta = jiffies_to_nsecs(now - tsk->vtime_snap);
>  other = account_other_time(delta);
>  WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
> -tsk->vtime_snap = now;
> +tsk->vtime_snap += delta;
>
>  return delta - other;
>  }
> @@ -776,7 +778,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
>
>  write_seqcount_begin(¤t->vtime_seqcount);
>  current->vtime_snap_whence = VTIME_SYS;
> -current->vtime_snap = jiffies;
> +current->vtime_snap = sched_clock_cpu(smp_processor_id());
>  write_seqcount_end(¤t->vtime_seqcount);
>  }
>
> @@ -787,7 +789,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
>  local_irq_save(flags);
>  write_seqcount_begin(&t->vtime_seqcount);
>  t->vtime_snap_whence = VTIME_SYS;
> -t->vtime_snap = jiffies;
> +t->vtime_snap = sched_clock_cpu(cpu);
>  write_seqcount_end(&t->vtime_seqcount);
>  local_irq_restore(flags);
>  }
>
> Regards,
> Wanpeng Li


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Wanpeng Li
Cc Peterz, Thomas,
2017-03-30 12:27 GMT+08:00 Mike Galbraith :
> On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote:
>
>> In other words, the tick on cpu0 is aligned
>> with the tick on the nohz_full cpus, and
>> jiffies is advanced while the nohz_full cpus
>> with an active tick happen to be in kernel
>> mode?
>
> You really want skew_tick=1, especially on big boxen.
>
>> Frederic, can you think of any reason why
>> the tick on nohz_full CPUs would end up aligned
>> with the tick on cpu0, instead of running at some
>> random offset?
>
> (I or low rq->clock bits as crude NOHZ collision avoidance)
>
>> A random offset, or better yet a somewhat randomized
>> tick length to make sure that simultaneous ticks are
>> fairly rare and the vtime sampling does not end up
>> "in phase" with the jiffies incrementing, could make
>> the accounting work right again.
>
> That improves jitter, especially on big boxen.  I have an 8 socket box
> that thinks it's an extra large PC, there, collision avoidance matters
> hugely.  I couldn't reproduce bean counting woes, no idea if collision
> avoidance will help that.

So I implement two methods, one is from Rik's random offset proposal
through skew tick, the other one is from Frederic's proposal and it is
the same as my original idea through use nanosecond granularity to
check deltas but only perform an actual cputime update when that delta
>= TICK_NSEC. Both methods can solve the bug which Luiz reported.
Peterz, Thomas, any ideas?

--->8-

skew tick:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 7fe53be..9981437 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1198,7 +1198,11 @@ void tick_setup_sched_timer(void)
 hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());

 /* Offset the tick to avert jiffies_lock contention. */
+#ifdef CONFIG_NO_HZ_FULL
+if (sched_skew_tick || tick_nohz_full_running) {
+#else
 if (sched_skew_tick) {
+#endif
 u64 offset = ktime_to_ns(tick_period) >> 1;
 do_div(offset, num_possible_cpus());
 offset *= smp_processor_id();

-->8-

use nanosecond granularity to check deltas but only perform an actual
cputime update when that delta >= TICK_NSEC.

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f3778e2b..f1ee393 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -676,18 +676,21 @@ void thread_group_cputime_adjusted(struct
task_struct *p, u64 *ut, u64 *st)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static u64 vtime_delta(struct task_struct *tsk)
 {
-unsigned long now = READ_ONCE(jiffies);
+u64 now = local_clock();
+u64 delta;
+
+delta = now - tsk->vtime_snap;

-if (time_before(now, (unsigned long)tsk->vtime_snap))
+if (delta < TICK_NSEC)
 return 0;

-return jiffies_to_nsecs(now - tsk->vtime_snap);
+return jiffies_to_nsecs(delta / TICK_NSEC);
 }

 static u64 get_vtime_delta(struct task_struct *tsk)
 {
-unsigned long now = READ_ONCE(jiffies);
-u64 delta, other;
+u64 delta = vtime_delta(tsk);
+u64 other;

 /*
  * Unlike tick based timing, vtime based timing never has lost
@@ -696,10 +699,9 @@ static u64 get_vtime_delta(struct task_struct *tsk)
  * elapsed time. Limit account_other_time to prevent rounding
  * errors from causing elapsed vtime to go negative.
  */
-delta = jiffies_to_nsecs(now - tsk->vtime_snap);
 other = account_other_time(delta);
 WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
-tsk->vtime_snap = now;
+tsk->vtime_snap += delta;

 return delta - other;
 }
@@ -776,7 +778,7 @@ void arch_vtime_task_switch(struct task_struct *prev)

 write_seqcount_begin(¤t->vtime_seqcount);
 current->vtime_snap_whence = VTIME_SYS;
-current->vtime_snap = jiffies;
+current->vtime_snap = sched_clock_cpu(smp_processor_id());
 write_seqcount_end(¤t->vtime_seqcount);
 }

@@ -787,7 +789,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 local_irq_save(flags);
 write_seqcount_begin(&t->vtime_seqcount);
 t->vtime_snap_whence = VTIME_SYS;
-t->vtime_snap = jiffies;
+t->vtime_snap = sched_clock_cpu(cpu);
 write_seqcount_end(&t->vtime_seqcount);
 local_irq_restore(flags);
 }

Regards,
Wanpeng Li


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Mike Galbraith
On Wed, 2017-03-29 at 16:08 -0400, Rik van Riel wrote:

> In other words, the tick on cpu0 is aligned
> with the tick on the nohz_full cpus, and
> jiffies is advanced while the nohz_full cpus
> with an active tick happen to be in kernel
> mode?

You really want skew_tick=1, especially on big boxen.
 
> Frederic, can you think of any reason why
> the tick on nohz_full CPUs would end up aligned
> with the tick on cpu0, instead of running at some
> random offset?

(I or low rq->clock bits as crude NOHZ collision avoidance)

> A random offset, or better yet a somewhat randomized
> tick length to make sure that simultaneous ticks are
> fairly rare and the vtime sampling does not end up
> "in phase" with the jiffies incrementing, could make
> the accounting work right again.

That improves jitter, especially on big boxen.  I have an 8 socket box
that thinks it's an extra large PC, there, collision avoidance matters
hugely.  I couldn't reproduce bean counting woes, no idea if collision
avoidance will help that.

-Mike


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Luiz Capitulino
On Thu, 30 Mar 2017 06:46:30 +0800
Wanpeng Li  wrote:

> > So! Now we need to find a proper fix :o)
> >
> > Hmm, how bad would it be to revert to sched_clock() instead of jiffies in 
> > vtime_delta()?
> > We could use nanosecond granularity to check deltas but only perform an 
> > actual cputime update
> > when that delta >= TICK_NSEC. That should keep the load ok.  
> 
> Yeah, I mentioned something similar before.
> https://lkml.org/lkml/2017/3/26/138 However, Rik's commit optimized
> syscalls by not utilize sched_clock(), so if we should distinguish
> between syscalls/exceptions and irqs?

Why not use ktime_get()?

Here's the solution I was thinking about, it's mostly untested. I'm
rate limiting below TICK_NSEC because I want to avoid syncing with
the tick.

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index f3778e2b..a8b1e85 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -676,18 +676,20 @@ void thread_group_cputime_adjusted(struct task_struct *p, 
u64 *ut, u64 *st)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static u64 vtime_delta(struct task_struct *tsk)
 {
-   unsigned long now = READ_ONCE(jiffies);
+   return ktime_sub(ktime_get(), tsk->vtime_snap);
+}
 
-   if (time_before(now, (unsigned long)tsk->vtime_snap))
-   return 0;
+/* A little bit less than the tick period */
+#define VTIME_RATE_LIMIT (TICK_NSEC - 20)
 
-   return jiffies_to_nsecs(now - tsk->vtime_snap);
+static bool vtime_should_account(struct task_struct *tsk)
+{
+   return vtime_delta(tsk) > VTIME_RATE_LIMIT;
 }
 
 static u64 get_vtime_delta(struct task_struct *tsk)
 {
-   unsigned long now = READ_ONCE(jiffies);
-   u64 delta, other;
+   u64 delta, other, now = ktime_get();
 
/*
 * Unlike tick based timing, vtime based timing never has lost
@@ -696,7 +698,7 @@ static u64 get_vtime_delta(struct task_struct *tsk)
 * elapsed time. Limit account_other_time to prevent rounding
 * errors from causing elapsed vtime to go negative.
 */
-   delta = jiffies_to_nsecs(now - tsk->vtime_snap);
+   delta = ktime_sub(now, tsk->vtime_snap);
other = account_other_time(delta);
WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
tsk->vtime_snap = now;
@@ -711,7 +713,7 @@ static void __vtime_account_system(struct task_struct *tsk)
 
 void vtime_account_system(struct task_struct *tsk)
 {
-   if (!vtime_delta(tsk))
+   if (!vtime_should_account(tsk))
return;
 
write_seqcount_begin(&tsk->vtime_seqcount);
@@ -723,7 +725,7 @@ void vtime_account_user(struct task_struct *tsk)
 {
write_seqcount_begin(&tsk->vtime_seqcount);
tsk->vtime_snap_whence = VTIME_SYS;
-   if (vtime_delta(tsk))
+   if (vtime_should_account(tsk))
account_user_time(tsk, get_vtime_delta(tsk));
write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -731,7 +733,7 @@ void vtime_account_user(struct task_struct *tsk)
 void vtime_user_enter(struct task_struct *tsk)
 {
write_seqcount_begin(&tsk->vtime_seqcount);
-   if (vtime_delta(tsk))
+   if (vtime_should_account(tsk))
__vtime_account_system(tsk);
tsk->vtime_snap_whence = VTIME_USER;
write_seqcount_end(&tsk->vtime_seqcount);
@@ -747,7 +749,7 @@ void vtime_guest_enter(struct task_struct *tsk)
 * that can thus safely catch up with a tickless delta.
 */
write_seqcount_begin(&tsk->vtime_seqcount);
-   if (vtime_delta(tsk))
+   if (vtime_should_account(tsk))
__vtime_account_system(tsk);
current->flags |= PF_VCPU;
write_seqcount_end(&tsk->vtime_seqcount);
@@ -776,7 +778,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 
write_seqcount_begin(¤t->vtime_seqcount);
current->vtime_snap_whence = VTIME_SYS;
-   current->vtime_snap = jiffies;
+   current->vtime_snap = ktime_get();
write_seqcount_end(¤t->vtime_seqcount);
 }
 


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Wanpeng Li
2017-03-30 4:08 GMT+08:00 Rik van Riel :
> On Wed, 2017-03-29 at 13:16 -0400, Luiz Capitulino wrote:
>> On Tue, 28 Mar 2017 13:24:06 -0400
>> Luiz Capitulino  wrote:
>>
>> >  1. In my tracing I'm seeing that sometimes (always?) the
>> > time interval between two timer interrupts is less than 1ms
>>
>> I think that's the root cause.
>>
>> In this trace, we see the following:
>>
>>  1. On CPU15, we transition from user-space to kernel-space because
>> of a timer interrupt (it's the tick)
>>
>>  2. vtimer_delta() returns 0, because jiffies didn't change since the
>> last accounting
>>
>>  3. While CPU15 is executing in kernel-space, jiffies is updated
>> by CPU0
>>
>>  4. When going back to user-space, vtime_delta() returns non-zero
>> and the whole time is accounted for system time (observe how
>> the cputime parameter in account_system_time() is less than 1ms)
>
> In other words, the tick on cpu0 is aligned
> with the tick on the nohz_full cpus, and
> jiffies is advanced while the nohz_full cpus
> with an active tick happen to be in kernel
> mode?
>
> Frederic, can you think of any reason why
> the tick on nohz_full CPUs would end up aligned
> with the tick on cpu0, instead of running at some
> random offset?
>
> A random offset, or better yet a somewhat randomized
> tick length to make sure that simultaneous ticks are
> fairly rare and the vtime sampling does not end up
> "in phase" with the jiffies incrementing, could make
> the accounting work right again.
>
> Of course, that assumes the above hypothesis is correct :)

There is such a feature skew_tick currently, refer to commit
5307c9556bc (tick: add tick skew boot option), w/ skew_tick=1 boot
parameter, the bug disappear, however, the commit also mentioned that
it will hurt power consumption. I will try Frederic's proposal which
is similar to my original idea "how bad would it be to revert to
sched_clock() instead of jiffies in vtime_delta()? We could use
nanosecond granularity to check deltas but only perform an actual
cputime update when that delta >= TICK_NSEC."

Regards,
Wanpeng Li


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Luiz Capitulino
On Wed, 29 Mar 2017 23:12:00 +0200
Frederic Weisbecker  wrote:

> On Wed, Mar 29, 2017 at 09:23:57AM -0400, Luiz Capitulino wrote:
> > 
> > There are various reproducers actually. I started off with the simple
> > loop above, then wrote the attach program and then wrote the one
> > you're mentioning:
> > 
> >  http://people.redhat.com/~lcapitul/real-time/acct-bug.c
> > 
> > All of them reproduce the issue 100% of the time for me.  
> 
> > #define _GNU_SOURCE
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > static int move_to_cpu(int cpu)
> > {
> > cpu_set_t set;
> > 
> > CPU_ZERO(&set);
> > CPU_SET(cpu, &set);
> > return sched_setaffinity(0, sizeof(set), &set);
> > }
> > 
> > static void loop(void)
> > {
> > for (;;) ;
> > }
> > 
> > static int fork_hog(int cpu)
> > {
> > int pid;
> > 
> > pid = (int) fork();
> > if (pid == 0) {
> > move_to_cpu(cpu);
> > loop();
> > exit(0);
> > }
> > 
> > return pid;
> > }
> > 
> > int main(int argc, char *argv[])
> > {
> > int i, pid, cpu, nr_procs;
> > 
> > if (argc != 3) {
> > printf("usage: hog < nr-procs > < CPU >\n");
> > exit(1);
> > }
> > 
> > cpu = atoi(argv[2]);
> > nr_procs = atoi(argv[1]);
> > 
> > for (i = 0; i < nr_procs; i++) {
> > pid = fork_hog(cpu);
> > fprintf(stderr, "created hog%d pid=%d\n", i, pid);
> > }
> > 
> > fprintf(stderr, "pausing...\n");
> > pause();
> > 
> > return 0;
> > }  
> 
> I just tried both of these and none seem to show incorrect cputime :-/
> I'm wondering if that bug depends on some hardware.

Are you running on x86? My CPU is:

Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz

I wonder if this issue depends on the timer used by the hrtimer
subsystem.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Frederic Weisbecker
(Adding Thomas in Cc)

On Wed, Mar 29, 2017 at 04:08:45PM -0400, Rik van Riel wrote:
> On Wed, 2017-03-29 at 13:16 -0400, Luiz Capitulino wrote:
> > On Tue, 28 Mar 2017 13:24:06 -0400
> > Luiz Capitulino  wrote:
> > 
> > >  1. In my tracing I'm seeing that sometimes (always?) the
> > > time interval between two timer interrupts is less than 1ms
> > 
> > I think that's the root cause.
> > 
> > In this trace, we see the following:
> > 
> >  1. On CPU15, we transition from user-space to kernel-space because
> > of a timer interrupt (it's the tick)
> > 
> >  2. vtimer_delta() returns 0, because jiffies didn't change since the
> > last accounting
> > 
> >  3. While CPU15 is executing in kernel-space, jiffies is updated
> > by CPU0
> > 
> >  4. When going back to user-space, vtime_delta() returns non-zero
> > and the whole time is accounted for system time (observe how
> > the cputime parameter in account_system_time() is less than 1ms)
> 
> In other words, the tick on cpu0 is aligned
> with the tick on the nohz_full cpus, and
> jiffies is advanced while the nohz_full cpus
> with an active tick happen to be in kernel
> mode?

Ah you found out faster than me :-)

> Frederic, can you think of any reason why
> the tick on nohz_full CPUs would end up aligned
> with the tick on cpu0, instead of running at some
> random offset?

tick_init_jiffy_update() takes that decision to align all ticks.

I'm not sure why. I don't see anything that could depend on that
wide tick synchronization. The jiffies update itself relies on ktime
to check when to update it. So even if the tick fires a bit later
on CPU 1 than on CPU 0, the jiffies updates should stay coherent and
should never exceed 999us delay in the worst case (for HZ=1000)

Now I might overlook something.

> 
> A random offset, or better yet a somewhat randomized
> tick length to make sure that simultaneous ticks are
> fairly rare and the vtime sampling does not end up
> "in phase" with the jiffies incrementing, could make
> the accounting work right again.
> 
> Of course, that assumes the above hypothesis is correct :)

I'm not sure that randomizing the tick start per CPU would be a
right solution. Somewhere in the world you can be sure the tick
randomization of some nohz_full CPU will coincide with the tick
of CPU 0 :o)

Or we could force that tick on nohz_full CPUs to be far from
CPU 0's tick... I'm not sure such a solution would be accepted though.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Wanpeng Li
2017-03-30 6:17 GMT+08:00 Frederic Weisbecker :
> On Wed, Mar 29, 2017 at 01:16:56PM -0400, Luiz Capitulino wrote:
>> On Tue, 28 Mar 2017 13:24:06 -0400
>> Luiz Capitulino  wrote:
>>
>> >  1. In my tracing I'm seeing that sometimes (always?) the
>> > time interval between two timer interrupts is less than 1ms
>>
>> I think that's the root cause.
>>
>> I'm getting traces like this:
>>
>>hog-11980 [015]   341.494491: function: enter_from_user_mode 
>> <-- apic_timer_interrupt
>> -0 [000]   341.494492: function: 
>> smp_apic_timer_interrupt <-- apic_timer_interrupt
>>hog-11980 [015]   341.494492: function: 
>> __context_tracking_exit <-- enter_from_user_mode
>> -0 [000]   341.494492: function: irq_enter <-- 
>> smp_apic_timer_interrupt
>>hog-11980 [015]   341.494492: bprint:   vtime_delta: diff=0 
>> (now=4295008339 vtime_snap=4295008339)
>>hog-11980 [015]   341.494492: function: 
>> smp_apic_timer_interrupt <-- apic_timer_interrupt
>>hog-11980 [015]   341.494492: function: irq_enter <-- 
>> smp_apic_timer_interrupt
>>hog-11980 [015]   341.494493: function: tick_sched_timer <-- 
>> __hrtimer_run_queues
>> -0 [000]   341.494493: function: tick_sched_timer <-- 
>> __hrtimer_run_queues
>> -0 [000]   341.494493: function: 
>> tick_do_update_jiffies64.part.14 <-- tick_sched_do_timer
>> -0 [000]   341.494494: function: do_timer <-- 
>> tick_do_update_jiffies64.part.14
>>hog-11980 [015]   341.494494: function: irq_exit <-- 
>> smp_apic_timer_interrupt
>> -0 [000]   341.494494: bprint:   do_timer: updated 
>> jiffies_64=4295008340 ticks=1
>>hog-11980 [015]   341.494494: function: 
>> __context_tracking_enter <-- prepare_exit_to_usermode
>>hog-11980 [015]   341.494494: function: vtime_user_enter <-- 
>> __context_tracking_enter
>>hog-11980 [015]   341.494495: bprint:   vtime_delta: 
>> diff=100 (now=4295008340 vtime_snap=4295008339)
>>hog-11980 [015]   341.494495: function: 
>> __vtime_account_system <-- vtime_user_enter
>>hog-11980 [015]   341.494495: bprint:   get_vtime_delta: 
>> vtime_snap=4295008339 now=4295008340
>>hog-11980 [015]   341.494495: function: account_system_time 
>> <-- __vtime_account_system
>>hog-11980 [015]   341.494495: bprint:   account_system_time: 
>> cputime=995488
>> -0 [000]   341.494497: function: irq_exit <-- 
>> smp_apic_timer_interrupt
>>
>> In this trace, we see the following:
>>
>>  1. On CPU15, we transition from user-space to kernel-space because
>> of a timer interrupt (it's the tick)
>>
>>  2. vtimer_delta() returns 0, because jiffies didn't change since the
>> last accounting
>>
>>  3. While CPU15 is executing in kernel-space, jiffies is updated
>> by CPU0
>>
>>  4. When going back to user-space, vtime_delta() returns non-zero
>> and the whole time is accounted for system time (observe how
>> the cputime parameter in account_system_time() is less than 1ms)
>
> Aah, so the issue can indeed happen if all CPUs fire their ticks at the same 
> time:
>
>
>  CPU 0 CPU 1
>  - -
>exit_user() // no cputime 
> update
> tick X   update_jiffies
>enter_user() // cputime update
>
>
>exit_user() //no cputime update
> tick X+1 update_jiffies
>enter_user() // cputime update
>
>>
>> That's why my patch from yesterday fixed the issue, it increased the
>> tick period to more than 1ms. So vtime_delta() always evaluate to true
>> when transitioning from user-space to kernel-space (because we spend
>> more than 1ms in user-space between ticks). The patch below achieves
>> the same result by adding 10us to the tick period.
>>
>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> index 7fe53be..00e46df 100644
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -1165,7 +1165,7 @@ static enum hrtimer_restart tick_sched_timer(struct 
>> hrtimer *timer)
>> if (unlikely(ts->tick_stopped))
>> return HRTIMER_NORESTART;
>>
>> -   hrtimer_forward(timer, now, tick_period);
>> +   hrtimer_forward(timer, now, tick_period + 1);
>
> I'm surprised it works though. If the 10us shift was only applied to CPU 0 
> and not the
> others then yes, but if it is applied to all CPUs, the ticks stay 
> synchronized and the
> problem should stay...
>
> Ah wait! It can work because the nohz_full CPUs have their ticks sometimes 
> scheduled
> by tick_nohz_stop_sched_tick() or tick_nohz_restart_sched_tick() which 

Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Frederic Weisbecker
On Wed, Mar 29, 2017 at 09:23:57AM -0400, Luiz Capitulino wrote:
> 
> There are various reproducers actually. I started off with the simple
> loop above, then wrote the attach program and then wrote the one
> you're mentioning:
> 
>  http://people.redhat.com/~lcapitul/real-time/acct-bug.c
> 
> All of them reproduce the issue 100% of the time for me.

> #define _GNU_SOURCE
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> static int move_to_cpu(int cpu)
> {
> cpu_set_t set;
> 
> CPU_ZERO(&set);
> CPU_SET(cpu, &set);
> return sched_setaffinity(0, sizeof(set), &set);
> }
> 
> static void loop(void)
> {
> for (;;) ;
> }
> 
> static int fork_hog(int cpu)
> {
> int pid;
> 
> pid = (int) fork();
> if (pid == 0) {
> move_to_cpu(cpu);
> loop();
> exit(0);
> }
> 
> return pid;
> }
> 
> int main(int argc, char *argv[])
> {
> int i, pid, cpu, nr_procs;
> 
>   if (argc != 3) {
>   printf("usage: hog < nr-procs > < CPU >\n");
>   exit(1);
>   }
> 
>   cpu = atoi(argv[2]);
>   nr_procs = atoi(argv[1]);
> 
> for (i = 0; i < nr_procs; i++) {
> pid = fork_hog(cpu);
> fprintf(stderr, "created hog%d pid=%d\n", i, pid);
> }
> 
> fprintf(stderr, "pausing...\n");
> pause();
> 
> return 0;
> }

I just tried both of these and none seem to show incorrect cputime :-/
I'm wondering if that bug depends on some hardware.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Rik van Riel
On Wed, 2017-03-29 at 13:16 -0400, Luiz Capitulino wrote:
> On Tue, 28 Mar 2017 13:24:06 -0400
> Luiz Capitulino  wrote:
> 
> >  1. In my tracing I'm seeing that sometimes (always?) the
> > time interval between two timer interrupts is less than 1ms
> 
> I think that's the root cause.
> 
> In this trace, we see the following:
> 
>  1. On CPU15, we transition from user-space to kernel-space because
> of a timer interrupt (it's the tick)
> 
>  2. vtimer_delta() returns 0, because jiffies didn't change since the
> last accounting
> 
>  3. While CPU15 is executing in kernel-space, jiffies is updated
> by CPU0
> 
>  4. When going back to user-space, vtime_delta() returns non-zero
> and the whole time is accounted for system time (observe how
> the cputime parameter in account_system_time() is less than 1ms)

In other words, the tick on cpu0 is aligned
with the tick on the nohz_full cpus, and
jiffies is advanced while the nohz_full cpus
with an active tick happen to be in kernel
mode?

Frederic, can you think of any reason why
the tick on nohz_full CPUs would end up aligned
with the tick on cpu0, instead of running at some
random offset?

A random offset, or better yet a somewhat randomized
tick length to make sure that simultaneous ticks are
fairly rare and the vtime sampling does not end up
"in phase" with the jiffies incrementing, could make
the accounting work right again.

Of course, that assumes the above hypothesis is correct :)


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Luiz Capitulino
On Wed, 29 Mar 2017 09:14:32 -0400
Rik van Riel  wrote:

> > I failed to reproduce with your config. I'm still getting 99%
> > userspace
> > cputime. So I'm wondering if the hogging style plays a role.
> > 
> > I run pure user loops:
> > 
> > int main(int argc, char **argv)
> > {
> > for (;;);
> > return 0
> > }
> > 
> > Does your user program perform syscalls or IOs of some sort?  
> 
> Luiz's program makes a syscall every millisecond,
> if started with the arguments he gave as his
> reproducer.

There are various reproducers actually. I started off with the simple
loop above, then wrote the attach program and then wrote the one
you're mentioning:

 http://people.redhat.com/~lcapitul/real-time/acct-bug.c

All of them reproduce the issue 100% of the time for me.
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

static int move_to_cpu(int cpu)
{
cpu_set_t set;

CPU_ZERO(&set);
CPU_SET(cpu, &set);
return sched_setaffinity(0, sizeof(set), &set);
}

static void loop(void)
{
for (;;) ;
}

static int fork_hog(int cpu)
{
int pid;

pid = (int) fork();
if (pid == 0) {
move_to_cpu(cpu);
loop();
exit(0);
}

return pid;
}

int main(int argc, char *argv[])
{
int i, pid, cpu, nr_procs;

		if (argc != 3) {
			printf("usage: hog < nr-procs > < CPU >\n");
			exit(1);
		}

		cpu = atoi(argv[2]);
		nr_procs = atoi(argv[1]);

for (i = 0; i < nr_procs; i++) {
pid = fork_hog(cpu);
fprintf(stderr, "created hog%d pid=%d\n", i, pid);
}

fprintf(stderr, "pausing...\n");
pause();

return 0;
}


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Rik van Riel
On Wed, 2017-03-29 at 15:04 +0200, Frederic Weisbecker wrote:
> On Thu, Mar 23, 2017 at 04:55:12PM -0400, Luiz Capitulino wrote:
> > 
> > When there are two or more tasks executing in user-space and
> > taking 100% of a nohz_full CPU, top reports 70% system time
> > and 30% user time utilization. Sometimes I'm even able to get
> > 100% system time and 0% user time.
> > 
> > This was reproduced with latest Linus tree (093b995), but I
> > don't believe it's a regression (at least not a recent one)
> > as I can reproduce it with older kernels. Also, I have
> > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce
> > without it yet.
> > 
> > Below you'll find the steps to reproduce and some initial
> > analysis.
> > 
> > Steps to reproduce
> > --
> > 
> > 1. Set up a CPU for nohz_full with isolcpus= nohz_full=
> > 
> > 2. Pin two tasks that hog the CPU 100% of the time to that CPU
> 
> I failed to reproduce with your config. I'm still getting 99%
> userspace
> cputime. So I'm wondering if the hogging style plays a role.
> 
> I run pure user loops:
> 
> int main(int argc, char **argv)
> {
> for (;;);
> return 0
> }
> 
> Does your user program perform syscalls or IOs of some sort?

Luiz's program makes a syscall every millisecond,
if started with the arguments he gave as his
reproducer.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Frederic Weisbecker
On Thu, Mar 23, 2017 at 04:55:12PM -0400, Luiz Capitulino wrote:
> 
> When there are two or more tasks executing in user-space and
> taking 100% of a nohz_full CPU, top reports 70% system time
> and 30% user time utilization. Sometimes I'm even able to get
> 100% system time and 0% user time.
> 
> This was reproduced with latest Linus tree (093b995), but I
> don't believe it's a regression (at least not a recent one)
> as I can reproduce it with older kernels. Also, I have
> CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce
> without it yet.
> 
> Below you'll find the steps to reproduce and some initial
> analysis.
> 
> Steps to reproduce
> --
> 
> 1. Set up a CPU for nohz_full with isolcpus= nohz_full=
> 
> 2. Pin two tasks that hog the CPU 100% of the time to that CPU

I failed to reproduce with your config. I'm still getting 99% userspace
cputime. So I'm wondering if the hogging style plays a role.

I run pure user loops:

int main(int argc, char **argv)
{
for (;;);
return 0
}

Does your user program perform syscalls or IOs of some sort?


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Frederic Weisbecker
On Wed, Mar 29, 2017 at 05:56:30PM +0800, Wanpeng Li wrote:
> 2017-03-29 5:26 GMT+08:00 Luiz Capitulino :
> > On Tue, 28 Mar 2017 17:01:52 -0400
> > Rik van Riel  wrote:
> >
> >> On Tue, 2017-03-28 at 16:14 -0400, Luiz Capitulino wrote:
> >> > On Tue, 28 Mar 2017 13:24:06 -0400
> >> > Luiz Capitulino  wrote:
> >> > > I'm starting to suspect that the nohz code may be programming
> >> > > the tick period to be shorter than 1ms when it re-activates
> >> > > the tick.
> >> >
> >> > And I think I was right, it looks like the nohz code is programming
> >> > the tick period incorrectly when restarting the tick. The patch below
> >> > fixes things for me, but I still have some homework todo and more
> >> > testing before posting a patch for inclusion. Could you guys test it?
> >>
> >> Your patch seems to work. I don't claim to understand why
> >> your patch makes a difference, but for this particular test
> >> case, on this particular setup, it seems to work...
> >
> > I don't fully understand why either yet. I was looking for places
> > where nohz might be programming the tick period incorrectly and
> 
> The bug is still present when I config CONTEXT_TRACKING_FORCE and
> nohz=off in the boot parameter.

Indeed I saw something similar a few days ago with:

!CONFIG_NO_HZ_FULL && CONFIG_VIRT_CPU_ACCOUNTING_GEN && 
CONTEXT_TRACKING_FORCE

And it disappeared with CONFIG_NO_HZ_FULL=y so I didn't care much because that 
setting
isn't used in production and in fact I intend to remove CONTEXT_TRACKING_FORCE. 
But
it could be the sign of something important.

It might be different than Luiz's bug because I can't reproduce his bug yet 
even with
his config.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-29 Thread Wanpeng Li
2017-03-29 5:26 GMT+08:00 Luiz Capitulino :
> On Tue, 28 Mar 2017 17:01:52 -0400
> Rik van Riel  wrote:
>
>> On Tue, 2017-03-28 at 16:14 -0400, Luiz Capitulino wrote:
>> > On Tue, 28 Mar 2017 13:24:06 -0400
>> > Luiz Capitulino  wrote:
>> > > I'm starting to suspect that the nohz code may be programming
>> > > the tick period to be shorter than 1ms when it re-activates
>> > > the tick.
>> >
>> > And I think I was right, it looks like the nohz code is programming
>> > the tick period incorrectly when restarting the tick. The patch below
>> > fixes things for me, but I still have some homework todo and more
>> > testing before posting a patch for inclusion. Could you guys test it?
>>
>> Your patch seems to work. I don't claim to understand why
>> your patch makes a difference, but for this particular test
>> case, on this particular setup, it seems to work...
>
> I don't fully understand why either yet. I was looking for places
> where nohz might be programming the tick period incorrectly and

The bug is still present when I config CONTEXT_TRACKING_FORCE and
nohz=off in the boot parameter.

Regards,
Wanpeng Li

> I found that there's a case in tick_nohz_stop_sched_tick() where
> tick_nohz_restart() is called only to reprogram the tick timer,
> not cancel the tick. In this case, ts->last_tick seems to be out
> of date. Fixing this fixed accounting for me.
>
>> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> > index 7fe53be..9abe979 100644
>> > --- a/kernel/time/tick-sched.c
>> > +++ b/kernel/time/tick-sched.c
>> > @@ -1152,6 +1152,7 @@ static enum hrtimer_restart
>> > tick_sched_timer(struct hrtimer *timer)
>> > struct pt_regs *regs = get_irq_regs();
>> > ktime_t now = ktime_get();
>> >
>> > +   ts->last_tick = now;
>> > tick_sched_do_timer(now);
>> >
>> > /*
>>
>


Re: [BUG nohz]: wrong user and system time accounting

2017-03-28 Thread Luiz Capitulino
On Tue, 28 Mar 2017 17:24:11 -0400
Rik van Riel  wrote:

> On Tue, 2017-03-28 at 16:14 -0400, Luiz Capitulino wrote:
> 
> > And I think I was right, it looks like the nohz code is programming
> > the tick period incorrectly when restarting the tick. The patch below
> > fixes things for me, but I still have some homework todo and more
> > testing before posting a patch for inclusion. Could you guys test it?  
> 
> I spoke too soon.  After half an hour of runtime,
> things have gotten aligned to give me about 50/50
> user time and system time with your test case,
> again.

Hmmm, maybe it's incomplete. I still think that nohz might screwing
something up when re-activating the tick.

> 
> This is on an 8 VCPU virtual machine, with
> nohz_full=2-7, and the test case running on one
> of the nohz_full CPUs.
> 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 7fe53be..9abe979 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -1152,6 +1152,7 @@ static enum hrtimer_restart
> > tick_sched_timer(struct hrtimer *timer)
> > struct pt_regs *regs = get_irq_regs();
> > ktime_t now = ktime_get();
> >  
> > +   ts->last_tick = now;
> > tick_sched_do_timer(now);
> >  
> > /*  
> 



Re: [BUG nohz]: wrong user and system time accounting

2017-03-28 Thread Luiz Capitulino
On Tue, 28 Mar 2017 17:01:52 -0400
Rik van Riel  wrote:

> On Tue, 2017-03-28 at 16:14 -0400, Luiz Capitulino wrote:
> > On Tue, 28 Mar 2017 13:24:06 -0400
> > Luiz Capitulino  wrote:  
> > > I'm starting to suspect that the nohz code may be programming
> > > the tick period to be shorter than 1ms when it re-activates
> > > the tick.  
> > 
> > And I think I was right, it looks like the nohz code is programming
> > the tick period incorrectly when restarting the tick. The patch below
> > fixes things for me, but I still have some homework todo and more
> > testing before posting a patch for inclusion. Could you guys test it?  
> 
> Your patch seems to work. I don't claim to understand why
> your patch makes a difference, but for this particular test
> case, on this particular setup, it seems to work...

I don't fully understand why either yet. I was looking for places
where nohz might be programming the tick period incorrectly and
I found that there's a case in tick_nohz_stop_sched_tick() where
tick_nohz_restart() is called only to reprogram the tick timer,
not cancel the tick. In this case, ts->last_tick seems to be out
of date. Fixing this fixed accounting for me.

> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 7fe53be..9abe979 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -1152,6 +1152,7 @@ static enum hrtimer_restart
> > tick_sched_timer(struct hrtimer *timer)
> > struct pt_regs *regs = get_irq_regs();
> > ktime_t now = ktime_get();
> >  
> > +   ts->last_tick = now;
> > tick_sched_do_timer(now);
> >  
> > /*  
> 



Re: [BUG nohz]: wrong user and system time accounting

2017-03-28 Thread Rik van Riel
On Tue, 2017-03-28 at 16:14 -0400, Luiz Capitulino wrote:

> And I think I was right, it looks like the nohz code is programming
> the tick period incorrectly when restarting the tick. The patch below
> fixes things for me, but I still have some homework todo and more
> testing before posting a patch for inclusion. Could you guys test it?

I spoke too soon.  After half an hour of runtime,
things have gotten aligned to give me about 50/50
user time and system time with your test case,
again.

This is on an 8 VCPU virtual machine, with
nohz_full=2-7, and the test case running on one
of the nohz_full CPUs.

> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 7fe53be..9abe979 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1152,6 +1152,7 @@ static enum hrtimer_restart
> tick_sched_timer(struct hrtimer *timer)
> struct pt_regs *regs = get_irq_regs();
> ktime_t now = ktime_get();
>  
> +   ts->last_tick = now;
> tick_sched_do_timer(now);
>  
> /*


Re: [BUG nohz]: wrong user and system time accounting

2017-03-28 Thread Rik van Riel
On Tue, 2017-03-28 at 16:14 -0400, Luiz Capitulino wrote:
> On Tue, 28 Mar 2017 13:24:06 -0400
> Luiz Capitulino  wrote:
> > I'm starting to suspect that the nohz code may be programming
> > the tick period to be shorter than 1ms when it re-activates
> > the tick.
> 
> And I think I was right, it looks like the nohz code is programming
> the tick period incorrectly when restarting the tick. The patch below
> fixes things for me, but I still have some homework todo and more
> testing before posting a patch for inclusion. Could you guys test it?

Your patch seems to work. I don't claim to understand why
your patch makes a difference, but for this particular test
case, on this particular setup, it seems to work...

> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 7fe53be..9abe979 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1152,6 +1152,7 @@ static enum hrtimer_restart
> tick_sched_timer(struct hrtimer *timer)
> struct pt_regs *regs = get_irq_regs();
> ktime_t now = ktime_get();
>  
> +   ts->last_tick = now;
> tick_sched_do_timer(now);
>  
> /*


Re: [BUG nohz]: wrong user and system time accounting

2017-03-28 Thread Luiz Capitulino
On Tue, 28 Mar 2017 13:28:13 +0800
Wanpeng Li  wrote:

> 2017-03-28 2:38 GMT+08:00 Luiz Capitulino :
> > On Mon, 27 Mar 2017 09:56:47 +0800
> > Wanpeng Li  wrote:
> >  
> >> Actually after I bisect, the first bad commit is ff9a9b4c4334 ("sched,
> >> time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug
> >> can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true,
> >> then just stress all the online cpus or just one cpu and leave others
> >> idle(so it stresses the global timekeeping one), top show 100%
> >> sys-time. And another way to reproduce it is by nohz_full, and gives
> >> the stress to the house keeping cpu, the top show 100% sys-time of the
> >> house keeping cpu, and also the other cpus who have at least two tasks
> >> running on and in full_nohz mode.  
> >
> > We're not short on reproducers, I have a new one too:
> >
> >  http://people.redhat.com/~lcapitul/real-time/acct-bug.c
> >
> > This is a single threaded task that reproduces the issue. If you
> > run it as instructed, you'll get:
> >
> >  - nohz_full CPU: 95% system time 5% idle time
> >  - non-nohz_full CPU: 95% user time 5% idle time (expected behavior)
> >
> > This reproduces the issue, but not for the reasons I expected. I was
> > trying to mimic what I was seeing on my trace when tracing the two
> > task problem. Which is: a task stays 995us in user-space and then
> > enters the kernel. Time won't be accounted for user-space because
> > we're not 1 jiffies yet, but if the task stays in the kernel for more
> > than 5us, then time will be accounted for system time when going
> > back to user-space.
> >
> > However, what really seems to be happening is: acct-bug is causing
> > the tick to be re-activated (why? it shouldn't) and that causes the
> > issue to appear. This is consistent with my other observations: I
> > can only reproduce the issue if the nohz_full CPU re-activates the tick.  
> 
> I see there are other kthreads like migration, kworker,
> torture_shuffle etc on the isolated CPU.

Except for torture_shuffle (which is new to me, and I guess could
be disabled in .config) the other threads should not be runnable
for most of the time.

> 
> Regards,
> Wanpeng Li
> 
> >  
> >> Let's consider the cpu which has responsibility for the global
> >> timekeeping, as the tracing posted above, the vtime_account_user() is
> >> called before tick_sched_timer() which will update jiffies,  
> >
> > But the vtime_account_user() call and the jiffies update happen
> > on different CPUs, no? So the ordering shouldn't matter.
> >  
> >> so jiffies
> >> is stale in vtime_account_user() and the run time in userspace is
> >> skipped, the vtime_user_enter() is called after jiffies update, so
> >> both the time in userspace and in  kernel are accumulated to sys time.
> >>
> >> If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is
> >> fine. However, if you give stress to the housekeeping cpu, top will
> >> show 100% sys-time of both the housekeeping cpu and the other cpus who
> >> have at least two tasks running on and in full_nohz mode.  
> >
> > The housekeeping CPUs are idle with my reproducers.
> >  
> >> I think it
> >> is because the stress delays the timer interrupt handling in some
> >> degree, then the jiffies is not updated timely before other cpus
> >> access it in vtime_account_user().
> >>
> >> I think we can keep syscalls/exceptions context tracking still in
> >> jiffies based sampling and utilize local_clock() in vtime_delta()
> >> again for irqs which avoids jiffies stale influence. I can make a
> >> patch if the idea is acceptable or there is any better proposal. :)
> >>
> >> Regards,
> >> Wanpeng Li
> >>  
> >  
> 



Re: [BUG nohz]: wrong user and system time accounting

2017-03-28 Thread Wanpeng Li
2017-03-28 1:35 GMT+08:00 Rik van Riel :
> On Mon, 2017-03-27 at 09:56 +0800, Wanpeng Li wrote:
>>
>> Actually after I bisect, the first bad commit is ff9a9b4c4334
>> ("sched,
>> time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug
>> can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true
>
> At the time, we thought it was an "occasionally bad" / "unlucky"
> kind of bug, not a systemic issue, like your observations seem
> to suggest.
>
>> Let's consider the cpu which has responsibility for the global
>> timekeeping, as the tracing posted above, the vtime_account_user() is
>> called before tick_sched_timer() which will update jiffies, so
>> jiffies
>> is stale in vtime_account_user() and the run time in userspace is
>> skipped, the vtime_user_enter() is called after jiffies update, so
>> both the time in userspace and in  kernel are accumulated to sys
>> time.
>> If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is
>> fine. However, if you give stress to the housekeeping cpu, top will
>> show 100% sys-time of both the housekeeping cpu and the other cpus
>> who
>> have at least two tasks running on and in full_nohz mode. I think it
>> is because the stress delays the timer interrupt handling in some
>> degree, then the jiffies is not updated timely before other cpus
>> access it in vtime_account_user().
>>
>> I think we can keep syscalls/exceptions context tracking still in
>> jiffies based sampling and utilize local_clock() in vtime_delta()
>> again for irqs which avoids jiffies stale influence. I can make a
>> patch if the idea is acceptable or there is any better proposal. :)
>
> Making that patch seems worthwhile, but I would like to
> know what the root cause is of the issue that is being
> observed.
>
> Is the problem due to the nohz_full CPU receiving an
> interrupt at the same time the timer interrupt fires on
> the housekeeping CPU?
>
> Is it due to a nohz_full CPU updating jiffies all by
> itself from irq context?  In that case, could it be
> better to always have that be done by the housekeeping
> CPU?

I observed that the jiffies is always updated by housekeeping CPU as
we expected.

Regards,
Wanpeng Li


Re: [BUG nohz]: wrong user and system time accounting

2017-03-27 Thread Wanpeng Li
2017-03-28 2:38 GMT+08:00 Luiz Capitulino :
> On Mon, 27 Mar 2017 09:56:47 +0800
> Wanpeng Li  wrote:
>
>> Actually after I bisect, the first bad commit is ff9a9b4c4334 ("sched,
>> time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug
>> can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true,
>> then just stress all the online cpus or just one cpu and leave others
>> idle(so it stresses the global timekeeping one), top show 100%
>> sys-time. And another way to reproduce it is by nohz_full, and gives
>> the stress to the house keeping cpu, the top show 100% sys-time of the
>> house keeping cpu, and also the other cpus who have at least two tasks
>> running on and in full_nohz mode.
>
> We're not short on reproducers, I have a new one too:
>
>  http://people.redhat.com/~lcapitul/real-time/acct-bug.c
>
> This is a single threaded task that reproduces the issue. If you
> run it as instructed, you'll get:
>
>  - nohz_full CPU: 95% system time 5% idle time
>  - non-nohz_full CPU: 95% user time 5% idle time (expected behavior)
>
> This reproduces the issue, but not for the reasons I expected. I was
> trying to mimic what I was seeing on my trace when tracing the two
> task problem. Which is: a task stays 995us in user-space and then
> enters the kernel. Time won't be accounted for user-space because
> we're not 1 jiffies yet, but if the task stays in the kernel for more
> than 5us, then time will be accounted for system time when going
> back to user-space.
>
> However, what really seems to be happening is: acct-bug is causing
> the tick to be re-activated (why? it shouldn't) and that causes the
> issue to appear. This is consistent with my other observations: I
> can only reproduce the issue if the nohz_full CPU re-activates the tick.

I see there are other kthreads like migration, kworker,
torture_shuffle etc on the isolated CPU.

Regards,
Wanpeng Li

>
>> Let's consider the cpu which has responsibility for the global
>> timekeeping, as the tracing posted above, the vtime_account_user() is
>> called before tick_sched_timer() which will update jiffies,
>
> But the vtime_account_user() call and the jiffies update happen
> on different CPUs, no? So the ordering shouldn't matter.
>
>> so jiffies
>> is stale in vtime_account_user() and the run time in userspace is
>> skipped, the vtime_user_enter() is called after jiffies update, so
>> both the time in userspace and in  kernel are accumulated to sys time.
>>
>> If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is
>> fine. However, if you give stress to the housekeeping cpu, top will
>> show 100% sys-time of both the housekeeping cpu and the other cpus who
>> have at least two tasks running on and in full_nohz mode.
>
> The housekeeping CPUs are idle with my reproducers.
>
>> I think it
>> is because the stress delays the timer interrupt handling in some
>> degree, then the jiffies is not updated timely before other cpus
>> access it in vtime_account_user().
>>
>> I think we can keep syscalls/exceptions context tracking still in
>> jiffies based sampling and utilize local_clock() in vtime_delta()
>> again for irqs which avoids jiffies stale influence. I can make a
>> patch if the idea is acceptable or there is any better proposal. :)
>>
>> Regards,
>> Wanpeng Li
>>
>


Re: [BUG nohz]: wrong user and system time accounting

2017-03-27 Thread Luiz Capitulino
On Mon, 27 Mar 2017 09:56:47 +0800
Wanpeng Li  wrote:

> Actually after I bisect, the first bad commit is ff9a9b4c4334 ("sched,
> time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug
> can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true,
> then just stress all the online cpus or just one cpu and leave others
> idle(so it stresses the global timekeeping one), top show 100%
> sys-time. And another way to reproduce it is by nohz_full, and gives
> the stress to the house keeping cpu, the top show 100% sys-time of the
> house keeping cpu, and also the other cpus who have at least two tasks
> running on and in full_nohz mode.

We're not short on reproducers, I have a new one too:

 http://people.redhat.com/~lcapitul/real-time/acct-bug.c

This is a single threaded task that reproduces the issue. If you
run it as instructed, you'll get:

 - nohz_full CPU: 95% system time 5% idle time
 - non-nohz_full CPU: 95% user time 5% idle time (expected behavior)

This reproduces the issue, but not for the reasons I expected. I was
trying to mimic what I was seeing on my trace when tracing the two
task problem. Which is: a task stays 995us in user-space and then
enters the kernel. Time won't be accounted for user-space because
we're not 1 jiffies yet, but if the task stays in the kernel for more
than 5us, then time will be accounted for system time when going
back to user-space.

However, what really seems to be happening is: acct-bug is causing
the tick to be re-activated (why? it shouldn't) and that causes the
issue to appear. This is consistent with my other observations: I
can only reproduce the issue if the nohz_full CPU re-activates the tick.

> Let's consider the cpu which has responsibility for the global
> timekeeping, as the tracing posted above, the vtime_account_user() is
> called before tick_sched_timer() which will update jiffies,

But the vtime_account_user() call and the jiffies update happen
on different CPUs, no? So the ordering shouldn't matter.

> so jiffies
> is stale in vtime_account_user() and the run time in userspace is
> skipped, the vtime_user_enter() is called after jiffies update, so
> both the time in userspace and in  kernel are accumulated to sys time.
>
> If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is
> fine. However, if you give stress to the housekeeping cpu, top will
> show 100% sys-time of both the housekeeping cpu and the other cpus who
> have at least two tasks running on and in full_nohz mode.

The housekeeping CPUs are idle with my reproducers.

> I think it
> is because the stress delays the timer interrupt handling in some
> degree, then the jiffies is not updated timely before other cpus
> access it in vtime_account_user().
> 
> I think we can keep syscalls/exceptions context tracking still in
> jiffies based sampling and utilize local_clock() in vtime_delta()
> again for irqs which avoids jiffies stale influence. I can make a
> patch if the idea is acceptable or there is any better proposal. :)
> 
> Regards,
> Wanpeng Li
> 



Re: [BUG nohz]: wrong user and system time accounting

2017-03-27 Thread Rik van Riel
On Mon, 2017-03-27 at 09:56 +0800, Wanpeng Li wrote:
> 
> Actually after I bisect, the first bad commit is ff9a9b4c4334
> ("sched,
> time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug
> can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true

At the time, we thought it was an "occasionally bad" / "unlucky"
kind of bug, not a systemic issue, like your observations seem
to suggest.

> Let's consider the cpu which has responsibility for the global
> timekeeping, as the tracing posted above, the vtime_account_user() is
> called before tick_sched_timer() which will update jiffies, so
> jiffies
> is stale in vtime_account_user() and the run time in userspace is
> skipped, the vtime_user_enter() is called after jiffies update, so
> both the time in userspace and in  kernel are accumulated to sys
> time.
> If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is
> fine. However, if you give stress to the housekeeping cpu, top will
> show 100% sys-time of both the housekeeping cpu and the other cpus
> who
> have at least two tasks running on and in full_nohz mode. I think it
> is because the stress delays the timer interrupt handling in some
> degree, then the jiffies is not updated timely before other cpus
> access it in vtime_account_user().
> 
> I think we can keep syscalls/exceptions context tracking still in
> jiffies based sampling and utilize local_clock() in vtime_delta()
> again for irqs which avoids jiffies stale influence. I can make a
> patch if the idea is acceptable or there is any better proposal. :)

Making that patch seems worthwhile, but I would like to
know what the root cause is of the issue that is being
observed.

Is the problem due to the nohz_full CPU receiving an
interrupt at the same time the timer interrupt fires on
the housekeeping CPU?

Is it due to a nohz_full CPU updating jiffies all by
itself from irq context?  In that case, could it be
better to always have that be done by the housekeeping
CPU?

What exactly is going on here?


Re: [BUG nohz]: wrong user and system time accounting

2017-03-26 Thread lkml
On Thu, Mar 23, 2017 at 08:56:02PM -0400, Rik van Riel wrote:
> On Thu, 2017-03-23 at 16:55 -0400, Luiz Capitulino wrote:
> > When there are two or more tasks executing in user-space and
> > taking 100% of a nohz_full CPU, top reports 70% system time
> > and 30% user time utilization. Sometimes I'm even able to get
> > 100% system time and 0% user time.
> > 
> > This was reproduced with latest Linus tree (093b995), but I
> > don't believe it's a regression (at least not a recent one)
> > as I can reproduce it with older kernels. Also, I have
> > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce
> > without it yet.
> > 
> > Below you'll find the steps to reproduce and some initial
> > analysis.
> > 
> > Steps to reproduce
> > --
> > 
> > 1. Set up a CPU for nohz_full with isolcpus= nohz_full=
> > 
> > 2. Pin two tasks that hog the CPU 100% of the time to that CPU
> > 
> > 3. Run top -d1 and check system time
> > 
> > NOTE: When there's only one task hogging a nohz_full CPU, top
> >   shows 100% user-time, as expected
> > 
> > Initial analysis
> > 
> > 
> > When tracing vtime accounting functions and the user-space/kernel
> > transitions when the issue is taking place, I see several of the
> > following:
> > 
> > hog-10552 [015]  1132.711104:
> > function: enter_from_user_mode <-- apic_timer_interrupt
> > hog-10552 [015]  1132.711105:
> > function: __context_tracking_exit <--
> > enter_from_user_mode
> > hog-10552 [015]  1132.711105:
> > bprint:   __context_tracking_exit.part.4: new state=1 cur
> > state=1 active=1
> > hog-10552 [015]  1132.711105:
> > function: vtime_account_user <--
> > __context_tracking_exit.part.4
> > hog-10552 [015]  1132.711105:
> > function: smp_apic_timer_interrupt <--
> > apic_timer_interrupt
> > hog-10552 [015]  1132.711106: function: irq_enter <--
> > smp_apic_timer_interrupt
> > hog-10552 [015]  1132.711106: function: tick_sched_timer
> > <-- __hrtimer_run_queues
> > hog-10552 [015]  1132.711108: function: irq_exit <--
> > smp_apic_timer_interrupt
> > hog-10552 [015]  1132.711108:
> > function: __context_tracking_enter <--
> > prepare_exit_to_usermode
> > hog-10552 [015]  1132.711108:
> > bprint:   __context_tracking_enter.part.2: new state=1
> > cur state=0 active=1
> > hog-10552 [015]  1132.711109: function: vtime_user_enter
> > <-- __context_tracking_enter.part.2
> > hog-10552 [015]  1132.711109:
> > function: __vtime_account_system <-- vtime_user_enter
> > hog-10552 [015]  1132.711109:
> > function: account_system_time <-- __vtime_account_system
> > 
> > On entering the kernel due to a timer interrupt, vtime_account_user()
> > skips user-time accounting. Then later on when returning to user-
> > space,
> > vtime_user_enter() is probably accounting the whole time (ie. user-
> > space
> > plus kernel-space) to system time.
> > 
> > Now, when does vtime_account_user() skips accounting? Well, when the
> > time delta is less then one jiffie. This would imply that
> > vtime_account_user()
> > is being called less than one jiffie since the last accounting, but I
> > haven't
> > confirmed any of this yet.
> 
> Jiffies should be advanced by the timer interrupt, on the
> housekeeping CPU, which is not doing context tracking.
> 
> Why is the isolated/nohz_full CPU receiving timer interrupts
> at all?
> 
> I thought it would not, but obviously I am wrong. What is
> going on here?

This thread sounds awful familiar to me.

With CONFIG_NO_HZ_FULL=y && CONFIG_VIRT_CPU_ACCOUNTING_GEN=y I observed process
accounting anomalies with user CPU time being misaccounted as system time all
the way back to 4.6.0.

After switching to CONFIG_NO_HZ_IDLE=y && CONFIG_VIRT_CPU_ACCOUNTING_GEN=n the
issues went away.

The lkml thread I had seen at that time which compelled me to suspect these
settings was this:
http://lkml.iu.edu/hypermail/linux/kernel/1608.2/05860.html

It sounds like this issue is finally beginning to be understood though, good
work!

Regards,
Vito Caputo


Re: [BUG nohz]: wrong user and system time accounting

2017-03-26 Thread Wanpeng Li
2017-03-24 4:55 GMT+08:00 Luiz Capitulino :
>
> When there are two or more tasks executing in user-space and
> taking 100% of a nohz_full CPU, top reports 70% system time
> and 30% user time utilization. Sometimes I'm even able to get
> 100% system time and 0% user time.
>
> This was reproduced with latest Linus tree (093b995), but I
> don't believe it's a regression (at least not a recent one)
> as I can reproduce it with older kernels. Also, I have
> CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce
> without it yet.
>
> Below you'll find the steps to reproduce and some initial
> analysis.
>
> Steps to reproduce
> --
>
> 1. Set up a CPU for nohz_full with isolcpus= nohz_full=
>
> 2. Pin two tasks that hog the CPU 100% of the time to that CPU
>
> 3. Run top -d1 and check system time
>
> NOTE: When there's only one task hogging a nohz_full CPU, top
>   shows 100% user-time, as expected
>
> Initial analysis
> 
>
> When tracing vtime accounting functions and the user-space/kernel
> transitions when the issue is taking place, I see several of the
> following:
>
> hog-10552 [015]  1132.711104: function: enter_from_user_mode <-- 
> apic_timer_interrupt
> hog-10552 [015]  1132.711105: function: __context_tracking_exit 
> <-- enter_from_user_mode
> hog-10552 [015]  1132.711105: bprint:   
> __context_tracking_exit.part.4: new state=1 cur state=1 active=1
> hog-10552 [015]  1132.711105: function: vtime_account_user <-- 
> __context_tracking_exit.part.4
> hog-10552 [015]  1132.711105: function: smp_apic_timer_interrupt 
> <-- apic_timer_interrupt
> hog-10552 [015]  1132.711106: function: irq_enter <-- 
> smp_apic_timer_interrupt
> hog-10552 [015]  1132.711106: function: tick_sched_timer <-- 
> __hrtimer_run_queues
> hog-10552 [015]  1132.711108: function: irq_exit <-- 
> smp_apic_timer_interrupt
> hog-10552 [015]  1132.711108: function: __context_tracking_enter 
> <-- prepare_exit_to_usermode
> hog-10552 [015]  1132.711108: bprint:   
> __context_tracking_enter.part.2: new state=1 cur state=0 active=1
> hog-10552 [015]  1132.711109: function: vtime_user_enter <-- 
> __context_tracking_enter.part.2
> hog-10552 [015]  1132.711109: function: __vtime_account_system 
> <-- vtime_user_enter
> hog-10552 [015]  1132.711109: function: account_system_time <-- 
> __vtime_account_system
>
> On entering the kernel due to a timer interrupt, vtime_account_user()
> skips user-time accounting. Then later on when returning to user-space,
> vtime_user_enter() is probably accounting the whole time (ie. user-space
> plus kernel-space) to system time.

Actually after I bisect, the first bad commit is ff9a9b4c4334 ("sched,
time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity"). The bug
can be reproduced readily if CONFIG_CONTEXT_TRACKING_FORCE is true,
then just stress all the online cpus or just one cpu and leave others
idle(so it stresses the global timekeeping one), top show 100%
sys-time. And another way to reproduce it is by nohz_full, and gives
the stress to the house keeping cpu, the top show 100% sys-time of the
house keeping cpu, and also the other cpus who have at least two tasks
running on and in full_nohz mode.

Let's consider the cpu which has responsibility for the global
timekeeping, as the tracing posted above, the vtime_account_user() is
called before tick_sched_timer() which will update jiffies, so jiffies
is stale in vtime_account_user() and the run time in userspace is
skipped, the vtime_user_enter() is called after jiffies update, so
both the time in userspace and in  kernel are accumulated to sys time.
If the housekeeping cpu is idle when CONFIG_NO_HZ_FULL, everything is
fine. However, if you give stress to the housekeeping cpu, top will
show 100% sys-time of both the housekeeping cpu and the other cpus who
have at least two tasks running on and in full_nohz mode. I think it
is because the stress delays the timer interrupt handling in some
degree, then the jiffies is not updated timely before other cpus
access it in vtime_account_user().

I think we can keep syscalls/exceptions context tracking still in
jiffies based sampling and utilize local_clock() in vtime_delta()
again for irqs which avoids jiffies stale influence. I can make a
patch if the idea is acceptable or there is any better proposal. :)

Regards,
Wanpeng Li


Re: [BUG nohz]: wrong user and system time accounting

2017-03-23 Thread Luiz Capitulino
On Fri, 24 Mar 2017 09:52:11 +0800
Wanpeng Li  wrote:

> 2017-03-24 4:55 GMT+08:00 Luiz Capitulino :
> >
> > When there are two or more tasks executing in user-space and
> > taking 100% of a nohz_full CPU, top reports 70% system time
> > and 30% user time utilization. Sometimes I'm even able to get
> > 100% system time and 0% user time.
> >
> > This was reproduced with latest Linus tree (093b995), but I
> > don't believe it's a regression (at least not a recent one)
> > as I can reproduce it with older kernels. Also, I have
> > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce
> > without it yet.
> >
> > Below you'll find the steps to reproduce and some initial
> > analysis.
> >
> > Steps to reproduce
> > --
> >
> > 1. Set up a CPU for nohz_full with isolcpus= nohz_full=
> >
> > 2. Pin two tasks that hog the CPU 100% of the time to that CPU
> >
> > 3. Run top -d1 and check system time
> >
> > NOTE: When there's only one task hogging a nohz_full CPU, top
> >   shows 100% user-time, as expected  
> 
> I just saw at most 12% system time instead of 30% or 100%. Could you
> grep HZ /boot/config-`uname -r` and post here?

I'm sending you the whole thing:

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.11.0-rc3 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
# CONFIG_NO_HZ_IDLE is not set
CONFIG_NO_HZ_FULL=y
# CONFIG_NO_HZ_FULL_ALL is not set
# CONFIG_NO_HZ_FULL_SYSIDLE is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_IRQ_TIME_ACCOUNTING=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
CONFIG_CONTEXT_TRACKING=y
# CONFIG_CONTEXT_TRACKING_FORCE is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_NONE is not set
# CONFIG_RCU_NOCB_CPU_ZERO is not set
CONFIG_RCU_NOCB_CPU_ALL=y
CONFIG_BUILD_BIN2C=y
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=20
CONFIG_

Re: [BUG nohz]: wrong user and system time accounting

2017-03-23 Thread Wanpeng Li
2017-03-24 4:55 GMT+08:00 Luiz Capitulino :
>
> When there are two or more tasks executing in user-space and
> taking 100% of a nohz_full CPU, top reports 70% system time
> and 30% user time utilization. Sometimes I'm even able to get
> 100% system time and 0% user time.
>
> This was reproduced with latest Linus tree (093b995), but I
> don't believe it's a regression (at least not a recent one)
> as I can reproduce it with older kernels. Also, I have
> CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce
> without it yet.
>
> Below you'll find the steps to reproduce and some initial
> analysis.
>
> Steps to reproduce
> --
>
> 1. Set up a CPU for nohz_full with isolcpus= nohz_full=
>
> 2. Pin two tasks that hog the CPU 100% of the time to that CPU
>
> 3. Run top -d1 and check system time
>
> NOTE: When there's only one task hogging a nohz_full CPU, top
>   shows 100% user-time, as expected

I just saw at most 12% system time instead of 30% or 100%. Could you
grep HZ /boot/config-`uname -r` and post here?

Regards,
Wanpeng Li

>
> Initial analysis
> 
>
> When tracing vtime accounting functions and the user-space/kernel
> transitions when the issue is taking place, I see several of the
> following:
>
> hog-10552 [015]  1132.711104: function: enter_from_user_mode <-- 
> apic_timer_interrupt
> hog-10552 [015]  1132.711105: function: __context_tracking_exit 
> <-- enter_from_user_mode
> hog-10552 [015]  1132.711105: bprint:   
> __context_tracking_exit.part.4: new state=1 cur state=1 active=1
> hog-10552 [015]  1132.711105: function: vtime_account_user <-- 
> __context_tracking_exit.part.4
> hog-10552 [015]  1132.711105: function: smp_apic_timer_interrupt 
> <-- apic_timer_interrupt
> hog-10552 [015]  1132.711106: function: irq_enter <-- 
> smp_apic_timer_interrupt
> hog-10552 [015]  1132.711106: function: tick_sched_timer <-- 
> __hrtimer_run_queues
> hog-10552 [015]  1132.711108: function: irq_exit <-- 
> smp_apic_timer_interrupt
> hog-10552 [015]  1132.711108: function: __context_tracking_enter 
> <-- prepare_exit_to_usermode
> hog-10552 [015]  1132.711108: bprint:   
> __context_tracking_enter.part.2: new state=1 cur state=0 active=1
> hog-10552 [015]  1132.711109: function: vtime_user_enter <-- 
> __context_tracking_enter.part.2
> hog-10552 [015]  1132.711109: function: __vtime_account_system 
> <-- vtime_user_enter
> hog-10552 [015]  1132.711109: function: account_system_time <-- 
> __vtime_account_system
>
> On entering the kernel due to a timer interrupt, vtime_account_user()
> skips user-time accounting. Then later on when returning to user-space,
> vtime_user_enter() is probably accounting the whole time (ie. user-space
> plus kernel-space) to system time.
>
> Now, when does vtime_account_user() skips accounting? Well, when the
> time delta is less then one jiffie. This would imply that vtime_account_user()
> is being called less than one jiffie since the last accounting, but I haven't
> confirmed any of this yet.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-23 Thread Luiz Capitulino
On Thu, 23 Mar 2017 21:08:38 -0400
Rik van Riel  wrote:

> On Thu, 2017-03-23 at 21:05 -0400, Luiz Capitulino wrote:
> > On Thu, 23 Mar 2017 20:56:02 -0400
> > Rik van Riel  wrote:
> >   
> > > On Thu, 2017-03-23 at 16:55 -0400, Luiz Capitulino wrote:  
> > > > When there are two or more tasks executing in user-space and
> > > > taking 100% of a nohz_full CPU, top reports 70% system time
> > > > and 30% user time utilization. Sometimes I'm even able to get
> > > > 100% system time and 0% user time.
> > > > 
> > > > This was reproduced with latest Linus tree (093b995), but I
> > > > don't believe it's a regression (at least not a recent one)
> > > > as I can reproduce it with older kernels. Also, I have
> > > > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce
> > > > without it yet.
> > > > 
> > > > Below you'll find the steps to reproduce and some initial
> > > > analysis.
> > > > 
> > > > Steps to reproduce
> > > > --
> > > > 
> > > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full=
> > > > 
> > > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU
> > > > 
> > > > 3. Run top -d1 and check system time
> > > > 
> > > > NOTE: When there's only one task hogging a nohz_full CPU, top
> > > >   shows 100% user-time, as expected
> > > > 
> > > > Initial analysis
> > > > 
> > > > 
> > > > When tracing vtime accounting functions and the user-space/kernel
> > > > transitions when the issue is taking place, I see several of the
> > > > following:
> > > > 
> > > > hog-10552 [015]  1132.711104:
> > > > function: enter_from_user_mode <--
> > > > apic_timer_interrupt
> > > > hog-10552 [015]  1132.711105:
> > > > function: __context_tracking_exit <--
> > > > enter_from_user_mode
> > > > hog-10552 [015]  1132.711105:
> > > > bprint:   __context_tracking_exit.part.4: new state=1
> > > > cur
> > > > state=1 active=1
> > > > hog-10552 [015]  1132.711105:
> > > > function: vtime_account_user <--
> > > > __context_tracking_exit.part.4
> > > > hog-10552 [015]  1132.711105:
> > > > function: smp_apic_timer_interrupt <--
> > > > apic_timer_interrupt
> > > > hog-10552 [015]  1132.711106: function: irq_enter <--
> > > > smp_apic_timer_interrupt
> > > > hog-10552 [015]  1132.711106:
> > > > function: tick_sched_timer
> > > > <-- __hrtimer_run_queues
> > > > hog-10552 [015]  1132.711108: function: irq_exit <--
> > > > smp_apic_timer_interrupt
> > > > hog-10552 [015]  1132.711108:
> > > > function: __context_tracking_enter <--
> > > > prepare_exit_to_usermode
> > > > hog-10552 [015]  1132.711108:
> > > > bprint:   __context_tracking_enter.part.2: new
> > > > state=1
> > > > cur state=0 active=1
> > > > hog-10552 [015]  1132.711109:
> > > > function: vtime_user_enter
> > > > <-- __context_tracking_enter.part.2
> > > > hog-10552 [015]  1132.711109:
> > > > function: __vtime_account_system <-- vtime_user_enter
> > > > hog-10552 [015]  1132.711109:
> > > > function: account_system_time <--
> > > > __vtime_account_system
> > > > 
> > > > On entering the kernel due to a timer interrupt,
> > > > vtime_account_user()
> > > > skips user-time accounting. Then later on when returning to user-
> > > > space,
> > > > vtime_user_enter() is probably accounting the whole time (ie.
> > > > user-
> > > > space
> > > > plus kernel-space) to system time.
> > > > 
> > > > Now, when does vtime_account_user() skips accounting? Well, when
> > > > the
> > > > time delta is less then one jiffie. This would imply that
> > > > vtime_account_user()
> > > > is being called less than one jiffie since the last accounting,
> > > > but I
> > > > haven't
> > > > confirmed any of this yet.    
> > > 
> > > Jiffies should be advanced by the timer interrupt, on the
> > > housekeeping CPU, which is not doing context tracking.  
> > 
> > The hypothesis isn't that it wasn't advanced, but that we stayed in
> > user-space less than 1ms.  
> 
> That is part of the hypothesis. The other part of the hypothesis
> involves jiffies advancing on the nohz_full & isolated CPU while
> that CPU is in kernel mode 30% of the time.

OK.

> I have no good explanation for the latter yet...
> 
> > > Why is the isolated/nohz_full CPU receiving timer interrupts
> > > at all?
> > > 
> > > I thought it would not, but obviously I am wrong. What is
> > > going on here?  
> > 
> > There are two runnable SCHED_OTHER tasks on the nohz_full CPU. When
> > that happens, the tick is re-activated. We're not nohz_full anymore,
> > but accounting should still work.  
> 
> Isn't the scheduler tick distinct from the timer interrupt,
> or am I confused?

If you consider the scheduler tick to be the code that's run
by scheduler_tick(), yes they are distinct. But I was referring
to tick_sched_timer() the "main" tick handler. This one runs
as a hrtimer handler. In the case described in this email

Re: [BUG nohz]: wrong user and system time accounting

2017-03-23 Thread Rik van Riel
On Thu, 2017-03-23 at 21:05 -0400, Luiz Capitulino wrote:
> On Thu, 23 Mar 2017 20:56:02 -0400
> Rik van Riel  wrote:
> 
> > On Thu, 2017-03-23 at 16:55 -0400, Luiz Capitulino wrote:
> > > When there are two or more tasks executing in user-space and
> > > taking 100% of a nohz_full CPU, top reports 70% system time
> > > and 30% user time utilization. Sometimes I'm even able to get
> > > 100% system time and 0% user time.
> > > 
> > > This was reproduced with latest Linus tree (093b995), but I
> > > don't believe it's a regression (at least not a recent one)
> > > as I can reproduce it with older kernels. Also, I have
> > > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce
> > > without it yet.
> > > 
> > > Below you'll find the steps to reproduce and some initial
> > > analysis.
> > > 
> > > Steps to reproduce
> > > --
> > > 
> > > 1. Set up a CPU for nohz_full with isolcpus= nohz_full=
> > > 
> > > 2. Pin two tasks that hog the CPU 100% of the time to that CPU
> > > 
> > > 3. Run top -d1 and check system time
> > > 
> > > NOTE: When there's only one task hogging a nohz_full CPU, top
> > >   shows 100% user-time, as expected
> > > 
> > > Initial analysis
> > > 
> > > 
> > > When tracing vtime accounting functions and the user-space/kernel
> > > transitions when the issue is taking place, I see several of the
> > > following:
> > > 
> > > hog-10552 [015]  1132.711104:
> > > function: enter_from_user_mode <--
> > > apic_timer_interrupt
> > > hog-10552 [015]  1132.711105:
> > > function: __context_tracking_exit <--
> > > enter_from_user_mode
> > > hog-10552 [015]  1132.711105:
> > > bprint:   __context_tracking_exit.part.4: new state=1
> > > cur
> > > state=1 active=1
> > > hog-10552 [015]  1132.711105:
> > > function: vtime_account_user <--
> > > __context_tracking_exit.part.4
> > > hog-10552 [015]  1132.711105:
> > > function: smp_apic_timer_interrupt <--
> > > apic_timer_interrupt
> > > hog-10552 [015]  1132.711106: function: irq_enter <--
> > > smp_apic_timer_interrupt
> > > hog-10552 [015]  1132.711106:
> > > function: tick_sched_timer
> > > <-- __hrtimer_run_queues
> > > hog-10552 [015]  1132.711108: function: irq_exit <--
> > > smp_apic_timer_interrupt
> > > hog-10552 [015]  1132.711108:
> > > function: __context_tracking_enter <--
> > > prepare_exit_to_usermode
> > > hog-10552 [015]  1132.711108:
> > > bprint:   __context_tracking_enter.part.2: new
> > > state=1
> > > cur state=0 active=1
> > > hog-10552 [015]  1132.711109:
> > > function: vtime_user_enter
> > > <-- __context_tracking_enter.part.2
> > > hog-10552 [015]  1132.711109:
> > > function: __vtime_account_system <-- vtime_user_enter
> > > hog-10552 [015]  1132.711109:
> > > function: account_system_time <--
> > > __vtime_account_system
> > > 
> > > On entering the kernel due to a timer interrupt,
> > > vtime_account_user()
> > > skips user-time accounting. Then later on when returning to user-
> > > space,
> > > vtime_user_enter() is probably accounting the whole time (ie.
> > > user-
> > > space
> > > plus kernel-space) to system time.
> > > 
> > > Now, when does vtime_account_user() skips accounting? Well, when
> > > the
> > > time delta is less then one jiffie. This would imply that
> > > vtime_account_user()
> > > is being called less than one jiffie since the last accounting,
> > > but I
> > > haven't
> > > confirmed any of this yet.  
> > 
> > Jiffies should be advanced by the timer interrupt, on the
> > housekeeping CPU, which is not doing context tracking.
> 
> The hypothesis isn't that it wasn't advanced, but that we stayed in
> user-space less than 1ms.

That is part of the hypothesis. The other part of the hypothesis
involves jiffies advancing on the nohz_full & isolated CPU while
that CPU is in kernel mode 30% of the time.

I have no good explanation for the latter yet...

> > Why is the isolated/nohz_full CPU receiving timer interrupts
> > at all?
> > 
> > I thought it would not, but obviously I am wrong. What is
> > going on here?
> 
> There are two runnable SCHED_OTHER tasks on the nohz_full CPU. When
> that happens, the tick is re-activated. We're not nohz_full anymore,
> but accounting should still work.

Isn't the scheduler tick distinct from the timer interrupt,
or am I confused?


Re: [BUG nohz]: wrong user and system time accounting

2017-03-23 Thread Luiz Capitulino
On Thu, 23 Mar 2017 20:56:02 -0400
Rik van Riel  wrote:

> On Thu, 2017-03-23 at 16:55 -0400, Luiz Capitulino wrote:
> > When there are two or more tasks executing in user-space and
> > taking 100% of a nohz_full CPU, top reports 70% system time
> > and 30% user time utilization. Sometimes I'm even able to get
> > 100% system time and 0% user time.
> > 
> > This was reproduced with latest Linus tree (093b995), but I
> > don't believe it's a regression (at least not a recent one)
> > as I can reproduce it with older kernels. Also, I have
> > CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce
> > without it yet.
> > 
> > Below you'll find the steps to reproduce and some initial
> > analysis.
> > 
> > Steps to reproduce
> > --
> > 
> > 1. Set up a CPU for nohz_full with isolcpus= nohz_full=
> > 
> > 2. Pin two tasks that hog the CPU 100% of the time to that CPU
> > 
> > 3. Run top -d1 and check system time
> > 
> > NOTE: When there's only one task hogging a nohz_full CPU, top
> >   shows 100% user-time, as expected
> > 
> > Initial analysis
> > 
> > 
> > When tracing vtime accounting functions and the user-space/kernel
> > transitions when the issue is taking place, I see several of the
> > following:
> > 
> > hog-10552 [015]  1132.711104:
> > function: enter_from_user_mode <-- apic_timer_interrupt
> > hog-10552 [015]  1132.711105:
> > function: __context_tracking_exit <--
> > enter_from_user_mode
> > hog-10552 [015]  1132.711105:
> > bprint:   __context_tracking_exit.part.4: new state=1 cur
> > state=1 active=1
> > hog-10552 [015]  1132.711105:
> > function: vtime_account_user <--
> > __context_tracking_exit.part.4
> > hog-10552 [015]  1132.711105:
> > function: smp_apic_timer_interrupt <--
> > apic_timer_interrupt
> > hog-10552 [015]  1132.711106: function: irq_enter <--
> > smp_apic_timer_interrupt
> > hog-10552 [015]  1132.711106: function: tick_sched_timer
> > <-- __hrtimer_run_queues
> > hog-10552 [015]  1132.711108: function: irq_exit <--
> > smp_apic_timer_interrupt
> > hog-10552 [015]  1132.711108:
> > function: __context_tracking_enter <--
> > prepare_exit_to_usermode
> > hog-10552 [015]  1132.711108:
> > bprint:   __context_tracking_enter.part.2: new state=1
> > cur state=0 active=1
> > hog-10552 [015]  1132.711109: function: vtime_user_enter
> > <-- __context_tracking_enter.part.2
> > hog-10552 [015]  1132.711109:
> > function: __vtime_account_system <-- vtime_user_enter
> > hog-10552 [015]  1132.711109:
> > function: account_system_time <-- __vtime_account_system
> > 
> > On entering the kernel due to a timer interrupt, vtime_account_user()
> > skips user-time accounting. Then later on when returning to user-
> > space,
> > vtime_user_enter() is probably accounting the whole time (ie. user-
> > space
> > plus kernel-space) to system time.
> > 
> > Now, when does vtime_account_user() skips accounting? Well, when the
> > time delta is less then one jiffie. This would imply that
> > vtime_account_user()
> > is being called less than one jiffie since the last accounting, but I
> > haven't
> > confirmed any of this yet.  
> 
> Jiffies should be advanced by the timer interrupt, on the
> housekeeping CPU, which is not doing context tracking.

The hypothesis isn't that it wasn't advanced, but that we stayed in
user-space less than 1ms.

> Why is the isolated/nohz_full CPU receiving timer interrupts
> at all?
> 
> I thought it would not, but obviously I am wrong. What is
> going on here?

There are two runnable SCHED_OTHER tasks on the nohz_full CPU. When
that happens, the tick is re-activated. We're not nohz_full anymore,
but accounting should still work.


Re: [BUG nohz]: wrong user and system time accounting

2017-03-23 Thread Rik van Riel
On Thu, 2017-03-23 at 16:55 -0400, Luiz Capitulino wrote:
> When there are two or more tasks executing in user-space and
> taking 100% of a nohz_full CPU, top reports 70% system time
> and 30% user time utilization. Sometimes I'm even able to get
> 100% system time and 0% user time.
> 
> This was reproduced with latest Linus tree (093b995), but I
> don't believe it's a regression (at least not a recent one)
> as I can reproduce it with older kernels. Also, I have
> CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce
> without it yet.
> 
> Below you'll find the steps to reproduce and some initial
> analysis.
> 
> Steps to reproduce
> --
> 
> 1. Set up a CPU for nohz_full with isolcpus= nohz_full=
> 
> 2. Pin two tasks that hog the CPU 100% of the time to that CPU
> 
> 3. Run top -d1 and check system time
> 
> NOTE: When there's only one task hogging a nohz_full CPU, top
>   shows 100% user-time, as expected
> 
> Initial analysis
> 
> 
> When tracing vtime accounting functions and the user-space/kernel
> transitions when the issue is taking place, I see several of the
> following:
> 
> hog-10552 [015]  1132.711104:
> function: enter_from_user_mode <-- apic_timer_interrupt
> hog-10552 [015]  1132.711105:
> function: __context_tracking_exit <--
> enter_from_user_mode
> hog-10552 [015]  1132.711105:
> bprint:   __context_tracking_exit.part.4: new state=1 cur
> state=1 active=1
> hog-10552 [015]  1132.711105:
> function: vtime_account_user <--
> __context_tracking_exit.part.4
> hog-10552 [015]  1132.711105:
> function: smp_apic_timer_interrupt <--
> apic_timer_interrupt
> hog-10552 [015]  1132.711106: function: irq_enter <--
> smp_apic_timer_interrupt
> hog-10552 [015]  1132.711106: function: tick_sched_timer
> <-- __hrtimer_run_queues
> hog-10552 [015]  1132.711108: function: irq_exit <--
> smp_apic_timer_interrupt
> hog-10552 [015]  1132.711108:
> function: __context_tracking_enter <--
> prepare_exit_to_usermode
> hog-10552 [015]  1132.711108:
> bprint:   __context_tracking_enter.part.2: new state=1
> cur state=0 active=1
> hog-10552 [015]  1132.711109: function: vtime_user_enter
> <-- __context_tracking_enter.part.2
> hog-10552 [015]  1132.711109:
> function: __vtime_account_system <-- vtime_user_enter
> hog-10552 [015]  1132.711109:
> function: account_system_time <-- __vtime_account_system
> 
> On entering the kernel due to a timer interrupt, vtime_account_user()
> skips user-time accounting. Then later on when returning to user-
> space,
> vtime_user_enter() is probably accounting the whole time (ie. user-
> space
> plus kernel-space) to system time.
> 
> Now, when does vtime_account_user() skips accounting? Well, when the
> time delta is less then one jiffie. This would imply that
> vtime_account_user()
> is being called less than one jiffie since the last accounting, but I
> haven't
> confirmed any of this yet.

Jiffies should be advanced by the timer interrupt, on the
housekeeping CPU, which is not doing context tracking.

Why is the isolated/nohz_full CPU receiving timer interrupts
at all?

I thought it would not, but obviously I am wrong. What is
going on here?


[BUG nohz]: wrong user and system time accounting

2017-03-23 Thread Luiz Capitulino

When there are two or more tasks executing in user-space and
taking 100% of a nohz_full CPU, top reports 70% system time
and 30% user time utilization. Sometimes I'm even able to get
100% system time and 0% user time.

This was reproduced with latest Linus tree (093b995), but I
don't believe it's a regression (at least not a recent one)
as I can reproduce it with older kernels. Also, I have
CONFIG_IRQ_TIME_ACCOUNTING=y and haven't tried to reproduce
without it yet.

Below you'll find the steps to reproduce and some initial
analysis.

Steps to reproduce
--

1. Set up a CPU for nohz_full with isolcpus= nohz_full=

2. Pin two tasks that hog the CPU 100% of the time to that CPU

3. Run top -d1 and check system time

NOTE: When there's only one task hogging a nohz_full CPU, top
  shows 100% user-time, as expected

Initial analysis


When tracing vtime accounting functions and the user-space/kernel
transitions when the issue is taking place, I see several of the
following:

hog-10552 [015]  1132.711104: function: enter_from_user_mode <-- 
apic_timer_interrupt
hog-10552 [015]  1132.711105: function: __context_tracking_exit <-- 
enter_from_user_mode
hog-10552 [015]  1132.711105: bprint:   
__context_tracking_exit.part.4: new state=1 cur state=1 active=1
hog-10552 [015]  1132.711105: function: vtime_account_user <-- 
__context_tracking_exit.part.4
hog-10552 [015]  1132.711105: function: smp_apic_timer_interrupt 
<-- apic_timer_interrupt
hog-10552 [015]  1132.711106: function: irq_enter <-- 
smp_apic_timer_interrupt
hog-10552 [015]  1132.711106: function: tick_sched_timer <-- 
__hrtimer_run_queues
hog-10552 [015]  1132.711108: function: irq_exit <-- 
smp_apic_timer_interrupt
hog-10552 [015]  1132.711108: function: __context_tracking_enter 
<-- prepare_exit_to_usermode
hog-10552 [015]  1132.711108: bprint:   
__context_tracking_enter.part.2: new state=1 cur state=0 active=1
hog-10552 [015]  1132.711109: function: vtime_user_enter <-- 
__context_tracking_enter.part.2
hog-10552 [015]  1132.711109: function: __vtime_account_system <-- 
vtime_user_enter
hog-10552 [015]  1132.711109: function: account_system_time <-- 
__vtime_account_system

On entering the kernel due to a timer interrupt, vtime_account_user()
skips user-time accounting. Then later on when returning to user-space,
vtime_user_enter() is probably accounting the whole time (ie. user-space
plus kernel-space) to system time.

Now, when does vtime_account_user() skips accounting? Well, when the
time delta is less then one jiffie. This would imply that vtime_account_user()
is being called less than one jiffie since the last accounting, but I haven't
confirmed any of this yet.