Re: [Xen-devel] [PATCH 1/1] sched/cputime: do not decrease steal time after live migration on xen

2017-10-11 Thread Dongli Zhang
Hi Rik,

On 10/10/2017 10:01 PM, Rik van Riel wrote:
> On Tue, 2017-10-10 at 14:48 +0200, Peter Zijlstra wrote:
>> On Tue, Oct 10, 2017 at 02:42:01PM +0200, Stanislaw Gruszka wrote:
> + u64 steal, steal_time;
> + s64 steal_delta;
> +
> + steal_time =
> paravirt_steal_clock(smp_processor_id());
> + steal = steal_delta = steal_time - this_rq()-
>> prev_steal_time;
> +
> + if (unlikely(steal_delta < 0)) {
> + this_rq()->prev_steal_time =
> steal_time;
>>>
>>> I don't think setting prev_steal_time to smaller value is right
>>> thing to do. 
>>>
>>> Beside, I don't think we need to check for overflow condition for
>>> cputime variables (it will happen after 279 years :-). So instead
>>> of introducing signed steal_delta variable I would just add
>>> below check, which should be sufficient to fix the problem:
>>>
>>> if (unlikely(steal <= this_rq()->prev_steal_time))
>>> return 0;
>>
>> How about you just fix up paravirt_steal_time() on migration and not
>> muck with the users ?
> 
> Not just migration, either. CPU hotplug is another time to fix up
> the steal time.

I think this issue might be hit when we add and online vcpu after a very very
long time since boot (or the last time vcpu is offline). Please correct me if I
am wrong.

Thank you very much!

Dongli Zhang

> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/1] sched/cputime: do not decrease steal time after live migration on xen

2017-10-11 Thread Dongli Zhang
Hi Stanislaw and Peter,

On 10/10/2017 08:42 PM, Stanislaw Gruszka wrote:
> On Tue, Oct 10, 2017 at 12:59:26PM +0200, Ingo Molnar wrote:
>>
>> (Cc:-ed more gents involved in kernel/sched/cputime.c work. Full patch 
>> quoted 
>> below.)
>>
>> * Dongli Zhang  wrote:
>>
>>> After guest live migration on xen, steal time in /proc/stat
>>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>>> paravirt_steal_clock() might be less than this_rq()->prev_steal_time.
>>>
>>> For instance, steal time of each vcpu is 335 before live migration.
>>>
>>> cpu  198 0 368 200064 1962 0 0 1340 0 0
>>> cpu0 38 0 81 50063 492 0 0 335 0 0
>>> cpu1 65 0 97 49763 634 0 0 335 0 0
>>> cpu2 38 0 81 50098 462 0 0 335 0 0
>>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>>
>>> After live migration, steal time is reduced to 312.
>>>
>>> cpu  200 0 370 200330 1971 0 0 1248 0 0
>>> cpu0 38 0 82 50123 500 0 0 312 0 0
>>> cpu1 65 0 97 49832 634 0 0 312 0 0
>>> cpu2 39 0 82 50167 462 0 0 312 0 0
>>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>>
>>> The code in this patch is borrowed from do_stolen_accounting() which has
>>> already been removed from linux source code since commit ecb23dc6 ("xen:
>>> add steal_clock support on x86").
>>>
>>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>>> discussed by Michael Las at
>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest.
>>> Unlike the issue discussed by Michael Las which would overflow steal time
>>> and lead to 100% st usage in top command for linux 4.8-4.10, the issue for
>>> linux 4.11+ would only decrease but not overflow steal time after live
>>> migration.
>>>
>>> References: 
>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>>> Signed-off-by: Dongli Zhang 
>>> ---
>>>  kernel/sched/cputime.c | 13 ++---
>>>  1 file changed, 10 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>>> index 14d2dbf..57d09cab 100644
>>> --- a/kernel/sched/cputime.c
>>> +++ b/kernel/sched/cputime.c
>>> @@ -238,10 +238,17 @@ static __always_inline u64 
>>> steal_account_process_time(u64 maxtime)
>>>  {
>>>  #ifdef CONFIG_PARAVIRT
>>> if (static_key_false(_steal_enabled)) {
>>> -   u64 steal;
>>> +   u64 steal, steal_time;
>>> +   s64 steal_delta;
>>> +
>>> +   steal_time = paravirt_steal_clock(smp_processor_id());
>>> +   steal = steal_delta = steal_time - this_rq()->prev_steal_time;
>>> +
>>> +   if (unlikely(steal_delta < 0)) {
>>> +   this_rq()->prev_steal_time = steal_time;
> 
> I don't think setting prev_steal_time to smaller value is right
> thing to do.

If we do not set prev_steal_time to smaller steal (obtained from
paravirt_steal_clock()), it will take a while for kernel to wait for new steal
to catch up with this_rq()->prev_steal_time, and cpustat[CPUTIME_STEAL] will
stay unchanged until steal is more than this_rq()->prev_steal_time again. Do you
think it is fine?

If it is fine, I will try to limit the fix to xen specific code in
driver/xen/time.c so that we would not taint kernel/sched/cputime.c, as Peter
has asked why not just fix up paravirt_steal_time() on migration.

Thank you very much!

Dongli Zhang

> 
> Beside, I don't think we need to check for overflow condition for
> cputime variables (it will happen after 279 years :-). So instead
> of introducing signed steal_delta variable I would just add
> below check, which should be sufficient to fix the problem:
> 
>   if (unlikely(steal <= this_rq()->prev_steal_time))
>   return 0;
> 
> Thanks
> Stanislaw
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/1] sched/cputime: do not decrease steal time after live migration on xen

2017-10-10 Thread Rik van Riel
On Tue, 2017-10-10 at 14:48 +0200, Peter Zijlstra wrote:
> On Tue, Oct 10, 2017 at 02:42:01PM +0200, Stanislaw Gruszka wrote:
> > > > +   u64 steal, steal_time;
> > > > +   s64 steal_delta;
> > > > +
> > > > +   steal_time =
> > > > paravirt_steal_clock(smp_processor_id());
> > > > +   steal = steal_delta = steal_time - this_rq()-
> > > > >prev_steal_time;
> > > > +
> > > > +   if (unlikely(steal_delta < 0)) {
> > > > +   this_rq()->prev_steal_time =
> > > > steal_time;
> > 
> > I don't think setting prev_steal_time to smaller value is right
> > thing to do. 
> > 
> > Beside, I don't think we need to check for overflow condition for
> > cputime variables (it will happen after 279 years :-). So instead
> > of introducing signed steal_delta variable I would just add
> > below check, which should be sufficient to fix the problem:
> > 
> > if (unlikely(steal <= this_rq()->prev_steal_time))
> > return 0;
> 
> How about you just fix up paravirt_steal_time() on migration and not
> muck with the users ?

Not just migration, either. CPU hotplug is another time to fix up
the steal time.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/1] sched/cputime: do not decrease steal time after live migration on xen

2017-10-10 Thread Peter Zijlstra
On Tue, Oct 10, 2017 at 02:42:01PM +0200, Stanislaw Gruszka wrote:
> > > + u64 steal, steal_time;
> > > + s64 steal_delta;
> > > +
> > > + steal_time = paravirt_steal_clock(smp_processor_id());
> > > + steal = steal_delta = steal_time - this_rq()->prev_steal_time;
> > > +
> > > + if (unlikely(steal_delta < 0)) {
> > > + this_rq()->prev_steal_time = steal_time;
> 
> I don't think setting prev_steal_time to smaller value is right
> thing to do. 
> 
> Beside, I don't think we need to check for overflow condition for
> cputime variables (it will happen after 279 years :-). So instead
> of introducing signed steal_delta variable I would just add
> below check, which should be sufficient to fix the problem:
> 
>   if (unlikely(steal <= this_rq()->prev_steal_time))
>   return 0;

How about you just fix up paravirt_steal_time() on migration and not
muck with the users ?

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/1] sched/cputime: do not decrease steal time after live migration on xen

2017-10-10 Thread Stanislaw Gruszka
On Tue, Oct 10, 2017 at 12:59:26PM +0200, Ingo Molnar wrote:
> 
> (Cc:-ed more gents involved in kernel/sched/cputime.c work. Full patch quoted 
> below.)
> 
> * Dongli Zhang  wrote:
> 
> > After guest live migration on xen, steal time in /proc/stat
> > (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
> > paravirt_steal_clock() might be less than this_rq()->prev_steal_time.
> > 
> > For instance, steal time of each vcpu is 335 before live migration.
> > 
> > cpu  198 0 368 200064 1962 0 0 1340 0 0
> > cpu0 38 0 81 50063 492 0 0 335 0 0
> > cpu1 65 0 97 49763 634 0 0 335 0 0
> > cpu2 38 0 81 50098 462 0 0 335 0 0
> > cpu3 56 0 107 50138 374 0 0 335 0 0
> > 
> > After live migration, steal time is reduced to 312.
> > 
> > cpu  200 0 370 200330 1971 0 0 1248 0 0
> > cpu0 38 0 82 50123 500 0 0 312 0 0
> > cpu1 65 0 97 49832 634 0 0 312 0 0
> > cpu2 39 0 82 50167 462 0 0 312 0 0
> > cpu3 56 0 107 50207 374 0 0 312 0 0
> > 
> > The code in this patch is borrowed from do_stolen_accounting() which has
> > already been removed from linux source code since commit ecb23dc6 ("xen:
> > add steal_clock support on x86").
> > 
> > Similar and more severe issue would impact prior linux 4.8-4.10 as
> > discussed by Michael Las at
> > https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest.
> > Unlike the issue discussed by Michael Las which would overflow steal time
> > and lead to 100% st usage in top command for linux 4.8-4.10, the issue for
> > linux 4.11+ would only decrease but not overflow steal time after live
> > migration.
> > 
> > References: 
> > https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
> > Signed-off-by: Dongli Zhang 
> > ---
> >  kernel/sched/cputime.c | 13 ++---
> >  1 file changed, 10 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index 14d2dbf..57d09cab 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -238,10 +238,17 @@ static __always_inline u64 
> > steal_account_process_time(u64 maxtime)
> >  {
> >  #ifdef CONFIG_PARAVIRT
> > if (static_key_false(_steal_enabled)) {
> > -   u64 steal;
> > +   u64 steal, steal_time;
> > +   s64 steal_delta;
> > +
> > +   steal_time = paravirt_steal_clock(smp_processor_id());
> > +   steal = steal_delta = steal_time - this_rq()->prev_steal_time;
> > +
> > +   if (unlikely(steal_delta < 0)) {
> > +   this_rq()->prev_steal_time = steal_time;

I don't think setting prev_steal_time to smaller value is right
thing to do. 

Beside, I don't think we need to check for overflow condition for
cputime variables (it will happen after 279 years :-). So instead
of introducing signed steal_delta variable I would just add
below check, which should be sufficient to fix the problem:

if (unlikely(steal <= this_rq()->prev_steal_time))
return 0;

Thanks
Stanislaw

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/1] sched/cputime: do not decrease steal time after live migration on xen

2017-10-10 Thread Peter Zijlstra
On Tue, Oct 10, 2017 at 05:14:08PM +0800, Dongli Zhang wrote:
> After guest live migration on xen, steal time in /proc/stat
> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
> paravirt_steal_clock() might be less than this_rq()->prev_steal_time.

So why not fix paravirt_steal_clock() to not be broken?

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/1] sched/cputime: do not decrease steal time after live migration on xen

2017-10-10 Thread Ingo Molnar

(Cc:-ed more gents involved in kernel/sched/cputime.c work. Full patch quoted 
below.)

* Dongli Zhang  wrote:

> After guest live migration on xen, steal time in /proc/stat
> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
> paravirt_steal_clock() might be less than this_rq()->prev_steal_time.
> 
> For instance, steal time of each vcpu is 335 before live migration.
> 
> cpu  198 0 368 200064 1962 0 0 1340 0 0
> cpu0 38 0 81 50063 492 0 0 335 0 0
> cpu1 65 0 97 49763 634 0 0 335 0 0
> cpu2 38 0 81 50098 462 0 0 335 0 0
> cpu3 56 0 107 50138 374 0 0 335 0 0
> 
> After live migration, steal time is reduced to 312.
> 
> cpu  200 0 370 200330 1971 0 0 1248 0 0
> cpu0 38 0 82 50123 500 0 0 312 0 0
> cpu1 65 0 97 49832 634 0 0 312 0 0
> cpu2 39 0 82 50167 462 0 0 312 0 0
> cpu3 56 0 107 50207 374 0 0 312 0 0
> 
> The code in this patch is borrowed from do_stolen_accounting() which has
> already been removed from linux source code since commit ecb23dc6 ("xen:
> add steal_clock support on x86").
> 
> Similar and more severe issue would impact prior linux 4.8-4.10 as
> discussed by Michael Las at
> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest.
> Unlike the issue discussed by Michael Las which would overflow steal time
> and lead to 100% st usage in top command for linux 4.8-4.10, the issue for
> linux 4.11+ would only decrease but not overflow steal time after live
> migration.
> 
> References: 
> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
> Signed-off-by: Dongli Zhang 
> ---
>  kernel/sched/cputime.c | 13 ++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 14d2dbf..57d09cab 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -238,10 +238,17 @@ static __always_inline u64 
> steal_account_process_time(u64 maxtime)
>  {
>  #ifdef CONFIG_PARAVIRT
>   if (static_key_false(_steal_enabled)) {
> - u64 steal;
> + u64 steal, steal_time;
> + s64 steal_delta;
> +
> + steal_time = paravirt_steal_clock(smp_processor_id());
> + steal = steal_delta = steal_time - this_rq()->prev_steal_time;
> +
> + if (unlikely(steal_delta < 0)) {
> + this_rq()->prev_steal_time = steal_time;
> + return 0;
> + }
>  
> - steal = paravirt_steal_clock(smp_processor_id());
> - steal -= this_rq()->prev_steal_time;
>   steal = min(steal, maxtime);
>   account_steal_time(steal);
>   this_rq()->prev_steal_time += steal;
> -- 
> 2.7.4
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 1/1] sched/cputime: do not decrease steal time after live migration on xen

2017-10-10 Thread Jan Beulich
>>> On 10.10.17 at 11:14,  wrote:
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -238,10 +238,17 @@ static __always_inline u64 
> steal_account_process_time(u64 maxtime)
>  {
>  #ifdef CONFIG_PARAVIRT
>   if (static_key_false(_steal_enabled)) {
> - u64 steal;
> + u64 steal, steal_time;
> + s64 steal_delta;
> +
> + steal_time = paravirt_steal_clock(smp_processor_id());
> + steal = steal_delta = steal_time - this_rq()->prev_steal_time;
> +
> + if (unlikely(steal_delta < 0)) {
> + this_rq()->prev_steal_time = steal_time;
> + return 0;
> + }
>  
> - steal = paravirt_steal_clock(smp_processor_id());
> - steal -= this_rq()->prev_steal_time;
>   steal = min(steal, maxtime);
>   account_steal_time(steal);
>   this_rq()->prev_steal_time += steal;

While I can see this making the issue less pronounced, I don't see
how it fully addresses it: Why would only a negative delta represent
a discontinuity? In our old XenoLinux derived kernel we had the
change below (unlikely to be upstreamable as is, so just to give you
an idea).

Jan

--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -112,6 +112,47 @@ static inline void task_group_account_fi
cpuacct_account_field(p, index, tmp);
 }
 
+#if !defined(CONFIG_XEN) || defined(CONFIG_VIRT_CPU_ACCOUNTING)
+# define _cputime_adjust(t) (t)
+#else
+# include 
+# define NS_PER_TICK (10 / HZ)
+
+static DEFINE_PER_CPU(u64, steal_snapshot);
+static DEFINE_PER_CPU(unsigned int, steal_residual);
+
+static u64 _cputime_adjust(u64 t)
+{
+   u64 s = this_vcpu_read(runstate.time[RUNSTATE_runnable]);
+   unsigned long adj = div_u64_rem(s - __this_cpu_read(steal_snapshot)
+ + __this_cpu_read(steal_residual),
+   NS_PER_TICK,
+   this_cpu_ptr(_residual));
+
+   __this_cpu_write(steal_snapshot, s);
+   if (t < jiffies_to_nsecs(adj))
+   return 0;
+
+   return t - jiffies_to_nsecs(adj);
+}
+
+static void steal_resume(void)
+{
+   _cputime_adjust((1ULL << 63) - 1);
+}
+
+static struct syscore_ops steal_syscore_ops = {
+   .resume = steal_resume,
+};
+
+static int __init steal_register(void)
+{
+   register_syscore_ops(_syscore_ops);
+   return 0;
+}
+core_initcall(steal_register);
+#endif
+
 /*
  * Account user cpu time to a process.
  * @p: the process that the cpu time gets accounted to
@@ -128,7 +169,7 @@ void account_user_time(struct task_struc
index = (task_nice(p) > 0) ? CPUTIME_NICE : CPUTIME_USER;
 
/* Add user time to cpustat. */
-   task_group_account_field(p, index, cputime);
+   task_group_account_field(p, index, _cputime_adjust(cputime));
 
/* Account for user time used */
acct_account_cputime(p);
@@ -172,7 +213,7 @@ void account_system_index_time(struct ta
account_group_system_time(p, cputime);
 
/* Add system time to cpustat. */
-   task_group_account_field(p, index, cputime);
+   task_group_account_field(p, index, _cputime_adjust(cputime));
 
/* Account for system time used */
acct_account_cputime(p);
@@ -224,9 +265,9 @@ void account_idle_time(u64 cputime)
struct rq *rq = this_rq();
 
if (atomic_read(>nr_iowait) > 0)
-   cpustat[CPUTIME_IOWAIT] += cputime;
+   cpustat[CPUTIME_IOWAIT] += _cputime_adjust(cputime);
else
-   cpustat[CPUTIME_IDLE] += cputime;
+   cpustat[CPUTIME_IDLE] += _cputime_adjust(cputime);
 }
 
 /*





___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/1] sched/cputime: do not decrease steal time after live migration on xen

2017-10-10 Thread Dongli Zhang
After guest live migration on xen, steal time in /proc/stat
(cpustat[CPUTIME_STEAL]) might decrease because steal returned by
paravirt_steal_clock() might be less than this_rq()->prev_steal_time.

For instance, steal time of each vcpu is 335 before live migration.

cpu  198 0 368 200064 1962 0 0 1340 0 0
cpu0 38 0 81 50063 492 0 0 335 0 0
cpu1 65 0 97 49763 634 0 0 335 0 0
cpu2 38 0 81 50098 462 0 0 335 0 0
cpu3 56 0 107 50138 374 0 0 335 0 0

After live migration, steal time is reduced to 312.

cpu  200 0 370 200330 1971 0 0 1248 0 0
cpu0 38 0 82 50123 500 0 0 312 0 0
cpu1 65 0 97 49832 634 0 0 312 0 0
cpu2 39 0 82 50167 462 0 0 312 0 0
cpu3 56 0 107 50207 374 0 0 312 0 0

The code in this patch is borrowed from do_stolen_accounting() which has
already been removed from linux source code since commit ecb23dc6 ("xen:
add steal_clock support on x86").

Similar and more severe issue would impact prior linux 4.8-4.10 as
discussed by Michael Las at
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest.
Unlike the issue discussed by Michael Las which would overflow steal time
and lead to 100% st usage in top command for linux 4.8-4.10, the issue for
linux 4.11+ would only decrease but not overflow steal time after live
migration.

References: 
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
Signed-off-by: Dongli Zhang 
---
 kernel/sched/cputime.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 14d2dbf..57d09cab 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -238,10 +238,17 @@ static __always_inline u64 steal_account_process_time(u64 
maxtime)
 {
 #ifdef CONFIG_PARAVIRT
if (static_key_false(_steal_enabled)) {
-   u64 steal;
+   u64 steal, steal_time;
+   s64 steal_delta;
+
+   steal_time = paravirt_steal_clock(smp_processor_id());
+   steal = steal_delta = steal_time - this_rq()->prev_steal_time;
+
+   if (unlikely(steal_delta < 0)) {
+   this_rq()->prev_steal_time = steal_time;
+   return 0;
+   }
 
-   steal = paravirt_steal_clock(smp_processor_id());
-   steal -= this_rq()->prev_steal_time;
steal = min(steal, maxtime);
account_steal_time(steal);
this_rq()->prev_steal_time += steal;
-- 
2.7.4


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel