time: do not decrease steal time after live migration on xen

Dongli Zhang Mon, 30 Oct 2017 20:14:42 -0700

Hi Boris,

On 10/31/2017 08:58 AM, Boris Ostrovsky wrote:
> 
> 
> On 10/30/2017 08:14 PM, Dongli Zhang wrote:
>> Hi Boris,
>>
>> On 10/30/2017 09:34 PM, Boris Ostrovsky wrote:
>>> On 10/30/2017 04:03 AM, Dongli Zhang wrote:
>>>> After guest live migration on xen, steal time in /proc/stat
>>>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>>>> xen_steal_lock() might be less than this_rq()->prev_steal_time which is
>>>> derived from previous return value of xen_steal_clock().
>>>>
>>>> For instance, steal time of each vcpu is 335 before live migration.
>>>>
>>>> cpu  198 0 368 200064 1962 0 0 1340 0 0
>>>> cpu0 38 0 81 50063 492 0 0 335 0 0
>>>> cpu1 65 0 97 49763 634 0 0 335 0 0
>>>> cpu2 38 0 81 50098 462 0 0 335 0 0
>>>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>>>
>>>> After live migration, steal time is reduced to 312.
>>>>
>>>> cpu  200 0 370 200330 1971 0 0 1248 0 0
>>>> cpu0 38 0 82 50123 500 0 0 312 0 0
>>>> cpu1 65 0 97 49832 634 0 0 312 0 0
>>>> cpu2 39 0 82 50167 462 0 0 312 0 0
>>>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>>>
>>>> Since runstate times are cumulative and cleared during xen live migration
>>>> by xen hypervisor, the idea of this patch is to accumulate runstate times
>>>> to global percpu variables before live migration suspend. Once guest VM is
>>>> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
>>>> runstate times and previously accumulated times stored in global percpu
>>>> variables.
>>>>
>>>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>>>> discussed by Michael Las at
>>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
>>>>
>>>> which would overflow steal time and lead to 100% st usage in top command
>>>> for linux 4.8-4.10. A backport of this patch would fix that issue.
>>>>
>>>> References:
>>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>>>>
>>>> Signed-off-by: Dongli Zhang <dongli.zh...@oracle.com>
>>>>
>>>> ---
>>>> Changed since v1:
>>>>    * relocate modification to xen_get_runstate_snapshot_cpu
>>>>
>>>> Changed since v2:
>>>>    * accumulate runstate times before live migration
>>>>
>>>> Changed since v3:
>>>>    * do not accumulate times in the case of guest checkpointing
>>>>
>>>> Changed since v4:
>>>>    * allocate array of vcpu_runstate_info to reduce number of memory 
>>>> allocation
>>>>
>>>> ---
>>>>   drivers/xen/manage.c         |  2 ++
>>>>   drivers/xen/time.c           | 68
>>>> ++++++++++++++++++++++++++++++++++++++++++--
>>>>   include/xen/interface/vcpu.h |  2 ++
>>>>   include/xen/xen-ops.h        |  1 +
>>>>   4 files changed, 71 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
>>>> index c425d03..3dc085d 100644
>>>> --- a/drivers/xen/manage.c
>>>> +++ b/drivers/xen/manage.c
>>>> @@ -72,6 +72,7 @@ static int xen_suspend(void *data)
>>>>       }
>>>>         gnttab_suspend();
>>>> +    xen_accumulate_runstate_time(-1);
>>>>       xen_arch_pre_suspend();
>>>>         /*
>>>> @@ -84,6 +85,7 @@ static int xen_suspend(void *data)
>>>>                                              : 0);
>>>>         xen_arch_post_suspend(si->cancelled);
>>>> +    xen_accumulate_runstate_time(si->cancelled);
>>>
>>> I am not convinced that the comment above HYPERVISOR_suspend() is
>>> correct. The call can return an error code and so if it returns -EPERM
>>> (which AFAICS it can't now but might in the future) then
>>> xen_accumulate_runstate_time() will do wrong thing.
>>
>> I would split xen_accumulate_runstate_time() into two functions to avoid the
>> -EPERM issue, as one is for saving and another is for accumulation, 
>> respectively.
>>
>> Otherwise, can you use xen_accumulate_runstate_time(2) for saving before 
>> suspend
>> and xen_accumulate_runstate_time(si->cancelled) after resume?
> 
> 
> I'd probably just say something like
> 
> si->cancelled = HYPERVISOR_suspend() ? 1 : 0;


As the call of HYPERVISOR_suspend() takes 3 lines, I would make it as below and
I think gcc would optimize it.

-   /*
-    * This hypercall returns 1 if suspend was cancelled
-    * or the domain was merely checkpointed, and 0 if it
-    * is resuming in a new domain.
-    */
    si->cancelled = HYPERVISOR_suspend(xen_pv_domain()
                                            ? virt_to_gfn(xen_start_info)
                                            : 0);
+   si->cancelled = si->cancelled ? 1 : 0;

> 
> and keep xen_accumulate_runstate_time() as is (maybe rename it to
> xen_manage_runstate_time()). And also remove the comment above the hypercall 
> as
> it is incorrect (but please mention the reason in the commit message)
> 
>>
>>>
>>>
>>>>       gnttab_resume();
>>>>         if (!si->cancelled) {
>>>> diff --git a/drivers/xen/time.c b/drivers/xen/time.c
>>>> index ac5f23f..cf3afb9 100644
>>>> --- a/drivers/xen/time.c
>>>> +++ b/drivers/xen/time.c
>>>> @@ -19,6 +19,9 @@
>>>>   /* runstate info updated by Xen */
>>>>   static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
>>>>   +static DEFINE_PER_CPU(u64[RUNSTATE_max], old_runstate_time);
>>>> +static struct vcpu_runstate_info *runstate_delta;
>>>
>>> I'd move this inside xen_accumulate_runstate_time() since that's the
>>
>> If we split xen_accumulate_runstate_time() into two functions, we would leave
>> runstate_delta as global static.
>>
>>> only function that uses it. And why does it need to be
>>> vcpu_runstate_info and not u64[4]?
>>
>> This was suggested by Juergen to avoid the allocation and reclaim of the 
>> second
>> dimensional array as in v4 of this patch?
>>
>> Or would you like to allocate sizeof(u64[4]) * num_possible_cpus() and 
>> emulate
>> the 2d array with this 1d array and move the pointer forward sizeof(u64[4]) 
>> in
>> each iteration?
> 
> 
> I was thinking of
> 
> u64 **runstate_delta = (u64 **)kmalloc(sizeof(xen_runstate.time) *
> num_possible_cpus())
> 
> and then you should be able to access runstate_delta[cpu][RUNSTATE_*].

Would the above code work? You are allocating data only for 1st dimension of a
2d array. How can you access the 2nd array with runstate_delta[cpu][RUNSTATE_*]?

Is there any option in gcc that support this?

I have actually tested this with below code in linux and kernel panic with page
fault when doing memcpy.

+void xen_manage_runstate_time(int action)
+{
+   static u64 **runstate_delta;
+   struct vcpu_runstate_info state;
+   int cpu, i;
+
+   switch (action) {
+   case -1: /* backup runstate time before suspend */
+       WARN_ON_ONCE(unlikely(runstate_delta));
+
+       runstate_delta = (u64 **)kmalloc(sizeof(xen_runstate.time) *
num_possible_cpus(), GFP_ATOMIC);
+       if (unlikely(!runstate_delta)) {
+           pr_alert("%s: failed to allocate runstate_delta\n",
+               __func__);
+           return;
+       }
+
+       for_each_possible_cpu(cpu) {
+           xen_get_runstate_snapshot_cpu_delta(&state, cpu);
+           memcpy(runstate_delta[cpu], state.time,
+                   sizeof(xen_runstate.time));
+       }
+
+       break;

> 
>>
>>>
>>>> +
>>>>   /* return an consistent snapshot of 64-bit time/counter value */
>>>>   static u64 get64(const u64 *p)
>>>>   {
>>>> @@ -47,8 +50,8 @@ static u64 get64(const u64 *p)
>>>>       return ret;
>>>>   }
>>>>   -static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info 
>>>> *res,
>>>> -                      unsigned int cpu)
>>>> +static void xen_get_runstate_snapshot_cpu_delta(
>>>> +            struct vcpu_runstate_info *res, unsigned int cpu)
>>>>   {
>>>>       u64 state_time;
>>>>       struct vcpu_runstate_info *state;
>>>> @@ -66,6 +69,67 @@ static void xen_get_runstate_snapshot_cpu(struct
>>>> vcpu_runstate_info *res,
>>>>            (state_time & XEN_RUNSTATE_UPDATE));
>>>>   }
>>>>   +static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info 
>>>> *res,
>>>> +                      unsigned int cpu)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    xen_get_runstate_snapshot_cpu_delta(res, cpu);
>>>> +
>>>> +    for (i = 0; i < RUNSTATE_max; i++)
>>>> +        res->time[i] += per_cpu(old_runstate_time, cpu)[i];
>>>> +}
>>>> +
>>>> +void xen_accumulate_runstate_time(int action)
>>>> +{
>>>> +    struct vcpu_runstate_info state;
>>>> +    int cpu, i;
>>>> +
>>>> +    switch (action) {
>>>> +    case -1: /* backup runstate time before suspend */
>>>> +        WARN_ON_ONCE(unlikely(runstate_delta));
>>>
>>> pr_warn_once(), to be consistent with the rest of the file. And then
>>> should you return if this is true?
>>
>> I would prefer to not return if it is true but just warn the administrator 
>> that
>> there is memory leakage issue while leaving runstate accumulation works 
>> normally.
>>
>>>
>>>> +
>>>> +        runstate_delta = kcalloc(num_possible_cpus(),
>>>> +                     sizeof(*runstate_delta),
>>>> +                     GFP_KERNEL);
>>>> +        if (unlikely(!runstate_delta)) {
>>>> +            pr_alert("%s: failed to allocate runstate_delta\n",
>>>> +                    __func__);
>>>
>>> pr_warn() should be sufficient. Below too.
>>>
>>> Also, as a side question --- can we do kmalloc() at this point?
>>
>> Yes. kmalloc_array() is better than kcalloc, unless we have 2 dimensional 
>> array
>> and we need to guarantee the value of first dimension is always 0.
> 
> 
> That's not what was thinking about. GFP_KERNEL may sleep and I don't know how
> sleep is handled at this point. Everything is pretty much dead now. Perhaps
> GFP_ATOMIC might be a better choice.
> 
> -boris
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

Thank you very much for your suggestion!

Dongli Zhang

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v5 1/1] xen/time: do not decrease steal time after live migration on xen

Reply via email to