Re: [Xen-devel] [osstest test] 110909: tolerable FAIL - PUSHED

2017-06-23 Thread Ian Jackson
Andrew Cooper writes ("Re: [Xen-devel] [osstest test] 110909: tolerable FAIL - 
PUSHED"):
> XEN_DOMINF_paused is a straight reflection of
> d->controller_pause_count.  A domain is created with 1 reference count,
> requiring the toolstack to call DOMCTL_unpause_domain once to cause it
> to start executing.

Thanks for the info.

> In this case, what condition causes the failure?  Is it simply seeing
> the domain as paused

Yes.

> (in which case, there will definitely be a
> low-probability false negative rate if anything else in dom0 uses domain
> pause), or is it some other failure which prompts for the paused state
> check?

But I don't think there should be anything in dom0 which uses domain
pause.  (I'm assuming that qemu doesn't... Hrm.)

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [osstest test] 110909: tolerable FAIL - PUSHED

2017-06-21 Thread Andrew Cooper
On 21/06/2017 23:59, Ian Jackson wrote:
> osstest service owner writes ("[osstest test] 110909: tolerable FAIL - 
> PUSHED"):
>> flight 110909 osstest real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/110909/
>>
>> Failures :-/ but no regressions.
> ...
>> Tests which did not succeed, but are not blocking:
> ...
>>  test-amd64-i386-xl-qemuu-win7-amd64 15 guest-localmigrate/x10 fail like 
>> 110373
> This guest had ~31G of disk and 1.5G of RAM.
>
> The logfile
>
>   
> http://logs.test-lab.xenproject.org/osstest/logs/110909/test-amd64-i386-xl-qemuu-win7-amd64/15.ts-guest-localmigrate.log
>
> seems to show that the guest is paused (state "p") following the 9th
> migration.  This is weird, given that xl seems to say earlier
> "migration target: Domain started successsfully", which message
> follows the call to libxl_domain_unpause.
>
> I wonder if it is possible that the domain still appears paused
> briefly after xl/libxlq tries to unpause it.  That is, that
> XEN_DOMINF_paused might be set in the return from
> xc_domain_getinfolist even after the unpause domctl returns.
>
> By the time log collection runs, the domain seems unpaused.

XEN_DOMINF_paused is a straight reflection of
d->controller_pause_count.  A domain is created with 1 reference count,
requiring the toolstack to call DOMCTL_unpause_domain once to cause it
to start executing.

Other than that, it is strictly reference counted based on pause and
unpause hypercalls from toolstack components (in this case, all in dom0).

One issue which XenServer has found in combination with Introspection is
that any toolstack entity which can call pause/unpause (even for a short
period of time) can result in XEN_DOMINF_paused being sampled as being set.

The fix ^W utterly gross hack for XenServer's purposes is
https://github.com/xenserver/xen-4.7.pg/blob/master/master/xen-introspection-pause.patch
but I don't yet have a sensible plan for how to fix this in general. 
One option would be to introduce hypercall pairs per toolstack
component, but that doesn't scale sensibly.

In this case, what condition causes the failure?  Is it simply seeing
the domain as paused (in which case, there will definitely be a
low-probability false negative rate if anything else in dom0 uses domain
pause), or is it some other failure which prompts for the paused state
check?

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [osstest test] 110909: tolerable FAIL - PUSHED

2017-06-21 Thread Ian Jackson
osstest service owner writes ("[osstest test] 110909: tolerable FAIL - PUSHED"):
> flight 110909 osstest real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/110909/
> 
> Failures :-/ but no regressions.
...
> Tests which did not succeed, but are not blocking:
...
>  test-amd64-i386-xl-qemuu-win7-amd64 15 guest-localmigrate/x10 fail like 
> 110373

This guest had ~31G of disk and 1.5G of RAM.

The logfile

  
http://logs.test-lab.xenproject.org/osstest/logs/110909/test-amd64-i386-xl-qemuu-win7-amd64/15.ts-guest-localmigrate.log

seems to show that the guest is paused (state "p") following the 9th
migration.  This is weird, given that xl seems to say earlier
"migration target: Domain started successsfully", which message
follows the call to libxl_domain_unpause.

I wonder if it is possible that the domain still appears paused
briefly after xl/libxlq tries to unpause it.  That is, that
XEN_DOMINF_paused might be set in the return from
xc_domain_getinfolist even after the unpause domctl returns.

By the time log collection runs, the domain seems unpaused.

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel