Re: [Xen-devel] [xen-unstable test] 116832: regressions - FAIL [and 1 more messages]

2017-12-05 Thread Andrew Cooper
On 05/12/17 15:31, Jan Beulich wrote:
>>>> On 05.12.17 at 16:05, <ian.jack...@eu.citrix.com> wrote:
>> Jan Beulich writes ("Re: [Xen-devel] [xen-unstable test] 116832: regressions 
>> - 
>> FAIL"):
>>> This is a blue screen, recurring, and has first been reported in flight
>>> 116779, i.e. was likely introduced in the batch ending in commit
>>> 4cd0fad645. Among those the most likely candidates appear to be
>>> the SVM changes (the failures are all on AMD hardware). The logs
>>> there also have huge amounts of "Unexpected nested vmexit",
>>> albeit not directly connected with the failed test afaict.
>> Ian Jackson writes ("Re: [xen-unstable test] 116832: regressions - FAIL"):
>>> This is the expected Windows failure.  Force pushed.
>> Oops.  Sorry about that.
>>
>> I think this goes to show that (i) leaving known failures languishing
>> for months and expecting them to be force pushed results in human
>> error (ii) I should read the whole email thread first.
> Oh, that's pretty unfortunate. I think we'll then need a custom flight
> tied to the box that this failure occurred on, to have a way to tell
> whether the fix I'm about to prepare has actually helped, the more
> that the same issue is presumably also present on the 4.10 branch.
> Thing is that newer AMD hardware (with decode assist) doesn't
> appear to demonstrate the misbehavior, and for some reason it also
> doesn't show on Intel systems.
>
> I've spent quite a bit of time to repro this on my old AMD box, but the
> distro on there is just too old to be able to start a suitable Windows
> guest (part(?) of the reason being that scripts in /etc/xen/scripts
> appear to get invoked alongside the ones from the separate unstable
> install tree, and at some point I then decided to give up trying to hack
> things up so they would work together again).

If you've got a provisional patch, I can get some testing organised on
newer and older hardware.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 116832: regressions - FAIL [and 1 more messages]

2017-12-05 Thread Ian Jackson
Jan Beulich writes ("Re: [Xen-devel] [xen-unstable test] 116832: regressions - 
FAIL [and 1 more messages]"):
> Oh, that's pretty unfortunate. I think we'll then need a custom flight
> tied to the box that this failure occurred on, to have a way to tell
> whether the fix I'm about to prepare has actually helped,

Even though it is no longer regarded as a regression by osstest, the
job will still be host-sticky.  So if you commit a patch to staging,
you should be able to see whether it has helped.

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 116832: regressions - FAIL [and 1 more messages]

2017-12-05 Thread Jan Beulich
>>> On 05.12.17 at 16:05, <ian.jack...@eu.citrix.com> wrote:
> Jan Beulich writes ("Re: [Xen-devel] [xen-unstable test] 116832: regressions 
> - 
> FAIL"):
>> This is a blue screen, recurring, and has first been reported in flight
>> 116779, i.e. was likely introduced in the batch ending in commit
>> 4cd0fad645. Among those the most likely candidates appear to be
>> the SVM changes (the failures are all on AMD hardware). The logs
>> there also have huge amounts of "Unexpected nested vmexit",
>> albeit not directly connected with the failed test afaict.
> 
> Ian Jackson writes ("Re: [xen-unstable test] 116832: regressions - FAIL"):
>> This is the expected Windows failure.  Force pushed.
> 
> Oops.  Sorry about that.
> 
> I think this goes to show that (i) leaving known failures languishing
> for months and expecting them to be force pushed results in human
> error (ii) I should read the whole email thread first.

Oh, that's pretty unfortunate. I think we'll then need a custom flight
tied to the box that this failure occurred on, to have a way to tell
whether the fix I'm about to prepare has actually helped, the more
that the same issue is presumably also present on the 4.10 branch.
Thing is that newer AMD hardware (with decode assist) doesn't
appear to demonstrate the misbehavior, and for some reason it also
doesn't show on Intel systems.

I've spent quite a bit of time to repro this on my old AMD box, but the
distro on there is just too old to be able to start a suitable Windows
guest (part(?) of the reason being that scripts in /etc/xen/scripts
appear to get invoked alongside the ones from the separate unstable
install tree, and at some point I then decided to give up trying to hack
things up so they would work together again).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 116832: regressions - FAIL

2017-12-05 Thread Ian Jackson
osstest service owner writes ("[xen-unstable test] 116832: regressions - FAIL"):
> flight 116832 xen-unstable real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/116832/
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  test-amd64-amd64-xl-qemut-win7-amd64 10 windows-install  fail REGR. vs. 
> 116744
> version targeted for testing:
>  xen  553ac37137c2d1c03bf1b69cfb192ffbfe29daa4

This is the expected Windows failure.  Force pushed.

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 116832: regressions - FAIL

2017-12-05 Thread Andrew Cooper
On 05/12/17 11:16, Jan Beulich wrote:
 On 05.12.17 at 11:03,  wrote:
>> On 05/12/2017 09:30, Jan Beulich wrote:
>> On 05.12.17 at 09:49,  wrote:
 flight 116832 xen-unstable real [real]
 http://logs.test-lab.xenproject.org/osstest/logs/116832/ 

 Regressions :-(

 Tests which did not succeed and are blocking,
 including tests which could not be run:
  test-amd64-amd64-xl-qemut-win7-amd64 10 windows-install  fail REGR. vs. 
>> 116744
>>> This is a blue screen, recurring, and has first been reported in flight
>>> 116779, i.e. was likely introduced in the batch ending in commit
>>> 4cd0fad645. Among those the most likely candidates appear to be
>>> the SVM changes (the failures are all on AMD hardware). The logs
>>> there also have huge amounts of "Unexpected nested vmexit",
>>> albeit not directly connected with the failed test afaict.
>> The unexpected nested vmexit is from a previous test, (and hopefully the
>> nested virt test, as that path shouldn't be reachable elsehow).
>>
>> The windows boot which actually failed has:
>>
>> Dec  5 04:20:08.735216 (XEN) CR access emulation failed (1): d1v0 64bit @ 
>> 0010:f8000ce9e4ab -> 66 f3 6d 48 8b 7c 24 08 c3 cc cc cc cc cc cc cc
> How did I not spot this? This is a REP INSW, which then makes it
> far more likely to be a result of Paul's 9c9384d6d8. Sadly the
> windows-install tests of the two 4.10 flights we've had so far all
> ran on Intel hardware, so we can't easily tell whether the
> problem is present there as well (in which case it would for sure
> be that commit).
>
> For the moment I'm struggling to understand how we can end up
> on the CR access emulation path here, but I'll take a closer look.

I am equally perplexed.  The SVM code definitely used to (ab)use the
MMIO path, so I expect there is still some remnants left.

The CR access in the message proves that we started this emulation from
a CR vmexit.  My best guess is that _hvm_emulate_one() reused the
instruction cache rather than starting fresh.  The unhandleable is
probably from a ->validate() failure, and the bytes are probably stale
from the previous emulation.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 116832: regressions - FAIL

2017-12-05 Thread Jan Beulich
>>> On 05.12.17 at 11:03,  wrote:
> On 05/12/2017 09:30, Jan Beulich wrote:
> On 05.12.17 at 09:49,  wrote:
>>> flight 116832 xen-unstable real [real]
>>> http://logs.test-lab.xenproject.org/osstest/logs/116832/ 
>>>
>>> Regressions :-(
>>>
>>> Tests which did not succeed and are blocking,
>>> including tests which could not be run:
>>>  test-amd64-amd64-xl-qemut-win7-amd64 10 windows-install  fail REGR. vs. 
> 116744
>> This is a blue screen, recurring, and has first been reported in flight
>> 116779, i.e. was likely introduced in the batch ending in commit
>> 4cd0fad645. Among those the most likely candidates appear to be
>> the SVM changes (the failures are all on AMD hardware). The logs
>> there also have huge amounts of "Unexpected nested vmexit",
>> albeit not directly connected with the failed test afaict.
> 
> The unexpected nested vmexit is from a previous test, (and hopefully the
> nested virt test, as that path shouldn't be reachable elsehow).
> 
> The windows boot which actually failed has:
> 
> Dec  5 04:20:08.735216 (XEN) CR access emulation failed (1): d1v0 64bit @ 
> 0010:f8000ce9e4ab -> 66 f3 6d 48 8b 7c 24 08 c3 cc cc cc cc cc cc cc

How did I not spot this? This is a REP INSW, which then makes it
far more likely to be a result of Paul's 9c9384d6d8. Sadly the
windows-install tests of the two 4.10 flights we've had so far all
ran on Intel hardware, so we can't easily tell whether the
problem is present there as well (in which case it would for sure
be that commit).

For the moment I'm struggling to understand how we can end up
on the CR access emulation path here, but I'll take a closer look.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 116832: regressions - FAIL

2017-12-05 Thread Andrew Cooper
On 05/12/2017 10:03, Andrew Cooper wrote:
> On 05/12/2017 09:30, Jan Beulich wrote:
> On 05.12.17 at 09:49,  wrote:
>>> flight 116832 xen-unstable real [real]
>>> http://logs.test-lab.xenproject.org/osstest/logs/116832/ 
>>>
>>> Regressions :-(
>>>
>>> Tests which did not succeed and are blocking,
>>> including tests which could not be run:
>>>  test-amd64-amd64-xl-qemut-win7-amd64 10 windows-install  fail REGR. vs. 
>>> 116744
>> This is a blue screen, recurring, and has first been reported in flight
>> 116779, i.e. was likely introduced in the batch ending in commit
>> 4cd0fad645. Among those the most likely candidates appear to be
>> the SVM changes (the failures are all on AMD hardware). The logs
>> there also have huge amounts of "Unexpected nested vmexit",
>> albeit not directly connected with the failed test afaict.
> The unexpected nested vmexit is from a previous test, (and hopefully the
> nested virt test, as that path shouldn't be reachable elsehow).
>
> The windows boot which actually failed has:
>
> Dec  5 04:20:08.735216 (XEN) CR access emulation failed (1): d1v0 64bit @ 
> 0010:f8000ce9e4ab -> 66 f3 6d 48 8b 7c 24 08 c3 cc cc cc cc cc cc cc
> Dec  5 04:21:49.555130 (XEN) stdvga.c:173:d1v0 entering stdvga mode
>
> which I expect is the root of the problem.

Yes - that is the cause of the problem.  The BSOD is a 0x3D (exception
not handled) referencing the same %rip as the CR emulation failure.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 116832: regressions - FAIL

2017-12-05 Thread Andrew Cooper
On 05/12/2017 09:30, Jan Beulich wrote:
 On 05.12.17 at 09:49,  wrote:
>> flight 116832 xen-unstable real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/116832/ 
>>
>> Regressions :-(
>>
>> Tests which did not succeed and are blocking,
>> including tests which could not be run:
>>  test-amd64-amd64-xl-qemut-win7-amd64 10 windows-install  fail REGR. vs. 
>> 116744
> This is a blue screen, recurring, and has first been reported in flight
> 116779, i.e. was likely introduced in the batch ending in commit
> 4cd0fad645. Among those the most likely candidates appear to be
> the SVM changes (the failures are all on AMD hardware). The logs
> there also have huge amounts of "Unexpected nested vmexit",
> albeit not directly connected with the failed test afaict.

The unexpected nested vmexit is from a previous test, (and hopefully the
nested virt test, as that path shouldn't be reachable elsehow).

The windows boot which actually failed has:

Dec  5 04:20:08.735216 (XEN) CR access emulation failed (1): d1v0 64bit @ 
0010:f8000ce9e4ab -> 66 f3 6d 48 8b 7c 24 08 c3 cc cc cc cc cc cc cc
Dec  5 04:21:49.555130 (XEN) stdvga.c:173:d1v0 entering stdvga mode

which I expect is the root of the problem.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [xen-unstable test] 116832: regressions - FAIL

2017-12-05 Thread Jan Beulich
>>> On 05.12.17 at 09:49,  wrote:
> flight 116832 xen-unstable real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/116832/ 
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  test-amd64-amd64-xl-qemut-win7-amd64 10 windows-install  fail REGR. vs. 
> 116744

This is a blue screen, recurring, and has first been reported in flight
116779, i.e. was likely introduced in the batch ending in commit
4cd0fad645. Among those the most likely candidates appear to be
the SVM changes (the failures are all on AMD hardware). The logs
there also have huge amounts of "Unexpected nested vmexit",
albeit not directly connected with the failed test afaict.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [xen-unstable test] 116832: regressions - FAIL

2017-12-05 Thread osstest service owner
flight 116832 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/116832/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-amd64-xl-qemut-win7-amd64 10 windows-install  fail REGR. vs. 116744

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt-xsm 14 saverestore-support-checkfail  like 116744
 test-amd64-amd64-xl-qemuu-win7-amd64 16 guest-localmigrate/x10 fail like 116744
 test-armhf-armhf-libvirt-raw 13 saverestore-support-checkfail  like 116744
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop fail like 116744
 test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stop fail like 116744
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop fail like 116744
 test-armhf-armhf-libvirt 14 saverestore-support-checkfail  like 116744
 test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stopfail like 116744
 test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stopfail like 116744
 test-amd64-amd64-xl-pvhv2-intel 12 guest-start fail never pass
 test-amd64-amd64-xl-pvhv2-amd 12 guest-start  fail  never pass
 test-amd64-i386-libvirt  13 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  13 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 13 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  14 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 11 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qcow2 12 migrate-support-checkfail  never pass
 test-amd64-amd64-libvirt-vhd 12 migrate-support-checkfail   never pass
 test-amd64-amd64-qemuu-nested-amd 17 debian-hvm-install/l1/l2  fail never pass
 test-armhf-armhf-xl-xsm  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-xsm  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-xsm 13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  14 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-cubietruck 13 migrate-support-checkfail never pass
 test-armhf-armhf-xl-cubietruck 14 saverestore-support-checkfail never pass
 test-armhf-armhf-libvirt-raw 12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 13 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 14 saverestore-support-checkfail  never pass
 test-armhf-armhf-libvirt 13 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  12 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  13 saverestore-support-checkfail   never pass
 test-amd64-i386-xl-qemut-ws16-amd64 17 guest-stop  fail never pass
 test-amd64-i386-xl-qemut-win10-i386 10 windows-install fail never pass
 test-amd64-i386-xl-qemuu-win10-i386 10 windows-install fail never pass
 test-amd64-amd64-xl-qemuu-win10-i386 10 windows-installfail never pass
 test-amd64-amd64-xl-qemut-win10-i386 10 windows-installfail never pass

version targeted for testing:
 xen  553ac37137c2d1c03bf1b69cfb192ffbfe29daa4
baseline version:
 xen  6da091d95dfcbe00daf91308d044ee5151b1ac9e

Last test of basis   116744  2017-12-01 13:53:15 Z3 days
Failing since116779  2017-12-02 17:06:23 Z2 days3 attempts
Testing same since   116832  2017-12-04 13:57:29 Z0 days1 attempts


People who touched revisions under test:
  Andrew Cooper 
  Boris Ostrovsky 
  Brian Woods 
  David Esler 
  Euan Harris 
  Gregory Herrero 
  Haozhong Zhang 
  Ian Jackson 
  Jan Beulich 
  Kevin Tian 
  Paul Durrant 
  Roger Pau Monné 
  Sergey Dyasli 
  Stefano Stabellini 
  Zhenzhong Duan