Re: [Xen-devel] Commit moratorium to staging

2017-11-06 Thread George Dunlap
On 11/03/2017 06:35 PM, Juergen Gross wrote:
> On 03/11/17 19:29, Roger Pau Monné wrote:
>> On Fri, Nov 03, 2017 at 05:57:52PM +, George Dunlap wrote:
>>> On 11/03/2017 02:52 PM, George Dunlap wrote:
 On 11/03/2017 02:14 PM, Roger Pau Monné wrote:
> On Thu, Nov 02, 2017 at 09:55:11AM +, Paul Durrant wrote:
>> Hmm. I wonder whether the guest is actually healthy after the migrate. 
>> One could imagine a situation where the storage device model (IDE in our 
>> case I guess) gets stuck in some way but recovers after a timeout in the 
>> guest storage stack. Thus, if you happen to try shut down while it is 
>> still stuck Windows starts trying to shut down but can't. Try after the 
>> timeout though and it can.
>> In the past we did make attempts to support Windows without PV drivers 
>> in XenServer but xenrt would never reliably pass VM lifecycle tests 
>> using emulated devices. That was with qemu trad, but I wonder whether 
>> upstream qemu is actually any better particularly if using older device 
>> models such as IDE and RTL8139 (which are probably largely unmodified 
>> from trad).
>
> Since I've been looking into this for a couple of days, and found no
> solution I'm going to write what I've found so far:
>
>  - The issue only affects Windows guests.
>  - It only manifests itself when doing live migration, non-live
>migration or save/resume work fine.
>  - It affects all x86 hardware, the amount of migrations in order to
>trigger it seems to depend on the hardware, but doing 20 migrations
>reliably triggers it on all the hardware I've tested.

 Not good.

 You said that Windows reported that the login process failed somehow?

 Is it possible something bad is happening, like sending spurious page
 faults to the guest in logdirty mode?

 I wonder if we could reproduce something like it on Linux -- set a build
 going and start localhost migrating; a spurious page fault is likely to
 cause the build to fail.
>>>
>>> Well, with a looping xen-build going on in the guest, I've done 40 local
>>> migrates with no problems yet.
>>>
>>> But Roger -- is this on emulated devices only, no PV drivers?
>>>
>>> That might be something worth looking at.
>>
>> Yes, windows doesn't have PV devices. But save/restore and non-live
>> migration seems fine, so it doesn't look to be related to devices, but
>> rather to log-dirty or some other aspect of live-migration.
> 
> log-dirty for read-I/Os of emulated devices?

FWIW I booted a Linux guest with "xen_nopv" on the command-line, gave it
256 MiB of RAM, and then ran a Xen build on it in a loop (see command
below).

Then I started migrating it in a loop.

After an hour or two it had done 146 local migrations, and 46 builds of
Xen (swapping onto emulated disk is pretty slow), without any issues.

Build command:

# while make -j 3 xen ; do git clean -ffdx ; done

I'm shutting down the VM and I'll leave it running overnight.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-03 Thread Ian Jackson
George Dunlap writes ("Re: [Xen-devel] Commit moratorium to staging"):
> Well, with a looping xen-build going on in the guest, I've done 40 local
> migrates with no problems yet.
> 
> But Roger -- is this on emulated devices only, no PV drivers?

Yes.  None of our Windows tests have PV drivers.

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-03 Thread Juergen Gross
On 03/11/17 19:29, Roger Pau Monné wrote:
> On Fri, Nov 03, 2017 at 05:57:52PM +, George Dunlap wrote:
>> On 11/03/2017 02:52 PM, George Dunlap wrote:
>>> On 11/03/2017 02:14 PM, Roger Pau Monné wrote:
 On Thu, Nov 02, 2017 at 09:55:11AM +, Paul Durrant wrote:
> Hmm. I wonder whether the guest is actually healthy after the migrate. 
> One could imagine a situation where the storage device model (IDE in our 
> case I guess) gets stuck in some way but recovers after a timeout in the 
> guest storage stack. Thus, if you happen to try shut down while it is 
> still stuck Windows starts trying to shut down but can't. Try after the 
> timeout though and it can.
> In the past we did make attempts to support Windows without PV drivers in 
> XenServer but xenrt would never reliably pass VM lifecycle tests using 
> emulated devices. That was with qemu trad, but I wonder whether upstream 
> qemu is actually any better particularly if using older device models 
> such as IDE and RTL8139 (which are probably largely unmodified from trad).

 Since I've been looking into this for a couple of days, and found no
 solution I'm going to write what I've found so far:

  - The issue only affects Windows guests.
  - It only manifests itself when doing live migration, non-live
migration or save/resume work fine.
  - It affects all x86 hardware, the amount of migrations in order to
trigger it seems to depend on the hardware, but doing 20 migrations
reliably triggers it on all the hardware I've tested.
>>>
>>> Not good.
>>>
>>> You said that Windows reported that the login process failed somehow?
>>>
>>> Is it possible something bad is happening, like sending spurious page
>>> faults to the guest in logdirty mode?
>>>
>>> I wonder if we could reproduce something like it on Linux -- set a build
>>> going and start localhost migrating; a spurious page fault is likely to
>>> cause the build to fail.
>>
>> Well, with a looping xen-build going on in the guest, I've done 40 local
>> migrates with no problems yet.
>>
>> But Roger -- is this on emulated devices only, no PV drivers?
>>
>> That might be something worth looking at.
> 
> Yes, windows doesn't have PV devices. But save/restore and non-live
> migration seems fine, so it doesn't look to be related to devices, but
> rather to log-dirty or some other aspect of live-migration.

log-dirty for read-I/Os of emulated devices?


Juergen

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-03 Thread Roger Pau Monné
On Fri, Nov 03, 2017 at 05:57:52PM +, George Dunlap wrote:
> On 11/03/2017 02:52 PM, George Dunlap wrote:
> > On 11/03/2017 02:14 PM, Roger Pau Monné wrote:
> >> On Thu, Nov 02, 2017 at 09:55:11AM +, Paul Durrant wrote:
> >>> Hmm. I wonder whether the guest is actually healthy after the migrate. 
> >>> One could imagine a situation where the storage device model (IDE in our 
> >>> case I guess) gets stuck in some way but recovers after a timeout in the 
> >>> guest storage stack. Thus, if you happen to try shut down while it is 
> >>> still stuck Windows starts trying to shut down but can't. Try after the 
> >>> timeout though and it can.
> >>> In the past we did make attempts to support Windows without PV drivers in 
> >>> XenServer but xenrt would never reliably pass VM lifecycle tests using 
> >>> emulated devices. That was with qemu trad, but I wonder whether upstream 
> >>> qemu is actually any better particularly if using older device models 
> >>> such as IDE and RTL8139 (which are probably largely unmodified from trad).
> >>
> >> Since I've been looking into this for a couple of days, and found no
> >> solution I'm going to write what I've found so far:
> >>
> >>  - The issue only affects Windows guests.
> >>  - It only manifests itself when doing live migration, non-live
> >>migration or save/resume work fine.
> >>  - It affects all x86 hardware, the amount of migrations in order to
> >>trigger it seems to depend on the hardware, but doing 20 migrations
> >>reliably triggers it on all the hardware I've tested.
> > 
> > Not good.
> > 
> > You said that Windows reported that the login process failed somehow?
> > 
> > Is it possible something bad is happening, like sending spurious page
> > faults to the guest in logdirty mode?
> > 
> > I wonder if we could reproduce something like it on Linux -- set a build
> > going and start localhost migrating; a spurious page fault is likely to
> > cause the build to fail.
> 
> Well, with a looping xen-build going on in the guest, I've done 40 local
> migrates with no problems yet.
> 
> But Roger -- is this on emulated devices only, no PV drivers?
> 
> That might be something worth looking at.

Yes, windows doesn't have PV devices. But save/restore and non-live
migration seems fine, so it doesn't look to be related to devices, but
rather to log-dirty or some other aspect of live-migration.

Or maybe it's something indeed related to emulated devices that's more
easily triggerable on live-migration.

I'm also thinking it would be helpful to do x20 save/restore,
shutdown, create, x20 migrations and shutdown. That would help us
identify problems related to save/restore and live-migration more
easily.

Roger.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-03 Thread George Dunlap
On 11/03/2017 02:52 PM, George Dunlap wrote:
> On 11/03/2017 02:14 PM, Roger Pau Monné wrote:
>> On Thu, Nov 02, 2017 at 09:55:11AM +, Paul Durrant wrote:
>>> Hmm. I wonder whether the guest is actually healthy after the migrate. One 
>>> could imagine a situation where the storage device model (IDE in our case I 
>>> guess) gets stuck in some way but recovers after a timeout in the guest 
>>> storage stack. Thus, if you happen to try shut down while it is still stuck 
>>> Windows starts trying to shut down but can't. Try after the timeout though 
>>> and it can.
>>> In the past we did make attempts to support Windows without PV drivers in 
>>> XenServer but xenrt would never reliably pass VM lifecycle tests using 
>>> emulated devices. That was with qemu trad, but I wonder whether upstream 
>>> qemu is actually any better particularly if using older device models such 
>>> as IDE and RTL8139 (which are probably largely unmodified from trad).
>>
>> Since I've been looking into this for a couple of days, and found no
>> solution I'm going to write what I've found so far:
>>
>>  - The issue only affects Windows guests.
>>  - It only manifests itself when doing live migration, non-live
>>migration or save/resume work fine.
>>  - It affects all x86 hardware, the amount of migrations in order to
>>trigger it seems to depend on the hardware, but doing 20 migrations
>>reliably triggers it on all the hardware I've tested.
> 
> Not good.
> 
> You said that Windows reported that the login process failed somehow?
> 
> Is it possible something bad is happening, like sending spurious page
> faults to the guest in logdirty mode?
> 
> I wonder if we could reproduce something like it on Linux -- set a build
> going and start localhost migrating; a spurious page fault is likely to
> cause the build to fail.

Well, with a looping xen-build going on in the guest, I've done 40 local
migrates with no problems yet.

But Roger -- is this on emulated devices only, no PV drivers?

That might be something worth looking at.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-03 Thread George Dunlap
On 11/03/2017 02:14 PM, Roger Pau Monné wrote:
> On Thu, Nov 02, 2017 at 09:55:11AM +, Paul Durrant wrote:
>>> -Original Message-
>>> From: Roger Pau Monne
>>> Sent: 02 November 2017 09:42
>>> To: Paul Durrant <paul.durr...@citrix.com>
>>> Cc: Ian Jackson <ian.jack...@citrix.com>; Lars Kurth
>>> <lars.ku...@citrix.com>; Wei Liu <wei.l...@citrix.com>; Julien Grall
>>> <julien.gr...@linaro.org>; committ...@xenproject.org; xen-devel >> de...@lists.xenproject.org>
>>> Subject: Re: [Xen-devel] Commit moratorium to staging
>>>
>>> On Thu, Nov 02, 2017 at 09:20:10AM +, Paul Durrant wrote:
>>>>> -Original Message-
>>>>> From: Roger Pau Monne
>>>>> Sent: 02 November 2017 09:15
>>>>> To: Roger Pau Monne <roger@citrix.com>
>>>>> Cc: Ian Jackson <ian.jack...@citrix.com>; Lars Kurth
>>>>> <lars.ku...@citrix.com>; Wei Liu <wei.l...@citrix.com>; Julien Grall
>>>>> <julien.gr...@linaro.org>; Paul Durrant <paul.durr...@citrix.com>;
>>>>> committ...@xenproject.org; xen-devel >> de...@lists.xenproject.org>
>>>>> Subject: Re: [Xen-devel] Commit moratorium to staging
>>>>>
>>>>> On Wed, Nov 01, 2017 at 04:17:10PM +, Roger Pau Monné wrote:
>>>>>> On Wed, Nov 01, 2017 at 02:07:48PM +, Ian Jackson wrote:
>>>>>>> * Affected hosts differ from unaffected hosts according to cpuid.
>>>>>>>   Roger has repro'd the bug on an unaffected host by masking out
>>>>>>>   certain cpuid bits.  There are 6 implicated bits and he is working
>>>>>>>   to narrow that down.
>>>>>>
>>>>>> I'm currently trying to narrow this down and make sure the above is
>>>>>> accurate.
>>>>>
>>>>> So I was wrong with this, I guess I've run the tests on the wrong
>>>>> host. Even when masking the different cpuid bits in the guest the
>>>>> tests still succeeds.
>>>>>
>>>>> AFAICT the test fail or succeed reliably depending on the host
>>>>> hardware. I don't really have many ideas about what to do next, but I
>>>>> think it would be useful to create a manual osstest flight that runs
>>>>> the win16 job in all the different hosts in the colo. I would also
>>>>> capture the normal information that Xen collects after each test (xl
>>>>> info, /proc/cpuid, serial logs...).
>>>>>
>>>>> Is there anything else not captured by ts-logs-capture that would be
>>>>> interesting in order to help debug the issue?
>>>>
>>>> Does the shutdown reliably complete prior to migrate and then only fail
>>> intermittently after a localhost migrate?
>>>
>>> AFAICT yes, but it can also be added to the test in order to be sure.
>>>
>>>> It might be useful to know what cpuid info is seen by the guest before and
>>> after migrate.
>>>
>>> Is there anyway to get that from windows in an automatic way? If not I
>>> could test that with a Debian guest. In fact it might even be a good
>>> thing for Linux based guest to be added to the regular migration tests
>>> in order to make sure cpuid bits don't change across migrations.
>>>
>>
>> I found this for windows:
>>
>> https://www.cpuid.com/downloads/cpu-z/cpu-z_1.81-en.exe
>>
>> It can generate a text or html report as well as being run interactively. 
>> But you may get more mileage from using a debian HVM guest. I guess it may 
>> also be useful is we can get a scan of available MSRs and content before and 
>> after migrate too.
>>
>>>> Another datapoint... does the shutdown fail if you insert a delay of a 
>>>> couple
>>> of minutes between the migrate and the shutdown?
>>>
>>> Sometimes, after a variable number of calls to xl shutdown ... the
>>> guest usually ends up shutting down.
>>>
>>
>> Hmm. I wonder whether the guest is actually healthy after the migrate. One 
>> could imagine a situation where the storage device model (IDE in our case I 
>> guess) gets stuck in some way but recovers after a timeout in the guest 
>> storage stack. Thus, if you happen to try shut down while it is still stuck 
>> Windows starts trying to shut down but can't. Try after the timeout though 
>> and it can.
>> In 

Re: [Xen-devel] Commit moratorium to staging

2017-11-03 Thread Roger Pau Monné
On Thu, Nov 02, 2017 at 09:55:11AM +, Paul Durrant wrote:
> > -Original Message-
> > From: Roger Pau Monne
> > Sent: 02 November 2017 09:42
> > To: Paul Durrant <paul.durr...@citrix.com>
> > Cc: Ian Jackson <ian.jack...@citrix.com>; Lars Kurth
> > <lars.ku...@citrix.com>; Wei Liu <wei.l...@citrix.com>; Julien Grall
> > <julien.gr...@linaro.org>; committ...@xenproject.org; xen-devel  > de...@lists.xenproject.org>
> > Subject: Re: [Xen-devel] Commit moratorium to staging
> > 
> > On Thu, Nov 02, 2017 at 09:20:10AM +, Paul Durrant wrote:
> > > > -Original Message-
> > > > From: Roger Pau Monne
> > > > Sent: 02 November 2017 09:15
> > > > To: Roger Pau Monne <roger@citrix.com>
> > > > Cc: Ian Jackson <ian.jack...@citrix.com>; Lars Kurth
> > > > <lars.ku...@citrix.com>; Wei Liu <wei.l...@citrix.com>; Julien Grall
> > > > <julien.gr...@linaro.org>; Paul Durrant <paul.durr...@citrix.com>;
> > > > committ...@xenproject.org; xen-devel  > de...@lists.xenproject.org>
> > > > Subject: Re: [Xen-devel] Commit moratorium to staging
> > > >
> > > > On Wed, Nov 01, 2017 at 04:17:10PM +, Roger Pau Monné wrote:
> > > > > On Wed, Nov 01, 2017 at 02:07:48PM +, Ian Jackson wrote:
> > > > > > * Affected hosts differ from unaffected hosts according to cpuid.
> > > > > >   Roger has repro'd the bug on an unaffected host by masking out
> > > > > >   certain cpuid bits.  There are 6 implicated bits and he is working
> > > > > >   to narrow that down.
> > > > >
> > > > > I'm currently trying to narrow this down and make sure the above is
> > > > > accurate.
> > > >
> > > > So I was wrong with this, I guess I've run the tests on the wrong
> > > > host. Even when masking the different cpuid bits in the guest the
> > > > tests still succeeds.
> > > >
> > > > AFAICT the test fail or succeed reliably depending on the host
> > > > hardware. I don't really have many ideas about what to do next, but I
> > > > think it would be useful to create a manual osstest flight that runs
> > > > the win16 job in all the different hosts in the colo. I would also
> > > > capture the normal information that Xen collects after each test (xl
> > > > info, /proc/cpuid, serial logs...).
> > > >
> > > > Is there anything else not captured by ts-logs-capture that would be
> > > > interesting in order to help debug the issue?
> > >
> > > Does the shutdown reliably complete prior to migrate and then only fail
> > intermittently after a localhost migrate?
> > 
> > AFAICT yes, but it can also be added to the test in order to be sure.
> > 
> > > It might be useful to know what cpuid info is seen by the guest before and
> > after migrate.
> > 
> > Is there anyway to get that from windows in an automatic way? If not I
> > could test that with a Debian guest. In fact it might even be a good
> > thing for Linux based guest to be added to the regular migration tests
> > in order to make sure cpuid bits don't change across migrations.
> > 
> 
> I found this for windows:
> 
> https://www.cpuid.com/downloads/cpu-z/cpu-z_1.81-en.exe
> 
> It can generate a text or html report as well as being run interactively. But 
> you may get more mileage from using a debian HVM guest. I guess it may also 
> be useful is we can get a scan of available MSRs and content before and after 
> migrate too.
> 
> > > Another datapoint... does the shutdown fail if you insert a delay of a 
> > > couple
> > of minutes between the migrate and the shutdown?
> > 
> > Sometimes, after a variable number of calls to xl shutdown ... the
> > guest usually ends up shutting down.
> > 
> 
> Hmm. I wonder whether the guest is actually healthy after the migrate. One 
> could imagine a situation where the storage device model (IDE in our case I 
> guess) gets stuck in some way but recovers after a timeout in the guest 
> storage stack. Thus, if you happen to try shut down while it is still stuck 
> Windows starts trying to shut down but can't. Try after the timeout though 
> and it can.
> In the past we did make attempts to support Windows without PV drivers in 
> XenServer but xenrt would never reliably pass VM lifecycle tests using 
> emulated devices. That was with qemu trad,

Re: [Xen-devel] Commit moratorium to staging [and 1 more messages]

2017-11-02 Thread Julien Grall

Hi Ian,

On 02/11/17 13:27, Ian Jackson wrote:

Julien Grall writes ("Re: Commit moratorium to staging"):

Thank you for the explanation. I agree with the force push to unblock
master (and other tree I mentioned).


I will force push all the affected trees, but in a reactive way
because I base each force push on a test report - so it won't be right
away for all of them.

osstest service owner writes ("[xen-unstable test] 115471: regressions - FAIL"):

flight 115471 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/115471/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
  test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stopfail REGR. vs. 114644
  test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop   fail REGR. vs. 114644
  test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop   fail REGR. vs. 114644


The above are justifiable as discussed, leaving no blockers.


version targeted for testing:
  xen  bb2c1a1cc98a22e2d4c14b18421aa7be6c2adf0d


So I have forced pushed that.


Thank you! With that, the tree is re-opened. I will go through my 
backlog of Xen 4.10 and have a look whether they are suitable.


Cheers,


--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging [and 1 more messages]

2017-11-02 Thread Ian Jackson
Julien Grall writes ("Re: Commit moratorium to staging"):
> Thank you for the explanation. I agree with the force push to unblock 
> master (and other tree I mentioned).

I will force push all the affected trees, but in a reactive way
because I base each force push on a test report - so it won't be right
away for all of them.

osstest service owner writes ("[xen-unstable test] 115471: regressions - FAIL"):
> flight 115471 xen-unstable real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/115471/
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  test-amd64-i386-xl-qemuu-ws16-amd64 17 guest-stopfail REGR. vs. 
> 114644
>  test-amd64-amd64-xl-qemuu-ws16-amd64 17 guest-stop   fail REGR. vs. 
> 114644
>  test-amd64-amd64-xl-qemut-ws16-amd64 17 guest-stop   fail REGR. vs. 
> 114644

The above are justifiable as discussed, leaving no blockers.

> version targeted for testing:
>  xen  bb2c1a1cc98a22e2d4c14b18421aa7be6c2adf0d

So I have forced pushed that.

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-02 Thread Ian Jackson
Roger Pau Monné writes ("Re: [Xen-devel] Commit moratorium to staging"):
> Is there anyway to get that from windows in an automatic way? If not I
> could test that with a Debian guest. In fact it might even be a good
> thing for Linux based guest to be added to the regular migration tests
> in order to make sure cpuid bits don't change across migrations.

We do migrations of all the guests in osstest (apart from in ARM,
where the guests don't support it, and some special cases like
rumpkernel and xtf domains).

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-02 Thread George Dunlap
On 11/01/2017 02:07 PM, Ian Jackson wrote:
> So, investigations (mostly by Roger, and also a bit of archaeology in
> the osstest db by me) have determined:
> 
> * This bug is 100% reproducible on affected hosts.  The repro is
>   to boot the Windows guest, save/restore it, then migrate it,
>   then shut down.  (This is from an IRL conversation with Roger and
>   may not be 100% accurate.  Roger, please correct me.)

I presume when you say 'migrate' you mean localhost migration?

Are the results different if you:
- only save/restore *or* migrate it?
- save/restore twice or migrate twice, rather than save/restore + migrate?

Going through the save/restore path suggests that there's something
about the domain that's being set up one way on initial creation than on
restoring/receiving from a migration: i.e., something not being saved
and restored properly.

An alternate explanation would be a 'hitch' somewhere in the 're-attach'
driver code.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-02 Thread Paul Durrant
> -Original Message-
> From: Roger Pau Monne
> Sent: 02 November 2017 09:42
> To: Paul Durrant <paul.durr...@citrix.com>
> Cc: Ian Jackson <ian.jack...@citrix.com>; Lars Kurth
> <lars.ku...@citrix.com>; Wei Liu <wei.l...@citrix.com>; Julien Grall
> <julien.gr...@linaro.org>; committ...@xenproject.org; xen-devel  de...@lists.xenproject.org>
> Subject: Re: [Xen-devel] Commit moratorium to staging
> 
> On Thu, Nov 02, 2017 at 09:20:10AM +, Paul Durrant wrote:
> > > -Original Message-
> > > From: Roger Pau Monne
> > > Sent: 02 November 2017 09:15
> > > To: Roger Pau Monne <roger@citrix.com>
> > > Cc: Ian Jackson <ian.jack...@citrix.com>; Lars Kurth
> > > <lars.ku...@citrix.com>; Wei Liu <wei.l...@citrix.com>; Julien Grall
> > > <julien.gr...@linaro.org>; Paul Durrant <paul.durr...@citrix.com>;
> > > committ...@xenproject.org; xen-devel  de...@lists.xenproject.org>
> > > Subject: Re: [Xen-devel] Commit moratorium to staging
> > >
> > > On Wed, Nov 01, 2017 at 04:17:10PM +, Roger Pau Monné wrote:
> > > > On Wed, Nov 01, 2017 at 02:07:48PM +, Ian Jackson wrote:
> > > > > * Affected hosts differ from unaffected hosts according to cpuid.
> > > > >   Roger has repro'd the bug on an unaffected host by masking out
> > > > >   certain cpuid bits.  There are 6 implicated bits and he is working
> > > > >   to narrow that down.
> > > >
> > > > I'm currently trying to narrow this down and make sure the above is
> > > > accurate.
> > >
> > > So I was wrong with this, I guess I've run the tests on the wrong
> > > host. Even when masking the different cpuid bits in the guest the
> > > tests still succeeds.
> > >
> > > AFAICT the test fail or succeed reliably depending on the host
> > > hardware. I don't really have many ideas about what to do next, but I
> > > think it would be useful to create a manual osstest flight that runs
> > > the win16 job in all the different hosts in the colo. I would also
> > > capture the normal information that Xen collects after each test (xl
> > > info, /proc/cpuid, serial logs...).
> > >
> > > Is there anything else not captured by ts-logs-capture that would be
> > > interesting in order to help debug the issue?
> >
> > Does the shutdown reliably complete prior to migrate and then only fail
> intermittently after a localhost migrate?
> 
> AFAICT yes, but it can also be added to the test in order to be sure.
> 
> > It might be useful to know what cpuid info is seen by the guest before and
> after migrate.
> 
> Is there anyway to get that from windows in an automatic way? If not I
> could test that with a Debian guest. In fact it might even be a good
> thing for Linux based guest to be added to the regular migration tests
> in order to make sure cpuid bits don't change across migrations.
> 

I found this for windows:

https://www.cpuid.com/downloads/cpu-z/cpu-z_1.81-en.exe

It can generate a text or html report as well as being run interactively. But 
you may get more mileage from using a debian HVM guest. I guess it may also be 
useful is we can get a scan of available MSRs and content before and after 
migrate too.

> > Another datapoint... does the shutdown fail if you insert a delay of a 
> > couple
> of minutes between the migrate and the shutdown?
> 
> Sometimes, after a variable number of calls to xl shutdown ... the
> guest usually ends up shutting down.
> 

Hmm. I wonder whether the guest is actually healthy after the migrate. One 
could imagine a situation where the storage device model (IDE in our case I 
guess) gets stuck in some way but recovers after a timeout in the guest storage 
stack. Thus, if you happen to try shut down while it is still stuck Windows 
starts trying to shut down but can't. Try after the timeout though and it can.
In the past we did make attempts to support Windows without PV drivers in 
XenServer but xenrt would never reliably pass VM lifecycle tests using emulated 
devices. That was with qemu trad, but I wonder whether upstream qemu is 
actually any better particularly if using older device models such as IDE and 
RTL8139 (which are probably largely unmodified from trad).

  Paul

> Roger.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-02 Thread Roger Pau Monné
On Thu, Nov 02, 2017 at 09:20:10AM +, Paul Durrant wrote:
> > -Original Message-
> > From: Roger Pau Monne
> > Sent: 02 November 2017 09:15
> > To: Roger Pau Monne <roger@citrix.com>
> > Cc: Ian Jackson <ian.jack...@citrix.com>; Lars Kurth
> > <lars.ku...@citrix.com>; Wei Liu <wei.l...@citrix.com>; Julien Grall
> > <julien.gr...@linaro.org>; Paul Durrant <paul.durr...@citrix.com>;
> > committ...@xenproject.org; xen-devel <xen-de...@lists.xenproject.org>
> > Subject: Re: [Xen-devel] Commit moratorium to staging
> > 
> > On Wed, Nov 01, 2017 at 04:17:10PM +, Roger Pau Monné wrote:
> > > On Wed, Nov 01, 2017 at 02:07:48PM +, Ian Jackson wrote:
> > > > * Affected hosts differ from unaffected hosts according to cpuid.
> > > >   Roger has repro'd the bug on an unaffected host by masking out
> > > >   certain cpuid bits.  There are 6 implicated bits and he is working
> > > >   to narrow that down.
> > >
> > > I'm currently trying to narrow this down and make sure the above is
> > > accurate.
> > 
> > So I was wrong with this, I guess I've run the tests on the wrong
> > host. Even when masking the different cpuid bits in the guest the
> > tests still succeeds.
> > 
> > AFAICT the test fail or succeed reliably depending on the host
> > hardware. I don't really have many ideas about what to do next, but I
> > think it would be useful to create a manual osstest flight that runs
> > the win16 job in all the different hosts in the colo. I would also
> > capture the normal information that Xen collects after each test (xl
> > info, /proc/cpuid, serial logs...).
> > 
> > Is there anything else not captured by ts-logs-capture that would be
> > interesting in order to help debug the issue?
> 
> Does the shutdown reliably complete prior to migrate and then only fail 
> intermittently after a localhost migrate?

AFAICT yes, but it can also be added to the test in order to be sure.

> It might be useful to know what cpuid info is seen by the guest before and 
> after migrate.

Is there anyway to get that from windows in an automatic way? If not I
could test that with a Debian guest. In fact it might even be a good
thing for Linux based guest to be added to the regular migration tests
in order to make sure cpuid bits don't change across migrations.

> Another datapoint... does the shutdown fail if you insert a delay of a couple 
> of minutes between the migrate and the shutdown?

Sometimes, after a variable number of calls to xl shutdown ... the
guest usually ends up shutting down.

Roger.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-02 Thread Paul Durrant
> -Original Message-
> From: Roger Pau Monne
> Sent: 02 November 2017 09:15
> To: Roger Pau Monne <roger@citrix.com>
> Cc: Ian Jackson <ian.jack...@citrix.com>; Lars Kurth
> <lars.ku...@citrix.com>; Wei Liu <wei.l...@citrix.com>; Julien Grall
> <julien.gr...@linaro.org>; Paul Durrant <paul.durr...@citrix.com>;
> committ...@xenproject.org; xen-devel <xen-de...@lists.xenproject.org>
> Subject: Re: [Xen-devel] Commit moratorium to staging
> 
> On Wed, Nov 01, 2017 at 04:17:10PM +, Roger Pau Monné wrote:
> > On Wed, Nov 01, 2017 at 02:07:48PM +, Ian Jackson wrote:
> > > * Affected hosts differ from unaffected hosts according to cpuid.
> > >   Roger has repro'd the bug on an unaffected host by masking out
> > >   certain cpuid bits.  There are 6 implicated bits and he is working
> > >   to narrow that down.
> >
> > I'm currently trying to narrow this down and make sure the above is
> > accurate.
> 
> So I was wrong with this, I guess I've run the tests on the wrong
> host. Even when masking the different cpuid bits in the guest the
> tests still succeeds.
> 
> AFAICT the test fail or succeed reliably depending on the host
> hardware. I don't really have many ideas about what to do next, but I
> think it would be useful to create a manual osstest flight that runs
> the win16 job in all the different hosts in the colo. I would also
> capture the normal information that Xen collects after each test (xl
> info, /proc/cpuid, serial logs...).
> 
> Is there anything else not captured by ts-logs-capture that would be
> interesting in order to help debug the issue?

Does the shutdown reliably complete prior to migrate and then only fail 
intermittently after a localhost migrate? It might be useful to know what cpuid 
info is seen by the guest before and after migrate. Another datapoint... does 
the shutdown fail if you insert a delay of a couple of minutes between the 
migrate and the shutdown?

  Paul

> 
> Regards, Roger.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-02 Thread Roger Pau Monné
On Wed, Nov 01, 2017 at 04:17:10PM +, Roger Pau Monné wrote:
> On Wed, Nov 01, 2017 at 02:07:48PM +, Ian Jackson wrote:
> > * Affected hosts differ from unaffected hosts according to cpuid.
> >   Roger has repro'd the bug on an unaffected host by masking out
> >   certain cpuid bits.  There are 6 implicated bits and he is working
> >   to narrow that down.
> 
> I'm currently trying to narrow this down and make sure the above is
> accurate.

So I was wrong with this, I guess I've run the tests on the wrong
host. Even when masking the different cpuid bits in the guest the
tests still succeeds.

AFAICT the test fail or succeed reliably depending on the host
hardware. I don't really have many ideas about what to do next, but I
think it would be useful to create a manual osstest flight that runs
the win16 job in all the different hosts in the colo. I would also
capture the normal information that Xen collects after each test (xl
info, /proc/cpuid, serial logs...).

Is there anything else not captured by ts-logs-capture that would be
interesting in order to help debug the issue?

Regards, Roger.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-01 Thread Julien Grall

Hi Ian,

On 11/01/2017 04:54 PM, Ian Jackson wrote:

Julien Grall writes ("Re: Commit moratorium to staging"):

Hi Ian,

Thank you for the detailed e-mail.

On 11/01/2017 02:07 PM, Ian Jackson wrote:

Furthermore, the test is not intermittent, so a force push will be
effective in the following sense: we would only get a "spurious" pass,
resulting in the relevant osstest branch becoming stuck again, if a
future test was unlucky and got an unaffected host.  That will happen
infrequently enough.

...

I am not entirely sure to understand this paragraph. Are you saying that
osstest will not get stuck if we get a "spurious" pass on some hardware
in the future? Or will we need another force push?


osstest *would* get stuck *if* we got such a spurious push.  However,
because osstest likes to retest failing tests on the same host as they
failed on previously, such spurious passes are fairly unlikely.

I say "likes to".  The allocation system uses a set of heuristics to
calculate a score for each possible host.  The score takes into
account both when the host will be available to this job, and
information like "did the most recent run of this test, on this host,
pass or fail".  So I can't make guarantees but the amount of manual
work to force push stuck branches will be tolerable.


Thank you for the explanation. I agree with the force push to unblock 
master (and other tree I mentioned).


However, it would still be nice to find the root causes of this bug and 
fix it.


Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-01 Thread Ian Jackson
Julien Grall writes ("Re: Commit moratorium to staging"):
> Hi Ian,
> 
> Thank you for the detailed e-mail.
> 
> On 11/01/2017 02:07 PM, Ian Jackson wrote:
> > Furthermore, the test is not intermittent, so a force push will be
> > effective in the following sense: we would only get a "spurious" pass,
> > resulting in the relevant osstest branch becoming stuck again, if a
> > future test was unlucky and got an unaffected host.  That will happen
> > infrequently enough.
...
> I am not entirely sure to understand this paragraph. Are you saying that 
> osstest will not get stuck if we get a "spurious" pass on some hardware
> in the future? Or will we need another force push?

osstest *would* get stuck *if* we got such a spurious push.  However,
because osstest likes to retest failing tests on the same host as they
failed on previously, such spurious passes are fairly unlikely.

I say "likes to".  The allocation system uses a set of heuristics to
calculate a score for each possible host.  The score takes into
account both when the host will be available to this job, and
information like "did the most recent run of this test, on this host,
pass or fail".  So I can't make guarantees but the amount of manual
work to force push stuck branches will be tolerable.

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-01 Thread Roger Pau Monné
On Wed, Nov 01, 2017 at 02:07:48PM +, Ian Jackson wrote:
> So, investigations (mostly by Roger, and also a bit of archaeology in
> the osstest db by me) have determined:
> 
> * This bug is 100% reproducible on affected hosts.  The repro is
>   to boot the Windows guest, save/restore it, then migrate it,
>   then shut down.  (This is from an IRL conversation with Roger and
>   may not be 100% accurate.  Roger, please correct me.)

Yes, that's correct AFAICT. The affected hosts works fine if windows
is booted and then shut down (without save/restore or migrations
involved).

> * Affected hosts differ from unaffected hosts according to cpuid.
>   Roger has repro'd the bug on an unaffected host by masking out
>   certain cpuid bits.  There are 6 implicated bits and he is working
>   to narrow that down.

I'm currently trying to narrow this down and make sure the above is
accurate.

> * It seems likely that this is therefore a real bug.  Maybe in Xen and
>   perhaps indeed one that should indeed be a release blocker.
> 
> * But this is not a regresson between master and staging.  It affects
>   many osstest branches apparently equally.
> 
> * This test is, effectively, new: before the osstest change
>   "HostDiskRoot: bump to 20G", these jobs would always fail earlier
>   and the affected step would not be run.
> 
> * The passes we got on various osstest branches before were just
>   because those branches hadn't tested on an affected host yet.  As
>   branches test different hosts, they will stick on affected hosts.
> 
> ISTM that this situation would therefore justify a force push.  We
> have established that this bug is very unlikely to be anything to do
> with the commits currently blocked by the failing pushes.

I agree, this is a bug that's always been present (at least in the
tested branches). It's triggered now because the windows tests
have made further progress.

> Furthermore, the test is not intermittent, so a force push will be
> effective in the following sense: we would only get a "spurious" pass,
> resulting in the relevant osstest branch becoming stuck again, if a
> future test was unlucky and got an unaffected host.  That will happen
> infrequently enough.
> 
> So unless anyone objects (and for xen.git#master, with Julien's
> permission), I intend to force push all affected osstest branches when
> the test report shows the only blockage is ws16 and/or win10 tests
> failing the "guest-stop" step.
> 
> Opinions ?

I agree that a force push is justified. This is bug going to be quite
annoying if osstest decides to tests on non-affected hosts, because
then we will get sporadic success flights.

Thanks, Roger.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-01 Thread Julien Grall

Hi Ian,

Thank you for the detailed e-mail.

On 11/01/2017 02:07 PM, Ian Jackson wrote:

So, investigations (mostly by Roger, and also a bit of archaeology in
the osstest db by me) have determined:

* This bug is 100% reproducible on affected hosts.  The repro is
   to boot the Windows guest, save/restore it, then migrate it,
   then shut down.  (This is from an IRL conversation with Roger and
   may not be 100% accurate.  Roger, please correct me.)

* Affected hosts differ from unaffected hosts according to cpuid.
   Roger has repro'd the bug on an unaffected host by masking out
   certain cpuid bits.  There are 6 implicated bits and he is working
   to narrow that down.

* It seems likely that this is therefore a real bug.  Maybe in Xen and
   perhaps indeed one that should indeed be a release blocker.

* But this is not a regresson between master and staging.  It affects
   many osstest branches apparently equally.

* This test is, effectively, new: before the osstest change
   "HostDiskRoot: bump to 20G", these jobs would always fail earlier
   and the affected step would not be run.

* The passes we got on various osstest branches before were just
   because those branches hadn't tested on an affected host yet.  As
   branches test different hosts, they will stick on affected hosts.

ISTM that this situation would therefore justify a force push.  We
have established that this bug is very unlikely to be anything to do
with the commits currently blocked by the failing pushes.

Furthermore, the test is not intermittent, so a force push will be
effective in the following sense: we would only get a "spurious" pass,
resulting in the relevant osstest branch becoming stuck again, if a
future test was unlucky and got an unaffected host.  That will happen
infrequently enough.
I am not entirely sure to understand this paragraph. Are you saying that 
osstest will not get stuck if we get a "spurious" pass on some hardware

in the future? Or will we need another force push?



So unless anyone objects (and for xen.git#master, with Julien's
permission), I intend to force push all affected osstest branches when
the test report shows the only blockage is ws16 and/or win10 tests
failing the "guest-stop" step.


This is not only blocking xen.git#master but also blocking other trees:
- linux-linus
- linux-4.9

Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-01 Thread Ian Jackson
So, investigations (mostly by Roger, and also a bit of archaeology in
the osstest db by me) have determined:

* This bug is 100% reproducible on affected hosts.  The repro is
  to boot the Windows guest, save/restore it, then migrate it,
  then shut down.  (This is from an IRL conversation with Roger and
  may not be 100% accurate.  Roger, please correct me.)

* Affected hosts differ from unaffected hosts according to cpuid.
  Roger has repro'd the bug on an unaffected host by masking out
  certain cpuid bits.  There are 6 implicated bits and he is working
  to narrow that down.

* It seems likely that this is therefore a real bug.  Maybe in Xen and
  perhaps indeed one that should indeed be a release blocker.

* But this is not a regresson between master and staging.  It affects
  many osstest branches apparently equally.

* This test is, effectively, new: before the osstest change
  "HostDiskRoot: bump to 20G", these jobs would always fail earlier
  and the affected step would not be run.

* The passes we got on various osstest branches before were just
  because those branches hadn't tested on an affected host yet.  As
  branches test different hosts, they will stick on affected hosts.

ISTM that this situation would therefore justify a force push.  We
have established that this bug is very unlikely to be anything to do
with the commits currently blocked by the failing pushes.

Furthermore, the test is not intermittent, so a force push will be
effective in the following sense: we would only get a "spurious" pass,
resulting in the relevant osstest branch becoming stuck again, if a
future test was unlucky and got an unaffected host.  That will happen
infrequently enough.

So unless anyone objects (and for xen.git#master, with Julien's
permission), I intend to force push all affected osstest branches when
the test report shows the only blockage is ws16 and/or win10 tests
failing the "guest-stop" step.

Opinions ?

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-01 Thread Paul Durrant
> -Original Message-
> From: Wei Liu [mailto:wei.l...@citrix.com]
> Sent: 01 November 2017 10:48
> To: Roger Pau Monne 
> Cc: Julien Grall ; committ...@xenproject.org; xen-
> devel ; Lars Kurth ;
> Paul Durrant ; Wei Liu 
> Subject: Re: Commit moratorium to staging
> 
> On Tue, Oct 31, 2017 at 04:52:37PM +, Roger Pau Monné wrote:
> >
> > I have to admit I have no idea why Windows clears the STS power bit
> > and then completely ignores it on certain occasions.
> >
> > I'm also afraid I have no idea how to debug Windows in order to know
> > why this event is acknowledged but ignored.
> >
> > I've also tried to reproduce the same with a Debian guest, by doing
> > the same amount of save/restores and migrations, and finally issuing a
> > xl trigger  power, but Debian has always worked fine and
> > shut down.
> >
> > Any comments are welcome.
> 
> After googling around, some articles suggest Windows can ignore ACPI
> events under certain circumstances. Is it worth checking in the Windows
> event log to see if an event is received but ignored for reason X?

Dumping the event logs would definitely be a useful thing to do.

> 
> For Windows Server 2012:
> https://serverfault.com/questions/534042/windows-2012-how-to-make-
> power-button-work-in-every-cases
> 
> Can't find anything for Windows Server 2016.

No, I couldn't either. I did find 
https://ethertubes.com/unattended-acpi-shutdown-of-windows-server/ too which 
seems to have some potentially useful suggestions.

  Paul

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-11-01 Thread Wei Liu
On Tue, Oct 31, 2017 at 04:52:37PM +, Roger Pau Monné wrote:
> 
> I have to admit I have no idea why Windows clears the STS power bit
> and then completely ignores it on certain occasions.
> 
> I'm also afraid I have no idea how to debug Windows in order to know
> why this event is acknowledged but ignored.
> 
> I've also tried to reproduce the same with a Debian guest, by doing
> the same amount of save/restores and migrations, and finally issuing a
> xl trigger  power, but Debian has always worked fine and
> shut down.
> 
> Any comments are welcome.

After googling around, some articles suggest Windows can ignore ACPI
events under certain circumstances. Is it worth checking in the Windows
event log to see if an event is received but ignored for reason X?

For Windows Server 2012:
https://serverfault.com/questions/534042/windows-2012-how-to-make-power-button-work-in-every-cases

Can't find anything for Windows Server 2016.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-10-31 Thread Roger Pau Monné
On Tue, Oct 31, 2017 at 10:49:35AM +, Julien Grall wrote:
> Hi all,
> 
> Master lags 15 days behind staging due to tests failing reliably on some of
> the hardware in osstest (see [1]).
> 
> At the moment a force push is not feasible because the same tests passes on
> different hardware (see [2]).

I've been looking into this, and I'm afraid I don't yet have a cause
for those issues. I'm going to post what I've found so far, maybe
someone is able to spot something I'm missing.

Since I assumed this was somehow related to the ACPI PM1A_STS/EN
blocks (which is how the power button even gets notified to the OS),
I've added the following instrumentation to the pmtimer.c code:

diff --git a/xen/arch/x86/hvm/pmtimer.c b/xen/arch/x86/hvm/pmtimer.c
index 435647ff1e..051fc46df8 100644
--- a/xen/arch/x86/hvm/pmtimer.c
+++ b/xen/arch/x86/hvm/pmtimer.c
@@ -61,9 +61,15 @@ static void pmt_update_sci(PMTState *s)
 ASSERT(spin_is_locked(>lock));
 
 if ( acpi->pm1a_en & acpi->pm1a_sts & SCI_MASK )
+{
+printk("asserting SCI IRQ\n");
 hvm_isa_irq_assert(s->vcpu->domain, SCI_IRQ, NULL);
+}
 else
+{
+printk("de-asserting SCI IRQ\n");
 hvm_isa_irq_deassert(s->vcpu->domain, SCI_IRQ);
+}
 }
 
 void hvm_acpi_power_button(struct domain *d)
@@ -73,6 +79,7 @@ void hvm_acpi_power_button(struct domain *d)
 if ( !has_vpm(d) )
 return;
 
+printk("hvm_acpi_power_button for d%d\n", d->domain_id);
 spin_lock(>lock);
 d->arch.hvm_domain.acpi.pm1a_sts |= PWRBTN_STS;
 pmt_update_sci(s);
@@ -86,6 +93,7 @@ void hvm_acpi_sleep_button(struct domain *d)
 if ( !has_vpm(d) )
 return;
 
+printk("hvm_acpi_sleep_button for d%d\n", d->domain_id);
 spin_lock(>lock);
 d->arch.hvm_domain.acpi.pm1a_sts |= PWRBTN_STS;
 pmt_update_sci(s);
@@ -170,6 +178,7 @@ static int handle_evt_io(
 
 if ( dir == IOREQ_WRITE )
 {
+printk("write PM1a addr: %#x val: %#x\n", addr, *val);
 /* Handle this I/O one byte at a time */
 for ( i = bytes, data = *val;
   i > 0;
@@ -197,6 +206,8 @@ static int handle_evt_io(
  bytes, *val, port);
 }
 }
+printk("result pm1a_sts: %#x pm1a_en: %#x\n",
+  acpi->pm1a_sts, acpi->pm1a_en);
 /* Fix up the SCI state to match the new register state */
 pmt_update_sci(s);
 }

I've then rerun the failing test, and this is what I got in the
failure case (ie: windows ignoring the power event):

(XEN) hvm_acpi_power_button for d14
(XEN) asserting SCI IRQ
(XEN) write PM1a addr: 0 val: 0x1
(XEN) result pm1a_sts: 0x100 pm1a_en: 0x320
(XEN) asserting SCI IRQ
(XEN) write PM1a addr: 0 val: 0x100
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ
(XEN) write PM1a addr: 0x2 val: 0x320
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ

Strangely enough, the second time I've tried the same command (xl
shutdown -wF ...) on the same guest, it succeed and windows shut down
without issues, this is the log in that case:

(XEN) hvm_acpi_power_button for d14
(XEN) asserting SCI IRQ
(XEN) write PM1a addr: 0 val: 0x1
(XEN) result pm1a_sts: 0x100 pm1a_en: 0x320
(XEN) asserting SCI IRQ
(XEN) write PM1a addr: 0 val: 0x100
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ
(XEN) write PM1a addr: 0x2 val: 0x320
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ
(XEN) write PM1a addr: 0x2 val: 0x320
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ
(XEN) write PM1a addr: 0 val: 0
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ
(XEN) write PM1a addr: 0 val: 0x8000
(XEN) result pm1a_sts: 0 pm1a_en: 0x320
(XEN) de-asserting SCI IRQ

I have to admit I have no idea why Windows clears the STS power bit
and then completely ignores it on certain occasions.

I'm also afraid I have no idea how to debug Windows in order to know
why this event is acknowledged but ignored.

I've also tried to reproduce the same with a Debian guest, by doing
the same amount of save/restores and migrations, and finally issuing a
xl trigger  power, but Debian has always worked fine and
shut down.

Any comments are welcome.

Roger.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] Commit moratorium to staging

2017-10-31 Thread Julien Grall

Hi all,

Master lags 15 days behind staging due to tests failing reliably on some 
of the hardware in osstest (see [1]).


At the moment a force push is not feasible because the same tests passes 
on different hardware (see [2]).


Please avoid committing any more patches unless it is fixing a test 
failure in osstest.


Tree will be re-opened once we get a push.

Cheers,

[1] 
https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg03351.html
[2] 
https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg02932.html


--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Commit moratorium to staging

2017-05-16 Thread Ian Jackson
Julien Grall writes ("Commit moratorium to staging"):
> It looks like osstest is a bit behind because of ARM64 boxes (they are 
> fully loaded) and XP testing (they now have been removed see [1]).
> 
> I'd like to cut the next rc when staging == master, so please stop 
> committing today.

I force pushed xen#master earlier and there is no longer any need for
this moratorium.

Of course any commits to staging still need RM approval from Julien.

Thanks,
Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] Commit moratorium to staging

2017-05-15 Thread Julien Grall

Committers,

It looks like osstest is a bit behind because of ARM64 boxes (they are 
fully loaded) and XP testing (they now have been removed see [1]).


I'd like to cut the next rc when staging == master, so please stop 
committing today.


Ian forced pushed osstest today, so hopefully we can get a push tomorrow.

Cheers,

[1] 
https://lists.xenproject.org/archives/html/xen-devel/2017-05/msg00425.html


--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] Commit moratorium to staging-4.6

2015-10-02 Thread Wei Liu
Committers,

All patches that I'm aware of that need to be in 4.6 have been
committed.  Please stop committing to staging-4.6 today.

Next week we will start making release when OSSTest gets a push.

Wei.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] Commit moratorium to staging-4.6 lifted

2015-09-28 Thread Wei Liu
Committers,

RC4 is tagged. You can now commit rest of your 4.6 queue to staging-4.6.

Note that we expect to release in about two weeks (Oct 12). Preferably
all patches should be applied within this week so that we can sort out
any problem within next week.

Wei.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel