Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-24 Thread Simon Gaiser
Jan Beulich:
 On 23.05.18 at 00:21,  wrote:
>> I have done some more testing in the meantime. The issue also affect
>> 4.10.1, but not 4.10.0. That's useful since it makes the bisect shorter.
>> A bisect identifies 8462c575d9 "x86/xpti: Hide almost all of .text and
>> all .data/.rodata/.bss mappings" as the commit which breaks suspend.
>>
>> 8462c575d9 is a squashed backport of:
>>
>>   422588e885 x86/xpti: Hide almost all of .text and all .data/.rodata/.bss 
>> mappings
>>   d1d6fc97d6 x86/xpti: really hide almost all of Xen image
>>   044fedfaa2 x86/traps: Put idt_table[] back into .bss
>>
>> And indeed, reverting those on staging fixes suspend. (This also matches
>> the behavior that xpti=off fixes suspend as George already reported
>> earlier today).
> 
> Okay, that was quite helpful - I think I see now where I screwed up (i.e.
> the issue is in the middle of the three commits). Could you confirm that a
> Xen booted with "nosmp" suspends and resumes fine?

Yes, with nosmp suspend works.



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-24 Thread Jan Beulich
>>> On 23.05.18 at 00:21,  wrote:
> I have done some more testing in the meantime. The issue also affect
> 4.10.1, but not 4.10.0. That's useful since it makes the bisect shorter.
> A bisect identifies 8462c575d9 "x86/xpti: Hide almost all of .text and
> all .data/.rodata/.bss mappings" as the commit which breaks suspend.
> 
> 8462c575d9 is a squashed backport of:
> 
>   422588e885 x86/xpti: Hide almost all of .text and all .data/.rodata/.bss 
> mappings
>   d1d6fc97d6 x86/xpti: really hide almost all of Xen image
>   044fedfaa2 x86/traps: Put idt_table[] back into .bss
> 
> And indeed, reverting those on staging fixes suspend. (This also matches
> the behavior that xpti=off fixes suspend as George already reported
> earlier today).

Okay, that was quite helpful - I think I see now where I screwed up (i.e.
the issue is in the middle of the three commits). Could you confirm that a
Xen booted with "nosmp" suspends and resumes fine?

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-22 Thread Simon Gaiser
George Dunlap:
> On Fri, May 18, 2018 at 5:19 PM, Marek Marczykowski
>  wrote:
>> On Fri, May 18, 2018 at 09:54:37AM -0600, Jan Beulich wrote:
>> On 18.05.18 at 17:33,  wrote:
 Yes, I'm happy to help with that. As I've said, the basic test is very
 simple (rtcwake command) and already very useful. The fact that it is(?)
 broken on staging doesn't make it easier,
>>>
>>> Details on the breakage would be appreciated (on a separate thread),
>>> unless you plan to address it yourself. I recall Simon(?) mentioning this as
>>> well, but also not providing sufficient data to consider looking into it
>>> (perhaps simply because it wasn't easy to obtain useful data, as
>>> frequently is the case with S3 resume). I think it would be nice if we could
>>> release 4.11 without a regression here.
>>
>> I only know that Simon have tested it and it fails. Cc'ing him.

I run into the same problem as George below (see [1] for the inital
report).

> Well I tried it with a post-RC 4.11 and got the below.  I haven't done
> any investigation.
> 
>  -George
> 
[...]
> (XEN) *** DOUBLE FAULT ***
> (XEN) [ Xen-4.11-rc  x86_64  debug=y   Not tainted ]
> (XEN) CPU:0
> (XEN) RIP:e008:[] handle_exception+0x9c/0xf7
> (XEN) RFLAGS: 00010006   CONTEXT: hypervisor
> (XEN) rax: c900422480b8   rbx:    rcx: 0005
> (XEN) rdx:    rsi:    rdi: 
> (XEN) rbp: 36ffbddb7f27   rsp: c90042248000   r8:  
> (XEN) r9:     r10:    r11: 
> (XEN) r12:    r13:    r14: c9004224
> (XEN) r15:    cr0: 8005003b   cr4: 26e0
> (XEN) cr3: 00018a10   cr2: c90042247ff8
> (XEN) fsb: 7f6242d95700   gsb: 88003dc0   gss: 
> (XEN) ds:    es:    fs:    gs:    ss: e010   cs: e008
> (XEN) Current stack base c90042248000 differs from expected 
> 8300dfa8
> (XEN) Valid stack range: c9004224e000-c9004225,
> sp=c90042248000, tss.rsp0=8300dfa87fa0
> (XEN) No stack overflow detected. Skipping stack trace.
> (XEN)
> (XEN) 
> (XEN) Panic on CPU 0:
> (XEN) DOUBLE FAULT -- system shutdown
> (XEN) 
> (XEN)
> (XEN) Reboot in five seconds...

I have done some more testing in the meantime. The issue also affect
4.10.1, but not 4.10.0. That's useful since it makes the bisect shorter.
A bisect identifies 8462c575d9 "x86/xpti: Hide almost all of .text and
all .data/.rodata/.bss mappings" as the commit which breaks suspend.

8462c575d9 is a squashed backport of:

  422588e885 x86/xpti: Hide almost all of .text and all .data/.rodata/.bss 
mappings
  d1d6fc97d6 x86/xpti: really hide almost all of Xen image
  044fedfaa2 x86/traps: Put idt_table[] back into .bss

And indeed, reverting those on staging fixes suspend. (This also matches
the behavior that xpti=off fixes suspend as George already reported
earlier today).

[1]: https://lists.xenproject.org/archives/html/xen-devel/2018-04/msg01137.html



signature.asc
Description: OpenPGP digital signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-22 Thread Dario Faggioli
On Mon, 2018-05-21 at 14:57 +0100, Ian Jackson wrote:
> > On Mon, 2018-05-21 at 12:04 +0100, George Dunlap wrote:
> > > What if we 1) have two versions of the test -- "Fake suspend" and
> > > "Real Suspend"; 2) only run "Real suspend" on hardware
> > > specifically
> > > marked as having a suspend that works reliably; 3) default all
> > > hardware to 'false' until we do some testing to find out how
> > > reliable
> > > it is?
> > > 
>
> OK, for starters, how about we add the fake suspend test to every
> flight.
> 
> Do we want or need to do that test with a guest running ?
> 
Doing it with a guest running would be more complete, I think.

I think the best would be to do both, i.e.:
- suspend without any guest
- (when resumed) start a guest
- suspend with a guest

Dario
-- 
<> (Raistlin Majere)
-
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

signature.asc
Description: This is a digitally signed message part
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-21 Thread George Dunlap
On Mon, May 21, 2018 at 5:28 PM, George Dunlap  wrote:
> On Mon, May 21, 2018 at 5:17 PM, Andrew Cooper
>  wrote:
>> On 21/05/18 16:48, George Dunlap wrote:
>>> On Fri, May 18, 2018 at 5:19 PM, Marek Marczykowski
>>>  wrote:
 On Fri, May 18, 2018 at 09:54:37AM -0600, Jan Beulich wrote:
 On 18.05.18 at 17:33,  wrote:
>> Yes, I'm happy to help with that. As I've said, the basic test is very
>> simple (rtcwake command) and already very useful. The fact that it is(?)
>> broken on staging doesn't make it easier,
> Details on the breakage would be appreciated (on a separate thread),
> unless you plan to address it yourself. I recall Simon(?) mentioning this 
> as
> well, but also not providing sufficient data to consider looking into it
> (perhaps simply because it wasn't easy to obtain useful data, as
> frequently is the case with S3 resume). I think it would be nice if we 
> could
> release 4.11 without a regression here.
 I only know that Simon have tested it and it fails. Cc'ing him.
>>> Well I tried it with a post-RC 4.11 and got the below.  I haven't done
>>> any investigation.
>>>
>>>  -George
>>>
>>> 
>>> (XEN) CPU3: Intel(R) Xeon(R) CPU   E5630  @ 2.53GHz stepping 02
>>> (XEN) *** DOUBLE FAULT ***
>>> (XEN) [ Xen-4.11-rc  x86_64  debug=y   Not tainted ]
>>> (XEN) CPU:0
>>> (XEN) RIP:e008:[] handle_exception+0x9c/0xf7
>>
>> Do you have xen-syms from this build?  That looks like its in the middle
>> of the Spectre alternative, but isn't the wrmsr instruction itself.
>
> Hmm, sorry, I've trashed it -- I was really trying to test my
> "acpi_sleep=s3_fake" test.
>
> I've never tried suspend on this particular box, so I'm not sure it
> works generally.  Let me get a reasonable baseline first.

OK, well suspend / resume works on this box in all the following configurations:

* 4.8.0 (real)
* 4.8.0 with s3_fake backported (fake)
* 4.8.3 (real)
* staging-4.8 with bti=false and xpti=false (real)

It fails in the following configuration:
* staging-4.8 with speculation mitigations at default.  (It is an
Intel box, so BTI and XPTI will both be on.)

I didn't get a stack trace unfortunately -- the box just stopped responding.

I'll do some more playing around on staging tomorrow.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-21 Thread George Dunlap
On Mon, May 21, 2018 at 5:17 PM, Andrew Cooper
 wrote:
> On 21/05/18 16:48, George Dunlap wrote:
>> On Fri, May 18, 2018 at 5:19 PM, Marek Marczykowski
>>  wrote:
>>> On Fri, May 18, 2018 at 09:54:37AM -0600, Jan Beulich wrote:
>>> On 18.05.18 at 17:33,  wrote:
> Yes, I'm happy to help with that. As I've said, the basic test is very
> simple (rtcwake command) and already very useful. The fact that it is(?)
> broken on staging doesn't make it easier,
 Details on the breakage would be appreciated (on a separate thread),
 unless you plan to address it yourself. I recall Simon(?) mentioning this 
 as
 well, but also not providing sufficient data to consider looking into it
 (perhaps simply because it wasn't easy to obtain useful data, as
 frequently is the case with S3 resume). I think it would be nice if we 
 could
 release 4.11 without a regression here.
>>> I only know that Simon have tested it and it fails. Cc'ing him.
>> Well I tried it with a post-RC 4.11 and got the below.  I haven't done
>> any investigation.
>>
>>  -George
>>
>> 
>> (XEN) CPU3: Intel(R) Xeon(R) CPU   E5630  @ 2.53GHz stepping 02
>> (XEN) *** DOUBLE FAULT ***
>> (XEN) [ Xen-4.11-rc  x86_64  debug=y   Not tainted ]
>> (XEN) CPU:0
>> (XEN) RIP:e008:[] handle_exception+0x9c/0xf7
>
> Do you have xen-syms from this build?  That looks like its in the middle
> of the Spectre alternative, but isn't the wrmsr instruction itself.

Hmm, sorry, I've trashed it -- I was really trying to test my
"acpi_sleep=s3_fake" test.

I've never tried suspend on this particular box, so I'm not sure it
works generally.  Let me get a reasonable baseline first.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-21 Thread Andrew Cooper
On 21/05/18 16:48, George Dunlap wrote:
> On Fri, May 18, 2018 at 5:19 PM, Marek Marczykowski
>  wrote:
>> On Fri, May 18, 2018 at 09:54:37AM -0600, Jan Beulich wrote:
>> On 18.05.18 at 17:33,  wrote:
 Yes, I'm happy to help with that. As I've said, the basic test is very
 simple (rtcwake command) and already very useful. The fact that it is(?)
 broken on staging doesn't make it easier,
>>> Details on the breakage would be appreciated (on a separate thread),
>>> unless you plan to address it yourself. I recall Simon(?) mentioning this as
>>> well, but also not providing sufficient data to consider looking into it
>>> (perhaps simply because it wasn't easy to obtain useful data, as
>>> frequently is the case with S3 resume). I think it would be nice if we could
>>> release 4.11 without a regression here.
>> I only know that Simon have tested it and it fails. Cc'ing him.
> Well I tried it with a post-RC 4.11 and got the below.  I haven't done
> any investigation.
>
>  -George
>
> 
> (XEN) CPU3: Intel(R) Xeon(R) CPU   E5630  @ 2.53GHz stepping 02
> (XEN) *** DOUBLE FAULT ***
> (XEN) [ Xen-4.11-rc  x86_64  debug=y   Not tainted ]
> (XEN) CPU:0
> (XEN) RIP:e008:[] handle_exception+0x9c/0xf7

Do you have xen-syms from this build?  That looks like its in the middle
of the Spectre alternative, but isn't the wrmsr instruction itself.

> (XEN) RFLAGS: 00010006   CONTEXT: hypervisor
> (XEN) rax: c900422480b8   rbx:    rcx: 0005
> (XEN) rdx:    rsi:    rdi: 
> (XEN) rbp: 36ffbddb7f27   rsp: c90042248000   r8:  
> (XEN) r9:     r10:    r11: 
> (XEN) r12:    r13:    r14: c9004224
> (XEN) r15:    cr0: 8005003b   cr4: 26e0
> (XEN) cr3: 00018a10   cr2: c90042247ff8
> (XEN) fsb: 7f6242d95700   gsb: 88003dc0   gss: 
> (XEN) ds:    es:    fs:    gs:    ss: e010   cs: e008
> (XEN) Current stack base c90042248000 differs from expected 
> 8300dfa8
> (XEN) Valid stack range: c9004224e000-c9004225,
> sp=c90042248000, tss.rsp0=8300dfa87fa0
> (XEN) No stack overflow detected. Skipping stack trace.

I really need to wire up the code dump, irrespective of this particular
issue.

~Andrew

> (XEN)
> (XEN) 
> (XEN) Panic on CPU 0:
> (XEN) DOUBLE FAULT -- system shutdown
> (XEN) 
> (XEN)
> (XEN) Reboot in five seconds...
>
> ___
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-21 Thread George Dunlap
On Fri, May 18, 2018 at 5:19 PM, Marek Marczykowski
 wrote:
> On Fri, May 18, 2018 at 09:54:37AM -0600, Jan Beulich wrote:
>> >>> On 18.05.18 at 17:33,  wrote:
>> > Yes, I'm happy to help with that. As I've said, the basic test is very
>> > simple (rtcwake command) and already very useful. The fact that it is(?)
>> > broken on staging doesn't make it easier,
>>
>> Details on the breakage would be appreciated (on a separate thread),
>> unless you plan to address it yourself. I recall Simon(?) mentioning this as
>> well, but also not providing sufficient data to consider looking into it
>> (perhaps simply because it wasn't easy to obtain useful data, as
>> frequently is the case with S3 resume). I think it would be nice if we could
>> release 4.11 without a regression here.
>
> I only know that Simon have tested it and it fails. Cc'ing him.

Well I tried it with a post-RC 4.11 and got the below.  I haven't done
any investigation.

 -George

(XEN) CPU0 CMCI LVT vector (0xf2) already installed
(XEN) CPU0: Thermal monitoring enabled (TM1)
(XEN) Finishing wakeup from ACPI S3 state.
(XEN) Preparing system for ACPI S3 state.
(XEN) Disabling non-boot CPUs ...
(XEN) Broke affinity for irq 16
(XEN) Broke affinity for irq 49
(XEN) CPU: Physical Processor ID: 0
(XEN) CPU: Processor Core ID: 0
(XEN) CPU: L1 I cache: 32K, L1 D cache: 32K
(XEN) CPU: L2 cache: 256K
(XEN) CPU: L3 cache: 12288K
(XEN) Enabling non-boot CPUs  ...
(XEN) Booting processor 1/2 eip 8e000
(XEN) Initializing CPU#1
(XEN) CPU: Physical Processor ID: 0
(XEN) CPU: Processor Core ID: 1
(XEN) CPU: L1 I cache: 32K, L1 D cache: 32K
(XEN) CPU: L2 cache: 256K
(XEN) CPU: L3 cache: 12288K
(XEN) CPU1: Intel(R) Xeon(R) CPU   E5630  @ 2.53GHz stepping 02
(XEN) Booting processor 2/18 eip 8e000
(XEN) Initializing CPU#2
(XEN) CPU: Physical Processor ID: 0
(XEN) CPU: Processor Core ID: 9
(XEN) CPU: L1 I cache: 32K, L1 D cache: 32K
(XEN) CPU: L2 cache: 256K
(XEN) CPU: L3 cache: 12288K
(XEN) CPU2: Intel(R) Xeon(R) CPU   E5630  @ 2.53GHz stepping 02
(XEN) Booting processor 3/20 eip 8e000
(XEN) Initializing CPU#3
(XEN) CPU: Physical Processor ID: 0
(XEN) CPU: Processor Core ID: 10
(XEN) CPU: L1 I cache: 32K, L1 D cache: 32K
(XEN) CPU: L2 cache: 256K
(XEN) CPU: L3 cache: 12288K
(XEN) CPU3: Intel(R) Xeon(R) CPU   E5630  @ 2.53GHz stepping 02
(XEN) *** DOUBLE FAULT ***
(XEN) [ Xen-4.11-rc  x86_64  debug=y   Not tainted ]
(XEN) CPU:0
(XEN) RIP:e008:[] handle_exception+0x9c/0xf7
(XEN) RFLAGS: 00010006   CONTEXT: hypervisor
(XEN) rax: c900422480b8   rbx:    rcx: 0005
(XEN) rdx:    rsi:    rdi: 
(XEN) rbp: 36ffbddb7f27   rsp: c90042248000   r8:  
(XEN) r9:     r10:    r11: 
(XEN) r12:    r13:    r14: c9004224
(XEN) r15:    cr0: 8005003b   cr4: 26e0
(XEN) cr3: 00018a10   cr2: c90042247ff8
(XEN) fsb: 7f6242d95700   gsb: 88003dc0   gss: 
(XEN) ds:    es:    fs:    gs:    ss: e010   cs: e008
(XEN) Current stack base c90042248000 differs from expected 8300dfa8
(XEN) Valid stack range: c9004224e000-c9004225,
sp=c90042248000, tss.rsp0=8300dfa87fa0
(XEN) No stack overflow detected. Skipping stack trace.
(XEN)
(XEN) 
(XEN) Panic on CPU 0:
(XEN) DOUBLE FAULT -- system shutdown
(XEN) 
(XEN)
(XEN) Reboot in five seconds...

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-21 Thread George Dunlap
On Mon, May 21, 2018 at 2:57 PM, Ian Jackson  wrote:
> Dario Faggioli writes ("Re: [Xen-devel] Test for osstest, features used in 
> Qubes OS"):
>> On Mon, 2018-05-21 at 12:04 +0100, George Dunlap wrote:
>> > What if we 1) have two versions of the test -- "Fake suspend" and
>> > "Real Suspend"; 2) only run "Real suspend" on hardware specifically
>> > marked as having a suspend that works reliably; 3) default all
>> > hardware to 'false' until we do some testing to find out how reliable
>> > it is?
>> >
>> > That way we get suspend testing 95% effective as quickly as possible,
>> > and we can complete it as we have time.
>>
>> That sounds a very good plan to me, FWIW.
>
> OK, for starters, how about we add the fake suspend test to every
> flight.
>
> What is the rune for that.
>
> Do we want or need to do that test with a guest running ?

Unfortunately the patch was never checked in.

I'll send an updated patch.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-21 Thread Ian Jackson
Dario Faggioli writes ("Re: [Xen-devel] Test for osstest, features used in 
Qubes OS"):
> On Mon, 2018-05-21 at 12:04 +0100, George Dunlap wrote:
> > What if we 1) have two versions of the test -- "Fake suspend" and
> > "Real Suspend"; 2) only run "Real suspend" on hardware specifically
> > marked as having a suspend that works reliably; 3) default all
> > hardware to 'false' until we do some testing to find out how reliable
> > it is?
> > 
> > That way we get suspend testing 95% effective as quickly as possible,
> > and we can complete it as we have time.
> 
> That sounds a very good plan to me, FWIW.

OK, for starters, how about we add the fake suspend test to every
flight.

What is the rune for that.

Do we want or need to do that test with a guest running ?

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-21 Thread Dario Faggioli
On Mon, 2018-05-21 at 12:04 +0100, George Dunlap wrote:
> On Thu, May 17, 2018 at 4:12 PM, Ian Jackson 
> wrote:
> > That's not entirely trivial then, especially for you, unless you
> > want
> > to set up your own osstest production instance.  However, I can
> > probably do the osstest-machinery work if you will help debug it,
> > review logs, tell me what to do next, etc. :-).
> 
> I'm pretty sure it would be possible to test the Xen "get ready for
> suspend" and "resume from suspend" functionality without actually
> needing to interact with ACPI -- we just get it to the point where it
> would start interacting with ACPI, and then have it return instead.
> From a "I'm positive this will continue to work" point of view it's
> not as satisfying as actually doing the suspend; but from a practical
> point of view, it will catch the vast majority of bugs in Xen (as
> opposed to hardware-specific quirks); and it will run on any hardware
> (which means not having to do reliability testing).
> 
> IIRC Dario actually had a patch for something like this for his own
> testing at some point -- Dario, anything to add?
> 
Indeed I had a patch (it's originally from Ben, actually). I sent it,
so it can be found in list archives. And, in any case, I still have it
around and can resend it.

I did catch quite a few bugs with it back then.

> What if we 1) have two versions of the test -- "Fake suspend" and
> "Real Suspend"; 2) only run "Real suspend" on hardware specifically
> marked as having a suspend that works reliably; 3) default all
> hardware to 'false' until we do some testing to find out how reliable
> it is?
> 
> That way we get suspend testing 95% effective as quickly as possible,
> and we can complete it as we have time.
> 
That sounds a very good plan to me, FWIW.

Regards,
Dario
-- 
<> (Raistlin Majere)
-
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

signature.asc
Description: This is a digitally signed message part
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-21 Thread Dario Faggioli
On Thu, 2018-05-17 at 16:12 +0100, Ian Jackson wrote:
> Marek Marczykowski-Górecki writes ("Re: Test for osstest, features
> used in Qubes OS"):
> > On Thu, May 17, 2018 at 01:26:30PM +0100, Ian Jackson wrote:
> > > Is it likely that this will depend on non-buggy host firmware
> > > ?  If so
> > > then we need to make arrangements to test it and only do it on
> > > hosts
> > > which are not buggy.  In practice this probably means wiring it
> > > up to
> > > the automatic host examiner.
> > 
> > Yes, probably.
> 
> That's not entirely trivial then, especially for you, unless you want
> to set up your own osstest production instance.  However, I can
> probably do the osstest-machinery work if you will help debug it,
> review logs, tell me what to do next, etc. :-).
> 
I'm not sure what 'non-bugs' in the firmware we're talking about, but I
problem I had when trying to do something like testing S3
suspend/resume in osstest, was that most server class hardware I could
find, did not support that.

If that's the bug you're talking about, yes, I agree it's not trivial.
:-) (although, I did not actually check the boxes in the MA colo, they
were just servers from Citrix's lab).

There's a (non-perfect) workaround, though, as George suggests, which
would allow us to run a "quasi-suspend" test at every flight on every
hardware.

Regards,
Dario
-- 
<> (Raistlin Majere)
-
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

signature.asc
Description: This is a digitally signed message part
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-21 Thread George Dunlap
On Thu, May 17, 2018 at 4:12 PM, Ian Jackson  wrote:
> Marek Marczykowski-Górecki writes ("Re: Test for osstest, features used in 
> Qubes OS"):
>> On Thu, May 17, 2018 at 01:26:30PM +0100, Ian Jackson wrote:
>> > Is it likely that this will depend on non-buggy host firmware ?  If so
>> > then we need to make arrangements to test it and only do it on hosts
>> > which are not buggy.  In practice this probably means wiring it up to
>> > the automatic host examiner.
>>
>> Yes, probably.
>
> That's not entirely trivial then, especially for you, unless you want
> to set up your own osstest production instance.  However, I can
> probably do the osstest-machinery work if you will help debug it,
> review logs, tell me what to do next, etc. :-).

I'm pretty sure it would be possible to test the Xen "get ready for
suspend" and "resume from suspend" functionality without actually
needing to interact with ACPI -- we just get it to the point where it
would start interacting with ACPI, and then have it return instead.
From a "I'm positive this will continue to work" point of view it's
not as satisfying as actually doing the suspend; but from a practical
point of view, it will catch the vast majority of bugs in Xen (as
opposed to hardware-specific quirks); and it will run on any hardware
(which means not having to do reliability testing).

IIRC Dario actually had a patch for something like this for his own
testing at some point -- Dario, anything to add?

What if we 1) have two versions of the test -- "Fake suspend" and
"Real Suspend"; 2) only run "Real suspend" on hardware specifically
marked as having a suspend that works reliably; 3) default all
hardware to 'false' until we do some testing to find out how reliable
it is?

That way we get suspend testing 95% effective as quickly as possible,
and we can complete it as we have time.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-18 Thread Marek Marczykowski
On Fri, May 18, 2018 at 09:54:37AM -0600, Jan Beulich wrote:
> >>> On 18.05.18 at 17:33,  wrote:
> > Yes, I'm happy to help with that. As I've said, the basic test is very
> > simple (rtcwake command) and already very useful. The fact that it is(?)
> > broken on staging doesn't make it easier,
> 
> Details on the breakage would be appreciated (on a separate thread),
> unless you plan to address it yourself. I recall Simon(?) mentioning this as
> well, but also not providing sufficient data to consider looking into it
> (perhaps simply because it wasn't easy to obtain useful data, as
> frequently is the case with S3 resume). I think it would be nice if we could
> release 4.11 without a regression here.

I only know that Simon have tested it and it fails. Cc'ing him.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?


signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-18 Thread Jan Beulich
>>> On 18.05.18 at 17:33,  wrote:
> Yes, I'm happy to help with that. As I've said, the basic test is very
> simple (rtcwake command) and already very useful. The fact that it is(?)
> broken on staging doesn't make it easier,

Details on the breakage would be appreciated (on a separate thread),
unless you plan to address it yourself. I recall Simon(?) mentioning this as
well, but also not providing sufficient data to consider looking into it
(perhaps simply because it wasn't easy to obtain useful data, as
frequently is the case with S3 resume). I think it would be nice if we could
release 4.11 without a regression here.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-18 Thread Marek Marczykowski-Górecki
On Thu, May 17, 2018 at 08:00:38PM +0200, Sander Eikelenboom wrote:
> Marek / Ian,
> 
> Nice to see PCI-passthrough getting some attention again.
> 
> On 17/05/18 17:12, Ian Jackson wrote:
> > Marek Marczykowski-Górecki writes ("Re: Test for osstest, features used in 
> > Qubes OS"):
> >> On Thu, May 17, 2018 at 01:26:30PM +0100, Ian Jackson wrote:
> >>> Is there some kind of cheap USB HID, that is interactable-with, which
> >>> we could plug into each machine's USB port ?  I'm slightly concerned
> >>> that plugging in a storage device, or connecting the other NIC, might
> >>> interfere with booting.
> >>
> >> I use mass storage for tests... But if you use network boot, it
> >> shouldn't really interfere, no?
> > 
> > We do both network boot and disk boot.  I think the BIOS disk boot has
> > to continue to work and boot the HDD.

In fact, using any device should be enough for the start. USB mouse for
example. Just reading USB descriptor involve some communication with the
controller, so it should be some indication about its state.

> As a user of pci-passthrough for quite some time and reporting some 
> pci-passthrough bugs in the past,
> I do have some comments:
> 
> - First of all it would be very nice to get some autotesting :).
> - But if you want to thoroughly test pci-passthrough, 
>   it will be far from easy since there is quite a multi-dimensional support 
> matrix
>   (I'm not implying that everything should be done or it won't be valuable if 
> any is missing,
>it's only meant for reference):
>   1) Guest side implementation: 
>  - PV guest (pcifront)
>  - HVM (qemu-traditional) 
>  - HVM (qemu-xen) 
>  - HVM (qemu-upstream) 
>  - perhaps PVH support for pci passthrough coming around the corner.
> 
>   2) (Un)Binding method to pciback:
>  - binding pci devices to pciback on host boot (command line) 
>  - de/re/unbinding devices from dom0 while running.
>  
>   3) (Un)binding to guest:
>  - On guest start (guest.cfg pci=[...])
>  - After the guest has been started with 'xl pci-*' commands
>   3) Device interrupts: legacy versus MSI versus MSI-X
>   4) Other pci device features: roms, BAR sizes, etc.
>   5) AMD versus Intel IOMMU
> 
> From the past reports, I know (1) and (3) did matter (problems being isolated 
> to one of these variants only).

Yes, that's right, my experience is similar in that matter. Especially
point 3 is tricky/problematic, as some devices (or rather: drivers)
doesn't correctly fallback to legacy interrupts if MSI/MSI-X isn't
available.
So, the ideal test should check those things too - if the guest driver
really use what it's expected to use. But lets start with something
first. I don't know how osstest handle it yet, but I'd expect adding
more guest configurations to run the same test on should be easy.

> As for restarting guests and reassigning pci-devices again to other guests 
> the current pciback reset support lacks
> the bus-reset patches at present in upstream linux kernels. Passthrough of 
> AMD Radeon graphics adapters works only one
> time without it (if you stop and restart a guest it doesn't work anymore and 
> you need to reboot the host). 
> With the bus-reset patches (which have been posted to the list and seem to be 
> in both Qubes and Xenserver 
> in some form but not in upstream linux). Someone from Oracle had picked them 
> up to get them upstream some time ago,
> but that effort seems to have stalled.

Can you point specifically what patches are you talking about? In Qubes
in most cases device reset is handled by libvirt...

> The code in libxl seems to be quite messy for pci-passthrough especially for 
> handling all the guest side implementations (1)
> and xenstore interactions that go with it (or don't for qemu).
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?


signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-18 Thread Marek Marczykowski-Górecki
On Thu, May 17, 2018 at 04:12:09PM +0100, Ian Jackson wrote:
> Marek Marczykowski-Górecki writes ("Re: Test for osstest, features used in 
> Qubes OS"):
> > On Thu, May 17, 2018 at 01:26:30PM +0100, Ian Jackson wrote:
> > > Is it likely that this will depend on non-buggy host firmware ?  If so
> > > then we need to make arrangements to test it and only do it on hosts
> > > which are not buggy.  In practice this probably means wiring it up to
> > > the automatic host examiner.
> > 
> > Yes, probably.
> 
> That's not entirely trivial then, especially for you, unless you want
> to set up your own osstest production instance.  However, I can
> probably do the osstest-machinery work if you will help debug it,
> review logs, tell me what to do next, etc. :-).

Yes, I'm happy to help with that. As I've said, the basic test is very
simple (rtcwake command) and already very useful. The fact that it is(?)
broken on staging doesn't make it easier, but I think setting up the
test using 4.8 branch first should be fine.
If you want to talk on IRC about it, just ping me on email first, I
don't have my irc client running all the time.

In the meantime, I'll try to familiarize myself with osstest...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?


signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-17 Thread Sander Eikelenboom
Marek / Ian,

Nice to see PCI-passthrough getting some attention again.

On 17/05/18 17:12, Ian Jackson wrote:
> Marek Marczykowski-Górecki writes ("Re: Test for osstest, features used in 
> Qubes OS"):
>> On Thu, May 17, 2018 at 01:26:30PM +0100, Ian Jackson wrote:
>>> Is it likely that this will depend on non-buggy host firmware ?  If so
>>> then we need to make arrangements to test it and only do it on hosts
>>> which are not buggy.  In practice this probably means wiring it up to
>>> the automatic host examiner.
>>
>> Yes, probably.
> 
> That's not entirely trivial then, especially for you, unless you want
> to set up your own osstest production instance.  However, I can
> probably do the osstest-machinery work if you will help debug it,
> review logs, tell me what to do next, etc. :-).
> 
>>> Is there some kind of cheap USB HID, that is interactable-with, which
>>> we could plug into each machine's USB port ?  I'm slightly concerned
>>> that plugging in a storage device, or connecting the other NIC, might
>>> interfere with booting.
>>
>> I use mass storage for tests... But if you use network boot, it
>> shouldn't really interfere, no?
> 
> We do both network boot and disk boot.  I think the BIOS disk boot has
> to continue to work and boot the HDD.

As a user of pci-passthrough for quite some time and reporting some 
pci-passthrough bugs in the past,
I do have some comments:

- First of all it would be very nice to get some autotesting :).
- But if you want to thoroughly test pci-passthrough, 
  it will be far from easy since there is quite a multi-dimensional support 
matrix
  (I'm not implying that everything should be done or it won't be valuable if 
any is missing,
   it's only meant for reference):
  1) Guest side implementation: 
 - PV guest (pcifront)
 - HVM (qemu-traditional) 
 - HVM (qemu-xen) 
 - HVM (qemu-upstream) 
 - perhaps PVH support for pci passthrough coming around the corner.

  2) (Un)Binding method to pciback:
 - binding pci devices to pciback on host boot (command line) 
 - de/re/unbinding devices from dom0 while running.
 
  3) (Un)binding to guest:
 - On guest start (guest.cfg pci=[...])
 - After the guest has been started with 'xl pci-*' commands
  3) Device interrupts: legacy versus MSI versus MSI-X
  4) Other pci device features: roms, BAR sizes, etc.
  5) AMD versus Intel IOMMU

From the past reports, I know (1) and (3) did matter (problems being isolated 
to one of these variants only).


As for restarting guests and reassigning pci-devices again to other guests the 
current pciback reset support lacks
the bus-reset patches at present in upstream linux kernels. Passthrough of AMD 
Radeon graphics adapters works only one
time without it (if you stop and restart a guest it doesn't work anymore and 
you need to reboot the host). 
With the bus-reset patches (which have been posted to the list and seem to be 
in both Qubes and Xenserver 
in some form but not in upstream linux). Someone from Oracle had picked them up 
to get them upstream some time ago,
but that effort seems to have stalled.

The code in libxl seems to be quite messy for pci-passthrough especially for 
handling all the guest side implementations (1)
and xenstore interactions that go with it (or don't for qemu).

--
Sander

 
>>> If you want to get pci passthrough tests working I would suggest
>>> testing it with non-stubdom first.  I assume the config etc. is the
>>> same, so having got that working, osstest would be able to test it for
>>> the stubdom tests too.
>>
>> Oh, I though there are already tests for that...
> 
> There are no PCI passthrough tests at all.  For a while we had some
> SRIOV NIC tests which were requested by Intel.  But they always failed
> giving kernel stack dumps.  We kept poking Intel to get them to fix
> them, or tell us how the tests were wrong, but to no avail.  So we
> dropped them.
> 
> So any work in this area would be greatly appreciated!
> 
> Ian.
> 
> 


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-17 Thread Ian Jackson
Marek Marczykowski-Górecki writes ("Re: Test for osstest, features used in 
Qubes OS"):
> On Thu, May 17, 2018 at 01:26:30PM +0100, Ian Jackson wrote:
> > Is it likely that this will depend on non-buggy host firmware ?  If so
> > then we need to make arrangements to test it and only do it on hosts
> > which are not buggy.  In practice this probably means wiring it up to
> > the automatic host examiner.
> 
> Yes, probably.

That's not entirely trivial then, especially for you, unless you want
to set up your own osstest production instance.  However, I can
probably do the osstest-machinery work if you will help debug it,
review logs, tell me what to do next, etc. :-).

> > Is there some kind of cheap USB HID, that is interactable-with, which
> > we could plug into each machine's USB port ?  I'm slightly concerned
> > that plugging in a storage device, or connecting the other NIC, might
> > interfere with booting.
> 
> I use mass storage for tests... But if you use network boot, it
> shouldn't really interfere, no?

We do both network boot and disk boot.  I think the BIOS disk boot has
to continue to work and boot the HDD.

> > If you want to get pci passthrough tests working I would suggest
> > testing it with non-stubdom first.  I assume the config etc. is the
> > same, so having got that working, osstest would be able to test it for
> > the stubdom tests too.
> 
> Oh, I though there are already tests for that...

There are no PCI passthrough tests at all.  For a while we had some
SRIOV NIC tests which were requested by Intel.  But they always failed
giving kernel stack dumps.  We kept poking Intel to get them to fix
them, or tell us how the tests were wrong, but to no avail.  So we
dropped them.

So any work in this area would be greatly appreciated!

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-17 Thread Marek Marczykowski-Górecki
On Thu, May 17, 2018 at 01:26:30PM +0100, Ian Jackson wrote:
> Marek Marczykowski-Górecki writes ("Test for osstest, features used in Qubes 
> OS"):
> > As discussed some time ago, I'd like to help with adding tests for some
> > features we use in Qubes OS.
> > 
> > IMO the easiest thing to test is host suspend. You just need to execute
> > "rtcwake -s 30 -m mem", and see if the host is back to live after ~30s.
> > Right now I know it works on Xen 4.8, but supposedly is broken on
> > staging (haven't tested the most recent version).
> > Next step would be the same while having some domains running.
> > 
> > How the test should look like (where to add this? etc)?
> 
> I guess this should be a new
>   ts-host-suspend-test
> script.
> 
> Is it likely that this will depend on non-buggy host firmware ?  If so
> then we need to make arrangements to test it and only do it on hosts
> which are not buggy.  In practice this probably means wiring it up to
> the automatic host examiner.

Yes, probably.

> > Next things would be mostly related to PCI passthrough:
> >  - PCI passthrough with qemu in stubdomain
> >  - the same as above, but with Linux-based stubdomain (we need cleanup
> >and send patches for that first, probably 4.12 material)
> >  - guest suspend (recently added libxl_domain_suspend_only), for
> >different guest types (PV, PVH, HVM), also with/without PCI device
> > 
> > For this, the machine obviously need to have IOMMU (I assume at least
> > some of the hardware used in test lab have it), and some spare PCI
> > device. I use sound card for some of such tests. But testing on USB
> > controllers would be more useful (from out experience, one of the most
> > problematic devices for suspend, sadly also lacking FLR or such...).
> 
> I doubt any of our x86 machines have sound cards. ...  Just looked at
> one and it says
>   00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core
>   Processor HD Audio Controller (rev 06)
> which is obviously mad.
> 
> I'm pretty sure they all have usb controllers.  Almost all of them
> have multiple NICs, often on different pci devices, although it is
> difficult to tell if a NIC not connected to anything is working.
> 
> Eg,
> 
>   02:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
>   Connection (rev 03)
> 
>   03:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
>   Connection (rev 03)
> 
> Is there some kind of cheap USB HID, that is interactable-with, which
> we could plug into each machine's USB port ?  I'm slightly concerned
> that plugging in a storage device, or connecting the other NIC, might
> interfere with booting.

I use mass storage for tests... But if you use network boot, it
shouldn't really interfere, no?

> If you want to get pci passthrough tests working I would suggest
> testing it with non-stubdom first.  I assume the config etc. is the
> same, so having got that working, osstest would be able to test it for
> the stubdom tests too.

Oh, I though there are already tests for that...
Yes, good idea.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?


signature.asc
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Test for osstest, features used in Qubes OS

2018-05-17 Thread Ian Jackson
Marek Marczykowski-Górecki writes ("Test for osstest, features used in Qubes 
OS"):
> As discussed some time ago, I'd like to help with adding tests for some
> features we use in Qubes OS.
> 
> IMO the easiest thing to test is host suspend. You just need to execute
> "rtcwake -s 30 -m mem", and see if the host is back to live after ~30s.
> Right now I know it works on Xen 4.8, but supposedly is broken on
> staging (haven't tested the most recent version).
> Next step would be the same while having some domains running.
> 
> How the test should look like (where to add this? etc)?

I guess this should be a new
  ts-host-suspend-test
script.

Is it likely that this will depend on non-buggy host firmware ?  If so
then we need to make arrangements to test it and only do it on hosts
which are not buggy.  In practice this probably means wiring it up to
the automatic host examiner.

> Next things would be mostly related to PCI passthrough:
>  - PCI passthrough with qemu in stubdomain
>  - the same as above, but with Linux-based stubdomain (we need cleanup
>and send patches for that first, probably 4.12 material)
>  - guest suspend (recently added libxl_domain_suspend_only), for
>different guest types (PV, PVH, HVM), also with/without PCI device
> 
> For this, the machine obviously need to have IOMMU (I assume at least
> some of the hardware used in test lab have it), and some spare PCI
> device. I use sound card for some of such tests. But testing on USB
> controllers would be more useful (from out experience, one of the most
> problematic devices for suspend, sadly also lacking FLR or such...).

I doubt any of our x86 machines have sound cards. ...  Just looked at
one and it says
  00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core
  Processor HD Audio Controller (rev 06)
which is obviously mad.

I'm pretty sure they all have usb controllers.  Almost all of them
have multiple NICs, often on different pci devices, although it is
difficult to tell if a NIC not connected to anything is working.

Eg,

  02:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
  Connection (rev 03)

  03:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
  Connection (rev 03)

Is there some kind of cheap USB HID, that is interactable-with, which
we could plug into each machine's USB port ?  I'm slightly concerned
that plugging in a storage device, or connecting the other NIC, might
interfere with booting.

If you want to get pci passthrough tests working I would suggest
testing it with non-stubdom first.  I assume the config etc. is the
same, so having got that working, osstest would be able to test it for
the stubdom tests too.

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel