from:"\"Jan Beulich\""

[Xen-devel] Xen 4.7.4 released

2017-11-22 Thread Jan Beulich

All,

I am pleased to announce the release of Xen 4.7.4. This is
available immediately from its git repository
http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/stable-4.7 
(tag RELEASE-4.7.4) or from the XenProject download page
https://xenproject.org/downloads/xen-archives/xen-project-47-series/xen-474.html
 
(where a list of changes can also be found).

We recommend all users of the 4.7 stable series to update to this
latest point release.

Regards, Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

[Xen-devel] Xen 4.9.1 released

2017-11-22 Thread Jan Beulich

All,

I am pleased to announce the release of Xen 4.9.1. This is
available immediately from its git repository
http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/stable-4.9
(tag RELEASE-4.9.1) or from the XenProject download page
http://www.xenproject.org/downloads/xen-archives/xen-project-49-series/xen-491.html
 
(where a list of changes can also be found).

We recommend all users of the 4.9 stable series to update to this
first point release.

Regards, Jan



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] Linux as 32-bit Dom0?

2017-11-22 Thread Jan Beulich

>>> On 22.11.17 at 15:40,  wrote:
> On 22/11/17 15:05, Jan Beulich wrote:
>> Jürgen, Boris,
>> 
>> am I trying something that's not allowed, but selectable via Kconfig?
>> On system with multiple IO-APICs (I assume that's what triggers the
>> problem) I get
>> 
>> Kernel panic - not syncing: Max apic_id exceeded!
> 
> Generally I don't think 32 bit dom0 is forbidden, but rarely used. I
> wouldn't be too sad in case we'd decide to drop that support. ;-)
> 
> Can you please be a little bit more specific?
> 
> How many IOAPICs? From the code I guess this is an INTEL system with not
> too recent IOAPIC versions (<0x14)?
> 
> Having a little bit more of the boot log might help, too.

Full log attached, which should answer all questions. This is
a Haswell system, so not too old an IO-APIC flavor I would say.

Jan


minicom.Linux-4.14
Description: Binary data
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

[Xen-devel] Linux as 32-bit Dom0?

2017-11-22 Thread Jan Beulich

Jürgen, Boris,

am I trying something that's not allowed, but selectable via Kconfig?
On system with multiple IO-APICs (I assume that's what triggers the
problem) I get

Kernel panic - not syncing: Max apic_id exceeded!

CPU: 0 PID: 0 Comm: swapper Not tainted 4.14.1-2017-11-21-xen0 #6
Hardware name: ...
Call Trace:
 ? show_stack+0x20/0x50
 ? dump_stack+0x7e/0xc0
 ? panic+0x99/0x220
 ? io_apic_get_unique_id+0x207/0x210
 ? __raw_callee_save_xen_restore_fl+0x6/0x8
 ? xen_flush_tlb_single+0x6f/0x80
 ? set_pte_vaddr+0xef/0x110
 ? xen_io_apic_read+0x36/0x90
 ? mp_register_ioapic+0x2b7/0x410
 ? acpi_os_map_iomem+0x14d/0x210
 ? acpi_parse_ioapic+0x6b/0x6f
 ? acpi_parse_entries_array+0xa1/0x16d
 ? acpi_tb_get_table+0x95/0x9d
 ? acpi_ut_release_mutex+0x109/0x10e
 ? acpi_table_parse_entries_array+0x98/0xa8
 ? acpi_table_parse_entries+0x30/0x35
 ? acpi_boot_init+0x65/0x65
 ? acpi_table_parse_madt+0x1b/0x1f
 ? acpi_boot_init+0x65/0x65
 ? acpi_parse_madt_ioapic_entries+0x4a/0x119
 ? mutex_lock+0x8/0x30
 ? acpi_process_madt+0xbe/0x105
 ? acpi_boot_init+0x3d/0x65
 ? setup_arch+0x67a/0x83f
 ? 0xc100
 ? start_kernel+0x46/0x361
 ? x86_early_init_platform_quirks+0x4d/0x90
 ? i386_start_kernel+0x22/0x88
 ? xen_start_kernel+0x453/0x639
(XEN) Hardware Dom0 crashed: 'noreboot' set - not rebooting.

I admit I have a few custom patches in that tree, but I'm reasonably
certain that none of them comes even close to having such an effect.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2] sync CPU state upon final domain destruction

2017-11-22 Thread Jan Beulich

>>> On 22.11.17 at 13:39,  wrote:
> See the code comment being added for why we need this.
> 
> This is being placed here to balance between the desire to prevent
> future similar issues (the risk of which would grow if it was put
> further down the call stack, e.g. in vmx_vcpu_destroy()) and the
> intention to limit the performance impact (otherwise it could also go
> into rcu_do_batch(), paralleling the use in do_tasklet_work()).
> 
> Reported-by: Igor Druzhinin 
> Signed-off-by: Jan Beulich 

I'm sorry, Julien, I did forget to Cc you (for 4.10 inclusion).

> ---
> v2: Move from vmx_vcpu_destroy() to complete_domain_destroy().
> 
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -794,6 +794,14 @@ static void complete_domain_destroy(stru
>  struct vcpu *v;
>  int i;
>  
> +/*
> + * Flush all state for the vCPU previously having run on the current CPU.
> + * This is in particular relevant for x86 HVM ones on VMX, so that this
> + * flushing of state won't happen from the TLB flush IPI handler behind
> + * the back of a vmx_vmcs_enter() / vmx_vmcs_exit() section.
> + */
> +sync_local_execstate();
> +
>  for ( i = d->max_vcpus - 1; i >= 0; i-- )
>  {
>  if ( (v = d->vcpu[i]) == NULL )



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

[Xen-devel] [PATCH v2] sync CPU state upon final domain destruction

2017-11-22 Thread Jan Beulich

See the code comment being added for why we need this.

This is being placed here to balance between the desire to prevent
future similar issues (the risk of which would grow if it was put
further down the call stack, e.g. in vmx_vcpu_destroy()) and the
intention to limit the performance impact (otherwise it could also go
into rcu_do_batch(), paralleling the use in do_tasklet_work()).

Reported-by: Igor Druzhinin 
Signed-off-by: Jan Beulich 
---
v2: Move from vmx_vcpu_destroy() to complete_domain_destroy().

--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -794,6 +794,14 @@ static void complete_domain_destroy(stru
 struct vcpu *v;
 int i;
 
+/*
+ * Flush all state for the vCPU previously having run on the current CPU.
+ * This is in particular relevant for x86 HVM ones on VMX, so that this
+ * flushing of state won't happen from the TLB flush IPI handler behind
+ * the back of a vmx_vmcs_enter() / vmx_vmcs_exit() section.
+ */
+sync_local_execstate();
+
 for ( i = d->max_vcpus - 1; i >= 0; i-- )
 {
 if ( (v = d->vcpu[i]) == NULL )




___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 10/16] SUPPORT.md: Add Debugging, analysis, crash post-portem

2017-11-22 Thread Jan Beulich

>>> On 21.11.17 at 19:19,  wrote:
> xentrace I would argue for security support; I've asked customers to
> send me xentrace data as part of analysis before.  I also know enough
> about it that I'm reasonably confident the risk of an attack vector is
> pretty low.

Knowing pretty little about xentrace I will trust you here. What I
was afraid of is that generally anything adding overhead can have
unintended side effects, the more with the - aiui - huge amounts of
data this may produce.

> I don't have a strong opinion on gdbsx; I'd call it 'supported', but if
> you think we need to exclude it from security support I'm happy with
> that as well.

Looks like on another sub-thread it was meanwhile already agreed
to mark it not security supported.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 08/16] SUPPORT.md: Add x86-specific virtual hardware

2017-11-22 Thread Jan Beulich

>>> On 21.11.17 at 19:02,  wrote:
> On 11/21/2017 08:39 AM, Jan Beulich wrote:
>>>>> On 13.11.17 at 16:41,  wrote:
>>> +### x86/Nested PV
>>> +
>>> +Status, x86 HVM: Tech Preview
>>> +
>>> +This means running a Xen hypervisor inside an HVM domain,
>>> +with support for PV L2 guests only
>>> +(i.e., hardware virtualization extensions not provided
>>> +to the guest).
>>> +
>>> +This works, but has performance limitations
>>> +because the L1 dom0 can only access emulated L1 devices.
>> 
>> So is this explicitly meaning Xen-on-Xen? Xen-on-KVM, for example,
>> could be considered "nested PV", too. IOW I think it needs to be
>> spelled out whether this means the host side of things here, the
>> guest one, or both.
> 
> Yes, that's true.  But I forget: Can a Xen dom0 use virtio guest
> drivers?  I'm pretty sure Stefano tried it at some point but I don't
> remember what the result was.

I have no idea at all.

>>> +### x86/Nested HVM
>>> +
>>> +Status, x86 HVM: Experimental
>>> +
>>> +This means running a Xen hypervisor inside an HVM domain,
>>> +with support for running both PV and HVM L2 guests
>>> +(i.e., hardware virtualization extensions provided
>>> +to the guest).
>> 
>> "Nested HVM" generally means more than using Xen as the L1
>> hypervisor. If this is really to mean just L1 Xen, I think the title
>> should already say so, not just the description.
> 
> Yes, I mean any sort of nested guest support here.

In which case would you ind inserting "for example"?

>>> +### x86/Advanced Vector eXtension
>>> +
>>> +Status: Supported
>> 
>> As indicated before, I think this either needs to be dropped or
>> be extended by an entry for virtually every CPUID bit exposed
>> to guests. Furthermore, in this isolated fashion it is not clear
>> what derived features (e.g. FMA, FMA4, AVX2, or even AVX-512)
>> it is meant to imply. If any of them are implied, "with caveats"
>> would need to be added as long as the instruction emulator isn't
>> capable of handling the instructions, yet.
> 
> Adding a section for CPUID bits supported (and to what level) sounds
> like a useful thing to do, perhaps in the next release.

May I suggest then that until then the section above be dropped?

>>> +### x86/HVM EFI
>>> +
>>> +Status: Supported
>>> +
>>> +Booting a guest via guest EFI firmware
>> 
>> Shouldn't this say OVMF, to avoid covering possible other
>> implementations?
> 
> I don't expect that we'll ever need more than one EFI implementation in
> the tree.  If a time comes when it makes sense to have two, we can
> adjust the entry accordingly.

But that's part of my point - you say "in the tree", but this is a
separate tree, and there could be any number of separate ones.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 07/16] SUPPORT.md: Add virtual devices common to ARM and x86

2017-11-22 Thread Jan Beulich

>>> On 21.11.17 at 18:35,  wrote:
> On 11/21/2017 08:29 AM, Jan Beulich wrote:
>>> +### QEMU backend hotplugging for xl
>>> +
>>> +Status: Supported
>> 
>> Wouldn't this more appropriately be
>> 
>> ### QEMU backend hotplugging
>> 
>> Status, xl: Supported
> 
> You mean, for this whole section (i.e., everything here that says 'for
> xl')?  If not, why this one in particular?

Well, I had commented on the other two entries separately, and
from my other reply just sent it would follow that I'd rather see
those other two entries go away (information moved elsewhere).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 07/16] SUPPORT.md: Add virtual devices common to ARM and x86

2017-11-22 Thread Jan Beulich

>>> On 21.11.17 at 18:20,  wrote:
> On 11/21/2017 11:41 AM, Jan Beulich wrote:
>>>>> On 21.11.17 at 11:56,  wrote:
>>> On 11/21/2017 08:29 AM, Jan Beulich wrote:
>>>>>>> On 13.11.17 at 16:41,  wrote:
>>>>> +### PV USB support for xl
>>>>> +
>>>>> +Status: Supported
>>>>> +
>>>>> +### PV 9pfs support for xl
>>>>> +
>>>>> +Status: Tech Preview
>>>>
>>>> Why are these two being called out, but xl support for other device
>>>> types isn't?
>>>
>>> Do you see how big this document is? :-)  If you think something else
>>> needs to be covered, don't ask why I didn't mention it, just say what
>>> you think I missed.
>> 
>> Well, (not very) implicitly here: The same for all other PV protocols.
> 
> Oh, I see -- you didn't read my comment below the `---` pointing this
> out.  :-)

Oops, sorry.

> Yes, I wasn't quite sure what to do here.  We already list all the PV
> protocols in at least 2 places (frontend and backend support); it seemed
> a bit redundant to list them all again in xl and/or libxl support.
> 
> Except, of course, that there are a number of protocols *not* plumbed
> through the toolstack yet -- PVSCSI being one example.
> 
> Any suggestions would be welcome.

How about putting that as a note to the respective frontend /
backend entries? And then, wouldn't lack of xl support anyway
mean "experimental" at best?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 3/5] xen: Provide XEN_DMOP_pin_memory_cacheattr

2017-11-22 Thread Jan Beulich

>>> On 23.10.17 at 11:05,  wrote:
> --- a/xen/arch/x86/hvm/dm.c
> +++ b/xen/arch/x86/hvm/dm.c
> @@ -21,6 +21,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 

With this addition moved up a line to result in a properly sorted set
Reviewed-by: Jan Beulich 

However, the series should be extended by a patch removing the
no longer needed domctl (wiring the libxc function through to the
libxendevicemodel one if necessary), at once converting the
XEN_DOMCTL_MEM_CACHEATTR_* values to ones with names
suitable for use with this new interface (compatibility defines
in domctl.h would need to be retained until both qemu-s have had
their patches applied).

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] Ping: [PATCH] VMX: sync CPU state upon vCPU destruction

2017-11-21 Thread Jan Beulich

>>> On 21.11.17 at 18:00,  wrote:
> On Tue, 2017-11-21 at 08:29 -0700, Jan Beulich wrote:
>> > > > On 21.11.17 at 15:07,  wrote:
>> > 
>> > On 21/11/17 13:22, Jan Beulich wrote:
>> > > > > > On 09.11.17 at 15:49,  wrote:
>> > > > 
>> > > > See the code comment being added for why we need this.
>> > > > 
>> > > > Reported-by: Igor Druzhinin 
>> > > > Signed-off-by: Jan Beulich 
>> > > 
>> > > I realize we aren't settled yet on where to put the sync call. The
>> > > discussion appears to have stalled, though. Just to recap,
>> > > alternatives to the placement below are
>> > > - at the top of complete_domain_destroy(), being the specific
>> > >   RCU callback exhibiting the problem (others are unlikely to
>> > >   touch guest state)
>> > > - in rcu_do_batch(), paralleling the similar call from
>> > >   do_tasklet_work()
>> > 
>> > rcu_do_batch() sounds better to me. As I said before I think that the
>> > problem is general for the hypervisor (not for VMX only) and might
>> > appear in other places as well.
>> 
>> The question here is: In what other cases do we expect an RCU
>> callback to possibly touch guest state? I think the common use is
>> to merely free some memory in a delayed fashion.
>> 
>> > Those choices that you outlined appear to be different in terms whether
>> > we solve the general problem and probably have some minor performance
>> > impact or we solve the ad-hoc problem but make the system more
>> > entangled. Here I'm more inclined to the first choice because this
>> > particular scenario the performance impact should be negligible.
>> 
>> For the problem at hand there's no question about a
>> performance effect. The question is whether doing this for _other_
>> RCU callbacks would introduce a performance drop in certain cases.
> 
> So what are performance implications of my original suggestion of
> removing !v->is_running check from vmx_ctxt_switch_from() ?
> From what I can see:
> 
> 1. Another field in struct vcpu will be checked instead (vmcs_pa)
> 2. Additionally this_cpu(current_vmcs) will be loaded, which shouldn't
>be terrible, given how heavy a context switch already is.

There are no performance implications afaict; I'm simply of the
opinion this is not the way the issue should be addressed. The
sync approach seems much more natural to me.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 2/5] xen: Provide XEN_DMOP_add_to_physmap

2017-11-21 Thread Jan Beulich

>>> On 23.10.17 at 11:05,  wrote:

First of all, instead of xen: please consider using something more
specific, like x86/hvm:.

> --- a/xen/include/public/hvm/dm_op.h
> +++ b/xen/include/public/hvm/dm_op.h
> @@ -368,6 +368,22 @@ struct xen_dm_op_remote_shutdown {
> /* (Other reason values are not blocked) */
>  };
>  
> +/*
> + * XEN_DMOP_add_to_physmap : Sets the GPFNs at which a page range appears in
> + *   the specified guest's pseudophysical address

Talking of "pseudophysical" is at least confusing for HVM guests. So
far it was my understanding that such exists for PV guests only.

> + *   space. Identical to XENMEM_add_to_physmap with
> + *   space == XENMAPSPACE_gmfn_range.
> + */
> +#define XEN_DMOP_add_to_physmap 17
> +
> +struct xen_dm_op_add_to_physmap {
> +uint16_t size; /* Number of GMFNs to process. */

Why limit this to 16 bits?

> +uint16_t pad0;
> +uint32_t pad1;
> +uint64_aligned_t idx;  /* Index into GMFN space. */

Why would you call this "idx"? The other interface and its naming
should have no significance here. So perhaps "src_gfn" and ...

> +uint64_aligned_t gpfn; /* Starting GPFN where the GMFNs should appear. */

... "dst_gfn"?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 1/5] xen/mm: Make xenmem_add_to_physmap global

2017-11-21 Thread Jan Beulich

>>> On 23.10.17 at 11:05,  wrote:
> Make it global in preparation to be called by a new dmop.
> 
> Signed-off-by: Ross Lagerwall 
> 
> ---
> Reviewed-by: Paul Durrant 

Misplaced tag.

I'd prefer if the function was made non-static in the patch which
needs it so, but anyway
Acked-by: Jan Beulich 

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] Ping: [PATCH] VMX: sync CPU state upon vCPU destruction

2017-11-21 Thread Jan Beulich

>>> On 21.11.17 at 15:07,  wrote:
> On 21/11/17 13:22, Jan Beulich wrote:
>>>>> On 09.11.17 at 15:49,  wrote:
>>> See the code comment being added for why we need this.
>>>
>>> Reported-by: Igor Druzhinin 
>>> Signed-off-by: Jan Beulich 
>> 
>> I realize we aren't settled yet on where to put the sync call. The
>> discussion appears to have stalled, though. Just to recap,
>> alternatives to the placement below are
>> - at the top of complete_domain_destroy(), being the specific
>>   RCU callback exhibiting the problem (others are unlikely to
>>   touch guest state)
>> - in rcu_do_batch(), paralleling the similar call from
>>   do_tasklet_work()
> 
> rcu_do_batch() sounds better to me. As I said before I think that the
> problem is general for the hypervisor (not for VMX only) and might
> appear in other places as well.

The question here is: In what other cases do we expect an RCU
callback to possibly touch guest state? I think the common use is
to merely free some memory in a delayed fashion.

> Those choices that you outlined appear to be different in terms whether
> we solve the general problem and probably have some minor performance
> impact or we solve the ad-hoc problem but make the system more
> entangled. Here I'm more inclined to the first choice because this
> particular scenario the performance impact should be negligible.

For the problem at hand there's no question about a
performance effect. The question is whether doing this for _other_
RCU callbacks would introduce a performance drop in certain cases.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

[Xen-devel] Ping#2: [PATCH] x86emul: keep compiler from using {x, y, z}mm registers itself

2017-11-21 Thread Jan Beulich

>>> On 06.11.17 at 16:04,  wrote:
> On 11/06/2017 11:59 AM, Jan Beulich wrote:
>>>>> On 16.10.17 at 14:42,  wrote:
>>>>>> On 16.10.17 at 14:37,  wrote:
>>>> On 16/10/17 13:32, Jan Beulich wrote:
>>>>> Since the emulator acts on the live hardware registers, we need to
>>>>> prevent the compiler from using them e.g. for inlined memcpy() /
>>>>> memset() (as gcc7 does). We can't, however, set this from the command
>>>>> line, as otherwise the 64-bit build would face issues with functions
>>>>> returning floating point values and being declared in standard headers.
>>>>>
>>>>> As the pragma isn't available prior to gcc6, we need to invoke it
>>>>> conditionally. Luckily up to gcc6 we haven't seen generated code access
>>>>> SIMD registers beyond what our asm()s do.
>>>>>
>>>>> Reported-by: George Dunlap 
>>>>> Signed-off-by: Jan Beulich 
>>>>> ---
>>>>> While this doesn't affect core functionality, I think it would still be
>>>>> nice for it to be allowed in for 4.10.
>>>>
>>>> Agreed.
>>>>
>>>> Has this been tested with Clang?
>>>
>>> Sorry, no - still haven't got around to set up a suitable Clang
>>> locally.
>>>
>>>>  It stands a good chance of being
>>>> compatible, but we may need an && !defined(__clang__) included.
>>>
>>> Should non-gcc silently ignore "#pragma GCC ..." it doesn't
>>> recognize, or not define __GNUC__ in the first place if it isn't
>>> sufficiently compatible? I.e. if anything I'd expect we need
>>> "#elif defined(__clang__)" to achieve the same for Clang by
>>> some different pragma (if such exists).
>> 
>> Not having received any reply so far, I'm wondering whether
>> being able to build the test harness with clang is more
>> important than for it to work correctly when built with gcc. I
>> can't predict when I would get around to set up a suitable
>> clang on my dev systems.
> 
> I agree with the argument you make above.  On the unlikely chance
> there's a problem Travis should catch it, and someone who actually has a
> clang setup can help sort it out.

I'm still lacking an ack, before it being sensible to check with Julien
whether this is still fine to go in at this late stage.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

[Xen-devel] Ping: [PATCH] VMX: sync CPU state upon vCPU destruction

2017-11-21 Thread Jan Beulich

>>> On 09.11.17 at 15:49,  wrote:
> See the code comment being added for why we need this.
> 
> Reported-by: Igor Druzhinin 
> Signed-off-by: Jan Beulich 

I realize we aren't settled yet on where to put the sync call. The
discussion appears to have stalled, though. Just to recap,
alternatives to the placement below are
- at the top of complete_domain_destroy(), being the specific
  RCU callback exhibiting the problem (others are unlikely to
  touch guest state)
- in rcu_do_batch(), paralleling the similar call from
  do_tasklet_work()

Jan

> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -479,7 +479,13 @@ static void vmx_vcpu_destroy(struct vcpu
>   * we should disable PML manually here. Note that vmx_vcpu_destroy is 
> called
>   * prior to vmx_domain_destroy so we need to disable PML for each vcpu
>   * separately here.
> + *
> + * Before doing that though, flush all state for the vCPU previously 
> having
> + * run on the current CPU, so that this flushing of state won't happen 
> from
> + * the TLB flush IPI handler behind the back of a vmx_vmcs_enter() /
> + * vmx_vmcs_exit() section.
>   */
> +sync_local_execstate();
>  vmx_vcpu_disable_pml(v);
>  vmx_destroy_vmcs(v);
>  passive_domain_destroy(v);
> 
> 
> 
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org 
> https://lists.xen.org/xen-devel 




___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 04/16] SUPPORT.md: Add core ARM features

2017-11-21 Thread Jan Beulich

>>> On 21.11.17 at 13:39,  wrote:
> What about something like this?
> 
> ### IOMMU
> 
> Status, AMD IOMMU: Supported
> Status, Intel VT-d: Supported
> Status, ARM SMMUv1: Supported
> Status, ARM SMMUv2: Supported

Fine with me, as it makes things explicit.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 03/16] SUPPORT.md: Add some x86 features

2017-11-21 Thread Jan Beulich

>>> On 21.11.17 at 13:24,  wrote:
>> On Nov 21, 2017, at 11:35 AM, Jan Beulich 
>> Much depends on whether you think "guest" == "DomU". To me
>> Dom0 is a guest, too.
> 
> That’s not how I’ve ever understood those terms.
> 
> A guest at a hotel is someone who is served, and who does not have (legal) 
> access to the internals of the system.  The maids who clean the room and the 
> janitors who sweep the floors are hosts, because they have (to various 
> degrees) extra access designed to help them serve the guests.
> 
> A “guest” is a virtual machine that does not have access to the internals of 
> the system; that is the “target” of virtualization.  As such, the dom0 kernel 
> and all the toolstack / emulation code running in domain 0 are part of the 
> “host”.
> 
> Domain 0 is a domain and a VM, but only domUs are guests.

Okay then; just FTR I've always been considering "domain" ==
"guest" == "VM".

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2] tools/libxl: mark special pages as reserved in e820 map for PVH

2017-11-21 Thread Jan Beulich

>>> On 21.11.17 at 12:48,  wrote:
> On 21/11/17 12:27, Jan Beulich wrote:
>>>>> On 21.11.17 at 12:06,  wrote:
>>> The "special pages" for PVH guests include the frames for console and
>>> Xenstore ring buffers. Those have to be marked as "Reserved" in the
>>> guest's E820 map, as otherwise conflicts might arise later e.g. when
>>> hotplugging memory into the guest.
>> 
>> Afaict this differs from v1 only in no longer adding the extra entry
>> for HVM. How does this address the concerns raised on v1 wrt spec
>> compliance? v1 having caused problems with hvmloader should not
>> have resulted in simply excluding HVM here. That's even more so
>> because we mean HVM and PVH to converge in the long run - I'd
>> expect that to mean that no clear type distinction would exist
>> anymore on libxl.
> 
> The difference is for HVM the HVMloader is creating the additional
> "Reserved" entry.
> 
>> If you want to reserve Xenstore ring buffer and console page,
>> why don't you reserve just the two (provided of course they
>> live outside of any [fake] PCI device BAR), which then ought to
>> also be compatible with plain HVM?
> 
> For PVH the "mmio" area is starting with the LAPIC and extends up
> to 4GB.

Oh, I see - that's probably okay then (or at least as okay as
libxl having knowledge of the LAPIC base address in the first
place).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 07/16] SUPPORT.md: Add virtual devices common to ARM and x86

2017-11-21 Thread Jan Beulich

>>> On 21.11.17 at 11:56,  wrote:
> On 11/21/2017 08:29 AM, Jan Beulich wrote:
>>>>> On 13.11.17 at 16:41,  wrote:
>>> +### PV USB support for xl
>>> +
>>> +Status: Supported
>>> +
>>> +### PV 9pfs support for xl
>>> +
>>> +Status: Tech Preview
>> 
>> Why are these two being called out, but xl support for other device
>> types isn't?
> 
> Do you see how big this document is? :-)  If you think something else
> needs to be covered, don't ask why I didn't mention it, just say what
> you think I missed.

Well, (not very) implicitly here: The same for all other PV protocols.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 04/16] SUPPORT.md: Add core ARM features

2017-11-21 Thread Jan Beulich

>>> On 21.11.17 at 11:45,  wrote:
> On 11/21/2017 08:11 AM, Jan Beulich wrote:
>>>>> On 13.11.17 at 16:41,  wrote:
>>> +### ARM/SMMUv1
>>> +
>>> +Status: Supported
>>> +
>>> +### ARM/SMMUv2
>>> +
>>> +Status: Supported
>> 
>> Do these belong here, when IOMMU isn't part of the corresponding
>> x86 patch?
> 
> Since there was recently a time when these weren't supported, I think
> it's useful to have them in here.  (Julien, let me know if you think
> otherwise.)
> 
> Do you think it would be useful to include an IOMMU line for x86?

At this point of the series I would surely have said "yes". The
later PCI passthrough additions state this implicitly at least (by
requiring an IOMMU for passthrough to be supported at all).
But even then saying so explicitly may be better.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 03/16] SUPPORT.md: Add some x86 features

2017-11-21 Thread Jan Beulich

>>> On 21.11.17 at 11:42,  wrote:
> On 11/21/2017 08:09 AM, Jan Beulich wrote:
>>>>> On 13.11.17 at 16:41,  wrote:
>>> +### x86/PVH guest
>>> +
>>> +Status: Supported
>>> +
>>> +PVH is a next-generation paravirtualized mode 
>>> +designed to take advantage of hardware virtualization support when 
>>> possible.
>>> +During development this was sometimes called HVMLite or PVHv2.
>>> +
>>> +Requires hardware virtualisation support (Intel VMX / AMD SVM)
>> 
>> I think it needs to be said that only DomU is considered supported.
>> Dom0 is perhaps not even experimental at this point, considering
>> the panic() in dom0_construct_pvh().
> 
> Indeed, that's why dom0 PVH isn't in the list, and why this says 'PVH
> guest', and is in the 'Guest Type' section.  We generally don't say,
> "Oh, and we don't have this feature at all".
> 
> If you think it's important we could add a sentence here explicitly
> stating that dom0 PVH isn't supported, but I sort of feel like it isn't
> necessary.

Much depends on whether you think "guest" == "DomU". To me
Dom0 is a guest, too.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 02/16] SUPPORT.md: Add core functionality

2017-11-21 Thread Jan Beulich

>>> On 21.11.17 at 11:36,  wrote:
> On 11/21/2017 08:03 AM, Jan Beulich wrote:
>>>>> On 13.11.17 at 16:41,  wrote:
>>> --- a/SUPPORT.md
>>> +++ b/SUPPORT.md
>>> @@ -16,6 +16,65 @@ for the definitions of the support status levels etc.
>>>  
>>>  # Feature Support
>>>  
>>> +## Memory Management
>>> +
>>> +### Memory Ballooning
>>> +
>>> +Status: Supported
>> 
>> Is this a proper feature in the context we're talking about? To me
>> it's meaningful in guest OS context only. I also wouldn't really
>> consider it "core", but placement within the series clearly is a minor
>> aspect.
>> 
>> I'd prefer this to be dropped altogether as a feature, but
> 
> This doesn't make any sense to me.  Allowing a guest to modify its own
> memory requires a *lot* of support, spread throughout the hypervisor;
> and there are a huge number of recent security holes that would have
> been much more difficult to exploit if guests didn't have the ability to
> balloon up or down.
> 
> If what you mean is *specifically* the technique of making a "memory
> balloon" to trick the guest OS into handing back memory without knowing
> it, then it's just a matter of semantics.  We could call this "dynamic
> memory control" or something like that if you prefer (although we'd have
> to mention ballooning in the description to make sure people can find it).

Indeed I'd prefer the alternative naming: Outside of p2m-pod.c there's
no mention of the term "balloon" in any of the hypervisor source files.
Furthermore this "dynamic memory control" can be used for things other
than ballooning, all of which I think is (to be) supported.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2] tools/libxl: mark special pages as reserved in e820 map for PVH

2017-11-21 Thread Jan Beulich

>>> On 21.11.17 at 12:06,  wrote:
> The "special pages" for PVH guests include the frames for console and
> Xenstore ring buffers. Those have to be marked as "Reserved" in the
> guest's E820 map, as otherwise conflicts might arise later e.g. when
> hotplugging memory into the guest.

Afaict this differs from v1 only in no longer adding the extra entry
for HVM. How does this address the concerns raised on v1 wrt spec
compliance? v1 having caused problems with hvmloader should not
have resulted in simply excluding HVM here. That's even more so
because we mean HVM and PVH to converge in the long run - I'd
expect that to mean that no clear type distinction would exist
anymore on libxl.

If you want to reserve Xenstore ring buffer and console page,
why don't you reserve just the two (provided of course they
live outside of any [fake] PCI device BAR), which then ought to
also be compatible with plain HVM?

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 16/16] SUPPORT.md: Add limits RFC

2017-11-21 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> +### Virtual CPUs
> +
> +Limit, x86 PV: 8192
> +Limit-security, x86 PV: 32
> +Limit, x86 HVM: 128
> +Limit-security, x86 HVM: 32

Personally I consider the "Limit-security" numbers too low here, but
I have no proof that higher numbers will work _in all cases_.

> +### Virtual RAM
> +
> +Limit-security, x86 PV: 2047GiB

I think this needs splitting for 64- and 32-bit (the latter can go up
to 168Gb only on hosts with no memory past the 168Gb boundary,
and up to 128Gb only on larger ones, without this being a processor
architecture limitation).

> +### Event Channel FIFO ABI
> +
> +Limit: 131072

Are we certain this is a security supportable limit? There is at least
one loop (in get_free_port()) which can potentially have this number
of iterations.

That's already leaving aside the one in the 'e' key handler. Speaking
of which - I think we should state somewhere that there's no security
support if any key whatsoever was sent to Xen via the console or
the sysctl interface.

And more generally - surely there are items that aren't present in
the series and no-one can realistically spot right away. What do we
mean to imply for functionality not covered in the doc? One thing
coming to mind here are certain command line options, an example
being "sync_console" - the description states "not suitable for
production environments", but I think this should be tightened to
exclude security support.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 14/16] SUPPORT.md: Add statement on PCI passthrough

2017-11-21 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> +### x86/PCI Device Passthrough
> +
> +Status: Supported, with caveats

I think this wants to be

### PCI Device Passthrough

Status, x86 HVM: Supported, with caveats
Status, x86 PV: Supported, with caveats

to (a) allow later extending for ARM and (b) exclude PVH (assuming
that its absence means non-existing code).

> +Only systems using IOMMUs will be supported.
> +
> +Not compatible with migration, altp2m, introspection, memory sharing, or 
> memory paging.

And PoD, iirc.

With these adjustments (or substantially similar ones)
Acked-by: Jan Beulich 

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 13/16] SUPPORT.md: Add secondary memory management features

2017-11-21 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> Signed-off-by: George Dunlap 

Wouldn't PoD belong here too? With that added as supported on x86
HVM
Acked-by: Jan Beulich 

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 12/16] SUPPORT.md: Add Security-releated features

2017-11-21 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> With the exception of driver domains, which depend on PCI passthrough,
> and will be introduced later.
> 
> Signed-off-by: George Dunlap 

Shouldn't we also explicitly exclude tool stack disaggregation here,
with reference to XSA-77?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 11/16] SUPPORT.md: Add 'easy' HA / FT features

2017-11-21 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> +### x86/vMCE
> +
> +Status: Supported
> +
> +Forward Machine Check Exceptions to Appropriate guests

Acked-by: Jan Beulich 
perhaps with the A converted to lower case.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 10/16] SUPPORT.md: Add Debugging, analysis, crash post-portem

2017-11-21 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> --- a/SUPPORT.md
> +++ b/SUPPORT.md
> @@ -152,6 +152,35 @@ Output of information in machine-parseable JSON format
>  
>  Status: Supported, Security support external
>  
> +## Debugging, analysis, and crash post-mortem
> +
> +### gdbsx
> +
> +Status, x86: Supported
> +
> +Debugger to debug ELF guests
> +
> +### Soft-reset for PV guests
> +
> +Status: Supported
> +
> +Soft-reset allows a new kernel to start 'from scratch' with a fresh VM 
> state, 
> +but with all the memory from the previous state of the VM intact.
> +This is primarily designed to allow "crash kernels", 
> +which can do core dumps of memory to help with debugging in the event of a 
> crash.
> +
> +### xentrace
> +
> +Status, x86: Supported
> +
> +Tool to capture Xen trace buffer data
> +
> +### gcov
> +
> +Status: Supported, Not security supported

I agree with excluding security support here, but why wouldn't the
same be the case for gdbsx and xentrace?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-21 Thread Jan Beulich

>>> On 21.11.17 at 09:13,  wrote:
> On 21/11/17 08:50, Jan Beulich wrote:
>>>>> On 20.11.17 at 19:28,  wrote:
>>> On 20/11/17 17:14, Jan Beulich wrote:
>>>>>>> On 20.11.17 at 16:24,  wrote:
>>>>> So without my patch the RSDP table is loaded e.g. at about 6.5MB when
>>>>> I'm using grub2 (the loaded grub image is about 5.5MB in size and it
>>>>> is being loaded at 1MB).
>>>>>
>>>>> When I'm using the PVH Linux kernel directly RSDP is just below 1MB
>>>>> due to pure luck (the bzImage loader is still using the PV specific
>>>>> ELF notes and this results in the loader believing RSDP is loadable
>>>>> at this address, which is true, but the tests used to come to this
>>>>> conclusion are just not applicable for PVH).
>>>>>
>>>>> So in your opinion we should revoke the PVH support from Xen 4.10,
>>>>> Linux and maybe BSD because RSDP is loaded in middle of RAM of the
>>>>> guest?
>>>>
>>>> So what's wrong with it being put wherever the next free memory
>>>> location is being determined to be by the loader, just like is being
>>>> done for other information, including modules (if any)?
>>>
>>> The RSDP table is marked as "Reserved" in the memory map. So putting it
>>> somewhere in the middle of the guest's memory will force the guest to
>>> use 4kB pages instead of 2MB or even 1GB pages. I'd really like to avoid
>>> this problem, as we've been hit by the very same in HVM guests before
>>> causing quite measurable performance drops.
>> 
>> This is a valid point.
>> 
>>> So I'd rather put it in the first MB as most kernels have to deal with
>>> small pages at beginning of RAM today. An alternative would be to put
>>> it just below 4GB where e.g. the console and Xenstore page are located.
>> 
>> Putting it in the first Mb implies that mappings there will continue to
>> be 4k ones. I can't, however, see why for PVH that should be
>> necessary: There's no BIOS and nothing legacy that needs to live
>> there, so other than HVM it could benefit from using a 1Gb mapping
>> even at address zero (even if this might be something that can't
>> be achieved right away). So yes, if anything, the allocation should
>> be made top down starting from 4Gb. Otoh, I don't see a strict
>> need for this area to live below 4Gb in the first place.
> 
> The physical RSDP address in the PVH start info block is 32 bits
> only. So it can't be above 4GB.

struct hvm_start_info {
uint32_t magic; /* Contains the magic value 0x336ec578   */
/* ("xEn3" with the 0x80 bit of the "E" set).*/
uint32_t version;   /* Version of this structure.*/
uint32_t flags; /* SIF_xxx flags.*/
uint32_t nr_modules;/* Number of modules passed to the kernel.   */
uint64_t modlist_paddr; /* Physical address of an array of   */
/* hvm_modlist_entry.*/
uint64_t cmdline_paddr; /* Physical address of the command line. */
uint64_t rsdp_paddr;/* Physical address of the RSDP ACPI data*/
/* structure.*/
};

Granted a comment a few lines up in the public header says "NB: Xen
on x86 will always try to place all the data below the 4GiB boundary."

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 08/16] SUPPORT.md: Add x86-specific virtual hardware

2017-11-21 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> +### x86/Nested PV
> +
> +Status, x86 HVM: Tech Preview
> +
> +This means running a Xen hypervisor inside an HVM domain,
> +with support for PV L2 guests only
> +(i.e., hardware virtualization extensions not provided
> +to the guest).
> +
> +This works, but has performance limitations
> +because the L1 dom0 can only access emulated L1 devices.

So is this explicitly meaning Xen-on-Xen? Xen-on-KVM, for example,
could be considered "nested PV", too. IOW I think it needs to be
spelled out whether this means the host side of things here, the
guest one, or both.

> +### x86/Nested HVM
> +
> +Status, x86 HVM: Experimental
> +
> +This means running a Xen hypervisor inside an HVM domain,
> +with support for running both PV and HVM L2 guests
> +(i.e., hardware virtualization extensions provided
> +to the guest).

"Nested HVM" generally means more than using Xen as the L1
hypervisor. If this is really to mean just L1 Xen, I think the title
should already say so, not just the description.

> +### x86/Advanced Vector eXtension
> +
> +Status: Supported

As indicated before, I think this either needs to be dropped or
be extended by an entry for virtually every CPUID bit exposed
to guests. Furthermore, in this isolated fashion it is not clear
what derived features (e.g. FMA, FMA4, AVX2, or even AVX-512)
it is meant to imply. If any of them are implied, "with caveats"
would need to be added as long as the instruction emulator isn't
capable of handling the instructions, yet.

> +### x86/HVM EFI
> +
> +Status: Supported
> +
> +Booting a guest via guest EFI firmware

Shouldn't this say OVMF, to avoid covering possible other
implementations?

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 07/16] SUPPORT.md: Add virtual devices common to ARM and x86

2017-11-21 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> +### PV USB support for xl
> +
> +Status: Supported
> +
> +### PV 9pfs support for xl
> +
> +Status: Tech Preview

Why are these two being called out, but xl support for other device
types isn't?

> +### QEMU backend hotplugging for xl
> +
> +Status: Supported

Wouldn't this more appropriately be

### QEMU backend hotplugging

Status, xl: Supported

?

> +## Virtual driver support, guest side
> +
> +### Blkfront
> +
> +Status, Linux: Supported
> +Status, FreeBSD: Supported, Security support external
> +Status, NetBSD: Supported, Security support external
> +Status, Windows: Supported
> +
> +Guest-side driver capable of speaking the Xen PV block protocol
> +
> +### Netfront
> +
> +Status, Linux: Supported
> +States, Windows: Supported
> +Status, FreeBSD: Supported, Security support external
> +Status, NetBSD: Supported, Security support external
> +Status, OpenBSD: Supported, Security support external

Seeing the difference in OSes between the two (with the variance
increasing in entries further down) - what does the absence of an
OS on one list, but its presence on another mean? While not
impossible, I would find it surprising if e.g. OpenBSD had netfront
but not even a basic blkfront.

> +Guest-side driver capable of speaking the Xen PV networking protocol
> +
> +### PV Framebuffer (frontend)
> +
> +Status, Linux (xen-fbfront): Supported
> +
> +Guest-side driver capable of speaking the Xen PV Framebuffer protocol
> +
> +### PV Console (frontend)
> +
> +Status, Linux (hvc_xen): Supported
> +Status, Windows: Supported
> +Status, FreeBSD: Supported, Security support external
> +Status, NetBSD: Supported, Security support external
> +
> +Guest-side driver capable of speaking the Xen PV console protocol
> +
> +### PV keyboard (frontend)
> +
> +Status, Linux (xen-kbdfront): Supported
> +Status, Windows: Supported
> +
> +Guest-side driver capable of speaking the Xen PV keyboard protocol

Are these three active/usable in guests regardless of whether the
guest is being run PV, PVH, or HVM? If not, wouldn't this need
spelling out?

> +## Virtual device support, host side
> +
> +### Blkback
> +
> +Status, Linux (blkback): Supported

Strictly speaking, if the driver name is to be spelled out here in
the first place, it's xen-blkback here and ...

> +Status, FreeBSD (blkback): Supported, Security support external
> +Status, NetBSD (xbdback): Supported, security support external
> +Status, QEMU (xen_disk): Supported
> +Status, Blktap2: Deprecated
> +
> +Host-side implementations of the Xen PV block protocol
> +
> +### Netback
> +
> +Status, Linux (netback): Supported

... xen-netback here for the upstream kernels.

> +### PV USB (backend)
> +
> +Status, Linux: Experimental

What existing/upstream code does this refer to?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 06/16] SUPPORT.md: Add scalability features

2017-11-21 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> --- a/SUPPORT.md
> +++ b/SUPPORT.md
> @@ -195,6 +195,27 @@ on embedded platforms.
>  
>  Enables NUMA aware scheduling in Xen
>  
> +## Scalability
> +
> +### 1GB/2MB super page support
> +
> +Status, x86 HVM/PVH: : Supported

On top of what you and Julien have worked out here already: Don't
we need to clarify here that this for HAP mode, while shadow more
doesn't support 1Gb guest pages (and doesn't use 2Mb host pages)?

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 04/16] SUPPORT.md: Add core ARM features

2017-11-21 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> +### ARM/SMMUv1
> +
> +Status: Supported
> +
> +### ARM/SMMUv2
> +
> +Status: Supported

Do these belong here, when IOMMU isn't part of the corresponding
x86 patch?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 03/16] SUPPORT.md: Add some x86 features

2017-11-21 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> +### Host ACPI (via Domain 0)
> +
> +Status, x86 PV: Supported
> +Status, x86 PVH: Tech preview

Are we this far already? Preview implies functional completeness,
but I'm not sure about all ACPI related parts actually having been
implemented (and see also below). But perhaps things like P and C
state handling come as individual features later on.

> +### x86/PVH guest
> +
> +Status: Supported
> +
> +PVH is a next-generation paravirtualized mode 
> +designed to take advantage of hardware virtualization support when possible.
> +During development this was sometimes called HVMLite or PVHv2.
> +
> +Requires hardware virtualisation support (Intel VMX / AMD SVM)

I think it needs to be said that only DomU is considered supported.
Dom0 is perhaps not even experimental at this point, considering
the panic() in dom0_construct_pvh().

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 02/16] SUPPORT.md: Add core functionality

2017-11-21 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> --- a/SUPPORT.md
> +++ b/SUPPORT.md
> @@ -16,6 +16,65 @@ for the definitions of the support status levels etc.
>  
>  # Feature Support
>  
> +## Memory Management
> +
> +### Memory Ballooning
> +
> +Status: Supported

Is this a proper feature in the context we're talking about? To me
it's meaningful in guest OS context only. I also wouldn't really
consider it "core", but placement within the series clearly is a minor
aspect.

I'd prefer this to be dropped altogether as a feature, but
Acked-by: Jan Beulich 
is independent of that.

> +### Credit2 Scheduler
> +
> +Status: Supported

Sort of unrelated, but with this having been the case since 4.8 as it
looks, is there a reason it still isn't the default scheduler?

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Jan Beulich

>>> On 20.11.17 at 19:28,  wrote:
> On 20/11/17 17:14, Jan Beulich wrote:
>>>>> On 20.11.17 at 16:24,  wrote:
>>> On 20/11/17 15:20, Jan Beulich wrote:
>>>>>>> On 20.11.17 at 15:14,  wrote:
>>>>> On 20/11/17 14:56, Boris Ostrovsky wrote:
>>>>>> On 11/20/2017 06:50 AM, Jan Beulich wrote:
>>>>>>>>>> On 20.11.17 at 12:20,  wrote:
>>>>>>>> Which restriction? I'm loading the RSDP table to its architectural
>>>>>>>> correct addres if possible, otherwise it will be loaded to the same
>>>>>>>> address as without my patch. So I'm not adding a restriction, but
>>>>>>>> removing one.
>>>>>>> What is "architecturally correct" in PVH can't be read out of
>>>>>>> specs other than what we write down. When there's no BIOS,
>>>>>>> placing anything right below the 1Mb boundary is at least
>>>>>>> bogus.
>>>>>>
>>>>>> Unless it's a UEFI boot -- where else would you put it? Aren't these two
>>>>>> (UEFI and non-UEFI) the only two options that the ACPI spec provides?
>>>>>
>>>>> I think Jan is right: for PVH its _our_ job to define the correct
>>>>> placement. Which still can be the same as in the BIOS case, making
>>>>> it easier to adapt any guest systems.
>>>>>
>>>>> So I'd say: The RSDP address in PVH case is passed in the PVH start
>>>>> info block to the guest. In case there is no conflict with the
>>>>> physical load address of the guest kernel the preferred address of
>>>>> the RSDP is right below the 1MB boundary.
>>>>>
>>>>> Would this wording be okay?
>>>>
>>>> To be honest (and in case it wasn't sufficiently clear form my
>>>> earlier replies) - I'm pretty much opposed to this below-1Mb thing.
>>>> There ought to be just plain RAM there for PVH.
>>>
>>> So without my patch the RSDP table is loaded e.g. at about 6.5MB when
>>> I'm using grub2 (the loaded grub image is about 5.5MB in size and it
>>> is being loaded at 1MB).
>>>
>>> When I'm using the PVH Linux kernel directly RSDP is just below 1MB
>>> due to pure luck (the bzImage loader is still using the PV specific
>>> ELF notes and this results in the loader believing RSDP is loadable
>>> at this address, which is true, but the tests used to come to this
>>> conclusion are just not applicable for PVH).
>>>
>>> So in your opinion we should revoke the PVH support from Xen 4.10,
>>> Linux and maybe BSD because RSDP is loaded in middle of RAM of the
>>> guest?
>> 
>> So what's wrong with it being put wherever the next free memory
>> location is being determined to be by the loader, just like is being
>> done for other information, including modules (if any)?
> 
> The RSDP table is marked as "Reserved" in the memory map. So putting it
> somewhere in the middle of the guest's memory will force the guest to
> use 4kB pages instead of 2MB or even 1GB pages. I'd really like to avoid
> this problem, as we've been hit by the very same in HVM guests before
> causing quite measurable performance drops.

This is a valid point.

> So I'd rather put it in the first MB as most kernels have to deal with
> small pages at beginning of RAM today. An alternative would be to put
> it just below 4GB where e.g. the console and Xenstore page are located.

Putting it in the first Mb implies that mappings there will continue to
be 4k ones. I can't, however, see why for PVH that should be
necessary: There's no BIOS and nothing legacy that needs to live
there, so other than HVM it could benefit from using a 1Gb mapping
even at address zero (even if this might be something that can't
be achieved right away). So yes, if anything, the allocation should
be made top down starting from 4Gb. Otoh, I don't see a strict
need for this area to live below 4Gb in the first place.

>>> Doing it in a proper way you are outlining above would render
>>> current PVH guests unusable.
>> 
>> I'm afraid I don't understand which outline of mine you refer to.
>> Iirc all I said was that placing it below 1Mb is bogus. Top of RAM
>> (as I think I saw being mentioned elsewhere) probably isn't much
>> better. Special locations should be used only if there's really no
>> other way to convey i

Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Jan Beulich

>>> On 20.11.17 at 17:59,  wrote:
> On 11/20/2017 11:43 AM, Jan Beulich wrote:
>>>>> On 20.11.17 at 17:28,  wrote:
>>> On 11/20/2017 11:26 AM, Jan Beulich wrote:
>>>>>>> On 20.11.17 at 17:14,  wrote:
>>>>> What could cause grub2 to fail to find space for the pointer in the
>>>>> first page? Will we ever have anything in EBDA (which is one of the
>>>>> possible RSDP locations)?
>>>> Well, the EBDA (see the B in its name) is again something that's
>>>> meaningless without there being a BIOS.
>>> Exactly. So it should always be available for grub to copy the pointer
>>> there.
>> But what use would it be if grub copied it there? It just shouldn't
>> be there, neither before nor after grub (just like grub doesn't
>> introduce firmware into the system).
> 
> So that the guest can find it using standard methods. If Xen can't
> guarantee ACPI-compliant placement of the pointer then someone has to
> help the guest find it in the expected place. We can do it with a
> dedicated entry point by setting the pointer explicitly (although
> admittedly this is not done correctly now) or we need to have firmware
> (grub2) place it in the "right" location.
> 
> (It does look a bit hacky though)

Indeed. Of course ACPI without any actual firmware is sort of odd,
too. As to dedicated entry point and its alternatives: Xen itself
tells grub (aiui we're talking about a flavor of it running PVH itself)
where the RSDP is. Why can't grub forward that information in a
suitable way (e.g. via a new tag, or - for Linux - as a new entry
in the Linux boot header)?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 01/16] Introduce skeleton SUPPORT.md

2017-11-20 Thread Jan Beulich

>>> On 13.11.17 at 16:41,  wrote:
> Add a machine-readable file to describe what features are in what
> state of being 'supported', as well as information about how long this
> release will be supported, and so on.
> 
> The document should be formatted using "semantic newlines" [1], to make
> changes easier.
> 
> Begin with the basic framework.
> 
> Signed-off-by: Ian Jackson 
> Signed-off-by: George Dunlap 

Acked-by: Jan Beulich 
despite ...

> +We also provide security support for Xen-related code in Linux,
> +which is an external project but doesn't have its own security process.

... not fully agreeing with this part. But at least this way the state
of things is properly spelled out in a sufficiently official place.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Jan Beulich

>>> On 20.11.17 at 17:28,  wrote:
> On 11/20/2017 11:26 AM, Jan Beulich wrote:
>>>>> On 20.11.17 at 17:14,  wrote:
>>> What could cause grub2 to fail to find space for the pointer in the
>>> first page? Will we ever have anything in EBDA (which is one of the
>>> possible RSDP locations)?
>> Well, the EBDA (see the B in its name) is again something that's
>> meaningless without there being a BIOS.
> 
> Exactly. So it should always be available for grub to copy the pointer
> there.

But what use would it be if grub copied it there? It just shouldn't
be there, neither before nor after grub (just like grub doesn't
introduce firmware into the system).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Jan Beulich

>>> On 20.11.17 at 17:14,  wrote:
> What could cause grub2 to fail to find space for the pointer in the
> first page? Will we ever have anything in EBDA (which is one of the
> possible RSDP locations)?

Well, the EBDA (see the B in its name) is again something that's
meaningless without there being a BIOS.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Jan Beulich

>>> On 20.11.17 at 16:24,  wrote:
> On 20/11/17 15:20, Jan Beulich wrote:
>>>>> On 20.11.17 at 15:14,  wrote:
>>> On 20/11/17 14:56, Boris Ostrovsky wrote:
>>>> On 11/20/2017 06:50 AM, Jan Beulich wrote:
>>>>>>>> On 20.11.17 at 12:20,  wrote:
>>>>>> Which restriction? I'm loading the RSDP table to its architectural
>>>>>> correct addres if possible, otherwise it will be loaded to the same
>>>>>> address as without my patch. So I'm not adding a restriction, but
>>>>>> removing one.
>>>>> What is "architecturally correct" in PVH can't be read out of
>>>>> specs other than what we write down. When there's no BIOS,
>>>>> placing anything right below the 1Mb boundary is at least
>>>>> bogus.
>>>>
>>>> Unless it's a UEFI boot -- where else would you put it? Aren't these two
>>>> (UEFI and non-UEFI) the only two options that the ACPI spec provides?
>>>
>>> I think Jan is right: for PVH its _our_ job to define the correct
>>> placement. Which still can be the same as in the BIOS case, making
>>> it easier to adapt any guest systems.
>>>
>>> So I'd say: The RSDP address in PVH case is passed in the PVH start
>>> info block to the guest. In case there is no conflict with the
>>> physical load address of the guest kernel the preferred address of
>>> the RSDP is right below the 1MB boundary.
>>>
>>> Would this wording be okay?
>> 
>> To be honest (and in case it wasn't sufficiently clear form my
>> earlier replies) - I'm pretty much opposed to this below-1Mb thing.
>> There ought to be just plain RAM there for PVH.
> 
> So without my patch the RSDP table is loaded e.g. at about 6.5MB when
> I'm using grub2 (the loaded grub image is about 5.5MB in size and it
> is being loaded at 1MB).
> 
> When I'm using the PVH Linux kernel directly RSDP is just below 1MB
> due to pure luck (the bzImage loader is still using the PV specific
> ELF notes and this results in the loader believing RSDP is loadable
> at this address, which is true, but the tests used to come to this
> conclusion are just not applicable for PVH).
> 
> So in your opinion we should revoke the PVH support from Xen 4.10,
> Linux and maybe BSD because RSDP is loaded in middle of RAM of the
> guest?

So what's wrong with it being put wherever the next free memory
location is being determined to be by the loader, just like is being
done for other information, including modules (if any)?

> Doing it in a proper way you are outlining above would render
> current PVH guests unusable.

I'm afraid I don't understand which outline of mine you refer to.
Iirc all I said was that placing it below 1Mb is bogus. Top of RAM
(as I think I saw being mentioned elsewhere) probably isn't much
better. Special locations should be used only if there's really no
other way to convey information.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] x86/hvm: Don't ignore unknown MSRs in the migration stream

2017-11-20 Thread Jan Beulich

>>> On 20.11.17 at 15:10,  wrote:
> On 17/11/17 12:10, Jan Beulich wrote:
>>>>> On 16.11.17 at 20:15,  wrote:
>>> Doing so amounts to silent state corruption, and must be avoided.
>> I think a little more explanation is needed on why the current code
>> is insufficient. Note specifically this
>>
>> for ( i = 0; !err && i < ctxt->count; ++i )
>> {
>> switch ( ctxt->msr[i].index )
>> {
>> default:
>> if ( !ctxt->msr[i]._rsvd )
>> err = -ENXIO;
>> break;
>> }
>> }
>>
>> in hvm_load_cpu_msrs(), intended to give vendor code a first
>> shot, but allowing for vendor independent MSRs to be handled
>> here.
> 
> That is sufficiently subtle and non-obvious that I'm still having a hard
> time convincing myself that its correct.  Also, this use of _rsvd really
> should be document.

Well, from an abstract pov I agree. The field being defined in the
public interface though, I don't see a good place where to document
that - its point of declaration certainly isn't the right one in such a
case, as the public interface should not document implementation
details.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Jan Beulich

>>> On 20.11.17 at 15:14,  wrote:
> On 20/11/17 14:56, Boris Ostrovsky wrote:
>> On 11/20/2017 06:50 AM, Jan Beulich wrote:
>>>>>> On 20.11.17 at 12:20,  wrote:
>>>> Which restriction? I'm loading the RSDP table to its architectural
>>>> correct addres if possible, otherwise it will be loaded to the same
>>>> address as without my patch. So I'm not adding a restriction, but
>>>> removing one.
>>> What is "architecturally correct" in PVH can't be read out of
>>> specs other than what we write down. When there's no BIOS,
>>> placing anything right below the 1Mb boundary is at least
>>> bogus.
>> 
>> Unless it's a UEFI boot -- where else would you put it? Aren't these two
>> (UEFI and non-UEFI) the only two options that the ACPI spec provides?
> 
> I think Jan is right: for PVH its _our_ job to define the correct
> placement. Which still can be the same as in the BIOS case, making
> it easier to adapt any guest systems.
> 
> So I'd say: The RSDP address in PVH case is passed in the PVH start
> info block to the guest. In case there is no conflict with the
> physical load address of the guest kernel the preferred address of
> the RSDP is right below the 1MB boundary.
> 
> Would this wording be okay?

To be honest (and in case it wasn't sufficiently clear form my
earlier replies) - I'm pretty much opposed to this below-1Mb thing.
There ought to be just plain RAM there for PVH.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Jan Beulich

>>> On 20.11.17 at 14:56,  wrote:
> On 11/20/2017 06:50 AM, Jan Beulich wrote:
>>>>> On 20.11.17 at 12:20,  wrote:
>>> Which restriction? I'm loading the RSDP table to its architectural
>>> correct addres if possible, otherwise it will be loaded to the same
>>> address as without my patch. So I'm not adding a restriction, but
>>> removing one.
>> What is "architecturally correct" in PVH can't be read out of
>> specs other than what we write down. When there's no BIOS,
>> placing anything right below the 1Mb boundary is at least
>> bogus.
> 
> Unless it's a UEFI boot -- where else would you put it? Aren't these two
> (UEFI and non-UEFI) the only two options that the ACPI spec provides?

Of course - we can't really expect them to cater for something like
PVH. But this also means we can't always use the spec as reference
point here.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-next] x86/vmx: Drop more PVHv1 remenants

2017-11-20 Thread Jan Beulich

>>> On 20.11.17 at 14:19,  wrote:
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] libxc: load acpi RSDP table at correct address

2017-11-20 Thread Jan Beulich

>>> On 20.11.17 at 12:20,  wrote:
> Which restriction? I'm loading the RSDP table to its architectural
> correct addres if possible, otherwise it will be loaded to the same
> address as without my patch. So I'm not adding a restriction, but
> removing one.

What is "architecturally correct" in PVH can't be read out of
specs other than what we write down. When there's no BIOS,
placing anything right below the 1Mb boundary is at least
bogus.

As to the prior grub2 discussion you refer to - just like Andrew
I don't think I was really involved here. If it resulted in any
decisions affecting the PVH ABI, I think it would be a good idea
to summarize the outcome, and perhaps even submit a patch
adding respective documentation (e.g. by way of comment in
public headers). That'll then allow non-grub Xen folks (like
Andrew and me) to see what you're intending to do (and of
course there would be the risk of someone disagreeing with
what you had come up with while discussing this on the grub
side).

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v8] x86/altp2m: support for setting restrictions for an array of pages

2017-11-20 Thread Jan Beulich

>>> On 20.11.17 at 10:35,  wrote:
> On Ma, 2017-10-24 at 13:19 +0300, Petre Pircalabu wrote:
>> From: Razvan Cojocaru 
>>
>> For the default EPT view we have xc_set_mem_access_multi(), which
>> is able to set an array of pages to an array of access rights with
>> a single hypercall. However, this functionality was lacking for the
>> altp2m subsystem, which could only set page restrictions for one
>> page at a time. This patch addresses the gap.
>>
>> HVMOP_altp2m_set_mem_access_multi has been added as a HVMOP (as
>> opposed to a
>> DOMCTL) for consistency with its HVMOP_altp2m_set_mem_access
>> counterpart (and
>> hence with the original altp2m design, where domains are allowed -
>> with the
>> proper altp2m access rights - to alter these settings), in the
>> absence of an
>> official position on the issue from the original altp2m designers.
>>
>> Signed-off-by: Razvan Cojocaru 
>> Signed-off-by: Petre Pircalabu 
> 
> Are there still any outstanding issues with this patch?

I for one don't know - I simply did't get around to look at it yet.
The tree's still frozen right now anyway.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] tools/libxl: mark hvm mmio area as reserved in e820 map

2017-11-17 Thread Jan Beulich

>>> On 17.11.17 at 12:47,  wrote:
> Make sure the HVM mmio area (especially console and Xenstore pages) is
> marked as "reserved" in the guest's E820 map, as otherwise conflicts
> might arise later, e.g. when hotplugging memory into the guest.

This is very certainly wrong. Have you looked at a couple of physical
machines? Have you found an E820_RESERVED area on any of them for
the MMIO hole? Two examples I can present right away:

<6>BIOS-e820: [mem 0xc93f-0xc9f8cfff] reserved
<6>BIOS-e820: [mem 0xc9f8d000-0xc9fdefff] ACPI data
<6>BIOS-e820: [mem 0xc9fdf000-0xcac82fff] ACPI NVS
<6>BIOS-e820: [mem 0xcac83000-0xcb172fff] reserved
<6>BIOS-e820: [mem 0xcb173000-0xcb173fff] usable
<6>BIOS-e820: [mem 0xcb174000-0xcb181fff] reserved
<6>BIOS-e820: [mem 0xcb182000-0xccff] usable
<6>BIOS-e820: [mem 0xcd00-0xcdff] reserved
<6>BIOS-e820: [mem 0xd000-0xdfff] reserved
<6>BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
<6>BIOS-e820: [mem 0xff00-0x] reserved

and

(XEN)  cf4bd000 - cf4bf000 (reserved)
(XEN)  cf4bf000 - cf636000 (usable)
(XEN)  cf636000 - cf7bf000 (ACPI NVS)
(XEN)  cf7bf000 - cf7df000 (usable)
(XEN)  cf7df000 - cf7ff000 (ACPI data)
(XEN)  cf7ff000 - cf80 (usable)
(XEN)  cf80 - d000 (reserved)
(XEN)  f800 - fd00 (reserved)
(XEN)  ffe0 - 0001 (reserved)

Things covered by E820_RESERVED include the MCFG area, yes, but
not most other parts. The OS has to either be careful or consult
ACPI for further resource usage details. In particular, the ACPI spec
says

"The platform boot firmware does not return a range description for
 the memory mapping of PCI devices, ISA Option ROMs, and ISA Plug
 and Play cards because the OS has mechanisms available to detect
 them."

See the section "E820 Assumptions and Limitations" for further details.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] x86/hvm: Don't corrupt the HVM context stream when writing the MSR record

2017-11-17 Thread Jan Beulich

>>> On 16.11.17 at 23:45,  wrote:
> Ever since it was introduced in c/s bd1f0b45ff, hvm_save_cpu_msrs() has had a
> bug whereby it corrupts the HVM context stream if some, but fewer than the
> maximum number of MSRs are written.
> 
> _hvm_init_entry() creates an hvm_save_descriptor with length for
> msr_count_max, but in the case that we write fewer than max, h->cur only moves
> forward by the amount of space used, causing the subsequent
> hvm_save_descriptor to be written within the bounds of the previous one.
> 
> To resolve this, reduce the length reported by the descriptor to match the
> actual number of bytes used.
> 
> A typical failure on the destination side looks like:
> 
> (XEN) HVM4 restore: CPU_MSR 0
> (XEN) HVM4.0 restore: not enough data left to read 56 MSR bytes
> (XEN) HVM4 restore: failed to load entry 20/0
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 

> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -1330,6 +1330,7 @@ static int hvm_save_cpu_msrs(struct domain *d, 
> hvm_domain_context_t *h)
>  
>  for_each_vcpu ( d, v )
>  {
> +struct hvm_save_descriptor *d = _p(&h->data[h->cur]);
>  struct hvm_msr *ctxt;
>  unsigned int i;
>  
> @@ -1348,8 +1349,13 @@ static int hvm_save_cpu_msrs(struct domain *d, 
> hvm_domain_context_t *h)
>  ctxt->msr[i]._rsvd = 0;
>  
>  if ( ctxt->count )
> +{
> +/* Rewrite length to indicate how much space we actually used. */
> +d->length = HVM_CPU_MSR_SIZE(ctxt->count);

Would of course be nice if we had a function to do this, such that
the (sufficiently hidden) cast above also wouldn't be necessary to
open code in places like this one.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] tools/libxc: Fix restoration of PV MSRs after migrate

2017-11-17 Thread Jan Beulich

>>> On 16.11.17 at 22:13,  wrote:
> There are two bugs in process_vcpu_msrs() which clearly demonstrate that I
> didn't test this bit of Migration v2 very well when writing it...
> 
> vcpu->msrsz is always expected to be a multiple of xen_domctl_vcpu_msr_t
> records in a spec-compliant stream, so the modulo yields 0 for the msr_count,
> rather than the actual number sent in the stream.
> 
> Passing 0 for the msr_count causes the hypercall to exit early, and hides the
> fact that the guest handle is inserted into the wrong field in the domctl
> union.

Oops.

> The reason that these bugs have gone unnoticed for so long is that the only
> MSRs passed like this for PV guests are the AMD DBGEXT MSRs, which only exist
> in fairly modern hardware, and whose use doesn't appear to be implemented in
> any contemporary PV guests.
> 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Jan Beulich 



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] x86/hvm: Don't ignore unknown MSRs in the migration stream

2017-11-17 Thread Jan Beulich

>>> On 16.11.17 at 20:15,  wrote:
> Doing so amounts to silent state corruption, and must be avoided.

I think a little more explanation is needed on why the current code
is insufficient. Note specifically this

for ( i = 0; !err && i < ctxt->count; ++i )
{
switch ( ctxt->msr[i].index )
{
default:
if ( !ctxt->msr[i]._rsvd )
err = -ENXIO;
break;
}
}

in hvm_load_cpu_msrs(), intended to give vendor code a first
shot, but allowing for vendor independent MSRs to be handled
here.

> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -450,7 +450,7 @@ static int svm_load_msr(struct vcpu *v, struct hvm_msr 
> *ctxt)
>  break;
>  
>  default:
> -continue;
> +err = -ENXIO;
>  }

If the change was nevertheless needed, please add break
statements here (and in VMX code as well), despite this
currently being the last label in the switch() block.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] XSA 243 v5 is missing the second patch for xen 4.8

2017-11-17 Thread Jan Beulich

>>> On 16.11.17 at 21:01,  wrote:
> Hello,
> Looking at
> https://xenbits.xen.org/xsa/advisory-243.html,
> I cannot find the second patch for xen 4.8, xsa243-4.8-2.patch.
> The text of the advisory leads me to believe that it should be there, so
> it seems to be missing.

The text has xsa243-{4.8-1,2}.patch, which expands to
xsa243-4.8-1.patch and xsa243-2.patch.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [BUG] Error applying XSA240 update 5 on 4.8 and 4.9 (patch 3 references CONFIG_PV_LINEAR_PT, 3285e75dea89, x86/mm: Make PV linear pagetables optional)

2017-11-16 Thread Jan Beulich

>>> On 16.11.17 at 13:30,  wrote:
> On Thursday, 16 November 2017 8:30:39 PM AEDT Jan Beulich wrote:
>> >>> On 15.11.17 at 23:48,  wrote:
>> > I am having trouble applying the patch 3 from XSA240 update 5 for xen
>> > stable 4.8 and 4.9
>> > xsa240 0003 contains:
>> > 
>> > CONFIG_PV_LINEAR_PT
>> > 
>> > from:
>> > 
>> > x86/mm: Make PV linear pagetables optional
>> > https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=3285e75dea89afb0e 
>> > f5 b3ee39bd15194bd7cc110
>> > 
>> > I cannot find this string in an XSA, nor is an XSA referenced in the
>> > commit.
>> > Am I missing a patch, or doing something wrong?
>> 
>> Well, you're expected to apply all patched which haven't been
>> applied so far. In particular, in the stable version trees, the 2nd
>> patch hasn't gone in yet (I'm intending to do this later today),
>> largely because it (a) wasn't ready at the time the first patch
>> went in and (b) it is more a courtesy patch than an actual part of
>> the security fix.
> 
> I'm not quite sure this is a great idea... They should work on the released 
> versions - hence xsa240 patchset should apply to the base tarball + current 
> XSA patches. If there is something in the git that *isn't* in the latest 
> release, it should be included in the XSA patchset - otherwise the set is 
> incomplete.

Well, I've been taking a different view: The only valid (or so to say
canonical) base to supply patches against is the current tip of the
respective staging branch. Anyone wanting to apply to anything
older will need to make adjustments, if need be. Otherwise what
would keep you or others to request, say, not only patches against
4.7.3, but also against 4.7.0, 4.7.1, and 4.7.2?

> I don't see mention of anywhere in the written XSA that mentions a separate 
> patch is required outside of the patches included with the XSA.

It isn't (afaict), it's just that the included patch 2 is stale. This is
certainly unfortunate, but correct patches can now easily be
taken from the respective git branches. I'm not convinced it is
worthwhile to re-issue the advisory yet another time, but I'm
also not going to stand in the way if others on the security team
want to do so.

> These should be included in 4.9.1 - which makes most things irrelevant - but 
> I'm not aware of what the release window is for 4.9.1.

It is in preparation; I was merely waiting for these regression fixes
to be publicly announced.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [BUG] Error applying XSA240 update 5 on 4.8 and 4.9 (patch 3 references CONFIG_PV_LINEAR_PT, 3285e75dea89, x86/mm: Make PV linear pagetables optional)

2017-11-16 Thread Jan Beulich

>>> On 15.11.17 at 23:48,  wrote:
> Hi,
> 
> I am having trouble applying the patch 3 from XSA240 update 5 for xen
> stable 4.8 and 4.9 
> xsa240 0003 contains:
> 
> CONFIG_PV_LINEAR_PT
> 
> from:
> 
> x86/mm: Make PV linear pagetables optional
> https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=3285e75dea89afb0ef5 
> b3ee39bd15194bd7cc110
> 
> I cannot find this string in an XSA, nor is an XSA referenced in the
> commit.
> Am I missing a patch, or doing something wrong?

Well, you're expected to apply all patched which haven't been
applied so far. In particular, in the stable version trees, the 2nd
patch hasn't gone in yet (I'm intending to do this later today),
largely because it (a) wasn't ready at the time the first patch
went in and (b) it is more a courtesy patch than an actual part of
the security fix.

> xsa240-4.9 0002 and  3285e75dea "x86/mm: Make PV linear pagetables
> optional" conflict as is.

Oh, that's rather unfortunate. The patch (all versions, including
the master one) is stale. A proper backport of 3285e75dea
("x86/mm: Make PV linear pagetables optional") is needed instead.
Please see the backports I'm intending to commit later today.

> xsa240-4.9 0003 applies after 3285e75dea "x86/mm: Make PV linear
> pagetables optional"
> 
> Could we also refer to the third patch in the XSA resolution section
> please?

Hmm, I see the text doesn't mention that new patch; the list of
patches (with their sha256 sums) includes it though, which I think
is good enough.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2] x86/hvm: Fix altp2m_vcpu_enable_notify error handling

2017-11-15 Thread Jan Beulich

>>> On 15.11.17 at 14:47,  wrote:
> The altp2m_vcpu_enable_notify subop handler might skip calling
> rcu_unlock_domain() after rcu_lock_current_domain().  Albeit since both
> rcu functions are no-ops when run on the current domain, this doesn't
> really have repercussions.
> 
> The second change is adding a missing break that would have potentially
> enabled #VE for the current domain even if it had intended to enable it
> for another one (not a supported functionality).

Thanks, much better.

> Signed-off-by: Adrian Pop 
> Reviewed-by: Andrew Cooper 

Reviewed-by: Jan Beulich 

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] x86/hvm: Fix rcu_unlock_domain call bypass

2017-11-14 Thread Jan Beulich

>>> On 14.11.17 at 16:11,  wrote:
> rcu_lock_current_domain is called at the beginning of do_altp2m_op, but
> the altp2m_vcpu_enable_notify subop handler might skip calling
> rcu_unlock_domain, possibly hanging the domain altogether.

I fully agree with the change, but the description needs improvement.
For one, why would the domain be hanging with

static inline struct domain *rcu_lock_current_domain(void)
{
return /*rcu_lock_domain*/(current->domain);
}

? And even if the lock function invocation wasn't commented
out, all it does is preempt_disable(). That may cause an
assertion to trigger in debug builds, but that's not a domain
hang. Plus ...

> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -4534,12 +4534,18 @@ static int do_altp2m_op(
>  
>  if ( a.u.enable_notify.pad || a.domain != DOMID_SELF ||
>   a.u.enable_notify.vcpu_id != curr->vcpu_id )
> +{
>  rc = -EINVAL;
> +break;
> +}

... you also change flow here, which is a second bug you address,
but you fail to mention it.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v3 for-4.10 2/2] x86/mm: fix a potential race condition in modify_xen_mappings().

2017-11-14 Thread Jan Beulich

>>> On 14.11.17 at 07:53,  wrote:
> In modify_xen_mappings(), a L1/L2 page table shall be freed,
> if all entries of this page table are empty. Corresponding
> L2/L3 PTE will need be cleared in such scenario.
> 
> However, concurrent paging structure modifications on different
> CPUs may cause the L2/L3 PTEs to be already be cleared or set
> to reference a superpage.
> 
> Therefore the logic to enumerate the L1/L2 page table and to
> reset the corresponding L2/L3 PTE need to be protected with
> spinlock. And the _PAGE_PRESENT and _PAGE_PSE flags need be
> checked after the lock is obtained.
> 
> Signed-off-by: Yu Zhang 

Reviewed-by: Jan Beulich 



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v3 for-4.10 1/2] x86/mm: fix potential race conditions in map_pages_to_xen().

2017-11-14 Thread Jan Beulich

>>> On 14.11.17 at 07:53,  wrote:
> From: Min He 
> 
> In map_pages_to_xen(), a L2 page table entry may be reset to point to
> a superpage, and its corresponding L1 page table need be freed in such
> scenario, when these L1 page table entries are mapping to consecutive
> page frames and having the same mapping flags.
> 
> However, variable `pl1e` is not protected by the lock before L1 page table
> is enumerated. A race condition may happen if this code path is invoked
> simultaneously on different CPUs.
> 
> For example, `pl1e` value on CPU0 may hold an obsolete value, pointing
> to a page which has just been freed on CPU1. Besides, before this page
> is reused, it will still be holding the old PTEs, referencing consecutive
> page frames. Consequently the `free_xen_pagetable(l2e_to_l1e(ol2e))` will
> be triggered on CPU0, resulting the unexpected free of a normal page.
> 
> This patch fixes the above problem by protecting the `pl1e` with the lock.
> 
> Also, there're other potential race conditions. For instance, the L2/L3
> entry may be modified concurrently on different CPUs, by routines such as
> map_pages_to_xen(), modify_xen_mappings() etc. To fix this, this patch will
> check the _PAGE_PRESENT and _PAGE_PSE flags, after the spinlock is obtained,
> for the corresponding L2/L3 entry.
> 
> Signed-off-by: Min He 
> Signed-off-by: Yi Zhang 
> Signed-off-by: Yu Zhang 

Reviewed-by: Jan Beulich 



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] x86/mm: fix a potential race condition in map_pages_to_xen().

2017-11-13 Thread Jan Beulich

>>> On 13.11.17 at 11:34,  wrote:
> Our debug showed the concerned page->count_info was already(and 
> unexpectedly)
> cleared in free_xenheap_pages(), and the call trace should be like this:
> 
> free_xenheap_pages()
>  ^
>  |
> free_xen_pagetable()
>  ^
>  |
> map_pages_to_xen()
>  ^
>  |
> update_xen_mappings()
>  ^
>  |
> get_page_from_l1e()
>  ^
>  |
> mod_l1_entry()
>  ^
>  |
> do_mmu_update()

This ...

> Is above description convincing enough? :-)

... is indeed enough for me to suggest to Julien that we take both
patches (once they're ready). But it's his decision in the end.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH net-next v1] xen-netback: make copy batch size configurable

2017-11-13 Thread Jan Beulich

>>> On 13.11.17 at 11:33,  wrote:
>> From: Joao Martins [mailto:joao.m.mart...@oracle.com]
>> Sent: 10 November 2017 19:35
>> --- a/drivers/net/xen-netback/netback.c
>> +++ b/drivers/net/xen-netback/netback.c
>> @@ -96,6 +96,11 @@ unsigned int xenvif_hash_cache_size =
>> XENVIF_HASH_CACHE_SIZE_DEFAULT;
>>  module_param_named(hash_cache_size, xenvif_hash_cache_size, uint,
>> 0644);

Isn't the "owner-write" permission here ...

>> --- a/drivers/net/xen-netback/rx.c
>> +++ b/drivers/net/xen-netback/rx.c
>> @@ -168,11 +168,14 @@ static void xenvif_rx_copy_add(struct
>> xenvif_queue *queue,
>> struct xen_netif_rx_request *req,
>> unsigned int offset, void *data, size_t len)
>>  {
>> +unsigned int batch_size;
>>  struct gnttab_copy *op;
>>  struct page *page;
>>  struct xen_page_foreign *foreign;
>> 
>> -if (queue->rx_copy.num == COPY_BATCH_SIZE)
>> +batch_size = min(xenvif_copy_batch_size, queue->rx_copy.size);
> 
> Surely queue->rx_copy.size and xenvif_copy_batch_size are always identical? 
> Why do you need this statement (and hence stack variable)?

... the answer to your question?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] docs: update hvmlite.markdown

2017-11-13 Thread Jan Beulich

>>> On 12.11.17 at 12:03,  wrote:
> --- a/docs/misc/hvmlite.markdown
> +++ b/docs/misc/hvmlite.markdown
> @@ -1,6 +1,3 @@
> -**NOTE**: this document will be merged into `pvh.markdown` once PVH is 
> replaced
> -with the HVMlite implementation.
> -

This being stale, wouldn't it then be better to rename the doc to
pvh.markdown at the same time? Either way
Acked-by: Jan Beulich 

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] VMX: sync CPU state upon vCPU destruction

2017-11-13 Thread Jan Beulich

>>> On 10.11.17 at 15:46,  wrote:
> On 10/11/17 10:30, Jan Beulich wrote:
>>>>> On 10.11.17 at 09:41,  wrote:
>>>2. Drop v->is_running check inside vmx_ctxt_switch_from() making
>>>vmx_vmcs_reload() unconditional.
>> 
>> This is an option, indeed (and I don't think it would have a
>> meaningful performance impact, as vmx_vmcs_reload() does
>> nothing if the right VMCS is already in place). Iirc I had added the
>> conditional back then merely to introduce as little of a behavioral
>> change as was (appeared to be at that time) necessary. What I'm
>> not certain about, however, is the final state we'll end up in then.
>> Coming back to your flow scheme (altered to represent the
>> suggested new flow):
>> 
> 
> I was thinking of this approach for a while and I couldn't find anything
> dangerous that can be potentially done by vmcs_reload() since it looks
> like that it already has all the necessary checks inside.
> 
>> pCPU1   pCPU2
>> =   =
>> current == vCPU1
>> context_switch(next == idle)
>> !! __context_switch() is skipped
>> vcpu_migrate(vCPU1)
>> RCU callbacks
>> vmx_vcpu_destroy()
>> vmx_vcpu_disable_pml()
>> current_vmcs = 0
>> 
>> schedule(next == vCPU1)
>> vCPU1->is_running = 1;
>> context_switch(next == vCPU1)
>> flush_tlb_mask(&dirty_mask);
>> 
>> <--- IPI
>> 
>> __sync_local_execstate()
>> __context_switch(prev == vCPU1)
>> vmx_ctxt_switch_from(vCPU1)
>> vmx_vmcs_reload()
>> ...
>> 
>> We'd now leave the being destroyed vCPU's VMCS active in pCPU1
>> (at least I can't see where it would be deactivated again).
> 
> This would be VMCS of the migrated vCPU - not the destroyed one.

Oh, right. Nevertheless I favor the other approach (or some of the
possible variants of it). In particular, I'm increasingly in favor of
moving the sync up the call stack, at least into
complete_domain_destroy(), or even rcu_do_batch(), as mentioned
before. The former would be more clearly of no concern performance
wise (to please Dario), while the latter would be the obvious and
clean equivalent of the old tasklet commit I've pointed out last week.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 2/2] x86/mm: fix a potential race condition in modify_xen_mappings().

2017-11-13 Thread Jan Beulich

>>> On 10.11.17 at 15:02,  wrote:
> On 11/10/2017 5:57 PM, Jan Beulich wrote:
>>>>> On 10.11.17 at 08:18,  wrote:
>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -5097,6 +5097,17 @@ int modify_xen_mappings(unsigned long s, unsigned 
>>> long e, unsigned int nf)
>>>*/
>>>   if ( (nf & _PAGE_PRESENT) || ((v != e) && (l1_table_offset(v) 
>>> != 0)) )
>>>   continue;
>>> +if ( locking )
>>> +spin_lock(&map_pgdir_lock);
>>> +
>>> +/* L2E may be cleared on another CPU. */
>>> +if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )
>> I think you also need a PSE check here, or else the l2e_to_l1e() below
>> may be illegal.
> 
> Hmm, interesting point, and thanks! :-)
> I did not check the PSE, because modify_xen_mappings() will not do the 
> re-consolidation, and
> concurrent invokes of this routine will not change this flag. But now I 
> believe this presumption
> shall not be made, because the paging structures may be modified by 
> other routines, like
> map_pages_to_xen() on other CPUs.
> 
> So yes, I think a _PAGE_PSE check is necessary here. And I suggest we 
> also check the _PAGE_PRESENT
> flag as well, for the re-consolidation part in my first patch for 
> map_pages_to_xen(). Do you agree?

Oh, yes, definitely. I should have noticed this myself.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] x86/mm: fix a potential race condition in map_pages_to_xen().

2017-11-13 Thread Jan Beulich

>>> On 10.11.17 at 15:05,  wrote:
> On 11/10/2017 5:49 PM, Jan Beulich wrote:
>> I'm not certain this is important enough a fix to consider for 4.10,
>> and you seem to think it's good enough if this gets applied only
>> after the tree would be branched, as you didn't Cc Julien. Please
>> indicate if you actually simply weren't aware, and you indeed
>> there's an important aspect to this that I'm overlooking.
> 
> Well, at first I have not expected this to be accepted for 4.10. But 
> since we have
> met this issue in practice, when running a graphic application which 
> consumes
> memory intensively in dom0, I think it also makes sense if we can fix it 
> in xen's
> release as early as possible. Do you think this is a reasonable 
> requirement? :-)

You'd need to provide further details for us to understand the
scenario. It obviously depends on whether you have other
patches to Xen which actually trigger this. If the problem can
be triggered from outside of a vanilla upstream Xen, then yes,
I think I would favor the fixes being included.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v3 4/4] x86/cpuid: Enable new SSE/AVX/AVX512 cpu features

2017-11-10 Thread Jan Beulich

>>> On 10.11.17 at 11:36,  wrote:
> Intel IceLake cpu has added new cpu features: AVX512VBMI2/GFNI/
> VAES/AVX512VNNI/AVX512BITALG/VPCLMULQDQ. Those new cpu features
> need expose to guest.
> 
> The bit definition:
> CPUID.(EAX=7,ECX=0):ECX[bit 06] AVX512VBMI2
> CPUID.(EAX=7,ECX=0):ECX[bit 08] GFNI
> CPUID.(EAX=7,ECX=0):ECX[bit 09] VAES
> CPUID.(EAX=7,ECX=0):ECX[bit 10] VPCLMULQDQ
> CPUID.(EAX=7,ECX=0):ECX[bit 11] AVX512VNNI
> CPUID.(EAX=7,ECX=0):ECX[bit 12] AVX512_BITALG
> 
> The release document ref below link:
> https://software.intel.com/sites/default/files/managed/c5/15/\ 
> architecture-instruction-set-extensions-programming-reference.pdf
> 
> Signed-off-by: Yang Zhong 

Non-toolstack parts
Acked-by: Jan Beulich 
(which you could have picked up from v2 if you hadn't been rushing v3)

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v3 0/4] x86/cpuid: enable new cpu features

2017-11-10 Thread Jan Beulich

>>> On 10.11.17 at 11:36,  wrote:
> The new cpu features in intel icelake: AVX512VBMI2/GFNI/VAES/
> AVX512VNNI/AVX512BITALG/VPCLMULQDQ.
> 
> 
> v2: adjust the patches sequence from Jan

I'm sorry, but please be a little more patient with sending new versions.
Allow for at least a couple of days, preferably a week, for other
comments to be made. This is the more that your series can't go in
right away anyway, as the tree is frozen.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 1/4] x86/cpuid: Enable new SSE/AVX/AVX512 cpu features

2017-11-10 Thread Jan Beulich

>>> On 10.11.17 at 10:36,  wrote:
> Intel IceLake cpu has added new cpu features: AVX512VBMI2/GFNI/
> VAES/AVX512VNNI/AVX512BITALG/VPCLMULQDQ. Those new cpu features
> need expose to guest.
> 
> The bit definition:
> CPUID.(EAX=7,ECX=0):ECX[bit 06] AVX512VBMI2
> CPUID.(EAX=7,ECX=0):ECX[bit 08] GFNI
> CPUID.(EAX=7,ECX=0):ECX[bit 09] VAES
> CPUID.(EAX=7,ECX=0):ECX[bit 10] VPCLMULQDQ
> CPUID.(EAX=7,ECX=0):ECX[bit 11] AVX512VNNI
> CPUID.(EAX=7,ECX=0):ECX[bit 12] AVX512_BITALG
> 
> The release document ref below link:
> https://software.intel.com/sites/default/files/managed/c5/15/\ 
> architecture-instruction-set-extensions-programming-reference.pdf
> 
> Signed-off-by: Yang Zhong 

Properly placed last in the series, the non-toolstack parts here
Acked-by: Jan Beulich 

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] VMX: sync CPU state upon vCPU destruction

2017-11-10 Thread Jan Beulich

>>> On 10.11.17 at 09:41,  wrote:
> On Thu, 2017-11-09 at 07:49 -0700, Jan Beulich wrote:
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -479,7 +479,13 @@ static void vmx_vcpu_destroy(struct vcpu
>>   * we should disable PML manually here. Note that vmx_vcpu_destroy is 
>> called
>>   * prior to vmx_domain_destroy so we need to disable PML for each vcpu
>>   * separately here.
>> + *
>> + * Before doing that though, flush all state for the vCPU previously 
>> having
>> + * run on the current CPU, so that this flushing of state won't happen 
>> from
>> + * the TLB flush IPI handler behind the back of a vmx_vmcs_enter() /
>> + * vmx_vmcs_exit() section.
>>   */
>> +sync_local_execstate();
>>  vmx_vcpu_disable_pml(v);
>>  vmx_destroy_vmcs(v);
>>  passive_domain_destroy(v);
> 
> This patch fixes only one particular issue and not the general problem.
> What if vmcs is cleared, possibly in some future code, at another place?

As indicated in the earlier discussion, if we go this route, other
future async accesses may need to do the same then.

> The original intent of vmx_vmcs_reload() is correct: it lazily loads
> the vmcs when it's needed. It's just the logic which checks for
> v->is_running inside vmx_ctxt_switch_from() is flawed: v might be
> "running" on another pCPU.
> 
> IMHO there are 2 possible solutions:
> 
> 1. Add additional pCPU check into vmx_ctxt_switch_from()

I agree with Dario in not seeing this as a possible solution.

>2. Drop v->is_running check inside vmx_ctxt_switch_from() making
>vmx_vmcs_reload() unconditional.

This is an option, indeed (and I don't think it would have a
meaningful performance impact, as vmx_vmcs_reload() does
nothing if the right VMCS is already in place). Iirc I had added the
conditional back then merely to introduce as little of a behavioral
change as was (appeared to be at that time) necessary. What I'm
not certain about, however, is the final state we'll end up in then.
Coming back to your flow scheme (altered to represent the
suggested new flow):

pCPU1   pCPU2
=   =
current == vCPU1
context_switch(next == idle)
!! __context_switch() is skipped
vcpu_migrate(vCPU1)
RCU callbacks
vmx_vcpu_destroy()
vmx_vcpu_disable_pml()
current_vmcs = 0

schedule(next == vCPU1)
vCPU1->is_running = 1;
context_switch(next == vCPU1)
flush_tlb_mask(&dirty_mask);

<--- IPI

__sync_local_execstate()
__context_switch(prev == vCPU1)
vmx_ctxt_switch_from(vCPU1)
vmx_vmcs_reload()
...

We'd now leave the being destroyed vCPU's VMCS active in pCPU1
(at least I can't see where it would be deactivated again).

Overall I think it is quite reasonable to terminate early a lazy
context switch of a vCPU under destruction. From that abstract
consideration, forcing this higher up the call stack of
vmx_vcpu_destroy() (as I had suggested as an alternative
previously, before actually moving it further down into VMX code,
perhaps even right in RCU handling) would continue to be an
option. In this context you may want to pay particular
attention to the description of 346da00456 ("Synchronise lazy
execstate before calling tasklet handlers").

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [xen-unstable test] 115555: regressions - FAIL

2017-11-10 Thread Jan Beulich

>>> On 10.11.17 at 10:50,  wrote:
> On 10/11/17 10:33, Roger Pau Monné wrote:
>> On Sat, Nov 04, 2017 at 11:14:35PM +, osstest service owner wrote:
>>> flight 11 xen-unstable real [real]
>>> http://logs.test-lab.xenproject.org/osstest/logs/11/ 
>>>
>>> Regressions :-(
>>>
>>> Tests which did not succeed and are blocking,
>>> including tests which could not be run:
>>>  test-amd64-amd64-i386-pvgrub 11 guest-start  fail REGR. vs. 
> 115526
>>>  test-amd64-amd64-amd64-pvgrub 11 guest-start fail REGR. vs. 
> 115526
>> 
>> Both of the above trigger an assertion, it seems to be in pvgrub code
>> (because I doubt minios has anything kexec related):
>> 
>> root  (hd0,0)
>> 
>>  Filesystem type is ext2fs, partition type 0x83
>> 
>> kernel  /boot/vmlinuz-3.16.0-4-686-pae 
> root=UUID=482efee7-e0f0-453e-bb8e-000bfc
>> 
>> d5a822 ro 
>> 
>> initrd  /boot/initrd.img-3.16.0-4-686-pae
>> 
>> = Init TPM Front 
>> Tpmfront:Error Unable to read device/vtpm/0/backend-id during tpmfront 
> initialization! error = ENOENT
>> Tpmfront:Info Shutting down tpmfront
>> close blk: backend=/local/domain/0/backend/vbd/2/51712 node=device/vbd/51712
>> ASSERTION FAILED: 0 at kexec.c:418.
> 
> That's a failure of HYPERVISOR_mmu_update() just before the target
> kernel is being executed (see stubdom/grub/kexec.c).

Which in turn, looking at the hypervisor logs, are attempts to fiddle
with MFN 0, which the domains don't own.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 2/2] x86/mm: fix a potential race condition in modify_xen_mappings().

2017-11-10 Thread Jan Beulich

>>> On 10.11.17 at 08:18,  wrote:
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -5097,6 +5097,17 @@ int modify_xen_mappings(unsigned long s, unsigned long 
> e, unsigned int nf)
>   */
>  if ( (nf & _PAGE_PRESENT) || ((v != e) && (l1_table_offset(v) != 
> 0)) )
>  continue;
> +if ( locking )
> +spin_lock(&map_pgdir_lock);
> +
> +/* L2E may be cleared on another CPU. */
> +if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )

I think you also need a PSE check here, or else the l2e_to_l1e() below
may be illegal.

> @@ -5105,11 +5116,16 @@ int modify_xen_mappings(unsigned long s, unsigned 
> long e, unsigned int nf)
>  {
>  /* Empty: zap the L2E and free the L1 page. */
>  l2e_write_atomic(pl2e, l2e_empty());
> +if ( locking )
> +spin_unlock(&map_pgdir_lock);
>  flush_area(NULL, FLUSH_TLB_GLOBAL); /* flush before free */
>  free_xen_pagetable(pl1e);
>  }
> +else if ( locking )
> +spin_unlock(&map_pgdir_lock);
>  }
>  
> +check_l3:

Labels indented by at least one space please.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [BUG] blkback reporting incorrect number of sectors, unable to boot

2017-11-10 Thread Jan Beulich

>>> On 10.11.17 at 10:40,  wrote:
>> Anthony PERARD
>> Sent: 09 November 2017 17:50
>> The problem is that QEMU 4.10 have a lock on the disk image. When
>> booting an HVM guest with a qdisk backend, the disk is open twice, but
>> can only be locked once, so when the pv disk is been initialized, the
>> initialisation kind of fail.
>> Unfortunatly, OVMF will wait indefinitly until the PV disk is
>> initialized.
> 
> That's presumably because the OVMF frontend leaves the emulated disk plugged 
> in despite talking via PV?

Well, how could it not? It can't know whether the OS to be booted
is going to have PV drivers, and iirc the unplug is not reversible.
Shouldn't OVMF close the blkif connection, with the backend
responding to this by unlocking (and maybe closing) the image?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] x86/mm: fix a potential race condition in map_pages_to_xen().

2017-11-10 Thread Jan Beulich

>>> On 10.11.17 at 08:18,  wrote:
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4844,9 +4844,19 @@ int map_pages_to_xen(
>  {
>  unsigned long base_mfn;
>  
> -pl1e = l2e_to_l1e(*pl2e);
>  if ( locking )
>  spin_lock(&map_pgdir_lock);
> +
> +/* Skip the re-consolidation if it's done on another CPU. */
> +if ( l2e_get_flags(*pl2e) & _PAGE_PSE )
> +{
> +if ( locking )
> +spin_unlock(&map_pgdir_lock);
> +goto check_l3;
> +}
> +
> +ol2e = *pl2e;

This wants moving up ahead of the if(), so you can use the local
variable inside that if().

> @@ -4880,6 +4889,15 @@ int map_pages_to_xen(
>  
>  if ( locking )
>  spin_lock(&map_pgdir_lock);
> +
> +/* Skip the re-consolidation if it's done on another CPU. */
> +if ( l3e_get_flags(*pl3e) & _PAGE_PSE )
> +{
> +if ( locking )
> +spin_unlock(&map_pgdir_lock);
> +continue;
> +    }
> +
>  ol3e = *pl3e;

Same here - move the if() below here and use ol3e in there.

With that
Reviewed-by: Jan Beulich 

I'm not certain this is important enough a fix to consider for 4.10,
and you seem to think it's good enough if this gets applied only
after the tree would be branched, as you didn't Cc Julien. Please
indicate if you actually simply weren't aware, and you indeed
there's an important aspect to this that I'm overlooking.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 0/4] x86/cpuid: Enable new SSE/AVX/AVX512 cpu features

2017-11-10 Thread Jan Beulich

>>> On 10.11.17 at 10:36,  wrote:
> Yang Zhong (4):
>   x86/cpuid: Enable new SSE/AVX/AVX512 cpu features

The ordering is wrong - as said before, these ...

>   x86emul: Support GFNI insns
>   x86emul: Support vpclmulqdq
>   x86emul: Support vaes insns

... are supposed to be prereqs of the actual enabling of the CPUID
bits.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 15:16,  wrote:
> On Thu, 2017-11-09 at 06:08 -0700, Jan Beulich wrote:
>> Tasklets already take care of this by
>> calling sync_local_execstate() before calling the handler. But
>> for softirqs this isn't really an option; I'm surprised to see that
>> tasklet code does this independently of what kind of tasklet that
>> is. 
>>
> Good point. Weird indeed.
> 
>> Softirq tasklets aren't used very often, so I wonder if we have
>> a latent bug here. Otoh, if this was actually fine, adding a similar
>> call to rcu_do_batch() (or its caller) would be an option, I think.
>> 
> We can have a look.

I'm sorry for the noise here - I've once again forgotten that
sync_local_execstate() does nothing if a lazy switch hasn't
happened already. Hence leaving the potentially bad
performance effect aside, doing the same for RCU (or more
generally softirqs) would be an option.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v3 for-next 0/4] xen: Convert __page_to_mfn and _mfn_to_page to use typesafe MFN

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 16:48,  wrote:
> On 09/11/17 15:47, Jan Beulich wrote:
>>>>> On 09.11.17 at 16:39,  wrote:
>>> What I meant is you would replace the 4 occurrences by
>>> mfn_to_page(_mfn(...)). If you are happy with that, then fine.
>> 
>> Oh, sure, that's a fine intermediate state, which we have all over
>> the place right now. It's clear that it'll take a while to fully carry
>> through typesafeness to everywhere.
> 
> Would you be fine with other part of Xen too? If so, I can have a go at 
> removing completely __page_to_mfn/__mfn_to_page.

Oh, if you want to go that extra mile, that's certainly fine with me.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2] x86/pvh: Do not add DSDT and FACS to PVH dom0 XSDT

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 16:37,  wrote:
> These tables are pointed to from FADT. Adding them will
> result in duplicate entries in the guest's tables.
> 
> Signed-off-by: Boris Ostrovsky 

Reviewed-by: Jan Beulich 



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v3 for-next 0/4] xen: Convert __page_to_mfn and _mfn_to_page to use typesafe MFN

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 16:39,  wrote:
> On 09/11/17 15:36, Jan Beulich wrote:
>>>>> On 09.11.17 at 16:20,  wrote:
>>> I had a look at the files that needs to convert. It seems there are few
>>> files with page_to_mfn/mfn_to_page re-defined but no callers:
>>> - arch/x86/mm/hap/nested_hap.c
>>> - arch/x86/mm/p2m-pt.c
>>> - arch/x86/pv/traps.c
>>> - arch/x86/pv/mm.c
>>> - arch/x86/pv/iret.c
>>>
>>> Those can be fixed now. That leave the following files:
>>> - arch/x86/mm/p2m-ept.c
>>> In that file, the override prevents all the caller to use the
>>> construction mfn_to_page(_mfn(...)) as it mostly deals with hardware.
>> 
>> I'm not clear what you're trying to tell me here. The file has a total
>> for four mfn_to_page() uses - if overrides don't help (and perhaps
>> regardless of if they do), I think it wouldn't be very difficult to simply
>> change the four places. And note that plain staging has no override
>> there right now.
> Because the plain staging still has page_to_mfn() not using typesafe... 
> You would need to override it even I follow your suggestion.
> 
> What I meant is you would replace the 4 occurrences by 
> mfn_to_page(_mfn(...)). If you are happy with that, then fine.

Oh, sure, that's a fine intermediate state, which we have all over
the place right now. It's clear that it'll take a while to fully carry
through typesafeness to everywhere.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v3 for-next 0/4] xen: Convert __page_to_mfn and _mfn_to_page to use typesafe MFN

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 16:20,  wrote:
> I had a look at the files that needs to convert. It seems there are few 
> files with page_to_mfn/mfn_to_page re-defined but no callers:
>   - arch/x86/mm/hap/nested_hap.c
>   - arch/x86/mm/p2m-pt.c
>   - arch/x86/pv/traps.c
>   - arch/x86/pv/mm.c
>   - arch/x86/pv/iret.c
> 
> Those can be fixed now. That leave the following files:
>   - arch/x86/mm/p2m-ept.c
>   In that file, the override prevents all the caller to use the 
> construction mfn_to_page(_mfn(...)) as it mostly deals with hardware.

I'm not clear what you're trying to tell me here. The file has a total
for four mfn_to_page() uses - if overrides don't help (and perhaps
regardless of if they do), I think it wouldn't be very difficult to simply
change the four places. And note that plain staging has no override
there right now.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 1/2 v2] xen: Add support for initializing 16550 UART using ACPI

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 16:07,  wrote:
> On Thu, Nov 09, 2017 at 06:18:21AM -0700, Jan Beulich wrote:
>> >>> On 09.11.17 at 12:31,  wrote:
>> > On Thu, Nov 09, 2017 at 03:49:23PM +0530, Bhupinder Thakur wrote:
>> >> +static int ns16550_init_dt(struct ns16550 *uart,
>> >> +   const struct dt_device_node *dev)
>> >> +{
>> >> +return -EINVAL;
>> >> +}
>> >> +#endif
>> >> +
>> >> +#ifdef CONFIG_ACPI
>> >> +#include 
>> > 
>> > Please place the include at the top of the file, together with the
>> > other ones.
>> 
>> I disagree here, at least as long as that header isn't itself expanding
>> to nothing (or only stubs) when !CONFIG_ACPI.
> 
> The header already has a CONFIG_ACPI guard inside, but it doesn't
> cover the full content, so my suggestion was to move the include to
> the top of the file, but keeping the CONFIG_ACPI guards around it.

Ah, I see. I'd then still prefer one less #ifdef by keeping it where it
is, but that's a matter of taste I admit.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

[Xen-devel] [PATCH] VMX: sync CPU state upon vCPU destruction

2017-11-09 Thread Jan Beulich

See the code comment being added for why we need this.

Reported-by: Igor Druzhinin 
Signed-off-by: Jan Beulich 

--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -479,7 +479,13 @@ static void vmx_vcpu_destroy(struct vcpu
  * we should disable PML manually here. Note that vmx_vcpu_destroy is 
called
  * prior to vmx_domain_destroy so we need to disable PML for each vcpu
  * separately here.
+ *
+ * Before doing that though, flush all state for the vCPU previously having
+ * run on the current CPU, so that this flushing of state won't happen from
+ * the TLB flush IPI handler behind the back of a vmx_vmcs_enter() /
+ * vmx_vmcs_exit() section.
  */
+sync_local_execstate();
 vmx_vcpu_disable_pml(v);
 vmx_destroy_vmcs(v);
 passive_domain_destroy(v);




___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2] x86/hvm: do not register hpet mmio during s3 cycle

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 15:42,  wrote:
> Hi,
> 
> On 09/11/17 08:55, Jan Beulich wrote:
>>>>> On 08.11.17 at 20:46,  wrote:
>>> Do it once at domain creation (hpet_init).
>>>
>>> Sleep -> Resume cycles will end up crashing an HVM guest with hpet as
>>> the sequence during resume takes the path:
>>> -> hvm_s3_suspend
>>>-> hpet_reset
>>>  -> hpet_deinit
>>>  -> hpet_init
>>>-> register_mmio_handler
>>>  -> hvm_next_io_handler
>>>
>>> register_mmio_handler will use a new io handler each time, until
>>> eventually it reaches NR_IO_HANDLERS, then hvm_next_io_handler calls
>>> domain_crash.
>>>
>>> Signed-off-by: Eric Chanudet 
>>>
>>> ---
>>> v2:
>>>* make hpet_reinit static inline (one call site in this file)
>> 
>> Perhaps my prior reply was ambiguous: By "inlining" I meant
>> literally inlining it (i.e. dropping the standalone function
>> altogether). Static functions outside of header files should not
>> normally be marked "inline" explicitly - it should be the compiler
>> to make that decision.
>> 
>> As doing the adjustment it relatively simple, I wouldn't mind
>> doing so while committing, saving another round trip. With
>> that adjustment (or at the very least with the "inline" dropped)
>> Reviewed-by: Jan Beulich 
> 
> What would be the risk to get this patch in Xen 4.10?

Close to none, I would say. Of course, if there really was
something wrong with the code restructuring to fix the bug,
basically all HVM guests would be hosed HPET-wise.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 15:16,  wrote:
> Ah, yes, my bad! What if I take vcpu_migrate() out of the above exec-
> trace (which is what I wanted to do in my email already)?
> 
> pCPU1
> =
> current == vCPU1
> context_switch(next == idle)
> !! __context_switch() is skipped
> anything_that_uses_or_touches_context()
> 
> Point being, is the underlying and general "issue" here, really bound
> to migrations, or is it something intrinsic of lazy context switch? I'm
> saying it's the latter.

The general issue doesn't require vcpu_migrate(), I agree. The
specific VMX issue here does, though.

Thing is - I'm not convinced there's a general issue here in the first
place: Asynchronous code isn't supposed to modify state behind
the back of synchronous code. It just so happens that VMX code is
structured to violate that assumption when PML is in use.

> That being said, sure it makes sense to assume that, if we migrated the
> vCPU from pCPU1 to pCPU2, it's highly unlikely that it will resume the
> execution on pCPU1, and hence there is no point in leaving its context
> there, and we could very well sync. It's a reasonable best-effort
> measure, but can we rely on it for correctness? I don't think we can.

We can't right now, but code (from an abstract pov at least)
could be structured so we could rely on it.

> And generalizing the idea enough that we could then rely on it for
> correctness, tends to be close enough to not doing lazy context switch
> at all, I think.

I don't think so, no - we could still leave state in hardware in
anticipation that no other non-idle vCPU would be scheduled
on the local CPU. That's something that ought to help in
particular pinned vCPU-s.

>> The problem is with tasklet / softirq
>> (and hence also RCU) work. 
>>
> Yes.
> 
>> Tasklets already take care of this by
>> calling sync_local_execstate() before calling the handler. But
>> for softirqs this isn't really an option; I'm surprised to see that
>> tasklet code does this independently of what kind of tasklet that
>> is. 
>>
> Good point. Weird indeed.

I've added an item to my todo list to see whether I can figure
whether this is an okay thing to do.

>> Softirq tasklets aren't used very often, so I wonder if we have
>> a latent bug here. Otoh, if this was actually fine, adding a similar
>> call to rcu_do_batch() (or its caller) would be an option, I think.
>> 
> We can have a look.
> 
> What about the effect on performance, though?
> 
> I mean, assuming that lazy context switch does a good job, with respect
> to that, by avoiding synching in enough case where it is actually not
> necessary, how do things change if we start to sync at any softirq,
> even when the handler would have not required that?

I wasn't suggesting this for softirqs, but only (if at all) for RCU. But
yes, there would a performance hit from this; not sure how small
or large. But as you can see, for tasklets the hit is taken
unconditionally.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 1/2 v2] xen: Add support for initializing 16550 UART using ACPI

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 12:31,  wrote:
> On Thu, Nov 09, 2017 at 03:49:23PM +0530, Bhupinder Thakur wrote:
>> +static int ns16550_init_dt(struct ns16550 *uart,
>> +   const struct dt_device_node *dev)
>> +{
>> +return -EINVAL;
>> +}
>> +#endif
>> +
>> +#ifdef CONFIG_ACPI
>> +#include 
> 
> Please place the include at the top of the file, together with the
> other ones.

I disagree here, at least as long as that header isn't itself expanding
to nothing (or only stubs) when !CONFIG_ACPI.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 12:01,  wrote:
> Anyway, as I was trying to explain replaying to Jan, although in this
> situation the issue manifests as a consequence of vCPU migration, I
> think it is indeed more general, as in, without even the need to
> consider a second pCPU:
> 
> pCPU1
> =
> current == vCPU1
> context_switch(next == idle)
> !! __context_switch() is skipped
> vcpu_migrate(vCPU1)
> anything_that_uses_or_touches_context()
> 
> So, it must be anything_that_uses_or_touches_context() --knowing it
> will be reading or touching the pCPU context-- that syncs the state,
> before using or touching it (which is, e.g., what Jan's patch does).

Well, generally after the vcpu_migrate(vCPU1) above we expect
nothing to happen at all on the pCPU, until another (non-idle)
vCPU gets scheduled onto it. The problem is with tasklet / softirq
(and hence also RCU) work. Tasklets already take care of this by
calling sync_local_execstate() before calling the handler. But
for softirqs this isn't really an option; I'm surprised to see that
tasklet code does this independently of what kind of tasklet that
is. Softirq tasklets aren't used very often, so I wonder if we have
a latent bug here. Otoh, if this was actually fine, adding a similar
call to rcu_do_batch() (or its caller) would be an option, I think.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 11:36,  wrote:
> Well, I'm afraid I only see two solutions:
> 1) we get rid of lazy context switch;
> 2) whatever it is that is happening at point c above, it needs to be 
>aware that we use lazy context switch, and make sure to sync the 
>context before playing with or altering it;

3) Better centralize the updating of v->processor, so that it becomes
reasonable to sync state there. Igor's idea of flushing state once it
is known (or at least pretty certain) that the vCPU won't run on the
prior pCPU next time it gets scheduled is certainly a reasonable one.
It just doesn't fit well with how the individual schedulers currently
behave.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] x86/mm: fix a potential race condition in map_pages_to_xen().

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 11:24,  wrote:
> On 11/9/2017 5:19 PM, Jan Beulich wrote:
>> 2) Is your change actually enough to take care of all forms of the
>> race you describe? In particular, isn't it necessary to re-check PSE
>> after having taken the lock, in case another CPU has just finished
>> doing the re-consolidation?
> 
> Good question. :-)
> 
> I'd thought of checking the PSE for pl2e, and dropped that. My understanding
> was below:
> After the lock is taken, pl2e will be pointing to either a L1 page table 
> in normal
> cases; or to a superpage if another CPU has just finished the 
> re-consolidation
> and released the lock. And for the latter scenario, l1e_get_pfn(*pl1e) 
> shall not
> be equal to (base_mfn + i), and will not jump out the the loop.
> 
> But after second thought, above understanding is based on assumption of the
> contents of the target superpage. No matter how small the chance is, we can
> not make such assumption.
> 
> So my suggestion is we add the check the PSE and if it is set, "goto 
> check_l3".
> Is this reasonable to you?

Yes; for the L3 case it'll be a simple "continue" afaict.

>> 3) What about the empty&free checks in modify_xen_mappings()?
> 
> Oh. Thanks for the remind.
> Just had a look. It seems pl1e or pl2e may be freed more than once for the
> empty & free checks, due to lack of protection.
> So we'd better add a lock too, right?

Yes, I think so.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 10:54,  wrote:
> On Tue, 2017-11-07 at 14:24 +, Igor Druzhinin wrote:
>> Perhaps I should improve my diagram:
>> 
>> pCPU1: vCPUx of domain X -> migrate to pCPU2 -> switch to idle
>> context
>> -> RCU callbacks -> vcpu_destroy(vCPUy of domain Y) ->
>> vmx_vcpu_disable_pml() -> vmx_vmcs_clear() (VMCS is trashed at this
>> point on pCPU1)
>> 
>> pCPU2: context switch into vCPUx -> vCPUx.is_running = 1 -> TLB flush
>> from context switch to clean TLB on pCPU1
>> 
> Sorry, there must be something I'm missing (or misunderstanding).
> 
> What is this code that checks is_running and triggers the TLB flush?

I don't see where Igor said is_running is being checked around a
TLB flush. The TLB flush itself is what happens first thing in
context_switch() (and it's really using the TLB flush interface to
mainly effect the state flush, with the TLB flush being an implied
side effect; I've already got a series of further patches to make
this less implicit).

> But, more important, how come you are context switching to something
> that has is_running == 1 ? That should not be possible.

That's not what Igor's diagram says - it's indicating the fact that
is_running is being set to 1 in the process of context switching
into vCPUx.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] VMX: fix VMCS race on context-switch paths

2017-11-09 Thread Jan Beulich

>>> On 07.11.17 at 16:52,  wrote:
> There is one things that I'm worrying about with this approach:
> 
> At this place we just sync the idle context because we know that we are
> going to deal with VMCS later. But what about other potential cases
> (perhaps some softirqs) in which we are accessing a vCPU data structure
> that is currently shared between different pCPUs. Maybe we'd better sync
> the context as soon as possible after we switched to idle from a
> migrated vCPU.

Short of feedback from the scheduler maintainers I've looked
into this some more. Calling sync_vcpu_execstate() out of
vcpu_move_locked() would seem feasible, but there are a number
of other places where ->processor of a vCPU is being updated,
and where the vCPU was not (obviously) put to sleep already:

- credit1's csched_runq_steal()
- credit2's migrate()
- csched2_schedule()
- null's vcpu_assign() when called out of null_schedule()
- rt_schedule()

I don't think it is reasonable to call sync_vcpu_execstate() in all of
those places, and hence VMX'es special VMCS management may
indeed better be taken care of by VMX-specific code (i.e. as
previously indicated the sync_local_execstate() invocation moved
from vcpu_destroy() - where my most recent patch draft had put
it - to vmx_vcpu_destroy()). And we'd have to live with other
VMX code paths having similar asynchronous behavior needing to
similarly take care of the requirement.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] x86/mm: fix a potential race condition in map_pages_to_xen().

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 16:29,  wrote:
> In map_pages_to_xen(), a L2 page table entry may be reset to point to
> a superpage, and its corresponding L1 page table need be freed in such
> scenario, when these L1 page table entries are mapping to consecutive
> page frames and having the same mapping flags.
> 
> However, variable `pl1e` is not protected by the lock before L1 page table
> is enumerated. A race condition may happen if this code path is invoked
> simultaneously on different CPUs.
> 
> For example, `pl1e` value on CPU0 may hold an obsolete value, pointing
> to a page which has just been freed on CPU1. Besides, before this page
> is reused, it will still be holding the old PTEs, referencing consecutive
> page frames. Consequently the `free_xen_pagetable(l2e_to_l1e(ol2e))` will
> be triggered on CPU0, resulting the unexpected free of a normal page.
> 
> Protecting the `pl1e` with the lock will fix this race condition.
> 
> Signed-off-by: Min He 
> Signed-off-by: Yi Zhang 
> Signed-off-by: Yu Zhang 

Oh, one more thing: Is it really the case that all three of you
contributed to the patch? We don't use the Linux model of
everyone through whose hands a patch passes adding an
S-o-b of their own - that would rather be Reviewed-by then (if
applicable).

Also generally I would consider the first S-o-b to be that of the
original author, yet the absence of an explicit From: tag makes
authorship ambiguous here. Please clarify this in v2.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] x86/mm: fix a potential race condition in map_pages_to_xen().

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 16:29,  wrote:
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -4844,9 +4844,10 @@ int map_pages_to_xen(
>  {
>  unsigned long base_mfn;
>  
> -pl1e = l2e_to_l1e(*pl2e);
>  if ( locking )
>  spin_lock(&map_pgdir_lock);
> +
> +pl1e = l2e_to_l1e(*pl2e);
>  base_mfn = l1e_get_pfn(*pl1e) & ~(L1_PAGETABLE_ENTRIES - 1);
>  for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++, pl1e++ )
>  if ( (l1e_get_pfn(*pl1e) != (base_mfn + i)) ||

I agree with the general observation, but there are three things I'd
like to see considered:

1) Please extend the change slightly such that the L2E
re-consolidation code matches the L3E one (i.e. latch into ol2e
earlier and pass that one to l2e_to_l1e(). Personally I would even
prefer if the presence/absence of blank lines matched between
the two pieces of code.

2) Is your change actually enough to take care of all forms of the
race you describe? In particular, isn't it necessary to re-check PSE
after having taken the lock, in case another CPU has just finished
doing the re-consolidation?

3) What about the empty&free checks in modify_xen_mappings()?

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] x86/pvh: Do not add DSDT and FACS to PVH dom0 XSDT

2017-11-09 Thread Jan Beulich

>>> On 08.11.17 at 21:19,  wrote:
> These tables are pointed to from FADT. Adding them will
> result in duplicate entries in the guest's tables.

Oh, indeed. Just one small adjustment request:

> +static bool __init pvh_acpi_table_in_xsdt(const char *sig)
> +{
> +/*
> + * DSDT and FACS are pointed to from FADT and thus don't belong
> + * in XSDT.
> + */
> +if ( !strncmp(sig, ACPI_SIG_DSDT, ACPI_NAME_SIZE) ||
> + !strncmp(sig, ACPI_SIG_FACS, ACPI_NAME_SIZE) )
> +return false;
> +
> +return true;
> +}

Please don't use two return statements in cases like this one -
all can be had with a single one and no explicit uses of true or
false.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH v2] x86/hvm: do not register hpet mmio during s3 cycle

2017-11-09 Thread Jan Beulich

>>> On 08.11.17 at 20:46,  wrote:
> Do it once at domain creation (hpet_init).
> 
> Sleep -> Resume cycles will end up crashing an HVM guest with hpet as
> the sequence during resume takes the path:
> -> hvm_s3_suspend
>   -> hpet_reset
> -> hpet_deinit
> -> hpet_init
>   -> register_mmio_handler
> -> hvm_next_io_handler
> 
> register_mmio_handler will use a new io handler each time, until
> eventually it reaches NR_IO_HANDLERS, then hvm_next_io_handler calls
> domain_crash.
> 
> Signed-off-by: Eric Chanudet 
> 
> ---
> v2:
>   * make hpet_reinit static inline (one call site in this file)

Perhaps my prior reply was ambiguous: By "inlining" I meant
literally inlining it (i.e. dropping the standalone function
altogether). Static functions outside of header files should not
normally be marked "inline" explicitly - it should be the compiler
to make that decision.

As doing the adjustment it relatively simple, I wouldn't mind
doing so while committing, saving another round trip. With
that adjustment (or at the very least with the "inline" dropped)
Reviewed-by: Jan Beulich 

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH V2] Xen/pciback: Implement PCI slot or bus reset with 'do_flr' SysFS attribute

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 00:06,  wrote:
> --- a/drivers/xen/xen-pciback/pci_stub.c
> +++ b/drivers/xen/xen-pciback/pci_stub.c
> @@ -244,6 +244,91 @@ struct pci_dev *pcistub_get_pci_dev(struct 
> xen_pcibk_device *pdev,
>   return found_dev;
>  }
>  
> +struct pcistub_args {
> + struct pci_dev *dev;

Please don't ignore prior review comments: Either carry out what
was requested, or explain why the request can't be fulfilled. You
saying "This field will point to first device that is not owned by
pcistub" to Roger's request to make this a pointer to const is not a
valid reason to not do the adjustment; in fact your reply is entirely
unrelated to the request.

> +static int pcistub_search_dev(struct pci_dev *dev, void *data)
> +{
> + struct pcistub_device *psdev;
> + struct pcistub_args *arg = data;
> + bool found_dev = false;

Purely cosmetical, but anyway: Why not just "found"? What else
could be (not) found here other than the device in question?

> + unsigned long flags;
> +
> + spin_lock_irqsave(&pcistub_devices_lock, flags);
> +
> + list_for_each_entry(psdev, &pcistub_devices, dev_list) {
> + if (psdev->dev == dev) {
> + found_dev = true;
> + arg->dcount++;
> + break;
> + }
> + }
> +
> + spin_unlock_irqrestore(&pcistub_devices_lock, flags);
> +
> + /* Device not owned by pcistub, someone owns it. Abort the walk */
> + if (!found_dev)
> + arg->dev = dev;
> +
> + return found_dev ? 0 : 1;

Despite the function needing to return int, this can be simplified to
"return !found_dev". I'd also like to note that the part of the
earlier comment related to this is sort of disconnected. How about

/* Device not owned by pcistub, someone owns it. Abort the walk */
if (!found_dev) {
arg->dev = dev;
return 1;
}

return 0;

And finally - I don't think the comment is entirely correct - the
device not being owned by pciback doesn't necessarily mean it's
owned by another driver. It could as well be unowned.

> +static int pcistub_reset_dev(struct pci_dev *dev)
> +{
> + struct xen_pcibk_dev_data *dev_data;
> + bool slot = false, bus = false;
> + struct pcistub_args arg = {};
> +
> + if (!dev)
> + return -EINVAL;
> +
> + dev_dbg(&dev->dev, "[%s]\n", __func__);
> +
> + if (!pci_probe_reset_slot(dev->slot))
> + slot = true;
> + else if ((!pci_probe_reset_bus(dev->bus)) &&
> +  (!pci_is_root_bus(dev->bus)))
> + bus = true;
> +
> + if (!bus && !slot)
> + return -EOPNOTSUPP;
> +
> + /*
> +  * Make sure all devices on this bus are owned by the
> +  * PCI backend so that we can safely reset the whole bus.
> +  */

Is that really the case when you mean to do a slot reset? It was for
a reason that I had asked about a missing "else" in v1 review,
rather than questioning the conditional around the logic.

> + pci_walk_bus(dev->bus, pcistub_search_dev, &arg);
> +
> + /* All devices under the bus should be part of pcistub! */
> + if (arg.dev) {
> + dev_err(&dev->dev, "%s device on bus 0x%x is not owned by 
> pcistub\n",

%#x

Yet then, thinking about what would be useful information should the
situation really arise, I'm not convinced printing a bare bus number
here is useful either. Especially for the case of multiple parallel
requests you want to make it possible to match each message to the
original request (guest start or whatever). Hence I think you want
something like

"%s on the same bus as %s is not owned by " DRV_NAME "\n"

> + pci_name(arg.dev), dev->bus->number);
> +
> + return -EBUSY;
> + }
> +
> + dev_dbg(&dev->dev, "pcistub owns %d devices on bus 0x%x\n",
> + arg.dcount, dev->bus->number);

While here the original device is perhaps not necessary to print,
the bare bus number doesn't carry enough information: You'll
want to prefix it by the segment number. Plus you'll want to use
canonical formatting (:bb), so one can get matches when
suitably grep-ing the log. Perhaps bus->name is what you're
after.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH] Xen/pciback: Implement PCI slot or bus reset with 'do_flr' SysFS attribute

2017-11-09 Thread Jan Beulich

>>> On 08.11.17 at 16:44,  wrote:
> On 11/7/2017 8:40 AM, Jan Beulich wrote:
>>>>> On 06.11.17 at 18:48,  wrote:
>>> --- a/Documentation/ABI/testing/sysfs-driver-pciback
>>> +++ b/Documentation/ABI/testing/sysfs-driver-pciback
>>> @@ -11,3 +11,15 @@ Description:
>>>   #echo 00:19.0-E0:2:FF > 
>>> /sys/bus/pci/drivers/pciback/quirks
>>>   will allow the guest to read and write to the 
>>> configuration
>>>   register 0x0E.
>>> +
>>> +What:   /sys/bus/pci/drivers/pciback/do_flr
>>> +Date:   Nov 2017
>>> +KernelVersion:  4.15
>>> +Contact:xen-de...@lists.xenproject.org 
>>> +Description:
>>> +An option to perform a slot or bus reset when a PCI device
>>> +   is owned by Xen PCI backend. Writing a string of :BB:DD.F
>>> +   will cause the pciback driver to perform a slot or bus reset
>>> +   if the device supports it. It also checks to make sure that
>>> +   all of the devices under the bridge are owned by Xen PCI
>>> +   backend.
>> Why do you name this "do_flr" when you don't even try FLR, but
>> go to slot or then bus reset right away.
> Yes, I agree but xen toolstack has already been modified to 
> consume"do_flr" attribute. Hence, we are using the
> function that matches with sysfs attribute.

That's not a valid reason imo: Right now the driver doesn't expose
such an attribute, so if the tool stack was trying to use it, it would
not do what is intended at present anyway (i.e. the code could at
best be called dead). Furthermore, contrary to what you claim in
your reply to Pasi, I can't see where you try an actual FLR first -
you go straight to pci_probe_reset_{slot,bus}(). If you actually
tried FLR first, only falling back to the other methods as "emulation",
I could certainly agree with the file name chosen.

And btw - I don't consider it too good an idea to post the next
version of a patch when discussion of the prior one hasn't really
settled yet.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [BUG] xen-mceinj tool testing cause dom0 crash

2017-11-09 Thread Jan Beulich

>>> On 09.11.17 at 02:44,  wrote:
> On 11/07/17 01:37 -0700, Jan Beulich wrote:
>> I don't believe a crash is the expected outcome here.
>>
> 
> This test case injects two errors to the same dom0 page. During the
> first injection, offline_page() is called to set PGC_broken flag of
> that page. During the second injection, offline_page() detects the
> same broken page is touched again, and then tries to shutdown the page
> owner, i.e. dom0 in this case:
> 
> /*
>  * NB. When broken page belong to guest, usually hypervisor will
>  * notify the guest to handle the broken page. However, hypervisor
>  * need to prevent malicious guest access the broken page again.
>  * Under such case, hypervisor shutdown guest, preventing recursive mce.
>  */
> if ( (pg->count_info & PGC_broken) && (owner = page_get_owner(pg)) )
> {
> *status = PG_OFFLINE_AGAIN;
> domain_shutdown(owner, SHUTDOWN_crash);
> return 0;
> }
> 
> So I think Dom0 crash and the following machine reboot are the
> expected behaviors here.

Oh, that's quite unfortunate - there should at least be a log
message associated with that; I can't see why this isn't
domain_crash() in the first place (which would leave a log
message); I'm queuing a patch (but for after 4.10 only).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH 3/3] x86/xen: use guest_late_init to detect Xen PVH guest

2017-11-08 Thread Jan Beulich

>>> On 08.11.17 at 13:45,  wrote:
> On 08/11/17 13:31, Jan Beulich wrote:
>>>>> On 08.11.17 at 12:55,  wrote:
>>> On 08/11/17 12:18, Jan Beulich wrote:
>>>>>>> On 08.11.17 at 10:07,  wrote:
>>>>> In case we are booted via the default boot entry by a generic loader
>>>>> like grub or OVMF it is necessary to distinguish between a HVM guest
>>>>> with a device model supporting legacy devices and a PVH guest without
>>>>> device model.
>>>>>
>>>>> PVH guests will always have x86_platform.legacy.no_vga set and
>>>>> x86_platform.legacy.rtc cleared, while both won't be true for HVM
>>>>> guests.
>>>>>
>>>>> Test for both conditions in the guest_late_init hook and set xen_pvh
>>>>> to true if they are met.
>>>>
>>>> This sounds pretty fragile to me: I can't see a reason why a proper
>>>> HVM guest couldn't come without VGA and RTC. That's not possible
>>>> today, agreed, but certainly an option down the road if virtualization
>>>> follows bare metal's road towards being legacy free.
>>>
>>> From guest's perspective: what is the difference between a legacy free
>>> HVM domain and PVH? In the end the need for differentiating is to avoid
>>> access to legacy features in PVH as those would require a device model.
>> 
>> My point is that "legacy free" would likely be reached over time (and
>> even once fully reached, hybrid configurations would be possible).
>> I.e. there could be a setup with PIC, but with neither VGA nor RTC.
>> That's still not PVH then. Nor do all legacy features require a device
>> model in the first place - some of them are being emulated entirely
>> in the hypervisor.
>> 
>> Furthermore, PVH absolutely requires guest awareness afaict, while
>> legacy-free pure HVM guests (with an OS only aware of the possible
>> absence of legacy devices) would still be possible.
> 
> Hmm, where else do you expect PVH awareness to be required? Maybe for
> vcpu hotplugging, but this could easily be solved by adding a Xenstore
> entry containing the required information. Is there any other problem to
> be expected before Xenstore access is possible?

Let me ask the question the other way around: What's all the PVH
specific code for under arch/x86/xen/ if there's no difference? One
thing I seem to remember is that getting hold of the ACPI tables
is different between PVH and HVM. Iirc the distinct PVH entry point
is (in part) for that purpose. In the end - with that separate entry
point - it is not really clear to me why any "detection" needs to be
done in the first place: You'd know which mode you're in by knowing
which entry point path you've taken.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [PATCH for-4.10] gcov: return EOPNOTSUPP for unimplemented gcov domctl

2017-11-08 Thread Jan Beulich

>>> On 07.11.17 at 13:31,  wrote:
> ENOSYS should only be used by unimplemented top-level syscalls. Use
> EOPNOTSUPP instead.
> 
> Signed-off-by: Roger Pau Monné 
> Reported-by: Jan Beulich 

Btw I've taken the liberty to make the title say "sysctl" instead of
"domctl".

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 13762 matches

Mail list logo