from:"Igor Druzhinin"

[PATCH v2] pci: fix handling of PCI bridges with subordinate bus number 0xff

2021-09-24 Thread Igor Druzhinin

Bus number 0xff is valid according to the PCI spec. Using u8 typed sub_bus
and assigning 0xff to it will result in the following loop getting stuck.

for ( ; sec_bus <= sub_bus; sec_bus++ ) {...}

Just change its type to unsigned int similarly to what is already done in
dmar_scope_add_buses().

Signed-off-by: Igor Druzhinin 
---
v2:
- fix free_pdev() as well
- style fixes
---
 xen/drivers/passthrough/pci.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index fc4fa2e..d65cda8 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -363,8 +363,7 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 
bus, u8 devfn)
 /* update bus2bridge */
 switch ( pdev->type = pdev_type(pseg->nr, bus, devfn) )
 {
-u16 cap;
-u8 sec_bus, sub_bus;
+unsigned int cap, sec_bus, sub_bus;
 
 case DEV_TYPE_PCIe2PCI_BRIDGE:
 case DEV_TYPE_LEGACY_PCI_BRIDGE:
@@ -431,7 +430,7 @@ static void free_pdev(struct pci_seg *pseg, struct pci_dev 
*pdev)
 /* update bus2bridge */
 switch ( pdev->type )
 {
-uint8_t sec_bus, sub_bus;
+unsigned int sec_bus, sub_bus;
 
 case DEV_TYPE_PCIe2PCI_BRIDGE:
 case DEV_TYPE_LEGACY_PCI_BRIDGE:
-- 
2.7.4

[PATCH] pci: fix handling of PCI bridges with subordinate bus number 0xff

2021-09-23 Thread Igor Druzhinin

Bus number 0xff is valid according to the PCI spec. Using u8 typed sub_bus
and assigning 0xff to it will result in the following loop getting stuck.

for ( ; sec_bus <= sub_bus; sec_bus++ ) {...}

Just change its type to u16 the same way that is already handled in
dmar_scope_add_buses().

Signed-off-by: Igor Druzhinin 
---
 xen/drivers/passthrough/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index fc4fa2e..48b415c 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -364,7 +364,7 @@ static struct pci_dev *alloc_pdev(struct pci_seg *pseg, u8 
bus, u8 devfn)
 switch ( pdev->type = pdev_type(pseg->nr, bus, devfn) )
 {
 u16 cap;
-u8 sec_bus, sub_bus;
+u16 sec_bus, sub_bus;
 
 case DEV_TYPE_PCIe2PCI_BRIDGE:
 case DEV_TYPE_LEGACY_PCI_BRIDGE:
-- 
2.7.4

[PATCH v2] tools/libxc: use uint32_t for pirq in xc_domain_irq_permission

2021-07-12 Thread Igor Druzhinin

Current unit8_t for pirq argument in this interface is too restrictive
causing failures on modern hardware with lots of GSIs. That extends down to
XEN_DOMCTL_irq_permission ABI structure where it needs to be fixed up
as well.

Internal Xen structures appear to be fine. Existing users of the interface
in tree (libxl, ocaml and python bindings) are currently using signed int
for pirq representation which should be wide enough. Converting them to
uint32_t now is desirable to avoid accidental passing of a negative
number (probably denoting an error code) by caller as pirq, but left for
the future clean up.

Domctl interface version is needed to be bumped with this change but that
was already done by 918b8842a8 ("arm64: Change type of hsr, cpsr, spsr_el1
to uint64_t") in this release cycle.

Additionally, take a change and convert allow_access argument to bool.

Reviewed-by: Jan Beulich 
Signed-off-by: Igor Druzhinin 
Acked-by: Christian Lindig 
---

Changes in v2:
- extra wording for clarity in commit message (Julien)
- change allow_access to bool (Andrew)
- add padding (Jan)

---
 tools/include/xenctrl.h | 4 ++--
 tools/libs/ctrl/xc_domain.c | 4 ++--
 tools/ocaml/libs/xc/xenctrl_stubs.c | 4 ++--
 xen/include/public/domctl.h | 3 ++-
 4 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 2a7c836..14adaa0 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1385,8 +1385,8 @@ int xc_domain_ioport_permission(xc_interface *xch,
 
 int xc_domain_irq_permission(xc_interface *xch,
  uint32_t domid,
- uint8_t pirq,
- uint8_t allow_access);
+ uint32_t pirq,
+ bool allow_access);
 
 int xc_domain_iomem_permission(xc_interface *xch,
uint32_t domid,
diff --git a/tools/libs/ctrl/xc_domain.c b/tools/libs/ctrl/xc_domain.c
index 7d11884..1cdf3d1 100644
--- a/tools/libs/ctrl/xc_domain.c
+++ b/tools/libs/ctrl/xc_domain.c
@@ -1384,8 +1384,8 @@ int xc_vcpu_setcontext(xc_interface *xch,
 
 int xc_domain_irq_permission(xc_interface *xch,
  uint32_t domid,
- uint8_t pirq,
- uint8_t allow_access)
+ uint32_t pirq,
+ bool allow_access)
 {
 DECLARE_DOMCTL;
 
diff --git a/tools/ocaml/libs/xc/xenctrl_stubs.c 
b/tools/ocaml/libs/xc/xenctrl_stubs.c
index 6e4bc56..266eb32 100644
--- a/tools/ocaml/libs/xc/xenctrl_stubs.c
+++ b/tools/ocaml/libs/xc/xenctrl_stubs.c
@@ -1077,8 +1077,8 @@ CAMLprim value stub_xc_domain_irq_permission(value xch, 
value domid,
 value pirq, value allow)
 {
CAMLparam4(xch, domid, pirq, allow);
-   uint8_t c_pirq;
-   uint8_t c_allow;
+   uint32_t c_pirq;
+   bool c_allow;
int ret;
 
c_pirq = Int_val(pirq);
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 4dbf107..088c964 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -441,8 +441,9 @@ struct xen_domctl_setdebugging {
 
 /* XEN_DOMCTL_irq_permission */
 struct xen_domctl_irq_permission {
-uint8_t pirq;
+uint32_t pirq;
 uint8_t allow_access;/* flag to specify enable/disable of IRQ access */
+uint8_t pad[3];
 };
 
 
-- 
2.7.4

Re: [PATCH] tools/libxc: use uint32_t for pirq in xc_domain_irq_permission

2021-07-07 Thread Igor Druzhinin


On 07/07/2021 14:21, Julien Grall wrote:

On 07/07/2021 14:14, Jan Beulich wrote:

On 07.07.2021 14:59, Julien Grall wrote:

On 07/07/2021 13:54, Jan Beulich wrote:

On 07.07.2021 14:51, Julien Grall wrote:

On 07/07/2021 02:02, Igor Druzhinin wrote:

Current unit8_t for pirq argument in this interface is too restrictive
causing failures on modern hardware with lots of GSIs. That extends down to
XEN_DOMCTL_irq_permission ABI structure where it needs to be fixed up
as well. Internal Xen structures appear to be fine. Existing users of
the interface in tree (libxl, ocaml and python bindings) are already using
int for pirq representation that should be wide enough.


By "int", I am assuming you imply "signed int", is that correct?


Yes, just "int" in the meaning "signed int" - I can clarify that in the 
description.


If so, should the function xc_domain_irq_permission() interface take an
int in parameter and check it is not negative?


Please let's not make things worse than they are, the more that


Well, what I am trying to prevent is surprise where the caller
mistakenly pass a negative value that will be interpreted as a positive
value...


This happens all the time when converting from signed to unsigned
perhaps just internally.


I am not sure what's your point... Yes there are place in Xen that switch 
between signed and unsigned. We likely have some (latent) problem because of 
that...


Callers of libxc interface shouldn't have been using signed int at all.
They just happen to do it at least in-tree - that's what I found and mentioned
in the description. At the same time "int" type is for now wide enough so there
is no immediate rush to fix them up.

That gets a little bit tricky with bindings - they themselves expose pirq
as int. So a negative value could be passed by the caller and, given other
similar interace functions like xc_physdev_map_pirq() are using "int pirq"
to signal an error as negative value, that could be misinterpreted by lower
levels.

We can add extra checks in bindings to avoid passing all negative values to
libxc level. Would this be good enough?


Such issues are beyong annoying to debug...


No worse than any other out-of-bounds value, I would say.


  > ./CODING_STYLE is unambiguous in cases like this one.

Hmmm... The coding style mention the fixed size but nothing about the
signedness of the type...


Oh, sorry, yes. The adjustment for this even pre-dates the two
patches to ./CODING_STYLE that I've on record as pending for
nearly two years.


The alternative suggestion is to keep a unsigned type but check the bit
31 is not set.


Why? Why not bit 30 or bit 27? There's nothing special about
bit 31 in an unsigned number.


Bit 31 is the signed bit for signed number. The check would make sure that:
  1) The value will fit other hypercall (the PIRQ is described as int in a few 
of the structure)
  2) Catch potentially caller that would use the number that could potentially 
be interpreted as negative by other part of the hypervisor.

That said, I can live with the implicit signed -> unsigned convertion, however 
the commit message should at least be clarified because it is misleading.


Could you specify which statement exactly is misleading (or needs clariying)
in the commit message?

Igor

Re: [PATCH] tools/libxc: use uint32_t for pirq in xc_domain_irq_permission

2021-07-07 Thread Igor Druzhinin


On 08/07/2021 02:26, Andrew Cooper wrote:

On 08/07/2021 02:14, Igor Druzhinin wrote:

On 08/07/2021 02:11, Andrew Cooper wrote:

On 08/07/2021 02:08, Igor Druzhinin wrote:

On 07/07/2021 10:19, Andrew Cooper wrote:

On 07/07/2021 08:46, Jan Beulich wrote:

--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1385,7 +1385,7 @@ int xc_domain_ioport_permission(xc_interface
*xch,
      int xc_domain_irq_permission(xc_interface *xch,
     uint32_t domid,
- uint8_t pirq,
+ uint32_t pirq,
     uint8_t allow_access);

Take the opportunity and also change "allow_access" to bool? Or is
use of bool prohibited in external interfaces?


We've got bool's in the interface already.


Where exactly? I couldn't find a single "bool".


$ git grep -w bool -- :/tools/include/xen*.h
../tools/include/xenctrl.h:1844:  uint32_t
domid, bool restore,
../tools/include/xenctrl.h:1846:  unsigned int
nr_features, bool pae, bool itsc,
../tools/include/xenctrl.h:1847:  bool
nested_virt, const struct xc_xend_cpuid *xend);
../tools/include/xenctrl.h:1954:int
xc_altp2m_get_domain_state(xc_interface *handle, uint32_t dom, bool
*state);
../tools/include/xenctrl.h:1955:int
xc_altp2m_set_domain_state(xc_interface *handle, uint32_t dom, bool
state);

and loads more.


Are we ok to have different types in ABI interface and in libxc
function prototype then?


Yes.  Again, we've got plenty of examples like this.


Because I was referring to ABI structures.


The hypercall structs can't contain bool.  bool has implementation
defined width in C, just like enum, and there is no requirement for
sizeof(bool) to be 1.

The pre-existing uint8_t here is correct, although the hypercall handler
ideally wants a further adjustment to reject non-boolean values.  This
hypercall clearly predates our more careful review practices...


Sure. Get what you want now. I'm just not a fan of type conversions
for the sake of it - prefer a common type to be used pervasively.
But, of course, happy to follow Xen practises.

Igor

Re: [PATCH] tools/libxc: use uint32_t for pirq in xc_domain_irq_permission

2021-07-07 Thread Igor Druzhinin


On 08/07/2021 02:11, Andrew Cooper wrote:

On 08/07/2021 02:08, Igor Druzhinin wrote:

On 07/07/2021 10:19, Andrew Cooper wrote:

On 07/07/2021 08:46, Jan Beulich wrote:

--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1385,7 +1385,7 @@ int xc_domain_ioport_permission(xc_interface
*xch,
     int xc_domain_irq_permission(xc_interface *xch,
    uint32_t domid,
- uint8_t pirq,
+ uint32_t pirq,
    uint8_t allow_access);

Take the opportunity and also change "allow_access" to bool? Or is
use of bool prohibited in external interfaces?


We've got bool's in the interface already.


Where exactly? I couldn't find a single "bool".


$ git grep -w bool -- :/tools/include/xen*.h
../tools/include/xenctrl.h:1844:  uint32_t
domid, bool restore,
../tools/include/xenctrl.h:1846:  unsigned int
nr_features, bool pae, bool itsc,
../tools/include/xenctrl.h:1847:  bool
nested_virt, const struct xc_xend_cpuid *xend);
../tools/include/xenctrl.h:1954:int
xc_altp2m_get_domain_state(xc_interface *handle, uint32_t dom, bool *state);
../tools/include/xenctrl.h:1955:int
xc_altp2m_set_domain_state(xc_interface *handle, uint32_t dom, bool state);

and loads more.


Are we ok to have different types in ABI interface and in libxc
function prototype then? Because I was referring to ABI structures.

Igor

Re: [PATCH] tools/libxc: use uint32_t for pirq in xc_domain_irq_permission

2021-07-07 Thread Igor Druzhinin


On 07/07/2021 10:19, Andrew Cooper wrote:

On 07/07/2021 08:46, Jan Beulich wrote:

--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1385,7 +1385,7 @@ int xc_domain_ioport_permission(xc_interface *xch,
  
  int xc_domain_irq_permission(xc_interface *xch,

   uint32_t domid,
- uint8_t pirq,
+ uint32_t pirq,
   uint8_t allow_access);

Take the opportunity and also change "allow_access" to bool? Or is
use of bool prohibited in external interfaces?


We've got bool's in the interface already.


Where exactly? I couldn't find a single "bool".

Igor

[PATCH] tools/libxc: use uint32_t for pirq in xc_domain_irq_permission

2021-07-06 Thread Igor Druzhinin

Current unit8_t for pirq argument in this interface is too restrictive
causing failures on modern hardware with lots of GSIs. That extends down to
XEN_DOMCTL_irq_permission ABI structure where it needs to be fixed up
as well. Internal Xen structures appear to be fine. Existing users of
the interface in tree (libxl, ocaml and python bindings) are already using
int for pirq representation that should be wide enough.

Domctl interface version is needed to be bumped with this change but that
was already done by 918b8842a8 ("arm64: Change type of hsr, cpsr, spsr_el1
to uint64_t") in this release cycle.

Signed-off-by: Igor Druzhinin 
---
 tools/include/xenctrl.h | 2 +-
 tools/libs/ctrl/xc_domain.c | 2 +-
 tools/ocaml/libs/xc/xenctrl_stubs.c | 2 +-
 xen/include/public/domctl.h | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 2a7c836..8974747 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1385,7 +1385,7 @@ int xc_domain_ioport_permission(xc_interface *xch,
 
 int xc_domain_irq_permission(xc_interface *xch,
  uint32_t domid,
- uint8_t pirq,
+ uint32_t pirq,
  uint8_t allow_access);
 
 int xc_domain_iomem_permission(xc_interface *xch,
diff --git a/tools/libs/ctrl/xc_domain.c b/tools/libs/ctrl/xc_domain.c
index 7d11884..8e4ffd0 100644
--- a/tools/libs/ctrl/xc_domain.c
+++ b/tools/libs/ctrl/xc_domain.c
@@ -1384,7 +1384,7 @@ int xc_vcpu_setcontext(xc_interface *xch,
 
 int xc_domain_irq_permission(xc_interface *xch,
  uint32_t domid,
- uint8_t pirq,
+ uint32_t pirq,
  uint8_t allow_access)
 {
 DECLARE_DOMCTL;
diff --git a/tools/ocaml/libs/xc/xenctrl_stubs.c 
b/tools/ocaml/libs/xc/xenctrl_stubs.c
index 6e4bc56..e5837e6 100644
--- a/tools/ocaml/libs/xc/xenctrl_stubs.c
+++ b/tools/ocaml/libs/xc/xenctrl_stubs.c
@@ -1077,7 +1077,7 @@ CAMLprim value stub_xc_domain_irq_permission(value xch, 
value domid,
 value pirq, value allow)
 {
CAMLparam4(xch, domid, pirq, allow);
-   uint8_t c_pirq;
+   uint32_t c_pirq;
uint8_t c_allow;
int ret;
 
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 4dbf107..277478e 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -441,7 +441,7 @@ struct xen_domctl_setdebugging {
 
 /* XEN_DOMCTL_irq_permission */
 struct xen_domctl_irq_permission {
-uint8_t pirq;
+uint32_t pirq;
 uint8_t allow_access;/* flag to specify enable/disable of IRQ access */
 };
 
-- 
2.7.4

Re: [PATCH] x86/AMD: make HT range dynamic for Fam17 and up

2021-06-18 Thread Igor Druzhinin


On 18/06/2021 18:15, Igor Druzhinin wrote:

On 18/06/2021 17:00, Jan Beulich wrote:

At the time of d838ac2539cf ("x86: don't allow Dom0 access to the HT
address range") documentation correctly stated that the range was
completely fixed. For Fam17 and newer, it lives at the top of physical
address space, though.


 From "Open-Source Register Reference for AMD Family 17h Processors (PUB)":
https://developer.amd.com/wp-content/resources/56255_3_03.PDF

"The processor defines a reserved memory address region starting at
FFFD__h and extending up to __h."

It's still doesn't say that it's at the top of physical address space
although I understand that's how it's now implemented. The official
document doesn't confirm it will move along with physical address space
extension.


To correctly determine the top of physical address space, we need to
account for their physical address reduction, hence the calculation of
paddr_bits also gets adjusted.

While for paddr_bits < 40 the HT range is completely hidden, there's no
need to suppress the range insertion in that case: It'll just have no
real meaning.

Reported-by: Andrew Cooper 
Signed-off-by: Jan Beulich 

--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -349,13 +349,17 @@ void __init early_cpu_init(void)
  eax = cpuid_eax(0x8000);
  if ((eax >> 16) == 0x8000 && eax >= 0x8008) {
+    ebx = eax >= 0x801f ? cpuid_ebx(0x801f) : 0;
  eax = cpuid_eax(0x8008);
-    paddr_bits = eax & 0xff;
+


I understand Andrew has some concerns regarding changing paddr_bits but
some comment explaining what's located at 0x801f:ebx[11:6] and why
we're doing this might be useful.


+    paddr_bits = (eax & 0xff) - ((ebx >> 6) & 0x3f);
  if (paddr_bits > PADDR_BITS)
  paddr_bits = PADDR_BITS;
+
  vaddr_bits = (eax >> 8) & 0xff;
  if (vaddr_bits > VADDR_BITS)
  vaddr_bits = VADDR_BITS;
+
  hap_paddr_bits = ((eax >> 16) & 0xff) ?: paddr_bits;
  if (hap_paddr_bits > PADDR_BITS)
  hap_paddr_bits = PADDR_BITS;
--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -524,8 +524,11 @@ int __init dom0_setup_permissions(struct
   MSI_ADDR_DEST_ID_MASK));
  /* HyperTransport range. */
  if ( boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON) )
-    rc |= iomem_deny_access(d, paddr_to_pfn(0xfdULL << 32),
-    paddr_to_pfn((1ULL << 40) - 1));
+    {
+    mfn = paddr_to_pfn(1UL <<
+   (boot_cpu_data.x86 < 0x17 ? 40 : paddr_bits));


That doesn't really follow what Andrew gave us, namely:

1) On parts with <40 bits, its fully hidden from software
2) Before Fam17h, it was always 12G just below 1T, even if there was more RAM 
above this location
3) On Fam17h and later, it is variable based on SME, and is either just below 
2^48 (no encryption) or 2^43 (encryption)

Do we need (1) to be coded here as well?


Ignore this last paragraph - I lost your statement in comment description.

Igor

Re: [PATCH] x86/AMD: make HT range dynamic for Fam17 and up

2021-06-18 Thread Igor Druzhinin


On 18/06/2021 17:00, Jan Beulich wrote:

At the time of d838ac2539cf ("x86: don't allow Dom0 access to the HT
address range") documentation correctly stated that the range was
completely fixed. For Fam17 and newer, it lives at the top of physical
address space, though.


From "Open-Source Register Reference for AMD Family 17h Processors (PUB)":
https://developer.amd.com/wp-content/resources/56255_3_03.PDF

"The processor defines a reserved memory address region starting at
FFFD__h and extending up to __h."

It's still doesn't say that it's at the top of physical address space
although I understand that's how it's now implemented. The official
document doesn't confirm it will move along with physical address space
extension.


To correctly determine the top of physical address space, we need to
account for their physical address reduction, hence the calculation of
paddr_bits also gets adjusted.

While for paddr_bits < 40 the HT range is completely hidden, there's no
need to suppress the range insertion in that case: It'll just have no
real meaning.

Reported-by: Andrew Cooper 
Signed-off-by: Jan Beulich 

--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -349,13 +349,17 @@ void __init early_cpu_init(void)
  
  	eax = cpuid_eax(0x8000);

if ((eax >> 16) == 0x8000 && eax >= 0x8008) {
+   ebx = eax >= 0x801f ? cpuid_ebx(0x801f) : 0;
eax = cpuid_eax(0x8008);
-   paddr_bits = eax & 0xff;
+


I understand Andrew has some concerns regarding changing paddr_bits but
some comment explaining what's located at 0x801f:ebx[11:6] and why
we're doing this might be useful.


+   paddr_bits = (eax & 0xff) - ((ebx >> 6) & 0x3f);
if (paddr_bits > PADDR_BITS)
paddr_bits = PADDR_BITS;
+
vaddr_bits = (eax >> 8) & 0xff;
if (vaddr_bits > VADDR_BITS)
vaddr_bits = VADDR_BITS;
+
hap_paddr_bits = ((eax >> 16) & 0xff) ?: paddr_bits;
if (hap_paddr_bits > PADDR_BITS)
hap_paddr_bits = PADDR_BITS;
--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -524,8 +524,11 @@ int __init dom0_setup_permissions(struct
   MSI_ADDR_DEST_ID_MASK));
  /* HyperTransport range. */
  if ( boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON) )
-rc |= iomem_deny_access(d, paddr_to_pfn(0xfdULL << 32),
-paddr_to_pfn((1ULL << 40) - 1));
+{
+mfn = paddr_to_pfn(1UL <<
+   (boot_cpu_data.x86 < 0x17 ? 40 : paddr_bits));


That doesn't really follow what Andrew gave us, namely:

1) On parts with <40 bits, its fully hidden from software
2) Before Fam17h, it was always 12G just below 1T, even if there was more RAM 
above this location
3) On Fam17h and later, it is variable based on SME, and is either just below 
2^48 (no encryption) or 2^43 (encryption)

Do we need (1) to be coded here as well?

Igor

BUG in 1f3d87c75129 ("x86/vpt: do not take pt_migrate rwlock in some cases")

2021-06-14 Thread Igor Druzhinin


Hi, Boris, Stephen, Roger,

We have stress tested recent changes on staging-4.13 which includes a
backport of the subject. Since the backport is identical to the
master branch and all of the pre-reqs are in place - we have no reason
to believe the issue is not the same on master.

Here is what we got by running heavy stress testing including multiple
repeated VM lifecycle operations with storage and network load:


Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287
[ Xen-4.13.3-10.7-d  x86_64  debug=y   Not tainted ]
CPU:17
RIP:e008:[] common/timer.c#active_timer+0xc/0x1b
RFLAGS: 00010046   CONTEXT: hypervisor (d675v0)
rax:    rbx: 83137a8ed300   rcx: 
rdx: 83303fff7fff   rsi: 83303fff2549   rdi: 83137a8ed300
rbp: 83303fff7cf8   rsp: 83303fff7cf8   r8:  0001
r9:     r10: 0011   r11: 168b0cc08083
r12:    r13: 82d0805cf300   r14: 82d0805cf300
r15: 0292   cr0: 80050033   cr4: 000426e0
cr3: 0013c1a32000   cr2: 
fsb:    gsb:    gss: 
ds:    es:    fs:    gs:    ss:    cs: e008
Xen code around  (common/timer.c#active_timer+0xc/0x1b):
 0f b6 47 2a 84 c0 75 02 <0f> 0b 3c 04 76 02 0f 0b 3c 02 0f 97 c0 5d c3 55
Xen stack trace from rsp=83303fff7cf8:
   83303fff7d48 82d0802479f1 168b0192b846 83137a8ed328
   1d0776eb 83137a8ed2c0 83133ee47568 83133ee47000
   83133ee47560 832b1a0cd000 83303fff7d78 82d08031e74e
   83102d898000 83133ee47000 83102db8d000 0011
   83303fff7dc8 82d08027df19  83133ee47060
   82d0805d0088 83102d898000 83133ee47000 0011
   0001 0011 83303fff7e08 82d0802414e0
   83303fff7df8 168b0192b846 83102d8a4660 168b0192b846
   83102d8a4720 0011 83303fff7ea8 82d080241d6c
   83133ee47000 831244137a50 83303fff7e48 82d08031b5b8
   83133ee47000 832b1a0cd000 83303fff7e68 82d080312b65
   83133ee47000  83303fff7ee8 83102d8a4678
   83303fff7ee8 82d0805d6380 82d0805d5b00 
   83303fff7fff  83303fff7ed8 82d0802431f5
   83133ee47000   
   83303fff7ee8 82d08024324a 7ccfc00080e7 82d08033930b
   b0ebd5a0 000d 0062 0097
   001e 001f b0ebd5ad 
   0005 0003d91d  
   03d5 001e  beefbeef
Xen call trace:
   [] R common/timer.c#active_timer+0xc/0x1b
   [] F stop_timer+0xf5/0x188
   [] F pt_save_timer+0x45/0x8a
   [] F context_switch+0xf9/0xee0
   [] F common/schedule.c#sched_context_switch+0x146/0x151
   [] F common/schedule.c#schedule+0x28a/0x299
   [] F common/softirq.c#__do_softirq+0x85/0x90
   [] F do_softirq+0x13/0x15
   [] F vmx_asm_do_vmentry+0x2b/0x30


Panic on CPU 17:
Assertion 'timer->status >= TIMER_STATUS_inactive' failed at timer.c:287


That looks like a race opened by this change - we didn't see the problem
before while running with all of the pre-req patches.

Could you please analyse this assertion failure?

Igor

Re: [PATCH] xen-mapcache: avoid a race on memory map while using MAP_FIXED

2021-04-20 Thread Igor Druzhinin


On 20/04/2021 04:39, no-re...@patchew.org wrote:

Patchew URL: 
https://patchew.org/QEMU/1618889702-13104-1-git-send-email-igor.druzhi...@citrix.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 1618889702-13104-1-git-send-email-igor.druzhi...@citrix.com
Subject: [PATCH] xen-mapcache: avoid a race on memory map while using MAP_FIXED

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
 From https://github.com/patchew-project/qemu
  * [new tag] 
patchew/1618889702-13104-1-git-send-email-igor.druzhi...@citrix.com -> 
patchew/1618889702-13104-1-git-send-email-igor.druzhi...@citrix.com
Switched to a new branch 'test'
3102519 xen-mapcache: avoid a race on memory map while using MAP_FIXED

=== OUTPUT BEGIN ===
ERROR: Author email address is mangled by the mailing list
#2:
Author: Igor Druzhinin via 

total: 1 errors, 0 warnings, 21 lines checked



Anthony,

Is there any action required from me here? I don't completely understand 
how that happened. But I found this:


"In some cases the Author: email address in patches submitted to the
list gets mangled such that it says

John Doe via Qemu-devel 

This change is a result of workarounds for DMARC policies.

Subsystem maintainers accepting patches need to catch these and fix
them before sending pull requests, so a checkpatch.pl test is highly
desirable."

Igor

Re: [PATCH] xen-mapcache: avoid a race on memory map while using MAP_FIXED

2021-04-20 Thread Igor Druzhinin


On 20/04/2021 09:53, Roger Pau Monné wrote:

On Tue, Apr 20, 2021 at 04:35:02AM +0100, Igor Druzhinin wrote:

When we're replacing the existing mapping there is possibility of a race
on memory map with other threads doing mmap operations - the address being
unmapped/re-mapped could be occupied by another thread in between.

Linux mmap man page recommends keeping the existing mappings in place to
reserve the place and instead utilize the fact that the next mmap operation
with MAP_FIXED flag passed will implicitly destroy the existing mappings
behind the chosen address. This behavior is guaranteed by POSIX / BSD and
therefore is portable.

Note that it wouldn't make the replacement atomic for parallel accesses to
the replaced region - those might still fail with SIGBUS due to
xenforeignmemory_map not being atomic. So we're still not expecting those.

Tested-by: Anthony PERARD 
Signed-off-by: Igor Druzhinin 


Should we add a 'Fixes: 759235653de ...' or similar tag here?


I was thinking about it and decided - no. There wasn't a bug here until 
QEMU introduced usage of iothreads at the state loading phase. 
Originally this process was totally single-threaded. And it's hard to 
pinpoint the exact moment when it happened which is also not directly 
related to the change here.


Igor

[PATCH] xen-mapcache: avoid a race on memory map while using MAP_FIXED

2021-04-19 Thread Igor Druzhinin

When we're replacing the existing mapping there is possibility of a race
on memory map with other threads doing mmap operations - the address being
unmapped/re-mapped could be occupied by another thread in between.

Linux mmap man page recommends keeping the existing mappings in place to
reserve the place and instead utilize the fact that the next mmap operation
with MAP_FIXED flag passed will implicitly destroy the existing mappings
behind the chosen address. This behavior is guaranteed by POSIX / BSD and
therefore is portable.

Note that it wouldn't make the replacement atomic for parallel accesses to
the replaced region - those might still fail with SIGBUS due to
xenforeignmemory_map not being atomic. So we're still not expecting those.

Tested-by: Anthony PERARD 
Signed-off-by: Igor Druzhinin 
---
 hw/i386/xen/xen-mapcache.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/hw/i386/xen/xen-mapcache.c b/hw/i386/xen/xen-mapcache.c
index 5b120ed..e82b7dc 100644
--- a/hw/i386/xen/xen-mapcache.c
+++ b/hw/i386/xen/xen-mapcache.c
@@ -171,7 +171,20 @@ static void xen_remap_bucket(MapCacheEntry *entry,
 if (!(entry->flags & XEN_MAPCACHE_ENTRY_DUMMY)) {
 ram_block_notify_remove(entry->vaddr_base, entry->size);
 }
-if (munmap(entry->vaddr_base, entry->size) != 0) {
+
+/*
+ * If an entry is being replaced by another mapping and we're using
+ * MAP_FIXED flag for it - there is possibility of a race for vaddr
+ * address with another thread doing an mmap call itself
+ * (see man 2 mmap). To avoid that we skip explicit unmapping here
+ * and allow the kernel to destroy the previous mappings by replacing
+ * them in mmap call later.
+ *
+ * Non-identical replacements are not allowed therefore.
+ */
+assert(!vaddr || (entry->vaddr_base == vaddr && entry->size == size));
+
+if (!vaddr && munmap(entry->vaddr_base, entry->size) != 0) {
 perror("unmap fails");
 exit(-1);
 }
-- 
2.7.4

[PATCH v5 1/2] x86/vtx: add LBR_SELECT to the list of LBR MSRs

2021-04-15 Thread Igor Druzhinin

This MSR exists since Nehalem / Silvermont and is actively used by Linux,
for instance, to improve sampling efficiency.

Signed-off-by: Igor Druzhinin 
---
Changes in v5:
- added Silvermont+ LBR_SELECT support

New patch in v4 as suggested by Andrew.
---
 xen/arch/x86/hvm/vmx/vmx.c  | 20 
 xen/include/asm-x86/msr-index.h | 10 --
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 835b905..30c6a57 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2915,14 +2915,16 @@ static const struct lbr_info {
 }, nh_lbr[] = {
 { MSR_IA32_LASTINTFROMIP,   1 },
 { MSR_IA32_LASTINTTOIP, 1 },
-{ MSR_C2_LASTBRANCH_TOS,1 },
+{ MSR_NHL_LBR_SELECT,   1 },
+{ MSR_NHL_LASTBRANCH_TOS,   1 },
 { MSR_P4_LASTBRANCH_0_FROM_LIP, NUM_MSR_P4_LASTBRANCH_FROM_TO },
 { MSR_P4_LASTBRANCH_0_TO_LIP,   NUM_MSR_P4_LASTBRANCH_FROM_TO },
 { 0, 0 }
 }, sk_lbr[] = {
 { MSR_IA32_LASTINTFROMIP,   1 },
 { MSR_IA32_LASTINTTOIP, 1 },
-{ MSR_SKL_LASTBRANCH_TOS,   1 },
+{ MSR_NHL_LBR_SELECT,   1 },
+{ MSR_NHL_LASTBRANCH_TOS,   1 },
 { MSR_SKL_LASTBRANCH_0_FROM_IP, NUM_MSR_SKL_LASTBRANCH },
 { MSR_SKL_LASTBRANCH_0_TO_IP,   NUM_MSR_SKL_LASTBRANCH },
 { MSR_SKL_LASTBRANCH_0_INFO,NUM_MSR_SKL_LASTBRANCH },
@@ -2934,10 +2936,19 @@ static const struct lbr_info {
 { MSR_C2_LASTBRANCH_0_FROM_IP,  NUM_MSR_ATOM_LASTBRANCH_FROM_TO },
 { MSR_C2_LASTBRANCH_0_TO_IP,NUM_MSR_ATOM_LASTBRANCH_FROM_TO },
 { 0, 0 }
+}, sm_lbr[] = {
+{ MSR_IA32_LASTINTFROMIP,   1 },
+{ MSR_IA32_LASTINTTOIP, 1 },
+{ MSR_SM_LBR_SELECT,1 },
+{ MSR_SM_LASTBRANCH_TOS,1 },
+{ MSR_C2_LASTBRANCH_0_FROM_IP,  NUM_MSR_ATOM_LASTBRANCH_FROM_TO },
+{ MSR_C2_LASTBRANCH_0_TO_IP,NUM_MSR_ATOM_LASTBRANCH_FROM_TO },
+{ 0, 0 }
 }, gm_lbr[] = {
 { MSR_IA32_LASTINTFROMIP,   1 },
 { MSR_IA32_LASTINTTOIP, 1 },
-{ MSR_GM_LASTBRANCH_TOS,1 },
+{ MSR_SM_LBR_SELECT,1 },
+{ MSR_SM_LASTBRANCH_TOS,1 },
 { MSR_GM_LASTBRANCH_0_FROM_IP,  NUM_MSR_GM_LASTBRANCH_FROM_TO },
 { MSR_GM_LASTBRANCH_0_TO_IP,NUM_MSR_GM_LASTBRANCH_FROM_TO },
 { 0, 0 }
@@ -2991,6 +3002,7 @@ static const struct lbr_info *last_branch_msr_get(void)
 return sk_lbr;
 /* Atom */
 case 0x1c: case 0x26: case 0x27: case 0x35: case 0x36:
+return at_lbr;
 /* Silvermont */
 case 0x37: case 0x4a: case 0x4d: case 0x5a: case 0x5d:
 /* Xeon Phi Knights Landing */
@@ -2999,7 +3011,7 @@ static const struct lbr_info *last_branch_msr_get(void)
 case 0x85:
 /* Airmont */
 case 0x4c:
-return at_lbr;
+return sm_lbr;
 /* Goldmont */
 case 0x5c: case 0x5f:
 return gm_lbr;
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 43d26ef..020908f 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -606,15 +606,21 @@
 #define NUM_MSR_C2_LASTBRANCH_FROM_TO  4
 #define NUM_MSR_ATOM_LASTBRANCH_FROM_TO8
 
+/* Nehalem (and newer) last-branch recording */
+#define MSR_NHL_LBR_SELECT 0x01c8
+#define MSR_NHL_LASTBRANCH_TOS 0x01c9
+
 /* Skylake (and newer) last-branch recording */
-#define MSR_SKL_LASTBRANCH_TOS 0x01c9
 #define MSR_SKL_LASTBRANCH_0_FROM_IP   0x0680
 #define MSR_SKL_LASTBRANCH_0_TO_IP 0x06c0
 #define MSR_SKL_LASTBRANCH_0_INFO  0x0dc0
 #define NUM_MSR_SKL_LASTBRANCH 32
 
+/* Silvermont (and newer) last-branch recording */
+#define MSR_SM_LBR_SELECT  0x01c8
+#define MSR_SM_LASTBRANCH_TOS  0x01c9
+
 /* Goldmont last-branch recording */
-#define MSR_GM_LASTBRANCH_TOS  0x01c9
 #define MSR_GM_LASTBRANCH_0_FROM_IP0x0680
 #define MSR_GM_LASTBRANCH_0_TO_IP  0x06c0
 #define NUM_MSR_GM_LASTBRANCH_FROM_TO  32
-- 
2.7.4

[PATCH v5 2/2] x86/intel: insert Ice Lake-SP and Ice Lake-D model numbers

2021-04-15 Thread Igor Druzhinin

LBR, C-state MSRs should correspond to Ice Lake desktop according to
SDM rev. 74 for both models.

Ice Lake-SP is known to expose IF_PSCHANGE_MC_NO in IA32_ARCH_CAPABILITIES MSR
(as advisory tells and Whitley SDP confirms) which means the erratum is fixed
in hardware for that model and therefore it shouldn't be present in
has_if_pschange_mc list. Provisionally assume the same to be the case
for Ice Lake-D.

Reviewed-by: Jan Beulich 
Signed-off-by: Igor Druzhinin 
---
No changes in v5.

Changes in v4:
- now based on SDM update
- new LBR (0x1e0)does not seem to be exposed in the docs

Changes in v3:
- Add Ice Lake-D model numbers
- Drop has_if_pschange_mc hunk following Tiger Lake related discussion
---
 xen/arch/x86/acpi/cpu_idle.c | 2 ++
 xen/arch/x86/hvm/vmx/vmx.c   | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index c092086..d788c8b 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -181,6 +181,8 @@ static void do_get_hw_residencies(void *arg)
 case 0x55:
 case 0x5E:
 /* Ice Lake */
+case 0x6A:
+case 0x6C:
 case 0x7D:
 case 0x7E:
 /* Tiger Lake */
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 30c6a57..91cba19 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2990,7 +2990,7 @@ static const struct lbr_info *last_branch_msr_get(void)
 /* Goldmont Plus */
 case 0x7a:
 /* Ice Lake */
-case 0x7d: case 0x7e:
+case 0x6a: case 0x6c: case 0x7d: case 0x7e:
 /* Tiger Lake */
 case 0x8c: case 0x8d:
 /* Tremont */
-- 
2.7.4

Re: [PATCH v4 1/2] x86/vtx: add LBR_SELECT to the list of LBR MSRs

2021-04-14 Thread Igor Druzhinin


On 14/04/2021 12:41, Jan Beulich wrote:

On 14.04.2021 06:40, Igor Druzhinin wrote:

--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2915,14 +2915,16 @@ static const struct lbr_info {
  }, nh_lbr[] = {
  { MSR_IA32_LASTINTFROMIP,   1 },
  { MSR_IA32_LASTINTTOIP, 1 },
-{ MSR_C2_LASTBRANCH_TOS,1 },
+{ MSR_NHL_LBR_SELECT,   1 },
+{ MSR_NHL_LASTBRANCH_TOS,   1 },
  { MSR_P4_LASTBRANCH_0_FROM_LIP, NUM_MSR_P4_LASTBRANCH_FROM_TO },
  { MSR_P4_LASTBRANCH_0_TO_LIP,   NUM_MSR_P4_LASTBRANCH_FROM_TO },
  { 0, 0 }
  }, sk_lbr[] = {
  { MSR_IA32_LASTINTFROMIP,   1 },
  { MSR_IA32_LASTINTTOIP, 1 },
-{ MSR_SKL_LASTBRANCH_TOS,   1 },
+{ MSR_NHL_LBR_SELECT,   1 },
+{ MSR_NHL_LASTBRANCH_TOS,   1 },
  { MSR_SKL_LASTBRANCH_0_FROM_IP, NUM_MSR_SKL_LASTBRANCH },
  { MSR_SKL_LASTBRANCH_0_TO_IP,   NUM_MSR_SKL_LASTBRANCH },
  { MSR_SKL_LASTBRANCH_0_INFO,NUM_MSR_SKL_LASTBRANCH },
@@ -2937,6 +2939,7 @@ static const struct lbr_info {
  }, gm_lbr[] = {
  { MSR_IA32_LASTINTFROMIP,   1 },
  { MSR_IA32_LASTINTTOIP, 1 },
+{ MSR_GM_LBR_SELECT,1 },


What about Xeon Phi, Silvermont, and Airmont?


Yes, you're right - forgot about those. Will need to shuffle arrays a 
little.



--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -606,14 +606,18 @@
  #define NUM_MSR_C2_LASTBRANCH_FROM_TO 4
  #define NUM_MSR_ATOM_LASTBRANCH_FROM_TO   8
  
+/* Nehalem (and newer) last-branch recording */

+#define MSR_NHL_LBR_SELECT 0x01c8
+#define MSR_NHL_LASTBRANCH_TOS 0x01c9
+
  /* Skylake (and newer) last-branch recording */
-#define MSR_SKL_LASTBRANCH_TOS 0x01c9
  #define MSR_SKL_LASTBRANCH_0_FROM_IP  0x0680
  #define MSR_SKL_LASTBRANCH_0_TO_IP0x06c0
  #define MSR_SKL_LASTBRANCH_0_INFO 0x0dc0
  #define NUM_MSR_SKL_LASTBRANCH32
  
  /* Goldmont last-branch recording */

+#define MSR_GM_LBR_SELECT  0x01c8
  #define MSR_GM_LASTBRANCH_TOS 0x01c9


Wouldn't it make sense to also re-use the NHL constants, like you
do for Skylake?


I didn't really see GM to be derived from NHL so decided to split those. 
Looks cleaner to me that way otherwise might be a little confusing to 
use NHL constants in GM definitions. Given the change above - I will 
have to reshuffle those anyway in v5.


Igor

[PATCH v4 2/2] x86/intel: insert Ice Lake-SP and Ice Lake-D model numbers

2021-04-13 Thread Igor Druzhinin

LBR, C-state MSRs should correspond to Ice Lake desktop according to
SDM rev. 74 for both models.

Ice Lake-SP is known to expose IF_PSCHANGE_MC_NO in IA32_ARCH_CAPABILITIES MSR
(as advisory tells and Whitley SDP confirms) which means the erratum is fixed
in hardware for that model and therefore it shouldn't be present in
has_if_pschange_mc list. Provisionally assume the same to be the case
for Ice Lake-D while advisory is not yet updated.

Signed-off-by: Igor Druzhinin 
---
Changes in v4:
- now based on SDM update
- new LBR (0x1e0)does not seem to be exposed in the docs

Changes in v3:
- Add Ice Lake-D model numbers
- Drop has_if_pschange_mc hunk following Tiger Lake related discussion
---
 xen/arch/x86/acpi/cpu_idle.c | 2 ++
 xen/arch/x86/hvm/vmx/vmx.c   | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index c092086..d788c8b 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -181,6 +181,8 @@ static void do_get_hw_residencies(void *arg)
 case 0x55:
 case 0x5E:
 /* Ice Lake */
+case 0x6A:
+case 0x6C:
 case 0x7D:
 case 0x7E:
 /* Tiger Lake */
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 5a4ca35..52469ca 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2982,7 +2982,7 @@ static const struct lbr_info *last_branch_msr_get(void)
 /* Goldmont Plus */
 case 0x7a:
 /* Ice Lake */
-case 0x7d: case 0x7e:
+case 0x6a: case 0x6c: case 0x7d: case 0x7e:
 /* Tiger Lake */
 case 0x8c: case 0x8d:
 /* Tremont */
-- 
2.7.4

[PATCH v4 1/2] x86/vtx: add LBR_SELECT to the list of LBR MSRs

2021-04-13 Thread Igor Druzhinin

This MSR exists since Nehalem and is actively used by Linux, for instance,
to improve sampling efficiency.

Signed-off-by: Igor Druzhinin 
---
New patch in v4 as suggested by Andrew.
---
 xen/arch/x86/hvm/vmx/vmx.c  | 7 +--
 xen/include/asm-x86/msr-index.h | 6 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 835b905..5a4ca35 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2915,14 +2915,16 @@ static const struct lbr_info {
 }, nh_lbr[] = {
 { MSR_IA32_LASTINTFROMIP,   1 },
 { MSR_IA32_LASTINTTOIP, 1 },
-{ MSR_C2_LASTBRANCH_TOS,1 },
+{ MSR_NHL_LBR_SELECT,   1 },
+{ MSR_NHL_LASTBRANCH_TOS,   1 },
 { MSR_P4_LASTBRANCH_0_FROM_LIP, NUM_MSR_P4_LASTBRANCH_FROM_TO },
 { MSR_P4_LASTBRANCH_0_TO_LIP,   NUM_MSR_P4_LASTBRANCH_FROM_TO },
 { 0, 0 }
 }, sk_lbr[] = {
 { MSR_IA32_LASTINTFROMIP,   1 },
 { MSR_IA32_LASTINTTOIP, 1 },
-{ MSR_SKL_LASTBRANCH_TOS,   1 },
+{ MSR_NHL_LBR_SELECT,   1 },
+{ MSR_NHL_LASTBRANCH_TOS,   1 },
 { MSR_SKL_LASTBRANCH_0_FROM_IP, NUM_MSR_SKL_LASTBRANCH },
 { MSR_SKL_LASTBRANCH_0_TO_IP,   NUM_MSR_SKL_LASTBRANCH },
 { MSR_SKL_LASTBRANCH_0_INFO,NUM_MSR_SKL_LASTBRANCH },
@@ -2937,6 +2939,7 @@ static const struct lbr_info {
 }, gm_lbr[] = {
 { MSR_IA32_LASTINTFROMIP,   1 },
 { MSR_IA32_LASTINTTOIP, 1 },
+{ MSR_GM_LBR_SELECT,1 },
 { MSR_GM_LASTBRANCH_TOS,1 },
 { MSR_GM_LASTBRANCH_0_FROM_IP,  NUM_MSR_GM_LASTBRANCH_FROM_TO },
 { MSR_GM_LASTBRANCH_0_TO_IP,NUM_MSR_GM_LASTBRANCH_FROM_TO },
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 43d26ef..25c4308 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -606,14 +606,18 @@
 #define NUM_MSR_C2_LASTBRANCH_FROM_TO  4
 #define NUM_MSR_ATOM_LASTBRANCH_FROM_TO8
 
+/* Nehalem (and newer) last-branch recording */
+#define MSR_NHL_LBR_SELECT 0x01c8
+#define MSR_NHL_LASTBRANCH_TOS 0x01c9
+
 /* Skylake (and newer) last-branch recording */
-#define MSR_SKL_LASTBRANCH_TOS 0x01c9
 #define MSR_SKL_LASTBRANCH_0_FROM_IP   0x0680
 #define MSR_SKL_LASTBRANCH_0_TO_IP 0x06c0
 #define MSR_SKL_LASTBRANCH_0_INFO  0x0dc0
 #define NUM_MSR_SKL_LASTBRANCH 32
 
 /* Goldmont last-branch recording */
+#define MSR_GM_LBR_SELECT  0x01c8
 #define MSR_GM_LASTBRANCH_TOS  0x01c9
 #define MSR_GM_LASTBRANCH_0_FROM_IP0x0680
 #define MSR_GM_LASTBRANCH_0_TO_IP  0x06c0
-- 
2.7.4

[PATCH] x86/vPMU: Extend vPMU support to version 5

2021-04-13 Thread Igor Druzhinin

Version 5 is backwards compatible with version 3. This allows to enable
vPMU on Ice Lake CPUs.

Signed-off-by: Igor Druzhinin 
---
 xen/arch/x86/cpu/vpmu_intel.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/cpu/vpmu_intel.c b/xen/arch/x86/cpu/vpmu_intel.c
index 6e97ce7..0dfc256 100644
--- a/xen/arch/x86/cpu/vpmu_intel.c
+++ b/xen/arch/x86/cpu/vpmu_intel.c
@@ -839,7 +839,7 @@ int vmx_vpmu_initialise(struct vcpu *v)
 return 0;
 
 if ( v->domain->arch.cpuid->basic.pmu_version <= 1 ||
- v->domain->arch.cpuid->basic.pmu_version >= 5 )
+ v->domain->arch.cpuid->basic.pmu_version >= 6 )
 return -EINVAL;
 
 if ( (arch_pmc_cnt + fixed_pmc_cnt) == 0 )
@@ -909,8 +909,9 @@ int __init core2_vpmu_init(void)
 switch ( version )
 {
 case 4:
-printk(XENLOG_INFO "VPMU: PMU version 4 is not fully supported. "
-   "Emulating version 3\n");
+case 5:
+printk(XENLOG_INFO "VPMU: PMU version %u is not fully supported. "
+   "Emulating version 3\n", version);
 /* FALLTHROUGH */
 
 case 2:
-- 
2.7.4

Re: [PATCH v3] x86/intel: insert Ice Lake-X (server) and Ice Lake-D model numbers

2021-04-08 Thread Igor Druzhinin


On 27/01/2021 09:52, Andrew Cooper wrote:

On 23/12/2020 20:32, Igor Druzhinin wrote:

LBR, C-state MSRs should correspond to Ice Lake desktop according to
External Design Specification vol.2 for both models.

Ice Lake-X is known to expose IF_PSCHANGE_MC_NO in IA32_ARCH_CAPABILITIES MSR
(confirmed on Whitley SDP) which means the erratum is fixed in hardware for
that model and therefore it shouldn't be present in has_if_pschange_mc list.
Provisionally assume the same to be the case for Ice Lake-D.

Signed-off-by: Igor Druzhinin 
---
Changes in v3:
- Add Ice Lake-D model numbers
- Drop has_if_pschange_mc hunk following Tiger Lake related discussion -
   IF_PSCHANGE_MC_NO is confimed to be exposed on Whitley SDP

---
  xen/arch/x86/acpi/cpu_idle.c | 2 ++
  xen/arch/x86/hvm/vmx/vmx.c   | 2 +-
  2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index c092086..d788c8b 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -181,6 +181,8 @@ static void do_get_hw_residencies(void *arg)
  case 0x55:
  case 0x5E:
  /* Ice Lake */
+case 0x6A:
+case 0x6C:
  case 0x7D:
  case 0x7E:
  /* Tiger Lake */

So I think these changes are correct.  TGL definitely has deeper
core/package states than this interface enumerates, but I can't locate
extra MSRs.


diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 2d4475e..bff5979 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2775,7 +2775,7 @@ static const struct lbr_info *last_branch_msr_get(void)
  /* Goldmont Plus */
  case 0x7a:
  /* Ice Lake */
-case 0x7d: case 0x7e:
+case 0x6a: case 0x6c: case 0x7d: case 0x7e:

IceLake Server has what appear to be new aspects to LBR.  I can't find
LAST_INT_INFO (0x1e0) existing in any previous generation.
Strange.  Updates to SDM finally got published Apr 2 and those contained 
Ice Lake SP details in it - this new LBR hasn't been mentioned. Does 
this mean we can skip it for now as it appears to be undocumented 
functionality (at least publicly) and unlikely used by software?

However, my investigation also found LBR_SELECT which has been around
since Nehalem, which we don't handle, and Linux *does* use.


Good point, I'll make an additional patch that adds LBR_SELECT to the 
LBR logic.


Igor

RE: Troubles analyzing crash dumps from xl dump-core

2021-03-10 Thread Igor Druzhinin

> On 30.01.21 19:53, Roman Shaposhnik wrote:
> > On Fri, Jan 29, 2021 at 11:28 PM Jürgen Groß  wrote:
> >>
> >> On 29.01.21 21:12, Roman Shaposhnik wrote:
> >>> Hi!
> >>>
> >>> I'm trying to see how much mileage I can get out of
> >>> crash(1) 7.2.8 (based on gdb 7.6) when it comes to analyzing crash
> >>> dumps taken via xl dump-core (this is all on x86_64 with stock Xen
> >>> v. 4.14).
> >>>
> >>> The good news is that the image actually does load up but it throws
> >>> the following WARNINGs in the process:
> >>>
> >>> WARNING: cannot access vmalloc'd module memory
> >>> crash: read error: kernel virtual address: 93613480  type:
> >>> "fill_task_struct"
> >>> WARNING: active task 93613480 on cpu 0 not found in PID hash
> >>> crash: read error: kernel virtual address: 93613480  type:
> >>> "fill_task_struct"
> >>> WARNING: cannot read log_buf contents
> >>>
> >>> And then the info that it gives me around basic things like ps, mod,
> >>> log, etc. is really super limited (and I am now suspecting may even
> >>> be wrong).
> >>>
> >>> Since I was primarily after dmesg/log initially, I tried:
> >>> crash> log
> >>> log: WARNING: cannot read log_buf contents
> >>>
> >>> Then I tried taking an xl dump-core from the domU that was still
> >>> very much alive and happy and got similar results -- so it clearly
> >>> doesn't seem to be related to the state domU is in.
> >>>
> >>> As matter of fact, I actually got to the desired dmesg output by
> >>> simply running strings(1) on the core file -- so the info is
> >>> definitely there -- but I guess some kind of index/reference maybe
> >>> broken.
> >>>
> >>> With all that in mind, if there's anyone on this ML who has recently
> >>> done Xen DomU crash dump analysis -- I would definitely appreciate
> >>> the pointers!
> >>
> >> For me it just works (openSUSE).
> >
> > Can you please run:
> >
> > crash --version and readelf -a  (on the xl dump-core output) and
> > post the results?
> 
> # crash --version
> 
> crash 7.2.1

I tried to build this version but I still get the following while trying to 
open a dump file
produced by "xl dump-core":

[root@lcy2-dt92 crash]# ./crash ../vmlinux-5.8.0-44-generic ../xxx.dmp

crash 7.2.1
Copyright (C) 2002-2017  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

crash: cannot determine base kernel version
crash: ../vmlinux-5.8.0-44-generic and ../xxx.dmp do not match!

Am I doing something wrong? How do I need to use crash for it to work?

Igor

Re: [PATCH for-4.15] vtd: make sure QI/IR are disabled before initialisation

2021-03-08 Thread Igor Druzhinin


On 08/03/2021 08:18, Jan Beulich wrote:

On 08.03.2021 08:00, Igor Druzhinin wrote:

BIOS might pass control to Xen leaving QI and/or IR in enabled and/or
partially configured state. In case of x2APIC code path where EIM is
enabled early in boot - those are correctly disabled by Xen before any
attempt to configure. But for xAPIC that step is missing which was
proven to cause QI initialization failures on some ICX based platforms
where QI is left pre-enabled and partially configured by BIOS.


And those systems then tell us to avoid use of x2APIC? I would have
expected that on modern systems we wouldn't see such quirky firmware
behavior anymore. Anyway, half a sentence to this effect might help
here, as without such firmware behavior the only way to run into
this ought to be use of "no-x2apic" on the command line. Which in
turn might require justification (and potentially a fix elsewhere in
the code to make use of that option unnecessary).


Unify the behaviour between x2APIC and xAPIC code paths keeping that in
line with what Linux does.

Signed-off-by: Igor Druzhinin 


Reviewed-by: Jan Beulich 
with some editing of the description. If no other need for a v2
arises, I suppose whatever you come up with could be folded in
while committing.


How about:

"... But for xAPIC that step is missing which was proven to cause QI 
initialization failures on some ICX based platforms where QI is left 
pre-enabled and partially configured by BIOS. That problem becomes hard 
to avoid since those platforms are shipped with x2APIC opt out being 
advertised by default at the same time by firmware.

..."

Igor

[PATCH for-4.15] vtd: make sure QI/IR are disabled before initialisation

2021-03-07 Thread Igor Druzhinin

BIOS might pass control to Xen leaving QI and/or IR in enabled and/or
partially configured state. In case of x2APIC code path where EIM is
enabled early in boot - those are correctly disabled by Xen before any
attempt to configure. But for xAPIC that step is missing which was
proven to cause QI initialization failures on some ICX based platforms
where QI is left pre-enabled and partially configured by BIOS.

Unify the behaviour between x2APIC and xAPIC code paths keeping that in
line with what Linux does.

Signed-off-by: Igor Druzhinin 
---
 xen/arch/x86/apic.c |  2 +-
 xen/drivers/passthrough/vtd/iommu.c | 12 +++-
 xen/include/asm-x86/apic.h  |  1 +
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/apic.c b/xen/arch/x86/apic.c
index 7497ddb..8ab8214 100644
--- a/xen/arch/x86/apic.c
+++ b/xen/arch/x86/apic.c
@@ -47,7 +47,7 @@ static bool __read_mostly tdt_enabled;
 static bool __initdata tdt_enable = true;
 boolean_param("tdt", tdt_enable);
 
-static bool __read_mostly iommu_x2apic_enabled;
+bool __read_mostly iommu_x2apic_enabled;
 
 static struct {
 int active;
diff --git a/xen/drivers/passthrough/vtd/iommu.c 
b/xen/drivers/passthrough/vtd/iommu.c
index d136fe3..4aa7a31 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2080,7 +2080,7 @@ static int __must_check init_vtd_hw(void)
 u32 sts;
 
 /*
- * Basic VT-d HW init: set VT-d interrupt, clear VT-d faults.  
+ * Basic VT-d HW init: set VT-d interrupt, clear VT-d faults, etc.
  */
 for_each_drhd_unit ( drhd )
 {
@@ -2090,6 +2090,16 @@ static int __must_check init_vtd_hw(void)
 
 clear_fault_bits(iommu);
 
+/*
+ * Disable interrupt remapping and queued invalidation if
+ * already enabled by BIOS in case we've not initialized it yet.
+ */
+if ( !iommu_x2apic_enabled )
+{
+disable_intremap(iommu);
+disable_qinval(iommu);
+}
+
 spin_lock_irqsave(>register_lock, flags);
 sts = dmar_readl(iommu->reg, DMAR_FECTL_REG);
 sts &= ~DMA_FECTL_IM;
diff --git a/xen/include/asm-x86/apic.h b/xen/include/asm-x86/apic.h
index 8ddb896..2fe54bb 100644
--- a/xen/include/asm-x86/apic.h
+++ b/xen/include/asm-x86/apic.h
@@ -24,6 +24,7 @@ enum apic_mode {
 APIC_MODE_X2APIC/* x2APIC mode - common for large MP machines */
 };
 
+extern bool iommu_x2apic_enabled;
 extern u8 apic_verbosity;
 extern bool directed_eoi_enabled;
 
-- 
2.7.4

[PATCH v2 2/2] tools/libxl: only set viridian flags on new domains

2021-02-03 Thread Igor Druzhinin

Domains migrating or restoring should have viridian HVM param key in
the migration stream already and setting that twice results in Xen
returing -EEXIST on the second attempt later (during migration stream parsing)
in case the values don't match. That causes migration/restore operation
to fail at destination side.

That issue is now resurfaced by the latest commits (983524671 and 7e5cffcd1e)
extending default viridian feature set making the values from the previous
migration streams and those set at domain construction different.

Suggested-by: Andrew Cooper 
Signed-off-by: Igor Druzhinin 
---
 tools/libs/light/libxl_x86.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/libs/light/libxl_x86.c b/tools/libs/light/libxl_x86.c
index 91169d1..58187ed 100644
--- a/tools/libs/light/libxl_x86.c
+++ b/tools/libs/light/libxl_x86.c
@@ -468,7 +468,10 @@ int libxl__arch_domain_create(libxl__gc *gc,
 (ret = hvm_set_conf_params(gc, domid, info)) != 0)
 goto out;
 
-if (info->type == LIBXL_DOMAIN_TYPE_HVM &&
+/* Viridian flags are already a part of the migration stream so set
+ * them here only for brand new domains. */
+if (!state->restore &&
+info->type == LIBXL_DOMAIN_TYPE_HVM &&
 (ret = hvm_set_viridian_features(gc, domid, info)) != 0)
 goto out;
 
-- 
2.7.4

[PATCH v2 1/2] tools/libxl: pass libxl__domain_build_state to libxl__arch_domain_create

2021-02-03 Thread Igor Druzhinin

No functional change.

Signed-off-by: Igor Druzhinin 
---
New patch in v2 as requested.
---
 tools/libs/light/libxl_arch.h | 6 --
 tools/libs/light/libxl_arm.c  | 4 +++-
 tools/libs/light/libxl_dom.c  | 2 +-
 tools/libs/light/libxl_x86.c  | 6 --
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/tools/libs/light/libxl_arch.h b/tools/libs/light/libxl_arch.h
index 6a91775..c305d70 100644
--- a/tools/libs/light/libxl_arch.h
+++ b/tools/libs/light/libxl_arch.h
@@ -30,8 +30,10 @@ int libxl__arch_domain_save_config(libxl__gc *gc,
 
 /* arch specific internal domain creation function */
 _hidden
-int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config,
-   uint32_t domid);
+int libxl__arch_domain_create(libxl__gc *gc,
+  libxl_domain_config *d_config,
+  libxl__domain_build_state *state,
+  uint32_t domid);
 
 /* setup arch specific hardware description, i.e. DTB on ARM */
 _hidden
diff --git a/tools/libs/light/libxl_arm.c b/tools/libs/light/libxl_arm.c
index 66e8a06..8c4eda3 100644
--- a/tools/libs/light/libxl_arm.c
+++ b/tools/libs/light/libxl_arm.c
@@ -126,7 +126,9 @@ int libxl__arch_domain_save_config(libxl__gc *gc,
 return 0;
 }
 
-int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config,
+int libxl__arch_domain_create(libxl__gc *gc,
+  libxl_domain_config *d_config,
+  ibxl__domain_build_state *state,
   uint32_t domid)
 {
 return 0;
diff --git a/tools/libs/light/libxl_dom.c b/tools/libs/light/libxl_dom.c
index 1916857..842a51c 100644
--- a/tools/libs/light/libxl_dom.c
+++ b/tools/libs/light/libxl_dom.c
@@ -378,7 +378,7 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
 state->store_port = xc_evtchn_alloc_unbound(ctx->xch, domid, 
state->store_domid);
 state->console_port = xc_evtchn_alloc_unbound(ctx->xch, domid, 
state->console_domid);
 
-rc = libxl__arch_domain_create(gc, d_config, domid);
+rc = libxl__arch_domain_create(gc, d_config, state, domid);
 
 /* Construct a CPUID policy, but only for brand new domains.  Domains
  * being migrated-in/restored have CPUID handled during the
diff --git a/tools/libs/light/libxl_x86.c b/tools/libs/light/libxl_x86.c
index 91a9fc7..91169d1 100644
--- a/tools/libs/light/libxl_x86.c
+++ b/tools/libs/light/libxl_x86.c
@@ -453,8 +453,10 @@ static int hvm_set_conf_params(libxl__gc *gc, uint32_t 
domid,
 return ret;
 }
 
-int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config,
-uint32_t domid)
+int libxl__arch_domain_create(libxl__gc *gc,
+  libxl_domain_config *d_config,
+  libxl__domain_build_state *state,
+  uint32_t domid)
 {
 const libxl_domain_build_info *info = _config->b_info;
 int ret = 0;
-- 
2.7.4

Re: [PATCH] tools/libxl: only set viridian flags on new domains

2021-02-02 Thread Igor Druzhinin

On 03/02/2021 04:01, Igor Druzhinin wrote:
> Domains migrating or restoring should have viridian HVM param key in
> the migration stream already and setting that twice results in Xen
> returing -EEXIST on the second attempt later (during migration stream parsing)
> in case the values don't match. That causes migration/restore operation
> to fail at destination side.
> 
> That issue is now resurfaced by the latest commits (983524671 and 7e5cffcd1e)
> extending default viridian feature set making the values from the previous
> migration streams and those set at domain construction different.
> 
> Signed-off-by: Igor Druzhinin 

Suggested-by: Andrew Cooper 

Igor

[PATCH] tools/libxl: only set viridian flags on new domains

2021-02-02 Thread Igor Druzhinin

Domains migrating or restoring should have viridian HVM param key in
the migration stream already and setting that twice results in Xen
returing -EEXIST on the second attempt later (during migration stream parsing)
in case the values don't match. That causes migration/restore operation
to fail at destination side.

That issue is now resurfaced by the latest commits (983524671 and 7e5cffcd1e)
extending default viridian feature set making the values from the previous
migration streams and those set at domain construction different.

Signed-off-by: Igor Druzhinin 
---
 tools/libs/light/libxl_arch.h |  6 --
 tools/libs/light/libxl_arm.c  |  4 +++-
 tools/libs/light/libxl_dom.c  |  2 +-
 tools/libs/light/libxl_x86.c  | 11 ---
 4 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/tools/libs/light/libxl_arch.h b/tools/libs/light/libxl_arch.h
index 6a91775..c305d70 100644
--- a/tools/libs/light/libxl_arch.h
+++ b/tools/libs/light/libxl_arch.h
@@ -30,8 +30,10 @@ int libxl__arch_domain_save_config(libxl__gc *gc,
 
 /* arch specific internal domain creation function */
 _hidden
-int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config,
-   uint32_t domid);
+int libxl__arch_domain_create(libxl__gc *gc,
+  libxl_domain_config *d_config,
+  libxl__domain_build_state *state,
+  uint32_t domid);
 
 /* setup arch specific hardware description, i.e. DTB on ARM */
 _hidden
diff --git a/tools/libs/light/libxl_arm.c b/tools/libs/light/libxl_arm.c
index 66e8a06..8c4eda3 100644
--- a/tools/libs/light/libxl_arm.c
+++ b/tools/libs/light/libxl_arm.c
@@ -126,7 +126,9 @@ int libxl__arch_domain_save_config(libxl__gc *gc,
 return 0;
 }
 
-int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config,
+int libxl__arch_domain_create(libxl__gc *gc,
+  libxl_domain_config *d_config,
+  ibxl__domain_build_state *state,
   uint32_t domid)
 {
 return 0;
diff --git a/tools/libs/light/libxl_dom.c b/tools/libs/light/libxl_dom.c
index 1916857..842a51c 100644
--- a/tools/libs/light/libxl_dom.c
+++ b/tools/libs/light/libxl_dom.c
@@ -378,7 +378,7 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
 state->store_port = xc_evtchn_alloc_unbound(ctx->xch, domid, 
state->store_domid);
 state->console_port = xc_evtchn_alloc_unbound(ctx->xch, domid, 
state->console_domid);
 
-rc = libxl__arch_domain_create(gc, d_config, domid);
+rc = libxl__arch_domain_create(gc, d_config, state, domid);
 
 /* Construct a CPUID policy, but only for brand new domains.  Domains
  * being migrated-in/restored have CPUID handled during the
diff --git a/tools/libs/light/libxl_x86.c b/tools/libs/light/libxl_x86.c
index 91a9fc7..58187ed 100644
--- a/tools/libs/light/libxl_x86.c
+++ b/tools/libs/light/libxl_x86.c
@@ -453,8 +453,10 @@ static int hvm_set_conf_params(libxl__gc *gc, uint32_t 
domid,
 return ret;
 }
 
-int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config,
-uint32_t domid)
+int libxl__arch_domain_create(libxl__gc *gc,
+  libxl_domain_config *d_config,
+  libxl__domain_build_state *state,
+  uint32_t domid)
 {
 const libxl_domain_build_info *info = _config->b_info;
 int ret = 0;
@@ -466,7 +468,10 @@ int libxl__arch_domain_create(libxl__gc *gc, 
libxl_domain_config *d_config,
 (ret = hvm_set_conf_params(gc, domid, info)) != 0)
 goto out;
 
-if (info->type == LIBXL_DOMAIN_TYPE_HVM &&
+/* Viridian flags are already a part of the migration stream so set
+ * them here only for brand new domains. */
+if (!state->restore &&
+info->type == LIBXL_DOMAIN_TYPE_HVM &&
 (ret = hvm_set_viridian_features(gc, domid, info)) != 0)
 goto out;
 
-- 
2.7.4

Re: [PATCH] xen/netback: avoid race in xenvif_rx_ring_slots_available()

2021-02-02 Thread Igor Druzhinin

On 02/02/2021 07:09, Juergen Gross wrote:
> Since commit 23025393dbeb3b8b3 ("xen/netback: use lateeoi irq binding")
> xenvif_rx_ring_slots_available() is no longer called only from the rx
> queue kernel thread, so it needs to access the rx queue with the
> associated queue held.
> 
> Reported-by: Igor Druzhinin 
> Fixes: 23025393dbeb3b8b3 ("xen/netback: use lateeoi irq binding")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Juergen Gross 

Appreciate a quick fix! Is this the only place that sort of race could
happen now?

Igor

Re: staging: unable to restore HVM with Viridian param set

2021-02-02 Thread Igor Druzhinin

On 02/02/2021 08:35, Paul Durrant wrote:
>> -Original Message-
>> From: Igor Druzhinin 
>> Sent: 02 February 2021 00:20
>> To: Andrew Cooper ; Tamas K Lengyel 
>> ; Xen-devel
>> ; Wei Liu ; Ian Jackson 
>> ; Anthony
>> PERARD ; Paul Durrant 
>> Subject: Re: staging: unable to restore HVM with Viridian param set
>>
>> n 01/02/2021 22:57, Andrew Cooper wrote:
>>> On 01/02/2021 22:51, Tamas K Lengyel wrote:
>>>> Hi all,
>>>> trying to restore a Windows VM saved on Xen 4.14 with Xen staging results 
>>>> in:
>>>>
>>>> # xl restore -p /shared/cfg/windows10.save
>>>> Loading new save file /shared/cfg/windows10.save (new xl fmt info 
>>>> 0x3/0x0/1475)
>>>>  Savefile contains xl domain config in JSON format
>>>> Parsing config from 
>>>> xc: info: Found x86 HVM domain from Xen 4.14
>>>> xc: info: Restoring domain
>>>> xc: error: set HVM param 9 = 0x0065 (17 = File exists):
>>>> Internal error
>>>> xc: error: Restore failed (17 = File exists): Internal error
>>>> libxl: error: libxl_stream_read.c:850:libxl__xc_domain_restore_done:
>>>> restoring domain: File exists
>>>> libxl: error: libxl_create.c:1581:domcreate_rebuild_done: Domain
>>>> 8:cannot (re-)build domain: -3
>>>> libxl: error: libxl_domain.c:1182:libxl__destroy_domid: Domain
>>>> 8:Non-existant domain
>>>> libxl: error: libxl_domain.c:1136:domain_destroy_callback: Domain
>>>> 8:Unable to destroy guest
>>>> libxl: error: libxl_domain.c:1063:domain_destroy_cb: Domain
>>>> 8:Destruction of domain failed
>>>>
>>>> Running on staging 419cd07895891c6642f29085aee07be72413f08c
>>>
>>> CC'ing maintainers and those who've edited the code recently.
>>>
>>> What is happening is xl/libxl is selecting some viridian settings,
>>> applying them to the domain, and then the migrations stream has a
>>> different set of viridian settings.
>>>
>>> For a migrating-in VM, nothing should be set during domain build.
>>> Viridian state has been part of the migrate stream since before mig-v2,
>>> so can be considered to be everywhere relevant now.
>>
>> The fallout is likely from my changes that modified default set of viridian
>> settings. The relevant commits:
>> 983524671031fcfdb24a6c0da988203ebb47aebe
>> 7e5cffcd1e9300cab46a1816b5eb676caaeed2c1
>>
>> The same config from migrated domains now implies different set of viridian
>> extensions then those set at the source side. That creates inconsistency in
>> libxl. I don't really know how to address it properly in libxl other than
>> don't extend the default set ever.
>>
> 
> Surely it should be addressed properly in libxl by not messing with the 
> viridian settings on migrate-in/resume, as Andrew says? TBH I thought this 
> was already the case. There should be no problem with adding to the default 
> set as this is just an xl/libxl concept; the param flags in Xen are always 
> define the *exact* set of enabled enlightenments.

If maintainers agree with this approach I can make a patch.

Igor

dom0 crash in xenvif_rx_ring_slots_available

2021-02-01 Thread Igor Druzhinin

Juergen,

We've got a crash report from one of our customers (see below) running 4.4 
kernel.
The functions seem to be the new that came with XSA-332 and nothing like that 
has been
reported before in their cloud. It appears there is some use-after-free 
happening on skb
in the following code fragment:

static bool xenvif_rx_ring_slots_available(struct xenvif_queue *queue)
{
RING_IDX prod, cons;
struct sk_buff *skb;
int needed;

skb = skb_peek(>rx_queue);
if (!skb)
return false;

needed = DIV_ROUND_UP(skb->len, XEN_PAGE_SIZE);
if (skb_is_gso(skb))  <== skb points to 0-ed memory
needed++;

Has something similar been reported before? Any ideas?

Igor


[2721961.681180]  ALERT: BUG: unable to handle kernel NULL pointer dereference 
at 0002
[2721961.681222]  ALERT: IP: [] 
xenvif_rx_ring_slots_available+0x4a/0xa0
[2721961.681252]   WARN: PGD 121cf5067 PUD 10095a067 PMD 0 
[2721961.681274]   WARN: Oops:  [#1] SMP 
[2721961.681291]   WARN: Modules linked in: iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi vport_vxlan(O) vport_stt(O) tun sch_sfq sch_htb 8021q garp 
mrp stp llc openvswitch(O) tunnel6 nf_conntrack_ipv6 nf_nat_ipv6 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack 
libcrc32c xt_tcpudp iptable_filter dm_multipath ipmi_devintf dm_mod 
x86_pkg_temp_thermal coretemp crc32_pclmul aesni_intel aes_x86_64 ablk_helper 
cryptd lrw gf128mul glue_helper psmouse sb_edac sg edac_core lpc_ich mfd_core 
hpilo shpchp wmi tpm_tis tpm nls_utf8 isofs ipmi_si ipmi_msghandler ip_tables 
x_tables sd_mod uhci_hcd serio_raw ehci_pci ehci_hcd ixgbe(O) hpsa(O) vxlan 
scsi_transport_sas ip6_udp_tunnel udp_tunnel ptp pps_core scsi_dh_rdac 
scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 autofs4
[2721961.681538]   WARN: CPU: 12 PID: 0 Comm: swapper/12 Tainted: G   O 
   4.4.0+2 #1
[2721961.681551]   WARN: Hardware name: HP ProLiant DL380p Gen8, BIOS P70 
05/21/2018
[2721961.681562]   WARN: task: 880149231c00 ti: 880149238000 task.ti: 
880149238000
[2721961.681574]   WARN: RIP: e030:[]  [] 
xenvif_rx_ring_slots_available+0x4a/0xa0
[2721961.681591]   WARN: RSP: e02b:880149b83db8  EFLAGS: 00010002
[2721961.681601]   WARN: RAX:  RBX: c90040a0e000 RCX: 

[2721961.681615]   WARN: RDX: 0001 RSI:  RDI: 
c90040a0e000
[2721961.681629]   WARN: RBP: 880149b83db8 R08:  R09: 
880057bcc000
[2721961.681647]   WARN: R10: 880057bcc0e0 R11:  R12: 

[2721961.681671]   WARN: R13: 0157 R14: 0001 R15: 

[2721961.681742]   WARN: FS:  7f5cc4c98780() GS:880149b8() 
knlGS:
[2721961.681762]   WARN: CS:  e033 DS: 002b ES: 002b CR0: 80050033
[2721961.681777]   WARN: CR2: 0002 CR3: 000101006000 CR4: 
00040660
[2721961.681802]   WARN: Stack:
[2721961.681812]   WARN:  880149b83dd8 8146cdf7 c90040a0e000 
c90040a0e028
[2721961.681878]   WARN:  880149b83e08 8146aa71 88009ca7ca80 
0157
[2721961.681911]   WARN:  88006315c020  880149b83e58 
810c046f
[2721961.681971]   WARN: Call Trace:
[2721961.681981]   WARN:   
[2721961.681994]   WARN:  [] xenvif_have_rx_work+0x17/0x80
[2721961.682044]   WARN:  [] xenvif_interrupt+0x61/0xb0
[2721961.682065]   WARN:  [] 
handle_irq_event_percpu+0x7f/0x1e0
[2721961.682108]   WARN:  [] handle_irq_event+0x3b/0x60
[2721961.682125]   WARN:  [] handle_edge_irq+0xff/0x130
[2721961.682178]   WARN:  [] generic_handle_irq+0x22/0x30
[2721961.682196]   WARN:  [] handle_irq_for_port+0xbd/0xd0
[2721961.682248]   WARN:  [] 
__evtchn_fifo_handle_events+0x143/0x170
[2721961.682266]   WARN:  [] 
evtchn_fifo_handle_events+0xe/0x10
[2721961.682285]   WARN:  [] __xen_evtchn_do_upcall+0x50/0x90
[2721961.682339]   WARN:  [] xen_evtchn_do_upcall+0x30/0x50
[2721961.682359]   WARN:  [] 
xen_do_hypervisor_callback+0x1e/0x40
[2721961.682414]   WARN:   
[2721961.682425]   WARN:  [] ? xen_hypercall_sched_op+0xa/0x20
[2721961.682443]   WARN:  [] ? xen_hypercall_sched_op+0xa/0x20
[2721961.682492]   WARN:  [] ? xen_safe_halt+0x10/0x20
[2721961.682506]   WARN:  [] ? default_idle+0x57/0xf0
[2721961.682518]   WARN:  [] ? arch_cpu_idle+0xf/0x20
[2721961.682569]   WARN:  [] ? default_idle_call+0x32/0x40
[2721961.682582]   WARN:  [] ? cpu_startup_entry+0x1ec/0x330
[2721961.682633]   WARN:  [] ? cpu_bringup_and_idle+0x18/0x20
[2721961.682645]   WARN: Code: 73 48 85 c0 74 f7 8b 90 80 00 00 00 8b 88 cc 00 
00 00 4c 8b 80 d0 00 00 00 0f b6 80 91 00 00 00 48 81 c2 ff 0f 00 00 48 c1 ea 
0c <66> 41 83 7c 08 02 00 8d 72 01 0f 44 f2 48 8b 97 d8 a9 00 00 83 
[2721961.682944]  ALERT: RIP  [] 
xenvif_rx_ring_slots_available+0x4a/0xa0
[2721961.682998]   WARN:  RSP 
[2721961.683006]   WARN: CR2:

Re: staging: unable to restore HVM with Viridian param set

2021-02-01 Thread Igor Druzhinin

n 01/02/2021 22:57, Andrew Cooper wrote:
> On 01/02/2021 22:51, Tamas K Lengyel wrote:
>> Hi all,
>> trying to restore a Windows VM saved on Xen 4.14 with Xen staging results in:
>>
>> # xl restore -p /shared/cfg/windows10.save
>> Loading new save file /shared/cfg/windows10.save (new xl fmt info 
>> 0x3/0x0/1475)
>>  Savefile contains xl domain config in JSON format
>> Parsing config from 
>> xc: info: Found x86 HVM domain from Xen 4.14
>> xc: info: Restoring domain
>> xc: error: set HVM param 9 = 0x0065 (17 = File exists):
>> Internal error
>> xc: error: Restore failed (17 = File exists): Internal error
>> libxl: error: libxl_stream_read.c:850:libxl__xc_domain_restore_done:
>> restoring domain: File exists
>> libxl: error: libxl_create.c:1581:domcreate_rebuild_done: Domain
>> 8:cannot (re-)build domain: -3
>> libxl: error: libxl_domain.c:1182:libxl__destroy_domid: Domain
>> 8:Non-existant domain
>> libxl: error: libxl_domain.c:1136:domain_destroy_callback: Domain
>> 8:Unable to destroy guest
>> libxl: error: libxl_domain.c:1063:domain_destroy_cb: Domain
>> 8:Destruction of domain failed
>>
>> Running on staging 419cd07895891c6642f29085aee07be72413f08c
> 
> CC'ing maintainers and those who've edited the code recently.
> 
> What is happening is xl/libxl is selecting some viridian settings,
> applying them to the domain, and then the migrations stream has a
> different set of viridian settings.
> 
> For a migrating-in VM, nothing should be set during domain build. 
> Viridian state has been part of the migrate stream since before mig-v2,
> so can be considered to be everywhere relevant now.

The fallout is likely from my changes that modified default set of viridian
settings. The relevant commits:
983524671031fcfdb24a6c0da988203ebb47aebe
7e5cffcd1e9300cab46a1816b5eb676caaeed2c1

The same config from migrated domains now implies different set of viridian
extensions then those set at the source side. That creates inconsistency in
libxl. I don't really know how to address it properly in libxl other than
don't extend the default set ever.

Igor

Re: [PATCH v2 1/2] viridian: remove implicit limit of 64 VPs per partition

2021-01-25 Thread Igor Druzhinin

On 12/01/2021 04:17, Igor Druzhinin wrote:
> TLFS 7.8.1 stipulates that "a virtual processor index must be less than
> the maximum number of virtual processors per partition" that "can be obtained
> through CPUID leaf 0x4005". Furthermore, "Requirements for Implementing
> the Microsoft Hypervisor Interface" defines that starting from Windows Server
> 2012, which allowed more than 64 CPUs to be brought up, this leaf can now
> contain a value -1 basically assuming the hypervisor has no restriction while
> 0 (that we currently expose) means the default restriction is still present.
> 
> Along with the previous changes exposing ExProcessorMasks this allows a recent
> Windows VM with Viridian extension enabled to have more than 64 vCPUs without
> going into BSOD in some cases.
> 
> Since we didn't expose the leaf before and to keep CPUID data consistent for
> incoming streams from previous Xen versions - let's keep it behind an option.
> 
> Signed-off-by: Igor Druzhinin 
> ---

ping? Paul?

Igor

Re: [PATCH] OvmfPkg/XenPlatformPei: Grab 64-bit PCI MMIO hole size from OVMF info table

2021-01-19 Thread Igor Druzhinin

On 19/01/2021 13:20, Anthony PERARD wrote:
> On Mon, Jan 11, 2021 at 03:45:18AM +0000, Igor Druzhinin wrote:
>> diff --git a/OvmfPkg/XenPlatformPei/MemDetect.c 
>> b/OvmfPkg/XenPlatformPei/MemDetect.c
>> index 1f81eee..4175a2f 100644
>> --- a/OvmfPkg/XenPlatformPei/MemDetect.c
>> +++ b/OvmfPkg/XenPlatformPei/MemDetect.c
>> @@ -227,6 +227,7 @@ GetFirstNonAddress (
>>UINT64   FirstNonAddress;
>>UINT64   Pci64Base, Pci64Size;
>>RETURN_STATUSPcdStatus;
>> +  EFI_STATUS   Status;
>>  
>>FirstNonAddress = BASE_4GB + GetSystemMemorySizeAbove4gb ();
>>  
>> @@ -245,7 +246,10 @@ GetFirstNonAddress (
>>// Otherwise, in order to calculate the highest address plus one, we must
>>// consider the 64-bit PCI host aperture too. Fetch the default size.
>>//
>> -  Pci64Size = PcdGet64 (PcdPciMmio64Size);
>> +  Status = XenGetPciMmioInfo (NULL, NULL, , );
> 
> Pci64Base is overridden later (25 line bellow) by the value from
> FirstNonAddress, shouldn't this be avoided?
> Pci64Base = ALIGN_VALUE (FirstNonAddress, (UINT64)SIZE_1GB);
> 
>> diff --git a/OvmfPkg/XenPlatformPei/Xen.h b/OvmfPkg/XenPlatformPei/Xen.h
>> index 2605481..c6e5fbb 100644
>> --- a/OvmfPkg/XenPlatformPei/Xen.h
>> +++ b/OvmfPkg/XenPlatformPei/Xen.h
>> @@ -34,6 +34,16 @@ typedef struct {
>>EFI_PHYSICAL_ADDRESS E820;
>>UINT32 E820EntriesCount;
>>  } EFI_XEN_OVMF_INFO;
>> +
>> +// This extra table gives layout of PCI apertures in a Xen guest
>> +#define OVMF_INFO_PCI_TABLE 0
>> +
>> +typedef struct {
>> +  EFI_PHYSICAL_ADDRESS LowStart;
>> +  EFI_PHYSICAL_ADDRESS LowEnd;
>> +  EFI_PHYSICAL_ADDRESS HiStart;
>> +  EFI_PHYSICAL_ADDRESS HiEnd;
> 
> In the hvmloader patch, these are uint64. It doesn't seems like a good
> idea to use the type EFI_PHYSICAL_ADDRESS here. Could you change to
> UINT64 here?
> 
> (even if EFI_PHYSICAL_ADDRESS seems to always be UINT64, in the source
> code.)
> 

Anthony, this patch is obsolete now - see discussion between Jan and Laszlo.
The new patch is:
https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg00789.html

Igor

[PATCH] OvmfPkg/XenPlatformPei: Use CPUID to get physical address width on Xen

2021-01-12 Thread Igor Druzhinin

We faced a problem with passing through a PCI device with 64GB BAR to
UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
64G address which pushes physical address space to 37 bits. That is above
36-bit width that OVMF exposes currently to a guest without tweaking
PcdPciMmio64Size knob.

The reverse calculation using this knob was inhereted from QEMU-KVM platform
code where it serves the purpose of finding max accessible physical
address without necessary trusting emulated CPUID physbits value (that could
be different from host physbits). On Xen we expect to use CPUID policy
to level the data correctly to prevent situations with guest physbits >
host physbits e.g. across migrations.

The next aspect raising concern - resource consumption for DXE IPL page tables
and time required to map the whole address space in case of using CPUID
bits directly. That could be mitigated by enabling support for 1G pages
in DXE IPL configuration. 1G pages are available on most CPUs produced in
the last 10 years and those without don't have many phys bits.

Remove all the redundant code now (including PcdPciMmio64.. handling that's
not used on Xen anyway) and grab physbits directly from CPUID that should
be what baremetal UEFI systems do.

Signed-off-by: Igor Druzhinin 
---
 OvmfPkg/OvmfXen.dsc|   3 +
 OvmfPkg/XenPlatformPei/MemDetect.c | 166 +++--
 2 files changed, 15 insertions(+), 154 deletions(-)

diff --git a/OvmfPkg/OvmfXen.dsc b/OvmfPkg/OvmfXen.dsc
index 7d31e88..8ae6ed0 100644
--- a/OvmfPkg/OvmfXen.dsc
+++ b/OvmfPkg/OvmfXen.dsc
@@ -444,6 +444,9 @@
   ## Xen vlapic's frequence is 100 MHz
   gEfiMdePkgTokenSpaceGuid.PcdFSBClock|1
 
+  # We populate DXE IPL tables with 1G pages preferably on Xen
+  gEfiMdeModulePkgTokenSpaceGuid.PcdUse1GPageTable|TRUE
+
 

 #
 # Pcd Dynamic Section - list of all EDK II PCD Entries defined by this Platform
diff --git a/OvmfPkg/XenPlatformPei/MemDetect.c 
b/OvmfPkg/XenPlatformPei/MemDetect.c
index 1f81eee..1970b63 100644
--- a/OvmfPkg/XenPlatformPei/MemDetect.c
+++ b/OvmfPkg/XenPlatformPei/MemDetect.c
@@ -172,175 +172,33 @@ GetSystemMemorySizeBelow4gb (
   return (UINT32) (((UINTN)((Cmos0x35 << 8) + Cmos0x34) << 16) + SIZE_16MB);
 }
 
-
-STATIC
-UINT64
-GetSystemMemorySizeAbove4gb (
-  )
-{
-  UINT32 Size;
-  UINTN  CmosIndex;
-
-  //
-  // In PVH case, there is no CMOS, we have to calculate the memory size
-  // from parsing the E820
-  //
-  if (XenPvhDetected ()) {
-UINT64  HighestAddress;
-
-HighestAddress = GetHighestSystemMemoryAddress (FALSE);
-ASSERT (HighestAddress == 0 || HighestAddress >= BASE_4GB);
-
-if (HighestAddress >= BASE_4GB) {
-  HighestAddress -= BASE_4GB;
-}
-
-return HighestAddress;
-  }
-
-  //
-  // CMOS 0x5b-0x5d specifies the system memory above 4GB MB.
-  // * CMOS(0x5d) is the most significant size byte
-  // * CMOS(0x5c) is the middle size byte
-  // * CMOS(0x5b) is the least significant size byte
-  // * The size is specified in 64kb chunks
-  //
-
-  Size = 0;
-  for (CmosIndex = 0x5d; CmosIndex >= 0x5b; CmosIndex--) {
-Size = (UINT32) (Size << 8) + (UINT32) CmosRead8 (CmosIndex);
-  }
-
-  return LShiftU64 (Size, 16);
-}
-
-
-/**
-  Return the highest address that DXE could possibly use, plus one.
-**/
-STATIC
-UINT64
-GetFirstNonAddress (
-  VOID
-  )
-{
-  UINT64   FirstNonAddress;
-  UINT64   Pci64Base, Pci64Size;
-  RETURN_STATUSPcdStatus;
-
-  FirstNonAddress = BASE_4GB + GetSystemMemorySizeAbove4gb ();
-
-  //
-  // If DXE is 32-bit, then we're done; PciBusDxe will degrade 64-bit MMIO
-  // resources to 32-bit anyway. See DegradeResource() in
-  // "PciResourceSupport.c".
-  //
-#ifdef MDE_CPU_IA32
-  if (!FeaturePcdGet (PcdDxeIplSwitchToLongMode)) {
-return FirstNonAddress;
-  }
-#endif
-
-  //
-  // Otherwise, in order to calculate the highest address plus one, we must
-  // consider the 64-bit PCI host aperture too. Fetch the default size.
-  //
-  Pci64Size = PcdGet64 (PcdPciMmio64Size);
-
-  if (Pci64Size == 0) {
-if (mBootMode != BOOT_ON_S3_RESUME) {
-  DEBUG ((DEBUG_INFO, "%a: disabling 64-bit PCI host aperture\n",
-__FUNCTION__));
-  PcdStatus = PcdSet64S (PcdPciMmio64Size, 0);
-  ASSERT_RETURN_ERROR (PcdStatus);
-}
-
-//
-// There's nothing more to do; the amount of memory above 4GB fully
-// determines the highest address plus one. The memory hotplug area (see
-// below) plays no role for the firmware in this case.
-//
-return FirstNonAddress;
-  }
-
-  //
-  // SeaBIOS aligns both boundaries of the 64-bit PCI host aperture to 1GB, so
-  // that the host can map it with 1GB hugepages. Follow suit.
-  //
-  Pci64Base = ALIGN_VALUE (FirstNonAddress, (UINT64)SIZE_1GB);
-  Pci64Size = ALIGN_VALUE (Pci64Size, (UINT64)SIZ

[PATCH v2 1/2] viridian: remove implicit limit of 64 VPs per partition

2021-01-11 Thread Igor Druzhinin

TLFS 7.8.1 stipulates that "a virtual processor index must be less than
the maximum number of virtual processors per partition" that "can be obtained
through CPUID leaf 0x4005". Furthermore, "Requirements for Implementing
the Microsoft Hypervisor Interface" defines that starting from Windows Server
2012, which allowed more than 64 CPUs to be brought up, this leaf can now
contain a value -1 basically assuming the hypervisor has no restriction while
0 (that we currently expose) means the default restriction is still present.

Along with the previous changes exposing ExProcessorMasks this allows a recent
Windows VM with Viridian extension enabled to have more than 64 vCPUs without
going into BSOD in some cases.

Since we didn't expose the leaf before and to keep CPUID data consistent for
incoming streams from previous Xen versions - let's keep it behind an option.

Signed-off-by: Igor Druzhinin 
---
Changes in v2:
- expose the option in libxl
---
 docs/man/xl.cfg.5.pod.in |  9 -
 tools/include/libxl.h|  6 ++
 tools/libs/light/libxl_types.idl |  1 +
 tools/libs/light/libxl_x86.c |  4 
 xen/arch/x86/hvm/viridian/viridian.c | 23 +++
 xen/include/public/hvm/params.h  |  7 ++-
 6 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in
index c8e017f..3467eae 100644
--- a/docs/man/xl.cfg.5.pod.in
+++ b/docs/man/xl.cfg.5.pod.in
@@ -2260,11 +2260,18 @@ mask. Hence this enlightenment must be specified for 
guests with more
 than 64 vCPUs if B and/or B are also
 specified.
 
+=item B
+
+This group when set indicates to a guest that the hypervisor does not
+explicitly have any limits on the number of Virtual processors a guest
+is allowed to bring up. It is strongly recommended to keep this enabled
+for guests with more than 64 vCPUs.
+
 =item B
 
 This is a special value that enables the default set of groups, which
 is currently the B, B, B, B,
-B and B groups.
+B, B and B groups.
 
 =item B
 
diff --git a/tools/include/libxl.h b/tools/include/libxl.h
index 3433c95..be1e288 100644
--- a/tools/include/libxl.h
+++ b/tools/include/libxl.h
@@ -452,6 +452,12 @@
 #define LIBXL_HAVE_VIRIDIAN_EX_PROCESSOR_MASKS 1
 
 /*
+ * LIBXL_HAVE_VIRIDIAN_NO_VP_LIMIT indicates that the 'no_vp_limit' value
+ * is present in the viridian enlightenment enumeration.
+ */
+#define LIBXL_HAVE_VIRIDIAN_NO_VP_LIMIT 1
+
+/*
  * LIBXL_HAVE_DEVICE_PCI_LIST_FREE indicates that the
  * libxl_device_pci_list_free() function is defined.
  */
diff --git a/tools/libs/light/libxl_types.idl b/tools/libs/light/libxl_types.idl
index 0532473..8502b29 100644
--- a/tools/libs/light/libxl_types.idl
+++ b/tools/libs/light/libxl_types.idl
@@ -239,6 +239,7 @@ libxl_viridian_enlightenment = 
Enumeration("viridian_enlightenment", [
 (8, "stimer"),
 (9, "hcall_ipi"),
 (10, "ex_processor_masks"),
+(11, "no_vp_limit"),
 ])
 
 libxl_hdtype = Enumeration("hdtype", [
diff --git a/tools/libs/light/libxl_x86.c b/tools/libs/light/libxl_x86.c
index 86d2729..5c4c194 100644
--- a/tools/libs/light/libxl_x86.c
+++ b/tools/libs/light/libxl_x86.c
@@ -309,6 +309,7 @@ static int hvm_set_viridian_features(libxl__gc *gc, 
uint32_t domid,
 libxl_bitmap_set(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_TIME_REF_COUNT);
 libxl_bitmap_set(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_APIC_ASSIST);
 libxl_bitmap_set(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_CRASH_CTL);
+libxl_bitmap_set(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_NO_VP_LIMIT);
 }
 
 libxl_for_each_set_bit(v, info->u.hvm.viridian_enable) {
@@ -369,6 +370,9 @@ static int hvm_set_viridian_features(libxl__gc *gc, 
uint32_t domid,
 if (libxl_bitmap_test(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_EX_PROCESSOR_MASKS))
 mask |= HVMPV_ex_processor_masks;
 
+if (libxl_bitmap_test(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_NO_VP_LIMIT))
+mask |= HVMPV_no_vp_limit;
+
 if (mask != 0 &&
 xc_hvm_param_set(CTX->xch,
  domid,
diff --git a/xen/arch/x86/hvm/viridian/viridian.c 
b/xen/arch/x86/hvm/viridian/viridian.c
index ed97804..ae1ea86 100644
--- a/xen/arch/x86/hvm/viridian/viridian.c
+++ b/xen/arch/x86/hvm/viridian/viridian.c
@@ -209,6 +209,29 @@ void cpuid_viridian_leaves(const struct vcpu *v, uint32_t 
leaf,
 res->b = viridian_spinlock_retry_count;
 break;
 
+case 5:
+/*
+ * From "Requirements for Implementing the Microsoft Hypervisor
+ *  Interface":
+ *
+ * "On Windows operating systems versions through Windows Server
+ * 2008 R2, reporting the HV#1 hypervisor interface limits
+ * the Windows virtual machine to a maximum of 64 VPs, regardless of
+ * what is reported via CPUID.4005.EAX.
+ *
+ * Starting with Windows Server 20

[PATCH v2 2/2] viridian: allow vCPU hotplug for Windows VMs

2021-01-11 Thread Igor Druzhinin

If Viridian extensions are enabled, Windows wouldn't currently allow
a hotplugged vCPU to be brought up dynamically. We need to expose a special
bit to let the guest know we allow it. Hide it behind an option to stay
on the safe side regarding compatibility with existing guests but
nevertheless set the option on by default.

Signed-off-by: Igor Druzhinin 
---
Changes on v2:
- hide the bit under an option and expose it in libxl
---
 docs/man/xl.cfg.5.pod.in | 7 ++-
 tools/include/libxl.h| 6 ++
 tools/libs/light/libxl_types.idl | 1 +
 tools/libs/light/libxl_x86.c | 4 
 xen/arch/x86/hvm/viridian/viridian.c | 5 -
 xen/include/public/hvm/params.h  | 7 ++-
 6 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in
index 3467eae..7cdb859 100644
--- a/docs/man/xl.cfg.5.pod.in
+++ b/docs/man/xl.cfg.5.pod.in
@@ -2267,11 +2267,16 @@ explicitly have any limits on the number of Virtual 
processors a guest
 is allowed to bring up. It is strongly recommended to keep this enabled
 for guests with more than 64 vCPUs.
 
+=item B
+
+This set enables dynamic changes to Virtual processor states in Windows
+guests effectively allowing vCPU hotplug.
+
 =item B
 
 This is a special value that enables the default set of groups, which
 is currently the B, B, B, B,
-B, B and B groups.
+B, B, B and B groups.
 
 =item B
 
diff --git a/tools/include/libxl.h b/tools/include/libxl.h
index be1e288..7c7c541 100644
--- a/tools/include/libxl.h
+++ b/tools/include/libxl.h
@@ -458,6 +458,12 @@
 #define LIBXL_HAVE_VIRIDIAN_NO_VP_LIMIT 1
 
 /*
+ * LIBXL_HAVE_VIRIDIAN_CPU_HOTPLUG indicates that the 'cpu_hotplug' value
+ * is present in the viridian enlightenment enumeration.
+ */
+#define LIBXL_HAVE_VIRIDIAN_CPU_HOTPLUG 1
+
+/*
  * LIBXL_HAVE_DEVICE_PCI_LIST_FREE indicates that the
  * libxl_device_pci_list_free() function is defined.
  */
diff --git a/tools/libs/light/libxl_types.idl b/tools/libs/light/libxl_types.idl
index 8502b29..00a8e68 100644
--- a/tools/libs/light/libxl_types.idl
+++ b/tools/libs/light/libxl_types.idl
@@ -240,6 +240,7 @@ libxl_viridian_enlightenment = 
Enumeration("viridian_enlightenment", [
 (9, "hcall_ipi"),
 (10, "ex_processor_masks"),
 (11, "no_vp_limit"),
+(12, "cpu_hotplug"),
 ])
 
 libxl_hdtype = Enumeration("hdtype", [
diff --git a/tools/libs/light/libxl_x86.c b/tools/libs/light/libxl_x86.c
index 5c4c194..91a9fc7 100644
--- a/tools/libs/light/libxl_x86.c
+++ b/tools/libs/light/libxl_x86.c
@@ -310,6 +310,7 @@ static int hvm_set_viridian_features(libxl__gc *gc, 
uint32_t domid,
 libxl_bitmap_set(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_APIC_ASSIST);
 libxl_bitmap_set(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_CRASH_CTL);
 libxl_bitmap_set(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_NO_VP_LIMIT);
+libxl_bitmap_set(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_CPU_HOTPLUG);
 }
 
 libxl_for_each_set_bit(v, info->u.hvm.viridian_enable) {
@@ -373,6 +374,9 @@ static int hvm_set_viridian_features(libxl__gc *gc, 
uint32_t domid,
 if (libxl_bitmap_test(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_NO_VP_LIMIT))
 mask |= HVMPV_no_vp_limit;
 
+if (libxl_bitmap_test(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_CPU_HOTPLUG))
+mask |= HVMPV_cpu_hotplug;
+
 if (mask != 0 &&
 xc_hvm_param_set(CTX->xch,
  domid,
diff --git a/xen/arch/x86/hvm/viridian/viridian.c 
b/xen/arch/x86/hvm/viridian/viridian.c
index ae1ea86..b906f7b 100644
--- a/xen/arch/x86/hvm/viridian/viridian.c
+++ b/xen/arch/x86/hvm/viridian/viridian.c
@@ -76,6 +76,7 @@ typedef union _HV_CRASH_CTL_REG_CONTENTS
 } HV_CRASH_CTL_REG_CONTENTS;
 
 /* Viridian CPUID leaf 3, Hypervisor Feature Indication */
+#define CPUID3D_CPU_DYNAMIC_PARTITIONING (1 << 3)
 #define CPUID3D_CRASH_MSRS (1 << 10)
 #define CPUID3D_SINT_POLLING (1 << 17)
 
@@ -179,8 +180,10 @@ void cpuid_viridian_leaves(const struct vcpu *v, uint32_t 
leaf,
 res->a = u.lo;
 res->b = u.hi;
 
+if ( viridian_feature_mask(d) & HVMPV_cpu_hotplug )
+   res->d = CPUID3D_CPU_DYNAMIC_PARTITIONING;
 if ( viridian_feature_mask(d) & HVMPV_crash_ctl )
-res->d = CPUID3D_CRASH_MSRS;
+res->d |= CPUID3D_CRASH_MSRS;
 if ( viridian_feature_mask(d) & HVMPV_synic )
 res->d |= CPUID3D_SINT_POLLING;
 
diff --git a/xen/include/public/hvm/params.h b/xen/include/public/hvm/params.h
index 805f4ca..c9d6e70 100644
--- a/xen/include/public/hvm/params.h
+++ b/xen/include/public/hvm/params.h
@@ -172,6 +172,10 @@
 #define _HVMPV_no_vp_limit 11
 #define HVMPV_no_vp_limit (1 << _HVMPV_no_vp_limit)
 
+/* Enable vCPU hotplug */
+#define _HVMPV_cpu_hotplug 12
+#define HVMPV_cpu_hotplug (1 << _HVMPV_cpu_hotplug)
+
 #define HVMPV_feature_mas

Re: [PATCH] hvmloader: pass PCI MMIO layout to OVMF as an info table

2021-01-11 Thread Igor Druzhinin

On 11/01/2021 15:35, Laszlo Ersek wrote:
> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments 
> unless you have verified the sender and know the content is safe.
> 
> On 01/11/21 16:26, Igor Druzhinin wrote:
>> On 11/01/2021 15:21, Jan Beulich wrote:
>>> On 11.01.2021 15:49, Laszlo Ersek wrote:
>>>> On 01/11/21 15:00, Igor Druzhinin wrote:
>>>>> On 11/01/2021 09:27, Jan Beulich wrote:
>>>>>> On 11.01.2021 05:53, Igor Druzhinin wrote:
>>>>>>> We faced a problem with passing through a PCI device with 64GB BAR to
>>>>>>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>>>>>>> 64G address which pushes physical address space to 37 bits. OVMF uses
>>>>>>> address width early in PEI phase to make DXE identity pages covering
>>>>>>> the whole addressable space so it needs to know the last address it 
>>>>>>> needs
>>>>>>> to cover but at the same time not overdo the mappings.
>>>>>>>
>>>>>>> As there is seemingly no other way to pass or get this information in
>>>>>>> OVMF at this early phase (ACPI is not yet available, PCI is not yet 
>>>>>>> enumerated,
>>>>>>> xenstore is not yet initialized) - extend the info structure with a new
>>>>>>> table. Since the structure was initially created to be extendable -
>>>>>>> the change is backward compatible.
>>>>>>
>>>>>> How does UEFI handle the same situation on baremetal? I'd guess it is
>>>>>> in even more trouble there, as it couldn't even read addresses from
>>>>>> BARs, but would first need to assign them (or at least calculate
>>>>>> their intended positions).
>>>>>
>>>>> Maybe Laszlo or Anthony could answer this question quickly while I'm 
>>>>> investigating?
>>>>
>>>> On the bare metal, the phys address width of the processor is known.
>>>
>>> From CPUID I suppose.
>>>
>>>> OVMF does the whole calculation in reverse because there's no way for it
>>>> to know the physical address width of the physical (= host) CPU.
>>>> "Overdoing" the mappings doesn't only waste resources, it breaks hard
>>>> with EPT -- access to a GPA that is inexpressible with the phys address
>>>> width of the host CPU (= not mappable successfully with the nested page
>>>> tables) will behave super bad. I don't recall the exact symptoms, but it
>>>> prevents booting the guest OS.
>>>>
>>>> This is why the most conservative 36-bit width is assumed by default.
>>>
>>> IOW you don't trust virtualized CPUID output?
>>
>> I'm discussing this with Andrew and it appears we're certainly more lax in
>> wiring physical address width into the guest from hardware directly rather
>> than KVM.
>>
>> Another problem that I faced while experimenting is that creating page
>> tables for 46-bits (that CPUID returned in my case) of address space takes
>> about a minute on a modern CPU.
> 
> Even if you enable 1GiB pages?
> 
> (In the libvirt domain XML, it's expressed as
> 
> 
> )
> 
> ... I'm not doubtful, just curious. I guess that, when the physical
> address width is so large, a physical UEFI platform firmware will limit
> itself to a lesser width -- it could even offer some knobs in the setup TUI.

So it wasn't the feature bit that we expose by default in Xen but the OVMF 
configuration
with 1G pages disabled for that use. I enabled it and got booting even with 
46-bits
in reasonable time now.

Given we're not that sensitive in Xen to physical address being different and 
prefer to
control that on different level I'd like to abandon that ABI change approach 
(does anyone
have any objections?) and instead take physical address width directly from 
CPUID which
we do in hvmloader already. The change would be local to Xen platform.

Igor

Re: [PATCH] hvmloader: pass PCI MMIO layout to OVMF as an info table

2021-01-11 Thread Igor Druzhinin

On 11/01/2021 15:21, Jan Beulich wrote:
> On 11.01.2021 15:49, Laszlo Ersek wrote:
>> On 01/11/21 15:00, Igor Druzhinin wrote:
>>> On 11/01/2021 09:27, Jan Beulich wrote:
>>>> On 11.01.2021 05:53, Igor Druzhinin wrote:
>>>>> We faced a problem with passing through a PCI device with 64GB BAR to
>>>>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>>>>> 64G address which pushes physical address space to 37 bits. OVMF uses
>>>>> address width early in PEI phase to make DXE identity pages covering
>>>>> the whole addressable space so it needs to know the last address it needs
>>>>> to cover but at the same time not overdo the mappings.
>>>>>
>>>>> As there is seemingly no other way to pass or get this information in
>>>>> OVMF at this early phase (ACPI is not yet available, PCI is not yet 
>>>>> enumerated,
>>>>> xenstore is not yet initialized) - extend the info structure with a new
>>>>> table. Since the structure was initially created to be extendable -
>>>>> the change is backward compatible.
>>>>
>>>> How does UEFI handle the same situation on baremetal? I'd guess it is
>>>> in even more trouble there, as it couldn't even read addresses from
>>>> BARs, but would first need to assign them (or at least calculate
>>>> their intended positions).
>>>
>>> Maybe Laszlo or Anthony could answer this question quickly while I'm 
>>> investigating?
>>
>> On the bare metal, the phys address width of the processor is known.
> 
> From CPUID I suppose.
> 
>> OVMF does the whole calculation in reverse because there's no way for it
>> to know the physical address width of the physical (= host) CPU.
>> "Overdoing" the mappings doesn't only waste resources, it breaks hard
>> with EPT -- access to a GPA that is inexpressible with the phys address
>> width of the host CPU (= not mappable successfully with the nested page
>> tables) will behave super bad. I don't recall the exact symptoms, but it
>> prevents booting the guest OS.
>>
>> This is why the most conservative 36-bit width is assumed by default.
> 
> IOW you don't trust virtualized CPUID output?

I'm discussing this with Andrew and it appears we're certainly more lax in
wiring physical address width into the guest from hardware directly rather
than KVM.

Another problem that I faced while experimenting is that creating page
tables for 46-bits (that CPUID returned in my case) of address space takes
about a minute on a modern CPU.

Igor

Re: [PATCH] hvmloader: pass PCI MMIO layout to OVMF as an info table

2021-01-11 Thread Igor Druzhinin

On 11/01/2021 14:14, Jan Beulich wrote:
> On 11.01.2021 15:00, Igor Druzhinin wrote:
>> On 11/01/2021 09:27, Jan Beulich wrote:
>>> On 11.01.2021 05:53, Igor Druzhinin wrote:
>>>> --- a/tools/firmware/hvmloader/ovmf.c
>>>> +++ b/tools/firmware/hvmloader/ovmf.c
>>>> @@ -61,6 +61,14 @@ struct ovmf_info {
>>>>  uint32_t e820_nr;
>>>>  } __attribute__ ((packed));
>>>>  
>>>> +#define OVMF_INFO_PCI_TABLE 0
>>>> +struct ovmf_pci_info {
>>>> +uint64_t low_start;
>>>> +uint64_t low_end;
>>>> +uint64_t hi_start;
>>>> +uint64_t hi_end;
>>>> +} __attribute__ ((packed));
>>>
>>> Forming part of ABI, I believe this belongs in a public header,
>>> which consumers could at least in principle use verbatim if
>>> they wanted to.
>>
>> It probably does, but if we'd want to move all of hand-over structures
>> wholesale that would include seabios as well. I'd stick with the current
>> approach to avoid code churn in various repos. Besides the structures
>> are not the only bits of ABI that are implicitly shared with BIOS images.
> 
> Well, so be it then for the time being. I'm going to be
> hesitant though ack-ing such, no matter that there are (bad)
> precedents. What I'd like to ask for as a minimum is to have
> a comment here clarifying this struct can't be changed
> arbitrarily because of being part of an ABI.

Ok, I will improve information in comments in an additional commit.

>>>> @@ -74,9 +82,21 @@ static void ovmf_setup_bios_info(void)
>>>>  static void ovmf_finish_bios_info(void)
>>>>  {
>>>>  struct ovmf_info *info = (void *)OVMF_INFO_PHYSICAL_ADDRESS;
>>>> +struct ovmf_pci_info *pci_info;
>>>> +uint64_t *tables = 
>>>> scratch_alloc(sizeof(uint64_t)*OVMF_INFO_MAX_TABLES, 0);
>>>
>>> I wasn't able to locate OVMF_INFO_MAX_TABLES in either
>>> xen/include/public/ or tools/firmware/. Where does it get
>>> defined?
>>
>> I expect it to be unlimited from OVMF side. It just expects an array of 
>> tables_nr elements.
> 
> That wasn't the (primary) question. Me not being able to locate
> the place where this constant gets #define-d means I wonder how
> this code builds.

It's right up there in the same file.

>>> Also (nit) missing blanks around * .
>>>
>>>>  uint32_t i;
>>>>  uint8_t checksum;
>>>>  
>>>> +pci_info = scratch_alloc(sizeof(struct ovmf_pci_info), 0);
>>>
>>> Is "scratch" correct here and above? I guess intended usage /
>>> scope will want spelling out somewhere.
>>
>> Again, scratch_alloc is used universally for handing over info between 
>> hvmloader
>> and BIOS images. Where would you want it to be spelled out?
> 
> Next to where all the involved structures get declared.
> Consumers need to be aware they may need to take precautions to
> avoid clobbering the contents before consuming it. But as per
> above there doesn't look to be such a central place (yet).

I will duplicate the comments for now in all places involved.
The struct checksum I believe servers exactly the purpose you described -
to catch that sort of bugs early.

>>>> +pci_info->low_start = pci_mem_start;
>>>> +pci_info->low_end = pci_mem_end;
>>>> +pci_info->hi_start = pci_hi_mem_start;
>>>> +pci_info->hi_end = pci_hi_mem_end;
>>>> +
>>>> +tables[OVMF_INFO_PCI_TABLE] = (uint32_t)pci_info;
>>>> +info->tables = (uint32_t)tables;
>>>> +info->tables_nr = 1;
>>>
>>> In how far is this problem (and hence solution / workaround) OVMF
>>> specific? IOW don't we need a more generic approach here?
>>
>> I believe it's very OVMF specific given only OVMF constructs identity page
>> tables for the whole address space - that's how it was designed. Seabios to
>> the best of my knowledge only has access to lower 4G.
> 
> Quite likely, yet how would SeaBIOS access such a huge frame
> buffer then? They can't possibly place it below 4G. Do systems
> with such video cards get penalized by e.g. not surfacing VESA
> mode changing functionality?

Yes, VESA FB pointer is 32 bit only.
The framebuffer itself from my experience is located in a separate smaller BAR
on real cards. That makes it usually land in below 4G that masks the problem
in most scenarios.

Igor

Re: [PATCH] hvmloader: pass PCI MMIO layout to OVMF as an info table

2021-01-11 Thread Igor Druzhinin

On 11/01/2021 09:27, Jan Beulich wrote:
> On 11.01.2021 05:53, Igor Druzhinin wrote:
>> We faced a problem with passing through a PCI device with 64GB BAR to
>> UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
>> 64G address which pushes physical address space to 37 bits. OVMF uses
>> address width early in PEI phase to make DXE identity pages covering
>> the whole addressable space so it needs to know the last address it needs
>> to cover but at the same time not overdo the mappings.
>>
>> As there is seemingly no other way to pass or get this information in
>> OVMF at this early phase (ACPI is not yet available, PCI is not yet 
>> enumerated,
>> xenstore is not yet initialized) - extend the info structure with a new
>> table. Since the structure was initially created to be extendable -
>> the change is backward compatible.
> 
> How does UEFI handle the same situation on baremetal? I'd guess it is
> in even more trouble there, as it couldn't even read addresses from
> BARs, but would first need to assign them (or at least calculate
> their intended positions).

Maybe Laszlo or Anthony could answer this question quickly while I'm 
investigating?

>> --- a/tools/firmware/hvmloader/ovmf.c
>> +++ b/tools/firmware/hvmloader/ovmf.c
>> @@ -61,6 +61,14 @@ struct ovmf_info {
>>  uint32_t e820_nr;
>>  } __attribute__ ((packed));
>>  
>> +#define OVMF_INFO_PCI_TABLE 0
>> +struct ovmf_pci_info {
>> +uint64_t low_start;
>> +uint64_t low_end;
>> +uint64_t hi_start;
>> +uint64_t hi_end;
>> +} __attribute__ ((packed));
> 
> Forming part of ABI, I believe this belongs in a public header,
> which consumers could at least in principle use verbatim if
> they wanted to.

It probably does, but if we'd want to move all of hand-over structures
wholesale that would include seabios as well. I'd stick with the current
approach to avoid code churn in various repos. Besides the structures
are not the only bits of ABI that are implicitly shared with BIOS images.

>> @@ -74,9 +82,21 @@ static void ovmf_setup_bios_info(void)
>>  static void ovmf_finish_bios_info(void)
>>  {
>>  struct ovmf_info *info = (void *)OVMF_INFO_PHYSICAL_ADDRESS;
>> +struct ovmf_pci_info *pci_info;
>> +uint64_t *tables = scratch_alloc(sizeof(uint64_t)*OVMF_INFO_MAX_TABLES, 
>> 0);
> 
> I wasn't able to locate OVMF_INFO_MAX_TABLES in either
> xen/include/public/ or tools/firmware/. Where does it get
> defined?

I expect it to be unlimited from OVMF side. It just expects an array of 
tables_nr elements.

> Also (nit) missing blanks around * .
> 
>>  uint32_t i;
>>  uint8_t checksum;
>>  
>> +pci_info = scratch_alloc(sizeof(struct ovmf_pci_info), 0);
> 
> Is "scratch" correct here and above? I guess intended usage /
> scope will want spelling out somewhere.

Again, scratch_alloc is used universally for handing over info between hvmloader
and BIOS images. Where would you want it to be spelled out?

>> +pci_info->low_start = pci_mem_start;
>> +pci_info->low_end = pci_mem_end;
>> +pci_info->hi_start = pci_hi_mem_start;
>> +pci_info->hi_end = pci_hi_mem_end;
>> +
>> +tables[OVMF_INFO_PCI_TABLE] = (uint32_t)pci_info;
>> +info->tables = (uint32_t)tables;
>> +info->tables_nr = 1;
> 
> In how far is this problem (and hence solution / workaround) OVMF
> specific? IOW don't we need a more generic approach here?

I believe it's very OVMF specific given only OVMF constructs identity page
tables for the whole address space - that's how it was designed. Seabios to
the best of my knowledge only has access to lower 4G.

Igor

Re: [PATCH 1/2] viridian: remove implicit limit of 64 VPs per partition

2021-01-11 Thread Igor Druzhinin

On 11/01/2021 13:47, Paul Durrant wrote:
>> -Original Message-
>> From: Jan Beulich 
>> Sent: 11 January 2021 13:38
>> To: Igor Druzhinin ; p...@xen.org
>> Cc: w...@xen.org; i...@xenproject.org; anthony.per...@citrix.com; 
>> andrew.coop...@citrix.com;
>> george.dun...@citrix.com; jul...@xen.org; sstabell...@kernel.org; 
>> roger@citrix.com; xen-
>> de...@lists.xenproject.org
>> Subject: Re: [PATCH 1/2] viridian: remove implicit limit of 64 VPs per 
>> partition
>>
>> On 11.01.2021 14:34, Igor Druzhinin wrote:
>>> On 11/01/2021 09:16, Jan Beulich wrote:
>>>> On 11.01.2021 10:12, Paul Durrant wrote:
>>>>>> From: Paul Durrant 
>>>>>> Sent: 11 January 2021 09:10
>>>>>>
>>>>>>> From: Jan Beulich 
>>>>>>> Sent: 11 January 2021 09:00
>>>>>>>
>>>>>>> On 11.01.2021 09:45, Paul Durrant wrote:
>>>>>>>> You can add my R-b to the patch.
>>>>>>>
>>>>>>> That's the unchanged patch then, including the libxl change that
>>>>>>> I had asked about and that I have to admit I don't fully follow
>>>>>>> Igor's responses? I'm hesitant to give an ack for that aspect of
>>>>>>> the change, yet I suppose the libxl maintainers will defer to
>>>>>>> x86 ones there. Alternatively Andrew or Roger could of course
>>>>>>> ack this ...
>>>>>>>
>>>>>>
>>>>>> I don't think we really need specific control in xl.cfg as this is a fix 
>>>>>> for some poorly
>> documented
>>>>>> semantics in the spec. The flag simply prevents the leaf magically 
>>>>>> appearing on migrate and I
>> think
>>>>>> that's enough.
>>>>>
>>>>> ... although adding an option in xl/libxl isn't that much work, I suppose.
>>>>>
>>>>> Igor, would you be ok plumbing it through?
>>>>
>>>> This back and forth leaves unclear to me what I should do. I
>>>> would have asked on irc, but you're not there as it seems.
>>>
>>> I don't see a scenario where somebody would want to opt out of unlimited
>>> VPs per domain given the leaf with -1 is supported on all Windows versions.
>>
>> So Paul - commit patch as is then?
>>
>>> I can make it configurable in the future if reports re-surface it causes
>>> troubles somewhere.
>>
>> This is the slight concern I have: Having to make it configurable
>> once someone has reported trouble would look a little late to me.
>> Otoh I agree it may end up being dead code if no problems get
>> ever encountered.
>>
> 
> I think I'm persuaded by your caution. Since it's not a massive amount of 
> code, let's have flags for both wired through to xl and default them to on, 
> so I withdraw my R-b for the libxl_x86.c hunk.

Ok, will re-work the patches.

Igot

Re: [PATCH 1/2] viridian: remove implicit limit of 64 VPs per partition

2021-01-11 Thread Igor Druzhinin

On 11/01/2021 09:16, Jan Beulich wrote:
> On 11.01.2021 10:12, Paul Durrant wrote:
>>> From: Paul Durrant 
>>> Sent: 11 January 2021 09:10
>>>
 From: Jan Beulich 
 Sent: 11 January 2021 09:00

 On 11.01.2021 09:45, Paul Durrant wrote:
> You can add my R-b to the patch.

 That's the unchanged patch then, including the libxl change that
 I had asked about and that I have to admit I don't fully follow
 Igor's responses? I'm hesitant to give an ack for that aspect of
 the change, yet I suppose the libxl maintainers will defer to
 x86 ones there. Alternatively Andrew or Roger could of course
 ack this ...

>>>
>>> I don't think we really need specific control in xl.cfg as this is a fix 
>>> for some poorly documented
>>> semantics in the spec. The flag simply prevents the leaf magically 
>>> appearing on migrate and I think
>>> that's enough.
>>
>> ... although adding an option in xl/libxl isn't that much work, I suppose.
>>
>> Igor, would you be ok plumbing it through?
> 
> This back and forth leaves unclear to me what I should do. I
> would have asked on irc, but you're not there as it seems.

I don't see a scenario where somebody would want to opt out of unlimited
VPs per domain given the leaf with -1 is supported on all Windows versions.
I can make it configurable in the future if reports re-surface it causes
troubles somewhere.

I'd like to do the same with CPU hotplug bit (given it's not configurable
in QEMU either) but wanted to hear an opinion here?

Igor

[PATCH] hvmloader: pass PCI MMIO layout to OVMF as an info table

2021-01-10 Thread Igor Druzhinin

We faced a problem with passing through a PCI device with 64GB BAR to
UEFI guest. The BAR is expectedly programmed into 64-bit PCI aperture at
64G address which pushes physical address space to 37 bits. OVMF uses
address width early in PEI phase to make DXE identity pages covering
the whole addressable space so it needs to know the last address it needs
to cover but at the same time not overdo the mappings.

As there is seemingly no other way to pass or get this information in
OVMF at this early phase (ACPI is not yet available, PCI is not yet enumerated,
xenstore is not yet initialized) - extend the info structure with a new
table. Since the structure was initially created to be extendable -
the change is backward compatible.

Signed-off-by: Igor Druzhinin 
---

Companion change in OVMF:
https://lists.xenproject.org/archives/html/xen-devel/2021-01/msg00516.html

---
 tools/firmware/hvmloader/ovmf.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/tools/firmware/hvmloader/ovmf.c b/tools/firmware/hvmloader/ovmf.c
index 23610a0..9bfe274 100644
--- a/tools/firmware/hvmloader/ovmf.c
+++ b/tools/firmware/hvmloader/ovmf.c
@@ -61,6 +61,14 @@ struct ovmf_info {
 uint32_t e820_nr;
 } __attribute__ ((packed));
 
+#define OVMF_INFO_PCI_TABLE 0
+struct ovmf_pci_info {
+uint64_t low_start;
+uint64_t low_end;
+uint64_t hi_start;
+uint64_t hi_end;
+} __attribute__ ((packed));
+
 static void ovmf_setup_bios_info(void)
 {
 struct ovmf_info *info = (void *)OVMF_INFO_PHYSICAL_ADDRESS;
@@ -74,9 +82,21 @@ static void ovmf_setup_bios_info(void)
 static void ovmf_finish_bios_info(void)
 {
 struct ovmf_info *info = (void *)OVMF_INFO_PHYSICAL_ADDRESS;
+struct ovmf_pci_info *pci_info;
+uint64_t *tables = scratch_alloc(sizeof(uint64_t)*OVMF_INFO_MAX_TABLES, 0);
 uint32_t i;
 uint8_t checksum;
 
+pci_info = scratch_alloc(sizeof(struct ovmf_pci_info), 0);
+pci_info->low_start = pci_mem_start;
+pci_info->low_end = pci_mem_end;
+pci_info->hi_start = pci_hi_mem_start;
+pci_info->hi_end = pci_hi_mem_end;
+
+tables[OVMF_INFO_PCI_TABLE] = (uint32_t)pci_info;
+info->tables = (uint32_t)tables;
+info->tables_nr = 1;
+
 checksum = 0;
 for ( i = 0; i < info->length; i++ )
 checksum += ((uint8_t *)(info))[i];
-- 
2.7.4

[PATCH] OvmfPkg/XenPlatformPei: Grab 64-bit PCI MMIO hole size from OVMF info table

2021-01-10 Thread Igor Druzhinin

We faced a problem with passing through a PCI device with 64GB BAR to UEFI
guest. The BAR is expectedly programmed into 64-bit PCI aperture at
64G address which pushes physical address space to 37 bits. That is above
36-bit width that OVMF exposes currently to a guest without tweaking
PcdPciMmio64Size knob.

We can't really set this knob to a very high value and expect that to work
on all CPUs as the max physical address width varies singnificantly between
models. We also can't simply take CPU address width at runtime and use that
as we need to cover all of that with indentity pages for DXE phase.

The exact size of upper PCI MMIO hole is only known after hvmloader placed
all of the BARs and calculated the necessary aperture which is exposed
through ACPI. This information is not readily available to OVMF at PEI phase.
So let's expose it using the existing extensible binary interface between
OVMF and hvmloader preserving compatibility.

Signed-off-by: Igor Druzhinin 
---

The change is backward compatible and has a companion change for hvmloader.

Are we still waiting to clean up Xen stuff from QEMU platform? Or I need
to duplicate my changed there (I hope not)?

---
 OvmfPkg/XenPlatformPei/MemDetect.c |  6 -
 OvmfPkg/XenPlatformPei/Platform.h  |  8 +++
 OvmfPkg/XenPlatformPei/Xen.c   | 47 ++
 OvmfPkg/XenPlatformPei/Xen.h   | 12 +-
 4 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/OvmfPkg/XenPlatformPei/MemDetect.c 
b/OvmfPkg/XenPlatformPei/MemDetect.c
index 1f81eee..4175a2f 100644
--- a/OvmfPkg/XenPlatformPei/MemDetect.c
+++ b/OvmfPkg/XenPlatformPei/MemDetect.c
@@ -227,6 +227,7 @@ GetFirstNonAddress (
   UINT64   FirstNonAddress;
   UINT64   Pci64Base, Pci64Size;
   RETURN_STATUSPcdStatus;
+  EFI_STATUS   Status;
 
   FirstNonAddress = BASE_4GB + GetSystemMemorySizeAbove4gb ();
 
@@ -245,7 +246,10 @@ GetFirstNonAddress (
   // Otherwise, in order to calculate the highest address plus one, we must
   // consider the 64-bit PCI host aperture too. Fetch the default size.
   //
-  Pci64Size = PcdGet64 (PcdPciMmio64Size);
+  Status = XenGetPciMmioInfo (NULL, NULL, , );
+  if (EFI_ERROR (Status)) {
+Pci64Size = PcdGet64 (PcdPciMmio64Size);
+  }
 
   if (Pci64Size == 0) {
 if (mBootMode != BOOT_ON_S3_RESUME) {
diff --git a/OvmfPkg/XenPlatformPei/Platform.h 
b/OvmfPkg/XenPlatformPei/Platform.h
index 7661f4a..6024cae 100644
--- a/OvmfPkg/XenPlatformPei/Platform.h
+++ b/OvmfPkg/XenPlatformPei/Platform.h
@@ -127,6 +127,14 @@ XenGetE820Map (
   UINT32 *Count
   );
 
+EFI_STATUS
+XenGetPciMmioInfo (
+  UINT64 *PciBase,
+  UINT64 *PciSize,
+  UINT64 *Pci64Base,
+  UINT64 *Pci64Size
+  );
+
 extern EFI_BOOT_MODE mBootMode;
 
 extern UINT8 mPhysMemAddressWidth;
diff --git a/OvmfPkg/XenPlatformPei/Xen.c b/OvmfPkg/XenPlatformPei/Xen.c
index c41fecd..3c1970d 100644
--- a/OvmfPkg/XenPlatformPei/Xen.c
+++ b/OvmfPkg/XenPlatformPei/Xen.c
@@ -114,6 +114,53 @@ XenGetE820Map (
 }
 
 /**
+  Returns layouts of PCI MMIO holes provided by hvmloader
+
+  @param PciBase  Pointer to MMIO hole base address
+  @param PciSize  Pointer to MMIO hole size
+  @param Pci64BasePointer to upper MMIO hole base address
+  @param Pci64SizePointer to upper MMIO hole size
+
+  @return EFI_STATUS
+**/
+EFI_STATUS
+XenGetPciMmioInfo (
+  UINT64 *PciBase,
+  UINT64 *PciSize,
+  UINT64 *Pci64Base,
+  UINT64 *Pci64Size
+  )
+{
+  UINT64 *Tables;
+  EFI_XEN_OVMF_PCI_INFO *PciInfo;
+
+  if (mXenHvmloaderInfo == NULL) {
+return EFI_NOT_FOUND;
+  }
+
+  if (mXenHvmloaderInfo->TablesCount < OVMF_INFO_PCI_TABLE + 1) {
+return EFI_UNSUPPORTED;
+  }
+
+  ASSERT (mXenHvmloaderInfo->Tables &&
+  mXenHvmloaderInfo->Tables < MAX_ADDRESS);
+  Tables = (UINT64 *) mXenHvmloaderInfo->Tables;
+  PciInfo = (EFI_XEN_OVMF_PCI_INFO *) Tables[OVMF_INFO_PCI_TABLE];
+
+  ASSERT (PciInfo && (EFI_PHYSICAL_ADDRESS) PciInfo < MAX_ADDRESS);
+  if (PciBase && PciSize) {
+*PciBase = (UINT64) PciInfo->LowStart;
+*PciSize = (UINT64) (PciInfo->LowEnd - PciInfo->LowStart);
+  }
+  if (Pci64Base && Pci64Size) {
+*Pci64Base = (UINT64) PciInfo->HiStart;
+*Pci64Size = (UINT64) (PciInfo->HiEnd - PciInfo->HiStart);
+  }
+
+  return EFI_SUCCESS;
+}
+
+/**
   Connects to the Hypervisor.
 
   @return EFI_STATUS
diff --git a/OvmfPkg/XenPlatformPei/Xen.h b/OvmfPkg/XenPlatformPei/Xen.h
index 2605481..c6e5fbb 100644
--- a/OvmfPkg/XenPlatformPei/Xen.h
+++ b/OvmfPkg/XenPlatformPei/Xen.h
@@ -15,7 +15,7 @@ SPDX-License-Identifier: BSD-2-Clause-Patent
 // Physical address of OVMF info
 #define OVMF_INFO_PHYSICAL_ADDRESS 0x1000
 
-// This structure must match the definition on Xen side
+// These structures must match the definition on Xen side
 #pragma pack(1)
 typedef struct {
   CHAR8 Signature[14]; // XenHVMOVMF\0
@@ -34,6 +34,16 @@ typedef struct {
   EFI_

Re: [PATCH 1/2] viridian: remove implicit limit of 64 VPs per partition

2021-01-08 Thread Igor Druzhinin

On 08/01/2021 08:32, Paul Durrant wrote:
>> -Original Message-
>> From: Igor Druzhinin 
>> Sent: 08 January 2021 00:47
>> To: xen-devel@lists.xenproject.org
>> Cc: p...@xen.org; w...@xen.org; i...@xenproject.org; 
>> anthony.per...@citrix.com;
>> andrew.coop...@citrix.com; george.dun...@citrix.com; jbeul...@suse.com; 
>> jul...@xen.org;
>> sstabell...@kernel.org; roger@citrix.com; Igor Druzhinin 
>> 
>> Subject: [PATCH 1/2] viridian: remove implicit limit of 64 VPs per partition
>>
>> TLFS 7.8.1 stipulates that "a virtual processor index must be less than
>> the maximum number of virtual processors per partition" that "can be obtained
>> through CPUID leaf 0x4005". Furthermore, "Requirements for Implementing
>> the Microsoft Hypervisor Interface" defines that starting from Windows Server
>> 2012, which allowed more than 64 CPUs to be brought up, this leaf can now
>> contain a value -1 basically assuming the hypervisor has no restriction while
>> 0 (that we currently expose) means the default restriction is still present.
>>
>> Along with previous changes exposing ExProcessorMasks this allows a recent
>> Windows VM with Viridian extension enabled to have more than 64 vCPUs without
>> going into immediate BSOD.
>>
> 
> This is very odd as I was happily testing with a 128 vCPU VM once 
> ExProcessorMasks was implemented... no need for the extra leaf.
> The documentation for 0x4005 states " Describes the scale limits 
> supported in the current hypervisor implementation. If any
> value is zero, the hypervisor does not expose the corresponding information". 
> It does not say (in section 7.8.1 or elsewhere AFAICT)
> what implications that has for VP_INDEX.
> 
> In " Requirements for Implementing the Microsoft Hypervisor Interface" I 
> don't see anything saying what the semantics of not
> implementing leaf 0x4005 are, only that if implementing it then -1 must 
> be used to break the 64 VP limit. It also says that
> reporting -1 in 0x4005 means that 4004.EAX bits 1 and 2 *must* be 
> clear, which is clearly not true if ExProcessorMasks is
> enabled. That document is dated June 13th 2012 so I think it should be taken 
> with a pinch of salt.
> 
> Have you actually observed a BSOD with >64 vCPUs when ExProcessorMasks is 
> enabled? If so, which version of Windows? I'd like to get
> a repro myself.

I return with more testing that confirm both my and your results.

1) with ExProcessorMask and 66 vCPUs assigned both current WS19 build and
pre-release 20270 of vNext boot correctly with no changes

and that would be fine but the existing documentation for ex_processor_masks
states that:
"Hence this enlightenment must be specified for guests with more
than 64 vCPUs if B and/or B are also
specified."

You then would expect 64+ vCPU VM to boot without ex_processors_mask,
hcall_remote_tlb_flush and hcall_ipi.

So,
2) without ExProcessorMaks and 66 vCPUs assigned and with hcall_remote_tlb_flush
and hcall_ipi disabled: WS19 BSODs and vNext fails to initialize secondary CPUs

After applying my change,
3) without ExProcessorMaks and 66 vCPUs assigned and with hcall_remote_tlb_flush
and hcall_ipi disabled WS19 and vNext boot correctly

So we either need to change documentation and require ExProcessorMasks for all
VMs with 64+ vCPUs or go with my change.

Igor

Re: [PATCH 1/2] viridian: remove implicit limit of 64 VPs per partition

2021-01-08 Thread Igor Druzhinin

On 08/01/2021 13:17, Jan Beulich wrote:
> On 08.01.2021 12:27, Igor Druzhinin wrote:
>> On 08/01/2021 09:19, Jan Beulich wrote:
>>> On 08.01.2021 01:46, Igor Druzhinin wrote:
>>>> --- a/tools/libs/light/libxl_x86.c
>>>> +++ b/tools/libs/light/libxl_x86.c
>>>> @@ -336,7 +336,7 @@ static int hvm_set_viridian_features(libxl__gc *gc, 
>>>> uint32_t domid,
>>>>  LOG(DETAIL, "%s group enabled", 
>>>> libxl_viridian_enlightenment_to_string(v));
>>>>  
>>>>  if (libxl_bitmap_test(, 
>>>> LIBXL_VIRIDIAN_ENLIGHTENMENT_BASE)) {
>>>> -mask |= HVMPV_base_freq;
>>>> +mask |= HVMPV_base_freq | HVMPV_no_vp_limit;
>>>
>>> Could you clarify the point of having the new control when at
>>> the guest config level there's no way to prevent it from
>>> getting enabled (short of disabling Viridian extensions
>>> altogether)?
>>
>> If you migrate in from a host without this code (previous version
>> of Xen)- you will keep the old behavior for the guest thus preserving
>> binary compatibility.
> 
> Keeping thing compatible like this is clearly a requirement. But
> is this enough? As soon as the guest reboots, it will see differing
> behavior, won't it?

Yes, that's what I was expecting - after reboot it should be fine.
Hence I put this option into the default set - I'd expect no limit to be
applied in the first place.

Igor

Re: [PATCH 1/2] viridian: remove implicit limit of 64 VPs per partition

2021-01-08 Thread Igor Druzhinin

On 08/01/2021 08:32, Paul Durrant wrote:
>> -Original Message-
>> From: Igor Druzhinin 
>> Sent: 08 January 2021 00:47
>> To: xen-devel@lists.xenproject.org
>> Cc: p...@xen.org; w...@xen.org; i...@xenproject.org; 
>> anthony.per...@citrix.com;
>> andrew.coop...@citrix.com; george.dun...@citrix.com; jbeul...@suse.com; 
>> jul...@xen.org;
>> sstabell...@kernel.org; roger@citrix.com; Igor Druzhinin 
>> 
>> Subject: [PATCH 1/2] viridian: remove implicit limit of 64 VPs per partition
>>
>> TLFS 7.8.1 stipulates that "a virtual processor index must be less than
>> the maximum number of virtual processors per partition" that "can be obtained
>> through CPUID leaf 0x4005". Furthermore, "Requirements for Implementing
>> the Microsoft Hypervisor Interface" defines that starting from Windows Server
>> 2012, which allowed more than 64 CPUs to be brought up, this leaf can now
>> contain a value -1 basically assuming the hypervisor has no restriction while
>> 0 (that we currently expose) means the default restriction is still present.
>>
>> Along with previous changes exposing ExProcessorMasks this allows a recent
>> Windows VM with Viridian extension enabled to have more than 64 vCPUs without
>> going into immediate BSOD.
>>
> 
> This is very odd as I was happily testing with a 128 vCPU VM once 
> ExProcessorMasks was implemented... no need for the extra leaf.
> The documentation for 0x4005 states " Describes the scale limits 
> supported in the current hypervisor implementation. If any
> value is zero, the hypervisor does not expose the corresponding information". 
> It does not say (in section 7.8.1 or elsewhere AFAICT)
> what implications that has for VP_INDEX.

The text says that VP_INDEX "must" be below that limit - that is clear enough
for me.

I admit I have not tested with ExProssorMasks exposed but I disabled TLB flush
and IPI extensions before that - Windows is mysterious in it's handling of
this logic. Let me align all of the changes and perform some clean cut testing.

> In " Requirements for Implementing the Microsoft Hypervisor Interface" I 
> don't see anything saying what the semantics of not
> implementing leaf 0x4005 are, only that if implementing it then -1 must 
> be used to break the 64 VP limit.

Yes, that's exactly the info this change is based on.

> It also says that
> reporting -1 in 0x4005 means that 4004.EAX bits 1 and 2 *must* be 
> clear, which is clearly not true if ExProcessorMasks is
> enabled. That document is dated June 13th 2012 so I think it should be taken 
> with a pinch of salt.

True - I assumed that for them to work ExProcessorMasks is necessary then.

Igor

Re: [PATCH 2/2] viridian: allow vCPU hotplug for Windows VMs

2021-01-08 Thread Igor Druzhinin

On 08/01/2021 11:40, Paul Durrant wrote:
>> -Original Message-
>> From: Igor Druzhinin 
>> Sent: 08 January 2021 11:36
>> To: p...@xen.org; xen-devel@lists.xenproject.org
>> Cc: w...@xen.org; i...@xenproject.org; anthony.per...@citrix.com; 
>> andrew.coop...@citrix.com;
>> george.dun...@citrix.com; jbeul...@suse.com; jul...@xen.org; 
>> sstabell...@kernel.org;
>> roger@citrix.com
>> Subject: Re: [PATCH 2/2] viridian: allow vCPU hotplug for Windows VMs
>>
>> On 08/01/2021 11:33, Paul Durrant wrote:
>>>> -Original Message-
>>>> From: Igor Druzhinin 
>>>> Sent: 08 January 2021 11:30
>>>> To: p...@xen.org; xen-devel@lists.xenproject.org
>>>> Cc: w...@xen.org; i...@xenproject.org; anthony.per...@citrix.com; 
>>>> andrew.coop...@citrix.com;
>>>> george.dun...@citrix.com; jbeul...@suse.com; jul...@xen.org; 
>>>> sstabell...@kernel.org;
>>>> roger@citrix.com
>>>> Subject: Re: [PATCH 2/2] viridian: allow vCPU hotplug for Windows VMs
>>>>
>>>> On 08/01/2021 08:38, Paul Durrant wrote:
>>>>>> -Original Message-
>>>>>> From: Igor Druzhinin 
>>>>>> Sent: 08 January 2021 00:47
>>>>>> To: xen-devel@lists.xenproject.org
>>>>>> Cc: p...@xen.org; w...@xen.org; i...@xenproject.org; 
>>>>>> anthony.per...@citrix.com;
>>>>>> andrew.coop...@citrix.com; george.dun...@citrix.com; jbeul...@suse.com; 
>>>>>> jul...@xen.org;
>>>>>> sstabell...@kernel.org; roger@citrix.com; Igor Druzhinin 
>>>>>> 
>>>>>> Subject: [PATCH 2/2] viridian: allow vCPU hotplug for Windows VMs
>>>>>>
>>>>>> If Viridian extensions are enabled, Windows wouldn't currently allow
>>>>>> a hotplugged vCPU to be brought up dynamically. We need to expose a 
>>>>>> special
>>>>>> bit to let the guest know we allow it. It appears we can just start 
>>>>>> exposing
>>>>>> it without worrying too much about compatibility - see relevant QEMU
>>>>>> discussion here:
>>>>>>
>>>>>> https://patchwork.kernel.org/project/qemu-devel/patch/1455364815-19586-1-git-send-email-
>>>>>> d...@openvz.org/
>>>>>
>>>>> I don't think that discussion really confirmed it was safe... just that 
>>>>> empirically it appeared to
>>>> be so. I think we should err on
>>>>> the side of caution and have this behind a feature flag (but I'm happy 
>>>>> for it to default to on).
>>>>
>>>> QEMU was having this code since 2016 and nobody complained is good
>>>> enough for me - but if you insist we need an option - ok, I will add one.
>>>>
>>>
>>> Given that it has not yet been in a release, perhaps you could just guard 
>>> this and the
>> implementation of leaf 0x4005 using HVMPV_ex_processor_masks?
>>
>> That looks sloppy and confusing to me - let's have a separate option instead.
>>
> 
> Yes, for this I guess it is really a separate thing. Using 
> HVMPV_ex_processor_masks to control the presence of leaf 0x4005 seems 
> reasonable (since it's all about being able to use >64 vcpus). Perhaps a new 
> HVMPV_cpu_hotplug for this one?

Agree.

Igor

Re: [PATCH 2/2] viridian: allow vCPU hotplug for Windows VMs

2021-01-08 Thread Igor Druzhinin

On 08/01/2021 11:33, Paul Durrant wrote:
>> -Original Message-
>> From: Igor Druzhinin 
>> Sent: 08 January 2021 11:30
>> To: p...@xen.org; xen-devel@lists.xenproject.org
>> Cc: w...@xen.org; i...@xenproject.org; anthony.per...@citrix.com; 
>> andrew.coop...@citrix.com;
>> george.dun...@citrix.com; jbeul...@suse.com; jul...@xen.org; 
>> sstabell...@kernel.org;
>> roger@citrix.com
>> Subject: Re: [PATCH 2/2] viridian: allow vCPU hotplug for Windows VMs
>>
>> On 08/01/2021 08:38, Paul Durrant wrote:
>>>> -Original Message-
>>>> From: Igor Druzhinin 
>>>> Sent: 08 January 2021 00:47
>>>> To: xen-devel@lists.xenproject.org
>>>> Cc: p...@xen.org; w...@xen.org; i...@xenproject.org; 
>>>> anthony.per...@citrix.com;
>>>> andrew.coop...@citrix.com; george.dun...@citrix.com; jbeul...@suse.com; 
>>>> jul...@xen.org;
>>>> sstabell...@kernel.org; roger@citrix.com; Igor Druzhinin 
>>>> 
>>>> Subject: [PATCH 2/2] viridian: allow vCPU hotplug for Windows VMs
>>>>
>>>> If Viridian extensions are enabled, Windows wouldn't currently allow
>>>> a hotplugged vCPU to be brought up dynamically. We need to expose a special
>>>> bit to let the guest know we allow it. It appears we can just start 
>>>> exposing
>>>> it without worrying too much about compatibility - see relevant QEMU
>>>> discussion here:
>>>>
>>>> https://patchwork.kernel.org/project/qemu-devel/patch/1455364815-19586-1-git-send-email-
>>>> d...@openvz.org/
>>>
>>> I don't think that discussion really confirmed it was safe... just that 
>>> empirically it appeared to
>> be so. I think we should err on
>>> the side of caution and have this behind a feature flag (but I'm happy for 
>>> it to default to on).
>>
>> QEMU was having this code since 2016 and nobody complained is good
>> enough for me - but if you insist we need an option - ok, I will add one.
>>
> 
> Given that it has not yet been in a release, perhaps you could just guard 
> this and the implementation of leaf 0x4005 using HVMPV_ex_processor_masks?

That looks sloppy and confusing to me - let's have a separate option instead.

Igor

Re: [PATCH 2/2] viridian: allow vCPU hotplug for Windows VMs

2021-01-08 Thread Igor Druzhinin

On 08/01/2021 08:38, Paul Durrant wrote:
>> -Original Message-
>> From: Igor Druzhinin 
>> Sent: 08 January 2021 00:47
>> To: xen-devel@lists.xenproject.org
>> Cc: p...@xen.org; w...@xen.org; i...@xenproject.org; 
>> anthony.per...@citrix.com;
>> andrew.coop...@citrix.com; george.dun...@citrix.com; jbeul...@suse.com; 
>> jul...@xen.org;
>> sstabell...@kernel.org; roger@citrix.com; Igor Druzhinin 
>> 
>> Subject: [PATCH 2/2] viridian: allow vCPU hotplug for Windows VMs
>>
>> If Viridian extensions are enabled, Windows wouldn't currently allow
>> a hotplugged vCPU to be brought up dynamically. We need to expose a special
>> bit to let the guest know we allow it. It appears we can just start exposing
>> it without worrying too much about compatibility - see relevant QEMU
>> discussion here:
>>
>> https://patchwork.kernel.org/project/qemu-devel/patch/1455364815-19586-1-git-send-email-
>> d...@openvz.org/
> 
> I don't think that discussion really confirmed it was safe... just that 
> empirically it appeared to be so. I think we should err on
> the side of caution and have this behind a feature flag (but I'm happy for it 
> to default to on).

QEMU was having this code since 2016 and nobody complained is good
enough for me - but if you insist we need an option - ok, I will add one.

Igor

Re: [PATCH 1/2] viridian: remove implicit limit of 64 VPs per partition

2021-01-08 Thread Igor Druzhinin

On 08/01/2021 09:19, Jan Beulich wrote:
> On 08.01.2021 01:46, Igor Druzhinin wrote:
>> --- a/tools/libs/light/libxl_x86.c
>> +++ b/tools/libs/light/libxl_x86.c
>> @@ -336,7 +336,7 @@ static int hvm_set_viridian_features(libxl__gc *gc, 
>> uint32_t domid,
>>  LOG(DETAIL, "%s group enabled", 
>> libxl_viridian_enlightenment_to_string(v));
>>  
>>  if (libxl_bitmap_test(, 
>> LIBXL_VIRIDIAN_ENLIGHTENMENT_BASE)) {
>> -mask |= HVMPV_base_freq;
>> +mask |= HVMPV_base_freq | HVMPV_no_vp_limit;
> 
> Could you clarify the point of having the new control when at
> the guest config level there's no way to prevent it from
> getting enabled (short of disabling Viridian extensions
> altogether)?

If you migrate in from a host without this code (previous version
of Xen)- you will keep the old behavior for the guest thus preserving
binary compatibility.

Igor

[PATCH 1/2] viridian: remove implicit limit of 64 VPs per partition

2021-01-07 Thread Igor Druzhinin

TLFS 7.8.1 stipulates that "a virtual processor index must be less than
the maximum number of virtual processors per partition" that "can be obtained
through CPUID leaf 0x4005". Furthermore, "Requirements for Implementing
the Microsoft Hypervisor Interface" defines that starting from Windows Server
2012, which allowed more than 64 CPUs to be brought up, this leaf can now
contain a value -1 basically assuming the hypervisor has no restriction while
0 (that we currently expose) means the default restriction is still present.

Along with previous changes exposing ExProcessorMasks this allows a recent
Windows VM with Viridian extension enabled to have more than 64 vCPUs without
going into immediate BSOD.

Since we didn't expose the leaf before and to keep CPUID data consistent for
incoming streams from previous Xen versions - let's keep it behind an option.

Signed-off-by: Igor Druzhinin 
---
 tools/libs/light/libxl_x86.c |  2 +-
 xen/arch/x86/hvm/viridian/viridian.c | 23 +++
 xen/include/public/hvm/params.h  |  7 ++-
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/tools/libs/light/libxl_x86.c b/tools/libs/light/libxl_x86.c
index 86d2729..96c8bf1 100644
--- a/tools/libs/light/libxl_x86.c
+++ b/tools/libs/light/libxl_x86.c
@@ -336,7 +336,7 @@ static int hvm_set_viridian_features(libxl__gc *gc, 
uint32_t domid,
 LOG(DETAIL, "%s group enabled", 
libxl_viridian_enlightenment_to_string(v));
 
 if (libxl_bitmap_test(, LIBXL_VIRIDIAN_ENLIGHTENMENT_BASE)) 
{
-mask |= HVMPV_base_freq;
+mask |= HVMPV_base_freq | HVMPV_no_vp_limit;
 
 if (!libxl_bitmap_test(, 
LIBXL_VIRIDIAN_ENLIGHTENMENT_FREQ))
 mask |= HVMPV_no_freq;
diff --git a/xen/arch/x86/hvm/viridian/viridian.c 
b/xen/arch/x86/hvm/viridian/viridian.c
index ed97804..ae1ea86 100644
--- a/xen/arch/x86/hvm/viridian/viridian.c
+++ b/xen/arch/x86/hvm/viridian/viridian.c
@@ -209,6 +209,29 @@ void cpuid_viridian_leaves(const struct vcpu *v, uint32_t 
leaf,
 res->b = viridian_spinlock_retry_count;
 break;
 
+case 5:
+/*
+ * From "Requirements for Implementing the Microsoft Hypervisor
+ *  Interface":
+ *
+ * "On Windows operating systems versions through Windows Server
+ * 2008 R2, reporting the HV#1 hypervisor interface limits
+ * the Windows virtual machine to a maximum of 64 VPs, regardless of
+ * what is reported via CPUID.4005.EAX.
+ *
+ * Starting with Windows Server 2012 and Windows 8, if
+ * CPUID.4005.EAX containsa value of -1, Windows assumes that
+ * the hypervisor imposes no specific limit to the number of VPs.
+ * In this case, Windows Server 2012 guest VMs may use more than 64
+ * VPs, up to the maximum supported number of processors applicable
+ * to the specific Windows version being used."
+ *
+ * For compatibility we hide it behind an option.
+ */
+if ( viridian_feature_mask(d) & HVMPV_no_vp_limit )
+res->a = -1;
+break;
+
 case 6:
 /* Detected and in use hardware features. */
 if ( cpu_has_vmx_virtualize_apic_accesses )
diff --git a/xen/include/public/hvm/params.h b/xen/include/public/hvm/params.h
index 3b0a0f4..805f4ca 100644
--- a/xen/include/public/hvm/params.h
+++ b/xen/include/public/hvm/params.h
@@ -168,6 +168,10 @@
 #define _HVMPV_ex_processor_masks 10
 #define HVMPV_ex_processor_masks (1 << _HVMPV_ex_processor_masks)
 
+/* Allow more than 64 VPs */
+#define _HVMPV_no_vp_limit 11
+#define HVMPV_no_vp_limit (1 << _HVMPV_no_vp_limit)
+
 #define HVMPV_feature_mask \
 (HVMPV_base_freq | \
  HVMPV_no_freq | \
@@ -179,7 +183,8 @@
  HVMPV_synic | \
  HVMPV_stimer | \
  HVMPV_hcall_ipi | \
- HVMPV_ex_processor_masks)
+ HVMPV_ex_processor_masks | \
+ HVMPV_no_vp_limit)
 
 #endif
 
-- 
2.7.4

[PATCH 2/2] viridian: allow vCPU hotplug for Windows VMs

2021-01-07 Thread Igor Druzhinin

If Viridian extensions are enabled, Windows wouldn't currently allow
a hotplugged vCPU to be brought up dynamically. We need to expose a special
bit to let the guest know we allow it. It appears we can just start exposing
it without worrying too much about compatibility - see relevant QEMU
discussion here:

https://patchwork.kernel.org/project/qemu-devel/patch/1455364815-19586-1-git-send-email-...@openvz.org/

Signed-off-by: Igor Druzhinin 
---
 xen/arch/x86/hvm/viridian/viridian.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/viridian/viridian.c 
b/xen/arch/x86/hvm/viridian/viridian.c
index ae1ea86..76e8291 100644
--- a/xen/arch/x86/hvm/viridian/viridian.c
+++ b/xen/arch/x86/hvm/viridian/viridian.c
@@ -76,6 +76,7 @@ typedef union _HV_CRASH_CTL_REG_CONTENTS
 } HV_CRASH_CTL_REG_CONTENTS;
 
 /* Viridian CPUID leaf 3, Hypervisor Feature Indication */
+#define CPUID3D_CPU_DYNAMIC_PARTITIONING (1 << 3)
 #define CPUID3D_CRASH_MSRS (1 << 10)
 #define CPUID3D_SINT_POLLING (1 << 17)
 
@@ -179,8 +180,11 @@ void cpuid_viridian_leaves(const struct vcpu *v, uint32_t 
leaf,
 res->a = u.lo;
 res->b = u.hi;
 
+/* Expose ability to bring up VPs dynamically - allows vCPU hotplug */
+res->d = CPUID3D_CPU_DYNAMIC_PARTITIONING;
+
 if ( viridian_feature_mask(d) & HVMPV_crash_ctl )
-res->d = CPUID3D_CRASH_MSRS;
+res->d |= CPUID3D_CRASH_MSRS;
 if ( viridian_feature_mask(d) & HVMPV_synic )
 res->d |= CPUID3D_SINT_POLLING;
 
-- 
2.7.4

Re: [PATCH v3] x86/intel: insert Ice Lake-X (server) and Ice Lake-D model numbers

2021-01-06 Thread Igor Druzhinin

On 06/01/2021 11:04, Jan Beulich wrote:
> On 23.12.2020 21:32, Igor Druzhinin wrote:
>> LBR, C-state MSRs should correspond to Ice Lake desktop according to
>> External Design Specification vol.2 for both models.
>>
>> Ice Lake-X is known to expose IF_PSCHANGE_MC_NO in IA32_ARCH_CAPABILITIES MSR
>> (confirmed on Whitley SDP) which means the erratum is fixed in hardware for
>> that model and therefore it shouldn't be present in has_if_pschange_mc list.
>> Provisionally assume the same to be the case for Ice Lake-D.
> 
> I did find Ice Lake D EDS, and it confirms the respective additions.
> In the course I also found the "plain" Ice Lake EDS, and it seems to
> contradict SDM vol 4 in that it doesn't list CC3_RESIDENCY (0x3FC).
> For now I guess we can consider this a doc error.
> 
> I didn't find Ice Lake-X EDS, though.

You may search "Ice Lake server eds volume 2".

Igor

[PATCH v3] x86/intel: insert Ice Lake-X (server) and Ice Lake-D model numbers

2020-12-23 Thread Igor Druzhinin

LBR, C-state MSRs should correspond to Ice Lake desktop according to
External Design Specification vol.2 for both models.

Ice Lake-X is known to expose IF_PSCHANGE_MC_NO in IA32_ARCH_CAPABILITIES MSR
(confirmed on Whitley SDP) which means the erratum is fixed in hardware for
that model and therefore it shouldn't be present in has_if_pschange_mc list.
Provisionally assume the same to be the case for Ice Lake-D.

Signed-off-by: Igor Druzhinin 
---
Changes in v3:
- Add Ice Lake-D model numbers
- Drop has_if_pschange_mc hunk following Tiger Lake related discussion -
  IF_PSCHANGE_MC_NO is confimed to be exposed on Whitley SDP

---
 xen/arch/x86/acpi/cpu_idle.c | 2 ++
 xen/arch/x86/hvm/vmx/vmx.c   | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index c092086..d788c8b 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -181,6 +181,8 @@ static void do_get_hw_residencies(void *arg)
 case 0x55:
 case 0x5E:
 /* Ice Lake */
+case 0x6A:
+case 0x6C:
 case 0x7D:
 case 0x7E:
 /* Tiger Lake */
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 2d4475e..bff5979 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2775,7 +2775,7 @@ static const struct lbr_info *last_branch_msr_get(void)
 /* Goldmont Plus */
 case 0x7a:
 /* Ice Lake */
-case 0x7d: case 0x7e:
+case 0x6a: case 0x6c: case 0x7d: case 0x7e:
 /* Tiger Lake */
 case 0x8c: case 0x8d:
 /* Tremont */
-- 
2.7.4

Re: [PATCH v2] x86/intel: insert Ice Lake X (server) model numbers

2020-12-21 Thread Igor Druzhinin

On 21/12/2020 16:36, Jan Beulich wrote:
> On 19.10.2020 04:47, Igor Druzhinin wrote:
>> LBR, C-state MSRs and if_pschange_mc erratum applicability should correspond
>> to Ice Lake desktop according to External Design Specification vol.2.
>>
>> Signed-off-by: Igor Druzhinin 
>> ---
>> Changes in v2:
>> - keep partial sorting
>>
>> Andrew, since you have access to these documents, please review as you have 
>> time.
> 
> Coming back to this - the recent SDM update inserted at least the
> model numbers, but besides 6a it also lists 6c. Judging from the
> majority of additions happening in pairs, I wonder whether we
> couldn't (reasonably safely) add 6c then here as well. Of course
> I still can't ack the change either way with access to just the
> SDM...

I checked what 0x6c is and it appears to be Ice Lake-D (next gen Xeon D).
The information from EDS vol.2 on Ice Lake-D available to us corresponds to what
I got for Ice Lake X. So the numbers could be added here as soon as Andrew finds
time to review that one.

Igor

Re: [PATCH v3 1/2] x86/IRQ: make max number of guests for a shared IRQ configurable

2020-12-07 Thread Igor Druzhinin

On 07/12/2020 09:43, Jan Beulich wrote:
> On 06.12.2020 18:43, Igor Druzhinin wrote:
>> @@ -1633,11 +1640,12 @@ int pirq_guest_bind(struct vcpu *v, struct pirq 
>> *pirq, int will_share)
>>  goto retry;
>>  }
>>  
>> -if ( action->nr_guests == IRQ_MAX_GUESTS )
>> +if ( action->nr_guests == irq_max_guests )
>>  {
>> -printk(XENLOG_G_INFO "Cannot bind IRQ%d to dom%d. "
>> -   "Already at max share.\n",
>> -   pirq->pirq, v->domain->domain_id);
>> +printk(XENLOG_G_INFO
>> +   "Cannot bind IRQ%d to dom%pd: already at max share %u ",

I noticed it just now but could you also remove stray "dom" left in this line 
while commiting.

>> +   pirq->pirq, v->domain, irq_max_guests);
>> +printk("(increase with irq-max-guests= option)\n");
> 
> Now two separate printk()s are definitely worse. Then putting the
> part of the format string inside the parentheses on a separate line
> would still be better (and perhaps a sensible compromise with the
> grep-ability desire).

Now I'm confused because you asked me not to split the format string between 
the lines which
wouldn't be possible without splitting printk's. I didn't really want to drop 
anything
informative.

> With suitable adjustments, which I'd be okay making while committing
> as long as you agree,

Yes, do with it whatever you see fit.

Igor

[PATCH v3 1/2] x86/IRQ: make max number of guests for a shared IRQ configurable

2020-12-06 Thread Igor Druzhinin

... and increase the default to 16.

Current limit of 7 is too restrictive for modern systems where one GSI
could be shared by potentially many PCI INTx sources where each of them
corresponds to a device passed through to its own guest. Some systems do not
apply due dilligence in swizzling INTx links in case e.g. INTA is declared as
interrupt pin for the majority of PCI devices behind a single router,
resulting in overuse of a GSI.

Introduce a new command line option to configure that limit and dynamically
allocate an array of the necessary size. Set the default size now to 16 which
is higher than 7 but could later be increased even more if necessary.

Signed-off-by: Igor Druzhinin 
---

Changes in v2:
- introduced a command line option as suggested
- set initial default limit to 16

Changes in v3:
- changed option name to us - instead of _
- used uchar instead of uint to utilize integer_param overflow handling logic
- used xmalloc_flex_struct
- restructured printk as suggested

---
 docs/misc/xen-command-line.pandoc | 10 ++
 xen/arch/x86/irq.c| 22 +++---
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc 
b/docs/misc/xen-command-line.pandoc
index b4a0d60..53e676b 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1641,6 +1641,16 @@ This option is ignored in **pv-shim** mode.
 ### nr_irqs (x86)
 > `= `
 
+### irq-max-guests (x86)
+> `= `
+
+> Default: `16`
+
+Maximum number of guests any individual IRQ could be shared between,
+i.e. a limit on the number of guests it is possible to start each having
+assigned a device sharing a common interrupt line.  Accepts values between
+1 and 255.
+
 ### numa (x86)
 > `= on | off | fake= | noacpi`
 
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 8d1f9a9..4fd0578 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -42,6 +42,10 @@ integer_param("nr_irqs", nr_irqs);
 int __read_mostly opt_irq_vector_map = OPT_IRQ_VECTOR_MAP_DEFAULT;
 custom_param("irq_vector_map", parse_irq_vector_map_param);
 
+/* Max number of guests IRQ could be shared with */
+static unsigned char __read_mostly irq_max_guests;
+integer_param("irq-max-guests", irq_max_guests);
+
 vmask_t global_used_vector_map;
 
 struct irq_desc __read_mostly *irq_desc = NULL;
@@ -435,6 +439,9 @@ int __init init_irq_data(void)
 for ( ; irq < nr_irqs; irq++ )
 irq_to_desc(irq)->irq = irq;
 
+if ( !irq_max_guests )
+irq_max_guests = 16;
+
 #ifdef CONFIG_PV
 /* Never allocate the hypercall vector or Linux/BSD fast-trap vector. */
 set_bit(LEGACY_SYSCALL_VECTOR, used_vectors);
@@ -1028,7 +1035,6 @@ int __init setup_irq(unsigned int irq, unsigned int 
irqflags,
  * HANDLING OF GUEST-BOUND PHYSICAL IRQS
  */
 
-#define IRQ_MAX_GUESTS 7
 typedef struct {
 u8 nr_guests;
 u8 in_flight;
@@ -1039,7 +1045,7 @@ typedef struct {
 #define ACKTYPE_EOI2 /* EOI on the CPU that was interrupted  */
 cpumask_var_t cpu_eoi_map; /* CPUs that need to EOI this interrupt */
 struct timer eoi_timer;
-struct domain *guest[IRQ_MAX_GUESTS];
+struct domain *guest[];
 } irq_guest_action_t;
 
 /*
@@ -1564,7 +1570,8 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, 
int will_share)
 if ( newaction == NULL )
 {
 spin_unlock_irq(>lock);
-if ( (newaction = xmalloc(irq_guest_action_t)) != NULL &&
+if ( (newaction = xmalloc_flex_struct(irq_guest_action_t, guest,
+  irq_max_guests)) != NULL &&
  zalloc_cpumask_var(>cpu_eoi_map) )
 goto retry;
 xfree(newaction);
@@ -1633,11 +1640,12 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, 
int will_share)
 goto retry;
 }
 
-if ( action->nr_guests == IRQ_MAX_GUESTS )
+if ( action->nr_guests == irq_max_guests )
 {
-printk(XENLOG_G_INFO "Cannot bind IRQ%d to dom%d. "
-   "Already at max share.\n",
-   pirq->pirq, v->domain->domain_id);
+printk(XENLOG_G_INFO
+   "Cannot bind IRQ%d to dom%pd: already at max share %u ",
+   pirq->pirq, v->domain, irq_max_guests);
+printk("(increase with irq-max-guests= option)\n");
 rc = -EBUSY;
 goto unlock_out;
 }
-- 
2.7.4

[PATCH v3 2/2] x86/IRQ: allocate guest array of max size only for shareable IRQs

2020-12-06 Thread Igor Druzhinin

... and increase default "irq-max-guests" to 32.

It's not necessary to have an array of a size more than 1 for non-shareable
IRQs and it might impact scalability in case of high "irq-max-guests"
values being used - every IRQ in the system including MSIs would be
supplied with an array of that size.

Since it's now less impactful to use higher "irq-max-guests" value - bump
the default to 32. That should give more headroom for future systems.

Signed-off-by: Igor Druzhinin 
---

New in v2.
Based on Jan's suggestion.

Changes in v3:
- almost none since I prefer the clarity of the code as is

---
 docs/misc/xen-command-line.pandoc | 2 +-
 xen/arch/x86/irq.c| 7 ---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc 
b/docs/misc/xen-command-line.pandoc
index 53e676b..f7db2b6 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1644,7 +1644,7 @@ This option is ignored in **pv-shim** mode.
 ### irq-max-guests (x86)
 > `= `
 
-> Default: `16`
+> Default: `32`
 
 Maximum number of guests any individual IRQ could be shared between,
 i.e. a limit on the number of guests it is possible to start each having
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 4fd0578..a088818 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -440,7 +440,7 @@ int __init init_irq_data(void)
 irq_to_desc(irq)->irq = irq;
 
 if ( !irq_max_guests )
-irq_max_guests = 16;
+irq_max_guests = 32;
 
 #ifdef CONFIG_PV
 /* Never allocate the hypercall vector or Linux/BSD fast-trap vector. */
@@ -1540,6 +1540,7 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, 
int will_share)
 unsigned intirq;
 struct irq_desc *desc;
 irq_guest_action_t *action, *newaction = NULL;
+unsigned char   max_nr_guests = will_share ? irq_max_guests : 1;
 int rc = 0;
 
 WARN_ON(!spin_is_locked(>domain->event_lock));
@@ -1571,7 +1572,7 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, 
int will_share)
 {
 spin_unlock_irq(>lock);
 if ( (newaction = xmalloc_flex_struct(irq_guest_action_t, guest,
-  irq_max_guests)) != NULL &&
+  max_nr_guests)) != NULL &&
  zalloc_cpumask_var(>cpu_eoi_map) )
 goto retry;
 xfree(newaction);
@@ -1640,7 +1641,7 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, 
int will_share)
 goto retry;
 }
 
-if ( action->nr_guests == irq_max_guests )
+if ( action->nr_guests >= max_nr_guests )
 {
 printk(XENLOG_G_INFO
"Cannot bind IRQ%d to dom%pd: already at max share %u ",
-- 
2.7.4

[PATCH v2 1/2] x86/IRQ: make max number of guests for a shared IRQ configurable

2020-12-02 Thread Igor Druzhinin

... and increase the default to 16.

Current limit of 7 is too restrictive for modern systems where one GSI
could be shared by potentially many PCI INTx sources where each of them
corresponds to a device passed through to its own guest. Some systems do not
apply due dilligence in swizzling INTx links in case e.g. INTA is declared as
interrupt pin for the majority of PCI devices behind a single router,
resulting in overuse of a GSI.

Introduce a new command line option to configure that limit and dynamically
allocate an array of the necessary size. Set the default size now to 16 which
is higher than 7 but could later be increased even more if necessary.

Signed-off-by: Igor Druzhinin 
---

Changes in v2:
- introduced a command line option as suggested
- set the default limit to 16 for now

---
 docs/misc/xen-command-line.pandoc |  9 +
 xen/arch/x86/irq.c| 19 +--
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc 
b/docs/misc/xen-command-line.pandoc
index b4a0d60..f5f230c 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1641,6 +1641,15 @@ This option is ignored in **pv-shim** mode.
 ### nr_irqs (x86)
 > `= `
 
+### irq_max_guests (x86)
+> `= `
+
+> Default: `16`
+
+Maximum number of guests IRQ could be shared between, i.e. a limit on
+the number of guests it is possible to start each having assigned a device
+sharing a common interrupt line.  Accepts values between 1 and 255.
+
 ### numa (x86)
 > `= on | off | fake= | noacpi`
 
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 8d1f9a9..5ae9846 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -42,6 +42,10 @@ integer_param("nr_irqs", nr_irqs);
 int __read_mostly opt_irq_vector_map = OPT_IRQ_VECTOR_MAP_DEFAULT;
 custom_param("irq_vector_map", parse_irq_vector_map_param);
 
+/* Max number of guests IRQ could be shared with */
+static unsigned int __read_mostly irq_max_guests;
+integer_param("irq_max_guests", irq_max_guests);
+
 vmask_t global_used_vector_map;
 
 struct irq_desc __read_mostly *irq_desc = NULL;
@@ -435,6 +439,9 @@ int __init init_irq_data(void)
 for ( ; irq < nr_irqs; irq++ )
 irq_to_desc(irq)->irq = irq;
 
+if ( !irq_max_guests || irq_max_guests > 255)
+irq_max_guests = 16;
+
 #ifdef CONFIG_PV
 /* Never allocate the hypercall vector or Linux/BSD fast-trap vector. */
 set_bit(LEGACY_SYSCALL_VECTOR, used_vectors);
@@ -1028,7 +1035,6 @@ int __init setup_irq(unsigned int irq, unsigned int 
irqflags,
  * HANDLING OF GUEST-BOUND PHYSICAL IRQS
  */
 
-#define IRQ_MAX_GUESTS 7
 typedef struct {
 u8 nr_guests;
 u8 in_flight;
@@ -1039,7 +1045,7 @@ typedef struct {
 #define ACKTYPE_EOI2 /* EOI on the CPU that was interrupted  */
 cpumask_var_t cpu_eoi_map; /* CPUs that need to EOI this interrupt */
 struct timer eoi_timer;
-struct domain *guest[IRQ_MAX_GUESTS];
+struct domain *guest[];
 } irq_guest_action_t;
 
 /*
@@ -1564,7 +1570,8 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, 
int will_share)
 if ( newaction == NULL )
 {
 spin_unlock_irq(>lock);
-if ( (newaction = xmalloc(irq_guest_action_t)) != NULL &&
+if ( (newaction = xmalloc_bytes(sizeof(irq_guest_action_t) +
+  irq_max_guests * sizeof(action->guest[0]))) != NULL &&
  zalloc_cpumask_var(>cpu_eoi_map) )
 goto retry;
 xfree(newaction);
@@ -1633,11 +1640,11 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, 
int will_share)
 goto retry;
 }
 
-if ( action->nr_guests == IRQ_MAX_GUESTS )
+if ( action->nr_guests == irq_max_guests )
 {
 printk(XENLOG_G_INFO "Cannot bind IRQ%d to dom%d. "
-   "Already at max share.\n",
-   pirq->pirq, v->domain->domain_id);
+   "Already at max share %u, increase with irq_max_guests= 
option.\n",
+   pirq->pirq, v->domain->domain_id, irq_max_guests);
 rc = -EBUSY;
 goto unlock_out;
 }
-- 
2.7.4

[PATCH v2 2/2] x86/IRQ: allocate guest array of max size only for shareable IRQs

2020-12-02 Thread Igor Druzhinin

... and increase default "irq_max_guests" to 32.

It's not necessary to have an array of a size more than 1 for non-shareable
IRQs and it might impact scalability in case of high "irq_max_guests"
values being used - every IRQ in the system including MSIs would be
supplied with an array of that size.

Since it's now less impactful to use higher "irq_max_guests" value - bump
the default to 32. That should give more headroom for future systems.

Signed-off-by: Igor Druzhinin 
---

New in v2.
This is suggested by Jan and is optional for me.

---
 docs/misc/xen-command-line.pandoc | 2 +-
 xen/arch/x86/irq.c| 7 ---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc 
b/docs/misc/xen-command-line.pandoc
index f5f230c..dea2a22 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1644,7 +1644,7 @@ This option is ignored in **pv-shim** mode.
 ### irq_max_guests (x86)
 > `= `
 
-> Default: `16`
+> Default: `32`
 
 Maximum number of guests IRQ could be shared between, i.e. a limit on
 the number of guests it is possible to start each having assigned a device
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 5ae9846..70b7a53 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -440,7 +440,7 @@ int __init init_irq_data(void)
 irq_to_desc(irq)->irq = irq;
 
 if ( !irq_max_guests || irq_max_guests > 255)
-irq_max_guests = 16;
+irq_max_guests = 32;
 
 #ifdef CONFIG_PV
 /* Never allocate the hypercall vector or Linux/BSD fast-trap vector. */
@@ -1540,6 +1540,7 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, 
int will_share)
 unsigned intirq;
 struct irq_desc *desc;
 irq_guest_action_t *action, *newaction = NULL;
+unsigned intmax_nr_guests = will_share ? irq_max_guests : 1;
 int rc = 0;
 
 WARN_ON(!spin_is_locked(>domain->event_lock));
@@ -1571,7 +1572,7 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, 
int will_share)
 {
 spin_unlock_irq(>lock);
 if ( (newaction = xmalloc_bytes(sizeof(irq_guest_action_t) +
-  irq_max_guests * sizeof(action->guest[0]))) != NULL &&
+  max_nr_guests * sizeof(action->guest[0]))) != NULL &&
  zalloc_cpumask_var(>cpu_eoi_map) )
 goto retry;
 xfree(newaction);
@@ -1640,7 +1641,7 @@ int pirq_guest_bind(struct vcpu *v, struct pirq *pirq, 
int will_share)
 goto retry;
 }
 
-if ( action->nr_guests == irq_max_guests )
+if ( action->nr_guests == max_nr_guests )
 {
 printk(XENLOG_G_INFO "Cannot bind IRQ%d to dom%d. "
"Already at max share %u, increase with irq_max_guests= 
option.\n",
-- 
2.7.4

Re: [PATCH] x86/IRQ: bump max number of guests for a shared IRQ to 31

2020-12-02 Thread Igor Druzhinin

On 02/12/2020 15:21, Jan Beulich wrote:
> On 02.12.2020 15:53, Igor Druzhinin wrote:
>> On 02/12/2020 09:25, Jan Beulich wrote:
>>> Instead I'm wondering whether this wouldn't better be a Kconfig
>>> setting (or even command line controllable). There don't look to be
>>> any restrictions on the precise value chosen (i.e. 2**n-1 like is
>>> the case for old and new values here, for whatever reason), so a
>>> simple permitted range of like 4...64 would seem fine to specify.
>>> Whether the default then would want to be 8 (close to the current
>>> 7) or higher (around the actually observed maximum) is a different
>>> question.
>>
>> I'm in favor of a command line argument here - it would be much less trouble
>> if a higher limit was suddenly necessary in the field. The default IMO
>> should definitely be higher than 8 - I'd stick with number 32 which to me
>> should cover our real world scenarios and apply some headroom for the future.
> 
> Well, I'm concerned of the extra memory overhead. Every IRQ,
> sharable or not, will get the extra slots allocated with the
> current scheme. Perhaps a prereq change then would be to only
> allocate multi-guest arrays for sharable IRQs, effectively
> shrinking the overhead in particular for all MSI ones?

That's one way to improve overall system scalability but in that area
there is certainly much bigger fish to fry elsewhere. With 32 elements in the
array we get 200 bytes of overhead per structure, with 16 it's just 72 extra
bytes which in the unattainable worst case scenario of every single vector taken
in 512 CPU machine would only account for several MB of overhead.

I'd start with dynamic array allocation first and setting the limit to 16 that
should be enough for now. And then if that default value needs to be raised
we can consider further improvements.

Igor

Re: [PATCH] x86/IRQ: bump max number of guests for a shared IRQ to 31

2020-12-02 Thread Igor Druzhinin

On 02/12/2020 09:25, Jan Beulich wrote:
> On 01.12.2020 00:59, Igor Druzhinin wrote:
>> Current limit of 7 is too restrictive for modern systems where one GSI
>> could be shared by potentially many PCI INTx sources where each of them
>> corresponds to a device passed through to its own guest. Some systems do not
>> apply due dilligence in swizzling INTx links in case e.g. INTA is declared as
>> interrupt pin for the majority of PCI devices behind a single router,
>> resulting in overuse of a GSI.
>>
>> Signed-off-by: Igor Druzhinin 
>> ---
>>
>> If people think that would make sense - I can rework the array to a list of
>> domain pointers to avoid the limit.
> 
> Not sure about this. What is the biggest number you've found on any
> system? (I assume the chosen value of 31 has some headroom.)

The value 31 was taken as a practical maximum for one specific HP system
if all of the PCI slots in all of its riser cards are occupied with GPUs.
The value is obtained by reverse engineering their ACPI tables. Currently
we're only concerned with number 8 (our graphics vendors do not recommend
installing more cards than that) but it's a matter of time it will grow.
I'm also not sure why this routing scheme was chosen in the first place:
is it dictated by hardware restrictions or firmware engineers being lazy - 
we're working with HP to determine it.

> Instead I'm wondering whether this wouldn't better be a Kconfig
> setting (or even command line controllable). There don't look to be
> any restrictions on the precise value chosen (i.e. 2**n-1 like is
> the case for old and new values here, for whatever reason), so a
> simple permitted range of like 4...64 would seem fine to specify.
> Whether the default then would want to be 8 (close to the current
> 7) or higher (around the actually observed maximum) is a different
> question.

I'm in favor of a command line argument here - it would be much less trouble
if a higher limit was suddenly necessary in the field. The default IMO
should definitely be higher than 8 - I'd stick with number 32 which to me
should cover our real world scenarios and apply some headroom for the future.

Igor

[PATCH] x86/IRQ: bump max number of guests for a shared IRQ to 31

2020-11-30 Thread Igor Druzhinin

Current limit of 7 is too restrictive for modern systems where one GSI
could be shared by potentially many PCI INTx sources where each of them
corresponds to a device passed through to its own guest. Some systems do not
apply due dilligence in swizzling INTx links in case e.g. INTA is declared as
interrupt pin for the majority of PCI devices behind a single router,
resulting in overuse of a GSI.

Signed-off-by: Igor Druzhinin 
---

If people think that would make sense - I can rework the array to a list of
domain pointers to avoid the limit.

---
 xen/arch/x86/irq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 8d1f9a9..194f660 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -1028,7 +1028,7 @@ int __init setup_irq(unsigned int irq, unsigned int 
irqflags,
  * HANDLING OF GUEST-BOUND PHYSICAL IRQS
  */
 
-#define IRQ_MAX_GUESTS 7
+#define IRQ_MAX_GUESTS 31
 typedef struct {
 u8 nr_guests;
 u8 in_flight;
-- 
2.7.4

[PATCH v2] x86/intel: insert Ice Lake X (server) model numbers

2020-10-18 Thread Igor Druzhinin

LBR, C-state MSRs and if_pschange_mc erratum applicability should correspond
to Ice Lake desktop according to External Design Specification vol.2.

Signed-off-by: Igor Druzhinin 
---
Changes in v2:
- keep partial sorting

Andrew, since you have access to these documents, please review as you have 
time.
---
 xen/arch/x86/acpi/cpu_idle.c | 1 +
 xen/arch/x86/hvm/vmx/vmx.c   | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 27e0b52..eca423c 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -181,6 +181,7 @@ static void do_get_hw_residencies(void *arg)
 case 0x55:
 case 0x5E:
 /* Ice Lake */
+case 0x6A:
 case 0x7D:
 case 0x7E:
 /* Kaby Lake */
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 86b8916..8382917 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2427,6 +2427,7 @@ static bool __init has_if_pschange_mc(void)
 case 0x4e: /* Skylake M */
 case 0x5e: /* Skylake D */
 case 0x55: /* Skylake-X / Cascade Lake */
+case 0x6a: /* Ice Lake-X */
 case 0x7d: /* Ice Lake */
 case 0x7e: /* Ice Lake */
 case 0x8e: /* Kaby / Coffee / Whiskey Lake M */
@@ -2775,7 +2776,7 @@ static const struct lbr_info *last_branch_msr_get(void)
 /* Goldmont Plus */
 case 0x7a:
 /* Ice Lake */
-case 0x7d: case 0x7e:
+case 0x6a: case 0x7d: case 0x7e:
 /* Tremont */
 case 0x86:
 /* Kaby Lake */
-- 
2.7.4

Re: [PATCH v2] hvmloader: flip "ACPI data" to "ACPI NVS" type for ACPI table region

2020-10-16 Thread Igor Druzhinin

On 16/10/2020 14:34, Sander Eikelenboom wrote:
> On 16/10/2020 08:34, Jan Beulich wrote:
>> On 16.10.2020 02:39, Igor Druzhinin wrote:
>>> ACPI specification contains statements describing memory marked with regular
>>> "ACPI data" type as reclaimable by the guest. Although the guest shouldn't
>>> really do it if it wants kexec or similar functionality to work, there
>>> could still be ambiguities in treating these regions as potentially regular
>>> RAM.
>>>
>>> One such example is SeaBIOS which currently reports "ACPI data" regions as
>>> RAM to the guest in its e801 call. Which it might have the right to do as 
>>> any
>>> user of this is expected to be ACPI unaware. But a QEMU bootloader later 
>>> seems
>>> to ignore that fact and is instead using e801 to find a place for initrd 
>>> which
>>> causes the tables to be erased. While arguably QEMU bootloader or SeaBIOS 
>>> need
>>> to be fixed / improved here, that is just one example of the potential 
>>> problems
>>> from using a reclaimable memory type.
>>>
>>> Flip the type to "ACPI NVS" which doesn't have this ambiguity in it and is
>>> described by the spec as non-reclaimable (so cannot ever be treated like 
>>> RAM).
>>>
>>> Signed-off-by: Igor Druzhinin 
>>
>> Acked-by: Jan Beulich 
>>
>>
> 
> I don't see any stable and or fixes tags, but I assume this will go to
> the stable trees (which have (a backport of)
> 8efa46516c5f4cf185c8df179812c185d3c27eb6 in their staging branches) ?

Yes, this should go to the stable branches as well. I don't usually see Fixes:
tag being used on xen-devel (not sure if it's intentional or simply not
customary). It's also questionable whether it's a fix or a workaround for an
issue with compatibility.

Igor

[PATCH v2] hvmloader: flip "ACPI data" to "ACPI NVS" type for ACPI table region

2020-10-15 Thread Igor Druzhinin

ACPI specification contains statements describing memory marked with regular
"ACPI data" type as reclaimable by the guest. Although the guest shouldn't
really do it if it wants kexec or similar functionality to work, there
could still be ambiguities in treating these regions as potentially regular
RAM.

One such example is SeaBIOS which currently reports "ACPI data" regions as
RAM to the guest in its e801 call. Which it might have the right to do as any
user of this is expected to be ACPI unaware. But a QEMU bootloader later seems
to ignore that fact and is instead using e801 to find a place for initrd which
causes the tables to be erased. While arguably QEMU bootloader or SeaBIOS need
to be fixed / improved here, that is just one example of the potential problems
from using a reclaimable memory type.

Flip the type to "ACPI NVS" which doesn't have this ambiguity in it and is
described by the spec as non-reclaimable (so cannot ever be treated like RAM).

Signed-off-by: Igor Druzhinin 
---
Changes in v2:
- Put the exact reasoning into a comment
- Improved commit message
---
 tools/firmware/hvmloader/e820.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
index 38bcf18..c490a0b 100644
--- a/tools/firmware/hvmloader/e820.c
+++ b/tools/firmware/hvmloader/e820.c
@@ -202,16 +202,21 @@ int build_e820_table(struct e820entry *e820,
 nr++;
 
 /*
- * Mark populated reserved memory that contains ACPI tables as ACPI data.
+ * Mark populated reserved memory that contains ACPI tables as ACPI NVS.
  * That should help the guest to treat it correctly later: e.g. pass to
- * the next kernel on kexec or reclaim if necessary.
+ * the next kernel on kexec.
+ *
+ * Using NVS type instead of a regular one helps to prevent potential
+ * space reuse by an ACPI unaware / buggy bootloader, option ROM, etc.
+ * before an ACPI OS takes control. This is possible due to the fact that
+ * ACPI NVS memory is explicitly described as non-reclaimable in ACPI spec.
  */
 
 if ( acpi_enabled )
 {
 e820[nr].addr = RESERVED_MEMBASE;
 e820[nr].size = acpi_mem_end - RESERVED_MEMBASE;
-e820[nr].type = E820_ACPI;
+e820[nr].type = E820_NVS;
 nr++;
 }
 
-- 
2.7.4

Re: [PATCH 1/2] x86/intel: insert Ice Lake X (server) model numbers

2020-10-14 Thread Igor Druzhinin

On 14/10/2020 16:47, Jan Beulich wrote:
> On 13.10.2020 05:02, Igor Druzhinin wrote:
>> LBR, C-state MSRs and if_pschange_mc erratum applicability should correspond
>> to Ice Lake desktop according to External Design Specification vol.2.
> 
> Could you tell me where this is publicly available? Even after spending
> quite a bit of time on searching for it, I can't seem to be able to
> find it. And the SDM doesn't have enough information (yet).

True that SDM doesn't have this data. As I mentioned that data is taken from
External Design Specification for Ice Lake server which is accessed using Intel
account. I'm not completely sure it is right to make changes in open source
project like Linux or Xen based on information which is not publicly available
yet. But Intel is frequently doing this with Linux : even my second patch uses
data taken from one of these documents and was committed by Intel to Linux 
first.

Do we need the information publicly available to commit these changes as well?
If not, we can run with these changes in our patchqueue until it gets out 
properly.

Igor

Re: [PATCH] hvmloader: flip "ACPI data" to ACPI NVS type for ACPI table region

2020-10-13 Thread Igor Druzhinin

On 13/10/2020 16:54, Jan Beulich wrote:
> On 13.10.2020 17:47, Igor Druzhinin wrote:
>> On 13/10/2020 16:35, Jan Beulich wrote:
>>> On 13.10.2020 14:59, Igor Druzhinin wrote:
>>>> On 13/10/2020 13:51, Jan Beulich wrote:
>>>>> As a consequence I think we will also want to adjust Xen itself to
>>>>> automatically disable ACPI when it ends up consuming E801 data. Or
>>>>> alternatively we should consider dropping all E801-related code (as
>>>>> being inapplicable to 64-bit systems).
>>>>
>>>> I'm not following here. What Xen has to do with E801? It's a SeaBIOS 
>>>> implemented
>>>> call that happened to be used by QEMU option ROM. We cannot drop it from 
>>>> there
>>>> as it's part of BIOS spec.
>>>
>>> Any ACPI aware OS has to use E820 (and nothing else). Hence our
>>> own use of E801 should either be dropped, or lead to the
>>> disabling of ACPI. Otherwise real firmware using logic similar
>>> to SeaBIOS'es (but hopefully properly accounting for holes)
>>> could make us use ACPI table space as normal RAM.
>>
>> It's not us using it - it's a boot loader from QEMU in a form of option ROM
>> that works in 16bit pre-OS environment which is not OS and relies on e801 
>> BIOS call.
>> I'm sure any ACPI aware OS does indeed use E820 but the problem here is not 
>> an OS.
>>
>> The option ROM is loaded using fw_cfg from QEMU so it's not our code. 
>> Technically
>> it's one foreign code (QEMU boot loader) talking to another foreign code 
>> (SeaBIOS)
>> which provides information based on E820 that we gave them.
>>
>> So I'm afraid decision to dynamically disable ACPI (whatever you mean by 
>> this)
>> cannot be made by sole usage of this call by a pre-OS boot loader.
> 
> I guess this is simply a misunderstanding. I'm not talking about
> your change or hvmloader or the boot loader at all. I was merely
> noticing a consequence of your findings on the behavior of Xen
> itself: Use of ACPI and use of E801 are exclusive of one another.

Sorry, yes. I forgot e801 is also used by Xen as an alternative to e820.

Igor

Re: [PATCH] hvmloader: flip "ACPI data" to ACPI NVS type for ACPI table region

2020-10-13 Thread Igor Druzhinin

On 13/10/2020 16:35, Jan Beulich wrote:
> On 13.10.2020 14:59, Igor Druzhinin wrote:
>> On 13/10/2020 13:51, Jan Beulich wrote:
>>> As a consequence I think we will also want to adjust Xen itself to
>>> automatically disable ACPI when it ends up consuming E801 data. Or
>>> alternatively we should consider dropping all E801-related code (as
>>> being inapplicable to 64-bit systems).
>>
>> I'm not following here. What Xen has to do with E801? It's a SeaBIOS 
>> implemented
>> call that happened to be used by QEMU option ROM. We cannot drop it from 
>> there
>> as it's part of BIOS spec.
> 
> Any ACPI aware OS has to use E820 (and nothing else). Hence our
> own use of E801 should either be dropped, or lead to the
> disabling of ACPI. Otherwise real firmware using logic similar
> to SeaBIOS'es (but hopefully properly accounting for holes)
> could make us use ACPI table space as normal RAM.

It's not us using it - it's a boot loader from QEMU in a form of option ROM
that works in 16bit pre-OS environment which is not OS and relies on e801 BIOS 
call.
I'm sure any ACPI aware OS does indeed use E820 but the problem here is not an 
OS.

The option ROM is loaded using fw_cfg from QEMU so it's not our code. 
Technically
it's one foreign code (QEMU boot loader) talking to another foreign code 
(SeaBIOS)
which provides information based on E820 that we gave them.

So I'm afraid decision to dynamically disable ACPI (whatever you mean by this)
cannot be made by sole usage of this call by a pre-OS boot loader.

Igor

Re: [PATCH] hvmloader: flip "ACPI data" to ACPI NVS type for ACPI table region

2020-10-13 Thread Igor Druzhinin

On 13/10/2020 13:51, Jan Beulich wrote:
> On 13.10.2020 12:50, Igor Druzhinin wrote:
>> ACPI specification contains statements describing memory marked with regular
>> "ACPI data" type as reclaimable by the guest. Although the guest shouldn't
>> really do it if it wants kexec or similar functionality to work, there
>> could still be ambiguities in treating these regions as potentially regular
>> RAM.
>>
>> One such an example is SeaBIOS which currently reports "ACPI data" regions as
>> RAM to the guest in its e801 call. The guest then tries to use this region
>> for initrd placement and gets stuck.
> 
> Any theory on why it would get stuck? Having read the thread rooting
> at Sander's report, it hasn't become clear to me where the collision
> there is. A consumer of E801 (rather than E820) intends to not use
> ACPI data, and hence I consider SeaBIOS right in this regard (the
> lack of considering holes is a problem, though).

QEMU's fw_cfg Linux boot loader (that is used by our direct kernel boot method)
is usign E801 to find the top of RAM and places images below that address.
Since now it's 0xfc0 it gets located right in a PCI hole below - which 
causes
the loader to hang.

>> --- a/tools/firmware/hvmloader/e820.c
>> +++ b/tools/firmware/hvmloader/e820.c
>> @@ -202,16 +202,17 @@ int build_e820_table(struct e820entry *e820,
>>  nr++;
>>  
>>  /*
>> - * Mark populated reserved memory that contains ACPI tables as ACPI 
>> data.
>> + * Mark populated reserved memory that contains ACPI tables as ACPI NVS.
>>   * That should help the guest to treat it correctly later: e.g. pass to
>> - * the next kernel on kexec or reclaim if necessary.
>> + * the next kernel on kexec and prevent space reclaim which is possible
>> + * with regular ACPI data type accoring to ACPI spec v6.3.
> 
> Preventing space reclaim is not the business of hvmloader. As per above,
> an ACPI unaware OS ought to be permitted to use as ordinary RAM all the
> space the tables occupy. Therefore at the very least the comment needs
> to reflect that this preventing of space reclaim is a workaround, not
> correct behavior.

Agree to modify the comment.

> Also as a nit: "according".
> 
> As a consequence I think we will also want to adjust Xen itself to
> automatically disable ACPI when it ends up consuming E801 data. Or
> alternatively we should consider dropping all E801-related code (as
> being inapplicable to 64-bit systems).

I'm not following here. What Xen has to do with E801? It's a SeaBIOS implemented
call that happened to be used by QEMU option ROM. We cannot drop it from there
as it's part of BIOS spec.

Igor

[PATCH] hvmloader: flip "ACPI data" to ACPI NVS type for ACPI table region

2020-10-13 Thread Igor Druzhinin

ACPI specification contains statements describing memory marked with regular
"ACPI data" type as reclaimable by the guest. Although the guest shouldn't
really do it if it wants kexec or similar functionality to work, there
could still be ambiguities in treating these regions as potentially regular
RAM.

One such an example is SeaBIOS which currently reports "ACPI data" regions as
RAM to the guest in its e801 call. The guest then tries to use this region
for initrd placement and gets stuck. While arguably SeaBIOS needs to be fixed
here, that is just one example of the potential problems from using
reclaimable memory type.

Flip the type to "ACPI NVS" which doesn't have this ambiguity in it and is
described by the spec as non-reclaimable (so cannot ever be treated like RAM).

Signed-off-by: Igor Druzhinin 
---
 tools/firmware/hvmloader/e820.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
index 38bcf18..8870099 100644
--- a/tools/firmware/hvmloader/e820.c
+++ b/tools/firmware/hvmloader/e820.c
@@ -202,16 +202,17 @@ int build_e820_table(struct e820entry *e820,
 nr++;
 
 /*
- * Mark populated reserved memory that contains ACPI tables as ACPI data.
+ * Mark populated reserved memory that contains ACPI tables as ACPI NVS.
  * That should help the guest to treat it correctly later: e.g. pass to
- * the next kernel on kexec or reclaim if necessary.
+ * the next kernel on kexec and prevent space reclaim which is possible
+ * with regular ACPI data type accoring to ACPI spec v6.3.
  */
 
 if ( acpi_enabled )
 {
 e820[nr].addr = RESERVED_MEMBASE;
 e820[nr].size = acpi_mem_end - RESERVED_MEMBASE;
-e820[nr].type = E820_ACPI;
+e820[nr].type = E820_NVS;
 nr++;
 }
 
-- 
2.7.4

[PATCH 1/2] x86/intel: insert Ice Lake X (server) model numbers

2020-10-12 Thread Igor Druzhinin

LBR, C-state MSRs and if_pschange_mc erratum applicability should correspond
to Ice Lake desktop according to External Design Specification vol.2.

Signed-off-by: Igor Druzhinin 
---
 xen/arch/x86/acpi/cpu_idle.c | 1 +
 xen/arch/x86/hvm/vmx/vmx.c   | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 27e0b52..7ad726a 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -183,6 +183,7 @@ static void do_get_hw_residencies(void *arg)
 /* Ice Lake */
 case 0x7D:
 case 0x7E:
+case 0x6A:
 /* Kaby Lake */
 case 0x8E:
 case 0x9E:
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 86b8916..bce8b99 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2429,6 +2429,7 @@ static bool __init has_if_pschange_mc(void)
 case 0x55: /* Skylake-X / Cascade Lake */
 case 0x7d: /* Ice Lake */
 case 0x7e: /* Ice Lake */
+case 0x6a: /* Ice Lake-X */
 case 0x8e: /* Kaby / Coffee / Whiskey Lake M */
 case 0x9e: /* Kaby / Coffee / Whiskey Lake D */
 case 0xa5: /* Comet Lake H/S */
@@ -2775,7 +2776,7 @@ static const struct lbr_info *last_branch_msr_get(void)
 /* Goldmont Plus */
 case 0x7a:
 /* Ice Lake */
-case 0x7d: case 0x7e:
+case 0x7d: case 0x7e: case 0x6a:
 /* Tremont */
 case 0x86:
 /* Kaby Lake */
-- 
2.7.4

[PATCH 2/2] x86/mwait-idle: Customize IceLake server support

2020-10-12 Thread Igor Druzhinin

From: Chen Yu 

On ICX platform, the C1E auto-promotion is enabled by default.
As a result, the CPU might fall into C1E more offen than previous
platforms. So disable C1E auto-promotion and expose C1E as a separate
idle state.

Beside C1 and C1E, the exit latency of C6 was measured
by a dedicated tool. However the exit latency(41us) exposed
by _CST is much smaller than the one we measured(128us). This
is probably due to the _CST uses the exit latency when woken
up from PC0+C6, rather than PC6+C6 when C6 was measured. Choose
the latter as we need the longest latency in theory.

Signed-off-by: Chen Yu 
Signed-off-by: Rafael J. Wysocki 
[Linux commit a472ad2bcea479ba068880125d7273fc95c14b70]
Signed-off-by: Igor Druzhinin 
---
Applying this gives almost 100% boost in sysbench cpu test on Whitley SDP
---
 xen/arch/x86/cpu/mwait-idle.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index 8add13d..f0c6ff9 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -554,6 +554,28 @@ static const struct cpuidle_state skx_cstates[] = {
{}
 };
 
+static const struct cpuidle_state icx_cstates[] = {
+   {
+   .name = "C1-ICX",
+   .flags = MWAIT2flg(0x00),
+   .exit_latency = 1,
+   .target_residency = 1,
+   },
+   {
+   .name = "C1E-ICX",
+   .flags = MWAIT2flg(0x01),
+   .exit_latency = 4,
+   .target_residency = 4,
+   },
+   {
+   .name = "C6-ICX",
+   .flags = MWAIT2flg(0x20) | CPUIDLE_FLAG_TLB_FLUSHED,
+   .exit_latency = 128,
+   .target_residency = 384,
+   },
+   {}
+};
+
 static const struct cpuidle_state atom_cstates[] = {
{
.name = "C1E-ATM",
@@ -904,6 +926,11 @@ static const struct idle_cpu idle_cpu_skx = {
.disable_promotion_to_c1e = 1,
 };
 
+static const struct idle_cpu idle_cpu_icx = {
+   .state_table = icx_cstates,
+   .disable_promotion_to_c1e = 1,
+};
+
 static const struct idle_cpu idle_cpu_avn = {
.state_table = avn_cstates,
.disable_promotion_to_c1e = 1,
@@ -958,6 +985,7 @@ static const struct x86_cpu_id intel_idle_ids[] 
__initconstrel = {
ICPU(0x8e, skl),
ICPU(0x9e, skl),
ICPU(0x55, skx),
+   ICPU(0x6a, icx),
ICPU(0x57, knl),
ICPU(0x85, knl),
ICPU(0x5c, bxt),
-- 
2.7.4

Re: [SUSPECTED SPAM]Xen-unstable :can't boot HVM guests, bisected to commit: "hvmloader: indicate ACPI tables with "ACPI data" type in e820"

2020-10-11 Thread Igor Druzhinin

On 11/10/2020 11:40, Igor Druzhinin wrote:
> On 11/10/2020 10:43, Sander Eikelenboom wrote:
>> On 11/10/2020 02:06, Igor Druzhinin wrote:
>>> On 10/10/2020 18:51, Sander Eikelenboom wrote:
>>>> Hi Igor/Jan,
>>>>
>>>> I tried to update my AMD machine to current xen-unstable, but
>>>> unfortunately the HVM guests don't boot after that. The guest keeps
>>>> using CPU-cycles but I don't get to a command prompt (or any output at
>>>> all). PVH guests run fine.
>>>>
>>>> Bisection leads to commit:
>>>>
>>>> 8efa46516c5f4cf185c8df179812c185d3c27eb6
>>>> hvmloader: indicate ACPI tables with "ACPI data" type in e820
>>>>
>>>> I tried xen-unstable with this commit reverted and with that everything
>>>> works fine.
>>>>
>>>> I attached the xl-dmesg output.
>>>
>>> What guests are you using? 
>> Not sure I understand what you ask for, but:
>> dom0 PV
>> guest HVM (qemu-xen)
>>
>>> Could you get serial output from the guest?
>> Not getting any, it seems to be stuck in very early boot.
>>
>>> Is it AMD specific?
>> Can't tell, this is the only machine I test xen-unstable on.
>> It's a AMD phenom X6.
>> Both dom0 and guest kernel are 5.9-rc8.
>>
>> Tested with guest config:
>> kernel  = '/boot/vmlinuz-xen-guest'
>> ramdisk = '/boot/initrd.img-xen-guest'
>>
>> cmdline = 'root=UUID=7cc4a90d-d6b0-4958-bb7d-50497aa29f18 ro
>> nomodeset console=tty1 console=ttyS0 console=hvc0 earlyprintk=xen'
>>
>> type='hvm'
>>
>> device_model_version = 'qemu-xen'
>>
>> cpus= "2-5"
>> vcpus = 2
>>
>> memory  = '512'
>>
>> disk= [
>>   'phy:/dev/xen_vms_ssd/media,xvda,w'
>>   ]
>>
>> name= 'guest'
>>
>> vif = [ 'bridge=xen_bridge,ip=192.168.1.10,mac=00:16:3E:DC:0A:F1' ]
>>
>> on_poweroff = 'destroy'
>> on_reboot   = 'restart'
>> on_crash= 'preserve'
>>
>> vnc=0
>>
>>
>>> If it's a Linux guest could you get a stacktrace from
>>> the guest using xenctx?
>>
>> It is, here are few subsequent runs:
>>
>> ~# /usr/local/lib/xen/bin/xenctx -s
>> /boot/System.map-5.9.0-rc8-20201010-doflr-mac80211debug+ -f -a -C 4
>> vcpu0:
>> cs:eip: ca80:0256
> 
> Ok, it's stuck in linuxboot.bin option ROM. That's not something we test in 
> Citrix -
> we don't use fw_cfg. It could be something with caching (given it's moving 
> but slowly) or a
> bug uncovered by memory map changes. I'll try to get a repro on Monday.

Right, I think I know what will fix your problem - could you flip "ACPI data"
type to "ACPI NVS" in my commit.

Jan, this is what we've discussed on the list as an ambiguity in ACPI spec but
couldn't reach a clean resolution after all.
SeaBIOS thinks that "ACPI data" type is essentially RAM that could be reported
as RAM resource to the guest in E801.
https://wiki.osdev.org/Detecting_Memory_(x86)#BIOS_Function:_INT_0x15.2C_AX_.3D_0xE801

// Calculate the maximum ramsize (less than 4gig) from e820 map.
static void
calcRamSize(void)
{
u32 rs = 0;
int i;
for (i=e820_count-1; i>=0; i--) {
struct e820entry *en = _list[i];
u64 end = en->start + en->size;
u32 type = en->type;
if (end <= 0x && (type == E820_ACPI || type == E820_RAM)) {
rs = end;
break;
}
}
LegacyRamSize = rs >= 1024*1024 ? rs : 1024*1024;
}

what is wrong here I think is that it clearly doesn't handle holes and worked 
more
by luck. So SeaBIOS needs to be fixed but I think that using ACPI NVS in 
hvmloader
is still safer.

Igor

Re: [SUSPECTED SPAM]Xen-unstable :can't boot HVM guests, bisected to commit: "hvmloader: indicate ACPI tables with "ACPI data" type in e820"

2020-10-11 Thread Igor Druzhinin

On 11/10/2020 10:43, Sander Eikelenboom wrote:
> On 11/10/2020 02:06, Igor Druzhinin wrote:
>> On 10/10/2020 18:51, Sander Eikelenboom wrote:
>>> Hi Igor/Jan,
>>>
>>> I tried to update my AMD machine to current xen-unstable, but
>>> unfortunately the HVM guests don't boot after that. The guest keeps
>>> using CPU-cycles but I don't get to a command prompt (or any output at
>>> all). PVH guests run fine.
>>>
>>> Bisection leads to commit:
>>>
>>> 8efa46516c5f4cf185c8df179812c185d3c27eb6
>>> hvmloader: indicate ACPI tables with "ACPI data" type in e820
>>>
>>> I tried xen-unstable with this commit reverted and with that everything
>>> works fine.
>>>
>>> I attached the xl-dmesg output.
>>
>> What guests are you using? 
> Not sure I understand what you ask for, but:
> dom0 PV
> guest HVM (qemu-xen)
> 
>> Could you get serial output from the guest?
> Not getting any, it seems to be stuck in very early boot.
> 
>> Is it AMD specific?
> Can't tell, this is the only machine I test xen-unstable on.
> It's a AMD phenom X6.
> Both dom0 and guest kernel are 5.9-rc8.
> 
> Tested with guest config:
> kernel  = '/boot/vmlinuz-xen-guest'
> ramdisk = '/boot/initrd.img-xen-guest'
> 
> cmdline = 'root=UUID=7cc4a90d-d6b0-4958-bb7d-50497aa29f18 ro
> nomodeset console=tty1 console=ttyS0 console=hvc0 earlyprintk=xen'
> 
> type='hvm'
> 
> device_model_version = 'qemu-xen'
> 
> cpus= "2-5"
> vcpus = 2
> 
> memory  = '512'
> 
> disk= [
>   'phy:/dev/xen_vms_ssd/media,xvda,w'
>   ]
> 
> name= 'guest'
> 
> vif = [ 'bridge=xen_bridge,ip=192.168.1.10,mac=00:16:3E:DC:0A:F1' ]
> 
> on_poweroff = 'destroy'
> on_reboot   = 'restart'
> on_crash= 'preserve'
> 
> vnc=0
> 
> 
>> If it's a Linux guest could you get a stacktrace from
>> the guest using xenctx?
> 
> It is, here are few subsequent runs:
> 
> ~# /usr/local/lib/xen/bin/xenctx -s
> /boot/System.map-5.9.0-rc8-20201010-doflr-mac80211debug+ -f -a -C 4
> vcpu0:
> cs:eip: ca80:0256

Ok, it's stuck in linuxboot.bin option ROM. That's not something we test in 
Citrix -
we don't use fw_cfg. It could be something with caching (given it's moving but 
slowly) or a
bug uncovered by memory map changes. I'll try to get a repro on Monday.

It could be AMD specific if it's caching related and that's why osstest didn't 
pick it up.

Igor

Re: [SUSPECTED SPAM]Xen-unstable :can't boot HVM guests, bisected to commit: "hvmloader: indicate ACPI tables with "ACPI data" type in e820"

2020-10-10 Thread Igor Druzhinin

On 10/10/2020 18:51, Sander Eikelenboom wrote:
> Hi Igor/Jan,
> 
> I tried to update my AMD machine to current xen-unstable, but
> unfortunately the HVM guests don't boot after that. The guest keeps
> using CPU-cycles but I don't get to a command prompt (or any output at
> all). PVH guests run fine.
> 
> Bisection leads to commit:
> 
> 8efa46516c5f4cf185c8df179812c185d3c27eb6
> hvmloader: indicate ACPI tables with "ACPI data" type in e820
> 
> I tried xen-unstable with this commit reverted and with that everything
> works fine.
> 
> I attached the xl-dmesg output.

What guests are you using? Could you get serial output from the guest?
Is it AMD specific? If it's a Linux guest could you get a stacktrace from
the guest using xenctx?

We have tested the change on all modern guests in our Citrix lab and haven't
found any problem for several months. 

Igor

[PATCH v4] hvmloader: indicate ACPI tables with "ACPI data" type in e820

2020-09-08 Thread Igor Druzhinin

Guest kernel does need to know in some cases where the tables are located
to treat these regions properly. One example is kexec process where
the first kernel needs to pass ACPI region locations to the second
kernel which is now a requirement in Linux after 02a3e3cdb7f12 ("x86/boot:
Parse SRAT table and count immovable memory regions") in order for kexec
transition to actually work.

That commit introduced accesses to XSDT and SRAT while the second kernel
is still using kexec transition tables. The transition tables do not have
e820 "reserved" regions mapped where those tables are located currently
in a Xen guest. Instead "ACPI data" regions are mapped with the transition
tables that was introduced by the following commit 6bbeb276b7 ("x86/kexec:
Add the EFI system tables and ACPI tables to the ident map").

Reserve 1MB (out of 16MB currently available) right after ACPI info page for
ACPI tables exclusively but populate this region on demand and only indicate
populated memory as "ACPI data" since according to ACPI spec that memory is
reclaimable by the guest if necessary. That is close to how we treat
the same ACPI data in PVH guests. 1MB should be enough for now but could be
later extended if required.

Signed-off-by: Igor Druzhinin 
---
Changes in v4:
- gated new region creation on acpi_enabled
- added a comment to explain reserved region start point

Changes in v3:
- switched from NVS to regular "ACPI data" type by separating ACPI allocations
  into their own region
- gave more information on particular kexec usecase that requires this change

Changes in v2:
- gave more information on NVS type selection and potential alternatives
  in the description
- minor type fixes suggested
---
 tools/firmware/hvmloader/config.h|  6 +-
 tools/firmware/hvmloader/e820.c  | 26 ++
 tools/firmware/hvmloader/hvmloader.c |  3 ++-
 tools/firmware/hvmloader/pci.c   |  1 -
 tools/firmware/hvmloader/util.c  | 29 -
 tools/firmware/hvmloader/util.h  |  2 ++
 6 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/tools/firmware/hvmloader/config.h 
b/tools/firmware/hvmloader/config.h
index d9b4713..844120b 100644
--- a/tools/firmware/hvmloader/config.h
+++ b/tools/firmware/hvmloader/config.h
@@ -2,6 +2,7 @@
 #define __HVMLOADER_CONFIG_H__
 
 #include 
+#include 
 
 enum virtual_vga { VGA_none, VGA_std, VGA_cirrus, VGA_pt };
 extern enum virtual_vga virtual_vga;
@@ -62,6 +63,8 @@ extern uint8_t ioapic_version;
 extern unsigned long pci_mem_start, pci_mem_end;
 extern uint64_t pci_hi_mem_start, pci_hi_mem_end;
 
+extern bool acpi_enabled;
+
 /* Memory map. */
 #define SCRATCH_PHYSICAL_ADDRESS  0x0001
 #define HYPERCALL_PHYSICAL_ADDRESS0x0008
@@ -71,7 +74,8 @@ extern uint64_t pci_hi_mem_start, pci_hi_mem_end;
 #define RESERVED_MEMBASE  0xFC00
 /* NB. ACPI_INFO_PHYSICAL_ADDRESS *MUST* match definition in acpi/dsdt.asl! */
 #define ACPI_INFO_PHYSICAL_ADDRESS0xFC00
-#define RESERVED_MEMORY_DYNAMIC_START 0xFC001000
+#define ACPI_MEMORY_DYNAMIC_START 0xFC001000
+#define RESERVED_MEMORY_DYNAMIC_START 0xFC10
 #define RESERVED_MEMORY_DYNAMIC_END   0xFE00
 /*
  * GUEST_RESERVED: Physical address space reserved for guest use.
diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
index 4d1c955..38bcf18 100644
--- a/tools/firmware/hvmloader/e820.c
+++ b/tools/firmware/hvmloader/e820.c
@@ -155,6 +155,9 @@ int build_e820_table(struct e820entry *e820,
 {
 unsigned int nr = 0, i, j;
 uint32_t low_mem_end = hvm_info->low_mem_pgend << PAGE_SHIFT;
+unsigned long acpi_mem_end = acpi_enabled ?
+ACPI_MEMORY_DYNAMIC_START + (acpi_pages_allocated() << PAGE_SHIFT) :
+RESERVED_MEMBASE;
 
 if ( !lowmem_reserved_base )
 lowmem_reserved_base = 0xA;
@@ -199,8 +202,23 @@ int build_e820_table(struct e820entry *e820,
 nr++;
 
 /*
+ * Mark populated reserved memory that contains ACPI tables as ACPI data.
+ * That should help the guest to treat it correctly later: e.g. pass to
+ * the next kernel on kexec or reclaim if necessary.
+ */
+
+if ( acpi_enabled )
+{
+e820[nr].addr = RESERVED_MEMBASE;
+e820[nr].size = acpi_mem_end - RESERVED_MEMBASE;
+e820[nr].type = E820_ACPI;
+nr++;
+}
+
+/*
  * Explicitly reserve space for special pages.
- * This space starts at RESERVED_MEMBASE an extends to cover various
+ * This space starts right after ACPI region (to avoid creating a hole that
+ * might be accidentally occupied by MMIO) and extends to cover various
  * fixed hardware mappings (e.g., LAPIC, IOAPIC, default SVGA framebuffer).
  *
  * If igd_opregion_pgbase we need to split the RESERVED region in two.
@@ -210,8 +228,8 @@ int build_e820_table(struct e820entry *e820,
 {
 uint32_

Re: [PATCH v3] hvmloader: indicate ACPI tables with "ACPI data" type in e820

2020-09-08 Thread Igor Druzhinin

On 08/09/2020 10:15, Jan Beulich wrote:
> On 08.09.2020 01:42, Igor Druzhinin wrote:
>> Changes in v3:
>> - switched from NVS to regular "ACPI data" type by separating ACPI 
>> allocations
>>   into their own region
>> - gave more information on particular kexec usecase that requires this change
> 
> Thanks a lot for doing both of these.
> 
>> --- a/tools/firmware/hvmloader/e820.c
>> +++ b/tools/firmware/hvmloader/e820.c
>> @@ -155,6 +155,8 @@ int build_e820_table(struct e820entry *e820,
>>  {
>>  unsigned int nr = 0, i, j;
>>  uint32_t low_mem_end = hvm_info->low_mem_pgend << PAGE_SHIFT;
>> +unsigned long acpi_mem_end =
>> +ACPI_MEMORY_DYNAMIC_START + (acpi_pages_allocated() << PAGE_SHIFT);
>>  
>>  if ( !lowmem_reserved_base )
>>  lowmem_reserved_base = 0xA;
>> @@ -199,8 +201,19 @@ int build_e820_table(struct e820entry *e820,
>>  nr++;
>>  
>>  /*
>> + * Mark populated reserved memory that contains ACPI tables as ACPI 
>> data.
>> + * That should help the guest to treat it correctly later: e.g. pass to
>> + * the next kernel on kexec or reclaim if necessary.
>> + */
>> +
>> +e820[nr].addr = RESERVED_MEMBASE;
>> +e820[nr].size = acpi_mem_end - RESERVED_MEMBASE;
>> +e820[nr].type = E820_ACPI;
>> +nr++;
> 
> This region may be empty (afaict), when !acpi_enabled. Imo we'd better
> avoid adding empty ranges.

Thanks, will gate creation of this region on acpi_enabled.

>> @@ -210,8 +223,8 @@ int build_e820_table(struct e820entry *e820,
>>  {
>>  uint32_t igd_opregion_base = igd_opregion_pgbase << PAGE_SHIFT;
>>  
>> -e820[nr].addr = RESERVED_MEMBASE;
>> -e820[nr].size = (uint32_t) igd_opregion_base - RESERVED_MEMBASE;
>> +e820[nr].addr = acpi_mem_end;
>> +e820[nr].size = igd_opregion_base - acpi_mem_end;
>>  e820[nr].type = E820_RESERVED;
>>  nr++;
>>  
>> @@ -227,7 +240,7 @@ int build_e820_table(struct e820entry *e820,
>>  }
>>  else
>>  {
>> -e820[nr].addr = RESERVED_MEMBASE;
>> +e820[nr].addr = acpi_mem_end;
>>  e820[nr].size = (uint32_t)-e820[nr].addr;
>>  e820[nr].type = E820_RESERVED;
>>  nr++;
> 
> In both cases - why not RESERVED_MEMORY_DYNAMIC_START? I.e. why
> mark reserved space that isn't in use for anything?

I think it's better to reserve space that a) isn't suppose to be in use for 
anything - 
we don't really want some MMIO being accidentally mapped there and confusing 
whatever in
our fragile model, b) that wasn't a hole before so it'd be dangerous to make it 
that
way here. Overall, I think it's better to keep this space as reserved as 
possible as
before.

Igor

[PATCH v3] hvmloader: indicate ACPI tables with "ACPI data" type in e820

2020-09-07 Thread Igor Druzhinin

Guest kernel does need to know in some cases where the tables are located
to treat these regions properly. One example is kexec process where
the first kernel needs to pass ACPI region locations to the second
kernel which is now a requirement in Linux after 02a3e3cdb7f12 ("x86/boot:
Parse SRAT table and count immovable memory regions") in order for kexec
transition to actually work.

That commit introduced accesses to XSDT and SRAT while the second kernel
is still using kexec transition tables. The transition tables do not have
e820 "reserved" regions mapped where those tables are located currently
in a Xen guest. Instead "ACPI data" regions are mapped with the transition
tables that was introduced by the following commit 6bbeb276b7 ("x86/kexec:
Add the EFI system tables and ACPI tables to the ident map").

Reserve 1MB (out of 16MB currently available) right after ACPI info page for
ACPI tables exclusively but populate this region on demand and only indicate
populated memory as "ACPI data" since according to ACPI spec that memory is
reclaimable by the guest if necessary. That is close to how we treat
the same ACPI data in PVH guests. 1MB should be enough for now but could be
later extended if required.

Signed-off-by: Igor Druzhinin 
---
Changes in v3:
- switched from NVS to regular "ACPI data" type by separating ACPI allocations
  into their own region
- gave more information on particular kexec usecase that requires this change

Changes in v2:
- gave more information on NVS type selection and potential alternatives
  in the description
- minor type fixes suggested
---
 tools/firmware/hvmloader/config.h |  3 ++-
 tools/firmware/hvmloader/e820.c   | 21 +
 tools/firmware/hvmloader/util.c   | 29 -
 tools/firmware/hvmloader/util.h   |  2 ++
 4 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/tools/firmware/hvmloader/config.h 
b/tools/firmware/hvmloader/config.h
index d9b4713..ff19b64 100644
--- a/tools/firmware/hvmloader/config.h
+++ b/tools/firmware/hvmloader/config.h
@@ -71,7 +71,8 @@ extern uint64_t pci_hi_mem_start, pci_hi_mem_end;
 #define RESERVED_MEMBASE  0xFC00
 /* NB. ACPI_INFO_PHYSICAL_ADDRESS *MUST* match definition in acpi/dsdt.asl! */
 #define ACPI_INFO_PHYSICAL_ADDRESS0xFC00
-#define RESERVED_MEMORY_DYNAMIC_START 0xFC001000
+#define ACPI_MEMORY_DYNAMIC_START 0xFC001000
+#define RESERVED_MEMORY_DYNAMIC_START 0xFC10
 #define RESERVED_MEMORY_DYNAMIC_END   0xFE00
 /*
  * GUEST_RESERVED: Physical address space reserved for guest use.
diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
index 4d1c955..bc83808 100644
--- a/tools/firmware/hvmloader/e820.c
+++ b/tools/firmware/hvmloader/e820.c
@@ -155,6 +155,8 @@ int build_e820_table(struct e820entry *e820,
 {
 unsigned int nr = 0, i, j;
 uint32_t low_mem_end = hvm_info->low_mem_pgend << PAGE_SHIFT;
+unsigned long acpi_mem_end =
+ACPI_MEMORY_DYNAMIC_START + (acpi_pages_allocated() << PAGE_SHIFT);
 
 if ( !lowmem_reserved_base )
 lowmem_reserved_base = 0xA;
@@ -199,8 +201,19 @@ int build_e820_table(struct e820entry *e820,
 nr++;
 
 /*
+ * Mark populated reserved memory that contains ACPI tables as ACPI data.
+ * That should help the guest to treat it correctly later: e.g. pass to
+ * the next kernel on kexec or reclaim if necessary.
+ */
+
+e820[nr].addr = RESERVED_MEMBASE;
+e820[nr].size = acpi_mem_end - RESERVED_MEMBASE;
+e820[nr].type = E820_ACPI;
+nr++;
+
+/*
  * Explicitly reserve space for special pages.
- * This space starts at RESERVED_MEMBASE an extends to cover various
+ * This space starts after ACPI region and extends to cover various
  * fixed hardware mappings (e.g., LAPIC, IOAPIC, default SVGA framebuffer).
  *
  * If igd_opregion_pgbase we need to split the RESERVED region in two.
@@ -210,8 +223,8 @@ int build_e820_table(struct e820entry *e820,
 {
 uint32_t igd_opregion_base = igd_opregion_pgbase << PAGE_SHIFT;
 
-e820[nr].addr = RESERVED_MEMBASE;
-e820[nr].size = (uint32_t) igd_opregion_base - RESERVED_MEMBASE;
+e820[nr].addr = acpi_mem_end;
+e820[nr].size = igd_opregion_base - acpi_mem_end;
 e820[nr].type = E820_RESERVED;
 nr++;
 
@@ -227,7 +240,7 @@ int build_e820_table(struct e820entry *e820,
 }
 else
 {
-e820[nr].addr = RESERVED_MEMBASE;
+e820[nr].addr = acpi_mem_end;
 e820[nr].size = (uint32_t)-e820[nr].addr;
 e820[nr].type = E820_RESERVED;
 nr++;
diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index 0c3f2d2..7da144b 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -871,10 +871,37 @@ static unsigned long acpi_v2p(struct acpi_ctxt *ctxt

Re: [PATCH v2.1] hvmloader: indicate dynamically allocated memory as ACPI NVS in e820

2020-09-04 Thread Igor Druzhinin

On 04/09/2020 15:40, Jan Beulich wrote:
> On 04.09.2020 13:49, Igor Druzhinin wrote:
>> On 04/09/2020 09:33, Jan Beulich wrote:
>>> On 01.09.2020 04:50, Igor Druzhinin wrote:
>>>> Guest kernel does need to know in some cases where the tables are located
>>>> to treat these regions properly. One example is kexec process where
>>>> the first kernel needs to pass firmware region locations to the second
>>>> kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse 
>>>> SRAT
>>>> table and count immovable memory regions").
>>>
>>> I'm still struggling with the connection here: Reserved regions
>>> surely are "immovable" too, aren't they? 
>>
>> "Immovable" regions here are RAM that doesn't go away by hot-unplug. That 
>> change
>> was necessary in Linux to avoid image randomized placement to these regions.
>>
>>> Where's the connection to
>>> the E820 map in the first place - the change cited above is entirely
>>> about SRAT? And I can't imagine kexec getting away with passing on
>>> ACPI NVS regions, but not reserved ones.
>>>
>>
>> They got away with it for as long as kexec exists I think. The point was that
>> those reserved regions were not accessed during early boot as long as kexec 
>> kernel stays
>> at transition tables. Now ACPI portion of it is accessed which highlighted 
>> our
>> imprecise reporting of memory layout to the guest - which I think should be 
>> fixed
>> either way.
> 
> Is this to mean they map ACPI regions into the transition page tables,
> but not reserved regions?

Yes.

> If so, perhaps that's what the description
> wants to say (and then possibly with a reference to the commit
> introducing this into Linux, instead of the seemingly unrelated SRAT
> one)?

The referenced commit is not unrelated - it's the commit introducing the access
causing kexec transition to fail. But I can add another one pointing to the 
mapping
of ACPI tables that was supposed to fix the failure caused by the first one.

Igor

Re: [PATCH v2.1] hvmloader: indicate dynamically allocated memory as ACPI NVS in e820

2020-09-04 Thread Igor Druzhinin

On 04/09/2020 09:33, Jan Beulich wrote:
> On 01.09.2020 04:50, Igor Druzhinin wrote:
>> Guest kernel does need to know in some cases where the tables are located
>> to treat these regions properly. One example is kexec process where
>> the first kernel needs to pass firmware region locations to the second
>> kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
>> table and count immovable memory regions").
> 
> I'm still struggling with the connection here: Reserved regions
> surely are "immovable" too, aren't they? 

"Immovable" regions here are RAM that doesn't go away by hot-unplug. That change
was necessary in Linux to avoid image randomized placement to these regions.

> Where's the connection to
> the E820 map in the first place - the change cited above is entirely
> about SRAT? And I can't imagine kexec getting away with passing on
> ACPI NVS regions, but not reserved ones.
> 

They got away with it for as long as kexec exists I think. The point was that
those reserved regions were not accessed during early boot as long as kexec 
kernel stays
at transition tables. Now ACPI portion of it is accessed which highlighted our
imprecise reporting of memory layout to the guest - which I think should be 
fixed
either way.

I'm not going to argue if reserved regions should be mapped to transition 
tables or
not - I don't think it's important in context related to this patch. There were
already several kernel releases without that mappings and those also should be 
able
to invoke kdump.

Igor

Re: [PATCH v2.1] hvmloader: indicate dynamically allocated memory as ACPI NVS in e820

2020-09-01 Thread Igor Druzhinin

On 01/09/2020 10:28, Roger Pau Monné wrote:
> On Tue, Sep 01, 2020 at 03:50:34AM +0100, Igor Druzhinin wrote:
>> Guest kernel does need to know in some cases where the tables are located
>> to treat these regions properly. One example is kexec process where
>> the first kernel needs to pass firmware region locations to the second
>> kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
>> table and count immovable memory regions").
> 
> Can you add a note that this is a Linux commit? Albeit there's a
> reference to kexec above I don't it's entirely clear it's a Linux
> commit.

Ok.

>>
>> The memory that hvmloader allocates in the reserved region mostly contains
>> these useful tables and could be safely indicated as ACPI without the need
>> to designate a sub-region specially for that. Making it non-reclaimable
>> (ACPI NVS) in contrast with ACPI reclaim (ACPI table) memory would avoid
>> potential reuse of this memory by the guest taking into account this region
>> may contain runtime structures like VM86 TSS, etc. If necessary, those
>> can be moved away later and the region marked as reclaimable.
> 
> By looking at domain_construct_memmap from libxl I think the same
> problem is not present on that case as regions are properly marked as
> RESERVED or ACPI as required?

Uhh, I simply forgot that PVH also constructs e820 - it seems it's doing it
properly though. I want to makes e820 map as close as possible between PVH and
HVM - let me see if I can improve my HVM version.

>>
>> Signed-off-by: Igor Druzhinin 
> 
> Just one question below and one nit.
> 
>> ---
>> Changes in v2.1:
>> - fixed previously missed uint32_t occurence
>>
>> Changes in v2:
>> - gave more information on NVS type selection and potential alternatives
>>   in the description
>> - minor type fixes suggested
>>
>> ---
>>  tools/firmware/hvmloader/e820.c | 21 +
>>  tools/firmware/hvmloader/util.c |  6 ++
>>  tools/firmware/hvmloader/util.h |  3 +++
>>  3 files changed, 26 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/firmware/hvmloader/e820.c 
>> b/tools/firmware/hvmloader/e820.c
>> index 4d1c955..0ad2f05 100644
>> --- a/tools/firmware/hvmloader/e820.c
>> +++ b/tools/firmware/hvmloader/e820.c
>> @@ -155,6 +155,8 @@ int build_e820_table(struct e820entry *e820,
>>  {
>>  unsigned int nr = 0, i, j;
>>  uint32_t low_mem_end = hvm_info->low_mem_pgend << PAGE_SHIFT;
>> +unsigned long firmware_mem_end =
>> +RESERVED_MEMORY_DYNAMIC_START + (mem_mfns_allocated() << 
>> PAGE_SHIFT);
>>  
>>  if ( !lowmem_reserved_base )
>>  lowmem_reserved_base = 0xA;
>> @@ -199,8 +201,19 @@ int build_e820_table(struct e820entry *e820,
>>  nr++;
>>  
>>  /*
>> + * Mark populated reserved memory that contains ACPI and other tables as
>> + * ACPI NVS (non-reclaimable) space - that should help the guest to 
>> treat
>> + * it correctly later (e.g. pass to the next kernel on kexec).
>> + */
>> +
>> +e820[nr].addr = RESERVED_MEMBASE;
>> +e820[nr].size = firmware_mem_end - RESERVED_MEMBASE;
>> +e820[nr].type = E820_NVS;
>> +nr++;
>> +
>> +/*
>>   * Explicitly reserve space for special pages.
>> - * This space starts at RESERVED_MEMBASE an extends to cover various
>> + * This space starts after ACPI region and extends to cover various
>>   * fixed hardware mappings (e.g., LAPIC, IOAPIC, default SVGA 
>> framebuffer).
>>   *
>>   * If igd_opregion_pgbase we need to split the RESERVED region in two.
>> @@ -210,8 +223,8 @@ int build_e820_table(struct e820entry *e820,
>>  {
>>  uint32_t igd_opregion_base = igd_opregion_pgbase << PAGE_SHIFT;
>>  
>> -e820[nr].addr = RESERVED_MEMBASE;
>> -e820[nr].size = (uint32_t) igd_opregion_base - RESERVED_MEMBASE;
>> +e820[nr].addr = firmware_mem_end;
>> +e820[nr].size = igd_opregion_base - firmware_mem_end;
> 
> Is there anything between firmware_mem_end and igd_opregion_base now?
> You already account for RESERVED_MEMBASE to firmware_mem_end.

It's possible that there is something in between. IGD opregion is allocated
dynamically from above and occupies a couple of pages - there is a gap between 
the
top of a populated region and it where other structures could be located.
I don't want to tie e820 to the current allocation order.

>>  e820[nr].type = E820_RESERVED;
>>  nr++

[PATCH v2.1] hvmloader: indicate dynamically allocated memory as ACPI NVS in e820

2020-09-01 Thread Igor Druzhinin

Guest kernel does need to know in some cases where the tables are located
to treat these regions properly. One example is kexec process where
the first kernel needs to pass firmware region locations to the second
kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
table and count immovable memory regions").

The memory that hvmloader allocates in the reserved region mostly contains
these useful tables and could be safely indicated as ACPI without the need
to designate a sub-region specially for that. Making it non-reclaimable
(ACPI NVS) in contrast with ACPI reclaim (ACPI table) memory would avoid
potential reuse of this memory by the guest taking into account this region
may contain runtime structures like VM86 TSS, etc. If necessary, those
can be moved away later and the region marked as reclaimable.

Signed-off-by: Igor Druzhinin 
---
Changes in v2.1:
- fixed previously missed uint32_t occurence

Changes in v2:
- gave more information on NVS type selection and potential alternatives
  in the description
- minor type fixes suggested

---
 tools/firmware/hvmloader/e820.c | 21 +
 tools/firmware/hvmloader/util.c |  6 ++
 tools/firmware/hvmloader/util.h |  3 +++
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
index 4d1c955..0ad2f05 100644
--- a/tools/firmware/hvmloader/e820.c
+++ b/tools/firmware/hvmloader/e820.c
@@ -155,6 +155,8 @@ int build_e820_table(struct e820entry *e820,
 {
 unsigned int nr = 0, i, j;
 uint32_t low_mem_end = hvm_info->low_mem_pgend << PAGE_SHIFT;
+unsigned long firmware_mem_end =
+RESERVED_MEMORY_DYNAMIC_START + (mem_mfns_allocated() << PAGE_SHIFT);
 
 if ( !lowmem_reserved_base )
 lowmem_reserved_base = 0xA;
@@ -199,8 +201,19 @@ int build_e820_table(struct e820entry *e820,
 nr++;
 
 /*
+ * Mark populated reserved memory that contains ACPI and other tables as
+ * ACPI NVS (non-reclaimable) space - that should help the guest to treat
+ * it correctly later (e.g. pass to the next kernel on kexec).
+ */
+
+e820[nr].addr = RESERVED_MEMBASE;
+e820[nr].size = firmware_mem_end - RESERVED_MEMBASE;
+e820[nr].type = E820_NVS;
+nr++;
+
+/*
  * Explicitly reserve space for special pages.
- * This space starts at RESERVED_MEMBASE an extends to cover various
+ * This space starts after ACPI region and extends to cover various
  * fixed hardware mappings (e.g., LAPIC, IOAPIC, default SVGA framebuffer).
  *
  * If igd_opregion_pgbase we need to split the RESERVED region in two.
@@ -210,8 +223,8 @@ int build_e820_table(struct e820entry *e820,
 {
 uint32_t igd_opregion_base = igd_opregion_pgbase << PAGE_SHIFT;
 
-e820[nr].addr = RESERVED_MEMBASE;
-e820[nr].size = (uint32_t) igd_opregion_base - RESERVED_MEMBASE;
+e820[nr].addr = firmware_mem_end;
+e820[nr].size = igd_opregion_base - firmware_mem_end;
 e820[nr].type = E820_RESERVED;
 nr++;
 
@@ -227,7 +240,7 @@ int build_e820_table(struct e820entry *e820,
 }
 else
 {
-e820[nr].addr = RESERVED_MEMBASE;
+e820[nr].addr = firmware_mem_end;
 e820[nr].size = (uint32_t)-e820[nr].addr;
 e820[nr].type = E820_RESERVED;
 nr++;
diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index 0c3f2d2..59cde4a 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -444,6 +444,12 @@ void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns)
 static uint32_t alloc_up = RESERVED_MEMORY_DYNAMIC_START - 1;
 static uint32_t alloc_down = RESERVED_MEMORY_DYNAMIC_END;
 
+unsigned long mem_mfns_allocated(void)
+{
+return (alloc_up >> PAGE_SHIFT) -
+((RESERVED_MEMORY_DYNAMIC_START - 1) >> PAGE_SHIFT);
+}
+
 xen_pfn_t mem_hole_alloc(uint32_t nr_mfns)
 {
 alloc_down -= nr_mfns << PAGE_SHIFT;
diff --git a/tools/firmware/hvmloader/util.h b/tools/firmware/hvmloader/util.h
index 7bca641..acd673a 100644
--- a/tools/firmware/hvmloader/util.h
+++ b/tools/firmware/hvmloader/util.h
@@ -200,6 +200,9 @@ void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns);
 /* Allocate a memory hole below 4GB. */
 xen_pfn_t mem_hole_alloc(uint32_t nr_mfns);
 
+/* Return number of pages allocated */
+unsigned long mem_mfns_allocated(void);
+
 /* Allocate memory in a reserved region below 4GB. */
 void *mem_alloc(uint32_t size, uint32_t align);
 #define virt_to_phys(v) ((unsigned long)(v))
-- 
2.7.4

[PATCH v2] hvmloader: indicate dynamically allocated memory as ACPI NVS in e820

2020-09-01 Thread Igor Druzhinin

Guest kernel does need to know in some cases where the tables are located
to treat these regions properly. One example is kexec process where
the first kernel needs to pass firmware region locations to the second
kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
table and count immovable memory regions").

The memory that hvmloader allocates in the reserved region mostly contains
these useful tables and could be safely indicated as ACPI without the need
to designate a sub-region specially for that. Making it non-reclaimable
(ACPI NVS) in contrast with regular ACPI (ACPI table) memory would avoid
potential reuse of this memory by the guest taking into account this region
may contain runtime structures like VM86 TSS, etc. If necessary, those
can be moved away later and the region marked as reclaimable.

Signed-off-by: Igor Druzhinin 
---
 tools/firmware/hvmloader/e820.c | 21 +
 tools/firmware/hvmloader/util.c |  6 ++
 tools/firmware/hvmloader/util.h |  3 +++
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
index 4d1c955..0ad2f05 100644
--- a/tools/firmware/hvmloader/e820.c
+++ b/tools/firmware/hvmloader/e820.c
@@ -155,6 +155,8 @@ int build_e820_table(struct e820entry *e820,
 {
 unsigned int nr = 0, i, j;
 uint32_t low_mem_end = hvm_info->low_mem_pgend << PAGE_SHIFT;
+unsigned long firmware_mem_end =
+RESERVED_MEMORY_DYNAMIC_START + (mem_mfns_allocated() << PAGE_SHIFT);
 
 if ( !lowmem_reserved_base )
 lowmem_reserved_base = 0xA;
@@ -199,8 +201,19 @@ int build_e820_table(struct e820entry *e820,
 nr++;
 
 /*
+ * Mark populated reserved memory that contains ACPI and other tables as
+ * ACPI NVS (non-reclaimable) space - that should help the guest to treat
+ * it correctly later (e.g. pass to the next kernel on kexec).
+ */
+
+e820[nr].addr = RESERVED_MEMBASE;
+e820[nr].size = firmware_mem_end - RESERVED_MEMBASE;
+e820[nr].type = E820_NVS;
+nr++;
+
+/*
  * Explicitly reserve space for special pages.
- * This space starts at RESERVED_MEMBASE an extends to cover various
+ * This space starts after ACPI region and extends to cover various
  * fixed hardware mappings (e.g., LAPIC, IOAPIC, default SVGA framebuffer).
  *
  * If igd_opregion_pgbase we need to split the RESERVED region in two.
@@ -210,8 +223,8 @@ int build_e820_table(struct e820entry *e820,
 {
 uint32_t igd_opregion_base = igd_opregion_pgbase << PAGE_SHIFT;
 
-e820[nr].addr = RESERVED_MEMBASE;
-e820[nr].size = (uint32_t) igd_opregion_base - RESERVED_MEMBASE;
+e820[nr].addr = firmware_mem_end;
+e820[nr].size = igd_opregion_base - firmware_mem_end;
 e820[nr].type = E820_RESERVED;
 nr++;
 
@@ -227,7 +240,7 @@ int build_e820_table(struct e820entry *e820,
 }
 else
 {
-e820[nr].addr = RESERVED_MEMBASE;
+e820[nr].addr = firmware_mem_end;
 e820[nr].size = (uint32_t)-e820[nr].addr;
 e820[nr].type = E820_RESERVED;
 nr++;
diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index 0c3f2d2..af68862 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -444,6 +444,12 @@ void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns)
 static uint32_t alloc_up = RESERVED_MEMORY_DYNAMIC_START - 1;
 static uint32_t alloc_down = RESERVED_MEMORY_DYNAMIC_END;
 
+uint32_t mem_mfns_allocated(void)
+{
+return (alloc_up >> PAGE_SHIFT) -
+((RESERVED_MEMORY_DYNAMIC_START - 1) >> PAGE_SHIFT);
+}
+
 xen_pfn_t mem_hole_alloc(uint32_t nr_mfns)
 {
 alloc_down -= nr_mfns << PAGE_SHIFT;
diff --git a/tools/firmware/hvmloader/util.h b/tools/firmware/hvmloader/util.h
index 7bca641..98d5e02 100644
--- a/tools/firmware/hvmloader/util.h
+++ b/tools/firmware/hvmloader/util.h
@@ -200,6 +200,9 @@ void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns);
 /* Allocate a memory hole below 4GB. */
 xen_pfn_t mem_hole_alloc(uint32_t nr_mfns);
 
+/* Return number of pages allocated */
+uint32_t mem_mfns_allocated(void);
+
 /* Allocate memory in a reserved region below 4GB. */
 void *mem_alloc(uint32_t size, uint32_t align);
 #define virt_to_phys(v) ((unsigned long)(v))
-- 
2.7.4

Re: [PATCH] hvmloader: indicate firmware tables as ACPI NVS in e820

2020-08-28 Thread Igor Druzhinin

On 28/08/2020 08:51, Jan Beulich wrote:
> On 28.08.2020 02:13, Igor Druzhinin wrote:
>> Guest kernel does need to know in some cases where the tables are located
>> to treat these regions properly. One example is kexec process where
>> the first kernel needs to pass firmware region locations to the second
>> kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
>> table and count immovable memory regions").
>>
>> The memory that hvmloader allocates in the reserved region mostly contains
>> these useful tables and could be safely indicated as ACPI without the need
>> to designate a sub-region specially for that. Making it non-reclaimable
>> (ACPI NVS) would avoid potential reuse of this memory by the guest.
>> Swtiching from Reserved to ACPI NVS type for this memory would also mean
>> its content is preserved across S4 (which is true for any ACPI type according
>> to the spec).
> 
> Marking the range as ACPI is certainly fine, but why did you choose NVS?
> There's nothing non-volatile there afaict. And the part of the last
> sentence in parentheses suggests to me that the "normal" ACPI type is as
> suitable for the purpose you're after.

The problem with normal ACPI type is that according to the spec it's reclaimable
by the OS:
"This range is available RAM usable by the OS after it reads the ACPI tables."
while NVS type is specifically designated as non-reclaimable. The spec is also
denotes that both normal and NVS types should be preserved across S4 - that 
sounds
ambiguous to me. But it might mean that non-NVS preservation is not OS
responsibility as it's assumed to be static.

Our region also contains VM86 TSS which we certainly need at runtime (although 
that
could be allocated from the reserved region above if desirable).

To stay on the safe side and do not rely on perceived sensible OS behavior NVS 
type
was chosen which OS shouldn't touch under any circumstances.

In fact while writing this response I found that document which confirms some 
of my
suspicions:
http://www.baldwin.cx/~phoenix/reference/docs/acpi_impguide.pdf

>> --- a/tools/firmware/hvmloader/e820.c
>> +++ b/tools/firmware/hvmloader/e820.c
>> @@ -155,6 +155,8 @@ int build_e820_table(struct e820entry *e820,
>>  {
>>  unsigned int nr = 0, i, j;
>>  uint32_t low_mem_end = hvm_info->low_mem_pgend << PAGE_SHIFT;
>> +uint32_t firmware_mem_end =
>> +RESERVED_MEMORY_DYNAMIC_START + (mem_mfns_allocated() << 
>> PAGE_SHIFT);
> 
> Here and elsewhere - please avoid uint32_t and friends when a plain
> C type (unsigned int or unsigned long in this case) will do.

Ok, should I also take a chance and convert some of the surroundings?

>> @@ -210,8 +223,8 @@ int build_e820_table(struct e820entry *e820,
>>  {
>>  uint32_t igd_opregion_base = igd_opregion_pgbase << PAGE_SHIFT;
>>  
>> -e820[nr].addr = RESERVED_MEMBASE;
>> -e820[nr].size = (uint32_t) igd_opregion_base - RESERVED_MEMBASE;
>> +e820[nr].addr = firmware_mem_end;
>> +e820[nr].size = (uint32_t) igd_opregion_base - firmware_mem_end;
> 
> Any chance I could talk you into also dropping the pointless cast
> at this occasion?

Sure.

Igor

[PATCH] hvmloader: indicate firmware tables as ACPI NVS in e820

2020-08-27 Thread Igor Druzhinin

Guest kernel does need to know in some cases where the tables are located
to treat these regions properly. One example is kexec process where
the first kernel needs to pass firmware region locations to the second
kernel which is now a requirement after 02a3e3cdb7f12 ("x86/boot: Parse SRAT
table and count immovable memory regions").

The memory that hvmloader allocates in the reserved region mostly contains
these useful tables and could be safely indicated as ACPI without the need
to designate a sub-region specially for that. Making it non-reclaimable
(ACPI NVS) would avoid potential reuse of this memory by the guest.
Swtiching from Reserved to ACPI NVS type for this memory would also mean
its content is preserved across S4 (which is true for any ACPI type according
to the spec).

Signed-off-by: Igor Druzhinin 
---
 tools/firmware/hvmloader/e820.c | 21 +
 tools/firmware/hvmloader/util.c |  6 ++
 tools/firmware/hvmloader/util.h |  3 +++
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/tools/firmware/hvmloader/e820.c b/tools/firmware/hvmloader/e820.c
index 4d1c955..ef60007 100644
--- a/tools/firmware/hvmloader/e820.c
+++ b/tools/firmware/hvmloader/e820.c
@@ -155,6 +155,8 @@ int build_e820_table(struct e820entry *e820,
 {
 unsigned int nr = 0, i, j;
 uint32_t low_mem_end = hvm_info->low_mem_pgend << PAGE_SHIFT;
+uint32_t firmware_mem_end =
+RESERVED_MEMORY_DYNAMIC_START + (mem_mfns_allocated() << PAGE_SHIFT);
 
 if ( !lowmem_reserved_base )
 lowmem_reserved_base = 0xA;
@@ -199,8 +201,19 @@ int build_e820_table(struct e820entry *e820,
 nr++;
 
 /*
+ * Mark populated reserved memory that contains ACPI and other tables as
+ * ACPI NVS (non-reclaimable) space - that should help the guest to treat
+ * it correctly later (e.g. pass to the next kernel on kexec).
+ */
+
+e820[nr].addr = RESERVED_MEMBASE;
+e820[nr].size = firmware_mem_end - RESERVED_MEMBASE;
+e820[nr].type = E820_NVS;
+nr++;
+
+/*
  * Explicitly reserve space for special pages.
- * This space starts at RESERVED_MEMBASE an extends to cover various
+ * This space starts after ACPI region and extends to cover various
  * fixed hardware mappings (e.g., LAPIC, IOAPIC, default SVGA framebuffer).
  *
  * If igd_opregion_pgbase we need to split the RESERVED region in two.
@@ -210,8 +223,8 @@ int build_e820_table(struct e820entry *e820,
 {
 uint32_t igd_opregion_base = igd_opregion_pgbase << PAGE_SHIFT;
 
-e820[nr].addr = RESERVED_MEMBASE;
-e820[nr].size = (uint32_t) igd_opregion_base - RESERVED_MEMBASE;
+e820[nr].addr = firmware_mem_end;
+e820[nr].size = (uint32_t) igd_opregion_base - firmware_mem_end;
 e820[nr].type = E820_RESERVED;
 nr++;
 
@@ -227,7 +240,7 @@ int build_e820_table(struct e820entry *e820,
 }
 else
 {
-e820[nr].addr = RESERVED_MEMBASE;
+e820[nr].addr = firmware_mem_end;
 e820[nr].size = (uint32_t)-e820[nr].addr;
 e820[nr].type = E820_RESERVED;
 nr++;
diff --git a/tools/firmware/hvmloader/util.c b/tools/firmware/hvmloader/util.c
index 0c3f2d2..af68862 100644
--- a/tools/firmware/hvmloader/util.c
+++ b/tools/firmware/hvmloader/util.c
@@ -444,6 +444,12 @@ void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns)
 static uint32_t alloc_up = RESERVED_MEMORY_DYNAMIC_START - 1;
 static uint32_t alloc_down = RESERVED_MEMORY_DYNAMIC_END;
 
+uint32_t mem_mfns_allocated(void)
+{
+return (alloc_up >> PAGE_SHIFT) -
+((RESERVED_MEMORY_DYNAMIC_START - 1) >> PAGE_SHIFT);
+}
+
 xen_pfn_t mem_hole_alloc(uint32_t nr_mfns)
 {
 alloc_down -= nr_mfns << PAGE_SHIFT;
diff --git a/tools/firmware/hvmloader/util.h b/tools/firmware/hvmloader/util.h
index 7bca641..98d5e02 100644
--- a/tools/firmware/hvmloader/util.h
+++ b/tools/firmware/hvmloader/util.h
@@ -200,6 +200,9 @@ void mem_hole_populate_ram(xen_pfn_t mfn, uint32_t nr_mfns);
 /* Allocate a memory hole below 4GB. */
 xen_pfn_t mem_hole_alloc(uint32_t nr_mfns);
 
+/* Return number of pages allocated */
+uint32_t mem_mfns_allocated(void);
+
 /* Allocate memory in a reserved region below 4GB. */
 void *mem_alloc(uint32_t size, uint32_t align);
 #define virt_to_phys(v) ((unsigned long)(v))
-- 
2.7.4

Re: [PATCH] OvmfPkg: End timer interrupt later to avoid stack overflow under load

2020-06-16 Thread Igor Druzhinin

On 16/06/2020 19:42, Laszlo Ersek wrote
> If I understand correctly, TimerInterruptHandler()
> [OvmfPkg/8254TimerDxe/Timer.c] currently does the following:
> 
> - RaiseTPL (TPL_HIGH_LEVEL) --> mask interrupts from being delivered
> 
> - mLegacy8259->EndOfInterrupt() --> permit the PIC to generate further
> interrupts (= make them pending)
> 
> - RestoreTPL() --> unmask interrupts (allow delivery)
> 
> RestoreTPL() is always expected to invoke handlers (on its own stack)
> that have just been unmasked, so that behavior is not unexpected, in my
> opinion.

Yes, this is where I'd like to have a confirmation - opening a window
for uncontrollable number of nested interrupts with a small stack
looks dangerous.

> What seems unexpected is the queueing of a huge number of timer
> interrupts. I would think a timer interrupt is either pending or not
> pending (i.e. if it's already pending, then the next generated interrupt
> is coalesced, not queued). While there would still be a window between
> the EOI and the unmasking, I don't think it would normally allow for a
> *huge* number of queued interrupts (and consequently a stack overflow).

It's not a window between EOI and unmasking but the very fact vCPU is 
descheduled for a considerable amount of time that causes backlog of
timer interrupts to build up. This is Xen default behavior and is
configurable (there are several timer modes including coalescing
you mention). That is done for compatibility with some guests basing
time accounting on the number of periodic interrupts they receive.

> So I basically see the root of the problem in the interrupts being
> queued rather than coalesced. I'm pretty unfamiliar with this x86 area
> (= the 8259 PIC in general), but the following wiki article seems to
> agree with my suspicion:
> 
> https://wiki.osdev.org/8259_PIC#How_does_the_8259_PIC_chip_work.3F
> 
> [...] and whether there's an interrupt already pending. If the
> channel is unmasked and there's no interrupt pending, the PIC will
> raise the interrupt line [...]
> 
> Can we say that the interrupt queueing (as opposed to coalescing) is a
> Xen issue?

I can admit that the whole issue might be Xen specific if that form
of timer mode is not used in QEMU-KVM. What mode is typical there
then? We might consider switching Xen to a different mode if so, as I believe
those guests are not in support for many years.

> (Hmmm... maybe the hypervisor *has* to queue the timer interrupts,
> otherwise some of them would simply be lost, and the guest would lose
> track of time.)
> 
> Either way, I'm not sure what the best approach is. This driver was
> moved under OvmfPkg from PcAtChipsetPkg in commit 1a3ffdff82e6
> ("OvmfPkg: Copy 8254TimerDxe driver from PcAtChipsetPkg", 2019-04-11).
> HpetTimerDxe also lives under PcAtChipsetPkg.
> 
> So I think I'll have to rely on the expertise of Ray here (CC'd).

Also note that since the issue might be Xen specific we might want to
try to fix it in XenTimer only - I modified 8254Timer due to the
fact Xen is still present in general config (but that should soon
go away).

> Also, I recall a recent-ish QEMU commit that seems vaguely related
> (i.e., to timer interrupt coalescing -- see 7a3e29b12f5a, "mc146818rtc:
> fix timer interrupt reinjection again", 2019-11-19), so I'm CC'ing Paolo
> too.

Hmm that looks more like a RTC implementation specific issue.

> Some more comments / questions below:
> 
>>
>> diff --git a/OvmfPkg/8254TimerDxe/Timer.c b/OvmfPkg/8254TimerDxe/Timer.c
>> index 67e22f5..fd1691b 100644
>> --- a/OvmfPkg/8254TimerDxe/Timer.c
>> +++ b/OvmfPkg/8254TimerDxe/Timer.c
>> @@ -79,8 +79,6 @@ TimerInterruptHandler (
>>  
>>OriginalTPL = gBS->RaiseTPL (TPL_HIGH_LEVEL);
>>  
>> -  mLegacy8259->EndOfInterrupt (mLegacy8259, Efi8259Irq0);
>> -
>>if (mTimerNotifyFunction != NULL) {
>>  //
>>  // @bug : This does not handle missed timer interrupts
>> @@ -89,6 +87,9 @@ TimerInterruptHandler (
>>}
>>  
>>gBS->RestoreTPL (OriginalTPL);
>> +
>> +  DisableInterrupts ();
>> +  mLegacy8259->EndOfInterrupt (mLegacy8259, Efi8259Irq0);
>>  }
> 
> So this briefly (temporarily) unmasks interrupt delivery (between
> RestoreTPL() and DisableInterrupts()) while the PIC is still blocked
> from generating more, and then unblocks the PIC.
> 
> It looks plausible for preventing the unbounded recursion per se, but
> why is it safe to leave the function with interrupts disabled? Before
> the patch, that didn't use to be the case.

Quickly looking through the code it appears to me the first thing that
caller does after interrupt handler - it clears interrupt flag to make
sure those disabled. So I don't see any assumption that interrupts should
be enabled on exiting. But I might not know about all of the possible
combinations here.

Igor

[PATCH for-4.14 v3] tools/xen-ucode: return correct exit code on failed microcode update

2020-06-16 Thread Igor Druzhinin

Otherwise it's difficult to know if operation failed inside the automation.

While at it, also switch to returning 1 and 2 instead of errno to avoid
incompatibilies between errno and special exit code numbers.

Signed-off-by: Igor Druzhinin 
---
Changes in v3:
- conventionally return 1 and 2 instead of errno as exit code
---
 tools/misc/xen-ucode.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/tools/misc/xen-ucode.c b/tools/misc/xen-ucode.c
index 0c257f4..ad32fac 100644
--- a/tools/misc/xen-ucode.c
+++ b/tools/misc/xen-ucode.c
@@ -25,7 +25,7 @@ int main(int argc, char *argv[])
 fprintf(stderr,
 "xen-ucode: Xen microcode updating tool\n"
 "Usage: %s \n", argv[0]);
-return 0;
+exit(2);
 }
 
 filename = argv[1];
@@ -34,14 +34,14 @@ int main(int argc, char *argv[])
 {
 fprintf(stderr, "Could not open %s. (err: %s)\n",
 filename, strerror(errno));
-return errno;
+exit(1);
 }
 
 if ( fstat(fd, ) != 0 )
 {
 fprintf(stderr, "Could not get the size of %s. (err: %s)\n",
 filename, strerror(errno));
-return errno;
+exit(1);
 }
 
 len = st.st_size;
@@ -49,7 +49,7 @@ int main(int argc, char *argv[])
 if ( buf == MAP_FAILED )
 {
 fprintf(stderr, "mmap failed. (error: %s)\n", strerror(errno));
-return errno;
+exit(1);
 }
 
 xch = xc_interface_open(NULL, NULL, 0);
@@ -57,20 +57,23 @@ int main(int argc, char *argv[])
 {
 fprintf(stderr, "Error opening xc interface. (err: %s)\n",
 strerror(errno));
-return errno;
+exit(1);
 }
 
 ret = xc_microcode_update(xch, buf, len);
 if ( ret )
+{
 fprintf(stderr, "Failed to update microcode. (err: %s)\n",
 strerror(errno));
+exit(1);
+}
 
 xc_interface_close(xch);
 
 if ( munmap(buf, len) )
 {
 printf("Could not unmap: %d(%s)\n", errno, strerror(errno));
-return errno;
+exit(1);
 }
 close(fd);
 
-- 
2.7.4

Re: [PATCH for-4.14 v2] tools/xen-ucode: fix error code propagation of microcode load operation

2020-06-16 Thread Igor Druzhinin

On 16/06/2020 13:25, Jan Beulich wrote:
> [CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments 
> unless you have verified the sender and know the content is safe.
> 
> On 16.06.2020 13:42, Igor Druzhinin wrote:
>> @@ -62,8 +62,11 @@ int main(int argc, char *argv[])
>>  
>>  ret = xc_microcode_update(xch, buf, len);
>>  if ( ret )
>> +{
>>  fprintf(stderr, "Failed to update microcode. (err: %s)\n",
>>  strerror(errno));
>> +return errno;
> 
> I think you need to latch errno, as fprintf() may in principle run
> into another error.

Yes, I also noticed that but the whole file has this problem so I didn't
change it here specifically.

If fixing the whole file - I'd rather rewrite error reporting completely:
return 1 on error, 0 on success, etc. From what I've read returning errno
has many incompatibilities and might lead to surprise consequences.

I'll send v3 to clean this all up.

Igor

[PATCH for-4.14 v2] tools/xen-ucode: fix error code propagation of microcode load operation

2020-06-16 Thread Igor Druzhinin

Otherwise it's impossible to know the reason for a fault or blob rejection
inside the automation.

While at it, also change return code of incorrect invokation to EINVAL.

Signed-off-by: Igor Druzhinin 
---
Changes in v2:
- simply call "return errno". On Linux that seems to be safe as values <=255
  are correctly propagated, on non-Linux I couldn't find error codes >127.
- return positive value on incorrect invokation
---
 tools/misc/xen-ucode.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/misc/xen-ucode.c b/tools/misc/xen-ucode.c
index 0c257f4..409cace 100644
--- a/tools/misc/xen-ucode.c
+++ b/tools/misc/xen-ucode.c
@@ -25,7 +25,7 @@ int main(int argc, char *argv[])
 fprintf(stderr,
 "xen-ucode: Xen microcode updating tool\n"
 "Usage: %s \n", argv[0]);
-return 0;
+return EINVAL;
 }
 
 filename = argv[1];
@@ -62,8 +62,11 @@ int main(int argc, char *argv[])
 
 ret = xc_microcode_update(xch, buf, len);
 if ( ret )
+{
 fprintf(stderr, "Failed to update microcode. (err: %s)\n",
 strerror(errno));
+return errno;
+}
 
 xc_interface_close(xch);
 
-- 
2.7.4

Re: [XEN PATCH for-4.14] tools/xen-ucode: fix error code propagation of microcode load operation

2020-06-12 Thread Igor Druzhinin

On 12/06/2020 17:53, Ian Jackson wrote:
> Igor Druzhinin writes ("[PATCH] tools/xen-ucode: fix error code propagation 
> of microcode load operation"):
>> Otherwise it's impossible to know the reason for a fault or blob rejection
>> inside the automation.
> ...
>>  fprintf(stderr, "Failed to update microcode. (err: %s)\n",
>>  strerror(errno));
> 
> This part is fine.
> 
>> +ret = errno;
>>  xc_interface_close(xch);
> ...
>>  }
>>  close(fd);
>>  
>> -return 0;
>> +return ret;
> 
> Unfortunately I don't think this is right.  errno might not fit into a
> return value.

errno codes that Xen return are all lower than 127. It fits perfectly fine.

> Returning nonzero on microcode loading error would
> definitely be right, but ...
> 
> ... oh I have just read the rest of this file.
> 
> I think what is missing here is simply `return errno' (and the braces)
> There is no need to call xc_interface_close, or munmap, if we are
> about to exit.

Probably but that is identical to what was suggested.
Cleaning resource is just a nice thing to do although unnecessary.
Can change to just "return errno" if that's what you'd prefer.

> I think fixing the lost error return is 4.14 material, so I have
> added that to the subject line.
> 
> Paul, would you Release-ack a patch that replaced every `return errno'
> with (say) exit(12) ? 

That would again conceal real return code from automation.

Igor

[PATCH] tools/xen-ucode: fix error code propagation of microcode load operation

2020-06-12 Thread Igor Druzhinin

Otherwise it's impossible to know the reason for a fault or blob rejection
inside the automation.

Signed-off-by: Igor Druzhinin 
---
 tools/misc/xen-ucode.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/misc/xen-ucode.c b/tools/misc/xen-ucode.c
index 0c257f4..2c907e1 100644
--- a/tools/misc/xen-ucode.c
+++ b/tools/misc/xen-ucode.c
@@ -62,8 +62,11 @@ int main(int argc, char *argv[])
 
 ret = xc_microcode_update(xch, buf, len);
 if ( ret )
+{
+ret = errno;
 fprintf(stderr, "Failed to update microcode. (err: %s)\n",
 strerror(errno));
+}
 
 xc_interface_close(xch);
 
@@ -74,5 +77,5 @@ int main(int argc, char *argv[])
 }
 close(fd);
 
-return 0;
+return ret;
 }
-- 
2.7.4

Re: [PATCH for-4.14 v3] x86/svm: do not try to handle recalc NPT faults immediately

2020-06-04 Thread Igor Druzhinin

On 04/06/2020 11:50, Paul Durrant wrote:
>> -Original Message-
>> From: Jan Beulich 
>> Sent: 04 June 2020 11:34
>> To: p...@xen.org
>> Cc: 'Igor Druzhinin' ; 
>> xen-devel@lists.xenproject.org;
>> andrew.coop...@citrix.com; w...@xen.org; roger@citrix.com; 
>> george.dun...@citrix.com
>> Subject: Re: [PATCH for-4.14 v3] x86/svm: do not try to handle recalc NPT 
>> faults immediately
>>
>> On 04.06.2020 09:49, Paul Durrant wrote:
>>>> -Original Message-
>>>> From: Igor Druzhinin 
>>>> Sent: 03 June 2020 23:42
>>>> To: xen-devel@lists.xenproject.org
>>>> Cc: jbeul...@suse.com; andrew.coop...@citrix.com; w...@xen.org; 
>>>> roger@citrix.com;
>>>> george.dun...@citrix.com; p...@xen.org; Igor Druzhinin 
>>>> 
>>>> Subject: [PATCH for-4.14 v3] x86/svm: do not try to handle recalc NPT 
>>>> faults immediately
>>>>
>>>> A recalculation NPT fault doesn't always require additional handling
>>>> in hvm_hap_nested_page_fault(), moreover in general case if there is no
>>>> explicit handling done there - the fault is wrongly considered fatal.
>>>>
>>>> This covers a specific case of migration with vGPU assigned which
>>>> uses direct MMIO mappings made by XEN_DOMCTL_memory_mapping hypercall:
>>>> at a moment log-dirty is enabled globally, recalculation is requested
>>>> for the whole guest memory including those mapped MMIO regions
>>>
>>> I still think it is odd to put this in the commit comment since, as I
>>> said before, Xen ensures that this situation cannot happen at
>>> the moment.
>>
>> Aiui Igor had replaced reference to passed-through devices by reference
>> to mere handing of an MMIO range to a guest. Are you saying we suppress
>> log-dirty enabling in this case as well? I didn't think we do:
> 
> No, but the comment says "migration with vGPU *assigned*" (my emphasis), 
> which surely means has_arch_pdevs() will be true.

You may replace it with 'associated' or something if you don't like this word.

>>
>> if ( has_arch_pdevs(d) && log_global )
>> {
>> /*
>>  * Refuse to turn on global log-dirty mode
>>  * if the domain is sharing the P2M with the IOMMU.
>>  */
>> return -EINVAL;
>> }
>>
>> Seeing this code I wonder about the non-sharing case: If what the
>> comment says was true, the condition would need to change, but I
>> think it's the comment which is wrong, and we don't want global
>> log-dirty as long as an IOMMU is in use at all for a domain.
> 
> I think is the comment that is correct, not the condition. It is only when 
> using shared EPT that enabling logdirty is clearly an unsafe thing to do. 
> Using sync-ed IOMMU mappings should be ok.

It seems that the case of simple MMIO mappings made without IOMMU being
enabled for a domain, in fact, irrelevant to the this condition.
I take it as a separate discussion on a different topic.

Igor

[PATCH for-4.14 v3] x86/svm: do not try to handle recalc NPT faults immediately

2020-06-03 Thread Igor Druzhinin

A recalculation NPT fault doesn't always require additional handling
in hvm_hap_nested_page_fault(), moreover in general case if there is no
explicit handling done there - the fault is wrongly considered fatal.

This covers a specific case of migration with vGPU assigned which
uses direct MMIO mappings made by XEN_DOMCTL_memory_mapping hypercall:
at a moment log-dirty is enabled globally, recalculation is requested
for the whole guest memory including those mapped MMIO regions
which causes a page fault being raised at the first access to them;
but due to MMIO P2M type not having any explicit handling in
hvm_hap_nested_page_fault() a domain is erroneously crashed with unhandled
SVM violation.

Instead of trying to be opportunistic - use safer approach and handle
P2M recalculation in a separate NPT fault by attempting to retry after
making the necessary adjustments. This is aligned with Intel behavior
where there are separate VMEXITs for recalculation and EPT violations
(faults) and only faults are handled in hvm_hap_nested_page_fault().
Do it by also unifying do_recalc return code with Intel implementation
where returning 1 means P2M was actually changed.

Since there was no case previously where p2m_pt_handle_deferred_changes()
could return a positive value - it's safe to replace ">= 0" with just "== 0"
in VMEXIT_NPF handler. finish_type_change() is also not affected by the
change as being able to deal with >0 return value of p2m->recalc from
EPT implementation.

Reviewed-by: Jan Beulich 
Reviewed-by: Roger Pau Monn?? 
Signed-off-by: Igor Druzhinin 
---
Changes in v2:
- replace rc with recalc_done bool
- updated comment in finish_type_change()
- significantly extended commit description
Changes in v3:
- covert bool to int implicitly
- a little bit more info of the usecase in the message
---
 xen/arch/x86/hvm/svm/svm.c | 5 +++--
 xen/arch/x86/mm/p2m-pt.c   | 7 ++-
 xen/arch/x86/mm/p2m.c  | 2 +-
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 46a1aac..7f6f578 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -2923,9 +2923,10 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
 v->arch.hvm.svm.cached_insn_len = vmcb->guest_ins_len & 0xf;
 rc = vmcb->exitinfo1 & PFEC_page_present
  ? p2m_pt_handle_deferred_changes(vmcb->exitinfo2) : 0;
-if ( rc >= 0 )
+if ( rc == 0 )
+/* If no recal adjustments were being made - handle this fault */
 svm_do_nested_pgfault(v, regs, vmcb->exitinfo1, vmcb->exitinfo2);
-else
+else if ( rc < 0 )
 {
 printk(XENLOG_G_ERR
"%pv: Error %d handling NPF (gpa=%08lx ec=%04lx)\n",
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index 5c05017..070389e 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -341,6 +341,7 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long 
gfn)
 unsigned int level = 4;
 l1_pgentry_t *pent;
 int err = 0;
+bool recalc_done = false;
 
 table = map_domain_page(pagetable_get_mfn(p2m_get_pagetable(p2m)));
 while ( --level )
@@ -402,6 +403,8 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long 
gfn)
 clear_recalc(l1, e);
 err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
 ASSERT(!err);
+
+recalc_done = true;
 }
 }
 unmap_domain_page((void *)((unsigned long)pent & PAGE_MASK));
@@ -448,12 +451,14 @@ static int do_recalc(struct p2m_domain *p2m, unsigned 
long gfn)
 clear_recalc(l1, e);
 err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
 ASSERT(!err);
+
+recalc_done = true;
 }
 
  out:
 unmap_domain_page(table);
 
-return err;
+return err ?: recalc_done;
 }
 
 int p2m_pt_handle_deferred_changes(uint64_t gpa)
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 17f320b..db7bde0 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1197,7 +1197,7 @@ static int finish_type_change(struct p2m_domain *p2m,
 rc = p2m->recalc(p2m, gfn);
 /*
  * ept->recalc could return 0/1/-ENOMEM. pt->recalc could return
- * 0/-ENOMEM/-ENOENT, -ENOENT isn't an error as we are looping
+ * 0/1/-ENOMEM/-ENOENT, -ENOENT isn't an error as we are looping
  * gfn here. If rc is 1 we need to have it 0 for success.
  */
 if ( rc == -ENOENT || rc > 0 )
-- 
2.7.4

Re: [PATCH v2] x86/svm: do not try to handle recalc NPT faults immediately

2020-06-03 Thread Igor Druzhinin

On 03/06/2020 12:48, Paul Durrant wrote:
>> -Original Message-
>> From: Igor Druzhinin 
>> Sent: 03 June 2020 12:45
>> To: p...@xen.org; 'Jan Beulich' 
>> Cc: xen-devel@lists.xenproject.org; andrew.coop...@citrix.com; w...@xen.org; 
>> roger@citrix.com;
>> george.dun...@citrix.com
>> Subject: Re: [PATCH v2] x86/svm: do not try to handle recalc NPT faults 
>> immediately
>>
>> On 03/06/2020 12:28, Paul Durrant wrote:
>>>> -Original Message-
>>>> From: Jan Beulich 
>>>> Sent: 03 June 2020 12:22
>>>> To: p...@xen.org
>>>> Cc: 'Igor Druzhinin' ; 
>>>> xen-devel@lists.xenproject.org;
>>>> andrew.coop...@citrix.com; w...@xen.org; roger@citrix.com; 
>>>> george.dun...@citrix.com
>>>> Subject: Re: [PATCH v2] x86/svm: do not try to handle recalc NPT faults 
>>>> immediately
>>>>
>>>> On 03.06.2020 12:26, Paul Durrant wrote:
>>>>>> -Original Message-
>>>>>> From: Jan Beulich 
>>>>>> Sent: 03 June 2020 11:03
>>>>>> To: Igor Druzhinin 
>>>>>> Cc: xen-devel@lists.xenproject.org; andrew.coop...@citrix.com; 
>>>>>> w...@xen.org; roger@citrix.com;
>>>>>> george.dun...@citrix.com; Paul Durrant 
>>>>>> Subject: Re: [PATCH v2] x86/svm: do not try to handle recalc NPT faults 
>>>>>> immediately
>>>>>>
>>>>>> On 02.06.2020 18:56, Igor Druzhinin wrote:
>>>>>>> A recalculation NPT fault doesn't always require additional handling
>>>>>>> in hvm_hap_nested_page_fault(), moreover in general case if there is no
>>>>>>> explicit handling done there - the fault is wrongly considered fatal.
>>>>>>>
>>>>>>> This covers a specific case of migration with vGPU assigned on AMD:
>>>>>>> at a moment log-dirty is enabled globally, recalculation is requested
>>>>>>> for the whole guest memory including directly mapped MMIO regions of 
>>>>>>> vGPU
>>>>>>> which causes a page fault being raised at the first access to those;
>>>>>>> but due to MMIO P2M type not having any explicit handling in
>>>>>>> hvm_hap_nested_page_fault() a domain is erroneously crashed with 
>>>>>>> unhandled
>>>>>>> SVM violation.
>>>>>>>
>>>>>>> Instead of trying to be opportunistic - use safer approach and handle
>>>>>>> P2M recalculation in a separate NPT fault by attempting to retry after
>>>>>>> making the necessary adjustments. This is aligned with Intel behavior
>>>>>>> where there are separate VMEXITs for recalculation and EPT violations
>>>>>>> (faults) and only faults are handled in hvm_hap_nested_page_fault().
>>>>>>> Do it by also unifying do_recalc return code with Intel implementation
>>>>>>> where returning 1 means P2M was actually changed.
>>>>>>>
>>>>>>> Since there was no case previously where 
>>>>>>> p2m_pt_handle_deferred_changes()
>>>>>>> could return a positive value - it's safe to replace ">= 0" with just 
>>>>>>> "== 0"
>>>>>>> in VMEXIT_NPF handler. finish_type_change() is also not affected by the
>>>>>>> change as being able to deal with >0 return value of p2m->recalc from
>>>>>>> EPT implementation.
>>>>>>>
>>>>>>> Reviewed-by: Roger Pau Monné 
>>>>>>> Signed-off-by: Igor Druzhinin 
>>>>>>
>>>>>> Reviewed-by: Jan Beulich 
>>>>>> albeit preferably with ...
>>>>>>
>>>>>>> @@ -448,12 +451,14 @@ static int do_recalc(struct p2m_domain *p2m, 
>>>>>>> unsigned long gfn)
>>>>>>>  clear_recalc(l1, e);
>>>>>>>  err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
>>>>>>>  ASSERT(!err);
>>>>>>> +
>>>>>>> +recalc_done = true;
>>>>>>>  }
>>>>>>>
>>>>>>>   out:
>>>>>>>  unmap_domain_page(table);
>>>>>>>
>>>>>>> -return err;
>>>&

Re: [PATCH v2] x86/svm: do not try to handle recalc NPT faults immediately

2020-06-03 Thread Igor Druzhinin

On 03/06/2020 12:28, Paul Durrant wrote:
>> -Original Message-
>> From: Jan Beulich 
>> Sent: 03 June 2020 12:22
>> To: p...@xen.org
>> Cc: 'Igor Druzhinin' ; 
>> xen-devel@lists.xenproject.org;
>> andrew.coop...@citrix.com; w...@xen.org; roger@citrix.com; 
>> george.dun...@citrix.com
>> Subject: Re: [PATCH v2] x86/svm: do not try to handle recalc NPT faults 
>> immediately
>>
>> On 03.06.2020 12:26, Paul Durrant wrote:
>>>> -Original Message-
>>>> From: Jan Beulich 
>>>> Sent: 03 June 2020 11:03
>>>> To: Igor Druzhinin 
>>>> Cc: xen-devel@lists.xenproject.org; andrew.coop...@citrix.com; 
>>>> w...@xen.org; roger@citrix.com;
>>>> george.dun...@citrix.com; Paul Durrant 
>>>> Subject: Re: [PATCH v2] x86/svm: do not try to handle recalc NPT faults 
>>>> immediately
>>>>
>>>> On 02.06.2020 18:56, Igor Druzhinin wrote:
>>>>> A recalculation NPT fault doesn't always require additional handling
>>>>> in hvm_hap_nested_page_fault(), moreover in general case if there is no
>>>>> explicit handling done there - the fault is wrongly considered fatal.
>>>>>
>>>>> This covers a specific case of migration with vGPU assigned on AMD:
>>>>> at a moment log-dirty is enabled globally, recalculation is requested
>>>>> for the whole guest memory including directly mapped MMIO regions of vGPU
>>>>> which causes a page fault being raised at the first access to those;
>>>>> but due to MMIO P2M type not having any explicit handling in
>>>>> hvm_hap_nested_page_fault() a domain is erroneously crashed with unhandled
>>>>> SVM violation.
>>>>>
>>>>> Instead of trying to be opportunistic - use safer approach and handle
>>>>> P2M recalculation in a separate NPT fault by attempting to retry after
>>>>> making the necessary adjustments. This is aligned with Intel behavior
>>>>> where there are separate VMEXITs for recalculation and EPT violations
>>>>> (faults) and only faults are handled in hvm_hap_nested_page_fault().
>>>>> Do it by also unifying do_recalc return code with Intel implementation
>>>>> where returning 1 means P2M was actually changed.
>>>>>
>>>>> Since there was no case previously where p2m_pt_handle_deferred_changes()
>>>>> could return a positive value - it's safe to replace ">= 0" with just "== 
>>>>> 0"
>>>>> in VMEXIT_NPF handler. finish_type_change() is also not affected by the
>>>>> change as being able to deal with >0 return value of p2m->recalc from
>>>>> EPT implementation.
>>>>>
>>>>> Reviewed-by: Roger Pau Monné 
>>>>> Signed-off-by: Igor Druzhinin 
>>>>
>>>> Reviewed-by: Jan Beulich 
>>>> albeit preferably with ...
>>>>
>>>>> @@ -448,12 +451,14 @@ static int do_recalc(struct p2m_domain *p2m, 
>>>>> unsigned long gfn)
>>>>>  clear_recalc(l1, e);
>>>>>  err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
>>>>>  ASSERT(!err);
>>>>> +
>>>>> +recalc_done = true;
>>>>>  }
>>>>>
>>>>>   out:
>>>>>  unmap_domain_page(table);
>>>>>
>>>>> -return err;
>>>>> +return err ?: (recalc_done ? 1 : 0);
>>>>
>>>> ... this shrunk to
>>>>
>>>> return err ?: recalc_done;
>>>>
>>>> (easily doable while committing).
>>>>
>>>> Also Cc Paul.
>>>>
>>>
>>> paging_log_dirty_enable() still fails global enable if has_arch_pdevs()
>>> is true, so presumably there's no desperate need for this to go in 4.14?
>>
>> The MMIO case is just the particular situation here. Aiui the same issue
>> could potentially surface with other p2m types. Also given I'd consider
>> this a backporting candidate, while it may not be desperately needed for
>> the release, I think it deserves considering beyond the specific aspect
>> you mention.
>>
> 
> In which case I think the commit message probably ought to be rephrased, 
> since it appears to focus on a case that cannot currently happen.

This can happen without has_arch_pdevs() being true. It's enough to just
directly map some physical memory into a guest to get p2m_direct_mmio
type present in the page tables.

Igor

[PATCH v2] x86/svm: do not try to handle recalc NPT faults immediately

2020-06-02 Thread Igor Druzhinin

A recalculation NPT fault doesn't always require additional handling
in hvm_hap_nested_page_fault(), moreover in general case if there is no
explicit handling done there - the fault is wrongly considered fatal.

This covers a specific case of migration with vGPU assigned on AMD:
at a moment log-dirty is enabled globally, recalculation is requested
for the whole guest memory including directly mapped MMIO regions of vGPU
which causes a page fault being raised at the first access to those;
but due to MMIO P2M type not having any explicit handling in
hvm_hap_nested_page_fault() a domain is erroneously crashed with unhandled
SVM violation.

Instead of trying to be opportunistic - use safer approach and handle
P2M recalculation in a separate NPT fault by attempting to retry after
making the necessary adjustments. This is aligned with Intel behavior
where there are separate VMEXITs for recalculation and EPT violations
(faults) and only faults are handled in hvm_hap_nested_page_fault().
Do it by also unifying do_recalc return code with Intel implementation
where returning 1 means P2M was actually changed.

Since there was no case previously where p2m_pt_handle_deferred_changes()
could return a positive value - it's safe to replace ">= 0" with just "== 0"
in VMEXIT_NPF handler. finish_type_change() is also not affected by the
change as being able to deal with >0 return value of p2m->recalc from
EPT implementation.

Reviewed-by: Roger Pau Monn?? 
Signed-off-by: Igor Druzhinin 
---
Changes in v2:
- replace rc with recalc_done bool
- updated comment in finish_type_change()
- significantly extended commit description
---
 xen/arch/x86/hvm/svm/svm.c | 5 +++--
 xen/arch/x86/mm/p2m-pt.c   | 7 ++-
 xen/arch/x86/mm/p2m.c  | 2 +-
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 46a1aac..7f6f578 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -2923,9 +2923,10 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
 v->arch.hvm.svm.cached_insn_len = vmcb->guest_ins_len & 0xf;
 rc = vmcb->exitinfo1 & PFEC_page_present
  ? p2m_pt_handle_deferred_changes(vmcb->exitinfo2) : 0;
-if ( rc >= 0 )
+if ( rc == 0 )
+/* If no recal adjustments were being made - handle this fault */
 svm_do_nested_pgfault(v, regs, vmcb->exitinfo1, vmcb->exitinfo2);
-else
+else if ( rc < 0 )
 {
 printk(XENLOG_G_ERR
"%pv: Error %d handling NPF (gpa=%08lx ec=%04lx)\n",
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index 5c05017..070389e 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -341,6 +341,7 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long 
gfn)
 unsigned int level = 4;
 l1_pgentry_t *pent;
 int err = 0;
+bool recalc_done = false;
 
 table = map_domain_page(pagetable_get_mfn(p2m_get_pagetable(p2m)));
 while ( --level )
@@ -402,6 +403,8 @@ static int do_recalc(struct p2m_domain *p2m, unsigned long 
gfn)
 clear_recalc(l1, e);
 err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
 ASSERT(!err);
+
+recalc_done = true;
 }
 }
 unmap_domain_page((void *)((unsigned long)pent & PAGE_MASK));
@@ -448,12 +451,14 @@ static int do_recalc(struct p2m_domain *p2m, unsigned 
long gfn)
 clear_recalc(l1, e);
 err = p2m->write_p2m_entry(p2m, gfn, pent, e, level + 1);
 ASSERT(!err);
+
+recalc_done = true;
 }
 
  out:
 unmap_domain_page(table);
 
-return err;
+return err ?: (recalc_done ? 1 : 0);
 }
 
 int p2m_pt_handle_deferred_changes(uint64_t gpa)
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 17f320b..db7bde0 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1197,7 +1197,7 @@ static int finish_type_change(struct p2m_domain *p2m,
 rc = p2m->recalc(p2m, gfn);
 /*
  * ept->recalc could return 0/1/-ENOMEM. pt->recalc could return
- * 0/-ENOMEM/-ENOENT, -ENOENT isn't an error as we are looping
+ * 0/1/-ENOMEM/-ENOENT, -ENOENT isn't an error as we are looping
  * gfn here. If rc is 1 we need to have it 0 for success.
  */
 if ( rc == -ENOENT || rc > 0 )
-- 
2.7.4

Re: [PATCH] x86/svm: do not try to handle recalc NPT faults immediately

2020-05-29 Thread Igor Druzhinin

On 29/05/2020 16:17, Igor Druzhinin wrote:
> On 29/05/2020 15:34, Jan Beulich wrote:
>> On 29.05.2020 02:35, Igor Druzhinin wrote:
>>> A recalculation NPT fault doesn't always require additional handling
>>> in hvm_hap_nested_page_fault(), moreover in general case if there is no
>>> explicit handling done there - the fault is wrongly considered fatal.
>>>
>>> Instead of trying to be opportunistic - use safer approach and handle
>>> P2M recalculation in a separate NPT fault by attempting to retry after
>>> making the necessary adjustments. This is aligned with Intel behavior
>>> where there are separate VMEXITs for recalculation and EPT violations
>>> (faults) and only faults are handled in hvm_hap_nested_page_fault().
>>> Do it by also unifying do_recalc return code with Intel implementation
>>> where returning 1 means P2M was actually changed.
>>>
>>> This covers a specific case of migration with vGPU assigned on AMD:
>>> global log-dirty is enabled and causes immediate recalculation NPT
>>> fault in MMIO area upon access.
>>
>> To be honest, from this last paragraph I still can't really derive
>> what goes wrong exactly why, before this change.
> 
> I admit it might require some knowledge of how vGPU is implemented. I will try
> to give more info in this paragraph.
> 
>>> Signed-off-by: Igor Druzhinin 
>>> ---
>>> This is a safer alternative to:
>>> https://lists.xenproject.org/archives/html/xen-devel/2020-05/msg01662.html
>>> and more correct approach from my PoV.
>>
>> Indeed - I was about to reply there, but then I thought I'd first
>> look at this patch, in case it was a replacement.
>>
>>> --- a/xen/arch/x86/hvm/svm/svm.c
>>> +++ b/xen/arch/x86/hvm/svm/svm.c
>>> @@ -2923,9 +2923,10 @@ void svm_vmexit_handler(struct cpu_user_regs *regs)
>>>  v->arch.hvm.svm.cached_insn_len = vmcb->guest_ins_len & 0xf;
>>>  rc = vmcb->exitinfo1 & PFEC_page_present
>>>   ? p2m_pt_handle_deferred_changes(vmcb->exitinfo2) : 0;
>>> -if ( rc >= 0 )
>>> +if ( rc == 0 )
>>> +/* If no recal adjustments were being made - handle this fault 
>>> */
>>>  svm_do_nested_pgfault(v, regs, vmcb->exitinfo1, 
>>> vmcb->exitinfo2);
>>> -else
>>> +else if ( rc < 0 )
>>
>> So from going through the code and judging by the comment in
>> finish_type_change() (which btw you will need to update, to avoid
>> it becoming stale) the >= here was there just in case, without
>> there actually being any case where a positive value would be
>> returned. It that's also the conclusion you've drawn, then I
>> think it would help mentioning this in the description.
> 
> I re-read the comments in finish_type_change() and to me they look
> pretty much in line with the now common interface between EPT and NPT
> recalc calls. 

Sorry, upon close examination there is indeed a new case missed. Thanks
for pointing out.

Igor

1 2 3 4 >

1 - 100 of 350 matches

Mail list logo