[PATCH v6 6/7] [DO NOT APPLY] switch to qemu fork

2024-04-26 Thread Marek Marczykowski-Górecki
This makes tests to use patched QEMU, to actually test the new behavior.
---
 Config.mk | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Config.mk b/Config.mk
index a962f095ca16..5e220a1284e4 100644
--- a/Config.mk
+++ b/Config.mk
@@ -220,8 +220,8 @@ endif
 OVMF_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/ovmf.git
 OVMF_UPSTREAM_REVISION ?= ba91d0292e593df8528b66f99c1b0b14fadc8e16
 
-QEMU_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/qemu-xen.git
-QEMU_UPSTREAM_REVISION ?= master
+QEMU_UPSTREAM_URL ?= https://github.com/marmarek/qemu
+QEMU_UPSTREAM_REVISION ?= origin/msix
 
 MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git
 MINIOS_UPSTREAM_REVISION ?= b6a5b4d72b88e5c4faed01f5a44505de022860fc
-- 
git-series 0.9.1



[PATCH v6 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table

2024-04-26 Thread Marek Marczykowski-Górecki
Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
on the same page as MSI-X table. Device model (especially one in
stubdomain) cannot really handle those, as direct writes to that page is
refused (page is on the mmio_ro_ranges list). Instead, extend
msixtbl_mmio_ops to handle such accesses too.

Doing this, requires correlating read/write location with guest
of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
for PV would need to be done separately.

This will be also used to read Pending Bit Array, if it lives on the same
page, making QEMU not needing /dev/mem access at all (especially helpful
with lockdown enabled in dom0). If PBA lives on another page, QEMU will
map it to the guest directly.
If PBA lives on the same page, discard writes and log a message.
Technically, writes outside of PBA could be allowed, but at this moment
the precise location of PBA isn't saved, and also no known device abuses
the spec in this way (at least yet).

To access those registers, msixtbl_mmio_ops need the relevant page
mapped. MSI handling already has infrastructure for that, using fixmap,
so try to map first/last page of the MSI-X table (if necessary) and save
their fixmap indexes. Note that msix_get_fixmap() does reference
counting and reuses existing mapping, so just call it directly, even if
the page was mapped before. Also, it uses a specific range of fixmap
indexes which doesn't include 0, so use 0 as default ("not mapped")
value - which simplifies code a bit.

GCC 12.2.1 gets confused about 'desc' variable:

arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized 
[-Werror=maybe-uninitialized]
  553 | if ( desc )
  |^
arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
  537 | const struct msi_desc *desc;
  |^~~~

It's conditional initialization is actually correct (in the case where
it isn't initialized, function returns early), but to avoid
build failure initialize it explicitly to NULL anyway.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v6:
- use MSIX_CHECK_WARN macro
- extend assert on fixmap_idx
- add break in default label, after ASSERT_UNREACHABLE(), and move
  setting default there
- style fixes
Changes in v5:
- style fixes
- include GCC version in the commit message
- warn only once (per domain, per device) about failed adjacent access
Changes in v4:
- drop same_page parameter of msixtbl_find_entry(), distinguish two
  cases in relevant callers
- rename adj_access_table_idx to adj_access_idx
- code style fixes
- drop alignment check in adjacent_{read,write}() - all callers already
  have it earlier
- delay mapping first/last MSI-X pages until preparing device for a
  passthrough
v3:
 - merge handling into msixtbl_mmio_ops
 - extend commit message
v2:
 - adjust commit message
 - pass struct domain to msixtbl_page_handler_get_hwaddr()
 - reduce local variables used only once
 - log a warning if write is forbidden if MSI-X and PBA lives on the same
   page
 - do not passthrough unaligned accesses
 - handle accesses both before and after MSI-X table
---
 xen/arch/x86/hvm/vmsi.c| 200 --
 xen/arch/x86/include/asm/msi.h |   5 +-
 xen/arch/x86/msi.c |  41 +++-
 3 files changed, 236 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 17983789..230e3a5dee3f 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d)
 return d->arch.hvm.msixtbl_list.next;
 }
 
+/*
+ * Lookup an msixtbl_entry on the same page as given addr. It's up to the
+ * caller to check if address is strictly part of the table - if relevant.
+ */
 static struct msixtbl_entry *msixtbl_find_entry(
 struct vcpu *v, unsigned long addr)
 {
@@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry(
 struct domain *d = v->domain;
 
 list_for_each_entry( entry, >arch.hvm.msixtbl_list, list )
-if ( addr >= entry->gtable &&
- addr < entry->gtable + entry->table_len )
+if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
+ PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
 return entry;
 
 return NULL;
@@ -213,6 +217,138 @@ static struct msi_desc *msixtbl_addr_to_desc(
 return NULL;
 }
 
+/*
+ * Returns:
+ *  - ADJACENT_DONT_HANDLE if no handling should be done
+ *  - ADJACENT_DISCARD_WRITE if write should be discarded
+ *  - a fixmap idx to use for handling
+ */
+#define ADJACENT_DONT_HANDLE UINT_MAX
+#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1)
+static unsigned int adjacent_handle(
+const struct msixtbl_entry *entry, unsign

[PATCH v6 0/7] MSI-X support with qemu in stubdomain, and other related changes

2024-04-26 Thread Marek Marczykowski-Górecki
This series includes changes to make MSI-X working with Linux stubdomain and
especially Intel Wifi 6 AX210 card. This takes care of remaining reasons for
QEMU to access /dev/mem, but also the Intel Wifi card violating spec by putting
some registers on the same page as the MSI-X table.
Besides the stubdomain case (of which I care more), this is also necessary for
PCI-passthrough to work with lockdown enabled in dom0 (when QEMU runs in dom0).

See individual patches for details.

This series include also tests for MSI-X using new approach (by preventing QEMU
access to /dev/mem). But for it to work, it needs QEMU change that
makes use of the changes introduced here. It can be seen at
https://github.com/marmarek/qemu/commits/msix

Here is the pipeline that used the QEMU fork above:
https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1269664578

Marek Marczykowski-Górecki (7):
  x86/msi: passthrough all MSI-X vector ctrl writes to device model
  x86/msi: Extend per-domain/device warning mechanism
  x86/hvm: Allow access to registers on the same page as MSI-X table
  automation: prevent QEMU access to /dev/mem in PCI passthrough tests
  automation: switch to a wifi card on ADL system
  [DO NOT APPLY] switch to qemu fork
  [DO NOT APPLY] switch to alternative artifact repo

 Config.mk   |   4 +-
 automation/gitlab-ci/build.yaml |   4 +-
 automation/gitlab-ci/test.yaml  |   4 +-
 automation/scripts/qubes-x86-64.sh  |   9 +-
 automation/tests-artifacts/alpine/3.18.dockerfile   |   7 +-
 automation/tests-artifacts/kernel/6.1.19.dockerfile |   2 +-
 xen/arch/x86/hvm/vmsi.c | 215 -
 xen/arch/x86/include/asm/msi.h  |  22 +-
 xen/arch/x86/msi.c  |  46 ++-
 xen/common/kernel.c |   1 +-
 xen/include/public/features.h   |   8 +-
 11 files changed, 300 insertions(+), 22 deletions(-)

base-commit: 7846f7699fea25502061a05ea847e942ea624f12
-- 
git-series 0.9.1



[PATCH v6 7/7] [DO NOT APPLY] switch to alternative artifact repo

2024-04-26 Thread Marek Marczykowski-Górecki
For testing, switch to my containers registry that includes containers
rebuilt with changes in this series.
---
 automation/gitlab-ci/build.yaml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index aac29ee13ab6..6957f06016b7 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -320,7 +320,7 @@ qemu-system-ppc64-8.1.0-ppc64-export:
 
 alpine-3.18-rootfs-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.18
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/alpine:3.18
   script:
 - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz
   artifacts:
@@ -331,7 +331,7 @@ alpine-3.18-rootfs-export:
 
 kernel-6.1.19-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.19
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/kernel:6.1.19
   script:
 - mkdir binaries && cp /bzImage binaries/bzImage
   artifacts:
-- 
git-series 0.9.1



[PATCH v6 2/7] x86/msi: Extend per-domain/device warning mechanism

2024-04-26 Thread Marek Marczykowski-Górecki
The arch_msix struct had a single "warned" field with a domid for which
warning was issued. Upcoming patch will need similar mechanism for few
more warnings, so change it to save a bit field of issued warnings.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v6:
- add MSIX_CHECK_WARN macro (Jan)
- drop struct name from warned_kind union (Jan)

New in v5
---
 xen/arch/x86/include/asm/msi.h | 17 -
 xen/arch/x86/msi.c |  5 +
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
index 997ccb87be0c..bcfdfd35345d 100644
--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -208,6 +208,15 @@ struct msg_address {
PCI_MSIX_ENTRY_SIZE + \
(~PCI_MSIX_BIRMASK & (PAGE_SIZE - 1)))
 
+#define MSIX_CHECK_WARN(msix, domid, which) ({ \
+if ( (msix)->warned_domid != (domid) ) \
+{ \
+(msix)->warned_domid = (domid); \
+(msix)->warned_kind.all = 0; \
+} \
+(msix)->warned_kind.which ? false : ((msix)->warned_kind.which = true); \
+})
+
 struct arch_msix {
 unsigned int nr_entries, used_entries;
 struct {
@@ -217,7 +226,13 @@ struct arch_msix {
 int table_idx[MAX_MSIX_TABLE_PAGES];
 spinlock_t table_lock;
 bool host_maskall, guest_maskall;
-domid_t warned;
+domid_t warned_domid;
+union {
+uint8_t all;
+struct {
+bool maskall   : 1;
+};
+} warned_kind;
 };
 
 void early_msi_init(void);
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index e721aaf5c001..42c793426da3 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -364,13 +364,10 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool 
host, bool guest)
 domid_t domid = pdev->domain->domain_id;
 
 maskall = true;
-if ( pdev->msix->warned != domid )
-{
-pdev->msix->warned = domid;
+if ( MSIX_CHECK_WARN(pdev->msix, domid, maskall) )
 printk(XENLOG_G_WARNING
"cannot mask IRQ %d: masking MSI-X on Dom%d's %pp\n",
desc->irq, domid, >sbdf);
-}
 }
 pdev->msix->host_maskall = maskall;
 if ( maskall || pdev->msix->guest_maskall )
-- 
git-series 0.9.1



[PATCH v6 5/7] automation: switch to a wifi card on ADL system

2024-04-26 Thread Marek Marczykowski-Górecki
Switch to a wifi card that has registers on a MSI-X page. This tests the
"x86/hvm: Allow writes to registers on the same page as MSI-X table"
feature. Switch it only for HVM test, because MSI-X adjacent write is
not supported on PV.

This requires also including drivers and firmware in system for tests.
Remove firmware unrelated to the test, to not increase initrd size too
much (all firmware takes over 100MB compressed).
And finally adjusts test script to handle not only eth0 as a test device,
but also wlan0 and connect it to the wifi network.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Stefano Stabellini 
---
This needs two new gitlab variables: WIFI_HW2_SSID and WIFI_HW2_PSK. I'll
provide them in private.

This change requires rebuilding test containers.

This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/gitlab-ci/test.yaml  | 4 
 automation/scripts/qubes-x86-64.sh  | 7 +++
 automation/tests-artifacts/alpine/3.18.dockerfile   | 7 +++
 automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 ++
 4 files changed, 20 insertions(+)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 8b7b2e4da92d..23a183fad967 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -193,6 +193,10 @@ adl-pci-pv-x86-64-gcc-debug:
 
 adl-pci-hvm-x86-64-gcc-debug:
   extends: .adl-x86-64
+  variables:
+PCIDEV: "00:14.3"
+WIFI_SSID: "$WIFI_HW2_SSID"
+WIFI_PSK: "$WIFI_HW2_PSK"
   script:
 - ./automation/scripts/qubes-x86-64.sh pci-hvm 2>&1 | tee ${LOGFILE}
   needs:
diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index 7eabc1bd6ad4..60498ef1e89a 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -94,6 +94,13 @@ on_reboot = "destroy"
 domU_check="
 set -x -e
 interface=eth0
+if [ -e /sys/class/net/wlan0 ]; then
+interface=wlan0
+set +x
+wpa_passphrase "$WIFI_SSID" "$WIFI_PSK" > /etc/wpa_supplicant.conf
+set -x
+wpa_supplicant -B -iwlan0 -c /etc/wpa_supplicant.conf
+fi
 ip link set \"\$interface\" up
 timeout 30s udhcpc -i \"\$interface\"
 pingip=\$(ip -o -4 r show default|cut -f 3 -d ' ')
diff --git a/automation/tests-artifacts/alpine/3.18.dockerfile 
b/automation/tests-artifacts/alpine/3.18.dockerfile
index 9cde6c9ad4da..c323e266c7da 100644
--- a/automation/tests-artifacts/alpine/3.18.dockerfile
+++ b/automation/tests-artifacts/alpine/3.18.dockerfile
@@ -34,6 +34,13 @@ RUN \
   apk add udev && \
   apk add pciutils && \
   apk add libelf && \
+  apk add wpa_supplicant && \
+  # Select firmware for hardware tests
+  apk add linux-firmware-other && \
+  mkdir /lib/firmware-preserve && \
+  mv /lib/firmware/iwlwifi-so-a0-gf-a0* /lib/firmware-preserve/ && \
+  rm -rf /lib/firmware && \
+  mv /lib/firmware-preserve /lib/firmware && \
   \
   # Xen
   cd / && \
diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile 
b/automation/tests-artifacts/kernel/6.1.19.dockerfile
index 3a4096780d20..84ed5dff23ae 100644
--- a/automation/tests-artifacts/kernel/6.1.19.dockerfile
+++ b/automation/tests-artifacts/kernel/6.1.19.dockerfile
@@ -32,6 +32,8 @@ RUN curl -fsSLO 
https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSI
 make xen.config && \
 scripts/config --enable BRIDGE && \
 scripts/config --enable IGC && \
+scripts/config --enable IWLWIFI && \
+scripts/config --enable IWLMVM && \
 cp .config .config.orig && \
 cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \
 make -j$(nproc) bzImage && \
-- 
git-series 0.9.1



[PATCH v6 4/7] automation: prevent QEMU access to /dev/mem in PCI passthrough tests

2024-04-26 Thread Marek Marczykowski-Górecki
/dev/mem access doesn't work in dom0 in lockdown and in stubdomain.
Simulate this environment with removing /dev/mem device node. Full test
for lockdown and stubdomain will come later, when all requirements will
be in place.

Signed-off-by: Marek Marczykowski-Górecki 
Acked-by: Stefano Stabellini 
---
This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/scripts/qubes-x86-64.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index d81ed7b931cf..7eabc1bd6ad4 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -163,6 +163,8 @@ ifconfig eth0 up
 ifconfig xenbr0 up
 ifconfig xenbr0 192.168.0.1
 
+# ensure QEMU wont have access /dev/mem
+rm -f /dev/mem
 # get domU console content into test log
 tail -F /var/log/xen/console/guest-domU.log 2>/dev/null | sed -e \"s/^/(domU) 
/\" &
 xl create /etc/xen/domU.cfg
-- 
git-series 0.9.1



[PATCH v6 1/7] x86/msi: passthrough all MSI-X vector ctrl writes to device model

2024-04-26 Thread Marek Marczykowski-Górecki
QEMU needs to know whether clearing maskbit of a vector is really
clearing, or was already cleared before. Currently Xen sends only
clearing that bit to the device model, but not setting it, so QEMU
cannot detect it. Because of that, QEMU is working this around by
checking via /dev/mem, but that isn't the proper approach.

Give all necessary information to QEMU by passing all ctrl writes,
including masking a vector. Advertise the new behavior via
XENVER_get_features, so QEMU can know it doesn't need to access /dev/mem
anymore.

While this commit doesn't move the whole maskbit handling to QEMU (as
discussed on xen-devel as one of the possibilities), it is a necessary
first step anyway. Including telling QEMU it will get all the required
information to do so. The actual implementation would need to include:
 - a hypercall for QEMU to control just maskbit (without (re)binding the
   interrupt again
 - a methor for QEMU to tell Xen it will actually do the work
Those are not part of this series.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Jan Beulich 
---
I did not added any control to enable/disable this new behavior (as
Roger have suggested for possible non-QEMU ioreqs). I don't see how the
new behavior could be problematic for some existing ioreq server (they
already received writes to those addresses, just not all of them),
but if that's really necessary, I can probably add a command line option
to restore previous behavior system-wide.

Changes in v5:
- announce the feature only on x86
- style fixes

Changes in v4:
- ignore unaligned writes with X86EMUL_OKAY
- restructure the code to forward all writes in _msixtbl_write() instead
  of manipulating return value of msixtbl_write() - this makes
  WRITE_LEN4_COMPLETION special case unnecessary
- advertise the changed behavior via XENVER_get_features instead of DMOP
v3:
 - advertise changed behavior in XEN_DMOP_get_ioreq_server_info - make
   "flags" parameter IN/OUT
 - move len check back to msixtbl_write() - will be needed there anyway
   in a later patch
v2:
 - passthrough quad writes to emulator too (Jan)
 - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
   #define for this magic value
---
 xen/arch/x86/hvm/vmsi.c   | 19 ++-
 xen/common/kernel.c   |  1 +
 xen/include/public/features.h |  8 
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index adbac965f9f7..17983789 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -283,8 +283,8 @@ static int msixtbl_write(struct vcpu *v, unsigned long 
address,
 unsigned long flags;
 struct irq_desc *desc;
 
-if ( (len != 4 && len != 8) || (address & (len - 1)) )
-return r;
+if ( !IS_ALIGNED(address, len) )
+return X86EMUL_OKAY;
 
 rcu_read_lock(_rcu_lock);
 
@@ -345,8 +345,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long 
address,
 
 unlock:
 spin_unlock_irqrestore(>lock, flags);
-if ( len == 4 )
-r = X86EMUL_OKAY;
+r = X86EMUL_OKAY;
 
 out:
 rcu_read_unlock(_rcu_lock);
@@ -357,7 +356,17 @@ static int cf_check _msixtbl_write(
 const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
 uint64_t val)
 {
-return msixtbl_write(current, address, len, val);
+/* Ignore invalid length or unaligned writes. */
+if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) )
+return X86EMUL_OKAY;
+
+/*
+ * This function returns X86EMUL_UNHANDLEABLE even if write is properly
+ * handled, to propagate it to the device model (so it can keep its
+ * internal state in sync).
+ */
+msixtbl_write(current, address, len, val);
+return X86EMUL_UNHANDLEABLE;
 }
 
 static bool cf_check msixtbl_range(
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index 08dbaa2a054c..b44b2439ca8e 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -637,6 +637,7 @@ long do_xen_version(int cmd, XEN_GUEST_HANDLE_PARAM(void) 
arg)
 fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) |
  (1U << XENFEAT_hvm_callback_vector) |
  (has_pirq(d) ? (1U << XENFEAT_hvm_pirqs) : 0);
+fi.submap |= (1U << XENFEAT_dm_msix_all_writes);
 #endif
 if ( !paging_mode_translate(d) || is_domain_direct_mapped(d) )
 fi.submap |= (1U << XENFEAT_direct_mapped);
diff --git a/xen/include/public/features.h b/xen/include/public/features.h
index 4437f25d2532..880193094713 100644
--- a/xen/include/public/features.h
+++ b/xen/include/public/features.h
@@ -120,6 +120,14 @@
 #define XENFEAT_runstate_phys_area18
 #define XENFEAT_vcpu_time_phys_area   19
 
+/*
+ * If set, Xen will passthrough all MSI-X vector ctrl writes to device model,
+ * not only those unmasking an entry. This allows device model to properly k

Re: [PATCH v5 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table

2024-04-26 Thread Marek Marczykowski-Górecki
On Thu, Apr 25, 2024 at 01:15:34PM +0200, Jan Beulich wrote:
> On 13.03.2024 16:16, Marek Marczykowski-Górecki wrote:
> > Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
> > on the same page as MSI-X table. Device model (especially one in
> > stubdomain) cannot really handle those, as direct writes to that page is
> > refused (page is on the mmio_ro_ranges list). Instead, extend
> > msixtbl_mmio_ops to handle such accesses too.
> > 
> > Doing this, requires correlating read/write location with guest
> > of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
> > it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
> > for PV would need to be done separately.
> > 
> > This will be also used to read Pending Bit Array, if it lives on the same
> > page, making QEMU not needing /dev/mem access at all (especially helpful
> > with lockdown enabled in dom0). If PBA lives on another page, QEMU will
> > map it to the guest directly.
> > If PBA lives on the same page, discard writes and log a message.
> > Technically, writes outside of PBA could be allowed, but at this moment
> > the precise location of PBA isn't saved, and also no known device abuses
> > the spec in this way (at least yet).
> > 
> > To access those registers, msixtbl_mmio_ops need the relevant page
> > mapped. MSI handling already has infrastructure for that, using fixmap,
> > so try to map first/last page of the MSI-X table (if necessary) and save
> > their fixmap indexes. Note that msix_get_fixmap() does reference
> > counting and reuses existing mapping, so just call it directly, even if
> > the page was mapped before. Also, it uses a specific range of fixmap
> > indexes which doesn't include 0, so use 0 as default ("not mapped")
> > value - which simplifies code a bit.
> > 
> > GCC 12.2.1 gets confused about 'desc' variable:
> > 
> > arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
> > arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized 
> > [-Werror=maybe-uninitialized]
> >   553 | if ( desc )
> >   |^
> > arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
> >   537 | const struct msi_desc *desc;
> >   |    ^~~~
> > 
> > It's conditional initialization is actually correct (in the case where
> > it isn't initialized, function returns early), but to avoid
> > build failure initialize it explicitly to NULL anyway.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki 
> 
> Sadly there are further more or less cosmetic issues. Plus, as indicated
> before, I'm not really happy for us to gain all of this extra code. In
> the end I may eventually give an R-b not including the usually implied
> A-b, to indicate the code (then) looks okay to me but I still want
> someone else to actually ack it to allow it going in.

I understand. Given similar code is committed for vPCI already, I hope
somebody will be comfortable with acking this one too (yes, I do realize
the vPCI one is much less exposed, but still).

> > +static int adjacent_read(
> > +unsigned int fixmap_idx,
> > +paddr_t address, unsigned int len, uint64_t *pval)
> > +{
> > +const void __iomem *hwaddr;
> > +
> > +*pval = ~0UL;
> > +
> > +ASSERT(fixmap_idx != ADJACENT_DISCARD_WRITE);
> 
> Why only one of the special values? And before you add the other here:
> Why not simply ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END)? (Could of
> course bound at the other end, too, i.e. against FIX_MSIX_IO_RESERV_BASE.)

That's the most likely bug that could happen, but indeed broader assert
would be better.

> > +hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address);
> > +
> > +switch ( len )
> > +{
> > +case 1:
> > +*pval = readb(hwaddr);
> > +break;
> > +
> > +case 2:
> > +*pval = readw(hwaddr);
> > +break;
> > +
> > +case 4:
> > +*pval = readl(hwaddr);
> > +break;
> > +
> > +case 8:
> > +*pval = readq(hwaddr);
> > +    break;
> > +
> > +default:
> > +ASSERT_UNREACHABLE();
> 
> Misra demands "break;" to be here for release builds. In fact I wonder
> why "*pval = ~0UL;" isn't put here, too. Question of course is whether
> in such a case a true error indicator wouldn't be yet better.

I don't think it possible for the msixtbl_read() (that calls
adjacent_read()) to be called with other sizes. The default label is
here exactly to make it obvious for the reader.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH] x86/MTRR: avoid several indirect calls

2024-04-22 Thread Marek Marczykowski-Górecki
On Wed, Jan 17, 2024 at 10:32:53AM +0100, Jan Beulich wrote:
> --- a/xen/arch/x86/cpu/mtrr/main.c
> +++ b/xen/arch/x86/cpu/mtrr/main.c
> @@ -328,7 +316,7 @@ int mtrr_add_page(unsigned long base, un
>   }
>  
>   /*  If the type is WC, check that this processor supports it  */
> - if ((type == X86_MT_WC) && !have_wrcomb()) {
> + if ((type == X86_MT_WC) && mtrr_have_wrcomb()) {

Is reversing the condition intentional here? 

>   printk(KERN_WARNING
>  "mtrr: your processor doesn't support 
> write-combining\n");
>   return -EOPNOTSUPP;

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v5 0/7] MSI-X support with qemu in stubdomain, and other related changes

2024-04-18 Thread Marek Marczykowski-Górecki
On Wed, Mar 13, 2024 at 04:16:05PM +0100, Marek Marczykowski-Górecki wrote:
> This series includes changes to make MSI-X working with Linux stubdomain and
> especially Intel Wifi 6 AX210 card. This takes care of remaining reasons for
> QEMU to access /dev/mem, but also the Intel Wifi card violating spec by 
> putting
> some registers on the same page as the MSI-X table.
> Besides the stubdomain case (of which I care more), this is also necessary for
> PCI-passthrough to work with lockdown enabled in dom0 (when QEMU runs in 
> dom0).
> 
> See individual patches for details.
> 
> This series include also tests for MSI-X using new approach (by preventing 
> QEMU
> access to /dev/mem). But for it to work, it needs QEMU change that
> makes use of the changes introduced here. It can be seen at
> https://github.com/marmarek/qemu/commits/msix
> 
> Here is the pipeline that used the QEMU fork above:
> https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1211237368
> 
> Marek Marczykowski-Górecki (7):
>   x86/msi: passthrough all MSI-X vector ctrl writes to device model
>   x86/msi: Extend per-domain/device warning mechanism
>   x86/hvm: Allow access to registers on the same page as MSI-X table
>   automation: prevent QEMU access to /dev/mem in PCI passthrough tests
>   automation: switch to a wifi card on ADL system
>   [DO NOT APPLY] switch to qemu fork
>   [DO NOT APPLY] switch to alternative artifact repo
> 
>  Config.mk   |   4 +-
>  automation/gitlab-ci/build.yaml |   4 +-
>  automation/gitlab-ci/test.yaml  |   4 +-
>  automation/scripts/qubes-x86-64.sh  |   9 +-
>  automation/tests-artifacts/alpine/3.18.dockerfile   |   7 +-
>  automation/tests-artifacts/kernel/6.1.19.dockerfile |   2 +-
>  xen/arch/x86/hvm/vmsi.c | 224 -
>  xen/arch/x86/include/asm/msi.h  |  15 +-
>  xen/arch/x86/msi.c  |  50 ++-
>  xen/common/kernel.c |   1 +-
>  xen/include/public/features.h   |   8 +-
>  11 files changed, 308 insertions(+), 20 deletions(-)
> 
> base-commit: 03cf7ca23e0e876075954c558485b267b7d02406
> -- 
> git-series 0.9.1

Ping, can I ask for a review?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v3 2/2] drivers/char: mark extra reserved device memory in memory map

2024-04-17 Thread Marek Marczykowski-Górecki
On Wed, Apr 17, 2024 at 04:17:48PM +0200, Jan Beulich wrote:
> On 14.04.2024 02:32, Marek Marczykowski-Górecki wrote:
> > On Wed, Apr 03, 2024 at 09:10:40AM +0200, Jan Beulich wrote:
> >> On 27.03.2024 03:53, Marek Marczykowski-Górecki wrote:
> >>> The IOMMU driver checks if RMRR/IVMD are marked as reserved in memory
> >>> map. This should be true for addresses coming from the firmware, but
> >>> when extra pages used by Xen itself are included in the mapping, those
> >>> are taken from usable RAM used. Mark those pages as reserved too.
> >>>
> >>> Not marking the pages as reserved didn't caused issues before due to
> >>> another a bug in IOMMU driver code, that was fixed in 83afa3135830
> >>> ("amd-vi: fix IVMD memory type checks").
> >>>
> >>> Failing to reserve memory will lead to panic in IOMMU setup code. And
> >>> not including the page in IOMMU mapping will lead to broken console (due
> >>> to IOMMU faults). The pages chosen by the XHCI console driver should
> >>> still be usable by the CPU though, and the console code already can deal
> >>> with too slow console by dropping characters (and console not printing
> >>> anything is a special case of "slow"). When reserving fails print an error
> >>> message showing which pages failed and who requested them. This should
> >>> be enough hint to find why XHCI console doesn't work.
> >>>
> >>> Fixes: 3a1a7b809ffa "drivers/char: mark DMA buffers as reserved for the 
> >>> XHCI"
> >>> Signed-off-by: Marek Marczykowski-Górecki 
> >>> 
> >>
> >> Acked-by: Jan Beulich 
> > 
> > Is any ack missing here, or has it just fallen through the cracks?
> 
> ??? (commit dd5101a6169f89b9e3f3b72f0b0fcdb38db2fb35)

Oh, sorry, somehow I missed it. All good then, thanks.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v3 2/2] drivers/char: mark extra reserved device memory in memory map

2024-04-13 Thread Marek Marczykowski-Górecki
On Wed, Apr 03, 2024 at 09:10:40AM +0200, Jan Beulich wrote:
> On 27.03.2024 03:53, Marek Marczykowski-Górecki wrote:
> > The IOMMU driver checks if RMRR/IVMD are marked as reserved in memory
> > map. This should be true for addresses coming from the firmware, but
> > when extra pages used by Xen itself are included in the mapping, those
> > are taken from usable RAM used. Mark those pages as reserved too.
> > 
> > Not marking the pages as reserved didn't caused issues before due to
> > another a bug in IOMMU driver code, that was fixed in 83afa3135830
> > ("amd-vi: fix IVMD memory type checks").
> > 
> > Failing to reserve memory will lead to panic in IOMMU setup code. And
> > not including the page in IOMMU mapping will lead to broken console (due
> > to IOMMU faults). The pages chosen by the XHCI console driver should
> > still be usable by the CPU though, and the console code already can deal
> > with too slow console by dropping characters (and console not printing
> > anything is a special case of "slow"). When reserving fails print an error
> > message showing which pages failed and who requested them. This should
> > be enough hint to find why XHCI console doesn't work.
> > 
> > Fixes: 3a1a7b809ffa "drivers/char: mark DMA buffers as reserved for the 
> > XHCI"
> > Signed-off-by: Marek Marczykowski-Górecki 
> 
> Acked-by: Jan Beulich 

Hi,

Is any ack missing here, or has it just fallen through the cracks?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [XEN PATCH v2 0/3] x86/iommu: Drop IOMMU support when cx16 isn't supported

2024-04-11 Thread Marek Marczykowski-Górecki
On Thu, Apr 11, 2024 at 09:05:08PM +0100, Andrew Cooper wrote:
> Sorry, but you've sent out two copies of each patch in this series, and
> it's not clear if they're identical or not.

FWIW I've got just one copy. 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v1 0/2] Starting AMD SEV work

2024-04-10 Thread Marek Marczykowski-Górecki
On Wed, Apr 10, 2024 at 05:36:34PM +0200, Andrei Semenov wrote:
> This patch series initiate work on AMD SEV technology implementation in Xen.
> SEV stands for "Secure Encrypted Virtualization" and allows the memory 
> contents
> of a VM to be encrypted with a key unique to this VM. In this way the neither
> other VMs nor hypervisor can't read the memory content of this "encrypted"
> VM.
> 
> In order to create and to run such a VM different layers of software must
> interact (bascally Xen hypevisor, Xen toolstack in dom0 and the encrypted VM
> itself).
> 
> In this work we start with discovering and enabling SEV feature on the 
> platform.
> The second patch ports AMD Secure Processor driver on Xen. This AMD Secure
> Processor device (a.k.a PSP) is the way the different software layers interact
> with AMD firmware/hardware to manage and run the encrypted VM.

How will that interact with the PSP driver in dom0? AFAIK amdgpu driver
uses PSP for loading the GPU firmware. Does it mean one need to choose
either GPU in dom0 or encrypted VMs, or is it going to work somehow
together?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH] x86/tsx: Cope with RTM_ALWAYS_ABORT vs RTM mismatch

2024-04-04 Thread Marek Marczykowski-Górecki
On Thu, Apr 04, 2024 at 11:41:22AM +0100, Andrew Cooper wrote:
> It turns out there is something wonky on some but not all CPUs with
> MSR_TSX_FORCE_ABORT.  The presence of RTM_ALWAYS_ABORT causes Xen to think
> it's safe to offer HLE/RTM to guests, but in this case, XBEGIN instructions
> genuinely #UD.
> 
> Spot this case and try to back out as cleanly as we can.
> 
> Signed-off-by: Andrew Cooper 

Thanks, this makes the test exit with 0, and print just "Got #UD" now in
the "Testing RTM behaviour" section.

Tested-by: Marek Marczykowski-Górecki 

> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Marek Marczykowski-Górecki 
> 
> In the meantime, I'll see if anyone at Intel knows what's going on.  Because
> these parts are fully out of support now, it's very unlikely that we're going
> to get a fix.
> ---
>  xen/arch/x86/tsx.c | 55 +-
>  1 file changed, 45 insertions(+), 10 deletions(-)
> 
> diff --git a/xen/arch/x86/tsx.c b/xen/arch/x86/tsx.c
> index 50d8059f23a9..41bb39d10074 100644
> --- a/xen/arch/x86/tsx.c
> +++ b/xen/arch/x86/tsx.c
> @@ -1,5 +1,6 @@
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  /*
> @@ -9,6 +10,7 @@
>   *  -1 => Default, altered to 0/1 (if unspecified) by:
>   * - TAA heuristics/settings for speculative safety
>   * - "TSX vs PCR3" select for TSX memory ordering safety
> + *  -2 => Implicit tsx=0 (from RTM_ALWAYS_ABORT vs RTM mismatch)
>   *  -3 => Implicit tsx=1 (feed-through from spec-ctrl=0)
>   *
>   * This is arranged such that the bottom bit encodes whether TSX is actually
> @@ -114,11 +116,50 @@ void tsx_init(void)
>  
>  if ( cpu_has_tsx_force_abort )
>  {
> +uint64_t val;
> +
>  /*
> - * On an early TSX-enable Skylake part subject to the memory
> + * On an early TSX-enabled Skylake part subject to the memory
>   * ordering erratum, with at least the March 2019 microcode.
>   */
>  
> +rdmsrl(MSR_TSX_FORCE_ABORT, val);
> +
> +/*
> + * At the time of writing (April 2024), it was discovered that
> + * some parts (e.g. CoffeeLake 8th Gen, 06-9e-0a, ucode 0xf6)
> + * advertise RTM_ALWAYS_ABORT, but XBEGIN instructions #UD.  
> Other
> + * similar parts (e.g. KabyLake Xeon-E3, 06-9e-09, ucode 0xf8)
> + * operate as expected.
> + *
> + * In this case:
> + *  - RTM_ALWAYS_ABORT and MSR_TSX_FORCE_ABORT are enumerated.
> + *  - XBEGIN instructions genuinely #UD.
> + *  - MSR_TSX_FORCE_ABORT is write-discard and fails to hold its
> + *value.
> + *  - HLE and RTM are not enumerated, despite
> + *MSR_TSX_FORCE_ABORT.TSX_CPUID_CLEAR being clear.
> + *
> + * Spot this case, and treat it as if no TSX is available at all.
> + * This will prevent Xen from thinking it's safe to offer HLE/RTM
> + * to VMs.
> + */
> +if ( val == 0 && cpu_has_rtm_always_abort && !cpu_has_rtm )
> +{
> +printk(XENLOG_ERR
> +   "FIRMWARE BUG: CPU %02x-%02x-%02x, ucode 0x%08x: 
> RTM_ALWAYS_ABORT vs RTM mismatch\n",
> +   boot_cpu_data.x86, boot_cpu_data.x86_model,
> +   boot_cpu_data.x86_mask, this_cpu(cpu_sig).rev);
> +
> +setup_clear_cpu_cap(X86_FEATURE_RTM_ALWAYS_ABORT);
> +setup_clear_cpu_cap(X86_FEATURE_TSX_FORCE_ABORT);
> +
> +if ( opt_tsx < 0 )
> +opt_tsx = -2;
> +
> +goto done_setup;
> +}
> +
>  /*
>   * Probe for the June 2021 microcode which de-features TSX on
>   * client parts.  (Note - this is a subset of parts impacted by
> @@ -128,15 +169,8 @@ void tsx_init(void)
>   * read as zero if TSX_FORCE_ABORT.ENABLE_RTM has been set before
>   * we run.
>   */
> -if ( !has_rtm_always_abort )
> -{
> -uint64_t val;
> -
> -rdmsrl(MSR_TSX_FORCE_ABORT, val);
> -
> -if ( val & TSX_ENABLE_RTM )
> -has_rtm_always_abort = true;
> -}
> +if ( val & TSX_ENABLE_RTM )
> +has_rtm_always_abort = true;
>  
>  /*
>   * If no explicit tsx= option is provided, pick a default.
> @@ -191,6 +225,7 @@ void tsx_init(void)
>  setup_force_cpu_cap(X86_FEATURE_RTM);
>  }
>  }
> + done_setup:
>  
>  /*
>   * Note: MSR_TSX_CTRL is enumerated on TSX-enabled MDS_NO and later 
> parts.
> 
> base-commit: 6117179dd99958e4ef2687617d12c9b15bdbae24
> -- 
> 2.30.2
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: text-tsx fails on Intel core 8th gen system

2024-04-03 Thread Marek Marczykowski-Górecki
On Wed, Apr 03, 2024 at 05:04:20PM +0200, Jan Beulich wrote:
> On 03.04.2024 16:50, Marek Marczykowski-Górecki wrote:
> > Hi,
> > 
> > I've noticed that tools/tests/tsx/test-tsx fails on a system with Intel
> > Core i7-8750H. Specific error I get:
> > 
> > [user@dom0 tsx]$ ./test-tsx 
> > TSX tests
> >   Got 16 CPUs
> > Testing MSR_TSX_FORCE_ABORT consistency
> >   CPU0 val 0
> > Testing MSR_TSX_CTRL consistency
> > Testing MSR_MCU_OPT_CTRL consistency
> >   CPU0 val 0
> > Testing RTM behaviour
> >   Got #UD
> >   Host reports RTM, but appears unavailable
> 
> Isn't this ...
> 
> > Testing PV default/max policies
> >   Max: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
> >   Def: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
> >   HLE/RTM offered to guests despite not being available
> > Testing HVM default/max policies
> >   Max: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
> >   Def: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
> >   HLE/RTM offered to guests despite not being available
> > Testing PV guest
> >   Created d8
> >   Cur: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
> >   Cur: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
> > Testing HVM guest
> >   Created d9
> >   Cur: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
> >   Cur: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
> > [user@dom0 tsx]$ echo $?
> > 1
> 
> ... the reason for this?

I think so, but the question is why it behaves this way. Could be an
issue with MSR/CPUID values presented by Xen, or values Xen gets from
the CPU.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


text-tsx fails on Intel core 8th gen system

2024-04-03 Thread Marek Marczykowski-Górecki
Hi,

I've noticed that tools/tests/tsx/test-tsx fails on a system with Intel
Core i7-8750H. Specific error I get:

[user@dom0 tsx]$ ./test-tsx 
TSX tests
  Got 16 CPUs
Testing MSR_TSX_FORCE_ABORT consistency
  CPU0 val 0
Testing MSR_TSX_CTRL consistency
Testing MSR_MCU_OPT_CTRL consistency
  CPU0 val 0
Testing RTM behaviour
  Got #UD
  Host reports RTM, but appears unavailable
Testing PV default/max policies
  Max: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
  Def: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
  HLE/RTM offered to guests despite not being available
Testing HVM default/max policies
  Max: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
  Def: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
  HLE/RTM offered to guests despite not being available
Testing PV guest
  Created d8
  Cur: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
  Cur: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
Testing HVM guest
  Created d9
  Cur: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
  Cur: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0
[user@dom0 tsx]$ echo $?
1


When I try it on a newer system (11th gen) then it works fine (exit code
0, just "Got #UD", no "Host reports RTM, but appears unavailable" line).


/proc/cpuinfo says:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 158
model name  : Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
stepping: 10
microcode   : 0xf6
cpu MHz : 2207.990
cache size  : 9216 KB
physical id : 0
siblings: 6
core id : 0
cpu cores   : 1
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu de tsc msr pae mce cx8 apic sep mca cmov pat 
clflush acpi mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good 
nopl nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma cx16 
sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 
3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms 
rdseed adx clflushopt xsaveopt xsavec xgetbv1 md_clear arch_capabilities
bugs: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass 
l1tf mds swapgs itlb_multihit srbds mmio_stale_data retbleed
bogomips: 4415.98
clflush size: 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

...


Full `xen-cpuid detail` output attached.

Just in case, I'm attaching also full xl dmesg, but I don't see anything
related there.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
nr_features: 18

Static sets:
Known   
bfebfbff:fffef3ff:ee500800:2469bfff:000f:ffbf:3a405fdf:0780:779fd205:fc91ef1c:1c30:38000144:0001:0037::0004:1fbe:
  [00] CPUID 0x0001.edx fpu vme de pse tsc msr pae mce cx8 apic 
sysenter mtrr pge mca cmov pat pse36 clflush ds acpi mmx fxsr sse sse2 ss htt 
tm pbe
  [01] CPUID 0x0001.ecx sse3 pclmulqdq dtes64 monitor ds-cpl vmx smx 
est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse41 sse42 x2apic movebe popcnt 
tsc-dl aesni xsave osxsave avx f16c rdrnd hyper
  [02] CPUID 0x8001.edx syscall nx mmx+ fxsr+ pg1g rdtscp lm 3dnow+ 
3dnow
  [03] CPUID 0x8001.ecx lahf-lm cmp svm extapic cr8d lzcnt sse4a msse 
3dnowpf osvw ibs xop skinit wdt lwp fma4 nodeid tbm topoext dbx monitorx
  [04] CPUID 0x000d:1.eax   xsaveopt xsavec xgetbv1 xsaves
  [05] CPUID 0x0007:0.ebx   fsgsbase tsc-adj sgx bmi1 hle avx2 fdp-exn smep 
bmi2 erms invpcid rtm pqm depfpp mpx pqe avx512f avx512dq rdseed adx smap 
avx512-ifma clflushopt clwb proc-trace avx512pf avx512er avx512cd sha avx512bw 
avx512vl
  [06] CPUID 0x0007:0.ecx   prefetchwt1 avx512-vbmi umip pku ospke 
avx512-vbmi2 cet-ss gfni vaes vpclmulqdq avx512-vnni avx512-bitalg 
avx512-vpopcntdq rdpid cldemote movdiri movdir64b enqcmd
  [07] CPUID 0x8007.edx hw-pstate itsc cpb efro
  [08] CPUID 0x8008.ebx clzero rstr-fp-err-ptrs wbnoinvd ibpb ibrs 
amd-stibp ibrs-always stibp-always ibrs-fast ibrs-same-mode no-lmsl ppin 
amd-ssbd virt-ssbd ssb-no psfd btc-no ibpb-ret
  [09] CPUID 0x0007:0.edx   avx512-4vnniw avx512-4fmaps fsrm 
avx512-vp2intersect srbds-ctrl md-clear rtm-always-abort tsx-force-abort 
serialize hybrid tsxldtrk cet-ibt avx512-fp16 ibrsb stibp l1d-flush arch-caps 
core-caps ssbd
  [10] CPUID 0x0007:1.eax   avx-vnni avx512-bf16 fzrm fsrs fsrcs
  [11] CPUID 0x8021.eax lfence+ nscb auto-ibrs sbpb ibpb-brtype srso-no
  [12] CPUID 0x0007:1.ebx

Re: [PATCH net] xen-netfront: Add missing skb_mark_for_recycle

2024-03-28 Thread Marek Marczykowski-Górecki
On Wed, Mar 27, 2024 at 01:14:56PM +0100, Jesper Dangaard Brouer wrote:
> Notice that skb_mark_for_recycle() is introduced later than fixes tag in
> 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB recycling").
> 
> It is believed that fixes tag were missing a call to page_pool_release_page()
> between v5.9 to v5.14, after which is should have used skb_mark_for_recycle().
> Since v6.6 the call page_pool_release_page() were removed (in 535b9c61bdef
> ("net: page_pool: hide page_pool_release_page()") and remaining callers
> converted (in commit 6bfef2ec0172 ("Merge branch
> 'net-page_pool-remove-page_pool_release_page'")).
> 
> This leak became visible in v6.8 via commit dba1b8a7ab68 ("mm/page_pool: catch
> page_pool memory leaks").
> 
> Fixes: 6c5aa6fc4def ("xen networking: add basic XDP support for xen-netfront")
> Reported-by: Arthur Borsboom 
> Signed-off-by: Jesper Dangaard Brouer 
> ---
> Compile tested only, can someone please test this

I've got a confirmation it fixes the issue:
https://github.com/QubesOS/qubes-linux-kernel/pull/926#issuecomment-2026226944

>  drivers/net/xen-netfront.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index ad29f370034e..8d2aee88526c 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -285,6 +285,7 @@ static struct sk_buff *xennet_alloc_one_rx_buffer(struct 
> netfront_queue *queue)
>   return NULL;
>   }
>   skb_add_rx_frag(skb, 0, page, 0, 0, PAGE_SIZE);
> + skb_mark_for_recycle(skb);
>  
>   /* Align ip header to a 16 bytes boundary */
>   skb_reserve(skb, NET_IP_ALIGN);
> 
> 
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH v3 1/2] hw/xen: detect when running inside stubdomain

2024-03-26 Thread Marek Marczykowski-Górecki
Introduce global xen_is_stubdomain variable when qemu is running inside
a stubdomain instead of dom0. This will be relevant for subsequent
patches, as few things like accessing PCI config space need to be done
differently.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v3:
- move to xen_hvm_init_pc()
- coding style
Changes in v2:
- use sigend int for domid to match xenstore_read_int() types
- fix code style
---
 hw/i386/xen/xen-hvm.c | 22 ++
 include/hw/xen/xen.h  |  1 +
 system/globals.c  |  1 +
 3 files changed, 24 insertions(+)

diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
index 7745cb3..3291c17 100644
--- a/hw/i386/xen/xen-hvm.c
+++ b/hw/i386/xen/xen-hvm.c
@@ -583,6 +583,26 @@ static void xen_wakeup_notifier(Notifier *notifier, void 
*data)
 xc_set_hvm_param(xen_xc, xen_domid, HVM_PARAM_ACPI_S_STATE, 0);
 }
 
+static bool xen_check_stubdomain(struct xs_handle *xsh)
+{
+char *dm_path = g_strdup_printf(
+"/local/domain/%d/image/device-model-domid", xen_domid);
+char *val;
+int32_t dm_domid;
+bool is_stubdom = false;
+
+val = xs_read(xsh, 0, dm_path, NULL);
+if (val) {
+if (sscanf(val, "%d", _domid) == 1) {
+is_stubdom = dm_domid != 0;
+}
+free(val);
+}
+
+g_free(dm_path);
+return is_stubdom;
+}
+
 void xen_hvm_init_pc(PCMachineState *pcms, MemoryRegion **ram_memory)
 {
 MachineState *ms = MACHINE(pcms);
@@ -595,6 +615,8 @@ void xen_hvm_init_pc(PCMachineState *pcms, MemoryRegion 
**ram_memory)
 
 xen_register_ioreq(state, max_cpus, _memory_listener);
 
+xen_is_stubdomain = xen_check_stubdomain(state->xenstore);
+
 QLIST_INIT(_physmap);
 xen_read_physmap(state);
 
diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h
index 37ecc91..ecb89ec 100644
--- a/include/hw/xen/xen.h
+++ b/include/hw/xen/xen.h
@@ -36,6 +36,7 @@ enum xen_mode {
 extern uint32_t xen_domid;
 extern enum xen_mode xen_mode;
 extern bool xen_domid_restrict;
+extern bool xen_is_stubdomain;
 
 int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num);
 int xen_set_pci_link_route(uint8_t link, uint8_t irq);
diff --git a/system/globals.c b/system/globals.c
index e353584..d602a04 100644
--- a/system/globals.c
+++ b/system/globals.c
@@ -60,6 +60,7 @@ bool qemu_uuid_set;
 uint32_t xen_domid;
 enum xen_mode xen_mode = XEN_DISABLED;
 bool xen_domid_restrict;
+bool xen_is_stubdomain;
 struct evtchn_backend_ops *xen_evtchn_ops;
 struct gnttab_backend_ops *xen_gnttab_ops;
 struct foreignmem_backend_ops *xen_foreignmem_ops;
-- 
git-series 0.9.1



[PATCH v3 2/2] xen: fix stubdom PCI addr

2024-03-26 Thread Marek Marczykowski-Górecki
When running in a stubdomain, the config space access via sysfs needs to
use BDF as seen inside stubdomain (connected via xen-pcifront), which is
different from the real BDF. For other purposes (hypercall parameters
etc), the real BDF needs to be used.
Get the in-stubdomain BDF by looking up relevant PV PCI xenstore
entries.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v3:
- reduce 'path' size
- add two missing error_setg() calls
- coding style
Changes in v2:
- use xs_node_scanf
- use %d instead of %u to read values written as %d
- add a comment from another iteration of this patch by Jason Andryuk
---
 hw/xen/xen-host-pci-device.c | 76 -
 hw/xen/xen-host-pci-device.h |  6 +++-
 2 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c
index 8c6e9a1..eaf32f2 100644
--- a/hw/xen/xen-host-pci-device.c
+++ b/hw/xen/xen-host-pci-device.c
@@ -9,6 +9,8 @@
 #include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "qemu/cutils.h"
+#include "hw/xen/xen-legacy-backend.h"
+#include "hw/xen/xen-bus-helper.h"
 #include "xen-host-pci-device.h"
 
 #define XEN_HOST_PCI_MAX_EXT_CAP \
@@ -33,13 +35,73 @@
 #define IORESOURCE_PREFETCH 0x1000  /* No side effects */
 #define IORESOURCE_MEM_64   0x0010
 
+/*
+ * Non-passthrough (dom0) accesses are local PCI devices and use the given BDF
+ * Passthough (stubdom) accesses are through PV frontend PCI device.  Those
+ * either have a BDF identical to the backend's BDF (xen-backend.passthrough=1)
+ * or a local virtual BDF (xen-backend.passthrough=0)
+ *
+ * We are always given the backend's BDF and need to lookup the appropriate
+ * local BDF for sysfs access.
+ */
+static void xen_host_pci_fill_local_addr(XenHostPCIDevice *d, Error **errp)
+{
+unsigned int num_devs, len, i;
+unsigned int domain, bus, dev, func;
+char *be_path = NULL;
+char path[16];
+
+be_path = qemu_xen_xs_read(xenstore, 0, "device/pci/0/backend", );
+if (!be_path) {
+error_setg(errp, "Failed to read device/pci/0/backend");
+goto out;
+}
+
+if (xs_node_scanf(xenstore, 0, be_path, "num_devs", NULL,
+  "%d", _devs) != 1) {
+error_setg(errp, "Failed to read or parse %s/num_devs", be_path);
+goto out;
+}
+
+for (i = 0; i < num_devs; i++) {
+snprintf(path, sizeof(path), "dev-%d", i);
+if (xs_node_scanf(xenstore, 0, be_path, path, NULL,
+  "%x:%x:%x.%x", , , , ) != 4) {
+error_setg(errp, "Failed to read or parse %s/%s", be_path, path);
+goto out;
+}
+if (domain != d->domain ||
+bus != d->bus ||
+dev != d->dev ||
+func != d->func)
+continue;
+snprintf(path, sizeof(path), "vdev-%d", i);
+if (xs_node_scanf(xenstore, 0, be_path, path, NULL,
+  "%x:%x:%x.%x", , , , ) != 4) {
+error_setg(errp, "Failed to read or parse %s/%s", be_path, path);
+goto out;
+}
+d->local_domain = domain;
+d->local_bus = bus;
+d->local_dev = dev;
+d->local_func = func;
+goto out;
+}
+error_setg(errp, "Failed to find PCI device %x:%x:%x.%x in xenstore",
+   d->domain, d->bus, d->dev, d->func);
+
+out:
+free(be_path);
+}
+
 static void xen_host_pci_sysfs_path(const XenHostPCIDevice *d,
 const char *name, char *buf, ssize_t size)
 {
 int rc;
 
 rc = snprintf(buf, size, "/sys/bus/pci/devices/%04x:%02x:%02x.%d/%s",
-  d->domain, d->bus, d->dev, d->func, name);
+  d->local_domain, d->local_bus, d->local_dev, d->local_func,
+  name);
 assert(rc >= 0 && rc < size);
 }
 
@@ -342,6 +404,18 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, uint16_t 
domain,
 d->dev = dev;
 d->func = func;
 
+if (xen_is_stubdomain) {
+xen_host_pci_fill_local_addr(d, errp);
+if (*errp) {
+goto error;
+}
+} else {
+d->local_domain = d->domain;
+d->local_bus = d->bus;
+d->local_dev = d->dev;
+d->local_func = d->func;
+}
+
 xen_host_pci_config_open(d, errp);
 if (*errp) {
 goto error;
diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h
index 4d8d34e..270dcb2 100644
--- a/hw/xen/xen-host-pci-device.h
+++ b/hw/xen/xen-host-pci-device.h
@@ -23,6 +23,12 @@ typedef struct XenHostPCIDevice {
 uint8_t dev;
 uint8_t func;
 
+/* different from the above in case of stubdomain */
+uint16_t local_domain;
+uint8_t local_bus;
+uint8_t local_dev;
+uint8_t local_func;
+
 uint16_t vendor_id;
 uint16_t device_id;
 uint32_t class_code;
-- 
git-series 0.9.1



Re: [PATCH v3 1/2] IOMMU: store name for extra reserved device memory

2024-03-26 Thread Marek Marczykowski-Górecki
On Wed, Mar 27, 2024 at 03:53:10AM +0100, Marek Marczykowski-Górecki wrote:
> It will be useful for error reporting in a subsequent patch.
> 
> Signed-off-by: Marek Marczykowski-Górecki 
> Acked-by: Jan Beulich 

This one is already applied, sorry for re-send.

> ---
> New in v2
> ---
>  xen/drivers/char/xhci-dbc.c | 3 ++-
>  xen/drivers/passthrough/iommu.c | 5 -
>  xen/include/xen/iommu.h | 3 ++-
>  3 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
> index 3bf389be7d0b..8e2037f1a5f7 100644
> --- a/xen/drivers/char/xhci-dbc.c
> +++ b/xen/drivers/char/xhci-dbc.c
> @@ -1421,7 +1421,8 @@ void __init xhci_dbc_uart_init(void)
>  iommu_add_extra_reserved_device_memory(
>  PFN_DOWN(virt_to_maddr(_dma_bufs)),
>  PFN_UP(sizeof(dbc_dma_bufs)),
> -uart->dbc.sbdf);
> +uart->dbc.sbdf,
> +"XHCI console");
>  serial_register_uart(SERHND_XHCI, _uart_driver, _uart);
>  }
>  }
> diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
> index 996c31be1284..03587c0cd680 100644
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -682,6 +682,7 @@ struct extra_reserved_range {
>  unsigned long start;
>  unsigned long nr;
>  pci_sbdf_t sbdf;
> +const char *name;
>  };
>  static unsigned int __initdata nr_extra_reserved_ranges;
>  static struct extra_reserved_range __initdata
> @@ -689,7 +690,8 @@ static struct extra_reserved_range __initdata
>  
>  int __init iommu_add_extra_reserved_device_memory(unsigned long start,
>unsigned long nr,
> -  pci_sbdf_t sbdf)
> +  pci_sbdf_t sbdf,
> +  const char *name)
>  {
>  unsigned int idx;
>  
> @@ -700,6 +702,7 @@ int __init 
> iommu_add_extra_reserved_device_memory(unsigned long start,
>  extra_reserved_ranges[idx].start = start;
>  extra_reserved_ranges[idx].nr = nr;
>  extra_reserved_ranges[idx].sbdf = sbdf;
> +extra_reserved_ranges[idx].name = name;
>  
>  return 0;
>  }
> diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
> index 9621459c63ee..b7829dff4588 100644
> --- a/xen/include/xen/iommu.h
> +++ b/xen/include/xen/iommu.h
> @@ -329,7 +329,8 @@ struct iommu_ops {
>   */
>  extern int iommu_add_extra_reserved_device_memory(unsigned long start,
>unsigned long nr,
> -  pci_sbdf_t sbdf);
> +  pci_sbdf_t sbdf,
> +      const char *name);
>  /*
>   * To be called by specific IOMMU driver during initialization,
>   * to fetch ranges registered with iommu_add_extra_reserved_device_memory().
> -- 
> 2.43.0
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH v3 1/2] IOMMU: store name for extra reserved device memory

2024-03-26 Thread Marek Marczykowski-Górecki
It will be useful for error reporting in a subsequent patch.

Signed-off-by: Marek Marczykowski-Górecki 
Acked-by: Jan Beulich 
---
New in v2
---
 xen/drivers/char/xhci-dbc.c | 3 ++-
 xen/drivers/passthrough/iommu.c | 5 -
 xen/include/xen/iommu.h | 3 ++-
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
index 3bf389be7d0b..8e2037f1a5f7 100644
--- a/xen/drivers/char/xhci-dbc.c
+++ b/xen/drivers/char/xhci-dbc.c
@@ -1421,7 +1421,8 @@ void __init xhci_dbc_uart_init(void)
 iommu_add_extra_reserved_device_memory(
 PFN_DOWN(virt_to_maddr(_dma_bufs)),
 PFN_UP(sizeof(dbc_dma_bufs)),
-uart->dbc.sbdf);
+uart->dbc.sbdf,
+"XHCI console");
 serial_register_uart(SERHND_XHCI, _uart_driver, _uart);
 }
 }
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 996c31be1284..03587c0cd680 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -682,6 +682,7 @@ struct extra_reserved_range {
 unsigned long start;
 unsigned long nr;
 pci_sbdf_t sbdf;
+const char *name;
 };
 static unsigned int __initdata nr_extra_reserved_ranges;
 static struct extra_reserved_range __initdata
@@ -689,7 +690,8 @@ static struct extra_reserved_range __initdata
 
 int __init iommu_add_extra_reserved_device_memory(unsigned long start,
   unsigned long nr,
-  pci_sbdf_t sbdf)
+  pci_sbdf_t sbdf,
+  const char *name)
 {
 unsigned int idx;
 
@@ -700,6 +702,7 @@ int __init iommu_add_extra_reserved_device_memory(unsigned 
long start,
 extra_reserved_ranges[idx].start = start;
 extra_reserved_ranges[idx].nr = nr;
 extra_reserved_ranges[idx].sbdf = sbdf;
+extra_reserved_ranges[idx].name = name;
 
 return 0;
 }
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 9621459c63ee..b7829dff4588 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -329,7 +329,8 @@ struct iommu_ops {
  */
 extern int iommu_add_extra_reserved_device_memory(unsigned long start,
   unsigned long nr,
-  pci_sbdf_t sbdf);
+  pci_sbdf_t sbdf,
+  const char *name);
 /*
  * To be called by specific IOMMU driver during initialization,
  * to fetch ranges registered with iommu_add_extra_reserved_device_memory().
-- 
2.43.0




[PATCH v3 2/2] drivers/char: mark extra reserved device memory in memory map

2024-03-26 Thread Marek Marczykowski-Górecki
The IOMMU driver checks if RMRR/IVMD are marked as reserved in memory
map. This should be true for addresses coming from the firmware, but
when extra pages used by Xen itself are included in the mapping, those
are taken from usable RAM used. Mark those pages as reserved too.

Not marking the pages as reserved didn't caused issues before due to
another a bug in IOMMU driver code, that was fixed in 83afa3135830
("amd-vi: fix IVMD memory type checks").

Failing to reserve memory will lead to panic in IOMMU setup code. And
not including the page in IOMMU mapping will lead to broken console (due
to IOMMU faults). The pages chosen by the XHCI console driver should
still be usable by the CPU though, and the console code already can deal
with too slow console by dropping characters (and console not printing
anything is a special case of "slow"). When reserving fails print an error
message showing which pages failed and who requested them. This should
be enough hint to find why XHCI console doesn't work.

Fixes: 3a1a7b809ffa "drivers/char: mark DMA buffers as reserved for the XHCI"
Signed-off-by: Marek Marczykowski-Górecki 
---
Alternative error handling could be a panic, but with this version I
think it can be avoided. And not panicing gives a better chance to
actually see the error message (from the hopefully started dom0),
especially as the affected driver is the console one.

The reserve_e820_ram() is x86-specific. Is there some equivalent API for
ARM, or maybe even some abstract one? That said, I have no way to test
XHCI console on ARM, I don't know if such hardware even exists...

Changes in v2:
- move reserving to iommu_get_extra_reserved_device_memory() to cover
  all users of iommu_add_extra_reserved_device_memory()
- change error handling to not panic, as in this code layout it can skip
  sending the pages to the IOMMU driver
Changes in v3:
- code style, typo
---
 xen/drivers/passthrough/iommu.c | 17 +
 xen/include/xen/iommu.h |  5 -
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 03587c0cd680..ba18136c461c 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -22,6 +22,10 @@
 #include 
 #include 
 
+#ifdef CONFIG_X86
+#include 
+#endif
+
 unsigned int __read_mostly iommu_dev_iotlb_timeout = 1000;
 integer_param("iommu_dev_iotlb_timeout", iommu_dev_iotlb_timeout);
 
@@ -715,6 +719,19 @@ int __init 
iommu_get_extra_reserved_device_memory(iommu_grdm_t *func,
 
 for ( idx = 0; idx < nr_extra_reserved_ranges; idx++ )
 {
+#ifdef CONFIG_X86
+paddr_t start = pfn_to_paddr(extra_reserved_ranges[idx].start);
+paddr_t end = pfn_to_paddr(extra_reserved_ranges[idx].start +
+   extra_reserved_ranges[idx].nr);
+
+if ( !reserve_e820_ram(, start, end) )
+{
+printk(XENLOG_ERR "Failed to reserve [%"PRIx64"-%"PRIx64") for %s, 
"
+   "skipping IOMMU mapping for it, some functionality may be 
broken\n",
+   start, end, extra_reserved_ranges[idx].name);
+continue;
+}
+#endif
 ret = func(extra_reserved_ranges[idx].start,
extra_reserved_ranges[idx].nr,
extra_reserved_ranges[idx].sbdf.sbdf,
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index b7829dff4588..1f56a6cf456a 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -324,7 +324,8 @@ struct iommu_ops {
 };
 
 /*
- * To be called by Xen internally, to register extra RMRR/IVMD ranges.
+ * To be called by Xen internally, to register extra RMRR/IVMD ranges for RAM
+ * pages.
  * Needs to be called before IOMMU initialization.
  */
 extern int iommu_add_extra_reserved_device_memory(unsigned long start,
@@ -334,6 +335,8 @@ extern int iommu_add_extra_reserved_device_memory(unsigned 
long start,
 /*
  * To be called by specific IOMMU driver during initialization,
  * to fetch ranges registered with iommu_add_extra_reserved_device_memory().
+ * This has a side effect of marking requested ranges as "reserved" in the
+ * memory map.
  */
 extern int iommu_get_extra_reserved_device_memory(iommu_grdm_t *func,
   void *ctxt);
-- 
2.43.0




Re: [PATCH v2 1/2] IOMMU: store name for extra reserved device memory

2024-03-26 Thread Marek Marczykowski-Górecki
On Mon, Mar 18, 2024 at 04:52:42PM +0100, Roger Pau Monné wrote:
> On Mon, Mar 18, 2024 at 02:40:21PM +0100, Jan Beulich wrote:
> > On 12.03.2024 17:25, Marek Marczykowski-Górecki wrote:
> > > It will be useful for error reporting in a subsequent patch.
> > > 
> > > Signed-off-by: Marek Marczykowski-Górecki 
> > > 
> > 
> > In principle
> > Acked-by: Jan Beulich 
> > However, ...
> > 
> > > --- a/xen/drivers/passthrough/iommu.c
> > > +++ b/xen/drivers/passthrough/iommu.c
> > > @@ -682,6 +682,7 @@ struct extra_reserved_range {
> > >  unsigned long start;
> > >  unsigned long nr;
> > >  pci_sbdf_t sbdf;
> > > +const char *name;
> > >  };
> > 
> > ... to me "descr" (or the longer "description") would seem more suitable.
> > Thoughts?
> 
> I'm happy either way, but I don't find 'name' odd.

"descr" sounds a bit weird (even though it's clear what it means),
"name" is a full word. And also, quick grep suggest other places use
"name" for similar purpose (I haven't found a single "descr" nor
"description" struct field).

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v2 2/2] drivers/char: mark extra reserved device memory in memory map

2024-03-26 Thread Marek Marczykowski-Górecki
On Mon, Mar 18, 2024 at 02:48:09PM +0100, Jan Beulich wrote:
> On 12.03.2024 17:25, Marek Marczykowski-Górecki wrote:
> > The IOMMU driver checks if RMRR/IVMD are marked as reserved in memory
> > map. This should be true for addresses coming from the firmware, but
> > when extra pages used by Xen itself are included in the mapping, those
> > are taken from usable RAM used. Mark those pages as reserved too.
> > 
> > Not marking the pages as reserved didn't caused issues before due to
> > another a bug in IOMMU driver code, that was fixed in 83afa3135830
> > ("amd-vi: fix IVMD memory type checks").
> > 
> > Failing to reserve memory will lead to panic in IOMMU setup code. And
> > not including the page in IOMMU mapping will lead to broken console (due
> > to IOMMU faults). The pages chosen by the XHCI console driver should
> > still be usable by the CPU though, and the console code already can deal
> > with too slow console by dropping characters (and console not printing
> > anything is a special case of "slow"). When reserving fails print an error
> > message showing which pages failed and who requested them. This should
> > be enough hint to find why XHCI console doesn't work.
> > 
> > Fixes: 3a1a7b809ffa "drivers/char: mark DMA buffers as reserved for the 
> > XHCI"
> > Signed-off-by: Marek Marczykowski-Górecki 
> > ---
> > Alternative error handling could be a panic, but with this version I
> > think it can be avoided. And not panicing gives a better chance to
> > actually see the error message (from the hopefully started dom0),
> > especially as the affected driver is the console one.
> > 
> > The reserve_e820_ram() is x86-specific. Is there some equivalent API for
> > ARM, or maybe even some abstract one? That said, I have no way to test
> > XHCI console on ARM, I don't know if such hardware even exists...
> 
> These are normal PCI devices, so I don't see why they shouldn't be usable
> on non-x86 systems. But this is all okay as long as XHCI depends on X86
> in Kconfig.

That's why I'm asking for similar API for ARM. The x86-specific part
here is only about IOMMU, other parts should work just fine (but as I
said, I have no way to test).

Anyway, I'll leave it to whoever will need this driver on ARM (or other
arch).

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v2 1/2] hw/xen: detect when running inside stubdomain

2024-03-26 Thread Marek Marczykowski-Górecki
On Tue, Mar 26, 2024 at 05:06:50PM +, Anthony PERARD wrote:
> On Tue, Mar 05, 2024 at 08:12:29PM +0100, Marek Marczykowski-Górecki wrote:
> > diff --git a/hw/xen/xen-legacy-backend.c b/hw/xen/xen-legacy-backend.c
> > index 124dd5f3d6..6bd4e6eb2f 100644
> > --- a/hw/xen/xen-legacy-backend.c
> > +++ b/hw/xen/xen-legacy-backend.c
> > @@ -603,6 +603,20 @@ static void xen_set_dynamic_sysbus(void)
> >  machine_class_allow_dynamic_sysbus_dev(mc, TYPE_XENSYSDEV);
> >  }
> >  
> > +static bool xen_check_stubdomain(void)
> > +{
> > +char *dm_path = g_strdup_printf("/local/domain/%d/image", xen_domid);
> > +int32_t dm_domid;
> > +bool is_stubdom = false;
> > +
> > +if (!xenstore_read_int(dm_path, "device-model-domid", _domid)) {
> > +is_stubdom = dm_domid != 0;
> > +}
> > +
> > +g_free(dm_path);
> > +return is_stubdom;
> > +}
> > +
> >  void xen_be_init(void)
> >  {
> >  xenstore = qemu_xen_xs_open();
> > @@ -616,6 +630,8 @@ void xen_be_init(void)
> >  exit(1);
> >  }
> >  
> > +xen_is_stubdomain = xen_check_stubdomain();
> 
> This isn't really a backend specific information, and xen_be_init() is
> all about old backend implementation support. (qdisk which have been the
> first to be rewritten doesn't need xen_be_init(), or shouldn't). Could
> we move the initialisation elsewhere?

I can try to move it, sure.

> Is this relevant PV guests? If not, we could move the initialisation to
> xen_hvm_init_pc().
> 
> Also, avoid having xen_check_stubdomain() depending on
> "xen-legacy-backend", if possible.
> 
> (In xen_hvm_init_pc(), a call to xen_register_ioreq() opens another
> xenstore, as `state->xenstore`.)

And xen_register_ioreq() calls xen_be_init() anyway, so it wouldn't
change much in practice (at least for now)...

> (There's already been effort to build QEMU without legacy backends, that
> stubdom check would break in this scenario.)

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: NULL pointer dereference in xenbus_thread->...

2024-03-25 Thread Marek Marczykowski-Górecki
On Sun, Oct 22, 2023 at 04:14:30PM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki wrote:
> > Hi,
> > 
> > I've noticed in Qubes's CI failure like this:
> > 
> > [  871.271292] BUG: kernel NULL pointer dereference, address: 
> > 
> > [  871.275290] #PF: supervisor read access in kernel mode
> > [  871.277282] #PF: error_code(0x) - not-present page
> > [  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0 
> > [  871.281071] Oops:  [#1] PREEMPT SMP NOPTI
> > [  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 
> > 6.1.43-1.qubes.fc37.x86_64 #1
> > [  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> > [  871.23] RIP: e030:__wake_up_common+0x4c/0x180
> > [  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 
> > 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b 
> > <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > [  871.299776] RSP: e02b:c900400f7e10 EFLAGS: 00010082
> > [  871.301656] RAX:  RBX: 88810541ce98 RCX: 
> > 
> > [  871.304255] RDX: 0001 RSI: 0003 RDI: 
> > 88810541ce90
> > [  871.306714] RBP: c900400f0280 R08: ffe8 R09: 
> > c900400f7e68
> > [  871.309937] R10: 7ff0 R11: 888100ad3000 R12: 
> > c900400f7e68
> > [  871.312326] R13:  R14:  R15: 
> > 
> > [  871.314647] FS:  () GS:88813ff0() 
> > knlGS:
> > [  871.317677] CS:  1e030 DS:  ES:  CR0: 80050033
> > [  871.319644] CR2:  CR3: 0001067fe000 CR4: 
> > 00040660
> > [  871.321973] Call Trace:
> > [  871.322782]  
> > [  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
> > [  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
> > [  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
> > [  871.327721]  ? __wake_up_common_lock+0x82/0xd0
> > [  871.329147]  ? __die_body.cold+0x8/0xd
> > [  871.330378]  ? page_fault_oops+0x163/0x1a0
> > [  871.331691]  ? exc_page_fault+0x70/0x170
> > [  871.332946]  ? asm_exc_page_fault+0x22/0x30
> > [  871.334454]  ? __wake_up_common+0x4c/0x180
> > [  871.335777]  __wake_up_common_lock+0x82/0xd0
> > [  871.337183]  ? process_writes+0x240/0x240
> > [  871.338461]  process_msg+0x18e/0x2f0
> > [  871.339627]  xenbus_thread+0x165/0x1c0
> > [  871.340830]  ? cpuusage_read+0x10/0x10
> > [  871.342032]  kthread+0xe9/0x110
> > [  871.343317]  ? kthread_complete_and_exit+0x20/0x20
> > [  871.345020]  ret_from_fork+0x22/0x30
> > [  871.346239]  
> > [  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio 
> > snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec 
> > snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr 
> > ppdev intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 
> > parport_pc parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul 
> > crc32c_intel polyval_clmulni polyval_generic floppy ghash_clmulni_intel 
> > sha512_ssse3 serio_raw virtio_scsi virtio_console bochs xhci_pci 
> > xhci_pci_renesas xhci_hcd qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm 
> > ata_generic pata_acpi xen_privcmd xen_pciback xen_blkback xen_gntalloc 
> > xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput 
> > dm_multipath
> > [  871.368892] CR2: 
> > [  871.370160] ---[ end trace  ]---
> > [  871.371719] RIP: e030:__wake_up_common+0x4c/0x180
> > [  871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 
> > 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b 
> > <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > [  871.379866] RSP: e02b:c900400f7e10 EFLAGS: 00010082
> > [  871.381689] RAX:  RBX: 88810541ce98 RCX: 
> > 
> > [  871.383971] RDX: 0001 RSI: 0003 RDI: 
> > 88810541ce90
> > [  871.386235] RBP: c900400f0280 R08: ffe8 R09: 
> > c900400f7e68
> > [  871.388521] R10: 7ff0 R11: 888100ad3000 R12: 
> > c900400f7e68
> > [  871.390789] R13:  R14:  R15: 
> > 
> > [  871.393101] FS:  () GS:ff

Status of S0ix with Xen

2024-03-15 Thread Marek Marczykowski-Górecki
Hi,

S0ix came up in a recent discussion, so let me post a small status
update. We have patches that makes Qubes suspend with ~90% S0ix
residency. It's significantly worse than native Linux on the same system
- that gets 99+%, but still a significant progress.
We do have a tricky case though, because we use PCI passthrough, with
stubdomains.

A summary of changes required is posted by Simon in a tracking issue[1],
but let me post it here too:

So currently on my test system (NUC11TNHi5) I'm being able to reach
S0ix residency with only dom0 running for nearly 90 % of the suspend to
idle time. (That's still worse than native Linux with > 99 %.)

Used changes:

 - Xen
   - Disable HPET legacy mode after it has been userd for IOAPIC
 test during boot ([upstream 
discussion](https://lore.kernel.org/xen-devel/cb408368-077d-edb5-b4ad-f80086db4...@invisiblethingslab.com/),
 including minimally required diff)
   - Tiger Lake support in Xen's mwait-idle driver. Unlike Linux'
 version of that driver Xen's only supports the CPUs for which
 it contains a hardcoded config. For the moment I just manualy added the
 config Linux is reading from ACPI. Needs to be tested if this is
 actually required (see also
 [Jan's 
comment](https://lore.kernel.org/xen-devel/f6c27788-bdd9-e5b1-a874-7f48a27c6...@suse.com/))
   - Allow reading of S0ix related MSRs. Not sure if any of them is
 strictly required to get into S0ix states, but at the very
 least you want them to be able to debug things (for example via
 `/sys/kernel/debug/pmc_core/...`)
   - Teach `cpu_idle.c` that Tiger Lake also has PC{8..10} (for proper 
debug output)
   - Disable `ondemand` cpufreq governor (both `powersave` as well as 
`perfromance` work)
 - Linux in dom0
   - Add wakup support to `xen-pirq`
 
([patch](https://lore.kernel.org/xen-devel/20230313134102.3157-1-si...@invisiblethingslab.com/)
 sent upstream)
   - If using a SATA device you need to enable low power mode for
 link management (`echo min_power > 
/sys/class/scsi_host/host0/link_power_management_policy`, tried to get
 it [enabled by 
default](https://lore.kernel.org/linux-ide/20230213102450.1604-1-si...@invisiblethingslab.com/)
 but that's
 [not so 
easy](https://lore.kernel.org/linux-ide/ad02467d-d623-5933-67e0-09925c185...@leemhuis.info))

First tests with a running guest and investigating the difference to
native Linux are progress (the later is probably due to some timer).

Since the above update, few more things came up:
1. Thunderbolt driver in Linux does not support suspending when kernel
is built with CONFIG_HOTPLUG_PCI=n (which is the case in Qubes). We have
a hacky workaround at [2] (the first patch), together with more details
about the issue in the patch description.

2. When toolstack issues "suspend" command (via xenstore
control/shutdown), Linux uses "freeze" path, that doesn't put devices
into low power state. Changing that to "suspend" path helps (the second
patch in [2]), and for our case seems to have no negative consequences.
But we do not use live migration nor save/restore - which is another use
of this feature, and we have not tested if this still works.

3. QEMU emulates (instead of passthrough) power management related
config space. This is fixed at [3]

The relevant patches are all linked in pull requests connected to the issue.

This is still work in progress. Some of the patches were already posted
to relevant mailing lists, other still require some more work and will
be posted later. I'm posting it here mostly to share that we are working
on it. But also, if anybody wants to try it out and maybe help
improving, the patches are out there.

[1] https://github.com/QubesOS/qubes-issues/issues/6411#issuecomment-1538089344
[2] https://github.com/QubesOS/qubes-linux-kernel/pull/910/files
[3] https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/pull/63/files
-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH v5 0/7] MSI-X support with qemu in stubdomain, and other related changes

2024-03-13 Thread Marek Marczykowski-Górecki
This series includes changes to make MSI-X working with Linux stubdomain and
especially Intel Wifi 6 AX210 card. This takes care of remaining reasons for
QEMU to access /dev/mem, but also the Intel Wifi card violating spec by putting
some registers on the same page as the MSI-X table.
Besides the stubdomain case (of which I care more), this is also necessary for
PCI-passthrough to work with lockdown enabled in dom0 (when QEMU runs in dom0).

See individual patches for details.

This series include also tests for MSI-X using new approach (by preventing QEMU
access to /dev/mem). But for it to work, it needs QEMU change that
makes use of the changes introduced here. It can be seen at
https://github.com/marmarek/qemu/commits/msix

Here is the pipeline that used the QEMU fork above:
https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1211237368

Marek Marczykowski-Górecki (7):
  x86/msi: passthrough all MSI-X vector ctrl writes to device model
  x86/msi: Extend per-domain/device warning mechanism
  x86/hvm: Allow access to registers on the same page as MSI-X table
  automation: prevent QEMU access to /dev/mem in PCI passthrough tests
  automation: switch to a wifi card on ADL system
  [DO NOT APPLY] switch to qemu fork
  [DO NOT APPLY] switch to alternative artifact repo

 Config.mk   |   4 +-
 automation/gitlab-ci/build.yaml |   4 +-
 automation/gitlab-ci/test.yaml  |   4 +-
 automation/scripts/qubes-x86-64.sh  |   9 +-
 automation/tests-artifacts/alpine/3.18.dockerfile   |   7 +-
 automation/tests-artifacts/kernel/6.1.19.dockerfile |   2 +-
 xen/arch/x86/hvm/vmsi.c | 224 -
 xen/arch/x86/include/asm/msi.h  |  15 +-
 xen/arch/x86/msi.c  |  50 ++-
 xen/common/kernel.c |   1 +-
 xen/include/public/features.h   |   8 +-
 11 files changed, 308 insertions(+), 20 deletions(-)

base-commit: 03cf7ca23e0e876075954c558485b267b7d02406
-- 
git-series 0.9.1



[PATCH v5 7/7] [DO NOT APPLY] switch to alternative artifact repo

2024-03-13 Thread Marek Marczykowski-Górecki
For testing, switch to my containers registry that includes containers
rebuilt with changes in this series.
---
 automation/gitlab-ci/build.yaml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index 6d2cb18b8883..7dafbf144494 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -320,7 +320,7 @@ qemu-system-ppc64-8.1.0-ppc64-export:
 
 alpine-3.18-rootfs-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.18
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/alpine:3.18
   script:
 - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz
   artifacts:
@@ -331,7 +331,7 @@ alpine-3.18-rootfs-export:
 
 kernel-6.1.19-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.19
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/kernel:6.1.19
   script:
 - mkdir binaries && cp /bzImage binaries/bzImage
   artifacts:
-- 
git-series 0.9.1



[PATCH v5 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table

2024-03-13 Thread Marek Marczykowski-Górecki
Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
on the same page as MSI-X table. Device model (especially one in
stubdomain) cannot really handle those, as direct writes to that page is
refused (page is on the mmio_ro_ranges list). Instead, extend
msixtbl_mmio_ops to handle such accesses too.

Doing this, requires correlating read/write location with guest
of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
for PV would need to be done separately.

This will be also used to read Pending Bit Array, if it lives on the same
page, making QEMU not needing /dev/mem access at all (especially helpful
with lockdown enabled in dom0). If PBA lives on another page, QEMU will
map it to the guest directly.
If PBA lives on the same page, discard writes and log a message.
Technically, writes outside of PBA could be allowed, but at this moment
the precise location of PBA isn't saved, and also no known device abuses
the spec in this way (at least yet).

To access those registers, msixtbl_mmio_ops need the relevant page
mapped. MSI handling already has infrastructure for that, using fixmap,
so try to map first/last page of the MSI-X table (if necessary) and save
their fixmap indexes. Note that msix_get_fixmap() does reference
counting and reuses existing mapping, so just call it directly, even if
the page was mapped before. Also, it uses a specific range of fixmap
indexes which doesn't include 0, so use 0 as default ("not mapped")
value - which simplifies code a bit.

GCC 12.2.1 gets confused about 'desc' variable:

arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized 
[-Werror=maybe-uninitialized]
  553 | if ( desc )
  |^
arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
  537 | const struct msi_desc *desc;
  |^~~~

It's conditional initialization is actually correct (in the case where
it isn't initialized, function returns early), but to avoid
build failure initialize it explicitly to NULL anyway.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v5:
- style fixes
- include GCC version in the commit message
- warn only once (per domain, per device) about failed adjacent access
Changes in v4:
- drop same_page parameter of msixtbl_find_entry(), distinguish two
  cases in relevant callers
- rename adj_access_table_idx to adj_access_idx
- code style fixes
- drop alignment check in adjacent_{read,write}() - all callers already
  have it earlier
- delay mapping first/last MSI-X pages until preparing device for a
  passthrough
v3:
 - merge handling into msixtbl_mmio_ops
 - extend commit message
v2:
 - adjust commit message
 - pass struct domain to msixtbl_page_handler_get_hwaddr()
 - reduce local variables used only once
 - log a warning if write is forbidden if MSI-X and PBA lives on the same
   page
 - do not passthrough unaligned accesses
 - handle accesses both before and after MSI-X table
---
 xen/arch/x86/hvm/vmsi.c| 209 --
 xen/arch/x86/include/asm/msi.h |   7 +-
 xen/arch/x86/msi.c |  41 +++-
 3 files changed, 247 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 17983789..5ac3d9d9b4cf 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d)
 return d->arch.hvm.msixtbl_list.next;
 }
 
+/*
+ * Lookup an msixtbl_entry on the same page as given addr. It's up to the
+ * caller to check if address is strictly part of the table - if relevant.
+ */
 static struct msixtbl_entry *msixtbl_find_entry(
 struct vcpu *v, unsigned long addr)
 {
@@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry(
 struct domain *d = v->domain;
 
 list_for_each_entry( entry, >arch.hvm.msixtbl_list, list )
-if ( addr >= entry->gtable &&
- addr < entry->gtable + entry->table_len )
+if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
+ PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
 return entry;
 
 return NULL;
@@ -213,6 +217,148 @@ static struct msi_desc *msixtbl_addr_to_desc(
 return NULL;
 }
 
+/*
+ * Returns:
+ *  - ADJACENT_DONT_HANDLE if no handling should be done
+ *  - ADJACENT_DISCARD_WRITE if write should be discarded
+ *  - a fixmap idx to use for handling
+ */
+#define ADJACENT_DONT_HANDLE UINT_MAX
+#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1)
+static unsigned int adjacent_handle(
+const struct msixtbl_entry *entry, unsigned long addr, bool write)
+{
+unsigned int adj_type;
+struct arch_msix *msix;
+
+if ( !entry || !entry->pdev )
+return ADJACENT_DONT_HANDLE;
+
+   

[PATCH v5 1/7] x86/msi: passthrough all MSI-X vector ctrl writes to device model

2024-03-13 Thread Marek Marczykowski-Górecki
QEMU needs to know whether clearing maskbit of a vector is really
clearing, or was already cleared before. Currently Xen sends only
clearing that bit to the device model, but not setting it, so QEMU
cannot detect it. Because of that, QEMU is working this around by
checking via /dev/mem, but that isn't the proper approach.

Give all necessary information to QEMU by passing all ctrl writes,
including masking a vector. Advertise the new behavior via
XENVER_get_features, so QEMU can know it doesn't need to access /dev/mem
anymore.

While this commit doesn't move the whole maskbit handling to QEMU (as
discussed on xen-devel as one of the possibilities), it is a necessary
first step anyway. Including telling QEMU it will get all the required
information to do so. The actual implementation would need to include:
 - a hypercall for QEMU to control just maskbit (without (re)binding the
   interrupt again
 - a methor for QEMU to tell Xen it will actually do the work
Those are not part of this series.

Signed-off-by: Marek Marczykowski-Górecki 
---
I did not added any control to enable/disable this new behavior (as
Roger have suggested for possible non-QEMU ioreqs). I don't see how the
new behavior could be problematic for some existing ioreq server (they
already received writes to those addresses, just not all of them),
but if that's really necessary, I can probably add a command line option
to restore previous behavior system-wide.

Changes in v5:
- announce the feature only on x86
- style fixes

Changes in v4:
- ignore unaligned writes with X86EMUL_OKAY
- restructure the code to forward all writes in _msixtbl_write() instead
  of manipulating return value of msixtbl_write() - this makes
  WRITE_LEN4_COMPLETION special case unnecessary
- advertise the changed behavior via XENVER_get_features instead of DMOP
v3:
 - advertise changed behavior in XEN_DMOP_get_ioreq_server_info - make
   "flags" parameter IN/OUT
 - move len check back to msixtbl_write() - will be needed there anyway
   in a later patch
v2:
 - passthrough quad writes to emulator too (Jan)
 - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
   #define for this magic value
---
 xen/arch/x86/hvm/vmsi.c   | 19 ++-
 xen/common/kernel.c   |  1 +
 xen/include/public/features.h |  8 
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index adbac965f9f7..17983789 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -283,8 +283,8 @@ static int msixtbl_write(struct vcpu *v, unsigned long 
address,
 unsigned long flags;
 struct irq_desc *desc;
 
-if ( (len != 4 && len != 8) || (address & (len - 1)) )
-return r;
+if ( !IS_ALIGNED(address, len) )
+return X86EMUL_OKAY;
 
 rcu_read_lock(_rcu_lock);
 
@@ -345,8 +345,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long 
address,
 
 unlock:
 spin_unlock_irqrestore(>lock, flags);
-if ( len == 4 )
-r = X86EMUL_OKAY;
+r = X86EMUL_OKAY;
 
 out:
 rcu_read_unlock(_rcu_lock);
@@ -357,7 +356,17 @@ static int cf_check _msixtbl_write(
 const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
 uint64_t val)
 {
-return msixtbl_write(current, address, len, val);
+/* Ignore invalid length or unaligned writes. */
+if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) )
+return X86EMUL_OKAY;
+
+/*
+ * This function returns X86EMUL_UNHANDLEABLE even if write is properly
+ * handled, to propagate it to the device model (so it can keep its
+ * internal state in sync).
+ */
+msixtbl_write(current, address, len, val);
+return X86EMUL_UNHANDLEABLE;
 }
 
 static bool cf_check msixtbl_range(
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index 08dbaa2a054c..b44b2439ca8e 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -637,6 +637,7 @@ long do_xen_version(int cmd, XEN_GUEST_HANDLE_PARAM(void) 
arg)
 fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) |
  (1U << XENFEAT_hvm_callback_vector) |
  (has_pirq(d) ? (1U << XENFEAT_hvm_pirqs) : 0);
+fi.submap |= (1U << XENFEAT_dm_msix_all_writes);
 #endif
 if ( !paging_mode_translate(d) || is_domain_direct_mapped(d) )
 fi.submap |= (1U << XENFEAT_direct_mapped);
diff --git a/xen/include/public/features.h b/xen/include/public/features.h
index 4437f25d2532..880193094713 100644
--- a/xen/include/public/features.h
+++ b/xen/include/public/features.h
@@ -120,6 +120,14 @@
 #define XENFEAT_runstate_phys_area18
 #define XENFEAT_vcpu_time_phys_area   19
 
+/*
+ * If set, Xen will passthrough all MSI-X vector ctrl writes to device model,
+ * not only those unmasking an entry. This allows device model to properly keep
+ * track of the MSI-X t

[PATCH v5 5/7] automation: switch to a wifi card on ADL system

2024-03-13 Thread Marek Marczykowski-Górecki
Switch to a wifi card that has registers on a MSI-X page. This tests the
"x86/hvm: Allow writes to registers on the same page as MSI-X table"
feature. Switch it only for HVM test, because MSI-X adjacent write is
not supported on PV.

This requires also including drivers and firmware in system for tests.
Remove firmware unrelated to the test, to not increase initrd size too
much (all firmware takes over 100MB compressed).
And finally adjusts test script to handle not only eth0 as a test device,
but also wlan0 and connect it to the wifi network.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Stefano Stabellini 
---
This needs two new gitlab variables: WIFI_HW2_SSID and WIFI_HW2_PSK. I'll
provide them in private.

This change requires rebuilding test containers.

This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/gitlab-ci/test.yaml  | 4 
 automation/scripts/qubes-x86-64.sh  | 7 +++
 automation/tests-artifacts/alpine/3.18.dockerfile   | 7 +++
 automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 ++
 4 files changed, 20 insertions(+)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 8b7b2e4da92d..23a183fad967 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -193,6 +193,10 @@ adl-pci-pv-x86-64-gcc-debug:
 
 adl-pci-hvm-x86-64-gcc-debug:
   extends: .adl-x86-64
+  variables:
+PCIDEV: "00:14.3"
+WIFI_SSID: "$WIFI_HW2_SSID"
+WIFI_PSK: "$WIFI_HW2_PSK"
   script:
 - ./automation/scripts/qubes-x86-64.sh pci-hvm 2>&1 | tee ${LOGFILE}
   needs:
diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index 7eabc1bd6ad4..60498ef1e89a 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -94,6 +94,13 @@ on_reboot = "destroy"
 domU_check="
 set -x -e
 interface=eth0
+if [ -e /sys/class/net/wlan0 ]; then
+interface=wlan0
+set +x
+wpa_passphrase "$WIFI_SSID" "$WIFI_PSK" > /etc/wpa_supplicant.conf
+set -x
+wpa_supplicant -B -iwlan0 -c /etc/wpa_supplicant.conf
+fi
 ip link set \"\$interface\" up
 timeout 30s udhcpc -i \"\$interface\"
 pingip=\$(ip -o -4 r show default|cut -f 3 -d ' ')
diff --git a/automation/tests-artifacts/alpine/3.18.dockerfile 
b/automation/tests-artifacts/alpine/3.18.dockerfile
index 9cde6c9ad4da..c323e266c7da 100644
--- a/automation/tests-artifacts/alpine/3.18.dockerfile
+++ b/automation/tests-artifacts/alpine/3.18.dockerfile
@@ -34,6 +34,13 @@ RUN \
   apk add udev && \
   apk add pciutils && \
   apk add libelf && \
+  apk add wpa_supplicant && \
+  # Select firmware for hardware tests
+  apk add linux-firmware-other && \
+  mkdir /lib/firmware-preserve && \
+  mv /lib/firmware/iwlwifi-so-a0-gf-a0* /lib/firmware-preserve/ && \
+  rm -rf /lib/firmware && \
+  mv /lib/firmware-preserve /lib/firmware && \
   \
   # Xen
   cd / && \
diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile 
b/automation/tests-artifacts/kernel/6.1.19.dockerfile
index 3a4096780d20..84ed5dff23ae 100644
--- a/automation/tests-artifacts/kernel/6.1.19.dockerfile
+++ b/automation/tests-artifacts/kernel/6.1.19.dockerfile
@@ -32,6 +32,8 @@ RUN curl -fsSLO 
https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSI
 make xen.config && \
 scripts/config --enable BRIDGE && \
 scripts/config --enable IGC && \
+scripts/config --enable IWLWIFI && \
+scripts/config --enable IWLMVM && \
 cp .config .config.orig && \
 cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \
 make -j$(nproc) bzImage && \
-- 
git-series 0.9.1



[PATCH v5 4/7] automation: prevent QEMU access to /dev/mem in PCI passthrough tests

2024-03-13 Thread Marek Marczykowski-Górecki
/dev/mem access doesn't work in dom0 in lockdown and in stubdomain.
Simulate this environment with removing /dev/mem device node. Full test
for lockdown and stubdomain will come later, when all requirements will
be in place.

Signed-off-by: Marek Marczykowski-Górecki 
Acked-by: Stefano Stabellini 
---
This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/scripts/qubes-x86-64.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index d81ed7b931cf..7eabc1bd6ad4 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -163,6 +163,8 @@ ifconfig eth0 up
 ifconfig xenbr0 up
 ifconfig xenbr0 192.168.0.1
 
+# ensure QEMU wont have access /dev/mem
+rm -f /dev/mem
 # get domU console content into test log
 tail -F /var/log/xen/console/guest-domU.log 2>/dev/null | sed -e \"s/^/(domU) 
/\" &
 xl create /etc/xen/domU.cfg
-- 
git-series 0.9.1



[PATCH v5 2/7] x86/msi: Extend per-domain/device warning mechanism

2024-03-13 Thread Marek Marczykowski-Górecki
The arch_msix struct had a single "warned" field with a domid for which
warning was issued. Upcoming patch will need similar mechanism for few
more warnings, so change it to save a bit field of issued warnings.

Signed-off-by: Marek Marczykowski-Górecki 
---
Should I add also some helper for the boilerplate the handling requires
now? I tried some macro that also sets the bit field, but I couldn't get
it working. I guess I could make it working with a bitmask in a single
uint8_t - would that be preferred?

New in v5
---
 xen/arch/x86/include/asm/msi.h |  8 +++-
 xen/arch/x86/msi.c |  9 +++--
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
index 997ccb87be0c..19b1e6631fdf 100644
--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -217,7 +217,13 @@ struct arch_msix {
 int table_idx[MAX_MSIX_TABLE_PAGES];
 spinlock_t table_lock;
 bool host_maskall, guest_maskall;
-domid_t warned;
+domid_t warned_domid;
+union {
+uint8_t all;
+struct {
+bool maskall   : 1;
+} u;
+} warned_kind;
 };
 
 void early_msi_init(void);
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index e721aaf5c001..6433df30bd60 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -364,9 +364,14 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool 
host, bool guest)
 domid_t domid = pdev->domain->domain_id;
 
 maskall = true;
-if ( pdev->msix->warned != domid )
+if ( pdev->msix->warned_domid != domid )
 {
-pdev->msix->warned = domid;
+pdev->msix->warned_domid = domid;
+pdev->msix->warned_kind.all = 0;
+}
+if ( !pdev->msix->warned_kind.u.maskall )
+{
+pdev->msix->warned_kind.u.maskall = true;
 printk(XENLOG_G_WARNING
"cannot mask IRQ %d: masking MSI-X on Dom%d's %pp\n",
desc->irq, domid, >sbdf);
-- 
git-series 0.9.1



[PATCH v5 6/7] [DO NOT APPLY] switch to qemu fork

2024-03-13 Thread Marek Marczykowski-Górecki
This makes tests to use patched QEMU, to actually test the new behavior.
---
 Config.mk | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Config.mk b/Config.mk
index a962f095ca16..5e220a1284e4 100644
--- a/Config.mk
+++ b/Config.mk
@@ -220,8 +220,8 @@ endif
 OVMF_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/ovmf.git
 OVMF_UPSTREAM_REVISION ?= ba91d0292e593df8528b66f99c1b0b14fadc8e16
 
-QEMU_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/qemu-xen.git
-QEMU_UPSTREAM_REVISION ?= master
+QEMU_UPSTREAM_URL ?= https://github.com/marmarek/qemu
+QEMU_UPSTREAM_REVISION ?= origin/msix
 
 MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git
 MINIOS_UPSTREAM_REVISION ?= b6a5b4d72b88e5c4faed01f5a44505de022860fc
-- 
git-series 0.9.1



Re: E820 memory allocation issue on Threadripper platforms

2024-03-12 Thread Marek Marczykowski-Górecki
On Tue, Mar 12, 2024 at 05:07:12PM -0400, Jason Andryuk wrote:
> On 2024-03-10 10:06, Marek Marczykowski-Górecki wrote:
> > On Fri, Jan 19, 2024 at 02:40:06PM +0100, Marek Marczykowski-Górecki wrote:
> > > On Thu, Jan 18, 2024 at 01:23:56AM -0500, Patrick Plenefisch wrote:
> > > > On Wed, Jan 17, 2024 at 3:46 AM Jan Beulich  wrote:
> > > > > On 17.01.2024 07:12, Patrick Plenefisch wrote:
> > > > > > As someone who hasn't built a kernel in over a decade, should I 
> > > > > > figure
> > > > > out
> > > > > > how to do a kernel build with CONFIG_PHYSICAL_START=0x200 and 
> > > > > > report
> > > > > > back?
> > > > > 
> > > > > That was largely a suggestion to perhaps allow you to gain some
> > > > > workable setup. It would be of interest to us largely for 
> > > > > completeness.
> > > > > 
> > > > 
> > > > Typo aside, setting the boot to 2MiB works! It works better for PV
> > > 
> > > Are there any downsides of running kernel with
> > > CONFIG_PHYSICAL_START=0x20? I can confirm it fixes the issue on
> > > another affected system, and if there aren't any practical downsides,
> > > I'm tempted to change it the default kernel in Qubes OS.
> > 
> > I have the answer here: CONFIG_PHYSICAL_START=0x20 breaks booting
> > Xen in KVM with OVMF. There, the memory map has:
> > (XEN)  00010-0007f type=7 attr=000f
> > (XEN)  00080-000807fff type=10 attr=000f
> > (XEN)  000808000-00080afff type=7 attr=000f
> > (XEN)  00080b000-00080bfff type=10 attr=000f
> > (XEN)  00080c000-00080 type=7 attr=000f
> > (XEN)  00081-0008f type=10 attr=000f
> > (XEN)  00090-0015f type=4 attr=000f
> > 
> > So, starting at 0x100 worked since type=4 (boot service data) is
> > available at that time already, but with 0x20 it conflicts with
> > those AcpiNvs areas around 0x80.
> > 
> > I'm cc-ing Jason since I see he claimed relevant gitlab issue. This
> > conflict at least gives easy test environment with console logged to a
> > file.
> 
> Thanks.  I actually hacked Xen to reserve memory ranges in the e820 to
> repro.
> 
> I claimed the *PVH* Dom0 gitlab issue.  PV is outside of my scope :(

That's not bad either, it means we're getting closer to usable PVH dom0
:)

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH v2 1/2] IOMMU: store name for extra reserved device memory

2024-03-12 Thread Marek Marczykowski-Górecki
It will be useful for error reporting in a subsequent patch.

Signed-off-by: Marek Marczykowski-Górecki 
---
New in v2
---
 xen/drivers/char/xhci-dbc.c | 3 ++-
 xen/drivers/passthrough/iommu.c | 5 -
 xen/include/xen/iommu.h | 3 ++-
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
index 3bf389be7d0b..8e2037f1a5f7 100644
--- a/xen/drivers/char/xhci-dbc.c
+++ b/xen/drivers/char/xhci-dbc.c
@@ -1421,7 +1421,8 @@ void __init xhci_dbc_uart_init(void)
 iommu_add_extra_reserved_device_memory(
 PFN_DOWN(virt_to_maddr(_dma_bufs)),
 PFN_UP(sizeof(dbc_dma_bufs)),
-uart->dbc.sbdf);
+uart->dbc.sbdf,
+"XHCI console");
 serial_register_uart(SERHND_XHCI, _uart_driver, _uart);
 }
 }
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 996c31be1284..03587c0cd680 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -682,6 +682,7 @@ struct extra_reserved_range {
 unsigned long start;
 unsigned long nr;
 pci_sbdf_t sbdf;
+const char *name;
 };
 static unsigned int __initdata nr_extra_reserved_ranges;
 static struct extra_reserved_range __initdata
@@ -689,7 +690,8 @@ static struct extra_reserved_range __initdata
 
 int __init iommu_add_extra_reserved_device_memory(unsigned long start,
   unsigned long nr,
-  pci_sbdf_t sbdf)
+  pci_sbdf_t sbdf,
+  const char *name)
 {
 unsigned int idx;
 
@@ -700,6 +702,7 @@ int __init iommu_add_extra_reserved_device_memory(unsigned 
long start,
 extra_reserved_ranges[idx].start = start;
 extra_reserved_ranges[idx].nr = nr;
 extra_reserved_ranges[idx].sbdf = sbdf;
+extra_reserved_ranges[idx].name = name;
 
 return 0;
 }
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 9621459c63ee..b7829dff4588 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -329,7 +329,8 @@ struct iommu_ops {
  */
 extern int iommu_add_extra_reserved_device_memory(unsigned long start,
   unsigned long nr,
-  pci_sbdf_t sbdf);
+  pci_sbdf_t sbdf,
+  const char *name);
 /*
  * To be called by specific IOMMU driver during initialization,
  * to fetch ranges registered with iommu_add_extra_reserved_device_memory().
-- 
2.43.0




[PATCH v2 2/2] drivers/char: mark extra reserved device memory in memory map

2024-03-12 Thread Marek Marczykowski-Górecki
The IOMMU driver checks if RMRR/IVMD are marked as reserved in memory
map. This should be true for addresses coming from the firmware, but
when extra pages used by Xen itself are included in the mapping, those
are taken from usable RAM used. Mark those pages as reserved too.

Not marking the pages as reserved didn't caused issues before due to
another a bug in IOMMU driver code, that was fixed in 83afa3135830
("amd-vi: fix IVMD memory type checks").

Failing to reserve memory will lead to panic in IOMMU setup code. And
not including the page in IOMMU mapping will lead to broken console (due
to IOMMU faults). The pages chosen by the XHCI console driver should
still be usable by the CPU though, and the console code already can deal
with too slow console by dropping characters (and console not printing
anything is a special case of "slow"). When reserving fails print an error
message showing which pages failed and who requested them. This should
be enough hint to find why XHCI console doesn't work.

Fixes: 3a1a7b809ffa "drivers/char: mark DMA buffers as reserved for the XHCI"
Signed-off-by: Marek Marczykowski-Górecki 
---
Alternative error handling could be a panic, but with this version I
think it can be avoided. And not panicing gives a better chance to
actually see the error message (from the hopefully started dom0),
especially as the affected driver is the console one.

The reserve_e820_ram() is x86-specific. Is there some equivalent API for
ARM, or maybe even some abstract one? That said, I have no way to test
XHCI console on ARM, I don't know if such hardware even exists...

Changes in v2:
- move reserving to iommu_get_extra_reserved_device_memory() to cover
  all users of iommu_add_extra_reserved_device_memory()
- change error handling to not panic, as in this code layout it can skip
  sending the pages to the IOMMU driver
---
 xen/drivers/passthrough/iommu.c | 19 +++
 xen/include/xen/iommu.h |  5 -
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 03587c0cd680..a311a37a2a03 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -21,6 +21,9 @@
 #include 
 #include 
 #include 
+#ifdef CONFIG_X86
+#include 
+#endif
 
 unsigned int __read_mostly iommu_dev_iotlb_timeout = 1000;
 integer_param("iommu_dev_iotlb_timeout", iommu_dev_iotlb_timeout);
@@ -715,6 +718,22 @@ int __init 
iommu_get_extra_reserved_device_memory(iommu_grdm_t *func,
 
 for ( idx = 0; idx < nr_extra_reserved_ranges; idx++ )
 {
+#ifdef CONFIG_X86
+if ( !reserve_e820_ram(
+,
+pfn_to_paddr(extra_reserved_ranges[idx].start),
+pfn_to_paddr(extra_reserved_ranges[idx].start +
+ extra_reserved_ranges[idx].nr)) )
+{
+printk(XENLOG_ERR "Failed to reserve [%"PRIx64"-%"PRIx64") for %s, 
"
+   "skipping IOMMU mapping for it, some functionality may be 
broken\n",
+   pfn_to_paddr(extra_reserved_ranges[idx].start),
+   pfn_to_paddr(extra_reserved_ranges[idx].start +
+extra_reserved_ranges[idx].nr),
+   extra_reserved_ranges[idx].name);
+continue;
+}
+#endif
 ret = func(extra_reserved_ranges[idx].start,
extra_reserved_ranges[idx].nr,
extra_reserved_ranges[idx].sbdf.sbdf,
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index b7829dff4588..875eaeb90167 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -324,7 +324,8 @@ struct iommu_ops {
 };
 
 /*
- * To be called by Xen internally, to register extra RMRR/IVMD ranges.
+ * To be called by Xen internally, to register extra RMRR/IVMD ranges for RAM
+ * pages.
  * Needs to be called before IOMMU initialization.
  */
 extern int iommu_add_extra_reserved_device_memory(unsigned long start,
@@ -334,6 +335,8 @@ extern int iommu_add_extra_reserved_device_memory(unsigned 
long start,
 /*
  * To be called by specific IOMMU driver during initialization,
  * to fetch ranges registered with iommu_add_extra_reserved_device_memory().
+ * This has a side effect of marking requested ranges as "reserverd" in the
+ * memory map.
  */
 extern int iommu_get_extra_reserved_device_memory(iommu_grdm_t *func,
   void *ctxt);
-- 
2.43.0




Re: [PATCH] drivers/char: mark XHCI DMA buffers reserved in memory map

2024-03-12 Thread Marek Marczykowski-Górecki
On Tue, Mar 12, 2024 at 03:37:15PM +0100, Jan Beulich wrote:
> On 12.03.2024 15:24, Marek Marczykowski-Górecki wrote:
> > On Tue, Mar 12, 2024 at 11:53:46AM +0100, Jan Beulich wrote:
> >> On 12.03.2024 11:24, Roger Pau Monné wrote:
> >>>> --- a/xen/arch/x86/setup.c
> >>>> +++ b/xen/arch/x86/setup.c
> >>>> @@ -1806,6 +1806,9 @@ void asmlinkage __init noreturn 
> >>>> __start_xen(unsigned long mbi_p)
> >>>>  mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
> >>>>RANGESETF_prettyprint_hex);
> >>>>  
> >>>> +/* Needs to happen after E820 processing but before IOMMU init */
> >>>> +xhci_dbc_uart_reserve_ram();
> >>>
> >>> Overall it might be better if some generic solution for all users of
> >>> iommu_add_extra_reserved_device_memory() could be implemented,
> >>
> >> +1
> >>
> >>> but I'm
> >>> unsure whether the intention is for the interface to always be used
> >>> against RAM.
> >>
> >> I think we can work from that assumption for now.
> > 
> > One more question - what should be the error handling in this case?
> 
> My first reaction here is - please first propose something that's
> sensible from your perspective, and then we can go from there. That's
> generally easier that discussion without seeing involved code.
> However, ...
> 
> > At
> > this stage, if reserving fails, I can still skip giving this range to
> > the IOMMU driver, which (most likely) will result in IOMMU faults and
> > in-operational device (xhci console). Since I don't know (theoretically)
> > what driver requested the range, the error message can only contain an
> > address and device, so will be a bit less actionable for the user
> > (although it should be quite easy to notice the BDF being the XHCI one).
> > 
> > Panic surely is safer option, but less user friendly, especially since
> > (due to the above) I cannot give explicit hint to disable XHCI console.
> 
> ... reading this I was meaning to ...
> 
> > And kinda independently - I'm tempted to add another field to `struct
> > extra_reserved_range` (and an argument to
> > `iommu_add_extra_reserved_device_memory()`) - textual description, for
> > the error reporting purpose.
> 
> ... suggest minimally this. We may want to go farther, though: The party
> registering the range could also supply a callback, to be invoked in
> case registration fails. That callback could then undo whatever is
> necessary in order to not use the memory range in question.
> 
> That said - isn't all of this over-engineering, as the allocated memory
> range must have come from a valid RAM region? In which case a simple
> BUG_ON() may be all that's needed (and will never trigger in practice,
> unless we truly screwed up somewhere)?

In this case (with a single use of
iommu_add_extra_reserved_device_memory()), it will be valid RAM. But
reserving may fail for other reasons too, for example overflow of the
E820 map.

Undoing things certainly is possible, but quite complicated (none of the
involved serial console APIs support disabling/unregistering a console).
Given the simplicity of the workaround user can do (not enabling xhci
console), I don't think it's worth going this route.

Anyway, I'll post v2 with some version of the above and we can continue
discussion there.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH] drivers/char: mark XHCI DMA buffers reserved in memory map

2024-03-12 Thread Marek Marczykowski-Górecki
On Tue, Mar 12, 2024 at 11:53:46AM +0100, Jan Beulich wrote:
> On 12.03.2024 11:24, Roger Pau Monné wrote:
> >> --- a/xen/arch/x86/setup.c
> >> +++ b/xen/arch/x86/setup.c
> >> @@ -1806,6 +1806,9 @@ void asmlinkage __init noreturn __start_xen(unsigned 
> >> long mbi_p)
> >>  mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
> >>RANGESETF_prettyprint_hex);
> >>  
> >> +/* Needs to happen after E820 processing but before IOMMU init */
> >> +xhci_dbc_uart_reserve_ram();
> > 
> > Overall it might be better if some generic solution for all users of
> > iommu_add_extra_reserved_device_memory() could be implemented,
> 
> +1
> 
> > but I'm
> > unsure whether the intention is for the interface to always be used
> > against RAM.
> 
> I think we can work from that assumption for now.

One more question - what should be the error handling in this case? At
this stage, if reserving fails, I can still skip giving this range to
the IOMMU driver, which (most likely) will result in IOMMU faults and
in-operational device (xhci console). Since I don't know (theoretically)
what driver requested the range, the error message can only contain an
address and device, so will be a bit less actionable for the user
(although it should be quite easy to notice the BDF being the XHCI one).

Panic surely is safer option, but less user friendly, especially since
(due to the above) I cannot give explicit hint to disable XHCI console.

And kinda independently - I'm tempted to add another field to `struct
extra_reserved_range` (and an argument to
`iommu_add_extra_reserved_device_memory()`) - textual description, for
the error reporting purpose.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH] drivers/char: mark XHCI DMA buffers reserved in memory map

2024-03-12 Thread Marek Marczykowski-Górecki
On Tue, Mar 12, 2024 at 02:09:14PM +0100, Marek Marczykowski-Górecki wrote:
> On Tue, Mar 12, 2024 at 01:38:53PM +0100, Jan Beulich wrote:
> > On 12.03.2024 13:02, Marek Marczykowski-Górecki wrote:
> > > BTW should e820_change_range_type() return 1 in case of mapping already
> > > having the right type? Otherwise, if one wants to use
> > > iommu_add_extra_reserved_device_memory() on already reserved memory, the
> > > e820_change_range_type() would fail.
> > 
> > You raised that question on Matrix yesterday, iirc, and I left it
> > unanswered there because it takes archeology to find the answer (or at
> > least get closer to one). And, please don't get me wrong, you could as
> > well do that yourself. (My vague recollection from having dealt with
> > similar code in Linux is that yes, in the example given the function
> > ought to indeed have failed back then. Depending on present uses etc
> > it may well be that we could reconsider, though.)
> 
> I sure can do some archaeology there, I was just hoping any of you would
> know the answer already.

None of the commit messages touching that code give the answer. But
looking around, most callers of reserve_e820_ram() ignore its return
value. One exception is reserving memory for kexec. I guess in that case
it may be intentional to fail if the area is reserved already, as it may
indicate it cannot be used for kexec. Is that correct?

There are also a couple of calls to e820_change_range_type() in tboot
code where it changes E820_RESERVED to E820_UNUSABLE. There, I guess
changing e820_change_range_type() behavior would be okay.

Anyway, since we agree to focus on RAM in this API for now, it shouldn't
matter anymore.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH] drivers/char: mark XHCI DMA buffers reserved in memory map

2024-03-12 Thread Marek Marczykowski-Górecki
On Tue, Mar 12, 2024 at 01:38:53PM +0100, Jan Beulich wrote:
> On 12.03.2024 13:02, Marek Marczykowski-Górecki wrote:
> > On Tue, Mar 12, 2024 at 11:53:46AM +0100, Jan Beulich wrote:
> >> On 12.03.2024 11:24, Roger Pau Monné wrote:
> >>>> --- a/xen/arch/x86/setup.c
> >>>> +++ b/xen/arch/x86/setup.c
> >>>> @@ -1806,6 +1806,9 @@ void asmlinkage __init noreturn 
> >>>> __start_xen(unsigned long mbi_p)
> >>>>  mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
> >>>>RANGESETF_prettyprint_hex);
> >>>>  
> >>>> +/* Needs to happen after E820 processing but before IOMMU init */
> >>>> +xhci_dbc_uart_reserve_ram();
> >>>
> >>> Overall it might be better if some generic solution for all users of
> >>> iommu_add_extra_reserved_device_memory() could be implemented,
> >>
> >> +1
> > 
> > In that case, is the approach with
> > iommu_get_extra_reserved_device_memory() okay (and a comment that it
> > will have a side effect...) ?
> 
> Counter question: You not having gone that route despite our talking
> about it on Matrix must have a reason. When seeing this approach I
> concluded something got in the way. What's the deal?

I added a note about that in the patch (below commit message):
> As an alternative implementation I have considered changing
> iommu_get_extra_reserved_device_memory() to modify memory map. But that
> may hide (or cause) some other issues when this API will gain some more
> users in the future.

More specifically, if some future use would be on a non-RAM area, or
already reserved area. But if we can ignore those cases for now, I'm
fine with that approach.

> >>> but I'm
> >>> unsure whether the intention is for the interface to always be used
> >>> against RAM.
> >>
> >> I think we can work from that assumption for now.
> > 
> > Ok, I'll add a comment about that. I guess, if needed in the future,
> > iommu_add_extra_reserved_device_memory() can gain an extra parameter to
> > distinguish RAM from non-RAM mappings.
> > 
> > BTW should e820_change_range_type() return 1 in case of mapping already
> > having the right type? Otherwise, if one wants to use
> > iommu_add_extra_reserved_device_memory() on already reserved memory, the
> > e820_change_range_type() would fail.
> 
> You raised that question on Matrix yesterday, iirc, and I left it
> unanswered there because it takes archeology to find the answer (or at
> least get closer to one). And, please don't get me wrong, you could as
> well do that yourself. (My vague recollection from having dealt with
> similar code in Linux is that yes, in the example given the function
> ought to indeed have failed back then. Depending on present uses etc
> it may well be that we could reconsider, though.)

I sure can do some archaeology there, I was just hoping any of you would
know the answer already.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH] drivers/char: mark XHCI DMA buffers reserved in memory map

2024-03-12 Thread Marek Marczykowski-Górecki
On Tue, Mar 12, 2024 at 11:53:46AM +0100, Jan Beulich wrote:
> On 12.03.2024 11:24, Roger Pau Monné wrote:
> >> --- a/xen/arch/x86/setup.c
> >> +++ b/xen/arch/x86/setup.c
> >> @@ -1806,6 +1806,9 @@ void asmlinkage __init noreturn __start_xen(unsigned 
> >> long mbi_p)
> >>  mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
> >>RANGESETF_prettyprint_hex);
> >>  
> >> +/* Needs to happen after E820 processing but before IOMMU init */
> >> +xhci_dbc_uart_reserve_ram();
> > 
> > Overall it might be better if some generic solution for all users of
> > iommu_add_extra_reserved_device_memory() could be implemented,
> 
> +1

In that case, is the approach with
iommu_get_extra_reserved_device_memory() okay (and a comment that it
will have a side effect...) ?

> > but I'm
> > unsure whether the intention is for the interface to always be used
> > against RAM.
> 
> I think we can work from that assumption for now.

Ok, I'll add a comment about that. I guess, if needed in the future,
iommu_add_extra_reserved_device_memory() can gain an extra parameter to
distinguish RAM from non-RAM mappings.

BTW should e820_change_range_type() return 1 in case of mapping already
having the right type? Otherwise, if one wants to use
iommu_add_extra_reserved_device_memory() on already reserved memory, the
e820_change_range_type() would fail.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Release signing key still uses SHA1

2024-03-11 Thread Marek Marczykowski-Górecki
Hi,

The key used to sign release tarballs and git tags still uses SHA1 for
its self-signature. Is updated key somewhere already?

SHA1 is starting to be rejected by some tools already, for example
sequoia-sq:

$ sq inspect xen.pub
xen.pub: OpenPGP Certificate.

Fingerprint: 23E3222C145F4475FA8060A783FE14C957E82BD9
 Invalid: No binding signature at time 2024-03-12T02:37:29Z
Public-key algo: RSA
Public-key size: 2048 bits
  Creation time: 2010-04-06 13:55:33 UTC

 UserID: Xen.org Xen tree code signing (signatures on the xen 
hypervisor and tools) 
 Invalid: Policy rejected non-revocation signature 
(PositiveCertification) requiring second pre-image resistance
 because: SHA1 is not considered secure
 Certifications: 7, use --certifications to list


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


4.17.3 download is missing on the website

2024-03-11 Thread Marek Marczykowski-Górecki
Hi,

https://xenproject.org/xen-project-archives/ doesn't include 4.17.3.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH] drivers/char: mark XHCI DMA buffers reserved in memory map

2024-03-11 Thread Marek Marczykowski-Górecki
The IOMMU driver checks if RMRR/IVMD are marked as reserved in memory
map. This should be true for addresses coming from the firmware, but
when extra pages used by Xen itself are included in the mapping, those
are taken from usable RAM used. Mark those pages as reserved too.

Not marking the pages as reserved didn't caused issues before due to
another a bug in IOMMU driver code, that was fixed in 83afa3135830
("amd-vi: fix IVMD memory type checks").

Failing to reserve memory will lead to panic in IOMMU setup code. And
not including the page in IOMMU mapping will lead to broken console (on
due to IOMMU faults). Handling failure with panic() isn't the most user
friendly thing, but at this stage the alternative would require undoing
a lot of console init. Since the user can do it much easier - by simply
not enabling xhci console next time, say that and panic.

Fixes: 3a1a7b809ffa "drivers/char: mark DMA buffers as reserved for the XHCI"
Signed-off-by: Marek Marczykowski-Górecki 
---
As an alternative implementation I have considered changing
iommu_get_extra_reserved_device_memory() to modify memory map. But that
may hide (or cause) some other issues when this API will gain some more
users in the future.

The reserve_e820_ram() is x86-specific. Is there some equivalent API for
ARM, or maybe even some abstract one? That said, I have no way to test
XHCI console on ARM, I don't know if such hardware even exists...
---
 xen/arch/x86/setup.c|  3 +++
 xen/drivers/char/xhci-dbc.c | 22 ++
 xen/include/xen/serial.h|  2 ++
 3 files changed, 27 insertions(+)

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index a21984b1ccd8..8ab89b3710ed 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1806,6 +1806,9 @@ void asmlinkage __init noreturn __start_xen(unsigned long 
mbi_p)
 mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges",
   RANGESETF_prettyprint_hex);
 
+/* Needs to happen after E820 processing but before IOMMU init */
+xhci_dbc_uart_reserve_ram();
+
 xsm_multiboot_init(module_map, mbi);
 
 /*
diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
index 3bf389be7d0b..e31f3cba7838 100644
--- a/xen/drivers/char/xhci-dbc.c
+++ b/xen/drivers/char/xhci-dbc.c
@@ -31,6 +31,9 @@
 #include 
 #include 
 #include 
+#ifdef CONFIG_X86
+#include 
+#endif
 
 /* uncomment to have dbc_uart_dump() debug function */
 /* #define DBC_DEBUG 1 */
@@ -1426,6 +1429,25 @@ void __init xhci_dbc_uart_init(void)
 }
 }
 
+void __init xhci_dbc_uart_reserve_ram(void)
+{
+struct dbc *dbc = _uart.dbc;
+
+if ( !dbc->enable )
+return;
+
+#ifdef CONFIG_X86
+if ( !reserve_e820_ram(
+,
+virt_to_maddr(_dma_bufs),
+virt_to_maddr(_dma_bufs) + sizeof(dbc_dma_bufs) - 1) )
+panic("Failed to reserve XHCI DMA buffers (%"PRIx64"-%"PRIx64"), "
+  "disable xhci console to work around\n",
+  virt_to_maddr(_dma_bufs),
+  virt_to_maddr(_dma_bufs) + sizeof(dbc_dma_bufs) - 1);
+#endif
+}
+
 #ifdef DBC_DEBUG
 static void dbc_dump(struct dbc *dbc)
 {
diff --git a/xen/include/xen/serial.h b/xen/include/xen/serial.h
index 3d21207a3d7a..bb48eb8e8bd9 100644
--- a/xen/include/xen/serial.h
+++ b/xen/include/xen/serial.h
@@ -162,8 +162,10 @@ void ns16550_init(int index, struct ns16550_defaults 
*defaults);
 void ehci_dbgp_init(void);
 #ifdef CONFIG_XHCI
 void xhci_dbc_uart_init(void);
+void xhci_dbc_uart_reserve_ram(void);
 #else
 static void inline xhci_dbc_uart_init(void) {}
+static void inline xhci_dbc_uart_reserve_ram(void) {}
 #endif
 
 void arm_uart_init(void);
-- 
2.43.0




Regression with xhci console (was: [PATCH 1/4] amd-vi: fix IVMD memory type checks)

2024-03-10 Thread Marek Marczykowski-Górecki
On Thu, Feb 01, 2024 at 06:01:56PM +0100, Roger Pau Monne wrote:
> The current code that parses the IVMD blocks is relaxed with regard to the
> restriction that such unity regions should always fall into memory ranges
> marked as reserved in the memory map.
> 
> However the type checks for the IVMD addresses are inverted, and as a result
> IVMD ranges falling into RAM areas are accepted.  Note that having such ranges
> in the first place is a firmware bug, as IVMD should always fall into reserved
> ranges.
> 
> Fixes: ed6c77ebf0c1 ('AMD/IOMMU: check / convert IVMD ranges for being / to 
> be reserved')
> Signed-off-by: Roger Pau Monné 

FYI Xen 4.17.3 with this patch (but not others in the series) causes
panic on boot on Framework 14 AMD laptop:

(XEN)  [4457, 5077efff] (usable)
...
(XEN) AMD-Vi: Warning: IVMD: [4f83f000,4f88) is not (entirely) in 
reserved memory
(XEN) AMD-Vi: Error: IVMD: page at 4f83f000 can't be converted
...
(XEN) Xen BUG at drivers/passthrough/amd/iommu_init.c:1386

Full boot log at https://github.com/QubesOS/qubes-issues/issues/9030
4.17.2 worked fine.
I'll try the whole series (and the follow up one), but I don't expect
any difference.

This looks to be related to XHCI console, which does use the same API to
allow device to DMA into relevant buffers even when the rest of XHCI is
used by dom0 (or even other domain if 'share=yes' is used):
https://gitlab.com/xen-project/xen/-/blob/staging/xen/drivers/char/xhci-dbc.c?ref_type=heads#L1421-1424

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: E820 memory allocation issue on Threadripper platforms

2024-03-10 Thread Marek Marczykowski-Górecki
On Fri, Jan 19, 2024 at 02:40:06PM +0100, Marek Marczykowski-Górecki wrote:
> On Thu, Jan 18, 2024 at 01:23:56AM -0500, Patrick Plenefisch wrote:
> > On Wed, Jan 17, 2024 at 3:46 AM Jan Beulich  wrote:
> > > On 17.01.2024 07:12, Patrick Plenefisch wrote:
> > > > As someone who hasn't built a kernel in over a decade, should I figure
> > > out
> > > > how to do a kernel build with CONFIG_PHYSICAL_START=0x200 and report
> > > > back?
> > >
> > > That was largely a suggestion to perhaps allow you to gain some
> > > workable setup. It would be of interest to us largely for completeness.
> > >
> > 
> > Typo aside, setting the boot to 2MiB works! It works better for PV
> 
> Are there any downsides of running kernel with
> CONFIG_PHYSICAL_START=0x20? I can confirm it fixes the issue on
> another affected system, and if there aren't any practical downsides,
> I'm tempted to change it the default kernel in Qubes OS.

I have the answer here: CONFIG_PHYSICAL_START=0x20 breaks booting
Xen in KVM with OVMF. There, the memory map has:
(XEN)  00010-0007f type=7 attr=000f
(XEN)  00080-000807fff type=10 attr=000f
(XEN)  000808000-00080afff type=7 attr=000f
(XEN)  00080b000-00080bfff type=10 attr=000f
(XEN)  00080c000-00080 type=7 attr=000f
(XEN)  00081-0008f type=10 attr=000f
(XEN)  00090-0015f type=4 attr=000f

So, starting at 0x100 worked since type=4 (boot service data) is
available at that time already, but with 0x20 it conflicts with
those AcpiNvs areas around 0x80.

I'm cc-ing Jason since I see he claimed relevant gitlab issue. This
conflict at least gives easy test environment with console logged to a
file.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [BUG]i2c_hid_acpi broken with 4.17.2 on Framework Laptop 13 AMD

2024-03-06 Thread Marek Marczykowski-Górecki
On Wed, Mar 06, 2024 at 02:37:20PM -0600, Mario Limonciello wrote:
> On 3/6/2024 14:34, Sébastien Chaumat wrote:
> > 
> > 
> > Le mer. 6 mars 2024 à 19:51, Mario Limonciello
> > mailto:mario.limoncie...@amd.com>> a écrit :
> > 
> > On 3/6/2024 12:49, Sébastien Chaumat wrote:
> >  >
> >  >
> >  > Le mer. 6 mars 2024 à 19:08, Mario Limonciello
> >  > mailto:mario.limoncie...@amd.com>
> > <mailto:mario.limoncie...@amd.com
> > <mailto:mario.limoncie...@amd.com>>> a écrit :
> >  >
> >  >     On 3/6/2024 12:05, Sébastien Chaumat wrote:
> >  >      >
> >  >      >
> >  >      > Le mer. 6 mars 2024 à 18:33, Mario Limonciello
> >  >      >  >  >      >     Also; I'd be really interested to hear what happens with
> >  >     s2idle with
> >  >      >     Xen
> >  >      >     now (if it works).
> >  >      >
> >  >      >
> >  >      > suspend to RAM now works :)
> >  >
> >  >     Do you see /sys/power/suspend_stats/last_hw_sleep increasing
> > with your
> >  >     suspend cycle?
> >  >
> >  >
> >  > No,  it does not increase (0).
> >  >
> > 
> > OK, then in that case I suggest you run
> > 
> > https://gitlab.freedesktop.org/drm/amd/-/blob/master/scripts/amd_s2idle.py 
> > <https://gitlab.freedesktop.org/drm/amd/-/blob/master/scripts/amd_s2idle.py>
> > in the hypervisor to find out what's wrong.
> > 
> > 
> > It fails with xen (not bare metal) during the prerequisite tests :
> > 
> 
> The script might need some modifications for tests that don't make sense in
> the context of Xen.
> 
> > ✅ GPIO driver `pinctrl_amd` available
> > Traceback (most recent call last):
> >    File "/home/sch/Downloads/amd_s2idle.py", line 2447, in 
> >      test = app.prerequisites()
> >             ^^^
> >    File "/home/sch/Downloads/amd_s2idle.py", line 1938, in prerequisites
> >      if not check():
> >             ^^^
> >    File "/home/sch/Downloads/amd_s2idle.py", line 826, in check_msr
> >      val = read_msr(reg, 0)
> >            
> >    File "/home/sch/Downloads/amd_s2idle.py", line 813, in read_msr
> >      val = struct.unpack("Q", os.read(f, 8))[0]
> >                               ^
> > OSError: [Errno 5] Input/output error
> > 
> > Last line in the log is :
> > 2024-03-06 21:29:33,146 DEBUG: Lockdown: [none] integrity confidentiality
> > 
> 
> Do you have userspace MSR support compiled in?  If not; that could explain
> that problem.  But it's very unlikely you changed the MSRs.
> For the purpose of finding where the suspend problem is, I'd comment out
> that specific check for now.

The s2idle isn't supported in Xen yet. Only S3 is. In fact, I have some
work-in-progress patches for that, I may try testing them on this
system.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH v2 2/2] xen: fix stubdom PCI addr

2024-03-05 Thread Marek Marczykowski-Górecki
From: Frédéric Pierret (fepitre) 

When running in a stubdomain, the config space access via sysfs needs to
use BDF as seen inside stubdomain (connected via xen-pcifront), which is
different from the real BDF. For other purposes (hypercall parameters
etc), the real BDF needs to be used.
Get the in-stubdomain BDF by looking up relevant PV PCI xenstore
entries.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v2:
- use xs_node_scanf
- use %d instead of %u to read values written as %d
- add a comment from another iteration of this patch by Jason Andryuk
---
 hw/xen/xen-host-pci-device.c | 69 +++-
 hw/xen/xen-host-pci-device.h |  6 
 2 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c
index 8c6e9a1716..8ea2a5a4af 100644
--- a/hw/xen/xen-host-pci-device.c
+++ b/hw/xen/xen-host-pci-device.c
@@ -9,6 +9,8 @@
 #include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "qemu/cutils.h"
+#include "hw/xen/xen-legacy-backend.h"
+#include "hw/xen/xen-bus-helper.h"
 #include "xen-host-pci-device.h"
 
 #define XEN_HOST_PCI_MAX_EXT_CAP \
@@ -33,13 +35,67 @@
 #define IORESOURCE_PREFETCH 0x1000  /* No side effects */
 #define IORESOURCE_MEM_64   0x0010
 
+/*
+ * Non-passthrough (dom0) accesses are local PCI devices and use the given BDF
+ * Passthough (stubdom) accesses are through PV frontend PCI device.  Those
+ * either have a BDF identical to the backend's BDF (xen-backend.passthrough=1)
+ * or a local virtual BDF (xen-backend.passthrough=0)
+ *
+ * We are always given the backend's BDF and need to lookup the appropriate
+ * local BDF for sysfs access.
+ */
+static void xen_host_pci_fill_local_addr(XenHostPCIDevice *d, Error **errp)
+{
+unsigned int num_devs, len, i;
+unsigned int domain, bus, dev, func;
+char *be_path = NULL;
+char path[80];
+
+be_path = qemu_xen_xs_read(xenstore, 0, "device/pci/0/backend", );
+if (!be_path)
+goto out;
+
+if (xs_node_scanf(xenstore, 0, be_path, "num_devs", NULL, "%d", _devs) 
!= 1) {
+error_setg(errp, "Failed to read or parse %s/num_devs\n", be_path);
+goto out;
+}
+
+for (i = 0; i < num_devs; i++) {
+snprintf(path, sizeof(path), "dev-%d", i);
+if (xs_node_scanf(xenstore, 0, be_path, path, NULL,
+  "%x:%x:%x.%x", , , , ) != 4) {
+error_setg(errp, "Failed to read or parse %s/%s\n", be_path, path);
+goto out;
+}
+if (domain != d->domain ||
+bus != d->bus ||
+dev != d->dev ||
+func!= d->func)
+continue;
+snprintf(path, sizeof(path), "vdev-%d", i);
+if (xs_node_scanf(xenstore, 0, be_path, path, NULL,
+  "%x:%x:%x.%x", , , , ) != 4) {
+error_setg(errp, "Failed to read or parse %s/%s\n", be_path, path);
+goto out;
+}
+d->local_domain = domain;
+d->local_bus = bus;
+d->local_dev = dev;
+d->local_func = func;
+goto out;
+}
+
+out:
+free(be_path);
+}
+
 static void xen_host_pci_sysfs_path(const XenHostPCIDevice *d,
 const char *name, char *buf, ssize_t size)
 {
 int rc;
 
 rc = snprintf(buf, size, "/sys/bus/pci/devices/%04x:%02x:%02x.%d/%s",
-  d->domain, d->bus, d->dev, d->func, name);
+  d->local_domain, d->local_bus, d->local_dev, d->local_func, 
name);
 assert(rc >= 0 && rc < size);
 }
 
@@ -342,6 +398,17 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, uint16_t 
domain,
 d->dev = dev;
 d->func = func;
 
+if (xen_is_stubdomain) {
+xen_host_pci_fill_local_addr(d, errp);
+if (*errp)
+goto error;
+} else {
+d->local_domain = d->domain;
+d->local_bus = d->bus;
+d->local_dev = d->dev;
+d->local_func = d->func;
+}
+
 xen_host_pci_config_open(d, errp);
 if (*errp) {
 goto error;
diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h
index 4d8d34ecb0..270dcb27f7 100644
--- a/hw/xen/xen-host-pci-device.h
+++ b/hw/xen/xen-host-pci-device.h
@@ -23,6 +23,12 @@ typedef struct XenHostPCIDevice {
 uint8_t dev;
 uint8_t func;
 
+/* different from the above in case of stubdomain */
+uint16_t local_domain;
+uint8_t local_bus;
+uint8_t local_dev;
+uint8_t local_func;
+
 uint16_t vendor_id;
 uint16_t device_id;
 uint32_t class_code;
-- 
2.43.0




[PATCH v2 1/2] hw/xen: detect when running inside stubdomain

2024-03-05 Thread Marek Marczykowski-Górecki
Introduce global xen_is_stubdomain variable when qemu is running inside
a stubdomain instead of dom0. This will be relevant for subsequent
patches, as few things like accessing PCI config space need to be done
differently.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v2:
- use sigend int for domid to match xenstore_read_int() types, domid is
  in a signed range anyway
- fix code style
---
 hw/xen/xen-legacy-backend.c | 16 
 include/hw/xen/xen.h|  1 +
 system/globals.c|  1 +
 3 files changed, 18 insertions(+)

diff --git a/hw/xen/xen-legacy-backend.c b/hw/xen/xen-legacy-backend.c
index 124dd5f3d6..6bd4e6eb2f 100644
--- a/hw/xen/xen-legacy-backend.c
+++ b/hw/xen/xen-legacy-backend.c
@@ -603,6 +603,20 @@ static void xen_set_dynamic_sysbus(void)
 machine_class_allow_dynamic_sysbus_dev(mc, TYPE_XENSYSDEV);
 }
 
+static bool xen_check_stubdomain(void)
+{
+char *dm_path = g_strdup_printf("/local/domain/%d/image", xen_domid);
+int32_t dm_domid;
+bool is_stubdom = false;
+
+if (!xenstore_read_int(dm_path, "device-model-domid", _domid)) {
+is_stubdom = dm_domid != 0;
+}
+
+g_free(dm_path);
+return is_stubdom;
+}
+
 void xen_be_init(void)
 {
 xenstore = qemu_xen_xs_open();
@@ -616,6 +630,8 @@ void xen_be_init(void)
 exit(1);
 }
 
+xen_is_stubdomain = xen_check_stubdomain();
+
 xen_sysdev = qdev_new(TYPE_XENSYSDEV);
 sysbus_realize_and_unref(SYS_BUS_DEVICE(xen_sysdev), _fatal);
 xen_sysbus = qbus_new(TYPE_XENSYSBUS, xen_sysdev, "xen-sysbus");
diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h
index 37ecc91fc3..ecb89ecfc1 100644
--- a/include/hw/xen/xen.h
+++ b/include/hw/xen/xen.h
@@ -36,6 +36,7 @@ enum xen_mode {
 extern uint32_t xen_domid;
 extern enum xen_mode xen_mode;
 extern bool xen_domid_restrict;
+extern bool xen_is_stubdomain;
 
 int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num);
 int xen_set_pci_link_route(uint8_t link, uint8_t irq);
diff --git a/system/globals.c b/system/globals.c
index b6d4e72530..ac27d88bd4 100644
--- a/system/globals.c
+++ b/system/globals.c
@@ -62,6 +62,7 @@ bool qemu_uuid_set;
 uint32_t xen_domid;
 enum xen_mode xen_mode = XEN_DISABLED;
 bool xen_domid_restrict;
+bool xen_is_stubdomain;
 struct evtchn_backend_ops *xen_evtchn_ops;
 struct gnttab_backend_ops *xen_gnttab_ops;
 struct foreignmem_backend_ops *xen_foreignmem_ops;
-- 
2.43.0




Link to the 2023 xen summit schedule is broken

2024-03-02 Thread Marek Marczykowski-Górecki
Hi,

The link to the last year schedule at
https://events.linuxfoundation.org/archive/2023/xen-project-summit/ is
broken, it opens a page for upcoming "ONE summit" event that doesn't
look to be related to Xen Summit.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH] ns16550: add Asix AX99100 serial card

2024-02-19 Thread Marek Marczykowski-Górecki
On Mon, Feb 19, 2024 at 12:13:18PM +0100, Marek Marczykowski-Górecki wrote:
> On Mon, Feb 19, 2024 at 09:57:49AM +0100, Jan Beulich wrote:
> > On 18.02.2024 02:34, Marek Marczykowski-Górecki wrote:
> > > @@ -1170,6 +1177,11 @@ static const struct ns16550_config __initconst 
> > > uart_config[] =
> > >  .dev_id = 0x7adc,
> > >  .param = param_intel_lpss
> > >  },
> > > +{
> > > +.vendor_id = PCI_VENDOR_ID_ASIX,
> > > +.dev_id = 9100,
> > 
> > As per Linux this is 0x9100.
> 
> Right... but then, maybe the patch isn't needed at all, as it does work
> for me. Maybe what's needed instead is some other patch already in
> staging. Initial attempt that did not work was with 4.17.something.
> I guess setting the fifo size isn't that important.

Indeed, the patch is not needed after all, plain "master" from today
works.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH 2/2] xen: fix stubdom PCI addr

2024-02-19 Thread Marek Marczykowski-Górecki
On Mon, Feb 19, 2024 at 07:16:06PM +0100, Marek Marczykowski-Górecki wrote:
> From: Frédéric Pierret (fepitre) 

This shouldn't be here, it's my patch.

> When running in a stubdomain, the config space access via sysfs needs to
> use BDF as seen inside stubdomain (connected via xen-pcifront), which is
> different from the real BDF. For other purposes (hypercall parameters
> etc), the real BDF needs to be used.
> Get the in-stubdomain BDF by looking up relevant PV PCI xenstore
> entries.
> 
> Signed-off-by: Marek Marczykowski-Górecki 
> ---
>  hw/xen/xen-host-pci-device.c | 77 +++-
>  hw/xen/xen-host-pci-device.h |  6 +++
>  2 files changed, 82 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c
> index 8c6e9a1716..3f8a6f84a8 100644
> --- a/hw/xen/xen-host-pci-device.c
> +++ b/hw/xen/xen-host-pci-device.c
> @@ -9,6 +9,7 @@
>  #include "qemu/osdep.h"
>  #include "qapi/error.h"
>  #include "qemu/cutils.h"
> +#include "hw/xen/xen-legacy-backend.h"
>  #include "xen-host-pci-device.h"
>  
>  #define XEN_HOST_PCI_MAX_EXT_CAP \
> @@ -33,13 +34,76 @@
>  #define IORESOURCE_PREFETCH 0x1000  /* No side effects */
>  #define IORESOURCE_MEM_64   0x0010
>  
> +static void xen_host_pci_fill_local_addr(XenHostPCIDevice *d, Error **errp)
> +{
> +unsigned int num_devs, len, i;
> +unsigned int domain, bus, dev, func;
> +char *be_path = NULL;
> +char path[80];
> +char *msg = NULL;
> +
> +be_path = qemu_xen_xs_read(xenstore, 0, "device/pci/0/backend", );
> +if (!be_path)
> +goto err_out;
> +snprintf(path, sizeof(path), "%s/num_devs", be_path);
> +msg = qemu_xen_xs_read(xenstore, 0, path, );
> +if (!msg)
> +goto err_out;
> +
> +if (sscanf(msg, "%u", _devs) != 1) {
> +error_setg(errp, "Failed to parse %s (%s)\n", msg, path);
> +goto err_out;
> +}
> +free(msg);
> +
> +for (i = 0; i < num_devs; i++) {
> +snprintf(path, sizeof(path), "%s/dev-%u", be_path, i);
> +msg = qemu_xen_xs_read(xenstore, 0, path, );
> +if (!msg) {
> +error_setg(errp, "Failed to read %s\n", path);
> +goto err_out;
> +}
> +if (sscanf(msg, "%x:%x:%x.%x", , , , ) != 4) {
> +error_setg(errp, "Failed to parse %s (%s)\n", msg, path);
> +goto err_out;
> +}
> +free(msg);
> +if (domain != d->domain ||
> +bus != d->bus ||
> +dev != d->dev ||
> +func!= d->func)
> +continue;
> +snprintf(path, sizeof(path), "%s/vdev-%u", be_path, i);
> +msg = qemu_xen_xs_read(xenstore, 0, path, );
> +if (!msg) {
> +error_setg(errp, "Failed to read %s\n", path);
> +goto out;
> +}
> +if (sscanf(msg, "%x:%x:%x.%x", , , , ) != 4) {
> +error_setg(errp, "Failed to parse %s (%s)\n", msg, path);
> +goto err_out;
> +}
> +free(msg);
> +d->local_domain = domain;
> +d->local_bus = bus;
> +d->local_dev = dev;
> +d->local_func = func;
> +goto out;
> +}
> +
> +err_out:
> +free(msg);
> +out:
> +free(be_path);
> +}
> +
>  static void xen_host_pci_sysfs_path(const XenHostPCIDevice *d,
>  const char *name, char *buf, ssize_t 
> size)
>  {
>  int rc;
>  
>  rc = snprintf(buf, size, "/sys/bus/pci/devices/%04x:%02x:%02x.%d/%s",
> -  d->domain, d->bus, d->dev, d->func, name);
> +  d->local_domain, d->local_bus, d->local_dev, 
> d->local_func, name);
>  assert(rc >= 0 && rc < size);
>  }
>  
> @@ -342,6 +406,17 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, 
> uint16_t domain,
>  d->dev = dev;
>  d->func = func;
>  
> +if (xen_is_stubdomain) {
> +xen_host_pci_fill_local_addr(d, errp);
> +if (*errp)
> +goto error;
> +} else {
> +d->local_domain = d->domain;
> +d->local_bus = d->bus;
> +d->local_dev = d->dev;
> +d->local_func = d->func;
> +}
> +
>  xen_host_pci_config_open(d, errp);
>  if (*errp) {
>  goto error;
> diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h
> index 4d8d34ecb0..270dcb27f7 100644
> --- a/hw/xen/xen-host-pci-device.h
> +++ b/hw/xen/xen-host-pci-device.h
> @@ -23,6 +23,12 @@ typedef struct XenHostPCIDevice {
>  uint8_t dev;
>  uint8_t func;
>  
> +/* different from the above in case of stubdomain */
> +uint16_t local_domain;
> +uint8_t local_bus;
> +uint8_t local_dev;
> +uint8_t local_func;
> +
>  uint16_t vendor_id;
>  uint16_t device_id;
>  uint32_t class_code;
> -- 
> 2.43.0
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH 2/2] xen: fix stubdom PCI addr

2024-02-19 Thread Marek Marczykowski-Górecki
From: Frédéric Pierret (fepitre) 

When running in a stubdomain, the config space access via sysfs needs to
use BDF as seen inside stubdomain (connected via xen-pcifront), which is
different from the real BDF. For other purposes (hypercall parameters
etc), the real BDF needs to be used.
Get the in-stubdomain BDF by looking up relevant PV PCI xenstore
entries.

Signed-off-by: Marek Marczykowski-Górecki 
---
 hw/xen/xen-host-pci-device.c | 77 +++-
 hw/xen/xen-host-pci-device.h |  6 +++
 2 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c
index 8c6e9a1716..3f8a6f84a8 100644
--- a/hw/xen/xen-host-pci-device.c
+++ b/hw/xen/xen-host-pci-device.c
@@ -9,6 +9,7 @@
 #include "qemu/osdep.h"
 #include "qapi/error.h"
 #include "qemu/cutils.h"
+#include "hw/xen/xen-legacy-backend.h"
 #include "xen-host-pci-device.h"
 
 #define XEN_HOST_PCI_MAX_EXT_CAP \
@@ -33,13 +34,76 @@
 #define IORESOURCE_PREFETCH 0x1000  /* No side effects */
 #define IORESOURCE_MEM_64   0x0010
 
+static void xen_host_pci_fill_local_addr(XenHostPCIDevice *d, Error **errp)
+{
+unsigned int num_devs, len, i;
+unsigned int domain, bus, dev, func;
+char *be_path = NULL;
+char path[80];
+char *msg = NULL;
+
+be_path = qemu_xen_xs_read(xenstore, 0, "device/pci/0/backend", );
+if (!be_path)
+goto err_out;
+snprintf(path, sizeof(path), "%s/num_devs", be_path);
+msg = qemu_xen_xs_read(xenstore, 0, path, );
+if (!msg)
+goto err_out;
+
+if (sscanf(msg, "%u", _devs) != 1) {
+error_setg(errp, "Failed to parse %s (%s)\n", msg, path);
+goto err_out;
+}
+free(msg);
+
+for (i = 0; i < num_devs; i++) {
+snprintf(path, sizeof(path), "%s/dev-%u", be_path, i);
+msg = qemu_xen_xs_read(xenstore, 0, path, );
+if (!msg) {
+error_setg(errp, "Failed to read %s\n", path);
+goto err_out;
+}
+if (sscanf(msg, "%x:%x:%x.%x", , , , ) != 4) {
+error_setg(errp, "Failed to parse %s (%s)\n", msg, path);
+goto err_out;
+}
+free(msg);
+if (domain != d->domain ||
+bus != d->bus ||
+dev != d->dev ||
+func!= d->func)
+continue;
+snprintf(path, sizeof(path), "%s/vdev-%u", be_path, i);
+msg = qemu_xen_xs_read(xenstore, 0, path, );
+if (!msg) {
+error_setg(errp, "Failed to read %s\n", path);
+goto out;
+}
+if (sscanf(msg, "%x:%x:%x.%x", , , , ) != 4) {
+error_setg(errp, "Failed to parse %s (%s)\n", msg, path);
+goto err_out;
+}
+free(msg);
+d->local_domain = domain;
+d->local_bus = bus;
+d->local_dev = dev;
+d->local_func = func;
+goto out;
+}
+
+err_out:
+free(msg);
+out:
+free(be_path);
+}
+
 static void xen_host_pci_sysfs_path(const XenHostPCIDevice *d,
 const char *name, char *buf, ssize_t size)
 {
 int rc;
 
 rc = snprintf(buf, size, "/sys/bus/pci/devices/%04x:%02x:%02x.%d/%s",
-  d->domain, d->bus, d->dev, d->func, name);
+  d->local_domain, d->local_bus, d->local_dev, d->local_func, 
name);
 assert(rc >= 0 && rc < size);
 }
 
@@ -342,6 +406,17 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, uint16_t 
domain,
 d->dev = dev;
 d->func = func;
 
+if (xen_is_stubdomain) {
+xen_host_pci_fill_local_addr(d, errp);
+if (*errp)
+goto error;
+} else {
+d->local_domain = d->domain;
+d->local_bus = d->bus;
+d->local_dev = d->dev;
+d->local_func = d->func;
+}
+
 xen_host_pci_config_open(d, errp);
 if (*errp) {
 goto error;
diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h
index 4d8d34ecb0..270dcb27f7 100644
--- a/hw/xen/xen-host-pci-device.h
+++ b/hw/xen/xen-host-pci-device.h
@@ -23,6 +23,12 @@ typedef struct XenHostPCIDevice {
 uint8_t dev;
 uint8_t func;
 
+/* different from the above in case of stubdomain */
+uint16_t local_domain;
+uint8_t local_bus;
+uint8_t local_dev;
+uint8_t local_func;
+
 uint16_t vendor_id;
 uint16_t device_id;
 uint32_t class_code;
-- 
2.43.0




[PATCH 1/2] hw/xen: detect when running inside stubdomain

2024-02-19 Thread Marek Marczykowski-Górecki
Introduce global xen_is_stubdomain variable when qemu is running inside
a stubdomain instead of dom0. This will be relevant for subsequent
patches, as few things like accessing PCI config space need to be done
differently.

Signed-off-by: Marek Marczykowski-Górecki 
---
 hw/xen/xen-legacy-backend.c | 15 +++
 include/hw/xen/xen.h|  1 +
 system/globals.c|  1 +
 3 files changed, 17 insertions(+)

diff --git a/hw/xen/xen-legacy-backend.c b/hw/xen/xen-legacy-backend.c
index 124dd5f3d6..4a09ea2561 100644
--- a/hw/xen/xen-legacy-backend.c
+++ b/hw/xen/xen-legacy-backend.c
@@ -603,6 +603,19 @@ static void xen_set_dynamic_sysbus(void)
 machine_class_allow_dynamic_sysbus_dev(mc, TYPE_XENSYSDEV);
 }
 
+static bool xen_check_stubdomain(void)
+{
+char *dm_path = g_strdup_printf("/local/domain/%d/image", xen_domid);
+uint32_t dm_domid;
+bool is_stubdom = false;
+
+if (!xenstore_read_int(dm_path, "device-model-domid", _domid))
+is_stubdom = dm_domid != 0;
+
+g_free(dm_path);
+return is_stubdom;
+}
+
 void xen_be_init(void)
 {
 xenstore = qemu_xen_xs_open();
@@ -616,6 +629,8 @@ void xen_be_init(void)
 exit(1);
 }
 
+xen_is_stubdomain = xen_check_stubdomain();
+
 xen_sysdev = qdev_new(TYPE_XENSYSDEV);
 sysbus_realize_and_unref(SYS_BUS_DEVICE(xen_sysdev), _fatal);
 xen_sysbus = qbus_new(TYPE_XENSYSBUS, xen_sysdev, "xen-sysbus");
diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h
index 37ecc91fc3..ecb89ecfc1 100644
--- a/include/hw/xen/xen.h
+++ b/include/hw/xen/xen.h
@@ -36,6 +36,7 @@ enum xen_mode {
 extern uint32_t xen_domid;
 extern enum xen_mode xen_mode;
 extern bool xen_domid_restrict;
+extern bool xen_is_stubdomain;
 
 int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num);
 int xen_set_pci_link_route(uint8_t link, uint8_t irq);
diff --git a/system/globals.c b/system/globals.c
index b6d4e72530..ac27d88bd4 100644
--- a/system/globals.c
+++ b/system/globals.c
@@ -62,6 +62,7 @@ bool qemu_uuid_set;
 uint32_t xen_domid;
 enum xen_mode xen_mode = XEN_DISABLED;
 bool xen_domid_restrict;
+bool xen_is_stubdomain;
 struct evtchn_backend_ops *xen_evtchn_ops;
 struct gnttab_backend_ops *xen_gnttab_ops;
 struct foreignmem_backend_ops *xen_foreignmem_ops;
-- 
2.43.0




Re: [PATCH] ns16550: add Asix AX99100 serial card

2024-02-19 Thread Marek Marczykowski-Górecki
On Mon, Feb 19, 2024 at 09:57:49AM +0100, Jan Beulich wrote:
> On 18.02.2024 02:34, Marek Marczykowski-Górecki wrote:
> > @@ -1170,6 +1177,11 @@ static const struct ns16550_config __initconst 
> > uart_config[] =
> >  .dev_id = 0x7adc,
> >  .param = param_intel_lpss
> >  },
> > +{
> > +.vendor_id = PCI_VENDOR_ID_ASIX,
> > +.dev_id = 9100,
> 
> As per Linux this is 0x9100.

Right... but then, maybe the patch isn't needed at all, as it does work
for me. Maybe what's needed instead is some other patch already in
staging. Initial attempt that did not work was with 4.17.something.
I guess setting the fifo size isn't that important.

> > +.param = param_asix_ax99100
> > +},
> >  };
> >  
> >  static int __init
> > diff --git a/xen/include/xen/pci_ids.h b/xen/include/xen/pci_ids.h
> > index e798477a7e23..2a19f4ab7872 100644
> > --- a/xen/include/xen/pci_ids.h
> > +++ b/xen/include/xen/pci_ids.h
> > @@ -11,3 +11,5 @@
> >  #define PCI_VENDOR_ID_BROADCOM   0x14e4
> >  
> >  #define PCI_VENDOR_ID_INTEL  0x8086
> > +
> > +#define PCI_VENDOR_ID_ASIX   0x125b
> 
> Please insert such that numeric sorting is retained.
> 
> Jan

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH] ns16550: add Asix AX99100 serial card

2024-02-17 Thread Marek Marczykowski-Górecki
It's 4-port serial card, each port is presented as a separate function.
According to the specification, it features 256-byte TX FIFO buffer.

Signed-off-by: Marek Marczykowski-Górecki 
---
It's a card plugged into a box that can function as yet another gitlab runner.
---
 xen/drivers/char/ns16550.c | 12 
 xen/include/xen/pci_ids.h  |  2 ++
 2 files changed, 14 insertions(+)

diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c
index 97bf0985344a..00c0da3f373c 100644
--- a/xen/drivers/char/ns16550.c
+++ b/xen/drivers/char/ns16550.c
@@ -92,6 +92,7 @@ struct ns16550_config {
 param_exar_xr17v354,
 param_exar_xr17v358,
 param_intel_lpss,
+param_asix_ax99100,
 } param;
 };
 
@@ -842,6 +843,12 @@ static const struct ns16550_config_param __initconst 
uart_param[] = {
 .mmio = 1,
 .max_ports = 1,
 },
+[param_asix_ax99100] = {
+.reg_width = 1,
+.lsr_mask = UART_LSR_THRE,
+.max_ports = 1,
+.fifo_size = 256,
+},
 };
 
 static const struct ns16550_config __initconst uart_config[] =
@@ -1170,6 +1177,11 @@ static const struct ns16550_config __initconst 
uart_config[] =
 .dev_id = 0x7adc,
 .param = param_intel_lpss
 },
+{
+.vendor_id = PCI_VENDOR_ID_ASIX,
+.dev_id = 9100,
+.param = param_asix_ax99100
+},
 };
 
 static int __init
diff --git a/xen/include/xen/pci_ids.h b/xen/include/xen/pci_ids.h
index e798477a7e23..2a19f4ab7872 100644
--- a/xen/include/xen/pci_ids.h
+++ b/xen/include/xen/pci_ids.h
@@ -11,3 +11,5 @@
 #define PCI_VENDOR_ID_BROADCOM   0x14e4
 
 #define PCI_VENDOR_ID_INTEL  0x8086
+
+#define PCI_VENDOR_ID_ASIX   0x125b
-- 
2.43.0




Re: [PATCH] i386: load kernel on xen using DMA

2024-02-16 Thread Marek Marczykowski-Górecki
On Fri, Jun 18, 2021 at 09:54:14AM +0100, Alex Bennée wrote:
> 
> Marek Marczykowski-Górecki  writes:
> 
> > Kernel on Xen is loaded via fw_cfg. Previously it used non-DMA version,
> > which loaded the kernel (and initramfs) byte by byte. Change this
> > to DMA, to load in bigger chunks.
> > This change alone reduces load time of a (big) kernel+initramfs from
> > ~10s down to below 1s.
> >
> > This change was suggested initially here:
> > https://lore.kernel.org/xen-devel/20180216204031.5...@gmail.com/
> > Apparently this alone is already enough to get massive speedup.
> >
> > Signed-off-by: Marek Marczykowski-Górecki 
> > ---
> >  hw/i386/pc.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > index 8a84b25a03..14e43d4da4 100644
> > --- a/hw/i386/pc.c
> > +++ b/hw/i386/pc.c
> > @@ -839,7 +839,8 @@ void xen_load_linux(PCMachineState *pcms)
> >  
> >  assert(MACHINE(pcms)->kernel_filename != NULL);
> >  
> > -fw_cfg = fw_cfg_init_io(FW_CFG_IO_BASE);
> > +fw_cfg = fw_cfg_init_io_dma(FW_CFG_IO_BASE, FW_CFG_IO_BASE + 4,
> > +_space_memory);
> >  fw_cfg_add_i16(fw_cfg, FW_CFG_NB_CPUS, x86ms->boot_cpus);
> >  rom_set_fw(fw_cfg);
> 
> Gentle ping. The fix looks perfectly sane to me but I don't have any x86
> Xen HW to test this one. Are the x86 maintainers happy to take this on?

Ping...

> 
> FWIW:
> 
> Reviewed-by: Alex Bennée 
> 
> -- 
> Alex Bennée
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v3 1/8] serial: fake IRQ-regs context in poll handlers

2024-02-15 Thread Marek Marczykowski
On Thu, Feb 15, 2024 at 09:39:41AM +0100, Jan Beulich wrote:
> On 15.02.2024 03:19, Marek Marczykowski wrote:
> > On Tue, Feb 13, 2024 at 04:44:04PM +0100, Jan Beulich wrote:
> >> On 13.02.2024 16:11, Marek Marczykowski wrote:
> >>> On Tue, Feb 13, 2024 at 04:00:32PM +0100, Jan Beulich wrote:
> >>>> On 13.02.2024 15:53, Marek Marczykowski wrote:
> >>>>> Generally provides sensible output, for example during boot (it is using
> >>>>> idle vCPU then, right?).
> >>>>
> >>>> Before Dom0 is started: Yes. With the exception of the phase where PV
> >>>> Dom0's page tables are constructed, albeit in that time window
> >>>> guest_cpu_user_regs() shouldn't yield sensible data either. I can only
> >>>> say I'm surprised; since I have no way to properly test with an XHCI
> >>>> debug port, I'd have to see about faking something to convince myself
> >>>> (unless you were to supply example output).
> >>>
> >>> Would you like me to test this series with xhci console?
> >>
> >> The behavior shouldn't really be connected to this series. But yes, 'd'
> >> debug key output (just the part for the CPU the key handling was
> >> actually invoked from) with the xhci debug console would be of
> >> interest, for the case where that CPU at that time runs an idle vCPU.
> > 
> > I managed to press 'd' before dom0 started. Full output at
> > https://gist.github.com/marmarek/a495cd666f4aafed3a5cfcb8393f515b (it's
> > Alder Lake, and smt=off, so CPU numbering is weird).
> > Interestingly, I do _not_ see output for CPU0, where I'd expect the
> > key handler to run... I see all the idle ones, plus one doing memory
> > scrubbing.
> 
> Which is precisely the problem, just in not exactly the manifestation
> I expected. In dump_execstate() we dump host state only if the
> incoming regs don't indicate guest state. Yet for the idle vCPU they
> (wrongly) do here - see how guest_mode() calculates the delta to what
> guest_cpu_user_regs() returns, i.e. 0 when what guest_cpu_user_regs()
> returned is passed in.
> 
> Guest state dumping is suppressed for idle vCPU-s. Hence no output
> at all for the CPU where the key processing was actually invoked
> from.
> 
> > But also, I don't see info about the handling CPU when doing `xl
> > debug-key d`.
> 
> I'm afraid I'm confused, ...
> 
> > At one time, with `xl debug-key d` I got this:
> > 
> > (XEN) *** Dumping CPU6 guest state (d0v7): ***
> > (XEN) [ Xen-4.18-unstable  x86_64  debug=y  Tainted:  M ]
> > (XEN) CPU:6
> > (XEN) RIP:e033:[]
> > (XEN) RFLAGS: 0286   EM: 0   CONTEXT: pv guest (d0v7)
> > (XEN) rax: 0023   rbx: 0005   rcx: 81e1546a
> > (XEN) rdx:    rsi:    rdi: 79147611e010
> > (XEN) rbp: 88810db53200   rsp: c90041c6bde0   r8:  
> > (XEN) r9:     r10:    r11: 0286
> > (XEN) r12: 00305000   r13: 7ffc61097f40   r14: 88810db53200
> > (XEN) r15:    cr0: 80050033   cr4: 00b526e0
> > (XEN) cr3: 0004ae2a7000   cr2: 006d3118
> > (XEN) fsb: 791475b8a380   gsb: 8881897c   gss: 
> > (XEN) ds:    es:    fs:    gs:    ss: e02b   cs: e033
> > (XEN) Guest stack trace from rsp=c90041c6bde0:
> > (XEN)0001  c02905a6 0023
> > (XEN)79147611e010   
> > (XEN) 8064129fc747f100 c0291568 00305000
> > (XEN)8064129fc747f100 0005 813f7d4d c90041c6bf58
> > (XEN)c90041c6bf48   
> > (XEN)81e16158 006d3118 c90041c6bf58 0040
> > (XEN)8132f6bb 0006 c90041c6bf58 006d3118
> > (XEN)0255 888102cf8880 888102cf88f0 8108746f
> > (XEN) 0002 c90041c6bf58 c90041c6bf58
> > (XEN)006d3118  0006 
> > (XEN) 81e1a975  
> > (XEN)  829b 0043d9b0
> > (XEN)0043d990 7ffc61097f90 7ffc61097fc0 7ffc61099d16
> > (XEN)006cab40 0246 0001 
> 

Re: [PATCH v3 1/8] serial: fake IRQ-regs context in poll handlers

2024-02-14 Thread Marek Marczykowski
On Tue, Feb 13, 2024 at 04:44:04PM +0100, Jan Beulich wrote:
> On 13.02.2024 16:11, Marek Marczykowski wrote:
> > On Tue, Feb 13, 2024 at 04:00:32PM +0100, Jan Beulich wrote:
> >> On 13.02.2024 15:53, Marek Marczykowski wrote:
> >>> On Tue, Feb 13, 2024 at 08:45:54AM +0100, Jan Beulich wrote:
> >>>> On 13.02.2024 04:43, Marek Marczykowski wrote:
> >>>>> On Mon, Feb 12, 2024 at 10:04:38AM +0100, Jan Beulich wrote:
> >>>>>> On 08.02.2024 23:00, Julien Grall wrote:
> >>>>>>> On 05/02/2024 13:27, Jan Beulich wrote:
> >>>>>>>> In preparation of dropping the register parameters from
> >>>>>>>> serial_[rt]x_interrupt() and in turn from IRQ handler functions,
> >>>>>>>> register state needs making available another way for the few key
> >>>>>>>> handlers which need it. Fake IRQ-like state.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Jan Beulich 
> >>>>>>>> ---
> >>>>>>>> The use of guest_cpu_user_regs() in dbc_uart_poll() is inconsistent 
> >>>>>>>> with
> >>>>>>>> other console poll functions we have, and it's unclear whether that's
> >>>>>>>> actually generally correct.
> >>>>>>>
> >>>>>>> Is it? Looking at ns16550_poll() we would pass guest_user_regs() if 
> >>>>>>> run_in_exception() doesn't exist. But looking at the caller, no-on 
> >>>>>>> seems 
> >>>>>>> to care about the 'regs'. So is this just a latent bug?
> >>>>>>
> >>>>>> What do you mean by "doesn't exist"? ns16550_poll() assumes it exists.
> >>>>>> And I can spot any use of guest_user_regs() on the respective generic
> >>>>>> or Arm-specific bug.c paths.
> >>>>>>
> >>>>>>> BTW, do you have an idea why the poll function is not run in an 
> >>>>>>> exception handler?
> >>>>>>
> >>>>>> "The poll function" being which one? If you mean the one in xhci-dbc.c
> >>>>>> then that's why I had Cc-ed Marek. Moving him to To: - maybe that
> >>>>>> manages to finally catch his attention.
> >>>>>
> >>>>> TBH, I don't know. That's part of the original xue patch at
> >>>>> https://github.com/connojd/xue/blob/master/patches/xen-xue-dbgp.patch
> >>>>> and it works for me as it is.
> >>>>
> >>>> "Works" meaning what? Doesn't crash on you? Or does also provide
> >>>> sensible output in _all_ cases (i.e. including when e.g. the poll
> >>>> happens to run on an idle vCPU)?
> >>>
> >>> Generally provides sensible output, for example during boot (it is using
> >>> idle vCPU then, right?).
> >>
> >> Before Dom0 is started: Yes. With the exception of the phase where PV
> >> Dom0's page tables are constructed, albeit in that time window
> >> guest_cpu_user_regs() shouldn't yield sensible data either. I can only
> >> say I'm surprised; since I have no way to properly test with an XHCI
> >> debug port, I'd have to see about faking something to convince myself
> >> (unless you were to supply example output).
> > 
> > Would you like me to test this series with xhci console?
> 
> The behavior shouldn't really be connected to this series. But yes, 'd'
> debug key output (just the part for the CPU the key handling was
> actually invoked from) with the xhci debug console would be of
> interest, for the case where that CPU at that time runs an idle vCPU.

I managed to press 'd' before dom0 started. Full output at
https://gist.github.com/marmarek/a495cd666f4aafed3a5cfcb8393f515b (it's
Alder Lake, and smt=off, so CPU numbering is weird).
Interestingly, I do _not_ see output for CPU0, where I'd expect the
key handler to run... I see all the idle ones, plus one doing memory
scrubbing.
But also, I don't see info about the handling CPU when doing `xl
debug-key d`. At one time, with `xl debug-key d` I got this:

(XEN) *** Dumping CPU6 guest state (d0v7): ***
(XEN) [ Xen-4.18-unstable  x86_64  debug=y  Tainted:  M ]
(XEN) CPU:6
(XEN) RIP:e033:[]
(XEN) RFLAGS: 0286   EM: 0   CONTEXT: pv guest (d0v7)
(XEN) rax: 0023   rbx: 0005   rcx: 81e1546a
(XEN) rdx:    rsi:    rdi: 791

Re: [XEN PATCH] build: Replace `which` with `command -v`

2024-02-14 Thread Marek Marczykowski-Górecki
On Wed, Feb 14, 2024 at 02:34:11PM +, Anthony PERARD wrote:
> The `which` command is not standard, may not exist on the build host,
> or may not behave as expected. It is recommanded to use `command -v`
> to find out if a command exist and have it's path, and it's part of a
> POSIX shell standard.
> 
> Signed-off-by: Anthony PERARD 

This fixes build on fedora 39:

Tested-by: Marek Marczykowski-Górecki 

> ---
>  xen/Makefile | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/Makefile b/xen/Makefile
> index 21832d6402..767e47d6c7 100644
> --- a/xen/Makefile
> +++ b/xen/Makefile
> @@ -25,8 +25,8 @@ export XEN_BUILD_HOST   := $(shell hostname)
>  endif
>  
>  # Best effort attempt to find a python interpreter, defaulting to Python 3 if
> -# available.  Fall back to just `python` if `which` is nowhere to be found.
> -PYTHON_INTERPRETER   := $(word 1,$(shell which python3 python python2 
> 2>/dev/null) python)
> +# available.  Fall back to just `python`.
> +PYTHON_INTERPRETER   := $(word 1,$(shell command -v python3 || command -v 
> python || command -v python2) python)
>  export PYTHON?= $(PYTHON_INTERPRETER)
>  
>  export CHECKPOLICY   ?= checkpolicy
> -- 
> Anthony PERARD
> 
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v3 1/8] serial: fake IRQ-regs context in poll handlers

2024-02-13 Thread Marek Marczykowski
On Tue, Feb 13, 2024 at 04:00:32PM +0100, Jan Beulich wrote:
> On 13.02.2024 15:53, Marek Marczykowski wrote:
> > On Tue, Feb 13, 2024 at 08:45:54AM +0100, Jan Beulich wrote:
> >> On 13.02.2024 04:43, Marek Marczykowski wrote:
> >>> On Mon, Feb 12, 2024 at 10:04:38AM +0100, Jan Beulich wrote:
> >>>> On 08.02.2024 23:00, Julien Grall wrote:
> >>>>> On 05/02/2024 13:27, Jan Beulich wrote:
> >>>>>> In preparation of dropping the register parameters from
> >>>>>> serial_[rt]x_interrupt() and in turn from IRQ handler functions,
> >>>>>> register state needs making available another way for the few key
> >>>>>> handlers which need it. Fake IRQ-like state.
> >>>>>>
> >>>>>> Signed-off-by: Jan Beulich 
> >>>>>> ---
> >>>>>> The use of guest_cpu_user_regs() in dbc_uart_poll() is inconsistent 
> >>>>>> with
> >>>>>> other console poll functions we have, and it's unclear whether that's
> >>>>>> actually generally correct.
> >>>>>
> >>>>> Is it? Looking at ns16550_poll() we would pass guest_user_regs() if 
> >>>>> run_in_exception() doesn't exist. But looking at the caller, no-on 
> >>>>> seems 
> >>>>> to care about the 'regs'. So is this just a latent bug?
> >>>>
> >>>> What do you mean by "doesn't exist"? ns16550_poll() assumes it exists.
> >>>> And I can spot any use of guest_user_regs() on the respective generic
> >>>> or Arm-specific bug.c paths.
> >>>>
> >>>>> BTW, do you have an idea why the poll function is not run in an 
> >>>>> exception handler?
> >>>>
> >>>> "The poll function" being which one? If you mean the one in xhci-dbc.c
> >>>> then that's why I had Cc-ed Marek. Moving him to To: - maybe that
> >>>> manages to finally catch his attention.
> >>>
> >>> TBH, I don't know. That's part of the original xue patch at
> >>> https://github.com/connojd/xue/blob/master/patches/xen-xue-dbgp.patch
> >>> and it works for me as it is.
> >>
> >> "Works" meaning what? Doesn't crash on you? Or does also provide
> >> sensible output in _all_ cases (i.e. including when e.g. the poll
> >> happens to run on an idle vCPU)?
> > 
> > Generally provides sensible output, for example during boot (it is using
> > idle vCPU then, right?).
> 
> Before Dom0 is started: Yes. With the exception of the phase where PV
> Dom0's page tables are constructed, albeit in that time window
> guest_cpu_user_regs() shouldn't yield sensible data either. I can only
> say I'm surprised; since I have no way to properly test with an XHCI
> debug port, I'd have to see about faking something to convince myself
> (unless you were to supply example output).

Would you like me to test this series with xhci console? Or maybe add
some extra debug prints and include their output? But note, printk from
inside console code generally leads to deadlocks. What I did for some
debugging was to log into some separate buffer and dump it later.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v3 1/8] serial: fake IRQ-regs context in poll handlers

2024-02-13 Thread Marek Marczykowski
On Tue, Feb 13, 2024 at 08:45:54AM +0100, Jan Beulich wrote:
> On 13.02.2024 04:43, Marek Marczykowski wrote:
> > On Mon, Feb 12, 2024 at 10:04:38AM +0100, Jan Beulich wrote:
> >> On 08.02.2024 23:00, Julien Grall wrote:
> >>> On 05/02/2024 13:27, Jan Beulich wrote:
> >>>> In preparation of dropping the register parameters from
> >>>> serial_[rt]x_interrupt() and in turn from IRQ handler functions,
> >>>> register state needs making available another way for the few key
> >>>> handlers which need it. Fake IRQ-like state.
> >>>>
> >>>> Signed-off-by: Jan Beulich 
> >>>> ---
> >>>> The use of guest_cpu_user_regs() in dbc_uart_poll() is inconsistent with
> >>>> other console poll functions we have, and it's unclear whether that's
> >>>> actually generally correct.
> >>>
> >>> Is it? Looking at ns16550_poll() we would pass guest_user_regs() if 
> >>> run_in_exception() doesn't exist. But looking at the caller, no-on seems 
> >>> to care about the 'regs'. So is this just a latent bug?
> >>
> >> What do you mean by "doesn't exist"? ns16550_poll() assumes it exists.
> >> And I can spot any use of guest_user_regs() on the respective generic
> >> or Arm-specific bug.c paths.
> >>
> >>> BTW, do you have an idea why the poll function is not run in an 
> >>> exception handler?
> >>
> >> "The poll function" being which one? If you mean the one in xhci-dbc.c
> >> then that's why I had Cc-ed Marek. Moving him to To: - maybe that
> >> manages to finally catch his attention.
> > 
> > TBH, I don't know. That's part of the original xue patch at
> > https://github.com/connojd/xue/blob/master/patches/xen-xue-dbgp.patch
> > and it works for me as it is.
> 
> "Works" meaning what? Doesn't crash on you? Or does also provide
> sensible output in _all_ cases (i.e. including when e.g. the poll
> happens to run on an idle vCPU)?

Generally provides sensible output, for example during boot (it is using
idle vCPU then, right?).

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v3 1/8] serial: fake IRQ-regs context in poll handlers

2024-02-12 Thread Marek Marczykowski
On Mon, Feb 12, 2024 at 10:04:38AM +0100, Jan Beulich wrote:
> On 08.02.2024 23:00, Julien Grall wrote:
> > On 05/02/2024 13:27, Jan Beulich wrote:
> >> In preparation of dropping the register parameters from
> >> serial_[rt]x_interrupt() and in turn from IRQ handler functions,
> >> register state needs making available another way for the few key
> >> handlers which need it. Fake IRQ-like state.
> >>
> >> Signed-off-by: Jan Beulich 
> >> ---
> >> The use of guest_cpu_user_regs() in dbc_uart_poll() is inconsistent with
> >> other console poll functions we have, and it's unclear whether that's
> >> actually generally correct.
> > 
> > Is it? Looking at ns16550_poll() we would pass guest_user_regs() if 
> > run_in_exception() doesn't exist. But looking at the caller, no-on seems 
> > to care about the 'regs'. So is this just a latent bug?
> 
> What do you mean by "doesn't exist"? ns16550_poll() assumes it exists.
> And I can spot any use of guest_user_regs() on the respective generic
> or Arm-specific bug.c paths.
> 
> > BTW, do you have an idea why the poll function is not run in an 
> > exception handler?
> 
> "The poll function" being which one? If you mean the one in xhci-dbc.c
> then that's why I had Cc-ed Marek. Moving him to To: - maybe that
> manages to finally catch his attention.

TBH, I don't know. That's part of the original xue patch at
https://github.com/connojd/xue/blob/master/patches/xen-xue-dbgp.patch
and it works for me as it is.

> >> Andrew suggested to move set_irq_regs() to BUGFRAME_run_fn handling;
> >> it's not clear to me whether that would be (a) correct from an abstract
> >> pov (that's exception, not interrupt context after all) 
> > 
> > I agree with that.
> > 
> >> and (b) really beneficial.
> > 
> > I guess this could help to reduce the amount of churn. I can't really 
> > make my mind whether this is worth it or not. So I would keep it as you did.
> 
> Good, thanks.
> 
> Jan

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: Linux 6.7-rc1+: WARNING at drivers/xen/evtchn.c:167 evtchn_interrupt

2024-01-29 Thread Marek Marczykowski-Górecki
On Sat, Dec 02, 2023 at 10:46:38PM +0100, Marek Marczykowski-Górecki wrote:
> On Fri, Dec 01, 2023 at 09:29:30AM +0100, Juergen Gross wrote:
> > Hi Marek,
> > 
> > On 30.11.23 03:34, Marek Marczykowski-Górecki wrote:
> > > Hi,
> > > 
> > > While testing 6.7-rc3 on Qubes OS I found several warning like in the
> > > subject in dom0 log. I see them when running 6.7-rc1 too. I'm not sure
> > > what exactly triggers the issue, but my guess would be unbinding an
> > > event channel from userspace (closing vchan connection).
> > > 
> > > Specific message:
> > > 
> > > [   83.973377] [ cut here ]
> > > [   83.975523] Interrupt for port 77, but apparently not enabled; 
> > > per-user a0e9f1d1
> > 
> > Just a guess, but I think this might happen when the vchan connection
> > is closed while there is still some traffic. This could result in events
> > triggering in parallel to unbinding the event channel.
> > 
> > Could you please test the attached patch (only compile tested)?
> 
> Unfortunately that doesn't help, I get exactly the same traceback.

Hi, this is still an issue on 6.7.2, any other ideas?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v4 2/2] tools: don't expose XENFEAT_hvm_pirqs by default

2024-01-25 Thread Marek Marczykowski-Górecki
On Thu, Jan 25, 2024 at 10:30:41AM +0100, Roger Pau Monne wrote:
> The HVM pirq feature allows routing interrupts from both physical and emulated
> devices over event channels, this was done a performance improvement.  However
> its usage is fully undocumented, and the only reference implementation is in
> Linux.  It defeats the purpose of local APIC hardware virtualization, because
> when using it interrupts avoid the usage of the local APIC altogether.
> 
> It has also been reported to not work properly with certain devices, at least
> when using some AMD GPUs Linux attempts to route interrupts over event
> channels, but Xen doesn't correctly detect such routing, which leads to the
> hypervisor complaining with:
> 
> (XEN) d15v0: Unsupported MSI delivery mode 7 for Dom15
> 
> When MSIs are attempted to be routed over event channels the entry delivery
> mode is set to ExtINT, but Xen doesn't detect such routing and attempts to
> inject the interrupt following the native MSI path, and the ExtINT delivery
> mode is not supported.
> 
> Disable HVM PIRQs by default and provide a per-domain option in xl.cfg to
> enable such feature.  Also for backwards compatibility keep the feature 
> enabled
> for any resumed domains that don't have an explicit selection.
> 
> Note that the only user of the feature (Linux) is also able to handle native
> interrupts fine, as the feature was already not used if Xen reported local 
> APIC
> hardware virtualization active.
> 
> Link: https://github.com/QubesOS/qubes-issues/issues/7971
> Signed-off-by: Roger Pau Monné 
> Reviewed-by: Anthony PERARD 

For the python part - it's a bit unfortunate there is no knob to control
the value, but in practice I doubt it would be useful (and especially
for python bindings users), so:

Acked-by: Marek Marczykowski-Górecki 

> ---
> Changes since v2:
>  - Add changelog entry.
> 
> Changes since v1:
>  - Fix libxl for PV guests.
> ---
>  CHANGELOG.md  |  2 ++
>  docs/man/xl.cfg.5.pod.in  |  7 +++
>  tools/include/libxl.h |  7 +++
>  tools/libs/light/libxl_create.c   |  7 +--
>  tools/libs/light/libxl_types.idl  |  1 +
>  tools/libs/light/libxl_x86.c  | 12 +---
>  tools/python/xen/lowlevel/xc/xc.c |  4 +++-
>  tools/xl/xl_parse.c   |  1 +
>  8 files changed, 35 insertions(+), 6 deletions(-)
> 
> diff --git a/CHANGELOG.md b/CHANGELOG.md
> index 723d06425431..ddb3ab8db4e7 100644
> --- a/CHANGELOG.md
> +++ b/CHANGELOG.md
> @@ -9,6 +9,8 @@ The format is based on [Keep a 
> Changelog](https://keepachangelog.com/en/1.0.0/)
>  ### Changed
>   - Changed flexible array definitions in public I/O interface headers to not
> use "1" as the number of array elements.
> + - On x86:
> +   - HVM PIRQs are disabled by default.
>  
>  ### Added
>   - On x86:
> diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in
> index 2e234b450efb..ea8d41727d8e 100644
> --- a/docs/man/xl.cfg.5.pod.in
> +++ b/docs/man/xl.cfg.5.pod.in
> @@ -2460,6 +2460,13 @@ The viridian option can be specified as a boolean. A 
> value of true (1)
>  is equivalent to the list [ "defaults" ], and a value of false (0) is
>  equivalent to an empty list.
>  
> +=item B
> +
> +Select whether the guest is allowed to route interrupts from devices (either
> +emulated or passed through) over event channels.
> +
> +This option is disabled by default.
> +
>  =back
>  
>  =head3 Emulated VGA Graphics Device
> diff --git a/tools/include/libxl.h b/tools/include/libxl.h
> index 907aa0a3303a..f1652b1664f0 100644
> --- a/tools/include/libxl.h
> +++ b/tools/include/libxl.h
> @@ -608,6 +608,13 @@
>   * executable in order to not run it as the same user as libxl.
>   */
>  
> +/*
> + * LIBXL_HAVE_HVM_PIRQ indicates the presence of the u.hvm.pirq filed in
> + * libxl_domain_build_info that signals whether an HVM guest has accesses to
> + * the XENFEAT_hvm_pirqs feature.
> + */
> +#define LIBXL_HAVE_HVM_PIRQ 1
> +
>  /*
>   * libxl memory management
>   *
> diff --git a/tools/libs/light/libxl_create.c b/tools/libs/light/libxl_create.c
> index ce1d43110336..0008fac607e3 100644
> --- a/tools/libs/light/libxl_create.c
> +++ b/tools/libs/light/libxl_create.c
> @@ -376,6 +376,7 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc,
>  libxl_defbool_setdefault(_info->u.hvm.usb,false);
>  libxl_defbool_setdefault(_info->u.hvm.vkb_device, true);
>  libxl_defbool_setdefault(_info->u.hvm.xen_platform_pci,   true);
> +libxl_defbool_setdefault(_info->u.hvm.pirq,   false);
>  
>  libxl_defbool_setdefault(

Re: Sketch of an idea for handling the "mixed workload" problem

2024-01-22 Thread Marek Marczykowski-Górecki
On Mon, Jan 22, 2024 at 12:25:58PM +, George Dunlap wrote:
> On Mon, Jan 22, 2024 at 12:17 PM Marek Marczykowski-Górecki
>  wrote:
> >
> > On Mon, Jan 22, 2024 at 11:54:14AM +, George Dunlap wrote:
> > > The other issue I have with this (and essentially where I got stuck
> > > developing credit2 in the first place) is testing: how do you ensure
> > > that it has the properties that you expect?
> >
> > Audio is actually quite nice use case at this, since it's quite
> > sensitive for scheduling jitter. I think even a simple "PCI passthrough a
> > sound card and play/record something" should show results. Especially
> > you can measure how hard you can push the system (for example artificial
> > load in other domains) until it breaks.
> 
> Are we going have a gitlab runner which says, "Marek sits in front of
> his test machine and listens to audio for pops"? :-)

Kinda ;)
We have already audio tests in qubes CI. They do more or less the above,
but using our audio virtualization. Play something, record in another
domain, and compare. Running the very same thing in gitlab-ci may be too
complicated (require bringing in some qubes infrastructure to make PV
audio work), but maybe similar test can be done based on qemu-emulated
audio or other pv audio solution?

> > > How do you develop a
> > > "regression test" to make sure that server-based workloads don't have
> > > issues in this sort of case?
> >
> > For this I believe there are several benchmarking methods already,
> > starting with old trusty "Linux kernel build time".
> 
> First of all, AFAICT "Linux kernel bulid time" is not representative
> of almost any actual server workload; and the end-to-end throughput
> completely misses what most server workloads will actually care about,
> like latency.
> 
> Secondly, what you're testing isn't the performance of a single
> workload on an empty system; you're testing how workloads *interact*.
> If you want ideal throughput for a single workload on an empty system,
> use the null scheduler; more complex schedulers are only necessary
> when multiple different workloads interact.

I should have clarified I meant `make -jNN`. But still, that's the same
workload on multiple vCPUs.

> FWIW this was my first stab at trying to be systematic about testing
> the scheduler:
> 
> https://github.com/gwd/schedbench
> 
> The rump kernel project has basically died AFAIK, so anyone trying to
> resurrect this would probably have to try to rebase that bit of it
> against something like XTF or unikernels.
> 
>  -George

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: Sketch of an idea for handling the "mixed workload" problem

2024-01-22 Thread Marek Marczykowski-Górecki
On Mon, Jan 22, 2024 at 11:54:14AM +, George Dunlap wrote:
> The other issue I have with this (and essentially where I got stuck
> developing credit2 in the first place) is testing: how do you ensure
> that it has the properties that you expect?  

Audio is actually quite nice use case at this, since it's quite
sensitive for scheduling jitter. I think even a simple "PCI passthrough a
sound card and play/record something" should show results. Especially
you can measure how hard you can push the system (for example artificial
load in other domains) until it breaks.

> How do you develop a
> "regression test" to make sure that server-based workloads don't have
> issues in this sort of case?

For this I believe there are several benchmarking methods already,
starting with old trusty "Linux kernel build time".

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: E820 memory allocation issue on Threadripper platforms

2024-01-19 Thread Marek Marczykowski-Górecki
On Fri, Jan 19, 2024 at 02:50:38PM +0100, Jan Beulich wrote:
> On 19.01.2024 14:40, Marek Marczykowski-Górecki wrote:
> > On Thu, Jan 18, 2024 at 01:23:56AM -0500, Patrick Plenefisch wrote:
> >> On Wed, Jan 17, 2024 at 3:46 AM Jan Beulich  wrote:
> >>> On 17.01.2024 07:12, Patrick Plenefisch wrote:
> >>>> As someone who hasn't built a kernel in over a decade, should I figure
> >>> out
> >>>> how to do a kernel build with CONFIG_PHYSICAL_START=0x200 and report
> >>>> back?
> >>>
> >>> That was largely a suggestion to perhaps allow you to gain some
> >>> workable setup. It would be of interest to us largely for completeness.
> >>>
> >>
> >> Typo aside, setting the boot to 2MiB works! It works better for PV
> > 
> > Are there any downsides of running kernel with
> > CONFIG_PHYSICAL_START=0x20? I can confirm it fixes the issue on
> > another affected system, and if there aren't any practical downsides,
> > I'm tempted to change it the default kernel in Qubes OS.
> 
> There must have been a reason to make the default 16Mb. You may want
> to fish out the commit doing so ... 

https://git.kernel.org/torvalds/c/ceefccc93932b920

Default CONFIG_PHYSICAL_START and CONFIG_PHYSICAL_ALIGN each to 16 MB,
so that both non-relocatable and relocatable kernels are loaded at
16 MB by a non-relocating bootloader.  This is somewhat hacky, but it
appears to be the only way to do this that does not break some some
set of existing bootloaders.

We want to avoid the bottom 16 MB because of large page breakup,
memory holes, and ZONE_DMA.  Embedded systems may need to reduce this,
or update their bootloaders to be aware of the new min_alignment field.

Large pages (in practice) do not apply to PV dom0, but other points
could in theory. That said, I checked few other systems and I don't see
any reserved regions there (there is large usable region at 0x10,
other reserved regions are near the 4GB boundary).
This isn't very representative sample, though...

> In Qubes, though, I understand
> you're always running with Xen underneath, so unless this same kernel
> is also needed to run in HVM guests, some of whatever the reasons may
> have been may go away.

The same kernel is used for PVH/HVM guests too.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: E820 memory allocation issue on Threadripper platforms

2024-01-19 Thread Marek Marczykowski-Górecki
On Thu, Jan 18, 2024 at 01:23:56AM -0500, Patrick Plenefisch wrote:
> On Wed, Jan 17, 2024 at 3:46 AM Jan Beulich  wrote:
> > On 17.01.2024 07:12, Patrick Plenefisch wrote:
> > > As someone who hasn't built a kernel in over a decade, should I figure
> > out
> > > how to do a kernel build with CONFIG_PHYSICAL_START=0x200 and report
> > > back?
> >
> > That was largely a suggestion to perhaps allow you to gain some
> > workable setup. It would be of interest to us largely for completeness.
> >
> 
> Typo aside, setting the boot to 2MiB works! It works better for PV

Are there any downsides of running kernel with
CONFIG_PHYSICAL_START=0x20? I can confirm it fixes the issue on
another affected system, and if there aren't any practical downsides,
I'm tempted to change it the default kernel in Qubes OS.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: E820 memory allocation issue on Threadripper platforms

2024-01-17 Thread Marek Marczykowski-Górecki
On Tue, Jan 16, 2024 at 10:33:26AM +0100, Jan Beulich wrote:
> ... as per
> 
> (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100 -> 0x4a0
> 
> there's an overlap with not exactly a hole, but with an
> EfiACPIMemoryNVS region:
> 
> (XEN)  00010-003159fff type=2 attr=000f
> (XEN)  00315a000-003ff type=7 attr=000f
> (XEN)  00400-004045fff type=10 attr=000f
> (XEN)  004046000-009afefff type=7 attr=000f
> 
> (the 3rd of the 4 lines). Considering there's another region higher
> up:
> 
> (XEN)  0a747f000-0a947efff type=10 attr=000f
> 
> I'm inclined to say it is poor firmware (or, far less likely, boot
> loader) behavior to clobber a rather low and entirely arbitrary RAM
> range, rather than consolidating all such regions near the top of
> RAM below 4Gb.

FWIW, we have two more similar reports, with different motherboards and
firmware versions, but the common factor is Threadripper CPU. It doesn't
exclude firmware issue (it can be an issue in some common template, like
edk2?), but makes it a bit less likely.

> There are further such odd regions, btw:
> 
> (XEN)  009aff000-009ff type=0 attr=000f
> ...
> (XEN)  00b00-00b020fff type=0 attr=000f
> 
> If the kernel image was sufficiently much larger, these could become
> a problem as well. Otoh if the kernel wasn't built with
> CONFIG_PHYSICAL_START=0x100, i.e. to start at 16Mb, but at, say,
> 2Mb, things should apparently work even with this unusual memory
> layout (until the kernel would grow enough to again run into that
> very region).

Shouldn't CONFIG_RELOCATABLE=y take care of this? At least in the case
of Qubes OS, it's enabled and the issue still happens.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: Thunderbolt (and other PCI hotplug) isolation

2024-01-16 Thread Marek Marczykowski-Górecki
On Tue, Jan 16, 2024 at 10:58:05AM +0100, Jan Beulich wrote:
> On 16.01.2024 03:20, Marek Marczykowski-Górecki wrote:
> > Hi,
> > 
> > A little background:
> > In Qubes OS we try to isolate external (especially hot pluggable)
> > devices as much as possible. For PCI devices, we do PCI passthrough to
> > dedicated domains (sys-net, sys-usb - mostly the latter). The goal is to
> > prevent unauthorized device to compromise the whole system, especially
> > using DMA (either initiated by a malicious device itself, or by a
> > compromised driver). For the discussion here, lets ignore what happens
> > before Xen starts.
> > 
> > The matter becomes much more complicated for hot plugged devices. I did
> > some test recently, enabled PCI hoplug in dom0 kernel (we have it
> > disabled by default), and this is what I got:
> > 1. Hot plugged devices were properly detected, and dom0 told Xen about
> > them. In my case, it was two PCI bridges and an NVMe disk.
> > 2. New devices were assigned to dom0 automatically.
> > 3. New leaf device (the disk) can be assigned to a HVM domU and seems to 
> > work.
> > 4. The bridges cannot be assigned to a domU.
> > 
> > Now, there are (at least) two problems with the above:
> > i) The second point above: new device automatically gain ability to DMA (at
> > least) into dom0 memory. I guess this should be easy-ish solvable for
> > leaf devices by assigning them to a quarantine domain by default. There
> > is an issue how to decide what devices to handle this way (for example,
> > what about external devices present during Xen/dom0 startup already),
> > but it feels like a problem solvable with some configuration. And of
> > course dom0 will need to be adjusted to not talk to such devices
> > automatically (via drivers blacklisting or similar approach). But for
> > the bridge devices, it's more complicated, basically the second point
> > below.
> > 
> > ii) The fourth point above: an external PCI device remains in dom0
> > (including being able to dom0 into dom0's memory) just because it happen
> > to have some specific bits in its config space set. When considering
> > malicious device, it doesn't even need to function as a bridge - it's
> > just enough to present itself as a bridge, wait for dom0's thunderbolt
> > driver to authorize the device so it gets assigned dom0's IOMMU context,
> > and boom. On the other hand, a bridge has privileged function by
> > design, for example IIUC takes part in discovering devices behind it
> > (which then needs to be properly registered in Xen, assigned IOMMU
> > context etc).
> 
> I may not be following the underlying concept here: If you consider a
> device potentially malicious, why would you even connect it to your
> system? 

The thing is I don't know whether I can trust the device or not. It may
be a device I got from somebody, but also it may be a benign but buggy
device that later got its firmware compromised. Take the example of an
external nvme disk - I got some data on it to process, and the data
processing itself can easily be isolated in a separate domU (so, I don't
need to assume too much trust in that data), but also I'd like to not
assume trust in the disk itself.

Besides that, there is also an attack where somebody plugs in a device
without my knowledge (when working in open space, some conference, etc).
I'd like to avoid situation where screenlocker can be very quickly bypassed
with a "screen unlock via DMA" thunderbolt device.

> And if you mean Dom0 to not drive devices, why would you even
> build the respective drivers for such a Dom0 kernel?

I could exclude a lot of drivers indeed, but there are practical issues
with that: for example I do want to use the _internal_ nvme disk in
dom0.

> > iii) Untested, but it feels like there is a lot of room for various race
> > conditions in the hot plug handling. For example, device must be
> > allowed any DMA only after its IOMMU context is properly configured.
> 
> Isn't that the case already? Any attempt to DMA without respective
> device / context table entry (AMD / Intel terminology) ought to result
> in IOMMU faults.

Ok, so in theory it should be matter of "just" assigning device to
appropriate domain directly, instead of assigning to dom0 first and only
later re-assigning.

> > I
> > believe thunderbolt technically allows that (plain PCIe hotplug most
> > likely not), but my guess is it's not the case currently.
> > 
> > My question is mostly: what can be done about the "ii" problem above?
> > 
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Thunderbolt (and other PCI hotplug) isolation

2024-01-15 Thread Marek Marczykowski-Górecki
Hi,

A little background:
In Qubes OS we try to isolate external (especially hot pluggable)
devices as much as possible. For PCI devices, we do PCI passthrough to
dedicated domains (sys-net, sys-usb - mostly the latter). The goal is to
prevent unauthorized device to compromise the whole system, especially
using DMA (either initiated by a malicious device itself, or by a
compromised driver). For the discussion here, lets ignore what happens
before Xen starts.

The matter becomes much more complicated for hot plugged devices. I did
some test recently, enabled PCI hoplug in dom0 kernel (we have it
disabled by default), and this is what I got:
1. Hot plugged devices were properly detected, and dom0 told Xen about
them. In my case, it was two PCI bridges and an NVMe disk.
2. New devices were assigned to dom0 automatically.
3. New leaf device (the disk) can be assigned to a HVM domU and seems to work.
4. The bridges cannot be assigned to a domU.

Now, there are (at least) two problems with the above:
i) The second point above: new device automatically gain ability to DMA (at
least) into dom0 memory. I guess this should be easy-ish solvable for
leaf devices by assigning them to a quarantine domain by default. There
is an issue how to decide what devices to handle this way (for example,
what about external devices present during Xen/dom0 startup already),
but it feels like a problem solvable with some configuration. And of
course dom0 will need to be adjusted to not talk to such devices
automatically (via drivers blacklisting or similar approach). But for
the bridge devices, it's more complicated, basically the second point
below.

ii) The fourth point above: an external PCI device remains in dom0
(including being able to dom0 into dom0's memory) just because it happen
to have some specific bits in its config space set. When considering
malicious device, it doesn't even need to function as a bridge - it's
just enough to present itself as a bridge, wait for dom0's thunderbolt
driver to authorize the device so it gets assigned dom0's IOMMU context,
and boom. On the other hand, a bridge has privileged function by
design, for example IIUC takes part in discovering devices behind it
(which then needs to be properly registered in Xen, assigned IOMMU
context etc).

iii) Untested, but it feels like there is a lot of room for various race
conditions in the hot plug handling. For example, device must be
allowed any DMA only after its IOMMU context is properly configured. I
believe thunderbolt technically allows that (plain PCIe hotplug most
likely not), but my guess is it's not the case currently.

My question is mostly: what can be done about the "ii" problem above?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH 1/2] libxl: Remove cdrom forced QDISK w/ stubdom

2024-01-10 Thread Marek Marczykowski-Górecki
On Tue, Jan 09, 2024 at 03:46:54PM -0500, Jason Andryuk wrote:
> A Linux HVM domain ignores PV block devices with type cdrom.  The
> Windows PV drivers also ignore device-type != "disk".  Therefore QEMU's
> emulated CD-ROM support is used.  This allows ejection and other CD-ROM
> features to work.
> 
> With a stubdom, QEMU is running in the stubdom.  A PV disk is still
> connected into the stubdom, and then QEMU can emulate the CD-ROM into
> the guest.  This removes the need for forcing to a QDISK.  Relax the
> checks to support this.
> 
> Signed-off-by: Jason Andryuk 
> ---
>  tools/libs/light/libxl_disk.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/libs/light/libxl_disk.c b/tools/libs/light/libxl_disk.c
> index b65cad33cc..d1f84ef404 100644
> --- a/tools/libs/light/libxl_disk.c
> +++ b/tools/libs/light/libxl_disk.c
> @@ -192,7 +192,8 @@ static int libxl__device_disk_setdefault(libxl__gc *gc, 
> uint32_t domid,
>  
>  /* Force Qdisk backend for CDROM devices of guests with a device model. 
> */
>  if (disk->is_cdrom != 0 &&
> -libxl__domain_type(gc, domid) == LIBXL_DOMAIN_TYPE_HVM) {
> +libxl__domain_type(gc, domid) == LIBXL_DOMAIN_TYPE_HVM &&
> +!libxl_get_stubdom_id(CTX, domid)) {

Should this check for stubdomain flavor too? I guess it won't really
work with qemu-traditional.
Similar check also wants to be in the next patch, instead of completely
dropping stubdomain check.

>  if (!(disk->backend == LIBXL_DISK_BACKEND_QDISK ||
>disk->backend == LIBXL_DISK_BACKEND_UNKNOWN)) {
>  LOGD(ERROR, domid, "Backend for CD devices on HVM guests must be 
> Qdisk");
> -- 
> 2.43.0
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: Serial console stuck during boot, unblocked with xl debug-key

2024-01-04 Thread Marek Marczykowski-Górecki
On Thu, Jan 04, 2024 at 12:59:28PM +0100, Jan Beulich wrote:
> On 29.12.2023 10:50, Marek Marczykowski-Górecki wrote:
> > Hi,
> > 
> > This is continuation from matrix chat. There is an occasional failure on
> > qubes-hw2 gitlab runner that console become stuck during boot. I can now
> > reproduce it _much_ more often on another system, and the serial console 
> > output
> > ends with:
> > 
> > (XEN) Allocated console ring of 256 KiB.
> > (XEN) Using HWP for cpufreq
> > (XEN) mwait-idle: does not run on family 6
> > 
> > It should be:
> > 
> > (XEN) Allocated console ring of 256 KiB.
> > (XEN) Using HWP for cpufreq
> > (XEN) mwait-idle: does not run on family 6 model 183
> > (XEN) VMX: Supported advanced features:
> > (XEN)  - APIC MMIO access virtualisation
> > (XEN)  - APIC TPR shadow
> > ...
> > 
> > 
> > Otherwise the system works perfectly fine, the logs are available in
> > full via `xl dmesg` etc. Doing (any?) `xl debug-key` unblocks the
> > console and missing logs gets dumped there too. I narrowed it down to
> > the serial console tx buffer and collected some info with the attacked
> > patch (it collects info still during boot, after the place where it
> > usually breaks). When it works, I get:
> > 
> > (XEN) SERIAL DEBUG: txbufc: 0x5b5, txbufp: 0x9f7, uart intr_works: 1, 
> > serial_txbufsz: 0x4000, tx_ready: 0, lsr_mask: 0x20, msi: 0, io_size: 8, 
> > skipped_interrupts: 0
> > 
> > And when it breaks, I get:
> > 
> > (XEN) SERIAL DEBUG: txbufc: 0x70, txbufp: 0x9fd, uart intr_works: 1, 
> > serial_txbufsz: 0x4000, tx_ready: 16, lsr_mask: 0x20, msi: 0, io_size: 8, 
> > skipped_interrupts: 0
> 
> The only meaningful difference is tx_ready then. Looking at
> ns16550_tx_ready() I wonder whether the LSR reports inconsistent
> values on successive reads (there are at least three separate calls
> to the function out of serial_tx_interrupt() alone). What you didn't
> log is the LSR value itself; from the tx_ready value one can conclude
> though that in the bad case fifo_size was returned, while in the good
> case 0 was passed back. At the first glance this looks backwards, or
> in other words I can't explain why it would be this way round. (I
> assume you've had each case multiple times, and the output was
> sufficiently consistent; that doesn't go without saying as your
> invocation of serial_debug() is competing with the asynchronous
> transmitting of data [if any].) It being this way round might suggest
> that we lost an interrupt.

That is my current hypothesis too. Either at the hw level (being masked
for some reason at some point?) or on sw level (somehow not calling the
handler - that's why adding skipped_interrupts).

> Is this a real serial port, or one mimicked
> by a BMC (SoL or alike)?

This one is a real serial port. It isn't fully reproducible, but
happened sufficiently often that I'm quite sure of the above info.
Yes, my serial_debug() can interfere with data transfer, but I
intentionally added it significantly later than the issue happens (I
realize that console output end may not directly coincide with actual
time of the problem due to async sending, but still IMO should
be good enough). I later moved it to keyhandler, but that didn't give
any more info.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Serial console stuck during boot, unblocked with xl debug-key

2023-12-29 Thread Marek Marczykowski-Górecki
Hi,

This is continuation from matrix chat. There is an occasional failure on
qubes-hw2 gitlab runner that console become stuck during boot. I can now
reproduce it _much_ more often on another system, and the serial console output
ends with:

(XEN) Allocated console ring of 256 KiB.
(XEN) Using HWP for cpufreq
(XEN) mwait-idle: does not run on family 6

It should be:

(XEN) Allocated console ring of 256 KiB.
(XEN) Using HWP for cpufreq
(XEN) mwait-idle: does not run on family 6 model 183
(XEN) VMX: Supported advanced features:
(XEN)  - APIC MMIO access virtualisation
(XEN)  - APIC TPR shadow
...


Otherwise the system works perfectly fine, the logs are available in
full via `xl dmesg` etc. Doing (any?) `xl debug-key` unblocks the
console and missing logs gets dumped there too. I narrowed it down to
the serial console tx buffer and collected some info with the attacked
patch (it collects info still during boot, after the place where it
usually breaks). When it works, I get:

(XEN) SERIAL DEBUG: txbufc: 0x5b5, txbufp: 0x9f7, uart intr_works: 1, 
serial_txbufsz: 0x4000, tx_ready: 0, lsr_mask: 0x20, msi: 0, io_size: 8, 
skipped_interrupts: 0

And when it breaks, I get:

(XEN) SERIAL DEBUG: txbufc: 0x70, txbufp: 0x9fd, uart intr_works: 1, 
serial_txbufsz: 0x4000, tx_ready: 16, lsr_mask: 0x20, msi: 0, io_size: 8, 
skipped_interrupts: 0

So, I haven't found yet why it stops sending data. I'll continue adding
some debug prints etc, but if anyone has some ideas what is going on,
I'd appreciate hints.


Full `xl dmesg` of a broken case can be seen here:
https://openqa.qubes-os.org/tests/89100/logfile?filename=serial_terminal.txt
Similarly, example working case is here:
https://openqa.qubes-os.org/tests/89099/logfile?filename=serial0.txt

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 5bbed3a36a..39ad935266 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -354,6 +354,8 @@ static struct page_info * __init alloc_chunk(struct domain 
*d,
 return page;
 }
 
+extern void serial_debug(void);
+
 int __init dom0_construct_pv(struct domain *d,
  const module_t *image,
  unsigned long image_headroom,
@@ -412,6 +414,7 @@ int __init dom0_construct_pv(struct domain *d,
 /* Machine address of next candidate page-table page. */
 paddr_t mpt_alloc;
 
+serial_debug();
 printk(XENLOG_INFO "*** Building a PV Dom%d ***\n", d->domain_id);
 
 d->max_pages = ~0U;
diff --git a/xen/drivers/char/serial.c b/xen/drivers/char/serial.c
index 324024c29a..d6a4d62a07 100644
--- a/xen/drivers/char/serial.c
+++ b/xen/drivers/char/serial.c
@@ -71,6 +71,80 @@ void serial_rx_interrupt(struct serial_port *port, struct 
cpu_user_regs *regs)
 (*fn)(c & 0x7f, regs);
 }
 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#ifdef NS16550_PCI
+#include 
+#include 
+#include 
+#endif
+#include 
+#include 
+#include 
+#ifdef CONFIG_HAS_DEVICE_TREE
+#include 
+#endif
+#ifdef CONFIG_X86
+#include 
+#endif
+
+#define NS16550_PCI
+
+struct ns16550 {
+int baud, clock_hz, data_bits, parity, stop_bits, fifo_size, irq;
+u64 io_base;   /* I/O port or memory-mapped I/O address. */
+u64 io_size;
+int reg_shift; /* Bits to shift register offset by */
+int reg_width; /* Size of access to use, the registers
+* themselves are still bytes */
+char __iomem *remapped_io_base;  /* Remapped virtual address of MMIO. */
+/* UART with IRQ line: interrupt-driven I/O. */
+struct irqaction irqaction;
+u8 lsr_mask;
+#ifdef CONFIG_ARM
+struct vuart_info vuart;
+#endif
+/* UART with no IRQ line: periodically-polled I/O. */
+struct timer timer;
+struct timer resume_timer;
+unsigned int timeout_ms;
+bool intr_works;
+bool dw_usr_bsy;
+#ifdef NS16550_PCI
+/* PCI card parameters. */
+bool pb_bdf_enable; /* if =1, pb-bdf effective, port behind bridge */
+bool ps_bdf_enable; /* if =1, ps_bdf effective, port on pci card */
+unsigned int pb_bdf[3]; /* pci bridge BDF */
+unsigned int ps_bdf[3]; /* pci serial port BDF */
+u32 bar;
+u32 bar64;
+u16 cr;
+u8 bar_idx;
+bool msi;
+const struct ns16550_config_param *param; /* Points into .init.*! */
+#endif
+};
+
+int skipped_interrupts = 0;
+
+void serial_debug(void)
+{
+struct serial_port *port = [SERHND_COM1];
+struct ns16550 *uart = port->uart;
+printk("SERIAL DEBUG: txbufc: %#x, txbufp: %#x, uart intr_works: %d, 
serial_txbufsz: %#x, tx_ready: %d, "
+"lsr_mask: %#x, msi: %d, io_size: %ld, skipped_interrupts: %d\n",
+port->txbufc, port->txbufp, uart->intr_works,
+serial_txbufsz,
+port->driver->tx_ready(po

[PATCH] libxl: Disable relocating memory for qemu-xen in stubdomain too

2023-12-26 Thread Marek Marczykowski-Górecki
According to comments (and experiments) qemu-xen cannot handle memory
reolcation done by hvmloader. The code was already disabled when running
qemu-xen in dom0 (see libxl__spawn_local_dm()), but it was missed when
adding qemu-xen support to stubdomain. Adjust libxl__spawn_stub_dm() to
be consistent in this regard.

Reported-by: Neowutran 
Signed-off-by: Marek Marczykowski-Górecki 
---
 tools/libs/light/libxl_dm.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/tools/libs/light/libxl_dm.c b/tools/libs/light/libxl_dm.c
index 14b593110f7c..ed620a9d8e14 100644
--- a/tools/libs/light/libxl_dm.c
+++ b/tools/libs/light/libxl_dm.c
@@ -2432,6 +2432,16 @@ void libxl__spawn_stub_dm(libxl__egc *egc, 
libxl__stub_dm_spawn_state *sdss)
 "%s",
 
libxl_bios_type_to_string(guest_config->b_info.u.hvm.bios));
 }
+/* Disable relocating memory to make the MMIO hole larger
+ * unless we're running qemu-traditional and vNUMA is not
+ * configured. */
+libxl__xs_printf(gc, XBT_NULL,
+ libxl__sprintf(gc, "%s/hvmloader/allow-memory-relocate",
+libxl__xs_get_dompath(gc, guest_domid)),
+ "%d",
+ guest_config->b_info.device_model_version
+== LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL &&
+ !libxl__vnuma_configured(_config->b_info));
 ret = xc_domain_set_target(ctx->xch, dm_domid, guest_domid);
 if (ret<0) {
 LOGED(ERROR, guest_domid, "setting target domain %d -> %d",
-- 
2.41.0




Re: hvmloader - allow_memory_relocate overlaps

2023-12-22 Thread Marek Marczykowski-Górecki
On Fri, Dec 22, 2023 at 12:35:57PM +0100, Jan Beulich wrote:
> On 21.12.2023 19:08, Neowutran wrote:
> > On 2023-12-19 17:12, Jan Beulich wrote:
> >> On 16.12.2023 08:01, Neowutran wrote:
> >>> I am wondering if the variable "allow_memory_relocate" is still
> >>> relevant today and if its default value is still relevant. 
> >>> Should it be defined to 0 by default instead of 1 (it seems to be a
> >>> workaround for qemu-traditional, so maybe an outdated default value ? ) ? 
> >>
> >> So are you saying you use qemu-trad?
> > No, I am using "qemu_xen" ( from  
> > xenstore-read -> 'device-model =
> > "qemu_xen"' 
> > 
> >> Otherwise isn't libxl suppressing this behavior anyway?
> > If by "isn't libxl suppressing this behavior" you means if libxl is setting
> > the value of "allow_memory_relocate", then the answer is no. 
> > 
> > Following this lead, I checked in what code path
> > "allow_memory_relocate" could be defined. 
> > 
> > It is only defined in one code path, 
> > 
> > In the file "tools/libs/light/libxl_dm.c",
> > in the function "void libxl__spawn_local_dm(libxl__egc *egc, 
> > libxl__dm_spawn_state *dmss)": 
> > 
> > '''
> >  // ...
> > if (b_info->type == LIBXL_DOMAIN_TYPE_HVM) {
> >  // ...
> > 
> > libxl__xs_printf(gc, XBT_NULL,
> >  GCSPRINTF("%s/hvmloader/allow-memory-relocate", 
> > path),
> >  "%d",
> >  
> > b_info->device_model_version==LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL
> >  &&
> >  !libxl__vnuma_configured(b_info));
> > // ...
> > '''
> > 
> > However, on QubesOS (my system), "local_dm" is never used, "stub_dm"
> > is always used. 
> > 
> > In the function "void lib 
> > xl__spawn_stub_dm(libxl__egc *egc, libxl__stub_dm_spawn_state *sdss)", 
> > the value of "allow_memory_relocate" is never defined. 
> > 
> > I tried to patch the code to define "allow_memory_relocate" in
> > "libxl__spawn_stub_dm": 
> > 
> > '''
> > --- a/tools/libs/light/libxl_dm.c
> > +++ b/tools/libs/light/libxl_dm.c
> > @@ -2431,6 +2431,10 @@ void libxl__spawn_stub_dm(libxl__egc *egc, 
> > libxl__stub_dm_spawn_state *sdss)
> > libxl__xs_get_dompath(gc, 
> > guest_domid)),
> >  "%s",
> >  
> > libxl_bios_type_to_string(guest_config->b_info.u.hvm.bios));
> > +libxl__xs_printf(gc, XBT_NULL,
> > + libxl__sprintf(gc, 
> > "%s/hvmloader/allow-memory-relocate", libxl__xs_get_dompath(gc, 
> > guest_domid)),
> > + "%d",
> > + 0);
> >  }
> >  ret = xc_domain_set_target(ctx->xch, dm_domid, guest_domid);
> >  if (ret<0) {
> > '''
> > 
> > And it is indeed another way to solve my issue. 
> > However I do not understand the xen hypervisor enough to known i 
> > f
> > "allow-memory-relocate" should have been defined in
> > "libxl__spawn_stub_dm" or if the real issue is somewhere else. 
> 
> I think it should be done the same no matter where qemu runs. Back at the
> time iirc only qemu-trad could run in a stubdom, which may explain the
> omission. Cc-ing the two guys who are likely in the best position to tell
> for sure.

This indeed looks like a missing piece of qemu-xen support in
stubdomain.
But also, As Jan pointed out, this would be better fixed at the qemu
side so memory relocation could be re-enabled.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH] docs: Document a policy for when to deviate from specifications

2023-12-13 Thread Marek Marczykowski-Górecki
On Mon, Sep 18, 2023 at 01:28:16PM +0100, George Dunlap wrote:
> There is an ongoing disagreement among maintainers for how Xen should
> handle deviations to specifications such as ACPI or EFI.
> 
> Write up an explicit policy, and include two worked-out examples from
> recent discussions.

Looks very reasonable to me. While it would be nice for every hardware
(or thing in general) to follow specifications, sadly it isn't reality
and Xen doesn't have enough market share to influence vendors in a
meaningful way. So, yes, the policy described below sounds like a
reasonable approach to make things working for end users.

Reviewed-by: Marek Marczykowski-Górecki 

PS As a downstream distributions, we do ship several workarounds that were
rejected upstream on the grounds that "specification says otherwise"
before...

> 
> Signed-off-by: George Dunlap 
> ---
> NB that the technical descriptions of the costs of the accommodations
> or lack thereof I've just gathered from reading the discussions; I'm
> not familiar enough with the details to assert things about them.  So
> please correct any technical issues.
> ---
>  docs/policy/FollowingSpecifications.md | 219 +
>  1 file changed, 219 insertions(+)
>  create mode 100644 docs/policy/FollowingSpecifications.md
> 
> diff --git a/docs/policy/FollowingSpecifications.md 
> b/docs/policy/FollowingSpecifications.md
> new file mode 100644
> index 00..a197f01f65
> --- /dev/null
> +++ b/docs/policy/FollowingSpecifications.md
> @@ -0,0 +1,219 @@
> +# Guidelines for following specifications
> +
> +## In general, follow specifications
> +
> +In general, specifications such as ACPI and EFI should be followed.
> +
> +## Accommodate non-compliant systems if it doesn't affect compliant systems
> +
> +Sometimes, however, there occur situations where real systems "in the
> +wild" violate these specifications, or at least our interpretation of
> +them (henceforth called "non-compliant").  If we can accommodate
> +non-compliant systems without affecting any compliant systems, then we
> +should do so.
> +
> +## If accommodation would affect theoretical compliant systems that are
> +   not known to exist, and Linux and/or Windows takes the
> +   accommodation, take the accommodation unless there's a
> +   reason not to.
> +
> +Sometimes, however, there occur situations where real, non-compliant
> +systems "in the wild" cannot be accommodated without affecting
> +theoretical compliant systems; but there are no known theoretical
> +compliant systems which exist.  If Linux and/or Windows take the
> +accommodation, then from a cost/benefits perspective it's probably best
> +for us to take the accommodation as well.
> +
> +This is really a generalization of the next principle; the "reason not
> +to" would be in the form of a cost-benefits analysis as described in
> +the next section showing why the "special case" doesn't apply to the
> +accommodation in question.
> +
> +## If things aren't clear, do a cost-benefits analysis
> +
> +Sometimes, however, things are more complicated or less clear.  In
> +that case, we should do a cost-benefits analysis for a particular
> +accommodation.  Things which should be factored into the analysis:
> +
> +N-1: The number of non-compliant systems that require the accommodation
> + N-1a: The number of known current systems
> + N-1b: The probable number of unknown current systems
> + N-1c: The probable number of unknown future systems
> +
> +N-2 The severity of the effect of non-accommodation on these systems
> +
> +C-1: The number of compliant systems that would be affected by the 
> accommodation
> + C-1a: The number of known current systems
> + C-1b: The probable number of unknown current systems
> + C-1c: The probable number of unknown future systems
> +
> +C-2 The severity of the effect of accommodation on these systems
> +
> +Intuitively, N-1 * N-2 gives us N, the cost of not making the
> +accommodation, and C-1 * C-2 gives us C, the cost of taking the
> +accommodation.  If N > C, then we should take the accommodation; if C >
> +N, then we shouldn't.
> +
> +The idea isn't to come up with actual numbers to plug in here
> +(although that's certainly an option if someone wants to), but to
> +explain the general idea we're trying to get at.
> +
> +A couple of other principles to factor in:
> +
> +Vendors tend to copy themselves and other vendors.  If one or two
> +major vendors are known to create compliant or non-compliant systems
> +in a particular way, then there are likely to be more unknown and
> +future systems which will be affected by / need a similar accommodation
> +respectively;

Re: [PATCH] security-process.pandoc: Statement on issuing XSAs for older versions of Xen

2023-12-11 Thread Marek Marczykowski-Górecki
On Fri, Oct 27, 2023 at 03:26:02PM +0100, George Dunlap wrote:
> We recently had a situation where a security issue was discovered
> which only affected versions of Xen out of security support from an
> upstream perspective.  However, many downstreams (including XenServer
> and SUSE) still had supported products based on the versions affected.
> 
> Specify what the security team will do in this situation in the
> future.  As always, the goal here is to be fair and helpful, without
> adding to the workload of the security team.  Inviting downstreams to
> list versions and ranges, as well as expecting them to be involved in
> the patch, gives organizations without representation in the security
> team the opportunity to decide to engage in the security process.  At
> the same time, it puts he onus of determining which products and which
> versions might be affected, as well as the core work of creating and
> testing a patch, on downstreams.
> 
> Signed-off-by: George Dunlap 

Hi George,

This is interesting proposal, indeed it looks fair, given XenServer and
SUSE basically have this option already. In practice, I'm not sure how
useful that would be for Qubes OS, given we don't consider DoS-only bugs
security issues needing coordinated disclosure. It feels like infoleak
or privesc bugs are either found earlier or affect newer versions too
and in both cases they fall into standard security support anyway. But
that very well might be just an impression due to no such policy
earlier. 

In any case, in Qubes OS we support Xen 4.17 and 4.14 - the latter only
for about 6 months more.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: Linux 6.7-rc1+: WARNING at drivers/xen/evtchn.c:167 evtchn_interrupt

2023-12-02 Thread Marek Marczykowski-Górecki
On Fri, Dec 01, 2023 at 09:29:30AM +0100, Juergen Gross wrote:
> Hi Marek,
> 
> On 30.11.23 03:34, Marek Marczykowski-Górecki wrote:
> > Hi,
> > 
> > While testing 6.7-rc3 on Qubes OS I found several warning like in the
> > subject in dom0 log. I see them when running 6.7-rc1 too. I'm not sure
> > what exactly triggers the issue, but my guess would be unbinding an
> > event channel from userspace (closing vchan connection).
> > 
> > Specific message:
> > 
> > [   83.973377] [ cut here ]
> > [   83.975523] Interrupt for port 77, but apparently not enabled; per-user 
> > a0e9f1d1
> 
> Just a guess, but I think this might happen when the vchan connection
> is closed while there is still some traffic. This could result in events
> triggering in parallel to unbinding the event channel.
> 
> Could you please test the attached patch (only compile tested)?

Unfortunately that doesn't help, I get exactly the same traceback.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v4 2/6] x86/hvm: Allow access to registers on the same page as MSI-X table

2023-12-01 Thread Marek Marczykowski-Górecki
On Mon, Nov 27, 2023 at 06:00:57PM +0100, Jan Beulich wrote:
> On 24.11.2023 02:47, Marek Marczykowski-Górecki wrote:
> > GCC gets confused about 'desc' variable:
> > 
> > arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
> > arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized 
> > [-Werror=maybe-uninitialized]
> >   553 | if ( desc )
> >   |^
> > arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
> >   537 | const struct msi_desc *desc;
> >   |^~~~
> 
> This could do with also indicating the gcc version. Issues like this
> tend to get fixed over time.

Sure, I'll add it's GCC 12.2.1.
And indeed, GCC 13.2.1 does not complain anymore.

> > +
> > +if ( !msix->adj_access_idx[adj_type] )
> > +{
> > +gprintk(XENLOG_WARNING,
> > +"Page for adjacent(%d) MSI-X table access not initialized 
> > for %pp (addr %#lx, gtable %#lx\n",
> > +adj_type, >pdev->sbdf, addr, entry->gtable);
> > +
> > +return ADJACENT_DONT_HANDLE;
> > +}
> > +
> > +/* If PBA lives on the same page too, discard writes. */
> > +if ( write &&
> > + ((adj_type == ADJ_IDX_LAST &&
> > +   msix->table.last == msix->pba.first) ||
> > +  (adj_type == ADJ_IDX_FIRST &&
> > +   msix->table.first == msix->pba.last)) )
> > +{
> > +gprintk(XENLOG_WARNING,
> > +"MSI-X table and PBA of %pp live on the same page, "
> > +"writing to other registers there is not implemented\n",
> > +>pdev->sbdf);
> 
> Here and above I think verbosity needs limiting to the first instance per
> device per domain.

Is there some clever API for that already, or do I need to introduce
extra variable in some of those structures (msixtbl_entry? pci_dev?) ?

(other requested changes ok)

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Linux 6.7-rc1+: WARNING at drivers/xen/evtchn.c:167 evtchn_interrupt

2023-11-29 Thread Marek Marczykowski-Górecki
Hi,

While testing 6.7-rc3 on Qubes OS I found several warning like in the
subject in dom0 log. I see them when running 6.7-rc1 too. I'm not sure
what exactly triggers the issue, but my guess would be unbinding an
event channel from userspace (closing vchan connection).

Specific message:

[   83.973377] [ cut here ]
[   83.975523] Interrupt for port 77, but apparently not enabled; per-user 
a0e9f1d1
[   83.979083] WARNING: CPU: 1 PID: 2432 at drivers/xen/evtchn.c:167 
evtchn_interrupt+0xb6/0xc0 [xen_evtchn]
[   83.982869] Modules linked in: joydev intel_rapl_msr intel_rapl_common 
snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg 
snd_intel_sdw_acpi ppdev snd_hda_codec snd_hda_core snd_hwdep snd_seq 
snd_seq_device snd_pcm parport_pc parport pcspkr snd_timer snd e1000e i2c_piix4 
soundcore fuse loop xenfs dm_thin_pool dm_persistent_data dm_bio_prison 
crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic 
ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_scsi bochs 
serio_raw drm_vram_helper xhci_pci xhci_pci_renesas floppy drm_ttm_helper 
xhci_hcd ttm ata_generic pata_acpi qemu_fw_cfg virtio_console xen_privcmd 
xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac 
scsi_dh_emc scsi_dh_alua uinput dm_multipath
[   84.019753] CPU: 1 PID: 2432 Comm: qrexec-client Not tainted 
6.7.0-0.rc3.1.qubes.1.fc37.x86_64 #1
[   84.027045] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[   84.033828] RIP: e030:evtchn_interrupt+0xb6/0xc0 [xen_evtchn]
[   84.037814] Code: ba 01 00 00 00 be 1d 00 00 00 48 8d bb 88 00 00 00 e8 ce 
93 27 c1 eb b4 8b 76 20 48 89 da 48 c7 c7 70 52 21 c0 e8 0a 5c f0 c0 <0f> 0b e9 
61 ff ff ff 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90
[   84.048932] RSP: e02b:c90041e7fd60 EFLAGS: 00010082
[   84.051252] RAX:  RBX: 888104fc9a80 RCX: 0027
[   84.054569] RDX: 88813ff21608 RSI: 0001 RDI: 88813ff21600
[   84.057606] RBP: 88810522a280 R08:  R09: c90041e7fbf8
[   84.060864] R10: 0003 R11: 82f460c8 R12: 888105d5b6a4
[   84.064043] R13: 888105d5b760 R14: 88810522a280 R15: 888105d5b600
[   84.066598] FS:  75ee3eb8ed40() GS:88813ff0() 
knlGS:
[   84.069251] CS:  1e030 DS:  ES:  CR0: 80050033
[   84.071293] CR2: 7b86fa0f0f5c CR3: 000106e18000 CR4: 00040660
[   84.073911] Call Trace:
[   84.074746]  
[   84.075471]  ? evtchn_interrupt+0xb6/0xc0 [xen_evtchn]
[   84.077448]  ? __warn+0x81/0x130
[   84.078671]  ? evtchn_interrupt+0xb6/0xc0 [xen_evtchn]
[   84.080507]  ? report_bug+0x171/0x1a0
[   84.081830]  ? prb_read_valid+0x1b/0x30
[   84.083682]  ? handle_bug+0x41/0x70
[   84.084952]  ? exc_invalid_op+0x17/0x70
[   84.086314]  ? asm_exc_invalid_op+0x1a/0x20
[   84.088520]  ? evtchn_interrupt+0xb6/0xc0 [xen_evtchn]
[   84.090300]  __free_irq+0x114/0x2b0
[   84.091478]  free_irq+0x32/0x70
[   84.092522]  unbind_from_irqhandler+0x31/0xb0
[   84.094035]  evtchn_release+0x2b/0xa0 [xen_evtchn]
[   84.095643]  __fput+0x9a/0x2c0
[   84.096734]  __x64_sys_close+0x3d/0x80
[   84.097995]  do_syscall_64+0x63/0xe0
[   84.099188]  ? __vm_munmap+0xbb/0x150
[   84.100431]  ? syscall_exit_work+0x103/0x130
[   84.101854]  ? syscall_exit_to_user_mode+0x2b/0x40
[   84.103509]  ? do_syscall_64+0x6f/0xe0
[   84.104757]  ? syscall_exit_to_user_mode+0x2b/0x40
[   84.106350]  ? do_syscall_64+0x6f/0xe0
[   84.107587]  ? exit_to_user_mode_prepare+0xbc/0xd0
[   84.109180]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[   84.110863] RIP: 0033:0x75ee3ed15a0a
[   84.111999] Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 44 c3 0f 1f 00 48 83 ec 
18 89 7c 24 0c e8 33 bc f8 ff 8b 7c 24 0c 89 c2 b8 03 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 36 89 d7 89 44 24 0c e8 93 bc f8 ff 8b 44 24
[   84.118267] RSP: 002b:70ad0f70 EFLAGS: 0293 ORIG_RAX: 
0003
[   84.120756] RAX: ffda RBX: 55d3d8f70970 RCX: 75ee3ed15a0a
[   84.123675] RDX:  RSI:  RDI: 000f
[   84.126012] RBP:  R08:  R09: 
[   84.128366] R10: 75ee3ec31468 R11: 0293 R12: 
[   84.130699] R13:  R14: 70ad2eaa R15: 55d3d8f6f2a0
[   84.133050]  
[   84.133810] ---[ end trace  ]---

Full log: https://openqa.qubes-os.org/tests/86647/logfile?filename=serial0.txt

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [Xen-devel] PV guest with PCI passthrough crash on Xen 4.8.3 inside KVM when booted through OVMF

2023-11-27 Thread Marek Marczykowski-Górecki
On Mon, Nov 27, 2023 at 11:20:36AM +, Frediano Ziglio wrote:
> On Sun, Nov 26, 2023 at 2:51 PM Marek Marczykowski-Górecki
>  wrote:
> >
> > On Mon, Feb 19, 2018 at 06:30:14PM +0100, Juergen Gross wrote:
> > > On 16/02/18 20:02, Andrew Cooper wrote:
> > > > On 16/02/18 18:51, Marek Marczykowski-Górecki wrote:
> > > >> On Fri, Feb 16, 2018 at 05:52:50PM +0000, Andrew Cooper wrote:
> > > >>> On 16/02/18 17:48, Marek Marczykowski-Górecki wrote:
> > > >>>> Hi,
> > > >>>>
> > > >>>> As in the subject, the guest crashes on boot, before kernel output
> > > >>>> anything. I've isolated this to the conditions below:
> > > >>>>  - PV guest have PCI device assigned (e1000e emulated by QEMU in 
> > > >>>> this case),
> > > >>>>without PCI device it works
> > > >>>>  - Xen (in KVM) is started through OVMF; with seabios it works
> > > >>>>  - nested HVM is disabled in KVM
> > > >>>>  - AMD IOMMU emulation is disabled in KVM; when enabled qemu crashes 
> > > >>>> on
> > > >>>>boot (looks like qemu bug, unrelated to this one)
> > > >>>>
> > > >>>> Version info:
> > > >>>>  - KVM host: OpenSUSE 42.3, qemu 2.9.1, 
> > > >>>> ovmf-2017+git1492060560.b6d11d7c46-4.1, AMD
> > > >>>>  - Xen host: Xen 4.8.3, dom0: Linux 4.14.13
> > > >>>>  - Xen domU: Linux 4.14.13, direct boot
> > > >>>>
> > > >>>> Not sure if relevant, but initially I've tried booting xen.efi /mapbs
> > > >>>> /noexitboot and then dom0 kernel crashed saying something about 
> > > >>>> conflict
> > > >>>> between e820 and kernel mapping. But now those options are disabled.
> > > >>>>
> > > >>>> The crash message:
> > > >>>> (XEN) d1v0 Unhandled invalid opcode fault/trap [#6, ec=]
> > > >>>> (XEN) domain_crash_sync called from entry.S: fault at 
> > > >>>> 82d080218720 entry.o#create_bounce_frame+0x137/0x146
> > > >>>> (XEN) Domain 1 (vcpu#0) crashed on cpu#1:
> > > >>>> (XEN) [ Xen-4.8.3  x86_64  debug=n   Not tainted ]
> > > >>>> (XEN) CPU:1
> > > >>>> (XEN) RIP:e033:[]
> > > >>> This is #UD, which is most probably hitting a BUG().  addr2line this ^
> > > >>> to find some code to look at.
> > > >> addr2line failed me
> > > >
> > > > By default, vmlinux is stripped and compressed.  Ideally you want to
> > > > addr2line the vmlinux artefact in the root of your kernel build, which
> > > > is the plain elf with debugging symbols.
> > > >
> > > > Alternatively, use scripts/extract-vmlinux on the binary you actually
> > > > booted, which might get you somewhere.
> > > >
> > > >> , but System.map says its xen_memory_setup. And it
> > > >> looks like the BUG() is the same as I had in dom0 before:
> > > >> "Xen hypervisor allocated kernel memory conflicts with E820 map".
> > > >
> > > > Juergen: Is there anything we can do to try and insert some dummy
> > > > exception handlers right at PV start, so we could at least print out a
> > > > oneliner to the host console which is a little more helpful than Xen
> > > > saying "something unknown went wrong" ?
> > >
> > > You mean something like commit 42b3a4cb5609de757f5445fcad18945ba9239a07
> > > added to kernel 4.15?
> > >
> > > >
> > > >>
> > > >> Disabling e820_host in guest config solved the problem. Thanks!
> > > >>
> > > >> Is this some bug in Xen or OVMF, or is it expected behavior and 
> > > >> e820_host
> > > >> should be avoided?
> > > >
> > > > I don't really know.  e820_host is a gross hack which shouldn't really
> > > > be present.  The actually problem is that Linux can't cope with the
> > > > memory layout it was given (and I can't recall if there is anything
> > > > Linux could potentially to do cope).  OTOH, the toolstack, which knew
> > > > about e820_host and chose to lay the guest out in an overlapping way is
> > > > probably also at fault.
> > &g

Re: [Xen-devel] PV guest with PCI passthrough crash on Xen 4.8.3 inside KVM when booted through OVMF

2023-11-26 Thread Marek Marczykowski-Górecki
On Mon, Feb 19, 2018 at 06:30:14PM +0100, Juergen Gross wrote:
> On 16/02/18 20:02, Andrew Cooper wrote:
> > On 16/02/18 18:51, Marek Marczykowski-Górecki wrote:
> >> On Fri, Feb 16, 2018 at 05:52:50PM +, Andrew Cooper wrote:
> >>> On 16/02/18 17:48, Marek Marczykowski-Górecki wrote:
> >>>> Hi,
> >>>>
> >>>> As in the subject, the guest crashes on boot, before kernel output
> >>>> anything. I've isolated this to the conditions below:
> >>>>  - PV guest have PCI device assigned (e1000e emulated by QEMU in this 
> >>>> case),
> >>>>without PCI device it works
> >>>>  - Xen (in KVM) is started through OVMF; with seabios it works
> >>>>  - nested HVM is disabled in KVM
> >>>>  - AMD IOMMU emulation is disabled in KVM; when enabled qemu crashes on
> >>>>boot (looks like qemu bug, unrelated to this one)
> >>>>
> >>>> Version info:
> >>>>  - KVM host: OpenSUSE 42.3, qemu 2.9.1, 
> >>>> ovmf-2017+git1492060560.b6d11d7c46-4.1, AMD
> >>>>  - Xen host: Xen 4.8.3, dom0: Linux 4.14.13
> >>>>  - Xen domU: Linux 4.14.13, direct boot
> >>>>
> >>>> Not sure if relevant, but initially I've tried booting xen.efi /mapbs
> >>>> /noexitboot and then dom0 kernel crashed saying something about conflict
> >>>> between e820 and kernel mapping. But now those options are disabled.
> >>>>
> >>>> The crash message:
> >>>> (XEN) d1v0 Unhandled invalid opcode fault/trap [#6, ec=]
> >>>> (XEN) domain_crash_sync called from entry.S: fault at 82d080218720 
> >>>> entry.o#create_bounce_frame+0x137/0x146
> >>>> (XEN) Domain 1 (vcpu#0) crashed on cpu#1:
> >>>> (XEN) [ Xen-4.8.3  x86_64  debug=n   Not tainted ]
> >>>> (XEN) CPU:1
> >>>> (XEN) RIP:e033:[]
> >>> This is #UD, which is most probably hitting a BUG().  addr2line this ^
> >>> to find some code to look at.
> >> addr2line failed me
> > 
> > By default, vmlinux is stripped and compressed.  Ideally you want to
> > addr2line the vmlinux artefact in the root of your kernel build, which
> > is the plain elf with debugging symbols.
> > 
> > Alternatively, use scripts/extract-vmlinux on the binary you actually
> > booted, which might get you somewhere.
> > 
> >> , but System.map says its xen_memory_setup. And it
> >> looks like the BUG() is the same as I had in dom0 before:
> >> "Xen hypervisor allocated kernel memory conflicts with E820 map".
> > 
> > Juergen: Is there anything we can do to try and insert some dummy
> > exception handlers right at PV start, so we could at least print out a
> > oneliner to the host console which is a little more helpful than Xen
> > saying "something unknown went wrong" ?
> 
> You mean something like commit 42b3a4cb5609de757f5445fcad18945ba9239a07
> added to kernel 4.15?
> 
> > 
> >>
> >> Disabling e820_host in guest config solved the problem. Thanks!
> >>
> >> Is this some bug in Xen or OVMF, or is it expected behavior and e820_host
> >> should be avoided?
> > 
> > I don't really know.  e820_host is a gross hack which shouldn't really
> > be present.  The actually problem is that Linux can't cope with the
> > memory layout it was given (and I can't recall if there is anything
> > Linux could potentially to do cope).  OTOH, the toolstack, which knew
> > about e820_host and chose to lay the guest out in an overlapping way is
> > probably also at fault.
> 
> The kernel can cope with lots of E820 scenarios (e.g. by relocating
> initrd or the p2m map), but moving itself out of the way is not
> possible.

I'm afraid I need to resurrect this thread...

With recent kernel (6.6+), the host_e820=0 workaround is not an option
anymore. It makes Linux not initialize xen-swiotlb (due to
f9a38ea5172a3365f4594335ed5d63e15af2fd18), so PCI passthrough doesn't
work at all. While I can add yet another layer of workaround (force
xen-swiotlb with iommu=soft), that's getting unwieldy.

Furthermore, I don't get the crash message anymore, even with debug
hypervisor and guest_loglvl=all. Not even "Domain X crashed" in `xl
dmesg`. It looks like the "crash" shutdown reason doesn't reach Xen, and
it's considered clean shutdown (I can confirm it by changing various
`on_*` settings (via libvirt) and observing which gets applied).

Most tests I've done with 6.7-rc1, but the issue I observed on 6.6.1
already.

This is on Xen 4.17.2. And the L0 is running Linux 6.6.1, and then uses
QEMU 8.1.2 + OVMF 202308 to run Xen as L1.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH v4 3/6] automation: prevent QEMU access to /dev/mem in PCI passthrough tests

2023-11-23 Thread Marek Marczykowski-Górecki
/dev/mem access doesn't work in dom0 in lockdown and in stubdomain.
Simulate this environment with removing /dev/mem device node. Full test
for lockdown and stubdomain will come later, when all requirements will
be in place.

Signed-off-by: Marek Marczykowski-Górecki 
---
This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/scripts/qubes-x86-64.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index d81ed7b931cf..7eabc1bd6ad4 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -163,6 +163,8 @@ ifconfig eth0 up
 ifconfig xenbr0 up
 ifconfig xenbr0 192.168.0.1
 
+# ensure QEMU wont have access /dev/mem
+rm -f /dev/mem
 # get domU console content into test log
 tail -F /var/log/xen/console/guest-domU.log 2>/dev/null | sed -e \"s/^/(domU) 
/\" &
 xl create /etc/xen/domU.cfg
-- 
git-series 0.9.1



[PATCH v4 6/6] [DO NOT APPLY] switch to alternative artifact repo

2023-11-23 Thread Marek Marczykowski-Górecki
For testing, switch to my containers registry that includes containers
rebuilt with changes in this series.
---
 automation/gitlab-ci/build.yaml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index 32af30ccedc9..52abb12bce48 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -324,7 +324,7 @@ qemu-system-ppc64-8.1.0-ppc64-export:
 
 alpine-3.18-rootfs-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.18
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/alpine:3.18
   script:
 - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz
   artifacts:
@@ -335,7 +335,7 @@ alpine-3.18-rootfs-export:
 
 kernel-6.1.19-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.19
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/kernel:6.1.19
   script:
 - mkdir binaries && cp /bzImage binaries/bzImage
   artifacts:
-- 
git-series 0.9.1



[PATCH v4 0/6] MSI-X support with qemu in stubdomain, and other related changes

2023-11-23 Thread Marek Marczykowski-Górecki
This series includes changes to make MSI-X working with Linux stubdomain and
especially Intel Wifi 6 AX210 card. This takes care of remaining reasons for
QEMU to access /dev/mem, but also the Intel Wifi card violating spec by putting
some registers on the same page as the MSI-X table.

See individual patches for details.

This series include also tests for MSI-X using new approach (by preventing QEMU
access to /dev/mem). But for it to work, it needs QEMU change that
makes use of the changes introduced here. It can be seen at
https://github.com/marmarek/qemu/commits/msix

Here is the pipeline that used the QEMU fork above:
https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1083468508

Marek Marczykowski-Górecki (6):
  x86/msi: passthrough all MSI-X vector ctrl writes to device model
  x86/hvm: Allow access to registers on the same page as MSI-X table
  automation: prevent QEMU access to /dev/mem in PCI passthrough tests
  automation: switch to a wifi card on ADL system
  [DO NOT APPLY] switch to qemu fork
  [DO NOT APPLY] switch to alternative artifact repo

 Config.mk   |   4 +-
 automation/gitlab-ci/build.yaml |   4 +-
 automation/gitlab-ci/test.yaml  |   4 +-
 automation/scripts/qubes-x86-64.sh  |   9 +-
 automation/tests-artifacts/alpine/3.18.dockerfile   |   7 +-
 automation/tests-artifacts/kernel/6.1.19.dockerfile |   2 +-
 xen/arch/x86/hvm/vmsi.c | 206 -
 xen/arch/x86/include/asm/msi.h  |   5 +-
 xen/arch/x86/msi.c  |  40 +++-
 xen/common/kernel.c |   1 +-
 xen/include/public/features.h   |   8 +-
 11 files changed, 272 insertions(+), 18 deletions(-)

base-commit: f96e2f64576cdbb147391c7cb399d393385719a9
-- 
git-series 0.9.1



[PATCH v4 2/6] x86/hvm: Allow access to registers on the same page as MSI-X table

2023-11-23 Thread Marek Marczykowski-Górecki
Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
on the same page as MSI-X table. Device model (especially one in
stubdomain) cannot really handle those, as direct writes to that page is
refused (page is on the mmio_ro_ranges list). Instead, extend
msixtbl_mmio_ops to handle such accesses too.

Doing this, requires correlating read/write location with guest
of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
for PV would need to be done separately.

This will be also used to read Pending Bit Array, if it lives on the same
page, making QEMU not needing /dev/mem access at all (especially helpful
with lockdown enabled in dom0). If PBA lives on another page, QEMU will
map it to the guest directly.
If PBA lives on the same page, discard writes and log a message.
Technically, writes outside of PBA could be allowed, but at this moment
the precise location of PBA isn't saved, and also no known device abuses
the spec in this way (at least yet).

To access those registers, msixtbl_mmio_ops need the relevant page
mapped. MSI handling already has infrastructure for that, using fixmap,
so try to map first/last page of the MSI-X table (if necessary) and save
their fixmap indexes. Note that msix_get_fixmap() does reference
counting and reuses existing mapping, so just call it directly, even if
the page was mapped before. Also, it uses a specific range of fixmap
indexes which doesn't include 0, so use 0 as default ("not mapped")
value - which simplifies code a bit.

GCC gets confused about 'desc' variable:

arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’:
arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized 
[-Werror=maybe-uninitialized]
  553 | if ( desc )
  |^
arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here
  537 | const struct msi_desc *desc;
  |^~~~

It's conditional initialization is actually correct (in the case where
it isn't initialized, function returns early), but to avoid
build failure initialize it explicitly to NULL anyway.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v4:
- drop same_page parameter of msixtbl_find_entry(), distinguish two
  cases in relevant callers
- rename adj_access_table_idx to adj_access_idx
- code style fixes
- drop alignment check in adjacent_{read,write}() - all callers already
  have it earlier
- delay mapping first/last MSI-X pages until preparing device for a
  passthrough
v3:
 - merge handling into msixtbl_mmio_ops
 - extend commit message
v2:
 - adjust commit message
 - pass struct domain to msixtbl_page_handler_get_hwaddr()
 - reduce local variables used only once
 - log a warning if write is forbidden if MSI-X and PBA lives on the same
   page
 - do not passthrough unaligned accesses
 - handle accesses both before and after MSI-X table
---
 xen/arch/x86/hvm/vmsi.c| 191 --
 xen/arch/x86/include/asm/msi.h |   5 +-
 xen/arch/x86/msi.c |  40 +++-
 3 files changed, 225 insertions(+), 11 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 2436154c40b6..d1e8cb1e30f6 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d)
 return d->arch.hvm.msixtbl_list.next;
 }
 
+/*
+ * Lookup an msixtbl_entry on the same page as given addr. It's up to the
+ * caller to check if address is strictly part of the table - if relevant.
+ */
 static struct msixtbl_entry *msixtbl_find_entry(
 struct vcpu *v, unsigned long addr)
 {
@@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry(
 struct domain *d = v->domain;
 
 list_for_each_entry( entry, >arch.hvm.msixtbl_list, list )
-if ( addr >= entry->gtable &&
- addr < entry->gtable + entry->table_len )
+if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
+ PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
 return entry;
 
 return NULL;
@@ -213,6 +217,131 @@ static struct msi_desc *msixtbl_addr_to_desc(
 return NULL;
 }
 
+/*
+ * Returns:
+ *  - UINT_MAX if no handling should be done
+ *  - UINT_MAX-1 if write should be discarded
+ *  - a fixmap idx to use for handling
+ */
+#define ADJACENT_DONT_HANDLE UINT_MAX
+#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1)
+static unsigned int adjacent_handle(
+const struct msixtbl_entry *entry, unsigned long addr, bool write)
+{
+unsigned int adj_type;
+const struct arch_msix *msix;
+
+if ( !entry || !entry->pdev )
+return ADJACENT_DONT_HANDLE;
+
+if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable) && addr < entry->gtable )
+adj_type = ADJ_IDX_FIRST;
+else if ( PFN_DOWN(addr) == PFN_DOWN(entry->

[PATCH v4 5/6] [DO NOT APPLY] switch to qemu fork

2023-11-23 Thread Marek Marczykowski-Górecki
This makes tests to use patched QEMU, to actually test the new behavior.
---
 Config.mk | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Config.mk b/Config.mk
index 2c43702958eb..dd2687c0e9e7 100644
--- a/Config.mk
+++ b/Config.mk
@@ -222,8 +222,8 @@ endif
 OVMF_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/ovmf.git
 OVMF_UPSTREAM_REVISION ?= ba91d0292e593df8528b66f99c1b0b14fadc8e16
 
-QEMU_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/qemu-xen.git
-QEMU_UPSTREAM_REVISION ?= master
+QEMU_UPSTREAM_URL ?= https://github.com/marmarek/qemu
+QEMU_UPSTREAM_REVISION ?= origin/msix
 
 MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git
 MINIOS_UPSTREAM_REVISION ?= b08019f0b2fbc30c75169a160acb9fd9af5d68f4
-- 
git-series 0.9.1



[PATCH v4 4/6] automation: switch to a wifi card on ADL system

2023-11-23 Thread Marek Marczykowski-Górecki
Switch to a wifi card that has registers on a MSI-X page. This tests the
"x86/hvm: Allow writes to registers on the same page as MSI-X table"
feature. Switch it only for HVM test, because MSI-X adjacent write is
not supported on PV.

This requires also including drivers and firmware in system for tests.
Remove firmware unrelated to the test, to not increase initrd size too
much (all firmware takes over 100MB compressed).
And finally adjusts test script to handle not only eth0 as a test device,
but also wlan0 and connect it to the wifi network.

Signed-off-by: Marek Marczykowski-Górecki 
---
This needs two new gitlab variables: WIFI_HW2_SSID and WIFI_HW2_PSK. I'll
provide them in private.

This change requires rebuilding test containers.

This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/gitlab-ci/test.yaml  | 4 
 automation/scripts/qubes-x86-64.sh  | 7 +++
 automation/tests-artifacts/alpine/3.18.dockerfile   | 7 +++
 automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 ++
 4 files changed, 20 insertions(+)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 6aabdb9d156f..931a8fb28e1d 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -195,6 +195,10 @@ adl-pci-pv-x86-64-gcc-debug:
 
 adl-pci-hvm-x86-64-gcc-debug:
   extends: .adl-x86-64
+  variables:
+PCIDEV: "00:14.3"
+WIFI_SSID: "$WIFI_HW2_SSID"
+WIFI_PSK: "$WIFI_HW2_PSK"
   script:
 - ./automation/scripts/qubes-x86-64.sh pci-hvm 2>&1 | tee ${LOGFILE}
   needs:
diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index 7eabc1bd6ad4..60498ef1e89a 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -94,6 +94,13 @@ on_reboot = "destroy"
 domU_check="
 set -x -e
 interface=eth0
+if [ -e /sys/class/net/wlan0 ]; then
+interface=wlan0
+set +x
+wpa_passphrase "$WIFI_SSID" "$WIFI_PSK" > /etc/wpa_supplicant.conf
+set -x
+wpa_supplicant -B -iwlan0 -c /etc/wpa_supplicant.conf
+fi
 ip link set \"\$interface\" up
 timeout 30s udhcpc -i \"\$interface\"
 pingip=\$(ip -o -4 r show default|cut -f 3 -d ' ')
diff --git a/automation/tests-artifacts/alpine/3.18.dockerfile 
b/automation/tests-artifacts/alpine/3.18.dockerfile
index f1b4a8b7a191..b821a291fed3 100644
--- a/automation/tests-artifacts/alpine/3.18.dockerfile
+++ b/automation/tests-artifacts/alpine/3.18.dockerfile
@@ -34,6 +34,13 @@ RUN \
   apk add curl && \
   apk add udev && \
   apk add pciutils && \
+  apk add wpa_supplicant && \
+  # Select firmware for hardware tests
+  apk add linux-firmware-other && \
+  mkdir /lib/firmware-preserve && \
+  mv /lib/firmware/iwlwifi-so-a0-gf-a0* /lib/firmware-preserve/ && \
+  rm -rf /lib/firmware && \
+  mv /lib/firmware-preserve /lib/firmware && \
   \
   # Xen
   cd / && \
diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile 
b/automation/tests-artifacts/kernel/6.1.19.dockerfile
index 3a4096780d20..84ed5dff23ae 100644
--- a/automation/tests-artifacts/kernel/6.1.19.dockerfile
+++ b/automation/tests-artifacts/kernel/6.1.19.dockerfile
@@ -32,6 +32,8 @@ RUN curl -fsSLO 
https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSI
 make xen.config && \
 scripts/config --enable BRIDGE && \
 scripts/config --enable IGC && \
+scripts/config --enable IWLWIFI && \
+scripts/config --enable IWLMVM && \
 cp .config .config.orig && \
 cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \
 make -j$(nproc) bzImage && \
-- 
git-series 0.9.1



[PATCH v4 1/6] x86/msi: passthrough all MSI-X vector ctrl writes to device model

2023-11-23 Thread Marek Marczykowski-Górecki
QEMU needs to know whether clearing maskbit of a vector is really
clearing, or was already cleared before. Currently Xen sends only
clearing that bit to the device model, but not setting it, so QEMU
cannot detect it. Because of that, QEMU is working this around by
checking via /dev/mem, but that isn't the proper approach.

Give all necessary information to QEMU by passing all ctrl writes,
including masking a vector. Advertise the new behavior via
XENVER_get_features, so QEMU can know it doesn't need to access /dev/mem
anymore.

While this commit doesn't move the whole maskbit handling to QEMU (as
discussed on xen-devel as one of the possibilities), it is a necessary
first step anyway. Including telling QEMU it will get all the required
information to do so. The actual implementation would need to include:
 - a hypercall for QEMU to control just maskbit (without (re)binding the
   interrupt again
 - a methor for QEMU to tell Xen it will actually do the work
Those are not part of this series.

Signed-off-by: Marek Marczykowski-Górecki 
---
I did not added any control to enable/disable this new behavior (as
Roger have suggested for possible non-QEMU ioreqs). I don't see how the
new behavior could be problematic for some existing ioreq server (they
already received writes to those addresses, just not all of them),
but if that's really necessary, I can probably add a command line option
to restore previous behavior system-wide.

Changes in v4:
- ignore unaligned writes with X86EMUL_OKAY
- restructure the code to forward all writes in _msixtbl_write() instead
  of manipulating return value of msixtbl_write() - this makes
  WRITE_LEN4_COMPLETION special case unnecessary
- advertise the changed behavior via XENVER_get_features instead of DMOP
v3:
 - advertise changed behavior in XEN_DMOP_get_ioreq_server_info - make
   "flags" parameter IN/OUT
 - move len check back to msixtbl_write() - will be needed there anyway
   in a later patch
v2:
 - passthrough quad writes to emulator too (Jan)
 - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
   #define for this magic value
---
 xen/arch/x86/hvm/vmsi.c   | 19 ++-
 xen/common/kernel.c   |  1 +
 xen/include/public/features.h |  8 
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 128f23636279..2436154c40b6 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -283,8 +283,8 @@ static int msixtbl_write(struct vcpu *v, unsigned long 
address,
 unsigned long flags;
 struct irq_desc *desc;
 
-if ( (len != 4 && len != 8) || (address & (len - 1)) )
-return r;
+if ( !IS_ALIGNED(address, len) )
+return X86EMUL_OKAY;
 
 rcu_read_lock(_rcu_lock);
 
@@ -345,8 +345,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long 
address,
 
 unlock:
 spin_unlock_irqrestore(>lock, flags);
-if ( len == 4 )
-r = X86EMUL_OKAY;
+r = X86EMUL_OKAY;
 
 out:
 rcu_read_unlock(_rcu_lock);
@@ -357,7 +356,17 @@ static int cf_check _msixtbl_write(
 const struct hvm_io_handler *handler, uint64_t address, uint32_t len,
 uint64_t val)
 {
-return msixtbl_write(current, address, len, val);
+/* ignore invalid length or unaligned writes */
+if ( len != 4 && len != 8 || !IS_ALIGNED(address, len) )
+return X86EMUL_OKAY;
+
+/*
+ * This function returns X86EMUL_UNHANDLEABLE even if write is properly
+ * handled, to propagate it to the device model (so it can keep its
+ * internal state in sync).
+ */
+msixtbl_write(current, address, len, val);
+return X86EMUL_UNHANDLEABLE;
 }
 
 static bool cf_check msixtbl_range(
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index 08dbaa2a054c..229784c6ce52 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -642,6 +642,7 @@ long do_xen_version(int cmd, XEN_GUEST_HANDLE_PARAM(void) 
arg)
 fi.submap |= (1U << XENFEAT_direct_mapped);
 else
 fi.submap |= (1U << XENFEAT_not_direct_mapped);
+fi.submap |= (1U << XENFEAT_dm_msix_all_writes);
 break;
 default:
 return -EINVAL;
diff --git a/xen/include/public/features.h b/xen/include/public/features.h
index 36936f6a4ee0..634534827d43 100644
--- a/xen/include/public/features.h
+++ b/xen/include/public/features.h
@@ -120,6 +120,14 @@
 #define XENFEAT_runstate_phys_area   18
 #define XENFEAT_vcpu_time_phys_area  19
 
+/*
+ * If set, Xen will passthrough all MSI-X vector ctrl writes to device model,
+ * not only those unmasking an entry. This allows device model to properly keep
+ * track of the MSI-X table without having to read it from the device behind
+ * Xen's backs. This information is relevant only for device models.
+ */
+#define XENFEAT_dm_msix_all_writes20
+
 #define XENFEAT_NR_SUBMAPS 1
 
 #endif /* __XEN_PUBLIC_FEATURES_H__ */
-- 
git-series 0.9.1



[PATCH for-4.18 v2] automation: fix race condition in adl-suspend test

2023-10-30 Thread Marek Marczykowski-Górecki
If system suspends too quickly, the message for the test controller to
wake up the system may be not sent to the console before suspending.
This will cause the test to timeout.

Fix this by calling sync on the console and waiting a bit after printing
the message. The test controller then resumes the system 30s after the
message, so as long as the delay + suspending takes less time it is
okay.

Signed-off-by: Marek Marczykowski-Górecki 
---
This is consistent with the observation that sync_console "fixes" the
issue.

Changes in v2:
- add sync /dev/stdout too (Roger)
---
 automation/scripts/qubes-x86-64.sh | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index 26131b082671..0f00bebdd8c8 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -54,11 +54,12 @@ until grep 'domU started' 
/var/log/xen/console/guest-domU.log; do
 sleep 1
 done
 echo \"${wait_and_wakeup}\"
+# let the above message flow to console, then suspend
+sync /dev/stdout
+sleep 5
 set -x
 echo deep > /sys/power/mem_sleep
 echo mem > /sys/power/state
-# now wait for resume
-sleep 5
 xl list
 xl dmesg | grep 'Finishing wakeup from ACPI S3 state' || exit 1
 # check if domU is still alive
-- 
2.41.0




Re: [PATCH] automation: fix race condition in adl-suspend test

2023-10-30 Thread Marek Marczykowski-Górecki
On Mon, Oct 30, 2023 at 05:32:08PM +0100, Marek Marczykowski-Górecki wrote:
> On Mon, Oct 30, 2023 at 12:42:52PM +0100, Roger Pau Monné wrote:
> > On Sat, Oct 28, 2023 at 05:33:57AM +0200, Marek Marczykowski-Górecki wrote:
> > > If system suspends too quickly, the message for the test controller to
> > > wake up the system may be not sent to the console before suspending.
> > > This will cause the test to timeout.
> > > 
> > > Fix this by waiting a bit after printing the message. The test
> > > controller then resumes the system 30s after the message, so as long as
> > > the delay + suspending takes less time it is okay.
> > > 
> > > Signed-off-by: Marek Marczykowski-Górecki 
> > > 
> > > ---
> > > This is consistent with the observation that sync_console "fixes" the
> > > issue.
> > > ---
> > >  automation/scripts/qubes-x86-64.sh | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/automation/scripts/qubes-x86-64.sh 
> > > b/automation/scripts/qubes-x86-64.sh
> > > index 26131b082671..a34db96e4585 100755
> > > --- a/automation/scripts/qubes-x86-64.sh
> > > +++ b/automation/scripts/qubes-x86-64.sh
> > > @@ -54,11 +54,11 @@ until grep 'domU started' 
> > > /var/log/xen/console/guest-domU.log; do
> > >  sleep 1
> > >  done
> > >  echo \"${wait_and_wakeup}\"
> > > +# let the above message flow to console, then suspend
> > > +sleep 5
> > 
> > Could you use `sync /dev/stdout`?  I guess that might not be enough,
> > since the sync won't be propagated to the hypervisor, and hence even
> > if flushed from Linux, we have no guarantee that the hypervisor has
> > also flushed it.
> 
> It seems `sync /dev/stdout` helps too, at least in a limited sample of
> two.

... and the third attempt (with sync instead of sleep) failed.

> > Xen should flush the buffer when a newline character is found, but I
> > have no idea whether context could return to guest while the buffer is
> > still in the process of being fully flushed.
> 
> IIC Xen should flush the console buffer on the suspend path (there is
> console_start_sync() in enter_state()). So, if linux manages to send it
> to Xen in time, all should be good (in theory at least).
> 
> > Anyway, adding the extra sync might be good regardless, and keeping
> > the sleep.
> 
> Good idea, I'll send v2 with it included.


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH] automation: fix race condition in adl-suspend test

2023-10-30 Thread Marek Marczykowski-Górecki
On Mon, Oct 30, 2023 at 12:42:52PM +0100, Roger Pau Monné wrote:
> On Sat, Oct 28, 2023 at 05:33:57AM +0200, Marek Marczykowski-Górecki wrote:
> > If system suspends too quickly, the message for the test controller to
> > wake up the system may be not sent to the console before suspending.
> > This will cause the test to timeout.
> > 
> > Fix this by waiting a bit after printing the message. The test
> > controller then resumes the system 30s after the message, so as long as
> > the delay + suspending takes less time it is okay.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki 
> > ---
> > This is consistent with the observation that sync_console "fixes" the
> > issue.
> > ---
> >  automation/scripts/qubes-x86-64.sh | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/automation/scripts/qubes-x86-64.sh 
> > b/automation/scripts/qubes-x86-64.sh
> > index 26131b082671..a34db96e4585 100755
> > --- a/automation/scripts/qubes-x86-64.sh
> > +++ b/automation/scripts/qubes-x86-64.sh
> > @@ -54,11 +54,11 @@ until grep 'domU started' 
> > /var/log/xen/console/guest-domU.log; do
> >  sleep 1
> >  done
> >  echo \"${wait_and_wakeup}\"
> > +# let the above message flow to console, then suspend
> > +sleep 5
> 
> Could you use `sync /dev/stdout`?  I guess that might not be enough,
> since the sync won't be propagated to the hypervisor, and hence even
> if flushed from Linux, we have no guarantee that the hypervisor has
> also flushed it.

It seems `sync /dev/stdout` helps too, at least in a limited sample of
two.

> Xen should flush the buffer when a newline character is found, but I
> have no idea whether context could return to guest while the buffer is
> still in the process of being fully flushed.

IIC Xen should flush the console buffer on the suspend path (there is
console_start_sync() in enter_state()). So, if linux manages to send it
to Xen in time, all should be good (in theory at least).

> Anyway, adding the extra sync might be good regardless, and keeping
> the sleep.

Good idea, I'll send v2 with it included.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH] automation: fix race condition in adl-suspend test

2023-10-27 Thread Marek Marczykowski-Górecki
If system suspends too quickly, the message for the test controller to
wake up the system may be not sent to the console before suspending.
This will cause the test to timeout.

Fix this by waiting a bit after printing the message. The test
controller then resumes the system 30s after the message, so as long as
the delay + suspending takes less time it is okay.

Signed-off-by: Marek Marczykowski-Górecki 
---
This is consistent with the observation that sync_console "fixes" the
issue.
---
 automation/scripts/qubes-x86-64.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index 26131b082671..a34db96e4585 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -54,11 +54,11 @@ until grep 'domU started' 
/var/log/xen/console/guest-domU.log; do
 sleep 1
 done
 echo \"${wait_and_wakeup}\"
+# let the above message flow to console, then suspend
+sleep 5
 set -x
 echo deep > /sys/power/mem_sleep
 echo mem > /sys/power/state
-# now wait for resume
-sleep 5
 xl list
 xl dmesg | grep 'Finishing wakeup from ACPI S3 state' || exit 1
 # check if domU is still alive
-- 
2.41.0




Re: NULL pointer dereference in xenbus_thread->...

2023-10-22 Thread Marek Marczykowski-Górecki
On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki wrote:
> Hi,
> 
> I've noticed in Qubes's CI failure like this:
> 
> [  871.271292] BUG: kernel NULL pointer dereference, address: 
> [  871.275290] #PF: supervisor read access in kernel mode
> [  871.277282] #PF: error_code(0x) - not-present page
> [  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0 
> [  871.281071] Oops:  [#1] PREEMPT SMP NOPTI
> [  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 
> 6.1.43-1.qubes.fc37.x86_64 #1
> [  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> [  871.23] RIP: e030:__wake_up_common+0x4c/0x180
> [  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 
> 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 
> 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> [  871.299776] RSP: e02b:c900400f7e10 EFLAGS: 00010082
> [  871.301656] RAX:  RBX: 88810541ce98 RCX: 
> 
> [  871.304255] RDX: 0001 RSI: 0003 RDI: 
> 88810541ce90
> [  871.306714] RBP: c900400f0280 R08: ffe8 R09: 
> c900400f7e68
> [  871.309937] R10: 7ff0 R11: 888100ad3000 R12: 
> c900400f7e68
> [  871.312326] R13:  R14:  R15: 
> 
> [  871.314647] FS:  () GS:88813ff0() 
> knlGS:
> [  871.317677] CS:  1e030 DS:  ES:  CR0: 80050033
> [  871.319644] CR2:  CR3: 0001067fe000 CR4: 
> 00040660
> [  871.321973] Call Trace:
> [  871.322782]  
> [  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
> [  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
> [  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
> [  871.327721]  ? __wake_up_common_lock+0x82/0xd0
> [  871.329147]  ? __die_body.cold+0x8/0xd
> [  871.330378]  ? page_fault_oops+0x163/0x1a0
> [  871.331691]  ? exc_page_fault+0x70/0x170
> [  871.332946]  ? asm_exc_page_fault+0x22/0x30
> [  871.334454]  ? __wake_up_common+0x4c/0x180
> [  871.335777]  __wake_up_common_lock+0x82/0xd0
> [  871.337183]  ? process_writes+0x240/0x240
> [  871.338461]  process_msg+0x18e/0x2f0
> [  871.339627]  xenbus_thread+0x165/0x1c0
> [  871.340830]  ? cpuusage_read+0x10/0x10
> [  871.342032]  kthread+0xe9/0x110
> [  871.343317]  ? kthread_complete_and_exit+0x20/0x20
> [  871.345020]  ret_from_fork+0x22/0x30
> [  871.346239]  
> [  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio 
> snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core 
> snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr ppdev 
> intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 parport_pc 
> parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel 
> polyval_clmulni polyval_generic floppy ghash_clmulni_intel sha512_ssse3 
> serio_raw virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd 
> qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi 
> xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn 
> scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath
> [  871.368892] CR2: 
> [  871.370160] ---[ end trace  ]---
> [  871.371719] RIP: e030:__wake_up_common+0x4c/0x180
> [  871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 
> 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b 
> 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> [  871.379866] RSP: e02b:c900400f7e10 EFLAGS: 00010082
> [  871.381689] RAX:  RBX: 88810541ce98 RCX: 
> 
> [  871.383971] RDX: 0001 RSI: 0003 RDI: 
> 88810541ce90
> [  871.386235] RBP: c900400f0280 R08: ffe8 R09: 
> c900400f7e68
> [  871.388521] R10: 7ff0 R11: 888100ad3000 R12: 
> c900400f7e68
> [  871.390789] R13:  R14:  R15: 
> 
> [  871.393101] FS:  () GS:88813ff0() 
> knlGS:
> [  871.395671] CS:  1e030 DS:  ES:  CR0: 80050033
> [  871.397863] CR2:  CR3: 0001067fe000 CR4: 
> 00040660
> [  871.400441] Kernel panic - not syncing: Fatal exception
> [  871.402171] Kernel Offset: disabled
> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds.
> 
> It isn't the first time I see similar crash, but I can't really
> reproduce it reliably. Restarted test u

  1   2   3   4   5   6   7   8   9   10   >