[PATCH v6 6/7] [DO NOT APPLY] switch to qemu fork
This makes tests to use patched QEMU, to actually test the new behavior. --- Config.mk | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Config.mk b/Config.mk index a962f095ca16..5e220a1284e4 100644 --- a/Config.mk +++ b/Config.mk @@ -220,8 +220,8 @@ endif OVMF_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/ovmf.git OVMF_UPSTREAM_REVISION ?= ba91d0292e593df8528b66f99c1b0b14fadc8e16 -QEMU_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/qemu-xen.git -QEMU_UPSTREAM_REVISION ?= master +QEMU_UPSTREAM_URL ?= https://github.com/marmarek/qemu +QEMU_UPSTREAM_REVISION ?= origin/msix MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git MINIOS_UPSTREAM_REVISION ?= b6a5b4d72b88e5c4faed01f5a44505de022860fc -- git-series 0.9.1
[PATCH v6 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table
Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers on the same page as MSI-X table. Device model (especially one in stubdomain) cannot really handle those, as direct writes to that page is refused (page is on the mmio_ro_ranges list). Instead, extend msixtbl_mmio_ops to handle such accesses too. Doing this, requires correlating read/write location with guest of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest, it requires msixtbl_entry->gtable, which is HVM-only. Similar feature for PV would need to be done separately. This will be also used to read Pending Bit Array, if it lives on the same page, making QEMU not needing /dev/mem access at all (especially helpful with lockdown enabled in dom0). If PBA lives on another page, QEMU will map it to the guest directly. If PBA lives on the same page, discard writes and log a message. Technically, writes outside of PBA could be allowed, but at this moment the precise location of PBA isn't saved, and also no known device abuses the spec in this way (at least yet). To access those registers, msixtbl_mmio_ops need the relevant page mapped. MSI handling already has infrastructure for that, using fixmap, so try to map first/last page of the MSI-X table (if necessary) and save their fixmap indexes. Note that msix_get_fixmap() does reference counting and reuses existing mapping, so just call it directly, even if the page was mapped before. Also, it uses a specific range of fixmap indexes which doesn't include 0, so use 0 as default ("not mapped") value - which simplifies code a bit. GCC 12.2.1 gets confused about 'desc' variable: arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’: arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized] 553 | if ( desc ) |^ arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here 537 | const struct msi_desc *desc; |^~~~ It's conditional initialization is actually correct (in the case where it isn't initialized, function returns early), but to avoid build failure initialize it explicitly to NULL anyway. Signed-off-by: Marek Marczykowski-Górecki --- Changes in v6: - use MSIX_CHECK_WARN macro - extend assert on fixmap_idx - add break in default label, after ASSERT_UNREACHABLE(), and move setting default there - style fixes Changes in v5: - style fixes - include GCC version in the commit message - warn only once (per domain, per device) about failed adjacent access Changes in v4: - drop same_page parameter of msixtbl_find_entry(), distinguish two cases in relevant callers - rename adj_access_table_idx to adj_access_idx - code style fixes - drop alignment check in adjacent_{read,write}() - all callers already have it earlier - delay mapping first/last MSI-X pages until preparing device for a passthrough v3: - merge handling into msixtbl_mmio_ops - extend commit message v2: - adjust commit message - pass struct domain to msixtbl_page_handler_get_hwaddr() - reduce local variables used only once - log a warning if write is forbidden if MSI-X and PBA lives on the same page - do not passthrough unaligned accesses - handle accesses both before and after MSI-X table --- xen/arch/x86/hvm/vmsi.c| 200 -- xen/arch/x86/include/asm/msi.h | 5 +- xen/arch/x86/msi.c | 41 +++- 3 files changed, 236 insertions(+), 10 deletions(-) diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c index 17983789..230e3a5dee3f 100644 --- a/xen/arch/x86/hvm/vmsi.c +++ b/xen/arch/x86/hvm/vmsi.c @@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d) return d->arch.hvm.msixtbl_list.next; } +/* + * Lookup an msixtbl_entry on the same page as given addr. It's up to the + * caller to check if address is strictly part of the table - if relevant. + */ static struct msixtbl_entry *msixtbl_find_entry( struct vcpu *v, unsigned long addr) { @@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry( struct domain *d = v->domain; list_for_each_entry( entry, >arch.hvm.msixtbl_list, list ) -if ( addr >= entry->gtable && - addr < entry->gtable + entry->table_len ) +if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) && + PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) ) return entry; return NULL; @@ -213,6 +217,138 @@ static struct msi_desc *msixtbl_addr_to_desc( return NULL; } +/* + * Returns: + * - ADJACENT_DONT_HANDLE if no handling should be done + * - ADJACENT_DISCARD_WRITE if write should be discarded + * - a fixmap idx to use for handling + */ +#define ADJACENT_DONT_HANDLE UINT_MAX +#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1) +static unsigned int adjacent_handle( +const struct msixtbl_entry *entry, unsign
[PATCH v6 0/7] MSI-X support with qemu in stubdomain, and other related changes
This series includes changes to make MSI-X working with Linux stubdomain and especially Intel Wifi 6 AX210 card. This takes care of remaining reasons for QEMU to access /dev/mem, but also the Intel Wifi card violating spec by putting some registers on the same page as the MSI-X table. Besides the stubdomain case (of which I care more), this is also necessary for PCI-passthrough to work with lockdown enabled in dom0 (when QEMU runs in dom0). See individual patches for details. This series include also tests for MSI-X using new approach (by preventing QEMU access to /dev/mem). But for it to work, it needs QEMU change that makes use of the changes introduced here. It can be seen at https://github.com/marmarek/qemu/commits/msix Here is the pipeline that used the QEMU fork above: https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1269664578 Marek Marczykowski-Górecki (7): x86/msi: passthrough all MSI-X vector ctrl writes to device model x86/msi: Extend per-domain/device warning mechanism x86/hvm: Allow access to registers on the same page as MSI-X table automation: prevent QEMU access to /dev/mem in PCI passthrough tests automation: switch to a wifi card on ADL system [DO NOT APPLY] switch to qemu fork [DO NOT APPLY] switch to alternative artifact repo Config.mk | 4 +- automation/gitlab-ci/build.yaml | 4 +- automation/gitlab-ci/test.yaml | 4 +- automation/scripts/qubes-x86-64.sh | 9 +- automation/tests-artifacts/alpine/3.18.dockerfile | 7 +- automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 +- xen/arch/x86/hvm/vmsi.c | 215 - xen/arch/x86/include/asm/msi.h | 22 +- xen/arch/x86/msi.c | 46 ++- xen/common/kernel.c | 1 +- xen/include/public/features.h | 8 +- 11 files changed, 300 insertions(+), 22 deletions(-) base-commit: 7846f7699fea25502061a05ea847e942ea624f12 -- git-series 0.9.1
[PATCH v6 7/7] [DO NOT APPLY] switch to alternative artifact repo
For testing, switch to my containers registry that includes containers rebuilt with changes in this series. --- automation/gitlab-ci/build.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml index aac29ee13ab6..6957f06016b7 100644 --- a/automation/gitlab-ci/build.yaml +++ b/automation/gitlab-ci/build.yaml @@ -320,7 +320,7 @@ qemu-system-ppc64-8.1.0-ppc64-export: alpine-3.18-rootfs-export: extends: .test-jobs-artifact-common - image: registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.18 + image: registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/alpine:3.18 script: - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz artifacts: @@ -331,7 +331,7 @@ alpine-3.18-rootfs-export: kernel-6.1.19-export: extends: .test-jobs-artifact-common - image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.19 + image: registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/kernel:6.1.19 script: - mkdir binaries && cp /bzImage binaries/bzImage artifacts: -- git-series 0.9.1
[PATCH v6 2/7] x86/msi: Extend per-domain/device warning mechanism
The arch_msix struct had a single "warned" field with a domid for which warning was issued. Upcoming patch will need similar mechanism for few more warnings, so change it to save a bit field of issued warnings. Signed-off-by: Marek Marczykowski-Górecki --- Changes in v6: - add MSIX_CHECK_WARN macro (Jan) - drop struct name from warned_kind union (Jan) New in v5 --- xen/arch/x86/include/asm/msi.h | 17 - xen/arch/x86/msi.c | 5 + 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h index 997ccb87be0c..bcfdfd35345d 100644 --- a/xen/arch/x86/include/asm/msi.h +++ b/xen/arch/x86/include/asm/msi.h @@ -208,6 +208,15 @@ struct msg_address { PCI_MSIX_ENTRY_SIZE + \ (~PCI_MSIX_BIRMASK & (PAGE_SIZE - 1))) +#define MSIX_CHECK_WARN(msix, domid, which) ({ \ +if ( (msix)->warned_domid != (domid) ) \ +{ \ +(msix)->warned_domid = (domid); \ +(msix)->warned_kind.all = 0; \ +} \ +(msix)->warned_kind.which ? false : ((msix)->warned_kind.which = true); \ +}) + struct arch_msix { unsigned int nr_entries, used_entries; struct { @@ -217,7 +226,13 @@ struct arch_msix { int table_idx[MAX_MSIX_TABLE_PAGES]; spinlock_t table_lock; bool host_maskall, guest_maskall; -domid_t warned; +domid_t warned_domid; +union { +uint8_t all; +struct { +bool maskall : 1; +}; +} warned_kind; }; void early_msi_init(void); diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c index e721aaf5c001..42c793426da3 100644 --- a/xen/arch/x86/msi.c +++ b/xen/arch/x86/msi.c @@ -364,13 +364,10 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest) domid_t domid = pdev->domain->domain_id; maskall = true; -if ( pdev->msix->warned != domid ) -{ -pdev->msix->warned = domid; +if ( MSIX_CHECK_WARN(pdev->msix, domid, maskall) ) printk(XENLOG_G_WARNING "cannot mask IRQ %d: masking MSI-X on Dom%d's %pp\n", desc->irq, domid, >sbdf); -} } pdev->msix->host_maskall = maskall; if ( maskall || pdev->msix->guest_maskall ) -- git-series 0.9.1
[PATCH v6 5/7] automation: switch to a wifi card on ADL system
Switch to a wifi card that has registers on a MSI-X page. This tests the "x86/hvm: Allow writes to registers on the same page as MSI-X table" feature. Switch it only for HVM test, because MSI-X adjacent write is not supported on PV. This requires also including drivers and firmware in system for tests. Remove firmware unrelated to the test, to not increase initrd size too much (all firmware takes over 100MB compressed). And finally adjusts test script to handle not only eth0 as a test device, but also wlan0 and connect it to the wifi network. Signed-off-by: Marek Marczykowski-Górecki Reviewed-by: Stefano Stabellini --- This needs two new gitlab variables: WIFI_HW2_SSID and WIFI_HW2_PSK. I'll provide them in private. This change requires rebuilding test containers. This can be applied only after QEMU change is committed. Otherwise the test will fail. --- automation/gitlab-ci/test.yaml | 4 automation/scripts/qubes-x86-64.sh | 7 +++ automation/tests-artifacts/alpine/3.18.dockerfile | 7 +++ automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 ++ 4 files changed, 20 insertions(+) diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml index 8b7b2e4da92d..23a183fad967 100644 --- a/automation/gitlab-ci/test.yaml +++ b/automation/gitlab-ci/test.yaml @@ -193,6 +193,10 @@ adl-pci-pv-x86-64-gcc-debug: adl-pci-hvm-x86-64-gcc-debug: extends: .adl-x86-64 + variables: +PCIDEV: "00:14.3" +WIFI_SSID: "$WIFI_HW2_SSID" +WIFI_PSK: "$WIFI_HW2_PSK" script: - ./automation/scripts/qubes-x86-64.sh pci-hvm 2>&1 | tee ${LOGFILE} needs: diff --git a/automation/scripts/qubes-x86-64.sh b/automation/scripts/qubes-x86-64.sh index 7eabc1bd6ad4..60498ef1e89a 100755 --- a/automation/scripts/qubes-x86-64.sh +++ b/automation/scripts/qubes-x86-64.sh @@ -94,6 +94,13 @@ on_reboot = "destroy" domU_check=" set -x -e interface=eth0 +if [ -e /sys/class/net/wlan0 ]; then +interface=wlan0 +set +x +wpa_passphrase "$WIFI_SSID" "$WIFI_PSK" > /etc/wpa_supplicant.conf +set -x +wpa_supplicant -B -iwlan0 -c /etc/wpa_supplicant.conf +fi ip link set \"\$interface\" up timeout 30s udhcpc -i \"\$interface\" pingip=\$(ip -o -4 r show default|cut -f 3 -d ' ') diff --git a/automation/tests-artifacts/alpine/3.18.dockerfile b/automation/tests-artifacts/alpine/3.18.dockerfile index 9cde6c9ad4da..c323e266c7da 100644 --- a/automation/tests-artifacts/alpine/3.18.dockerfile +++ b/automation/tests-artifacts/alpine/3.18.dockerfile @@ -34,6 +34,13 @@ RUN \ apk add udev && \ apk add pciutils && \ apk add libelf && \ + apk add wpa_supplicant && \ + # Select firmware for hardware tests + apk add linux-firmware-other && \ + mkdir /lib/firmware-preserve && \ + mv /lib/firmware/iwlwifi-so-a0-gf-a0* /lib/firmware-preserve/ && \ + rm -rf /lib/firmware && \ + mv /lib/firmware-preserve /lib/firmware && \ \ # Xen cd / && \ diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile b/automation/tests-artifacts/kernel/6.1.19.dockerfile index 3a4096780d20..84ed5dff23ae 100644 --- a/automation/tests-artifacts/kernel/6.1.19.dockerfile +++ b/automation/tests-artifacts/kernel/6.1.19.dockerfile @@ -32,6 +32,8 @@ RUN curl -fsSLO https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSI make xen.config && \ scripts/config --enable BRIDGE && \ scripts/config --enable IGC && \ +scripts/config --enable IWLWIFI && \ +scripts/config --enable IWLMVM && \ cp .config .config.orig && \ cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \ make -j$(nproc) bzImage && \ -- git-series 0.9.1
[PATCH v6 4/7] automation: prevent QEMU access to /dev/mem in PCI passthrough tests
/dev/mem access doesn't work in dom0 in lockdown and in stubdomain. Simulate this environment with removing /dev/mem device node. Full test for lockdown and stubdomain will come later, when all requirements will be in place. Signed-off-by: Marek Marczykowski-Górecki Acked-by: Stefano Stabellini --- This can be applied only after QEMU change is committed. Otherwise the test will fail. --- automation/scripts/qubes-x86-64.sh | 2 ++ 1 file changed, 2 insertions(+) diff --git a/automation/scripts/qubes-x86-64.sh b/automation/scripts/qubes-x86-64.sh index d81ed7b931cf..7eabc1bd6ad4 100755 --- a/automation/scripts/qubes-x86-64.sh +++ b/automation/scripts/qubes-x86-64.sh @@ -163,6 +163,8 @@ ifconfig eth0 up ifconfig xenbr0 up ifconfig xenbr0 192.168.0.1 +# ensure QEMU wont have access /dev/mem +rm -f /dev/mem # get domU console content into test log tail -F /var/log/xen/console/guest-domU.log 2>/dev/null | sed -e \"s/^/(domU) /\" & xl create /etc/xen/domU.cfg -- git-series 0.9.1
[PATCH v6 1/7] x86/msi: passthrough all MSI-X vector ctrl writes to device model
QEMU needs to know whether clearing maskbit of a vector is really clearing, or was already cleared before. Currently Xen sends only clearing that bit to the device model, but not setting it, so QEMU cannot detect it. Because of that, QEMU is working this around by checking via /dev/mem, but that isn't the proper approach. Give all necessary information to QEMU by passing all ctrl writes, including masking a vector. Advertise the new behavior via XENVER_get_features, so QEMU can know it doesn't need to access /dev/mem anymore. While this commit doesn't move the whole maskbit handling to QEMU (as discussed on xen-devel as one of the possibilities), it is a necessary first step anyway. Including telling QEMU it will get all the required information to do so. The actual implementation would need to include: - a hypercall for QEMU to control just maskbit (without (re)binding the interrupt again - a methor for QEMU to tell Xen it will actually do the work Those are not part of this series. Signed-off-by: Marek Marczykowski-Górecki Reviewed-by: Jan Beulich --- I did not added any control to enable/disable this new behavior (as Roger have suggested for possible non-QEMU ioreqs). I don't see how the new behavior could be problematic for some existing ioreq server (they already received writes to those addresses, just not all of them), but if that's really necessary, I can probably add a command line option to restore previous behavior system-wide. Changes in v5: - announce the feature only on x86 - style fixes Changes in v4: - ignore unaligned writes with X86EMUL_OKAY - restructure the code to forward all writes in _msixtbl_write() instead of manipulating return value of msixtbl_write() - this makes WRITE_LEN4_COMPLETION special case unnecessary - advertise the changed behavior via XENVER_get_features instead of DMOP v3: - advertise changed behavior in XEN_DMOP_get_ioreq_server_info - make "flags" parameter IN/OUT - move len check back to msixtbl_write() - will be needed there anyway in a later patch v2: - passthrough quad writes to emulator too (Jan) - (ab)use len==0 for write len=4 completion (Jan), but add descriptive #define for this magic value --- xen/arch/x86/hvm/vmsi.c | 19 ++- xen/common/kernel.c | 1 + xen/include/public/features.h | 8 3 files changed, 23 insertions(+), 5 deletions(-) diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c index adbac965f9f7..17983789 100644 --- a/xen/arch/x86/hvm/vmsi.c +++ b/xen/arch/x86/hvm/vmsi.c @@ -283,8 +283,8 @@ static int msixtbl_write(struct vcpu *v, unsigned long address, unsigned long flags; struct irq_desc *desc; -if ( (len != 4 && len != 8) || (address & (len - 1)) ) -return r; +if ( !IS_ALIGNED(address, len) ) +return X86EMUL_OKAY; rcu_read_lock(_rcu_lock); @@ -345,8 +345,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long address, unlock: spin_unlock_irqrestore(>lock, flags); -if ( len == 4 ) -r = X86EMUL_OKAY; +r = X86EMUL_OKAY; out: rcu_read_unlock(_rcu_lock); @@ -357,7 +356,17 @@ static int cf_check _msixtbl_write( const struct hvm_io_handler *handler, uint64_t address, uint32_t len, uint64_t val) { -return msixtbl_write(current, address, len, val); +/* Ignore invalid length or unaligned writes. */ +if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) ) +return X86EMUL_OKAY; + +/* + * This function returns X86EMUL_UNHANDLEABLE even if write is properly + * handled, to propagate it to the device model (so it can keep its + * internal state in sync). + */ +msixtbl_write(current, address, len, val); +return X86EMUL_UNHANDLEABLE; } static bool cf_check msixtbl_range( diff --git a/xen/common/kernel.c b/xen/common/kernel.c index 08dbaa2a054c..b44b2439ca8e 100644 --- a/xen/common/kernel.c +++ b/xen/common/kernel.c @@ -637,6 +637,7 @@ long do_xen_version(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) | (1U << XENFEAT_hvm_callback_vector) | (has_pirq(d) ? (1U << XENFEAT_hvm_pirqs) : 0); +fi.submap |= (1U << XENFEAT_dm_msix_all_writes); #endif if ( !paging_mode_translate(d) || is_domain_direct_mapped(d) ) fi.submap |= (1U << XENFEAT_direct_mapped); diff --git a/xen/include/public/features.h b/xen/include/public/features.h index 4437f25d2532..880193094713 100644 --- a/xen/include/public/features.h +++ b/xen/include/public/features.h @@ -120,6 +120,14 @@ #define XENFEAT_runstate_phys_area18 #define XENFEAT_vcpu_time_phys_area 19 +/* + * If set, Xen will passthrough all MSI-X vector ctrl writes to device model, + * not only those unmasking an entry. This allows device model to properly k
Re: [PATCH v5 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table
On Thu, Apr 25, 2024 at 01:15:34PM +0200, Jan Beulich wrote: > On 13.03.2024 16:16, Marek Marczykowski-Górecki wrote: > > Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers > > on the same page as MSI-X table. Device model (especially one in > > stubdomain) cannot really handle those, as direct writes to that page is > > refused (page is on the mmio_ro_ranges list). Instead, extend > > msixtbl_mmio_ops to handle such accesses too. > > > > Doing this, requires correlating read/write location with guest > > of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest, > > it requires msixtbl_entry->gtable, which is HVM-only. Similar feature > > for PV would need to be done separately. > > > > This will be also used to read Pending Bit Array, if it lives on the same > > page, making QEMU not needing /dev/mem access at all (especially helpful > > with lockdown enabled in dom0). If PBA lives on another page, QEMU will > > map it to the guest directly. > > If PBA lives on the same page, discard writes and log a message. > > Technically, writes outside of PBA could be allowed, but at this moment > > the precise location of PBA isn't saved, and also no known device abuses > > the spec in this way (at least yet). > > > > To access those registers, msixtbl_mmio_ops need the relevant page > > mapped. MSI handling already has infrastructure for that, using fixmap, > > so try to map first/last page of the MSI-X table (if necessary) and save > > their fixmap indexes. Note that msix_get_fixmap() does reference > > counting and reuses existing mapping, so just call it directly, even if > > the page was mapped before. Also, it uses a specific range of fixmap > > indexes which doesn't include 0, so use 0 as default ("not mapped") > > value - which simplifies code a bit. > > > > GCC 12.2.1 gets confused about 'desc' variable: > > > > arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’: > > arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized > > [-Werror=maybe-uninitialized] > > 553 | if ( desc ) > > |^ > > arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here > > 537 | const struct msi_desc *desc; > > | ^~~~ > > > > It's conditional initialization is actually correct (in the case where > > it isn't initialized, function returns early), but to avoid > > build failure initialize it explicitly to NULL anyway. > > > > Signed-off-by: Marek Marczykowski-Górecki > > Sadly there are further more or less cosmetic issues. Plus, as indicated > before, I'm not really happy for us to gain all of this extra code. In > the end I may eventually give an R-b not including the usually implied > A-b, to indicate the code (then) looks okay to me but I still want > someone else to actually ack it to allow it going in. I understand. Given similar code is committed for vPCI already, I hope somebody will be comfortable with acking this one too (yes, I do realize the vPCI one is much less exposed, but still). > > +static int adjacent_read( > > +unsigned int fixmap_idx, > > +paddr_t address, unsigned int len, uint64_t *pval) > > +{ > > +const void __iomem *hwaddr; > > + > > +*pval = ~0UL; > > + > > +ASSERT(fixmap_idx != ADJACENT_DISCARD_WRITE); > > Why only one of the special values? And before you add the other here: > Why not simply ASSERT(fixmap_idx <= FIX_MSIX_IO_RESERV_END)? (Could of > course bound at the other end, too, i.e. against FIX_MSIX_IO_RESERV_BASE.) That's the most likely bug that could happen, but indeed broader assert would be better. > > +hwaddr = fix_to_virt(fixmap_idx) + PAGE_OFFSET(address); > > + > > +switch ( len ) > > +{ > > +case 1: > > +*pval = readb(hwaddr); > > +break; > > + > > +case 2: > > +*pval = readw(hwaddr); > > +break; > > + > > +case 4: > > +*pval = readl(hwaddr); > > +break; > > + > > +case 8: > > +*pval = readq(hwaddr); > > + break; > > + > > +default: > > +ASSERT_UNREACHABLE(); > > Misra demands "break;" to be here for release builds. In fact I wonder > why "*pval = ~0UL;" isn't put here, too. Question of course is whether > in such a case a true error indicator wouldn't be yet better. I don't think it possible for the msixtbl_read() (that calls adjacent_read()) to be called with other sizes. The default label is here exactly to make it obvious for the reader. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH] x86/MTRR: avoid several indirect calls
On Wed, Jan 17, 2024 at 10:32:53AM +0100, Jan Beulich wrote: > --- a/xen/arch/x86/cpu/mtrr/main.c > +++ b/xen/arch/x86/cpu/mtrr/main.c > @@ -328,7 +316,7 @@ int mtrr_add_page(unsigned long base, un > } > > /* If the type is WC, check that this processor supports it */ > - if ((type == X86_MT_WC) && !have_wrcomb()) { > + if ((type == X86_MT_WC) && mtrr_have_wrcomb()) { Is reversing the condition intentional here? > printk(KERN_WARNING > "mtrr: your processor doesn't support > write-combining\n"); > return -EOPNOTSUPP; -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v5 0/7] MSI-X support with qemu in stubdomain, and other related changes
On Wed, Mar 13, 2024 at 04:16:05PM +0100, Marek Marczykowski-Górecki wrote: > This series includes changes to make MSI-X working with Linux stubdomain and > especially Intel Wifi 6 AX210 card. This takes care of remaining reasons for > QEMU to access /dev/mem, but also the Intel Wifi card violating spec by > putting > some registers on the same page as the MSI-X table. > Besides the stubdomain case (of which I care more), this is also necessary for > PCI-passthrough to work with lockdown enabled in dom0 (when QEMU runs in > dom0). > > See individual patches for details. > > This series include also tests for MSI-X using new approach (by preventing > QEMU > access to /dev/mem). But for it to work, it needs QEMU change that > makes use of the changes introduced here. It can be seen at > https://github.com/marmarek/qemu/commits/msix > > Here is the pipeline that used the QEMU fork above: > https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1211237368 > > Marek Marczykowski-Górecki (7): > x86/msi: passthrough all MSI-X vector ctrl writes to device model > x86/msi: Extend per-domain/device warning mechanism > x86/hvm: Allow access to registers on the same page as MSI-X table > automation: prevent QEMU access to /dev/mem in PCI passthrough tests > automation: switch to a wifi card on ADL system > [DO NOT APPLY] switch to qemu fork > [DO NOT APPLY] switch to alternative artifact repo > > Config.mk | 4 +- > automation/gitlab-ci/build.yaml | 4 +- > automation/gitlab-ci/test.yaml | 4 +- > automation/scripts/qubes-x86-64.sh | 9 +- > automation/tests-artifacts/alpine/3.18.dockerfile | 7 +- > automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 +- > xen/arch/x86/hvm/vmsi.c | 224 - > xen/arch/x86/include/asm/msi.h | 15 +- > xen/arch/x86/msi.c | 50 ++- > xen/common/kernel.c | 1 +- > xen/include/public/features.h | 8 +- > 11 files changed, 308 insertions(+), 20 deletions(-) > > base-commit: 03cf7ca23e0e876075954c558485b267b7d02406 > -- > git-series 0.9.1 Ping, can I ask for a review? -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v3 2/2] drivers/char: mark extra reserved device memory in memory map
On Wed, Apr 17, 2024 at 04:17:48PM +0200, Jan Beulich wrote: > On 14.04.2024 02:32, Marek Marczykowski-Górecki wrote: > > On Wed, Apr 03, 2024 at 09:10:40AM +0200, Jan Beulich wrote: > >> On 27.03.2024 03:53, Marek Marczykowski-Górecki wrote: > >>> The IOMMU driver checks if RMRR/IVMD are marked as reserved in memory > >>> map. This should be true for addresses coming from the firmware, but > >>> when extra pages used by Xen itself are included in the mapping, those > >>> are taken from usable RAM used. Mark those pages as reserved too. > >>> > >>> Not marking the pages as reserved didn't caused issues before due to > >>> another a bug in IOMMU driver code, that was fixed in 83afa3135830 > >>> ("amd-vi: fix IVMD memory type checks"). > >>> > >>> Failing to reserve memory will lead to panic in IOMMU setup code. And > >>> not including the page in IOMMU mapping will lead to broken console (due > >>> to IOMMU faults). The pages chosen by the XHCI console driver should > >>> still be usable by the CPU though, and the console code already can deal > >>> with too slow console by dropping characters (and console not printing > >>> anything is a special case of "slow"). When reserving fails print an error > >>> message showing which pages failed and who requested them. This should > >>> be enough hint to find why XHCI console doesn't work. > >>> > >>> Fixes: 3a1a7b809ffa "drivers/char: mark DMA buffers as reserved for the > >>> XHCI" > >>> Signed-off-by: Marek Marczykowski-Górecki > >>> > >> > >> Acked-by: Jan Beulich > > > > Is any ack missing here, or has it just fallen through the cracks? > > ??? (commit dd5101a6169f89b9e3f3b72f0b0fcdb38db2fb35) Oh, sorry, somehow I missed it. All good then, thanks. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v3 2/2] drivers/char: mark extra reserved device memory in memory map
On Wed, Apr 03, 2024 at 09:10:40AM +0200, Jan Beulich wrote: > On 27.03.2024 03:53, Marek Marczykowski-Górecki wrote: > > The IOMMU driver checks if RMRR/IVMD are marked as reserved in memory > > map. This should be true for addresses coming from the firmware, but > > when extra pages used by Xen itself are included in the mapping, those > > are taken from usable RAM used. Mark those pages as reserved too. > > > > Not marking the pages as reserved didn't caused issues before due to > > another a bug in IOMMU driver code, that was fixed in 83afa3135830 > > ("amd-vi: fix IVMD memory type checks"). > > > > Failing to reserve memory will lead to panic in IOMMU setup code. And > > not including the page in IOMMU mapping will lead to broken console (due > > to IOMMU faults). The pages chosen by the XHCI console driver should > > still be usable by the CPU though, and the console code already can deal > > with too slow console by dropping characters (and console not printing > > anything is a special case of "slow"). When reserving fails print an error > > message showing which pages failed and who requested them. This should > > be enough hint to find why XHCI console doesn't work. > > > > Fixes: 3a1a7b809ffa "drivers/char: mark DMA buffers as reserved for the > > XHCI" > > Signed-off-by: Marek Marczykowski-Górecki > > Acked-by: Jan Beulich Hi, Is any ack missing here, or has it just fallen through the cracks? -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [XEN PATCH v2 0/3] x86/iommu: Drop IOMMU support when cx16 isn't supported
On Thu, Apr 11, 2024 at 09:05:08PM +0100, Andrew Cooper wrote: > Sorry, but you've sent out two copies of each patch in this series, and > it's not clear if they're identical or not. FWIW I've got just one copy. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v1 0/2] Starting AMD SEV work
On Wed, Apr 10, 2024 at 05:36:34PM +0200, Andrei Semenov wrote: > This patch series initiate work on AMD SEV technology implementation in Xen. > SEV stands for "Secure Encrypted Virtualization" and allows the memory > contents > of a VM to be encrypted with a key unique to this VM. In this way the neither > other VMs nor hypervisor can't read the memory content of this "encrypted" > VM. > > In order to create and to run such a VM different layers of software must > interact (bascally Xen hypevisor, Xen toolstack in dom0 and the encrypted VM > itself). > > In this work we start with discovering and enabling SEV feature on the > platform. > The second patch ports AMD Secure Processor driver on Xen. This AMD Secure > Processor device (a.k.a PSP) is the way the different software layers interact > with AMD firmware/hardware to manage and run the encrypted VM. How will that interact with the PSP driver in dom0? AFAIK amdgpu driver uses PSP for loading the GPU firmware. Does it mean one need to choose either GPU in dom0 or encrypted VMs, or is it going to work somehow together? -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH] x86/tsx: Cope with RTM_ALWAYS_ABORT vs RTM mismatch
On Thu, Apr 04, 2024 at 11:41:22AM +0100, Andrew Cooper wrote: > It turns out there is something wonky on some but not all CPUs with > MSR_TSX_FORCE_ABORT. The presence of RTM_ALWAYS_ABORT causes Xen to think > it's safe to offer HLE/RTM to guests, but in this case, XBEGIN instructions > genuinely #UD. > > Spot this case and try to back out as cleanly as we can. > > Signed-off-by: Andrew Cooper Thanks, this makes the test exit with 0, and print just "Got #UD" now in the "Testing RTM behaviour" section. Tested-by: Marek Marczykowski-Górecki > --- > CC: Jan Beulich > CC: Roger Pau Monné > CC: Marek Marczykowski-Górecki > > In the meantime, I'll see if anyone at Intel knows what's going on. Because > these parts are fully out of support now, it's very unlikely that we're going > to get a fix. > --- > xen/arch/x86/tsx.c | 55 +- > 1 file changed, 45 insertions(+), 10 deletions(-) > > diff --git a/xen/arch/x86/tsx.c b/xen/arch/x86/tsx.c > index 50d8059f23a9..41bb39d10074 100644 > --- a/xen/arch/x86/tsx.c > +++ b/xen/arch/x86/tsx.c > @@ -1,5 +1,6 @@ > #include > #include > +#include > #include > > /* > @@ -9,6 +10,7 @@ > * -1 => Default, altered to 0/1 (if unspecified) by: > * - TAA heuristics/settings for speculative safety > * - "TSX vs PCR3" select for TSX memory ordering safety > + * -2 => Implicit tsx=0 (from RTM_ALWAYS_ABORT vs RTM mismatch) > * -3 => Implicit tsx=1 (feed-through from spec-ctrl=0) > * > * This is arranged such that the bottom bit encodes whether TSX is actually > @@ -114,11 +116,50 @@ void tsx_init(void) > > if ( cpu_has_tsx_force_abort ) > { > +uint64_t val; > + > /* > - * On an early TSX-enable Skylake part subject to the memory > + * On an early TSX-enabled Skylake part subject to the memory > * ordering erratum, with at least the March 2019 microcode. > */ > > +rdmsrl(MSR_TSX_FORCE_ABORT, val); > + > +/* > + * At the time of writing (April 2024), it was discovered that > + * some parts (e.g. CoffeeLake 8th Gen, 06-9e-0a, ucode 0xf6) > + * advertise RTM_ALWAYS_ABORT, but XBEGIN instructions #UD. > Other > + * similar parts (e.g. KabyLake Xeon-E3, 06-9e-09, ucode 0xf8) > + * operate as expected. > + * > + * In this case: > + * - RTM_ALWAYS_ABORT and MSR_TSX_FORCE_ABORT are enumerated. > + * - XBEGIN instructions genuinely #UD. > + * - MSR_TSX_FORCE_ABORT is write-discard and fails to hold its > + *value. > + * - HLE and RTM are not enumerated, despite > + *MSR_TSX_FORCE_ABORT.TSX_CPUID_CLEAR being clear. > + * > + * Spot this case, and treat it as if no TSX is available at all. > + * This will prevent Xen from thinking it's safe to offer HLE/RTM > + * to VMs. > + */ > +if ( val == 0 && cpu_has_rtm_always_abort && !cpu_has_rtm ) > +{ > +printk(XENLOG_ERR > + "FIRMWARE BUG: CPU %02x-%02x-%02x, ucode 0x%08x: > RTM_ALWAYS_ABORT vs RTM mismatch\n", > + boot_cpu_data.x86, boot_cpu_data.x86_model, > + boot_cpu_data.x86_mask, this_cpu(cpu_sig).rev); > + > +setup_clear_cpu_cap(X86_FEATURE_RTM_ALWAYS_ABORT); > +setup_clear_cpu_cap(X86_FEATURE_TSX_FORCE_ABORT); > + > +if ( opt_tsx < 0 ) > +opt_tsx = -2; > + > +goto done_setup; > +} > + > /* > * Probe for the June 2021 microcode which de-features TSX on > * client parts. (Note - this is a subset of parts impacted by > @@ -128,15 +169,8 @@ void tsx_init(void) > * read as zero if TSX_FORCE_ABORT.ENABLE_RTM has been set before > * we run. > */ > -if ( !has_rtm_always_abort ) > -{ > -uint64_t val; > - > -rdmsrl(MSR_TSX_FORCE_ABORT, val); > - > -if ( val & TSX_ENABLE_RTM ) > -has_rtm_always_abort = true; > -} > +if ( val & TSX_ENABLE_RTM ) > +has_rtm_always_abort = true; > > /* > * If no explicit tsx= option is provided, pick a default. > @@ -191,6 +225,7 @@ void tsx_init(void) > setup_force_cpu_cap(X86_FEATURE_RTM); > } > } > + done_setup: > > /* > * Note: MSR_TSX_CTRL is enumerated on TSX-enabled MDS_NO and later > parts. > > base-commit: 6117179dd99958e4ef2687617d12c9b15bdbae24 > -- > 2.30.2 > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: text-tsx fails on Intel core 8th gen system
On Wed, Apr 03, 2024 at 05:04:20PM +0200, Jan Beulich wrote: > On 03.04.2024 16:50, Marek Marczykowski-Górecki wrote: > > Hi, > > > > I've noticed that tools/tests/tsx/test-tsx fails on a system with Intel > > Core i7-8750H. Specific error I get: > > > > [user@dom0 tsx]$ ./test-tsx > > TSX tests > > Got 16 CPUs > > Testing MSR_TSX_FORCE_ABORT consistency > > CPU0 val 0 > > Testing MSR_TSX_CTRL consistency > > Testing MSR_MCU_OPT_CTRL consistency > > CPU0 val 0 > > Testing RTM behaviour > > Got #UD > > Host reports RTM, but appears unavailable > > Isn't this ... > > > Testing PV default/max policies > > Max: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 > > Def: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 > > HLE/RTM offered to guests despite not being available > > Testing HVM default/max policies > > Max: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 > > Def: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 > > HLE/RTM offered to guests despite not being available > > Testing PV guest > > Created d8 > > Cur: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 > > Cur: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 > > Testing HVM guest > > Created d9 > > Cur: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 > > Cur: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 > > [user@dom0 tsx]$ echo $? > > 1 > > ... the reason for this? I think so, but the question is why it behaves this way. Could be an issue with MSR/CPUID values presented by Xen, or values Xen gets from the CPU. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
text-tsx fails on Intel core 8th gen system
Hi, I've noticed that tools/tests/tsx/test-tsx fails on a system with Intel Core i7-8750H. Specific error I get: [user@dom0 tsx]$ ./test-tsx TSX tests Got 16 CPUs Testing MSR_TSX_FORCE_ABORT consistency CPU0 val 0 Testing MSR_TSX_CTRL consistency Testing MSR_MCU_OPT_CTRL consistency CPU0 val 0 Testing RTM behaviour Got #UD Host reports RTM, but appears unavailable Testing PV default/max policies Max: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 Def: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 HLE/RTM offered to guests despite not being available Testing HVM default/max policies Max: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 Def: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 HLE/RTM offered to guests despite not being available Testing PV guest Created d8 Cur: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 Cur: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 Testing HVM guest Created d9 Cur: RTM 0, HLE 0, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 Cur: RTM 1, HLE 1, TSX_FORCE_ABORT 0, RTM_ALWAYS_ABORT 0, TSX_CTRL 0 [user@dom0 tsx]$ echo $? 1 When I try it on a newer system (11th gen) then it works fine (exit code 0, just "Got #UD", no "Host reports RTM, but appears unavailable" line). /proc/cpuinfo says: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 158 model name : Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz stepping: 10 microcode : 0xf6 cpu MHz : 2207.990 cache size : 9216 KB physical id : 0 siblings: 6 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms rdseed adx clflushopt xsaveopt xsavec xgetbv1 md_clear arch_capabilities bugs: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds mmio_stale_data retbleed bogomips: 4415.98 clflush size: 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: ... Full `xen-cpuid detail` output attached. Just in case, I'm attaching also full xl dmesg, but I don't see anything related there. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab nr_features: 18 Static sets: Known bfebfbff:fffef3ff:ee500800:2469bfff:000f:ffbf:3a405fdf:0780:779fd205:fc91ef1c:1c30:38000144:0001:0037::0004:1fbe: [00] CPUID 0x0001.edx fpu vme de pse tsc msr pae mce cx8 apic sysenter mtrr pge mca cmov pat pse36 clflush ds acpi mmx fxsr sse sse2 ss htt tm pbe [01] CPUID 0x0001.ecx sse3 pclmulqdq dtes64 monitor ds-cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse41 sse42 x2apic movebe popcnt tsc-dl aesni xsave osxsave avx f16c rdrnd hyper [02] CPUID 0x8001.edx syscall nx mmx+ fxsr+ pg1g rdtscp lm 3dnow+ 3dnow [03] CPUID 0x8001.ecx lahf-lm cmp svm extapic cr8d lzcnt sse4a msse 3dnowpf osvw ibs xop skinit wdt lwp fma4 nodeid tbm topoext dbx monitorx [04] CPUID 0x000d:1.eax xsaveopt xsavec xgetbv1 xsaves [05] CPUID 0x0007:0.ebx fsgsbase tsc-adj sgx bmi1 hle avx2 fdp-exn smep bmi2 erms invpcid rtm pqm depfpp mpx pqe avx512f avx512dq rdseed adx smap avx512-ifma clflushopt clwb proc-trace avx512pf avx512er avx512cd sha avx512bw avx512vl [06] CPUID 0x0007:0.ecx prefetchwt1 avx512-vbmi umip pku ospke avx512-vbmi2 cet-ss gfni vaes vpclmulqdq avx512-vnni avx512-bitalg avx512-vpopcntdq rdpid cldemote movdiri movdir64b enqcmd [07] CPUID 0x8007.edx hw-pstate itsc cpb efro [08] CPUID 0x8008.ebx clzero rstr-fp-err-ptrs wbnoinvd ibpb ibrs amd-stibp ibrs-always stibp-always ibrs-fast ibrs-same-mode no-lmsl ppin amd-ssbd virt-ssbd ssb-no psfd btc-no ibpb-ret [09] CPUID 0x0007:0.edx avx512-4vnniw avx512-4fmaps fsrm avx512-vp2intersect srbds-ctrl md-clear rtm-always-abort tsx-force-abort serialize hybrid tsxldtrk cet-ibt avx512-fp16 ibrsb stibp l1d-flush arch-caps core-caps ssbd [10] CPUID 0x0007:1.eax avx-vnni avx512-bf16 fzrm fsrs fsrcs [11] CPUID 0x8021.eax lfence+ nscb auto-ibrs sbpb ibpb-brtype srso-no [12] CPUID 0x0007:1.ebx
Re: [PATCH net] xen-netfront: Add missing skb_mark_for_recycle
On Wed, Mar 27, 2024 at 01:14:56PM +0100, Jesper Dangaard Brouer wrote: > Notice that skb_mark_for_recycle() is introduced later than fixes tag in > 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB recycling"). > > It is believed that fixes tag were missing a call to page_pool_release_page() > between v5.9 to v5.14, after which is should have used skb_mark_for_recycle(). > Since v6.6 the call page_pool_release_page() were removed (in 535b9c61bdef > ("net: page_pool: hide page_pool_release_page()") and remaining callers > converted (in commit 6bfef2ec0172 ("Merge branch > 'net-page_pool-remove-page_pool_release_page'")). > > This leak became visible in v6.8 via commit dba1b8a7ab68 ("mm/page_pool: catch > page_pool memory leaks"). > > Fixes: 6c5aa6fc4def ("xen networking: add basic XDP support for xen-netfront") > Reported-by: Arthur Borsboom > Signed-off-by: Jesper Dangaard Brouer > --- > Compile tested only, can someone please test this I've got a confirmation it fixes the issue: https://github.com/QubesOS/qubes-linux-kernel/pull/926#issuecomment-2026226944 > drivers/net/xen-netfront.c |1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > index ad29f370034e..8d2aee88526c 100644 > --- a/drivers/net/xen-netfront.c > +++ b/drivers/net/xen-netfront.c > @@ -285,6 +285,7 @@ static struct sk_buff *xennet_alloc_one_rx_buffer(struct > netfront_queue *queue) > return NULL; > } > skb_add_rx_frag(skb, 0, page, 0, 0, PAGE_SIZE); > + skb_mark_for_recycle(skb); > > /* Align ip header to a 16 bytes boundary */ > skb_reserve(skb, NET_IP_ALIGN); > > > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
[PATCH v3 1/2] hw/xen: detect when running inside stubdomain
Introduce global xen_is_stubdomain variable when qemu is running inside a stubdomain instead of dom0. This will be relevant for subsequent patches, as few things like accessing PCI config space need to be done differently. Signed-off-by: Marek Marczykowski-Górecki --- Changes in v3: - move to xen_hvm_init_pc() - coding style Changes in v2: - use sigend int for domid to match xenstore_read_int() types - fix code style --- hw/i386/xen/xen-hvm.c | 22 ++ include/hw/xen/xen.h | 1 + system/globals.c | 1 + 3 files changed, 24 insertions(+) diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c index 7745cb3..3291c17 100644 --- a/hw/i386/xen/xen-hvm.c +++ b/hw/i386/xen/xen-hvm.c @@ -583,6 +583,26 @@ static void xen_wakeup_notifier(Notifier *notifier, void *data) xc_set_hvm_param(xen_xc, xen_domid, HVM_PARAM_ACPI_S_STATE, 0); } +static bool xen_check_stubdomain(struct xs_handle *xsh) +{ +char *dm_path = g_strdup_printf( +"/local/domain/%d/image/device-model-domid", xen_domid); +char *val; +int32_t dm_domid; +bool is_stubdom = false; + +val = xs_read(xsh, 0, dm_path, NULL); +if (val) { +if (sscanf(val, "%d", _domid) == 1) { +is_stubdom = dm_domid != 0; +} +free(val); +} + +g_free(dm_path); +return is_stubdom; +} + void xen_hvm_init_pc(PCMachineState *pcms, MemoryRegion **ram_memory) { MachineState *ms = MACHINE(pcms); @@ -595,6 +615,8 @@ void xen_hvm_init_pc(PCMachineState *pcms, MemoryRegion **ram_memory) xen_register_ioreq(state, max_cpus, _memory_listener); +xen_is_stubdomain = xen_check_stubdomain(state->xenstore); + QLIST_INIT(_physmap); xen_read_physmap(state); diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h index 37ecc91..ecb89ec 100644 --- a/include/hw/xen/xen.h +++ b/include/hw/xen/xen.h @@ -36,6 +36,7 @@ enum xen_mode { extern uint32_t xen_domid; extern enum xen_mode xen_mode; extern bool xen_domid_restrict; +extern bool xen_is_stubdomain; int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num); int xen_set_pci_link_route(uint8_t link, uint8_t irq); diff --git a/system/globals.c b/system/globals.c index e353584..d602a04 100644 --- a/system/globals.c +++ b/system/globals.c @@ -60,6 +60,7 @@ bool qemu_uuid_set; uint32_t xen_domid; enum xen_mode xen_mode = XEN_DISABLED; bool xen_domid_restrict; +bool xen_is_stubdomain; struct evtchn_backend_ops *xen_evtchn_ops; struct gnttab_backend_ops *xen_gnttab_ops; struct foreignmem_backend_ops *xen_foreignmem_ops; -- git-series 0.9.1
[PATCH v3 2/2] xen: fix stubdom PCI addr
When running in a stubdomain, the config space access via sysfs needs to use BDF as seen inside stubdomain (connected via xen-pcifront), which is different from the real BDF. For other purposes (hypercall parameters etc), the real BDF needs to be used. Get the in-stubdomain BDF by looking up relevant PV PCI xenstore entries. Signed-off-by: Marek Marczykowski-Górecki --- Changes in v3: - reduce 'path' size - add two missing error_setg() calls - coding style Changes in v2: - use xs_node_scanf - use %d instead of %u to read values written as %d - add a comment from another iteration of this patch by Jason Andryuk --- hw/xen/xen-host-pci-device.c | 76 - hw/xen/xen-host-pci-device.h | 6 +++- 2 files changed, 81 insertions(+), 1 deletion(-) diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c index 8c6e9a1..eaf32f2 100644 --- a/hw/xen/xen-host-pci-device.c +++ b/hw/xen/xen-host-pci-device.c @@ -9,6 +9,8 @@ #include "qemu/osdep.h" #include "qapi/error.h" #include "qemu/cutils.h" +#include "hw/xen/xen-legacy-backend.h" +#include "hw/xen/xen-bus-helper.h" #include "xen-host-pci-device.h" #define XEN_HOST_PCI_MAX_EXT_CAP \ @@ -33,13 +35,73 @@ #define IORESOURCE_PREFETCH 0x1000 /* No side effects */ #define IORESOURCE_MEM_64 0x0010 +/* + * Non-passthrough (dom0) accesses are local PCI devices and use the given BDF + * Passthough (stubdom) accesses are through PV frontend PCI device. Those + * either have a BDF identical to the backend's BDF (xen-backend.passthrough=1) + * or a local virtual BDF (xen-backend.passthrough=0) + * + * We are always given the backend's BDF and need to lookup the appropriate + * local BDF for sysfs access. + */ +static void xen_host_pci_fill_local_addr(XenHostPCIDevice *d, Error **errp) +{ +unsigned int num_devs, len, i; +unsigned int domain, bus, dev, func; +char *be_path = NULL; +char path[16]; + +be_path = qemu_xen_xs_read(xenstore, 0, "device/pci/0/backend", ); +if (!be_path) { +error_setg(errp, "Failed to read device/pci/0/backend"); +goto out; +} + +if (xs_node_scanf(xenstore, 0, be_path, "num_devs", NULL, + "%d", _devs) != 1) { +error_setg(errp, "Failed to read or parse %s/num_devs", be_path); +goto out; +} + +for (i = 0; i < num_devs; i++) { +snprintf(path, sizeof(path), "dev-%d", i); +if (xs_node_scanf(xenstore, 0, be_path, path, NULL, + "%x:%x:%x.%x", , , , ) != 4) { +error_setg(errp, "Failed to read or parse %s/%s", be_path, path); +goto out; +} +if (domain != d->domain || +bus != d->bus || +dev != d->dev || +func != d->func) +continue; +snprintf(path, sizeof(path), "vdev-%d", i); +if (xs_node_scanf(xenstore, 0, be_path, path, NULL, + "%x:%x:%x.%x", , , , ) != 4) { +error_setg(errp, "Failed to read or parse %s/%s", be_path, path); +goto out; +} +d->local_domain = domain; +d->local_bus = bus; +d->local_dev = dev; +d->local_func = func; +goto out; +} +error_setg(errp, "Failed to find PCI device %x:%x:%x.%x in xenstore", + d->domain, d->bus, d->dev, d->func); + +out: +free(be_path); +} + static void xen_host_pci_sysfs_path(const XenHostPCIDevice *d, const char *name, char *buf, ssize_t size) { int rc; rc = snprintf(buf, size, "/sys/bus/pci/devices/%04x:%02x:%02x.%d/%s", - d->domain, d->bus, d->dev, d->func, name); + d->local_domain, d->local_bus, d->local_dev, d->local_func, + name); assert(rc >= 0 && rc < size); } @@ -342,6 +404,18 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, uint16_t domain, d->dev = dev; d->func = func; +if (xen_is_stubdomain) { +xen_host_pci_fill_local_addr(d, errp); +if (*errp) { +goto error; +} +} else { +d->local_domain = d->domain; +d->local_bus = d->bus; +d->local_dev = d->dev; +d->local_func = d->func; +} + xen_host_pci_config_open(d, errp); if (*errp) { goto error; diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h index 4d8d34e..270dcb2 100644 --- a/hw/xen/xen-host-pci-device.h +++ b/hw/xen/xen-host-pci-device.h @@ -23,6 +23,12 @@ typedef struct XenHostPCIDevice { uint8_t dev; uint8_t func; +/* different from the above in case of stubdomain */ +uint16_t local_domain; +uint8_t local_bus; +uint8_t local_dev; +uint8_t local_func; + uint16_t vendor_id; uint16_t device_id; uint32_t class_code; -- git-series 0.9.1
Re: [PATCH v3 1/2] IOMMU: store name for extra reserved device memory
On Wed, Mar 27, 2024 at 03:53:10AM +0100, Marek Marczykowski-Górecki wrote: > It will be useful for error reporting in a subsequent patch. > > Signed-off-by: Marek Marczykowski-Górecki > Acked-by: Jan Beulich This one is already applied, sorry for re-send. > --- > New in v2 > --- > xen/drivers/char/xhci-dbc.c | 3 ++- > xen/drivers/passthrough/iommu.c | 5 - > xen/include/xen/iommu.h | 3 ++- > 3 files changed, 8 insertions(+), 3 deletions(-) > > diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c > index 3bf389be7d0b..8e2037f1a5f7 100644 > --- a/xen/drivers/char/xhci-dbc.c > +++ b/xen/drivers/char/xhci-dbc.c > @@ -1421,7 +1421,8 @@ void __init xhci_dbc_uart_init(void) > iommu_add_extra_reserved_device_memory( > PFN_DOWN(virt_to_maddr(_dma_bufs)), > PFN_UP(sizeof(dbc_dma_bufs)), > -uart->dbc.sbdf); > +uart->dbc.sbdf, > +"XHCI console"); > serial_register_uart(SERHND_XHCI, _uart_driver, _uart); > } > } > diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c > index 996c31be1284..03587c0cd680 100644 > --- a/xen/drivers/passthrough/iommu.c > +++ b/xen/drivers/passthrough/iommu.c > @@ -682,6 +682,7 @@ struct extra_reserved_range { > unsigned long start; > unsigned long nr; > pci_sbdf_t sbdf; > +const char *name; > }; > static unsigned int __initdata nr_extra_reserved_ranges; > static struct extra_reserved_range __initdata > @@ -689,7 +690,8 @@ static struct extra_reserved_range __initdata > > int __init iommu_add_extra_reserved_device_memory(unsigned long start, >unsigned long nr, > - pci_sbdf_t sbdf) > + pci_sbdf_t sbdf, > + const char *name) > { > unsigned int idx; > > @@ -700,6 +702,7 @@ int __init > iommu_add_extra_reserved_device_memory(unsigned long start, > extra_reserved_ranges[idx].start = start; > extra_reserved_ranges[idx].nr = nr; > extra_reserved_ranges[idx].sbdf = sbdf; > +extra_reserved_ranges[idx].name = name; > > return 0; > } > diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h > index 9621459c63ee..b7829dff4588 100644 > --- a/xen/include/xen/iommu.h > +++ b/xen/include/xen/iommu.h > @@ -329,7 +329,8 @@ struct iommu_ops { > */ > extern int iommu_add_extra_reserved_device_memory(unsigned long start, >unsigned long nr, > - pci_sbdf_t sbdf); > + pci_sbdf_t sbdf, > + const char *name); > /* > * To be called by specific IOMMU driver during initialization, > * to fetch ranges registered with iommu_add_extra_reserved_device_memory(). > -- > 2.43.0 > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
[PATCH v3 1/2] IOMMU: store name for extra reserved device memory
It will be useful for error reporting in a subsequent patch. Signed-off-by: Marek Marczykowski-Górecki Acked-by: Jan Beulich --- New in v2 --- xen/drivers/char/xhci-dbc.c | 3 ++- xen/drivers/passthrough/iommu.c | 5 - xen/include/xen/iommu.h | 3 ++- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c index 3bf389be7d0b..8e2037f1a5f7 100644 --- a/xen/drivers/char/xhci-dbc.c +++ b/xen/drivers/char/xhci-dbc.c @@ -1421,7 +1421,8 @@ void __init xhci_dbc_uart_init(void) iommu_add_extra_reserved_device_memory( PFN_DOWN(virt_to_maddr(_dma_bufs)), PFN_UP(sizeof(dbc_dma_bufs)), -uart->dbc.sbdf); +uart->dbc.sbdf, +"XHCI console"); serial_register_uart(SERHND_XHCI, _uart_driver, _uart); } } diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c index 996c31be1284..03587c0cd680 100644 --- a/xen/drivers/passthrough/iommu.c +++ b/xen/drivers/passthrough/iommu.c @@ -682,6 +682,7 @@ struct extra_reserved_range { unsigned long start; unsigned long nr; pci_sbdf_t sbdf; +const char *name; }; static unsigned int __initdata nr_extra_reserved_ranges; static struct extra_reserved_range __initdata @@ -689,7 +690,8 @@ static struct extra_reserved_range __initdata int __init iommu_add_extra_reserved_device_memory(unsigned long start, unsigned long nr, - pci_sbdf_t sbdf) + pci_sbdf_t sbdf, + const char *name) { unsigned int idx; @@ -700,6 +702,7 @@ int __init iommu_add_extra_reserved_device_memory(unsigned long start, extra_reserved_ranges[idx].start = start; extra_reserved_ranges[idx].nr = nr; extra_reserved_ranges[idx].sbdf = sbdf; +extra_reserved_ranges[idx].name = name; return 0; } diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h index 9621459c63ee..b7829dff4588 100644 --- a/xen/include/xen/iommu.h +++ b/xen/include/xen/iommu.h @@ -329,7 +329,8 @@ struct iommu_ops { */ extern int iommu_add_extra_reserved_device_memory(unsigned long start, unsigned long nr, - pci_sbdf_t sbdf); + pci_sbdf_t sbdf, + const char *name); /* * To be called by specific IOMMU driver during initialization, * to fetch ranges registered with iommu_add_extra_reserved_device_memory(). -- 2.43.0
[PATCH v3 2/2] drivers/char: mark extra reserved device memory in memory map
The IOMMU driver checks if RMRR/IVMD are marked as reserved in memory map. This should be true for addresses coming from the firmware, but when extra pages used by Xen itself are included in the mapping, those are taken from usable RAM used. Mark those pages as reserved too. Not marking the pages as reserved didn't caused issues before due to another a bug in IOMMU driver code, that was fixed in 83afa3135830 ("amd-vi: fix IVMD memory type checks"). Failing to reserve memory will lead to panic in IOMMU setup code. And not including the page in IOMMU mapping will lead to broken console (due to IOMMU faults). The pages chosen by the XHCI console driver should still be usable by the CPU though, and the console code already can deal with too slow console by dropping characters (and console not printing anything is a special case of "slow"). When reserving fails print an error message showing which pages failed and who requested them. This should be enough hint to find why XHCI console doesn't work. Fixes: 3a1a7b809ffa "drivers/char: mark DMA buffers as reserved for the XHCI" Signed-off-by: Marek Marczykowski-Górecki --- Alternative error handling could be a panic, but with this version I think it can be avoided. And not panicing gives a better chance to actually see the error message (from the hopefully started dom0), especially as the affected driver is the console one. The reserve_e820_ram() is x86-specific. Is there some equivalent API for ARM, or maybe even some abstract one? That said, I have no way to test XHCI console on ARM, I don't know if such hardware even exists... Changes in v2: - move reserving to iommu_get_extra_reserved_device_memory() to cover all users of iommu_add_extra_reserved_device_memory() - change error handling to not panic, as in this code layout it can skip sending the pages to the IOMMU driver Changes in v3: - code style, typo --- xen/drivers/passthrough/iommu.c | 17 + xen/include/xen/iommu.h | 5 - 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c index 03587c0cd680..ba18136c461c 100644 --- a/xen/drivers/passthrough/iommu.c +++ b/xen/drivers/passthrough/iommu.c @@ -22,6 +22,10 @@ #include #include +#ifdef CONFIG_X86 +#include +#endif + unsigned int __read_mostly iommu_dev_iotlb_timeout = 1000; integer_param("iommu_dev_iotlb_timeout", iommu_dev_iotlb_timeout); @@ -715,6 +719,19 @@ int __init iommu_get_extra_reserved_device_memory(iommu_grdm_t *func, for ( idx = 0; idx < nr_extra_reserved_ranges; idx++ ) { +#ifdef CONFIG_X86 +paddr_t start = pfn_to_paddr(extra_reserved_ranges[idx].start); +paddr_t end = pfn_to_paddr(extra_reserved_ranges[idx].start + + extra_reserved_ranges[idx].nr); + +if ( !reserve_e820_ram(, start, end) ) +{ +printk(XENLOG_ERR "Failed to reserve [%"PRIx64"-%"PRIx64") for %s, " + "skipping IOMMU mapping for it, some functionality may be broken\n", + start, end, extra_reserved_ranges[idx].name); +continue; +} +#endif ret = func(extra_reserved_ranges[idx].start, extra_reserved_ranges[idx].nr, extra_reserved_ranges[idx].sbdf.sbdf, diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h index b7829dff4588..1f56a6cf456a 100644 --- a/xen/include/xen/iommu.h +++ b/xen/include/xen/iommu.h @@ -324,7 +324,8 @@ struct iommu_ops { }; /* - * To be called by Xen internally, to register extra RMRR/IVMD ranges. + * To be called by Xen internally, to register extra RMRR/IVMD ranges for RAM + * pages. * Needs to be called before IOMMU initialization. */ extern int iommu_add_extra_reserved_device_memory(unsigned long start, @@ -334,6 +335,8 @@ extern int iommu_add_extra_reserved_device_memory(unsigned long start, /* * To be called by specific IOMMU driver during initialization, * to fetch ranges registered with iommu_add_extra_reserved_device_memory(). + * This has a side effect of marking requested ranges as "reserved" in the + * memory map. */ extern int iommu_get_extra_reserved_device_memory(iommu_grdm_t *func, void *ctxt); -- 2.43.0
Re: [PATCH v2 1/2] IOMMU: store name for extra reserved device memory
On Mon, Mar 18, 2024 at 04:52:42PM +0100, Roger Pau Monné wrote: > On Mon, Mar 18, 2024 at 02:40:21PM +0100, Jan Beulich wrote: > > On 12.03.2024 17:25, Marek Marczykowski-Górecki wrote: > > > It will be useful for error reporting in a subsequent patch. > > > > > > Signed-off-by: Marek Marczykowski-Górecki > > > > > > > In principle > > Acked-by: Jan Beulich > > However, ... > > > > > --- a/xen/drivers/passthrough/iommu.c > > > +++ b/xen/drivers/passthrough/iommu.c > > > @@ -682,6 +682,7 @@ struct extra_reserved_range { > > > unsigned long start; > > > unsigned long nr; > > > pci_sbdf_t sbdf; > > > +const char *name; > > > }; > > > > ... to me "descr" (or the longer "description") would seem more suitable. > > Thoughts? > > I'm happy either way, but I don't find 'name' odd. "descr" sounds a bit weird (even though it's clear what it means), "name" is a full word. And also, quick grep suggest other places use "name" for similar purpose (I haven't found a single "descr" nor "description" struct field). -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v2 2/2] drivers/char: mark extra reserved device memory in memory map
On Mon, Mar 18, 2024 at 02:48:09PM +0100, Jan Beulich wrote: > On 12.03.2024 17:25, Marek Marczykowski-Górecki wrote: > > The IOMMU driver checks if RMRR/IVMD are marked as reserved in memory > > map. This should be true for addresses coming from the firmware, but > > when extra pages used by Xen itself are included in the mapping, those > > are taken from usable RAM used. Mark those pages as reserved too. > > > > Not marking the pages as reserved didn't caused issues before due to > > another a bug in IOMMU driver code, that was fixed in 83afa3135830 > > ("amd-vi: fix IVMD memory type checks"). > > > > Failing to reserve memory will lead to panic in IOMMU setup code. And > > not including the page in IOMMU mapping will lead to broken console (due > > to IOMMU faults). The pages chosen by the XHCI console driver should > > still be usable by the CPU though, and the console code already can deal > > with too slow console by dropping characters (and console not printing > > anything is a special case of "slow"). When reserving fails print an error > > message showing which pages failed and who requested them. This should > > be enough hint to find why XHCI console doesn't work. > > > > Fixes: 3a1a7b809ffa "drivers/char: mark DMA buffers as reserved for the > > XHCI" > > Signed-off-by: Marek Marczykowski-Górecki > > --- > > Alternative error handling could be a panic, but with this version I > > think it can be avoided. And not panicing gives a better chance to > > actually see the error message (from the hopefully started dom0), > > especially as the affected driver is the console one. > > > > The reserve_e820_ram() is x86-specific. Is there some equivalent API for > > ARM, or maybe even some abstract one? That said, I have no way to test > > XHCI console on ARM, I don't know if such hardware even exists... > > These are normal PCI devices, so I don't see why they shouldn't be usable > on non-x86 systems. But this is all okay as long as XHCI depends on X86 > in Kconfig. That's why I'm asking for similar API for ARM. The x86-specific part here is only about IOMMU, other parts should work just fine (but as I said, I have no way to test). Anyway, I'll leave it to whoever will need this driver on ARM (or other arch). -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v2 1/2] hw/xen: detect when running inside stubdomain
On Tue, Mar 26, 2024 at 05:06:50PM +, Anthony PERARD wrote: > On Tue, Mar 05, 2024 at 08:12:29PM +0100, Marek Marczykowski-Górecki wrote: > > diff --git a/hw/xen/xen-legacy-backend.c b/hw/xen/xen-legacy-backend.c > > index 124dd5f3d6..6bd4e6eb2f 100644 > > --- a/hw/xen/xen-legacy-backend.c > > +++ b/hw/xen/xen-legacy-backend.c > > @@ -603,6 +603,20 @@ static void xen_set_dynamic_sysbus(void) > > machine_class_allow_dynamic_sysbus_dev(mc, TYPE_XENSYSDEV); > > } > > > > +static bool xen_check_stubdomain(void) > > +{ > > +char *dm_path = g_strdup_printf("/local/domain/%d/image", xen_domid); > > +int32_t dm_domid; > > +bool is_stubdom = false; > > + > > +if (!xenstore_read_int(dm_path, "device-model-domid", _domid)) { > > +is_stubdom = dm_domid != 0; > > +} > > + > > +g_free(dm_path); > > +return is_stubdom; > > +} > > + > > void xen_be_init(void) > > { > > xenstore = qemu_xen_xs_open(); > > @@ -616,6 +630,8 @@ void xen_be_init(void) > > exit(1); > > } > > > > +xen_is_stubdomain = xen_check_stubdomain(); > > This isn't really a backend specific information, and xen_be_init() is > all about old backend implementation support. (qdisk which have been the > first to be rewritten doesn't need xen_be_init(), or shouldn't). Could > we move the initialisation elsewhere? I can try to move it, sure. > Is this relevant PV guests? If not, we could move the initialisation to > xen_hvm_init_pc(). > > Also, avoid having xen_check_stubdomain() depending on > "xen-legacy-backend", if possible. > > (In xen_hvm_init_pc(), a call to xen_register_ioreq() opens another > xenstore, as `state->xenstore`.) And xen_register_ioreq() calls xen_be_init() anyway, so it wouldn't change much in practice (at least for now)... > (There's already been effort to build QEMU without legacy backends, that > stubdom check would break in this scenario.) -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: NULL pointer dereference in xenbus_thread->...
On Sun, Oct 22, 2023 at 04:14:30PM +0200, Marek Marczykowski-Górecki wrote: > On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki wrote: > > Hi, > > > > I've noticed in Qubes's CI failure like this: > > > > [ 871.271292] BUG: kernel NULL pointer dereference, address: > > > > [ 871.275290] #PF: supervisor read access in kernel mode > > [ 871.277282] #PF: error_code(0x) - not-present page > > [ 871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0 > > [ 871.281071] Oops: [#1] PREEMPT SMP NOPTI > > [ 871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted > > 6.1.43-1.qubes.fc37.x86_64 #1 > > [ 871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014 > > [ 871.23] RIP: e030:__wake_up_common+0x4c/0x180 > > [ 871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 > > 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b > > <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40 > > [ 871.299776] RSP: e02b:c900400f7e10 EFLAGS: 00010082 > > [ 871.301656] RAX: RBX: 88810541ce98 RCX: > > > > [ 871.304255] RDX: 0001 RSI: 0003 RDI: > > 88810541ce90 > > [ 871.306714] RBP: c900400f0280 R08: ffe8 R09: > > c900400f7e68 > > [ 871.309937] R10: 7ff0 R11: 888100ad3000 R12: > > c900400f7e68 > > [ 871.312326] R13: R14: R15: > > > > [ 871.314647] FS: () GS:88813ff0() > > knlGS: > > [ 871.317677] CS: 1e030 DS: ES: CR0: 80050033 > > [ 871.319644] CR2: CR3: 0001067fe000 CR4: > > 00040660 > > [ 871.321973] Call Trace: > > [ 871.322782] > > [ 871.323494] ? show_trace_log_lvl+0x1d3/0x2ef > > [ 871.324901] ? show_trace_log_lvl+0x1d3/0x2ef > > [ 871.326310] ? show_trace_log_lvl+0x1d3/0x2ef > > [ 871.327721] ? __wake_up_common_lock+0x82/0xd0 > > [ 871.329147] ? __die_body.cold+0x8/0xd > > [ 871.330378] ? page_fault_oops+0x163/0x1a0 > > [ 871.331691] ? exc_page_fault+0x70/0x170 > > [ 871.332946] ? asm_exc_page_fault+0x22/0x30 > > [ 871.334454] ? __wake_up_common+0x4c/0x180 > > [ 871.335777] __wake_up_common_lock+0x82/0xd0 > > [ 871.337183] ? process_writes+0x240/0x240 > > [ 871.338461] process_msg+0x18e/0x2f0 > > [ 871.339627] xenbus_thread+0x165/0x1c0 > > [ 871.340830] ? cpuusage_read+0x10/0x10 > > [ 871.342032] kthread+0xe9/0x110 > > [ 871.343317] ? kthread_complete_and_exit+0x20/0x20 > > [ 871.345020] ret_from_fork+0x22/0x30 > > [ 871.346239] > > [ 871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio > > snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec > > snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr > > ppdev intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 > > parport_pc parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul > > crc32c_intel polyval_clmulni polyval_generic floppy ghash_clmulni_intel > > sha512_ssse3 serio_raw virtio_scsi virtio_console bochs xhci_pci > > xhci_pci_renesas xhci_hcd qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm > > ata_generic pata_acpi xen_privcmd xen_pciback xen_blkback xen_gntalloc > > xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput > > dm_multipath > > [ 871.368892] CR2: > > [ 871.370160] ---[ end trace ]--- > > [ 871.371719] RIP: e030:__wake_up_common+0x4c/0x180 > > [ 871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 > > 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b > > <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40 > > [ 871.379866] RSP: e02b:c900400f7e10 EFLAGS: 00010082 > > [ 871.381689] RAX: RBX: 88810541ce98 RCX: > > > > [ 871.383971] RDX: 0001 RSI: 0003 RDI: > > 88810541ce90 > > [ 871.386235] RBP: c900400f0280 R08: ffe8 R09: > > c900400f7e68 > > [ 871.388521] R10: 7ff0 R11: 888100ad3000 R12: > > c900400f7e68 > > [ 871.390789] R13: R14: R15: > > > > [ 871.393101] FS: () GS:ff
Status of S0ix with Xen
Hi, S0ix came up in a recent discussion, so let me post a small status update. We have patches that makes Qubes suspend with ~90% S0ix residency. It's significantly worse than native Linux on the same system - that gets 99+%, but still a significant progress. We do have a tricky case though, because we use PCI passthrough, with stubdomains. A summary of changes required is posted by Simon in a tracking issue[1], but let me post it here too: So currently on my test system (NUC11TNHi5) I'm being able to reach S0ix residency with only dom0 running for nearly 90 % of the suspend to idle time. (That's still worse than native Linux with > 99 %.) Used changes: - Xen - Disable HPET legacy mode after it has been userd for IOAPIC test during boot ([upstream discussion](https://lore.kernel.org/xen-devel/cb408368-077d-edb5-b4ad-f80086db4...@invisiblethingslab.com/), including minimally required diff) - Tiger Lake support in Xen's mwait-idle driver. Unlike Linux' version of that driver Xen's only supports the CPUs for which it contains a hardcoded config. For the moment I just manualy added the config Linux is reading from ACPI. Needs to be tested if this is actually required (see also [Jan's comment](https://lore.kernel.org/xen-devel/f6c27788-bdd9-e5b1-a874-7f48a27c6...@suse.com/)) - Allow reading of S0ix related MSRs. Not sure if any of them is strictly required to get into S0ix states, but at the very least you want them to be able to debug things (for example via `/sys/kernel/debug/pmc_core/...`) - Teach `cpu_idle.c` that Tiger Lake also has PC{8..10} (for proper debug output) - Disable `ondemand` cpufreq governor (both `powersave` as well as `perfromance` work) - Linux in dom0 - Add wakup support to `xen-pirq` ([patch](https://lore.kernel.org/xen-devel/20230313134102.3157-1-si...@invisiblethingslab.com/) sent upstream) - If using a SATA device you need to enable low power mode for link management (`echo min_power > /sys/class/scsi_host/host0/link_power_management_policy`, tried to get it [enabled by default](https://lore.kernel.org/linux-ide/20230213102450.1604-1-si...@invisiblethingslab.com/) but that's [not so easy](https://lore.kernel.org/linux-ide/ad02467d-d623-5933-67e0-09925c185...@leemhuis.info)) First tests with a running guest and investigating the difference to native Linux are progress (the later is probably due to some timer). Since the above update, few more things came up: 1. Thunderbolt driver in Linux does not support suspending when kernel is built with CONFIG_HOTPLUG_PCI=n (which is the case in Qubes). We have a hacky workaround at [2] (the first patch), together with more details about the issue in the patch description. 2. When toolstack issues "suspend" command (via xenstore control/shutdown), Linux uses "freeze" path, that doesn't put devices into low power state. Changing that to "suspend" path helps (the second patch in [2]), and for our case seems to have no negative consequences. But we do not use live migration nor save/restore - which is another use of this feature, and we have not tested if this still works. 3. QEMU emulates (instead of passthrough) power management related config space. This is fixed at [3] The relevant patches are all linked in pull requests connected to the issue. This is still work in progress. Some of the patches were already posted to relevant mailing lists, other still require some more work and will be posted later. I'm posting it here mostly to share that we are working on it. But also, if anybody wants to try it out and maybe help improving, the patches are out there. [1] https://github.com/QubesOS/qubes-issues/issues/6411#issuecomment-1538089344 [2] https://github.com/QubesOS/qubes-linux-kernel/pull/910/files [3] https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/pull/63/files -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
[PATCH v5 0/7] MSI-X support with qemu in stubdomain, and other related changes
This series includes changes to make MSI-X working with Linux stubdomain and especially Intel Wifi 6 AX210 card. This takes care of remaining reasons for QEMU to access /dev/mem, but also the Intel Wifi card violating spec by putting some registers on the same page as the MSI-X table. Besides the stubdomain case (of which I care more), this is also necessary for PCI-passthrough to work with lockdown enabled in dom0 (when QEMU runs in dom0). See individual patches for details. This series include also tests for MSI-X using new approach (by preventing QEMU access to /dev/mem). But for it to work, it needs QEMU change that makes use of the changes introduced here. It can be seen at https://github.com/marmarek/qemu/commits/msix Here is the pipeline that used the QEMU fork above: https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1211237368 Marek Marczykowski-Górecki (7): x86/msi: passthrough all MSI-X vector ctrl writes to device model x86/msi: Extend per-domain/device warning mechanism x86/hvm: Allow access to registers on the same page as MSI-X table automation: prevent QEMU access to /dev/mem in PCI passthrough tests automation: switch to a wifi card on ADL system [DO NOT APPLY] switch to qemu fork [DO NOT APPLY] switch to alternative artifact repo Config.mk | 4 +- automation/gitlab-ci/build.yaml | 4 +- automation/gitlab-ci/test.yaml | 4 +- automation/scripts/qubes-x86-64.sh | 9 +- automation/tests-artifacts/alpine/3.18.dockerfile | 7 +- automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 +- xen/arch/x86/hvm/vmsi.c | 224 - xen/arch/x86/include/asm/msi.h | 15 +- xen/arch/x86/msi.c | 50 ++- xen/common/kernel.c | 1 +- xen/include/public/features.h | 8 +- 11 files changed, 308 insertions(+), 20 deletions(-) base-commit: 03cf7ca23e0e876075954c558485b267b7d02406 -- git-series 0.9.1
[PATCH v5 7/7] [DO NOT APPLY] switch to alternative artifact repo
For testing, switch to my containers registry that includes containers rebuilt with changes in this series. --- automation/gitlab-ci/build.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml index 6d2cb18b8883..7dafbf144494 100644 --- a/automation/gitlab-ci/build.yaml +++ b/automation/gitlab-ci/build.yaml @@ -320,7 +320,7 @@ qemu-system-ppc64-8.1.0-ppc64-export: alpine-3.18-rootfs-export: extends: .test-jobs-artifact-common - image: registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.18 + image: registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/alpine:3.18 script: - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz artifacts: @@ -331,7 +331,7 @@ alpine-3.18-rootfs-export: kernel-6.1.19-export: extends: .test-jobs-artifact-common - image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.19 + image: registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/kernel:6.1.19 script: - mkdir binaries && cp /bzImage binaries/bzImage artifacts: -- git-series 0.9.1
[PATCH v5 3/7] x86/hvm: Allow access to registers on the same page as MSI-X table
Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers on the same page as MSI-X table. Device model (especially one in stubdomain) cannot really handle those, as direct writes to that page is refused (page is on the mmio_ro_ranges list). Instead, extend msixtbl_mmio_ops to handle such accesses too. Doing this, requires correlating read/write location with guest of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest, it requires msixtbl_entry->gtable, which is HVM-only. Similar feature for PV would need to be done separately. This will be also used to read Pending Bit Array, if it lives on the same page, making QEMU not needing /dev/mem access at all (especially helpful with lockdown enabled in dom0). If PBA lives on another page, QEMU will map it to the guest directly. If PBA lives on the same page, discard writes and log a message. Technically, writes outside of PBA could be allowed, but at this moment the precise location of PBA isn't saved, and also no known device abuses the spec in this way (at least yet). To access those registers, msixtbl_mmio_ops need the relevant page mapped. MSI handling already has infrastructure for that, using fixmap, so try to map first/last page of the MSI-X table (if necessary) and save their fixmap indexes. Note that msix_get_fixmap() does reference counting and reuses existing mapping, so just call it directly, even if the page was mapped before. Also, it uses a specific range of fixmap indexes which doesn't include 0, so use 0 as default ("not mapped") value - which simplifies code a bit. GCC 12.2.1 gets confused about 'desc' variable: arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’: arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized] 553 | if ( desc ) |^ arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here 537 | const struct msi_desc *desc; |^~~~ It's conditional initialization is actually correct (in the case where it isn't initialized, function returns early), but to avoid build failure initialize it explicitly to NULL anyway. Signed-off-by: Marek Marczykowski-Górecki --- Changes in v5: - style fixes - include GCC version in the commit message - warn only once (per domain, per device) about failed adjacent access Changes in v4: - drop same_page parameter of msixtbl_find_entry(), distinguish two cases in relevant callers - rename adj_access_table_idx to adj_access_idx - code style fixes - drop alignment check in adjacent_{read,write}() - all callers already have it earlier - delay mapping first/last MSI-X pages until preparing device for a passthrough v3: - merge handling into msixtbl_mmio_ops - extend commit message v2: - adjust commit message - pass struct domain to msixtbl_page_handler_get_hwaddr() - reduce local variables used only once - log a warning if write is forbidden if MSI-X and PBA lives on the same page - do not passthrough unaligned accesses - handle accesses both before and after MSI-X table --- xen/arch/x86/hvm/vmsi.c| 209 -- xen/arch/x86/include/asm/msi.h | 7 +- xen/arch/x86/msi.c | 41 +++- 3 files changed, 247 insertions(+), 10 deletions(-) diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c index 17983789..5ac3d9d9b4cf 100644 --- a/xen/arch/x86/hvm/vmsi.c +++ b/xen/arch/x86/hvm/vmsi.c @@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d) return d->arch.hvm.msixtbl_list.next; } +/* + * Lookup an msixtbl_entry on the same page as given addr. It's up to the + * caller to check if address is strictly part of the table - if relevant. + */ static struct msixtbl_entry *msixtbl_find_entry( struct vcpu *v, unsigned long addr) { @@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry( struct domain *d = v->domain; list_for_each_entry( entry, >arch.hvm.msixtbl_list, list ) -if ( addr >= entry->gtable && - addr < entry->gtable + entry->table_len ) +if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) && + PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) ) return entry; return NULL; @@ -213,6 +217,148 @@ static struct msi_desc *msixtbl_addr_to_desc( return NULL; } +/* + * Returns: + * - ADJACENT_DONT_HANDLE if no handling should be done + * - ADJACENT_DISCARD_WRITE if write should be discarded + * - a fixmap idx to use for handling + */ +#define ADJACENT_DONT_HANDLE UINT_MAX +#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1) +static unsigned int adjacent_handle( +const struct msixtbl_entry *entry, unsigned long addr, bool write) +{ +unsigned int adj_type; +struct arch_msix *msix; + +if ( !entry || !entry->pdev ) +return ADJACENT_DONT_HANDLE; + +
[PATCH v5 1/7] x86/msi: passthrough all MSI-X vector ctrl writes to device model
QEMU needs to know whether clearing maskbit of a vector is really clearing, or was already cleared before. Currently Xen sends only clearing that bit to the device model, but not setting it, so QEMU cannot detect it. Because of that, QEMU is working this around by checking via /dev/mem, but that isn't the proper approach. Give all necessary information to QEMU by passing all ctrl writes, including masking a vector. Advertise the new behavior via XENVER_get_features, so QEMU can know it doesn't need to access /dev/mem anymore. While this commit doesn't move the whole maskbit handling to QEMU (as discussed on xen-devel as one of the possibilities), it is a necessary first step anyway. Including telling QEMU it will get all the required information to do so. The actual implementation would need to include: - a hypercall for QEMU to control just maskbit (without (re)binding the interrupt again - a methor for QEMU to tell Xen it will actually do the work Those are not part of this series. Signed-off-by: Marek Marczykowski-Górecki --- I did not added any control to enable/disable this new behavior (as Roger have suggested for possible non-QEMU ioreqs). I don't see how the new behavior could be problematic for some existing ioreq server (they already received writes to those addresses, just not all of them), but if that's really necessary, I can probably add a command line option to restore previous behavior system-wide. Changes in v5: - announce the feature only on x86 - style fixes Changes in v4: - ignore unaligned writes with X86EMUL_OKAY - restructure the code to forward all writes in _msixtbl_write() instead of manipulating return value of msixtbl_write() - this makes WRITE_LEN4_COMPLETION special case unnecessary - advertise the changed behavior via XENVER_get_features instead of DMOP v3: - advertise changed behavior in XEN_DMOP_get_ioreq_server_info - make "flags" parameter IN/OUT - move len check back to msixtbl_write() - will be needed there anyway in a later patch v2: - passthrough quad writes to emulator too (Jan) - (ab)use len==0 for write len=4 completion (Jan), but add descriptive #define for this magic value --- xen/arch/x86/hvm/vmsi.c | 19 ++- xen/common/kernel.c | 1 + xen/include/public/features.h | 8 3 files changed, 23 insertions(+), 5 deletions(-) diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c index adbac965f9f7..17983789 100644 --- a/xen/arch/x86/hvm/vmsi.c +++ b/xen/arch/x86/hvm/vmsi.c @@ -283,8 +283,8 @@ static int msixtbl_write(struct vcpu *v, unsigned long address, unsigned long flags; struct irq_desc *desc; -if ( (len != 4 && len != 8) || (address & (len - 1)) ) -return r; +if ( !IS_ALIGNED(address, len) ) +return X86EMUL_OKAY; rcu_read_lock(_rcu_lock); @@ -345,8 +345,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long address, unlock: spin_unlock_irqrestore(>lock, flags); -if ( len == 4 ) -r = X86EMUL_OKAY; +r = X86EMUL_OKAY; out: rcu_read_unlock(_rcu_lock); @@ -357,7 +356,17 @@ static int cf_check _msixtbl_write( const struct hvm_io_handler *handler, uint64_t address, uint32_t len, uint64_t val) { -return msixtbl_write(current, address, len, val); +/* Ignore invalid length or unaligned writes. */ +if ( (len != 4 && len != 8) || !IS_ALIGNED(address, len) ) +return X86EMUL_OKAY; + +/* + * This function returns X86EMUL_UNHANDLEABLE even if write is properly + * handled, to propagate it to the device model (so it can keep its + * internal state in sync). + */ +msixtbl_write(current, address, len, val); +return X86EMUL_UNHANDLEABLE; } static bool cf_check msixtbl_range( diff --git a/xen/common/kernel.c b/xen/common/kernel.c index 08dbaa2a054c..b44b2439ca8e 100644 --- a/xen/common/kernel.c +++ b/xen/common/kernel.c @@ -637,6 +637,7 @@ long do_xen_version(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) | (1U << XENFEAT_hvm_callback_vector) | (has_pirq(d) ? (1U << XENFEAT_hvm_pirqs) : 0); +fi.submap |= (1U << XENFEAT_dm_msix_all_writes); #endif if ( !paging_mode_translate(d) || is_domain_direct_mapped(d) ) fi.submap |= (1U << XENFEAT_direct_mapped); diff --git a/xen/include/public/features.h b/xen/include/public/features.h index 4437f25d2532..880193094713 100644 --- a/xen/include/public/features.h +++ b/xen/include/public/features.h @@ -120,6 +120,14 @@ #define XENFEAT_runstate_phys_area18 #define XENFEAT_vcpu_time_phys_area 19 +/* + * If set, Xen will passthrough all MSI-X vector ctrl writes to device model, + * not only those unmasking an entry. This allows device model to properly keep + * track of the MSI-X t
[PATCH v5 5/7] automation: switch to a wifi card on ADL system
Switch to a wifi card that has registers on a MSI-X page. This tests the "x86/hvm: Allow writes to registers on the same page as MSI-X table" feature. Switch it only for HVM test, because MSI-X adjacent write is not supported on PV. This requires also including drivers and firmware in system for tests. Remove firmware unrelated to the test, to not increase initrd size too much (all firmware takes over 100MB compressed). And finally adjusts test script to handle not only eth0 as a test device, but also wlan0 and connect it to the wifi network. Signed-off-by: Marek Marczykowski-Górecki Reviewed-by: Stefano Stabellini --- This needs two new gitlab variables: WIFI_HW2_SSID and WIFI_HW2_PSK. I'll provide them in private. This change requires rebuilding test containers. This can be applied only after QEMU change is committed. Otherwise the test will fail. --- automation/gitlab-ci/test.yaml | 4 automation/scripts/qubes-x86-64.sh | 7 +++ automation/tests-artifacts/alpine/3.18.dockerfile | 7 +++ automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 ++ 4 files changed, 20 insertions(+) diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml index 8b7b2e4da92d..23a183fad967 100644 --- a/automation/gitlab-ci/test.yaml +++ b/automation/gitlab-ci/test.yaml @@ -193,6 +193,10 @@ adl-pci-pv-x86-64-gcc-debug: adl-pci-hvm-x86-64-gcc-debug: extends: .adl-x86-64 + variables: +PCIDEV: "00:14.3" +WIFI_SSID: "$WIFI_HW2_SSID" +WIFI_PSK: "$WIFI_HW2_PSK" script: - ./automation/scripts/qubes-x86-64.sh pci-hvm 2>&1 | tee ${LOGFILE} needs: diff --git a/automation/scripts/qubes-x86-64.sh b/automation/scripts/qubes-x86-64.sh index 7eabc1bd6ad4..60498ef1e89a 100755 --- a/automation/scripts/qubes-x86-64.sh +++ b/automation/scripts/qubes-x86-64.sh @@ -94,6 +94,13 @@ on_reboot = "destroy" domU_check=" set -x -e interface=eth0 +if [ -e /sys/class/net/wlan0 ]; then +interface=wlan0 +set +x +wpa_passphrase "$WIFI_SSID" "$WIFI_PSK" > /etc/wpa_supplicant.conf +set -x +wpa_supplicant -B -iwlan0 -c /etc/wpa_supplicant.conf +fi ip link set \"\$interface\" up timeout 30s udhcpc -i \"\$interface\" pingip=\$(ip -o -4 r show default|cut -f 3 -d ' ') diff --git a/automation/tests-artifacts/alpine/3.18.dockerfile b/automation/tests-artifacts/alpine/3.18.dockerfile index 9cde6c9ad4da..c323e266c7da 100644 --- a/automation/tests-artifacts/alpine/3.18.dockerfile +++ b/automation/tests-artifacts/alpine/3.18.dockerfile @@ -34,6 +34,13 @@ RUN \ apk add udev && \ apk add pciutils && \ apk add libelf && \ + apk add wpa_supplicant && \ + # Select firmware for hardware tests + apk add linux-firmware-other && \ + mkdir /lib/firmware-preserve && \ + mv /lib/firmware/iwlwifi-so-a0-gf-a0* /lib/firmware-preserve/ && \ + rm -rf /lib/firmware && \ + mv /lib/firmware-preserve /lib/firmware && \ \ # Xen cd / && \ diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile b/automation/tests-artifacts/kernel/6.1.19.dockerfile index 3a4096780d20..84ed5dff23ae 100644 --- a/automation/tests-artifacts/kernel/6.1.19.dockerfile +++ b/automation/tests-artifacts/kernel/6.1.19.dockerfile @@ -32,6 +32,8 @@ RUN curl -fsSLO https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSI make xen.config && \ scripts/config --enable BRIDGE && \ scripts/config --enable IGC && \ +scripts/config --enable IWLWIFI && \ +scripts/config --enable IWLMVM && \ cp .config .config.orig && \ cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \ make -j$(nproc) bzImage && \ -- git-series 0.9.1
[PATCH v5 4/7] automation: prevent QEMU access to /dev/mem in PCI passthrough tests
/dev/mem access doesn't work in dom0 in lockdown and in stubdomain. Simulate this environment with removing /dev/mem device node. Full test for lockdown and stubdomain will come later, when all requirements will be in place. Signed-off-by: Marek Marczykowski-Górecki Acked-by: Stefano Stabellini --- This can be applied only after QEMU change is committed. Otherwise the test will fail. --- automation/scripts/qubes-x86-64.sh | 2 ++ 1 file changed, 2 insertions(+) diff --git a/automation/scripts/qubes-x86-64.sh b/automation/scripts/qubes-x86-64.sh index d81ed7b931cf..7eabc1bd6ad4 100755 --- a/automation/scripts/qubes-x86-64.sh +++ b/automation/scripts/qubes-x86-64.sh @@ -163,6 +163,8 @@ ifconfig eth0 up ifconfig xenbr0 up ifconfig xenbr0 192.168.0.1 +# ensure QEMU wont have access /dev/mem +rm -f /dev/mem # get domU console content into test log tail -F /var/log/xen/console/guest-domU.log 2>/dev/null | sed -e \"s/^/(domU) /\" & xl create /etc/xen/domU.cfg -- git-series 0.9.1
[PATCH v5 2/7] x86/msi: Extend per-domain/device warning mechanism
The arch_msix struct had a single "warned" field with a domid for which warning was issued. Upcoming patch will need similar mechanism for few more warnings, so change it to save a bit field of issued warnings. Signed-off-by: Marek Marczykowski-Górecki --- Should I add also some helper for the boilerplate the handling requires now? I tried some macro that also sets the bit field, but I couldn't get it working. I guess I could make it working with a bitmask in a single uint8_t - would that be preferred? New in v5 --- xen/arch/x86/include/asm/msi.h | 8 +++- xen/arch/x86/msi.c | 9 +++-- 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h index 997ccb87be0c..19b1e6631fdf 100644 --- a/xen/arch/x86/include/asm/msi.h +++ b/xen/arch/x86/include/asm/msi.h @@ -217,7 +217,13 @@ struct arch_msix { int table_idx[MAX_MSIX_TABLE_PAGES]; spinlock_t table_lock; bool host_maskall, guest_maskall; -domid_t warned; +domid_t warned_domid; +union { +uint8_t all; +struct { +bool maskall : 1; +} u; +} warned_kind; }; void early_msi_init(void); diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c index e721aaf5c001..6433df30bd60 100644 --- a/xen/arch/x86/msi.c +++ b/xen/arch/x86/msi.c @@ -364,9 +364,14 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool host, bool guest) domid_t domid = pdev->domain->domain_id; maskall = true; -if ( pdev->msix->warned != domid ) +if ( pdev->msix->warned_domid != domid ) { -pdev->msix->warned = domid; +pdev->msix->warned_domid = domid; +pdev->msix->warned_kind.all = 0; +} +if ( !pdev->msix->warned_kind.u.maskall ) +{ +pdev->msix->warned_kind.u.maskall = true; printk(XENLOG_G_WARNING "cannot mask IRQ %d: masking MSI-X on Dom%d's %pp\n", desc->irq, domid, >sbdf); -- git-series 0.9.1
[PATCH v5 6/7] [DO NOT APPLY] switch to qemu fork
This makes tests to use patched QEMU, to actually test the new behavior. --- Config.mk | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Config.mk b/Config.mk index a962f095ca16..5e220a1284e4 100644 --- a/Config.mk +++ b/Config.mk @@ -220,8 +220,8 @@ endif OVMF_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/ovmf.git OVMF_UPSTREAM_REVISION ?= ba91d0292e593df8528b66f99c1b0b14fadc8e16 -QEMU_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/qemu-xen.git -QEMU_UPSTREAM_REVISION ?= master +QEMU_UPSTREAM_URL ?= https://github.com/marmarek/qemu +QEMU_UPSTREAM_REVISION ?= origin/msix MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git MINIOS_UPSTREAM_REVISION ?= b6a5b4d72b88e5c4faed01f5a44505de022860fc -- git-series 0.9.1
Re: E820 memory allocation issue on Threadripper platforms
On Tue, Mar 12, 2024 at 05:07:12PM -0400, Jason Andryuk wrote: > On 2024-03-10 10:06, Marek Marczykowski-Górecki wrote: > > On Fri, Jan 19, 2024 at 02:40:06PM +0100, Marek Marczykowski-Górecki wrote: > > > On Thu, Jan 18, 2024 at 01:23:56AM -0500, Patrick Plenefisch wrote: > > > > On Wed, Jan 17, 2024 at 3:46 AM Jan Beulich wrote: > > > > > On 17.01.2024 07:12, Patrick Plenefisch wrote: > > > > > > As someone who hasn't built a kernel in over a decade, should I > > > > > > figure > > > > > out > > > > > > how to do a kernel build with CONFIG_PHYSICAL_START=0x200 and > > > > > > report > > > > > > back? > > > > > > > > > > That was largely a suggestion to perhaps allow you to gain some > > > > > workable setup. It would be of interest to us largely for > > > > > completeness. > > > > > > > > > > > > > Typo aside, setting the boot to 2MiB works! It works better for PV > > > > > > Are there any downsides of running kernel with > > > CONFIG_PHYSICAL_START=0x20? I can confirm it fixes the issue on > > > another affected system, and if there aren't any practical downsides, > > > I'm tempted to change it the default kernel in Qubes OS. > > > > I have the answer here: CONFIG_PHYSICAL_START=0x20 breaks booting > > Xen in KVM with OVMF. There, the memory map has: > > (XEN) 00010-0007f type=7 attr=000f > > (XEN) 00080-000807fff type=10 attr=000f > > (XEN) 000808000-00080afff type=7 attr=000f > > (XEN) 00080b000-00080bfff type=10 attr=000f > > (XEN) 00080c000-00080 type=7 attr=000f > > (XEN) 00081-0008f type=10 attr=000f > > (XEN) 00090-0015f type=4 attr=000f > > > > So, starting at 0x100 worked since type=4 (boot service data) is > > available at that time already, but with 0x20 it conflicts with > > those AcpiNvs areas around 0x80. > > > > I'm cc-ing Jason since I see he claimed relevant gitlab issue. This > > conflict at least gives easy test environment with console logged to a > > file. > > Thanks. I actually hacked Xen to reserve memory ranges in the e820 to > repro. > > I claimed the *PVH* Dom0 gitlab issue. PV is outside of my scope :( That's not bad either, it means we're getting closer to usable PVH dom0 :) -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
[PATCH v2 1/2] IOMMU: store name for extra reserved device memory
It will be useful for error reporting in a subsequent patch. Signed-off-by: Marek Marczykowski-Górecki --- New in v2 --- xen/drivers/char/xhci-dbc.c | 3 ++- xen/drivers/passthrough/iommu.c | 5 - xen/include/xen/iommu.h | 3 ++- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c index 3bf389be7d0b..8e2037f1a5f7 100644 --- a/xen/drivers/char/xhci-dbc.c +++ b/xen/drivers/char/xhci-dbc.c @@ -1421,7 +1421,8 @@ void __init xhci_dbc_uart_init(void) iommu_add_extra_reserved_device_memory( PFN_DOWN(virt_to_maddr(_dma_bufs)), PFN_UP(sizeof(dbc_dma_bufs)), -uart->dbc.sbdf); +uart->dbc.sbdf, +"XHCI console"); serial_register_uart(SERHND_XHCI, _uart_driver, _uart); } } diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c index 996c31be1284..03587c0cd680 100644 --- a/xen/drivers/passthrough/iommu.c +++ b/xen/drivers/passthrough/iommu.c @@ -682,6 +682,7 @@ struct extra_reserved_range { unsigned long start; unsigned long nr; pci_sbdf_t sbdf; +const char *name; }; static unsigned int __initdata nr_extra_reserved_ranges; static struct extra_reserved_range __initdata @@ -689,7 +690,8 @@ static struct extra_reserved_range __initdata int __init iommu_add_extra_reserved_device_memory(unsigned long start, unsigned long nr, - pci_sbdf_t sbdf) + pci_sbdf_t sbdf, + const char *name) { unsigned int idx; @@ -700,6 +702,7 @@ int __init iommu_add_extra_reserved_device_memory(unsigned long start, extra_reserved_ranges[idx].start = start; extra_reserved_ranges[idx].nr = nr; extra_reserved_ranges[idx].sbdf = sbdf; +extra_reserved_ranges[idx].name = name; return 0; } diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h index 9621459c63ee..b7829dff4588 100644 --- a/xen/include/xen/iommu.h +++ b/xen/include/xen/iommu.h @@ -329,7 +329,8 @@ struct iommu_ops { */ extern int iommu_add_extra_reserved_device_memory(unsigned long start, unsigned long nr, - pci_sbdf_t sbdf); + pci_sbdf_t sbdf, + const char *name); /* * To be called by specific IOMMU driver during initialization, * to fetch ranges registered with iommu_add_extra_reserved_device_memory(). -- 2.43.0
[PATCH v2 2/2] drivers/char: mark extra reserved device memory in memory map
The IOMMU driver checks if RMRR/IVMD are marked as reserved in memory map. This should be true for addresses coming from the firmware, but when extra pages used by Xen itself are included in the mapping, those are taken from usable RAM used. Mark those pages as reserved too. Not marking the pages as reserved didn't caused issues before due to another a bug in IOMMU driver code, that was fixed in 83afa3135830 ("amd-vi: fix IVMD memory type checks"). Failing to reserve memory will lead to panic in IOMMU setup code. And not including the page in IOMMU mapping will lead to broken console (due to IOMMU faults). The pages chosen by the XHCI console driver should still be usable by the CPU though, and the console code already can deal with too slow console by dropping characters (and console not printing anything is a special case of "slow"). When reserving fails print an error message showing which pages failed and who requested them. This should be enough hint to find why XHCI console doesn't work. Fixes: 3a1a7b809ffa "drivers/char: mark DMA buffers as reserved for the XHCI" Signed-off-by: Marek Marczykowski-Górecki --- Alternative error handling could be a panic, but with this version I think it can be avoided. And not panicing gives a better chance to actually see the error message (from the hopefully started dom0), especially as the affected driver is the console one. The reserve_e820_ram() is x86-specific. Is there some equivalent API for ARM, or maybe even some abstract one? That said, I have no way to test XHCI console on ARM, I don't know if such hardware even exists... Changes in v2: - move reserving to iommu_get_extra_reserved_device_memory() to cover all users of iommu_add_extra_reserved_device_memory() - change error handling to not panic, as in this code layout it can skip sending the pages to the IOMMU driver --- xen/drivers/passthrough/iommu.c | 19 +++ xen/include/xen/iommu.h | 5 - 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c index 03587c0cd680..a311a37a2a03 100644 --- a/xen/drivers/passthrough/iommu.c +++ b/xen/drivers/passthrough/iommu.c @@ -21,6 +21,9 @@ #include #include #include +#ifdef CONFIG_X86 +#include +#endif unsigned int __read_mostly iommu_dev_iotlb_timeout = 1000; integer_param("iommu_dev_iotlb_timeout", iommu_dev_iotlb_timeout); @@ -715,6 +718,22 @@ int __init iommu_get_extra_reserved_device_memory(iommu_grdm_t *func, for ( idx = 0; idx < nr_extra_reserved_ranges; idx++ ) { +#ifdef CONFIG_X86 +if ( !reserve_e820_ram( +, +pfn_to_paddr(extra_reserved_ranges[idx].start), +pfn_to_paddr(extra_reserved_ranges[idx].start + + extra_reserved_ranges[idx].nr)) ) +{ +printk(XENLOG_ERR "Failed to reserve [%"PRIx64"-%"PRIx64") for %s, " + "skipping IOMMU mapping for it, some functionality may be broken\n", + pfn_to_paddr(extra_reserved_ranges[idx].start), + pfn_to_paddr(extra_reserved_ranges[idx].start + +extra_reserved_ranges[idx].nr), + extra_reserved_ranges[idx].name); +continue; +} +#endif ret = func(extra_reserved_ranges[idx].start, extra_reserved_ranges[idx].nr, extra_reserved_ranges[idx].sbdf.sbdf, diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h index b7829dff4588..875eaeb90167 100644 --- a/xen/include/xen/iommu.h +++ b/xen/include/xen/iommu.h @@ -324,7 +324,8 @@ struct iommu_ops { }; /* - * To be called by Xen internally, to register extra RMRR/IVMD ranges. + * To be called by Xen internally, to register extra RMRR/IVMD ranges for RAM + * pages. * Needs to be called before IOMMU initialization. */ extern int iommu_add_extra_reserved_device_memory(unsigned long start, @@ -334,6 +335,8 @@ extern int iommu_add_extra_reserved_device_memory(unsigned long start, /* * To be called by specific IOMMU driver during initialization, * to fetch ranges registered with iommu_add_extra_reserved_device_memory(). + * This has a side effect of marking requested ranges as "reserverd" in the + * memory map. */ extern int iommu_get_extra_reserved_device_memory(iommu_grdm_t *func, void *ctxt); -- 2.43.0
Re: [PATCH] drivers/char: mark XHCI DMA buffers reserved in memory map
On Tue, Mar 12, 2024 at 03:37:15PM +0100, Jan Beulich wrote: > On 12.03.2024 15:24, Marek Marczykowski-Górecki wrote: > > On Tue, Mar 12, 2024 at 11:53:46AM +0100, Jan Beulich wrote: > >> On 12.03.2024 11:24, Roger Pau Monné wrote: > >>>> --- a/xen/arch/x86/setup.c > >>>> +++ b/xen/arch/x86/setup.c > >>>> @@ -1806,6 +1806,9 @@ void asmlinkage __init noreturn > >>>> __start_xen(unsigned long mbi_p) > >>>> mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges", > >>>>RANGESETF_prettyprint_hex); > >>>> > >>>> +/* Needs to happen after E820 processing but before IOMMU init */ > >>>> +xhci_dbc_uart_reserve_ram(); > >>> > >>> Overall it might be better if some generic solution for all users of > >>> iommu_add_extra_reserved_device_memory() could be implemented, > >> > >> +1 > >> > >>> but I'm > >>> unsure whether the intention is for the interface to always be used > >>> against RAM. > >> > >> I think we can work from that assumption for now. > > > > One more question - what should be the error handling in this case? > > My first reaction here is - please first propose something that's > sensible from your perspective, and then we can go from there. That's > generally easier that discussion without seeing involved code. > However, ... > > > At > > this stage, if reserving fails, I can still skip giving this range to > > the IOMMU driver, which (most likely) will result in IOMMU faults and > > in-operational device (xhci console). Since I don't know (theoretically) > > what driver requested the range, the error message can only contain an > > address and device, so will be a bit less actionable for the user > > (although it should be quite easy to notice the BDF being the XHCI one). > > > > Panic surely is safer option, but less user friendly, especially since > > (due to the above) I cannot give explicit hint to disable XHCI console. > > ... reading this I was meaning to ... > > > And kinda independently - I'm tempted to add another field to `struct > > extra_reserved_range` (and an argument to > > `iommu_add_extra_reserved_device_memory()`) - textual description, for > > the error reporting purpose. > > ... suggest minimally this. We may want to go farther, though: The party > registering the range could also supply a callback, to be invoked in > case registration fails. That callback could then undo whatever is > necessary in order to not use the memory range in question. > > That said - isn't all of this over-engineering, as the allocated memory > range must have come from a valid RAM region? In which case a simple > BUG_ON() may be all that's needed (and will never trigger in practice, > unless we truly screwed up somewhere)? In this case (with a single use of iommu_add_extra_reserved_device_memory()), it will be valid RAM. But reserving may fail for other reasons too, for example overflow of the E820 map. Undoing things certainly is possible, but quite complicated (none of the involved serial console APIs support disabling/unregistering a console). Given the simplicity of the workaround user can do (not enabling xhci console), I don't think it's worth going this route. Anyway, I'll post v2 with some version of the above and we can continue discussion there. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH] drivers/char: mark XHCI DMA buffers reserved in memory map
On Tue, Mar 12, 2024 at 11:53:46AM +0100, Jan Beulich wrote: > On 12.03.2024 11:24, Roger Pau Monné wrote: > >> --- a/xen/arch/x86/setup.c > >> +++ b/xen/arch/x86/setup.c > >> @@ -1806,6 +1806,9 @@ void asmlinkage __init noreturn __start_xen(unsigned > >> long mbi_p) > >> mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges", > >>RANGESETF_prettyprint_hex); > >> > >> +/* Needs to happen after E820 processing but before IOMMU init */ > >> +xhci_dbc_uart_reserve_ram(); > > > > Overall it might be better if some generic solution for all users of > > iommu_add_extra_reserved_device_memory() could be implemented, > > +1 > > > but I'm > > unsure whether the intention is for the interface to always be used > > against RAM. > > I think we can work from that assumption for now. One more question - what should be the error handling in this case? At this stage, if reserving fails, I can still skip giving this range to the IOMMU driver, which (most likely) will result in IOMMU faults and in-operational device (xhci console). Since I don't know (theoretically) what driver requested the range, the error message can only contain an address and device, so will be a bit less actionable for the user (although it should be quite easy to notice the BDF being the XHCI one). Panic surely is safer option, but less user friendly, especially since (due to the above) I cannot give explicit hint to disable XHCI console. And kinda independently - I'm tempted to add another field to `struct extra_reserved_range` (and an argument to `iommu_add_extra_reserved_device_memory()`) - textual description, for the error reporting purpose. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH] drivers/char: mark XHCI DMA buffers reserved in memory map
On Tue, Mar 12, 2024 at 02:09:14PM +0100, Marek Marczykowski-Górecki wrote: > On Tue, Mar 12, 2024 at 01:38:53PM +0100, Jan Beulich wrote: > > On 12.03.2024 13:02, Marek Marczykowski-Górecki wrote: > > > BTW should e820_change_range_type() return 1 in case of mapping already > > > having the right type? Otherwise, if one wants to use > > > iommu_add_extra_reserved_device_memory() on already reserved memory, the > > > e820_change_range_type() would fail. > > > > You raised that question on Matrix yesterday, iirc, and I left it > > unanswered there because it takes archeology to find the answer (or at > > least get closer to one). And, please don't get me wrong, you could as > > well do that yourself. (My vague recollection from having dealt with > > similar code in Linux is that yes, in the example given the function > > ought to indeed have failed back then. Depending on present uses etc > > it may well be that we could reconsider, though.) > > I sure can do some archaeology there, I was just hoping any of you would > know the answer already. None of the commit messages touching that code give the answer. But looking around, most callers of reserve_e820_ram() ignore its return value. One exception is reserving memory for kexec. I guess in that case it may be intentional to fail if the area is reserved already, as it may indicate it cannot be used for kexec. Is that correct? There are also a couple of calls to e820_change_range_type() in tboot code where it changes E820_RESERVED to E820_UNUSABLE. There, I guess changing e820_change_range_type() behavior would be okay. Anyway, since we agree to focus on RAM in this API for now, it shouldn't matter anymore. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH] drivers/char: mark XHCI DMA buffers reserved in memory map
On Tue, Mar 12, 2024 at 01:38:53PM +0100, Jan Beulich wrote: > On 12.03.2024 13:02, Marek Marczykowski-Górecki wrote: > > On Tue, Mar 12, 2024 at 11:53:46AM +0100, Jan Beulich wrote: > >> On 12.03.2024 11:24, Roger Pau Monné wrote: > >>>> --- a/xen/arch/x86/setup.c > >>>> +++ b/xen/arch/x86/setup.c > >>>> @@ -1806,6 +1806,9 @@ void asmlinkage __init noreturn > >>>> __start_xen(unsigned long mbi_p) > >>>> mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges", > >>>>RANGESETF_prettyprint_hex); > >>>> > >>>> +/* Needs to happen after E820 processing but before IOMMU init */ > >>>> +xhci_dbc_uart_reserve_ram(); > >>> > >>> Overall it might be better if some generic solution for all users of > >>> iommu_add_extra_reserved_device_memory() could be implemented, > >> > >> +1 > > > > In that case, is the approach with > > iommu_get_extra_reserved_device_memory() okay (and a comment that it > > will have a side effect...) ? > > Counter question: You not having gone that route despite our talking > about it on Matrix must have a reason. When seeing this approach I > concluded something got in the way. What's the deal? I added a note about that in the patch (below commit message): > As an alternative implementation I have considered changing > iommu_get_extra_reserved_device_memory() to modify memory map. But that > may hide (or cause) some other issues when this API will gain some more > users in the future. More specifically, if some future use would be on a non-RAM area, or already reserved area. But if we can ignore those cases for now, I'm fine with that approach. > >>> but I'm > >>> unsure whether the intention is for the interface to always be used > >>> against RAM. > >> > >> I think we can work from that assumption for now. > > > > Ok, I'll add a comment about that. I guess, if needed in the future, > > iommu_add_extra_reserved_device_memory() can gain an extra parameter to > > distinguish RAM from non-RAM mappings. > > > > BTW should e820_change_range_type() return 1 in case of mapping already > > having the right type? Otherwise, if one wants to use > > iommu_add_extra_reserved_device_memory() on already reserved memory, the > > e820_change_range_type() would fail. > > You raised that question on Matrix yesterday, iirc, and I left it > unanswered there because it takes archeology to find the answer (or at > least get closer to one). And, please don't get me wrong, you could as > well do that yourself. (My vague recollection from having dealt with > similar code in Linux is that yes, in the example given the function > ought to indeed have failed back then. Depending on present uses etc > it may well be that we could reconsider, though.) I sure can do some archaeology there, I was just hoping any of you would know the answer already. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH] drivers/char: mark XHCI DMA buffers reserved in memory map
On Tue, Mar 12, 2024 at 11:53:46AM +0100, Jan Beulich wrote: > On 12.03.2024 11:24, Roger Pau Monné wrote: > >> --- a/xen/arch/x86/setup.c > >> +++ b/xen/arch/x86/setup.c > >> @@ -1806,6 +1806,9 @@ void asmlinkage __init noreturn __start_xen(unsigned > >> long mbi_p) > >> mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges", > >>RANGESETF_prettyprint_hex); > >> > >> +/* Needs to happen after E820 processing but before IOMMU init */ > >> +xhci_dbc_uart_reserve_ram(); > > > > Overall it might be better if some generic solution for all users of > > iommu_add_extra_reserved_device_memory() could be implemented, > > +1 In that case, is the approach with iommu_get_extra_reserved_device_memory() okay (and a comment that it will have a side effect...) ? > > but I'm > > unsure whether the intention is for the interface to always be used > > against RAM. > > I think we can work from that assumption for now. Ok, I'll add a comment about that. I guess, if needed in the future, iommu_add_extra_reserved_device_memory() can gain an extra parameter to distinguish RAM from non-RAM mappings. BTW should e820_change_range_type() return 1 in case of mapping already having the right type? Otherwise, if one wants to use iommu_add_extra_reserved_device_memory() on already reserved memory, the e820_change_range_type() would fail. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Release signing key still uses SHA1
Hi, The key used to sign release tarballs and git tags still uses SHA1 for its self-signature. Is updated key somewhere already? SHA1 is starting to be rejected by some tools already, for example sequoia-sq: $ sq inspect xen.pub xen.pub: OpenPGP Certificate. Fingerprint: 23E3222C145F4475FA8060A783FE14C957E82BD9 Invalid: No binding signature at time 2024-03-12T02:37:29Z Public-key algo: RSA Public-key size: 2048 bits Creation time: 2010-04-06 13:55:33 UTC UserID: Xen.org Xen tree code signing (signatures on the xen hypervisor and tools) Invalid: Policy rejected non-revocation signature (PositiveCertification) requiring second pre-image resistance because: SHA1 is not considered secure Certifications: 7, use --certifications to list -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
4.17.3 download is missing on the website
Hi, https://xenproject.org/xen-project-archives/ doesn't include 4.17.3. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
[PATCH] drivers/char: mark XHCI DMA buffers reserved in memory map
The IOMMU driver checks if RMRR/IVMD are marked as reserved in memory map. This should be true for addresses coming from the firmware, but when extra pages used by Xen itself are included in the mapping, those are taken from usable RAM used. Mark those pages as reserved too. Not marking the pages as reserved didn't caused issues before due to another a bug in IOMMU driver code, that was fixed in 83afa3135830 ("amd-vi: fix IVMD memory type checks"). Failing to reserve memory will lead to panic in IOMMU setup code. And not including the page in IOMMU mapping will lead to broken console (on due to IOMMU faults). Handling failure with panic() isn't the most user friendly thing, but at this stage the alternative would require undoing a lot of console init. Since the user can do it much easier - by simply not enabling xhci console next time, say that and panic. Fixes: 3a1a7b809ffa "drivers/char: mark DMA buffers as reserved for the XHCI" Signed-off-by: Marek Marczykowski-Górecki --- As an alternative implementation I have considered changing iommu_get_extra_reserved_device_memory() to modify memory map. But that may hide (or cause) some other issues when this API will gain some more users in the future. The reserve_e820_ram() is x86-specific. Is there some equivalent API for ARM, or maybe even some abstract one? That said, I have no way to test XHCI console on ARM, I don't know if such hardware even exists... --- xen/arch/x86/setup.c| 3 +++ xen/drivers/char/xhci-dbc.c | 22 ++ xen/include/xen/serial.h| 2 ++ 3 files changed, 27 insertions(+) diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c index a21984b1ccd8..8ab89b3710ed 100644 --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -1806,6 +1806,9 @@ void asmlinkage __init noreturn __start_xen(unsigned long mbi_p) mmio_ro_ranges = rangeset_new(NULL, "r/o mmio ranges", RANGESETF_prettyprint_hex); +/* Needs to happen after E820 processing but before IOMMU init */ +xhci_dbc_uart_reserve_ram(); + xsm_multiboot_init(module_map, mbi); /* diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c index 3bf389be7d0b..e31f3cba7838 100644 --- a/xen/drivers/char/xhci-dbc.c +++ b/xen/drivers/char/xhci-dbc.c @@ -31,6 +31,9 @@ #include #include #include +#ifdef CONFIG_X86 +#include +#endif /* uncomment to have dbc_uart_dump() debug function */ /* #define DBC_DEBUG 1 */ @@ -1426,6 +1429,25 @@ void __init xhci_dbc_uart_init(void) } } +void __init xhci_dbc_uart_reserve_ram(void) +{ +struct dbc *dbc = _uart.dbc; + +if ( !dbc->enable ) +return; + +#ifdef CONFIG_X86 +if ( !reserve_e820_ram( +, +virt_to_maddr(_dma_bufs), +virt_to_maddr(_dma_bufs) + sizeof(dbc_dma_bufs) - 1) ) +panic("Failed to reserve XHCI DMA buffers (%"PRIx64"-%"PRIx64"), " + "disable xhci console to work around\n", + virt_to_maddr(_dma_bufs), + virt_to_maddr(_dma_bufs) + sizeof(dbc_dma_bufs) - 1); +#endif +} + #ifdef DBC_DEBUG static void dbc_dump(struct dbc *dbc) { diff --git a/xen/include/xen/serial.h b/xen/include/xen/serial.h index 3d21207a3d7a..bb48eb8e8bd9 100644 --- a/xen/include/xen/serial.h +++ b/xen/include/xen/serial.h @@ -162,8 +162,10 @@ void ns16550_init(int index, struct ns16550_defaults *defaults); void ehci_dbgp_init(void); #ifdef CONFIG_XHCI void xhci_dbc_uart_init(void); +void xhci_dbc_uart_reserve_ram(void); #else static void inline xhci_dbc_uart_init(void) {} +static void inline xhci_dbc_uart_reserve_ram(void) {} #endif void arm_uart_init(void); -- 2.43.0
Regression with xhci console (was: [PATCH 1/4] amd-vi: fix IVMD memory type checks)
On Thu, Feb 01, 2024 at 06:01:56PM +0100, Roger Pau Monne wrote: > The current code that parses the IVMD blocks is relaxed with regard to the > restriction that such unity regions should always fall into memory ranges > marked as reserved in the memory map. > > However the type checks for the IVMD addresses are inverted, and as a result > IVMD ranges falling into RAM areas are accepted. Note that having such ranges > in the first place is a firmware bug, as IVMD should always fall into reserved > ranges. > > Fixes: ed6c77ebf0c1 ('AMD/IOMMU: check / convert IVMD ranges for being / to > be reserved') > Signed-off-by: Roger Pau Monné FYI Xen 4.17.3 with this patch (but not others in the series) causes panic on boot on Framework 14 AMD laptop: (XEN) [4457, 5077efff] (usable) ... (XEN) AMD-Vi: Warning: IVMD: [4f83f000,4f88) is not (entirely) in reserved memory (XEN) AMD-Vi: Error: IVMD: page at 4f83f000 can't be converted ... (XEN) Xen BUG at drivers/passthrough/amd/iommu_init.c:1386 Full boot log at https://github.com/QubesOS/qubes-issues/issues/9030 4.17.2 worked fine. I'll try the whole series (and the follow up one), but I don't expect any difference. This looks to be related to XHCI console, which does use the same API to allow device to DMA into relevant buffers even when the rest of XHCI is used by dom0 (or even other domain if 'share=yes' is used): https://gitlab.com/xen-project/xen/-/blob/staging/xen/drivers/char/xhci-dbc.c?ref_type=heads#L1421-1424 -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: E820 memory allocation issue on Threadripper platforms
On Fri, Jan 19, 2024 at 02:40:06PM +0100, Marek Marczykowski-Górecki wrote: > On Thu, Jan 18, 2024 at 01:23:56AM -0500, Patrick Plenefisch wrote: > > On Wed, Jan 17, 2024 at 3:46 AM Jan Beulich wrote: > > > On 17.01.2024 07:12, Patrick Plenefisch wrote: > > > > As someone who hasn't built a kernel in over a decade, should I figure > > > out > > > > how to do a kernel build with CONFIG_PHYSICAL_START=0x200 and report > > > > back? > > > > > > That was largely a suggestion to perhaps allow you to gain some > > > workable setup. It would be of interest to us largely for completeness. > > > > > > > Typo aside, setting the boot to 2MiB works! It works better for PV > > Are there any downsides of running kernel with > CONFIG_PHYSICAL_START=0x20? I can confirm it fixes the issue on > another affected system, and if there aren't any practical downsides, > I'm tempted to change it the default kernel in Qubes OS. I have the answer here: CONFIG_PHYSICAL_START=0x20 breaks booting Xen in KVM with OVMF. There, the memory map has: (XEN) 00010-0007f type=7 attr=000f (XEN) 00080-000807fff type=10 attr=000f (XEN) 000808000-00080afff type=7 attr=000f (XEN) 00080b000-00080bfff type=10 attr=000f (XEN) 00080c000-00080 type=7 attr=000f (XEN) 00081-0008f type=10 attr=000f (XEN) 00090-0015f type=4 attr=000f So, starting at 0x100 worked since type=4 (boot service data) is available at that time already, but with 0x20 it conflicts with those AcpiNvs areas around 0x80. I'm cc-ing Jason since I see he claimed relevant gitlab issue. This conflict at least gives easy test environment with console logged to a file. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [BUG]i2c_hid_acpi broken with 4.17.2 on Framework Laptop 13 AMD
On Wed, Mar 06, 2024 at 02:37:20PM -0600, Mario Limonciello wrote: > On 3/6/2024 14:34, Sébastien Chaumat wrote: > > > > > > Le mer. 6 mars 2024 à 19:51, Mario Limonciello > > mailto:mario.limoncie...@amd.com>> a écrit : > > > > On 3/6/2024 12:49, Sébastien Chaumat wrote: > > > > > > > > > Le mer. 6 mars 2024 à 19:08, Mario Limonciello > > > mailto:mario.limoncie...@amd.com> > > <mailto:mario.limoncie...@amd.com > > <mailto:mario.limoncie...@amd.com>>> a écrit : > > > > > > On 3/6/2024 12:05, Sébastien Chaumat wrote: > > > > > > > > > > > > Le mer. 6 mars 2024 à 18:33, Mario Limonciello > > > > > > > Also; I'd be really interested to hear what happens with > > > s2idle with > > > > Xen > > > > now (if it works). > > > > > > > > > > > > suspend to RAM now works :) > > > > > > Do you see /sys/power/suspend_stats/last_hw_sleep increasing > > with your > > > suspend cycle? > > > > > > > > > No, it does not increase (0). > > > > > > > OK, then in that case I suggest you run > > > > https://gitlab.freedesktop.org/drm/amd/-/blob/master/scripts/amd_s2idle.py > > <https://gitlab.freedesktop.org/drm/amd/-/blob/master/scripts/amd_s2idle.py> > > in the hypervisor to find out what's wrong. > > > > > > It fails with xen (not bare metal) during the prerequisite tests : > > > > The script might need some modifications for tests that don't make sense in > the context of Xen. > > > ✅ GPIO driver `pinctrl_amd` available > > Traceback (most recent call last): > > File "/home/sch/Downloads/amd_s2idle.py", line 2447, in > > test = app.prerequisites() > > ^^^ > > File "/home/sch/Downloads/amd_s2idle.py", line 1938, in prerequisites > > if not check(): > > ^^^ > > File "/home/sch/Downloads/amd_s2idle.py", line 826, in check_msr > > val = read_msr(reg, 0) > > > > File "/home/sch/Downloads/amd_s2idle.py", line 813, in read_msr > > val = struct.unpack("Q", os.read(f, 8))[0] > > ^ > > OSError: [Errno 5] Input/output error > > > > Last line in the log is : > > 2024-03-06 21:29:33,146 DEBUG: Lockdown: [none] integrity confidentiality > > > > Do you have userspace MSR support compiled in? If not; that could explain > that problem. But it's very unlikely you changed the MSRs. > For the purpose of finding where the suspend problem is, I'd comment out > that specific check for now. The s2idle isn't supported in Xen yet. Only S3 is. In fact, I have some work-in-progress patches for that, I may try testing them on this system. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
[PATCH v2 2/2] xen: fix stubdom PCI addr
From: Frédéric Pierret (fepitre) When running in a stubdomain, the config space access via sysfs needs to use BDF as seen inside stubdomain (connected via xen-pcifront), which is different from the real BDF. For other purposes (hypercall parameters etc), the real BDF needs to be used. Get the in-stubdomain BDF by looking up relevant PV PCI xenstore entries. Signed-off-by: Marek Marczykowski-Górecki --- Changes in v2: - use xs_node_scanf - use %d instead of %u to read values written as %d - add a comment from another iteration of this patch by Jason Andryuk --- hw/xen/xen-host-pci-device.c | 69 +++- hw/xen/xen-host-pci-device.h | 6 2 files changed, 74 insertions(+), 1 deletion(-) diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c index 8c6e9a1716..8ea2a5a4af 100644 --- a/hw/xen/xen-host-pci-device.c +++ b/hw/xen/xen-host-pci-device.c @@ -9,6 +9,8 @@ #include "qemu/osdep.h" #include "qapi/error.h" #include "qemu/cutils.h" +#include "hw/xen/xen-legacy-backend.h" +#include "hw/xen/xen-bus-helper.h" #include "xen-host-pci-device.h" #define XEN_HOST_PCI_MAX_EXT_CAP \ @@ -33,13 +35,67 @@ #define IORESOURCE_PREFETCH 0x1000 /* No side effects */ #define IORESOURCE_MEM_64 0x0010 +/* + * Non-passthrough (dom0) accesses are local PCI devices and use the given BDF + * Passthough (stubdom) accesses are through PV frontend PCI device. Those + * either have a BDF identical to the backend's BDF (xen-backend.passthrough=1) + * or a local virtual BDF (xen-backend.passthrough=0) + * + * We are always given the backend's BDF and need to lookup the appropriate + * local BDF for sysfs access. + */ +static void xen_host_pci_fill_local_addr(XenHostPCIDevice *d, Error **errp) +{ +unsigned int num_devs, len, i; +unsigned int domain, bus, dev, func; +char *be_path = NULL; +char path[80]; + +be_path = qemu_xen_xs_read(xenstore, 0, "device/pci/0/backend", ); +if (!be_path) +goto out; + +if (xs_node_scanf(xenstore, 0, be_path, "num_devs", NULL, "%d", _devs) != 1) { +error_setg(errp, "Failed to read or parse %s/num_devs\n", be_path); +goto out; +} + +for (i = 0; i < num_devs; i++) { +snprintf(path, sizeof(path), "dev-%d", i); +if (xs_node_scanf(xenstore, 0, be_path, path, NULL, + "%x:%x:%x.%x", , , , ) != 4) { +error_setg(errp, "Failed to read or parse %s/%s\n", be_path, path); +goto out; +} +if (domain != d->domain || +bus != d->bus || +dev != d->dev || +func!= d->func) +continue; +snprintf(path, sizeof(path), "vdev-%d", i); +if (xs_node_scanf(xenstore, 0, be_path, path, NULL, + "%x:%x:%x.%x", , , , ) != 4) { +error_setg(errp, "Failed to read or parse %s/%s\n", be_path, path); +goto out; +} +d->local_domain = domain; +d->local_bus = bus; +d->local_dev = dev; +d->local_func = func; +goto out; +} + +out: +free(be_path); +} + static void xen_host_pci_sysfs_path(const XenHostPCIDevice *d, const char *name, char *buf, ssize_t size) { int rc; rc = snprintf(buf, size, "/sys/bus/pci/devices/%04x:%02x:%02x.%d/%s", - d->domain, d->bus, d->dev, d->func, name); + d->local_domain, d->local_bus, d->local_dev, d->local_func, name); assert(rc >= 0 && rc < size); } @@ -342,6 +398,17 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, uint16_t domain, d->dev = dev; d->func = func; +if (xen_is_stubdomain) { +xen_host_pci_fill_local_addr(d, errp); +if (*errp) +goto error; +} else { +d->local_domain = d->domain; +d->local_bus = d->bus; +d->local_dev = d->dev; +d->local_func = d->func; +} + xen_host_pci_config_open(d, errp); if (*errp) { goto error; diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h index 4d8d34ecb0..270dcb27f7 100644 --- a/hw/xen/xen-host-pci-device.h +++ b/hw/xen/xen-host-pci-device.h @@ -23,6 +23,12 @@ typedef struct XenHostPCIDevice { uint8_t dev; uint8_t func; +/* different from the above in case of stubdomain */ +uint16_t local_domain; +uint8_t local_bus; +uint8_t local_dev; +uint8_t local_func; + uint16_t vendor_id; uint16_t device_id; uint32_t class_code; -- 2.43.0
[PATCH v2 1/2] hw/xen: detect when running inside stubdomain
Introduce global xen_is_stubdomain variable when qemu is running inside a stubdomain instead of dom0. This will be relevant for subsequent patches, as few things like accessing PCI config space need to be done differently. Signed-off-by: Marek Marczykowski-Górecki --- Changes in v2: - use sigend int for domid to match xenstore_read_int() types, domid is in a signed range anyway - fix code style --- hw/xen/xen-legacy-backend.c | 16 include/hw/xen/xen.h| 1 + system/globals.c| 1 + 3 files changed, 18 insertions(+) diff --git a/hw/xen/xen-legacy-backend.c b/hw/xen/xen-legacy-backend.c index 124dd5f3d6..6bd4e6eb2f 100644 --- a/hw/xen/xen-legacy-backend.c +++ b/hw/xen/xen-legacy-backend.c @@ -603,6 +603,20 @@ static void xen_set_dynamic_sysbus(void) machine_class_allow_dynamic_sysbus_dev(mc, TYPE_XENSYSDEV); } +static bool xen_check_stubdomain(void) +{ +char *dm_path = g_strdup_printf("/local/domain/%d/image", xen_domid); +int32_t dm_domid; +bool is_stubdom = false; + +if (!xenstore_read_int(dm_path, "device-model-domid", _domid)) { +is_stubdom = dm_domid != 0; +} + +g_free(dm_path); +return is_stubdom; +} + void xen_be_init(void) { xenstore = qemu_xen_xs_open(); @@ -616,6 +630,8 @@ void xen_be_init(void) exit(1); } +xen_is_stubdomain = xen_check_stubdomain(); + xen_sysdev = qdev_new(TYPE_XENSYSDEV); sysbus_realize_and_unref(SYS_BUS_DEVICE(xen_sysdev), _fatal); xen_sysbus = qbus_new(TYPE_XENSYSBUS, xen_sysdev, "xen-sysbus"); diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h index 37ecc91fc3..ecb89ecfc1 100644 --- a/include/hw/xen/xen.h +++ b/include/hw/xen/xen.h @@ -36,6 +36,7 @@ enum xen_mode { extern uint32_t xen_domid; extern enum xen_mode xen_mode; extern bool xen_domid_restrict; +extern bool xen_is_stubdomain; int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num); int xen_set_pci_link_route(uint8_t link, uint8_t irq); diff --git a/system/globals.c b/system/globals.c index b6d4e72530..ac27d88bd4 100644 --- a/system/globals.c +++ b/system/globals.c @@ -62,6 +62,7 @@ bool qemu_uuid_set; uint32_t xen_domid; enum xen_mode xen_mode = XEN_DISABLED; bool xen_domid_restrict; +bool xen_is_stubdomain; struct evtchn_backend_ops *xen_evtchn_ops; struct gnttab_backend_ops *xen_gnttab_ops; struct foreignmem_backend_ops *xen_foreignmem_ops; -- 2.43.0
Link to the 2023 xen summit schedule is broken
Hi, The link to the last year schedule at https://events.linuxfoundation.org/archive/2023/xen-project-summit/ is broken, it opens a page for upcoming "ONE summit" event that doesn't look to be related to Xen Summit. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH] ns16550: add Asix AX99100 serial card
On Mon, Feb 19, 2024 at 12:13:18PM +0100, Marek Marczykowski-Górecki wrote: > On Mon, Feb 19, 2024 at 09:57:49AM +0100, Jan Beulich wrote: > > On 18.02.2024 02:34, Marek Marczykowski-Górecki wrote: > > > @@ -1170,6 +1177,11 @@ static const struct ns16550_config __initconst > > > uart_config[] = > > > .dev_id = 0x7adc, > > > .param = param_intel_lpss > > > }, > > > +{ > > > +.vendor_id = PCI_VENDOR_ID_ASIX, > > > +.dev_id = 9100, > > > > As per Linux this is 0x9100. > > Right... but then, maybe the patch isn't needed at all, as it does work > for me. Maybe what's needed instead is some other patch already in > staging. Initial attempt that did not work was with 4.17.something. > I guess setting the fifo size isn't that important. Indeed, the patch is not needed after all, plain "master" from today works. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH 2/2] xen: fix stubdom PCI addr
On Mon, Feb 19, 2024 at 07:16:06PM +0100, Marek Marczykowski-Górecki wrote: > From: Frédéric Pierret (fepitre) This shouldn't be here, it's my patch. > When running in a stubdomain, the config space access via sysfs needs to > use BDF as seen inside stubdomain (connected via xen-pcifront), which is > different from the real BDF. For other purposes (hypercall parameters > etc), the real BDF needs to be used. > Get the in-stubdomain BDF by looking up relevant PV PCI xenstore > entries. > > Signed-off-by: Marek Marczykowski-Górecki > --- > hw/xen/xen-host-pci-device.c | 77 +++- > hw/xen/xen-host-pci-device.h | 6 +++ > 2 files changed, 82 insertions(+), 1 deletion(-) > > diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c > index 8c6e9a1716..3f8a6f84a8 100644 > --- a/hw/xen/xen-host-pci-device.c > +++ b/hw/xen/xen-host-pci-device.c > @@ -9,6 +9,7 @@ > #include "qemu/osdep.h" > #include "qapi/error.h" > #include "qemu/cutils.h" > +#include "hw/xen/xen-legacy-backend.h" > #include "xen-host-pci-device.h" > > #define XEN_HOST_PCI_MAX_EXT_CAP \ > @@ -33,13 +34,76 @@ > #define IORESOURCE_PREFETCH 0x1000 /* No side effects */ > #define IORESOURCE_MEM_64 0x0010 > > +static void xen_host_pci_fill_local_addr(XenHostPCIDevice *d, Error **errp) > +{ > +unsigned int num_devs, len, i; > +unsigned int domain, bus, dev, func; > +char *be_path = NULL; > +char path[80]; > +char *msg = NULL; > + > +be_path = qemu_xen_xs_read(xenstore, 0, "device/pci/0/backend", ); > +if (!be_path) > +goto err_out; > +snprintf(path, sizeof(path), "%s/num_devs", be_path); > +msg = qemu_xen_xs_read(xenstore, 0, path, ); > +if (!msg) > +goto err_out; > + > +if (sscanf(msg, "%u", _devs) != 1) { > +error_setg(errp, "Failed to parse %s (%s)\n", msg, path); > +goto err_out; > +} > +free(msg); > + > +for (i = 0; i < num_devs; i++) { > +snprintf(path, sizeof(path), "%s/dev-%u", be_path, i); > +msg = qemu_xen_xs_read(xenstore, 0, path, ); > +if (!msg) { > +error_setg(errp, "Failed to read %s\n", path); > +goto err_out; > +} > +if (sscanf(msg, "%x:%x:%x.%x", , , , ) != 4) { > +error_setg(errp, "Failed to parse %s (%s)\n", msg, path); > +goto err_out; > +} > +free(msg); > +if (domain != d->domain || > +bus != d->bus || > +dev != d->dev || > +func!= d->func) > +continue; > +snprintf(path, sizeof(path), "%s/vdev-%u", be_path, i); > +msg = qemu_xen_xs_read(xenstore, 0, path, ); > +if (!msg) { > +error_setg(errp, "Failed to read %s\n", path); > +goto out; > +} > +if (sscanf(msg, "%x:%x:%x.%x", , , , ) != 4) { > +error_setg(errp, "Failed to parse %s (%s)\n", msg, path); > +goto err_out; > +} > +free(msg); > +d->local_domain = domain; > +d->local_bus = bus; > +d->local_dev = dev; > +d->local_func = func; > +goto out; > +} > + > +err_out: > +free(msg); > +out: > +free(be_path); > +} > + > static void xen_host_pci_sysfs_path(const XenHostPCIDevice *d, > const char *name, char *buf, ssize_t > size) > { > int rc; > > rc = snprintf(buf, size, "/sys/bus/pci/devices/%04x:%02x:%02x.%d/%s", > - d->domain, d->bus, d->dev, d->func, name); > + d->local_domain, d->local_bus, d->local_dev, > d->local_func, name); > assert(rc >= 0 && rc < size); > } > > @@ -342,6 +406,17 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, > uint16_t domain, > d->dev = dev; > d->func = func; > > +if (xen_is_stubdomain) { > +xen_host_pci_fill_local_addr(d, errp); > +if (*errp) > +goto error; > +} else { > +d->local_domain = d->domain; > +d->local_bus = d->bus; > +d->local_dev = d->dev; > +d->local_func = d->func; > +} > + > xen_host_pci_config_open(d, errp); > if (*errp) { > goto error; > diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h > index 4d8d34ecb0..270dcb27f7 100644 > --- a/hw/xen/xen-host-pci-device.h > +++ b/hw/xen/xen-host-pci-device.h > @@ -23,6 +23,12 @@ typedef struct XenHostPCIDevice { > uint8_t dev; > uint8_t func; > > +/* different from the above in case of stubdomain */ > +uint16_t local_domain; > +uint8_t local_bus; > +uint8_t local_dev; > +uint8_t local_func; > + > uint16_t vendor_id; > uint16_t device_id; > uint32_t class_code; > -- > 2.43.0 > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
[PATCH 2/2] xen: fix stubdom PCI addr
From: Frédéric Pierret (fepitre) When running in a stubdomain, the config space access via sysfs needs to use BDF as seen inside stubdomain (connected via xen-pcifront), which is different from the real BDF. For other purposes (hypercall parameters etc), the real BDF needs to be used. Get the in-stubdomain BDF by looking up relevant PV PCI xenstore entries. Signed-off-by: Marek Marczykowski-Górecki --- hw/xen/xen-host-pci-device.c | 77 +++- hw/xen/xen-host-pci-device.h | 6 +++ 2 files changed, 82 insertions(+), 1 deletion(-) diff --git a/hw/xen/xen-host-pci-device.c b/hw/xen/xen-host-pci-device.c index 8c6e9a1716..3f8a6f84a8 100644 --- a/hw/xen/xen-host-pci-device.c +++ b/hw/xen/xen-host-pci-device.c @@ -9,6 +9,7 @@ #include "qemu/osdep.h" #include "qapi/error.h" #include "qemu/cutils.h" +#include "hw/xen/xen-legacy-backend.h" #include "xen-host-pci-device.h" #define XEN_HOST_PCI_MAX_EXT_CAP \ @@ -33,13 +34,76 @@ #define IORESOURCE_PREFETCH 0x1000 /* No side effects */ #define IORESOURCE_MEM_64 0x0010 +static void xen_host_pci_fill_local_addr(XenHostPCIDevice *d, Error **errp) +{ +unsigned int num_devs, len, i; +unsigned int domain, bus, dev, func; +char *be_path = NULL; +char path[80]; +char *msg = NULL; + +be_path = qemu_xen_xs_read(xenstore, 0, "device/pci/0/backend", ); +if (!be_path) +goto err_out; +snprintf(path, sizeof(path), "%s/num_devs", be_path); +msg = qemu_xen_xs_read(xenstore, 0, path, ); +if (!msg) +goto err_out; + +if (sscanf(msg, "%u", _devs) != 1) { +error_setg(errp, "Failed to parse %s (%s)\n", msg, path); +goto err_out; +} +free(msg); + +for (i = 0; i < num_devs; i++) { +snprintf(path, sizeof(path), "%s/dev-%u", be_path, i); +msg = qemu_xen_xs_read(xenstore, 0, path, ); +if (!msg) { +error_setg(errp, "Failed to read %s\n", path); +goto err_out; +} +if (sscanf(msg, "%x:%x:%x.%x", , , , ) != 4) { +error_setg(errp, "Failed to parse %s (%s)\n", msg, path); +goto err_out; +} +free(msg); +if (domain != d->domain || +bus != d->bus || +dev != d->dev || +func!= d->func) +continue; +snprintf(path, sizeof(path), "%s/vdev-%u", be_path, i); +msg = qemu_xen_xs_read(xenstore, 0, path, ); +if (!msg) { +error_setg(errp, "Failed to read %s\n", path); +goto out; +} +if (sscanf(msg, "%x:%x:%x.%x", , , , ) != 4) { +error_setg(errp, "Failed to parse %s (%s)\n", msg, path); +goto err_out; +} +free(msg); +d->local_domain = domain; +d->local_bus = bus; +d->local_dev = dev; +d->local_func = func; +goto out; +} + +err_out: +free(msg); +out: +free(be_path); +} + static void xen_host_pci_sysfs_path(const XenHostPCIDevice *d, const char *name, char *buf, ssize_t size) { int rc; rc = snprintf(buf, size, "/sys/bus/pci/devices/%04x:%02x:%02x.%d/%s", - d->domain, d->bus, d->dev, d->func, name); + d->local_domain, d->local_bus, d->local_dev, d->local_func, name); assert(rc >= 0 && rc < size); } @@ -342,6 +406,17 @@ void xen_host_pci_device_get(XenHostPCIDevice *d, uint16_t domain, d->dev = dev; d->func = func; +if (xen_is_stubdomain) { +xen_host_pci_fill_local_addr(d, errp); +if (*errp) +goto error; +} else { +d->local_domain = d->domain; +d->local_bus = d->bus; +d->local_dev = d->dev; +d->local_func = d->func; +} + xen_host_pci_config_open(d, errp); if (*errp) { goto error; diff --git a/hw/xen/xen-host-pci-device.h b/hw/xen/xen-host-pci-device.h index 4d8d34ecb0..270dcb27f7 100644 --- a/hw/xen/xen-host-pci-device.h +++ b/hw/xen/xen-host-pci-device.h @@ -23,6 +23,12 @@ typedef struct XenHostPCIDevice { uint8_t dev; uint8_t func; +/* different from the above in case of stubdomain */ +uint16_t local_domain; +uint8_t local_bus; +uint8_t local_dev; +uint8_t local_func; + uint16_t vendor_id; uint16_t device_id; uint32_t class_code; -- 2.43.0
[PATCH 1/2] hw/xen: detect when running inside stubdomain
Introduce global xen_is_stubdomain variable when qemu is running inside a stubdomain instead of dom0. This will be relevant for subsequent patches, as few things like accessing PCI config space need to be done differently. Signed-off-by: Marek Marczykowski-Górecki --- hw/xen/xen-legacy-backend.c | 15 +++ include/hw/xen/xen.h| 1 + system/globals.c| 1 + 3 files changed, 17 insertions(+) diff --git a/hw/xen/xen-legacy-backend.c b/hw/xen/xen-legacy-backend.c index 124dd5f3d6..4a09ea2561 100644 --- a/hw/xen/xen-legacy-backend.c +++ b/hw/xen/xen-legacy-backend.c @@ -603,6 +603,19 @@ static void xen_set_dynamic_sysbus(void) machine_class_allow_dynamic_sysbus_dev(mc, TYPE_XENSYSDEV); } +static bool xen_check_stubdomain(void) +{ +char *dm_path = g_strdup_printf("/local/domain/%d/image", xen_domid); +uint32_t dm_domid; +bool is_stubdom = false; + +if (!xenstore_read_int(dm_path, "device-model-domid", _domid)) +is_stubdom = dm_domid != 0; + +g_free(dm_path); +return is_stubdom; +} + void xen_be_init(void) { xenstore = qemu_xen_xs_open(); @@ -616,6 +629,8 @@ void xen_be_init(void) exit(1); } +xen_is_stubdomain = xen_check_stubdomain(); + xen_sysdev = qdev_new(TYPE_XENSYSDEV); sysbus_realize_and_unref(SYS_BUS_DEVICE(xen_sysdev), _fatal); xen_sysbus = qbus_new(TYPE_XENSYSBUS, xen_sysdev, "xen-sysbus"); diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h index 37ecc91fc3..ecb89ecfc1 100644 --- a/include/hw/xen/xen.h +++ b/include/hw/xen/xen.h @@ -36,6 +36,7 @@ enum xen_mode { extern uint32_t xen_domid; extern enum xen_mode xen_mode; extern bool xen_domid_restrict; +extern bool xen_is_stubdomain; int xen_pci_slot_get_pirq(PCIDevice *pci_dev, int irq_num); int xen_set_pci_link_route(uint8_t link, uint8_t irq); diff --git a/system/globals.c b/system/globals.c index b6d4e72530..ac27d88bd4 100644 --- a/system/globals.c +++ b/system/globals.c @@ -62,6 +62,7 @@ bool qemu_uuid_set; uint32_t xen_domid; enum xen_mode xen_mode = XEN_DISABLED; bool xen_domid_restrict; +bool xen_is_stubdomain; struct evtchn_backend_ops *xen_evtchn_ops; struct gnttab_backend_ops *xen_gnttab_ops; struct foreignmem_backend_ops *xen_foreignmem_ops; -- 2.43.0
Re: [PATCH] ns16550: add Asix AX99100 serial card
On Mon, Feb 19, 2024 at 09:57:49AM +0100, Jan Beulich wrote: > On 18.02.2024 02:34, Marek Marczykowski-Górecki wrote: > > @@ -1170,6 +1177,11 @@ static const struct ns16550_config __initconst > > uart_config[] = > > .dev_id = 0x7adc, > > .param = param_intel_lpss > > }, > > +{ > > +.vendor_id = PCI_VENDOR_ID_ASIX, > > +.dev_id = 9100, > > As per Linux this is 0x9100. Right... but then, maybe the patch isn't needed at all, as it does work for me. Maybe what's needed instead is some other patch already in staging. Initial attempt that did not work was with 4.17.something. I guess setting the fifo size isn't that important. > > +.param = param_asix_ax99100 > > +}, > > }; > > > > static int __init > > diff --git a/xen/include/xen/pci_ids.h b/xen/include/xen/pci_ids.h > > index e798477a7e23..2a19f4ab7872 100644 > > --- a/xen/include/xen/pci_ids.h > > +++ b/xen/include/xen/pci_ids.h > > @@ -11,3 +11,5 @@ > > #define PCI_VENDOR_ID_BROADCOM 0x14e4 > > > > #define PCI_VENDOR_ID_INTEL 0x8086 > > + > > +#define PCI_VENDOR_ID_ASIX 0x125b > > Please insert such that numeric sorting is retained. > > Jan -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
[PATCH] ns16550: add Asix AX99100 serial card
It's 4-port serial card, each port is presented as a separate function. According to the specification, it features 256-byte TX FIFO buffer. Signed-off-by: Marek Marczykowski-Górecki --- It's a card plugged into a box that can function as yet another gitlab runner. --- xen/drivers/char/ns16550.c | 12 xen/include/xen/pci_ids.h | 2 ++ 2 files changed, 14 insertions(+) diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c index 97bf0985344a..00c0da3f373c 100644 --- a/xen/drivers/char/ns16550.c +++ b/xen/drivers/char/ns16550.c @@ -92,6 +92,7 @@ struct ns16550_config { param_exar_xr17v354, param_exar_xr17v358, param_intel_lpss, +param_asix_ax99100, } param; }; @@ -842,6 +843,12 @@ static const struct ns16550_config_param __initconst uart_param[] = { .mmio = 1, .max_ports = 1, }, +[param_asix_ax99100] = { +.reg_width = 1, +.lsr_mask = UART_LSR_THRE, +.max_ports = 1, +.fifo_size = 256, +}, }; static const struct ns16550_config __initconst uart_config[] = @@ -1170,6 +1177,11 @@ static const struct ns16550_config __initconst uart_config[] = .dev_id = 0x7adc, .param = param_intel_lpss }, +{ +.vendor_id = PCI_VENDOR_ID_ASIX, +.dev_id = 9100, +.param = param_asix_ax99100 +}, }; static int __init diff --git a/xen/include/xen/pci_ids.h b/xen/include/xen/pci_ids.h index e798477a7e23..2a19f4ab7872 100644 --- a/xen/include/xen/pci_ids.h +++ b/xen/include/xen/pci_ids.h @@ -11,3 +11,5 @@ #define PCI_VENDOR_ID_BROADCOM 0x14e4 #define PCI_VENDOR_ID_INTEL 0x8086 + +#define PCI_VENDOR_ID_ASIX 0x125b -- 2.43.0
Re: [PATCH] i386: load kernel on xen using DMA
On Fri, Jun 18, 2021 at 09:54:14AM +0100, Alex Bennée wrote: > > Marek Marczykowski-Górecki writes: > > > Kernel on Xen is loaded via fw_cfg. Previously it used non-DMA version, > > which loaded the kernel (and initramfs) byte by byte. Change this > > to DMA, to load in bigger chunks. > > This change alone reduces load time of a (big) kernel+initramfs from > > ~10s down to below 1s. > > > > This change was suggested initially here: > > https://lore.kernel.org/xen-devel/20180216204031.5...@gmail.com/ > > Apparently this alone is already enough to get massive speedup. > > > > Signed-off-by: Marek Marczykowski-Górecki > > --- > > hw/i386/pc.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/hw/i386/pc.c b/hw/i386/pc.c > > index 8a84b25a03..14e43d4da4 100644 > > --- a/hw/i386/pc.c > > +++ b/hw/i386/pc.c > > @@ -839,7 +839,8 @@ void xen_load_linux(PCMachineState *pcms) > > > > assert(MACHINE(pcms)->kernel_filename != NULL); > > > > -fw_cfg = fw_cfg_init_io(FW_CFG_IO_BASE); > > +fw_cfg = fw_cfg_init_io_dma(FW_CFG_IO_BASE, FW_CFG_IO_BASE + 4, > > +_space_memory); > > fw_cfg_add_i16(fw_cfg, FW_CFG_NB_CPUS, x86ms->boot_cpus); > > rom_set_fw(fw_cfg); > > Gentle ping. The fix looks perfectly sane to me but I don't have any x86 > Xen HW to test this one. Are the x86 maintainers happy to take this on? Ping... > > FWIW: > > Reviewed-by: Alex Bennée > > -- > Alex Bennée > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v3 1/8] serial: fake IRQ-regs context in poll handlers
On Thu, Feb 15, 2024 at 09:39:41AM +0100, Jan Beulich wrote: > On 15.02.2024 03:19, Marek Marczykowski wrote: > > On Tue, Feb 13, 2024 at 04:44:04PM +0100, Jan Beulich wrote: > >> On 13.02.2024 16:11, Marek Marczykowski wrote: > >>> On Tue, Feb 13, 2024 at 04:00:32PM +0100, Jan Beulich wrote: > >>>> On 13.02.2024 15:53, Marek Marczykowski wrote: > >>>>> Generally provides sensible output, for example during boot (it is using > >>>>> idle vCPU then, right?). > >>>> > >>>> Before Dom0 is started: Yes. With the exception of the phase where PV > >>>> Dom0's page tables are constructed, albeit in that time window > >>>> guest_cpu_user_regs() shouldn't yield sensible data either. I can only > >>>> say I'm surprised; since I have no way to properly test with an XHCI > >>>> debug port, I'd have to see about faking something to convince myself > >>>> (unless you were to supply example output). > >>> > >>> Would you like me to test this series with xhci console? > >> > >> The behavior shouldn't really be connected to this series. But yes, 'd' > >> debug key output (just the part for the CPU the key handling was > >> actually invoked from) with the xhci debug console would be of > >> interest, for the case where that CPU at that time runs an idle vCPU. > > > > I managed to press 'd' before dom0 started. Full output at > > https://gist.github.com/marmarek/a495cd666f4aafed3a5cfcb8393f515b (it's > > Alder Lake, and smt=off, so CPU numbering is weird). > > Interestingly, I do _not_ see output for CPU0, where I'd expect the > > key handler to run... I see all the idle ones, plus one doing memory > > scrubbing. > > Which is precisely the problem, just in not exactly the manifestation > I expected. In dump_execstate() we dump host state only if the > incoming regs don't indicate guest state. Yet for the idle vCPU they > (wrongly) do here - see how guest_mode() calculates the delta to what > guest_cpu_user_regs() returns, i.e. 0 when what guest_cpu_user_regs() > returned is passed in. > > Guest state dumping is suppressed for idle vCPU-s. Hence no output > at all for the CPU where the key processing was actually invoked > from. > > > But also, I don't see info about the handling CPU when doing `xl > > debug-key d`. > > I'm afraid I'm confused, ... > > > At one time, with `xl debug-key d` I got this: > > > > (XEN) *** Dumping CPU6 guest state (d0v7): *** > > (XEN) [ Xen-4.18-unstable x86_64 debug=y Tainted: M ] > > (XEN) CPU:6 > > (XEN) RIP:e033:[] > > (XEN) RFLAGS: 0286 EM: 0 CONTEXT: pv guest (d0v7) > > (XEN) rax: 0023 rbx: 0005 rcx: 81e1546a > > (XEN) rdx: rsi: rdi: 79147611e010 > > (XEN) rbp: 88810db53200 rsp: c90041c6bde0 r8: > > (XEN) r9: r10: r11: 0286 > > (XEN) r12: 00305000 r13: 7ffc61097f40 r14: 88810db53200 > > (XEN) r15: cr0: 80050033 cr4: 00b526e0 > > (XEN) cr3: 0004ae2a7000 cr2: 006d3118 > > (XEN) fsb: 791475b8a380 gsb: 8881897c gss: > > (XEN) ds: es: fs: gs: ss: e02b cs: e033 > > (XEN) Guest stack trace from rsp=c90041c6bde0: > > (XEN)0001 c02905a6 0023 > > (XEN)79147611e010 > > (XEN) 8064129fc747f100 c0291568 00305000 > > (XEN)8064129fc747f100 0005 813f7d4d c90041c6bf58 > > (XEN)c90041c6bf48 > > (XEN)81e16158 006d3118 c90041c6bf58 0040 > > (XEN)8132f6bb 0006 c90041c6bf58 006d3118 > > (XEN)0255 888102cf8880 888102cf88f0 8108746f > > (XEN) 0002 c90041c6bf58 c90041c6bf58 > > (XEN)006d3118 0006 > > (XEN) 81e1a975 > > (XEN) 829b 0043d9b0 > > (XEN)0043d990 7ffc61097f90 7ffc61097fc0 7ffc61099d16 > > (XEN)006cab40 0246 0001 >
Re: [PATCH v3 1/8] serial: fake IRQ-regs context in poll handlers
On Tue, Feb 13, 2024 at 04:44:04PM +0100, Jan Beulich wrote: > On 13.02.2024 16:11, Marek Marczykowski wrote: > > On Tue, Feb 13, 2024 at 04:00:32PM +0100, Jan Beulich wrote: > >> On 13.02.2024 15:53, Marek Marczykowski wrote: > >>> On Tue, Feb 13, 2024 at 08:45:54AM +0100, Jan Beulich wrote: > >>>> On 13.02.2024 04:43, Marek Marczykowski wrote: > >>>>> On Mon, Feb 12, 2024 at 10:04:38AM +0100, Jan Beulich wrote: > >>>>>> On 08.02.2024 23:00, Julien Grall wrote: > >>>>>>> On 05/02/2024 13:27, Jan Beulich wrote: > >>>>>>>> In preparation of dropping the register parameters from > >>>>>>>> serial_[rt]x_interrupt() and in turn from IRQ handler functions, > >>>>>>>> register state needs making available another way for the few key > >>>>>>>> handlers which need it. Fake IRQ-like state. > >>>>>>>> > >>>>>>>> Signed-off-by: Jan Beulich > >>>>>>>> --- > >>>>>>>> The use of guest_cpu_user_regs() in dbc_uart_poll() is inconsistent > >>>>>>>> with > >>>>>>>> other console poll functions we have, and it's unclear whether that's > >>>>>>>> actually generally correct. > >>>>>>> > >>>>>>> Is it? Looking at ns16550_poll() we would pass guest_user_regs() if > >>>>>>> run_in_exception() doesn't exist. But looking at the caller, no-on > >>>>>>> seems > >>>>>>> to care about the 'regs'. So is this just a latent bug? > >>>>>> > >>>>>> What do you mean by "doesn't exist"? ns16550_poll() assumes it exists. > >>>>>> And I can spot any use of guest_user_regs() on the respective generic > >>>>>> or Arm-specific bug.c paths. > >>>>>> > >>>>>>> BTW, do you have an idea why the poll function is not run in an > >>>>>>> exception handler? > >>>>>> > >>>>>> "The poll function" being which one? If you mean the one in xhci-dbc.c > >>>>>> then that's why I had Cc-ed Marek. Moving him to To: - maybe that > >>>>>> manages to finally catch his attention. > >>>>> > >>>>> TBH, I don't know. That's part of the original xue patch at > >>>>> https://github.com/connojd/xue/blob/master/patches/xen-xue-dbgp.patch > >>>>> and it works for me as it is. > >>>> > >>>> "Works" meaning what? Doesn't crash on you? Or does also provide > >>>> sensible output in _all_ cases (i.e. including when e.g. the poll > >>>> happens to run on an idle vCPU)? > >>> > >>> Generally provides sensible output, for example during boot (it is using > >>> idle vCPU then, right?). > >> > >> Before Dom0 is started: Yes. With the exception of the phase where PV > >> Dom0's page tables are constructed, albeit in that time window > >> guest_cpu_user_regs() shouldn't yield sensible data either. I can only > >> say I'm surprised; since I have no way to properly test with an XHCI > >> debug port, I'd have to see about faking something to convince myself > >> (unless you were to supply example output). > > > > Would you like me to test this series with xhci console? > > The behavior shouldn't really be connected to this series. But yes, 'd' > debug key output (just the part for the CPU the key handling was > actually invoked from) with the xhci debug console would be of > interest, for the case where that CPU at that time runs an idle vCPU. I managed to press 'd' before dom0 started. Full output at https://gist.github.com/marmarek/a495cd666f4aafed3a5cfcb8393f515b (it's Alder Lake, and smt=off, so CPU numbering is weird). Interestingly, I do _not_ see output for CPU0, where I'd expect the key handler to run... I see all the idle ones, plus one doing memory scrubbing. But also, I don't see info about the handling CPU when doing `xl debug-key d`. At one time, with `xl debug-key d` I got this: (XEN) *** Dumping CPU6 guest state (d0v7): *** (XEN) [ Xen-4.18-unstable x86_64 debug=y Tainted: M ] (XEN) CPU:6 (XEN) RIP:e033:[] (XEN) RFLAGS: 0286 EM: 0 CONTEXT: pv guest (d0v7) (XEN) rax: 0023 rbx: 0005 rcx: 81e1546a (XEN) rdx: rsi: rdi: 791
Re: [XEN PATCH] build: Replace `which` with `command -v`
On Wed, Feb 14, 2024 at 02:34:11PM +, Anthony PERARD wrote: > The `which` command is not standard, may not exist on the build host, > or may not behave as expected. It is recommanded to use `command -v` > to find out if a command exist and have it's path, and it's part of a > POSIX shell standard. > > Signed-off-by: Anthony PERARD This fixes build on fedora 39: Tested-by: Marek Marczykowski-Górecki > --- > xen/Makefile | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/xen/Makefile b/xen/Makefile > index 21832d6402..767e47d6c7 100644 > --- a/xen/Makefile > +++ b/xen/Makefile > @@ -25,8 +25,8 @@ export XEN_BUILD_HOST := $(shell hostname) > endif > > # Best effort attempt to find a python interpreter, defaulting to Python 3 if > -# available. Fall back to just `python` if `which` is nowhere to be found. > -PYTHON_INTERPRETER := $(word 1,$(shell which python3 python python2 > 2>/dev/null) python) > +# available. Fall back to just `python`. > +PYTHON_INTERPRETER := $(word 1,$(shell command -v python3 || command -v > python || command -v python2) python) > export PYTHON?= $(PYTHON_INTERPRETER) > > export CHECKPOLICY ?= checkpolicy > -- > Anthony PERARD > > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v3 1/8] serial: fake IRQ-regs context in poll handlers
On Tue, Feb 13, 2024 at 04:00:32PM +0100, Jan Beulich wrote: > On 13.02.2024 15:53, Marek Marczykowski wrote: > > On Tue, Feb 13, 2024 at 08:45:54AM +0100, Jan Beulich wrote: > >> On 13.02.2024 04:43, Marek Marczykowski wrote: > >>> On Mon, Feb 12, 2024 at 10:04:38AM +0100, Jan Beulich wrote: > >>>> On 08.02.2024 23:00, Julien Grall wrote: > >>>>> On 05/02/2024 13:27, Jan Beulich wrote: > >>>>>> In preparation of dropping the register parameters from > >>>>>> serial_[rt]x_interrupt() and in turn from IRQ handler functions, > >>>>>> register state needs making available another way for the few key > >>>>>> handlers which need it. Fake IRQ-like state. > >>>>>> > >>>>>> Signed-off-by: Jan Beulich > >>>>>> --- > >>>>>> The use of guest_cpu_user_regs() in dbc_uart_poll() is inconsistent > >>>>>> with > >>>>>> other console poll functions we have, and it's unclear whether that's > >>>>>> actually generally correct. > >>>>> > >>>>> Is it? Looking at ns16550_poll() we would pass guest_user_regs() if > >>>>> run_in_exception() doesn't exist. But looking at the caller, no-on > >>>>> seems > >>>>> to care about the 'regs'. So is this just a latent bug? > >>>> > >>>> What do you mean by "doesn't exist"? ns16550_poll() assumes it exists. > >>>> And I can spot any use of guest_user_regs() on the respective generic > >>>> or Arm-specific bug.c paths. > >>>> > >>>>> BTW, do you have an idea why the poll function is not run in an > >>>>> exception handler? > >>>> > >>>> "The poll function" being which one? If you mean the one in xhci-dbc.c > >>>> then that's why I had Cc-ed Marek. Moving him to To: - maybe that > >>>> manages to finally catch his attention. > >>> > >>> TBH, I don't know. That's part of the original xue patch at > >>> https://github.com/connojd/xue/blob/master/patches/xen-xue-dbgp.patch > >>> and it works for me as it is. > >> > >> "Works" meaning what? Doesn't crash on you? Or does also provide > >> sensible output in _all_ cases (i.e. including when e.g. the poll > >> happens to run on an idle vCPU)? > > > > Generally provides sensible output, for example during boot (it is using > > idle vCPU then, right?). > > Before Dom0 is started: Yes. With the exception of the phase where PV > Dom0's page tables are constructed, albeit in that time window > guest_cpu_user_regs() shouldn't yield sensible data either. I can only > say I'm surprised; since I have no way to properly test with an XHCI > debug port, I'd have to see about faking something to convince myself > (unless you were to supply example output). Would you like me to test this series with xhci console? Or maybe add some extra debug prints and include their output? But note, printk from inside console code generally leads to deadlocks. What I did for some debugging was to log into some separate buffer and dump it later. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v3 1/8] serial: fake IRQ-regs context in poll handlers
On Tue, Feb 13, 2024 at 08:45:54AM +0100, Jan Beulich wrote: > On 13.02.2024 04:43, Marek Marczykowski wrote: > > On Mon, Feb 12, 2024 at 10:04:38AM +0100, Jan Beulich wrote: > >> On 08.02.2024 23:00, Julien Grall wrote: > >>> On 05/02/2024 13:27, Jan Beulich wrote: > >>>> In preparation of dropping the register parameters from > >>>> serial_[rt]x_interrupt() and in turn from IRQ handler functions, > >>>> register state needs making available another way for the few key > >>>> handlers which need it. Fake IRQ-like state. > >>>> > >>>> Signed-off-by: Jan Beulich > >>>> --- > >>>> The use of guest_cpu_user_regs() in dbc_uart_poll() is inconsistent with > >>>> other console poll functions we have, and it's unclear whether that's > >>>> actually generally correct. > >>> > >>> Is it? Looking at ns16550_poll() we would pass guest_user_regs() if > >>> run_in_exception() doesn't exist. But looking at the caller, no-on seems > >>> to care about the 'regs'. So is this just a latent bug? > >> > >> What do you mean by "doesn't exist"? ns16550_poll() assumes it exists. > >> And I can spot any use of guest_user_regs() on the respective generic > >> or Arm-specific bug.c paths. > >> > >>> BTW, do you have an idea why the poll function is not run in an > >>> exception handler? > >> > >> "The poll function" being which one? If you mean the one in xhci-dbc.c > >> then that's why I had Cc-ed Marek. Moving him to To: - maybe that > >> manages to finally catch his attention. > > > > TBH, I don't know. That's part of the original xue patch at > > https://github.com/connojd/xue/blob/master/patches/xen-xue-dbgp.patch > > and it works for me as it is. > > "Works" meaning what? Doesn't crash on you? Or does also provide > sensible output in _all_ cases (i.e. including when e.g. the poll > happens to run on an idle vCPU)? Generally provides sensible output, for example during boot (it is using idle vCPU then, right?). -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v3 1/8] serial: fake IRQ-regs context in poll handlers
On Mon, Feb 12, 2024 at 10:04:38AM +0100, Jan Beulich wrote: > On 08.02.2024 23:00, Julien Grall wrote: > > On 05/02/2024 13:27, Jan Beulich wrote: > >> In preparation of dropping the register parameters from > >> serial_[rt]x_interrupt() and in turn from IRQ handler functions, > >> register state needs making available another way for the few key > >> handlers which need it. Fake IRQ-like state. > >> > >> Signed-off-by: Jan Beulich > >> --- > >> The use of guest_cpu_user_regs() in dbc_uart_poll() is inconsistent with > >> other console poll functions we have, and it's unclear whether that's > >> actually generally correct. > > > > Is it? Looking at ns16550_poll() we would pass guest_user_regs() if > > run_in_exception() doesn't exist. But looking at the caller, no-on seems > > to care about the 'regs'. So is this just a latent bug? > > What do you mean by "doesn't exist"? ns16550_poll() assumes it exists. > And I can spot any use of guest_user_regs() on the respective generic > or Arm-specific bug.c paths. > > > BTW, do you have an idea why the poll function is not run in an > > exception handler? > > "The poll function" being which one? If you mean the one in xhci-dbc.c > then that's why I had Cc-ed Marek. Moving him to To: - maybe that > manages to finally catch his attention. TBH, I don't know. That's part of the original xue patch at https://github.com/connojd/xue/blob/master/patches/xen-xue-dbgp.patch and it works for me as it is. > >> Andrew suggested to move set_irq_regs() to BUGFRAME_run_fn handling; > >> it's not clear to me whether that would be (a) correct from an abstract > >> pov (that's exception, not interrupt context after all) > > > > I agree with that. > > > >> and (b) really beneficial. > > > > I guess this could help to reduce the amount of churn. I can't really > > make my mind whether this is worth it or not. So I would keep it as you did. > > Good, thanks. > > Jan -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: Linux 6.7-rc1+: WARNING at drivers/xen/evtchn.c:167 evtchn_interrupt
On Sat, Dec 02, 2023 at 10:46:38PM +0100, Marek Marczykowski-Górecki wrote: > On Fri, Dec 01, 2023 at 09:29:30AM +0100, Juergen Gross wrote: > > Hi Marek, > > > > On 30.11.23 03:34, Marek Marczykowski-Górecki wrote: > > > Hi, > > > > > > While testing 6.7-rc3 on Qubes OS I found several warning like in the > > > subject in dom0 log. I see them when running 6.7-rc1 too. I'm not sure > > > what exactly triggers the issue, but my guess would be unbinding an > > > event channel from userspace (closing vchan connection). > > > > > > Specific message: > > > > > > [ 83.973377] [ cut here ] > > > [ 83.975523] Interrupt for port 77, but apparently not enabled; > > > per-user a0e9f1d1 > > > > Just a guess, but I think this might happen when the vchan connection > > is closed while there is still some traffic. This could result in events > > triggering in parallel to unbinding the event channel. > > > > Could you please test the attached patch (only compile tested)? > > Unfortunately that doesn't help, I get exactly the same traceback. Hi, this is still an issue on 6.7.2, any other ideas? -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v4 2/2] tools: don't expose XENFEAT_hvm_pirqs by default
On Thu, Jan 25, 2024 at 10:30:41AM +0100, Roger Pau Monne wrote: > The HVM pirq feature allows routing interrupts from both physical and emulated > devices over event channels, this was done a performance improvement. However > its usage is fully undocumented, and the only reference implementation is in > Linux. It defeats the purpose of local APIC hardware virtualization, because > when using it interrupts avoid the usage of the local APIC altogether. > > It has also been reported to not work properly with certain devices, at least > when using some AMD GPUs Linux attempts to route interrupts over event > channels, but Xen doesn't correctly detect such routing, which leads to the > hypervisor complaining with: > > (XEN) d15v0: Unsupported MSI delivery mode 7 for Dom15 > > When MSIs are attempted to be routed over event channels the entry delivery > mode is set to ExtINT, but Xen doesn't detect such routing and attempts to > inject the interrupt following the native MSI path, and the ExtINT delivery > mode is not supported. > > Disable HVM PIRQs by default and provide a per-domain option in xl.cfg to > enable such feature. Also for backwards compatibility keep the feature > enabled > for any resumed domains that don't have an explicit selection. > > Note that the only user of the feature (Linux) is also able to handle native > interrupts fine, as the feature was already not used if Xen reported local > APIC > hardware virtualization active. > > Link: https://github.com/QubesOS/qubes-issues/issues/7971 > Signed-off-by: Roger Pau Monné > Reviewed-by: Anthony PERARD For the python part - it's a bit unfortunate there is no knob to control the value, but in practice I doubt it would be useful (and especially for python bindings users), so: Acked-by: Marek Marczykowski-Górecki > --- > Changes since v2: > - Add changelog entry. > > Changes since v1: > - Fix libxl for PV guests. > --- > CHANGELOG.md | 2 ++ > docs/man/xl.cfg.5.pod.in | 7 +++ > tools/include/libxl.h | 7 +++ > tools/libs/light/libxl_create.c | 7 +-- > tools/libs/light/libxl_types.idl | 1 + > tools/libs/light/libxl_x86.c | 12 +--- > tools/python/xen/lowlevel/xc/xc.c | 4 +++- > tools/xl/xl_parse.c | 1 + > 8 files changed, 35 insertions(+), 6 deletions(-) > > diff --git a/CHANGELOG.md b/CHANGELOG.md > index 723d06425431..ddb3ab8db4e7 100644 > --- a/CHANGELOG.md > +++ b/CHANGELOG.md > @@ -9,6 +9,8 @@ The format is based on [Keep a > Changelog](https://keepachangelog.com/en/1.0.0/) > ### Changed > - Changed flexible array definitions in public I/O interface headers to not > use "1" as the number of array elements. > + - On x86: > + - HVM PIRQs are disabled by default. > > ### Added > - On x86: > diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in > index 2e234b450efb..ea8d41727d8e 100644 > --- a/docs/man/xl.cfg.5.pod.in > +++ b/docs/man/xl.cfg.5.pod.in > @@ -2460,6 +2460,13 @@ The viridian option can be specified as a boolean. A > value of true (1) > is equivalent to the list [ "defaults" ], and a value of false (0) is > equivalent to an empty list. > > +=item B > + > +Select whether the guest is allowed to route interrupts from devices (either > +emulated or passed through) over event channels. > + > +This option is disabled by default. > + > =back > > =head3 Emulated VGA Graphics Device > diff --git a/tools/include/libxl.h b/tools/include/libxl.h > index 907aa0a3303a..f1652b1664f0 100644 > --- a/tools/include/libxl.h > +++ b/tools/include/libxl.h > @@ -608,6 +608,13 @@ > * executable in order to not run it as the same user as libxl. > */ > > +/* > + * LIBXL_HAVE_HVM_PIRQ indicates the presence of the u.hvm.pirq filed in > + * libxl_domain_build_info that signals whether an HVM guest has accesses to > + * the XENFEAT_hvm_pirqs feature. > + */ > +#define LIBXL_HAVE_HVM_PIRQ 1 > + > /* > * libxl memory management > * > diff --git a/tools/libs/light/libxl_create.c b/tools/libs/light/libxl_create.c > index ce1d43110336..0008fac607e3 100644 > --- a/tools/libs/light/libxl_create.c > +++ b/tools/libs/light/libxl_create.c > @@ -376,6 +376,7 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc, > libxl_defbool_setdefault(_info->u.hvm.usb,false); > libxl_defbool_setdefault(_info->u.hvm.vkb_device, true); > libxl_defbool_setdefault(_info->u.hvm.xen_platform_pci, true); > +libxl_defbool_setdefault(_info->u.hvm.pirq, false); > > libxl_defbool_setdefault(
Re: Sketch of an idea for handling the "mixed workload" problem
On Mon, Jan 22, 2024 at 12:25:58PM +, George Dunlap wrote: > On Mon, Jan 22, 2024 at 12:17 PM Marek Marczykowski-Górecki > wrote: > > > > On Mon, Jan 22, 2024 at 11:54:14AM +, George Dunlap wrote: > > > The other issue I have with this (and essentially where I got stuck > > > developing credit2 in the first place) is testing: how do you ensure > > > that it has the properties that you expect? > > > > Audio is actually quite nice use case at this, since it's quite > > sensitive for scheduling jitter. I think even a simple "PCI passthrough a > > sound card and play/record something" should show results. Especially > > you can measure how hard you can push the system (for example artificial > > load in other domains) until it breaks. > > Are we going have a gitlab runner which says, "Marek sits in front of > his test machine and listens to audio for pops"? :-) Kinda ;) We have already audio tests in qubes CI. They do more or less the above, but using our audio virtualization. Play something, record in another domain, and compare. Running the very same thing in gitlab-ci may be too complicated (require bringing in some qubes infrastructure to make PV audio work), but maybe similar test can be done based on qemu-emulated audio or other pv audio solution? > > > How do you develop a > > > "regression test" to make sure that server-based workloads don't have > > > issues in this sort of case? > > > > For this I believe there are several benchmarking methods already, > > starting with old trusty "Linux kernel build time". > > First of all, AFAICT "Linux kernel bulid time" is not representative > of almost any actual server workload; and the end-to-end throughput > completely misses what most server workloads will actually care about, > like latency. > > Secondly, what you're testing isn't the performance of a single > workload on an empty system; you're testing how workloads *interact*. > If you want ideal throughput for a single workload on an empty system, > use the null scheduler; more complex schedulers are only necessary > when multiple different workloads interact. I should have clarified I meant `make -jNN`. But still, that's the same workload on multiple vCPUs. > FWIW this was my first stab at trying to be systematic about testing > the scheduler: > > https://github.com/gwd/schedbench > > The rump kernel project has basically died AFAIK, so anyone trying to > resurrect this would probably have to try to rebase that bit of it > against something like XTF or unikernels. > > -George -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: Sketch of an idea for handling the "mixed workload" problem
On Mon, Jan 22, 2024 at 11:54:14AM +, George Dunlap wrote: > The other issue I have with this (and essentially where I got stuck > developing credit2 in the first place) is testing: how do you ensure > that it has the properties that you expect? Audio is actually quite nice use case at this, since it's quite sensitive for scheduling jitter. I think even a simple "PCI passthrough a sound card and play/record something" should show results. Especially you can measure how hard you can push the system (for example artificial load in other domains) until it breaks. > How do you develop a > "regression test" to make sure that server-based workloads don't have > issues in this sort of case? For this I believe there are several benchmarking methods already, starting with old trusty "Linux kernel build time". -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: E820 memory allocation issue on Threadripper platforms
On Fri, Jan 19, 2024 at 02:50:38PM +0100, Jan Beulich wrote: > On 19.01.2024 14:40, Marek Marczykowski-Górecki wrote: > > On Thu, Jan 18, 2024 at 01:23:56AM -0500, Patrick Plenefisch wrote: > >> On Wed, Jan 17, 2024 at 3:46 AM Jan Beulich wrote: > >>> On 17.01.2024 07:12, Patrick Plenefisch wrote: > >>>> As someone who hasn't built a kernel in over a decade, should I figure > >>> out > >>>> how to do a kernel build with CONFIG_PHYSICAL_START=0x200 and report > >>>> back? > >>> > >>> That was largely a suggestion to perhaps allow you to gain some > >>> workable setup. It would be of interest to us largely for completeness. > >>> > >> > >> Typo aside, setting the boot to 2MiB works! It works better for PV > > > > Are there any downsides of running kernel with > > CONFIG_PHYSICAL_START=0x20? I can confirm it fixes the issue on > > another affected system, and if there aren't any practical downsides, > > I'm tempted to change it the default kernel in Qubes OS. > > There must have been a reason to make the default 16Mb. You may want > to fish out the commit doing so ... https://git.kernel.org/torvalds/c/ceefccc93932b920 Default CONFIG_PHYSICAL_START and CONFIG_PHYSICAL_ALIGN each to 16 MB, so that both non-relocatable and relocatable kernels are loaded at 16 MB by a non-relocating bootloader. This is somewhat hacky, but it appears to be the only way to do this that does not break some some set of existing bootloaders. We want to avoid the bottom 16 MB because of large page breakup, memory holes, and ZONE_DMA. Embedded systems may need to reduce this, or update their bootloaders to be aware of the new min_alignment field. Large pages (in practice) do not apply to PV dom0, but other points could in theory. That said, I checked few other systems and I don't see any reserved regions there (there is large usable region at 0x10, other reserved regions are near the 4GB boundary). This isn't very representative sample, though... > In Qubes, though, I understand > you're always running with Xen underneath, so unless this same kernel > is also needed to run in HVM guests, some of whatever the reasons may > have been may go away. The same kernel is used for PVH/HVM guests too. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: E820 memory allocation issue on Threadripper platforms
On Thu, Jan 18, 2024 at 01:23:56AM -0500, Patrick Plenefisch wrote: > On Wed, Jan 17, 2024 at 3:46 AM Jan Beulich wrote: > > On 17.01.2024 07:12, Patrick Plenefisch wrote: > > > As someone who hasn't built a kernel in over a decade, should I figure > > out > > > how to do a kernel build with CONFIG_PHYSICAL_START=0x200 and report > > > back? > > > > That was largely a suggestion to perhaps allow you to gain some > > workable setup. It would be of interest to us largely for completeness. > > > > Typo aside, setting the boot to 2MiB works! It works better for PV Are there any downsides of running kernel with CONFIG_PHYSICAL_START=0x20? I can confirm it fixes the issue on another affected system, and if there aren't any practical downsides, I'm tempted to change it the default kernel in Qubes OS. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: E820 memory allocation issue on Threadripper platforms
On Tue, Jan 16, 2024 at 10:33:26AM +0100, Jan Beulich wrote: > ... as per > > (XEN) Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100 -> 0x4a0 > > there's an overlap with not exactly a hole, but with an > EfiACPIMemoryNVS region: > > (XEN) 00010-003159fff type=2 attr=000f > (XEN) 00315a000-003ff type=7 attr=000f > (XEN) 00400-004045fff type=10 attr=000f > (XEN) 004046000-009afefff type=7 attr=000f > > (the 3rd of the 4 lines). Considering there's another region higher > up: > > (XEN) 0a747f000-0a947efff type=10 attr=000f > > I'm inclined to say it is poor firmware (or, far less likely, boot > loader) behavior to clobber a rather low and entirely arbitrary RAM > range, rather than consolidating all such regions near the top of > RAM below 4Gb. FWIW, we have two more similar reports, with different motherboards and firmware versions, but the common factor is Threadripper CPU. It doesn't exclude firmware issue (it can be an issue in some common template, like edk2?), but makes it a bit less likely. > There are further such odd regions, btw: > > (XEN) 009aff000-009ff type=0 attr=000f > ... > (XEN) 00b00-00b020fff type=0 attr=000f > > If the kernel image was sufficiently much larger, these could become > a problem as well. Otoh if the kernel wasn't built with > CONFIG_PHYSICAL_START=0x100, i.e. to start at 16Mb, but at, say, > 2Mb, things should apparently work even with this unusual memory > layout (until the kernel would grow enough to again run into that > very region). Shouldn't CONFIG_RELOCATABLE=y take care of this? At least in the case of Qubes OS, it's enabled and the issue still happens. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: Thunderbolt (and other PCI hotplug) isolation
On Tue, Jan 16, 2024 at 10:58:05AM +0100, Jan Beulich wrote: > On 16.01.2024 03:20, Marek Marczykowski-Górecki wrote: > > Hi, > > > > A little background: > > In Qubes OS we try to isolate external (especially hot pluggable) > > devices as much as possible. For PCI devices, we do PCI passthrough to > > dedicated domains (sys-net, sys-usb - mostly the latter). The goal is to > > prevent unauthorized device to compromise the whole system, especially > > using DMA (either initiated by a malicious device itself, or by a > > compromised driver). For the discussion here, lets ignore what happens > > before Xen starts. > > > > The matter becomes much more complicated for hot plugged devices. I did > > some test recently, enabled PCI hoplug in dom0 kernel (we have it > > disabled by default), and this is what I got: > > 1. Hot plugged devices were properly detected, and dom0 told Xen about > > them. In my case, it was two PCI bridges and an NVMe disk. > > 2. New devices were assigned to dom0 automatically. > > 3. New leaf device (the disk) can be assigned to a HVM domU and seems to > > work. > > 4. The bridges cannot be assigned to a domU. > > > > Now, there are (at least) two problems with the above: > > i) The second point above: new device automatically gain ability to DMA (at > > least) into dom0 memory. I guess this should be easy-ish solvable for > > leaf devices by assigning them to a quarantine domain by default. There > > is an issue how to decide what devices to handle this way (for example, > > what about external devices present during Xen/dom0 startup already), > > but it feels like a problem solvable with some configuration. And of > > course dom0 will need to be adjusted to not talk to such devices > > automatically (via drivers blacklisting or similar approach). But for > > the bridge devices, it's more complicated, basically the second point > > below. > > > > ii) The fourth point above: an external PCI device remains in dom0 > > (including being able to dom0 into dom0's memory) just because it happen > > to have some specific bits in its config space set. When considering > > malicious device, it doesn't even need to function as a bridge - it's > > just enough to present itself as a bridge, wait for dom0's thunderbolt > > driver to authorize the device so it gets assigned dom0's IOMMU context, > > and boom. On the other hand, a bridge has privileged function by > > design, for example IIUC takes part in discovering devices behind it > > (which then needs to be properly registered in Xen, assigned IOMMU > > context etc). > > I may not be following the underlying concept here: If you consider a > device potentially malicious, why would you even connect it to your > system? The thing is I don't know whether I can trust the device or not. It may be a device I got from somebody, but also it may be a benign but buggy device that later got its firmware compromised. Take the example of an external nvme disk - I got some data on it to process, and the data processing itself can easily be isolated in a separate domU (so, I don't need to assume too much trust in that data), but also I'd like to not assume trust in the disk itself. Besides that, there is also an attack where somebody plugs in a device without my knowledge (when working in open space, some conference, etc). I'd like to avoid situation where screenlocker can be very quickly bypassed with a "screen unlock via DMA" thunderbolt device. > And if you mean Dom0 to not drive devices, why would you even > build the respective drivers for such a Dom0 kernel? I could exclude a lot of drivers indeed, but there are practical issues with that: for example I do want to use the _internal_ nvme disk in dom0. > > iii) Untested, but it feels like there is a lot of room for various race > > conditions in the hot plug handling. For example, device must be > > allowed any DMA only after its IOMMU context is properly configured. > > Isn't that the case already? Any attempt to DMA without respective > device / context table entry (AMD / Intel terminology) ought to result > in IOMMU faults. Ok, so in theory it should be matter of "just" assigning device to appropriate domain directly, instead of assigning to dom0 first and only later re-assigning. > > I > > believe thunderbolt technically allows that (plain PCIe hotplug most > > likely not), but my guess is it's not the case currently. > > > > My question is mostly: what can be done about the "ii" problem above? > > > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Thunderbolt (and other PCI hotplug) isolation
Hi, A little background: In Qubes OS we try to isolate external (especially hot pluggable) devices as much as possible. For PCI devices, we do PCI passthrough to dedicated domains (sys-net, sys-usb - mostly the latter). The goal is to prevent unauthorized device to compromise the whole system, especially using DMA (either initiated by a malicious device itself, or by a compromised driver). For the discussion here, lets ignore what happens before Xen starts. The matter becomes much more complicated for hot plugged devices. I did some test recently, enabled PCI hoplug in dom0 kernel (we have it disabled by default), and this is what I got: 1. Hot plugged devices were properly detected, and dom0 told Xen about them. In my case, it was two PCI bridges and an NVMe disk. 2. New devices were assigned to dom0 automatically. 3. New leaf device (the disk) can be assigned to a HVM domU and seems to work. 4. The bridges cannot be assigned to a domU. Now, there are (at least) two problems with the above: i) The second point above: new device automatically gain ability to DMA (at least) into dom0 memory. I guess this should be easy-ish solvable for leaf devices by assigning them to a quarantine domain by default. There is an issue how to decide what devices to handle this way (for example, what about external devices present during Xen/dom0 startup already), but it feels like a problem solvable with some configuration. And of course dom0 will need to be adjusted to not talk to such devices automatically (via drivers blacklisting or similar approach). But for the bridge devices, it's more complicated, basically the second point below. ii) The fourth point above: an external PCI device remains in dom0 (including being able to dom0 into dom0's memory) just because it happen to have some specific bits in its config space set. When considering malicious device, it doesn't even need to function as a bridge - it's just enough to present itself as a bridge, wait for dom0's thunderbolt driver to authorize the device so it gets assigned dom0's IOMMU context, and boom. On the other hand, a bridge has privileged function by design, for example IIUC takes part in discovering devices behind it (which then needs to be properly registered in Xen, assigned IOMMU context etc). iii) Untested, but it feels like there is a lot of room for various race conditions in the hot plug handling. For example, device must be allowed any DMA only after its IOMMU context is properly configured. I believe thunderbolt technically allows that (plain PCIe hotplug most likely not), but my guess is it's not the case currently. My question is mostly: what can be done about the "ii" problem above? -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH 1/2] libxl: Remove cdrom forced QDISK w/ stubdom
On Tue, Jan 09, 2024 at 03:46:54PM -0500, Jason Andryuk wrote: > A Linux HVM domain ignores PV block devices with type cdrom. The > Windows PV drivers also ignore device-type != "disk". Therefore QEMU's > emulated CD-ROM support is used. This allows ejection and other CD-ROM > features to work. > > With a stubdom, QEMU is running in the stubdom. A PV disk is still > connected into the stubdom, and then QEMU can emulate the CD-ROM into > the guest. This removes the need for forcing to a QDISK. Relax the > checks to support this. > > Signed-off-by: Jason Andryuk > --- > tools/libs/light/libxl_disk.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/tools/libs/light/libxl_disk.c b/tools/libs/light/libxl_disk.c > index b65cad33cc..d1f84ef404 100644 > --- a/tools/libs/light/libxl_disk.c > +++ b/tools/libs/light/libxl_disk.c > @@ -192,7 +192,8 @@ static int libxl__device_disk_setdefault(libxl__gc *gc, > uint32_t domid, > > /* Force Qdisk backend for CDROM devices of guests with a device model. > */ > if (disk->is_cdrom != 0 && > -libxl__domain_type(gc, domid) == LIBXL_DOMAIN_TYPE_HVM) { > +libxl__domain_type(gc, domid) == LIBXL_DOMAIN_TYPE_HVM && > +!libxl_get_stubdom_id(CTX, domid)) { Should this check for stubdomain flavor too? I guess it won't really work with qemu-traditional. Similar check also wants to be in the next patch, instead of completely dropping stubdomain check. > if (!(disk->backend == LIBXL_DISK_BACKEND_QDISK || >disk->backend == LIBXL_DISK_BACKEND_UNKNOWN)) { > LOGD(ERROR, domid, "Backend for CD devices on HVM guests must be > Qdisk"); > -- > 2.43.0 > -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: Serial console stuck during boot, unblocked with xl debug-key
On Thu, Jan 04, 2024 at 12:59:28PM +0100, Jan Beulich wrote: > On 29.12.2023 10:50, Marek Marczykowski-Górecki wrote: > > Hi, > > > > This is continuation from matrix chat. There is an occasional failure on > > qubes-hw2 gitlab runner that console become stuck during boot. I can now > > reproduce it _much_ more often on another system, and the serial console > > output > > ends with: > > > > (XEN) Allocated console ring of 256 KiB. > > (XEN) Using HWP for cpufreq > > (XEN) mwait-idle: does not run on family 6 > > > > It should be: > > > > (XEN) Allocated console ring of 256 KiB. > > (XEN) Using HWP for cpufreq > > (XEN) mwait-idle: does not run on family 6 model 183 > > (XEN) VMX: Supported advanced features: > > (XEN) - APIC MMIO access virtualisation > > (XEN) - APIC TPR shadow > > ... > > > > > > Otherwise the system works perfectly fine, the logs are available in > > full via `xl dmesg` etc. Doing (any?) `xl debug-key` unblocks the > > console and missing logs gets dumped there too. I narrowed it down to > > the serial console tx buffer and collected some info with the attacked > > patch (it collects info still during boot, after the place where it > > usually breaks). When it works, I get: > > > > (XEN) SERIAL DEBUG: txbufc: 0x5b5, txbufp: 0x9f7, uart intr_works: 1, > > serial_txbufsz: 0x4000, tx_ready: 0, lsr_mask: 0x20, msi: 0, io_size: 8, > > skipped_interrupts: 0 > > > > And when it breaks, I get: > > > > (XEN) SERIAL DEBUG: txbufc: 0x70, txbufp: 0x9fd, uart intr_works: 1, > > serial_txbufsz: 0x4000, tx_ready: 16, lsr_mask: 0x20, msi: 0, io_size: 8, > > skipped_interrupts: 0 > > The only meaningful difference is tx_ready then. Looking at > ns16550_tx_ready() I wonder whether the LSR reports inconsistent > values on successive reads (there are at least three separate calls > to the function out of serial_tx_interrupt() alone). What you didn't > log is the LSR value itself; from the tx_ready value one can conclude > though that in the bad case fifo_size was returned, while in the good > case 0 was passed back. At the first glance this looks backwards, or > in other words I can't explain why it would be this way round. (I > assume you've had each case multiple times, and the output was > sufficiently consistent; that doesn't go without saying as your > invocation of serial_debug() is competing with the asynchronous > transmitting of data [if any].) It being this way round might suggest > that we lost an interrupt. That is my current hypothesis too. Either at the hw level (being masked for some reason at some point?) or on sw level (somehow not calling the handler - that's why adding skipped_interrupts). > Is this a real serial port, or one mimicked > by a BMC (SoL or alike)? This one is a real serial port. It isn't fully reproducible, but happened sufficiently often that I'm quite sure of the above info. Yes, my serial_debug() can interfere with data transfer, but I intentionally added it significantly later than the issue happens (I realize that console output end may not directly coincide with actual time of the problem due to async sending, but still IMO should be good enough). I later moved it to keyhandler, but that didn't give any more info. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Serial console stuck during boot, unblocked with xl debug-key
Hi, This is continuation from matrix chat. There is an occasional failure on qubes-hw2 gitlab runner that console become stuck during boot. I can now reproduce it _much_ more often on another system, and the serial console output ends with: (XEN) Allocated console ring of 256 KiB. (XEN) Using HWP for cpufreq (XEN) mwait-idle: does not run on family 6 It should be: (XEN) Allocated console ring of 256 KiB. (XEN) Using HWP for cpufreq (XEN) mwait-idle: does not run on family 6 model 183 (XEN) VMX: Supported advanced features: (XEN) - APIC MMIO access virtualisation (XEN) - APIC TPR shadow ... Otherwise the system works perfectly fine, the logs are available in full via `xl dmesg` etc. Doing (any?) `xl debug-key` unblocks the console and missing logs gets dumped there too. I narrowed it down to the serial console tx buffer and collected some info with the attacked patch (it collects info still during boot, after the place where it usually breaks). When it works, I get: (XEN) SERIAL DEBUG: txbufc: 0x5b5, txbufp: 0x9f7, uart intr_works: 1, serial_txbufsz: 0x4000, tx_ready: 0, lsr_mask: 0x20, msi: 0, io_size: 8, skipped_interrupts: 0 And when it breaks, I get: (XEN) SERIAL DEBUG: txbufc: 0x70, txbufp: 0x9fd, uart intr_works: 1, serial_txbufsz: 0x4000, tx_ready: 16, lsr_mask: 0x20, msi: 0, io_size: 8, skipped_interrupts: 0 So, I haven't found yet why it stops sending data. I'll continue adding some debug prints etc, but if anyone has some ideas what is going on, I'd appreciate hints. Full `xl dmesg` of a broken case can be seen here: https://openqa.qubes-os.org/tests/89100/logfile?filename=serial_terminal.txt Similarly, example working case is here: https://openqa.qubes-os.org/tests/89099/logfile?filename=serial0.txt -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c index 5bbed3a36a..39ad935266 100644 --- a/xen/arch/x86/pv/dom0_build.c +++ b/xen/arch/x86/pv/dom0_build.c @@ -354,6 +354,8 @@ static struct page_info * __init alloc_chunk(struct domain *d, return page; } +extern void serial_debug(void); + int __init dom0_construct_pv(struct domain *d, const module_t *image, unsigned long image_headroom, @@ -412,6 +414,7 @@ int __init dom0_construct_pv(struct domain *d, /* Machine address of next candidate page-table page. */ paddr_t mpt_alloc; +serial_debug(); printk(XENLOG_INFO "*** Building a PV Dom%d ***\n", d->domain_id); d->max_pages = ~0U; diff --git a/xen/drivers/char/serial.c b/xen/drivers/char/serial.c index 324024c29a..d6a4d62a07 100644 --- a/xen/drivers/char/serial.c +++ b/xen/drivers/char/serial.c @@ -71,6 +71,80 @@ void serial_rx_interrupt(struct serial_port *port, struct cpu_user_regs *regs) (*fn)(c & 0x7f, regs); } +#include +#include +#include +#include +#include +#include +#include +#include +#ifdef NS16550_PCI +#include +#include +#include +#endif +#include +#include +#include +#ifdef CONFIG_HAS_DEVICE_TREE +#include +#endif +#ifdef CONFIG_X86 +#include +#endif + +#define NS16550_PCI + +struct ns16550 { +int baud, clock_hz, data_bits, parity, stop_bits, fifo_size, irq; +u64 io_base; /* I/O port or memory-mapped I/O address. */ +u64 io_size; +int reg_shift; /* Bits to shift register offset by */ +int reg_width; /* Size of access to use, the registers +* themselves are still bytes */ +char __iomem *remapped_io_base; /* Remapped virtual address of MMIO. */ +/* UART with IRQ line: interrupt-driven I/O. */ +struct irqaction irqaction; +u8 lsr_mask; +#ifdef CONFIG_ARM +struct vuart_info vuart; +#endif +/* UART with no IRQ line: periodically-polled I/O. */ +struct timer timer; +struct timer resume_timer; +unsigned int timeout_ms; +bool intr_works; +bool dw_usr_bsy; +#ifdef NS16550_PCI +/* PCI card parameters. */ +bool pb_bdf_enable; /* if =1, pb-bdf effective, port behind bridge */ +bool ps_bdf_enable; /* if =1, ps_bdf effective, port on pci card */ +unsigned int pb_bdf[3]; /* pci bridge BDF */ +unsigned int ps_bdf[3]; /* pci serial port BDF */ +u32 bar; +u32 bar64; +u16 cr; +u8 bar_idx; +bool msi; +const struct ns16550_config_param *param; /* Points into .init.*! */ +#endif +}; + +int skipped_interrupts = 0; + +void serial_debug(void) +{ +struct serial_port *port = [SERHND_COM1]; +struct ns16550 *uart = port->uart; +printk("SERIAL DEBUG: txbufc: %#x, txbufp: %#x, uart intr_works: %d, serial_txbufsz: %#x, tx_ready: %d, " +"lsr_mask: %#x, msi: %d, io_size: %ld, skipped_interrupts: %d\n", +port->txbufc, port->txbufp, uart->intr_works, +serial_txbufsz, +port->driver->tx_ready(po
[PATCH] libxl: Disable relocating memory for qemu-xen in stubdomain too
According to comments (and experiments) qemu-xen cannot handle memory reolcation done by hvmloader. The code was already disabled when running qemu-xen in dom0 (see libxl__spawn_local_dm()), but it was missed when adding qemu-xen support to stubdomain. Adjust libxl__spawn_stub_dm() to be consistent in this regard. Reported-by: Neowutran Signed-off-by: Marek Marczykowski-Górecki --- tools/libs/light/libxl_dm.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/tools/libs/light/libxl_dm.c b/tools/libs/light/libxl_dm.c index 14b593110f7c..ed620a9d8e14 100644 --- a/tools/libs/light/libxl_dm.c +++ b/tools/libs/light/libxl_dm.c @@ -2432,6 +2432,16 @@ void libxl__spawn_stub_dm(libxl__egc *egc, libxl__stub_dm_spawn_state *sdss) "%s", libxl_bios_type_to_string(guest_config->b_info.u.hvm.bios)); } +/* Disable relocating memory to make the MMIO hole larger + * unless we're running qemu-traditional and vNUMA is not + * configured. */ +libxl__xs_printf(gc, XBT_NULL, + libxl__sprintf(gc, "%s/hvmloader/allow-memory-relocate", +libxl__xs_get_dompath(gc, guest_domid)), + "%d", + guest_config->b_info.device_model_version +== LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL && + !libxl__vnuma_configured(_config->b_info)); ret = xc_domain_set_target(ctx->xch, dm_domid, guest_domid); if (ret<0) { LOGED(ERROR, guest_domid, "setting target domain %d -> %d", -- 2.41.0
Re: hvmloader - allow_memory_relocate overlaps
On Fri, Dec 22, 2023 at 12:35:57PM +0100, Jan Beulich wrote: > On 21.12.2023 19:08, Neowutran wrote: > > On 2023-12-19 17:12, Jan Beulich wrote: > >> On 16.12.2023 08:01, Neowutran wrote: > >>> I am wondering if the variable "allow_memory_relocate" is still > >>> relevant today and if its default value is still relevant. > >>> Should it be defined to 0 by default instead of 1 (it seems to be a > >>> workaround for qemu-traditional, so maybe an outdated default value ? ) ? > >> > >> So are you saying you use qemu-trad? > > No, I am using "qemu_xen" ( from > > xenstore-read -> 'device-model = > > "qemu_xen"' > > > >> Otherwise isn't libxl suppressing this behavior anyway? > > If by "isn't libxl suppressing this behavior" you means if libxl is setting > > the value of "allow_memory_relocate", then the answer is no. > > > > Following this lead, I checked in what code path > > "allow_memory_relocate" could be defined. > > > > It is only defined in one code path, > > > > In the file "tools/libs/light/libxl_dm.c", > > in the function "void libxl__spawn_local_dm(libxl__egc *egc, > > libxl__dm_spawn_state *dmss)": > > > > ''' > > // ... > > if (b_info->type == LIBXL_DOMAIN_TYPE_HVM) { > > // ... > > > > libxl__xs_printf(gc, XBT_NULL, > > GCSPRINTF("%s/hvmloader/allow-memory-relocate", > > path), > > "%d", > > > > b_info->device_model_version==LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL > > && > > !libxl__vnuma_configured(b_info)); > > // ... > > ''' > > > > However, on QubesOS (my system), "local_dm" is never used, "stub_dm" > > is always used. > > > > In the function "void lib > > xl__spawn_stub_dm(libxl__egc *egc, libxl__stub_dm_spawn_state *sdss)", > > the value of "allow_memory_relocate" is never defined. > > > > I tried to patch the code to define "allow_memory_relocate" in > > "libxl__spawn_stub_dm": > > > > ''' > > --- a/tools/libs/light/libxl_dm.c > > +++ b/tools/libs/light/libxl_dm.c > > @@ -2431,6 +2431,10 @@ void libxl__spawn_stub_dm(libxl__egc *egc, > > libxl__stub_dm_spawn_state *sdss) > > libxl__xs_get_dompath(gc, > > guest_domid)), > > "%s", > > > > libxl_bios_type_to_string(guest_config->b_info.u.hvm.bios)); > > +libxl__xs_printf(gc, XBT_NULL, > > + libxl__sprintf(gc, > > "%s/hvmloader/allow-memory-relocate", libxl__xs_get_dompath(gc, > > guest_domid)), > > + "%d", > > + 0); > > } > > ret = xc_domain_set_target(ctx->xch, dm_domid, guest_domid); > > if (ret<0) { > > ''' > > > > And it is indeed another way to solve my issue. > > However I do not understand the xen hypervisor enough to known i > > f > > "allow-memory-relocate" should have been defined in > > "libxl__spawn_stub_dm" or if the real issue is somewhere else. > > I think it should be done the same no matter where qemu runs. Back at the > time iirc only qemu-trad could run in a stubdom, which may explain the > omission. Cc-ing the two guys who are likely in the best position to tell > for sure. This indeed looks like a missing piece of qemu-xen support in stubdomain. But also, As Jan pointed out, this would be better fixed at the qemu side so memory relocation could be re-enabled. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH] docs: Document a policy for when to deviate from specifications
On Mon, Sep 18, 2023 at 01:28:16PM +0100, George Dunlap wrote: > There is an ongoing disagreement among maintainers for how Xen should > handle deviations to specifications such as ACPI or EFI. > > Write up an explicit policy, and include two worked-out examples from > recent discussions. Looks very reasonable to me. While it would be nice for every hardware (or thing in general) to follow specifications, sadly it isn't reality and Xen doesn't have enough market share to influence vendors in a meaningful way. So, yes, the policy described below sounds like a reasonable approach to make things working for end users. Reviewed-by: Marek Marczykowski-Górecki PS As a downstream distributions, we do ship several workarounds that were rejected upstream on the grounds that "specification says otherwise" before... > > Signed-off-by: George Dunlap > --- > NB that the technical descriptions of the costs of the accommodations > or lack thereof I've just gathered from reading the discussions; I'm > not familiar enough with the details to assert things about them. So > please correct any technical issues. > --- > docs/policy/FollowingSpecifications.md | 219 + > 1 file changed, 219 insertions(+) > create mode 100644 docs/policy/FollowingSpecifications.md > > diff --git a/docs/policy/FollowingSpecifications.md > b/docs/policy/FollowingSpecifications.md > new file mode 100644 > index 00..a197f01f65 > --- /dev/null > +++ b/docs/policy/FollowingSpecifications.md > @@ -0,0 +1,219 @@ > +# Guidelines for following specifications > + > +## In general, follow specifications > + > +In general, specifications such as ACPI and EFI should be followed. > + > +## Accommodate non-compliant systems if it doesn't affect compliant systems > + > +Sometimes, however, there occur situations where real systems "in the > +wild" violate these specifications, or at least our interpretation of > +them (henceforth called "non-compliant"). If we can accommodate > +non-compliant systems without affecting any compliant systems, then we > +should do so. > + > +## If accommodation would affect theoretical compliant systems that are > + not known to exist, and Linux and/or Windows takes the > + accommodation, take the accommodation unless there's a > + reason not to. > + > +Sometimes, however, there occur situations where real, non-compliant > +systems "in the wild" cannot be accommodated without affecting > +theoretical compliant systems; but there are no known theoretical > +compliant systems which exist. If Linux and/or Windows take the > +accommodation, then from a cost/benefits perspective it's probably best > +for us to take the accommodation as well. > + > +This is really a generalization of the next principle; the "reason not > +to" would be in the form of a cost-benefits analysis as described in > +the next section showing why the "special case" doesn't apply to the > +accommodation in question. > + > +## If things aren't clear, do a cost-benefits analysis > + > +Sometimes, however, things are more complicated or less clear. In > +that case, we should do a cost-benefits analysis for a particular > +accommodation. Things which should be factored into the analysis: > + > +N-1: The number of non-compliant systems that require the accommodation > + N-1a: The number of known current systems > + N-1b: The probable number of unknown current systems > + N-1c: The probable number of unknown future systems > + > +N-2 The severity of the effect of non-accommodation on these systems > + > +C-1: The number of compliant systems that would be affected by the > accommodation > + C-1a: The number of known current systems > + C-1b: The probable number of unknown current systems > + C-1c: The probable number of unknown future systems > + > +C-2 The severity of the effect of accommodation on these systems > + > +Intuitively, N-1 * N-2 gives us N, the cost of not making the > +accommodation, and C-1 * C-2 gives us C, the cost of taking the > +accommodation. If N > C, then we should take the accommodation; if C > > +N, then we shouldn't. > + > +The idea isn't to come up with actual numbers to plug in here > +(although that's certainly an option if someone wants to), but to > +explain the general idea we're trying to get at. > + > +A couple of other principles to factor in: > + > +Vendors tend to copy themselves and other vendors. If one or two > +major vendors are known to create compliant or non-compliant systems > +in a particular way, then there are likely to be more unknown and > +future systems which will be affected by / need a similar accommodation > +respectively;
Re: [PATCH] security-process.pandoc: Statement on issuing XSAs for older versions of Xen
On Fri, Oct 27, 2023 at 03:26:02PM +0100, George Dunlap wrote: > We recently had a situation where a security issue was discovered > which only affected versions of Xen out of security support from an > upstream perspective. However, many downstreams (including XenServer > and SUSE) still had supported products based on the versions affected. > > Specify what the security team will do in this situation in the > future. As always, the goal here is to be fair and helpful, without > adding to the workload of the security team. Inviting downstreams to > list versions and ranges, as well as expecting them to be involved in > the patch, gives organizations without representation in the security > team the opportunity to decide to engage in the security process. At > the same time, it puts he onus of determining which products and which > versions might be affected, as well as the core work of creating and > testing a patch, on downstreams. > > Signed-off-by: George Dunlap Hi George, This is interesting proposal, indeed it looks fair, given XenServer and SUSE basically have this option already. In practice, I'm not sure how useful that would be for Qubes OS, given we don't consider DoS-only bugs security issues needing coordinated disclosure. It feels like infoleak or privesc bugs are either found earlier or affect newer versions too and in both cases they fall into standard security support anyway. But that very well might be just an impression due to no such policy earlier. In any case, in Qubes OS we support Xen 4.17 and 4.14 - the latter only for about 6 months more. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: Linux 6.7-rc1+: WARNING at drivers/xen/evtchn.c:167 evtchn_interrupt
On Fri, Dec 01, 2023 at 09:29:30AM +0100, Juergen Gross wrote: > Hi Marek, > > On 30.11.23 03:34, Marek Marczykowski-Górecki wrote: > > Hi, > > > > While testing 6.7-rc3 on Qubes OS I found several warning like in the > > subject in dom0 log. I see them when running 6.7-rc1 too. I'm not sure > > what exactly triggers the issue, but my guess would be unbinding an > > event channel from userspace (closing vchan connection). > > > > Specific message: > > > > [ 83.973377] [ cut here ] > > [ 83.975523] Interrupt for port 77, but apparently not enabled; per-user > > a0e9f1d1 > > Just a guess, but I think this might happen when the vchan connection > is closed while there is still some traffic. This could result in events > triggering in parallel to unbinding the event channel. > > Could you please test the attached patch (only compile tested)? Unfortunately that doesn't help, I get exactly the same traceback. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH v4 2/6] x86/hvm: Allow access to registers on the same page as MSI-X table
On Mon, Nov 27, 2023 at 06:00:57PM +0100, Jan Beulich wrote: > On 24.11.2023 02:47, Marek Marczykowski-Górecki wrote: > > GCC gets confused about 'desc' variable: > > > > arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’: > > arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized > > [-Werror=maybe-uninitialized] > > 553 | if ( desc ) > > |^ > > arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here > > 537 | const struct msi_desc *desc; > > |^~~~ > > This could do with also indicating the gcc version. Issues like this > tend to get fixed over time. Sure, I'll add it's GCC 12.2.1. And indeed, GCC 13.2.1 does not complain anymore. > > + > > +if ( !msix->adj_access_idx[adj_type] ) > > +{ > > +gprintk(XENLOG_WARNING, > > +"Page for adjacent(%d) MSI-X table access not initialized > > for %pp (addr %#lx, gtable %#lx\n", > > +adj_type, >pdev->sbdf, addr, entry->gtable); > > + > > +return ADJACENT_DONT_HANDLE; > > +} > > + > > +/* If PBA lives on the same page too, discard writes. */ > > +if ( write && > > + ((adj_type == ADJ_IDX_LAST && > > + msix->table.last == msix->pba.first) || > > + (adj_type == ADJ_IDX_FIRST && > > + msix->table.first == msix->pba.last)) ) > > +{ > > +gprintk(XENLOG_WARNING, > > +"MSI-X table and PBA of %pp live on the same page, " > > +"writing to other registers there is not implemented\n", > > +>pdev->sbdf); > > Here and above I think verbosity needs limiting to the first instance per > device per domain. Is there some clever API for that already, or do I need to introduce extra variable in some of those structures (msixtbl_entry? pci_dev?) ? (other requested changes ok) -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Linux 6.7-rc1+: WARNING at drivers/xen/evtchn.c:167 evtchn_interrupt
Hi, While testing 6.7-rc3 on Qubes OS I found several warning like in the subject in dom0 log. I see them when running 6.7-rc1 too. I'm not sure what exactly triggers the issue, but my guess would be unbinding an event channel from userspace (closing vchan connection). Specific message: [ 83.973377] [ cut here ] [ 83.975523] Interrupt for port 77, but apparently not enabled; per-user a0e9f1d1 [ 83.979083] WARNING: CPU: 1 PID: 2432 at drivers/xen/evtchn.c:167 evtchn_interrupt+0xb6/0xc0 [xen_evtchn] [ 83.982869] Modules linked in: joydev intel_rapl_msr intel_rapl_common snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi ppdev snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm parport_pc parport pcspkr snd_timer snd e1000e i2c_piix4 soundcore fuse loop xenfs dm_thin_pool dm_persistent_data dm_bio_prison crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_scsi bochs serio_raw drm_vram_helper xhci_pci xhci_pci_renesas floppy drm_ttm_helper xhci_hcd ttm ata_generic pata_acpi qemu_fw_cfg virtio_console xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath [ 84.019753] CPU: 1 PID: 2432 Comm: qrexec-client Not tainted 6.7.0-0.rc3.1.qubes.1.fc37.x86_64 #1 [ 84.027045] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 [ 84.033828] RIP: e030:evtchn_interrupt+0xb6/0xc0 [xen_evtchn] [ 84.037814] Code: ba 01 00 00 00 be 1d 00 00 00 48 8d bb 88 00 00 00 e8 ce 93 27 c1 eb b4 8b 76 20 48 89 da 48 c7 c7 70 52 21 c0 e8 0a 5c f0 c0 <0f> 0b e9 61 ff ff ff 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 [ 84.048932] RSP: e02b:c90041e7fd60 EFLAGS: 00010082 [ 84.051252] RAX: RBX: 888104fc9a80 RCX: 0027 [ 84.054569] RDX: 88813ff21608 RSI: 0001 RDI: 88813ff21600 [ 84.057606] RBP: 88810522a280 R08: R09: c90041e7fbf8 [ 84.060864] R10: 0003 R11: 82f460c8 R12: 888105d5b6a4 [ 84.064043] R13: 888105d5b760 R14: 88810522a280 R15: 888105d5b600 [ 84.066598] FS: 75ee3eb8ed40() GS:88813ff0() knlGS: [ 84.069251] CS: 1e030 DS: ES: CR0: 80050033 [ 84.071293] CR2: 7b86fa0f0f5c CR3: 000106e18000 CR4: 00040660 [ 84.073911] Call Trace: [ 84.074746] [ 84.075471] ? evtchn_interrupt+0xb6/0xc0 [xen_evtchn] [ 84.077448] ? __warn+0x81/0x130 [ 84.078671] ? evtchn_interrupt+0xb6/0xc0 [xen_evtchn] [ 84.080507] ? report_bug+0x171/0x1a0 [ 84.081830] ? prb_read_valid+0x1b/0x30 [ 84.083682] ? handle_bug+0x41/0x70 [ 84.084952] ? exc_invalid_op+0x17/0x70 [ 84.086314] ? asm_exc_invalid_op+0x1a/0x20 [ 84.088520] ? evtchn_interrupt+0xb6/0xc0 [xen_evtchn] [ 84.090300] __free_irq+0x114/0x2b0 [ 84.091478] free_irq+0x32/0x70 [ 84.092522] unbind_from_irqhandler+0x31/0xb0 [ 84.094035] evtchn_release+0x2b/0xa0 [xen_evtchn] [ 84.095643] __fput+0x9a/0x2c0 [ 84.096734] __x64_sys_close+0x3d/0x80 [ 84.097995] do_syscall_64+0x63/0xe0 [ 84.099188] ? __vm_munmap+0xbb/0x150 [ 84.100431] ? syscall_exit_work+0x103/0x130 [ 84.101854] ? syscall_exit_to_user_mode+0x2b/0x40 [ 84.103509] ? do_syscall_64+0x6f/0xe0 [ 84.104757] ? syscall_exit_to_user_mode+0x2b/0x40 [ 84.106350] ? do_syscall_64+0x6f/0xe0 [ 84.107587] ? exit_to_user_mode_prepare+0xbc/0xd0 [ 84.109180] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 84.110863] RIP: 0033:0x75ee3ed15a0a [ 84.111999] Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 44 c3 0f 1f 00 48 83 ec 18 89 7c 24 0c e8 33 bc f8 ff 8b 7c 24 0c 89 c2 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 36 89 d7 89 44 24 0c e8 93 bc f8 ff 8b 44 24 [ 84.118267] RSP: 002b:70ad0f70 EFLAGS: 0293 ORIG_RAX: 0003 [ 84.120756] RAX: ffda RBX: 55d3d8f70970 RCX: 75ee3ed15a0a [ 84.123675] RDX: RSI: RDI: 000f [ 84.126012] RBP: R08: R09: [ 84.128366] R10: 75ee3ec31468 R11: 0293 R12: [ 84.130699] R13: R14: 70ad2eaa R15: 55d3d8f6f2a0 [ 84.133050] [ 84.133810] ---[ end trace ]--- Full log: https://openqa.qubes-os.org/tests/86647/logfile?filename=serial0.txt -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [Xen-devel] PV guest with PCI passthrough crash on Xen 4.8.3 inside KVM when booted through OVMF
On Mon, Nov 27, 2023 at 11:20:36AM +, Frediano Ziglio wrote: > On Sun, Nov 26, 2023 at 2:51 PM Marek Marczykowski-Górecki > wrote: > > > > On Mon, Feb 19, 2018 at 06:30:14PM +0100, Juergen Gross wrote: > > > On 16/02/18 20:02, Andrew Cooper wrote: > > > > On 16/02/18 18:51, Marek Marczykowski-Górecki wrote: > > > >> On Fri, Feb 16, 2018 at 05:52:50PM +0000, Andrew Cooper wrote: > > > >>> On 16/02/18 17:48, Marek Marczykowski-Górecki wrote: > > > >>>> Hi, > > > >>>> > > > >>>> As in the subject, the guest crashes on boot, before kernel output > > > >>>> anything. I've isolated this to the conditions below: > > > >>>> - PV guest have PCI device assigned (e1000e emulated by QEMU in > > > >>>> this case), > > > >>>>without PCI device it works > > > >>>> - Xen (in KVM) is started through OVMF; with seabios it works > > > >>>> - nested HVM is disabled in KVM > > > >>>> - AMD IOMMU emulation is disabled in KVM; when enabled qemu crashes > > > >>>> on > > > >>>>boot (looks like qemu bug, unrelated to this one) > > > >>>> > > > >>>> Version info: > > > >>>> - KVM host: OpenSUSE 42.3, qemu 2.9.1, > > > >>>> ovmf-2017+git1492060560.b6d11d7c46-4.1, AMD > > > >>>> - Xen host: Xen 4.8.3, dom0: Linux 4.14.13 > > > >>>> - Xen domU: Linux 4.14.13, direct boot > > > >>>> > > > >>>> Not sure if relevant, but initially I've tried booting xen.efi /mapbs > > > >>>> /noexitboot and then dom0 kernel crashed saying something about > > > >>>> conflict > > > >>>> between e820 and kernel mapping. But now those options are disabled. > > > >>>> > > > >>>> The crash message: > > > >>>> (XEN) d1v0 Unhandled invalid opcode fault/trap [#6, ec=] > > > >>>> (XEN) domain_crash_sync called from entry.S: fault at > > > >>>> 82d080218720 entry.o#create_bounce_frame+0x137/0x146 > > > >>>> (XEN) Domain 1 (vcpu#0) crashed on cpu#1: > > > >>>> (XEN) [ Xen-4.8.3 x86_64 debug=n Not tainted ] > > > >>>> (XEN) CPU:1 > > > >>>> (XEN) RIP:e033:[] > > > >>> This is #UD, which is most probably hitting a BUG(). addr2line this ^ > > > >>> to find some code to look at. > > > >> addr2line failed me > > > > > > > > By default, vmlinux is stripped and compressed. Ideally you want to > > > > addr2line the vmlinux artefact in the root of your kernel build, which > > > > is the plain elf with debugging symbols. > > > > > > > > Alternatively, use scripts/extract-vmlinux on the binary you actually > > > > booted, which might get you somewhere. > > > > > > > >> , but System.map says its xen_memory_setup. And it > > > >> looks like the BUG() is the same as I had in dom0 before: > > > >> "Xen hypervisor allocated kernel memory conflicts with E820 map". > > > > > > > > Juergen: Is there anything we can do to try and insert some dummy > > > > exception handlers right at PV start, so we could at least print out a > > > > oneliner to the host console which is a little more helpful than Xen > > > > saying "something unknown went wrong" ? > > > > > > You mean something like commit 42b3a4cb5609de757f5445fcad18945ba9239a07 > > > added to kernel 4.15? > > > > > > > > > > >> > > > >> Disabling e820_host in guest config solved the problem. Thanks! > > > >> > > > >> Is this some bug in Xen or OVMF, or is it expected behavior and > > > >> e820_host > > > >> should be avoided? > > > > > > > > I don't really know. e820_host is a gross hack which shouldn't really > > > > be present. The actually problem is that Linux can't cope with the > > > > memory layout it was given (and I can't recall if there is anything > > > > Linux could potentially to do cope). OTOH, the toolstack, which knew > > > > about e820_host and chose to lay the guest out in an overlapping way is > > > > probably also at fault. > > &g
Re: [Xen-devel] PV guest with PCI passthrough crash on Xen 4.8.3 inside KVM when booted through OVMF
On Mon, Feb 19, 2018 at 06:30:14PM +0100, Juergen Gross wrote: > On 16/02/18 20:02, Andrew Cooper wrote: > > On 16/02/18 18:51, Marek Marczykowski-Górecki wrote: > >> On Fri, Feb 16, 2018 at 05:52:50PM +, Andrew Cooper wrote: > >>> On 16/02/18 17:48, Marek Marczykowski-Górecki wrote: > >>>> Hi, > >>>> > >>>> As in the subject, the guest crashes on boot, before kernel output > >>>> anything. I've isolated this to the conditions below: > >>>> - PV guest have PCI device assigned (e1000e emulated by QEMU in this > >>>> case), > >>>>without PCI device it works > >>>> - Xen (in KVM) is started through OVMF; with seabios it works > >>>> - nested HVM is disabled in KVM > >>>> - AMD IOMMU emulation is disabled in KVM; when enabled qemu crashes on > >>>>boot (looks like qemu bug, unrelated to this one) > >>>> > >>>> Version info: > >>>> - KVM host: OpenSUSE 42.3, qemu 2.9.1, > >>>> ovmf-2017+git1492060560.b6d11d7c46-4.1, AMD > >>>> - Xen host: Xen 4.8.3, dom0: Linux 4.14.13 > >>>> - Xen domU: Linux 4.14.13, direct boot > >>>> > >>>> Not sure if relevant, but initially I've tried booting xen.efi /mapbs > >>>> /noexitboot and then dom0 kernel crashed saying something about conflict > >>>> between e820 and kernel mapping. But now those options are disabled. > >>>> > >>>> The crash message: > >>>> (XEN) d1v0 Unhandled invalid opcode fault/trap [#6, ec=] > >>>> (XEN) domain_crash_sync called from entry.S: fault at 82d080218720 > >>>> entry.o#create_bounce_frame+0x137/0x146 > >>>> (XEN) Domain 1 (vcpu#0) crashed on cpu#1: > >>>> (XEN) [ Xen-4.8.3 x86_64 debug=n Not tainted ] > >>>> (XEN) CPU:1 > >>>> (XEN) RIP:e033:[] > >>> This is #UD, which is most probably hitting a BUG(). addr2line this ^ > >>> to find some code to look at. > >> addr2line failed me > > > > By default, vmlinux is stripped and compressed. Ideally you want to > > addr2line the vmlinux artefact in the root of your kernel build, which > > is the plain elf with debugging symbols. > > > > Alternatively, use scripts/extract-vmlinux on the binary you actually > > booted, which might get you somewhere. > > > >> , but System.map says its xen_memory_setup. And it > >> looks like the BUG() is the same as I had in dom0 before: > >> "Xen hypervisor allocated kernel memory conflicts with E820 map". > > > > Juergen: Is there anything we can do to try and insert some dummy > > exception handlers right at PV start, so we could at least print out a > > oneliner to the host console which is a little more helpful than Xen > > saying "something unknown went wrong" ? > > You mean something like commit 42b3a4cb5609de757f5445fcad18945ba9239a07 > added to kernel 4.15? > > > > >> > >> Disabling e820_host in guest config solved the problem. Thanks! > >> > >> Is this some bug in Xen or OVMF, or is it expected behavior and e820_host > >> should be avoided? > > > > I don't really know. e820_host is a gross hack which shouldn't really > > be present. The actually problem is that Linux can't cope with the > > memory layout it was given (and I can't recall if there is anything > > Linux could potentially to do cope). OTOH, the toolstack, which knew > > about e820_host and chose to lay the guest out in an overlapping way is > > probably also at fault. > > The kernel can cope with lots of E820 scenarios (e.g. by relocating > initrd or the p2m map), but moving itself out of the way is not > possible. I'm afraid I need to resurrect this thread... With recent kernel (6.6+), the host_e820=0 workaround is not an option anymore. It makes Linux not initialize xen-swiotlb (due to f9a38ea5172a3365f4594335ed5d63e15af2fd18), so PCI passthrough doesn't work at all. While I can add yet another layer of workaround (force xen-swiotlb with iommu=soft), that's getting unwieldy. Furthermore, I don't get the crash message anymore, even with debug hypervisor and guest_loglvl=all. Not even "Domain X crashed" in `xl dmesg`. It looks like the "crash" shutdown reason doesn't reach Xen, and it's considered clean shutdown (I can confirm it by changing various `on_*` settings (via libvirt) and observing which gets applied). Most tests I've done with 6.7-rc1, but the issue I observed on 6.6.1 already. This is on Xen 4.17.2. And the L0 is running Linux 6.6.1, and then uses QEMU 8.1.2 + OVMF 202308 to run Xen as L1. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
[PATCH v4 3/6] automation: prevent QEMU access to /dev/mem in PCI passthrough tests
/dev/mem access doesn't work in dom0 in lockdown and in stubdomain. Simulate this environment with removing /dev/mem device node. Full test for lockdown and stubdomain will come later, when all requirements will be in place. Signed-off-by: Marek Marczykowski-Górecki --- This can be applied only after QEMU change is committed. Otherwise the test will fail. --- automation/scripts/qubes-x86-64.sh | 2 ++ 1 file changed, 2 insertions(+) diff --git a/automation/scripts/qubes-x86-64.sh b/automation/scripts/qubes-x86-64.sh index d81ed7b931cf..7eabc1bd6ad4 100755 --- a/automation/scripts/qubes-x86-64.sh +++ b/automation/scripts/qubes-x86-64.sh @@ -163,6 +163,8 @@ ifconfig eth0 up ifconfig xenbr0 up ifconfig xenbr0 192.168.0.1 +# ensure QEMU wont have access /dev/mem +rm -f /dev/mem # get domU console content into test log tail -F /var/log/xen/console/guest-domU.log 2>/dev/null | sed -e \"s/^/(domU) /\" & xl create /etc/xen/domU.cfg -- git-series 0.9.1
[PATCH v4 6/6] [DO NOT APPLY] switch to alternative artifact repo
For testing, switch to my containers registry that includes containers rebuilt with changes in this series. --- automation/gitlab-ci/build.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml index 32af30ccedc9..52abb12bce48 100644 --- a/automation/gitlab-ci/build.yaml +++ b/automation/gitlab-ci/build.yaml @@ -324,7 +324,7 @@ qemu-system-ppc64-8.1.0-ppc64-export: alpine-3.18-rootfs-export: extends: .test-jobs-artifact-common - image: registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.18 + image: registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/alpine:3.18 script: - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz artifacts: @@ -335,7 +335,7 @@ alpine-3.18-rootfs-export: kernel-6.1.19-export: extends: .test-jobs-artifact-common - image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.19 + image: registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/kernel:6.1.19 script: - mkdir binaries && cp /bzImage binaries/bzImage artifacts: -- git-series 0.9.1
[PATCH v4 0/6] MSI-X support with qemu in stubdomain, and other related changes
This series includes changes to make MSI-X working with Linux stubdomain and especially Intel Wifi 6 AX210 card. This takes care of remaining reasons for QEMU to access /dev/mem, but also the Intel Wifi card violating spec by putting some registers on the same page as the MSI-X table. See individual patches for details. This series include also tests for MSI-X using new approach (by preventing QEMU access to /dev/mem). But for it to work, it needs QEMU change that makes use of the changes introduced here. It can be seen at https://github.com/marmarek/qemu/commits/msix Here is the pipeline that used the QEMU fork above: https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1083468508 Marek Marczykowski-Górecki (6): x86/msi: passthrough all MSI-X vector ctrl writes to device model x86/hvm: Allow access to registers on the same page as MSI-X table automation: prevent QEMU access to /dev/mem in PCI passthrough tests automation: switch to a wifi card on ADL system [DO NOT APPLY] switch to qemu fork [DO NOT APPLY] switch to alternative artifact repo Config.mk | 4 +- automation/gitlab-ci/build.yaml | 4 +- automation/gitlab-ci/test.yaml | 4 +- automation/scripts/qubes-x86-64.sh | 9 +- automation/tests-artifacts/alpine/3.18.dockerfile | 7 +- automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 +- xen/arch/x86/hvm/vmsi.c | 206 - xen/arch/x86/include/asm/msi.h | 5 +- xen/arch/x86/msi.c | 40 +++- xen/common/kernel.c | 1 +- xen/include/public/features.h | 8 +- 11 files changed, 272 insertions(+), 18 deletions(-) base-commit: f96e2f64576cdbb147391c7cb399d393385719a9 -- git-series 0.9.1
[PATCH v4 2/6] x86/hvm: Allow access to registers on the same page as MSI-X table
Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers on the same page as MSI-X table. Device model (especially one in stubdomain) cannot really handle those, as direct writes to that page is refused (page is on the mmio_ro_ranges list). Instead, extend msixtbl_mmio_ops to handle such accesses too. Doing this, requires correlating read/write location with guest of MSI-X table address. Since QEMU doesn't map MSI-X table to the guest, it requires msixtbl_entry->gtable, which is HVM-only. Similar feature for PV would need to be done separately. This will be also used to read Pending Bit Array, if it lives on the same page, making QEMU not needing /dev/mem access at all (especially helpful with lockdown enabled in dom0). If PBA lives on another page, QEMU will map it to the guest directly. If PBA lives on the same page, discard writes and log a message. Technically, writes outside of PBA could be allowed, but at this moment the precise location of PBA isn't saved, and also no known device abuses the spec in this way (at least yet). To access those registers, msixtbl_mmio_ops need the relevant page mapped. MSI handling already has infrastructure for that, using fixmap, so try to map first/last page of the MSI-X table (if necessary) and save their fixmap indexes. Note that msix_get_fixmap() does reference counting and reuses existing mapping, so just call it directly, even if the page was mapped before. Also, it uses a specific range of fixmap indexes which doesn't include 0, so use 0 as default ("not mapped") value - which simplifies code a bit. GCC gets confused about 'desc' variable: arch/x86/hvm/vmsi.c: In function ‘msixtbl_range’: arch/x86/hvm/vmsi.c:553:8: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized] 553 | if ( desc ) |^ arch/x86/hvm/vmsi.c:537:28: note: ‘desc’ was declared here 537 | const struct msi_desc *desc; |^~~~ It's conditional initialization is actually correct (in the case where it isn't initialized, function returns early), but to avoid build failure initialize it explicitly to NULL anyway. Signed-off-by: Marek Marczykowski-Górecki --- Changes in v4: - drop same_page parameter of msixtbl_find_entry(), distinguish two cases in relevant callers - rename adj_access_table_idx to adj_access_idx - code style fixes - drop alignment check in adjacent_{read,write}() - all callers already have it earlier - delay mapping first/last MSI-X pages until preparing device for a passthrough v3: - merge handling into msixtbl_mmio_ops - extend commit message v2: - adjust commit message - pass struct domain to msixtbl_page_handler_get_hwaddr() - reduce local variables used only once - log a warning if write is forbidden if MSI-X and PBA lives on the same page - do not passthrough unaligned accesses - handle accesses both before and after MSI-X table --- xen/arch/x86/hvm/vmsi.c| 191 -- xen/arch/x86/include/asm/msi.h | 5 +- xen/arch/x86/msi.c | 40 +++- 3 files changed, 225 insertions(+), 11 deletions(-) diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c index 2436154c40b6..d1e8cb1e30f6 100644 --- a/xen/arch/x86/hvm/vmsi.c +++ b/xen/arch/x86/hvm/vmsi.c @@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d) return d->arch.hvm.msixtbl_list.next; } +/* + * Lookup an msixtbl_entry on the same page as given addr. It's up to the + * caller to check if address is strictly part of the table - if relevant. + */ static struct msixtbl_entry *msixtbl_find_entry( struct vcpu *v, unsigned long addr) { @@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry( struct domain *d = v->domain; list_for_each_entry( entry, >arch.hvm.msixtbl_list, list ) -if ( addr >= entry->gtable && - addr < entry->gtable + entry->table_len ) +if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) && + PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) ) return entry; return NULL; @@ -213,6 +217,131 @@ static struct msi_desc *msixtbl_addr_to_desc( return NULL; } +/* + * Returns: + * - UINT_MAX if no handling should be done + * - UINT_MAX-1 if write should be discarded + * - a fixmap idx to use for handling + */ +#define ADJACENT_DONT_HANDLE UINT_MAX +#define ADJACENT_DISCARD_WRITE (UINT_MAX - 1) +static unsigned int adjacent_handle( +const struct msixtbl_entry *entry, unsigned long addr, bool write) +{ +unsigned int adj_type; +const struct arch_msix *msix; + +if ( !entry || !entry->pdev ) +return ADJACENT_DONT_HANDLE; + +if ( PFN_DOWN(addr) == PFN_DOWN(entry->gtable) && addr < entry->gtable ) +adj_type = ADJ_IDX_FIRST; +else if ( PFN_DOWN(addr) == PFN_DOWN(entry->
[PATCH v4 5/6] [DO NOT APPLY] switch to qemu fork
This makes tests to use patched QEMU, to actually test the new behavior. --- Config.mk | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Config.mk b/Config.mk index 2c43702958eb..dd2687c0e9e7 100644 --- a/Config.mk +++ b/Config.mk @@ -222,8 +222,8 @@ endif OVMF_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/ovmf.git OVMF_UPSTREAM_REVISION ?= ba91d0292e593df8528b66f99c1b0b14fadc8e16 -QEMU_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/qemu-xen.git -QEMU_UPSTREAM_REVISION ?= master +QEMU_UPSTREAM_URL ?= https://github.com/marmarek/qemu +QEMU_UPSTREAM_REVISION ?= origin/msix MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git MINIOS_UPSTREAM_REVISION ?= b08019f0b2fbc30c75169a160acb9fd9af5d68f4 -- git-series 0.9.1
[PATCH v4 4/6] automation: switch to a wifi card on ADL system
Switch to a wifi card that has registers on a MSI-X page. This tests the "x86/hvm: Allow writes to registers on the same page as MSI-X table" feature. Switch it only for HVM test, because MSI-X adjacent write is not supported on PV. This requires also including drivers and firmware in system for tests. Remove firmware unrelated to the test, to not increase initrd size too much (all firmware takes over 100MB compressed). And finally adjusts test script to handle not only eth0 as a test device, but also wlan0 and connect it to the wifi network. Signed-off-by: Marek Marczykowski-Górecki --- This needs two new gitlab variables: WIFI_HW2_SSID and WIFI_HW2_PSK. I'll provide them in private. This change requires rebuilding test containers. This can be applied only after QEMU change is committed. Otherwise the test will fail. --- automation/gitlab-ci/test.yaml | 4 automation/scripts/qubes-x86-64.sh | 7 +++ automation/tests-artifacts/alpine/3.18.dockerfile | 7 +++ automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 ++ 4 files changed, 20 insertions(+) diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml index 6aabdb9d156f..931a8fb28e1d 100644 --- a/automation/gitlab-ci/test.yaml +++ b/automation/gitlab-ci/test.yaml @@ -195,6 +195,10 @@ adl-pci-pv-x86-64-gcc-debug: adl-pci-hvm-x86-64-gcc-debug: extends: .adl-x86-64 + variables: +PCIDEV: "00:14.3" +WIFI_SSID: "$WIFI_HW2_SSID" +WIFI_PSK: "$WIFI_HW2_PSK" script: - ./automation/scripts/qubes-x86-64.sh pci-hvm 2>&1 | tee ${LOGFILE} needs: diff --git a/automation/scripts/qubes-x86-64.sh b/automation/scripts/qubes-x86-64.sh index 7eabc1bd6ad4..60498ef1e89a 100755 --- a/automation/scripts/qubes-x86-64.sh +++ b/automation/scripts/qubes-x86-64.sh @@ -94,6 +94,13 @@ on_reboot = "destroy" domU_check=" set -x -e interface=eth0 +if [ -e /sys/class/net/wlan0 ]; then +interface=wlan0 +set +x +wpa_passphrase "$WIFI_SSID" "$WIFI_PSK" > /etc/wpa_supplicant.conf +set -x +wpa_supplicant -B -iwlan0 -c /etc/wpa_supplicant.conf +fi ip link set \"\$interface\" up timeout 30s udhcpc -i \"\$interface\" pingip=\$(ip -o -4 r show default|cut -f 3 -d ' ') diff --git a/automation/tests-artifacts/alpine/3.18.dockerfile b/automation/tests-artifacts/alpine/3.18.dockerfile index f1b4a8b7a191..b821a291fed3 100644 --- a/automation/tests-artifacts/alpine/3.18.dockerfile +++ b/automation/tests-artifacts/alpine/3.18.dockerfile @@ -34,6 +34,13 @@ RUN \ apk add curl && \ apk add udev && \ apk add pciutils && \ + apk add wpa_supplicant && \ + # Select firmware for hardware tests + apk add linux-firmware-other && \ + mkdir /lib/firmware-preserve && \ + mv /lib/firmware/iwlwifi-so-a0-gf-a0* /lib/firmware-preserve/ && \ + rm -rf /lib/firmware && \ + mv /lib/firmware-preserve /lib/firmware && \ \ # Xen cd / && \ diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile b/automation/tests-artifacts/kernel/6.1.19.dockerfile index 3a4096780d20..84ed5dff23ae 100644 --- a/automation/tests-artifacts/kernel/6.1.19.dockerfile +++ b/automation/tests-artifacts/kernel/6.1.19.dockerfile @@ -32,6 +32,8 @@ RUN curl -fsSLO https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSI make xen.config && \ scripts/config --enable BRIDGE && \ scripts/config --enable IGC && \ +scripts/config --enable IWLWIFI && \ +scripts/config --enable IWLMVM && \ cp .config .config.orig && \ cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \ make -j$(nproc) bzImage && \ -- git-series 0.9.1
[PATCH v4 1/6] x86/msi: passthrough all MSI-X vector ctrl writes to device model
QEMU needs to know whether clearing maskbit of a vector is really clearing, or was already cleared before. Currently Xen sends only clearing that bit to the device model, but not setting it, so QEMU cannot detect it. Because of that, QEMU is working this around by checking via /dev/mem, but that isn't the proper approach. Give all necessary information to QEMU by passing all ctrl writes, including masking a vector. Advertise the new behavior via XENVER_get_features, so QEMU can know it doesn't need to access /dev/mem anymore. While this commit doesn't move the whole maskbit handling to QEMU (as discussed on xen-devel as one of the possibilities), it is a necessary first step anyway. Including telling QEMU it will get all the required information to do so. The actual implementation would need to include: - a hypercall for QEMU to control just maskbit (without (re)binding the interrupt again - a methor for QEMU to tell Xen it will actually do the work Those are not part of this series. Signed-off-by: Marek Marczykowski-Górecki --- I did not added any control to enable/disable this new behavior (as Roger have suggested for possible non-QEMU ioreqs). I don't see how the new behavior could be problematic for some existing ioreq server (they already received writes to those addresses, just not all of them), but if that's really necessary, I can probably add a command line option to restore previous behavior system-wide. Changes in v4: - ignore unaligned writes with X86EMUL_OKAY - restructure the code to forward all writes in _msixtbl_write() instead of manipulating return value of msixtbl_write() - this makes WRITE_LEN4_COMPLETION special case unnecessary - advertise the changed behavior via XENVER_get_features instead of DMOP v3: - advertise changed behavior in XEN_DMOP_get_ioreq_server_info - make "flags" parameter IN/OUT - move len check back to msixtbl_write() - will be needed there anyway in a later patch v2: - passthrough quad writes to emulator too (Jan) - (ab)use len==0 for write len=4 completion (Jan), but add descriptive #define for this magic value --- xen/arch/x86/hvm/vmsi.c | 19 ++- xen/common/kernel.c | 1 + xen/include/public/features.h | 8 3 files changed, 23 insertions(+), 5 deletions(-) diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c index 128f23636279..2436154c40b6 100644 --- a/xen/arch/x86/hvm/vmsi.c +++ b/xen/arch/x86/hvm/vmsi.c @@ -283,8 +283,8 @@ static int msixtbl_write(struct vcpu *v, unsigned long address, unsigned long flags; struct irq_desc *desc; -if ( (len != 4 && len != 8) || (address & (len - 1)) ) -return r; +if ( !IS_ALIGNED(address, len) ) +return X86EMUL_OKAY; rcu_read_lock(_rcu_lock); @@ -345,8 +345,7 @@ static int msixtbl_write(struct vcpu *v, unsigned long address, unlock: spin_unlock_irqrestore(>lock, flags); -if ( len == 4 ) -r = X86EMUL_OKAY; +r = X86EMUL_OKAY; out: rcu_read_unlock(_rcu_lock); @@ -357,7 +356,17 @@ static int cf_check _msixtbl_write( const struct hvm_io_handler *handler, uint64_t address, uint32_t len, uint64_t val) { -return msixtbl_write(current, address, len, val); +/* ignore invalid length or unaligned writes */ +if ( len != 4 && len != 8 || !IS_ALIGNED(address, len) ) +return X86EMUL_OKAY; + +/* + * This function returns X86EMUL_UNHANDLEABLE even if write is properly + * handled, to propagate it to the device model (so it can keep its + * internal state in sync). + */ +msixtbl_write(current, address, len, val); +return X86EMUL_UNHANDLEABLE; } static bool cf_check msixtbl_range( diff --git a/xen/common/kernel.c b/xen/common/kernel.c index 08dbaa2a054c..229784c6ce52 100644 --- a/xen/common/kernel.c +++ b/xen/common/kernel.c @@ -642,6 +642,7 @@ long do_xen_version(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg) fi.submap |= (1U << XENFEAT_direct_mapped); else fi.submap |= (1U << XENFEAT_not_direct_mapped); +fi.submap |= (1U << XENFEAT_dm_msix_all_writes); break; default: return -EINVAL; diff --git a/xen/include/public/features.h b/xen/include/public/features.h index 36936f6a4ee0..634534827d43 100644 --- a/xen/include/public/features.h +++ b/xen/include/public/features.h @@ -120,6 +120,14 @@ #define XENFEAT_runstate_phys_area 18 #define XENFEAT_vcpu_time_phys_area 19 +/* + * If set, Xen will passthrough all MSI-X vector ctrl writes to device model, + * not only those unmasking an entry. This allows device model to properly keep + * track of the MSI-X table without having to read it from the device behind + * Xen's backs. This information is relevant only for device models. + */ +#define XENFEAT_dm_msix_all_writes20 + #define XENFEAT_NR_SUBMAPS 1 #endif /* __XEN_PUBLIC_FEATURES_H__ */ -- git-series 0.9.1
[PATCH for-4.18 v2] automation: fix race condition in adl-suspend test
If system suspends too quickly, the message for the test controller to wake up the system may be not sent to the console before suspending. This will cause the test to timeout. Fix this by calling sync on the console and waiting a bit after printing the message. The test controller then resumes the system 30s after the message, so as long as the delay + suspending takes less time it is okay. Signed-off-by: Marek Marczykowski-Górecki --- This is consistent with the observation that sync_console "fixes" the issue. Changes in v2: - add sync /dev/stdout too (Roger) --- automation/scripts/qubes-x86-64.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/automation/scripts/qubes-x86-64.sh b/automation/scripts/qubes-x86-64.sh index 26131b082671..0f00bebdd8c8 100755 --- a/automation/scripts/qubes-x86-64.sh +++ b/automation/scripts/qubes-x86-64.sh @@ -54,11 +54,12 @@ until grep 'domU started' /var/log/xen/console/guest-domU.log; do sleep 1 done echo \"${wait_and_wakeup}\" +# let the above message flow to console, then suspend +sync /dev/stdout +sleep 5 set -x echo deep > /sys/power/mem_sleep echo mem > /sys/power/state -# now wait for resume -sleep 5 xl list xl dmesg | grep 'Finishing wakeup from ACPI S3 state' || exit 1 # check if domU is still alive -- 2.41.0
Re: [PATCH] automation: fix race condition in adl-suspend test
On Mon, Oct 30, 2023 at 05:32:08PM +0100, Marek Marczykowski-Górecki wrote: > On Mon, Oct 30, 2023 at 12:42:52PM +0100, Roger Pau Monné wrote: > > On Sat, Oct 28, 2023 at 05:33:57AM +0200, Marek Marczykowski-Górecki wrote: > > > If system suspends too quickly, the message for the test controller to > > > wake up the system may be not sent to the console before suspending. > > > This will cause the test to timeout. > > > > > > Fix this by waiting a bit after printing the message. The test > > > controller then resumes the system 30s after the message, so as long as > > > the delay + suspending takes less time it is okay. > > > > > > Signed-off-by: Marek Marczykowski-Górecki > > > > > > --- > > > This is consistent with the observation that sync_console "fixes" the > > > issue. > > > --- > > > automation/scripts/qubes-x86-64.sh | 4 ++-- > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > diff --git a/automation/scripts/qubes-x86-64.sh > > > b/automation/scripts/qubes-x86-64.sh > > > index 26131b082671..a34db96e4585 100755 > > > --- a/automation/scripts/qubes-x86-64.sh > > > +++ b/automation/scripts/qubes-x86-64.sh > > > @@ -54,11 +54,11 @@ until grep 'domU started' > > > /var/log/xen/console/guest-domU.log; do > > > sleep 1 > > > done > > > echo \"${wait_and_wakeup}\" > > > +# let the above message flow to console, then suspend > > > +sleep 5 > > > > Could you use `sync /dev/stdout`? I guess that might not be enough, > > since the sync won't be propagated to the hypervisor, and hence even > > if flushed from Linux, we have no guarantee that the hypervisor has > > also flushed it. > > It seems `sync /dev/stdout` helps too, at least in a limited sample of > two. ... and the third attempt (with sync instead of sleep) failed. > > Xen should flush the buffer when a newline character is found, but I > > have no idea whether context could return to guest while the buffer is > > still in the process of being fully flushed. > > IIC Xen should flush the console buffer on the suspend path (there is > console_start_sync() in enter_state()). So, if linux manages to send it > to Xen in time, all should be good (in theory at least). > > > Anyway, adding the extra sync might be good regardless, and keeping > > the sleep. > > Good idea, I'll send v2 with it included. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [PATCH] automation: fix race condition in adl-suspend test
On Mon, Oct 30, 2023 at 12:42:52PM +0100, Roger Pau Monné wrote: > On Sat, Oct 28, 2023 at 05:33:57AM +0200, Marek Marczykowski-Górecki wrote: > > If system suspends too quickly, the message for the test controller to > > wake up the system may be not sent to the console before suspending. > > This will cause the test to timeout. > > > > Fix this by waiting a bit after printing the message. The test > > controller then resumes the system 30s after the message, so as long as > > the delay + suspending takes less time it is okay. > > > > Signed-off-by: Marek Marczykowski-Górecki > > --- > > This is consistent with the observation that sync_console "fixes" the > > issue. > > --- > > automation/scripts/qubes-x86-64.sh | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/automation/scripts/qubes-x86-64.sh > > b/automation/scripts/qubes-x86-64.sh > > index 26131b082671..a34db96e4585 100755 > > --- a/automation/scripts/qubes-x86-64.sh > > +++ b/automation/scripts/qubes-x86-64.sh > > @@ -54,11 +54,11 @@ until grep 'domU started' > > /var/log/xen/console/guest-domU.log; do > > sleep 1 > > done > > echo \"${wait_and_wakeup}\" > > +# let the above message flow to console, then suspend > > +sleep 5 > > Could you use `sync /dev/stdout`? I guess that might not be enough, > since the sync won't be propagated to the hypervisor, and hence even > if flushed from Linux, we have no guarantee that the hypervisor has > also flushed it. It seems `sync /dev/stdout` helps too, at least in a limited sample of two. > Xen should flush the buffer when a newline character is found, but I > have no idea whether context could return to guest while the buffer is > still in the process of being fully flushed. IIC Xen should flush the console buffer on the suspend path (there is console_start_sync() in enter_state()). So, if linux manages to send it to Xen in time, all should be good (in theory at least). > Anyway, adding the extra sync might be good regardless, and keeping > the sleep. Good idea, I'll send v2 with it included. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
[PATCH] automation: fix race condition in adl-suspend test
If system suspends too quickly, the message for the test controller to wake up the system may be not sent to the console before suspending. This will cause the test to timeout. Fix this by waiting a bit after printing the message. The test controller then resumes the system 30s after the message, so as long as the delay + suspending takes less time it is okay. Signed-off-by: Marek Marczykowski-Górecki --- This is consistent with the observation that sync_console "fixes" the issue. --- automation/scripts/qubes-x86-64.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/automation/scripts/qubes-x86-64.sh b/automation/scripts/qubes-x86-64.sh index 26131b082671..a34db96e4585 100755 --- a/automation/scripts/qubes-x86-64.sh +++ b/automation/scripts/qubes-x86-64.sh @@ -54,11 +54,11 @@ until grep 'domU started' /var/log/xen/console/guest-domU.log; do sleep 1 done echo \"${wait_and_wakeup}\" +# let the above message flow to console, then suspend +sleep 5 set -x echo deep > /sys/power/mem_sleep echo mem > /sys/power/state -# now wait for resume -sleep 5 xl list xl dmesg | grep 'Finishing wakeup from ACPI S3 state' || exit 1 # check if domU is still alive -- 2.41.0
Re: NULL pointer dereference in xenbus_thread->...
On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki wrote: > Hi, > > I've noticed in Qubes's CI failure like this: > > [ 871.271292] BUG: kernel NULL pointer dereference, address: > [ 871.275290] #PF: supervisor read access in kernel mode > [ 871.277282] #PF: error_code(0x) - not-present page > [ 871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0 > [ 871.281071] Oops: [#1] PREEMPT SMP NOPTI > [ 871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted > 6.1.43-1.qubes.fc37.x86_64 #1 > [ 871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014 > [ 871.23] RIP: e030:__wake_up_common+0x4c/0x180 > [ 871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 > 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b > 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40 > [ 871.299776] RSP: e02b:c900400f7e10 EFLAGS: 00010082 > [ 871.301656] RAX: RBX: 88810541ce98 RCX: > > [ 871.304255] RDX: 0001 RSI: 0003 RDI: > 88810541ce90 > [ 871.306714] RBP: c900400f0280 R08: ffe8 R09: > c900400f7e68 > [ 871.309937] R10: 7ff0 R11: 888100ad3000 R12: > c900400f7e68 > [ 871.312326] R13: R14: R15: > > [ 871.314647] FS: () GS:88813ff0() > knlGS: > [ 871.317677] CS: 1e030 DS: ES: CR0: 80050033 > [ 871.319644] CR2: CR3: 0001067fe000 CR4: > 00040660 > [ 871.321973] Call Trace: > [ 871.322782] > [ 871.323494] ? show_trace_log_lvl+0x1d3/0x2ef > [ 871.324901] ? show_trace_log_lvl+0x1d3/0x2ef > [ 871.326310] ? show_trace_log_lvl+0x1d3/0x2ef > [ 871.327721] ? __wake_up_common_lock+0x82/0xd0 > [ 871.329147] ? __die_body.cold+0x8/0xd > [ 871.330378] ? page_fault_oops+0x163/0x1a0 > [ 871.331691] ? exc_page_fault+0x70/0x170 > [ 871.332946] ? asm_exc_page_fault+0x22/0x30 > [ 871.334454] ? __wake_up_common+0x4c/0x180 > [ 871.335777] __wake_up_common_lock+0x82/0xd0 > [ 871.337183] ? process_writes+0x240/0x240 > [ 871.338461] process_msg+0x18e/0x2f0 > [ 871.339627] xenbus_thread+0x165/0x1c0 > [ 871.340830] ? cpuusage_read+0x10/0x10 > [ 871.342032] kthread+0xe9/0x110 > [ 871.343317] ? kthread_complete_and_exit+0x20/0x20 > [ 871.345020] ret_from_fork+0x22/0x30 > [ 871.346239] > [ 871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio > snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core > snd_hwdep snd_seq snd_seq_device joydev snd_pcm intel_rapl_msr ppdev > intel_rapl_common snd_timer pcspkr e1000e snd soundcore i2c_piix4 parport_pc > parport loop fuse xenfs dm_crypt crct10dif_pclmul crc32_pclmul crc32c_intel > polyval_clmulni polyval_generic floppy ghash_clmulni_intel sha512_ssse3 > serio_raw virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd > qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi > xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn > scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinput dm_multipath > [ 871.368892] CR2: > [ 871.370160] ---[ end trace ]--- > [ 871.371719] RIP: e030:__wake_up_common+0x4c/0x180 > [ 871.373273] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 a3 00 > 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 5b <49> 8b > 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40 > [ 871.379866] RSP: e02b:c900400f7e10 EFLAGS: 00010082 > [ 871.381689] RAX: RBX: 88810541ce98 RCX: > > [ 871.383971] RDX: 0001 RSI: 0003 RDI: > 88810541ce90 > [ 871.386235] RBP: c900400f0280 R08: ffe8 R09: > c900400f7e68 > [ 871.388521] R10: 7ff0 R11: 888100ad3000 R12: > c900400f7e68 > [ 871.390789] R13: R14: R15: > > [ 871.393101] FS: () GS:88813ff0() > knlGS: > [ 871.395671] CS: 1e030 DS: ES: CR0: 80050033 > [ 871.397863] CR2: CR3: 0001067fe000 CR4: > 00040660 > [ 871.400441] Kernel panic - not syncing: Fatal exception > [ 871.402171] Kernel Offset: disabled > (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds. > > It isn't the first time I see similar crash, but I can't really > reproduce it reliably. Restarted test u