Re: [PATCH] include/hw/xen: Use more inclusive language in comment
On 09.11.2023 18:40, Thomas Huth wrote: > --- a/include/hw/xen/interface/hvm/params.h > +++ b/include/hw/xen/interface/hvm/params.h > @@ -255,7 +255,7 @@ > * Note that 'mixed' mode has not been evaluated for safety from a > * security perspective. Before using this mode in a > * security-critical environment, each subop should be evaluated for > - * safety, with unsafe subops blacklisted in XSM. > + * safety, with unsafe subops blocked in XSM. To avoid another round trip when you send the patch against xen.git, as already asked for by others, I'd like to point out that the wording change isn't describing things sufficiently similarly: "blocked" reads as if XSM would do so all by itself, whereas "blacklisted" has an indication that something needs to be done for XSM to behave in the intended way. Minimally I'd suggest "suitably blocked via", but perhaps yet better wording can be thought of. Jan PS: Personally I'm against such avoiding of certain words. Them being misused is not really a justification. New wording (perhaps not specifically here, but considering the underlying wider theme) is going to be misused as well, leading to the need to come up with yet different wording, and so on.
Re: [PATCH 2/2] Do not access /dev/mem in MSI-X PCI passthrough on Xen
On 17.11.2022 04:34, Marek Marczykowski-Górecki wrote: > Ok, I found what is wrong. Enabling MSI-X is refused, because INTx isn't > disabled at this point yet. And apparently I was testing this with > permissive=1... > > Linux does this: > https://github.com/torvalds/linux/blob/master/drivers/pci/msi/msi.c#L611 > In short: > 1. Enable MSI-X with MASKALL=1 > 2. Setup MSI-X table > 3. Disable INTx > 4. Set MASKALL=0 > > This patch on top should fix this: > 8< > diff --git a/drivers/xen/xen-pciback/conf_space_capability.c > b/drivers/xen/xen-pciback/conf_space_capability.c > index 097316a74126..f4c4381de76e 100644 > --- a/drivers/xen/xen-pciback/conf_space_capability.c > +++ b/drivers/xen/xen-pciback/conf_space_capability.c > @@ -235,7 +235,7 @@ static int msi_msix_flags_write(struct pci_dev *dev, int > offset, u16 new_value, > (new_value ^ old_value) & ~field_config->allowed_bits) > return PCIBIOS_SET_FAILED; > > - if (new_value & field_config->enable_bit) { > + if ((new_value & field_config->allowed_bits) == > field_config->enable_bit) { > /* don't allow enabling together with other interrupt types */ > int int_type = xen_pcibk_get_interrupt_type(dev); > > 8< > > Jan, is the above safe? It should prevent clearing MASKALL if INTx isn't > disabled, unless I missed something? But also, it will allow enabling > MSI-X with MASKALL=1 together with MSI, which I'm not sure if should be > allowed. While it would probably be okay with what we have now (after your earlier patch introducing allowed_bits), it's likely not going to be correct once further bits would be added to allowed_bits (which clearly is going to be wanted sooner or later, e.g. for multi-vector MSI). Hence I think ... > Alternatively to the above patch, I could allow specifically setting > MSIX_FLAGS_ENABLE + MSIX_FLAGS_MASKALL while INTx isn't disabled as a > single exception, ... this is the way to go, and ... > but at this point I'm not sure if some other driver or > OS wouldn't approach this in yet another way. ... I guess we need to further add exceptions as needed - the one further approach I could see is to keep all entry's mask bits set until setting "INTx disable", without using MASKALL. I'd like to note though that the PCI spec has no such exception. It, however, also doesn't mandate setting "INTx disable" - that's merely a workaround for flawed hardware afaik. (In Xen we also only cross check MSI and MSI-X. IOW we rely on Dom0 anyway when it comes to driving "INTx disable".) Therefore in the end pciback looks to be going too far in enforcing such dependencies. Thinking of this - what about making the change in xen_pcibk_get_interrupt_type() instead, setting INTERRUPT_TYPE_MSIX only when MASKALL is clear (alongside ENABLE being set)? This would then also cover command_write(). Jan
Re: [PATCH 2/2] Do not access /dev/mem in MSI-X PCI passthrough on Xen
On 15.11.2022 12:38, Marek Marczykowski-Górecki wrote: > On Tue, Nov 15, 2022 at 09:14:07AM +0100, Jan Beulich wrote: >> On 14.11.2022 20:20, Marek Marczykowski-Górecki wrote: >>> The /dev/mem is used for two purposes: >>> - reading PCI_MSIX_ENTRY_CTRL_MASKBIT >>> - reading Pending Bit Array (PBA) >>> >>> The first one was originally done because when Xen did not send all >>> vector ctrl writes to the device model, so QEMU might have outdated old >>> register value. This has been changed in Xen, so QEMU can now use its >>> cached value of the register instead. >>> >>> The Pending Bit Array (PBA) handling is for the case where it lives on >>> the same page as the MSI-X table itself. Xen has been extended to handle >>> this case too (as well as other registers that may live on those pages), >>> so QEMU handling is not necessary anymore. >> >> Don't you need to check for new enough Xen for both aspects? > > Yes, see my response to Andrew in the thread. I'm open for suggestions > what to check specifically (Xen version directly?). I guess we should first see what changes we (you) end up doing in Xen itself, and then decide whether to surface the new functionality in some kind of feature indicator. Generally I'd prefer to avoid version checks, because they don't fit very well with backports nor the (rare) need to revert something. But sometimes a feature can be probed easily without requiring an explicit feature indicator. Jan
Re: [PATCH 2/2] Do not access /dev/mem in MSI-X PCI passthrough on Xen
On 14.11.2022 20:20, Marek Marczykowski-Górecki wrote: > The /dev/mem is used for two purposes: > - reading PCI_MSIX_ENTRY_CTRL_MASKBIT > - reading Pending Bit Array (PBA) > > The first one was originally done because when Xen did not send all > vector ctrl writes to the device model, so QEMU might have outdated old > register value. This has been changed in Xen, so QEMU can now use its > cached value of the register instead. > > The Pending Bit Array (PBA) handling is for the case where it lives on > the same page as the MSI-X table itself. Xen has been extended to handle > this case too (as well as other registers that may live on those pages), > so QEMU handling is not necessary anymore. Don't you need to check for new enough Xen for both aspects? Jan
Re: [BUG] Xen build error - undefined reference to bpf_program__set_socket_filter
On 17.10.2022 10:12, Arthur Borsboom wrote: > Xen 4.16.1, 4.16.2 and 4.17.0-rc1 don't build anymore in Arch Linux. That is, qemu doesn't build. That's something to be taken care of there, not in Xen, I think. Jan > I believe it is caused by the missing function > bpf_program__set_socket_filter provided by libbpf. > This function has been deprecated in v0.8 and has been removed in v1.0. > > Arch Linux uses libbpf v1.0.1 since October 2022. > A downgrade to libbpf v0.8.1 fixes the Xen build problem. > > Source about the deprecation: > https://libbpf-test.readthedocs.io/en/latest/api.html > > Build error: > > /bin/ld: libcommon.fa.p/ebpf_ebpf_rss.c.o: in function `ebpf_rss_load': > /home/arthur/.cache/yay/xen/src/xen-4.16.1/tools/qemu-xen-build/../qemu-xen/ebpf/ebpf_rss.c:52: > undefined reference to `bpf_program__set_socket_filter' > collect2: error: ld returned 1 exit status > ... > ... > ninja: build stopped: subcommand failed. > make: *** [Makefile:156: run-ninja] Error 1 > make: Leaving directory > '/home/arthur/.cache/yay/xen/src/xen-4.16.1/tools/qemu-xen-build' > make[3]: *** [Makefile:212: subdir-all-qemu-xen-dir] Error 2 > make[3]: Leaving directory > '/home/arthur/.cache/yay/xen/src/xen-4.16.1/tools' > make[2]: *** > [/home/arthur/.cache/yay/xen/src/xen-4.16.1/tools/../tools/Rules.mk:161: > subdirs-install] Error 2 > make[2]: Leaving directory > '/home/arthur/.cache/yay/xen/src/xen-4.16.1/tools' > make[1]: *** [Makefile:66: install] Error 2 > make[1]: Leaving directory > '/home/arthur/.cache/yay/xen/src/xen-4.16.1/tools' > make: *** [Makefile:140: install-tools] Error 2 > ==> ERROR: A failure occurred in build(). > Aborting... > -> error making: xen >
Re: SecureBoot and PCI passthrough with kernel lockdown in place (on Xen)
On 14.02.2022 16:02, Dario Faggioli wrote: > We have run into an issue when trying to use PCI passthrough for a Xen > VM running on an host where dom0 kernel is 5.14.21 (but we think it > could be any kernel > 5.4) and SecureBoot is enabled. > > The error we get, when (for instance) trying to attach a device to an > (HVM) VM, on such system is: > > # xl pci-attach 2-fv-sles15sp4beta2 :58:03.0 > libxl: error: libxl_qmp.c:1838:qmp_ev_parse_error_messages: Domain 12:Failed > to initialize 12/15, type = 0x1, rc: -1 > libxl: error: libxl_pci.c:1777:device_pci_add_done: Domain > 12:libxl__device_pci_add failed for PCI device 0:58:3.0 (rc -28) > libxl: error: libxl_device.c:1420:device_addrm_aocomplete: unable to add > device > > QEMU, is telling us the following: > > [00:04.0] xen_pt_msix_init: Error: Can't open /dev/mem: Operation not > permitted > [00:04.0] xen_pt_msix_size_init: Error: Internal error: Invalid > xen_pt_msix_init. > > And the kernel reports this: > > Jan 27 16:20:53 narvi-sr860v2-bps-sles15sp4b2 kernel: Lockdown: > qemu-system-i38: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7 > > So, it's related to lockdown. Which AFAIUI it's consistent with the > fact that the problem only shows up when SecureBoot is enabled, as > that's implies lockdown. It's also consistent with the fact that we > don't seem to have any problems doing the same with a 5.3.x dom0 > kernel... As there's no lockdown there! > > Some digging revealed that QEMU tries to open /dev/mem in > xen_pt_msix_init(): > > fd = open("/dev/mem", O_RDWR); > ... > msix->phys_iomem_base = > mmap(NULL, > total_entries * PCI_MSIX_ENTRY_SIZE + > msix->table_offset_adjust, > PROT_READ, > MAP_SHARED | MAP_LOCKED, > fd, > msix->table_base + table_off - msix->table_offset_adjust); > close(fd); I think this is finally a clear indication that it has always been wrong for qemu to access hardware directly like this. I see no way around replacing this by something which isn't a bodge / layering violation. Jan > This comes from commit: > > commit 3854ca577dad92c4fe97b4a6ebce360e25407af7 > Author: Jiang Yunhong > Date: Thu Jun 21 15:42:35 2012 + > > Introduce Xen PCI Passthrough, MSI > > A more complete history can be found here: > git://xenbits.xensource.com/qemu-xen-unstable.git > > Signed-off-by: Jiang Yunhong > Signed-off-by: Shan Haitao > Signed-off-by: Anthony PERARD > Acked-by: Stefano Stabellini > > Now, the questions: > - is this (i.e., PCI-Passthrough with a locked-down dom0 kernel) > working for anyone? I've Cc-ed Marek, because I think I've read that > QubesOS that it does on QubesOS, but I'm not sure if the situation > is the same... > - if it's working, how? > > Thanks and Regards
Re: [Qemu-devel] [RFC Patch] xen/pt: Emulate FLR capability
On 29.08.2019 11:02, Chao Gao wrote: > Currently, for a HVM on Xen, no reset method is virtualized. So in a VM's > perspective, assigned devices cannot be reset. But some devices rely on PCI > reset to recover from hardware hangs. When being assigned to a VM, those > devices cannot be reset and won't work any longer if a hardware hang occurs. > We have to reboot VM to trigger PCI reset on host to recover the device. Did you consider a hot-unplug, reset (by host), hot-plug cycle instead? > +static int xen_pt_devctl_reg_write(XenPCIPassthroughState *s, > + XenPTReg *cfg_entry, uint16_t *val, > + uint16_t dev_value, uint16_t valid_mask) > +{ > +if (s->real_device.is_resetable && (*val & PCI_EXP_DEVCTL_BCR_FLR)) { > +xen_pt_reset(s); > +} > +return xen_pt_word_reg_write(s, cfg_entry, val, dev_value, valid_mask); I think you also need to clear the bit before handing on the request, such that reads will always observe it clear. Jan
Re: [Qemu-devel] [PATCH] Xen PCI passthrough: fix passthrough failure when irq map failure
>>> On 23.11.18 at 11:19, wrote: > Adding Jan in case he has an opinion on my reply below. I agree, fwiw. Jan > On Fri, Nov 23, 2018 at 12:04:51AM -0500, Zhao Yan wrote: >> On Thu, Nov 22, 2018 at 03:18:05PM +0100, Roger Pau Monné wrote: >> > On Thu, Nov 22, 2018 at 08:11:20AM -0500, Zhao Yan wrote: >> > > On Thu, Oct 18, 2018 at 03:56:36PM +0100, Roger Pau Monné wrote: >> > > > On Thu, Oct 18, 2018 at 08:22:41AM +, Zhao, Yan Y wrote: >> > > > > Hi >> > > > > The background for this patch is that: for some pci device, even >> > > > > it's PCI_INTERRUPT_PIN is not 0, it actually does not support INTx mode, so we should just report error, disable INTx mode and continue the passthrough. >> > > > > However, the commit 5a11d0f7 regards this as error condition and let >> > > > > qemu quit passthrough, which is too rigorous. >> > > > > >> > > > > Error message is below: >> > > > > libxl: error: libxl_qmp.c:287:qmp_handle_error_response: Domain >> > > > > 2:received an error message from QMP server: Mapping machine irq 0 to pirq -1 failed: Operation not permitted >> > > > >> > > > I'm having issues figuring out what's happening here. >> > > > s->real_device.irq is 0, yet the PCI config space read of >> > > > PCI_INTERRUPT_PIN returns something different than 0. >> > > > >> > > > AFAICT this is due to some kind of error in Linux, so that even when >> > > > the device is supposed to have a valid IRQ the sysfs node it is set to >> > > > 0, do you know the actual underlying cause of this? >> > > > >> > > > Thanks, Roger. >> > > Hi Roger >> > > Sorry for the later reply, I just missed this mail... >> > > On my side, it's because the hardware actually does not support INTx >> > > mode, >> > > but its configuration space does not report PCI_INTERRUPT_PIN to 0. It's >> > > a >> > > hardware bug, but previous version of qemu can tolerate it, switch to MSI >> > > and make passthrough work. >> > >> > Then I think it would be better to check both PCI_INTERRUPT_PIN and >> > s->real_device.irq before attempting to map the IRQ. >> > >> > Making the error non-fatal would mean that a device with a valid IRQ >> > could fail to be setup correctly but the guest will still be created, >> > and things won't go as expected when the guest attempts to use it. >> > >> > Thanks, Roger. >> hi roger >> thanks for your sugguestion. it's right that "s->real_device.irq" is needed >> to be checked before mapping, like if it's 0. >> but on the other hand, maybe xc_physdev_map_pirq() itself can serve as a >> checking of "s->real_device.irq" ? >> like in our case, it will fail and return -EPERM. >> then error hanling is still conducted ==>set INTX_DISABLE flag, eventhrough >> the error is not fatal. >> >> machine_irq = s->real_device.irq; >> rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, ); >> if (rc < 0) { >> error_setg_errno(errp, errno, "Mapping machine irq %u to" >> " pirq %i failed", machine_irq, pirq); >> >> /* Disable PCI intx assertion (turn on bit10 of devctl) */ >> cmd |= PCI_COMMAND_INTX_DISABLE; >> machine_irq = 0; >> s->machine_irq = 0; >> So, do you think it's all right just converting fatal error to non-fatal? > > As I said above, I think it would be better to leave the error as > fatal and avoid attempting a xc_physdev_map_pirq with a machine_irq == > 0, which will fail. > > If we really want to go down the route of making the error non-fatal, > I think you will also have to report PCI_INTERRUPT_PIN as 0 to the > guest, so that it's clear to the guest that the device doesn't have > legacy interrupt support. > > Exposing a device with PCI_INTERRUPT_PIN != 0 but then not allowing > the guest to clear PCI_COMMAND_INTX_DISABLE is likely bogus. > > Thanks, Roger.
Re: [Qemu-devel] [Xen-devel] [PATCH v2] xen-hvm: stop faking I/O to access PCI config space
>>> On 18.05.18 at 15:51,wrote: >> Sent: 18 May 2018 14:34 >> To: Paul Durrant >> >>> On 18.05.18 at 15:00, wrote: >> > +QLIST_FOREACH(xendev, >dev_list, entry) { >> > +unsigned int i; >> > +uint32_t tmp; >> > + >> > +if (xendev->sbdf != sbdf) { >> > +continue; >> > +} >> > + >> > +if (!req->data_is_ptr) { >> > +if (req->dir == IOREQ_READ) { >> > +for (i = 0; i < req->count; i++) { >> > +rw_config_req_item(xendev, req, i, ); >> > +req->data = tmp; >> > +} >> > +} else if (req->dir == IOREQ_WRITE) { >> > +for (i = 0; i < req->count; i++) { >> > +tmp = req->data; >> > +rw_config_req_item(xendev, req, i, ); >> > +} >> > +} >> >> Wouldn't it be more sensible to fail req->count != 1 requests here? >> > > I'm wondering whether we'd want to handle count > 1 once we allow MMCONFIG > accesses though. I guess it would be easier just to defer that. For the data_is_ptr case - sure. But here? Or wait - are you thinking about REP STOS (and the relatively useless REP LODS)? Jan
Re: [Qemu-devel] [Xen-devel] [PATCH v2] xen-hvm: stop faking I/O to access PCI config space
>>> On 18.05.18 at 15:00,wrote: > @@ -903,6 +926,80 @@ static void cpu_ioreq_move(ioreq_t *req) > } > } > > +static void rw_config_req_item(XenPciDevice *xendev, ioreq_t *req, It looks to me as if both parameters could be constified. > + uint32_t i, uint32_t *val) > +{ > +int32_t reg = req->addr; > +uint32_t offset = req->size * i; > + > +reg += (req->df ? -1 : 1) * offset; > +if (reg < 0 || reg > PCI_CONFIG_SPACE_SIZE) { Having fought a number of issues in this area in the hypervisor a couple of years back I wonder - why reg is of signed type, - whether overflow of the first multiplication really doesn't matter, - whether wrapping when adding in the offset is not an issue. I take it that the rather lax upper bound check (should imo really be reg + size > PCI_CONFIG_SPACE_SIZE [implying reg + size doesn't itself wrap], or at least reg >= PCI_CONFIG_SPACE_SIZE) is not a problem because ... > +if (req->dir == IOREQ_READ) { > +*val = ~0u; > +} > +return; > +} > + > +if (req->dir == IOREQ_READ) { > +*val = pci_host_config_read_common(xendev->pci_dev, reg, > + PCI_CONFIG_SPACE_SIZE, > + req->size); > +trace_cpu_ioreq_config_read(req, xendev->sbdf, reg, > +req->size, *val); > +} else { > +trace_cpu_ioreq_config_write(req, xendev->sbdf, reg, req->size, > + *val); > +pci_host_config_write_common(xendev->pci_dev, reg, > + PCI_CONFIG_SPACE_SIZE, *val, > + req->size); > +} ... these called functions do full checking anyway? > +static void cpu_ioreq_config(XenIOState *state, ioreq_t *req) > +{ > +uint32_t sbdf = req->addr >> 32; > +XenPciDevice *xendev; > + > +if (req->size > sizeof(uint32_t)) { > +hw_error("PCI config access: bad size (%u)", req->size); What about size 0 or 3? > +} > + > +QLIST_FOREACH(xendev, >dev_list, entry) { > +unsigned int i; > +uint32_t tmp; > + > +if (xendev->sbdf != sbdf) { > +continue; > +} > + > +if (!req->data_is_ptr) { > +if (req->dir == IOREQ_READ) { > +for (i = 0; i < req->count; i++) { > +rw_config_req_item(xendev, req, i, ); > +req->data = tmp; > +} > +} else if (req->dir == IOREQ_WRITE) { > +for (i = 0; i < req->count; i++) { > +tmp = req->data; > +rw_config_req_item(xendev, req, i, ); > +} > +} Wouldn't it be more sensible to fail req->count != 1 requests here? Jan
[Qemu-devel] [PATCH] allow to build with older sed
sed's -E option may not be supported by older distros. As there's no point using sed here at all, use just shell mechanisms to establish the variable values, starting from the stem instead of the full target. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- a/Makefile +++ b/Makefile @@ -242,8 +242,7 @@ GENERATED_FILES += $(KEYCODEMAP_FILES) ui/input-keymap-%.c: $(KEYCODEMAP_GEN) $(KEYCODEMAP_CSV) $(SRC_PATH)/ui/Makefile.objs $(call quiet-command,\ - src=$$(echo $@ | sed -E -e "s,^ui/input-keymap-(.+)-to-(.+)\.c$$,\1,") && \ - dst=$$(echo $@ | sed -E -e "s,^ui/input-keymap-(.+)-to-(.+)\.c$$,\2,") && \ + stem=$* && src=$${stem%-to-*} dst=$${stem#*-to-} && \ test -e $(KEYCODEMAP_GEN) && \ $(PYTHON) $(KEYCODEMAP_GEN) \ --lang glib2 \
Re: [Qemu-devel] [RFC QEMU PATCH v3 00/10] Implement vNVDIMM for Xen HVM guest
>>> On 13.10.17 at 13:13,wrote: > To Jan, Andrew, Stefano and Anthony, > > what do you think about allowing QEMU to build the entire guest ACPI > and letting SeaBIOS to load it? ACPI builder code in hvmloader is > still there and just bypassed in this case. Well, if that can be made work in a non-quirky way and without loss of functionality, I'd probably be fine. I do think, however, that there's a reason this is being handled in hvmloader right now. Jan
[Qemu-devel] xen-pci-passthrough PCI Express support? (Re: [Xen-devel] [PATCH v2 4/5] pci: Add INTERFACE_CONVENTIONAL_PCI_DEVICE to Conventional PCI devices)
>>> On 03.10.17 at 02:12, <ehabk...@redhat.com> wrote: > On Thu, Sep 28, 2017 at 10:12:34AM -0300, Eduardo Habkost wrote: >> On Thu, Sep 28, 2017 at 02:33:57AM -0600, Jan Beulich wrote: >> > >>> On 27.09.17 at 21:56, <ehabk...@redhat.com> wrote: >> > > --- a/hw/xen/xen_pt.c >> > > +++ b/hw/xen/xen_pt.c >> > > @@ -964,6 +964,10 @@ static const TypeInfo xen_pci_passthrough_info = { >> > > .instance_size = sizeof(XenPCIPassthroughState), >> > > .instance_finalize = xen_pci_passthrough_finalize, >> > > .class_init = xen_pci_passthrough_class_init, >> > > +.interfaces = (InterfaceInfo[]) { >> > > +{ INTERFACE_CONVENTIONAL_PCI_DEVICE }, >> > > +{ }, >> > > +}, >> > > }; >> > >> > Passed through devices can be both PCI and PCIe, so following >> > the description of the patch I don't think these can be statically >> > given either property. Granted quite a bit of PCIe specific >> > functionality may be missing in the Xen code ... >> >> This is just static data about what the device type supports, not >> about what a given device instance really is. Deciding if the >> device is PCIe or Conventional at runtime is out of the scope of >> this series. >> >> That said, if passed through PCI Express devices are really >> supported, it looks like this should be marked as hybrid. > > Can anybody confirm if PCI Express devices are really supported > by xen-pci-passthrough? I think I've clearly said they're supported, with some limitations. Jan
Re: [Qemu-devel] [Xen-devel] [PATCH v2 4/5] pci: Add INTERFACE_CONVENTIONAL_PCI_DEVICE to Conventional PCI devices
>>> On 27.09.17 at 21:56,wrote: > --- a/hw/xen/xen_pt.c > +++ b/hw/xen/xen_pt.c > @@ -964,6 +964,10 @@ static const TypeInfo xen_pci_passthrough_info = { > .instance_size = sizeof(XenPCIPassthroughState), > .instance_finalize = xen_pci_passthrough_finalize, > .class_init = xen_pci_passthrough_class_init, > +.interfaces = (InterfaceInfo[]) { > +{ INTERFACE_CONVENTIONAL_PCI_DEVICE }, > +{ }, > +}, > }; Passed through devices can be both PCI and PCIe, so following the description of the patch I don't think these can be statically given either property. Granted quite a bit of PCIe specific functionality may be missing in the Xen code ... Jan
Re: [Qemu-devel] [PATCH] xen: use vMSI related #define-s from public interface
>>> On 21.09.17 at 03:12, <sstabell...@kernel.org> wrote: > On Fri, 1 Sep 2017, Jan Beulich wrote: >> --- a/hw/xen/xen_pt_msi.c >> +++ b/hw/xen/xen_pt_msi.c >> @@ -18,6 +18,11 @@ >> >> #define XEN_PT_AUTO_ASSIGN -1 >> >> +#ifndef XEN_DOMCTL_VMSI_X86_DEST_ID_MASK >> +#if XEN_DOMCTL_INTERFACE_VERSION >= 0x000e >> +#error vMSI defines missing from domctl.h >> +#endif > > All the version compatibility stuff goes to > include/hw/xen/xen_common.h. Please move it there. I know there's a central place, but moving there stuff that's needed only in this file seemed rather counterproductive to me - why would you want all files including that shared one have to see these definitions? If there was a remote chance that some other file may need to make use of it, I might agree, but I don't see any such chance at all. > We usually assume that the Xen version we are building against is > "sane", so we don't do #error's typically. Hmm, I can drop the #error, but to be honest I'm hesitant to do so - I've put it there intentionally. Jan
[Qemu-devel] [PATCH] xen_disk: avoid use of g_malloc0_n()
Prefer g_new() / g_new0() to be farther backwards compatible with older glib versions. As there's no point in zeroing the allocation here (the loop right afterwards fully initializes the memory), use the former. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- a/hw/block/xen_disk.c +++ b/hw/block/xen_disk.c @@ -1232,7 +1232,7 @@ static int blk_connect(struct XenDevice return -1; } -domids = g_malloc0_n(blkdev->nr_ring_ref, sizeof(uint32_t)); +domids = g_new(uint32_t, blkdev->nr_ring_ref); for (i = 0; i < blkdev->nr_ring_ref; i++) { domids[i] = blkdev->xendev.dom; }
Re: [Qemu-devel] [Xen-devel] [PATCH] xen: use vMSI related #define-s from public interface
>>> On 04.09.17 at 11:23, <roger@citrix.com> wrote: > On Mon, Sep 04, 2017 at 03:04:41AM -0600, Jan Beulich wrote: >> >>> On 01.09.17 at 20:20, <roger@citrix.com> wrote: >> > On Fri, Sep 01, 2017 at 10:25:42AM -0600, Jan Beulich wrote: >> >> Xen and qemu having identical #define-s (with different names) is a >> >> strong hint that these should have been part of the public interface >> >> from the very start. Use them if they're available, falling back to >> >> privately defined values only when using older headers. >> >> >> >> Signed-off-by: Jan Beulich <jbeul...@suse.com> >> > >> > Reviewed-by: Roger Pau Monné <roger@citrix.com> >> >> Thanks. >> >> >> --- a/hw/xen/xen_pt_msi.c >> >> +++ b/hw/xen/xen_pt_msi.c >> >> @@ -18,6 +18,11 @@ >> >> >> >> #define XEN_PT_AUTO_ASSIGN -1 >> >> >> >> +#ifndef XEN_DOMCTL_VMSI_X86_DEST_ID_MASK >> >> +#if XEN_DOMCTL_INTERFACE_VERSION >= 0x000e >> > >> > XEN_DOMCTL_INTERFACE_VERSION is already 0xe (without you added >> > defines), I guess it doesn't matter much because we only care for >> > stable releases. >> >> Well, that's if you build qemu against master Xen. What about >> people building against older Xen versions/headers? > > Sorry, I think I haven't explained myself clearly. What I mean is > that if this change gets committed before the Xen side one, QEMU would > not compile against current Xen. As said, this is only a transitory > issue, and it's never going to be a problem in stable branches. This > is because of the "#error" that you add. Oh, yes, that's the case. I can't see an easy way around it, but I'm pretty sure the qemu side of it will see some further delay anyway. Jan
Re: [Qemu-devel] [Xen-devel] [PATCH] xen: use vMSI related #define-s from public interface
>>> On 01.09.17 at 20:20, <roger@citrix.com> wrote: > On Fri, Sep 01, 2017 at 10:25:42AM -0600, Jan Beulich wrote: >> Xen and qemu having identical #define-s (with different names) is a >> strong hint that these should have been part of the public interface >> from the very start. Use them if they're available, falling back to >> privately defined values only when using older headers. >> >> Signed-off-by: Jan Beulich <jbeul...@suse.com> > > Reviewed-by: Roger Pau Monné <roger@citrix.com> Thanks. >> --- a/hw/xen/xen_pt_msi.c >> +++ b/hw/xen/xen_pt_msi.c >> @@ -18,6 +18,11 @@ >> >> #define XEN_PT_AUTO_ASSIGN -1 >> >> +#ifndef XEN_DOMCTL_VMSI_X86_DEST_ID_MASK >> +#if XEN_DOMCTL_INTERFACE_VERSION >= 0x000e > > XEN_DOMCTL_INTERFACE_VERSION is already 0xe (without you added > defines), I guess it doesn't matter much because we only care for > stable releases. Well, that's if you build qemu against master Xen. What about people building against older Xen versions/headers? Jan
[Qemu-devel] [PATCH] xen: use vMSI related #define-s from public interface
Xen and qemu having identical #define-s (with different names) is a strong hint that these should have been part of the public interface from the very start. Use them if they're available, falling back to privately defined values only when using older headers. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -18,6 +18,11 @@ #define XEN_PT_AUTO_ASSIGN -1 +#ifndef XEN_DOMCTL_VMSI_X86_DEST_ID_MASK +#if XEN_DOMCTL_INTERFACE_VERSION >= 0x000e +#error vMSI defines missing from domctl.h +#endif + /* shift count for gflags */ #define XEN_PT_GFLAGS_SHIFT_DEST_ID0 #define XEN_PT_GFLAGS_SHIFT_RH 8 @@ -26,6 +31,16 @@ #define XEN_PT_GFLAGSSHIFT_TRG_MODE 15 #define XEN_PT_GFLAGSSHIFT_UNMASKED 16 +#define XEN_DOMCTL_VMSI_X86_DEST_ID_MASK (0xffU << XEN_PT_GFLAGS_SHIFT_DEST_ID) +#define XEN_DOMCTL_VMSI_X86_RH_MASK (1U << XEN_PT_GFLAGS_SHIFT_RH) +#define XEN_DOMCTL_VMSI_X86_DM_MASK (1U << XEN_PT_GFLAGS_SHIFT_DM) +#define XEN_DOMCTL_VMSI_X86_DELIV_MASK (7U << XEN_PT_GFLAGSSHIFT_DELIV_MODE) +#define XEN_DOMCTL_VMSI_X86_TRIG_MASK(1U << XEN_PT_GFLAGSSHIFT_TRG_MODE) +#define XEN_DOMCTL_VMSI_X86_UNMASKED (1U << XEN_PT_GFLAGSSHIFT_UNMASKED) +#endif + +#define MASK_INSR(v, m) (((v) * ((m) & -(m))) & (m)) + #define latch(fld) latch[PCI_MSIX_ENTRY_##fld / sizeof(uint32_t)] /* @@ -49,21 +64,18 @@ static inline uint32_t msi_ext_dest_id(u static uint32_t msi_gflags(uint32_t data, uint64_t addr) { -uint32_t result = 0; -int rh, dm, dest_id, deliv_mode, trig_mode; +int rh, dm, deliv_mode, trig_mode; rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1; dm = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1; -dest_id = msi_dest_id(addr); deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7; trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1; -result = dest_id | (rh << XEN_PT_GFLAGS_SHIFT_RH) -| (dm << XEN_PT_GFLAGS_SHIFT_DM) -| (deliv_mode << XEN_PT_GFLAGSSHIFT_DELIV_MODE) -| (trig_mode << XEN_PT_GFLAGSSHIFT_TRG_MODE); - -return result; +return MASK_INSR(msi_dest_id(addr), XEN_DOMCTL_VMSI_X86_DEST_ID_MASK) | + MASK_INSR(rh, XEN_DOMCTL_VMSI_X86_RH_MASK) | + MASK_INSR(dm, XEN_DOMCTL_VMSI_X86_DM_MASK) | + MASK_INSR(deliv_mode, XEN_DOMCTL_VMSI_X86_DELIV_MASK) | + MASK_INSR(trig_mode, XEN_DOMCTL_VMSI_X86_TRIG_MASK); } static inline uint64_t msi_addr64(XenPTMSI *msi) @@ -173,7 +185,7 @@ static int msi_msix_update(XenPCIPassthr table_addr = s->msix->mmio_base_addr; } -gflags |= masked ? 0 : (1u << XEN_PT_GFLAGSSHIFT_UNMASKED); +gflags |= masked ? 0 : XEN_DOMCTL_VMSI_X86_UNMASKED; rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec, pirq, gflags, table_addr);
Re: [Qemu-devel] [PATCH QEMU v2] xen/pt: allow QEMU to request MSI unmasking at bind time
>>> On 24.08.17 at 14:19, <roger@citrix.com> wrote: > When a MSI interrupt is bound to a guest using > xc_domain_update_msi_irq (XEN_DOMCTL_bind_pt_irq) the interrupt is > left masked by default. > > This causes problems with guests that first configure interrupts and > clean the per-entry MSIX table mask bit and afterwards enable MSIX > globally. In such scenario the Xen internal msixtbl handlers would not > detect the unmasking of MSIX entries because vectors are not yet > registered since MSIX is not enabled, and vectors would be left > masked. > > Introduce a new flag in the gflags field to signal Xen whether a MSI > interrupt should be unmasked after being bound. > > This also requires to track the mask register for MSI interrupts, so > QEMU can also notify to Xen whether the MSI interrupt should be bound > masked or unmasked > > Signed-off-by: Roger Pau Monné <roger@citrix.com> > Reported-by: Andreas Kinzler <h...@posteo.de> Reviewed-by: Jan Beulich <jbeul...@suse.com>
Re: [Qemu-devel] [PATCH QEMU] xen/pt: allow QEMU to request MSI unmasking at bind time
>>> On 24.08.17 at 12:06, <roger@citrix.com> wrote: > On Thu, Aug 24, 2017 at 03:54:21AM -0600, Jan Beulich wrote: >> >>> On 24.08.17 at 11:47, <roger@citrix.com> wrote: >> > @@ -274,7 +278,7 @@ int xen_pt_msi_update(XenPCIPassthroughState *s) >> > { >> > XenPTMSI *msi = s->msi; >> > return msi_msix_update(s, msi_addr64(msi), msi->data, msi->pirq, >> > - false, 0, >pirq); >> > + false, 0, >pirq, false); >> > } >> >> I don't think this is correct when the device has mask bits. > > Right, I thought I modified this. I've already changed > pt_irq_create_bind so that the original behavior is keep if the unmask > bit is not set. In this case this should be 'true' in order to keep > the previous behavior, which was correct for MSI. Wouldn't you want to pass the state of the mask bit here, rather than uniformly hard coding true or false? > It also makes me wonder whether it would be better to change the name > of the parameter to "unmask", so that false -> no change to mask, true > -> unconditionally unmask. I don't care much about this aspect, but perhaps the qemu maintainers do. Jan
Re: [Qemu-devel] [PATCH QEMU] xen/pt: allow QEMU to request MSI unmasking at bind time
>>> On 24.08.17 at 11:47,wrote: > @@ -274,7 +278,7 @@ int xen_pt_msi_update(XenPCIPassthroughState *s) > { > XenPTMSI *msi = s->msi; > return msi_msix_update(s, msi_addr64(msi), msi->data, msi->pirq, > - false, 0, >pirq); > + false, 0, >pirq, false); > } I don't think this is correct when the device has mask bits. Jan
Re: [Qemu-devel] [PATCH] xen/disk: don't leak stack data via response ring
>>> Stefano Stabellini <sstabell...@kernel.org> 06/23/17 8:43 PM >>> >On Fri, 23 Jun 2017, Jan Beulich wrote: >> >>> On 22.06.17 at 20:52, <sstabell...@kernel.org> wrote: >> > I am happy to write the code and/or the commit message. Would a simple >> > cast like below work to fix the security issue? >> >> I suppose so, but you do realize that this is _exactly_ what >> my patch does, except you use dangerous casts while I play >> type-safe? > >Why is the cast dangerous? Casts, and especially pointer ones) are always dangerous, as they have the potential of type changes elsewhere going unnoticed. You may have noticed that in reviews I'm often picking at casts, because in a majority of cases people use them they're not needed and hence their use is a (latent) risk. > Both your patch and my version of it rely on >inner knowledge of the way the rings are laid out in memory, but at >least my patch doesn't introduce the risk of instantiating broken >structs. Besides, type safety checks don't add much value, given that we >rely on the way the ring.h macros have been written, which weren't even >imported in the QEMU project until March this year. That's said, as it seems to imply backporting of the change to older qemu may then be problematic. Otoh I don't recall having had problems with using the patch with only minor adjustments on older trees of ours. >These are the reasons why I prefer my version, but I would consider your >version with clear in-code comments that warn users to avoid >instantiating the structs because they are not ABI compliant. > >How do you want to proceed? I can provide a version with comments added, but only next week. If you feel like you want to go with your variant, I won't stand in the way, but I also wouldn't give it my ack or alike (which you don't depend on anyway). Jan
Re: [Qemu-devel] [PATCH] xen/disk: don't leak stack data via response ring
>>> On 22.06.17 at 20:52, <sstabell...@kernel.org> wrote: > On Thu, 22 Jun 2017, Jan Beulich wrote: >> >>> On 21.06.17 at 20:46, <sstabell...@kernel.org> wrote: >> > On Wed, 21 Jun 2017, Jan Beulich wrote: >> >> >>> On 20.06.17 at 23:48, <sstabell...@kernel.org> wrote: >> >> > On Tue, 20 Jun 2017, Jan Beulich wrote: >> >> >> @@ -36,13 +33,7 @@ struct blkif_x86_32_request_discard { >> >> >> blkif_sector_t sector_number;/* start sector idx on disk (r/w >> > only) */ >> >> >> uint64_t nr_sectors; /* # of contiguous sectors to >> >> >> discard >> > */ >> >> >> }; >> >> >> -struct blkif_x86_32_response { >> >> >> -uint64_tid; /* copied from request */ >> >> >> -uint8_t operation; /* copied from request */ >> >> >> -int16_t status; /* BLKIF_RSP_??? */ >> >> >> -}; >> >> >> typedef struct blkif_x86_32_request blkif_x86_32_request_t; >> >> >> -typedef struct blkif_x86_32_response blkif_x86_32_response_t; >> >> >> #pragma pack(pop) >> >> >> >> >> >> /* x86_64 protocol version */ >> >> >> @@ -62,20 +53,14 @@ struct blkif_x86_64_request_discard { >> >> >> blkif_sector_t sector_number;/* start sector idx on disk (r/w >> > only) */ >> >> >> uint64_t nr_sectors; /* # of contiguous sectors to >> >> >> discard >> > */ >> >> >> }; >> >> >> -struct blkif_x86_64_response { >> >> >> -uint64_t __attribute__((__aligned__(8))) id; >> >> >> -uint8_t operation; /* copied from request */ >> >> >> -int16_t status; /* BLKIF_RSP_??? */ >> >> >> -}; >> >> >> >> >> >> typedef struct blkif_x86_64_request blkif_x86_64_request_t; >> >> >> -typedef struct blkif_x86_64_response blkif_x86_64_response_t; >> >> >> >> >> >> DEFINE_RING_TYPES(blkif_common, struct blkif_common_request, >> >> >> - struct blkif_common_response); >> >> >> + struct blkif_response); >> >> >> DEFINE_RING_TYPES(blkif_x86_32, struct blkif_x86_32_request, >> >> >> - struct blkif_x86_32_response); >> >> >> + struct blkif_response QEMU_PACKED); >> >> > >> >> > In my test, the previous sizes and alignments of the response structs >> >> > were (on both x86_32 and x86_64): >> >> > >> >> > sizeof(blkif_x86_32_response)=12 sizeof(blkif_x86_64_response)=16 >> >> > align(blkif_x86_32_response)=4 align(blkif_x86_64_response)=8 >> >> > >> >> > While with these changes are now, when compiled on x86_64: >> >> > sizeof(blkif_x86_32_response)=11 sizeof(blkif_x86_64_response)=16 >> >> > align(blkif_x86_32_response)=1 align(blkif_x86_64_response)=8 >> >> > >> >> > when compiled on x86_32: >> >> > sizeof(blkif_x86_32_response)=11 sizeof(blkif_x86_64_response)=12 >> >> > align(blkif_x86_32_response)=1 align(blkif_x86_64_response)=4 >> >> > >> >> > Did I do my tests wrong? >> >> > >> >> > QEMU_PACKED is not the same as #pragma pack(push, 4). In fact, it is the >> >> > same as #pragma pack(push, 1), causing the struct to be densely packed, >> >> > leaving no padding whatsever. >> >> > >> >> > In addition, without __attribute__((__aligned__(8))), >> >> > blkif_x86_64_response won't be 8 bytes aligned when built on x86_32. >> >> > >> >> > Am I missing something? >> >> >> >> Well, you're mixing attribute application upon structure >> >> declaration with attribute application upon structure use. It's >> >> the latter here, and hence the attribute doesn't affect >> >> structure layout at all. All it does is avoid the _containing_ >> >> 32-bit union to become 8-byte aligned (and tail padding to be >> >> inserted). >> > >> > Thanks for the explanation. I admit it's the first time I see the
Re: [Qemu-devel] [PATCH] xen/disk: don't leak stack data via response ring
>>> On 21.06.17 at 20:46, <sstabell...@kernel.org> wrote: > On Wed, 21 Jun 2017, Jan Beulich wrote: >> >>> On 20.06.17 at 23:48, <sstabell...@kernel.org> wrote: >> > On Tue, 20 Jun 2017, Jan Beulich wrote: >> >> @@ -36,13 +33,7 @@ struct blkif_x86_32_request_discard { >> >> blkif_sector_t sector_number;/* start sector idx on disk (r/w > only) */ >> >> uint64_t nr_sectors; /* # of contiguous sectors to >> >> discard > */ >> >> }; >> >> -struct blkif_x86_32_response { >> >> -uint64_tid; /* copied from request */ >> >> -uint8_t operation; /* copied from request */ >> >> -int16_t status; /* BLKIF_RSP_??? */ >> >> -}; >> >> typedef struct blkif_x86_32_request blkif_x86_32_request_t; >> >> -typedef struct blkif_x86_32_response blkif_x86_32_response_t; >> >> #pragma pack(pop) >> >> >> >> /* x86_64 protocol version */ >> >> @@ -62,20 +53,14 @@ struct blkif_x86_64_request_discard { >> >> blkif_sector_t sector_number;/* start sector idx on disk (r/w > only) */ >> >> uint64_t nr_sectors; /* # of contiguous sectors to >> >> discard > */ >> >> }; >> >> -struct blkif_x86_64_response { >> >> -uint64_t __attribute__((__aligned__(8))) id; >> >> -uint8_t operation; /* copied from request */ >> >> -int16_t status; /* BLKIF_RSP_??? */ >> >> -}; >> >> >> >> typedef struct blkif_x86_64_request blkif_x86_64_request_t; >> >> -typedef struct blkif_x86_64_response blkif_x86_64_response_t; >> >> >> >> DEFINE_RING_TYPES(blkif_common, struct blkif_common_request, >> >> - struct blkif_common_response); >> >> + struct blkif_response); >> >> DEFINE_RING_TYPES(blkif_x86_32, struct blkif_x86_32_request, >> >> - struct blkif_x86_32_response); >> >> + struct blkif_response QEMU_PACKED); >> > >> > In my test, the previous sizes and alignments of the response structs >> > were (on both x86_32 and x86_64): >> > >> > sizeof(blkif_x86_32_response)=12 sizeof(blkif_x86_64_response)=16 >> > align(blkif_x86_32_response)=4 align(blkif_x86_64_response)=8 >> > >> > While with these changes are now, when compiled on x86_64: >> > sizeof(blkif_x86_32_response)=11 sizeof(blkif_x86_64_response)=16 >> > align(blkif_x86_32_response)=1 align(blkif_x86_64_response)=8 >> > >> > when compiled on x86_32: >> > sizeof(blkif_x86_32_response)=11 sizeof(blkif_x86_64_response)=12 >> > align(blkif_x86_32_response)=1 align(blkif_x86_64_response)=4 >> > >> > Did I do my tests wrong? >> > >> > QEMU_PACKED is not the same as #pragma pack(push, 4). In fact, it is the >> > same as #pragma pack(push, 1), causing the struct to be densely packed, >> > leaving no padding whatsever. >> > >> > In addition, without __attribute__((__aligned__(8))), >> > blkif_x86_64_response won't be 8 bytes aligned when built on x86_32. >> > >> > Am I missing something? >> >> Well, you're mixing attribute application upon structure >> declaration with attribute application upon structure use. It's >> the latter here, and hence the attribute doesn't affect >> structure layout at all. All it does is avoid the _containing_ >> 32-bit union to become 8-byte aligned (and tail padding to be >> inserted). > > Thanks for the explanation. I admit it's the first time I see the > aligned attribute being used at structure usage only. I think it's the > first time QEMU_PACKED is used this way in QEMU too. > > Anyway, even taking that into account, things are still not completely > right: the alignment of struct blkif_x86_32_response QEMU_PACKED is 4 > bytes as you wrote, but the size of struct blkif_x86_32_response is > still 16 bytes instead of 12 bytes in my test. I suspect it worked for > you because the other member of the union (blkif_x86_32_request) is > larger than that. However, I think is not a good idea to rely on this > implementation detail. The implementation of DEFINE_RING_TYPES should be > opaque from our point of view. We shouldn't have to know that there is a > union there. I don't follow - why should we not rely on this? It i
Re: [Qemu-devel] [PATCH] xen/disk: don't leak stack data via response ring
>>> On 20.06.17 at 23:48, <sstabell...@kernel.org> wrote: > On Tue, 20 Jun 2017, Jan Beulich wrote: >> @@ -36,13 +33,7 @@ struct blkif_x86_32_request_discard { >> blkif_sector_t sector_number;/* start sector idx on disk (r/w only) >> */ >> uint64_t nr_sectors; /* # of contiguous sectors to discard >> */ >> }; >> -struct blkif_x86_32_response { >> -uint64_tid; /* copied from request */ >> -uint8_t operation; /* copied from request */ >> -int16_t status; /* BLKIF_RSP_??? */ >> -}; >> typedef struct blkif_x86_32_request blkif_x86_32_request_t; >> -typedef struct blkif_x86_32_response blkif_x86_32_response_t; >> #pragma pack(pop) >> >> /* x86_64 protocol version */ >> @@ -62,20 +53,14 @@ struct blkif_x86_64_request_discard { >> blkif_sector_t sector_number;/* start sector idx on disk (r/w only) >> */ >> uint64_t nr_sectors; /* # of contiguous sectors to discard >> */ >> }; >> -struct blkif_x86_64_response { >> -uint64_t __attribute__((__aligned__(8))) id; >> -uint8_t operation; /* copied from request */ >> -int16_t status; /* BLKIF_RSP_??? */ >> -}; >> >> typedef struct blkif_x86_64_request blkif_x86_64_request_t; >> -typedef struct blkif_x86_64_response blkif_x86_64_response_t; >> >> DEFINE_RING_TYPES(blkif_common, struct blkif_common_request, >> - struct blkif_common_response); >> + struct blkif_response); >> DEFINE_RING_TYPES(blkif_x86_32, struct blkif_x86_32_request, >> - struct blkif_x86_32_response); >> + struct blkif_response QEMU_PACKED); > > In my test, the previous sizes and alignments of the response structs > were (on both x86_32 and x86_64): > > sizeof(blkif_x86_32_response)=12 sizeof(blkif_x86_64_response)=16 > align(blkif_x86_32_response)=4 align(blkif_x86_64_response)=8 > > While with these changes are now, when compiled on x86_64: > sizeof(blkif_x86_32_response)=11 sizeof(blkif_x86_64_response)=16 > align(blkif_x86_32_response)=1 align(blkif_x86_64_response)=8 > > when compiled on x86_32: > sizeof(blkif_x86_32_response)=11 sizeof(blkif_x86_64_response)=12 > align(blkif_x86_32_response)=1 align(blkif_x86_64_response)=4 > > Did I do my tests wrong? > > QEMU_PACKED is not the same as #pragma pack(push, 4). In fact, it is the > same as #pragma pack(push, 1), causing the struct to be densely packed, > leaving no padding whatsever. > > In addition, without __attribute__((__aligned__(8))), > blkif_x86_64_response won't be 8 bytes aligned when built on x86_32. > > Am I missing something? Well, you're mixing attribute application upon structure declaration with attribute application upon structure use. It's the latter here, and hence the attribute doesn't affect structure layout at all. All it does is avoid the _containing_ 32-bit union to become 8-byte aligned (and tail padding to be inserted). Jan
[Qemu-devel] [PATCH] xen/disk: don't leak stack data via response ring
Rather than constructing a local structure instance on the stack, fill the fields directly on the shared ring, just like other (Linux) backends do. Build on the fact that all response structure flavors are actually identical (the old code did make this assumption too). This is XSA-216. Reported by: Anthony Perard <anthony.per...@citrix.com> Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com> Acked-by: Anthony PERARD <anthony.per...@citrix.com> --- v2: Add QEMU_PACKED to fix handling 32-bit guests by 64-bit qemu. --- a/hw/block/xen_blkif.h +++ b/hw/block/xen_blkif.h @@ -14,9 +14,6 @@ struct blkif_common_request { char dummy; }; -struct blkif_common_response { -char dummy; -}; /* i386 protocol version */ #pragma pack(push, 4) @@ -36,13 +33,7 @@ struct blkif_x86_32_request_discard { blkif_sector_t sector_number;/* start sector idx on disk (r/w only) */ uint64_t nr_sectors; /* # of contiguous sectors to discard */ }; -struct blkif_x86_32_response { -uint64_tid; /* copied from request */ -uint8_t operation; /* copied from request */ -int16_t status; /* BLKIF_RSP_??? */ -}; typedef struct blkif_x86_32_request blkif_x86_32_request_t; -typedef struct blkif_x86_32_response blkif_x86_32_response_t; #pragma pack(pop) /* x86_64 protocol version */ @@ -62,20 +53,14 @@ struct blkif_x86_64_request_discard { blkif_sector_t sector_number;/* start sector idx on disk (r/w only) */ uint64_t nr_sectors; /* # of contiguous sectors to discard */ }; -struct blkif_x86_64_response { -uint64_t __attribute__((__aligned__(8))) id; -uint8_t operation; /* copied from request */ -int16_t status; /* BLKIF_RSP_??? */ -}; typedef struct blkif_x86_64_request blkif_x86_64_request_t; -typedef struct blkif_x86_64_response blkif_x86_64_response_t; DEFINE_RING_TYPES(blkif_common, struct blkif_common_request, - struct blkif_common_response); + struct blkif_response); DEFINE_RING_TYPES(blkif_x86_32, struct blkif_x86_32_request, - struct blkif_x86_32_response); + struct blkif_response QEMU_PACKED); DEFINE_RING_TYPES(blkif_x86_64, struct blkif_x86_64_request, - struct blkif_x86_64_response); + struct blkif_response); union blkif_back_rings { blkif_back_ring_tnative; --- a/hw/block/xen_disk.c +++ b/hw/block/xen_disk.c @@ -769,31 +769,30 @@ static int blk_send_response_one(struct struct XenBlkDev *blkdev = ioreq->blkdev; int send_notify = 0; int have_requests = 0; -blkif_response_t resp; -void *dst; - -resp.id= ioreq->req.id; -resp.operation = ioreq->req.operation; -resp.status= ioreq->status; +blkif_response_t *resp; /* Place on the response ring for the relevant domain. */ switch (blkdev->protocol) { case BLKIF_PROTOCOL_NATIVE: -dst = RING_GET_RESPONSE(>rings.native, blkdev->rings.native.rsp_prod_pvt); +resp = RING_GET_RESPONSE(>rings.native, + blkdev->rings.native.rsp_prod_pvt); break; case BLKIF_PROTOCOL_X86_32: -dst = RING_GET_RESPONSE(>rings.x86_32_part, -blkdev->rings.x86_32_part.rsp_prod_pvt); +resp = RING_GET_RESPONSE(>rings.x86_32_part, + blkdev->rings.x86_32_part.rsp_prod_pvt); break; case BLKIF_PROTOCOL_X86_64: -dst = RING_GET_RESPONSE(>rings.x86_64_part, -blkdev->rings.x86_64_part.rsp_prod_pvt); +resp = RING_GET_RESPONSE(>rings.x86_64_part, + blkdev->rings.x86_64_part.rsp_prod_pvt); break; default: -dst = NULL; return 0; } -memcpy(dst, , sizeof(resp)); + +resp->id= ioreq->req.id; +resp->operation = ioreq->req.operation; +resp->status= ioreq->status; + blkdev->rings.common.rsp_prod_pvt++; RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(>rings.common, send_notify); xen/disk: don't leak stack data via response ring Rather than constructing a local structure instance on the stack, fill the fields directly on the shared ring, just like other (Linux) backends do. Build on the fact that all response structure flavors are actually identical (the old code did make this assumption too). This is XSA-216. Reported by: Anthony Perard <anthony.per...@citrix.com> Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com> Acked-by: Anthony PERARD <anthony.per...@citrix.com> --- v2: Add QEMU
Re: [Qemu-devel] [Xen-devel] [RFC PATCH V2 1/2] xen-pt: bind/unbind interrupt remapping format MSI
>>> On 19.05.17 at 13:16,wrote: > On Thu, May 18, 2017 at 01:32:59AM -0400, Lan Tianyu wrote: >> --- a/include/hw/i386/apic-msidef.h >> +++ b/include/hw/i386/apic-msidef.h >> @@ -26,6 +26,7 @@ >> >> #define MSI_ADDR_DEST_ID_SHIFT 12 >> #define MSI_ADDR_DEST_IDX_SHIFT 4 >> -#define MSI_ADDR_DEST_ID_MASK 0x000 >> +#define MSI_ADDR_DEST_ID_MASK 0x000fff00 > > The value of MSI_ADDR_DEST_ID_MASK is changed here. I think the patch > should be: > +#define MSI_ADDR_DEST_ID_MASK 0x0000 Judging from other sources, rather the other way around - the mask needs to have further bits removed (should be 0x000ff000 afaict). Xen sources confirm this, and while Linux has the value you suggest, that contradicts #define MSI_ADDR_DEST_ID_SHIFT 12 #define MSI_ADDR_DEST_ID(dest) (((dest) << MSI_ADDR_DEST_ID_SHIFT) & \ MSI_ADDR_DEST_ID_MASK) as well as #define MSI_ADDR_EXT_DEST_ID(dest) ((dest) & 0xff00) chopping off just the low 8 bits. Jan
[Qemu-devel] [PATCH v2 3/3] xen: ignore direction in bufioreq handling
There's no way to communicate back read data, so only writes can ever be usefully specified. Ignore the field, paving the road for eventually re-using the bit for something else in a few (many?) years time. Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Paul Durrant <paul.durr...@citrix.com> Acked-by: Stefano Stabellini <sstabell...@kernel.org> --- a/xen-hvm.c +++ b/xen-hvm.c @@ -997,6 +997,7 @@ static int handle_buffered_iopage(XenIOS memset(, 0x00, sizeof(req)); req.state = STATE_IOREQ_READY; req.count = 1; +req.dir = IOREQ_WRITE; for (;;) { uint32_t rdptr = buf_page->read_pointer, wrptr; @@ -1014,7 +1015,6 @@ static int handle_buffered_iopage(XenIOS req.size = 1U << buf_req->size; req.addr = buf_req->addr; req.data = buf_req->data; -req.dir = buf_req->dir; req.type = buf_req->type; xen_rmb(); qw = (req.size == 8); @@ -1031,10 +1031,12 @@ static int handle_buffered_iopage(XenIOS handle_ioreq(state, ); /* Only req.data may get updated by handle_ioreq(), albeit even that - * should not happen as such data would never make it to the guest. + * should not happen as such data would never make it to the guest (we + * can only usefully see writes here after all). */ assert(req.state == STATE_IOREQ_READY); assert(req.count == 1); +assert(req.dir == IOREQ_WRITE); assert(!req.data_is_ptr); atomic_add(_page->read_pointer, qw + 1);
[Qemu-devel] [PATCH v2 2/3] xen: slightly simplify bufioreq handling
There's no point setting fields always receiving the same value on each iteration, as handle_ioreq() doesn't alter them anyway. Set state and count once ahead of the loop, drop the redundant clearing of data_is_ptr, and avoid the meaningless setting of df altogether. Also avoid doing an unsigned long calculation of size when the field to be initialized is only 32 bits wide (and the shift value in the range 0...3). Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Paul Durrant <paul.durr...@citrix.com> --- a/xen-hvm.c +++ b/xen-hvm.c @@ -995,6 +995,8 @@ static int handle_buffered_iopage(XenIOS } memset(, 0x00, sizeof(req)); +req.state = STATE_IOREQ_READY; +req.count = 1; for (;;) { uint32_t rdptr = buf_page->read_pointer, wrptr; @@ -1009,15 +1011,11 @@ static int handle_buffered_iopage(XenIOS break; } buf_req = _page->buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM]; -req.size = 1UL << buf_req->size; -req.count = 1; +req.size = 1U << buf_req->size; req.addr = buf_req->addr; req.data = buf_req->data; -req.state = STATE_IOREQ_READY; req.dir = buf_req->dir; -req.df = 1; req.type = buf_req->type; -req.data_is_ptr = 0; xen_rmb(); qw = (req.size == 8); if (qw) { @@ -1032,6 +1030,13 @@ static int handle_buffered_iopage(XenIOS handle_ioreq(state, ); +/* Only req.data may get updated by handle_ioreq(), albeit even that + * should not happen as such data would never make it to the guest. + */ +assert(req.state == STATE_IOREQ_READY); +assert(req.count == 1); +assert(!req.data_is_ptr); + atomic_add(_page->read_pointer, qw + 1); }
[Qemu-devel] [PATCH v2 1/3] xen: fix quad word bufioreq handling
We should not consume the second slot if it didn't get written yet. Normal writers - i.e. Xen - would not update write_pointer between the two writes, but the page may get fiddled with by the guest itself, and we're better off avoiding to enter an infinite loop in that case. Reported-by: yanghongke <yanghon...@huawei.com> Signed-off-by: Jan Beulich <jbeul...@suse.com> --- v2: Bail (using hw_error()) instead of just breaking the loop. --- a/xen-hvm.c +++ b/xen-hvm.c @@ -1021,6 +1021,9 @@ static int handle_buffered_iopage(XenIOS xen_rmb(); qw = (req.size == 8); if (qw) { +if (rdptr + 1 == wrptr) { +hw_error("Incomplete quad word buffered ioreq"); +} buf_req = _page->buf_ioreq[(rdptr + 1) % IOREQ_BUFFER_SLOT_NUM]; req.data |= ((uint64_t)buf_req->data) << 32;
[Qemu-devel] [PATCH v2 0/3] xen: XSA-197 follow-ups
1: fix quad word bufioreq handling 2: slightly simplify bufioreq handling 3: ignore direction in bufioreq handling Signed-off-by: Jan Beulich <jbeul...@suse.com> --- v2: Only patch 1 changed; see there.
Re: [Qemu-devel] [PATCH 2/3] xen: slightly simplify bufioreq handling
>>> On 23.11.16 at 19:13, <sstabell...@kernel.org> wrote: > On Wed, 23 Nov 2016, Jan Beulich wrote: >> There's no point setting fields always receiving the same value on each >> iteration, as handle_ioreq() doesn't alter them anyway. Set state and >> count once ahead of the loop, drop the redundant clearing of >> data_is_ptr, and avoid the meaningless setting of df altogether. > > Why setting df is meaningless? With count being fixed to one there's no need to update addresses, and hence no use for knowing which direction the updates should go. Jan
Re: [Qemu-devel] [PATCH 1/3] xen: fix quad word bufioreq handling
>>> On 23.11.16 at 11:45,wrote: > No, if QEMU is using a default ioreq server (i.e. the legacy way of doing > things) then it's vulnerable to the guest messing with the rings and I'd > forgotten that migrated-in guests from old QEMUs also end up using the > default > server, so I guess this is a worthy checkt to make... although maybe it's > best to just bail if the check fails, since it would indicate a malicious > guest. Okay, that's basically the TBD note I have in the patch; I'll wait for at least one of the qemu maintainers to voice their preference. Jan
Re: [Qemu-devel] [PATCH 1/3] xen: fix quad word bufioreq handling
>>> On 23.11.16 at 10:48, <paul.durr...@citrix.com> wrote: >> From: Jan Beulich [mailto:jbeul...@suse.com] >> Sent: 23 November 2016 09:24 >> >> We should not consume the second slot if it didn't get written yet. >> Normal writers - i.e. Xen - would not update write_pointer between the >> two writes, but the page may get fiddled with by the guest itself, and >> we're better off entering an infinite loop in that case. >> > > Xen would never put QEMU in this situation and the guest can't actually > modify the page whilst it's in use, since activation of the IOREQ server > removes the page from the guest's p2m so the premise of the patch is not > correct. Is that the case even for pre-ioreq-server Xen versions? The issue here was reported together with what became XSA-197, and it's not been assigned its own XSA just because there are other ways for a guest to place high load on its qemu process (and there are ways to deal with such high load situations). Jan
[Qemu-devel] [PATCH 3/3] xen: ignore direction in bufioreq handling
There's no way to communicate back read data, so only writes can ever be usefully specified. Ignore the field, paving the road for eventually re-using the bit for something else in a few (many?) years time. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- a/xen-hvm.c +++ b/xen-hvm.c @@ -997,6 +997,7 @@ static int handle_buffered_iopage(XenIOS memset(, 0x00, sizeof(req)); req.state = STATE_IOREQ_READY; req.count = 1; +req.dir = IOREQ_WRITE; for (;;) { uint32_t rdptr = buf_page->read_pointer, wrptr; @@ -1014,7 +1015,6 @@ static int handle_buffered_iopage(XenIOS req.size = 1U << buf_req->size; req.addr = buf_req->addr; req.data = buf_req->data; -req.dir = buf_req->dir; req.type = buf_req->type; xen_rmb(); qw = (req.size == 8); @@ -1031,10 +1031,12 @@ static int handle_buffered_iopage(XenIOS handle_ioreq(state, ); /* Only req.data may get updated by handle_ioreq(), albeit even that - * should not happen as such data would never make it to the guest. + * should not happen as such data would never make it to the guest (we + * can only usefully see writes here after all). */ assert(req.state == STATE_IOREQ_READY); assert(req.count == 1); +assert(req.dir == IOREQ_WRITE); assert(!req.data_is_ptr); atomic_add(_page->read_pointer, qw + 1);
[Qemu-devel] [PATCH 2/3] xen: slightly simplify bufioreq handling
There's no point setting fields always receiving the same value on each iteration, as handle_ioreq() doesn't alter them anyway. Set state and count once ahead of the loop, drop the redundant clearing of data_is_ptr, and avoid the meaningless setting of df altogether. Also avoid doing an unsigned long calculation of size when the field to be initialized is only 32 bits wide (and the shift value in the range 0...3). Signed-off-by: Jan Beulich <jbeul...@suse.com> --- a/xen-hvm.c +++ b/xen-hvm.c @@ -995,6 +995,8 @@ static int handle_buffered_iopage(XenIOS } memset(, 0x00, sizeof(req)); +req.state = STATE_IOREQ_READY; +req.count = 1; for (;;) { uint32_t rdptr = buf_page->read_pointer, wrptr; @@ -1009,15 +1011,11 @@ static int handle_buffered_iopage(XenIOS break; } buf_req = _page->buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM]; -req.size = 1UL << buf_req->size; -req.count = 1; +req.size = 1U << buf_req->size; req.addr = buf_req->addr; req.data = buf_req->data; -req.state = STATE_IOREQ_READY; req.dir = buf_req->dir; -req.df = 1; req.type = buf_req->type; -req.data_is_ptr = 0; xen_rmb(); qw = (req.size == 8); if (qw) { @@ -1032,6 +1030,13 @@ static int handle_buffered_iopage(XenIOS handle_ioreq(state, ); +/* Only req.data may get updated by handle_ioreq(), albeit even that + * should not happen as such data would never make it to the guest. + */ +assert(req.state == STATE_IOREQ_READY); +assert(req.count == 1); +assert(!req.data_is_ptr); + atomic_add(_page->read_pointer, qw + 1); }
[Qemu-devel] [PATCH 1/3] xen: fix quad word bufioreq handling
We should not consume the second slot if it didn't get written yet. Normal writers - i.e. Xen - would not update write_pointer between the two writes, but the page may get fiddled with by the guest itself, and we're better off entering an infinite loop in that case. Reported-by: yanghongke <yanghon...@huawei.com> Signed-off-by: Jan Beulich <jbeul...@suse.com> --- TBD: Alternatively we could call e.g. hw_error() instead. --- a/xen-hvm.c +++ b/xen-hvm.c @@ -1021,6 +1021,9 @@ static int handle_buffered_iopage(XenIOS xen_rmb(); qw = (req.size == 8); if (qw) { +if (rdptr + 1 == wrptr) { +break; +} buf_req = _page->buf_ioreq[(rdptr + 1) % IOREQ_BUFFER_SLOT_NUM]; req.data |= ((uint64_t)buf_req->data) << 32;
[Qemu-devel] [PATCH 0/3] xen: XSA-197 follow-ups
1: fix quad word bufioreq handling 2: slightly simplify bufioreq handling 3: ignore direction in bufioreq handling Signed-off-by: Jan Beulich <jbeul...@suse.com>
[Qemu-devel] [PATCH] xen: fix ioreq handling
Avoid double fetches and bounds check size to avoid overflowing internal variables. This is CVE-2016-9381 / XSA-197. Reported-by: yanghongke <yanghon...@huawei.com> Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Stefano Stabellini <sstabell...@kernel.org> --- a/xen-hvm.c +++ b/xen-hvm.c @@ -810,6 +810,10 @@ static void cpu_ioreq_pio(ioreq_t *req) trace_cpu_ioreq_pio(req, req->dir, req->df, req->data_is_ptr, req->addr, req->data, req->count, req->size); +if (req->size > sizeof(uint32_t)) { +hw_error("PIO: bad size (%u)", req->size); +} + if (req->dir == IOREQ_READ) { if (!req->data_is_ptr) { req->data = do_inp(req->addr, req->size); @@ -846,6 +850,10 @@ static void cpu_ioreq_move(ioreq_t *req) trace_cpu_ioreq_move(req, req->dir, req->df, req->data_is_ptr, req->addr, req->data, req->count, req->size); +if (req->size > sizeof(req->data)) { +hw_error("MMIO: bad size (%u)", req->size); +} + if (!req->data_is_ptr) { if (req->dir == IOREQ_READ) { for (i = 0; i < req->count; i++) { @@ -1010,11 +1018,13 @@ static int handle_buffered_iopage(XenIOS req.df = 1; req.type = buf_req->type; req.data_is_ptr = 0; +xen_rmb(); qw = (req.size == 8); if (qw) { buf_req = _page->buf_ioreq[(rdptr + 1) % IOREQ_BUFFER_SLOT_NUM]; req.data |= ((uint64_t)buf_req->data) << 32; +xen_rmb(); } handle_ioreq(state, ); @@ -1045,7 +1055,11 @@ static void cpu_handle_ioreq(void *opaqu handle_buffered_iopage(state); if (req) { -handle_ioreq(state, req); +ioreq_t copy = *req; + +xen_rmb(); +handle_ioreq(state, ); +req->data = copy.data; if (req->state != STATE_IOREQ_INPROCESS) { fprintf(stderr, "Badness in I/O request ... not in service?!: "
Re: [Qemu-devel] [Xen-devel] [PATCH v2] xen: fix qdisk BLKIF_OP_DISCARD for 32/64 word size mix
>>> On 17.06.16 at 12:08, <jgr...@suse.com> wrote: > On 17/06/16 11:50, Paul Durrant wrote: >>> -Original Message- >>> From: Juergen Gross [mailto:jgr...@suse.com] >>> Sent: 17 June 2016 10:46 >>> To: Paul Durrant; Jan Beulich >>> Cc: Anthony Perard; xen-devel; sstabell...@kernel.org; qemu- >>> de...@nongnu.org; kra...@redhat.com >>> Subject: Re: [Xen-devel] [PATCH v2] xen: fix qdisk BLKIF_OP_DISCARD for >>> 32/64 word size mix >>> >>> On 17/06/16 11:37, Paul Durrant wrote: >>>>> -Original Message- >>>>> From: Xen-devel [mailto:xen-devel-boun...@lists.xen.org] On Behalf Of >>> Jan >>>>> Beulich >>>>> Sent: 17 June 2016 10:26 >>>>> To: Juergen Gross >>>>> Cc: Anthony Perard; xen-devel; sstabell...@kernel.org; qemu- >>>>> de...@nongnu.org; kra...@redhat.com >>>>> Subject: Re: [Xen-devel] [PATCH v2] xen: fix qdisk BLKIF_OP_DISCARD for >>>>> 32/64 word size mix >>>>> >>>>>>>> On 17.06.16 at 11:14, <jgr...@suse.com> wrote: >>>>>> In case the word size of the domU and qemu running the qdisk backend >>>>>> differ BLKIF_OP_DISCARD will not work reliably, as the request >>>>>> structure in the ring have different layouts for different word size. >>>>>> >>>>>> Correct this by copying the request structure in case of different >>>>>> word size element by element in the BLKIF_OP_DISCARD case, too. >>>>>> >>>>>> The easiest way to achieve this is to resync hw/block/xen_blkif.h with >>>>>> its original source from the Linux kernel. >>>>>> >>>>>> Signed-off-by: Juergen Gross <jgr...@suse.com> >>>>>> --- >>>>>> V2: resync with Linux kernel version of hw/block/xen_blkif.h as >>>>>> suggested by Paul Durrant >>>>> >>>>> Oh, I didn't realize he suggested syncing with the Linux variant. >>>>> Why not with the canonical one? I have to admit that I particularly >>>>> dislike Linux'es strange union-izng, mainly because of it requiring >>>>> this myriad of __attribute__((__packed__)). >>>>> >>>> >>>> Yes, it's truly grotesque and such things should be blown away with >>> extreme prejudice. >>> >>> Sorry, I'm confused now. >>> >>> Do you still mandate for the resync or not? >>> >>> Resyncing with elimination of all the __packed__ stuff seems not to be >>> a proper alternative as this would require a major rework. >> >> Why? Replacing the existing horribleness with the canonical header (fixed > for style) might mean a large diff but it should be functionally the same or > something has gone very seriously wrong. If the extra part you need is not in > the canonical header then adding this as a second patch seems like a > reasonable plan. > > I think you don't realize that qemu is built using the public headers > from the Xen build environment. So there is no way to resync with the > canonical header as this isn't part of the qemu tree. Oh, right, these are just the 32- and 64-bit variants which get declared here. > The header in question is originating from the Linux one which is an > add-on of the canonical header containing the explicit 32- and 64-bit > variants of the xenbus protocol and the conversion routines between > those. > > It would be possible to add these parts to the canonical header, but > do we really want that? Definitely not, I would say. But syncing with Linux'es strange code then isn't a good idea either. (Still waiting for a qemu maintainer opinion, though.) Jan
Re: [Qemu-devel] [Xen-devel] [PATCH v2] xen: fix qdisk BLKIF_OP_DISCARD for 32/64 word size mix
>>> On 17.06.16 at 11:31, <jgr...@suse.com> wrote: > On 17/06/16 11:26, Jan Beulich wrote: >>>>> On 17.06.16 at 11:14, <jgr...@suse.com> wrote: >>> In case the word size of the domU and qemu running the qdisk backend >>> differ BLKIF_OP_DISCARD will not work reliably, as the request >>> structure in the ring have different layouts for different word size. >>> >>> Correct this by copying the request structure in case of different >>> word size element by element in the BLKIF_OP_DISCARD case, too. >>> >>> The easiest way to achieve this is to resync hw/block/xen_blkif.h with >>> its original source from the Linux kernel. >>> >>> Signed-off-by: Juergen Gross <jgr...@suse.com> >>> --- >>> V2: resync with Linux kernel version of hw/block/xen_blkif.h as >>> suggested by Paul Durrant >> >> Oh, I didn't realize he suggested syncing with the Linux variant. >> Why not with the canonical one? I have to admit that I particularly >> dislike Linux'es strange union-izng, mainly because of it requiring >> this myriad of __attribute__((__packed__)). > > What would be gained by syncing with the canonical one? The part to be > modified is available in the Linux variant only. In, as said, a rather disgusting (in my opinion) way. I'm no the maintainer anyway, so you don't have to convince me. Jan
Re: [Qemu-devel] [Xen-devel] [PATCH v2] xen: fix qdisk BLKIF_OP_DISCARD for 32/64 word size mix
>>> On 17.06.16 at 11:14,wrote: > In case the word size of the domU and qemu running the qdisk backend > differ BLKIF_OP_DISCARD will not work reliably, as the request > structure in the ring have different layouts for different word size. > > Correct this by copying the request structure in case of different > word size element by element in the BLKIF_OP_DISCARD case, too. > > The easiest way to achieve this is to resync hw/block/xen_blkif.h with > its original source from the Linux kernel. > > Signed-off-by: Juergen Gross > --- > V2: resync with Linux kernel version of hw/block/xen_blkif.h as > suggested by Paul Durrant Oh, I didn't realize he suggested syncing with the Linux variant. Why not with the canonical one? I have to admit that I particularly dislike Linux'es strange union-izng, mainly because of it requiring this myriad of __attribute__((__packed__)). Jan
Re: [Qemu-devel] [Xen-devel] [PATCH] xen: fix qdisk BLKIF_OP_DISCARD for 32/64 word size mix
>>> On 16.06.16 at 13:04, <jgr...@suse.com> wrote: > On 16/06/16 12:54, Jan Beulich wrote: >>>>> On 16.06.16 at 12:02, <jgr...@suse.com> wrote: >>> In case the word size of the domU and qemu running the qdisk backend >>> differ BLKIF_OP_DISCARD will not work reliably, as the request >>> structure in the ring have different layouts for different word size. >>> >>> Correct this by copying the request structure in case of different >>> word size element by element in the BLKIF_OP_DISCARD case, too. >>> >>> Signed-off-by: Juergen Gross <jgr...@suse.com> >> >> With the indentation (tabs vs blanks) fixed > > Hmm, qemu coding style is to use blanks. I could: > a) leave the patch as is (changed lines indented with blanks) > b) use tabs to indent (style of the modified file up to now) > c) change the style of the file in this patch > d) change the style of the file in a separate patch > > Any preferences? I'd say a) is not an option, but beyond that I think this needs to be answered by the qemu maintainers. Jan
Re: [Qemu-devel] [Xen-devel] [PATCH] xen: fix qdisk BLKIF_OP_DISCARD for 32/64 word size mix
>>> On 16.06.16 at 12:02, <jgr...@suse.com> wrote: > In case the word size of the domU and qemu running the qdisk backend > differ BLKIF_OP_DISCARD will not work reliably, as the request > structure in the ring have different layouts for different word size. > > Correct this by copying the request structure in case of different > word size element by element in the BLKIF_OP_DISCARD case, too. > > Signed-off-by: Juergen Gross <jgr...@suse.com> With the indentation (tabs vs blanks) fixed Reviewed-by: Jan Beulich <jbeul...@suse.com> And maybe ... > @@ -82,7 +98,7 @@ static inline void blkif_get_x86_32_req(blkif_request_t > *dst, blkif_x86_32_reque > /* Prevent the compiler from using src->... instead. */ > barrier(); > if (dst->operation == BLKIF_OP_DISCARD) { > - struct blkif_request_discard *s = (void *)src; > +struct blkif_x86_32_request_discard *s = (void *)src; ... it would also be worth adding const here and ... > @@ -105,7 +121,7 @@ static inline void blkif_get_x86_64_req(blkif_request_t > *dst, blkif_x86_64_reque > /* Prevent the compiler from using src->... instead. */ > barrier(); > if (dst->operation == BLKIF_OP_DISCARD) { > - struct blkif_request_discard *s = (void *)src; > +struct blkif_x86_64_request_discard *s = (void *)src; ... here. Jan
Re: [Qemu-devel] [Xen-devel] [PULL 0/2] xen-20160613-tag
>>> On 13.06.16 at 12:54,wrote: > The following changes since commit a93c1bdf0bd4689287094ddb2aae3dc907da3535: > > Merge remote-tracking branch 'remotes/kraxel/tags/pull-ui-20160610-1' into > staging (2016-06-10 15:47:17 +0100) > > are available in the git repository at: > > > git://xenbits.xen.org/people/sstabellini/qemu-dm.git tags/xen-20160613-tag > > for you to fetch changes up to 88c16567d2cd23da328787187910b013ee43ebca: > > Introduce "xen-load-devices-state" (2016-06-13 11:50:53 +0100) > > > Xen 2016/06/13 > > > Anthony PERARD (1): > exec: Fix qemu_ram_block_from_host for Xen > > Wen Congyang (1): > Introduce "xen-load-devices-state" May I ask what the disposition of "xen/blkif: avoid double access to any shared ring request fields" is? I don't think I've seen a pull req, and it doesn't appear to be in master. Jan
[Qemu-devel] [PATCH] xen/blkif: avoid double access to any shared ring request fields
Commit f9e98e5d7a ("xen/blkif: Avoid double access to src->nr_segments") didn't go far enough: src->operation is also being used twice. And nothing was done to prevent the compiler from using the source side of the copy done by blk_get_request() (granted that's very unlikely). Move the barrier()s up, and add another one to blk_get_request(). Note that for completing XSA-155, the barrier() getting added to blk_get_request() would suffice, and hence the changes to xen_blkif.h are more like just cleanup. And since, as said, the unpatched code getting compiled to something vulnerable is very unlikely (and not observed in practice), this isn't being viewed as a new security issue. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- a/hw/block/xen_blkif.h +++ b/hw/block/xen_blkif.h @@ -79,14 +79,14 @@ static inline void blkif_get_x86_32_req( dst->handle = src->handle; dst->id = src->id; dst->sector_number = src->sector_number; - if (src->operation == BLKIF_OP_DISCARD) { + /* Prevent the compiler from using src->... instead. */ + barrier(); + if (dst->operation == BLKIF_OP_DISCARD) { struct blkif_request_discard *s = (void *)src; struct blkif_request_discard *d = (void *)dst; d->nr_sectors = s->nr_sectors; return; } - /* prevent the compiler from optimizing the code and using src->nr_segments instead */ - barrier(); if (n > dst->nr_segments) n = dst->nr_segments; for (i = 0; i < n; i++) @@ -102,14 +102,14 @@ static inline void blkif_get_x86_64_req( dst->handle = src->handle; dst->id = src->id; dst->sector_number = src->sector_number; - if (src->operation == BLKIF_OP_DISCARD) { + /* Prevent the compiler from using src->... instead. */ + barrier(); + if (dst->operation == BLKIF_OP_DISCARD) { struct blkif_request_discard *s = (void *)src; struct blkif_request_discard *d = (void *)dst; d->nr_sectors = s->nr_sectors; return; } - /* prevent the compiler from optimizing the code and using src->nr_segments instead */ - barrier(); if (n > dst->nr_segments) n = dst->nr_segments; for (i = 0; i < n; i++) --- a/hw/block/xen_disk.c +++ b/hw/block/xen_disk.c @@ -679,6 +679,8 @@ static int blk_get_request(struct XenBlk RING_GET_REQUEST(>rings.x86_64_part, rc)); break; } +/* Prevent the compiler from accessing the on-ring fields instead. */ +barrier(); return 0; } xen/blkif: avoid double access to any shared ring request fields Commit f9e98e5d7a ("xen/blkif: Avoid double access to src->nr_segments") didn't go far enough: src->operation is also being used twice. And nothing was done to prevent the compiler from using the source side of the copy done by blk_get_request() (granted that's very unlikely). Move the barrier()s up, and add another one to blk_get_request(). Note that for completing XSA-155, the barrier() getting added to blk_get_request() would suffice, and hence the changes to xen_blkif.h are more like just cleanup. And since, as said, the unpatched code getting compiled to something vulnerable is very unlikely (and not observed in practice), this isn't being viewed as a new security issue. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- a/hw/block/xen_blkif.h +++ b/hw/block/xen_blkif.h @@ -79,14 +79,14 @@ static inline void blkif_get_x86_32_req( dst->handle = src->handle; dst->id = src->id; dst->sector_number = src->sector_number; - if (src->operation == BLKIF_OP_DISCARD) { + /* Prevent the compiler from using src->... instead. */ + barrier(); + if (dst->operation == BLKIF_OP_DISCARD) { struct blkif_request_discard *s = (void *)src; struct blkif_request_discard *d = (void *)dst; d->nr_sectors = s->nr_sectors; return; } - /* prevent the compiler from optimizing the code and using src->nr_segments instead */ - barrier(); if (n > dst->nr_segments) n = dst->nr_segments; for (i = 0; i < n; i++) @@ -102,14 +102,14 @@ static inline void blkif_get_x86_64_req( dst->handle = src->handle; dst->id = src->id; dst->sector_number = src->sector_number; - if (src->operation == BLKIF_OP_DISCARD) { + /* Prevent the compiler from using src->... instead. */ + barrier(); + if (dst->operation == BLKIF_OP_DISCARD) { struct blkif_request_discard *s = (void *)src; struct blkif_request_
Re: [Qemu-devel] [Xen-devel] [PATCH RFC v2 4/4] xen/MSI: re-expose masking capability
>>> On 11.12.15 at 17:56,wrote: > For the original issue here, could the flag be exposed as a > XEN_SYSCTL_PHYSCAP_ Yes, I think it could, albeit calling this a "capability" or "feature" seems odd (since really the original behavior was bogus/buggy). But - with sysctl not being a stable interface, is making qemu use this actually a good idea? I.e. won't we paint ourselves into the corner of needing to write compatibility wrappers in qemu whenever XEN_SYSCTL_physinfo (or its libxc wrapper) changes? Jan
Re: [Qemu-devel] [Xen-devel] [PATCH RFC v2 4/4] xen/MSI: re-expose masking capability
>>> On 11.12.15 at 17:56,wrote: > On Fri, 2015-12-11 at 16:44 +, Stefano Stabellini wrote: >> >> It is not possible to do this at runtime. I think we should do this at >> compile time because in any case it is not supported to run a QEMU built >> for a given Xen version on a different Xen version. > > I am currently working pretty hard to make this possible in the future, it > would be a shame to add another reason it wasn't possible at this stage. And I don't think it's not possible - if anything, the infrastructure to do so is just missing. I'm definitely not going to make this a build time check, since I deem it very wrong namely when considering --with-system-qemu (as in that case there shouldn't be any dependency on the precise Xen tools versions in use - using plural intentionally here to point out the possibility of multiple ones being present). Jan
Re: [Qemu-devel] [PATCH RFC v2 4/4] xen/MSI: re-expose masking capability
>>> On 11.12.15 at 15:33, <stefano.stabell...@eu.citrix.com> wrote: > On Mon, 7 Dec 2015, Jan Beulich wrote: >> >>> On 07.12.15 at 15:56, <stefano.stabell...@eu.citrix.com> wrote: >> > On Mon, 7 Dec 2015, Jan Beulich wrote: >> >> >>> On 07.12.15 at 13:45, <stefano.stabell...@eu.citrix.com> wrote: >> >> > On Tue, 24 Nov 2015, Jan Beulich wrote: >> >> >> TBD: We probably need to deal with running on an older hypervisor. I >> >> >> can't, however, immediately see a way for qemu to find out. >> >> > >> >> > Actually QEMU has already an infrastructure to detect the hypervisor >> >> > version at compile time, see include/hw/xen/xen_common.h. You could >> >> > #define the right emu_mask depending on the hypervisor. >> >> >> >> We don't want compile time detection here, but runtime one. >> > >> > I guess the issue is that a fix was backported to Xen that changed its >> > behaviour in past releases, right? >> >> No, we shouldn't try to guess whether this is present in any pre-4.6 >> hypervisors; we should simply accept that maskable MSI is not >> available for guests there, because ... >> >> > Is there a way to detect the presence of this fix in Xen, by invoking an >> > hypercall and checking the returned values and error numbers? >> >> ... there's nothing to (reliably) probe here. This really is just an >> implementation detail of the hypervisor, and hence a version check >> is all we have available. > > In that case, I think we should stay on the safe side, and only expose > the masking capability (only take into effects the changes that this > patch makes) for Xen >= 4.7. > > What do you think? That's what I suggested, with the hope of getting a hint where the necessary infrastructure is (somehow I didn't find any run time version dependent code to clone from). Jan
Re: [Qemu-devel] [Xen-devel] [PATCH v3 1/4] xen/MSI-X: latch MSI-X table writes
>>> On 09.12.15 at 16:55, <stefano.stabell...@eu.citrix.com> wrote: > On Tue, 8 Dec 2015, Jan Beulich wrote: >> The remaining log message in pci_msix_write() is wrong, as there guest >> behavior may only appear to be wrong: For one, the old logic didn't >> take the mask-all bit into account. And then this shouldn't depend on >> host device state (i.e. the host may have masked the entry without the >> guest having done so). Plus these writes shouldn't be dropped even when >> an entry gets unmasked. Instead, if they can't be made take effect >> right away, they should take effect on the next unmasking or enabling >> operation - the specification explicitly describes such caching >> behavior. >> >> Signed-off-by: Jan Beulich <jbeul...@suse.com> > > I have applied this patch and the first 2 of the series to my "next" > branch. I have to think a bit more about the fourth. Thanks (I guess you mean this one and the _next_ 2 of this series). Jan
[Qemu-devel] [PATCH v3 1/4] xen/MSI-X: latch MSI-X table writes
The remaining log message in pci_msix_write() is wrong, as there guest behavior may only appear to be wrong: For one, the old logic didn't take the mask-all bit into account. And then this shouldn't depend on host device state (i.e. the host may have masked the entry without the guest having done so). Plus these writes shouldn't be dropped even when an entry gets unmasked. Instead, if they can't be made take effect right away, they should take effect on the next unmasking or enabling operation - the specification explicitly describes such caching behavior. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- v3: Add comment to xen_pt_msix_update_one(). v2: Pass original vec_ctrl to xen_pt_msix_update_one() instead of (ab)using latch[]. --- a/hw/xen/xen_pt_config_init.c +++ b/hw/xen/xen_pt_config_init.c @@ -1499,6 +1499,8 @@ static int xen_pt_msixctrl_reg_write(Xen xen_pt_msix_disable(s); } +s->msix->maskall = *val & PCI_MSIX_FLAGS_MASKALL; + debug_msix_enabled_old = s->msix->enabled; s->msix->enabled = !!(*val & PCI_MSIX_FLAGS_ENABLE); if (s->msix->enabled != debug_msix_enabled_old) { --- a/hw/xen/xen_pt.h +++ b/hw/xen/xen_pt.h @@ -187,13 +187,13 @@ typedef struct XenPTMSIXEntry { int pirq; uint64_t addr; uint32_t data; -uint32_t vector_ctrl; +uint32_t latch[4]; bool updated; /* indicate whether MSI ADDR or DATA is updated */ -bool warned; /* avoid issuing (bogus) warning more than once */ } XenPTMSIXEntry; typedef struct XenPTMSIX { uint32_t ctrl_offset; bool enabled; +bool maskall; int total_entries; int bar_index; uint64_t table_base; --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -25,6 +25,7 @@ #define XEN_PT_GFLAGSSHIFT_DELIV_MODE 12 #define XEN_PT_GFLAGSSHIFT_TRG_MODE 15 +#define latch(fld) latch[PCI_MSIX_ENTRY_##fld / sizeof(uint32_t)] /* * Helpers @@ -314,7 +315,8 @@ static int msix_set_enable(XenPCIPassthr enabled); } -static int xen_pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr) +static int xen_pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr, + uint32_t vec_ctrl) { XenPTMSIXEntry *entry = NULL; int pirq; @@ -332,6 +334,19 @@ static int xen_pt_msix_update_one(XenPCI pirq = entry->pirq; +/* + * Update the entry addr and data to the latest values only when the + * entry is masked or they are all masked, as required by the spec. + * Addr and data changes while the MSI-X entry is unmasked get deferred + * until the next masked -> unmasked transition. + */ +if (pirq == XEN_PT_UNASSIGNED_PIRQ || s->msix->maskall || +(vec_ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT)) { +entry->addr = entry->latch(LOWER_ADDR) | + ((uint64_t)entry->latch(UPPER_ADDR) << 32); +entry->data = entry->latch(DATA); +} + rc = msi_msix_setup(s, entry->addr, entry->data, , true, entry_nr, entry->pirq == XEN_PT_UNASSIGNED_PIRQ); if (rc) { @@ -357,7 +372,7 @@ int xen_pt_msix_update(XenPCIPassthrough int i; for (i = 0; i < msix->total_entries; i++) { -xen_pt_msix_update_one(s, i); +xen_pt_msix_update_one(s, i, msix->msix_entry[i].latch(VECTOR_CTRL)); } return 0; @@ -406,35 +421,15 @@ int xen_pt_msix_update_remap(XenPCIPasst static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset) { -switch (offset) { -case PCI_MSIX_ENTRY_LOWER_ADDR: -return e->addr & UINT32_MAX; -case PCI_MSIX_ENTRY_UPPER_ADDR: -return e->addr >> 32; -case PCI_MSIX_ENTRY_DATA: -return e->data; -case PCI_MSIX_ENTRY_VECTOR_CTRL: -return e->vector_ctrl; -default: -return 0; -} +return !(offset % sizeof(*e->latch)) + ? e->latch[offset / sizeof(*e->latch)] : 0; } static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val) { -switch (offset) { -case PCI_MSIX_ENTRY_LOWER_ADDR: -e->addr = (e->addr & ((uint64_t)UINT32_MAX << 32)) | val; -break; -case PCI_MSIX_ENTRY_UPPER_ADDR: -e->addr = (uint64_t)val << 32 | (e->addr & UINT32_MAX); -break; -case PCI_MSIX_ENTRY_DATA: -e->data = val; -break; -case PCI_MSIX_ENTRY_VECTOR_CTRL: -e->vector_ctrl = val; -break; +if (!(offset % sizeof(*e->latch))) +{ +e->latch[offset / sizeof(*e->latch)] = val; } } @@ -454,39 +449,26 @@ static void pci_msix_write(void *opaque, offset = addr % PCI_MSIX_ENTRY_SIZE; if (offset != PCI_MSIX_ENTRY_VECTOR_CTRL) { -const volatile uint32_t *vec_ctrl; - if (get_entry_value(entry, offset) == val &&
Re: [Qemu-devel] [PATCH RFC v2 4/4] xen/MSI: re-expose masking capability
>>> On 07.12.15 at 15:56, <stefano.stabell...@eu.citrix.com> wrote: > On Mon, 7 Dec 2015, Jan Beulich wrote: >> >>> On 07.12.15 at 13:45, <stefano.stabell...@eu.citrix.com> wrote: >> > On Tue, 24 Nov 2015, Jan Beulich wrote: >> >> Now that the hypervisor intercepts all config space writes and monitors >> >> changes to the masking flags, this undoes the main effect of the >> >> XSA-129 fix, exposing the masking capability again to guests. > > Could you please mention the relevant commit ids in Xen? It's just one (which I've now added a reference to), unless you want all the prereqs listed. > What happens if QEMU, with this change, is running on top of an older > Xen that doesn't intercepts all config space writes? The security issue would resurface. >> >> Signed-off-by: Jan Beulich <jbeul...@suse.com> >> >> --- >> >> TBD: We probably need to deal with running on an older hypervisor. I >> >> can't, however, immediately see a way for qemu to find out. >> > >> > Actually QEMU has already an infrastructure to detect the hypervisor >> > version at compile time, see include/hw/xen/xen_common.h. You could >> > #define the right emu_mask depending on the hypervisor. >> >> We don't want compile time detection here, but runtime one. > > I guess the issue is that a fix was backported to Xen that changed its > behaviour in past releases, right? No, we shouldn't try to guess whether this is present in any pre-4.6 hypervisors; we should simply accept that maskable MSI is not available for guests there, because ... > Is there a way to detect the presence of this fix in Xen, by invoking an > hypercall and checking the returned values and error numbers? ... there's nothing to (reliably) probe here. This really is just an implementation detail of the hypervisor, and hence a version check is all we have available. Jan
Re: [Qemu-devel] [PATCH v2 1/4] xen/MSI-X: latch MSI-X table writes
>>> On 07.12.15 at 13:41,wrote: > I know that in your opinion is superfluous, nonetheless could you please > add 2-3 lines of in-code comment right here, to explain what you are > doing with the check? Something like: > > /* > * Update the entry addr and data to the latest values only when the > * entry is masked or they are all masked, as required by the spec. > * Addr and data changes while the MSI-X entry is unmasked will be > * delayed until the next masking->unmasking. > */ Btw, will it be okay to just resend this one patch as v3, or do I need to resend the whole series (the rest of which didn't change)? Jan
Re: [Qemu-devel] [PATCH v2 1/4] xen/MSI-X: latch MSI-X table writes
>>> On 07.12.15 at 13:41, <stefano.stabell...@eu.citrix.com> wrote: > On Tue, 24 Nov 2015, Jan Beulich wrote: >> @@ -332,6 +334,13 @@ static int xen_pt_msix_update_one(XenPCI >> >> pirq = entry->pirq; > > I know that in your opinion is superfluous, nonetheless could you please > add 2-3 lines of in-code comment right here, to explain what you are > doing with the check? Something like: > > /* > * Update the entry addr and data to the latest values only when the > * entry is masked or they are all masked, as required by the spec. > * Addr and data changes while the MSI-X entry is unmasked will be > * delayed until the next masking->unmasking. > */ > > >> +if (pirq == XEN_PT_UNASSIGNED_PIRQ || s->msix->maskall || >> +(vec_ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT)) { >> +entry->addr = entry->latch(LOWER_ADDR) | >> + ((uint64_t)entry->latch(UPPER_ADDR) << 32); >> +entry->data = entry->latch(DATA); >> +} Adding a comment like this is fine of course, namely if it helps acceptance. Jan
Re: [Qemu-devel] [PATCH RFC v2 4/4] xen/MSI: re-expose masking capability
>>> On 07.12.15 at 13:45, <stefano.stabell...@eu.citrix.com> wrote: > On Tue, 24 Nov 2015, Jan Beulich wrote: >> Now that the hypervisor intercepts all config space writes and monitors >> changes to the masking flags, this undoes the main effect of the >> XSA-129 fix, exposing the masking capability again to guests. >> >> Signed-off-by: Jan Beulich <jbeul...@suse.com> >> --- >> TBD: We probably need to deal with running on an older hypervisor. I >> can't, however, immediately see a way for qemu to find out. > > Actually QEMU has already an infrastructure to detect the hypervisor > version at compile time, see include/hw/xen/xen_common.h. You could > #define the right emu_mask depending on the hypervisor. We don't want compile time detection here, but runtime one. Jan
Re: [Qemu-devel] [Xen-devel] [PATCH 1/6] xen/MSI-X: latch MSI-X table writes
>>> Tue, 16 Jun 2015 15:48:16 +0100 Stefano Stabellini wrote: >On Tue, 16 Jun 2015, Jan Beulich wrote: >> >>> On 16.06.15 at 15:35, Stefano Stabellini wrote: >> > On Fri, 5 Jun 2015, Jan Beulich wrote: I'm sorry for getting back to this only now. >> >> @@ -322,6 +323,13 @@ static int xen_pt_msix_update_one(XenPCI >> >> >> >> pirq = entry->pirq; >> >> >> >> +if (pirq == XEN_PT_UNASSIGNED_PIRQ || s->msix->maskall || >> >> +(entry->latch(VECTOR_CTRL) & PCI_MSIX_ENTRY_CTRL_MASKBIT)) { >> > >> > I admit I am having difficulties understanding the full purpose of these >> > checks. Please add a comment on them. >> >> The comment would (pointlessly imo) re-state what the code already >> says: >> >> > I guess the intention is only to make changes using the latest values, >> > the ones in entry->latch, when the right conditions are met, otherwise >> > keep using the old values. Is that right? >> > >> > In that case, don't we want to use the latest values on MASKBIT -> >> > !MASKBIT transitions? In general when unmasking? >> >> This is what we want. And with that, the questions you ask further >> down should be answered too: The function gets invoked with the >> pre-change mask flag state in ->latch[], and updates the values >> used for actually setting up when that one has the entry masked >> (or mask-all is set). The actual new value gets written to ->latch[] >> after the call. > >I think this logic is counter-intuitive and prone to confuse the reader. >This change doesn't make sense on its own: when one will read >xen_pt_msix_update_one, won't be able to understand the function without >checking the call sites. > >Could we turn it around to be more obvious? Here check if >!(entry->latch(VECTOR_CTRL) & PCI_MSIX_ENTRY_CTRL_MASKBIT) and below only >call xen_pt_msix_update_one on MASKBIT -> !MASKBIT transactions? > >Or something like that? That would maybe be doable with just pci_msix_write() as a caller, but not with the one in xen_pt_msix_update() (where we have no transition of the mask bit from set to clear). Jan
[Qemu-devel] [PATCH v2 0/4] xen/pass-through: XSA-120, 128...131 follow-up
1: MSI-X: latch MSI-X table writes 2: MSI-X: really enforce alignment 3: pass-through: correctly deal with RW1C bits 4: [RFC] MSI: re-expose masking capability Signed-off-by: Jan Beulich <jbeul...@suse.com>
[Qemu-devel] [PATCH v2 2/4] xen/MSI-X: really enforce alignment
The way the generic infrastructure works the intention of not allowing unaligned accesses can't be achieved by simply setting .unaligned to false. The benefit is that we can now replace the conditionals in {get,set}_entry_value() by assert()-s. Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Stefano Stabellini <stefano.stabell...@eu.citrix.com> --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -415,16 +415,14 @@ int xen_pt_msix_update_remap(XenPCIPasst static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset) { -return !(offset % sizeof(*e->latch)) - ? e->latch[offset / sizeof(*e->latch)] : 0; +assert(!(offset % sizeof(*e->latch))); +return e->latch[offset / sizeof(*e->latch)]; } static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val) { -if (!(offset % sizeof(*e->latch))) -{ -e->latch[offset / sizeof(*e->latch)] = val; -} +assert(!(offset % sizeof(*e->latch))); +e->latch[offset / sizeof(*e->latch)] = val; } static void pci_msix_write(void *opaque, hwaddr addr, @@ -488,6 +486,12 @@ static uint64_t pci_msix_read(void *opaq } } +static bool pci_msix_accepts(void *opaque, hwaddr addr, + unsigned size, bool is_write) +{ +return !(addr & (size - 1)); +} + static const MemoryRegionOps pci_msix_ops = { .read = pci_msix_read, .write = pci_msix_write, @@ -496,7 +500,13 @@ static const MemoryRegionOps pci_msix_op .min_access_size = 4, .max_access_size = 4, .unaligned = false, +.accepts = pci_msix_accepts }, +.impl = { +.min_access_size = 4, +.max_access_size = 4, +.unaligned = false +} }; int xen_pt_msix_init(XenPCIPassthroughState *s, uint32_t base) xen/MSI-X: really enforce alignment The way the generic infrastructure works the intention of not allowing unaligned accesses can't be achieved by simply setting .unaligned to false. The benefit is that we can now replace the conditionals in {get,set}_entry_value() by assert()-s. Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Stefano Stabellini <stefano.stabell...@eu.citrix.com> --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -415,16 +415,14 @@ int xen_pt_msix_update_remap(XenPCIPasst static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset) { -return !(offset % sizeof(*e->latch)) - ? e->latch[offset / sizeof(*e->latch)] : 0; +assert(!(offset % sizeof(*e->latch))); +return e->latch[offset / sizeof(*e->latch)]; } static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val) { -if (!(offset % sizeof(*e->latch))) -{ -e->latch[offset / sizeof(*e->latch)] = val; -} +assert(!(offset % sizeof(*e->latch))); +e->latch[offset / sizeof(*e->latch)] = val; } static void pci_msix_write(void *opaque, hwaddr addr, @@ -488,6 +486,12 @@ static uint64_t pci_msix_read(void *opaq } } +static bool pci_msix_accepts(void *opaque, hwaddr addr, + unsigned size, bool is_write) +{ +return !(addr & (size - 1)); +} + static const MemoryRegionOps pci_msix_ops = { .read = pci_msix_read, .write = pci_msix_write, @@ -496,7 +500,13 @@ static const MemoryRegionOps pci_msix_op .min_access_size = 4, .max_access_size = 4, .unaligned = false, +.accepts = pci_msix_accepts }, +.impl = { +.min_access_size = 4, +.max_access_size = 4, +.unaligned = false +} }; int xen_pt_msix_init(XenPCIPassthroughState *s, uint32_t base)
[Qemu-devel] [PATCH v2 1/4] xen/MSI-X: latch MSI-X table writes
The remaining log message in pci_msix_write() is wrong, as there guest behavior may only appear to be wrong: For one, the old logic didn't take the mask-all bit into account. And then this shouldn't depend on host device state (i.e. the host may have masked the entry without the guest having done so). Plus these writes shouldn't be dropped even when an entry gets unmasked. Instead, if they can't be made take effect right away, they should take effect on the next unmasking or enabling operation - the specification explicitly describes such caching behavior. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- v2: Pass original vec_ctrl to xen_pt_msix_update_one() instead of (ab)using latch[]. --- a/hw/xen/xen_pt_config_init.c +++ b/hw/xen/xen_pt_config_init.c @@ -1499,6 +1499,8 @@ static int xen_pt_msixctrl_reg_write(Xen xen_pt_msix_disable(s); } +s->msix->maskall = *val & PCI_MSIX_FLAGS_MASKALL; + debug_msix_enabled_old = s->msix->enabled; s->msix->enabled = !!(*val & PCI_MSIX_FLAGS_ENABLE); if (s->msix->enabled != debug_msix_enabled_old) { --- a/hw/xen/xen_pt.h +++ b/hw/xen/xen_pt.h @@ -187,13 +187,13 @@ typedef struct XenPTMSIXEntry { int pirq; uint64_t addr; uint32_t data; -uint32_t vector_ctrl; +uint32_t latch[4]; bool updated; /* indicate whether MSI ADDR or DATA is updated */ -bool warned; /* avoid issuing (bogus) warning more than once */ } XenPTMSIXEntry; typedef struct XenPTMSIX { uint32_t ctrl_offset; bool enabled; +bool maskall; int total_entries; int bar_index; uint64_t table_base; --- a/hw/xen/xen_pt_msi.c +++ b/hw/xen/xen_pt_msi.c @@ -25,6 +25,7 @@ #define XEN_PT_GFLAGSSHIFT_DELIV_MODE 12 #define XEN_PT_GFLAGSSHIFT_TRG_MODE 15 +#define latch(fld) latch[PCI_MSIX_ENTRY_##fld / sizeof(uint32_t)] /* * Helpers @@ -314,7 +315,8 @@ static int msix_set_enable(XenPCIPassthr enabled); } -static int xen_pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr) +static int xen_pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr, + uint32_t vec_ctrl) { XenPTMSIXEntry *entry = NULL; int pirq; @@ -332,6 +334,13 @@ static int xen_pt_msix_update_one(XenPCI pirq = entry->pirq; +if (pirq == XEN_PT_UNASSIGNED_PIRQ || s->msix->maskall || +(vec_ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT)) { +entry->addr = entry->latch(LOWER_ADDR) | + ((uint64_t)entry->latch(UPPER_ADDR) << 32); +entry->data = entry->latch(DATA); +} + rc = msi_msix_setup(s, entry->addr, entry->data, , true, entry_nr, entry->pirq == XEN_PT_UNASSIGNED_PIRQ); if (rc) { @@ -357,7 +366,7 @@ int xen_pt_msix_update(XenPCIPassthrough int i; for (i = 0; i < msix->total_entries; i++) { -xen_pt_msix_update_one(s, i); +xen_pt_msix_update_one(s, i, msix->msix_entry[i].latch(VECTOR_CTRL)); } return 0; @@ -406,35 +415,15 @@ int xen_pt_msix_update_remap(XenPCIPasst static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset) { -switch (offset) { -case PCI_MSIX_ENTRY_LOWER_ADDR: -return e->addr & UINT32_MAX; -case PCI_MSIX_ENTRY_UPPER_ADDR: -return e->addr >> 32; -case PCI_MSIX_ENTRY_DATA: -return e->data; -case PCI_MSIX_ENTRY_VECTOR_CTRL: -return e->vector_ctrl; -default: -return 0; -} +return !(offset % sizeof(*e->latch)) + ? e->latch[offset / sizeof(*e->latch)] : 0; } static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val) { -switch (offset) { -case PCI_MSIX_ENTRY_LOWER_ADDR: -e->addr = (e->addr & ((uint64_t)UINT32_MAX << 32)) | val; -break; -case PCI_MSIX_ENTRY_UPPER_ADDR: -e->addr = (uint64_t)val << 32 | (e->addr & UINT32_MAX); -break; -case PCI_MSIX_ENTRY_DATA: -e->data = val; -break; -case PCI_MSIX_ENTRY_VECTOR_CTRL: -e->vector_ctrl = val; -break; +if (!(offset % sizeof(*e->latch))) +{ +e->latch[offset / sizeof(*e->latch)] = val; } } @@ -454,39 +443,26 @@ static void pci_msix_write(void *opaque, offset = addr % PCI_MSIX_ENTRY_SIZE; if (offset != PCI_MSIX_ENTRY_VECTOR_CTRL) { -const volatile uint32_t *vec_ctrl; - if (get_entry_value(entry, offset) == val && entry->pirq != XEN_PT_UNASSIGNED_PIRQ) { return; } +entry->updated = true; +} else if (msix->enabled && entry->updated && + !(val & PCI_MSIX_ENTRY_CTRL_MASKBIT)) { +const volatile uint32_t *vec_ctrl; + /* * I
[Qemu-devel] [PATCH RFC v2 4/4] xen/MSI: re-expose masking capability
Now that the hypervisor intercepts all config space writes and monitors changes to the masking flags, this undoes the main effect of the XSA-129 fix, exposing the masking capability again to guests. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- TBD: We probably need to deal with running on an older hypervisor. I can't, however, immediately see a way for qemu to find out. --- a/hw/xen/xen_pt_config_init.c +++ b/hw/xen/xen_pt_config_init.c @@ -1333,7 +1333,7 @@ static XenPTRegInfo xen_pt_emu_reg_msi[] .init_val = 0x, .res_mask = 0xFE00, .ro_mask= 0x018E, -.emu_mask = 0x017E, +.emu_mask = 0x007E, .init = xen_pt_msgctrl_reg_init, .u.w.read = xen_pt_word_reg_read, .u.w.write = xen_pt_msgctrl_reg_write, @@ -1387,8 +1387,8 @@ static XenPTRegInfo xen_pt_emu_reg_msi[] .offset = PCI_MSI_MASK_32, .size = 4, .init_val = 0x, -.ro_mask= 0x, -.emu_mask = 0x, +.ro_mask= 0x, +.emu_mask = 0x, .init = xen_pt_mask_reg_init, .u.dw.read = xen_pt_long_reg_read, .u.dw.write = xen_pt_long_reg_write, @@ -1398,8 +1398,8 @@ static XenPTRegInfo xen_pt_emu_reg_msi[] .offset = PCI_MSI_MASK_64, .size = 4, .init_val = 0x, -.ro_mask= 0x, -.emu_mask = 0x, +.ro_mask= 0x, +.emu_mask = 0x, .init = xen_pt_mask_reg_init, .u.dw.read = xen_pt_long_reg_read, .u.dw.write = xen_pt_long_reg_write, xen/MSI: re-expose masking capability Now that the hypervisor intercepts all config space writes and monitors changes to the masking flags, this undoes the main effect of the XSA-129 fix, exposing the masking capability again to guests. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- TBD: We probably need to deal with running on an older hypervisor. I can't, however, immediately see a way for qemu to find out. --- a/hw/xen/xen_pt_config_init.c +++ b/hw/xen/xen_pt_config_init.c @@ -1333,7 +1333,7 @@ static XenPTRegInfo xen_pt_emu_reg_msi[] .init_val = 0x, .res_mask = 0xFE00, .ro_mask= 0x018E, -.emu_mask = 0x017E, +.emu_mask = 0x007E, .init = xen_pt_msgctrl_reg_init, .u.w.read = xen_pt_word_reg_read, .u.w.write = xen_pt_msgctrl_reg_write, @@ -1387,8 +1387,8 @@ static XenPTRegInfo xen_pt_emu_reg_msi[] .offset = PCI_MSI_MASK_32, .size = 4, .init_val = 0x, -.ro_mask= 0x, -.emu_mask = 0x, +.ro_mask= 0x, +.emu_mask = 0x, .init = xen_pt_mask_reg_init, .u.dw.read = xen_pt_long_reg_read, .u.dw.write = xen_pt_long_reg_write, @@ -1398,8 +1398,8 @@ static XenPTRegInfo xen_pt_emu_reg_msi[] .offset = PCI_MSI_MASK_64, .size = 4, .init_val = 0x, -.ro_mask= 0x, -.emu_mask = 0x, +.ro_mask= 0x, +.emu_mask = 0x, .init = xen_pt_mask_reg_init, .u.dw.read = xen_pt_long_reg_read, .u.dw.write = xen_pt_long_reg_write,
[Qemu-devel] [PATCH v2 3/4] xen/pass-through: correctly deal with RW1C bits
Introduce yet another mask for them, so that the generic routine can handle them, at once rendering xen_pt_pmcsr_reg_write() superfluous. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- a/hw/xen/xen_pt.h +++ b/hw/xen/xen_pt.h @@ -113,6 +113,8 @@ struct XenPTRegInfo { uint32_t res_mask; /* reg read only field mask (ON:RO/ROS, OFF:other) */ uint32_t ro_mask; +/* reg read/write-1-clear field mask (ON:RW1C/RW1CS, OFF:other) */ +uint32_t rw1c_mask; /* reg emulate field mask (ON:emu, OFF:passthrough) */ uint32_t emu_mask; xen_pt_conf_reg_init init; --- a/hw/xen/xen_pt_config_init.c +++ b/hw/xen/xen_pt_config_init.c @@ -179,7 +179,8 @@ static int xen_pt_byte_reg_write(XenPCIP *data = XEN_PT_MERGE_VALUE(*val, *data, writable_mask); /* create value for writing to I/O device register */ -*val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask); +*val = XEN_PT_MERGE_VALUE(*val, dev_value & ~reg->rw1c_mask, + throughable_mask); return 0; } @@ -197,7 +198,8 @@ static int xen_pt_word_reg_write(XenPCIP *data = XEN_PT_MERGE_VALUE(*val, *data, writable_mask); /* create value for writing to I/O device register */ -*val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask); +*val = XEN_PT_MERGE_VALUE(*val, dev_value & ~reg->rw1c_mask, + throughable_mask); return 0; } @@ -215,7 +217,8 @@ static int xen_pt_long_reg_write(XenPCIP *data = XEN_PT_MERGE_VALUE(*val, *data, writable_mask); /* create value for writing to I/O device register */ -*val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask); +*val = XEN_PT_MERGE_VALUE(*val, dev_value & ~reg->rw1c_mask, + throughable_mask); return 0; } @@ -633,6 +636,7 @@ static XenPTRegInfo xen_pt_emu_reg_heade .init_val = 0x, .res_mask = 0x0007, .ro_mask= 0x06F8, +.rw1c_mask = 0xF900, .emu_mask = 0x0010, .init = xen_pt_status_reg_init, .u.w.read = xen_pt_word_reg_read, @@ -944,6 +948,7 @@ static XenPTRegInfo xen_pt_emu_reg_pcie[ .size = 2, .res_mask = 0xFFC0, .ro_mask= 0x0030, +.rw1c_mask = 0x000F, .init = xen_pt_common_reg_init, .u.w.read = xen_pt_word_reg_read, .u.w.write = xen_pt_word_reg_write, @@ -964,6 +969,7 @@ static XenPTRegInfo xen_pt_emu_reg_pcie[ .offset = PCI_EXP_LNKSTA, .size = 2, .ro_mask= 0x3FFF, +.rw1c_mask = 0xC000, .init = xen_pt_common_reg_init, .u.w.read = xen_pt_word_reg_read, .u.w.write = xen_pt_word_reg_write, @@ -1000,27 +1006,6 @@ static XenPTRegInfo xen_pt_emu_reg_pcie[ * Power Management Capability */ -/* write Power Management Control/Status register */ -static int xen_pt_pmcsr_reg_write(XenPCIPassthroughState *s, - XenPTReg *cfg_entry, uint16_t *val, - uint16_t dev_value, uint16_t valid_mask) -{ -XenPTRegInfo *reg = cfg_entry->reg; -uint16_t writable_mask = 0; -uint16_t throughable_mask = get_throughable_mask(s, reg, valid_mask); -uint16_t *data = cfg_entry->ptr.half_word; - -/* modify emulate register */ -writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask; -*data = XEN_PT_MERGE_VALUE(*val, *data, writable_mask); - -/* create value for writing to I/O device register */ -*val = XEN_PT_MERGE_VALUE(*val, dev_value & ~PCI_PM_CTRL_PME_STATUS, - throughable_mask); - -return 0; -} - /* Power Management Capability reg static information table */ static XenPTRegInfo xen_pt_emu_reg_pm[] = { /* Next Pointer reg */ @@ -1051,11 +1036,12 @@ static XenPTRegInfo xen_pt_emu_reg_pm[] .size = 2, .init_val = 0x0008, .res_mask = 0x00F0, -.ro_mask= 0xE10C, +.ro_mask= 0x610C, +.rw1c_mask = 0x8000, .emu_mask = 0x810B, .init = xen_pt_common_reg_init, .u.w.read = xen_pt_word_reg_read, -.u.w.write = xen_pt_pmcsr_reg_write, +.u.w.write = xen_pt_word_reg_write, }, { .size = 0, xen/pass-through: correctly deal with RW1C bits Introduce yet another mask for them, so that the generic routine can handle them, at once rendering xen_pt_pmcsr_reg_write() superfluous. Signed-off-by: Jan Beulich <jbeul...@suse.com> --- a/hw/xen/xen_pt.h +++ b/hw/xen/xen_pt.h @@ -113,6 +113,8 @@ struct XenPTRegInfo { uint32_t res_mask; /* reg read only field mask (ON:RO/ROS, OFF:other) */ uint32_t ro_mask; +/* reg read/write-1-clear field mask (ON:RW1C/RW1CS, OFF:other) */ +uint32_t rw1c_mask; /* reg emulate field mask (ON:emu, OFF:passthrough) */ uint32
Re: [Qemu-devel] [Xen-devel] [PATCH v1] xenpt: Properly handle 64-bit bar with more than 4G size
On 12.08.15 at 09:10, feng...@intel.com wrote: -Original Message- From: qemu-devel-bounces+feng.wu=intel@nongnu.org [mailto:qemu-devel-bounces+feng.wu=intel@nongnu.org] On Behalf Of Jan Beulich Sent: Wednesday, August 12, 2015 2:59 PM To: Wu, Feng Cc: xen-de...@lists.xensource.com; qemu-devel@nongnu.org; stefano.stabell...@eu.citrix.com Subject: Re: [Qemu-devel] [Xen-devel] [PATCH v1] xenpt: Properly handle 64-bit bar with more than 4G size On 05.08.15 at 04:02, feng...@intel.com wrote: @@ -491,8 +474,9 @@ static int xen_pt_bar_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry, bar_ro_mask = XEN_PT_BAR_IO_RO_MASK | (r_size - 1); break; case XEN_PT_BAR_FLAG_UPPER: +r = d-io_regions[index-1]; Perhaps worth an assert(index 0)? No problem, I will add it. BTW, do you have any other comments about this patch? If no, I am going to send out the new version with this changes. No - everything else looks to make sense (but continues to need testing). Jan
Re: [Qemu-devel] [Xen-devel] [PATCH v1] xenpt: Properly handle 64-bit bar with more than 4G size
On 05.08.15 at 04:02, feng...@intel.com wrote: @@ -491,8 +474,9 @@ static int xen_pt_bar_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry, bar_ro_mask = XEN_PT_BAR_IO_RO_MASK | (r_size - 1); break; case XEN_PT_BAR_FLAG_UPPER: +r = d-io_regions[index-1]; Perhaps worth an assert(index 0)? Jan
[Qemu-devel] [PATCH v3] xen/HVM: atomically access pointers in bufioreq handling
The number of slots per page being 511 (i.e. not a power of two) means that the (32-bit) read and write indexes going beyond 2^32 will likely disturb operation. The hypervisor side gets I/O req server creation extended so we can indicate that we're using suitable atomic accesses where needed, allowing it to atomically canonicalize both pointers when both have gone through at least one cycle. The Xen side counterpart (which is not a functional prereq to this change, albeit a build one) went in already (commit b7007bc6f9). Signed-off-by: Jan Beulich jbeul...@suse.com --- v3: Check for Xen 4.6 (configure and build time). v2: Adjust description. --- a/configure +++ b/configure @@ -1882,6 +1882,33 @@ int main(void) { xc_gnttab_open(NULL, 0); xc_domain_add_to_physmap(0, 0, XENMAPSPACE_gmfn, 0, 0); xc_hvm_inject_msi(xc, 0, 0xf000, 0x); + xc_hvm_create_ioreq_server(xc, 0, HVM_IOREQSRV_BUFIOREQ_ATOMIC, NULL); + return 0; +} +EOF + compile_prog $xen_libs +then +xen_ctrl_version=460 +xen=yes + + # Xen 4.5 + elif + cat $TMPC EOF +#include xenctrl.h +#include xenstore.h +#include stdint.h +#include xen/hvm/hvm_info_table.h +#if !defined(HVM_MAX_VCPUS) +# error HVM_MAX_VCPUS not defined +#endif +int main(void) { + xc_interface *xc; + xs_daemon_open(); + xc = xc_interface_open(0, 0, 0); + xc_hvm_set_mem_type(0, 0, HVMMEM_ram_ro, 0, 0); + xc_gnttab_open(NULL, 0); + xc_domain_add_to_physmap(0, 0, XENMAPSPACE_gmfn, 0, 0); + xc_hvm_inject_msi(xc, 0, 0xf000, 0x); xc_hvm_create_ioreq_server(xc, 0, 0, NULL); return 0; } --- a/xen-hvm.c +++ b/xen-hvm.c @@ -963,19 +963,30 @@ static void handle_ioreq(XenIOState *sta static int handle_buffered_iopage(XenIOState *state) { +buffered_iopage_t *buf_page = state-buffered_io_page; buf_ioreq_t *buf_req = NULL; ioreq_t req; int qw; -if (!state-buffered_io_page) { +if (!buf_page) { return 0; } memset(req, 0x00, sizeof(req)); -while (state-buffered_io_page-read_pointer != state-buffered_io_page-write_pointer) { -buf_req = state-buffered_io_page-buf_ioreq[ -state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM]; +for (;;) { +uint32_t rdptr = buf_page-read_pointer, wrptr; + +xen_rmb(); +wrptr = buf_page-write_pointer; +xen_rmb(); +if (rdptr != buf_page-read_pointer) { +continue; +} +if (rdptr == wrptr) { +break; +} +buf_req = buf_page-buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM]; req.size = 1UL buf_req-size; req.count = 1; req.addr = buf_req-addr; @@ -987,15 +998,14 @@ static int handle_buffered_iopage(XenIOS req.data_is_ptr = 0; qw = (req.size == 8); if (qw) { -buf_req = state-buffered_io_page-buf_ioreq[ -(state-buffered_io_page-read_pointer + 1) % IOREQ_BUFFER_SLOT_NUM]; +buf_req = buf_page-buf_ioreq[(rdptr + 1) % + IOREQ_BUFFER_SLOT_NUM]; req.data |= ((uint64_t)buf_req-data) 32; } handle_ioreq(state, req); -xen_mb(); -state-buffered_io_page-read_pointer += qw ? 2 : 1; +atomic_add(buf_page-read_pointer, qw + 1); } return req.count; --- a/include/hw/xen/xen_common.h +++ b/include/hw/xen/xen_common.h @@ -186,6 +186,15 @@ static inline int xen_get_vmport_regs_pf } #endif +/* Xen before 4.6 */ +#if CONFIG_XEN_CTRL_INTERFACE_VERSION 460 + +#ifndef HVM_IOREQSRV_BUFIOREQ_ATOMIC +#define HVM_IOREQSRV_BUFIOREQ_ATOMIC 2 +#endif + +#endif + /* Xen before 4.5 */ #if CONFIG_XEN_CTRL_INTERFACE_VERSION 450 @@ -370,7 +379,8 @@ static inline void xen_unmap_pcidev(XenX static inline int xen_create_ioreq_server(XenXC xc, domid_t dom, ioservid_t *ioservid) { -int rc = xc_hvm_create_ioreq_server(xc, dom, 1, ioservid); +int rc = xc_hvm_create_ioreq_server(xc, dom, HVM_IOREQSRV_BUFIOREQ_ATOMIC, +ioservid); if (rc == 0) { trace_xen_ioreq_server_create(*ioservid); xen/HVM: atomically access pointers in bufioreq handling The number of slots per page being 511 (i.e. not a power of two) means that the (32-bit) read and write indexes going beyond 2^32 will likely disturb operation. The hypervisor side gets I/O req server creation extended so we can indicate that we're using suitable atomic accesses where needed, allowing it to atomically canonicalize both pointers when both have gone through at least one cycle. The Xen side counterpart (which is not a functional prereq to this change, albeit a build one) went in already (commit b7007bc6f9). Signed-off-by: Jan Beulich jbeul...@suse.com --- v3: Check for Xen 4.6 (configure and build time). v2: Adjust description. --- a/configure +++ b/configure @@ -1882,6 +1882,33 @@ int main(void
Re: [Qemu-devel] [PATCH] xen/HVM: atomically access pointers in bufioreq handling
On 23.07.15 at 12:04, stefano.stabell...@eu.citrix.com wrote: On Thu, 23 Jul 2015, Jan Beulich wrote: On 22.07.15 at 19:24, stefano.stabell...@eu.citrix.com wrote: I'll queue this change up for the next QEMU release cycle. Thanks - v2 (with the adjusted description) just sent. It would however be nice for our variant in 4.6 to also gain this, perhaps independent of upstream's schedule. Is the hypervisor side going to go in 4.6? Do we need a freeze exception or do we consider this a bug fix? As said in the updated description in v2, the hypervisor side went in already some time ago. Jan
Re: [Qemu-devel] [PATCH] xen/HVM: atomically access pointers in bufioreq handling
On 22.07.15 at 19:24, stefano.stabell...@eu.citrix.com wrote: I'll queue this change up for the next QEMU release cycle. Thanks - v2 (with the adjusted description) just sent. It would however be nice for our variant in 4.6 to also gain this, perhaps independent of upstream's schedule. Jan
[Qemu-devel] [PATCH v2] xen/HVM: atomically access pointers in bufioreq handling
The number of slots per page being 511 (i.e. not a power of two) means that the (32-bit) read and write indexes going beyond 2^32 will likely disturb operation. The hypervisor side gets I/O req server creation extended so we can indicate that we're using suitable atomic accesses where needed, allowing it to atomically canonicalize both pointers when both have gone through at least one cycle. The Xen side counterpart (which is not a functional prereq to this change, albeit a build one) went in already (commit b7007bc6f9). Signed-off-by: Jan Beulich jbeul...@suse.com --- v2: Adjust description. --- a/xen-hvm.c +++ b/xen-hvm.c @@ -957,19 +957,30 @@ static void handle_ioreq(XenIOState *sta static int handle_buffered_iopage(XenIOState *state) { +buffered_iopage_t *buf_page = state-buffered_io_page; buf_ioreq_t *buf_req = NULL; ioreq_t req; int qw; -if (!state-buffered_io_page) { +if (!buf_page) { return 0; } memset(req, 0x00, sizeof(req)); -while (state-buffered_io_page-read_pointer != state-buffered_io_page-write_pointer) { -buf_req = state-buffered_io_page-buf_ioreq[ -state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM]; +for (;;) { +uint32_t rdptr = buf_page-read_pointer, wrptr; + +xen_rmb(); +wrptr = buf_page-write_pointer; +xen_rmb(); +if (rdptr != buf_page-read_pointer) { +continue; +} +if (rdptr == wrptr) { +break; +} +buf_req = buf_page-buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM]; req.size = 1UL buf_req-size; req.count = 1; req.addr = buf_req-addr; @@ -981,15 +992,14 @@ static int handle_buffered_iopage(XenIOS req.data_is_ptr = 0; qw = (req.size == 8); if (qw) { -buf_req = state-buffered_io_page-buf_ioreq[ -(state-buffered_io_page-read_pointer + 1) % IOREQ_BUFFER_SLOT_NUM]; +buf_req = buf_page-buf_ioreq[(rdptr + 1) % + IOREQ_BUFFER_SLOT_NUM]; req.data |= ((uint64_t)buf_req-data) 32; } handle_ioreq(state, req); -xen_mb(); -state-buffered_io_page-read_pointer += qw ? 2 : 1; +atomic_add(buf_page-read_pointer, qw + 1); } return req.count; --- unstable.orig/qemu/upstream/include/hw/xen/xen_common.h 2015-03-31 15:58:04.0 +0200 +++ unstable/qemu/upstream/include/hw/xen/xen_common.h 2015-06-15 08:24:28.0 +0200 @@ -370,7 +370,8 @@ static inline void xen_unmap_pcidev(XenX static inline int xen_create_ioreq_server(XenXC xc, domid_t dom, ioservid_t *ioservid) { -int rc = xc_hvm_create_ioreq_server(xc, dom, 1, ioservid); +int rc = xc_hvm_create_ioreq_server(xc, dom, HVM_IOREQSRV_BUFIOREQ_ATOMIC, +ioservid); if (rc == 0) { trace_xen_ioreq_server_create(*ioservid); xen/HVM: atomically access pointers in bufioreq handling The number of slots per page being 511 (i.e. not a power of two) means that the (32-bit) read and write indexes going beyond 2^32 will likely disturb operation. The hypervisor side gets I/O req server creation extended so we can indicate that we're using suitable atomic accesses where needed, allowing it to atomically canonicalize both pointers when both have gone through at least one cycle. The Xen side counterpart (which is not a functional prereq to this change, albeit a build one) went in already (commit b7007bc6f9). Signed-off-by: Jan Beulich jbeul...@suse.com --- v2: Adjust description. --- a/xen-hvm.c +++ b/xen-hvm.c @@ -957,19 +957,30 @@ static void handle_ioreq(XenIOState *sta static int handle_buffered_iopage(XenIOState *state) { +buffered_iopage_t *buf_page = state-buffered_io_page; buf_ioreq_t *buf_req = NULL; ioreq_t req; int qw; -if (!state-buffered_io_page) { +if (!buf_page) { return 0; } memset(req, 0x00, sizeof(req)); -while (state-buffered_io_page-read_pointer != state-buffered_io_page-write_pointer) { -buf_req = state-buffered_io_page-buf_ioreq[ -state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM]; +for (;;) { +uint32_t rdptr = buf_page-read_pointer, wrptr; + +xen_rmb(); +wrptr = buf_page-write_pointer; +xen_rmb(); +if (rdptr != buf_page-read_pointer) { +continue; +} +if (rdptr == wrptr) { +break; +} +buf_req = buf_page-buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM]; req.size = 1UL buf_req-size; req.count = 1; req.addr = buf_req-addr; @@ -981,15 +992,14 @@ static int handle_buffered_iopage(XenIOS req.data_is_ptr = 0; qw = (req.size == 8); if (qw) { -buf_req = state-buffered_io_page-buf_ioreq
Re: [Qemu-devel] [PATCH] xen/HVM: atomically access pointers in bufioreq handling
On 22.07.15 at 16:50, stefano.stabell...@eu.citrix.com wrote: On Wed, 22 Jul 2015, Jan Beulich wrote: --- a/xen-hvm.c +++ b/xen-hvm.c @@ -981,19 +981,30 @@ static void handle_ioreq(XenIOState *sta static int handle_buffered_iopage(XenIOState *state) { +buffered_iopage_t *buf_page = state-buffered_io_page; buf_ioreq_t *buf_req = NULL; ioreq_t req; int qw; -if (!state-buffered_io_page) { +if (!buf_page) { return 0; } memset(req, 0x00, sizeof(req)); -while (state-buffered_io_page-read_pointer != state-buffered_io_page-write_pointer) { -buf_req = state-buffered_io_page-buf_ioreq[ -state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM]; +for (;;) { +uint32_t rdptr = buf_page-read_pointer, wrptr; + +xen_rmb(); We don't need this barrier. How would we not? We need to make sure we read in this order read_pointer, write_pointer, and read_pointer again (in the comparison). Only that way we can be certain to hold a matching pair in hands at the end. See below +wrptr = buf_page-write_pointer; +xen_rmb(); +if (rdptr != buf_page-read_pointer) { I think you have to use atomic_read to be sure that the second read to buf_page-read_pointer is up to date and not optimized away. No, suppressing such an optimization is an intended (side) effect of the barriers used. I understand what you are saying but I am not sure if your assumption is correct. Can the compiler optimize the second read anyway? No, it can't, due to the barrier. But if I think that it would be best to simply use atomic_read to read both pointers at once using uint64_t as type, so you are sure to get a consistent view and there is no need for this check. But I'm specifically trying to avoid e.g. a locked cmpxchg8b here on ix86. OK, but we don't need cmpxchg8b, just: #define atomic_read(ptr) (*(__typeof__(*ptr) volatile*) (ptr)) This only gives the impression of being atomic when the type is wider than a machine word. There's no ix86 (i.e. 32-bit) instruction other than LOCK CMPXCHG8B (and possibly MMX/SSE/AVX ones) allowing to atomically read a 64-bit quantity. something like: for (;;) { uint64_t ptrs; uint32_t rdptr, wrptr; ptrs = atomic_read((uint64_t*)state-buffered_io_page-read_pointer); rdptr = (uint32_t)ptrs; wrptr = *(((uint32_t*)ptrs) + 1); if (rdptr == wrptr) { continue; } [work] atomic_add(buf_page-read_pointer, qw + 1); } it would work, wouldn't it? Looks like so, but the amount of casts alone makes me wish for no-one to consider this (but I agree that the casts could be taken care of). Still I think (as btw done elsewhere) the lock free access is preferable. Jan
Re: [Qemu-devel] [PATCH] xen/HVM: atomically access pointers in bufioreq handling
On 21.07.15 at 15:54, stefano.stabell...@eu.citrix.com wrote: On Thu, 18 Jun 2015, Jan Beulich wrote: The number of slots per page being 511 (i.e. not a power of two) means that the (32-bit) read and write indexes going beyond 2^32 will likely disturb operation. The hypervisor side gets I/O req server creation extended so we can indicate that we're using suitable atomic accesses where needed (not all accesses to the two pointers really need to be atomic), allowing it to atomically canonicalize both pointers when both have gone through at least one cycle. The description is a bit terse: which accesses don't really need to be atomic? Perhaps I should drop this part - I more or less copied the hypervisor side's commit message, and the above really applies to e.g. if ( (pg-ptrs.write_pointer - pg-ptrs.read_pointer) = (IOREQ_BUFFER_SLOT_NUM - qw) ) in hypervisor code. --- a/xen-hvm.c +++ b/xen-hvm.c @@ -981,19 +981,30 @@ static void handle_ioreq(XenIOState *sta static int handle_buffered_iopage(XenIOState *state) { +buffered_iopage_t *buf_page = state-buffered_io_page; buf_ioreq_t *buf_req = NULL; ioreq_t req; int qw; -if (!state-buffered_io_page) { +if (!buf_page) { return 0; } memset(req, 0x00, sizeof(req)); -while (state-buffered_io_page-read_pointer != state-buffered_io_page-write_pointer) { -buf_req = state-buffered_io_page-buf_ioreq[ -state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM]; +for (;;) { +uint32_t rdptr = buf_page-read_pointer, wrptr; + +xen_rmb(); We don't need this barrier. How would we not? We need to make sure we read in this order read_pointer, write_pointer, and read_pointer again (in the comparison). Only that way we can be certain to hold a matching pair in hands at the end. +wrptr = buf_page-write_pointer; +xen_rmb(); +if (rdptr != buf_page-read_pointer) { I think you have to use atomic_read to be sure that the second read to buf_page-read_pointer is up to date and not optimized away. No, suppressing such an optimization is an intended (side) effect of the barriers used. But if I think that it would be best to simply use atomic_read to read both pointers at once using uint64_t as type, so you are sure to get a consistent view and there is no need for this check. But I'm specifically trying to avoid e.g. a locked cmpxchg8b here on ix86. handle_ioreq(state, req); -xen_mb(); -state-buffered_io_page-read_pointer += qw ? 2 : 1; +atomic_add(buf_page-read_pointer, qw + 1); I couldn't get specific info on the type of barrier implemented by __sync_fetch_and_add, so I cannot tell for sure whether removing xen_mb() is appropriate. Do you have a link? I suspect that given the strong guarantees of the x86 architecture we'll be fine. I would be less confident if this code was used on other archs. gcc.pdf, in the section covering them, says In most cases, these built-in functions are considered a full barrier. [...] Further, instructions are issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation. Details on individual builtins subsequently tell the exceptions from this general rule, but the one used here is not among the exceptions. --- a/include/hw/xen/xen_common.h +++ b/include/hw/xen/xen_common.h @@ -370,7 +370,8 @@ static inline void xen_unmap_pcidev(XenX static inline int xen_create_ioreq_server(XenXC xc, domid_t dom, ioservid_t *ioservid) { -int rc = xc_hvm_create_ioreq_server(xc, dom, 1, ioservid); +int rc = xc_hvm_create_ioreq_server(xc, dom, HVM_IOREQSRV_BUFIOREQ_ATOMIC, +ioservid); I am concerned that passing 2 instead of 1 could break older hypervisors. However handle_bufioreq was never defined as a true boolean, so maybe it is OK? Indeed I'm building on it only having done == 0 or != 0 checks. The alternative would be to create a xen_xc_hvm_create_ioreq_server versioned wrapper in include/hw/xen/xen_common.h. Which is what I aimed at avoiding. Jan
[Qemu-devel] [PATCH] xen/HVM: atomically access pointers in bufioreq handling
The number of slots per page being 511 (i.e. not a power of two) means that the (32-bit) read and write indexes going beyond 2^32 will likely disturb operation. The hypervisor side gets I/O req server creation extended so we can indicate that we're using suitable atomic accesses where needed (not all accesses to the two pointers really need to be atomic), allowing it to atomically canonicalize both pointers when both have gone through at least one cycle. The Xen side counterpart (which is not a functional prereq to this change, albeit a build one) can be found at e.g. http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg02996.html Signed-off-by: Jan Beulich jbeul...@suse.com --- a/xen-hvm.c +++ b/xen-hvm.c @@ -981,19 +981,30 @@ static void handle_ioreq(XenIOState *sta static int handle_buffered_iopage(XenIOState *state) { +buffered_iopage_t *buf_page = state-buffered_io_page; buf_ioreq_t *buf_req = NULL; ioreq_t req; int qw; -if (!state-buffered_io_page) { +if (!buf_page) { return 0; } memset(req, 0x00, sizeof(req)); -while (state-buffered_io_page-read_pointer != state-buffered_io_page-write_pointer) { -buf_req = state-buffered_io_page-buf_ioreq[ -state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM]; +for (;;) { +uint32_t rdptr = buf_page-read_pointer, wrptr; + +xen_rmb(); +wrptr = buf_page-write_pointer; +xen_rmb(); +if (rdptr != buf_page-read_pointer) { +continue; +} +if (rdptr == wrptr) { +break; +} +buf_req = buf_page-buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM]; req.size = 1UL buf_req-size; req.count = 1; req.addr = buf_req-addr; @@ -1005,15 +1016,14 @@ static int handle_buffered_iopage(XenIOS req.data_is_ptr = 0; qw = (req.size == 8); if (qw) { -buf_req = state-buffered_io_page-buf_ioreq[ -(state-buffered_io_page-read_pointer + 1) % IOREQ_BUFFER_SLOT_NUM]; +buf_req = buf_page-buf_ioreq[(rdptr + 1) % + IOREQ_BUFFER_SLOT_NUM]; req.data |= ((uint64_t)buf_req-data) 32; } handle_ioreq(state, req); -xen_mb(); -state-buffered_io_page-read_pointer += qw ? 2 : 1; +atomic_add(buf_page-read_pointer, qw + 1); } return req.count; --- a/include/hw/xen/xen_common.h +++ b/include/hw/xen/xen_common.h @@ -370,7 +370,8 @@ static inline void xen_unmap_pcidev(XenX static inline int xen_create_ioreq_server(XenXC xc, domid_t dom, ioservid_t *ioservid) { -int rc = xc_hvm_create_ioreq_server(xc, dom, 1, ioservid); +int rc = xc_hvm_create_ioreq_server(xc, dom, HVM_IOREQSRV_BUFIOREQ_ATOMIC, +ioservid); if (rc == 0) { trace_xen_ioreq_server_create(*ioservid); xen/HVM: atomically access pointers in bufioreq handling The number of slots per page being 511 (i.e. not a power of two) means that the (32-bit) read and write indexes going beyond 2^32 will likely disturb operation. The hypervisor side gets I/O req server creation extended so we can indicate that we're using suitable atomic accesses where needed (not all accesses to the two pointers really need to be atomic), allowing it to atomically canonicalize both pointers when both have gone through at least one cycle. The Xen side counterpart (which is not a functional prereq to this change, albeit a build one) can be found at e.g. http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg02996.html Signed-off-by: Jan Beulich jbeul...@suse.com --- a/xen-hvm.c +++ b/xen-hvm.c @@ -981,19 +981,30 @@ static void handle_ioreq(XenIOState *sta static int handle_buffered_iopage(XenIOState *state) { +buffered_iopage_t *buf_page = state-buffered_io_page; buf_ioreq_t *buf_req = NULL; ioreq_t req; int qw; -if (!state-buffered_io_page) { +if (!buf_page) { return 0; } memset(req, 0x00, sizeof(req)); -while (state-buffered_io_page-read_pointer != state-buffered_io_page-write_pointer) { -buf_req = state-buffered_io_page-buf_ioreq[ -state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM]; +for (;;) { +uint32_t rdptr = buf_page-read_pointer, wrptr; + +xen_rmb(); +wrptr = buf_page-write_pointer; +xen_rmb(); +if (rdptr != buf_page-read_pointer) { +continue; +} +if (rdptr == wrptr) { +break; +} +buf_req = buf_page-buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM]; req.size = 1UL buf_req-size; req.count = 1; req.addr = buf_req-addr; @@ -1005,15 +1016,14 @@ static int handle_buffered_iopage(XenIOS req.data_is_ptr = 0; qw = (req.size == 8); if (qw
Re: [Qemu-devel] [PATCH 2/6] xen/MSI-X: drive maskall and enable bits through hypercalls
On 16.06.15 at 16:03, stefano.stabell...@eu.citrix.com wrote: On Fri, 5 Jun 2015, Jan Beulich wrote: +} else if (s-msix-enabled) { +if (!(value PCI_MSIX_FLAGS_ENABLE)) { +xen_pt_msix_disable(s); +s-msix-enabled = false; +} else if (!s-msix-maskall) { Why are you changing the state of s-msix-maskall here? This is the value PCI_MSIX_FLAGS_ENABLE case, nothing to do with maskall, right? We're at an else if inside an else if here. The only case left after the if() still seen above is that value has PCI_MSIX_FLAGS_MASKALL set. +s-msix-maskall = true; +xen_pt_msix_maskall(s, true); +} } -debug_msix_enabled_old = s-msix-enabled; -s-msix-enabled = !!(*val PCI_MSIX_FLAGS_ENABLE); if (s-msix-enabled != debug_msix_enabled_old) { XEN_PT_LOG(s-dev, %s MSI-X\n, s-msix-enabled ? enable : disable); } +xen_host_pci_get_word(s-real_device, s-msix-ctrl_offset, dev_value); I have to say that I don't like the asymmetry between reading and writing PCI config registers. If writes go via hypercalls, reads should go via hypercalls too. We're not doing any cfg register write via hypercalls (not here, and not elsewhere). What is being replaced by the patch are write to two bits which happen to live in PCI config space. Plus, reading directly, and doing writes via hypercall only when really needed would still be the right thing from a performance pov. --- a/qemu/upstream/hw/xen/xen_pt_msi.c +++ b/qemu/upstream/hw/xen/xen_pt_msi.c @@ -301,8 +301,11 @@ static int msix_set_enable(XenPCIPassthr return -1; } -return msi_msix_enable(s, s-msix-ctrl_offset, PCI_MSIX_FLAGS_ENABLE, - enabled); Would it make sense to remove msi_msix_enable completely to avoid any further mistakes? Perhaps, yes. I think I actually had suggested so quite a while back. But I don't see myself wasting much more time on this, ehm, code. Jan
Re: [Qemu-devel] [PATCH 1/6] xen/MSI-X: latch MSI-X table writes
On 16.06.15 at 15:35, stefano.stabell...@eu.citrix.com wrote: On Fri, 5 Jun 2015, Jan Beulich wrote: @@ -322,6 +323,13 @@ static int xen_pt_msix_update_one(XenPCI pirq = entry-pirq; +if (pirq == XEN_PT_UNASSIGNED_PIRQ || s-msix-maskall || +(entry-latch(VECTOR_CTRL) PCI_MSIX_ENTRY_CTRL_MASKBIT)) { I admit I am having difficulties understanding the full purpose of these checks. Please add a comment on them. The comment would (pointlessly imo) re-state what the code already says: I guess the intention is only to make changes using the latest values, the ones in entry-latch, when the right conditions are met, otherwise keep using the old values. Is that right? In that case, don't we want to use the latest values on MASKBIT - !MASKBIT transitions? In general when unmasking? This is what we want. And with that, the questions you ask further down should be answered too: The function gets invoked with the pre-change mask flag state in -latch[], and updates the values used for actually setting up when that one has the entry masked (or mask-all is set). The actual new value gets written to -latch[] after the call. @@ -444,39 +432,28 @@ static void pci_msix_write(void *opaque, offset = addr % PCI_MSIX_ENTRY_SIZE; if (offset != PCI_MSIX_ENTRY_VECTOR_CTRL) { -const volatile uint32_t *vec_ctrl; - if (get_entry_value(entry, offset) == val entry-pirq != XEN_PT_UNASSIGNED_PIRQ) { return; } +entry-updated = true; +} else if (msix-enabled entry-updated + !(val PCI_MSIX_ENTRY_CTRL_MASKBIT)) { +const volatile uint32_t *vec_ctrl; + /* * If Xen intercepts the mask bit access, entry-vec_ctrl may not be * up-to-date. Read from hardware directly. */ vec_ctrl = s-msix-phys_iomem_base + entry_nr * PCI_MSIX_ENTRY_SIZE + PCI_MSIX_ENTRY_VECTOR_CTRL; +set_entry_value(entry, offset, *vec_ctrl); Why are you calling set_entry_value with the hardware vec_ctrl value? It doesn't look correct to me. In any case, if you wanted to do it, shouldn't you just set/unset PCI_MSIX_ENTRY_CTRL_MASKBIT instead of the whole *vec_ctrl? The comment above the code explains it: What we have stored locally may not reflect reality, as we may not have seen all writes (and this indeed isn't just a may). And if out cached value isn't valid anymore, why would we not want to update all of it, rather than just the mask bit? -if (msix-enabled !(*vec_ctrl PCI_MSIX_ENTRY_CTRL_MASKBIT)) { -if (!entry-warned) { -entry-warned = true; -XEN_PT_ERR(s-dev, Can't update msix entry %d since MSI-X is -already enabled.\n, entry_nr); -} -return; -} - -entry-updated = true; +xen_pt_msix_update_one(s, entry_nr); Shouldn't we call xen_pt_msix_update_one only if (*vec_ctrl PCI_MSIX_ENTRY_CTRL_MASKBIT)? In other words, only when we see a MASKBIT - !MASKBIT transition? The combination of the !(val PCI_MSIX_ENTRY_CTRL_MASKBIT) check in the if() surrounding this call and the (entry-latch(VECTOR_CTRL) PCI_MSIX_ENTRY_CTRL_MASKBIT) check inside the function guarantee just that (i.e. the function invocation is benign in the other case, as entry-addr/entry-data would remain unchanged). Jan
Re: [Qemu-devel] [PATCH 4/6] xen/pass-through: correctly deal with RW1C bits
On 16.06.15 at 16:19, stefano.stabell...@eu.citrix.com wrote: On Fri, 5 Jun 2015, Jan Beulich wrote: @@ -1016,11 +1002,12 @@ static XenPTRegInfo xen_pt_emu_reg_pm[] .size = 2, .init_val = 0x0008, .res_mask = 0x00F0, -.ro_mask= 0xE10C, +.ro_mask= 0x610C, +.rw1c_mask = 0x8000, .emu_mask = 0x810B, .init = xen_pt_common_reg_init, .u.w.read = xen_pt_word_reg_read, -.u.w.write = xen_pt_pmcsr_reg_write, +.u.w.write = xen_pt_word_reg_write, }, { .size = 0, I can see that the code change doesn't cause a change in behaviour for PCI_PM_CTRL, but it does for PCI_STATUS, PCI_EXP_DEVSTA and PCI_EXP_LNKSTA. Please explain why in the commit message. I'm not sure what you're after in a patch titled correctly deal with RW1C bits. Jan
Re: [Qemu-devel] [PATCH 6/6] xen/pass-through: constify some static data
On 16.06.15 at 16:27, stefano.stabell...@eu.citrix.com wrote: On Fri, 5 Jun 2015, Jan Beulich wrote: This is done indirectly by adjusting two typedefs and helps emphasizing that the respective tables aren't supposed to be modified at runtime (as they may be shared between devices). Signed-off-by: Jan Beulich jbeul...@suse.com Reviewed-by: Stefano Stabellini stefano.stabell...@eu.citrix.com Perhaps patch 5 and 6 could go in right away, without waiting for the issues on the other ones addressed? After all, it may well be that patch 1 will be dropped following the discussion on the hypervisor side patch series... Jan
Re: [Qemu-devel] [PATCH 2/6] xen/MSI-X: drive maskall and enable bits through hypercalls
On 16.06.15 at 16:56, stefano.stabell...@eu.citrix.com wrote: On Tue, 16 Jun 2015, Jan Beulich wrote: On 16.06.15 at 16:03, stefano.stabell...@eu.citrix.com wrote: On Fri, 5 Jun 2015, Jan Beulich wrote: --- a/qemu/upstream/hw/xen/xen_pt_msi.c +++ b/qemu/upstream/hw/xen/xen_pt_msi.c @@ -301,8 +301,11 @@ static int msix_set_enable(XenPCIPassthr return -1; } -return msi_msix_enable(s, s-msix-ctrl_offset, PCI_MSIX_FLAGS_ENABLE, - enabled); Would it make sense to remove msi_msix_enable completely to avoid any further mistakes? Perhaps, yes. I think I actually had suggested so quite a while back. But I don't see myself wasting much more time on this, ehm, code. Isn't it just a matter of removing msi_msix_enable? It has another caller xen_pt_msi_set_enable(). If we went down the route of what this patch does, then MSI's enable bit should ultimately also be driven through a hypercall, and that would then be the point where the function would naturally disappear. But as said, it looks like we're intending to go a different route anyway. Jan
Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register
On 10.06.15 at 13:43, m...@redhat.com wrote: On Wed, Jun 10, 2015 at 08:00:55AM +0100, Jan Beulich wrote: On 08.06.15 at 13:28, m...@redhat.com wrote: On Mon, Jun 08, 2015 at 11:55:22AM +0100, Jan Beulich wrote: while function 0 has 0x10: Base Address Register 0 = 0xca23000c (Memory space, 64-bit access, prefetchable) 0x18: Base Address Register 2 = 0xca24000c (Memory space, 64-bit access, prefetchable) 0x20: Base Address Register 4 = 0xca25000c (Memory space, 64-bit access, prefetchable) and function 1 0x10: Base Address Register 0 = 0xca2c (Memory space, 64-bit access, prefetchable) 0x18: Base Address Register 2 = 0xca21000c (Memory space, 64-bit access, prefetchable) 0x20: Base Address Register 4 = 0xca22000c (Memory space, 64-bit access, prefetchable) Does the sibling device have a BAR overlapping the address? No, its BARs are fully separate. Judging from the above, it's actually function 1's BAR 2 that is accessed? Are you saying disabling memory on function 0 breaks function 2 somehow? Oops, just noticed I didn't reply to this. Not sure how you come to that conclusion - the ITP log says that the bad write is to 0xca25004c. Look at the bridge configuration though - looks like it will only forward transactions to 0xca21. Anything else will be terminated by the bridge itself. Right, that's what I had pointed out before, but then again things work prior to the guest shutting down (and in the absence of any guest), even if I can't explain why or how. Jan
Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register
On 10.06.15 at 13:46, m...@redhat.com wrote: I don't really know. The idea would be that device is not designed for memory to be disabled when it's active, and starts behaving in broken ways. But you recall that we know the origin of the offending write is in the hypervisor, and we also know that the device isn't active at the point the problem occurs? Jan
Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register
On 08.06.15 at 13:28, m...@redhat.com wrote: On Mon, Jun 08, 2015 at 11:55:22AM +0100, Jan Beulich wrote: while function 0 has 0x10: Base Address Register 0 = 0xca23000c (Memory space, 64-bit access, prefetchable) 0x18: Base Address Register 2 = 0xca24000c (Memory space, 64-bit access, prefetchable) 0x20: Base Address Register 4 = 0xca25000c (Memory space, 64-bit access, prefetchable) and function 1 0x10: Base Address Register 0 = 0xca2c (Memory space, 64-bit access, prefetchable) 0x18: Base Address Register 2 = 0xca21000c (Memory space, 64-bit access, prefetchable) 0x20: Base Address Register 4 = 0xca22000c (Memory space, 64-bit access, prefetchable) Does the sibling device have a BAR overlapping the address? No, its BARs are fully separate. Judging from the above, it's actually function 1's BAR 2 that is accessed? Are you saying disabling memory on function 0 breaks function 2 somehow? Oops, just noticed I didn't reply to this. Not sure how you come to that conclusion - the ITP log says that the bad write is to 0xca25004c. Jan
Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register
On 08.06.15 at 11:30, m...@redhat.com wrote: What happens if you disable SERR# in the command register of 83:00.1? We've just been told that with SERR not enabled in any of the sibling endpoints the NMI still occurs. Not really surprising with us now assuming that it's the root port that generates the SERR in response to the UR coming back from an endpoint. But otoh in conflict with what we see in the ITP log (where SERR clearly is enabled on the endpoints, and the information we got says that it _is_ disabled, not that they had to do anything to disable it). 2. Has a driver initialized this endpoint? What if you don't load a driver before sending the transaction resulting in the UR? They now tried at least without loading a driver in Dom0, which didn't make a difference. Did you mean to also not load any driver in the guest? Otoh I can't really see what difference this makes, as the cleanup after the guest inside the hypervisor doesn't really depend much on whether it actively used any of the MSI-X entries. Jan
Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register
On 07.06.15 at 08:23, m...@redhat.com wrote: On Mon, Apr 20, 2015 at 04:32:12PM +0200, Michael S. Tsirkin wrote: On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote: On 20.04.15 at 15:43, m...@redhat.com wrote: On Mon, Apr 13, 2015 at 01:51:06PM +0100, Jan Beulich wrote: On 13.04.15 at 14:47, m...@redhat.com wrote: Can you check device capabilities register, offset 0x4 within pci express capability structure? Bit 15 is 15 Role-Based Error Reporting. Is it set? The spec says: 15 On platforms where robust error handling and PC-compatible Configuration Space probing is required, it is suggested that software or firmware have the Unsupported Request Reporting Enable bit Set for Role-Based Error Reporting Functions, but clear for 1.0a Functions. Software or firmware can distinguish the two classes of Functions by examining the Role-Based Error Reporting bit in the Device Capabilities register. Yes, that bit is set. curiouser and curiouser. So with functions that do support Role-Based Error Reporting, we have this: With device Functions implementing Role-Based Error Reporting, setting the Unsupported Request Reporting Enable bit will not interfere with PC-compatible Configuration Space probing, assuming that the severity for UR is left at its default of non-fatal. However, setting the Unsupported Request Reporting Enable bit will enable the Function to report UR errors 97 detected with posted Requests, helping avoid this case for potential silent data corruption. I still don't see what the PC-compatible config space probing has to do with our issue. I'm not sure but I think it's listed here because it causes a ton of URs when device scan probes unimplemented functions. did firmware reconfigure this device to report URs as fatal errors then? No, the Unsupported Request Error Serverity flag is zero. OK, that's the correct configuration, so how come the box crashes when there's a UR then? Ping - any update on this? Not really. All we concluded so far is that _maybe_ the bridge, upon seeing the UR, generates a Master Abort, rendering the whole thing fatal. Otoh the respective root port also has - Received Master Abort set in its Secondary Status register (but that's also already the case in the log that we have before the UR occurs, i.e. that doesn't mean all that much), - Received System Error set in its Secondary Status register (and after the UR the sibling endpoint [UR originating from 83:00.0, sibling being 83:00.1] also shows Signaled System Error set). Do we can chalk this up to hardware bugs on a specific box? I have to admit that I'm still very uncertain whether to consider all this correct behavior, a firmware flaw, or a hardware bug. Jan
Re: [Qemu-devel] [PATCH] xen/pass-through: ROM BAR handling adjustments
On 05.06.15 at 18:41, stefano.stabell...@eu.citrix.com wrote: On Fri, 5 Jun 2015, Jan Beulich wrote: On 05.06.15 at 13:32, stefano.stabell...@eu.citrix.com wrote: --- a/hw/xen/xen_pt.c +++ b/hw/xen/xen_pt.c @@ -248,7 +248,9 @@ static void xen_pt_pci_write_config(PCID /* check unused BAR register */ index = xen_pt_bar_offset_to_index(addr); -if ((index = 0) (val 0 val XEN_PT_BAR_ALLF) +if ((index = 0) (val != 0) +(((index != PCI_ROM_SLOT) ? + val : (val | (uint32_t)~PCI_ROM_ADDRESS_MASK)) != XEN_PT_BAR_ALLF) The change seems looks good and in line with the commit message. But this if statement looks like acrobatic circus to me now. I think the alternative of splitting it up into multiple if()-s would not be any better readable. Would you be OK if I rewrote the statement as follows? if ((index = 0) (val != 0)) { uint32_t vu; if (index == PCI_ROM_SLOT) { vu = val | (uint32_t)~PCI_ROM_ADDRESS_MASK; } else { vu = val; } if ((vu != XEN_PT_BAR_ALLF) (s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) { XEN_PT_WARN(d, Guest attempt to set address to unused Base Address Register. (addr: 0x%02x, len: %d)\n, addr, len); } } Actually I agree that this indeed looks better. If I were to make the change, I'd do if ((index = 0) (val != 0)) { uint32_t vu = val; if (index == PCI_ROM_SLOT) { vu |= (uint32_t)~PCI_ROM_ADDRESS_MASK; } if ((vu != XEN_PT_BAR_ALLF) ... though. But if you're going to do the edit without wanting me to re-submit, I'll certainly leave this to you. Just let me know which way you prefer it to be handled. Jan
Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register
On 08.06.15 at 10:09, malcolm.cross...@citrix.com wrote: On 08/06/15 08:42, Jan Beulich wrote: Not really. All we concluded so far is that _maybe_ the bridge, upon seeing the UR, generates a Master Abort, rendering the whole thing fatal. Otoh the respective root port also has - Received Master Abort set in its Secondary Status register (but that's also already the case in the log that we have before the UR occurs, i.e. that doesn't mean all that much), - Received System Error set in its Secondary Status register (and after the UR the sibling endpoint [UR originating from 83:00.0, sibling being 83:00.1] also shows Signaled System Error set). Disabling the Memory decode in the command register could also result in a completion timeout on the root port issuing a transaction towards the PCI device in question. PCIE completion timeouts can be escalated to Fatal AER errors which trigger system firmware to inject NMI's into the host. And how does all that play with PC compatibility (where writes into no-where get dropped, and reads from no-where get all ones returned)? Remember - we#re talking about CPU side accesses here. Here is an example AER setup for a PCIE root port. You can see UnsupReq errors are masked and so do not trigger errors. CmpltTO ( completion timeout) errors are not masked and the errors are treated as Fatal because the corresponding bit in the Uncorrectable Severity register is set. Capabilities: [148 v1] Advanced Error Reporting UESta:DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk:DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol+ UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta:RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk:RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- A root port completion timeout will also result in the master abort bit being set. Typically system firmware clears the error in the AER registers after it's processed it. So the operating system may not be able to determine what error triggered the NMI in the first place. Right, but in the case at hand we have an ITP log available, which increases the hope that we see a reasonably complete picture. Do we can chalk this up to hardware bugs on a specific box? I have to admit that I'm still very uncertain whether to consider all this correct behavior, a firmware flaw, or a hardware bug. I believe the correct behaviour is happening but a PCIE completion timeout is occurring instead of a unsupported request. Might it be that with the supposedly correct device returning UR the root port reissues the request to the sibling device, which then fails it in a more dramatic way (albeit the sibling's Uncorrectable Error Status Register also has only Unsupported Request Error Status set)? Jan
Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register
On 08.06.15 at 11:36, m...@redhat.com wrote: On Mon, Jun 08, 2015 at 10:03:18AM +0100, Jan Beulich wrote: On 08.06.15 at 10:09, malcolm.cross...@citrix.com wrote: I believe the correct behaviour is happening but a PCIE completion timeout is occurring instead of a unsupported request. Might it be that with the supposedly correct device returning UR the root port reissues the request to the sibling device, which then fails it in a more dramatic way (albeit the sibling's Uncorrectable Error Status Register also has only Unsupported Request Error Status set)? Isn't the sibling a function on the same device? Yes. And is the request causing the UR a memory read? No, it's a write according to the ITP log. If so doesn't this use address routing? What does it mean that the request is to the sibling device then? The way I expressed things above may simply be due to my limited knowledge of PCIe terminology: I simply don't know (and can't find this spelled out in the spec) what the root port's behavior ought to be when a transaction comes in with an address that's within one of its base/limit regions, but none of the endpoints can fulfill the request. But you asking this made me look more closely at the memory ranges dumped out to the ITP log: The root port has 0x20: Memory Base = 0xca40 0x22: Memory Limit = 0xca40 0x24: Prefetchable Mem Base= 0xca21 0x26: Prefetchable Mem Limit = 0xca21 while function 0 has 0x10: Base Address Register 0 = 0xca23000c (Memory space, 64-bit access, prefetchable) 0x18: Base Address Register 2 = 0xca24000c (Memory space, 64-bit access, prefetchable) 0x20: Base Address Register 4 = 0xca25000c (Memory space, 64-bit access, prefetchable) and function 1 0x10: Base Address Register 0 = 0xca2c (Memory space, 64-bit access, prefetchable) 0x18: Base Address Register 2 = 0xca21000c (Memory space, 64-bit access, prefetchable) 0x20: Base Address Register 4 = 0xca22000c (Memory space, 64-bit access, prefetchable) Isn't is bogus for all but one of the BARs being outside the root ports prefetchable memory window (the MSI-X tables being inside the BAR 4 region in both cases)? Does the sibling device have a BAR overlapping the address? No, its BARs are fully separate. Jan
Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register
On 08.06.15 at 11:30, m...@redhat.com wrote: On Mon, Jun 08, 2015 at 08:42:57AM +0100, Jan Beulich wrote: Otoh the respective root port also has - Received Master Abort set in its Secondary Status register (but that's also already the case in the log that we have before the UR occurs, i.e. that doesn't mean all that much), - Received System Error set in its Secondary Status register (and after the UR the sibling endpoint [UR originating from 83:00.0, sibling being 83:00.1] also shows Signaled System Error set). It's another function of the same physical device, correct? Yes (obviously with their BDF only differing in the function). So is this sibling the only function sending SERR? Yes, albeit I can't tell whether the root port generated SERR on its own or in response to the endpoint doing so. What happens if you disable SERR# in the command register of 83:00.1? Any experiments with that system will be quite difficult, as they're only accessible by partners or ours. But I'll ask to try this if you think this can provide useful insight. Do we can chalk this up to hardware bugs on a specific box? I have to admit that I'm still very uncertain whether to consider all this correct behavior, a firmware flaw, or a hardware bug. Questions: 1. Does this only happen with a specific endpoint? What if you add another endpoint to the same system? We've got reports of this for two systems (two different vendors) using a Broadcomm NIC in one case and an Intel one in the other. 2. Has a driver initialized this endpoint? What if you don't load a driver before sending the transaction resulting in the UR? I'd have to ask for this to be tried too, and getting an answer may take some time. Jan
[Qemu-devel] [PATCH v2] xen/pass-through: ROM BAR handling adjustments
Expecting the ROM BAR to be written with an all ones value when sizing the region is wrong - the low bit has another meaning (enable/disable) and bits 1..10 are reserved. The PCI spec also mandates writing all ones to just the address portion of the register. Use suitable constants also for initializing the ROM BAR register field description. Signed-off-by: Jan Beulich jbeul...@suse.com --- v2: Convert complex if() into simpler ones. --- a/hw/xen/xen_pt.c +++ b/hw/xen/xen_pt.c @@ -249,10 +249,18 @@ static void xen_pt_pci_write_config(PCID /* check unused BAR register */ index = xen_pt_bar_offset_to_index(addr); -if ((index = 0) (val 0 val XEN_PT_BAR_ALLF) -(s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) { -XEN_PT_WARN(d, Guest attempt to set address to unused Base Address -Register. (addr: 0x%02x, len: %d)\n, addr, len); +if ((index = 0) (val != 0)) { +uint32_t chk = val; + +if (index == PCI_ROM_SLOT) +chk |= (uint32_t)~PCI_ROM_ADDRESS_MASK; + +if ((chk != XEN_PT_BAR_ALLF) +(s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) { +XEN_PT_WARN(d, Guest attempt to set address to unused +Base Address Register. (addr: 0x%02x, len: %d)\n, +addr, len); +} } /* find register group entry */ --- a/hw/xen/xen_pt_config_init.c +++ b/hw/xen/xen_pt_config_init.c @@ -729,8 +729,8 @@ static XenPTRegInfo xen_pt_emu_reg_heade .offset = PCI_ROM_ADDRESS, .size = 4, .init_val = 0x, -.ro_mask= 0x07FE, -.emu_mask = 0xF800, +.ro_mask= ~PCI_ROM_ADDRESS_MASK ~PCI_ROM_ADDRESS_ENABLE, +.emu_mask = (uint32_t)PCI_ROM_ADDRESS_MASK, .init = xen_pt_bar_reg_init, .u.dw.read = xen_pt_long_reg_read, .u.dw.write = xen_pt_exp_rom_bar_reg_write, xen/pass-through: ROM BAR handling adjustments Expecting the ROM BAR to be written with an all ones value when sizing the region is wrong - the low bit has another meaning (enable/disable) and bits 1..10 are reserved. The PCI spec also mandates writing all ones to just the address portion of the register. Use suitable constants also for initializing the ROM BAR register field description. Signed-off-by: Jan Beulich jbeul...@suse.com --- v2: Convert complex if() into simpler ones. --- a/hw/xen/xen_pt.c +++ b/hw/xen/xen_pt.c @@ -249,10 +249,18 @@ static void xen_pt_pci_write_config(PCID /* check unused BAR register */ index = xen_pt_bar_offset_to_index(addr); -if ((index = 0) (val 0 val XEN_PT_BAR_ALLF) -(s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) { -XEN_PT_WARN(d, Guest attempt to set address to unused Base Address -Register. (addr: 0x%02x, len: %d)\n, addr, len); +if ((index = 0) (val != 0)) { +uint32_t chk = val; + +if (index == PCI_ROM_SLOT) +chk |= (uint32_t)~PCI_ROM_ADDRESS_MASK; + +if ((chk != XEN_PT_BAR_ALLF) +(s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) { +XEN_PT_WARN(d, Guest attempt to set address to unused +Base Address Register. (addr: 0x%02x, len: %d)\n, +addr, len); +} } /* find register group entry */ --- a/hw/xen/xen_pt_config_init.c +++ b/hw/xen/xen_pt_config_init.c @@ -729,8 +729,8 @@ static XenPTRegInfo xen_pt_emu_reg_heade .offset = PCI_ROM_ADDRESS, .size = 4, .init_val = 0x, -.ro_mask= 0x07FE, -.emu_mask = 0xF800, +.ro_mask= ~PCI_ROM_ADDRESS_MASK ~PCI_ROM_ADDRESS_ENABLE, +.emu_mask = (uint32_t)PCI_ROM_ADDRESS_MASK, .init = xen_pt_bar_reg_init, .u.dw.read = xen_pt_long_reg_read, .u.dw.write = xen_pt_exp_rom_bar_reg_write,
Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register
On 08.06.15 at 13:28, m...@redhat.com wrote: On Mon, Jun 08, 2015 at 11:55:22AM +0100, Jan Beulich wrote: But you asking this made me look more closely at the memory ranges dumped out to the ITP log: The root port has 0x20: Memory Base = 0xca40 0x22: Memory Limit = 0xca40 0x24: Prefetchable Mem Base= 0xca21 0x26: Prefetchable Mem Limit = 0xca21 while function 0 has 0x10: Base Address Register 0 = 0xca23000c (Memory space, 64-bit access, prefetchable) 0x18: Base Address Register 2 = 0xca24000c (Memory space, 64-bit access, prefetchable) 0x20: Base Address Register 4 = 0xca25000c (Memory space, 64-bit access, prefetchable) and function 1 0x10: Base Address Register 0 = 0xca2c (Memory space, 64-bit access, prefetchable) 0x18: Base Address Register 2 = 0xca21000c (Memory space, 64-bit access, prefetchable) And what is the size of this BAR? Sadly I don't seem have a matching pair of kernel and ITP logs, so I can only guess that it's 64k (or less). Jan
Re: [Qemu-devel] [PATCH] xen/pass-through: ROM BAR handling adjustments
On 05.06.15 at 13:32, stefano.stabell...@eu.citrix.com wrote: --- a/hw/xen/xen_pt.c +++ b/hw/xen/xen_pt.c @@ -248,7 +248,9 @@ static void xen_pt_pci_write_config(PCID /* check unused BAR register */ index = xen_pt_bar_offset_to_index(addr); -if ((index = 0) (val 0 val XEN_PT_BAR_ALLF) +if ((index = 0) (val != 0) +(((index != PCI_ROM_SLOT) ? + val : (val | (uint32_t)~PCI_ROM_ADDRESS_MASK)) != XEN_PT_BAR_ALLF) The change seems looks good and in line with the commit message. But this if statement looks like acrobatic circus to me now. I think the alternative of splitting it up into multiple if()-s would not be any better readable. --- a/hw/xen/xen_pt_config_init.c +++ b/hw/xen/xen_pt_config_init.c @@ -726,8 +726,8 @@ static XenPTRegInfo xen_pt_emu_reg_heade .offset = PCI_ROM_ADDRESS, .size = 4, .init_val = 0x, -.ro_mask= 0x07FE, -.emu_mask = 0xF800, +.ro_mask= ~PCI_ROM_ADDRESS_MASK ~PCI_ROM_ADDRESS_ENABLE, +.emu_mask = (uint32_t)PCI_ROM_ADDRESS_MASK, There is no need for the explicit cast, is there? There is - PCI_ROM_ADDRESS_MASK being wider than 32 bits the assignment could cause compiler warning otherwise (which I think I actually ran into. Jan
[Qemu-devel] [PATCH 5/6] xen/pass-through: log errno values rather than function return ones
Functions setting errno commonly return just -1, which is of no particular use in the log file. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/qemu/upstream/hw/xen/xen_pt.c +++ b/qemu/upstream/hw/xen/xen_pt.c @@ -609,8 +609,8 @@ static void xen_pt_region_update(XenPCIP guest_port, machine_port, size, op); if (rc) { -XEN_PT_ERR(d, %s ioport mapping failed! (rc: %i)\n, - adding ? create new : remove old, rc); +XEN_PT_ERR(d, %s ioport mapping failed! (err: %i)\n, + adding ? create new : remove old, errno); } } else { pcibus_t guest_addr = sec-offset_within_address_space; @@ -623,8 +623,8 @@ static void xen_pt_region_update(XenPCIP XEN_PFN(size + XC_PAGE_SIZE - 1), op); if (rc) { -XEN_PT_ERR(d, %s mem mapping failed! (rc: %i)\n, - adding ? create new : remove old, rc); +XEN_PT_ERR(d, %s mem mapping failed! (err: %i)\n, + adding ? create new : remove old, errno); } } } @@ -738,8 +738,8 @@ static int xen_pt_initfn(PCIDevice *d) rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, pirq); if (rc 0) { -XEN_PT_ERR(d, Mapping machine irq %u to pirq %i failed, (rc: %d)\n, - machine_irq, pirq, rc); +XEN_PT_ERR(d, Mapping machine irq %u to pirq %i failed, (err: %d)\n, + machine_irq, pirq, errno); /* Disable PCI intx assertion (turn on bit10 of devctl) */ cmd |= PCI_COMMAND_INTX_DISABLE; @@ -760,8 +760,8 @@ static int xen_pt_initfn(PCIDevice *d) PCI_SLOT(d-devfn), e_intx); if (rc 0) { -XEN_PT_ERR(d, Binding of interrupt %i failed! (rc: %d)\n, - e_intx, rc); +XEN_PT_ERR(d, Binding of interrupt %i failed! (err: %d)\n, + e_intx, errno); /* Disable PCI intx assertion (turn on bit10 of devctl) */ cmd |= PCI_COMMAND_INTX_DISABLE; @@ -770,7 +770,7 @@ static int xen_pt_initfn(PCIDevice *d) if (xen_pt_mapped_machine_irq[machine_irq] == 0) { if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) { XEN_PT_ERR(d, Unmapping of machine interrupt %i failed! -(rc: %d)\n, machine_irq, rc); +(err: %d)\n, machine_irq, errno); } } s-machine_irq = 0; @@ -808,9 +808,9 @@ static void xen_pt_unregister_device(PCI 0 /* isa_irq */); if (rc 0) { XEN_PT_ERR(d, unbinding of interrupt INT%c failed. -(machine irq: %i, rc: %d) +(machine irq: %i, err: %d) But bravely continuing on..\n, - 'a' + intx, machine_irq, rc); + 'a' + intx, machine_irq, errno); } } @@ -828,9 +828,9 @@ static void xen_pt_unregister_device(PCI rc = xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq); if (rc 0) { -XEN_PT_ERR(d, unmapping of interrupt %i failed. (rc: %d) +XEN_PT_ERR(d, unmapping of interrupt %i failed. (err: %d) But bravely continuing on..\n, - machine_irq, rc); + machine_irq, errno); } } } --- a/qemu/upstream/hw/xen/xen_pt_msi.c +++ b/qemu/upstream/hw/xen/xen_pt_msi.c @@ -133,8 +133,8 @@ static int msi_msix_setup(XenPCIPassthro msix_entry, table_base); if (rc) { XEN_PT_ERR(s-dev, - Mapping of MSI%s (rc: %i, vec: %#x, entry %#x)\n, - is_msix ? -X : , rc, gvec, msix_entry); + Mapping of MSI%s (err: %i, vec: %#x, entry %#x)\n, + is_msix ? -X : , errno, gvec, msix_entry); return rc; } } @@ -167,12 +167,12 @@ static int msi_msix_update(XenPCIPassthr pirq, gflags, table_addr); if (rc) { -XEN_PT_ERR(d, Updating of MSI%s failed. (rc: %d)\n, - is_msix ? -X : , rc); +XEN_PT_ERR(d, Updating of MSI%s failed. (err: %d)\n, + is_msix ? -X : , errno); if (xc_physdev_unmap_pirq(xen_xc, xen_domid, *old_pirq)) { -XEN_PT_ERR(d, Unmapping of MSI%s pirq %d failed.\n, - is_msix ? -X : , *old_pirq); +XEN_PT_ERR(d, Unmapping of MSI%s pirq %d failed. (err: %d)\n, + is_msix ? -X : , *old_pirq, errno
[Qemu-devel] [PATCH 2/6] xen/MSI-X: drive maskall and enable bits through hypercalls
Particularly the maskall bit has to be under exclusive hypervisor control (and since they live in the same config space field, the enable bit has to follow suit). Use the replacement hypercall interfaces. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/qemu/upstream/hw/xen/xen_pt.h +++ b/qemu/upstream/hw/xen/xen_pt.h @@ -181,6 +181,7 @@ typedef struct XenPTMSIXEntry { typedef struct XenPTMSIX { uint32_t ctrl_offset; bool enabled; +bool maskall; int total_entries; int bar_index; uint64_t table_base; @@ -293,7 +294,9 @@ int xen_pt_msix_init(XenPCIPassthroughSt void xen_pt_msix_delete(XenPCIPassthroughState *s); int xen_pt_msix_update(XenPCIPassthroughState *s); int xen_pt_msix_update_remap(XenPCIPassthroughState *s, int bar_index); +void xen_pt_msix_enable(XenPCIPassthroughState *s); void xen_pt_msix_disable(XenPCIPassthroughState *s); +int xen_pt_msix_maskall(XenPCIPassthroughState *s, bool mask); static inline bool xen_pt_has_msix_mapping(XenPCIPassthroughState *s, int bar) { --- a/qemu/upstream/hw/xen/xen_pt_config_init.c +++ b/qemu/upstream/hw/xen/xen_pt_config_init.c @@ -1436,32 +1436,58 @@ static int xen_pt_msixctrl_reg_write(Xen uint16_t dev_value, uint16_t valid_mask) { XenPTRegInfo *reg = cfg_entry-reg; -uint16_t writable_mask = 0; +uint16_t writable_mask, value; uint16_t throughable_mask = get_throughable_mask(s, reg, valid_mask); int debug_msix_enabled_old; /* modify emulate register */ writable_mask = reg-emu_mask ~reg-ro_mask valid_mask; -cfg_entry-data = XEN_PT_MERGE_VALUE(*val, cfg_entry-data, writable_mask); +value = XEN_PT_MERGE_VALUE(*val, cfg_entry-data, writable_mask); +cfg_entry-data = value; /* create value for writing to I/O device register */ *val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask); +debug_msix_enabled_old = s-msix-enabled; + /* update MSI-X */ -if ((*val PCI_MSIX_FLAGS_ENABLE) - !(*val PCI_MSIX_FLAGS_MASKALL)) { +if ((value PCI_MSIX_FLAGS_ENABLE) + !(value PCI_MSIX_FLAGS_MASKALL)) { +if (!s-msix-enabled) { +if (!s-msix-maskall) { +xen_pt_msix_maskall(s, true); +} +xen_pt_msix_enable(s); +} xen_pt_msix_update(s); -} else if (!(*val PCI_MSIX_FLAGS_ENABLE) s-msix-enabled) { -xen_pt_msix_disable(s); +s-msix-enabled = true; +s-msix-maskall = false; +xen_pt_msix_maskall(s, false); +} else if (s-msix-enabled) { +if (!(value PCI_MSIX_FLAGS_ENABLE)) { +xen_pt_msix_disable(s); +s-msix-enabled = false; +} else if (!s-msix-maskall) { +s-msix-maskall = true; +xen_pt_msix_maskall(s, true); +} } -debug_msix_enabled_old = s-msix-enabled; -s-msix-enabled = !!(*val PCI_MSIX_FLAGS_ENABLE); if (s-msix-enabled != debug_msix_enabled_old) { XEN_PT_LOG(s-dev, %s MSI-X\n, s-msix-enabled ? enable : disable); } +xen_host_pci_get_word(s-real_device, s-msix-ctrl_offset, dev_value); + +if (s-msix-enabled !(dev_value PCI_MSIX_FLAGS_ENABLE)) { +XEN_PT_ERR(s-dev, MSI-X unexpectedly disabled\n); +} else if ((dev_value PCI_MSIX_FLAGS_ENABLE) + s-msix-maskall + !(dev_value PCI_MSIX_FLAGS_MASKALL)) { +XEN_PT_ERR(s-dev, MSI-X unexpectedly unmasked\n); +} + return 0; } @@ -1483,9 +1509,12 @@ static XenPTRegInfo xen_pt_emu_reg_msix[ .offset = PCI_MSI_FLAGS, .size = 2, .init_val = 0x, -.res_mask = 0x3800, -.ro_mask= 0x07FF, -.emu_mask = 0x, +/* This must not be split into res_mask (0x3800) and ro_mask (0x07FF) + * because even in permissive mode there must not be any write back + * to this register. + */ +.ro_mask= 0x3FFF, +.emu_mask = 0xC000, .init = xen_pt_msixctrl_reg_init, .u.w.read = xen_pt_word_reg_read, .u.w.write = xen_pt_msixctrl_reg_write, --- a/qemu/upstream/hw/xen/xen_pt_msi.c +++ b/qemu/upstream/hw/xen/xen_pt_msi.c @@ -301,8 +301,11 @@ static int msix_set_enable(XenPCIPassthr return -1; } -return msi_msix_enable(s, s-msix-ctrl_offset, PCI_MSIX_FLAGS_ENABLE, - enabled); +return xc_physdev_msix_enable(xen_xc, s-real_device.domain, + s-real_device.bus, + PCI_DEVFN(s-real_device.dev, +s-real_device.func), + enabled); } static int xen_pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr) @@ -361,6 +364,11 @@ int xen_pt_msix_update(XenPCIPassthrough return 0; } +void xen_pt_msix_enable(XenPCIPassthroughState *s
[Qemu-devel] [PATCH 1/6] xen/MSI-X: latch MSI-X table writes
The remaining log message in pci_msix_write() is wrong, as there guest behavior may only appear to be wrong: For one, the old logic didn't take the mask-all bit into account. And then this shouldn't depend on host device state (i.e. the host may have masked the entry without the guest having done so). Plus these writes shouldn't be dropped even when an entry gets unmasked. Instead, if they can't be made take effect right away, they should take effect on the next unmasking or enabling operation - the specification explicitly describes such caching behavior. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/qemu/upstream/hw/xen/xen_pt.h +++ b/qemu/upstream/hw/xen/xen_pt.h @@ -175,9 +175,8 @@ typedef struct XenPTMSIXEntry { int pirq; uint64_t addr; uint32_t data; -uint32_t vector_ctrl; +uint32_t latch[4]; bool updated; /* indicate whether MSI ADDR or DATA is updated */ -bool warned; /* avoid issuing (bogus) warning more than once */ } XenPTMSIXEntry; typedef struct XenPTMSIX { uint32_t ctrl_offset; --- a/qemu/upstream/hw/xen/xen_pt_msi.c +++ b/qemu/upstream/hw/xen/xen_pt_msi.c @@ -25,6 +25,7 @@ #define XEN_PT_GFLAGSSHIFT_DELIV_MODE 12 #define XEN_PT_GFLAGSSHIFT_TRG_MODE 15 +#define latch(fld) latch[PCI_MSIX_ENTRY_##fld / sizeof(uint32_t)] /* * Helpers @@ -322,6 +323,13 @@ static int xen_pt_msix_update_one(XenPCI pirq = entry-pirq; +if (pirq == XEN_PT_UNASSIGNED_PIRQ || s-msix-maskall || +(entry-latch(VECTOR_CTRL) PCI_MSIX_ENTRY_CTRL_MASKBIT)) { +entry-addr = entry-latch(LOWER_ADDR) | + ((uint64_t)entry-latch(UPPER_ADDR) 32); +entry-data = entry-latch(DATA); +} + rc = msi_msix_setup(s, entry-addr, entry-data, pirq, true, entry_nr, entry-pirq == XEN_PT_UNASSIGNED_PIRQ); if (rc) { @@ -396,35 +404,15 @@ int xen_pt_msix_update_remap(XenPCIPasst static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset) { -switch (offset) { -case PCI_MSIX_ENTRY_LOWER_ADDR: -return e-addr UINT32_MAX; -case PCI_MSIX_ENTRY_UPPER_ADDR: -return e-addr 32; -case PCI_MSIX_ENTRY_DATA: -return e-data; -case PCI_MSIX_ENTRY_VECTOR_CTRL: -return e-vector_ctrl; -default: -return 0; -} +return !(offset % sizeof(*e-latch)) + ? e-latch[offset / sizeof(*e-latch)] : 0; } static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val) { -switch (offset) { -case PCI_MSIX_ENTRY_LOWER_ADDR: -e-addr = (e-addr ((uint64_t)UINT32_MAX 32)) | val; -break; -case PCI_MSIX_ENTRY_UPPER_ADDR: -e-addr = (uint64_t)val 32 | (e-addr UINT32_MAX); -break; -case PCI_MSIX_ENTRY_DATA: -e-data = val; -break; -case PCI_MSIX_ENTRY_VECTOR_CTRL: -e-vector_ctrl = val; -break; +if (!(offset % sizeof(*e-latch))) +{ +e-latch[offset / sizeof(*e-latch)] = val; } } @@ -444,39 +432,28 @@ static void pci_msix_write(void *opaque, offset = addr % PCI_MSIX_ENTRY_SIZE; if (offset != PCI_MSIX_ENTRY_VECTOR_CTRL) { -const volatile uint32_t *vec_ctrl; - if (get_entry_value(entry, offset) == val entry-pirq != XEN_PT_UNASSIGNED_PIRQ) { return; } +entry-updated = true; +} else if (msix-enabled entry-updated + !(val PCI_MSIX_ENTRY_CTRL_MASKBIT)) { +const volatile uint32_t *vec_ctrl; + /* * If Xen intercepts the mask bit access, entry-vec_ctrl may not be * up-to-date. Read from hardware directly. */ vec_ctrl = s-msix-phys_iomem_base + entry_nr * PCI_MSIX_ENTRY_SIZE + PCI_MSIX_ENTRY_VECTOR_CTRL; +set_entry_value(entry, offset, *vec_ctrl); -if (msix-enabled !(*vec_ctrl PCI_MSIX_ENTRY_CTRL_MASKBIT)) { -if (!entry-warned) { -entry-warned = true; -XEN_PT_ERR(s-dev, Can't update msix entry %d since MSI-X is -already enabled.\n, entry_nr); -} -return; -} - -entry-updated = true; +xen_pt_msix_update_one(s, entry_nr); } set_entry_value(entry, offset, val); - -if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL) { -if (msix-enabled !(val PCI_MSIX_ENTRY_CTRL_MASKBIT)) { -xen_pt_msix_update_one(s, entry_nr); -} -} } static uint64_t pci_msix_read(void *opaque, hwaddr addr, xen/MSI-X: latch MSI-X table writes The remaining log message in pci_msix_write() is wrong, as there guest behavior may only appear to be wrong: For one, the old logic didn't take the mask-all bit into account. And then this shouldn't depend on host device state (i.e. the host may have masked the entry without the guest having done so). Plus these writes shouldn't be dropped even when an entry gets unmasked. Instead
[Qemu-devel] [PATCH 6/6] xen/pass-through: constify some static data
This is done indirectly by adjusting two typedefs and helps emphasizing that the respective tables aren't supposed to be modified at runtime (as they may be shared between devices). Signed-off-by: Jan Beulich jbeul...@suse.com --- a/qemu/upstream/hw/xen/xen_pt.h +++ b/qemu/upstream/hw/xen/xen_pt.h @@ -31,7 +31,7 @@ void xen_pt_log(const PCIDevice *d, cons /* Helper */ #define XEN_PFN(x) ((x) XC_PAGE_SHIFT) -typedef struct XenPTRegInfo XenPTRegInfo; +typedef const struct XenPTRegInfo XenPTRegInfo; typedef struct XenPTReg XenPTReg; typedef struct XenPCIPassthroughState XenPCIPassthroughState; @@ -135,11 +135,11 @@ struct XenPTReg { uint32_t data; /* emulated value */ }; -typedef struct XenPTRegGroupInfo XenPTRegGroupInfo; +typedef const struct XenPTRegGroupInfo XenPTRegGroupInfo; /* emul reg group size initialize method */ typedef int (*xen_pt_reg_size_init_fn) -(XenPCIPassthroughState *, const XenPTRegGroupInfo *, +(XenPCIPassthroughState *, XenPTRegGroupInfo *, uint32_t base_offset, uint8_t *size); /* emulated register group information */ @@ -154,7 +154,7 @@ struct XenPTRegGroupInfo { /* emul register group management table */ typedef struct XenPTRegGroup { QLIST_ENTRY(XenPTRegGroup) entries; -const XenPTRegGroupInfo *reg_grp; +XenPTRegGroupInfo *reg_grp; uint32_t base_offset; uint8_t size; QLIST_HEAD(, XenPTReg) reg_tbl_list; --- a/qemu/upstream/hw/xen/xen_pt_config_init.c +++ b/qemu/upstream/hw/xen/xen_pt_config_init.c @@ -96,8 +96,7 @@ XenPTReg *xen_pt_find_reg(XenPTRegGroup } static uint32_t get_throughable_mask(const XenPCIPassthroughState *s, - const XenPTRegInfo *reg, - uint32_t valid_mask) + XenPTRegInfo *reg, uint32_t valid_mask) { uint32_t throughable_mask = ~(reg-emu_mask | reg-ro_mask); xen/pass-through: constify some static data This is done indirectly by adjusting two typedefs and helps emphasizing that the respective tables aren't supposed to be modified at runtime (as they may be shared between devices). Signed-off-by: Jan Beulich jbeul...@suse.com --- a/qemu/upstream/hw/xen/xen_pt.h +++ b/qemu/upstream/hw/xen/xen_pt.h @@ -31,7 +31,7 @@ void xen_pt_log(const PCIDevice *d, cons /* Helper */ #define XEN_PFN(x) ((x) XC_PAGE_SHIFT) -typedef struct XenPTRegInfo XenPTRegInfo; +typedef const struct XenPTRegInfo XenPTRegInfo; typedef struct XenPTReg XenPTReg; typedef struct XenPCIPassthroughState XenPCIPassthroughState; @@ -135,11 +135,11 @@ struct XenPTReg { uint32_t data; /* emulated value */ }; -typedef struct XenPTRegGroupInfo XenPTRegGroupInfo; +typedef const struct XenPTRegGroupInfo XenPTRegGroupInfo; /* emul reg group size initialize method */ typedef int (*xen_pt_reg_size_init_fn) -(XenPCIPassthroughState *, const XenPTRegGroupInfo *, +(XenPCIPassthroughState *, XenPTRegGroupInfo *, uint32_t base_offset, uint8_t *size); /* emulated register group information */ @@ -154,7 +154,7 @@ struct XenPTRegGroupInfo { /* emul register group management table */ typedef struct XenPTRegGroup { QLIST_ENTRY(XenPTRegGroup) entries; -const XenPTRegGroupInfo *reg_grp; +XenPTRegGroupInfo *reg_grp; uint32_t base_offset; uint8_t size; QLIST_HEAD(, XenPTReg) reg_tbl_list; --- a/qemu/upstream/hw/xen/xen_pt_config_init.c +++ b/qemu/upstream/hw/xen/xen_pt_config_init.c @@ -96,8 +96,7 @@ XenPTReg *xen_pt_find_reg(XenPTRegGroup } static uint32_t get_throughable_mask(const XenPCIPassthroughState *s, - const XenPTRegInfo *reg, - uint32_t valid_mask) + XenPTRegInfo *reg, uint32_t valid_mask) { uint32_t throughable_mask = ~(reg-emu_mask | reg-ro_mask);
[Qemu-devel] [PATCH 3/6] xen/MSI-X: really enforce alignment
The way the generic infrastructure works the intention of not allowing unaligned accesses can't be achieved by simply setting .unaligned to false. The benefit is that we can now replace the conditionals in {get,set}_entry_value() by assert()-s. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/qemu/upstream/hw/xen/xen_pt_msi.c +++ b/qemu/upstream/hw/xen/xen_pt_msi.c @@ -421,16 +421,14 @@ int xen_pt_msix_update_remap(XenPCIPasst static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset) { -return !(offset % sizeof(*e-latch)) - ? e-latch[offset / sizeof(*e-latch)] : 0; +assert(!(offset % sizeof(*e-latch))); +return e-latch[offset / sizeof(*e-latch)]; } static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val) { -if (!(offset % sizeof(*e-latch))) -{ -e-latch[offset / sizeof(*e-latch)] = val; -} +assert(!(offset % sizeof(*e-latch))); +e-latch[offset / sizeof(*e-latch)] = val; } static void pci_msix_write(void *opaque, hwaddr addr, @@ -496,6 +494,12 @@ static uint64_t pci_msix_read(void *opaq } } +static bool pci_msix_accepts(void *opaque, hwaddr addr, + unsigned size, bool is_write) +{ +return !(addr (size - 1)); +} + static const MemoryRegionOps pci_msix_ops = { .read = pci_msix_read, .write = pci_msix_write, @@ -504,7 +508,13 @@ static const MemoryRegionOps pci_msix_op .min_access_size = 4, .max_access_size = 4, .unaligned = false, +.accepts = pci_msix_accepts }, +.impl = { +.min_access_size = 4, +.max_access_size = 4, +.unaligned = false +} }; int xen_pt_msix_init(XenPCIPassthroughState *s, uint32_t base) xen/MSI-X: really enforce alignment The way the generic infrastructure works the intention of not allowing unaligned accesses can't be achieved by simply setting .unaligned to false. The benefit is that we can now replace the conditionals in {get,set}_entry_value() by assert()-s. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/qemu/upstream/hw/xen/xen_pt_msi.c +++ b/qemu/upstream/hw/xen/xen_pt_msi.c @@ -421,16 +421,14 @@ int xen_pt_msix_update_remap(XenPCIPasst static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset) { -return !(offset % sizeof(*e-latch)) - ? e-latch[offset / sizeof(*e-latch)] : 0; +assert(!(offset % sizeof(*e-latch))); +return e-latch[offset / sizeof(*e-latch)]; } static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val) { -if (!(offset % sizeof(*e-latch))) -{ -e-latch[offset / sizeof(*e-latch)] = val; -} +assert(!(offset % sizeof(*e-latch))); +e-latch[offset / sizeof(*e-latch)] = val; } static void pci_msix_write(void *opaque, hwaddr addr, @@ -496,6 +494,12 @@ static uint64_t pci_msix_read(void *opaq } } +static bool pci_msix_accepts(void *opaque, hwaddr addr, + unsigned size, bool is_write) +{ +return !(addr (size - 1)); +} + static const MemoryRegionOps pci_msix_ops = { .read = pci_msix_read, .write = pci_msix_write, @@ -504,7 +508,13 @@ static const MemoryRegionOps pci_msix_op .min_access_size = 4, .max_access_size = 4, .unaligned = false, +.accepts = pci_msix_accepts }, +.impl = { +.min_access_size = 4, +.max_access_size = 4, +.unaligned = false +} }; int xen_pt_msix_init(XenPCIPassthroughState *s, uint32_t base)
[Qemu-devel] [PATCH 4/6] xen/pass-through: correctly deal with RW1C bits
Introduce yet another mask for them, so that the generic routine can handle them, at once rendering xen_pt_pmcsr_reg_write() superfluous. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/qemu/upstream/hw/xen/xen_pt.h +++ b/qemu/upstream/hw/xen/xen_pt.h @@ -105,6 +105,8 @@ struct XenPTRegInfo { uint32_t res_mask; /* reg read only field mask (ON:RO/ROS, OFF:other) */ uint32_t ro_mask; +/* reg read/write-1-clear field mask (ON:RW1C/RW1CS, OFF:other) */ +uint32_t rw1c_mask; /* reg emulate field mask (ON:emu, OFF:passthrough) */ uint32_t emu_mask; xen_pt_conf_reg_init init; --- a/qemu/upstream/hw/xen/xen_pt_config_init.c +++ b/qemu/upstream/hw/xen/xen_pt_config_init.c @@ -176,7 +176,8 @@ static int xen_pt_byte_reg_write(XenPCIP cfg_entry-data = XEN_PT_MERGE_VALUE(*val, cfg_entry-data, writable_mask); /* create value for writing to I/O device register */ -*val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask); +*val = XEN_PT_MERGE_VALUE(*val, dev_value ~reg-rw1c_mask, + throughable_mask); return 0; } @@ -193,7 +194,8 @@ static int xen_pt_word_reg_write(XenPCIP cfg_entry-data = XEN_PT_MERGE_VALUE(*val, cfg_entry-data, writable_mask); /* create value for writing to I/O device register */ -*val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask); +*val = XEN_PT_MERGE_VALUE(*val, dev_value ~reg-rw1c_mask, + throughable_mask); return 0; } @@ -210,7 +212,8 @@ static int xen_pt_long_reg_write(XenPCIP cfg_entry-data = XEN_PT_MERGE_VALUE(*val, cfg_entry-data, writable_mask); /* create value for writing to I/O device register */ -*val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask); +*val = XEN_PT_MERGE_VALUE(*val, dev_value ~reg-rw1c_mask, + throughable_mask); return 0; } @@ -611,6 +614,7 @@ static XenPTRegInfo xen_pt_emu_reg_heade .init_val = 0x, .res_mask = 0x0007, .ro_mask= 0x06F8, +.rw1c_mask = 0xF900, .emu_mask = 0x0010, .init = xen_pt_status_reg_init, .u.w.read = xen_pt_word_reg_read, @@ -910,6 +914,7 @@ static XenPTRegInfo xen_pt_emu_reg_pcie[ .size = 2, .res_mask = 0xFFC0, .ro_mask= 0x0030, +.rw1c_mask = 0x000F, .init = xen_pt_common_reg_init, .u.w.read = xen_pt_word_reg_read, .u.w.write = xen_pt_word_reg_write, @@ -930,6 +935,7 @@ static XenPTRegInfo xen_pt_emu_reg_pcie[ .offset = PCI_EXP_LNKSTA, .size = 2, .ro_mask= 0x3FFF, +.rw1c_mask = 0xC000, .init = xen_pt_common_reg_init, .u.w.read = xen_pt_word_reg_read, .u.w.write = xen_pt_word_reg_write, @@ -966,26 +972,6 @@ static XenPTRegInfo xen_pt_emu_reg_pcie[ * Power Management Capability */ -/* write Power Management Control/Status register */ -static int xen_pt_pmcsr_reg_write(XenPCIPassthroughState *s, - XenPTReg *cfg_entry, uint16_t *val, - uint16_t dev_value, uint16_t valid_mask) -{ -XenPTRegInfo *reg = cfg_entry-reg; -uint16_t writable_mask = 0; -uint16_t throughable_mask = get_throughable_mask(s, reg, valid_mask); - -/* modify emulate register */ -writable_mask = reg-emu_mask ~reg-ro_mask valid_mask; -cfg_entry-data = XEN_PT_MERGE_VALUE(*val, cfg_entry-data, writable_mask); - -/* create value for writing to I/O device register */ -*val = XEN_PT_MERGE_VALUE(*val, dev_value ~PCI_PM_CTRL_PME_STATUS, - throughable_mask); - -return 0; -} - /* Power Management Capability reg static information table */ static XenPTRegInfo xen_pt_emu_reg_pm[] = { /* Next Pointer reg */ @@ -1016,11 +1002,12 @@ static XenPTRegInfo xen_pt_emu_reg_pm[] .size = 2, .init_val = 0x0008, .res_mask = 0x00F0, -.ro_mask= 0xE10C, +.ro_mask= 0x610C, +.rw1c_mask = 0x8000, .emu_mask = 0x810B, .init = xen_pt_common_reg_init, .u.w.read = xen_pt_word_reg_read, -.u.w.write = xen_pt_pmcsr_reg_write, +.u.w.write = xen_pt_word_reg_write, }, { .size = 0, xen/pass-through: correctly deal with RW1C bits Introduce yet another mask for them, so that the generic routine can handle them, at once rendering xen_pt_pmcsr_reg_write() superfluous. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/qemu/upstream/hw/xen/xen_pt.h +++ b/qemu/upstream/hw/xen/xen_pt.h @@ -105,6 +105,8 @@ struct XenPTRegInfo { uint32_t res_mask; /* reg read only field mask (ON:RO/ROS, OFF:other) */ uint32_t ro_mask; +/* reg read/write-1-clear field mask (ON:RW1C/RW1CS, OFF:other) */ +uint32_t rw1c_mask; /* reg emulate field mask (ON:emu
[Qemu-devel] [PATCH 0/6] xen/pass-through: XSA-120, 128...131 follow-up
1: MSI-X: latch MSI-X table writes 2: MSI-X: drive maskall and enable bits through hypercalls 3: MSI-X: really enforce alignment 4: pass-through: correctly deal with RW1C bits 5: pass-through: log errno values rather than function return ones 6: pass-through: constify some static data Signed-off-by: Jan Beulich jbeul...@suse.com
[Qemu-devel] Ping: [PATCH] xen/pass-through: ROM BAR handling adjustments
Ping? On 15.05.15 at 14:41, jbeul...@suse.com wrote: Expecting the ROM BAR to be written with an all ones value when sizing the region is wrong - the low bit has another meaning (enable/disable) and bits 1..10 are reserved. The PCI spec also mandates writing all ones to just the address portion of the register. Use suitable constants also for initializing the ROM BAR register field description. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/hw/xen/xen_pt.c +++ b/hw/xen/xen_pt.c @@ -248,7 +248,9 @@ static void xen_pt_pci_write_config(PCID /* check unused BAR register */ index = xen_pt_bar_offset_to_index(addr); -if ((index = 0) (val 0 val XEN_PT_BAR_ALLF) +if ((index = 0) (val != 0) +(((index != PCI_ROM_SLOT) ? + val : (val | (uint32_t)~PCI_ROM_ADDRESS_MASK)) != XEN_PT_BAR_ALLF) (s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) { XEN_PT_WARN(d, Guest attempt to set address to unused Base Address Register. (addr: 0x%02x, len: %d)\n, addr, len); --- a/hw/xen/xen_pt_config_init.c +++ b/hw/xen/xen_pt_config_init.c @@ -726,8 +726,8 @@ static XenPTRegInfo xen_pt_emu_reg_heade .offset = PCI_ROM_ADDRESS, .size = 4, .init_val = 0x, -.ro_mask= 0x07FE, -.emu_mask = 0xF800, +.ro_mask= ~PCI_ROM_ADDRESS_MASK ~PCI_ROM_ADDRESS_ENABLE, +.emu_mask = (uint32_t)PCI_ROM_ADDRESS_MASK, .init = xen_pt_bar_reg_init, .u.dw.read = xen_pt_long_reg_read, .u.dw.write = xen_pt_exp_rom_bar_reg_write, ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel
[Qemu-devel] Ping: [PATCH] xen/pass-through: fold host PCI command register writes
Ping? On 15.05.15 at 14:46, jbeul...@suse.com wrote: The code introduced to address XSA-126 allows simplification of other code in xen_pt_initfn(): All we need to do is update cmd suitably, as it'll be written back to the host register near the end of the function anyway. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/hw/xen/xen_pt.c +++ b/hw/xen/xen_pt.c @@ -698,10 +698,7 @@ static int xen_pt_initfn(PCIDevice *d) machine_irq, pirq, rc); /* Disable PCI intx assertion (turn on bit10 of devctl) */ -xen_host_pci_set_word(s-real_device, - PCI_COMMAND, - pci_get_word(s-dev.config + PCI_COMMAND) - | PCI_COMMAND_INTX_DISABLE); +cmd |= PCI_COMMAND_INTX_DISABLE; machine_irq = 0; s-machine_irq = 0; } else { @@ -723,9 +720,7 @@ static int xen_pt_initfn(PCIDevice *d) e_intx, rc); /* Disable PCI intx assertion (turn on bit10 of devctl) */ -xen_host_pci_set_word(s-real_device, PCI_COMMAND, - *(uint16_t *)(s-dev.config[PCI_COMMAND]) - | PCI_COMMAND_INTX_DISABLE); +cmd |= PCI_COMMAND_INTX_DISABLE; xen_pt_mapped_machine_irq[machine_irq]--; if (xen_pt_mapped_machine_irq[machine_irq] == 0) { ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel
[Qemu-devel] [PATCH] xen/pass-through: ROM BAR handling adjustments
Expecting the ROM BAR to be written with an all ones value when sizing the region is wrong - the low bit has another meaning (enable/disable) and bits 1..10 are reserved. The PCI spec also mandates writing all ones to just the address portion of the register. Use suitable constants also for initializing the ROM BAR register field description. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/hw/xen/xen_pt.c +++ b/hw/xen/xen_pt.c @@ -248,7 +248,9 @@ static void xen_pt_pci_write_config(PCID /* check unused BAR register */ index = xen_pt_bar_offset_to_index(addr); -if ((index = 0) (val 0 val XEN_PT_BAR_ALLF) +if ((index = 0) (val != 0) +(((index != PCI_ROM_SLOT) ? + val : (val | (uint32_t)~PCI_ROM_ADDRESS_MASK)) != XEN_PT_BAR_ALLF) (s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) { XEN_PT_WARN(d, Guest attempt to set address to unused Base Address Register. (addr: 0x%02x, len: %d)\n, addr, len); --- a/hw/xen/xen_pt_config_init.c +++ b/hw/xen/xen_pt_config_init.c @@ -726,8 +726,8 @@ static XenPTRegInfo xen_pt_emu_reg_heade .offset = PCI_ROM_ADDRESS, .size = 4, .init_val = 0x, -.ro_mask= 0x07FE, -.emu_mask = 0xF800, +.ro_mask= ~PCI_ROM_ADDRESS_MASK ~PCI_ROM_ADDRESS_ENABLE, +.emu_mask = (uint32_t)PCI_ROM_ADDRESS_MASK, .init = xen_pt_bar_reg_init, .u.dw.read = xen_pt_long_reg_read, .u.dw.write = xen_pt_exp_rom_bar_reg_write,
[Qemu-devel] [PATCH] xen/pass-through: fold host PCI command register writes
The code introduced to address XSA-126 allows simplification of other code in xen_pt_initfn(): All we need to do is update cmd suitably, as it'll be written back to the host register near the end of the function anyway. Signed-off-by: Jan Beulich jbeul...@suse.com --- a/hw/xen/xen_pt.c +++ b/hw/xen/xen_pt.c @@ -698,10 +698,7 @@ static int xen_pt_initfn(PCIDevice *d) machine_irq, pirq, rc); /* Disable PCI intx assertion (turn on bit10 of devctl) */ -xen_host_pci_set_word(s-real_device, - PCI_COMMAND, - pci_get_word(s-dev.config + PCI_COMMAND) - | PCI_COMMAND_INTX_DISABLE); +cmd |= PCI_COMMAND_INTX_DISABLE; machine_irq = 0; s-machine_irq = 0; } else { @@ -723,9 +720,7 @@ static int xen_pt_initfn(PCIDevice *d) e_intx, rc); /* Disable PCI intx assertion (turn on bit10 of devctl) */ -xen_host_pci_set_word(s-real_device, PCI_COMMAND, - *(uint16_t *)(s-dev.config[PCI_COMMAND]) - | PCI_COMMAND_INTX_DISABLE); +cmd |= PCI_COMMAND_INTX_DISABLE; xen_pt_mapped_machine_irq[machine_irq]--; if (xen_pt_mapped_machine_irq[machine_irq] == 0) {
Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register
On 20.04.15 at 16:32, m...@redhat.com wrote: On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote: On 20.04.15 at 15:43, m...@redhat.com wrote: did firmware reconfigure this device to report URs as fatal errors then? No, the Unsupported Request Error Serverity flag is zero. OK, that's the correct configuration, so how come the box crashes when there's a UR then? Apparently something marks it fatal in the process of delivering it through APEI. I'll raise the question with the reporters. Jan