from:"Jan Beulich"

Re: [PATCH] include/hw/xen: Use more inclusive language in comment

2023-11-10 Thread Jan Beulich

On 09.11.2023 18:40, Thomas Huth wrote:
> --- a/include/hw/xen/interface/hvm/params.h
> +++ b/include/hw/xen/interface/hvm/params.h
> @@ -255,7 +255,7 @@
>   * Note that 'mixed' mode has not been evaluated for safety from a
>   * security perspective.  Before using this mode in a
>   * security-critical environment, each subop should be evaluated for
> - * safety, with unsafe subops blacklisted in XSM.
> + * safety, with unsafe subops blocked in XSM.

To avoid another round trip when you send the patch against xen.git, as
already asked for by others, I'd like to point out that the wording
change isn't describing things sufficiently similarly: "blocked" reads
as if XSM would do so all by itself, whereas "blacklisted" has an
indication that something needs to be done for XSM to behave in the
intended way. Minimally I'd suggest "suitably blocked via", but perhaps
yet better wording can be thought of.

Jan

PS: Personally I'm against such avoiding of certain words. Them being
misused is not really a justification. New wording (perhaps not
specifically here, but considering the underlying wider theme) is going
to be misused as well, leading to the need to come up with yet different
wording, and so on.

Re: [PATCH 2/2] Do not access /dev/mem in MSI-X PCI passthrough on Xen

2022-11-17 Thread Jan Beulich

On 17.11.2022 04:34, Marek Marczykowski-Górecki wrote:
> Ok, I found what is wrong. Enabling MSI-X is refused, because INTx isn't
> disabled at this point yet. And apparently I was testing this with
> permissive=1...
> 
> Linux does this:
> https://github.com/torvalds/linux/blob/master/drivers/pci/msi/msi.c#L611
> In short:
> 1. Enable MSI-X with MASKALL=1
> 2. Setup MSI-X table
> 3. Disable INTx
> 4. Set MASKALL=0
> 
> This patch on top should fix this:
> 8<
> diff --git a/drivers/xen/xen-pciback/conf_space_capability.c 
> b/drivers/xen/xen-pciback/conf_space_capability.c
> index 097316a74126..f4c4381de76e 100644
> --- a/drivers/xen/xen-pciback/conf_space_capability.c
> +++ b/drivers/xen/xen-pciback/conf_space_capability.c
> @@ -235,7 +235,7 @@ static int msi_msix_flags_write(struct pci_dev *dev, int 
> offset, u16 new_value,
>   (new_value ^ old_value) & ~field_config->allowed_bits)
>   return PCIBIOS_SET_FAILED;
>  
> - if (new_value & field_config->enable_bit) {
> + if ((new_value & field_config->allowed_bits) == 
> field_config->enable_bit) {
>   /* don't allow enabling together with other interrupt types */
>   int int_type = xen_pcibk_get_interrupt_type(dev);
>  
> 8<
> 
> Jan, is the above safe? It should prevent clearing MASKALL if INTx isn't
> disabled, unless I missed something? But also, it will allow enabling
> MSI-X with MASKALL=1 together with MSI, which I'm not sure if should be
> allowed.

While it would probably be okay with what we have now (after your earlier
patch introducing allowed_bits), it's likely not going to be correct once
further bits would be added to allowed_bits (which clearly is going to be
wanted sooner or later, e.g. for multi-vector MSI). Hence I think ...

> Alternatively to the above patch, I could allow specifically setting
> MSIX_FLAGS_ENABLE + MSIX_FLAGS_MASKALL while INTx isn't disabled as a
> single exception,

... this is the way to go, and ...

> but at this point I'm not sure if some other driver or
> OS wouldn't approach this in yet another way.

... I guess we need to further add exceptions as needed - the one further
approach I could see is to keep all entry's mask bits set until setting
"INTx disable", without using MASKALL.

I'd like to note though that the PCI spec has no such exception. It,
however, also doesn't mandate setting "INTx disable" - that's merely a
workaround for flawed hardware afaik. (In Xen we also only cross check
MSI and MSI-X. IOW we rely on Dom0 anyway when it comes to driving
"INTx disable".) Therefore in the end pciback looks to be going too far
in enforcing such dependencies.

Thinking of this - what about making the change in
xen_pcibk_get_interrupt_type() instead, setting INTERRUPT_TYPE_MSIX only
when MASKALL is clear (alongside ENABLE being set)? This would then also
cover command_write().

Jan

Re: [PATCH 2/2] Do not access /dev/mem in MSI-X PCI passthrough on Xen

2022-11-15 Thread Jan Beulich

On 15.11.2022 12:38, Marek Marczykowski-Górecki wrote:
> On Tue, Nov 15, 2022 at 09:14:07AM +0100, Jan Beulich wrote:
>> On 14.11.2022 20:20, Marek Marczykowski-Górecki wrote:
>>> The /dev/mem is used for two purposes:
>>>  - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
>>>  - reading Pending Bit Array (PBA)
>>>
>>> The first one was originally done because when Xen did not send all
>>> vector ctrl writes to the device model, so QEMU might have outdated old
>>> register value. This has been changed in Xen, so QEMU can now use its
>>> cached value of the register instead.
>>>
>>> The Pending Bit Array (PBA) handling is for the case where it lives on
>>> the same page as the MSI-X table itself. Xen has been extended to handle
>>> this case too (as well as other registers that may live on those pages),
>>> so QEMU handling is not necessary anymore.
>>
>> Don't you need to check for new enough Xen for both aspects?
> 
> Yes, see my response to Andrew in the thread. I'm open for suggestions
> what to check specifically (Xen version directly?). 

I guess we should first see what changes we (you) end up doing in Xen
itself, and then decide whether to surface the new functionality in
some kind of feature indicator. Generally I'd prefer to avoid version
checks, because they don't fit very well with backports nor the (rare)
need to revert something. But sometimes a feature can be probed easily
without requiring an explicit feature indicator.

Jan

Re: [PATCH 2/2] Do not access /dev/mem in MSI-X PCI passthrough on Xen

2022-11-15 Thread Jan Beulich

On 14.11.2022 20:20, Marek Marczykowski-Górecki wrote:
> The /dev/mem is used for two purposes:
>  - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
>  - reading Pending Bit Array (PBA)
> 
> The first one was originally done because when Xen did not send all
> vector ctrl writes to the device model, so QEMU might have outdated old
> register value. This has been changed in Xen, so QEMU can now use its
> cached value of the register instead.
> 
> The Pending Bit Array (PBA) handling is for the case where it lives on
> the same page as the MSI-X table itself. Xen has been extended to handle
> this case too (as well as other registers that may live on those pages),
> so QEMU handling is not necessary anymore.

Don't you need to check for new enough Xen for both aspects?

Jan

Re: [BUG] Xen build error - undefined reference to bpf_program__set_socket_filter

2022-10-17 Thread Jan Beulich

On 17.10.2022 10:12, Arthur Borsboom wrote:
> Xen 4.16.1, 4.16.2 and 4.17.0-rc1 don't build anymore in Arch Linux.

That is, qemu doesn't build. That's something to be taken care of there,
not in Xen, I think.

Jan

> I believe it is caused by the missing function
> bpf_program__set_socket_filter provided by libbpf.
> This function has been deprecated in v0.8 and has been removed in v1.0.
> 
> Arch Linux uses libbpf v1.0.1 since October 2022.
> A downgrade to libbpf v0.8.1 fixes the Xen build problem.
> 
> Source about the deprecation:
> https://libbpf-test.readthedocs.io/en/latest/api.html
> 
> Build error:
> 
> /bin/ld: libcommon.fa.p/ebpf_ebpf_rss.c.o: in function `ebpf_rss_load':
> /home/arthur/.cache/yay/xen/src/xen-4.16.1/tools/qemu-xen-build/../qemu-xen/ebpf/ebpf_rss.c:52:
> undefined reference to `bpf_program__set_socket_filter'
> collect2: error: ld returned 1 exit status
> ...
> ...
> ninja: build stopped: subcommand failed.
> make: *** [Makefile:156: run-ninja] Error 1
> make: Leaving directory
> '/home/arthur/.cache/yay/xen/src/xen-4.16.1/tools/qemu-xen-build'
> make[3]: *** [Makefile:212: subdir-all-qemu-xen-dir] Error 2
> make[3]: Leaving directory
> '/home/arthur/.cache/yay/xen/src/xen-4.16.1/tools'
> make[2]: ***
> [/home/arthur/.cache/yay/xen/src/xen-4.16.1/tools/../tools/Rules.mk:161:
> subdirs-install] Error 2
> make[2]: Leaving directory
> '/home/arthur/.cache/yay/xen/src/xen-4.16.1/tools'
> make[1]: *** [Makefile:66: install] Error 2
> make[1]: Leaving directory
> '/home/arthur/.cache/yay/xen/src/xen-4.16.1/tools'
> make: *** [Makefile:140: install-tools] Error 2
> ==> ERROR: A failure occurred in build().
> Aborting...
> -> error making: xen
>

Re: SecureBoot and PCI passthrough with kernel lockdown in place (on Xen)

2022-02-14 Thread Jan Beulich

On 14.02.2022 16:02, Dario Faggioli wrote:
> We have run into an issue when trying to use PCI passthrough for a Xen
> VM running on an host where dom0 kernel is 5.14.21 (but we think it
> could be any kernel > 5.4) and SecureBoot is enabled.
> 
> The error we get, when (for instance) trying to attach a device to an
> (HVM) VM, on such system is:
> 
> # xl pci-attach 2-fv-sles15sp4beta2 :58:03.0 
> libxl: error: libxl_qmp.c:1838:qmp_ev_parse_error_messages: Domain 12:Failed 
> to initialize 12/15, type = 0x1, rc: -1
> libxl: error: libxl_pci.c:1777:device_pci_add_done: Domain 
> 12:libxl__device_pci_add failed for PCI device 0:58:3.0 (rc -28)
> libxl: error: libxl_device.c:1420:device_addrm_aocomplete: unable to add 
> device
> 
> QEMU, is telling us the following:
> 
> [00:04.0] xen_pt_msix_init: Error: Can't open /dev/mem: Operation not 
> permitted
> [00:04.0] xen_pt_msix_size_init: Error: Internal error: Invalid 
> xen_pt_msix_init.
> 
> And the kernel reports this:
> 
> Jan 27 16:20:53 narvi-sr860v2-bps-sles15sp4b2 kernel: Lockdown: 
> qemu-system-i38: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
> 
> So, it's related to lockdown. Which AFAIUI it's consistent with the
> fact that the problem only shows up when SecureBoot is enabled, as
> that's implies lockdown. It's also consistent with the fact that we
> don't seem to have any problems doing the same with a 5.3.x dom0
> kernel... As there's no lockdown there!
> 
> Some digging revealed that QEMU tries to open /dev/mem in
> xen_pt_msix_init():
> 
> fd = open("/dev/mem", O_RDWR);
> ...
> msix->phys_iomem_base =
> mmap(NULL,
>  total_entries * PCI_MSIX_ENTRY_SIZE + 
> msix->table_offset_adjust,
>  PROT_READ,
>  MAP_SHARED | MAP_LOCKED,
>  fd,
>  msix->table_base + table_off - msix->table_offset_adjust);
> close(fd);

I think this is finally a clear indication that it has always been
wrong for qemu to access hardware directly like this. I see no way
around replacing this by something which isn't a bodge / layering
violation.

Jan

> This comes from commit:
> 
> commit 3854ca577dad92c4fe97b4a6ebce360e25407af7
> Author: Jiang Yunhong 
> Date:   Thu Jun 21 15:42:35 2012 +
> 
> Introduce Xen PCI Passthrough, MSI
> 
> A more complete history can be found here:
> git://xenbits.xensource.com/qemu-xen-unstable.git
> 
> Signed-off-by: Jiang Yunhong 
> Signed-off-by: Shan Haitao 
> Signed-off-by: Anthony PERARD 
> Acked-by: Stefano Stabellini 
> 
> Now, the questions:
> - is this (i.e., PCI-Passthrough with a locked-down dom0 kernel) 
>   working for anyone? I've Cc-ed Marek, because I think I've read that 
>   QubesOS that it does on QubesOS, but I'm not sure if the situation 
>   is the same...
> - if it's working, how?
> 
> Thanks and Regards

Re: [Qemu-devel] [RFC Patch] xen/pt: Emulate FLR capability

2019-08-29 Thread Jan Beulich

On 29.08.2019 11:02, Chao Gao wrote:
> Currently, for a HVM on Xen, no reset method is virtualized. So in a VM's
> perspective, assigned devices cannot be reset. But some devices rely on PCI
> reset to recover from hardware hangs. When being assigned to a VM, those
> devices cannot be reset and won't work any longer if a hardware hang occurs.
> We have to reboot VM to trigger PCI reset on host to recover the device.

Did you consider a hot-unplug, reset (by host), hot-plug cycle instead?

> +static int xen_pt_devctl_reg_write(XenPCIPassthroughState *s,
> +   XenPTReg *cfg_entry, uint16_t *val,
> +   uint16_t dev_value, uint16_t valid_mask)
> +{
> +if (s->real_device.is_resetable && (*val & PCI_EXP_DEVCTL_BCR_FLR)) {
> +xen_pt_reset(s);
> +}
> +return xen_pt_word_reg_write(s, cfg_entry, val, dev_value, valid_mask);

I think you also need to clear the bit before handing on the request,
such that reads will always observe it clear.

Jan

Re: [Qemu-devel] [PATCH] Xen PCI passthrough: fix passthrough failure when irq map failure

2018-11-23 Thread Jan Beulich

>>> On 23.11.18 at 11:19,  wrote:
> Adding Jan in case he has an opinion on my reply below.

I agree, fwiw.

Jan

> On Fri, Nov 23, 2018 at 12:04:51AM -0500, Zhao Yan wrote:
>> On Thu, Nov 22, 2018 at 03:18:05PM +0100, Roger Pau Monné wrote:
>> > On Thu, Nov 22, 2018 at 08:11:20AM -0500, Zhao Yan wrote:
>> > > On Thu, Oct 18, 2018 at 03:56:36PM +0100, Roger Pau Monné wrote:
>> > > > On Thu, Oct 18, 2018 at 08:22:41AM +, Zhao, Yan Y wrote:
>> > > > > Hi
>> > > > > The background for this patch is that: for some pci device, even 
>> > > > > it's PCI_INTERRUPT_PIN is not 0, it
actually does not support INTx mode, so we should just report error, disable 
INTx mode and continue the passthrough.
>> > > > > However, the commit 5a11d0f7 regards this as error condition and let 
>> > > > > qemu quit passthrough, which is too
rigorous.
>> > > > > 
>> > > > > Error message is below:
>> > > > > libxl: error: libxl_qmp.c:287:qmp_handle_error_response: Domain 
>> > > > > 2:received an error message from QMP server:
Mapping machine irq 0 to pirq -1 failed: Operation not permitted
>> > > > 
>> > > > I'm having issues figuring out what's happening here.
>> > > > s->real_device.irq is 0, yet the PCI config space read of
>> > > > PCI_INTERRUPT_PIN returns something different than 0.
>> > > > 
>> > > > AFAICT this is due to some kind of error in Linux, so that even when
>> > > > the device is supposed to have a valid IRQ the sysfs node it is set to
>> > > > 0, do you know the actual underlying cause of this?
>> > > > 
>> > > > Thanks, Roger.
>> > > Hi Roger
>> > > Sorry for the later reply, I just missed this mail...
>> > > On my side, it's because the hardware actually does not support INTx 
>> > > mode,
>> > > but its configuration space does not report PCI_INTERRUPT_PIN to 0. It's 
>> > > a
>> > > hardware bug, but previous version of qemu can tolerate it, switch to MSI
>> > > and make passthrough work.
>> > 
>> > Then I think it would be better to check both PCI_INTERRUPT_PIN and
>> > s->real_device.irq before attempting to map the IRQ.
>> > 
>> > Making the error non-fatal would mean that a device with a valid IRQ
>> > could fail to be setup correctly but the guest will still be created,
>> > and things won't go as expected when the guest attempts to use it.
>> > 
>> > Thanks, Roger.
>> hi roger
>> thanks for your sugguestion. it's right that "s->real_device.irq" is needed 
>> to be checked before mapping, like if
it's 0.
>> but on the other hand, maybe xc_physdev_map_pirq() itself can serve as a 
>> checking of "s->real_device.irq" ?
>> like in our case, it will fail and return -EPERM.
>> then error hanling is still conducted ==>set INTX_DISABLE flag, eventhrough 
>> the error is not fatal.
>> 
>> machine_irq = s->real_device.irq;
>> rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, );
>> if (rc < 0) {
>> error_setg_errno(errp, errno, "Mapping machine irq %u to"
>>  " pirq %i failed", machine_irq, pirq);
>> 
>> /* Disable PCI intx assertion (turn on bit10 of devctl) */
>> cmd |= PCI_COMMAND_INTX_DISABLE;
>> machine_irq = 0;
>> s->machine_irq = 0;
>> So, do you think it's all right just converting fatal error to non-fatal?
> 
> As I said above, I think it would be better to leave the error as
> fatal and avoid attempting a xc_physdev_map_pirq with a machine_irq ==
> 0, which will fail.
> 
> If we really want to go down the route of making the error non-fatal,
> I think you will also have to report PCI_INTERRUPT_PIN as 0 to the
> guest, so that it's clear to the guest that the device doesn't have
> legacy interrupt support.
> 
> Exposing a device with PCI_INTERRUPT_PIN != 0 but then not allowing
> the guest to clear PCI_COMMAND_INTX_DISABLE is likely bogus.
> 
> Thanks, Roger.

Re: [Qemu-devel] [Xen-devel] [PATCH v2] xen-hvm: stop faking I/O to access PCI config space

2018-05-18 Thread Jan Beulich

>>> On 18.05.18 at 15:51,  wrote:
>> Sent: 18 May 2018 14:34
>> To: Paul Durrant 
>> >>> On 18.05.18 at 15:00,  wrote:
>> > +QLIST_FOREACH(xendev, >dev_list, entry) {
>> > +unsigned int i;
>> > +uint32_t tmp;
>> > +
>> > +if (xendev->sbdf != sbdf) {
>> > +continue;
>> > +}
>> > +
>> > +if (!req->data_is_ptr) {
>> > +if (req->dir == IOREQ_READ) {
>> > +for (i = 0; i < req->count; i++) {
>> > +rw_config_req_item(xendev, req, i, );
>> > +req->data = tmp;
>> > +}
>> > +} else if (req->dir == IOREQ_WRITE) {
>> > +for (i = 0; i < req->count; i++) {
>> > +tmp = req->data;
>> > +rw_config_req_item(xendev, req, i, );
>> > +}
>> > +}
>> 
>> Wouldn't it be more sensible to fail req->count != 1 requests here?
>> 
> 
> I'm wondering whether we'd want to handle count > 1 once we allow MMCONFIG 
> accesses though. I guess it would be easier just to defer that.

For the data_is_ptr case - sure. But here? Or wait - are you thinking about
REP STOS (and the relatively useless REP LODS)?

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH v2] xen-hvm: stop faking I/O to access PCI config space

2018-05-18 Thread Jan Beulich

>>> On 18.05.18 at 15:00,  wrote:
> @@ -903,6 +926,80 @@ static void cpu_ioreq_move(ioreq_t *req)
>  }
>  }
>  
> +static void rw_config_req_item(XenPciDevice *xendev, ioreq_t *req,

It looks to me as if both parameters could be constified.

> +   uint32_t i, uint32_t *val)
> +{
> +int32_t reg = req->addr;
> +uint32_t offset = req->size * i;
> +
> +reg += (req->df ? -1 : 1) * offset;
> +if (reg < 0 || reg > PCI_CONFIG_SPACE_SIZE) {

Having fought a number of issues in this area in the hypervisor a couple
of years back I wonder
- why reg is of signed type,
- whether overflow of the first multiplication really doesn't matter,
- whether wrapping when adding in the offset is not an issue.

I take it that the rather lax upper bound check (should imo really be
reg + size > PCI_CONFIG_SPACE_SIZE [implying reg + size doesn't
itself wrap], or at least reg >= PCI_CONFIG_SPACE_SIZE) is not a
problem because ...

> +if (req->dir == IOREQ_READ) {
> +*val = ~0u;
> +}
> +return;
> +}
> +
> +if (req->dir == IOREQ_READ) {
> +*val = pci_host_config_read_common(xendev->pci_dev, reg,
> +   PCI_CONFIG_SPACE_SIZE,
> +   req->size);
> +trace_cpu_ioreq_config_read(req, xendev->sbdf, reg,
> +req->size, *val);
> +} else {
> +trace_cpu_ioreq_config_write(req, xendev->sbdf, reg, req->size,
> + *val);
> +pci_host_config_write_common(xendev->pci_dev, reg,
> + PCI_CONFIG_SPACE_SIZE, *val,
> + req->size);
> +}

... these called functions do full checking anyway?

> +static void cpu_ioreq_config(XenIOState *state, ioreq_t *req)
> +{
> +uint32_t sbdf = req->addr >> 32;
> +XenPciDevice *xendev;
> +
> +if (req->size > sizeof(uint32_t)) {
> +hw_error("PCI config access: bad size (%u)", req->size);

What about size 0 or 3?

> +}
> +
> +QLIST_FOREACH(xendev, >dev_list, entry) {
> +unsigned int i;
> +uint32_t tmp;
> +
> +if (xendev->sbdf != sbdf) {
> +continue;
> +}
> +
> +if (!req->data_is_ptr) {
> +if (req->dir == IOREQ_READ) {
> +for (i = 0; i < req->count; i++) {
> +rw_config_req_item(xendev, req, i, );
> +req->data = tmp;
> +}
> +} else if (req->dir == IOREQ_WRITE) {
> +for (i = 0; i < req->count; i++) {
> +tmp = req->data;
> +rw_config_req_item(xendev, req, i, );
> +}
> +}

Wouldn't it be more sensible to fail req->count != 1 requests here?

Jan

[Qemu-devel] [PATCH] allow to build with older sed

2018-01-18 Thread Jan Beulich

sed's -E option may not be supported by older distros. As there's no
point using sed here at all, use just shell mechanisms to establish the
variable values, starting from the stem instead of the full target.

Signed-off-by: Jan Beulich <jbeul...@suse.com>

--- a/Makefile
+++ b/Makefile
@@ -242,8 +242,7 @@ GENERATED_FILES += $(KEYCODEMAP_FILES)
 
 ui/input-keymap-%.c: $(KEYCODEMAP_GEN) $(KEYCODEMAP_CSV) 
$(SRC_PATH)/ui/Makefile.objs
$(call quiet-command,\
-   src=$$(echo $@ | sed -E -e 
"s,^ui/input-keymap-(.+)-to-(.+)\.c$$,\1,") && \
-   dst=$$(echo $@ | sed -E -e 
"s,^ui/input-keymap-(.+)-to-(.+)\.c$$,\2,") && \
+   stem=$* && src=$${stem%-to-*} dst=$${stem#*-to-} && \
test -e $(KEYCODEMAP_GEN) && \
$(PYTHON) $(KEYCODEMAP_GEN) \
  --lang glib2 \

Re: [Qemu-devel] [RFC QEMU PATCH v3 00/10] Implement vNVDIMM for Xen HVM guest

2017-10-13 Thread Jan Beulich

>>> On 13.10.17 at 13:13,  wrote:
> To Jan, Andrew, Stefano and Anthony,
> 
> what do you think about allowing QEMU to build the entire guest ACPI
> and letting SeaBIOS to load it? ACPI builder code in hvmloader is
> still there and just bypassed in this case.

Well, if that can be made work in a non-quirky way and without
loss of functionality, I'd probably be fine. I do think, however,
that there's a reason this is being handled in hvmloader right now.

Jan

[Qemu-devel] xen-pci-passthrough PCI Express support? (Re: [Xen-devel] [PATCH v2 4/5] pci: Add INTERFACE_CONVENTIONAL_PCI_DEVICE to Conventional PCI devices)

2017-10-04 Thread Jan Beulich

>>> On 03.10.17 at 02:12, <ehabk...@redhat.com> wrote:
> On Thu, Sep 28, 2017 at 10:12:34AM -0300, Eduardo Habkost wrote:
>> On Thu, Sep 28, 2017 at 02:33:57AM -0600, Jan Beulich wrote:
>> > >>> On 27.09.17 at 21:56, <ehabk...@redhat.com> wrote:
>> > > --- a/hw/xen/xen_pt.c
>> > > +++ b/hw/xen/xen_pt.c
>> > > @@ -964,6 +964,10 @@ static const TypeInfo xen_pci_passthrough_info = {
>> > >  .instance_size = sizeof(XenPCIPassthroughState),
>> > >  .instance_finalize = xen_pci_passthrough_finalize,
>> > >  .class_init = xen_pci_passthrough_class_init,
>> > > +.interfaces = (InterfaceInfo[]) {
>> > > +{ INTERFACE_CONVENTIONAL_PCI_DEVICE },
>> > > +{ },
>> > > +},
>> > >  };
>> > 
>> > Passed through devices can be both PCI and PCIe, so following
>> > the description of the patch I don't think these can be statically
>> > given either property. Granted quite a bit of PCIe specific
>> > functionality may be missing in the Xen code ...
>> 
>> This is just static data about what the device type supports, not
>> about what a given device instance really is.  Deciding if the
>> device is PCIe or Conventional at runtime is out of the scope of
>> this series.
>> 
>> That said, if passed through PCI Express devices are really
>> supported, it looks like this should be marked as hybrid.
> 
> Can anybody confirm if PCI Express devices are really supported
> by xen-pci-passthrough?

I think I've clearly said they're supported, with some limitations.

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH v2 4/5] pci: Add INTERFACE_CONVENTIONAL_PCI_DEVICE to Conventional PCI devices

2017-09-28 Thread Jan Beulich

>>> On 27.09.17 at 21:56,  wrote:
> --- a/hw/xen/xen_pt.c
> +++ b/hw/xen/xen_pt.c
> @@ -964,6 +964,10 @@ static const TypeInfo xen_pci_passthrough_info = {
>  .instance_size = sizeof(XenPCIPassthroughState),
>  .instance_finalize = xen_pci_passthrough_finalize,
>  .class_init = xen_pci_passthrough_class_init,
> +.interfaces = (InterfaceInfo[]) {
> +{ INTERFACE_CONVENTIONAL_PCI_DEVICE },
> +{ },
> +},
>  };

Passed through devices can be both PCI and PCIe, so following
the description of the patch I don't think these can be statically
given either property. Granted quite a bit of PCIe specific
functionality may be missing in the Xen code ...

Jan

Re: [Qemu-devel] [PATCH] xen: use vMSI related #define-s from public interface

2017-09-21 Thread Jan Beulich

>>> On 21.09.17 at 03:12, <sstabell...@kernel.org> wrote:
> On Fri, 1 Sep 2017, Jan Beulich wrote:
>> --- a/hw/xen/xen_pt_msi.c
>> +++ b/hw/xen/xen_pt_msi.c
>> @@ -18,6 +18,11 @@
>>  
>>  #define XEN_PT_AUTO_ASSIGN -1
>>  
>> +#ifndef XEN_DOMCTL_VMSI_X86_DEST_ID_MASK
>> +#if XEN_DOMCTL_INTERFACE_VERSION >= 0x000e
>> +#error vMSI defines missing from domctl.h
>> +#endif
> 
> All the version compatibility stuff goes to
> include/hw/xen/xen_common.h. Please move it there.

I know there's a central place, but moving there stuff that's
needed only in this file seemed rather counterproductive to
me - why would you want all files including that shared one
have to see these definitions? If there was a remote chance
that some other file may need to make use of it, I might
agree, but I don't see any such chance at all.

> We usually assume that the Xen version we are building against is
> "sane", so we don't do #error's typically.

Hmm, I can drop the #error, but to be honest I'm hesitant to do
so - I've put it there intentionally.

Jan

[Qemu-devel] [PATCH] xen_disk: avoid use of g_malloc0_n()

2017-09-12 Thread Jan Beulich

Prefer g_new() / g_new0() to be farther backwards compatible with older
glib versions. As there's no point in zeroing the allocation here (the
loop right afterwards fully initializes the memory), use the former.

Signed-off-by: Jan Beulich <jbeul...@suse.com>

--- a/hw/block/xen_disk.c
+++ b/hw/block/xen_disk.c
@@ -1232,7 +1232,7 @@ static int blk_connect(struct XenDevice
 return -1;
 }
 
-domids = g_malloc0_n(blkdev->nr_ring_ref, sizeof(uint32_t));
+domids = g_new(uint32_t, blkdev->nr_ring_ref);
 for (i = 0; i < blkdev->nr_ring_ref; i++) {
 domids[i] = blkdev->xendev.dom;
 }

Re: [Qemu-devel] [Xen-devel] [PATCH] xen: use vMSI related #define-s from public interface

2017-09-04 Thread Jan Beulich

>>> On 04.09.17 at 11:23, <roger@citrix.com> wrote:
> On Mon, Sep 04, 2017 at 03:04:41AM -0600, Jan Beulich wrote:
>> >>> On 01.09.17 at 20:20, <roger@citrix.com> wrote:
>> > On Fri, Sep 01, 2017 at 10:25:42AM -0600, Jan Beulich wrote:
>> >> Xen and qemu having identical #define-s (with different names) is a
>> >> strong hint that these should have been part of the public interface
>> >> from the very start. Use them if they're available, falling back to
>> >> privately defined values only when using older headers.
>> >> 
>> >> Signed-off-by: Jan Beulich <jbeul...@suse.com>
>> > 
>> > Reviewed-by: Roger Pau Monné <roger@citrix.com>
>> 
>> Thanks.
>> 
>> >> --- a/hw/xen/xen_pt_msi.c
>> >> +++ b/hw/xen/xen_pt_msi.c
>> >> @@ -18,6 +18,11 @@
>> >>  
>> >>  #define XEN_PT_AUTO_ASSIGN -1
>> >>  
>> >> +#ifndef XEN_DOMCTL_VMSI_X86_DEST_ID_MASK
>> >> +#if XEN_DOMCTL_INTERFACE_VERSION >= 0x000e
>> > 
>> > XEN_DOMCTL_INTERFACE_VERSION is already 0xe (without you added
>> > defines), I guess it doesn't matter much because we only care for
>> > stable releases.
>> 
>> Well, that's if you build qemu against master Xen. What about
>> people building against older Xen versions/headers?
> 
> Sorry, I think I haven't explained myself clearly. What I mean is
> that if this change gets committed before the Xen side one, QEMU would
> not compile against current Xen. As said, this is only a transitory
> issue, and it's never going to be a problem in stable branches. This
> is because of the "#error" that you add.

Oh, yes, that's the case. I can't see an easy way around it, but I'm
pretty sure the qemu side of it will see some further delay anyway.

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH] xen: use vMSI related #define-s from public interface

2017-09-04 Thread Jan Beulich

>>> On 01.09.17 at 20:20, <roger@citrix.com> wrote:
> On Fri, Sep 01, 2017 at 10:25:42AM -0600, Jan Beulich wrote:
>> Xen and qemu having identical #define-s (with different names) is a
>> strong hint that these should have been part of the public interface
>> from the very start. Use them if they're available, falling back to
>> privately defined values only when using older headers.
>> 
>> Signed-off-by: Jan Beulich <jbeul...@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger@citrix.com>

Thanks.

>> --- a/hw/xen/xen_pt_msi.c
>> +++ b/hw/xen/xen_pt_msi.c
>> @@ -18,6 +18,11 @@
>>  
>>  #define XEN_PT_AUTO_ASSIGN -1
>>  
>> +#ifndef XEN_DOMCTL_VMSI_X86_DEST_ID_MASK
>> +#if XEN_DOMCTL_INTERFACE_VERSION >= 0x000e
> 
> XEN_DOMCTL_INTERFACE_VERSION is already 0xe (without you added
> defines), I guess it doesn't matter much because we only care for
> stable releases.

Well, that's if you build qemu against master Xen. What about
people building against older Xen versions/headers?

Jan

[Qemu-devel] [PATCH] xen: use vMSI related #define-s from public interface

2017-09-01 Thread Jan Beulich

Xen and qemu having identical #define-s (with different names) is a
strong hint that these should have been part of the public interface
from the very start. Use them if they're available, falling back to
privately defined values only when using older headers.

Signed-off-by: Jan Beulich <jbeul...@suse.com>

--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -18,6 +18,11 @@
 
 #define XEN_PT_AUTO_ASSIGN -1
 
+#ifndef XEN_DOMCTL_VMSI_X86_DEST_ID_MASK
+#if XEN_DOMCTL_INTERFACE_VERSION >= 0x000e
+#error vMSI defines missing from domctl.h
+#endif
+
 /* shift count for gflags */
 #define XEN_PT_GFLAGS_SHIFT_DEST_ID0
 #define XEN_PT_GFLAGS_SHIFT_RH 8
@@ -26,6 +31,16 @@
 #define XEN_PT_GFLAGSSHIFT_TRG_MODE   15
 #define XEN_PT_GFLAGSSHIFT_UNMASKED   16
 
+#define XEN_DOMCTL_VMSI_X86_DEST_ID_MASK (0xffU << XEN_PT_GFLAGS_SHIFT_DEST_ID)
+#define XEN_DOMCTL_VMSI_X86_RH_MASK  (1U << XEN_PT_GFLAGS_SHIFT_RH)
+#define XEN_DOMCTL_VMSI_X86_DM_MASK  (1U << XEN_PT_GFLAGS_SHIFT_DM)
+#define XEN_DOMCTL_VMSI_X86_DELIV_MASK   (7U << XEN_PT_GFLAGSSHIFT_DELIV_MODE)
+#define XEN_DOMCTL_VMSI_X86_TRIG_MASK(1U << XEN_PT_GFLAGSSHIFT_TRG_MODE)
+#define XEN_DOMCTL_VMSI_X86_UNMASKED (1U << XEN_PT_GFLAGSSHIFT_UNMASKED)
+#endif
+
+#define MASK_INSR(v, m) (((v) * ((m) & -(m))) & (m))
+
 #define latch(fld) latch[PCI_MSIX_ENTRY_##fld / sizeof(uint32_t)]
 
 /*
@@ -49,21 +64,18 @@ static inline uint32_t msi_ext_dest_id(u
 
 static uint32_t msi_gflags(uint32_t data, uint64_t addr)
 {
-uint32_t result = 0;
-int rh, dm, dest_id, deliv_mode, trig_mode;
+int rh, dm, deliv_mode, trig_mode;
 
 rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1;
 dm = (addr >> MSI_ADDR_DEST_MODE_SHIFT) & 0x1;
-dest_id = msi_dest_id(addr);
 deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
 trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
 
-result = dest_id | (rh << XEN_PT_GFLAGS_SHIFT_RH)
-| (dm << XEN_PT_GFLAGS_SHIFT_DM)
-| (deliv_mode << XEN_PT_GFLAGSSHIFT_DELIV_MODE)
-| (trig_mode << XEN_PT_GFLAGSSHIFT_TRG_MODE);
-
-return result;
+return MASK_INSR(msi_dest_id(addr), XEN_DOMCTL_VMSI_X86_DEST_ID_MASK) |
+   MASK_INSR(rh, XEN_DOMCTL_VMSI_X86_RH_MASK) |
+   MASK_INSR(dm, XEN_DOMCTL_VMSI_X86_DM_MASK) |
+   MASK_INSR(deliv_mode, XEN_DOMCTL_VMSI_X86_DELIV_MASK) |
+   MASK_INSR(trig_mode, XEN_DOMCTL_VMSI_X86_TRIG_MASK);
 }
 
 static inline uint64_t msi_addr64(XenPTMSI *msi)
@@ -173,7 +185,7 @@ static int msi_msix_update(XenPCIPassthr
 table_addr = s->msix->mmio_base_addr;
 }
 
-gflags |= masked ? 0 : (1u << XEN_PT_GFLAGSSHIFT_UNMASKED);
+gflags |= masked ? 0 : XEN_DOMCTL_VMSI_X86_UNMASKED;
 
 rc = xc_domain_update_msi_irq(xen_xc, xen_domid, gvec,
   pirq, gflags, table_addr);

Re: [Qemu-devel] [PATCH QEMU v2] xen/pt: allow QEMU to request MSI unmasking at bind time

2017-08-24 Thread Jan Beulich

>>> On 24.08.17 at 14:19, <roger@citrix.com> wrote:
> When a MSI interrupt is bound to a guest using
> xc_domain_update_msi_irq (XEN_DOMCTL_bind_pt_irq) the interrupt is
> left masked by default.
> 
> This causes problems with guests that first configure interrupts and
> clean the per-entry MSIX table mask bit and afterwards enable MSIX
> globally. In such scenario the Xen internal msixtbl handlers would not
> detect the unmasking of MSIX entries because vectors are not yet
> registered since MSIX is not enabled, and vectors would be left
> masked.
> 
> Introduce a new flag in the gflags field to signal Xen whether a MSI
> interrupt should be unmasked after being bound.
> 
> This also requires to track the mask register for MSI interrupts, so
> QEMU can also notify to Xen whether the MSI interrupt should be bound
> masked or unmasked
> 
> Signed-off-by: Roger Pau Monné <roger@citrix.com>
> Reported-by: Andreas Kinzler <h...@posteo.de>

Reviewed-by: Jan Beulich <jbeul...@suse.com>

Re: [Qemu-devel] [PATCH QEMU] xen/pt: allow QEMU to request MSI unmasking at bind time

2017-08-24 Thread Jan Beulich

>>> On 24.08.17 at 12:06, <roger@citrix.com> wrote:
> On Thu, Aug 24, 2017 at 03:54:21AM -0600, Jan Beulich wrote:
>> >>> On 24.08.17 at 11:47, <roger@citrix.com> wrote:
>> > @@ -274,7 +278,7 @@ int xen_pt_msi_update(XenPCIPassthroughState *s)
>> >  {
>> >  XenPTMSI *msi = s->msi;
>> >  return msi_msix_update(s, msi_addr64(msi), msi->data, msi->pirq,
>> > -   false, 0, >pirq);
>> > +   false, 0, >pirq, false);
>> >  }
>> 
>> I don't think this is correct when the device has mask bits.
> 
> Right, I thought I modified this. I've already changed
> pt_irq_create_bind so that the original behavior is keep if the unmask
> bit is not set. In this case this should be 'true' in order to keep
> the previous behavior, which was correct for MSI.

Wouldn't you want to pass the state of the mask bit here,
rather than uniformly hard coding true or false?

> It also makes me wonder whether it would be better to change the name
> of the parameter to "unmask", so that false -> no change to mask, true
> -> unconditionally unmask.

I don't care much about this aspect, but perhaps the qemu
maintainers do.

Jan

Re: [Qemu-devel] [PATCH QEMU] xen/pt: allow QEMU to request MSI unmasking at bind time

2017-08-24 Thread Jan Beulich

>>> On 24.08.17 at 11:47,  wrote:
> @@ -274,7 +278,7 @@ int xen_pt_msi_update(XenPCIPassthroughState *s)
>  {
>  XenPTMSI *msi = s->msi;
>  return msi_msix_update(s, msi_addr64(msi), msi->data, msi->pirq,
> -   false, 0, >pirq);
> +   false, 0, >pirq, false);
>  }

I don't think this is correct when the device has mask bits.

Jan

Re: [Qemu-devel] [PATCH] xen/disk: don't leak stack data via response ring

2017-06-26 Thread Jan Beulich

>>> Stefano Stabellini <sstabell...@kernel.org> 06/23/17 8:43 PM >>>
>On Fri, 23 Jun 2017, Jan Beulich wrote:
>> >>> On 22.06.17 at 20:52, <sstabell...@kernel.org> wrote:
>> > I am happy to write the code and/or the commit message. Would a simple
>> > cast like below work to fix the security issue?
>> 
>> I suppose so, but you do realize that this is _exactly_ what
>> my patch does, except you use dangerous casts while I play
>> type-safe?
>
>Why is the cast dangerous?

Casts, and especially pointer ones) are always dangerous, as they have
the potential of type changes elsewhere going unnoticed. You may have
noticed that in reviews I'm often picking at casts, because in a majority
of cases people use them they're not needed and hence their use is a
(latent) risk.

> Both your patch and my version of it rely on
>inner knowledge of the way the rings are laid out in memory, but at
>least my patch doesn't introduce the risk of instantiating broken
>structs. Besides, type safety checks don't add much value, given that we
>rely on the way the ring.h macros have been written, which weren't even
>imported in the QEMU project until March this year.

That's said, as it seems to imply backporting of the change to older
qemu may then be problematic. Otoh I don't recall having had problems
with using the patch with only minor adjustments on older trees of ours.

>These are the reasons why I prefer my version, but I would consider your
>version with clear in-code comments that warn users to avoid
>instantiating the structs because they are not ABI compliant.
>
>How do you want to proceed?

I can provide a version with comments added, but only next week. If you
feel like you want to go with your variant, I won't stand in the way, but I
also wouldn't give it my ack or alike (which you don't depend on anyway).

Jan

Re: [Qemu-devel] [PATCH] xen/disk: don't leak stack data via response ring

2017-06-23 Thread Jan Beulich

>>> On 22.06.17 at 20:52, <sstabell...@kernel.org> wrote:
> On Thu, 22 Jun 2017, Jan Beulich wrote:
>> >>> On 21.06.17 at 20:46, <sstabell...@kernel.org> wrote:
>> > On Wed, 21 Jun 2017, Jan Beulich wrote:
>> >> >>> On 20.06.17 at 23:48, <sstabell...@kernel.org> wrote:
>> >> > On Tue, 20 Jun 2017, Jan Beulich wrote:
>> >> >> @@ -36,13 +33,7 @@ struct blkif_x86_32_request_discard {
>> >> >>  blkif_sector_t sector_number;/* start sector idx on disk (r/w 
>> > only)  */
>> >> >>  uint64_t   nr_sectors;   /* # of contiguous sectors to 
>> >> >> discard 
>> >   */
>> >> >>  };
>> >> >> -struct blkif_x86_32_response {
>> >> >> -uint64_tid;  /* copied from request */
>> >> >> -uint8_t operation;   /* copied from request */
>> >> >> -int16_t status;  /* BLKIF_RSP_???   */
>> >> >> -};
>> >> >>  typedef struct blkif_x86_32_request blkif_x86_32_request_t;
>> >> >> -typedef struct blkif_x86_32_response blkif_x86_32_response_t;
>> >> >>  #pragma pack(pop)
>> >> >>  
>> >> >>  /* x86_64 protocol version */
>> >> >> @@ -62,20 +53,14 @@ struct blkif_x86_64_request_discard {
>> >> >>  blkif_sector_t sector_number;/* start sector idx on disk (r/w 
>> > only)  */
>> >> >>  uint64_t   nr_sectors;   /* # of contiguous sectors to 
>> >> >> discard 
>> >   */
>> >> >>  };
>> >> >> -struct blkif_x86_64_response {
>> >> >> -uint64_t   __attribute__((__aligned__(8))) id;
>> >> >> -uint8_t operation;   /* copied from request */
>> >> >> -int16_t status;  /* BLKIF_RSP_???   */
>> >> >> -};
>> >> >>
>> >> >>  typedef struct blkif_x86_64_request blkif_x86_64_request_t;
>> >> >> -typedef struct blkif_x86_64_response blkif_x86_64_response_t;
>> >> >>  
>> >> >>  DEFINE_RING_TYPES(blkif_common, struct blkif_common_request,
>> >> >> -  struct blkif_common_response);
>> >> >> +  struct blkif_response);
>> >> >>  DEFINE_RING_TYPES(blkif_x86_32, struct blkif_x86_32_request,
>> >> >> -  struct blkif_x86_32_response);
>> >> >> +  struct blkif_response QEMU_PACKED);
>> >> > 
>> >> > In my test, the previous sizes and alignments of the response structs
>> >> > were (on both x86_32 and x86_64):
>> >> > 
>> >> > sizeof(blkif_x86_32_response)=12   sizeof(blkif_x86_64_response)=16
>> >> > align(blkif_x86_32_response)=4 align(blkif_x86_64_response)=8
>> >> > 
>> >> > While with these changes are now, when compiled on x86_64:
>> >> > sizeof(blkif_x86_32_response)=11   sizeof(blkif_x86_64_response)=16
>> >> > align(blkif_x86_32_response)=1 align(blkif_x86_64_response)=8
>> >> > 
>> >> > when compiled on x86_32:
>> >> > sizeof(blkif_x86_32_response)=11   sizeof(blkif_x86_64_response)=12
>> >> > align(blkif_x86_32_response)=1 align(blkif_x86_64_response)=4
>> >> > 
>> >> > Did I do my tests wrong?
>> >> > 
>> >> > QEMU_PACKED is not the same as #pragma pack(push, 4). In fact, it is the
>> >> > same as #pragma pack(push, 1), causing the struct to be densely packed,
>> >> > leaving no padding whatsever.
>> >> > 
>> >> > In addition, without __attribute__((__aligned__(8))),
>> >> > blkif_x86_64_response won't be 8 bytes aligned when built on x86_32.
>> >> > 
>> >> > Am I missing something?
>> >> 
>> >> Well, you're mixing attribute application upon structure
>> >> declaration with attribute application upon structure use. It's
>> >> the latter here, and hence the attribute doesn't affect
>> >> structure layout at all. All it does is avoid the _containing_
>> >> 32-bit union to become 8-byte aligned (and tail padding to be
>> >> inserted).
>> > 
>> > Thanks for the explanation. I admit it's the first time I see the

Re: [Qemu-devel] [PATCH] xen/disk: don't leak stack data via response ring

2017-06-22 Thread Jan Beulich

>>> On 21.06.17 at 20:46, <sstabell...@kernel.org> wrote:
> On Wed, 21 Jun 2017, Jan Beulich wrote:
>> >>> On 20.06.17 at 23:48, <sstabell...@kernel.org> wrote:
>> > On Tue, 20 Jun 2017, Jan Beulich wrote:
>> >> @@ -36,13 +33,7 @@ struct blkif_x86_32_request_discard {
>> >>  blkif_sector_t sector_number;/* start sector idx on disk (r/w 
> only)  */
>> >>  uint64_t   nr_sectors;   /* # of contiguous sectors to 
>> >> discard 
>   */
>> >>  };
>> >> -struct blkif_x86_32_response {
>> >> -uint64_tid;  /* copied from request */
>> >> -uint8_t operation;   /* copied from request */
>> >> -int16_t status;  /* BLKIF_RSP_???   */
>> >> -};
>> >>  typedef struct blkif_x86_32_request blkif_x86_32_request_t;
>> >> -typedef struct blkif_x86_32_response blkif_x86_32_response_t;
>> >>  #pragma pack(pop)
>> >>  
>> >>  /* x86_64 protocol version */
>> >> @@ -62,20 +53,14 @@ struct blkif_x86_64_request_discard {
>> >>  blkif_sector_t sector_number;/* start sector idx on disk (r/w 
> only)  */
>> >>  uint64_t   nr_sectors;   /* # of contiguous sectors to 
>> >> discard 
>   */
>> >>  };
>> >> -struct blkif_x86_64_response {
>> >> -uint64_t   __attribute__((__aligned__(8))) id;
>> >> -uint8_t operation;   /* copied from request */
>> >> -int16_t status;  /* BLKIF_RSP_???   */
>> >> -};
>> >>
>> >>  typedef struct blkif_x86_64_request blkif_x86_64_request_t;
>> >> -typedef struct blkif_x86_64_response blkif_x86_64_response_t;
>> >>  
>> >>  DEFINE_RING_TYPES(blkif_common, struct blkif_common_request,
>> >> -  struct blkif_common_response);
>> >> +  struct blkif_response);
>> >>  DEFINE_RING_TYPES(blkif_x86_32, struct blkif_x86_32_request,
>> >> -  struct blkif_x86_32_response);
>> >> +  struct blkif_response QEMU_PACKED);
>> > 
>> > In my test, the previous sizes and alignments of the response structs
>> > were (on both x86_32 and x86_64):
>> > 
>> > sizeof(blkif_x86_32_response)=12   sizeof(blkif_x86_64_response)=16
>> > align(blkif_x86_32_response)=4 align(blkif_x86_64_response)=8
>> > 
>> > While with these changes are now, when compiled on x86_64:
>> > sizeof(blkif_x86_32_response)=11   sizeof(blkif_x86_64_response)=16
>> > align(blkif_x86_32_response)=1 align(blkif_x86_64_response)=8
>> > 
>> > when compiled on x86_32:
>> > sizeof(blkif_x86_32_response)=11   sizeof(blkif_x86_64_response)=12
>> > align(blkif_x86_32_response)=1 align(blkif_x86_64_response)=4
>> > 
>> > Did I do my tests wrong?
>> > 
>> > QEMU_PACKED is not the same as #pragma pack(push, 4). In fact, it is the
>> > same as #pragma pack(push, 1), causing the struct to be densely packed,
>> > leaving no padding whatsever.
>> > 
>> > In addition, without __attribute__((__aligned__(8))),
>> > blkif_x86_64_response won't be 8 bytes aligned when built on x86_32.
>> > 
>> > Am I missing something?
>> 
>> Well, you're mixing attribute application upon structure
>> declaration with attribute application upon structure use. It's
>> the latter here, and hence the attribute doesn't affect
>> structure layout at all. All it does is avoid the _containing_
>> 32-bit union to become 8-byte aligned (and tail padding to be
>> inserted).
> 
> Thanks for the explanation. I admit it's the first time I see the
> aligned attribute being used at structure usage only. I think it's the
> first time QEMU_PACKED is used this way in QEMU too.
> 
> Anyway, even taking that into account, things are still not completely
> right: the alignment of struct blkif_x86_32_response QEMU_PACKED is 4
> bytes as you wrote, but the size of struct blkif_x86_32_response is
> still 16 bytes instead of 12 bytes in my test. I suspect it worked for
> you because the other member of the union (blkif_x86_32_request) is
> larger than that. However, I think is not a good idea to rely on this
> implementation detail. The implementation of DEFINE_RING_TYPES should be
> opaque from our point of view. We shouldn't have to know that there is a
> union there.

I don't follow - why should we not rely on this? It i

Re: [Qemu-devel] [PATCH] xen/disk: don't leak stack data via response ring

2017-06-21 Thread Jan Beulich

>>> On 20.06.17 at 23:48, <sstabell...@kernel.org> wrote:
> On Tue, 20 Jun 2017, Jan Beulich wrote:
>> @@ -36,13 +33,7 @@ struct blkif_x86_32_request_discard {
>>  blkif_sector_t sector_number;/* start sector idx on disk (r/w only) 
>>  */
>>  uint64_t   nr_sectors;   /* # of contiguous sectors to discard  
>>  */
>>  };
>> -struct blkif_x86_32_response {
>> -uint64_tid;  /* copied from request */
>> -uint8_t operation;   /* copied from request */
>> -int16_t status;  /* BLKIF_RSP_???   */
>> -};
>>  typedef struct blkif_x86_32_request blkif_x86_32_request_t;
>> -typedef struct blkif_x86_32_response blkif_x86_32_response_t;
>>  #pragma pack(pop)
>>  
>>  /* x86_64 protocol version */
>> @@ -62,20 +53,14 @@ struct blkif_x86_64_request_discard {
>>  blkif_sector_t sector_number;/* start sector idx on disk (r/w only) 
>>  */
>>  uint64_t   nr_sectors;   /* # of contiguous sectors to discard  
>>  */
>>  };
>> -struct blkif_x86_64_response {
>> -uint64_t   __attribute__((__aligned__(8))) id;
>> -uint8_t operation;   /* copied from request */
>> -int16_t status;  /* BLKIF_RSP_???   */
>> -};
>>
>>  typedef struct blkif_x86_64_request blkif_x86_64_request_t;
>> -typedef struct blkif_x86_64_response blkif_x86_64_response_t;
>>  
>>  DEFINE_RING_TYPES(blkif_common, struct blkif_common_request,
>> -  struct blkif_common_response);
>> +  struct blkif_response);
>>  DEFINE_RING_TYPES(blkif_x86_32, struct blkif_x86_32_request,
>> -  struct blkif_x86_32_response);
>> +  struct blkif_response QEMU_PACKED);
> 
> In my test, the previous sizes and alignments of the response structs
> were (on both x86_32 and x86_64):
> 
> sizeof(blkif_x86_32_response)=12   sizeof(blkif_x86_64_response)=16
> align(blkif_x86_32_response)=4 align(blkif_x86_64_response)=8
> 
> While with these changes are now, when compiled on x86_64:
> sizeof(blkif_x86_32_response)=11   sizeof(blkif_x86_64_response)=16
> align(blkif_x86_32_response)=1 align(blkif_x86_64_response)=8
> 
> when compiled on x86_32:
> sizeof(blkif_x86_32_response)=11   sizeof(blkif_x86_64_response)=12
> align(blkif_x86_32_response)=1 align(blkif_x86_64_response)=4
> 
> Did I do my tests wrong?
> 
> QEMU_PACKED is not the same as #pragma pack(push, 4). In fact, it is the
> same as #pragma pack(push, 1), causing the struct to be densely packed,
> leaving no padding whatsever.
> 
> In addition, without __attribute__((__aligned__(8))),
> blkif_x86_64_response won't be 8 bytes aligned when built on x86_32.
> 
> Am I missing something?

Well, you're mixing attribute application upon structure
declaration with attribute application upon structure use. It's
the latter here, and hence the attribute doesn't affect
structure layout at all. All it does is avoid the _containing_
32-bit union to become 8-byte aligned (and tail padding to be
inserted).

Jan

[Qemu-devel] [PATCH] xen/disk: don't leak stack data via response ring

2017-06-20 Thread Jan Beulich

Rather than constructing a local structure instance on the stack, fill
the fields directly on the shared ring, just like other (Linux)
backends do. Build on the fact that all response structure flavors are
actually identical (the old code did make this assumption too).

This is XSA-216.

Reported by: Anthony Perard <anthony.per...@citrix.com>
Signed-off-by: Jan Beulich <jbeul...@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
Acked-by: Anthony PERARD <anthony.per...@citrix.com>
---
v2: Add QEMU_PACKED to fix handling 32-bit guests by 64-bit qemu.

--- a/hw/block/xen_blkif.h
+++ b/hw/block/xen_blkif.h
@@ -14,9 +14,6 @@
 struct blkif_common_request {
 char dummy;
 };
-struct blkif_common_response {
-char dummy;
-};
 
 /* i386 protocol version */
 #pragma pack(push, 4)
@@ -36,13 +33,7 @@ struct blkif_x86_32_request_discard {
 blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
 uint64_t   nr_sectors;   /* # of contiguous sectors to discard   */
 };
-struct blkif_x86_32_response {
-uint64_tid;  /* copied from request */
-uint8_t operation;   /* copied from request */
-int16_t status;  /* BLKIF_RSP_???   */
-};
 typedef struct blkif_x86_32_request blkif_x86_32_request_t;
-typedef struct blkif_x86_32_response blkif_x86_32_response_t;
 #pragma pack(pop)
 
 /* x86_64 protocol version */
@@ -62,20 +53,14 @@ struct blkif_x86_64_request_discard {
 blkif_sector_t sector_number;/* start sector idx on disk (r/w only)  */
 uint64_t   nr_sectors;   /* # of contiguous sectors to discard   */
 };
-struct blkif_x86_64_response {
-uint64_t   __attribute__((__aligned__(8))) id;
-uint8_t operation;   /* copied from request */
-int16_t status;  /* BLKIF_RSP_???   */
-};
 typedef struct blkif_x86_64_request blkif_x86_64_request_t;
-typedef struct blkif_x86_64_response blkif_x86_64_response_t;
 
 DEFINE_RING_TYPES(blkif_common, struct blkif_common_request,
-  struct blkif_common_response);
+  struct blkif_response);
 DEFINE_RING_TYPES(blkif_x86_32, struct blkif_x86_32_request,
-  struct blkif_x86_32_response);
+  struct blkif_response QEMU_PACKED);
 DEFINE_RING_TYPES(blkif_x86_64, struct blkif_x86_64_request,
-  struct blkif_x86_64_response);
+  struct blkif_response);
 
 union blkif_back_rings {
 blkif_back_ring_tnative;
--- a/hw/block/xen_disk.c
+++ b/hw/block/xen_disk.c
@@ -769,31 +769,30 @@ static int blk_send_response_one(struct
 struct XenBlkDev  *blkdev = ioreq->blkdev;
 int   send_notify   = 0;
 int   have_requests = 0;
-blkif_response_t  resp;
-void  *dst;
-
-resp.id= ioreq->req.id;
-resp.operation = ioreq->req.operation;
-resp.status= ioreq->status;
+blkif_response_t  *resp;
 
 /* Place on the response ring for the relevant domain. */
 switch (blkdev->protocol) {
 case BLKIF_PROTOCOL_NATIVE:
-dst = RING_GET_RESPONSE(>rings.native, 
blkdev->rings.native.rsp_prod_pvt);
+resp = RING_GET_RESPONSE(>rings.native,
+ blkdev->rings.native.rsp_prod_pvt);
 break;
 case BLKIF_PROTOCOL_X86_32:
-dst = RING_GET_RESPONSE(>rings.x86_32_part,
-blkdev->rings.x86_32_part.rsp_prod_pvt);
+resp = RING_GET_RESPONSE(>rings.x86_32_part,
+ blkdev->rings.x86_32_part.rsp_prod_pvt);
 break;
 case BLKIF_PROTOCOL_X86_64:
-dst = RING_GET_RESPONSE(>rings.x86_64_part,
-blkdev->rings.x86_64_part.rsp_prod_pvt);
+resp = RING_GET_RESPONSE(>rings.x86_64_part,
+ blkdev->rings.x86_64_part.rsp_prod_pvt);
 break;
 default:
-dst = NULL;
 return 0;
 }
-memcpy(dst, , sizeof(resp));
+
+resp->id= ioreq->req.id;
+resp->operation = ioreq->req.operation;
+resp->status= ioreq->status;
+
 blkdev->rings.common.rsp_prod_pvt++;
 
 RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(>rings.common, send_notify);


xen/disk: don't leak stack data via response ring

Rather than constructing a local structure instance on the stack, fill
the fields directly on the shared ring, just like other (Linux)
backends do. Build on the fact that all response structure flavors are
actually identical (the old code did make this assumption too).

This is XSA-216.

Reported by: Anthony Perard <anthony.per...@citrix.com>
Signed-off-by: Jan Beulich <jbeul...@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
Acked-by: Anthony PERARD <anthony.per...@citrix.com>
---
v2: Add QEMU

Re: [Qemu-devel] [Xen-devel] [RFC PATCH V2 1/2] xen-pt: bind/unbind interrupt remapping format MSI

2017-05-19 Thread Jan Beulich

>>> On 19.05.17 at 13:16,  wrote:
> On Thu, May 18, 2017 at 01:32:59AM -0400, Lan Tianyu wrote:
>> --- a/include/hw/i386/apic-msidef.h
>> +++ b/include/hw/i386/apic-msidef.h
>> @@ -26,6 +26,7 @@
>>  
>>  #define MSI_ADDR_DEST_ID_SHIFT  12
>>  #define MSI_ADDR_DEST_IDX_SHIFT 4
>> -#define  MSI_ADDR_DEST_ID_MASK  0x000
>> +#define  MSI_ADDR_DEST_ID_MASK  0x000fff00
> 
> The value of MSI_ADDR_DEST_ID_MASK is changed here. I think the patch
> should be:
> +#define  MSI_ADDR_DEST_ID_MASK  0x0000

Judging from other sources, rather the other way around - the
mask needs to have further bits removed (should be 0x000ff000
afaict). Xen sources confirm this, and while Linux has the value
you suggest, that contradicts

#define MSI_ADDR_DEST_ID_SHIFT  12
#define  MSI_ADDR_DEST_ID(dest) (((dest) << MSI_ADDR_DEST_ID_SHIFT) & \
 MSI_ADDR_DEST_ID_MASK)

as well as

#define MSI_ADDR_EXT_DEST_ID(dest)  ((dest) & 0xff00)

chopping off just the low 8 bits.

Jan

[Qemu-devel] [PATCH v2 3/3] xen: ignore direction in bufioreq handling

2016-11-25 Thread Jan Beulich

There's no way to communicate back read data, so only writes can ever
be usefully specified. Ignore the field, paving the road for eventually
re-using the bit for something else in a few (many?) years time.

Signed-off-by: Jan Beulich <jbeul...@suse.com>
Reviewed-by: Paul Durrant <paul.durr...@citrix.com>
Acked-by: Stefano Stabellini <sstabell...@kernel.org>

--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -997,6 +997,7 @@ static int handle_buffered_iopage(XenIOS
 memset(, 0x00, sizeof(req));
 req.state = STATE_IOREQ_READY;
 req.count = 1;
+req.dir = IOREQ_WRITE;
 
 for (;;) {
 uint32_t rdptr = buf_page->read_pointer, wrptr;
@@ -1014,7 +1015,6 @@ static int handle_buffered_iopage(XenIOS
 req.size = 1U << buf_req->size;
 req.addr = buf_req->addr;
 req.data = buf_req->data;
-req.dir = buf_req->dir;
 req.type = buf_req->type;
 xen_rmb();
 qw = (req.size == 8);
@@ -1031,10 +1031,12 @@ static int handle_buffered_iopage(XenIOS
 handle_ioreq(state, );
 
 /* Only req.data may get updated by handle_ioreq(), albeit even that
- * should not happen as such data would never make it to the guest.
+ * should not happen as such data would never make it to the guest (we
+ * can only usefully see writes here after all).
  */
 assert(req.state == STATE_IOREQ_READY);
 assert(req.count == 1);
+assert(req.dir == IOREQ_WRITE);
 assert(!req.data_is_ptr);
 
 atomic_add(_page->read_pointer, qw + 1);

[Qemu-devel] [PATCH v2 2/3] xen: slightly simplify bufioreq handling

2016-11-25 Thread Jan Beulich

There's no point setting fields always receiving the same value on each
iteration, as handle_ioreq() doesn't alter them anyway. Set state and
count once ahead of the loop, drop the redundant clearing of
data_is_ptr, and avoid the meaningless setting of df altogether.

Also avoid doing an unsigned long calculation of size when the field to
be initialized is only 32 bits wide (and the shift value in the range
0...3).

Signed-off-by: Jan Beulich <jbeul...@suse.com>
Reviewed-by: Paul Durrant <paul.durr...@citrix.com>

--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -995,6 +995,8 @@ static int handle_buffered_iopage(XenIOS
 }
 
 memset(, 0x00, sizeof(req));
+req.state = STATE_IOREQ_READY;
+req.count = 1;
 
 for (;;) {
 uint32_t rdptr = buf_page->read_pointer, wrptr;
@@ -1009,15 +1011,11 @@ static int handle_buffered_iopage(XenIOS
 break;
 }
 buf_req = _page->buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM];
-req.size = 1UL << buf_req->size;
-req.count = 1;
+req.size = 1U << buf_req->size;
 req.addr = buf_req->addr;
 req.data = buf_req->data;
-req.state = STATE_IOREQ_READY;
 req.dir = buf_req->dir;
-req.df = 1;
 req.type = buf_req->type;
-req.data_is_ptr = 0;
 xen_rmb();
 qw = (req.size == 8);
 if (qw) {
@@ -1032,6 +1030,13 @@ static int handle_buffered_iopage(XenIOS
 
 handle_ioreq(state, );
 
+/* Only req.data may get updated by handle_ioreq(), albeit even that
+ * should not happen as such data would never make it to the guest.
+ */
+assert(req.state == STATE_IOREQ_READY);
+assert(req.count == 1);
+assert(!req.data_is_ptr);
+
 atomic_add(_page->read_pointer, qw + 1);
 }

[Qemu-devel] [PATCH v2 1/3] xen: fix quad word bufioreq handling

2016-11-25 Thread Jan Beulich

We should not consume the second slot if it didn't get written yet.
Normal writers - i.e. Xen - would not update write_pointer between the
two writes, but the page may get fiddled with by the guest itself, and
we're better off avoiding to enter an infinite loop in that case.

Reported-by: yanghongke <yanghon...@huawei.com>
Signed-off-by: Jan Beulich <jbeul...@suse.com>
---
v2: Bail (using hw_error()) instead of just breaking the loop.

--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -1021,6 +1021,9 @@ static int handle_buffered_iopage(XenIOS
 xen_rmb();
 qw = (req.size == 8);
 if (qw) {
+if (rdptr + 1 == wrptr) {
+hw_error("Incomplete quad word buffered ioreq");
+}
 buf_req = _page->buf_ioreq[(rdptr + 1) %
IOREQ_BUFFER_SLOT_NUM];
 req.data |= ((uint64_t)buf_req->data) << 32;

[Qemu-devel] [PATCH v2 0/3] xen: XSA-197 follow-ups

2016-11-25 Thread Jan Beulich

1: fix quad word bufioreq handling
2: slightly simplify bufioreq handling
3: ignore direction in bufioreq handling

Signed-off-by: Jan Beulich <jbeul...@suse.com>
---
v2: Only patch 1 changed; see there.

Re: [Qemu-devel] [PATCH 2/3] xen: slightly simplify bufioreq handling

2016-11-24 Thread Jan Beulich

>>> On 23.11.16 at 19:13, <sstabell...@kernel.org> wrote:
> On Wed, 23 Nov 2016, Jan Beulich wrote:
>> There's no point setting fields always receiving the same value on each
>> iteration, as handle_ioreq() doesn't alter them anyway. Set state and
>> count once ahead of the loop, drop the redundant clearing of
>> data_is_ptr, and avoid the meaningless setting of df altogether.
> 
> Why setting df is meaningless?

With count being fixed to one there's no need to update addresses,
and hence no use for knowing which direction the updates should go.

Jan

Re: [Qemu-devel] [PATCH 1/3] xen: fix quad word bufioreq handling

2016-11-23 Thread Jan Beulich

>>> On 23.11.16 at 11:45,  wrote:
> No, if QEMU is using a default ioreq server (i.e. the legacy way of doing 
> things) then it's vulnerable to the guest messing with the rings and I'd 
> forgotten that migrated-in guests from old QEMUs also end up using the 
> default 
> server, so I guess this is a worthy checkt to make... although maybe it's 
> best to just bail if the check fails, since it would indicate a malicious 
> guest.

Okay, that's basically the TBD note I have in the patch; I'll wait for
at least one of the qemu maintainers to voice their preference.

Jan

Re: [Qemu-devel] [PATCH 1/3] xen: fix quad word bufioreq handling

2016-11-23 Thread Jan Beulich

>>> On 23.11.16 at 10:48, <paul.durr...@citrix.com> wrote:
>> From: Jan Beulich [mailto:jbeul...@suse.com]
>> Sent: 23 November 2016 09:24
>> 
>> We should not consume the second slot if it didn't get written yet.
>> Normal writers - i.e. Xen - would not update write_pointer between the
>> two writes, but the page may get fiddled with by the guest itself, and
>> we're better off entering an infinite loop in that case.
>> 
> 
> Xen would never put QEMU in this situation and the guest can't actually 
> modify the page whilst it's in use, since activation of the IOREQ server 
> removes the page from the guest's p2m so the premise of the patch is not 
> correct.

Is that the case even for pre-ioreq-server Xen versions? The issue
here was reported together with what became XSA-197, and it's
not been assigned its own XSA just because there are other ways
for a guest to place high load on its qemu process (and there are
ways to deal with such high load situations).

Jan

[Qemu-devel] [PATCH 3/3] xen: ignore direction in bufioreq handling

2016-11-23 Thread Jan Beulich

There's no way to communicate back read data, so only writes can ever
be usefully specified. Ignore the field, paving the road for eventually
re-using the bit for something else in a few (many?) years time.

Signed-off-by: Jan Beulich <jbeul...@suse.com>

--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -997,6 +997,7 @@ static int handle_buffered_iopage(XenIOS
 memset(, 0x00, sizeof(req));
 req.state = STATE_IOREQ_READY;
 req.count = 1;
+req.dir = IOREQ_WRITE;
 
 for (;;) {
 uint32_t rdptr = buf_page->read_pointer, wrptr;
@@ -1014,7 +1015,6 @@ static int handle_buffered_iopage(XenIOS
 req.size = 1U << buf_req->size;
 req.addr = buf_req->addr;
 req.data = buf_req->data;
-req.dir = buf_req->dir;
 req.type = buf_req->type;
 xen_rmb();
 qw = (req.size == 8);
@@ -1031,10 +1031,12 @@ static int handle_buffered_iopage(XenIOS
 handle_ioreq(state, );
 
 /* Only req.data may get updated by handle_ioreq(), albeit even that
- * should not happen as such data would never make it to the guest.
+ * should not happen as such data would never make it to the guest (we
+ * can only usefully see writes here after all).
  */
 assert(req.state == STATE_IOREQ_READY);
 assert(req.count == 1);
+assert(req.dir == IOREQ_WRITE);
 assert(!req.data_is_ptr);
 
 atomic_add(_page->read_pointer, qw + 1);

[Qemu-devel] [PATCH 2/3] xen: slightly simplify bufioreq handling

2016-11-23 Thread Jan Beulich

There's no point setting fields always receiving the same value on each
iteration, as handle_ioreq() doesn't alter them anyway. Set state and
count once ahead of the loop, drop the redundant clearing of
data_is_ptr, and avoid the meaningless setting of df altogether.

Also avoid doing an unsigned long calculation of size when the field to
be initialized is only 32 bits wide (and the shift value in the range
0...3).

Signed-off-by: Jan Beulich <jbeul...@suse.com>

--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -995,6 +995,8 @@ static int handle_buffered_iopage(XenIOS
 }
 
 memset(, 0x00, sizeof(req));
+req.state = STATE_IOREQ_READY;
+req.count = 1;
 
 for (;;) {
 uint32_t rdptr = buf_page->read_pointer, wrptr;
@@ -1009,15 +1011,11 @@ static int handle_buffered_iopage(XenIOS
 break;
 }
 buf_req = _page->buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM];
-req.size = 1UL << buf_req->size;
-req.count = 1;
+req.size = 1U << buf_req->size;
 req.addr = buf_req->addr;
 req.data = buf_req->data;
-req.state = STATE_IOREQ_READY;
 req.dir = buf_req->dir;
-req.df = 1;
 req.type = buf_req->type;
-req.data_is_ptr = 0;
 xen_rmb();
 qw = (req.size == 8);
 if (qw) {
@@ -1032,6 +1030,13 @@ static int handle_buffered_iopage(XenIOS
 
 handle_ioreq(state, );
 
+/* Only req.data may get updated by handle_ioreq(), albeit even that
+ * should not happen as such data would never make it to the guest.
+ */
+assert(req.state == STATE_IOREQ_READY);
+assert(req.count == 1);
+assert(!req.data_is_ptr);
+
 atomic_add(_page->read_pointer, qw + 1);
 }

[Qemu-devel] [PATCH 1/3] xen: fix quad word bufioreq handling

2016-11-23 Thread Jan Beulich

We should not consume the second slot if it didn't get written yet.
Normal writers - i.e. Xen - would not update write_pointer between the
two writes, but the page may get fiddled with by the guest itself, and
we're better off entering an infinite loop in that case.

Reported-by: yanghongke <yanghon...@huawei.com>
Signed-off-by: Jan Beulich <jbeul...@suse.com>
---
TBD: Alternatively we could call e.g. hw_error() instead.

--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -1021,6 +1021,9 @@ static int handle_buffered_iopage(XenIOS
 xen_rmb();
 qw = (req.size == 8);
 if (qw) {
+if (rdptr + 1 == wrptr) {
+break;
+}
 buf_req = _page->buf_ioreq[(rdptr + 1) %
IOREQ_BUFFER_SLOT_NUM];
 req.data |= ((uint64_t)buf_req->data) << 32;

[Qemu-devel] [PATCH 0/3] xen: XSA-197 follow-ups

2016-11-23 Thread Jan Beulich

1: fix quad word bufioreq handling
2: slightly simplify bufioreq handling
3: ignore direction in bufioreq handling

Signed-off-by: Jan Beulich <jbeul...@suse.com>

[Qemu-devel] [PATCH] xen: fix ioreq handling

2016-11-22 Thread Jan Beulich

Avoid double fetches and bounds check size to avoid overflowing
internal variables.

This is CVE-2016-9381 / XSA-197.

Reported-by: yanghongke <yanghon...@huawei.com>
Signed-off-by: Jan Beulich <jbeul...@suse.com>
Reviewed-by: Stefano Stabellini <sstabell...@kernel.org>

--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -810,6 +810,10 @@ static void cpu_ioreq_pio(ioreq_t *req)
 trace_cpu_ioreq_pio(req, req->dir, req->df, req->data_is_ptr, req->addr,
  req->data, req->count, req->size);
 
+if (req->size > sizeof(uint32_t)) {
+hw_error("PIO: bad size (%u)", req->size);
+}
+
 if (req->dir == IOREQ_READ) {
 if (!req->data_is_ptr) {
 req->data = do_inp(req->addr, req->size);
@@ -846,6 +850,10 @@ static void cpu_ioreq_move(ioreq_t *req)
 trace_cpu_ioreq_move(req, req->dir, req->df, req->data_is_ptr, req->addr,
  req->data, req->count, req->size);
 
+if (req->size > sizeof(req->data)) {
+hw_error("MMIO: bad size (%u)", req->size);
+}
+
 if (!req->data_is_ptr) {
 if (req->dir == IOREQ_READ) {
 for (i = 0; i < req->count; i++) {
@@ -1010,11 +1018,13 @@ static int handle_buffered_iopage(XenIOS
 req.df = 1;
 req.type = buf_req->type;
 req.data_is_ptr = 0;
+xen_rmb();
 qw = (req.size == 8);
 if (qw) {
 buf_req = _page->buf_ioreq[(rdptr + 1) %
IOREQ_BUFFER_SLOT_NUM];
 req.data |= ((uint64_t)buf_req->data) << 32;
+xen_rmb();
 }
 
 handle_ioreq(state, );
@@ -1045,7 +1055,11 @@ static void cpu_handle_ioreq(void *opaqu
 
 handle_buffered_iopage(state);
 if (req) {
-handle_ioreq(state, req);
+ioreq_t copy = *req;
+
+xen_rmb();
+handle_ioreq(state, );
+req->data = copy.data;
 
 if (req->state != STATE_IOREQ_INPROCESS) {
 fprintf(stderr, "Badness in I/O request ... not in service?!: "

Re: [Qemu-devel] [Xen-devel] [PATCH v2] xen: fix qdisk BLKIF_OP_DISCARD for 32/64 word size mix

2016-06-17 Thread Jan Beulich

>>> On 17.06.16 at 12:08, <jgr...@suse.com> wrote:
> On 17/06/16 11:50, Paul Durrant wrote:
>>> -Original Message-
>>> From: Juergen Gross [mailto:jgr...@suse.com]
>>> Sent: 17 June 2016 10:46
>>> To: Paul Durrant; Jan Beulich
>>> Cc: Anthony Perard; xen-devel; sstabell...@kernel.org; qemu-
>>> de...@nongnu.org; kra...@redhat.com 
>>> Subject: Re: [Xen-devel] [PATCH v2] xen: fix qdisk BLKIF_OP_DISCARD for
>>> 32/64 word size mix
>>>
>>> On 17/06/16 11:37, Paul Durrant wrote:
>>>>> -Original Message-
>>>>> From: Xen-devel [mailto:xen-devel-boun...@lists.xen.org] On Behalf Of
>>> Jan
>>>>> Beulich
>>>>> Sent: 17 June 2016 10:26
>>>>> To: Juergen Gross
>>>>> Cc: Anthony Perard; xen-devel; sstabell...@kernel.org; qemu-
>>>>> de...@nongnu.org; kra...@redhat.com 
>>>>> Subject: Re: [Xen-devel] [PATCH v2] xen: fix qdisk BLKIF_OP_DISCARD for
>>>>> 32/64 word size mix
>>>>>
>>>>>>>> On 17.06.16 at 11:14, <jgr...@suse.com> wrote:
>>>>>> In case the word size of the domU and qemu running the qdisk backend
>>>>>> differ BLKIF_OP_DISCARD will not work reliably, as the request
>>>>>> structure in the ring have different layouts for different word size.
>>>>>>
>>>>>> Correct this by copying the request structure in case of different
>>>>>> word size element by element in the BLKIF_OP_DISCARD case, too.
>>>>>>
>>>>>> The easiest way to achieve this is to resync hw/block/xen_blkif.h with
>>>>>> its original source from the Linux kernel.
>>>>>>
>>>>>> Signed-off-by: Juergen Gross <jgr...@suse.com>
>>>>>> ---
>>>>>> V2: resync with Linux kernel version of hw/block/xen_blkif.h as
>>>>>> suggested by Paul Durrant
>>>>>
>>>>> Oh, I didn't realize he suggested syncing with the Linux variant.
>>>>> Why not with the canonical one? I have to admit that I particularly
>>>>> dislike Linux'es strange union-izng, mainly because of it requiring
>>>>> this myriad of __attribute__((__packed__)).
>>>>>
>>>>
>>>> Yes, it's truly grotesque and such things should be blown away with
>>> extreme prejudice.
>>>
>>> Sorry, I'm confused now.
>>>
>>> Do you still mandate for the resync or not?
>>>
>>> Resyncing with elimination of all the __packed__ stuff seems not to be
>>> a proper alternative as this would require a major rework.
>> 
>> Why? Replacing the existing horribleness with the canonical header (fixed 
> for style) might mean a large diff but it should be functionally the same or 
> something has gone very seriously wrong. If the extra part you need is not in 
> the canonical header then adding this as a second patch seems like a 
> reasonable plan. 
> 
> I think you don't realize that qemu is built using the public headers
> from the Xen build environment. So there is no way to resync with the
> canonical header as this isn't part of the qemu tree.

Oh, right, these are just the 32- and 64-bit variants which get
declared here.

> The header in question is originating from the Linux one which is an
> add-on of the canonical header containing the explicit 32- and 64-bit
> variants of the xenbus protocol and the conversion routines between
> those.
> 
> It would be possible to add these parts to the canonical header, but
> do we really want that?

Definitely not, I would say. But syncing with Linux'es strange
code then isn't a good idea either. (Still waiting for a qemu
maintainer opinion, though.)

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH v2] xen: fix qdisk BLKIF_OP_DISCARD for 32/64 word size mix

2016-06-17 Thread Jan Beulich

>>> On 17.06.16 at 11:31, <jgr...@suse.com> wrote:
> On 17/06/16 11:26, Jan Beulich wrote:
>>>>> On 17.06.16 at 11:14, <jgr...@suse.com> wrote:
>>> In case the word size of the domU and qemu running the qdisk backend
>>> differ BLKIF_OP_DISCARD will not work reliably, as the request
>>> structure in the ring have different layouts for different word size.
>>>
>>> Correct this by copying the request structure in case of different
>>> word size element by element in the BLKIF_OP_DISCARD case, too.
>>>
>>> The easiest way to achieve this is to resync hw/block/xen_blkif.h with
>>> its original source from the Linux kernel.
>>>
>>> Signed-off-by: Juergen Gross <jgr...@suse.com>
>>> ---
>>> V2: resync with Linux kernel version of hw/block/xen_blkif.h as
>>> suggested by Paul Durrant
>> 
>> Oh, I didn't realize he suggested syncing with the Linux variant.
>> Why not with the canonical one? I have to admit that I particularly
>> dislike Linux'es strange union-izng, mainly because of it requiring
>> this myriad of __attribute__((__packed__)).
> 
> What would be gained by syncing with the canonical one? The part to be
> modified is available in the Linux variant only.

In, as said, a rather disgusting (in my opinion) way. I'm no the
maintainer anyway, so you don't have to convince me.

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH v2] xen: fix qdisk BLKIF_OP_DISCARD for 32/64 word size mix

2016-06-17 Thread Jan Beulich

>>> On 17.06.16 at 11:14,  wrote:
> In case the word size of the domU and qemu running the qdisk backend
> differ BLKIF_OP_DISCARD will not work reliably, as the request
> structure in the ring have different layouts for different word size.
> 
> Correct this by copying the request structure in case of different
> word size element by element in the BLKIF_OP_DISCARD case, too.
> 
> The easiest way to achieve this is to resync hw/block/xen_blkif.h with
> its original source from the Linux kernel.
> 
> Signed-off-by: Juergen Gross 
> ---
> V2: resync with Linux kernel version of hw/block/xen_blkif.h as
> suggested by Paul Durrant

Oh, I didn't realize he suggested syncing with the Linux variant.
Why not with the canonical one? I have to admit that I particularly
dislike Linux'es strange union-izng, mainly because of it requiring
this myriad of __attribute__((__packed__)).

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH] xen: fix qdisk BLKIF_OP_DISCARD for 32/64 word size mix

2016-06-16 Thread Jan Beulich

>>> On 16.06.16 at 13:04, <jgr...@suse.com> wrote:
> On 16/06/16 12:54, Jan Beulich wrote:
>>>>> On 16.06.16 at 12:02, <jgr...@suse.com> wrote:
>>> In case the word size of the domU and qemu running the qdisk backend
>>> differ BLKIF_OP_DISCARD will not work reliably, as the request
>>> structure in the ring have different layouts for different word size.
>>>
>>> Correct this by copying the request structure in case of different
>>> word size element by element in the BLKIF_OP_DISCARD case, too.
>>>
>>> Signed-off-by: Juergen Gross <jgr...@suse.com>
>> 
>> With the indentation (tabs vs blanks) fixed
> 
> Hmm, qemu coding style is to use blanks. I could:
> a) leave the patch as is (changed lines indented with blanks)
> b) use tabs to indent (style of the modified file up to now)
> c) change the style of the file in this patch
> d) change the style of the file in a separate patch
> 
> Any preferences?

I'd say a) is not an option, but beyond that I think this needs to be
answered by the qemu maintainers.

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH] xen: fix qdisk BLKIF_OP_DISCARD for 32/64 word size mix

2016-06-16 Thread Jan Beulich

>>> On 16.06.16 at 12:02, <jgr...@suse.com> wrote:
> In case the word size of the domU and qemu running the qdisk backend
> differ BLKIF_OP_DISCARD will not work reliably, as the request
> structure in the ring have different layouts for different word size.
> 
> Correct this by copying the request structure in case of different
> word size element by element in the BLKIF_OP_DISCARD case, too.
> 
> Signed-off-by: Juergen Gross <jgr...@suse.com>

With the indentation (tabs vs blanks) fixed
Reviewed-by: Jan Beulich <jbeul...@suse.com>

And maybe ...

> @@ -82,7 +98,7 @@ static inline void blkif_get_x86_32_req(blkif_request_t 
> *dst, blkif_x86_32_reque
>   /* Prevent the compiler from using src->... instead. */
>   barrier();
>   if (dst->operation == BLKIF_OP_DISCARD) {
> - struct blkif_request_discard *s = (void *)src;
> +struct blkif_x86_32_request_discard *s = (void *)src;

... it would also be worth adding const here and ...

> @@ -105,7 +121,7 @@ static inline void blkif_get_x86_64_req(blkif_request_t 
> *dst, blkif_x86_64_reque
>   /* Prevent the compiler from using src->... instead. */
>   barrier();
>   if (dst->operation == BLKIF_OP_DISCARD) {
> - struct blkif_request_discard *s = (void *)src;
> +struct blkif_x86_64_request_discard *s = (void *)src;

... here.

Jan

Re: [Qemu-devel] [Xen-devel] [PULL 0/2] xen-20160613-tag

2016-06-13 Thread Jan Beulich

>>> On 13.06.16 at 12:54,  wrote:
> The following changes since commit a93c1bdf0bd4689287094ddb2aae3dc907da3535:
> 
>   Merge remote-tracking branch 'remotes/kraxel/tags/pull-ui-20160610-1' into 
> staging (2016-06-10 15:47:17 +0100)
> 
> are available in the git repository at:
> 
> 
>   git://xenbits.xen.org/people/sstabellini/qemu-dm.git tags/xen-20160613-tag
> 
> for you to fetch changes up to 88c16567d2cd23da328787187910b013ee43ebca:
> 
>   Introduce "xen-load-devices-state" (2016-06-13 11:50:53 +0100)
> 
> 
> Xen 2016/06/13
> 
> 
> Anthony PERARD (1):
>   exec: Fix qemu_ram_block_from_host for Xen
> 
> Wen Congyang (1):
>   Introduce "xen-load-devices-state"

May I ask what the disposition of "xen/blkif: avoid double access to
any shared ring request fields" is? I don't think I've seen a pull req,
and it doesn't appear to be in master.

Jan

[Qemu-devel] [PATCH] xen/blkif: avoid double access to any shared ring request fields

2016-05-23 Thread Jan Beulich

Commit f9e98e5d7a ("xen/blkif: Avoid double access to
src->nr_segments") didn't go far enough: src->operation is also being
used twice. And nothing was done to prevent the compiler from using the
source side of the copy done by blk_get_request() (granted that's very
unlikely).

Move the barrier()s up, and add another one to blk_get_request().

Note that for completing XSA-155, the barrier() getting added to
blk_get_request() would suffice, and hence the changes to xen_blkif.h
are more like just cleanup. And since, as said, the unpatched code
getting compiled to something vulnerable is very unlikely (and not
observed in practice), this isn't being viewed as a new security issue.

Signed-off-by: Jan Beulich <jbeul...@suse.com>

--- a/hw/block/xen_blkif.h
+++ b/hw/block/xen_blkif.h
@@ -79,14 +79,14 @@ static inline void blkif_get_x86_32_req(
dst->handle = src->handle;
dst->id = src->id;
dst->sector_number = src->sector_number;
-   if (src->operation == BLKIF_OP_DISCARD) {
+   /* Prevent the compiler from using src->... instead. */
+   barrier();
+   if (dst->operation == BLKIF_OP_DISCARD) {
struct blkif_request_discard *s = (void *)src;
struct blkif_request_discard *d = (void *)dst;
d->nr_sectors = s->nr_sectors;
return;
}
-   /* prevent the compiler from optimizing the code and using 
src->nr_segments instead */
-   barrier();
if (n > dst->nr_segments)
n = dst->nr_segments;
for (i = 0; i < n; i++)
@@ -102,14 +102,14 @@ static inline void blkif_get_x86_64_req(
dst->handle = src->handle;
dst->id = src->id;
dst->sector_number = src->sector_number;
-   if (src->operation == BLKIF_OP_DISCARD) {
+   /* Prevent the compiler from using src->... instead. */
+   barrier();
+   if (dst->operation == BLKIF_OP_DISCARD) {
struct blkif_request_discard *s = (void *)src;
struct blkif_request_discard *d = (void *)dst;
d->nr_sectors = s->nr_sectors;
return;
}
-   /* prevent the compiler from optimizing the code and using 
src->nr_segments instead */
-   barrier();
if (n > dst->nr_segments)
n = dst->nr_segments;
for (i = 0; i < n; i++)
--- a/hw/block/xen_disk.c
+++ b/hw/block/xen_disk.c
@@ -679,6 +679,8 @@ static int blk_get_request(struct XenBlk
  RING_GET_REQUEST(>rings.x86_64_part, rc));
 break;
 }
+/* Prevent the compiler from accessing the on-ring fields instead. */
+barrier();
 return 0;
 }
 



xen/blkif: avoid double access to any shared ring request fields

Commit f9e98e5d7a ("xen/blkif: Avoid double access to
src->nr_segments") didn't go far enough: src->operation is also being
used twice. And nothing was done to prevent the compiler from using the
source side of the copy done by blk_get_request() (granted that's very
unlikely).

Move the barrier()s up, and add another one to blk_get_request().

Note that for completing XSA-155, the barrier() getting added to
blk_get_request() would suffice, and hence the changes to xen_blkif.h
are more like just cleanup. And since, as said, the unpatched code
getting compiled to something vulnerable is very unlikely (and not
observed in practice), this isn't being viewed as a new security issue.

Signed-off-by: Jan Beulich <jbeul...@suse.com>

--- a/hw/block/xen_blkif.h
+++ b/hw/block/xen_blkif.h
@@ -79,14 +79,14 @@ static inline void blkif_get_x86_32_req(
dst->handle = src->handle;
dst->id = src->id;
dst->sector_number = src->sector_number;
-   if (src->operation == BLKIF_OP_DISCARD) {
+   /* Prevent the compiler from using src->... instead. */
+   barrier();
+   if (dst->operation == BLKIF_OP_DISCARD) {
struct blkif_request_discard *s = (void *)src;
struct blkif_request_discard *d = (void *)dst;
d->nr_sectors = s->nr_sectors;
return;
}
-   /* prevent the compiler from optimizing the code and using 
src->nr_segments instead */
-   barrier();
if (n > dst->nr_segments)
n = dst->nr_segments;
for (i = 0; i < n; i++)
@@ -102,14 +102,14 @@ static inline void blkif_get_x86_64_req(
dst->handle = src->handle;
dst->id = src->id;
dst->sector_number = src->sector_number;
-   if (src->operation == BLKIF_OP_DISCARD) {
+   /* Prevent the compiler from using src->... instead. */
+   barrier();
+   if (dst->operation == BLKIF_OP_DISCARD) {
struct blkif_request_discard *s = (void *)src;
struct blkif_request_

Re: [Qemu-devel] [Xen-devel] [PATCH RFC v2 4/4] xen/MSI: re-expose masking capability

2015-12-14 Thread Jan Beulich

>>> On 11.12.15 at 17:56,  wrote:
> For the original issue here, could the flag be exposed as a
> XEN_SYSCTL_PHYSCAP_

Yes, I think it could, albeit calling this a "capability" or "feature"
seems odd (since really the original behavior was bogus/buggy).
But - with sysctl not being a stable interface, is making qemu use
this actually a good idea? I.e. won't we paint ourselves into the
corner of needing to write compatibility wrappers in qemu
whenever XEN_SYSCTL_physinfo (or its libxc wrapper) changes?

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH RFC v2 4/4] xen/MSI: re-expose masking capability

2015-12-14 Thread Jan Beulich

>>> On 11.12.15 at 17:56,  wrote:
> On Fri, 2015-12-11 at 16:44 +, Stefano Stabellini wrote:
>>  
>> It is not possible to do this at runtime. I think we should do this at
>> compile time because in any case it is not supported to run a QEMU built
>> for a given Xen version on a different Xen version.
> 
> I am currently working pretty hard to make this possible in the future, it
> would be a shame to add another reason it wasn't possible at this stage.

And I don't think it's not possible - if anything, the infrastructure to
do so is just missing. I'm definitely not going to make this a build time
check, since I deem it very wrong namely when considering
--with-system-qemu (as in that case there shouldn't be any
dependency on the precise Xen tools versions in use - using plural
intentionally here to point out the possibility of multiple ones being
present).

Jan

Re: [Qemu-devel] [PATCH RFC v2 4/4] xen/MSI: re-expose masking capability

2015-12-11 Thread Jan Beulich

>>> On 11.12.15 at 15:33, <stefano.stabell...@eu.citrix.com> wrote:
> On Mon, 7 Dec 2015, Jan Beulich wrote:
>> >>> On 07.12.15 at 15:56, <stefano.stabell...@eu.citrix.com> wrote:
>> > On Mon, 7 Dec 2015, Jan Beulich wrote:
>> >> >>> On 07.12.15 at 13:45, <stefano.stabell...@eu.citrix.com> wrote:
>> >> > On Tue, 24 Nov 2015, Jan Beulich wrote:
>> >> >> TBD: We probably need to deal with running on an older hypervisor. I
>> >> >>  can't, however, immediately see a way for qemu to find out.
>> >> > 
>> >> > Actually QEMU has already an infrastructure to detect the hypervisor
>> >> > version at compile time, see include/hw/xen/xen_common.h. You could
>> >> > #define the right emu_mask depending on the hypervisor.
>> >> 
>> >> We don't want compile time detection here, but runtime one.
>> > 
>> > I guess the issue is that a fix was backported to Xen that changed its
>> > behaviour in past releases, right?
>> 
>> No, we shouldn't try to guess whether this is present in any pre-4.6
>> hypervisors; we should simply accept that maskable MSI is not
>> available for guests there, because ...
>> 
>> > Is there a way to detect the presence of this fix in Xen, by invoking an
>> > hypercall and checking the returned values and error numbers?
>> 
>> ... there's nothing to (reliably) probe here. This really is just an
>> implementation detail of the hypervisor, and hence a version check
>> is all we have available.
> 
> In that case, I think we should stay on the safe side, and only expose
> the masking capability (only take into effects the changes that this
> patch makes) for Xen >= 4.7.
> 
> What do you think?

That's what I suggested, with the hope of getting a hint where the
necessary infrastructure is (somehow I didn't find any run time
version dependent code to clone from).

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH v3 1/4] xen/MSI-X: latch MSI-X table writes

2015-12-09 Thread Jan Beulich

>>> On 09.12.15 at 16:55, <stefano.stabell...@eu.citrix.com> wrote:
> On Tue, 8 Dec 2015, Jan Beulich wrote:
>> The remaining log message in pci_msix_write() is wrong, as there guest
>> behavior may only appear to be wrong: For one, the old logic didn't
>> take the mask-all bit into account. And then this shouldn't depend on
>> host device state (i.e. the host may have masked the entry without the
>> guest having done so). Plus these writes shouldn't be dropped even when
>> an entry gets unmasked. Instead, if they can't be made take effect
>> right away, they should take effect on the next unmasking or enabling
>> operation - the specification explicitly describes such caching
>> behavior.
>> 
>> Signed-off-by: Jan Beulich <jbeul...@suse.com>
> 
> I have applied this patch and the first 2 of the series to my "next"
> branch. I have to think a bit more about the fourth.

Thanks (I guess you mean this one and the _next_ 2 of this series).

Jan

[Qemu-devel] [PATCH v3 1/4] xen/MSI-X: latch MSI-X table writes

2015-12-08 Thread Jan Beulich

The remaining log message in pci_msix_write() is wrong, as there guest
behavior may only appear to be wrong: For one, the old logic didn't
take the mask-all bit into account. And then this shouldn't depend on
host device state (i.e. the host may have masked the entry without the
guest having done so). Plus these writes shouldn't be dropped even when
an entry gets unmasked. Instead, if they can't be made take effect
right away, they should take effect on the next unmasking or enabling
operation - the specification explicitly describes such caching
behavior.

Signed-off-by: Jan Beulich <jbeul...@suse.com>
---
v3: Add comment to xen_pt_msix_update_one().
v2: Pass original vec_ctrl to xen_pt_msix_update_one() instead of
(ab)using latch[].

--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -1499,6 +1499,8 @@ static int xen_pt_msixctrl_reg_write(Xen
 xen_pt_msix_disable(s);
 }
 
+s->msix->maskall = *val & PCI_MSIX_FLAGS_MASKALL;
+
 debug_msix_enabled_old = s->msix->enabled;
 s->msix->enabled = !!(*val & PCI_MSIX_FLAGS_ENABLE);
 if (s->msix->enabled != debug_msix_enabled_old) {
--- a/hw/xen/xen_pt.h
+++ b/hw/xen/xen_pt.h
@@ -187,13 +187,13 @@ typedef struct XenPTMSIXEntry {
 int pirq;
 uint64_t addr;
 uint32_t data;
-uint32_t vector_ctrl;
+uint32_t latch[4];
 bool updated; /* indicate whether MSI ADDR or DATA is updated */
-bool warned;  /* avoid issuing (bogus) warning more than once */
 } XenPTMSIXEntry;
 typedef struct XenPTMSIX {
 uint32_t ctrl_offset;
 bool enabled;
+bool maskall;
 int total_entries;
 int bar_index;
 uint64_t table_base;
--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -25,6 +25,7 @@
 #define XEN_PT_GFLAGSSHIFT_DELIV_MODE 12
 #define XEN_PT_GFLAGSSHIFT_TRG_MODE   15
 
+#define latch(fld) latch[PCI_MSIX_ENTRY_##fld / sizeof(uint32_t)]
 
 /*
  * Helpers
@@ -314,7 +315,8 @@ static int msix_set_enable(XenPCIPassthr
enabled);
 }
 
-static int xen_pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr)
+static int xen_pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr,
+  uint32_t vec_ctrl)
 {
 XenPTMSIXEntry *entry = NULL;
 int pirq;
@@ -332,6 +334,19 @@ static int xen_pt_msix_update_one(XenPCI
 
 pirq = entry->pirq;
 
+/*
+ * Update the entry addr and data to the latest values only when the
+ * entry is masked or they are all masked, as required by the spec.
+ * Addr and data changes while the MSI-X entry is unmasked get deferred
+ * until the next masked -> unmasked transition.
+ */
+if (pirq == XEN_PT_UNASSIGNED_PIRQ || s->msix->maskall ||
+(vec_ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
+entry->addr = entry->latch(LOWER_ADDR) |
+  ((uint64_t)entry->latch(UPPER_ADDR) << 32);
+entry->data = entry->latch(DATA);
+}
+
 rc = msi_msix_setup(s, entry->addr, entry->data, , true, entry_nr,
 entry->pirq == XEN_PT_UNASSIGNED_PIRQ);
 if (rc) {
@@ -357,7 +372,7 @@ int xen_pt_msix_update(XenPCIPassthrough
 int i;
 
 for (i = 0; i < msix->total_entries; i++) {
-xen_pt_msix_update_one(s, i);
+xen_pt_msix_update_one(s, i, msix->msix_entry[i].latch(VECTOR_CTRL));
 }
 
 return 0;
@@ -406,35 +421,15 @@ int xen_pt_msix_update_remap(XenPCIPasst
 
 static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset)
 {
-switch (offset) {
-case PCI_MSIX_ENTRY_LOWER_ADDR:
-return e->addr & UINT32_MAX;
-case PCI_MSIX_ENTRY_UPPER_ADDR:
-return e->addr >> 32;
-case PCI_MSIX_ENTRY_DATA:
-return e->data;
-case PCI_MSIX_ENTRY_VECTOR_CTRL:
-return e->vector_ctrl;
-default:
-return 0;
-}
+return !(offset % sizeof(*e->latch))
+   ? e->latch[offset / sizeof(*e->latch)] : 0;
 }
 
 static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val)
 {
-switch (offset) {
-case PCI_MSIX_ENTRY_LOWER_ADDR:
-e->addr = (e->addr & ((uint64_t)UINT32_MAX << 32)) | val;
-break;
-case PCI_MSIX_ENTRY_UPPER_ADDR:
-e->addr = (uint64_t)val << 32 | (e->addr & UINT32_MAX);
-break;
-case PCI_MSIX_ENTRY_DATA:
-e->data = val;
-break;
-case PCI_MSIX_ENTRY_VECTOR_CTRL:
-e->vector_ctrl = val;
-break;
+if (!(offset % sizeof(*e->latch)))
+{
+e->latch[offset / sizeof(*e->latch)] = val;
 }
 }
 
@@ -454,39 +449,26 @@ static void pci_msix_write(void *opaque,
 offset = addr % PCI_MSIX_ENTRY_SIZE;
 
 if (offset != PCI_MSIX_ENTRY_VECTOR_CTRL) {
-const volatile uint32_t *vec_ctrl;
-
 if (get_entry_value(entry, offset) == val
 &&

Re: [Qemu-devel] [PATCH RFC v2 4/4] xen/MSI: re-expose masking capability

2015-12-07 Thread Jan Beulich

>>> On 07.12.15 at 15:56, <stefano.stabell...@eu.citrix.com> wrote:
> On Mon, 7 Dec 2015, Jan Beulich wrote:
>> >>> On 07.12.15 at 13:45, <stefano.stabell...@eu.citrix.com> wrote:
>> > On Tue, 24 Nov 2015, Jan Beulich wrote:
>> >> Now that the hypervisor intercepts all config space writes and monitors
>> >> changes to the masking flags, this undoes the main effect of the
>> >> XSA-129 fix, exposing the masking capability again to guests.
> 
> Could you please mention the relevant commit ids in Xen?

It's just one (which I've now  added a reference to), unless you want
all the prereqs listed.

> What happens if QEMU, with this change, is running on top of an older
> Xen that doesn't intercepts all config space writes?

The security issue would resurface.

>> >> Signed-off-by: Jan Beulich <jbeul...@suse.com>
>> >> ---
>> >> TBD: We probably need to deal with running on an older hypervisor. I
>> >>  can't, however, immediately see a way for qemu to find out.
>> > 
>> > Actually QEMU has already an infrastructure to detect the hypervisor
>> > version at compile time, see include/hw/xen/xen_common.h. You could
>> > #define the right emu_mask depending on the hypervisor.
>> 
>> We don't want compile time detection here, but runtime one.
> 
> I guess the issue is that a fix was backported to Xen that changed its
> behaviour in past releases, right?

No, we shouldn't try to guess whether this is present in any pre-4.6
hypervisors; we should simply accept that maskable MSI is not
available for guests there, because ...

> Is there a way to detect the presence of this fix in Xen, by invoking an
> hypercall and checking the returned values and error numbers?

... there's nothing to (reliably) probe here. This really is just an
implementation detail of the hypervisor, and hence a version check
is all we have available.

Jan

Re: [Qemu-devel] [PATCH v2 1/4] xen/MSI-X: latch MSI-X table writes

2015-12-07 Thread Jan Beulich

>>> On 07.12.15 at 13:41,  wrote:
> I know that in your opinion is superfluous, nonetheless could you please
> add 2-3 lines of in-code comment right here, to explain what you are
> doing with the check?  Something like:
> 
> /*
>  * Update the entry addr and data to the latest values only when the
>  * entry is masked or they are all masked, as required by the spec.
>  * Addr and data changes while the MSI-X entry is unmasked will be
>  * delayed until the next masking->unmasking.
>  */

Btw, will it be okay to just resend this one patch as v3, or do I need
to resend the whole series (the rest of which didn't change)?

Jan

Re: [Qemu-devel] [PATCH v2 1/4] xen/MSI-X: latch MSI-X table writes

2015-12-07 Thread Jan Beulich

>>> On 07.12.15 at 13:41, <stefano.stabell...@eu.citrix.com> wrote:
> On Tue, 24 Nov 2015, Jan Beulich wrote:
>> @@ -332,6 +334,13 @@ static int xen_pt_msix_update_one(XenPCI
>>  
>>  pirq = entry->pirq;
> 
> I know that in your opinion is superfluous, nonetheless could you please
> add 2-3 lines of in-code comment right here, to explain what you are
> doing with the check?  Something like:
> 
> /*
>  * Update the entry addr and data to the latest values only when the
>  * entry is masked or they are all masked, as required by the spec.
>  * Addr and data changes while the MSI-X entry is unmasked will be
>  * delayed until the next masking->unmasking.
>  */
> 
> 
>> +if (pirq == XEN_PT_UNASSIGNED_PIRQ || s->msix->maskall ||
>> +(vec_ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
>> +entry->addr = entry->latch(LOWER_ADDR) |
>> +  ((uint64_t)entry->latch(UPPER_ADDR) << 32);
>> +entry->data = entry->latch(DATA);
>> +}

Adding a comment like this is fine of course, namely if it helps
acceptance.

Jan

Re: [Qemu-devel] [PATCH RFC v2 4/4] xen/MSI: re-expose masking capability

2015-12-07 Thread Jan Beulich

>>> On 07.12.15 at 13:45, <stefano.stabell...@eu.citrix.com> wrote:
> On Tue, 24 Nov 2015, Jan Beulich wrote:
>> Now that the hypervisor intercepts all config space writes and monitors
>> changes to the masking flags, this undoes the main effect of the
>> XSA-129 fix, exposing the masking capability again to guests.
>> 
>> Signed-off-by: Jan Beulich <jbeul...@suse.com>
>> ---
>> TBD: We probably need to deal with running on an older hypervisor. I
>>  can't, however, immediately see a way for qemu to find out.
> 
> Actually QEMU has already an infrastructure to detect the hypervisor
> version at compile time, see include/hw/xen/xen_common.h. You could
> #define the right emu_mask depending on the hypervisor.

We don't want compile time detection here, but runtime one.

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH 1/6] xen/MSI-X: latch MSI-X table writes

2015-11-24 Thread Jan Beulich

>>> Tue, 16 Jun 2015 15:48:16 +0100 Stefano Stabellini wrote:
>On Tue, 16 Jun 2015, Jan Beulich wrote:
>> >>> On 16.06.15 at 15:35, Stefano Stabellini wrote:
>> > On Fri, 5 Jun 2015, Jan Beulich wrote:

I'm sorry for getting back to this only now.

>> >> @@ -322,6 +323,13 @@ static int xen_pt_msix_update_one(XenPCI
>> >>  
>> >>  pirq = entry->pirq;
>> >>  
>> >> +if (pirq == XEN_PT_UNASSIGNED_PIRQ || s->msix->maskall ||
>> >> +(entry->latch(VECTOR_CTRL) & PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
>> > 
>> > I admit I am having difficulties understanding the full purpose of these
>> > checks. Please add a comment on them.
>> 
>> The comment would (pointlessly imo) re-state what the code already
>> says:
>> 
>> > I guess the intention is only to make changes using the latest values,
>> > the ones in entry->latch, when the right conditions are met, otherwise
>> > keep using the old values. Is that right?
>> > 
>> > In that case, don't we want to use the latest values on MASKBIT ->
>> > !MASKBIT transitions? In general when unmasking?
>> 
>> This is what we want. And with that, the questions you ask further
>> down should be answered too: The function gets invoked with the
>> pre-change mask flag state in ->latch[], and updates the values
>> used for actually setting up when that one has the entry masked
>> (or mask-all is set). The actual new value gets written to ->latch[]
>> after the call.
>
>I think this logic is counter-intuitive and prone to confuse the reader.
>This change doesn't make sense on its own: when one will read
>xen_pt_msix_update_one, won't be able to understand the function without
>checking the call sites.
>
>Could we turn it around to be more obvious?  Here check if
>!(entry->latch(VECTOR_CTRL) & PCI_MSIX_ENTRY_CTRL_MASKBIT) and below only
>call xen_pt_msix_update_one on MASKBIT -> !MASKBIT transactions?
>
>Or something like that?

That would maybe be doable with just pci_msix_write() as a caller,
but not with the one in xen_pt_msix_update() (where we have no
transition of the mask bit from set to clear).

Jan

[Qemu-devel] [PATCH v2 0/4] xen/pass-through: XSA-120, 128...131 follow-up

2015-11-24 Thread Jan Beulich

1: MSI-X: latch MSI-X table writes
2: MSI-X: really enforce alignment
3: pass-through: correctly deal with RW1C bits
4: [RFC] MSI: re-expose masking capability

Signed-off-by: Jan Beulich <jbeul...@suse.com>

[Qemu-devel] [PATCH v2 2/4] xen/MSI-X: really enforce alignment

2015-11-24 Thread Jan Beulich

The way the generic infrastructure works the intention of not allowing
unaligned accesses can't be achieved by simply setting .unaligned to
false. The benefit is that we can now replace the conditionals in
{get,set}_entry_value() by assert()-s.

Signed-off-by: Jan Beulich <jbeul...@suse.com>
Reviewed-by: Stefano Stabellini <stefano.stabell...@eu.citrix.com>

--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -415,16 +415,14 @@ int xen_pt_msix_update_remap(XenPCIPasst
 
 static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset)
 {
-return !(offset % sizeof(*e->latch))
-   ? e->latch[offset / sizeof(*e->latch)] : 0;
+assert(!(offset % sizeof(*e->latch)));
+return e->latch[offset / sizeof(*e->latch)];
 }
 
 static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val)
 {
-if (!(offset % sizeof(*e->latch)))
-{
-e->latch[offset / sizeof(*e->latch)] = val;
-}
+assert(!(offset % sizeof(*e->latch)));
+e->latch[offset / sizeof(*e->latch)] = val;
 }
 
 static void pci_msix_write(void *opaque, hwaddr addr,
@@ -488,6 +486,12 @@ static uint64_t pci_msix_read(void *opaq
 }
 }
 
+static bool pci_msix_accepts(void *opaque, hwaddr addr,
+ unsigned size, bool is_write)
+{
+return !(addr & (size - 1));
+}
+
 static const MemoryRegionOps pci_msix_ops = {
 .read = pci_msix_read,
 .write = pci_msix_write,
@@ -496,7 +500,13 @@ static const MemoryRegionOps pci_msix_op
 .min_access_size = 4,
 .max_access_size = 4,
 .unaligned = false,
+.accepts = pci_msix_accepts
 },
+.impl = {
+.min_access_size = 4,
+.max_access_size = 4,
+.unaligned = false
+}
 };
 
 int xen_pt_msix_init(XenPCIPassthroughState *s, uint32_t base)



xen/MSI-X: really enforce alignment

The way the generic infrastructure works the intention of not allowing
unaligned accesses can't be achieved by simply setting .unaligned to
false. The benefit is that we can now replace the conditionals in
{get,set}_entry_value() by assert()-s.

Signed-off-by: Jan Beulich <jbeul...@suse.com>
Reviewed-by: Stefano Stabellini <stefano.stabell...@eu.citrix.com>

--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -415,16 +415,14 @@ int xen_pt_msix_update_remap(XenPCIPasst
 
 static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset)
 {
-return !(offset % sizeof(*e->latch))
-   ? e->latch[offset / sizeof(*e->latch)] : 0;
+assert(!(offset % sizeof(*e->latch)));
+return e->latch[offset / sizeof(*e->latch)];
 }
 
 static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val)
 {
-if (!(offset % sizeof(*e->latch)))
-{
-e->latch[offset / sizeof(*e->latch)] = val;
-}
+assert(!(offset % sizeof(*e->latch)));
+e->latch[offset / sizeof(*e->latch)] = val;
 }
 
 static void pci_msix_write(void *opaque, hwaddr addr,
@@ -488,6 +486,12 @@ static uint64_t pci_msix_read(void *opaq
 }
 }
 
+static bool pci_msix_accepts(void *opaque, hwaddr addr,
+ unsigned size, bool is_write)
+{
+return !(addr & (size - 1));
+}
+
 static const MemoryRegionOps pci_msix_ops = {
 .read = pci_msix_read,
 .write = pci_msix_write,
@@ -496,7 +500,13 @@ static const MemoryRegionOps pci_msix_op
 .min_access_size = 4,
 .max_access_size = 4,
 .unaligned = false,
+.accepts = pci_msix_accepts
 },
+.impl = {
+.min_access_size = 4,
+.max_access_size = 4,
+.unaligned = false
+}
 };
 
 int xen_pt_msix_init(XenPCIPassthroughState *s, uint32_t base)

[Qemu-devel] [PATCH v2 1/4] xen/MSI-X: latch MSI-X table writes

2015-11-24 Thread Jan Beulich

The remaining log message in pci_msix_write() is wrong, as there guest
behavior may only appear to be wrong: For one, the old logic didn't
take the mask-all bit into account. And then this shouldn't depend on
host device state (i.e. the host may have masked the entry without the
guest having done so). Plus these writes shouldn't be dropped even when
an entry gets unmasked. Instead, if they can't be made take effect
right away, they should take effect on the next unmasking or enabling
operation - the specification explicitly describes such caching
behavior.

Signed-off-by: Jan Beulich <jbeul...@suse.com>
---
v2: Pass original vec_ctrl to xen_pt_msix_update_one() instead of
(ab)using latch[].

--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -1499,6 +1499,8 @@ static int xen_pt_msixctrl_reg_write(Xen
 xen_pt_msix_disable(s);
 }
 
+s->msix->maskall = *val & PCI_MSIX_FLAGS_MASKALL;
+
 debug_msix_enabled_old = s->msix->enabled;
 s->msix->enabled = !!(*val & PCI_MSIX_FLAGS_ENABLE);
 if (s->msix->enabled != debug_msix_enabled_old) {
--- a/hw/xen/xen_pt.h
+++ b/hw/xen/xen_pt.h
@@ -187,13 +187,13 @@ typedef struct XenPTMSIXEntry {
 int pirq;
 uint64_t addr;
 uint32_t data;
-uint32_t vector_ctrl;
+uint32_t latch[4];
 bool updated; /* indicate whether MSI ADDR or DATA is updated */
-bool warned;  /* avoid issuing (bogus) warning more than once */
 } XenPTMSIXEntry;
 typedef struct XenPTMSIX {
 uint32_t ctrl_offset;
 bool enabled;
+bool maskall;
 int total_entries;
 int bar_index;
 uint64_t table_base;
--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -25,6 +25,7 @@
 #define XEN_PT_GFLAGSSHIFT_DELIV_MODE 12
 #define XEN_PT_GFLAGSSHIFT_TRG_MODE   15
 
+#define latch(fld) latch[PCI_MSIX_ENTRY_##fld / sizeof(uint32_t)]
 
 /*
  * Helpers
@@ -314,7 +315,8 @@ static int msix_set_enable(XenPCIPassthr
enabled);
 }
 
-static int xen_pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr)
+static int xen_pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr,
+  uint32_t vec_ctrl)
 {
 XenPTMSIXEntry *entry = NULL;
 int pirq;
@@ -332,6 +334,13 @@ static int xen_pt_msix_update_one(XenPCI
 
 pirq = entry->pirq;
 
+if (pirq == XEN_PT_UNASSIGNED_PIRQ || s->msix->maskall ||
+(vec_ctrl & PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
+entry->addr = entry->latch(LOWER_ADDR) |
+  ((uint64_t)entry->latch(UPPER_ADDR) << 32);
+entry->data = entry->latch(DATA);
+}
+
 rc = msi_msix_setup(s, entry->addr, entry->data, , true, entry_nr,
 entry->pirq == XEN_PT_UNASSIGNED_PIRQ);
 if (rc) {
@@ -357,7 +366,7 @@ int xen_pt_msix_update(XenPCIPassthrough
 int i;
 
 for (i = 0; i < msix->total_entries; i++) {
-xen_pt_msix_update_one(s, i);
+xen_pt_msix_update_one(s, i, msix->msix_entry[i].latch(VECTOR_CTRL));
 }
 
 return 0;
@@ -406,35 +415,15 @@ int xen_pt_msix_update_remap(XenPCIPasst
 
 static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset)
 {
-switch (offset) {
-case PCI_MSIX_ENTRY_LOWER_ADDR:
-return e->addr & UINT32_MAX;
-case PCI_MSIX_ENTRY_UPPER_ADDR:
-return e->addr >> 32;
-case PCI_MSIX_ENTRY_DATA:
-return e->data;
-case PCI_MSIX_ENTRY_VECTOR_CTRL:
-return e->vector_ctrl;
-default:
-return 0;
-}
+return !(offset % sizeof(*e->latch))
+   ? e->latch[offset / sizeof(*e->latch)] : 0;
 }
 
 static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val)
 {
-switch (offset) {
-case PCI_MSIX_ENTRY_LOWER_ADDR:
-e->addr = (e->addr & ((uint64_t)UINT32_MAX << 32)) | val;
-break;
-case PCI_MSIX_ENTRY_UPPER_ADDR:
-e->addr = (uint64_t)val << 32 | (e->addr & UINT32_MAX);
-break;
-case PCI_MSIX_ENTRY_DATA:
-e->data = val;
-break;
-case PCI_MSIX_ENTRY_VECTOR_CTRL:
-e->vector_ctrl = val;
-break;
+if (!(offset % sizeof(*e->latch)))
+{
+e->latch[offset / sizeof(*e->latch)] = val;
 }
 }
 
@@ -454,39 +443,26 @@ static void pci_msix_write(void *opaque,
 offset = addr % PCI_MSIX_ENTRY_SIZE;
 
 if (offset != PCI_MSIX_ENTRY_VECTOR_CTRL) {
-const volatile uint32_t *vec_ctrl;
-
 if (get_entry_value(entry, offset) == val
 && entry->pirq != XEN_PT_UNASSIGNED_PIRQ) {
 return;
 }
 
+entry->updated = true;
+} else if (msix->enabled && entry->updated &&
+   !(val & PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
+const volatile uint32_t *vec_ctrl;
+
 /*
  * I

[Qemu-devel] [PATCH RFC v2 4/4] xen/MSI: re-expose masking capability

2015-11-24 Thread Jan Beulich

Now that the hypervisor intercepts all config space writes and monitors
changes to the masking flags, this undoes the main effect of the
XSA-129 fix, exposing the masking capability again to guests.

Signed-off-by: Jan Beulich <jbeul...@suse.com>
---
TBD: We probably need to deal with running on an older hypervisor. I
 can't, however, immediately see a way for qemu to find out.

--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -1333,7 +1333,7 @@ static XenPTRegInfo xen_pt_emu_reg_msi[]
 .init_val   = 0x,
 .res_mask   = 0xFE00,
 .ro_mask= 0x018E,
-.emu_mask   = 0x017E,
+.emu_mask   = 0x007E,
 .init   = xen_pt_msgctrl_reg_init,
 .u.w.read   = xen_pt_word_reg_read,
 .u.w.write  = xen_pt_msgctrl_reg_write,
@@ -1387,8 +1387,8 @@ static XenPTRegInfo xen_pt_emu_reg_msi[]
 .offset = PCI_MSI_MASK_32,
 .size   = 4,
 .init_val   = 0x,
-.ro_mask= 0x,
-.emu_mask   = 0x,
+.ro_mask= 0x,
+.emu_mask   = 0x,
 .init   = xen_pt_mask_reg_init,
 .u.dw.read  = xen_pt_long_reg_read,
 .u.dw.write = xen_pt_long_reg_write,
@@ -1398,8 +1398,8 @@ static XenPTRegInfo xen_pt_emu_reg_msi[]
 .offset = PCI_MSI_MASK_64,
 .size   = 4,
 .init_val   = 0x,
-.ro_mask= 0x,
-.emu_mask   = 0x,
+.ro_mask= 0x,
+.emu_mask   = 0x,
 .init   = xen_pt_mask_reg_init,
 .u.dw.read  = xen_pt_long_reg_read,
 .u.dw.write = xen_pt_long_reg_write,



xen/MSI: re-expose masking capability

Now that the hypervisor intercepts all config space writes and monitors
changes to the masking flags, this undoes the main effect of the
XSA-129 fix, exposing the masking capability again to guests.

Signed-off-by: Jan Beulich <jbeul...@suse.com>
---
TBD: We probably need to deal with running on an older hypervisor. I
 can't, however, immediately see a way for qemu to find out.

--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -1333,7 +1333,7 @@ static XenPTRegInfo xen_pt_emu_reg_msi[]
 .init_val   = 0x,
 .res_mask   = 0xFE00,
 .ro_mask= 0x018E,
-.emu_mask   = 0x017E,
+.emu_mask   = 0x007E,
 .init   = xen_pt_msgctrl_reg_init,
 .u.w.read   = xen_pt_word_reg_read,
 .u.w.write  = xen_pt_msgctrl_reg_write,
@@ -1387,8 +1387,8 @@ static XenPTRegInfo xen_pt_emu_reg_msi[]
 .offset = PCI_MSI_MASK_32,
 .size   = 4,
 .init_val   = 0x,
-.ro_mask= 0x,
-.emu_mask   = 0x,
+.ro_mask= 0x,
+.emu_mask   = 0x,
 .init   = xen_pt_mask_reg_init,
 .u.dw.read  = xen_pt_long_reg_read,
 .u.dw.write = xen_pt_long_reg_write,
@@ -1398,8 +1398,8 @@ static XenPTRegInfo xen_pt_emu_reg_msi[]
 .offset = PCI_MSI_MASK_64,
 .size   = 4,
 .init_val   = 0x,
-.ro_mask= 0x,
-.emu_mask   = 0x,
+.ro_mask= 0x,
+.emu_mask   = 0x,
 .init   = xen_pt_mask_reg_init,
 .u.dw.read  = xen_pt_long_reg_read,
 .u.dw.write = xen_pt_long_reg_write,

[Qemu-devel] [PATCH v2 3/4] xen/pass-through: correctly deal with RW1C bits

2015-11-24 Thread Jan Beulich

Introduce yet another mask for them, so that the generic routine can
handle them, at once rendering xen_pt_pmcsr_reg_write() superfluous.

Signed-off-by: Jan Beulich <jbeul...@suse.com>

--- a/hw/xen/xen_pt.h
+++ b/hw/xen/xen_pt.h
@@ -113,6 +113,8 @@ struct XenPTRegInfo {
 uint32_t res_mask;
 /* reg read only field mask (ON:RO/ROS, OFF:other) */
 uint32_t ro_mask;
+/* reg read/write-1-clear field mask (ON:RW1C/RW1CS, OFF:other) */
+uint32_t rw1c_mask;
 /* reg emulate field mask (ON:emu, OFF:passthrough) */
 uint32_t emu_mask;
 xen_pt_conf_reg_init init;
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -179,7 +179,8 @@ static int xen_pt_byte_reg_write(XenPCIP
 *data = XEN_PT_MERGE_VALUE(*val, *data, writable_mask);
 
 /* create value for writing to I/O device register */
-*val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask);
+*val = XEN_PT_MERGE_VALUE(*val, dev_value & ~reg->rw1c_mask,
+  throughable_mask);
 
 return 0;
 }
@@ -197,7 +198,8 @@ static int xen_pt_word_reg_write(XenPCIP
 *data = XEN_PT_MERGE_VALUE(*val, *data, writable_mask);
 
 /* create value for writing to I/O device register */
-*val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask);
+*val = XEN_PT_MERGE_VALUE(*val, dev_value & ~reg->rw1c_mask,
+  throughable_mask);
 
 return 0;
 }
@@ -215,7 +217,8 @@ static int xen_pt_long_reg_write(XenPCIP
 *data = XEN_PT_MERGE_VALUE(*val, *data, writable_mask);
 
 /* create value for writing to I/O device register */
-*val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask);
+*val = XEN_PT_MERGE_VALUE(*val, dev_value & ~reg->rw1c_mask,
+  throughable_mask);
 
 return 0;
 }
@@ -633,6 +636,7 @@ static XenPTRegInfo xen_pt_emu_reg_heade
 .init_val   = 0x,
 .res_mask   = 0x0007,
 .ro_mask= 0x06F8,
+.rw1c_mask  = 0xF900,
 .emu_mask   = 0x0010,
 .init   = xen_pt_status_reg_init,
 .u.w.read   = xen_pt_word_reg_read,
@@ -944,6 +948,7 @@ static XenPTRegInfo xen_pt_emu_reg_pcie[
 .size   = 2,
 .res_mask   = 0xFFC0,
 .ro_mask= 0x0030,
+.rw1c_mask  = 0x000F,
 .init   = xen_pt_common_reg_init,
 .u.w.read   = xen_pt_word_reg_read,
 .u.w.write  = xen_pt_word_reg_write,
@@ -964,6 +969,7 @@ static XenPTRegInfo xen_pt_emu_reg_pcie[
 .offset = PCI_EXP_LNKSTA,
 .size   = 2,
 .ro_mask= 0x3FFF,
+.rw1c_mask  = 0xC000,
 .init   = xen_pt_common_reg_init,
 .u.w.read   = xen_pt_word_reg_read,
 .u.w.write  = xen_pt_word_reg_write,
@@ -1000,27 +1006,6 @@ static XenPTRegInfo xen_pt_emu_reg_pcie[
  * Power Management Capability
  */
 
-/* write Power Management Control/Status register */
-static int xen_pt_pmcsr_reg_write(XenPCIPassthroughState *s,
-  XenPTReg *cfg_entry, uint16_t *val,
-  uint16_t dev_value, uint16_t valid_mask)
-{
-XenPTRegInfo *reg = cfg_entry->reg;
-uint16_t writable_mask = 0;
-uint16_t throughable_mask = get_throughable_mask(s, reg, valid_mask);
-uint16_t *data = cfg_entry->ptr.half_word;
-
-/* modify emulate register */
-writable_mask = reg->emu_mask & ~reg->ro_mask & valid_mask;
-*data = XEN_PT_MERGE_VALUE(*val, *data, writable_mask);
-
-/* create value for writing to I/O device register */
-*val = XEN_PT_MERGE_VALUE(*val, dev_value & ~PCI_PM_CTRL_PME_STATUS,
-  throughable_mask);
-
-return 0;
-}
-
 /* Power Management Capability reg static information table */
 static XenPTRegInfo xen_pt_emu_reg_pm[] = {
 /* Next Pointer reg */
@@ -1051,11 +1036,12 @@ static XenPTRegInfo xen_pt_emu_reg_pm[]
 .size   = 2,
 .init_val   = 0x0008,
 .res_mask   = 0x00F0,
-.ro_mask= 0xE10C,
+.ro_mask= 0x610C,
+.rw1c_mask  = 0x8000,
 .emu_mask   = 0x810B,
 .init   = xen_pt_common_reg_init,
 .u.w.read   = xen_pt_word_reg_read,
-.u.w.write  = xen_pt_pmcsr_reg_write,
+.u.w.write  = xen_pt_word_reg_write,
 },
 {
 .size = 0,


xen/pass-through: correctly deal with RW1C bits

Introduce yet another mask for them, so that the generic routine can
handle them, at once rendering xen_pt_pmcsr_reg_write() superfluous.

Signed-off-by: Jan Beulich <jbeul...@suse.com>

--- a/hw/xen/xen_pt.h
+++ b/hw/xen/xen_pt.h
@@ -113,6 +113,8 @@ struct XenPTRegInfo {
 uint32_t res_mask;
 /* reg read only field mask (ON:RO/ROS, OFF:other) */
 uint32_t ro_mask;
+/* reg read/write-1-clear field mask (ON:RW1C/RW1CS, OFF:other) */
+uint32_t rw1c_mask;
 /* reg emulate field mask (ON:emu, OFF:passthrough) */
 uint32

Re: [Qemu-devel] [Xen-devel] [PATCH v1] xenpt: Properly handle 64-bit bar with more than 4G size

2015-08-12 Thread Jan Beulich

 On 12.08.15 at 09:10, feng...@intel.com wrote:

 
 -Original Message-
 From: qemu-devel-bounces+feng.wu=intel@nongnu.org 
 [mailto:qemu-devel-bounces+feng.wu=intel@nongnu.org] On Behalf Of
 Jan Beulich
 Sent: Wednesday, August 12, 2015 2:59 PM
 To: Wu, Feng
 Cc: xen-de...@lists.xensource.com; qemu-devel@nongnu.org;
 stefano.stabell...@eu.citrix.com 
 Subject: Re: [Qemu-devel] [Xen-devel] [PATCH v1] xenpt: Properly handle 
 64-bit
 bar with more than 4G size
 
  On 05.08.15 at 04:02, feng...@intel.com wrote:
  @@ -491,8 +474,9 @@ static int
 xen_pt_bar_reg_write(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
   bar_ro_mask = XEN_PT_BAR_IO_RO_MASK | (r_size - 1);
   break;
   case XEN_PT_BAR_FLAG_UPPER:
  +r = d-io_regions[index-1];
 
 Perhaps worth an assert(index  0)?
 
 No problem, I will add it. BTW, do you have any other comments about this 
 patch? If no, I am
 going to send out the new version with this changes.

No - everything else looks to make sense (but continues to need
testing).

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH v1] xenpt: Properly handle 64-bit bar with more than 4G size

2015-08-12 Thread Jan Beulich

 On 05.08.15 at 04:02, feng...@intel.com wrote:
 @@ -491,8 +474,9 @@ static int xen_pt_bar_reg_write(XenPCIPassthroughState 
 *s, XenPTReg *cfg_entry,
  bar_ro_mask = XEN_PT_BAR_IO_RO_MASK | (r_size - 1);
  break;
  case XEN_PT_BAR_FLAG_UPPER:
 +r = d-io_regions[index-1];

Perhaps worth an assert(index  0)?

Jan

[Qemu-devel] [PATCH v3] xen/HVM: atomically access pointers in bufioreq handling

2015-07-24 Thread Jan Beulich

The number of slots per page being 511 (i.e. not a power of two) means
that the (32-bit) read and write indexes going beyond 2^32 will likely
disturb operation. The hypervisor side gets I/O req server creation
extended so we can indicate that we're using suitable atomic accesses
where needed, allowing it to atomically canonicalize both pointers when
both have gone through at least one cycle.

The Xen side counterpart (which is not a functional prereq to this
change, albeit a build one) went in already (commit b7007bc6f9).

Signed-off-by: Jan Beulich jbeul...@suse.com
---
v3: Check for Xen 4.6 (configure and build time).
v2: Adjust description.

--- a/configure
+++ b/configure
@@ -1882,6 +1882,33 @@ int main(void) {
   xc_gnttab_open(NULL, 0);
   xc_domain_add_to_physmap(0, 0, XENMAPSPACE_gmfn, 0, 0);
   xc_hvm_inject_msi(xc, 0, 0xf000, 0x);
+  xc_hvm_create_ioreq_server(xc, 0, HVM_IOREQSRV_BUFIOREQ_ATOMIC, NULL);
+  return 0;
+}
+EOF
+  compile_prog  $xen_libs
+then
+xen_ctrl_version=460
+xen=yes
+
+  # Xen 4.5
+  elif
+  cat  $TMPC EOF 
+#include xenctrl.h
+#include xenstore.h
+#include stdint.h
+#include xen/hvm/hvm_info_table.h
+#if !defined(HVM_MAX_VCPUS)
+# error HVM_MAX_VCPUS not defined
+#endif
+int main(void) {
+  xc_interface *xc;
+  xs_daemon_open();
+  xc = xc_interface_open(0, 0, 0);
+  xc_hvm_set_mem_type(0, 0, HVMMEM_ram_ro, 0, 0);
+  xc_gnttab_open(NULL, 0);
+  xc_domain_add_to_physmap(0, 0, XENMAPSPACE_gmfn, 0, 0);
+  xc_hvm_inject_msi(xc, 0, 0xf000, 0x);
   xc_hvm_create_ioreq_server(xc, 0, 0, NULL);
   return 0;
 }
--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -963,19 +963,30 @@ static void handle_ioreq(XenIOState *sta
 
 static int handle_buffered_iopage(XenIOState *state)
 {
+buffered_iopage_t *buf_page = state-buffered_io_page;
 buf_ioreq_t *buf_req = NULL;
 ioreq_t req;
 int qw;
 
-if (!state-buffered_io_page) {
+if (!buf_page) {
 return 0;
 }
 
 memset(req, 0x00, sizeof(req));
 
-while (state-buffered_io_page-read_pointer != 
state-buffered_io_page-write_pointer) {
-buf_req = state-buffered_io_page-buf_ioreq[
-state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM];
+for (;;) {
+uint32_t rdptr = buf_page-read_pointer, wrptr;
+
+xen_rmb();
+wrptr = buf_page-write_pointer;
+xen_rmb();
+if (rdptr != buf_page-read_pointer) {
+continue;
+}
+if (rdptr == wrptr) {
+break;
+}
+buf_req = buf_page-buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM];
 req.size = 1UL  buf_req-size;
 req.count = 1;
 req.addr = buf_req-addr;
@@ -987,15 +998,14 @@ static int handle_buffered_iopage(XenIOS
 req.data_is_ptr = 0;
 qw = (req.size == 8);
 if (qw) {
-buf_req = state-buffered_io_page-buf_ioreq[
-(state-buffered_io_page-read_pointer + 1) % 
IOREQ_BUFFER_SLOT_NUM];
+buf_req = buf_page-buf_ioreq[(rdptr + 1) %
+   IOREQ_BUFFER_SLOT_NUM];
 req.data |= ((uint64_t)buf_req-data)  32;
 }
 
 handle_ioreq(state, req);
 
-xen_mb();
-state-buffered_io_page-read_pointer += qw ? 2 : 1;
+atomic_add(buf_page-read_pointer, qw + 1);
 }
 
 return req.count;
--- a/include/hw/xen/xen_common.h
+++ b/include/hw/xen/xen_common.h
@@ -186,6 +186,15 @@ static inline int xen_get_vmport_regs_pf
 }
 #endif
 
+/* Xen before 4.6 */
+#if CONFIG_XEN_CTRL_INTERFACE_VERSION  460
+
+#ifndef HVM_IOREQSRV_BUFIOREQ_ATOMIC
+#define HVM_IOREQSRV_BUFIOREQ_ATOMIC 2
+#endif
+
+#endif
+
 /* Xen before 4.5 */
 #if CONFIG_XEN_CTRL_INTERFACE_VERSION  450
 
@@ -370,7 +379,8 @@ static inline void xen_unmap_pcidev(XenX
 static inline int xen_create_ioreq_server(XenXC xc, domid_t dom,
   ioservid_t *ioservid)
 {
-int rc = xc_hvm_create_ioreq_server(xc, dom, 1, ioservid);
+int rc = xc_hvm_create_ioreq_server(xc, dom, HVM_IOREQSRV_BUFIOREQ_ATOMIC,
+ioservid);
 
 if (rc == 0) {
 trace_xen_ioreq_server_create(*ioservid);


xen/HVM: atomically access pointers in bufioreq handling

The number of slots per page being 511 (i.e. not a power of two) means
that the (32-bit) read and write indexes going beyond 2^32 will likely
disturb operation. The hypervisor side gets I/O req server creation
extended so we can indicate that we're using suitable atomic accesses
where needed, allowing it to atomically canonicalize both pointers when
both have gone through at least one cycle.

The Xen side counterpart (which is not a functional prereq to this
change, albeit a build one) went in already (commit b7007bc6f9).

Signed-off-by: Jan Beulich jbeul...@suse.com
---
v3: Check for Xen 4.6 (configure and build time).
v2: Adjust description.

--- a/configure
+++ b/configure
@@ -1882,6 +1882,33 @@ int main(void

Re: [Qemu-devel] [PATCH] xen/HVM: atomically access pointers in bufioreq handling

2015-07-23 Thread Jan Beulich

 On 23.07.15 at 12:04, stefano.stabell...@eu.citrix.com wrote:
 On Thu, 23 Jul 2015, Jan Beulich wrote:
  On 22.07.15 at 19:24, stefano.stabell...@eu.citrix.com wrote:
  I'll queue this change up for the next QEMU release cycle.
 
 Thanks - v2 (with the adjusted description) just sent.
 
 It would however be nice for our variant in 4.6 to also gain this,
 perhaps independent of upstream's schedule.
 
 Is the hypervisor side going to go in 4.6? Do we need a freeze
 exception or do we consider this a bug fix?

As said in the updated description in v2, the hypervisor side went in
already some time ago.

Jan

Re: [Qemu-devel] [PATCH] xen/HVM: atomically access pointers in bufioreq handling

2015-07-23 Thread Jan Beulich

 On 22.07.15 at 19:24, stefano.stabell...@eu.citrix.com wrote:
 I'll queue this change up for the next QEMU release cycle.

Thanks - v2 (with the adjusted description) just sent.

It would however be nice for our variant in 4.6 to also gain this,
perhaps independent of upstream's schedule.

Jan

[Qemu-devel] [PATCH v2] xen/HVM: atomically access pointers in bufioreq handling

2015-07-23 Thread Jan Beulich

The number of slots per page being 511 (i.e. not a power of two) means
that the (32-bit) read and write indexes going beyond 2^32 will likely
disturb operation. The hypervisor side gets I/O req server creation
extended so we can indicate that we're using suitable atomic accesses
where needed, allowing it to atomically canonicalize both pointers when
both have gone through at least one cycle.

The Xen side counterpart (which is not a functional prereq to this
change, albeit a build one) went in already (commit b7007bc6f9).

Signed-off-by: Jan Beulich jbeul...@suse.com
---
v2: Adjust description.

--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -957,19 +957,30 @@ static void handle_ioreq(XenIOState *sta
 
 static int handle_buffered_iopage(XenIOState *state)
 {
+buffered_iopage_t *buf_page = state-buffered_io_page;
 buf_ioreq_t *buf_req = NULL;
 ioreq_t req;
 int qw;
 
-if (!state-buffered_io_page) {
+if (!buf_page) {
 return 0;
 }
 
 memset(req, 0x00, sizeof(req));
 
-while (state-buffered_io_page-read_pointer != 
state-buffered_io_page-write_pointer) {
-buf_req = state-buffered_io_page-buf_ioreq[
-state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM];
+for (;;) {
+uint32_t rdptr = buf_page-read_pointer, wrptr;
+
+xen_rmb();
+wrptr = buf_page-write_pointer;
+xen_rmb();
+if (rdptr != buf_page-read_pointer) {
+continue;
+}
+if (rdptr == wrptr) {
+break;
+}
+buf_req = buf_page-buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM];
 req.size = 1UL  buf_req-size;
 req.count = 1;
 req.addr = buf_req-addr;
@@ -981,15 +992,14 @@ static int handle_buffered_iopage(XenIOS
 req.data_is_ptr = 0;
 qw = (req.size == 8);
 if (qw) {
-buf_req = state-buffered_io_page-buf_ioreq[
-(state-buffered_io_page-read_pointer + 1) % 
IOREQ_BUFFER_SLOT_NUM];
+buf_req = buf_page-buf_ioreq[(rdptr + 1) %
+   IOREQ_BUFFER_SLOT_NUM];
 req.data |= ((uint64_t)buf_req-data)  32;
 }
 
 handle_ioreq(state, req);
 
-xen_mb();
-state-buffered_io_page-read_pointer += qw ? 2 : 1;
+atomic_add(buf_page-read_pointer, qw + 1);
 }
 
 return req.count;
--- unstable.orig/qemu/upstream/include/hw/xen/xen_common.h 2015-03-31 
15:58:04.0 +0200
+++ unstable/qemu/upstream/include/hw/xen/xen_common.h  2015-06-15 
08:24:28.0 +0200
@@ -370,7 +370,8 @@ static inline void xen_unmap_pcidev(XenX
 static inline int xen_create_ioreq_server(XenXC xc, domid_t dom,
   ioservid_t *ioservid)
 {
-int rc = xc_hvm_create_ioreq_server(xc, dom, 1, ioservid);
+int rc = xc_hvm_create_ioreq_server(xc, dom, HVM_IOREQSRV_BUFIOREQ_ATOMIC,
+ioservid);
 
 if (rc == 0) {
 trace_xen_ioreq_server_create(*ioservid);



xen/HVM: atomically access pointers in bufioreq handling

The number of slots per page being 511 (i.e. not a power of two) means
that the (32-bit) read and write indexes going beyond 2^32 will likely
disturb operation. The hypervisor side gets I/O req server creation
extended so we can indicate that we're using suitable atomic accesses
where needed, allowing it to atomically canonicalize both pointers when
both have gone through at least one cycle.

The Xen side counterpart (which is not a functional prereq to this
change, albeit a build one) went in already (commit b7007bc6f9).

Signed-off-by: Jan Beulich jbeul...@suse.com
---
v2: Adjust description.

--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -957,19 +957,30 @@ static void handle_ioreq(XenIOState *sta
 
 static int handle_buffered_iopage(XenIOState *state)
 {
+buffered_iopage_t *buf_page = state-buffered_io_page;
 buf_ioreq_t *buf_req = NULL;
 ioreq_t req;
 int qw;
 
-if (!state-buffered_io_page) {
+if (!buf_page) {
 return 0;
 }
 
 memset(req, 0x00, sizeof(req));
 
-while (state-buffered_io_page-read_pointer != 
state-buffered_io_page-write_pointer) {
-buf_req = state-buffered_io_page-buf_ioreq[
-state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM];
+for (;;) {
+uint32_t rdptr = buf_page-read_pointer, wrptr;
+
+xen_rmb();
+wrptr = buf_page-write_pointer;
+xen_rmb();
+if (rdptr != buf_page-read_pointer) {
+continue;
+}
+if (rdptr == wrptr) {
+break;
+}
+buf_req = buf_page-buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM];
 req.size = 1UL  buf_req-size;
 req.count = 1;
 req.addr = buf_req-addr;
@@ -981,15 +992,14 @@ static int handle_buffered_iopage(XenIOS
 req.data_is_ptr = 0;
 qw = (req.size == 8);
 if (qw) {
-buf_req = state-buffered_io_page-buf_ioreq

Re: [Qemu-devel] [PATCH] xen/HVM: atomically access pointers in bufioreq handling

2015-07-22 Thread Jan Beulich

 On 22.07.15 at 16:50, stefano.stabell...@eu.citrix.com wrote:
 On Wed, 22 Jul 2015, Jan Beulich wrote:
  --- a/xen-hvm.c
  +++ b/xen-hvm.c
  @@ -981,19 +981,30 @@ static void handle_ioreq(XenIOState *sta
   
   static int handle_buffered_iopage(XenIOState *state)
   {
  +buffered_iopage_t *buf_page = state-buffered_io_page;
   buf_ioreq_t *buf_req = NULL;
   ioreq_t req;
   int qw;
   
  -if (!state-buffered_io_page) {
  +if (!buf_page) {
   return 0;
   }
   
   memset(req, 0x00, sizeof(req));
   
  -while (state-buffered_io_page-read_pointer != 
  state-buffered_io_page-write_pointer) {
  -buf_req = state-buffered_io_page-buf_ioreq[
  -state-buffered_io_page-read_pointer % 
  IOREQ_BUFFER_SLOT_NUM];
  +for (;;) {
  +uint32_t rdptr = buf_page-read_pointer, wrptr;
  +
  +xen_rmb();
  
  We don't need this barrier.
 
 How would we not? We need to make sure we read in this order
 read_pointer, write_pointer, and read_pointer again (in the
 comparison).  Only that way we can be certain to hold a matching
 pair in hands at the end.
 
 See below
 
 
  +wrptr = buf_page-write_pointer;
  +xen_rmb();
  +if (rdptr != buf_page-read_pointer) {
  
  I think you have to use atomic_read to be sure that the second read to
  buf_page-read_pointer is up to date and not optimized away.
 
 No, suppressing such an optimization is an intended (side) effect
 of the barriers used.
 
 I understand what you are saying but I am not sure if your assumption
 is correct. Can the compiler optimize the second read anyway?

No, it can't, due to the barrier.

  But if I think that it would be best to simply use atomic_read to read
  both pointers at once using uint64_t as type, so you are sure to get a
  consistent view and there is no need for this check.
 
 But I'm specifically trying to avoid e.g. a locked cmpxchg8b here on
 ix86.
 
 OK, but we don't need cmpxchg8b, just:
 
 #define atomic_read(ptr)   (*(__typeof__(*ptr) volatile*) (ptr))

This only gives the impression of being atomic when the type is wider
than a machine word. There's no ix86 (i.e. 32-bit) instruction other
than LOCK CMPXCHG8B (and possibly MMX/SSE/AVX ones) allowing
to atomically read a 64-bit quantity.

 something like:
 
  for (;;) {
  uint64_t ptrs;
  uint32_t rdptr, wrptr;
  
  ptrs = atomic_read((uint64_t*)state-buffered_io_page-read_pointer);
  rdptr = (uint32_t)ptrs;
  wrptr = *(((uint32_t*)ptrs) + 1);
  
  if (rdptr == wrptr) {
  continue;
  }
  
  [work]
  
  atomic_add(buf_page-read_pointer, qw + 1);
  }
 
 it would work, wouldn't it?

Looks like so, but the amount of casts alone makes me wish for
no-one to consider this (but I agree that the casts could be
taken care of). Still I think (as btw done elsewhere) the lock
free access is preferable.

Jan

Re: [Qemu-devel] [PATCH] xen/HVM: atomically access pointers in bufioreq handling

2015-07-22 Thread Jan Beulich

 On 21.07.15 at 15:54, stefano.stabell...@eu.citrix.com wrote:
 On Thu, 18 Jun 2015, Jan Beulich wrote:
 The number of slots per page being 511 (i.e. not a power of two) means
 that the (32-bit) read and write indexes going beyond 2^32 will likely
 disturb operation. The hypervisor side gets I/O req server creation
 extended so we can indicate that we're using suitable atomic accesses
 where needed (not all accesses to the two pointers really need to be
 atomic), allowing it to atomically canonicalize both pointers when both
 have gone through at least one cycle.
 
 The description is a bit terse: which accesses don't really need to be
 atomic?

Perhaps I should drop this part - I more or less copied the hypervisor
side's commit message, and the above really applies to e.g.

if ( (pg-ptrs.write_pointer - pg-ptrs.read_pointer) =
 (IOREQ_BUFFER_SLOT_NUM - qw) )

in hypervisor code.

 --- a/xen-hvm.c
 +++ b/xen-hvm.c
 @@ -981,19 +981,30 @@ static void handle_ioreq(XenIOState *sta
  
  static int handle_buffered_iopage(XenIOState *state)
  {
 +buffered_iopage_t *buf_page = state-buffered_io_page;
  buf_ioreq_t *buf_req = NULL;
  ioreq_t req;
  int qw;
  
 -if (!state-buffered_io_page) {
 +if (!buf_page) {
  return 0;
  }
  
  memset(req, 0x00, sizeof(req));
  
 -while (state-buffered_io_page-read_pointer != 
 state-buffered_io_page-write_pointer) {
 -buf_req = state-buffered_io_page-buf_ioreq[
 -state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM];
 +for (;;) {
 +uint32_t rdptr = buf_page-read_pointer, wrptr;
 +
 +xen_rmb();
 
 We don't need this barrier.

How would we not? We need to make sure we read in this order
read_pointer, write_pointer, and read_pointer again (in the
comparison).  Only that way we can be certain to hold a matching
pair in hands at the end.

 +wrptr = buf_page-write_pointer;
 +xen_rmb();
 +if (rdptr != buf_page-read_pointer) {
 
 I think you have to use atomic_read to be sure that the second read to
 buf_page-read_pointer is up to date and not optimized away.

No, suppressing such an optimization is an intended (side) effect
of the barriers used.

 But if I think that it would be best to simply use atomic_read to read
 both pointers at once using uint64_t as type, so you are sure to get a
 consistent view and there is no need for this check.

But I'm specifically trying to avoid e.g. a locked cmpxchg8b here on
ix86.

  handle_ioreq(state, req);
  
 -xen_mb();
 -state-buffered_io_page-read_pointer += qw ? 2 : 1;
 +atomic_add(buf_page-read_pointer, qw + 1);
 
 I couldn't get specific info on the type of barrier implemented by
 __sync_fetch_and_add, so I cannot tell for sure whether removing
 xen_mb() is appropriate. Do you have a link? I suspect that given the
 strong guarantees of the x86 architecture we'll be fine. I would be less
 confident if this code was used on other archs.

gcc.pdf, in the section covering them, says In most cases, these
built-in functions are considered a full barrier. [...] Further,
instructions are issued as necessary to prevent the processor from
speculating loads across the operation and from queuing stores
after the operation. Details on individual builtins subsequently
tell the exceptions from this general rule, but the one used here is
not among the exceptions.

 --- a/include/hw/xen/xen_common.h
 +++ b/include/hw/xen/xen_common.h
 @@ -370,7 +370,8 @@ static inline void xen_unmap_pcidev(XenX
  static inline int xen_create_ioreq_server(XenXC xc, domid_t dom,
ioservid_t *ioservid)
  {
 -int rc = xc_hvm_create_ioreq_server(xc, dom, 1, ioservid);
 +int rc = xc_hvm_create_ioreq_server(xc, dom, 
 HVM_IOREQSRV_BUFIOREQ_ATOMIC,
 +ioservid);
 
 I am concerned that passing 2 instead of 1 could break older
 hypervisors. However handle_bufioreq was never defined as a true
 boolean, so maybe it is OK?

Indeed I'm building on it only having done == 0 or != 0 checks.

 The alternative would be to create a xen_xc_hvm_create_ioreq_server
 versioned wrapper in include/hw/xen/xen_common.h.

Which is what I aimed at avoiding.

Jan

[Qemu-devel] [PATCH] xen/HVM: atomically access pointers in bufioreq handling

2015-06-18 Thread Jan Beulich

The number of slots per page being 511 (i.e. not a power of two) means
that the (32-bit) read and write indexes going beyond 2^32 will likely
disturb operation. The hypervisor side gets I/O req server creation
extended so we can indicate that we're using suitable atomic accesses
where needed (not all accesses to the two pointers really need to be
atomic), allowing it to atomically canonicalize both pointers when both
have gone through at least one cycle.

The Xen side counterpart (which is not a functional prereq to this
change, albeit a build one) can be found at e.g.
http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg02996.html 

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -981,19 +981,30 @@ static void handle_ioreq(XenIOState *sta
 
 static int handle_buffered_iopage(XenIOState *state)
 {
+buffered_iopage_t *buf_page = state-buffered_io_page;
 buf_ioreq_t *buf_req = NULL;
 ioreq_t req;
 int qw;
 
-if (!state-buffered_io_page) {
+if (!buf_page) {
 return 0;
 }
 
 memset(req, 0x00, sizeof(req));
 
-while (state-buffered_io_page-read_pointer != 
state-buffered_io_page-write_pointer) {
-buf_req = state-buffered_io_page-buf_ioreq[
-state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM];
+for (;;) {
+uint32_t rdptr = buf_page-read_pointer, wrptr;
+
+xen_rmb();
+wrptr = buf_page-write_pointer;
+xen_rmb();
+if (rdptr != buf_page-read_pointer) {
+continue;
+}
+if (rdptr == wrptr) {
+break;
+}
+buf_req = buf_page-buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM];
 req.size = 1UL  buf_req-size;
 req.count = 1;
 req.addr = buf_req-addr;
@@ -1005,15 +1016,14 @@ static int handle_buffered_iopage(XenIOS
 req.data_is_ptr = 0;
 qw = (req.size == 8);
 if (qw) {
-buf_req = state-buffered_io_page-buf_ioreq[
-(state-buffered_io_page-read_pointer + 1) % 
IOREQ_BUFFER_SLOT_NUM];
+buf_req = buf_page-buf_ioreq[(rdptr + 1) %
+   IOREQ_BUFFER_SLOT_NUM];
 req.data |= ((uint64_t)buf_req-data)  32;
 }
 
 handle_ioreq(state, req);
 
-xen_mb();
-state-buffered_io_page-read_pointer += qw ? 2 : 1;
+atomic_add(buf_page-read_pointer, qw + 1);
 }
 
 return req.count;
--- a/include/hw/xen/xen_common.h
+++ b/include/hw/xen/xen_common.h
@@ -370,7 +370,8 @@ static inline void xen_unmap_pcidev(XenX
 static inline int xen_create_ioreq_server(XenXC xc, domid_t dom,
   ioservid_t *ioservid)
 {
-int rc = xc_hvm_create_ioreq_server(xc, dom, 1, ioservid);
+int rc = xc_hvm_create_ioreq_server(xc, dom, HVM_IOREQSRV_BUFIOREQ_ATOMIC,
+ioservid);
 
 if (rc == 0) {
 trace_xen_ioreq_server_create(*ioservid);



xen/HVM: atomically access pointers in bufioreq handling

The number of slots per page being 511 (i.e. not a power of two) means
that the (32-bit) read and write indexes going beyond 2^32 will likely
disturb operation. The hypervisor side gets I/O req server creation
extended so we can indicate that we're using suitable atomic accesses
where needed (not all accesses to the two pointers really need to be
atomic), allowing it to atomically canonicalize both pointers when both
have gone through at least one cycle.

The Xen side counterpart (which is not a functional prereq to this
change, albeit a build one) can be found at e.g.
http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg02996.html

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -981,19 +981,30 @@ static void handle_ioreq(XenIOState *sta
 
 static int handle_buffered_iopage(XenIOState *state)
 {
+buffered_iopage_t *buf_page = state-buffered_io_page;
 buf_ioreq_t *buf_req = NULL;
 ioreq_t req;
 int qw;
 
-if (!state-buffered_io_page) {
+if (!buf_page) {
 return 0;
 }
 
 memset(req, 0x00, sizeof(req));
 
-while (state-buffered_io_page-read_pointer != 
state-buffered_io_page-write_pointer) {
-buf_req = state-buffered_io_page-buf_ioreq[
-state-buffered_io_page-read_pointer % IOREQ_BUFFER_SLOT_NUM];
+for (;;) {
+uint32_t rdptr = buf_page-read_pointer, wrptr;
+
+xen_rmb();
+wrptr = buf_page-write_pointer;
+xen_rmb();
+if (rdptr != buf_page-read_pointer) {
+continue;
+}
+if (rdptr == wrptr) {
+break;
+}
+buf_req = buf_page-buf_ioreq[rdptr % IOREQ_BUFFER_SLOT_NUM];
 req.size = 1UL  buf_req-size;
 req.count = 1;
 req.addr = buf_req-addr;
@@ -1005,15 +1016,14 @@ static int handle_buffered_iopage(XenIOS
 req.data_is_ptr = 0;
 qw = (req.size == 8);
 if (qw

Re: [Qemu-devel] [PATCH 2/6] xen/MSI-X: drive maskall and enable bits through hypercalls

2015-06-16 Thread Jan Beulich

 On 16.06.15 at 16:03, stefano.stabell...@eu.citrix.com wrote:
 On Fri, 5 Jun 2015, Jan Beulich wrote:
 +} else if (s-msix-enabled) {
 +if (!(value  PCI_MSIX_FLAGS_ENABLE)) {
 +xen_pt_msix_disable(s);
 +s-msix-enabled = false;
 +} else if (!s-msix-maskall) {
 
 Why are you changing the state of s-msix-maskall here?
 This is the value  PCI_MSIX_FLAGS_ENABLE case, nothing to do with
 maskall, right?

We're at an else if inside an else if here. The only case left
after the if() still seen above is that value has
PCI_MSIX_FLAGS_MASKALL set.

 +s-msix-maskall = true;
 +xen_pt_msix_maskall(s, true);
 +}
  }
  
 -debug_msix_enabled_old = s-msix-enabled;
 -s-msix-enabled = !!(*val  PCI_MSIX_FLAGS_ENABLE);
  if (s-msix-enabled != debug_msix_enabled_old) {
  XEN_PT_LOG(s-dev, %s MSI-X\n,
 s-msix-enabled ? enable : disable);
  }
  
 +xen_host_pci_get_word(s-real_device, s-msix-ctrl_offset, 
 dev_value);
 
 I have to say that I don't like the asymmetry between reading and
 writing PCI config registers. If writes go via hypercalls, reads should
 go via hypercalls too.

We're not doing any cfg register write via hypercalls (not here,
and not elsewhere). What is being replaced by the patch are
write to two bits which happen to live in PCI config space. Plus,
reading directly, and doing writes via hypercall only when really
needed would still be the right thing from a performance pov.

 --- a/qemu/upstream/hw/xen/xen_pt_msi.c
 +++ b/qemu/upstream/hw/xen/xen_pt_msi.c
 @@ -301,8 +301,11 @@ static int msix_set_enable(XenPCIPassthr
  return -1;
  }
  
 -return msi_msix_enable(s, s-msix-ctrl_offset, PCI_MSIX_FLAGS_ENABLE,
 -   enabled);
 
 Would it make sense to remove msi_msix_enable completely to avoid any
 further mistakes?

Perhaps, yes. I think I actually had suggested so quite a while back.
But I don't see myself wasting much more time on this, ehm, code.

Jan

Re: [Qemu-devel] [PATCH 1/6] xen/MSI-X: latch MSI-X table writes

2015-06-16 Thread Jan Beulich

 On 16.06.15 at 15:35, stefano.stabell...@eu.citrix.com wrote:
 On Fri, 5 Jun 2015, Jan Beulich wrote:
 @@ -322,6 +323,13 @@ static int xen_pt_msix_update_one(XenPCI
  
  pirq = entry-pirq;
  
 +if (pirq == XEN_PT_UNASSIGNED_PIRQ || s-msix-maskall ||
 +(entry-latch(VECTOR_CTRL)  PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
 
 I admit I am having difficulties understanding the full purpose of these
 checks. Please add a comment on them.

The comment would (pointlessly imo) re-state what the code already
says:

 I guess the intention is only to make changes using the latest values,
 the ones in entry-latch, when the right conditions are met, otherwise
 keep using the old values. Is that right?
 
 In that case, don't we want to use the latest values on MASKBIT -
 !MASKBIT transitions? In general when unmasking?

This is what we want. And with that, the questions you ask further
down should be answered too: The function gets invoked with the
pre-change mask flag state in -latch[], and updates the values
used for actually setting up when that one has the entry masked
(or mask-all is set). The actual new value gets written to -latch[]
after the call.

 @@ -444,39 +432,28 @@ static void pci_msix_write(void *opaque,
  offset = addr % PCI_MSIX_ENTRY_SIZE;
  
  if (offset != PCI_MSIX_ENTRY_VECTOR_CTRL) {
 -const volatile uint32_t *vec_ctrl;
 -
  if (get_entry_value(entry, offset) == val
   entry-pirq != XEN_PT_UNASSIGNED_PIRQ) {
  return;
  }
  
 +entry-updated = true;
 +} else if (msix-enabled  entry-updated 
 +   !(val  PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
 +const volatile uint32_t *vec_ctrl;
 +
  /*
   * If Xen intercepts the mask bit access, entry-vec_ctrl may not be
   * up-to-date. Read from hardware directly.
   */
  vec_ctrl = s-msix-phys_iomem_base + entry_nr * PCI_MSIX_ENTRY_SIZE
  + PCI_MSIX_ENTRY_VECTOR_CTRL;
 +set_entry_value(entry, offset, *vec_ctrl);
 
 Why are you calling set_entry_value with the hardware vec_ctrl value? It
 doesn't look correct to me.  In any case, if you wanted to do it,
 shouldn't you just set/unset PCI_MSIX_ENTRY_CTRL_MASKBIT instead of the
 whole *vec_ctrl?

The comment above the code explains it: What we have stored locally
may not reflect reality, as we may not have seen all writes (and this
indeed isn't just a may). And if out cached value isn't valid anymore,
why would we not want to update all of it, rather than just the mask
bit?

 -if (msix-enabled  !(*vec_ctrl  PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
 -if (!entry-warned) {
 -entry-warned = true;
 -XEN_PT_ERR(s-dev, Can't update msix entry %d since MSI-X 
 is
 -already enabled.\n, entry_nr);
 -}
 -return;
 -}
 -
 -entry-updated = true;
 +xen_pt_msix_update_one(s, entry_nr);
 
 Shouldn't we call xen_pt_msix_update_one only if (*vec_ctrl 
 PCI_MSIX_ENTRY_CTRL_MASKBIT)? In other words, only when we see a
 MASKBIT - !MASKBIT transition?

The combination of the !(val  PCI_MSIX_ENTRY_CTRL_MASKBIT)
check in the if() surrounding this call and the
(entry-latch(VECTOR_CTRL)  PCI_MSIX_ENTRY_CTRL_MASKBIT)
check inside the function guarantee just that (i.e. the function
invocation is benign in the other case, as entry-addr/entry-data
would remain unchanged).

Jan

Re: [Qemu-devel] [PATCH 4/6] xen/pass-through: correctly deal with RW1C bits

2015-06-16 Thread Jan Beulich

 On 16.06.15 at 16:19, stefano.stabell...@eu.citrix.com wrote:
 On Fri, 5 Jun 2015, Jan Beulich wrote:
 @@ -1016,11 +1002,12 @@ static XenPTRegInfo xen_pt_emu_reg_pm[] 
  .size   = 2,
  .init_val   = 0x0008,
  .res_mask   = 0x00F0,
 -.ro_mask= 0xE10C,
 +.ro_mask= 0x610C,
 +.rw1c_mask  = 0x8000,
  .emu_mask   = 0x810B,
  .init   = xen_pt_common_reg_init,
  .u.w.read   = xen_pt_word_reg_read,
 -.u.w.write  = xen_pt_pmcsr_reg_write,
 +.u.w.write  = xen_pt_word_reg_write,
  },
  {
  .size = 0,
 
 I can see that the code change doesn't cause a change in behaviour for
 PCI_PM_CTRL, but it does for PCI_STATUS, PCI_EXP_DEVSTA and
 PCI_EXP_LNKSTA. Please explain why in the commit message.

I'm not sure what you're after in a patch titled correctly deal with
RW1C bits.

Jan

Re: [Qemu-devel] [PATCH 6/6] xen/pass-through: constify some static data

2015-06-16 Thread Jan Beulich

 On 16.06.15 at 16:27, stefano.stabell...@eu.citrix.com wrote:
 On Fri, 5 Jun 2015, Jan Beulich wrote:
 This is done indirectly by adjusting two typedefs and helps emphasizing
 that the respective tables aren't supposed to be modified at runtime
 (as they may be shared between devices).
 
 Signed-off-by: Jan Beulich jbeul...@suse.com
 
 Reviewed-by: Stefano Stabellini stefano.stabell...@eu.citrix.com

Perhaps patch 5 and 6 could go in right away, without waiting
for the issues on the other ones addressed? After all, it may
well be that patch 1 will be dropped following the discussion on
the hypervisor side patch series...

Jan

Re: [Qemu-devel] [PATCH 2/6] xen/MSI-X: drive maskall and enable bits through hypercalls

2015-06-16 Thread Jan Beulich

 On 16.06.15 at 16:56, stefano.stabell...@eu.citrix.com wrote:
 On Tue, 16 Jun 2015, Jan Beulich wrote:
  On 16.06.15 at 16:03, stefano.stabell...@eu.citrix.com wrote:
  On Fri, 5 Jun 2015, Jan Beulich wrote:
  --- a/qemu/upstream/hw/xen/xen_pt_msi.c
  +++ b/qemu/upstream/hw/xen/xen_pt_msi.c
  @@ -301,8 +301,11 @@ static int msix_set_enable(XenPCIPassthr
   return -1;
   }
   
  -return msi_msix_enable(s, s-msix-ctrl_offset, 
  PCI_MSIX_FLAGS_ENABLE,
  -   enabled);
  
  Would it make sense to remove msi_msix_enable completely to avoid any
  further mistakes?
 
 Perhaps, yes. I think I actually had suggested so quite a while back.
 But I don't see myself wasting much more time on this, ehm, code.
 
 Isn't it just a matter of removing msi_msix_enable?

It has another caller xen_pt_msi_set_enable(). If we went down
the route of what this patch does, then MSI's enable bit should
ultimately also be driven through a hypercall, and that would then
be the point where the function would naturally disappear. But as
said, it looks like we're intending to go a different route anyway.

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register

2015-06-10 Thread Jan Beulich

 On 10.06.15 at 13:43, m...@redhat.com wrote:
 On Wed, Jun 10, 2015 at 08:00:55AM +0100, Jan Beulich wrote:
  On 08.06.15 at 13:28, m...@redhat.com wrote:
  On Mon, Jun 08, 2015 at 11:55:22AM +0100, Jan Beulich wrote:
  while function 0 has
  
  0x10: Base Address Register 0  = 0xca23000c (Memory space, 64-bit access, 
  prefetchable)
  0x18: Base Address Register 2  = 0xca24000c (Memory space, 64-bit access, 
  prefetchable)
  0x20: Base Address Register 4  = 0xca25000c (Memory space, 64-bit access, 
  prefetchable)
  
  and function 1
  
  0x10: Base Address Register 0  = 0xca2c (Memory space, 64-bit access, 
  prefetchable)
  0x18: Base Address Register 2  = 0xca21000c (Memory space, 64-bit access, 
  prefetchable)
  0x20: Base Address Register 4  = 0xca22000c (Memory space, 64-bit access, 
  prefetchable)
  
   Does the sibling device have a BAR overlapping the address?
  
  No, its BARs are fully separate.
  
  Judging from the above, it's actually function 1's BAR 2 that
  is accessed? Are you saying disabling memory on function 0
  breaks function 2 somehow?
 
 Oops, just noticed I didn't reply to this. Not sure how you
 come to that conclusion - the ITP log says that the bad write is to
 0xca25004c.
 
 Look at the bridge configuration though - looks like it
 will only forward transactions to 0xca21.
 Anything else will be terminated by the bridge itself.

Right, that's what I had pointed out before, but then again things
work prior to the guest shutting down (and in the absence of any
guest), even if I can't explain why or how.

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register

2015-06-10 Thread Jan Beulich

 On 10.06.15 at 13:46, m...@redhat.com wrote:
 I don't really know. The idea would be that device
 is not designed for memory to be disabled when it's
 active, and starts behaving in broken ways.

But you recall that we know the origin of the offending write is in
the hypervisor, and we also know that the device isn't active at the
point the problem occurs?

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register

2015-06-10 Thread Jan Beulich

 On 08.06.15 at 13:28, m...@redhat.com wrote:
 On Mon, Jun 08, 2015 at 11:55:22AM +0100, Jan Beulich wrote:
 while function 0 has
 
 0x10: Base Address Register 0  = 0xca23000c (Memory space, 64-bit access, 
 prefetchable)
 0x18: Base Address Register 2  = 0xca24000c (Memory space, 64-bit access, 
 prefetchable)
 0x20: Base Address Register 4  = 0xca25000c (Memory space, 64-bit access, 
 prefetchable)
 
 and function 1
 
 0x10: Base Address Register 0  = 0xca2c (Memory space, 64-bit access, 
 prefetchable)
 0x18: Base Address Register 2  = 0xca21000c (Memory space, 64-bit access, 
 prefetchable)
 0x20: Base Address Register 4  = 0xca22000c (Memory space, 64-bit access, 
 prefetchable)
 
  Does the sibling device have a BAR overlapping the address?
 
 No, its BARs are fully separate.
 
 Judging from the above, it's actually function 1's BAR 2 that
 is accessed? Are you saying disabling memory on function 0
 breaks function 2 somehow?

Oops, just noticed I didn't reply to this. Not sure how you
come to that conclusion - the ITP log says that the bad write is to
0xca25004c.


Jan

Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register

2015-06-10 Thread Jan Beulich

 On 08.06.15 at 11:30, m...@redhat.com wrote:
 What happens if you disable SERR# in the command register
 of 83:00.1?

We've just been told that with SERR not enabled in any of the
sibling endpoints the NMI still occurs. Not really surprising with
us now assuming that it's the root port that generates the SERR
in response to the UR coming back from an endpoint. But otoh
in conflict with what we see in the ITP log (where SERR clearly
is enabled on the endpoints, and the information we got says
that it _is_ disabled, not that they had to do anything to disable
it).

 2.  Has a driver initialized this endpoint? What if you don't
 load a driver before sending the transaction resulting in the UR?

They now tried at least without loading a driver in Dom0, which
didn't make a difference. Did you mean to also not load any driver
in the guest? Otoh I can't really see what difference this makes,
as the cleanup after the guest inside the hypervisor doesn't
really depend much on whether it actively used any of the MSI-X
entries.

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register

2015-06-08 Thread Jan Beulich

 On 07.06.15 at 08:23, m...@redhat.com wrote:
 On Mon, Apr 20, 2015 at 04:32:12PM +0200, Michael S. Tsirkin wrote:
 On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote:
   On 20.04.15 at 15:43, m...@redhat.com wrote:
   On Mon, Apr 13, 2015 at 01:51:06PM +0100, Jan Beulich wrote:
On 13.04.15 at 14:47, m...@redhat.com wrote:
Can you check device capabilities register, offset 0x4 within
pci express capability structure?
Bit 15 is 15 Role-Based Error Reporting.
Is it set?

The spec says:

  15
  On platforms where robust error handling and PC-compatible 
Configuration 
Space probing is
  required, it is suggested that software or firmware have the 
Unsupported 
Request Reporting Enable
  bit Set for Role-Based Error Reporting Functions, but clear for 
1.0a 
Functions. Software or
  firmware can distinguish the two classes of Functions by 
examining the 
Role-Based Error Reporting
  bit in the Device Capabilities register.
   
   Yes, that bit is set.
   
   curiouser and curiouser.
   
   So with functions that do support Role-Based Error Reporting, we have
   this:
   
   
With device Functions implementing Role-Based Error Reporting, setting 
   the 
   Unsupported Request
Reporting Enable bit will not interfere with PC-compatible 
   Configuration 
   Space probing, assuming
that the severity for UR is left at its default of non-fatal. However, 
   setting the Unsupported Request
Reporting Enable bit will enable the Function to report UR errors 97 
   detected with posted Requests,
helping avoid this case for potential silent data corruption.
  
  I still don't see what the PC-compatible config space probing has to
  do with our issue.
 
 I'm not sure but I think it's listed here because it causes a ton of URs
 when device scan probes unimplemented functions.
 
   did firmware reconfigure this device to report URs as fatal errors then?
  
  No, the Unsupported Request Error Serverity flag is zero.
 
 OK, that's the correct configuration, so how come the box crashes when
 there's a UR then?
 
 Ping - any update on this?

Not really. All we concluded so far is that _maybe_ the bridge, upon
seeing the UR, generates a Master Abort, rendering the whole thing
fatal. Otoh the respective root port also has
- Received Master Abort set in its Secondary Status register (but
  that's also already the case in the log that we have before the UR
  occurs, i.e. that doesn't mean all that much),
- Received System Error set in its Secondary Status register (and
  after the UR the sibling endpoint [UR originating from 83:00.0,
  sibling being 83:00.1] also shows Signaled System Error set).

 Do we can chalk this up to hardware bugs on a specific box?

I have to admit that I'm still very uncertain whether to consider all
this correct behavior, a firmware flaw, or a hardware bug.

Jan

Re: [Qemu-devel] [PATCH] xen/pass-through: ROM BAR handling adjustments

2015-06-08 Thread Jan Beulich

 On 05.06.15 at 18:41, stefano.stabell...@eu.citrix.com wrote:
 On Fri, 5 Jun 2015, Jan Beulich wrote:
  On 05.06.15 at 13:32, stefano.stabell...@eu.citrix.com wrote:
  --- a/hw/xen/xen_pt.c
  +++ b/hw/xen/xen_pt.c
  @@ -248,7 +248,9 @@ static void xen_pt_pci_write_config(PCID
   
   /* check unused BAR register */
   index = xen_pt_bar_offset_to_index(addr);
  -if ((index = 0)  (val  0  val  XEN_PT_BAR_ALLF) 
  +if ((index = 0)  (val != 0) 
  +(((index != PCI_ROM_SLOT) ?
  +  val : (val | (uint32_t)~PCI_ROM_ADDRESS_MASK)) != 
 XEN_PT_BAR_ALLF) 
  
  The change seems looks good and in line with the commit message. But
  this if statement looks like acrobatic circus to me now.
 
 I think the alternative of splitting it up into multiple if()-s would not
 be any better readable.
 
 Would you be OK if I rewrote the statement as follows?
 
 if ((index = 0)  (val != 0)) {
 uint32_t vu;
 
 if (index == PCI_ROM_SLOT) {
 vu = val | (uint32_t)~PCI_ROM_ADDRESS_MASK;
 } else {
 vu = val;
 }
 
 if ((vu != XEN_PT_BAR_ALLF) 
 (s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) {
 XEN_PT_WARN(d, Guest attempt to set address to unused Base 
 Address 
 Register. (addr: 0x%02x, len: %d)\n, addr, len);
 }
 }

Actually I agree that this indeed looks better. If I were to make the
change, I'd do

if ((index = 0)  (val != 0)) {
uint32_t vu = val;

if (index == PCI_ROM_SLOT) {
vu |= (uint32_t)~PCI_ROM_ADDRESS_MASK;
}

if ((vu != XEN_PT_BAR_ALLF) 
...

though. But if you're going to do the edit without wanting me to
re-submit, I'll certainly leave this to you. Just let me know which
way you prefer it to be handled.

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register

2015-06-08 Thread Jan Beulich

 On 08.06.15 at 10:09, malcolm.cross...@citrix.com wrote:
 On 08/06/15 08:42, Jan Beulich wrote:
 Not really. All we concluded so far is that _maybe_ the bridge, upon
 seeing the UR, generates a Master Abort, rendering the whole thing
 fatal. Otoh the respective root port also has
 - Received Master Abort set in its Secondary Status register (but
   that's also already the case in the log that we have before the UR
   occurs, i.e. that doesn't mean all that much),
 - Received System Error set in its Secondary Status register (and
   after the UR the sibling endpoint [UR originating from 83:00.0,
   sibling being 83:00.1] also shows Signaled System Error set).
 
 
 Disabling the Memory decode in the command register could also result in a 
 completion timeout on the
 root port issuing a transaction towards the PCI device in question. PCIE 
 completion timeouts can be
 escalated to Fatal AER errors which trigger system firmware to inject NMI's 
 into the host.

And how does all that play with PC compatibility (where writes into
no-where get dropped, and reads from no-where get all ones
returned)? Remember - we#re talking about CPU side accesses
here.

 Here is an example AER setup for a PCIE root port. You can see UnsupReq 
 errors are masked and so do
 not trigger errors. CmpltTO ( completion timeout) errors are not masked and 
 the errors are treated
 as Fatal because the corresponding bit in the Uncorrectable Severity 
 register is set.
 
 Capabilities: [148 v1] Advanced Error Reporting
 UESta:DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- 
 MalfTLP- ECRC- UnsupReq- 
 ACSViol-
 UEMsk:DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- 
 MalfTLP- ECRC- 
 UnsupReq+ ACSViol+
 UESvrt:   DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ 
 MalfTLP+ ECRC- 
 UnsupReq- ACSViol-
 CESta:RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
 CEMsk:RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
 AERCap:   First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
 
 A root port completion timeout will also result in the master abort bit 
 being set.
 
 Typically system firmware clears the error in the AER registers after it's 
 processed it. So the
 operating system may not be able to determine what error triggered the NMI 
 in the first place.

Right, but in the case at hand we have an ITP log available, which
increases the hope that we see a reasonably complete picture.

 Do we can chalk this up to hardware bugs on a specific box?
 
 I have to admit that I'm still very uncertain whether to consider all
 this correct behavior, a firmware flaw, or a hardware bug.
 I believe the correct behaviour is happening but a PCIE completion timeout 
 is occurring instead of a
 unsupported request.

Might it be that with the supposedly correct device returning UR
the root port reissues the request to the sibling device, which then
fails it in a more dramatic way (albeit the sibling's Uncorrectable
Error Status Register also has only Unsupported Request Error
Status set)?

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register

2015-06-08 Thread Jan Beulich

 On 08.06.15 at 11:36, m...@redhat.com wrote:
 On Mon, Jun 08, 2015 at 10:03:18AM +0100, Jan Beulich wrote:
  On 08.06.15 at 10:09, malcolm.cross...@citrix.com wrote:
  I believe the correct behaviour is happening but a PCIE completion timeout 
  is occurring instead of a
  unsupported request.
 
 Might it be that with the supposedly correct device returning UR
 the root port reissues the request to the sibling device, which then
 fails it in a more dramatic way (albeit the sibling's Uncorrectable
 Error Status Register also has only Unsupported Request Error
 Status set)?
 
 Isn't the sibling a function on the same device?

Yes.

 And is the request causing the UR a memory read?

No, it's a write according to the ITP log.

 If so doesn't this use address routing?
 What does it mean that the request is to the sibling device then?

The way I expressed things above may simply be due to my limited
knowledge of PCIe terminology: I simply don't know (and can't find
this spelled out in the spec) what the root port's behavior ought to
be when a transaction comes in with an address that's within one of
its base/limit regions, but none of the endpoints can fulfill the
request. But you asking this made me look more closely at the
memory ranges dumped out to the ITP log: The root port has

0x20: Memory Base  = 0xca40
0x22: Memory Limit = 0xca40
0x24: Prefetchable Mem Base= 0xca21
0x26: Prefetchable Mem Limit   = 0xca21

while function 0 has

0x10: Base Address Register 0  = 0xca23000c (Memory space, 64-bit access, 
prefetchable)
0x18: Base Address Register 2  = 0xca24000c (Memory space, 64-bit access, 
prefetchable)
0x20: Base Address Register 4  = 0xca25000c (Memory space, 64-bit access, 
prefetchable)

and function 1

0x10: Base Address Register 0  = 0xca2c (Memory space, 64-bit access, 
prefetchable)
0x18: Base Address Register 2  = 0xca21000c (Memory space, 64-bit access, 
prefetchable)
0x20: Base Address Register 4  = 0xca22000c (Memory space, 64-bit access, 
prefetchable)

Isn't is bogus for all but one of the BARs being outside the root
ports prefetchable memory window (the MSI-X tables being inside
the BAR 4 region in both cases)?

 Does the sibling device have a BAR overlapping the address?

No, its BARs are fully separate.

Jan

Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register

2015-06-08 Thread Jan Beulich

 On 08.06.15 at 11:30, m...@redhat.com wrote:
 On Mon, Jun 08, 2015 at 08:42:57AM +0100, Jan Beulich wrote:
 Otoh the respective root port also has
 - Received Master Abort set in its Secondary Status register (but
   that's also already the case in the log that we have before the UR
   occurs, i.e. that doesn't mean all that much),
 - Received System Error set in its Secondary Status register (and
   after the UR the sibling endpoint [UR originating from 83:00.0,
   sibling being 83:00.1] also shows Signaled System Error set).
 
 It's another function of the same physical device, correct?

Yes (obviously with their BDF only differing in the function).

 So is this sibling the only function sending SERR?

Yes, albeit I can't tell whether the root port generated SERR on
its own or in response to the endpoint doing so.

 What happens if you disable SERR# in the command register
 of 83:00.1?

Any experiments with that system will be quite difficult, as they're
only accessible by partners or ours. But I'll ask to try this if you
think this can provide useful insight.

  Do we can chalk this up to hardware bugs on a specific box?
 
 I have to admit that I'm still very uncertain whether to consider all
 this correct behavior, a firmware flaw, or a hardware bug.
 
 Questions:
 1.  Does this only happen with a specific endpoint?
 What if you add another endpoint to the same system?

We've got reports of this for two systems (two different vendors)
using a Broadcomm NIC in one case and an Intel one in the other.

 2.  Has a driver initialized this endpoint? What if you don't
 load a driver before sending the transaction resulting in the UR?

I'd have to ask for this to be tried too, and getting an answer
may take some time.

Jan

[Qemu-devel] [PATCH v2] xen/pass-through: ROM BAR handling adjustments

2015-06-08 Thread Jan Beulich

Expecting the ROM BAR to be written with an all ones value when sizing
the region is wrong - the low bit has another meaning (enable/disable)
and bits 1..10 are reserved. The PCI spec also mandates writing all
ones to just the address portion of the register.

Use suitable constants also for initializing the ROM BAR register field
description.

Signed-off-by: Jan Beulich jbeul...@suse.com
---
v2: Convert complex if() into simpler ones.

--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -249,10 +249,18 @@ static void xen_pt_pci_write_config(PCID
 
 /* check unused BAR register */
 index = xen_pt_bar_offset_to_index(addr);
-if ((index = 0)  (val  0  val  XEN_PT_BAR_ALLF) 
-(s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) {
-XEN_PT_WARN(d, Guest attempt to set address to unused Base Address 
-Register. (addr: 0x%02x, len: %d)\n, addr, len);
+if ((index = 0)  (val != 0)) {
+uint32_t chk = val;
+
+if (index == PCI_ROM_SLOT)
+chk |= (uint32_t)~PCI_ROM_ADDRESS_MASK;
+
+if ((chk != XEN_PT_BAR_ALLF) 
+(s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) {
+XEN_PT_WARN(d, Guest attempt to set address to unused 
+Base Address Register. (addr: 0x%02x, len: %d)\n,
+addr, len);
+}
 }
 
 /* find register group entry */
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -729,8 +729,8 @@ static XenPTRegInfo xen_pt_emu_reg_heade
 .offset = PCI_ROM_ADDRESS,
 .size   = 4,
 .init_val   = 0x,
-.ro_mask= 0x07FE,
-.emu_mask   = 0xF800,
+.ro_mask= ~PCI_ROM_ADDRESS_MASK  ~PCI_ROM_ADDRESS_ENABLE,
+.emu_mask   = (uint32_t)PCI_ROM_ADDRESS_MASK,
 .init   = xen_pt_bar_reg_init,
 .u.dw.read  = xen_pt_long_reg_read,
 .u.dw.write = xen_pt_exp_rom_bar_reg_write,



xen/pass-through: ROM BAR handling adjustments

Expecting the ROM BAR to be written with an all ones value when sizing
the region is wrong - the low bit has another meaning (enable/disable)
and bits 1..10 are reserved. The PCI spec also mandates writing all
ones to just the address portion of the register.

Use suitable constants also for initializing the ROM BAR register field
description.

Signed-off-by: Jan Beulich jbeul...@suse.com
---
v2: Convert complex if() into simpler ones.

--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -249,10 +249,18 @@ static void xen_pt_pci_write_config(PCID
 
 /* check unused BAR register */
 index = xen_pt_bar_offset_to_index(addr);
-if ((index = 0)  (val  0  val  XEN_PT_BAR_ALLF) 
-(s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) {
-XEN_PT_WARN(d, Guest attempt to set address to unused Base Address 
-Register. (addr: 0x%02x, len: %d)\n, addr, len);
+if ((index = 0)  (val != 0)) {
+uint32_t chk = val;
+
+if (index == PCI_ROM_SLOT)
+chk |= (uint32_t)~PCI_ROM_ADDRESS_MASK;
+
+if ((chk != XEN_PT_BAR_ALLF) 
+(s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) {
+XEN_PT_WARN(d, Guest attempt to set address to unused 
+Base Address Register. (addr: 0x%02x, len: %d)\n,
+addr, len);
+}
 }
 
 /* find register group entry */
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -729,8 +729,8 @@ static XenPTRegInfo xen_pt_emu_reg_heade
 .offset = PCI_ROM_ADDRESS,
 .size   = 4,
 .init_val   = 0x,
-.ro_mask= 0x07FE,
-.emu_mask   = 0xF800,
+.ro_mask= ~PCI_ROM_ADDRESS_MASK  ~PCI_ROM_ADDRESS_ENABLE,
+.emu_mask   = (uint32_t)PCI_ROM_ADDRESS_MASK,
 .init   = xen_pt_bar_reg_init,
 .u.dw.read  = xen_pt_long_reg_read,
 .u.dw.write = xen_pt_exp_rom_bar_reg_write,

Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register

2015-06-08 Thread Jan Beulich

 On 08.06.15 at 13:28, m...@redhat.com wrote:
 On Mon, Jun 08, 2015 at 11:55:22AM +0100, Jan Beulich wrote:
 But you asking this made me look more closely at the
 memory ranges dumped out to the ITP log: The root port has
 
 0x20: Memory Base  = 0xca40
 0x22: Memory Limit = 0xca40
 0x24: Prefetchable Mem Base= 0xca21
 0x26: Prefetchable Mem Limit   = 0xca21
 
 while function 0 has
 
 0x10: Base Address Register 0  = 0xca23000c (Memory space, 64-bit access, 
 prefetchable)
 0x18: Base Address Register 2  = 0xca24000c (Memory space, 64-bit access, 
 prefetchable)
 0x20: Base Address Register 4  = 0xca25000c (Memory space, 64-bit access, 
 prefetchable)
 
 and function 1
 
 0x10: Base Address Register 0  = 0xca2c (Memory space, 64-bit access, 
 prefetchable)
 0x18: Base Address Register 2  = 0xca21000c (Memory space, 64-bit access, 
 prefetchable)
 
 And what is the size of this BAR?

Sadly I don't seem have a matching pair of kernel and ITP logs, so I
can only guess that it's 64k (or less).

Jan

Re: [Qemu-devel] [PATCH] xen/pass-through: ROM BAR handling adjustments

2015-06-05 Thread Jan Beulich

 On 05.06.15 at 13:32, stefano.stabell...@eu.citrix.com wrote:
 --- a/hw/xen/xen_pt.c
 +++ b/hw/xen/xen_pt.c
 @@ -248,7 +248,9 @@ static void xen_pt_pci_write_config(PCID
  
  /* check unused BAR register */
  index = xen_pt_bar_offset_to_index(addr);
 -if ((index = 0)  (val  0  val  XEN_PT_BAR_ALLF) 
 +if ((index = 0)  (val != 0) 
 +(((index != PCI_ROM_SLOT) ?
 +  val : (val | (uint32_t)~PCI_ROM_ADDRESS_MASK)) != 
 XEN_PT_BAR_ALLF) 
 
 The change seems looks good and in line with the commit message. But
 this if statement looks like acrobatic circus to me now.

I think the alternative of splitting it up into multiple if()-s would not
be any better readable.

 --- a/hw/xen/xen_pt_config_init.c
 +++ b/hw/xen/xen_pt_config_init.c
 @@ -726,8 +726,8 @@ static XenPTRegInfo xen_pt_emu_reg_heade
  .offset = PCI_ROM_ADDRESS,
  .size   = 4,
  .init_val   = 0x,
 -.ro_mask= 0x07FE,
 -.emu_mask   = 0xF800,
 +.ro_mask= ~PCI_ROM_ADDRESS_MASK  ~PCI_ROM_ADDRESS_ENABLE,
 +.emu_mask   = (uint32_t)PCI_ROM_ADDRESS_MASK,
 
 There is no need for the explicit cast, is there?

There is - PCI_ROM_ADDRESS_MASK being wider than 32 bits the
assignment could cause compiler warning otherwise (which I think
I actually ran into.

Jan

[Qemu-devel] [PATCH 5/6] xen/pass-through: log errno values rather than function return ones

2015-06-05 Thread Jan Beulich

Functions setting errno commonly return just -1, which is of no
particular use in the log file.

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/qemu/upstream/hw/xen/xen_pt.c
+++ b/qemu/upstream/hw/xen/xen_pt.c
@@ -609,8 +609,8 @@ static void xen_pt_region_update(XenPCIP
   guest_port, machine_port, size,
   op);
 if (rc) {
-XEN_PT_ERR(d, %s ioport mapping failed! (rc: %i)\n,
-   adding ? create new : remove old, rc);
+XEN_PT_ERR(d, %s ioport mapping failed! (err: %i)\n,
+   adding ? create new : remove old, errno);
 }
 } else {
 pcibus_t guest_addr = sec-offset_within_address_space;
@@ -623,8 +623,8 @@ static void xen_pt_region_update(XenPCIP
   XEN_PFN(size + XC_PAGE_SIZE - 1),
   op);
 if (rc) {
-XEN_PT_ERR(d, %s mem mapping failed! (rc: %i)\n,
-   adding ? create new : remove old, rc);
+XEN_PT_ERR(d, %s mem mapping failed! (err: %i)\n,
+   adding ? create new : remove old, errno);
 }
 }
 }
@@ -738,8 +738,8 @@ static int xen_pt_initfn(PCIDevice *d)
 rc = xc_physdev_map_pirq(xen_xc, xen_domid, machine_irq, pirq);
 
 if (rc  0) {
-XEN_PT_ERR(d, Mapping machine irq %u to pirq %i failed, (rc: %d)\n,
-   machine_irq, pirq, rc);
+XEN_PT_ERR(d, Mapping machine irq %u to pirq %i failed, (err: %d)\n,
+   machine_irq, pirq, errno);
 
 /* Disable PCI intx assertion (turn on bit10 of devctl) */
 cmd |= PCI_COMMAND_INTX_DISABLE;
@@ -760,8 +760,8 @@ static int xen_pt_initfn(PCIDevice *d)
PCI_SLOT(d-devfn),
e_intx);
 if (rc  0) {
-XEN_PT_ERR(d, Binding of interrupt %i failed! (rc: %d)\n,
-   e_intx, rc);
+XEN_PT_ERR(d, Binding of interrupt %i failed! (err: %d)\n,
+   e_intx, errno);
 
 /* Disable PCI intx assertion (turn on bit10 of devctl) */
 cmd |= PCI_COMMAND_INTX_DISABLE;
@@ -770,7 +770,7 @@ static int xen_pt_initfn(PCIDevice *d)
 if (xen_pt_mapped_machine_irq[machine_irq] == 0) {
 if (xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq)) {
 XEN_PT_ERR(d, Unmapping of machine interrupt %i failed!
-(rc: %d)\n, machine_irq, rc);
+(err: %d)\n, machine_irq, errno);
 }
 }
 s-machine_irq = 0;
@@ -808,9 +808,9 @@ static void xen_pt_unregister_device(PCI
  0 /* isa_irq */);
 if (rc  0) {
 XEN_PT_ERR(d, unbinding of interrupt INT%c failed.
-(machine irq: %i, rc: %d)
+(machine irq: %i, err: %d)
 But bravely continuing on..\n,
-   'a' + intx, machine_irq, rc);
+   'a' + intx, machine_irq, errno);
 }
 }
 
@@ -828,9 +828,9 @@ static void xen_pt_unregister_device(PCI
 rc = xc_physdev_unmap_pirq(xen_xc, xen_domid, machine_irq);
 
 if (rc  0) {
-XEN_PT_ERR(d, unmapping of interrupt %i failed. (rc: %d)
+XEN_PT_ERR(d, unmapping of interrupt %i failed. (err: %d)
 But bravely continuing on..\n,
-   machine_irq, rc);
+   machine_irq, errno);
 }
 }
 }
--- a/qemu/upstream/hw/xen/xen_pt_msi.c
+++ b/qemu/upstream/hw/xen/xen_pt_msi.c
@@ -133,8 +133,8 @@ static int msi_msix_setup(XenPCIPassthro
  msix_entry, table_base);
 if (rc) {
 XEN_PT_ERR(s-dev,
-   Mapping of MSI%s (rc: %i, vec: %#x, entry %#x)\n,
-   is_msix ? -X : , rc, gvec, msix_entry);
+   Mapping of MSI%s (err: %i, vec: %#x, entry %#x)\n,
+   is_msix ? -X : , errno, gvec, msix_entry);
 return rc;
 }
 }
@@ -167,12 +167,12 @@ static int msi_msix_update(XenPCIPassthr
   pirq, gflags, table_addr);
 
 if (rc) {
-XEN_PT_ERR(d, Updating of MSI%s failed. (rc: %d)\n,
-   is_msix ? -X : , rc);
+XEN_PT_ERR(d, Updating of MSI%s failed. (err: %d)\n,
+   is_msix ? -X : , errno);
 
 if (xc_physdev_unmap_pirq(xen_xc, xen_domid, *old_pirq)) {
-XEN_PT_ERR(d, Unmapping of MSI%s pirq %d failed.\n,
-   is_msix ? -X : , *old_pirq);
+XEN_PT_ERR(d, Unmapping of MSI%s pirq %d failed. (err: %d)\n,
+   is_msix ? -X : , *old_pirq, errno

[Qemu-devel] [PATCH 2/6] xen/MSI-X: drive maskall and enable bits through hypercalls

2015-06-05 Thread Jan Beulich

Particularly the maskall bit has to be under exclusive hypervisor
control (and since they live in the same config space field, the
enable bit has to follow suit). Use the replacement hypercall
interfaces.

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/qemu/upstream/hw/xen/xen_pt.h
+++ b/qemu/upstream/hw/xen/xen_pt.h
@@ -181,6 +181,7 @@ typedef struct XenPTMSIXEntry {
 typedef struct XenPTMSIX {
 uint32_t ctrl_offset;
 bool enabled;
+bool maskall;
 int total_entries;
 int bar_index;
 uint64_t table_base;
@@ -293,7 +294,9 @@ int xen_pt_msix_init(XenPCIPassthroughSt
 void xen_pt_msix_delete(XenPCIPassthroughState *s);
 int xen_pt_msix_update(XenPCIPassthroughState *s);
 int xen_pt_msix_update_remap(XenPCIPassthroughState *s, int bar_index);
+void xen_pt_msix_enable(XenPCIPassthroughState *s);
 void xen_pt_msix_disable(XenPCIPassthroughState *s);
+int xen_pt_msix_maskall(XenPCIPassthroughState *s, bool mask);
 
 static inline bool xen_pt_has_msix_mapping(XenPCIPassthroughState *s, int bar)
 {
--- a/qemu/upstream/hw/xen/xen_pt_config_init.c
+++ b/qemu/upstream/hw/xen/xen_pt_config_init.c
@@ -1436,32 +1436,58 @@ static int xen_pt_msixctrl_reg_write(Xen
  uint16_t dev_value, uint16_t valid_mask)
 {
 XenPTRegInfo *reg = cfg_entry-reg;
-uint16_t writable_mask = 0;
+uint16_t writable_mask, value;
 uint16_t throughable_mask = get_throughable_mask(s, reg, valid_mask);
 int debug_msix_enabled_old;
 
 /* modify emulate register */
 writable_mask = reg-emu_mask  ~reg-ro_mask  valid_mask;
-cfg_entry-data = XEN_PT_MERGE_VALUE(*val, cfg_entry-data, writable_mask);
+value = XEN_PT_MERGE_VALUE(*val, cfg_entry-data, writable_mask);
+cfg_entry-data = value;
 
 /* create value for writing to I/O device register */
 *val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask);
 
+debug_msix_enabled_old = s-msix-enabled;
+
 /* update MSI-X */
-if ((*val  PCI_MSIX_FLAGS_ENABLE)
- !(*val  PCI_MSIX_FLAGS_MASKALL)) {
+if ((value  PCI_MSIX_FLAGS_ENABLE)
+ !(value  PCI_MSIX_FLAGS_MASKALL)) {
+if (!s-msix-enabled) {
+if (!s-msix-maskall) {
+xen_pt_msix_maskall(s, true);
+}
+xen_pt_msix_enable(s);
+}
 xen_pt_msix_update(s);
-} else if (!(*val  PCI_MSIX_FLAGS_ENABLE)  s-msix-enabled) {
-xen_pt_msix_disable(s);
+s-msix-enabled = true;
+s-msix-maskall = false;
+xen_pt_msix_maskall(s, false);
+} else if (s-msix-enabled) {
+if (!(value  PCI_MSIX_FLAGS_ENABLE)) {
+xen_pt_msix_disable(s);
+s-msix-enabled = false;
+} else if (!s-msix-maskall) {
+s-msix-maskall = true;
+xen_pt_msix_maskall(s, true);
+}
 }
 
-debug_msix_enabled_old = s-msix-enabled;
-s-msix-enabled = !!(*val  PCI_MSIX_FLAGS_ENABLE);
 if (s-msix-enabled != debug_msix_enabled_old) {
 XEN_PT_LOG(s-dev, %s MSI-X\n,
s-msix-enabled ? enable : disable);
 }
 
+xen_host_pci_get_word(s-real_device, s-msix-ctrl_offset, dev_value);
+
+if (s-msix-enabled  !(dev_value  PCI_MSIX_FLAGS_ENABLE)) {
+XEN_PT_ERR(s-dev, MSI-X unexpectedly disabled\n);
+} else if ((dev_value  PCI_MSIX_FLAGS_ENABLE) 
+   s-msix-maskall 
+   !(dev_value  PCI_MSIX_FLAGS_MASKALL)) {
+XEN_PT_ERR(s-dev, MSI-X unexpectedly unmasked\n);
+}
+
 return 0;
 }
 
@@ -1483,9 +1509,12 @@ static XenPTRegInfo xen_pt_emu_reg_msix[
 .offset = PCI_MSI_FLAGS,
 .size   = 2,
 .init_val   = 0x,
-.res_mask   = 0x3800,
-.ro_mask= 0x07FF,
-.emu_mask   = 0x,
+/* This must not be split into res_mask (0x3800) and ro_mask (0x07FF)
+ * because even in permissive mode there must not be any write back
+ * to this register.
+ */
+.ro_mask= 0x3FFF,
+.emu_mask   = 0xC000,
 .init   = xen_pt_msixctrl_reg_init,
 .u.w.read   = xen_pt_word_reg_read,
 .u.w.write  = xen_pt_msixctrl_reg_write,
--- a/qemu/upstream/hw/xen/xen_pt_msi.c
+++ b/qemu/upstream/hw/xen/xen_pt_msi.c
@@ -301,8 +301,11 @@ static int msix_set_enable(XenPCIPassthr
 return -1;
 }
 
-return msi_msix_enable(s, s-msix-ctrl_offset, PCI_MSIX_FLAGS_ENABLE,
-   enabled);
+return xc_physdev_msix_enable(xen_xc, s-real_device.domain,
+  s-real_device.bus,
+  PCI_DEVFN(s-real_device.dev,
+s-real_device.func),
+  enabled);
 }
 
 static int xen_pt_msix_update_one(XenPCIPassthroughState *s, int entry_nr)
@@ -361,6 +364,11 @@ int xen_pt_msix_update(XenPCIPassthrough
 return 0;
 }
 
+void xen_pt_msix_enable(XenPCIPassthroughState *s

[Qemu-devel] [PATCH 1/6] xen/MSI-X: latch MSI-X table writes

2015-06-05 Thread Jan Beulich

The remaining log message in pci_msix_write() is wrong, as there guest
behavior may only appear to be wrong: For one, the old logic didn't
take the mask-all bit into account. And then this shouldn't depend on
host device state (i.e. the host may have masked the entry without the
guest having done so). Plus these writes shouldn't be dropped even when
an entry gets unmasked. Instead, if they can't be made take effect
right away, they should take effect on the next unmasking or enabling
operation - the specification explicitly describes such caching
behavior.

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/qemu/upstream/hw/xen/xen_pt.h
+++ b/qemu/upstream/hw/xen/xen_pt.h
@@ -175,9 +175,8 @@ typedef struct XenPTMSIXEntry {
 int pirq;
 uint64_t addr;
 uint32_t data;
-uint32_t vector_ctrl;
+uint32_t latch[4];
 bool updated; /* indicate whether MSI ADDR or DATA is updated */
-bool warned;  /* avoid issuing (bogus) warning more than once */
 } XenPTMSIXEntry;
 typedef struct XenPTMSIX {
 uint32_t ctrl_offset;
--- a/qemu/upstream/hw/xen/xen_pt_msi.c
+++ b/qemu/upstream/hw/xen/xen_pt_msi.c
@@ -25,6 +25,7 @@
 #define XEN_PT_GFLAGSSHIFT_DELIV_MODE 12
 #define XEN_PT_GFLAGSSHIFT_TRG_MODE   15
 
+#define latch(fld) latch[PCI_MSIX_ENTRY_##fld / sizeof(uint32_t)]
 
 /*
  * Helpers
@@ -322,6 +323,13 @@ static int xen_pt_msix_update_one(XenPCI
 
 pirq = entry-pirq;
 
+if (pirq == XEN_PT_UNASSIGNED_PIRQ || s-msix-maskall ||
+(entry-latch(VECTOR_CTRL)  PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
+entry-addr = entry-latch(LOWER_ADDR) |
+  ((uint64_t)entry-latch(UPPER_ADDR)  32);
+entry-data = entry-latch(DATA);
+}
+
 rc = msi_msix_setup(s, entry-addr, entry-data, pirq, true, entry_nr,
 entry-pirq == XEN_PT_UNASSIGNED_PIRQ);
 if (rc) {
@@ -396,35 +404,15 @@ int xen_pt_msix_update_remap(XenPCIPasst
 
 static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset)
 {
-switch (offset) {
-case PCI_MSIX_ENTRY_LOWER_ADDR:
-return e-addr  UINT32_MAX;
-case PCI_MSIX_ENTRY_UPPER_ADDR:
-return e-addr  32;
-case PCI_MSIX_ENTRY_DATA:
-return e-data;
-case PCI_MSIX_ENTRY_VECTOR_CTRL:
-return e-vector_ctrl;
-default:
-return 0;
-}
+return !(offset % sizeof(*e-latch))
+   ? e-latch[offset / sizeof(*e-latch)] : 0;
 }
 
 static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val)
 {
-switch (offset) {
-case PCI_MSIX_ENTRY_LOWER_ADDR:
-e-addr = (e-addr  ((uint64_t)UINT32_MAX  32)) | val;
-break;
-case PCI_MSIX_ENTRY_UPPER_ADDR:
-e-addr = (uint64_t)val  32 | (e-addr  UINT32_MAX);
-break;
-case PCI_MSIX_ENTRY_DATA:
-e-data = val;
-break;
-case PCI_MSIX_ENTRY_VECTOR_CTRL:
-e-vector_ctrl = val;
-break;
+if (!(offset % sizeof(*e-latch)))
+{
+e-latch[offset / sizeof(*e-latch)] = val;
 }
 }
 
@@ -444,39 +432,28 @@ static void pci_msix_write(void *opaque,
 offset = addr % PCI_MSIX_ENTRY_SIZE;
 
 if (offset != PCI_MSIX_ENTRY_VECTOR_CTRL) {
-const volatile uint32_t *vec_ctrl;
-
 if (get_entry_value(entry, offset) == val
  entry-pirq != XEN_PT_UNASSIGNED_PIRQ) {
 return;
 }
 
+entry-updated = true;
+} else if (msix-enabled  entry-updated 
+   !(val  PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
+const volatile uint32_t *vec_ctrl;
+
 /*
  * If Xen intercepts the mask bit access, entry-vec_ctrl may not be
  * up-to-date. Read from hardware directly.
  */
 vec_ctrl = s-msix-phys_iomem_base + entry_nr * PCI_MSIX_ENTRY_SIZE
 + PCI_MSIX_ENTRY_VECTOR_CTRL;
+set_entry_value(entry, offset, *vec_ctrl);
 
-if (msix-enabled  !(*vec_ctrl  PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
-if (!entry-warned) {
-entry-warned = true;
-XEN_PT_ERR(s-dev, Can't update msix entry %d since MSI-X is
-already enabled.\n, entry_nr);
-}
-return;
-}
-
-entry-updated = true;
+xen_pt_msix_update_one(s, entry_nr);
 }
 
 set_entry_value(entry, offset, val);
-
-if (offset == PCI_MSIX_ENTRY_VECTOR_CTRL) {
-if (msix-enabled  !(val  PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
-xen_pt_msix_update_one(s, entry_nr);
-}
-}
 }
 
 static uint64_t pci_msix_read(void *opaque, hwaddr addr,


xen/MSI-X: latch MSI-X table writes

The remaining log message in pci_msix_write() is wrong, as there guest
behavior may only appear to be wrong: For one, the old logic didn't
take the mask-all bit into account. And then this shouldn't depend on
host device state (i.e. the host may have masked the entry without the
guest having done so). Plus these writes shouldn't be dropped even when
an entry gets unmasked. Instead

[Qemu-devel] [PATCH 6/6] xen/pass-through: constify some static data

2015-06-05 Thread Jan Beulich

This is done indirectly by adjusting two typedefs and helps emphasizing
that the respective tables aren't supposed to be modified at runtime
(as they may be shared between devices).

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/qemu/upstream/hw/xen/xen_pt.h
+++ b/qemu/upstream/hw/xen/xen_pt.h
@@ -31,7 +31,7 @@ void xen_pt_log(const PCIDevice *d, cons
 /* Helper */
 #define XEN_PFN(x) ((x)  XC_PAGE_SHIFT)
 
-typedef struct XenPTRegInfo XenPTRegInfo;
+typedef const struct XenPTRegInfo XenPTRegInfo;
 typedef struct XenPTReg XenPTReg;
 
 typedef struct XenPCIPassthroughState XenPCIPassthroughState;
@@ -135,11 +135,11 @@ struct XenPTReg {
 uint32_t data; /* emulated value */
 };
 
-typedef struct XenPTRegGroupInfo XenPTRegGroupInfo;
+typedef const struct XenPTRegGroupInfo XenPTRegGroupInfo;
 
 /* emul reg group size initialize method */
 typedef int (*xen_pt_reg_size_init_fn)
-(XenPCIPassthroughState *, const XenPTRegGroupInfo *,
+(XenPCIPassthroughState *, XenPTRegGroupInfo *,
  uint32_t base_offset, uint8_t *size);
 
 /* emulated register group information */
@@ -154,7 +154,7 @@ struct XenPTRegGroupInfo {
 /* emul register group management table */
 typedef struct XenPTRegGroup {
 QLIST_ENTRY(XenPTRegGroup) entries;
-const XenPTRegGroupInfo *reg_grp;
+XenPTRegGroupInfo *reg_grp;
 uint32_t base_offset;
 uint8_t size;
 QLIST_HEAD(, XenPTReg) reg_tbl_list;
--- a/qemu/upstream/hw/xen/xen_pt_config_init.c
+++ b/qemu/upstream/hw/xen/xen_pt_config_init.c
@@ -96,8 +96,7 @@ XenPTReg *xen_pt_find_reg(XenPTRegGroup 
 }
 
 static uint32_t get_throughable_mask(const XenPCIPassthroughState *s,
- const XenPTRegInfo *reg,
- uint32_t valid_mask)
+ XenPTRegInfo *reg, uint32_t valid_mask)
 {
 uint32_t throughable_mask = ~(reg-emu_mask | reg-ro_mask);
 



xen/pass-through: constify some static data

This is done indirectly by adjusting two typedefs and helps emphasizing
that the respective tables aren't supposed to be modified at runtime
(as they may be shared between devices).

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/qemu/upstream/hw/xen/xen_pt.h
+++ b/qemu/upstream/hw/xen/xen_pt.h
@@ -31,7 +31,7 @@ void xen_pt_log(const PCIDevice *d, cons
 /* Helper */
 #define XEN_PFN(x) ((x)  XC_PAGE_SHIFT)
 
-typedef struct XenPTRegInfo XenPTRegInfo;
+typedef const struct XenPTRegInfo XenPTRegInfo;
 typedef struct XenPTReg XenPTReg;
 
 typedef struct XenPCIPassthroughState XenPCIPassthroughState;
@@ -135,11 +135,11 @@ struct XenPTReg {
 uint32_t data; /* emulated value */
 };
 
-typedef struct XenPTRegGroupInfo XenPTRegGroupInfo;
+typedef const struct XenPTRegGroupInfo XenPTRegGroupInfo;
 
 /* emul reg group size initialize method */
 typedef int (*xen_pt_reg_size_init_fn)
-(XenPCIPassthroughState *, const XenPTRegGroupInfo *,
+(XenPCIPassthroughState *, XenPTRegGroupInfo *,
  uint32_t base_offset, uint8_t *size);
 
 /* emulated register group information */
@@ -154,7 +154,7 @@ struct XenPTRegGroupInfo {
 /* emul register group management table */
 typedef struct XenPTRegGroup {
 QLIST_ENTRY(XenPTRegGroup) entries;
-const XenPTRegGroupInfo *reg_grp;
+XenPTRegGroupInfo *reg_grp;
 uint32_t base_offset;
 uint8_t size;
 QLIST_HEAD(, XenPTReg) reg_tbl_list;
--- a/qemu/upstream/hw/xen/xen_pt_config_init.c
+++ b/qemu/upstream/hw/xen/xen_pt_config_init.c
@@ -96,8 +96,7 @@ XenPTReg *xen_pt_find_reg(XenPTRegGroup 
 }
 
 static uint32_t get_throughable_mask(const XenPCIPassthroughState *s,
- const XenPTRegInfo *reg,
- uint32_t valid_mask)
+ XenPTRegInfo *reg, uint32_t valid_mask)
 {
 uint32_t throughable_mask = ~(reg-emu_mask | reg-ro_mask);

[Qemu-devel] [PATCH 3/6] xen/MSI-X: really enforce alignment

2015-06-05 Thread Jan Beulich

The way the generic infrastructure works the intention of not allowing
unaligned accesses can't be achieved by simply setting .unaligned to
false. The benefit is that we can now replace the conditionals in
{get,set}_entry_value() by assert()-s.

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/qemu/upstream/hw/xen/xen_pt_msi.c
+++ b/qemu/upstream/hw/xen/xen_pt_msi.c
@@ -421,16 +421,14 @@ int xen_pt_msix_update_remap(XenPCIPasst
 
 static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset)
 {
-return !(offset % sizeof(*e-latch))
-   ? e-latch[offset / sizeof(*e-latch)] : 0;
+assert(!(offset % sizeof(*e-latch)));
+return e-latch[offset / sizeof(*e-latch)];
 }
 
 static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val)
 {
-if (!(offset % sizeof(*e-latch)))
-{
-e-latch[offset / sizeof(*e-latch)] = val;
-}
+assert(!(offset % sizeof(*e-latch)));
+e-latch[offset / sizeof(*e-latch)] = val;
 }
 
 static void pci_msix_write(void *opaque, hwaddr addr,
@@ -496,6 +494,12 @@ static uint64_t pci_msix_read(void *opaq
 }
 }
 
+static bool pci_msix_accepts(void *opaque, hwaddr addr,
+ unsigned size, bool is_write)
+{
+return !(addr  (size - 1));
+}
+
 static const MemoryRegionOps pci_msix_ops = {
 .read = pci_msix_read,
 .write = pci_msix_write,
@@ -504,7 +508,13 @@ static const MemoryRegionOps pci_msix_op
 .min_access_size = 4,
 .max_access_size = 4,
 .unaligned = false,
+.accepts = pci_msix_accepts
 },
+.impl = {
+.min_access_size = 4,
+.max_access_size = 4,
+.unaligned = false
+}
 };
 
 int xen_pt_msix_init(XenPCIPassthroughState *s, uint32_t base)



xen/MSI-X: really enforce alignment

The way the generic infrastructure works the intention of not allowing
unaligned accesses can't be achieved by simply setting .unaligned to
false. The benefit is that we can now replace the conditionals in
{get,set}_entry_value() by assert()-s.

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/qemu/upstream/hw/xen/xen_pt_msi.c
+++ b/qemu/upstream/hw/xen/xen_pt_msi.c
@@ -421,16 +421,14 @@ int xen_pt_msix_update_remap(XenPCIPasst
 
 static uint32_t get_entry_value(XenPTMSIXEntry *e, int offset)
 {
-return !(offset % sizeof(*e-latch))
-   ? e-latch[offset / sizeof(*e-latch)] : 0;
+assert(!(offset % sizeof(*e-latch)));
+return e-latch[offset / sizeof(*e-latch)];
 }
 
 static void set_entry_value(XenPTMSIXEntry *e, int offset, uint32_t val)
 {
-if (!(offset % sizeof(*e-latch)))
-{
-e-latch[offset / sizeof(*e-latch)] = val;
-}
+assert(!(offset % sizeof(*e-latch)));
+e-latch[offset / sizeof(*e-latch)] = val;
 }
 
 static void pci_msix_write(void *opaque, hwaddr addr,
@@ -496,6 +494,12 @@ static uint64_t pci_msix_read(void *opaq
 }
 }
 
+static bool pci_msix_accepts(void *opaque, hwaddr addr,
+ unsigned size, bool is_write)
+{
+return !(addr  (size - 1));
+}
+
 static const MemoryRegionOps pci_msix_ops = {
 .read = pci_msix_read,
 .write = pci_msix_write,
@@ -504,7 +508,13 @@ static const MemoryRegionOps pci_msix_op
 .min_access_size = 4,
 .max_access_size = 4,
 .unaligned = false,
+.accepts = pci_msix_accepts
 },
+.impl = {
+.min_access_size = 4,
+.max_access_size = 4,
+.unaligned = false
+}
 };
 
 int xen_pt_msix_init(XenPCIPassthroughState *s, uint32_t base)

[Qemu-devel] [PATCH 4/6] xen/pass-through: correctly deal with RW1C bits

2015-06-05 Thread Jan Beulich

Introduce yet another mask for them, so that the generic routine can
handle them, at once rendering xen_pt_pmcsr_reg_write() superfluous.

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/qemu/upstream/hw/xen/xen_pt.h
+++ b/qemu/upstream/hw/xen/xen_pt.h
@@ -105,6 +105,8 @@ struct XenPTRegInfo {
 uint32_t res_mask;
 /* reg read only field mask (ON:RO/ROS, OFF:other) */
 uint32_t ro_mask;
+/* reg read/write-1-clear field mask (ON:RW1C/RW1CS, OFF:other) */
+uint32_t rw1c_mask;
 /* reg emulate field mask (ON:emu, OFF:passthrough) */
 uint32_t emu_mask;
 xen_pt_conf_reg_init init;
--- a/qemu/upstream/hw/xen/xen_pt_config_init.c
+++ b/qemu/upstream/hw/xen/xen_pt_config_init.c
@@ -176,7 +176,8 @@ static int xen_pt_byte_reg_write(XenPCIP
 cfg_entry-data = XEN_PT_MERGE_VALUE(*val, cfg_entry-data, writable_mask);
 
 /* create value for writing to I/O device register */
-*val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask);
+*val = XEN_PT_MERGE_VALUE(*val, dev_value  ~reg-rw1c_mask,
+  throughable_mask);
 
 return 0;
 }
@@ -193,7 +194,8 @@ static int xen_pt_word_reg_write(XenPCIP
 cfg_entry-data = XEN_PT_MERGE_VALUE(*val, cfg_entry-data, writable_mask);
 
 /* create value for writing to I/O device register */
-*val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask);
+*val = XEN_PT_MERGE_VALUE(*val, dev_value  ~reg-rw1c_mask,
+  throughable_mask);
 
 return 0;
 }
@@ -210,7 +212,8 @@ static int xen_pt_long_reg_write(XenPCIP
 cfg_entry-data = XEN_PT_MERGE_VALUE(*val, cfg_entry-data, writable_mask);
 
 /* create value for writing to I/O device register */
-*val = XEN_PT_MERGE_VALUE(*val, dev_value, throughable_mask);
+*val = XEN_PT_MERGE_VALUE(*val, dev_value  ~reg-rw1c_mask,
+  throughable_mask);
 
 return 0;
 }
@@ -611,6 +614,7 @@ static XenPTRegInfo xen_pt_emu_reg_heade
 .init_val   = 0x,
 .res_mask   = 0x0007,
 .ro_mask= 0x06F8,
+.rw1c_mask  = 0xF900,
 .emu_mask   = 0x0010,
 .init   = xen_pt_status_reg_init,
 .u.w.read   = xen_pt_word_reg_read,
@@ -910,6 +914,7 @@ static XenPTRegInfo xen_pt_emu_reg_pcie[
 .size   = 2,
 .res_mask   = 0xFFC0,
 .ro_mask= 0x0030,
+.rw1c_mask  = 0x000F,
 .init   = xen_pt_common_reg_init,
 .u.w.read   = xen_pt_word_reg_read,
 .u.w.write  = xen_pt_word_reg_write,
@@ -930,6 +935,7 @@ static XenPTRegInfo xen_pt_emu_reg_pcie[
 .offset = PCI_EXP_LNKSTA,
 .size   = 2,
 .ro_mask= 0x3FFF,
+.rw1c_mask  = 0xC000,
 .init   = xen_pt_common_reg_init,
 .u.w.read   = xen_pt_word_reg_read,
 .u.w.write  = xen_pt_word_reg_write,
@@ -966,26 +972,6 @@ static XenPTRegInfo xen_pt_emu_reg_pcie[
  * Power Management Capability
  */
 
-/* write Power Management Control/Status register */
-static int xen_pt_pmcsr_reg_write(XenPCIPassthroughState *s,
-  XenPTReg *cfg_entry, uint16_t *val,
-  uint16_t dev_value, uint16_t valid_mask)
-{
-XenPTRegInfo *reg = cfg_entry-reg;
-uint16_t writable_mask = 0;
-uint16_t throughable_mask = get_throughable_mask(s, reg, valid_mask);
-
-/* modify emulate register */
-writable_mask = reg-emu_mask  ~reg-ro_mask  valid_mask;
-cfg_entry-data = XEN_PT_MERGE_VALUE(*val, cfg_entry-data, writable_mask);
-
-/* create value for writing to I/O device register */
-*val = XEN_PT_MERGE_VALUE(*val, dev_value  ~PCI_PM_CTRL_PME_STATUS,
-  throughable_mask);
-
-return 0;
-}
-
 /* Power Management Capability reg static information table */
 static XenPTRegInfo xen_pt_emu_reg_pm[] = {
 /* Next Pointer reg */
@@ -1016,11 +1002,12 @@ static XenPTRegInfo xen_pt_emu_reg_pm[] 
 .size   = 2,
 .init_val   = 0x0008,
 .res_mask   = 0x00F0,
-.ro_mask= 0xE10C,
+.ro_mask= 0x610C,
+.rw1c_mask  = 0x8000,
 .emu_mask   = 0x810B,
 .init   = xen_pt_common_reg_init,
 .u.w.read   = xen_pt_word_reg_read,
-.u.w.write  = xen_pt_pmcsr_reg_write,
+.u.w.write  = xen_pt_word_reg_write,
 },
 {
 .size = 0,


xen/pass-through: correctly deal with RW1C bits

Introduce yet another mask for them, so that the generic routine can
handle them, at once rendering xen_pt_pmcsr_reg_write() superfluous.

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/qemu/upstream/hw/xen/xen_pt.h
+++ b/qemu/upstream/hw/xen/xen_pt.h
@@ -105,6 +105,8 @@ struct XenPTRegInfo {
 uint32_t res_mask;
 /* reg read only field mask (ON:RO/ROS, OFF:other) */
 uint32_t ro_mask;
+/* reg read/write-1-clear field mask (ON:RW1C/RW1CS, OFF:other) */
+uint32_t rw1c_mask;
 /* reg emulate field mask (ON:emu

[Qemu-devel] [PATCH 0/6] xen/pass-through: XSA-120, 128...131 follow-up

2015-06-05 Thread Jan Beulich

1: MSI-X: latch MSI-X table writes
2: MSI-X: drive maskall and enable bits through hypercalls
3: MSI-X: really enforce alignment
4: pass-through: correctly deal with RW1C bits
5: pass-through: log errno values rather than function return ones
6: pass-through: constify some static data

Signed-off-by: Jan Beulich jbeul...@suse.com

[Qemu-devel] Ping: [PATCH] xen/pass-through: ROM BAR handling adjustments

2015-06-01 Thread Jan Beulich

Ping?

 On 15.05.15 at 14:41, jbeul...@suse.com wrote:
 Expecting the ROM BAR to be written with an all ones value when sizing
 the region is wrong - the low bit has another meaning (enable/disable)
 and bits 1..10 are reserved. The PCI spec also mandates writing all
 ones to just the address portion of the register.
 
 Use suitable constants also for initializing the ROM BAR register field
 description.
 
 Signed-off-by: Jan Beulich jbeul...@suse.com
 
 --- a/hw/xen/xen_pt.c
 +++ b/hw/xen/xen_pt.c
 @@ -248,7 +248,9 @@ static void xen_pt_pci_write_config(PCID
  
  /* check unused BAR register */
  index = xen_pt_bar_offset_to_index(addr);
 -if ((index = 0)  (val  0  val  XEN_PT_BAR_ALLF) 
 +if ((index = 0)  (val != 0) 
 +(((index != PCI_ROM_SLOT) ?
 +  val : (val | (uint32_t)~PCI_ROM_ADDRESS_MASK)) != 
 XEN_PT_BAR_ALLF) 
  (s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) {
  XEN_PT_WARN(d, Guest attempt to set address to unused Base Address 
 
  Register. (addr: 0x%02x, len: %d)\n, addr, len);
 --- a/hw/xen/xen_pt_config_init.c
 +++ b/hw/xen/xen_pt_config_init.c
 @@ -726,8 +726,8 @@ static XenPTRegInfo xen_pt_emu_reg_heade
  .offset = PCI_ROM_ADDRESS,
  .size   = 4,
  .init_val   = 0x,
 -.ro_mask= 0x07FE,
 -.emu_mask   = 0xF800,
 +.ro_mask= ~PCI_ROM_ADDRESS_MASK  ~PCI_ROM_ADDRESS_ENABLE,
 +.emu_mask   = (uint32_t)PCI_ROM_ADDRESS_MASK,
  .init   = xen_pt_bar_reg_init,
  .u.dw.read  = xen_pt_long_reg_read,
  .u.dw.write = xen_pt_exp_rom_bar_reg_write,
 
 
 
 
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org 
 http://lists.xen.org/xen-devel

[Qemu-devel] Ping: [PATCH] xen/pass-through: fold host PCI command register writes

2015-06-01 Thread Jan Beulich

Ping?

 On 15.05.15 at 14:46, jbeul...@suse.com wrote:
 The code introduced to address XSA-126 allows simplification of other
 code in xen_pt_initfn(): All we need to do is update cmd suitably,
 as it'll be written back to the host register near the end of the
 function anyway.
 
 Signed-off-by: Jan Beulich jbeul...@suse.com
 
 --- a/hw/xen/xen_pt.c
 +++ b/hw/xen/xen_pt.c
 @@ -698,10 +698,7 @@ static int xen_pt_initfn(PCIDevice *d)
 machine_irq, pirq, rc);
  
  /* Disable PCI intx assertion (turn on bit10 of devctl) */
 -xen_host_pci_set_word(s-real_device,
 -  PCI_COMMAND,
 -  pci_get_word(s-dev.config + PCI_COMMAND)
 -  | PCI_COMMAND_INTX_DISABLE);
 +cmd |= PCI_COMMAND_INTX_DISABLE;
  machine_irq = 0;
  s-machine_irq = 0;
  } else {
 @@ -723,9 +720,7 @@ static int xen_pt_initfn(PCIDevice *d)
 e_intx, rc);
  
  /* Disable PCI intx assertion (turn on bit10 of devctl) */
 -xen_host_pci_set_word(s-real_device, PCI_COMMAND,
 -  *(uint16_t *)(s-dev.config[PCI_COMMAND])
 -  | PCI_COMMAND_INTX_DISABLE);
 +cmd |= PCI_COMMAND_INTX_DISABLE;
  xen_pt_mapped_machine_irq[machine_irq]--;
  
  if (xen_pt_mapped_machine_irq[machine_irq] == 0) {
 
 
 
 
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org 
 http://lists.xen.org/xen-devel

[Qemu-devel] [PATCH] xen/pass-through: ROM BAR handling adjustments

2015-05-15 Thread Jan Beulich

Expecting the ROM BAR to be written with an all ones value when sizing
the region is wrong - the low bit has another meaning (enable/disable)
and bits 1..10 are reserved. The PCI spec also mandates writing all
ones to just the address portion of the register.

Use suitable constants also for initializing the ROM BAR register field
description.

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -248,7 +248,9 @@ static void xen_pt_pci_write_config(PCID
 
 /* check unused BAR register */
 index = xen_pt_bar_offset_to_index(addr);
-if ((index = 0)  (val  0  val  XEN_PT_BAR_ALLF) 
+if ((index = 0)  (val != 0) 
+(((index != PCI_ROM_SLOT) ?
+  val : (val | (uint32_t)~PCI_ROM_ADDRESS_MASK)) != XEN_PT_BAR_ALLF) 
 (s-bases[index].bar_flag == XEN_PT_BAR_FLAG_UNUSED)) {
 XEN_PT_WARN(d, Guest attempt to set address to unused Base Address 
 Register. (addr: 0x%02x, len: %d)\n, addr, len);
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -726,8 +726,8 @@ static XenPTRegInfo xen_pt_emu_reg_heade
 .offset = PCI_ROM_ADDRESS,
 .size   = 4,
 .init_val   = 0x,
-.ro_mask= 0x07FE,
-.emu_mask   = 0xF800,
+.ro_mask= ~PCI_ROM_ADDRESS_MASK  ~PCI_ROM_ADDRESS_ENABLE,
+.emu_mask   = (uint32_t)PCI_ROM_ADDRESS_MASK,
 .init   = xen_pt_bar_reg_init,
 .u.dw.read  = xen_pt_long_reg_read,
 .u.dw.write = xen_pt_exp_rom_bar_reg_write,

[Qemu-devel] [PATCH] xen/pass-through: fold host PCI command register writes

2015-05-15 Thread Jan Beulich

The code introduced to address XSA-126 allows simplification of other
code in xen_pt_initfn(): All we need to do is update cmd suitably,
as it'll be written back to the host register near the end of the
function anyway.

Signed-off-by: Jan Beulich jbeul...@suse.com

--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -698,10 +698,7 @@ static int xen_pt_initfn(PCIDevice *d)
machine_irq, pirq, rc);
 
 /* Disable PCI intx assertion (turn on bit10 of devctl) */
-xen_host_pci_set_word(s-real_device,
-  PCI_COMMAND,
-  pci_get_word(s-dev.config + PCI_COMMAND)
-  | PCI_COMMAND_INTX_DISABLE);
+cmd |= PCI_COMMAND_INTX_DISABLE;
 machine_irq = 0;
 s-machine_irq = 0;
 } else {
@@ -723,9 +720,7 @@ static int xen_pt_initfn(PCIDevice *d)
e_intx, rc);
 
 /* Disable PCI intx assertion (turn on bit10 of devctl) */
-xen_host_pci_set_word(s-real_device, PCI_COMMAND,
-  *(uint16_t *)(s-dev.config[PCI_COMMAND])
-  | PCI_COMMAND_INTX_DISABLE);
+cmd |= PCI_COMMAND_INTX_DISABLE;
 xen_pt_mapped_machine_irq[machine_irq]--;
 
 if (xen_pt_mapped_machine_irq[machine_irq] == 0) {

Re: [Qemu-devel] [Xen-devel] [PATCH][XSA-126] xen: limit guest control of PCI command register

2015-04-20 Thread Jan Beulich

 On 20.04.15 at 16:32, m...@redhat.com wrote:
 On Mon, Apr 20, 2015 at 03:08:09PM +0100, Jan Beulich wrote:
  On 20.04.15 at 15:43, m...@redhat.com wrote:
  did firmware reconfigure this device to report URs as fatal errors then?
 
 No, the Unsupported Request Error Serverity flag is zero.
 
 OK, that's the correct configuration, so how come the box crashes when
 there's a UR then?

Apparently something marks it fatal in the process of delivering it
through APEI. I'll raise the question with the reporters.

Jan

1 2 >

1 - 100 of 154 matches

Mail list logo