from:"Knut Omang"

Re: [PATCH] docs: Replace Qemu -> QEMU

2022-04-22 Thread Knut Omang

On Fri, 2022-04-22 at 10:30 +0200, Stefan Weil wrote:
> Signed-off-by: Stefan Weil 
> ---
>  docs/pcie_sriov.txt | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
> index f5e891e1d4..11158dbf88 100644
> --- a/docs/pcie_sriov.txt
> +++ b/docs/pcie_sriov.txt
> @@ -8,8 +8,8 @@ of a PCI Express device. It allows a single physical function 
> (PF) to
> appear as
>  virtual functions (VFs) for the main purpose of eliminating software
>  overhead in I/O from virtual machines.
>  
> -Qemu now implements the basic common functionality to enable an emulated 
> device
> -to support SR/IOV. Yet no fully implemented devices exists in Qemu, but a
> +QEMU now implements the basic common functionality to enable an emulated 
> device
> +to support SR/IOV. Yet no fully implemented devices exists in QEMU, but a
>  proof-of-concept hack of the Intel igb can be found here:
>  
>  git://github.com/knuto/qemu.git sriov_patches_v5
> @@ -18,7 +18,7 @@ Implementation
>  ==
>  Implementing emulation of an SR/IOV capable device typically consists of
>  implementing support for two types of device classes; the "normal" physical 
> device
> -(PF) and the virtual device (VF). From Qemu's perspective, the VFs are just
> +(PF) and the virtual device (VF). From QEMU's perspective, the VFs are just
>  like other devices, except that some of their properties are derived from
>  the PF.
>  

Reviewed-by: Knut Omang

Re: [PATCH v4 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV)

2022-01-26 Thread Knut Omang

On Wed, 2022-01-26 at 18:11 +0100, Lukasz Maniak wrote:
> From: Knut Omang 
> 
> This patch provides the building blocks for creating an SR/IOV
> PCIe Extended Capability header and register/unregister
> SR/IOV Virtual Functions.
> 
> Signed-off-by: Knut Omang 
> ---
>  hw/pci/meson.build  |   1 +
>  hw/pci/pci.c    | 100 +---
>  hw/pci/pcie.c   |   5 +
>  hw/pci/pcie_sriov.c | 294 
>  hw/pci/trace-events |   5 +
>  include/hw/pci/pci.h    |  12 +-
>  include/hw/pci/pcie.h   |   6 +
>  include/hw/pci/pcie_sriov.h |  71 +
>  include/qemu/typedefs.h |   2 +
>  9 files changed, 470 insertions(+), 26 deletions(-)
>  create mode 100644 hw/pci/pcie_sriov.c
>  create mode 100644 include/hw/pci/pcie_sriov.h
> 
> diff --git a/hw/pci/meson.build b/hw/pci/meson.build
> index 5c4bbac817..bcc9c75919 100644
> --- a/hw/pci/meson.build
> +++ b/hw/pci/meson.build
> @@ -5,6 +5,7 @@ pci_ss.add(files(
>    'pci.c',
>    'pci_bridge.c',
>    'pci_host.c',
> +  'pcie_sriov.c',
>    'shpc.c',
>    'slotid_cap.c'
>  ))
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 5d30f9ca60..ba8fb92efc 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -239,6 +239,9 @@ int pci_bar(PCIDevice *d, int reg)
>  {
>  uint8_t type;
>  
> +    /* PCIe virtual functions do not have their own BARs */
> +    assert(!pci_is_vf(d));
> +
>  if (reg != PCI_ROM_SLOT)
>  return PCI_BASE_ADDRESS_0 + reg * 4;
>  
> @@ -304,10 +307,30 @@ void pci_device_deassert_intx(PCIDevice *dev)
>  }
>  }
>  
> -static void pci_do_device_reset(PCIDevice *dev)
> +static void pci_reset_regions(PCIDevice *dev)
>  {
>  int r;
> +    if (pci_is_vf(dev)) {
> +    return;
> +    }
> +
> +    for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> +    PCIIORegion *region = >io_regions[r];
> +    if (!region->size) {
> +    continue;
> +    }
>  
> +    if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
> +    region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +    pci_set_quad(dev->config + pci_bar(dev, r), region->type);
> +    } else {
> +    pci_set_long(dev->config + pci_bar(dev, r), region->type);
> +    }
> +    }
> +}
> +
> +static void pci_do_device_reset(PCIDevice *dev)
> +{
>  pci_device_deassert_intx(dev);
>  assert(dev->irq_state == 0);
>  
> @@ -323,19 +346,7 @@ static void pci_do_device_reset(PCIDevice *dev)
>    pci_get_word(dev->wmask + PCI_INTERRUPT_LINE) |
>    pci_get_word(dev->w1cmask + 
> PCI_INTERRUPT_LINE));
>  dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
> -    for (r = 0; r < PCI_NUM_REGIONS; ++r) {
> -    PCIIORegion *region = >io_regions[r];
> -    if (!region->size) {
> -    continue;
> -    }
> -
> -    if (!(region->type & PCI_BASE_ADDRESS_SPACE_IO) &&
> -    region->type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> -    pci_set_quad(dev->config + pci_bar(dev, r), region->type);
> -    } else {
> -    pci_set_long(dev->config + pci_bar(dev, r), region->type);
> -    }
> -    }
> +    pci_reset_regions(dev);
>  pci_update_mappings(dev);
>  
>  msi_reset(dev);
> @@ -884,6 +895,16 @@ static void pci_init_multifunction(PCIBus *bus, 
> PCIDevice *dev,
> Error **errp)
>  dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
>  }
>  
> +    /*
> + * With SR/IOV and ARI, a device at function 0 need not be a 
> multifunction
> + * device, as it may just be a VF that ended up with function 0 in
> + * the legacy PCI interpretation. Avoid failing in such cases:
> + */
> +    if (pci_is_vf(dev) &&
> +    dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +    return;
> +    }
> +
>  /*
>   * multifunction bit is interpreted in two ways as follows.
>   *   - all functions must set the bit to 1.
> @@ -1083,6 +1104,7 @@ static PCIDevice *do_pci_register_device(PCIDevice 
> *pci_dev,
>     bus->devices[devfn]->name);
>  return NULL;
>  } else if (dev->hotplugged &&
> +   !pci_is_vf(pci_dev) &&
>     pci_get_function_0(pci_dev)) {
>  error_setg(errp, "PCI: slot %d function 0 already occupied by %s,"
>     " new func %s cannot be exposed to guest.",
> @@

Re: [PATCH v3 01/15] pcie: Add support for Single Root I/O Virtualization (SR/IOV)

2022-01-26 Thread Knut Omang

On Wed, 2022-01-26 at 14:23 +0100, Łukasz Gieryk wrote:
> I'm sorry for the delayed response. We (I and the other Lukasz) somehow
> had hoped that Knut, the original author of this patch, would have
> responded.

Yes, sorry - this one flushed past me here for some reason,

> 
> Let me address your questions, up to my best knowledge.
>   
> > > -static pcibus_t pci_bar_address(PCIDevice *d,
> > > -    int reg, uint8_t type, pcibus_t size)
> > > +static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
> > > +    uint8_t type, pcibus_t size)
> > > +{
> > > +    pcibus_t new_addr;
> > > +    if (!pci_is_vf(d)) {
> > > +    int bar = pci_bar(d, reg);
> > > +    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > > +    new_addr = pci_get_quad(d->config + bar);
> > > +    } else {
> > > +    new_addr = pci_get_long(d->config + bar);
> > > +    }
> > > +    } else {
> > > +    PCIDevice *pf = d->exp.sriov_vf.pf;
> > > +    uint16_t sriov_cap = pf->exp.sriov_cap;
> > > +    int bar = sriov_cap + PCI_SRIOV_BAR + reg * 4;
> > > +    uint16_t vf_offset = pci_get_word(pf->config + sriov_cap +
> > > PCI_SRIOV_VF_OFFSET);
> > > +    uint16_t vf_stride = pci_get_word(pf->config + sriov_cap +
> > > PCI_SRIOV_VF_STRIDE);
> > > +    uint32_t vf_num = (d->devfn - (pf->devfn + vf_offset)) / 
> > > vf_stride;
> > > +
> > > +    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > > +    new_addr = pci_get_quad(pf->config + bar);
> > > +    } else {
> > > +    new_addr = pci_get_long(pf->config + bar);
> > > +    }
> > > +    new_addr += vf_num * size;
> > > +    }
> > > +    if (reg != PCI_ROM_SLOT) {
> > > +    /* Preserve the rom enable bit */
> > > +    new_addr &= ~(size - 1);
> > 
> > This comment puzzles me. How does clearing low bits preserve
> > any bits? Looks like this will clear low bits if any.
> > 
> 
> I think the comment applies to (reg != PCI_ROM_SLOT), i.e., the bits are
> cleared for BARs, but not for expansion ROM. I agree the placement of this
> comment is slightly misleading. We will move it up and rephrase slightly.

I agree - it's maybe better to just put the comment above the if(...) 
other than that I believe it is correct.

Knut

>  
> > > +pcibus_t pci_bar_address(PCIDevice *d,
> > > + int reg, uint8_t type, pcibus_t size)
> > >   {
> > >   pcibus_t new_addr, last_addr;
> > > -    int bar = pci_bar(d, reg);
> > >   uint16_t cmd = pci_get_word(d->config + PCI_COMMAND);
> > >   Object *machine = qdev_get_machine();
> > >   ObjectClass *oc = object_get_class(machine);
> > > @@ -1309,7 +1363,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
> > >   if (!(cmd & PCI_COMMAND_IO)) {
> > >   return PCI_BAR_UNMAPPED;
> > >   }
> > > -    new_addr = pci_get_long(d->config + bar) & ~(size - 1);
> > > +    new_addr = pci_config_get_bar_addr(d, reg, type, size);
> > >   last_addr = new_addr + size - 1;
> > >   /* Check if 32 bit BAR wraps around explicitly.
> > >    * TODO: make priorities correct and remove this work around.
> > > @@ -1324,11 +1378,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
> > >   if (!(cmd & PCI_COMMAND_MEMORY)) {
> > >   return PCI_BAR_UNMAPPED;
> > >   }
> > > -    if (type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > > -    new_addr = pci_get_quad(d->config + bar);
> > > -    } else {
> > > -    new_addr = pci_get_long(d->config + bar);
> > > -    }
> > > +    new_addr = pci_config_get_bar_addr(d, reg, type, size);
> > >   /* the ROM slot has a specific enable bit */
> > >   if (reg == PCI_ROM_SLOT && !(new_addr & PCI_ROM_ADDRESS_ENABLE)) {
> > 
> > And in fact here we check the low bit and handle it specially.
> 
> The code seems correct for me. The bit is preserved for ROM case.
> 
> > > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > > index d7d73a31e4..182a225054 100644
> > > --- a/hw/pci/pcie.c
> > > +++ b/hw/pci/pcie.c
> > > @@ -446,6 +446,11 @@ void pcie_cap_slot_plug_cb(HotplugHandler 
> > > *hotplug_dev,
> > > DeviceState *dev,
> > >   PCIDevice *pci_dev = PCI_DEVICE(dev);
> > >   uint32_t lnkcap = pci_get_long(exp_cap + PCI_EXP_LNKCAP);
> > >   
> > > +    if(pci_is_vf(pci_dev)) {
> > > +    /* We don't want to change any state in hotplug_dev for SR/IOV 
> > > virtual
> > > functions */
> > > +    return;
> > > +    }
> > > +
> > 
> > Coding style violation here.  And pls document the why not the what.
> > E.g. IIRC the reason is that VFs don't have an express capability,
> > right?
> 
> I think the reason is that virtual functions don’t exist physically, so
> they cannot be individually disconnected. Only PF should respond to
> hotplug events, implicitly disconnecting (thus: destroying) all child
> VFs.
> 
> Anyway, we will update this comment to state *why* and add the missing
>

Re: [PATCH 13/15] pcie: Add helpers to the SR/IOV API

2021-10-26 Thread Knut Omang

On Thu, 2021-10-07 at 18:24 +0200, Lukasz Maniak wrote:
> From: Łukasz Gieryk 
> 
> Two convenience functions for retrieving:
>  - the total number of VFs,
>  - the PCIDevice object of the N-th VF.
> 
> Signed-off-by: Łukasz Gieryk 
> ---
>  hw/pci/pcie_sriov.c | 14 ++
>  include/hw/pci/pcie_sriov.h |  8 
>  2 files changed, 22 insertions(+)
> 
> diff --git a/hw/pci/pcie_sriov.c b/hw/pci/pcie_sriov.c
> index cac2aee061..5a8e92d5ab 100644
> --- a/hw/pci/pcie_sriov.c
> +++ b/hw/pci/pcie_sriov.c
> @@ -292,8 +292,22 @@ uint16_t pcie_sriov_vf_number(PCIDevice *dev)
>  return dev->exp.sriov_vf.vf_number;
>  }
>  
> +uint16_t pcie_sriov_vf_number_total(PCIDevice *dev)
> +{
> +    assert(!pci_is_vf(dev));
> +    return dev->exp.sriov_pf.num_vfs;
> +}
>  
>  PCIDevice *pcie_sriov_get_pf(PCIDevice *dev)
>  {
>  return dev->exp.sriov_vf.pf;
>  }
> +
> +PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n)
> +{
> +    assert(!pci_is_vf(dev));
> +    if (n < dev->exp.sriov_pf.num_vfs) {
> +    return dev->exp.sriov_pf.vf[n];
> +    }
> +    return NULL;
> +}
> diff --git a/include/hw/pci/pcie_sriov.h b/include/hw/pci/pcie_sriov.h
> index 9ab48b79c0..d1f39b7223 100644
> --- a/include/hw/pci/pcie_sriov.h
> +++ b/include/hw/pci/pcie_sriov.h
> @@ -65,9 +65,17 @@ void pcie_sriov_pf_disable_vfs(PCIDevice *dev);
>  /* Get logical VF number of a VF - only valid for VFs */
>  uint16_t pcie_sriov_vf_number(PCIDevice *dev);
>  
> +/* Get the total number of VFs - only valid for PF */
> +uint16_t pcie_sriov_vf_number_total(PCIDevice *dev);
> +
>  /* Get the physical function that owns this VF.
>   * Returns NULL if dev is not a virtual function
>   */
>  PCIDevice *pcie_sriov_get_pf(PCIDevice *dev);
>  
> +/* Get the n-th VF of this physical function - only valid for PF.
> + * Returns NULL if index is invalid
> + */
> +PCIDevice *pcie_sriov_get_vf_at_index(PCIDevice *dev, int n);
> +
>  #endif /* QEMU_PCIE_SRIOV_H */


These look like natural improvements to me, thanks!

Reviewed-by: Knut Omang

Re: [PATCH 01/15] pcie: Set default and supported MaxReadReq to 512

2021-10-26 Thread Knut Omang

On Tue, 2021-10-26 at 16:36 +0200, Lukasz Maniak wrote:
> On Thu, Oct 07, 2021 at 06:12:41PM -0400, Michael S. Tsirkin wrote:
> > On Thu, Oct 07, 2021 at 06:23:52PM +0200, Lukasz Maniak wrote:
> > > From: Knut Omang 
> > > 
> > > Make the default PCI Express Capability for PCIe devices set
> > > MaxReadReq to 512.
> > 
> > code says 256
> > 
> > > Tyipcal modern devices people would want to
> > 
> > 
> > typo
> > 
> > > emulate or simulate would want this. The previous value would
> > > cause warnings from the root port driver on some kernels.
> > 
> > 
> > which specifically?
> > 
> > > 
> > > Signed-off-by: Knut Omang 
> > 
> > we can't make changes like this unconditionally, this will
> > break migration across versions.
> > Pls tie this to a machine version.
> > 
> > Thanks!
> > > ---
> > >   hw/pci/pcie.c | 5 -
> > >   1 file changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > > index 6e95d82903..c1a12f3744 100644
> > > --- a/hw/pci/pcie.c
> > > +++ b/hw/pci/pcie.c
> > > @@ -62,8 +62,9 @@ pcie_cap_v1_fill(PCIDevice *dev, uint8_t port, uint8_t 
> > > type, uint8_t
> > > version)
> > >    * Functions conforming to the ECN, PCI Express Base
> > >    * Specification, Revision 1.1., or subsequent PCI Express Base
> > >    * Specification revisions.
> > > + *  + set max payload size to 256, which seems to be a common value
> > >    */
> > > -    pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER);
> > > +    pci_set_long(exp_cap + PCI_EXP_DEVCAP, PCI_EXP_DEVCAP_RBER | (0x1 &
> > > PCI_EXP_DEVCAP_PAYLOAD));
> > >   
> > >   pci_set_long(exp_cap + PCI_EXP_LNKCAP,
> > >    (port << PCI_EXP_LNKCAP_PN_SHIFT) |
> > > @@ -179,6 +180,8 @@ int pcie_cap_init(PCIDevice *dev, uint8_t offset,
> > >   pci_set_long(exp_cap + PCI_EXP_DEVCAP2,
> > >    PCI_EXP_DEVCAP2_EFF | PCI_EXP_DEVCAP2_EETLPP);
> > >   
> > > +    pci_set_word(exp_cap + PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_READRQ_256B);
> > > +
> > >   pci_set_word(dev->wmask + pos + PCI_EXP_DEVCTL2, 
> > > PCI_EXP_DEVCTL2_EETLPPB);
> > >   
> > >   if (dev->cap_present & QEMU_PCIE_EXTCAP_INIT) {
> > > -- 
> > > 2.25.1
> > 
> 
> Hi Michael,
> 
> The reason Knut keeps rebasing this fix along with SR-IOV patch is not
> clear for us.

Sorry for the slow response - I seem to have messed up my mail filters so this
thread slipped past my attention.

> Since we have tested the NVMe device without this fix and did not notice
> any issues mentioned by Knut on kernel 5.4.0, we decided to drop it for
> v2.

I agree, let's just drop it.

It was likely in the 3.x kernels I had to relate to back then, 
likely discovered in Oracle Linux given that I did not specifically point to a 
kernel
range already back then.

> However, I have posted your comments to this patch on Knut's github so
> they can be addressed in case Knut decides to resubmit it later though.

Thanks for that ping, Lukasz, and great to see the patch finally being used in a
functional device!

Knut

> Thanks,
> Lukasz

Re: [Qemu-devel] QEMU and vIOMMU support for emulated VF passthrough to nested (L2) VM

2019-04-01 Thread Knut Omang

On Mon, 2019-04-01 at 14:01 +, Elijah Shakkour wrote:
> 
> > -Original Message-
> > From: Peter Xu 
> > Sent: Monday, April 1, 2019 1:25 PM
> > To: Elijah Shakkour 
> > Cc: Knut Omang ; Michael S. Tsirkin
> > ; Alex Williamson ;
> > Marcel Apfelbaum ; Stefan Hajnoczi
> > ; qemu-devel@nongnu.org
> > Subject: Re: QEMU and vIOMMU support for emulated VF passthrough to
> > nested (L2) VM
> > 
> > On Mon, Apr 01, 2019 at 09:12:38AM +, Elijah Shakkour wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Peter Xu 
> > > > Sent: Monday, April 1, 2019 5:47 AM
> > > > To: Elijah Shakkour 
> > > > Cc: Knut Omang ; Michael S. Tsirkin
> > > > ; Alex Williamson ;
> > > > Marcel Apfelbaum ; Stefan Hajnoczi
> > > > ; qemu-devel@nongnu.org
> > > > Subject: Re: QEMU and vIOMMU support for emulated VF passthrough
> > to
> > > > nested (L2) VM
> > > >
> > > > On Sun, Mar 31, 2019 at 11:15:00AM +, Elijah Shakkour wrote:
> > > >
> > > > [...]
> > > >
> > > > > I didn't have DMA nor MMIO read/write working with my old command
> > > > line.
> > > > > But, when I removed all CPU flags and only provided "-cpu host", I
> > > > > see that
> > > > MMIO works.
> > > > > Still, DMA read/write from emulated device doesn't work for VF.
> > > > > For
> > > > example:
> > > > > Driver provides me a buffer pointer through MMIO write, this
> > > > > address
> > > > (pointer) is GPA of L2, and when I try to call pci_dma_read() with
> > > > this address I get:
> > > > > "
> > > > > Unassigned mem read  "
> > > >
> > > > I don't know where this error log was dumped but if it's during DMA
> > > > then I agree it can probably be related to vIOMMU.
> > > >
> > >
> > > This log is dumped from:
> > > memory.c: unassigned_mem_read()
> > >
> > > > > As I said, my problem now is in translation of L2 GPA provided by
> > > > > driver,
> > > > when I call DMA read/write for this address from VF.
> > > > > Any insights?
> > > >
> > > > I just noticed that you were using QEMU 2.12 [1].  If that's the
> > > > case, please rebase to the latest QEMU, at least >=3.0 because
> > > > there's major refactor of the shadow logic during 3.0 devel cycle 
> > > > AFAICT.
> > > >
> > >
> > > Rebased to QEMU 3.1
> > > Now I see the address I'm trying to read from in log but still same error:
> > > "
> > > Unassigned mem read f0481000
> > > "
> > > What do you suggest?
> > 
> > Would you please answer the questions that Knut asked?  Is it working for L1
> > guest?  How about PF?
> 
> Both VF and PF are working for L1 guest.
> I don't know how to passthrough PF to nested VM in hyper-v.

On Linux passing through VFs and PFs are the same. 
Maybe you can try passthrough with all Linux first? (first PF then VF) ?

> I don't invoke VF manually in hyper-v and pass it through to nested VM. I use 
> hyper-v
> manager to configure and provide a VF for nested VM (I can see the VF only in 
> the nested
> VM).
> 
> Did someone try to run emulated device in linux RH as nested L2 where L1 is 
> windows
> hyper-v? Does DMA read/write work for this emulated device in this case?

I have never tried that, I have only used Linux as L2, Windows might be pickier 
about what
it expects, so starting with Linux to rule that out is probably a good idea.

> > 
> > You can also try to enable VT-d device log by appending:
> > 
> >   -trace enable="vtd_*"
> > 
> > In case it dumps anything useful for you.
> 
> Is there a way to open those traces to be dumped to stdout/stderr on the fly, 
> instead of
> dtrace?

It's up to you what tracer(s) to configure when you build QEMU - check out 
docs/devel/tracing.txt . There's a few trace events defined in the SR/IOV patch 
set, you
might want to enable them as well.

Knut

> > --
> > Peter Xu

Re: [Qemu-devel] QEMU and vIOMMU support for emulated VF passthrough to nested (L2) VM

2019-03-27 Thread Knut Omang

On Wed, 2019-03-27 at 14:41 +0800, Peter Xu wrote:
> On Tue, Mar 26, 2019 at 01:23:12PM +, Elijah Shakkour wrote:
> > Adding QEMU-devel
> 
> Hi, Elijah,
> 
> > 
> > -Original Message-
> > From: Michael S. Tsirkin  
> > Sent: Tuesday, March 26, 2019 2:53 PM
> > To: Elijah Shakkour 
> > Cc: Knut Omang ; Alex Williamson 
> > ;
> Marcel Apfelbaum ; Stefan Hajnoczi 
> ; 
> pet...@redhat.com
> > Subject: Re: QEMU and vIOMMU support for emulated VF passthrough to nested 
> > (L2) VM
> > 
> > I think you forgot to copy the qemu mailing list.
> > 
> > On Tue, Mar 26, 2019 at 10:08:17AM +, Elijah Shakkour wrote:
> > > My questions are:
> > > 
> > > - Suppose that there is an emulated NIC that supports SRIOV (I 
> > > implemented such a
> NIC), now does QEMU support a scenario of an emulated NIC that supports SRIOV 
> in Hyper-V 
> L1 guest, that invokes VF and pass it to nested linux L2 guest?
> 
> I am not an expert of SR-IOV but I can't see a limitation to not allow
> that to happen.
> 
> > > - I'm using vIOMMU in L1, so what is needed to be done in QEMU or maybe 
> > > in emulated
> NIC PF/VF to allow DMA remapping and INT remapping work as expected?
> 
> Your below command line should work, and even it seems to be an
> overkill.
> 
> If your device is completely emulated, IIUC you only simply need this
> on the latest QEMU:
> 
>   -M q35 -device intel-iommu
> 
> Split-irqchip and IR is on by default now, so you'll naturally gain
> x2apic if it's supported.  You can use x-aw-bits but only if you
> really need address space beyond 39 bits (which I suspect).  The rest
> parameters are optional too.
> 
> > > - Does the command line below -that I use to run QEMU- seem ok for the 
> > > scenario I
> described to work?
> 
> Before I look into details of the cmdline - I'd say MMIO in L2 should
> have nothing to do with IOMMU...  

The addresses used in L2 is the GPAs of the L2, which would typically be 
different from
the L2 HPAs == L1 GPAs, so I think the IOMMU mappings must work.

You would need something like 'intel_iommu=on iommu=pt' as boot parameters for 
L1.

> Are you sure the MMIO traps are
> setup correctly?  Can the VF do IO properly even without L2?

I agree with Peter that just running the VF as another function in L1 
would be good to test before trying to get L2 passthrough to work.

I recommend you also verify that passing the PF through works as expected, 
unless you already have done so.

And do you see correct BAR address values in the lspci -vvv output in the L2 
instance?

The SR/IOV logic is from QEMU's perspective just another device instance 
apart from the differences in the BAR setup code, so if passing through a 
non-virtual device works, and VF BAR addresses appear right, 
I believe VFs should work as well.

> Also I don't know whether there can be some tricks when you boot L2
> with vfio-pci when the device to assign is a VF.

A lot has happened since I was actively using the SR/IOV patch set myself so 
that might entirely be possible from my perspective.

Thanks,
Knut 

> > > 
> > > -Original Message-
> > > From: Michael S. Tsirkin 
> > > Sent: Monday, March 25, 2019 4:14 AM
> > > To: Elijah Shakkour 
> > > Cc: Knut Omang ; Alex Williamson 
> > > ; Marcel Apfelbaum 
> > > ; Stefan Hajnoczi 
> > > Subject: Re: QEMU and vIOMMU support for emulated VF passthrough to 
> > > nested (L2) VM
> > > 
> > > Pls post all questions on list.
> > > I have a policy against answering off-list mail.
> > > Cc Pter Xu might be a good idea, too.
> > > 
> > > On Sun, Mar 24, 2019 at 09:56:26PM +, Elijah Shakkour wrote:
> > > > Hey,
> > > > 
> > > > I'm emulating Mellanox ConnectX-4 in QEMU and right now, I'm adding 
> > > > SRIOV
> capability.
> > > > I'm using Knut Omang SRIOV patches rebased to QEMU v2.12.
> > > > My server (L0) is Linux. L1 guest is Windows2016 Hyper-V and L2 guest 
> > > > is Linux
> RH7.2.
> > > > I can see my device in L1 VM and I see the invocation of the VF via 
> > > > SRIOV
> capability.
> > > > Inside L2 guest I see the virtual function in "lspci' command.
> > > > But when driver of L2 guest issues MMIO read/write, my MMIO ops don't 
> > > > get called.
> > > > I implemented my VF basically like Omang SRIOV example patch.
> > > > 
> > > > Could you please shed some light on what you think I might be missing?
> > > > 
> > > > Here

Re: [Qemu-devel] [PATCH v5 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-02-21 Thread Knut Omang

On Sat, 2019-02-16 at 11:17 -0700, Alex Williamson wrote:
> On Sat, 16 Feb 2019 17:51:11 +0100
> Knut Omang  wrote:
> 
> > Implementing an ACS capability on downstream ports and multifunction
> > endpoints indicates isolation and IOMMU visibility to a finer
> > granularity. This creates smaller IOMMU groups in the guest and thus
> > more flexibility in assigning endpoints to guest userspace or an L2
> > guest.
> > 
> > Signed-off-by: Knut Omang 
> > ---
> >  hw/pci/pcie.c  | 39 +++-
> >  include/hw/pci/pcie.h  |  6 ++-
> >  include/hw/pci/pcie_regs.h |  4 -
> >  3 files changed, 49 insertions(+)
> > 
> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index 230478f..6afc37a 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -906,3 +906,42 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
> >  
> >  pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
> >  }
> > +
> > +/* ACS (Access Control Services) */
> > +void pcie_acs_init(PCIDevice *dev, uint16_t offset)
> > +{
> > +bool is_downstream = pci_is_express_downstream_port(dev);
> > +uint16_t cap_bits = 0;
> > +
> > +/* For endpoints, only multifunction devs may have an ACS capability: 
> > */
> > +assert(is_downstream ||
> > +   (dev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) ||
> > +   PCI_FUNC(dev->devfn));
> > +
> > +pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER, offset,
> > +PCI_ACS_SIZEOF);
> > +dev->exp.acs_cap = offset;
> > +
> > +if (is_downstream) {
> > +/* Downstream ports must implement SV, TB, RR, CR, and UF (with
> > + * caveats on the latter three that we ignore for simplicity).
> > + * Endpoints may also implement a subset of ACS capabilities,
> > + * but these are optional if the endpoint does not support
> > + * peer-to-peer between functions and thus omitted here.
> > + * Downstream switch ports must also implement DT, while this
> > + * is optional for root ports, so we set that as well:
> > + */
> 
> Again...
> https://git.qemu.org/?p=qemu.git;a=blob;f=CODING_STYLE;hb=HEAD#l127
> 
> Personally I'd add DT in the original list rather than mention is
> separately, the caveats are pretty similar.

Sorry, I know - just overlooked it and forgot to re-check the style again.
Just sent v6.

Knut

> > +cap_bits = PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
> > +PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT;
> > +}
> > +
> > +pci_set_word(dev->config + offset + PCI_ACS_CAP, cap_bits);
> > +pci_set_word(dev->wmask + offset + PCI_ACS_CTRL, cap_bits);
> > +}
> > +
> > +void pcie_acs_reset(PCIDevice *dev)
> > +{
> > +if (dev->exp.acs_cap) {
> > +pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL, 0);
> > +}
> > +}
> > diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> > index 5b82a0d..e30334d 100644
> > --- a/include/hw/pci/pcie.h
> > +++ b/include/hw/pci/pcie.h
> > @@ -79,6 +79,9 @@ struct PCIExpressDevice {
> >  
> >  /* Offset of ATS capability in config space */
> >  uint16_t ats_cap;
> > +
> > +/* ACS */
> > +uint16_t acs_cap;
> >  };
> >  
> >  #define COMPAT_PROP_PCP "power_controller_present"
> > @@ -128,6 +131,9 @@ void pcie_add_capability(PCIDevice *dev,
> >   uint16_t offset, uint16_t size);
> >  void pcie_sync_bridge_lnk(PCIDevice *dev);
> >  
> > +void pcie_acs_init(PCIDevice *dev, uint16_t offset);
> > +void pcie_acs_reset(PCIDevice *dev);
> > +
> >  void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
> >  void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t 
> > ser_num);
> >  void pcie_ats_init(PCIDevice *dev, uint16_t offset);
> > diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
> > index ad4e780..1db86b0 100644
> > --- a/include/hw/pci/pcie_regs.h
> > +++ b/include/hw/pci/pcie_regs.h
> > @@ -175,4 +175,8 @@ typedef enum PCIExpLinkWidth {
> >   PCI_ERR_COR_INTERNAL | \
> >   PCI_ERR_COR_HL_OVERFLOW)
> >  
> > +/* ACS */
> > +#define PCI_ACS_VER 0x1
> > +#define PCI_ACS_SIZEOF  8
> > +
> >  #endif /* QEMU_PCIE_REGS_H */
>

[Qemu-devel] [PATCH v6 2/2] gen_pcie_root_port: Add ACS (Access Control Services) capability

2019-02-21 Thread Knut Omang

Claim ACS support in the generic PCIe root port to allow
passthrough of individual functions of a device to different
guests (in a nested virt.setting) with VFIO.
Without this patch, all functions of a device, such as all VFs of
an SR/IOV device, will end up in the same IOMMU group.
A similar situation occurs on Windows with Hyper-V.

In the single function device case, it also has a small cosmetic
benefit in that the root port itself is not grouped with
the device. VFIO handles that situation in that binding rules
only apply to endpoints, so it does not limit passthrough in
those cases.

Signed-off-by: Knut Omang 
Reviewed-by: Marcel Apfelbaum 
---
 hw/pci-bridge/gen_pcie_root_port.c | 4 
 hw/pci-bridge/pcie_root_port.c | 4 
 include/hw/pci/pcie_port.h | 1 +
 3 files changed, 9 insertions(+)

diff --git a/hw/pci-bridge/gen_pcie_root_port.c 
b/hw/pci-bridge/gen_pcie_root_port.c
index 9766edb..26bda73 100644
--- a/hw/pci-bridge/gen_pcie_root_port.c
+++ b/hw/pci-bridge/gen_pcie_root_port.c
@@ -20,6 +20,9 @@
 OBJECT_CHECK(GenPCIERootPort, (obj), TYPE_GEN_PCIE_ROOT_PORT)
 
 #define GEN_PCIE_ROOT_PORT_AER_OFFSET   0x100
+#define GEN_PCIE_ROOT_PORT_ACS_OFFSET \
+(GEN_PCIE_ROOT_PORT_AER_OFFSET + PCI_ERR_SIZEOF)
+
 #define GEN_PCIE_ROOT_PORT_MSIX_NR_VECTOR   1
 
 typedef struct GenPCIERootPort {
@@ -149,6 +152,7 @@ static void gen_rp_dev_class_init(ObjectClass *klass, void 
*data)
 rpc->interrupts_init = gen_rp_interrupts_init;
 rpc->interrupts_uninit = gen_rp_interrupts_uninit;
 rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
+rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
 }
 
 static const TypeInfo gen_rp_dev_info = {
diff --git a/hw/pci-bridge/pcie_root_port.c b/hw/pci-bridge/pcie_root_port.c
index 34ad767..e94d918 100644
--- a/hw/pci-bridge/pcie_root_port.c
+++ b/hw/pci-bridge/pcie_root_port.c
@@ -47,6 +47,7 @@ static void rp_reset(DeviceState *qdev)
 pcie_cap_deverr_reset(d);
 pcie_cap_slot_reset(d);
 pcie_cap_arifwd_reset(d);
+pcie_acs_reset(d);
 pcie_aer_root_reset(d);
 pci_bridge_reset(qdev);
 pci_bridge_disable_base_limit(d);
@@ -106,6 +107,9 @@ static void rp_realize(PCIDevice *d, Error **errp)
 pcie_aer_root_init(d);
 rp_aer_vector_update(d);
 
+if (rpc->acs_offset) {
+pcie_acs_init(d, rpc->acs_offset);
+}
 return;
 
 err:
diff --git a/include/hw/pci/pcie_port.h b/include/hw/pci/pcie_port.h
index df242a0..09586f4 100644
--- a/include/hw/pci/pcie_port.h
+++ b/include/hw/pci/pcie_port.h
@@ -78,6 +78,7 @@ typedef struct PCIERootPortClass {
 int exp_offset;
 int aer_offset;
 int ssvid_offset;
+int acs_offset;/* If nonzero, optional ACS capability offset */
 int ssid;
 } PCIERootPortClass;
 
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v6 0/2] pcie: Add simple ACS "support" to the generic PCIe root port

2019-02-21 Thread Knut Omang

These two patches together implements a PCIe capability
config space header for Access Control Services (ACS) for the
new Qemu specific generic root port. ACS support in the
associated root port is needed for passing
individual functions of a device populating the port through to
an L2 guest from an unmodified kernel.

Without this, the IOMMU group the device belongs to will also
include the root port itself, and all functions the device provides.

ACS is thus necessary to support SR/IOV where the primary
purpose is to be able to share out individual VFs to different
guests, which will not be permitted by VFIO or the Windows Hyper-V
equivalent unless ACS is supported by the root port.

These patches can also be found as part of an updated version of
my SR/IOV emulation patch set at

  https://github.com/knuto/qemu/tree/sriov_patches_v12

The patches' basic operation with VFIO and iommu groups have
been tested with the above patch set and a rebased version of an
in progress igb ethernet device, which needs some more care
before I can let it go out.

Changes from v5:

- Fix comment on cap.bits

Changes from v4:

- Set the DT bit for downstream ports
- Fix the assertion guard against uses violating the spec
- Use pci_is_express_downstream_port() instead of type casts
  for discrimination.

Changes from v3:

- rebased to the latest qemu master
- Revised commit message and comments for patch #1 to make it
  clearer that VFIO works for single function devices even without ACS.
- Improved checking for valid endpoints for ACS.
- Fixed comment style issue
- Co-locate the pci_acs_init and _reset functions and
  rename pci_cap_acs_reset to pci_acs_reset to adhere to the naming
  conventions that _cap_ functions in pcie is for changing state
  in the main pcie capability and not the individual extended
  capabilities.
- Added Marcel's r-b to patch 2, which did not change

Changes from v2:

- rebased to the latest qemu master

Incorporated further feedback from Alex:
- Make sure slot/downstream capability bits are only set for slots.
- Make acs reset callback do nothing if no acs capability exists
- Set correct acs version
- div simplification

Changes from v1:

Incorporated feedback from Alex Williamson:
- Make commit messages reflect a more correct understanding of how this
  affects VFIO operation.
- Implemented the CTRL register properly (reset callback + making 
non-implemented
  capabilities RO, default value 0)
- removed the egress ctrl vector parameter to the init function
- Fixed some whitespace issues

Knut Omang (2):
  pcie: Add a simple PCIe ACS (Access Control Services) helper function
  gen_pcie_root_port: Add ACS (Access Control Services) capability

 hw/pci-bridge/gen_pcie_root_port.c |  4 +++-
 hw/pci-bridge/pcie_root_port.c |  4 +++-
 hw/pci/pcie.c  | 38 +++-
 include/hw/pci/pcie.h  |  6 +-
 include/hw/pci/pcie_port.h |  1 +-
 include/hw/pci/pcie_regs.h |  4 +++-
 6 files changed, 57 insertions(+)

base-commit: 0b5e750bea635b167eb03d86c3d9a09bbd43bc06
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v6 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-02-21 Thread Knut Omang

Implementing an ACS capability on downstream ports and multifunction
endpoints indicates isolation and IOMMU visibility to a finer
granularity. This creates smaller IOMMU groups in the guest and thus
more flexibility in assigning endpoints to guest userspace or an L2
guest.

Signed-off-by: Knut Omang 
---
 hw/pci/pcie.c  | 38 ++
 include/hw/pci/pcie.h  |  6 ++
 include/hw/pci/pcie_regs.h |  4 
 3 files changed, 48 insertions(+)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 230478f..09ebf11 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -906,3 +906,41 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
 
 pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
 }
+
+/* ACS (Access Control Services) */
+void pcie_acs_init(PCIDevice *dev, uint16_t offset)
+{
+bool is_downstream = pci_is_express_downstream_port(dev);
+uint16_t cap_bits = 0;
+
+/* For endpoints, only multifunction devs may have an ACS capability: */
+assert(is_downstream ||
+   (dev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) ||
+   PCI_FUNC(dev->devfn));
+
+pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER, offset,
+PCI_ACS_SIZEOF);
+dev->exp.acs_cap = offset;
+
+if (is_downstream) {
+/*
+ * Downstream ports must implement SV, TB, RR, CR, UF, and DT (with
+ * caveats on the latter four that we ignore for simplicity).
+ * Endpoints may also implement a subset of ACS capabilities,
+ * but these are optional if the endpoint does not support
+ * peer-to-peer between functions and thus omitted here.
+ */
+cap_bits = PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
+PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT;
+}
+
+pci_set_word(dev->config + offset + PCI_ACS_CAP, cap_bits);
+pci_set_word(dev->wmask + offset + PCI_ACS_CTRL, cap_bits);
+}
+
+void pcie_acs_reset(PCIDevice *dev)
+{
+if (dev->exp.acs_cap) {
+pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL, 0);
+}
+}
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index 5b82a0d..e30334d 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -79,6 +79,9 @@ struct PCIExpressDevice {
 
 /* Offset of ATS capability in config space */
 uint16_t ats_cap;
+
+/* ACS */
+uint16_t acs_cap;
 };
 
 #define COMPAT_PROP_PCP "power_controller_present"
@@ -128,6 +131,9 @@ void pcie_add_capability(PCIDevice *dev,
  uint16_t offset, uint16_t size);
 void pcie_sync_bridge_lnk(PCIDevice *dev);
 
+void pcie_acs_init(PCIDevice *dev, uint16_t offset);
+void pcie_acs_reset(PCIDevice *dev);
+
 void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
 void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t ser_num);
 void pcie_ats_init(PCIDevice *dev, uint16_t offset);
diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
index ad4e780..1db86b0 100644
--- a/include/hw/pci/pcie_regs.h
+++ b/include/hw/pci/pcie_regs.h
@@ -175,4 +175,8 @@ typedef enum PCIExpLinkWidth {
  PCI_ERR_COR_INTERNAL | \
  PCI_ERR_COR_HL_OVERFLOW)
 
+/* ACS */
+#define PCI_ACS_VER 0x1
+#define PCI_ACS_SIZEOF  8
+
 #endif /* QEMU_PCIE_REGS_H */
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v5 0/2] pcie: Add simple ACS "support" to the generic PCIe root port

2019-02-16 Thread Knut Omang

These two patches together implements a PCIe capability
config space header for Access Control Services (ACS) for the
new Qemu specific generic root port. ACS support in the
associated root port is needed for passing
individual functions of a device populating the port through to
an L2 guest from an unmodified kernel.

Without this, the IOMMU group the device belongs to will also
include the root port itself, and all functions the device provides.

ACS is thus necessary to support SR/IOV where the primary
purpose is to be able to share out individual VFs to different
guests, which will not be permitted by VFIO or the Windows Hyper-V
equivalent unless ACS is supported by the root port.

These patches can also be found as part of an updated version of
my SR/IOV emulation patch set at

  https://github.com/knuto/qemu/tree/sriov_patches_v12

The patches' basic operation with VFIO and iommu groups have
been tested with the above patch set and a rebased version of an
in progress igb ethernet device, which needs some more care
before I can let it go out.

Changes from v4:

- Set the DT bit for downstream ports
- Fix the assertion guard against uses violating the spec
- Use pci_is_express_downstream_port() instead of type casts
  for discrimination.

Changes from v3:

- rebased to the latest qemu master
- Revised commit message and comments for patch #1 to make it
  clearer that VFIO works for single function devices even without ACS.
- Improved checking for valid endpoints for ACS.
- Fixed comment style issue
- Co-locate the pci_acs_init and _reset functions and
  rename pci_cap_acs_reset to pci_acs_reset to adhere to the naming
  conventions that _cap_ functions in pcie is for changing state
  in the main pcie capability and not the individual extended
  capabilities.
- Added Marcel's r-b to patch 2, which did not change

Changes from v2:

- rebased to the latest qemu master

Incorporated further feedback from Alex:
- Make sure slot/downstream capability bits are only set for slots.
- Make acs reset callback do nothing if no acs capability exists
- Set correct acs version
- div simplification

Changes from v1:

Incorporated feedback from Alex Williamson:
- Make commit messages reflect a more correct understanding of how this
  affects VFIO operation.
- Implemented the CTRL register properly (reset callback + making 
non-implemented
  capabilities RO, default value 0)
- removed the egress ctrl vector parameter to the init function
- Fixed some whitespace issues

Knut Omang (2):
  pcie: Add a simple PCIe ACS (Access Control Services) helper function
  gen_pcie_root_port: Add ACS (Access Control Services) capability

 hw/pci-bridge/gen_pcie_root_port.c |  4 +++-
 hw/pci-bridge/pcie_root_port.c |  4 +++-
 hw/pci/pcie.c  | 39 +++-
 include/hw/pci/pcie.h  |  6 +-
 include/hw/pci/pcie_port.h |  1 +-
 include/hw/pci/pcie_regs.h |  4 +++-
 6 files changed, 58 insertions(+)

base-commit: 0b5e750bea635b167eb03d86c3d9a09bbd43bc06
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v5 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-02-16 Thread Knut Omang

Implementing an ACS capability on downstream ports and multifunction
endpoints indicates isolation and IOMMU visibility to a finer
granularity. This creates smaller IOMMU groups in the guest and thus
more flexibility in assigning endpoints to guest userspace or an L2
guest.

Signed-off-by: Knut Omang 
---
 hw/pci/pcie.c  | 39 +++-
 include/hw/pci/pcie.h  |  6 ++-
 include/hw/pci/pcie_regs.h |  4 -
 3 files changed, 49 insertions(+)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 230478f..6afc37a 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -906,3 +906,42 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
 
 pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
 }
+
+/* ACS (Access Control Services) */
+void pcie_acs_init(PCIDevice *dev, uint16_t offset)
+{
+bool is_downstream = pci_is_express_downstream_port(dev);
+uint16_t cap_bits = 0;
+
+/* For endpoints, only multifunction devs may have an ACS capability: */
+assert(is_downstream ||
+   (dev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) ||
+   PCI_FUNC(dev->devfn));
+
+pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER, offset,
+PCI_ACS_SIZEOF);
+dev->exp.acs_cap = offset;
+
+if (is_downstream) {
+/* Downstream ports must implement SV, TB, RR, CR, and UF (with
+ * caveats on the latter three that we ignore for simplicity).
+ * Endpoints may also implement a subset of ACS capabilities,
+ * but these are optional if the endpoint does not support
+ * peer-to-peer between functions and thus omitted here.
+ * Downstream switch ports must also implement DT, while this
+ * is optional for root ports, so we set that as well:
+ */
+cap_bits = PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR |
+PCI_ACS_CR | PCI_ACS_UF | PCI_ACS_DT;
+}
+
+pci_set_word(dev->config + offset + PCI_ACS_CAP, cap_bits);
+pci_set_word(dev->wmask + offset + PCI_ACS_CTRL, cap_bits);
+}
+
+void pcie_acs_reset(PCIDevice *dev)
+{
+if (dev->exp.acs_cap) {
+pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL, 0);
+}
+}
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index 5b82a0d..e30334d 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -79,6 +79,9 @@ struct PCIExpressDevice {
 
 /* Offset of ATS capability in config space */
 uint16_t ats_cap;
+
+/* ACS */
+uint16_t acs_cap;
 };
 
 #define COMPAT_PROP_PCP "power_controller_present"
@@ -128,6 +131,9 @@ void pcie_add_capability(PCIDevice *dev,
  uint16_t offset, uint16_t size);
 void pcie_sync_bridge_lnk(PCIDevice *dev);
 
+void pcie_acs_init(PCIDevice *dev, uint16_t offset);
+void pcie_acs_reset(PCIDevice *dev);
+
 void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
 void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t ser_num);
 void pcie_ats_init(PCIDevice *dev, uint16_t offset);
diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
index ad4e780..1db86b0 100644
--- a/include/hw/pci/pcie_regs.h
+++ b/include/hw/pci/pcie_regs.h
@@ -175,4 +175,8 @@ typedef enum PCIExpLinkWidth {
  PCI_ERR_COR_INTERNAL | \
  PCI_ERR_COR_HL_OVERFLOW)
 
+/* ACS */
+#define PCI_ACS_VER 0x1
+#define PCI_ACS_SIZEOF  8
+
 #endif /* QEMU_PCIE_REGS_H */
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v5 2/2] gen_pcie_root_port: Add ACS (Access Control Services) capability

2019-02-16 Thread Knut Omang

Claim ACS support in the generic PCIe root port to allow
passthrough of individual functions of a device to different
guests (in a nested virt.setting) with VFIO.
Without this patch, all functions of a device, such as all VFs of
an SR/IOV device, will end up in the same IOMMU group.
A similar situation occurs on Windows with Hyper-V.

In the single function device case, it also has a small cosmetic
benefit in that the root port itself is not grouped with
the device. VFIO handles that situation in that binding rules
only apply to endpoints, so it does not limit passthrough in
those cases.

Signed-off-by: Knut Omang 
Reviewed-by: Marcel Apfelbaum 
---
 hw/pci-bridge/gen_pcie_root_port.c | 4 
 hw/pci-bridge/pcie_root_port.c | 4 
 include/hw/pci/pcie_port.h | 1 +
 3 files changed, 9 insertions(+)

diff --git a/hw/pci-bridge/gen_pcie_root_port.c 
b/hw/pci-bridge/gen_pcie_root_port.c
index 9766edb..26bda73 100644
--- a/hw/pci-bridge/gen_pcie_root_port.c
+++ b/hw/pci-bridge/gen_pcie_root_port.c
@@ -20,6 +20,9 @@
 OBJECT_CHECK(GenPCIERootPort, (obj), TYPE_GEN_PCIE_ROOT_PORT)
 
 #define GEN_PCIE_ROOT_PORT_AER_OFFSET   0x100
+#define GEN_PCIE_ROOT_PORT_ACS_OFFSET \
+(GEN_PCIE_ROOT_PORT_AER_OFFSET + PCI_ERR_SIZEOF)
+
 #define GEN_PCIE_ROOT_PORT_MSIX_NR_VECTOR   1
 
 typedef struct GenPCIERootPort {
@@ -149,6 +152,7 @@ static void gen_rp_dev_class_init(ObjectClass *klass, void 
*data)
 rpc->interrupts_init = gen_rp_interrupts_init;
 rpc->interrupts_uninit = gen_rp_interrupts_uninit;
 rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
+rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
 }
 
 static const TypeInfo gen_rp_dev_info = {
diff --git a/hw/pci-bridge/pcie_root_port.c b/hw/pci-bridge/pcie_root_port.c
index 34ad767..e94d918 100644
--- a/hw/pci-bridge/pcie_root_port.c
+++ b/hw/pci-bridge/pcie_root_port.c
@@ -47,6 +47,7 @@ static void rp_reset(DeviceState *qdev)
 pcie_cap_deverr_reset(d);
 pcie_cap_slot_reset(d);
 pcie_cap_arifwd_reset(d);
+pcie_acs_reset(d);
 pcie_aer_root_reset(d);
 pci_bridge_reset(qdev);
 pci_bridge_disable_base_limit(d);
@@ -106,6 +107,9 @@ static void rp_realize(PCIDevice *d, Error **errp)
 pcie_aer_root_init(d);
 rp_aer_vector_update(d);
 
+if (rpc->acs_offset) {
+pcie_acs_init(d, rpc->acs_offset);
+}
 return;
 
 err:
diff --git a/include/hw/pci/pcie_port.h b/include/hw/pci/pcie_port.h
index df242a0..09586f4 100644
--- a/include/hw/pci/pcie_port.h
+++ b/include/hw/pci/pcie_port.h
@@ -78,6 +78,7 @@ typedef struct PCIERootPortClass {
 int exp_offset;
 int aer_offset;
 int ssvid_offset;
+int acs_offset;/* If nonzero, optional ACS capability offset */
 int ssid;
 } PCIERootPortClass;
 
-- 
git-series 0.9.1

Re: [Qemu-devel] [PATCH v4 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-02-14 Thread Knut Omang

On Thu, 2019-02-14 at 08:51 -0700, Alex Williamson wrote:
> On Thu, 14 Feb 2019 08:07:33 +0100
> Knut Omang  wrote:
> 
> > On Wed, 2019-02-13 at 12:13 -0700, Alex Williamson wrote:
> > > On Wed, 13 Feb 2019 10:29:58 +0100
> > > Knut Omang  wrote:
> > >   
> > > > Add a helper function to add PCIe capability for Access Control Services
> > > > (ACS)  
> > > 
> > > This is redundant to the commit title.
> > >   
> > > > ACS support in the associated root port is needed to pass
> > > > through individual functions of a device to different VMs with VFIO
> > > > without Alex Williamson's pcie_acs_override kernel patch or similar
> > > > in the guest.  
> > > 
> > > This is overly subtle, to work backwards that individual functions
> > > (plural!) of a device (singular!) must imply a multifunction endpoint
> > > in the same hierarchy split to different L2 VMs.  Perhaps I only
> > > finally realized this subtly on v4.
> > >   
> > > > Single function devices, or multifunction devices
> > > > that all goes to the same VM works fine even without ACS, as VFIO
> > > > will avoid putting the root port itself into the IOMMU group
> > > > even without ACS support in the port.  
> > > 
> > > Also confusing and incorrectly states that a) VFIO is responsible for
> > > IOMMU grouping, it's not, and b) that the root port would not be
> > > included in such a group, it would.  The latter was I thought the
> > > impetus for this series.  
> > 
> > that wasn't the intention but I can see that it looks that way now
> > 
> > > > Multifunction endpoints may also implement an ACS capability,
> > > > only on function 0, and with more limited features.  
> > > 
> > > "only on function 0" is incorrect, each function of a multifunction
> > > device should (not must) implement an ACS capability if any of them do.
> > > 
> > > Can't we just say something like:
> > > 
> > > "Implementing an ACS capability on downstream ports and multifuction
> > > endpoints indicates isolation and IOMMU visibility to a finer
> > > granularity thereby creating smaller IOMMU groups in the guest and thus
> > > more flexibility in assigning endpoints to guest userspace or an L2
> > > guest."  
> > 
> > sure - will use this - and remove my confusing attempt to 
> > credit to your override patch and VFIO :)
> > 
> > > (I avoided including SR-IOV with multifunction since that's not
> > > implemented here)  
> > 
> > I agree
> > 
> > > > Signed-off-by: Knut Omang 
> > > > ---
> > > >  hw/pci/pcie.c  | 39
> > > > +++-
> > > >  include/hw/pci/pcie.h  |  6 ++-
> > > >  include/hw/pci/pcie_regs.h |  4 -
> > > >  3 files changed, 49 insertions(+)
> > > > 
> > > > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > > > index 230478f..6e87994 100644
> > > > --- a/hw/pci/pcie.c
> > > > +++ b/hw/pci/pcie.c
> > > > @@ -906,3 +906,42 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
> > > >  
> > > >  pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
> > > >  }
> > > > +
> > > > +/* ACS (Access Control Services) */
> > > > +void pcie_acs_init(PCIDevice *dev, uint16_t offset)
> > > > +{
> > > > +bool is_pcie_slot = !!object_dynamic_cast(OBJECT(dev),
> > > > TYPE_PCIE_SLOT);  
> > > 
> > > Perhaps we should be using pci_is_express_downstream_port().  
> > 
> > oh - yes - I forgot that we need to look in pci.h for those kind of 
> > helpers..
> > 
> > > > +uint16_t cap_bits = 0;
> > > > +
> > > > +/*
> > > > + * For endpoints, only multifunction devices may have an
> > > > + * ACS capability, and only on function 0:  
> > > 
> > > Incorrect
> > >   
> > > > + */
> > > > +assert(is_pcie_slot ||
> > > > +   ((dev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) &&
> > > > +PCI_FUNC(dev->devfn)));  
> > > 
> > > The second test should be:
> > > 
> > > ((dev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) ||
> > >  PCI_FUNC(dev->devfn))
>

Re: [Qemu-devel] [PATCH v4 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-02-13 Thread Knut Omang

On Wed, 2019-02-13 at 12:13 -0700, Alex Williamson wrote:
> On Wed, 13 Feb 2019 10:29:58 +0100
> Knut Omang  wrote:
> 
> > Add a helper function to add PCIe capability for Access Control Services
> > (ACS)
> 
> This is redundant to the commit title.
> 
> > ACS support in the associated root port is needed to pass
> > through individual functions of a device to different VMs with VFIO
> > without Alex Williamson's pcie_acs_override kernel patch or similar
> > in the guest.
> 
> This is overly subtle, to work backwards that individual functions
> (plural!) of a device (singular!) must imply a multifunction endpoint
> in the same hierarchy split to different L2 VMs.  Perhaps I only
> finally realized this subtly on v4.
> 
> > Single function devices, or multifunction devices
> > that all goes to the same VM works fine even without ACS, as VFIO
> > will avoid putting the root port itself into the IOMMU group
> > even without ACS support in the port.
> 
> Also confusing and incorrectly states that a) VFIO is responsible for
> IOMMU grouping, it's not, and b) that the root port would not be
> included in such a group, it would.  The latter was I thought the
> impetus for this series.

that wasn't the intention but I can see that it looks that way now

> > Multifunction endpoints may also implement an ACS capability,
> > only on function 0, and with more limited features.
> 
> "only on function 0" is incorrect, each function of a multifunction
> device should (not must) implement an ACS capability if any of them do.
> 
> Can't we just say something like:
> 
> "Implementing an ACS capability on downstream ports and multifuction
> endpoints indicates isolation and IOMMU visibility to a finer
> granularity thereby creating smaller IOMMU groups in the guest and thus
> more flexibility in assigning endpoints to guest userspace or an L2
> guest."

sure - will use this - and remove my confusing attempt to 
credit to your override patch and VFIO :)

> (I avoided including SR-IOV with multifunction since that's not
> implemented here)

I agree

> > Signed-off-by: Knut Omang 
> > ---
> >  hw/pci/pcie.c  | 39 +++-
> >  include/hw/pci/pcie.h  |  6 ++-
> >  include/hw/pci/pcie_regs.h |  4 -
> >  3 files changed, 49 insertions(+)
> > 
> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index 230478f..6e87994 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -906,3 +906,42 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
> >  
> >  pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
> >  }
> > +
> > +/* ACS (Access Control Services) */
> > +void pcie_acs_init(PCIDevice *dev, uint16_t offset)
> > +{
> > +bool is_pcie_slot = !!object_dynamic_cast(OBJECT(dev), TYPE_PCIE_SLOT);
> 
> Perhaps we should be using pci_is_express_downstream_port().

oh - yes - I forgot that we need to look in pci.h for those kind of 
helpers..

> > +uint16_t cap_bits = 0;
> > +
> > +/*
> > + * For endpoints, only multifunction devices may have an
> > + * ACS capability, and only on function 0:
> 
> Incorrect
> 
> > + */
> > +assert(is_pcie_slot ||
> > +   ((dev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) &&
> > +PCI_FUNC(dev->devfn)));
> 
> The second test should be:
> 
> ((dev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) ||
>  PCI_FUNC(dev->devfn))
> 
> If the function number is non-zero, then it's clearly a multifunction
> device, the multifunction capability is only required on function
> zero.  Just as in my previous example, an ACS capability can only
> describe/control the DMA flow of the function implementing it, nothing
> in the spec that I can see imposes function zero's DMA flow on the
> other functions.

Ah - of course - that makes sense - 
was thinking too complicated here, and also my comment didn't match
the code at all..

> There's also a gap here that function zero can set the multifunction
> capability, but there may be no secondary devices defined.  Not that
> we necessarily need to resolve this, but it's a nuance of allowing
> arbitrary multifunction configurations as QEMU does.

Yes, in the SR/IOV case, at least as I have implemented it in QEMU, 
with one PF that would be the default - as no VFs are defined at reset, 
there's only one function, but it still need to be multifunction 
for QEMU to accept more functions appearing later.

> > +
> > +pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER, offset,
> > +

Re: [Qemu-devel] [PATCH v3 2/2] gen_pcie_root_port: Add ACS (Access Control Services) capability

2019-02-13 Thread Knut Omang

On Mon, 2019-02-11 at 10:07 +0200, Marcel Apfelbaum wrote:
> 
> On 2/10/19 8:53 AM, Knut Omang wrote:
> > Claim ACS support in the generic PCIe root port to allow
> > passthrough of individual functions of a device to different
> > guests (in a nested virt.setting) with VFIO.
> > Without this patch, all functions of a device, such as all VFs of
> > an SR/IOV device, will end up in the same IOMMU group.
> > A similar situation occurs on Windows with Hyper-V.
> > 
> > In the single function device case, it also has a small cosmetic
> > benefit in that the root port itself is not grouped with
> > the device. VFIO handles that situation in that binding rules
> > only apply to endpoints, so it does not limit passthrough in
> > those cases.
> > 
> > Signed-off-by: Knut Omang 
> > ---
> >   hw/pci-bridge/gen_pcie_root_port.c | 4 
> >   hw/pci-bridge/pcie_root_port.c | 4 
> >   include/hw/pci/pcie_port.h | 1 +
> >   3 files changed, 9 insertions(+)
> > 
> > diff --git a/hw/pci-bridge/gen_pcie_root_port.c b/hw/pci-
> > bridge/gen_pcie_root_port.c
> > index 9766edb..26bda73 100644
> > --- a/hw/pci-bridge/gen_pcie_root_port.c
> > +++ b/hw/pci-bridge/gen_pcie_root_port.c
> > @@ -20,6 +20,9 @@
> >   OBJECT_CHECK(GenPCIERootPort, (obj), TYPE_GEN_PCIE_ROOT_PORT)
> >   
> >   #define GEN_PCIE_ROOT_PORT_AER_OFFSET   0x100
> > +#define GEN_PCIE_ROOT_PORT_ACS_OFFSET \
> > +(GEN_PCIE_ROOT_PORT_AER_OFFSET + PCI_ERR_SIZEOF)
> > +
> >   #define GEN_PCIE_ROOT_PORT_MSIX_NR_VECTOR   1
> >   
> >   typedef struct GenPCIERootPort {
> > @@ -149,6 +152,7 @@ static void gen_rp_dev_class_init(ObjectClass *klass,
> > void *data)
> >   rpc->interrupts_init = gen_rp_interrupts_init;
> >   rpc->interrupts_uninit = gen_rp_interrupts_uninit;
> >   rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
> > +rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
> >   }
> >   
> >   static const TypeInfo gen_rp_dev_info = {
> > diff --git a/hw/pci-bridge/pcie_root_port.c b/hw/pci-bridge/pcie_root_port.c
> > index 34ad767..a0b4cf7 100644
> > --- a/hw/pci-bridge/pcie_root_port.c
> > +++ b/hw/pci-bridge/pcie_root_port.c
> > @@ -47,6 +47,7 @@ static void rp_reset(DeviceState *qdev)
> >   pcie_cap_deverr_reset(d);
> >   pcie_cap_slot_reset(d);
> >   pcie_cap_arifwd_reset(d);
> > +pcie_cap_acs_reset(d);
> >   pcie_aer_root_reset(d);
> >   pci_bridge_reset(qdev);
> >   pci_bridge_disable_base_limit(d);
> > @@ -106,6 +107,9 @@ static void rp_realize(PCIDevice *d, Error **errp)
> >   pcie_aer_root_init(d);
> >   rp_aer_vector_update(d);
> >   
> > +if (rpc->acs_offset) {
> > +pcie_acs_init(d, rpc->acs_offset);
> > +}
> >   return;
> >   
> >   err:
> > diff --git a/include/hw/pci/pcie_port.h b/include/hw/pci/pcie_port.h
> > index df242a0..09586f4 100644
> > --- a/include/hw/pci/pcie_port.h
> > +++ b/include/hw/pci/pcie_port.h
> > @@ -78,6 +78,7 @@ typedef struct PCIERootPortClass {
> >   int exp_offset;
> >   int aer_offset;
> >   int ssvid_offset;
> > +int acs_offset;/* If nonzero, optional ACS capability offset */
> >   int ssid;
> >   } PCIERootPortClass;
> >   
> 
> Reviewed-by: Marcel Apfelbaum 

Thanks! I added this r-b to v4, but left out the one for patch 1 as it got some
more changes based on Alex's feedback and our fruitful discussion afterwards,
so you might want to recheck it.

Knut
> 
> Thanks,
> Marcel

[Qemu-devel] [PATCH v4 0/2] pcie: Add simple ACS "support" to the generic PCIe root port

2019-02-13 Thread Knut Omang

These two patches together implements a PCIe capability
config space header for Access Control Services (ACS) for the
new Qemu specific generic root port. ACS support in the
associated root port is a prerequisite to be able to pass
individual functions of a device populating the port through to
an L2 guest from an unmodified kernel.

Without this, the IOMMU group the device belongs to will also
include the root port itself, and all functions the device provides.

It is necessary to support SR/IOV where the primary
purpose is to be able to share out individual VFs to different
guests, which will not be permitted by VFIO or the Windows Hyper-V equivalent
unless ACS is supported by the root port.

These patches can also be found as part of an updated version of
my SR/IOV emulation patch set at

  https://github.com/knuto/qemu/tree/sriov_patches_v11

The patches' basic operation with VFIO and iommu groups have
been tested with the above patch set and a rebased version of an
in progress igb ethernet device, which needs some more care
before I can let it go out.

Changes from v3:

- rebased to the latest qemu master
- Revised commit message and comments for patch #1 to make it
  clearer that VFIO works for single function devices even without ACS.
- Improved checking for valid endpoints for ACS.
- Fixed comment style issue
- Co-locate the pci_acs_init and _reset functions and
  rename pci_cap_acs_reset to pci_acs_reset to adhere to the naming
  conventions that _cap_ functions in pcie is for changing state
  in the main pcie capability and not the individual extended
  capabilities.
- Added Marcel's r-b to patch 2, which did not change

Changes from v2:

- rebased to the latest qemu master

Incorporated further feedback from Alex:
- Make sure slot/downstream capability bits are only set for slots.
- Make acs reset callback do nothing if no acs capability exists
- Set correct acs version
- div simplification

Changes from v1:

Incorporated feedback from Alex Williamson:
- Make commit messages reflect a more correct understanding of how this
  affects VFIO operation.
- Implemented the CTRL register properly (reset callback + making 
non-implemented
  capabilities RO, default value 0)
- removed the egress ctrl vector parameter to the init function
- Fixed some whitespace issues

Knut Omang (2):
  pcie: Add a simple PCIe ACS (Access Control Services) helper function
  gen_pcie_root_port: Add ACS (Access Control Services) capability

 hw/pci-bridge/gen_pcie_root_port.c |  4 +++-
 hw/pci-bridge/pcie_root_port.c |  4 +++-
 hw/pci/pcie.c  | 39 +++-
 include/hw/pci/pcie.h  |  6 +-
 include/hw/pci/pcie_port.h |  1 +-
 include/hw/pci/pcie_regs.h |  4 +++-
 6 files changed, 58 insertions(+)

base-commit: 0b5e750bea635b167eb03d86c3d9a09bbd43bc06
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v4 2/2] gen_pcie_root_port: Add ACS (Access Control Services) capability

2019-02-13 Thread Knut Omang

Claim ACS support in the generic PCIe root port to allow
passthrough of individual functions of a device to different
guests (in a nested virt.setting) with VFIO.
Without this patch, all functions of a device, such as all VFs of
an SR/IOV device, will end up in the same IOMMU group.
A similar situation occurs on Windows with Hyper-V.

In the single function device case, it also has a small cosmetic
benefit in that the root port itself is not grouped with
the device. VFIO handles that situation in that binding rules
only apply to endpoints, so it does not limit passthrough in
those cases.

Signed-off-by: Knut Omang 
Reviewed-by: Marcel Apfelbaum 
---
 hw/pci-bridge/gen_pcie_root_port.c | 4 
 hw/pci-bridge/pcie_root_port.c | 4 
 include/hw/pci/pcie_port.h | 1 +
 3 files changed, 9 insertions(+)

diff --git a/hw/pci-bridge/gen_pcie_root_port.c 
b/hw/pci-bridge/gen_pcie_root_port.c
index 9766edb..26bda73 100644
--- a/hw/pci-bridge/gen_pcie_root_port.c
+++ b/hw/pci-bridge/gen_pcie_root_port.c
@@ -20,6 +20,9 @@
 OBJECT_CHECK(GenPCIERootPort, (obj), TYPE_GEN_PCIE_ROOT_PORT)
 
 #define GEN_PCIE_ROOT_PORT_AER_OFFSET   0x100
+#define GEN_PCIE_ROOT_PORT_ACS_OFFSET \
+(GEN_PCIE_ROOT_PORT_AER_OFFSET + PCI_ERR_SIZEOF)
+
 #define GEN_PCIE_ROOT_PORT_MSIX_NR_VECTOR   1
 
 typedef struct GenPCIERootPort {
@@ -149,6 +152,7 @@ static void gen_rp_dev_class_init(ObjectClass *klass, void 
*data)
 rpc->interrupts_init = gen_rp_interrupts_init;
 rpc->interrupts_uninit = gen_rp_interrupts_uninit;
 rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
+rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
 }
 
 static const TypeInfo gen_rp_dev_info = {
diff --git a/hw/pci-bridge/pcie_root_port.c b/hw/pci-bridge/pcie_root_port.c
index 34ad767..e94d918 100644
--- a/hw/pci-bridge/pcie_root_port.c
+++ b/hw/pci-bridge/pcie_root_port.c
@@ -47,6 +47,7 @@ static void rp_reset(DeviceState *qdev)
 pcie_cap_deverr_reset(d);
 pcie_cap_slot_reset(d);
 pcie_cap_arifwd_reset(d);
+pcie_acs_reset(d);
 pcie_aer_root_reset(d);
 pci_bridge_reset(qdev);
 pci_bridge_disable_base_limit(d);
@@ -106,6 +107,9 @@ static void rp_realize(PCIDevice *d, Error **errp)
 pcie_aer_root_init(d);
 rp_aer_vector_update(d);
 
+if (rpc->acs_offset) {
+pcie_acs_init(d, rpc->acs_offset);
+}
 return;
 
 err:
diff --git a/include/hw/pci/pcie_port.h b/include/hw/pci/pcie_port.h
index df242a0..09586f4 100644
--- a/include/hw/pci/pcie_port.h
+++ b/include/hw/pci/pcie_port.h
@@ -78,6 +78,7 @@ typedef struct PCIERootPortClass {
 int exp_offset;
 int aer_offset;
 int ssvid_offset;
+int acs_offset;/* If nonzero, optional ACS capability offset */
 int ssid;
 } PCIERootPortClass;
 
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v4 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-02-13 Thread Knut Omang

Add a helper function to add PCIe capability for Access Control Services (ACS)
ACS support in the associated root port is needed to pass
through individual functions of a device to different VMs with VFIO
without Alex Williamson's pcie_acs_override kernel patch or similar
in the guest.

Single function devices, or multifunction devices
that all goes to the same VM works fine even without ACS, as VFIO
will avoid putting the root port itself into the IOMMU group
even without ACS support in the port.

Multifunction endpoints may also implement an ACS capability,
only on function 0, and with more limited features.

Signed-off-by: Knut Omang 
---
 hw/pci/pcie.c  | 39 +++-
 include/hw/pci/pcie.h  |  6 ++-
 include/hw/pci/pcie_regs.h |  4 -
 3 files changed, 49 insertions(+)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 230478f..6e87994 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -906,3 +906,42 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
 
 pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
 }
+
+/* ACS (Access Control Services) */
+void pcie_acs_init(PCIDevice *dev, uint16_t offset)
+{
+bool is_pcie_slot = !!object_dynamic_cast(OBJECT(dev), TYPE_PCIE_SLOT);
+uint16_t cap_bits = 0;
+
+/*
+ * For endpoints, only multifunction devices may have an
+ * ACS capability, and only on function 0:
+ */
+assert(is_pcie_slot ||
+   ((dev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) &&
+PCI_FUNC(dev->devfn)));
+
+pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER, offset,
+PCI_ACS_SIZEOF);
+dev->exp.acs_cap = offset;
+
+if (is_pcie_slot) {
+/*
+ * Endpoints may also implement ACS, and optionally RR and CR,
+ * if they want to support p2p, but only slots may
+ * implement SV, TB or UF:
+ */
+cap_bits =
+PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF;
+}
+
+pci_set_word(dev->config + offset + PCI_ACS_CAP, cap_bits);
+pci_set_word(dev->wmask + offset + PCI_ACS_CTRL, cap_bits);
+}
+
+void pcie_acs_reset(PCIDevice *dev)
+{
+if (dev->exp.acs_cap) {
+pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL, 0);
+}
+}
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index 5b82a0d..e30334d 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -79,6 +79,9 @@ struct PCIExpressDevice {
 
 /* Offset of ATS capability in config space */
 uint16_t ats_cap;
+
+/* ACS */
+uint16_t acs_cap;
 };
 
 #define COMPAT_PROP_PCP "power_controller_present"
@@ -128,6 +131,9 @@ void pcie_add_capability(PCIDevice *dev,
  uint16_t offset, uint16_t size);
 void pcie_sync_bridge_lnk(PCIDevice *dev);
 
+void pcie_acs_init(PCIDevice *dev, uint16_t offset);
+void pcie_acs_reset(PCIDevice *dev);
+
 void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
 void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t ser_num);
 void pcie_ats_init(PCIDevice *dev, uint16_t offset);
diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
index ad4e780..1db86b0 100644
--- a/include/hw/pci/pcie_regs.h
+++ b/include/hw/pci/pcie_regs.h
@@ -175,4 +175,8 @@ typedef enum PCIExpLinkWidth {
  PCI_ERR_COR_INTERNAL | \
  PCI_ERR_COR_HL_OVERFLOW)
 
+/* ACS */
+#define PCI_ACS_VER 0x1
+#define PCI_ACS_SIZEOF  8
+
 #endif /* QEMU_PCIE_REGS_H */
-- 
git-series 0.9.1

Re: [Qemu-devel] [PATCH v3 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-02-12 Thread Knut Omang

On Tue, 2019-02-12 at 10:14 -0700, Alex Williamson wrote:
> On Tue, 12 Feb 2019 17:25:46 +0100
> Knut Omang  wrote:
> 
> > On Tue, 2019-02-12 at 08:59 -0700, Alex Williamson wrote:
> > > On Tue, 12 Feb 2019 09:07:43 +0100
> > > Knut Omang  wrote:
> > >   
> > > > On Mon, 2019-02-11 at 16:09 -0700, Alex Williamson wrote:  
> > > > > On Sun, 10 Feb 2019 07:52:59 +0100
> > > > > Knut Omang  wrote:
> > > > > 
> > > > > > Add a helper function to add PCIe capability for Access Control
> > > > > > Services
> > > > > > (ACS)
> > > > > > ACS support in the associated root port is a prerequisite to be able
> > > > > > to
> > > > > > do
> > > > > > passthrough of individual functions of a device with VFIO
> > > > > > without Alex Williamson's pcie_acs_override kernel patch or similar
> > > > > > in the guest.
> > > > > 
> > > > > This is still incorrect, the ACS override patch is only required for
> > > > > separating multifunction endpoints or multifunction root
> > > > > ports.  Single
> > > > > function endpoints are assignable without ACS simply by placing them
> > > > > downstream of a single function root port or directly on the root
> > > > > complex.
> > > > 
> > > > Hmm - that was the intended meaning of the comment, but I'll see if I
> > > > can
> > > > make
> > > > it more clear by saying it explicitly.  
> > > 
> > > "ACS support... is a prerequisite".  Prerequisite: a thing that is
> > > required as a prior condition for something else to happen or exist.
> > > 
> > > Assignment of individual functions exists today, as is, by using QEMU
> > > to define a PCIe topology that allows the desired grouping.  The code
> > > here enables specific topologies, it is clearly not a prerequisite.
> > >   
> > > > > > Endpoints may also implement an ACS capability, but with
> > > > > > limited features.
> > > > > > 
> > > > > > Signed-off-by: Knut Omang 
> > > > > > ---
> > > > > >  hw/pci/pcie.c  | 29 +
> > > > > >  include/hw/pci/pcie.h  |  6 ++
> > > > > >  include/hw/pci/pcie_regs.h |  4 
> > > > > >  3 files changed, 39 insertions(+)
> > > > > > 
> > > > > > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > > > > > index 230478f..509632f 100644
> > > > > > --- a/hw/pci/pcie.c
> > > > > > +++ b/hw/pci/pcie.c
> > > > > > @@ -742,6 +742,14 @@ bool pcie_cap_is_arifwd_enabled(const PCIDevice
> > > > > > *dev)
> > > > > >  PCI_EXP_DEVCTL2_ARI;
> > > > > >  }
> > > > > >  
> > > > > > +/* Access Control Services (ACS) */
> > > > > > +void pcie_cap_acs_reset(PCIDevice *dev)
> > > > > > +{
> > > > > > +if (dev->exp.acs_cap) {
> > > > > > +pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL,
> > > > > > 0);
> > > > > > +}
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > > 
> > > > > > 
> > > > > >   * pci express extended capability list management functions
> > > > > >   * uint16_t ext_cap_id (16 bit)
> > > > > > @@ -906,3 +914,24 @@ void pcie_ats_init(PCIDevice *dev, uint16_t
> > > > > > offset)
> > > > > >  
> > > > > >  pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL,
> > > > > > 0x800f);
> > > > > >  }
> > > > > > +
> > > > > > +/* ACS (Access Control Services) */
> > > > > > +void pcie_acs_init(PCIDevice *dev, uint16_t offset)
> > > > > > +{
> > > > > > +bool is_pcie_slot = !!object_dynamic_cast(OBJECT(dev),
> > > > > > TYPE_PCIE_SLOT);
> > > > > > +uint16_t cap_bits = 0;
> > > > > > +
> > > > > > +pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER,
>

Re: [Qemu-devel] [PATCH v3 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-02-12 Thread Knut Omang

On Tue, 2019-02-12 at 08:59 -0700, Alex Williamson wrote:
> On Tue, 12 Feb 2019 09:07:43 +0100
> Knut Omang  wrote:
> 
> > On Mon, 2019-02-11 at 16:09 -0700, Alex Williamson wrote:
> > > On Sun, 10 Feb 2019 07:52:59 +0100
> > > Knut Omang  wrote:
> > >   
> > > > Add a helper function to add PCIe capability for Access Control Services
> > > > (ACS)
> > > > ACS support in the associated root port is a prerequisite to be able to
> > > > do
> > > > passthrough of individual functions of a device with VFIO
> > > > without Alex Williamson's pcie_acs_override kernel patch or similar
> > > > in the guest.  
> > > 
> > > This is still incorrect, the ACS override patch is only required for
> > > separating multifunction endpoints or multifunction root ports.  Single
> > > function endpoints are assignable without ACS simply by placing them
> > > downstream of a single function root port or directly on the root
> > > complex.  
> > 
> > Hmm - that was the intended meaning of the comment, but I'll see if I can
> > make
> > it more clear by saying it explicitly.
> 
> "ACS support... is a prerequisite".  Prerequisite: a thing that is
> required as a prior condition for something else to happen or exist.
> 
> Assignment of individual functions exists today, as is, by using QEMU
> to define a PCIe topology that allows the desired grouping.  The code
> here enables specific topologies, it is clearly not a prerequisite.
> 
> > > > Endpoints may also implement an ACS capability, but with
> > > > limited features.
> > > > 
> > > > Signed-off-by: Knut Omang 
> > > > ---
> > > >  hw/pci/pcie.c  | 29 +
> > > >  include/hw/pci/pcie.h  |  6 ++
> > > >  include/hw/pci/pcie_regs.h |  4 
> > > >  3 files changed, 39 insertions(+)
> > > > 
> > > > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > > > index 230478f..509632f 100644
> > > > --- a/hw/pci/pcie.c
> > > > +++ b/hw/pci/pcie.c
> > > > @@ -742,6 +742,14 @@ bool pcie_cap_is_arifwd_enabled(const PCIDevice
> > > > *dev)
> > > >  PCI_EXP_DEVCTL2_ARI;
> > > >  }
> > > >  
> > > > +/* Access Control Services (ACS) */
> > > > +void pcie_cap_acs_reset(PCIDevice *dev)
> > > > +{
> > > > +if (dev->exp.acs_cap) {
> > > > +pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL, 0);
> > > > +}
> > > > +}
> > > > +
> > > >  /**
> > > > 
> > > >   * pci express extended capability list management functions
> > > >   * uint16_t ext_cap_id (16 bit)
> > > > @@ -906,3 +914,24 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
> > > >  
> > > >  pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
> > > >  }
> > > > +
> > > > +/* ACS (Access Control Services) */
> > > > +void pcie_acs_init(PCIDevice *dev, uint16_t offset)
> > > > +{
> > > > +bool is_pcie_slot = !!object_dynamic_cast(OBJECT(dev),
> > > > TYPE_PCIE_SLOT);
> > > > +uint16_t cap_bits = 0;
> > > > +
> > > > +pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER, offset,
> > > > +PCI_ACS_SIZEOF);
> > > > +dev->exp.acs_cap = offset;
> > > > +
> > > > +if (is_pcie_slot) {
> > > > +/* Endpoints may also implement ACS, but these capabilities are
> > > > */
> > > > +/* only valid for slots: */  
> > > 
> > > Not quite, SV, TB, and UF must not be implemented by endpoints, but RR
> > > and CR must be implemented by multifunction endpoints that support p2p
> > > if they provide an ACS capability.
> > 
> > Hmm - are you ok with setting 0 here as I have done, just amending your 
> > description to the comment? Then any future emulation that do support p2p 
> > would have to set the needed bits after calling the init function.
> 
> The comment definitely needs work, but I don't know what to do about
> single function, non-SR-IOV capable devices calling into this.  

I agree - I have only been thinking about multifunction devices, I should 
probably assert on not multifunction (or

Re: [Qemu-devel] [PATCH v3 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-02-12 Thread Knut Omang

On Tue, 2019-02-12 at 09:07 +0100, Knut Omang wrote:
> On Mon, 2019-02-11 at 16:09 -0700, Alex Williamson wrote:
> > On Sun, 10 Feb 2019 07:52:59 +0100
> > Knut Omang  wrote:
> > 
> > > Add a helper function to add PCIe capability for Access Control Services
> > > (ACS)
> > > ACS support in the associated root port is a prerequisite to be able to do
> > > passthrough of individual functions of a device with VFIO
> > > without Alex Williamson's pcie_acs_override kernel patch or similar
> > > in the guest.
> > 
> > This is still incorrect, the ACS override patch is only required for
> > separating multifunction endpoints or multifunction root ports.  Single
> > function endpoints are assignable without ACS simply by placing them
> > downstream of a single function root port or directly on the root
> > complex.
> 
> Hmm - that was the intended meaning of the comment, but I'll see if I can make
> it more clear by saying it explicitly.
> 
> > > Endpoints may also implement an ACS capability, but with
> > > limited features.
> > > 
> > > Signed-off-by: Knut Omang 
> > > ---
> > >  hw/pci/pcie.c  | 29 +
> > >  include/hw/pci/pcie.h  |  6 ++
> > >  include/hw/pci/pcie_regs.h |  4 
> > >  3 files changed, 39 insertions(+)
> > > 
> > > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > > index 230478f..509632f 100644
> > > --- a/hw/pci/pcie.c
> > > +++ b/hw/pci/pcie.c
> > > @@ -742,6 +742,14 @@ bool pcie_cap_is_arifwd_enabled(const PCIDevice *dev)
> > >  PCI_EXP_DEVCTL2_ARI;
> > >  }
> > >  
> > > +/* Access Control Services (ACS) */
> > > +void pcie_cap_acs_reset(PCIDevice *dev)
> > > +{
> > > +if (dev->exp.acs_cap) {
> > > +pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL, 0);
> > > +}
> > > +}
> > > +
> > >  /
> > > **
> > >   * pci express extended capability list management functions
> > >   * uint16_t ext_cap_id (16 bit)
> > > @@ -906,3 +914,24 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
> > >  
> > >  pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
> > >  }
> > > +
> > > +/* ACS (Access Control Services) */
> > > +void pcie_acs_init(PCIDevice *dev, uint16_t offset)
> > > +{
> > > +bool is_pcie_slot = !!object_dynamic_cast(OBJECT(dev),
> > > TYPE_PCIE_SLOT);
> > > +uint16_t cap_bits = 0;
> > > +
> > > +pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER, offset,
> > > +PCI_ACS_SIZEOF);
> > > +dev->exp.acs_cap = offset;
> > > +
> > > +if (is_pcie_slot) {
> > > +/* Endpoints may also implement ACS, but these capabilities are
> > > */
> > > +/* only valid for slots: */
> > 
> > Not quite, SV, TB, and UF must not be implemented by endpoints, but RR
> > and CR must be implemented by multifunction endpoints that support p2p
> > if they provide an ACS capability.  
> 
> Hmm - are you ok with setting 0 here as I have done, just amending your 
> description to the comment? Then any future emulation that do support p2p 
> would have to set the needed bits after calling the init function.
> 
> After your previous comments on this, I had a look at Mellanox CX4 and CX5
> which
> are the only devices I could find in the lab that are endpoints and implement
> an
> ACS capability, and neither seems to implement any extra capabilities:
> 
> Capabilities: [230 v1] Access Control Services
> ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd-
> EgressCtrl- DirectTrans-
> ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd-
> EgressCtrl- DirectTrans-
> 
> that was the reason for my choice of value here - after skimming through the
> spec (with my still very limited understanding of the details)
> 
> > Linux therefore infers that if ACS
> > is supported by an endpoint and RR and CR are not implemented, the
> > device does not support p2p.  
> 
> Interesting - I thought the CX5 supported p2p, but I have not kept up with
> what
> happens on the RDMA list on that front.
> 
> > We could just say that nothing supports
> > p2p yet, but single function endpoints (except those implementing
> > SR-IOV) mu

Re: [Qemu-devel] [PATCH v3 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-02-12 Thread Knut Omang

On Mon, 2019-02-11 at 16:09 -0700, Alex Williamson wrote:
> On Sun, 10 Feb 2019 07:52:59 +0100
> Knut Omang  wrote:
> 
> > Add a helper function to add PCIe capability for Access Control Services
> > (ACS)
> > ACS support in the associated root port is a prerequisite to be able to do
> > passthrough of individual functions of a device with VFIO
> > without Alex Williamson's pcie_acs_override kernel patch or similar
> > in the guest.
> 
> This is still incorrect, the ACS override patch is only required for
> separating multifunction endpoints or multifunction root ports.  Single
> function endpoints are assignable without ACS simply by placing them
> downstream of a single function root port or directly on the root
> complex.

Hmm - that was the intended meaning of the comment, but I'll see if I can make
it more clear by saying it explicitly.

> > Endpoints may also implement an ACS capability, but with
> > limited features.
> > 
> > Signed-off-by: Knut Omang 
> > ---
> >  hw/pci/pcie.c  | 29 +
> >  include/hw/pci/pcie.h  |  6 ++
> >  include/hw/pci/pcie_regs.h |  4 
> >  3 files changed, 39 insertions(+)
> > 
> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index 230478f..509632f 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -742,6 +742,14 @@ bool pcie_cap_is_arifwd_enabled(const PCIDevice *dev)
> >  PCI_EXP_DEVCTL2_ARI;
> >  }
> >  
> > +/* Access Control Services (ACS) */
> > +void pcie_cap_acs_reset(PCIDevice *dev)
> > +{
> > +if (dev->exp.acs_cap) {
> > +pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL, 0);
> > +}
> > +}
> > +
> >  /**
> >   * pci express extended capability list management functions
> >   * uint16_t ext_cap_id (16 bit)
> > @@ -906,3 +914,24 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
> >  
> >  pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
> >  }
> > +
> > +/* ACS (Access Control Services) */
> > +void pcie_acs_init(PCIDevice *dev, uint16_t offset)
> > +{
> > +bool is_pcie_slot = !!object_dynamic_cast(OBJECT(dev), TYPE_PCIE_SLOT);
> > +uint16_t cap_bits = 0;
> > +
> > +pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER, offset,
> > +PCI_ACS_SIZEOF);
> > +dev->exp.acs_cap = offset;
> > +
> > +if (is_pcie_slot) {
> > +/* Endpoints may also implement ACS, but these capabilities are */
> > +/* only valid for slots: */
> 
> Not quite, SV, TB, and UF must not be implemented by endpoints, but RR
> and CR must be implemented by multifunction endpoints that support p2p
> if they provide an ACS capability.  

Hmm - are you ok with setting 0 here as I have done, just amending your 
description to the comment? Then any future emulation that do support p2p 
would have to set the needed bits after calling the init function.

After your previous comments on this, I had a look at Mellanox CX4 and CX5 which
are the only devices I could find in the lab that are endpoints and implement an
ACS capability, and neither seems to implement any extra capabilities:

Capabilities: [230 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- 
EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- 
EgressCtrl- DirectTrans-

that was the reason for my choice of value here - after skimming through the
spec (with my still very limited understanding of the details)

> Linux therefore infers that if ACS
> is supported by an endpoint and RR and CR are not implemented, the
> device does not support p2p.  

Interesting - I thought the CX5 supported p2p, but I have not kept up with what
happens on the RDMA list on that front.

> We could just say that nothing supports
> p2p yet, but single function endpoints (except those implementing
> SR-IOV) must not implement an ACS capability per the spec, which could
> be difficult to exclude since the multifunction bit is handled
> separately from the device model.

Hmm - the older SR/IOV devices I know of, some Intel Ethernet devices and any of
the older Mellanox devices, and our own cancelled Infiniband device, all seem
not to implement ACS?

> Also comment style:
> https://git.qemu.org/?p=qemu.git;a=blob;f=CODING_STYLE;#l127

I see - will fix,

> > +cap_bits =
> > +PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF;
> > +}
> > +
> > +

[Qemu-devel] [PATCH v3 2/2] gen_pcie_root_port: Add ACS (Access Control Services) capability

2019-02-09 Thread Knut Omang

Claim ACS support in the generic PCIe root port to allow
passthrough of individual functions of a device to different
guests (in a nested virt.setting) with VFIO.
Without this patch, all functions of a device, such as all VFs of
an SR/IOV device, will end up in the same IOMMU group.
A similar situation occurs on Windows with Hyper-V.

In the single function device case, it also has a small cosmetic
benefit in that the root port itself is not grouped with
the device. VFIO handles that situation in that binding rules
only apply to endpoints, so it does not limit passthrough in
those cases.

Signed-off-by: Knut Omang 
---
 hw/pci-bridge/gen_pcie_root_port.c | 4 
 hw/pci-bridge/pcie_root_port.c | 4 
 include/hw/pci/pcie_port.h | 1 +
 3 files changed, 9 insertions(+)

diff --git a/hw/pci-bridge/gen_pcie_root_port.c 
b/hw/pci-bridge/gen_pcie_root_port.c
index 9766edb..26bda73 100644
--- a/hw/pci-bridge/gen_pcie_root_port.c
+++ b/hw/pci-bridge/gen_pcie_root_port.c
@@ -20,6 +20,9 @@
 OBJECT_CHECK(GenPCIERootPort, (obj), TYPE_GEN_PCIE_ROOT_PORT)
 
 #define GEN_PCIE_ROOT_PORT_AER_OFFSET   0x100
+#define GEN_PCIE_ROOT_PORT_ACS_OFFSET \
+(GEN_PCIE_ROOT_PORT_AER_OFFSET + PCI_ERR_SIZEOF)
+
 #define GEN_PCIE_ROOT_PORT_MSIX_NR_VECTOR   1
 
 typedef struct GenPCIERootPort {
@@ -149,6 +152,7 @@ static void gen_rp_dev_class_init(ObjectClass *klass, void 
*data)
 rpc->interrupts_init = gen_rp_interrupts_init;
 rpc->interrupts_uninit = gen_rp_interrupts_uninit;
 rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
+rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
 }
 
 static const TypeInfo gen_rp_dev_info = {
diff --git a/hw/pci-bridge/pcie_root_port.c b/hw/pci-bridge/pcie_root_port.c
index 34ad767..a0b4cf7 100644
--- a/hw/pci-bridge/pcie_root_port.c
+++ b/hw/pci-bridge/pcie_root_port.c
@@ -47,6 +47,7 @@ static void rp_reset(DeviceState *qdev)
 pcie_cap_deverr_reset(d);
 pcie_cap_slot_reset(d);
 pcie_cap_arifwd_reset(d);
+pcie_cap_acs_reset(d);
 pcie_aer_root_reset(d);
 pci_bridge_reset(qdev);
 pci_bridge_disable_base_limit(d);
@@ -106,6 +107,9 @@ static void rp_realize(PCIDevice *d, Error **errp)
 pcie_aer_root_init(d);
 rp_aer_vector_update(d);
 
+if (rpc->acs_offset) {
+pcie_acs_init(d, rpc->acs_offset);
+}
 return;
 
 err:
diff --git a/include/hw/pci/pcie_port.h b/include/hw/pci/pcie_port.h
index df242a0..09586f4 100644
--- a/include/hw/pci/pcie_port.h
+++ b/include/hw/pci/pcie_port.h
@@ -78,6 +78,7 @@ typedef struct PCIERootPortClass {
 int exp_offset;
 int aer_offset;
 int ssvid_offset;
+int acs_offset;/* If nonzero, optional ACS capability offset */
 int ssid;
 } PCIERootPortClass;
 
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v3 0/2] pcie: Add simple ACS "support" to the generic PCIe root port

2019-02-09 Thread Knut Omang

These two patches together implements a PCIe capability
config space header for Access Control Services (ACS) for the
new Qemu specific generic root port. ACS support in the
associated root port is a prerequisite to be able to pass
individual functions of a device populating the port through to
an L2 guest from an unmodified kernel.

Without this, the IOMMU group the device belongs to will also
include the root port itself, and all functions the device provides.

It is necessary to support SR/IOV where the primary
purpose is to be able to share out individual VFs to different
guests, which will not be permitted by VFIO or the Windows Hyper-V equivalent
unless ACS is supported by the root port.

These patches can also be found as part of an updated version of
my SR/IOV emulation patch set at

  https://github.com/knuto/qemu/tree/sriov_patches_v11

The patches' basic operation with VFIO and iommu groups have
been tested with the above patch set and a rebased version of an
in progress igb ethernet device, which needs some more care
before I can let it go out.

Changes from v2:

- rebased to the latest qemu master

Incorporated further feedback from Alex:
- Make sure slot/downstream capability bits are only set for slots.
- Make acs reset callback do nothing if no acs capability exists
- Set correct acs version
- div simplification

Changes from v1:

Incorporated feedback from Alex Williamson:
- Make commit messages reflect a more correct understanding of how this
  affects VFIO operation.
- Implemented the CTRL register properly (reset callback + making 
non-implemented
  capabilities RO, default value 0)
- removed the egress ctrl vector parameter to the init function
- Fixed some whitespace issues

Knut Omang (2):
  pcie: Add a simple PCIe ACS (Access Control Services) helper function
  gen_pcie_root_port: Add ACS (Access Control Services) capability

 hw/pci-bridge/gen_pcie_root_port.c |  4 
 hw/pci-bridge/pcie_root_port.c |  4 
 hw/pci/pcie.c  | 29 +
 include/hw/pci/pcie.h  |  6 ++
 include/hw/pci/pcie_port.h |  1 +
 include/hw/pci/pcie_regs.h |  4 
 6 files changed, 48 insertions(+)

base-commit: e47f81b617684c4546af286d307b69014a83538a
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v3 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-02-09 Thread Knut Omang

Add a helper function to add PCIe capability for Access Control Services (ACS)
ACS support in the associated root port is a prerequisite to be able to do
passthrough of individual functions of a device with VFIO
without Alex Williamson's pcie_acs_override kernel patch or similar
in the guest.

Endpoints may also implement an ACS capability, but with
limited features.

Signed-off-by: Knut Omang 
---
 hw/pci/pcie.c  | 29 +
 include/hw/pci/pcie.h  |  6 ++
 include/hw/pci/pcie_regs.h |  4 
 3 files changed, 39 insertions(+)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 230478f..509632f 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -742,6 +742,14 @@ bool pcie_cap_is_arifwd_enabled(const PCIDevice *dev)
 PCI_EXP_DEVCTL2_ARI;
 }
 
+/* Access Control Services (ACS) */
+void pcie_cap_acs_reset(PCIDevice *dev)
+{
+if (dev->exp.acs_cap) {
+pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL, 0);
+}
+}
+
 /**
  * pci express extended capability list management functions
  * uint16_t ext_cap_id (16 bit)
@@ -906,3 +914,24 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
 
 pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
 }
+
+/* ACS (Access Control Services) */
+void pcie_acs_init(PCIDevice *dev, uint16_t offset)
+{
+bool is_pcie_slot = !!object_dynamic_cast(OBJECT(dev), TYPE_PCIE_SLOT);
+uint16_t cap_bits = 0;
+
+pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER, offset,
+PCI_ACS_SIZEOF);
+dev->exp.acs_cap = offset;
+
+if (is_pcie_slot) {
+/* Endpoints may also implement ACS, but these capabilities are */
+/* only valid for slots: */
+cap_bits =
+PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF;
+}
+
+pci_set_word(dev->config + offset + PCI_ACS_CAP, cap_bits);
+pci_set_word(dev->wmask + offset + PCI_ACS_CTRL, cap_bits);
+}
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index 5b82a0d..4c40711 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -79,6 +79,9 @@ struct PCIExpressDevice {
 
 /* Offset of ATS capability in config space */
 uint16_t ats_cap;
+
+/* ACS */
+uint16_t acs_cap;
 };
 
 #define COMPAT_PROP_PCP "power_controller_present"
@@ -116,6 +119,8 @@ void pcie_cap_flr_init(PCIDevice *dev);
 void pcie_cap_flr_write_config(PCIDevice *dev,
uint32_t addr, uint32_t val, int len);
 
+void pcie_cap_acs_reset(PCIDevice *dev);
+
 /* ARI forwarding capability and control */
 void pcie_cap_arifwd_init(PCIDevice *dev);
 void pcie_cap_arifwd_reset(PCIDevice *dev);
@@ -129,6 +134,7 @@ void pcie_add_capability(PCIDevice *dev,
 void pcie_sync_bridge_lnk(PCIDevice *dev);
 
 void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
+void pcie_acs_init(PCIDevice *dev, uint16_t offset);
 void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t ser_num);
 void pcie_ats_init(PCIDevice *dev, uint16_t offset);
 
diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
index ad4e780..1db86b0 100644
--- a/include/hw/pci/pcie_regs.h
+++ b/include/hw/pci/pcie_regs.h
@@ -175,4 +175,8 @@ typedef enum PCIExpLinkWidth {
  PCI_ERR_COR_INTERNAL | \
  PCI_ERR_COR_HL_OVERFLOW)
 
+/* ACS */
+#define PCI_ACS_VER 0x1
+#define PCI_ACS_SIZEOF  8
+
 #endif /* QEMU_PCIE_REGS_H */
-- 
git-series 0.9.1

Re: [Qemu-devel] [PATCH v2 2/2] gen_pcie_root_port: Add ACS (Access Control Services) capability

2019-01-24 Thread Knut Omang

On Thu, 2019-01-24 at 10:33 -0700, Alex Williamson wrote:
> On Thu, 24 Jan 2019 11:12:53 +0100
> Knut Omang  wrote:
> 
> > Claim ACS support in the generic PCIe root port to allow
> > passthrough of individual functions of a device to different
> > guests (in a nested virt.setting) with VFIO.
> > Without this patch, all functions of a device, such as all VFs of
> > an SR/IOV device, will end up in the same IOMMU group.
> > A similar situation occurs on Windows with Hyper-V.
> > 
> > In the single function device case, it also has a small cosmetic
> > benefit in that the root port itself is not grouped with
> > the device. VFIO handles that situation in that binding rules
> > only apply to endpoints, so it does not limit passthrough in
> > those cases.
> > 
> > Signed-off-by: Knut Omang 
> > ---
> >  hw/pci-bridge/gen_pcie_root_port.c | 2 ++
> >  hw/pci-bridge/pcie_root_port.c | 4 
> >  include/hw/pci/pcie_port.h | 1 +
> >  3 files changed, 7 insertions(+)
> > 
> > diff --git a/hw/pci-bridge/gen_pcie_root_port.c 
> > b/hw/pci-bridge/gen_pcie_root_port.c
> > index 9766edb..b5a5ecc 100644
> > --- a/hw/pci-bridge/gen_pcie_root_port.c
> > +++ b/hw/pci-bridge/gen_pcie_root_port.c
> > @@ -20,6 +20,7 @@
> >  OBJECT_CHECK(GenPCIERootPort, (obj), TYPE_GEN_PCIE_ROOT_PORT)
> >  
> >  #define GEN_PCIE_ROOT_PORT_AER_OFFSET   0x100
> > +#define GEN_PCIE_ROOT_PORT_ACS_OFFSET   0x148
> 
> So you prefer that everyone passing through here decode these to figure
> out that ACS_OFFSET is (AER_OFFSET + ERR_SIZEOF) since my comment on v1
> was ignored?

Sorry, not at all - I managed to overlook your comment - will fix it,

> >  #define GEN_PCIE_ROOT_PORT_MSIX_NR_VECTOR   1
> >  
> >  typedef struct GenPCIERootPort {
> > @@ -149,6 +150,7 @@ static void gen_rp_dev_class_init(ObjectClass *klass, 
> > void *data)
> >  rpc->interrupts_init = gen_rp_interrupts_init;
> >  rpc->interrupts_uninit = gen_rp_interrupts_uninit;
> >  rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
> > +rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
> >  }
> >  
> >  static const TypeInfo gen_rp_dev_info = {
> > diff --git a/hw/pci-bridge/pcie_root_port.c b/hw/pci-bridge/pcie_root_port.c
> > index 34ad767..a0b4cf7 100644
> > --- a/hw/pci-bridge/pcie_root_port.c
> > +++ b/hw/pci-bridge/pcie_root_port.c
> > @@ -47,6 +47,7 @@ static void rp_reset(DeviceState *qdev)
> >  pcie_cap_deverr_reset(d);
> >  pcie_cap_slot_reset(d);
> >  pcie_cap_arifwd_reset(d);
> > +pcie_cap_acs_reset(d);
> 
> Only the generic root port initializes acs_offset to enable an ACS
> capability, but all members of the device class call the reset function
> which does no checking that an ACS capability exists.  We've just
> corrupted config space for the device.

Ouch! Not good at all, sorry!
Will look at it (after a good night's sleep this time..)

Thanks!
Knut

> >  pcie_aer_root_reset(d);
> >  pci_bridge_reset(qdev);
> >  pci_bridge_disable_base_limit(d);
> > @@ -106,6 +107,9 @@ static void rp_realize(PCIDevice *d, Error **errp)
> >  pcie_aer_root_init(d);
> >  rp_aer_vector_update(d);
> >  
> > +if (rpc->acs_offset) {
> > +pcie_acs_init(d, rpc->acs_offset);
> > +}
> >  return;
> >  
> >  err:
> > diff --git a/include/hw/pci/pcie_port.h b/include/hw/pci/pcie_port.h
> > index df242a0..09586f4 100644
> > --- a/include/hw/pci/pcie_port.h
> > +++ b/include/hw/pci/pcie_port.h
> > @@ -78,6 +78,7 @@ typedef struct PCIERootPortClass {
> >  int exp_offset;
> >  int aer_offset;
> >  int ssvid_offset;
> > +int acs_offset;/* If nonzero, optional ACS capability offset */
> >  int ssid;
> >  } PCIERootPortClass;
> >  
>

Re: [Qemu-devel] [PATCH v2 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-01-24 Thread Knut Omang

On Thu, 2019-01-24 at 10:22 -0700, Alex Williamson wrote:
> On Thu, 24 Jan 2019 11:12:52 +0100
> Knut Omang  wrote:
> 
> > Add a helper function to add PCIe capability for Access Control Services 
> > (ACS)
> > ACS support in the associated root port is a prerequisite to be able to do
> > passthrough of individual functions of a device with VFIO
> > without Alex Williamson's pcie_acs_override kernel patch or similar
> > in the guest.
> > 
> > Signed-off-by: Knut Omang 
> > ---
> >  hw/pci/pcie.c  | 21 +
> >  include/hw/pci/pcie.h  |  6 ++
> >  include/hw/pci/pcie_regs.h |  4 
> >  3 files changed, 31 insertions(+)
> > 
> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index 230478f..5ab3d1d 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -742,6 +742,13 @@ bool pcie_cap_is_arifwd_enabled(const PCIDevice *dev)
> >  PCI_EXP_DEVCTL2_ARI;
> >  }
> >  
> > +/* Access Control Services (ACS)
> > + */
> 
> Comment style
>
> > +void pcie_cap_acs_reset(PCIDevice *dev)
> > +{
> > +pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL, 0);
> > +}
> > +
> >  /**
> >   * pci express extended capability list management functions
> >   * uint16_t ext_cap_id (16 bit)
> > @@ -906,3 +913,17 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
> >  
> >  pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
> >  }
> > +
> > +/* ACS (Access Control Services) */
> > +void pcie_acs_init(PCIDevice *dev, uint16_t offset)
> > +{
> > +pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER,
> > +offset, PCI_ACS_SIZEOF);
> > +dev->exp.acs_cap = offset;
> > +pci_set_word(dev->config + offset + PCI_ACS_CAP,
> > + PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | 
> > PCI_ACS_UF);
> 
> This is still only valid for downstream ports yet neither restricted
> nor commented do indicate that.  You could use an object_dynamic_cast
> to triggger an assert should someone use it for an invalid type of
> device, ex:
> 
> assert(object_dynamic_cast(OBJECT(dev), TYPE_PCIE_SLOT));

Sorry, I didn't realize what you meant with v1 - this evolved from 
just a fix in the implementation of ioh3420 to a fix in the generic code, 
which I now realize of course is also used for downstream ports...

> > +
> > +pci_set_word(dev->config + offset + PCI_ACS_CTRL, 0);
> 
> Suspect this is unnecessary given the reset callback.

ok

> > +pci_set_word(dev->wmask + offset + PCI_ACS_CTRL,
> > + PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | 
> > PCI_ACS_UF);
> > +}
> > diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> > index 5b82a0d..4c40711 100644
> > --- a/include/hw/pci/pcie.h
> > +++ b/include/hw/pci/pcie.h
> > @@ -79,6 +79,9 @@ struct PCIExpressDevice {
> >  
> >  /* Offset of ATS capability in config space */
> >  uint16_t ats_cap;
> > +
> > +/* ACS */
> > +uint16_t acs_cap;
> >  };
> >  
> >  #define COMPAT_PROP_PCP "power_controller_present"
> > @@ -116,6 +119,8 @@ void pcie_cap_flr_init(PCIDevice *dev);
> >  void pcie_cap_flr_write_config(PCIDevice *dev,
> > uint32_t addr, uint32_t val, int len);
> >  
> > +void pcie_cap_acs_reset(PCIDevice *dev);
> > +
> >  /* ARI forwarding capability and control */
> >  void pcie_cap_arifwd_init(PCIDevice *dev);
> >  void pcie_cap_arifwd_reset(PCIDevice *dev);
> > @@ -129,6 +134,7 @@ void pcie_add_capability(PCIDevice *dev,
> >  void pcie_sync_bridge_lnk(PCIDevice *dev);
> >  
> >  void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
> > +void pcie_acs_init(PCIDevice *dev, uint16_t offset);
> >  void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t 
> > ser_num);
> >  void pcie_ats_init(PCIDevice *dev, uint16_t offset);
> >  
> > diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
> > index ad4e780..3fc9aca 100644
> > --- a/include/hw/pci/pcie_regs.h
> > +++ b/include/hw/pci/pcie_regs.h
> > @@ -175,4 +175,8 @@ typedef enum PCIExpLinkWidth {
> >   PCI_ERR_COR_INTERNAL | \
> >   PCI_ERR_COR_HL_OVERFLOW)
> >  
> > +/* ACS */
> > +#define PCI_ACS_VER 0x2
> 
> There's no such version, even the PCIe 5.0 drafts only define version 1.

Hmm - I have no idea how it ended up as 2 in the first place - my model device 
is of
course also v1 - will fix it.

Thanks!
Knut

> > +#define PCI_ACS_SIZEOF  8
> > +
> >  #endif /* QEMU_PCIE_REGS_H */
>

[Qemu-devel] [PATCH v2 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-01-24 Thread Knut Omang

Add a helper function to add PCIe capability for Access Control Services (ACS)
ACS support in the associated root port is a prerequisite to be able to do
passthrough of individual functions of a device with VFIO
without Alex Williamson's pcie_acs_override kernel patch or similar
in the guest.

Signed-off-by: Knut Omang 
---
 hw/pci/pcie.c  | 21 +
 include/hw/pci/pcie.h  |  6 ++
 include/hw/pci/pcie_regs.h |  4 
 3 files changed, 31 insertions(+)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 230478f..5ab3d1d 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -742,6 +742,13 @@ bool pcie_cap_is_arifwd_enabled(const PCIDevice *dev)
 PCI_EXP_DEVCTL2_ARI;
 }
 
+/* Access Control Services (ACS)
+ */
+void pcie_cap_acs_reset(PCIDevice *dev)
+{
+pci_set_word(dev->config + dev->exp.acs_cap + PCI_ACS_CTRL, 0);
+}
+
 /**
  * pci express extended capability list management functions
  * uint16_t ext_cap_id (16 bit)
@@ -906,3 +913,17 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
 
 pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
 }
+
+/* ACS (Access Control Services) */
+void pcie_acs_init(PCIDevice *dev, uint16_t offset)
+{
+pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER,
+offset, PCI_ACS_SIZEOF);
+dev->exp.acs_cap = offset;
+pci_set_word(dev->config + offset + PCI_ACS_CAP,
+ PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | 
PCI_ACS_UF);
+
+pci_set_word(dev->config + offset + PCI_ACS_CTRL, 0);
+pci_set_word(dev->wmask + offset + PCI_ACS_CTRL,
+ PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | 
PCI_ACS_UF);
+}
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index 5b82a0d..4c40711 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -79,6 +79,9 @@ struct PCIExpressDevice {
 
 /* Offset of ATS capability in config space */
 uint16_t ats_cap;
+
+/* ACS */
+uint16_t acs_cap;
 };
 
 #define COMPAT_PROP_PCP "power_controller_present"
@@ -116,6 +119,8 @@ void pcie_cap_flr_init(PCIDevice *dev);
 void pcie_cap_flr_write_config(PCIDevice *dev,
uint32_t addr, uint32_t val, int len);
 
+void pcie_cap_acs_reset(PCIDevice *dev);
+
 /* ARI forwarding capability and control */
 void pcie_cap_arifwd_init(PCIDevice *dev);
 void pcie_cap_arifwd_reset(PCIDevice *dev);
@@ -129,6 +134,7 @@ void pcie_add_capability(PCIDevice *dev,
 void pcie_sync_bridge_lnk(PCIDevice *dev);
 
 void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
+void pcie_acs_init(PCIDevice *dev, uint16_t offset);
 void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t ser_num);
 void pcie_ats_init(PCIDevice *dev, uint16_t offset);
 
diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
index ad4e780..3fc9aca 100644
--- a/include/hw/pci/pcie_regs.h
+++ b/include/hw/pci/pcie_regs.h
@@ -175,4 +175,8 @@ typedef enum PCIExpLinkWidth {
  PCI_ERR_COR_INTERNAL | \
  PCI_ERR_COR_HL_OVERFLOW)
 
+/* ACS */
+#define PCI_ACS_VER 0x2
+#define PCI_ACS_SIZEOF  8
+
 #endif /* QEMU_PCIE_REGS_H */
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v2 0/2] pcie: Add simple ACS "support" to the generic PCIe root port

2019-01-24 Thread Knut Omang

These two patches together implements a PCIe capability
config space header for Access Control Services (ACS) for the
new Qemu specific generic root port. ACS support in the
associated root port is a prerequisite to be able to pass
individual functions of a device populating the port through to
an L2 guest from an unmodified kernel.

Without this, the IOMMU group the device belongs to will also
include the root port itself, and all functions the device provides.

It is necessary to support SR/IOV where the primary
purpose is to be able to share out individual VFs to different
guests, which will not be permitted by VFIO or the Windows Hyper-V equivalent
unless ACS is supported by the root port.

These patches can also be found as part of an updated version of
my SR/IOV emulation patch set at

  https://github.com/knuto/qemu/tree/sriov_patches_v10

Changes from v1:

Incorporated feedback from Alex Williamson:

- Make commit messages reflect a more correct understanding of how this
  affects VFIO operation.
- Implemented the CTRL register properly (reset callback + making 
non-implemented
  capabilities RO, default value 0)
- removed the egress ctrl vector parameter to the init function
- Fixed some whitespace issues

Knut Omang (2):
  pcie: Add a simple PCIe ACS (Access Control Services) helper function
  gen_pcie_root_port: Add ACS (Access Control Services) capability

 hw/pci-bridge/gen_pcie_root_port.c |  2 ++
 hw/pci-bridge/pcie_root_port.c |  4 
 hw/pci/pcie.c  | 21 +
 include/hw/pci/pcie.h  |  6 ++
 include/hw/pci/pcie_port.h |  1 +
 include/hw/pci/pcie_regs.h |  4 
 6 files changed, 38 insertions(+)

base-commit: a8d2b0685681e2f291faaa501efbbd76875f8ec8
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v2 2/2] gen_pcie_root_port: Add ACS (Access Control Services) capability

2019-01-24 Thread Knut Omang

Claim ACS support in the generic PCIe root port to allow
passthrough of individual functions of a device to different
guests (in a nested virt.setting) with VFIO.
Without this patch, all functions of a device, such as all VFs of
an SR/IOV device, will end up in the same IOMMU group.
A similar situation occurs on Windows with Hyper-V.

In the single function device case, it also has a small cosmetic
benefit in that the root port itself is not grouped with
the device. VFIO handles that situation in that binding rules
only apply to endpoints, so it does not limit passthrough in
those cases.

Signed-off-by: Knut Omang 
---
 hw/pci-bridge/gen_pcie_root_port.c | 2 ++
 hw/pci-bridge/pcie_root_port.c | 4 
 include/hw/pci/pcie_port.h | 1 +
 3 files changed, 7 insertions(+)

diff --git a/hw/pci-bridge/gen_pcie_root_port.c 
b/hw/pci-bridge/gen_pcie_root_port.c
index 9766edb..b5a5ecc 100644
--- a/hw/pci-bridge/gen_pcie_root_port.c
+++ b/hw/pci-bridge/gen_pcie_root_port.c
@@ -20,6 +20,7 @@
 OBJECT_CHECK(GenPCIERootPort, (obj), TYPE_GEN_PCIE_ROOT_PORT)
 
 #define GEN_PCIE_ROOT_PORT_AER_OFFSET   0x100
+#define GEN_PCIE_ROOT_PORT_ACS_OFFSET   0x148
 #define GEN_PCIE_ROOT_PORT_MSIX_NR_VECTOR   1
 
 typedef struct GenPCIERootPort {
@@ -149,6 +150,7 @@ static void gen_rp_dev_class_init(ObjectClass *klass, void 
*data)
 rpc->interrupts_init = gen_rp_interrupts_init;
 rpc->interrupts_uninit = gen_rp_interrupts_uninit;
 rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
+rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
 }
 
 static const TypeInfo gen_rp_dev_info = {
diff --git a/hw/pci-bridge/pcie_root_port.c b/hw/pci-bridge/pcie_root_port.c
index 34ad767..a0b4cf7 100644
--- a/hw/pci-bridge/pcie_root_port.c
+++ b/hw/pci-bridge/pcie_root_port.c
@@ -47,6 +47,7 @@ static void rp_reset(DeviceState *qdev)
 pcie_cap_deverr_reset(d);
 pcie_cap_slot_reset(d);
 pcie_cap_arifwd_reset(d);
+pcie_cap_acs_reset(d);
 pcie_aer_root_reset(d);
 pci_bridge_reset(qdev);
 pci_bridge_disable_base_limit(d);
@@ -106,6 +107,9 @@ static void rp_realize(PCIDevice *d, Error **errp)
 pcie_aer_root_init(d);
 rp_aer_vector_update(d);
 
+if (rpc->acs_offset) {
+pcie_acs_init(d, rpc->acs_offset);
+}
 return;
 
 err:
diff --git a/include/hw/pci/pcie_port.h b/include/hw/pci/pcie_port.h
index df242a0..09586f4 100644
--- a/include/hw/pci/pcie_port.h
+++ b/include/hw/pci/pcie_port.h
@@ -78,6 +78,7 @@ typedef struct PCIERootPortClass {
 int exp_offset;
 int aer_offset;
 int ssvid_offset;
+int acs_offset;/* If nonzero, optional ACS capability offset */
 int ssid;
 } PCIERootPortClass;
 
-- 
git-series 0.9.1

Re: [Qemu-devel] [PATCH 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-01-23 Thread Knut Omang

On Wed, 2019-01-23 at 12:56 -0700, Alex Williamson wrote:
> On Wed, 23 Jan 2019 20:46:14 +0100
> Knut Omang  wrote:
> 
> > On Wed, 2019-01-23 at 12:04 -0700, Alex Williamson wrote:
> > > On Wed, 23 Jan 2019 19:27:59 +0100
> > > Knut Omang  wrote:
> > >   
> > > > Add a helper function to add PCIe capability for Access Control 
> > > > Services (ACS)
> > > > ACS support in the associated root port is a prerequisite to be able to 
> > > > do useful
> > > > passthrough with VFIO without Alex Williamson's pcie_acs_override 
> > > > kernel patch.  
> > > 
> > > Define "useful".  We can certainly still assign single function PFs to
> > > an L2 guest, or multi-function so long as all the functions are
> > > assigned.  I won't deny that it's problematic, but it's a virtual
> > > topology that can be adjusted, so I think this is overstating things a
> > > bit.
> > >   
> > > > Signed-off-by: Knut Omang 
> > > > ---
> > > >  hw/pci/pcie.c  | 14 ++
> > > >  include/hw/pci/pcie.h  |  1 +
> > > >  include/hw/pci/pcie_regs.h |  4 
> > > >  3 files changed, 19 insertions(+)
> > > > 
> > > > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > > > index 230478f..18feff5 100644
> > > > --- a/hw/pci/pcie.c
> > > > +++ b/hw/pci/pcie.c
> > > > @@ -906,3 +906,17 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
> > > >  
> > > >  pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
> > > >  }
> > > > +
> > > > +/* Add an ACS (Access Control Services) capability */
> > > > +void pcie_acs_init(PCIDevice *dev, uint16_t offset, uint8_t 
> > > > egress_ctrl_vec_sz)
> > > > +{
> > > > +int ectrl_words = (egress_ctrl_vec_sz + 31) & ~31;
> > > > +pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER,
> > > > +offset, PCI_ACS_SIZEOF + ectrl_words);  
> > > 
> > > The egress control vector is only valid if the egress control
> > > capability is enabled, which is not set below, so this just seems to
> > > waste config space and introduces a meaningless function arg.  
> > 
> > I think my intention way back when I implemented this was to provide it as 
> > a skeleton
> > for further detailing later, I use it with ectrl_words = 0 in the root port.
> > 
> > > > +pci_set_word(dev->config + offset + PCI_ACS_CAP,
> > > > + PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR |
> PCI_ACS_UF);  
> > > 
> > > Some of these bits are only valid for downstream ports, it would
> > > violate the spec to set them on and endpoint.  
> > 
> > I must admit I haven't really dived deep here - I just set the cap bits 
> > that one of my
> > servers sets for it's root ports - this is the one I used as model:
> > 
> > 00:02.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express 
> > Root Port 1
> (rev
> > 22) (prog-if 00 [Normal decode])
> > 00:02.0 0604: 8086:3409 (rev 22)
> > 
> > > > +pci_set_word(dev->config + offset + PCI_ACS_CTRL,
> > > > + PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR |
> PCI_ACS_UF);  
> > > 
> > > The default values of the control register bits is zero, so we
> > > shouldn't be setting it here and we should have a reset hook to clear
> > > it.  
> > 
> > I agree - I'll implement it,
> > thanks!
> > 
> > Knut
> > 
> > >   
> > > > +/* Make CTRL register writable */
> > > > +memset(dev->wmask + offset + PCI_ACS_CTRL, 0xff, 2);
> 
> While you're at it, it doesn't make sense to set unimplemented control
> bits as writable, this should match the bits set in the capabilities
> register.  Thanks,

Good point, I'll fix that as well,

Thanks!
Knut

> 
> Alex
> 
> > > > +}
> > > > diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> > > > index 5b82a0d..c2da148 100644
> > > > --- a/include/hw/pci/pcie.h
> > > > +++ b/include/hw/pci/pcie.h
> > > > @@ -129,6 +129,7 @@ void pcie_add_capability(PCIDevice *dev,
> > > >  void pcie_sync_bridge_lnk(PCIDevice *dev);
> > > >  
> > > >  void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
> > > > +void pcie_acs_init(PCIDevice *dev, uint16_t offset, uint8_t 
> > > > egress_ctrl_vec_sz);
> > > >  void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t 
> > > > ser_num);
> > > >  void pcie_ats_init(PCIDevice *dev, uint16_t offset);
> > > >  
> > > > diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
> > > > index ad4e780..5e7409c 100644
> > > > --- a/include/hw/pci/pcie_regs.h
> > > > +++ b/include/hw/pci/pcie_regs.h
> > > > @@ -175,4 +175,8 @@ typedef enum PCIExpLinkWidth {
> > > >   PCI_ERR_COR_INTERNAL |
> > > >  \
> > > >   PCI_ERR_COR_HL_OVERFLOW)
> > > >  
> > > > +/* ACS */
> > > > +#define PCI_ACS_VER0x2
> > > > +#define PCI_ACS_SIZEOF  8
> > > > +
> > > >  #endif /* QEMU_PCIE_REGS_H */  
> > >   
> > 
> > 
>

Re: [Qemu-devel] [PATCH 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-01-23 Thread Knut Omang

On Wed, 2019-01-23 at 12:04 -0700, Alex Williamson wrote:
> On Wed, 23 Jan 2019 19:27:59 +0100
> Knut Omang  wrote:
> 
> > Add a helper function to add PCIe capability for Access Control Services 
> > (ACS)
> > ACS support in the associated root port is a prerequisite to be able to do 
> > useful
> > passthrough with VFIO without Alex Williamson's pcie_acs_override kernel 
> > patch.
> 
> Define "useful".  We can certainly still assign single function PFs to
> an L2 guest, or multi-function so long as all the functions are
> assigned.  I won't deny that it's problematic, but it's a virtual
> topology that can be adjusted, so I think this is overstating things a
> bit.
> 
> > Signed-off-by: Knut Omang 
> > ---
> >  hw/pci/pcie.c  | 14 ++
> >  include/hw/pci/pcie.h  |  1 +
> >  include/hw/pci/pcie_regs.h |  4 
> >  3 files changed, 19 insertions(+)
> > 
> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index 230478f..18feff5 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -906,3 +906,17 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
> >  
> >  pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
> >  }
> > +
> > +/* Add an ACS (Access Control Services) capability */
> > +void pcie_acs_init(PCIDevice *dev, uint16_t offset, uint8_t 
> > egress_ctrl_vec_sz)
> > +{
> > +int ectrl_words = (egress_ctrl_vec_sz + 31) & ~31;
> > +pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER,
> > +offset, PCI_ACS_SIZEOF + ectrl_words);
> 
> The egress control vector is only valid if the egress control
> capability is enabled, which is not set below, so this just seems to
> waste config space and introduces a meaningless function arg.

I think my intention way back when I implemented this was to provide it as a 
skeleton
for further detailing later, I use it with ectrl_words = 0 in the root port.

> > +pci_set_word(dev->config + offset + PCI_ACS_CAP,
> > + PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | 
> > PCI_ACS_UF);
> 
> Some of these bits are only valid for downstream ports, it would
> violate the spec to set them on and endpoint.

I must admit I haven't really dived deep here - I just set the cap bits that 
one of my
servers sets for it's root ports - this is the one I used as model:

00:02.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root 
Port 1 (rev
22) (prog-if 00 [Normal decode])
00:02.0 0604: 8086:3409 (rev 22)

> > +pci_set_word(dev->config + offset + PCI_ACS_CTRL,
> > + PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | 
> > PCI_ACS_UF);
> 
> The default values of the control register bits is zero, so we
> shouldn't be setting it here and we should have a reset hook to clear
> it.

I agree - I'll implement it,
thanks!

Knut

> 
> > +/* Make CTRL register writable */
> > +memset(dev->wmask + offset + PCI_ACS_CTRL, 0xff, 2);
> > +}
> > diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> > index 5b82a0d..c2da148 100644
> > --- a/include/hw/pci/pcie.h
> > +++ b/include/hw/pci/pcie.h
> > @@ -129,6 +129,7 @@ void pcie_add_capability(PCIDevice *dev,
> >  void pcie_sync_bridge_lnk(PCIDevice *dev);
> >  
> >  void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
> > +void pcie_acs_init(PCIDevice *dev, uint16_t offset, uint8_t 
> > egress_ctrl_vec_sz);
> >  void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t 
> > ser_num);
> >  void pcie_ats_init(PCIDevice *dev, uint16_t offset);
> >  
> > diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
> > index ad4e780..5e7409c 100644
> > --- a/include/hw/pci/pcie_regs.h
> > +++ b/include/hw/pci/pcie_regs.h
> > @@ -175,4 +175,8 @@ typedef enum PCIExpLinkWidth {
> >   PCI_ERR_COR_INTERNAL | \
> >   PCI_ERR_COR_HL_OVERFLOW)
> >  
> > +/* ACS */
> > +#define PCI_ACS_VER0x2
> > +#define PCI_ACS_SIZEOF  8
> > +
> >  #endif /* QEMU_PCIE_REGS_H */
>

Re: [Qemu-devel] [PATCH 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-01-23 Thread Knut Omang

On Wed, 2019-01-23 at 12:32 -0700, Alex Williamson wrote:
> On Wed, 23 Jan 2019 20:14:07 +0100
> Knut Omang  wrote:
> 
> > On Wed, 2019-01-23 at 12:04 -0700, Alex Williamson wrote:
> > > On Wed, 23 Jan 2019 19:27:59 +0100
> > > Knut Omang  wrote:
> > >   
> > > > Add a helper function to add PCIe capability for Access Control 
> > > > Services (ACS)
> > > > ACS support in the associated root port is a prerequisite to be able to 
> > > > do useful
> > > > passthrough with VFIO without Alex Williamson's pcie_acs_override 
> > > > kernel patch.  
> > > 
> > > Define "useful".
> > 
> > Hmm - just that without the patches, the root port itself 
> > also gets assigned to the same group, which seemed problematic to me
> > (without any further testing than just binding/unbinding to VFIO)
> 
> vfio-pci binding rules only apply to endpoints.  A root port lacking
> ACS will include all devices downstream of it in the IOMMU group, and
> potentially sibling functions, and devices downstream of those, but it
> doesn't absolutely preclude L2 assignment, or L1 userspace usage,
> which is already widely used.  It simply means that all the endpoints
> within that group need to be bound to vfio-pci and can only have a
> single owner.  Thanks,

I see, that makes sense - I'll moderate my language!

Thanks,
Knut
> 
> Alex
> 
> > > We can certainly still assign single function PFs to
> > > an L2 guest, or multi-function so long as all the functions are
> > > assigned.  I won't deny that it's problematic, but it's a virtual
> > > topology that can be adjusted, so I think this is overstating things a
> > > bit.
> > >   
> > > > Signed-off-by: Knut Omang 
> > > > ---
> > > >  hw/pci/pcie.c  | 14 ++
> > > >  include/hw/pci/pcie.h  |  1 +
> > > >  include/hw/pci/pcie_regs.h |  4 
> > > >  3 files changed, 19 insertions(+)
> > > > 
> > > > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > > > index 230478f..18feff5 100644
> > > > --- a/hw/pci/pcie.c
> > > > +++ b/hw/pci/pcie.c
> > > > @@ -906,3 +906,17 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
> > > >  
> > > >  pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
> > > >  }
> > > > +
> > > > +/* Add an ACS (Access Control Services) capability */
> > > > +void pcie_acs_init(PCIDevice *dev, uint16_t offset, uint8_t 
> > > > egress_ctrl_vec_sz)
> > > > +{
> > > > +int ectrl_words = (egress_ctrl_vec_sz + 31) & ~31;
> > > > +pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER,
> > > > +offset, PCI_ACS_SIZEOF + ectrl_words);  
> > > 
> > > The egress control vector is only valid if the egress control
> > > capability is enabled, which is not set below, so this just seems to
> > > waste config space and introduces a meaningless function arg.
> > >   
> > > > +pci_set_word(dev->config + offset + PCI_ACS_CAP,
> > > > + PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR |
> PCI_ACS_UF);  
> > > 
> > > Some of these bits are only valid for downstream ports, it would
> > > violate the spec to set them on and endpoint.
> > >   
> > > > +pci_set_word(dev->config + offset + PCI_ACS_CTRL,
> > > > + PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR |
> PCI_ACS_UF);  
> > > 
> > > The default values of the control register bits is zero, so we
> > > shouldn't be setting it here and we should have a reset hook to clear
> > > it.
> > >   
> > > > +/* Make CTRL register writable */
> > > > +memset(dev->wmask + offset + PCI_ACS_CTRL, 0xff, 2);
> > > > +}
> > > > diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> > > > index 5b82a0d..c2da148 100644
> > > > --- a/include/hw/pci/pcie.h
> > > > +++ b/include/hw/pci/pcie.h
> > > > @@ -129,6 +129,7 @@ void pcie_add_capability(PCIDevice *dev,
> > > >  void pcie_sync_bridge_lnk(PCIDevice *dev);
> > > >  
> > > >  void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
> > > > +void pcie_acs_init(PCIDevice *dev, uint16_t offset, uint8_t 
> > > > egress_ctrl_vec_sz);
> > > >  void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t 
> > > > ser_num);
> > > >  void pcie_ats_init(PCIDevice *dev, uint16_t offset);
> > > >  
> > > > diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
> > > > index ad4e780..5e7409c 100644
> > > > --- a/include/hw/pci/pcie_regs.h
> > > > +++ b/include/hw/pci/pcie_regs.h
> > > > @@ -175,4 +175,8 @@ typedef enum PCIExpLinkWidth {
> > > >   PCI_ERR_COR_INTERNAL |
> > > >  \
> > > >   PCI_ERR_COR_HL_OVERFLOW)
> > > >  
> > > > +/* ACS */
> > > > +#define PCI_ACS_VER0x2
> > > > +#define PCI_ACS_SIZEOF  8
> > > > +
> > > >  #endif /* QEMU_PCIE_REGS_H */  
> > >   
> > 
>

Re: [Qemu-devel] [PATCH 2/2] gen_pcie_root_port: Add ACS (Access Control Services) capability

2019-01-23 Thread Knut Omang

On Wed, 2019-01-23 at 12:04 -0700, Alex Williamson wrote:
> On Wed, 23 Jan 2019 19:28:00 +0100
> Knut Omang  wrote:
> 
> > Claiming ACS support allows passthrough of an emulated device
> > (in a nested virt.setting) with VFIO without Alex Williamson's patch
> > for the pcie_acs_override kernel parameter.
> > A similar need appears on Windows with Hyper-V
> > 
> > Signed-off-by: Knut Omang 
> > ---
> >  hw/pci-bridge/gen_pcie_root_port.c | 2 ++
> >  hw/pci-bridge/ioh3420.c| 1 -
> >  hw/pci-bridge/pcie_root_port.c | 3 +++
> >  include/hw/pci/pcie_port.h | 1 +
> >  4 files changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/pci-bridge/gen_pcie_root_port.c 
> > b/hw/pci-bridge/gen_pcie_root_port.c
> > index 9766edb..b5a5ecc 100644
> > --- a/hw/pci-bridge/gen_pcie_root_port.c
> > +++ b/hw/pci-bridge/gen_pcie_root_port.c
> > @@ -20,6 +20,7 @@
> >  OBJECT_CHECK(GenPCIERootPort, (obj), TYPE_GEN_PCIE_ROOT_PORT)
> >  
> >  #define GEN_PCIE_ROOT_PORT_AER_OFFSET   0x100
> > +#define GEN_PCIE_ROOT_PORT_ACS_OFFSET   0x148
> 
> Perhaps (GEN_PCIE_ROOT_PORT_AER_OFFSET + PCI_ERR_SIZEOF)
> 
> >  #define GEN_PCIE_ROOT_PORT_MSIX_NR_VECTOR   1
> >  
> >  typedef struct GenPCIERootPort {
> > @@ -149,6 +150,7 @@ static void gen_rp_dev_class_init(ObjectClass *klass, 
> > void *data)
> >  rpc->interrupts_init = gen_rp_interrupts_init;
> >  rpc->interrupts_uninit = gen_rp_interrupts_uninit;
> >  rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
> > +rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
> >  }
> >  
> >  static const TypeInfo gen_rp_dev_info = {
> > diff --git a/hw/pci-bridge/ioh3420.c b/hw/pci-bridge/ioh3420.c
> > index 81f2de6..2064939 100644
> > --- a/hw/pci-bridge/ioh3420.c
> > +++ b/hw/pci-bridge/ioh3420.c
> > @@ -71,7 +71,6 @@ static int ioh3420_interrupts_init(PCIDevice *d, Error 
> > **errp)
> >  if (rc < 0) {
> >  assert(rc == -ENOTSUP);
> >  }
> > -
> 
> Unrelated

Oops, will fix, thanks.

Knut 

> 
> >  return rc;
> >  }
> >  
> > diff --git a/hw/pci-bridge/pcie_root_port.c b/hw/pci-bridge/pcie_root_port.c
> > index 34ad767..c33a493 100644
> > --- a/hw/pci-bridge/pcie_root_port.c
> > +++ b/hw/pci-bridge/pcie_root_port.c
> > @@ -106,6 +106,9 @@ static void rp_realize(PCIDevice *d, Error **errp)
> >  pcie_aer_root_init(d);
> >  rp_aer_vector_update(d);
> >  
> > +if (rpc->acs_offset) {
> > +pcie_acs_init(d, rpc->acs_offset, 0);
> > +}
> >  return;
> >  
> >  err:
> > diff --git a/include/hw/pci/pcie_port.h b/include/hw/pci/pcie_port.h
> > index df242a0..09586f4 100644
> > --- a/include/hw/pci/pcie_port.h
> > +++ b/include/hw/pci/pcie_port.h
> > @@ -78,6 +78,7 @@ typedef struct PCIERootPortClass {
> >  int exp_offset;
> >  int aer_offset;
> >  int ssvid_offset;
> > +int acs_offset;/* If nonzero, optional ACS capability offset */
> >  int ssid;
> >  } PCIERootPortClass;
> >  
>

Re: [Qemu-devel] [PATCH 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-01-23 Thread Knut Omang

On Wed, 2019-01-23 at 12:04 -0700, Alex Williamson wrote:
> On Wed, 23 Jan 2019 19:27:59 +0100
> Knut Omang  wrote:
> 
> > Add a helper function to add PCIe capability for Access Control Services 
> > (ACS)
> > ACS support in the associated root port is a prerequisite to be able to do 
> > useful
> > passthrough with VFIO without Alex Williamson's pcie_acs_override kernel 
> > patch.
> 
> Define "useful".  

Hmm - just that without the patches, the root port itself 
also gets assigned to the same group, which seemed problematic to me
(without any further testing than just binding/unbinding to VFIO)

Knut

> We can certainly still assign single function PFs to
> an L2 guest, or multi-function so long as all the functions are
> assigned.  I won't deny that it's problematic, but it's a virtual
> topology that can be adjusted, so I think this is overstating things a
> bit.
> 
> > Signed-off-by: Knut Omang 
> > ---
> >  hw/pci/pcie.c  | 14 ++
> >  include/hw/pci/pcie.h  |  1 +
> >  include/hw/pci/pcie_regs.h |  4 
> >  3 files changed, 19 insertions(+)
> > 
> > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> > index 230478f..18feff5 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -906,3 +906,17 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
> >  
> >  pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
> >  }
> > +
> > +/* Add an ACS (Access Control Services) capability */
> > +void pcie_acs_init(PCIDevice *dev, uint16_t offset, uint8_t 
> > egress_ctrl_vec_sz)
> > +{
> > +int ectrl_words = (egress_ctrl_vec_sz + 31) & ~31;
> > +pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER,
> > +offset, PCI_ACS_SIZEOF + ectrl_words);
> 
> The egress control vector is only valid if the egress control
> capability is enabled, which is not set below, so this just seems to
> waste config space and introduces a meaningless function arg.
> 
> > +pci_set_word(dev->config + offset + PCI_ACS_CAP,
> > + PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | 
> > PCI_ACS_UF);
> 
> Some of these bits are only valid for downstream ports, it would
> violate the spec to set them on and endpoint.
> 
> > +pci_set_word(dev->config + offset + PCI_ACS_CTRL,
> > + PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | 
> > PCI_ACS_UF);
> 
> The default values of the control register bits is zero, so we
> shouldn't be setting it here and we should have a reset hook to clear
> it.
> 
> > +/* Make CTRL register writable */
> > +memset(dev->wmask + offset + PCI_ACS_CTRL, 0xff, 2);
> > +}
> > diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> > index 5b82a0d..c2da148 100644
> > --- a/include/hw/pci/pcie.h
> > +++ b/include/hw/pci/pcie.h
> > @@ -129,6 +129,7 @@ void pcie_add_capability(PCIDevice *dev,
> >  void pcie_sync_bridge_lnk(PCIDevice *dev);
> >  
> >  void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
> > +void pcie_acs_init(PCIDevice *dev, uint16_t offset, uint8_t 
> > egress_ctrl_vec_sz);
> >  void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t 
> > ser_num);
> >  void pcie_ats_init(PCIDevice *dev, uint16_t offset);
> >  
> > diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
> > index ad4e780..5e7409c 100644
> > --- a/include/hw/pci/pcie_regs.h
> > +++ b/include/hw/pci/pcie_regs.h
> > @@ -175,4 +175,8 @@ typedef enum PCIExpLinkWidth {
> >   PCI_ERR_COR_INTERNAL | \
> >   PCI_ERR_COR_HL_OVERFLOW)
> >  
> > +/* ACS */
> > +#define PCI_ACS_VER0x2
> > +#define PCI_ACS_SIZEOF  8
> > +
> >  #endif /* QEMU_PCIE_REGS_H */
>

[Qemu-devel] [PATCH 2/2] gen_pcie_root_port: Add ACS (Access Control Services) capability

2019-01-23 Thread Knut Omang

Claiming ACS support allows passthrough of an emulated device
(in a nested virt.setting) with VFIO without Alex Williamson's patch
for the pcie_acs_override kernel parameter.
A similar need appears on Windows with Hyper-V

Signed-off-by: Knut Omang 
---
 hw/pci-bridge/gen_pcie_root_port.c | 2 ++
 hw/pci-bridge/ioh3420.c| 1 -
 hw/pci-bridge/pcie_root_port.c | 3 +++
 include/hw/pci/pcie_port.h | 1 +
 4 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/pci-bridge/gen_pcie_root_port.c 
b/hw/pci-bridge/gen_pcie_root_port.c
index 9766edb..b5a5ecc 100644
--- a/hw/pci-bridge/gen_pcie_root_port.c
+++ b/hw/pci-bridge/gen_pcie_root_port.c
@@ -20,6 +20,7 @@
 OBJECT_CHECK(GenPCIERootPort, (obj), TYPE_GEN_PCIE_ROOT_PORT)
 
 #define GEN_PCIE_ROOT_PORT_AER_OFFSET   0x100
+#define GEN_PCIE_ROOT_PORT_ACS_OFFSET   0x148
 #define GEN_PCIE_ROOT_PORT_MSIX_NR_VECTOR   1
 
 typedef struct GenPCIERootPort {
@@ -149,6 +150,7 @@ static void gen_rp_dev_class_init(ObjectClass *klass, void 
*data)
 rpc->interrupts_init = gen_rp_interrupts_init;
 rpc->interrupts_uninit = gen_rp_interrupts_uninit;
 rpc->aer_offset = GEN_PCIE_ROOT_PORT_AER_OFFSET;
+rpc->acs_offset = GEN_PCIE_ROOT_PORT_ACS_OFFSET;
 }
 
 static const TypeInfo gen_rp_dev_info = {
diff --git a/hw/pci-bridge/ioh3420.c b/hw/pci-bridge/ioh3420.c
index 81f2de6..2064939 100644
--- a/hw/pci-bridge/ioh3420.c
+++ b/hw/pci-bridge/ioh3420.c
@@ -71,7 +71,6 @@ static int ioh3420_interrupts_init(PCIDevice *d, Error **errp)
 if (rc < 0) {
 assert(rc == -ENOTSUP);
 }
-
 return rc;
 }
 
diff --git a/hw/pci-bridge/pcie_root_port.c b/hw/pci-bridge/pcie_root_port.c
index 34ad767..c33a493 100644
--- a/hw/pci-bridge/pcie_root_port.c
+++ b/hw/pci-bridge/pcie_root_port.c
@@ -106,6 +106,9 @@ static void rp_realize(PCIDevice *d, Error **errp)
 pcie_aer_root_init(d);
 rp_aer_vector_update(d);
 
+if (rpc->acs_offset) {
+pcie_acs_init(d, rpc->acs_offset, 0);
+}
 return;
 
 err:
diff --git a/include/hw/pci/pcie_port.h b/include/hw/pci/pcie_port.h
index df242a0..09586f4 100644
--- a/include/hw/pci/pcie_port.h
+++ b/include/hw/pci/pcie_port.h
@@ -78,6 +78,7 @@ typedef struct PCIERootPortClass {
 int exp_offset;
 int aer_offset;
 int ssvid_offset;
+int acs_offset;/* If nonzero, optional ACS capability offset */
 int ssid;
 } PCIERootPortClass;
 
-- 
git-series 0.9.1

[Qemu-devel] [PATCH 0/2] pcie: Add simple ACS "support" to the generic PCIe root port

2019-01-23 Thread Knut Omang

These two patches together implements a PCIe capability
config space header for Access Control Services (ACS) for the
new Qemu specific generic root port. ACS support in the
associated root port is a prerequisite to be able to pass the a function of
the device populating the port through to an L2 guest from an unmodified kernel.
Without this, the IOMMU group the device belongs to will also
include the root port itself, and all functions the device provides.

With an SR/IOV device this becomes even more important, as the whole
purpose with SR/IOV is to be able to share out individual VFs to different
guests, which will not be permitted by VFIO or the Windows Hyper-V equivalent
unless ACS is supported by the root port.

These patches can also be found as part of an updated version of
my SR/IOV emulation patch set at

  https://github.com/knuto/qemu/tree/sriov_patches_v9

Knut Omang (2):
  pcie: Add a simple PCIe ACS (Access Control Services) helper function
  gen_pcie_root_port: Add ACS (Access Control Services) capability

 hw/pci-bridge/gen_pcie_root_port.c |  2 ++
 hw/pci-bridge/ioh3420.c|  1 -
 hw/pci-bridge/pcie_root_port.c |  3 +++
 hw/pci/pcie.c  | 14 ++
 include/hw/pci/pcie.h  |  1 +
 include/hw/pci/pcie_port.h |  1 +
 include/hw/pci/pcie_regs.h |  4 
 7 files changed, 25 insertions(+), 1 deletion(-)

base-commit: a8d2b0685681e2f291faaa501efbbd76875f8ec8
-- 
git-series 0.9.1

[Qemu-devel] [PATCH 1/2] pcie: Add a simple PCIe ACS (Access Control Services) helper function

2019-01-23 Thread Knut Omang

Add a helper function to add PCIe capability for Access Control Services (ACS)
ACS support in the associated root port is a prerequisite to be able to do 
useful
passthrough with VFIO without Alex Williamson's pcie_acs_override kernel patch.

Signed-off-by: Knut Omang 
---
 hw/pci/pcie.c  | 14 ++
 include/hw/pci/pcie.h  |  1 +
 include/hw/pci/pcie_regs.h |  4 
 3 files changed, 19 insertions(+)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 230478f..18feff5 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -906,3 +906,17 @@ void pcie_ats_init(PCIDevice *dev, uint16_t offset)
 
 pci_set_word(dev->wmask + dev->exp.ats_cap + PCI_ATS_CTRL, 0x800f);
 }
+
+/* Add an ACS (Access Control Services) capability */
+void pcie_acs_init(PCIDevice *dev, uint16_t offset, uint8_t egress_ctrl_vec_sz)
+{
+int ectrl_words = (egress_ctrl_vec_sz + 31) & ~31;
+pcie_add_capability(dev, PCI_EXT_CAP_ID_ACS, PCI_ACS_VER,
+offset, PCI_ACS_SIZEOF + ectrl_words);
+pci_set_word(dev->config + offset + PCI_ACS_CAP,
+ PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | 
PCI_ACS_UF);
+pci_set_word(dev->config + offset + PCI_ACS_CTRL,
+ PCI_ACS_SV | PCI_ACS_TB | PCI_ACS_RR | PCI_ACS_CR | 
PCI_ACS_UF);
+/* Make CTRL register writable */
+memset(dev->wmask + offset + PCI_ACS_CTRL, 0xff, 2);
+}
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index 5b82a0d..c2da148 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -129,6 +129,7 @@ void pcie_add_capability(PCIDevice *dev,
 void pcie_sync_bridge_lnk(PCIDevice *dev);
 
 void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
+void pcie_acs_init(PCIDevice *dev, uint16_t offset, uint8_t 
egress_ctrl_vec_sz);
 void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t ser_num);
 void pcie_ats_init(PCIDevice *dev, uint16_t offset);
 
diff --git a/include/hw/pci/pcie_regs.h b/include/hw/pci/pcie_regs.h
index ad4e780..5e7409c 100644
--- a/include/hw/pci/pcie_regs.h
+++ b/include/hw/pci/pcie_regs.h
@@ -175,4 +175,8 @@ typedef enum PCIExpLinkWidth {
  PCI_ERR_COR_INTERNAL | \
  PCI_ERR_COR_HL_OVERFLOW)
 
+/* ACS */
+#define PCI_ACS_VER0x2
+#define PCI_ACS_SIZEOF  8
+
 #endif /* QEMU_PCIE_REGS_H */
-- 
git-series 0.9.1

Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Knut Omang

On Thu, 2018-07-26 at 20:05 +0300, Marcel Apfelbaum wrote:
> 
> On 07/26/2018 07:42 PM, Knut Omang wrote:
> > On Thu, 2018-07-26 at 19:41 +0300, Michael S. Tsirkin wrote:
> > > On Thu, Jul 26, 2018 at 06:38:32PM +0200, Knut Omang wrote:
> > > > On Thu, 2018-07-26 at 18:51 +0300, Marcel Apfelbaum wrote:
> > > > > Hi
> > > > > 
> > > > > On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:
> > > > > > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:
> > > > > > > Hey,
> > > > > > > 
> > > > > > > Our team is adding a NIC functional  emulation to QEMU.
> > > > > > > One of the features we are adding to this NIC is SRIOV.
> > > > > > > 
> > > > > > > Here is the error message I get when checking SRIOV support
> > > > > > > of  our
> > > > > > > emulated NIC on Win2016 server (the hyper-v VM).
> > > > > > > "
> > > > > > > SR-IOV cannot be used on this system as the PCI Express hardware
> > > > > > > does
> > > > > > > not
> > > > > > > support Access Control Services (ACS) at any root port.
> > > > > 
> > > > > QEMU's emulated PCI Express Root Ports do not support ACS yet, however
> > > > > I
> > > > > am not sure ACS is a prerequisite
> > > > > for SR-IOV. We would need ARI support for allowing more than 8VFs, but
> > > > > QEMU doesn't support that either (yet).
> > > > > 
> > > > > Knut Omag has some working patches, he successfully implemented SR-IOV
> > > > > with QEMU, see:
> > > > >https://github.com/knuto/qemu/tree/sriov_patches_v7
> > > > > 
> > > > > The code was not merged since we need at least a device with SR-IOV
> > > > > support to justify the addition.
> > > > 
> > > > FYI, I recently rebased these to latest master but just didn't get to
> > > > push
> > > > them
> > > > out. I did just now - they are available here:
> > > > 
> > > > https://github.com/knuto/qemu/tree/sriov_patches_v8
> > > > 
> > > > As far as I know ARI support with my patch set works just fine - I have
> > > > tested
> > > > it with lots of VFs.
> > > > 
> > > > One of the patches in the series (pci: Make use of the devfn property
> > > > when
> > > > registering new devices) is necessary to make ARI work as it should with
> > > > SR/IOV.
> > > > 
> > > > For the hardware model I developed the SR/IOV patches for, I also added
> > > > enough
> > > > ACS support in the root port (PCIe capability helper patch + usage in
> > > > ioh3420) to make VFIO "happy". I haven't submitted them because they
> > > > are "questionable" since they likely do not reflect the actual features
> > > > of
> > > > the
> > > > ioh3420.
> > > 
> > > In that the actual ioh3420 doesn't support ACS?
> > 
> > yes.. I don't have one so I don't know but that was my assumption..
> 
> Hi Knut,
> 
> We have now a generic PCIe Root Port we can add whatever we want to it.
> See please hw/pci-bridge/gen_pcie_root_port.c.

Ok, I see! - I wasn't aware of that. I was thinking more in terms of just 
emulating a slightly newer Intel root port but never got around to that.

> So your patches add both ARI and ACS support, nice!
> Maybe it worth merging at least these features.

I'll have a look at the generic root port, it is probably a natural extension 
for v9 of the sriov patch set :-)

Thanks,
Knut

> 
> Thanks,
> Marcel
> 
> > 
> > Knut
> > 
> > > > I can make those available if interesting.
> > > > 
> > > > Thanks,
> > > > Knut
> > > > 
> > > > > > > Contact your system vendor for further information.
> > > > > > > "
> > > > > > 
> > > > > > I'm not sure what the status of emulated SR-IOV is so I have CCed
> > > > > > Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.
> > > > > 
> > > > > Thanks,
> > > > > Marcel
> > > > > 
> > > > > > > Could you please advise about what could be the issue here?
> > > >

Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Knut Omang

On Thu, 2018-07-26 at 20:05 +0300, Marcel Apfelbaum wrote:
> 
> On 07/26/2018 07:42 PM, Knut Omang wrote:
> > On Thu, 2018-07-26 at 19:41 +0300, Michael S. Tsirkin wrote:
> > > On Thu, Jul 26, 2018 at 06:38:32PM +0200, Knut Omang wrote:
> > > > On Thu, 2018-07-26 at 18:51 +0300, Marcel Apfelbaum wrote:
> > > > > Hi
> > > > > 
> > > > > On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:
> > > > > > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:
> > > > > > > Hey,
> > > > > > > 
> > > > > > > Our team is adding a NIC functional  emulation to QEMU.
> > > > > > > One of the features we are adding to this NIC is SRIOV.
> > > > > > > 
> > > > > > > Here is the error message I get when checking SRIOV support
> > > > > > > of  our
> > > > > > > emulated NIC on Win2016 server (the hyper-v VM).
> > > > > > > "
> > > > > > > SR-IOV cannot be used on this system as the PCI Express hardware
> > > > > > > does
> > > > > > > not
> > > > > > > support Access Control Services (ACS) at any root port.
> > > > > 
> > > > > QEMU's emulated PCI Express Root Ports do not support ACS yet, however
> > > > > I
> > > > > am not sure ACS is a prerequisite
> > > > > for SR-IOV. We would need ARI support for allowing more than 8VFs, but
> > > > > QEMU doesn't support that either (yet).
> > > > > 
> > > > > Knut Omag has some working patches, he successfully implemented SR-IOV
> > > > > with QEMU, see:
> > > > >https://github.com/knuto/qemu/tree/sriov_patches_v7
> > > > > 
> > > > > The code was not merged since we need at least a device with SR-IOV
> > > > > support to justify the addition.
> > > > 
> > > > FYI, I recently rebased these to latest master but just didn't get to
> > > > push
> > > > them
> > > > out. I did just now - they are available here:
> > > > 
> > > > https://github.com/knuto/qemu/tree/sriov_patches_v8
> > > > 
> > > > As far as I know ARI support with my patch set works just fine - I have
> > > > tested
> > > > it with lots of VFs.
> > > > 
> > > > One of the patches in the series (pci: Make use of the devfn property
> > > > when
> > > > registering new devices) is necessary to make ARI work as it should with
> > > > SR/IOV.
> > > > 
> > > > For the hardware model I developed the SR/IOV patches for, I also added
> > > > enough
> > > > ACS support in the root port (PCIe capability helper patch + usage in
> > > > ioh3420) to make VFIO "happy". I haven't submitted them because they
> > > > are "questionable" since they likely do not reflect the actual features
> > > > of
> > > > the
> > > > ioh3420.
> > > 
> > > In that the actual ioh3420 doesn't support ACS?
> > 
> > yes.. I don't have one so I don't know but that was my assumption..
> 
> Hi Knut,
> 
> We have now a generic PCIe Root Port we can add whatever we want to it.
> See please hw/pci-bridge/gen_pcie_root_port.c.

Ok, I see! - I wasn't aware of that. I was thinking more in terms of just 
emulating a slightly newer Intel root port but never got around to that.

> So your patches add both ARI and ACS support, nice!
> Maybe it worth merging at least these features.

I'll have a look at the generic root port, it is probably a natural extension 
for v9 of the sriov patch set :-)

Thanks,
Knut

> 
> Thanks,
> Marcel
> 
> > 
> > Knut
> > 
> > > > I can make those available if interesting.
> > > > 
> > > > Thanks,
> > > > Knut
> > > > 
> > > > > > > Contact your system vendor for further information.
> > > > > > > "
> > > > > > 
> > > > > > I'm not sure what the status of emulated SR-IOV is so I have CCed
> > > > > > Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.
> > > > > 
> > > > > Thanks,
> > > > > Marcel
> > > > > 
> > > > > > > Could you please advise about what could be the issue here?
> > > >

Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Knut Omang

On Thu, 2018-07-26 at 20:11 +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 26, 2018 at 10:42:52AM -0600, Alex Williamson wrote:
> > On Thu, 26 Jul 2018 19:14:45 +0300
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Thu, Jul 26, 2018 at 06:51:13PM +0300, Marcel Apfelbaum wrote:
> > > > Hi
> > > > 
> > > > On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:  
> > > > > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:  
> > > > > > Hey,
> > > > > > 
> > > > > > Our team is adding a NIC functional  emulation to QEMU.
> > > > > > One of the features we are adding to this NIC is SRIOV.
> > > > > > 
> > > > > > Here is the error message I get when checking SRIOV support of  our
> > > > > > emulated NIC on Win2016 server (the hyper-v VM).
> > > > > > "
> > > > > > SR-IOV cannot be used on this system as the PCI Express hardware
> > > > > > does not support Access Control Services (ACS) at any root port.  
> > > > 
> > > > QEMU's emulated PCI Express Root Ports do not support ACS yet, however I
> > > > am
> > > > not sure ACS is a prerequisite
> > > > for SR-IOV.
> > 
> > ACS is certainly not a prerequisite for SR-IOV.
> >  
> > > Looks like windows blocks dev assignment in nested VMs without it.
> > > Thinking about it, doesn't vfio do the same by default? I think vfio has
> > > a flag to override this though.
> > 
> > IOMMU grouping in Linux takes isolation via ACS and device specific
> > mechanisms into account, limiting the granularity with which a
> > userspace driver can claim ownership of devices,
> 
> In that you must assign all devices behind this port as a group?

Yes, that's why I needed to add PCIe config space ACS support in ioh3420:
to be able to use VFIO to pass through individual VFs to different L2 VMs with
"plain" kernels on the L1 host which owns the SR/IOV device.

Knut

> > but it doesn't
> > actually prevent enabling SR-IOV on the endpoint.  It just makes it
> > less useful if your intention is to use SR-IOV for device assignment.
> > The only overrides for this are out-of-tree.  Thanks,
> >
> > 
> > Alex
> 
>

Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Knut Omang

On Thu, 2018-07-26 at 19:41 +0300, Michael S. Tsirkin wrote:
> On Thu, Jul 26, 2018 at 06:38:32PM +0200, Knut Omang wrote:
> > On Thu, 2018-07-26 at 18:51 +0300, Marcel Apfelbaum wrote:
> > > Hi
> > > 
> > > On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:
> > > > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:
> > > > > Hey,
> > > > > 
> > > > > Our team is adding a NIC functional  emulation to QEMU.
> > > > > One of the features we are adding to this NIC is SRIOV.
> > > > > 
> > > > > Here is the error message I get when checking SRIOV support of  our
> > > > > emulated NIC on Win2016 server (the hyper-v VM).
> > > > > "
> > > > > SR-IOV cannot be used on this system as the PCI Express hardware does
> > > > > not
> > > > > support Access Control Services (ACS) at any root port.
> > > 
> > > QEMU's emulated PCI Express Root Ports do not support ACS yet, however I 
> > > am not sure ACS is a prerequisite
> > > for SR-IOV. We would need ARI support for allowing more than 8VFs, but 
> > > QEMU doesn't support that either (yet).
> > > 
> > > Knut Omag has some working patches, he successfully implemented SR-IOV 
> > > with QEMU, see:
> > >   https://github.com/knuto/qemu/tree/sriov_patches_v7
> > > 
> > > The code was not merged since we need at least a device with SR-IOV 
> > > support to justify the addition.
> > 
> > FYI, I recently rebased these to latest master but just didn't get to push
> > them
> > out. I did just now - they are available here:
> > 
> >https://github.com/knuto/qemu/tree/sriov_patches_v8
> > 
> > As far as I know ARI support with my patch set works just fine - I have
> > tested
> > it with lots of VFs.
> > 
> > One of the patches in the series (pci: Make use of the devfn property when
> > registering new devices) is necessary to make ARI work as it should with
> > SR/IOV.
> > 
> > For the hardware model I developed the SR/IOV patches for, I also added
> > enough
> > ACS support in the root port (PCIe capability helper patch + usage in
> > ioh3420) to make VFIO "happy". I haven't submitted them because they
> > are "questionable" since they likely do not reflect the actual features of
> > the
> > ioh3420.
> 
> In that the actual ioh3420 doesn't support ACS?

yes.. I don't have one so I don't know but that was my assumption..

Knut

> > I can make those available if interesting.
> > 
> > Thanks,
> > Knut
> > 
> > > > > Contact your system vendor for further information.
> > > > > "
> > > > 
> > > > I'm not sure what the status of emulated SR-IOV is so I have CCed
> > > > Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.
> > > 
> > > Thanks,
> > > Marcel
> > > 
> > > > > Could you please advise about what could be the issue here?
> > > > > 
> > > > > BTW: I use same configuration (VM XML file attached) when running
> > > > > linux VM
> > > > > (RH7.2) image (instead of Win Hyper-V) over the same host and SRIOV is
> > > > > working for me there.
> > > > > 
> > > > > Here the XML file I use to define the VM (our emulated NIC is added at
> > > > > the
> > > > > end of XML):
> > > > > "
> > > > >  > > > > > ;
> > > > > 
> > > > >nst105
> > > > >0249a525-2ee2-432b-a1f5-a6db83b089a3
> > > > >8388608
> > > > >8388608
> > > > >8
> > > > >
> > > > >  /machine
> > > > >
> > > > >
> > > > >  hvm
> > > > >
> > > > >
> > > > >  
> > > > >  
> > > > >  
> > > > >
> > > > >
> > > > >
> > > > >  
> > > > >
> > > > >
> > > > >  SandyBridge
> > > > >  
> > > > >  
> > > > >
> > > > >
> > > > >  
> > > > >  
>

Re: [Qemu-devel] Question: SRIOV support over Win Hyper-V VM running in QEMU process on Linux host

2018-07-26 Thread Knut Omang

On Thu, 2018-07-26 at 18:51 +0300, Marcel Apfelbaum wrote:
> Hi
> 
> On 07/26/2018 05:52 PM, Stefan Hajnoczi wrote:
> > On Thu, Jul 12, 2018 at 07:33:14AM +, Elijah Shakkour wrote:
> > > Hey,
> > > 
> > > Our team is adding a NIC functional  emulation to QEMU.
> > > One of the features we are adding to this NIC is SRIOV.
> > > 
> > > Here is the error message I get when checking SRIOV support of  our
> > > emulated NIC on Win2016 server (the hyper-v VM).
> > > "
> > > SR-IOV cannot be used on this system as the PCI Express hardware does not
> > > support Access Control Services (ACS) at any root port.
> 
> QEMU's emulated PCI Express Root Ports do not support ACS yet, however I 
> am not sure ACS is a prerequisite
> for SR-IOV. We would need ARI support for allowing more than 8VFs, but 
> QEMU doesn't support that either (yet).
> 
> Knut Omag has some working patches, he successfully implemented SR-IOV 
> with QEMU, see:
>   https://github.com/knuto/qemu/tree/sriov_patches_v7
> 
> The code was not merged since we need at least a device with SR-IOV 
> support to justify the addition.

FYI, I recently rebased these to latest master but just didn't get to push them
out. I did just now - they are available here:

   https://github.com/knuto/qemu/tree/sriov_patches_v8

As far as I know ARI support with my patch set works just fine - I have tested
it with lots of VFs.

One of the patches in the series (pci: Make use of the devfn property when
registering new devices) is necessary to make ARI work as it should with SR/IOV.

For the hardware model I developed the SR/IOV patches for, I also added enough
ACS support in the root port (PCIe capability helper patch + usage in
ioh3420) to make VFIO "happy". I haven't submitted them because they
are "questionable" since they likely do not reflect the actual features of the
ioh3420. I can make those available if interesting.

Thanks,
Knut

> > > Contact your system vendor for further information.
> > > "
> > 
> > I'm not sure what the status of emulated SR-IOV is so I have CCed
> > Michael Tsirkin and Marcel Apfelbaum, the PCI maintainers in QEMU.
> 
> Thanks,
> Marcel
> 
> > > Could you please advise about what could be the issue here?
> > > 
> > > BTW: I use same configuration (VM XML file attached) when running linux VM
> > > (RH7.2) image (instead of Win Hyper-V) over the same host and SRIOV is
> > > working for me there.
> > > 
> > > Here the XML file I use to define the VM (our emulated NIC is added at the
> > > end of XML):
> > > "
> > >  > > >;
> > >nst105
> > >0249a525-2ee2-432b-a1f5-a6db83b089a3
> > >8388608
> > >8388608
> > >8
> > >
> > >  /machine
> > >
> > >
> > >  hvm
> > >
> > >
> > >  
> > >  
> > >  
> > >
> > >
> > >
> > >  
> > >
> > >
> > >  SandyBridge
> > >  
> > >  
> > >
> > >
> > >  
> > >  
> > >  
> > >  
> > >
> > >destroy
> > >restart
> > >destroy
> > >
> > >  
> > >  
> > >
> > >
> > >  /opt/qemu/bin/qemu-system-x86_64
> > >  
> > >
> > >
> > >
> > >
> > >
> > >  
> > >  
> > > > > function='0x7'/>
> > >  
> > >  
> > >
> > > > > function='0x0' multifunction='on'/>
> > >  
> > >  
> > >
> > > > > function='0x1'/>
> > >  
> > >  
> > >
> > > > > function='0x2'/>
> > >  
> > >  
> > > > > function='0x2'/>
> > >  
> > >  
> > >  
> > >
> > >
> > > > > function='0x0' multifunction='on'/>
> > >  
> > >  
> > >
> > >
> > > > > function='0x1'/>
> > >  
> > >  
> > >
> > >
> > > > > function='0x2'/>
> > >  
> > >  
> > >
> > > > > function='0x0'/>
> > >  
> > >  
> > >
> > >
> > > > > function='0x0'/>
> > >  
> > >  
> > >
> > >
> > >
> > >
> > > > > function='0x0'/>
> > >  
> > >  
> > >
> > >  
> > >  
> > >
> > >  
> > >  
> > >
> > >  
> > >  
> > >  
> > >   > > keymap='en-us'>
> > >
> > >  
> > >  
> > >
> > > > > function='0x0'/>
> > >  
> > >  
> > > > > function='0x0'/>
> > >  
> > >
> > >
> > >
> > >
> > >  
> > >  
> > >  
> > >  
> > >  
> > >  
> > >
> > > 
> > > "
> > > __
> > > General info:
> > > Host OS: RH7.0 (Kernel: 4.14.13)
> > > QEMU version: 2.11
> > > libvirt version: 3.2.0
> > > Running the following on the host shows that both nested and IOMMU are
> > > enabled:
> > > ~]#: cat /sys/module/kvm_intel/parameters/nested
> > > Y
> > > ~]# dmesg | grep -e DMAR -e IOMMU
> > > [0.00] DMAR: IOMMU enabled
> > > 
> > > Thanks,
>

Re: [Qemu-devel] [PULL v1 2/2] tests: Add test-listen - a stress test for QEMU socket listen

2018-05-01 Thread Knut Omang

On Tue, 2018-05-01 at 15:07 +0100, Daniel P. Berrangé wrote:
> On Tue, May 01, 2018 at 04:00:35PM +0200, Knut Omang wrote:
> > Hi Peter,
> > 
> > Seems this test was lost along the way?
> > 
> > I thought it got merged with Daniel's fix to the potential leak 
> > that the test exposed in your build, but it seems not?
> 
> I've never been able to get this test, nor another one of my own
> to succesfully pass automated testing in all the various test
> envs Peter and our many CI systems use.

Ok, that explains it - no big deal from my side, I just thought 
it was "value lost" against future errors.

Would it make sense to let the test check for environment and just 
really run in cases where it reliably passes? 

That will also serve as documentation as to where there might be 
additional issues hidden?

Knut

Re: [Qemu-devel] [PULL v1 2/2] tests: Add test-listen - a stress test for QEMU socket listen

2018-05-01 Thread Knut Omang

Hi Peter,

Seems this test was lost along the way?

I thought it got merged with Daniel's fix to the potential leak 
that the test exposed in your build, but it seems not?

Thanks,
Knut

On Mon, 2017-11-06 at 15:33 +, Daniel P. Berrange wrote:
> From: Knut Omang <knut.om...@oracle.com>
> 
> There's a potential race condition between multiple bind()'s
> attempting to bind to the same port, which occasionally
> allows more than one bind to succeed against the same port.
> 
> When a subsequent listen() call is made with the same socket
> only one will succeed.
> 
> The current QEMU code does however not take this situation into account
> and the listen will cause the code to break out and fail even
> when there are actually available ports to use.
> 
> This test exposes two subtests:
> 
> /socket/listen-serial
> /socket/listen-compete
> 
> The "compete" subtest creates a number of threads and have them all trying to 
> bind
> to the same port with a large enough offset input to
> allow all threads to get it's own port.
> The "serial" subtest just does the same, except in series in a
> single thread.
> 
> The serial version passes, probably in most versions of QEMU.
> 
> The parallel version exposes the problem in a relatively reliable way,
> eg. it fails a majority of times, but not with a 100% rate, occasional
> passes can be seen. Nevertheless this is quite good given that
> the bug was tricky to reproduce and has been left undetected for
> a while.
> 
> The problem seems to be present in all versions of QEMU.
> 
> The original failure scenario occurred with VNC port allocation
> in a traditional Xen based build, in different code
> but with similar functionality.
> 
> Reported-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> Signed-off-by: Knut Omang <knut.om...@oracle.com>
> Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
> Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
> Signed-off-by: Daniel P. Berrange <berra...@redhat.com>
> ---
>  tests/Makefile.include |   2 +
>  tests/test-listen.c| 253 
> +
>  2 files changed, 255 insertions(+)
>  create mode 100644 tests/test-listen.c
> 
> diff --git a/tests/Makefile.include b/tests/Makefile.include
> index 434a2ce868..e4bb88bd3d 100644
> --- a/tests/Makefile.include
> +++ b/tests/Makefile.include
> @@ -154,6 +154,7 @@ gcov-files-check-bufferiszero-y = util/bufferiszero.c
>  check-unit-y += tests/test-uuid$(EXESUF)
>  check-unit-y += tests/ptimer-test$(EXESUF)
>  gcov-files-ptimer-test-y = hw/core/ptimer.c
> +check-unit-y += tests/test-listen$(EXESUF)
>  check-unit-y += tests/test-qapi-util$(EXESUF)
>  gcov-files-test-qapi-util-y = qapi/qapi-util.c
>  
> @@ -804,6 +805,7 @@ tests/test-arm-mptimer$(EXESUF): tests/test-arm-mptimer.o
>  tests/test-qapi-util$(EXESUF): tests/test-qapi-util.o $(test-util-obj-y)
>  tests/numa-test$(EXESUF): tests/numa-test.o
>  tests/vmgenid-test$(EXESUF): tests/vmgenid-test.o tests/boot-sector.o 
> tests/acpi-
> utils.o
> +tests/test-listen$(EXESUF): tests/test-listen.o $(test-util-obj-y)
>  
>  tests/migration/stress$(EXESUF): tests/migration/stress.o
>   $(call quiet-command, $(LINKPROG) -static -O3 $(PTHREAD_LIB) -o $@ $<
> ,"LINK","$(TARGET_DIR)$@")
> diff --git a/tests/test-listen.c b/tests/test-listen.c
> new file mode 100644
> index 00..03c4c8f03b
> --- /dev/null
> +++ b/tests/test-listen.c
> @@ -0,0 +1,253 @@
> +/*
> + * Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.
> + *Author: Knut Omang <knut.om...@oracle.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 or later
> + * as published by the Free Software Foundation.
> + *
> + * Test parallel port listen configuration with
> + * dynamic port allocation
> + */
> +
> +#include "qemu/osdep.h"
> +#include "libqtest.h"
> +#include "qemu-common.h"
> +#include "qemu/thread.h"
> +#include "qemu/sockets.h"
> +#include "qapi/error.h"
> +
> +#define NAME_LEN 1024
> +#define PORT_LEN 16
> +
> +struct thr_info {
> +QemuThread thread;
> +int to_port;
> +bool ipv4;
> +bool ipv6;
> +int got_port;
> +int eno;
> +int fd;
> +const char *errstr;
> +char hostname[NAME_LEN + 1];
> +char port[PORT_LEN + 1];
> +};
> +
> +
> +/* These two functions taken from test-io-channel-socket.c */
> +s

Re: [Qemu-devel] FW: Are there any qemu emulated SR-IOV devices?

2017-10-11 Thread Knut Omang

On Thu, 2017-10-05 at 16:31 +0200, Kashyap Chamarthy wrote:
> [Sorry, I'm a bit late in responding, as I missed this e-mail.]
> 
> On Thu, Jun 08, 2017 at 07:35:52AM -0700, mwoodpatr...@gmail.com wrote:
> > I wanted to play around with SR-IOV using qemu and was wondering if there
> > are any qemu emulated SR-IOV devices I could experiment with?
> 
> Near as I know there isn't any existing emulated SR-IOV device in QEMU.
> But yes, an emulated SR-IOV would also be useful for testing purposes
> for projects like OpenStack, without having the need for real hardware.
> 
> > If not I plan on creating one and would appreciate any pointers to documents
> > describing how to add an emulated device that supports more than 8 functions
> > and has ARI enabled
> 
> I'm far from an expert on this area, but just wanted to point out that
> there's some existing support for SR-IOV in QEMU, added via:
> 
> [PATCH v6 0/4] pcie: Add support for Single Root I/O Virtualization --
> https://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05155.html
> 
> Knut Omang (Cced) who added the above support also had some notes here
> on how to implement an SR-IOV capable device.:
> 
> https://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05157.html
> -- pcie: Add some SR/IOV API documentationin docs/pcie_sriov.txt
> 
> "Add a small intro + minimal documentation for how to implement
> SR/IOV support for an emulated device.
> 
> 
> [...]

These are not (yet) in QEMU but can still be found here:

https://github.com/knuto/qemu

I try to rebase them from time to time.
The reason why Michael has not accepted them (yet) is to my understanding the 
lack of
working example devices - correct me if I am wrong.

I did provide the igb device which works as a proof of concept of a device with 
32 VFs and
use of stride and ARI in that the SR/IOV logic itself can be verified, but it 
is not a
working device, just a quick hack on the e1000 code to make it look like an igb 
with
SR/IOV support.

But as far as I know, several people have used the SR/IOV patches for their own 
device
models. 

We have a fairly elaborate and complex model (of an Infiniband HCA) that uses 
this, and
that has been used quite extensively, the problem is that the main part of the 
code
implementing the device is inside a simulation model outside of QEMU, just 
communicating 
PCIe transactions with QEMU over a TCP connection. So it is not that easy to 
make use of
inside QEMU. I do however want to pursue ways to provide the QEMU side of that
implementation to QEMU, as a tool for other device developers. If there are 
more people
interested in this domain, maybe we should have a BoF session about it at KVM 
Forum.

Thanks,
Knut

[Qemu-devel] [PATCH v7 2/4] sockets: factor out a new try_bind() function

2017-08-07 Thread Knut Omang

A refactoring step to prepare for the problem
exposed by the test-listen test in the previous commit.

Simplify and reorganize the IPv6 specific extra
measures and move it out of the for loop to increase
code readability. No semantic changes.

Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Daniel P. Berrange <berra...@redhat.com>
---
 util/qemu-sockets.c | 69 ++
 1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index 1358c81..b4a2f24 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -149,6 +149,44 @@ int inet_ai_family_from_address(InetSocketAddress *addr,
 return PF_UNSPEC;
 }
 
+static int try_bind(int socket, InetSocketAddress *saddr, struct addrinfo *e)
+{
+#ifndef IPV6_V6ONLY
+return bind(socket, e->ai_addr, e->ai_addrlen);
+#else
+/*
+ * Deals with first & last cases in matrix in comment
+ * for inet_ai_family_from_address().
+ */
+int v6only =
+((!saddr->has_ipv4 && !saddr->has_ipv6) ||
+ (saddr->has_ipv4 && saddr->ipv4 &&
+  saddr->has_ipv6 && saddr->ipv6)) ? 0 : 1;
+int stat;
+
+ rebind:
+if (e->ai_family == PF_INET6) {
+qemu_setsockopt(socket, IPPROTO_IPV6, IPV6_V6ONLY, ,
+sizeof(v6only));
+}
+
+stat = bind(socket, e->ai_addr, e->ai_addrlen);
+if (!stat) {
+return 0;
+}
+
+/* If we got EADDRINUSE from an IPv6 bind & v6only is unset,
+ * it could be that the IPv4 port is already claimed, so retry
+ * with v6only set
+ */
+if (e->ai_family == PF_INET6 && errno == EADDRINUSE && !v6only) {
+v6only = 1;
+goto rebind;
+}
+return stat;
+#endif
+}
+
 static int inet_listen_saddr(InetSocketAddress *saddr,
  int port_offset,
  bool update_addr,
@@ -228,39 +266,10 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 port_min = inet_getport(e);
 port_max = saddr->has_to ? saddr->to + port_offset : port_min;
 for (p = port_min; p <= port_max; p++) {
-#ifdef IPV6_V6ONLY
-/*
- * Deals with first & last cases in matrix in comment
- * for inet_ai_family_from_address().
- */
-int v6only =
-((!saddr->has_ipv4 && !saddr->has_ipv6) ||
- (saddr->has_ipv4 && saddr->ipv4 &&
-  saddr->has_ipv6 && saddr->ipv6)) ? 0 : 1;
-#endif
 inet_setport(e, p);
-#ifdef IPV6_V6ONLY
-rebind:
-if (e->ai_family == PF_INET6) {
-qemu_setsockopt(slisten, IPPROTO_IPV6, IPV6_V6ONLY, ,
-sizeof(v6only));
-}
-#endif
-if (bind(slisten, e->ai_addr, e->ai_addrlen) == 0) {
+if (try_bind(slisten, saddr, e) >= 0) {
 goto listen;
 }
-
-#ifdef IPV6_V6ONLY
-/* If we got EADDRINUSE from an IPv6 bind & V6ONLY is unset,
- * it could be that the IPv4 port is already claimed, so retry
- * with V6ONLY set
- */
-if (e->ai_family == PF_INET6 && errno == EADDRINUSE && !v6only) {
-v6only = 1;
-goto rebind;
-}
-#endif
-
 if (p == port_max) {
 if (!e->ai_next) {
 error_setg_errno(errp, errno, "Failed to bind socket");
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v7 1/4] tests: Add test-listen - a stress test for QEMU socket listen

2017-08-07 Thread Knut Omang

There's a potential race condition between multiple bind()'s
attempting to bind to the same port, which occasionally
allows more than one bind to succeed against the same port.

When a subsequent listen() call is made with the same socket
only one will succeed.

The current QEMU code does however not take this situation into account
and the listen will cause the code to break out and fail even
when there are actually available ports to use.

This test exposes two subtests:

/socket/listen-serial
/socket/listen-compete

The "compete" subtest creates a number of threads and have them all trying to 
bind
to the same port with a large enough offset input to
allow all threads to get it's own port.
The "serial" subtest just does the same, except in series in a
single thread.

The serial version passes, probably in most versions of QEMU.

The parallel version exposes the problem in a relatively reliable way,
eg. it fails a majority of times, but not with a 100% rate, occasional
passes can be seen. Nevertheless this is quite good given that
the bug was tricky to reproduce and has been left undetected for
a while.

The problem seems to be present in all versions of QEMU.

The original failure scenario occurred with VNC port allocation
in a traditional Xen based build, in different code
but with similar functionality.

Reported-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
---
 tests/Makefile.include |   2 +-
 tests/test-listen.c| 253 ++-
 2 files changed, 255 insertions(+)
 create mode 100644 tests/test-listen.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 7af278d..b37c0c8 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -128,6 +128,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
+#check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
@@ -769,6 +770,7 @@ tests/test-arm-mptimer$(EXESUF): tests/test-arm-mptimer.o
 tests/test-qapi-util$(EXESUF): tests/test-qapi-util.o $(test-util-obj-y)
 tests/numa-test$(EXESUF): tests/numa-test.o
 tests/vmgenid-test$(EXESUF): tests/vmgenid-test.o tests/boot-sector.o 
tests/acpi-utils.o
+tests/test-listen$(EXESUF): tests/test-listen.o $(test-util-obj-y)
 
 tests/migration/stress$(EXESUF): tests/migration/stress.o
$(call quiet-command, $(LINKPROG) -static -O3 $(PTHREAD_LIB) -o $@ $< 
,"LINK","$(TARGET_DIR)$@")
diff --git a/tests/test-listen.c b/tests/test-listen.c
new file mode 100644
index 000..03c4c8f
--- /dev/null
+++ b/tests/test-listen.c
@@ -0,0 +1,253 @@
+/*
+ * Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.
+ *Author: Knut Omang <knut.om...@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 or later
+ * as published by the Free Software Foundation.
+ *
+ * Test parallel port listen configuration with
+ * dynamic port allocation
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest.h"
+#include "qemu-common.h"
+#include "qemu/thread.h"
+#include "qemu/sockets.h"
+#include "qapi/error.h"
+
+#define NAME_LEN 1024
+#define PORT_LEN 16
+
+struct thr_info {
+QemuThread thread;
+int to_port;
+bool ipv4;
+bool ipv6;
+int got_port;
+int eno;
+int fd;
+const char *errstr;
+char hostname[NAME_LEN + 1];
+char port[PORT_LEN + 1];
+};
+
+
+/* These two functions taken from test-io-channel-socket.c */
+static int check_bind(const char *hostname, bool *has_proto)
+{
+int fd = -1;
+struct addrinfo ai, *res = NULL;
+int rc;
+int ret = -1;
+
+memset(, 0, sizeof(ai));
+ai.ai_flags = AI_CANONNAME | AI_ADDRCONFIG;
+ai.ai_family = AF_UNSPEC;
+ai.ai_socktype = SOCK_STREAM;
+
+/* lookup */
+rc = getaddrinfo(hostname, NULL, , );
+if (rc != 0) {
+if (rc == EAI_ADDRFAMILY ||
+rc == EAI_FAMILY) {
+*has_proto = false;
+goto done;
+}
+goto cleanup;
+}
+
+fd = qemu_socket(res->ai_family, res->ai_socktype, res->ai_protocol);
+if (fd < 0) {
+goto cleanup;
+}
+
+if (bind(fd, res->ai_addr, res->ai_addrlen) < 0) {
+if (errno == EADDRNOTAVAIL) {
+*has_proto = false;
+goto done;
+}
+goto cleanup;
+}
+
+*has_proto = true;
+ done:

[Qemu-devel] [PATCH v7 4/4] sockets: Handle race condition between binds to the same port

2017-08-07 Thread Knut Omang

If an offset of ports is specified to the inet_listen_saddr function(),
and two or more processes tries to bind from these ports at the same time,
occasionally more than one process may be able to bind to the same
port. The condition is detected by listen() but too late to avoid a failure.

This function is called by socket_listen() and used
by all socket listening code in QEMU, so all cases where any form of dynamic
port selection is used should be subject to this issue.

Add code to close and re-establish the socket when this
condition is observed, hiding the race condition from the user.

Also clean up some issues with error handling to allow more
accurate reporting of the cause of an error.

This has been developed and tested by means of the
test-listen unit test in the previous commit.
Enable the test for make check now that it passes.

Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Daniel P. Berrange <berra...@redhat.com>
---
 tests/Makefile.include |  2 +-
 util/qemu-sockets.c| 58 ++-
 2 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/tests/Makefile.include b/tests/Makefile.include
index b37c0c8..9d2131d 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -128,7 +128,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
-#check-unit-y += tests/test-listen$(EXESUF)
+check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index d0d2047..6a511fb 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -206,7 +206,10 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 char port[33];
 char uaddr[INET6_ADDRSTRLEN+1];
 char uport[33];
-int slisten, rc, port_min, port_max, p;
+int rc, port_min, port_max, p;
+int slisten = 0;
+int saved_errno = 0;
+bool socket_created = false;
 Error *err = NULL;
 
 memset(,0, sizeof(ai));
@@ -258,7 +261,7 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 return -1;
 }
 
-/* create socket + bind */
+/* create socket + bind/listen */
 for (e = res; e != NULL; e = e->ai_next) {
 getnameinfo((struct sockaddr*)e->ai_addr,e->ai_addrlen,
uaddr,INET6_ADDRSTRLEN,uport,32,
@@ -266,37 +269,58 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 
 slisten = create_fast_reuse_socket(e);
 if (slisten < 0) {
-if (!e->ai_next) {
-error_setg_errno(errp, errno, "Failed to create socket");
-}
 continue;
 }
 
+socket_created = true;
 port_min = inet_getport(e);
 port_max = saddr->has_to ? saddr->to + port_offset : port_min;
 for (p = port_min; p <= port_max; p++) {
 inet_setport(e, p);
-if (try_bind(slisten, saddr, e) >= 0) {
-goto listen;
-}
-if (p == port_max) {
-if (!e->ai_next) {
+rc = try_bind(slisten, saddr, e);
+if (rc) {
+if (errno == EADDRINUSE) {
+continue;
+} else {
 error_setg_errno(errp, errno, "Failed to bind socket");
+goto listen_failed;
 }
 }
+if (!listen(slisten, 1)) {
+goto listen_ok;
+}
+if (errno != EADDRINUSE) {
+error_setg_errno(errp, errno, "Failed to listen on socket");
+goto listen_failed;
+}
+/* Someone else managed to bind to the same port and beat us
+ * to listen on it! Socket semantics does not allow us to
+ * recover from this situation, so we need to recreate the
+ * socket to allow bind attempts for subsequent ports:
+ */
+closesocket(slisten);
+slisten = create_fast_reuse_socket(e);
+if (slisten < 0) {
+error_setg_errno(errp, errno,
+ "Failed to recreate failed listening socket");
+goto listen_failed;
+}
 }
+}
+error_setg_errno(errp, errno,
+ socket_created ?
+ "Failed to find an available port" :
+ "Failed to create a socket");
+listen_failed:
+saved_errno = errno;
+if (sliste

[Qemu-devel] [PATCH v7 3/4] sockets: factor out create_fast_reuse_socket

2017-08-07 Thread Knut Omang

Another refactoring step to prepare for fixing the problem
exposed with the test-listen test in the previous commit

Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Daniel P. Berrange <berra...@redhat.com>
---
 util/qemu-sockets.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index b4a2f24..d0d2047 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -149,6 +149,16 @@ int inet_ai_family_from_address(InetSocketAddress *addr,
 return PF_UNSPEC;
 }
 
+static int create_fast_reuse_socket(struct addrinfo *e)
+{
+int slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+if (slisten < 0) {
+return -1;
+}
+socket_set_fast_reuse(slisten);
+return slisten;
+}
+
 static int try_bind(int socket, InetSocketAddress *saddr, struct addrinfo *e)
 {
 #ifndef IPV6_V6ONLY
@@ -253,7 +263,8 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 getnameinfo((struct sockaddr*)e->ai_addr,e->ai_addrlen,
uaddr,INET6_ADDRSTRLEN,uport,32,
NI_NUMERICHOST | NI_NUMERICSERV);
-slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+
+slisten = create_fast_reuse_socket(e);
 if (slisten < 0) {
 if (!e->ai_next) {
 error_setg_errno(errp, errno, "Failed to create socket");
@@ -261,8 +272,6 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 continue;
 }
 
-socket_set_fast_reuse(slisten);
-
 port_min = inet_getport(e);
 port_max = saddr->has_to ? saddr->to + port_offset : port_min;
 for (p = port_min; p <= port_max; p++) {
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v7 0/4] Unit test+fix for problem with QEMU handling of multiple bind()s to the same port

2017-08-07 Thread Knut Omang

This series contains:
* a unit test that exposes a race condition which causes QEMU to fail
  to find a port even when there is plenty of available ports.
* a refactor of the qemu-sockets inet_listen_saddr() function
  to better handle this situation.

Changes from v6:
* Changed license of the (new) test-listen.c source file from
  GNU v2 to GNU v2 and later, according to QEMU standards.

Changes from v5:
* Also move setting of error from socket creation out
  into the main double for loop.
* Simplify and enhance reporting if socket creation fails:
  Only report failure to create sockets if none of the
  provided addrinfo list elements allows creation of a
  socket, otherwise report failure to find an available port.
* Further simplify if's within the for loops and make sure
  failure to recreate a socket gets distinctively reported.
* Rebased to current master.

Changes from v4:
* Move the complexity of recreating a socket and setting the error pointer
  into the main for loop, eliminating the try_bind_listen() function
  again. Cleaning up and improving error handling in the process.

Changes from v3:
* Test changes: Add missing license
  Add subtests for ipv4, ipv6 and both
  Various g_* usage improvements
* Split patch into 3 patches with two refactoring patches ahead
  of the actual fix.

Changes from v2:
* Non-trivial rebase + further abstraction
  on top of 7ad9af343c7f1c70c8015c7c519c312d8c5f9fa1
  'tests: add functional test validating ipv4/ipv6 address flag handling'

Changes from v1:
* Fix potential uninitialized variable only detected by optimize.
* Improve unexpected error detection in test-listen to give more
  details about why the test fails unexpectedly.
* Fix some line length style issues.

Thanks,
Knut

Knut Omang (4):
  tests: Add test-listen - a stress test for QEMU socket listen
  sockets: factor out a new try_bind() function
  sockets: factor out create_fast_reuse_socket
  sockets: Handle race condition between binds to the same port

 tests/Makefile.include |   2 +-
 tests/test-listen.c| 253 ++-
 util/qemu-sockets.c| 138 +++
 3 files changed, 345 insertions(+), 48 deletions(-)
 create mode 100644 tests/test-listen.c

base-commit: a588c4985eff363154d65aee8607d0a4601655f7
-- 
git-series 0.9.1

Re: [Qemu-devel] [PATCH v6 1/4] tests: Add test-listen - a stress test for QEMU socket listen

2017-08-07 Thread Knut Omang

On Mon, 2017-08-07 at 11:27 +0100, Daniel P. Berrange wrote:
> On Mon, Aug 07, 2017 at 11:21:25AM +0200, Knut Omang wrote:
> > On Mon, 2017-08-07 at 09:45 +0100, Daniel P. Berrange wrote:
> > > On Sat, Jul 29, 2017 at 11:18:15PM +0200, Knut Omang wrote:
> > > > There's a potential race condition between multiple bind()'s
> > > > attempting to bind to the same port, which occasionally
> > > > allows more than one bind to succeed against the same port.
> > > > 
> > > > When a subsequent listen() call is made with the same socket
> > > > only one will succeed.
> > > > 
> > > > The current QEMU code does however not take this situation into account
> > > > and the listen will cause the code to break out and fail even
> > > > when there are actually available ports to use.
> > > > 
> > > > This test exposes two subtests:
> > > > 
> > > > /socket/listen-serial
> > > > /socket/listen-compete
> > > > 
> > > > The "compete" subtest creates a number of threads and have them all 
> > > > trying to bind
> > > > to the same port with a large enough offset input to
> > > > allow all threads to get it's own port.
> > > > The "serial" subtest just does the same, except in series in a
> > > > single thread.
> > > > 
> > > > The serial version passes, probably in most versions of QEMU.
> > > > 
> > > > The parallel version exposes the problem in a relatively reliable way,
> > > > eg. it fails a majority of times, but not with a 100% rate, occasional
> > > > passes can be seen. Nevertheless this is quite good given that
> > > > the bug was tricky to reproduce and has been left undetected for
> > > > a while.
> > > > 
> > > > The problem seems to be present in all versions of QEMU.
> > > > 
> > > > The original failure scenario occurred with VNC port allocation
> > > > in a traditional Xen based build, in different code
> > > > but with similar functionality.
> > > > 
> > > > Reported-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> > > > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > > > Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
> > > > Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> > > > Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
> > > > ---
> > > >  tests/Makefile.include |   2 +-
> > > >  tests/test-listen.c| 253 
> > > > ++-
> > > >  2 files changed, 255 insertions(+)
> > > >  create mode 100644 tests/test-listen.c
> > > > 
> > > > diff --git a/tests/Makefile.include b/tests/Makefile.include
> > > > index 7af278d..b37c0c8 100644
> > > > --- a/tests/Makefile.include
> > > > +++ b/tests/Makefile.include
> > > > @@ -128,6 +128,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
> > > >  gcov-files-check-bufferiszero-y = util/bufferiszero.c
> > > >  check-unit-y += tests/test-uuid$(EXESUF)
> > > >  check-unit-y += tests/ptimer-test$(EXESUF)
> > > > +#check-unit-y += tests/test-listen$(EXESUF)
> > > >  gcov-files-ptimer-test-y = hw/core/ptimer.c
> > > >  check-unit-y += tests/test-qapi-util$(EXESUF)
> > > >  gcov-files-test-qapi-util-y = qapi/qapi-util.c
> > > > @@ -769,6 +770,7 @@ tests/test-arm-mptimer$(EXESUF): 
> > > > tests/test-arm-mptimer.o
> > > >  tests/test-qapi-util$(EXESUF): tests/test-qapi-util.o 
> > > > $(test-util-obj-y)
> > > >  tests/numa-test$(EXESUF): tests/numa-test.o
> > > >  tests/vmgenid-test$(EXESUF): tests/vmgenid-test.o tests/boot-sector.o 
> > > > tests/acpi-utils.o
> > > > +tests/test-listen$(EXESUF): tests/test-listen.o $(test-util-obj-y)
> > > >  
> > > >  tests/migration/stress$(EXESUF): tests/migration/stress.o
> > > >     $(call quiet-command, $(LINKPROG) -static -O3 $(PTHREAD_LIB) -o 
> > > > $@ $< ,"LINK","$(TARGET_DIR)$@")
> > > > diff --git a/tests/test-listen.c b/tests/test-listen.c
> > > > new file mode 100644
> > > > index 000..5c07537
> > > > --- /dev/null
> > > > +++ b/tests/test-listen.c
> > > > @@ -0,0 +1,253 @@
> > > > +/*
> > > > + * Copyr

Re: [Qemu-devel] [PATCH v6 1/4] tests: Add test-listen - a stress test for QEMU socket listen

2017-08-07 Thread Knut Omang

On Mon, 2017-08-07 at 09:45 +0100, Daniel P. Berrange wrote:
> On Sat, Jul 29, 2017 at 11:18:15PM +0200, Knut Omang wrote:
> > There's a potential race condition between multiple bind()'s
> > attempting to bind to the same port, which occasionally
> > allows more than one bind to succeed against the same port.
> > 
> > When a subsequent listen() call is made with the same socket
> > only one will succeed.
> > 
> > The current QEMU code does however not take this situation into account
> > and the listen will cause the code to break out and fail even
> > when there are actually available ports to use.
> > 
> > This test exposes two subtests:
> > 
> > /socket/listen-serial
> > /socket/listen-compete
> > 
> > The "compete" subtest creates a number of threads and have them all trying 
> > to bind
> > to the same port with a large enough offset input to
> > allow all threads to get it's own port.
> > The "serial" subtest just does the same, except in series in a
> > single thread.
> > 
> > The serial version passes, probably in most versions of QEMU.
> > 
> > The parallel version exposes the problem in a relatively reliable way,
> > eg. it fails a majority of times, but not with a 100% rate, occasional
> > passes can be seen. Nevertheless this is quite good given that
> > the bug was tricky to reproduce and has been left undetected for
> > a while.
> > 
> > The problem seems to be present in all versions of QEMU.
> > 
> > The original failure scenario occurred with VNC port allocation
> > in a traditional Xen based build, in different code
> > but with similar functionality.
> > 
> > Reported-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
> > Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> > Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
> > ---
> >  tests/Makefile.include |   2 +-
> >  tests/test-listen.c| 253 ++-
> >  2 files changed, 255 insertions(+)
> >  create mode 100644 tests/test-listen.c
> > 
> > diff --git a/tests/Makefile.include b/tests/Makefile.include
> > index 7af278d..b37c0c8 100644
> > --- a/tests/Makefile.include
> > +++ b/tests/Makefile.include
> > @@ -128,6 +128,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
> >  gcov-files-check-bufferiszero-y = util/bufferiszero.c
> >  check-unit-y += tests/test-uuid$(EXESUF)
> >  check-unit-y += tests/ptimer-test$(EXESUF)
> > +#check-unit-y += tests/test-listen$(EXESUF)
> >  gcov-files-ptimer-test-y = hw/core/ptimer.c
> >  check-unit-y += tests/test-qapi-util$(EXESUF)
> >  gcov-files-test-qapi-util-y = qapi/qapi-util.c
> > @@ -769,6 +770,7 @@ tests/test-arm-mptimer$(EXESUF): 
> > tests/test-arm-mptimer.o
> >  tests/test-qapi-util$(EXESUF): tests/test-qapi-util.o $(test-util-obj-y)
> >  tests/numa-test$(EXESUF): tests/numa-test.o
> >  tests/vmgenid-test$(EXESUF): tests/vmgenid-test.o tests/boot-sector.o 
> > tests/acpi-utils.o
> > +tests/test-listen$(EXESUF): tests/test-listen.o $(test-util-obj-y)
> >  
> >  tests/migration/stress$(EXESUF): tests/migration/stress.o
> >     $(call quiet-command, $(LINKPROG) -static -O3 $(PTHREAD_LIB) -o $@ $< 
> > ,"LINK","$(TARGET_DIR)$@")
> > diff --git a/tests/test-listen.c b/tests/test-listen.c
> > new file mode 100644
> > index 000..5c07537
> > --- /dev/null
> > +++ b/tests/test-listen.c
> > @@ -0,0 +1,253 @@
> > +/*
> > + * Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.
> > + *Author: Knut Omang <knut.om...@oracle.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2
> > + * as published by the Free Software Foundation.
> 
> Can you change that to "version 2 or later" - per the LICENSE file, we don't
> accept contributions under "version 2 only" except for 4 specific subdirs:
> 
> 
>   "As of July 2013, contributions under version 2 of the GNU General Public
>    License (and no later version) are only accepted for the following files
>    or directories: bsd-user/, linux-user/, hw/vfio/, hw/xen/xen_pt*."

Oh, sorry - I wasn't aware of this, +"...or later" is fine with me.
Would you like me to send a v7 of the set with only that change, or can you 
amend 
it as part of the merge?

Thanks,
Knut

> 
> > + *
> > + * Test parallel port listen configuration with
> > + * dynamic port allocation
> > + */
> 
> 
> Regards,
> Daniel

[Qemu-devel] [PATCH v6 2/4] sockets: factor out a new try_bind() function

2017-07-29 Thread Knut Omang

A refactoring step to prepare for the problem
exposed by the test-listen test in the previous commit.

Simplify and reorganize the IPv6 specific extra
measures and move it out of the for loop to increase
code readability. No semantic changes.

Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Daniel P. Berrange <berra...@redhat.com>
---
 util/qemu-sockets.c | 69 ++
 1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index 1358c81..b4a2f24 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -149,6 +149,44 @@ int inet_ai_family_from_address(InetSocketAddress *addr,
 return PF_UNSPEC;
 }
 
+static int try_bind(int socket, InetSocketAddress *saddr, struct addrinfo *e)
+{
+#ifndef IPV6_V6ONLY
+return bind(socket, e->ai_addr, e->ai_addrlen);
+#else
+/*
+ * Deals with first & last cases in matrix in comment
+ * for inet_ai_family_from_address().
+ */
+int v6only =
+((!saddr->has_ipv4 && !saddr->has_ipv6) ||
+ (saddr->has_ipv4 && saddr->ipv4 &&
+  saddr->has_ipv6 && saddr->ipv6)) ? 0 : 1;
+int stat;
+
+ rebind:
+if (e->ai_family == PF_INET6) {
+qemu_setsockopt(socket, IPPROTO_IPV6, IPV6_V6ONLY, ,
+sizeof(v6only));
+}
+
+stat = bind(socket, e->ai_addr, e->ai_addrlen);
+if (!stat) {
+return 0;
+}
+
+/* If we got EADDRINUSE from an IPv6 bind & v6only is unset,
+ * it could be that the IPv4 port is already claimed, so retry
+ * with v6only set
+ */
+if (e->ai_family == PF_INET6 && errno == EADDRINUSE && !v6only) {
+v6only = 1;
+goto rebind;
+}
+return stat;
+#endif
+}
+
 static int inet_listen_saddr(InetSocketAddress *saddr,
  int port_offset,
  bool update_addr,
@@ -228,39 +266,10 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 port_min = inet_getport(e);
 port_max = saddr->has_to ? saddr->to + port_offset : port_min;
 for (p = port_min; p <= port_max; p++) {
-#ifdef IPV6_V6ONLY
-/*
- * Deals with first & last cases in matrix in comment
- * for inet_ai_family_from_address().
- */
-int v6only =
-((!saddr->has_ipv4 && !saddr->has_ipv6) ||
- (saddr->has_ipv4 && saddr->ipv4 &&
-  saddr->has_ipv6 && saddr->ipv6)) ? 0 : 1;
-#endif
 inet_setport(e, p);
-#ifdef IPV6_V6ONLY
-rebind:
-if (e->ai_family == PF_INET6) {
-qemu_setsockopt(slisten, IPPROTO_IPV6, IPV6_V6ONLY, ,
-sizeof(v6only));
-}
-#endif
-if (bind(slisten, e->ai_addr, e->ai_addrlen) == 0) {
+if (try_bind(slisten, saddr, e) >= 0) {
 goto listen;
 }
-
-#ifdef IPV6_V6ONLY
-/* If we got EADDRINUSE from an IPv6 bind & V6ONLY is unset,
- * it could be that the IPv4 port is already claimed, so retry
- * with V6ONLY set
- */
-if (e->ai_family == PF_INET6 && errno == EADDRINUSE && !v6only) {
-v6only = 1;
-goto rebind;
-}
-#endif
-
 if (p == port_max) {
 if (!e->ai_next) {
 error_setg_errno(errp, errno, "Failed to bind socket");
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v6 3/4] sockets: factor out create_fast_reuse_socket

2017-07-29 Thread Knut Omang

Another refactoring step to prepare for fixing the problem
exposed with the test-listen test in the previous commit

Signed-off-by: Knut Omang <knut.om...@oracle.com>
---
 util/qemu-sockets.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index b4a2f24..d0d2047 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -149,6 +149,16 @@ int inet_ai_family_from_address(InetSocketAddress *addr,
 return PF_UNSPEC;
 }
 
+static int create_fast_reuse_socket(struct addrinfo *e)
+{
+int slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+if (slisten < 0) {
+return -1;
+}
+socket_set_fast_reuse(slisten);
+return slisten;
+}
+
 static int try_bind(int socket, InetSocketAddress *saddr, struct addrinfo *e)
 {
 #ifndef IPV6_V6ONLY
@@ -253,7 +263,8 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 getnameinfo((struct sockaddr*)e->ai_addr,e->ai_addrlen,
uaddr,INET6_ADDRSTRLEN,uport,32,
NI_NUMERICHOST | NI_NUMERICSERV);
-slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+
+slisten = create_fast_reuse_socket(e);
 if (slisten < 0) {
 if (!e->ai_next) {
 error_setg_errno(errp, errno, "Failed to create socket");
@@ -261,8 +272,6 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 continue;
 }
 
-socket_set_fast_reuse(slisten);
-
 port_min = inet_getport(e);
 port_max = saddr->has_to ? saddr->to + port_offset : port_min;
 for (p = port_min; p <= port_max; p++) {
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v6 4/4] sockets: Handle race condition between binds to the same port

2017-07-29 Thread Knut Omang

If an offset of ports is specified to the inet_listen_saddr function(),
and two or more processes tries to bind from these ports at the same time,
occasionally more than one process may be able to bind to the same
port. The condition is detected by listen() but too late to avoid a failure.

This function is called by socket_listen() and used
by all socket listening code in QEMU, so all cases where any form of dynamic
port selection is used should be subject to this issue.

Add code to close and re-establish the socket when this
condition is observed, hiding the race condition from the user.

Also clean up some issues with error handling to allow more
accurate reporting of the cause of an error.

This has been developed and tested by means of the
test-listen unit test in the previous commit.
Enable the test for make check now that it passes.

Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
Signed-off-by: Knut Omang <knut.om...@oracle.com>
---
 tests/Makefile.include |  2 +-
 util/qemu-sockets.c| 58 ++-
 2 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/tests/Makefile.include b/tests/Makefile.include
index b37c0c8..9d2131d 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -128,7 +128,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
-#check-unit-y += tests/test-listen$(EXESUF)
+check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index d0d2047..6a511fb 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -206,7 +206,10 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 char port[33];
 char uaddr[INET6_ADDRSTRLEN+1];
 char uport[33];
-int slisten, rc, port_min, port_max, p;
+int rc, port_min, port_max, p;
+int slisten = 0;
+int saved_errno = 0;
+bool socket_created = false;
 Error *err = NULL;
 
 memset(,0, sizeof(ai));
@@ -258,7 +261,7 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 return -1;
 }
 
-/* create socket + bind */
+/* create socket + bind/listen */
 for (e = res; e != NULL; e = e->ai_next) {
 getnameinfo((struct sockaddr*)e->ai_addr,e->ai_addrlen,
uaddr,INET6_ADDRSTRLEN,uport,32,
@@ -266,37 +269,58 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 
 slisten = create_fast_reuse_socket(e);
 if (slisten < 0) {
-if (!e->ai_next) {
-error_setg_errno(errp, errno, "Failed to create socket");
-}
 continue;
 }
 
+socket_created = true;
 port_min = inet_getport(e);
 port_max = saddr->has_to ? saddr->to + port_offset : port_min;
 for (p = port_min; p <= port_max; p++) {
 inet_setport(e, p);
-if (try_bind(slisten, saddr, e) >= 0) {
-goto listen;
-}
-if (p == port_max) {
-if (!e->ai_next) {
+rc = try_bind(slisten, saddr, e);
+if (rc) {
+if (errno == EADDRINUSE) {
+continue;
+} else {
 error_setg_errno(errp, errno, "Failed to bind socket");
+goto listen_failed;
 }
 }
+if (!listen(slisten, 1)) {
+goto listen_ok;
+}
+if (errno != EADDRINUSE) {
+error_setg_errno(errp, errno, "Failed to listen on socket");
+goto listen_failed;
+}
+/* Someone else managed to bind to the same port and beat us
+ * to listen on it! Socket semantics does not allow us to
+ * recover from this situation, so we need to recreate the
+ * socket to allow bind attempts for subsequent ports:
+ */
+closesocket(slisten);
+slisten = create_fast_reuse_socket(e);
+if (slisten < 0) {
+error_setg_errno(errp, errno,
+ "Failed to recreate failed listening socket");
+goto listen_failed;
+}
 }
+}
+error_setg_errno(errp, errno,
+ socket_created ?
+ "Failed to find an available port" :
+ "Failed to create a socket");
+listen_failed:
+saved_errno = errno;
+if (slisten >= 0) {
 closesocket(slist

[Qemu-devel] [PATCH v6 0/4] Unit test+fix for problem with QEMU handling of multiple bind()s to the same port

2017-07-29 Thread Knut Omang

This series contains:
* a unit test that exposes a race condition which causes QEMU to fail
  to find a port even when there is plenty of available ports.
* a refactor of the qemu-sockets inet_listen_saddr() function
  to better handle this situation.

Changes from v5:
* Also move setting of error from socket creation out
  into the main double for loop.
* Simplify and enhance reporting if socket creation fails:
  Only report failure to create sockets if none of the
  provided addrinfo list elements allows creation of a
  socket, otherwise report failure to find an available port.
* Further simplify if's within the for loops and make sure
  failure to recreate a socket gets distinctively reported.
* Rebased to current master.

Changes from v4:
* Move the complexity of recreating a socket and setting the error pointer
  into the main for loop, eliminating the try_bind_listen() function
  again. Cleaning up and improving error handling in the process.

Changes from v3:
* Test changes: Add missing license
  Add subtests for ipv4, ipv6 and both
  Various g_* usage improvements
* Split patch into 3 patches with two refactoring patches ahead
  of the actual fix.

Changes from v2:
* Non-trivial rebase + further abstraction
  on top of 7ad9af343c7f1c70c8015c7c519c312d8c5f9fa1
  'tests: add functional test validating ipv4/ipv6 address flag handling'

Changes from v1:
* Fix potential uninitialized variable only detected by optimize.
* Improve unexpected error detection in test-listen to give more
  details about why the test fails unexpectedly.
* Fix some line length style issues.

Thanks,
Knut

Knut Omang (4):
  tests: Add test-listen - a stress test for QEMU socket listen
  sockets: factor out a new try_bind() function
  sockets: factor out create_fast_reuse_socket
  sockets: Handle race condition between binds to the same port

 tests/Makefile.include |   2 +-
 tests/test-listen.c| 253 ++-
 util/qemu-sockets.c| 138 +++
 3 files changed, 345 insertions(+), 48 deletions(-)
 create mode 100644 tests/test-listen.c

base-commit: a588c4985eff363154d65aee8607d0a4601655f7
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v6 1/4] tests: Add test-listen - a stress test for QEMU socket listen

2017-07-29 Thread Knut Omang

There's a potential race condition between multiple bind()'s
attempting to bind to the same port, which occasionally
allows more than one bind to succeed against the same port.

When a subsequent listen() call is made with the same socket
only one will succeed.

The current QEMU code does however not take this situation into account
and the listen will cause the code to break out and fail even
when there are actually available ports to use.

This test exposes two subtests:

/socket/listen-serial
/socket/listen-compete

The "compete" subtest creates a number of threads and have them all trying to 
bind
to the same port with a large enough offset input to
allow all threads to get it's own port.
The "serial" subtest just does the same, except in series in a
single thread.

The serial version passes, probably in most versions of QEMU.

The parallel version exposes the problem in a relatively reliable way,
eg. it fails a majority of times, but not with a 100% rate, occasional
passes can be seen. Nevertheless this is quite good given that
the bug was tricky to reproduce and has been left undetected for
a while.

The problem seems to be present in all versions of QEMU.

The original failure scenario occurred with VNC port allocation
in a traditional Xen based build, in different code
but with similar functionality.

Reported-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
---
 tests/Makefile.include |   2 +-
 tests/test-listen.c| 253 ++-
 2 files changed, 255 insertions(+)
 create mode 100644 tests/test-listen.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 7af278d..b37c0c8 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -128,6 +128,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
+#check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
@@ -769,6 +770,7 @@ tests/test-arm-mptimer$(EXESUF): tests/test-arm-mptimer.o
 tests/test-qapi-util$(EXESUF): tests/test-qapi-util.o $(test-util-obj-y)
 tests/numa-test$(EXESUF): tests/numa-test.o
 tests/vmgenid-test$(EXESUF): tests/vmgenid-test.o tests/boot-sector.o 
tests/acpi-utils.o
+tests/test-listen$(EXESUF): tests/test-listen.o $(test-util-obj-y)
 
 tests/migration/stress$(EXESUF): tests/migration/stress.o
$(call quiet-command, $(LINKPROG) -static -O3 $(PTHREAD_LIB) -o $@ $< 
,"LINK","$(TARGET_DIR)$@")
diff --git a/tests/test-listen.c b/tests/test-listen.c
new file mode 100644
index 000..5c07537
--- /dev/null
+++ b/tests/test-listen.c
@@ -0,0 +1,253 @@
+/*
+ * Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.
+ *Author: Knut Omang <knut.om...@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ *
+ * Test parallel port listen configuration with
+ * dynamic port allocation
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest.h"
+#include "qemu-common.h"
+#include "qemu/thread.h"
+#include "qemu/sockets.h"
+#include "qapi/error.h"
+
+#define NAME_LEN 1024
+#define PORT_LEN 16
+
+struct thr_info {
+QemuThread thread;
+int to_port;
+bool ipv4;
+bool ipv6;
+int got_port;
+int eno;
+int fd;
+const char *errstr;
+char hostname[NAME_LEN + 1];
+char port[PORT_LEN + 1];
+};
+
+
+/* These two functions taken from test-io-channel-socket.c */
+static int check_bind(const char *hostname, bool *has_proto)
+{
+int fd = -1;
+struct addrinfo ai, *res = NULL;
+int rc;
+int ret = -1;
+
+memset(, 0, sizeof(ai));
+ai.ai_flags = AI_CANONNAME | AI_ADDRCONFIG;
+ai.ai_family = AF_UNSPEC;
+ai.ai_socktype = SOCK_STREAM;
+
+/* lookup */
+rc = getaddrinfo(hostname, NULL, , );
+if (rc != 0) {
+if (rc == EAI_ADDRFAMILY ||
+rc == EAI_FAMILY) {
+*has_proto = false;
+goto done;
+}
+goto cleanup;
+}
+
+fd = qemu_socket(res->ai_family, res->ai_socktype, res->ai_protocol);
+if (fd < 0) {
+goto cleanup;
+}
+
+if (bind(fd, res->ai_addr, res->ai_addrlen) < 0) {
+if (errno == EADDRNOTAVAIL) {
+*has_proto = false;
+goto done;
+}
+goto cleanup;
+}
+
+*has_proto = true;
+ done:

Re: [Qemu-devel] [PATCH v5 2/4] sockets: factor out create_fast_reuse_socket

2017-07-29 Thread Knut Omang

On Tue, 2017-07-25 at 10:38 +0100, Daniel P. Berrange wrote:
> On Sat, Jul 22, 2017 at 09:49:31AM +0200, Knut Omang wrote:
> > First refactoring step to prepare for fixing the problem
> > exposed with the test-listen test in the previous commit
> > 
> > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > ---
> >  util/qemu-sockets.c | 24 +---
> >  1 file changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
> > index 1358c81..578f25b 100644
> > --- a/util/qemu-sockets.c
> > +++ b/util/qemu-sockets.c
> > @@ -149,6 +149,20 @@ int inet_ai_family_from_address(InetSocketAddress 
> > *addr,
> >  return PF_UNSPEC;
> >  }
> >  
> > +static int create_fast_reuse_socket(struct addrinfo *e, Error **errp)
> > +{
> > +int slisten = qemu_socket(e->ai_family, e->ai_socktype, 
> > e->ai_protocol);
> > +if (slisten < 0) {
> > +if (!e->ai_next) {
> > +error_setg_errno(errp, errno, "Failed to create socket");
> > +}
> > +return -1;
> > +}
> > +
> > +socket_set_fast_reuse(slisten);
> > +return slisten;
> > +}
> 
> As mentioned in the previous review, I don't think we should have
> methods like this which sometimes report an error on failure, and
> sometimes don't report an error. It makes it hard to review the
> callers for correctness of error handling. 

Sorry, somehow this one slipped through..

> 
> > +
> >  static int inet_listen_saddr(InetSocketAddress *saddr,
> >   int port_offset,
> >   bool update_addr,
> > @@ -210,21 +224,17 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
> >  return -1;
> >  }
> >  
> > -/* create socket + bind */
> > +/* create socket + bind/listen */
> >  for (e = res; e != NULL; e = e->ai_next) {
> >  getnameinfo((struct sockaddr*)e->ai_addr,e->ai_addrlen,
> >     uaddr,INET6_ADDRSTRLEN,uport,32,
> >     NI_NUMERICHOST | NI_NUMERICSERV);
> > -slisten = qemu_socket(e->ai_family, e->ai_socktype, 
> > e->ai_protocol);
> > +
> > +slisten = create_fast_reuse_socket(e, );
> >  if (slisten < 0) {
> > -if (!e->ai_next) {
> > -error_setg_errno(errp, errno, "Failed to create socket");
> > -}
> 
> So please leave this here.

Since we are already outside of the scope of the original patch, 
let me suggest:

As far as I can see there's no good reason to have 
different behavior if creating sockets on the last 'struct addrinfo'
in the list fails than if an intermediate addrinfo fails and the next 
addrinfo in the list just happened to have no free ports.

As far as I can see the 'Failed to create socket' message will be
most useful to the user if creating a socket fails on all 
elements in the list, to detect issues with the network setup.

I have implemented this semantics in v6 - it should be an enhancement while 
also simplifying the code inside the for loops, getting rid  
of the conditional error setting and avoid code duplication by 
handling the same error twice in proximity.

Thanks,
Knut

> 
> >  continue;
> >  }
> >  
> > -socket_set_fast_reuse(slisten);
> > -
> >  port_min = inet_getport(e);
> >  port_max = saddr->has_to ? saddr->to + port_offset : port_min;
> >  for (p = port_min; p <= port_max; p++) {
> 
> 
> Regards,
> Daniel

Re: [Qemu-devel] [PATCH v5 4/4] sockets: Handle race condition between binds to the same port

2017-07-29 Thread Knut Omang

On Tue, 2017-07-25 at 10:54 +0100, Daniel P. Berrange wrote:
> On Sat, Jul 22, 2017 at 09:49:33AM +0200, Knut Omang wrote:
> > If an offset of ports is specified to the inet_listen_saddr function(),
> > and two or more processes tries to bind from these ports at the same time,
> > occasionally more than one process may be able to bind to the same
> > port. The condition is detected by listen() but too late to avoid a failure.
> > 
> > This function is called by socket_listen() and used
> > by all socket listening code in QEMU, so all cases where any form of dynamic
> > port selection is used should be subject to this issue.
> > 
> > Add code to close and re-establish the socket when this
> > condition is observed, hiding the race condition from the user.
> > 
> > Also clean up some issues with error handling to allow more
> > accurate reporting of the cause of an error.
> > 
> > This has been developed and tested by means of the
> > test-listen unit test in the previous commit.
> > Enable the test for make check now that it passes.
> > 
> > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> > Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
> > Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
> > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > ---
> >  tests/Makefile.include |  2 +-
> >  util/qemu-sockets.c| 51 ---
> >  2 files changed, 39 insertions(+), 14 deletions(-)
> > 
> > diff --git a/tests/Makefile.include b/tests/Makefile.include
> > index b37c0c8..9d2131d 100644
> > --- a/tests/Makefile.include
> > +++ b/tests/Makefile.include
> > @@ -128,7 +128,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
> >  gcov-files-check-bufferiszero-y = util/bufferiszero.c
> >  check-unit-y += tests/test-uuid$(EXESUF)
> >  check-unit-y += tests/ptimer-test$(EXESUF)
> > -#check-unit-y += tests/test-listen$(EXESUF)
> > +check-unit-y += tests/test-listen$(EXESUF)
> >  gcov-files-ptimer-test-y = hw/core/ptimer.c
> >  check-unit-y += tests/test-qapi-util$(EXESUF)
> >  gcov-files-test-qapi-util-y = qapi/qapi-util.c
> > diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
> > index 39f207c..8ca3ba6 100644
> > --- a/util/qemu-sockets.c
> > +++ b/util/qemu-sockets.c
> > @@ -210,7 +210,9 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
> >  char port[33];
> >  char uaddr[INET6_ADDRSTRLEN+1];
> >  char uport[33];
> > -int slisten, rc, port_min, port_max, p;
> > +int rc, port_min, port_max, p;
> > +int slisten = 0;
> > +int saved_errno = 0;
> >  Error *err = NULL;
> >  
> >  memset(,0, sizeof(ai));
> > @@ -277,27 +279,50 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
> >  port_max = saddr->has_to ? saddr->to + port_offset : port_min;
> >  for (p = port_min; p <= port_max; p++) {
> >  inet_setport(e, p);
> > -if (try_bind(slisten, saddr, e) >= 0) {
> > -goto listen;
> > -}
> > -if (p == port_max) {
> > -if (!e->ai_next) {
> > +rc = try_bind(slisten, saddr, e);
> > +if (rc) {
> > +if (errno == EADDRINUSE) {
> 
> Add in  '&& e->ai_next' here, so we trigger the error handling
> on the else branch if this was the last address we had.

See my comment on patch 2 - with that reporting semantics, inability to
create a socket will only be reported if none of the addrinfo structs
allowed socket creation so this reporting case goes away.

> > +continue;
> > +} else {
> >  error_setg_errno(errp, errno, "Failed to bind socket");
> > +goto listen_failed;
> > +}
> > +}
> > +rc = listen(slisten, 1);
> > +if (!rc) {
> 
> Assigning to rc seems redundant here.

removed.

> > +goto listen_ok;
> > +} else if (errno == EADDRINUSE) {
> > +/* Someone else managed to bind to the same port and beat 
> > us
> > + * to listen on it! Socket semantics does not allow us to
> > + * recover from this situation, so we need to recreate the
> > + * socket to allow bind attempts for subsequent ports:
> > + */
> > +

[Qemu-devel] [PATCH v5 4/4] sockets: Handle race condition between binds to the same port

2017-07-22 Thread Knut Omang

If an offset of ports is specified to the inet_listen_saddr function(),
and two or more processes tries to bind from these ports at the same time,
occasionally more than one process may be able to bind to the same
port. The condition is detected by listen() but too late to avoid a failure.

This function is called by socket_listen() and used
by all socket listening code in QEMU, so all cases where any form of dynamic
port selection is used should be subject to this issue.

Add code to close and re-establish the socket when this
condition is observed, hiding the race condition from the user.

Also clean up some issues with error handling to allow more
accurate reporting of the cause of an error.

This has been developed and tested by means of the
test-listen unit test in the previous commit.
Enable the test for make check now that it passes.

Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
Signed-off-by: Knut Omang <knut.om...@oracle.com>
---
 tests/Makefile.include |  2 +-
 util/qemu-sockets.c| 51 ---
 2 files changed, 39 insertions(+), 14 deletions(-)

diff --git a/tests/Makefile.include b/tests/Makefile.include
index b37c0c8..9d2131d 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -128,7 +128,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
-#check-unit-y += tests/test-listen$(EXESUF)
+check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index 39f207c..8ca3ba6 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -210,7 +210,9 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 char port[33];
 char uaddr[INET6_ADDRSTRLEN+1];
 char uport[33];
-int slisten, rc, port_min, port_max, p;
+int rc, port_min, port_max, p;
+int slisten = 0;
+int saved_errno = 0;
 Error *err = NULL;
 
 memset(,0, sizeof(ai));
@@ -277,27 +279,50 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 port_max = saddr->has_to ? saddr->to + port_offset : port_min;
 for (p = port_min; p <= port_max; p++) {
 inet_setport(e, p);
-if (try_bind(slisten, saddr, e) >= 0) {
-goto listen;
-}
-if (p == port_max) {
-if (!e->ai_next) {
+rc = try_bind(slisten, saddr, e);
+if (rc) {
+if (errno == EADDRINUSE) {
+continue;
+} else {
 error_setg_errno(errp, errno, "Failed to bind socket");
+goto listen_failed;
+}
+}
+rc = listen(slisten, 1);
+if (!rc) {
+goto listen_ok;
+} else if (errno == EADDRINUSE) {
+/* Someone else managed to bind to the same port and beat us
+ * to listen on it! Socket semantics does not allow us to
+ * recover from this situation, so we need to recreate the
+ * socket to allow bind attempts for subsequent ports:
+ */
+closesocket(slisten);
+slisten = create_fast_reuse_socket(e, errp);
+if (slisten >= 0) {
+continue;
 }
 }
+error_setg_errno(errp, errno, "Failed to listen on socket");
+goto listen_failed;
 }
+}
+if (err) {
+error_propagate(errp, err);
+} else {
+error_setg_errno(errp, errno, "Failed to find an available port");
+}
+
+listen_failed:
+saved_errno = errno;
+if (slisten >= 0) {
 closesocket(slisten);
 }
 freeaddrinfo(res);
+errno = saved_errno;
 return -1;
 
-listen:
-if (listen(slisten,1) != 0) {
-error_setg_errno(errp, errno, "Failed to listen on socket");
-closesocket(slisten);
-freeaddrinfo(res);
-return -1;
-}
+listen_ok:
 if (update_addr) {
 g_free(saddr->host);
 saddr->host = g_strdup(uaddr);
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v5 1/4] tests: Add test-listen - a stress test for QEMU socket listen

2017-07-22 Thread Knut Omang

There's a potential race condition between multiple bind()'s
attempting to bind to the same port, which occasionally
allows more than one bind to succeed against the same port.

When a subsequent listen() call is made with the same socket
only one will succeed.

The current QEMU code does however not take this situation into account
and the listen will cause the code to break out and fail even
when there are actually available ports to use.

This test exposes two subtests:

/socket/listen-serial
/socket/listen-compete

The "compete" subtest creates a number of threads and have them all trying to 
bind
to the same port with a large enough offset input to
allow all threads to get it's own port.
The "serial" subtest just does the same, except in series in a
single thread.

The serial version passes, probably in most versions of QEMU.

The parallel version exposes the problem in a relatively reliable way,
eg. it fails a majority of times, but not with a 100% rate, occasional
passes can be seen. Nevertheless this is quite good given that
the bug was tricky to reproduce and has been left undetected for
a while.

The problem seems to be present in all versions of QEMU.

The original failure scenario occurred with VNC port allocation
in a traditional Xen based build, in different code
but with similar functionality.

Reported-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
---
 tests/Makefile.include |   2 +-
 tests/test-listen.c| 253 ++-
 2 files changed, 255 insertions(+)
 create mode 100644 tests/test-listen.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 7af278d..b37c0c8 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -128,6 +128,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
+#check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
@@ -769,6 +770,7 @@ tests/test-arm-mptimer$(EXESUF): tests/test-arm-mptimer.o
 tests/test-qapi-util$(EXESUF): tests/test-qapi-util.o $(test-util-obj-y)
 tests/numa-test$(EXESUF): tests/numa-test.o
 tests/vmgenid-test$(EXESUF): tests/vmgenid-test.o tests/boot-sector.o 
tests/acpi-utils.o
+tests/test-listen$(EXESUF): tests/test-listen.o $(test-util-obj-y)
 
 tests/migration/stress$(EXESUF): tests/migration/stress.o
$(call quiet-command, $(LINKPROG) -static -O3 $(PTHREAD_LIB) -o $@ $< 
,"LINK","$(TARGET_DIR)$@")
diff --git a/tests/test-listen.c b/tests/test-listen.c
new file mode 100644
index 000..5c07537
--- /dev/null
+++ b/tests/test-listen.c
@@ -0,0 +1,253 @@
+/*
+ * Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.
+ *Author: Knut Omang <knut.om...@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ *
+ * Test parallel port listen configuration with
+ * dynamic port allocation
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest.h"
+#include "qemu-common.h"
+#include "qemu/thread.h"
+#include "qemu/sockets.h"
+#include "qapi/error.h"
+
+#define NAME_LEN 1024
+#define PORT_LEN 16
+
+struct thr_info {
+QemuThread thread;
+int to_port;
+bool ipv4;
+bool ipv6;
+int got_port;
+int eno;
+int fd;
+const char *errstr;
+char hostname[NAME_LEN + 1];
+char port[PORT_LEN + 1];
+};
+
+
+/* These two functions taken from test-io-channel-socket.c */
+static int check_bind(const char *hostname, bool *has_proto)
+{
+int fd = -1;
+struct addrinfo ai, *res = NULL;
+int rc;
+int ret = -1;
+
+memset(, 0, sizeof(ai));
+ai.ai_flags = AI_CANONNAME | AI_ADDRCONFIG;
+ai.ai_family = AF_UNSPEC;
+ai.ai_socktype = SOCK_STREAM;
+
+/* lookup */
+rc = getaddrinfo(hostname, NULL, , );
+if (rc != 0) {
+if (rc == EAI_ADDRFAMILY ||
+rc == EAI_FAMILY) {
+*has_proto = false;
+goto done;
+}
+goto cleanup;
+}
+
+fd = qemu_socket(res->ai_family, res->ai_socktype, res->ai_protocol);
+if (fd < 0) {
+goto cleanup;
+}
+
+if (bind(fd, res->ai_addr, res->ai_addrlen) < 0) {
+if (errno == EADDRNOTAVAIL) {
+*has_proto = false;
+goto done;
+}
+goto cleanup;
+}
+
+*has_proto = true;
+ done:

[Qemu-devel] [PATCH v5 3/4] sockets: factor out a new try_bind() function

2017-07-22 Thread Knut Omang

Another refactoring step to prepare for the problem
exposed by the test-listen test.

This time simplify and reorganize the IPv6 specific extra
measures and move it out of the for loop to increase
code readability. No semantic changes.

Signed-off-by: Knut Omang <knut.om...@oracle.com>
---
 util/qemu-sockets.c | 69 ++
 1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index 578f25b..39f207c 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -163,6 +163,44 @@ static int create_fast_reuse_socket(struct addrinfo *e, 
Error **errp)
 return slisten;
 }
 
+static int try_bind(int socket, InetSocketAddress *saddr, struct addrinfo *e)
+{
+#ifndef IPV6_V6ONLY
+return bind(socket, e->ai_addr, e->ai_addrlen);
+#else
+/*
+ * Deals with first & last cases in matrix in comment
+ * for inet_ai_family_from_address().
+ */
+int v6only =
+((!saddr->has_ipv4 && !saddr->has_ipv6) ||
+ (saddr->has_ipv4 && saddr->ipv4 &&
+  saddr->has_ipv6 && saddr->ipv6)) ? 0 : 1;
+int stat;
+
+ rebind:
+if (e->ai_family == PF_INET6) {
+qemu_setsockopt(socket, IPPROTO_IPV6, IPV6_V6ONLY, ,
+sizeof(v6only));
+}
+
+stat = bind(socket, e->ai_addr, e->ai_addrlen);
+if (!stat) {
+return 0;
+}
+
+/* If we got EADDRINUSE from an IPv6 bind & v6only is unset,
+ * it could be that the IPv4 port is already claimed, so retry
+ * with v6only set
+ */
+if (e->ai_family == PF_INET6 && errno == EADDRINUSE && !v6only) {
+v6only = 1;
+goto rebind;
+}
+return stat;
+#endif
+}
+
 static int inet_listen_saddr(InetSocketAddress *saddr,
  int port_offset,
  bool update_addr,
@@ -238,39 +276,10 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 port_min = inet_getport(e);
 port_max = saddr->has_to ? saddr->to + port_offset : port_min;
 for (p = port_min; p <= port_max; p++) {
-#ifdef IPV6_V6ONLY
-/*
- * Deals with first & last cases in matrix in comment
- * for inet_ai_family_from_address().
- */
-int v6only =
-((!saddr->has_ipv4 && !saddr->has_ipv6) ||
- (saddr->has_ipv4 && saddr->ipv4 &&
-  saddr->has_ipv6 && saddr->ipv6)) ? 0 : 1;
-#endif
 inet_setport(e, p);
-#ifdef IPV6_V6ONLY
-rebind:
-if (e->ai_family == PF_INET6) {
-qemu_setsockopt(slisten, IPPROTO_IPV6, IPV6_V6ONLY, ,
-sizeof(v6only));
-}
-#endif
-if (bind(slisten, e->ai_addr, e->ai_addrlen) == 0) {
+if (try_bind(slisten, saddr, e) >= 0) {
 goto listen;
 }
-
-#ifdef IPV6_V6ONLY
-/* If we got EADDRINUSE from an IPv6 bind & V6ONLY is unset,
- * it could be that the IPv4 port is already claimed, so retry
- * with V6ONLY set
- */
-if (e->ai_family == PF_INET6 && errno == EADDRINUSE && !v6only) {
-v6only = 1;
-goto rebind;
-}
-#endif
-
 if (p == port_max) {
 if (!e->ai_next) {
 error_setg_errno(errp, errno, "Failed to bind socket");
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v5 2/4] sockets: factor out create_fast_reuse_socket

2017-07-22 Thread Knut Omang

First refactoring step to prepare for fixing the problem
exposed with the test-listen test in the previous commit

Signed-off-by: Knut Omang <knut.om...@oracle.com>
---
 util/qemu-sockets.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index 1358c81..578f25b 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -149,6 +149,20 @@ int inet_ai_family_from_address(InetSocketAddress *addr,
 return PF_UNSPEC;
 }
 
+static int create_fast_reuse_socket(struct addrinfo *e, Error **errp)
+{
+int slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+if (slisten < 0) {
+if (!e->ai_next) {
+error_setg_errno(errp, errno, "Failed to create socket");
+}
+return -1;
+}
+
+socket_set_fast_reuse(slisten);
+return slisten;
+}
+
 static int inet_listen_saddr(InetSocketAddress *saddr,
  int port_offset,
  bool update_addr,
@@ -210,21 +224,17 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 return -1;
 }
 
-/* create socket + bind */
+/* create socket + bind/listen */
 for (e = res; e != NULL; e = e->ai_next) {
 getnameinfo((struct sockaddr*)e->ai_addr,e->ai_addrlen,
uaddr,INET6_ADDRSTRLEN,uport,32,
NI_NUMERICHOST | NI_NUMERICSERV);
-slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+
+slisten = create_fast_reuse_socket(e, );
 if (slisten < 0) {
-if (!e->ai_next) {
-error_setg_errno(errp, errno, "Failed to create socket");
-}
 continue;
 }
 
-socket_set_fast_reuse(slisten);
-
 port_min = inet_getport(e);
 port_max = saddr->has_to ? saddr->to + port_offset : port_min;
 for (p = port_min; p <= port_max; p++) {
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v5 0/4] Unit test+fix for problem with QEMU handling of multiple bind()s to the same port

2017-07-22 Thread Knut Omang

This series contains:
* a unit test that exposes a race condition which causes QEMU to fail
  to find a port even when there is plenty of available ports.
* a refactor of the qemu-sockets inet_listen_saddr() function
  to better handle this situation.

Changes from v4:
* Move the complexity of recreating a socket and setting the error pointer
  into the main for loop, eliminating the try_bind_listen() function
  again. Cleaning up and improving error handling in the process.

Changes from v3:
* Test changes: Add missing license
  Add subtests for ipv4, ipv6 and both
  Various g_* usage improvements
* Split patch into 3 patches with two refactoring patches ahead
  of the actual fix.

Changes from v2:
* Non-trivial rebase + further abstraction
  on top of 7ad9af343c7f1c70c8015c7c519c312d8c5f9fa1
  'tests: add functional test validating ipv4/ipv6 address flag handling'

Changes from v1:
* Fix potential uninitialized variable only detected by optimize.
* Improve unexpected error detection in test-listen to give more
  details about why the test fails unexpectedly.
* Fix some line length style issues.

Thanks,
Knut

Knut Omang (4):
  tests: Add test-listen - a stress test for QEMU socket listen
  sockets: factor out create_fast_reuse_socket
  sockets: factor out a new try_bind() function
  sockets: Handle race condition between binds to the same port

 tests/Makefile.include |   2 +-
 tests/test-listen.c| 253 ++-
 util/qemu-sockets.c| 142 +++-
 3 files changed, 348 insertions(+), 49 deletions(-)
 create mode 100644 tests/test-listen.c

base-commit: 91939262ffcd3c85ea6a4793d3029326eea1d649
-- 
git-series 0.9.1

Re: [Qemu-devel] [PATCH v4 4/4] sockets: Handle race condition between binds to the same port

2017-07-02 Thread Knut Omang

On Mon, 2017-06-26 at 13:49 +0100, Daniel P. Berrange wrote:
> On Mon, Jun 26, 2017 at 02:32:48PM +0200, Knut Omang wrote:
> > 
> > On Mon, 2017-06-26 at 11:22 +0100, Daniel P. Berrange wrote:
> > > 
> > > On Fri, Jun 23, 2017 at 12:31:08PM +0200, Knut Omang wrote:
> > > > 
> > > > If an offset of ports is specified to the inet_listen_saddr function(),
> > > > and two or more processes tries to bind from these ports at the same
> > > > time,
> > > > occasionally more than one process may be able to bind to the same
> > > > port. The condition is detected by listen() but too late to avoid a
> > > > failure.
> > > >  
> > > > This function is called by socket_listen() and used
> > > > by all socket listening code in QEMU, so all cases where any form of
> > > > dynamic
> > > > port selection is used should be subject to this issue.
> > > >  
> > > > Add code to close and re-establish the socket when this
> > > > condition is observed, hiding the race condition from the user.
> > > >  
> > > > This has been developed and tested by means of the
> > > > test-listen unit test in the previous commit.
> > > > Enable the test for make check now that it passes.
> > > >  
> > > > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > > > Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> > > > Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
> > > > Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
> > > > ---
> > > >   tests/Makefile.include |  2 +-
> > > >   util/qemu-sockets.c| 68 
> > > > ---
> > > >   2 files changed, 53 insertions(+), 17 deletions(-)
> > > >  
> > > > diff --git a/tests/Makefile.include b/tests/Makefile.include
> > > > index 22bb97e..c38f94e 100644
> > > > --- a/tests/Makefile.include
> > > > +++ b/tests/Makefile.include
> > > > @@ -127,7 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
> > > >   gcov-files-check-bufferiszero-y = util/bufferiszero.c
> > > >   check-unit-y += tests/test-uuid$(EXESUF)
> > > >   check-unit-y += tests/ptimer-test$(EXESUF)
> > > > -#check-unit-y += tests/test-listen$(EXESUF)
> > > > +check-unit-y += tests/test-listen$(EXESUF)
> > > >   gcov-files-ptimer-test-y = hw/core/ptimer.c
> > > >   check-unit-y += tests/test-qapi-util$(EXESUF)
> > > >   gcov-files-test-qapi-util-y = qapi/qapi-util.c
> > > > diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
> > > > index 48b9319..7b118b4 100644
> > > > --- a/util/qemu-sockets.c
> > > > +++ b/util/qemu-sockets.c
> > > > @@ -201,6 +201,42 @@ static int try_bind(int socket, InetSocketAddress
> > > > *saddr, struct
> > > addrinfo *e)
> > > > 
> > > >   #endif
> > > >   }
> > > >   
> > > > +static int try_bind_listen(int *socket, InetSocketAddress *saddr,
> > > > +   struct addrinfo *e, int port, Error **errp)
> > > > +{
> > > > +int s = *socket;
> > > > +int ret;
> > > > +
> > > > +inet_setport(e, port);
> > > > +ret = try_bind(s, saddr, e);
> > > > +if (ret) {
> > > > +if (errno != EADDRINUSE) {
> > > > +error_setg_errno(errp, errno, "Failed to bind socket");
> > > > +}
> > > > +return errno;
> > > > +}
> > > > +if (listen(s, 1) == 0) {
> > > > +return 0;
> > > > +}
> > > > +if (errno == EADDRINUSE) {
> > > > +/* We got to bind the socket to a port but someone else managed
> > > > + * to bind to the same port and beat us to listen on it!
> > > > + * Recreate the socket and return EADDRINUSE to preserve the
> > > > + * expected state by the caller:
> > > > + */
> > > > +closesocket(s);
> > > > +s = create_fast_reuse_socket(e, errp);
> > > > +if (s < 0) {
> > > > +return errno;
> > > > +}
> > > > +*socket = s;
> > > 
> > > I don't really like this at all - if we need to close + recreate the
&g

Re: [Qemu-devel] [PATCH v4 4/4] sockets: Handle race condition between binds to the same port

2017-07-02 Thread Knut Omang

On Mon, 2017-06-26 at 11:34 +0100, Daniel P. Berrange wrote:
> On Fri, Jun 23, 2017 at 12:31:08PM +0200, Knut Omang wrote:
> > 
> > If an offset of ports is specified to the inet_listen_saddr function(),
> > and two or more processes tries to bind from these ports at the same time,
> > occasionally more than one process may be able to bind to the same
> > port. The condition is detected by listen() but too late to avoid a failure.
> > 
> > This function is called by socket_listen() and used
> > by all socket listening code in QEMU, so all cases where any form of dynamic
> > port selection is used should be subject to this issue.
> > 
> > Add code to close and re-establish the socket when this
> > condition is observed, hiding the race condition from the user.
> > 
> > This has been developed and tested by means of the
> > test-listen unit test in the previous commit.
> > Enable the test for make check now that it passes.
> > 
> > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> > Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
> > Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
> > ---
> >  tests/Makefile.include |  2 +-
> >  util/qemu-sockets.c| 68 ---
> >  2 files changed, 53 insertions(+), 17 deletions(-)
> > 
> > diff --git a/tests/Makefile.include b/tests/Makefile.include
> > index 22bb97e..c38f94e 100644
> > --- a/tests/Makefile.include
> > +++ b/tests/Makefile.include
> > @@ -127,7 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
> >  gcov-files-check-bufferiszero-y = util/bufferiszero.c
> >  check-unit-y += tests/test-uuid$(EXESUF)
> >  check-unit-y += tests/ptimer-test$(EXESUF)
> > -#check-unit-y += tests/test-listen$(EXESUF)
> > +check-unit-y += tests/test-listen$(EXESUF)
> >  gcov-files-ptimer-test-y = hw/core/ptimer.c
> >  check-unit-y += tests/test-qapi-util$(EXESUF)
> >  gcov-files-test-qapi-util-y = qapi/qapi-util.c
> > diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
> > index 48b9319..7b118b4 100644
> > --- a/util/qemu-sockets.c
> > +++ b/util/qemu-sockets.c
> > @@ -201,6 +201,42 @@ static int try_bind(int socket, InetSocketAddress
> > *saddr, struct addrinfo *e)
> >  #endif
> >  }
> >  
> > +static int try_bind_listen(int *socket, InetSocketAddress *saddr,
> > +   struct addrinfo *e, int port, Error **errp)
> > +{
> > +int s = *socket;
> > +int ret;
> > +
> > +inet_setport(e, port);
> > +ret = try_bind(s, saddr, e);
> > +if (ret) {
> > +if (errno != EADDRINUSE) {
> > +error_setg_errno(errp, errno, "Failed to bind socket");
> > +}
> > +return errno;
> > +}
> > +if (listen(s, 1) == 0) {
> > +return 0;
> > +}
> > +if (errno == EADDRINUSE) {
> > +/* We got to bind the socket to a port but someone else managed
> > + * to bind to the same port and beat us to listen on it!
> > + * Recreate the socket and return EADDRINUSE to preserve the
> > + * expected state by the caller:
> > + */
> > +closesocket(s);
> > +s = create_fast_reuse_socket(e, errp);
> 
> This usage scenario for create_fast_reuse_socket() makes its error
> reporting behaviour even more wrong. Recall that create_fast_reuse_socket
> is reporting an error if e->ai_next is NULL, which is a way of determining
> this is the last call to create_fast_reuse_socket in the loop. That
> assumption is violated though now that we're calling the method from
> inside the inner loop. Even when e->ai_next is NULL, we may be calling
> create_fast_reuse_socket many many times due to the port  'to' range.

I agree that the error reporting should go out of create_fast_reuse_socket().
Note however that this code will only be called when the race condition occurs,
which I think is very unlikely to happen more than once for each call to
inet_listen_saddr (except in my test of course..)

> 
> > 
> > +if (s < 0) {
> > +return errno;
> > +}
> > +*socket = s;
> > +errno = EADDRINUSE;
> > +return errno;
> > +}
> > +error_setg_errno(errp, errno, "Failed to listen on socket");
> > +return errno;
> > +}
> 
> This method is both preserving the global errno, and returning the
> global errno. The caller expects glo

Re: [Qemu-devel] [PATCH v4 2/4] sockets: factor out create_fast_reuse_socket

2017-07-02 Thread Knut Omang

On Mon, 2017-06-26 at 11:28 +0100, Daniel P. Berrange wrote:
> On Fri, Jun 23, 2017 at 12:31:06PM +0200, Knut Omang wrote:
> > 
> > First refactoring step to prepare for fixing the problem
> > exposed with the test-listen test in the previous commit
> > 
> > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > ---
> >  util/qemu-sockets.c | 24 +---
> >  1 file changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
> > index 852773d..699e36c 100644
> > --- a/util/qemu-sockets.c
> > +++ b/util/qemu-sockets.c
> > @@ -149,6 +149,20 @@ int inet_ai_family_from_address(InetSocketAddress
> > *addr,
> >  return PF_UNSPEC;
> >  }
> >  
> > +static int create_fast_reuse_socket(struct addrinfo *e, Error **errp)
> > +{
> > +int slisten = qemu_socket(e->ai_family, e->ai_socktype, e-
> > >ai_protocol);
> > +if (slisten < 0) {
> > +if (!e->ai_next) {
> > +error_setg_errno(errp, errno, "Failed to create socket");
> > +}
> 
> I think that having this method sometimes report an error message, and
> sometimes not report an error message, depending on state of a variable
> used by the caller is rather unpleasant. I'd much rather see this
> error message reporting remain in the caller.
>
> > 
> > +return -1;
> > +}
> > +
> > +socket_set_fast_reuse(slisten);
> > +return slisten;
> > +}
> > +
> >  static int inet_listen_saddr(InetSocketAddress *saddr,
> >   int port_offset,
> >   bool update_addr,
> > @@ -210,21 +224,17 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
> >  return -1;
> >  }
> >  
> > -/* create socket + bind */
> > +/* create socket + bind/listen */
> >  for (e = res; e != NULL; e = e->ai_next) {
> >  getnameinfo((struct sockaddr*)e->ai_addr,e->ai_addrlen,
> >     uaddr,INET6_ADDRSTRLEN,uport,32,
> >     NI_NUMERICHOST | NI_NUMERICSERV);
> > -slisten = qemu_socket(e->ai_family, e->ai_socktype, e-
> > >ai_protocol);
> > +
> > +slisten = create_fast_reuse_socket(e, );
> >  if (slisten < 0) {
> > -if (!e->ai_next) {
> > -error_setg_errno(errp, errno, "Failed to create socket");
> > -}
> >  continue;
> 
> It isn't shown in this diff context, but at the end of the outer
> loop we have
> 
>    error_setg_errno(errp, errno, "Failed to find available port");
> 
> so IIUC, even this pre-existing code is wrong. If 'e->ai_next' is
> NULL, we report an error message here. Then, we continue to the
> next loop iteration, which causes use to terminate the loop
> entirely. At which point we'll report another error message
> over the top of the one we already have. So I think the error
> reporting does still need refactoring, but not in the way it
> is done here.

Yes, I did scratch my head about this but I tried to keep the original semantics
to avoid mixing unrelated changes.

With the split into separate refactoring commits we are beyond that anyway.

I'll have a second look at it..

Thanks,
Knut

> 
> > 
> >  }
> >  
> > -socket_set_fast_reuse(slisten);
> > -
> >  port_min = inet_getport(e);
> >  port_max = saddr->has_to ? saddr->to + port_offset : port_min;
> >  for (p = port_min; p <= port_max; p++) {
> 
> Regards,
> Daniel

Re: [Qemu-devel] [PATCH v4 4/4] sockets: Handle race condition between binds to the same port

2017-06-26 Thread Knut Omang

On Mon, 2017-06-26 at 11:22 +0100, Daniel P. Berrange wrote:
> On Fri, Jun 23, 2017 at 12:31:08PM +0200, Knut Omang wrote:
> > If an offset of ports is specified to the inet_listen_saddr function(),
> > and two or more processes tries to bind from these ports at the same time,
> > occasionally more than one process may be able to bind to the same
> > port. The condition is detected by listen() but too late to avoid a failure.
> > 
> > This function is called by socket_listen() and used
> > by all socket listening code in QEMU, so all cases where any form of dynamic
> > port selection is used should be subject to this issue.
> > 
> > Add code to close and re-establish the socket when this
> > condition is observed, hiding the race condition from the user.
> > 
> > This has been developed and tested by means of the
> > test-listen unit test in the previous commit.
> > Enable the test for make check now that it passes.
> > 
> > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> > Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
> > Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
> > ---
> >  tests/Makefile.include |  2 +-
> >  util/qemu-sockets.c| 68 ---
> >  2 files changed, 53 insertions(+), 17 deletions(-)
> > 
> > diff --git a/tests/Makefile.include b/tests/Makefile.include
> > index 22bb97e..c38f94e 100644
> > --- a/tests/Makefile.include
> > +++ b/tests/Makefile.include
> > @@ -127,7 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
> >  gcov-files-check-bufferiszero-y = util/bufferiszero.c
> >  check-unit-y += tests/test-uuid$(EXESUF)
> >  check-unit-y += tests/ptimer-test$(EXESUF)
> > -#check-unit-y += tests/test-listen$(EXESUF)
> > +check-unit-y += tests/test-listen$(EXESUF)
> >  gcov-files-ptimer-test-y = hw/core/ptimer.c
> >  check-unit-y += tests/test-qapi-util$(EXESUF)
> >  gcov-files-test-qapi-util-y = qapi/qapi-util.c
> > diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
> > index 48b9319..7b118b4 100644
> > --- a/util/qemu-sockets.c
> > +++ b/util/qemu-sockets.c
> > @@ -201,6 +201,42 @@ static int try_bind(int socket, InetSocketAddress 
> > *saddr, struct
> addrinfo *e)
> >  #endif
> >  }
> >  
> > +static int try_bind_listen(int *socket, InetSocketAddress *saddr,
> > +   struct addrinfo *e, int port, Error **errp)
> > +{
> > +int s = *socket;
> > +int ret;
> > +
> > +inet_setport(e, port);
> > +ret = try_bind(s, saddr, e);
> > +if (ret) {
> > +if (errno != EADDRINUSE) {
> > +error_setg_errno(errp, errno, "Failed to bind socket");
> > +}
> > +return errno;
> > +}
> > +if (listen(s, 1) == 0) {
> > +return 0;
> > +}
> > +if (errno == EADDRINUSE) {
> > +/* We got to bind the socket to a port but someone else managed
> > + * to bind to the same port and beat us to listen on it!
> > + * Recreate the socket and return EADDRINUSE to preserve the
> > + * expected state by the caller:
> > + */
> > +closesocket(s);
> > +s = create_fast_reuse_socket(e, errp);
> > +if (s < 0) {
> > +return errno;
> > +}
> > +*socket = s;
> 
> I don't really like this at all - if we need to close + recreate the
> socket, IMHO that should remain the job of the caller, since it owns
> the socket FD ultimately.

Normally I would agree, but this is a very unlikely situation. I considered 
moving the
complexity out to the caller, even to recreate for every call, but found those 
solutions
to be inferior as they do not in any way confine the problem, and cause the 
handling of
the common cases to be much less readable. It's going to be some trade-offs 
here.

As long as the caller is aware of (by the reference call) that the socket in 
use may
change, this is in my view a clean (as clean as possible) abstraction that 
simplifies the
logic at the next level. My intention is to make the common, good case as 
readable as
possible and hide some of the complexity of these 
unlikely error scenarios inside the new functions - divide and conquer..

> 
> > +errno = EADDRINUSE;
> > +return errno;
> > +}
> > +error_setg_errno(errp, errno, "Failed to listen on socket");
> > +return errno;
> > +}
> > +
> >  sta

Re: [Qemu-devel] [PATCH v4 2/4] sockets: factor out create_fast_reuse_socket

2017-06-26 Thread Knut Omang

On Mon, 2017-06-26 at 11:28 +0100, Daniel P. Berrange wrote:
> On Fri, Jun 23, 2017 at 12:31:06PM +0200, Knut Omang wrote:
> > First refactoring step to prepare for fixing the problem
> > exposed with the test-listen test in the previous commit
> > 
> > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > ---
> >  util/qemu-sockets.c | 24 +---
> >  1 file changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
> > index 852773d..699e36c 100644
> > --- a/util/qemu-sockets.c
> > +++ b/util/qemu-sockets.c
> > @@ -149,6 +149,20 @@ int inet_ai_family_from_address(InetSocketAddress 
> > *addr,
> >  return PF_UNSPEC;
> >  }
> >  
> > +static int create_fast_reuse_socket(struct addrinfo *e, Error **errp)
> > +{
> > +int slisten = qemu_socket(e->ai_family, e->ai_socktype, 
> > e->ai_protocol);
> > +if (slisten < 0) {
> > +if (!e->ai_next) {
> > +error_setg_errno(errp, errno, "Failed to create socket");
> > +}
> 
> I think that having this method sometimes report an error message, and
> sometimes not report an error message, depending on state of a variable
> used by the caller is rather unpleasant. I'd much rather see this
> error message reporting remain in the caller.

In principle I agree with you, but I think we do want to keep the details 
of what the failure cause was by also propagating information about the system 
call that failed.

I considered this an acceptable trade-off in the name of performance as well as
readability at the next level. This is a fairly unlikely case that one really 
does not
have to worry too much about at the next level. Setting an error that does not 
get used
for that special, unlikely case is not that bad. Doing it for all failures 
would be 
a lot more unnecessary work.

> 
> > +return -1;
> > +}
> > +
> > +socket_set_fast_reuse(slisten);
> > +return slisten;
> > +}
> > +
> >  static int inet_listen_saddr(InetSocketAddress *saddr,
> >   int port_offset,
> >   bool update_addr,
> > @@ -210,21 +224,17 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
> >  return -1;
> >  }
> >  
> > -/* create socket + bind */
> > +/* create socket + bind/listen */
> >  for (e = res; e != NULL; e = e->ai_next) {
> >  getnameinfo((struct sockaddr*)e->ai_addr,e->ai_addrlen,
> >     uaddr,INET6_ADDRSTRLEN,uport,32,
> >     NI_NUMERICHOST | NI_NUMERICSERV);
> > -slisten = qemu_socket(e->ai_family, e->ai_socktype, 
> > e->ai_protocol);
> > +
> > +slisten = create_fast_reuse_socket(e, );
> >  if (slisten < 0) {
> > -if (!e->ai_next) {
> > -error_setg_errno(errp, errno, "Failed to create socket");
> > -}
> >  continue;
> 
> It isn't shown in this diff context, but at the end of the outer
> loop we have
> 
>    error_setg_errno(errp, errno, "Failed to find available port");
> 
> so IIUC, even this pre-existing code is wrong. If 'e->ai_next' is
> NULL, we report an error message here. Then, we continue to the
> next loop iteration, which causes use to terminate the loop
> entirely. At which point we'll report another error message
> over the top of the one we already have. 
>
> So I think the error
> reporting does still need refactoring, but not in the way it
> is done here.

I agree, a simple way to solve it would be to only set errp if no error has 
already been
set.

Thanks,
Knut

> >  }
> >  
> > -socket_set_fast_reuse(slisten);
> > -
> >  port_min = inet_getport(e);
> >  port_max = saddr->has_to ? saddr->to + port_offset : port_min;
> >  for (p = port_min; p <= port_max; p++) {
> 
> Regards,
> Daniel

[Qemu-devel] [PATCH v4 4/4] sockets: Handle race condition between binds to the same port

2017-06-23 Thread Knut Omang

If an offset of ports is specified to the inet_listen_saddr function(),
and two or more processes tries to bind from these ports at the same time,
occasionally more than one process may be able to bind to the same
port. The condition is detected by listen() but too late to avoid a failure.

This function is called by socket_listen() and used
by all socket listening code in QEMU, so all cases where any form of dynamic
port selection is used should be subject to this issue.

Add code to close and re-establish the socket when this
condition is observed, hiding the race condition from the user.

This has been developed and tested by means of the
test-listen unit test in the previous commit.
Enable the test for make check now that it passes.

Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
---
 tests/Makefile.include |  2 +-
 util/qemu-sockets.c| 68 ---
 2 files changed, 53 insertions(+), 17 deletions(-)

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 22bb97e..c38f94e 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -127,7 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
-#check-unit-y += tests/test-listen$(EXESUF)
+check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index 48b9319..7b118b4 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -201,6 +201,42 @@ static int try_bind(int socket, InetSocketAddress *saddr, 
struct addrinfo *e)
 #endif
 }
 
+static int try_bind_listen(int *socket, InetSocketAddress *saddr,
+   struct addrinfo *e, int port, Error **errp)
+{
+int s = *socket;
+int ret;
+
+inet_setport(e, port);
+ret = try_bind(s, saddr, e);
+if (ret) {
+if (errno != EADDRINUSE) {
+error_setg_errno(errp, errno, "Failed to bind socket");
+}
+return errno;
+}
+if (listen(s, 1) == 0) {
+return 0;
+}
+if (errno == EADDRINUSE) {
+/* We got to bind the socket to a port but someone else managed
+ * to bind to the same port and beat us to listen on it!
+ * Recreate the socket and return EADDRINUSE to preserve the
+ * expected state by the caller:
+ */
+closesocket(s);
+s = create_fast_reuse_socket(e, errp);
+if (s < 0) {
+return errno;
+}
+*socket = s;
+errno = EADDRINUSE;
+return errno;
+}
+error_setg_errno(errp, errno, "Failed to listen on socket");
+return errno;
+}
+
 static int inet_listen_saddr(InetSocketAddress *saddr,
  int port_offset,
  bool update_addr,
@@ -210,7 +246,9 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 char port[33];
 char uaddr[INET6_ADDRSTRLEN+1];
 char uport[33];
-int slisten, rc, port_min, port_max, p;
+int rc, port_min, port_max, p;
+int slisten = 0;
+int saved_errno = 0;
 Error *err = NULL;
 
 memset(,0, sizeof(ai));
@@ -276,28 +314,26 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 port_min = inet_getport(e);
 port_max = saddr->has_to ? saddr->to + port_offset : port_min;
 for (p = port_min; p <= port_max; p++) {
-inet_setport(e, p);
-if (try_bind(slisten, saddr, e) >= 0) {
-goto listen;
-}
-if (p == port_max) {
-if (!e->ai_next) {
-error_setg_errno(errp, errno, "Failed to bind socket");
-}
+int eno = try_bind_listen(, saddr, e, p, );
+if (!eno) {
+goto listen_ok;
+} else if (eno != EADDRINUSE) {
+goto listen_failed;
 }
 }
+}
+error_setg_errno(errp, errno, "Failed to find available port");
+
+listen_failed:
+saved_errno = errno;
+if (slisten >= 0) {
 closesocket(slisten);
 }
 freeaddrinfo(res);
+errno = saved_errno;
 return -1;
 
-listen:
-if (listen(slisten,1) != 0) {
-error_setg_errno(errp, errno, "Failed to listen on socket");
-closesocket(slisten);
-freeaddrinfo(res);
-return -1;
-}
+listen_ok:
 if (update_addr) {
 g_free(saddr->host);
 saddr->host = g_strdup(uaddr);
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v4 0/4] Unit test+fix for problem with QEMU handling of multiple bind()s to the same port

2017-06-23 Thread Knut Omang

This series contains:
* a unit test that exposes a race condition which causes QEMU to fail
  to find a port even when there is plenty of available ports.
* a refactor of the qemu-sockets inet_listen_saddr() function
  to better handle this situation.

Changes from v3:
* Test changes: Add missing license
  Add subtests for ipv4, ipv6 and both
  Various g_* usage improvements
* Split patch into 3 patches with two refactoring patches ahead
  of the actual fix.

Changes from v2:
* Non-trivial rebase + further abstraction
  on top of 7ad9af343c7f1c70c8015c7c519c312d8c5f9fa1
  'tests: add functional test validating ipv4/ipv6 address flag handling'

Changes from v1:
* Fix potential uninitialized variable only detected by optimize.
* Improve unexpected error detection in test-listen to give more
  details about why the test fails unexpectedly.
* Fix some line length style issues.

Thanks,
Knut

Knut Omang (4):
  tests: Add test-listen - a stress test for QEMU socket listen
  sockets: factor out create_fast_reuse_socket
  sockets: factor out a new try_bind() function
  sockets: Handle race condition between binds to the same port

 tests/Makefile.include |   2 +-
 tests/test-listen.c| 253 ++-
 util/qemu-sockets.c| 159 +-
 3 files changed, 362 insertions(+), 52 deletions(-)
 create mode 100644 tests/test-listen.c

base-commit: 7ad9af343c7f1c70c8015c7c519c312d8c5f9fa1
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v4 2/4] sockets: factor out create_fast_reuse_socket

2017-06-23 Thread Knut Omang

First refactoring step to prepare for fixing the problem
exposed with the test-listen test in the previous commit

Signed-off-by: Knut Omang <knut.om...@oracle.com>
---
 util/qemu-sockets.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index 852773d..699e36c 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -149,6 +149,20 @@ int inet_ai_family_from_address(InetSocketAddress *addr,
 return PF_UNSPEC;
 }
 
+static int create_fast_reuse_socket(struct addrinfo *e, Error **errp)
+{
+int slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+if (slisten < 0) {
+if (!e->ai_next) {
+error_setg_errno(errp, errno, "Failed to create socket");
+}
+return -1;
+}
+
+socket_set_fast_reuse(slisten);
+return slisten;
+}
+
 static int inet_listen_saddr(InetSocketAddress *saddr,
  int port_offset,
  bool update_addr,
@@ -210,21 +224,17 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 return -1;
 }
 
-/* create socket + bind */
+/* create socket + bind/listen */
 for (e = res; e != NULL; e = e->ai_next) {
 getnameinfo((struct sockaddr*)e->ai_addr,e->ai_addrlen,
uaddr,INET6_ADDRSTRLEN,uport,32,
NI_NUMERICHOST | NI_NUMERICSERV);
-slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+
+slisten = create_fast_reuse_socket(e, );
 if (slisten < 0) {
-if (!e->ai_next) {
-error_setg_errno(errp, errno, "Failed to create socket");
-}
 continue;
 }
 
-socket_set_fast_reuse(slisten);
-
 port_min = inet_getport(e);
 port_max = saddr->has_to ? saddr->to + port_offset : port_min;
 for (p = port_min; p <= port_max; p++) {
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v4 3/4] sockets: factor out a new try_bind() function

2017-06-23 Thread Knut Omang

Another refactoring step to prepare for the problem
exposed by the test-listen test.

This time simplify and reorganize the IPv6 specific extra
measures and move it out of the for loop to increase
code readability. No semantic changes.

Signed-off-by: Knut Omang <knut.om...@oracle.com>
---
 util/qemu-sockets.c | 69 ++
 1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index 699e36c..48b9319 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -163,6 +163,44 @@ static int create_fast_reuse_socket(struct addrinfo *e, 
Error **errp)
 return slisten;
 }
 
+static int try_bind(int socket, InetSocketAddress *saddr, struct addrinfo *e)
+{
+#ifndef IPV6_V6ONLY
+return bind(socket, e->ai_addr, e->ai_addrlen);
+#else
+/*
+ * Deals with first & last cases in matrix in comment
+ * for inet_ai_family_from_address().
+ */
+int v6only =
+((!saddr->has_ipv4 && !saddr->has_ipv6) ||
+ (saddr->has_ipv4 && saddr->ipv4 &&
+  saddr->has_ipv6 && saddr->ipv6)) ? 0 : 1;
+int stat;
+
+ rebind:
+if (e->ai_family == PF_INET6) {
+qemu_setsockopt(socket, IPPROTO_IPV6, IPV6_V6ONLY, ,
+sizeof(v6only));
+}
+
+stat = bind(socket, e->ai_addr, e->ai_addrlen);
+if (!stat) {
+return 0;
+}
+
+/* If we got EADDRINUSE from an IPv6 bind & v6only is unset,
+ * it could be that the IPv4 port is already claimed, so retry
+ * with v6only set
+ */
+if (e->ai_family == PF_INET6 && errno == EADDRINUSE && !v6only) {
+v6only = 1;
+goto rebind;
+}
+return stat;
+#endif
+}
+
 static int inet_listen_saddr(InetSocketAddress *saddr,
  int port_offset,
  bool update_addr,
@@ -238,39 +276,10 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 port_min = inet_getport(e);
 port_max = saddr->has_to ? saddr->to + port_offset : port_min;
 for (p = port_min; p <= port_max; p++) {
-#ifdef IPV6_V6ONLY
-/*
- * Deals with first & last cases in matrix in comment
- * for inet_ai_family_from_address().
- */
-int v6only =
-((!saddr->has_ipv4 && !saddr->has_ipv6) ||
- (saddr->has_ipv4 && saddr->ipv4 &&
-  saddr->has_ipv6 && saddr->ipv6)) ? 0 : 1;
-#endif
 inet_setport(e, p);
-#ifdef IPV6_V6ONLY
-rebind:
-if (e->ai_family == PF_INET6) {
-qemu_setsockopt(slisten, IPPROTO_IPV6, IPV6_V6ONLY, ,
-sizeof(v6only));
-}
-#endif
-if (bind(slisten, e->ai_addr, e->ai_addrlen) == 0) {
+if (try_bind(slisten, saddr, e) >= 0) {
 goto listen;
 }
-
-#ifdef IPV6_V6ONLY
-/* If we got EADDRINUSE from an IPv6 bind & V6ONLY is unset,
- * it could be that the IPv4 port is already claimed, so retry
- * with V6ONLY set
- */
-if (e->ai_family == PF_INET6 && errno == EADDRINUSE && !v6only) {
-v6only = 1;
-goto rebind;
-}
-#endif
-
 if (p == port_max) {
 if (!e->ai_next) {
 error_setg_errno(errp, errno, "Failed to bind socket");
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v4 1/4] tests: Add test-listen - a stress test for QEMU socket listen

2017-06-23 Thread Knut Omang

There's a potential race condition between multiple bind()'s
attempting to bind to the same port, which occasionally
allows more than one bind to succeed against the same port.

When a subsequent listen() call is made with the same socket
only one will succeed.

The current QEMU code does however not take this situation into account
and the listen will cause the code to break out and fail even
when there are actually available ports to use.

This test exposes two subtests:

/socket/listen-serial
/socket/listen-compete

The "compete" subtest creates a number of threads and have them all trying to 
bind
to the same port with a large enough offset input to
allow all threads to get it's own port.
The "serial" subtest just does the same, except in series in a
single thread.

The serial version passes, probably in most versions of QEMU.

The parallel version exposes the problem in a relatively reliable way,
eg. it fails a majority of times, but not with a 100% rate, occasional
passes can be seen. Nevertheless this is quite good given that
the bug was tricky to reproduce and has been left undetected for
a while.

The problem seems to be present in all versions of QEMU.

The original failure scenario occurred with VNC port allocation
in a traditional Xen based build, in different code
but with similar functionality.

Reported-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
---
 tests/Makefile.include |   2 +-
 tests/test-listen.c| 253 ++-
 2 files changed, 255 insertions(+)
 create mode 100644 tests/test-listen.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 7180fe4..22bb97e 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -127,6 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
+#check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
@@ -764,6 +765,7 @@ tests/test-uuid$(EXESUF): tests/test-uuid.o 
$(test-util-obj-y)
 tests/test-arm-mptimer$(EXESUF): tests/test-arm-mptimer.o
 tests/test-qapi-util$(EXESUF): tests/test-qapi-util.o $(test-util-obj-y)
 tests/numa-test$(EXESUF): tests/numa-test.o
+tests/test-listen$(EXESUF): tests/test-listen.o $(test-util-obj-y)
 
 tests/migration/stress$(EXESUF): tests/migration/stress.o
$(call quiet-command, $(LINKPROG) -static -O3 $(PTHREAD_LIB) -o $@ $< 
,"LINK","$(TARGET_DIR)$@")
diff --git a/tests/test-listen.c b/tests/test-listen.c
new file mode 100644
index 000..5c07537
--- /dev/null
+++ b/tests/test-listen.c
@@ -0,0 +1,253 @@
+/*
+ * Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.
+ *Author: Knut Omang <knut.om...@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ *
+ * Test parallel port listen configuration with
+ * dynamic port allocation
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest.h"
+#include "qemu-common.h"
+#include "qemu/thread.h"
+#include "qemu/sockets.h"
+#include "qapi/error.h"
+
+#define NAME_LEN 1024
+#define PORT_LEN 16
+
+struct thr_info {
+QemuThread thread;
+int to_port;
+bool ipv4;
+bool ipv6;
+int got_port;
+int eno;
+int fd;
+const char *errstr;
+char hostname[NAME_LEN + 1];
+char port[PORT_LEN + 1];
+};
+
+
+/* These two functions taken from test-io-channel-socket.c */
+static int check_bind(const char *hostname, bool *has_proto)
+{
+int fd = -1;
+struct addrinfo ai, *res = NULL;
+int rc;
+int ret = -1;
+
+memset(, 0, sizeof(ai));
+ai.ai_flags = AI_CANONNAME | AI_ADDRCONFIG;
+ai.ai_family = AF_UNSPEC;
+ai.ai_socktype = SOCK_STREAM;
+
+/* lookup */
+rc = getaddrinfo(hostname, NULL, , );
+if (rc != 0) {
+if (rc == EAI_ADDRFAMILY ||
+rc == EAI_FAMILY) {
+*has_proto = false;
+goto done;
+}
+goto cleanup;
+}
+
+fd = qemu_socket(res->ai_family, res->ai_socktype, res->ai_protocol);
+if (fd < 0) {
+goto cleanup;
+}
+
+if (bind(fd, res->ai_addr, res->ai_addrlen) < 0) {
+if (errno == EADDRNOTAVAIL) {
+*has_proto = false;
+goto done;
+}
+goto cleanup;
+}
+
+*has_proto = true;
+ done:
+ret = 0;
+
+ cleanup:

Re: [Qemu-devel] [PATCH v3 2/2] sockets: Handle race condition between binds to the same port

2017-06-20 Thread Knut Omang

On Fri, 2017-06-16 at 15:45 +0100, Daniel P. Berrange wrote:
> On Wed, Jun 14, 2017 at 06:53:52PM +0200, Knut Omang wrote:
> > If an offset of ports is specified to the inet_listen_saddr function(),
> > and two or more processes tries to bind from these ports at the same time,
> > occasionally more than one process may be able to bind to the same
> > port. The condition is detected by listen() but too late to avoid a failure.
> > 
> > This function is called by socket_listen() and used
> > by all socket listening code in QEMU, so all cases where any form of dynamic
> > port selection is used should be subject to this issue.
> > 
> > Add code to close and re-establish the socket when this
> > condition is observed, hiding the race condition from the user.
> > 
> > This has been developed and tested by means of the
> > test-listen unit test in the previous commit.
> > Enable the test for make check now that it passes.
> > 
> > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> > Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
> > Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
> > ---
> >  tests/Makefile.include |   2 +-
> >  util/qemu-sockets.c| 159 --
> >  2 files changed, 108 insertions(+), 53 deletions(-)
> > 
> > diff --git a/tests/Makefile.include b/tests/Makefile.include
> > index 22bb97e..c38f94e 100644
> > --- a/tests/Makefile.include
> > +++ b/tests/Makefile.include
> > @@ -127,7 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
> >  gcov-files-check-bufferiszero-y = util/bufferiszero.c
> >  check-unit-y += tests/test-uuid$(EXESUF)
> >  check-unit-y += tests/ptimer-test$(EXESUF)
> > -#check-unit-y += tests/test-listen$(EXESUF)
> > +check-unit-y += tests/test-listen$(EXESUF)
> >  gcov-files-ptimer-test-y = hw/core/ptimer.c
> >  check-unit-y += tests/test-qapi-util$(EXESUF)
> >  gcov-files-test-qapi-util-y = qapi/qapi-util.c
> > diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
> > index 852773d..7b118b4 100644
> > --- a/util/qemu-sockets.c
> > +++ b/util/qemu-sockets.c
> > @@ -149,6 +149,94 @@ int inet_ai_family_from_address(InetSocketAddress 
> > *addr,
> >  return PF_UNSPEC;
> >  }
> >  
> > +static int create_fast_reuse_socket(struct addrinfo *e, Error **errp)
> > +{
> > +int slisten = qemu_socket(e->ai_family, e->ai_socktype, 
> > e->ai_protocol);
> > +if (slisten < 0) {
> > +if (!e->ai_next) {
> > +error_setg_errno(errp, errno, "Failed to create socket");
> > +}
> > +return -1;
> > +}
> > +
> > +socket_set_fast_reuse(slisten);
> > +return slisten;
> > +}
> > +
> > +static int try_bind(int socket, InetSocketAddress *saddr, struct addrinfo 
> > *e)
> > +{
> > +#ifndef IPV6_V6ONLY
> > +return bind(socket, e->ai_addr, e->ai_addrlen);
> > +#else
> > +/*
> > + * Deals with first & last cases in matrix in comment
> > + * for inet_ai_family_from_address().
> > + */
> > +int v6only =
> > +((!saddr->has_ipv4 && !saddr->has_ipv6) ||
> > + (saddr->has_ipv4 && saddr->ipv4 &&
> > +  saddr->has_ipv6 && saddr->ipv6)) ? 0 : 1;
> > +int stat;
> > +
> > + rebind:
> > +if (e->ai_family == PF_INET6) {
> > +qemu_setsockopt(socket, IPPROTO_IPV6, IPV6_V6ONLY, ,
> > +sizeof(v6only));
> > +}
> > +
> > +stat = bind(socket, e->ai_addr, e->ai_addrlen);
> > +if (!stat) {
> > +return 0;
> > +}
> > +
> > +/* If we got EADDRINUSE from an IPv6 bind & v6only is unset,
> > + * it could be that the IPv4 port is already claimed, so retry
> > + * with v6only set
> > + */
> > +if (e->ai_family == PF_INET6 && errno == EADDRINUSE && !v6only) {
> > +v6only = 1;
> > +goto rebind;
> > +}
> > +return stat;
> > +#endif
> > +}
> > +
> > +static int try_bind_listen(int *socket, InetSocketAddress *saddr,
> > +   struct addrinfo *e, int port, Error **errp)
> > +{
> > +int s = *socket;
> > +int ret;
> > +
> > +inet_setport(e, port);
> > +ret = try_bind(s, saddr, e);
> > +

Re: [Qemu-devel] [PATCH v3 1/2] tests: Add test-listen - a stress test for QEMU socket listen

2017-06-20 Thread Knut Omang

On Fri, 2017-06-16 at 15:41 +0100, Daniel P. Berrange wrote:
> On Wed, Jun 14, 2017 at 06:53:51PM +0200, Knut Omang wrote:
> > There's a potential race condition between multiple bind()'s
> > attempting to bind to the same port, which occasionally
> > allows more than one bind to succeed against the same port.
> > 
> > When a subsequent listen() call is made with the same socket
> > only one will succeed.
> > 
> > The current QEMU code does however not take this situation into account
> > and the listen will cause the code to break out and fail even
> > when there are actually available ports to use.
> > 
> > This test exposes two subtests:
> > 
> > /socket/listen-serial
> > /socket/listen-compete
> > 
> > The "compete" subtest creates a number of threads and have them all trying 
> > to bind
> > to the same port with a large enough offset input to
> > allow all threads to get it's own port.
> > The "serial" subtest just does the same, except in series in a
> > single thread.
> > 
> > The serial version passes, probably in most versions of QEMU.
> > 
> > The parallel version exposes the problem in a relatively reliable way,
> > eg. it fails a majority of times, but not with a 100% rate, occasional
> > passes can be seen. Nevertheless this is quite good given that
> > the bug was tricky to reproduce and has been left undetected for
> > a while.
> > 
> > The problem seems to be present in all versions of QEMU.
> > 
> > The original failure scenario occurred with VNC port allocation
> > in a traditional Xen based build, in different code
> > but with similar functionality.
> > 
> > Reported-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
> > Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> > Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
> > ---
> >  tests/Makefile.include |   2 +-
> >  tests/test-listen.c| 141 ++-
> >  2 files changed, 143 insertions(+)
> >  create mode 100644 tests/test-listen.c
> > 
> > diff --git a/tests/Makefile.include b/tests/Makefile.include
> > index 7180fe4..22bb97e 100644
> > --- a/tests/Makefile.include
> > +++ b/tests/Makefile.include
> > @@ -127,6 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
> >  gcov-files-check-bufferiszero-y = util/bufferiszero.c
> >  check-unit-y += tests/test-uuid$(EXESUF)
> >  check-unit-y += tests/ptimer-test$(EXESUF)
> > +#check-unit-y += tests/test-listen$(EXESUF)
> 
> Did you really mean to leave this commented out ?

Yes, it is enabled by the next commit with the fix - otherwise it would
fail 'make check'. Just wanted to make it convenient
for you reproduce the issue (I appreciate that myself.. :-) )

> > diff --git a/tests/test-listen.c b/tests/test-listen.c
> > new file mode 100644
> > index 000..45fe9a8
> > --- /dev/null
> > +++ b/tests/test-listen.c
> > @@ -0,0 +1,141 @@
> > +/*
> > + * Test parallel port listen configuration with
> > + * dynamic port allocation
> > + */
> 
> Should stick a standard license header on this new file

Oh - just forgot it, will add, thanks..
> 
> > +
> > +#include "qemu/osdep.h"
> > +#include "libqtest.h"
> > +#include "qemu-common.h"
> > +#include "qemu/thread.h"
> > +#include "qemu/sockets.h"
> > +#include "qapi/error.h"
> > +
> > +#define NAME_LEN 1024
> > +#define PORT_LEN 16
> > +
> > +struct thr_info {
> > +QemuThread thread;
> > +int to_port;
> > +int got_port;
> > +int eno;
> > +int fd;
> > +const char *errstr;
> > +};
> > +
> > +static char hostname[NAME_LEN + 1];
> > +static char port[PORT_LEN + 1];
> > +
> > +static void *listener_thread(void *arg)
> > +{
> > +struct thr_info *thr = (struct thr_info *)arg;
> > +SocketAddress addr = {
> > +.type = SOCKET_ADDRESS_TYPE_INET,
> > +.u = {
> > +.inet = {
> > +.host = hostname,
> > +.port = port,
> > +.ipv4 = true,
> 
> .ipv4 is ignored unless you set  .has_ipv4 too.

I see..

> I'd inclined to allow ipv6 too though, since that's normal
> usage scenario out of the box.
> 
> Or repeat the test multiple times, with ipv4 only, ipv6
>

[Qemu-devel] [PATCH v3 1/2] tests: Add test-listen - a stress test for QEMU socket listen

2017-06-14 Thread Knut Omang

There's a potential race condition between multiple bind()'s
attempting to bind to the same port, which occasionally
allows more than one bind to succeed against the same port.

When a subsequent listen() call is made with the same socket
only one will succeed.

The current QEMU code does however not take this situation into account
and the listen will cause the code to break out and fail even
when there are actually available ports to use.

This test exposes two subtests:

/socket/listen-serial
/socket/listen-compete

The "compete" subtest creates a number of threads and have them all trying to 
bind
to the same port with a large enough offset input to
allow all threads to get it's own port.
The "serial" subtest just does the same, except in series in a
single thread.

The serial version passes, probably in most versions of QEMU.

The parallel version exposes the problem in a relatively reliable way,
eg. it fails a majority of times, but not with a 100% rate, occasional
passes can be seen. Nevertheless this is quite good given that
the bug was tricky to reproduce and has been left undetected for
a while.

The problem seems to be present in all versions of QEMU.

The original failure scenario occurred with VNC port allocation
in a traditional Xen based build, in different code
but with similar functionality.

Reported-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
---
 tests/Makefile.include |   2 +-
 tests/test-listen.c| 141 ++-
 2 files changed, 143 insertions(+)
 create mode 100644 tests/test-listen.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 7180fe4..22bb97e 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -127,6 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
+#check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
@@ -764,6 +765,7 @@ tests/test-uuid$(EXESUF): tests/test-uuid.o 
$(test-util-obj-y)
 tests/test-arm-mptimer$(EXESUF): tests/test-arm-mptimer.o
 tests/test-qapi-util$(EXESUF): tests/test-qapi-util.o $(test-util-obj-y)
 tests/numa-test$(EXESUF): tests/numa-test.o
+tests/test-listen$(EXESUF): tests/test-listen.o $(test-util-obj-y)
 
 tests/migration/stress$(EXESUF): tests/migration/stress.o
$(call quiet-command, $(LINKPROG) -static -O3 $(PTHREAD_LIB) -o $@ $< 
,"LINK","$(TARGET_DIR)$@")
diff --git a/tests/test-listen.c b/tests/test-listen.c
new file mode 100644
index 000..45fe9a8
--- /dev/null
+++ b/tests/test-listen.c
@@ -0,0 +1,141 @@
+/*
+ * Test parallel port listen configuration with
+ * dynamic port allocation
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest.h"
+#include "qemu-common.h"
+#include "qemu/thread.h"
+#include "qemu/sockets.h"
+#include "qapi/error.h"
+
+#define NAME_LEN 1024
+#define PORT_LEN 16
+
+struct thr_info {
+QemuThread thread;
+int to_port;
+int got_port;
+int eno;
+int fd;
+const char *errstr;
+};
+
+static char hostname[NAME_LEN + 1];
+static char port[PORT_LEN + 1];
+
+static void *listener_thread(void *arg)
+{
+struct thr_info *thr = (struct thr_info *)arg;
+SocketAddress addr = {
+.type = SOCKET_ADDRESS_TYPE_INET,
+.u = {
+.inet = {
+.host = hostname,
+.port = port,
+.ipv4 = true,
+.has_to = true,
+.to = thr->to_port,
+},
+},
+};
+Error *err = NULL;
+int fd;
+
+fd = socket_listen(, );
+if (fd < 0) {
+thr->eno = errno;
+thr->errstr = error_get_pretty(err);
+} else {
+struct sockaddr_in a;
+socklen_t a_len = sizeof(a);
+g_assert_cmpint(getsockname(fd, (struct sockaddr *), _len), ==, 0);
+thr->got_port = ntohs(a.sin_port);
+thr->fd = fd;
+}
+return arg;
+}
+
+
+static void listen_compete_nthr(bool threaded, int nthreads,
+int start_port, int max_offset)
+{
+int i;
+int failed_listens = 0;
+size_t alloc_sz = sizeof(struct thr_info) * nthreads;
+struct thr_info *thr = g_malloc(alloc_sz);
+int used[max_offset + 1];
+memset(used, 0, sizeof(used));
+g_assert_nonnull(thr);
+g_assert_cmpint(gethostname(hostname, NAME_LEN), == , 0);
+snprintf(port, PORT_LEN, "%d", start_port);
+memset(thr, 0, alloc_sz);
+
+f

[Qemu-devel] [PATCH v3 0/2] Unit test+fix for problem with QEMU handling of multiple bind()s to the same port

2017-06-14 Thread Knut Omang

This series contains:
* a unit test that exposes a race condition which causes QEMU to fail
  to find a port even when there is plenty of available ports.
* a refactor of the qemu-sockets inet_listen_saddr() function
  to better handle this situation.

Changes from v2:
* Non-trivial rebase + further abstraction
  on top of 7ad9af343c7f1c70c8015c7c519c312d8c5f9fa1
  'tests: add functional test validating ipv4/ipv6 address flag handling'

Changes from v1:
* Fix potential uninitialized variable only detected by optimize.
* Improve unexpected error detection in test-listen to give more
  details about why the test fails unexpectedly.
* Fix some line length style issues.

Thanks,
Knut

Knut Omang (2):
  tests: Add test-listen - a stress test for QEMU socket listen
  sockets: Handle race condition between binds to the same port

 tests/Makefile.include |   2 +-
 tests/test-listen.c| 141 +-
 util/qemu-sockets.c| 159 --
 3 files changed, 250 insertions(+), 52 deletions(-)
 create mode 100644 tests/test-listen.c

base-commit: 7ad9af343c7f1c70c8015c7c519c312d8c5f9fa1
-- 
git-series 0.9.1

[Qemu-devel] [PATCH v3 2/2] sockets: Handle race condition between binds to the same port

2017-06-14 Thread Knut Omang

If an offset of ports is specified to the inet_listen_saddr function(),
and two or more processes tries to bind from these ports at the same time,
occasionally more than one process may be able to bind to the same
port. The condition is detected by listen() but too late to avoid a failure.

This function is called by socket_listen() and used
by all socket listening code in QEMU, so all cases where any form of dynamic
port selection is used should be subject to this issue.

Add code to close and re-establish the socket when this
condition is observed, hiding the race condition from the user.

This has been developed and tested by means of the
test-listen unit test in the previous commit.
Enable the test for make check now that it passes.

Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
---
 tests/Makefile.include |   2 +-
 util/qemu-sockets.c| 159 --
 2 files changed, 108 insertions(+), 53 deletions(-)

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 22bb97e..c38f94e 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -127,7 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
-#check-unit-y += tests/test-listen$(EXESUF)
+check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index 852773d..7b118b4 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -149,6 +149,94 @@ int inet_ai_family_from_address(InetSocketAddress *addr,
 return PF_UNSPEC;
 }
 
+static int create_fast_reuse_socket(struct addrinfo *e, Error **errp)
+{
+int slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+if (slisten < 0) {
+if (!e->ai_next) {
+error_setg_errno(errp, errno, "Failed to create socket");
+}
+return -1;
+}
+
+socket_set_fast_reuse(slisten);
+return slisten;
+}
+
+static int try_bind(int socket, InetSocketAddress *saddr, struct addrinfo *e)
+{
+#ifndef IPV6_V6ONLY
+return bind(socket, e->ai_addr, e->ai_addrlen);
+#else
+/*
+ * Deals with first & last cases in matrix in comment
+ * for inet_ai_family_from_address().
+ */
+int v6only =
+((!saddr->has_ipv4 && !saddr->has_ipv6) ||
+ (saddr->has_ipv4 && saddr->ipv4 &&
+  saddr->has_ipv6 && saddr->ipv6)) ? 0 : 1;
+int stat;
+
+ rebind:
+if (e->ai_family == PF_INET6) {
+qemu_setsockopt(socket, IPPROTO_IPV6, IPV6_V6ONLY, ,
+sizeof(v6only));
+}
+
+stat = bind(socket, e->ai_addr, e->ai_addrlen);
+if (!stat) {
+return 0;
+}
+
+/* If we got EADDRINUSE from an IPv6 bind & v6only is unset,
+ * it could be that the IPv4 port is already claimed, so retry
+ * with v6only set
+ */
+if (e->ai_family == PF_INET6 && errno == EADDRINUSE && !v6only) {
+v6only = 1;
+goto rebind;
+}
+return stat;
+#endif
+}
+
+static int try_bind_listen(int *socket, InetSocketAddress *saddr,
+   struct addrinfo *e, int port, Error **errp)
+{
+int s = *socket;
+int ret;
+
+inet_setport(e, port);
+ret = try_bind(s, saddr, e);
+if (ret) {
+if (errno != EADDRINUSE) {
+error_setg_errno(errp, errno, "Failed to bind socket");
+}
+return errno;
+}
+if (listen(s, 1) == 0) {
+return 0;
+}
+if (errno == EADDRINUSE) {
+/* We got to bind the socket to a port but someone else managed
+ * to bind to the same port and beat us to listen on it!
+ * Recreate the socket and return EADDRINUSE to preserve the
+ * expected state by the caller:
+ */
+closesocket(s);
+s = create_fast_reuse_socket(e, errp);
+if (s < 0) {
+return errno;
+}
+*socket = s;
+errno = EADDRINUSE;
+return errno;
+}
+error_setg_errno(errp, errno, "Failed to listen on socket");
+return errno;
+}
+
 static int inet_listen_saddr(InetSocketAddress *saddr,
  int port_offset,
  bool update_addr,
@@ -158,7 +246,9 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 char port[33];
 char uaddr[INET6_ADDRSTRLEN+1];
 char uport[33];
-int slisten, rc, port_min, port_max, p;
+int rc, port_min, po

Re: [Qemu-devel] [PATCH 2/2] socket: Handle race condition between binds to the same port

2017-06-14 Thread Knut Omang

On Wed, 2017-06-14 at 09:17 +0100, Daniel P. Berrange wrote:
> On Fri, Jun 09, 2017 at 09:19:49PM +0200, Knut Omang wrote:
> > If an offset of ports is specified to the inet_listen_saddr function(),
> > and two or more processes tries to bind from these ports at the same time,
> > occasionally more than one process may be able to bind to the same
> > port. The condition is detected by listen() but too late to avoid a failure.
> > 
> > This function is called by socket_listen() and used
> > by all socket listening code in QEMU, so all cases where any form of dynamic
> > port selection is used should be subject to this issue.
> > 
> > Add code to close and re-establish the socket when this
> > condition is observed, hiding the race condition from the user.
> > 
> > This has been developed and tested by means of the
> > test-listen unit test in the previous commit.
> > Enable the test for make check now that it passes.
> > 
> > Signed-off-by: Knut Omang <knut.om...@oracle.com>
> > Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
> > Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
> > Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
> > ---
> >  tests/Makefile.include |   2 +-
> >  util/qemu-sockets.c| 106 +-
> >  2 files changed, 76 insertions(+), 32 deletions(-)
> 
> FYI, the changes here will conflict with a pull request that I have
> pending, so please rebase against this PR
> 
> https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg01940.html

Ok, v3 will be from this base,

Knut

> 
> Regards,
> Daniel

[Qemu-devel] [PATCH v2 2/2] socket: Handle race condition between binds to the same port

2017-06-13 Thread Knut Omang

If an offset of ports is specified to the inet_listen_saddr function(),
and two or more processes tries to bind from these ports at the same time,
occasionally more than one process may be able to bind to the same
port. The condition is detected by listen() but too late to avoid a failure.

This function is called by socket_listen() and used
by all socket listening code in QEMU, so all cases where any form of dynamic
port selection is used should be subject to this issue.

Add code to close and re-establish the socket when this
condition is observed, hiding the race condition from the user.

This has been developed and tested by means of the
test-listen unit test in the previous commit.
Enable the test for make check now that it passes.

Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
Signed-off-by: Knut Omang <knut.om...@oracle.com>
---
 tests/Makefile.include |   2 +-
 util/qemu-sockets.c| 109 +-
 2 files changed, 78 insertions(+), 33 deletions(-)

diff --git a/tests/Makefile.include b/tests/Makefile.include
index a492285..d8f3bde 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -127,7 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
-#check-unit-y += tests/test-listen$(EXESUF)
+check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index b39ae74..e6ac743 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -133,6 +133,64 @@ int inet_ai_family_from_address(InetSocketAddress *addr,
 return PF_UNSPEC;
 }
 
+static int create_fast_reuse_socket(struct addrinfo *e, Error **errp)
+{
+int slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+if (slisten < 0) {
+if (!e->ai_next) {
+error_setg_errno(errp, errno, "Failed to create socket");
+}
+return -1;
+}
+
+socket_set_fast_reuse(slisten);
+#ifdef IPV6_V6ONLY
+if (e->ai_family == PF_INET6) {
+/* listen on both ipv4 and ipv6 */
+const int off = 0;
+qemu_setsockopt(slisten, IPPROTO_IPV6, IPV6_V6ONLY, ,
+sizeof(off));
+}
+#endif
+return slisten;
+}
+
+static int try_bind_listen(int *socket, struct addrinfo *e,
+   int port, Error **errp)
+{
+int s = *socket;
+int ret;
+
+inet_setport(e, port);
+ret = bind(s, e->ai_addr, e->ai_addrlen);
+if (ret) {
+if (errno != EADDRINUSE) {
+error_setg_errno(errp, errno, "Failed to bind socket");
+}
+return errno;
+}
+if (listen(s, 1) == 0) {
+return 0;
+}
+if (errno == EADDRINUSE) {
+/* We got to bind the socket to a port but someone else managed
+ * to bind to the same port and beat us to listen on it!
+ * Recreate the socket and return EADDRINUSE to preserve the
+ * expected state by the caller:
+ */
+closesocket(s);
+s = create_fast_reuse_socket(e, errp);
+if (s < 0) {
+return errno;
+}
+*socket = s;
+errno = EADDRINUSE;
+return errno;
+}
+error_setg_errno(errp, errno, "Failed to listen on socket");
+return errno;
+}
+
 static int inet_listen_saddr(InetSocketAddress *saddr,
  int port_offset,
  bool update_addr,
@@ -142,7 +200,9 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 char port[33];
 char uaddr[INET6_ADDRSTRLEN+1];
 char uport[33];
-int slisten, rc, port_min, port_max, p;
+int rc, port_min, port_max, p;
+int slisten = 0;
+int saved_errno = 0;
 Error *err = NULL;
 
 memset(,0, sizeof(ai));
@@ -194,54 +254,39 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 return -1;
 }
 
-/* create socket + bind */
+/* create socket + bind/listen */
 for (e = res; e != NULL; e = e->ai_next) {
 getnameinfo((struct sockaddr*)e->ai_addr,e->ai_addrlen,
uaddr,INET6_ADDRSTRLEN,uport,32,
NI_NUMERICHOST | NI_NUMERICSERV);
-slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+
+slisten = create_fast_reuse_socket(e, );
 if (slisten < 0) {
-if (!e->ai_next) {
-error_setg_errno(errp, errno, "Failed to create socket");
-}

[Qemu-devel] [PATCH v2 1/2] Add test-listen - a stress test for QEMU socket listen

2017-06-13 Thread Knut Omang

There's a potential race condition between multiple bind()'s
attempting to bind to the same port, which occasionally
allows more than one bind to succeed against the same port.

When a subsequent listen() call is made with the same socket
only one will succeed.

The current QEMU code does however not take this situation into account
and the listen will cause the code to break out and fail even
when there are actually available ports to use.

This test exposes two subtests:

/socket/listen-serial
/socket/listen-compete

The "compete" subtest creates a number of threads and have them all trying to 
bind
to the same port with a large enough offset input to
allow all threads to get it's own port.
The "serial" subtest just does the same, except in series in a
single thread.

The serial version passes, probably in most versions of QEMU.

The parallel version exposes the problem in a relatively reliable way,
eg. it fails a majority of times, but not with a 100% rate, occasional
passes can be seen. Nevertheless this is quite good given that
the bug was tricky to reproduce and has been left undetected for
a while.

The problem seems to be present in all versions of QEMU.

The original failure scenario occurred with VNC port allocation
in a traditional Xen based build, in different code
but with similar functionality.

Reported-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
---
 tests/Makefile.include |   2 +-
 tests/test-listen.c| 141 ++-
 2 files changed, 143 insertions(+)
 create mode 100644 tests/test-listen.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index f42f3df..a492285 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -127,6 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
+#check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
@@ -760,6 +761,7 @@ tests/test-uuid$(EXESUF): tests/test-uuid.o 
$(test-util-obj-y)
 tests/test-arm-mptimer$(EXESUF): tests/test-arm-mptimer.o
 tests/test-qapi-util$(EXESUF): tests/test-qapi-util.o $(test-util-obj-y)
 tests/numa-test$(EXESUF): tests/numa-test.o
+tests/test-listen$(EXESUF): tests/test-listen.o $(test-util-obj-y)
 
 tests/migration/stress$(EXESUF): tests/migration/stress.o
$(call quiet-command, $(LINKPROG) -static -O3 $(PTHREAD_LIB) -o $@ $< 
,"LINK","$(TARGET_DIR)$@")
diff --git a/tests/test-listen.c b/tests/test-listen.c
new file mode 100644
index 000..45fe9a8
--- /dev/null
+++ b/tests/test-listen.c
@@ -0,0 +1,141 @@
+/*
+ * Test parallel port listen configuration with
+ * dynamic port allocation
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest.h"
+#include "qemu-common.h"
+#include "qemu/thread.h"
+#include "qemu/sockets.h"
+#include "qapi/error.h"
+
+#define NAME_LEN 1024
+#define PORT_LEN 16
+
+struct thr_info {
+QemuThread thread;
+int to_port;
+int got_port;
+int eno;
+int fd;
+const char *errstr;
+};
+
+static char hostname[NAME_LEN + 1];
+static char port[PORT_LEN + 1];
+
+static void *listener_thread(void *arg)
+{
+struct thr_info *thr = (struct thr_info *)arg;
+SocketAddress addr = {
+.type = SOCKET_ADDRESS_TYPE_INET,
+.u = {
+.inet = {
+.host = hostname,
+.port = port,
+.ipv4 = true,
+.has_to = true,
+.to = thr->to_port,
+},
+},
+};
+Error *err = NULL;
+int fd;
+
+fd = socket_listen(, );
+if (fd < 0) {
+thr->eno = errno;
+thr->errstr = error_get_pretty(err);
+} else {
+struct sockaddr_in a;
+socklen_t a_len = sizeof(a);
+g_assert_cmpint(getsockname(fd, (struct sockaddr *), _len), ==, 0);
+thr->got_port = ntohs(a.sin_port);
+thr->fd = fd;
+}
+return arg;
+}
+
+
+static void listen_compete_nthr(bool threaded, int nthreads,
+int start_port, int max_offset)
+{
+int i;
+int failed_listens = 0;
+size_t alloc_sz = sizeof(struct thr_info) * nthreads;
+struct thr_info *thr = g_malloc(alloc_sz);
+int used[max_offset + 1];
+memset(used, 0, sizeof(used));
+g_assert_nonnull(thr);
+g_assert_cmpint(gethostname(hostname, NAME_LEN), == , 0);
+snprintf(port, PORT_LEN, "%d", start_port);
+memset(thr, 0, alloc_sz);
+
+f

[Qemu-devel] [PATCH v2 0/2] Unit test+fix for problem with QEMU handling of multiple bind()s to the same port

2017-06-13 Thread Knut Omang

This series contains:
* a unit test that exposes a race condition which causes QEMU to fail
  to find a port even when there is plenty of available ports.
* a refactor of the qemu-sockets inet_listen_saddr() function
  to better handle this situation.

Changes from v1:
* Fix potential uninitialized variable only detected by optimize.
* Improve unexpected error detection in test-listen to give more
  details about why the test fails unexpectedly.
* Fix some line length style issues.

Thanks,
Knut

Knut Omang (2):
  Add test-listen - a stress test for QEMU socket listen
  socket: Handle race condition between binds to the same port

 tests/Makefile.include |   2 +-
 tests/test-listen.c| 141 ++-
 util/qemu-sockets.c| 109 ++--
 3 files changed, 220 insertions(+), 32 deletions(-)
 create mode 100644 tests/test-listen.c

base-commit: 64175afc695c0672876fbbfc31b299c86d562cb4
-- 
git-series 0.9.1

Re: [Qemu-devel] [PATCH 0/2] Unit test+fix for problem with QEMU handling of multiple bind()s to the same port

2017-06-13 Thread Knut Omang

On Fri, 2017-06-09 at 14:04 -0700, no-re...@patchew.org wrote:
> Hi,
> 
> This series failed automatic build test. Please find the testing commands and
> their output below. If you have docker installed, you can probably reproduce 
> it
> locally.
> 
> Message-id: 
> cover.e0862272629975a2a7296103cf5d5f8de70abc01.1497035841.git-series.knut.om
> a...@oracle.com
> Subject: [Qemu-devel] [PATCH 0/2] Unit test+fix for problem with QEMU 
> handling of
> multiple bind()s to the same port
> Type: series
> 
> === TEST SCRIPT BEGIN ===
> #!/bin/bash
> set -e
> git submodule update --init dtc
> # Let docker tests dump environment info
> export SHOW_ENV=1
> export J=8
> time make docker-test-quick@centos6
> time make docker-test-mingw@fedora
> time make docker-test-build@min-glib
> === TEST SCRIPT END ===
> 
> Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
> Switched to a new branch 'test'
> 016f28c socket: Handle race condition between binds to the same port
> b4ff650 Add test-listen - a stress test for QEMU socket listen
> 
> === OUTPUT BEGIN ===
> Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc'
> Cloning into '/var/tmp/patchew-tester-tmp-x238ee0k/src/dtc'...
> Submodule path 'dtc': checked out '558cd81bdd432769b59bff01240c44f82cfb1a9d'
>   BUILD   centos6
> make[1]: Entering directory '/var/tmp/patchew-tester-tmp-x238ee0k/src'
>   ARCHIVE qemu.tgz
>   ARCHIVE dtc.tgz
>   COPYRUNNER
> RUN test-quick in qemu:centos6 
> Packages installed:
> SDL-devel-1.2.14-7.el6_7.1.x86_64
> ccache-3.1.6-2.el6.x86_64
> epel-release-6-8.noarch
> gcc-4.4.7-17.el6.x86_64
> git-1.7.1-4.el6_7.1.x86_64
> glib2-devel-2.28.8-5.el6.x86_64
> libfdt-devel-1.4.0-1.el6.x86_64
> make-3.81-23.el6.x86_64
> package g++ is not installed
> pixman-devel-0.32.8-1.el6.x86_64
> tar-1.23-15.el6_8.x86_64
> zlib-devel-1.2.3-29.el6.x86_64
> 
> Environment variables:
> PACKAGES=libfdt-devel ccache tar git make gcc g++ zlib-devel 
> glib2-devel SDL-
> devel pixman-devel epel-release
> HOSTNAME=fdd63eaaa7b5
> TERM=xterm
> MAKEFLAGS= -j8
> HISTSIZE=1000
> J=8
> USER=root
> CCACHE_DIR=/var/tmp/ccache
> EXTRA_CONFIGURE_OPTS=
> V=
> SHOW_ENV=1
> MAIL=/var/spool/mail/root
> PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
> :/sbin:/bin
> PWD=/
> LANG=en_US.UTF-8
> TARGET_LIST=
> HISTCONTROL=ignoredups
> SHLVL=1
> HOME=/root
> TEST_DIR=/tmp/qemu-test
> LOGNAME=root
> LESSOPEN=||/usr/bin/lesspipe.sh %s
> FEATURES= dtc
> DEBUG=
> G_BROKEN_FILENAMES=1
> CCACHE_HASHDIR=
> _=/usr/bin/env
> 
> Configure options:
> --enable-werror --target-list=x86_64-softmmu,aarch64-softmmu 
> --prefix=/var/tmp/qemu-
> build/install
> No C++ compiler available; disabling C++ specific optional code
> Install prefix/var/tmp/qemu-build/install
> BIOS directory/var/tmp/qemu-build/install/share/qemu
> binary directory  /var/tmp/qemu-build/install/bin
> library directory /var/tmp/qemu-build/install/lib
> module directory  /var/tmp/qemu-build/install/lib/qemu
> libexec directory /var/tmp/qemu-build/install/libexec
> include directory /var/tmp/qemu-build/install/include
> config directory  /var/tmp/qemu-build/install/etc
> local state directory   /var/tmp/qemu-build/install/var
> Manual directory  /var/tmp/qemu-build/install/share/man
> ELF interp prefix /usr/gnemul/qemu-%M
> Source path   /tmp/qemu-test/src
> C compilercc
> Host C compiler   cc
> C++ compiler  
> Objective-C compiler cc
> ARFLAGS   rv
> CFLAGS-O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g 
> QEMU_CFLAGS   -I/usr/include/pixman-1   -I$(SRC_PATH)/dtc/libfdt -pthread
> -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include   -fPIE -DPIE -m64 
> -mcx16
> -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes
> -Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes 
> -fno-strict-
> aliasing -fno-common -fwrapv  -Wendif-labels -Wno-missing-include-dirs 
> -Wempty-body
> -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self 
> -Wignored-qualifiers -Wold-
> style-declaration -Wold-style-definition -Wtype-limits -fstack-protector-all
> LDFLAGS   -Wl,--warn-common -Wl,-z,relro -Wl,-z,now -pie -m64 -g 
> make  make
> install   install
> pythonpython -B
> smbd  /usr/sbin/smbd
> module supportno
> host CPU  x86_64
> host big endian   no
> target list   x86_64-softmmu aarch64-softmmu
> tcg debug enabled no
> gprof enabled no
> sparse enabledno
> strip binariesyes
> profiler  no
> static build  no
> pixmansystem
> SDL support   yes (1.2.14)
> GTK support   no 
> GTK GL supportno
> VTE support   no 
> TLS priority  NORMAL
> GNUTLS supportno
> GNUTLS rndno
> libgcrypt no
> libgcrypt kdf no
> nettleno 
> nettle kdfno
> libtasn1  no
> curses supportno
> virgl support no
> curl support

Re: [Qemu-devel] [PATCH 0/2] Unit test+fix for problem with QEMU handling of multiple bind()s to the same port

2017-06-12 Thread Knut Omang

On Fri, 2017-06-09 at 14:39 -0500, Eric Blake wrote:
> On 06/09/2017 02:19 PM, Knut Omang wrote:
> > This series contains:
> > * a unit test that exposes a race condition which causes QEMU to fail
> >   to find a port even when there is plenty of available ports.
> > * a refactor of the qemu-sockets inet_listen_saddr() function
> >   to better handle this situation.
> > 
> > Thanks,
> > Knut
> > 
> > Knut Omang (2):
> >   Add test-listen - a stress test for QEMU socket listen
> >   socket: Handle race condition between binds to the same port
> > 
> 
> I'd reorder the series, to put the fix first and the test second, rather
> than the (crippled) test first.  

Are there good reasons not to have the test first? (as long as it does not 
break the build). IMHO the logical test driven approach 
is to have the test first to highlight/reproduce the issue.

> Someone that wants to prove that the
> test works can easily apply the patches out of order.

Yes, but in my view that's both less logical and less convenient?

Thanks,
Knut

[Qemu-devel] [PATCH 2/2] socket: Handle race condition between binds to the same port

2017-06-09 Thread Knut Omang

If an offset of ports is specified to the inet_listen_saddr function(),
and two or more processes tries to bind from these ports at the same time,
occasionally more than one process may be able to bind to the same
port. The condition is detected by listen() but too late to avoid a failure.

This function is called by socket_listen() and used
by all socket listening code in QEMU, so all cases where any form of dynamic
port selection is used should be subject to this issue.

Add code to close and re-establish the socket when this
condition is observed, hiding the race condition from the user.

This has been developed and tested by means of the
test-listen unit test in the previous commit.
Enable the test for make check now that it passes.

Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
---
 tests/Makefile.include |   2 +-
 util/qemu-sockets.c| 106 +-
 2 files changed, 76 insertions(+), 32 deletions(-)

diff --git a/tests/Makefile.include b/tests/Makefile.include
index a492285..d8f3bde 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -127,7 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
-#check-unit-y += tests/test-listen$(EXESUF)
+check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index b39ae74..693c1ed 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -133,6 +133,64 @@ int inet_ai_family_from_address(InetSocketAddress *addr,
 return PF_UNSPEC;
 }
 
+static int create_fast_reuse_socket(struct addrinfo *e, Error **errp)
+{
+int slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+if (slisten < 0) {
+if (!e->ai_next) {
+error_setg_errno(errp, errno, "Failed to create socket");
+}
+return -1;
+}
+
+socket_set_fast_reuse(slisten);
+#ifdef IPV6_V6ONLY
+if (e->ai_family == PF_INET6) {
+/* listen on both ipv4 and ipv6 */
+const int off = 0;
+qemu_setsockopt(slisten, IPPROTO_IPV6, IPV6_V6ONLY, ,
+sizeof(off));
+}
+#endif
+return slisten;
+}
+
+static int try_bind_listen(int *socket, struct addrinfo *e,
+   int port, Error **errp)
+{
+int s = *socket;
+int ret;
+
+inet_setport(e, port);
+ret = bind(s, e->ai_addr, e->ai_addrlen);
+if (ret) {
+if (errno != EADDRINUSE) {
+error_setg_errno(errp, errno, "Failed to bind socket");
+}
+return errno;
+}
+if (listen(s, 1) == 0) {
+return 0;
+}
+if (errno == EADDRINUSE) {
+/* We got to bind the socket to a port but someone else managed
+ * to bind to the same port and beat us to listen on it!
+ * Recreate the socket and return EADDRINUSE to preserve the
+ * expected state by the caller:
+ */
+closesocket(s);
+s = create_fast_reuse_socket(e, errp);
+if (s < 0) {
+return errno;
+}
+*socket = s;
+errno = EADDRINUSE;
+return errno;
+}
+error_setg_errno(errp, errno, "Failed to listen on socket");
+return errno;
+}
+
 static int inet_listen_saddr(InetSocketAddress *saddr,
  int port_offset,
  bool update_addr,
@@ -143,6 +201,7 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 char uaddr[INET6_ADDRSTRLEN+1];
 char uport[33];
 int slisten, rc, port_min, port_max, p;
+int saved_errno = 0;
 Error *err = NULL;
 
 memset(,0, sizeof(ai));
@@ -194,54 +253,39 @@ static int inet_listen_saddr(InetSocketAddress *saddr,
 return -1;
 }
 
-/* create socket + bind */
+/* create socket + bind/listen */
 for (e = res; e != NULL; e = e->ai_next) {
 getnameinfo((struct sockaddr*)e->ai_addr,e->ai_addrlen,
uaddr,INET6_ADDRSTRLEN,uport,32,
NI_NUMERICHOST | NI_NUMERICSERV);
-slisten = qemu_socket(e->ai_family, e->ai_socktype, e->ai_protocol);
+
+slisten = create_fast_reuse_socket(e, );
 if (slisten < 0) {
-if (!e->ai_next) {
-error_setg_errno(errp, errno, "Failed to create socket");
-}
 continue;
 }
-
-socket_set_fast_reuse(slisten);
-#ifdef IPV6_V6ONLY
-if (e->ai_family == PF_INET6) {
-

[Qemu-devel] [PATCH 1/2] Add test-listen - a stress test for QEMU socket listen

2017-06-09 Thread Knut Omang

There's a potential race condition between multiple bind()'s
attempting to bind to the same port, which occasionally
allows more than one bind to succeed against the same port.

When a subsequent listen() call is made with the same socket
only one will succeed.

The current QEMU code does however not take this situation into account
and the listen will cause the code to break out and fail even
when there are actually available ports to use.

This test exposes two subtests:

/socket/listen-serial
/socket/listen-compete

The "compete" subtest creates a number of threads and have them all trying to 
bind
to the same port with a large enough offset input to
allow all threads to get it's own port.
The "serial" subtest just does the same, except in series in a
single thread.

The serial version passes, probably in most versions of QEMU.

The parallel version exposes the problem in a relatively reliable way,
eg. it fails a majority of times, but not with a 100% rate, occasional
passes can be seen. Nevertheless this is quite good given that
the bug was tricky to reproduce and has been left undetected for
a while.

The problem seems to be present in all versions of QEMU.

The original failure scenario occurred with VNC port allocation
in a traditional Xen based build, in different code
but with similar functionality.

Reported-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Signed-off-by: Knut Omang <knut.om...@oracle.com>
Reviewed-by: Yuval Shaia <yuval.sh...@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.da...@oracle.com>
Reviewed-by: Girish Moodalbail <girish.moodalb...@oracle.com>
---
 tests/Makefile.include |   2 +-
 tests/test-listen.c| 135 ++-
 2 files changed, 137 insertions(+)
 create mode 100644 tests/test-listen.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index f42f3df..a492285 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -127,6 +127,7 @@ check-unit-y += tests/test-bufferiszero$(EXESUF)
 gcov-files-check-bufferiszero-y = util/bufferiszero.c
 check-unit-y += tests/test-uuid$(EXESUF)
 check-unit-y += tests/ptimer-test$(EXESUF)
+#check-unit-y += tests/test-listen$(EXESUF)
 gcov-files-ptimer-test-y = hw/core/ptimer.c
 check-unit-y += tests/test-qapi-util$(EXESUF)
 gcov-files-test-qapi-util-y = qapi/qapi-util.c
@@ -760,6 +761,7 @@ tests/test-uuid$(EXESUF): tests/test-uuid.o 
$(test-util-obj-y)
 tests/test-arm-mptimer$(EXESUF): tests/test-arm-mptimer.o
 tests/test-qapi-util$(EXESUF): tests/test-qapi-util.o $(test-util-obj-y)
 tests/numa-test$(EXESUF): tests/numa-test.o
+tests/test-listen$(EXESUF): tests/test-listen.o $(test-util-obj-y)
 
 tests/migration/stress$(EXESUF): tests/migration/stress.o
$(call quiet-command, $(LINKPROG) -static -O3 $(PTHREAD_LIB) -o $@ $< 
,"LINK","$(TARGET_DIR)$@")
diff --git a/tests/test-listen.c b/tests/test-listen.c
new file mode 100644
index 000..517b6ed
--- /dev/null
+++ b/tests/test-listen.c
@@ -0,0 +1,135 @@
+/*
+ * Test parallel port listen configuration with
+ * dynamic port allocation
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest.h"
+#include "qemu-common.h"
+#include "qemu/thread.h"
+#include "qemu/sockets.h"
+#include "qapi/error.h"
+
+#define NAME_LEN 1024
+#define PORT_LEN 16
+
+struct thr_info {
+QemuThread thread;
+int to_port;
+int got_port;
+int eno;
+int fd;
+const char *errstr;
+};
+
+static char hostname[NAME_LEN + 1];
+static char port[PORT_LEN + 1];
+
+static void *listener_thread(void *arg)
+{
+struct thr_info *thr = (struct thr_info *)arg;
+SocketAddress addr = {
+.type = SOCKET_ADDRESS_TYPE_INET,
+.u = {
+.inet = {
+.host = hostname,
+.port = port,
+.ipv4 = true,
+.has_to = true,
+.to = thr->to_port,
+},
+},
+};
+Error *err = NULL;
+int fd;
+
+fd = socket_listen(, );
+if (fd < 0) {
+thr->eno = errno;
+thr->errstr = error_get_pretty(err);
+} else {
+struct sockaddr_in a;
+socklen_t a_len = sizeof(a);
+g_assert_cmpint(getsockname(fd, (struct sockaddr *), _len), ==, 0);
+thr->got_port = ntohs(a.sin_port);
+thr->fd = fd;
+}
+return arg;
+}
+
+
+static void listen_compete_nthr(bool threaded, int nthreads,
+int start_port, int max_offset)
+{
+int i;
+int failed_listens = 0;
+size_t alloc_sz = sizeof(struct thr_info) * nthreads;
+struct thr_info *thr = g_malloc(alloc_sz);
+int used[max_offset + 1];
+memset(used, 0, sizeof(used));
+g_assert_nonnull(thr);
+g_assert_cmpint(gethostname(hostname, NAME_LEN), == , 0);
+snprintf(port, PORT_LEN, "%d", start_port);
+memset(thr, 0, alloc_sz);
+
+f

[Qemu-devel] [PATCH 0/2] Unit test+fix for problem with QEMU handling of multiple bind()s to the same port

2017-06-09 Thread Knut Omang

This series contains:
* a unit test that exposes a race condition which causes QEMU to fail
  to find a port even when there is plenty of available ports.
* a refactor of the qemu-sockets inet_listen_saddr() function
  to better handle this situation.

Thanks,
Knut

Knut Omang (2):
  Add test-listen - a stress test for QEMU socket listen
  socket: Handle race condition between binds to the same port

 tests/Makefile.include |   2 +-
 tests/test-listen.c| 135 ++-
 util/qemu-sockets.c| 106 +++--
 3 files changed, 212 insertions(+), 31 deletions(-)
 create mode 100644 tests/test-listen.c

base-commit: 64175afc695c0672876fbbfc31b299c86d562cb4
-- 
git-series 0.9.1

Re: [Qemu-devel] Proposal for deprecating unsupported host OSes & architecutures

2017-03-28 Thread Knut Omang

On Mon, 2017-03-27 at 17:32 +0100, Peter Maydell wrote:
> On 26 March 2017 at 10:16, Knut Omang <knut.om...@oracle.com> wrote:
> > On Sat, 2017-03-25 at 21:15 +, Peter Maydell wrote:
> >> On 25 March 2017 at 20:49, Knut Omang <knut.om...@oracle.com> wrote:
> >> >
> >> > Can we please keep the Sparc support in for a while still?
> >>
> >> Yes, John Paul Adrian Glaubitz and the Debian Project have
> >> kindly provided me with access to a Sparc box. I'm planning
> >> to send a patch that puts sparc into the 'supported'
> >> category before 2.9 release.
> >
> > good to hear!
> 
> It occurred to me that it might be helpful to point out that
> support of Solaris as a host OS is a separate thing and is
> still in the 'unsupported' category. 

Yes, that's my understanding too, but thanks for highlighting it.

> (It's also quite high
> on the list of stuff to drop because I suspect it's broken
> and it's one of the things making a mess of our configure code.)

I agree, and from what I have been able to figure out, 
it seems there are no imminent problems with removing it,

Thanks,
Knut

> 
> thanks
> -- PMM

Re: [Qemu-devel] Proposal for deprecating unsupported host OSes & architecutures

2017-03-26 Thread Knut Omang

On Sat, 2017-03-25 at 21:15 +, Peter Maydell wrote:
> On 25 March 2017 at 20:49, Knut Omang <knut.om...@oracle.com> wrote:
> > 
> > Can we please keep the Sparc support in for a while still?
> 
> Yes, John Paul Adrian Glaubitz and the Debian Project have
> kindly provided me with access to a Sparc box. I'm planning
> to send a patch that puts sparc into the 'supported'
> category before 2.9 release.

good to hear!

> I would note that so far I've found a couple of TCG
> bugs of the "this just doesn't work at all" level, so
> it doesn't look like anybody using sparc has been
> doing any testing against git master...

No, I tried a while ago myself without much success so there's awareness
about this sad state. Hopefully we can get some momentum on it soon.

> > 
> > When it comes to build platforms, a legitimate need to be able to keep
> > anything
> > running, I don't have any authority to promise away hardware or other forms
> > of
> > Sparc access, but I have been told that that part can be worked out in some
> > way
> > if we get enough support for this internally.
> 
> I'd recommend trying to get a machine into the gcc compile
> farm (assuming they'd be willing to take it) -- that way
> it's accessible to developers for a range of open
> source projects.

I'll bring that suggestion forward,

Thanks,
Knut

> 
> thanks
> -- PMM

Re: [Qemu-devel] Proposal for deprecating unsupported host OSes & architecutures

2017-03-25 Thread Knut Omang

On Thu, 2017-03-16 at 15:23 +, Peter Maydell wrote:
> OK, here's a concrete proposal for deprecating/dropping out of
> date host OS and architecture support.
> 
> We'll put this in the ChangeLog 'Future incompatible changes'
> section:
> -
> * Removal of support for untested host OS and architectures:
> 
> The QEMU Project intends to drop support in a future release for any
> host OS or architecture which we do not have access to a build and test
> machine for. This affects the following host OSes:
>  * Native CYGWIN building
>  * GNU/kFreeBSD
>  * FreeBSD
>  * DragonFly BSD
>  * NetBSD
>  * OpenBSD
>  * Solaris
>  * AIX
>  * Haiku
> and the following host CPU architectures:
>  * ia64
>  * sparc

Can we please keep the Sparc support in for a while still?

There is an increasing recognition of the value of better support for 
QEMU within Oracle and hopefully we can get some traction on this in not too
long.

I have colleagues currently looking at various ways forward
both in the direction of implementing support for newer Sparc architectures in
QEMU and also investigating support for native KVM support for Linux on Sparc.

When it comes to build platforms, a legitimate need to be able to keep anything
running, I don't have any authority to promise away hardware or other forms of
Sparc access, but I have been told that that part can be worked out in some way
if we get enough support for this internally. 

Thanks,
Knut

> Specifically, if we do not have a build and test system available
> to us by the time we release QEMU 2.10, we will remove support in the
> release that follows 2.10.
> -
> 
> I'm not sure here if we want to just have this as a bald list,
> or to have some kind of two tier setup with OSes we expect to
> dump in one tier and OSes where we're really trolling for a build
> machine in the other tier (the "unlikely to dump" category would
> get most of the BSD variants in it). Putting out a changelog
> that says "we're gonna drop all the BSDs" seems like it might
> produce a lot of yelling?
> 
> Should "native CYGWIN" be in the drop list? I only test
> mingw cross compile, but configure has a separate section for
> CYGWIN in its $targetos case statement.
> 
> It would also not be too difficult to make configure warn when it
> is run on the deprecated OS or architecture, so we should probably
> sneak that into 2.9.
> 
> (Technically right this instant 'mips' and 's390' would be in the
> 'dump' list, since I don't personally have access yet. But we have
> a plan for s390, and it turns out there is a mips machine in the
> gcc compile farm which I'm just checking out.)
> 
> thanks
> -- PMM
>

Re: [Qemu-devel] Possible reference leak in device_set_realized(...)

2016-01-12 Thread Knut Omang

On Fri, 2016-01-01 at 23:37 +0100, Paolo Bonzini wrote:
> 
> On 31/12/2015 19:13, Ilya Lesokhin wrote:
> > I was able to overcome this issue by calling object_unparent on my
> > device but I’m not sure that the correct way of fixing it.
> 
> Yes, it's definitely the right way to fix it.

Sorry for the late follow-up on this one, but I had to find some more
time to spend with the code (and with valgrind too) to understand
better/verify what was going on in the qdev/qom layers.

In the SR/IOV patch the object is created by pci_create.
Since there is no corresponding pci_delete, I assume this means that
the correct way to clean up from pci_create is simply a call to
object_unparent() as you indicate, and this is what is missing from the
patch set.

So the full setup/teardown sequence per VF then becomes:

pci_create(...)

object_unparent(...)

Thanks,
Knut

Re: [Qemu-devel] [PATCH v6 0/4] pcie: Add support for Single Root I/O Virtualization

2015-10-22 Thread Knut Omang

Michael,

I just realized that this now went out without 

Reviewed-by: Marcel Apfelbaum <mar...@redhat.com>

to patches 2-4, 

Sorry about that - can you add it for me?

Thanks,
Knut

On Thu, 2015-10-22 at 10:01 +0200, Knut Omang wrote:
> This patch set implements generic support for SR/IOV as an extension
> to the
> core PCIe functionality, similar to the way other capabilities such
> as AER
> is implemented.
> 
> There is no implementation of any device that provides
> SR/IOV support included, but I have implemented a test
> example which can be found together with this patch set here:
> 
>   git://github.com/knuto/qemu.git sriov_patches_v6
> 
> Testing with the example device was documented here:
> 
>   
> http://lists.nongnu.org/archive/html/qemu-devel/2014-08/msg05110.html
> 
> Changes since v5:
>   - Fix reset logic that got broken in v5. Reset logic is now equal
> to
> that of v4 except that two ambiguous initialization statements
> (introduced during rebase) have been removed
>   - From private feedback, added observer functions for SR/IOV values
> in pcie_sriov.h. To ease access to the vf number, the SR/IOV VF
> device
> struct extension now caches this value.
> 
> Changes since v4:
>   - Mostly based on feeback in Marcel Apfelbaum's review:
>   - The patch with changes to pci_regs.h got eliminated by rebase
>   - Added some documentation as an additional patch
>   - Some trivial fixes moved to separate patch
>   - Modified code to use error and trace functions instead of printfs
> 
> Changes since v3:
>   - Reworked 'pci: Update pci_regs header' to merge kernel version
> improvements
> with the current qemu version instead of copying from the kernel
> version.
> 
> Changes since v2:
>   - Rebased onto 090d0bfd
>   - Un-qdev'ified - avoids issues when resetting NUM_VFS
>   - Fixed handling of vf_offset/vf_stride
> 
> Changes since v1:
>   - Rebased on top of latest master, eliminating prereqs.
>   - Implement proper support for VF_STRIDE, VF_OFFSET and SUP_PGSIZE
> Time better spent fixing it than explaining what the previous
> limitations were.
> - Added new first patch to fix pci bug related to this
>   - Split out patch to pci_default_config_write to a separate patch 2
> to highlight bug fix.
>   - Refactored out logic into new source files
> hw/pci/pcie_sriov.c include/hw/pci/pcie_sriov.h
> similar to pcie_aer.c/h.
>   - Rename functions and introduce structs to better separate
> pf and vf functionality.
>   - Replaced is_vf member with pci_is_vf() function abstraction
>   - Fix numerous syntax, whitespace and comment issues
> according to Michael's review.
>   - Fix memory leaks.
>   - Removed igb example device - a rebased version available
> on github instead.
> 
> Knut Omang (4):
>   pci: Make use of the devfn property when registering new devices
>   pcie: Add support for Single Root I/O Virtualization (SR/IOV)
>   pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt
>   pcie: A few minor fixes (type+code simplify)
> 
>  docs/pcie_sriov.txt | 115 ++
>  hw/pci/Makefile.objs|   2 +-
>  hw/pci/pci.c|  97 
>  hw/pci/pcie.c   |   9 +-
>  hw/pci/pcie_sriov.c | 277
> 
>  include/hw/pci/pci.h|  11 +-
>  include/hw/pci/pcie.h   |   6 +
>  include/hw/pci/pcie_sriov.h |  67 +++
>  include/qemu/typedefs.h |   2 +
>  trace-events|   5 +
>  10 files changed, 561 insertions(+), 30 deletions(-)
>  create mode 100644 docs/pcie_sriov.txt
>  create mode 100644 hw/pci/pcie_sriov.c
>  create mode 100644 include/hw/pci/pcie_sriov.h
> 
> --
> 2.4.3

[Qemu-devel] [PATCH v6 4/4] pcie: A few minor fixes (type+code simplify)

2015-10-22 Thread Knut Omang

- Fix comment typo in pcie_cap_slot_write_config
- Simplify code in pcie_cap_slot_hot_unplug_request_cb.

Signed-off-by: Knut Omang <knut.om...@oracle.com>
---
 hw/pci/pcie.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 774b9ed..ba49c0f 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -265,10 +265,11 @@ void pcie_cap_slot_hot_unplug_request_cb(HotplugHandler 
*hotplug_dev,
  DeviceState *dev, Error **errp)
 {
 uint8_t *exp_cap;
+PCIDevice *pdev = PCI_DEVICE(hotplug_dev);
 
-pcie_cap_slot_hotplug_common(PCI_DEVICE(hotplug_dev), dev, _cap, errp);
+pcie_cap_slot_hotplug_common(pdev, dev, _cap, errp);
 
-pcie_cap_slot_push_attention_button(PCI_DEVICE(hotplug_dev));
+pcie_cap_slot_push_attention_button(pdev);
 }
 
 /* pci express slot for pci express root/downstream port
@@ -408,7 +409,7 @@ void pcie_cap_slot_write_config(PCIDevice *dev,
 }
 
 /*
- * If the slot is polulated, power indicator is off and power
+ * If the slot is populated, power indicator is off and power
  * controller is off, it is safe to detach the devices.
  */
 if ((sltsta & PCI_EXP_SLTSTA_PDS) && (val & PCI_EXP_SLTCTL_PCC) &&
-- 
2.4.3

[Qemu-devel] [PATCH v6 2/4] pcie: Add support for Single Root I/O Virtualization (SR/IOV)

2015-10-22 Thread Knut Omang

This patch provides the building blocks for creating an SR/IOV
PCIe Extended Capability header and register/unregister
SR/IOV Virtual Functions.

Signed-off-by: Knut Omang <knut.om...@oracle.com>
---
 hw/pci/Makefile.objs|   2 +-
 hw/pci/pci.c|  95 +++
 hw/pci/pcie.c   |   2 +-
 hw/pci/pcie_sriov.c | 277 
 include/hw/pci/pci.h|  11 +-
 include/hw/pci/pcie.h   |   6 +
 include/hw/pci/pcie_sriov.h |  67 +++
 include/qemu/typedefs.h |   2 +
 trace-events|   5 +
 9 files changed, 441 insertions(+), 26 deletions(-)
 create mode 100644 hw/pci/pcie_sriov.c
 create mode 100644 include/hw/pci/pcie_sriov.h

diff --git a/hw/pci/Makefile.objs b/hw/pci/Makefile.objs
index 9f905e6..2226980 100644
--- a/hw/pci/Makefile.objs
+++ b/hw/pci/Makefile.objs
@@ -3,7 +3,7 @@ common-obj-$(CONFIG_PCI) += msix.o msi.o
 common-obj-$(CONFIG_PCI) += shpc.o
 common-obj-$(CONFIG_PCI) += slotid_cap.o
 common-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o
-common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o
+common-obj-$(CONFIG_PCI) += pcie.o pcie_aer.o pcie_port.o pcie_sriov.o
 
 common-obj-$(call lnot,$(CONFIG_PCI)) += pci-stub.o
 common-obj-$(CONFIG_ALL) += pci-stub.o
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index b095cfe..3a6cce3 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -153,6 +153,9 @@ int pci_bar(PCIDevice *d, int reg)
 {
 uint8_t type;
 
+/* PCIe virtual functions do not have their own BARs */
+assert(!pci_is_vf(d));
+
 if (reg != PCI_ROM_SLOT)
 return PCI_BASE_ADDRESS_0 + reg * 4;
 
@@ -211,22 +214,13 @@ void pci_device_deassert_intx(PCIDevice *dev)
 }
 }
 
-static void pci_do_device_reset(PCIDevice *dev)
+static void pci_reset_regions(PCIDevice *dev)
 {
 int r;
+if (pci_is_vf(dev)) {
+return;
+}
 
-pci_device_deassert_intx(dev);
-assert(dev->irq_state == 0);
-
-/* Clear all writable bits */
-pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
- pci_get_word(dev->wmask + PCI_COMMAND) |
- pci_get_word(dev->w1cmask + PCI_COMMAND));
-pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
- pci_get_word(dev->wmask + PCI_STATUS) |
- pci_get_word(dev->w1cmask + PCI_STATUS));
-dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
-dev->config[PCI_INTERRUPT_LINE] = 0x0;
 for (r = 0; r < PCI_NUM_REGIONS; ++r) {
 PCIIORegion *region = >io_regions[r];
 if (!region->size) {
@@ -240,6 +234,23 @@ static void pci_do_device_reset(PCIDevice *dev)
 pci_set_long(dev->config + pci_bar(dev, r), region->type);
 }
 }
+}
+
+static void pci_do_device_reset(PCIDevice *dev)
+{
+pci_device_deassert_intx(dev);
+assert(dev->irq_state == 0);
+
+/* Clear all writable bits */
+pci_word_test_and_clear_mask(dev->config + PCI_COMMAND,
+ pci_get_word(dev->wmask + PCI_COMMAND) |
+ pci_get_word(dev->w1cmask + PCI_COMMAND));
+pci_word_test_and_clear_mask(dev->config + PCI_STATUS,
+ pci_get_word(dev->wmask + PCI_STATUS) |
+ pci_get_word(dev->w1cmask + PCI_STATUS));
+dev->config[PCI_CACHE_LINE_SIZE] = 0x0;
+dev->config[PCI_INTERRUPT_LINE] = 0x0;
+pci_reset_regions(dev);
 pci_update_mappings(dev);
 
 msi_reset(dev);
@@ -771,6 +782,15 @@ static void pci_init_multifunction(PCIBus *bus, PCIDevice 
*dev, Error **errp)
 dev->config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
 }
 
+/* With SR/IOV and ARI, a device at function 0 need not be a multifunction
+ * device, as it may just be a VF that ended up with function 0 in
+ * the legacy PCI interpretation. Avoid failing in such cases:
+ */
+if (pci_is_vf(dev) &&
+dev->exp.sriov_vf.pf->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+return;
+}
+
 /*
  * multifunction bit is interpreted in two ways as follows.
  *   - all functions must set the bit to 1.
@@ -962,6 +982,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
 uint64_t wmask;
 pcibus_t size = memory_region_size(memory);
 
+assert(!pci_is_vf(pci_dev)); /* VFs must use pcie_sriov_vf_register_bar */
 assert(region_num >= 0);
 assert(region_num < PCI_NUM_REGIONS);
 if (size & (size-1)) {
@@ -1060,11 +1081,44 @@ pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int 
region_num)
 return pci_dev->io_regions[region_num].addr;
 }
 
-static pcibus_t pci_bar_address(PCIDevice *d,
-   int reg, uint8_t type, pcibus_t size)
+
+static pcibus_t pci_config_get_bar_addr(PCIDevice *d, int reg,
+

[Qemu-devel] [PATCH v6 1/4] pci: Make use of the devfn property when registering new devices

2015-10-22 Thread Knut Omang

Without this, the devfn argument to pci_create_*()
does not affect the assigned devfn.

Needed to support (VF_STRIDE,VF_OFFSET) values other than (1,1)
for SR/IOV.

Reviewed-by: Marcel Apfelbaum <mar...@redhat.com>
Signed-off-by: Knut Omang <knut.om...@oracle.com>
---
 hw/pci/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index b0bf540..b095cfe 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1840,7 +1840,7 @@ static void pci_qdev_realize(DeviceState *qdev, Error 
**errp)
 bus = PCI_BUS(qdev_get_parent_bus(qdev));
 pci_dev = do_pci_register_device(pci_dev, bus,
  object_get_typename(OBJECT(qdev)),
- pci_dev->devfn, errp);
+ object_property_get_int(OBJECT(qdev), 
"addr", NULL), errp);
 if (pci_dev == NULL)
 return;
 
-- 
2.4.3

[Qemu-devel] [PATCH v6 3/4] pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt

2015-10-22 Thread Knut Omang

Add a small intro + minimal documentation for how to
implement SR/IOV support for an emulated device.

Signed-off-by: Knut Omang <knut.om...@oracle.com>
---
 docs/pcie_sriov.txt | 115 
 1 file changed, 115 insertions(+)
 create mode 100644 docs/pcie_sriov.txt

diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
new file mode 100644
index 000..b343671
--- /dev/null
+++ b/docs/pcie_sriov.txt
@@ -0,0 +1,115 @@
+PCI SR/IOV EMULATION SUPPORT
+
+
+Description
+===
+SR/IOV (Single Root I/O Virtualization) is an optional extended capability
+of a PCI Express device. It allows a single physical function (PF) to appear 
as multiple
+virtual functions (VFs) for the main purpose of eliminating software
+overhead in I/O from virtual machines.
+
+Qemu now implements the basic common functionality to enable an emulated device
+to support SR/IOV. Yet no fully implemented devices exists in Qemu, but a
+proof-of-concept hack of the Intel igb can be found here:
+
+git://github.com/knuto/qemu.git sriov_patches_v6
+
+Implementation
+==
+Implementing emulation of an SR/IOV capable device typically consists of
+implementing support for two types of device classes; the "normal" physical 
device
+(PF) and the virtual device (VF). From Qemu's perspective, the VFs are just
+like other devices, except that some of their properties are derived from
+the PF.
+
+A virtual function is different from a physical function in that the BAR
+space for all VFs are defined by the BAR registers in the PFs SR/IOV
+capability. All VFs have the same BARs and BAR sizes.
+
+Accesses to these virtual BARs then is computed as
+
++  *  + 
+
+From our emulation perspective this means that there is a separate call for
+setting up a BAR for a VF.
+
+1) To enable SR/IOV support in the PF, it must be a PCI Express device so
+   you would need to add a PCI Express capability in the normal PCI
+   capability list. You might also want to add an ARI (Alternative
+   Routing-ID Interpretation) capability to indicate that your device
+   supports functions beyond it's "own" function space (0-7),
+   which is necessary to support more than 7 functions, or
+   if functions extends beyond offset 7 because they are placed at an
+   offset > 1 or have stride > 1.
+
+   ...
+   #include "hw/pci/pcie.h"
+   #include "hw/pci/pcie_sriov.h"
+
+   pci_your_pf_dev_realize( ... )
+   {
+  ...
+  int ret = pcie_endpoint_cap_init(d, 0x70);
+  ...
+  pcie_ari_init(d, 0x100, 1);
+  ...
+
+  /* Add and initialize the SR/IOV capability */
+  pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
+   vf_devid, initial_vfs, total_vfs,
+   fun_offset, stride);
+
+  /* Set up individual VF BARs (parameters as for normal BARs) */
+  pcie_sriov_pf_init_vf_bar( ... )
+  ...
+   }
+
+   For cleanup, you simply call:
+
+  pcie_sriov_pf_exit(device);
+
+   which will delete all the virtual functions and associated resources.
+
+2) Similarly in the implementation of the virtual function, you need to
+   make it a PCI Express device and add a similar set of capabilities
+   except for the SR/IOV capability. Then you need to set up the VF BARs as
+   subregions of the PFs SR/IOV VF BARs by calling
+   pcie_sriov_vf_register_bar() instead of the normal pci_register_bar() call:
+
+   pci_your_vf_dev_realize( ... )
+   {
+  ...
+  int ret = pcie_endpoint_cap_init(d, 0x60);
+  ...
+  pcie_ari_init(d, 0x100, 1);
+  ...
+  memory_region_init(mr, ... )
+  pcie_sriov_vf_register_bar(d, bar_nr, mr);
+  ...
+   }
+
+Testing on Linux guest
+==
+The easiest is if your device driver supports sysfs based SR/IOV
+enabling. Support for this was added in kernel v.3.8, so not all drivers
+support it yet.
+
+To enable 4 VFs for a device at 01:00.0:
+
+   modprobe yourdriver
+   echo 4 > /sys/bus/pci/devices/:01:00.0/sriov_numvfs
+
+You should now see 4 VFs with lspci.
+To turn SR/IOV off again - the standard requires you to turn it off before you 
can enable
+another VF count, and the emulation enforces this:
+
+   echo 0 > /sys/bus/pci/devices/:01:00.0/sriov_numvfs
+
+Older drivers typically provide a max_vfs module parameter
+to enable it at load time:
+
+   modprobe yourdriver max_vfs=4
+
+To disable the VFs again then, you simply have to unload the driver:
+
+   rmmod yourdriver
-- 
2.4.3

1 2 3 >

1 - 100 of 214 matches

Mail list logo