[Qemu-devel] [PATCH for-4.0] spapr: Simplify handling of host-serial and host-model values

2019-03-27 Thread David Gibson
27461d69a0f "ppc: add host-serial and host-model machine attributes
(CVE-2019-8934)" introduced 'host-serial' and 'host-model' machine
properties for spapr to explicitly control the values advertised to the
guest in device tree properties with the same names.

The previous behaviour on KVM was to unconditionally populate the device
tree with the real host serial number and model, which leaks possibly
sensitive information about the host to the guest.

To maintain compatibility for old machine types, we allowed those props
to be set to "passthrough" to take the value from the host as before.  Or
they could be set to "none" to explicitly omit the device tree items.

Special casing specific values on what's otherwise a user supplied string
is very ugly.  So, this patch simplifies things by implementing the
backwards compatibility in a different way: we have a machine class flag
set for the older machines, and we only load the host values into the
device tree if A) they're not set by the user and B) we have that flag set.

This does mean that the "passthrough" functionality is no longer available
with the current machine type.  That's ok though: if a user or management
layer really wants the information passed through they can read it
themselves (OpenStack Nova already does something similar for x86).

It also means the user can't explicitly ask for the values to be omitted
on the old machine types.  I think that's an acceptable trade-off: if you
care enough about not leaking the host information you can either move to
the new machine type, or use a dummy value for the properties.

This also removes an odd inconsistency between running on a POWER and
non-POWER (or non-Linux) hosts: if the host information couldn't be read
from where we expect (in the host's device tree as exposed by Linux), we'd
fallback to omitting the guest device tree items.

While we're there, improve some poorly worded comments, and the help text
for the properties.

Signed-off-by: David Gibson 
---

I've (tentatively) put this into my ppc-for-4.0 tree already, which I
hope to push in the next few days.  I realize it's very late to make
such a cleanup in 4.0, however I'd like to clean up the interface
before it goes into a released version which we have to support for
ages.

 hw/ppc/spapr.c | 57 ++
 include/hw/ppc/spapr.h |  1 +
 2 files changed, 20 insertions(+), 38 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 6c16d6cfaf..c46c6e2670 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1252,38 +1252,8 @@ static void *spapr_build_fdt(SpaprMachineState *spapr)
 _FDT(fdt_setprop_string(fdt, 0, "model", "IBM pSeries (emulated by 
qemu)"));
 _FDT(fdt_setprop_string(fdt, 0, "compatible", "qemu,pseries"));
 
-/*
- * Add info to guest to indentify which host is it being run on
- * and what is the uuid of the guest
- */
-if (spapr->host_model && !g_str_equal(spapr->host_model, "none")) {
-if (g_str_equal(spapr->host_model, "passthrough")) {
-/* -M host-model=passthrough */
-if (kvmppc_get_host_model()) {
-_FDT(fdt_setprop_string(fdt, 0, "host-model", buf));
-g_free(buf);
-}
-} else {
-/* -M host-model= */
-_FDT(fdt_setprop_string(fdt, 0, "host-model", spapr->host_model));
-}
-}
-
-if (spapr->host_serial && !g_str_equal(spapr->host_serial, "none")) {
-if (g_str_equal(spapr->host_serial, "passthrough")) {
-/* -M host-serial=passthrough */
-if (kvmppc_get_host_serial()) {
-_FDT(fdt_setprop_string(fdt, 0, "host-serial", buf));
-g_free(buf);
-}
-} else {
-/* -M host-serial= */
-_FDT(fdt_setprop_string(fdt, 0, "host-serial", 
spapr->host_serial));
-}
-}
-
+/* Guest UUID & Name*/
 buf = qemu_uuid_unparse_strdup(_uuid);
-
 _FDT(fdt_setprop_string(fdt, 0, "vm,uuid", buf));
 if (qemu_uuid_set) {
 _FDT(fdt_setprop_string(fdt, 0, "system-id", buf));
@@ -1295,6 +1265,21 @@ static void *spapr_build_fdt(SpaprMachineState *spapr)
 qemu_get_vm_name()));
 }
 
+/* Host Model & Serial Number */
+if (spapr->host_model) {
+_FDT(fdt_setprop_string(fdt, 0, "host-model", spapr->host_model));
+} else if (smc->broken_host_serial_model && kvmppc_get_host_model()) {
+_FDT(fdt_setprop_string(fdt, 0, "host-model", buf));
+g_free(buf);
+}
+
+if (spapr->host_serial) {
+_FDT(fdt_setprop_string(fdt, 0, "host-serial", spapr->host_serial));
+} else if (smc->broken_host_serial_model && kvmppc_get_host_serial()) {
+_FDT(fdt_setprop_string(fdt, 0, "host-serial", buf));
+g_free(buf);
+}
+
 _FDT(fdt_setprop_cell(fdt, 0, "#address-cells", 2));
 _FDT(fdt_setprop_cell(fdt, 0, "#size-cells", 2));
 
@@ -3352,12 

Re: [Qemu-devel] [PATCH for 4.0 v1 1/5] riscv: plic: Fix incorrect irq calculation

2019-03-27 Thread Palmer Dabbelt

On Wed, 27 Mar 2019 09:29:56 PDT (-0700), alistai...@gmail.com wrote:

On Wed, Mar 27, 2019 at 3:29 AM Palmer Dabbelt  wrote:


On Wed, 20 Mar 2019 17:46:09 PDT (-0700), Alistair Francis wrote:
> The irq is incorrectly calculated to be off by one. It has worked in the
> past as the priority_base offset has also been set incorrectly. We are
> about to fix the priority_base offset so first first the irq
> calculation.
>
> Signed-off-by: Alistair Francis 
> ---
>  hw/riscv/sifive_plic.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/hw/riscv/sifive_plic.c b/hw/riscv/sifive_plic.c
> index 1c703e1a37..70a85cd075 100644
> --- a/hw/riscv/sifive_plic.c
> +++ b/hw/riscv/sifive_plic.c
> @@ -206,7 +206,7 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr 
addr, unsigned size)
>  if (addr >= plic->priority_base && /* 4 bytes per source */
>  addr < plic->priority_base + (plic->num_sources << 2))
>  {
> -uint32_t irq = (addr - plic->priority_base) >> 2;
> +uint32_t irq = ((addr - plic->priority_base) >> 2) + 1;
>  if (RISCV_DEBUG_PLIC) {
>  qemu_log("plic: read priority: irq=%d priority=%d\n",
>  irq, plic->source_priority[irq]);
> @@ -279,7 +279,7 @@ static void sifive_plic_write(void *opaque, hwaddr addr, 
uint64_t value,
>  if (addr >= plic->priority_base && /* 4 bytes per source */
>  addr < plic->priority_base + (plic->num_sources << 2))
>  {
> -uint32_t irq = (addr - plic->priority_base) >> 2;
> +uint32_t irq = ((addr - plic->priority_base) >> 2) + 1;
>  plic->source_priority[irq] = value & 7;
>  if (RISCV_DEBUG_PLIC) {
>  qemu_log("plic: write priority: irq=%d priority=%d\n",

I think this breaks bisectability and should be merged with the
*_PLIC_PRIORITY_BASE modifications as otherwise we'll end up the priority being
set for the brong IRQ.


Good point, I can merge them together.



I'm also not sure how this fixes the bug listed in the OpenSBI PR.  As far as I
can understand it, all this does is ensure that source 0 is actually treated as
illegal (and shrinks the number of sources to match what's actually there, but
that shouldn't fix the bug).  That looks more like a cleanup than an actual fix.


The problem is that before we where off by one. We supported 0 - (n -
1) and this patch set changes QEMU to support 1 - n. This is because
of the "addr < plic->priority_base + (plic->num_sources << 2)"
comparison. As priority_base is now 0x04 instead of 0x00 the
comparison will work correctly.


So something in OpenSBI is going through the entire set of sources and 
initializing the priorities?  That makes sense for the off-by-one, I was just 
wondering because the array shrank so the specific offset in that error would 
still be invalid with the new patch set.




Alistair



Am I missing something?




Re: [Qemu-devel] [PATCH for 4.0 v2 0/2] Update the QEMU PLIC addresses

2019-03-27 Thread Palmer Dabbelt

On Wed, 27 Mar 2019 11:51:15 PDT (-0700), Alistair Francis wrote:

This series updates the PLIC address to match the documentation.

This fixes: https://github.com/riscv/opensbi/issues/97

V2:
 - Squash patches to ensure biesctability

Alistair Francis (2):
  riscv: plic: Fix incorrect irq calculation
  riscv: plic: Log guest errors

 hw/riscv/sifive_plic.c  | 16 +++-
 hw/riscv/sifive_u.c |  2 +-
 include/hw/riscv/sifive_e.h |  2 +-
 include/hw/riscv/sifive_u.h |  4 ++--
 include/hw/riscv/virt.h |  2 +-
 5 files changed, 16 insertions(+), 10 deletions(-)

--
2.21.0


Thanks, I've got these on for-master.  I'll let them sit for a bit to see if 
there are any other comments, but


Reviewed-by: Palmer Dabbelt 



Re: [Qemu-devel] [PATCH for 4.0 v2 2/2] riscv: plic: Log guest errors

2019-03-27 Thread Palmer Dabbelt

On Wed, 27 Mar 2019 11:52:53 PDT (-0700), alistai...@gmail.com wrote:

On Wed, Mar 27, 2019 at 11:51 AM Alistair Francis
 wrote:


Instead of using error_report() to print guest errors let's use
qemu_log_mask(LOG_GUEST_ERROR,...) to log the error.

Signed-off-by: Alistair Francis 


I dropped Philippe's review in this series.
This should be included from v1:

Reviewed-by: Philippe Mathieu-Daudé 


I got it.



Alistair


---
 hw/riscv/sifive_plic.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/hw/riscv/sifive_plic.c b/hw/riscv/sifive_plic.c
index 70a85cd075..7f373d6c9d 100644
--- a/hw/riscv/sifive_plic.c
+++ b/hw/riscv/sifive_plic.c
@@ -262,7 +262,9 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, 
unsigned size)
 }

 err:
-error_report("plic: invalid register read: %08x", (uint32_t)addr);
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: Invalid register read 0x%" HWADDR_PRIx "\n",
+  __func__, addr);
 return 0;
 }

@@ -289,7 +291,9 @@ static void sifive_plic_write(void *opaque, hwaddr addr, 
uint64_t value,
 } else if (addr >= plic->pending_base && /* 1 bit per source */
addr < plic->pending_base + (plic->num_sources >> 3))
 {
-error_report("plic: invalid pending write: %08x", (uint32_t)addr);
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid pending write: 0x%" HWADDR_PRIx "",
+  __func__, addr);
 return;
 } else if (addr >= plic->enable_base && /* 1 bit per source */
 addr < plic->enable_base + plic->num_addrs * plic->enable_stride)
@@ -339,7 +343,9 @@ static void sifive_plic_write(void *opaque, hwaddr addr, 
uint64_t value,
 }

 err:
-error_report("plic: invalid register write: %08x", (uint32_t)addr);
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: Invalid register write 0x%" HWADDR_PRIx "\n",
+  __func__, addr);
 }

 static const MemoryRegionOps sifive_plic_ops = {
--
2.21.0





Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-27 Thread Peter Xu
On Wed, Mar 27, 2019 at 11:35:35AM -0400, Michael S. Tsirkin wrote:
> On Wed, Mar 27, 2019 at 02:25:00PM +0800, Peter Xu wrote:
> > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> > > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > > distinguish by devfn & bus between devices in a conventional PCI
> > > topology and therefore we cannot assign them separate AddressSpaces.
> > > By taking this requester ID aliasing into account, QEMU better matches
> > > the bare metal behavior and restrictions, and enables shared
> > > AddressSpace configurations that are otherwise not possible with
> > > guest IOMMU support.
> > > 
> > > For the latter case, given any example where an IOMMU group on the
> > > host includes multiple devices:
> > > 
> > >   $ ls  /sys/kernel/iommu_groups/1/devices/
> > >   :00:01.0  :01:00.0  :01:00.1
> > 
> > [1]
> > 
> > > 
> > > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > > that we can only assign one of the endpoints to the guest because a
> > > second endpoint will attempt to use a different AddressSpace.  VFIO
> > > only supports IOMMU group level granularity at the container level,
> > > preventing this second endpoint from being assigned:
> > > 
> > > qemu-system-x86_64 -machine q35... \
> > >   -device intel-iommu,intremap=on \
> > >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> > >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> > >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > > 
> > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: 
> > > vfio \
> > > :01:00.1: group 1 used in multiple address spaces
> > > 
> > > However, when QEMU incorporates proper aliasing, we can make use of a
> > > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > > provides the downstream devices with the same AddressSpace, ex:
> > > 
> > > qemu-system-x86_64 -machine q35... \
> > >   -device intel-iommu,intremap=on \
> > >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> > >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> > >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > > 
> > > While the utility of this hack may be limited, this AddressSpace
> > > aliasing is the correct behavior for QEMU to emulate bare metal.
> > > 
> > > Signed-off-by: Alex Williamson 
> > 
> > The patch looks sane to me even as a bug fix since otherwise the DMA
> > address spaces used under misc kinds of PCI bridges can be wrong, so:
> > 
> > Reviewed-by: Peter Xu 
> > 
> > Though I have a question that confused me even before: Alex, do you
> > know why all the context entry of the devices in the IOMMU root table
> > will be programmed even if the devices are under a pcie-to-pci bridge?
> > I'm giving an example with above [1] to be clear: in that case IIUC
> > we'll program context entries for all the three devices (00:01.0,
> > 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
> > devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
> > bare metal
> 
> What makes you think so?
> 
> PCI Express spec says:
> 
> Requester ID
> 
> The combination of a Requester's Bus Number, Device Number, and Function
> Number that uniquely identifies the Requester. With an ARI Requester ID, bits
> traditionally used for the Device Number field are used instead to expand the
> Function Number field, and the Device Number is implied to be 0.

It can be something special when there're bridges like pcie-to-pci
bridge because the bridge can take the ownership of the transaction
and modify the requester ID (which should be part of the transaction
ID).  I believe it's clearer now since I saw a lot of discussions
happening in the other thread about this too...

Thanks,

-- 
Peter Xu



Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-27 Thread Peter Xu
On Wed, Mar 27, 2019 at 10:37:09AM -0600, Alex Williamson wrote:
> On Wed, 27 Mar 2019 14:25:00 +0800
> Peter Xu  wrote:
> 
> > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> > > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > > distinguish by devfn & bus between devices in a conventional PCI
> > > topology and therefore we cannot assign them separate AddressSpaces.
> > > By taking this requester ID aliasing into account, QEMU better matches
> > > the bare metal behavior and restrictions, and enables shared
> > > AddressSpace configurations that are otherwise not possible with
> > > guest IOMMU support.
> > > 
> > > For the latter case, given any example where an IOMMU group on the
> > > host includes multiple devices:
> > > 
> > >   $ ls  /sys/kernel/iommu_groups/1/devices/
> > >   :00:01.0  :01:00.0  :01:00.1  
> > 
> > [1]
> > 
> > > 
> > > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > > that we can only assign one of the endpoints to the guest because a
> > > second endpoint will attempt to use a different AddressSpace.  VFIO
> > > only supports IOMMU group level granularity at the container level,
> > > preventing this second endpoint from being assigned:
> > > 
> > > qemu-system-x86_64 -machine q35... \
> > >   -device intel-iommu,intremap=on \
> > >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> > >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> > >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > > 
> > > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: 
> > > vfio \
> > > :01:00.1: group 1 used in multiple address spaces
> > > 
> > > However, when QEMU incorporates proper aliasing, we can make use of a
> > > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > > provides the downstream devices with the same AddressSpace, ex:
> > > 
> > > qemu-system-x86_64 -machine q35... \
> > >   -device intel-iommu,intremap=on \
> > >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> > >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> > >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > > 
> > > While the utility of this hack may be limited, this AddressSpace
> > > aliasing is the correct behavior for QEMU to emulate bare metal.
> > > 
> > > Signed-off-by: Alex Williamson   
> > 
> > The patch looks sane to me even as a bug fix since otherwise the DMA
> > address spaces used under misc kinds of PCI bridges can be wrong, so:
> 
> I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but
> I'd be cautious about this if so.  Eric Auger noted that he's seen an
> SMMU VM hit a guest kernel bug-on, which needs further
> investigation.  It's not clear if it's just an untested or
> unimplemented scenario for SMMU to see a conventional PCI bus or if
> there's something wrong in QEMU.  I also haven't tested AMD IOMMU and
> only VT-d to a very limited degree, thus RFC.

Sorry to be unclear.  I wasn't meant to target this for 4.0, and I
completely agree that it should be after the release since it is still
a relatively influential change to the PCI system of QEMU, not to
mention that the system mostly works well even without this patch.

(except things like assignment of multi-functions with IOMMU but it is
 rare, after all)

>  
> > Reviewed-by: Peter Xu 
> > 
> > Though I have a question that confused me even before: Alex, do you
> > know why all the context entry of the devices in the IOMMU root table
> > will be programmed even if the devices are under a pcie-to-pci bridge?
> > I'm giving an example with above [1] to be clear: in that case IIUC
> > we'll program context entries for all the three devices (00:01.0,
> > 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
> > devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
> > bare metal and then why we bother to program the context entry of
> > 01:00.1?  It seems never used.
> > 
> > (It should be used for current QEMU to work with pcie-to-pci bridges
> >  if without this patch, but I feel like I don't know the real answer
> >  behind)
> 
> We actually have two different scenarios that could be represented by
> [1], the group can be formed by lack of isolation or by lack of
> visibility.  In the group above, it's the former, isolation.  The PCIe
> root port does not support ACS, so while the IOMMU has visibility of
> the individual devices, peer-to-peer between devices may also be
> possible.  Native, trusted, in-kernel drivers for these devices could
> still make use of separate IOMMU domains per device, but in order to
> expose the devices to a userspace driver we need to consider them a
> non-isolated set to prevent side-channel attacks between devices.  We
> therefore consider them as a group within the IOMMU API and it's
> required that each context entry maps to the same domain as the IOMMU
> will see transactions for each 

Re: [Qemu-devel] [PATCH v2] ui/cocoa: Fix absolute input device grabbing issues on Mojave

2019-03-27 Thread Chen Zhang via Qemu-devel



> On Mar 27, 2019, at 7:37 PM, Peter Maydell  wrote:
> 
> On Wed, 27 Mar 2019 at 01:09, Chen Zhang  wrote:
>> Just double-checked on a MacMini mid 2011 with macOS 10.13.6, Xcode 10.1. 
>> This NSWindow API did exist, and the patch could be built without error.
>> Searching convertPointFromScreen in Xcode.app’s platform SDK showed that it 
>> was defined at line 440 in NSWindow.h of AppKit.framework.
>> 
>> May I ask what build environment did you use (as printed by `xcode-select 
>> -p`)?
> 
> manooth$ xcode-select -p
> /Applications/Xcode.app/Contents/Developer
> 
> But that doesn't seem to be what the C compiler is actually
> using, because the error messages quote the filename
> /System/Library/Frameworks/AppKit.framework/Headers/NSWindow.h
> There is no convertPointFromScreen in that header; but there
> is one in both of
> 
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/AppKit.framework/Versions/C/Headers/NSWindow.h
> 
> and
> /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/AppKit.framework/Versions/C/Headers/NSWindow.h
> 
> which are the other two copies of NSWindow.h this machine has.
> 
> thanks
> -- PMM
This was really odd.

I tried to figure out what was wrong with the toolchain and run `xcode-select 
--install` and `xcode-select --switch`. After the this, clang failed like what 
you mentioned.

I had to re-configure qemu with --extra-cflags="-isysroot `xcrun 
--show-sdk-path`”, forcing clang to use the correct SDK root, and then it would 
compile.

Thanks and best regards,


[Qemu-devel] [PATCH for-4.0?] nbd/client: Deal with unaligned size from server

2019-03-27 Thread Eric Blake
When a server advertises an unaligned size but no block sizes, the
code was rounding up to a sector-aligned size (a known limitation of
bdrv_getlength(); maybe 4.1 will fix it to be byte-accurate), then
assuming a request_alignment of 512 (the recommendation of the NBD
spec for maximum portability).  However, this means that qemu will
actually attempt to access the padding bytes of the trailing partial
sector, without realizing that it is risky.

An easy demonstration, using nbdkit as the server:
$ nbdkit -fv random size=1023
$ qemu-io -r -f raw -c 'r -v 0 1023' nbd://localhost:10809
read failed: Invalid argument

because the client rounded the request up to 1024 bytes, which nbdkit
then rejected as beyond the advertised size of 1023.

Note that qemu as the server refuses to send an unaligned size, as it
has already rounded the unaligned image up to sector size, and then
happily resizes the image on access (at least when serving a POSIX
file over NBD). But since third-party servers may decide to kill the
connection when we access out of bounds, it's better to just ignore
the unaligned tail than it is to risk problems. We can undo this hack
later once the block layer quits rounding sizes inappropriately.

Reported-by: Richard W.M. Jones 
Signed-off-by: Eric Blake 
---

This is the minimal patch to avoid our client behaving out-of-spec,
but I don't like that it makes the tail of the server's file
unreadable. A better solution of auditing the block layer to use a
proper byte length everywhere instead of rounding up may be 4.1
material, but is certainly not appropriate for 4.0-rc2. I could also
teach the NBD code to compare every request from the block length
against client.info.size, and manually handle the tail itself, but
that feels like a lot of pain (for something the block layer should be
able to do generically, and where any NBD-specific tail-handling code
just slows down the common case and would be ripped out again once the
block layer quits rounding up). I'm willing to include this with my
other NBD patches slated for -rc2 as part of my suite of improved
third-party interoperability, but only if we agree that the truncation
is appropriate.

Note that nbdkit includes '--filter=truncate round-up=512' as a way to
expose the unaligned tail without making qemu trigger out-of-bounds
operations: reads of the tail now succeed with contents of 0; writes
fail with ENOSPC unless the contents requested to be written into the
tail are all 0s.  So not fixing the bug for 4.0, and telling people to
use nbdkit's truncate filter instead, is also a viable option.

 block/nbd.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/block/nbd.c b/block/nbd.c
index 2e72df528ac..a2cd365f646 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -463,7 +463,17 @@ static int64_t nbd_getlength(BlockDriverState *bs)
 {
 BDRVNBDState *s = bs->opaque;

-return s->client.info.size;
+/*
+ * HACK: Round down to sector alignment. qemu as server always
+ * advertises a sector-aligned image, so we only ever see
+ * unaligned sizes with third-party servers. The block layer
+ * defaults to rounding up, but that can result in us attempting
+ * to read beyond EOF from the remote server; better is to
+ * forcefully round down here and just treat the last few bytes
+ * from the server as unreadable.  This can all go away once the
+ * block layer uses actual byte lengths instead of rounding up.
+ */
+return QEMU_ALIGN_DOWN(s->client.info.size, BDRV_SECTOR_SIZE);
 }

 static void nbd_detach_aio_context(BlockDriverState *bs)
-- 
2.20.1




Re: [Qemu-devel] [Qemu-ppc] [PATCH v7 4/6] target/ppc: Build rtas error log

2019-03-27 Thread David Gibson
On Wed, Mar 27, 2019 at 12:18:31PM +0530, Aravinda Prasad wrote:
> 
> 
> On Tuesday 26 March 2019 03:35 PM, Greg Kurz wrote:
> > On Tue, 26 Mar 2019 14:45:57 +0530
> > Aravinda Prasad  wrote:
> > 
> >> On Tuesday 26 March 2019 02:00 PM, Greg Kurz wrote:
> >>> On Tue, 26 Mar 2019 10:32:35 +1100
> >>> David Gibson  wrote:
> >>>   
>  On Mon, Mar 25, 2019 at 01:56:50PM +0530, Aravinda Prasad wrote:  
> >
> >
> > On Monday 25 March 2019 12:00 PM, David Gibson wrote:
> >> On Fri, Mar 22, 2019 at 12:04:07PM +0530, Aravinda Prasad wrote:
> >>> This patch builds the rtas error log, copies it to the
> >>> rtas_addr and then invokes the guest registered machine
> >>> check handler.
> >>
> >> This commit message needs more context.  When is this occurring, why
> >> do we need this?
> >>
> >> [I can answer those questions now, but whether I - or anyone else -
> >>  will be able to looking back at this commit from years in the future
> >>  is a different question]
> >
> > will add more info.
> 
>  Thanks.
> 
>  [snip]  
> >>> +static uint64_t spapr_get_rtas_addr(void)
> >>> +{
> >>> +SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> >>> +int rtas_node;
> >>> +const struct fdt_property *rtas_addr_prop;
> >>> +void *fdt = spapr->fdt_blob;
> >>> +uint32_t rtas_addr;
> >>> +
> >>> +/* fetch rtas addr from fdt */
> >>> +rtas_node = fdt_path_offset(fdt, "/rtas");
> >>> +g_assert(rtas_node >= 0);
> >>> +
> >>> +rtas_addr_prop = fdt_get_property(fdt, rtas_node, 
> >>> "linux,rtas-base", NULL);
> >>> +g_assert(rtas_addr_prop);
> >>> +
> >>> +rtas_addr = fdt32_to_cpu(*(uint32_t *)rtas_addr_prop->data);
> >>> +return (uint64_t)rtas_addr;
> >>
> >> It seems a bit roundabout to pull the rtas address out of the device
> >> tree, since it was us that put it in there in the first place.
> >
> > Slof can change the rtas address. So we need to get the updated rtas
> > address.
> 
>  Ah, ok.
>   
> >>>
> >>> Yeah, and knowing that the DT is guest originated makes me a bit
> >>> nervous when I see the g_assert()... a misbehaving guest could
> >>> possibly abort QEMU. Either there should be some sanity checks
> >>> performed earlier or an non-fatal error path should be added in
> >>> this function IMHO.  
> >>
> >> Is it not the QEMU that builds the DT and provides it to the guest?
> >>
> > 
> > Yes, but then the guest can push a new DT with the KVMPPC_H_UPDATE_DT hcall.
> > We only do some minimalist sanity checks in h_update_dt(). I don't think
> > we want to abort QEMU because the guest sent a DT where "linux,rtas-base"
> > is missing for example.
> > 
> >> Also, spapr_get_rtas_addr() is called during physical memory corruption
> >> which is a fatal error.
> > 
> > Not that fatal since we care to report it to the guest.
> 
> True, but if guest does not provide rtas_addr then I am not getting the
> point on why terminating the QEMU instance (which actually terminates
> the guest) is a problem. Am I missing something?
> 
> > 
> >> So, if we cannot fetch rtas_addr (required to
> >> build and pass the error info to the guest) then I think we should abort.
> >>
> > 
> > Maybe we cannot do anything better at this point, but then we should
> > do some earlier checks and switch to the old machine check behaviour
> > if what we need is missing from the updated DT for example.
> 
> We can do some checks earlier, may be during fwnmi registration to see
> if rtas entry is missing. Again, the guest can possibly update DT after
> fwnmi registration.
> 
> But I think we cannot switch to old machine check behavior if we cannot
> fetch the rtas_addr, because according to PAPR the guest would have
> relinquished 0x200 vector to the firmware when fwnmi is registered. So
> we cannot expect the guest to handle 0x200 interrupt.

I think that's fair.

I think we want to verify that we can get the rtas_addr and store it
at nmi-register time.  If we can't we can fail the RTAS call.  If the
guest changes the RTAS address after that point (which we don't
expect), then it has shot itself in the foot and that's fine.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH v1] s390x/kvm: Configure page size after memory has actually been initialized

2019-03-27 Thread David Gibson
On Wed, Mar 27, 2019 at 09:06:53PM +0100, David Hildenbrand wrote:
> On 27.03.19 17:45, Igor Mammedov wrote:
> > On Wed, 27 Mar 2019 14:59:44 +0100
> > David Hildenbrand  wrote:
> > 
> >> Right now we configure the pagesize quite early, when initializing KVM.
> >> This is long before system memory is actually allocated via
> >> memory_region_allocate_system_memory(), and therefore memory backends
> >> marked as mapped.
> >>
> >> Instead, let's configure the maximum page size after initializing
> >> memory in s390_memory_init(). cap_hpage_1m is still properly
> >> configured before creating any CPUs, and therefore before configuring
> >> the CPU model and eventually enabling CMMA.
> >>
> >> We might later want to replace qemu_getrampagesize() by another
> >> detection mechanism, e.g. only looking at mapped, initial memory.
> >> We don't support any memory devices yet, and if so, we can always reject
> >> devices with a page size bigger than the initial page size when
> >> hotplugging. qemu_getrampagesize() should work for now, especially when
> >> converting it to only look at mapped backends.
> >>
> >> Signed-off-by: David Hildenbrand 
> > 
> > Acked-by: Igor Mammedov 
> 
> BTW, do we want
> 
> qemu_getmaxrampagesize()
> qemu_getminrampagesize()

That could work.

> or similar. qemu_getrampagesize() in its current form is really far from
> beautiful.

Yeah, and AFAICT the way it's used here remains incorrect.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [qemu-s390x] [PATCH for-4.0?] exec: Only count mapped memory backends for qemu_getrampagesize()

2019-03-27 Thread David Gibson
On Wed, Mar 27, 2019 at 02:19:41PM +0100, David Hildenbrand wrote:
> On 27.03.19 10:03, David Gibson wrote:
> > On Wed, Mar 27, 2019 at 09:10:01AM +0100, David Hildenbrand wrote:
> >> On 27.03.19 01:12, David Gibson wrote:
> >>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
>  On 26.03.19 15:08, Igor Mammedov wrote:
> > On Tue, 26 Mar 2019 14:50:58 +1100
> > David Gibson  wrote:
> >
> >> qemu_getrampagesize() works out the minimum host page size backing any 
> >> of
> >> guest RAM.  This is required in a few places, such as for POWER8 PAPR 
> >> KVM
> >> guests, because limitations of the hardware virtualization mean the 
> >> guest
> >> can't use pagesizes larger than the host pages backing its memory.
> >>
> >> However, it currently checks against *every* memory backend, whether 
> >> or not
> >> it is actually mapped into guest memory at the moment.  This is 
> >> incorrect.
> >>
> >> This can cause a problem attempting to add memory to a POWER8 pseries 
> >> KVM
> >> guest which is configured to allow hugepages in the guest (e.g.
> >> -machine cap-hpt-max-page-size=16m).  If you attempt to add 
> >> non-hugepage,
> >> you can (correctly) create a memory backend, however it (correctly) 
> >> will
> >> throw an error when you attempt to map that memory into the guest by
> >> 'device_add'ing a pc-dimm.
> >>
> >> What's not correct is that if you then reset the guest a startup check
> >> against qemu_getrampagesize() will cause a fatal error because of the 
> >> new
> >> memory object, even though it's not mapped into the guest.
> > I'd say that backend should be remove by mgmt app since device_add 
> > failed
> > instead of leaving it to hang around. (but fatal error either not a nice
> > behavior on QEMU part)
> 
>  Indeed, it should be removed. Depending on the options (huge pages with
>  prealloc?) memory might be consumed for unused memory. Undesired.
> >>>
> >>> Right, but if the guest initiates a reboot before the management gets
> >>> to that, we'll have a crash.
> >>>
> >>
> >> Yes, I agree.
> >>
> >> At least on s390x (extending on what Igor said):
> >>
> >> mc->init() -> s390_memory_init() ->
> >> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
> >>
> >>
> >> ac->init_machine() -> kvm_arch_init() ->
> >> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
> >>
> >>
> >> And in vl.c
> >>
> >> configure_accelerator(current_machine, argv[0]);
> >> ...
> >> machine_run_board_init()
> >>
> >> So memory is indeed not mapped before calling qemu_getrampagesize().
> >>
> >>
> >> We *could* move the call to kvm_s390_configure_mempath_backing() to
> >> s390_memory_init().
> >>
> >> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
> >>
> >> We could than eventually make qemu_getrampagesize() asssert if no
> >> backends are mapped at all, to catch other user that rely on this being
> >> correct.
> > 
> > So.. I had a look at the usage in kvm_s390_configure_mempath_backing()
> > and I'm pretty sure it's broken.  It will work in the case where
> > there's only one backend.  And if that's the default -mem-path rather
> > than an explicit memory backend then my patch won't break it any
> > further.
> 
> It works for the current scenarios, where you only have one (maximum
> two) backings of the same kind. Your patch would break that.

Actually it wouldn't.  My patch only affects checking of explicit
backend objects - checking of the base -mem-path implicit backend
remains the same.

> > qemu_getrampagesize() returns the smallest host page size for any
> > memory backend.  That's what matters for pcc KVM (in several cases)
> > because we need certain things to be host-contiguous, not just
> > guest-contiguous.  Bigger host page sizes are fine for that purpose,
> > clearly.
> > 
> > AFAICT on s390 you're looking to determine if any backend is using
> > hugepages, because KVM may not support that.  The minimum host page
> > size isn't adequate to determine that, so qemu_getrampagesize() won't
> > tell you what you need.
> 
> Well, as long as we don't support DIMMS or anything like that it works
> perfectly fine. But yes it is far from beautiful.
> 
> First of all, I'll prepare a patch to do the call from a different
> context. Then we can fine tune to using something else than
> qemu_getrampagesize()
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH for-4.0] target/ppc: Fix QEMU crash with stxsdx

2019-03-27 Thread David Gibson
On Wed, Mar 27, 2019 at 07:59:12PM +, Mark Cave-Ayland wrote:
> On 27/03/2019 18:12, Greg Kurz wrote:
> 
> > I've been hitting several QEMU crashes while running a fedora29 ppc64le
> > guest under TCG. Each time, this would occur several minutes after the
> > guest reached login:
> > 
> > Fedora 29 (Twenty Nine)
> > Kernel 4.20.6-200.fc29.ppc64le on an ppc64le (hvc0)
> > 
> > Web console: https://localhost:9090/
> > 
> > localhost login:
> > tcg/tcg.c:3211: tcg fatal error
> > 
> > This happens because a bug crept up in the gen_stxsdx() helper when it
> > was converted to use VSR register accessors by commit 8b3b2d75c7c04
> > "target/ppc: introduce get_cpu_vsr{l,h}() and set_cpu_vsr{l,h}() helpers
> > for VSR register access".
> > 
> > The code creates a temporary, passes it directly to gen_qemu_st64_i64()
> > and then to set_cpu_vrsh()... which looks like this was mistakenly
> > coded as a load instead of a store.
> > 
> > Reverse the logic: read the VSR to the temporary first and then store
> > it to memory.
> > 
> > Fixes: 8b3b2d75c7c0481544e277dad226223245e058eb
> > Signed-off-by: Greg Kurz 
> > ---
> >  target/ppc/translate/vsx-impl.inc.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/target/ppc/translate/vsx-impl.inc.c 
> > b/target/ppc/translate/vsx-impl.inc.c
> > index 508e9199c8df..489b2436e49f 100644
> > --- a/target/ppc/translate/vsx-impl.inc.c
> > +++ b/target/ppc/translate/vsx-impl.inc.c
> > @@ -356,8 +356,8 @@ static void gen_##name(DisasContext *ctx)   
> >   \
> >  gen_set_access_type(ctx, ACCESS_INT); \
> >  EA = tcg_temp_new();  \
> >  gen_addr_reg_index(ctx, EA);  \
> > +get_cpu_vsrh(t0, xS(ctx->opcode));\
> >  gen_qemu_##operation(ctx, t0, EA);\
> > -set_cpu_vsrh(xS(ctx->opcode), t0);\
> >  tcg_temp_free(EA);\
> >  tcg_temp_free_i64(t0);\
> >  }
> 
> Gah, yes - I remember at one point having to double check the get/set 
> direction but
> obviously my brain and fingers got the better of me here. Apologies for the
> inconvenience.
> 
> Reviewed-by: Mark Cave-Ayland 

Applied, thanks.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PULL 02/29] ppc/spapr: Receive and store device tree blob from SLOF

2019-03-27 Thread David Gibson
On Wed, Mar 27, 2019 at 01:36:59PM +0100, Greg Kurz wrote:
> On Wed, 27 Mar 2019 23:06:42 +1100
> David Gibson  wrote:
> 
> > On Wed, Mar 27, 2019 at 10:44:03AM +0100, Greg Kurz wrote:
> > > On Wed, 27 Mar 2019 19:38:27 +1100
> > > David Gibson  wrote:
> > >   
> > > > On Wed, Mar 27, 2019 at 08:09:26AM +0100, Greg Kurz wrote:  
> > > > > On Wed, 27 Mar 2019 11:56:10 +1100
> > > > > David Gibson  wrote:
> > > > > 
> > > > > > On Tue, Mar 26, 2019 at 08:09:53AM +0100, Greg Kurz wrote:
> > > > > > > On Tue, 26 Mar 2019 10:47:15 +1100
> > > > > > > David Gibson  wrote:
> > > > > > >   
> > > > > > > > On Mon, Mar 25, 2019 at 05:33:21PM +0100, Greg Kurz wrote:  
> > > > > > > > > On Mon, 25 Mar 2019 11:53:47 +1100
> > > > > > > > > David Gibson  wrote:
> > > > > > > > > 
> > > > > > > > > > On Sun, Mar 24, 2019 at 12:03:54AM -0400, Brad Smith wrote: 
> > > > > > > > > >
> > > > > > > > > > > Now that I am checking out 4.0.0 rc's I see this diff is 
> > > > > > > > > > > broken and
> > > > > > > > > > > depends on a function libfdt does not expose. The 
> > > > > > > > > > > breakage is
> > > > > > > > > > > hidden by the fallback check in the configure script. 
> > > > > > > > > > >  
> > > > > > > > > > 
> > > > > > > > > > Ah, bother.  That keeps happening, unfortunately.  I think 
> > > > > > > > > > it's
> > > > > > > > > > because so many people use libfdt embedded, rather than as 
> > > > > > > > > > a shared
> > > > > > > > > > library that we tend not to notice.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > It's a bit more complicated. I do have latest libfdt packages 
> > > > > > > > > on my laptop:
> > > > > > > > > 
> > > > > > > > > libfdt-1.4.7-2.fc28.x86_64
> > > > > > > > > libfdt-devel-1.4.7-2.fc28.x86_64
> > > > > > > > > 
> > > > > > > > > but I still end up using the embedded one and the build 
> > > > > > > > > doesn't spot
> > > > > > > > > the missing symbols.
> > > > > > > > 
> > > > > > > > Sorry, I wasn't clear.  I wasn't meaning in the context of 
> > > > > > > > qemu, but
> > > > > > > > for dtc generally.  A large portion of the users are things like
> > > > > > > > u-boot that have to use dtc embedded, rather than as a shared
> > > > > > > > library.  That's why we tend not to notice missing symbols from 
> > > > > > > > the
> > > > > > > > version script upstream.
> > > > > > > >   
> > > > > > > 
> > > > > > > Ok, I get it.
> > > > > > >   
> > > > > > > > > This happens because of several reasons:
> > > > > > > > > 
> > > > > > > > > 1) configure unconditionally falls back to embedded if an 
> > > > > > > > > error occurs with
> > > > > > > > >the system packages. And, as reported by Brad, the current 
> > > > > > > > > 1.4.7 packages
> > > > > > > > >are broken indeed:
> > > > > > > > > 
> > > > > > > > > $ objdump -T /usr/lib64/libfdt-1.4.7.so | grep fdt_check_full
> > > > > > > > > $ 
> > > > > > > > > 
> > > > > > > > > 2) when building embedded, we only build the archive, not the 
> > > > > > > > > shared lib.
> > > > > > > > > 
> > > > > > > > > > I guess we should figure out how to force the testsuite to 
> > > > > > > > > > link
> > > > > > > > > > against the shared library rather than static to test for 
> > > > > > > > > > this.  I'll
> > > > > > > > > > look into it if I have time (which is a big if).
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I think we should rather build the embedded shared library 
> > > > > > > > > using
> > > > > > > > > the 'libfdt' rule of the top-level makefile of the dtc 
> > > > > > > > > sub-module
> > > > > > > > > and have QEMU to be linked against this share library instead 
> > > > > > > > > of
> > > > > > > > > the static one. AFAIK, this is what gcc does when it finds 
> > > > > > > > > both
> > > > > > > > > .a and .so.
> > > > > > > > 
> > > > > > > > Actually, I don't think this is a good idea.  It means the 
> > > > > > > > resulting
> > > > > > > > qemu build would have to be installed with the libfdt image as 
> > > > > > > > well.
> > > > > > > > As well as complicating the install path, that means that the 
> > > > > > > > qemu
> > > > > > > > build will now actively conflict with a packaged libfdt, rather 
> > > > > > > > than
> > > > > > > > merely suboptimally failing to use it.  
> > > > > > > 
> > > > > > > Yes you're right: the resulting QEMU shouldn't be installed, 
> > > > > > > especially
> > > > > > > if it has a RPATH poiting to the build directory.
> > > > > > > 
> > > > > > > This being said, if someone wants to build AND install QEMU, 
> > > > > > > shouldn't
> > > > > > > she rely exclusively on installed libfdt packages ? In other 
> > > > > > > words,
> > > > > > > shouldn't the embedded libfdt be a QEMU developper only thing ? 
> > > > > > > What
> > > > > > > are the real life use cases for embedded libfdt ?  
> > > > > > 
> > > > > > I don't think we should insist on that, 

Re: [Qemu-devel] [qemu-s390x] [PATCH for-4.0?] exec: Only count mapped memory backends for qemu_getrampagesize()

2019-03-27 Thread David Gibson
On Wed, Mar 27, 2019 at 03:22:58PM +0100, David Hildenbrand wrote:
> On 27.03.19 10:03, David Gibson wrote:
> > On Wed, Mar 27, 2019 at 09:10:01AM +0100, David Hildenbrand wrote:
> >> On 27.03.19 01:12, David Gibson wrote:
> >>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:
>  On 26.03.19 15:08, Igor Mammedov wrote:
> > On Tue, 26 Mar 2019 14:50:58 +1100
> > David Gibson  wrote:
> >
> >> qemu_getrampagesize() works out the minimum host page size backing any 
> >> of
> >> guest RAM.  This is required in a few places, such as for POWER8 PAPR 
> >> KVM
> >> guests, because limitations of the hardware virtualization mean the 
> >> guest
> >> can't use pagesizes larger than the host pages backing its memory.
> >>
> >> However, it currently checks against *every* memory backend, whether 
> >> or not
> >> it is actually mapped into guest memory at the moment.  This is 
> >> incorrect.
> >>
> >> This can cause a problem attempting to add memory to a POWER8 pseries 
> >> KVM
> >> guest which is configured to allow hugepages in the guest (e.g.
> >> -machine cap-hpt-max-page-size=16m).  If you attempt to add 
> >> non-hugepage,
> >> you can (correctly) create a memory backend, however it (correctly) 
> >> will
> >> throw an error when you attempt to map that memory into the guest by
> >> 'device_add'ing a pc-dimm.
> >>
> >> What's not correct is that if you then reset the guest a startup check
> >> against qemu_getrampagesize() will cause a fatal error because of the 
> >> new
> >> memory object, even though it's not mapped into the guest.
> > I'd say that backend should be remove by mgmt app since device_add 
> > failed
> > instead of leaving it to hang around. (but fatal error either not a nice
> > behavior on QEMU part)
> 
>  Indeed, it should be removed. Depending on the options (huge pages with
>  prealloc?) memory might be consumed for unused memory. Undesired.
> >>>
> >>> Right, but if the guest initiates a reboot before the management gets
> >>> to that, we'll have a crash.
> >>>
> >>
> >> Yes, I agree.
> >>
> >> At least on s390x (extending on what Igor said):
> >>
> >> mc->init() -> s390_memory_init() ->
> >> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
> >>
> >>
> >> ac->init_machine() -> kvm_arch_init() ->
> >> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
> >>
> >>
> >> And in vl.c
> >>
> >> configure_accelerator(current_machine, argv[0]);
> >> ...
> >> machine_run_board_init()
> >>
> >> So memory is indeed not mapped before calling qemu_getrampagesize().
> >>
> >>
> >> We *could* move the call to kvm_s390_configure_mempath_backing() to
> >> s390_memory_init().
> >>
> >> cap_hpage_1m is not needed before we create VCPUs, so this would work fine.
> >>
> >> We could than eventually make qemu_getrampagesize() asssert if no
> >> backends are mapped at all, to catch other user that rely on this being
> >> correct.
> > 
> > So.. I had a look at the usage in kvm_s390_configure_mempath_backing()
> > and I'm pretty sure it's broken.  It will work in the case where
> > there's only one backend.  And if that's the default -mem-path rather
> > than an explicit memory backend then my patch won't break it any
> > further.
> 
> On the second look, I think I get your point.
> 
> 1. Why on earth does "find_max_supported_pagesize" find the "minimum
> page size". What kind of nasty stuff is this.

Ah, yeah, the naming is bad because of history.

The original usecase of this was because on POWER (before POWER9) the
way MMU virtualization works, pages inserted into the guest MMU view
have to be host-contiguous: there's no 2nd level translation that lets
them be broken into smaller host pages.

The upshot is that a KVM guest can only use large pages if it's backed
by large pages on the host.  We have to advertise the availability of
large pages to the guest at boot time though, and there's no way to
restrict it to certain parts of guest RAM.

So, this code path was finding the _maximum_ page size the guest could
use... which depends on the _minimum page_ size used on the host.
When this was moved to (partly) generic code we didn't think to
improve all the names.

> 2. qemu_mempath_getpagesize() is not affected by your patch

Correct.

> and that
> seems to be the only thing used on s390x for now.

Uh.. what?

> I sent a patch to move the call on s390x. But we really have to detect
> the maximum page size (what find_max_supported_pagesize promises), not
> the minimum page size.

Well.. sort of.  In the ppc case it really is the minimum page size we
care about, in the sense that if some part of RAM has a larger page
size, that's fine - even if it's a weird size that we didn't expect.

IIUC for s390 the problem is that KVM doesn't necessarily support
putting large pages into the guest at all, and what large page 

Re: [Qemu-devel] [PATCH for-4.0?] exec: Only count mapped memory backends for qemu_getrampagesize()

2019-03-27 Thread David Gibson
On Wed, Mar 27, 2019 at 06:24:17PM +0100, Igor Mammedov wrote:
> On Wed, 27 Mar 2019 23:03:58 +1100
> David Gibson  wrote:
> 
> > On Wed, Mar 27, 2019 at 10:38:38AM +0100, Igor Mammedov wrote:
> > > On Wed, 27 Mar 2019 20:07:57 +1100
> > > David Gibson  wrote:
> > > 
> > > > On Wed, Mar 27, 2019 at 09:57:45AM +0100, Igor Mammedov wrote:
> > > > > On Wed, 27 Mar 2019 11:11:46 +1100
> > > > > David Gibson  wrote:
> > > > >   
> > > > > > On Tue, Mar 26, 2019 at 03:08:20PM +0100, Igor Mammedov wrote:  
> > > > > > > On Tue, 26 Mar 2019 14:50:58 +1100
> > > > > > > David Gibson  wrote:
> > > > > > > 
> > > > > > > > qemu_getrampagesize() works out the minimum host page size 
> > > > > > > > backing any of
> > > > > > > > guest RAM.  This is required in a few places, such as for 
> > > > > > > > POWER8 PAPR KVM
> > > > > > > > guests, because limitations of the hardware virtualization mean 
> > > > > > > > the guest
> > > > > > > > can't use pagesizes larger than the host pages backing its 
> > > > > > > > memory.
> > > > > > > > 
> > > > > > > > However, it currently checks against *every* memory backend, 
> > > > > > > > whether or not
> > > > > > > > it is actually mapped into guest memory at the moment.  This is 
> > > > > > > > incorrect.
> > > > > > > > 
> > > > > > > > This can cause a problem attempting to add memory to a POWER8 
> > > > > > > > pseries KVM
> > > > > > > > guest which is configured to allow hugepages in the guest (e.g.
> > > > > > > > -machine cap-hpt-max-page-size=16m).  If you attempt to add 
> > > > > > > > non-hugepage,
> > > > > > > > you can (correctly) create a memory backend, however it 
> > > > > > > > (correctly) will
> > > > > > > > throw an error when you attempt to map that memory into the 
> > > > > > > > guest by
> > > > > > > > 'device_add'ing a pc-dimm.
> > > > > > > > 
> > > > > > > > What's not correct is that if you then reset the guest a 
> > > > > > > > startup check
> > > > > > > > against qemu_getrampagesize() will cause a fatal error because 
> > > > > > > > of the new
> > > > > > > > memory object, even though it's not mapped into the guest.
> > > > > > > I'd say that backend should be remove by mgmt app since 
> > > > > > > device_add failed
> > > > > > > instead of leaving it to hang around. (but fatal error either not 
> > > > > > > a nice
> > > > > > > behavior on QEMU part)
> > > > > > 
> > > > > > I agree, but reset could be guest initiated, so even if management 
> > > > > > is
> > > > > > doing the right thing there's a window where it could break.  
> > > > > 
> > > > > We could (log term) get rid of qemu_getrampagesize() (it's not really 
> > > > > good
> > > > > to push machine/target specific condition into generic function) and 
> > > > > move
> > > > > pagesize calculation closer to machine. In this case machine (spapr) 
> > > > > knows
> > > > > exactly when and what is mapped and can enumerate NOT backends but 
> > > > > mapped
> > > > > memory regions and/or pc-dimms (lets say we have 
> > > > > memory_region_page_size()
> > > > > and apply whatever policy to the results.  
> > > > 
> > > > So.. it used to be in the machine specific code, but was made generic
> > > > because it's used in the vfio code.  Where it is ppc specific, but not
> > > > keyed into the machine specific code in a way that we can really
> > > > re-use it there.
> > > It probably was the easiest way to hack things together, but then API gets
> > > generalized and misused and then tweaked to specific machine usecase
> > > and it gets only worse over time.
> > 
> > I don't really know what you mean by that.  In both the (main) ppc and
> > VFIO cases the semantics we want from getrampagesize() really are the
> > same: what's the smallest chunk of guest-contiguous memory we can rely
> > on to also be host-contiguous.
> "smallest chunk of guest-contiguous memory" is frontend abstraction while
> the later "host-contiguous" is backend's one. But qemu_getrampagesize()
> operates though on backend side (which doesn't have to represent guest RAM).
> Shouldn't we look for guest RAM from frontend side (to be sure we are dealing 
> with guest RAM)
> and then find out corresponding host side attributes from there?

I'm not really sure what you have in mind there.

In cases where we have a specific guest address, we do try to go from
that to backend information.  Unfortunately in some cases we can't
know a guest address beforehand, so we have to calculate for the worst
case, which is what this is about.

> btw:
> above doesn't mean that we should block your simpler fix for 4.0.
> So for this patch with David's s390 patch as precondition included
> 
> Reviewed-by: Igor Mammedov 
> 
> 
> > 
> > > > > > > > This patch corrects the problem by adjusting 
> > > > > > > > find_max_supported_pagesize()
> > > > > > > > (called from qemu_getrampagesize() via object_child_foreach) to 
> > > > > > > > exclude
> > > > > > > > non-mapped memory backends.
> > > > > > > I'm not sure 

Re: [Qemu-devel] [PATCH v7 6/6] migration: Block migration while handling machine check

2019-03-27 Thread David Gibson
On Tue, Mar 26, 2019 at 12:22:50PM +0530, Aravinda Prasad wrote:
> 
> 
> On Tuesday 26 March 2019 09:49 AM, David Gibson wrote:
> > On Mon, Mar 25, 2019 at 02:31:10PM +0530, Aravinda Prasad wrote:
> >>
> >>
> >> On Monday 25 March 2019 12:03 PM, David Gibson wrote:
> >>> On Fri, Mar 22, 2019 at 12:04:25PM +0530, Aravinda Prasad wrote:
>  Block VM migration requests until the machine check
>  error handling is complete as (i) these errors are
>  specific to the source hardware and is irrelevant on
>  the target hardware, (ii) these errors cause data
>  corruption and should be handled before migration.
> 
>  Signed-off-by: Aravinda Prasad 
>  ---
>   hw/ppc/spapr_events.c  |   17 +
>   hw/ppc/spapr_rtas.c|4 
>   include/hw/ppc/spapr.h |3 +++
>   3 files changed, 24 insertions(+)
> 
>  diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
>  index d7cc0a4..6356a38 100644
>  --- a/hw/ppc/spapr_events.c
>  +++ b/hw/ppc/spapr_events.c
>  @@ -41,6 +41,7 @@
>   #include "qemu/bcd.h"
>   #include "hw/ppc/spapr_ovec.h"
>   #include 
>  +#include "migration/blocker.h"
>   
>   #define RTAS_LOG_VERSION_MASK   0xff00
>   #define   RTAS_LOG_VERSION_60x0600
>  @@ -866,6 +867,22 @@ static void spapr_mce_build_elog(PowerPCCPU *cpu, 
>  bool recovered)
>   void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered)
>   {
>   SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>  +int ret;
>  +Error *local_err = NULL;
>  +
>  +error_setg(>migration_blocker,
>  +"Live migration not supported during machine check 
>  handling");
>  +ret = migrate_add_blocker(spapr->migration_blocker, _err);
>  +if (ret < 0) {
>  +/*
>  + * We don't want to abort and let the migration to continue. In 
>  a
>  + * rare case, the machine check handler will run on the target
>  + * hardware. Though this is not preferable, it is better than 
>  aborting
>  + * the migration or killing the VM.
> >>>
> >>> Can't you just discard the error in that case?
> >>
> >> I am actually discarding the error. Do you mean I don't have to print
> >> the below warning or am I missing something?
> > 
> > Um.. I guess we should keep the warning.  The tricky thing here is
> > that the information on the machine check is no longer relevant after
> > the migration, however presumably the event which triggered it might
> > have already corrupted guest memory, which *would* be relevant to the
> > guest after migration.  Not sure what to do about that.
> 
> I think the information is still relevant after the migration, because
> it includes the effective address in error and the context. If the
> corruption is in the guest kernel context then the machine check handler
> inside the guest kernel will call panic(). If the corruption is in user
> space, then the application is sent a SIGBUS.

Good point.

> But the problem is when the machine check handler hardware poisons the
> page frame backing the corrupt memory to avoid reuse of that page frame.
> After migration, the page frame backing the corrupt memory is clean, but
> we poison the clean page frame.

This is the guest side handler?  It sounds like we we need two
variants of the notification that say whether or not the area is still
bad.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] megasas: Unexpected response from lun 1 while scanning, scan aborted

2019-03-27 Thread Dongli Zhang



On 3/27/19 7:31 PM, Hannes Reinecke wrote:
> On 3/26/19 5:47 PM, Dongli Zhang wrote:
>> I am reporting an error that the scsi lun cannot initialize successfully 
>> when I
>> am emulating megasas scsi controller with qemu.
>>
>> I am not sure if this is issue in qemu or linux kernel.
>>
>> When 'lun=1' is specified, there is "Unexpected response from lun 1 while
>> scanning, scan aborted".
>>
>> Everything works well if 'lun=0' is specified.
>>
>>
>> Below is the qemu cmdline involved:
>>
>> -device megasas,id=scsi0 \
>> -device scsi-hd,drive=drive0,bus=scsi0.0,lun=1 \
>> -drive file=/home/zhang/img/test.img,if=none,id=drive0,format=raw
>>
>>
>> Below is the syslog related to 'scsi|SCSI'
>>
>> # dmesg | grep SCSI
>> [    0.392494] SCSI subsystem initialized
>> [    0.460666] Block layer SCSI generic (bsg) driver version 0.4 loaded 
>> (major
>> 251)
>> [    0.706788] sd 1:0:0:0: [sda] Attached SCSI disk
>> # dmesg | grep scsi
>> [    0.511643] scsi host0: Avago SAS based MegaRAID driver
>> [    0.523302] scsi 0:2:0:0: Unexpected response from lun 1 while scanning,
>> scan aborted
>> [    0.540364] scsi host1: ata_piix
>> [    0.540780] scsi host2: ata_piix
>> [    0.702396] scsi 1:0:0:0: Direct-Access ATA  QEMU HARDDISK    2.5+
>> PQ: 0 ANSI: 5
>>
>> When 'lun=1' is changed to 'lun=0', there is no issue.
>>
>> Thank you very much!
>>
> That's an artifact from the megasas emulation in quemu.
> Megasas (internally) can't handle LUN numbers (the RAID part only knows about
> 'disks'), so I took the decision to not expose devices with LUN != 0.
> Please use a different SCSI target number, not a non-zero LUN number.

The guest kernel is able to detect the disk if lun is always 0, while target
number is changed:

-device scsi-hd,drive=drive0,bus=scsi0.0,channel=0,scsi-id=0,lun=0
-device scsi-hd,drive=drive1,bus=scsi0.0,channel=0,scsi-id=1,lun=0

# dmesg | grep scsi
[0.935999] scsi host0: ata_piix
[0.936401] scsi host1: ata_piix
[1.100945] scsi 0:0:0:0: Direct-Access ATA  QEMU HARDDISK2.5+
PQ: 0 ANSI: 5
[1.102409] sd 0:0:0:0: Attached scsi generic sg0 type 0
[1.672952] scsi host2: Avago SAS based MegaRAID driver
[1.683886] scsi 2:2:0:0: Direct-Access QEMU QEMU HARDDISK2.5+
PQ: 0 ANSI: 5
[1.684915] scsi 2:2:1:0: Direct-Access QEMU QEMU HARDDISK2.5+
PQ: 0 ANSI: 5
[1.701529] sd 2:2:0:0: Attached scsi generic sg1 type 0
[1.704795] sd 2:2:1:0: Attached scsi generic sg2 type 0
# dmesg | grep SCSI
[0.111015] SCSI subsystem initialized
[0.904712] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 
246)
[1.121174] sd 0:0:0:0: [sda] Attached SCSI disk
[1.703739] sd 2:2:0:0: [sdb] Attached SCSI disk
[1.706964] sd 2:2:1:0: [sdc] Attached SCSI disk



If device with LUN != 0 will not be exposed, why not set max_lun = 0 as what
qemu lsi is doing?

diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
index a56317e..c966ee0 100644
--- a/hw/scsi/megasas.c
+++ b/hw/scsi/megasas.c
@@ -2298,7 +2298,7 @@ static void megasas_scsi_uninit(PCIDevice *d)
 static const struct SCSIBusInfo megasas_scsi_info = {
 .tcq = true,
 .max_target = MFI_MAX_LD,
-.max_lun = 255,
+.max_lun = 0,

 .transfer_data = megasas_xfer_complete,
 .get_sg_list = megasas_get_sg_list,


Thank you very much!

Dongli Zhang



[Qemu-devel] [Bug 1822012] [NEW] powernv machine complains of missing skiboot files

2019-03-27 Thread mcandre
Public bug reported:

Hi, I want to use the powernv machine from the qemu-system-ppcle
application. However, when I specify this machine, qemu complains that
it can't find the skiboot files.

I noticed that skiboot is available for Ubuntu, but only for the PPC[64]
hosts. Well, I just need skiboot files for qemu on amd64 hosts.

Hmm, looks like Debian has a package for these missing qemu files:

https://packages.debian.org/sid/qemu-skiboot

Could we promote these to Ubuntu repositories, and fix the qemu packages
so that they automatically depend on the necessary BIOS packages? For
example, openbios-ppc should also be installed when qemu-system-
ppc[64[le]] are installed.

** Affects: qemu
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1822012

Title:
  powernv machine complains of missing skiboot files

Status in QEMU:
  New

Bug description:
  Hi, I want to use the powernv machine from the qemu-system-ppcle
  application. However, when I specify this machine, qemu complains that
  it can't find the skiboot files.

  I noticed that skiboot is available for Ubuntu, but only for the
  PPC[64] hosts. Well, I just need skiboot files for qemu on amd64
  hosts.

  Hmm, looks like Debian has a package for these missing qemu files:

  https://packages.debian.org/sid/qemu-skiboot

  Could we promote these to Ubuntu repositories, and fix the qemu
  packages so that they automatically depend on the necessary BIOS
  packages? For example, openbios-ppc should also be installed when
  qemu-system-ppc[64[le]] are installed.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1822012/+subscriptions



[Qemu-devel] [PATCH v2 1/3] iotests: Add 241 to test NBD on unaligned images

2019-03-27 Thread Eric Blake
Add a test for the NBD client workaround in the previous patch.  It's
not really feasible for an iotest to assume a specific tracing engine,
so we can't really probe for the new
trace_nbd_parse_blockstatus_compliance to see if the server was fixed
vs. whether the client just worked around the server (other than by
rearranging order between code patches and this test). But having a
successful exchange sure beats the previous state of an error message.

Not tested yet, but worth adding to this test in future patches: an
NBD server that can advertise a non-sector-aligned size (such as
nbdkit) causes qemu as the NBD client to misbehave when it rounds the
size up and accesses beyond the advertised size. Qemu as NBD server
never advertises a non-sector-aligned size (since bdrv_getlength()
currently rounds up to sector boundaries); until qemu can act as such
a server, testing that flaw will have to rely on external binaries.

Signed-off-by: Eric Blake 
---
 tests/qemu-iotests/241 | 71 ++
 tests/qemu-iotests/241.out |  7 
 tests/qemu-iotests/group   |  1 +
 3 files changed, 79 insertions(+)
 create mode 100755 tests/qemu-iotests/241
 create mode 100644 tests/qemu-iotests/241.out

diff --git a/tests/qemu-iotests/241 b/tests/qemu-iotests/241
new file mode 100755
index 000..b1f6e6452de
--- /dev/null
+++ b/tests/qemu-iotests/241
@@ -0,0 +1,71 @@
+#!/bin/bash
+#
+# Test qemu-nbd vs. unaligned images
+#
+# Copyright (C) 2018-2019 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+seq="$(basename $0)"
+echo "QA output created by $seq"
+
+status=1 # failure is the default!
+
+nbd_unix_socket=$TEST_DIR/test_qemu_nbd_socket
+rm -f "${TEST_DIR}/qemu-nbd.pid"
+
+_cleanup()
+{
+_cleanup_test_img
+nbd_server_stop
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+. ./common.nbd
+
+_supported_fmt raw
+_supported_proto nbd
+_supported_os Linux
+_require_command QEMU_NBD
+
+echo
+echo "=== Exporting unaligned raw image ==="
+echo
+
+# can't use _make_test_img, because qemu-img rounds image size up,
+# and because we want to use Unix socket rather than TCP port. Likewise,
+# we have to redirect TEST_IMG to our server.
+# This tests that we can deal with the hole at the end of an unaligned
+# raw file (either because the server doesn't advertise alignment too
+# large, or because the client ignores the server's noncompliance).
+printf %01000d 0 > "$TEST_IMG_FILE"
+nbd_server_start_unix_socket -f $IMGFMT -e 42 -x '' "$TEST_IMG_FILE"
+TEST_IMG="nbd:unix:$nbd_unix_socket"
+
+$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+$QEMU_IO -c map "$TEST_IMG"
+
+# Not tested yet: we also want to ensure that qemu as NBD client does
+# not access beyond the end of a server's advertised unaligned size.
+# However, since qemu as server always rounds up to a sector alignment,
+# we would have to use nbdkit to provoke the current client failures.
+
+# success, all done
+echo '*** done'
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out
new file mode 100644
index 000..044afc0c6f8
--- /dev/null
+++ b/tests/qemu-iotests/241.out
@@ -0,0 +1,7 @@
+QA output created by 241
+
+=== Exporting unaligned raw image ===
+
+[{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true}]
+1 KiB (0x400) bytes allocated at offset 0 bytes (0x0)
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 41da10c6cf5..bae77183809 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -240,6 +240,7 @@
 238 auto quick
 239 rw auto quick
 240 auto quick
+241 rw auto quick
 242 rw auto quick
 243 rw auto quick
 244 rw auto quick
-- 
2.20.1




[Qemu-devel] [PATCH v2 2/3] block: Add bdrv_get_request_alignment()

2019-03-27 Thread Eric Blake
The next patch needs access to a device's minimum permitted
alignment, since NBD wants to advertise this to clients. Add
an accessor function, borrowing from blk_get_max_transfer()
for accessing a backend's block limits.

Signed-off-by: Eric Blake 
Message-Id: <20180802144834.520904-2-ebl...@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/sysemu/block-backend.h | 1 +
 block/block-backend.c  | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index e2066eb06b2..3be05c2d689 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -177,6 +177,7 @@ bool blk_is_available(BlockBackend *blk);
 void blk_lock_medium(BlockBackend *blk, bool locked);
 void blk_eject(BlockBackend *blk, bool eject_flag);
 int blk_get_flags(BlockBackend *blk);
+uint32_t blk_get_request_alignment(BlockBackend *blk);
 uint32_t blk_get_max_transfer(BlockBackend *blk);
 int blk_get_max_iov(BlockBackend *blk);
 void blk_set_guest_block_size(BlockBackend *blk, int align);
diff --git a/block/block-backend.c b/block/block-backend.c
index edad02a0f2a..f78e82a707f 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1764,6 +1764,13 @@ int blk_get_flags(BlockBackend *blk)
 }
 }

+/* Returns the minimum request alignment, in bytes; guaranteed nonzero */
+uint32_t blk_get_request_alignment(BlockBackend *blk)
+{
+BlockDriverState *bs = blk_bs(blk);
+return bs ? bs->bl.request_alignment : BDRV_SECTOR_SIZE;
+}
+
 /* Returns the maximum transfer length, in bytes; guaranteed nonzero */
 uint32_t blk_get_max_transfer(BlockBackend *blk)
 {
-- 
2.20.1




[Qemu-devel] [PATCH for-4.0 v2 0/3] NBD server alignment improvement

2019-03-27 Thread Eric Blake
v1 was here (yes, blast from the past):
https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg00305.html

Since then:
- retitle the iotest (228 is now 241)
- rebase on top of other alignment fixes/workarounds that have
landed in meantime
- hoist test earlier, to demonstrate (according to which order you
compile things) whether the server fix, client workaround, or both
are in play
- enhance commit messages
- drop patch 4 from the earlier series (I still need to come up with
a more robust solution for avoiding the tail of an unaligned image)

Eric Blake (3):
  iotests: Add 241 to test NBD on unaligned images
  block: Add bdrv_get_request_alignment()
  nbd/server: Advertise actual minimum block size

 include/sysemu/block-backend.h |  1 +
 block/block-backend.c  |  7 
 nbd/server.c   | 12 +++---
 tests/qemu-iotests/241 | 71 ++
 tests/qemu-iotests/241.out |  8 
 tests/qemu-iotests/group   |  1 +
 6 files changed, 95 insertions(+), 5 deletions(-)
 create mode 100755 tests/qemu-iotests/241
 create mode 100644 tests/qemu-iotests/241.out

-- 
2.20.1




[Qemu-devel] [PATCH v2 3/3] nbd/server: Advertise actual minimum block size

2019-03-27 Thread Eric Blake
Both NBD_CMD_BLOCK_STATUS and structured NBD_CMD_READ will split their
reply according to bdrv_block_status() boundaries. If the block device
has a request_alignment smaller than 512, but we advertise a block
alignment of 512 to the client, then this can result in the server
reply violating client expectations by reporting a smaller region of
the export than what the client is permitted to address (although this
is less of an issue for qemu 4.0 clients, given recent client patches
to overlook our non-compliance).  Since it's always better to be
strict in what we send, it is worth advertising the actual minimum
block limit rather than blindly rounding it up to 512.

Note that this patch is not foolproof - it is still possible to
provoke non-compliant server behavior using:

$ qemu-nbd --image-opts 
driver=blkdebug,align=512,image.driver=file,image.filename=/path/to/non-aligned-file

But as blkdebug is not normally used, and as this patch makes our
server more interoperable with qemu 3.1 clients, it is worth applying
now, even while we still work on a larger patch for the 4.1 timeframe
to improve the block layer to prevent the mid-block status changes
that can be observed now with blkdebug or with a backing layer with
smaller alignment than the active layer.

Note that the iotests output changes - both pre- and post-patch, the
server is reporting a mid-sector hole; but pre-patch, the client was
then rounding that up to a sector boundary as a workaround, while
post-patch the client doesn't have to round because it sees the
server's smaller advertised block size.

Signed-off-by: Eric Blake 
---
 nbd/server.c   | 12 +++-
 tests/qemu-iotests/241.out |  3 ++-
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index fd013a2817a..c76f32dbb50 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -607,13 +607,15 @@ static int nbd_negotiate_handle_info(NBDClient *client, 
uint16_t myflags,
 /* Send NBD_INFO_BLOCK_SIZE always, but tweak the minimum size
  * according to whether the client requested it, and according to
  * whether this is OPT_INFO or OPT_GO. */
-/* minimum - 1 for back-compat, or 512 if client is new enough.
- * TODO: consult blk_bs(blk)->bl.request_alignment? */
-sizes[0] =
-(client->opt == NBD_OPT_INFO || blocksize) ? BDRV_SECTOR_SIZE : 1;
+/* minimum - 1 for back-compat, or actual if client will obey it. */
+if (client->opt == NBD_OPT_INFO || blocksize) {
+sizes[0] = blk_get_request_alignment(exp->blk);
+} else {
+sizes[0] = 1;
+}
 /* preferred - Hard-code to 4096 for now.
  * TODO: is blk_bs(blk)->bl.opt_transfer appropriate? */
-sizes[1] = 4096;
+sizes[1] = MAX(4096, sizes[0]);
 /* maximum - At most 32M, but smaller as appropriate. */
 sizes[2] = MIN(blk_get_max_transfer(exp->blk), NBD_MAX_BUFFER_SIZE);
 trace_nbd_negotiate_handle_info_block_size(sizes[0], sizes[1], sizes[2]);
diff --git a/tests/qemu-iotests/241.out b/tests/qemu-iotests/241.out
index 044afc0c6f8..a7f1e665a9a 100644
--- a/tests/qemu-iotests/241.out
+++ b/tests/qemu-iotests/241.out
@@ -2,6 +2,7 @@ QA output created by 241

 === Exporting unaligned raw image ===

-[{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true}]
+[{ "start": 0, "length": 1000, "depth": 0, "zero": false, "data": true},
+{ "start": 1000, "length": 24, "depth": 0, "zero": true, "data": true}]
 1 KiB (0x400) bytes allocated at offset 0 bytes (0x0)
 *** done
-- 
2.20.1




Re: [Qemu-devel] intel_iommu migration compatibiltiy for scalable mode emulation

2019-03-27 Thread Michael S. Tsirkin
On Wed, Mar 27, 2019 at 08:02:24PM +, Dr. David Alan Gilbert wrote:
> (Oops, originally forgot to cc the list)
> 
> * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote:
> > Hi,
> >   Commit fb43cf739 introduced a new iommu migration field called
> > 'root_scalable' - but it doesn't do any machine type checks, which means
> > it breaks migration between 3.1<->4.0 -  we should fix this for 4.0.
> > 
> > Can we make this conditional on the 4.0 machine type?
> > 
> > Dave
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Yes we have to. Peter?

-- 
MST



Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-27 Thread Alex Williamson
On Wed, 27 Mar 2019 20:18:54 +
"Dr. David Alan Gilbert"  wrote:

> * Zhao Yan (yan.y.z...@intel.com) wrote:
> > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:  
> > > > > > >   b) How do we detect if we're migrating from/to the wrong device 
> > > > > > > or
> > > > > > > version of device?  Or say to a device with older firmware or 
> > > > > > > perhaps
> > > > > > > a device that has less device memory ?
> > > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > > whether it's better to check that in libvirt or qemu (like a device 
> > > > > > magic
> > > > > > along with verion ?).
> > > > 
> > > > We must keep the hardware generation is the same with one POD of public 
> > > > cloud
> > > > providers. But we still think about the live migration between from the 
> > > > the lower
> > > > generation of hardware migrated to the higher generation.  
> > > 
> > > Agreed, lower->higher is the one direction that might make sense to
> > > support.
> > > 
> > > But regardless of that, I think we need to make sure that incompatible
> > > devices/versions fail directly instead of failing in a subtle, hard to
> > > debug way. Might be useful to do some initial sanity checks in libvirt
> > > as well.
> > > 
> > > How easy is it to obtain that information in a form that can be
> > > consumed by higher layers? Can we find out the device type at least?
> > > What about some kind of revision?  
> > hi Alex and Cornelia
> > for device compatibility, do you think it's a good idea to use "version"
> > and "device version" fields?
> > 
> > version field: identify live migration interface's version. it can have a
> > sort of backward compatibility, like target machine's version >= source
> > machine's version. something like that.

Don't we essentially already have this via the device specific region?
The struct vfio_info_cap_header includes id and version fields, so we
can declare a migration id and increment the version for any
incompatible changes to the protocol.

> > 
> > device_version field consists two parts:
> > 1. vendor id : it takes 32 bits. e.g. 0x8086.

Who allocates IDs?  If we're going to use PCI vendor IDs, then I'd
suggest we use a bit to flag it as such so we can reserve that portion
of the 32bit address space.  See for example:

#define VFIO_REGION_TYPE_PCI_VENDOR_TYPE(1 << 31)
#define VFIO_REGION_TYPE_PCI_VENDOR_MASK(0x)

For vendor specific regions.

> > 2. vendor proprietary string: it can be any string that a vendor driver
> > thinks can identify a source device. e.g. pciid + mdev type.
> > "vendor id" is to avoid overlap of "vendor proprietary string".
> > 
> > 
> > struct vfio_device_state_ctl {
> >  __u32 version;/* ro */
> >  __u8 device_version[MAX_DEVICE_VERSION_LEN];/* ro */
> >  struct {
> > __u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
> > ...
> >  }data;
> >  ...
> >  };

We have a buffer area where we can read and write data from the vendor
driver, why would we have this IS_COMPATIBLE protocol to write the
device version string but use a static fixed length version string in
the control header to read it?  IOW, let's use GET_VERSION,
CHECK_VERSION actions that make use of the buffer area and allow vendor
specific version information length.

> > 
> > Then, an action IS_COMPATIBLE is added to check device compatibility.
> > 
> > The flow to figure out whether a source device is migratable to target 
> > device
> > is like that:
> > 1. in source side's .save_setup, save source device's device_version string
> > 2. in target side's .load_state, load source device's device version string
> > and write it to data region, and call IS_COMPATIBLE action to ask vendor 
> > driver
> > to check whether the source device is compatible to it.
> > 
> > The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> > maintain a compatibility table and decide whether source device is 
> > compatible
> > to target device according to its proprietary table.
> > In device_version string, vendor driver only has to describe the source
> > device as elaborately as possible and resorts to vendor driver in target 
> > side
> > to figure out whether they are compatible.  

I agree, it's too complicated and restrictive to try to create an
interface for the user to determine compatibility, let the driver
declare it compatible or not.

> It would also be good if the 'IS_COMPATIBLE' was somehow callable
> externally - so we could be able to answer a question like 'can we
> migrate this VM to this host' - from the management layer before it
> actually starts the migration.

I think we'd need to mirror this capability in sysfs to support that,
or create a qmp interface through QEMU that the device owner could make
the request on behalf of the management layer.  Getting access to the
vfio device requires an iommu context that's already in use by the
device 

Re: [Qemu-devel] [PATCH for 4.0 v2 2/2] riscv: plic: Log guest errors

2019-03-27 Thread Philippe Mathieu-Daudé
Le mer. 27 mars 2019 19:57, Alistair Francis  a
écrit :

> Instead of using error_report() to print guest errors let's use
> qemu_log_mask(LOG_GUEST_ERROR,...) to log the error.
>
> Signed-off-by: Alistair Francis 
> ---
>  hw/riscv/sifive_plic.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/hw/riscv/sifive_plic.c b/hw/riscv/sifive_plic.c
> index 70a85cd075..7f373d6c9d 100644
> --- a/hw/riscv/sifive_plic.c
> +++ b/hw/riscv/sifive_plic.c
> @@ -262,7 +262,9 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr
> addr, unsigned size)
>  }
>
>  err:
> -error_report("plic: invalid register read: %08x", (uint32_t)addr);
> +qemu_log_mask(LOG_GUEST_ERROR,
> +  "%s: Invalid register read 0x%" HWADDR_PRIx "\n",
> +  __func__, addr);
>  return 0;
>  }
>
> @@ -289,7 +291,9 @@ static void sifive_plic_write(void *opaque, hwaddr
> addr, uint64_t value,
>  } else if (addr >= plic->pending_base && /* 1 bit per source */
> addr < plic->pending_base + (plic->num_sources >> 3))
>  {
> -error_report("plic: invalid pending write: %08x", (uint32_t)addr);
> +qemu_log_mask(LOG_GUEST_ERROR,
> +  "%s: invalid pending write: 0x%" HWADDR_PRIx "",
> +  __func__, addr);
>  return;
>  } else if (addr >= plic->enable_base && /* 1 bit per source */
>  addr < plic->enable_base + plic->num_addrs * plic->enable_stride)
> @@ -339,7 +343,9 @@ static void sifive_plic_write(void *opaque, hwaddr
> addr, uint64_t value,
>  }
>
>  err:
> -error_report("plic: invalid register write: %08x", (uint32_t)addr);
> +qemu_log_mask(LOG_GUEST_ERROR,
> +  "%s: Invalid register write 0x%" HWADDR_PRIx "\n",
> +  __func__, addr);
>  }
>
>  static const MemoryRegionOps sifive_plic_ops = {
> --
> 2.21.0
>

I'm using the GMail embedded gapp and wonder if the deduplication features
is not sometimes abusive, when the same patch is sent increasing versions,
I reply to the last version I see but my reply looks like I replied to the
first.
I might be also misusing the gapp :)
Anyway:
Reviewed-by: Philippe Mathieu-Daudé 

>


Re: [Qemu-devel] [PATCH] migration: remove not used field xfer_limit

2019-03-27 Thread Wei Yang
On Wed, Mar 27, 2019 at 08:24:21PM +, Dr. David Alan Gilbert wrote:
>* Wei Yang (richardw.y...@linux.intel.com) wrote:
>> MigrationState->xfer_limit is only set to 0 in migrate_init().
>> 
>> Remove this unnecessary field.
>> 
>> Signed-off-by: Wei Yang 
>
>Nice; I think that field has been unneeded since 1964a397063967
>just over 5 years ago :-)
>

Wow, such a long time...

>
>Reviewed-by: Dr. David Alan Gilbert 
>

Thanks

>> ---
>>  migration/migration.c | 1 -
>>  migration/migration.h | 1 -
>>  2 files changed, 2 deletions(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index e88acab53b..533c2102c7 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1682,7 +1682,6 @@ void migrate_init(MigrationState *s)
>>   * locks.
>>   */
>>  s->bytes_xfer = 0;
>> -s->xfer_limit = 0;
>>  s->cleanup_bh = 0;
>>  s->to_dst_file = NULL;
>>  s->rp_state.from_dst_file = NULL;
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 99e99e56bd..852eb3c4e9 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -117,7 +117,6 @@ struct MigrationState
>>  
>>  /*< public >*/
>>  size_t bytes_xfer;
>> -size_t xfer_limit;
>>  QemuThread thread;
>>  QEMUBH *cleanup_bh;
>>  QEMUFile *to_dst_file;
>> -- 
>> 2.19.1
>> 
>--
>Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

-- 
Wei Yang
Help you, Help me



Re: [Qemu-devel] [Qemu-block] [PATCH] block/file-posix: ignore fail on unlock bytes

2019-03-27 Thread Eric Blake
On 3/27/19 3:33 PM, John Snow wrote:
> 
> 
> On 3/27/19 8:49 AM, Vladimir Sementsov-Ogievskiy wrote:
>> bdrv_replace_child() calls bdrv_check_perm() with error_abort on
>> loosening permissions. However file-locking operations may fail even
>> in this case, for example on NFS. And this leads to Qemu crash.
>>
>> Let's ignore such errors, as we do already on permission update commit
>> and abort.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
>> ---
>>  block/file-posix.c | 14 ++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index db4cccbe51..403e67fe90 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -815,6 +815,20 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
>>  
>>  switch (op) {
>>  case RAW_PL_PREPARE:
>> +if ((s->perm | new_perm) == s->perm &&

This says if new_perm does not add any bits beyond what s->perm had.

>> +(~s->shared_perm | ~new_perm) == ~s->shared_perm)
> 
> Little strange to read, but ultimately "If we aren't changing anything"
> based on the call below.

'(~a | ~b)' is equivalent to '~(a & b)'.

'~(a & b) == ~a' is equivalent to '(a & b) == a'

That expression is much easier to read, as new_perm does not remove any
bits beyond what s->shared_perm already had.

But rewriting it in an easier form would indeed make the patch easier to
swallow.

> 
>> +{
>> +/*
>> + * We are going to unlock bytes, it should not fail. If fail,
>> + * just report it and ignore, like we do for ABORT and COMMIT
>> + * anyway.
>> + */
>> +ret = raw_check_lock_bytes(s->fd, new_perm, new_shared, 
>> _err);
>> +if (local_err) {
>> +error_report_err(local_err);
>> +}
>> +return 0;
>> +}
>>  ret = raw_apply_lock_bytes(s, s->fd, s->perm | new_perm,
>> ~s->shared_perm | ~new_shared,
>> false, errp);
>>
> 
> I thnk this makes sense, but hopefully someone else can give it the
> once-over too.
> 
> Reviewed-by: John Snow 
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-devel] [PATCH for-4.0] nbd-client: Work around server BLOCK_STATUS misalignment at EOF

2019-03-27 Thread Eric Blake
On 3/27/19 11:56 AM, Vladimir Sementsov-Ogievskiy wrote:
> 26.03.2019 20:13, Eric Blake wrote:
>> The NBD spec is clear that a server that advertises a minimum block
>> size should reply to NBD_CMD_BLOCK_STATUS with extents aligned
>> accordingly. However, we know that the qemu NBD server implementation
>> has had a corner-case bug where it is not compliant with the spec,
>> present since the introduction of NBD_CMD_BLOCK_STATUS in qemu 2.12
>> (and unlikely to be patched in time for 4.0). Namely, when qemu is
>> serving a file that is not a multiple of 512 bytes, it rounds the size
>> advertised over NBD up to the next sector boundary (someday, I'd like
>> to fix that to be byte-accurate, but it's a much bigger audit not
>> appropriate for this release); yet if the final sector contains data
>> prior to EOF, lseek(SEEK_HOLE) will point to the implicit hole
>> mid-sector which qemu then reported over NBD.
>>

>>
>> Easy reproduction:
>> $ printf %1000d 1 > file
>> $ qemu-nbd -f raw -t file & pid=$!
>> $ qemu-img map --output=json -f raw nbd://localhost:10809
>> qemu-img: Could not read file metadata: Invalid argument
>> $ kill $pid
> 
> would be good to add it to iotests
> 

Will do in a followup; probably by reviving a v2 of this series:
https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg00305.html

>>
>> where the patched version instead succeeds with:
>> [{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true}]
>>
>> Signed-off-by: Eric Blake 
> 
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 
> 

Thanks; will queue through my NBD tree for -rc2.


>> +++ b/block/nbd-client.c
>> @@ -269,14 +269,36 @@ static int 
>> nbd_parse_blockstatus_payload(NBDClientSession *client,
>>   extent->length = payload_advance32();
>>   extent->flags = payload_advance32();
>>
>> -if (extent->length == 0 ||
>> -(client->info.min_block && !QEMU_IS_ALIGNED(extent->length,
>> -
>> client->info.min_block))) {
>> +if (extent->length == 0) {
>>   error_setg(errp, "Protocol error: server sent status chunk with "
>>  "invalid length");
> 
> may be improved to s/invalid/zero/

Will do.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-devel] [PATCH v7 2/2] spapr-rtas: add ibm, get-vpd RTAS interface

2019-03-27 Thread Maxiwell S. Garcia
This adds a handler for ibm,get-vpd RTAS calls, allowing pseries guest
to collect host information. It is disabled by default to avoid unwanted
information leakage. To enable it, use:

-M pseries,host-serial={passthrough|string},host-model={passthrough|string}

Only the SE and TM keywords are returned at the moment.
SE for Machine or Cabinet Serial Number and
TM for Machine Type and Model

The SE and TM keywords are controlled by 'host-serial' and 'host-model'
options, respectively. In the case of 'passthrough', the SE shows the
host 'system-id' information and the TM shows the host 'model' information.

Powerpc-utils tools can dispatch RTAS calls to retrieve host
information using this ibm,get-vpd interface. The 'host-serial'
and 'host-model' nodes of device-tree hold the same information but
in a static manner, which is useless after a migration operation.

Signed-off-by: Maxiwell S. Garcia 
---
 hw/ppc/spapr_rtas.c| 96 ++
 include/hw/ppc/spapr.h | 10 -
 2 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
index 24c45b12d4..c42630e95d 100644
--- a/hw/ppc/spapr_rtas.c
+++ b/hw/ppc/spapr_rtas.c
@@ -287,6 +287,98 @@ static void rtas_ibm_set_system_parameter(PowerPCCPU *cpu,
 rtas_st(rets, 0, ret);
 }
 
+static int vpd_st(target_ulong addr, target_ulong len,
+  const void *val, uint16_t val_len)
+{
+hwaddr phys = ppc64_phys_to_real(addr);
+
+if (len < val_len) {
+return RTAS_OUT_PARAM_ERROR;
+}
+
+cpu_physical_memory_write(phys, val, val_len);
+return RTAS_OUT_SUCCESS;
+}
+
+static void vpd_ret(target_ulong rets, const int status,
+const int next_seq_number, const int bytes_returned)
+{
+rtas_st(rets, 0, status);
+rtas_st(rets, 1, next_seq_number);
+rtas_st(rets, 2, bytes_returned);
+}
+
+static void rtas_ibm_get_vpd_fields_register(SpaprMachineState *spapr)
+{
+int i = 0;
+char *buf = NULL;
+
+memset(spapr->rtas_get_vpd_fields, 0,
+   sizeof(spapr->rtas_get_vpd_fields));
+
+buf = spapr_get_valid_host_serial(spapr);
+if (buf) {
+spapr->rtas_get_vpd_fields[i++] = g_strdup_printf("SE %s", buf);
+g_free(buf);
+}
+
+buf = spapr_get_valid_host_model(spapr);
+if (buf) {
+spapr->rtas_get_vpd_fields[i++] = g_strdup_printf("TM %s", buf);
+g_free(buf);
+}
+}
+
+static void rtas_ibm_get_vpd(PowerPCCPU *cpu,
+ SpaprMachineState *spapr,
+ uint32_t token, uint32_t nargs,
+ target_ulong args,
+ uint32_t nret, target_ulong rets)
+{
+target_ulong loc_code_addr = rtas_ld(args, 0);
+target_ulong work_area_addr = rtas_ld(args, 1);
+target_ulong work_area_size = rtas_ld(args, 2);
+target_ulong seq_number = rtas_ld(args, 3);
+unsigned int next_seq_number = 1;
+int status = RTAS_OUT_PARAM_ERROR;
+unsigned char loc_code;
+unsigned int vpd_field_len;
+char *vpd_field;
+int ret;
+
+/* Specific Location Code is not supported and seq_number */
+/* must be checked to avoid out of bound index error */
+cpu_physical_memory_read(loc_code_addr, _code, 1);
+if ((loc_code != 0) || (seq_number <= 0) ||
+(seq_number > RTAS_IBM_GET_VPD_MAX)) {
+vpd_ret(rets, RTAS_OUT_PARAM_ERROR, 1, 0);
+return;
+}
+
+/* RTAS not authorized if no keywords have been registered */
+if (!spapr->rtas_get_vpd_fields[seq_number - 1]) {
+vpd_ret(rets, RTAS_OUT_NOT_AUTHORIZED, 1, 0);
+return;
+}
+
+vpd_field = spapr->rtas_get_vpd_fields[seq_number - 1];
+vpd_field_len = strlen(vpd_field);
+ret = vpd_st(work_area_addr, work_area_size,
+ vpd_field, vpd_field_len + 1);
+
+if (ret == 0) {
+next_seq_number = seq_number + 1;
+if (spapr->rtas_get_vpd_fields[next_seq_number - 1]) {
+status = RTAS_IBM_GET_VPD_CONTINUE;
+} else {
+status = RTAS_OUT_SUCCESS;
+next_seq_number = 1;
+}
+}
+
+vpd_ret(ret, status, next_seq_number, vpd_field_len);
+}
+
 static void rtas_ibm_os_term(PowerPCCPU *cpu,
 SpaprMachineState *spapr,
 uint32_t token, uint32_t nargs,
@@ -464,6 +556,8 @@ void spapr_load_rtas(SpaprMachineState *spapr, void *fdt, 
hwaddr addr)
  fdt_strerror(ret));
 exit(1);
 }
+
+rtas_ibm_get_vpd_fields_register(spapr);
 }
 
 static void core_rtas_register_types(void)
@@ -485,6 +579,8 @@ static void core_rtas_register_types(void)
 rtas_ibm_set_system_parameter);
 spapr_rtas_register(RTAS_IBM_OS_TERM, "ibm,os-term",
 rtas_ibm_os_term);
+spapr_rtas_register(RTAS_IBM_GET_VPD, "ibm,get-vpd",
+rtas_ibm_get_vpd);
 

Re: [Qemu-devel] [Qemu-block] [PATCH for-4.0] qemu-img: Gracefully shutdown when map can't finish

2019-03-27 Thread John Snow



On 3/26/19 2:40 PM, Eric Blake wrote:
> Trying 'qemu-img map -f raw nbd://localhost:10809' causes the
> NBD server to output a scary message:
> 
> qemu-nbd: Disconnect client, due to: Failed to read request: Unexpected 
> end-of-file before all bytes were read
> 
> This is because the NBD client, being remote, has no way to expose a
> human-readable map (the --output=json data is fine, however). But
> because we exit(1) right after the message, causing the client to
> bypass all block cleanup, the server sees the abrupt exit and warns,
> whereas it would be silent had the client had a chance to send
> NBD_CMD_DISC. Other protocols may have similar cleanup issues, where
> failure to blk_unref() could cause unintended effects.
> 
> Signed-off-by: Eric Blake 

Seems good; taking it through the NBD tree helps me keep track of them :)

Reviewed-by: John Snow 



[Qemu-devel] [PATCH v7 1/2] spapr: helper functions to get valid host fields

2019-03-27 Thread Maxiwell S. Garcia
The pseries options 'host-serial' and 'host-model' accepts
'none', 'passthrough', or  content. The helper
functions in this commit return a valid host field based on
user options.

Signed-off-by: Maxiwell S. Garcia 
Reviewed-by: Greg Kurz 
---
 hw/ppc/spapr.c | 48 +++---
 include/hw/ppc/spapr.h |  4 
 2 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 6c16d6cfaf..3e2d5fe438 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1211,6 +1211,24 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, 
void *fdt)
 g_free(bootlist);
 }
 
+#define SPAPR_GET_VALID_HOST(attr)\
+char *spapr_get_valid_host_##attr(SpaprMachineState *spapr)   \
+{ \
+char *buf = NULL; \
+if (spapr->host_##attr && \
+!g_str_equal(spapr->host_##attr, "none")) {   \
+if (g_str_equal(spapr->host_##attr, "passthrough")) { \
+kvmppc_get_host_##attr(); \
+} else {  \
+buf = g_strdup(spapr->host_##attr);   \
+} \
+} \
+return buf;   \
+}
+
+SPAPR_GET_VALID_HOST(serial);
+SPAPR_GET_VALID_HOST(model);
+
 static void spapr_dt_hypervisor(SpaprMachineState *spapr, void *fdt)
 {
 /* The /hypervisor node isn't in PAPR - this is a hack to allow PR
@@ -1256,30 +1274,16 @@ static void *spapr_build_fdt(SpaprMachineState *spapr)
  * Add info to guest to indentify which host is it being run on
  * and what is the uuid of the guest
  */
-if (spapr->host_model && !g_str_equal(spapr->host_model, "none")) {
-if (g_str_equal(spapr->host_model, "passthrough")) {
-/* -M host-model=passthrough */
-if (kvmppc_get_host_model()) {
-_FDT(fdt_setprop_string(fdt, 0, "host-model", buf));
-g_free(buf);
-}
-} else {
-/* -M host-model= */
-_FDT(fdt_setprop_string(fdt, 0, "host-model", spapr->host_model));
-}
+buf = spapr_get_valid_host_model(spapr);
+if (buf) {
+_FDT(fdt_setprop_string(fdt, 0, "host-model", buf));
+g_free(buf);
 }
 
-if (spapr->host_serial && !g_str_equal(spapr->host_serial, "none")) {
-if (g_str_equal(spapr->host_serial, "passthrough")) {
-/* -M host-serial=passthrough */
-if (kvmppc_get_host_serial()) {
-_FDT(fdt_setprop_string(fdt, 0, "host-serial", buf));
-g_free(buf);
-}
-} else {
-/* -M host-serial= */
-_FDT(fdt_setprop_string(fdt, 0, "host-serial", 
spapr->host_serial));
-}
+buf = spapr_get_valid_host_serial(spapr);
+if (buf) {
+_FDT(fdt_setprop_string(fdt, 0, "host-serial", buf));
+g_free(buf);
 }
 
 buf = qemu_uuid_unparse_strdup(_uuid);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 2b4c05a2ec..5e0a1b2a8a 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -856,6 +856,10 @@ int spapr_caps_post_migration(SpaprMachineState *spapr);
 
 void spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
   Error **errp);
+
+char *spapr_get_valid_host_serial(SpaprMachineState *spapr);
+char *spapr_get_valid_host_model(SpaprMachineState *spapr);
+
 /*
  * XIVE definitions
  */
-- 
2.20.1




[Qemu-devel] [PATCH v7 0/2] spapr-rtas: add ibm, get-vpd RTAS interface

2019-03-27 Thread Maxiwell S. Garcia
Here are two patches to add a handler for ibm,get-vpd RTAS calls.
This RTAS exposes host information in case of set QEMU options
'host-serial' and 'host-model' as 'passthrough'.

The patch 1 creates helper functions to get valid 'host-serial'
and 'host-model' parameters, guided by QEMU command line. These
parameters are useful to build the guest device tree and to return
get-vpd RTAS calls. The patch 2 adds the ibm,get-vpd itself.

Update v7:
* rtas_get_vpd_fields as a static array in spapr machine state

Maxiwell S. Garcia (2):
  spapr: helper functions to get valid host fields
  spapr-rtas: add ibm,get-vpd RTAS interface

 hw/ppc/spapr.c | 48 +++--
 hw/ppc/spapr_rtas.c| 96 ++
 include/hw/ppc/spapr.h | 14 +-
 3 files changed, 135 insertions(+), 23 deletions(-)

-- 
2.20.1




Re: [Qemu-devel] [Qemu-block] [PATCH] block/file-posix: ignore fail on unlock bytes

2019-03-27 Thread John Snow



On 3/27/19 8:49 AM, Vladimir Sementsov-Ogievskiy wrote:
> bdrv_replace_child() calls bdrv_check_perm() with error_abort on
> loosening permissions. However file-locking operations may fail even
> in this case, for example on NFS. And this leads to Qemu crash.
> 
> Let's ignore such errors, as we do already on permission update commit
> and abort.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/file-posix.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index db4cccbe51..403e67fe90 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -815,6 +815,20 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
>  
>  switch (op) {
>  case RAW_PL_PREPARE:
> +if ((s->perm | new_perm) == s->perm &&
> +(~s->shared_perm | ~new_perm) == ~s->shared_perm)

Little strange to read, but ultimately "If we aren't changing anything"
based on the call below.

> +{
> +/*
> + * We are going to unlock bytes, it should not fail. If fail,
> + * just report it and ignore, like we do for ABORT and COMMIT
> + * anyway.
> + */
> +ret = raw_check_lock_bytes(s->fd, new_perm, new_shared, 
> _err);
> +if (local_err) {
> +error_report_err(local_err);
> +}
> +return 0;
> +}
>  ret = raw_apply_lock_bytes(s, s->fd, s->perm | new_perm,
> ~s->shared_perm | ~new_shared,
> false, errp);
> 

I thnk this makes sense, but hopefully someone else can give it the
once-over too.

Reviewed-by: John Snow 



Re: [Qemu-devel] [Xen-devel] [PATCH v2 0/2] xen-block: fix sector size confusion

2019-03-27 Thread Paul Durrant
> -Original Message-
> From: Andrew Cooper
> Sent: 27 March 2019 18:20
> To: Paul Durrant ; xen-de...@lists.xenproject.org; 
> qemu-bl...@nongnu.org;
> qemu-devel@nongnu.org
> Cc: Kevin Wolf ; Stefano Stabellini 
> ; Max Reitz
> ; Stefan Hajnoczi ; Anthony Perard 
> 
> Subject: Re: [Xen-devel] [PATCH v2 0/2] xen-block: fix sector size confusion
> 
> On 27/03/2019 17:32, Paul Durrant wrote:
> > The Xen blkif protocol is confusing but discussion with the maintainer
> > has clarified that sector based quantities in requests and the 'sectors'
> > value advertized in xenstore should always be in terms of 512-byte
> > units and not the advertised logical 'sector-size' value.
> >
> > This series fixes xen-block to adhere to the spec.
> 
> I thought we agreed that hardcoding things to 512 bytes was the wrong
> thing to do.

To some extent we decided it was the *only* thing to do.

> 
> I was expecting something like:
> 
> 1) Clarify the spec with the intended meaning, (which is what some
> implementations actually use already) and wont cripple 4k datapaths.
> 2) Introduce a compatibility key for "I don't rely on sector-size being
> 512", which fixed implementations should advertise.
> 3) Specify that because of bugs in the spec which got out into the wild,
> drivers which don't find the key being advertised by the other end
> should emulate sector-size=512 for compatibility with broken
> implementations.

Yes, that's how we are going to fix things.

> 
> Whatever the eventual way out, the first thing which needs to happen is
> an update to the spec, before actions are taken to alter existing
> implementations.

Well the implementation is currently wrong w.r.t. the spec and these patches 
fix that. As long as sector-size remains at 512 then no existing frontend 
should break, so I guess you could argue that patch #2 should also make sure 
that sector-size is also 512... but that is not yet in the spec.
I guess I'm ok to defer patch #2 until a revised spec. is agreed, but the ship 
has already sailed as far as patch #1 goes.

Anthony, thoughts?

  Paul 

> 
> ~Andrew


Re: [Qemu-devel] [PATCH for-4.1 v3 00/12] bundle edk2 platform firmware with QEMU

2019-03-27 Thread Philippe Mathieu-Daudé
Le mer. 27 mars 2019 17:06, Daniel P. Berrangé  a
écrit :

> On Wed, Mar 27, 2019 at 04:58:23PM +0100, Paolo Bonzini wrote:
> > On 27/03/19 16:30, Daniel P. Berrangé wrote:
> > > Perhaps the VM test scripts should do a "HEAD" request for the image
> > > every time to discover if it has been changed on the server, before
> > > honouring the local cache.
> >
> > Another possibility is to first download the shasum from
> > download.patchew.org, and compare _that_ against the one that is stored
> > locally, instead of hardcoding it in QEMU's repository.
>
> Personally I prefer the idea of having the shasum stored in the repo.
>
> That means that if we update git master to point to a newer image,
> previous stable branches will stick with their original image, rather
> than using a new image that may be incompatible with the stable branch
>
> Storing hash in git also means that if someone compromised the patchew
> server, they can't cause developer to run compromised images, without
> first also compromising git to change the hash.
>

And we can git bisect too, right?


> Regards,
> Daniel
> --
> |: https://berrange.com  -o-
> https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-
> https://www.instagram.com/dberrange :|
>


Re: [Qemu-devel] [PATCH] migration: remove not used field xfer_limit

2019-03-27 Thread Dr. David Alan Gilbert
* Wei Yang (richardw.y...@linux.intel.com) wrote:
> MigrationState->xfer_limit is only set to 0 in migrate_init().
> 
> Remove this unnecessary field.
> 
> Signed-off-by: Wei Yang 

Nice; I think that field has been unneeded since 1964a397063967
just over 5 years ago :-)


Reviewed-by: Dr. David Alan Gilbert 

> ---
>  migration/migration.c | 1 -
>  migration/migration.h | 1 -
>  2 files changed, 2 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index e88acab53b..533c2102c7 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1682,7 +1682,6 @@ void migrate_init(MigrationState *s)
>   * locks.
>   */
>  s->bytes_xfer = 0;
> -s->xfer_limit = 0;
>  s->cleanup_bh = 0;
>  s->to_dst_file = NULL;
>  s->rp_state.from_dst_file = NULL;
> diff --git a/migration/migration.h b/migration/migration.h
> index 99e99e56bd..852eb3c4e9 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -117,7 +117,6 @@ struct MigrationState
>  
>  /*< public >*/
>  size_t bytes_xfer;
> -size_t xfer_limit;
>  QemuThread thread;
>  QEMUBH *cleanup_bh;
>  QEMUFile *to_dst_file;
> -- 
> 2.19.1
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v2 1/2] block/vhdx: Remove redundant IEC binary prefixes definition

2019-03-27 Thread Philippe Mathieu-Daudé
Le mer. 27 mars 2019 18:18, John Snow  a écrit :

>
>
> On 3/27/19 5:56 AM, Stefano Garzarella wrote:
> > IEC binary prefixes are already defined in "qemu/units.h",
> > so we can remove redundant definitions in "block/vhdx.h".
> >
> > Signed-off-by: Stefano Garzarella 
> > ---
> >  block/vhdx.c | 3 ++-
> >  block/vhdx.h | 6 +-
> >  2 files changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/block/vhdx.c b/block/vhdx.c
> > index b785aef4b7..7cd1fc3731 100644
> > --- a/block/vhdx.c
> > +++ b/block/vhdx.c
> > @@ -1889,7 +1889,8 @@ static int coroutine_fn
> vhdx_co_create(BlockdevCreateOptions *opts,
> >  return -EINVAL;
> >  }
> >  if (block_size > VHDX_BLOCK_SIZE_MAX) {
> > -error_setg(errp, "Block size must not exceed %d",
> VHDX_BLOCK_SIZE_MAX);
> > +error_setg(errp, "Block size must not exceed %" PRId64,
> > +   VHDX_BLOCK_SIZE_MAX);
> >  return -EINVAL;
> >  }
> >
> > diff --git a/block/vhdx.h b/block/vhdx.h
> > index 1bfb4e4f73..bf72090c8f 100644
> > --- a/block/vhdx.h
> > +++ b/block/vhdx.h
> > @@ -17,11 +17,7 @@
> >
> >  #ifndef BLOCK_VHDX_H
> >  #define BLOCK_VHDX_H
> > -
> > -#define KiB  (1 * 1024)
> > -#define MiB(KiB * 1024)
> > -#define GiB(MiB * 1024)
> > -#define TiB ((uint64_t) GiB * 1024)
> > +#include "qemu/units.h"
> >
> >  #define DEFAULT_LOG_SIZE 1048576 /* 1MiB */
> >  /* Structures and fields present in the VHDX file */
> >
>
> Looks good. By the way, you may want to consider adding to your git
> config rules that put the .h file diff changes above the .c file.
>
> Look at qemu/scripts/git.orderfile
>
> (I know I'm one to talk, I still haven't set mine up...)
>

=)


> Reviewed-by: John Snow 
>
>


Re: [Qemu-devel] [PATCH 0/5] QEMU VFIO live migration

2019-03-27 Thread Dr. David Alan Gilbert
* Zhao Yan (yan.y.z...@intel.com) wrote:
> On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck wrote:
> > > > > >   b) How do we detect if we're migrating from/to the wrong device or
> > > > > > version of device?  Or say to a device with older firmware or 
> > > > > > perhaps
> > > > > > a device that has less device memory ?  
> > > > > Actually it's still an open for VFIO migration. Need to think about
> > > > > whether it's better to check that in libvirt or qemu (like a device 
> > > > > magic
> > > > > along with verion ?).  
> > > 
> > > We must keep the hardware generation is the same with one POD of public 
> > > cloud
> > > providers. But we still think about the live migration between from the 
> > > the lower
> > > generation of hardware migrated to the higher generation.
> > 
> > Agreed, lower->higher is the one direction that might make sense to
> > support.
> > 
> > But regardless of that, I think we need to make sure that incompatible
> > devices/versions fail directly instead of failing in a subtle, hard to
> > debug way. Might be useful to do some initial sanity checks in libvirt
> > as well.
> > 
> > How easy is it to obtain that information in a form that can be
> > consumed by higher layers? Can we find out the device type at least?
> > What about some kind of revision?
> hi Alex and Cornelia
> for device compatibility, do you think it's a good idea to use "version"
> and "device version" fields?
> 
> version field: identify live migration interface's version. it can have a
> sort of backward compatibility, like target machine's version >= source
> machine's version. something like that.
> 
> device_version field consists two parts:
> 1. vendor id : it takes 32 bits. e.g. 0x8086.
> 2. vendor proprietary string: it can be any string that a vendor driver
> thinks can identify a source device. e.g. pciid + mdev type.
> "vendor id" is to avoid overlap of "vendor proprietary string".
> 
> 
> struct vfio_device_state_ctl {
>  __u32 version;/* ro */
>  __u8 device_version[MAX_DEVICE_VERSION_LEN];/* ro */
>  struct {
>   __u32 action; /* GET_BUFFER, SET_BUFFER, IS_COMPATIBLE*/
>   ...
>  }data;
>  ...
>  };
> 
> Then, an action IS_COMPATIBLE is added to check device compatibility.
> 
> The flow to figure out whether a source device is migratable to target device
> is like that:
> 1. in source side's .save_setup, save source device's device_version string
> 2. in target side's .load_state, load source device's device version string
> and write it to data region, and call IS_COMPATIBLE action to ask vendor 
> driver
> to check whether the source device is compatible to it.
> 
> The advantage of adding an IS_COMPATIBLE action is that, vendor driver can
> maintain a compatibility table and decide whether source device is compatible
> to target device according to its proprietary table.
> In device_version string, vendor driver only has to describe the source
> device as elaborately as possible and resorts to vendor driver in target side
> to figure out whether they are compatible.

It would also be good if the 'IS_COMPATIBLE' was somehow callable
externally - so we could be able to answer a question like 'can we
migrate this VM to this host' - from the management layer before it
actually starts the migration.

Dave

> Thanks
> Yan
> 
> 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] Question about hostfwd IPv6 parameter

2019-03-27 Thread Hikmat Jafarli
Appears that this was already done in 2018 October:
https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg05734.html
Disregard this thread, I should have searched better.

On Wed, Mar 27, 2019 at 4:02 PM Hikmat Jafarli  wrote:

> So I'm trying to add IPv6 support to hostfwd, as requested in
> https://bugs.launchpad.net/qemu/+bug/1724590
>
> I'm not entirely sure what would be the right parameter format in case of
> IPv6 support.
>
> Current IPv4-only format is
> "hostfwd=[tcp|udp]:[hostaddr]:hostport-[guestaddr]:guestport", example
> usage: "qemu-system-i386 -nic user,hostfwd=tcp:127.0.0.1:6001-:6000".
>
> I have searched through the manpages but couldn't find any other QEMU
> option of the same "kind" that also supported IPv6, so there's nothing to
> copy. What would be the right approach here? Should I introduce "hostfwd6"
> to be used separately or should I somehow extend the current "hostfwd"
> format?
>
> Any suggestions?
>
>


Re: [Qemu-devel] intel_iommu migration compatibiltiy for scalable mode emulation

2019-03-27 Thread Dr. David Alan Gilbert
(Oops, originally forgot to cc the list)

* Dr. David Alan Gilbert (dgilb...@redhat.com) wrote:
> Hi,
>   Commit fb43cf739 introduced a new iommu migration field called
> 'root_scalable' - but it doesn't do any machine type checks, which means
> it breaks migration between 3.1<->4.0 -  we should fix this for 4.0.
> 
> Can we make this conditional on the 4.0 machine type?
> 
> Dave
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] 4.0->3.1 audio compat

2019-03-27 Thread Dr. David Alan Gilbert
Oops, originally didn't cc the list.

* Dr. David Alan Gilbert (dgilb...@redhat.com) wrote:
> Hi Gerd,
>   I bisected a migration compatibiltiy issue down to:
> bd56d378842c238c audio: fix pc speaker init
> 
> migrating a simple VM from 4.0rc1->3.1 gives me:
>   ./x86_64-softmmu/qemu-system-x86_64 -M pc-q35-3.1,accel=kvm -m 4G -smp 1 
> -nographic
> 
>Unknown savevm section or instance 'audio' 0.
> 
> and indeed there's an audio section in the stream now that wasn't
> there before.
> 
> I'm not sure what influences whether it exists or not though.
> 
> Can we fix that for 4.0?
> 
> Dave
> 
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v1] s390x/kvm: Configure page size after memory has actually been initialized

2019-03-27 Thread David Hildenbrand
On 27.03.19 17:45, Igor Mammedov wrote:
> On Wed, 27 Mar 2019 14:59:44 +0100
> David Hildenbrand  wrote:
> 
>> Right now we configure the pagesize quite early, when initializing KVM.
>> This is long before system memory is actually allocated via
>> memory_region_allocate_system_memory(), and therefore memory backends
>> marked as mapped.
>>
>> Instead, let's configure the maximum page size after initializing
>> memory in s390_memory_init(). cap_hpage_1m is still properly
>> configured before creating any CPUs, and therefore before configuring
>> the CPU model and eventually enabling CMMA.
>>
>> We might later want to replace qemu_getrampagesize() by another
>> detection mechanism, e.g. only looking at mapped, initial memory.
>> We don't support any memory devices yet, and if so, we can always reject
>> devices with a page size bigger than the initial page size when
>> hotplugging. qemu_getrampagesize() should work for now, especially when
>> converting it to only look at mapped backends.
>>
>> Signed-off-by: David Hildenbrand 
> 
> Acked-by: Igor Mammedov 

BTW, do we want

qemu_getmaxrampagesize()
qemu_getminrampagesize()

or similar. qemu_getrampagesize() in its current form is really far from
beautiful.

-- 

Thanks,

David / dhildenb



Re: [Qemu-devel] [PATCH for-4.0] target/ppc: Fix QEMU crash with stxsdx

2019-03-27 Thread Mark Cave-Ayland
On 27/03/2019 18:12, Greg Kurz wrote:

> I've been hitting several QEMU crashes while running a fedora29 ppc64le
> guest under TCG. Each time, this would occur several minutes after the
> guest reached login:
> 
> Fedora 29 (Twenty Nine)
> Kernel 4.20.6-200.fc29.ppc64le on an ppc64le (hvc0)
> 
> Web console: https://localhost:9090/
> 
> localhost login:
> tcg/tcg.c:3211: tcg fatal error
> 
> This happens because a bug crept up in the gen_stxsdx() helper when it
> was converted to use VSR register accessors by commit 8b3b2d75c7c04
> "target/ppc: introduce get_cpu_vsr{l,h}() and set_cpu_vsr{l,h}() helpers
> for VSR register access".
> 
> The code creates a temporary, passes it directly to gen_qemu_st64_i64()
> and then to set_cpu_vrsh()... which looks like this was mistakenly
> coded as a load instead of a store.
> 
> Reverse the logic: read the VSR to the temporary first and then store
> it to memory.
> 
> Fixes: 8b3b2d75c7c0481544e277dad226223245e058eb
> Signed-off-by: Greg Kurz 
> ---
>  target/ppc/translate/vsx-impl.inc.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/target/ppc/translate/vsx-impl.inc.c 
> b/target/ppc/translate/vsx-impl.inc.c
> index 508e9199c8df..489b2436e49f 100644
> --- a/target/ppc/translate/vsx-impl.inc.c
> +++ b/target/ppc/translate/vsx-impl.inc.c
> @@ -356,8 +356,8 @@ static void gen_##name(DisasContext *ctx) 
> \
>  gen_set_access_type(ctx, ACCESS_INT); \
>  EA = tcg_temp_new();  \
>  gen_addr_reg_index(ctx, EA);  \
> +get_cpu_vsrh(t0, xS(ctx->opcode));\
>  gen_qemu_##operation(ctx, t0, EA);\
> -set_cpu_vsrh(xS(ctx->opcode), t0);\
>  tcg_temp_free(EA);\
>  tcg_temp_free_i64(t0);\
>  }

Gah, yes - I remember at one point having to double check the get/set direction 
but
obviously my brain and fingers got the better of me here. Apologies for the
inconvenience.

Reviewed-by: Mark Cave-Ayland 


ATB,

Mark.



Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-27 Thread Alex Williamson
On Wed, 27 Mar 2019 19:07:44 +0100
Auger Eric  wrote:

> Hi Alex,
> 
> On 3/27/19 7:02 PM, Alex Williamson wrote:
> > On Wed, 27 Mar 2019 12:46:41 +0100
> > Auger Eric  wrote:
> >   
> >> Hi Alex,
> >>
> >> On 3/26/19 11:55 PM, Alex Williamson wrote:  
> >>> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> >>> distinguish by devfn & bus between devices in a conventional PCI
> >>> topology and therefore we cannot assign them separate AddressSpaces.
> >>> By taking this requester ID aliasing into account, QEMU better matches
> >>> the bare metal behavior and restrictions, and enables shared
> >>> AddressSpace configurations that are otherwise not possible with
> >>> guest IOMMU support.
> >>>
> >>> For the latter case, given any example where an IOMMU group on the
> >>> host includes multiple devices:
> >>>
> >>>   $ ls  /sys/kernel/iommu_groups/1/devices/
> >>>   :00:01.0  :01:00.0  :01:00.1
> >>>
> >>> If we incorporate a vIOMMU into the VM configuration, we're restricted
> >>> that we can only assign one of the endpoints to the guest because a
> >>> second endpoint will attempt to use a different AddressSpace.  VFIO
> >>> only supports IOMMU group level granularity at the container level,
> >>> preventing this second endpoint from being assigned:
> >>>
> >>> qemu-system-x86_64 -machine q35... \
> >>>   -device intel-iommu,intremap=on \
> >>>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >>>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >>>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> >>>
> >>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: 
> >>> vfio \
> >>> :01:00.1: group 1 used in multiple address spaces
> >>>
> >>> However, when QEMU incorporates proper aliasing, we can make use of a
> >>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> >>> provides the downstream devices with the same AddressSpace, ex:
> >>>
> >>> qemu-system-x86_64 -machine q35... \
> >>>   -device intel-iommu,intremap=on \
> >>>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >>>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >>>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> >>>
> >>> While the utility of this hack may be limited, this AddressSpace
> >>> aliasing is the correct behavior for QEMU to emulate bare metal.
> >>>
> >>> Signed-off-by: Alex Williamson 
> >>> ---
> >>>  hw/pci/pci.c |   33 +++--
> >>>  1 file changed, 31 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >>> index 35451c1e9987..38467e676f1f 100644
> >>> --- a/hw/pci/pci.c
> >>> +++ b/hw/pci/pci.c
> >>> @@ -2594,12 +2594,41 @@ AddressSpace 
> >>> *pci_device_iommu_address_space(PCIDevice *dev)
> >>>  {
> >>>  PCIBus *bus = pci_get_bus(dev);
> >>>  PCIBus *iommu_bus = bus;
> >>> +uint8_t devfn = dev->devfn;
> >>>  
> >>>  while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> >>> -iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> >>> +PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> >>> +
> >>> +/*
> >>> + * Determine which requester ID alias should be used for the 
> >>> device
> >>> + * based on the PCI topology.  There are no requester IDs on 
> >>> convetional
> >> conventional  
> > 
> > Oops, fixed
> >   
> >>> + * PCI buses, therefore we push the alias up to the parent on 
> >>> each non-
> >>> + * express bus.  Which alias we use depends on whether this is a 
> >>> legacy
> >>> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the 
> >>> PCIe-to-
> >>> + * PCI bridge spec.  Note that we cannot use pci_requester_id() 
> >>> here
> >>> + * because the resulting BDF depends on the secondary bridge 
> >>> register
> >> Didn't you mean secondary bus number register?  
> > 
> > Yes, fixed
> >   
> >>> + * programming.  We also cannot lookup the PCIBus from the bus 
> >>> number
> >>> + * at this point for the iommu_fn.  Also, requester_id_cache is 
> >>> the
> >>> + * alias to the root bus, which is usually, but not necessarily 
> >>> always
> >>> + * where we'll find our iommu_fn.
> >>> + */
> >>> +if (!pci_bus_is_express(iommu_bus)) {
> >>> +PCIDevice *parent = iommu_bus->parent_dev;
> >>> +
> >>> +if (pci_is_express(parent) &&
> >>> +pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> >>> +devfn = PCI_DEVFN(0, 0);
> >>> +bus = iommu_bus;
> >>> +} else {
> >>> +devfn = parent->devfn;
> >>> +bus = parent_bus;
> >>> +}
> >>> +}
> >>> +
> >>> +iommu_bus = parent_bus;
> >>>  }
> >>>  if (iommu_bus && iommu_bus->iommu_fn) {
> >>> -return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, 
> >>> dev->devfn);

[Qemu-devel] [PATCH for 4.0 v2 1/2] riscv: plic: Fix incorrect irq calculation

2019-03-27 Thread Alistair Francis
This patch fixes four different things, to maintain bisectability they
have been merged into a single patch. The following fixes are below:

sifive_plic: Fix incorrect irq calculation
The irq is incorrectly calculated to be off by one. It has worked in the
past as the priority_base offset has also been set incorrectly. We are
about to fix the priority_base offset so first first the irq
calculation.

sifive_u: Fix PLIC priority base offset and numbering
According to the FU540 manual the PLIC source priority address starts at
an offset of 0x04 and not 0x00. The same manual also specifies that the
PLIC only has 53 source priorities. Fix these two incorrect header
files.

We also need to over extend the plic_gpios[] array as the PLIC sources
count from 1 and not 0.

riscv: sifive_e: Fix PLIC priority base offset
According to the FE31 manual the PLIC source priority address starts at
an offset of 0x04 and not 0x00.

riscv: virt: Fix PLIC priority base offset
Update the virt offsets based on the newly updated SiFive U and SiFive E
offsets.

Signed-off-by: Alistair Francis 
---
 hw/riscv/sifive_plic.c  | 4 ++--
 hw/riscv/sifive_u.c | 2 +-
 include/hw/riscv/sifive_e.h | 2 +-
 include/hw/riscv/sifive_u.h | 4 ++--
 include/hw/riscv/virt.h | 2 +-
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/hw/riscv/sifive_plic.c b/hw/riscv/sifive_plic.c
index 1c703e1a37..70a85cd075 100644
--- a/hw/riscv/sifive_plic.c
+++ b/hw/riscv/sifive_plic.c
@@ -206,7 +206,7 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, 
unsigned size)
 if (addr >= plic->priority_base && /* 4 bytes per source */
 addr < plic->priority_base + (plic->num_sources << 2))
 {
-uint32_t irq = (addr - plic->priority_base) >> 2;
+uint32_t irq = ((addr - plic->priority_base) >> 2) + 1;
 if (RISCV_DEBUG_PLIC) {
 qemu_log("plic: read priority: irq=%d priority=%d\n",
 irq, plic->source_priority[irq]);
@@ -279,7 +279,7 @@ static void sifive_plic_write(void *opaque, hwaddr addr, 
uint64_t value,
 if (addr >= plic->priority_base && /* 4 bytes per source */
 addr < plic->priority_base + (plic->num_sources << 2))
 {
-uint32_t irq = (addr - plic->priority_base) >> 2;
+uint32_t irq = ((addr - plic->priority_base) >> 2) + 1;
 plic->source_priority[irq] = value & 7;
 if (RISCV_DEBUG_PLIC) {
 qemu_log("plic: write priority: irq=%d priority=%d\n",
diff --git a/hw/riscv/sifive_u.c b/hw/riscv/sifive_u.c
index 5ecc47cea3..88381a7507 100644
--- a/hw/riscv/sifive_u.c
+++ b/hw/riscv/sifive_u.c
@@ -340,7 +340,7 @@ static void riscv_sifive_u_soc_realize(DeviceState *dev, 
Error **errp)
 const struct MemmapEntry *memmap = sifive_u_memmap;
 MemoryRegion *system_memory = get_system_memory();
 MemoryRegion *mask_rom = g_new(MemoryRegion, 1);
-qemu_irq plic_gpios[SIFIVE_U_PLIC_NUM_SOURCES];
+qemu_irq plic_gpios[SIFIVE_U_PLIC_NUM_SOURCES + 1];
 int i;
 Error *err = NULL;
 NICInfo *nd = _table[0];
diff --git a/include/hw/riscv/sifive_e.h b/include/hw/riscv/sifive_e.h
index 7b6d8aed96..f715f8606f 100644
--- a/include/hw/riscv/sifive_e.h
+++ b/include/hw/riscv/sifive_e.h
@@ -70,7 +70,7 @@ enum {
 #define SIFIVE_E_PLIC_HART_CONFIG "M"
 #define SIFIVE_E_PLIC_NUM_SOURCES 127
 #define SIFIVE_E_PLIC_NUM_PRIORITIES 7
-#define SIFIVE_E_PLIC_PRIORITY_BASE 0x0
+#define SIFIVE_E_PLIC_PRIORITY_BASE 0x04
 #define SIFIVE_E_PLIC_PENDING_BASE 0x1000
 #define SIFIVE_E_PLIC_ENABLE_BASE 0x2000
 #define SIFIVE_E_PLIC_ENABLE_STRIDE 0x80
diff --git a/include/hw/riscv/sifive_u.h b/include/hw/riscv/sifive_u.h
index be13cc1304..d859ea20f6 100644
--- a/include/hw/riscv/sifive_u.h
+++ b/include/hw/riscv/sifive_u.h
@@ -68,9 +68,9 @@ enum {
 };
 
 #define SIFIVE_U_PLIC_HART_CONFIG "MS"
-#define SIFIVE_U_PLIC_NUM_SOURCES 127
+#define SIFIVE_U_PLIC_NUM_SOURCES 53
 #define SIFIVE_U_PLIC_NUM_PRIORITIES 7
-#define SIFIVE_U_PLIC_PRIORITY_BASE 0x0
+#define SIFIVE_U_PLIC_PRIORITY_BASE 0x04
 #define SIFIVE_U_PLIC_PENDING_BASE 0x1000
 #define SIFIVE_U_PLIC_ENABLE_BASE 0x2000
 #define SIFIVE_U_PLIC_ENABLE_STRIDE 0x80
diff --git a/include/hw/riscv/virt.h b/include/hw/riscv/virt.h
index f12deaebd6..568764b570 100644
--- a/include/hw/riscv/virt.h
+++ b/include/hw/riscv/virt.h
@@ -59,7 +59,7 @@ enum {
 #define VIRT_PLIC_HART_CONFIG "MS"
 #define VIRT_PLIC_NUM_SOURCES 127
 #define VIRT_PLIC_NUM_PRIORITIES 7
-#define VIRT_PLIC_PRIORITY_BASE 0x0
+#define VIRT_PLIC_PRIORITY_BASE 0x04
 #define VIRT_PLIC_PENDING_BASE 0x1000
 #define VIRT_PLIC_ENABLE_BASE 0x2000
 #define VIRT_PLIC_ENABLE_STRIDE 0x80
-- 
2.21.0




[Qemu-devel] [PATCH for 4.0 v2 0/2] Update the QEMU PLIC addresses

2019-03-27 Thread Alistair Francis
This series updates the PLIC address to match the documentation.

This fixes: https://github.com/riscv/opensbi/issues/97

V2:
 - Squash patches to ensure biesctability

Alistair Francis (2):
  riscv: plic: Fix incorrect irq calculation
  riscv: plic: Log guest errors

 hw/riscv/sifive_plic.c  | 16 +++-
 hw/riscv/sifive_u.c |  2 +-
 include/hw/riscv/sifive_e.h |  2 +-
 include/hw/riscv/sifive_u.h |  4 ++--
 include/hw/riscv/virt.h |  2 +-
 5 files changed, 16 insertions(+), 10 deletions(-)

-- 
2.21.0




[Qemu-devel] [PATCH for 4.0 v2 2/2] riscv: plic: Log guest errors

2019-03-27 Thread Alistair Francis
Instead of using error_report() to print guest errors let's use
qemu_log_mask(LOG_GUEST_ERROR,...) to log the error.

Signed-off-by: Alistair Francis 
---
 hw/riscv/sifive_plic.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/hw/riscv/sifive_plic.c b/hw/riscv/sifive_plic.c
index 70a85cd075..7f373d6c9d 100644
--- a/hw/riscv/sifive_plic.c
+++ b/hw/riscv/sifive_plic.c
@@ -262,7 +262,9 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, 
unsigned size)
 }
 
 err:
-error_report("plic: invalid register read: %08x", (uint32_t)addr);
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: Invalid register read 0x%" HWADDR_PRIx "\n",
+  __func__, addr);
 return 0;
 }
 
@@ -289,7 +291,9 @@ static void sifive_plic_write(void *opaque, hwaddr addr, 
uint64_t value,
 } else if (addr >= plic->pending_base && /* 1 bit per source */
addr < plic->pending_base + (plic->num_sources >> 3))
 {
-error_report("plic: invalid pending write: %08x", (uint32_t)addr);
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: invalid pending write: 0x%" HWADDR_PRIx "",
+  __func__, addr);
 return;
 } else if (addr >= plic->enable_base && /* 1 bit per source */
 addr < plic->enable_base + plic->num_addrs * plic->enable_stride)
@@ -339,7 +343,9 @@ static void sifive_plic_write(void *opaque, hwaddr addr, 
uint64_t value,
 }
 
 err:
-error_report("plic: invalid register write: %08x", (uint32_t)addr);
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: Invalid register write 0x%" HWADDR_PRIx "\n",
+  __func__, addr);
 }
 
 static const MemoryRegionOps sifive_plic_ops = {
-- 
2.21.0




Re: [Qemu-devel] [PATCH for 4.0 v2 2/2] riscv: plic: Log guest errors

2019-03-27 Thread Alistair Francis
On Wed, Mar 27, 2019 at 11:51 AM Alistair Francis
 wrote:
>
> Instead of using error_report() to print guest errors let's use
> qemu_log_mask(LOG_GUEST_ERROR,...) to log the error.
>
> Signed-off-by: Alistair Francis 

I dropped Philippe's review in this series.
This should be included from v1:

Reviewed-by: Philippe Mathieu-Daudé 

Alistair

> ---
>  hw/riscv/sifive_plic.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/hw/riscv/sifive_plic.c b/hw/riscv/sifive_plic.c
> index 70a85cd075..7f373d6c9d 100644
> --- a/hw/riscv/sifive_plic.c
> +++ b/hw/riscv/sifive_plic.c
> @@ -262,7 +262,9 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr 
> addr, unsigned size)
>  }
>
>  err:
> -error_report("plic: invalid register read: %08x", (uint32_t)addr);
> +qemu_log_mask(LOG_GUEST_ERROR,
> +  "%s: Invalid register read 0x%" HWADDR_PRIx "\n",
> +  __func__, addr);
>  return 0;
>  }
>
> @@ -289,7 +291,9 @@ static void sifive_plic_write(void *opaque, hwaddr addr, 
> uint64_t value,
>  } else if (addr >= plic->pending_base && /* 1 bit per source */
> addr < plic->pending_base + (plic->num_sources >> 3))
>  {
> -error_report("plic: invalid pending write: %08x", (uint32_t)addr);
> +qemu_log_mask(LOG_GUEST_ERROR,
> +  "%s: invalid pending write: 0x%" HWADDR_PRIx "",
> +  __func__, addr);
>  return;
>  } else if (addr >= plic->enable_base && /* 1 bit per source */
>  addr < plic->enable_base + plic->num_addrs * plic->enable_stride)
> @@ -339,7 +343,9 @@ static void sifive_plic_write(void *opaque, hwaddr addr, 
> uint64_t value,
>  }
>
>  err:
> -error_report("plic: invalid register write: %08x", (uint32_t)addr);
> +qemu_log_mask(LOG_GUEST_ERROR,
> +  "%s: Invalid register write 0x%" HWADDR_PRIx "\n",
> +  __func__, addr);
>  }
>
>  static const MemoryRegionOps sifive_plic_ops = {
> --
> 2.21.0
>



Re: [Qemu-devel] [PATCH v2 3/6] Categorize devices: DIMM

2019-03-27 Thread Ernest Esene
On Wed, Mar 27, 2019 at 01:47:48PM +0100, Philippe Mathieu-Daudé wrote:
> Le mer. 27 mars 2019 13:32, Ernest Esene  a écrit :
> 
> > Set category and describe DIMM devices
> >
> > Signed-off-by: Ernest Esene 
> >
> > ---
> > v2:
> >   * split into separate patches
> > ---
> >  hw/mem/nvdimm.c  | 1 +
> >  hw/mem/pc-dimm.c | 1 +
> >  2 files changed, 2 insertions(+)
> >
> > diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
> > index bf2adf5e16..a334dbe1f5 100644
> > --- a/hw/mem/nvdimm.c
> > +++ b/hw/mem/nvdimm.c
> > @@ -200,6 +200,7 @@ static void nvdimm_class_init(ObjectClass *oc, void
> > *data)
> >  ddc->realize = nvdimm_realize;
> >  mdc->get_memory_region = nvdimm_md_get_memory_region;
> >  dc->props = nvdimm_properties;
> > +dc->desc = "NVDIMM memory module";
> >
> >  nvc->read_label_data = nvdimm_read_label_data;
> >  nvc->write_label_data = nvdimm_write_label_data;
> > diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
> > index 152400b1fc..19b9c0f406 100644
> > --- a/hw/mem/pc-dimm.c
> > +++ b/hw/mem/pc-dimm.c
> > @@ -259,6 +259,7 @@ static void pc_dimm_class_init(ObjectClass *oc, void
> > *data)
> >  dc->unrealize = pc_dimm_unrealize;
> >  dc->props = pc_dimm_properties;
> >  dc->desc = "DIMM memory module";
> > +set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> >
> 
> Hmm should we add a MEMORY category?
Exactly what I was thinking. I shall send an update for the MEMORY
category.
> 
> 
> >  ddc->get_vmstate_memory_region = pc_dimm_get_memory_region;
> >
> > --
> > 2.14.2
> >
> >


signature.asc
Description: PGP signature


Re: [Qemu-devel] [Xen-devel] [PATCH v2 0/2] xen-block: fix sector size confusion

2019-03-27 Thread Andrew Cooper
On 27/03/2019 17:32, Paul Durrant wrote:
> The Xen blkif protocol is confusing but discussion with the maintainer
> has clarified that sector based quantities in requests and the 'sectors'
> value advertized in xenstore should always be in terms of 512-byte
> units and not the advertised logical 'sector-size' value.
>
> This series fixes xen-block to adhere to the spec.

I thought we agreed that hardcoding things to 512 bytes was the wrong
thing to do.

I was expecting something like:

1) Clarify the spec with the intended meaning, (which is what some
implementations actually use already) and wont cripple 4k datapaths.
2) Introduce a compatibility key for "I don't rely on sector-size being
512", which fixed implementations should advertise.
3) Specify that because of bugs in the spec which got out into the wild,
drivers which don't find the key being advertised by the other end
should emulate sector-size=512 for compatibility with broken
implementations.

Whatever the eventual way out, the first thing which needs to happen is
an update to the spec, before actions are taken to alter existing
implementations.

~Andrew



Re: [Qemu-devel] [PATCH for-4.1 v3 00/12] bundle edk2 platform firmware with QEMU

2019-03-27 Thread Wainer dos Santos Moschetta




On 03/27/2019 01:15 PM, Paolo Bonzini wrote:

On 27/03/19 17:05, Daniel P. Berrangé wrote:

On Wed, Mar 27, 2019 at 04:58:23PM +0100, Paolo Bonzini wrote:

On 27/03/19 16:30, Daniel P. Berrangé wrote:

Perhaps the VM test scripts should do a "HEAD" request for the image
every time to discover if it has been changed on the server, before
honouring the local cache.

Another possibility is to first download the shasum from
download.patchew.org, and compare _that_ against the one that is stored
locally, instead of hardcoding it in QEMU's repository.

Personally I prefer the idea of having the shasum stored in the repo.

That means that if we update git master to point to a newer image,
previous stable branches will stick with their original image, rather
than using a new image that may be incompatible with the stable branch

Storing hash in git also means that if someone compromised the patchew
server, they can't cause developer to run compromised images, without
first also compromising git to change the hash.

The two are not mutually exclusive.  We can warn if the hash doesn't
match against the one in QEMU, add a --force option, or whatever.


I'm about to send a patch to make vm-test work with Python3. I can work 
on that image checking mechanism you folks have discussed, unless 
someone is already working on it.


- Wainer



Also, I have now created symlinks by hash at
http://download.patchew.org/by-sha256sum in case someone finds them useful.

Paolo







[Qemu-devel] [PATCH v2] target/mips: add * to comments and realign them to fix checkpatch warnings

2019-03-27 Thread Jules Irenge
Add * to comments and realign to fix warnings issued by checkpatch.pl tool
"WARNING: Block comments use a leading /* on a separate line"
 within the file "target/mips/cp0_timer.c".

Signed-off-by: Jules Irenge 
---
This v2 improves on writing the comment style and get rid of the warnings 
completely
 target/mips/cp0_timer.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/target/mips/cp0_timer.c b/target/mips/cp0_timer.c
index 742f4b2890..d80b0076d3 100644
--- a/target/mips/cp0_timer.c
+++ b/target/mips/cp0_timer.c
@@ -42,8 +42,10 @@ uint32_t cpu_mips_get_random(CPUMIPSState *env)
 
 /* Don't return same value twice, so get another value */
 do {
-/* Use a simple algorithm of Linear Congruential Generator
- * from ISO/IEC 9899 standard. */
+/*
+ * Use a simple algorithm of Linear Congruential Generator
+ * from ISO/IEC 9899 standard.
+ */
 seed = 1103515245 * seed + 12345;
 idx = (seed >> 16) % nb_rand_tlb + env->CP0_Wired;
 } while (idx == prev_idx);
@@ -143,9 +145,11 @@ static void mips_timer_cb(void *opaque)
 if (env->CP0_Cause & (1 << CP0Ca_DC))
 return;
 
-/* ??? This callback should occur when the counter is exactly equal to
-   the comparator value.  Offset the count by one to avoid immediately
-   retriggering the callback before any virtual time has passed.  */
+/*
+ * ??? This callback should occur when the counter is exactly equal to
+ * the comparator value.  Offset the count by one to avoid immediately
+ * retriggering the callback before any virtual time has passed.
+ */
 env->CP0_Count++;
 cpu_mips_timer_expire(env);
 env->CP0_Count--;
-- 
2.20.1




Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-27 Thread Alex Williamson
On Wed, 27 Mar 2019 12:46:41 +0100
Auger Eric  wrote:

> Hi Alex,
> 
> On 3/26/19 11:55 PM, Alex Williamson wrote:
> > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > distinguish by devfn & bus between devices in a conventional PCI
> > topology and therefore we cannot assign them separate AddressSpaces.
> > By taking this requester ID aliasing into account, QEMU better matches
> > the bare metal behavior and restrictions, and enables shared
> > AddressSpace configurations that are otherwise not possible with
> > guest IOMMU support.
> > 
> > For the latter case, given any example where an IOMMU group on the
> > host includes multiple devices:
> > 
> >   $ ls  /sys/kernel/iommu_groups/1/devices/
> >   :00:01.0  :01:00.0  :01:00.1
> > 
> > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > that we can only assign one of the endpoints to the guest because a
> > second endpoint will attempt to use a different AddressSpace.  VFIO
> > only supports IOMMU group level granularity at the container level,
> > preventing this second endpoint from being assigned:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > 
> > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > :01:00.1: group 1 used in multiple address spaces
> > 
> > However, when QEMU incorporates proper aliasing, we can make use of a
> > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > provides the downstream devices with the same AddressSpace, ex:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > 
> > While the utility of this hack may be limited, this AddressSpace
> > aliasing is the correct behavior for QEMU to emulate bare metal.
> > 
> > Signed-off-by: Alex Williamson 
> > ---
> >  hw/pci/pci.c |   33 +++--
> >  1 file changed, 31 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 35451c1e9987..38467e676f1f 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2594,12 +2594,41 @@ AddressSpace 
> > *pci_device_iommu_address_space(PCIDevice *dev)
> >  {
> >  PCIBus *bus = pci_get_bus(dev);
> >  PCIBus *iommu_bus = bus;
> > +uint8_t devfn = dev->devfn;
> >  
> >  while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> > -iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> > +PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> > +
> > +/*
> > + * Determine which requester ID alias should be used for the device
> > + * based on the PCI topology.  There are no requester IDs on 
> > convetional  
> conventional

Oops, fixed

> > + * PCI buses, therefore we push the alias up to the parent on each 
> > non-
> > + * express bus.  Which alias we use depends on whether this is a 
> > legacy
> > + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the 
> > PCIe-to-
> > + * PCI bridge spec.  Note that we cannot use pci_requester_id() 
> > here
> > + * because the resulting BDF depends on the secondary bridge 
> > register  
> Didn't you mean secondary bus number register?

Yes, fixed

> > + * programming.  We also cannot lookup the PCIBus from the bus 
> > number
> > + * at this point for the iommu_fn.  Also, requester_id_cache is the
> > + * alias to the root bus, which is usually, but not necessarily 
> > always
> > + * where we'll find our iommu_fn.
> > + */
> > +if (!pci_bus_is_express(iommu_bus)) {
> > +PCIDevice *parent = iommu_bus->parent_dev;
> > +
> > +if (pci_is_express(parent) &&
> > +pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> > +devfn = PCI_DEVFN(0, 0);
> > +bus = iommu_bus;
> > +} else {
> > +devfn = parent->devfn;
> > +bus = parent_bus;
> > +}
> > +}
> > +
> > +iommu_bus = parent_bus;
> >  }
> >  if (iommu_bus && iommu_bus->iommu_fn) {
> > -return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, 
> > dev->devfn);
> > +return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);  
> I think it would make sense to comment this iommu_fn() callback's role
> somewhere.

While I agree, it seems a bit out of scope of this patch.  Or are you
suggesting that this patch fundamentally changing the role rather than
trying to make it work as intended?  Thanks,

Alex



Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-27 Thread Auger Eric
Hi Alex,

On 3/27/19 7:02 PM, Alex Williamson wrote:
> On Wed, 27 Mar 2019 12:46:41 +0100
> Auger Eric  wrote:
> 
>> Hi Alex,
>>
>> On 3/26/19 11:55 PM, Alex Williamson wrote:
>>> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
>>> distinguish by devfn & bus between devices in a conventional PCI
>>> topology and therefore we cannot assign them separate AddressSpaces.
>>> By taking this requester ID aliasing into account, QEMU better matches
>>> the bare metal behavior and restrictions, and enables shared
>>> AddressSpace configurations that are otherwise not possible with
>>> guest IOMMU support.
>>>
>>> For the latter case, given any example where an IOMMU group on the
>>> host includes multiple devices:
>>>
>>>   $ ls  /sys/kernel/iommu_groups/1/devices/
>>>   :00:01.0  :01:00.0  :01:00.1
>>>
>>> If we incorporate a vIOMMU into the VM configuration, we're restricted
>>> that we can only assign one of the endpoints to the guest because a
>>> second endpoint will attempt to use a different AddressSpace.  VFIO
>>> only supports IOMMU group level granularity at the container level,
>>> preventing this second endpoint from being assigned:
>>>
>>> qemu-system-x86_64 -machine q35... \
>>>   -device intel-iommu,intremap=on \
>>>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
>>>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>>>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
>>>
>>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
>>> :01:00.1: group 1 used in multiple address spaces
>>>
>>> However, when QEMU incorporates proper aliasing, we can make use of a
>>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
>>> provides the downstream devices with the same AddressSpace, ex:
>>>
>>> qemu-system-x86_64 -machine q35... \
>>>   -device intel-iommu,intremap=on \
>>>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>>>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>>>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
>>>
>>> While the utility of this hack may be limited, this AddressSpace
>>> aliasing is the correct behavior for QEMU to emulate bare metal.
>>>
>>> Signed-off-by: Alex Williamson 
>>> ---
>>>  hw/pci/pci.c |   33 +++--
>>>  1 file changed, 31 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>> index 35451c1e9987..38467e676f1f 100644
>>> --- a/hw/pci/pci.c
>>> +++ b/hw/pci/pci.c
>>> @@ -2594,12 +2594,41 @@ AddressSpace 
>>> *pci_device_iommu_address_space(PCIDevice *dev)
>>>  {
>>>  PCIBus *bus = pci_get_bus(dev);
>>>  PCIBus *iommu_bus = bus;
>>> +uint8_t devfn = dev->devfn;
>>>  
>>>  while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
>>> -iommu_bus = pci_get_bus(iommu_bus->parent_dev);
>>> +PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
>>> +
>>> +/*
>>> + * Determine which requester ID alias should be used for the device
>>> + * based on the PCI topology.  There are no requester IDs on 
>>> convetional  
>> conventional
> 
> Oops, fixed
> 
>>> + * PCI buses, therefore we push the alias up to the parent on each 
>>> non-
>>> + * express bus.  Which alias we use depends on whether this is a 
>>> legacy
>>> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the 
>>> PCIe-to-
>>> + * PCI bridge spec.  Note that we cannot use pci_requester_id() 
>>> here
>>> + * because the resulting BDF depends on the secondary bridge 
>>> register  
>> Didn't you mean secondary bus number register?
> 
> Yes, fixed
> 
>>> + * programming.  We also cannot lookup the PCIBus from the bus 
>>> number
>>> + * at this point for the iommu_fn.  Also, requester_id_cache is the
>>> + * alias to the root bus, which is usually, but not necessarily 
>>> always
>>> + * where we'll find our iommu_fn.
>>> + */
>>> +if (!pci_bus_is_express(iommu_bus)) {
>>> +PCIDevice *parent = iommu_bus->parent_dev;
>>> +
>>> +if (pci_is_express(parent) &&
>>> +pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
>>> +devfn = PCI_DEVFN(0, 0);
>>> +bus = iommu_bus;
>>> +} else {
>>> +devfn = parent->devfn;
>>> +bus = parent_bus;
>>> +}
>>> +}
>>> +
>>> +iommu_bus = parent_bus;
>>>  }
>>>  if (iommu_bus && iommu_bus->iommu_fn) {
>>> -return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, 
>>> dev->devfn);
>>> +return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);  
>> I think it would make sense to comment this iommu_fn() callback's role
>> somewhere.
> 
> While I agree, it seems a bit out of scope of this patch.  Or are you
> suggesting that this patch fundamentally changing the role rather than
> trying to 

[Qemu-devel] [PATCH for-4.0] target/ppc: Fix QEMU crash with stxsdx

2019-03-27 Thread Greg Kurz
I've been hitting several QEMU crashes while running a fedora29 ppc64le
guest under TCG. Each time, this would occur several minutes after the
guest reached login:

Fedora 29 (Twenty Nine)
Kernel 4.20.6-200.fc29.ppc64le on an ppc64le (hvc0)

Web console: https://localhost:9090/

localhost login:
tcg/tcg.c:3211: tcg fatal error

This happens because a bug crept up in the gen_stxsdx() helper when it
was converted to use VSR register accessors by commit 8b3b2d75c7c04
"target/ppc: introduce get_cpu_vsr{l,h}() and set_cpu_vsr{l,h}() helpers
for VSR register access".

The code creates a temporary, passes it directly to gen_qemu_st64_i64()
and then to set_cpu_vrsh()... which looks like this was mistakenly
coded as a load instead of a store.

Reverse the logic: read the VSR to the temporary first and then store
it to memory.

Fixes: 8b3b2d75c7c0481544e277dad226223245e058eb
Signed-off-by: Greg Kurz 
---
 target/ppc/translate/vsx-impl.inc.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/ppc/translate/vsx-impl.inc.c 
b/target/ppc/translate/vsx-impl.inc.c
index 508e9199c8df..489b2436e49f 100644
--- a/target/ppc/translate/vsx-impl.inc.c
+++ b/target/ppc/translate/vsx-impl.inc.c
@@ -356,8 +356,8 @@ static void gen_##name(DisasContext *ctx)   
  \
 gen_set_access_type(ctx, ACCESS_INT); \
 EA = tcg_temp_new();  \
 gen_addr_reg_index(ctx, EA);  \
+get_cpu_vsrh(t0, xS(ctx->opcode));\
 gen_qemu_##operation(ctx, t0, EA);\
-set_cpu_vsrh(xS(ctx->opcode), t0);\
 tcg_temp_free(EA);\
 tcg_temp_free_i64(t0);\
 }




Re: [Qemu-devel] [PATCH for-4.1 v3 00/12] bundle edk2 platform firmware with QEMU

2019-03-27 Thread Paolo Bonzini
On 27/03/19 19:08, Wainer dos Santos Moschetta wrote:
> 
> 
> On 03/27/2019 01:15 PM, Paolo Bonzini wrote:
>> On 27/03/19 17:05, Daniel P. Berrangé wrote:
>>> On Wed, Mar 27, 2019 at 04:58:23PM +0100, Paolo Bonzini wrote:
 On 27/03/19 16:30, Daniel P. Berrangé wrote:
> Perhaps the VM test scripts should do a "HEAD" request for the image
> every time to discover if it has been changed on the server, before
> honouring the local cache.
 Another possibility is to first download the shasum from
 download.patchew.org, and compare _that_ against the one that is stored
 locally, instead of hardcoding it in QEMU's repository.
>>> Personally I prefer the idea of having the shasum stored in the repo.
>>>
>>> That means that if we update git master to point to a newer image,
>>> previous stable branches will stick with their original image, rather
>>> than using a new image that may be incompatible with the stable branch
>>>
>>> Storing hash in git also means that if someone compromised the patchew
>>> server, they can't cause developer to run compromised images, without
>>> first also compromising git to change the hash.
>> The two are not mutually exclusive.  We can warn if the hash doesn't
>> match against the one in QEMU, add a --force option, or whatever.
> 
> I'm about to send a patch to make vm-test work with Python3. I can work
> on that image checking mechanism you folks have discussed, unless
> someone is already working on it.

Absolutely not, go for it!

Paolo




Re: [Qemu-devel] [PATCH 1/2] vl: Fix to create accel compat props before migration object again

2019-03-27 Thread Anthony PERARD
On Wed, Mar 27, 2019 at 04:03:46PM +0100, Markus Armbruster wrote:
> Recent commit cda4aa9a5a0 moved block backend creation before machine
> property evaluation.  This broke block backends registering migration
> blockers.  Commit e60483f2f84 fixed it by moving migration object
> creation before block backend creation.  This broke migration with
> Xen.  Turns out we need to configure the accelerator before we create
> the migration object so that Xen's accelerator compat properties get
> applied.  Fix by calling configure_accelerator() earlier, right before
> migration_object_init().
> 
> Fixes: e60483f2f8498ae08ae79ca4c6fb03a3317f5e1e
> Reported-by: Anthony PERARD 
> Signed-off-by: Markus Armbruster 

Tested-by: Anthony PERARD 

Thanks!

-- 
Anthony PERARD



[Qemu-devel] [PATCH v2 1/2] xen-block: scale sector based quantities correctly

2019-03-27 Thread Paul Durrant
The Xen blkif protocol requires that sector based quantities should be
interpreted strictly as multiples of 512 bytes. Specifically:

"first_sect and last_sect in blkif_request_segment, as well as
sector_number in blkif_request, are always expressed in 512-byte units."

This patch modifies the xen-block code accordingly.

Signed-off-by: Paul Durrant 
---
Cc: Stefan Hajnoczi 
Cc: Stefano Stabellini 
Cc: Anthony Perard 
Cc: Kevin Wolf 
Cc: Max Reitz 
---
 hw/block/dataplane/xen-block.c | 28 +---
 hw/block/xen_blkif.h   |  2 ++
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/hw/block/dataplane/xen-block.c b/hw/block/dataplane/xen-block.c
index f1523c5b45..bb8f1186e4 100644
--- a/hw/block/dataplane/xen-block.c
+++ b/hw/block/dataplane/xen-block.c
@@ -49,7 +49,6 @@ struct XenBlockDataPlane {
 unsigned int *ring_ref;
 unsigned int nr_ring_ref;
 void *sring;
-int64_t file_blk;
 int protocol;
 blkif_back_rings_t rings;
 int more_work;
@@ -168,7 +167,7 @@ static int xen_block_parse_request(XenBlockRequest *request)
 goto err;
 }
 
-request->start = request->req.sector_number * dataplane->file_blk;
+request->start = request->req.sector_number * XEN_BLKIF_SECTOR_SIZE;
 for (i = 0; i < request->req.nr_segments; i++) {
 if (i == BLKIF_MAX_SEGMENTS_PER_REQUEST) {
 error_report("error: nr_segments too big");
@@ -178,14 +177,14 @@ static int xen_block_parse_request(XenBlockRequest 
*request)
 error_report("error: first > last sector");
 goto err;
 }
-if (request->req.seg[i].last_sect * dataplane->file_blk >=
+if (request->req.seg[i].last_sect * XEN_BLKIF_SECTOR_SIZE >=
 XC_PAGE_SIZE) {
 error_report("error: page crossing");
 goto err;
 }
 
 len = (request->req.seg[i].last_sect -
-   request->req.seg[i].first_sect + 1) * dataplane->file_blk;
+   request->req.seg[i].first_sect + 1) * XEN_BLKIF_SECTOR_SIZE;
 request->size += len;
 }
 if (request->start + request->size > blk_getlength(dataplane->blk)) {
@@ -205,7 +204,6 @@ static int xen_block_copy_request(XenBlockRequest *request)
 XenDevice *xendev = dataplane->xendev;
 XenDeviceGrantCopySegment segs[BLKIF_MAX_SEGMENTS_PER_REQUEST];
 int i, count;
-int64_t file_blk = dataplane->file_blk;
 bool to_domain = (request->req.operation == BLKIF_OP_READ);
 void *virt = request->buf;
 Error *local_err = NULL;
@@ -220,16 +218,17 @@ static int xen_block_copy_request(XenBlockRequest 
*request)
 if (to_domain) {
 segs[i].dest.foreign.ref = request->req.seg[i].gref;
 segs[i].dest.foreign.offset = request->req.seg[i].first_sect *
-file_blk;
+XEN_BLKIF_SECTOR_SIZE;
 segs[i].source.virt = virt;
 } else {
 segs[i].source.foreign.ref = request->req.seg[i].gref;
 segs[i].source.foreign.offset = request->req.seg[i].first_sect *
-file_blk;
+XEN_BLKIF_SECTOR_SIZE;
 segs[i].dest.virt = virt;
 }
 segs[i].len = (request->req.seg[i].last_sect -
-   request->req.seg[i].first_sect + 1) * file_blk;
+   request->req.seg[i].first_sect + 1) *
+  XEN_BLKIF_SECTOR_SIZE;
 virt += segs[i].len;
 }
 
@@ -331,22 +330,22 @@ static bool xen_block_split_discard(XenBlockRequest 
*request,
 XenBlockDataPlane *dataplane = request->dataplane;
 int64_t byte_offset;
 int byte_chunk;
-uint64_t byte_remaining, limit;
+uint64_t byte_remaining;
 uint64_t sec_start = sector_number;
 uint64_t sec_count = nr_sectors;
 
 /* Wrap around, or overflowing byte limit? */
 if (sec_start + sec_count < sec_count ||
-sec_start + sec_count > INT64_MAX / dataplane->file_blk) {
+sec_start + sec_count > INT64_MAX / XEN_BLKIF_SECTOR_SIZE) {
 return false;
 }
 
-limit = BDRV_REQUEST_MAX_SECTORS * dataplane->file_blk;
-byte_offset = sec_start * dataplane->file_blk;
-byte_remaining = sec_count * dataplane->file_blk;
+byte_offset = sec_start * XEN_BLKIF_SECTOR_SIZE;
+byte_remaining = sec_count * XEN_BLKIF_SECTOR_SIZE;
 
 do {
-byte_chunk = byte_remaining > limit ? limit : byte_remaining;
+byte_chunk = byte_remaining > BDRV_REQUEST_MAX_BYTES ?
+BDRV_REQUEST_MAX_BYTES : byte_remaining;
 request->aio_inflight++;
 blk_aio_pdiscard(dataplane->blk, byte_offset, byte_chunk,
  xen_block_complete_aio, request);
@@ -632,7 +631,6 @@ XenBlockDataPlane *xen_block_dataplane_create(XenDevice 
*xendev,
 XenBlockDataPlane *dataplane = g_new0(XenBlockDataPlane, 1);
 
 dataplane->xendev = xendev;
-dataplane->file_blk = conf->logical_block_size;
 dataplane->blk = conf->blk;
 
  

[Qemu-devel] [PATCH v2 2/2] xen-block: always report 'sectors' in terms of 512-byte units

2019-03-27 Thread Paul Durrant
The mail thread at [1] clarifies that the Xen blkif protocol requires that
'sectors' value reported in xenstore is strictly in terms of 512-byte
units and is not dependent on the logical sector size reported in
'sector-size'.

[1] https://lists.xenproject.org/archives/html/xen-devel/2019-03/msg01600.html

Signed-off-by: Paul Durrant 
---
Cc: Stefano Stabellini 
Cc: Anthony Perard 
Cc: Kevin Wolf 
Cc: Max Reitz 
---
 hw/block/xen-block.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/xen-block.c b/hw/block/xen-block.c
index a848849f48..57e9da7e1c 100644
--- a/hw/block/xen-block.c
+++ b/hw/block/xen-block.c
@@ -149,7 +149,7 @@ static void xen_block_set_size(XenBlockDevice *blockdev)
 const char *type = object_get_typename(OBJECT(blockdev));
 XenBlockVdev *vdev = >props.vdev;
 BlockConf *conf = >props.conf;
-int64_t sectors = blk_getlength(conf->blk) / conf->logical_block_size;
+int64_t sectors = blk_getlength(conf->blk) / XEN_BLKIF_SECTOR_SIZE;
 XenDevice *xendev = XEN_DEVICE(blockdev);
 
 trace_xen_block_size(type, vdev->disk, vdev->partition, sectors);
-- 
2.20.1.2.gb21ebb6




[Qemu-devel] [PATCH v2 0/2] xen-block: fix sector size confusion

2019-03-27 Thread Paul Durrant
The Xen blkif protocol is confusing but discussion with the maintainer
has clarified that sector based quantities in requests and the 'sectors'
value advertized in xenstore should always be in terms of 512-byte
units and not the advertised logical 'sector-size' value.

This series fixes xen-block to adhere to the spec.

Paul Durrant (2):
  xen-block: scale sector based quantities correctly
  xen-block: always report 'sectors' in terms of 512-byte units

 hw/block/dataplane/xen-block.c | 28 +---
 hw/block/xen-block.c   |  2 +-
 hw/block/xen_blkif.h   |  2 ++
 3 files changed, 16 insertions(+), 16 deletions(-)
---
v2:
 - Split up previous single patch

Cc: Anthony Perard 
Cc: Kevin Wolf 
Cc: Max Reitz 
Cc: Stefan Hajnoczi 
Cc: Stefano Stabellini 
-- 
2.20.1.2.gb21ebb6




Re: [Qemu-devel] [PATCH] block/file-posix: ignore fail on unlock bytes

2019-03-27 Thread no-reply
Patchew URL: 
https://patchew.org/QEMU/20190327124915.22265-1-vsement...@virtuozzo.com/



Hi,

This series failed the asan build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
time make docker-test-debug@fedora TARGET_LIST=x86_64-softmmu J=14 NETWORK=1
=== TEST SCRIPT END ===

PASS 1 fdc-test /x86_64/fdc/cmos
PASS 2 fdc-test /x86_64/fdc/no_media_on_start
PASS 3 fdc-test /x86_64/fdc/read_without_media
==7509==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 4 fdc-test /x86_64/fdc/media_change
PASS 5 fdc-test /x86_64/fdc/sense_interrupt
PASS 6 fdc-test /x86_64/fdc/relative_seek
---
PASS 33 test-opts-visitor /visitor/opts/dict/unvisited
PASS 11 fdc-test /x86_64/fdc/read_no_dma_18
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
tests/test-coroutine -m=quick -k --tap < /dev/null | ./scripts/tap-driver.pl 
--test-name="test-coroutine" 
==7572==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
==7572==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 
0x7ffd9579e000; bottom 0x7efe3dbf8000; size: 0x00ff57ba6000 (1096688492544)
False positive error reports may follow
For details see https://github.com/google/sanitizers/issues/189
PASS 1 test-coroutine /basic/no-dangling-access
---
PASS 13 test-aio /aio/event/wait/no-flush-cb
PASS 12 fdc-test /x86_64/fdc/read_no_dma_19
PASS 14 test-aio /aio/timer/schedule
==7587==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 15 test-aio /aio/coroutine/queue-chaining
PASS 16 test-aio /aio-gsource/flush
PASS 17 test-aio /aio-gsource/bh/schedule
---
PASS 27 test-aio /aio-gsource/event/wait/no-flush-cb
PASS 13 fdc-test /x86_64/fdc/fuzz-registers
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
QTEST_QEMU_BINARY=x86_64-softmmu/qemu-system-x86_64 QTEST_QEMU_IMG=qemu-img 
tests/ide-test -m=quick -k --tap < /dev/null | ./scripts/tap-driver.pl 
--test-name="ide-test" 
==7596==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 28 test-aio /aio-gsource/timer/schedule
PASS 1 ide-test /x86_64/ide/identify
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
tests/test-aio-multithread -m=quick -k --tap < /dev/null | 
./scripts/tap-driver.pl --test-name="test-aio-multithread" 
==7605==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 1 test-aio-multithread /aio/multi/lifecycle
==7603==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 2 ide-test /x86_64/ide/flush
PASS 2 test-aio-multithread /aio/multi/schedule
==7623==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 3 ide-test /x86_64/ide/bmdma/setup
PASS 4 ide-test /x86_64/ide/bmdma/simple_rw
PASS 5 ide-test /x86_64/ide/bmdma/trim
PASS 6 ide-test /x86_64/ide/bmdma/short_prdt
PASS 7 ide-test /x86_64/ide/bmdma/one_sector_short_prdt
PASS 8 ide-test /x86_64/ide/bmdma/long_prdt
==7623==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 
0x7ffc541df000; bottom 0x7f7cd28f9000; size: 0x007f818e6000 (547634438144)
False positive error reports may follow
For details see https://github.com/google/sanitizers/issues/189
PASS 9 ide-test /x86_64/ide/bmdma/no_busmaster
PASS 10 ide-test /x86_64/ide/bmdma/teardown
PASS 3 test-aio-multithread /aio/multi/mutex/contended
PASS 11 ide-test /x86_64/ide/flush/nodev
==7644==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 12 ide-test /x86_64/ide/flush/empty_drive
==7649==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 13 ide-test /x86_64/ide/flush/retry_pci
PASS 4 test-aio-multithread /aio/multi/mutex/handoff
==7655==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 14 ide-test /x86_64/ide/flush/retry_isa
PASS 5 test-aio-multithread /aio/multi/mutex/mcs
==7666==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
PASS 15 ide-test /x86_64/ide/cdrom/pio
PASS 6 test-aio-multithread /aio/multi/mutex/pthread
MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))}  
tests/test-throttle -m=quick -k --tap < /dev/null | ./scripts/tap-driver.pl 
--test-name="test-throttle" 
==7677==WARNING: ASan doesn't fully support makecontext/swapcontext functions 
and may produce false positives in some cases!
==7683==WARNING: ASan 

Re: [Qemu-devel] [PATCH for-4.0?] exec: Only count mapped memory backends for qemu_getrampagesize()

2019-03-27 Thread Igor Mammedov
On Wed, 27 Mar 2019 23:03:58 +1100
David Gibson  wrote:

> On Wed, Mar 27, 2019 at 10:38:38AM +0100, Igor Mammedov wrote:
> > On Wed, 27 Mar 2019 20:07:57 +1100
> > David Gibson  wrote:
> > 
> > > On Wed, Mar 27, 2019 at 09:57:45AM +0100, Igor Mammedov wrote:
> > > > On Wed, 27 Mar 2019 11:11:46 +1100
> > > > David Gibson  wrote:
> > > >   
> > > > > On Tue, Mar 26, 2019 at 03:08:20PM +0100, Igor Mammedov wrote:  
> > > > > > On Tue, 26 Mar 2019 14:50:58 +1100
> > > > > > David Gibson  wrote:
> > > > > > 
> > > > > > > qemu_getrampagesize() works out the minimum host page size 
> > > > > > > backing any of
> > > > > > > guest RAM.  This is required in a few places, such as for POWER8 
> > > > > > > PAPR KVM
> > > > > > > guests, because limitations of the hardware virtualization mean 
> > > > > > > the guest
> > > > > > > can't use pagesizes larger than the host pages backing its memory.
> > > > > > > 
> > > > > > > However, it currently checks against *every* memory backend, 
> > > > > > > whether or not
> > > > > > > it is actually mapped into guest memory at the moment.  This is 
> > > > > > > incorrect.
> > > > > > > 
> > > > > > > This can cause a problem attempting to add memory to a POWER8 
> > > > > > > pseries KVM
> > > > > > > guest which is configured to allow hugepages in the guest (e.g.
> > > > > > > -machine cap-hpt-max-page-size=16m).  If you attempt to add 
> > > > > > > non-hugepage,
> > > > > > > you can (correctly) create a memory backend, however it 
> > > > > > > (correctly) will
> > > > > > > throw an error when you attempt to map that memory into the guest 
> > > > > > > by
> > > > > > > 'device_add'ing a pc-dimm.
> > > > > > > 
> > > > > > > What's not correct is that if you then reset the guest a startup 
> > > > > > > check
> > > > > > > against qemu_getrampagesize() will cause a fatal error because of 
> > > > > > > the new
> > > > > > > memory object, even though it's not mapped into the guest.
> > > > > > I'd say that backend should be remove by mgmt app since device_add 
> > > > > > failed
> > > > > > instead of leaving it to hang around. (but fatal error either not a 
> > > > > > nice
> > > > > > behavior on QEMU part)
> > > > > 
> > > > > I agree, but reset could be guest initiated, so even if management is
> > > > > doing the right thing there's a window where it could break.  
> > > > 
> > > > We could (log term) get rid of qemu_getrampagesize() (it's not really 
> > > > good
> > > > to push machine/target specific condition into generic function) and 
> > > > move
> > > > pagesize calculation closer to machine. In this case machine (spapr) 
> > > > knows
> > > > exactly when and what is mapped and can enumerate NOT backends but 
> > > > mapped
> > > > memory regions and/or pc-dimms (lets say we have 
> > > > memory_region_page_size()
> > > > and apply whatever policy to the results.  
> > > 
> > > So.. it used to be in the machine specific code, but was made generic
> > > because it's used in the vfio code.  Where it is ppc specific, but not
> > > keyed into the machine specific code in a way that we can really
> > > re-use it there.
> > It probably was the easiest way to hack things together, but then API gets
> > generalized and misused and then tweaked to specific machine usecase
> > and it gets only worse over time.
> 
> I don't really know what you mean by that.  In both the (main) ppc and
> VFIO cases the semantics we want from getrampagesize() really are the
> same: what's the smallest chunk of guest-contiguous memory we can rely
> on to also be host-contiguous.
"smallest chunk of guest-contiguous memory" is frontend abstraction while
the later "host-contiguous" is backend's one. But qemu_getrampagesize()
operates though on backend side (which doesn't have to represent guest RAM).
Shouldn't we look for guest RAM from frontend side (to be sure we are dealing 
with guest RAM)
and then find out corresponding host side attributes from there?

btw:
above doesn't mean that we should block your simpler fix for 4.0.
So for this patch with David's s390 patch as precondition included

Reviewed-by: Igor Mammedov 


> 
> > > > > > > This patch corrects the problem by adjusting 
> > > > > > > find_max_supported_pagesize()
> > > > > > > (called from qemu_getrampagesize() via object_child_foreach) to 
> > > > > > > exclude
> > > > > > > non-mapped memory backends.
> > > > > > I'm not sure about if it's ok do so. It depends on where from
> > > > > > qemu_getrampagesize() is called. For example s390 calls it rather 
> > > > > > early
> > > > > > when initializing KVM, so there isn't anything mapped yet.
> > > > > 
> > > > > Oh.  Drat.
> > > > >   
> > > > > > And once I replace -mem-path with hostmem backend and drop
> > > > > > qemu_mempath_getpagesize(mem_path) /which btw aren't guarantied to 
> > > > > > be mapped either/
> > > > > > this patch might lead to incorrect results for initial memory as
> > > > > > well.
> > > > > 
> > > > > Uh.. I don't 

Re: [Qemu-devel] [PATCH v2 1/2] block/vhdx: Remove redundant IEC binary prefixes definition

2019-03-27 Thread Stefano Garzarella
On Wed, Mar 27, 2019 at 01:16:49PM -0400, John Snow wrote:
> 
> 
> On 3/27/19 5:56 AM, Stefano Garzarella wrote:
> > IEC binary prefixes are already defined in "qemu/units.h",
> > so we can remove redundant definitions in "block/vhdx.h".
> > 
> > Signed-off-by: Stefano Garzarella 
> > ---
> >  block/vhdx.c | 3 ++-
> >  block/vhdx.h | 6 +-
> >  2 files changed, 3 insertions(+), 6 deletions(-)
> > 
> > diff --git a/block/vhdx.c b/block/vhdx.c
> > index b785aef4b7..7cd1fc3731 100644
> > --- a/block/vhdx.c
> > +++ b/block/vhdx.c
> > @@ -1889,7 +1889,8 @@ static int coroutine_fn 
> > vhdx_co_create(BlockdevCreateOptions *opts,
> >  return -EINVAL;
> >  }
> >  if (block_size > VHDX_BLOCK_SIZE_MAX) {
> > -error_setg(errp, "Block size must not exceed %d", 
> > VHDX_BLOCK_SIZE_MAX);
> > +error_setg(errp, "Block size must not exceed %" PRId64,
> > +   VHDX_BLOCK_SIZE_MAX);
> >  return -EINVAL;
> >  }
> >  
> > diff --git a/block/vhdx.h b/block/vhdx.h
> > index 1bfb4e4f73..bf72090c8f 100644
> > --- a/block/vhdx.h
> > +++ b/block/vhdx.h
> > @@ -17,11 +17,7 @@
> >  
> >  #ifndef BLOCK_VHDX_H
> >  #define BLOCK_VHDX_H
> > -
> > -#define KiB  (1 * 1024)
> > -#define MiB(KiB * 1024)
> > -#define GiB(MiB * 1024)
> > -#define TiB ((uint64_t) GiB * 1024)
> > +#include "qemu/units.h"
> >  
> >  #define DEFAULT_LOG_SIZE 1048576 /* 1MiB */
> >  /* Structures and fields present in the VHDX file */
> > 
> 
> Looks good. By the way, you may want to consider adding to your git
> config rules that put the .h file diff changes above the .c file.
> 
> Look at qemu/scripts/git.orderfile

Thank you very much! It is very useful!

> 
> (I know I'm one to talk, I still haven't set mine up...)

:)

> 
> Reviewed-by: John Snow 

Thanks,
Stefano



Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-27 Thread Alex Williamson
On Wed, 27 Mar 2019 17:43:45 +0100
Auger Eric  wrote:

> Hi,
> 
> On 3/27/19 4:32 PM, Michael S. Tsirkin wrote:
> > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:  
> >> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> >> distinguish by devfn & bus between devices in a conventional PCI
> >> topology and therefore we cannot assign them separate AddressSpaces.
> >> By taking this requester ID aliasing into account, QEMU better matches
> >> the bare metal behavior and restrictions, and enables shared
> >> AddressSpace configurations that are otherwise not possible with
> >> guest IOMMU support.
> >>
> >> For the latter case, given any example where an IOMMU group on the
> >> host includes multiple devices:
> >>
> >>   $ ls  /sys/kernel/iommu_groups/1/devices/
> >>   :00:01.0  :01:00.0  :01:00.1
> >>
> >> If we incorporate a vIOMMU into the VM configuration, we're restricted
> >> that we can only assign one of the endpoints to the guest because a
> >> second endpoint will attempt to use a different AddressSpace.  VFIO
> >> only supports IOMMU group level granularity at the container level,
> >> preventing this second endpoint from being assigned:
> >>
> >> qemu-system-x86_64 -machine q35... \
> >>   -device intel-iommu,intremap=on \
> >>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio 
> >> \
> >> :01:00.1: group 1 used in multiple address spaces
> >>
> >> However, when QEMU incorporates proper aliasing, we can make use of a
> >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> >> provides the downstream devices with the same AddressSpace, ex:
> >>
> >> qemu-system-x86_64 -machine q35... \
> >>   -device intel-iommu,intremap=on \
> >>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1  
> > 
> > 
> > This will break extended address space access though, won't it?
> > things like attempts to take PCI Express badnwidth into
> > consideration by drivers for E2E credit math will also
> > not work ...
> >   
> >>
> >> While the utility of this hack may be limited, this AddressSpace
> >> aliasing is the correct behavior for QEMU to emulate bare metal.  
> > 
> > 
> > This one's a valid point.
> >   
> >> Signed-off-by: Alex Williamson 
> >> ---
> >>  hw/pci/pci.c |   33 +++--
> >>  1 file changed, 31 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index 35451c1e9987..38467e676f1f 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -2594,12 +2594,41 @@ AddressSpace 
> >> *pci_device_iommu_address_space(PCIDevice *dev)
> >>  {
> >>  PCIBus *bus = pci_get_bus(dev);
> >>  PCIBus *iommu_bus = bus;
> >> +uint8_t devfn = dev->devfn;
> >>  
> >>  while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> >> -iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> >> +PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> >> +
> >> +/*
> >> + * Determine which requester ID alias should be used for the 
> >> device
> >> + * based on the PCI topology.  There are no requester IDs on 
> >> convetional
> >> + * PCI buses, therefore we push the alias up to the parent on 
> >> each non-
> >> + * express bus.  Which alias we use depends on whether this is a 
> >> legacy
> >> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the 
> >> PCIe-to-
> >> + * PCI bridge spec.  Note that we cannot use pci_requester_id() 
> >> here
> >> + * because the resulting BDF depends on the secondary bridge 
> >> register
> >> + * programming.  
> > 
> > Could you clarify this part a bit pls?
> >   
> >>  We also cannot lookup the PCIBus from the bus number
> >> + * at this point for the iommu_fn.  Also, requester_id_cache is 
> >> the
> >> + * alias to the root bus, which is usually, but not necessarily 
> >> always
> >> + * where we'll find our iommu_fn.
> >> + */
> >> +if (!pci_bus_is_express(iommu_bus)) {
> >> +PCIDevice *parent = iommu_bus->parent_dev;
> >> +
> >> +if (pci_is_express(parent) &&
> >> +pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> >> +devfn = PCI_DEVFN(0, 0);  
> > 
> > Why 0,0?  
> 
> The spec says: "If the bridge generates a new Requester ID for a
> transaction forwarded from the secondary interface to the primary
> interface, the bridge assigns the PCI Express Requester ID using its
> secondary interface’s Bus Number and sets both the Device Number and
> Function Number fields to zero.". I guess it is the reason?

Yup, 

Re: [Qemu-devel] [PATCH] iotests: Fix test 200 on s390x without virtio-pci

2019-03-27 Thread John Snow



On 3/27/19 2:43 AM, Thomas Huth wrote:
> virtio-pci is optional on s390x, e.g. in downstream RHEL builds, it
> is disabled. On s390x, virtio-ccw should be used instead. Other tests
> like 051 or 240 already use virtio-scsi-ccw instead of virtio-scsi-pci
> on s390x, so let's do the same here and always use virtio-scsi-ccw on
> s390x.
> 
> Signed-off-by: Thomas Huth 
> ---
>  tests/qemu-iotests/200 | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/tests/qemu-iotests/200 b/tests/qemu-iotests/200
> index 12d25f4..1c0f8ca 100755
> --- a/tests/qemu-iotests/200
> +++ b/tests/qemu-iotests/200
> @@ -52,13 +52,21 @@ ${QEMU_IMG} create -f $IMGFMT -F $IMGFMT "${TEST_IMG}" -b 
> "${BACKING_IMG}" 512M
>  
>  ${QEMU_IO} -c "write -P 0xa5 512 300M" "${BACKING_IMG}" | _filter_qemu_io
>  
> +case "$QEMU_DEFAULT_MACHINE" in
> +  s390-ccw-virtio)
> +  virtio_scsi="-device virtio-scsi-ccw,id=scsi0,iothread=iothread0"
> +  ;;
> +  *)
> +  virtio_scsi="-device pci-bridge,id=bridge1,chassis_nr=1,bus=pci.0
> +  -device 
> virtio-scsi-pci,bus=bridge1,addr=0x1f,id=scsi0,iothread=iothread0"
> +  ;;
> +esac
> +
>  echo
>  echo === Starting QEMU VM ===
>  echo
>  qemu_comm_method="qmp"
> -_launch_qemu -device pci-bridge,id=bridge1,chassis_nr=1,bus=pci.0 \
> - -object iothread,id=iothread0 \
> - -device 
> virtio-scsi-pci,bus=bridge1,addr=0x1f,id=scsi0,iothread=iothread0 \
> +_launch_qemu -object iothread,id=iothread0 $virtio_scsi \
>   -drive 
> file="${TEST_IMG}",media=disk,if=none,cache=$CACHEMODE,id=drive_sysdisk,format=$IMGFMT
>  \
>   -device 
> scsi-hd,drive=drive_sysdisk,bus=scsi0.0,id=sysdisk,bootindex=0
>  h1=$QEMU_HANDLE
> 

Nice, thank you :)

Reviewed-by: John Snow 



Re: [Qemu-devel] [qemu-s390x] [PATCH for-4.0?] exec: Only count mapped memory backends for qemu_getrampagesize()

2019-03-27 Thread David Hildenbrand
On 27.03.19 18:08, Igor Mammedov wrote:
> On Wed, 27 Mar 2019 15:01:37 +0100
> David Hildenbrand  wrote:
> 
>> On 27.03.19 10:09, Igor Mammedov wrote:
>>> On Wed, 27 Mar 2019 09:10:01 +0100
>>> David Hildenbrand  wrote:
>>>
 On 27.03.19 01:12, David Gibson wrote:
> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:  
>> On 26.03.19 15:08, Igor Mammedov wrote:  
>>> On Tue, 26 Mar 2019 14:50:58 +1100
>>> David Gibson  wrote:
>>>  
 qemu_getrampagesize() works out the minimum host page size backing any 
 of
 guest RAM.  This is required in a few places, such as for POWER8 PAPR 
 KVM
 guests, because limitations of the hardware virtualization mean the 
 guest
 can't use pagesizes larger than the host pages backing its memory.

 However, it currently checks against *every* memory backend, whether 
 or not
 it is actually mapped into guest memory at the moment.  This is 
 incorrect.

 This can cause a problem attempting to add memory to a POWER8 pseries 
 KVM
 guest which is configured to allow hugepages in the guest (e.g.
 -machine cap-hpt-max-page-size=16m).  If you attempt to add 
 non-hugepage,
 you can (correctly) create a memory backend, however it (correctly) 
 will
 throw an error when you attempt to map that memory into the guest by
 'device_add'ing a pc-dimm.

 What's not correct is that if you then reset the guest a startup check
 against qemu_getrampagesize() will cause a fatal error because of the 
 new
 memory object, even though it's not mapped into the guest.  
>>> I'd say that backend should be remove by mgmt app since device_add 
>>> failed
>>> instead of leaving it to hang around. (but fatal error either not a nice
>>> behavior on QEMU part)  
>>
>> Indeed, it should be removed. Depending on the options (huge pages with
>> prealloc?) memory might be consumed for unused memory. Undesired.  
>
> Right, but if the guest initiates a reboot before the management gets
> to that, we'll have a crash.
>   

 Yes, I agree.

 At least on s390x (extending on what Igor said):

 mc->init() -> s390_memory_init() ->
 memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()


 ac->init_machine() -> kvm_arch_init() ->
 kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()


 And in vl.c

 configure_accelerator(current_machine, argv[0]);
>>> Looking more at it, it is seems s390 is 'broken' anyways.
>>> We call qemu_getrampagesize() here with huge page backends on CLI
>>> but memory-backends are initialized later
>>>  qemu_opts_foreach(..., object_create_delayed, ...)
>>> so s390 doesn't take into account memory backends currently
>>
>> BTW that might indeed be true, we only check against --mem-path.
> 
> It's possible to break it with '-numa node,memdev=...' since we don't really 
> have
> anything to block that call chain for s390 (but I'd argue it's invalid use of 
> CLI
> for s390 but it's effectively -mem-path on steroids alternative)
> 

I remember that -numa on s390x is completely blocked, but my mind might
play tricks with me. Anyhow, detecting the biggest page size also ahs to
be fixed.

-- 

Thanks,

David / dhildenb



Re: [Qemu-devel] [PATCH v2 2/2] block/vhdx: Use IEC binary prefixes for size constants

2019-03-27 Thread John Snow



On 3/27/19 5:56 AM, Stefano Garzarella wrote:
> Using IEC binary prefixes in order to make the code more readable,
> with the exception of DEFAULT_LOG_SIZE because it's passed to
> stringify().
> 
> Signed-off-by: Stefano Garzarella 

Looks good to me;

Reviewed-by: John Snow 



Re: [Qemu-devel] [PATCH v2 1/2] block/vhdx: Remove redundant IEC binary prefixes definition

2019-03-27 Thread John Snow



On 3/27/19 5:56 AM, Stefano Garzarella wrote:
> IEC binary prefixes are already defined in "qemu/units.h",
> so we can remove redundant definitions in "block/vhdx.h".
> 
> Signed-off-by: Stefano Garzarella 
> ---
>  block/vhdx.c | 3 ++-
>  block/vhdx.h | 6 +-
>  2 files changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/block/vhdx.c b/block/vhdx.c
> index b785aef4b7..7cd1fc3731 100644
> --- a/block/vhdx.c
> +++ b/block/vhdx.c
> @@ -1889,7 +1889,8 @@ static int coroutine_fn 
> vhdx_co_create(BlockdevCreateOptions *opts,
>  return -EINVAL;
>  }
>  if (block_size > VHDX_BLOCK_SIZE_MAX) {
> -error_setg(errp, "Block size must not exceed %d", 
> VHDX_BLOCK_SIZE_MAX);
> +error_setg(errp, "Block size must not exceed %" PRId64,
> +   VHDX_BLOCK_SIZE_MAX);
>  return -EINVAL;
>  }
>  
> diff --git a/block/vhdx.h b/block/vhdx.h
> index 1bfb4e4f73..bf72090c8f 100644
> --- a/block/vhdx.h
> +++ b/block/vhdx.h
> @@ -17,11 +17,7 @@
>  
>  #ifndef BLOCK_VHDX_H
>  #define BLOCK_VHDX_H
> -
> -#define KiB  (1 * 1024)
> -#define MiB(KiB * 1024)
> -#define GiB(MiB * 1024)
> -#define TiB ((uint64_t) GiB * 1024)
> +#include "qemu/units.h"
>  
>  #define DEFAULT_LOG_SIZE 1048576 /* 1MiB */
>  /* Structures and fields present in the VHDX file */
> 

Looks good. By the way, you may want to consider adding to your git
config rules that put the .h file diff changes above the .c file.

Look at qemu/scripts/git.orderfile

(I know I'm one to talk, I still haven't set mine up...)

Reviewed-by: John Snow 



Re: [Qemu-devel] [PATCH v3] configure: automatically pick python3 is available

2019-03-27 Thread Eduardo Habkost
On Wed, Mar 27, 2019 at 05:07:01PM +, Daniel P. Berrangé wrote:
> Unless overridden via an env var or configure arg, QEMU will only look
> for the 'python' binary in $PATH. This is unhelpful on distros which
> are only shipping Python 3.x (eg Fedora) in their default install as,
> if they comply with PEP 394, the bare 'python' binary won't exist.
> 
> This changes configure so that by default it will search for all three
> common python binaries, preferring to find Python 3.x versions.
> 
> Signed-off-by: Daniel P. Berrangé 

Queued for 4.1.  Thanks, and sorry for taking so long to handle it.

-- 
Eduardo



Re: [Qemu-devel] [qemu-s390x] [PATCH for-4.0?] exec: Only count mapped memory backends for qemu_getrampagesize()

2019-03-27 Thread Igor Mammedov
On Wed, 27 Mar 2019 15:01:37 +0100
David Hildenbrand  wrote:

> On 27.03.19 10:09, Igor Mammedov wrote:
> > On Wed, 27 Mar 2019 09:10:01 +0100
> > David Hildenbrand  wrote:
> > 
> >> On 27.03.19 01:12, David Gibson wrote:
> >>> On Tue, Mar 26, 2019 at 06:02:51PM +0100, David Hildenbrand wrote:  
>  On 26.03.19 15:08, Igor Mammedov wrote:  
> > On Tue, 26 Mar 2019 14:50:58 +1100
> > David Gibson  wrote:
> >  
> >> qemu_getrampagesize() works out the minimum host page size backing any 
> >> of
> >> guest RAM.  This is required in a few places, such as for POWER8 PAPR 
> >> KVM
> >> guests, because limitations of the hardware virtualization mean the 
> >> guest
> >> can't use pagesizes larger than the host pages backing its memory.
> >>
> >> However, it currently checks against *every* memory backend, whether 
> >> or not
> >> it is actually mapped into guest memory at the moment.  This is 
> >> incorrect.
> >>
> >> This can cause a problem attempting to add memory to a POWER8 pseries 
> >> KVM
> >> guest which is configured to allow hugepages in the guest (e.g.
> >> -machine cap-hpt-max-page-size=16m).  If you attempt to add 
> >> non-hugepage,
> >> you can (correctly) create a memory backend, however it (correctly) 
> >> will
> >> throw an error when you attempt to map that memory into the guest by
> >> 'device_add'ing a pc-dimm.
> >>
> >> What's not correct is that if you then reset the guest a startup check
> >> against qemu_getrampagesize() will cause a fatal error because of the 
> >> new
> >> memory object, even though it's not mapped into the guest.  
> > I'd say that backend should be remove by mgmt app since device_add 
> > failed
> > instead of leaving it to hang around. (but fatal error either not a nice
> > behavior on QEMU part)  
> 
>  Indeed, it should be removed. Depending on the options (huge pages with
>  prealloc?) memory might be consumed for unused memory. Undesired.  
> >>>
> >>> Right, but if the guest initiates a reboot before the management gets
> >>> to that, we'll have a crash.
> >>>   
> >>
> >> Yes, I agree.
> >>
> >> At least on s390x (extending on what Igor said):
> >>
> >> mc->init() -> s390_memory_init() ->
> >> memory_region_allocate_system_memory() -> host_memory_backend_set_mapped()
> >>
> >>
> >> ac->init_machine() -> kvm_arch_init() ->
> >> kvm_s390_configure_mempath_backing() -> qemu_getrampagesize()
> >>
> >>
> >> And in vl.c
> >>
> >> configure_accelerator(current_machine, argv[0]);
> > Looking more at it, it is seems s390 is 'broken' anyways.
> > We call qemu_getrampagesize() here with huge page backends on CLI
> > but memory-backends are initialized later
> >  qemu_opts_foreach(..., object_create_delayed, ...)
> > so s390 doesn't take into account memory backends currently
> 
> BTW that might indeed be true, we only check against --mem-path.

It's possible to break it with '-numa node,memdev=...' since we don't really 
have
anything to block that call chain for s390 (but I'd argue it's invalid use of 
CLI
for s390 but it's effectively -mem-path on steroids alternative)




[Qemu-devel] [PATCH v3] configure: automatically pick python3 is available

2019-03-27 Thread Daniel P . Berrangé
Unless overridden via an env var or configure arg, QEMU will only look
for the 'python' binary in $PATH. This is unhelpful on distros which
are only shipping Python 3.x (eg Fedora) in their default install as,
if they comply with PEP 394, the bare 'python' binary won't exist.

This changes configure so that by default it will search for all three
common python binaries, preferring to find Python 3.x versions.

Signed-off-by: Daniel P. Berrangé 
---

Changed in v3:

 - 100% fewer tabs

 configure | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/configure b/configure
index 1c563a7027..8fb323c320 100755
--- a/configure
+++ b/configure
@@ -902,7 +902,18 @@ fi
 
 : ${make=${MAKE-make}}
 : ${install=${INSTALL-install}}
-: ${python=${PYTHON-python}}
+# We prefer python 3.x. A bare 'python' is traditionally
+# python 2.x, but some distros have it as python 3.x, so
+# we check that before python2
+python=
+for binary in "${PYTHON-python3}" python python2
+do
+if has "$binary"
+then
+python="$binary"
+break
+fi
+done
 : ${smbd=${SMBD-/usr/sbin/smbd}}
 
 # Default objcc to clang if available, otherwise use CC
@@ -1824,8 +1835,9 @@ EOF
 exit 0
 fi
 
-if ! has $python; then
-  error_exit "Python not found. Use --python=/path/to/python"
+if test -z "$python"
+then
+error_exit "Python not found. Use --python=/path/to/python"
 fi
 
 # Note that if the Python conditional here evaluates True we will exit
-- 
2.20.1




[Qemu-devel] [PATCH v2 6/6] Categorize devices: iommu

2019-03-27 Thread Ernest Esene
Set category and description for iommu devices
Signed-off-by: Ernest Esene 

---
v2:
  * split into separate patches
---
 hw/i386/amd_iommu.c   | 2 ++
 hw/i386/intel_iommu.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index 6eabdf9917..4a4e2c7fd4 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1601,6 +1601,8 @@ static void amdvi_class_init(ObjectClass *klass, void* 
data)
 dc_class->int_remap = amdvi_int_remap;
 /* Supported by the pc-q35-* machine types */
 dc->user_creatable = true;
+set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+dc->desc = "AMD IOMMU (AMD-Vi) DMA Remapping device";
 }
 
 static const TypeInfo amdvi = {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b90de6c664..4d0e60423c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3702,6 +3702,8 @@ static void vtd_class_init(ObjectClass *klass, void *data)
 x86_class->int_remap = vtd_int_remap;
 /* Supported by the pc-q35-* machine types */
 dc->user_creatable = true;
+set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+dc->desc = "Intel IOMMU (VT-d) DMA Remapping device";
 }
 
 static const TypeInfo vtd_info = {
-- 
2.14.2



signature.asc
Description: PGP signature


[Qemu-devel] [PULL 0/1] device-tree queue

2019-03-27 Thread Alistair Francis
The following changes since commit 49fc899f8d673dd9e73f3db0d9e9ea60b77c331b:

  Update version for v4.0.0-rc1 release (2019-03-26 17:02:29 +)

are available in the Git repository at:

  g...@github.com:alistair23/qemu.git tags/pull-device-tree-20190327

for you to fetch changes up to c3c962c12cca0ebba6ef12f6474618eb38aed9a3:

  MAINTAINERS: Update the device tree maintainers (2019-03-27 09:20:35 -0700)


Device Tree Pull Request for 4.0

A single patch updating the MAINTAINERS file for 4.0.


Alistair Francis (1):
  MAINTAINERS: Update the device tree maintainers

 MAINTAINERS | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


[Qemu-devel] [PULL 1/1] MAINTAINERS: Update the device tree maintainers

2019-03-27 Thread Alistair Francis
Remove Alex as a Device Tree maintainer as requested by him. Add myself
as a maintainer to avoid it being orphaned. Also add David as a
Reviewer (R) as he is the libfdt and DTC maintainer.

Signed-off-by: Alistair Francis 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Alexander Graf 
Acked-by: David Gibson 
---
 MAINTAINERS | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 85d7d764e5..c2ad5062f6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1813,7 +1813,8 @@ F: qom/cpu.c
 F: include/qom/cpu.h
 
 Device Tree
-M: Alexander Graf 
+M: Alistair Francis 
+R: David Gibson 
 S: Maintained
 F: device_tree.c
 F: include/sysemu/device_tree.h
-- 
2.21.0



Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-27 Thread Alex Williamson
On Wed, 27 Mar 2019 11:32:55 -0400
"Michael S. Tsirkin"  wrote:

> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > distinguish by devfn & bus between devices in a conventional PCI
> > topology and therefore we cannot assign them separate AddressSpaces.
> > By taking this requester ID aliasing into account, QEMU better matches
> > the bare metal behavior and restrictions, and enables shared
> > AddressSpace configurations that are otherwise not possible with
> > guest IOMMU support.
> > 
> > For the latter case, given any example where an IOMMU group on the
> > host includes multiple devices:
> > 
> >   $ ls  /sys/kernel/iommu_groups/1/devices/
> >   :00:01.0  :01:00.0  :01:00.1
> > 
> > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > that we can only assign one of the endpoints to the guest because a
> > second endpoint will attempt to use a different AddressSpace.  VFIO
> > only supports IOMMU group level granularity at the container level,
> > preventing this second endpoint from being assigned:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > 
> > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > :01:00.1: group 1 used in multiple address spaces
> > 
> > However, when QEMU incorporates proper aliasing, we can make use of a
> > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > provides the downstream devices with the same AddressSpace, ex:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1  
> 
> 
> This will break extended address space access though, won't it?
> things like attempts to take PCI Express badnwidth into
> consideration by drivers for E2E credit math will also
> not work ...

Correct.  As I explain in my reply to Peter, we're forcing the IOMMU to
use a common address space for these devices, and the only hack we have
to do that is to introduce a conventional PCI bus.  PCIe-to-PCI bridges
are required to respond with an unsupported request to extended config
space on the downstream devices.  This is part of the loss of utility I
mention below.  Perhaps we can add a couple more layers of hacks if
those sorts of things are important.  PCI-X supports extended config
space, so (without digging through the spec) I assume a PCIe-to-PCIX
bridge could pass extended config space.  Additionally, if we had the
same bridge in reverse mode, for something like this:

[RC]-[PCIe-to-PCIX]-[PCIX-to-PCIe]-[PCIe endpoint]

Then we would have access to link and extended config space while
maintaining the single address space.  Hack^3

> > While the utility of this hack may be limited, this AddressSpace
> > aliasing is the correct behavior for QEMU to emulate bare metal.  
> 
> 
> This one's a valid point.
> 
> > Signed-off-by: Alex Williamson 
> > ---
> >  hw/pci/pci.c |   33 +++--
> >  1 file changed, 31 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 35451c1e9987..38467e676f1f 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2594,12 +2594,41 @@ AddressSpace 
> > *pci_device_iommu_address_space(PCIDevice *dev)
> >  {
> >  PCIBus *bus = pci_get_bus(dev);
> >  PCIBus *iommu_bus = bus;
> > +uint8_t devfn = dev->devfn;
> >  
> >  while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> > -iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> > +PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> > +
> > +/*
> > + * Determine which requester ID alias should be used for the device
> > + * based on the PCI topology.  There are no requester IDs on 
> > convetional
> > + * PCI buses, therefore we push the alias up to the parent on each 
> > non-
> > + * express bus.  Which alias we use depends on whether this is a 
> > legacy
> > + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the 
> > PCIe-to-
> > + * PCI bridge spec.  Note that we cannot use pci_requester_id() 
> > here
> > + * because the resulting BDF depends on the secondary bridge 
> > register
> > + * programming.  
> 
> Could you clarify this part a bit pls?

We use the same algorithm here as we do to come up with
requester_id_cache on the PCIDevice (modulo stopping at the iommu_fn
bus), but pci_req_id_cache_extract() is only useful once the bridge
secondary bus number register is programmed.  Until then, the bus part
of the BDF is always 

Re: [Qemu-devel] [PATCH for-4.0] nbd-client: Work around server BLOCK_STATUS misalignment at EOF

2019-03-27 Thread Vladimir Sementsov-Ogievskiy
26.03.2019 20:13, Eric Blake wrote:
> The NBD spec is clear that a server that advertises a minimum block
> size should reply to NBD_CMD_BLOCK_STATUS with extents aligned
> accordingly. However, we know that the qemu NBD server implementation
> has had a corner-case bug where it is not compliant with the spec,
> present since the introduction of NBD_CMD_BLOCK_STATUS in qemu 2.12
> (and unlikely to be patched in time for 4.0). Namely, when qemu is
> serving a file that is not a multiple of 512 bytes, it rounds the size
> advertised over NBD up to the next sector boundary (someday, I'd like
> to fix that to be byte-accurate, but it's a much bigger audit not
> appropriate for this release); yet if the final sector contains data
> prior to EOF, lseek(SEEK_HOLE) will point to the implicit hole
> mid-sector which qemu then reported over NBD.
> 
> We are well within our rights to hang up on a server that can't follow
> the spec, but it is more useful to try and keep the connection alive
> in spite of the problem. Do so by tracing a message about the problem,
> and then either truncating the request back to an aligned boundary (if
> it covered more than the final sector) or widening it out to the full
> boundary with a forced status of data (since truncating would result
> in 0 bytes, but we have to make progress, and valid since data is a
> default-safe answer). And in practice, since the problem only happens
> on a sector that starts with data and ends with a hole, we are going
> to want to read that full sector anyway (where qemu as the server
> fills in the tail beyond EOF with appropriate NUL bytes).
> 
> Easy reproduction:
> $ printf %1000d 1 > file
> $ qemu-nbd -f raw -t file & pid=$!
> $ qemu-img map --output=json -f raw nbd://localhost:10809
> qemu-img: Could not read file metadata: Invalid argument
> $ kill $pid

would be good to add it to iotests

> 
> where the patched version instead succeeds with:
> [{ "start": 0, "length": 1024, "depth": 0, "zero": false, "data": true}]
> 
> Signed-off-by: Eric Blake 


Reviewed-by: Vladimir Sementsov-Ogievskiy 

> ---
>   block/nbd-client.c | 28 +---
>   1 file changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index a3b70d14004..241cc555246 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -269,14 +269,36 @@ static int 
> nbd_parse_blockstatus_payload(NBDClientSession *client,
>   extent->length = payload_advance32();
>   extent->flags = payload_advance32();
> 
> -if (extent->length == 0 ||
> -(client->info.min_block && !QEMU_IS_ALIGNED(extent->length,
> -
> client->info.min_block))) {
> +if (extent->length == 0) {
>   error_setg(errp, "Protocol error: server sent status chunk with "
>  "invalid length");

may be improved to s/invalid/zero/

>   return -EINVAL;
>   }
> 
> +/*
> + * A server sending unaligned block status is in violation of the
> + * protocol, but as qemu-nbd 3.1 is such a server (at least for
> + * POSIX files that are not a multiple of 512 bytes, since qemu
> + * rounds files up to 512-byte multiples but lseek(SEEK_HOLE)
> + * still sees an implicit hole beyond the real EOF), it's nicer to
> + * work around the misbehaving server. If the request included
> + * more than the final unaligned block, truncate it back to an
> + * aligned result; if the request was only the final block, round
> + * up to the full block and change the status to fully-allocated
> + * (always a safe status, even if it loses information).
> + */
> +if (client->info.min_block && !QEMU_IS_ALIGNED(extent->length,
> +   client->info.min_block)) {
> +trace_nbd_parse_blockstatus_compliance("extent length is unaligned");
> +if (extent->length > client->info.min_block) {
> +extent->length = QEMU_ALIGN_DOWN(extent->length,
> + client->info.min_block);
> +} else {
> +extent->length = client->info.min_block;
> +extent->flags = 0;
> +}
> +}
> +
>   /*
>* We used NBD_CMD_FLAG_REQ_ONE, so the server should not have
>* sent us any more than one extent, nor should it have included
> 


-- 
Best regards,
Vladimir


[Qemu-devel] [PATCH v2 5/6] Categorize devices: IGD passthrough ISA bridge

2019-03-27 Thread Ernest Esene
Set category for the device
Signed-off-by: Ernest Esene 

---
v2:
  * split into separate patches
---
 hw/i386/pc_piix.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 8ad8e885c6..03a9cb8af3 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -911,6 +911,7 @@ static void isa_bridge_class_init(ObjectClass *klass, void 
*data)
 PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
 
 dc->desc= "ISA bridge faked to support IGD PT";
+set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
 k->vendor_id= PCI_VENDOR_ID_INTEL;
 k->class_id = PCI_CLASS_BRIDGE_ISA;
 };
-- 
2.14.2



signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-27 Thread Michael S. Tsirkin
On Wed, Mar 27, 2019 at 05:43:45PM +0100, Auger Eric wrote:
> Hi,
> 
> On 3/27/19 4:32 PM, Michael S. Tsirkin wrote:
> > On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> >> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> >> distinguish by devfn & bus between devices in a conventional PCI
> >> topology and therefore we cannot assign them separate AddressSpaces.
> >> By taking this requester ID aliasing into account, QEMU better matches
> >> the bare metal behavior and restrictions, and enables shared
> >> AddressSpace configurations that are otherwise not possible with
> >> guest IOMMU support.
> >>
> >> For the latter case, given any example where an IOMMU group on the
> >> host includes multiple devices:
> >>
> >>   $ ls  /sys/kernel/iommu_groups/1/devices/
> >>   :00:01.0  :01:00.0  :01:00.1
> >>
> >> If we incorporate a vIOMMU into the VM configuration, we're restricted
> >> that we can only assign one of the endpoints to the guest because a
> >> second endpoint will attempt to use a different AddressSpace.  VFIO
> >> only supports IOMMU group level granularity at the container level,
> >> preventing this second endpoint from being assigned:
> >>
> >> qemu-system-x86_64 -machine q35... \
> >>   -device intel-iommu,intremap=on \
> >>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio 
> >> \
> >> :01:00.1: group 1 used in multiple address spaces
> >>
> >> However, when QEMU incorporates proper aliasing, we can make use of a
> >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> >> provides the downstream devices with the same AddressSpace, ex:
> >>
> >> qemu-system-x86_64 -machine q35... \
> >>   -device intel-iommu,intremap=on \
> >>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > 
> > 
> > This will break extended address space access though, won't it?
> > things like attempts to take PCI Express badnwidth into
> > consideration by drivers for E2E credit math will also
> > not work ...
> > 
> >>
> >> While the utility of this hack may be limited, this AddressSpace
> >> aliasing is the correct behavior for QEMU to emulate bare metal.
> > 
> > 
> > This one's a valid point.
> > 
> >> Signed-off-by: Alex Williamson 
> >> ---
> >>  hw/pci/pci.c |   33 +++--
> >>  1 file changed, 31 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index 35451c1e9987..38467e676f1f 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -2594,12 +2594,41 @@ AddressSpace 
> >> *pci_device_iommu_address_space(PCIDevice *dev)
> >>  {
> >>  PCIBus *bus = pci_get_bus(dev);
> >>  PCIBus *iommu_bus = bus;
> >> +uint8_t devfn = dev->devfn;
> >>  
> >>  while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> >> -iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> >> +PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> >> +
> >> +/*
> >> + * Determine which requester ID alias should be used for the 
> >> device
> >> + * based on the PCI topology.  There are no requester IDs on 
> >> convetional
> >> + * PCI buses, therefore we push the alias up to the parent on 
> >> each non-
> >> + * express bus.  Which alias we use depends on whether this is a 
> >> legacy
> >> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the 
> >> PCIe-to-
> >> + * PCI bridge spec.  Note that we cannot use pci_requester_id() 
> >> here
> >> + * because the resulting BDF depends on the secondary bridge 
> >> register
> >> + * programming.
> > 
> > Could you clarify this part a bit pls?
> > 
> >>  We also cannot lookup the PCIBus from the bus number
> >> + * at this point for the iommu_fn.  Also, requester_id_cache is 
> >> the
> >> + * alias to the root bus, which is usually, but not necessarily 
> >> always
> >> + * where we'll find our iommu_fn.
> >> + */
> >> +if (!pci_bus_is_express(iommu_bus)) {
> >> +PCIDevice *parent = iommu_bus->parent_dev;
> >> +
> >> +if (pci_is_express(parent) &&
> >> +pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> >> +devfn = PCI_DEVFN(0, 0);
> > 
> > Why 0,0?
> 
> The spec says: "If the bridge generates a new Requester ID for a
> transaction forwarded from the secondary interface to the primary
> interface, the bridge assigns the PCI Express Requester ID using its
> secondary interface’s Bus Number and sets both the Device Number and
> Function Number fields to zero.". I guess it is the reason?

Ah.

> > 
> >> +

Re: [Qemu-devel] [PATCH 2/2] accel: Unbreak accelerator fallback

2019-03-27 Thread Igor Mammedov
On Wed, 27 Mar 2019 16:03:47 +0100
Markus Armbruster  wrote:

> When the user specifies a list of accelerators, we pick the first one
> that initializes successfully.  Recent commit 1a3ec8c1564 broke that.
> Reproducer:
> 
> $ qemu-system-x86_64 --machine accel=xen:tcg
> xencall: error: Could not obtain handle on privileged command interface: 
> No such file or directory
> xen be core: xen be core: can't open xen interface
> can't open xen interface
> qemu-system-x86_64: failed to initialize Xen: Operation not permitted
> qemu-system-x86_64: /home/armbru/work/qemu/qom/object.c:436: 
> object_set_accelerator_compat_props: Assertion `!object_compat_props[0]' 
> failed.
> 
> Root cause: we register accelerator compat properties even when the
> accelerator fails.  The failed assertion is
> object_set_accelerator_compat_props() telling us off.  Fix by calling
> it only for the accelerator that succeeded.
> 
> Fixes: 1a3ec8c1564f51628cce10d435a2e22559ea29fd
> Signed-off-by: Markus Armbruster 

Reviewed-by: Igor Mammedov 

> ---
>  accel/accel.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/accel/accel.c b/accel/accel.c
> index 8deb475b5d..454fef9d92 100644
> --- a/accel/accel.c
> +++ b/accel/accel.c
> @@ -65,8 +65,9 @@ static int accel_init_machine(AccelClass *acc, MachineState 
> *ms)
>  ms->accelerator = NULL;
>  *(acc->allowed) = false;
>  object_unref(OBJECT(accel));
> +} else {
> +object_set_accelerator_compat_props(acc->compat_props);
>  }
> -object_set_accelerator_compat_props(acc->compat_props);
>  return ret;
>  }
>  




Re: [Qemu-devel] [PATCH v2] configure: automatically pick python3 is available

2019-03-27 Thread Daniel P . Berrangé
On Wed, Mar 27, 2019 at 01:38:07PM -0300, Eduardo Habkost wrote:
> On Tue, Mar 19, 2019 at 11:02:08AM +, Daniel P. Berrangé wrote:
> > Unless overridden via an env var or configure arg, QEMU will only look
> > for the 'python' binary in $PATH. This is unhelpful on distros which
> > are only shipping Python 3.x (eg Fedora) in their default install as,
> > if they comply with PEP 394, the bare 'python' binary won't exist.
> > 
> > This changes configure so that by default it will search for all three
> > common python binaries, preferring to find Python 3.x versions.
> > 
> > Signed-off-by: Daniel P. Berrangé 
> > ---
> > 
> > Changed in v2:
> > 
> >  - Rewrote to follow Eric's suggested approach
> > 
> >  configure | 18 +++---
> >  1 file changed, 15 insertions(+), 3 deletions(-)
> > 
> > diff --git a/configure b/configure
> > index 7071f52584..028453a5a6 100755
> > --- a/configure
> > +++ b/configure
> > @@ -901,7 +901,18 @@ fi
> >  
> >  : ${make=${MAKE-make}}
> >  : ${install=${INSTALL-install}}
> > -: ${python=${PYTHON-python}}
> > +# We prefer python 3.x. A bare 'python' is traditionally
> > +# python 2.x, but some distros have it as python 3.x, so
> > +# we check that before python2
> > +python=
> > +for binary in "${PYTHON-python3}" python python2
> 
> I wouldn't like configure to try any other python binary if one
> was explicitly specified using "--python=".  It might make
> packaging mistakes go completely undetected.

This code runs long before --python arg is parsed. It is just
setting up the default value for the $python  env in configure.

Later code will replace this default if --python is given, so
we still honour --python to same extent as before this patch.

> 
> 
> > +do
> > +if has "$binary"
> > +then
> > +   python="$binary"
> > +   break
> 
> Tabs/spaces mix up here.  (I thought Patchew/checkpatch.pl would
> detect this?)

Opps. Perhaps not for the configire script.

> 
> > +fi
> > +done
> >  : ${smbd=${SMBD-/usr/sbin/smbd}}
> >  
> >  # Default objcc to clang if available, otherwise use CC
> > @@ -1797,8 +1808,9 @@ EOF
> >  exit 0
> >  fi
> >  
> > -if ! has $python; then
> > -  error_exit "Python not found. Use --python=/path/to/python"
> > +if test -z "$python"
> > +then
> > +error_exit "Python not found. Use --python=/path/to/python"
> >  fi
> >  
> >  # Note that if the Python conditional here evaluates True we will exit
> > -- 
> > 2.20.1
> > 
> > 
> 
> -- 
> Eduardo

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-devel] [PATCH v1] s390x/kvm: Configure page size after memory has actually been initialized

2019-03-27 Thread Igor Mammedov
On Wed, 27 Mar 2019 14:59:44 +0100
David Hildenbrand  wrote:

> Right now we configure the pagesize quite early, when initializing KVM.
> This is long before system memory is actually allocated via
> memory_region_allocate_system_memory(), and therefore memory backends
> marked as mapped.
> 
> Instead, let's configure the maximum page size after initializing
> memory in s390_memory_init(). cap_hpage_1m is still properly
> configured before creating any CPUs, and therefore before configuring
> the CPU model and eventually enabling CMMA.
> 
> We might later want to replace qemu_getrampagesize() by another
> detection mechanism, e.g. only looking at mapped, initial memory.
> We don't support any memory devices yet, and if so, we can always reject
> devices with a page size bigger than the initial page size when
> hotplugging. qemu_getrampagesize() should work for now, especially when
> converting it to only look at mapped backends.
> 
> Signed-off-by: David Hildenbrand 

Acked-by: Igor Mammedov 

> ---
> 
> I only did a quick sanity test - I don't have a system with huge page
> support around to test. I did a compile for both KVM and TCG.
> 
> As Conny is on vacation and Christian is also not around, I guess it is
> okay that David G. will pick this up for now, to eventually make
> qemu_getrampagesize() only check mapped backends.
> 
>  hw/s390x/s390-virtio-ccw.c |  9 +
>  target/s390x/cpu.c |  7 +++
>  target/s390x/cpu.h |  1 +
>  target/s390x/kvm-stub.c|  4 
>  target/s390x/kvm.c | 35 ++-
>  target/s390x/kvm_s390x.h   |  1 +
>  6 files changed, 36 insertions(+), 21 deletions(-)
> 
> diff --git a/hw/s390x/s390-virtio-ccw.c b/hw/s390x/s390-virtio-ccw.c
> index d11069b860..b968bccc87 100644
> --- a/hw/s390x/s390-virtio-ccw.c
> +++ b/hw/s390x/s390-virtio-ccw.c
> @@ -15,6 +15,7 @@
>  #include "cpu.h"
>  #include "hw/boards.h"
>  #include "exec/address-spaces.h"
> +#include "exec/ram_addr.h"
>  #include "hw/s390x/s390-virtio-hcall.h"
>  #include "hw/s390x/sclp.h"
>  #include "hw/s390x/s390_flic.h"
> @@ -163,6 +164,7 @@ static void s390_memory_init(ram_addr_t mem_size)
>  MemoryRegion *sysmem = get_system_memory();
>  ram_addr_t chunk, offset = 0;
>  unsigned int number = 0;
> +Error *local_err = NULL;
>  gchar *name;
>  
>  /* allocate RAM for core */
> @@ -182,6 +184,12 @@ static void s390_memory_init(ram_addr_t mem_size)
>  }
>  g_free(name);
>  
> +/* Configure the maximum page size */
> +s390_set_max_pagesize(qemu_getrampagesize(), _err);
> +if (local_err) {
> +error_report_err(local_err);
> +exit(EXIT_FAILURE);
> +}
>  /* Initialize storage key device */
>  s390_skeys_init();
>  /* Initialize storage attributes device */
> @@ -253,6 +261,7 @@ static void ccw_init(MachineState *machine)
>  DeviceState *dev;
>  
>  s390_sclp_init();
> +/* init memory + setup max page size. Required for the CPU model */
>  s390_memory_init(machine->ram_size);
>  
>  /* init CPUs (incl. CPU model) early so s390_has_feature() works */
> diff --git a/target/s390x/cpu.c b/target/s390x/cpu.c
> index 698dd9cb82..b58ef0a8ef 100644
> --- a/target/s390x/cpu.c
> +++ b/target/s390x/cpu.c
> @@ -399,6 +399,13 @@ int s390_set_memory_limit(uint64_t new_limit, uint64_t 
> *hw_limit)
>  return 0;
>  }
>  
> +void s390_set_max_pagesize(uint64_t pagesize, Error **errp)
> +{
> +if (kvm_enabled()) {
> +kvm_s390_set_max_pagesize(pagesize, errp);
> +}
> +}
> +
>  void s390_cmma_reset(void)
>  {
>  if (kvm_enabled()) {
> diff --git a/target/s390x/cpu.h b/target/s390x/cpu.h
> index cb6d77053a..c14be2b5ba 100644
> --- a/target/s390x/cpu.h
> +++ b/target/s390x/cpu.h
> @@ -734,6 +734,7 @@ static inline void s390_do_cpu_load_normal(CPUState *cs, 
> run_on_cpu_data arg)
>  /* cpu.c */
>  void s390_crypto_reset(void);
>  int s390_set_memory_limit(uint64_t new_limit, uint64_t *hw_limit);
> +void s390_set_max_pagesize(uint64_t pagesize, Error **errp);
>  void s390_cmma_reset(void);
>  void s390_enable_css_support(S390CPU *cpu);
>  int s390_assign_subch_ioeventfd(EventNotifier *notifier, uint32_t sch_id,
> diff --git a/target/s390x/kvm-stub.c b/target/s390x/kvm-stub.c
> index bf7795e47a..22b4514ca6 100644
> --- a/target/s390x/kvm-stub.c
> +++ b/target/s390x/kvm-stub.c
> @@ -93,6 +93,10 @@ int kvm_s390_set_mem_limit(uint64_t new_limit, uint64_t 
> *hw_limit)
>  return 0;
>  }
>  
> +void kvm_s390_set_max_pagesize(uint64_t pagesize, Error **errp)
> +{
> +}
> +
>  void kvm_s390_crypto_reset(void)
>  {
>  }
> diff --git a/target/s390x/kvm.c b/target/s390x/kvm.c
> index 19530fb94e..bee73dc1a4 100644
> --- a/target/s390x/kvm.c
> +++ b/target/s390x/kvm.c
> @@ -283,44 +283,37 @@ void kvm_s390_crypto_reset(void)
>  }
>  }
>  
> -static int kvm_s390_configure_mempath_backing(KVMState *s)
> +void kvm_s390_set_max_pagesize(uint64_t pagesize, Error **errp)
>  {
> -

Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-27 Thread Auger Eric
Hi,

On 3/27/19 4:32 PM, Michael S. Tsirkin wrote:
> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
>> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
>> distinguish by devfn & bus between devices in a conventional PCI
>> topology and therefore we cannot assign them separate AddressSpaces.
>> By taking this requester ID aliasing into account, QEMU better matches
>> the bare metal behavior and restrictions, and enables shared
>> AddressSpace configurations that are otherwise not possible with
>> guest IOMMU support.
>>
>> For the latter case, given any example where an IOMMU group on the
>> host includes multiple devices:
>>
>>   $ ls  /sys/kernel/iommu_groups/1/devices/
>>   :00:01.0  :01:00.0  :01:00.1
>>
>> If we incorporate a vIOMMU into the VM configuration, we're restricted
>> that we can only assign one of the endpoints to the guest because a
>> second endpoint will attempt to use a different AddressSpace.  VFIO
>> only supports IOMMU group level granularity at the container level,
>> preventing this second endpoint from being assigned:
>>
>> qemu-system-x86_64 -machine q35... \
>>   -device intel-iommu,intremap=on \
>>   -device pcie-root-port,addr=1e.0,id=pcie.1 \
>>   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>>   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
>>
>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
>> :01:00.1: group 1 used in multiple address spaces
>>
>> However, when QEMU incorporates proper aliasing, we can make use of a
>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
>> provides the downstream devices with the same AddressSpace, ex:
>>
>> qemu-system-x86_64 -machine q35... \
>>   -device intel-iommu,intremap=on \
>>   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>>   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>>   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> 
> 
> This will break extended address space access though, won't it?
> things like attempts to take PCI Express badnwidth into
> consideration by drivers for E2E credit math will also
> not work ...
> 
>>
>> While the utility of this hack may be limited, this AddressSpace
>> aliasing is the correct behavior for QEMU to emulate bare metal.
> 
> 
> This one's a valid point.
> 
>> Signed-off-by: Alex Williamson 
>> ---
>>  hw/pci/pci.c |   33 +++--
>>  1 file changed, 31 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 35451c1e9987..38467e676f1f 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -2594,12 +2594,41 @@ AddressSpace 
>> *pci_device_iommu_address_space(PCIDevice *dev)
>>  {
>>  PCIBus *bus = pci_get_bus(dev);
>>  PCIBus *iommu_bus = bus;
>> +uint8_t devfn = dev->devfn;
>>  
>>  while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
>> -iommu_bus = pci_get_bus(iommu_bus->parent_dev);
>> +PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
>> +
>> +/*
>> + * Determine which requester ID alias should be used for the device
>> + * based on the PCI topology.  There are no requester IDs on 
>> convetional
>> + * PCI buses, therefore we push the alias up to the parent on each 
>> non-
>> + * express bus.  Which alias we use depends on whether this is a 
>> legacy
>> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the 
>> PCIe-to-
>> + * PCI bridge spec.  Note that we cannot use pci_requester_id() here
>> + * because the resulting BDF depends on the secondary bridge 
>> register
>> + * programming.
> 
> Could you clarify this part a bit pls?
> 
>>  We also cannot lookup the PCIBus from the bus number
>> + * at this point for the iommu_fn.  Also, requester_id_cache is the
>> + * alias to the root bus, which is usually, but not necessarily 
>> always
>> + * where we'll find our iommu_fn.
>> + */
>> +if (!pci_bus_is_express(iommu_bus)) {
>> +PCIDevice *parent = iommu_bus->parent_dev;
>> +
>> +if (pci_is_express(parent) &&
>> +pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
>> +devfn = PCI_DEVFN(0, 0);
> 
> Why 0,0?

The spec says: "If the bridge generates a new Requester ID for a
transaction forwarded from the secondary interface to the primary
interface, the bridge assigns the PCI Express Requester ID using its
secondary interface’s Bus Number and sets both the Device Number and
Function Number fields to zero.". I guess it is the reason?

> 
>> +bus = iommu_bus;
>> +} else {
>> +devfn = parent->devfn;
>> +bus = parent_bus;
Alex, please could you clarify this case as well , comment?

Thanks

Eric
>> +}
>> +}
>> +
>> +iommu_bus = parent_bus;
>>  }
>>  if (iommu_bus && 

Re: [Qemu-devel] QEMU event loop optimizations

2019-03-27 Thread Stefan Hajnoczi
On Tue, Mar 26, 2019 at 02:37:35PM +0100, Paolo Bonzini wrote:
> On 26/03/19 14:18, Stefan Hajnoczi wrote:
> > Hi Sergio,
> > Here are the forgotten event loop optimizations I mentioned:
> > 
> >   https://github.com/stefanha/qemu/commits/event-loop-optimizations
> > 
> > The goal was to eliminate or reorder syscalls so that useful work (like
> > executing BHs) occurs as soon as possible after an event is detected.
> > 
> > I remember that these optimizations only shave off a handful of
> > microseconds, so they aren't a huge win.  They do become attractive on
> > fast SSDs with <10us read/write latency.
> > 
> > These optimizations are aggressive and there is a possibility of
> > introducing regressions.
> > 
> > If you have time to pick up this work, try benchmarking each commit
> > individually so performance changes are attributed individually.
> > There's no need to send them together in a single patch series, the
> > changes are quite independent.
> 
> Another thing to consider is IOCB_CMD_POLL, and replacing poll() with
> io_getevents(min_nr=1, nr=1).  If more than one event is ready, it can
> be read directly from the ring buffer.

Hey, cool.  io_uring has an equivalent operation too.  We'll add it as a
stretch goal to the GSoC/Outreachy project.

> Maybe we can even remove the little-tested epoll() path in favor of this
> one.

That would be nice if the performance results are favorable.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH for-4.1 v3 00/12] bundle edk2 platform firmware with QEMU

2019-03-27 Thread Paolo Bonzini
On 27/03/19 17:24, Laszlo Ersek wrote:
> On 03/27/19 17:15, Paolo Bonzini wrote:
>> On 27/03/19 17:05, Daniel P. Berrangé wrote:
>>> On Wed, Mar 27, 2019 at 04:58:23PM +0100, Paolo Bonzini wrote:
 On 27/03/19 16:30, Daniel P. Berrangé wrote:
> Perhaps the VM test scripts should do a "HEAD" request for the image
> every time to discover if it has been changed on the server, before
> honouring the local cache.

 Another possibility is to first download the shasum from
 download.patchew.org, and compare _that_ against the one that is stored
 locally, instead of hardcoding it in QEMU's repository.
>>>
>>> Personally I prefer the idea of having the shasum stored in the repo.
>>>
>>> That means that if we update git master to point to a newer image,
>>> previous stable branches will stick with their original image, rather
>>> than using a new image that may be incompatible with the stable branch
>>>
>>> Storing hash in git also means that if someone compromised the patchew
>>> server, they can't cause developer to run compromised images, without
>>> first also compromising git to change the hash.
>>
>> The two are not mutually exclusive.  We can warn if the hash doesn't
>> match against the one in QEMU, add a --force option, or whatever.
>>
>> Also, I have now created symlinks by hash at
>> http://download.patchew.org/by-sha256sum in case someone finds them useful.
> 
> Isn't this risky? If someone replaces an image file (keeping its name),
> the old symlink will continue "working", but the hash stated by the
> symlink's name will not match the pointed-to image file.

Well, the idea is that if you use them you also double-check what you
downloaded. :)

Paolo



Re: [Qemu-devel] [PATCH v2] configure: automatically pick python3 is available

2019-03-27 Thread Eduardo Habkost
On Tue, Mar 19, 2019 at 11:02:08AM +, Daniel P. Berrangé wrote:
> Unless overridden via an env var or configure arg, QEMU will only look
> for the 'python' binary in $PATH. This is unhelpful on distros which
> are only shipping Python 3.x (eg Fedora) in their default install as,
> if they comply with PEP 394, the bare 'python' binary won't exist.
> 
> This changes configure so that by default it will search for all three
> common python binaries, preferring to find Python 3.x versions.
> 
> Signed-off-by: Daniel P. Berrangé 
> ---
> 
> Changed in v2:
> 
>  - Rewrote to follow Eric's suggested approach
> 
>  configure | 18 +++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/configure b/configure
> index 7071f52584..028453a5a6 100755
> --- a/configure
> +++ b/configure
> @@ -901,7 +901,18 @@ fi
>  
>  : ${make=${MAKE-make}}
>  : ${install=${INSTALL-install}}
> -: ${python=${PYTHON-python}}
> +# We prefer python 3.x. A bare 'python' is traditionally
> +# python 2.x, but some distros have it as python 3.x, so
> +# we check that before python2
> +python=
> +for binary in "${PYTHON-python3}" python python2

I wouldn't like configure to try any other python binary if one
was explicitly specified using "--python=".  It might make
packaging mistakes go completely undetected.


> +do
> +if has "$binary"
> +then
> + python="$binary"
> + break

Tabs/spaces mix up here.  (I thought Patchew/checkpatch.pl would
detect this?)

> +fi
> +done
>  : ${smbd=${SMBD-/usr/sbin/smbd}}
>  
>  # Default objcc to clang if available, otherwise use CC
> @@ -1797,8 +1808,9 @@ EOF
>  exit 0
>  fi
>  
> -if ! has $python; then
> -  error_exit "Python not found. Use --python=/path/to/python"
> +if test -z "$python"
> +then
> +error_exit "Python not found. Use --python=/path/to/python"
>  fi
>  
>  # Note that if the Python conditional here evaluates True we will exit
> -- 
> 2.20.1
> 
> 

-- 
Eduardo



Re: [Qemu-devel] [multiprocess RFC PATCH 36/37] multi-process: add the concept description to docs/devel/qemu-multiprocess

2019-03-27 Thread Stefan Hajnoczi
On Tue, Mar 26, 2019 at 10:31:53AM -0400, Jag Raman wrote:
> 
> 
> On 3/26/2019 4:08 AM, Stefan Hajnoczi wrote:
> > On Fri, Mar 08, 2019 at 09:50:36AM +, Stefan Hajnoczi wrote:
> > > On Thu, Mar 07, 2019 at 03:29:41PM -0800, John G Johnson wrote:
> > > > > On Mar 7, 2019, at 11:27 AM, Stefan Hajnoczi  
> > > > > wrote:
> > > > > On Thu, Mar 07, 2019 at 02:51:20PM +, Daniel P. Berrangé wrote:
> > > > > > On Thu, Mar 07, 2019 at 02:26:09PM +, Stefan Hajnoczi wrote:
> > > > > > > On Wed, Mar 06, 2019 at 11:22:53PM -0800, 
> > > > > > > elena.ufimts...@oracle.com wrote:
> > > > > > > > diff --git a/docs/devel/qemu-multiprocess.txt 
> > > > > > > > b/docs/devel/qemu-multiprocess.txt
> > > > > > > > new file mode 100644
> > > > > > > > index 000..e29c6c8
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/docs/devel/qemu-multiprocess.txt
> > > > > > > 
> > > > > > > Thanks for this document and the interesting work that you are 
> > > > > > > doing.
> > > > > > > I'd like to discuss the security advantages gained by 
> > > > > > > disaggregating
> > > > > > > QEMU in more detail.
> > > > > > > 
> > > > > > > The security model for VMs managed by libvirt (most production 
> > > > > > > x86, ppc,
> > > > > > > s390 guests) is that the QEMU process is untrusted and only has 
> > > > > > > access
> > > > > > > to resources belonging to the guest.  SELinux is used to restrict 
> > > > > > > the
> > > > > > > process from accessing other files, processes, etc on the host.
> > > > > > 
> > > > > > NB it doesn't have to be SELinux. Libvirt also supports AppArmor and
> > > > > > can even do isolation with traditional DAC by putting each QEMU 
> > > > > > under
> > > > > > a distinct UID/GID and having libvirtd set ownership on resources 
> > > > > > each
> > > > > > VM is permitted to use.
> > > > > > 
> > > > > > > QEMU does not hold privileged resources that must be kept away 
> > > > > > > from the
> > > > > > > guest.  An escaped guest can access its image file, tap file 
> > > > > > > descriptor,
> > > > > > > etc but they are the same resources it could already access via 
> > > > > > > device
> > > > > > > emulation.
> > > > > > > 
> > > > > > > Can you give specific examples of how disaggregation improves 
> > > > > > > security?
> > > > > 
> > > > > Elena & collaborators: Dan has posted some ideas but please share 
> > > > > yours
> > > > > so the security benefits of this patch series can be better 
> > > > > understood.
> > > > > 
> > > > 
> > > > Dan covered the main point.  The security regime we use 
> > > > (selinux)
> > > > constrains the actions of processes on objects, so having multiple 
> > > > processes
> > > > allows us to apply more fine-grained policies.
> > > 
> > > Please share the SELinux policy files, containerization scripts, etc.
> > > There is probably a home for them in qemu.git, libvirt.git, or elsewhere
> > > upstream.
> > > 
> > > We need to find a way to make the sandboxing improvements available to
> > > users besides yourself and easily reusable for developers who wish to
> > > convert additional device models.
> > 
> > Ping?
> > 
> > Without the scripts/policies there is no security benefit from this
> > patch series.
> 
> Hi Stefan,
> 
> We are working on this. We'll get back to you once we have this
> available.

Great, thanks!

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-27 Thread Alex Williamson
On Wed, 27 Mar 2019 14:25:00 +0800
Peter Xu  wrote:

> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > distinguish by devfn & bus between devices in a conventional PCI
> > topology and therefore we cannot assign them separate AddressSpaces.
> > By taking this requester ID aliasing into account, QEMU better matches
> > the bare metal behavior and restrictions, and enables shared
> > AddressSpace configurations that are otherwise not possible with
> > guest IOMMU support.
> > 
> > For the latter case, given any example where an IOMMU group on the
> > host includes multiple devices:
> > 
> >   $ ls  /sys/kernel/iommu_groups/1/devices/
> >   :00:01.0  :01:00.0  :01:00.1  
> 
> [1]
> 
> > 
> > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > that we can only assign one of the endpoints to the guest because a
> > second endpoint will attempt to use a different AddressSpace.  VFIO
> > only supports IOMMU group level granularity at the container level,
> > preventing this second endpoint from being assigned:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > 
> > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > :01:00.1: group 1 used in multiple address spaces
> > 
> > However, when QEMU incorporates proper aliasing, we can make use of a
> > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > provides the downstream devices with the same AddressSpace, ex:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > 
> > While the utility of this hack may be limited, this AddressSpace
> > aliasing is the correct behavior for QEMU to emulate bare metal.
> > 
> > Signed-off-by: Alex Williamson   
> 
> The patch looks sane to me even as a bug fix since otherwise the DMA
> address spaces used under misc kinds of PCI bridges can be wrong, so:

I'm not sure if "as a bug fix" here is encouraging a 4.0 target, but
I'd be cautious about this if so.  Eric Auger noted that he's seen an
SMMU VM hit a guest kernel bug-on, which needs further
investigation.  It's not clear if it's just an untested or
unimplemented scenario for SMMU to see a conventional PCI bus or if
there's something wrong in QEMU.  I also haven't tested AMD IOMMU and
only VT-d to a very limited degree, thus RFC.
 
> Reviewed-by: Peter Xu 
> 
> Though I have a question that confused me even before: Alex, do you
> know why all the context entry of the devices in the IOMMU root table
> will be programmed even if the devices are under a pcie-to-pci bridge?
> I'm giving an example with above [1] to be clear: in that case IIUC
> we'll program context entries for all the three devices (00:01.0,
> 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
> bare metal and then why we bother to program the context entry of
> 01:00.1?  It seems never used.
> 
> (It should be used for current QEMU to work with pcie-to-pci bridges
>  if without this patch, but I feel like I don't know the real answer
>  behind)

We actually have two different scenarios that could be represented by
[1], the group can be formed by lack of isolation or by lack of
visibility.  In the group above, it's the former, isolation.  The PCIe
root port does not support ACS, so while the IOMMU has visibility of
the individual devices, peer-to-peer between devices may also be
possible.  Native, trusted, in-kernel drivers for these devices could
still make use of separate IOMMU domains per device, but in order to
expose the devices to a userspace driver we need to consider them a
non-isolated set to prevent side-channel attacks between devices.  We
therefore consider them as a group within the IOMMU API and it's
required that each context entry maps to the same domain as the IOMMU
will see transactions for each requester ID.

If we had the visibility case, such as if [1] represented a PCIe-to-PCI
bridge subgroup, then the IOMMU really does only see the bridge
requester ID and there may not be a reason to populate the context
entries for the downstream aliased devices.  Perhaps the IOMMU might
still choose to do so, particularly if the bridge is actually a PCI-X
bridge as PCI-X does incorporate a requester ID, but also has strange
rules about the bridge being able to claim ownership of the
transaction.  So it might be paranoia or simplification that causes all
the context entries to be programmed, or for alias quirks, 

Re: [Qemu-devel] [PATCH for 4.0 v1 1/5] riscv: plic: Fix incorrect irq calculation

2019-03-27 Thread Alistair Francis
On Wed, Mar 27, 2019 at 3:29 AM Palmer Dabbelt  wrote:
>
> On Wed, 20 Mar 2019 17:46:09 PDT (-0700), Alistair Francis wrote:
> > The irq is incorrectly calculated to be off by one. It has worked in the
> > past as the priority_base offset has also been set incorrectly. We are
> > about to fix the priority_base offset so first first the irq
> > calculation.
> >
> > Signed-off-by: Alistair Francis 
> > ---
> >  hw/riscv/sifive_plic.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/riscv/sifive_plic.c b/hw/riscv/sifive_plic.c
> > index 1c703e1a37..70a85cd075 100644
> > --- a/hw/riscv/sifive_plic.c
> > +++ b/hw/riscv/sifive_plic.c
> > @@ -206,7 +206,7 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr 
> > addr, unsigned size)
> >  if (addr >= plic->priority_base && /* 4 bytes per source */
> >  addr < plic->priority_base + (plic->num_sources << 2))
> >  {
> > -uint32_t irq = (addr - plic->priority_base) >> 2;
> > +uint32_t irq = ((addr - plic->priority_base) >> 2) + 1;
> >  if (RISCV_DEBUG_PLIC) {
> >  qemu_log("plic: read priority: irq=%d priority=%d\n",
> >  irq, plic->source_priority[irq]);
> > @@ -279,7 +279,7 @@ static void sifive_plic_write(void *opaque, hwaddr 
> > addr, uint64_t value,
> >  if (addr >= plic->priority_base && /* 4 bytes per source */
> >  addr < plic->priority_base + (plic->num_sources << 2))
> >  {
> > -uint32_t irq = (addr - plic->priority_base) >> 2;
> > +uint32_t irq = ((addr - plic->priority_base) >> 2) + 1;
> >  plic->source_priority[irq] = value & 7;
> >  if (RISCV_DEBUG_PLIC) {
> >  qemu_log("plic: write priority: irq=%d priority=%d\n",
>
> I think this breaks bisectability and should be merged with the
> *_PLIC_PRIORITY_BASE modifications as otherwise we'll end up the priority 
> being
> set for the brong IRQ.

Good point, I can merge them together.

>
> I'm also not sure how this fixes the bug listed in the OpenSBI PR.  As far as 
> I
> can understand it, all this does is ensure that source 0 is actually treated 
> as
> illegal (and shrinks the number of sources to match what's actually there, but
> that shouldn't fix the bug).  That looks more like a cleanup than an actual 
> fix.

The problem is that before we where off by one. We supported 0 - (n -
1) and this patch set changes QEMU to support 1 - n. This is because
of the "addr < plic->priority_base + (plic->num_sources << 2)"
comparison. As priority_base is now 0x04 instead of 0x00 the
comparison will work correctly.

Alistair

>
> Am I missing something?



Re: [Qemu-devel] [PATCH v7 0/2] log: Make glib logging go through QEMU

2019-03-27 Thread Markus Armbruster
Uh, I totally missed this one...  my apologies!

Christophe Fergeau  writes:

> The main goal of this patch series is to make logs output by
> libspice-server.so go through qemu logging infrastructure so that their
> format is the same as the rest of QEMU messages (in particular,
> timestamps). This is done in the second patch, the first one is a
> preparation patch.

Reviewed-by: Markus Armbruster 

Queued for 4.1, thanks!



Re: [Qemu-devel] [PATCH for-4.1 v3 00/12] bundle edk2 platform firmware with QEMU

2019-03-27 Thread Laszlo Ersek
On 03/27/19 17:15, Paolo Bonzini wrote:
> On 27/03/19 17:05, Daniel P. Berrangé wrote:
>> On Wed, Mar 27, 2019 at 04:58:23PM +0100, Paolo Bonzini wrote:
>>> On 27/03/19 16:30, Daniel P. Berrangé wrote:
 Perhaps the VM test scripts should do a "HEAD" request for the image
 every time to discover if it has been changed on the server, before
 honouring the local cache.
>>>
>>> Another possibility is to first download the shasum from
>>> download.patchew.org, and compare _that_ against the one that is stored
>>> locally, instead of hardcoding it in QEMU's repository.
>>
>> Personally I prefer the idea of having the shasum stored in the repo.
>>
>> That means that if we update git master to point to a newer image,
>> previous stable branches will stick with their original image, rather
>> than using a new image that may be incompatible with the stable branch
>>
>> Storing hash in git also means that if someone compromised the patchew
>> server, they can't cause developer to run compromised images, without
>> first also compromising git to change the hash.
> 
> The two are not mutually exclusive.  We can warn if the hash doesn't
> match against the one in QEMU, add a --force option, or whatever.
> 
> Also, I have now created symlinks by hash at
> http://download.patchew.org/by-sha256sum in case someone finds them useful.

Isn't this risky? If someone replaces an image file (keeping its name),
the old symlink will continue "working", but the hash stated by the
symlink's name will not match the pointed-to image file.

It would be nice to enforce a hash update on upload, or addressing files
by content.

... Are we inventing a git submodule? :)

Laszlo



Re: [Qemu-devel] [PATCH v2 0/4] pvrdma: Add support for SRQ

2019-03-27 Thread Yuval Shaia
On Tue, Mar 26, 2019 at 02:54:29PM +0200, Kamal Heib wrote:
> This series implements the SRQ (Shared Receive Queue) for the pvrdma
> device, It also includes all the needed functions and definitions for
> support SRQ in the backend and resource management layers.
> 
> Changes from v1->v2:
> - Handle checkpatch.pl warnings. 

Hi Kamal,
I'm done with the review, please check my comments.

Thanks for doing it, this is an advance feature that i wanted to do for
long time.

> 
> Kamal Heib (4):
>   hw/rdma: Add SRQ support to backend layer
>   hw/rdma: Add support for managing SRQ resource
>   hw/rdma: Modify create/destroy QP to support SRQ
>   hw/pvrdma: Add support for SRQ
> 
>  hw/rdma/rdma_backend.c  | 125 +-
>  hw/rdma/rdma_backend.h  |  18 +++-
>  hw/rdma/rdma_backend_defs.h |   5 +
>  hw/rdma/rdma_rm.c   | 106 +-
>  hw/rdma/rdma_rm.h   |  13 ++-
>  hw/rdma/rdma_rm_defs.h  |   9 ++
>  hw/rdma/vmw/pvrdma_cmd.c| 208 
>  hw/rdma/vmw/pvrdma_main.c   |  16 +++
>  hw/rdma/vmw/pvrdma_qp_ops.c |  46 +++-
>  hw/rdma/vmw/pvrdma_qp_ops.h |   1 +
>  10 files changed, 513 insertions(+), 34 deletions(-)
> 
> -- 
> 2.20.1
> 
> 



Re: [Qemu-devel] [PATCH v2 4/4] hw/pvrdma: Add support for SRQ

2019-03-27 Thread Yuval Shaia
On Tue, Mar 26, 2019 at 02:54:33PM +0200, Kamal Heib wrote:
> Implement the pvrdma device commands for supporting SRQ
> 
> Signed-off-by: Kamal Heib 
> ---
>  hw/rdma/vmw/pvrdma_cmd.c| 147 
>  hw/rdma/vmw/pvrdma_main.c   |  16 
>  hw/rdma/vmw/pvrdma_qp_ops.c |  46 ++-
>  hw/rdma/vmw/pvrdma_qp_ops.h |   1 +
>  4 files changed, 209 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
> index 510062f05476..fb075048c90c 100644
> --- a/hw/rdma/vmw/pvrdma_cmd.c
> +++ b/hw/rdma/vmw/pvrdma_cmd.c
> @@ -615,6 +615,149 @@ static int destroy_uc(PVRDMADev *dev, union 
> pvrdma_cmd_req *req,
>  return 0;
>  }
>  
> +static int create_srq_ring(PCIDevice *pci_dev, PvrdmaRing **ring,
> +   uint64_t pdir_dma, uint32_t max_wr,
> +   uint32_t max_sge, uint32_t nchunks)
> +{
> +uint64_t *dir = NULL, *tbl = NULL;
> +PvrdmaRing *r;
> +int rc = -EINVAL;
> +char ring_name[MAX_RING_NAME_SZ];
> +uint32_t wqe_sz;
> +
> +if (!nchunks || nchunks > PVRDMA_MAX_FAST_REG_PAGES) {
> +rdma_error_report("Got invalid page count for SRQ ring: %d",
> +  nchunks);
> +return rc;
> +}
> +
> +dir = rdma_pci_dma_map(pci_dev, pdir_dma, TARGET_PAGE_SIZE);
> +if (!dir) {
> +rdma_error_report("Failed to map to SRQ page directory");
> +goto out;
> +}
> +
> +tbl = rdma_pci_dma_map(pci_dev, dir[0], TARGET_PAGE_SIZE);
> +if (!tbl) {
> +rdma_error_report("Failed to map to SRQ page table");
> +goto out;
> +}
> +
> +r = g_malloc(sizeof(*r));
> +*ring = r;
> +
> +r->ring_state = (struct pvrdma_ring *)
> +rdma_pci_dma_map(pci_dev, tbl[0], TARGET_PAGE_SIZE);
> +if (!r->ring_state) {
> +rdma_error_report("Failed to map tp SRQ ring state");
> +goto out_free_ring_mem;
> +}
> +
> +wqe_sz = pow2ceil(sizeof(struct pvrdma_rq_wqe_hdr) +
> +  sizeof(struct pvrdma_sge) * max_sge - 1);
> +sprintf(ring_name, "srq_ring_%" PRIx64, pdir_dma);
> +rc = pvrdma_ring_init(r, ring_name, pci_dev, >ring_state[1], max_wr,
> +  wqe_sz, (dma_addr_t *)[1], nchunks - 1);
> +if (rc) {
> +goto out_unmap_ring_state;
> +}
> +
> +goto out;
> +
> +out_unmap_ring_state:
> +rdma_pci_dma_unmap(pci_dev, r->ring_state, TARGET_PAGE_SIZE);
> +
> +out_free_ring_mem:
> +g_free(r);
> +
> +out:
> +rdma_pci_dma_unmap(pci_dev, tbl, TARGET_PAGE_SIZE);
> +rdma_pci_dma_unmap(pci_dev, dir, TARGET_PAGE_SIZE);
> +
> +return rc;
> +}
> +
> +static void destroy_srq_ring(PvrdmaRing *ring)
> +{
> +pvrdma_ring_free(ring);
> +rdma_pci_dma_unmap(ring->dev, ring->ring_state, TARGET_PAGE_SIZE);
> +g_free(ring);
> +}
> +
> +static int create_srq(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +  union pvrdma_cmd_resp *rsp)
> +{
> +struct pvrdma_cmd_create_srq *cmd = >create_srq;
> +struct pvrdma_cmd_create_srq_resp *resp = >create_srq_resp;
> +PvrdmaRing *ring = NULL;
> +int rc;
> +
> +memset(resp, 0, sizeof(*resp));
> +
> +rc = create_srq_ring(PCI_DEVICE(dev), , cmd->pdir_dma,
> + cmd->attrs.max_wr, cmd->attrs.max_sge,
> + cmd->nchunks);
> +if (rc) {
> +return rc;
> +}
> +
> +rc = rdma_rm_alloc_srq(>rdma_dev_res, cmd->pd_handle,
> +   cmd->attrs.max_wr, cmd->attrs.max_sge,
> +   cmd->attrs.srq_limit, >srqn, ring);
> +if (rc) {
> +destroy_srq_ring(ring);
> +return rc;
> +}
> +
> +return 0;
> +}
> +
> +static int query_srq(PVRDMADev *dev, union pvrdma_cmd_req *req,
> + union pvrdma_cmd_resp *rsp)
> +{
> +struct pvrdma_cmd_query_srq *cmd = >query_srq;
> +struct pvrdma_cmd_query_srq_resp *resp = >query_srq_resp;
> +
> +memset(resp, 0, sizeof(*resp));
> +
> +return rdma_rm_query_srq(>rdma_dev_res, cmd->srq_handle,
> + (struct ibv_srq_attr *)>attrs);
> +}
> +
> +static int modify_srq(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +  union pvrdma_cmd_resp *rsp)
> +{
> +struct pvrdma_cmd_modify_srq *cmd = >modify_srq;
> +
> +/* Only support SRQ limit */
> +if (!(cmd->attr_mask & IBV_SRQ_LIMIT) ||
> +(cmd->attr_mask & IBV_SRQ_MAX_WR))
> +return -EINVAL;
> +
> +return rdma_rm_modify_srq(>rdma_dev_res, cmd->srq_handle,
> +  (struct ibv_srq_attr *)>attrs,
> +  cmd->attr_mask);
> +}
> +
> +static int destroy_srq(PVRDMADev *dev, union pvrdma_cmd_req *req,
> +   union pvrdma_cmd_resp *rsp)
> +{
> +struct pvrdma_cmd_destroy_srq *cmd = >destroy_srq;
> +RdmaRmSRQ *srq;
> +PvrdmaRing *ring;
> +
> +srq = rdma_rm_get_srq(>rdma_dev_res, 

Re: [Qemu-devel] [PATCH for-4.1 v3 00/12] bundle edk2 platform firmware with QEMU

2019-03-27 Thread Paolo Bonzini
On 27/03/19 17:05, Daniel P. Berrangé wrote:
> On Wed, Mar 27, 2019 at 04:58:23PM +0100, Paolo Bonzini wrote:
>> On 27/03/19 16:30, Daniel P. Berrangé wrote:
>>> Perhaps the VM test scripts should do a "HEAD" request for the image
>>> every time to discover if it has been changed on the server, before
>>> honouring the local cache.
>>
>> Another possibility is to first download the shasum from
>> download.patchew.org, and compare _that_ against the one that is stored
>> locally, instead of hardcoding it in QEMU's repository.
> 
> Personally I prefer the idea of having the shasum stored in the repo.
> 
> That means that if we update git master to point to a newer image,
> previous stable branches will stick with their original image, rather
> than using a new image that may be incompatible with the stable branch
> 
> Storing hash in git also means that if someone compromised the patchew
> server, they can't cause developer to run compromised images, without
> first also compromising git to change the hash.

The two are not mutually exclusive.  We can warn if the hash doesn't
match against the one in QEMU, add a --force option, or whatever.

Also, I have now created symlinks by hash at
http://download.patchew.org/by-sha256sum in case someone finds them useful.

Paolo




Re: [Qemu-devel] [PATCH for-4.1 v3 00/12] bundle edk2 platform firmware with QEMU

2019-03-27 Thread Daniel P . Berrangé
On Wed, Mar 27, 2019 at 04:58:23PM +0100, Paolo Bonzini wrote:
> On 27/03/19 16:30, Daniel P. Berrangé wrote:
> > Perhaps the VM test scripts should do a "HEAD" request for the image
> > every time to discover if it has been changed on the server, before
> > honouring the local cache.
> 
> Another possibility is to first download the shasum from
> download.patchew.org, and compare _that_ against the one that is stored
> locally, instead of hardcoding it in QEMU's repository.

Personally I prefer the idea of having the shasum stored in the repo.

That means that if we update git master to point to a newer image,
previous stable branches will stick with their original image, rather
than using a new image that may be incompatible with the stable branch

Storing hash in git also means that if someone compromised the patchew
server, they can't cause developer to run compromised images, without
first also compromising git to change the hash.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-devel] [PATCH v2 2/4] hw/rdma: Add support for managing SRQ resource

2019-03-27 Thread Yuval Shaia
On Tue, Mar 26, 2019 at 02:54:31PM +0200, Kamal Heib wrote:
> Adding the required functions and definitions for support managing the
> shared receive queues (SRQs).
> 
> Signed-off-by: Kamal Heib 
> ---
>  hw/rdma/rdma_rm.c  | 83 ++
>  hw/rdma/rdma_rm.h  | 10 +
>  hw/rdma/rdma_rm_defs.h |  8 
>  3 files changed, 101 insertions(+)
> 
> diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
> index bac3b2f4a6c3..bc5873cb4c14 100644
> --- a/hw/rdma/rdma_rm.c
> +++ b/hw/rdma/rdma_rm.c
> @@ -542,6 +542,86 @@ void rdma_rm_dealloc_qp(RdmaDeviceResources *dev_res, 
> uint32_t qp_handle)
>  rdma_res_tbl_dealloc(_res->qp_tbl, qp->qpn);
>  }
>  
> +RdmaRmSRQ *rdma_rm_get_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle)
> +{
> +return rdma_res_tbl_get(_res->srq_tbl, srq_handle);
> +}
> +
> +int rdma_rm_alloc_srq(RdmaDeviceResources *dev_res, uint32_t pd_handle,
> +  uint32_t max_wr, uint32_t max_sge, uint32_t srq_limit,
> +  uint32_t *srq_handle, void *opaque)
> +{
> +RdmaRmSRQ *srq;
> +RdmaRmPD *pd;
> +int rc;
> +
> +pd = rdma_rm_get_pd(dev_res, pd_handle);
> +if (!pd) {
> +return -EINVAL;
> +}
> +
> +srq = rdma_res_tbl_alloc(_res->srq_tbl, srq_handle);
> +if (!srq) {
> +return -ENOMEM;
> +}
> +
> +rc = rdma_backend_create_srq(>backend_srq, >backend_pd,
> + max_wr, max_sge, srq_limit);
> +if (rc) {
> +rc = -EIO;
> +goto out_dealloc_srq;
> +}
> +
> +srq->opaque = opaque;
> +
> +return 0;
> +
> +out_dealloc_srq:
> +rdma_res_tbl_dealloc(_res->srq_tbl, *srq_handle);
> +
> +return rc;
> +}
> +
> +int rdma_rm_query_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle,
> +  struct ibv_srq_attr *srq_attr)
> +{
> +RdmaRmSRQ *srq;
> +
> +srq = rdma_rm_get_srq(dev_res, srq_handle);
> +if (!srq) {
> +return -EINVAL;
> +}
> +
> +return rdma_backend_query_srq(>backend_srq, srq_attr);
> +}
> +
> +int rdma_rm_modify_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle,
> +   struct ibv_srq_attr *srq_attr, int srq_attr_mask)
> +{
> +RdmaRmSRQ *srq;
> +
> +srq = rdma_rm_get_srq(dev_res, srq_handle);
> +if (!srq) {
> +return -EINVAL;
> +}
> +
> +return rdma_backend_modify_srq(>backend_srq, srq_attr,
> +   srq_attr_mask);

Such a blind pass-through?? don't you want to make sure that for example
max_sge is valid? I mean just for the sake of being fair to caller?

> +}
> +
> +void rdma_rm_dealloc_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle)
> +{
> +RdmaRmSRQ *srq;
> +
> +srq = rdma_rm_get_srq(dev_res, srq_handle);
> +if (!srq) {
> +return;
> +}
> +
> +rdma_backend_destroy_srq(>backend_srq, dev_res);
> +rdma_res_tbl_dealloc(_res->srq_tbl, srq_handle);
> +}
> +
>  void *rdma_rm_get_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t cqe_ctx_id)
>  {
>  void **cqe_ctx;
> @@ -671,6 +751,8 @@ int rdma_rm_init(RdmaDeviceResources *dev_res, struct 
> ibv_device_attr *dev_attr)
>  res_tbl_init("CQE_CTX", _res->cqe_ctx_tbl, dev_attr->max_qp *
> dev_attr->max_qp_wr, sizeof(void *));
>  res_tbl_init("UC", _res->uc_tbl, MAX_UCS, sizeof(RdmaRmUC));
> +res_tbl_init("SRQ", _res->srq_tbl, dev_attr->max_srq,
> + sizeof(RdmaRmSRQ));
>  
>  init_ports(dev_res);
>  
> @@ -689,6 +771,7 @@ void rdma_rm_fini(RdmaDeviceResources *dev_res, 
> RdmaBackendDev *backend_dev,
>  
>  fini_ports(dev_res, backend_dev, ifname);
>  
> +res_tbl_free(_res->srq_tbl);
>  res_tbl_free(_res->uc_tbl);
>  res_tbl_free(_res->cqe_ctx_tbl);
>  res_tbl_free(_res->qp_tbl);
> diff --git a/hw/rdma/rdma_rm.h b/hw/rdma/rdma_rm.h
> index 4f03f9b8c5f1..e88ab95e264b 100644
> --- a/hw/rdma/rdma_rm.h
> +++ b/hw/rdma/rdma_rm.h
> @@ -65,6 +65,16 @@ int rdma_rm_query_qp(RdmaDeviceResources *dev_res, 
> RdmaBackendDev *backend_dev,
>   int attr_mask, struct ibv_qp_init_attr *init_attr);
>  void rdma_rm_dealloc_qp(RdmaDeviceResources *dev_res, uint32_t qp_handle);
>  
> +RdmaRmSRQ *rdma_rm_get_srq(RdmaDeviceResources *dev_res, uint32_t 
> srq_handle);
> +int rdma_rm_alloc_srq(RdmaDeviceResources *dev_res, uint32_t pd_handle,
> +  uint32_t max_wr, uint32_t max_sge, uint32_t srq_limit,
> +  uint32_t *srq_handle, void *opaque);
> +int rdma_rm_query_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle,
> +  struct ibv_srq_attr *srq_attr);
> +int rdma_rm_modify_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle,
> +   struct ibv_srq_attr *srq_attr, int srq_attr_mask);
> +void rdma_rm_dealloc_srq(RdmaDeviceResources *dev_res, uint32_t srq_handle);
> +
>  int rdma_rm_alloc_cqe_ctx(RdmaDeviceResources *dev_res, uint32_t *cqe_ctx_id,

Re: [Qemu-devel] [PATCH for-4.1 v3 00/12] bundle edk2 platform firmware with QEMU

2019-03-27 Thread Paolo Bonzini
On 27/03/19 16:30, Daniel P. Berrangé wrote:
> Perhaps the VM test scripts should do a "HEAD" request for the image
> every time to discover if it has been changed on the server, before
> honouring the local cache.

Another possibility is to first download the shasum from
download.patchew.org, and compare _that_ against the one that is stored
locally, instead of hardcoding it in QEMU's repository.

Paolo



Re: [Qemu-devel] [PATCH v2 3/4] hw/rdma: Modify create/destroy QP to support SRQ

2019-03-27 Thread Yuval Shaia
On Tue, Mar 26, 2019 at 02:54:32PM +0200, Kamal Heib wrote:
> Modify create/destroy QP to support shared receive queue.
> 
> Signed-off-by: Kamal Heib 
> ---
>  hw/rdma/rdma_backend.c   |  9 --
>  hw/rdma/rdma_backend.h   |  6 ++--
>  hw/rdma/rdma_rm.c| 23 +--
>  hw/rdma/rdma_rm.h|  3 +-
>  hw/rdma/rdma_rm_defs.h   |  1 +
>  hw/rdma/vmw/pvrdma_cmd.c | 61 ++--
>  6 files changed, 72 insertions(+), 31 deletions(-)
> 
> diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
> index 54419c8c58dd..8f1349c64dda 100644
> --- a/hw/rdma/rdma_backend.c
> +++ b/hw/rdma/rdma_backend.c
> @@ -794,9 +794,9 @@ void rdma_backend_destroy_cq(RdmaBackendCQ *cq)
>  
>  int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
> RdmaBackendPD *pd, RdmaBackendCQ *scq,
> -   RdmaBackendCQ *rcq, uint32_t max_send_wr,
> -   uint32_t max_recv_wr, uint32_t max_send_sge,
> -   uint32_t max_recv_sge)
> +   RdmaBackendCQ *rcq, RdmaBackendSRQ *srq,
> +   uint32_t max_send_wr, uint32_t max_recv_wr,
> +   uint32_t max_send_sge, uint32_t max_recv_sge)
>  {
>  struct ibv_qp_init_attr attr = {};
>  
> @@ -824,6 +824,9 @@ int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t 
> qp_type,
>  attr.cap.max_recv_wr = max_recv_wr;
>  attr.cap.max_send_sge = max_send_sge;
>  attr.cap.max_recv_sge = max_recv_sge;
> +if (srq) {
> +attr.srq = srq->ibsrq;
> +}
>  
>  qp->ibqp = ibv_create_qp(pd->ibpd, );
>  if (!qp->ibqp) {
> diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
> index cad7956d98e8..7c1a19a2b5ff 100644
> --- a/hw/rdma/rdma_backend.h
> +++ b/hw/rdma/rdma_backend.h
> @@ -89,9 +89,9 @@ void rdma_backend_poll_cq(RdmaDeviceResources 
> *rdma_dev_res, RdmaBackendCQ *cq);
>  
>  int rdma_backend_create_qp(RdmaBackendQP *qp, uint8_t qp_type,
> RdmaBackendPD *pd, RdmaBackendCQ *scq,
> -   RdmaBackendCQ *rcq, uint32_t max_send_wr,
> -   uint32_t max_recv_wr, uint32_t max_send_sge,
> -   uint32_t max_recv_sge);
> +   RdmaBackendCQ *rcq, RdmaBackendSRQ *srq,
> +   uint32_t max_send_wr, uint32_t max_recv_wr,
> +   uint32_t max_send_sge, uint32_t max_recv_sge);
>  int rdma_backend_qp_state_init(RdmaBackendDev *backend_dev, RdmaBackendQP 
> *qp,
> uint8_t qp_type, uint32_t qkey);
>  int rdma_backend_qp_state_rtr(RdmaBackendDev *backend_dev, RdmaBackendQP *qp,
> diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
> index bc5873cb4c14..90870ee0e15e 100644
> --- a/hw/rdma/rdma_rm.c
> +++ b/hw/rdma/rdma_rm.c
> @@ -384,12 +384,14 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, 
> uint32_t pd_handle,
>   uint8_t qp_type, uint32_t max_send_wr,
>   uint32_t max_send_sge, uint32_t send_cq_handle,
>   uint32_t max_recv_wr, uint32_t max_recv_sge,
> - uint32_t recv_cq_handle, void *opaque, uint32_t *qpn)
> + uint32_t recv_cq_handle, void *opaque, uint32_t *qpn,
> + uint8_t is_srq, uint32_t srq_handle)
>  {
>  int rc;
>  RdmaRmQP *qp;
>  RdmaRmCQ *scq, *rcq;
>  RdmaRmPD *pd;
> +RdmaRmSRQ *srq = NULL;
>  uint32_t rm_qpn;
>  
>  pd = rdma_rm_get_pd(dev_res, pd_handle);
> @@ -406,6 +408,14 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, 
> uint32_t pd_handle,
>  return -EINVAL;
>  }
>  
> +if (is_srq) {
> +srq = rdma_rm_get_srq(dev_res, srq_handle);
> +if (!srq) {
> +rdma_error_report("Invalid srqn %d", srq_handle);
> +return -EINVAL;
> +}
> +}
> +

[1]

>  if (qp_type == IBV_QPT_GSI) {
>  scq->notify = CNT_SET;
>  rcq->notify = CNT_SET;
> @@ -422,10 +432,17 @@ int rdma_rm_alloc_qp(RdmaDeviceResources *dev_res, 
> uint32_t pd_handle,
>  qp->send_cq_handle = send_cq_handle;
>  qp->recv_cq_handle = recv_cq_handle;
>  qp->opaque = opaque;
> +if (is_srq) {
> +qp->is_srq = is_srq;
> +srq->recv_cq_handle = recv_cq_handle;
> +}

Does it make sense to join this section with [1]?

>  
>  rc = rdma_backend_create_qp(>backend_qp, qp_type, >backend_pd,
> ->backend_cq, >backend_cq, 
> max_send_wr,
> -max_recv_wr, max_send_sge, max_recv_sge);
> +>backend_cq, >backend_cq,
> +is_srq ? >backend_srq : NULL,
> +max_send_wr, max_recv_wr, max_send_sge,
> +max_recv_sge);
> +
>  if (rc) {
>  rc = -EIO;
>  

Re: [Qemu-devel] [PATCH] target: mips: Add /* comments to fix checkpatch error

2019-03-27 Thread Aleksandar Markovic
> From: Stefano Garzarella 
> Subject: Re: [Qemu-devel] [PATCH] target: mips: Add /* comments to fix 
> checkpatch error
> 
> On Wed, Mar 27, 2019 at 10:38:59AM +, Jules Irenge wrote:
> > Add /* comment  to fix checkpatch warning
> > "WARNING: Block comments use a leading /* on a separate line".
> >
> > Signed-off-by: Jules Irenge 
> > ---
> >  target/mips/cp0_timer.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/target/mips/cp0_timer.c b/target/mips/cp0_timer.c
> > index ffa460be25..d6c9bb8169 100644
> > --- a/target/mips/cp0_timer.c
> > +++ b/target/mips/cp0_timer.c
> > @@ -42,8 +42,8 @@ uint32_t cpu_mips_get_random(CPUMIPSState *env)
> >
> >  /* Don't return same value twice, so get another value */
> >  do {
> > -/* Use a simple algorithm of Linear Congruential Generator
> > - * from ISO/IEC 9899 standard. */
> > +/* Use a simple algorithm of Linear Congruential Generator */
> > +/* from ISO/IEC 9899 standard. */
> 
> Thanks for this patch, but this is a multiline comment. Looking at
> CODING_STYLE, section "7. Comment style", we should use this form:
> /*
>  * like
>  * this
>  */
> 
> Thanks,
> Stefano

Jules, Stefano is right. I would like to add that it would be nice to fix
another comment in the same file - the one that causes these warnings:

WARNING: Block comments use a leading /* on a separate line
#146: FILE: target/mips/cp0_timer.c:146:
+/* ??? This callback should occur when the counter is exactly equal to

WARNING: Block comments use * on subsequent lines
#147: FILE: target/mips/cp0_timer.c:147:
+/* ??? This callback should occur when the counter is exactly equal to
+   the comparator value.  Offset the count by one to avoid immediately

WARNING: Block comments use a trailing */ on a separate line
#148: FILE: target/mips/cp0_timer.c:148:
+   retriggering the callback before any virtual time has passed.  */

Regards,
Aleksandar


Re: [Qemu-devel] [PATCH v3] target/mips: remove a space before open parenthesis to fix checkpatch errors

2019-03-27 Thread Aleksandar Markovic
> From: Jules Irenge 
> Subject: [PATCH v3] target/mips: remove a space before open parenthesis to 
> fix checkpatch errors
> 
> Remove a space before open parenthesis to fix errors reported by 
> checkpatch.pl tool
> "ERROR: space prohibited between function name and open parenthesis"
> within the file "target/mips/cp0_timer.c".
> 
> Signed-off-by: Jules Irenge 
> ---
> v3: corrects log grammar errors and clarifies the scope of the patch
>  target/mips/cp0_timer.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 

Thanks!

Reviewed-by: Aleksandar Markovic 

Jules, you don't need to do anything else regarding this patch,
but, in your future work, please insert complete history in commit
messages of all v3, v4, v5... versions. This means, v3 (as this
patch is) should contain history of what is done in both v3 and v2,
etc.

Yours,
Aleksandar


> diff --git a/target/mips/cp0_timer.c b/target/mips/cp0_timer.c
> index f4716395df..742f4b2890 100644
> --- a/target/mips/cp0_timer.c
> +++ b/target/mips/cp0_timer.c
> @@ -29,7 +29,7 @@
>  #define TIMER_PERIOD 10 /* 10 ns period for 100 Mhz frequency */
> 
>  /* XXX: do not use a global */
> -uint32_t cpu_mips_get_random (CPUMIPSState *env)
> +uint32_t cpu_mips_get_random(CPUMIPSState *env)
>  {
>  static uint32_t seed = 1;
>  static uint32_t prev_idx = 0;
> @@ -73,7 +73,7 @@ static void cpu_mips_timer_expire(CPUMIPSState *env)
>  qemu_irq_raise(env->irq[(env->CP0_IntCtl >> CP0IntCtl_IPTI) & 0x7]);
>  }
> 
> -uint32_t cpu_mips_get_count (CPUMIPSState *env)
> +uint32_t cpu_mips_get_count(CPUMIPSState *env)
>  {
>  if (env->CP0_Cause & (1 << CP0Ca_DC)) {
>  return env->CP0_Count;
> @@ -91,7 +91,7 @@ uint32_t cpu_mips_get_count (CPUMIPSState *env)
>  }
>  }
> 
> -void cpu_mips_store_count (CPUMIPSState *env, uint32_t count)
> +void cpu_mips_store_count(CPUMIPSState *env, uint32_t count)
>  {
>  /*
>   * This gets called from cpu_state_reset(), potentially before timer 
> init.
> @@ -109,7 +109,7 @@ void cpu_mips_store_count (CPUMIPSState *env, uint32_t 
> count)
>  }
>  }
> 
> -void cpu_mips_store_compare (CPUMIPSState *env, uint32_t value)
> +void cpu_mips_store_compare(CPUMIPSState *env, uint32_t value)
>  {
>  env->CP0_Compare = value;
>  if (!(env->CP0_Cause & (1 << CP0Ca_DC)))
> @@ -131,7 +131,7 @@ void cpu_mips_stop_count(CPUMIPSState *env)
>   TIMER_PERIOD);
>  }
> 
> -static void mips_timer_cb (void *opaque)
> +static void mips_timer_cb(void *opaque)
>  {
>  CPUMIPSState *env;
> 
> @@ -151,7 +151,7 @@ static void mips_timer_cb (void *opaque)
>  env->CP0_Count--;
>  }
> 
> -void cpu_mips_clock_init (MIPSCPU *cpu)
> +void cpu_mips_clock_init(MIPSCPU *cpu)
>  {
>  CPUMIPSState *env = >env;
> 
> --
> 2.20.1
> 
> 


Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

2019-03-27 Thread Michael S. Tsirkin
On Wed, Mar 27, 2019 at 02:25:00PM +0800, Peter Xu wrote:
> On Tue, Mar 26, 2019 at 04:55:19PM -0600, Alex Williamson wrote:
> > Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > distinguish by devfn & bus between devices in a conventional PCI
> > topology and therefore we cannot assign them separate AddressSpaces.
> > By taking this requester ID aliasing into account, QEMU better matches
> > the bare metal behavior and restrictions, and enables shared
> > AddressSpace configurations that are otherwise not possible with
> > guest IOMMU support.
> > 
> > For the latter case, given any example where an IOMMU group on the
> > host includes multiple devices:
> > 
> >   $ ls  /sys/kernel/iommu_groups/1/devices/
> >   :00:01.0  :01:00.0  :01:00.1
> 
> [1]
> 
> > 
> > If we incorporate a vIOMMU into the VM configuration, we're restricted
> > that we can only assign one of the endpoints to the guest because a
> > second endpoint will attempt to use a different AddressSpace.  VFIO
> > only supports IOMMU group level granularity at the container level,
> > preventing this second endpoint from being assigned:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >   -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > 
> > qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > :01:00.1: group 1 used in multiple address spaces
> > 
> > However, when QEMU incorporates proper aliasing, we can make use of a
> > PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > provides the downstream devices with the same AddressSpace, ex:
> > 
> > qemu-system-x86_64 -machine q35... \
> >   -device intel-iommu,intremap=on \
> >   -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >   -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >   -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > 
> > While the utility of this hack may be limited, this AddressSpace
> > aliasing is the correct behavior for QEMU to emulate bare metal.
> > 
> > Signed-off-by: Alex Williamson 
> 
> The patch looks sane to me even as a bug fix since otherwise the DMA
> address spaces used under misc kinds of PCI bridges can be wrong, so:
> 
> Reviewed-by: Peter Xu 
> 
> Though I have a question that confused me even before: Alex, do you
> know why all the context entry of the devices in the IOMMU root table
> will be programmed even if the devices are under a pcie-to-pci bridge?
> I'm giving an example with above [1] to be clear: in that case IIUC
> we'll program context entries for all the three devices (00:01.0,
> 01:00.0, 01:00.1) but they'll point to the same IOMMU table.  DMAs of
> devices 01:00.0 and 01:00.1 should always been tagged with 01:00.0 on
> bare metal

What makes you think so?

PCI Express spec says:

Requester ID

The combination of a Requester's Bus Number, Device Number, and Function
Number that uniquely identifies the Requester. With an ARI Requester ID, bits
traditionally used for the Device Number field are used instead to expand the
Function Number field, and the Device Number is implied to be 0.



> and then why we bother to program the context entry of
> 01:00.1?  It seems never used.
> 
> (It should be used for current QEMU to work with pcie-to-pci bridges
>  if without this patch, but I feel like I don't know the real answer
>  behind)
> 
> Thanks,
> 
> -- 
> Peter Xu



  1   2   3   >