date:20181112

Re: [Qemu-devel] [QEMU PATCH v2 0/2]: KVM: i386: Add support for save and restore nested state

2018-11-12 Thread Liran Alon




> On 12 Nov 2018, at 18:50, Dr. David Alan Gilbert  wrote:
> 
> * Daniel P. Berrangé (berra...@redhat.com) wrote:
>> On Sun, Nov 04, 2018 at 11:19:57PM +0100, Paolo Bonzini wrote:
>>> On 02/11/2018 17:54, Daniel P. Berrangé wrote:
 We have usually followed a rule that new machine types must not
 affect runability of a VM on a host. IOW new machine types should
 not introduce dependancies on specific kernels, or hardware features
 such as CPU flags.
>>> 
 Anything that requires a new kernel feature thus ought to be an
 opt-in config tunable on the CLI, separate from machine type
 choice.
>>> 
>>> Unless someone tinkered with the module parameters, they could not even
>>> use nested virtualization before 4.20.  So for everyone else, "-cpu
>>> ...,+vmx" does count as an "opt-in config tunable on the CLI" that
>>> requires 4.20.
>>> 
>>> For those that did tinker with module parameters, we can grandfather in
>>> the old machine types, so that they can use nested virtualization with
>>> no live migration support.  For those that did not, however, I don't
>>> think it makes sense to say "oh by the way I really want to be able to
>>> migrate this VM" on the command line, or even worse on the monitor.
>> 
>> IIUC, 4.20 is only required from POV of migration state. Is it thus
>> possible to just register a migration blocker if QEMU is launched
>> on a host with kernel < 4.20.
>> 
>> Migration has always been busted historically, so those people using
>> nested VMX already won't be hurt by not having ability to live migrate
>> their VM, but could otherwise continue using them without being forced
>> to upgrade their kernel to fix a feature they're not even using.
> 
> Yes, although I am a bit worried we might have a population of users
> that:
>   a) Have enabled nesting
>   b) Run VMs with vmx enabled
>   c) Don't normally actually run nested guests
>   d) Currently happily migrate.
> 
> Dave

First of all, we should put the entire kvm_nested_state in a VMState subsection 
that has a .needed() method
that checks if ((format==VMX) && (vmx.vmxon_pa != -1ull));
This will allow migrating a VM that has nested exposed but didn’t really enter 
VMX operation to still be able
to successfully migrate to old hosts without any issues.

The only problem remaining to be solved that you discuss here is what happens 
if user runs with modern QEMU
(Including my to-be-written v3 patches discussed here) on a host that has 
kernel without KVM_CAP_NESTED_STATE.
In this kernel, it is impossible for userspace (e.g. QEMU) to know if guest 
that was exposed with VMX actually used it.
So for those cases, I see no option but to modify QEMU to also say that if 
guest is exposed with VMX and
we are running on kernel without KVM_CAP_NESTED_STATE, then this is a migration 
blocker.

-Liran

> 
>> Regards,
>> Daniel
>> -- 
>> |: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__berrange.com=DwIDAw=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=ximE6x4FrQ5vo3D4wF3w-cVsKgtNs85wCTL5GiP_B5Q=rs9ZkUUS37SHrs_oZJ9uIiXtpXvUkJBpfVEe9OSiQzk=
>>   -o-
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.flickr.com_photos_dberrange=DwIDAw=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=ximE6x4FrQ5vo3D4wF3w-cVsKgtNs85wCTL5GiP_B5Q=u_hnAFyY8BVwto0FsomTcZ3dmLKPYb1hwI_jRXI6EZg=
>>  :|
>> |: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__libvirt.org=DwIDAw=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=ximE6x4FrQ5vo3D4wF3w-cVsKgtNs85wCTL5GiP_B5Q=SEIw983ixpJUUySxUwrxtIbnjvHTB9ff3MaqULaulQw=
>>  -o-
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__fstop138.berrange.com=DwIDAw=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=ximE6x4FrQ5vo3D4wF3w-cVsKgtNs85wCTL5GiP_B5Q=yvxjeOrwjXjf08RBhPdX53lJN1W-8WSXT25ZeMSA06k=
>>  :|
>> |: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__entangle-2Dphoto.org=DwIDAw=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=ximE6x4FrQ5vo3D4wF3w-cVsKgtNs85wCTL5GiP_B5Q=9TIjtmf6AVFYWbyzI5vl-zXTaNCSCMAxyck92pc8yvY=
>> -o-
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_dberrange=DwIDAw=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=ximE6x4FrQ5vo3D4wF3w-cVsKgtNs85wCTL5GiP_B5Q=Wapdtm0yT4j-9EjMoxwo9QvRZ3h9Fk_CvpQq8J4TDpg=
>>  :|
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [QEMU PATCH v2 0/2]: KVM: i386: Add support for save and restore nested state

2018-11-12 Thread Liran Alon




> On 12 Nov 2018, at 18:54, Daniel P. Berrangé  wrote:
> 
> On Mon, Nov 12, 2018 at 04:50:54PM +, Dr. David Alan Gilbert wrote:
>> * Daniel P. Berrangé (berra...@redhat.com) wrote:
>>> On Sun, Nov 04, 2018 at 11:19:57PM +0100, Paolo Bonzini wrote:
 On 02/11/2018 17:54, Daniel P. Berrangé wrote:
> We have usually followed a rule that new machine types must not
> affect runability of a VM on a host. IOW new machine types should
> not introduce dependancies on specific kernels, or hardware features
> such as CPU flags.
 
> Anything that requires a new kernel feature thus ought to be an
> opt-in config tunable on the CLI, separate from machine type
> choice.
 
 Unless someone tinkered with the module parameters, they could not even
 use nested virtualization before 4.20.  So for everyone else, "-cpu
 ...,+vmx" does count as an "opt-in config tunable on the CLI" that
 requires 4.20.
 
 For those that did tinker with module parameters, we can grandfather in
 the old machine types, so that they can use nested virtualization with
 no live migration support.  For those that did not, however, I don't
 think it makes sense to say "oh by the way I really want to be able to
 migrate this VM" on the command line, or even worse on the monitor.
>>> 
>>> IIUC, 4.20 is only required from POV of migration state. Is it thus
>>> possible to just register a migration blocker if QEMU is launched
>>> on a host with kernel < 4.20.
>>> 
>>> Migration has always been busted historically, so those people using
>>> nested VMX already won't be hurt by not having ability to live migrate
>>> their VM, but could otherwise continue using them without being forced
>>> to upgrade their kernel to fix a feature they're not even using.
>> 
>> Yes, although I am a bit worried we might have a population of users
>> that:
>>   a) Have enabled nesting
>>   b) Run VMs with vmx enabled
> 
> 
>>   c) Don't normally actually run nested guests
>>   d) Currently happily migrate.
> 
> True, and (b) would include anyone using libvirt's  host-model CPU. So if
> you enabled nesting, have host-model for all guests, but only use nesting
> in one of the guests, you'd be doomed.
> 
> Is it possible for QEMU to determine if there are nested guests running or
> not and conditionally block migration appropriately to ensure safety ?


Only if kernel supports KVM_CAP_NESTED_STATE.
See my reply to Dave in this thread.

-Liran

> 
> 
> Regards,
> Daniel
> -- 
> |: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__berrange.com=DwIDaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=eMOrT-7t7-tfRtTw2da9c1YTU0_tOFfkVIhj9mWv-Pc=DIzWfmRGWO1b6hzL9NRbIt41fiFcnPt0MC8917u4Qv0=
>   -o-
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.flickr.com_photos_dberrange=DwIDaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=eMOrT-7t7-tfRtTw2da9c1YTU0_tOFfkVIhj9mWv-Pc=CjA-joyt2Y9t5B4YzIiupfY8EEO58m4vbmnd45adzFI=
>  :|
> |: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__libvirt.org=DwIDaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=eMOrT-7t7-tfRtTw2da9c1YTU0_tOFfkVIhj9mWv-Pc=tD05tikOHMJhh_EeZ2Esoxb0oku3MPFmj-S2YHdUGm0=
>  -o-
> https://urldefense.proofpoint.com/v2/url?u=https-3A__fstop138.berrange.com=DwIDaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=eMOrT-7t7-tfRtTw2da9c1YTU0_tOFfkVIhj9mWv-Pc=YAh1WAoXQKEB6hkMmG6ZnJQETOFnq6eqQLmJokME80A=
>  :|
> |: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__entangle-2Dphoto.org=DwIDaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=eMOrT-7t7-tfRtTw2da9c1YTU0_tOFfkVIhj9mWv-Pc=90Mm1Qb-SHe8P63xwGp6gzMU1I5DEW6YX0ttG6TL_7g=
> -o-
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_dberrange=DwIDaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0=eMOrT-7t7-tfRtTw2da9c1YTU0_tOFfkVIhj9mWv-Pc=l4NrrDdRzPClvYQdxQdfIW0geHPWcukeyOGX8QapwYA=
>  :|

Re: [Qemu-devel] How to emulate block I/O timeout on qemu side?

2018-11-12 Thread Dongli Zhang




On 11/13/2018 06:52 AM, Marc Olson via Qemu-devel wrote:
> On 11/11/18 11:36 PM, Dongli Zhang wrote:
>> On 11/12/2018 03:13 PM, Marc Olson via Qemu-devel wrote:
>>> On 11/3/18 10:24 AM, Dongli Zhang wrote:
 The 'write' latency of sector=40960 is set to a very large value. When the 
 I/O
 is stalled in guest due to that sector=40960 is accessed, I do see below
 messages in guest log:

 [   80.807755] nvme nvme0: I/O 11 QID 2 timeout, aborting
 [   80.808095] nvme nvme0: Abort status: 0x4001


 However, then nothing happens further. nvme I/O hangs in guest. I am not
 able to
 kill the qemu process with Ctrl+C. Both vnc and qemu user net do not work. 
 I
 need to kill qemu with "kill -9"


 The same result for virtio-scsi and qemu is stuck as well.
>>> While I didn't try virtio-scsi, I wasn't able to reproduce this behavior 
>>> using
>>> nvme on Ubuntu 18.04 (4.15). What image and kernel version are you trying
>>> against?
>> Would you like to reproduce the "aborting" message or the qemu hang?
> I could not reproduce IO hanging in the guest, but I can reproduce qemu 
> hanging.
>> guest image: ubuntu 16.04
>> guest kernel: mainline linux kernel (and default kernel in ubuntu 16.04)
>> qemu: qemu-3.0.0 (with the blkdebug delay patch)
>>
>> Would you be able to see the nvme abort (which is indeed not supported by 
>> qemu)
>> message in guest kernel?
> Yes.
>> Once I see that message, I would not be able to kill the qemu-system-x86_64
>> command line with Ctrl+C.
> 
> I missed this part. I wasn't expecting to handle very long timeouts, but what
> appears to be happening is that the sleep doesn't get interrupted on 
> shutdown. I
> suspect something like this, on top of the series I sent last night, should 
> help:
> 
> diff --git a/block/blkdebug.c b/block/blkdebug.c
> index 6b1f2d6..0bfb91b 100644
> --- a/block/blkdebug.c
> +++ b/block/blkdebug.c
> @@ -557,8 +557,11 @@ static int rule_check(BlockDriverState *bs, uint64_t
> offset, uint64_t bytes)
>  remove_active_rule(s, delay_rule);
>  }
> 
> -if (latency != 0) {
> -qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, latency);
> +while (latency > 0 && 
> !aio_external_disabled(bdrv_get_aio_context(bs))) {
> +int64_t cur_latency = MIN(latency, 10ULL);
> +
> +qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, cur_latency);
> +latency -= cur_latency;
>  }
>  }
> 
> 
> /marc
> 
> 

I am able to interrupt qemu with above patch to periodically wake up and sleep
again.

Dongli Zhang

Re: [Qemu-devel] [PATCH v1 2/3] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.

2018-11-12 Thread Yu Zhang

On Tue, Nov 13, 2018 at 11:37:07AM +0800, Peter Xu wrote:
> On Mon, Nov 12, 2018 at 05:42:01PM +0800, Yu Zhang wrote:
> > On Mon, Nov 12, 2018 at 04:36:34PM +0800, Peter Xu wrote:
> > > On Fri, Nov 09, 2018 at 07:49:46PM +0800, Yu Zhang wrote:
> > > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > > E.g. guest applications like DPDK prefer to use its VA as IOVA when
> > > > performing VFIO map/unmap operations, to avoid the burden of managing 
> > > > the
> > > > IOVA space.
> > > 
> > > Since you mentioned about DPDK... I'm just curious that whether have
> > > you tested the patchset with the 57bit-enabled machines with DPDK VA
> > > mode running in the guest? That would be something nice to mention in
> > > the cover letter if you have.
> > > 
> > 
> > Hah. Maybe I shall not mention DPDK here. 
> > 
> > The story is that we heard the requirement, saying applications like DPDK
> > would need 5-level paging in IOMMU side. And I was convinced after checked
> > DPDK code, seeing it may use VA as IOVA directly. But I did not test this
> > patch with DPDK.
> > 
> > Instead, I used kvm-unit-test to verify this patch series. And of course, I
> > also did some modification to the test case. Patch for the test also sent 
> > out
> > at https://www.spinics.net/lists/kvm/msg177425.html.
> 
> Yeah that's perfectly fine for me.  So instead maybe you can also
> mention the kvm-unit-test in the cover letter if you gonna repost.

Got it. Thanks!

> 
> > 
> > > [...]
> > > 
> > > > @@ -3264,11 +3286,19 @@ static bool vtd_decide_config(IntelIOMMUState 
> > > > *s, Error **errp)
> > > >  }
> > > >  }
> > > >  
> > > > -/* Currently only address widths supported are 39 and 48 bits */
> > > > +/* Currently address widths supported are 39, 48, and 57 bits */
> > > >  if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > -(s->aw_bits != VTD_AW_48BIT)) {
> > > > -error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > -   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > +(s->aw_bits != VTD_AW_48BIT) &&
> > > > +(s->aw_bits != VTD_AW_57BIT)) {
> > > > +error_setg(errp, "Supported values for x-aw-bits are: %d, %d, 
> > > > %d",
> > > > +   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > +return false;
> > > > +}
> > > > +
> > > > +if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > +!(host_has_la57() && guest_has_la57())) {
> > > > +error_setg(errp, "Do not support 57-bit DMA address, unless 
> > > > both "
> > > > + "host and guest are capable of 5-level 
> > > > paging.\n");
> > > 
> > > Is there any context (or pointer to previous discussions would work
> > > too) on explaining why we don't support some scenarios like
> > > host_paw=48,guest_paw=48,guest_gaw=57?
> > > 
> > 
> > Well, above check is only to make sure both the host and the guest can
> > use 57bit linear address, which requires 5-level paging. So I believe
> > we do support scenarios like host_paw=48,guest_paw=48,guest_gaw=57.
> > The guest_has_la57() means the guest can use 57-bit linear address,
> > regardless of its physical address width.
> 
> Sorry for my incorrect wording.  I mean when host/guest CPU only
> support 4-level LA then would/should we allow the guest IOMMU to
> support 5-level IOVA?  Asked since I'm thinking whether I can run the
> series a bit with my laptop/servers.

Well, by "only support", I guess you mean the hardware capability, instead
of its paging mode. So I do not think hardware will support 5-level IOVA for
platforms without 5-level VA. Therefore a 5-level vIOMMU is disallowed here. :)

> 
> Since at it, another thing I thought about is making sure the IOMMU
> capabilities will match between host and guest IOMMU, which I think
> this series has ignorred so far.  E.g., when we're having assigned
> devices in the guest and with 5-level IOVA, we should make sure the
> host IOMMU supports 5-level as well before the guest starts since
> otherwise the shadow page synchronization could potentially fail when
> the requested IOVA address goes beyond 4-level.  One simple solution
> is just to disable device assignment for now when we're with 57bits
> vIOMMU but I'm not sure whether that's what you want, especially you
> mentioned the DPDK case (who may use assigned devices).
> 

Thanks, Peter. Replied in the following up mail. :)

> (sorry to have mentioned the dpdk case again :)
> 
> Regards,
> 
> -- 
> Peter Xu
> 

B.R.
Yu

Re: [Qemu-devel] [PATCH 0/2] linux-user/mips: Support the n32 ABI for the R5900

2018-11-12 Thread Maciej W. Rozycki

On Fri, 9 Nov 2018, Maciej W. Rozycki wrote:

> > Some readelf results:
> > 
> > mips64el/stretch
> > 
> >   Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
> >   Class:   ELF64
> >   Flags:   0x8007, noreorder, pic, cpic, mips64r2
> 
>  Hmm, that's weird -- what executable did you check?  There may be some 
> that are n64, or maybe they've switched (which I would applaud, FWIW).  I 
> remember seeing mostly n32, with minimal support for n64, but that was a 
> while ago -- jessie or suchlike, I believe.  Using MIPS64r2 as the base 
> ISA also looks new to me, that used to be plain MIPS III, and some of 
> Debian's MIPS build systems used to be either MIPS III (Lemote Loongson) 
> or MIPS64r1 (Broadcom SiByte).

 OK, I definitely got this confused.  I did some checking and jessie 
didn't even have a 64-bit MIPS port.  I got their build systems right 
though, and the kernel is 64-bit for systems that support it.

> > Any binaries that need qemu-mipsn32 or qemu-mipsn32el?
> 
>  I'd expect at least the n32 dynamic loader (along with libc and some 
> other essential DSOs) to be present with MIPS64 Debian.  Traditionally, 
> under the FHS rules, it would be installed as /lib32/ld.so.1 (with the o32 
> one as /lib/ld.so.1 and the n64 as /lib64/ld.so.1), but Debian uses their 
> own multiarch filesystem layout standard, and offhand I don't remember 
> what the paths are defined to there.

 So with jessie you can install the `libc6-dev-mipsn32' package, which 
will get you n32 glibc development libraries and will pull the 
complementing n32 dynamic loader (at /lib32/ld.so.1 actually) and n32 
glibc shared libraries as well.

 Unfortunately multilib support files, such as the CRT files, seem to be 
missing from GCC for n32 or I cannot find them.  Otherwise you would be 
able to compile and link n32 binaries just by calling `gcc -mabi=n32'.  
Still the dynamic loader is directly runnable, as I noted above.

 HTH,

  Maciej

Re: [Qemu-devel] [PATCH] 9p: write lock path in v9fs_co_open2()

2018-11-12 Thread zhibin hu

Sorry, i have no time to make poc recently.

IMHO, the implementation of v9fs_path_copy is not secure, it first free the
original value and than copy the new value, there is a race.

So each caller must ensure the synchronization, maybe more locks are needed.

thanks.


On Mon, Nov 12, 2018 at 10:39 PM Greg Kurz  wrote:

> On Mon, 12 Nov 2018 12:19:29 +0100
> Greg Kurz  wrote:
>
> > On Mon, 12 Nov 2018 19:05:59 +0800
> > zhibin hu  wrote:
> >
> > > yes, and this :
> > >
> >
> > Yeah, all call sites of v9fs_path_copy() in v9fs_create() are called in
> the
> > context of the main thread. They may race with any other access to the
> fid
> > path performed by some other command in the context of a worker thread.
> My
> > first guess is that v9fs_create() should take the write lock before
> writing
> > to the fid path.
> >
>
> I think this call to v9fs_path_copy() in v9fs_walk() can also race:
>
> if (fid == newfid) {
> if (fidp->fid_type != P9_FID_NONE) {
> err = -EINVAL;
> goto out;
> }
> v9fs_path_copy(>path, );
> } else {
>
>
> Worse, since v9fs_co_open2() may overwrite the fid path from a worker
> thread, it seems that some more code might require to run with the read
> lock taken...
>
> > BTW, if you could share all the reproducers you already have for these
> > heap-use-after-free issues, it would be appreciated, and probably speed
> > up the fixing.
> >
> > > ==6094==ERROR: AddressSanitizer: heap-use-after-free on address
> > > 0x602e6751 at pc 0x562a8dc492b8 bp 0x7f6805d2fa10 sp 0x7f6805d2fa00
> > > READ of size 1 at 0x602e6751 thread T21
> > > #0 0x562a8dc492b7 in local_open_nofollow hw/9pfs/9p-local.c:59
> > > #1 0x562a8dc49361 in local_opendir_nofollow hw/9pfs/9p-local.c:92
> > > #2 0x562a8dc4bd6e in local_mknod hw/9pfs/9p-local.c:662
> > > #3 0x562a8dc521de in v9fs_co_mknod hw/9pfs/cofs.c:200
> > > #4 0x562a8dc4413e in v9fs_mknod hw/9pfs/9p.c:3044
> > > #5 0x562a8e600976 in coroutine_trampoline
> util/coroutine-ucontext.c:116
> > > #6 0x7f68713635ff in __correctly_grouped_prefixwc
> > > (/lib64/libc.so.6+0x4c5ff)
> > >
> > > 0x602e6751 is located 1 bytes inside of 2-byte region
> > > [0x602e6750,0x602e6752)
> > > freed by thread T0 here:
> > > #0 0x7f687cdb9880 in __interceptor_free
> (/lib64/libasan.so.5+0xee880)
> > > #1 0x7f687c1494d1 in g_free (/lib64/libglib-2.0.so.0+0x524d1)
> > > #2 0x562a8dc30ce4 in v9fs_path_copy hw/9pfs/9p.c:195
> > > #3 0x562a8dc3f0f3 in v9fs_create hw/9pfs/9p.c:2286
> > > #4 0x562a8e600976 in coroutine_trampoline
> util/coroutine-ucontext.c:116
> > > #5 0x7f68713635ff in __correctly_grouped_prefixwc
> > > (/lib64/libc.so.6+0x4c5ff)
> > >
> > > previously allocated by thread T5 here:
> > > #0 0x7f687cdb9c48 in malloc (/lib64/libasan.so.5+0xeec48)
> > > #1 0x7f6871421c37 in __GI___vasprintf_chk
> (/lib64/libc.so.6+0x10ac37)
> > >
> > > Thread T21 created by T0 here:
> > > #0 0x7f687cd16443 in pthread_create (/lib64/libasan.so.5+0x4b443)
> > > #1 0x562a8e5bf61e in qemu_thread_create
> util/qemu-thread-posix.c:534
> > > #2 0x562a8e5adbe4 in do_spawn_thread util/thread-pool.c:135
> > > #3 0x562a8e5adc73 in spawn_thread_bh_fn util/thread-pool.c:143
> > > #4 0x562a8e5aa4d0 in aio_bh_call util/async.c:90
> > > #5 0x562a8e5aa787 in aio_bh_poll util/async.c:118
> > > #6 0x562a8e5b65bd in aio_dispatch util/aio-posix.c:436
> > > #7 0x562a8e5ab26f in aio_ctx_dispatch util/async.c:261
> > > #8 0x7f687c1438ac in g_main_context_dispatch
> > > (/lib64/libglib-2.0.so.0+0x4c8ac)
> > >
> > > Thread T5 created by T0 here:
> > > #0 0x7f687cd16443 in pthread_create (/lib64/libasan.so.5+0x4b443)
> > > #1 0x562a8e5bf61e in qemu_thread_create
> util/qemu-thread-posix.c:534
> > > #2 0x562a8d7a2258 in qemu_kvm_start_vcpu
> /root/qemu-3.0.0/cpus.c:1935
> > > #3 0x562a8d7a2a0b in qemu_init_vcpu /root/qemu-3.0.0/cpus.c:2001
> > > #4 0x562a8da3ef0c in x86_cpu_realizefn
> > > /root/qemu-3.0.0/target/i386/cpu.c:4996
> > > #5 0x562a8dd3a1e8 in device_set_realized hw/core/qdev.c:826
> > > #6 0x562a8e2a1e62 in property_set_bool qom/object.c:1984
> > > #7 0x562a8e29db0e in object_property_set qom/object.c:1176
> > > #8 0x562a8e2a4b00 in object_property_set_qobject
> qom/qom-qobject.c:27
> > > #9 0x562a8e29de2b in object_property_set_bool qom/object.c:1242
> > > #10 0x562a8d9ad452 in pc_new_cpu /root/qemu-3.0.0/hw/i386/pc.c:1107
> > > #11 0x562a8d9ad993 in pc_cpus_init
> /root/qemu-3.0.0/hw/i386/pc.c:1155
> > > #12 0x562a8d9b79cb in pc_init1
> /root/qemu-3.0.0/hw/i386/pc_piix.c:153
> > > #13 0x562a8d9b981d in pc_init_v3_0
> > > /root/qemu-3.0.0/hw/i386/pc_piix.c:438
> > > #14 0x562a8dd4bf29 in machine_run_board_init hw/core/machine.c:830
> > > #15 0x562a8db8d46e in main /root/qemu-3.0.0/vl.c:4516
> > > #16 0x7f687133a11a in __libc_start_main (/lib64/libc.so.6+0x2311a)
> >

Re: [Qemu-devel] [PATCH v1 2/3] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.

2018-11-12 Thread Peter Xu

On Mon, Nov 12, 2018 at 05:42:01PM +0800, Yu Zhang wrote:
> On Mon, Nov 12, 2018 at 04:36:34PM +0800, Peter Xu wrote:
> > On Fri, Nov 09, 2018 at 07:49:46PM +0800, Yu Zhang wrote:
> > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > E.g. guest applications like DPDK prefer to use its VA as IOVA when
> > > performing VFIO map/unmap operations, to avoid the burden of managing the
> > > IOVA space.
> > 
> > Since you mentioned about DPDK... I'm just curious that whether have
> > you tested the patchset with the 57bit-enabled machines with DPDK VA
> > mode running in the guest? That would be something nice to mention in
> > the cover letter if you have.
> > 
> 
> Hah. Maybe I shall not mention DPDK here. 
> 
> The story is that we heard the requirement, saying applications like DPDK
> would need 5-level paging in IOMMU side. And I was convinced after checked
> DPDK code, seeing it may use VA as IOVA directly. But I did not test this
> patch with DPDK.
> 
> Instead, I used kvm-unit-test to verify this patch series. And of course, I
> also did some modification to the test case. Patch for the test also sent out
> at https://www.spinics.net/lists/kvm/msg177425.html.

Yeah that's perfectly fine for me.  So instead maybe you can also
mention the kvm-unit-test in the cover letter if you gonna repost.

> 
> > [...]
> > 
> > > @@ -3264,11 +3286,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, 
> > > Error **errp)
> > >  }
> > >  }
> > >  
> > > -/* Currently only address widths supported are 39 and 48 bits */
> > > +/* Currently address widths supported are 39, 48, and 57 bits */
> > >  if ((s->aw_bits != VTD_AW_39BIT) &&
> > > -(s->aw_bits != VTD_AW_48BIT)) {
> > > -error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > -   VTD_AW_39BIT, VTD_AW_48BIT);
> > > +(s->aw_bits != VTD_AW_48BIT) &&
> > > +(s->aw_bits != VTD_AW_57BIT)) {
> > > +error_setg(errp, "Supported values for x-aw-bits are: %d, %d, 
> > > %d",
> > > +   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > +return false;
> > > +}
> > > +
> > > +if ((s->aw_bits == VTD_AW_57BIT) &&
> > > +!(host_has_la57() && guest_has_la57())) {
> > > +error_setg(errp, "Do not support 57-bit DMA address, unless both 
> > > "
> > > + "host and guest are capable of 5-level 
> > > paging.\n");
> > 
> > Is there any context (or pointer to previous discussions would work
> > too) on explaining why we don't support some scenarios like
> > host_paw=48,guest_paw=48,guest_gaw=57?
> > 
> 
> Well, above check is only to make sure both the host and the guest can
> use 57bit linear address, which requires 5-level paging. So I believe
> we do support scenarios like host_paw=48,guest_paw=48,guest_gaw=57.
> The guest_has_la57() means the guest can use 57-bit linear address,
> regardless of its physical address width.

Sorry for my incorrect wording.  I mean when host/guest CPU only
support 4-level LA then would/should we allow the guest IOMMU to
support 5-level IOVA?  Asked since I'm thinking whether I can run the
series a bit with my laptop/servers.

Since at it, another thing I thought about is making sure the IOMMU
capabilities will match between host and guest IOMMU, which I think
this series has ignorred so far.  E.g., when we're having assigned
devices in the guest and with 5-level IOVA, we should make sure the
host IOMMU supports 5-level as well before the guest starts since
otherwise the shadow page synchronization could potentially fail when
the requested IOVA address goes beyond 4-level.  One simple solution
is just to disable device assignment for now when we're with 57bits
vIOMMU but I'm not sure whether that's what you want, especially you
mentioned the DPDK case (who may use assigned devices).

(sorry to have mentioned the dpdk case again :)

Regards,

-- 
Peter Xu

Re: [Qemu-devel] [PATCH] nvme: fix oob access issue(CVE-2018-16847)

2018-11-12 Thread Li Qiang

Ping what't the status of this patch.

I see Kevin's new pr doesn't contain this patch.

Thanks,
Li Qiang

Li Qiang  于2018年11月2日周五 上午9:22写道：

> Currently, the nvme_cmb_ops mr doesn't check the addr and size.
> This can lead an oob access issue. This is triggerable in the guest.
> Add check to avoid this issue.
>
> Fixes CVE-2018-16847.
>
> Reported-by: Li Qiang 
> Reviewed-by: Paolo Bonzini 
> Signed-off-by: Li Qiang 
> ---
>  hw/block/nvme.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index fc7dacb..d097add 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1175,6 +1175,10 @@ static void nvme_cmb_write(void *opaque, hwaddr
> addr, uint64_t data,
>  unsigned size)
>  {
>  NvmeCtrl *n = (NvmeCtrl *)opaque;
> +
> +if (addr + size > NVME_CMBSZ_GETSIZE(n->bar.cmbsz)) {
> +return;
> +}
>  memcpy(>cmbuf[addr], , size);
>  }
>
> @@ -1183,6 +1187,9 @@ static uint64_t nvme_cmb_read(void *opaque, hwaddr
> addr, unsigned size)
>  uint64_t val;
>  NvmeCtrl *n = (NvmeCtrl *)opaque;
>
> +if (addr + size > NVME_CMBSZ_GETSIZE(n->bar.cmbsz)) {
> +return 0;
> +}
>  memcpy(, >cmbuf[addr], size);
>  return val;
>  }
> --
> 1.8.3.1
>
>

[Qemu-devel] [PULL for-3.1 2/2] qga: Add multiple include guard to guest-agent-core.h

2018-11-12 Thread Michael Roth

From: Peter Maydell 

The guest-agent-core.h header was missing the usual guards
against multiple inclusion; add them.

(Spotted by lgtm.com's static analyzer.)

Signed-off-by: Peter Maydell 
Reviewed-by: Marc-André Lureau 
Reviewed-by: Philippe Mathieu-Daudé 
Tested-by: Philippe Mathieu-Daudé 
Signed-off-by: Michael Roth 
---
 qga/guest-agent-core.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/qga/guest-agent-core.h b/qga/guest-agent-core.h
index 6f4d214cb9..60eae16f27 100644
--- a/qga/guest-agent-core.h
+++ b/qga/guest-agent-core.h
@@ -10,6 +10,9 @@
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
  */
+#ifndef GUEST_AGENT_CORE_H
+#define GUEST_AGENT_CORE_H
+
 #include "qapi/qmp/dispatch.h"
 #include "qemu-common.h"
 #include "qga-qapi-types.h"
@@ -46,3 +49,5 @@ int ga_parse_whence(GuestFileWhence *whence, Error **errp);
 #ifndef _WIN32
 void reopen_fd_to_null(int fd);
 #endif
+
+#endif /* GUEST_AGENT_CORE_H */
-- 
2.17.1

[Qemu-devel] [PULL for-3.1 1/2] qga-win: fix leaks of build_guest_disk_info()

2018-11-12 Thread Michael Roth

From: Marc-André Lureau 

Introduced in commit b1ba8890e63ce9432c41c5c3fc229f54c87c9c99, vol_h
handle should be closed, and "out" cleanup should be done after
DeviceIoControl() fails.

Signed-off-by: Marc-André Lureau 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Michael Roth 
---
 qga/commands-win32.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/qga/commands-win32.c b/qga/commands-win32.c
index ef1d7d48d2..62e1b51dfe 100644
--- a/qga/commands-win32.c
+++ b/qga/commands-win32.c
@@ -797,7 +797,7 @@ static GuestDiskAddressList *build_guest_disk_info(char 
*guid, Error **errp)
 0, extents, size, NULL, NULL)) {
 error_setg_win32(errp, GetLastError(),
 "failed to get disk extents");
-return NULL;
+goto out;
 }
 } else if (last_err == ERROR_INVALID_FUNCTION) {
 /* Possibly CD-ROM or a shared drive. Try to pass the volume */
@@ -855,6 +855,9 @@ static GuestDiskAddressList *build_guest_disk_info(char 
*guid, Error **errp)
 
 
 out:
+if (vol_h != INVALID_HANDLE_VALUE) {
+CloseHandle(vol_h);
+}
 qapi_free_GuestDiskAddress(disk);
 g_free(extents);
 g_free(name);
-- 
2.17.1

Re: [Qemu-devel] [QEMU PATCH v2 0/2]: KVM: i386: Add support for save and restore nested state

2018-11-12 Thread Liran Alon




> On 13 Nov 2018, at 2:07, Jim Mattson  wrote:
> 
> On Mon, Nov 12, 2018 at 4:00 PM, Liran Alon  wrote:
>> 
>> 
>>> On 12 Nov 2018, at 18:54, Daniel P. Berrangé  wrote:
>>> 
>>> On Mon, Nov 12, 2018 at 04:50:54PM +, Dr. David Alan Gilbert wrote:
 * Daniel P. Berrangé (berra...@redhat.com) wrote:
> On Sun, Nov 04, 2018 at 11:19:57PM +0100, Paolo Bonzini wrote:
>> On 02/11/2018 17:54, Daniel P. Berrangé wrote:
>>> We have usually followed a rule that new machine types must not
>>> affect runability of a VM on a host. IOW new machine types should
>>> not introduce dependancies on specific kernels, or hardware features
>>> such as CPU flags.
>> 
>>> Anything that requires a new kernel feature thus ought to be an
>>> opt-in config tunable on the CLI, separate from machine type
>>> choice.
>> 
>> Unless someone tinkered with the module parameters, they could not even
>> use nested virtualization before 4.20.  So for everyone else, "-cpu
>> ...,+vmx" does count as an "opt-in config tunable on the CLI" that
>> requires 4.20.
>> 
>> For those that did tinker with module parameters, we can grandfather in
>> the old machine types, so that they can use nested virtualization with
>> no live migration support.  For those that did not, however, I don't
>> think it makes sense to say "oh by the way I really want to be able to
>> migrate this VM" on the command line, or even worse on the monitor.
> 
> IIUC, 4.20 is only required from POV of migration state. Is it thus
> possible to just register a migration blocker if QEMU is launched
> on a host with kernel < 4.20.
> 
> Migration has always been busted historically, so those people using
> nested VMX already won't be hurt by not having ability to live migrate
> their VM, but could otherwise continue using them without being forced
> to upgrade their kernel to fix a feature they're not even using.
 
 Yes, although I am a bit worried we might have a population of users
 that:
  a) Have enabled nesting
  b) Run VMs with vmx enabled
>>> 
>>> 
  c) Don't normally actually run nested guests
  d) Currently happily migrate.
>>> 
>>> True, and (b) would include anyone using libvirt's  host-model CPU. So if
>>> you enabled nesting, have host-model for all guests, but only use nesting
>>> in one of the guests, you'd be doomed.
>>> 
>>> Is it possible for QEMU to determine if there are nested guests running or
>>> not and conditionally block migration appropriately to ensure safety ?
>> 
>> 
>> Only if kernel supports KVM_CAP_NESTED_STATE.
>> See my reply to Dave in this thread.
> 
> You could still allow migration if CR4.VMXE is clear.

Agreed. Nice addition :)

Thanks,
-Liran

Re: [Qemu-devel] [PATCH v1 2/3] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.

2018-11-12 Thread Yu Zhang

On Tue, Nov 13, 2018 at 01:04:51PM +0800, Peter Xu wrote:
> On Tue, Nov 13, 2018 at 11:37:07AM +0800, Peter Xu wrote:
> > On Mon, Nov 12, 2018 at 05:42:01PM +0800, Yu Zhang wrote:
> > > On Mon, Nov 12, 2018 at 04:36:34PM +0800, Peter Xu wrote:
> > > > On Fri, Nov 09, 2018 at 07:49:46PM +0800, Yu Zhang wrote:
> > > > > A 5-level paging capable VM may choose to use 57-bit IOVA address 
> > > > > width.
> > > > > E.g. guest applications like DPDK prefer to use its VA as IOVA when
> > > > > performing VFIO map/unmap operations, to avoid the burden of managing 
> > > > > the
> > > > > IOVA space.
> > > > 
> > > > Since you mentioned about DPDK... I'm just curious that whether have
> > > > you tested the patchset with the 57bit-enabled machines with DPDK VA
> > > > mode running in the guest? That would be something nice to mention in
> > > > the cover letter if you have.
> > > > 
> > > 
> > > Hah. Maybe I shall not mention DPDK here. 
> > > 
> > > The story is that we heard the requirement, saying applications like DPDK
> > > would need 5-level paging in IOMMU side. And I was convinced after checked
> > > DPDK code, seeing it may use VA as IOVA directly. But I did not test this
> > > patch with DPDK.
> > > 
> > > Instead, I used kvm-unit-test to verify this patch series. And of course, 
> > > I
> > > also did some modification to the test case. Patch for the test also sent 
> > > out
> > > at https://www.spinics.net/lists/kvm/msg177425.html.
> > 
> > Yeah that's perfectly fine for me.  So instead maybe you can also
> > mention the kvm-unit-test in the cover letter if you gonna repost.
> > 
> > > 
> > > > [...]
> > > > 
> > > > > @@ -3264,11 +3286,19 @@ static bool vtd_decide_config(IntelIOMMUState 
> > > > > *s, Error **errp)
> > > > >  }
> > > > >  }
> > > > >  
> > > > > -/* Currently only address widths supported are 39 and 48 bits */
> > > > > +/* Currently address widths supported are 39, 48, and 57 bits */
> > > > >  if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > > -(s->aw_bits != VTD_AW_48BIT)) {
> > > > > -error_setg(errp, "Supported values for x-aw-bits are: %d, 
> > > > > %d",
> > > > > -   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > > +(s->aw_bits != VTD_AW_48BIT) &&
> > > > > +(s->aw_bits != VTD_AW_57BIT)) {
> > > > > +error_setg(errp, "Supported values for x-aw-bits are: %d, 
> > > > > %d, %d",
> > > > > +   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > > +return false;
> > > > > +}
> > > > > +
> > > > > +if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > > +!(host_has_la57() && guest_has_la57())) {
> > > > > +error_setg(errp, "Do not support 57-bit DMA address, unless 
> > > > > both "
> > > > > + "host and guest are capable of 5-level 
> > > > > paging.\n");
> > > > 
> > > > Is there any context (or pointer to previous discussions would work
> > > > too) on explaining why we don't support some scenarios like
> > > > host_paw=48,guest_paw=48,guest_gaw=57?
> > > > 
> > > 
> > > Well, above check is only to make sure both the host and the guest can
> > > use 57bit linear address, which requires 5-level paging. So I believe
> > > we do support scenarios like host_paw=48,guest_paw=48,guest_gaw=57.
> > > The guest_has_la57() means the guest can use 57-bit linear address,
> > > regardless of its physical address width.
> > 
> > Sorry for my incorrect wording.  I mean when host/guest CPU only
> > support 4-level LA then would/should we allow the guest IOMMU to
> > support 5-level IOVA?  Asked since I'm thinking whether I can run the
> > series a bit with my laptop/servers.
> 
> [...]
> 
> > 
> > Since at it, another thing I thought about is making sure the IOMMU
> > capabilities will match between host and guest IOMMU, which I think
> > this series has ignorred so far.  E.g., when we're having assigned
> > devices in the guest and with 5-level IOVA, we should make sure the
> > host IOMMU supports 5-level as well before the guest starts since
> > otherwise the shadow page synchronization could potentially fail when
> > the requested IOVA address goes beyond 4-level.  One simple solution
> > is just to disable device assignment for now when we're with 57bits
> > vIOMMU but I'm not sure whether that's what you want, especially you
> > mentioned the DPDK case (who may use assigned devices).
> 
> Ok I totally forgot that we don't even support any kind of check like
> this before... So feel free to skip this comment if you want, or it
> would be even nicer if you want to fix it as a whole. :)
> 

Indeed. We have talked about this before. How about we focus on the 5-level
extension for now, and solve the check issue in the future? I still do not
have any clean solutions in mind. BTW, any suggestions on this issue? :)

> Regards,
> 
> -- 
> Peter Xu
> 

B.R.
Yu

Re: [Qemu-devel] [RFC PATCH 02/11] decodetree: Add multiple include guard

2018-11-12 Thread Philippe Mathieu-Daudé


On 12/11/18 23:30, Eduardo Habkost wrote:

On Mon, Nov 12, 2018 at 12:36:13AM +0100, Philippe Mathieu-Daudé wrote:

It is necessary when splitting an ISA, or when using multiple ISAs.

Signed-off-by: Philippe Mathieu-Daudé 
---
TODO: explain why, use case
TODO: escape full path?
---
  scripts/decodetree.py | 5 +
  1 file changed, 5 insertions(+)

diff --git a/scripts/decodetree.py b/scripts/decodetree.py
index 0bc73b5990..5dea15e7a5 100755
--- a/scripts/decodetree.py
+++ b/scripts/decodetree.py
@@ -1030,7 +1030,11 @@ def main():
  else:
  output_fd = sys.stdout
  
+hdr_guard = filename.split(os.path.sep)[-1].upper().replace('.', '_') + "_H"


Isn't
   filename.split(os.path.sep)[-1]
equivalent to
   os.path.basename(filename)
?


Yes, thanks :)

Re: [Qemu-devel] [PATCH v1 2/3] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.

2018-11-12 Thread Peter Xu

On Tue, Nov 13, 2018 at 11:37:07AM +0800, Peter Xu wrote:
> On Mon, Nov 12, 2018 at 05:42:01PM +0800, Yu Zhang wrote:
> > On Mon, Nov 12, 2018 at 04:36:34PM +0800, Peter Xu wrote:
> > > On Fri, Nov 09, 2018 at 07:49:46PM +0800, Yu Zhang wrote:
> > > > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > > > E.g. guest applications like DPDK prefer to use its VA as IOVA when
> > > > performing VFIO map/unmap operations, to avoid the burden of managing 
> > > > the
> > > > IOVA space.
> > > 
> > > Since you mentioned about DPDK... I'm just curious that whether have
> > > you tested the patchset with the 57bit-enabled machines with DPDK VA
> > > mode running in the guest? That would be something nice to mention in
> > > the cover letter if you have.
> > > 
> > 
> > Hah. Maybe I shall not mention DPDK here. 
> > 
> > The story is that we heard the requirement, saying applications like DPDK
> > would need 5-level paging in IOMMU side. And I was convinced after checked
> > DPDK code, seeing it may use VA as IOVA directly. But I did not test this
> > patch with DPDK.
> > 
> > Instead, I used kvm-unit-test to verify this patch series. And of course, I
> > also did some modification to the test case. Patch for the test also sent 
> > out
> > at https://www.spinics.net/lists/kvm/msg177425.html.
> 
> Yeah that's perfectly fine for me.  So instead maybe you can also
> mention the kvm-unit-test in the cover letter if you gonna repost.
> 
> > 
> > > [...]
> > > 
> > > > @@ -3264,11 +3286,19 @@ static bool vtd_decide_config(IntelIOMMUState 
> > > > *s, Error **errp)
> > > >  }
> > > >  }
> > > >  
> > > > -/* Currently only address widths supported are 39 and 48 bits */
> > > > +/* Currently address widths supported are 39, 48, and 57 bits */
> > > >  if ((s->aw_bits != VTD_AW_39BIT) &&
> > > > -(s->aw_bits != VTD_AW_48BIT)) {
> > > > -error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > > > -   VTD_AW_39BIT, VTD_AW_48BIT);
> > > > +(s->aw_bits != VTD_AW_48BIT) &&
> > > > +(s->aw_bits != VTD_AW_57BIT)) {
> > > > +error_setg(errp, "Supported values for x-aw-bits are: %d, %d, 
> > > > %d",
> > > > +   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > > > +return false;
> > > > +}
> > > > +
> > > > +if ((s->aw_bits == VTD_AW_57BIT) &&
> > > > +!(host_has_la57() && guest_has_la57())) {
> > > > +error_setg(errp, "Do not support 57-bit DMA address, unless 
> > > > both "
> > > > + "host and guest are capable of 5-level 
> > > > paging.\n");
> > > 
> > > Is there any context (or pointer to previous discussions would work
> > > too) on explaining why we don't support some scenarios like
> > > host_paw=48,guest_paw=48,guest_gaw=57?
> > > 
> > 
> > Well, above check is only to make sure both the host and the guest can
> > use 57bit linear address, which requires 5-level paging. So I believe
> > we do support scenarios like host_paw=48,guest_paw=48,guest_gaw=57.
> > The guest_has_la57() means the guest can use 57-bit linear address,
> > regardless of its physical address width.
> 
> Sorry for my incorrect wording.  I mean when host/guest CPU only
> support 4-level LA then would/should we allow the guest IOMMU to
> support 5-level IOVA?  Asked since I'm thinking whether I can run the
> series a bit with my laptop/servers.

[...]

> 
> Since at it, another thing I thought about is making sure the IOMMU
> capabilities will match between host and guest IOMMU, which I think
> this series has ignorred so far.  E.g., when we're having assigned
> devices in the guest and with 5-level IOVA, we should make sure the
> host IOMMU supports 5-level as well before the guest starts since
> otherwise the shadow page synchronization could potentially fail when
> the requested IOVA address goes beyond 4-level.  One simple solution
> is just to disable device assignment for now when we're with 57bits
> vIOMMU but I'm not sure whether that's what you want, especially you
> mentioned the DPDK case (who may use assigned devices).

Ok I totally forgot that we don't even support any kind of check like
this before... So feel free to skip this comment if you want, or it
would be even nicer if you want to fix it as a whole. :)

Regards,

-- 
Peter Xu

Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu] ppc/spapr: Receive and store device tree blob from SLOF

2018-11-12 Thread Alexey Kardashevskiy




On 12/11/2018 20:05, Greg Kurz wrote:
> On Mon, 12 Nov 2018 15:12:26 +1100
> Alexey Kardashevskiy  wrote:
> 
>> On 12/11/2018 05:10, Greg Kurz wrote:
>>> Hi Alexey,
>>>
>>> Just a few remarks. See below.
>>>
>>> On Thu,  8 Nov 2018 12:44:06 +1100
>>> Alexey Kardashevskiy  wrote:
>>>   
 SLOF receives a device tree and updates it with various properties
 before switching to the guest kernel and QEMU is not aware of any changes
 made by SLOF. Since there is no real RTAS (QEMU implements it), it makes
 sense to pass the SLOF final device tree to QEMU to let it implement
 RTAS related tasks better, such as PCI host bus adapter hotplug.

 Specifially, now QEMU can find out the actual XICS phandle (for PHB
 hotplug) and the RTAS linux,rtas-entry/base properties (for firmware
 assisted NMI - FWNMI).

 This stores the initial DT blob in the sPAPR machine and replaces it
 in the KVMPPC_H_UPDATE_DT (new private hypercall) handler.

 This adds an @update_dt_enabled machine property to allow backward
 migration.

 SLOF already has a hypercall since
 https://github.com/aik/SLOF/commit/e6fc84652c9c0073f9183

 Signed-off-by: Alexey Kardashevskiy 
 ---
  include/hw/ppc/spapr.h |  7 ++-
  hw/ppc/spapr.c | 29 -
  hw/ppc/spapr_hcall.c   | 32 
  hw/ppc/trace-events|  2 ++
  4 files changed, 68 insertions(+), 2 deletions(-)

 diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
 index ad4d7cfd97..f5dcaf44cb 100644
 --- a/include/hw/ppc/spapr.h
 +++ b/include/hw/ppc/spapr.h
 @@ -100,6 +100,7 @@ struct sPAPRMachineClass {
  
  /*< public >*/
  bool dr_lmb_enabled;   /* enable dynamic-reconfig/hotplug of LMBs 
 */
 +bool update_dt_enabled;/* enable KVMPPC_H_UPDATE_DT */
  bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
  bool pre_2_10_has_unused_icps;
  bool legacy_irq_allocation;
 @@ -136,6 +137,9 @@ struct sPAPRMachineState {
  int vrma_adjust;
  ssize_t rtas_size;
  void *rtas_blob;
 +uint32_t fdt_size;
 +uint32_t fdt_initial_size;  
>>>
>>> I don't quite see the purpose of fdt_initial_size... it seems to be only
>>> used to print a trace.  
>>
>>
>> Ah, lost in rebase. The purpose was to test if the new device tree has
>> not grown too much.
>>
> 
> Ok, makes sense during development.
> 
>>
>>
>>>   
 +void *fdt_blob;
  long kernel_size;
  bool kernel_le;
  uint32_t initrd_base;
 @@ -462,7 +466,8 @@ struct sPAPRMachineState {
  #define KVMPPC_H_LOGICAL_MEMOP  (KVMPPC_HCALL_BASE + 0x1)
  /* Client Architecture support */
  #define KVMPPC_H_CAS(KVMPPC_HCALL_BASE + 0x2)
 -#define KVMPPC_HCALL_MAXKVMPPC_H_CAS
 +#define KVMPPC_H_UPDATE_DT  (KVMPPC_HCALL_BASE + 0x3)
 +#define KVMPPC_HCALL_MAXKVMPPC_H_UPDATE_DT
  
  typedef struct sPAPRDeviceTreeUpdateHeader {
  uint32_t version_id;
 diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
 index c08130facb..5e2d4d211c 100644
 --- a/hw/ppc/spapr.c
 +++ b/hw/ppc/spapr.c
 @@ -1633,7 +1633,10 @@ static void spapr_machine_reset(void)
  /* Load the fdt */
  qemu_fdt_dumpdtb(fdt, fdt_totalsize(fdt));
  cpu_physical_memory_write(fdt_addr, fdt, fdt_totalsize(fdt));
 -g_free(fdt);
 +g_free(spapr->fdt_blob);
 +spapr->fdt_size = fdt_totalsize(fdt);
 +spapr->fdt_initial_size = spapr->fdt_size;
 +spapr->fdt_blob = fdt;  
>>>
>>> Hmm... It looks weird to store state in a reset handler. I'd rather zeroe
>>> both fdt_blob and fdt_size here.  
>>
>>
>> The device tree is built from the reset handler and the idea is that we
>> want to always have some tree in the machine.
>>
> 
> Yes of course, I forgot that we need to keep the fdt to be kept
> somewhere so that we can use it :). My remark has more to do
> with migration actually: the fdt built at reset time is supposed
> to derive from the command line and hot-(un)plugged devices, ie,
> identical in source and destination. This isn't state we should
> migrate IIUC.

Having some device tree all the time seems more convenient than managing
the state when we do have one and when we do not.

It is not a big deal though, I'd wait and see what David thinks. Thanks,



> Maybe add a boolean field that tells that the fdt was updated, use
> it in spapr_dtb_needed() and reset it in spapr_machine_reset() ?
> 
>>
>>
>>>   
  
  /* Set up the entry state */
  spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
 @@ -1887,6 +1890,27 @@ static const VMStateDescription 
 vmstate_spapr_irq_map = {
  },
  };
  
 +static bool spapr_dtb_needed(void *opaque)
 +{
 +sPAPRMachineClass

Re: [Qemu-devel] [PATCH for-3.2 1/7] tests/pvpanic: Make the pvpanic test independent of global_qtest

2018-11-12 Thread Eric Blake


On 11/12/18 1:08 PM, Thomas Huth wrote:

We want to get rid of global_qtest in the long run, thus do not
use the wrappers like inb() and outb() here anymore.

Signed-off-by: Thomas Huth 
---
  tests/pvpanic-test.c | 14 --
  1 file changed, 8 insertions(+), 6 deletions(-)


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[Qemu-devel] [PATCH] memory: check write/read_with_attrs in memory dispatch

2018-11-12 Thread Li Qiang

This can avoid the NULL-deref if the rm doesn't has a
read/write nor write/read_with_attrs callback.

Signed-off-by: Li Qiang 
---
 memory.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/memory.c b/memory.c
index d14c6dec1d..3baf5857b9 100644
--- a/memory.c
+++ b/memory.c
@@ -1377,13 +1377,15 @@ static MemTxResult 
memory_region_dispatch_read1(MemoryRegion *mr,
  mr->ops->impl.max_access_size,
  memory_region_read_accessor,
  mr, attrs);
-} else {
+} else if (mr->ops->read_with_attrs) {
 return access_with_adjusted_size(addr, pval, size,
  mr->ops->impl.min_access_size,
  mr->ops->impl.max_access_size,
  
memory_region_read_with_attrs_accessor,
  mr, attrs);
 }
+
+return MEMTX_DECODE_ERROR;
 }
 
 MemTxResult memory_region_dispatch_read(MemoryRegion *mr,
@@ -1454,7 +1456,7 @@ MemTxResult memory_region_dispatch_write(MemoryRegion *mr,
  mr->ops->impl.max_access_size,
  memory_region_write_accessor, mr,
  attrs);
-} else {
+} else if (mr->ops->write_with_attrs) {
 return
 access_with_adjusted_size(addr, , size,
   mr->ops->impl.min_access_size,
@@ -1462,6 +1464,8 @@ MemTxResult memory_region_dispatch_write(MemoryRegion *mr,
   memory_region_write_with_attrs_accessor,
   mr, attrs);
 }
+
+return MEMTX_DECODE_ERROR;
 }
 
 void memory_region_init_io(MemoryRegion *mr,
-- 
2.11.0

Re: [Qemu-devel] [PATCH v2] Acceptance tests: add Linux initrd checking test

2018-11-12 Thread Eduardo Habkost

On Fri, Nov 09, 2018 at 01:21:53PM -0500, Wainer dos Santos Moschetta wrote:
> QEMU used to exits with a not accurate error message when
> an initrd > 2GiB was passed. That was fixed on patch:
> 
>   commit f3839fda5771596152b75dd1e1a6d050e6e6e380
>   Author: Li Zhijian 
>   Date:   Thu Sep 13 18:07:13 2018 +0800
> 
>   change get_image_size return type to int64_t
> 
> This change adds a regression test for that fix. It starts
> QEMU with a 2GiB dummy initrd, and check it evaluates the file
> size correctly and prints accurate message.
> 
> Signed-off-by: Wainer dos Santos Moschetta 
> Reviewed-by: Caio Carrara 
> Reviewed-by: Cleber Rosa 
> Reviewed-by: Eduardo Habkost 
> Reviewed-by: Philippe Mathieu-Daudé 
> Tested-by: Philippe Mathieu-Daudé 

Queued for 4.0, thanks!

-- 
Eduardo

[Qemu-devel] [Bug 1802915] Re: GTK display refresh rate is throttled

2018-11-12 Thread Chen Zhang

** Description changed:

  Guest OS running with GL enabled GTK display shows a reduced refresh
  rate, e.g. moving cursor around with iGVT-g DMA Buf.
  
- It seems that a default refresh interval GUI_REFRESH_INTERVAL_DEFAULT
+ Apparently, a default refresh interval GUI_REFRESH_INTERVAL_DEFAULT
  (30ms) is defined in include/ui/console.h, throttling the display
  refresh rate at 33Hz.
  
- To correct this throttle issue, a shorter interval should be applied to
- display change listener or the default value should be used.
+ To correct this throttle issue, a shorter interval (16 or 17
+ milliseconds) should be applied to display change listener or the
+ default value should be modified.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1802915

Title:
  GTK display refresh rate is throttled

Status in QEMU:
  New

Bug description:
  Guest OS running with GL enabled GTK display shows a reduced refresh
  rate, e.g. moving cursor around with iGVT-g DMA Buf.

  Apparently, a default refresh interval GUI_REFRESH_INTERVAL_DEFAULT
  (30ms) is defined in include/ui/console.h, throttling the display
  refresh rate at 33Hz.

  To correct this throttle issue, a shorter interval (16 or 17
  milliseconds) should be applied to display change listener or the
  default value should be modified.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1802915/+subscriptions

Re: [Qemu-devel] [PATCH] virtio-net: support RSC v4/v6 tcp traffic for Windows HCK

2018-11-12 Thread Wei Xu

Looks good, I can't recall the status of last version well but
I remember Jason gave some comments about sanity check are quiet
essential, have you addressed them?

Reviewed by: Wei Xu 

On Fri, Nov 09, 2018 at 04:58:27PM +0200, Yuri Benditovich wrote:
> This commit adds implementation of RX packets
> coalescing, compatible with requirements of Windows
> Hardware compatibility kit.
> 
> The device enables feature VIRTIO_NET_F_RSC_EXT in
> host features if it supports extended RSC functionality
> as defined in the specification.
> This feature requires at least one of VIRTIO_NET_F_GUEST_TSO4,
> VIRTIO_NET_F_GUEST_TSO6. Windows guest driver acks
> this feature only if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
> is also present.
> 
> In case vhost is enabled the feature bit is cleared in
> host_features during device initialization.
> 
> If the guest driver acks VIRTIO_NET_F_RSC_EXT feature,
> the device coalesces TCPv4 and TCPv6 packets (if
> respective VIRTIO_NET_F_GUEST_TSO feature is on,
> populates extended RSC information in virtio header
> and sets VIRTIO_NET_HDR_F_RSC_INFO bit in header flags.
> The device does not recalculate checksums in the coalesced
> packet, so they are not valid.
> 
> In this case:
> All the data packets in a tcp connection are cached
> to a single buffer in every receive interval, and will
> be sent out via a timer, the 'virtio_net_rsc_timeout'
> controls the interval, this value may impact the
> performance and response time of tcp connection,
> 5(50us) is an experience value to gain a performance
> improvement, since the whql test sends packets every 100us,
> so '30(300us)' passes the test case, it is the default
> value as well, tune it via the command line parameter
> 'rsc_interval' within 'virtio-net-pci' device, for example,
> to launch a guest with interval set as '50':
> 
> 'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00,
> guest_rsc_ext=on,rsc_interval=50'
> 
> The timer will only be triggered if the packets pool is not empty,
> and it'll drain off all the cached packets.
> 
> 'NetRscChain' is used to save the segments of IPv4/6 in a
> VirtIONet device.
> 
> A new segment becomes a 'Candidate' as well as it passed sanity check,
> the main handler of TCP includes TCP window update, duplicated
> ACK check and the real data coalescing.
> 
> An 'Candidate' segment means:
> 1. Segment is within current window and the sequence is the expected one.
> 2. 'ACK' of the segment is in the valid window.
> 
> Sanity check includes:
> 1. Incorrect version in IP header
> 2. An IP options or IP fragment
> 3. Not a TCP packet
> 4. Sanity size check to prevent buffer overflow attack.
> 5. An ECN packet
> 
> Even though, there might more cases should be considered such as
> ip identification other flags, while it breaks the test because
> windows set it to the same even it's not a fragment.
> 
> Normally it includes 2 typical ways to handle a TCP control flag,
> 'bypass' and 'finalize', 'bypass' means should be sent out directly,
> while 'finalize' means the packets should also be bypassed, but this
> should be done after search for the same connection packets in the
> pool and drain all of them out, this is to avoid out of order fragment.
> 
> All the 'SYN' packets will be bypassed since this always begin a new'
> connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
> finalization, because this normally happens upon a connection is going
> to be closed, an 'URG' packet also finalize current coalescing unit.
> 
> Statistics can be used to monitor the basic coalescing status, the
> 'out of order' and 'out of window' means how many retransmitting packets,
> thus describe the performance intuitively.
> 
> Difference between ip v4 and v6 processing:
>  Fragment length in ipv4 header includes itself, while it's not
>  included for ipv6, thus means ipv6 can carry a real 65535 payload.
> 
> Signed-off-by: Wei Xu 
> Signed-off-by: Yuri Benditovich 
> ---
>  hw/net/virtio-net.c | 648 +++-
>  include/hw/virtio/virtio-net.h  |  81 +++
>  include/net/eth.h   |   2 +
>  include/standard-headers/linux/virtio_net.h |   8 +
>  4 files changed, 734 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 385b1a03e9..43a7021409 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -41,6 +41,28 @@
>  #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE
>  #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE
>  
> +#define VIRTIO_NET_IP4_ADDR_SIZE   8/* ipv4 saddr + daddr */
> +
> +#define VIRTIO_NET_TCP_FLAG 0x3F
> +#define VIRTIO_NET_TCP_HDR_LENGTH   0xF000
> +
> +/* IPv4 max payload, 16 bits in the header */
> +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header))
> +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535
> +
> +/* header length value in ip header without option */
> +#define

Re: [Qemu-devel] [QEMU PATCH v2 0/2]: KVM: i386: Add support for save and restore nested state

2018-11-12 Thread Jim Mattson via Qemu-devel

On Mon, Nov 12, 2018 at 4:00 PM, Liran Alon  wrote:
>
>
>> On 12 Nov 2018, at 18:54, Daniel P. Berrangé  wrote:
>>
>> On Mon, Nov 12, 2018 at 04:50:54PM +, Dr. David Alan Gilbert wrote:
>>> * Daniel P. Berrangé (berra...@redhat.com) wrote:
 On Sun, Nov 04, 2018 at 11:19:57PM +0100, Paolo Bonzini wrote:
> On 02/11/2018 17:54, Daniel P. Berrangé wrote:
>> We have usually followed a rule that new machine types must not
>> affect runability of a VM on a host. IOW new machine types should
>> not introduce dependancies on specific kernels, or hardware features
>> such as CPU flags.
>
>> Anything that requires a new kernel feature thus ought to be an
>> opt-in config tunable on the CLI, separate from machine type
>> choice.
>
> Unless someone tinkered with the module parameters, they could not even
> use nested virtualization before 4.20.  So for everyone else, "-cpu
> ...,+vmx" does count as an "opt-in config tunable on the CLI" that
> requires 4.20.
>
> For those that did tinker with module parameters, we can grandfather in
> the old machine types, so that they can use nested virtualization with
> no live migration support.  For those that did not, however, I don't
> think it makes sense to say "oh by the way I really want to be able to
> migrate this VM" on the command line, or even worse on the monitor.

 IIUC, 4.20 is only required from POV of migration state. Is it thus
 possible to just register a migration blocker if QEMU is launched
 on a host with kernel < 4.20.

 Migration has always been busted historically, so those people using
 nested VMX already won't be hurt by not having ability to live migrate
 their VM, but could otherwise continue using them without being forced
 to upgrade their kernel to fix a feature they're not even using.
>>>
>>> Yes, although I am a bit worried we might have a population of users
>>> that:
>>>   a) Have enabled nesting
>>>   b) Run VMs with vmx enabled
>>
>>
>>>   c) Don't normally actually run nested guests
>>>   d) Currently happily migrate.
>>
>> True, and (b) would include anyone using libvirt's  host-model CPU. So if
>> you enabled nesting, have host-model for all guests, but only use nesting
>> in one of the guests, you'd be doomed.
>>
>> Is it possible for QEMU to determine if there are nested guests running or
>> not and conditionally block migration appropriately to ensure safety ?
>
>
> Only if kernel supports KVM_CAP_NESTED_STATE.
> See my reply to Dave in this thread.

You could still allow migration if CR4.VMXE is clear.

[Qemu-devel] [PULL for-3.1 0/2] qemu-ga patch queue for 3.1.0

2018-11-12 Thread Michael Roth

The following changes since commit 160e5c22e55b3f775c2003dfc626fa872ee4a7a1:

  Merge remote-tracking branch 'remotes/gkurz/tags/for-upstream' into staging 
(2018-11-09 10:54:10 +)

are available in the Git repository at:

  git://github.com/mdroth/qemu.git tags/qga-pull-2018-11-12-tag

for you to fetch changes up to 61baac2fdb7ad3891fb00bbd3c9e8b8ca87f0f62:

  qga: Add multiple include guard to guest-agent-core.h (2018-11-09 07:55:13 
-0600)


qemu-ga patch queue for 3.1.0

* add missing #include guards for guest-agent-core.h
* fix leaks introduced with recent win32 enablement of disk info in
  guest-get-fsinfo


Marc-André Lureau (1):
  qga-win: fix leaks of build_guest_disk_info()

Peter Maydell (1):
  qga: Add multiple include guard to guest-agent-core.h

 qga/commands-win32.c   | 5 -
 qga/guest-agent-core.h | 5 +
 2 files changed, 9 insertions(+), 1 deletion(-)

Re: [Qemu-devel] [PATCH v1 3/3] intel-iommu: search iotlb for levels supported by the address width.

2018-11-12 Thread Yu Zhang

On Tue, Nov 13, 2018 at 01:18:54PM +0800, Peter Xu wrote:
> On Mon, Nov 12, 2018 at 08:38:30PM +0800, Yu Zhang wrote:
> > On Mon, Nov 12, 2018 at 05:36:38PM +0800, Peter Xu wrote:
> > > On Mon, Nov 12, 2018 at 05:25:48PM +0800, Yu Zhang wrote:
> > > > On Mon, Nov 12, 2018 at 04:51:22PM +0800, Peter Xu wrote:
> > > > > On Fri, Nov 09, 2018 at 07:49:47PM +0800, Yu Zhang wrote:
> > > > > > This patch updates vtd_lookup_iotlb() to search cached mappings only
> > > > > > for all page levels supported by address width of current vIOMMU. 
> > > > > > Also,
> > > > > > to cover 57-bit width, the shift of source id(VTD_IOTLB_SID_SHIFT) 
> > > > > > and
> > > > > > of page level(VTD_IOTLB_LVL_SHIFT) are enlarged by 9 - the stride of
> > > > > > one paging structure level.
> > > > > > 
> > > > > > Signed-off-by: Yu Zhang 
> > > > > > ---
> > > > > > Cc: "Michael S. Tsirkin" 
> > > > > > Cc: Marcel Apfelbaum 
> > > > > > Cc: Paolo Bonzini  
> > > > > > Cc: Richard Henderson  
> > > > > > Cc: Eduardo Habkost 
> > > > > > Cc: Peter Xu 
> > > > > > ---
> > > > > >  hw/i386/intel_iommu.c  | 5 +++--
> > > > > >  hw/i386/intel_iommu_internal.h | 7 ++-
> > > > > >  2 files changed, 5 insertions(+), 7 deletions(-)
> > > > > > 
> > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > index 9cdf755..ce7e17e 100644
> > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > @@ -254,11 +254,12 @@ static uint64_t vtd_get_iotlb_gfn(hwaddr 
> > > > > > addr, uint32_t level)
> > > > > >  static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, 
> > > > > > uint16_t source_id,
> > > > > > hwaddr addr)
> > > > > >  {
> > > > > > -VTDIOTLBEntry *entry;
> > > > > > +VTDIOTLBEntry *entry = NULL;
> > > > > >  uint64_t key;
> > > > > >  int level;
> > > > > > +int max_level = (s->aw_bits - VTD_PAGE_SHIFT_4K) / 
> > > > > > VTD_SL_LEVEL_BITS;
> > > > > >  
> > > > > > -for (level = VTD_SL_PT_LEVEL; level < VTD_SL_PML4_LEVEL; 
> > > > > > level++) {
> > > > > > +for (level = VTD_SL_PT_LEVEL; level < max_level; level++) {
> > > > > 
> > > > > My understanding of current IOTLB is that it only caches the last
> > > > > level of mapping, say:
> > > > > 
> > > > >   - level 1: 4K page
> > > > >   - level 2: 2M page
> > > > >   - level 3: 1G page
> > > > > 
> > > > > So we don't check against level=4 even if x-aw-bits=48 is specified.
> > > > > 
> > > > > Here does it mean that we're going to have... 512G iommu huge pages?
> > > > > 
> > > > 
> > > > No. My bad, I misunderstood this routine. And now I believe we do not
> > > > need this patch. :-)
> > > 
> > > Yeah good to confirm that :-)
> > 
> > Sorry, Peter. I still have question about this part. I agree we do not need
> > to do the extra loop - therefore no need for the max_level part introduced
> > in this patch.
> > 
> > But as to modification of VTD_IOTLB_SID_SHIFT/VTD_IOTLB_LVL_SHIFT, we may
> > still need to do it due to the enlarged gfn, to search an IOTLB entry for
> > a 4K mapping, the pfn itself could be as large as 45-bit.
> 
> Agreed.

Thanks~

> 
> > 
> > Besides, currently vtd_get_iotlb_gfn() is just shifting 12 bits for all
> > different levels, is this necessary? I mean, how about we do the shift
> > based on current level?
> > 
> >  static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t level)
> >  {
> > -return (addr & vtd_slpt_level_page_mask(level)) >> VTD_PAGE_SHIFT_4K;
> > +uint32_t shift = vtd_slpt_level_shift(level);
> > +return (addr & vtd_slpt_level_page_mask(level)) >> shift;
> >  }
> 
> IMHO we can, but I don't see much gain from it.
> 
> If we shift, we still need to use the maximum possible bits that a PFN
> can hold, which is 45bits (when with 4k pages), so we can't gain
> anything out if it (no saved bits on iotlb key).  Instead, we'll need
> to call more vtd_slpt_level_shift() for each vtd_get_iotlb_gfn() which
> even seems a bit slower.

Yep, we still need to use 45 bits for 4K mappings. The only benifit I can think
of is it's more intuitive - more aligned to the vtd spec of iotlb tags. But just
like you said, I do not see any runtime gain in it. So I'm fine to drop this. :)

> 
> Regards,
> 
> -- 
> Peter Xu
> 

B.R.
Yu

Re: [Qemu-devel] [PATCH v1 2/3] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.

2018-11-12 Thread Peter Xu

On Tue, Nov 13, 2018 at 01:45:44PM +0800, Yu Zhang wrote:

[...]

> > > Since at it, another thing I thought about is making sure the IOMMU
> > > capabilities will match between host and guest IOMMU, which I think
> > > this series has ignorred so far.  E.g., when we're having assigned
> > > devices in the guest and with 5-level IOVA, we should make sure the
> > > host IOMMU supports 5-level as well before the guest starts since
> > > otherwise the shadow page synchronization could potentially fail when
> > > the requested IOVA address goes beyond 4-level.  One simple solution
> > > is just to disable device assignment for now when we're with 57bits
> > > vIOMMU but I'm not sure whether that's what you want, especially you
> > > mentioned the DPDK case (who may use assigned devices).
> > 
> > Ok I totally forgot that we don't even support any kind of check like
> > this before... So feel free to skip this comment if you want, or it
> > would be even nicer if you want to fix it as a whole. :)
> > 
> 
> Indeed. We have talked about this before. How about we focus on the 5-level
> extension for now, and solve the check issue in the future? I still do not
> have any clean solutions in mind. BTW, any suggestions on this issue? :)

I started to remember our discussions, sorry I should remember them
earlier... :)

The only thing in my mind (I think I also suggested the same thing
during that discussion, but I don't trust my memory any more...) is to
use sysfs.  Say:

  1. Scan /sys/class/iommu/dmarN for all the host IOMMUs, read cap of
 each IOMMU from /sys/class/iommu/dmar0/intel-iommu/cap,

  2. For each host iommu, scan /sys/class/iommu/dmarN/devices for all
 the devices under each host IOMMU, then we can know which IOMMU
 owns which device,

  3. For each assigned device to the guest, we lookup the previous
 information to know the mgaw for each host device, raise error
 and stop QEMU from booting if any of the host device has less
 level supported than the guest vIOMMU (possibly some more checks
 in vtd_iommu_notify_flag_changed)

(we still have some issue on vtd_iommu_notify_flag_changed since it's
 only run until the first enablement of vIOMMU, so we'll only raise
 the error during guest Linux boots with vIOMMU on. But that's another
 issue)

Regards,

-- 
Peter Xu

Re: [Qemu-devel] [PATCH v1 3/3] intel-iommu: search iotlb for levels supported by the address width.

2018-11-12 Thread Peter Xu

On Mon, Nov 12, 2018 at 08:38:30PM +0800, Yu Zhang wrote:
> On Mon, Nov 12, 2018 at 05:36:38PM +0800, Peter Xu wrote:
> > On Mon, Nov 12, 2018 at 05:25:48PM +0800, Yu Zhang wrote:
> > > On Mon, Nov 12, 2018 at 04:51:22PM +0800, Peter Xu wrote:
> > > > On Fri, Nov 09, 2018 at 07:49:47PM +0800, Yu Zhang wrote:
> > > > > This patch updates vtd_lookup_iotlb() to search cached mappings only
> > > > > for all page levels supported by address width of current vIOMMU. 
> > > > > Also,
> > > > > to cover 57-bit width, the shift of source id(VTD_IOTLB_SID_SHIFT) and
> > > > > of page level(VTD_IOTLB_LVL_SHIFT) are enlarged by 9 - the stride of
> > > > > one paging structure level.
> > > > > 
> > > > > Signed-off-by: Yu Zhang 
> > > > > ---
> > > > > Cc: "Michael S. Tsirkin" 
> > > > > Cc: Marcel Apfelbaum 
> > > > > Cc: Paolo Bonzini  
> > > > > Cc: Richard Henderson  
> > > > > Cc: Eduardo Habkost 
> > > > > Cc: Peter Xu 
> > > > > ---
> > > > >  hw/i386/intel_iommu.c  | 5 +++--
> > > > >  hw/i386/intel_iommu_internal.h | 7 ++-
> > > > >  2 files changed, 5 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > index 9cdf755..ce7e17e 100644
> > > > > --- a/hw/i386/intel_iommu.c
> > > > > +++ b/hw/i386/intel_iommu.c
> > > > > @@ -254,11 +254,12 @@ static uint64_t vtd_get_iotlb_gfn(hwaddr addr, 
> > > > > uint32_t level)
> > > > >  static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t 
> > > > > source_id,
> > > > > hwaddr addr)
> > > > >  {
> > > > > -VTDIOTLBEntry *entry;
> > > > > +VTDIOTLBEntry *entry = NULL;
> > > > >  uint64_t key;
> > > > >  int level;
> > > > > +int max_level = (s->aw_bits - VTD_PAGE_SHIFT_4K) / 
> > > > > VTD_SL_LEVEL_BITS;
> > > > >  
> > > > > -for (level = VTD_SL_PT_LEVEL; level < VTD_SL_PML4_LEVEL; 
> > > > > level++) {
> > > > > +for (level = VTD_SL_PT_LEVEL; level < max_level; level++) {
> > > > 
> > > > My understanding of current IOTLB is that it only caches the last
> > > > level of mapping, say:
> > > > 
> > > >   - level 1: 4K page
> > > >   - level 2: 2M page
> > > >   - level 3: 1G page
> > > > 
> > > > So we don't check against level=4 even if x-aw-bits=48 is specified.
> > > > 
> > > > Here does it mean that we're going to have... 512G iommu huge pages?
> > > > 
> > > 
> > > No. My bad, I misunderstood this routine. And now I believe we do not
> > > need this patch. :-)
> > 
> > Yeah good to confirm that :-)
> 
> Sorry, Peter. I still have question about this part. I agree we do not need
> to do the extra loop - therefore no need for the max_level part introduced
> in this patch.
> 
> But as to modification of VTD_IOTLB_SID_SHIFT/VTD_IOTLB_LVL_SHIFT, we may
> still need to do it due to the enlarged gfn, to search an IOTLB entry for
> a 4K mapping, the pfn itself could be as large as 45-bit.

Agreed.

> 
> Besides, currently vtd_get_iotlb_gfn() is just shifting 12 bits for all
> different levels, is this necessary? I mean, how about we do the shift
> based on current level?
> 
>  static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t level)
>  {
> -return (addr & vtd_slpt_level_page_mask(level)) >> VTD_PAGE_SHIFT_4K;
> +uint32_t shift = vtd_slpt_level_shift(level);
> +return (addr & vtd_slpt_level_page_mask(level)) >> shift;
>  }

IMHO we can, but I don't see much gain from it.

If we shift, we still need to use the maximum possible bits that a PFN
can hold, which is 45bits (when with 4k pages), so we can't gain
anything out if it (no saved bits on iotlb key).  Instead, we'll need
to call more vtd_slpt_level_shift() for each vtd_get_iotlb_gfn() which
even seems a bit slower.

Regards,

-- 
Peter Xu

Re: [Qemu-devel] [RFC PATCH 07/11] scripts/decodetree: Add add_func_check()

2018-11-12 Thread Richard Henderson

On 11/12/18 12:36 AM, Philippe Mathieu-Daudé wrote:
>  for n, f in self.fields.items():
>  output(ind, 'u.f_', arg, '.', n, ' = ', f.str_extract(), ';\n')
> +for f, a in self.check_funcs:
> +output(ind, 'check_', f, '(ctx, ', a, ');\n')
>  output(ind, 'return ', translate_prefix, '_', self.name,
> '(ctx, _', arg, ');\n')

I don't see the value in asserting the isa here.
If this were a conditional check, which might fall through to either another
pattern or to return false, maybe.

r~

Re: [Qemu-devel] [RFC PATCH 08/11] target/mips: Add a decodetree stub

2018-11-12 Thread Richard Henderson

On 11/12/18 6:37 AM, Aleksandar Markovic wrote:
>> Subject: [RFC PATCH 08/11] target/mips: Add a decodetree stub
> 
> There is no plan to use decodetree for MIPS target. MIPS decoding engine is
> mostly stable mature code that was well tested over many years, and there is 
> no
> point in introducing such drastic change to the code that works.

This attitude is unnecessarily obstructionist.

The reorganization of the instruction implementation that is implied by a
transition to decodetree is exactly what you and I talked about some months ago.

The ability to use multiple decodetree parsers to vector directly to the
instruction implementations would help unify code across multiple encoding
schemes.  (There 4 now: mips, mips16, micromips, nanomips; and mipsr6 is
different enough that it could be considered a 5th encoding.)

In a sample transition of 10 insns that Phillipe sent me via private email this
weekend, I noticed an existing bug in MULT{,U} that fails to enforce rd == 0 on
cpus that have only one accumulator -- any bets there are no more to be found?
 You have just this year added yet another encoding scheme for nanomips.  Your
statement "well tested over many years" is a bit of a stretch.

If you do not wish to work on this, that's your prerogative.  But I don't think
you should be arbitrarily shutting down experimentation on this line before it
has a chance to show results.

r~

Re: [Qemu-devel] [RFC] [PATCH] kvm: arm: Introduce error code KVM_EINVARIANT

2018-11-12 Thread Christoffer Dall

Hi Manish,

On Sat, Nov 10, 2018 at 10:18:47PM +, Manish Jaggi wrote:
> 
> CCing a larger audience.
> Please review.
> 
> On 10/23/2018 03:51 PM, Jaggi, Manish wrote:
> > From: Manish Jaggi 
> >
> > This patch introduces an error code KVM_EINVARIANT which is returned
> > by KVM when userland tries to set an invariant register.
> >
> > The need for this error code is in VM Migration for arm64.
> > ARM64 systems use mainly -machine virt -cpu host as parameter to qemu.
> > Migration requires both Source and destination machines to have same
> > physical cpu. There are cases where the overall architecture of CPU is
> > same but the next version of the chip with some bug fixes which have no
> > effect on qemu operation. In such cases invariant registers like MIDR
> > have a different value.
> > Currently Migration fails in such cases.
> >
> > Rather than sending a EINVAL, a specifc error code will help
> > userland program the guest invariant register by querying the migrated
> > host machines invariant registers.
> >
> > Qemu will have a parameter -hostinvariant along with checking of this
> > error code. So it can be safely assumed that the feature is opt-in
> >
> > Corresponding Qemu patchset can be found at :
> > https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg05048.html
> >
> >
> > Signed-off-by: Manish Jaggi 
> > ---
> >   arch/arm64/kvm/sys_regs.c | 9 -
> >   include/uapi/linux/kvm_para.h | 1 +
> >   2 files changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> > index 22fbbdb..78ffc02 100644
> > --- a/arch/arm64/kvm/sys_regs.c
> > +++ b/arch/arm64/kvm/sys_regs.c
> > @@ -,7 +,7 @@ static int __set_id_reg(const struct sys_reg_desc 
> > *rd, void __user *uaddr,
> >   
> > /* This is what we mean by invariant: you can't change it. */
> > if (val != read_id_reg(rd, raz))
> > -   return -EINVAL;
> > +   return -KVM_EINVARIANT;
> >   
> > return 0;
> >   }
> > @@ -2254,9 +2254,8 @@ static int set_invariant_sys_reg(u64 id, void __user 
> > *uaddr)
> > return err;
> >   
> > /* This is what we mean by invariant: you can't change it. */
> > -   if (r->val != val)
> > -   return -EINVAL;
> > -
> > +   if (r->val != val)  
> > +   return -KVM_EINVARIANT;
> > return 0;
> >   }
> >   
> > @@ -2335,7 +2334,7 @@ static int demux_c15_set(u64 id, void __user *uaddr)
> >   
> > /* This is also invariant: you can't change it. */
> > if (newval != get_ccsidr(val))
> > -   return -EINVAL;
> > +   return -KVM_EINVARIANT;
> > return 0;
> > default:
> > return -ENOENT;
> > diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
> > index 6c0ce49..4358669 100644
> > --- a/include/uapi/linux/kvm_para.h
> > +++ b/include/uapi/linux/kvm_para.h
> > @@ -17,6 +17,7 @@
> >   #define KVM_E2BIG E2BIG
> >   #define KVM_EPERM EPERM
> >   #define KVM_EOPNOTSUPP95
> > +#define KVM_EINVARIANT  96
> >   
> >   #define KVM_HC_VAPIC_POLL_IRQ 1
> >   #define KVM_HC_MMU_OP 2
> 

Eh, unless I'm severely missing something, I think you're confusing how
these error codes are used.

The error codes in include/uapi/linux/kvm_para.h are for hypercalls from
guests, not return values to userspace.  For the latter, we don't
have a habit of inventing new constants for something subsystem
specific, but instead we use the existing error codes which can be
understood by userspace software.

What's more, changing an error code is an ABI change and not something
you can just do at will.  If we are overloading EINVAL with the
KVM_SET_ONE_REG syscall (are we?), then you need to find some way to
communicate the additional information you need to userspace without
interfering with legacy userspace software's interaction with the
kernel.


Thanks,

Christoffer

Re: [Qemu-devel] [Qemu-block] [PATCH] tests: Fix Python 3 detection on older GNU make versions

2018-11-12 Thread Markus Armbruster

Max Reitz  writes:

> On 08.11.18 13:43, Markus Armbruster wrote:
>> Philippe Mathieu-Daudé  writes:
>> 
>>> Hi Markus,
>>>
>>>
>>> Le jeu. 8 nov. 2018 09:46, Markus Armbruster  a écrit :
>>>
 Cleber Rosa  writes:

> On 11/7/18 1:05 AM, Markus Armbruster wrote:
>> [...]
>> PEP 394 recommends software distributions install Python 3 into the
>> default path as python3, and users use that instead of python, except
>> for programs that are source compatible with both 2 and 3.  So, is
>> finding out whether python is a Python 3 really appropriate?  Why can't
>> we just use python3 and be done with it?
>>
>
> I mentioned that before, when you pointed out the issue you fix here,
> that configure may be the best place to get the Python version (not only
> the major version) and make it available elsewhere.  Even if not used
> for other purposes, it is IMO important information to show on the
> resulting "configure" output.
>
> I'm sending patches to do that in a few.
>
>> If we can't: isn't this a configure problem?
>>
>
> I believe adhering to PEP394 is a good thing, but even that document
> recognizes that the real world is not a perfect place: "however, end
> users should be aware that python refers to python3 on at least Arch
> Linux".  And it only covers *nix systems, so if we hope to care for
> non-*nix systems, we have to check the Python version manually.
>
> So, I guess the safest approach from QEMU's side is to check for the
> version indeed.

 If somebody can point to a system people still use where python3 doesn't
 get you a Python 3, but python does, catering for such (crappy) systems
 in configure makes sense.  Until then, it's a waste of brain waves and
 configure run time.

 PEP 394 mentions Arch Linux.  It's been seven years.  What's the most
 recent version of Arch Linux that's still crappy in this regard?

>>>
>>> Arch doesn't provide python2 by default, thus python points to python3.
>> 
>> That's non-crappy as long as python3 also exists, as PEP 394 recommends.
>> Does it?
>
> Sure it does.
>
> Arch is just problematic in how it handles "python" itself.  I don't
> think there is any system that has Python 3 but no "python3".

Bottom line: we can safely assume that a host that has Python 3 provides
it as python3.

Corollary:

* When python3 doesn't exist, checking whether python is Python 3 is
  futile.

* Checking whether python3 really is Python 3 is silly.

Re: [Qemu-devel] [PATCH] edid: silence a stringop-overflow warning

2018-11-12 Thread Markus Armbruster

Marc-André Lureau  writes:

> Simplify the code that doesn't need strncpy() since length of string
> is already computed.
>
> /home/elmarco/src/qemu/hw/display/edid-generate.c: In function 
> 'edid_desc_text':
> /home/elmarco/src/qemu/hw/display/edid-generate.c:168:5: error: 'strncpy' 
> specified bound depends on the length of the source argument 
> [-Werror=stringop-overflow=]
>  strncpy((char *)(desc + 5), text, len);
>  ^~
> /home/elmarco/src/qemu/hw/display/edid-generate.c:164:11: note: length 
> computed here
>  len = strlen(text);
>^~~~
> cc1: all warnings being treated as errors
>
> Signed-off-by: Marc-André Lureau 
> ---
>  hw/display/edid-generate.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/display/edid-generate.c b/hw/display/edid-generate.c
> index bdf5e1d4d4..77d9127344 100644
> --- a/hw/display/edid-generate.c
> +++ b/hw/display/edid-generate.c
> @@ -165,7 +165,7 @@ static void edid_desc_text(uint8_t *desc, uint8_t type,
>  if (len > 12) {
>  len = 12;
>  }
> -strncpy((char *)(desc + 5), text, len);
> +memcpy(desc + 5, text, len);
>  desc[5 + len] = '\n';
>  }

Reviewed-by: Markus Armbruster

Re: [Qemu-devel] [PATCH v1 2/3] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.

2018-11-12 Thread Yu Zhang

On Mon, Nov 12, 2018 at 04:36:34PM +0800, Peter Xu wrote:
> On Fri, Nov 09, 2018 at 07:49:46PM +0800, Yu Zhang wrote:
> > A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> > E.g. guest applications like DPDK prefer to use its VA as IOVA when
> > performing VFIO map/unmap operations, to avoid the burden of managing the
> > IOVA space.
> 
> Since you mentioned about DPDK... I'm just curious that whether have
> you tested the patchset with the 57bit-enabled machines with DPDK VA
> mode running in the guest? That would be something nice to mention in
> the cover letter if you have.
> 

Hah. Maybe I shall not mention DPDK here. 

The story is that we heard the requirement, saying applications like DPDK
would need 5-level paging in IOMMU side. And I was convinced after checked
DPDK code, seeing it may use VA as IOVA directly. But I did not test this
patch with DPDK.

Instead, I used kvm-unit-test to verify this patch series. And of course, I
also did some modification to the test case. Patch for the test also sent out
at https://www.spinics.net/lists/kvm/msg177425.html.

> [...]
> 
> > @@ -3264,11 +3286,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, 
> > Error **errp)
> >  }
> >  }
> >  
> > -/* Currently only address widths supported are 39 and 48 bits */
> > +/* Currently address widths supported are 39, 48, and 57 bits */
> >  if ((s->aw_bits != VTD_AW_39BIT) &&
> > -(s->aw_bits != VTD_AW_48BIT)) {
> > -error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> > -   VTD_AW_39BIT, VTD_AW_48BIT);
> > +(s->aw_bits != VTD_AW_48BIT) &&
> > +(s->aw_bits != VTD_AW_57BIT)) {
> > +error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> > +   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> > +return false;
> > +}
> > +
> > +if ((s->aw_bits == VTD_AW_57BIT) &&
> > +!(host_has_la57() && guest_has_la57())) {
> > +error_setg(errp, "Do not support 57-bit DMA address, unless both "
> > + "host and guest are capable of 5-level 
> > paging.\n");
> 
> Is there any context (or pointer to previous discussions would work
> too) on explaining why we don't support some scenarios like
> host_paw=48,guest_paw=48,guest_gaw=57?
> 

Well, above check is only to make sure both the host and the guest can
use 57bit linear address, which requires 5-level paging. So I believe
we do support scenarios like host_paw=48,guest_paw=48,guest_gaw=57.
The guest_has_la57() means the guest can use 57-bit linear address,
regardless of its physical address width.

> Thanks,
> 
> -- 
> Peter Xu
> 

B.R.
Yu

Re: [Qemu-devel] [PATCH] hw/bt: drop bluetooth keyboard emulation.

2018-11-12 Thread Thomas Huth

On 2018-11-09 15:14, Gerd Hoffmann wrote:
> Broken (segfaultson first keypress) and appearently unused.

Broken since QEMU v2.1 (!) already, if I've got that right - please
mention that version number, so that it is clear that it is broken since
more than two releases (i.e. it has been "deprecated" by being broken
for so long).

There is also one more occurance of "-bt device:keyboard" in
qemu-doc.texi which you should now remove, too, I think.

 Thomas

Re: [Qemu-devel] [PATCH v1 3/3] intel-iommu: search iotlb for levels supported by the address width.

2018-11-12 Thread Peter Xu

On Fri, Nov 09, 2018 at 07:49:47PM +0800, Yu Zhang wrote:
> This patch updates vtd_lookup_iotlb() to search cached mappings only
> for all page levels supported by address width of current vIOMMU. Also,
> to cover 57-bit width, the shift of source id(VTD_IOTLB_SID_SHIFT) and
> of page level(VTD_IOTLB_LVL_SHIFT) are enlarged by 9 - the stride of
> one paging structure level.
> 
> Signed-off-by: Yu Zhang 
> ---
> Cc: "Michael S. Tsirkin" 
> Cc: Marcel Apfelbaum 
> Cc: Paolo Bonzini  
> Cc: Richard Henderson  
> Cc: Eduardo Habkost 
> Cc: Peter Xu 
> ---
>  hw/i386/intel_iommu.c  | 5 +++--
>  hw/i386/intel_iommu_internal.h | 7 ++-
>  2 files changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 9cdf755..ce7e17e 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -254,11 +254,12 @@ static uint64_t vtd_get_iotlb_gfn(hwaddr addr, uint32_t 
> level)
>  static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t 
> source_id,
> hwaddr addr)
>  {
> -VTDIOTLBEntry *entry;
> +VTDIOTLBEntry *entry = NULL;
>  uint64_t key;
>  int level;
> +int max_level = (s->aw_bits - VTD_PAGE_SHIFT_4K) / VTD_SL_LEVEL_BITS;
>  
> -for (level = VTD_SL_PT_LEVEL; level < VTD_SL_PML4_LEVEL; level++) {
> +for (level = VTD_SL_PT_LEVEL; level < max_level; level++) {

My understanding of current IOTLB is that it only caches the last
level of mapping, say:

  - level 1: 4K page
  - level 2: 2M page
  - level 3: 1G page

So we don't check against level=4 even if x-aw-bits=48 is specified.

Here does it mean that we're going to have... 512G iommu huge pages?

>  key = vtd_get_iotlb_key(vtd_get_iotlb_gfn(addr, level),
>  source_id, level);
>  entry = g_hash_table_lookup(s->iotlb, );
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index a7ef24b..bdf2b7c 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -114,8 +114,8 @@
>   VTD_INTERRUPT_ADDR_FIRST + 1)
>  
>  /* The shift of source_id in the key of IOTLB hash table */
> -#define VTD_IOTLB_SID_SHIFT 36
> -#define VTD_IOTLB_LVL_SHIFT 52
> +#define VTD_IOTLB_SID_SHIFT 45
> +#define VTD_IOTLB_LVL_SHIFT 61
>  #define VTD_IOTLB_MAX_SIZE  1024/* Max size of the hash table */
>  
>  /* IOTLB_REG */
> @@ -450,9 +450,6 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_SL_LEVEL_BITS   9
>  
>  /* Second Level Paging Structure */
> -#define VTD_SL_PML4_LEVEL   4
> -#define VTD_SL_PDP_LEVEL3
> -#define VTD_SL_PD_LEVEL 2
>  #define VTD_SL_PT_LEVEL 1
>  #define VTD_SL_PT_ENTRY_NR  512
>  
> -- 
> 1.9.1
> 

Regards,

-- 
Peter Xu

[Qemu-devel] qdev has no maintainer (was: [PATCH] hw: set_netdev: remove useless code)

2018-11-12 Thread Markus Armbruster

Thomas Huth  writes:

> On 2018-11-09 09:21, Laurent Vivier wrote:
>> On 09/11/2018 09:13, Li Qiang wrote:
>>> In set_netdev(), the peers[i] is initialized
>>> qemu_find_net_clients_except() when i is in
>>> 0 between 'queues' it can't be NULL.
>>>
>>> Signed-off-by: Li Qiang 
>>> ---
>>>  hw/core/qdev-properties-system.c | 4 
>>>  1 file changed, 4 deletions(-)
>>>
>>> diff --git a/hw/core/qdev-properties-system.c 
>>> b/hw/core/qdev-properties-system.c
>>> index 8b22fb5..b45a7ef 100644
>>> --- a/hw/core/qdev-properties-system.c
>>> +++ b/hw/core/qdev-properties-system.c
>>> @@ -288,10 +288,6 @@ static void set_netdev(Object *obj, Visitor *v, const 
>>> char *name,
>>>  }
>>>  
>>>  for (i = 0; i < queues; i++) {
>>> -if (peers[i] == NULL) {
>>> -err = -ENOENT;
>>> -goto out;
>>> -}
>>>  
>>>  if (peers[i]->peer) {
>>>  err = -EEXIST;
>>>
>> 
>> Reviewed-by: Laurent Vivier 
>
> So who can pick up such patches? We don't have a maintainer for the QDEV
> subsystem anymore... should it go via qemu-trivial?

We've never had a formal qdev maintainer.

"qdev is too important to continue much longer without a maintainer."
Message-ID: 
https://lists.gnu.org/archive/html/qemu-devel/2011-06/msg01053.html

The closest we got to appointing one was probably here:
Message-ID: <56b33c2a.8030...@redhat.com>
https://lists.gnu.org/archive/html/qemu-devel/2016-02/msg00949.html

In my opinion, not having a maintainer push or at least direct the cart
is one of the reasons of the relatively sad state of qdev.

Same for QOM, where we had an active maintainer for a while, but no
more.

Re: [Qemu-devel] [RFC PATCH 08/11] target/mips: Add a decodetree stub

2018-11-12 Thread Aleksandar Markovic

Hello, Richard.

I am a little taken aback by your tone. I hope we can communicate in much 
friendlier maner, as we used to do.

>You have just this year added yet another encoding scheme for nanomips. Your 
>statement "well tested over many years" is a bit of a stretch.

I wrote "mostly stable mature", not "all stable mature".

>If you do not wish to work on this, that's your prerogative.  But I don't think
you should be arbitrarily shutting down experimentation on this line before it
has a chance to show results.

I am not preventing anyone from experimenting. I just want to warn Philippe 
about high-level view that the code in question, although not the nicest, 
works, and is planned to be maintained with minimal changes. The focus of MIPS 
target is on adding new architectures and ASEs, and (I correct myself) it could 
be that decodetree would kick in in such cases - but not for mature code 
driving older architectures. It just doesn't make enough sense.

Thanks,
Aleksandar

From: Richard Henderson 
Sent: Monday, November 12, 2018 9:58:05 AM
To: Aleksandar Markovic; Philippe Mathieu-Daudé; Bastian Koppelmann; Peer 
Adelt; Richard Henderson
Cc: qemu-devel@nongnu.org; Aurelien Jarno
Subject: Re: [Qemu-devel] [RFC PATCH 08/11] target/mips: Add a decodetree stub

On 11/12/18 6:37 AM, Aleksandar Markovic wrote:
>> Subject: [RFC PATCH 08/11] target/mips: Add a decodetree stub
>
> There is no plan to use decodetree for MIPS target. MIPS decoding engine is
> mostly stable mature code that was well tested over many years, and there is 
> no
> point in introducing such drastic change to the code that works.

This attitude is unnecessarily obstructionist.

The reorganization of the instruction implementation that is implied by a
transition to decodetree is exactly what you and I talked about some months ago.

The ability to use multiple decodetree parsers to vector directly to the
instruction implementations would help unify code across multiple encoding
schemes.  (There 4 now: mips, mips16, micromips, nanomips; and mipsr6 is
different enough that it could be considered a 5th encoding.)

In a sample transition of 10 insns that Phillipe sent me via private email this
weekend, I noticed an existing bug in MULT{,U} that fails to enforce rd == 0 on
cpus that have only one accumulator -- any bets there are no more to be found?
 You have just this year added yet another encoding scheme for nanomips.  Your
statement "well tested over many years" is a bit of a stretch.

If you do not wish to work on this, that's your prerogative.  But I don't think
you should be arbitrarily shutting down experimentation on this line before it
has a chance to show results.

r~

Re: [Qemu-devel] [PATCH 0/2] virtio-9p: qmp interface for set/query io throttle for fsdev devices

2018-11-12 Thread Greg Kurz

On Mon, 12 Nov 2018 01:14:55 +
xiezhide  wrote:

> These patches provide the qmp interface, to set/query the io throttle
> status of the all fsdev devices that are present in a vm.
> Some of the patches also remove the
> duplicate code that was present in block and fsdev files.
> 
> Zhide Xie (2):
>   fsdev-qmp: qmp interface for set/query io throttle for fsdev devices.
>   fsdev-qmp: fix coding style issue
> 

Hi Zhide Xie,

This series seems to have some formatting issues, like the broken
indention in the diffstat below for example. Also, no message
threading, weird Sob lines:

Signed-off-by: x00390961 mailto:xiezh...@huawei.com>>

and finally I couldn't even apply patch 1:

$ pwclient git-am 996225 
Applying patch #996225 using 'git am'
Description: [1/2] virtio-9p: qmp interface to set/query io throttle for fsdev 
devices
Applying: virtio-9p: qmp interface to set/query io throttle for fsdev devices

error: corrupt patch at line 36 <== this confirms that the patch has formatting 
issues.

Patch failed at 0001 virtio-9p: qmp interface to set/query io throttle for 
fsdev devices
Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
'git am' failed with exit status 128

Not sure what mail client was used to send this (I don't see any indication
in the mail headers), but if it's not the case already, I suggest you use
git format-patch and git send-email.

I don't have much time, so I'll wait for the well-formatted v2 before
starting review :-)

> Makefile|  20 +++-
> Makefile.objs   |   8 ++
> block/throttle.c   |   6 +-
> blockdev.c |  96 +
> fsdev/qemu-fsdev-dummy.c|  11 ++
> fsdev/qemu-fsdev-throttle.c| 144 +-
> fsdev/qemu-fsdev-throttle.h   |   6 +-
> fsdev/qemu-fsdev.c|  29 ++
> hmp-commands-info.hx |  15 +++
> hmp-commands.hx |  15 +++
> hmp.c |  83 
> +--
> hmp.h |   4 +
> include/qemu/throttle-options.h|   3 +-
> include/qemu/throttle.h   |   4 +-
> include/qemu/typedefs.h  |   1 +
> monitor.c   |   4 +
> qapi/block-core.json   | 122 +-
> qapi/fsdev.json |  96 +
> qapi/qapi-schema.json   |   1 +
> qapi/tlimits.json   |  89 
> qmp.c  |  12 +++
> util/throttle.c| 224 
> ++--
> 22 files changed, 639 insertions(+), 354 deletions(-)
> create mode 100644 qapi/fsdev.json
> create mode 100644 qapi/tlimits.json
> 

> --
> 1.8.3.1

Re: [Qemu-devel] [PATCH v4 14/15] block: Remove assertions from update_flags_from_options()

2018-11-12 Thread Alberto Garcia

On Sun 11 Nov 2018 10:01:05 PM CET, Max Reitz wrote:
> On 07.11.18 13:59, Alberto Garcia wrote:
>> This function takes three options (cache.direct, cache.no-flush and
>> read-only) from a QemuOpts object and updates the flags accordingly.
>
> and auto-read-only now

Oops, will update.

> Hm, seems like one way to solve it and I can't really find issue with
> it.  So, let's first give a
>
> Reviewed-by: Max Reitz 
>
> However, I wonder why you dropped your patch from v1 for this.  It
> seemed more reasonable to me.  You're basically trading half-updating
> the flags for just not touching them at all (and the latter seems
> better, even though it's all an error in the end anyway).

The main reason why I'm doing this is because if we keep the assertions
then we're forced to have these four options always set, and I don't see
any reason why they would need to be.

It's not a problem now but it will be later on. Have a look at this
early implementation of qmp_x_blockdev_reopen():

   https://lists.gnu.org/archive/html/qemu-block/2018-06/msg00795.html

Here we need to explicitly set those options to false if they're
unset. 'false' is already the default value of all of them, so this
shouldn't be necessary, but if we don't do it we'd hit the assertions
that I'm removing in this patch.

Berto

Re: [Qemu-devel] [PATCH v4 14/15] block: Remove assertions from update_flags_from_options()

2018-11-12 Thread Alberto Garcia

On Sun 11 Nov 2018 10:01:05 PM CET, Max Reitz wrote:
>> -assert(qemu_opt_find(opts, BDRV_OPT_CACHE_DIRECT));
>>  if (qemu_opt_get_bool_del(opts, BDRV_OPT_CACHE_DIRECT, false)) {
>>  *flags |= BDRV_O_NOCACHE;
>>  }
>>  
>>  *flags &= ~BDRV_O_RDWR;
>
> Unrelated to this patch, but isn't BDRV_O_AUTO_RDONLY missing here?

I forgot to mention, but I think you're right here. I'll include this
fix in the next version of the series.

Berto

Re: [Qemu-devel] [PATCH] decodetree: Force Python to print unsigned values

2018-11-12 Thread Richard Henderson

On 11/12/18 12:06 AM, Philippe Mathieu-Daudé wrote:
>> On 11/11/18 1:02 AM, Philippe Mathieu-Daudé wrote:
>>> Python internal representation is signed, so unsigned values
>>> bigger than 31-bit are interpreted as signed (and printed with
>>> a '-' signed).
>>> Mask out to force unsigned values.
>>>
>>> Signed-off-by: Philippe Mathieu-Daudé 
>>> ---
>>>  scripts/decodetree.py | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> Queued, thanks.
> 
> Can you drop this from your queue? I'll send a cleaner approach (as
> RFC, and I also want to log the error I got in the commit msg).

Done.


r~

Re: [Qemu-devel] [PATCH v2] pulseaudio: process audio data in smaller chunks

2018-11-12 Thread Gerd Hoffmann

On Sat, Nov 10, 2018 at 08:36:39PM +0100, Philippe Mathieu-Daudé wrote:
> Hi Gerd,
> 
> On 11/9/18 3:20 PM, Gerd Hoffmann wrote:
> > The rate of pulseaudio absorbing the audio stream is used to control the
> > the rate of the guests audio stream.  When the emulated hardware uses
> > small chunks (like intel-hda does) we need small chunks on the audio
> > backend side too, otherwise that feedback loop doesn't work very well.
> 
> Shouldn't this be user-configurable?

Why?

cheers,
  Gerd

Re: [Qemu-devel] [PATCH] virtio-net: support RSC v4/v6 tcp traffic for Windows HCK

2018-11-12 Thread Jason Wang




On 2018/11/12 下午4:57, Yuri Benditovich wrote:


On Mon, Nov 12, 2018 at 4:54 AM Michael S. Tsirkin > wrote:


On Sun, Nov 11, 2018 at 12:18:54PM +0200, Yuri Benditovich wrote:
>     > @@ -66,12 +143,16 @@ typedef struct VirtIONet {
>     >      VirtIONetQueue *vqs;
>     >      VirtQueue *ctrl_vq;
>     >      NICState *nic;
>     > +    QTAILQ_HEAD(, NetRscChain) rsc_chains;
>
>     what exactly happens with these chains on migration?
>
>
> This feature (software implementation of RSC in QEMU) is
intended to be used in
> the environment of certification tests which never uses migration.

Should this feature disable migration then?


IMO, this should not. But if you find it mandatory, please respond and 
I will add the migration blocker.



So if my understanding is correct, it's safe to do nothing even if we 
allow migration for RSC?


Thanks

Re: [Qemu-devel] [RFC PATCH 0/2] Acceptance tests for qemu-img

2018-11-12 Thread Kevin Wolf

Am 09.11.2018 um 23:12 hat Cleber Rosa geschrieben:
> The initial goal of this RFC is to get feedback on tests not specific
> to the QEMU main binary, but specific to other components such as
> qemu-img.
> 
> For this experiment, a small issue with the zero and negative number
> of I/O operations given to the bench command was chosen.

Any reason why this shouldn't be in qemu-iotests?

Kevin

Re: [Qemu-devel] [Qemu-arm] [PATCH for-3.1] target/arm: Fix typo in tlbi_aa64_vmalle1_write

2018-11-12 Thread Alex Bennée



Richard Henderson  writes:

> This would cause an infinite recursion or loop.
>
> Signed-off-by: Richard Henderson 

I feel the title undersells the importance of the fix ;-)

Reviewed-by: Alex Bennée 

> ---
>  target/arm/helper.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/target/arm/helper.c b/target/arm/helper.c
> index 96301930cc..a327988c66 100644
> --- a/target/arm/helper.c
> +++ b/target/arm/helper.c
> @@ -3155,7 +3155,7 @@ static void tlbi_aa64_vmalle1_write(CPUARMState *env, 
> const ARMCPRegInfo *ri,
>  CPUState *cs = ENV_GET_CPU(env);
>
>  if (tlb_force_broadcast(env)) {
> -tlbi_aa64_vmalle1_write(env, NULL, value);
> +tlbi_aa64_vmalle1is_write(env, NULL, value);
>  return;
>  }


--
Alex Bennée

Re: [Qemu-devel] [Qemu-arm] [PATCH 1/2] target/arm: Hyp mode R14 is shared with User and System

2018-11-12 Thread Edgar E. Iglesias

On Fri, Nov 09, 2018 at 06:15:20PM +, Peter Maydell wrote:
> On 9 November 2018 at 17:35, Peter Maydell  wrote:
> > Hyp mode is an exception to the general rule that each AArch32
> > mode has its own r13, r14 and SPSR -- it has a banked r13 and
> > SPSR but shares its r14 with User and System mode. We were
> > incorrectly implementing it as banked, which meant that on
> > entry to Hyp mode r14 was 0 rather than the USR/SYS r14.
> >
> > We provide a new function r14_bank_number() which is like
> > the existing bank_number() but provides the index into
> > env->banked_r14[]; bank_number() provides the index to use
> > for env->banked_r13[] and env->banked_cpsr[].
> >
> > All the points in the code that were using bank_number()
> > to index into env->banked_r14[] are updated for consintency:
> >  * switch_mode() -- this is the only place where we fix
> >an actual bug
> >  * aarch64_sync_32_to_64() and aarch64_sync_64_to_32():
> >no behavioural change as we already special-cased Hyp R14
> >  * kvm32.c: no behavioural change since the guest can't ever
> >be in Hyp mode, but conceptually the right thing to do
> >  * msr_banked()/mrs_banked(): we can never get to the case
> >that accesses banked_r14[] with tgtmode == ARM_CPU_MODE_HYP,
> >so no behavioural change
> >
> > Signed-off-by: Peter Maydell 
> > ---
> >  target/arm/internals.h | 16 
> >  target/arm/helper.c| 29 +++--
> >  target/arm/kvm32.c |  4 ++--
> >  target/arm/op_helper.c |  2 +-
> >  4 files changed, 34 insertions(+), 17 deletions(-)
> 
> Rats, this bit accidentally didn't make it into this patch:
> 
> diff --git a/target/arm/op_helper.c b/target/arm/op_helper.c
> index 2b62c53f5b5..eb6fb82fb81 100644
> --- a/target/arm/op_helper.c
> +++ b/target/arm/op_helper.c
> @@ -725,7 +725,7 @@ uint32_t HELPER(mrs_banked)(CPUARMState *env,
> uint32_t tgtmode, uint32_t regno)
>  case 13:
>  return env->banked_r13[bank_number(tgtmode)];
>  case 14:
> -return env->banked_r14[bank_number(tgtmode)];
> +return env->banked_r14[r14_bank_number(tgtmode)];
>  case 8 ... 12:
>  switch (tgtmode) {
>  case ARM_CPU_MODE_USR:
> 
> 
> (it's one of the "no behavioural change" bits).
> 


Reviewed-by: Edgar E. Iglesias

Re: [Qemu-devel] [PATCH v1 1/3] intel-iommu: differentiate host address width from IOVA address width.

2018-11-12 Thread Yu Zhang

On Mon, Nov 12, 2018 at 04:15:46PM +0800, Peter Xu wrote:
> On Fri, Nov 09, 2018 at 07:49:45PM +0800, Yu Zhang wrote:
> > Currently, vIOMMU is using the value of IOVA address width, instead of
> > the host address width(HAW) to calculate the number of reserved bits in
> > data structures such as root entries, context entries, and entries of
> > DMA paging structures etc.
> > 
> > However values of IOVA address width and of the HAW may not equal. For
> > example, a 48-bit IOVA can only be mapped to host addresses no wider than
> > 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> > in an invalid IOVA being accepted.
> > 
> > To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> > whose value is initialized based on the maximum physical address set to
> > guest CPU. Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> > to clarify.
> 
> IIRC I raised this question some time ago somewhere but no one
> remembered to follow that up.  Thanks for fixing it.
> 
> It looks mostly good to me, only one tiny comment below...
> 
> [...]
> 
> > @@ -887,6 +887,7 @@ static int vtd_page_walk_level(dma_addr_t addr, 
> > uint64_t start,
> >  uint64_t iova = start;
> >  uint64_t iova_next;
> >  int ret = 0;
> > +uint8_t haw = info->as->iommu_state->haw_bits;
> 
> For now vtd_page_walk_info->aw_bits caches the GAW information and we
> use a single vtd_page_walk_info during one page walk, maybe we can
> also do the same for HAW instead of fetching it every time here from
> info->as->iommu_state->haw_bits?

Thank you, Peter. And yes, using haw_bits in vtd_page_walk_info is better. :)

> 
> Regards,
> 
> -- 
> Peter Xu
> 

B.R.
Yu

Re: [Qemu-devel] [PULL 0/4] slirp updates

2018-11-12 Thread Peter Maydell

On 10 November 2018 at 14:09, Samuel Thibault
 wrote:
> The following changes since commit 160e5c22e55b3f775c2003dfc626fa872ee4a7a1:
>
>   Merge remote-tracking branch 'remotes/gkurz/tags/for-upstream' into staging 
> (2018-11-09 10:54:10 +)
>
> are available in the Git repository at:
>
>   https://people.debian.org/~sthibault/qemu.git tags/samuel-thibault
>
> for you to fetch changes up to 5c75f3adbbfcdf8fae6e74875b44efb8d928974a:
>
>   slirp: fork_exec(): create and connect child socket before fork() 
> (2018-11-10 15:07:53 +0100)
>
> 
> slirp updates
>
> Peter Maydell (4):
>   slirp: Don't pass possibly -1 fd to send()
>   slirp: Use g_new() to allocate sockets in socreate()
>   slirp: Remove code that handles socreate() failure
>   slirp: fork_exec(): create and connect child socket before fork()
>
> 
> Peter Maydell (4):
>   slirp: Don't pass possibly -1 fd to send()
>   slirp: Use g_new() to allocate sockets in socreate()
>   slirp: Remove code that handles socreate() failure
>   slirp: fork_exec(): create and connect child socket before fork()

Applied, thanks.

-- PMM

Re: [Qemu-devel] List of files containing devices which have not been QOMified

2018-11-12 Thread Peter Maydell

On 10 November 2018 at 15:20, Mark Cave-Ayland
 wrote:
> On 09/11/2018 10:31, Peter Maydell wrote:
>> I think the code I saw that looked like a non-QOMified
>> device was cpu_timer_create().

> Ah okay. The above timers are certainly internal CPU timers rather than 
> external, so
> should they still be QOMified? How are timers modelled for other CPUs?

So, a couple of caveats first:
 * I haven't really looked at the sparc code
 * as I mentioned elsewhere in this thread, I'm not sure this is
   necessarily our most-important cleanup/refactoring

There are a couple of "QOM-style" ways I know of to do cpu timers:

(1) the Arm Cortex-A9/A15 have some timers that are part of the cpu
in the sense that they're always there in hardware, but they're just
memory-mapped devices at a known address. We model these as
the usual sort of standalone QOM device, with a convenience
container object in hw/cpu/ that instantiates the various devices
and wires things up.

(2) the newer Arm "generic timers" are more closely coupled to the
CPU, because they're programmed via system registers. These we
actually model as part of the CPU object itself, with the code all
in target/arm. The CPU object then exposes outbound GPIO lines
which are the interrupt signals for the timers, and which the board
or SoC code wires up to the interrupt controller.

So it depends on how closely coupled the sparc cpu timers are to
the cpu, I think.

thanks
-- PMM

Re: [Qemu-devel] [PATCH] 9p: write lock path in v9fs_co_open2()

2018-11-12 Thread Greg Kurz

On Mon, 12 Nov 2018 19:05:59 +0800
zhibin hu  wrote:

> yes, and this :
> 

Yeah, all call sites of v9fs_path_copy() in v9fs_create() are called in the
context of the main thread. They may race with any other access to the fid
path performed by some other command in the context of a worker thread. My
first guess is that v9fs_create() should take the write lock before writing
to the fid path.

BTW, if you could share all the reproducers you already have for these
heap-use-after-free issues, it would be appreciated, and probably speed
up the fixing.

> ==6094==ERROR: AddressSanitizer: heap-use-after-free on address
> 0x602e6751 at pc 0x562a8dc492b8 bp 0x7f6805d2fa10 sp 0x7f6805d2fa00
> READ of size 1 at 0x602e6751 thread T21
> #0 0x562a8dc492b7 in local_open_nofollow hw/9pfs/9p-local.c:59
> #1 0x562a8dc49361 in local_opendir_nofollow hw/9pfs/9p-local.c:92
> #2 0x562a8dc4bd6e in local_mknod hw/9pfs/9p-local.c:662
> #3 0x562a8dc521de in v9fs_co_mknod hw/9pfs/cofs.c:200
> #4 0x562a8dc4413e in v9fs_mknod hw/9pfs/9p.c:3044
> #5 0x562a8e600976 in coroutine_trampoline util/coroutine-ucontext.c:116
> #6 0x7f68713635ff in __correctly_grouped_prefixwc
> (/lib64/libc.so.6+0x4c5ff)
> 
> 0x602e6751 is located 1 bytes inside of 2-byte region
> [0x602e6750,0x602e6752)
> freed by thread T0 here:
> #0 0x7f687cdb9880 in __interceptor_free (/lib64/libasan.so.5+0xee880)
> #1 0x7f687c1494d1 in g_free (/lib64/libglib-2.0.so.0+0x524d1)
> #2 0x562a8dc30ce4 in v9fs_path_copy hw/9pfs/9p.c:195
> #3 0x562a8dc3f0f3 in v9fs_create hw/9pfs/9p.c:2286
> #4 0x562a8e600976 in coroutine_trampoline util/coroutine-ucontext.c:116
> #5 0x7f68713635ff in __correctly_grouped_prefixwc
> (/lib64/libc.so.6+0x4c5ff)
> 
> previously allocated by thread T5 here:
> #0 0x7f687cdb9c48 in malloc (/lib64/libasan.so.5+0xeec48)
> #1 0x7f6871421c37 in __GI___vasprintf_chk (/lib64/libc.so.6+0x10ac37)
> 
> Thread T21 created by T0 here:
> #0 0x7f687cd16443 in pthread_create (/lib64/libasan.so.5+0x4b443)
> #1 0x562a8e5bf61e in qemu_thread_create util/qemu-thread-posix.c:534
> #2 0x562a8e5adbe4 in do_spawn_thread util/thread-pool.c:135
> #3 0x562a8e5adc73 in spawn_thread_bh_fn util/thread-pool.c:143
> #4 0x562a8e5aa4d0 in aio_bh_call util/async.c:90
> #5 0x562a8e5aa787 in aio_bh_poll util/async.c:118
> #6 0x562a8e5b65bd in aio_dispatch util/aio-posix.c:436
> #7 0x562a8e5ab26f in aio_ctx_dispatch util/async.c:261
> #8 0x7f687c1438ac in g_main_context_dispatch
> (/lib64/libglib-2.0.so.0+0x4c8ac)
> 
> Thread T5 created by T0 here:
> #0 0x7f687cd16443 in pthread_create (/lib64/libasan.so.5+0x4b443)
> #1 0x562a8e5bf61e in qemu_thread_create util/qemu-thread-posix.c:534
> #2 0x562a8d7a2258 in qemu_kvm_start_vcpu /root/qemu-3.0.0/cpus.c:1935
> #3 0x562a8d7a2a0b in qemu_init_vcpu /root/qemu-3.0.0/cpus.c:2001
> #4 0x562a8da3ef0c in x86_cpu_realizefn
> /root/qemu-3.0.0/target/i386/cpu.c:4996
> #5 0x562a8dd3a1e8 in device_set_realized hw/core/qdev.c:826
> #6 0x562a8e2a1e62 in property_set_bool qom/object.c:1984
> #7 0x562a8e29db0e in object_property_set qom/object.c:1176
> #8 0x562a8e2a4b00 in object_property_set_qobject qom/qom-qobject.c:27
> #9 0x562a8e29de2b in object_property_set_bool qom/object.c:1242
> #10 0x562a8d9ad452 in pc_new_cpu /root/qemu-3.0.0/hw/i386/pc.c:1107
> #11 0x562a8d9ad993 in pc_cpus_init /root/qemu-3.0.0/hw/i386/pc.c:1155
> #12 0x562a8d9b79cb in pc_init1 /root/qemu-3.0.0/hw/i386/pc_piix.c:153
> #13 0x562a8d9b981d in pc_init_v3_0
> /root/qemu-3.0.0/hw/i386/pc_piix.c:438
> #14 0x562a8dd4bf29 in machine_run_board_init hw/core/machine.c:830
> #15 0x562a8db8d46e in main /root/qemu-3.0.0/vl.c:4516
> #16 0x7f687133a11a in __libc_start_main (/lib64/libc.so.6+0x2311a)
> 
> SUMMARY: AddressSanitizer: heap-use-after-free hw/9pfs/9p-local.c:59 in
> local_open_nofollow
> Shadow bytes around the buggy address:
>   0x0c0480014c90: fa fa fd fa fa fa fa fa fa fa fa fa fa fa fa fa
>   0x0c0480014ca0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
>   0x0c0480014cb0: fa fa fa fa fa fa fd fd fa fa fd fa fa fa fd fa
>   0x0c0480014cc0: fa fa fa fa fa fa fa fa fa fa fd fd fa fa fd fa
>   0x0c0480014cd0: fa fa fa fa fa fa fa fa fa fa fd fa fa fa fd fd
> =>0x0c0480014ce0: fa fa fa fa fa fa fd fa fa fa[fd]fa fa fa fa fa
>   0x0c0480014cf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fd fd
>   0x0c0480014d00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
>   0x0c0480014d10: fa fa fd fa fa fa fd fd fa fa fa fa fa fa fa fa
>   0x0c0480014d20: fa fa fa fa fa fa fd fa fa fa fa fa fa fa fd fa
>   0x0c0480014d30: fa fa fa fa fa fa fd fd fa fa fa fa fa fa fa fa
> Shadow byte legend (one shadow byte represents 8 application bytes):
>   Addressable:   00
>   Partially addressable: 01 02 03 04 05 06 07
>   Heap left redzone:   fa
>   Freed heap region:   fd
>   Stack left redzone:  f1
>

Re: [Qemu-devel] [RFC PATCH 11/11] target/mips: Port MIPS64 DCL[Z/O] to decodetree

2018-11-12 Thread Richard Henderson

On 11/12/18 12:36 AM, Philippe Mathieu-Daudé wrote:
> +dclz011100 . . . . 100100   @rs_rt_rd   
> ?ctx->insn_flags_MIPS64
> +dclo011100 . . . . 100101   @rs_rt_rd   
> ?ctx->insn_flags_MIPS64

This syntax I do not like.  I preferred your other form,

  >isa(ISA_MIPS64)

which, with a change to return false as I suggested, would
have the same effect.

As an aside, you could do

#define check_isa(ctx, which)  ((ctx)->insn_flags & ISA_##which)

to reduce the decode markup to

  >isa(MIPS64)


r~

Re: [Qemu-devel] [PATCH v1 3/3] intel-iommu: search iotlb for levels supported by the address width.

2018-11-12 Thread Peter Xu

On Mon, Nov 12, 2018 at 05:25:48PM +0800, Yu Zhang wrote:
> On Mon, Nov 12, 2018 at 04:51:22PM +0800, Peter Xu wrote:
> > On Fri, Nov 09, 2018 at 07:49:47PM +0800, Yu Zhang wrote:
> > > This patch updates vtd_lookup_iotlb() to search cached mappings only
> > > for all page levels supported by address width of current vIOMMU. Also,
> > > to cover 57-bit width, the shift of source id(VTD_IOTLB_SID_SHIFT) and
> > > of page level(VTD_IOTLB_LVL_SHIFT) are enlarged by 9 - the stride of
> > > one paging structure level.
> > > 
> > > Signed-off-by: Yu Zhang 
> > > ---
> > > Cc: "Michael S. Tsirkin" 
> > > Cc: Marcel Apfelbaum 
> > > Cc: Paolo Bonzini  
> > > Cc: Richard Henderson  
> > > Cc: Eduardo Habkost 
> > > Cc: Peter Xu 
> > > ---
> > >  hw/i386/intel_iommu.c  | 5 +++--
> > >  hw/i386/intel_iommu_internal.h | 7 ++-
> > >  2 files changed, 5 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 9cdf755..ce7e17e 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -254,11 +254,12 @@ static uint64_t vtd_get_iotlb_gfn(hwaddr addr, 
> > > uint32_t level)
> > >  static VTDIOTLBEntry *vtd_lookup_iotlb(IntelIOMMUState *s, uint16_t 
> > > source_id,
> > > hwaddr addr)
> > >  {
> > > -VTDIOTLBEntry *entry;
> > > +VTDIOTLBEntry *entry = NULL;
> > >  uint64_t key;
> > >  int level;
> > > +int max_level = (s->aw_bits - VTD_PAGE_SHIFT_4K) / VTD_SL_LEVEL_BITS;
> > >  
> > > -for (level = VTD_SL_PT_LEVEL; level < VTD_SL_PML4_LEVEL; level++) {
> > > +for (level = VTD_SL_PT_LEVEL; level < max_level; level++) {
> > 
> > My understanding of current IOTLB is that it only caches the last
> > level of mapping, say:
> > 
> >   - level 1: 4K page
> >   - level 2: 2M page
> >   - level 3: 1G page
> > 
> > So we don't check against level=4 even if x-aw-bits=48 is specified.
> > 
> > Here does it mean that we're going to have... 512G iommu huge pages?
> > 
> 
> No. My bad, I misunderstood this routine. And now I believe we do not
> need this patch. :-)

Yeah good to confirm that :-)

Regards,

-- 
Peter Xu

Re: [Qemu-devel] [PATCH v2] pulseaudio: process audio data in smaller chunks

2018-11-12 Thread Philippe Mathieu-Daudé


On 12/11/18 10:12, Gerd Hoffmann wrote:

On Sat, Nov 10, 2018 at 08:36:39PM +0100, Philippe Mathieu-Daudé wrote:

Hi Gerd,

On 11/9/18 3:20 PM, Gerd Hoffmann wrote:

The rate of pulseaudio absorbing the audio stream is used to control the
the rate of the guests audio stream.  When the emulated hardware uses
small chunks (like intel-hda does) we need small chunks on the audio
backend side too, otherwise that feedback loop doesn't work very well.


Shouldn't this be user-configurable?


Why?


When emulated hardware is not intel-hda?

Re: [Qemu-devel] [PATCH] hw/bt: drop bluetooth keyboard emulation.

2018-11-12 Thread Peter Maydell

On 12 November 2018 at 09:16, Markus Armbruster  wrote:
> Peter Maydell  writes:
>
>> On 9 November 2018 at 14:14, Gerd Hoffmann  wrote:
>>> Broken (segfaultson first keypress) and appearently unused.
>>>
>>> Signed-off-by: Gerd Hoffmann 
>
> Please show the reproducer in the commit message.  Stack backtrace
> wouldn't hurt.
>
>>> ---
>>>  include/hw/bt.h |   3 -
>>>  hw/bt/hid.c | 554 
>>> 
>>>  vl.c|  34 +---
>>>  hw/bt/Makefile.objs |   3 +-
>>>  qemu-options.hx |   9 -
>>>  5 files changed, 2 insertions(+), 601 deletions(-)
>>>  delete mode 100644 hw/bt/hid.c
>>
>> Are we definitely happy that all the use cases for
>> this code segfault, not just the one you tested ?
>
> Are there any others?
>
>> Does it cost us much to mark this deprecated in 3.1
>> and drop it in 4.0 ?
>
> The cost is offering a known-to-be-broken option to users.
>
> The other half of the tradeoff: what would it gain us?

My point is that we do not know it to be broken. We know that
it has one bug, in the command line and segfault that Gerd
found. I haven't seen anybody say they've done a check of
all the use cases of this code and confirmed that all of them
must pass through the codepath/situation that segfaults.
What it gains us is that we get to be sure that we are keeping
our promise with our users that we don't just randomly remove
features they were using without notice.

thanks
-- PMM

Re: [Qemu-devel] [libvirt] [PATCH] bt: Mark the bluetooth subsystem as deprecated

2018-11-12 Thread Peter Krempa

On Mon, Nov 12, 2018 at 11:00:30 +0100, Thomas Huth wrote:
> It has been unmaintained since years, and there were only trivial or
> tree-wide changes to the related files since many years, so the
> code is likely very bitrotten and broken. For example the following
> segfaults as soon as as you press a key:
> 
>  qemu-system-x86_64 -usb -device usb-bt-dongle -bt hci -bt device:keyboard
> 
> Since we are not aware of anybody using bluetooth with the current
> version of QEMU, let's mark the subsystem as deprecated, with a special
> request for the users to write to the qemu-devel mailing list in case
> they still use it (so we could revert the deprecation status in that
> case).
> 
> Signed-off-by: Thomas Huth 
> ---
>  qemu-deprecated.texi | 7 +++
>  qemu-options.hx  | 4 
>  vl.c | 4 
>  3 files changed, 15 insertions(+)

libvirt never implemented any configuration option for bluetooth


signature.asc
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 03/11] target/mips: Move the !ISA_MIPS32R6 check out of decode_opc_special2_legacy

2018-11-12 Thread Richard Henderson

On 11/12/18 12:36 AM, Philippe Mathieu-Daudé wrote:
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  target/mips/translate.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Reviewed-by: Richard Henderson 


r~

Re: [Qemu-devel] [PATCH v1 1/3] intel-iommu: differentiate host address width from IOVA address width.

2018-11-12 Thread Peter Xu

On Fri, Nov 09, 2018 at 07:49:45PM +0800, Yu Zhang wrote:
> Currently, vIOMMU is using the value of IOVA address width, instead of
> the host address width(HAW) to calculate the number of reserved bits in
> data structures such as root entries, context entries, and entries of
> DMA paging structures etc.
> 
> However values of IOVA address width and of the HAW may not equal. For
> example, a 48-bit IOVA can only be mapped to host addresses no wider than
> 46 bits. Using 48, instead of 46 to calculate the reserved bit may result
> in an invalid IOVA being accepted.
> 
> To fix this, a new field - haw_bits is introduced in struct IntelIOMMUState,
> whose value is initialized based on the maximum physical address set to
> guest CPU. Also, definitions such as VTD_HOST_AW_39/48BIT etc. are renamed
> to clarify.

IIRC I raised this question some time ago somewhere but no one
remembered to follow that up.  Thanks for fixing it.

It looks mostly good to me, only one tiny comment below...

[...]

> @@ -887,6 +887,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t 
> start,
>  uint64_t iova = start;
>  uint64_t iova_next;
>  int ret = 0;
> +uint8_t haw = info->as->iommu_state->haw_bits;

For now vtd_page_walk_info->aw_bits caches the GAW information and we
use a single vtd_page_walk_info during one page walk, maybe we can
also do the same for HAW instead of fetching it every time here from
info->as->iommu_state->haw_bits?

Regards,

-- 
Peter Xu

Re: [Qemu-devel] [RFC PATCH 06/11] scripts/decodetree: Allow empty specifications

2018-11-12 Thread Richard Henderson

On 11/12/18 12:36 AM, Philippe Mathieu-Daudé wrote:
> Starting with empty specifications allow to write stubs/templates,
> useful when testing/rebasing.
> 
> This fixes:
> 
>   decode.inc.c: In function ‘decode’:
>   decode.inc.c:9:7: error: unused variable ‘u’ [-Werror=unused-variable]
>  } u;
>^
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  scripts/decodetree.py | 13 +++--
>  1 file changed, 7 insertions(+), 6 deletions(-)

Reviewed-by: Richard Henderson 


r~

Re: [Qemu-devel] [PATCH] virtio-net: support RSC v4/v6 tcp traffic for Windows HCK

2018-11-12 Thread Yuri Benditovich

On Mon, Nov 12, 2018 at 4:54 AM Michael S. Tsirkin  wrote:

> On Sun, Nov 11, 2018 at 12:18:54PM +0200, Yuri Benditovich wrote:
> > > @@ -66,12 +143,16 @@ typedef struct VirtIONet {
> > >  VirtIONetQueue *vqs;
> > >  VirtQueue *ctrl_vq;
> > >  NICState *nic;
> > > +QTAILQ_HEAD(, NetRscChain) rsc_chains;
> >
> > what exactly happens with these chains on migration?
> >
> >
> > This feature (software implementation of RSC in QEMU) is intended to be
> used in
> > the environment of certification tests which never uses migration.
>
> Should this feature disable migration then?
>

IMO, this should not. But if you find it mandatory, please respond and I
will add the migration blocker.
In this case I'd prefer to disable the feature in case of vhost as in this
case the RX normally will not be
processed by QEMU and even if the feature by mistake enabled in command
line this will not have
any effect.


>
> > These chains
> > and accumulated segments (if any) are lost in case of migration. I'll
> add the
> > note about it
> > in commit's message.
> >
>
> If it's a functional limitation it belongs in a code comment
> not in the commit log.
>
> >
> >
> > >  uint32_t tx_timeout;
> > >  int32_t tx_burst;
> > >  uint32_t has_vnet_hdr;
> > >  size_t host_hdr_len;
> > >  size_t guest_hdr_len;
> > >  uint64_t host_features;
> > > +uint32_t rsc_timeout;
> > > +uint8_t rsc4_enabled;
> > > +uint8_t rsc6_enabled;
> > >  uint8_t has_ufo;
> > >  uint32_t mergeable_rx_bufs;
> > >  uint8_t promisc;
> > > diff --git a/include/net/eth.h b/include/net/eth.h
> > > index e6dc8a7ba0..7f45c678e7 100644
> > > --- a/include/net/eth.h
> > > +++ b/include/net/eth.h
> > > @@ -177,6 +177,8 @@ struct tcp_hdr {
> > >  #define TH_PUSH 0x08
> > >  #define TH_ACK  0x10
> > >  #define TH_URG  0x20
> > > +#define TH_ECE  0x40
> > > +#define TH_CWR  0x80
> > >  u_short th_win;  /* window */
> > >  u_short th_sum;  /* checksum */
> > >  u_short th_urp;  /* urgent pointer */
> > > diff --git a/include/standard-headers/linux/virtio_net.h b/include/
> > standard-headers/linux/virtio_net.h
> > > index 260c3681d7..0d8658c06a 100644
> > > --- a/include/standard-headers/linux/virtio_net.h
> > > +++ b/include/standard-headers/linux/virtio_net.h
> > > @@ -57,6 +57,10 @@
> > >* Steering */
> > >  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23/* Set MAC address */
> > >
> > > +#define VIRTIO_NET_F_RSC_EXT 38
> >
> > Should it be VIRTIO_NET_F_GUEST_RSC_EXT ?
> >
> >
> > IMO, not. In the spec the name of the feature is VIRTIO_NET_F_RSC_EXT
> and it is
> > actually host feature
> > and its effect is how the host sets the fields in the header.
>
> Isn't the same true for GUEST_GSO?
>
>
I think GUEST_GSO is really guest feature ( it requires larger buffers on
guest side),
but this is question of terminology. If you find it necessary to rename the
feature,
please respond, I will do that in v3 of the patch. Just would like to
collect all the
requests.



> >
> >
> > > +#define VIRTIO_NET_F_GUEST_RSC4_DONT_USE 41  /* reserved
> */
> > > +#define VIRTIO_NET_F_GUEST_RSC6_DONT_USE 42  /* reserved
> */
> > > +
> > >  #define VIRTIO_NET_F_STANDBY   62/* Act as standby for another
> > device
> > >* with the same MAC.
> > >*/
> > > @@ -104,6 +108,7 @@ struct virtio_net_config {
> > >  struct virtio_net_hdr_v1 {
> > >  #define VIRTIO_NET_HDR_F_NEEDS_CSUM  1   /* Use csum_start,
> > csum_offset */
> > >  #define VIRTIO_NET_HDR_F_DATA_VALID  2   /* Csum is valid */
> > > +#define VIRTIO_NET_HDR_F_RSC_INFO4   /* rsc_ext data in
> csum_
> > fields */
> > >   uint8_t flags;
> > >  #define VIRTIO_NET_HDR_GSO_NONE  0   /* Not a GSO
> frame
> > */
> > >  #define VIRTIO_NET_HDR_GSO_TCPV4 1   /* GSO frame, IPv4
> TCP
> > (TSO) */
> > > @@ -118,6 +123,9 @@ struct virtio_net_hdr_v1 {
> > >   __virtio16 num_buffers; /* Number of merged rx buffers */
> > >  };
> > >
> > > +#define rsc_ext_num_packets  csum_start
> > > +#define rsc_ext_num_dupacks  csum_offset
> >
> > I would prefer an inline function to set the field, or a union.
> >
> >
> > > +
> > >  #ifndef VIRTIO_NET_NO_LEGACY
> > >  /* This header comes first in the scatter-gather list.
> > >   * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated,
> it must
> >
> > This part needs to get into the Linux header. Pls post there.
> > Until it does you can put it in virtio-net.c
> >
> >
> > > --
> > > 2.17.1
> >
>

Re: [Qemu-devel] [RFC PATCH 05/11] decodetree: Force Python to print unsigned values

2018-11-12 Thread Philippe Mathieu-Daudé

On Mon, Nov 12, 2018 at 9:20 AM Richard Henderson
 wrote:
> On 11/12/18 12:36 AM, Philippe Mathieu-Daudé wrote:
> > Python internal representation is signed, so unsigned values
> > bigger than 31-bit are interpreted as signed (and printed with
> > a '-' signed).
> > Mask out to force unsigned values.
> >
> > Signed-off-by: Philippe Mathieu-Daudé 
> > ---
> > TODO: display error encountered:
> >
> >case 0x-1:
> >
> > ---
> >  scripts/decodetree.py | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
>
> This must have been from a trivial input file containing only fixed field
> insns?  Not saying it's wrong, but just so we know.  Any chance you can build 
> a
> tests/decode/ test for this?  I know they're all currently related to parsing,
> but...

Yes, it is in my TODO to add this test, I have that failure in an early branch.
There is no rush with this series for now, this is not 3.2 material.

> Reviewed-by: Richard Henderson 

Thanks!

Phil.

Re: [Qemu-devel] [PATCH v1 0/3] intel-iommu: add support for 5-level virtual IOMMU.

2018-11-12 Thread Peter Xu

On Fri, Nov 09, 2018 at 07:49:44PM +0800, Yu Zhang wrote:
> Intel's upcoming processors will extend maximum linear address width to
> 57 bits, and introduce 5-level paging for CPU. Meanwhile, the platform
> will also extend the maximum guest address width for IOMMU to 57 bits,
> thus introducing the 5-level paging for 2nd level translation(See chapter 3
> in Intel Virtualization Technology for Directed I/O). 
> 
> This patch set extends the current logic to support a wider address width.
> A 5-level paging capable IOMMU(for 2nd level translation) can be rendered
> with configuration "device intel-iommu,x-aw-bits=57".

Along with this series, I'm not sure whether it'll be good we start to
consider removing the "x-" prefix for "x-aw-bits".

Michael?

Regards,

-- 
Peter Xu

Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu] ppc/spapr: Receive and store device tree blob from SLOF

2018-11-12 Thread Greg Kurz

On Mon, 12 Nov 2018 15:12:26 +1100
Alexey Kardashevskiy  wrote:

> On 12/11/2018 05:10, Greg Kurz wrote:
> > Hi Alexey,
> > 
> > Just a few remarks. See below.
> > 
> > On Thu,  8 Nov 2018 12:44:06 +1100
> > Alexey Kardashevskiy  wrote:
> >   
> >> SLOF receives a device tree and updates it with various properties
> >> before switching to the guest kernel and QEMU is not aware of any changes
> >> made by SLOF. Since there is no real RTAS (QEMU implements it), it makes
> >> sense to pass the SLOF final device tree to QEMU to let it implement
> >> RTAS related tasks better, such as PCI host bus adapter hotplug.
> >>
> >> Specifially, now QEMU can find out the actual XICS phandle (for PHB
> >> hotplug) and the RTAS linux,rtas-entry/base properties (for firmware
> >> assisted NMI - FWNMI).
> >>
> >> This stores the initial DT blob in the sPAPR machine and replaces it
> >> in the KVMPPC_H_UPDATE_DT (new private hypercall) handler.
> >>
> >> This adds an @update_dt_enabled machine property to allow backward
> >> migration.
> >>
> >> SLOF already has a hypercall since
> >> https://github.com/aik/SLOF/commit/e6fc84652c9c0073f9183
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >>  include/hw/ppc/spapr.h |  7 ++-
> >>  hw/ppc/spapr.c | 29 -
> >>  hw/ppc/spapr_hcall.c   | 32 
> >>  hw/ppc/trace-events|  2 ++
> >>  4 files changed, 68 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >> index ad4d7cfd97..f5dcaf44cb 100644
> >> --- a/include/hw/ppc/spapr.h
> >> +++ b/include/hw/ppc/spapr.h
> >> @@ -100,6 +100,7 @@ struct sPAPRMachineClass {
> >>  
> >>  /*< public >*/
> >>  bool dr_lmb_enabled;   /* enable dynamic-reconfig/hotplug of LMBs 
> >> */
> >> +bool update_dt_enabled;/* enable KVMPPC_H_UPDATE_DT */
> >>  bool use_ohci_by_default;  /* use USB-OHCI instead of XHCI */
> >>  bool pre_2_10_has_unused_icps;
> >>  bool legacy_irq_allocation;
> >> @@ -136,6 +137,9 @@ struct sPAPRMachineState {
> >>  int vrma_adjust;
> >>  ssize_t rtas_size;
> >>  void *rtas_blob;
> >> +uint32_t fdt_size;
> >> +uint32_t fdt_initial_size;  
> > 
> > I don't quite see the purpose of fdt_initial_size... it seems to be only
> > used to print a trace.  
> 
> 
> Ah, lost in rebase. The purpose was to test if the new device tree has
> not grown too much.
> 

Ok, makes sense during development.

> 
> 
> >   
> >> +void *fdt_blob;
> >>  long kernel_size;
> >>  bool kernel_le;
> >>  uint32_t initrd_base;
> >> @@ -462,7 +466,8 @@ struct sPAPRMachineState {
> >>  #define KVMPPC_H_LOGICAL_MEMOP  (KVMPPC_HCALL_BASE + 0x1)
> >>  /* Client Architecture support */
> >>  #define KVMPPC_H_CAS(KVMPPC_HCALL_BASE + 0x2)
> >> -#define KVMPPC_HCALL_MAXKVMPPC_H_CAS
> >> +#define KVMPPC_H_UPDATE_DT  (KVMPPC_HCALL_BASE + 0x3)
> >> +#define KVMPPC_HCALL_MAXKVMPPC_H_UPDATE_DT
> >>  
> >>  typedef struct sPAPRDeviceTreeUpdateHeader {
> >>  uint32_t version_id;
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index c08130facb..5e2d4d211c 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -1633,7 +1633,10 @@ static void spapr_machine_reset(void)
> >>  /* Load the fdt */
> >>  qemu_fdt_dumpdtb(fdt, fdt_totalsize(fdt));
> >>  cpu_physical_memory_write(fdt_addr, fdt, fdt_totalsize(fdt));
> >> -g_free(fdt);
> >> +g_free(spapr->fdt_blob);
> >> +spapr->fdt_size = fdt_totalsize(fdt);
> >> +spapr->fdt_initial_size = spapr->fdt_size;
> >> +spapr->fdt_blob = fdt;  
> > 
> > Hmm... It looks weird to store state in a reset handler. I'd rather zeroe
> > both fdt_blob and fdt_size here.  
> 
> 
> The device tree is built from the reset handler and the idea is that we
> want to always have some tree in the machine.
> 

Yes of course, I forgot that we need to keep the fdt to be kept
somewhere so that we can use it :). My remark has more to do
with migration actually: the fdt built at reset time is supposed
to derive from the command line and hot-(un)plugged devices, ie,
identical in source and destination. This isn't state we should
migrate IIUC.

Maybe add a boolean field that tells that the fdt was updated, use
it in spapr_dtb_needed() and reset it in spapr_machine_reset() ?

> 
> 
> >   
> >>  
> >>  /* Set up the entry state */
> >>  spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
> >> @@ -1887,6 +1890,27 @@ static const VMStateDescription 
> >> vmstate_spapr_irq_map = {
> >>  },
> >>  };
> >>  
> >> +static bool spapr_dtb_needed(void *opaque)
> >> +{
> >> +sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(opaque);
> >> +
> >> +return smc->update_dt_enabled;  
> > 
> > This means we always migrate the fdt, even if migration occurs before
> > SLOF could call KVMPPC_H_UPDATE_DT.
> > 
> > With spapr->fdt_blob set to NULL on reset, a better check

Re: [Qemu-devel] [PATCH 0/2] Fix the last Hyp mode bug and turn it on for A7, A15

2018-11-12 Thread Richard Henderson

On 11/9/18 6:35 PM, Peter Maydell wrote:
> Not entirely sure what to do about this for 3.1 -- maybe
> put in the bugfix patch but hold off on actually setting
> the feature bit til 4.0?

I would put them both in for 3.1.


r~

Re: [Qemu-devel] [RFC] [PATCH] kvm: arm: Introduce error code KVM_EINVARIANT

2018-11-12 Thread Andrew Jones

On Sun, Nov 11, 2018 at 11:36:44AM +, Marc Zyngier wrote:
> On Sat, 10 Nov 2018 22:18:47 +,
> Manish Jaggi  wrote:
> > 
> > 
> > CCing a larger audience.
> > Please review.
> > 
> > On 10/23/2018 03:51 PM, Jaggi, Manish wrote:
> > > From: Manish Jaggi 
> > >
> > > This patch introduces an error code KVM_EINVARIANT which is returned
> > > by KVM when userland tries to set an invariant register.
> > >
> > > The need for this error code is in VM Migration for arm64.
> > > ARM64 systems use mainly -machine virt -cpu host as parameter to qemu.
> > > Migration requires both Source and destination machines to have same
> > > physical cpu. There are cases where the overall architecture of CPU is
> > > same but the next version of the chip with some bug fixes which have no
> > > effect on qemu operation. In such cases invariant registers like MIDR
> > > have a different value.
> > > Currently Migration fails in such cases.
> > >
> > > Rather than sending a EINVAL, a specifc error code will help
> > > userland program the guest invariant register by querying the migrated
> > > host machines invariant registers.
> 
> But failing migration is a good thing, right? How do you expect that
> the guest will be happy to see a new CPU revision right in the middle
> of its execution? Do you also propose that QEMU starts doing that for
> big-little systems? After all, if ignoring the differences in some
> registers is harmless for migration, surely that should be the case in
> a single system, right?
> 
> > >
> > > Qemu will have a parameter -hostinvariant along with checking of this
> > > error code. So it can be safely assumed that the feature is opt-in
> 
> You're changing the ABI without any buy in from userspace, which is
> not acceptable.
> 
> As it stands, this patch creates a number of issues without solving
> any. Things to think about:
> 
> - How does errata management works when migrating to a different
>   system?
> - big-little, as mentioned above
> - Are all invariant registers equal? A different MIDR has the same
>   effect as a different MMFR0?
> 
> Instead of papering over architectural constants i a system, how about
> allowing the relevant ID registers to be overloaded when not
> incompatible?
> 

In QEMU, I think we need nail down how we define '-cpu host' for kvmarm.
IMO, '-cpu host' is what libvirt calls "host-passthrough"[*], and indeed
migrating a guest using host-passthrough is super risky. You need to
know what you're doing, and likely what you'll want to do is migrate to
an _identical_ host. We should start looking at building cpu models to
better support migration, but errata complicates that. However, for the
sake of argument, let's assume we solved those problems and completed the
implementation of cpu models - so we would no longer be using '-cpu host'
for a "typical" config. So what would '-cpu host' mean? It appears the
current semantics are "source host passthrough", similar to
"host-model"[*]. Rather than "whatever host I'm running on host
passthrough", which is what I think we want and what this patch and the
QEMU counterpart changes seem to be aiming at implementing. Anyway,
whichever of those two semantics we choose for '-cpu host', the user will
still need to know what they're doing and to assume the risks. Without
cpu models, I'm not even sure discussing migration safety makes sense.

(Yeah, I know I ignored big-little. IMO, big-little is an upper layer
problem, as it should just be a vcpu thread to cpu set affinity
assignment issue.)

Thanks,
drew

[*] https://wiki.openstack.org/wiki/LibvirtXMLCPUModel

Re: [Qemu-devel] [RFC PATCH 04/11] target/mips: Avoid access to CPUMIPSState from decode* functions

2018-11-12 Thread Richard Henderson

On 11/12/18 12:36 AM, Philippe Mathieu-Daudé wrote:
> The DisasContext is already initialized from the CPUMIPSState in
> mips_tr_init_disas_context().
> 
> Suggested-by: Richard Henderson 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  target/mips/translate.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Richard Henderson 


r~

Re: [Qemu-devel] [PATCH v1 2/3] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.

2018-11-12 Thread Peter Xu

On Fri, Nov 09, 2018 at 07:49:46PM +0800, Yu Zhang wrote:
> A 5-level paging capable VM may choose to use 57-bit IOVA address width.
> E.g. guest applications like DPDK prefer to use its VA as IOVA when
> performing VFIO map/unmap operations, to avoid the burden of managing the
> IOVA space.

Since you mentioned about DPDK... I'm just curious that whether have
you tested the patchset with the 57bit-enabled machines with DPDK VA
mode running in the guest? That would be something nice to mention in
the cover letter if you have.

[...]

> @@ -3264,11 +3286,19 @@ static bool vtd_decide_config(IntelIOMMUState *s, 
> Error **errp)
>  }
>  }
>  
> -/* Currently only address widths supported are 39 and 48 bits */
> +/* Currently address widths supported are 39, 48, and 57 bits */
>  if ((s->aw_bits != VTD_AW_39BIT) &&
> -(s->aw_bits != VTD_AW_48BIT)) {
> -error_setg(errp, "Supported values for x-aw-bits are: %d, %d",
> -   VTD_AW_39BIT, VTD_AW_48BIT);
> +(s->aw_bits != VTD_AW_48BIT) &&
> +(s->aw_bits != VTD_AW_57BIT)) {
> +error_setg(errp, "Supported values for x-aw-bits are: %d, %d, %d",
> +   VTD_AW_39BIT, VTD_AW_48BIT, VTD_AW_57BIT);
> +return false;
> +}
> +
> +if ((s->aw_bits == VTD_AW_57BIT) &&
> +!(host_has_la57() && guest_has_la57())) {
> +error_setg(errp, "Do not support 57-bit DMA address, unless both "
> + "host and guest are capable of 5-level paging.\n");

Is there any context (or pointer to previous discussions would work
too) on explaining why we don't support some scenarios like
host_paw=48,guest_paw=48,guest_gaw=57?

Thanks,

-- 
Peter Xu

Re: [Qemu-devel] List of files containing devices which have not been QOMified

2018-11-12 Thread Thomas Huth

On 2018-11-09 14:16, Paolo Bonzini wrote:
> On 09/11/2018 13:39, Thomas Huth wrote:
>> On 2018-11-09 12:29, Gerd Hoffmann wrote:
>>> On Fri, Nov 09, 2018 at 12:17:31PM +0100, Gerd Hoffmann wrote:
   Hi,

> I am also suspicious about hw/bt/ but don't know enough
> about that subsystem to say if it could benefit from
> using QOM objects more.

 I'm wondering whenever anyone would even notice if we just rm -rf hw/bt

 Looking through the changelog for the last five years (after hw/ split)
 the only thing I see is fixing warnings from compiler or coverity,
 adapting to changes in other systems (chardev for example) and treewide
 changes.  Not a *single* patch specific to bluetooth ...
>>>
>>> Tried this after studying docs:
>>>
>>>   qemu -usb -device usb-bt-dongle -bt hci,vlan=0 -bt device:keyboard
>>>
>>> Segfaults right anway on first keypress.
>>> I guess that qualifies as "broken and obviously unused".
>>
>> Thanks for checking! I guess that means we could even get rid of it
>> without deprecating it first if it is broken already for more than two
>> releases...?
> 
> I think what others were using bluetooth passthrough.  But it's
> certainly possible that it's broken.

Since there haven't been any non-trivial changes to the files in
*years*, and apparently none of the current QEMU developers is
interested in maintaining this subsystem, I think it is really very
likely that it's broken. So unless someone has got some hardware that
could be used for testing it, I think we really should mark this as
deprecated - we could keep it in that state a little bit longer to see
whether a user speaks up and says that it is still useful, but in case
nobody uses this anymore, there is really no need that we carry this
code with us forever.

 Thomas

Re: [Qemu-devel] [PATCH 1/2] target/arm: Hyp mode R14 is shared with User and System

2018-11-12 Thread Edgar E. Iglesias

On Fri, Nov 09, 2018 at 05:35:52PM +, Peter Maydell wrote:
> Hyp mode is an exception to the general rule that each AArch32
> mode has its own r13, r14 and SPSR -- it has a banked r13 and
> SPSR but shares its r14 with User and System mode. We were
> incorrectly implementing it as banked, which meant that on
> entry to Hyp mode r14 was 0 rather than the USR/SYS r14.
> 
> We provide a new function r14_bank_number() which is like
> the existing bank_number() but provides the index into
> env->banked_r14[]; bank_number() provides the index to use
> for env->banked_r13[] and env->banked_cpsr[].
> 
> All the points in the code that were using bank_number()
> to index into env->banked_r14[] are updated for consintency:
>  * switch_mode() -- this is the only place where we fix
>an actual bug
>  * aarch64_sync_32_to_64() and aarch64_sync_64_to_32():
>no behavioural change as we already special-cased Hyp R14
>  * kvm32.c: no behavioural change since the guest can't ever
>be in Hyp mode, but conceptually the right thing to do
>  * msr_banked()/mrs_banked(): we can never get to the case
>that accesses banked_r14[] with tgtmode == ARM_CPU_MODE_HYP,
>so no behavioural change
> 
> Signed-off-by: Peter Maydell 

Reviewed-by: Edgar E. Iglesias 



> ---
>  target/arm/internals.h | 16 
>  target/arm/helper.c| 29 +++--
>  target/arm/kvm32.c |  4 ++--
>  target/arm/op_helper.c |  2 +-
>  4 files changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/target/arm/internals.h b/target/arm/internals.h
> index 6c2bb2deebd..e5341f21f6f 100644
> --- a/target/arm/internals.h
> +++ b/target/arm/internals.h
> @@ -145,6 +145,22 @@ static inline int bank_number(int mode)
>  g_assert_not_reached();
>  }
>  
> +/**
> + * r14_bank_number: Map CPU mode onto register bank for r14
> + *
> + * Given an AArch32 CPU mode, return the index into the saved register
> + * banks to use for the R14 (LR) in that mode. This is the same as
> + * bank_number(), except for the special case of Hyp mode, where
> + * R14 is shared with USR and SYS, unlike its R13 and SPSR.
> + * This should be used as the index into env->banked_r14[], and
> + * bank_number() used for the index into env->banked_r13[] and
> + * env->banked_spsr[].
> + */
> +static inline int r14_bank_number(int mode)
> +{
> +return (mode == ARM_CPU_MODE_HYP) ? BANK_USRSYS : bank_number(mode);
> +}
> +
>  void arm_cpu_register_gdb_regs_for_features(ARMCPU *cpu);
>  void arm_translate_init(void);
>  
> diff --git a/target/arm/helper.c b/target/arm/helper.c
> index 96301930cc8..6fb1ddc5506 100644
> --- a/target/arm/helper.c
> +++ b/target/arm/helper.c
> @@ -6455,13 +6455,14 @@ static void switch_mode(CPUARMState *env, int mode)
>  
>  i = bank_number(old_mode);
>  env->banked_r13[i] = env->regs[13];
> -env->banked_r14[i] = env->regs[14];
>  env->banked_spsr[i] = env->spsr;
>  
>  i = bank_number(mode);
>  env->regs[13] = env->banked_r13[i];
> -env->regs[14] = env->banked_r14[i];
>  env->spsr = env->banked_spsr[i];
> +
> +env->banked_r14[r14_bank_number(old_mode)] = env->regs[14];
> +env->regs[14] = env->banked_r14[r14_bank_number(mode)];
>  }
>  
>  /* Physical Interrupt Target EL Lookup Table
> @@ -8040,7 +8041,7 @@ void aarch64_sync_32_to_64(CPUARMState *env)
>  if (mode == ARM_CPU_MODE_HYP) {
>  env->xregs[14] = env->regs[14];
>  } else {
> -env->xregs[14] = env->banked_r14[bank_number(ARM_CPU_MODE_USR)];
> +env->xregs[14] = 
> env->banked_r14[r14_bank_number(ARM_CPU_MODE_USR)];
>  }
>  }
>  
> @@ -8054,7 +8055,7 @@ void aarch64_sync_32_to_64(CPUARMState *env)
>  env->xregs[16] = env->regs[14];
>  env->xregs[17] = env->regs[13];
>  } else {
> -env->xregs[16] = env->banked_r14[bank_number(ARM_CPU_MODE_IRQ)];
> +env->xregs[16] = env->banked_r14[r14_bank_number(ARM_CPU_MODE_IRQ)];
>  env->xregs[17] = env->banked_r13[bank_number(ARM_CPU_MODE_IRQ)];
>  }
>  
> @@ -8062,7 +8063,7 @@ void aarch64_sync_32_to_64(CPUARMState *env)
>  env->xregs[18] = env->regs[14];
>  env->xregs[19] = env->regs[13];
>  } else {
> -env->xregs[18] = env->banked_r14[bank_number(ARM_CPU_MODE_SVC)];
> +env->xregs[18] = env->banked_r14[r14_bank_number(ARM_CPU_MODE_SVC)];
>  env->xregs[19] = env->banked_r13[bank_number(ARM_CPU_MODE_SVC)];
>  }
>  
> @@ -8070,7 +8071,7 @@ void aarch64_sync_32_to_64(CPUARMState *env)
>  env->xregs[20] = env->regs[14];
>  env->xregs[21] = env->regs[13];
>  } else {
> -env->xregs[20] = env->banked_r14[bank_number(ARM_CPU_MODE_ABT)];
> +env->xregs[20] = env->banked_r14[r14_bank_number(ARM_CPU_MODE_ABT)];
>  env->xregs[21] = env->banked_r13[bank_number(ARM_CPU_MODE_ABT)];
>  }
>  
> @@ -8078,7 +8079,7 @@ void aarch64_sync_32_to_64(CPUARMState *env)
>

Re: [Qemu-devel] [PATCH 2/2] target/arm/cpu: Give Cortex-A15 and -A7 the EL2 feature

2018-11-12 Thread Edgar E. Iglesias

On Fri, Nov 09, 2018 at 05:35:53PM +, Peter Maydell wrote:
> The Cortex-A15 and Cortex-A7 both have EL2; now we've implemented
> it properly we can enable the feature bit.
> 
> Signed-off-by: Peter Maydell 

Reviewed-by: Edgar E. Iglesias 


> ---
>  target/arm/cpu.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/target/arm/cpu.c b/target/arm/cpu.c
> index 784a4c2dfcc..b7185234d85 100644
> --- a/target/arm/cpu.c
> +++ b/target/arm/cpu.c
> @@ -1587,6 +1587,7 @@ static void cortex_a7_initfn(Object *obj)
>  set_feature(>env, ARM_FEATURE_GENERIC_TIMER);
>  set_feature(>env, ARM_FEATURE_DUMMY_C15_REGS);
>  set_feature(>env, ARM_FEATURE_CBAR_RO);
> +set_feature(>env, ARM_FEATURE_EL2);
>  set_feature(>env, ARM_FEATURE_EL3);
>  cpu->kvm_target = QEMU_KVM_ARM_TARGET_CORTEX_A7;
>  cpu->midr = 0x410fc075;
> @@ -1633,6 +1634,7 @@ static void cortex_a15_initfn(Object *obj)
>  set_feature(>env, ARM_FEATURE_GENERIC_TIMER);
>  set_feature(>env, ARM_FEATURE_DUMMY_C15_REGS);
>  set_feature(>env, ARM_FEATURE_CBAR_RO);
> +set_feature(>env, ARM_FEATURE_EL2);
>  set_feature(>env, ARM_FEATURE_EL3);
>  cpu->kvm_target = QEMU_KVM_ARM_TARGET_CORTEX_A15;
>  cpu->midr = 0x412fc0f1;
> -- 
> 2.19.1
>

[Qemu-devel] [PATCH v3 03/23] hw/rdma: Return qpn 1 if ibqp is NULL

2018-11-12 Thread Yuval Shaia

Device is not supporting QP0, only QP1.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/rdma_backend.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/rdma_backend.h b/hw/rdma/rdma_backend.h
index 86e8fe8ab6..3ccc9a2494 100644
--- a/hw/rdma/rdma_backend.h
+++ b/hw/rdma/rdma_backend.h
@@ -33,7 +33,7 @@ static inline union ibv_gid *rdma_backend_gid(RdmaBackendDev 
*dev)
 
 static inline uint32_t rdma_backend_qpn(const RdmaBackendQP *qp)
 {
-return qp->ibqp ? qp->ibqp->qp_num : 0;
+return qp->ibqp ? qp->ibqp->qp_num : 1;
 }
 
 static inline uint32_t rdma_backend_mr_lkey(const RdmaBackendMR *mr)
-- 
2.17.2

[Qemu-devel] [PATCH v3 06/23] hw/pvrdma: Make function reset_device return void

2018-11-12 Thread Yuval Shaia

This function cannot fail - fix it to return void

Signed-off-by: Yuval Shaia 
Reviewed-by: Marcel Apfelbaum
---
 hw/rdma/vmw/pvrdma_main.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index 6c8c0154fa..fc2abd34af 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -369,13 +369,11 @@ static int unquiesce_device(PVRDMADev *dev)
 return 0;
 }
 
-static int reset_device(PVRDMADev *dev)
+static void reset_device(PVRDMADev *dev)
 {
 pvrdma_stop(dev);
 
 pr_dbg("Device reset complete\n");
-
-return 0;
 }
 
 static uint64_t regs_read(void *opaque, hwaddr addr, unsigned size)
-- 
2.17.2

[Qemu-devel] [PATCH v3 09/23] hw/pvrdma: Set the correct opcode for send completion

2018-11-12 Thread Yuval Shaia

opcode for WC should be set by the device and not taken from work
element.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/vmw/pvrdma_qp_ops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 7b0f440fda..3388be1926 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -154,7 +154,7 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
 comp_ctx->cq_handle = qp->send_cq_handle;
 comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
 comp_ctx->cqe.qp = qp_handle;
-comp_ctx->cqe.opcode = wqe->hdr.opcode;
+comp_ctx->cqe.opcode = IBV_WC_SEND;
 
 rdma_backend_post_send(>backend_dev, >backend_qp, qp->qp_type,
(struct ibv_sge *)>sge[0], 
wqe->hdr.num_sge,
-- 
2.17.2

[Qemu-devel] [PATCH v3 16/23] hw/pvrdma: Fill all CQE fields

2018-11-12 Thread Yuval Shaia

Add ability to pass specific WC attributes to CQE such as GRH_BIT flag.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/rdma_backend.c  | 59 +++--
 hw/rdma/rdma_backend.h  |  4 +--
 hw/rdma/vmw/pvrdma_qp_ops.c | 31 +++
 3 files changed, 58 insertions(+), 36 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 5675504165..e453bda8f9 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -59,13 +59,24 @@ struct backend_umad {
 char mad[RDMA_MAX_PRIVATE_DATA];
 };
 
-static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
+static void (*comp_handler)(void *ctx, struct ibv_wc *wc);
 
-static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
+static void dummy_comp_handler(void *ctx, struct ibv_wc *wc)
 {
 pr_err("No completion handler is registered\n");
 }
 
+static inline void complete_work(enum ibv_wc_status status, uint32_t 
vendor_err,
+ void *ctx)
+{
+struct ibv_wc wc = {0};
+
+wc.status = status;
+wc.vendor_err = vendor_err;
+
+comp_handler(ctx, );
+}
+
 static void poll_cq(RdmaDeviceResources *rdma_dev_res, struct ibv_cq *ibcq)
 {
 int i, ne;
@@ -90,7 +101,7 @@ static void poll_cq(RdmaDeviceResources *rdma_dev_res, 
struct ibv_cq *ibcq)
 }
 pr_dbg("Processing %s CQE\n", bctx->is_tx_req ? "send" : "recv");
 
-comp_handler(wc[i].status, wc[i].vendor_err, bctx->up_ctx);
+comp_handler(bctx->up_ctx, [i]);
 
 rdma_rm_dealloc_cqe_ctx(rdma_dev_res, wc[i].wr_id);
 g_free(bctx);
@@ -184,8 +195,8 @@ static void start_comp_thread(RdmaBackendDev *backend_dev)
comp_handler_thread, backend_dev, QEMU_THREAD_DETACHED);
 }
 
-void rdma_backend_register_comp_handler(void (*handler)(int status,
-unsigned int vendor_err, void *ctx))
+void rdma_backend_register_comp_handler(void (*handler)(void *ctx,
+ struct ibv_wc *wc))
 {
 comp_handler = handler;
 }
@@ -369,14 +380,14 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
 if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */
 if (qp_type == IBV_QPT_SMI) {
 pr_dbg("QP0 unsupported\n");
-comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
+complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
 } else if (qp_type == IBV_QPT_GSI) {
 pr_dbg("QP1\n");
 rc = mad_send(backend_dev, sgid_idx, sgid, sge, num_sge);
 if (rc) {
-comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
 } else {
-comp_handler(IBV_WC_SUCCESS, 0, ctx);
+complete_work(IBV_WC_SUCCESS, 0, ctx);
 }
 }
 return;
@@ -385,7 +396,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
 pr_dbg("num_sge=%d\n", num_sge);
 if (!num_sge) {
 pr_dbg("num_sge=0\n");
-comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
+complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NO_SGE, ctx);
 return;
 }
 
@@ -396,21 +407,21 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
 rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, _id, bctx);
 if (unlikely(rc)) {
 pr_dbg("Failed to allocate cqe_ctx\n");
-comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
+complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_NOMEM, ctx);
 goto out_free_bctx;
 }
 
 rc = build_host_sge_array(backend_dev->rdma_dev_res, new_sge, sge, 
num_sge);
 if (rc) {
 pr_dbg("Error: Failed to build host SGE array\n");
-comp_handler(IBV_WC_GENERAL_ERR, rc, ctx);
+complete_work(IBV_WC_GENERAL_ERR, rc, ctx);
 goto out_dealloc_cqe_ctx;
 }
 
 if (qp_type == IBV_QPT_UD) {
 wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd, sgid_idx, dgid);
 if (!wr.wr.ud.ah) {
-comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
 goto out_dealloc_cqe_ctx;
 }
 wr.wr.ud.remote_qpn = dqpn;
@@ -428,7 +439,7 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
 if (rc) {
 pr_dbg("Fail (%d, %d) to post send WQE to qpn %d\n", rc, errno,
 qp->ibqp->qp_num);
-comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+complete_work(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
 goto out_dealloc_cqe_ctx;
 }
 
@@ -497,13 +508,13 @@ void rdma_backend_post_recv(RdmaBackendDev *backend_dev,
 if (!qp->ibqp) { /* This field does not get initialized for QP0 and QP1 */

[Qemu-devel] [PATCH v3 07/23] hw/pvrdma: Make default pkey 0xFFFF

2018-11-12 Thread Yuval Shaia

Commit 6e7dba23af ("hw/pvrdma: Make default pkey 0x") exports
default pkey as external definition but omit the change from 0x7FFF to
0x.

Fixes: 6e7dba23af ("hw/pvrdma: Make default pkey 0x")

Signed-off-by: Yuval Shaia 
Reviewed-by: Marcel Apfelbaum
---
 hw/rdma/vmw/pvrdma.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index e3742d893a..15c3f28b86 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -52,7 +52,7 @@
 #define PVRDMA_FW_VERSION14
 
 /* Some defaults */
-#define PVRDMA_PKEY  0x7FFF
+#define PVRDMA_PKEY  0x
 
 typedef struct DSRInfo {
 dma_addr_t dma;
-- 
2.17.2

[Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response

2018-11-12 Thread Yuval Shaia

Driver checks error code let's set it.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/vmw/pvrdma_cmd.c | 67 
 1 file changed, 48 insertions(+), 19 deletions(-)

diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 0d3c818c20..a326c5d470 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -131,7 +131,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 
 if (rdma_backend_query_port(>backend_dev,
 (struct ibv_port_attr *))) {
-return -ENOMEM;
+resp->hdr.err = -ENOMEM;
+goto out;
 }
 
 memset(resp, 0, sizeof(*resp));
@@ -150,7 +151,9 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 resp->attrs.active_width = 1;
 resp->attrs.active_speed = 1;
 
-return 0;
+out:
+pr_dbg("ret=%d\n", resp->hdr.err);
+return resp->hdr.err;
 }
 
 static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -170,7 +173,7 @@ static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 resp->pkey = PVRDMA_PKEY;
 pr_dbg("pkey=0x%x\n", resp->pkey);
 
-return 0;
+return resp->hdr.err;
 }
 
 static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -200,7 +203,9 @@ static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 
 rdma_rm_dealloc_pd(>rdma_dev_res, cmd->pd_handle);
 
-return 0;
+rsp->hdr.err = 0;
+
+return rsp->hdr.err;
 }
 
 static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -251,7 +256,9 @@ static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 
 rdma_rm_dealloc_mr(>rdma_dev_res, cmd->mr_handle);
 
-return 0;
+rsp->hdr.err = 0;
+
+return rsp->hdr.err;
 }
 
 static int create_cq_ring(PCIDevice *pci_dev , PvrdmaRing **ring,
@@ -353,7 +360,8 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 cq = rdma_rm_get_cq(>rdma_dev_res, cmd->cq_handle);
 if (!cq) {
 pr_dbg("Invalid CQ handle\n");
-return -EINVAL;
+rsp->hdr.err = -EINVAL;
+goto out;
 }
 
 ring = (PvrdmaRing *)cq->opaque;
@@ -364,7 +372,11 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 
 rdma_rm_dealloc_cq(>rdma_dev_res, cmd->cq_handle);
 
-return 0;
+rsp->hdr.err = 0;
+
+out:
+pr_dbg("ret=%d\n", rsp->hdr.err);
+return rsp->hdr.err;
 }
 
 static int create_qp_rings(PCIDevice *pci_dev, uint64_t pdir_dma,
@@ -553,7 +565,8 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 qp = rdma_rm_get_qp(>rdma_dev_res, cmd->qp_handle);
 if (!qp) {
 pr_dbg("Invalid QP handle\n");
-return -EINVAL;
+rsp->hdr.err = -EINVAL;
+goto out;
 }
 
 rdma_rm_dealloc_qp(>rdma_dev_res, cmd->qp_handle);
@@ -567,7 +580,11 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 rdma_pci_dma_unmap(PCI_DEVICE(dev), ring->ring_state, TARGET_PAGE_SIZE);
 g_free(ring);
 
-return 0;
+rsp->hdr.err = 0;
+
+out:
+pr_dbg("ret=%d\n", rsp->hdr.err);
+return rsp->hdr.err;
 }
 
 static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -580,7 +597,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 pr_dbg("index=%d\n", cmd->index);
 
 if (cmd->index >= MAX_PORT_GIDS) {
-return -EINVAL;
+rsp->hdr.err = -EINVAL;
+goto out;
 }
 
 pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
@@ -590,10 +608,15 @@ static int create_bind(PVRDMADev *dev, union 
pvrdma_cmd_req *req,
 rc = rdma_rm_add_gid(>rdma_dev_res, >backend_dev,
  dev->backend_eth_device_name, gid, cmd->index);
 if (rc < 0) {
-return -EINVAL;
+rsp->hdr.err = rc;
+goto out;
 }
 
-return 0;
+rsp->hdr.err = 0;
+
+out:
+pr_dbg("ret=%d\n", rsp->hdr.err);
+return rsp->hdr.err;
 }
 
 static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -606,7 +629,8 @@ static int destroy_bind(PVRDMADev *dev, union 
pvrdma_cmd_req *req,
 pr_dbg("index=%d\n", cmd->index);
 
 if (cmd->index >= MAX_PORT_GIDS) {
-return -EINVAL;
+rsp->hdr.err = -EINVAL;
+goto out;
 }
 
 rc = rdma_rm_del_gid(>rdma_dev_res, >backend_dev,
@@ -617,7 +641,11 @@ static int destroy_bind(PVRDMADev *dev, union 
pvrdma_cmd_req *req,
 goto out;
 }
 
-return 0;
+rsp->hdr.err = 0;
+
+out:
+pr_dbg("ret=%d\n", rsp->hdr.err);
+return rsp->hdr.err;
 }
 
 static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -634,9 +662,8 @@ static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 resp->hdr.err = rdma_rm_alloc_uc(>rdma_dev_res, cmd->pfn,
  >ctx_handle);
 
-pr_dbg("ret=%d\n", resp->hdr.err);
-
-return 0;
+pr_dbg("ret=%d\n", rsp->hdr.err);
+return rsp->hdr.err;
 }
 
 static int destroy_uc(PVRDMADev *dev, union

[Qemu-devel] [PATCH v3 04/23] hw/rdma: Abort send-op if fail to create addr handler

2018-11-12 Thread Yuval Shaia

Function create_ah might return NULL, let's exit with an error.

Signed-off-by: Yuval Shaia 
Reviewed-by: Marcel Apfelbaum
---
 hw/rdma/rdma_backend.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index d7a4bbd91f..1e148398a2 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -338,6 +338,10 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
 if (qp_type == IBV_QPT_UD) {
 wr.wr.ud.ah = create_ah(backend_dev, qp->ibpd,
 backend_dev->backend_gid_idx, dgid);
+if (!wr.wr.ud.ah) {
+comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_FAIL_BACKEND, ctx);
+goto out_dealloc_cqe_ctx;
+}
 wr.wr.ud.remote_qpn = dqpn;
 wr.wr.ud.remote_qkey = dqkey;
 }
-- 
2.17.2

[Qemu-devel] [PATCH v3 06/23] hw/pvrdma: Make function reset_device return void

2018-11-12 Thread Yuval Shaia

This function cannot fail - fix it to return void

Signed-off-by: Yuval Shaia 
Reviewed-by: Marcel Apfelbaum
---
 hw/rdma/vmw/pvrdma_main.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index 6c8c0154fa..fc2abd34af 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -369,13 +369,11 @@ static int unquiesce_device(PVRDMADev *dev)
 return 0;
 }
 
-static int reset_device(PVRDMADev *dev)
+static void reset_device(PVRDMADev *dev)
 {
 pvrdma_stop(dev);
 
 pr_dbg("Device reset complete\n");
-
-return 0;
 }
 
 static uint64_t regs_read(void *opaque, hwaddr addr, unsigned size)
-- 
2.17.2

[Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table

2018-11-12 Thread Yuval Shaia

The control over the RDMA device's GID table is done by updating the
device's Ethernet function addresses.
Usually the first GID entry is determine by the MAC address, the second
by the first IPv6 address and the third by the IPv4 address. Other
entries can be added by adding more IP addresses. The opposite is the
same, i.e. whenever an address is removed, the corresponding GID entry
is removed.

The process is done by the network and RDMA stacks. Whenever an address
is added the ib_core driver is notified and calls the device driver
add_gid function which in turn update the device.

To support this in pvrdma device we need to hook into the create_bind
and destroy_bind HW commands triggered by pvrdma driver in guest.
Whenever a changed is made to the pvrdma device's GID table a special
QMP messages is sent to be processed by libvirt to update the address of
the backend Ethernet device.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/rdma_backend.c  | 243 +++-
 hw/rdma/rdma_backend.h  |  22 ++--
 hw/rdma/rdma_backend_defs.h |   3 +-
 hw/rdma/rdma_rm.c   | 104 ++-
 hw/rdma/rdma_rm.h   |  17 ++-
 hw/rdma/rdma_rm_defs.h  |   9 +-
 hw/rdma/rdma_utils.h|  15 +++
 hw/rdma/vmw/pvrdma.h|   2 +-
 hw/rdma/vmw/pvrdma_cmd.c|  55 
 hw/rdma/vmw/pvrdma_main.c   |  25 +---
 hw/rdma/vmw/pvrdma_qp_ops.c |  20 +++
 11 files changed, 370 insertions(+), 145 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 3eb0099f8d..5675504165 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -18,12 +18,14 @@
 #include "qapi/error.h"
 #include "qapi/qmp/qlist.h"
 #include "qapi/qmp/qnum.h"
+#include "qapi/qapi-events-rdma.h"
 
 #include 
 #include 
 #include 
 #include 
 
+#include "contrib/rdmacm-mux/rdmacm-mux.h"
 #include "trace.h"
 #include "rdma_utils.h"
 #include "rdma_rm.h"
@@ -300,11 +302,11 @@ static int build_host_sge_array(RdmaDeviceResources 
*rdma_dev_res,
 return 0;
 }
 
-static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
-uint32_t num_sge)
+static int mad_send(RdmaBackendDev *backend_dev, uint8_t sgid_idx,
+union ibv_gid *sgid, struct ibv_sge *sge, uint32_t num_sge)
 {
-struct backend_umad umad = {0};
-char *hdr, *msg;
+RdmaCmMuxMsg msg = {0};
+char *hdr, *data;
 int ret;
 
 pr_dbg("num_sge=%d\n", num_sge);
@@ -313,41 +315,50 @@ static int mad_send(RdmaBackendDev *backend_dev, struct 
ibv_sge *sge,
 return -EINVAL;
 }
 
-umad.hdr.length = sge[0].length + sge[1].length;
-pr_dbg("msg_len=%d\n", umad.hdr.length);
+msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_MAD;
+memcpy(msg.hdr.sgid.raw, sgid->raw, sizeof(msg.hdr.sgid));
 
-if (umad.hdr.length > sizeof(umad.mad)) {
+msg.umad_len = sge[0].length + sge[1].length;
+pr_dbg("umad_len=%d\n", msg.umad_len);
+
+if (msg.umad_len > sizeof(msg.umad.mad)) {
 return -ENOMEM;
 }
 
-umad.hdr.addr.qpn = htobe32(1);
-umad.hdr.addr.grh_present = 1;
-umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
-memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
-umad.hdr.addr.hop_limit = 1;
+msg.umad.hdr.addr.qpn = htobe32(1);
+msg.umad.hdr.addr.grh_present = 1;
+pr_dbg("sgid_idx=%d\n", sgid_idx);
+pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
+msg.umad.hdr.addr.gid_index = sgid_idx;
+memcpy(msg.umad.hdr.addr.gid, sgid->raw, sizeof(msg.umad.hdr.addr.gid));
+msg.umad.hdr.addr.hop_limit = 1;
 
 hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
-msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+data = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+
+pr_dbg_buf("mad_hdr", hdr, sge[0].length);
+pr_dbg_buf("mad_data", data, sge[1].length);
 
-memcpy([0], hdr, sge[0].length);
-memcpy([sge[0].length], msg, sge[1].length);
+memcpy([0], hdr, sge[0].length);
+memcpy([sge[0].length], data, sge[1].length);
 
-rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
+rdma_pci_dma_unmap(backend_dev->dev, data, sge[1].length);
 rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
 
-ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *),
-sizeof(umad));
+ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *),
+sizeof(msg));
 
 pr_dbg("qemu_chr_fe_write=%d\n", ret);
 
-return (ret != sizeof(umad));
+return (ret != sizeof(msg));
 }
 
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
 RdmaBackendQP *qp, uint8_t qp_type,
 struct ibv_sge *sge, uint32_t num_sge,
-union ibv_gid *dgid, uint32_t dqpn,
-uint32_t dqkey, void *ctx)
+

[Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets

2018-11-12 Thread Yuval Shaia

MAD (Management Datagram) packets are widely used by various modules
both in kernel and in user space for example the rdma_* API which is
used to create and maintain "connection" layer on top of RDMA uses
several types of MAD packets.
To support MAD packets the device uses an external utility
(contrib/rdmacm-mux) to relay packets from and to the guest driver.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/rdma_backend.c  | 263 +++-
 hw/rdma/rdma_backend.h  |   4 +-
 hw/rdma/rdma_backend_defs.h |  10 +-
 hw/rdma/vmw/pvrdma.h|   2 +
 hw/rdma/vmw/pvrdma_main.c   |   4 +-
 5 files changed, 273 insertions(+), 10 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 1e148398a2..3eb0099f8d 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -16,8 +16,13 @@
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
+#include "qapi/qmp/qlist.h"
+#include "qapi/qmp/qnum.h"
 
 #include 
+#include 
+#include 
+#include 
 
 #include "trace.h"
 #include "rdma_utils.h"
@@ -33,16 +38,25 @@
 #define VENDOR_ERR_MAD_SEND 0x206
 #define VENDOR_ERR_INVLKEY  0x207
 #define VENDOR_ERR_MR_SMALL 0x208
+#define VENDOR_ERR_INV_MAD_BUFF 0x209
+#define VENDOR_ERR_INV_NUM_SGE  0x210
 
 #define THR_NAME_LEN 16
 #define THR_POLL_TO  5000
 
+#define MAD_HDR_SIZE sizeof(struct ibv_grh)
+
 typedef struct BackendCtx {
-uint64_t req_id;
 void *up_ctx;
 bool is_tx_req;
+struct ibv_sge sge; /* Used to save MAD recv buffer */
 } BackendCtx;
 
+struct backend_umad {
+struct ib_user_mad hdr;
+char mad[RDMA_MAX_PRIVATE_DATA];
+};
+
 static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
 
 static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
@@ -286,6 +300,49 @@ static int build_host_sge_array(RdmaDeviceResources 
*rdma_dev_res,
 return 0;
 }
 
+static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
+uint32_t num_sge)
+{
+struct backend_umad umad = {0};
+char *hdr, *msg;
+int ret;
+
+pr_dbg("num_sge=%d\n", num_sge);
+
+if (num_sge != 2) {
+return -EINVAL;
+}
+
+umad.hdr.length = sge[0].length + sge[1].length;
+pr_dbg("msg_len=%d\n", umad.hdr.length);
+
+if (umad.hdr.length > sizeof(umad.mad)) {
+return -ENOMEM;
+}
+
+umad.hdr.addr.qpn = htobe32(1);
+umad.hdr.addr.grh_present = 1;
+umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
+memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
+umad.hdr.addr.hop_limit = 1;
+
+hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
+msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+
+memcpy([0], hdr, sge[0].length);
+memcpy([sge[0].length], msg, sge[1].length);
+
+rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
+rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
+
+ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *),
+sizeof(umad));
+
+pr_dbg("qemu_chr_fe_write=%d\n", ret);
+
+return (ret != sizeof(umad));
+}
+
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
 RdmaBackendQP *qp, uint8_t qp_type,
 struct ibv_sge *sge, uint32_t num_sge,
@@ -304,9 +361,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
 comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
 } else if (qp_type == IBV_QPT_GSI) {
 pr_dbg("QP1\n");
-comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+rc = mad_send(backend_dev, sge, num_sge);
+if (rc) {
+comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+} else {
+comp_handler(IBV_WC_SUCCESS, 0, ctx);
+}
 }
-pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
 return;
 }
 
@@ -370,6 +431,48 @@ out_free_bctx:
 g_free(bctx);
 }
 
+static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
+ struct ibv_sge *sge, uint32_t num_sge,
+ void *ctx)
+{
+BackendCtx *bctx;
+int rc;
+uint32_t bctx_id;
+
+if (num_sge != 1) {
+pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);
+return VENDOR_ERR_INV_NUM_SGE;
+}
+
+if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
+pr_dbg("Too small buffer for MAD\n");
+return VENDOR_ERR_INV_MAD_BUFF;
+}
+
+pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
+pr_dbg("length=%d\n", sge[0].length);
+pr_dbg("lkey=%d\n", sge[0].lkey);
+
+bctx = g_malloc0(sizeof(*bctx));
+
+rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, _id, bctx);
+if (unlikely(rc)) {
+

[Qemu-devel] [PATCH v3 08/23] hw/pvrdma: Set the correct opcode for recv completion

2018-11-12 Thread Yuval Shaia

The function pvrdma_post_cqe populates CQE entry with opcode from the
given completion element. For receive operation value was not set. Fix
it by setting it to IBV_WC_RECV.

Signed-off-by: Yuval Shaia 
Reviewed-by: Marcel Apfelbaum
---
 hw/rdma/vmw/pvrdma_qp_ops.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 762700a205..7b0f440fda 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -196,8 +196,9 @@ int pvrdma_qp_recv(PVRDMADev *dev, uint32_t qp_handle)
 comp_ctx = g_malloc(sizeof(CompHandlerCtx));
 comp_ctx->dev = dev;
 comp_ctx->cq_handle = qp->recv_cq_handle;
-comp_ctx->cqe.qp = qp_handle;
 comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
+comp_ctx->cqe.qp = qp_handle;
+comp_ctx->cqe.opcode = IBV_WC_RECV;
 
 rdma_backend_post_recv(>backend_dev, >rdma_dev_res,
>backend_qp, qp->qp_type,
-- 
2.17.2

[Qemu-devel] [PATCH v3 14/23] hw/rdma: Initialize node_guid from vmxnet3 mac address

2018-11-12 Thread Yuval Shaia

node_guid should be set once device is load.
Make node_guid be GID format (32 bit) of PCI function 0 vmxnet3 device's
MAC.

A new function was added to do the conversion.
So for example the MAC 56:b6:44:e9:62:dc will be converted to GID
54b6:44ff:fee9:62dc.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/rdma_utils.h  |  9 +
 hw/rdma/vmw/pvrdma_cmd.c  | 10 --
 hw/rdma/vmw/pvrdma_main.c |  5 -
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/hw/rdma/rdma_utils.h b/hw/rdma/rdma_utils.h
index 989db249ef..202abb3366 100644
--- a/hw/rdma/rdma_utils.h
+++ b/hw/rdma/rdma_utils.h
@@ -63,4 +63,13 @@ extern unsigned long pr_dbg_cnt;
 void *rdma_pci_dma_map(PCIDevice *dev, dma_addr_t addr, dma_addr_t plen);
 void rdma_pci_dma_unmap(PCIDevice *dev, void *buffer, dma_addr_t len);
 
+static inline void addrconf_addr_eui48(uint8_t *eui, const char *addr)
+{
+memcpy(eui, addr, 3);
+eui[3] = 0xFF;
+eui[4] = 0xFE;
+memcpy(eui + 5, addr + 3, 3);
+eui[0] ^= 2;
+}
+
 #endif
diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index a334f6205e..2979582fac 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -592,16 +592,6 @@ static int create_bind(PVRDMADev *dev, union 
pvrdma_cmd_req *req,
 return -EINVAL;
 }
 
-/* TODO: Since drivers stores node_guid at load_dsr phase then this
- * assignment is not relevant, i need to figure out a way how to
- * retrieve MAC of our netdev */
-if (!cmd->index) {
-dev->node_guid =
-dev->rdma_dev_res.ports[0].gid_tbl[0].gid.global.interface_id;
-pr_dbg("dev->node_guid=0x%llx\n",
-   (long long unsigned int)be64_to_cpu(dev->node_guid));
-}
-
 return 0;
 }
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index fa6468d221..95e9322b7c 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -264,7 +264,7 @@ static void init_dsr_dev_caps(PVRDMADev *dev)
 dsr->caps.sys_image_guid = 0;
 pr_dbg("sys_image_guid=%" PRIx64 "\n", dsr->caps.sys_image_guid);
 
-dsr->caps.node_guid = cpu_to_be64(dev->node_guid);
+dsr->caps.node_guid = dev->node_guid;
 pr_dbg("node_guid=%" PRIx64 "\n", be64_to_cpu(dsr->caps.node_guid));
 
 dsr->caps.phys_port_cnt = MAX_PORTS;
@@ -579,6 +579,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 /* Break if not vmxnet3 device in slot 0 */
 dev->func0 = VMXNET3(pci_get_function_0(pdev));
 
+addrconf_addr_eui48((unsigned char *)>node_guid,
+(const char *)>func0->conf.macaddr.a);
+
 memdev_root = object_resolve_path("/objects", NULL);
 if (memdev_root) {
 object_child_foreach(memdev_root, pvrdma_check_ram_shared, 
_shared);
-- 
2.17.2

[Qemu-devel] [PATCH v3 05/23] hw/rdma: Add support for MAD packets

2018-11-12 Thread Yuval Shaia

MAD (Management Datagram) packets are widely used by various modules
both in kernel and in user space for example the rdma_* API which is
used to create and maintain "connection" layer on top of RDMA uses
several types of MAD packets.
To support MAD packets the device uses an external utility
(contrib/rdmacm-mux) to relay packets from and to the guest driver.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/rdma_backend.c  | 263 +++-
 hw/rdma/rdma_backend.h  |   4 +-
 hw/rdma/rdma_backend_defs.h |  10 +-
 hw/rdma/vmw/pvrdma.h|   2 +
 hw/rdma/vmw/pvrdma_main.c   |   4 +-
 5 files changed, 273 insertions(+), 10 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 1e148398a2..3eb0099f8d 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -16,8 +16,13 @@
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
 #include "qapi/error.h"
+#include "qapi/qmp/qlist.h"
+#include "qapi/qmp/qnum.h"
 
 #include 
+#include 
+#include 
+#include 
 
 #include "trace.h"
 #include "rdma_utils.h"
@@ -33,16 +38,25 @@
 #define VENDOR_ERR_MAD_SEND 0x206
 #define VENDOR_ERR_INVLKEY  0x207
 #define VENDOR_ERR_MR_SMALL 0x208
+#define VENDOR_ERR_INV_MAD_BUFF 0x209
+#define VENDOR_ERR_INV_NUM_SGE  0x210
 
 #define THR_NAME_LEN 16
 #define THR_POLL_TO  5000
 
+#define MAD_HDR_SIZE sizeof(struct ibv_grh)
+
 typedef struct BackendCtx {
-uint64_t req_id;
 void *up_ctx;
 bool is_tx_req;
+struct ibv_sge sge; /* Used to save MAD recv buffer */
 } BackendCtx;
 
+struct backend_umad {
+struct ib_user_mad hdr;
+char mad[RDMA_MAX_PRIVATE_DATA];
+};
+
 static void (*comp_handler)(int status, unsigned int vendor_err, void *ctx);
 
 static void dummy_comp_handler(int status, unsigned int vendor_err, void *ctx)
@@ -286,6 +300,49 @@ static int build_host_sge_array(RdmaDeviceResources 
*rdma_dev_res,
 return 0;
 }
 
+static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
+uint32_t num_sge)
+{
+struct backend_umad umad = {0};
+char *hdr, *msg;
+int ret;
+
+pr_dbg("num_sge=%d\n", num_sge);
+
+if (num_sge != 2) {
+return -EINVAL;
+}
+
+umad.hdr.length = sge[0].length + sge[1].length;
+pr_dbg("msg_len=%d\n", umad.hdr.length);
+
+if (umad.hdr.length > sizeof(umad.mad)) {
+return -ENOMEM;
+}
+
+umad.hdr.addr.qpn = htobe32(1);
+umad.hdr.addr.grh_present = 1;
+umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
+memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
+umad.hdr.addr.hop_limit = 1;
+
+hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
+msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+
+memcpy([0], hdr, sge[0].length);
+memcpy([sge[0].length], msg, sge[1].length);
+
+rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
+rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
+
+ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *),
+sizeof(umad));
+
+pr_dbg("qemu_chr_fe_write=%d\n", ret);
+
+return (ret != sizeof(umad));
+}
+
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
 RdmaBackendQP *qp, uint8_t qp_type,
 struct ibv_sge *sge, uint32_t num_sge,
@@ -304,9 +361,13 @@ void rdma_backend_post_send(RdmaBackendDev *backend_dev,
 comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_QP0, ctx);
 } else if (qp_type == IBV_QPT_GSI) {
 pr_dbg("QP1\n");
-comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+rc = mad_send(backend_dev, sge, num_sge);
+if (rc) {
+comp_handler(IBV_WC_GENERAL_ERR, VENDOR_ERR_MAD_SEND, ctx);
+} else {
+comp_handler(IBV_WC_SUCCESS, 0, ctx);
+}
 }
-pr_dbg("qp->ibqp is NULL for qp_type %d!!!\n", qp_type);
 return;
 }
 
@@ -370,6 +431,48 @@ out_free_bctx:
 g_free(bctx);
 }
 
+static unsigned int save_mad_recv_buffer(RdmaBackendDev *backend_dev,
+ struct ibv_sge *sge, uint32_t num_sge,
+ void *ctx)
+{
+BackendCtx *bctx;
+int rc;
+uint32_t bctx_id;
+
+if (num_sge != 1) {
+pr_dbg("Invalid num_sge (%d), expecting 1\n", num_sge);
+return VENDOR_ERR_INV_NUM_SGE;
+}
+
+if (sge[0].length < RDMA_MAX_PRIVATE_DATA + sizeof(struct ibv_grh)) {
+pr_dbg("Too small buffer for MAD\n");
+return VENDOR_ERR_INV_MAD_BUFF;
+}
+
+pr_dbg("addr=0x%" PRIx64"\n", sge[0].addr);
+pr_dbg("length=%d\n", sge[0].length);
+pr_dbg("lkey=%d\n", sge[0].lkey);
+
+bctx = g_malloc0(sizeof(*bctx));
+
+rc = rdma_rm_alloc_cqe_ctx(backend_dev->rdma_dev_res, _id, bctx);
+if (unlikely(rc)) {
+

[Qemu-devel] [PATCH v3 13/23] hw/pvrdma: Make sure PCI function 0 is vmxnet3

2018-11-12 Thread Yuval Shaia

Guest driver enforces it, we should also.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/vmw/pvrdma.h  | 2 ++
 hw/rdma/vmw/pvrdma_main.c | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index b019cb843a..10a3c4fb7c 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -20,6 +20,7 @@
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
 #include "chardev/char-fe.h"
+#include "hw/net/vmxnet3_defs.h"
 
 #include "../rdma_backend_defs.h"
 #include "../rdma_rm_defs.h"
@@ -85,6 +86,7 @@ typedef struct PVRDMADev {
 RdmaBackendDev backend_dev;
 RdmaDeviceResources rdma_dev_res;
 CharBackend mad_chr;
+VMXNET3State *func0;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index ac8c092db0..fa6468d221 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -576,6 +576,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 return;
 }
 
+/* Break if not vmxnet3 device in slot 0 */
+dev->func0 = VMXNET3(pci_get_function_0(pdev));
+
 memdev_root = object_resolve_path("/objects", NULL);
 if (memdev_root) {
 object_child_foreach(memdev_root, pvrdma_check_ram_shared, 
_shared);
-- 
2.17.2

[Qemu-devel] [PATCH v3 20/23] hw/pvrdma: Clean device's resource when system is shutdown

2018-11-12 Thread Yuval Shaia

In order to clean some external resources such as GIDs, QPs etc,
register to receive notification when VM is shutdown.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/vmw/pvrdma.h  |  2 ++
 hw/rdma/vmw/pvrdma_main.c | 12 
 2 files changed, 14 insertions(+)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index 10a3c4fb7c..ffae36986e 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -17,6 +17,7 @@
 #define PVRDMA_PVRDMA_H
 
 #include "qemu/units.h"
+#include "qemu/notify.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
 #include "chardev/char-fe.h"
@@ -87,6 +88,7 @@ typedef struct PVRDMADev {
 RdmaDeviceResources rdma_dev_res;
 CharBackend mad_chr;
 VMXNET3State *func0;
+Notifier shutdown_notifier;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index 95e9322b7c..45a59cddf9 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -24,6 +24,7 @@
 #include "hw/qdev-properties.h"
 #include "cpu.h"
 #include "trace.h"
+#include "sysemu/sysemu.h"
 
 #include "../rdma_rm.h"
 #include "../rdma_backend.h"
@@ -559,6 +560,14 @@ static int pvrdma_check_ram_shared(Object *obj, void 
*opaque)
 return 0;
 }
 
+static void pvrdma_shutdown_notifier(Notifier *n, void *opaque)
+{
+PVRDMADev *dev = container_of(n, PVRDMADev, shutdown_notifier);
+PCIDevice *pci_dev = PCI_DEVICE(dev);
+
+pvrdma_fini(pci_dev);
+}
+
 static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 {
 int rc;
@@ -623,6 +632,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 goto out;
 }
 
+dev->shutdown_notifier.notify = pvrdma_shutdown_notifier;
+qemu_register_shutdown_notifier(>shutdown_notifier);
+
 out:
 if (rc) {
 error_append_hint(errp, "Device fail to load\n");
-- 
2.17.2

[Qemu-devel] [PATCH v3 19/23] vl: Introduce shutdown_notifiers

2018-11-12 Thread Yuval Shaia

Notifier will be used for signaling shutdown event to inform system is
shutdown. This will allow devices and other component to run some
cleanup code needed before VM is shutdown.

Signed-off-by: Yuval Shaia 
---
 include/sysemu/sysemu.h |  1 +
 vl.c| 15 ++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 8d6095d98b..0d15f16492 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -80,6 +80,7 @@ void qemu_register_wakeup_notifier(Notifier *notifier);
 void qemu_system_shutdown_request(ShutdownCause reason);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
+void qemu_register_shutdown_notifier(Notifier *notifier);
 void qemu_system_debug_request(void);
 void qemu_system_vmstop_request(RunState reason);
 void qemu_system_vmstop_request_prepare(void);
diff --git a/vl.c b/vl.c
index 1fcacc5caa..d33d52522c 100644
--- a/vl.c
+++ b/vl.c
@@ -1578,6 +1578,8 @@ static NotifierList suspend_notifiers =
 NOTIFIER_LIST_INITIALIZER(suspend_notifiers);
 static NotifierList wakeup_notifiers =
 NOTIFIER_LIST_INITIALIZER(wakeup_notifiers);
+static NotifierList shutdown_notifiers =
+NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
 static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
 
 ShutdownCause qemu_shutdown_requested_get(void)
@@ -1809,6 +1811,12 @@ static void qemu_system_powerdown(void)
 notifier_list_notify(_notifiers, NULL);
 }
 
+static void qemu_system_shutdown(ShutdownCause cause)
+{
+qapi_event_send_shutdown(shutdown_caused_by_guest(cause));
+notifier_list_notify(_notifiers, );
+}
+
 void qemu_system_powerdown_request(void)
 {
 trace_qemu_system_powerdown_request();
@@ -1821,6 +1829,11 @@ void qemu_register_powerdown_notifier(Notifier *notifier)
 notifier_list_add(_notifiers, notifier);
 }
 
+void qemu_register_shutdown_notifier(Notifier *notifier)
+{
+notifier_list_add(_notifiers, notifier);
+}
+
 void qemu_system_debug_request(void)
 {
 debug_requested = 1;
@@ -1848,7 +1861,7 @@ static bool main_loop_should_exit(void)
 request = qemu_shutdown_requested();
 if (request) {
 qemu_kill_report();
-qapi_event_send_shutdown(shutdown_caused_by_guest(request));
+qemu_system_shutdown(request);
 if (no_shutdown) {
 vm_stop(RUN_STATE_SHUTDOWN);
 } else {
-- 
2.17.2

[Qemu-devel] [PATCH v3 20/23] hw/pvrdma: Clean device's resource when system is shutdown

2018-11-12 Thread Yuval Shaia

In order to clean some external resources such as GIDs, QPs etc,
register to receive notification when VM is shutdown.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/vmw/pvrdma.h  |  2 ++
 hw/rdma/vmw/pvrdma_main.c | 12 
 2 files changed, 14 insertions(+)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index 10a3c4fb7c..ffae36986e 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -17,6 +17,7 @@
 #define PVRDMA_PVRDMA_H
 
 #include "qemu/units.h"
+#include "qemu/notify.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
 #include "chardev/char-fe.h"
@@ -87,6 +88,7 @@ typedef struct PVRDMADev {
 RdmaDeviceResources rdma_dev_res;
 CharBackend mad_chr;
 VMXNET3State *func0;
+Notifier shutdown_notifier;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index 95e9322b7c..45a59cddf9 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -24,6 +24,7 @@
 #include "hw/qdev-properties.h"
 #include "cpu.h"
 #include "trace.h"
+#include "sysemu/sysemu.h"
 
 #include "../rdma_rm.h"
 #include "../rdma_backend.h"
@@ -559,6 +560,14 @@ static int pvrdma_check_ram_shared(Object *obj, void 
*opaque)
 return 0;
 }
 
+static void pvrdma_shutdown_notifier(Notifier *n, void *opaque)
+{
+PVRDMADev *dev = container_of(n, PVRDMADev, shutdown_notifier);
+PCIDevice *pci_dev = PCI_DEVICE(dev);
+
+pvrdma_fini(pci_dev);
+}
+
 static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 {
 int rc;
@@ -623,6 +632,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 goto out;
 }
 
+dev->shutdown_notifier.notify = pvrdma_shutdown_notifier;
+qemu_register_shutdown_notifier(>shutdown_notifier);
+
 out:
 if (rc) {
 error_append_hint(errp, "Device fail to load\n");
-- 
2.17.2

[Qemu-devel] [PATCH v3 21/23] hw/rdma: Do not use bitmap_zero_extend to free bitmap

2018-11-12 Thread Yuval Shaia

bitmap_zero_extend is designed to work for extending, not for
shrinking.
Using g_free instead.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/rdma_rm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/rdma_rm.c b/hw/rdma/rdma_rm.c
index 0a5ab8935a..35a96d9a64 100644
--- a/hw/rdma/rdma_rm.c
+++ b/hw/rdma/rdma_rm.c
@@ -43,7 +43,7 @@ static inline void res_tbl_free(RdmaRmResTbl *tbl)
 {
 qemu_mutex_destroy(>lock);
 g_free(tbl->tbl);
-bitmap_zero_extend(tbl->bitmap, tbl->tbl_sz, 0);
+g_free(tbl->bitmap);
 }
 
 static inline void *res_tbl_get(RdmaRmResTbl *tbl, uint32_t handle)
-- 
2.17.2

[Qemu-devel] [PATCH v3 09/23] hw/pvrdma: Set the correct opcode for send completion

2018-11-12 Thread Yuval Shaia

opcode for WC should be set by the device and not taken from work
element.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/vmw/pvrdma_qp_ops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_qp_ops.c b/hw/rdma/vmw/pvrdma_qp_ops.c
index 7b0f440fda..3388be1926 100644
--- a/hw/rdma/vmw/pvrdma_qp_ops.c
+++ b/hw/rdma/vmw/pvrdma_qp_ops.c
@@ -154,7 +154,7 @@ int pvrdma_qp_send(PVRDMADev *dev, uint32_t qp_handle)
 comp_ctx->cq_handle = qp->send_cq_handle;
 comp_ctx->cqe.wr_id = wqe->hdr.wr_id;
 comp_ctx->cqe.qp = qp_handle;
-comp_ctx->cqe.opcode = wqe->hdr.opcode;
+comp_ctx->cqe.opcode = IBV_WC_SEND;
 
 rdma_backend_post_send(>backend_dev, >backend_qp, qp->qp_type,
(struct ibv_sge *)>sge[0], 
wqe->hdr.num_sge,
-- 
2.17.2

[Qemu-devel] [PATCH v3 13/23] hw/pvrdma: Make sure PCI function 0 is vmxnet3

2018-11-12 Thread Yuval Shaia

Guest driver enforces it, we should also.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/vmw/pvrdma.h  | 2 ++
 hw/rdma/vmw/pvrdma_main.c | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/hw/rdma/vmw/pvrdma.h b/hw/rdma/vmw/pvrdma.h
index b019cb843a..10a3c4fb7c 100644
--- a/hw/rdma/vmw/pvrdma.h
+++ b/hw/rdma/vmw/pvrdma.h
@@ -20,6 +20,7 @@
 #include "hw/pci/pci.h"
 #include "hw/pci/msix.h"
 #include "chardev/char-fe.h"
+#include "hw/net/vmxnet3_defs.h"
 
 #include "../rdma_backend_defs.h"
 #include "../rdma_rm_defs.h"
@@ -85,6 +86,7 @@ typedef struct PVRDMADev {
 RdmaBackendDev backend_dev;
 RdmaDeviceResources rdma_dev_res;
 CharBackend mad_chr;
+VMXNET3State *func0;
 } PVRDMADev;
 #define PVRDMA_DEV(dev) OBJECT_CHECK(PVRDMADev, (dev), PVRDMA_HW_NAME)
 
diff --git a/hw/rdma/vmw/pvrdma_main.c b/hw/rdma/vmw/pvrdma_main.c
index ac8c092db0..fa6468d221 100644
--- a/hw/rdma/vmw/pvrdma_main.c
+++ b/hw/rdma/vmw/pvrdma_main.c
@@ -576,6 +576,9 @@ static void pvrdma_realize(PCIDevice *pdev, Error **errp)
 return;
 }
 
+/* Break if not vmxnet3 device in slot 0 */
+dev->func0 = VMXNET3(pci_get_function_0(pdev));
+
 memdev_root = object_resolve_path("/objects", NULL);
 if (memdev_root) {
 object_child_foreach(memdev_root, pvrdma_check_ram_shared, 
_shared);
-- 
2.17.2

[Qemu-devel] [PATCH v3 01/23] contrib/rdmacm-mux: Add implementation of RDMA User MAD multiplexer

2018-11-12 Thread Yuval Shaia

RDMA MAD kernel module (ibcm) disallow more than one MAD-agent for a
given MAD class.
This does not go hand-by-hand with qemu pvrdma device's requirements
where each VM is MAD agent.
Fix it by adding implementation of RDMA MAD multiplexer service which on
one hand register as a sole MAD agent with the kernel module and on the
other hand gives service to more than one VM.

Design Overview:

A server process is registered to UMAD framework (for this to work the
rdma_cm kernel module needs to be unloaded) and creates a unix socket to
listen to incoming request from clients.
A client process (such as QEMU) connects to this unix socket and
registers with its own GID.

TX:
---
When client needs to send rdma_cm MAD message it construct it the same
way as without this multiplexer, i.e. creates a umad packet but this
time it writes its content to the socket instead of calling umad_send().
The server, upon receiving such a message fetch local_comm_id from it so
a context for this session can be maintain and relay the message to UMAD
layer by calling umad_send().

RX:
---
The server creates a worker thread to process incoming rdma_cm MAD
messages. When an incoming message arrived (umad_recv()) the server,
depending on the message type (attr_id) looks for target client by
either searching in gid->fd table or in local_comm_id->fd table. With
the extracted fd the server relays to incoming message to the client.

Signed-off-by: Yuval Shaia 
---
 MAINTAINERS  |   1 +
 Makefile |   3 +
 Makefile.objs|   1 +
 contrib/rdmacm-mux/Makefile.objs |   4 +
 contrib/rdmacm-mux/main.c| 771 +++
 contrib/rdmacm-mux/rdmacm-mux.h  |  56 +++
 6 files changed, 836 insertions(+)
 create mode 100644 contrib/rdmacm-mux/Makefile.objs
 create mode 100644 contrib/rdmacm-mux/main.c
 create mode 100644 contrib/rdmacm-mux/rdmacm-mux.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 98a1856afc..e087d58ac6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2231,6 +2231,7 @@ S: Maintained
 F: hw/rdma/*
 F: hw/rdma/vmw/*
 F: docs/pvrdma.txt
+F: contrib/rdmacm-mux/*
 
 Build and test automation
 -
diff --git a/Makefile b/Makefile
index f2947186a4..94072776ff 100644
--- a/Makefile
+++ b/Makefile
@@ -418,6 +418,7 @@ dummy := $(call unnest-vars,, \
 elf2dmp-obj-y \
 ivshmem-client-obj-y \
 ivshmem-server-obj-y \
+rdmacm-mux-obj-y \
 libvhost-user-obj-y \
 vhost-user-scsi-obj-y \
 vhost-user-blk-obj-y \
@@ -725,6 +726,8 @@ vhost-user-scsi$(EXESUF): $(vhost-user-scsi-obj-y) 
libvhost-user.a
$(call LINK, $^)
 vhost-user-blk$(EXESUF): $(vhost-user-blk-obj-y) libvhost-user.a
$(call LINK, $^)
+rdmacm-mux$(EXESUF): $(rdmacm-mux-obj-y) $(COMMON_LDADDS)
+   $(call LINK, $^)
 
 module_block.h: $(SRC_PATH)/scripts/modules/module_block.py config-host.mak
$(call quiet-command,$(PYTHON) $< $@ \
diff --git a/Makefile.objs b/Makefile.objs
index 1e1ff387d7..cc7df3ad80 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -194,6 +194,7 @@ vhost-user-scsi.o-cflags := $(LIBISCSI_CFLAGS)
 vhost-user-scsi.o-libs := $(LIBISCSI_LIBS)
 vhost-user-scsi-obj-y = contrib/vhost-user-scsi/
 vhost-user-blk-obj-y = contrib/vhost-user-blk/
+rdmacm-mux-obj-y = contrib/rdmacm-mux/
 
 ##
 trace-events-subdirs =
diff --git a/contrib/rdmacm-mux/Makefile.objs b/contrib/rdmacm-mux/Makefile.objs
new file mode 100644
index 00..be3eacb6f7
--- /dev/null
+++ b/contrib/rdmacm-mux/Makefile.objs
@@ -0,0 +1,4 @@
+ifdef CONFIG_PVRDMA
+CFLAGS += -libumad -Wno-format-truncation
+rdmacm-mux-obj-y = main.o
+endif
diff --git a/contrib/rdmacm-mux/main.c b/contrib/rdmacm-mux/main.c
new file mode 100644
index 00..47cf0ac7bc
--- /dev/null
+++ b/contrib/rdmacm-mux/main.c
@@ -0,0 +1,771 @@
+/*
+ * QEMU paravirtual RDMA - rdmacm-mux implementation
+ *
+ * Copyright (C) 2018 Oracle
+ * Copyright (C) 2018 Red Hat Inc
+ *
+ * Authors:
+ * Yuval Shaia 
+ * Marcel Apfelbaum 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "sys/poll.h"
+#include "sys/ioctl.h"
+#include "pthread.h"
+#include "syslog.h"
+
+#include "infiniband/verbs.h"
+#include "infiniband/umad.h"
+#include "infiniband/umad_types.h"
+#include "infiniband/umad_sa.h"
+#include "infiniband/umad_cm.h"
+
+#include "rdmacm-mux.h"
+
+#define SCALE_US 1000
+#define COMMID_TTL 2 /* How many SCALE_US a context of MAD session is saved */
+#define SLEEP_SECS 5 /* This is used both in poll() and thread */
+#define SERVER_LISTEN_BACKLOG 10
+#define MAX_CLIENTS 4096
+#define MAD_RMPP_VERSION 0
+#define MAD_METHOD_MASK0 0x8
+
+#define IB_USER_MAD_LONGS_PER_METHOD_MASK (128 / (8 *

Re: [Qemu-devel] [PATCH v5 01/14] qtest: Add set_irq_in command to set IRQ/GPIO level

2018-11-12 Thread Thomas Huth

On 2018-11-12 22:42, Steffen Görtz wrote:
> Adds a new qtest command "set_irq_in" which allows
> to set qemu gpio lines to a given level.
> 
> Based on https://lists.gnu.org/archive/html/qemu-devel/2012-12/msg02363.html
> which never got merged.
> 
> Signed-off-by: Steffen Görtz 
> Originally-by: Matthew Ogilvie 
> Reviewed-by: Stefan Hajnoczi 
> ---
>  qtest.c  | 43 +++
>  tests/libqtest.c | 10 ++
>  tests/libqtest.h | 13 +
>  3 files changed, 66 insertions(+)

Reviewed-by: Thomas Huth

[Qemu-devel] [PATCH v3 12/23] vmxnet3: Move some definitions to header file

2018-11-12 Thread Yuval Shaia

pvrdma setup requires vmxnet3 device on PCI function 0 and PVRDMA device
on PCI function 1.
pvrdma device needs to access vmxnet3 device object for several reasons:
1. Make sure PCI function 0 is vmxnet3.
2. To monitor vmxnet3 device state.
3. To configure node_guid accoring to vmxnet3 device's MAC address.

To be able to access vmxnet3 device the definition of VMXNET3State is
moved to a new header file.

Signed-off-by: Yuval Shaia 
Reviewed-by: Dmitry Fleytman 
---
 hw/net/vmxnet3.c  | 116 +---
 hw/net/vmxnet3_defs.h | 133 ++
 2 files changed, 134 insertions(+), 115 deletions(-)
 create mode 100644 hw/net/vmxnet3_defs.h

diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 3648630386..54746a4030 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -18,7 +18,6 @@
 #include "qemu/osdep.h"
 #include "hw/hw.h"
 #include "hw/pci/pci.h"
-#include "net/net.h"
 #include "net/tap.h"
 #include "net/checksum.h"
 #include "sysemu/sysemu.h"
@@ -29,6 +28,7 @@
 #include "migration/register.h"
 
 #include "vmxnet3.h"
+#include "vmxnet3_defs.h"
 #include "vmxnet_debug.h"
 #include "vmware_utils.h"
 #include "net_tx_pkt.h"
@@ -131,23 +131,11 @@ typedef struct VMXNET3Class {
 DeviceRealize parent_dc_realize;
 } VMXNET3Class;
 
-#define TYPE_VMXNET3 "vmxnet3"
-#define VMXNET3(obj) OBJECT_CHECK(VMXNET3State, (obj), TYPE_VMXNET3)
-
 #define VMXNET3_DEVICE_CLASS(klass) \
 OBJECT_CLASS_CHECK(VMXNET3Class, (klass), TYPE_VMXNET3)
 #define VMXNET3_DEVICE_GET_CLASS(obj) \
 OBJECT_GET_CLASS(VMXNET3Class, (obj), TYPE_VMXNET3)
 
-/* Cyclic ring abstraction */
-typedef struct {
-hwaddr pa;
-uint32_t size;
-uint32_t cell_size;
-uint32_t next;
-uint8_t gen;
-} Vmxnet3Ring;
-
 static inline void vmxnet3_ring_init(PCIDevice *d,
 Vmxnet3Ring *ring,
  hwaddr pa,
@@ -245,108 +233,6 @@ vmxnet3_dump_rx_descr(struct Vmxnet3_RxDesc *descr)
   descr->rsvd, descr->dtype, descr->ext1, descr->btype);
 }
 
-/* Device state and helper functions */
-#define VMXNET3_RX_RINGS_PER_QUEUE (2)
-
-typedef struct {
-Vmxnet3Ring tx_ring;
-Vmxnet3Ring comp_ring;
-
-uint8_t intr_idx;
-hwaddr tx_stats_pa;
-struct UPT1_TxStats txq_stats;
-} Vmxnet3TxqDescr;
-
-typedef struct {
-Vmxnet3Ring rx_ring[VMXNET3_RX_RINGS_PER_QUEUE];
-Vmxnet3Ring comp_ring;
-uint8_t intr_idx;
-hwaddr rx_stats_pa;
-struct UPT1_RxStats rxq_stats;
-} Vmxnet3RxqDescr;
-
-typedef struct {
-bool is_masked;
-bool is_pending;
-bool is_asserted;
-} Vmxnet3IntState;
-
-typedef struct {
-PCIDevice parent_obj;
-NICState *nic;
-NICConf conf;
-MemoryRegion bar0;
-MemoryRegion bar1;
-MemoryRegion msix_bar;
-
-Vmxnet3RxqDescr rxq_descr[VMXNET3_DEVICE_MAX_RX_QUEUES];
-Vmxnet3TxqDescr txq_descr[VMXNET3_DEVICE_MAX_TX_QUEUES];
-
-/* Whether MSI-X support was installed successfully */
-bool msix_used;
-hwaddr drv_shmem;
-hwaddr temp_shared_guest_driver_memory;
-
-uint8_t txq_num;
-
-/* This boolean tells whether RX packet being indicated has to */
-/* be split into head and body chunks from different RX rings  */
-bool rx_packets_compound;
-
-bool rx_vlan_stripping;
-bool lro_supported;
-
-uint8_t rxq_num;
-
-/* Network MTU */
-uint32_t mtu;
-
-/* Maximum number of fragments for indicated TX packets */
-uint32_t max_tx_frags;
-
-/* Maximum number of fragments for indicated RX packets */
-uint16_t max_rx_frags;
-
-/* Index for events interrupt */
-uint8_t event_int_idx;
-
-/* Whether automatic interrupts masking enabled */
-bool auto_int_masking;
-
-bool peer_has_vhdr;
-
-/* TX packets to QEMU interface */
-struct NetTxPkt *tx_pkt;
-uint32_t offload_mode;
-uint32_t cso_or_gso_size;
-uint16_t tci;
-bool needs_vlan;
-
-struct NetRxPkt *rx_pkt;
-
-bool tx_sop;
-bool skip_current_tx_pkt;
-
-uint32_t device_active;
-uint32_t last_command;
-
-uint32_t link_status_and_speed;
-
-Vmxnet3IntState interrupt_states[VMXNET3_MAX_INTRS];
-
-uint32_t temp_mac;   /* To store the low part first */
-
-MACAddr perm_mac;
-uint32_t vlan_table[VMXNET3_VFT_SIZE];
-uint32_t rx_mode;
-MACAddr *mcast_list;
-uint32_t mcast_list_len;
-uint32_t mcast_list_buff_size; /* needed for live migration. */
-
-/* Compatibility flags for migration */
-uint32_t compat_flags;
-} VMXNET3State;
-
 /* Interrupt management */
 
 /*
diff --git a/hw/net/vmxnet3_defs.h b/hw/net/vmxnet3_defs.h
new file mode 100644
index 00..6c19d29b12
--- /dev/null
+++ b/hw/net/vmxnet3_defs.h
@@ -0,0 +1,133 @@
+/*
+ *

[Qemu-devel] [PATCH v3 12/23] vmxnet3: Move some definitions to header file

2018-11-12 Thread Yuval Shaia

pvrdma setup requires vmxnet3 device on PCI function 0 and PVRDMA device
on PCI function 1.
pvrdma device needs to access vmxnet3 device object for several reasons:
1. Make sure PCI function 0 is vmxnet3.
2. To monitor vmxnet3 device state.
3. To configure node_guid accoring to vmxnet3 device's MAC address.

To be able to access vmxnet3 device the definition of VMXNET3State is
moved to a new header file.

Signed-off-by: Yuval Shaia 
Reviewed-by: Dmitry Fleytman 
---
 hw/net/vmxnet3.c  | 116 +---
 hw/net/vmxnet3_defs.h | 133 ++
 2 files changed, 134 insertions(+), 115 deletions(-)
 create mode 100644 hw/net/vmxnet3_defs.h

diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 3648630386..54746a4030 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -18,7 +18,6 @@
 #include "qemu/osdep.h"
 #include "hw/hw.h"
 #include "hw/pci/pci.h"
-#include "net/net.h"
 #include "net/tap.h"
 #include "net/checksum.h"
 #include "sysemu/sysemu.h"
@@ -29,6 +28,7 @@
 #include "migration/register.h"
 
 #include "vmxnet3.h"
+#include "vmxnet3_defs.h"
 #include "vmxnet_debug.h"
 #include "vmware_utils.h"
 #include "net_tx_pkt.h"
@@ -131,23 +131,11 @@ typedef struct VMXNET3Class {
 DeviceRealize parent_dc_realize;
 } VMXNET3Class;
 
-#define TYPE_VMXNET3 "vmxnet3"
-#define VMXNET3(obj) OBJECT_CHECK(VMXNET3State, (obj), TYPE_VMXNET3)
-
 #define VMXNET3_DEVICE_CLASS(klass) \
 OBJECT_CLASS_CHECK(VMXNET3Class, (klass), TYPE_VMXNET3)
 #define VMXNET3_DEVICE_GET_CLASS(obj) \
 OBJECT_GET_CLASS(VMXNET3Class, (obj), TYPE_VMXNET3)
 
-/* Cyclic ring abstraction */
-typedef struct {
-hwaddr pa;
-uint32_t size;
-uint32_t cell_size;
-uint32_t next;
-uint8_t gen;
-} Vmxnet3Ring;
-
 static inline void vmxnet3_ring_init(PCIDevice *d,
 Vmxnet3Ring *ring,
  hwaddr pa,
@@ -245,108 +233,6 @@ vmxnet3_dump_rx_descr(struct Vmxnet3_RxDesc *descr)
   descr->rsvd, descr->dtype, descr->ext1, descr->btype);
 }
 
-/* Device state and helper functions */
-#define VMXNET3_RX_RINGS_PER_QUEUE (2)
-
-typedef struct {
-Vmxnet3Ring tx_ring;
-Vmxnet3Ring comp_ring;
-
-uint8_t intr_idx;
-hwaddr tx_stats_pa;
-struct UPT1_TxStats txq_stats;
-} Vmxnet3TxqDescr;
-
-typedef struct {
-Vmxnet3Ring rx_ring[VMXNET3_RX_RINGS_PER_QUEUE];
-Vmxnet3Ring comp_ring;
-uint8_t intr_idx;
-hwaddr rx_stats_pa;
-struct UPT1_RxStats rxq_stats;
-} Vmxnet3RxqDescr;
-
-typedef struct {
-bool is_masked;
-bool is_pending;
-bool is_asserted;
-} Vmxnet3IntState;
-
-typedef struct {
-PCIDevice parent_obj;
-NICState *nic;
-NICConf conf;
-MemoryRegion bar0;
-MemoryRegion bar1;
-MemoryRegion msix_bar;
-
-Vmxnet3RxqDescr rxq_descr[VMXNET3_DEVICE_MAX_RX_QUEUES];
-Vmxnet3TxqDescr txq_descr[VMXNET3_DEVICE_MAX_TX_QUEUES];
-
-/* Whether MSI-X support was installed successfully */
-bool msix_used;
-hwaddr drv_shmem;
-hwaddr temp_shared_guest_driver_memory;
-
-uint8_t txq_num;
-
-/* This boolean tells whether RX packet being indicated has to */
-/* be split into head and body chunks from different RX rings  */
-bool rx_packets_compound;
-
-bool rx_vlan_stripping;
-bool lro_supported;
-
-uint8_t rxq_num;
-
-/* Network MTU */
-uint32_t mtu;
-
-/* Maximum number of fragments for indicated TX packets */
-uint32_t max_tx_frags;
-
-/* Maximum number of fragments for indicated RX packets */
-uint16_t max_rx_frags;
-
-/* Index for events interrupt */
-uint8_t event_int_idx;
-
-/* Whether automatic interrupts masking enabled */
-bool auto_int_masking;
-
-bool peer_has_vhdr;
-
-/* TX packets to QEMU interface */
-struct NetTxPkt *tx_pkt;
-uint32_t offload_mode;
-uint32_t cso_or_gso_size;
-uint16_t tci;
-bool needs_vlan;
-
-struct NetRxPkt *rx_pkt;
-
-bool tx_sop;
-bool skip_current_tx_pkt;
-
-uint32_t device_active;
-uint32_t last_command;
-
-uint32_t link_status_and_speed;
-
-Vmxnet3IntState interrupt_states[VMXNET3_MAX_INTRS];
-
-uint32_t temp_mac;   /* To store the low part first */
-
-MACAddr perm_mac;
-uint32_t vlan_table[VMXNET3_VFT_SIZE];
-uint32_t rx_mode;
-MACAddr *mcast_list;
-uint32_t mcast_list_len;
-uint32_t mcast_list_buff_size; /* needed for live migration. */
-
-/* Compatibility flags for migration */
-uint32_t compat_flags;
-} VMXNET3State;
-
 /* Interrupt management */
 
 /*
diff --git a/hw/net/vmxnet3_defs.h b/hw/net/vmxnet3_defs.h
new file mode 100644
index 00..6c19d29b12
--- /dev/null
+++ b/hw/net/vmxnet3_defs.h
@@ -0,0 +1,133 @@
+/*
+ *

[Qemu-devel] [PATCH v3 15/23] hw/pvrdma: Make device state depend on Ethernet function state

2018-11-12 Thread Yuval Shaia

User should be able to control the device by changing Ethernet function
state so if user runs 'ifconfig ens3 down' the PVRDMA function should be
down as well.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/vmw/pvrdma_cmd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 2979582fac..0d3c818c20 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -139,7 +139,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 resp->hdr.ack = PVRDMA_CMD_QUERY_PORT_RESP;
 resp->hdr.err = 0;
 
-resp->attrs.state = attrs.state;
+resp->attrs.state = dev->func0->device_active ? attrs.state :
+PVRDMA_PORT_DOWN;
 resp->attrs.max_mtu = attrs.max_mtu;
 resp->attrs.active_mtu = attrs.active_mtu;
 resp->attrs.phys_state = attrs.phys_state;
-- 
2.17.2

Re: [Qemu-devel] [PATCH] slirp: add tftp tracing

2018-11-12 Thread Liam Merwick




On 13/11/2018 07:03, Gerd Hoffmann wrote:

Useful when debugging pxeboot, to see what the guest tries to do.

Signed-off-by: Gerd Hoffmann 


Reviewed-by: Liam Merwick 



---
  Makefile.objs  | 1 +
  slirp/tftp.c   | 3 +++
  slirp/trace-events | 5 +
  3 files changed, 9 insertions(+)
  create mode 100644 slirp/trace-events

diff --git a/Makefile.objs b/Makefile.objs
index 1e1ff387d7..31852eaf8f 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -251,6 +251,7 @@ trace-events-subdirs += net
  trace-events-subdirs += qapi
  trace-events-subdirs += qom
  trace-events-subdirs += scsi
+trace-events-subdirs += slirp
  trace-events-subdirs += target/arm
  trace-events-subdirs += target/i386
  trace-events-subdirs += target/mips
diff --git a/slirp/tftp.c b/slirp/tftp.c
index a9bc4bb1b6..735b57aa55 100644
--- a/slirp/tftp.c
+++ b/slirp/tftp.c
@@ -26,6 +26,7 @@
  #include "slirp.h"
  #include "qemu-common.h"
  #include "qemu/cutils.h"
+#include "trace.h"
  
  static inline int tftp_session_in_use(struct tftp_session *spt)

  {
@@ -204,6 +205,7 @@ static void tftp_send_error(struct tftp_session *spt,
struct mbuf *m;
struct tftp_t *tp;
  
+  trace_slirp_tftp_error(msg);

m = m_get(spt->slirp);
  
if (!m) {

@@ -323,6 +325,7 @@ static void tftp_handle_rrq(Slirp *slirp, struct 
sockaddr_storage *srcsas,
break;
  }
}
+  trace_slirp_tftp_rrq(req_fname);
  
/* check mode */

if ((pktlen - k) < 6) {
diff --git a/slirp/trace-events b/slirp/trace-events
new file mode 100644
index 00..ff8f656e8c
--- /dev/null
+++ b/slirp/trace-events
@@ -0,0 +1,5 @@
+# See docs/devel/tracing.txt for syntax documentation.
+
+# slirp/tftp.c
+slirp_tftp_rrq(const char *file) "file: %s"
+slirp_tftp_error(const char *file) "msg: %s"

[Qemu-devel] [PATCH v3 11/23] hw/pvrdma: Add support to allow guest to configure GID table

2018-11-12 Thread Yuval Shaia

The control over the RDMA device's GID table is done by updating the
device's Ethernet function addresses.
Usually the first GID entry is determine by the MAC address, the second
by the first IPv6 address and the third by the IPv4 address. Other
entries can be added by adding more IP addresses. The opposite is the
same, i.e. whenever an address is removed, the corresponding GID entry
is removed.

The process is done by the network and RDMA stacks. Whenever an address
is added the ib_core driver is notified and calls the device driver
add_gid function which in turn update the device.

To support this in pvrdma device we need to hook into the create_bind
and destroy_bind HW commands triggered by pvrdma driver in guest.
Whenever a changed is made to the pvrdma device's GID table a special
QMP messages is sent to be processed by libvirt to update the address of
the backend Ethernet device.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/rdma_backend.c  | 243 +++-
 hw/rdma/rdma_backend.h  |  22 ++--
 hw/rdma/rdma_backend_defs.h |   3 +-
 hw/rdma/rdma_rm.c   | 104 ++-
 hw/rdma/rdma_rm.h   |  17 ++-
 hw/rdma/rdma_rm_defs.h  |   9 +-
 hw/rdma/rdma_utils.h|  15 +++
 hw/rdma/vmw/pvrdma.h|   2 +-
 hw/rdma/vmw/pvrdma_cmd.c|  55 
 hw/rdma/vmw/pvrdma_main.c   |  25 +---
 hw/rdma/vmw/pvrdma_qp_ops.c |  20 +++
 11 files changed, 370 insertions(+), 145 deletions(-)

diff --git a/hw/rdma/rdma_backend.c b/hw/rdma/rdma_backend.c
index 3eb0099f8d..5675504165 100644
--- a/hw/rdma/rdma_backend.c
+++ b/hw/rdma/rdma_backend.c
@@ -18,12 +18,14 @@
 #include "qapi/error.h"
 #include "qapi/qmp/qlist.h"
 #include "qapi/qmp/qnum.h"
+#include "qapi/qapi-events-rdma.h"
 
 #include 
 #include 
 #include 
 #include 
 
+#include "contrib/rdmacm-mux/rdmacm-mux.h"
 #include "trace.h"
 #include "rdma_utils.h"
 #include "rdma_rm.h"
@@ -300,11 +302,11 @@ static int build_host_sge_array(RdmaDeviceResources 
*rdma_dev_res,
 return 0;
 }
 
-static int mad_send(RdmaBackendDev *backend_dev, struct ibv_sge *sge,
-uint32_t num_sge)
+static int mad_send(RdmaBackendDev *backend_dev, uint8_t sgid_idx,
+union ibv_gid *sgid, struct ibv_sge *sge, uint32_t num_sge)
 {
-struct backend_umad umad = {0};
-char *hdr, *msg;
+RdmaCmMuxMsg msg = {0};
+char *hdr, *data;
 int ret;
 
 pr_dbg("num_sge=%d\n", num_sge);
@@ -313,41 +315,50 @@ static int mad_send(RdmaBackendDev *backend_dev, struct 
ibv_sge *sge,
 return -EINVAL;
 }
 
-umad.hdr.length = sge[0].length + sge[1].length;
-pr_dbg("msg_len=%d\n", umad.hdr.length);
+msg.hdr.msg_type = RDMACM_MUX_MSG_TYPE_MAD;
+memcpy(msg.hdr.sgid.raw, sgid->raw, sizeof(msg.hdr.sgid));
 
-if (umad.hdr.length > sizeof(umad.mad)) {
+msg.umad_len = sge[0].length + sge[1].length;
+pr_dbg("umad_len=%d\n", msg.umad_len);
+
+if (msg.umad_len > sizeof(msg.umad.mad)) {
 return -ENOMEM;
 }
 
-umad.hdr.addr.qpn = htobe32(1);
-umad.hdr.addr.grh_present = 1;
-umad.hdr.addr.gid_index = backend_dev->backend_gid_idx;
-memcpy(umad.hdr.addr.gid, backend_dev->gid.raw, sizeof(umad.hdr.addr.gid));
-umad.hdr.addr.hop_limit = 1;
+msg.umad.hdr.addr.qpn = htobe32(1);
+msg.umad.hdr.addr.grh_present = 1;
+pr_dbg("sgid_idx=%d\n", sgid_idx);
+pr_dbg("sgid=0x%llx\n", sgid->global.interface_id);
+msg.umad.hdr.addr.gid_index = sgid_idx;
+memcpy(msg.umad.hdr.addr.gid, sgid->raw, sizeof(msg.umad.hdr.addr.gid));
+msg.umad.hdr.addr.hop_limit = 1;
 
 hdr = rdma_pci_dma_map(backend_dev->dev, sge[0].addr, sge[0].length);
-msg = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+data = rdma_pci_dma_map(backend_dev->dev, sge[1].addr, sge[1].length);
+
+pr_dbg_buf("mad_hdr", hdr, sge[0].length);
+pr_dbg_buf("mad_data", data, sge[1].length);
 
-memcpy([0], hdr, sge[0].length);
-memcpy([sge[0].length], msg, sge[1].length);
+memcpy([0], hdr, sge[0].length);
+memcpy([sge[0].length], data, sge[1].length);
 
-rdma_pci_dma_unmap(backend_dev->dev, msg, sge[1].length);
+rdma_pci_dma_unmap(backend_dev->dev, data, sge[1].length);
 rdma_pci_dma_unmap(backend_dev->dev, hdr, sge[0].length);
 
-ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *),
-sizeof(umad));
+ret = qemu_chr_fe_write(backend_dev->mad_chr_be, (const uint8_t *),
+sizeof(msg));
 
 pr_dbg("qemu_chr_fe_write=%d\n", ret);
 
-return (ret != sizeof(umad));
+return (ret != sizeof(msg));
 }
 
 void rdma_backend_post_send(RdmaBackendDev *backend_dev,
 RdmaBackendQP *qp, uint8_t qp_type,
 struct ibv_sge *sge, uint32_t num_sge,
-union ibv_gid *dgid, uint32_t dqpn,
-uint32_t dqkey, void *ctx)
+

[Qemu-devel] [PATCH v3 23/23] docs: Update pvrdma device documentation

2018-11-12 Thread Yuval Shaia

Interface with the device is changed with the addition of support for
MAD packets.
Adjust documentation accordingly.

While there fix a minor mistake which may lead to think that there is a
relation between using RXE on host and the compatibility with bare-metal
peers.

Signed-off-by: Yuval Shaia 
---
 docs/pvrdma.txt | 103 +++-
 1 file changed, 84 insertions(+), 19 deletions(-)

diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
index 5599318159..9e8d1674b7 100644
--- a/docs/pvrdma.txt
+++ b/docs/pvrdma.txt
@@ -9,8 +9,9 @@ It works with its Linux Kernel driver AS IS, no need for any 
special guest
 modifications.
 
 While it complies with the VMware device, it can also communicate with bare
-metal RDMA-enabled machines and does not require an RDMA HCA in the host, it
-can work with Soft-RoCE (rxe).
+metal RDMA-enabled machines as peers.
+
+It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
 
 It does not require the whole guest RAM to be pinned allowing memory
 over-commit and, even if not implemented yet, migration support will be
@@ -78,29 +79,93 @@ the required RDMA libraries.
 
 3. Usage
 
+
+
+3.1 VM Memory settings
+==
 Currently the device is working only with memory backed RAM
 and it must be mark as "shared":
-m 1G \
-object memory-backend-ram,id=mb1,size=1G,share \
-numa node,memdev=mb1 \
 
-The pvrdma device is composed of two functions:
- - Function 0 is a vmxnet Ethernet Device which is redundant in Guest
-   but is required to pass the ibdevice GID using its MAC.
-   Examples:
- For an rxe backend using eth0 interface it will use its mac:
-   -device vmxnet3,addr=.0,multifunction=on,mac=
- For an SRIOV VF, we take the Ethernet Interface exposed by it:
-   -device vmxnet3,multifunction=on,mac=
- - Function 1 is the actual device:
-   -device 
pvrdma,addr=.1,backend-dev=,backend-gid-idx=,backend-port=
-   where the ibdevice can be rxe or RDMA VF (e.g. mlx5_4)
- Note: Pay special attention that the GID at backend-gid-idx matches vmxnet's 
MAC.
- The rules of conversion are part of the RoCE spec, but since manual conversion
- is not required, spotting problems is not hard:
-Example: GID: fe80::::7efe:90ff:fecb:743a
- MAC: 7c:fe:90:cb:74:3a
-Note the difference between the first byte of the MAC and the GID.
+
+3.2 MAD Multiplexer
+===
+MAD Multiplexer is a service that exposes MAD-like interface for VMs in
+order to overcome the limitation where only single entity can register with
+MAD layer to send and receive RDMA-CM MAD packets.
+
+To build rdmacm-mux run
+# make rdmacm-mux
+
+The program accepts 3 command line arguments and exposes a UNIX socket to
+be used to relay control and data messages to and from the service.
+-s unix-socket-path   Path to unix socket to listen on
+  (default /var/run/rdmacm-mux)
+-d rdma-device-name   Name of RDMA device to register with
+  (default rxe0)
+-p rdma-device-port   Port number of RDMA device to register with
+  (default 1)
+The final UNIX socket file name is a concatenation of the 3 arguments so
+for example for device name mlx5_0 and port 2 the file
+/var/run/rdmacm-mux-mlx5_0-2 will be created.
+
+Please refer to contrib/rdmacm-mux for more details.
+
+
+3.3 PCI devices settings
+
+RoCE device exposes two functions - Ethernet and RDMA.
+To support it, pvrdma device is composed of two PCI functions, an Ethernet
+device of type vmxnet3 on PCI slot 0 and a pvrdma device on PCI slot 1. The
+Ethernet function can be used for other Ethernet purposes such as IP.
+
+
+3.4 Device parameters
+=
+- netdev: Specifies the Ethernet device on host. For Soft-RoCE (rxe) this
+  would be the Ethernet device used to create it. For any other physical
+  RoCE device this would be the netdev name of the device.
+- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
+- mad-chardev: The name of the MAD multiplexer char device.
+- ibport: In case of multi-port device (such as Mellanox's HCA) this
+  specify the port to use. If not set 1 will be used.
+- dev-caps-max-mr-size: The maximum size of MR.
+- dev-caps-max-qp: Maximum number of QPs.
+- dev-caps-max-sge: Maximum number of SGE elements in WR.
+- dev-caps-max-cq: Maximum number of CQs.
+- dev-caps-max-mr: Maximum number of MRs.
+- dev-caps-max-pd: Maximum number of PDs.
+- dev-caps-max-ah: Maximum number of AHs.
+
+Notes:
+- The first 3 parameters are mandatory settings, the rest have their
+  defaults.
+- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
+  limits but the final values are adjusted by the backend device limitations.
+
+3.5 Example
+===
+Define bridge device with vmxnet3 network backend:
+
+  
+  
+  
+  
+
+
+Define pvrdma device:
+
+  
+  
+  
+  
+  
+  
+  
+  
+
 
 
 
-- 
2.17.2

Re: [Qemu-devel] [PATCH v1 2/3] intel-iommu: extend VTD emulation to allow 57-bit IOVA address width.

2018-11-12 Thread Yu Zhang

On Tue, Nov 13, 2018 at 02:12:17PM +0800, Peter Xu wrote:
> On Tue, Nov 13, 2018 at 01:45:44PM +0800, Yu Zhang wrote:
> 
> [...]
> 
> > > > Since at it, another thing I thought about is making sure the IOMMU
> > > > capabilities will match between host and guest IOMMU, which I think
> > > > this series has ignorred so far.  E.g., when we're having assigned
> > > > devices in the guest and with 5-level IOVA, we should make sure the
> > > > host IOMMU supports 5-level as well before the guest starts since
> > > > otherwise the shadow page synchronization could potentially fail when
> > > > the requested IOVA address goes beyond 4-level.  One simple solution
> > > > is just to disable device assignment for now when we're with 57bits
> > > > vIOMMU but I'm not sure whether that's what you want, especially you
> > > > mentioned the DPDK case (who may use assigned devices).
> > > 
> > > Ok I totally forgot that we don't even support any kind of check like
> > > this before... So feel free to skip this comment if you want, or it
> > > would be even nicer if you want to fix it as a whole. :)
> > > 
> > 
> > Indeed. We have talked about this before. How about we focus on the 5-level
> > extension for now, and solve the check issue in the future? I still do not
> > have any clean solutions in mind. BTW, any suggestions on this issue? :)
> 
> I started to remember our discussions, sorry I should remember them
> earlier... :)
> 
> The only thing in my mind (I think I also suggested the same thing
> during that discussion, but I don't trust my memory any more...) is to
> use sysfs.  Say:
> 
>   1. Scan /sys/class/iommu/dmarN for all the host IOMMUs, read cap of
>  each IOMMU from /sys/class/iommu/dmar0/intel-iommu/cap,
> 
>   2. For each host iommu, scan /sys/class/iommu/dmarN/devices for all
>  the devices under each host IOMMU, then we can know which IOMMU
>  owns which device,
> 
>   3. For each assigned device to the guest, we lookup the previous
>  information to know the mgaw for each host device, raise error
>  and stop QEMU from booting if any of the host device has less
>  level supported than the guest vIOMMU (possibly some more checks
>  in vtd_iommu_notify_flag_changed)
> 
> (we still have some issue on vtd_iommu_notify_flag_changed since it's
>  only run until the first enablement of vIOMMU, so we'll only raise
>  the error during guest Linux boots with vIOMMU on. But that's another
>  issue)

Thanks for the explanation, Peter. You do have a better memory than I am.:)


> 
> Regards,
> 
> -- 
> Peter Xu
> 

B.R.
Yu

[Qemu-devel] [PATCH v3 19/23] vl: Introduce shutdown_notifiers

2018-11-12 Thread Yuval Shaia

Notifier will be used for signaling shutdown event to inform system is
shutdown. This will allow devices and other component to run some
cleanup code needed before VM is shutdown.

Signed-off-by: Yuval Shaia 
---
 include/sysemu/sysemu.h |  1 +
 vl.c| 15 ++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 8d6095d98b..0d15f16492 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -80,6 +80,7 @@ void qemu_register_wakeup_notifier(Notifier *notifier);
 void qemu_system_shutdown_request(ShutdownCause reason);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
+void qemu_register_shutdown_notifier(Notifier *notifier);
 void qemu_system_debug_request(void);
 void qemu_system_vmstop_request(RunState reason);
 void qemu_system_vmstop_request_prepare(void);
diff --git a/vl.c b/vl.c
index 1fcacc5caa..d33d52522c 100644
--- a/vl.c
+++ b/vl.c
@@ -1578,6 +1578,8 @@ static NotifierList suspend_notifiers =
 NOTIFIER_LIST_INITIALIZER(suspend_notifiers);
 static NotifierList wakeup_notifiers =
 NOTIFIER_LIST_INITIALIZER(wakeup_notifiers);
+static NotifierList shutdown_notifiers =
+NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
 static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
 
 ShutdownCause qemu_shutdown_requested_get(void)
@@ -1809,6 +1811,12 @@ static void qemu_system_powerdown(void)
 notifier_list_notify(_notifiers, NULL);
 }
 
+static void qemu_system_shutdown(ShutdownCause cause)
+{
+qapi_event_send_shutdown(shutdown_caused_by_guest(cause));
+notifier_list_notify(_notifiers, );
+}
+
 void qemu_system_powerdown_request(void)
 {
 trace_qemu_system_powerdown_request();
@@ -1821,6 +1829,11 @@ void qemu_register_powerdown_notifier(Notifier *notifier)
 notifier_list_add(_notifiers, notifier);
 }
 
+void qemu_register_shutdown_notifier(Notifier *notifier)
+{
+notifier_list_add(_notifiers, notifier);
+}
+
 void qemu_system_debug_request(void)
 {
 debug_requested = 1;
@@ -1848,7 +1861,7 @@ static bool main_loop_should_exit(void)
 request = qemu_shutdown_requested();
 if (request) {
 qemu_kill_report();
-qapi_event_send_shutdown(shutdown_caused_by_guest(request));
+qemu_system_shutdown(request);
 if (no_shutdown) {
 vm_stop(RUN_STATE_SHUTDOWN);
 } else {
-- 
2.17.2

[Qemu-devel] [PATCH v3 15/23] hw/pvrdma: Make device state depend on Ethernet function state

2018-11-12 Thread Yuval Shaia

User should be able to control the device by changing Ethernet function
state so if user runs 'ifconfig ens3 down' the PVRDMA function should be
down as well.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/vmw/pvrdma_cmd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 2979582fac..0d3c818c20 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -139,7 +139,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 resp->hdr.ack = PVRDMA_CMD_QUERY_PORT_RESP;
 resp->hdr.err = 0;
 
-resp->attrs.state = attrs.state;
+resp->attrs.state = dev->func0->device_active ? attrs.state :
+PVRDMA_PORT_DOWN;
 resp->attrs.max_mtu = attrs.max_mtu;
 resp->attrs.active_mtu = attrs.active_mtu;
 resp->attrs.phys_state = attrs.phys_state;
-- 
2.17.2

[Qemu-devel] [PATCH v3 10/23] json: Define new QMP message for pvrdma

2018-11-12 Thread Yuval Shaia

pvrdma requires that the same GID attached to it will be attached to the
backend device in the host.

A new QMP messages is defined so pvrdma device can broadcast any change
made to its GID table. This event is captured by libvirt which in turn
will update the GID table in the backend device.

Signed-off-by: Yuval Shaia 
Reviewed-by: Marcel Apfelbaum
---
 MAINTAINERS   |  1 +
 Makefile  |  3 ++-
 Makefile.objs |  4 
 qapi/qapi-schema.json |  1 +
 qapi/rdma.json| 38 ++
 5 files changed, 46 insertions(+), 1 deletion(-)
 create mode 100644 qapi/rdma.json

diff --git a/MAINTAINERS b/MAINTAINERS
index e087d58ac6..a149f68a8f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2232,6 +2232,7 @@ F: hw/rdma/*
 F: hw/rdma/vmw/*
 F: docs/pvrdma.txt
 F: contrib/rdmacm-mux/*
+F: qapi/rdma.json
 
 Build and test automation
 -
diff --git a/Makefile b/Makefile
index 94072776ff..db4ce60ee5 100644
--- a/Makefile
+++ b/Makefile
@@ -599,7 +599,8 @@ qapi-modules = $(SRC_PATH)/qapi/qapi-schema.json 
$(SRC_PATH)/qapi/common.json \
$(SRC_PATH)/qapi/tpm.json \
$(SRC_PATH)/qapi/trace.json \
$(SRC_PATH)/qapi/transaction.json \
-   $(SRC_PATH)/qapi/ui.json
+   $(SRC_PATH)/qapi/ui.json \
+   $(SRC_PATH)/qapi/rdma.json
 
 qapi/qapi-builtin-types.c qapi/qapi-builtin-types.h \
 qapi/qapi-types.c qapi/qapi-types.h \
diff --git a/Makefile.objs b/Makefile.objs
index cc7df3ad80..76d8028f2f 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -21,6 +21,7 @@ util-obj-y += qapi/qapi-types-tpm.o
 util-obj-y += qapi/qapi-types-trace.o
 util-obj-y += qapi/qapi-types-transaction.o
 util-obj-y += qapi/qapi-types-ui.o
+util-obj-y += qapi/qapi-types-rdma.o
 util-obj-y += qapi/qapi-builtin-visit.o
 util-obj-y += qapi/qapi-visit.o
 util-obj-y += qapi/qapi-visit-block-core.o
@@ -40,6 +41,7 @@ util-obj-y += qapi/qapi-visit-tpm.o
 util-obj-y += qapi/qapi-visit-trace.o
 util-obj-y += qapi/qapi-visit-transaction.o
 util-obj-y += qapi/qapi-visit-ui.o
+util-obj-y += qapi/qapi-visit-rdma.o
 util-obj-y += qapi/qapi-events.o
 util-obj-y += qapi/qapi-events-block-core.o
 util-obj-y += qapi/qapi-events-block.o
@@ -58,6 +60,7 @@ util-obj-y += qapi/qapi-events-tpm.o
 util-obj-y += qapi/qapi-events-trace.o
 util-obj-y += qapi/qapi-events-transaction.o
 util-obj-y += qapi/qapi-events-ui.o
+util-obj-y += qapi/qapi-events-rdma.o
 util-obj-y += qapi/qapi-introspect.o
 
 chardev-obj-y = chardev/
@@ -155,6 +158,7 @@ common-obj-y += qapi/qapi-commands-tpm.o
 common-obj-y += qapi/qapi-commands-trace.o
 common-obj-y += qapi/qapi-commands-transaction.o
 common-obj-y += qapi/qapi-commands-ui.o
+common-obj-y += qapi/qapi-commands-rdma.o
 common-obj-y += qapi/qapi-introspect.o
 common-obj-y += qmp.o hmp.o
 endif
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index 65b6dc2f6f..a650d80f83 100644
--- a/qapi/qapi-schema.json
+++ b/qapi/qapi-schema.json
@@ -94,3 +94,4 @@
 { 'include': 'trace.json' }
 { 'include': 'introspect.json' }
 { 'include': 'misc.json' }
+{ 'include': 'rdma.json' }
diff --git a/qapi/rdma.json b/qapi/rdma.json
new file mode 100644
index 00..804c68ab36
--- /dev/null
+++ b/qapi/rdma.json
@@ -0,0 +1,38 @@
+# -*- Mode: Python -*-
+#
+
+##
+# = RDMA device
+##
+
+##
+# @RDMA_GID_STATUS_CHANGED:
+#
+# Emitted when guest driver adds/deletes GID to/from device
+#
+# @netdev: RoCE Network Device name - char *
+#
+# @gid-status: Add or delete indication - bool
+#
+# @subnet-prefix: Subnet Prefix - uint64
+#
+# @interface-id : Interface ID - uint64
+#
+# Since: 3.2
+#
+# Example:
+#
+# <- {"timestamp": {"seconds": 1541579657, "microseconds": 986760},
+# "event": "RDMA_GID_STATUS_CHANGED",
+# "data":
+# {"netdev": "bridge0",
+# "interface-id": 15880512517475447892,
+# "gid-status": true,
+# "subnet-prefix": 33022}}
+#
+##
+{ 'event': 'RDMA_GID_STATUS_CHANGED',
+  'data': { 'netdev': 'str',
+'gid-status': 'bool',
+'subnet-prefix' : 'uint64',
+'interface-id'  : 'uint64' } }
-- 
2.17.2

[Qemu-devel] [PATCH v3 17/23] hw/pvrdma: Fill error code in command's response

2018-11-12 Thread Yuval Shaia

Driver checks error code let's set it.

Signed-off-by: Yuval Shaia 
---
 hw/rdma/vmw/pvrdma_cmd.c | 67 
 1 file changed, 48 insertions(+), 19 deletions(-)

diff --git a/hw/rdma/vmw/pvrdma_cmd.c b/hw/rdma/vmw/pvrdma_cmd.c
index 0d3c818c20..a326c5d470 100644
--- a/hw/rdma/vmw/pvrdma_cmd.c
+++ b/hw/rdma/vmw/pvrdma_cmd.c
@@ -131,7 +131,8 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 
 if (rdma_backend_query_port(>backend_dev,
 (struct ibv_port_attr *))) {
-return -ENOMEM;
+resp->hdr.err = -ENOMEM;
+goto out;
 }
 
 memset(resp, 0, sizeof(*resp));
@@ -150,7 +151,9 @@ static int query_port(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 resp->attrs.active_width = 1;
 resp->attrs.active_speed = 1;
 
-return 0;
+out:
+pr_dbg("ret=%d\n", resp->hdr.err);
+return resp->hdr.err;
 }
 
 static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -170,7 +173,7 @@ static int query_pkey(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 resp->pkey = PVRDMA_PKEY;
 pr_dbg("pkey=0x%x\n", resp->pkey);
 
-return 0;
+return resp->hdr.err;
 }
 
 static int create_pd(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -200,7 +203,9 @@ static int destroy_pd(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 
 rdma_rm_dealloc_pd(>rdma_dev_res, cmd->pd_handle);
 
-return 0;
+rsp->hdr.err = 0;
+
+return rsp->hdr.err;
 }
 
 static int create_mr(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -251,7 +256,9 @@ static int destroy_mr(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 
 rdma_rm_dealloc_mr(>rdma_dev_res, cmd->mr_handle);
 
-return 0;
+rsp->hdr.err = 0;
+
+return rsp->hdr.err;
 }
 
 static int create_cq_ring(PCIDevice *pci_dev , PvrdmaRing **ring,
@@ -353,7 +360,8 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 cq = rdma_rm_get_cq(>rdma_dev_res, cmd->cq_handle);
 if (!cq) {
 pr_dbg("Invalid CQ handle\n");
-return -EINVAL;
+rsp->hdr.err = -EINVAL;
+goto out;
 }
 
 ring = (PvrdmaRing *)cq->opaque;
@@ -364,7 +372,11 @@ static int destroy_cq(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 
 rdma_rm_dealloc_cq(>rdma_dev_res, cmd->cq_handle);
 
-return 0;
+rsp->hdr.err = 0;
+
+out:
+pr_dbg("ret=%d\n", rsp->hdr.err);
+return rsp->hdr.err;
 }
 
 static int create_qp_rings(PCIDevice *pci_dev, uint64_t pdir_dma,
@@ -553,7 +565,8 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 qp = rdma_rm_get_qp(>rdma_dev_res, cmd->qp_handle);
 if (!qp) {
 pr_dbg("Invalid QP handle\n");
-return -EINVAL;
+rsp->hdr.err = -EINVAL;
+goto out;
 }
 
 rdma_rm_dealloc_qp(>rdma_dev_res, cmd->qp_handle);
@@ -567,7 +580,11 @@ static int destroy_qp(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 rdma_pci_dma_unmap(PCI_DEVICE(dev), ring->ring_state, TARGET_PAGE_SIZE);
 g_free(ring);
 
-return 0;
+rsp->hdr.err = 0;
+
+out:
+pr_dbg("ret=%d\n", rsp->hdr.err);
+return rsp->hdr.err;
 }
 
 static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -580,7 +597,8 @@ static int create_bind(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 pr_dbg("index=%d\n", cmd->index);
 
 if (cmd->index >= MAX_PORT_GIDS) {
-return -EINVAL;
+rsp->hdr.err = -EINVAL;
+goto out;
 }
 
 pr_dbg("gid[%d]=0x%llx,0x%llx\n", cmd->index,
@@ -590,10 +608,15 @@ static int create_bind(PVRDMADev *dev, union 
pvrdma_cmd_req *req,
 rc = rdma_rm_add_gid(>rdma_dev_res, >backend_dev,
  dev->backend_eth_device_name, gid, cmd->index);
 if (rc < 0) {
-return -EINVAL;
+rsp->hdr.err = rc;
+goto out;
 }
 
-return 0;
+rsp->hdr.err = 0;
+
+out:
+pr_dbg("ret=%d\n", rsp->hdr.err);
+return rsp->hdr.err;
 }
 
 static int destroy_bind(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -606,7 +629,8 @@ static int destroy_bind(PVRDMADev *dev, union 
pvrdma_cmd_req *req,
 pr_dbg("index=%d\n", cmd->index);
 
 if (cmd->index >= MAX_PORT_GIDS) {
-return -EINVAL;
+rsp->hdr.err = -EINVAL;
+goto out;
 }
 
 rc = rdma_rm_del_gid(>rdma_dev_res, >backend_dev,
@@ -617,7 +641,11 @@ static int destroy_bind(PVRDMADev *dev, union 
pvrdma_cmd_req *req,
 goto out;
 }
 
-return 0;
+rsp->hdr.err = 0;
+
+out:
+pr_dbg("ret=%d\n", rsp->hdr.err);
+return rsp->hdr.err;
 }
 
 static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req *req,
@@ -634,9 +662,8 @@ static int create_uc(PVRDMADev *dev, union pvrdma_cmd_req 
*req,
 resp->hdr.err = rdma_rm_alloc_uc(>rdma_dev_res, cmd->pfn,
  >ctx_handle);
 
-pr_dbg("ret=%d\n", resp->hdr.err);
-
-return 0;
+pr_dbg("ret=%d\n", rsp->hdr.err);
+return rsp->hdr.err;
 }
 
 static int destroy_uc(PVRDMADev *dev, union

1 2 3 4 >

1 - 100 of 336 matches

Mail list logo