date:20161108

Re: [Qemu-devel] [PATCH for-2.8] migration: Fix return code of ram_save_iterate()

2016-11-08 Thread Thomas Huth

On 09.11.2016 08:18, Amit Shah wrote:
> On (Fri) 04 Nov 2016 [14:10:17], Thomas Huth wrote:
>> qemu_savevm_state_iterate() expects the iterators to return 1
>> when they are done, and 0 if there is still something left to do.
>> However, ram_save_iterate() does not obey this rule and returns
>> the number of saved pages instead. This causes a fatal hang with
>> ppc64 guests when you run QEMU like this (also works with TCG):
> 
> "works with" -- does that mean reproduces with?

Yes, that's what I've meant: You can reproduce it with TCG (e.g. running
on a x86 system), too, there's no need for a real POWER machine with KVM
here.

>>  qemu-img create -f qcow2  /tmp/test.qcow2 1M
>>  qemu-system-ppc64 -nographic -nodefaults -m 256 \
>>-hda /tmp/test.qcow2 -serial mon:stdio
>>
>> ... then switch to the monitor by pressing CTRL-a c and try to
>> save a snapshot with "savevm test1" for example.
>>
>> After the first iteration, ram_save_iterate() always returns 0 here,
>> so that qemu_savevm_state_iterate() hangs in an endless loop and you
>> can only "kill -9" the QEMU process.
>> Fix it by using proper return values in ram_save_iterate().
>>
>> Signed-off-by: Thomas Huth 
>> ---
>>  migration/ram.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index fb9252d..a1c8089 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1987,7 +1987,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>  int ret;
>>  int i;
>>  int64_t t0;
>> -int pages_sent = 0;
>> +int done = 0;
>>  
>>  rcu_read_lock();
>>  if (ram_list.version != last_version) {
>> @@ -2007,9 +2007,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>  pages = ram_find_and_save_block(f, false, _transferred);
>>  /* no more pages to sent */
>>  if (pages == 0) {
>> +done = 1;
>>  break;
>>  }
>> -pages_sent += pages;
>>  acct_info.iterations++;
>>  
>>  /* we want to check in the 1st loop, just in case it was the 1st 
>> time
>> @@ -2044,7 +2044,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>  return ret;
>>  }
>>  
>> -return pages_sent;
>> +return done;
>>  }
> 
> I agree with David, we can just remove the return value.  The first
> patch of the series can do that; and this one could become the 2nd
> patch.  Should be OK for the soft freeze.

Sorry, I still did not quite get it - if I'd change the return type of
ram_save_iterate() and the other iterate functions to "void", how is
qemu_savevm_state_iterate() supposed to know whether all iterators are
done or not? And other iterators also use negative return values to
signal errors - should that then be handled via an "Error **" parameter
instead? ... my gut feeling still says that such a bigger rework (we've
got to touch all iterators for this!) should rather not be done right in
the middle of the freeze period...

 Thomas

Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-08 Thread Wen Congyang

On 11/09/2016 01:02 PM, Dave Young wrote:
> On 11/09/16 at 11:58am, Wen Congyang wrote:
>> On 11/09/2016 11:17 AM, Dave Young wrote:
>>> Drop qiaonuohan, seems the mail address is wrong..
>>>
>>> On 11/09/16 at 11:01am, Dave Young wrote:
 Hi,

 Latest linux kernel enabled kaslr to randomiz phys/virt memory
 addresses, we had some effort to support kexec/kdump so that crash
 utility can still works in case crashed kernel has kaslr enabled.

 But according to Dave Anderson virsh dump does not work, quoted messages
 from Dave below:

 """
 with virsh dump, there's no way of even knowing that KASLR
 has randomized the kernel __START_KERNEL_map region, because there is no
 virtual address information -- e.g., like "SYMBOL(_stext)" in the kdump
 vmcoreinfo data to compare against the vmlinux file symbol value.
 Unless virsh dump can export some basic virtual memory data, which
 they say it can't, I don't see how KASLR can ever be supported.
 """

 I assume virsh dump is using qemu guest memory dump facility so it
 should be first addressed in qemu. Thus post this query to qemu devel
 list. If this is not correct please let me know.
>>
>> IIRC, 'virsh dump --memory-only' uses dump-guest-memory, and 'virsh dump'
>> uses migration to dump.
> 
> Do they need different fixes? Dave, I guess you mean --memory-only, but
> could you clarify and confirm it?
> 
>>
>> I think I should study kaslr first...
> 
> Thanks for taking care of it.

Can you give me the patch for kexec/kdump. I want to know what I need to do
for dump-guest-memory.

Thanks
Wen Congyang

> 
>>
>> Thanks
>> Wen Congyang
>>

 Could you qemu dump people make it work? Or we can not support virt dump
 as long as KASLR being enabled. Latest Fedora kernel has enabled it in 
 x86_64.

 Thanks
 Dave
>>>
>>>
>>>
>>
>>
>>
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v6 1/3] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest

2016-11-08 Thread Jason Wang




On 2016年11月08日 19:04, Aviv B.D wrote:

From: "Aviv Ben-David" 

This capability asks the guest to invalidate cache before each map operation.
We can use this invalidation to trap map operations in the hypervisor.


Hi:

Like I've asked twice in the past, I want to know why don't you cache 
translation faults as what spec required (especially this is a guest 
visible behavior)?


Btw, please cc me on posting future versions.

Thanks



Signed-off-by: Aviv Ben-David 
---
  hw/i386/intel_iommu.c  | 5 +
  hw/i386/intel_iommu_internal.h | 1 +
  include/hw/i386/intel_iommu.h  | 2 ++
  3 files changed, 8 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1655a65..834887f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2017,6 +2017,7 @@ static Property vtd_properties[] = {
  DEFINE_PROP_ON_OFF_AUTO("eim", IntelIOMMUState, intr_eim,
  ON_OFF_AUTO_AUTO),
  DEFINE_PROP_BOOL("x-buggy-eim", IntelIOMMUState, buggy_eim, false),
+DEFINE_PROP_BOOL("cache-mode", IntelIOMMUState, cache_mode_enabled, FALSE),
  DEFINE_PROP_END_OF_LIST(),
  };
  
@@ -2391,6 +2392,10 @@ static void vtd_init(IntelIOMMUState *s)

  assert(s->intr_eim != ON_OFF_AUTO_AUTO);
  }
  
+if (s->cache_mode_enabled) {

+s->cap |= VTD_CAP_CM;
+}
+
  vtd_reset_context_cache(s);
  vtd_reset_iotlb(s);
  
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h

index 0829a50..35d9f3a 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -201,6 +201,7 @@
  #define VTD_CAP_MAMV(VTD_MAMV << 48)
  #define VTD_CAP_PSI (1ULL << 39)
  #define VTD_CAP_SLLPS   ((1ULL << 34) | (1ULL << 35))
+#define VTD_CAP_CM  (1ULL << 7)
  
  /* Supported Adjusted Guest Address Widths */

  #define VTD_CAP_SAGAW_SHIFT 8
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 1989c1e..42d293f 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -258,6 +258,8 @@ struct IntelIOMMUState {
  uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
  uint32_t version;
  
+bool cache_mode_enabled;/* RO - is cap CM enabled? */

+
  dma_addr_t root;/* Current root table pointer */
  bool root_extended; /* Type of root table (extended or not) */
  bool dmar_enabled;  /* Set if DMA remapping is enabled */

Re: [Qemu-devel] [PATCH for-2.8] migration: Fix return code of ram_save_iterate()

2016-11-08 Thread Amit Shah

On (Fri) 04 Nov 2016 [14:10:17], Thomas Huth wrote:
> qemu_savevm_state_iterate() expects the iterators to return 1
> when they are done, and 0 if there is still something left to do.
> However, ram_save_iterate() does not obey this rule and returns
> the number of saved pages instead. This causes a fatal hang with
> ppc64 guests when you run QEMU like this (also works with TCG):

"works with" -- does that mean reproduces with?

>  qemu-img create -f qcow2  /tmp/test.qcow2 1M
>  qemu-system-ppc64 -nographic -nodefaults -m 256 \
>-hda /tmp/test.qcow2 -serial mon:stdio
> 
> ... then switch to the monitor by pressing CTRL-a c and try to
> save a snapshot with "savevm test1" for example.
> 
> After the first iteration, ram_save_iterate() always returns 0 here,
> so that qemu_savevm_state_iterate() hangs in an endless loop and you
> can only "kill -9" the QEMU process.
> Fix it by using proper return values in ram_save_iterate().
> 
> Signed-off-by: Thomas Huth 
> ---
>  migration/ram.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index fb9252d..a1c8089 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1987,7 +1987,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>  int ret;
>  int i;
>  int64_t t0;
> -int pages_sent = 0;
> +int done = 0;
>  
>  rcu_read_lock();
>  if (ram_list.version != last_version) {
> @@ -2007,9 +2007,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>  pages = ram_find_and_save_block(f, false, _transferred);
>  /* no more pages to sent */
>  if (pages == 0) {
> +done = 1;
>  break;
>  }
> -pages_sent += pages;
>  acct_info.iterations++;
>  
>  /* we want to check in the 1st loop, just in case it was the 1st time
> @@ -2044,7 +2044,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>  return ret;
>  }
>  
> -return pages_sent;
> +return done;
>  }

I agree with David, we can just remove the return value.  The first
patch of the series can do that; and this one could become the 2nd
patch.  Should be OK for the soft freeze.

Amit

Re: [Qemu-devel] Concerning " [PULL 6/6] curses: Use cursesw instead of curses"

2016-11-08 Thread Sergey Smolov



On 08.11.2016 20:28, Cornelia Huck wrote:

On Tue, 8 Nov 2016 16:49:51 +
Stefan Hajnoczi  wrote:


On Tue, Nov 08, 2016 at 10:40:20AM +0300, Sergey Smolov wrote:

Dear List!

I've encountered the same problem as was discussed in this thread:
https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg07898.html

Does anybody succeeded in solving the problem?

 From my side, the problem appears when I run the 'configure' script with
'--target-list=aarch64-softmmu' option. The script returns the following
message to me:

ERROR: configure test passed without -Werror but failed with -Werror.
This is probably a bug in the configure script. The failing command
will be at the bottom of config.log.
You can run configure with --disable-werror to bypass this check.

I've attached a config.log to this e-mail.

[...]


cc -Werror -fPIE -DPIE -m64 -mcx16 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 
-D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef 
-Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv 
-Wendif-labels -Wmissing-include-dirs -Wempty-body -Wnested-externs 
-Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers 
-Wold-style-declaration -Wold-style-definition -Wtype-limits 
-fstack-protector-all -I/usr/include/libpng14 -o config-temp/qemu-conf.exe 
config-temp/qemu-conf.c -Wl,-z,relro -Wl,-z,now -pie -m64 -g -lncursesw
config-temp/qemu-conf.c: In function ‘main’:
config-temp/qemu-conf.c:9:3: error: implicit declaration of function ‘addwstr’ 
[-Werror=implicit-function-declaration]
config-temp/qemu-conf.c:9:3: error: nested extern declaration of ‘addwstr’ 
[-Werror=nested-externs]
config-temp/qemu-conf.c:10:3: error: implicit declaration of function 
‘addnwstr’ [-Werror=implicit-function-declaration]
config-temp/qemu-conf.c:10:3: error: nested extern declaration of ‘addnwstr’ 
[-Werror=nested-externs]
cc1: all warnings being treated as errors

http://pdcurses.sourceforge.net/doc/PDCurses.txt:

   Wide-character functions from the X/Open standard -- these are only
   available when PDCurses is built with PDC_WIDE defined, and the
   prototypes are only available from curses.h when PDC_WIDE is defined
   before its inclusion in your app:

addnwstraddstr
addwstr addstr

QEMU does not define PDC_WIDE.  Try adding ./configure
--extra-flags=-DPDC_WIDE.

I think the problem is rather the incorrect include detection in
configure -- see <20161107133833.3681-1-msucha...@suse.de> ("[PATCH]
Fix legacy ncurses detection.") and the following thread.

Sergey: Are you running on SLES?




Dear Cornelia,

I'm running on OpenSUSE 12.2 x86_64.

I've tried to use this patch, but the situation keeps the same.

--
Thanks,
Sergey Smolov

Re: [Qemu-devel] Concerning " [PULL 6/6] curses: Use cursesw instead of curses"

2016-11-08 Thread Sergey Smolov



On 08.11.2016 19:49, Stefan Hajnoczi wrote:

On Tue, Nov 08, 2016 at 10:40:20AM +0300, Sergey Smolov wrote:

Dear List!

I've encountered the same problem as was discussed in this thread:
https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg07898.html

Does anybody succeeded in solving the problem?

 From my side, the problem appears when I run the 'configure' script with
'--target-list=aarch64-softmmu' option. The script returns the following
message to me:

ERROR: configure test passed without -Werror but failed with -Werror.
This is probably a bug in the configure script. The failing command
will be at the bottom of config.log.
You can run configure with --disable-werror to bypass this check.

I've attached a config.log to this e-mail.

[...]


cc -Werror -fPIE -DPIE -m64 -mcx16 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 
-D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef 
-Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv 
-Wendif-labels -Wmissing-include-dirs -Wempty-body -Wnested-externs 
-Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers 
-Wold-style-declaration -Wold-style-definition -Wtype-limits 
-fstack-protector-all -I/usr/include/libpng14 -o config-temp/qemu-conf.exe 
config-temp/qemu-conf.c -Wl,-z,relro -Wl,-z,now -pie -m64 -g -lncursesw
config-temp/qemu-conf.c: In function ‘main’:
config-temp/qemu-conf.c:9:3: error: implicit declaration of function ‘addwstr’ 
[-Werror=implicit-function-declaration]
config-temp/qemu-conf.c:9:3: error: nested extern declaration of ‘addwstr’ 
[-Werror=nested-externs]
config-temp/qemu-conf.c:10:3: error: implicit declaration of function 
‘addnwstr’ [-Werror=implicit-function-declaration]
config-temp/qemu-conf.c:10:3: error: nested extern declaration of ‘addnwstr’ 
[-Werror=nested-externs]
cc1: all warnings being treated as errors

http://pdcurses.sourceforge.net/doc/PDCurses.txt:

   Wide-character functions from the X/Open standard -- these are only
   available when PDCurses is built with PDC_WIDE defined, and the
   prototypes are only available from curses.h when PDC_WIDE is defined
   before its inclusion in your app:

addnwstraddstr
addwstr addstr

QEMU does not define PDC_WIDE.  Try adding ./configure
--extra-flags=-DPDC_WIDE.


Dear Stefan,

The QEMU from master branch does not have an "--extra-flags" option. 
I've tried to use "--extra-cflags" but I get the same issue.


--
Thanks,
Sergey Smolov

Re: [Qemu-devel] alpha platform is missing files after initrd load

2016-11-08 Thread Dennis Luehring


Am 07.11.2016 um 18:21 schrieb Laszlo Ersek:

its not alpha related same error happens with ppc64,sparc64,mips64...


so i tried to reproduce the errors also on x86_64 in qemu

building a tiny linux kernel 4.8.6 (~1MB) and a simple init (~1MB) based on 
http://mgalgs.github.io/2015/05/16/how-to-build-a-custom-linux-kernel-for-qemu-2015-edition.html

until now (using the same cpio commandline) im unable to get the same errors 
like on alpha, ppc64,sparc64,mips64

im using cross-linux-from-scratch manual/script for building the non-x86_64 
platforms

Re: [Qemu-devel] [RFC 15/17] ppc: Check that CPU model stays consistent across migration

2016-11-08 Thread Alexey Kardashevskiy

On 09/11/16 15:24, David Gibson wrote:
> On Tue, Nov 08, 2016 at 05:03:49PM +1100, Alexey Kardashevskiy wrote:
>> On 08/11/16 16:29, David Gibson wrote:
>>> On Fri, Nov 04, 2016 at 06:54:48PM +1100, Alexey Kardashevskiy wrote:
 On 30/10/16 22:12, David Gibson wrote:
> When a vmstate for the ppc cpu was first introduced (a90db15 "target-ppc:
> Convert ppc cpu savevm to VMStateDescription"), a VMSTATE_EQUAL was used
> to ensure that identical CPU models were used at source and destination
> as based on the PVR (Processor Version Register).
>
> However this was a problem for HV KVM, where due to hardware limitations
> we always need to use the real PVR of the host CPU.  So, to allow
> migration between hosts with "similar enough" CPUs, the PVR check was
> removed in 569be9f0 "target-ppc: Remove PVR check from migration".  This
> left the onus on user / management to only attempt migration between
> compatible CPUs.
>
> Now that we've reworked the handling of compatiblity modes, we have the
> information to actually determine if we're making a compatible migration.
> So this patch partially restores the PVR check.  If the source was running
> in a compatibility mode, we just make sure that the destination cpu can
> also run in that compatibility mode.  However, if the source was running
> in "raw" mode, we verify that the destination has the same PVR value.
>
> Signed-off-by: David Gibson 
> ---
>  target-ppc/machine.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/target-ppc/machine.c b/target-ppc/machine.c
> index 5d87ff6..62b9e94 100644
> --- a/target-ppc/machine.c
> +++ b/target-ppc/machine.c
> @@ -173,10 +173,12 @@ static int cpu_post_load(void *opaque, int 
> version_id)
>  target_ulong msr;
>  
>  /*
> - * We always ignore the source PVR. The user or management
> - * software has to take care of running QEMU in a compatible mode.
> + * If we're operating in compat mode, we should be ok as long as
> + * the destination supports the same compatiblity mode.
> + *
> + * Otherwise, however, we require that the destination has exactly
> + * the same CPU model as the source.
>   */
> -env->spr[SPR_PVR] = env->spr_cb[SPR_PVR].default_value;
>  
>  #if defined(TARGET_PPC64)
>  if (cpu->compat_pvr) {
> @@ -188,8 +190,13 @@ static int cpu_post_load(void *opaque, int 
> version_id)
>  error_free(local_err);
>  return -1;
>  }
> -}
> +} else
>  #endif
> +{
> +if (env->spr[SPR_PVR] != env->spr_cb[SPR_PVR].default_value) {
> +return -1;
> +}
> +}

 This should break migration from host with PVR=004d0200 to host with
 PVR=004d0201, what is the benefit of such limitation?
>>>
>>> There probably isn't one.  But the point is it also blocks migration
>>> from a host with PVR=004B0201 (POWER8) to one with PVR=00201400
>>> (403GCX) and *that* has a clear benefit.  I don't see a way to block
>>> the second without the first, except by creating a huge compatibility
>>> matrix table, which would require inordinate amounts of time to
>>> research carefully.
>>
>>
>> This is pcc->pvr_match() for this purpose.
> 
> Hmm.. thinking about this.  Obviously requiring an exactly matching
> PVR is the architecturally "safest" approach.  For TCG and PR KVM, it
> really should be sufficient - if you can select "close" PVRs at each
> end, you should be able to select exactly matching ones just as well.
> 
> For HV KVM, we should generally be using compatibility modes to allow
> migration between a relatively wide range of CPUs.  My intention was
> basically to require moving to that model, rather than "approximate
> matching" real PVRs.

So the management stack (libvirt) will need to know that if it is HV KVM,
then -cpu host,compat=; if it is PR KVM, then -cpu  and no compat.
That was really annoying when we had exact PVR matching.


> I'm still convinced using compat modes is the right way to go medium
> to long term.  However, allowing the approximate matches could make
> for a more forgiving transition, if people have existing hosts in
> "raw" mode.

Within the family, CPUs behave exactly (not slightly but exactly) the same
even though 3 of 4 bytes of the PVR value are different so enforcing PVR to
match or enforcing compatibility (which as a feature was not a great idea
from the day one) does not sound compelling.

Does x86 have anything like this compatibility thingy?


> Ok, I'll add pvr_match checking to this.
> 


-- 
Alexey



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH] boot-serial-test: Add a test for the powernv machine

2016-11-08 Thread Cédric Le Goater

On 11/09/2016 02:02 AM, David Gibson wrote:
> On Tue, Nov 08, 2016 at 02:05:35PM +0100, Cédric Le Goater wrote:
>> On 11/08/2016 01:36 PM, Thomas Huth wrote:
>>> The new powernv machine ships with a firmware that outputs
>>> some text to the serial console, so we can automatically
>>> test this machine type in the boot-serial tester, too.
>>> And to get some (very limited) test coverage for the new
>>> POWER9 CPU emulation, too, this test is also started with
>>> "-cpu POWER9".
>>
>> and we see the minimum :
>>
>>   [8450016,6] CPU: P9 generation processor(max 4 threads/core)
>>
>> Reviewed-by: Cédric Le Goater 
>>
>>
>>
>> With very minimal changes (definition of some SPRs and the use 
>> of the SHV mode), the guest would load the kernel.
> 
> Applied to ppc-for-2.8.  Good to have this basic smoke test for
> powernv.

yes. qom-test is also starting a powernv guest.

skiboot has a cool little program called hello_kernel that can be 
run in place of the real kernel, but that's beyond the qemu layer 
I guess 

For qemu, maybe we could do xscom accesses to test some devices.

C.

Re: [Qemu-devel] [RFC 11/17] ppc: Add ppc_set_compat_all()

2016-11-08 Thread Alexey Kardashevskiy

On 09/11/16 14:52, David Gibson wrote:
> On Wed, Nov 09, 2016 at 12:27:47PM +1100, Alexey Kardashevskiy wrote:
>> On 08/11/16 16:18, David Gibson wrote:
>>> On Fri, Nov 04, 2016 at 03:01:40PM +1100, Alexey Kardashevskiy wrote:
 On 30/10/16 22:12, David Gibson wrote:
> Once a compatiblity mode is negotiated with the guest,
> h_client_architecture_support() uses run_on_cpu() to update each CPU to
> the new mode.  We're going to want this logic somewhere else shortly,
> so make a helper function to do this global update.
>
> We put it in target-ppc/compat.c - it makes as much sense at the CPU level
> as it does at the machine level.  We also move the cpu_synchronize_state()
> into ppc_set_compat(), since it doesn't really make any sense to call that
> without synchronizing state.
>
> Signed-off-by: David Gibson 
> ---
>  hw/ppc/spapr_hcall.c | 31 +--
>  target-ppc/compat.c  | 36 
>  target-ppc/cpu.h |  3 +++
>  3 files changed, 44 insertions(+), 26 deletions(-)
>
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 3bd6d06..4eaf9a6 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -881,20 +881,6 @@ static target_ulong h_set_mode(PowerPCCPU *cpu, 
> sPAPRMachineState *spapr,
>  return ret;
>  }
>  
> -typedef struct {
> -uint32_t compat_pvr;
> -Error *err;
> -} SetCompatState;
> -
> -static void do_set_compat(CPUState *cs, void *arg)
> -{
> -PowerPCCPU *cpu = POWERPC_CPU(cs);
> -SetCompatState *s = arg;
> -
> -cpu_synchronize_state(cs);
> -ppc_set_compat(cpu, s->compat_pvr, >err);
> -}
> -
>  static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>sPAPRMachineState 
> *spapr,
>target_ulong opcode,
> @@ -902,7 +888,6 @@ static target_ulong 
> h_client_architecture_support(PowerPCCPU *cpu,
>  {
>  target_ulong list = ppc64_phys_to_real(args[0]);
>  target_ulong ov_table;
> -CPUState *cs;
>  bool explicit_match = false; /* Matched the CPU's real PVR */
>  uint32_t max_compat = cpu->max_compat;
>  uint32_t best_compat = 0;
> @@ -949,18 +934,12 @@ static target_ulong 
> h_client_architecture_support(PowerPCCPU *cpu,
>  
>  /* Update CPUs */
>  if (cpu->compat_pvr != best_compat) {
> -CPU_FOREACH(cs) {
> -SetCompatState s = {
> -.compat_pvr = best_compat,
> -.err = NULL,
> -};
> +Error *local_err = NULL;
>  
> -run_on_cpu(cs, do_set_compat, );
> -
> -if (s.err) {
> -error_report_err(s.err);
> -return H_HARDWARE;
> -}
> +ppc_set_compat_all(best_compat, _err);
> +if (local_err) {
> +error_report_err(local_err);
> +return H_HARDWARE;
>  }
>  }
>  
> diff --git a/target-ppc/compat.c b/target-ppc/compat.c
> index 1059555..0b12b58 100644
> --- a/target-ppc/compat.c
> +++ b/target-ppc/compat.c
> @@ -124,6 +124,8 @@ void ppc_set_compat(PowerPCCPU *cpu, uint32_t 
> compat_pvr, Error **errp)
>  pcr = compat->pcr;
>  }
>  
> +cpu_synchronize_state(CPU(cpu));
> +
>  cpu->compat_pvr = compat_pvr;
>  env->spr[SPR_PCR] = pcr & pcc->pcr_mask;
>  
> @@ -136,6 +138,40 @@ void ppc_set_compat(PowerPCCPU *cpu, uint32_t 
> compat_pvr, Error **errp)
>  }
>  }
>  
> +#if !defined(CONFIG_USER_ONLY)
> +typedef struct {
> +uint32_t compat_pvr;
> +Error *err;
> +} SetCompatState;
> +
> +static void do_set_compat(CPUState *cs, void *arg)
> +{
> +PowerPCCPU *cpu = POWERPC_CPU(cs);
> +SetCompatState *s = arg;
> +
> +ppc_set_compat(cpu, s->compat_pvr, >err);
> +}
> +
> +void ppc_set_compat_all(uint32_t compat_pvr, Error **errp)
> +{
> +CPUState *cs;
> +
> +CPU_FOREACH(cs) {
> +SetCompatState s = {
> +.compat_pvr = compat_pvr,
> +.err = NULL,
> +};
> +
> +run_on_cpu(cs, do_set_compat, );
> +
> +if (s.err) {
> +error_propagate(errp, s.err);
> +return;
> +}
> +}
> +}
> +#endif
> +
>  int ppc_compat_max_threads(PowerPCCPU *cpu)
>  {
>  const CompatInfo *compat = compat_by_pvr(cpu->compat_pvr);
> diff --git a/target-ppc/cpu.h b/target-ppc/cpu.h
> index 91e8be8..201a655 100644

Re: [Qemu-devel] [PATCH] spapr: Fix migration of PCI host bridges from qemu-2.7

2016-11-08 Thread Alexey Kardashevskiy

On 09/11/16 14:45, David Gibson wrote:
> daa2369 "spapr_pci: Add a 64-bit MMIO window" subtly broke migration from
> qemu-2.7 to the current version.  It split the device's MMIO window into
> two pieces for 32-bit and 64-bit MMIO.
> 
> The patch included backwards compatibility code to convert the old property
> into the new format.  However, the property value was also transferred in
> the migration stream and compared with a (probably unwise) VMSTATE_EQUAL.
> So, the "raw" value from 2.7 is compared to the new style converted value
> from (pre-)2.8 giving a mismatch and migration failure.
> 
> Although it would be technically possible to fix this in a way allowing
> backwards migration, that would leave an ugly legacy around indefinitely.
> This patch takes the simpler approach of bumping the migration version,
> dropping the unwise VMSTATE_EQUAL (and some equally unwise ones around it)
> and ignoring them on an incoming migration.
> 
> Signed-off-by: David Gibson 
> ---
>  hw/ppc/spapr_pci.c | 17 +++--
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 7cde30e..7f1cc29 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1658,19 +1658,24 @@ static int spapr_pci_post_load(void *opaque, int 
> version_id)
>  return 0;
>  }
>  
> +static bool version_before_3(void *opaque, int version_id)
> +{
> +return version_id < 3;
> +}
> +
>  static const VMStateDescription vmstate_spapr_pci = {
>  .name = "spapr_pci",
> -.version_id = 2,
> +.version_id = 3,
>  .minimum_version_id = 2,
>  .pre_save = spapr_pci_pre_save,
>  .post_load = spapr_pci_post_load,
>  .fields = (VMStateField[]) {
>  VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),


You could probably go one step further and get rid of @buid as well.

Nevertheless, this works,

Reviewed-by: Alexey Kardashevskiy 




> -VMSTATE_UINT32_EQUAL(dma_liobn[0], sPAPRPHBState),
> -VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
> -VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
> -VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
> -VMSTATE_UINT64_EQUAL(io_win_size, sPAPRPHBState),
> +VMSTATE_UNUSED_TEST(version_before_3, sizeof(uint32_t) /* 
> dma_liobn[0] */
> ++ sizeof(uint64_t) /* mem_win_addr */
> ++ sizeof(uint64_t) /* mem_win_size */
> ++ sizeof(uint64_t) /* io_win_addr */
> ++ sizeof(uint64_t) /* io_win_size */),
>  VMSTATE_STRUCT_ARRAY(lsi_table, sPAPRPHBState, PCI_NUM_PINS, 0,
>   vmstate_spapr_pci_lsi, struct spapr_pci_lsi),
>  VMSTATE_INT32(msi_devs_num, sPAPRPHBState),
> 


-- 
Alexey

Re: [Qemu-devel] [RFC 13/17] pseries: Move CPU compatibility property to machine

2016-11-08 Thread David Gibson

On Tue, Nov 08, 2016 at 04:56:10PM +1100, Alexey Kardashevskiy wrote:
> On 08/11/16 16:26, David Gibson wrote:
> > On Fri, Nov 04, 2016 at 06:43:52PM +1100, Alexey Kardashevskiy wrote:
> >> On 30/10/16 22:12, David Gibson wrote:
> >>> Server class POWER CPUs have a "compat" property, which is used to set the
> >>> backwards compatibility mode for the processor.  However, this only makes
> >>> sense for machine types which don't give the guest access to hypervisor
> >>> privilege - otherwise the compatibility level is under the guest's 
> >>> control.
> >>>
> >>> To reflect this, this removes the CPU 'compat' property and instead
> >>> creates a 'max-cpu-compat' property on the pseries machine.  Strictly
> >>> speaking this breaks compatibility, but AFAIK the 'compat' option was
> >>> never (directly) used with -device or device_add.
> >>>
> >>> The option was used with -cpu.  So, to maintain compatibility, this patch
> >>> adds a hack to the cpu option parsing to strip out any compat options
> >>> supplied with -cpu and set them on the machine property instead of the new
> >>> removed cpu property.
> >>>
> >>> Signed-off-by: David Gibson 
> >>> ---
> >>>  hw/ppc/spapr.c  |  6 +++-
> >>>  hw/ppc/spapr_cpu_core.c | 47 +++--
> >>>  hw/ppc/spapr_hcall.c|  2 +-
> >>>  include/hw/ppc/spapr.h  | 10 +--
> >>>  target-ppc/compat.c | 65 
> >>>  target-ppc/cpu.h|  6 ++--
> >>>  target-ppc/translate_init.c | 73 
> >>> -
> >>>  7 files changed, 127 insertions(+), 82 deletions(-)
> >>>
> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>> index 6c78889..b983faa 100644
> >>> --- a/hw/ppc/spapr.c
> >>> +++ b/hw/ppc/spapr.c
> >>> @@ -1849,7 +1849,7 @@ static void ppc_spapr_init(MachineState *machine)
> >>>  machine->cpu_model = kvm_enabled() ? "host" : 
> >>> smc->tcg_default_cpu;
> >>>  }
> >>>  
> >>> -ppc_cpu_parse_features(machine->cpu_model);
> >>> +spapr_cpu_parse_features(spapr);
> >>>  
> >>>  spapr_init_cpus(spapr);
> >>>  
> >>> @@ -2191,6 +2191,10 @@ static void spapr_machine_initfn(Object *obj)
> >>>  " place of standard EPOW events when 
> >>> possible"
> >>>  " (required for memory hot-unplug 
> >>> support)",
> >>>  NULL);
> >>> +
> >>> +object_property_add(obj, "max-cpu-compat", "str",
> >>> +ppc_compat_prop_get, ppc_compat_prop_set,
> >>> +NULL, >max_compat_pvr, _fatal);
> >>>  }
> >>>  
> >>>  static void spapr_machine_finalizefn(Object *obj)
> >>> diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
> >>> index ee5cd14..0319516 100644
> >>> --- a/hw/ppc/spapr_cpu_core.c
> >>> +++ b/hw/ppc/spapr_cpu_core.c
> >>> @@ -18,6 +18,49 @@
> >>>  #include "target-ppc/mmu-hash64.h"
> >>>  #include "sysemu/numa.h"
> >>>  
> >>> +void spapr_cpu_parse_features(sPAPRMachineState *spapr)
> >>> +{
> >>> +/*
> >>> + * Backwards compatibility hack:
> >>> +
> >>> + *   CPUs had a "compat=" property which didn't make sense for
> >>> + *   anything except pseries.  It was replaced by "max-cpu-compat"
> >>> + *   machine option.  This supports old command lines like
> >>> + *   -cpu POWER8,compat=power7
> >>> + *   By stripping the compat option and applying it to the machine
> >>> + *   before passing it on to the cpu level parser.
> >>> + */
> >>> +gchar **inpieces, **outpieces;
> >>> +int n, i, j;
> >>> +gchar *compat_str = NULL;
> >>> +gchar *filtered_model;
> >>> +
> >>> +inpieces = g_strsplit(MACHINE(spapr)->cpu_model, ",", 0);
> >>> +n = g_strv_length(inpieces);
> >>> +outpieces = g_new0(gchar *, g_strv_length(inpieces));
> >>> +
> >>> +/* inpieces[0] is the actual model string */
> >>> +for (i = 0, j = 0; i < n; i++) {
> >>> +if (g_str_has_prefix(inpieces[i], "compat=")) {
> >>> +compat_str = inpieces[i];
> >>> +} else {
> >>> +outpieces[j++] = g_strdup(inpieces[i]);
> >>> +}
> >>> +}
> >>> +
> >>> +if (compat_str) {
> >>> +char *val = compat_str + strlen("compat=");
> >>> +object_property_set_str(OBJECT(spapr), val, "max-cpu-compat",
> >>> +_fatal);
> >>
> >> This part is ok.
> >>
> >>> +}
> >>> +
> >>> +filtered_model = g_strjoinv(",", outpieces);
> >>> +ppc_cpu_parse_features(filtered_model);
> >>
> >>
> >> Rather than reducing the CPU parameters string from the command line, I'd
> >> keep "dc->props = powerpc_servercpu_properties" and make them noop + warn
> >> to use the machine option instead. One day QEMU may start calling the CPU
> >> features parser itself and somebody will have to hack this thing
> >> again.
> > 
> > Hrm.  A deprecation

Re: [Qemu-devel] [RFC 11/17] ppc: Add ppc_set_compat_all()

2016-11-08 Thread David Gibson

On Wed, Nov 09, 2016 at 12:27:47PM +1100, Alexey Kardashevskiy wrote:
> On 08/11/16 16:18, David Gibson wrote:
> > On Fri, Nov 04, 2016 at 03:01:40PM +1100, Alexey Kardashevskiy wrote:
> >> On 30/10/16 22:12, David Gibson wrote:
> >>> Once a compatiblity mode is negotiated with the guest,
> >>> h_client_architecture_support() uses run_on_cpu() to update each CPU to
> >>> the new mode.  We're going to want this logic somewhere else shortly,
> >>> so make a helper function to do this global update.
> >>>
> >>> We put it in target-ppc/compat.c - it makes as much sense at the CPU level
> >>> as it does at the machine level.  We also move the cpu_synchronize_state()
> >>> into ppc_set_compat(), since it doesn't really make any sense to call that
> >>> without synchronizing state.
> >>>
> >>> Signed-off-by: David Gibson 
> >>> ---
> >>>  hw/ppc/spapr_hcall.c | 31 +--
> >>>  target-ppc/compat.c  | 36 
> >>>  target-ppc/cpu.h |  3 +++
> >>>  3 files changed, 44 insertions(+), 26 deletions(-)
> >>>
> >>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> >>> index 3bd6d06..4eaf9a6 100644
> >>> --- a/hw/ppc/spapr_hcall.c
> >>> +++ b/hw/ppc/spapr_hcall.c
> >>> @@ -881,20 +881,6 @@ static target_ulong h_set_mode(PowerPCCPU *cpu, 
> >>> sPAPRMachineState *spapr,
> >>>  return ret;
> >>>  }
> >>>  
> >>> -typedef struct {
> >>> -uint32_t compat_pvr;
> >>> -Error *err;
> >>> -} SetCompatState;
> >>> -
> >>> -static void do_set_compat(CPUState *cs, void *arg)
> >>> -{
> >>> -PowerPCCPU *cpu = POWERPC_CPU(cs);
> >>> -SetCompatState *s = arg;
> >>> -
> >>> -cpu_synchronize_state(cs);
> >>> -ppc_set_compat(cpu, s->compat_pvr, >err);
> >>> -}
> >>> -
> >>>  static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> >>>sPAPRMachineState 
> >>> *spapr,
> >>>target_ulong opcode,
> >>> @@ -902,7 +888,6 @@ static target_ulong 
> >>> h_client_architecture_support(PowerPCCPU *cpu,
> >>>  {
> >>>  target_ulong list = ppc64_phys_to_real(args[0]);
> >>>  target_ulong ov_table;
> >>> -CPUState *cs;
> >>>  bool explicit_match = false; /* Matched the CPU's real PVR */
> >>>  uint32_t max_compat = cpu->max_compat;
> >>>  uint32_t best_compat = 0;
> >>> @@ -949,18 +934,12 @@ static target_ulong 
> >>> h_client_architecture_support(PowerPCCPU *cpu,
> >>>  
> >>>  /* Update CPUs */
> >>>  if (cpu->compat_pvr != best_compat) {
> >>> -CPU_FOREACH(cs) {
> >>> -SetCompatState s = {
> >>> -.compat_pvr = best_compat,
> >>> -.err = NULL,
> >>> -};
> >>> +Error *local_err = NULL;
> >>>  
> >>> -run_on_cpu(cs, do_set_compat, );
> >>> -
> >>> -if (s.err) {
> >>> -error_report_err(s.err);
> >>> -return H_HARDWARE;
> >>> -}
> >>> +ppc_set_compat_all(best_compat, _err);
> >>> +if (local_err) {
> >>> +error_report_err(local_err);
> >>> +return H_HARDWARE;
> >>>  }
> >>>  }
> >>>  
> >>> diff --git a/target-ppc/compat.c b/target-ppc/compat.c
> >>> index 1059555..0b12b58 100644
> >>> --- a/target-ppc/compat.c
> >>> +++ b/target-ppc/compat.c
> >>> @@ -124,6 +124,8 @@ void ppc_set_compat(PowerPCCPU *cpu, uint32_t 
> >>> compat_pvr, Error **errp)
> >>>  pcr = compat->pcr;
> >>>  }
> >>>  
> >>> +cpu_synchronize_state(CPU(cpu));
> >>> +
> >>>  cpu->compat_pvr = compat_pvr;
> >>>  env->spr[SPR_PCR] = pcr & pcc->pcr_mask;
> >>>  
> >>> @@ -136,6 +138,40 @@ void ppc_set_compat(PowerPCCPU *cpu, uint32_t 
> >>> compat_pvr, Error **errp)
> >>>  }
> >>>  }
> >>>  
> >>> +#if !defined(CONFIG_USER_ONLY)
> >>> +typedef struct {
> >>> +uint32_t compat_pvr;
> >>> +Error *err;
> >>> +} SetCompatState;
> >>> +
> >>> +static void do_set_compat(CPUState *cs, void *arg)
> >>> +{
> >>> +PowerPCCPU *cpu = POWERPC_CPU(cs);
> >>> +SetCompatState *s = arg;
> >>> +
> >>> +ppc_set_compat(cpu, s->compat_pvr, >err);
> >>> +}
> >>> +
> >>> +void ppc_set_compat_all(uint32_t compat_pvr, Error **errp)
> >>> +{
> >>> +CPUState *cs;
> >>> +
> >>> +CPU_FOREACH(cs) {
> >>> +SetCompatState s = {
> >>> +.compat_pvr = compat_pvr,
> >>> +.err = NULL,
> >>> +};
> >>> +
> >>> +run_on_cpu(cs, do_set_compat, );
> >>> +
> >>> +if (s.err) {
> >>> +error_propagate(errp, s.err);
> >>> +return;
> >>> +}
> >>> +}
> >>> +}
> >>> +#endif
> >>> +
> >>>  int ppc_compat_max_threads(PowerPCCPU *cpu)
> >>>  {
> >>>  const CompatInfo *compat = compat_by_pvr(cpu->compat_pvr);
> >>> diff --git a/target-ppc/cpu.h b/target-ppc/cpu.h
> >>> index 91e8be8..201a655 100644
> >>> --- a/target-ppc/cpu.h
> >>> +++

Re: [Qemu-devel] [RFC 15/17] ppc: Check that CPU model stays consistent across migration

2016-11-08 Thread David Gibson

On Tue, Nov 08, 2016 at 05:03:49PM +1100, Alexey Kardashevskiy wrote:
> On 08/11/16 16:29, David Gibson wrote:
> > On Fri, Nov 04, 2016 at 06:54:48PM +1100, Alexey Kardashevskiy wrote:
> >> On 30/10/16 22:12, David Gibson wrote:
> >>> When a vmstate for the ppc cpu was first introduced (a90db15 "target-ppc:
> >>> Convert ppc cpu savevm to VMStateDescription"), a VMSTATE_EQUAL was used
> >>> to ensure that identical CPU models were used at source and destination
> >>> as based on the PVR (Processor Version Register).
> >>>
> >>> However this was a problem for HV KVM, where due to hardware limitations
> >>> we always need to use the real PVR of the host CPU.  So, to allow
> >>> migration between hosts with "similar enough" CPUs, the PVR check was
> >>> removed in 569be9f0 "target-ppc: Remove PVR check from migration".  This
> >>> left the onus on user / management to only attempt migration between
> >>> compatible CPUs.
> >>>
> >>> Now that we've reworked the handling of compatiblity modes, we have the
> >>> information to actually determine if we're making a compatible migration.
> >>> So this patch partially restores the PVR check.  If the source was running
> >>> in a compatibility mode, we just make sure that the destination cpu can
> >>> also run in that compatibility mode.  However, if the source was running
> >>> in "raw" mode, we verify that the destination has the same PVR value.
> >>>
> >>> Signed-off-by: David Gibson 
> >>> ---
> >>>  target-ppc/machine.c | 15 +++
> >>>  1 file changed, 11 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/target-ppc/machine.c b/target-ppc/machine.c
> >>> index 5d87ff6..62b9e94 100644
> >>> --- a/target-ppc/machine.c
> >>> +++ b/target-ppc/machine.c
> >>> @@ -173,10 +173,12 @@ static int cpu_post_load(void *opaque, int 
> >>> version_id)
> >>>  target_ulong msr;
> >>>  
> >>>  /*
> >>> - * We always ignore the source PVR. The user or management
> >>> - * software has to take care of running QEMU in a compatible mode.
> >>> + * If we're operating in compat mode, we should be ok as long as
> >>> + * the destination supports the same compatiblity mode.
> >>> + *
> >>> + * Otherwise, however, we require that the destination has exactly
> >>> + * the same CPU model as the source.
> >>>   */
> >>> -env->spr[SPR_PVR] = env->spr_cb[SPR_PVR].default_value;
> >>>  
> >>>  #if defined(TARGET_PPC64)
> >>>  if (cpu->compat_pvr) {
> >>> @@ -188,8 +190,13 @@ static int cpu_post_load(void *opaque, int 
> >>> version_id)
> >>>  error_free(local_err);
> >>>  return -1;
> >>>  }
> >>> -}
> >>> +} else
> >>>  #endif
> >>> +{
> >>> +if (env->spr[SPR_PVR] != env->spr_cb[SPR_PVR].default_value) {
> >>> +return -1;
> >>> +}
> >>> +}
> >>
> >> This should break migration from host with PVR=004d0200 to host with
> >> PVR=004d0201, what is the benefit of such limitation?
> > 
> > There probably isn't one.  But the point is it also blocks migration
> > from a host with PVR=004B0201 (POWER8) to one with PVR=00201400
> > (403GCX) and *that* has a clear benefit.  I don't see a way to block
> > the second without the first, except by creating a huge compatibility
> > matrix table, which would require inordinate amounts of time to
> > research carefully.
> 
> 
> This is pcc->pvr_match() for this purpose.

Hmm.. thinking about this.  Obviously requiring an exactly matching
PVR is the architecturally "safest" approach.  For TCG and PR KVM, it
really should be sufficient - if you can select "close" PVRs at each
end, you should be able to select exactly matching ones just as well.

For HV KVM, we should generally be using compatibility modes to allow
migration between a relatively wide range of CPUs.  My intention was
basically to require moving to that model, rather than "approximate
matching" real PVRs.

I'm still convinced using compat modes is the right way to go medium
to long term.  However, allowing the approximate matches could make
for a more forgiving transition, if people have existing hosts in
"raw" mode.

Ok, I'll add pvr_match checking to this.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-08 Thread Dave Young

On 11/09/16 at 11:58am, Wen Congyang wrote:
> On 11/09/2016 11:17 AM, Dave Young wrote:
> > Drop qiaonuohan, seems the mail address is wrong..
> > 
> > On 11/09/16 at 11:01am, Dave Young wrote:
> >> Hi,
> >>
> >> Latest linux kernel enabled kaslr to randomiz phys/virt memory
> >> addresses, we had some effort to support kexec/kdump so that crash
> >> utility can still works in case crashed kernel has kaslr enabled.
> >>
> >> But according to Dave Anderson virsh dump does not work, quoted messages
> >> from Dave below:
> >>
> >> """
> >> with virsh dump, there's no way of even knowing that KASLR
> >> has randomized the kernel __START_KERNEL_map region, because there is no
> >> virtual address information -- e.g., like "SYMBOL(_stext)" in the kdump
> >> vmcoreinfo data to compare against the vmlinux file symbol value.
> >> Unless virsh dump can export some basic virtual memory data, which
> >> they say it can't, I don't see how KASLR can ever be supported.
> >> """
> >>
> >> I assume virsh dump is using qemu guest memory dump facility so it
> >> should be first addressed in qemu. Thus post this query to qemu devel
> >> list. If this is not correct please let me know.
> 
> IIRC, 'virsh dump --memory-only' uses dump-guest-memory, and 'virsh dump'
> uses migration to dump.

Do they need different fixes? Dave, I guess you mean --memory-only, but
could you clarify and confirm it?

> 
> I think I should study kaslr first...

Thanks for taking care of it.

> 
> Thanks
> Wen Congyang
> 
> >>
> >> Could you qemu dump people make it work? Or we can not support virt dump
> >> as long as KASLR being enabled. Latest Fedora kernel has enabled it in 
> >> x86_64.
> >>
> >> Thanks
> >> Dave
> > 
> > 
> > 
> 
> 
>

Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-08 Thread Wen Congyang

On 11/09/2016 11:17 AM, Dave Young wrote:
> Drop qiaonuohan, seems the mail address is wrong..
> 
> On 11/09/16 at 11:01am, Dave Young wrote:
>> Hi,
>>
>> Latest linux kernel enabled kaslr to randomiz phys/virt memory
>> addresses, we had some effort to support kexec/kdump so that crash
>> utility can still works in case crashed kernel has kaslr enabled.
>>
>> But according to Dave Anderson virsh dump does not work, quoted messages
>> from Dave below:
>>
>> """
>> with virsh dump, there's no way of even knowing that KASLR
>> has randomized the kernel __START_KERNEL_map region, because there is no
>> virtual address information -- e.g., like "SYMBOL(_stext)" in the kdump
>> vmcoreinfo data to compare against the vmlinux file symbol value.
>> Unless virsh dump can export some basic virtual memory data, which
>> they say it can't, I don't see how KASLR can ever be supported.
>> """
>>
>> I assume virsh dump is using qemu guest memory dump facility so it
>> should be first addressed in qemu. Thus post this query to qemu devel
>> list. If this is not correct please let me know.

IIRC, 'virsh dump --memory-only' uses dump-guest-memory, and 'virsh dump'
uses migration to dump.

I think I should study kaslr first...

Thanks
Wen Congyang

>>
>> Could you qemu dump people make it work? Or we can not support virt dump
>> as long as KASLR being enabled. Latest Fedora kernel has enabled it in 
>> x86_64.
>>
>> Thanks
>> Dave
> 
> 
>

Re: [Qemu-devel] [PATCH] docs: fix COLO architecture diagram

2016-11-08 Thread Zhang Chen


Does anyone have any comments?

Ping


Thanks

Zhang Chen


On 11/01/2016 03:06 PM, Zhang Chen wrote:



On 11/01/2016 02:25 PM, Hailiang Zhang wrote:

Hmm, there are other contents in this file need to be updated,
for example, we support blockdev-add command for nbd now,
so we can convert the related hmp command to qmp command way.
But since we didn't integrate COLO frame with block replication
and proxy, It is OK to fix them later.

For COLO, the basic capability is still incomplete,
but now we got into 'Soft feature freeze' stage,
I'm wondering if it is possible at this point to combine
COLO frame with proxy and block replication which only needs
three or four patches that only touch colo related files ...
Or uses still can't test COLO feature in qemu 2.8.

Any ideas ? Thanks.


I will send some patch about COLO-Proxy combine with COLO-frame in the 
future.




On 2016/11/1 11:38, Zhang Chen wrote:

Fix COLO-Proxy part of COLO architecture diagram

Signed-off-by: Zhang Chen 


Reviewed-by: zhanghailiang 

All such patches can go through trivial branch.

Cc: qemu-triv...@nongnu.org


I think this patch about COLO architecture,
So, I didn't cc qemu-trivial.


Thanks
Zhang Chen




---
  docs/COLO-FT.txt | 72 
+---

  1 file changed, 37 insertions(+), 35 deletions(-)

diff --git a/docs/COLO-FT.txt b/docs/COLO-FT.txt
index 6282938..e289be2 100644
--- a/docs/COLO-FT.txt
+++ b/docs/COLO-FT.txt
@@ -41,41 +41,43 @@ identical responses to all client requests. Once 
the differences in the outputs
  are detected between the PVM and SVM, COLO withholds transmission 
of the
  outbound packets until it has successfully synchronized the PVM 
state to the SVM.


-   Primary Node Secondary Node
- ++  +---+ 
++  ++
- ||  |   HeartBeat   |<->| HeartBeat
|  ||
- | Primary VM |  +---|---+ 
+---|+  |Secondary VM|

- ||  | |   ||
- ||  +---|---+ 
+---|+  ||
- ||  |QEMU   +---v+  |   |QEMU 
+v---+|  ||
- ||  |   |Failover|  |   | 
|Failover||  ||
- ||  |   ++  |   | 
++|  ||
- ||  |   +---+   |   | 
+---+|  ||
- ||  |   | VM Checkpoint |-->| VM 
Checkpoint ||  ||
- ||  |   +---+   |   | 
+---+|  ||
- ||  |   | |
|  ||
- 
|Requests<---^-->Requests|
- |Responses--\ /--|--\ 
/Responses|
- ||  |   | |  |  |   |   | 
| |  ||
- ||  | +---+ | |  |  |   |   |  | 
++ |  ||
- ||  | | COLO disk | | |  |  |   |   |  |  | COLO 
disk  | |  ||
- ||  | |   Manager |-|-|--|--|--|->| 
Manager| |  | |
- ||  | +|--+ | |  |  |   |   |  | 
+---|+ |  ||
- ||  |  || |  |  |   |   | 
|  |  |  ||
- ++  +--||-|--|--+ 
+---|--|--|--+  ++
-|| |  |  | 
|  |
- +-+| +--v-v--|--+ +---|--v---+  
|+-+
- |  VM Monitor || |  COLO Proxy  |   |COLO Proxy
|  || VM Monitor  |
- | || |(compare packet)  |   | (adjust 
sequence)|  || |
- +-+| +--|^--+ +--+  
|+-+

-|| ||
- +--|||--+ 
+-|--+
- |   Kernel |||  |   | 
Kernel|  |
- +--|||--+ 
+-|--+

-|| ||
- +--v+  +v|--+ +--+ 
+v-+
- |   Storage |  |External Network|   | External Network 
| |   Storage|
- +---+  ++ +--+ 
+--+

+  Primary Node Secondary Node
+++  +---+ 
++  ++
+||  |   HeartBeat   +<->+ HeartBeat

Re: [Qemu-devel] [PULL 15/16] spapr_pci: Add a 64-bit MMIO window

2016-11-08 Thread David Gibson

On Tue, Nov 08, 2016 at 02:59:30PM +1100, Alexey Kardashevskiy wrote:
> On 08/11/16 12:16, David Gibson wrote:
> > On Fri, Nov 04, 2016 at 04:03:31PM +1100, Alexey Kardashevskiy wrote:
> >> On 17/10/16 13:43, David Gibson wrote:
> >>> On real hardware, and under pHyp, the PCI host bridges on Power machines
> >>> typically advertise two outbound MMIO windows from the guest's physical
> >>> memory space to PCI memory space:
> >>>   - A 32-bit window which maps onto 2GiB..4GiB in the PCI address space
> >>>   - A 64-bit window which maps onto a large region somewhere high in PCI
> >>> address space (traditionally this used an identity mapping from guest
> >>> physical address to PCI address, but that's not always the case)
> >>>
> >>> The qemu implementation in spapr-pci-host-bridge, however, only supports a
> >>> single outbound MMIO window, however.  At least some Linux versions expect
> >>> the two windows however, so we arranged this window to map onto the PCI
> >>> memory space from 2 GiB..~64 GiB, then advertised it as two contiguous
> >>> windows, the "32-bit" window from 2G..4G and the "64-bit" window from
> >>> 4G..~64G.
> >>>
> >>> This approach means, however, that the 64G window is not naturally 
> >>> aligned.
> >>> In turn this limits the size of the largest BAR we can map (which does 
> >>> have
> >>> to be naturally aligned) to roughly half of the total window.  With some
> >>> large nVidia GPGPU cards which have huge memory BARs, this is starting to
> >>> be a problem.
> >>>
> >>> This patch adds true support for separate 32-bit and 64-bit outbound MMIO
> >>> windows to the spapr-pci-host-bridge implementation, each of which can
> >>> be independently configured.  The 32-bit window always maps to 2G.. in PCI
> >>> space, but the PCI address of the 64-bit window can be configured (it
> >>> defaults to the same as the guest physical address).
> >>>
> >>> So as not to break possible existing configurations, as long as a 64-bit
> >>> window is not specified, a large single window can be specified.  This
> >>> will appear the same way to the guest as the old approach, although it's
> >>> now implemented by two contiguous memory regions rather than a single one.
> >>>
> >>> For now, this only adds the possibility of 64-bit windows.  The default
> >>> configuration still uses the legacy mode.
> >>
> >>
> >> This breaks migration to QEMU v2.7, the destination reports:
> >>
> >> 22901@1478235261.799031:vmstate_load spapr_pci, spapr_pci
> >> 22901@1478235261.799040:vmstate_load_field_error field "mem_win_size" load
> >> failed, ret = -22
> >> qemu-hostos1: error while loading state for instance 0x0 of device 
> >> 'spapr_pci'
> >> 22901@1478235261.801324:migrate_set_state new state 7
> >> qemu-hostos1: load of migration failed: Invalid argument
> >>
> >>
> >> mem_win_size decreased from 0xf8000 to 0x8000.
> >>
> >> I'd think it should be allowed to migrate like this.
> > 
> > AIUI, we don't generally care (upstream) about migration from newer to
> > older qemu, only from older to newer. 
> 
> Older (v2.7.0) to newer (current upstream with -machine pseries-2.7) does
> not work either with the exact same symptom.

Drat.  Ok.. I see why.  I was converting the old style property into
new-style meaning during property parsing, but the "raw" property
value was still being sent and compared in the migration stream.

It would be possible to fix it both ways, by keeping around the "raw"
mem_window_size parameter and having some pre_save / post_load logic
to shuffle the various possibilities.  But that's going to leave ugly
cruft around indefinitely.  I think it's preferable to just bump the
version number and drop those more-trouble-than-they're-worth
VMSTATE_EQUAL fields.

Patch coming shortly.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

[Qemu-devel] [PATCH] spapr: Fix migration of PCI host bridges from qemu-2.7

2016-11-08 Thread David Gibson

daa2369 "spapr_pci: Add a 64-bit MMIO window" subtly broke migration from
qemu-2.7 to the current version.  It split the device's MMIO window into
two pieces for 32-bit and 64-bit MMIO.

The patch included backwards compatibility code to convert the old property
into the new format.  However, the property value was also transferred in
the migration stream and compared with a (probably unwise) VMSTATE_EQUAL.
So, the "raw" value from 2.7 is compared to the new style converted value
from (pre-)2.8 giving a mismatch and migration failure.

Although it would be technically possible to fix this in a way allowing
backwards migration, that would leave an ugly legacy around indefinitely.
This patch takes the simpler approach of bumping the migration version,
dropping the unwise VMSTATE_EQUAL (and some equally unwise ones around it)
and ignoring them on an incoming migration.

Signed-off-by: David Gibson 
---
 hw/ppc/spapr_pci.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 7cde30e..7f1cc29 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1658,19 +1658,24 @@ static int spapr_pci_post_load(void *opaque, int 
version_id)
 return 0;
 }
 
+static bool version_before_3(void *opaque, int version_id)
+{
+return version_id < 3;
+}
+
 static const VMStateDescription vmstate_spapr_pci = {
 .name = "spapr_pci",
-.version_id = 2,
+.version_id = 3,
 .minimum_version_id = 2,
 .pre_save = spapr_pci_pre_save,
 .post_load = spapr_pci_post_load,
 .fields = (VMStateField[]) {
 VMSTATE_UINT64_EQUAL(buid, sPAPRPHBState),
-VMSTATE_UINT32_EQUAL(dma_liobn[0], sPAPRPHBState),
-VMSTATE_UINT64_EQUAL(mem_win_addr, sPAPRPHBState),
-VMSTATE_UINT64_EQUAL(mem_win_size, sPAPRPHBState),
-VMSTATE_UINT64_EQUAL(io_win_addr, sPAPRPHBState),
-VMSTATE_UINT64_EQUAL(io_win_size, sPAPRPHBState),
+VMSTATE_UNUSED_TEST(version_before_3, sizeof(uint32_t) /* dma_liobn[0] 
*/
++ sizeof(uint64_t) /* mem_win_addr */
++ sizeof(uint64_t) /* mem_win_size */
++ sizeof(uint64_t) /* io_win_addr */
++ sizeof(uint64_t) /* io_win_size */),
 VMSTATE_STRUCT_ARRAY(lsi_table, sPAPRPHBState, PCI_NUM_PINS, 0,
  vmstate_spapr_pci_lsi, struct spapr_pci_lsi),
 VMSTATE_INT32(msi_devs_num, sPAPRPHBState),
-- 
2.7.4

Re: [Qemu-devel] [PATCH v11 15/22] vfio: Introduce vfio_set_irqs_validate_and_prepare()

2016-11-08 Thread Alex Williamson

On Wed, 9 Nov 2016 14:07:58 +1100
Alexey Kardashevskiy  wrote:
> On 09/11/16 07:22, Kirti Wankhede wrote:
> > On 11/8/2016 2:16 PM, Alexey Kardashevskiy wrote:  
> >> On 05/11/16 08:10, Kirti Wankhede wrote:  
> >>> Vendor driver using mediated device framework would use same mechnism to
> >>> validate and prepare IRQs. Introducing this function to reduce code
> >>> replication in multiple drivers.
> >>>
> >>> Signed-off-by: Kirti Wankhede 
> >>> Signed-off-by: Neo Jia 
> >>> Change-Id: Ie201f269dda0713ca18a07dc4852500bd8b48309
> >>> ---
> >>>  drivers/vfio/vfio.c  | 48 
> >>> 
> >>>  include/linux/vfio.h |  4 
> >>>  2 files changed, 52 insertions(+)
> >>>
> >>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>> index 9a03be0942a1..ed2361e4b904 100644
> >>> --- a/drivers/vfio/vfio.c
> >>> +++ b/drivers/vfio/vfio.c
> >>> @@ -1858,6 +1858,54 @@ int vfio_info_add_capability(struct vfio_info_cap 
> >>> *caps, int cap_type_id,
> >>>  }
> >>>  EXPORT_SYMBOL(vfio_info_add_capability);
> >>>  
> >>> +int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int 
> >>> num_irqs,
> >>> +int max_irq_type, size_t *data_size)
> >>> +{
> >>> + unsigned long minsz;
> >>> + size_t size;
> >>> +
> >>> + minsz = offsetofend(struct vfio_irq_set, count);
> >>> +
> >>> + if ((hdr->argsz < minsz) || (hdr->index >= max_irq_type) ||
> >>> + (hdr->count >= (U32_MAX - hdr->start)) ||
> >>> + (hdr->flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
> >>> + VFIO_IRQ_SET_ACTION_TYPE_MASK)))
> >>> + return -EINVAL;
> >>> +
> >>> + if (data_size)  
> >>
> >> Pointless check, the callers will pass non null pointer with value
> >> initialized to 0 anyway.
> >>  
> > 
> > Not always, When VFIO_IRQ_SET_DATA_NONE flag is set, caller can pass
> > data_size = NULL.  
> 
> 
> Today data_size is not NULL in all cases and the way it is used now (ioctl
> VFIO_DEVICE_SET_IRQS) gives me an idea that this is not going to change.
> 
> >   
> >>  
> >>> + *data_size = 0;
> >>> +
> >>> + if (hdr->start >= num_irqs || hdr->start + hdr->count > num_irqs)
> >>> + return -EINVAL;
> >>> +
> >>> + switch (hdr->flags & VFIO_IRQ_SET_DATA_TYPE_MASK) {
> >>> + case VFIO_IRQ_SET_DATA_NONE:
> >>> + size = 0;
> >>> + break;
> >>> + case VFIO_IRQ_SET_DATA_BOOL:
> >>> + size = sizeof(uint8_t);
> >>> + break;
> >>> + case VFIO_IRQ_SET_DATA_EVENTFD:
> >>> + size = sizeof(int32_t);
> >>> + break;
> >>> + default:
> >>> + return -EINVAL;
> >>> + }
> >>> +
> >>> + if (size) {  
> >>
> >> The whole branch would even work for size == 0.
> >>  
> > 
> > In that case below check (!data_size) might result in error if data_size
> > == NULL, whereas its not error case when size == 0, i.e.
> > VFIO_IRQ_SET_DATA_NONE flag set.
> >   
> >>> + if (hdr->argsz - minsz < hdr->count * size)
> >>> + return -EINVAL;
> >>> +
> >>> + if (!data_size)
> >>> + return -EINVAL;  
> >>
> >> Redundant check as well.
> >>  
> > 
> > This is not redundant. If you see above check, it sets its init value to
> > 0 but doesn't fail.
> >   
> >>> +
> >>> + *data_size = hdr->count * size;
> >>> + }
> >>> +
> >>> + return 0;
> >>> +}  
> >>
> >> It does not really prepare anything as the name suggests. It looks like
> >> this is 2 different helpers actually:
> >>
> >> int vfio_set_irqs_validate()
> >> and
> >> size_t vfio_set_irqs_hdr_to_data_size()
> >>  
> > 
> > Later one is the prepare.  
> 
> 
> Does not like it prepares anything, just a simple converter.
> 
> 
> >> And it would make it easier to review/bisect if 16/22 and 17/22 were merged
> >> into this one as this patch alone adds new code which it does not use and
> >> all 3 patches are fairly small.
> >>  
> > 
> > I do had all 3 patch merged in one in earlier version of patchset. This
> > is split as per Alex's suggestion.  
> 
> I got this from another mail from Alex. Which I find strange but whatever,
> this is his realm anyway :)

Maybe you haven't noticed, but your patch series are often difficult to
deal with, they almost always split across functional areas and
maintainers.  Splitting out code to common functions and _then_
updating the callers to make use of it is a common way to deal with
that.  We're in the same functional area here, but it's still
good practice.  Thanks,

Alex

Re: [Qemu-devel] [PATCH v11 02/22] vfio: VFIO based driver for Mediated devices

2016-11-08 Thread Dong Jia Shi

* Kirti Wankhede  [2016-11-05 02:40:36 +0530]:

Hi Kirti,

> vfio_mdev driver registers with mdev core driver.
> mdev core driver creates mediated device and calls probe routine of
> vfio_mdev driver for each device.
> Probe routine of vfio_mdev driver adds mediated device to VFIO core module
> 
> This driver forms a shim layer that pass through VFIO devices operations
> to vendor driver for mediated devices.
> 
> Signed-off-by: Kirti Wankhede 
> Signed-off-by: Neo Jia 
> Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
> ---
>  drivers/vfio/mdev/Kconfig |   9 ++-
>  drivers/vfio/mdev/Makefile|   1 +
>  drivers/vfio/mdev/vfio_mdev.c | 148 
> ++
>  3 files changed, 157 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/vfio/mdev/vfio_mdev.c
> 
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> index 303c14ce2847..79c9cface7b1 100644
> --- a/drivers/vfio/mdev/Kconfig
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -5,6 +5,13 @@ config VFIO_MDEV
>  default n
>  help
>  Provides a framework to virtualize devices.
> - See Documentation/vfio-mdev/vfio-mediated-device.txt for more details.
> + See Documentation/vfio-mediated-device.txt for more details.
So patch #01 has a wrong doc path.

> 
>  If you don't know what do here, say N.
> +
> +config VFIO_MDEV_DEVICE
> +tristate "VFIO support for Mediated devices"
> +depends on VFIO && VFIO_MDEV
> +default n
> +help
> +VFIO based driver for mediated devices.
I just think the names of the config entries here are a bit strange, but
I'm not sure if there is a better way. Maybe (?):
s/VFIO_MDEV/VFIO_MDEV_SUPPORT/
s/VFIO_MDEV_DEVICE/VFIO_MDEV/

> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> index 31bc04801d94..fa2d5ea466ee 100644
> --- a/drivers/vfio/mdev/Makefile
> +++ b/drivers/vfio/mdev/Makefile
> @@ -2,3 +2,4 @@
>  mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
> 
>  obj-$(CONFIG_VFIO_MDEV) += mdev.o
> +obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
> diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
[...]

> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
?
GPL V2

> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> -- 
> 2.7.0
> 

-- 
Dong Jia

Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-08 Thread Dave Young

Drop qiaonuohan, seems the mail address is wrong..

On 11/09/16 at 11:01am, Dave Young wrote:
> Hi,
> 
> Latest linux kernel enabled kaslr to randomiz phys/virt memory
> addresses, we had some effort to support kexec/kdump so that crash
> utility can still works in case crashed kernel has kaslr enabled.
> 
> But according to Dave Anderson virsh dump does not work, quoted messages
> from Dave below:
> 
> """
> with virsh dump, there's no way of even knowing that KASLR
> has randomized the kernel __START_KERNEL_map region, because there is no
> virtual address information -- e.g., like "SYMBOL(_stext)" in the kdump
> vmcoreinfo data to compare against the vmlinux file symbol value.
> Unless virsh dump can export some basic virtual memory data, which
> they say it can't, I don't see how KASLR can ever be supported.
> """
> 
> I assume virsh dump is using qemu guest memory dump facility so it
> should be first addressed in qemu. Thus post this query to qemu devel
> list. If this is not correct please let me know.
> 
> Could you qemu dump people make it work? Or we can not support virt dump
> as long as KASLR being enabled. Latest Fedora kernel has enabled it in x86_64.
> 
> Thanks
> Dave

Re: [Qemu-devel] [PATCH v11 15/22] vfio: Introduce vfio_set_irqs_validate_and_prepare()

2016-11-08 Thread Alexey Kardashevskiy

On 09/11/16 07:22, Kirti Wankhede wrote:
> 
> 
> On 11/8/2016 2:16 PM, Alexey Kardashevskiy wrote:
>> On 05/11/16 08:10, Kirti Wankhede wrote:
>>> Vendor driver using mediated device framework would use same mechnism to
>>> validate and prepare IRQs. Introducing this function to reduce code
>>> replication in multiple drivers.
>>>
>>> Signed-off-by: Kirti Wankhede 
>>> Signed-off-by: Neo Jia 
>>> Change-Id: Ie201f269dda0713ca18a07dc4852500bd8b48309
>>> ---
>>>  drivers/vfio/vfio.c  | 48 
>>>  include/linux/vfio.h |  4 
>>>  2 files changed, 52 insertions(+)
>>>
>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>> index 9a03be0942a1..ed2361e4b904 100644
>>> --- a/drivers/vfio/vfio.c
>>> +++ b/drivers/vfio/vfio.c
>>> @@ -1858,6 +1858,54 @@ int vfio_info_add_capability(struct vfio_info_cap 
>>> *caps, int cap_type_id,
>>>  }
>>>  EXPORT_SYMBOL(vfio_info_add_capability);
>>>  
>>> +int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int 
>>> num_irqs,
>>> +  int max_irq_type, size_t *data_size)
>>> +{
>>> +   unsigned long minsz;
>>> +   size_t size;
>>> +
>>> +   minsz = offsetofend(struct vfio_irq_set, count);
>>> +
>>> +   if ((hdr->argsz < minsz) || (hdr->index >= max_irq_type) ||
>>> +   (hdr->count >= (U32_MAX - hdr->start)) ||
>>> +   (hdr->flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
>>> +   VFIO_IRQ_SET_ACTION_TYPE_MASK)))
>>> +   return -EINVAL;
>>> +
>>> +   if (data_size)
>>
>> Pointless check, the callers will pass non null pointer with value
>> initialized to 0 anyway.
>>
> 
> Not always, When VFIO_IRQ_SET_DATA_NONE flag is set, caller can pass
> data_size = NULL.


Today data_size is not NULL in all cases and the way it is used now (ioctl
VFIO_DEVICE_SET_IRQS) gives me an idea that this is not going to change.

> 
>>
>>> +   *data_size = 0;
>>> +
>>> +   if (hdr->start >= num_irqs || hdr->start + hdr->count > num_irqs)
>>> +   return -EINVAL;
>>> +
>>> +   switch (hdr->flags & VFIO_IRQ_SET_DATA_TYPE_MASK) {
>>> +   case VFIO_IRQ_SET_DATA_NONE:
>>> +   size = 0;
>>> +   break;
>>> +   case VFIO_IRQ_SET_DATA_BOOL:
>>> +   size = sizeof(uint8_t);
>>> +   break;
>>> +   case VFIO_IRQ_SET_DATA_EVENTFD:
>>> +   size = sizeof(int32_t);
>>> +   break;
>>> +   default:
>>> +   return -EINVAL;
>>> +   }
>>> +
>>> +   if (size) {
>>
>> The whole branch would even work for size == 0.
>>
> 
> In that case below check (!data_size) might result in error if data_size
> == NULL, whereas its not error case when size == 0, i.e.
> VFIO_IRQ_SET_DATA_NONE flag set.
> 
>>> +   if (hdr->argsz - minsz < hdr->count * size)
>>> +   return -EINVAL;
>>> +
>>> +   if (!data_size)
>>> +   return -EINVAL;
>>
>> Redundant check as well.
>>
> 
> This is not redundant. If you see above check, it sets its init value to
> 0 but doesn't fail.
> 
>>> +
>>> +   *data_size = hdr->count * size;
>>> +   }
>>> +
>>> +   return 0;
>>> +}
>>
>> It does not really prepare anything as the name suggests. It looks like
>> this is 2 different helpers actually:
>>
>> int vfio_set_irqs_validate()
>> and
>> size_t vfio_set_irqs_hdr_to_data_size()
>>
> 
> Later one is the prepare.


Does not like it prepares anything, just a simple converter.


>> And it would make it easier to review/bisect if 16/22 and 17/22 were merged
>> into this one as this patch alone adds new code which it does not use and
>> all 3 patches are fairly small.
>>
> 
> I do had all 3 patch merged in one in earlier version of patchset. This
> is split as per Alex's suggestion.

I got this from another mail from Alex. Which I find strange but whatever,
this is his realm anyway :)


> 
>>
>>> +EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
>>
>> Everything you export in this patchset is EXPORT_SYMBOL() while the
>> existing code uses EXPORT_SYMBOL_GPL(), is this for a reason?
>>
>>
> 
> We want these symbols to be available to all drivers.


Right, got it from another mail from Alex as well. Ok, seems all right so
far. A note in the commit log would be useful though.



> 
> Thanks,
> Kirti
> 
>>> +
>>>  /*
>>>   * Pin a set of guest PFNs and return their associated host PFNs for local
>>>   * domain only.
>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>> index cf90393a11e2..87c9afecd822 100644
>>> --- a/include/linux/vfio.h
>>> +++ b/include/linux/vfio.h
>>> @@ -116,6 +116,10 @@ extern void vfio_info_cap_shift(struct vfio_info_cap 
>>> *caps, size_t offset);
>>>  extern int vfio_info_add_capability(struct vfio_info_cap *caps,
>>> int cap_type_id, void *cap_type);
>>>  
>>> +extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
>>> + int num_irqs, int max_irq_type,
>>> +

[Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-08 Thread Dave Young

Hi,

Latest linux kernel enabled kaslr to randomiz phys/virt memory
addresses, we had some effort to support kexec/kdump so that crash
utility can still works in case crashed kernel has kaslr enabled.

But according to Dave Anderson virsh dump does not work, quoted messages
from Dave below:

"""
with virsh dump, there's no way of even knowing that KASLR
has randomized the kernel __START_KERNEL_map region, because there is no
virtual address information -- e.g., like "SYMBOL(_stext)" in the kdump
vmcoreinfo data to compare against the vmlinux file symbol value.
Unless virsh dump can export some basic virtual memory data, which
they say it can't, I don't see how KASLR can ever be supported.
"""

I assume virsh dump is using qemu guest memory dump facility so it
should be first addressed in qemu. Thus post this query to qemu devel
list. If this is not correct please let me know.

Could you qemu dump people make it work? Or we can not support virt dump
as long as KASLR being enabled. Latest Fedora kernel has enabled it in x86_64.

Thanks
Dave

Re: [Qemu-devel] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer

2016-11-08 Thread Fam Zheng

On Tue, 11/08 16:52, Eric Blake wrote:
> Commit 443668ca rewrote the write_zeroes logic to guarantee that
> an unaligned request never crosses a cluster boundary.  But
> in the rewrite, the new code assumed that at most one iteration
> would be needed to get to an alignment boundary.
> 
> However, it is easy to trigger an assertion failure: the Linux
> kernel limits loopback devices to advertise a max_transfer of
> only 64k.  Any operation that requires falling back to writes
> rather than more efficient zeroing must obey max_transfer during
> that fallback, which means an unaligned head may require multiple
> iterations of the write fallbacks before reaching the aligned
> boundaries, when layering a format with clusters larger than 64k
> atop the protocol of file access to a loopback device.
> 
> Test case:
> 
> $ qemu-img create -f qcow2 -o cluster_size=1M file 10M
> $ losetup /dev/loop2 /path/to/file
> $ qemu-io -f qcow2 /dev/loop2
> qemu-io> w 7m 1k
> qemu-io> w -z 8003584 2093056
> 
> In fairness to Denis (as the original listed author of the culprit
> commit), the faulty logic for at most one iteration is probably all
> my fault in reworking his idea.  But the solution is to restore what
> was in place prior to that commit: when dealing with an unaligned
> head or tail, iterate as many times as necessary while fragmenting
> the operation at max_transfer boundaries.
> 
> CC: qemu-sta...@nongnu.org
> CC: Ed Swierk 
> CC: Denis V. Lunev 
> Signed-off-by: Eric Blake 
> ---
>  block/io.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index aa532a5..085ac34 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1214,6 +1214,8 @@ static int coroutine_fn 
> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>  int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
>  int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
>  bs->bl.request_alignment);
> +int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> +MAX_WRITE_ZEROES_BOUNCE_BUFFER);
> 
>  assert(alignment % bs->bl.request_alignment == 0);
>  head = offset % alignment;
> @@ -1229,9 +1231,12 @@ static int coroutine_fn 
> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>   * boundaries.
>   */
>  if (head) {
> -/* Make a small request up to the first aligned sector.  */
> -num = MIN(count, alignment - head);
> -head = 0;
> +/* Make a small request up to the first aligned sector. For
> + * convenience, limit this request to max_transfer even if
> + * we don't need to fall back to writes.  */
> +num = MIN(MIN(count, max_transfer), alignment - head);
> +head = (head + num) % alignment;
> +assert(num < max_write_zeroes);
>  } else if (tail && num > alignment) {
>  /* Shorten the request to the last aligned sector.  */
>  num -= tail;
> @@ -1257,8 +1262,6 @@ static int coroutine_fn 
> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
> 
>  if (ret == -ENOTSUP) {
>  /* Fall back to bounce buffer if write zeroes is unsupported */
> -int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> -MAX_WRITE_ZEROES_BOUNCE_BUFFER);
>  BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
> 
>  if ((flags & BDRV_REQ_FUA) &&
> -- 
> 2.7.4
> 

Reviewed-by: Fam Zheng

Re: [Qemu-devel] [PATCH v11 13/22] vfio: Introduce common function to add capabilities

2016-11-08 Thread Alexey Kardashevskiy

On 09/11/16 08:42, Alex Williamson wrote:
> On Wed, 9 Nov 2016 02:16:17 +0530
> Kirti Wankhede  wrote:
> 
>> On 11/8/2016 12:59 PM, Alexey Kardashevskiy wrote:
>>> On 05/11/16 08:10, Kirti Wankhede wrote:  
 Vendor driver using mediated device framework should use
 vfio_info_add_capability() to add capabilities.
 Introduced this function to reduce code duplication in vendor drivers.

 Signed-off-by: Kirti Wankhede 
 Signed-off-by: Neo Jia 
 Change-Id: I6fca329fa2291f37a2c859d0bc97574d9e2ce1a6
 ---
  drivers/vfio/vfio.c  | 60 
 +++-
  include/linux/vfio.h |  3 +++
  2 files changed, 62 insertions(+), 1 deletion(-)

 diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
 index 4ed1a6a247c6..9a03be0942a1 100644
 --- a/drivers/vfio/vfio.c
 +++ b/drivers/vfio/vfio.c
 @@ -1797,8 +1797,66 @@ void vfio_info_cap_shift(struct vfio_info_cap 
 *caps, size_t offset)
for (tmp = caps->buf; tmp->next; tmp = (void *)tmp + tmp->next - offset)
tmp->next += offset;
  }
 -EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 +EXPORT_SYMBOL(vfio_info_cap_shift);  
>>>
>>>
>>> Why this change?
>>>
>>>   
>>
>> We want this symbol to be available to all drivers.
> 
> IOW, from proprietary drivers.  It makes me uncomfortable how many
> non-GPL symbols we're adding (or converting) in this effort, but I'm
> trying to look objectively at every export as to whether a non-GPL
> caller of the function is legitimately separate from in-kernel code.
> For instance are they making use of data structures intrinsic to GPL'd
> code.  In this case we're converting a symbol that's just manipulating
> a data buffer to add an offset to each element in a chain.  The entries
> are documented in a uapi header.  Kirti asked me about this one, and I
> couldn't find any basis to raise an objection.  If you spot any reason
> that any of the export symbols in these series really should be GPL,
> please raise the issue.
> 
  
 +static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
 +{
 +  struct vfio_info_cap_header *header;
 +  struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
 +  size_t size;
 +
 +  size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
 +  header = vfio_info_cap_add(caps, size,
 + VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
 +  if (IS_ERR(header))
 +  return PTR_ERR(header);
 +
 +  sparse_cap = container_of(header,
 +  struct vfio_region_info_cap_sparse_mmap, header);
 +  sparse_cap->nr_areas = sparse->nr_areas;
 +  memcpy(sparse_cap->areas, sparse->areas,
 + sparse->nr_areas * sizeof(*sparse->areas));
 +  return 0;
 +}
 +
 +static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
 +{
 +  struct vfio_info_cap_header *header;
 +  struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
 +
 +  header = vfio_info_cap_add(caps, sizeof(*cap),
 + VFIO_REGION_INFO_CAP_TYPE, 1);
 +  if (IS_ERR(header))
 +  return PTR_ERR(header);
 +
 +  type_cap = container_of(header, struct vfio_region_info_cap_type,
 +  header);
 +  type_cap->type = cap->type;
 +  type_cap->subtype = cap->subtype;
 +  return 0;
 +}
 +
 +int vfio_info_add_capability(struct vfio_info_cap *caps, int cap_type_id,
 +   void *cap_type)
 +{
 +  int ret = -EINVAL;
 +
 +  if (!cap_type)
 +  return 0;
 +
 +  switch (cap_type_id) {
 +  case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
 +  ret = sparse_mmap_cap(caps, cap_type);
 +  break;
 +
 +  case VFIO_REGION_INFO_CAP_TYPE:
 +  ret = region_type_cap(caps, cap_type);
 +  break;
 +  }
 +
 +  return ret;
 +}
 +EXPORT_SYMBOL(vfio_info_add_capability);
  
  /*
   * Pin a set of guest PFNs and return their associated host PFNs for local
 diff --git a/include/linux/vfio.h b/include/linux/vfio.h
 index dcda8fccefab..cf90393a11e2 100644
 --- a/include/linux/vfio.h
 +++ b/include/linux/vfio.h
 @@ -113,6 +113,9 @@ extern struct vfio_info_cap_header *vfio_info_cap_add(
struct vfio_info_cap *caps, size_t size, u16 id, u16 version);
  extern void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t 
 offset);
  
 +extern int vfio_info_add_capability(struct vfio_info_cap *caps,
 +  int cap_type_id, void *cap_type);
 +  
>>>
>>>
>>> It would make it easier to review and bisect if 14/22 was squashed into
>>> this one.   
>>
>> This was split based on Alex's suggestion on earlier

Re: [Qemu-devel] [PATCH] boot-serial-test: Add a test for the powernv machine

2016-11-08 Thread David Gibson

On Tue, Nov 08, 2016 at 02:05:35PM +0100, Cédric Le Goater wrote:
> On 11/08/2016 01:36 PM, Thomas Huth wrote:
> > The new powernv machine ships with a firmware that outputs
> > some text to the serial console, so we can automatically
> > test this machine type in the boot-serial tester, too.
> > And to get some (very limited) test coverage for the new
> > POWER9 CPU emulation, too, this test is also started with
> > "-cpu POWER9".
> 
> and we see the minimum :
> 
>   [8450016,6] CPU: P9 generation processor(max 4 threads/core)
> 
> Reviewed-by: Cédric Le Goater 
> 
> 
> 
> With very minimal changes (definition of some SPRs and the use 
> of the SHV mode), the guest would load the kernel.

Applied to ppc-for-2.8.  Good to have this basic smoke test for
powernv.


> 
> Thanks,
> 
> C.
> 
> 
> > Signed-off-by: Thomas Huth 
> > ---
> >  tests/boot-serial-test.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/tests/boot-serial-test.c b/tests/boot-serial-test.c
> > index d98c564..44c82e5 100644
> > --- a/tests/boot-serial-test.c
> > +++ b/tests/boot-serial-test.c
> > @@ -29,6 +29,7 @@ static testdef_t tests[] = {
> >  { "ppc64", "ppce500", "", "U-Boot" },
> >  { "ppc64", "prep", "", "Open Hack'Ware BIOS" },
> >  { "ppc64", "pseries", "", "Open Firmware" },
> > +{ "ppc64", "powernv", "-cpu POWER9", "SkiBoot" },
> >  { "i386", "isapc", "-cpu qemu32 -device sga", "SGABIOS" },
> >  { "i386", "pc", "-device sga", "SGABIOS" },
> >  { "i386", "q35", "-device sga", "SGABIOS" },
> > 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH v5 0/4] POWER9 TCG enablements - BCD functions part I

2016-11-08 Thread David Gibson

On Tue, Nov 08, 2016 at 02:50:21PM -0200, Jose Ricardo Ziviani wrote:
> v5:
>  - reuse bcd_cmp_zero function
>  - improve zoned loop by using one index
> 
> v4:
>  - throws invalid for any instruction not implemented by default
>  - creates a function to compare bcd value to zero
> 
> v3:
>  - generates invalid instruction excpetion when opc4 is not handled/invalid
>  - changes get_national ret. type to uint16_t to handle invalid encoding
>  - small improvements
> 
> v2:
>  - implements all fixes and suggestions
> 
> This serie contains 4 new instructions for POWER9 ISA3.0

Applied to ppc-for-2.8.  Note that there was a conflict in
vmx-ops.inc.c.  It was pretty simple to resolve, but in future please
try to make sure your series are rebased on top of the latest
ppc-for-2.8.

> 
>  bcdcfn.: Decimal Convert From National
>  bcdctn.: Decimal Convert To National
>  bcdcfz.: Decimal Convert From Zoned
>  bcdctz.: Decimal Convert to Zoned
> 
> 
> Jose Ricardo Ziviani (4):
>   target-ppc: Implement bcdcfn. instruction
>   target-ppc: Implement bcdctn. instruction
>   target-ppc: Implement bcdcfz. instruction
>   target-ppc: Implement bcdctz. instruction
> 
>  target-ppc/helper.h |   4 +
>  target-ppc/int_helper.c | 186 
> 
>  target-ppc/translate/vmx-impl.inc.c |  73 ++
>  target-ppc/translate/vmx-ops.inc.c  |   4 +-
>  4 files changed, 265 insertions(+), 2 deletions(-)
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [RFC 11/17] ppc: Add ppc_set_compat_all()

2016-11-08 Thread Alexey Kardashevskiy

On 08/11/16 16:18, David Gibson wrote:
> On Fri, Nov 04, 2016 at 03:01:40PM +1100, Alexey Kardashevskiy wrote:
>> On 30/10/16 22:12, David Gibson wrote:
>>> Once a compatiblity mode is negotiated with the guest,
>>> h_client_architecture_support() uses run_on_cpu() to update each CPU to
>>> the new mode.  We're going to want this logic somewhere else shortly,
>>> so make a helper function to do this global update.
>>>
>>> We put it in target-ppc/compat.c - it makes as much sense at the CPU level
>>> as it does at the machine level.  We also move the cpu_synchronize_state()
>>> into ppc_set_compat(), since it doesn't really make any sense to call that
>>> without synchronizing state.
>>>
>>> Signed-off-by: David Gibson 
>>> ---
>>>  hw/ppc/spapr_hcall.c | 31 +--
>>>  target-ppc/compat.c  | 36 
>>>  target-ppc/cpu.h |  3 +++
>>>  3 files changed, 44 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>>> index 3bd6d06..4eaf9a6 100644
>>> --- a/hw/ppc/spapr_hcall.c
>>> +++ b/hw/ppc/spapr_hcall.c
>>> @@ -881,20 +881,6 @@ static target_ulong h_set_mode(PowerPCCPU *cpu, 
>>> sPAPRMachineState *spapr,
>>>  return ret;
>>>  }
>>>  
>>> -typedef struct {
>>> -uint32_t compat_pvr;
>>> -Error *err;
>>> -} SetCompatState;
>>> -
>>> -static void do_set_compat(CPUState *cs, void *arg)
>>> -{
>>> -PowerPCCPU *cpu = POWERPC_CPU(cs);
>>> -SetCompatState *s = arg;
>>> -
>>> -cpu_synchronize_state(cs);
>>> -ppc_set_compat(cpu, s->compat_pvr, >err);
>>> -}
>>> -
>>>  static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>>>sPAPRMachineState *spapr,
>>>target_ulong opcode,
>>> @@ -902,7 +888,6 @@ static target_ulong 
>>> h_client_architecture_support(PowerPCCPU *cpu,
>>>  {
>>>  target_ulong list = ppc64_phys_to_real(args[0]);
>>>  target_ulong ov_table;
>>> -CPUState *cs;
>>>  bool explicit_match = false; /* Matched the CPU's real PVR */
>>>  uint32_t max_compat = cpu->max_compat;
>>>  uint32_t best_compat = 0;
>>> @@ -949,18 +934,12 @@ static target_ulong 
>>> h_client_architecture_support(PowerPCCPU *cpu,
>>>  
>>>  /* Update CPUs */
>>>  if (cpu->compat_pvr != best_compat) {
>>> -CPU_FOREACH(cs) {
>>> -SetCompatState s = {
>>> -.compat_pvr = best_compat,
>>> -.err = NULL,
>>> -};
>>> +Error *local_err = NULL;
>>>  
>>> -run_on_cpu(cs, do_set_compat, );
>>> -
>>> -if (s.err) {
>>> -error_report_err(s.err);
>>> -return H_HARDWARE;
>>> -}
>>> +ppc_set_compat_all(best_compat, _err);
>>> +if (local_err) {
>>> +error_report_err(local_err);
>>> +return H_HARDWARE;
>>>  }
>>>  }
>>>  
>>> diff --git a/target-ppc/compat.c b/target-ppc/compat.c
>>> index 1059555..0b12b58 100644
>>> --- a/target-ppc/compat.c
>>> +++ b/target-ppc/compat.c
>>> @@ -124,6 +124,8 @@ void ppc_set_compat(PowerPCCPU *cpu, uint32_t 
>>> compat_pvr, Error **errp)
>>>  pcr = compat->pcr;
>>>  }
>>>  
>>> +cpu_synchronize_state(CPU(cpu));
>>> +
>>>  cpu->compat_pvr = compat_pvr;
>>>  env->spr[SPR_PCR] = pcr & pcc->pcr_mask;
>>>  
>>> @@ -136,6 +138,40 @@ void ppc_set_compat(PowerPCCPU *cpu, uint32_t 
>>> compat_pvr, Error **errp)
>>>  }
>>>  }
>>>  
>>> +#if !defined(CONFIG_USER_ONLY)
>>> +typedef struct {
>>> +uint32_t compat_pvr;
>>> +Error *err;
>>> +} SetCompatState;
>>> +
>>> +static void do_set_compat(CPUState *cs, void *arg)
>>> +{
>>> +PowerPCCPU *cpu = POWERPC_CPU(cs);
>>> +SetCompatState *s = arg;
>>> +
>>> +ppc_set_compat(cpu, s->compat_pvr, >err);
>>> +}
>>> +
>>> +void ppc_set_compat_all(uint32_t compat_pvr, Error **errp)
>>> +{
>>> +CPUState *cs;
>>> +
>>> +CPU_FOREACH(cs) {
>>> +SetCompatState s = {
>>> +.compat_pvr = compat_pvr,
>>> +.err = NULL,
>>> +};
>>> +
>>> +run_on_cpu(cs, do_set_compat, );
>>> +
>>> +if (s.err) {
>>> +error_propagate(errp, s.err);
>>> +return;
>>> +}
>>> +}
>>> +}
>>> +#endif
>>> +
>>>  int ppc_compat_max_threads(PowerPCCPU *cpu)
>>>  {
>>>  const CompatInfo *compat = compat_by_pvr(cpu->compat_pvr);
>>> diff --git a/target-ppc/cpu.h b/target-ppc/cpu.h
>>> index 91e8be8..201a655 100644
>>> --- a/target-ppc/cpu.h
>>> +++ b/target-ppc/cpu.h
>>> @@ -1317,6 +1317,9 @@ static inline int cpu_mmu_index (CPUPPCState *env, 
>>> bool ifetch)
>>>  bool ppc_check_compat(PowerPCCPU *cpu, uint32_t compat_pvr,
>>>uint32_t min_compat_pvr, uint32_t max_compat_pvr);
>>>  void ppc_set_compat(PowerPCCPU *cpu, uint32_t compat_pvr, Error **errp);
>>> +#if

Re: [Qemu-devel] [PATCH v13 1/2] virtio-crypto: Add virtio crypto device specification

2016-11-08 Thread Gonglei (Arei)

Hi,

> 
> [..]
> > +The header is the general header and the union is of the algorithm-specific
> type,
> > +which is set by the driver. All properties in the union are shown as 
> > follows.
> > +
> > +There is a unified idata structure for all symmetric algorithms, including
> CIPHER, HASH, MAC, and AEAD.
> > +
> > +The structure is defined as follows:
> > +
> > +\begin{lstlisting}
> > +struct virtio_crypto_sym_input {
> > +/* Destination data guest address, it's useless for plain HASH and MAC
> */
> > +le64 dst_data_addr;
> > +/* Digest result guest address, it's useless for plain cipher algos */
> > +le64 digest_result_addr;
> > +
> > +le32 status;
> > +le32 padding;
> > +};
> > +
> 
> This seems to be out of sync regarding the code (e.g. can't find it in
> virtio-crypto.h of the
> series linked in the cover-letter). It seems to me this reflects v4 since the 
> stuff
> is gone
> since v5 (qemu code).
> 
> > +\end{lstlisting}
> > +
> > +\subsubsection{HASH Service Operation}\label{sec:Device Types / Crypto
> Device / Device Operation / HASH Service Operation}
> > +
> > +\begin{lstlisting}
> > +struct virtio_crypto_hash_para {
> > +/* length of source data */
> > +le32 src_data_len;
> > +/* hash result length */
> > +le32 hash_result_len;
> > +};
> > +
> > +struct virtio_crypto_hash_input {
> > +struct virtio_crypto_sym_input input;
> > +};
> > +
> > +struct virtio_crypto_hash_output {
> > +/* source data guest address */
> > +le64 src_data_addr;
> > +};
> > +
> > +struct virtio_crypto_hash_data_req {
> > +/* Device-readable part */
> > +struct virtio_crypto_hash_para para;
> > +struct virtio_crypto_hash_output odata;
> > +/* Device-writable part */
> > +struct virtio_crypto_hash_input idata;
> > +};
> > +\end{lstlisting}
> > +
> > +Each data request uses virtio_crypto_hash_data_req structure to store
> information
> > +used to run the HASH operations. The request only occupies one entry
> > +in the Vring Descriptor Table in the virtio crypto device's dataq, which
> improves
> > +the throughput of data transmitted for the HASH service, so that the virtio
> crypto
> > +device can be better accelerated.
> > +
> > +The information includes the source data guest physical address stored by
> \field{odata}.\field{src_data_addr},
> > +length of source data stored by \field{para}.\field{src_data_len}, and the
> digest result guest physical address
> > +stored by \field{digest_result_addr} used to save the results of the HASH
> operations.
> > +The address and length can determine exclusive content in the guest
> memory.
> > +
> 
> Thus this does not make any sense to me. Furthermore the problem seems to
> persist
> across the specification. Thus in my opinion there is no point in reviewing
> this version. Or am I missing something here? In case I'm not missing anything
> and the spec describes something quite outdated when should we expect a
> new
> version of the spec?
> 
Nope, Actually I kept those description here is because I wanted to represent 
each packet
Intuitionally, otherwise I don't know how to explain them only occupy one entry 
in desc table
by indirect table method. So I changed the code completely as Stefan's 
suggestion and
revised the spec a little.

This just is a representative of the realization so that the people can easily 
understand what
the virtio crypto request's components. It isn't completely same with the code.
For virtio-scsi device, the struct virtio_scsi_req_cmd also used this way IIUR.

Thanks,
-Gonglei

Re: [Qemu-devel] [PATCH v11 01/22] vfio: Mediated device Core driver

2016-11-08 Thread Dong Jia Shi

* Kirti Wankhede  [2016-11-09 02:36:12 +0530]:

[...]
> >> +/*
> >> + * mdev_register_device : Register a device
> >> + * @dev: device structure representing parent device.
> >> + * @ops: Parent device operation structure to be registered.
> >> + *
> >> + * Add device to list of registered parent devices.
> >> + * Returns a negative value on error, otherwise 0.
> >> + */
> >> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> >> +{
> >> +  int ret;
> >> +  struct parent_device *parent;
> >> +
> >> +  /* check for mandatory ops */
> >> +  if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
> >> +  return -EINVAL;
> >> +
> >> +  dev = get_device(dev);
> >> +  if (!dev)
> >> +  return -EINVAL;
> >> +
> >> +  mutex_lock(_list_lock);
> >> +
> >> +  /* Check for duplicate */
> >> +  parent = __find_parent_device(dev);
> >> +  if (parent) {
> >> +  ret = -EEXIST;
> >> +  goto add_dev_err;
> >> +  }
> >> +
> >> +  parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> >> +  if (!parent) {
> >> +  ret = -ENOMEM;
> >> +  goto add_dev_err;
> >> +  }
> >> +
> >> +  kref_init(>ref);
> >> +  mutex_init(>lock);
> >> +
> >> +  parent->dev = dev;
> >> +  parent->ops = ops;
> >> +
> >> +  if (!mdev_bus_compat_class) {
> >> +  mdev_bus_compat_class = class_compat_register("mdev_bus");
> >> +  if (!mdev_bus_compat_class) {
> >> +  ret = -ENOMEM;
> >> +  goto add_dev_err;
> >> +  }
> >> +  }
> >> +
> >> +  ret = parent_create_sysfs_files(parent);
> >> +  if (ret)
> >> +  goto add_dev_err;
> >> +
> >> +  ret = class_compat_create_link(mdev_bus_compat_class, dev, NULL);
> >> +  if (ret)
> >> +  dev_warn(dev, "Failed to create compatibility class link\n");
> >> +
> >> +  list_add(>next, _list);
> >> +  mutex_unlock(_list_lock);
> >> +
> >> +  dev_info(dev, "MDEV: Registered\n");
> >> +  return 0;
> >> +
> >> +add_dev_err:
> >> +  mutex_unlock(_list_lock);
> >> +  if (parent)
> >> +  mdev_put_parent(parent);
> > Why do this? I don't find the place that you call mdev_get_parent above.
> > 
> 
>   kref_init(>ref);
> Above increments the ref_count, so mdev_put_parent() should be called if
> anything fails.
> 
> >> +  else
> >> +  put_device(dev);
> > Shouldn't we always do this?
> > 
> 
> When mdev_put_parent() is called, its release function do this. So if
> mdev_put_parent() is called, we don't need this.
Sorry for missing that. Thanks for the explanation!

> 
> >> +  return ret;
> >> +}
> >> +EXPORT_SYMBOL(mdev_register_device);
> >> +

[...]

-- 
Dong Jia

[Qemu-devel] [PATCH v2] docs: add document to explain the usage of vNVDIMM

2016-11-08 Thread Haozhong Zhang

Signed-off-by: Haozhong Zhang 
Reviewed-by: Xiao Guangrong 
Reviewed-by: Stefan Hajnoczi 
---
Changes since v1:
* explicitly state the block window mode is not supported (Stefan Hajnoczi)
* typo fix: label_size ==> label-size (David Alan Gilbert)
---
 docs/nvdimm.txt | 124 
 1 file changed, 124 insertions(+)
 create mode 100644 docs/nvdimm.txt

diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
new file mode 100644
index 000..2d9f8c0
--- /dev/null
+++ b/docs/nvdimm.txt
@@ -0,0 +1,124 @@
+QEMU Virtual NVDIMM
+===
+
+This document explains the usage of virtual NVDIMM (vNVDIMM) feature
+which is available since QEMU v2.6.0.
+
+The current QEMU only implements the persistent memory mode of vNVDIMM
+device and not the block window mode.
+
+Basic Usage
+---
+
+The storage of a vNVDIMM device in QEMU is provided by the memory
+backend (i.e. memory-backend-file and memory-backend-ram). A simple
+way to create a vNVDIMM device at startup time is done via the
+following command line options:
+
+ -machine pc,nvdimm
+ -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE
+ -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE
+ -device nvdimm,id=nvdimm1,memdev=mem1
+
+Where,
+
+ - the "nvdimm" machine option enables vNVDIMM feature.
+
+ - "slots=$N" should be equal to or larger than the total amount of
+   normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here.
+
+ - "maxmem=$MAX_SIZE" should be equal to or larger than the total size
+   of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be
+   >= $RAM_SIZE + $NVDIMM_SIZE here.
+
+ - "object 
memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE"
+   creates a backend storage of size $NVDIMM_SIZE on a file $PATH. All
+   accesses to the virtual NVDIMM device go to the file $PATH.
+
+   "share=on/off" controls the visibility of guest writes. If
+   "share=on", then guest writes will be applied to the backend
+   file. If another guest uses the same backend file with option
+   "share=on", then above writes will be visible to it as well. If
+   "share=off", then guest writes won't be applied to the backend
+   file and thus will be invisible to other guests.
+
+ - "device nvdimm,id=nvdimm1,memdev=mem1" creates a virtual NVDIMM
+   device whose storage is provided by above memory backend device.
+
+Multiple vNVDIMM devices can be created if multiple pairs of "-object"
+and "-device" are provided.
+
+For above command line options, if the guest OS has the proper NVDIMM
+driver, it should be able to detect a NVDIMM device which is in the
+persistent memory mode and whose size is $NVDIMM_SIZE.
+
+Note:
+
+1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual
+   backend file size is not equal to the size given by "size" option,
+   QEMU will truncate the backend file by ftruncate(2), which will
+   corrupt the existing data in the backend file, especially for the
+   shrink case.
+
+   QEMU v2.8.0 and later check the backend file size and the "size"
+   option. If they do not match, QEMU will report errors and abort in
+   order to avoid the data corruption.
+
+2. QEMU v2.6.0 only puts a basic alignment requirement on the "size"
+   option of memory-backend-file, e.g. 4KB alignment on x86.  However,
+   QEMU v.2.7.0 puts an additional alignment requirement, which may
+   require a larger value than the basic one, e.g. 2MB on x86. This
+   change breaks the usage of memory-backend-file that only satisfies
+   the basic alignment.
+
+   QEMU v2.8.0 and later remove the additional alignment on non-s390x
+   architectures, so the broken memory-backend-file can work again.
+
+Label
+-
+
+QEMU v2.7.0 and later implement the label support for vNVDIMM devices.
+To enable label on vNVDIMM devices, users can simply add
+"label-size=$SZ" option to "-device nvdimm", e.g.
+
+ -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K
+
+Note:
+
+1. The minimal label size is 128KB.
+
+2. QEMU v2.7.0 and later store labels at the end of backend storage.
+   If a memory backend file, which was previously used as the backend
+   of a vNVDIMM device without labels, is now used for a vNVDIMM
+   device with label, the data in the label area at the end of file
+   will be inaccessible to the guest. If any useful data (e.g. the
+   meta-data of the file system) was stored there, the latter usage
+   may result guest data corruption (e.g. breakage of guest file
+   system).
+
+Hotplug
+---
+
+QEMU v2.8.0 and later implement the hotplug support for vNVDIMM
+devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is
+accomplished by two monitor commands "object_add" and "device_add".
+
+For example, the following commands add another 4GB vNVDIMM device to
+the guest:
+
+ (qemu) object_add 
memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G
+ (qemu)

Re: [Qemu-devel] [PATCH] docs: add document to explain the usage of vNVDIMM

2016-11-08 Thread Haozhong Zhang


On 11/08/16 17:08 +, Stefan Hajnoczi wrote:

On Tue, Nov 08, 2016 at 08:46:14PM +0800, Haozhong Zhang wrote:

Signed-off-by: Haozhong Zhang 
Reviewed-by: Xiao Guangrong 
---
 docs/nvdimm.txt | 124 
 1 file changed, 124 insertions(+)
 create mode 100644 docs/nvdimm.txt

diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
new file mode 100644
index 000..fafca39
--- /dev/null
+++ b/docs/nvdimm.txt
@@ -0,0 +1,124 @@
+QEMU Virtual NVDIMM
+===
+
+This document explains the usage of virtual NVDIMM (vNVDIMM) feature
+which is available since QEMU v2.6.0.
+
+The current QEMU only implements the persistent memory mode of vNVDIMM
+device.


"and not the block window mode."

Explicitly naming block window mode would be useful for anyone looking
through the docs to find out whether this mode is supported or not.



will add in the next version.

Thanks,
Haozhong

[..]

Re: [Qemu-devel] [PATCH] docs: add document to explain the usage of vNVDIMM

2016-11-08 Thread Haozhong Zhang


On 11/08/16 16:50 +, Dr. David Alan Gilbert wrote:

* Haozhong Zhang (haozhong.zh...@intel.com) wrote:

[..]

+Label
+-
+
+QEMU v2.7.0 and later implement the label support for vNVDIMM devices.
+To enable label on vNVDIMM devices, users can simply add
+"label-size=$SZ" option to "-device nvdimm", e.g.
+
+ -device nvdimm,id=nvdimm1,memdev=mem1,label_size=128K


Is that label-size  rather than label_size ?



label-size. I'll update in the next version.

Thanks,
Haozhong

[..]

Re: [Qemu-devel] [PATCH 4/5] MAINTAINERS: Add an entry for the CHRP NVRAM files

2016-11-08 Thread David Gibson

On Tue, Nov 08, 2016 at 01:17:52PM +0100, Thomas Huth wrote:
> I recently added new files to the source tree that are not
> covered by any maintainer yet -- and since every new source
> file should have a maintainer nowadays, I volunteer to look
> after these files now, too.
> 
> Signed-off-by: Thomas Huth 

Reviewed-by: David Gibson 

> ---
>  MAINTAINERS | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 5d8b584..05b1c97 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1093,6 +1093,13 @@ S: Maintained
>  F: hw/core/generic-loader.c
>  F: include/hw/core/generic-loader.h
>  
> +CHRP NVRAM
> +M: Thomas Huth 
> +S: Maintained
> +F: hw/nvram/chrp_nvram.c
> +F: include/hw/nvram/chrp_nvram.h
> +F: tests/prom-env-test.c
> +
>  Subsystems
>  --
>  Audio

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

[Qemu-devel] [Bug 1639322] Re: pasting into ppc64 serial console kills qemu

2016-11-08 Thread Michal Suchanek

This is gtk interface.

However, the function on line 40 os spapr_vty.c looks really insane.

It asserts that it is not given more data to input in a ring buffer than
is size of the buffer and then stuffs all the data in regardless of the
amount of data already present.

It should probably loop or one of its callers but I did not find a
decent comparable piece of code to cut and paste whatever callbacks are
needed for the other side to consume the bytes.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1639322

Title:
  pasting into ppc64 serial console kills qemu

Status in QEMU:
  New

Bug description:
  - run qemu-system-ppc64
  - when X window appears press Ctrl+Alt+3
  - paste any text longer than 16 characters

  
  qemu-system-ppc64: 
/home/abuild/rpmbuild/BUILD/qemu-2.6.1/hw/char/spapr_vty.c:40: vty_receive: 
Assertion `(dev->in - dev->out) < 16' failed.
  Aborted (core dumped)

  Broken in SUSE Leap 42.2 and git
  4eb28abd52d48657cff6ff45e8dbbbefe4dbb414

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1639322/+subscriptions

Re: [Qemu-devel] Sphinx for QEMU docs? (and a doc-comment format question)

2016-11-08 Thread Paolo Bonzini

On 07/11/2016 14:30, Stefan Hajnoczi wrote:
> On Sat, Nov 05, 2016 at 06:42:23PM +, Peter Maydell wrote:
>> In particular I think we could:
>>  * set up a framework for our in-tree docs/ which gives us a
>>place to put new docs (both for-users and for-developers) --
>>I think having someplace to put things will reduce the barrier
>>to people writing useful new docs
>>  * gradually convert the existing docs to rst
>>  * use the sphinx extension features to pull in the doc-comments
>>we have been fairly consistently writing over the last few years
>>(for instance a converted version of docs/memory.txt could pull
>>in doc comments from memory.h; or we can just write simple
>>wrapper files like a "Bitmap operations" document that
>>displays the doc comments from bitops.h)
> 
> You are suggesting Sphinx for two different purposes:
> 
> 1. Formatting docs/ in HTML, PDF, etc.
> 
> 2. API documentation from doc comments.
> 
> It's a good idea for #1 since we can then publish automated builds of
> the docs.  They will be easy to view and link to in a web browser.
> 
> I'm not a fan of #2.  QEMU is not a C library that people develop
> against and our APIs are not stable.  There is no incentive for pretty
> doc comments.  It might be cool to set it up once but things will
> deterioate again quickly because we don't actually need external API
> docs.

I don't think pretty doc comments matter, but accurate doc comments do.
If we cannot have accurate doc comments, we might not have them at all,
but this is actually not the case.  There are some areas where we
actually go to great(er) lengths to have up-to-date documentation and
up-to-date doc comments, and it's a pity to only provide half of the
information in an easily consumable format.

It doesn't really have to be perfect, but it's a nice thing to have.
I'm not entirely sure that it's interesting to format bitops.h's doc
comments, but for memory.h or aio.h I'm pretty sure it's worth it.

Paolo

[Qemu-devel] [Bug 1639394] Re: Unable to boot Solaris 8/9 x86 under Fedora 24

2016-11-08 Thread John Snow

So, if I'm reading you right, Solaris10/11 work just fine, but 8/9 don't
-- and have not since qemu version 0.6.0!? From 2004?

I don't have a copy of Solaris9 to test with, so I doubt I can work on
trying to reproduce this. Is there any possibility to reproduce a
problem on an older, freely available BSD?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1639394

Title:
  Unable to boot Solaris 8/9 x86 under Fedora 24

Status in QEMU:
  New

Bug description:
  qemu-system-x86_64 -version
  QEMU emulator version 2.6.2 (qemu-2.6.2-4.fc24), Copyright (c) 2003-2008 
Fabrice Bellard

  Try several ways without success, I think it was a regression because problem 
seems to be related with ide fixed on 0.6.0:
  - int13 CDROM BIOS fix (aka Solaris x86 install CD fix)
  - int15, ah=86 BIOS fix (aka Solaris x86 hardware probe hang up fix)

  Solaris 10/11 works without a problem, also booting with "scsi" will
  circumvent initial problem, but later found problems related with
  "scsi" cdrom boot and also will not found the "ide" disk device.

  
  qemu-system-i386 -m 712 -drive file=/dev/Virtual_hdd/beryllium0,format=raw 
-cdrom /repo/Isos/sol-9_905_x86.iso

  SunOS Secondary Boot version 3.00

  prom_panic: Could not mount filesystem.
  Entering boot debugger:
  [136419]

  
  Regards,
  \\CA,

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1639394/+subscriptions

[Qemu-devel] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer

2016-11-08 Thread Eric Blake

Commit 443668ca rewrote the write_zeroes logic to guarantee that
an unaligned request never crosses a cluster boundary.  But
in the rewrite, the new code assumed that at most one iteration
would be needed to get to an alignment boundary.

However, it is easy to trigger an assertion failure: the Linux
kernel limits loopback devices to advertise a max_transfer of
only 64k.  Any operation that requires falling back to writes
rather than more efficient zeroing must obey max_transfer during
that fallback, which means an unaligned head may require multiple
iterations of the write fallbacks before reaching the aligned
boundaries, when layering a format with clusters larger than 64k
atop the protocol of file access to a loopback device.

Test case:

$ qemu-img create -f qcow2 -o cluster_size=1M file 10M
$ losetup /dev/loop2 /path/to/file
$ qemu-io -f qcow2 /dev/loop2
qemu-io> w 7m 1k
qemu-io> w -z 8003584 2093056

In fairness to Denis (as the original listed author of the culprit
commit), the faulty logic for at most one iteration is probably all
my fault in reworking his idea.  But the solution is to restore what
was in place prior to that commit: when dealing with an unaligned
head or tail, iterate as many times as necessary while fragmenting
the operation at max_transfer boundaries.

CC: qemu-sta...@nongnu.org
CC: Ed Swierk 
CC: Denis V. Lunev 
Signed-off-by: Eric Blake 
---
 block/io.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index aa532a5..085ac34 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1214,6 +1214,8 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
 int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
 bs->bl.request_alignment);
+int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
+MAX_WRITE_ZEROES_BOUNCE_BUFFER);

 assert(alignment % bs->bl.request_alignment == 0);
 head = offset % alignment;
@@ -1229,9 +1231,12 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
  * boundaries.
  */
 if (head) {
-/* Make a small request up to the first aligned sector.  */
-num = MIN(count, alignment - head);
-head = 0;
+/* Make a small request up to the first aligned sector. For
+ * convenience, limit this request to max_transfer even if
+ * we don't need to fall back to writes.  */
+num = MIN(MIN(count, max_transfer), alignment - head);
+head = (head + num) % alignment;
+assert(num < max_write_zeroes);
 } else if (tail && num > alignment) {
 /* Shorten the request to the last aligned sector.  */
 num -= tail;
@@ -1257,8 +1262,6 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,

 if (ret == -ENOTSUP) {
 /* Fall back to bounce buffer if write zeroes is unsupported */
-int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
-MAX_WRITE_ZEROES_BOUNCE_BUFFER);
 BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;

 if ((flags & BDRV_REQ_FUA) &&
-- 
2.7.4

Re: [Qemu-devel] [PATCH v11 13/22] vfio: Introduce common function to add capabilities

2016-11-08 Thread Alex Williamson

On Wed, 9 Nov 2016 02:16:17 +0530
Kirti Wankhede  wrote:

> On 11/8/2016 12:59 PM, Alexey Kardashevskiy wrote:
> > On 05/11/16 08:10, Kirti Wankhede wrote:  
> >> Vendor driver using mediated device framework should use
> >> vfio_info_add_capability() to add capabilities.
> >> Introduced this function to reduce code duplication in vendor drivers.
> >>
> >> Signed-off-by: Kirti Wankhede 
> >> Signed-off-by: Neo Jia 
> >> Change-Id: I6fca329fa2291f37a2c859d0bc97574d9e2ce1a6
> >> ---
> >>  drivers/vfio/vfio.c  | 60 
> >> +++-
> >>  include/linux/vfio.h |  3 +++
> >>  2 files changed, 62 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >> index 4ed1a6a247c6..9a03be0942a1 100644
> >> --- a/drivers/vfio/vfio.c
> >> +++ b/drivers/vfio/vfio.c
> >> @@ -1797,8 +1797,66 @@ void vfio_info_cap_shift(struct vfio_info_cap 
> >> *caps, size_t offset)
> >>for (tmp = caps->buf; tmp->next; tmp = (void *)tmp + tmp->next - offset)
> >>tmp->next += offset;
> >>  }
> >> -EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
> >> +EXPORT_SYMBOL(vfio_info_cap_shift);  
> > 
> > 
> > Why this change?
> > 
> >   
> 
> We want this symbol to be available to all drivers.

IOW, from proprietary drivers.  It makes me uncomfortable how many
non-GPL symbols we're adding (or converting) in this effort, but I'm
trying to look objectively at every export as to whether a non-GPL
caller of the function is legitimately separate from in-kernel code.
For instance are they making use of data structures intrinsic to GPL'd
code.  In this case we're converting a symbol that's just manipulating
a data buffer to add an offset to each element in a chain.  The entries
are documented in a uapi header.  Kirti asked me about this one, and I
couldn't find any basis to raise an objection.  If you spot any reason
that any of the export symbols in these series really should be GPL,
please raise the issue.

> >>  
> >> +static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
> >> +{
> >> +  struct vfio_info_cap_header *header;
> >> +  struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
> >> +  size_t size;
> >> +
> >> +  size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
> >> +  header = vfio_info_cap_add(caps, size,
> >> + VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
> >> +  if (IS_ERR(header))
> >> +  return PTR_ERR(header);
> >> +
> >> +  sparse_cap = container_of(header,
> >> +  struct vfio_region_info_cap_sparse_mmap, header);
> >> +  sparse_cap->nr_areas = sparse->nr_areas;
> >> +  memcpy(sparse_cap->areas, sparse->areas,
> >> + sparse->nr_areas * sizeof(*sparse->areas));
> >> +  return 0;
> >> +}
> >> +
> >> +static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
> >> +{
> >> +  struct vfio_info_cap_header *header;
> >> +  struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
> >> +
> >> +  header = vfio_info_cap_add(caps, sizeof(*cap),
> >> + VFIO_REGION_INFO_CAP_TYPE, 1);
> >> +  if (IS_ERR(header))
> >> +  return PTR_ERR(header);
> >> +
> >> +  type_cap = container_of(header, struct vfio_region_info_cap_type,
> >> +  header);
> >> +  type_cap->type = cap->type;
> >> +  type_cap->subtype = cap->subtype;
> >> +  return 0;
> >> +}
> >> +
> >> +int vfio_info_add_capability(struct vfio_info_cap *caps, int cap_type_id,
> >> +   void *cap_type)
> >> +{
> >> +  int ret = -EINVAL;
> >> +
> >> +  if (!cap_type)
> >> +  return 0;
> >> +
> >> +  switch (cap_type_id) {
> >> +  case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
> >> +  ret = sparse_mmap_cap(caps, cap_type);
> >> +  break;
> >> +
> >> +  case VFIO_REGION_INFO_CAP_TYPE:
> >> +  ret = region_type_cap(caps, cap_type);
> >> +  break;
> >> +  }
> >> +
> >> +  return ret;
> >> +}
> >> +EXPORT_SYMBOL(vfio_info_add_capability);
> >>  
> >>  /*
> >>   * Pin a set of guest PFNs and return their associated host PFNs for local
> >> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >> index dcda8fccefab..cf90393a11e2 100644
> >> --- a/include/linux/vfio.h
> >> +++ b/include/linux/vfio.h
> >> @@ -113,6 +113,9 @@ extern struct vfio_info_cap_header *vfio_info_cap_add(
> >>struct vfio_info_cap *caps, size_t size, u16 id, u16 version);
> >>  extern void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t 
> >> offset);
> >>  
> >> +extern int vfio_info_add_capability(struct vfio_info_cap *caps,
> >> +  int cap_type_id, void *cap_type);
> >> +  
> > 
> > 
> > It would make it easier to review and bisect if 14/22 was squashed into
> > this one.   
> 
> This was split based on Alex's suggestion on earlier version of this
> patchset.

Yeah, generally squashing patches together is

Re: [Qemu-devel] [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP

2016-11-08 Thread Alex Williamson

On Wed, 9 Nov 2016 01:29:19 +0530
Kirti Wankhede  wrote:

> On 11/8/2016 11:16 PM, Alex Williamson wrote:
> > On Tue, 8 Nov 2016 21:56:29 +0530
> > Kirti Wankhede  wrote:
> >   
> >> On 11/8/2016 5:15 AM, Alex Williamson wrote:  
> >>> On Sat, 5 Nov 2016 02:40:45 +0530
> >>> Kirti Wankhede  wrote:
> >>> 
> >> ...  
>   
>  +int vfio_register_notifier(struct device *dev, struct notifier_block 
>  *nb)
> >>>
> >>> Is the expectation here that this is a generic notifier for all
> >>> vfio->mdev signaling?  That should probably be made clear in the mdev
> >>> API to avoid vendor drivers assuming their notifier callback only
> >>> occurs for unmaps, even if that's currently the case.
> >>> 
> >>
> >> Ok. Adding comment about notifier callback in mdev_device which is part
> >> of next patch.
> >>
> >> ...
> >>  
>   mutex_lock(>lock);
>   
>  -if (!iommu->external_domain) {
>  +/* Fail if notifier list is empty */
>  +if ((!iommu->external_domain) || (!iommu->notifier.head)) {
>   ret = -EINVAL;
>   goto pin_done;
>   }
>  @@ -867,6 +870,11 @@ unlock:
>   /* Report how much was unmapped */
>   unmap->size = unmapped;
>   
>  +if (unmapped && iommu->external_domain)
>  +blocking_notifier_call_chain(>notifier,
>  + 
>  VFIO_IOMMU_NOTIFY_DMA_UNMAP,
>  + unmap);
> >>>
> >>> This is after the fact, there's already a gap here where pages are
> >>> unpinned and the mdev device is still running.
> >>
> >> Oh, there is a bug here, now unpin_pages() take user_pfn as argument and
> >> find vfio_dma. If its not found, it doesn't unpin pages. We have to call
> >> this notifier before vfio_remove_dma(). But if we call this before
> >> vfio_remove_dma() there will be deadlock since iommu->lock is already
> >> held here and vfio_iommu_type1_unpin_pages() will also try to hold
> >> iommu->lock.
> >> If we want to call blocking_notifier_call_chain() before
> >> vfio_remove_dma(), sequence should be:
> >>
> >> unmapped += dma->size;
> >> mutex_unlock(>lock);
> >> if (iommu->external_domain)) {
> >>struct vfio_iommu_type1_dma_unmap nb_unmap;
> >>
> >>nb_unmap.iova = dma->iova;
> >>nb_unmap.size = dma->size;
> >>blocking_notifier_call_chain(>notifier,
> >> VFIO_IOMMU_NOTIFY_DMA_UNMAP,
> >> _unmap);
> >> }
> >> mutex_lock(>lock);
> >> vfio_remove_dma(iommu, dma);  
> > 
> > It seems like it would be worthwhile to have the rb-tree rooted in the
> > vfio-dma, then we only need to call the notifier if there are pages
> > pinned within that vfio-dma (ie. the rb-tree is not empty).  We can
> > then release the lock call the notifier, re-acquire the lock, and
> > BUG_ON if the rb-tree still is not empty.  We might get duplicate pfns
> > between separate vfio_dma structs, but as I mentioned in other replies,
> > that seems like an exception that we don't need to optimize for.
> >   
> 
> If we don't optimize for the case where iova from different vfio_dma are
> mapped to same pfn and we would not consider this case for page
> accounting then:

Just to clarify, the current code (not handling mdevs) will pin and do
page accounting per iova, regardless of whether the iova translates to a
unique pfn.  As long as we do no worse than that, I'm ok.

> - have rb tree of pinned iova, where key would be iova, in each vfio_dma
> structure.
> - iova tracking structure would have iova and ref_count only.
> - page accounting would only count number of iova's in rb_tree, case
> where different iova could map to same pfn would not be considered in
> this implementation for now.
> - vfio_unpin_pages() would have user_pfn and pfn as input, we would
> validate that iova exist in rb tree and trust vendor driver that
> corresponding pfn is correct, there is no validation of pfn. If want
> validate pfn, call GUP, verify pfn and call put_pfn().
> - In .release() or .detach_group() path, if there are entries in this rb
> tree, call GUP again using that iova, get pfn and then call
> put_pfn(pfn) for ref_count+1 times. This is because we are not keeping
> pfn in our tracking logic.

Wait a sec, if we detach a group from the container and it's not the
last group in the container (which would trigger a release), we can't
assume anything about which vfio_dma entries were associated with that
device.  The vendor driver, through the release of the device(s) within
that group, needs to unpin.  In a container release, we need to send a
notifier to the vendor driver(s) to cause an unpin.  This is the only
mechanism we have to ensure that vendor drivers are not leaking
references.  If during the release, after the notifier, if any

Re: [Qemu-devel] Crashing in tcp_close

2016-11-08 Thread Brian Candler


On 07/11/2016 10:42, Stefan Hajnoczi wrote:

On Mon, Nov 07, 2016 at 08:42:17AM +, Brian Candler wrote:

>On 06/11/2016 18:04, Samuel Thibault wrote:

> >Brian, could you run it with
> >
> >export MALLOC_CHECK_=2
> >
> >and also this could be useful:
> >
> >export MALLOC_PERTURB_=1234
> >
> >Also, to rule out the double-free scenario, and try to catch a buffer
> >overflow coming from the socket structure itself, I have attached a
> >patch which adds some debugging.

>
>Thanks. I've added the patch, and re-run the stress test.


Back to the original setup, I can still get dumps. I notice I'm now 
getting "malloc_printerr" in the backtrace, but unfortunately I don't 
get to see the actual error message. It would seem that the malloc_check 
is being done and finding an issue.  I haven't been able to get one in 
tcp_close again though :-(


Regards,

Brian.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/bin/qemu-system-x86_64 -m 4G -machine 
type=pc,accel=kvm -device virt'.

Program terminated with signal SIGABRT, Aborted.
#0  0x7eff4f3df428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54

54../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7eff50dffa80 (LWP 13616))]
(gdb) bt
#0  0x7eff4f3df428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54

#1  0x7eff4f3e102a in __GI_abort () at abort.c:89
#2  0x7eff4f42bc1f in malloc_printerr (ar_ptr=, 
ptr=, str=,

action=) at malloc.c:5008
#3  _int_malloc (av=av@entry=0x7eff4f76db20 , 
bytes=bytes@entry=89) at malloc.c:3384
#4  0x7eff4f42c409 in malloc_check (sz=88, caller=) 
at hooks.c:295
#5  0x7eff50106729 in g_malloc () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#6  0x563ca16930cf in qemu_aio_get 
(aiocb_info=aiocb_info@entry=0x563ca1841190 ,
bs=0x563ca4132f40, cb=cb@entry=0x563ca14d85e0 , 
opaque=opaque@entry=0x563ca5b78910)

at /home/nsrc/qemu-2.7.0/block/io.c:2231
#7  0x563ca1687aa8 in blk_aio_get (opaque=0x563ca5b78910, 
cb=0x563ca14d85e0 , blk=0x563ca4132d70,
aiocb_info=0x563ca1841190 ) at 
/home/nsrc/qemu-2.7.0/block/block-backend.c:1477
#8  blk_aio_prwv (blk=0x563ca4132d70, offset=5278244864, bytes=4096, 
qiov=0x563ca5b78968,
co_entry=co_entry@entry=0x563ca16872b0 , 
flags=0, cb=0x563ca14d85e0 ,
opaque=0x563ca5b78910) at 
/home/nsrc/qemu-2.7.0/block/block-backend.c:941
#9  0x563ca1687bc0 in blk_aio_pwritev (blk=, 
offset=, qiov=,

flags=, cb=, opaque=)
at /home/nsrc/qemu-2.7.0/block/block-backend.c:1054
#10 0x563ca14d8718 in dma_blk_cb (opaque=0x563ca5b78910, 
ret=)

at /home/nsrc/qemu-2.7.0/dma-helpers.c:167
#11 0x563ca14d8bf8 in dma_blk_io (ctx=0x563ca41184a0, 
sg=sg@entry=0x563ca59a08f0,
offset=offset@entry=5278244864, 
io_func=io_func@entry=0x563ca15a58e0 ,
io_func_opaque=io_func_opaque@entry=0x563ca5c8a350, 
cb=cb@entry=0x563ca15a7250 ,
opaque=0x563ca5c8a350, dir=DMA_DIRECTION_TO_DEVICE) at 
/home/nsrc/qemu-2.7.0/dma-helpers.c:222
#12 0x563ca15a764e in scsi_write_data (req=0x563ca5c8a350) at 
/home/nsrc/qemu-2.7.0/hw/scsi/scsi-disk.c:540

#13 0x563ca15ac743 in scsi_req_continue (req=req@entry=0x563ca5c8a350)
at /home/nsrc/qemu-2.7.0/hw/scsi/scsi-bus.c:1680
#14 0x563ca14381a2 in virtio_scsi_handle_cmd_req_submit 
(s=0x563ca5abc1d0, req=)

at /home/nsrc/qemu-2.7.0/hw/scsi/virtio-scsi.c:565
#15 virtio_scsi_handle_cmd_vq (s=0x563ca5abc1d0, vq=0x7eff4963f110)
at /home/nsrc/qemu-2.7.0/hw/scsi/virtio-scsi.c:583
#16 0x563ca144a0d6 in virtio_queue_notify_vq (vq=0x7eff4963f110)
---Type  to continue, or q  to quit---
at /home/nsrc/qemu-2.7.0/hw/virtio/virtio.c:1113
#17 0x563ca1654965 in aio_dispatch (ctx=0x563ca41184a0) at 
/home/nsrc/qemu-2.7.0/aio-posix.c:330
#18 0x563ca164a3ae in aio_ctx_dispatch (source=, 
callback=,

user_data=) at /home/nsrc/qemu-2.7.0/async.c:234
#19 0x7eff501011a7 in g_main_context_dispatch () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#20 0x563ca16531db in glib_pollfds_poll () at 
/home/nsrc/qemu-2.7.0/main-loop.c:213
#21 os_host_main_loop_wait (timeout=) at 
/home/nsrc/qemu-2.7.0/main-loop.c:258
#22 main_loop_wait (nonblocking=) at 
/home/nsrc/qemu-2.7.0/main-loop.c:506

#23 0x563ca13be431 in main_loop () at /home/nsrc/qemu-2.7.0/vl.c:1908
#24 main (argc=, argv=, envp=out>) at /home/nsrc/qemu-2.7.0/vl.c:4604

(gdb)

Re: [Qemu-devel] [PATCH v11 01/22] vfio: Mediated device Core driver

2016-11-08 Thread Kirti Wankhede



On 11/8/2016 2:55 PM, Dong Jia Shi wrote:
> * Kirti Wankhede  [2016-11-05 02:40:35 +0530]:
> 
> Hi Kirti,
> 
> [...]
>> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
>> index da6e2ce77495..23eced02aaf6 100644
>> --- a/drivers/vfio/Kconfig
>> +++ b/drivers/vfio/Kconfig
>> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
>>
>>  source "drivers/vfio/pci/Kconfig"
>>  source "drivers/vfio/platform/Kconfig"
>> +source "drivers/vfio/mdev/Kconfig"
>>  source "virt/lib/Kconfig"
>> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
>> index 7b8a31f63fea..4a23c13b6be4 100644
>> --- a/drivers/vfio/Makefile
>> +++ b/drivers/vfio/Makefile
>> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>>  obj-$(CONFIG_VFIO_PCI) += pci/
>>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
>> +obj-$(CONFIG_VFIO_MDEV) += mdev/
>> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
>> new file mode 100644
>> index ..303c14ce2847
>> --- /dev/null
>> +++ b/drivers/vfio/mdev/Kconfig
>> @@ -0,0 +1,10 @@
>> +
>> +config VFIO_MDEV
>> +tristate "Mediated device driver framework"
>> +depends on VFIO
>> +default n
>> +help
>> +Provides a framework to virtualize devices.
>> +See Documentation/vfio-mdev/vfio-mediated-device.txt for more details.
> We don't have this doc at this point of time.
> 

Yes, but I have this doc in this patch series.

>> +
>> +If you don't know what do here, say N.
>> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
>> new file mode 100644
>> index ..31bc04801d94
>> --- /dev/null
>> +++ b/drivers/vfio/mdev/Makefile
>> @@ -0,0 +1,4 @@
>> +
>> +mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
>> +
>> +obj-$(CONFIG_VFIO_MDEV) += mdev.o
>> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
>> new file mode 100644
>> index ..54c59f325336
>> --- /dev/null
>> +++ b/drivers/vfio/mdev/mdev_core.c
> [...]
> 
>> +
>> +/*
>> + * mdev_device_remove_ops gets called from sysfs's 'remove' and when parent
>> + * device is being unregistered from mdev device framework.
>> + * - 'force_remove' is set to 'false' when called from sysfs's 'remove' 
>> which
>> + *   indicates that if the mdev device is active, used by VMM or userspace
>> + *   application, vendor driver could return error then don't remove the 
>> device.
>> + * - 'force_remove' is set to 'true' when called from 
>> mdev_unregister_device()
>> + *   which indicate that parent device is being removed from mdev device
>> + *   framework so remove mdev device forcefully.
>> + */
>> +static int mdev_device_remove_ops(struct mdev_device *mdev, bool 
>> force_remove)
> ?
> s/force_remove/force/
> 
>> +{
>> +struct parent_device *parent = mdev->parent;
>> +int ret;
>> +
>> +/*
>> + * Vendor driver can return error if VMM or userspace application is
>> + * using this mdev device.
>> + */
>> +ret = parent->ops->remove(mdev);
>> +if (ret && !force_remove)
>> +return -EBUSY;
>> +
>> +sysfs_remove_groups(>dev.kobj, parent->ops->mdev_attr_groups);
>> +return 0;
>> +}
>> +
>> +static int mdev_device_remove_cb(struct device *dev, void *data)
>> +{
>> +if (!dev_is_mdev(dev))
>> +return 0;
>> +
>> +return mdev_device_remove(dev, data ? *(bool *)data : true);
>> +}
>> +
>> +/*
>> + * mdev_register_device : Register a device
>> + * @dev: device structure representing parent device.
>> + * @ops: Parent device operation structure to be registered.
>> + *
>> + * Add device to list of registered parent devices.
>> + * Returns a negative value on error, otherwise 0.
>> + */
>> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
>> +{
>> +int ret;
>> +struct parent_device *parent;
>> +
>> +/* check for mandatory ops */
>> +if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
>> +return -EINVAL;
>> +
>> +dev = get_device(dev);
>> +if (!dev)
>> +return -EINVAL;
>> +
>> +mutex_lock(_list_lock);
>> +
>> +/* Check for duplicate */
>> +parent = __find_parent_device(dev);
>> +if (parent) {
>> +ret = -EEXIST;
>> +goto add_dev_err;
>> +}
>> +
>> +parent = kzalloc(sizeof(*parent), GFP_KERNEL);
>> +if (!parent) {
>> +ret = -ENOMEM;
>> +goto add_dev_err;
>> +}
>> +
>> +kref_init(>ref);
>> +mutex_init(>lock);
>> +
>> +parent->dev = dev;
>> +parent->ops = ops;
>> +
>> +if (!mdev_bus_compat_class) {
>> +mdev_bus_compat_class = class_compat_register("mdev_bus");
>> +if (!mdev_bus_compat_class) {
>> +ret = -ENOMEM;
>> +goto add_dev_err;
>> +}
>> +}
>> +
>> +ret = parent_create_sysfs_files(parent);
>> +if (ret)
>> +goto

Re: [Qemu-devel] [PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-08 Thread Michael S. Tsirkin

On Mon, Nov 07, 2016 at 09:23:38AM -0800, Dave Hansen wrote:
> On 11/06/2016 07:37 PM, Li, Liang Z wrote:
> >> Let's say we do a 32k bitmap that can hold ~1M pages.  That's 4GB of RAM.
> >> On a 1TB system, that's 256 passes through the top-level loop.
> >> The bottom-level lists have tens of thousands of pages in them, even on my
> >> laptop.  Only 1/256 of these pages will get consumed in a given pass.
> >>
> > Your description is not exactly.
> > A 32k bitmap is used only when there is few free memory left in the system 
> > and when 
> > the extend_page_bitmap() failed to allocate more memory for the bitmap. Or 
> > dozens of 
> > 32k split bitmap will be used, this version limit the bitmap count to 32, 
> > it means we can use
> > at most 32*32 kB for the bitmap, which can cover 128GB for RAM. We can 
> > increase the bitmap
> > count limit to a larger value if 32 is not big enough.
> 
> OK, so it tries to allocate a large bitmap.  But, if it fails, it will
> try to work with a smaller bitmap.  Correct?
> 
> So, what's the _worst_ case?  It sounds like it is even worse than I was
> positing.
> 
> >> That's an awfully inefficient way of doing it.  This patch essentially 
> >> changed
> >> the data structure without changing the algorithm to populate it.
> >>
> >> Please change the *algorithm* to use the new data structure efficiently.
> >>  Such a change would only do a single pass through each freelist, and would
> >> choose whether to use the extent-based (pfn -> range) or bitmap-based
> >> approach based on the contents of the free lists.
> > 
> > Save the free page info to a raw bitmap first and then process the raw 
> > bitmap to
> > get the proper ' extent-based ' and  'bitmap-based' is the most efficient 
> > way I can 
> > come up with to save the virtio data transmission.  Do you have some better 
> > idea?
> 
> That's kinda my point.  This patch *does* processing to try to pack the
> bitmaps full of pages from the various pfn ranges.  It's a form of
> processing that gets *REALLY*, *REALLY* bad in some (admittedly obscure)
> cases.
> 
> Let's not pretend that making an essentially unlimited number of passes
> over the free lists is not processing.
> 
> 1. Allocate as large of a bitmap as you can. (what you already do)
> 2. Iterate from the largest freelist order.  Store those pages in the
>bitmap.
> 3. If you can no longer fit pages in the bitmap, return the list that
>you have.
> 4. Make an approximation about where the bitmap does not make any more,
>and fall back to listing individual PFNs.  This would make sens, for
>instance in a large zone with very few free order-0 pages left.

In practice, a single PFN using the bitmap format
only takes up twice the size: I think it's 128 instead of 64 bit
per entry.

So it's not a a given that point 4 is worth it at any point,
just packing multiple bitmaps might be good enough.

> 
> > It seems the benefit we get for this feature is not as big as that in fast 
> > balloon inflating/deflating.
> >>
> >> You should not be using get_max_pfn().  Any patch set that continues to use
> >> it is not likely to be using a proper algorithm.
> > 
> > Do you have any suggestion about how to avoid it?
> 
> Yes: get the pfns from the page free lists alone.  Don't derive them
> from the pfn limits of the system or zones.

Re: [Qemu-devel] [PATCH v11 13/22] vfio: Introduce common function to add capabilities

2016-11-08 Thread Kirti Wankhede



On 11/8/2016 12:59 PM, Alexey Kardashevskiy wrote:
> On 05/11/16 08:10, Kirti Wankhede wrote:
>> Vendor driver using mediated device framework should use
>> vfio_info_add_capability() to add capabilities.
>> Introduced this function to reduce code duplication in vendor drivers.
>>
>> Signed-off-by: Kirti Wankhede 
>> Signed-off-by: Neo Jia 
>> Change-Id: I6fca329fa2291f37a2c859d0bc97574d9e2ce1a6
>> ---
>>  drivers/vfio/vfio.c  | 60 
>> +++-
>>  include/linux/vfio.h |  3 +++
>>  2 files changed, 62 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index 4ed1a6a247c6..9a03be0942a1 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -1797,8 +1797,66 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, 
>> size_t offset)
>>  for (tmp = caps->buf; tmp->next; tmp = (void *)tmp + tmp->next - offset)
>>  tmp->next += offset;
>>  }
>> -EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>> +EXPORT_SYMBOL(vfio_info_cap_shift);
> 
> 
> Why this change?
> 
> 

We want this symbol to be available to all drivers.


>>  
>> +static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
>> +{
>> +struct vfio_info_cap_header *header;
>> +struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
>> +size_t size;
>> +
>> +size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
>> +header = vfio_info_cap_add(caps, size,
>> +   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
>> +if (IS_ERR(header))
>> +return PTR_ERR(header);
>> +
>> +sparse_cap = container_of(header,
>> +struct vfio_region_info_cap_sparse_mmap, header);
>> +sparse_cap->nr_areas = sparse->nr_areas;
>> +memcpy(sparse_cap->areas, sparse->areas,
>> +   sparse->nr_areas * sizeof(*sparse->areas));
>> +return 0;
>> +}
>> +
>> +static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
>> +{
>> +struct vfio_info_cap_header *header;
>> +struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
>> +
>> +header = vfio_info_cap_add(caps, sizeof(*cap),
>> +   VFIO_REGION_INFO_CAP_TYPE, 1);
>> +if (IS_ERR(header))
>> +return PTR_ERR(header);
>> +
>> +type_cap = container_of(header, struct vfio_region_info_cap_type,
>> +header);
>> +type_cap->type = cap->type;
>> +type_cap->subtype = cap->subtype;
>> +return 0;
>> +}
>> +
>> +int vfio_info_add_capability(struct vfio_info_cap *caps, int cap_type_id,
>> + void *cap_type)
>> +{
>> +int ret = -EINVAL;
>> +
>> +if (!cap_type)
>> +return 0;
>> +
>> +switch (cap_type_id) {
>> +case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
>> +ret = sparse_mmap_cap(caps, cap_type);
>> +break;
>> +
>> +case VFIO_REGION_INFO_CAP_TYPE:
>> +ret = region_type_cap(caps, cap_type);
>> +break;
>> +}
>> +
>> +return ret;
>> +}
>> +EXPORT_SYMBOL(vfio_info_add_capability);
>>  
>>  /*
>>   * Pin a set of guest PFNs and return their associated host PFNs for local
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index dcda8fccefab..cf90393a11e2 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -113,6 +113,9 @@ extern struct vfio_info_cap_header *vfio_info_cap_add(
>>  struct vfio_info_cap *caps, size_t size, u16 id, u16 version);
>>  extern void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset);
>>  
>> +extern int vfio_info_add_capability(struct vfio_info_cap *caps,
>> +int cap_type_id, void *cap_type);
>> +
> 
> 
> It would make it easier to review and bisect if 14/22 was squashed into
> this one. 

This was split based on Alex's suggestion on earlier version of this
patchset.

> In the resulting patch, vfio_info_cap_add() can be made static as
> it will only be used in drivers/vfio/vfio.c from now.
> 

Not sure that vfio_info_cap_add() should be made static. If there are
any other drivers using this symbol outside kernel might break with that
change.

Thanks,
Kirti

> 
> 
> 
>>  struct pci_dev;
>>  #ifdef CONFIG_EEH
>>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
>>
> 
>

Re: [Qemu-devel] [PATCH v11 17/22] vfio_platform: Updated to use vfio_set_irqs_validate_and_prepare()

2016-11-08 Thread Kirti Wankhede



On 11/8/2016 2:22 PM, Alexey Kardashevskiy wrote:
> On 05/11/16 08:10, Kirti Wankhede wrote:
>> Updated vfio_platform_common.c file to use
>> vfio_set_irqs_validate_and_prepare()
>>
>> Signed-off-by: Kirti Wankhede 
>> Signed-off-by: Neo Jia 
>> Change-Id: Id87cd6b78ae901610b39bf957974baa6f40cd7b0
>> ---
>>  drivers/vfio/platform/vfio_platform_common.c | 31 
>> +++-
>>  1 file changed, 8 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/vfio/platform/vfio_platform_common.c 
>> b/drivers/vfio/platform/vfio_platform_common.c
>> index d78142830754..4c27f4be3c3d 100644
>> --- a/drivers/vfio/platform/vfio_platform_common.c
>> +++ b/drivers/vfio/platform/vfio_platform_common.c
>> @@ -364,36 +364,21 @@ static long vfio_platform_ioctl(void *device_data,
>>  struct vfio_irq_set hdr;
>>  u8 *data = NULL;
>>  int ret = 0;
>> +size_t data_size = 0;
>>  
>>  minsz = offsetofend(struct vfio_irq_set, count);
>>  
>>  if (copy_from_user(, (void __user *)arg, minsz))
>>  return -EFAULT;
>>  
>> -if (hdr.argsz < minsz)
>> -return -EINVAL;
>> -
>> -if (hdr.index >= vdev->num_irqs)
>> -return -EINVAL;
>> -
>> -if (hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
>> -  VFIO_IRQ_SET_ACTION_TYPE_MASK))
>> -return -EINVAL;
>> -
>> -if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
>> -size_t size;
>> -
>> -if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
>> -size = sizeof(uint8_t);
>> -else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
>> -size = sizeof(int32_t);
>> -else
>> -return -EINVAL;
>> -
>> -if (hdr.argsz - minsz < size)
>> -return -EINVAL;
>> +ret = vfio_set_irqs_validate_and_prepare(, vdev->num_irqs,
>> + vdev->num_irqs, _size);
> 
> The patch does not change this but I am still curious:
> 
> is not the second vdev->num_irqs supposed to be one of
> VFIO_PCI_INTX_IRQ_INDEX..VFIO_PCI_NUM_IRQS, not the actual number of
> interrupt vectors (as in vfio-pci)?
> 
> 

Those are PCI specific. I don't think those counts are applicable here.

If you see the prototype, second argument and third argument have
different meaning.

int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int
num_irqs, int max_irq_type, size_t *data_size)

- num_irqs are number of irqs caller want to setup and
- max_irq_type is the one which is return to user in
VFIO_DEVICE_GET_INFO ioctl's info.num_irqs.

For platform these two are same.

Thanks,
Kirti

> 
> 
>> +if (ret)
>> +return ret;
>>  
>> -data = memdup_user((void __user *)(arg + minsz), size);
>> +if (data_size) {
>> +data = memdup_user((void __user *)(arg + minsz),
>> +data_size);
>>  if (IS_ERR(data))
>>  return PTR_ERR(data);
>>  }
>>
> 
>

Re: [Qemu-devel] [PATCH v6 2/2] block/vxhs.c: Add qemu-iotests for new block device type "vxhs"

2016-11-08 Thread Jeff Cody

On Mon, Nov 07, 2016 at 04:59:45PM -0800, Ashish Mittal wrote:
> These changes use a vxhs test server that is a part of the following
> repository:
> https://github.com/MittalAshish/libqnio.git
> 
> Signed-off-by: Ashish Mittal 
> ---
> v6 changelog:
> (1) Added iotests for VxHS block device.
> 
>  tests/qemu-iotests/common|  6 ++
>  tests/qemu-iotests/common.config | 13 +
>  tests/qemu-iotests/common.filter |  1 +
>  tests/qemu-iotests/common.rc | 19 +++
>  4 files changed, 39 insertions(+)
> 
> diff --git a/tests/qemu-iotests/common b/tests/qemu-iotests/common
> index d60ea2c..41430d8 100644
> --- a/tests/qemu-iotests/common
> +++ b/tests/qemu-iotests/common

When using raw format, I was able to run the test successfully for all
supported test cases (26 of them).

With qcow2, they fail - but not the fault of this patch, I think; but
rather, the fault of the test server.  Can qnio_server be modified so that
it does not work on just raw files?



> @@ -158,6 +158,7 @@ check options
>  -nfstest nfs
>  -archipelagotest archipelago
>  -luks   test luks
> +-vxhs   test vxhs
>  -xdiff  graphical mode diff
>  -nocacheuse O_DIRECT on backing file
>  -misalign   misalign memory allocations
> @@ -261,6 +262,11 @@ testlist options
>  xpand=false
>  ;;
>  
> +-vxhs)
> +IMGPROTO=vxhs
> +xpand=false
> +;;
> +
>  -ssh)
>  IMGPROTO=ssh
>  xpand=false
> diff --git a/tests/qemu-iotests/common.config 
> b/tests/qemu-iotests/common.config
> index f6384fb..c7a80c0 100644
> --- a/tests/qemu-iotests/common.config
> +++ b/tests/qemu-iotests/common.config
> @@ -105,6 +105,10 @@ if [ -z "$QEMU_NBD_PROG" ]; then
>  export QEMU_NBD_PROG="`set_prog_path qemu-nbd`"
>  fi
>  
> +if [ -z "$QEMU_VXHS_PROG" ]; then
> +export QEMU_VXHS_PROG="`set_prog_path qnio_server /usr/local/bin`"
> +fi
> +
>  _qemu_wrapper()
>  {
>  (
> @@ -156,10 +160,19 @@ _qemu_nbd_wrapper()
>  )
>  }
>  
> +_qemu_vxhs_wrapper()
> +{
> +(
> +echo $BASHPID > "${TEST_DIR}/qemu-vxhs.pid"
> +exec "$QEMU_VXHS_PROG" $QEMU_VXHS_OPTIONS "$@"
> +)
> +}
> +
>  export QEMU=_qemu_wrapper
>  export QEMU_IMG=_qemu_img_wrapper
>  export QEMU_IO=_qemu_io_wrapper
>  export QEMU_NBD=_qemu_nbd_wrapper
> +export QEMU_VXHS=_qemu_vxhs_wrapper
>  
>  QEMU_IMG_EXTRA_ARGS=
>  if [ "$IMGOPTSSYNTAX" = "true" ]; then
> diff --git a/tests/qemu-iotests/common.filter 
> b/tests/qemu-iotests/common.filter
> index 240ed06..a8a4d0e 100644
> --- a/tests/qemu-iotests/common.filter
> +++ b/tests/qemu-iotests/common.filter
> @@ -123,6 +123,7 @@ _filter_img_info()
>  -e "s#$TEST_DIR#TEST_DIR#g" \
>  -e "s#$IMGFMT#IMGFMT#g" \
>  -e 's#nbd://127.0.0.1:10810$#TEST_DIR/t.IMGFMT#g' \
> +-e 's#json.*vdisk-id.*vxhs"}}#TEST_DIR/t.IMGFMT#' \
>  -e "/encrypted: yes/d" \
>  -e "/cluster_size: [0-9]\\+/d" \
>  -e "/table_size: [0-9]\\+/d" \
> diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
> index 3213765..06a3164 100644
> --- a/tests/qemu-iotests/common.rc
> +++ b/tests/qemu-iotests/common.rc
> @@ -89,6 +89,9 @@ else
>  TEST_IMG=$TEST_DIR/t.$IMGFMT
>  elif [ "$IMGPROTO" = "archipelago" ]; then
>  TEST_IMG="archipelago:at.$IMGFMT"
> +elif [ "$IMGPROTO" = "vxhs" ]; then
> +TEST_IMG_FILE=$TEST_DIR/t.$IMGFMT
> +TEST_IMG="vxhs://127.0.0.1:/t.$IMGFMT"
>  else
>  TEST_IMG=$IMGPROTO:$TEST_DIR/t.$IMGFMT
>  fi
> @@ -175,6 +178,12 @@ _make_test_img()
>  eval "$QEMU_NBD -v -t -b 127.0.0.1 -p 10810 -f $IMGFMT  
> $TEST_IMG_FILE &"
>  sleep 1 # FIXME: qemu-nbd needs to be listening before we continue
>  fi
> +
> +# Start QNIO server on image directory for vxhs protocol
> +if [ $IMGPROTO = "vxhs" ]; then
> +eval "$QEMU_VXHS -d  $TEST_DIR &"
> +sleep 1 # Wait for server to come up.
> +fi
>  }
>  
>  _rm_test_img()
> @@ -201,6 +210,16 @@ _cleanup_test_img()
>  fi
>  rm -f "$TEST_IMG_FILE"
>  ;;
> +vxhs)
> +if [ -f "${TEST_DIR}/qemu-vxhs.pid" ]; then
> +local QEMU_VXHS_PID
> +read QEMU_VXHS_PID < "${TEST_DIR}/qemu-vxhs.pid"
> +kill ${QEMU_VXHS_PID} >/dev/null 2>&1
> +rm -f "${TEST_DIR}/qemu-vxhs.pid"
> +fi
> +rm -f "$TEST_IMG_FILE"
> +;;
> +
>  file)
>  _rm_test_img "$TEST_DIR/t.$IMGFMT"
>  _rm_test_img "$TEST_DIR/t.$IMGFMT.orig"
> -- 
> 1.8.3.1
>

Re: [Qemu-devel] [PATCH 0/3] [RFC] Add HAX support

2016-11-08 Thread Paolo Bonzini

On 08/11/2016 20:41, Vincent Palatin wrote:
> >   If so, I think we should only support those
> > processors and slash all the part related to HAX_EMULATE_STATE_INITIAL
> > and HAX_EMULATE_STATE_REAL.  This would probably let us make patch 3
> > much less intrusive.
> 
> Sure the whole patchset would be lighter, not sure which proportion of
> user have VT machines without UG support though.

All Intel machines sold after ~2010 should have unrestricted guest
support.  (HAX doesn't support AMD, but anyway AMD's virt extensions
have always had the equivalent feature).

I'm not sure we want !UG support at all but, if we do, QEMU 2.9 will
have multithreaded TCG so it would be possible to make it less
intrusive.  So it's worth starting with the minimum patchset and see
what happens.

Thanks for working on this!

Paolo

Re: [Qemu-devel] [PATCH V3 05/10] intel_iommu: support device iotlb descriptor

2016-11-08 Thread Michael S. Tsirkin

On Tue, Nov 08, 2016 at 02:54:19PM +0800, Jason Wang wrote:
> 
> 
> On 2016年11月08日 07:35, Peter Xu wrote:
> > On Mon, Nov 07, 2016 at 03:09:50PM +0800, Jason Wang wrote:
> > 
> > [...]
> > 
> > > +static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
> > > +  VTDInvDesc *inv_desc)
> > > +{
> > > +VTDAddressSpace *vtd_dev_as;
> > > +IOMMUTLBEntry entry;
> > Since "entry" is allocated on the stack...
> > 
> > [...]
> > 
> > > +entry.target_as = _dev_as->as;
> > > +entry.addr_mask = sz - 1;
> > > +entry.iova = addr;
> > > +memory_region_notify_iommu(entry.target_as->root, entry);
> > ... here we need to assign entry.perm explicitly to IOMMU_NONE, right?
> > 
> > Also I think it'll be nice that we set all the fields even not used,
> > to avoid rubbish from the stack passed down to notifier handlers.
> > 
> > [...]
> 
> This is better, if no other comments on the series I will post a patch on
> top to fix this.

If you do, pls remember to use the fixup! prefix.

> > 
> > > +static bool x86_iommu_device_iotlb_prop_get(Object *o, Error **errp)
> > > +{
> > > +X86IOMMUState *s = X86_IOMMU_DEVICE(o);
> > > +return s->dt_supported;
> > > +}
> > > +
> > > +static void x86_iommu_device_iotlb_prop_set(Object *o, bool value, Error 
> > > **errp)
> > > +{
> > > +X86IOMMUState *s = X86_IOMMU_DEVICE(o);
> > > +s->dt_supported = value;
> > > +}
> > > +
> > >   static void x86_iommu_instance_init(Object *o)
> > >   {
> > >   X86IOMMUState *s = X86_IOMMU_DEVICE(o);
> > > @@ -114,6 +126,11 @@ static void x86_iommu_instance_init(Object *o)
> > >   s->intr_supported = false;
> > >   object_property_add_bool(o, "intremap", x86_iommu_intremap_prop_get,
> > >x86_iommu_intremap_prop_set, NULL);
> > > +s->dt_supported = false;
> > > +object_property_add_bool(o, "device-iotlb",
> > > + x86_iommu_device_iotlb_prop_get,
> > > + x86_iommu_device_iotlb_prop_set,
> > > + NULL);
> > Again, a nit-pick here is to use Property for "device-iotlb":
> > 
> >  static Property vtd_properties[] = {
> >  DEFINE_PROP_UINT32("device-iotlb", X86IOMMUState, dt_supported, 
> > false),
> >  DEFINE_PROP_END_OF_LIST(),
> >  };
> > 
> > However not worth a repost.
> > 
> > Thanks,
> > 
> > -- peterx
> > 
> 
> We may want to share this with AMD IOMMU. (Looking at AMD IOMMU codes, its
> device-iotlb support is buggy).
> 
> Thanks

Re: [Qemu-devel] [PATCH v11 15/22] vfio: Introduce vfio_set_irqs_validate_and_prepare()

2016-11-08 Thread Kirti Wankhede



On 11/8/2016 2:16 PM, Alexey Kardashevskiy wrote:
> On 05/11/16 08:10, Kirti Wankhede wrote:
>> Vendor driver using mediated device framework would use same mechnism to
>> validate and prepare IRQs. Introducing this function to reduce code
>> replication in multiple drivers.
>>
>> Signed-off-by: Kirti Wankhede 
>> Signed-off-by: Neo Jia 
>> Change-Id: Ie201f269dda0713ca18a07dc4852500bd8b48309
>> ---
>>  drivers/vfio/vfio.c  | 48 
>>  include/linux/vfio.h |  4 
>>  2 files changed, 52 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index 9a03be0942a1..ed2361e4b904 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -1858,6 +1858,54 @@ int vfio_info_add_capability(struct vfio_info_cap 
>> *caps, int cap_type_id,
>>  }
>>  EXPORT_SYMBOL(vfio_info_add_capability);
>>  
>> +int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int 
>> num_irqs,
>> +   int max_irq_type, size_t *data_size)
>> +{
>> +unsigned long minsz;
>> +size_t size;
>> +
>> +minsz = offsetofend(struct vfio_irq_set, count);
>> +
>> +if ((hdr->argsz < minsz) || (hdr->index >= max_irq_type) ||
>> +(hdr->count >= (U32_MAX - hdr->start)) ||
>> +(hdr->flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
>> +VFIO_IRQ_SET_ACTION_TYPE_MASK)))
>> +return -EINVAL;
>> +
>> +if (data_size)
> 
> Pointless check, the callers will pass non null pointer with value
> initialized to 0 anyway.
> 

Not always, When VFIO_IRQ_SET_DATA_NONE flag is set, caller can pass
data_size = NULL.

> 
>> +*data_size = 0;
>> +
>> +if (hdr->start >= num_irqs || hdr->start + hdr->count > num_irqs)
>> +return -EINVAL;
>> +
>> +switch (hdr->flags & VFIO_IRQ_SET_DATA_TYPE_MASK) {
>> +case VFIO_IRQ_SET_DATA_NONE:
>> +size = 0;
>> +break;
>> +case VFIO_IRQ_SET_DATA_BOOL:
>> +size = sizeof(uint8_t);
>> +break;
>> +case VFIO_IRQ_SET_DATA_EVENTFD:
>> +size = sizeof(int32_t);
>> +break;
>> +default:
>> +return -EINVAL;
>> +}
>> +
>> +if (size) {
> 
> The whole branch would even work for size == 0.
> 

In that case below check (!data_size) might result in error if data_size
== NULL, whereas its not error case when size == 0, i.e.
VFIO_IRQ_SET_DATA_NONE flag set.

>> +if (hdr->argsz - minsz < hdr->count * size)
>> +return -EINVAL;
>> +
>> +if (!data_size)
>> +return -EINVAL;
> 
> Redundant check as well.
> 

This is not redundant. If you see above check, it sets its init value to
0 but doesn't fail.

>> +
>> +*data_size = hdr->count * size;
>> +}
>> +
>> +return 0;
>> +}
> 
> It does not really prepare anything as the name suggests. It looks like
> this is 2 different helpers actually:
> 
> int vfio_set_irqs_validate()
> and
> size_t vfio_set_irqs_hdr_to_data_size()
> 

Later one is the prepare.

> 
> And it would make it easier to review/bisect if 16/22 and 17/22 were merged
> into this one as this patch alone adds new code which it does not use and
> all 3 patches are fairly small.
>

I do had all 3 patch merged in one in earlier version of patchset. This
is split as per Alex's suggestion.

> 
>> +EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
> 
> Everything you export in this patchset is EXPORT_SYMBOL() while the
> existing code uses EXPORT_SYMBOL_GPL(), is this for a reason?
> 
> 

We want these symbols to be available to all drivers.

Thanks,
Kirti

>> +
>>  /*
>>   * Pin a set of guest PFNs and return their associated host PFNs for local
>>   * domain only.
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index cf90393a11e2..87c9afecd822 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -116,6 +116,10 @@ extern void vfio_info_cap_shift(struct vfio_info_cap 
>> *caps, size_t offset);
>>  extern int vfio_info_add_capability(struct vfio_info_cap *caps,
>>  int cap_type_id, void *cap_type);
>>  
>> +extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
>> +  int num_irqs, int max_irq_type,
>> +  size_t *data_size);
>> +
>>  struct pci_dev;
>>  #ifdef CONFIG_EEH
>>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
>>
> 
>

Re: [Qemu-devel] [PATCH 3/3] Plumb the HAXM-based hardware acceleration support

2016-11-08 Thread Paolo Bonzini


> diff --git a/cpu-exec.c b/cpu-exec.c
> index 4188fed..4bd238b 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c

All this should not be needed anymore with unrestricted guest support.

> diff --git a/cpus.c b/cpus.c
> index fc78502..6e0f572 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -35,6 +35,7 @@
>  #include "sysemu/dma.h"
>  #include "sysemu/hw_accel.h"
>  #include "sysemu/kvm.h"
> +#include "sysemu/hax.h"
>  #include "qmp-commands.h"
>  #include "exec/exec-all.h"
>  
> @@ -1221,6 +1222,52 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
>  return NULL;
>  }
>  
> +static void *qemu_hax_cpu_thread_fn(void *arg)
> +{
> +CPUState *cpu = arg;
> +int r;
> +qemu_thread_get_self(cpu->thread);
> +qemu_mutex_lock(_global_mutex);
> +
> +cpu->thread_id = qemu_get_thread_id();
> +cpu->created = true;
> +cpu->halted = 0;
> +current_cpu = cpu;
> +
> +hax_init_vcpu(cpu);
> +qemu_cond_signal(_cpu_cond);
> +
> +while (1) {
> +if (cpu_can_run(cpu)) {
> +r = hax_smp_cpu_exec(cpu);
> +if (r == EXCP_DEBUG) {
> +cpu_handle_guest_debug(cpu);
> +}
> +}
> +
> +while (cpu_thread_is_idle(cpu)) {
> +qemu_cond_wait(cpu->halt_cond, _global_mutex);
> +}
> +
> +qemu_wait_io_event_common(cpu);
> +}
> +return NULL;
> +}
> +
> +
> +static void qemu_cpu_kick_no_halt(void)
> +{
> +CPUState *cpu;
> +/* Ensure whatever caused the exit has reached the CPU threads before
> + * writing exit_request.
> + */
> +atomic_mb_set(_request, 1);
> +cpu = atomic_mb_read(_current_cpu);
> +if (cpu) {
> +cpu_exit(cpu);
> +}
> +}
> +
>  static void qemu_cpu_kick_thread(CPUState *cpu)
>  {
>  #ifndef _WIN32
> @@ -1235,28 +1282,52 @@ static void qemu_cpu_kick_thread(CPUState *cpu)
>  fprintf(stderr, "qemu:%s: %s", __func__, strerror(err));
>  exit(1);
>  }
> -#else /* _WIN32 */
> -abort();
> -#endif
> -}
>  
> -static void qemu_cpu_kick_no_halt(void)
> -{
> -CPUState *cpu;
> -/* Ensure whatever caused the exit has reached the CPU threads before
> - * writing exit_request.
> +#ifdef CONFIG_DARWIN
> +/* The cpu thread cannot catch it reliably when shutdown the guest on 
> Mac.
> + * We can double check it and resend it
>   */
> -atomic_mb_set(_request, 1);
> -cpu = atomic_mb_read(_current_cpu);
> -if (cpu) {
> -cpu_exit(cpu);
> +if (!exit_request)
> +qemu_cpu_kick_no_halt();

This must be a conflict resolved wrong.  exit_request is never read by
the HAX code.

> +if (hax_enabled() && hax_ug_platform())
> +cpu->exit_request = 1;
> +#endif
> +#else /* _WIN32 */
> +if (!qemu_cpu_is_self(cpu)) {
> +CONTEXT tcgContext;
> +
> +if (SuspendThread(cpu->hThread) == (DWORD)-1) {
> +fprintf(stderr, "qemu:%s: GetLastError:%lu\n", __func__,
> +GetLastError());
> +exit(1);
> +}
> +
> +/* On multi-core systems, we are not sure that the thread is actually
> + * suspended until we can get the context.
> + */
> +tcgContext.ContextFlags = CONTEXT_CONTROL;
> +while (GetThreadContext(cpu->hThread, ) != 0) {
> +continue;
> +}
> +
> +qemu_cpu_kick_no_halt();
> +if (hax_enabled() && hax_ug_platform())
> +cpu->exit_request = 1;
> +
> +if (ResumeThread(cpu->hThread) == (DWORD)-1) {
> +fprintf(stderr, "qemu:%s: GetLastError:%lu\n", __func__,
> +GetLastError());
> +exit(1);
> +}

This is weird too.  The SuspendThread/ResumeThread dance comes from an
old version of QEMU.  It is not needed anymore and, again,
qemu_cpu_kick_no_halt would only be useful if hax_ug_platform() is false.

Here, Linux/KVM uses a signal and pthread_kill.  It's probably good for
HAX on Darwin too, but not on Windows.  It's possible that
SuspendThread/ResumeThread just does the right thing (sort of by
chance), in which case you can just keep it (removing
qemu_cpu_kick_no_halt).  However, there is a hax_raise_event in patch 2
that is unused.  If you can figure out how to use it it would be better.



> @@ -1617,6 +1618,21 @@ static void ram_block_add(RAMBlock *new_block, Error 
> **errp)
>  } else {
>  new_block->host = phys_mem_alloc(new_block->max_length,
>   _block->mr->align);
> +/*
> + * In Hax, the qemu allocate the virtual address, and HAX kernel
> + * populate the memory with physical memory. Currently we have no
> + * paging, so user should make sure enough free memory in advance
> + */
> +if (hax_enabled()) {
> +int ret;
> +ret = hax_populate_ram((uint64_t)(uintptr_t)new_block->host,
> +

[Qemu-devel] [kvm-unit-tests PATCH v4 07/11] arm/arm64: gicv2: add an IPI test

2016-11-08 Thread Andrew Jones

Reviewed-by: Eric Auger 
Signed-off-by: Andrew Jones 
---
v4: properly mask irqnr in ipi_handler
v2: add more details in the output if a test fails,
report spurious interrupts if we get them
---
 arm/Makefile.common |   6 +-
 arm/gic.c   | 195 
 arm/unittests.cfg   |   7 ++
 lib/arm/asm/gic.h   |   6 ++
 4 files changed, 211 insertions(+), 3 deletions(-)
 create mode 100644 arm/gic.c

diff --git a/arm/Makefile.common b/arm/Makefile.common
index 41239c37e092..bc38183ab86e 100644
--- a/arm/Makefile.common
+++ b/arm/Makefile.common
@@ -9,9 +9,9 @@ ifeq ($(LOADADDR),)
LOADADDR = 0x4000
 endif
 
-tests-common = \
-   $(TEST_DIR)/selftest.flat \
-   $(TEST_DIR)/spinlock-test.flat
+tests-common  = $(TEST_DIR)/selftest.flat
+tests-common += $(TEST_DIR)/spinlock-test.flat
+tests-common += $(TEST_DIR)/gic.flat
 
 all: test_cases
 
diff --git a/arm/gic.c b/arm/gic.c
new file mode 100644
index ..efefab7296d4
--- /dev/null
+++ b/arm/gic.c
@@ -0,0 +1,195 @@
+/*
+ * GIC tests
+ *
+ * GICv2
+ *   + test sending/receiving IPIs
+ *
+ * Copyright (C) 2016, Red Hat Inc, Andrew Jones 
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int gic_version;
+static int acked[NR_CPUS], spurious[NR_CPUS];
+static cpumask_t ready;
+
+static void nr_cpu_check(int nr)
+{
+   if (nr_cpus < nr)
+   report_abort("At least %d cpus required", nr);
+}
+
+static void wait_on_ready(void)
+{
+   cpumask_set_cpu(smp_processor_id(), );
+   while (!cpumask_full())
+   cpu_relax();
+}
+
+static void check_acked(cpumask_t *mask)
+{
+   int missing = 0, extra = 0, unexpected = 0;
+   int nr_pass, cpu, i;
+
+   /* Wait up to 5s for all interrupts to be delivered */
+   for (i = 0; i < 50; ++i) {
+   mdelay(100);
+   nr_pass = 0;
+   for_each_present_cpu(cpu) {
+   smp_rmb();
+   nr_pass += cpumask_test_cpu(cpu, mask) ?
+   acked[cpu] == 1 : acked[cpu] == 0;
+   }
+   if (nr_pass == nr_cpus) {
+   report("Completed in %d ms", true, ++i * 100);
+   return;
+   }
+   }
+
+   for_each_present_cpu(cpu) {
+   if (cpumask_test_cpu(cpu, mask)) {
+   if (!acked[cpu])
+   ++missing;
+   else if (acked[cpu] > 1)
+   ++extra;
+   } else {
+   if (acked[cpu])
+   ++unexpected;
+   }
+   }
+
+   report("Timed-out (5s). ACKS: missing=%d extra=%d unexpected=%d",
+  false, missing, extra, unexpected);
+}
+
+static void ipi_handler(struct pt_regs *regs __unused)
+{
+   u32 irqstat = readl(gicv2_cpu_base() + GIC_CPU_INTACK);
+   u32 irqnr = irqstat & GICC_IAR_INT_ID_MASK;
+
+   if (irqnr != GICC_INT_SPURIOUS) {
+   writel(irqstat, gicv2_cpu_base() + GIC_CPU_EOI);
+   smp_rmb(); /* pairs with wmb in ipi_test functions */
+   ++acked[smp_processor_id()];
+   smp_wmb(); /* pairs with rmb in check_acked */
+   } else {
+   ++spurious[smp_processor_id()];
+   smp_wmb();
+   }
+}
+
+static void ipi_test_self(void)
+{
+   cpumask_t mask;
+
+   report_prefix_push("self");
+   memset(acked, 0, sizeof(acked));
+   smp_wmb();
+   cpumask_clear();
+   cpumask_set_cpu(0, );
+   writel(2 << 24, gicv2_dist_base() + GIC_DIST_SOFTINT);
+   check_acked();
+   report_prefix_pop();
+}
+
+static void ipi_test_smp(void)
+{
+   cpumask_t mask;
+   unsigned long tlist;
+
+   report_prefix_push("target-list");
+   memset(acked, 0, sizeof(acked));
+   smp_wmb();
+   tlist = cpumask_bits(_present_mask)[0] & 0xaa;
+   cpumask_bits()[0] = tlist;
+   writel((u8)tlist << 16, gicv2_dist_base() + GIC_DIST_SOFTINT);
+   check_acked();
+   report_prefix_pop();
+
+   report_prefix_push("broadcast");
+   memset(acked, 0, sizeof(acked));
+   smp_wmb();
+   cpumask_copy(, _present_mask);
+   cpumask_clear_cpu(0, );
+   writel(1 << 24, gicv2_dist_base() + GIC_DIST_SOFTINT);
+   check_acked();
+   report_prefix_pop();
+}
+
+static void ipi_enable(void)
+{
+   gicv2_enable_defaults();
+#ifdef __arm__
+   install_exception_handler(EXCPTN_IRQ, ipi_handler);
+#else
+   install_irq_handler(EL1H_IRQ, ipi_handler);
+#endif
+   local_irq_enable();
+}
+
+static void ipi_recv(void)
+{
+   ipi_enable();
+   cpumask_set_cpu(smp_processor_id(), );
+   while (1)
+   wfi();
+}
+
+int

[Qemu-devel] [kvm-unit-tests PATCH v4 09/11] arm/arm64: add initial gicv3 support

2016-11-08 Thread Andrew Jones

Signed-off-by: Andrew Jones 

---
v4:
 - only take defines from kernel we need now [Andre]
 - simplify enable by not caring if we reinit the distributor [drew]
v2:
 - configure irqs as NS GRP1
---
 lib/arm/asm/arch_gicv3.h   | 42 +
 lib/arm/asm/gic-v3.h   | 92 ++
 lib/arm/asm/gic.h  |  1 +
 lib/arm/gic.c  | 56 
 lib/arm64/asm/arch_gicv3.h | 44 ++
 lib/arm64/asm/gic-v3.h |  1 +
 lib/arm64/asm/sysreg.h | 44 ++
 7 files changed, 280 insertions(+)
 create mode 100644 lib/arm/asm/arch_gicv3.h
 create mode 100644 lib/arm/asm/gic-v3.h
 create mode 100644 lib/arm64/asm/arch_gicv3.h
 create mode 100644 lib/arm64/asm/gic-v3.h
 create mode 100644 lib/arm64/asm/sysreg.h

diff --git a/lib/arm/asm/arch_gicv3.h b/lib/arm/asm/arch_gicv3.h
new file mode 100644
index ..81a1e5f6c29c
--- /dev/null
+++ b/lib/arm/asm/arch_gicv3.h
@@ -0,0 +1,42 @@
+/*
+ * All ripped off from arch/arm/include/asm/arch_gicv3.h
+ *
+ * Copyright (C) 2016, Red Hat Inc, Andrew Jones 
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.
+ */
+#ifndef _ASMARM_ARCH_GICV3_H_
+#define _ASMARM_ARCH_GICV3_H_
+
+#ifndef __ASSEMBLY__
+#include 
+#include 
+#include 
+
+#define __stringify xstr
+
+#define __ACCESS_CP15(CRn, Op1, CRm, Op2)  p15, Op1, %0, CRn, CRm, Op2
+
+#define ICC_PMR__ACCESS_CP15(c4, 0, c6, 0)
+#define ICC_IGRPEN1__ACCESS_CP15(c12, 0, c12, 7)
+
+static inline void gicv3_write_pmr(u32 val)
+{
+   asm volatile("mcr " __stringify(ICC_PMR) : : "r" (val));
+}
+
+static inline void gicv3_write_grpen1(u32 val)
+{
+   asm volatile("mcr " __stringify(ICC_IGRPEN1) : : "r" (val));
+   isb();
+}
+
+static inline u64 gicv3_read_typer(const volatile void __iomem *addr)
+{
+   u64 val = readl(addr);
+   val |= (u64)readl(addr + 4) << 32;
+   return val;
+}
+
+#endif /* !__ASSEMBLY__ */
+#endif /* _ASMARM_ARCH_GICV3_H_ */
diff --git a/lib/arm/asm/gic-v3.h b/lib/arm/asm/gic-v3.h
new file mode 100644
index ..03321f8c860f
--- /dev/null
+++ b/lib/arm/asm/gic-v3.h
@@ -0,0 +1,92 @@
+/*
+ * All GIC* defines are lifted from include/linux/irqchip/arm-gic-v3.h
+ *
+ * Copyright (C) 2016, Red Hat Inc, Andrew Jones 
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.
+ */
+#ifndef _ASMARM_GIC_V3_H_
+#define _ASMARM_GIC_V3_H_
+
+#ifndef _ASMARM_GIC_H_
+#error Do not directly include . Include 
+#endif
+
+#define GICD_CTLR  0x
+#define GICD_TYPER 0x0004
+#define GICD_IGROUPR   0x0080
+
+#define GICD_CTLR_RWP  (1U << 31)
+#define GICD_CTLR_ARE_NS   (1U << 4)
+#define GICD_CTLR_ENABLE_G1A   (1U << 1)
+#define GICD_CTLR_ENABLE_G1(1U << 0)
+
+#define GICR_TYPER 0x0008
+#define GICR_IGROUPR0  GICD_IGROUPR
+#define GICR_TYPER_LAST(1U << 4)
+
+
+#include 
+
+#ifndef __ASSEMBLY__
+#include 
+#include 
+#include 
+#include 
+
+struct gicv3_data {
+   void *dist_base;
+   void *redist_base[NR_CPUS];
+   unsigned int irq_nr;
+};
+extern struct gicv3_data gicv3_data;
+
+#define gicv3_dist_base()  (gicv3_data.dist_base)
+#define gicv3_redist_base()
(gicv3_data.redist_base[smp_processor_id()])
+#define gicv3_sgi_base()   
(gicv3_data.redist_base[smp_processor_id()] + SZ_64K)
+
+extern int gicv3_init(void);
+extern void gicv3_enable_defaults(void);
+
+static inline void gicv3_do_wait_for_rwp(void *base)
+{
+   int count = 10; /* 1s */
+
+   while (readl(base + GICD_CTLR) & GICD_CTLR_RWP) {
+   if (!--count) {
+   printf("GICv3: RWP timeout!\n");
+   abort();
+   }
+   cpu_relax();
+   udelay(10);
+   };
+}
+
+static inline void gicv3_dist_wait_for_rwp(void)
+{
+   gicv3_do_wait_for_rwp(gicv3_dist_base());
+}
+
+static inline void gicv3_redist_wait_for_rwp(void)
+{
+   gicv3_do_wait_for_rwp(gicv3_redist_base());
+}
+
+static inline u32 mpidr_compress(u64 mpidr)
+{
+   u64 compressed = mpidr & MPIDR_HWID_BITMASK;
+
+   compressed = (((compressed >> 32) & 0xff) << 24) | compressed;
+   return compressed;
+}
+
+static inline u64 mpidr_uncompress(u32 compressed)
+{
+   u64 mpidr = ((u64)compressed >> 24) << 32;
+
+   mpidr |= compressed & MPIDR_HWID_BITMASK;
+   return mpidr;
+}
+
+#endif /* !__ASSEMBLY__ */
+#endif /* _ASMARM_GIC_V3_H_ */
diff --git a/lib/arm/asm/gic.h b/lib/arm/asm/gic.h
index 328e078a9ae1..4897bc592cdd 100644
--- a/lib/arm/asm/gic.h
+++ b/lib/arm/asm/gic.h
@@ -7,6 +7,7 @@
 #define _ASMARM_GIC_H_
 
 #include 
+#include 
 
 #define GIC_CPU_CTRL

[Qemu-devel] [kvm-unit-tests PATCH v4 08/11] libcflat: add IS_ALIGNED() macro, and page sizes

2016-11-08 Thread Andrew Jones

From: Peter Xu 

These macros will be useful to do page alignment checks.

Signed-off-by: Peter Xu 
[drew: also added SZ_64K]
Signed-off-by: Andrew Jones 
---
 lib/libcflat.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/lib/libcflat.h b/lib/libcflat.h
index 82005f5d014f..143fc53061fe 100644
--- a/lib/libcflat.h
+++ b/lib/libcflat.h
@@ -33,6 +33,12 @@
 #define __ALIGN_MASK(x, mask)  (((x) + (mask)) & ~(mask))
 #define __ALIGN(x, a)  __ALIGN_MASK(x, (typeof(x))(a) - 1)
 #define ALIGN(x, a)__ALIGN((x), (a))
+#define IS_ALIGNED(x, a)   (((x) & ((typeof(x))(a) - 1)) == 0)
+
+#define SZ_4K  (0x1000)
+#define SZ_64K (0x1)
+#define SZ_2M  (0x20)
+#define SZ_1G  (0x4000)
 
 typedef uint8_tu8;
 typedef int8_t s8;
-- 
2.7.4

[Qemu-devel] [kvm-unit-tests PATCH v4 06/11] arm/arm64: add initial gicv2 support

2016-11-08 Thread Andrew Jones

Add some gicv2 support. This just adds init and enable
functions, allowing unit tests to start messing with it.

Signed-off-by: Andrew Jones 

---
v4:
 - only take defines from kernel we need now [Andre]
 - moved defines to asm/gic.h so they'll be shared with v3 [drew]
 - simplify enable by not caring if we reinit the distributor [drew]
 - init all GICD_INT_DEF_PRI_X4 registers [Eric]
---
 arm/Makefile.common|  1 +
 lib/arm/asm/gic-v2.h   | 28 +++
 lib/arm/asm/gic.h  | 44 +
 lib/arm/gic.c  | 75 ++
 lib/arm64/asm/gic-v2.h |  1 +
 lib/arm64/asm/gic.h|  1 +
 6 files changed, 150 insertions(+)
 create mode 100644 lib/arm/asm/gic-v2.h
 create mode 100644 lib/arm/asm/gic.h
 create mode 100644 lib/arm/gic.c
 create mode 100644 lib/arm64/asm/gic-v2.h
 create mode 100644 lib/arm64/asm/gic.h

diff --git a/arm/Makefile.common b/arm/Makefile.common
index ccb554d9251a..41239c37e092 100644
--- a/arm/Makefile.common
+++ b/arm/Makefile.common
@@ -42,6 +42,7 @@ cflatobjs += lib/arm/mmu.o
 cflatobjs += lib/arm/bitops.o
 cflatobjs += lib/arm/psci.o
 cflatobjs += lib/arm/smp.o
+cflatobjs += lib/arm/gic.o
 
 libeabi = lib/arm/libeabi.a
 eabiobjs = lib/arm/eabi_compat.o
diff --git a/lib/arm/asm/gic-v2.h b/lib/arm/asm/gic-v2.h
new file mode 100644
index ..f91530f88355
--- /dev/null
+++ b/lib/arm/asm/gic-v2.h
@@ -0,0 +1,28 @@
+/*
+ * All GIC* defines are lifted from include/linux/irqchip/arm-gic.h
+ *
+ * Copyright (C) 2016, Red Hat Inc, Andrew Jones 
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.
+ */
+#ifndef _ASMARM_GIC_V2_H_
+#define _ASMARM_GIC_V2_H_
+
+#ifndef _ASMARM_GIC_H_
+#error Do not directly include . Include 
+#endif
+
+struct gicv2_data {
+   void *dist_base;
+   void *cpu_base;
+   unsigned int irq_nr;
+};
+extern struct gicv2_data gicv2_data;
+
+#define gicv2_dist_base()  (gicv2_data.dist_base)
+#define gicv2_cpu_base()   (gicv2_data.cpu_base)
+
+extern int gicv2_init(void);
+extern void gicv2_enable_defaults(void);
+
+#endif /* _ASMARM_GIC_V2_H_ */
diff --git a/lib/arm/asm/gic.h b/lib/arm/asm/gic.h
new file mode 100644
index ..ec92f1064dc0
--- /dev/null
+++ b/lib/arm/asm/gic.h
@@ -0,0 +1,44 @@
+/*
+ * Copyright (C) 2016, Red Hat Inc, Andrew Jones 
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.
+ */
+#ifndef _ASMARM_GIC_H_
+#define _ASMARM_GIC_H_
+
+#include 
+
+#define GIC_CPU_CTRL   0x00
+#define GIC_CPU_PRIMASK0x04
+
+#define GICC_ENABLE0x1
+#define GICC_INT_PRI_THRESHOLD 0xf0
+
+#define GIC_DIST_CTRL  0x000
+#define GIC_DIST_CTR   0x004
+#define GIC_DIST_ENABLE_SET0x100
+#define GIC_DIST_PRI   0x400
+
+#define GICD_ENABLE0x1
+#define GICD_INT_EN_SET_SGI0x
+#define GICD_INT_DEF_PRI   0xa0
+#define GICD_INT_DEF_PRI_X4((GICD_INT_DEF_PRI << 24) |\
+   (GICD_INT_DEF_PRI << 16) |\
+   (GICD_INT_DEF_PRI << 8) |\
+   GICD_INT_DEF_PRI)
+
+#define GICD_TYPER_IRQS(typer) typer) & 0x1f) + 1) * 32)
+
+#ifndef __ASSEMBLY__
+
+/*
+ * gic_init will try to find all known gics, and then
+ * initialize the gic data for the one found.
+ * returns
+ *  0   : no gic was found
+ *  > 0 : the gic version of the gic found
+ */
+extern int gic_init(void);
+
+#endif /* !__ASSEMBLY__ */
+#endif /* _ASMARM_GIC_H_ */
diff --git a/lib/arm/gic.c b/lib/arm/gic.c
new file mode 100644
index ..91d78c9a0cc2
--- /dev/null
+++ b/lib/arm/gic.c
@@ -0,0 +1,75 @@
+/*
+ * Copyright (C) 2016, Red Hat Inc, Andrew Jones 
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.
+ */
+#include 
+#include 
+#include 
+
+struct gicv2_data gicv2_data;
+
+/*
+ * Documentation/devicetree/bindings/interrupt-controller/arm,gic.txt
+ */
+static bool
+gic_get_dt_bases(const char *compatible, void **base1, void **base2)
+{
+   struct dt_pbus_reg reg;
+   struct dt_device gic;
+   struct dt_bus bus;
+   int node, ret;
+
+   dt_bus_init_defaults();
+   dt_device_init(, , NULL);
+
+   node = dt_device_find_compatible(, compatible);
+   assert(node >= 0 || node == -FDT_ERR_NOTFOUND);
+
+   if (node == -FDT_ERR_NOTFOUND)
+   return false;
+
+   dt_device_bind_node(, node);
+
+   ret = dt_pbus_translate(, 0, );
+   assert(ret == 0);
+   *base1 = ioremap(reg.addr, reg.size);
+
+   ret = dt_pbus_translate(, 1, );
+   assert(ret == 0);
+   *base2 = ioremap(reg.addr, reg.size);
+
+   return true;
+}
+
+int gicv2_init(void)
+{
+   return

[Qemu-devel] [kvm-unit-tests PATCH v4 11/11] arm/arm64: gic: don't just use zero

2016-11-08 Thread Andrew Jones

Allow user to select who sends ipis and with which irq,
rather than just always sending irq=0 from cpu0.

Signed-off-by: Andrew Jones 

---
v4: improve structure and make sure spurious checking is
done even when the sender isn't cpu0
v2: actually check that the irq received was the irq sent,
and (for gicv2) that the sender is the expected one.
---
 arm/gic.c | 99 ++-
 1 file changed, 73 insertions(+), 26 deletions(-)

diff --git a/arm/gic.c b/arm/gic.c
index d98ca6b9efd5..8e50bc1b35e0 100644
--- a/arm/gic.c
+++ b/arm/gic.c
@@ -11,6 +11,7 @@
  * This work is licensed under the terms of the GNU LGPL, version 2.
  */
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -34,6 +35,8 @@ static struct gic *gic;
 static int gic_version;
 static int acked[NR_CPUS], spurious[NR_CPUS];
 static cpumask_t ready;
+static int sender;
+static u32 irq;
 
 static void nr_cpu_check(int nr)
 {
@@ -86,7 +89,16 @@ static void check_acked(cpumask_t *mask)
 
 static u32 gicv2_read_iar(void)
 {
-   return readl(gicv2_cpu_base() + GIC_CPU_INTACK);
+   u32 iar = readl(gicv2_cpu_base() + GIC_CPU_INTACK);
+   int src = (iar >> 10) & 7;
+
+   if (src != sender) {
+   report("cpu%d received IPI from unexpected source cpu%d "
+  "(expected cpu%d)",
+  false, smp_processor_id(), src, sender);
+   }
+
+   return iar;
 }
 
 static u32 gicv2_irqnr(u32 iar)
@@ -111,9 +123,15 @@ static void ipi_handler(struct pt_regs *regs __unused)
 
if (irqnr != GICC_INT_SPURIOUS) {
gic->write_eoi(irqstat);
-   smp_rmb(); /* pairs with wmb in ipi_test functions */
-   ++acked[smp_processor_id()];
-   smp_wmb(); /* pairs with rmb in check_acked */
+   if (irqnr == irq) {
+   smp_rmb(); /* pairs with wmb in ipi_test functions */
+   ++acked[smp_processor_id()];
+   smp_wmb(); /* pairs with rmb in check_acked */
+   } else {
+   report("cpu%d received unexpected irq %u "
+  "(expected %u)",
+  false, smp_processor_id(), irqnr, irq);
+   }
} else {
++spurious[smp_processor_id()];
smp_wmb();
@@ -122,19 +140,19 @@ static void ipi_handler(struct pt_regs *regs __unused)
 
 static void gicv2_ipi_send_self(void)
 {
-   writel(2 << 24, gicv2_dist_base() + GIC_DIST_SOFTINT);
+   writel(2 << 24 | irq, gicv2_dist_base() + GIC_DIST_SOFTINT);
 }
 
 static void gicv2_ipi_send_tlist(cpumask_t *mask)
 {
u8 tlist = (u8)cpumask_bits(mask)[0];
 
-   writel(tlist << 16, gicv2_dist_base() + GIC_DIST_SOFTINT);
+   writel(tlist << 16 | irq, gicv2_dist_base() + GIC_DIST_SOFTINT);
 }
 
 static void gicv2_ipi_send_broadcast(void)
 {
-   writel(1 << 24, gicv2_dist_base() + GIC_DIST_SOFTINT);
+   writel(1 << 24 | irq, gicv2_dist_base() + GIC_DIST_SOFTINT);
 }
 
 #define ICC_SGI1R_AFFINITY_1_SHIFT 16
@@ -200,7 +218,7 @@ static void gicv3_ipi_send_tlist(cpumask_t *mask)
/* Send the IPIs for the target list of this cluster */
sgi1r = (MPIDR_TO_SGI_AFFINITY(cluster_id, 3)   |
 MPIDR_TO_SGI_AFFINITY(cluster_id, 2)   |
-/* irq << 24   | */
+irq << 24  |
 MPIDR_TO_SGI_AFFINITY(cluster_id, 1)   |
 tlist);
 
@@ -222,7 +240,7 @@ static void gicv3_ipi_send_self(void)
 
 static void gicv3_ipi_send_broadcast(void)
 {
-   gicv3_write_sgi1r(1ULL << 40);
+   gicv3_write_sgi1r(1ULL << 40 | irq << 24);
isb();
 }
 
@@ -234,7 +252,7 @@ static void ipi_test_self(void)
memset(acked, 0, sizeof(acked));
smp_wmb();
cpumask_clear();
-   cpumask_set_cpu(0, );
+   cpumask_set_cpu(smp_processor_id(), );
gic->ipi.send_self();
check_acked();
report_prefix_pop();
@@ -249,7 +267,7 @@ static void ipi_test_smp(void)
memset(acked, 0, sizeof(acked));
smp_wmb();
cpumask_copy(, _present_mask);
-   for (i = 0; i < nr_cpus; i += 2)
+   for (i = smp_processor_id() & 1; i < nr_cpus; i += 2)
cpumask_clear_cpu(i, );
gic->ipi.send_tlist();
check_acked();
@@ -259,7 +277,7 @@ static void ipi_test_smp(void)
memset(acked, 0, sizeof(acked));
smp_wmb();
cpumask_copy(, _present_mask);
-   cpumask_clear_cpu(0, );
+   cpumask_clear_cpu(smp_processor_id(), );
gic->ipi.send_broadcast();
check_acked();
report_prefix_pop();
@@ -276,6 +294,27 @@ static void ipi_enable(void)
local_irq_enable();
 }
 
+static void ipi_send(void)
+{
+   int cpu;
+
+   ipi_enable();
+

[Qemu-devel] [kvm-unit-tests PATCH v4 05/11] arm/arm64: irq enable/disable

2016-11-08 Thread Andrew Jones

Reviewed-by: Alex Bennée 
Reviewed-by: Eric Auger 
Signed-off-by: Andrew Jones 
---
 lib/arm/asm/processor.h   | 10 ++
 lib/arm64/asm/processor.h | 10 ++
 2 files changed, 20 insertions(+)

diff --git a/lib/arm/asm/processor.h b/lib/arm/asm/processor.h
index afc903ca7d4a..75a8d08b8933 100644
--- a/lib/arm/asm/processor.h
+++ b/lib/arm/asm/processor.h
@@ -35,6 +35,16 @@ static inline unsigned long current_cpsr(void)
 
 #define current_mode() (current_cpsr() & MODE_MASK)
 
+static inline void local_irq_enable(void)
+{
+   asm volatile("cpsie i" : : : "memory", "cc");
+}
+
+static inline void local_irq_disable(void)
+{
+   asm volatile("cpsid i" : : : "memory", "cc");
+}
+
 static inline unsigned int get_mpidr(void)
 {
unsigned int mpidr;
diff --git a/lib/arm64/asm/processor.h b/lib/arm64/asm/processor.h
index 94f7ce35b65c..d54a4ed1c187 100644
--- a/lib/arm64/asm/processor.h
+++ b/lib/arm64/asm/processor.h
@@ -68,6 +68,16 @@ static inline unsigned long current_level(void)
return el & 0xc;
 }
 
+static inline void local_irq_enable(void)
+{
+   asm volatile("msr daifclr, #2" : : : "memory");
+}
+
+static inline void local_irq_disable(void)
+{
+   asm volatile("msr daifset, #2" : : : "memory");
+}
+
 #define DEFINE_GET_SYSREG(reg, type)   \
 static inline type get_##reg(void) \
 {  \
-- 
2.7.4

[Qemu-devel] [kvm-unit-tests PATCH v4 03/11] arm/arm64: smp: support more than 8 cpus

2016-11-08 Thread Andrew Jones

By adding support for launching with gicv3 we can break the 8 vcpu
limit. This patch adds support to smp code and also selects the
vgic model corresponding to the host. The vgic model may also be
manually selected by adding e.g. -machine gic-version=3 to
extra_params.

Reviewed-by: Alex Bennée 
Signed-off-by: Andrew Jones 

---
v4: improved commit message
---
 arm/run   | 19 ---
 arm/selftest.c|  5 -
 lib/arm/asm/processor.h   |  9 +++--
 lib/arm/asm/setup.h   |  4 ++--
 lib/arm/setup.c   | 12 +++-
 lib/arm64/asm/processor.h |  9 +++--
 6 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/arm/run b/arm/run
index a2f35ef6a7e6..2d0698619606 100755
--- a/arm/run
+++ b/arm/run
@@ -31,13 +31,6 @@ if [ -z "$ACCEL" ]; then
fi
 fi
 
-if [ "$HOST" = "aarch64" ] && [ "$ACCEL" = "kvm" ]; then
-   processor="host"
-   if [ "$ARCH" = "arm" ]; then
-   processor+=",aarch64=off"
-   fi
-fi
-
 qemu="${QEMU:-qemu-system-$ARCH_NAME}"
 qpath=$(which $qemu 2>/dev/null)
 
@@ -53,6 +46,18 @@ fi
 
 M='-machine virt'
 
+if [ "$ACCEL" = "kvm" ]; then
+   if $qemu $M,\? 2>&1 | grep gic-version > /dev/null; then
+   M+=',gic-version=host'
+   fi
+   if [ "$HOST" = "aarch64" ]; then
+   processor="host"
+   if [ "$ARCH" = "arm" ]; then
+   processor+=",aarch64=off"
+   fi
+   fi
+fi
+
 if ! $qemu $M -device '?' 2>&1 | grep virtconsole > /dev/null; then
echo "$qpath doesn't support virtio-console for chr-testdev. Exiting."
exit 2
diff --git a/arm/selftest.c b/arm/selftest.c
index 196164f5313d..2f117f795d2d 100644
--- a/arm/selftest.c
+++ b/arm/selftest.c
@@ -312,9 +312,10 @@ static bool psci_check(void)
 static cpumask_t smp_reported;
 static void cpu_report(void)
 {
+   unsigned long mpidr = get_mpidr();
int cpu = smp_processor_id();
 
-   report("CPU%d online", true, cpu);
+   report("CPU(%3d) mpidr=%lx", mpidr_to_cpu(mpidr) == cpu, cpu, mpidr);
cpumask_set_cpu(cpu, _reported);
halt();
 }
@@ -343,6 +344,7 @@ int main(int argc, char **argv)
 
} else if (strcmp(argv[1], "smp") == 0) {
 
+   unsigned long mpidr = get_mpidr();
int cpu;
 
report("PSCI version", psci_check());
@@ -353,6 +355,7 @@ int main(int argc, char **argv)
smp_boot_secondary(cpu, cpu_report);
}
 
+   report("CPU(%3d) mpidr=%lx", mpidr_to_cpu(mpidr) == 0, 0, 
mpidr);
cpumask_set_cpu(0, _reported);
while (!cpumask_full(_reported))
cpu_relax();
diff --git a/lib/arm/asm/processor.h b/lib/arm/asm/processor.h
index f25e7eee3666..d2048f5f5f7e 100644
--- a/lib/arm/asm/processor.h
+++ b/lib/arm/asm/processor.h
@@ -40,8 +40,13 @@ static inline unsigned int get_mpidr(void)
return mpidr;
 }
 
-/* Only support Aff0 for now, up to 4 cpus */
-#define mpidr_to_cpu(mpidr) ((int)((mpidr) & 0xff))
+#define MPIDR_HWID_BITMASK 0xff
+extern int mpidr_to_cpu(unsigned long mpidr);
+
+#define MPIDR_LEVEL_SHIFT(level) \
+   (((1 << level) >> 1) << 3)
+#define MPIDR_AFFINITY_LEVEL(mpidr, level) \
+   ((mpidr >> MPIDR_LEVEL_SHIFT(level)) & 0xff)
 
 extern void start_usr(void (*func)(void *arg), void *arg, unsigned long 
sp_usr);
 extern bool is_user(void);
diff --git a/lib/arm/asm/setup.h b/lib/arm/asm/setup.h
index cb8fdbd38dd5..b0d51f5f0721 100644
--- a/lib/arm/asm/setup.h
+++ b/lib/arm/asm/setup.h
@@ -10,8 +10,8 @@
 #include 
 #include 
 
-#define NR_CPUS8
-extern u32 cpus[NR_CPUS];
+#define NR_CPUS255
+extern u64 cpus[NR_CPUS];  /* per-cpu IDs (MPIDRs) */
 extern int nr_cpus;
 
 #define NR_MEM_REGIONS 8
diff --git a/lib/arm/setup.c b/lib/arm/setup.c
index 7e7b39f11dde..b6e2d5815e72 100644
--- a/lib/arm/setup.c
+++ b/lib/arm/setup.c
@@ -24,12 +24,22 @@ extern unsigned long stacktop;
 extern void io_init(void);
 extern void setup_args_progname(const char *args);
 
-u32 cpus[NR_CPUS] = { [0 ... NR_CPUS-1] = (~0U) };
+u64 cpus[NR_CPUS] = { [0 ... NR_CPUS-1] = (~0U) };
 int nr_cpus;
 
 struct mem_region mem_regions[NR_MEM_REGIONS];
 phys_addr_t __phys_offset, __phys_end;
 
+int mpidr_to_cpu(unsigned long mpidr)
+{
+   int i;
+
+   for (i = 0; i < nr_cpus; ++i)
+   if (cpus[i] == (mpidr & MPIDR_HWID_BITMASK))
+   return i;
+   return -1;
+}
+
 static void cpu_set(int fdtnode __unused, u32 regval, void *info __unused)
 {
int cpu = nr_cpus++;
diff --git a/lib/arm64/asm/processor.h b/lib/arm64/asm/processor.h
index 9a208ff729b7..7e448dc81a6a 100644
--- a/lib/arm64/asm/processor.h
+++ b/lib/arm64/asm/processor.h
@@ -78,8 +78,13 @@ static inline type get_##reg(void)   
\

[Qemu-devel] [kvm-unit-tests PATCH v4 04/11] arm/arm64: add some delay routines

2016-11-08 Thread Andrew Jones

Allow a thread to wait some specified amount of time. Can
specify in cycles, usecs, and msecs.

Reviewed-by: Alex Bennée 
Reviewed-by: Eric Auger 
Signed-off-by: Andrew Jones 
---
 lib/arm/asm/processor.h   | 19 +++
 lib/arm/processor.c   | 15 +++
 lib/arm64/asm/processor.h | 19 +++
 lib/arm64/processor.c | 15 +++
 4 files changed, 68 insertions(+)

diff --git a/lib/arm/asm/processor.h b/lib/arm/asm/processor.h
index d2048f5f5f7e..afc903ca7d4a 100644
--- a/lib/arm/asm/processor.h
+++ b/lib/arm/asm/processor.h
@@ -5,7 +5,9 @@
  *
  * This work is licensed under the terms of the GNU LGPL, version 2.
  */
+#include 
 #include 
+#include 
 
 enum vector {
EXCPTN_RST,
@@ -51,4 +53,21 @@ extern int mpidr_to_cpu(unsigned long mpidr);
 extern void start_usr(void (*func)(void *arg), void *arg, unsigned long 
sp_usr);
 extern bool is_user(void);
 
+static inline u64 get_cntvct(void)
+{
+   u64 vct;
+   isb();
+   asm volatile("mrrc p15, 1, %Q0, %R0, c14" : "=r" (vct));
+   return vct;
+}
+
+extern void delay(u64 cycles);
+extern void udelay(unsigned long usecs);
+
+static inline void mdelay(unsigned long msecs)
+{
+   while (msecs--)
+   udelay(1000);
+}
+
 #endif /* _ASMARM_PROCESSOR_H_ */
diff --git a/lib/arm/processor.c b/lib/arm/processor.c
index 54fdb87ef019..c2ee360df688 100644
--- a/lib/arm/processor.c
+++ b/lib/arm/processor.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static const char *processor_modes[] = {
"USER_26", "FIQ_26" , "IRQ_26" , "SVC_26" ,
@@ -141,3 +142,17 @@ bool is_user(void)
 {
return current_thread_info()->flags & TIF_USER_MODE;
 }
+
+void delay(u64 cycles)
+{
+   u64 start = get_cntvct();
+   while ((get_cntvct() - start) < cycles)
+   cpu_relax();
+}
+
+void udelay(unsigned long usec)
+{
+   unsigned int frq;
+   asm volatile("mrc p15, 0, %0, c14, c0, 0" : "=r" (frq));
+   delay((u64)usec * frq / 100);
+}
diff --git a/lib/arm64/asm/processor.h b/lib/arm64/asm/processor.h
index 7e448dc81a6a..94f7ce35b65c 100644
--- a/lib/arm64/asm/processor.h
+++ b/lib/arm64/asm/processor.h
@@ -17,8 +17,10 @@
 #define SCTLR_EL1_M(1 << 0)
 
 #ifndef __ASSEMBLY__
+#include 
 #include 
 #include 
+#include 
 
 enum vector {
EL1T_SYNC,
@@ -89,5 +91,22 @@ extern int mpidr_to_cpu(unsigned long mpidr);
 extern void start_usr(void (*func)(void *arg), void *arg, unsigned long 
sp_usr);
 extern bool is_user(void);
 
+static inline u64 get_cntvct(void)
+{
+   u64 vct;
+   isb();
+   asm volatile("mrs %0, cntvct_el0" : "=r" (vct));
+   return vct;
+}
+
+extern void delay(u64 cycles);
+extern void udelay(unsigned long usecs);
+
+static inline void mdelay(unsigned long msecs)
+{
+   while (msecs--)
+   udelay(1000);
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASMARM64_PROCESSOR_H_ */
diff --git a/lib/arm64/processor.c b/lib/arm64/processor.c
index deeab4ec9c8a..50fa835c6f1e 100644
--- a/lib/arm64/processor.c
+++ b/lib/arm64/processor.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static const char *vector_names[] = {
"el1t_sync",
@@ -253,3 +254,17 @@ bool is_user(void)
 {
return current_thread_info()->flags & TIF_USER_MODE;
 }
+
+void delay(u64 cycles)
+{
+   u64 start = get_cntvct();
+   while ((get_cntvct() - start) < cycles)
+   cpu_relax();
+}
+
+void udelay(unsigned long usec)
+{
+   unsigned int frq;
+   asm volatile("mrs %0, cntfrq_el0" : "=r" (frq));
+   delay((u64)usec * frq / 100);
+}
-- 
2.7.4

[Qemu-devel] [kvm-unit-tests PATCH v4 10/11] arm/arm64: gicv3: add an IPI test

2016-11-08 Thread Andrew Jones

Signed-off-by: Andrew Jones 

---
v4:
 - heavily comment gicv3_ipi_send_tlist() [Eric]
 - changes needed for gicv2 iar/irqstat fix to other patch
v2:
 - use IRM for gicv3 broadcast
---
 arm/gic.c  | 195 ++---
 arm/unittests.cfg  |   6 ++
 lib/arm/asm/arch_gicv3.h   |  23 ++
 lib/arm64/asm/arch_gicv3.h |  22 +
 4 files changed, 236 insertions(+), 10 deletions(-)

diff --git a/arm/gic.c b/arm/gic.c
index efefab7296d4..d98ca6b9efd5 100644
--- a/arm/gic.c
+++ b/arm/gic.c
@@ -3,6 +3,8 @@
  *
  * GICv2
  *   + test sending/receiving IPIs
+ * GICv3
+ *   + test sending/receiving IPIs
  *
  * Copyright (C) 2016, Red Hat Inc, Andrew Jones 
  *
@@ -16,6 +18,19 @@
 #include 
 #include 
 
+struct gic {
+   struct {
+   void (*enable)(void);
+   void (*send_self)(void);
+   void (*send_tlist)(cpumask_t *);
+   void (*send_broadcast)(void);
+   } ipi;
+   u32 (*read_iar)(void);
+   u32 (*irqnr)(u32 iar);
+   void (*write_eoi)(u32);
+};
+
+static struct gic *gic;
 static int gic_version;
 static int acked[NR_CPUS], spurious[NR_CPUS];
 static cpumask_t ready;
@@ -69,13 +84,33 @@ static void check_acked(cpumask_t *mask)
   false, missing, extra, unexpected);
 }
 
+static u32 gicv2_read_iar(void)
+{
+   return readl(gicv2_cpu_base() + GIC_CPU_INTACK);
+}
+
+static u32 gicv2_irqnr(u32 iar)
+{
+   return iar & GICC_IAR_INT_ID_MASK;
+}
+
+static void gicv2_write_eoi(u32 irqstat)
+{
+   writel(irqstat, gicv2_cpu_base() + GIC_CPU_EOI);
+}
+
+static u32 gicv3_irqnr(u32 iar)
+{
+   return iar;
+}
+
 static void ipi_handler(struct pt_regs *regs __unused)
 {
-   u32 irqstat = readl(gicv2_cpu_base() + GIC_CPU_INTACK);
-   u32 irqnr = irqstat & GICC_IAR_INT_ID_MASK;
+   u32 irqstat = gic->read_iar();
+   u32 irqnr = gic->irqnr(irqstat);
 
if (irqnr != GICC_INT_SPURIOUS) {
-   writel(irqstat, gicv2_cpu_base() + GIC_CPU_EOI);
+   gic->write_eoi(irqstat);
smp_rmb(); /* pairs with wmb in ipi_test functions */
++acked[smp_processor_id()];
smp_wmb(); /* pairs with rmb in check_acked */
@@ -85,6 +120,112 @@ static void ipi_handler(struct pt_regs *regs __unused)
}
 }
 
+static void gicv2_ipi_send_self(void)
+{
+   writel(2 << 24, gicv2_dist_base() + GIC_DIST_SOFTINT);
+}
+
+static void gicv2_ipi_send_tlist(cpumask_t *mask)
+{
+   u8 tlist = (u8)cpumask_bits(mask)[0];
+
+   writel(tlist << 16, gicv2_dist_base() + GIC_DIST_SOFTINT);
+}
+
+static void gicv2_ipi_send_broadcast(void)
+{
+   writel(1 << 24, gicv2_dist_base() + GIC_DIST_SOFTINT);
+}
+
+#define ICC_SGI1R_AFFINITY_1_SHIFT 16
+#define ICC_SGI1R_AFFINITY_2_SHIFT 32
+#define ICC_SGI1R_AFFINITY_3_SHIFT 48
+#define MPIDR_TO_SGI_AFFINITY(cluster_id, level) \
+   (MPIDR_AFFINITY_LEVEL(cluster_id, level) << ICC_SGI1R_AFFINITY_## level 
## _SHIFT)
+
+static void gicv3_ipi_send_tlist(cpumask_t *mask)
+{
+   u16 tlist;
+   int cpu;
+
+   /*
+* For each cpu in the mask collect its peers, which are also in
+* the mask, in order to form target lists.
+*/
+   for_each_cpu(cpu, mask) {
+   u64 mpidr = cpus[cpu], sgi1r;
+   u64 cluster_id;
+
+   /*
+* GICv3 can send IPIs to up 16 peer cpus with a single
+* write to ICC_SGI1R_EL1 (using the target list). Peers
+* are cpus that have nearly identical MPIDRs, the only
+* difference being Aff0. The matching upper affinity
+* levels form the cluster ID.
+*/
+   cluster_id = mpidr & ~0xffUL;
+   tlist = 0;
+
+   /*
+* Sort of open code for_each_cpu in order to have a
+* nested for_each_cpu loop.
+*/
+   while (cpu < nr_cpus) {
+   if ((mpidr & 0xff) >= 16) {
+   printf("cpu%d MPIDR:aff0 is %d (>= 16)!\n",
+   cpu, (int)(mpidr & 0xff));
+   break;
+   }
+
+   tlist |= 1 << (mpidr & 0xf);
+
+   cpu = cpumask_next(cpu, mask);
+   if (cpu >= nr_cpus)
+   break;
+
+   mpidr = cpus[cpu];
+
+   if (cluster_id != (mpidr & ~0xffUL)) {
+   /*
+* The next cpu isn't in our cluster. Roll
+* back the cpu index allowing the outer
+* for_each_cpu to find it again with
+* cpumask_next
+*/
+   --cpu;
+

[Qemu-devel] [kvm-unit-tests PATCH v4 02/11] arm64: fix get_"sysreg32" and make MPIDR 64bit

2016-11-08 Thread Andrew Jones

mrs is always 64bit, so we should always use a 64bit register.
Sometimes we'll only want to return the lower 32, but not for
MPIDR, as that does define fields in the upper 32.

Reviewed-by: Alex Bennée 
Reviewed-by: Eric Auger 
Signed-off-by: Andrew Jones 
---
 lib/arm64/asm/processor.h | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/lib/arm64/asm/processor.h b/lib/arm64/asm/processor.h
index 84d5c7ce752b..9a208ff729b7 100644
--- a/lib/arm64/asm/processor.h
+++ b/lib/arm64/asm/processor.h
@@ -66,14 +66,17 @@ static inline unsigned long current_level(void)
return el & 0xc;
 }
 
-#define DEFINE_GET_SYSREG32(reg)   \
-static inline unsigned int get_##reg(void) \
+#define DEFINE_GET_SYSREG(reg, type)   \
+static inline type get_##reg(void) \
 {  \
-   unsigned int reg;   \
-   asm volatile("mrs %0, " #reg "_el1" : "=r" (reg));  \
-   return reg; \
+   unsigned long r;\
+   asm volatile("mrs %0, " #reg "_el1" : "=r" (r));\
+   return (type)r; \
 }
-DEFINE_GET_SYSREG32(mpidr)
+#define DEFINE_GET_SYSREG32(reg) DEFINE_GET_SYSREG(reg, unsigned int)
+#define DEFINE_GET_SYSREG64(reg) DEFINE_GET_SYSREG(reg, unsigned long)
+
+DEFINE_GET_SYSREG64(mpidr)
 
 /* Only support Aff0 for now, gicv2 only */
 #define mpidr_to_cpu(mpidr) ((int)((mpidr) & 0xff))
-- 
2.7.4

[Qemu-devel] [kvm-unit-tests PATCH v4 00/11] arm/arm64: add gic framework

2016-11-08 Thread Andrew Jones

v4:
 - Eric's r-b's
 - Andre's suggestion to only take defines we need
 - several other changes listed in individual patches

v3:
 - Rebased on latest master
 - Added Alex's r-b's

v2:
 Rebased on latest master + my "populate argv[0]" series (will
 send a REPOST for that shortly. Additionally a few patches got
 fixes/features;
 07/10 got same fix as kernel 7c9b973061 "irqchip/gic-v3: Configure
   all interrupts as non-secure Group-1" in order to continue
   working over TCG, as the gicv3 code for TCG removed a hack
   it had there to make Linux happy.
 08/10 added more output for when things fail (if they fail)
 09/10 switched gicv3 broadcast implementation to using IRM. This
   found a bug in a recent (but not tip) kernel, which I was
   about to fix, but then I saw MarcZ beat me to it.
 10/10 actually check that the input irq is the received irq


Import defines, and steal enough helper functions, from Linux to
enable programming of the gic (v2 and v3). Then use the framework
to add an initial test (an ipi test; self, target-list, broadcast).

It's my hope that this framework will be a suitable base on which
more tests may be easily added, particularly because we have
vgic-new and tcg gicv3 emulation getting close to merge. (v3 UPDATE:
vgic-new and tcg gicv3 are merged now)

To run it, along with other tests, just do

 ./configure [ --arch=[arm|arm64] --cross-prefix=$PREFIX ]
 make
 export QEMU=$PATH_TO_QEMU
 ./run_tests.sh

To run it separately do, e.g.

$QEMU -machine virt,accel=tcg -cpu cortex-a57 \
 -device virtio-serial-device \
 -device virtconsole,chardev=ctd -chardev testdev,id=ctd \
 -display none -serial stdio \
 -kernel arm/gic.flat \
 -smp 123 -machine gic-version=3 -append ipi
  ^^ note, we can go nuts with nr-cpus on TCG :-)

Or, a KVM example using a different "sender" cpu and irq (other than zero)

$QEMU -machine virt,accel=kvm -cpu host \
 -device virtio-serial-device \
 -device virtconsole,chardev=ctd -chardev testdev,id=ctd \
 -display none -serial stdio \
 -kernel arm/gic.flat \
 -smp 48 -machine gic-version=3 -append 'ipi sender=42 irq=1'


Patches:
01-05: fixes and functionality needed by the later gic patches
06-07: enable gicv2 and gicv2 IPI test
08-10: enable gicv3 and gicv3 IPI test
   11: extend the IPI tests to take variable sender and irq

Available here: https://github.com/rhdrjones/kvm-unit-tests/commits/arm/gic-v4


Andrew Jones (10):
  lib: xstr: allow multiple args
  arm64: fix get_"sysreg32" and make MPIDR 64bit
  arm/arm64: smp: support more than 8 cpus
  arm/arm64: add some delay routines
  arm/arm64: irq enable/disable
  arm/arm64: add initial gicv2 support
  arm/arm64: gicv2: add an IPI test
  arm/arm64: add initial gicv3 support
  arm/arm64: gicv3: add an IPI test
  arm/arm64: gic: don't just use zero

Peter Xu (1):
  libcflat: add IS_ALIGNED() macro, and page sizes

 arm/Makefile.common|   7 +-
 arm/gic.c  | 417 +
 arm/run|  19 ++-
 arm/selftest.c |   5 +-
 arm/unittests.cfg  |  13 ++
 lib/arm/asm/arch_gicv3.h   |  65 +++
 lib/arm/asm/gic-v2.h   |  28 +++
 lib/arm/asm/gic-v3.h   |  92 ++
 lib/arm/asm/gic.h  |  51 ++
 lib/arm/asm/processor.h|  38 -
 lib/arm/asm/setup.h|   4 +-
 lib/arm/gic.c  | 131 ++
 lib/arm/processor.c|  15 ++
 lib/arm/setup.c|  12 +-
 lib/arm64/asm/arch_gicv3.h |  66 +++
 lib/arm64/asm/gic-v2.h |   1 +
 lib/arm64/asm/gic-v3.h |   1 +
 lib/arm64/asm/gic.h|   1 +
 lib/arm64/asm/processor.h  |  53 +-
 lib/arm64/asm/sysreg.h |  44 +
 lib/arm64/processor.c  |  15 ++
 lib/libcflat.h |  10 +-
 22 files changed, 1062 insertions(+), 26 deletions(-)
 create mode 100644 arm/gic.c
 create mode 100644 lib/arm/asm/arch_gicv3.h
 create mode 100644 lib/arm/asm/gic-v2.h
 create mode 100644 lib/arm/asm/gic-v3.h
 create mode 100644 lib/arm/asm/gic.h
 create mode 100644 lib/arm/gic.c
 create mode 100644 lib/arm64/asm/arch_gicv3.h
 create mode 100644 lib/arm64/asm/gic-v2.h
 create mode 100644 lib/arm64/asm/gic-v3.h
 create mode 100644 lib/arm64/asm/gic.h
 create mode 100644 lib/arm64/asm/sysreg.h

-- 
2.7.4

[Qemu-devel] [kvm-unit-tests PATCH v4 01/11] lib: xstr: allow multiple args

2016-11-08 Thread Andrew Jones

Make implementation equivalent to Linux's include/linux/stringify.h

Reviewed-by: Eric Auger 
Signed-off-by: Andrew Jones 
---
 lib/libcflat.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/libcflat.h b/lib/libcflat.h
index 72b1bf9668ef..82005f5d014f 100644
--- a/lib/libcflat.h
+++ b/lib/libcflat.h
@@ -27,8 +27,8 @@
 
 #define __unused __attribute__((__unused__))
 
-#define xstr(s) xxstr(s)
-#define xxstr(s) #s
+#define xstr(s...) xxstr(s)
+#define xxstr(s...) #s
 
 #define __ALIGN_MASK(x, mask)  (((x) + (mask)) & ~(mask))
 #define __ALIGN(x, a)  __ALIGN_MASK(x, (typeof(x))(a) - 1)
-- 
2.7.4

Re: [Qemu-devel] [PATCH] Fix legacy ncurses detection.

2016-11-08 Thread Samuel Thibault

Cornelia Huck, on Tue 08 Nov 2016 12:34:49 +0100, wrote:
> > diff --git a/configure b/configure
> > index fd6f898..e200aa8 100755
> > --- a/configure
> > +++ b/configure
> > @@ -2926,7 +2926,7 @@ if test "$curses" != "no" ; then
> >  curses_inc_list="$($pkg_config --cflags ncurses 2>/dev/null):"
> >  curses_lib_list="$($pkg_config --libs ncurses 2>/dev/null):-lpdcurses"
> >else
> > -curses_inc_list="$($pkg_config --cflags ncursesw 2>/dev/null):"
> > +curses_inc_list="$($pkg_config --cflags ncursesw 
> > 2>/dev/null):-I/usr/include/ncursesw:"
> 
> This arrives at
> 
> curses_inc_list=":-I/usr/include/ncursesw:"
> 
> which causes the parser below to start with an empty curses_inc (with :
> as separator).

Yes, this is expected.

> configure fails as before (with -Werror; passes without).

Ah!
So are you getting the following message?

“
configure test passed without -Werror but failed with -Werror.
This is probably a bug in the configure script. The failing command
will be at the bottom of config.log.
You can run configure with --disable-werror to bypass this check.
”

If so, you should really have said it, I was really wondering how
configure could just stopping in your case.  That does explain things
indeed.

Could you try the attached patch?  It should be able to really fail
without Werror too.

> > @@ -2955,6 +2955,9 @@ EOF
> >  break
> >fi
> >  done
> > +if test "$curses_found" = yes ; then
> > +  break
> > +fi
> 
> Breaking out as soon as we've found a working config seems like a good
> idea, but I don't think it's related to this issue.

Actually it is: in your case it's the second config which will work, the
third one will fail. Not breaking would mean we keep the failing one.

Samuel
commit 4c5e78e8843fa919f2601d8e44029ed0e711c6d9
Author: Samuel Thibault 
Date:   Tue Nov 8 20:57:27 2016 +0100

Fix cursesw detection

On systems which do not provide ncursesw.pc and whose /usr/include/curses.h
does not include wide support, we should not only try with no -I, i.e.
/usr/include, but also with -I/usr/include/ncursesw.

To properly detect for wide support with and without -Werror, we need to
check for the presence of e.g. the WACS_DEGREE macro.

We also want to stop at the first curses_inc_list configuration which works.

Signed-off-by: Samuel Thibault 

diff --git a/configure b/configure
index fd6f898..f35edf8 100755
--- a/configure
+++ b/configure
@@ -2926,7 +2926,7 @@ if test "$curses" != "no" ; then
 curses_inc_list="$($pkg_config --cflags ncurses 2>/dev/null):"
 curses_lib_list="$($pkg_config --libs ncurses 2>/dev/null):-lpdcurses"
   else
-curses_inc_list="$($pkg_config --cflags ncursesw 2>/dev/null):"
+curses_inc_list="$($pkg_config --cflags ncursesw 
2>/dev/null):-I/usr/include/ncursesw:"
 curses_lib_list="$($pkg_config --libs ncursesw 
2>/dev/null):-lncursesw:-lcursesw"
   fi
   curses_found=no
@@ -2941,6 +2941,7 @@ int main(void) {
   resize_term(0, 0);
   addwstr(L"wide chars\n");
   addnwstr(, 1);
+  add_wch(WACS_DEGREE);
   return s != 0;
 }
 EOF
@@ -2955,6 +2956,9 @@ EOF
 break
   fi
 done
+if test "$curses_found" = yes ; then
+  break
+fi
   done
   unset IFS
   if test "$curses_found" = "yes" ; then

[Qemu-devel] [PULL] xen: Fix xenpv machine initialisation

2016-11-08 Thread Stefano Stabellini

From: Anthony PERARD 

When using QEMU for Xen PV guest, QEMU abort with:
xen-common.c:118:xen_init: Object 0x7f2b8325dcb0 is not an instance of type 
generic-pc-machine

This is because the machine 'xenpv' also use accel=xen. Moving the code
to xen_hvm_init() fix the issue.

This fix 021746c131cdfeab9d82ff918795a9f18d20d7ae.

Signed-off-by: Anthony PERARD 
Signed-off-by: Stefano Stabellini 
Reviewed-by: Eduardo Habkost 
Reviewed-by: Stefano Stabellini 
---
 xen-common.c | 6 --
 xen-hvm.c| 4 
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/xen-common.c b/xen-common.c
index bacf962..9099760 100644
--- a/xen-common.c
+++ b/xen-common.c
@@ -9,7 +9,6 @@
  */
 
 #include "qemu/osdep.h"
-#include "hw/i386/pc.h"
 #include "hw/xen/xen_backend.h"
 #include "qmp-commands.h"
 #include "sysemu/char.h"
@@ -115,11 +114,6 @@ static void xen_change_state_handler(void *opaque, int 
running,
 
 static int xen_init(MachineState *ms)
 {
-PCMachineState *pcms = PC_MACHINE(ms);
-
-/* Disable ACPI build because Xen handles it */
-pcms->acpi_build_enabled = false;
-
 xen_xc = xc_interface_open(0, 0, 0);
 if (xen_xc == NULL) {
 xen_pv_printf(NULL, 0, "can't open xen interface\n");
diff --git a/xen-hvm.c b/xen-hvm.c
index 2f348ed..150c7e7 100644
--- a/xen-hvm.c
+++ b/xen-hvm.c
@@ -1316,6 +1316,10 @@ void xen_hvm_init(PCMachineState *pcms, MemoryRegion 
**ram_memory)
 }
 xen_be_register_common();
 xen_read_physmap(state);
+
+/* Disable ACPI build because Xen handles it */
+pcms->acpi_build_enabled = false;
+
 return;
 
 err:
-- 
1.9.1

[Qemu-devel] [Bug 1623276] Re: qemu 2.7 / iPXE crash

2016-11-08 Thread Laszlo Ersek (Red Hat)

The iPXE patches are now upstream (a big "thank you" to the iPXE
maintainer!); QEMU 2.8 -- with Gerd willing -- should bundle iPXE
binaries containing that fix.

http://lists.ipxe.org/pipermail/ipxe-devel/2016-November/005244.html

** Changed in: qemu
   Status: New => Confirmed

** Changed in: qemu
   Status: Confirmed => In Progress

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1623276

Title:
  qemu 2.7 / iPXE crash

Status in QEMU:
  In Progress

Bug description:
  I am running Arch linux

  vanilla 4.7.2 kernel
  qemu 2.7
  libvirt 2.2.0
  virt-manager 1.4.0

  
  Since the upgrade from qemu 2.6.1 to 2.7 a few days ago. I'm no longer
  able to PXE boot at all. Everything else appears to function normally.
  Non PXE booting and everything else is perfect. Obviously have
  restarted everying etc. Have tried the various network drivers also.

  This occurs on domains created with 2.6.1 or with 2.7

  When I choose PXE boot, the machine moves to a paused state (crashed)
  immediately after the 'starting PXE rom execution...' message appears.

  Reverting to qemu 2.6.1 package corrects the issue.

  The qemu.log snippet follows below.

  I'm not sure how to troubleshoot this problem to determine if it's a
  packaging error by the distribution or a problem with qemu/kvm/kernel?

  Any help would be much appreciated - Thanks,
  Greg

  --- qemu.log:

  
  2016-09-12 16:36:33.867+: starting up libvirt version: 2.2.0, qemu
  version: 2.7.0, hostname: seneca
  LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
  QEMU_AUDIO_DRV=spice /usr/sbin/qemu-system-x86_64 -name guest=c,debug-
  threads=on -S -object
  secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-6-
  c/master-key.aes -machine pc-i440fx-2.7,accel=kvm,usb=off,vmport=off
  -cpu Nehalem -m 2048 -realtime mlock=off -smp
  1,sockets=1,cores=1,threads=1 -uuid 348009be-26d5-4dc7-b515-
  e8b45f5117ac -no-user-config -nodefaults -chardev
  socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-6-
  c/monitor.sock,server,nowait -mon
  chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew
  -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global
  PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot
  menu=on,strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7
  -device ich9-usb-
  uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6
  -device ich9-usb-
  uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1 -device ich9-
  usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 -device
  virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive
  file=/var/lib/libvirt/images/c.qcow2,format=qcow2,if=none,id=drive-
  virtio-disk0 -device virtio-blk-
  pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-
  disk0,bootindex=1 -netdev tap,fd=28,id=hostnet0 -device
  rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:a0:95:7c,bus=pci.0,addr=0x
  3 -chardev pty,id=charserial0 -device isa-
  serial,chardev=charserial0,id=serial0 -chardev
  socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain
  -6-c/org.qemu.guest_agent.0,server,nowait -device
  virtserialport,bus=virtio-
  serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_age
  nt.0 -chardev spicevmc,id=charchannel1,name=vdagent -device
  virtserialport,bus=virtio-
  serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0
  -device usb-tablet,id=input0,bus=usb.0,port=1 -spice
  port=5901,addr=127.0.0.1,disable-ticketing,image-
  compression=off,seamless-migration=on -device qxl-
  vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vga
  mem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-
  hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-
  codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir
  -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=2
  -chardev spicevmc,id=charredir1,name=usbredir -device usb-
  redir,chardev=charredir1,id=redir1,bus=usb.0,port=3 -device virtio-
  balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
  char device redirected to /dev/pts/0 (label charserial0)
  main_channel_link: add main channel client
  red_dispatcher_set_cursor_peer: 
  inputs_connect: inputs channel client create
  KVM internal error. Suberror: 1
  emulation failure
  EAX=801a8d00 EBX=00a0 ECX=2e20 EDX=0009d5e8
  ESI=7ffa3c00 EDI=7fef4000 EBP= ESP=7b92
  EIP=06ab EFL=0087 [--S--PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =   00c09300
  CS =9c4c 0009c4c0  00809b00
  SS =   00809300
  DS =9cd0 0009cd00  00c09300
  FS =   00c09300
  GS =   00c09300
  LDT=   8200
  TR =   8b00
  GDT=  
  IDT=

Re: [Qemu-devel] [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP

2016-11-08 Thread Kirti Wankhede



On 11/8/2016 11:16 PM, Alex Williamson wrote:
> On Tue, 8 Nov 2016 21:56:29 +0530
> Kirti Wankhede  wrote:
> 
>> On 11/8/2016 5:15 AM, Alex Williamson wrote:
>>> On Sat, 5 Nov 2016 02:40:45 +0530
>>> Kirti Wankhede  wrote:
>>>   
>> ...
  
 +int vfio_register_notifier(struct device *dev, struct notifier_block *nb) 
  
>>>
>>> Is the expectation here that this is a generic notifier for all
>>> vfio->mdev signaling?  That should probably be made clear in the mdev
>>> API to avoid vendor drivers assuming their notifier callback only
>>> occurs for unmaps, even if that's currently the case.
>>>   
>>
>> Ok. Adding comment about notifier callback in mdev_device which is part
>> of next patch.
>>
>> ...
>>
mutex_lock(>lock);
  
 -  if (!iommu->external_domain) {
 +  /* Fail if notifier list is empty */
 +  if ((!iommu->external_domain) || (!iommu->notifier.head)) {
ret = -EINVAL;
goto pin_done;
}
 @@ -867,6 +870,11 @@ unlock:
/* Report how much was unmapped */
unmap->size = unmapped;
  
 +  if (unmapped && iommu->external_domain)
 +  blocking_notifier_call_chain(>notifier,
 +   VFIO_IOMMU_NOTIFY_DMA_UNMAP,
 +   unmap);  
>>>
>>> This is after the fact, there's already a gap here where pages are
>>> unpinned and the mdev device is still running.  
>>
>> Oh, there is a bug here, now unpin_pages() take user_pfn as argument and
>> find vfio_dma. If its not found, it doesn't unpin pages. We have to call
>> this notifier before vfio_remove_dma(). But if we call this before
>> vfio_remove_dma() there will be deadlock since iommu->lock is already
>> held here and vfio_iommu_type1_unpin_pages() will also try to hold
>> iommu->lock.
>> If we want to call blocking_notifier_call_chain() before
>> vfio_remove_dma(), sequence should be:
>>
>> unmapped += dma->size;
>> mutex_unlock(>lock);
>> if (iommu->external_domain)) {
>>  struct vfio_iommu_type1_dma_unmap nb_unmap;
>>
>>  nb_unmap.iova = dma->iova;
>>  nb_unmap.size = dma->size;
>>  blocking_notifier_call_chain(>notifier,
>>   VFIO_IOMMU_NOTIFY_DMA_UNMAP,
>>   _unmap);
>> }
>> mutex_lock(>lock);
>> vfio_remove_dma(iommu, dma);
> 
> It seems like it would be worthwhile to have the rb-tree rooted in the
> vfio-dma, then we only need to call the notifier if there are pages
> pinned within that vfio-dma (ie. the rb-tree is not empty).  We can
> then release the lock call the notifier, re-acquire the lock, and
> BUG_ON if the rb-tree still is not empty.  We might get duplicate pfns
> between separate vfio_dma structs, but as I mentioned in other replies,
> that seems like an exception that we don't need to optimize for.
> 

If we don't optimize for the case where iova from different vfio_dma are
mapped to same pfn and we would not consider this case for page
accounting then:
- have rb tree of pinned iova, where key would be iova, in each vfio_dma
structure.
- iova tracking structure would have iova and ref_count only.
- page accounting would only count number of iova's in rb_tree, case
where different iova could map to same pfn would not be considered in
this implementation for now.
- vfio_unpin_pages() would have user_pfn and pfn as input, we would
validate that iova exist in rb tree and trust vendor driver that
corresponding pfn is correct, there is no validation of pfn. If want
validate pfn, call GUP, verify pfn and call put_pfn().
- In .release() or .detach_group() path, if there are entries in this rb
tree, call GUP again using that iova, get pfn and then call
put_pfn(pfn) for ref_count+1 times. This is because we are not keeping
pfn in our tracking logic.

Does this sound reasonable?

Thanks,
Kirti


>>>  The notifier needs to
>>> happen prior to that and I suspect that we need to validate that we
>>> have no remaining external pfn references within this vfio_dma block.
>>> It seems like we need to root our pfn tracking in the vfio_dma so that
>>> we can see that it's empty after the notifier chain and BUG_ON if not.  
>>
>> There is no way to find pfns from that iova range with current
>> implementation. We can have this validate if we go with linear array of
>> iova to track pfns.
> 
> Right, I was still hoping to avoid storing the pfn even with the
> array/page-table approach though, ask the mm layer for the mapping
> again.  Is that too much overhead?  Maybe the page table could store
> the phys addr and we could use PAGE_MASK to store the reference count
> so that each entry is still only 8bytes(?)
>  
>>> I would also add some enforcement that external pinning is only enabled
>>> when vfio_iommu_type1 is configured for v2 semantics (ie. we only
>>> support unmaps exactly matching previous maps).
>>>   
>>
>> Ok I'll add that check.
>>
>> Thanks,
>>

[Qemu-devel] [PULL 0/1] tags/xen-20161108-tag

2016-11-08 Thread Stefano Stabellini

The following changes since commit 207faf24c58859f5240f66bf6decc33b87a1776e:

  Merge remote-tracking branch 'pm215/tags/pull-target-arm-20161107' into 
staging (2016-11-07 14:02:15 +)

are available in the git repository at:


  git://xenbits.xen.org/people/sstabellini/qemu-dm.git tags/xen-20161108-tag

for you to fetch changes up to 804ba7c10bbc66bb8a8aa73ecc60f620da7423d5:

  xen: Fix xenpv machine initialisation (2016-11-08 11:17:30 -0800)


Xen 2016/11/08


Anthony PERARD (1):
  xen: Fix xenpv machine initialisation

 xen-common.c | 6 --
 xen-hvm.c| 4 
 2 files changed, 4 insertions(+), 6 deletions(-)

Re: [Qemu-devel] [PATCH 2/3] target-i386: Add Intel HAX files

2016-11-08 Thread Vincent Palatin

On Tue, Nov 8, 2016 at 6:46 PM, Paolo Bonzini  wrote:
>
>
> On 08/11/2016 16:39, Vincent Palatin wrote:
>> +/* need tcg for non-UG platform in real mode */
>> +if (!hax_ug_platform())
>> +   tcg_exec_init(tcg_tb_size * 1024 * 1024);
>> +
>
> Oh, it does support unrestricted guest, and in fact without unrestricted
> guest you don't even have SMP!
>
> Would you post a v2 that removes (after this patch 2) as much code as
> possible related to non-UG platforms?

Yes I can do this.

-- 
Vincent

Re: [Qemu-devel] [PATCH 0/3] [RFC] Add HAX support

2016-11-08 Thread Vincent Palatin

On Tue, Nov 8, 2016 at 6:43 PM, Paolo Bonzini  wrote:
>
>
>
> On 08/11/2016 16:39, Vincent Palatin wrote:
> > I took a stab at trying to rebase/upstream the support for Intel HAXM.
> > (Hardware Accelerated Execution Manager).
> > Intel HAX is kernel-based hardware acceleration module for Windows and 
> > MacOSX.
> >
> > I have based my work on the last version of the source code I found:
> > the emu-2.2-release branch in the external/qemu-android repository as used 
> > by
> > the Android emulator.
> > In patch 2/3, I have forward-ported the core HAX code mostly unmodified from
> > there, I just did some minor touch up to make it build and run properly.
> > So it might contain some outdated constructs and probably requires more
> > attention (thus the 'RFC' for this patchset).
>
> Does HAXM support the "unrestricted guest" feature in Westmere and more
> recent processors?


Yes it does, as mentioned in the last paragraph of my message, I have
actually done a fair chunk of my testing in UG mode.

>
>   If so, I think we should only support those
> processors and slash all the part related to HAX_EMULATE_STATE_INITIAL
> and HAX_EMULATE_STATE_REAL.  This would probably let us make patch 3
> much less intrusive.

Sure the whole patchset would be lighter, not sure which proportion of
user have VT machines without UG support though.

>
> That said, patch 3's cpu-exec.c surgery is much nicer on the surface
> than if you look in depth, :) and since you don't look in depth if you
> steer clear of target-i386/hax*, I think it's okay to start with your
> patches and clean up progressively.  Others may disagree...  Also, we're
> now in soft freeze so the patches wouldn't be merged anyway for a few weeks.
>
> Paolo

Re: [Qemu-devel] [PATCH v2] xen: Fix xenpv machine initialisation

2016-11-08 Thread Stefano Stabellini

On Tue, 8 Nov 2016, Anthony PERARD wrote:
> When using QEMU for Xen PV guest, QEMU abort with:
> xen-common.c:118:xen_init: Object 0x7f2b8325dcb0 is not an instance of type 
> generic-pc-machine
> 
> This is because the machine 'xenpv' also use accel=xen. Moving the code
> to xen_hvm_init() fix the issue.
> 
> This fix 021746c131cdfeab9d82ff918795a9f18d20d7ae.
> 
> Signed-off-by: Anthony PERARD 

Reviewed-by: Stefano Stabellini 


> CC: Wei Liu 
> CC: Eduardo Habkost 
> ---
>  xen-common.c | 6 --
>  xen-hvm.c| 4 
>  2 files changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/xen-common.c b/xen-common.c
> index bacf962..9099760 100644
> --- a/xen-common.c
> +++ b/xen-common.c
> @@ -9,7 +9,6 @@
>   */
>  
>  #include "qemu/osdep.h"
> -#include "hw/i386/pc.h"
>  #include "hw/xen/xen_backend.h"
>  #include "qmp-commands.h"
>  #include "sysemu/char.h"
> @@ -115,11 +114,6 @@ static void xen_change_state_handler(void *opaque, int 
> running,
>  
>  static int xen_init(MachineState *ms)
>  {
> -PCMachineState *pcms = PC_MACHINE(ms);
> -
> -/* Disable ACPI build because Xen handles it */
> -pcms->acpi_build_enabled = false;
> -
>  xen_xc = xc_interface_open(0, 0, 0);
>  if (xen_xc == NULL) {
>  xen_pv_printf(NULL, 0, "can't open xen interface\n");
> diff --git a/xen-hvm.c b/xen-hvm.c
> index 2f348ed..150c7e7 100644
> --- a/xen-hvm.c
> +++ b/xen-hvm.c
> @@ -1316,6 +1316,10 @@ void xen_hvm_init(PCMachineState *pcms, MemoryRegion 
> **ram_memory)
>  }
>  xen_be_register_common();
>  xen_read_physmap(state);
> +
> +/* Disable ACPI build because Xen handles it */
> +pcms->acpi_build_enabled = false;
> +
>  return;
>  
>  err:
> -- 
> Anthony PERARD
>

Re: [Qemu-devel] [PATCH v11 05/22] vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops

2016-11-08 Thread Alex Williamson

On Wed, 9 Nov 2016 00:17:53 +0530
Kirti Wankhede  wrote:

> On 11/8/2016 10:09 PM, Alex Williamson wrote:
> > On Tue, 8 Nov 2016 19:25:35 +0530
> > Kirti Wankhede  wrote:
> >   
> ...
> 
>  -
>  +int (*pin_pages)(void *iommu_data, unsigned long 
>  *user_pfn,
>  + int npage, int prot,
>  + unsigned long *phys_pfn);
>  +int (*unpin_pages)(void *iommu_data,
> >>>
> >>> Are we changing from long to int here simply because of the absurdity
> >>> in passing in more than a 2^31 entry array, that would already consume
> >>> more than 16GB itself?
> >>> 
> >>
> >> These are on demand pin/unpin request, will that request go beyond 16GB
> >> limit? For Nvidia vGPU solution, pin request will not go beyond this 
> >> limit.  
> > 
> > 16G is simply the size of the user_pfn or phys_pfn arrays at a maximal
> > int32_t npage value, the interface actually allows mapping up to 8TB
> > per call, but at that point we have 16GB of input, 16GB of output, and
> > 80GB of vfio_pfns created.  So I don't really have a problem changing
> > form long to int given lack of scalability in the API in general, but
> > it does make me second guess the API itself.  Thanks,
> >   
> 
> Changing to 'long', in future we might enhance this API without changing
> it signature.

I think the pfn arrays are more of a problem long term than whether we
can only map 2^31 pfns in one call.  I particularly dislike that the
caller provides both the iova and pfn arrays for unpinning.  Being an
in-kernel driver, we should trust it, but it makes the interface
difficult to use and seems like it indicates that our tracking data
structures aren't architected the way they should be.  Upstream, this
API will need to be flexible and change over time, it's only the
downstream distros that may lock in a kABI.  Not breaking them should
be a consideration, but also needs to be weighted against long term
upstream goals.  Thanks,

Alex

Re: [Qemu-devel] [QEMU PATCH v2] kvmclock: advance clock by time window between vm_stop and pre_save

2016-11-08 Thread Marcelo Tosatti

On Tue, Nov 08, 2016 at 10:22:56AM +, Dr. David Alan Gilbert wrote:
> * Marcelo Tosatti (mtosa...@redhat.com) wrote:
> > On Mon, Nov 07, 2016 at 08:03:50PM +, Dr. David Alan Gilbert wrote:
> > > * Marcelo Tosatti (mtosa...@redhat.com) wrote:
> > > > On Mon, Nov 07, 2016 at 03:46:11PM +, Dr. David Alan Gilbert wrote:
> > > > > * Marcelo Tosatti (mtosa...@redhat.com) wrote:
> > > > > > This patch, relative to pre-copy migration codepath,
> > > > > > measures the time between vm_stop() and pre_save(),
> > > > > > which includes copying the remaining RAM to destination,
> > > > > > and advances the clock by that amount.
> > > > > > 
> > > > > > In a VM with 5 seconds downtime, this reduces the guest
> > > > > > clock difference on destination from 5s to 0.2s.
> > > > > > 
> > > > > > Tested with Linux and Windows 2012 R2 guests with -cpu XXX,+hv-time.
> > > > > 
> > > > > One thing that bothers me is that it's only this clock that's
> > > > > getting corrected; doesn't it cause things to get upset when
> > > > > one clock moves and the others dont?
> > > > 
> > > > If you are correlating the clocks, then yes.
> > > > 
> > > > Older Linux guests get upset (marking the TSC clocksource unstable
> > > > because the watchdog checks TSC vs kvmclock), but there is a workaround 
> > > > for it 
> > > > in newer guests
> > > > (kvmclock interface to notify watchdog to not complain).
> > > > 
> > > > Note marking TSC clocksource unstable on older guests is harmless
> > > > because kvmclock is the standard clocksource.
> > > > 
> > > > For Windows guests, i don't know that Windows correlates between 
> > > > different
> > > > clocks.
> > > > 
> > > > That is, there is relative control as to which software reads kvmclock 
> > > > or Windows TIMER MSR, so i don't see the need to advance every clock 
> > > > exposed.
> > > > 
> > > > > Shouldn't the pause delay be recorded somewhere architecturally
> > > > > independent and then be a thing that kvm-clock happens to use and
> > > > > other clocks might as well?
> > > > 
> > > > In theory, yes. In practice, i don't see the need for this... 
> > > 
> > > It seems unlikely to me that x86 is the only one that will want
> > > to do something similar.
> > 
> > Can't they copy what kvmclock is doing today? 
> 
> We shouldn't have copies of code all over should we?
> 
> Dave

Fine i'll add a notifier.

Re: [Qemu-devel] [PATCH v11 05/22] vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops

2016-11-08 Thread Kirti Wankhede



On 11/8/2016 10:09 PM, Alex Williamson wrote:
> On Tue, 8 Nov 2016 19:25:35 +0530
> Kirti Wankhede  wrote:
> 
...

 -
 +  int (*pin_pages)(void *iommu_data, unsigned long *user_pfn,
 +   int npage, int prot,
 +   unsigned long *phys_pfn);
 +  int (*unpin_pages)(void *iommu_data,  
>>>
>>> Are we changing from long to int here simply because of the absurdity
>>> in passing in more than a 2^31 entry array, that would already consume
>>> more than 16GB itself?
>>>   
>>
>> These are on demand pin/unpin request, will that request go beyond 16GB
>> limit? For Nvidia vGPU solution, pin request will not go beyond this limit.
> 
> 16G is simply the size of the user_pfn or phys_pfn arrays at a maximal
> int32_t npage value, the interface actually allows mapping up to 8TB
> per call, but at that point we have 16GB of input, 16GB of output, and
> 80GB of vfio_pfns created.  So I don't really have a problem changing
> form long to int given lack of scalability in the API in general, but
> it does make me second guess the API itself.  Thanks,
> 

Changing to 'long', in future we might enhance this API without changing
it signature.

Thanks,
Kirti

[Qemu-devel] [kvm-unit-tests PATCH v8 3/3] arm: pmu: Add CPI checking

2016-11-08 Thread Wei Huang

From: Christopher Covington 

Calculate the numbers of cycles per instruction (CPI) implied by ARM
PMU cycle counter values. The code includes a strict checking facility
intended for the -icount option in TCG mode in the configuration file.

Signed-off-by: Christopher Covington 
Signed-off-by: Wei Huang 
---
 arm/pmu.c | 101 +-
 arm/unittests.cfg |  14 
 2 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/arm/pmu.c b/arm/pmu.c
index d5e3ac3..09aff89 100644
--- a/arm/pmu.c
+++ b/arm/pmu.c
@@ -15,6 +15,7 @@
 #include "libcflat.h"
 
 #define PMU_PMCR_E (1 << 0)
+#define PMU_PMCR_C (1 << 2)
 #define PMU_PMCR_N_SHIFT   11
 #define PMU_PMCR_N_MASK0x1f
 #define PMU_PMCR_ID_SHIFT  16
@@ -75,6 +76,23 @@ static inline void pmccfiltr_write(uint32_t value)
pmselr_write(PMU_CYCLE_IDX);
pmxevtyper_write(value);
 }
+
+/*
+ * Extra instructions inserted by the compiler would be difficult to compensate
+ * for, so hand assemble everything between, and including, the PMCR accesses
+ * to start and stop counting.
+ */
+static inline void loop(int i, uint32_t pmcr)
+{
+   asm volatile(
+   "   mcr p15, 0, %[pmcr], c9, c12, 0\n"
+   "1: subs%[i], %[i], #1\n"
+   "   bgt 1b\n"
+   "   mcr p15, 0, %[z], c9, c12, 0\n"
+   : [i] "+r" (i)
+   : [pmcr] "r" (pmcr), [z] "r" (0)
+   : "cc");
+}
 #elif defined(__aarch64__)
 static inline uint32_t pmcr_read(void)
 {
@@ -106,6 +124,23 @@ static inline void pmccfiltr_write(uint32_t value)
 {
asm volatile("msr pmccfiltr_el0, %0" : : "r" (value));
 }
+
+/*
+ * Extra instructions inserted by the compiler would be difficult to compensate
+ * for, so hand assemble everything between, and including, the PMCR accesses
+ * to start and stop counting.
+ */
+static inline void loop(int i, uint32_t pmcr)
+{
+   asm volatile(
+   "   msr pmcr_el0, %[pmcr]\n"
+   "1: subs%[i], %[i], #1\n"
+   "   b.gt1b\n"
+   "   msr pmcr_el0, xzr\n"
+   : [i] "+r" (i)
+   : [pmcr] "r" (pmcr)
+   : "cc");
+}
 #endif
 
 /*
@@ -156,8 +191,71 @@ static bool check_cycles_increase(void)
return true;
 }
 
-int main(void)
+/*
+ * Execute a known number of guest instructions. Only odd instruction counts
+ * greater than or equal to 3 are supported by the in-line assembly code. The
+ * control register (PMCR_EL0) is initialized with the provided value (allowing
+ * for example for the cycle counter or event counters to be reset). At the end
+ * of the exact instruction loop, zero is written to PMCR_EL0 to disable
+ * counting, allowing the cycle counter or event counters to be read at the
+ * leisure of the calling code.
+ */
+static void measure_instrs(int num, uint32_t pmcr)
+{
+   int i = (num - 1) / 2;
+
+   assert(num >= 3 && ((num - 1) % 2 == 0));
+   loop(i, pmcr);
+}
+
+/*
+ * Measure cycle counts for various known instruction counts. Ensure that the
+ * cycle counter progresses (similar to check_cycles_increase() but with more
+ * instructions and using reset and stop controls). If supplied a positive,
+ * nonzero CPI parameter, also strictly check that every measurement matches
+ * it. Strict CPI checking is used to test -icount mode.
+ */
+static bool check_cpi(int cpi)
+{
+   uint32_t pmcr = pmcr_read() | PMU_PMCR_C | PMU_PMCR_E;
+   
+   if (cpi > 0)
+   printf("Checking for CPI=%d.\n", cpi);
+   printf("instrs : cycles0 cycles1 ...\n");
+
+   for (int i = 3; i < 300; i += 32) {
+   int avg, sum = 0;
+
+   printf("%d :", i);
+   for (int j = 0; j < NR_SAMPLES; j++) {
+   int cycles;
+
+   measure_instrs(i, pmcr);
+   cycles =pmccntr_read();
+   printf(" %d", cycles);
+
+   if (!cycles || (cpi > 0 && cycles != i * cpi)) {
+   printf("\n");
+   return false;
+   }
+
+   sum += cycles;
+   }
+   avg = sum / NR_SAMPLES;
+   printf(" sum=%d avg=%d avg_ipc=%d avg_cpi=%d\n",
+   sum, avg, i / avg, avg / i);
+   }
+
+   return true;
+}
+
+int main(int argc, char *argv[])
 {
+   int cpi = 0;
+
+   if (argc >= 1)
+   cpi = atol(argv[0]);
+
report_prefix_push("pmu");
 
/* init for PMU event access, right now only care about cycle count */
@@ -166,6 +264,7 @@ int main(void)
 
report("Control register", check_pmcr());
report("Monotonically increasing cycle count", check_cycles_increase());
+   report("Cycle/instruction ratio", check_cpi(cpi));
 
return report_summary();
 }
diff --git a/arm/unittests.cfg

Re: [Qemu-devel] [PATCH v3 5/6] blockjob: refactor backup_start as backup_job_create

2016-11-08 Thread Jeff Cody

On Tue, Nov 08, 2016 at 10:24:50AM -0500, John Snow wrote:
> 
> 
> On 11/08/2016 04:11 AM, Kevin Wolf wrote:
> >Am 08.11.2016 um 06:41 hat John Snow geschrieben:
> >>On 11/03/2016 09:17 AM, Kevin Wolf wrote:
> >>>Am 02.11.2016 um 18:50 hat John Snow geschrieben:
> Refactor backup_start as backup_job_create, which only creates the job,
> but does not automatically start it. The old interface, 'backup_start',
> is not kept in favor of limiting the number of nearly-identical interfaces
> that would have to be edited to keep up with QAPI changes in the future.
> 
> Callers that wish to synchronously start the backup_block_job can
> instead just call block_job_start immediately after calling
> backup_job_create.
> 
> Transactions are updated to use the new interface, calling block_job_start
> only during the .commit phase, which helps prevent race conditions where
> jobs may finish before we even finish building the transaction. This may
> happen, for instance, during empty block backup jobs.
> 
> Reported-by: Vladimir Sementsov-Ogievskiy 
> Signed-off-by: John Snow 
> >>>
> +static void drive_backup_commit(BlkActionState *common)
> +{
> +DriveBackupState *state = DO_UPCAST(DriveBackupState, common, 
> common);
> +if (state->job) {
> +block_job_start(state->job);
> +}
> }
> >>>
> >>>How could state->job ever be NULL?
> >>>
> >>
> >>Mechanical thinking. It can't. (I definitely didn't copy paste from
> >>the .abort routines. Definitely.)
> >>
> >>>Same question for abort, and for blockdev_backup_commit/abort.
> >>>
> >>
> >>Abort ... we may not have created the job successfully. Abort gets
> >>called whether or not we made it to or through the matching
> >>.prepare.
> >
> >Ah, yes, I always forget about this. It's so counterintuitive (and
> >bdrv_reopen() actually works differently, it only aborts entries that
> >have successfully been prepared).
> >
> >Is there a good reason why qmp_transaction() works this way, especially
> >since we have a separate .clean function?
> >
> >Kevin
> >
> 
> We just don't track which actions have succeeded or not, so we loop through
> all actions on each phase regardless.
> 
> I could add a little state enumeration (or boolean) to each action and I
> could adjust abort to only run on actions that either completed or failed,
> but in this case I think it still wouldn't change the text for .abort,
> because an action may fail before it got to creating the job, for instance.
> 

As far as this part goes, couldn't we just do it without any flags, by not
inserting the state into the snap_bdrv_states list unless it was successful
(assuming _prepare cleans up itself on failure)?  E.g.:

-QSIMPLEQ_INSERT_TAIL(_bdrv_states, state, entry);

 state->ops->prepare(state, _err);
 if (local_err) {
 error_propagate(errp, local_err);
+g_free(state);
 goto delete_and_fail;
 }
+QSIMPLEQ_INSERT_TAIL(_bdrv_states, state, entry);
 }

> Unless you'd propose undoing .prepare IN .prepare in failure cases, but why
> write abort code twice? I don't mind it living in .abort, personally.
>

Doing it the above way would indeed require prepare functions to clean up
after themselves on failure.

The bdrv_reopen() model does it this way, and I think it makes sense.  With
most APIs, on failure you wouldn't have a way of knowing what has or has not
been done, so it leaves everything in a clean state.  I think this is a good
model to follow.

It is also what most QEMU block interfaces currently do, iirc (.bdrv_open,
etc.) - if it fails, it is assumed that it frees all resources it allocated.

I guess it doesn't have to be done this way, and the complexity can just be
pushed into the _abort() function.  After all, with these transactional
models, there exists an abort function, which differentiates it from most
other APIs.  But the downfall is that we have different ways of handling
essentially the same sort of transactional model in the block layer (between
bdrv_reopen and qmp_transaction), and it trips up reviewers / authors.  

(I don't think changing how qmp_transaction handles this is something that
needs to be handled in this series - but it would be nice in the future
sometime).

Jeff

Re: [Qemu-devel] [PATCH kernel v4 7/7] virtio-balloon: tell host vm's unused page info

2016-11-08 Thread Dave Hansen

On 11/07/2016 09:50 PM, Li, Liang Z wrote:
> Sounds good.  Should we ignore some of the order-0 pages in step 4 if the 
> bitmap is full?
> Or should retry to get a complete list of order-0 pages?

I think that's a pretty reasonable thing to do.

>>> It seems the benefit we get for this feature is not as big as that in fast
>> balloon inflating/deflating.

 You should not be using get_max_pfn().  Any patch set that continues
 to use it is not likely to be using a proper algorithm.
>>>
>>> Do you have any suggestion about how to avoid it?
>>
>> Yes: get the pfns from the page free lists alone.  Don't derive
>> them from the pfn limits of the system or zones.
> 
> The ' get_max_pfn()' can be avoid in this patch, but I think we can't
> avoid it completely. We need it as a hint for allocating a proper
> size bitmap. No?

If you start with higher-order pages, you'll be unlikely to get anywhere
close to filling up a bitmap that was sized to hold all possible order-0
pages on the system.  Any use of max_pfn also means that you'll
completely mis-size bitmaps on sparse systems with large holes.

I think you should size it based on the size of the free lists, if anything.

[Qemu-devel] [kvm-unit-tests PATCH v8 1/3] arm: Add PMU test

2016-11-08 Thread Wei Huang

From: Christopher Covington 

Beginning with a simple sanity check of the control register, add
a unit test for the ARM Performance Monitors Unit (PMU).

Signed-off-by: Christopher Covington 
Signed-off-by: Wei Huang 
---
 arm/Makefile.common |  3 ++-
 arm/pmu.c   | 73 +
 arm/unittests.cfg   |  5 
 3 files changed, 80 insertions(+), 1 deletion(-)
 create mode 100644 arm/pmu.c

diff --git a/arm/Makefile.common b/arm/Makefile.common
index ccb554d..f98f422 100644
--- a/arm/Makefile.common
+++ b/arm/Makefile.common
@@ -11,7 +11,8 @@ endif
 
 tests-common = \
$(TEST_DIR)/selftest.flat \
-   $(TEST_DIR)/spinlock-test.flat
+   $(TEST_DIR)/spinlock-test.flat \
+   $(TEST_DIR)/pmu.flat
 
 all: test_cases
 
diff --git a/arm/pmu.c b/arm/pmu.c
new file mode 100644
index 000..0b29088
--- /dev/null
+++ b/arm/pmu.c
@@ -0,0 +1,73 @@
+/*
+ * Test the ARM Performance Monitors Unit (PMU).
+ *
+ * Copyright (c) 2015-2016, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License version 2.1 and
+ * only version 2.1 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License
+ * for more details.
+ */
+#include "libcflat.h"
+
+#define PMU_PMCR_N_SHIFT   11
+#define PMU_PMCR_N_MASK0x1f
+#define PMU_PMCR_ID_SHIFT  16
+#define PMU_PMCR_ID_MASK   0xff
+#define PMU_PMCR_IMP_SHIFT 24
+#define PMU_PMCR_IMP_MASK  0xff
+
+#if defined(__arm__)
+static inline uint32_t pmcr_read(void)
+{
+   uint32_t ret;
+
+   asm volatile("mrc p15, 0, %0, c9, c12, 0" : "=r" (ret));
+   return ret;
+}
+#elif defined(__aarch64__)
+static inline uint32_t pmcr_read(void)
+{
+   uint32_t ret;
+
+   asm volatile("mrs %0, pmcr_el0" : "=r" (ret));
+   return ret;
+}
+#endif
+
+/*
+ * As a simple sanity check on the PMCR_EL0, ensure the implementer field isn't
+ * null. Also print out a couple other interesting fields for diagnostic
+ * purposes. For example, as of fall 2016, QEMU TCG mode doesn't implement
+ * event counters and therefore reports zero event counters, but hopefully
+ * support for at least the instructions event will be added in the future and
+ * the reported number of event counters will become nonzero.
+ */
+static bool check_pmcr(void)
+{
+   uint32_t pmcr;
+
+   pmcr = pmcr_read();
+
+   printf("PMU implementer: %c\n",
+  (pmcr >> PMU_PMCR_IMP_SHIFT) & PMU_PMCR_IMP_MASK);
+   printf("Identification code: 0x%x\n",
+  (pmcr >> PMU_PMCR_ID_SHIFT) & PMU_PMCR_ID_MASK);
+   printf("Event counters:  %d\n",
+  (pmcr >> PMU_PMCR_N_SHIFT) & PMU_PMCR_N_MASK);
+
+   return ((pmcr >> PMU_PMCR_IMP_SHIFT) & PMU_PMCR_IMP_MASK) != 0;
+}
+
+int main(void)
+{
+   report_prefix_push("pmu");
+
+   report("Control register", check_pmcr());
+
+   return report_summary();
+}
diff --git a/arm/unittests.cfg b/arm/unittests.cfg
index 3f6fa45..7645180 100644
--- a/arm/unittests.cfg
+++ b/arm/unittests.cfg
@@ -54,3 +54,8 @@ file = selftest.flat
 smp = $MAX_SMP
 extra_params = -append 'smp'
 groups = selftest
+
+# Test PMU support
+[pmu]
+file = pmu.flat
+groups = pmu
-- 
1.8.3.1

[Qemu-devel] [kvm-unit-tests PATCH v8 2/3] arm: pmu: Check cycle count increases

2016-11-08 Thread Wei Huang

From: Christopher Covington 

Ensure that reads of the PMCCNTR_EL0 are monotonically increasing,
even for the smallest delta of two subsequent reads.

Signed-off-by: Christopher Covington 
Signed-off-by: Wei Huang 
---
 arm/pmu.c | 98 +++
 1 file changed, 98 insertions(+)

diff --git a/arm/pmu.c b/arm/pmu.c
index 0b29088..d5e3ac3 100644
--- a/arm/pmu.c
+++ b/arm/pmu.c
@@ -14,6 +14,7 @@
  */
 #include "libcflat.h"
 
+#define PMU_PMCR_E (1 << 0)
 #define PMU_PMCR_N_SHIFT   11
 #define PMU_PMCR_N_MASK0x1f
 #define PMU_PMCR_ID_SHIFT  16
@@ -21,6 +22,10 @@
 #define PMU_PMCR_IMP_SHIFT 24
 #define PMU_PMCR_IMP_MASK  0xff
 
+#define PMU_CYCLE_IDX  31
+
+#define NR_SAMPLES 10
+
 #if defined(__arm__)
 static inline uint32_t pmcr_read(void)
 {
@@ -29,6 +34,47 @@ static inline uint32_t pmcr_read(void)
asm volatile("mrc p15, 0, %0, c9, c12, 0" : "=r" (ret));
return ret;
 }
+
+static inline void pmcr_write(uint32_t value)
+{
+   asm volatile("mcr p15, 0, %0, c9, c12, 0" : : "r" (value));
+}
+
+static inline void pmselr_write(uint32_t value)
+{
+   asm volatile("mcr p15, 0, %0, c9, c12, 5" : : "r" (value));
+}
+
+static inline void pmxevtyper_write(uint32_t value)
+{
+   asm volatile("mcr p15, 0, %0, c9, c13, 1" : : "r" (value));
+}
+
+/*
+ * While PMCCNTR can be accessed as a 64 bit coprocessor register, returning 64
+ * bits doesn't seem worth the trouble when differential usage of the result is
+ * expected (with differences that can easily fit in 32 bits). So just return
+ * the lower 32 bits of the cycle count in AArch32.
+ */
+static inline uint32_t pmccntr_read(void)
+{
+   uint32_t cycles;
+
+   asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r" (cycles));
+   return cycles;
+}
+
+static inline void pmcntenset_write(uint32_t value)
+{
+   asm volatile("mcr p15, 0, %0, c9, c12, 1" : : "r" (value));
+}
+
+/* PMCCFILTR is an obsolete name for PMXEVTYPER31 in ARMv7 */
+static inline void pmccfiltr_write(uint32_t value)
+{
+   pmselr_write(PMU_CYCLE_IDX);
+   pmxevtyper_write(value);
+}
 #elif defined(__aarch64__)
 static inline uint32_t pmcr_read(void)
 {
@@ -37,6 +83,29 @@ static inline uint32_t pmcr_read(void)
asm volatile("mrs %0, pmcr_el0" : "=r" (ret));
return ret;
 }
+
+static inline void pmcr_write(uint32_t value)
+{
+   asm volatile("msr pmcr_el0, %0" : : "r" (value));
+}
+
+static inline uint32_t pmccntr_read(void)
+{
+   uint32_t cycles;
+
+   asm volatile("mrs %0, pmccntr_el0" : "=r" (cycles));
+   return cycles;
+}
+
+static inline void pmcntenset_write(uint32_t value)
+{
+   asm volatile("msr pmcntenset_el0, %0" : : "r" (value));
+}
+
+static inline void pmccfiltr_write(uint32_t value)
+{
+   asm volatile("msr pmccfiltr_el0, %0" : : "r" (value));
+}
 #endif
 
 /*
@@ -63,11 +132,40 @@ static bool check_pmcr(void)
return ((pmcr >> PMU_PMCR_IMP_SHIFT) & PMU_PMCR_IMP_MASK) != 0;
 }
 
+/*
+ * Ensure that the cycle counter progresses between back-to-back reads.
+ */
+static bool check_cycles_increase(void)
+{
+   pmcr_write(pmcr_read() | PMU_PMCR_E);
+
+   for (int i = 0; i < NR_SAMPLES; i++) {
+   unsigned long a, b;
+
+   a = pmccntr_read();
+   b = pmccntr_read();
+
+   if (a >= b) {
+   printf("Read %ld then %ld.\n", a, b);
+   return false;
+   }
+   }
+
+   pmcr_write(pmcr_read() & ~PMU_PMCR_E);
+
+   return true;
+}
+
 int main(void)
 {
report_prefix_push("pmu");
 
+   /* init for PMU event access, right now only care about cycle count */
+   pmcntenset_write(1 << PMU_CYCLE_IDX);
+   pmccfiltr_write(0); /* count cycles in EL0, EL1, but not EL2 */
+
report("Control register", check_pmcr());
+   report("Monotonically increasing cycle count", check_cycles_increase());
 
return report_summary();
 }
-- 
1.8.3.1

[Qemu-devel] [PULL 1/1] docs/tracing.txt: Update documentation of default backend

2016-11-08 Thread Stefan Hajnoczi

From: Peter Maydell 

In commit baf86d6b3c we switched the default trace backend from "nop"
to "log". Update the documentation to match.

Signed-off-by: Peter Maydell 
Message-id: 1478276837-31780-1-git-send-email-peter.mayd...@linaro.org
Signed-off-by: Stefan Hajnoczi 
---
 docs/tracing.txt | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/docs/tracing.txt b/docs/tracing.txt
index e62444c..f351998a 100644
--- a/docs/tracing.txt
+++ b/docs/tracing.txt
@@ -150,13 +150,16 @@ The trace backends are chosen at configure time:
 For a list of supported trace backends, try ./configure --help or see below.
 If multiple backends are enabled, the trace is sent to them all.
 
+If no backends are explicitly selected, configure will default to the
+"log" backend.
+
 The following subsections describe the supported trace backends.
 
 === Nop ===
 
 The "nop" backend generates empty trace event functions so that the compiler
-can optimize out trace events completely.  This is the default and imposes no
-performance penalty.
+can optimize out trace events completely.  This imposes no performance
+penalty.
 
 Note that regardless of the selected trace backend, events with the "disable"
 property will be generated with the "nop" backend.
-- 
2.7.4

[Qemu-devel] [PULL 0/1] Tracing patches

2016-11-08 Thread Stefan Hajnoczi

The following changes since commit 207faf24c58859f5240f66bf6decc33b87a1776e:

  Merge remote-tracking branch 'pm215/tags/pull-target-arm-20161107' into 
staging (2016-11-07 14:02:15 +)

are available in the git repository at:

  git://github.com/stefanha/qemu.git tags/tracing-pull-request

for you to fetch changes up to 3b0fc80dd8ed9bd1ac738898e4fbd70c4a618925:

  docs/tracing.txt: Update documentation of default backend (2016-11-08 
18:16:48 +)





Peter Maydell (1):
  docs/tracing.txt: Update documentation of default backend

 docs/tracing.txt | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

-- 
2.7.4

[Qemu-devel] [kvm-unit-tests PATCH v8 0/3] ARM PMU tests

2016-11-08 Thread Wei Huang

Changes from v7:
* Standardize PMU register accessor names and remove unused ones
* Use bit defines instead of bit fields
* Change the testing configure for pmu.flat
* Commit comments were updated

Note:
1) Current KVM code has bugs in handling PMCCFILTR write. A fix (see
below) is required for this unit testing code to work correctly under
KVM mode.
https://lists.cs.columbia.edu/pipermail/kvmarm/2016-November/022134.html.
2) Because the code was changed, Drew's original reviewed-by needs to
be acknowledged by him again.

-Wei

Wei Huang (3):
  arm: Add PMU test
  arm: pmu: Check cycle count increases
  arm: pmu: Add CPI checking

 arm/Makefile.common |   3 +-
 arm/pmu.c   | 270 
 arm/unittests.cfg   |  19 
 3 files changed, 291 insertions(+), 1 deletion(-)
 create mode 100644 arm/pmu.c

-- 
1.8.3.1

[Qemu-devel] [PULL 2/3] aio-posix: avoid NULL pointer dereference in aio_epoll_update

2016-11-08 Thread Stefan Hajnoczi

From: Paolo Bonzini 

aio_epoll_update dereferences parameter "node", but it could have been NULL
if deleting an fd handler that was not registered in the first place.

Signed-off-by: Paolo Bonzini 
Reviewed-by: Fam Zheng 
Message-id: 20161108135524.25927-2-pbonz...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 aio-posix.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/aio-posix.c b/aio-posix.c
index 4ef34dd..304b016 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -217,21 +217,23 @@ void aio_set_fd_handler(AioContext *ctx,
 
 /* Are we deleting the fd handler? */
 if (!io_read && !io_write) {
-if (node) {
-g_source_remove_poll(>source, >pfd);
+if (node == NULL) {
+return;
+}
 
-/* If the lock is held, just mark the node as deleted */
-if (ctx->walking_handlers) {
-node->deleted = 1;
-node->pfd.revents = 0;
-} else {
-/* Otherwise, delete it for real.  We can't just mark it as
- * deleted because deleted nodes are only cleaned up after
- * releasing the walking_handlers lock.
- */
-QLIST_REMOVE(node, node);
-deleted = true;
-}
+g_source_remove_poll(>source, >pfd);
+
+/* If the lock is held, just mark the node as deleted */
+if (ctx->walking_handlers) {
+node->deleted = 1;
+node->pfd.revents = 0;
+} else {
+/* Otherwise, delete it for real.  We can't just mark it as
+ * deleted because deleted nodes are only cleaned up after
+ * releasing the walking_handlers lock.
+ */
+QLIST_REMOVE(node, node);
+deleted = true;
 }
 } else {
 if (node == NULL) {
-- 
2.7.4

[Qemu-devel] [PULL 1/3] block: Don't mark node clean after failed flush

2016-11-08 Thread Stefan Hajnoczi

From: Kevin Wolf 

Commit 3ff2f67a changed bdrv_co_flush() so that no flush is issues if
the image hasn't been dirtied since the last flush. This is not quite
correct: The condition should be that the image hasn't been dirtied
since the last _successful_ flush. This patch changes the logic
accordingly.

Without this fix, subsequent bdrv_co_flush() calls would return success
without actually doing anything even though the image is still dirty.
The difference is visible in some blkdebug test cases where error
messages incorrectly disappeared after commit 3ff2f67a.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 
Reviewed-by: Denis V. Lunev 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: John Snow 
Message-id: 1478300595-10090-1-git-send-email-kw...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 block/io.c |  4 +++-
 tests/qemu-iotests/026.out | 22 ++
 tests/qemu-iotests/026.out.nocache | 22 ++
 tests/qemu-iotests/071.out |  2 ++
 4 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index 37749b6..aa532a5 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2372,7 +2372,9 @@ flush_parent:
 ret = bs->file ? bdrv_co_flush(bs->file->bs) : 0;
 out:
 /* Notify any pending flushes that we have completed */
-bs->flushed_gen = current_gen;
+if (ret == 0) {
+bs->flushed_gen = current_gen;
+}
 bs->active_flush_req = false;
 /* Return value is ignored - it's ok if wait queue is empty */
 qemu_co_queue_next(>flush_queue);
diff --git a/tests/qemu-iotests/026.out b/tests/qemu-iotests/026.out
index 8531735..59b8f74 100644
--- a/tests/qemu-iotests/026.out
+++ b/tests/qemu-iotests/026.out
@@ -14,6 +14,7 @@ No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 5; imm: off; once: off; write
+Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
 write failed: Input/output error
 
@@ -22,6 +23,7 @@ This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 5; imm: off; once: off; write -b
+Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
 write failed: Input/output error
 
@@ -40,6 +42,7 @@ No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 28; imm: off; once: off; write
+Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
@@ -48,6 +51,7 @@ This means waste of disk space, but no harm to data.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: l1_update; errno: 28; imm: off; once: off; write -b
+Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 
@@ -286,12 +290,14 @@ No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_load; errno: 5; imm: off; once: off; write
+Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_load; errno: 5; imm: off; once: off; write -b
+Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
 write failed: Input/output error
 No errors were found on the image.
@@ -308,12 +314,14 @@ No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_load; errno: 28; imm: off; once: off; write
+Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_load; errno: 28; imm: off; once: off; write -b
+Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
 write failed: No space left on device
 No errors were found on the image.
@@ -330,12 +338,14 @@ No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 
 Event: refblock_update_part; errno: 5; imm: off; once: off; write
+Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
 write failed: Input/output error
 No errors were found on the image.

[Qemu-devel] [PULL 0/3] Block patches

2016-11-08 Thread Stefan Hajnoczi

The following changes since commit 207faf24c58859f5240f66bf6decc33b87a1776e:

  Merge remote-tracking branch 'pm215/tags/pull-target-arm-20161107' into 
staging (2016-11-07 14:02:15 +)

are available in the git repository at:

  git://github.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 35dd66e23ce96283723de58e10d2877ae2be4a1b:

  aio-posix: simplify aio_epoll_update (2016-11-08 17:09:14 +)





Kevin Wolf (1):
  block: Don't mark node clean after failed flush

Paolo Bonzini (2):
  aio-posix: avoid NULL pointer dereference in aio_epoll_update
  aio-posix: simplify aio_epoll_update

 aio-posix.c| 53 +-
 block/io.c |  4 ++-
 tests/qemu-iotests/026.out | 22 
 tests/qemu-iotests/026.out.nocache | 22 
 tests/qemu-iotests/071.out |  2 ++
 5 files changed, 73 insertions(+), 30 deletions(-)

-- 
2.7.4

[Qemu-devel] [PULL 3/3] aio-posix: simplify aio_epoll_update

2016-11-08 Thread Stefan Hajnoczi

From: Paolo Bonzini 

Extract common code out of the "if".

Reviewed-by: Fam Zheng 
Signed-off-by: Paolo Bonzini 
Message-id: 20161108135524.25927-3-pbonz...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 aio-posix.c | 23 ---
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/aio-posix.c b/aio-posix.c
index 304b016..e13b9ab 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -81,29 +81,22 @@ static void aio_epoll_update(AioContext *ctx, AioHandler 
*node, bool is_new)
 {
 struct epoll_event event;
 int r;
+int ctl;
 
 if (!ctx->epoll_enabled) {
 return;
 }
 if (!node->pfd.events) {
-r = epoll_ctl(ctx->epollfd, EPOLL_CTL_DEL, node->pfd.fd, );
-if (r) {
-aio_epoll_disable(ctx);
-}
+ctl = EPOLL_CTL_DEL;
 } else {
 event.data.ptr = node;
 event.events = epoll_events_from_pfd(node->pfd.events);
-if (is_new) {
-r = epoll_ctl(ctx->epollfd, EPOLL_CTL_ADD, node->pfd.fd, );
-if (r) {
-aio_epoll_disable(ctx);
-}
-} else {
-r = epoll_ctl(ctx->epollfd, EPOLL_CTL_MOD, node->pfd.fd, );
-if (r) {
-aio_epoll_disable(ctx);
-}
-}
+ctl = is_new ? EPOLL_CTL_ADD : EPOLL_CTL_MOD;
+}
+
+r = epoll_ctl(ctx->epollfd, ctl, node->pfd.fd, );
+if (r) {
+aio_epoll_disable(ctx);
 }
 }
 
-- 
2.7.4

Re: [Qemu-devel] [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP

2016-11-08 Thread Alex Williamson

On Tue, 8 Nov 2016 21:56:29 +0530
Kirti Wankhede  wrote:

> On 11/8/2016 5:15 AM, Alex Williamson wrote:
> > On Sat, 5 Nov 2016 02:40:45 +0530
> > Kirti Wankhede  wrote:
> >   
> ...
> >>  
> >> +int vfio_register_notifier(struct device *dev, struct notifier_block *nb) 
> >>  
> > 
> > Is the expectation here that this is a generic notifier for all
> > vfio->mdev signaling?  That should probably be made clear in the mdev
> > API to avoid vendor drivers assuming their notifier callback only
> > occurs for unmaps, even if that's currently the case.
> >   
> 
> Ok. Adding comment about notifier callback in mdev_device which is part
> of next patch.
> 
> ...
> 
> >>mutex_lock(>lock);
> >>  
> >> -  if (!iommu->external_domain) {
> >> +  /* Fail if notifier list is empty */
> >> +  if ((!iommu->external_domain) || (!iommu->notifier.head)) {
> >>ret = -EINVAL;
> >>goto pin_done;
> >>}
> >> @@ -867,6 +870,11 @@ unlock:
> >>/* Report how much was unmapped */
> >>unmap->size = unmapped;
> >>  
> >> +  if (unmapped && iommu->external_domain)
> >> +  blocking_notifier_call_chain(>notifier,
> >> +   VFIO_IOMMU_NOTIFY_DMA_UNMAP,
> >> +   unmap);  
> > 
> > This is after the fact, there's already a gap here where pages are
> > unpinned and the mdev device is still running.  
> 
> Oh, there is a bug here, now unpin_pages() take user_pfn as argument and
> find vfio_dma. If its not found, it doesn't unpin pages. We have to call
> this notifier before vfio_remove_dma(). But if we call this before
> vfio_remove_dma() there will be deadlock since iommu->lock is already
> held here and vfio_iommu_type1_unpin_pages() will also try to hold
> iommu->lock.
> If we want to call blocking_notifier_call_chain() before
> vfio_remove_dma(), sequence should be:
> 
> unmapped += dma->size;
> mutex_unlock(>lock);
> if (iommu->external_domain)) {
>   struct vfio_iommu_type1_dma_unmap nb_unmap;
> 
>   nb_unmap.iova = dma->iova;
>   nb_unmap.size = dma->size;
>   blocking_notifier_call_chain(>notifier,
>VFIO_IOMMU_NOTIFY_DMA_UNMAP,
>_unmap);
> }
> mutex_lock(>lock);
> vfio_remove_dma(iommu, dma);

It seems like it would be worthwhile to have the rb-tree rooted in the
vfio-dma, then we only need to call the notifier if there are pages
pinned within that vfio-dma (ie. the rb-tree is not empty).  We can
then release the lock call the notifier, re-acquire the lock, and
BUG_ON if the rb-tree still is not empty.  We might get duplicate pfns
between separate vfio_dma structs, but as I mentioned in other replies,
that seems like an exception that we don't need to optimize for.

> >  The notifier needs to
> > happen prior to that and I suspect that we need to validate that we
> > have no remaining external pfn references within this vfio_dma block.
> > It seems like we need to root our pfn tracking in the vfio_dma so that
> > we can see that it's empty after the notifier chain and BUG_ON if not.  
> 
> There is no way to find pfns from that iova range with current
> implementation. We can have this validate if we go with linear array of
> iova to track pfns.

Right, I was still hoping to avoid storing the pfn even with the
array/page-table approach though, ask the mm layer for the mapping
again.  Is that too much overhead?  Maybe the page table could store
the phys addr and we could use PAGE_MASK to store the reference count
so that each entry is still only 8bytes(?)
 
> > I would also add some enforcement that external pinning is only enabled
> > when vfio_iommu_type1 is configured for v2 semantics (ie. we only
> > support unmaps exactly matching previous maps).
> >   
> 
> Ok I'll add that check.
> 
> Thanks,
> Kirti

Re: [Qemu-devel] [PATCH 2/3] target-i386: Add Intel HAX files

2016-11-08 Thread Paolo Bonzini

On 08/11/2016 16:39, Vincent Palatin wrote:
> +/* need tcg for non-UG platform in real mode */
> +if (!hax_ug_platform())
> +   tcg_exec_init(tcg_tb_size * 1024 * 1024);
> +

Oh, it does support unrestricted guest, and in fact without unrestricted
guest you don't even have SMP!

Would you post a v2 that removes (after this patch 2) as much code as
possible related to non-UG platforms?

Paolo

Re: [Qemu-devel] [PATCH 0/3] [RFC] Add HAX support

2016-11-08 Thread Paolo Bonzini

On 08/11/2016 16:39, Vincent Palatin wrote:
> I took a stab at trying to rebase/upstream the support for Intel HAXM.
> (Hardware Accelerated Execution Manager).
> Intel HAX is kernel-based hardware acceleration module for Windows and MacOSX.
> 
> I have based my work on the last version of the source code I found:
> the emu-2.2-release branch in the external/qemu-android repository as used by
> the Android emulator.
> In patch 2/3, I have forward-ported the core HAX code mostly unmodified from
> there, I just did some minor touch up to make it build and run properly.
> So it might contain some outdated constructs and probably requires more
> attention (thus the 'RFC' for this patchset).

Does HAXM support the "unrestricted guest" feature in Westmere and more
recent processors?  If so, I think we should only support those
processors and slash all the part related to HAX_EMULATE_STATE_INITIAL
and HAX_EMULATE_STATE_REAL.  This would probably let us make patch 3
much less intrusive.

That said, patch 3's cpu-exec.c surgery is much nicer on the surface
than if you look in depth, :) and since you don't look in depth if you
steer clear of target-i386/hax*, I think it's okay to start with your
patches and clean up progressively.  Others may disagree...  Also, we're
now in soft freeze so the patches wouldn't be merged anyway for a few weeks.

Paolo

Re: [Qemu-devel] [PATCH] vhost: secure vhost shared log files using argv paremeter

2016-11-08 Thread Rafael David Tinoco

Hello, 

> On Tue, Nov 8, 2016 at 4:49 PM Rafael David Tinoco 
>  wrote:
> Hello Michael, André,
> 
> Could you do a quick review before a final submission ?
> 
> http://paste.ubuntu.com/23446279/
> ...
> (André) > Could it be only a filename? This would simplify testing.
> (Michael) > When vhostlog is not specified, can we just use memfd as we did?
> 
> Michael said: 
> https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg08197.html
> I think that the best approach is to allow passing in the fd, not the file 
> path. If not passed, use memfd.

Missed this one.

> I do agree :)

Sounds good. I see that the new approach is to let the managing library to 
create the files and just pass the file descriptors, this way security rules 
are applied to library itself and not to qemu processes. 

> Do we really need to give a path? (pass fd with -add-fd/qmp add-fd)

I guess not. So, for shared logs:

- vhostlogfd has to be provided.
- if vhostlogfd is not provided, use memfd.
(we don't  want writes in /tmp, should i remove fallback mechanism from memfd 
logic)
- if memfd fails, log can't be shared/created and there is a migration blocker.

André, Michael,

I'll work on that and get the patches soon, meanwhile, could u push:

- "vhost: migration blocker only if shared log is use"

so I can backport it to Debian ? 

Thank you,
-Rafael Tinoco

Re: [Qemu-devel] [PATCH] vhost-scsi: Update 'ioeventfd_started' with host notifiers

2016-11-08 Thread Felipe Franciosi


> On 8 Nov 2016, at 18:18, Paolo Bonzini  wrote:
> 
> 
> 
> On 07/11/2016 18:23, Felipe Franciosi wrote:
>> Following the recent refactor of virtio notfiers [1], more specifically
>> the patch that uses virtio_bus_set_host_notifier [2] by default, core
>> virtio code requires 'ioeventfd_started' to be set to true/false when
>> the host notifiers are configured. Since vhost-scsi uses the legacy
>> interface, this value is not updated.
>> 
>> When booting a guest with a vhost-scsi backend controller, SeaBIOS will
>> initially configure the device which sets all notifiers. The guest will
>> continue to boot fine until the kernel virtio-scsi module reinitialises
>> the device causing a stop followed by another start. Since
>> ioeventfd_started was never set to true, the 'stop' operation triggered
>> by virtio_bus_set_host_notifier() will not result in a call to
>> virtio_pci_ioeventfd_assign(assign=false). This leaves the memory
>> regions with stale notifiers and results on the next start triggering
>> the following assertion:
>> 
>>  kvm_mem_ioeventfd_add: error adding ioeventfd: File exists
>>  Aborted
>> 
>> This patch updates ioeventfd_started whenever the notifiers are set or
>> cleared, fixing this issue.
>> 
>> Signed-off-by: Felipe Franciosi 
>> 
>> [1] http://lists.nongnu.org/archive/html/qemu-devel/2016-10/msg07748.html
>> [2] http://lists.nongnu.org/archive/html/qemu-devel/2016-10/msg07760.html
>> ---
>> hw/scsi/vhost-scsi.c | 2 ++
>> 1 file changed, 2 insertions(+)
>> 
>> diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
>> index 5b26946..1c6e6d4 100644
>> --- a/hw/scsi/vhost-scsi.c
>> +++ b/hw/scsi/vhost-scsi.c
>> @@ -95,6 +95,7 @@ static int vhost_scsi_start(VHostSCSI *s)
>> if (ret < 0) {
>> return ret;
>> }
>> +VIRTIO_BUS(qbus)->ioeventfd_started = true;
>> 
>> s->dev.acked_features = vdev->guest_features;
>> ret = vhost_dev_start(>dev, vdev);
>> @@ -152,6 +153,7 @@ static void vhost_scsi_stop(VHostSCSI *s)
>> vhost_scsi_clear_endpoint(s);
>> vhost_dev_stop(>dev, vdev);
>> vhost_dev_disable_notifiers(>dev, vdev);
>> +VIRTIO_BUS(qbus)->ioeventfd_started = false;
>> }
>> 
>> static uint64_t vhost_scsi_get_features(VirtIODevice *vdev,
>> 
> 
> While a bit hacky, at least for 2.8 the idea should be fine.  Only it
> has to be done in vhost_dev_enable_notifiers and
> vhost_dev_disable_notifiers, close to the calls to
> virtio_device_stop_ioeventfd (i.e., set bus->ioeventfd_started after the
> call) and virtio_device_start_ioeventfd (clear it before the call).

Ok. I'll send a v2 tomorrow which does this from vhost_dev_enable_notifiers(). 
I won't have time to test it now.

> 
> I've now worked through a fix that uses start_ioeventfd/stop_ioeventfd,
> and it does comes out nice (29 lines added, 65 removed :)), but it isn't
> bisectable (so it has to be squashed into a single patch) and does not
> cover vhost-net yet.  So let's go with your patch for now.

Perfect. Yeah as I told you on IRC I had a quick go at using the new 
start/stop, but it didn't seem straightforward and I ran out of time to do it 
properly. 

> 
> I'll send the vm_running fix separately, since that one is a real bugfix.

Ok. Thanks for taking on that one.

> 
> Paolo

Cheers,
Felipe

Re: [Qemu-devel] [PATCH 1/3] kvm: move cpu synchronization code

2016-11-08 Thread Paolo Bonzini



On 08/11/2016 16:39, Vincent Palatin wrote:
> Move the generic cpu_synchronize_ functions to the common hw_accel.h header,
> in order to prepare for the addition of a second hardware accelerator.
> 
> Signed-off-by: Vincent Palatin 
> ---
>  cpus.c|  1 +
>  gdbstub.c |  1 +
>  hw/i386/kvm/apic.c|  1 +
>  hw/i386/kvmvapic.c|  1 +
>  hw/misc/vmport.c  |  2 +-
>  include/sysemu/hw_accel.h | 39 +++
>  include/sysemu/kvm.h  | 23 ---
>  monitor.c |  2 +-
>  qom/cpu.c |  2 +-
>  target-arm/cpu.c  |  2 +-
>  target-i386/helper.c  |  1 +
>  target-i386/kvm.c |  1 +
>  12 files changed, 49 insertions(+), 27 deletions(-)
>  create mode 100644 include/sysemu/hw_accel.h
> 
> diff --git a/cpus.c b/cpus.c
> index 5213351..fc78502 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -33,6 +33,7 @@
>  #include "sysemu/block-backend.h"
>  #include "exec/gdbstub.h"
>  #include "sysemu/dma.h"
> +#include "sysemu/hw_accel.h"
>  #include "sysemu/kvm.h"
>  #include "qmp-commands.h"
>  #include "exec/exec-all.h"
> diff --git a/gdbstub.c b/gdbstub.c
> index de62d26..de9b62b 100644
> --- a/gdbstub.c
> +++ b/gdbstub.c
> @@ -32,6 +32,7 @@
>  #define MAX_PACKET_LENGTH 4096
>  
>  #include "qemu/sockets.h"
> +#include "sysemu/hw_accel.h"
>  #include "sysemu/kvm.h"
>  #include "exec/semihost.h"
>  #include "exec/exec-all.h"
> diff --git a/hw/i386/kvm/apic.c b/hw/i386/kvm/apic.c
> index 01cbaa8..328f80c 100644
> --- a/hw/i386/kvm/apic.c
> +++ b/hw/i386/kvm/apic.c
> @@ -14,6 +14,7 @@
>  #include "cpu.h"
>  #include "hw/i386/apic_internal.h"
>  #include "hw/pci/msi.h"
> +#include "sysemu/hw_accel.h"
>  #include "sysemu/kvm.h"
>  #include "target-i386/kvm_i386.h"
>  
> diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
> index b30d1b9..2f767b6 100644
> --- a/hw/i386/kvmvapic.c
> +++ b/hw/i386/kvmvapic.c
> @@ -14,6 +14,7 @@
>  #include "exec/exec-all.h"
>  #include "sysemu/sysemu.h"
>  #include "sysemu/cpus.h"
> +#include "sysemu/hw_accel.h"
>  #include "sysemu/kvm.h"
>  #include "hw/i386/apic_internal.h"
>  #include "hw/sysbus.h"
> diff --git a/hw/misc/vmport.c b/hw/misc/vmport.c
> index c763811..be40930 100644
> --- a/hw/misc/vmport.c
> +++ b/hw/misc/vmport.c
> @@ -25,7 +25,7 @@
>  #include "hw/hw.h"
>  #include "hw/isa/isa.h"
>  #include "hw/i386/pc.h"
> -#include "sysemu/kvm.h"
> +#include "sysemu/hw_accel.h"
>  #include "hw/qdev.h"
>  
>  //#define VMPORT_DEBUG
> diff --git a/include/sysemu/hw_accel.h b/include/sysemu/hw_accel.h
> new file mode 100644
> index 000..03812cf
> --- /dev/null
> +++ b/include/sysemu/hw_accel.h
> @@ -0,0 +1,39 @@
> +/*
> + * QEMU Hardware accelertors support
> + *
> + * Copyright 2016 Google, Inc.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef QEMU_HW_ACCEL_H
> +#define QEMU_HW_ACCEL_H
> +
> +#include "qom/cpu.h"
> +#include "sysemu/hax.h"
> +#include "sysemu/kvm.h"
> +
> +static inline void cpu_synchronize_state(CPUState *cpu)
> +{
> +if (kvm_enabled()) {
> +kvm_cpu_synchronize_state(cpu);
> +}
> +}
> +
> +static inline void cpu_synchronize_post_reset(CPUState *cpu)
> +{
> +if (kvm_enabled()) {
> +kvm_cpu_synchronize_post_reset(cpu);
> +}
> +}
> +
> +static inline void cpu_synchronize_post_init(CPUState *cpu)
> +{
> +if (kvm_enabled()) {
> +kvm_cpu_synchronize_post_init(cpu);
> +}
> +}
> +
> +#endif /* QEMU_HW_ACCEL_H */
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index df67cc0..3045ee7 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -461,29 +461,6 @@ void kvm_cpu_synchronize_state(CPUState *cpu);
>  void kvm_cpu_synchronize_post_reset(CPUState *cpu);
>  void kvm_cpu_synchronize_post_init(CPUState *cpu);
>  
> -/* generic hooks - to be moved/refactored once there are more users */
> -
> -static inline void cpu_synchronize_state(CPUState *cpu)
> -{
> -if (kvm_enabled()) {
> -kvm_cpu_synchronize_state(cpu);
> -}
> -}
> -
> -static inline void cpu_synchronize_post_reset(CPUState *cpu)
> -{
> -if (kvm_enabled()) {
> -kvm_cpu_synchronize_post_reset(cpu);
> -}
> -}
> -
> -static inline void cpu_synchronize_post_init(CPUState *cpu)
> -{
> -if (kvm_enabled()) {
> -kvm_cpu_synchronize_post_init(cpu);
> -}
> -}
> -
>  /**
>   * kvm_irqchip_add_msi_route - Add MSI route for specific vector
>   * @s:  KVM state
> diff --git a/monitor.c b/monitor.c
> index 0841d43..d38956f 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -50,7 +50,7 @@
>  #include "sysemu/balloon.h"
>  #include "qemu/timer.h"
>  #include "migration/migration.h"
> -#include "sysemu/kvm.h"
> +#include "sysemu/hw_accel.h"
>  #include "qemu/acl.h"
>  #include "sysemu/tpm.h"
>  #include "qapi/qmp/qerror.h"
> diff

Re: [Qemu-devel] Crashing in tcp_close

2016-11-08 Thread Brian Candler


On 07/11/2016 20:52, Brian Candler wrote:
So either this means that using tap networking instead of user 
networking is fixing all the problems; or it is some other option 
which is different. Really I now need to run qemu with exactly the 
same settings as before, except with tap instead of user networking.


I hacked something together to run qemu directly with the right flags 
for tap networking.


packer.io now connects via ssh to the IP address which the DHCP server 
gives out for that MAC address, and runs the same provisioning code as 
before (actually a whole load of ansible scripts)


I ran it three times successfully from start to end, with no crashes. 
Hence it does appear likely then that the crashes are something to do 
with the user networking.


Regards,

Brian.


#!/bin/sh -e
cp output-qemu-ubuntu-base/ubuntu-base.qcow2 
output-null-vtp-nmm/vtp-nmm.qcow2


TAP=tap0
sudo tunctl -d tap0
sudo tunctl -u $(whoami)
sudo brctl addif br-lan $TAP
sudo ip link set dev $TAP up

echo "Starting kvm..."
/usr/local/bin/qemu-system-x86_64 \
 -machine type=pc,accel=kvm \
 -device virtio-scsi-pci,id=scsi0 \
 -device scsi-hd,bus=scsi0.0,drive=drive0 \
 -device virtio-net,netdev=network0,id=net0,mac=52:54:00:cc:c1:62 \
 -vnc [::]:24 \
 -name vtp-nmm.qcow2 \
 -boot c \
 -netdev tap,id=network0,ifname=$TAP,script=no,downscript=no \
 -drive 
if=none,file=output-null-vtp-nmm/vtp-nmm.qcow2,id=drive0,cache=writeback,discard=unmap,format=qcow2 
\

 -m 4G
# waiting for exit

sudo tunctl -d tap0

Re: [Qemu-devel] Concerning " [PULL 6/6] curses: Use cursesw instead of curses"

2016-11-08 Thread Cornelia Huck

On Tue, 8 Nov 2016 16:49:51 +
Stefan Hajnoczi  wrote:

> On Tue, Nov 08, 2016 at 10:40:20AM +0300, Sergey Smolov wrote:
> > Dear List!
> > 
> > I've encountered the same problem as was discussed in this thread:
> > https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg07898.html
> > 
> > Does anybody succeeded in solving the problem?
> > 
> > From my side, the problem appears when I run the 'configure' script with
> > '--target-list=aarch64-softmmu' option. The script returns the following
> > message to me:
> > 
> > ERROR: configure test passed without -Werror but failed with -Werror.
> >This is probably a bug in the configure script. The failing command
> >will be at the bottom of config.log.
> >You can run configure with --disable-werror to bypass this check.
> > 
> > I've attached a config.log to this e-mail.
> 
> [...]
> 
> > cc -Werror -fPIE -DPIE -m64 -mcx16 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 
> > -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef 
> > -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common 
> > -fwrapv -Wendif-labels -Wmissing-include-dirs -Wempty-body -Wnested-externs 
> > -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers 
> > -Wold-style-declaration -Wold-style-definition -Wtype-limits 
> > -fstack-protector-all -I/usr/include/libpng14 -o config-temp/qemu-conf.exe 
> > config-temp/qemu-conf.c -Wl,-z,relro -Wl,-z,now -pie -m64 -g -lncursesw
> > config-temp/qemu-conf.c: In function ‘main’:
> > config-temp/qemu-conf.c:9:3: error: implicit declaration of function 
> > ‘addwstr’ [-Werror=implicit-function-declaration]
> > config-temp/qemu-conf.c:9:3: error: nested extern declaration of ‘addwstr’ 
> > [-Werror=nested-externs]
> > config-temp/qemu-conf.c:10:3: error: implicit declaration of function 
> > ‘addnwstr’ [-Werror=implicit-function-declaration]
> > config-temp/qemu-conf.c:10:3: error: nested extern declaration of 
> > ‘addnwstr’ [-Werror=nested-externs]
> > cc1: all warnings being treated as errors
> 
> http://pdcurses.sourceforge.net/doc/PDCurses.txt:
> 
>   Wide-character functions from the X/Open standard -- these are only
>   available when PDCurses is built with PDC_WIDE defined, and the
>   prototypes are only available from curses.h when PDC_WIDE is defined
>   before its inclusion in your app:
> 
>   addnwstraddstr
>   addwstr addstr
> 
> QEMU does not define PDC_WIDE.  Try adding ./configure
> --extra-flags=-DPDC_WIDE.

I think the problem is rather the incorrect include detection in
configure -- see <20161107133833.3681-1-msucha...@suse.de> ("[PATCH]
Fix legacy ncurses detection.") and the following thread.

Sergey: Are you running on SLES?

Re: [Qemu-devel] [PATCH] vhost-scsi: Update 'ioeventfd_started' with host notifiers

2016-11-08 Thread Paolo Bonzini



On 07/11/2016 18:23, Felipe Franciosi wrote:
> Following the recent refactor of virtio notfiers [1], more specifically
> the patch that uses virtio_bus_set_host_notifier [2] by default, core
> virtio code requires 'ioeventfd_started' to be set to true/false when
> the host notifiers are configured. Since vhost-scsi uses the legacy
> interface, this value is not updated.
> 
> When booting a guest with a vhost-scsi backend controller, SeaBIOS will
> initially configure the device which sets all notifiers. The guest will
> continue to boot fine until the kernel virtio-scsi module reinitialises
> the device causing a stop followed by another start. Since
> ioeventfd_started was never set to true, the 'stop' operation triggered
> by virtio_bus_set_host_notifier() will not result in a call to
> virtio_pci_ioeventfd_assign(assign=false). This leaves the memory
> regions with stale notifiers and results on the next start triggering
> the following assertion:
> 
>   kvm_mem_ioeventfd_add: error adding ioeventfd: File exists
>   Aborted
> 
> This patch updates ioeventfd_started whenever the notifiers are set or
> cleared, fixing this issue.
> 
> Signed-off-by: Felipe Franciosi 
> 
> [1] http://lists.nongnu.org/archive/html/qemu-devel/2016-10/msg07748.html
> [2] http://lists.nongnu.org/archive/html/qemu-devel/2016-10/msg07760.html
> ---
>  hw/scsi/vhost-scsi.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
> index 5b26946..1c6e6d4 100644
> --- a/hw/scsi/vhost-scsi.c
> +++ b/hw/scsi/vhost-scsi.c
> @@ -95,6 +95,7 @@ static int vhost_scsi_start(VHostSCSI *s)
>  if (ret < 0) {
>  return ret;
>  }
> +VIRTIO_BUS(qbus)->ioeventfd_started = true;
>  
>  s->dev.acked_features = vdev->guest_features;
>  ret = vhost_dev_start(>dev, vdev);
> @@ -152,6 +153,7 @@ static void vhost_scsi_stop(VHostSCSI *s)
>  vhost_scsi_clear_endpoint(s);
>  vhost_dev_stop(>dev, vdev);
>  vhost_dev_disable_notifiers(>dev, vdev);
> +VIRTIO_BUS(qbus)->ioeventfd_started = false;
>  }
>  
>  static uint64_t vhost_scsi_get_features(VirtIODevice *vdev,
> 

While a bit hacky, at least for 2.8 the idea should be fine.  Only it
has to be done in vhost_dev_enable_notifiers and
vhost_dev_disable_notifiers, close to the calls to
virtio_device_stop_ioeventfd (i.e., set bus->ioeventfd_started after the
call) and virtio_device_start_ioeventfd (clear it before the call).

I've now worked through a fix that uses start_ioeventfd/stop_ioeventfd,
and it does comes out nice (29 lines added, 65 removed :)), but it isn't
bisectable (so it has to be squashed into a single patch) and does not
cover vhost-net yet.  So let's go with your patch for now.

I'll send the vm_running fix separately, since that one is a real bugfix.

Paolo

Re: [Qemu-devel] [PATCH v13 1/2] virtio-crypto: Add virtio crypto device specification

2016-11-08 Thread Halil Pasic



On 10/28/2016 07:23 AM, Gonglei wrote:
> The virtio crypto device is a virtual crypto device (ie. hardware
> crypto accelerator card). Currently, the virtio crypto device provides
> the following crypto services: CIPHER, MAC, HASH, and AEAD.
> 
> In this patch, CIPHER, MAC, HASH, AEAD services are introduced.
> 
> VIRTIO-153
> 
> Signed-off-by: Gonglei 
> CC: Michael S. Tsirkin 
> CC: Cornelia Huck 
> CC: Stefan Hajnoczi 
> CC: Lingli Deng 
> CC: Jani Kokkonen 
> CC: Ola Liljedahl 
> CC: Varun Sethi 
> CC: Zeng Xin 
> CC: Keating Brian 
> CC: Ma Liang J 
> CC: Griffin John 
> CC: Hanweidong 
> CC: Mihai Claudiu Caraman 
> ---
[..]
> +The header is the general header and the union is of the algorithm-specific 
> type,
> +which is set by the driver. All properties in the union are shown as follows.
> +
> +There is a unified idata structure for all symmetric algorithms, including 
> CIPHER, HASH, MAC, and AEAD.
> +
> +The structure is defined as follows:
> +
> +\begin{lstlisting}
> +struct virtio_crypto_sym_input {
> +/* Destination data guest address, it's useless for plain HASH and MAC */
> +le64 dst_data_addr;
> +/* Digest result guest address, it's useless for plain cipher algos */
> +le64 digest_result_addr;
> +
> +le32 status;
> +le32 padding;
> +};
> +

This seems to be out of sync regarding the code (e.g. can't find it in 
virtio-crypto.h of the
series linked in the cover-letter). It seems to me this reflects v4 since the 
stuff is gone
since v5 (qemu code).

> +\end{lstlisting}
> +
> +\subsubsection{HASH Service Operation}\label{sec:Device Types / Crypto 
> Device / Device Operation / HASH Service Operation}
> +
> +\begin{lstlisting}
> +struct virtio_crypto_hash_para {
> +/* length of source data */
> +le32 src_data_len;
> +/* hash result length */
> +le32 hash_result_len;
> +};
> +
> +struct virtio_crypto_hash_input {
> +struct virtio_crypto_sym_input input;
> +};
> +
> +struct virtio_crypto_hash_output {
> +/* source data guest address */
> +le64 src_data_addr;
> +};
> +
> +struct virtio_crypto_hash_data_req {
> +/* Device-readable part */
> +struct virtio_crypto_hash_para para;
> +struct virtio_crypto_hash_output odata;
> +/* Device-writable part */
> +struct virtio_crypto_hash_input idata;
> +};
> +\end{lstlisting}
> +
> +Each data request uses virtio_crypto_hash_data_req structure to store 
> information
> +used to run the HASH operations. The request only occupies one entry
> +in the Vring Descriptor Table in the virtio crypto device's dataq, which 
> improves
> +the throughput of data transmitted for the HASH service, so that the virtio 
> crypto
> +device can be better accelerated.
> +
> +The information includes the source data guest physical address stored by 
> \field{odata}.\field{src_data_addr},
> +length of source data stored by \field{para}.\field{src_data_len}, and the 
> digest result guest physical address
> +stored by \field{digest_result_addr} used to save the results of the HASH 
> operations.
> +The address and length can determine exclusive content in the guest memory.
> +

Thus this does not make any sense to me. Furthermore the problem seems to 
persist
across the specification. Thus in my opinion there is no point in reviewing
this version. Or am I missing something here? In case I'm not missing anything
and the spec describes something quite outdated when should we expect a new
version of the spec?

Regards,
Halil



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH v11 10/22] vfio iommu type1: Add support for mediated devices

2016-11-08 Thread Alex Williamson

On Tue, 8 Nov 2016 20:36:34 +0530
Kirti Wankhede  wrote:

> On 11/8/2016 4:46 AM, Alex Williamson wrote:
> > On Sat, 5 Nov 2016 02:40:44 +0530
> > Kirti Wankhede  wrote:
> >   
> ...
> 
> >> -static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma 
> >> *dma)
> >> +static int __vfio_pin_page_external(struct vfio_dma *dma, unsigned long 
> >> vaddr,
> >> +  int prot, unsigned long *pfn_base,
> >> +  bool do_accounting)
> >> +{
> >> +  struct task_struct *task = dma->task;
> >> +  unsigned long limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >> +  bool lock_cap = dma->mlock_cap;
> >> +  struct mm_struct *mm = dma->addr_space->mm;
> >> +  int ret;
> >> +  bool rsvd;
> >> +
> >> +  ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
> >> +  if (ret)
> >> +  return ret;
> >> +
> >> +  rsvd = is_invalid_reserved_pfn(*pfn_base);
> >> +
> >> +  if (!rsvd && !lock_cap && mm->locked_vm + 1 > limit) {
> >> +  put_pfn(*pfn_base, prot);
> >> +  pr_warn("%s: Task %s (%d) RLIMIT_MEMLOCK (%ld) exceeded\n",
> >> +  __func__, task->comm, task_pid_nr(task),
> >> +  limit << PAGE_SHIFT);
> >> +  return -ENOMEM;
> >> +  }
> >> +
> >> +  if (!rsvd && do_accounting)
> >> +  vfio_lock_acct(mm, 1);
> >> +
> >> +  return 1;
> >> +}
> >> +
> >> +static void __vfio_unpin_page_external(struct vfio_addr_space *addr_space,
> >> + unsigned long pfn, int prot,
> >> + bool do_accounting)
> >> +{
> >> +  put_pfn(pfn, prot);
> >> +
> >> +  if (do_accounting)
> >> +  vfio_lock_acct(addr_space->mm, -1);  
> > 
> > Can't we batch this like we do elsewhere?  Intel folks, AIUI you intend
> > to pin all VM memory through this side channel, have you tested the
> > scalability and performance of this with larger VMs?  Our vfio_pfn
> > data structure alone is 40 bytes per pinned page, which means for
> > each 1GB of VM memory, we have 10MBs worth of struct vfio_pfn!
> > Additionally, unmapping each 1GB of VM memory will result in 256k
> > separate vfio_lock_acct() callbacks.  I'm concerned that we're not
> > being efficient enough in either space or time.
> > 
> > One thought might be whether we really need to save the pfn, we better
> > always get the same result if we pin it again, or maybe we can just do
> > a lookup through the mm at that point without re-pinning.  Could we get
> > to the point where we only need an atomic_t ref count per page in a
> > linear array relative to the IOVA?  
> 
> Ok. Is System RAM hot-plug supported? How is system RAM hot-plug
> handled? Are there DMA_MAP calls on such hot-plug for additional range?
> If we have a linear array/memory, we will have to realloc it on memory
> hot-plug?

I was thinking a linear array for each IOVA page within a vfio_dma.
The array would track the number of references (pins) of each page.  It
might actually need to be a page table given that a single vfio_dma can
nearly map the entire 64bit address space.  I don't think RAM hotplug
is a factor here, we need to support and properly account for multiple
IOVAs mapping to the same pfn, but the typical case will be a 1:1
mapping, I think that's what we'd optimize for.

> >  That would give us 1MB per 1GB
> > overhead. The semantics of the pin and unpin would make more sense then
> > too, both would take an IOVA range, only pinning would need a return
> > mechanism. For instance:
> > 
> > int pin_pages(void *iommu_data, dma_addr_t iova_base,
> >   int npage, unsigned long *pfn_base);
> > 
> > This would pin physically contiguous pages up to npage, returning the
> > base pfn and returning the number of pages pinned (<= npage).  The
> > vendor driver would make multiple calls to fill the necessary range.  
> 
> 
> With the current patch, input is user_pfn[] array and npages.
> 
> int vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
>int npage, int prot, unsigned long *phys_pfn)
> 
> 
> When guest allocates memory with malloc(), gfns would not be contiguous,
> right? These gfns (user_pfns) are passed as argument here.
> Is there any case where we could get pin/unpin request for contiguous pages?

It would depend on whether the user within the guest is actually
optimizing for hugepages.

> > Unpin would then simply be:
> > 
> > void unpin_pages(void *iommu_data, dma_addr_t iova_base, int npage);
> > 
> > Hugepage usage would really make such an interface shine (ie. 2MB+
> > contiguous ranges).  A downside would be the overhead of getting the
> > group and container reference in vfio for each callback, perhaps we'd
> > need to figure out how the vendor driver could hold that reference.  
> 
> In very initial phases of proposal, I had suggested to keep pointer to
> container->iommu_data in struct mdev_device. But that was discarded.

The referencing

Re: [Qemu-devel] [PATCH for-2.8 v2 0/2] aio-posix: epoll cleanups

2016-11-08 Thread Stefan Hajnoczi

On Tue, Nov 08, 2016 at 02:55:22PM +0100, Paolo Bonzini wrote:
> The first fixes a NULL-pointer dereference that was reported by
> Coverity (so definitely for 2.8).  The second is a small simplification.
> 
> Paolo Bonzini (2):
>   aio-posix: avoid NULL pointer dereference in aio_epoll_update
>   aio-posix: simplify aio_epoll_update
> 
>  aio-posix.c | 55 +--
>  1 file changed, 25 insertions(+), 30 deletions(-)
> 
> -- 
> 2.7.4
> 
> 

Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH] docs: add document to explain the usage of vNVDIMM

2016-11-08 Thread Stefan Hajnoczi

On Tue, Nov 08, 2016 at 08:46:14PM +0800, Haozhong Zhang wrote:
> Signed-off-by: Haozhong Zhang 
> Reviewed-by: Xiao Guangrong 
> ---
>  docs/nvdimm.txt | 124 
> 
>  1 file changed, 124 insertions(+)
>  create mode 100644 docs/nvdimm.txt
> 
> diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
> new file mode 100644
> index 000..fafca39
> --- /dev/null
> +++ b/docs/nvdimm.txt
> @@ -0,0 +1,124 @@
> +QEMU Virtual NVDIMM
> +===
> +
> +This document explains the usage of virtual NVDIMM (vNVDIMM) feature
> +which is available since QEMU v2.6.0.
> +
> +The current QEMU only implements the persistent memory mode of vNVDIMM
> +device.

"and not the block window mode."

Explicitly naming block window mode would be useful for anyone looking
through the docs to find out whether this mode is supported or not.

> +
> +Basic Usage
> +---
> +
> +The storage of a vNVDIMM device in QEMU is provided by the memory
> +backend (i.e. memory-backend-file and memory-backend-ram). A simple
> +way to create a vNVDIMM device at startup time is done via the
> +following command line options:
> +
> + -machine pc,nvdimm
> + -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE
> + -object 
> memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE
> + -device nvdimm,id=nvdimm1,memdev=mem1
> +
> +Where,
> +
> + - the "nvdimm" machine option enables vNVDIMM feature.
> +
> + - "slots=$N" should be equal to or larger than the total amount of
> +   normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here.
> +
> + - "maxmem=$MAX_SIZE" should be equal to or larger than the total size
> +   of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be
> +   >= $RAM_SIZE + $NVDIMM_SIZE here.
> +
> + - "object 
> memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE"
> +   creates a backend storage of size $NVDIMM_SIZE on a file $PATH. All
> +   accesses to the virtual NVDIMM device go to the file $PATH.
> +
> +   "share=on/off" controls the visibility of guest writes. If
> +   "share=on", then guest writes will be applied to the backend
> +   file. If another guest uses the same backend file with option
> +   "share=on", then above writes will be visible to it as well. If
> +   "share=off", then guest writes won't be applied to the backend
> +   file and thus will be invisible to other guests.
> +
> + - "device nvdimm,id=nvdimm1,memdev=mem1" creates a virtual NVDIMM
> +   device whose storage is provided by above memory backend device.
> +
> +Multiple vNVDIMM devices can be created if multiple pairs of "-object"
> +and "-device" are provided.
> +
> +For above command line options, if the guest OS has the proper NVDIMM
> +driver, it should be able to detect a NVDIMM device which is in the
> +persistent memory mode and whose size is $NVDIMM_SIZE.
> +
> +Note:
> +
> +1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual
> +   backend file size is not equal to the size given by "size" option,
> +   QEMU will truncate the backend file by ftruncate(2), which will
> +   corrupt the existing data in the backend file, especially for the
> +   shrink case.
> +
> +   QEMU v2.8.0 and later check the backend file size and the "size"
> +   option. If they do not match, QEMU will report errors and abort in
> +   order to avoid the data corruption.
> +
> +2. QEMU v2.6.0 only puts a basic alignment requirement on the "size"
> +   option of memory-backend-file, e.g. 4KB alignment on x86.  However,
> +   QEMU v.2.7.0 puts an additional alignment requirement, which may
> +   require a larger value than the basic one, e.g. 2MB on x86. This
> +   change breaks the usage of memory-backend-file that only satisfies
> +   the basic alignment.
> +
> +   QEMU v2.8.0 and later remove the additional alignment on non-s390x
> +   architectures, so the broken memory-backend-file can work again.
> +
> +Label
> +-
> +
> +QEMU v2.7.0 and later implement the label support for vNVDIMM devices.
> +To enable label on vNVDIMM devices, users can simply add
> +"label-size=$SZ" option to "-device nvdimm", e.g.
> +
> + -device nvdimm,id=nvdimm1,memdev=mem1,label_size=128K
> +
> +Note:
> +
> +1. The minimal label size is 128KB.
> +
> +2. QEMU v2.7.0 and later store labels at the end of backend storage.
> +   If a memory backend file, which was previously used as the backend
> +   of a vNVDIMM device without labels, is now used for a vNVDIMM
> +   device with label, the data in the label area at the end of file
> +   will be inaccessible to the guest. If any useful data (e.g. the
> +   meta-data of the file system) was stored there, the latter usage
> +   may result guest data corruption (e.g. breakage of guest file
> +   system).
> +
> +Hotplug
> +---
> +
> +QEMU v2.8.0 and later implement the hotplug support for vNVDIMM
> +devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is
>

Re: [Qemu-devel] [PATCH 0/5] Fixes for the MAINTAINERS file

2016-11-08 Thread Stefan Hajnoczi

On Tue, Nov 08, 2016 at 01:17:48PM +0100, Thomas Huth wrote:
> I've currently got some update patches to the MAINTAINERS file
> floating around, and Paolo asked me to send a PULL request for
> them - so here's now the assembled set of patches for a final
> review. If there are no objections, I'll send a PULL request in
> a couple of days.
> 
> Note: I've also included John's patch for the bitmap support here,
> since it is related - if that should go through another tree
> instead, please let me know.
> 
> The m68k update is also a v2 of a patch that I've sent some time ago...
> Laurent, please have another look at that one to see whether it is
> OK now.
> 
> John Snow (1):
>   MAINTAINERS: Add Fam and Jsnow for Bitmap support
> 
> Thomas Huth (4):
>   MAINTAINERS: Add some ARM related files to the corresponding sections
>   sparc: Add slavio_misc.c and eccmemctl.c to the MAINTAINERS file
>   m68k: Update the 68k sections in the MAINTAINERS file
>   MAINTAINERS: Add an entry for the CHRP NVRAM files
> 
>  MAINTAINERS | 37 +++--
>  1 file changed, 35 insertions(+), 2 deletions(-)
> 
> -- 
> 1.8.3.1
> 
> 

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

1 2 3 >

1 - 100 of 236 matches

Mail list logo