Re: [Qemu-devel] [PATCH RFC v1 1/3] target/ppc: Emulate LL/SC using cmpxchg helpers

2017-04-06 Thread Nikunj A Dadhania
David Gibson  writes:

> [ Unknown signature status ]
> On Thu, Apr 06, 2017 at 03:52:47PM +0530, Nikunj A Dadhania wrote:
>> Emulating LL/SC with cmpxchg is not correct, since it can suffer from
>> the ABA problem. However, portable parallel code is written assuming
>> only cmpxchg which means that in practice this is a viable alternative.
>> 
>> Signed-off-by: Nikunj A Dadhania 
>> ---
>>  target/ppc/translate.c | 24 +---
>>  1 file changed, 21 insertions(+), 3 deletions(-)
>> 
>> diff --git a/target/ppc/translate.c b/target/ppc/translate.c
>> index b6abc60..a9c733d 100644
>> --- a/target/ppc/translate.c
>> +++ b/target/ppc/translate.c
>> @@ -73,6 +73,7 @@ static TCGv cpu_cfar;
>>  #endif
>>  static TCGv cpu_xer, cpu_so, cpu_ov, cpu_ca, cpu_ov32, cpu_ca32;
>>  static TCGv cpu_reserve;
>> +static TCGv cpu_reserve_val;
>>  static TCGv cpu_fpscr;
>>  static TCGv_i32 cpu_access_type;
>>  
>> @@ -181,6 +182,9 @@ void ppc_translate_init(void)
>>  cpu_reserve = tcg_global_mem_new(cpu_env,
>>   offsetof(CPUPPCState, reserve_addr),
>>   "reserve_addr");
>> +cpu_reserve_val = tcg_global_mem_new(cpu_env,
>> + offsetof(CPUPPCState, reserve_val),
>> + "reserve_val");
>
> I notice that lqarx is not updated.  Does that matter?

Thats correct, I haven't touched that yet. Most of the locks are
implemented using lwarx/stwcx. 

Regards
Nikunj




Re: [Qemu-devel] [PATCH RFC v1 1/3] target/ppc: Emulate LL/SC using cmpxchg helpers

2017-04-06 Thread David Gibson
On Thu, Apr 06, 2017 at 03:52:47PM +0530, Nikunj A Dadhania wrote:
> Emulating LL/SC with cmpxchg is not correct, since it can suffer from
> the ABA problem. However, portable parallel code is written assuming
> only cmpxchg which means that in practice this is a viable alternative.
> 
> Signed-off-by: Nikunj A Dadhania 
> ---
>  target/ppc/translate.c | 24 +---
>  1 file changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/target/ppc/translate.c b/target/ppc/translate.c
> index b6abc60..a9c733d 100644
> --- a/target/ppc/translate.c
> +++ b/target/ppc/translate.c
> @@ -73,6 +73,7 @@ static TCGv cpu_cfar;
>  #endif
>  static TCGv cpu_xer, cpu_so, cpu_ov, cpu_ca, cpu_ov32, cpu_ca32;
>  static TCGv cpu_reserve;
> +static TCGv cpu_reserve_val;
>  static TCGv cpu_fpscr;
>  static TCGv_i32 cpu_access_type;
>  
> @@ -181,6 +182,9 @@ void ppc_translate_init(void)
>  cpu_reserve = tcg_global_mem_new(cpu_env,
>   offsetof(CPUPPCState, reserve_addr),
>   "reserve_addr");
> +cpu_reserve_val = tcg_global_mem_new(cpu_env,
> + offsetof(CPUPPCState, reserve_val),
> + "reserve_val");

I notice that lqarx is not updated.  Does that matter?

>  cpu_fpscr = tcg_global_mem_new(cpu_env,
> offsetof(CPUPPCState, fpscr), "fpscr");
> @@ -3023,7 +3027,7 @@ static void gen_##name(DisasContext *ctx)   
>  \
>  }\
>  tcg_gen_qemu_ld_tl(gpr, t0, ctx->mem_idx, memop);\
>  tcg_gen_mov_tl(cpu_reserve, t0); \
> -tcg_gen_st_tl(gpr, cpu_env, offsetof(CPUPPCState, reserve_val)); \
> +tcg_gen_mov_tl(cpu_reserve_val, gpr);\
>  tcg_temp_free(t0);   \
>  }
>  
> @@ -3156,14 +3160,28 @@ static void gen_conditional_store(DisasContext *ctx, 
> TCGv EA,
>int reg, int memop)
>  {
>  TCGLabel *l1;
> +TCGv_i32 tmp = tcg_temp_local_new_i32();
> +TCGv t0;
>  
> +tcg_gen_movi_i32(tmp, 0);
>  tcg_gen_trunc_tl_i32(cpu_crf[0], cpu_so);
>  l1 = gen_new_label();
>  tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1);
> -tcg_gen_ori_i32(cpu_crf[0], cpu_crf[0], CRF_EQ);
> -tcg_gen_qemu_st_tl(cpu_gpr[reg], EA, ctx->mem_idx, memop);
> +
> +t0 = tcg_temp_new();
> +tcg_gen_atomic_cmpxchg_tl(t0, EA, cpu_reserve_val, cpu_gpr[reg],
> +  ctx->mem_idx, DEF_MEMOP(memop));
> +tcg_gen_setcond_tl(TCG_COND_EQ, t0, t0, cpu_reserve_val);
> +tcg_gen_trunc_tl_i32(tmp, t0);
> +
>  gen_set_label(l1);
> +tcg_gen_shli_i32(tmp, tmp, CRF_EQ_BIT);
> +tcg_gen_or_i32(cpu_crf[0], cpu_crf[0], tmp);
>  tcg_gen_movi_tl(cpu_reserve, -1);
> +tcg_gen_movi_tl(cpu_reserve_val, 0);
> +
> +tcg_temp_free(t0);
> +tcg_temp_free_i32(tmp);
>  }
>  #endif
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [Qemu-ppc] [PATCH RFC v1 0/3] Enable MTTCG on PPC64

2017-04-06 Thread Nikunj A Dadhania
Cédric Le Goater  writes:

> Hello Nikunj,
>
> On 04/06/2017 12:22 PM, Nikunj A Dadhania wrote:
>> The series enables Multi-Threaded TCG on PPC64
>> 
>> Patch 01: Use atomic_cmpxchg in store conditional
>>   02: Handle first write to page during atomic operation
>>   03: Generate memory barriers for sync/isync and load/store conditional
>> 
>> Patches are based on ppc-for-2.10
>> 
>> Tested using following:
>> ./ppc64-softmmu/qemu-system-ppc64 -cpu POWER8 -vga none -nographic -machine 
>> pseries,usb=off -m 2G  -smp 8,cores=8,threads=1 -accel tcg,thread=multi  
>> f23.img
>
> I tried it with a Ubuntu 16.04.2 guest using stress --cpu 8. It looked 
> good : the CPU usage of QEMU reached 760% on the host.

Cool.

>> Todo:
>> * Enable other machine types and PPC32.
>
> I am quite ignorant on the topic.
> Have you looked at what it would take to emulate support of the HW
> threads ?

We would need to implement msgsndp (doorbell support for IPI between
threads of same core)

> and the PowerNV machine ?

Haven't tried it, should work. Just give a shot, let me know if you see 
problems.

Regards
Nikunj




Re: [Qemu-devel] [PATCH 0/2] use blk_pwrite_zeroes for each zero cluster

2017-04-06 Thread 858585 jemmy
the test result for this patch:

the migration command :
virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
--copy-storage-all qemu+ssh://10.59.163.38/system

the qemu-img info on source host:
qemu-img info 
/instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
image: 
/instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
file format: qcow2
virtual size: 1.0T (1095216660480 bytes)
disk size: 1.5G (1638989824 bytes)
cluster_size: 65536
backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2

the qemu-img info on dest host(before apply patch):
qemu-img info 
/instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
image: 
/instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
file format: qcow2
virtual size: 40G (42949672960 bytes)
disk size: 4.1G (4423286784 bytes)
cluster_size: 65536
backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2

the qemu-img info on dest host(after apply patch):
qemu-img info 
/instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
image: 
/instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
file format: qcow2
virtual size: 40G (42949672960 bytes)
disk size: 2.3G (2496200704 bytes)
cluster_size: 65536
backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2

the disk size reduce from 4.1G to 2.3G.


On Thu, Apr 6, 2017 at 9:15 PM,   wrote:
> From: Lidong Chen 
>
> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
> this maybe cause the qcow2 file size is bigger after migration.
> This patch check each cluster, use blk_pwrite_zeroes for each
> zero cluster.
>
> Lidong Chen (2):
>   block: make bdrv_get_cluster_size public
>   migration/block: use blk_pwrite_zeroes for each zero cluster
>
>  block/io.c|  2 +-
>  include/block/block.h |  1 +
>  migration/block.c | 34 --
>  3 files changed, 34 insertions(+), 3 deletions(-)
>
> --
> 1.8.3.1
>



Re: [Qemu-devel] [PATCH RFC v1 3/3] target/ppc: Generate fence operations

2017-04-06 Thread Nikunj A Dadhania
Richard Henderson  writes:

> On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote:
>> @@ -3028,6 +3030,7 @@ static void gen_##name(DisasContext *ctx)  
>>   \
>>  tcg_gen_qemu_ld_tl(gpr, t0, ctx->mem_idx, memop);\
>>  tcg_gen_mov_tl(cpu_reserve, t0); \
>>  tcg_gen_mov_tl(cpu_reserve_val, gpr);\
>> +tcg_gen_mb(TCG_MO_ALL | TCG_BAR_LDAQ);   \
>
> Please put the barrier next to the load.
> I hope we can merge these soon.

Sure

>
>> @@ -3196,6 +3199,7 @@ static void gen_##name(DisasContext *ctx)  
>>  \
>>  if (len > 1) {  \
>>  gen_check_align(ctx, t0, (len) - 1);\
>>  }   \
>> +tcg_gen_mb(TCG_MO_ALL | TCG_BAR_STRL);  \
>>  gen_conditional_store(ctx, t0, rS(ctx->opcode), memop); \
>>  tcg_temp_free(t0);  \
>>  }
>
> This one is more complicated...
>
> The success path includes an atomic_cmpxchg, which itself is a SC barrier. 
> However, the fail path branches across the cmpxchg and so sees no barrier, 
> but 
> one is still required by the architecture.  How about a gen_conditional_store 
> that looks like this:
>
> {
>TCGLabel *l1 = gen_new_label();
>TCGLabel *l2 = gen_new_label();
>TCGv t0;
>
>tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1);
>
>t0 = tcg_temp_new();
>tcg_gen_atomic_cmpxchg_tl(t0, cpu_reserve, cpu_reserve_val,
>  cpu_gpr[reg], ctx->mem_idx,
>  DEF_MEMOP(memop) | MO_ALIGN);
>tcg_gen_setcond_tl(TCG_COND_EQ, t0, t0, cpu_reserve_val);
>tcg_gen_shli_tl(t0, t0, CRF_EQ_BIT);
>tcg_gen_or_tl(t0, t0, cpu_so);
>tcg_gen_trunc_tl(cpu_crf[0], t0);
>tcg_temp_free(t0);
>tcg_gen_br(l2);
>
>gen_set_label(l1);
>
>/* Address mismatch implies failure.  But we still need to provide the
>   memory barrier semantics of the instruction.  */
>tcg_gen_mb(TCG_MO_ALL | TCG_BAR_STRL);
>tcg_gen_trunc_tl(cpu_crf[0], cpu_so);
>
>gen_set_label(l2);
>tcg_gen_movi_tl(cpu_reserve, -1);
> }
>
> Note that you don't need to reset cpu_reserve_val at all.

Sure.

> I just thought of something we might need to check for this and other ll/sc 
> implemetations -- do we need to check for address misalignment along the 
> address comparison failure path?

We do that in the macro:

  if (len > 1) {  \
  gen_check_align(ctx, t0, (len) - 1);\
  }   \

Would we still need a barrier before the alignment check?

> I suspect that we do.

Regards
Nikunj




Re: [Qemu-devel] [PATCH RFC v1 1/3] target/ppc: Emulate LL/SC using cmpxchg helpers

2017-04-06 Thread Nikunj A Dadhania
Richard Henderson  writes:

> On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote:
>>  tcg_gen_trunc_tl_i32(cpu_crf[0], cpu_so);
>>  l1 = gen_new_label();
>>  tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1);
>> -tcg_gen_ori_i32(cpu_crf[0], cpu_crf[0], CRF_EQ);
>> -tcg_gen_qemu_st_tl(cpu_gpr[reg], EA, ctx->mem_idx, memop);
>> +
>> +t0 = tcg_temp_new();
>> +tcg_gen_atomic_cmpxchg_tl(t0, EA, cpu_reserve_val, cpu_gpr[reg],
>> +  ctx->mem_idx, DEF_MEMOP(memop));
>
> Actually, I noticed another, existing, problem.
>
> This code changes CRF[0] before the user memory write, which might fault.  
> This 
> needs to delay any changes to the architecture visible state until after any 
> exception may be triggered.

Sure, here you are mentioning cpu_so being moved to CRF.

Regards
Nikunj




Re: [Qemu-devel] [PATCH RFC v1 1/3] target/ppc: Emulate LL/SC using cmpxchg helpers

2017-04-06 Thread Nikunj A Dadhania
Richard Henderson  writes:

> On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote:
>> +TCGv_i32 tmp = tcg_temp_local_new_i32();
>> +TCGv t0;
>>
>> +tcg_gen_movi_i32(tmp, 0);
>>  tcg_gen_trunc_tl_i32(cpu_crf[0], cpu_so);
>>  l1 = gen_new_label();
>>  tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1);
>> -tcg_gen_ori_i32(cpu_crf[0], cpu_crf[0], CRF_EQ);
>> -tcg_gen_qemu_st_tl(cpu_gpr[reg], EA, ctx->mem_idx, memop);
>> +
>> +t0 = tcg_temp_new();
>> +tcg_gen_atomic_cmpxchg_tl(t0, EA, cpu_reserve_val, cpu_gpr[reg],
>> +  ctx->mem_idx, DEF_MEMOP(memop));
>> +tcg_gen_setcond_tl(TCG_COND_EQ, t0, t0, cpu_reserve_val);
>> +tcg_gen_trunc_tl_i32(tmp, t0);
>> +
>>  gen_set_label(l1);
>> +tcg_gen_shli_i32(tmp, tmp, CRF_EQ_BIT);
>> +tcg_gen_or_i32(cpu_crf[0], cpu_crf[0], tmp);
>
> I encourage you to move these two lines up beside the setcond.
> That way you don't need to use a local tmp, which implies a
> spill/restore from the stack.

Sure.

Regards
Nikunj




Re: [Qemu-devel] [PATCH v8 3/9] memory: provide iommu_replay_all()

2017-04-06 Thread Peter Xu
On Thu, Apr 06, 2017 at 07:46:47PM +0800, Peter Xu wrote:
> On Thu, Apr 06, 2017 at 12:52:19PM +0200, Auger Eric wrote:
> > Hi Peter,
> > 
> > On 06/04/2017 09:08, Peter Xu wrote:

[...]

> > > +void memory_region_iommu_replay_all(MemoryRegion *mr)
> > > +{
> > > +IOMMUNotifier *notifier;
> > > +
> > > +IOMMU_NOTIFIER_FOREACH(notifier, mr) {
> > > +memory_region_iommu_replay(mr, notifier, false);
> > It is not fully clear to me what is the consequence of setting
> > is_write=false always?
> 
> I am not quite sure about it either, on why we are having the is_write
> parameter on memory_region_iommu_replay(). It's there since the first
> commit that introduced the interface:
> 
> commit a788f227ef7bd2912fcaacdfe13d13ece2998149
> Author: David Gibson 
> Date:   Wed Sep 30 12:13:55 2015 +1000
> 
> memory: Allow replay of IOMMU mapping notifications
> 
> However imho that should be a check against page RW permissions, which
> seems unecessary during replay. Or say, not sure whether below patch
> would be good (as a standalone one besides current patch/series):
> 
> --8<--
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 6b33b9f..d008a4b 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -488,7 +488,7 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  QLIST_INSERT_HEAD(>giommu_list, giommu, giommu_next);
>  
>  memory_region_register_iommu_notifier(giommu->iommu, >n);
> -memory_region_iommu_replay(giommu->iommu, >n, false);
> +memory_region_iommu_replay(giommu->iommu, >n);
>  
>  return;
>  }
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index c4fc94d..5ab0151 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -725,11 +725,8 @@ void memory_region_register_iommu_notifier(MemoryRegion 
> *mr,
>   *
>   * @mr: the memory region to observe
>   * @n: the notifier to which to replay iommu mappings
> - * @is_write: Whether to treat the replay as a translate "write"
> - * through the iommu
>   */
> -void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
> -bool is_write);
> +void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n);
>  
>  /**
>   * memory_region_iommu_replay_all: replay existing IOMMU translations
> diff --git a/memory.c b/memory.c
> index b782d5b..155ca1c 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1620,8 +1620,7 @@ uint64_t 
> memory_region_iommu_get_min_page_size(MemoryRegion *mr)
>  return TARGET_PAGE_SIZE;
>  }
>  
> -void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
> -bool is_write)
> +void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
>  {
>  hwaddr addr, granularity;
>  IOMMUTLBEntry iotlb;
> @@ -1635,7 +1634,13 @@ void memory_region_iommu_replay(MemoryRegion *mr, 
> IOMMUNotifier *n,
>  granularity = memory_region_iommu_get_min_page_size(mr);
>  
>  for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> -iotlb = mr->iommu_ops->translate(mr, addr, is_write);
> +/*
> + * For a replay, we don't need to do permission check.
> + * Assuming a "read" operation here to make sure we will pass
> + * the check (in case we don't have write permission on the
> + * page).
> + */
> +iotlb = mr->iommu_ops->translate(mr, addr, false);
>  if (iotlb.perm != IOMMU_NONE) {
>  n->notify(n, );
>  }
> @@ -1653,7 +1658,7 @@ void memory_region_iommu_replay_all(MemoryRegion *mr)
>  IOMMUNotifier *notifier;
>  
>  IOMMU_NOTIFIER_FOREACH(notifier, mr) {
> -memory_region_iommu_replay(mr, notifier, false);
> +memory_region_iommu_replay(mr, notifier);
>  }
>  }
> 
> ->8--

This follow-up patch has an assumption that all the pages would have
read permission. That may not be true, e.g., for write-only pages. A
better way (after a short discussion with David Gibson) would be
converting is_write in iommu_ops.translate() to IOMMUAccessFlags, so
that we can pass in IOMMU_NONE here (it means, we don't check
permission for this page translation). Further, we can remove the
is_write parameter in memory_region_iommu_replay().

In all cases, I'll move this change out of current series and send a
fresh-new post after 2.10. Thanks,

-- peterx



Re: [Qemu-devel] [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-04-06 Thread gengdongjiu
Hi Laszlo,
  thanks.

On 2017/4/7 2:55, Laszlo Ersek wrote:
> On 04/06/17 14:35, gengdongjiu wrote:
>> Dear, Laszlo
>>Thanks for your detailed explanation.
>>
>> On 2017/3/29 19:58, Laszlo Ersek wrote:
>>> (This ought to be one of the longest address lists I've ever seen :)
>>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>>> measure, I'm adding MST and Igor.)
>>>
>>> On 03/29/17 12:36, Achin Gupta wrote:
 Hi gengdongjiu,

 On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>
> Hi Laszlo/Biesheuvel/Qemu developer,
>
>Now I encounter a issue and want to consult with you in ARM64 
> platform, as described below:
>
> when guest OS happen synchronous or asynchronous abort, kvm needs
> to send the error address to Qemu or UEFI through sigbus to
> dynamically generate APEI table. from my investigation, there are
> two ways:
>
> (1) Qemu get the error address, and generate the APEI table, then
> notify UEFI to know this generation, then inject abort error to
> guest OS, guest OS read the APEI table.
> (2) Qemu get the error address, and let UEFI to generate the APEI
> table, then inject abort error to guest OS, guest OS read the APEI
> table.

 Just being pedantic! I don't think we are talking about creating the APEI 
 table
 dynamically here. The issue is: Once KVM has received an error that is 
 destined
 for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the 
 error
 into the guest OS, a CPER (Common Platform Error Record) has to be 
 generated
 corresponding to the error source (GHES corresponding to memory subsystem,
 processor etc) to allow the guest OS to do anything meaningful with the
 error. So who should create the CPER is the question.

 At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error 
 arrives
 at EL3 and secure firmware (at EL3 or a lower secure exception level) is
 responsible for creating the CPER. ARM is experimenting with using a 
 Standalone
 MM EDK2 image in the secure world to do the CPER creation. This will avoid
 adding the same code in ARM TF in EL3 (better for security). The error 
 will then
 be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM 
 Trusted
 Firmware.

 Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
 interface (as discussed with Christoffer below). So it should generate the 
 CPER
 before injecting the error.

 This is corresponds to (1) above apart from notifying UEFI (I am assuming 
 you
 mean guest UEFI). At this time, the guest OS already knows where to pick 
 up the
 CPER from through the HEST. Qemu has to create the CPER and populate its 
 address
 at the address exported in the HEST. Guest UEFI should not be involved in 
 this
 flow. Its job was to create the HEST at boot and that has been done by this
 stage.

 Qemu folk will be able to add but it looks like support for CPER 
 generation will
 need to be added to Qemu. We need to resolve this.

 Do shout if I am missing anything above.
>>>
>>> After reading this email, the use case looks *very* similar to what
>>> we've just done with VMGENID for QEMU 2.9.
>>>
>>> We have a facility between QEMU and the guest firmware, called "ACPI
>>> linker/loader", with which QEMU instructs the firmware to
>>>
>>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>>> ALLOCATE command,
>>>
>>> - relocate pointers in those blobs, to fields in other (or the same)
>>> blobs -- ADD_POINTER command,
>>>
>>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>>
>>> - and send GPAs of fields within such blobs back to QEMU --
>>> WRITE_POINTER command.
>>>
>>> This is how I imagine we can map the facility to the current use case
>>> (note that this is the first time I read about HEST / GHES / CPER):
>>>
>>> etc/acpi/tables etc/hardware_errors
>>>  ==
>>>  +---+
>>> +--+ | address   | +-> +--+
>>> |HEST  + | registers | |   | Error Status |
>>> + ++ | +-+ |   | Data Block 1 |
>>> | | GHES   | --> | | address | +   | ++
>>> | | GHES   | --> | | address | --+ | |  CPER  |
>>> | | GHES   | --> | | address | + | | |  CPER  |
>>> | | GHES   | --> | | address | -+  | | | |  CPER  |
>>> +-++ +-+-+  |  | | +-++
>>> |  | |
>>> |  | +---> +--+
>>> |  |   | Error 

Re: [Qemu-devel] [Qemu-ppc] [PATCH RFC v1 0/3] Enable MTTCG on PPC64

2017-04-06 Thread luigi burdo

Hi i can help test it too on my two Be machine.

If some one help me to find where is the patch or where i can download  the 
commits


Thanks

Luigi


[Qemu-devel] [PATCH for-2.9?] qcow2: Allow discard of final unaligned cluster

2017-04-06 Thread Eric Blake
As mentioned in commit 0c1bd46, we ignored requests to
discard the trailing cluster of an unaligned image.  While
discard is an advisory operation from the guest standpoint,
(and we are therefore free to ignore any request), our
qcow2 implementation exploits the fact that a discarded
cluster reads back as 0.  As long as we discard on cluster
boundaries, we are fine; but that means we could observe
non-zero data leaked at the tail of an unaligned image.

Enhance iotest 66 to cover this case, and fix the implementation
to honor a discard request on the final partial cluster.

Signed-off-by: Eric Blake 
---

I can't convince myself whether we strongly rely on aligned discards
to guarantee that we read back zeroes on qcow2 (it would be a
stronger contract than what the block layer requires of pdiscard,
since the guest cannot guarantee that a discard does anything). If
we do, then this is a bug fix worthy of 2.9. If we don't, then the
changes to test 66 rely on internal implementation (but the test is
already specific to qcow2), and the patch can wait for 2.10.

At any rate, I do know that we don't want to make discard work on
sub-cluster boundaries anywhere except at the tail of the image
(that's what write-zeroes is for, and it may be slower when it
has to do COW or read-modify-write).  Any reliance that we (might)
have on whole-cluster discards reading back as 0 is also relying
on using aligned operations.

 block/qcow2.c  |  7 ++-
 tests/qemu-iotests/066 | 12 +++-
 tests/qemu-iotests/066.out | 12 
 3 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 6a92d2e..863d889 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2515,7 +2515,12 @@ static coroutine_fn int 
qcow2_co_pdiscard(BlockDriverState *bs,

 if (!QEMU_IS_ALIGNED(offset | count, s->cluster_size)) {
 assert(count < s->cluster_size);
-return -ENOTSUP;
+/* Ignore partial clusters, except for the special case of the
+ * complete partial cluster at the end of an unaligned file */
+if (!QEMU_IS_ALIGNED(offset, s->cluster_size) ||
+offset + count != bs->total_sectors * BDRV_SECTOR_SIZE) {
+return -ENOTSUP;
+}
 }

 qemu_co_mutex_lock(>lock);
diff --git a/tests/qemu-iotests/066 b/tests/qemu-iotests/066
index 364166d..c2116a3 100755
--- a/tests/qemu-iotests/066
+++ b/tests/qemu-iotests/066
@@ -42,16 +42,18 @@ _supported_fmt qcow2
 _supported_proto generic
 _supported_os Linux

+# Intentionally create an unaligned image
 IMGOPTS="compat=1.1"
-IMG_SIZE=64M
+IMG_SIZE=$((64 * 1024 * 1024 + 512))

 echo
-echo "=== Testing snapshotting an image with zero clusters ==="
+echo "=== Testing cluster discards ==="
 echo
 _make_test_img $IMG_SIZE
-# Write some normal clusters, zero them (creating preallocated zero clusters)
-# and discard those
-$QEMU_IO -c "write 0 256k" -c "write -z 0 256k" -c "discard 0 256k" 
"$TEST_IMG" \
+# Write some normal clusters, zero some of them (creating preallocated
+# zero clusters) and discard everything. Everything should now read as 0.
+$QEMU_IO -c "write 0 256k" -c "write -z 0 256k" -c "write 64M 512" \
+-c "discard 0 $IMG_SIZE" -c "read -P 0 0 $IMG_SIZE" "$TEST_IMG" \
  | _filter_qemu_io
 # Check the image (there shouldn't be any leaks)
 _check_test_img
diff --git a/tests/qemu-iotests/066.out b/tests/qemu-iotests/066.out
index 7bc9a10..7c1f31a 100644
--- a/tests/qemu-iotests/066.out
+++ b/tests/qemu-iotests/066.out
@@ -1,13 +1,17 @@
 QA output created by 066

-=== Testing snapshotting an image with zero clusters ===
+=== Testing cluster discards ===

-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67109376
 wrote 262144/262144 bytes at offset 0
 256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 wrote 262144/262144 bytes at offset 0
 256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-discard 262144/262144 bytes at offset 0
-256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 512/512 bytes at offset 67108864
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+discard 67109376/67109376 bytes at offset 0
+64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 67109376/67109376 bytes at offset 0
+64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 No errors were found on the image.
 *** done
-- 
2.9.3




Re: [Qemu-devel] [PATCH v3] migration/block:limit the time used for block migration

2017-04-06 Thread 858585 jemmy
On Thu, Apr 6, 2017 at 10:02 PM, Stefan Hajnoczi  wrote:
> On Wed, Apr 05, 2017 at 05:27:58PM +0800, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> when migration with high speed, mig_save_device_bulk invoke
>> bdrv_is_allocated too frequently, and cause vnc reponse slowly.
>> this patch limit the time used for bdrv_is_allocated.
>
> bdrv_is_allocated() is supposed to yield back to the event loop if it
> needs to block.  If your VNC session is experiencing jitter then it's
> probably because a system call in the bdrv_is_allocated() code path is
> synchronous when it should be asynchronous.
>
> You could try to identify the system call using strace -f -T.  In the
> output you'll see the duration of each system call.  I guess there is a
> file I/O system call that is taking noticable amounts of time.

yes, i find the reason where bdrv_is_allocated needs to block.

the mainly reason is caused by qemu_co_mutex_lock invoked by
qcow2_co_get_block_status.
qemu_co_mutex_lock(>lock);
ret = qcow2_get_cluster_offset(bs, sector_num << 9, ,
   _offset);
qemu_co_mutex_unlock(>lock);

other reason is caused by l2_load invoked by
qcow2_get_cluster_offset.

/* load the l2 table in memory */

ret = l2_load(bs, l2_offset, _table);
if (ret < 0) {
return ret;
}

>
> A proper solution is to refactor the synchronous code to make it
> asynchronous.  This might require invoking the system call from a
> thread pool worker.
>

yes, i agree with you, but this is a big change.
I will try to find how to optimize this code, maybe need a long time.

this patch is not a perfect solution, but can alleviate the problem.

> Stefan



Re: [Qemu-devel] [PATCH for-2.9 3/5] block: Quiesce old aio context during bdrv_set_aio_context

2017-04-06 Thread Fam Zheng
On Thu, 04/06 17:17, Kevin Wolf wrote:
> Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben:
> > The fact that the bs->aio_context is changing can confuse the dataplane
> > iothread, because of the now fine granularity aio context lock.
> > bdrv_drain should rather be a bdrv_drained_begin/end pair, but since
> > bs->aio_context is changing, we can just use aio_disable_external and
> > block_job_pause.
> > 
> > Reported-by: Ed Swierk 
> > Signed-off-by: Fam Zheng 
> > ---
> >  block.c | 11 +--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block.c b/block.c
> > index 8893ac1..e70684a 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -4395,11 +4395,14 @@ void bdrv_attach_aio_context(BlockDriverState *bs,
> >  
> >  void bdrv_set_aio_context(BlockDriverState *bs, AioContext *new_context)
> >  {
> > -AioContext *ctx;
> > +AioContext *ctx = bdrv_get_aio_context(bs);
> >  
> > +aio_disable_external(ctx);
> > +if (bs->job) {
> > +block_job_pause(bs->job);
> > +}
> 
> Even more bs->job users... :-(
> 
> But is this one actually necessary? We drain all pending BHs below, so
> the job shouldn't have any requests in flight or be able to submit new
> ones while we switch.

I'm not 100% sure, but I think the aio_poll() loop below can still fire
co_sleep_cb if we don't do block_job_pause()?

> 
> >  bdrv_drain(bs); /* ensure there are no in-flight requests */
> >  
> > -ctx = bdrv_get_aio_context(bs);
> >  while (aio_poll(ctx, false)) {
> >  /* wait for all bottom halves to execute */
> >  }
> > @@ -4412,6 +4415,10 @@ void bdrv_set_aio_context(BlockDriverState *bs, 
> > AioContext *new_context)
> >  aio_context_acquire(new_context);
> >  bdrv_attach_aio_context(bs, new_context);
> >  aio_context_release(new_context);
> > +if (bs->job) {
> > +block_job_resume(bs->job);
> > +}
> > +aio_enable_external(ctx);
> >  }
> 
> The aio_disabe/enable_external() pair seems to make sense anyway.
> 
> Kevin



Re: [Qemu-devel] [PATCH for-2.9 1/5] block: Fix unpaired aio_disable_external in external snapshot

2017-04-06 Thread Fam Zheng
On Thu, 04/06 16:36, Kevin Wolf wrote:
> Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben:
> > bdrv_replace_child_noperm tries to hand over the quiesce_counter state
> > from old bs to the new one, but if they are not on the same aio context
> > this causes unbalance.
> > 
> > Fix this by forcing callers to move the aio context first, and assert
> > it.
> > 
> > Reported-by: Ed Swierk 
> > Signed-off-by: Fam Zheng 
> > ---
> >  block.c| 3 +++
> >  blockdev.c | 4 ++--
> >  2 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block.c b/block.c
> > index 927ba89..8893ac1 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -1752,6 +1752,9 @@ static void bdrv_replace_child_noperm(BdrvChild 
> > *child,
> >  {
> >  BlockDriverState *old_bs = child->bs;
> >  
> > +if (old_bs && new_bs) {
> > +assert(bdrv_get_aio_context(old_bs) == 
> > bdrv_get_aio_context(new_bs));
> > +}
> >  if (old_bs) {
> >  if (old_bs->quiesce_counter && child->role->drained_end) {
> >  child->role->drained_end(child);
> 
> Can we move this to a separate patch and also add this hunk for
> asserting the same thing for newly attached child nodes:
> 
> @@ -1852,6 +1855,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState 
> *parent_bs,
>  bdrv_get_cumulative_perm(parent_bs, , _perm);
>  
>  assert(parent_bs->drv);
> +assert(bdrv_get_aio_context(parent_bs) == 
> bdrv_get_aio_context(child_bs));
>  parent_bs->drv->bdrv_child_perm(parent_bs, NULL, child_role,
>  perm, shared_perm, , _perm);

OK, will do.



[Qemu-devel] [PULL for-2.9 1/1] vfio/pci-quirks: Exclude non-ioport BAR from NVIDIA quirk

2017-04-06 Thread Alex Williamson
The NVIDIA BAR5 quirk is targeting an ioport BAR.  Some older devices
have a BAR5 which is not ioport and can induce a segfault here.  Test
the BAR type to skip these devices.

Link: https://bugs.launchpad.net/qemu/+bug/1678466
Signed-off-by: Alex Williamson 
---
 hw/vfio/pci-quirks.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index e9b493b939db..349085ea12bc 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -660,7 +660,7 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice 
*vdev, int nr)
 VFIOConfigWindowQuirk *window;
 
 if (!vfio_pci_is(vdev, PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID) ||
-!vdev->vga || nr != 5) {
+!vdev->vga || nr != 5 || !vdev->bars[5].ioport) {
 return;
 }
 




[Qemu-devel] [PULL for-2.9 0/1] VFIO fixes 2017-04-06

2017-04-06 Thread Alex Williamson
The following changes since commit 54d689988c847271d87b3eb113712147129fb811:

  Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into
staging (2017-04-06 09:27:49 +0100)

are available in the git repository at:


  git://github.com/awilliam/qemu-vfio.git tags/vfio-updates-20170406.0

for you to fetch changes up to 8f419c5b43b988df4ef11315aeb8524e50c99687:

  vfio/pci-quirks: Exclude non-ioport BAR from NVIDIA quirk (2017-04-06
16:03:26 -0600)


VFIO fixes 2017-04-06

 - Extra test for NVIDIA BAR5 quirk to avoid segfault (Alex Williamson)


Alex Williamson (1):
  vfio/pci-quirks: Exclude non-ioport BAR from NVIDIA quirk

 hw/vfio/pci-quirks.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



Re: [Qemu-devel] [PATCH 03/21] ppc/pnv: Add support for POWER8+ LPC Controller

2017-04-06 Thread Benjamin Herrenschmidt
On Thu, 2017-04-06 at 14:44 +0200, Cédric Le Goater wrote:
> May be we could move that under the LPC object even if it is not
> strictly
> correct from a HW pov ? 

I'm not fan of this. It's really not part of the LPC controller and
it's specific to a certain crop of P8 machines.

Cheers,
Ben.




Re: [Qemu-devel] [PATCH 04/21] ppc/pnv: enable only one LPC bus

2017-04-06 Thread Benjamin Herrenschmidt
On Thu, 2017-04-06 at 14:35 +0200, Cédric Le Goater wrote:
> but real HW (2 sockets OpenPOWER systems, garrison and firestone)
> does 
> not expose the LPC bus on the second chip, should we do so in qemu ?

It's not so much HW as it it HostBoot. Not a huge deal.

Cheers,
Ben.




Re: [Qemu-devel] [PULL 43/46] block: Assertions for write permissions

2017-04-06 Thread Richard W.M. Jones
On Thu, Apr 06, 2017 at 04:23:19PM -0500, Eric Blake wrote:
> On 04/06/2017 04:15 PM, Richard W.M. Jones wrote:
> 
>  +++ b/block/io.c
>  @@ -945,6 +945,8 @@ static int coroutine_fn 
>  bdrv_co_do_copy_on_readv(BdrvChild *child,
>   size_t skip_bytes;
>   int ret;
>   
>  +assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE));
> >>>
> 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1439922
> > 
> > There is also a minimal reproducer in comment 2.
> 
> If it helps, backtrace shows:
> 
> #4  0x55c09f10 in bdrv_co_do_copy_on_readv
> (child=0x56990180, offset=0, bytes=512, qiov=0x7fffd0c0) at
> block/io.c:948
> #5  0x55c0a33d in bdrv_aligned_preadv (child=0x56990180,
> req=0x7fffb6dffec0, offset=0, bytes=512, align=1, qiov=0x7fffd0c0,
> flags=1)
> at block/io.c:1058
> #6  0x55c0a8c3 in bdrv_co_preadv (child=0x56990180,
> offset=0, bytes=512, qiov=0x7fffd0c0, flags=BDRV_REQ_COPY_ON_READ)
> at block/io.c:1166
> #7  0x55bf7ccf in blk_co_preadv (blk=0x56a2bc20, offset=0,
> bytes=512, qiov=0x7fffd0c0, flags=0) at block/block-backend.c:927
> #8  0x55bf7e19 in blk_read_entry (opaque=0x7fffd0e0)
> at block/block-backend.c:974
> #9  0x55cc3015 in coroutine_trampoline (i0=1485566720, i1=21845)
> at util/coroutine-ucontext.c:79

The patch that fixes this is simply:

diff --git a/block/io.c b/block/io.c
index 2709a7007f..4f1a730e45 100644
--- a/block/io.c
+++ b/block/io.c
@@ -945,8 +945,6 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BdrvChild 
*child,
 size_t skip_bytes;
 int ret;
 
-assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE));
-
 /* Cover entire cluster so no additional backing file I/O is required when
  * allocating cluster in the image file.
  */

It has no ill effects that I can see, and fixes the libguestfs test
suite, but I'm assuming the assert was added for a reason so I'm not
proposing this as a patch.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW



Re: [Qemu-devel] [PULL 43/46] block: Assertions for write permissions

2017-04-06 Thread Eric Blake
On 04/06/2017 04:15 PM, Richard W.M. Jones wrote:

 +++ b/block/io.c
 @@ -945,6 +945,8 @@ static int coroutine_fn 
 bdrv_co_do_copy_on_readv(BdrvChild *child,
  size_t skip_bytes;
  int ret;
  
 +assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE));
>>>

> https://bugzilla.redhat.com/show_bug.cgi?id=1439922
> 
> There is also a minimal reproducer in comment 2.

If it helps, backtrace shows:

#4  0x55c09f10 in bdrv_co_do_copy_on_readv
(child=0x56990180, offset=0, bytes=512, qiov=0x7fffd0c0) at
block/io.c:948
#5  0x55c0a33d in bdrv_aligned_preadv (child=0x56990180,
req=0x7fffb6dffec0, offset=0, bytes=512, align=1, qiov=0x7fffd0c0,
flags=1)
at block/io.c:1058
#6  0x55c0a8c3 in bdrv_co_preadv (child=0x56990180,
offset=0, bytes=512, qiov=0x7fffd0c0, flags=BDRV_REQ_COPY_ON_READ)
at block/io.c:1166
#7  0x55bf7ccf in blk_co_preadv (blk=0x56a2bc20, offset=0,
bytes=512, qiov=0x7fffd0c0, flags=0) at block/block-backend.c:927
#8  0x55bf7e19 in blk_read_entry (opaque=0x7fffd0e0)
at block/block-backend.c:974
#9  0x55cc3015 in coroutine_trampoline (i0=1485566720, i1=21845)
at util/coroutine-ucontext.c:79

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-devel] [PATCH] hw/arm/virt: generate 64-bit addressable ACPI objects

2017-04-06 Thread Ard Biesheuvel
Our current ACPI table generation code limits the placement of ACPI
tables to 32-bit addressable memory, in order to be able to emit the
root pointer (RSDP) and root table (RSDT) using table types from the
ACPI 1.0 days.

Since ARM was not supported by ACPI before version 5.0, it makes sense
to lift this restriction. This is not crucial for mach-virt, which is
guaranteed to have some memory available below the 4 GB mark, but it
is a nice to have for QEMU machines that do not have any 32-bit
addressable, not unlike some real world 64-bit ARM systems.

Since we already emit a version of the RSDP root pointer that carries
a 64-bit wide address for the 64-bit root table (XSDT), all we need to
do is replace the RSDT generation with the generation of an XSDT table,
and use a different slot in the FADT table to refer to the DSDT.

Note that this only modifies the private interaction between QEMU and
UEFI. Since these tables are all handled by the generic AcpiTableDxe
driver which generates its own RSDP, RSDT and XSDT tables and manages
the DSDT pointer in FADT itself, the tables that are present to the
guest are identical, and so no mach-virt versioning is required for
this change.

Signed-off-by: Ard Biesheuvel 
---
 hw/arm/virt-acpi-build.c| 55 ++---
 include/hw/acpi/acpi-defs.h | 11 +
 2 files changed, 53 insertions(+), 13 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index b173bd109b91..d18368094c00 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -391,12 +391,12 @@ static void acpi_dsdt_add_power_button(Aml *scope)
 
 /* RSDP */
 static GArray *
-build_rsdp(GArray *rsdp_table, BIOSLinker *linker, unsigned rsdt_tbl_offset)
+build_rsdp(GArray *rsdp_table, BIOSLinker *linker, unsigned xsdt_tbl_offset)
 {
 AcpiRsdpDescriptor *rsdp = acpi_data_push(rsdp_table, sizeof *rsdp);
-unsigned rsdt_pa_size = sizeof(rsdp->rsdt_physical_address);
-unsigned rsdt_pa_offset =
-(char *)>rsdt_physical_address - rsdp_table->data;
+unsigned xsdt_pa_size = sizeof(rsdp->xsdt_physical_address);
+unsigned xsdt_pa_offset =
+(char *)>xsdt_physical_address - rsdp_table->data;
 
 bios_linker_loader_alloc(linker, ACPI_BUILD_RSDP_FILE, rsdp_table, 16,
  true /* fseg memory */);
@@ -408,8 +408,8 @@ build_rsdp(GArray *rsdp_table, BIOSLinker *linker, unsigned 
rsdt_tbl_offset)
 
 /* Address to be filled by Guest linker */
 bios_linker_loader_add_pointer(linker,
-ACPI_BUILD_RSDP_FILE, rsdt_pa_offset, rsdt_pa_size,
-ACPI_BUILD_TABLE_FILE, rsdt_tbl_offset);
+ACPI_BUILD_RSDP_FILE, xsdt_pa_offset, xsdt_pa_size,
+ACPI_BUILD_TABLE_FILE, xsdt_tbl_offset);
 
 /* Checksum to be filled by Guest linker */
 bios_linker_loader_add_checksum(linker, ACPI_BUILD_RSDP_FILE,
@@ -686,7 +686,7 @@ static void build_fadt(GArray *table_data, BIOSLinker 
*linker,
VirtMachineState *vms, unsigned dsdt_tbl_offset)
 {
 AcpiFadtDescriptorRev5_1 *fadt = acpi_data_push(table_data, sizeof(*fadt));
-unsigned dsdt_entry_offset = (char *)>dsdt - table_data->data;
+unsigned xdsdt_entry_offset = (char *)>Xdsdt - table_data->data;
 uint16_t bootflags;
 
 switch (vms->psci_conduit) {
@@ -712,7 +712,7 @@ static void build_fadt(GArray *table_data, BIOSLinker 
*linker,
 
 /* DSDT address to be filled by Guest linker */
 bios_linker_loader_add_pointer(linker,
-ACPI_BUILD_TABLE_FILE, dsdt_entry_offset, sizeof(fadt->dsdt),
+ACPI_BUILD_TABLE_FILE, xdsdt_entry_offset, sizeof(fadt->Xdsdt),
 ACPI_BUILD_TABLE_FILE, dsdt_tbl_offset);
 
 build_header(linker, table_data,
@@ -772,12 +772,41 @@ struct AcpiBuildState {
 bool patched;
 } AcpiBuildState;
 
+/* Build xsdt table */
+
+static
+void build_xsdt(GArray *table_data, BIOSLinker *linker, GArray *table_offsets,
+const char *oem_id, const char *oem_table_id)
+{
+int i;
+unsigned xsdt_entries_offset;
+AcpiXsdtDescriptorRev2 *xsdt;
+const unsigned table_data_len = (sizeof(uint64_t) * table_offsets->len);
+const unsigned xsdt_entry_size = sizeof(xsdt->table_offset_entry[0]);
+const size_t xsdt_len = sizeof(*xsdt) + table_data_len;
+
+xsdt = acpi_data_push(table_data, xsdt_len);
+xsdt_entries_offset = (char *)xsdt->table_offset_entry - table_data->data;
+for (i = 0; i < table_offsets->len; ++i) {
+uint64_t ref_tbl_offset = g_array_index(table_offsets, uint32_t, i);
+uint64_t xsdt_entry_offset = xsdt_entries_offset + xsdt_entry_size * i;
+
+/* xsdt->table_offset_entry to be filled by Guest linker */
+bios_linker_loader_add_pointer(linker,
+ACPI_BUILD_TABLE_FILE, xsdt_entry_offset, xsdt_entry_size,
+ACPI_BUILD_TABLE_FILE, ref_tbl_offset);
+}
+build_header(linker, table_data,
+ (void *)xsdt, 

Re: [Qemu-devel] [PULL 43/46] block: Assertions for write permissions

2017-04-06 Thread Richard W.M. Jones
On Thu, Apr 06, 2017 at 04:03:28PM -0500, Eric Blake wrote:
> On 04/06/2017 03:59 PM, Richard W.M. Jones wrote:
> > On Tue, Feb 28, 2017 at 09:36:42PM +0100, Kevin Wolf wrote:
> >> This adds assertions that ensure that the necessary write permissions
> >> have been granted before someone attempts to write to a node.
> >>
> >> Signed-off-by: Kevin Wolf 
> >> Acked-by: Fam Zheng 
> >> Reviewed-by: Max Reitz 
> >> ---
> >>  block/io.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/block/io.c b/block/io.c
> >> index 2592ca1..4c79745 100644
> >> --- a/block/io.c
> >> +++ b/block/io.c
> >> @@ -945,6 +945,8 @@ static int coroutine_fn 
> >> bdrv_co_do_copy_on_readv(BdrvChild *child,
> >>  size_t skip_bytes;
> >>  int ret;
> >>  
> >> +assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE));
> > 
> > 
> > This assertion is thrown in the libguestfs test suite.  I filed a bug
> > about it against the Fedora package:
> > 
> > https://lists.gnu.org/archive/html/qemu-block/2017-02/msg01305.html
> 
> Wrong URL?

Ooops, the right link is:

https://bugzilla.redhat.com/show_bug.cgi?id=1439922

There is also a minimal reproducer in comment 2.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/



[Qemu-devel] [PATCH] event: Add signal information to SHUTDOWN

2017-04-06 Thread Eric Blake
qemu_kill_report() is already able to tell whether a shutdown
was triggered by guest action (no output) or by a host signal
(a message about termination is printed via error_report); but
this information is then lost.  Libvirt would like to be able
to distinguish between a SHUTDOWN event triggered solely by
guest request and one triggered by a SIGTERM on the host.

Enhance the SHUTDOWN event to pass the value of shutdown_signal
through to the monitor client, suitably remapped into a
platform-neutral string.  Note that mingw lacks decent signal
support, and will never report a signal because it never calls
qemu_system_killed().

See also https://bugzilla.redhat.com/1384007

Signed-off-by: Eric Blake 
---
 qapi/event.json | 20 +++-
 vl.c| 21 ++---
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/qapi/event.json b/qapi/event.json
index e80f3f4..6aad475 100644
--- a/qapi/event.json
+++ b/qapi/event.json
@@ -5,11 +5,29 @@
 ##

 ##
+# @ShutdownSignal:
+#
+# The list of host signal types known to cause qemu to shut down a guest.
+#
+# @int: SIGINT
+# @hup: SIGHUP
+# @term: SIGTERM
+#
+# Since: 2.10
+##
+{ 'enum': 'ShutdownSignal', 'data': [ 'int', 'hup', 'term' ] }
+
+##
 # @SHUTDOWN:
 #
 # Emitted when the virtual machine has shut down, indicating that qemu is
 # about to exit.
 #
+# @signal: If present, the shutdown was (probably) triggered due to
+# the receipt of the given signal in the host, rather than by a guest
+# action (note that there is an inherent race with a guest choosing to
+# shut down near the same time the host sends a signal). (since 2.10)
+#
 # Note: If the command-line option "-no-shutdown" has been specified, qemu will
 # not exit, and a STOP event will eventually follow the SHUTDOWN event
 #
@@ -21,7 +39,7 @@
 #  "timestamp": { "seconds": 1267040730, "microseconds": 682951 } }
 #
 ##
-{ 'event': 'SHUTDOWN' }
+{ 'event': 'SHUTDOWN', 'data': { '*signal': 'ShutdownSignal' } }

 ##
 # @POWERDOWN:
diff --git a/vl.c b/vl.c
index 0b4ed52..af29b2c 100644
--- a/vl.c
+++ b/vl.c
@@ -1626,9 +1626,23 @@ static int qemu_shutdown_requested(void)
 return atomic_xchg(_requested, 0);
 }

-static void qemu_kill_report(void)
+static ShutdownSignal qemu_kill_report(void)
 {
+ShutdownSignal ss = SHUTDOWN_SIGNAL__MAX;
 if (!qtest_driver() && shutdown_signal != -1) {
+switch (shutdown_signal) {
+case SIGINT:
+ss = SHUTDOWN_SIGNAL_INT;
+break;
+#ifdef SIGHUP
+case SIGHUP:
+ss = SHUTDOWN_SIGNAL_HUP;
+break;
+#endif
+case SIGTERM:
+ss = SHUTDOWN_SIGNAL_TERM;
+break;
+}
 if (shutdown_pid == 0) {
 /* This happens for eg ^C at the terminal, so it's worth
  * avoiding printing an odd message in that case.
@@ -1644,6 +1658,7 @@ static void qemu_kill_report(void)
 }
 shutdown_signal = -1;
 }
+return ss;
 }

 static int qemu_reset_requested(void)
@@ -1852,8 +1867,8 @@ static bool main_loop_should_exit(void)
 qemu_system_suspend();
 }
 if (qemu_shutdown_requested()) {
-qemu_kill_report();
-qapi_event_send_shutdown(_abort);
+ShutdownSignal ss = qemu_kill_report();
+qapi_event_send_shutdown(ss < SHUTDOWN_SIGNAL__MAX, ss, _abort);
 if (no_shutdown) {
 vm_stop(RUN_STATE_SHUTDOWN);
 } else {
-- 
2.9.3




Re: [Qemu-devel] [PULL 43/46] block: Assertions for write permissions

2017-04-06 Thread Eric Blake
On 04/06/2017 03:59 PM, Richard W.M. Jones wrote:
> On Tue, Feb 28, 2017 at 09:36:42PM +0100, Kevin Wolf wrote:
>> This adds assertions that ensure that the necessary write permissions
>> have been granted before someone attempts to write to a node.
>>
>> Signed-off-by: Kevin Wolf 
>> Acked-by: Fam Zheng 
>> Reviewed-by: Max Reitz 
>> ---
>>  block/io.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/block/io.c b/block/io.c
>> index 2592ca1..4c79745 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -945,6 +945,8 @@ static int coroutine_fn 
>> bdrv_co_do_copy_on_readv(BdrvChild *child,
>>  size_t skip_bytes;
>>  int ret;
>>  
>> +assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE));
> 
> 
> This assertion is thrown in the libguestfs test suite.  I filed a bug
> about it against the Fedora package:
> 
> https://lists.gnu.org/archive/html/qemu-block/2017-02/msg01305.html

Wrong URL?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PULL 43/46] block: Assertions for write permissions

2017-04-06 Thread Richard W.M. Jones
On Tue, Feb 28, 2017 at 09:36:42PM +0100, Kevin Wolf wrote:
> This adds assertions that ensure that the necessary write permissions
> have been granted before someone attempts to write to a node.
> 
> Signed-off-by: Kevin Wolf 
> Acked-by: Fam Zheng 
> Reviewed-by: Max Reitz 
> ---
>  block/io.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/block/io.c b/block/io.c
> index 2592ca1..4c79745 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -945,6 +945,8 @@ static int coroutine_fn 
> bdrv_co_do_copy_on_readv(BdrvChild *child,
>  size_t skip_bytes;
>  int ret;
>  
> +assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE));


This assertion is thrown in the libguestfs test suite.  I filed a bug
about it against the Fedora package:

https://lists.gnu.org/archive/html/qemu-block/2017-02/msg01305.html

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v



Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

2017-04-06 Thread Jason Gunthorpe
On Thu, Apr 06, 2017 at 10:45:54PM +0300, Yuval Shaia wrote:

> > Just add my 2 cents. You didn't answer on my question about other possible
> > implementations. It can be SoftRoCE loopback optimizations, special ULP,
> > RDMA transport, virtual driver with multiple VFs and single PF.
> 
> Please see my response to Jason's comments - eventually, when a support for
> VM to external host communication will be added - kdbr will become ULP as
> well.

So, is KDBR only to be used on the HV side? Ie it never shows up in the VM?

That is even weirder, we certainly do not want to see a kernel RDMA
ULP for any of this - the entire point of RDMA is to let user space
implement their protocols without needing a unique kernel component!!

Jason



Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

2017-04-06 Thread Jason Gunthorpe
On Thu, Apr 06, 2017 at 10:42:20PM +0300, Yuval Shaia wrote:

> > I'd rather see someone optimize the loopback path of soft roce than
> > see KDBR :)
> 
> Can we assume that the optimized loopback path will be as fast as direct
> copy between one VM address space to another VM address space?

Well, you'd optimize it until it was a direct memory copy, so I think
that is a reasonable starting assumption.

> > > 3. Our intention is for KDBR to be used in other contexts as well when we 
> > > need
> > >inter VM data exchange, e.g. backend for virtio devices. We didn't see 
> > > how this
> > >kind of requirement can be implemented inside SoftRoce as we don't see 
> > > any
> > >connection between them.
> > 
> > KDBR looks like weak RDMA to me, so it is reasonable question why not
> > use full RDMA with loopback optimization instead of creating something
> > unique.
> 
> True, KDBR exposes RDMA-like API because it's sole user is currently
> pvrdma device.  But, by design it can be expand to support other
> clients for example virtio device which might have other attributes,
> can we expect the same from SoftRoCE?

RDMA handles all sorts of complex virtio-like protocols just
fine. Unclear what 'other attributes' would be. Sounds like over
designing??

> > IMHO, it also makes more sense for something like KDBR to live as a
> > RDMA transport, not as a unique char device, it is obviously very
> > RDMA-like.
> 
> Can you elaborate more on this?
> What exactly it will solve?
> How it will be better than kdbr?

If you are going to do RDMA, then the uAPI for it from the kernel
should be the RDMA subsystem, don't invent unique cdevs that overlap
established kernel functionality without a very, very good reason.

> > .. and the char dev really can't be used when implementing user space
> > RDMA, that would just make a big mess..
> 
> The position of kdbr is not to be a layer *between* user space and device -
> it is *the device* from point of view of the process.

Any RDMA device built on top of kdbr certainly needs to support
/dev/uverbs0 and all the usual RDMA stuff, so again, I fail to see the
point of the special cdev.. Trying to mix /dev/uverbs0 and /dev/kdbr
in your provider would be too goofy and weird.

> > But obviously if you connect pvrdma to real hardware then the page pin
> > comes back.
> 
> The fact that page pin is not needed with Soft RoCE device but is needed
> with real RoCE device is exactly where kdbr can help as it isolates this
> fact from user space process.

I don't see how KDBR helps at all.

To do virtual RDMA you must transfer RDMA objects and commands
unmodified from VM to HV and implement a fairly complicated SW stack
inside the HV.

Once you do that, micro-optimizing for same-machine VM-to-VM copy is
not really such a big deal, IMHO.

The big challenge is keeping the real HW (or softrocee) RDMA objects
in sync with the VM ones and implementing some kind of RDMA-in-RDMA
tunnel to enable migration when using today's HW offload.

I see nothing in kdbr that helps with any of this. All it seems to do
is obfuscate the transfer of RDMA objects and commands to the
hypervisor, and make the transition of a RDMA channel from loopback to
network far, far, more complicated.

> Sorry, we didn't mean "easy" but "simple", and simplest solutions
> are always preferred.  IMHO, currently there is no good solution to
> do data copy between two VMs.

Don't confuse 'simple' with under featured. :)

> Can you comment on the second point - migration? Please note that we need
> it to work both with Soft RoCE and with real device.

I don't see how kdbr helps with migration, you still have to setup the
HW NIC and that needs sharing all the RDMA centric objects from VM to
HV.

Jason



Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

2017-04-06 Thread Yuval Shaia
On Tue, Apr 04, 2017 at 08:33:49PM +0300, Leon Romanovsky wrote:
> 
> I'm not going to repeat Jason's answer, I'm completely agree with him.
> 
> Just add my 2 cents. You didn't answer on my question about other possible
> implementations. It can be SoftRoCE loopback optimizations, special ULP,
> RDMA transport, virtual driver with multiple VFs and single PF.

Please see my response to Jason's comments - eventually, when a support for
VM to external host communication will be added - kdbr will become ULP as
well.

Marcel & Yuval

> 
> >




Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device

2017-04-06 Thread Yuval Shaia
On Tue, Apr 04, 2017 at 10:01:55AM -0600, Jason Gunthorpe wrote:
> On Tue, Apr 04, 2017 at 04:38:40PM +0300, Marcel Apfelbaum wrote:
> 
> > Here are some thoughts regarding the Soft RoCE usage in our project.
> > We thought about using it as backend for QEMU pvrdma device
> > we didn't how it will support our requirements.
> > 
> > 1. Does Soft RoCE support inter process (VM) fast path ? The KDBR
> >removes the need for hw resources, emulated or not, concentrating
> >on one copy from a VM to another.
> 
> I'd rather see someone optimize the loopback path of soft roce than
> see KDBR :)

Can we assume that the optimized loopback path will be as fast as direct
copy between one VM address space to another VM address space?

> 
> > 3. Our intention is for KDBR to be used in other contexts as well when we 
> > need
> >inter VM data exchange, e.g. backend for virtio devices. We didn't see 
> > how this
> >kind of requirement can be implemented inside SoftRoce as we don't see 
> > any
> >connection between them.
> 
> KDBR looks like weak RDMA to me, so it is reasonable question why not
> use full RDMA with loopback optimization instead of creating something
> unique.

True, KDBR exposes RDMA-like API because it's sole user is currently pvrdma
device.
But, by design it can be expand to support other clients for example virtio
device which might have other attributes, can we expect the same from
SoftRoCE?

> 
> IMHO, it also makes more sense for something like KDBR to live as a
> RDMA transport, not as a unique char device, it is obviously very
> RDMA-like.

Can you elaborate more on this?
What exactly it will solve?
How it will be better than kdbr?

As we see it - kdbr, when will be expand to support peers on external
hosts, will be like a ULP.

> 
> .. and the char dev really can't be used when implementing user space
> RDMA, that would just make a big mess..

The position of kdbr is not to be a layer *between* user space and device -
it is *the device* from point of view of the process.

> 
> > 4. We don't want all the VM memory to be pinned since it disable 
> > memory-over-commit
> >which in turn will make the pvrdma device useless.
> >We weren't sure how nice would play Soft RoCE with memory pinning and we 
> > wanted
> >more control on memory management. It may be a solvable issue, but 
> > combined
> >with the others lead us to our decision to come up with our kernel 
> > bridge (char
> 
> soft roce certainly can be optimized to remove the page pin and always
> run in an ODP-like mode.
> 
> But obviously if you connect pvrdma to real hardware then the page pin
> comes back.

The fact that page pin is not needed with Soft RoCE device but is needed
with real RoCE device is exactly where kdbr can help as it isolates this
fact from user space process.

> 
> >device or not, we went for it since it was the easiest to
> >implement for a POC)
> 
> I can see why it would be easy to implement, but not sure how this
> really improves the kernel..

Sorry, we didn't mean "easy" but "simple", and simplest solutions are
always preferred.
IMHO, currently there is no good solution to do data copy between two VMs.

> 
> Jason

Can you comment on the second point - migration? Please note that we need
it to work both with Soft RoCE and with real device.

Marcel & Yuval



[Qemu-devel] [PATCH for-2.9] throttle: Remove block from group on hot-unplug

2017-04-06 Thread Eric Blake
When a block device that is part of a throttle group is hot-unplugged,
we forgot to remove it from the throttle group. This leaves stale
memory around, and causes an easily reproducible crash:

$ ./x86_64-softmmu/qemu-system-x86_64 -nodefaults -nographic -qmp stdio \
-device virtio-scsi-pci,bus=pci.0 -drive \
id=drive_image2,if=none,format=raw,file=file2,bps=512000,iops=100,group=foo \
-device scsi-hd,id=image2,drive=drive_image2 -drive \
id=drive_image3,if=none,format=raw,file=file3,bps=512000,iops=100,group=foo \
-device scsi-hd,id=image3,drive=drive_image3
{'execute':'qmp_capabilities'}
{'execute':'device_del','arguments':{'id':'image3'}}
{'execute':'system_reset'}

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1428810

Suggested-by: Alberto Garcia 
Signed-off-by: Eric Blake 
---
 block/block-backend.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index 0b63773..d27c3a3 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -230,6 +230,9 @@ static void blk_delete(BlockBackend *blk)
 assert(!blk->refcnt);
 assert(!blk->name);
 assert(!blk->dev);
+if (blk->public.throttle_state) {
+blk_io_limits_disable(blk);
+}
 if (blk->root) {
 blk_remove_bs(blk);
 }
-- 
2.9.3




Re: [Qemu-devel] [PATCH for-2.9] block: Ignore guest dev permissions during incoming migration

2017-04-06 Thread Kashyap Chamarthy
On Thu, Apr 06, 2017 at 07:59:53PM +0200, Kevin Wolf wrote:
> Usually guest devices don't like other writers to the same image, so
> they use blk_set_perm() to prevent this from happening. In the migration
> phase before the VM is actually running, though, they don't have a
> problem with writes to the image. On the other hand, storage migration
> needs to be able to write to the image in this phase, so the restrictive
> blk_set_perm() call of qdev devices breaks it.
> 
> This patch flags all BlockBackends with a qdev device as
> blk->disable_perm during incoming migration, which means that the
> requested permissions are stored in the BlockBackend, but not actually
> applied to its root node yet.
> 
> Once migration has finished and the VM should be resumed, the
> permissions are applied. If they cannot be applied (e.g. because the NBD
> server used for block migration hasn't been shut down), resuming the VM
> fails.
> 
> Signed-off-by: Kevin Wolf 
> Tested-by: Kashyap Chamarthy 

Thanks.  Posted the test evidence on your RFC thread:

http://lists.nongnu.org/archive/html/qemu-devel/2017-04/msg01086.html

[...]

-- 
/kashyap



Re: [Qemu-devel] help debugging throttle crash

2017-04-06 Thread Eric Blake
On 04/03/2017 10:24 AM, Alberto Garcia wrote:
> On Mon, Apr 03, 2017 at 09:07:02AM -0500, Eric Blake wrote:
> 
>> At this point, it looks like no one is calling
>> throttle_group_unregister_blk() as a result of the 'device_del',
>> which leaves stale memory around
> 
> I see, I can also reproduce this very easily.
> 
> I wonder if it's not enough to simply disable I/O limits when a
> BlockBackend is deleted?

Seems to pass my testing. Do you want to submit it as a formal patch, or
shall I?

> 
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -230,6 +230,9 @@ static void blk_delete(BlockBackend *blk)
>  assert(!blk->refcnt);
>  assert(!blk->name);
>  assert(!blk->dev);
> +if (blk->public.throttle_state) {
> +blk_io_limits_disable(blk);
> +}
>  if (blk->root) {
>  blk_remove_bs(blk);
>  }
> 
> Berto
> 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-04-06 Thread Laszlo Ersek
On 04/06/17 14:35, gengdongjiu wrote:
> Dear, Laszlo
>Thanks for your detailed explanation.
> 
> On 2017/3/29 19:58, Laszlo Ersek wrote:
>> (This ought to be one of the longest address lists I've ever seen :)
>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>> measure, I'm adding MST and Igor.)
>>
>> On 03/29/17 12:36, Achin Gupta wrote:
>>> Hi gengdongjiu,
>>>
>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:

 Hi Laszlo/Biesheuvel/Qemu developer,

Now I encounter a issue and want to consult with you in ARM64 platform, 
 as described below:

 when guest OS happen synchronous or asynchronous abort, kvm needs
 to send the error address to Qemu or UEFI through sigbus to
 dynamically generate APEI table. from my investigation, there are
 two ways:

 (1) Qemu get the error address, and generate the APEI table, then
 notify UEFI to know this generation, then inject abort error to
 guest OS, guest OS read the APEI table.
 (2) Qemu get the error address, and let UEFI to generate the APEI
 table, then inject abort error to guest OS, guest OS read the APEI
 table.
>>>
>>> Just being pedantic! I don't think we are talking about creating the APEI 
>>> table
>>> dynamically here. The issue is: Once KVM has received an error that is 
>>> destined
>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the 
>>> error
>>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>>> corresponding to the error source (GHES corresponding to memory subsystem,
>>> processor etc) to allow the guest OS to do anything meaningful with the
>>> error. So who should create the CPER is the question.
>>>
>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error 
>>> arrives
>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>>> responsible for creating the CPER. ARM is experimenting with using a 
>>> Standalone
>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>>> adding the same code in ARM TF in EL3 (better for security). The error will 
>>> then
>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM 
>>> Trusted
>>> Firmware.
>>>
>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>>> interface (as discussed with Christoffer below). So it should generate the 
>>> CPER
>>> before injecting the error.
>>>
>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming 
>>> you
>>> mean guest UEFI). At this time, the guest OS already knows where to pick up 
>>> the
>>> CPER from through the HEST. Qemu has to create the CPER and populate its 
>>> address
>>> at the address exported in the HEST. Guest UEFI should not be involved in 
>>> this
>>> flow. Its job was to create the HEST at boot and that has been done by this
>>> stage.
>>>
>>> Qemu folk will be able to add but it looks like support for CPER generation 
>>> will
>>> need to be added to Qemu. We need to resolve this.
>>>
>>> Do shout if I am missing anything above.
>>
>> After reading this email, the use case looks *very* similar to what
>> we've just done with VMGENID for QEMU 2.9.
>>
>> We have a facility between QEMU and the guest firmware, called "ACPI
>> linker/loader", with which QEMU instructs the firmware to
>>
>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>> ALLOCATE command,
>>
>> - relocate pointers in those blobs, to fields in other (or the same)
>> blobs -- ADD_POINTER command,
>>
>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>
>> - and send GPAs of fields within such blobs back to QEMU --
>> WRITE_POINTER command.
>>
>> This is how I imagine we can map the facility to the current use case
>> (note that this is the first time I read about HEST / GHES / CPER):
>>
>> etc/acpi/tables etc/hardware_errors
>>  ==
>>  +---+
>> +--+ | address   | +-> +--+
>> |HEST  + | registers | |   | Error Status |
>> + ++ | +-+ |   | Data Block 1 |
>> | | GHES   | --> | | address | +   | ++
>> | | GHES   | --> | | address | --+ | |  CPER  |
>> | | GHES   | --> | | address | + | | |  CPER  |
>> | | GHES   | --> | | address | -+  | | | |  CPER  |
>> +-++ +-+-+  |  | | +-++
>> |  | |
>> |  | +---> +--+
>> |  |   | Error Status |
>> |  |   | Data Block 2 |
>> |  |   | ++
>> 

[Qemu-devel] [PATCH v2] kvmvapic: Enable kvmvapic when necessary

2017-04-06 Thread Anthony Xu
If KVM provides VAPIC, don't set up kvmvapic.

Signed-off-by: Anthony Xu 
---
 hw/intc/apic_common.c | 5 -
 include/sysemu/kvm.h  | 1 +
 kvm-all.c | 5 +
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/hw/intc/apic_common.c b/hw/intc/apic_common.c
index c3829e3..bf72107 100644
--- a/hw/intc/apic_common.c
+++ b/hw/intc/apic_common.c
@@ -317,8 +317,11 @@ static void apic_common_realize(DeviceState *dev, Error 
**errp)
 info = APIC_COMMON_GET_CLASS(s);
 info->realize(dev, errp);
 
-/* Note: We need at least 1M to map the VAPIC option ROM */
+/* Note: We need at least 1M to map the VAPIC option ROM,
+   if it is KVM, enable kvmvapic only when KVM doesn't have
+   VAPIC capability*/
 if (!vapic && s->vapic_control & VAPIC_ENABLE_MASK &&
+(!kvm_enabled() || (kvm_enabled() && !kvm_has_vapic())) &&
 !hax_enabled() && ram_size >= 1024 * 1024) {
 vapic = sysbus_create_simple("kvmvapic", -1, NULL);
 }
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 24281fc..43e0e4c 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -215,6 +215,7 @@ extern KVMState *kvm_state;
 
 bool kvm_has_free_slot(MachineState *ms);
 int kvm_has_sync_mmu(void);
+int kvm_has_vapic(void);
 int kvm_has_vcpu_events(void);
 int kvm_has_robust_singlestep(void);
 int kvm_has_debugregs(void);
diff --git a/kvm-all.c b/kvm-all.c
index 90b8573..c331c8c 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -2232,6 +2232,11 @@ int kvm_has_sync_mmu(void)
 return kvm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
 }
 
+int kvm_has_vapic(void)
+{
+return !kvm_check_extension(kvm_state, KVM_CAP_VAPIC);
+}
+
 int kvm_has_vcpu_events(void)
 {
 return kvm_state->vcpu_events;
-- 
1.8.3.1




Re: [Qemu-devel] [Qemu-block] [RFC PATCH for-2.9] block: Ignore guest dev permissions during incoming migration

2017-04-06 Thread Kashyap Chamarthy
On Thu, Apr 06, 2017 at 07:26:15PM +0200, Kashyap Chamarthy wrote:
> On Tue, Apr 04, 2017 at 05:35:56PM +0200, Kevin Wolf wrote:
> > Usually guest devices don't like other writers to the same image, so
> > they use blk_set_perm() to prevent this from happening. In the migration
> > phase before the VM is actually running, though, they don't have a
> > problem with writes to the image. On the other hand, storage migration
> > needs to be able to write to the image in this phase, so the restrictive
> > blk_set_perm() call of qdev devices breaks it.
> > 
> > This patch flags all BlockBackends with a qdev device as
> > blk->disable_perm during incoming migration, which means that the
> > requested permissions are stored in the BlockBackend, but not actually
> > applied to its root node yet.
> > 
> > Once migration has finished and the VM should be resumed, the
> > permissions are applied. If they cannot be applied (e.g. because the NBD
> > server used for block migration hasn't been shut down), resuming the VM
> > fails.
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  block/block-backend.c | 40 +++-
> >  include/block/block.h |  2 ++
> >  migration/migration.c |  8 
> >  qmp.c |  6 ++
> >  4 files changed, 55 insertions(+), 1 deletion(-)
> 
> With your fix applied, now I don't see the original error ("error:
> internal error: unable to execute QEMU command 'nbd-server-add':
> Conflicts with use by drive-virtio-disk0 as 'root', which does not allow
> 'write' on #block163"), and I can export a disk via `nbd-server-add`
> with 'writeable' flag.  
> 
> However, with this fix, running `drive-mirror` seg-faults source QEMU.

TL;DR: Kevin / Max pointed out that segmentation fault should be fixed
by supplying a 'job-id' parameter to `drive-mirror`.  (Longer version,
refer below.)

So:

Tested-by: Kashyap Chamarthy 

> Reproducer:
> 
> (1) On destination QEMU
> 
> $ /home/stack/build/build-qemu/x86_64-softmmu/qemu-system-x86_64 \
> -display none -nodefconfig -nodefaults -m 512 \
> -blockdev 
> node-name=bar,driver=qcow2,file.driver=file,file.filename=./dst-disk.qcow2 \
> -serial unix:/tmp/monitor,server,nowait \
> -incoming tcp:localhost: -qmp stdio
> {"QMP": {"version": {"qemu": {"micro": 93, "minor": 8, "major": 2}, 
> "package": " (v2.9.0-rc3-3-g3a8624b)"}, "capabilities": []}}
> { "execute": "qmp_capabilities" }
> {"return": {}}
> { "execute": "nbd-server-start", "arguments": { "addr": { "type": 
> "inet","data": { "host": "localhost", "port": "" } } } }
> {"return": {}}
> { "execute": "nbd-server-add", "arguments": { "device": "bar","writable": 
> true } }
> {"return": {}}
> /home/stack/src/qemu/nbd/server.c:nbd_receive_request():L706: read failed
> 
> 
> (2) On source QEMU:
> 
> $  /home/stack/build/build-qemu/x86_64-softmmu/qemu-system-x86_64 \
> -display none -nodefconfig -nodefaults -m 512 \
> -device virtio-scsi-pci,id=scsi -device virtio-serial-pci \
> -blockdev 
> node-name=foo,driver=qcow2,file.driver=file,file.filename=./cirros-0.3.5.qcow2
> -qmp stdio
> {"QMP": {"version": {"qemu": {"micro": 93, "minor": 8, "major": 2}, 
> "package": " (v2.9.0-rc3-3-g3a8624b)"}, "capabilities": []}}
> { "execute": "qmp_capabilities" }
> {"return": {}}
> { "execute": "drive-mirror", "arguments": { "device": "foo", "target": 
> "nbd:localhost::exportname=bar", "sync": "full","format": "raw", "mode": 
> "existing" } }
> Segmentation fault (core dumped)

Okay, there was a missing piece:  I was pointed out on IRC that the
`drive-mirror` QMP command was missing the 'job-id' (introduced by
commit: 71aa986 - "mirror: Add 'job-id' parameter to 'blockdev-mirror'
and 'drive-mirror'") parameter.  Kevin tells me that this is mandatory
_if_ you're using 'node-name' (which I was, in my '-blockdev'
command-line above).

And Max points out, the above should be caught by his:

https://lists.nongnu.org/archive/html/qemu-block/2017-04/msg00059.html 
--
[PATCH for-2.9 0/2] block/mirror: Fix use-after-free

And sure enough, adding the 'job-id' to `drive-mirror` on source will
let `drive-mirror` proceed succesfully: 

[...]
{ "execute": "drive-mirror", "arguments": { "device": "foo", "job-id": "job-0", 
"target": "nbd:localhost::exportname=bar", "sync": "full","format": "raw", 
"mode": "existing" } }
{"return": {}}
{"timestamp": {"seconds": 1491498318, "microseconds": 435816}, "event": 
"BLOCK_JOB_READY", "data": {"device": "job-0", "len": 41126400, "offset": 
41126400, "speed": 0, "type": "mirror"}}


Issue a `block-job-complete` (or `block-job-cancel) to end the running
mirror job:

[...]
{"execute": "block-job-complete", "arguments": {"device": "job-0"} }
{"return": {}}
{"timestamp": {"seconds": 1491500183, "microseconds": 768746}, "event": 
"BLOCK_JOB_COMPLETED", "data": {"device": "job-0", "len": 41126400, "offset": 
41126400, "speed": 0, "type": "mirror"}}

Then, 

Re: [Qemu-devel] Fix build break during configuration on musl-libc based Linux systems.

2017-04-06 Thread Peter Maydell
On 6 April 2017 at 19:15, Rainer Müller  wrote:
> A bit late to this thread, but the original problem was also reported
> for Mac OS X with --enable-curses in MacPorts. The build fails with the
> same symptoms as in the original report.
>
> https://trac.macports.org/ticket/53929
>
> As you identified, the problem is that ncurses expects the define
> _XOPEN_SOURCE >= 500 to enable the wide-char function declarations.
>
> The solution to retain access to non-standard API on Mac OS X would be
> to also define _DARWIN_C_SOURCE which enables extensions.

Thanks for the report. I have a feeling that fixing this bug
got lost somewhere -- there was a long thread about it but I
don't think an actual patch ever came out of it :-(

At this point (very near to release of 2.9) I think I'd rather
postpone fixing this until after release. It should be possible
to work around it with an --extra-cflags configure setting,
and our current behaviour is something we've had for many
releases now.

thanks
-- PMM



Re: [Qemu-devel] [PATCH 1/1] qemu-img: img_create does not support image-opts, fix docs

2017-04-06 Thread Kevin Wolf
Am 06.04.2017 um 19:45 hat Jeff Cody geschrieben:
> The documentation and help for qemu-img claims that 'qemu-img create'
> will take the '--image-opts' argument.  This is not true, so this
> patch removes those claims.
> 
> Signed-off-by: Jeff Cody 

Thanks, applied to the block branch.

Kevin



Re: [Qemu-devel] [PATCH]Enable kvmvapic only when host doesn't support VAPIC capability in KVM mode

2017-04-06 Thread Xu, Anthony
> > --- a/kvm-all.c
> > +++ b/kvm-all.c
> > @@ -2232,6 +2232,10 @@ int kvm_has_sync_mmu(void)
> >  return kvm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
> >  }
> >
> > +int kvm_has_vapic(void){
> > +return !kvm_check_extension(kvm_state, KVM_CAP_VAPIC);
> > +}
> > +
> 
> Is that function shouldn't return true if KVM is providing VAPIC
> capability?
> 

It should, but it doesn't. see below KVM kernel code segment,
it returns false if KVM provides VAPIC.

Thanks for your comments,

Will resend the patch.


Anthony



arch/x86/kvm/x86.c

case KVM_CAP_VAPIC:
r = !kvm_x86_ops->cpu_has_accelerated_tpr();
break;





Re: [Qemu-devel] Fix build break during configuration on musl-libc based Linux systems.

2017-04-06 Thread Rainer Müller
On 2017-02-17 17:57, Peter Maydell wrote:
> On 17 February 2017 at 11:20, Paolo Bonzini  wrote:
>>
>>
>> On 17/02/2017 11:18, Peter Maydell wrote:
>>> Defining _XOPEN_SOURCE is easy enough, and I think we should
>>> do it unconditionally. We should check what effect this has
>>> on the BSD hosts though I guess. (You could argue that we
>>> should be defining _XOPEN_SOURCE anyway for the benefit of
>>> the non-glibc BSD/Solaris/etc platforms.)
>>
>> Sounds good, then I think we should define it to 700 just like glibc does.
> 
> Unfortunately this idea turns out to break OSX compiles,
> because on OSX saying _XOPEN_SOURCE=anything disables
> all the non-X/Open APIs (which you get by default, and
> some of which like mkdtemp we use).

A bit late to this thread, but the original problem was also reported
for Mac OS X with --enable-curses in MacPorts. The build fails with the
same symptoms as in the original report.

https://trac.macports.org/ticket/53929

As you identified, the problem is that ncurses expects the define
_XOPEN_SOURCE >= 500 to enable the wide-char function declarations.

The solution to retain access to non-standard API on Mac OS X would be
to also define _DARWIN_C_SOURCE which enables extensions.

$ cat foo.c
#include 
int main() {
mkdtemp("/tmp/test-XX");
}
$ cc -D_XOPEN_SOURCE=500 -c foo.c
foo.c:4:5: warning: implicit declaration of function 'mkdtemp' is
invalid in C99 [-Wimplicit-function-declaration]
mkdtemp("/tmp/test-XX");
^
1 warning generated.
$ cc -D_XOPEN_SOURCE=500 -D_DARWIN_C_SOURCE -c foo.c
$

A quick test on current master with configure patched to define
  QEMU_CFLAGS="-D_XOPEN_SOURCE=500 -D_DARWIN_C_SOURCE $QEMU_CFLAGS"
compiled fine for both a default configure and with --enable-curses.

Rainer




Re: [Qemu-devel] [PATCH for-2.9 1/1] qemu-img: img_create does not support image-opts, fix docs

2017-04-06 Thread Eric Blake
On 04/06/2017 12:45 PM, Jeff Cody wrote:
> The documentation and help for qemu-img claims that 'qemu-img create'
> will take the '--image-opts' argument.  This is not true, so this
> patch removes those claims.
> 
> Signed-off-by: Jeff Cody 
> ---
>  qemu-img-cmds.hx | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Doc fix, worth having in 2.9 if we still have time.
Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [Qemu-ppc] [PATCH RFC v1 0/3] Enable MTTCG on PPC64

2017-04-06 Thread G 3


On Apr 6, 2017, at 1:08 PM, luigi burdo wrote:



Hi i can help test it too on my two Be machine.
If some one help me to find where is the patch or where i can  
download  the commits


Thanks
Luigi



Here are the patches:

1/3 https://patchwork.ozlabs.org/patch/747691/
2/3 https://patchwork.ozlabs.org/patch/747692/
3/3 https://patchwork.ozlabs.org/patch/747688/ 



[Qemu-devel] [PATCH for-2.9] block: Ignore guest dev permissions during incoming migration

2017-04-06 Thread Kevin Wolf
Usually guest devices don't like other writers to the same image, so
they use blk_set_perm() to prevent this from happening. In the migration
phase before the VM is actually running, though, they don't have a
problem with writes to the image. On the other hand, storage migration
needs to be able to write to the image in this phase, so the restrictive
blk_set_perm() call of qdev devices breaks it.

This patch flags all BlockBackends with a qdev device as
blk->disable_perm during incoming migration, which means that the
requested permissions are stored in the BlockBackend, but not actually
applied to its root node yet.

Once migration has finished and the VM should be resumed, the
permissions are applied. If they cannot be applied (e.g. because the NBD
server used for block migration hasn't been shut down), resuming the VM
fails.

Signed-off-by: Kevin Wolf 
Tested-by: Kashyap Chamarthy 
---

RFC -> v1:
- s/blk_next/blk_all_next/g [Max]

 block/block-backend.c | 40 +++-
 include/block/block.h |  2 ++
 migration/migration.c |  8 
 qmp.c |  6 ++
 4 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 0b63773..18ece99 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -61,6 +61,7 @@ struct BlockBackend {
 
 uint64_t perm;
 uint64_t shared_perm;
+bool disable_perm;
 
 bool allow_write_beyond_eof;
 
@@ -578,7 +579,7 @@ int blk_set_perm(BlockBackend *blk, uint64_t perm, uint64_t 
shared_perm,
 {
 int ret;
 
-if (blk->root) {
+if (blk->root && !blk->disable_perm) {
 ret = bdrv_child_try_set_perm(blk->root, perm, shared_perm, errp);
 if (ret < 0) {
 return ret;
@@ -597,15 +598,52 @@ void blk_get_perm(BlockBackend *blk, uint64_t *perm, 
uint64_t *shared_perm)
 *shared_perm = blk->shared_perm;
 }
 
+/*
+ * Notifies the user of all BlockBackends that migration has completed. qdev
+ * devices can tighten their permissions in response (specifically revoke
+ * shared write permissions that we needed for storage migration).
+ *
+ * If an error is returned, the VM cannot be allowed to be resumed.
+ */
+void blk_resume_after_migration(Error **errp)
+{
+BlockBackend *blk;
+Error *local_err = NULL;
+
+for (blk = blk_all_next(NULL); blk; blk = blk_all_next(blk)) {
+if (!blk->disable_perm) {
+continue;
+}
+
+blk->disable_perm = false;
+
+blk_set_perm(blk, blk->perm, blk->shared_perm, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+blk->disable_perm = true;
+return;
+}
+}
+}
+
 static int blk_do_attach_dev(BlockBackend *blk, void *dev)
 {
 if (blk->dev) {
 return -EBUSY;
 }
+
+/* While migration is still incoming, we don't need to apply the
+ * permissions of guest device BlockBackends. We might still have a block
+ * job or NBD server writing to the image for storage migration. */
+if (runstate_check(RUN_STATE_INMIGRATE)) {
+blk->disable_perm = true;
+}
+
 blk_ref(blk);
 blk->dev = dev;
 blk->legacy_dev = false;
 blk_iostatus_reset(blk);
+
 return 0;
 }
 
diff --git a/include/block/block.h b/include/block/block.h
index 5149260..3e09222 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -366,6 +366,8 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error 
**errp);
 void bdrv_invalidate_cache_all(Error **errp);
 int bdrv_inactivate_all(void);
 
+void blk_resume_after_migration(Error **errp);
+
 /* Ensure contents are flushed to disk.  */
 int bdrv_flush(BlockDriverState *bs);
 int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
diff --git a/migration/migration.c b/migration/migration.c
index 54060f7..ad4036f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -349,6 +349,14 @@ static void process_incoming_migration_bh(void *opaque)
 exit(EXIT_FAILURE);
 }
 
+/* If we get an error here, just don't restart the VM yet. */
+blk_resume_after_migration(_err);
+if (local_err) {
+error_free(local_err);
+local_err = NULL;
+autostart = false;
+}
+
 /*
  * This must happen after all error conditions are dealt with and
  * we're sure the VM is going to be running on this host.
diff --git a/qmp.c b/qmp.c
index fa82b59..a744e44 100644
--- a/qmp.c
+++ b/qmp.c
@@ -207,6 +207,12 @@ void qmp_cont(Error **errp)
 }
 }
 
+blk_resume_after_migration(_err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+
 if (runstate_check(RUN_STATE_INMIGRATE)) {
 autostart = 1;
 } else {
-- 
1.8.3.1




Re: [Qemu-devel] is qemu able to run multiple system emulation simultaneously

2017-04-06 Thread G 3


On Apr 6, 2017, at 12:00 PM, qemu-devel-requ...@nongnu.org wrote:


Dear QEMU developers,

I need multiple ARM-based hardware emulation runing simultaneously  
in one

PC.
I wander if QEMU is capble to run two system emulation at the same  
time.

Or I have to use two QEMU for two system emulation

If the former is possible, where can i find some docs about it? I have
searched two days, the result is not positive.

Please help. Thanks in advance.

Best regards,
Jiahuan


My vote goes to running multiple instances of QEMU at the same time.  
It is easy to do and works.




Re: [Qemu-devel] [PATCH for-2.9 0/2] block/mirror: Fix use-after-free

2017-04-06 Thread Kevin Wolf
Am 03.04.2017 um 19:51 hat Max Reitz geschrieben:
> And the exciting 2.9 ride continues!
> 
> When mirroring from a BDS with no parents at all (such as those added
> with -blockdev or blockdev-add), we have a use-after-free in mirror's
> error path. The first patch of this series fixes that, the other adds a
> patch so we don't regress.
> 
> What issue will we find next? Stay tuned!

Thanks, applied to the block branch.

Kevin



[Qemu-devel] [PATCH 1/1] qemu-img: img_create does not support image-opts, fix docs

2017-04-06 Thread Jeff Cody
The documentation and help for qemu-img claims that 'qemu-img create'
will take the '--image-opts' argument.  This is not true, so this
patch removes those claims.

Signed-off-by: Jeff Cody 
---
 qemu-img-cmds.hx | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 9c9702c..8ac7822 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -22,9 +22,9 @@ STEXI
 ETEXI
 
 DEF("create", img_create,
-"create [-q] [--object objectdef] [--image-opts] [-f fmt] [-o options] 
filename [size]")
+"create [-q] [--object objectdef] [-f fmt] [-o options] filename [size]")
 STEXI
-@item create [--object @var{objectdef}] [--image-opts] [-q] [-f @var{fmt}] [-o 
@var{options}] @var{filename} [@var{size}]
+@item create [--object @var{objectdef}] [-q] [-f @var{fmt}] [-o @var{options}] 
@var{filename} [@var{size}]
 ETEXI
 
 DEF("commit", img_commit,
-- 
2.9.3




Re: [Qemu-devel] EXT :Re: Emulating external registers

2017-04-06 Thread Peter Maydell
On 6 April 2017 at 18:23, Wu, Michael Y [US] (MS)  wrote:
> I changed my code to the following and added the appropriate read,write 
> functions and a MemoryRegionOps.
> memory_region_init_io(reg_memory, NULL, _ops, reg,
>   "reg_mem", 0x0040); //set to 64 bytes
> memory_region_add_subregion(sysmem, 0xFC00, reg_memory);
>
> For the read function I just returned a zero. So if I were to read from the 
> address 0xFC00 it should return a value of 0? The current issue I am 
> having is that gdb hangs when the pointer is accessed. I am starting to think 
> my bare-metal program is incorrect. I also added log messages in my read and 
> write functions. The read function was not accessed.

You'll probably find that what has happened is that your program
has taken an exception which you don't have an exception handler
installed for, and has then jumped off into nowhere or gone into
an infinite loop of taking exceptions. (Probably hitting ^c in
gdb will break into wherever it is.) It's a good idea in bare
metal test code to write at least a minimal set of exception handlers
that can print a message when an unexpected exception occurs and
stop, so you don't get too confused.

You might also want to investigate QEMU's tracing:
-d in_asm,exec,cpu,int,nochain -D debug.log will write tracing to
the debug.log file (a lot of it, and this set of trace flags
slow down execution a lot, but if you're doing very small bare
metal flags that should be ok). This can help in figuring out
what your test program is doing. (Watch out that the in_asm
shows when we first encounter a block of code, but if we
execute the same bit of code a second time we'll only print
the exec and cpu parts of the logging.)

thanks
-- PMM



Re: [Qemu-devel] [PATCH for-2.9 2/2] commit: Set commit_top_bs->total_sectors

2017-04-06 Thread Eric Blake
On 04/06/2017 12:14 PM, Kevin Wolf wrote:
> Like in the mirror filter driver, we also need to set the image size for
> the commit filter driver. This is less likely to be a problem in
> practice than for the mirror because we're not at the active layer here,
> but attaching new parents to a node in the middle of the chain is
> possible, so the size needs to be correct anyway.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/commit.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Eric Blake 

> 
> diff --git a/block/commit.c b/block/commit.c
> index 4c38220..91d2c34 100644
> --- a/block/commit.c
> +++ b/block/commit.c
> @@ -335,6 +335,7 @@ void commit_start(const char *job_id, BlockDriverState 
> *bs,
>  if (commit_top_bs == NULL) {
>  goto fail;
>  }
> +commit_top_bs->total_sectors = top->total_sectors;

[I'd really like to change total_sectors to bytes in 2.10 or later, but
that's a much hairier mess to untangle. I've got some patches that I
hope to post soon that at least start the process with regards to
bdrv_is_allocated...]

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH for-2.9 1/2] commit: Set commit_top_bs->aio_context

2017-04-06 Thread Eric Blake
On 04/06/2017 12:14 PM, Kevin Wolf wrote:
> The filter driver that is inserted by the commit job needs to use the
> same AioContext as its parent and child nodes.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/commit.c | 2 ++
>  1 file changed, 2 insertions(+)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [Qemu-block] [RFC PATCH for-2.9] block: Ignore guest dev permissions during incoming migration

2017-04-06 Thread Kashyap Chamarthy
On Tue, Apr 04, 2017 at 05:35:56PM +0200, Kevin Wolf wrote:
> Usually guest devices don't like other writers to the same image, so
> they use blk_set_perm() to prevent this from happening. In the migration
> phase before the VM is actually running, though, they don't have a
> problem with writes to the image. On the other hand, storage migration
> needs to be able to write to the image in this phase, so the restrictive
> blk_set_perm() call of qdev devices breaks it.
> 
> This patch flags all BlockBackends with a qdev device as
> blk->disable_perm during incoming migration, which means that the
> requested permissions are stored in the BlockBackend, but not actually
> applied to its root node yet.
> 
> Once migration has finished and the VM should be resumed, the
> permissions are applied. If they cannot be applied (e.g. because the NBD
> server used for block migration hasn't been shut down), resuming the VM
> fails.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/block-backend.c | 40 +++-
>  include/block/block.h |  2 ++
>  migration/migration.c |  8 
>  qmp.c |  6 ++
>  4 files changed, 55 insertions(+), 1 deletion(-)

With your fix applied, now I don't see the original error ("error:
internal error: unable to execute QEMU command 'nbd-server-add':
Conflicts with use by drive-virtio-disk0 as 'root', which does not allow
'write' on #block163"), and I can export a disk via `nbd-server-add`
with 'writeable' flag.  

However, with this fix, running `drive-mirror` seg-faults source QEMU.

Reproducer:

(1) On destination QEMU

$ /home/stack/build/build-qemu/x86_64-softmmu/qemu-system-x86_64 \
-display none -nodefconfig -nodefaults -m 512 \
-blockdev 
node-name=bar,driver=qcow2,file.driver=file,file.filename=./dst-disk.qcow2 \
-serial unix:/tmp/monitor,server,nowait \
-incoming tcp:localhost: -qmp stdio
{"QMP": {"version": {"qemu": {"micro": 93, "minor": 8, "major": 2}, "package": 
" (v2.9.0-rc3-3-g3a8624b)"}, "capabilities": []}}
{ "execute": "qmp_capabilities" }
{"return": {}}
{ "execute": "nbd-server-start", "arguments": { "addr": { "type": 
"inet","data": { "host": "localhost", "port": "" } } } }
{"return": {}}
{ "execute": "nbd-server-add", "arguments": { "device": "bar","writable": true 
} }
{"return": {}}
/home/stack/src/qemu/nbd/server.c:nbd_receive_request():L706: read failed


(2) On source QEMU:

$  /home/stack/build/build-qemu/x86_64-softmmu/qemu-system-x86_64 \
-display none -nodefconfig -nodefaults -m 512 \
-device virtio-scsi-pci,id=scsi -device virtio-serial-pci \
-blockdev 
node-name=foo,driver=qcow2,file.driver=file,file.filename=./cirros-0.3.5.qcow2
-qmp stdio
{"QMP": {"version": {"qemu": {"micro": 93, "minor": 8, "major": 2}, "package": 
" (v2.9.0-rc3-3-g3a8624b)"}, "capabilities": []}}
{ "execute": "qmp_capabilities" }
{"return": {}}
{ "execute": "drive-mirror", "arguments": { "device": "foo", "target": 
"nbd:localhost::exportname=bar", "sync": "full","format": "raw", "mode": 
"existing" } }
Segmentation fault (core dumped)


[...]

-- 
/kashyap



Re: [Qemu-devel] EXT :Re: Emulating external registers

2017-04-06 Thread Wu, Michael Y [US] (MS)
Thank you Peter!

I changed my code to the following and added the appropriate read,write 
functions and a MemoryRegionOps.
memory_region_init_io(reg_memory, NULL, _ops, reg,
  "reg_mem", 0x0040); //set to 64 bytes
memory_region_add_subregion(sysmem, 0xFC00, reg_memory);

For the read function I just returned a zero. So if I were to read from the 
address 0xFC00 it should return a value of 0? The current issue I am having 
is that gdb hangs when the pointer is accessed. I am starting to think my 
bare-metal program is incorrect. I also added log messages in my read and write 
functions. The read function was not accessed.

//Some code in my bare metal program
volatile unsigned int  * const test = (unsigned int *) 0xFC00;
if (0x0 == *test)  //gdb hangs on this line, cannot step further
  {
read_hit();
  }

The file 'unimp.c' you mentioned looks great! I will be using that as 
inspiration for creating my own device model.

-Original Message-
From: Peter Maydell [mailto:peter.mayd...@linaro.org] 
Sent: Thursday, April 06, 2017 1:33 AM
To: Wu, Michael Y [US] (MS)
Cc: qemu-devel@nongnu.org
Subject: EXT :Re: [Qemu-devel] Emulating external registers

On 5 April 2017 at 22:03, Wu, Michael Y [US] (MS)  wrote:
> My current approach is to create a new MemoryRegion in the init function of 
> the g3beige source code (mac_oldworld.c). My idea is to set up some mmio to 
> certain address to represent those registers. For now I only care about 
> reading and writing, but in the future I hope to add additional logic when 
> the registers are written.
> I added the following lines:
> MemoryRegion *reg = g_new(MemoryRegion, 1); 
> memory_region_init_alias(reg, NULL, "reg_mmio",
> get_system_io(), 0, 0x0020);

This is defining your region to be an alias into the system IO region, so 
reading/writing your region will act like reads and writes into those IO ports. 
This doesn't sound like what you were trying to do.
Usually memory regions for registers are defined using memory_region_init_io(), 
where you pass in a structure that includes pointers to functions which are 
called for reads and writes.

> memory_region_add_subregion(sysmem, 0xfc00, reg);

Typically you would create a device model for whatever this device is, and then 
map it in the board code, not directly create a memory region in the board code.

You might find it useful to look at hw/misc/unimp.c, which is a device which 
simply prints log messages when it is read or written. Or try a simple device 
that your board already has.

thanks
-- PMM



Re: [Qemu-devel] [Qemu-block] [RFC PATCH for-2.9] block: Ignore guest dev permissions during incoming migration

2017-04-06 Thread Kashyap Chamarthy
On Tue, Apr 04, 2017 at 05:35:56PM +0200, Kevin Wolf wrote:
> Usually guest devices don't like other writers to the same image, so
> they use blk_set_perm() to prevent this from happening. In the migration
> phase before the VM is actually running, though, they don't have a
> problem with writes to the image. On the other hand, storage migration
> needs to be able to write to the image in this phase, so the restrictive
> blk_set_perm() call of qdev devices breaks it.
> 
> This patch flags all BlockBackends with a qdev device as
> blk->disable_perm during incoming migration, which means that the
> requested permissions are stored in the BlockBackend, but not actually
> applied to its root node yet.
> 
> Once migration has finished and the VM should be resumed, the
> permissions are applied. If they cannot be applied (e.g. because the NBD
> server used for block migration hasn't been shut down), resuming the VM
> fails.

So I have an environment with a patched QEMU built with your fix to test
it with libvirt APIs, however, there's a libvirt bug that I just
discovered which fails the NBD-based live storage migration:

https://www.redhat.com/archives/libvir-list/2017-April/msg00350.html
-- NBD-based storage migration fails with "error: invalid argument:
monitor must not be NULL"

Meanwhile, I'm about it test with plain QMP.

> Signed-off-by: Kevin Wolf 
> ---
>  block/block-backend.c | 40 +++-
>  include/block/block.h |  2 ++
>  migration/migration.c |  8 
>  qmp.c |  6 ++
>  4 files changed, 55 insertions(+), 1 deletion(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 0b63773..f817040 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -61,6 +61,7 @@ struct BlockBackend {
>  
>  uint64_t perm;
>  uint64_t shared_perm;
> +bool disable_perm;
>  
>  bool allow_write_beyond_eof;
>  
> @@ -578,7 +579,7 @@ int blk_set_perm(BlockBackend *blk, uint64_t perm, 
> uint64_t shared_perm,
>  {
>  int ret;
>  
> -if (blk->root) {
> +if (blk->root && !blk->disable_perm) {
>  ret = bdrv_child_try_set_perm(blk->root, perm, shared_perm, errp);
>  if (ret < 0) {
>  return ret;
> @@ -597,15 +598,52 @@ void blk_get_perm(BlockBackend *blk, uint64_t *perm, 
> uint64_t *shared_perm)
>  *shared_perm = blk->shared_perm;
>  }
>  
> +/*
> + * Notifies the user of all BlockBackends that migration has completed. qdev
> + * devices can tighten their permissions in response (specifically revoke
> + * shared write permissions that we needed for storage migration).
> + *
> + * If an error is returned, the VM cannot be allowed to be resumed.
> + */
> +void blk_resume_after_migration(Error **errp)
> +{
> +BlockBackend *blk;
> +Error *local_err = NULL;
> +
> +for (blk = blk_next(NULL); blk; blk = blk_next(blk)) {
> +if (!blk->disable_perm) {
> +continue;
> +}
> +
> +blk->disable_perm = false;
> +
> +blk_set_perm(blk, blk->perm, blk->shared_perm, _err);
> +if (local_err) {
> +error_propagate(errp, local_err);
> +blk->disable_perm = true;
> +return;
> +}
> +}
> +}
> +
>  static int blk_do_attach_dev(BlockBackend *blk, void *dev)
>  {
>  if (blk->dev) {
>  return -EBUSY;
>  }
> +
> +/* While migration is still incoming, we don't need to apply the
> + * permissions of guest device BlockBackends. We might still have a block
> + * job or NBD server writing to the image for storage migration. */
> +if (runstate_check(RUN_STATE_INMIGRATE)) {
> +blk->disable_perm = true;
> +}
> +
>  blk_ref(blk);
>  blk->dev = dev;
>  blk->legacy_dev = false;
>  blk_iostatus_reset(blk);
> +
>  return 0;
>  }
>  
> diff --git a/include/block/block.h b/include/block/block.h
> index 5149260..3e09222 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -366,6 +366,8 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error 
> **errp);
>  void bdrv_invalidate_cache_all(Error **errp);
>  int bdrv_inactivate_all(void);
>  
> +void blk_resume_after_migration(Error **errp);
> +
>  /* Ensure contents are flushed to disk.  */
>  int bdrv_flush(BlockDriverState *bs);
>  int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
> diff --git a/migration/migration.c b/migration/migration.c
> index 54060f7..ad4036f 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -349,6 +349,14 @@ static void process_incoming_migration_bh(void *opaque)
>  exit(EXIT_FAILURE);
>  }
>  
> +/* If we get an error here, just don't restart the VM yet. */
> +blk_resume_after_migration(_err);
> +if (local_err) {
> +error_free(local_err);
> +local_err = NULL;
> +autostart = false;
> +}
> +
>  /*
>   * This must happen after all error conditions are 

[Qemu-devel] [PATCH for-2.9 0/2] commit: Fix commit_top_bs problems

2017-04-06 Thread Kevin Wolf
The same problem that Fam's patch "mirror: Fix aio context of
mirror_top_bs" fixes for the mirror block job also exists for the
commit block job and the 'commit' HMP command. While comparing the
respective places in mirror.c and commit.c, I also noticed that
commit neglects to set an image size.

This series fixes both problems.

Kevin Wolf (2):
  commit: Set commit_top_bs->aio_context
  commit: Set commit_top_bs->total_sectors

 block/commit.c | 3 +++
 1 file changed, 3 insertions(+)

-- 
1.8.3.1




[Qemu-devel] [PATCH for-2.9 1/2] commit: Set commit_top_bs->aio_context

2017-04-06 Thread Kevin Wolf
The filter driver that is inserted by the commit job needs to use the
same AioContext as its parent and child nodes.

Signed-off-by: Kevin Wolf 
---
 block/commit.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/commit.c b/block/commit.c
index 2832482..4c38220 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -335,6 +335,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 if (commit_top_bs == NULL) {
 goto fail;
 }
+bdrv_set_aio_context(commit_top_bs, bdrv_get_aio_context(top));
 
 bdrv_set_backing_hd(commit_top_bs, top, _err);
 if (local_err) {
@@ -482,6 +483,7 @@ int bdrv_commit(BlockDriverState *bs)
 error_report_err(local_err);
 goto ro_cleanup;
 }
+bdrv_set_aio_context(commit_top_bs, bdrv_get_aio_context(backing_file_bs));
 
 bdrv_set_backing_hd(commit_top_bs, backing_file_bs, _abort);
 bdrv_set_backing_hd(bs, commit_top_bs, _abort);
-- 
1.8.3.1




[Qemu-devel] [PATCH for-2.9 2/2] commit: Set commit_top_bs->total_sectors

2017-04-06 Thread Kevin Wolf
Like in the mirror filter driver, we also need to set the image size for
the commit filter driver. This is less likely to be a problem in
practice than for the mirror because we're not at the active layer here,
but attaching new parents to a node in the middle of the chain is
possible, so the size needs to be correct anyway.

Signed-off-by: Kevin Wolf 
---
 block/commit.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/commit.c b/block/commit.c
index 4c38220..91d2c34 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -335,6 +335,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 if (commit_top_bs == NULL) {
 goto fail;
 }
+commit_top_bs->total_sectors = top->total_sectors;
 bdrv_set_aio_context(commit_top_bs, bdrv_get_aio_context(top));
 
 bdrv_set_backing_hd(commit_top_bs, top, _err);
-- 
1.8.3.1




[Qemu-devel] [Bug 1635339] Re: qxl_pre_save assertion failure on vm "save"

2017-04-06 Thread rubenvb
I'm running into this issue as well:
Arch Linux
Qemu 2.8.0
spice guest tools: 0.132
QXL driver version (as reported by Windows Device Manager): 10.0.0.15000

Everything else works great. It would save me a lot of rebooting if this could 
get fixed.
If there is anything I can do or try, I'll be glad to help.

Relevant log of VM boot and crash on selecting suspend action in virt-
manager:

2017-04-06 16:59:24.681+: starting up libvirt version: 3.1.0, qemu version: 
2.8.0, hostname: arch-vaio.localdomain
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 
QEMU_AUDIO_DRV=spice /usr/bin/qemu-system-x86_64 -name 
guest=Windows,debug-threads=on -S -object 
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-9-Windows/master-key.aes
 -machine pc-i440fx-2.7,accel=kvm,usb=off,vmport=off,dump-guest-core=off -cpu 
core2duo,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff -m 2048 -realtime 
mlock=off -smp 2,sockets=1,cores=2,threads=1 -uuid 
f14684d3-5f81-4743-8512-e516d85ca2c9 -no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-9-Windows/monitor.sock,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc 
base=localtime,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet 
-no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot 
strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7 -device 
ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6 
-device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1 
-device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive 
file=/mnt/media/Qemu/windows10.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -drive 
file=/usr/share/spice-guest-tools/spice-guest-tools.iso,format=raw,if=none,id=drive-ide0-0-1,readonly=on
 -device ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -netdev 
user,id=hostnet0 -device 
rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:44:08:31,bus=pci.0,addr=0x3 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-chardev spicevmc,id=charchannel0,name=vdagent -device 
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0
 -device usb-tablet,id=input2,bus=usb.0,port=3 -spice 
port=5900,addr=127.0.0.1,disable-ticketing,image-compression=off,seamless-migration=on
 -device 
qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2
 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device 
hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev 
spicevmc,id=charredir0,name=usbredir -device 
usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=1 -chardev 
spicevmc,id=charredir1,name=usbredir -device 
usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
char device redirected to /dev/pts/6 (label charserial0)
main_channel_link: add main channel client
red_dispatcher_set_cursor_peer: 
inputs_connect: inputs channel client create
main_channel_handle_parsed: agent start
ehci warning: guest updated active QH
qemu-system-x86_64: /build/qemu/src/qemu-2.8.0/hw/display/qxl.c:2173: 
qxl_pre_save: Assertion `d->last_release_offset < d->vga.vram_size' failed.
2017-04-06 17:00:32.819+: shutting down, reason=crashed

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1635339

Title:
  qxl_pre_save assertion failure on vm "save"

Status in QEMU:
  New

Bug description:
  When I try and save my Windows 10 VM, I see an assertion failure, and
  the machine is shut down.

  I see the following in the log:

  main_channel_handle_parsed: agent start
  qemu-system-x86_64: /build/qemu-Zwynhi/qemu-2.5+dfsg/hw/display/qxl.c:2101: 
qxl_pre_save: Assertion `d->last_release_offset < d->vga.vram_size' failed.
  2016-10-20 11:52:42.713+: shutting down

  Please let me know what other information would be relevant!

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1635339/+subscriptions



Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation

2017-04-06 Thread Eric Blake
On 04/06/2017 10:01 AM, Alberto Garcia wrote:
> Hi all,
> 
> over the past couple of months I discussed with some of you the
> possibility to extend the qcow2 format in order to improve its
> performance and reduce its memory requirements (particularly with very
> large images).
> 
> After some discussion in the mailing list and the #qemu IRC channel I
> decided to write a prototype of a new extension for qcow2 so I could
> understand better the scope of the changes and have some preliminary
> data about its effects.
> 
> This e-mail is the formal presentation of my proposal to extend the
> on-disk qcow2 format. As you can see this is still an RFC. Due to the
> nature of the changes I would like to get as much feedback as possible
> before going forward.

The idea in general makes sense; I can even remember chatting with Kevin
about similar ideas as far back as 2015, where the biggest drawback is
that it is an incompatible image change, and therefore images created
with the flag cannot be read by older tools.

> === Test results ===
> 
> I have a basic working prototype of this. It's still incomplete -and
> buggy :)- but it gives an idea of what we can expect from it. In my
> implementation each data cluster has 8 subclusters, but that's not set
> in stone (see below).
> 
> I made all tests on an SSD drive, writing to an empty qcow2 image with
> a fully populated 40GB backing image, performing random writes using
> fio with a block size of 4KB.
> 
> I tried with the default and maximum cluster sizes (64KB and 2MB) and
> also with some other sizes. I also made sure to try with 32KB clusters
> so the subcluster size matches the 4KB block size used for the I/O.
> 
> It's important to point out that once a cluster has been completely
> allocated then having subclusters offers no performance benefit. For
> this reason the size of the image for these tests (40GB) was chosen to
> be large enough to guarantee that there are always new clusters being
> allocated. This is therefore a worst-case scenario (or best-case for
> this feature, if you want).
> 
> Here are the results (subcluster size in brackets):
> 
> |-++-+---|
> |  cluster size   | subclusters=on | subclusters=off | Max L2 cache size |
> |-++-+---|
> |   2 MB (256 KB) |   440 IOPS |  100 IOPS   | 160 KB (*)|
> | 512 KB  (64 KB) |  1000 IOPS |  300 IOPS   | 640 KB|
> |  64 KB   (8 KB) |  3000 IOPS | 1000 IOPS   |   5 MB|
> |  32 KB   (4 KB) | 12000 IOPS | 1300 IOPS   |  10 MB|
> |   4 KB  (512 B) |   100 IOPS |  100 IOPS   |  80 MB|
> |-++-+---|

Those are some cool results!


> === Changes to the on-disk format ===
> 
> The qcow2 on-disk format needs to change so each L2 entry has a bitmap
> indicating the allocation status of each subcluster. There are three
> possible states (unallocated, allocated, all zeroes), so we need two
> bits per subcluster.

You also have to add a new incompatible feature bit, so that older tools
know they can't read the new image correctly, and therefore don't
accidentally corrupt it.

> 
> An L2 entry is 64 bits wide, and this is the current format (for
> uncompressed clusters):
> 
> 6356 5548 4740 3932 3124 2316 15 8 7  0
>        
> **<> <--><--->*
>   Rsrved  host cluster offset of data Reserved
>   (6 bits)(47 bits)   (8 bits)
> 
> bit 63: refcount == 1   (QCOW_OFLAG_COPIED)
> bit 62: compressed = 1  (QCOW_OFLAG_COMPRESSED)
> bit 0: all zeros(QCOW_OFLAG_ZERO)
> 
> I thought of three alternatives for storing the subcluster bitmaps. I
> haven't made my mind completely about which one is the best one, so
> I'd like to present all three for discussion. Here they are:
> 
> (1) Storing the bitmap inside the 64-bit entry
> 
> This is a simple alternative and is the one that I chose for my
> prototype. There are 14 unused bits plus the "all zeroes" one. If
> we steal one from the host offset we have the 16 bits that we need
> for the bitmap and we have 46 bits left for the host offset, which
> is more than enough.

Note that because you are using exactly 8 subclusters, you can require
that the minimum cluster size when subclusters are enabled be 4k (since
we already have a lower-limit of 512-byte sector operation, and don't
want subclusters to be smaller than that); at which case you are
guaranteed that the host cluster offset will be 4k aligned.  So in
reality, once you turn on subclusters, you have:

6356 5548 4740 3932 3124 2316 15 8 7  0
  

Re: [Qemu-devel] [PATCH 1/5] qdev: qdev_hotplug is really a bool

2017-04-06 Thread Philippe Mathieu-Daudé

On 04/06/2017 10:13 AM, Juan Quintela wrote:

Signed-off-by: Juan Quintela 


Reviewed-by: Philippe Mathieu-Daudé 


---
 hw/core/qdev.c | 4 ++--
 include/hw/qdev-core.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 1e7fb33..6fa46b5 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -39,7 +39,7 @@
 #include "qapi-event.h"
 #include "migration/migration.h"

-int qdev_hotplug = 0;
+bool qdev_hotplug = false;
 static bool qdev_hot_added = false;
 static bool qdev_hot_removed = false;

@@ -385,7 +385,7 @@ void qdev_machine_creation_done(void)
  * ok, initial machine setup is done, starting from now we can
  * only create hotpluggable devices
  */
-qdev_hotplug = 1;
+qdev_hotplug = true;
 }

 bool qdev_machine_modified(void)
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index b44b476..a96a913 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -386,7 +386,7 @@ Object *qdev_get_machine(void);
 /* FIXME: make this a link<> */
 void qdev_set_parent_bus(DeviceState *dev, BusState *bus);

-extern int qdev_hotplug;
+extern bool qdev_hotplug;

 char *qdev_get_dev_path(DeviceState *dev);






Re: [Qemu-devel] [Qemu-arm] [PATCH] target/arm: Add missing entries to excnames[] for log strings

2017-04-06 Thread Peter Maydell
On 6 April 2017 at 17:34, Philippe Mathieu-Daudé  wrote:
> Hi Peter,
>
> excnames[] is only used in arm_log_exception(), can we move it there?

Seems like a good idea. I think this is historic, we used to use
it in separate functions in different files for the 32-bit and
64-bit exception entry code, but those got unified a little while back.

I'll do that as a separate patch, though.

thanks
-- PMM



Re: [Qemu-devel] [Qemu-arm] [PATCH] stellaris: Don't hw_error() on bad register accesses

2017-04-06 Thread Peter Maydell
On 6 April 2017 at 17:24, Philippe Mathieu-Daudé  wrote:
> Hi Peter,
>
>
> On 04/06/2017 10:45 AM, Peter Maydell wrote:
>>  default:
>> -hw_error("gptm_write: Bad offset 0x%x\n", (int)offset);
>> +qemu_log_mask(LOG_GUEST_ERROR,
>> +  "GPTM: read at bad offset 0x%x\n", (int)offset);
>
>
> use HWADDR_PRIx to remove this unnecessary casts here in following changes?
>
> ie: "GPTM: read at bad offset 0x%" HWADDR_PRIx "\n", offset);

I don't think either of the two is clearly better for this
sort of case where the offset is known to be small, so I opted
to leave the code the way it was already written.

> either way:
> Reviewed-by: Philippe Mathieu-Daudé 

Thanks.

-- PMM



Re: [Qemu-devel] [Qemu-arm] [PATCH] target/arm: Add missing entries to excnames[] for log strings

2017-04-06 Thread Philippe Mathieu-Daudé

Hi Peter,

excnames[] is only used in arm_log_exception(), can we move it there?

anyway:
Reviewed-by: Philippe Mathieu-Daudé 

On 04/06/2017 10:45 AM, Peter Maydell wrote:

Recent changes have added new EXCP_ values to ARM but forgot
to update the excnames[] array which is used to provide
human-readable strings when printing information about the
exception for debug logging. Add the missing entries, and
add a comment to the list of #defines to help avoid the mistake
being repeated in future.

Signed-off-by: Peter Maydell 
---
 target/arm/cpu.h   | 1 +
 target/arm/internals.h | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index a8aabce..e6f05e2 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -58,6 +58,7 @@
 #define EXCP_SEMIHOST   16   /* semihosting call */
 #define EXCP_NOCP   17   /* v7M NOCP UsageFault */
 #define EXCP_INVSTATE   18   /* v7M INVSTATE UsageFault */
+/* NB: new EXCP_ defines should be added to the excnames[] array too */

 #define ARMV7M_EXCP_RESET   1
 #define ARMV7M_EXCP_NMI 2
diff --git a/target/arm/internals.h b/target/arm/internals.h
index f742a41..97ca034 100644
--- a/target/arm/internals.h
+++ b/target/arm/internals.h
@@ -70,6 +70,8 @@ static const char * const excnames[] = {
 [EXCP_VIRQ] = "Virtual IRQ",
 [EXCP_VFIQ] = "Virtual FIQ",
 [EXCP_SEMIHOST] = "Semihosting call",
+[EXCP_NOCP] = "v7M NOCP UsageFault",
+[EXCP_INVSTATE] = "v7M INVSTATE UsageFault",
 };

 /* Scale factor for generic timers, ie number of ns per tick.





Re: [Qemu-devel] [Qemu-arm] [PATCH] stellaris: Don't hw_error() on bad register accesses

2017-04-06 Thread Philippe Mathieu-Daudé

Hi Peter,

On 04/06/2017 10:45 AM, Peter Maydell wrote:

Current recommended style is to log a guest error on bad register
accesses, not kill the whole system with hw_error().  Change the
hw_error() calls to log as LOG_GUEST_ERROR or LOG_UNIMP or use
g_assert_not_reached() as appropriate.

Signed-off-by: Peter Maydell 
---
 hw/arm/stellaris.c | 60 +-
 1 file changed, 37 insertions(+), 23 deletions(-)

diff --git a/hw/arm/stellaris.c b/hw/arm/stellaris.c
index 9edcd49..ea7a809 100644
--- a/hw/arm/stellaris.c
+++ b/hw/arm/stellaris.c
@@ -108,7 +108,10 @@ static void gptm_reload(gptm_state *s, int n, int reset)
 } else if (s->mode[n] == 0xa) {
 /* PWM mode.  Not implemented.  */
 } else {
-hw_error("TODO: 16-bit timer mode 0x%x\n", s->mode[n]);
+qemu_log_mask(LOG_UNIMP,
+  "GPTM: 16-bit timer mode unimplemented: 0x%x\n",
+  s->mode[n]);
+return;
 }
 s->tick[n] = tick;
 timer_mod(s->timer[n], tick);
@@ -149,7 +152,9 @@ static void gptm_tick(void *opaque)
 } else if (s->mode[n] == 0xa) {
 /* PWM mode.  Not implemented.  */
 } else {
-hw_error("TODO: 16-bit timer mode 0x%x\n", s->mode[n]);
+qemu_log_mask(LOG_UNIMP,
+  "GPTM: 16-bit timer mode unimplemented: 0x%x\n",
+  s->mode[n]);
 }
 gptm_update_irq(s);
 }
@@ -286,7 +291,8 @@ static void gptm_write(void *opaque, hwaddr offset,
 s->match_prescale[0] = value;
 break;
 default:
-hw_error("gptm_write: Bad offset 0x%x\n", (int)offset);
+qemu_log_mask(LOG_GUEST_ERROR,
+  "GPTM: read at bad offset 0x%x\n", (int)offset);


use HWADDR_PRIx to remove this unnecessary casts here in following changes?

ie: "GPTM: read at bad offset 0x%" HWADDR_PRIx "\n", offset);

either way:
Reviewed-by: Philippe Mathieu-Daudé 


 }
 gptm_update_irq(s);
 }
@@ -425,7 +431,10 @@ static int ssys_board_class(const ssys_state *s)
 }
 /* for unknown classes, fall through */
 default:
-hw_error("ssys_board_class: Unknown class 0x%08x\n", did0);
+/* This can only happen if the hardwired constant did0 value
+ * in this board's stellaris_board_info struct is wrong.
+ */
+g_assert_not_reached();
 }
 }

@@ -479,8 +488,7 @@ static uint64_t ssys_read(void *opaque, hwaddr offset,
 case DID0_CLASS_SANDSTORM:
 return pllcfg_sandstorm[xtal];
 default:
-hw_error("ssys_read: Unhandled class for PLLCFG read.\n");
-return 0;
+g_assert_not_reached();
 }
 }
 case 0x070: /* RCC2 */
@@ -512,7 +520,8 @@ static uint64_t ssys_read(void *opaque, hwaddr offset,
 case 0x1e4: /* USER1 */
 return s->user1;
 default:
-hw_error("ssys_read: Bad offset 0x%x\n", (int)offset);
+qemu_log_mask(LOG_GUEST_ERROR,
+  "SSYS: read at bad offset 0x%x\n", (int)offset);
 return 0;
 }
 }
@@ -614,7 +623,8 @@ static void ssys_write(void *opaque, hwaddr offset,
 s->ldoarst = value;
 break;
 default:
-hw_error("ssys_write: Bad offset 0x%x\n", (int)offset);
+qemu_log_mask(LOG_GUEST_ERROR,
+  "SSYS: write at bad offset 0x%x\n", (int)offset);
 }
 ssys_update(s);
 }
@@ -748,7 +758,8 @@ static uint64_t stellaris_i2c_read(void *opaque, hwaddr 
offset,
 case 0x20: /* MCR */
 return s->mcr;
 default:
-hw_error("strllaris_i2c_read: Bad offset 0x%x\n", (int)offset);
+qemu_log_mask(LOG_GUEST_ERROR,
+  "stellaris_i2c: read at bad offset 0x%x\n", (int)offset);
 return 0;
 }
 }
@@ -823,17 +834,18 @@ static void stellaris_i2c_write(void *opaque, hwaddr 
offset,
 s->mris &= ~value;
 break;
 case 0x20: /* MCR */
-if (value & 1)
-hw_error(
-  "stellaris_i2c_write: Loopback not implemented\n");
-if (value & 0x20)
-hw_error(
-  "stellaris_i2c_write: Slave mode not implemented\n");
+if (value & 1) {
+qemu_log_mask(LOG_UNIMP, "stellaris_i2c: Loopback not 
implemented");
+}
+if (value & 0x20) {
+qemu_log_mask(LOG_UNIMP,
+  "stellaris_i2c: Slave mode not implemented");
+}
 s->mcr = value & 0x31;
 break;
 default:
-hw_error("stellaris_i2c_write: Bad offset 0x%x\n",
-  (int)offset);
+qemu_log_mask(LOG_GUEST_ERROR,
+  "stellaris_i2c: write at bad offset 0x%x\n", 
(int)offset);
 }
 stellaris_i2c_update(s);
 }
@@ -1057,8 +1069,8 @@ static uint64_t stellaris_adc_read(void *opaque, hwaddr 
offset,
 case 0x30: /* SAC */

Re: [Qemu-devel] [PATCH for-2.9 5/5] coroutine: Explicitly specify AioContext when creating coroutine

2017-04-06 Thread Kevin Wolf
Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben:
> Coroutine in block layer should always be waken up in bs->aio_context
> rather than the "current" context where it is entered. They differ when
> the main loop is doing QMP tasks.

This whole mechanism is complex stuff that I haven't quite caught up on
yet, but this change means that we probably have some code that runs
under one AioContext or the other depending on whether a CoMutex had to
yield. Waking it up in its original AioContext definitely seemed more
straightforward.

> Race conditions happen without this patch, because the wrong context is
> acquired in co_schedule_bh_cb, while the entered coroutine works on a
> different one.

If the code in co_schedule_bh_cb() assumes that coroutines can never
move between AioContexts, then probably this assumption is what really
needs to be fixed.

For example, another case where this happens is that block jobs follow
their nodes if the AioContext changes and even implement
.attached_aio_context callbacks when they need to drag additional nodes
into the new context. With your change, the job coroutine would remember
the old coroutine and move back to the old context in some cases! I
don't want to be the one to debug this kind of problems...

If we really went this way, we'd need at least a way to change the
AioContext of a coroutine after the fact, and be sure that we call it
everywhere where it's needed (it's this last part that I'm highly
doubtful about for 2.9).

In fact, I think even attempting this is insanity and we need to teach
the infrastructure to cope with coroutines that move between
AioContexts. If it's really just about acquiring the wrong context,
shouldn't then using co->ctx instead of ctx solve the problem?

> Make the block layer explicitly specify a desired context for each created
> coroutine. For the rest, always use qemu_get_aio_context().
> 
> Signed-off-by: Fam Zheng 

> --- a/include/qemu/coroutine.h
> +++ b/include/qemu/coroutine.h
> @@ -63,7 +63,8 @@ typedef void coroutine_fn CoroutineEntry(void *opaque);
>   * Use qemu_coroutine_enter() to actually transfer control to the coroutine.
>   * The opaque argument is passed as the argument to the entry point.
>   */
> -Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque);
> +Coroutine *qemu_coroutine_create(AioContext *ctx,
> + CoroutineEntry *entry, void *opaque);

The new parameter could use some documentation, it's not even obvious
why a coroutine should have an AioContext.

> --- a/util/qemu-coroutine.c
> +++ b/util/qemu-coroutine.c
> @@ -115,7 +118,6 @@ void qemu_coroutine_enter(Coroutine *co)
>  }
>  
>  co->caller = self;
> -co->ctx = qemu_get_current_aio_context();
>  
>  /* Store co->ctx before anything that stores co.  Matches
>   * barrier in aio_co_wake and qemu_co_mutex_wake.
   */
   smp_wmb();

The comment suggests that the barrier can go away if you don't set
co->ctx any more.

Kevin



Re: [Qemu-devel] is qemu able to run multiple system emulation simultaneously

2017-04-06 Thread Peter Maydell
On 6 April 2017 at 16:47, Jiahuan Zhang  wrote:
> I need multiple ARM-based hardware emulation runing simultaneously in one
> PC.
> I wander if QEMU is capble to run two system emulation at the same time.
> Or I have to use two QEMU for two system emulation

You need to run two copies of QEMU for this. It should work fine,
though.

thanks
-- PMM



Re: [Qemu-devel] [PATCH RFC v1 3/3] target/ppc: Generate fence operations

2017-04-06 Thread Richard Henderson

On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote:

@@ -3028,6 +3030,7 @@ static void gen_##name(DisasContext *ctx) 
   \
 tcg_gen_qemu_ld_tl(gpr, t0, ctx->mem_idx, memop);\
 tcg_gen_mov_tl(cpu_reserve, t0); \
 tcg_gen_mov_tl(cpu_reserve_val, gpr);\
+tcg_gen_mb(TCG_MO_ALL | TCG_BAR_LDAQ);   \


Please put the barrier next to the load.
I hope we can merge these soon.



@@ -3196,6 +3199,7 @@ static void gen_##name(DisasContext *ctx) 
  \
 if (len > 1) {  \
 gen_check_align(ctx, t0, (len) - 1);\
 }   \
+tcg_gen_mb(TCG_MO_ALL | TCG_BAR_STRL);  \
 gen_conditional_store(ctx, t0, rS(ctx->opcode), memop); \
 tcg_temp_free(t0);  \
 }


This one is more complicated...

The success path includes an atomic_cmpxchg, which itself is a SC barrier. 
However, the fail path branches across the cmpxchg and so sees no barrier, but 
one is still required by the architecture.  How about a gen_conditional_store 
that looks like this:


{
  TCGLabel *l1 = gen_new_label();
  TCGLabel *l2 = gen_new_label();
  TCGv t0;

  tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1);

  t0 = tcg_temp_new();
  tcg_gen_atomic_cmpxchg_tl(t0, cpu_reserve, cpu_reserve_val,
cpu_gpr[reg], ctx->mem_idx,
DEF_MEMOP(memop) | MO_ALIGN);
  tcg_gen_setcond_tl(TCG_COND_EQ, t0, t0, cpu_reserve_val);
  tcg_gen_shli_tl(t0, t0, CRF_EQ_BIT);
  tcg_gen_or_tl(t0, t0, cpu_so);
  tcg_gen_trunc_tl(cpu_crf[0], t0);
  tcg_temp_free(t0);
  tcg_gen_br(l2);

  gen_set_label(l1);

  /* Address mismatch implies failure.  But we still need to provide the
 memory barrier semantics of the instruction.  */
  tcg_gen_mb(TCG_MO_ALL | TCG_BAR_STRL);
  tcg_gen_trunc_tl(cpu_crf[0], cpu_so);

  gen_set_label(l2);
  tcg_gen_movi_tl(cpu_reserve, -1);
}

Note that you don't need to reset cpu_reserve_val at all.

I just thought of something we might need to check for this and other ll/sc 
implemetations -- do we need to check for address misalignment along the 
address comparison failure path?  I suspect that we do.



r~



Re: [Qemu-devel] [PATCH 2/2] parallels: relax check for auto switch of prealloc_mode

2017-04-06 Thread Stefan Hajnoczi
On Wed, Apr 05, 2017 at 06:12:06PM +0300, Denis V. Lunev wrote:
> PRL_PREALLOC_MODE_TRUNCATE can be set only through command line. Remove
> the check that bdrv_truncate() is working on, this is almost always the
> case, while the check could lead to serious consequences during migration
> etc when we should not even try to API which technically could change the
> image (asserts are possible).
> 
> Let us keep this simple.
> 
> Signed-off-by: Denis V. Lunev 
> CC: Stefan Hajnoczi 
> ---
>  block/parallels.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH for-2.9?] tests/check-qdict: Fix missing brackets

2017-04-06 Thread Eric Blake
On 04/06/2017 10:49 AM, Eric Blake wrote:
> On 04/06/2017 10:41 AM, Dr. David Alan Gilbert (git) wrote:
>> From: "Dr. David Alan Gilbert" 
>>
>> Gcc 7 (on Fedora 26) spotted odd use of integers instead of a
>> boolean; it's got a point.
>>
>> Signed-off-by: Dr. David Alan Gilbert 
>> ---
>>  tests/check-qdict.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/tests/check-qdict.c b/tests/check-qdict.c
>> index 81162ee572..6f3fbcf9c1 100644
>> --- a/tests/check-qdict.c
>> +++ b/tests/check-qdict.c
>> @@ -559,7 +559,7 @@ static void qdict_join_test(void)
>>  g_assert(qdict_size(dict1) == 2);
>>  g_assert(qdict_size(dict2) == !overwrite);
>>  
>> -g_assert(qdict_get_int(dict1, "foo") == overwrite ? 84 : 42);
>> +g_assert(qdict_get_int(dict1, "foo") == (overwrite ? 84 : 42));
> 
> How is the test passing pre-patch, and why does it not change the test
> post-patch?  Does that mean that overwrite is not doing what we expected?

Replying to myself:

Pre-patch, it was reached twice (once for overwrite=false, once for
overwrite=true), as:

(42 == false) ? 84 : 42
(84 == true) ? 84 : 42

which simplifies to 42 on both iterations, and g_assert(42) succeeds.
In fact, this is a tautology - no matter WHAT value we encounter, the
assert succeeds, so we are not actually testing that the value matters.

Post-patch, it becomes:

42 == (false ? 84 : 42)
84 == (true ? 84 : 42)

which is then asserting that we actually have the value we expect.

Reviewed-by: Eric Blake 

Safe for 2.9, but late enough that it also doesn't matter if it slips
until 2.10.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH RFC v1 2/3] cputlb: handle first atomic write to the page

2017-04-06 Thread Richard Henderson

On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote:

In case where the conditional write is the first write to the page,
TLB_NOTDIRTY will be set and stop_the_world is triggered. Handle this as
a special case and set the dirty bit. After that fall through to the
actual atomic instruction below.

Signed-off-by: Nikunj A Dadhania 
---
 cputlb.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)


Reviewed-by: Richard Henderson 


r~



Re: [Qemu-devel] [PATCH RFC v1 1/3] target/ppc: Emulate LL/SC using cmpxchg helpers

2017-04-06 Thread Richard Henderson

On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote:

 tcg_gen_trunc_tl_i32(cpu_crf[0], cpu_so);
 l1 = gen_new_label();
 tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1);
-tcg_gen_ori_i32(cpu_crf[0], cpu_crf[0], CRF_EQ);
-tcg_gen_qemu_st_tl(cpu_gpr[reg], EA, ctx->mem_idx, memop);
+
+t0 = tcg_temp_new();
+tcg_gen_atomic_cmpxchg_tl(t0, EA, cpu_reserve_val, cpu_gpr[reg],
+  ctx->mem_idx, DEF_MEMOP(memop));


Actually, I noticed another, existing, problem.

This code changes CRF[0] before the user memory write, which might fault.  This 
needs to delay any changes to the architecture visible state until after any 
exception may be triggered.



r~



Re: [Qemu-devel] [PATCH] tests/check-qdict: Fix missing brackets

2017-04-06 Thread Dr. David Alan Gilbert
* Eric Blake (ebl...@redhat.com) wrote:
> On 04/06/2017 10:41 AM, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" 
> > 
> > Gcc 7 (on Fedora 26) spotted odd use of integers instead of a
> > boolean; it's got a point.
> > 
> > Signed-off-by: Dr. David Alan Gilbert 
> > ---
> >  tests/check-qdict.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tests/check-qdict.c b/tests/check-qdict.c
> > index 81162ee572..6f3fbcf9c1 100644
> > --- a/tests/check-qdict.c
> > +++ b/tests/check-qdict.c
> > @@ -559,7 +559,7 @@ static void qdict_join_test(void)
> >  g_assert(qdict_size(dict1) == 2);
> >  g_assert(qdict_size(dict2) == !overwrite);
> >  
> > -g_assert(qdict_get_int(dict1, "foo") == overwrite ? 84 : 42);
> > +g_assert(qdict_get_int(dict1, "foo") == (overwrite ? 84 : 42));
> 
> How is the test passing pre-patch, and why does it not change the test
> post-patch?  Does that mean that overwrite is not doing what we expected?

Pre-patch it passes because either 84 or 42  are valid 'true' values into 
g_assert.
Post patch it ends up as either:
qdict_get_int(dict1, "foo") == 42
or
qdict_get_int(dict1, "foo") == 84

which is what's intended.

Dave

> -- 
> Eric Blake   eblake redhat com+1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 



--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH RFC v1 1/3] target/ppc: Emulate LL/SC using cmpxchg helpers

2017-04-06 Thread Richard Henderson

On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote:

+TCGv_i32 tmp = tcg_temp_local_new_i32();
+TCGv t0;

+tcg_gen_movi_i32(tmp, 0);
 tcg_gen_trunc_tl_i32(cpu_crf[0], cpu_so);
 l1 = gen_new_label();
 tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1);
-tcg_gen_ori_i32(cpu_crf[0], cpu_crf[0], CRF_EQ);
-tcg_gen_qemu_st_tl(cpu_gpr[reg], EA, ctx->mem_idx, memop);
+
+t0 = tcg_temp_new();
+tcg_gen_atomic_cmpxchg_tl(t0, EA, cpu_reserve_val, cpu_gpr[reg],
+  ctx->mem_idx, DEF_MEMOP(memop));
+tcg_gen_setcond_tl(TCG_COND_EQ, t0, t0, cpu_reserve_val);
+tcg_gen_trunc_tl_i32(tmp, t0);
+
 gen_set_label(l1);
+tcg_gen_shli_i32(tmp, tmp, CRF_EQ_BIT);
+tcg_gen_or_i32(cpu_crf[0], cpu_crf[0], tmp);


I encourage you to move these two lines up beside the setcond.
That way you don't need to use a local tmp, which implies a
spill/restore from the stack.


r~



Re: [Qemu-devel] [PATCH] tests/check-qdict: Fix missing brackets

2017-04-06 Thread Eric Blake
On 04/06/2017 10:41 AM, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" 
> 
> Gcc 7 (on Fedora 26) spotted odd use of integers instead of a
> boolean; it's got a point.
> 
> Signed-off-by: Dr. David Alan Gilbert 
> ---
>  tests/check-qdict.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tests/check-qdict.c b/tests/check-qdict.c
> index 81162ee572..6f3fbcf9c1 100644
> --- a/tests/check-qdict.c
> +++ b/tests/check-qdict.c
> @@ -559,7 +559,7 @@ static void qdict_join_test(void)
>  g_assert(qdict_size(dict1) == 2);
>  g_assert(qdict_size(dict2) == !overwrite);
>  
> -g_assert(qdict_get_int(dict1, "foo") == overwrite ? 84 : 42);
> +g_assert(qdict_get_int(dict1, "foo") == (overwrite ? 84 : 42));

How is the test passing pre-patch, and why does it not change the test
post-patch?  Does that mean that overwrite is not doing what we expected?

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-devel] is qemu able to run multiple system emulation simultaneously

2017-04-06 Thread Jiahuan Zhang
Dear QEMU developers,

I need multiple ARM-based hardware emulation runing simultaneously in one
PC.
I wander if QEMU is capble to run two system emulation at the same time.
Or I have to use two QEMU for two system emulation

If the former is possible, where can i find some docs about it? I have
searched two days, the result is not positive.

Please help. Thanks in advance.

Best regards,
Jiahuan


[Qemu-devel] [PATCH] tests/check-qdict: Fix missing brackets

2017-04-06 Thread Dr. David Alan Gilbert (git)
From: "Dr. David Alan Gilbert" 

Gcc 7 (on Fedora 26) spotted odd use of integers instead of a
boolean; it's got a point.

Signed-off-by: Dr. David Alan Gilbert 
---
 tests/check-qdict.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/check-qdict.c b/tests/check-qdict.c
index 81162ee572..6f3fbcf9c1 100644
--- a/tests/check-qdict.c
+++ b/tests/check-qdict.c
@@ -559,7 +559,7 @@ static void qdict_join_test(void)
 g_assert(qdict_size(dict1) == 2);
 g_assert(qdict_size(dict2) == !overwrite);
 
-g_assert(qdict_get_int(dict1, "foo") == overwrite ? 84 : 42);
+g_assert(qdict_get_int(dict1, "foo") == (overwrite ? 84 : 42));
 g_assert(qdict_get_int(dict1, "bar") == 23);
 
 if (!overwrite) {
-- 
2.12.2




Re: [Qemu-devel] [PATCH for-2.9 4/5] block: Drain BH in bdrv_drained_begin

2017-04-06 Thread Kevin Wolf
Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben:
> During block job completion, nothing is preventing
> block_job_defer_to_main_loop_bh from being called in a nested
> aio_poll(), which is a trouble, such as in this code path:
> 
> qmp_block_commit
>   commit_active_start
> bdrv_reopen
>   bdrv_reopen_multiple
> bdrv_reopen_prepare
>   bdrv_flush
> aio_poll
>   aio_bh_poll
> aio_bh_call
>   block_job_defer_to_main_loop_bh
> stream_complete
>   bdrv_reopen
> 
> block_job_defer_to_main_loop_bh is the last step of the stream job,
> which should have been "paused" by the bdrv_drained_begin/end in
> bdrv_reopen_multiple, but it is not done because it's in the form of a
> main loop BH.
> 
> Similar to why block jobs should be paused between drained_begin and
> drained_end, BHs they schedule must be excluded as well.  To achieve
> this, this patch forces draining the BH before leaving bdrv_drained_begin().
> 
> Signed-off-by: Fam Zheng 

We used to poll BHs in earlier times. Commit 99723548 ('block: add BDS
field to count in-flight requests') changed this, without an explanation
in the commit message.

Paolo, is this simply a bug in that commit, or do you rely on it
somewhere? I remember that you wanted to get rid of some aio_poll()
calls a while ago.

>  block/io.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 2709a70..b9cfd18 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -228,7 +228,12 @@ void bdrv_drained_begin(BlockDriverState *bs)
>  bdrv_parent_drained_begin(bs);
>  }
>  
> -bdrv_drain_recurse(bs);
> +while (true) {
> +if (!bdrv_drain_recurse(bs) &&
> +!aio_poll(bdrv_get_aio_context(bs), false)) {
> +break;
> +}

The indentation is off here.

> +}
>  }

The old code had this in what is now the BDRV_POLL_WHILE() call in
bdrv_drain_recurse(). I think it makes more sense there, saves you a
loop here and fixes bdrv_drain_all_begin() at the same time.

Kevin



Re: [Qemu-devel] [PATCH v6] vfio error recovery: kernel support

2017-04-06 Thread Alex Williamson
On Thu, 6 Apr 2017 16:53:44 +0800
Cao jin  wrote:

> On 04/06/2017 06:36 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 04:19:10PM -0600, Alex Williamson wrote:  
> >> On Thu, 6 Apr 2017 00:50:22 +0300
> >> "Michael S. Tsirkin"  wrote:
> >>  
> >>> On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote:  
>  The previous intention of trying to handle all sorts of AER faults
>  clearly had more value, though even there the implementation and
>  configuration requirements restricted the practicality.  For instance
>  is AER support actually useful to a customer if it requires all ports
>  of a multifunction device assigned to the VM?  This seems more like a
>  feature targeting whole system partitioning rather than general VM
>  device assignment use cases.  Maybe that's ok, but it should be a clear
>  design decision.
> >>>
> >>> Alex, what kind of testing do you expect to be necessary?
> >>> Would you say testing on real hardware and making it trigger
> >>> AER errors is a requirement?  
> >>
> >> Testing various fatal, non-fatal, and corrected errors with aer-inject,
> >> especially in multfunction configurations (where more than one port
> >> is actually usable) would certainly be required.  If we have cases where
> >> the driver for a companion function can escalate a non-fatal error to a
> >> bus reset, that should be tested, even if it requires temporary hacks to
> >> the host driver for the companion function to trigger that case.  AER
> >> handling is not something that the typical user is going to experience,
> >> so it should to be thoroughly tested to make sure it works when needed
> >> or there's little point to doing it at all.  Thanks,
> >>
> >> Alex  
> > 
> > Some things can be tested within a VM. What would you
> > say would be sufficient on a VM and what has to be
> > tested on bare metal?
> >   
> 
> Does the "bare metal" here mean something like XenServer?

No, bare metal means the non-virtualized host OS.  I think Michael was
trying to facilitate testing by proposing to do it in a VM such that we
can create strange and interesting topologies that aren't bound by a
system in a remote lab having only one NIC port connected.  Thanks,

Alex



Re: [Qemu-devel] [PATCH for-2.9 5/5] coroutine: Explicitly specify AioContext when creating coroutine

2017-04-06 Thread Eric Blake
On 04/06/2017 09:25 AM, Fam Zheng wrote:
> Coroutine in block layer should always be waken up in bs->aio_context

s/waken up/awakened/

> rather than the "current" context where it is entered. They differ when
> the main loop is doing QMP tasks.
> 
> Race conditions happen without this patch, because the wrong context is
> acquired in co_schedule_bh_cb, while the entered coroutine works on a
> different one.
> 
> Make the block layer explicitly specify a desired context for each created
> coroutine. For the rest, always use qemu_get_aio_context().
> 
> Signed-off-by: Fam Zheng 
> ---

The meat of the change is here (using an order file to present your diff
with the interesting changes first can aid review)...

> +++ b/util/qemu-coroutine.c
> @@ -43,7 +43,8 @@ static void coroutine_pool_cleanup(Notifier *n, void *value)
>  }
>  }
>  
> -Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque)
> +Coroutine *qemu_coroutine_create(AioContext *ctx,
> + CoroutineEntry *entry, void *opaque)
>  {
>  Coroutine *co = NULL;
>  
> @@ -78,6 +79,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, 
> void *opaque)
>  
>  co->entry = entry;
>  co->entry_arg = opaque;
> +co->ctx = ctx;
>  QSIMPLEQ_INIT(>co_queue_wakeup);
>  return co;
>  }
> @@ -107,6 +109,7 @@ void qemu_coroutine_enter(Coroutine *co)
>  Coroutine *self = qemu_coroutine_self();
>  CoroutineAction ret;
>  
> +assert(co->ctx);
>  trace_qemu_coroutine_enter(self, co, co->entry_arg);
>  
>  if (co->caller) {
> @@ -115,7 +118,6 @@ void qemu_coroutine_enter(Coroutine *co)
>  }
>  
>  co->caller = self;
> -co->ctx = qemu_get_current_aio_context();

Basically, you close the race by assigning co->ctx sooner (during
creation, rather than entering the coroutine), where non-block callers
still end up with the same context, and block callers now have a chance
to provide their desired context up front.

Makes for a big patch due to the fallout, but the result seems sane to me.

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH v6] vfio error recovery: kernel support

2017-04-06 Thread Alex Williamson
On Thu, 6 Apr 2017 16:49:35 +0800
Cao jin  wrote:

> On 04/06/2017 05:56 AM, Michael S. Tsirkin wrote:
> > On Wed, Apr 05, 2017 at 04:54:33PM +0800, Cao jin wrote:  
> >> Apparently, I don't have experience to induce non-fatal error, device
> >> error is more of a chance related with the environment(temperature,
> >> humidity, etc) as I understand.  
> > 
> > I'm not sure how to interpret this statement. I think what Alex is
> > saying is simply that patches should include some justification. They
> > make changes but what are they improving?
> > For example:
> > 
> > I tested device ABC in conditions DEF. Without a patch VM
> > stops. With the patches applied VM recovers and proceeds to
> > use the device normally.
> > 
> > is one reasonable justification imho.
> >   
> 
> Got it. But unfortunately, until now, I haven't seen a VM stop caused by
> a real device non-fatal error during device assignment(Only saw real
> fatal errors after start VM).
> On one side, AER error could occur theoretically; on the other side,
> seldom people have seen a VM stop caused by AER. Now I am asked that do
> I have a real evidence or scenario to prove that this patchset is really
> useful? I don't, and we all know it is hard to trigger a real hardware
> error, so, seems I am pushed into the corner.  I guess these questions
> also apply for AER driver's author, if the scenario is easy to
> reproduce, there is no need to write aer_inject to fake errors.

Yes, non-fatal errors are rare, so rare in fact that I wonder if
anything actually implements them.  Could we take a survey of Linux
drivers with AER support and see which ones actually do anything useful
on a non-fatal error?  I know fatal errors exist, I've seen them.  The
situation I want to avoid is adding non-fatal AER support that never
gets used and just gets in the way of the next attempt to support fatal
error recovery.  We're extending the user ABI here, which means that
for any feature we add, we need to consider how we'll support it long
term and how we'll deprecate it or further extend it when the next
level of support comes.  I'm reluctant to add code that only partially
solves a problem and only addresses the side of the problem that we're
not even sure occurs.  Thanks,

Alex



Re: [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances

2017-04-06 Thread Peter Xu
On Thu, Apr 06, 2017 at 03:00:28PM +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 06, 2017 at 03:08:35PM +0800, Peter Xu wrote:
> > This is v8 of vt-d vfio enablement series.
> > 
> > v8
> > - remove patches 1-9 since merged already
> > - add David's r-b for all the patches
> > - add Aviv's s-o-b in the last patch
> > - rename iommu to iommu_dmar [Jason]
> > - rename last patch subject to "remote IOTLB" [Jason]
> > - pick up jason's two patches to fix vhost breakage
> > - let vhost leverage the new IOMMU notifier interface
> 
> Looks good to me except I'm still wondering why there's a single patch
> by Jason.

Answered in the other thread (the other one has been picked up
already).

> Marcel - you mentioned that you intend to try to open and
> maintain pci-next.  Would you like to pick this one up?  Otherwise, pls
> repost after 2.10.

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances

2017-04-06 Thread Peter Xu
On Thu, Apr 06, 2017 at 02:53:46PM +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 06, 2017 at 03:08:35PM +0800, Peter Xu wrote:
> > This is v8 of vt-d vfio enablement series.
> > 
> > v8
> > - remove patches 1-9 since merged already
> > - add David's r-b for all the patches
> > - add Aviv's s-o-b in the last patch
> > - rename iommu to iommu_dmar [Jason]
> > - rename last patch subject to "remote IOTLB" [Jason]
> > - pick up jason's two patches to fix vhost breakage
> 
> I only see one (6/9) - is a patch missing or misattributed?

Oh sorry I should say "jason's one patch". The other patch has already
been merged upstream, which is 375f74f47 ("vhost: generalize iommu
memory region").

> 
> > - let vhost leverage the new IOMMU notifier interface
> 
> Which patch does this?

The first one ("memory: add section range info for IOMMU notifier").

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH 2/3] move xen-hvm.c to hw/i386/xen/

2017-04-06 Thread Sahid Orentino Ferdjaoui
On Wed, Apr 05, 2017 at 04:21:30PM -0700, Anthony Xu wrote:
> move xen-hvm.c to hw/i386/xen/
> 
> 
> Signed-off -by: Anthony Xu 
> 
> 
> 
> ---
>  Makefile.target|  3 +--
>  hw/i386/xen/Makefile.objs  |  2 +-
>  hw/i386/xen/trace-events   | 11 +++
>  xen-hvm.c => hw/i386/xen/xen-hvm.c |  2 +-
>  stubs/Makefile.objs|  1 +
>  xen-hvm-stub.c => stubs/xen-hvm.c  |  0
>  trace-events   | 11 ---
>  7 files changed, 15 insertions(+), 15 deletions(-)
>  rename xen-hvm.c => hw/i386/xen/xen-hvm.c (99%)
>  rename xen-hvm-stub.c => stubs/xen-hvm.c (100%)

Reviewed-By: Sahid Orentino Ferdjaoui 

> 
> diff --git a/Makefile.target b/Makefile.target
> index 48c027f..d5ff0c7 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -150,8 +150,7 @@ obj-y += migration/ram.o migration/savevm.o
>  LIBS := $(libs_softmmu) $(LIBS)
>  
>  # xen support
> -obj-$(CONFIG_XEN_I386) += xen-hvm.o xen-mapcache.o
> -obj-$(call lnot,$(CONFIG_XEN_I386)) += xen-hvm-stub.o
> +obj-$(CONFIG_XEN_I386) += xen-mapcache.o
>  
>  # Hardware support
>  ifeq ($(TARGET_NAME), sparc64)
> diff --git a/hw/i386/xen/Makefile.objs b/hw/i386/xen/Makefile.objs
> index 801a68d..daf4f53 100644
> --- a/hw/i386/xen/Makefile.objs
> +++ b/hw/i386/xen/Makefile.objs
> @@ -1 +1 @@
> -obj-y += xen_platform.o xen_apic.o xen_pvdevice.o
> +obj-y += xen_platform.o xen_apic.o xen_pvdevice.o xen-hvm.o

Refers to the file moved /xen-hvm.c to /hw/i386/xen/xen-hvm.c

> diff --git a/hw/i386/xen/trace-events b/hw/i386/xen/trace-events
> index 321fe60..f25d622 100644
> --- a/hw/i386/xen/trace-events
> +++ b/hw/i386/xen/trace-events
> @@ -4,3 +4,14 @@ xen_platform_log(char *s) "xen platform: %s"
>  # hw/i386/xen/xen_pvdevice.c
>  xen_pv_mmio_read(uint64_t addr) "WARNING: read from Xen PV Device MMIO space 
> (address %"PRIx64")"
>  xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space 
> (address %"PRIx64")"
> +
> +# xen-hvm.c
> +xen_ram_alloc(unsigned long ram_addr, unsigned long size) "requested: %#lx, 
> size %#lx"
> +xen_client_set_memory(uint64_t start_addr, unsigned long size, bool 
> log_dirty) "%#"PRIx64" size %#lx, log_dirty %i"
> +handle_ioreq(void *req, uint32_t type, uint32_t dir, uint32_t df, uint32_t 
> data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t size) 
> "I/O=%p type=%d dir=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d 
> size=%d"
> +handle_ioreq_read(void *req, uint32_t type, uint32_t df, uint32_t 
> data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t size) 
> "I/O=%p read type=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d 
> size=%d"
> +handle_ioreq_write(void *req, uint32_t type, uint32_t df, uint32_t 
> data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t size) 
> "I/O=%p write type=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d 
> size=%d"
> +cpu_ioreq_pio(void *req, uint32_t dir, uint32_t df, uint32_t data_is_ptr, 
> uint64_t addr, uint64_t data, uint32_t count, uint32_t size) "I/O=%p pio 
> dir=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d size=%d"
> +cpu_ioreq_pio_read_reg(void *req, uint64_t data, uint64_t addr, uint32_t 
> size) "I/O=%p pio read reg data=%#"PRIx64" port=%#"PRIx64" size=%d"
> +cpu_ioreq_pio_write_reg(void *req, uint64_t data, uint64_t addr, uint32_t 
> size) "I/O=%p pio write reg data=%#"PRIx64" port=%#"PRIx64" size=%d"
> +cpu_ioreq_move(void *req, uint32_t dir, uint32_t df, uint32_t data_is_ptr, 
> uint64_t addr, uint64_t data, uint32_t count, uint32_t size) "I/O=%p copy 
> dir=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d size=%d"

comming from /trace-events

> diff --git a/xen-hvm.c b/hw/i386/xen/xen-hvm.c
> similarity index 99%
> rename from xen-hvm.c
> rename to hw/i386/xen/xen-hvm.c
> index 5043beb..0892361 100644
> --- a/xen-hvm.c
> +++ b/hw/i386/xen/xen-hvm.c
> @@ -22,7 +22,7 @@
>  #include "qemu/error-report.h"
>  #include "qemu/range.h"
>  #include "sysemu/xen-mapcache.h"
> -#include "trace-root.h"
> +#include "trace.h"
>  #include "exec/address-spaces.h"
>  
>  #include 
> diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
> index 6c80613..f5b47bf 100644
> --- a/stubs/Makefile.objs
> +++ b/stubs/Makefile.objs
> @@ -38,3 +38,4 @@ stub-obj-y += target-get-monitor-def.o
>  stub-obj-y += pc_madt_cpu_entry.o
>  stub-obj-y += vmgenid.o
>  stub-obj-y += xen-common.o
> +stub-obj-y += xen-hvm.o

Refers to the file moved /xen-hvm-stub.c to /hw/i386/stubs/xen-hvm.c

> diff --git a/xen-hvm-stub.c b/stubs/xen-hvm.c
> similarity index 100%
> rename from xen-hvm-stub.c
> rename to stubs/xen-hvm.c
> diff --git a/trace-events b/trace-events
> index b07a09b..4e14487 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -48,17 +48,6 @@ spice_vmc_register_interface(void *scd) "spice vmc 
> registered interface %p"
>  spice_vmc_unregister_interface(void *scd) "spice vmc 

Re: [Qemu-devel] [PATCH 1/2] parallels: create parallels_inactivate() callback of block driver state

2017-04-06 Thread Denis V. Lunev
On 04/06/2017 06:14 PM, Stefan Hajnoczi wrote:
> On Wed, Apr 05, 2017 at 06:12:05PM +0300, Denis V. Lunev wrote:
>> We should avoid to image file at migration end when BDRV_O_INACTIVE
> s/avoid to image/avoid writing to the image/ ?
yes


>> is set on the block driver state. All modifications should be done
>> in advance, i.e. in bdrv_inactivate.
>>
>> The patch also adds final bdrv_flush() to be sure that changes have
>> reached disk finally before other side will access the image.
> If migration fails/cancels on the source after .bdrv_inactivate() we
> need to return to the "activated" state.  .bdrv_invalidate_cache() will
> be called so I think it needs to be implemented by parallels to set
> s->header->inuse again if BDRV_O_RDWR.
>
> .bdrv_invalidate_cache() is also called on the destination at live
> migration handover.  That's okay - it makes sense for the destination to
> set the inuse field on success.
hmm. sounds like a good catch..


>> -static void parallels_close(BlockDriverState *bs)
>> +static int parallels_inactivate(BlockDriverState *bs)
>>  {
>> +int err;
>>  BDRVParallelsState *s = bs->opaque;
>>  
>> -if (bs->open_flags & BDRV_O_RDWR) {
>> -s->header->inuse = 0;
>> -parallels_update_header(bs);
>> +if (!(bs->open_flags & BDRV_O_RDWR)) {
>> +return 0;
>> +}
>> +
>> +s->header->inuse = 0;
>> +err = parallels_update_header(bs);
>> +if (err < 0) {
> Should we set s->header->inuse again in all error paths in case
> something calls parallels_hupdate_header() again later?
yes.

thank you :)




Re: [Qemu-devel] [PATCH 1/3] move xen-common.c to hw/xen/

2017-04-06 Thread Eric Blake
On 04/06/2017 10:10 AM, Sahid Orentino Ferdjaoui wrote:
> On Wed, Apr 05, 2017 at 04:21:29PM -0700, Anthony Xu wrote:
>> move xen-common.c to hw/xen/
>>
>> Signed-off -by: Anthony Xu 
> 
> nit: s/Signed-off -by/Signed-off-by, not sure if it's really important
> or not.

Some of the automated tooling may choke on the misspelled variant, and
S-o-b is legally important, so it is worth fixing.

By the way, you can use 'git commit -s' or 'git send-email -s' to add
S-o-b in the proper format at a given stage in your workflow, rather
than having to type it by hand; it becomes obvious that you didn't use
git's automation for adding it when you have a typo or when you have
more than one blank line between your S-o-b and the --- separator.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH for-2.9 3/5] block: Quiesce old aio context during bdrv_set_aio_context

2017-04-06 Thread Kevin Wolf
Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben:
> The fact that the bs->aio_context is changing can confuse the dataplane
> iothread, because of the now fine granularity aio context lock.
> bdrv_drain should rather be a bdrv_drained_begin/end pair, but since
> bs->aio_context is changing, we can just use aio_disable_external and
> block_job_pause.
> 
> Reported-by: Ed Swierk 
> Signed-off-by: Fam Zheng 
> ---
>  block.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 8893ac1..e70684a 100644
> --- a/block.c
> +++ b/block.c
> @@ -4395,11 +4395,14 @@ void bdrv_attach_aio_context(BlockDriverState *bs,
>  
>  void bdrv_set_aio_context(BlockDriverState *bs, AioContext *new_context)
>  {
> -AioContext *ctx;
> +AioContext *ctx = bdrv_get_aio_context(bs);
>  
> +aio_disable_external(ctx);
> +if (bs->job) {
> +block_job_pause(bs->job);
> +}

Even more bs->job users... :-(

But is this one actually necessary? We drain all pending BHs below, so
the job shouldn't have any requests in flight or be able to submit new
ones while we switch.

>  bdrv_drain(bs); /* ensure there are no in-flight requests */
>  
> -ctx = bdrv_get_aio_context(bs);
>  while (aio_poll(ctx, false)) {
>  /* wait for all bottom halves to execute */
>  }
> @@ -4412,6 +4415,10 @@ void bdrv_set_aio_context(BlockDriverState *bs, 
> AioContext *new_context)
>  aio_context_acquire(new_context);
>  bdrv_attach_aio_context(bs, new_context);
>  aio_context_release(new_context);
> +if (bs->job) {
> +block_job_resume(bs->job);
> +}
> +aio_enable_external(ctx);
>  }

The aio_disabe/enable_external() pair seems to make sense anyway.

Kevin



Re: [Qemu-devel] [PATCH 1/2] parallels: create parallels_inactivate() callback of block driver state

2017-04-06 Thread Stefan Hajnoczi
On Wed, Apr 05, 2017 at 06:12:05PM +0300, Denis V. Lunev wrote:
> We should avoid to image file at migration end when BDRV_O_INACTIVE

s/avoid to image/avoid writing to the image/ ?

> is set on the block driver state. All modifications should be done
> in advance, i.e. in bdrv_inactivate.
> 
> The patch also adds final bdrv_flush() to be sure that changes have
> reached disk finally before other side will access the image.

If migration fails/cancels on the source after .bdrv_inactivate() we
need to return to the "activated" state.  .bdrv_invalidate_cache() will
be called so I think it needs to be implemented by parallels to set
s->header->inuse again if BDRV_O_RDWR.

.bdrv_invalidate_cache() is also called on the destination at live
migration handover.  That's okay - it makes sense for the destination to
set the inuse field on success.

> -static void parallels_close(BlockDriverState *bs)
> +static int parallels_inactivate(BlockDriverState *bs)
>  {
> +int err;
>  BDRVParallelsState *s = bs->opaque;
>  
> -if (bs->open_flags & BDRV_O_RDWR) {
> -s->header->inuse = 0;
> -parallels_update_header(bs);
> +if (!(bs->open_flags & BDRV_O_RDWR)) {
> +return 0;
> +}
> +
> +s->header->inuse = 0;
> +err = parallels_update_header(bs);
> +if (err < 0) {

Should we set s->header->inuse again in all error paths in case
something calls parallels_hupdate_header() again later?


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH for-2.9 4/5] block: Drain BH in bdrv_drained_begin

2017-04-06 Thread Eric Blake
On 04/06/2017 09:25 AM, Fam Zheng wrote:
> During block job completion, nothing is preventing
> block_job_defer_to_main_loop_bh from being called in a nested
> aio_poll(), which is a trouble, such as in this code path:
> 
> qmp_block_commit
>   commit_active_start
> bdrv_reopen
>   bdrv_reopen_multiple
> bdrv_reopen_prepare
>   bdrv_flush
> aio_poll
>   aio_bh_poll
> aio_bh_call
>   block_job_defer_to_main_loop_bh
> stream_complete
>   bdrv_reopen
> 
> block_job_defer_to_main_loop_bh is the last step of the stream job,
> which should have been "paused" by the bdrv_drained_begin/end in
> bdrv_reopen_multiple, but it is not done because it's in the form of a
> main loop BH.
> 
> Similar to why block jobs should be paused between drained_begin and
> drained_end, BHs they schedule must be excluded as well.  To achieve
> this, this patch forces draining the BH before leaving bdrv_drained_begin().
> 
> Signed-off-by: Fam Zheng 
> ---
>  block/io.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 

Nice writeup.

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier

2017-04-06 Thread Alex Williamson
On Thu,  6 Apr 2017 15:08:36 +0800
Peter Xu  wrote:

> In this patch, IOMMUNotifier.{start|end} are introduced to store section
> information for a specific notifier. When notification occurs, we not
> only check the notification type (MAP|UNMAP), but also check whether the
> notified iova range overlaps with the range of specific IOMMU notifier,
> and skip those notifiers if not in the listened range.
> 
> When removing an region, we need to make sure we removed the correct
> VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
> 
> This patch is solving the problem that vfio-pci devices receive
> duplicated UNMAP notification on x86 platform when vIOMMU is there. The
> issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is
> splitted by the (0xfee0, 0xfeef) IRQ region. AFAIK
> this (splitted IOMMU region) is only happening on x86.
> 
> This patch also helps vhost to leverage the new interface as well, so
> that vhost won't get duplicated cache flushes. In that sense, it's an
> slight performance improvement.
> 
> Suggested-by: David Gibson 
> Signed-off-by: Peter Xu 
> ---
> v7->v8:
> - let vhost dmar leverage the new interface as well
> - add some more comments in commit message, mentioning what issue this
>   patch has solved
> - since touched up, removing Alex's a-b and DavidG's r-b
> ---
>  hw/vfio/common.c  | 12 +---
>  hw/virtio/vhost.c | 10 --
>  include/exec/memory.h | 19 ++-
>  memory.c  |  9 +
>  4 files changed, 44 insertions(+), 6 deletions(-)


Acked-by: Alex Williamson 

> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f3ba9b9..6b33b9f 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -478,8 +478,13 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  giommu->iommu_offset = section->offset_within_address_space -
> section->offset_within_region;
>  giommu->container = container;
> -giommu->n.notify = vfio_iommu_map_notify;
> -giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
> +llend = int128_add(int128_make64(section->offset_within_region),
> +   section->size);
> +llend = int128_sub(llend, int128_one());
> +iommu_notifier_init(>n, vfio_iommu_map_notify,
> +IOMMU_NOTIFIER_ALL,
> +section->offset_within_region,
> +int128_get64(llend));
>  QLIST_INSERT_HEAD(>giommu_list, giommu, giommu_next);
>  
>  memory_region_register_iommu_notifier(giommu->iommu, >n);
> @@ -550,7 +555,8 @@ static void vfio_listener_region_del(MemoryListener 
> *listener,
>  VFIOGuestIOMMU *giommu;
>  
>  QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
> -if (giommu->iommu == section->mr) {
> +if (giommu->iommu == section->mr &&
> +giommu->n.start == section->offset_within_region) {
>  memory_region_unregister_iommu_notifier(giommu->iommu,
>  >n);
>  QLIST_REMOVE(giommu, giommu_next);
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 613494d..185b95b 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -736,14 +736,20 @@ static void vhost_iommu_region_add(MemoryListener 
> *listener,
>  struct vhost_dev *dev = container_of(listener, struct vhost_dev,
>   iommu_listener);
>  struct vhost_iommu *iommu;
> +Int128 end;
>  
>  if (!memory_region_is_iommu(section->mr)) {
>  return;
>  }
>  
>  iommu = g_malloc0(sizeof(*iommu));
> -iommu->n.notify = vhost_iommu_unmap_notify;
> -iommu->n.notifier_flags = IOMMU_NOTIFIER_UNMAP;
> +end = int128_add(int128_make64(section->offset_within_region),
> + section->size);
> +end = int128_sub(end, int128_one());
> +iommu_notifier_init(>n, vhost_iommu_unmap_notify,
> +IOMMU_NOTIFIER_UNMAP,
> +section->offset_within_region,
> +int128_get64(end));
>  iommu->mr = section->mr;
>  iommu->iommu_offset = section->offset_within_address_space -
>section->offset_within_region;
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index f20b191..0840c89 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -77,13 +77,30 @@ typedef enum {
>  
>  #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP)
>  
> +struct IOMMUNotifier;
> +typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier,
> +IOMMUTLBEntry *data);
> +
>  struct IOMMUNotifier {
> -void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
> 

Re: [Qemu-devel] [PATCH 1/3] move xen-common.c to hw/xen/

2017-04-06 Thread Sahid Orentino Ferdjaoui
On Wed, Apr 05, 2017 at 04:21:29PM -0700, Anthony Xu wrote:
> move xen-common.c to hw/xen/
> 
> Signed-off -by: Anthony Xu 

nit: s/Signed-off -by/Signed-off-by, not sure if it's really important
or not.

Reviewed-By: Sahid Orentino Ferdjaoui 

> 
> ---
>  Makefile.target | 2 --
>  hw/xen/Makefile.objs| 2 +-
>  xen-common.c => hw/xen/xen-common.c | 0
>  stubs/Makefile.objs | 1 +
>  xen-common-stub.c => stubs/xen-common.c | 0
>  5 files changed, 2 insertions(+), 3 deletions(-)
>  rename xen-common.c => hw/xen/xen-common.c (100%)
>  rename xen-common-stub.c => stubs/xen-common.c (100%)
>
> diff --git a/Makefile.target b/Makefile.target
> index 7df2b8c..48c027f 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -150,9 +150,7 @@ obj-y += migration/ram.o migration/savevm.o
>  LIBS := $(libs_softmmu) $(LIBS)
>  
>  # xen support
> -obj-$(CONFIG_XEN) += xen-common.o
>  obj-$(CONFIG_XEN_I386) += xen-hvm.o xen-mapcache.o
> -obj-$(call lnot,$(CONFIG_XEN)) += xen-common-stub.o
>  obj-$(call lnot,$(CONFIG_XEN_I386)) += xen-hvm-stub.o
>  
>  # Hardware support
> diff --git a/hw/xen/Makefile.objs b/hw/xen/Makefile.objs
> index 4be3ec9..64a70bc 100644
> --- a/hw/xen/Makefile.objs
> +++ b/hw/xen/Makefile.objs
> @@ -1,5 +1,5 @@
>  # xen backend driver support
> -common-obj-$(CONFIG_XEN) += xen_backend.o xen_devconfig.o xen_pvdev.o
> +common-obj-$(CONFIG_XEN) += xen_backend.o xen_devconfig.o xen_pvdev.o 
> xen-common.o

Refers to the moved file /xen-common.c to /hw/xen/xen-common.c

>  obj-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen-host-pci-device.o
>  obj-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pt.o xen_pt_config_init.o 
> xen_pt_graphics.o xen_pt_msi.o
> diff --git a/xen-common.c b/hw/xen/xen-common.c
> similarity index 100%
> rename from xen-common.c
> rename to hw/xen/xen-common.c
> diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
> index 224f04b..6c80613 100644
> --- a/stubs/Makefile.objs
> +++ b/stubs/Makefile.objs
> @@ -37,3 +37,4 @@ stub-obj-y += target-monitor-defs.o
>  stub-obj-y += target-get-monitor-def.o
>  stub-obj-y += pc_madt_cpu_entry.o
>  stub-obj-y += vmgenid.o
> +stub-obj-y += xen-common.o

Refers to the moved file /xen-common-stub.c to /hw/xen/stubs/xen-common.c

> diff --git a/xen-common-stub.c b/stubs/xen-common.c
> similarity index 100%
> rename from xen-common-stub.c
> rename to stubs/xen-common.c
> -- 
> 1.8.3.1
> 



[Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation

2017-04-06 Thread Alberto Garcia
Hi all,

over the past couple of months I discussed with some of you the
possibility to extend the qcow2 format in order to improve its
performance and reduce its memory requirements (particularly with very
large images).

After some discussion in the mailing list and the #qemu IRC channel I
decided to write a prototype of a new extension for qcow2 so I could
understand better the scope of the changes and have some preliminary
data about its effects.

This e-mail is the formal presentation of my proposal to extend the
on-disk qcow2 format. As you can see this is still an RFC. Due to the
nature of the changes I would like to get as much feedback as possible
before going forward.

=== Problem ===

The original problem that I wanted to address is the memory
requirements of qcow2 files if you want to have very large images and
still keep good I/O performance. This is a consequence of its very
structure, which I'm going to describe now.

A qcow2 image is divided into units of constant size called clusters,
and among other things it contains metadata that maps guest addresses
to host addresses (the so-called L1 and L2 tables).

There are two basic problems that result from this:

1) Reading from or writing to a qcow2 image involves reading the
   corresponding entry on the L2 table that maps the guest address to
   the host address. This is very slow because it involves two I/O
   operations: one on the L2 table and the other one on the actual
   data cluster.

2) A cluster is the smallest unit of allocation. Therefore writing a
   mere 512 bytes to an empty disk requires allocating a complete
   cluster and filling it with zeroes (or with data from the backing
   image if there is one). This wastes more disk space and also has a
   negative impact on I/O.

Problem (1) can be solved by keeping in memory a cache of the L2
tables (QEMU has an "l2_cache_size" parameter for this purpose). The
amount of disk space used by L2 tables depends on two factors: the
disk size and the cluster size.

The cluster size can be configured when the image is created, and it
can be any power of two between 512 bytes and 2 MB (it defaults to 64
KB).

The maximum amount of space needed for the L2 tables can be calculated
with the following formula:

   max_l2_size = virtual_disk_size * 8 / cluster_size

Large images require a large amount of metadata, and therefore a large
amount of memory for the L2 cache. With the default cluster size
(64KB) that's 128MB of L2 cache for a 1TB qcow2 image.

The only way to reduce the size of the L2 tables is therefore
increasing the cluster size, but this makes the image less efficient
as described earlier in (2).

=== The proposal ===

The idea of this proposal is to extend the qcow2 format by allowing
subcluster allocation. There would be an optional feature that would
need to be enabled when creating the image. The on-disk format would
remain essentially the same, except that each data cluster would be
internally divided into a number of subclusters of equal size.

What this means in practice is that each entry on an L2 table would be
accompanied by a bitmap indicating the allocation state of each one of
the subclusters for that cluster. There are several alternatives for
storing the bitmap, described below.

Other than L2 entries, all other data structures would remain
unchanged, but for data clusters the smallest unit of allocation would
now be the subcluster. Reference counting would still be at the
cluster level, because there's no way to reference individual
subclusters. Copy-on-write on internal snapshots would need to copy
complete clusters, so that scenario would not benefit from this
change.

I see two main use cases for this feature:

a) The qcow2 image is not too large / the L2 cache is not a problem,
   but you want to increase the allocation performance. In this case
   you can have something like a 128KB cluster with 4KB subclusters
   (with 4KB being a common block size in ext4 and other filesystems)

b) The qcow2 image is very large and you want to save metadata space
   in order to have a smaller L2 cache. In this case you can go for
   the maximum cluster size (2MB) but you want to have smaller
   subclusters to increase the allocation performance and optimize the
   disk usage. This was actually my original use case.

=== Test results ===

I have a basic working prototype of this. It's still incomplete -and
buggy :)- but it gives an idea of what we can expect from it. In my
implementation each data cluster has 8 subclusters, but that's not set
in stone (see below).

I made all tests on an SSD drive, writing to an empty qcow2 image with
a fully populated 40GB backing image, performing random writes using
fio with a block size of 4KB.

I tried with the default and maximum cluster sizes (64KB and 2MB) and
also with some other sizes. I also made sure to try with 32KB clusters
so the subcluster size matches the 4KB block size used for the I/O.

It's important to point out that once 

Re: [Qemu-devel] [PATCH for-2.9 3/5] block: Quiesce old aio context during bdrv_set_aio_context

2017-04-06 Thread Eric Blake
On 04/06/2017 09:25 AM, Fam Zheng wrote:
> The fact that the bs->aio_context is changing can confuse the dataplane
> iothread, because of the now fine granularity aio context lock.
> bdrv_drain should rather be a bdrv_drained_begin/end pair, but since
> bs->aio_context is changing, we can just use aio_disable_external and
> block_job_pause.
> 
> Reported-by: Ed Swierk 
> Signed-off-by: Fam Zheng 
> ---
>  block.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH v5] Allow setting NUMA distance for different NUMA nodes

2017-04-06 Thread Andrew Jones
On Thu, Apr 06, 2017 at 10:18:53AM +0800, He Chen wrote:
> This patch is going to add SLIT table support in QEMU, and provides
> additional option `dist` for command `-numa` to allow user set vNUMA
> distance by QEMU command.
> 
> With this patch, when a user wants to create a guest that contains
> several vNUMA nodes and also wants to set distance among those nodes,
> the QEMU command would like:
> 
> ```
> -numa node,nodeid=0,cpus=0 \
> -numa node,nodeid=1,cpus=1 \
> -numa node,nodeid=2,cpus=2 \
> -numa node,nodeid=3,cpus=3 \
> -numa dist,src=0,dst=1,val=21 \
> -numa dist,src=0,dst=2,val=31 \
> -numa dist,src=0,dst=3,val=41 \
> -numa dist,src=1,dst=2,val=21 \
> -numa dist,src=1,dst=3,val=31 \
> -numa dist,src=2,dst=3,val=21 \
> ```
> 
> Signed-off-by: He Chen 
> ---
>  hw/acpi/aml-build.c |  25 +
>  hw/i386/acpi-build.c|   2 +
>  include/hw/acpi/aml-build.h |   1 +
>  include/sysemu/numa.h   |   1 +
>  include/sysemu/sysemu.h |   4 ++
>  numa.c  | 121 
> 
>  qapi-schema.json|  30 ++-
>  qemu-options.hx |  17 ++-
>  8 files changed, 198 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index c6f2032..2c6ab07 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -24,6 +24,7 @@
>  #include "hw/acpi/aml-build.h"
>  #include "qemu/bswap.h"
>  #include "qemu/bitops.h"
> +#include "sysemu/numa.h"
>  
>  static GArray *build_alloc_array(void)
>  {
> @@ -1609,3 +1610,27 @@ void build_srat_memory(AcpiSratMemoryAffinity 
> *numamem, uint64_t base,
>  numamem->base_addr = cpu_to_le64(base);
>  numamem->range_length = cpu_to_le64(len);
>  }
> +
> +/*
> + * ACPI spec 5.2.17 System Locality Distance Information Table
> + * (Revision 2.0 or later)
> + */
> +void build_slit(GArray *table_data, BIOSLinker *linker)
> +{
> +int slit_start, i, j;
> +slit_start = table_data->len;
> +
> +acpi_data_push(table_data, sizeof(AcpiTableHeader));
> +
> +build_append_int_noprefix(table_data, nb_numa_nodes, 8);
> +for (i = 0; i < nb_numa_nodes; i++) {
> +for (j = 0; j < nb_numa_nodes; j++) {
> +build_append_int_noprefix(table_data, numa_info[i].distance[j], 
> 1);
> +}
> +}
> +
> +build_header(linker, table_data,
> + (void *)(table_data->data + slit_start),
> + "SLIT",
> + table_data->len - slit_start, 1, NULL, NULL);
> +}
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 2073108..12730ea 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -2678,6 +2678,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState 
> *machine)
>  if (pcms->numa_nodes) {
>  acpi_add_table(table_offsets, tables_blob);
>  build_srat(tables_blob, tables->linker, machine);
> +acpi_add_table(table_offsets, tables_blob);
> +build_slit(tables_blob, tables->linker);

We could make the generation of the SLIT dependent on have_numa_distance.

>  }
>  if (acpi_get_mcfg()) {
>  acpi_add_table(table_offsets, tables_blob);
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 00c21f1..329a0d0 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -389,4 +389,5 @@ GCC_FMT_ATTR(2, 3);
>  void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
> uint64_t len, int node, MemoryAffinityFlags flags);
>  
> +void build_slit(GArray *table_data, BIOSLinker *linker);
>  #endif
> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> index 8f09dcf..2f7a941 100644
> --- a/include/sysemu/numa.h
> +++ b/include/sysemu/numa.h
> @@ -21,6 +21,7 @@ typedef struct node_info {
>  struct HostMemoryBackend *node_memdev;
>  bool present;
>  QLIST_HEAD(, numa_addr_range) addr; /* List to store address ranges */
> +uint8_t distance[MAX_NODES];
>  } NodeInfo;
>  
>  extern NodeInfo numa_info[MAX_NODES];
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 576c7ce..6999545 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -169,6 +169,10 @@ extern int mem_prealloc;
>  
>  #define MAX_NODES 128
>  #define NUMA_NODE_UNASSIGNED MAX_NODES
> +#define NUMA_DISTANCE_MIN 10
> +#define NUMA_DISTANCE_DEFAULT 20
> +#define NUMA_DISTANCE_MAX 254
> +#define NUMA_DISTANCE_UNREACHABLE 255
>  
>  #define MAX_OPTION_ROMS 16
>  typedef struct QEMUOptionRom {
> diff --git a/numa.c b/numa.c
> index 6fc2393..838e45a 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -52,6 +52,7 @@ static int max_numa_nodeid; /* Highest specified NUMA node 
> ID, plus one.
>   */
>  int nb_numa_nodes;
>  NodeInfo numa_info[MAX_NODES];
> +static bool have_numa_distance;
>  
>  void numa_set_mem_node_id(ram_addr_t addr, uint64_t size, 

Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure

2017-04-06 Thread Xiao Guangrong



On 31/03/2017 4:41 PM, Haozhong Zhang wrote:

This patch series constructs the flush hint address structures for
nvdimm devices in QEMU.

It's of course not for 2.9. I send it out early in order to get
comments on one point I'm uncertain (see the detailed explanation
below). Thanks for any comments in advance!


Background
---
Flush hint address structure is a substructure of NFIT and specifies
one or more addresses, namely Flush Hint Addresses. Software can write
to any one of these flush hint addresses to cause any preceding writes
to the NVDIMM region to be flushed out of the intervening platform
buffers to the targeted NVDIMM. More details can be found in ACPI Spec
6.1, Section 5.2.25.8 "Flush Hint Address Structure".


Why is it RFC?
---
RFC is added because I'm not sure whether the way in this patch series
that allocates the guest flush hint addresses is right.

QEMU needs to trap guest accesses (at least for writes) to the flush
hint addresses in order to perform the necessary flush on the host
back store. Therefore, QEMU needs to create IO memory regions that
cover those flush hint addresses. In order to create those IO memory
regions, QEMU needs to know the flush hint addresses or their offsets
to other known memory regions in advance. So far looks good.

Flush hint addresses are in the guest address space. Looking at how
the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural
to take the same way for flush hint addresses, i.e. let the guest
firmware allocate from free addresses and patch them in the flush hint
address structure. (*Please correct me If my following understand is wrong*)
However, the current allocation and pointer patching are transparent
to QEMU, so QEMU will be unaware of the flush hint addresses, and
consequently have no way to create corresponding IO memory regions in
order to trap guest accesses.


Er, it is awkward and flush-hint-table is static which may not be
easily patched.



Alternatively, this patch series moves the allocation of flush hint
addresses to QEMU:

1. (Patch 1) We reserve an address range after the end address of each
   nvdimm device. Its size is specified by the user via a new pc-dimm
   option 'reserved-size'.



We should make it only work for nvdimm?


   For the following example,
-object memory-backend-file,id=mem0,size=4G,...
-device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
-device pc-dimm,id=dimm1,...
   if dimm0 is allocated to address N ~ N+4G, the address of dimm1
   will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for
   dimm0.

2. (Patch 4) When NVDIMM ACPI code builds the flush hint address
   structure for each nvdimm device, it will allocate them from the
   above reserved area, e.g. the flush hint addresses of above dimm0
   are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in
   this way, so QEMU can easily create IO memory regions for them.

   If the reserved area is not present or too small, QEMU will report
   errors.



We should make 'reserved-size' always be page-aligned and should be
transparent to the user, i.e, automatically reserve 4k if 'flush-hint'
is specified?




Re: [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store

2017-04-06 Thread Xiao Guangrong



On 03/31/2017 04:41 PM, Haozhong Zhang wrote:

fsync() is used to persist modifications to the back store. If the
host NVDIMM is used as the back store, fsync() on Linux will trigger
the write to the host flush hint address.

Signed-off-by: Haozhong Zhang 
---
 hw/mem/nvdimm.c | 22 ++
 include/hw/mem/nvdimm.h | 13 +
 2 files changed, 35 insertions(+)

diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index db896b0..484ab8b 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -78,6 +78,26 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice 
*dimm)
 return >nvdimm_mr;
 }

+static void nvdimm_flush_init(NVDIMMDevice *nvdimm, MemoryRegion *hostmem_mr)
+{
+if (nvdimm->flush_hint_enabled) {
+nvdimm->backend_fd = memory_region_get_fd(hostmem_mr);


Hmm, IIRC host-mem-file does not initalize backend_fd at all.



Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure

2017-04-06 Thread Xiao Guangrong



On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote:

On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote:

This patch series constructs the flush hint address structures for
nvdimm devices in QEMU.

It's of course not for 2.9. I send it out early in order to get
comments on one point I'm uncertain (see the detailed explanation
below). Thanks for any comments in advance!
Background
---


Extra background:

Flush Hint Addresses are necessary because:

1. Some hardware configurations may require them.  In other words, a
   cache flush instruction is not enough to persist data.

2. The host file system may need fsync(2) calls (e.g. to persist
   metadata changes).

Without Flush Hint Addresses only some NVDIMM configurations actually
guarantee data persistence.


Flush hint address structure is a substructure of NFIT and specifies
one or more addresses, namely Flush Hint Addresses. Software can write
to any one of these flush hint addresses to cause any preceding writes
to the NVDIMM region to be flushed out of the intervening platform
buffers to the targeted NVDIMM. More details can be found in ACPI Spec
6.1, Section 5.2.25.8 "Flush Hint Address Structure".


Do you have performance data?  I'm concerned that Flush Hint Address
hardware interface is not virtualization-friendly.

In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does:

  wmb();
  for (i = 0; i < nd_region->ndr_mappings; i++)
  if (ndrd_get_flush_wpq(ndrd, i, 0))
  writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
  wmb();

That looks pretty lightweight - it's an MMIO write between write
barriers.

This patch implements the MMIO write like this:

  void nvdimm_flush(NVDIMMDevice *nvdimm)
  {
  if (nvdimm->backend_fd != -1) {
  /*
   * If the backend store is a physical NVDIMM device, fsync()
   * will trigger the flush via the flush hint on the host device.
   */
  fsync(nvdimm->backend_fd);
  }
  }

The MMIO store instruction turned into a synchronous fsync(2) system
call plus vmexit/vmenter and QEMU userspace context switch:

1. The vcpu blocks during the fsync(2) system call.  The MMIO write
   instruction has an unexpected and huge latency.

2. The vcpu thread holds the QEMU global mutex so all other threads
   (including the monitor) are blocked during fsync(2).  Other vcpu
   threads may block if they vmexit.

It is hard to implement this efficiently in QEMU.  This is why I said
the hardware interface is not virtualization-friendly.  It's cheap on
real hardware but expensive under virtualization.

We should think about the optimal way of implementing Flush Hint
Addresses in QEMU.  But if there is no reasonable way to implement them
then I think it's better *not* to implement them, just like the Block
Window feature which is also not virtualization-friendly.  Users who
want a block device can use virtio-blk.  I don't think NVDIMM Block
Window can achieve better performance than virtio-blk under
virtualization (although I'm happy to be proven wrong).

Some ideas for a faster implementation:

1. Use memory_region_clear_global_locking() to avoid taking the QEMU
   global mutex.  Little synchronization is necessary as long as the
   NVDIMM device isn't hot unplugged (not yet supported anyway).

2. Can the host kernel provide a way to mmap Address Flush Hints from
   the physical NVDIMM in cases where the configuration does not require
   host kernel interception?  That way QEMU can map the physical
   NVDIMM's Address Flush Hints directly into the guest.  The hypervisor
   is bypassed and performance would be good.

I'm not sure there is anything we can do to make the case where the host
kernel wants an fsync(2) fast :(.


Good point.

We can assume flush-CPU-cache-to-make-persistence is always
available on Intel's hardware so that flush-hint-table is not
needed if the vNVDIMM is based on a real Intel's NVDIMM device.

If the vNVDIMM device is based on the regular file, i think
fsync is the bottleneck rather than this mmio-virtualization. :(






Re: [Qemu-devel] [PATCH] Put all trace.o into libqemuutil.a

2017-04-06 Thread Stefan Hajnoczi
On Tue, Apr 04, 2017 at 09:39:39PM +, Xu, Anthony wrote:

Thanks, applied to my tracing-next tree:
https://github.com/stefanha/qemu/commits/tracing-next

I fixed up the patch when applying it.  Comments below:

> Put all trace.o into libqemuutil.a

Please use the "trace: " prefix for tracing patches.

All components of QEMU have commonly used prefixes that make it easy to
identify which area a patch affects.  People reading the mailing list or
grepping through git logs rely on this.

> 
> Currently all trace.o are linked into qemu-system, qemu-img, 
> qemu-nbd, qemu-io etc., even the corresponding components 
> are not included.
> Put all trace.o into libqemuutil.a that the linker would only pull in .o 
> files containing symbols that are actually referenced by the 
> program.
> 
> 
> Signed-off -by: Anthony Xu 

Please use git-format-patch(1) to send correctly formatted patches in
the future.  git-am(1) was unable to apply your email and I had to do it
manually.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address

2017-04-06 Thread Xiao Guangrong



On 03/31/2017 04:41 PM, Haozhong Zhang wrote:

If option 'reserved-size=RSVD' is present, QEMU will reserve an
address range of size 'RSVD' after the ending address of pc-dimm
device.

For the following example,
-object memory-backend-file,id=mem0,size=4G,...
-device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
-device pc-dimm,id=dimm1,...
if dimm0 is allocated to address N ~ N+4G, the address range of
dimm1 will start from N+4G+4K or higher.

Its current usage is to reserve spaces for flush hint addresses
of nvdimm devices.

Signed-off-by: Haozhong Zhang 
---
 hw/mem/pc-dimm.c | 48 +---
 include/hw/mem/pc-dimm.h |  2 ++
 2 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 9e8dab0..13dcd71 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -28,6 +28,7 @@
 #include "sysemu/kvm.h"
 #include "trace.h"
 #include "hw/virtio/vhost.h"
+#include "exec/address-spaces.h"

 typedef struct pc_dimms_capacity {
  uint64_t size;
@@ -44,7 +45,12 @@ void pc_dimm_memory_plug(DeviceState *dev, 
MemoryHotplugState *hpms,
 MemoryRegion *vmstate_mr = ddc->get_vmstate_memory_region(dimm);
 Error *local_err = NULL;
 uint64_t existing_dimms_capacity = 0;
-uint64_t addr;
+uint64_t addr, size = memory_region_size(mr);
+
+size += object_property_get_int(OBJECT(dimm), PC_DIMM_RSVD_PROP, 
_err);
+if (local_err) {
+goto out;
+}

 addr = object_property_get_int(OBJECT(dimm), PC_DIMM_ADDR_PROP, 
_err);
 if (local_err) {
@@ -54,7 +60,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState 
*hpms,
 addr = pc_dimm_get_free_addr(hpms->base,
  memory_region_size(>mr),
  !addr ? NULL : , align,
- memory_region_size(mr), _err);
+ size, _err);
 if (local_err) {
 goto out;
 }
@@ -64,7 +70,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState 
*hpms,
 goto out;
 }

-if (existing_dimms_capacity + memory_region_size(mr) >
+if (existing_dimms_capacity + size >
 machine->maxram_size - machine->ram_size) {
 error_setg(_err, "not enough space, currently 0x%" PRIx64
" in use of total hot pluggable 0x" RAM_ADDR_FMT,
@@ -315,6 +321,9 @@ uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
 PCDIMMDevice *dimm = item->data;
 uint64_t dimm_size = object_property_get_int(OBJECT(dimm),
  PC_DIMM_SIZE_PROP,
+ errp) +
+ object_property_get_int(OBJECT(dimm),
+ PC_DIMM_RSVD_PROP,
  errp);
 if (errp && *errp) {
 goto out;
@@ -382,6 +391,37 @@ static void pc_dimm_check_memdev_is_busy(Object *obj, 
const char *name,
 error_propagate(errp, local_err);
 }



Should count the reserved size in pc_existing_dimms_capacity_internal()
too.




Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure

2017-04-06 Thread Xiao Guangrong



On 04/06/2017 05:58 PM, Haozhong Zhang wrote:

On 04/06/17 17:39 +0800, Xiao Guangrong wrote:



On 31/03/2017 4:41 PM, Haozhong Zhang wrote:

This patch series constructs the flush hint address structures for
nvdimm devices in QEMU.

It's of course not for 2.9. I send it out early in order to get
comments on one point I'm uncertain (see the detailed explanation
below). Thanks for any comments in advance!


Background
---
Flush hint address structure is a substructure of NFIT and specifies
one or more addresses, namely Flush Hint Addresses. Software can write
to any one of these flush hint addresses to cause any preceding writes
to the NVDIMM region to be flushed out of the intervening platform
buffers to the targeted NVDIMM. More details can be found in ACPI Spec
6.1, Section 5.2.25.8 "Flush Hint Address Structure".


Why is it RFC?
---
RFC is added because I'm not sure whether the way in this patch series
that allocates the guest flush hint addresses is right.

QEMU needs to trap guest accesses (at least for writes) to the flush
hint addresses in order to perform the necessary flush on the host
back store. Therefore, QEMU needs to create IO memory regions that
cover those flush hint addresses. In order to create those IO memory
regions, QEMU needs to know the flush hint addresses or their offsets
to other known memory regions in advance. So far looks good.

Flush hint addresses are in the guest address space. Looking at how
the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural
to take the same way for flush hint addresses, i.e. let the guest
firmware allocate from free addresses and patch them in the flush hint
address structure. (*Please correct me If my following understand is wrong*)
However, the current allocation and pointer patching are transparent
to QEMU, so QEMU will be unaware of the flush hint addresses, and
consequently have no way to create corresponding IO memory regions in
order to trap guest accesses.


Er, it is awkward and flush-hint-table is static which may not be
easily patched.



Alternatively, this patch series moves the allocation of flush hint
addresses to QEMU:

1. (Patch 1) We reserve an address range after the end address of each
   nvdimm device. Its size is specified by the user via a new pc-dimm
   option 'reserved-size'.



We should make it only work for nvdimm?



Yes, we can check whether the machine option 'nvdimm' is present when
plugging the nvdimm.


   For the following example,
-object memory-backend-file,id=mem0,size=4G,...
-device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,...
-device pc-dimm,id=dimm1,...
   if dimm0 is allocated to address N ~ N+4G, the address of dimm1
   will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for
   dimm0.

2. (Patch 4) When NVDIMM ACPI code builds the flush hint address
   structure for each nvdimm device, it will allocate them from the
   above reserved area, e.g. the flush hint addresses of above dimm0
   are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in
   this way, so QEMU can easily create IO memory regions for them.

   If the reserved area is not present or too small, QEMU will report
   errors.



We should make 'reserved-size' always be page-aligned and should be
transparent to the user, i.e, automatically reserve 4k if 'flush-hint'
is specified?



4K alignment is already enforced by current memory plug code.

About the automatic reservation, is a non-zero default value
acceptable by qemu design/convention in general?


Needn't make it as a user-visible parameter, just a field contained in
dimm-dev struct or nvdimm-dev struct indicates the reserved size is
okay.





Re: [Qemu-devel] [PATCH for-2.9 2/5] mirror: Fix aio context of mirror_top_bs

2017-04-06 Thread Eric Blake
On 04/06/2017 09:25 AM, Fam Zheng wrote:
> It should be moved to the same context as source, before inserting to the
> graph.
> 
> Signed-off-by: Fam Zheng 
> ---
>  block/mirror.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Eric Blake 

> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 9e2fecc..e904fef 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -1148,6 +1148,7 @@ static void mirror_start_job(const char *job_id, 
> BlockDriverState *bs,
>  return;
>  }
>  mirror_top_bs->total_sectors = bs->total_sectors;
> +bdrv_set_aio_context(mirror_top_bs, bdrv_get_aio_context(bs));
>  
>  /* bdrv_append takes ownership of the mirror_top_bs reference, need to 
> keep
>   * it alive until block_job_create() even if bs has no parent. */
> 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH for-2.9 2/5] mirror: Fix aio context of mirror_top_bs

2017-04-06 Thread Kevin Wolf
Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben:
> It should be moved to the same context as source, before inserting to the
> graph.
> 
> Signed-off-by: Fam Zheng 

Reviewed-by: Kevin Wolf 



Re: [Qemu-devel] [PATCH for-2.9 1/5] block: Fix unpaired aio_disable_external in external snapshot

2017-04-06 Thread Kevin Wolf
Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben:
> bdrv_replace_child_noperm tries to hand over the quiesce_counter state
> from old bs to the new one, but if they are not on the same aio context
> this causes unbalance.
> 
> Fix this by forcing callers to move the aio context first, and assert
> it.
> 
> Reported-by: Ed Swierk 
> Signed-off-by: Fam Zheng 
> ---
>  block.c| 3 +++
>  blockdev.c | 4 ++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 927ba89..8893ac1 100644
> --- a/block.c
> +++ b/block.c
> @@ -1752,6 +1752,9 @@ static void bdrv_replace_child_noperm(BdrvChild *child,
>  {
>  BlockDriverState *old_bs = child->bs;
>  
> +if (old_bs && new_bs) {
> +assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs));
> +}
>  if (old_bs) {
>  if (old_bs->quiesce_counter && child->role->drained_end) {
>  child->role->drained_end(child);

Can we move this to a separate patch and also add this hunk for
asserting the same thing for newly attached child nodes:

@@ -1852,6 +1855,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 bdrv_get_cumulative_perm(parent_bs, , _perm);
 
 assert(parent_bs->drv);
+assert(bdrv_get_aio_context(parent_bs) == bdrv_get_aio_context(child_bs));
 parent_bs->drv->bdrv_child_perm(parent_bs, NULL, child_role,
 perm, shared_perm, , _perm);

Kevin



Re: [Qemu-devel] [PATCH for-2.9 1/5] block: Fix unpaired aio_disable_external in external snapshot

2017-04-06 Thread Eric Blake
On 04/06/2017 09:25 AM, Fam Zheng wrote:
> bdrv_replace_child_noperm tries to hand over the quiesce_counter state
> from old bs to the new one, but if they are not on the same aio context
> this causes unbalance.
> 
> Fix this by forcing callers to move the aio context first, and assert
> it.
> 
> Reported-by: Ed Swierk 
> Signed-off-by: Fam Zheng 
> ---
>  block.c| 3 +++
>  blockdev.c | 4 ++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


  1   2   3   4   >