Re: [Qemu-devel] [PATCH RFC v1 1/3] target/ppc: Emulate LL/SC using cmpxchg helpers
David Gibsonwrites: > [ Unknown signature status ] > On Thu, Apr 06, 2017 at 03:52:47PM +0530, Nikunj A Dadhania wrote: >> Emulating LL/SC with cmpxchg is not correct, since it can suffer from >> the ABA problem. However, portable parallel code is written assuming >> only cmpxchg which means that in practice this is a viable alternative. >> >> Signed-off-by: Nikunj A Dadhania >> --- >> target/ppc/translate.c | 24 +--- >> 1 file changed, 21 insertions(+), 3 deletions(-) >> >> diff --git a/target/ppc/translate.c b/target/ppc/translate.c >> index b6abc60..a9c733d 100644 >> --- a/target/ppc/translate.c >> +++ b/target/ppc/translate.c >> @@ -73,6 +73,7 @@ static TCGv cpu_cfar; >> #endif >> static TCGv cpu_xer, cpu_so, cpu_ov, cpu_ca, cpu_ov32, cpu_ca32; >> static TCGv cpu_reserve; >> +static TCGv cpu_reserve_val; >> static TCGv cpu_fpscr; >> static TCGv_i32 cpu_access_type; >> >> @@ -181,6 +182,9 @@ void ppc_translate_init(void) >> cpu_reserve = tcg_global_mem_new(cpu_env, >> offsetof(CPUPPCState, reserve_addr), >> "reserve_addr"); >> +cpu_reserve_val = tcg_global_mem_new(cpu_env, >> + offsetof(CPUPPCState, reserve_val), >> + "reserve_val"); > > I notice that lqarx is not updated. Does that matter? Thats correct, I haven't touched that yet. Most of the locks are implemented using lwarx/stwcx. Regards Nikunj
Re: [Qemu-devel] [PATCH RFC v1 1/3] target/ppc: Emulate LL/SC using cmpxchg helpers
On Thu, Apr 06, 2017 at 03:52:47PM +0530, Nikunj A Dadhania wrote: > Emulating LL/SC with cmpxchg is not correct, since it can suffer from > the ABA problem. However, portable parallel code is written assuming > only cmpxchg which means that in practice this is a viable alternative. > > Signed-off-by: Nikunj A Dadhania> --- > target/ppc/translate.c | 24 +--- > 1 file changed, 21 insertions(+), 3 deletions(-) > > diff --git a/target/ppc/translate.c b/target/ppc/translate.c > index b6abc60..a9c733d 100644 > --- a/target/ppc/translate.c > +++ b/target/ppc/translate.c > @@ -73,6 +73,7 @@ static TCGv cpu_cfar; > #endif > static TCGv cpu_xer, cpu_so, cpu_ov, cpu_ca, cpu_ov32, cpu_ca32; > static TCGv cpu_reserve; > +static TCGv cpu_reserve_val; > static TCGv cpu_fpscr; > static TCGv_i32 cpu_access_type; > > @@ -181,6 +182,9 @@ void ppc_translate_init(void) > cpu_reserve = tcg_global_mem_new(cpu_env, > offsetof(CPUPPCState, reserve_addr), > "reserve_addr"); > +cpu_reserve_val = tcg_global_mem_new(cpu_env, > + offsetof(CPUPPCState, reserve_val), > + "reserve_val"); I notice that lqarx is not updated. Does that matter? > cpu_fpscr = tcg_global_mem_new(cpu_env, > offsetof(CPUPPCState, fpscr), "fpscr"); > @@ -3023,7 +3027,7 @@ static void gen_##name(DisasContext *ctx) > \ > }\ > tcg_gen_qemu_ld_tl(gpr, t0, ctx->mem_idx, memop);\ > tcg_gen_mov_tl(cpu_reserve, t0); \ > -tcg_gen_st_tl(gpr, cpu_env, offsetof(CPUPPCState, reserve_val)); \ > +tcg_gen_mov_tl(cpu_reserve_val, gpr);\ > tcg_temp_free(t0); \ > } > > @@ -3156,14 +3160,28 @@ static void gen_conditional_store(DisasContext *ctx, > TCGv EA, >int reg, int memop) > { > TCGLabel *l1; > +TCGv_i32 tmp = tcg_temp_local_new_i32(); > +TCGv t0; > > +tcg_gen_movi_i32(tmp, 0); > tcg_gen_trunc_tl_i32(cpu_crf[0], cpu_so); > l1 = gen_new_label(); > tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1); > -tcg_gen_ori_i32(cpu_crf[0], cpu_crf[0], CRF_EQ); > -tcg_gen_qemu_st_tl(cpu_gpr[reg], EA, ctx->mem_idx, memop); > + > +t0 = tcg_temp_new(); > +tcg_gen_atomic_cmpxchg_tl(t0, EA, cpu_reserve_val, cpu_gpr[reg], > + ctx->mem_idx, DEF_MEMOP(memop)); > +tcg_gen_setcond_tl(TCG_COND_EQ, t0, t0, cpu_reserve_val); > +tcg_gen_trunc_tl_i32(tmp, t0); > + > gen_set_label(l1); > +tcg_gen_shli_i32(tmp, tmp, CRF_EQ_BIT); > +tcg_gen_or_i32(cpu_crf[0], cpu_crf[0], tmp); > tcg_gen_movi_tl(cpu_reserve, -1); > +tcg_gen_movi_tl(cpu_reserve_val, 0); > + > +tcg_temp_free(t0); > +tcg_temp_free_i32(tmp); > } > #endif > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [Qemu-devel] [Qemu-ppc] [PATCH RFC v1 0/3] Enable MTTCG on PPC64
Cédric Le Goaterwrites: > Hello Nikunj, > > On 04/06/2017 12:22 PM, Nikunj A Dadhania wrote: >> The series enables Multi-Threaded TCG on PPC64 >> >> Patch 01: Use atomic_cmpxchg in store conditional >> 02: Handle first write to page during atomic operation >> 03: Generate memory barriers for sync/isync and load/store conditional >> >> Patches are based on ppc-for-2.10 >> >> Tested using following: >> ./ppc64-softmmu/qemu-system-ppc64 -cpu POWER8 -vga none -nographic -machine >> pseries,usb=off -m 2G -smp 8,cores=8,threads=1 -accel tcg,thread=multi >> f23.img > > I tried it with a Ubuntu 16.04.2 guest using stress --cpu 8. It looked > good : the CPU usage of QEMU reached 760% on the host. Cool. >> Todo: >> * Enable other machine types and PPC32. > > I am quite ignorant on the topic. > Have you looked at what it would take to emulate support of the HW > threads ? We would need to implement msgsndp (doorbell support for IPI between threads of same core) > and the PowerNV machine ? Haven't tried it, should work. Just give a shot, let me know if you see problems. Regards Nikunj
Re: [Qemu-devel] [PATCH 0/2] use blk_pwrite_zeroes for each zero cluster
the test result for this patch: the migration command : virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893 --copy-storage-all qemu+ssh://10.59.163.38/system the qemu-img info on source host: qemu-img info /instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2 image: /instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2 file format: qcow2 virtual size: 1.0T (1095216660480 bytes) disk size: 1.5G (1638989824 bytes) cluster_size: 65536 backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2 the qemu-img info on dest host(before apply patch): qemu-img info /instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2 image: /instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2 file format: qcow2 virtual size: 40G (42949672960 bytes) disk size: 4.1G (4423286784 bytes) cluster_size: 65536 backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2 the qemu-img info on dest host(after apply patch): qemu-img info /instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2 image: /instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2 file format: qcow2 virtual size: 40G (42949672960 bytes) disk size: 2.3G (2496200704 bytes) cluster_size: 65536 backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2 the disk size reduce from 4.1G to 2.3G. On Thu, Apr 6, 2017 at 9:15 PM,wrote: > From: Lidong Chen > > BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default, > this maybe cause the qcow2 file size is bigger after migration. > This patch check each cluster, use blk_pwrite_zeroes for each > zero cluster. > > Lidong Chen (2): > block: make bdrv_get_cluster_size public > migration/block: use blk_pwrite_zeroes for each zero cluster > > block/io.c| 2 +- > include/block/block.h | 1 + > migration/block.c | 34 -- > 3 files changed, 34 insertions(+), 3 deletions(-) > > -- > 1.8.3.1 >
Re: [Qemu-devel] [PATCH RFC v1 3/3] target/ppc: Generate fence operations
Richard Hendersonwrites: > On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote: >> @@ -3028,6 +3030,7 @@ static void gen_##name(DisasContext *ctx) >> \ >> tcg_gen_qemu_ld_tl(gpr, t0, ctx->mem_idx, memop);\ >> tcg_gen_mov_tl(cpu_reserve, t0); \ >> tcg_gen_mov_tl(cpu_reserve_val, gpr);\ >> +tcg_gen_mb(TCG_MO_ALL | TCG_BAR_LDAQ); \ > > Please put the barrier next to the load. > I hope we can merge these soon. Sure > >> @@ -3196,6 +3199,7 @@ static void gen_##name(DisasContext *ctx) >> \ >> if (len > 1) { \ >> gen_check_align(ctx, t0, (len) - 1);\ >> } \ >> +tcg_gen_mb(TCG_MO_ALL | TCG_BAR_STRL); \ >> gen_conditional_store(ctx, t0, rS(ctx->opcode), memop); \ >> tcg_temp_free(t0); \ >> } > > This one is more complicated... > > The success path includes an atomic_cmpxchg, which itself is a SC barrier. > However, the fail path branches across the cmpxchg and so sees no barrier, > but > one is still required by the architecture. How about a gen_conditional_store > that looks like this: > > { >TCGLabel *l1 = gen_new_label(); >TCGLabel *l2 = gen_new_label(); >TCGv t0; > >tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1); > >t0 = tcg_temp_new(); >tcg_gen_atomic_cmpxchg_tl(t0, cpu_reserve, cpu_reserve_val, > cpu_gpr[reg], ctx->mem_idx, > DEF_MEMOP(memop) | MO_ALIGN); >tcg_gen_setcond_tl(TCG_COND_EQ, t0, t0, cpu_reserve_val); >tcg_gen_shli_tl(t0, t0, CRF_EQ_BIT); >tcg_gen_or_tl(t0, t0, cpu_so); >tcg_gen_trunc_tl(cpu_crf[0], t0); >tcg_temp_free(t0); >tcg_gen_br(l2); > >gen_set_label(l1); > >/* Address mismatch implies failure. But we still need to provide the > memory barrier semantics of the instruction. */ >tcg_gen_mb(TCG_MO_ALL | TCG_BAR_STRL); >tcg_gen_trunc_tl(cpu_crf[0], cpu_so); > >gen_set_label(l2); >tcg_gen_movi_tl(cpu_reserve, -1); > } > > Note that you don't need to reset cpu_reserve_val at all. Sure. > I just thought of something we might need to check for this and other ll/sc > implemetations -- do we need to check for address misalignment along the > address comparison failure path? We do that in the macro: if (len > 1) { \ gen_check_align(ctx, t0, (len) - 1);\ } \ Would we still need a barrier before the alignment check? > I suspect that we do. Regards Nikunj
Re: [Qemu-devel] [PATCH RFC v1 1/3] target/ppc: Emulate LL/SC using cmpxchg helpers
Richard Hendersonwrites: > On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote: >> tcg_gen_trunc_tl_i32(cpu_crf[0], cpu_so); >> l1 = gen_new_label(); >> tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1); >> -tcg_gen_ori_i32(cpu_crf[0], cpu_crf[0], CRF_EQ); >> -tcg_gen_qemu_st_tl(cpu_gpr[reg], EA, ctx->mem_idx, memop); >> + >> +t0 = tcg_temp_new(); >> +tcg_gen_atomic_cmpxchg_tl(t0, EA, cpu_reserve_val, cpu_gpr[reg], >> + ctx->mem_idx, DEF_MEMOP(memop)); > > Actually, I noticed another, existing, problem. > > This code changes CRF[0] before the user memory write, which might fault. > This > needs to delay any changes to the architecture visible state until after any > exception may be triggered. Sure, here you are mentioning cpu_so being moved to CRF. Regards Nikunj
Re: [Qemu-devel] [PATCH RFC v1 1/3] target/ppc: Emulate LL/SC using cmpxchg helpers
Richard Hendersonwrites: > On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote: >> +TCGv_i32 tmp = tcg_temp_local_new_i32(); >> +TCGv t0; >> >> +tcg_gen_movi_i32(tmp, 0); >> tcg_gen_trunc_tl_i32(cpu_crf[0], cpu_so); >> l1 = gen_new_label(); >> tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1); >> -tcg_gen_ori_i32(cpu_crf[0], cpu_crf[0], CRF_EQ); >> -tcg_gen_qemu_st_tl(cpu_gpr[reg], EA, ctx->mem_idx, memop); >> + >> +t0 = tcg_temp_new(); >> +tcg_gen_atomic_cmpxchg_tl(t0, EA, cpu_reserve_val, cpu_gpr[reg], >> + ctx->mem_idx, DEF_MEMOP(memop)); >> +tcg_gen_setcond_tl(TCG_COND_EQ, t0, t0, cpu_reserve_val); >> +tcg_gen_trunc_tl_i32(tmp, t0); >> + >> gen_set_label(l1); >> +tcg_gen_shli_i32(tmp, tmp, CRF_EQ_BIT); >> +tcg_gen_or_i32(cpu_crf[0], cpu_crf[0], tmp); > > I encourage you to move these two lines up beside the setcond. > That way you don't need to use a local tmp, which implies a > spill/restore from the stack. Sure. Regards Nikunj
Re: [Qemu-devel] [PATCH v8 3/9] memory: provide iommu_replay_all()
On Thu, Apr 06, 2017 at 07:46:47PM +0800, Peter Xu wrote: > On Thu, Apr 06, 2017 at 12:52:19PM +0200, Auger Eric wrote: > > Hi Peter, > > > > On 06/04/2017 09:08, Peter Xu wrote: [...] > > > +void memory_region_iommu_replay_all(MemoryRegion *mr) > > > +{ > > > +IOMMUNotifier *notifier; > > > + > > > +IOMMU_NOTIFIER_FOREACH(notifier, mr) { > > > +memory_region_iommu_replay(mr, notifier, false); > > It is not fully clear to me what is the consequence of setting > > is_write=false always? > > I am not quite sure about it either, on why we are having the is_write > parameter on memory_region_iommu_replay(). It's there since the first > commit that introduced the interface: > > commit a788f227ef7bd2912fcaacdfe13d13ece2998149 > Author: David Gibson> Date: Wed Sep 30 12:13:55 2015 +1000 > > memory: Allow replay of IOMMU mapping notifications > > However imho that should be a check against page RW permissions, which > seems unecessary during replay. Or say, not sure whether below patch > would be good (as a standalone one besides current patch/series): > > --8<-- > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > index 6b33b9f..d008a4b 100644 > --- a/hw/vfio/common.c > +++ b/hw/vfio/common.c > @@ -488,7 +488,7 @@ static void vfio_listener_region_add(MemoryListener > *listener, > QLIST_INSERT_HEAD(>giommu_list, giommu, giommu_next); > > memory_region_register_iommu_notifier(giommu->iommu, >n); > -memory_region_iommu_replay(giommu->iommu, >n, false); > +memory_region_iommu_replay(giommu->iommu, >n); > > return; > } > diff --git a/include/exec/memory.h b/include/exec/memory.h > index c4fc94d..5ab0151 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -725,11 +725,8 @@ void memory_region_register_iommu_notifier(MemoryRegion > *mr, > * > * @mr: the memory region to observe > * @n: the notifier to which to replay iommu mappings > - * @is_write: Whether to treat the replay as a translate "write" > - * through the iommu > */ > -void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n, > -bool is_write); > +void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n); > > /** > * memory_region_iommu_replay_all: replay existing IOMMU translations > diff --git a/memory.c b/memory.c > index b782d5b..155ca1c 100644 > --- a/memory.c > +++ b/memory.c > @@ -1620,8 +1620,7 @@ uint64_t > memory_region_iommu_get_min_page_size(MemoryRegion *mr) > return TARGET_PAGE_SIZE; > } > > -void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n, > -bool is_write) > +void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n) > { > hwaddr addr, granularity; > IOMMUTLBEntry iotlb; > @@ -1635,7 +1634,13 @@ void memory_region_iommu_replay(MemoryRegion *mr, > IOMMUNotifier *n, > granularity = memory_region_iommu_get_min_page_size(mr); > > for (addr = 0; addr < memory_region_size(mr); addr += granularity) { > -iotlb = mr->iommu_ops->translate(mr, addr, is_write); > +/* > + * For a replay, we don't need to do permission check. > + * Assuming a "read" operation here to make sure we will pass > + * the check (in case we don't have write permission on the > + * page). > + */ > +iotlb = mr->iommu_ops->translate(mr, addr, false); > if (iotlb.perm != IOMMU_NONE) { > n->notify(n, ); > } > @@ -1653,7 +1658,7 @@ void memory_region_iommu_replay_all(MemoryRegion *mr) > IOMMUNotifier *notifier; > > IOMMU_NOTIFIER_FOREACH(notifier, mr) { > -memory_region_iommu_replay(mr, notifier, false); > +memory_region_iommu_replay(mr, notifier); > } > } > > ->8-- This follow-up patch has an assumption that all the pages would have read permission. That may not be true, e.g., for write-only pages. A better way (after a short discussion with David Gibson) would be converting is_write in iommu_ops.translate() to IOMMUAccessFlags, so that we can pass in IOMMU_NONE here (it means, we don't check permission for this page translation). Further, we can remove the is_write parameter in memory_region_iommu_replay(). In all cases, I'll move this change out of current series and send a fresh-new post after 2.10. Thanks, -- peterx
Re: [Qemu-devel] [PATCH] kvm: pass the virtual SEI syndrome to guest OS
Hi Laszlo, thanks. On 2017/4/7 2:55, Laszlo Ersek wrote: > On 04/06/17 14:35, gengdongjiu wrote: >> Dear, Laszlo >>Thanks for your detailed explanation. >> >> On 2017/3/29 19:58, Laszlo Ersek wrote: >>> (This ought to be one of the longest address lists I've ever seen :) >>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good >>> measure, I'm adding MST and Igor.) >>> >>> On 03/29/17 12:36, Achin Gupta wrote: Hi gengdongjiu, On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote: > > Hi Laszlo/Biesheuvel/Qemu developer, > >Now I encounter a issue and want to consult with you in ARM64 > platform, as described below: > > when guest OS happen synchronous or asynchronous abort, kvm needs > to send the error address to Qemu or UEFI through sigbus to > dynamically generate APEI table. from my investigation, there are > two ways: > > (1) Qemu get the error address, and generate the APEI table, then > notify UEFI to know this generation, then inject abort error to > guest OS, guest OS read the APEI table. > (2) Qemu get the error address, and let UEFI to generate the APEI > table, then inject abort error to guest OS, guest OS read the APEI > table. Just being pedantic! I don't think we are talking about creating the APEI table dynamically here. The issue is: Once KVM has received an error that is destined for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error into the guest OS, a CPER (Common Platform Error Record) has to be generated corresponding to the error source (GHES corresponding to memory subsystem, processor etc) to allow the guest OS to do anything meaningful with the error. So who should create the CPER is the question. At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives at EL3 and secure firmware (at EL3 or a lower secure exception level) is responsible for creating the CPER. ARM is experimenting with using a Standalone MM EDK2 image in the secure world to do the CPER creation. This will avoid adding the same code in ARM TF in EL3 (better for security). The error will then be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted Firmware. Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1 interface (as discussed with Christoffer below). So it should generate the CPER before injecting the error. This is corresponds to (1) above apart from notifying UEFI (I am assuming you mean guest UEFI). At this time, the guest OS already knows where to pick up the CPER from through the HEST. Qemu has to create the CPER and populate its address at the address exported in the HEST. Guest UEFI should not be involved in this flow. Its job was to create the HEST at boot and that has been done by this stage. Qemu folk will be able to add but it looks like support for CPER generation will need to be added to Qemu. We need to resolve this. Do shout if I am missing anything above. >>> >>> After reading this email, the use case looks *very* similar to what >>> we've just done with VMGENID for QEMU 2.9. >>> >>> We have a facility between QEMU and the guest firmware, called "ACPI >>> linker/loader", with which QEMU instructs the firmware to >>> >>> - allocate and download blobs into guest RAM (AcpiNVS type memory) -- >>> ALLOCATE command, >>> >>> - relocate pointers in those blobs, to fields in other (or the same) >>> blobs -- ADD_POINTER command, >>> >>> - set ACPI table checksums -- ADD_CHECKSUM command, >>> >>> - and send GPAs of fields within such blobs back to QEMU -- >>> WRITE_POINTER command. >>> >>> This is how I imagine we can map the facility to the current use case >>> (note that this is the first time I read about HEST / GHES / CPER): >>> >>> etc/acpi/tables etc/hardware_errors >>> == >>> +---+ >>> +--+ | address | +-> +--+ >>> |HEST + | registers | | | Error Status | >>> + ++ | +-+ | | Data Block 1 | >>> | | GHES | --> | | address | + | ++ >>> | | GHES | --> | | address | --+ | | CPER | >>> | | GHES | --> | | address | + | | | CPER | >>> | | GHES | --> | | address | -+ | | | | CPER | >>> +-++ +-+-+ | | | +-++ >>> | | | >>> | | +---> +--+ >>> | | | Error
Re: [Qemu-devel] [Qemu-ppc] [PATCH RFC v1 0/3] Enable MTTCG on PPC64
Hi i can help test it too on my two Be machine. If some one help me to find where is the patch or where i can download the commits Thanks Luigi
[Qemu-devel] [PATCH for-2.9?] qcow2: Allow discard of final unaligned cluster
As mentioned in commit 0c1bd46, we ignored requests to discard the trailing cluster of an unaligned image. While discard is an advisory operation from the guest standpoint, (and we are therefore free to ignore any request), our qcow2 implementation exploits the fact that a discarded cluster reads back as 0. As long as we discard on cluster boundaries, we are fine; but that means we could observe non-zero data leaked at the tail of an unaligned image. Enhance iotest 66 to cover this case, and fix the implementation to honor a discard request on the final partial cluster. Signed-off-by: Eric Blake--- I can't convince myself whether we strongly rely on aligned discards to guarantee that we read back zeroes on qcow2 (it would be a stronger contract than what the block layer requires of pdiscard, since the guest cannot guarantee that a discard does anything). If we do, then this is a bug fix worthy of 2.9. If we don't, then the changes to test 66 rely on internal implementation (but the test is already specific to qcow2), and the patch can wait for 2.10. At any rate, I do know that we don't want to make discard work on sub-cluster boundaries anywhere except at the tail of the image (that's what write-zeroes is for, and it may be slower when it has to do COW or read-modify-write). Any reliance that we (might) have on whole-cluster discards reading back as 0 is also relying on using aligned operations. block/qcow2.c | 7 ++- tests/qemu-iotests/066 | 12 +++- tests/qemu-iotests/066.out | 12 3 files changed, 21 insertions(+), 10 deletions(-) diff --git a/block/qcow2.c b/block/qcow2.c index 6a92d2e..863d889 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -2515,7 +2515,12 @@ static coroutine_fn int qcow2_co_pdiscard(BlockDriverState *bs, if (!QEMU_IS_ALIGNED(offset | count, s->cluster_size)) { assert(count < s->cluster_size); -return -ENOTSUP; +/* Ignore partial clusters, except for the special case of the + * complete partial cluster at the end of an unaligned file */ +if (!QEMU_IS_ALIGNED(offset, s->cluster_size) || +offset + count != bs->total_sectors * BDRV_SECTOR_SIZE) { +return -ENOTSUP; +} } qemu_co_mutex_lock(>lock); diff --git a/tests/qemu-iotests/066 b/tests/qemu-iotests/066 index 364166d..c2116a3 100755 --- a/tests/qemu-iotests/066 +++ b/tests/qemu-iotests/066 @@ -42,16 +42,18 @@ _supported_fmt qcow2 _supported_proto generic _supported_os Linux +# Intentionally create an unaligned image IMGOPTS="compat=1.1" -IMG_SIZE=64M +IMG_SIZE=$((64 * 1024 * 1024 + 512)) echo -echo "=== Testing snapshotting an image with zero clusters ===" +echo "=== Testing cluster discards ===" echo _make_test_img $IMG_SIZE -# Write some normal clusters, zero them (creating preallocated zero clusters) -# and discard those -$QEMU_IO -c "write 0 256k" -c "write -z 0 256k" -c "discard 0 256k" "$TEST_IMG" \ +# Write some normal clusters, zero some of them (creating preallocated +# zero clusters) and discard everything. Everything should now read as 0. +$QEMU_IO -c "write 0 256k" -c "write -z 0 256k" -c "write 64M 512" \ +-c "discard 0 $IMG_SIZE" -c "read -P 0 0 $IMG_SIZE" "$TEST_IMG" \ | _filter_qemu_io # Check the image (there shouldn't be any leaks) _check_test_img diff --git a/tests/qemu-iotests/066.out b/tests/qemu-iotests/066.out index 7bc9a10..7c1f31a 100644 --- a/tests/qemu-iotests/066.out +++ b/tests/qemu-iotests/066.out @@ -1,13 +1,17 @@ QA output created by 066 -=== Testing snapshotting an image with zero clusters === +=== Testing cluster discards === -Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67109376 wrote 262144/262144 bytes at offset 0 256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 262144/262144 bytes at offset 0 256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -discard 262144/262144 bytes at offset 0 -256 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 512/512 bytes at offset 67108864 +512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +discard 67109376/67109376 bytes at offset 0 +64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +read 67109376/67109376 bytes at offset 0 +64 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) No errors were found on the image. *** done -- 2.9.3
Re: [Qemu-devel] [PATCH v3] migration/block:limit the time used for block migration
On Thu, Apr 6, 2017 at 10:02 PM, Stefan Hajnocziwrote: > On Wed, Apr 05, 2017 at 05:27:58PM +0800, jemmy858...@gmail.com wrote: >> From: Lidong Chen >> >> when migration with high speed, mig_save_device_bulk invoke >> bdrv_is_allocated too frequently, and cause vnc reponse slowly. >> this patch limit the time used for bdrv_is_allocated. > > bdrv_is_allocated() is supposed to yield back to the event loop if it > needs to block. If your VNC session is experiencing jitter then it's > probably because a system call in the bdrv_is_allocated() code path is > synchronous when it should be asynchronous. > > You could try to identify the system call using strace -f -T. In the > output you'll see the duration of each system call. I guess there is a > file I/O system call that is taking noticable amounts of time. yes, i find the reason where bdrv_is_allocated needs to block. the mainly reason is caused by qemu_co_mutex_lock invoked by qcow2_co_get_block_status. qemu_co_mutex_lock(>lock); ret = qcow2_get_cluster_offset(bs, sector_num << 9, , _offset); qemu_co_mutex_unlock(>lock); other reason is caused by l2_load invoked by qcow2_get_cluster_offset. /* load the l2 table in memory */ ret = l2_load(bs, l2_offset, _table); if (ret < 0) { return ret; } > > A proper solution is to refactor the synchronous code to make it > asynchronous. This might require invoking the system call from a > thread pool worker. > yes, i agree with you, but this is a big change. I will try to find how to optimize this code, maybe need a long time. this patch is not a perfect solution, but can alleviate the problem. > Stefan
Re: [Qemu-devel] [PATCH for-2.9 3/5] block: Quiesce old aio context during bdrv_set_aio_context
On Thu, 04/06 17:17, Kevin Wolf wrote: > Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben: > > The fact that the bs->aio_context is changing can confuse the dataplane > > iothread, because of the now fine granularity aio context lock. > > bdrv_drain should rather be a bdrv_drained_begin/end pair, but since > > bs->aio_context is changing, we can just use aio_disable_external and > > block_job_pause. > > > > Reported-by: Ed Swierk> > Signed-off-by: Fam Zheng > > --- > > block.c | 11 +-- > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > diff --git a/block.c b/block.c > > index 8893ac1..e70684a 100644 > > --- a/block.c > > +++ b/block.c > > @@ -4395,11 +4395,14 @@ void bdrv_attach_aio_context(BlockDriverState *bs, > > > > void bdrv_set_aio_context(BlockDriverState *bs, AioContext *new_context) > > { > > -AioContext *ctx; > > +AioContext *ctx = bdrv_get_aio_context(bs); > > > > +aio_disable_external(ctx); > > +if (bs->job) { > > +block_job_pause(bs->job); > > +} > > Even more bs->job users... :-( > > But is this one actually necessary? We drain all pending BHs below, so > the job shouldn't have any requests in flight or be able to submit new > ones while we switch. I'm not 100% sure, but I think the aio_poll() loop below can still fire co_sleep_cb if we don't do block_job_pause()? > > > bdrv_drain(bs); /* ensure there are no in-flight requests */ > > > > -ctx = bdrv_get_aio_context(bs); > > while (aio_poll(ctx, false)) { > > /* wait for all bottom halves to execute */ > > } > > @@ -4412,6 +4415,10 @@ void bdrv_set_aio_context(BlockDriverState *bs, > > AioContext *new_context) > > aio_context_acquire(new_context); > > bdrv_attach_aio_context(bs, new_context); > > aio_context_release(new_context); > > +if (bs->job) { > > +block_job_resume(bs->job); > > +} > > +aio_enable_external(ctx); > > } > > The aio_disabe/enable_external() pair seems to make sense anyway. > > Kevin
Re: [Qemu-devel] [PATCH for-2.9 1/5] block: Fix unpaired aio_disable_external in external snapshot
On Thu, 04/06 16:36, Kevin Wolf wrote: > Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben: > > bdrv_replace_child_noperm tries to hand over the quiesce_counter state > > from old bs to the new one, but if they are not on the same aio context > > this causes unbalance. > > > > Fix this by forcing callers to move the aio context first, and assert > > it. > > > > Reported-by: Ed Swierk> > Signed-off-by: Fam Zheng > > --- > > block.c| 3 +++ > > blockdev.c | 4 ++-- > > 2 files changed, 5 insertions(+), 2 deletions(-) > > > > diff --git a/block.c b/block.c > > index 927ba89..8893ac1 100644 > > --- a/block.c > > +++ b/block.c > > @@ -1752,6 +1752,9 @@ static void bdrv_replace_child_noperm(BdrvChild > > *child, > > { > > BlockDriverState *old_bs = child->bs; > > > > +if (old_bs && new_bs) { > > +assert(bdrv_get_aio_context(old_bs) == > > bdrv_get_aio_context(new_bs)); > > +} > > if (old_bs) { > > if (old_bs->quiesce_counter && child->role->drained_end) { > > child->role->drained_end(child); > > Can we move this to a separate patch and also add this hunk for > asserting the same thing for newly attached child nodes: > > @@ -1852,6 +1855,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState > *parent_bs, > bdrv_get_cumulative_perm(parent_bs, , _perm); > > assert(parent_bs->drv); > +assert(bdrv_get_aio_context(parent_bs) == > bdrv_get_aio_context(child_bs)); > parent_bs->drv->bdrv_child_perm(parent_bs, NULL, child_role, > perm, shared_perm, , _perm); OK, will do.
[Qemu-devel] [PULL for-2.9 1/1] vfio/pci-quirks: Exclude non-ioport BAR from NVIDIA quirk
The NVIDIA BAR5 quirk is targeting an ioport BAR. Some older devices have a BAR5 which is not ioport and can induce a segfault here. Test the BAR type to skip these devices. Link: https://bugs.launchpad.net/qemu/+bug/1678466 Signed-off-by: Alex Williamson--- hw/vfio/pci-quirks.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c index e9b493b939db..349085ea12bc 100644 --- a/hw/vfio/pci-quirks.c +++ b/hw/vfio/pci-quirks.c @@ -660,7 +660,7 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr) VFIOConfigWindowQuirk *window; if (!vfio_pci_is(vdev, PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID) || -!vdev->vga || nr != 5) { +!vdev->vga || nr != 5 || !vdev->bars[5].ioport) { return; }
[Qemu-devel] [PULL for-2.9 0/1] VFIO fixes 2017-04-06
The following changes since commit 54d689988c847271d87b3eb113712147129fb811: Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging (2017-04-06 09:27:49 +0100) are available in the git repository at: git://github.com/awilliam/qemu-vfio.git tags/vfio-updates-20170406.0 for you to fetch changes up to 8f419c5b43b988df4ef11315aeb8524e50c99687: vfio/pci-quirks: Exclude non-ioport BAR from NVIDIA quirk (2017-04-06 16:03:26 -0600) VFIO fixes 2017-04-06 - Extra test for NVIDIA BAR5 quirk to avoid segfault (Alex Williamson) Alex Williamson (1): vfio/pci-quirks: Exclude non-ioport BAR from NVIDIA quirk hw/vfio/pci-quirks.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Re: [Qemu-devel] [PATCH 03/21] ppc/pnv: Add support for POWER8+ LPC Controller
On Thu, 2017-04-06 at 14:44 +0200, Cédric Le Goater wrote: > May be we could move that under the LPC object even if it is not > strictly > correct from a HW pov ? I'm not fan of this. It's really not part of the LPC controller and it's specific to a certain crop of P8 machines. Cheers, Ben.
Re: [Qemu-devel] [PATCH 04/21] ppc/pnv: enable only one LPC bus
On Thu, 2017-04-06 at 14:35 +0200, Cédric Le Goater wrote: > but real HW (2 sockets OpenPOWER systems, garrison and firestone) > does > not expose the LPC bus on the second chip, should we do so in qemu ? It's not so much HW as it it HostBoot. Not a huge deal. Cheers, Ben.
Re: [Qemu-devel] [PULL 43/46] block: Assertions for write permissions
On Thu, Apr 06, 2017 at 04:23:19PM -0500, Eric Blake wrote: > On 04/06/2017 04:15 PM, Richard W.M. Jones wrote: > > +++ b/block/io.c > @@ -945,6 +945,8 @@ static int coroutine_fn > bdrv_co_do_copy_on_readv(BdrvChild *child, > size_t skip_bytes; > int ret; > > +assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE)); > >>> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1439922 > > > > There is also a minimal reproducer in comment 2. > > If it helps, backtrace shows: > > #4 0x55c09f10 in bdrv_co_do_copy_on_readv > (child=0x56990180, offset=0, bytes=512, qiov=0x7fffd0c0) at > block/io.c:948 > #5 0x55c0a33d in bdrv_aligned_preadv (child=0x56990180, > req=0x7fffb6dffec0, offset=0, bytes=512, align=1, qiov=0x7fffd0c0, > flags=1) > at block/io.c:1058 > #6 0x55c0a8c3 in bdrv_co_preadv (child=0x56990180, > offset=0, bytes=512, qiov=0x7fffd0c0, flags=BDRV_REQ_COPY_ON_READ) > at block/io.c:1166 > #7 0x55bf7ccf in blk_co_preadv (blk=0x56a2bc20, offset=0, > bytes=512, qiov=0x7fffd0c0, flags=0) at block/block-backend.c:927 > #8 0x55bf7e19 in blk_read_entry (opaque=0x7fffd0e0) > at block/block-backend.c:974 > #9 0x55cc3015 in coroutine_trampoline (i0=1485566720, i1=21845) > at util/coroutine-ucontext.c:79 The patch that fixes this is simply: diff --git a/block/io.c b/block/io.c index 2709a7007f..4f1a730e45 100644 --- a/block/io.c +++ b/block/io.c @@ -945,8 +945,6 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BdrvChild *child, size_t skip_bytes; int ret; -assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE)); - /* Cover entire cluster so no additional backing file I/O is required when * allocating cluster in the image file. */ It has no ill effects that I can see, and fixes the libguestfs test suite, but I'm assuming the assert was added for a reason so I'm not proposing this as a patch. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Re: [Qemu-devel] [PULL 43/46] block: Assertions for write permissions
On 04/06/2017 04:15 PM, Richard W.M. Jones wrote: +++ b/block/io.c @@ -945,6 +945,8 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BdrvChild *child, size_t skip_bytes; int ret; +assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE)); >>> > https://bugzilla.redhat.com/show_bug.cgi?id=1439922 > > There is also a minimal reproducer in comment 2. If it helps, backtrace shows: #4 0x55c09f10 in bdrv_co_do_copy_on_readv (child=0x56990180, offset=0, bytes=512, qiov=0x7fffd0c0) at block/io.c:948 #5 0x55c0a33d in bdrv_aligned_preadv (child=0x56990180, req=0x7fffb6dffec0, offset=0, bytes=512, align=1, qiov=0x7fffd0c0, flags=1) at block/io.c:1058 #6 0x55c0a8c3 in bdrv_co_preadv (child=0x56990180, offset=0, bytes=512, qiov=0x7fffd0c0, flags=BDRV_REQ_COPY_ON_READ) at block/io.c:1166 #7 0x55bf7ccf in blk_co_preadv (blk=0x56a2bc20, offset=0, bytes=512, qiov=0x7fffd0c0, flags=0) at block/block-backend.c:927 #8 0x55bf7e19 in blk_read_entry (opaque=0x7fffd0e0) at block/block-backend.c:974 #9 0x55cc3015 in coroutine_trampoline (i0=1485566720, i1=21845) at util/coroutine-ucontext.c:79 -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org signature.asc Description: OpenPGP digital signature
[Qemu-devel] [PATCH] hw/arm/virt: generate 64-bit addressable ACPI objects
Our current ACPI table generation code limits the placement of ACPI tables to 32-bit addressable memory, in order to be able to emit the root pointer (RSDP) and root table (RSDT) using table types from the ACPI 1.0 days. Since ARM was not supported by ACPI before version 5.0, it makes sense to lift this restriction. This is not crucial for mach-virt, which is guaranteed to have some memory available below the 4 GB mark, but it is a nice to have for QEMU machines that do not have any 32-bit addressable, not unlike some real world 64-bit ARM systems. Since we already emit a version of the RSDP root pointer that carries a 64-bit wide address for the 64-bit root table (XSDT), all we need to do is replace the RSDT generation with the generation of an XSDT table, and use a different slot in the FADT table to refer to the DSDT. Note that this only modifies the private interaction between QEMU and UEFI. Since these tables are all handled by the generic AcpiTableDxe driver which generates its own RSDP, RSDT and XSDT tables and manages the DSDT pointer in FADT itself, the tables that are present to the guest are identical, and so no mach-virt versioning is required for this change. Signed-off-by: Ard Biesheuvel--- hw/arm/virt-acpi-build.c| 55 ++--- include/hw/acpi/acpi-defs.h | 11 + 2 files changed, 53 insertions(+), 13 deletions(-) diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index b173bd109b91..d18368094c00 100644 --- a/hw/arm/virt-acpi-build.c +++ b/hw/arm/virt-acpi-build.c @@ -391,12 +391,12 @@ static void acpi_dsdt_add_power_button(Aml *scope) /* RSDP */ static GArray * -build_rsdp(GArray *rsdp_table, BIOSLinker *linker, unsigned rsdt_tbl_offset) +build_rsdp(GArray *rsdp_table, BIOSLinker *linker, unsigned xsdt_tbl_offset) { AcpiRsdpDescriptor *rsdp = acpi_data_push(rsdp_table, sizeof *rsdp); -unsigned rsdt_pa_size = sizeof(rsdp->rsdt_physical_address); -unsigned rsdt_pa_offset = -(char *)>rsdt_physical_address - rsdp_table->data; +unsigned xsdt_pa_size = sizeof(rsdp->xsdt_physical_address); +unsigned xsdt_pa_offset = +(char *)>xsdt_physical_address - rsdp_table->data; bios_linker_loader_alloc(linker, ACPI_BUILD_RSDP_FILE, rsdp_table, 16, true /* fseg memory */); @@ -408,8 +408,8 @@ build_rsdp(GArray *rsdp_table, BIOSLinker *linker, unsigned rsdt_tbl_offset) /* Address to be filled by Guest linker */ bios_linker_loader_add_pointer(linker, -ACPI_BUILD_RSDP_FILE, rsdt_pa_offset, rsdt_pa_size, -ACPI_BUILD_TABLE_FILE, rsdt_tbl_offset); +ACPI_BUILD_RSDP_FILE, xsdt_pa_offset, xsdt_pa_size, +ACPI_BUILD_TABLE_FILE, xsdt_tbl_offset); /* Checksum to be filled by Guest linker */ bios_linker_loader_add_checksum(linker, ACPI_BUILD_RSDP_FILE, @@ -686,7 +686,7 @@ static void build_fadt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms, unsigned dsdt_tbl_offset) { AcpiFadtDescriptorRev5_1 *fadt = acpi_data_push(table_data, sizeof(*fadt)); -unsigned dsdt_entry_offset = (char *)>dsdt - table_data->data; +unsigned xdsdt_entry_offset = (char *)>Xdsdt - table_data->data; uint16_t bootflags; switch (vms->psci_conduit) { @@ -712,7 +712,7 @@ static void build_fadt(GArray *table_data, BIOSLinker *linker, /* DSDT address to be filled by Guest linker */ bios_linker_loader_add_pointer(linker, -ACPI_BUILD_TABLE_FILE, dsdt_entry_offset, sizeof(fadt->dsdt), +ACPI_BUILD_TABLE_FILE, xdsdt_entry_offset, sizeof(fadt->Xdsdt), ACPI_BUILD_TABLE_FILE, dsdt_tbl_offset); build_header(linker, table_data, @@ -772,12 +772,41 @@ struct AcpiBuildState { bool patched; } AcpiBuildState; +/* Build xsdt table */ + +static +void build_xsdt(GArray *table_data, BIOSLinker *linker, GArray *table_offsets, +const char *oem_id, const char *oem_table_id) +{ +int i; +unsigned xsdt_entries_offset; +AcpiXsdtDescriptorRev2 *xsdt; +const unsigned table_data_len = (sizeof(uint64_t) * table_offsets->len); +const unsigned xsdt_entry_size = sizeof(xsdt->table_offset_entry[0]); +const size_t xsdt_len = sizeof(*xsdt) + table_data_len; + +xsdt = acpi_data_push(table_data, xsdt_len); +xsdt_entries_offset = (char *)xsdt->table_offset_entry - table_data->data; +for (i = 0; i < table_offsets->len; ++i) { +uint64_t ref_tbl_offset = g_array_index(table_offsets, uint32_t, i); +uint64_t xsdt_entry_offset = xsdt_entries_offset + xsdt_entry_size * i; + +/* xsdt->table_offset_entry to be filled by Guest linker */ +bios_linker_loader_add_pointer(linker, +ACPI_BUILD_TABLE_FILE, xsdt_entry_offset, xsdt_entry_size, +ACPI_BUILD_TABLE_FILE, ref_tbl_offset); +} +build_header(linker, table_data, + (void *)xsdt,
Re: [Qemu-devel] [PULL 43/46] block: Assertions for write permissions
On Thu, Apr 06, 2017 at 04:03:28PM -0500, Eric Blake wrote: > On 04/06/2017 03:59 PM, Richard W.M. Jones wrote: > > On Tue, Feb 28, 2017 at 09:36:42PM +0100, Kevin Wolf wrote: > >> This adds assertions that ensure that the necessary write permissions > >> have been granted before someone attempts to write to a node. > >> > >> Signed-off-by: Kevin Wolf> >> Acked-by: Fam Zheng > >> Reviewed-by: Max Reitz > >> --- > >> block/io.c | 3 +++ > >> 1 file changed, 3 insertions(+) > >> > >> diff --git a/block/io.c b/block/io.c > >> index 2592ca1..4c79745 100644 > >> --- a/block/io.c > >> +++ b/block/io.c > >> @@ -945,6 +945,8 @@ static int coroutine_fn > >> bdrv_co_do_copy_on_readv(BdrvChild *child, > >> size_t skip_bytes; > >> int ret; > >> > >> +assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE)); > > > > > > This assertion is thrown in the libguestfs test suite. I filed a bug > > about it against the Fedora package: > > > > https://lists.gnu.org/archive/html/qemu-block/2017-02/msg01305.html > > Wrong URL? Ooops, the right link is: https://bugzilla.redhat.com/show_bug.cgi?id=1439922 There is also a minimal reproducer in comment 2. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/
[Qemu-devel] [PATCH] event: Add signal information to SHUTDOWN
qemu_kill_report() is already able to tell whether a shutdown was triggered by guest action (no output) or by a host signal (a message about termination is printed via error_report); but this information is then lost. Libvirt would like to be able to distinguish between a SHUTDOWN event triggered solely by guest request and one triggered by a SIGTERM on the host. Enhance the SHUTDOWN event to pass the value of shutdown_signal through to the monitor client, suitably remapped into a platform-neutral string. Note that mingw lacks decent signal support, and will never report a signal because it never calls qemu_system_killed(). See also https://bugzilla.redhat.com/1384007 Signed-off-by: Eric Blake--- qapi/event.json | 20 +++- vl.c| 21 ++--- 2 files changed, 37 insertions(+), 4 deletions(-) diff --git a/qapi/event.json b/qapi/event.json index e80f3f4..6aad475 100644 --- a/qapi/event.json +++ b/qapi/event.json @@ -5,11 +5,29 @@ ## ## +# @ShutdownSignal: +# +# The list of host signal types known to cause qemu to shut down a guest. +# +# @int: SIGINT +# @hup: SIGHUP +# @term: SIGTERM +# +# Since: 2.10 +## +{ 'enum': 'ShutdownSignal', 'data': [ 'int', 'hup', 'term' ] } + +## # @SHUTDOWN: # # Emitted when the virtual machine has shut down, indicating that qemu is # about to exit. # +# @signal: If present, the shutdown was (probably) triggered due to +# the receipt of the given signal in the host, rather than by a guest +# action (note that there is an inherent race with a guest choosing to +# shut down near the same time the host sends a signal). (since 2.10) +# # Note: If the command-line option "-no-shutdown" has been specified, qemu will # not exit, and a STOP event will eventually follow the SHUTDOWN event # @@ -21,7 +39,7 @@ # "timestamp": { "seconds": 1267040730, "microseconds": 682951 } } # ## -{ 'event': 'SHUTDOWN' } +{ 'event': 'SHUTDOWN', 'data': { '*signal': 'ShutdownSignal' } } ## # @POWERDOWN: diff --git a/vl.c b/vl.c index 0b4ed52..af29b2c 100644 --- a/vl.c +++ b/vl.c @@ -1626,9 +1626,23 @@ static int qemu_shutdown_requested(void) return atomic_xchg(_requested, 0); } -static void qemu_kill_report(void) +static ShutdownSignal qemu_kill_report(void) { +ShutdownSignal ss = SHUTDOWN_SIGNAL__MAX; if (!qtest_driver() && shutdown_signal != -1) { +switch (shutdown_signal) { +case SIGINT: +ss = SHUTDOWN_SIGNAL_INT; +break; +#ifdef SIGHUP +case SIGHUP: +ss = SHUTDOWN_SIGNAL_HUP; +break; +#endif +case SIGTERM: +ss = SHUTDOWN_SIGNAL_TERM; +break; +} if (shutdown_pid == 0) { /* This happens for eg ^C at the terminal, so it's worth * avoiding printing an odd message in that case. @@ -1644,6 +1658,7 @@ static void qemu_kill_report(void) } shutdown_signal = -1; } +return ss; } static int qemu_reset_requested(void) @@ -1852,8 +1867,8 @@ static bool main_loop_should_exit(void) qemu_system_suspend(); } if (qemu_shutdown_requested()) { -qemu_kill_report(); -qapi_event_send_shutdown(_abort); +ShutdownSignal ss = qemu_kill_report(); +qapi_event_send_shutdown(ss < SHUTDOWN_SIGNAL__MAX, ss, _abort); if (no_shutdown) { vm_stop(RUN_STATE_SHUTDOWN); } else { -- 2.9.3
Re: [Qemu-devel] [PULL 43/46] block: Assertions for write permissions
On 04/06/2017 03:59 PM, Richard W.M. Jones wrote: > On Tue, Feb 28, 2017 at 09:36:42PM +0100, Kevin Wolf wrote: >> This adds assertions that ensure that the necessary write permissions >> have been granted before someone attempts to write to a node. >> >> Signed-off-by: Kevin Wolf>> Acked-by: Fam Zheng >> Reviewed-by: Max Reitz >> --- >> block/io.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/block/io.c b/block/io.c >> index 2592ca1..4c79745 100644 >> --- a/block/io.c >> +++ b/block/io.c >> @@ -945,6 +945,8 @@ static int coroutine_fn >> bdrv_co_do_copy_on_readv(BdrvChild *child, >> size_t skip_bytes; >> int ret; >> >> +assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE)); > > > This assertion is thrown in the libguestfs test suite. I filed a bug > about it against the Fedora package: > > https://lists.gnu.org/archive/html/qemu-block/2017-02/msg01305.html Wrong URL? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PULL 43/46] block: Assertions for write permissions
On Tue, Feb 28, 2017 at 09:36:42PM +0100, Kevin Wolf wrote: > This adds assertions that ensure that the necessary write permissions > have been granted before someone attempts to write to a node. > > Signed-off-by: Kevin Wolf> Acked-by: Fam Zheng > Reviewed-by: Max Reitz > --- > block/io.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/block/io.c b/block/io.c > index 2592ca1..4c79745 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -945,6 +945,8 @@ static int coroutine_fn > bdrv_co_do_copy_on_readv(BdrvChild *child, > size_t skip_bytes; > int ret; > > +assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE)); This assertion is thrown in the libguestfs test suite. I filed a bug about it against the Fedora package: https://lists.gnu.org/archive/html/qemu-block/2017-02/msg01305.html Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Thu, Apr 06, 2017 at 10:45:54PM +0300, Yuval Shaia wrote: > > Just add my 2 cents. You didn't answer on my question about other possible > > implementations. It can be SoftRoCE loopback optimizations, special ULP, > > RDMA transport, virtual driver with multiple VFs and single PF. > > Please see my response to Jason's comments - eventually, when a support for > VM to external host communication will be added - kdbr will become ULP as > well. So, is KDBR only to be used on the HV side? Ie it never shows up in the VM? That is even weirder, we certainly do not want to see a kernel RDMA ULP for any of this - the entire point of RDMA is to let user space implement their protocols without needing a unique kernel component!! Jason
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Thu, Apr 06, 2017 at 10:42:20PM +0300, Yuval Shaia wrote: > > I'd rather see someone optimize the loopback path of soft roce than > > see KDBR :) > > Can we assume that the optimized loopback path will be as fast as direct > copy between one VM address space to another VM address space? Well, you'd optimize it until it was a direct memory copy, so I think that is a reasonable starting assumption. > > > 3. Our intention is for KDBR to be used in other contexts as well when we > > > need > > >inter VM data exchange, e.g. backend for virtio devices. We didn't see > > > how this > > >kind of requirement can be implemented inside SoftRoce as we don't see > > > any > > >connection between them. > > > > KDBR looks like weak RDMA to me, so it is reasonable question why not > > use full RDMA with loopback optimization instead of creating something > > unique. > > True, KDBR exposes RDMA-like API because it's sole user is currently > pvrdma device. But, by design it can be expand to support other > clients for example virtio device which might have other attributes, > can we expect the same from SoftRoCE? RDMA handles all sorts of complex virtio-like protocols just fine. Unclear what 'other attributes' would be. Sounds like over designing?? > > IMHO, it also makes more sense for something like KDBR to live as a > > RDMA transport, not as a unique char device, it is obviously very > > RDMA-like. > > Can you elaborate more on this? > What exactly it will solve? > How it will be better than kdbr? If you are going to do RDMA, then the uAPI for it from the kernel should be the RDMA subsystem, don't invent unique cdevs that overlap established kernel functionality without a very, very good reason. > > .. and the char dev really can't be used when implementing user space > > RDMA, that would just make a big mess.. > > The position of kdbr is not to be a layer *between* user space and device - > it is *the device* from point of view of the process. Any RDMA device built on top of kdbr certainly needs to support /dev/uverbs0 and all the usual RDMA stuff, so again, I fail to see the point of the special cdev.. Trying to mix /dev/uverbs0 and /dev/kdbr in your provider would be too goofy and weird. > > But obviously if you connect pvrdma to real hardware then the page pin > > comes back. > > The fact that page pin is not needed with Soft RoCE device but is needed > with real RoCE device is exactly where kdbr can help as it isolates this > fact from user space process. I don't see how KDBR helps at all. To do virtual RDMA you must transfer RDMA objects and commands unmodified from VM to HV and implement a fairly complicated SW stack inside the HV. Once you do that, micro-optimizing for same-machine VM-to-VM copy is not really such a big deal, IMHO. The big challenge is keeping the real HW (or softrocee) RDMA objects in sync with the VM ones and implementing some kind of RDMA-in-RDMA tunnel to enable migration when using today's HW offload. I see nothing in kdbr that helps with any of this. All it seems to do is obfuscate the transfer of RDMA objects and commands to the hypervisor, and make the transition of a RDMA channel from loopback to network far, far, more complicated. > Sorry, we didn't mean "easy" but "simple", and simplest solutions > are always preferred. IMHO, currently there is no good solution to > do data copy between two VMs. Don't confuse 'simple' with under featured. :) > Can you comment on the second point - migration? Please note that we need > it to work both with Soft RoCE and with real device. I don't see how kdbr helps with migration, you still have to setup the HW NIC and that needs sharing all the RDMA centric objects from VM to HV. Jason
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Tue, Apr 04, 2017 at 08:33:49PM +0300, Leon Romanovsky wrote: > > I'm not going to repeat Jason's answer, I'm completely agree with him. > > Just add my 2 cents. You didn't answer on my question about other possible > implementations. It can be SoftRoCE loopback optimizations, special ULP, > RDMA transport, virtual driver with multiple VFs and single PF. Please see my response to Jason's comments - eventually, when a support for VM to external host communication will be added - kdbr will become ULP as well. Marcel & Yuval > > >
Re: [Qemu-devel] [PATCH RFC] hw/pvrdma: Proposal of a new pvrdma device
On Tue, Apr 04, 2017 at 10:01:55AM -0600, Jason Gunthorpe wrote: > On Tue, Apr 04, 2017 at 04:38:40PM +0300, Marcel Apfelbaum wrote: > > > Here are some thoughts regarding the Soft RoCE usage in our project. > > We thought about using it as backend for QEMU pvrdma device > > we didn't how it will support our requirements. > > > > 1. Does Soft RoCE support inter process (VM) fast path ? The KDBR > >removes the need for hw resources, emulated or not, concentrating > >on one copy from a VM to another. > > I'd rather see someone optimize the loopback path of soft roce than > see KDBR :) Can we assume that the optimized loopback path will be as fast as direct copy between one VM address space to another VM address space? > > > 3. Our intention is for KDBR to be used in other contexts as well when we > > need > >inter VM data exchange, e.g. backend for virtio devices. We didn't see > > how this > >kind of requirement can be implemented inside SoftRoce as we don't see > > any > >connection between them. > > KDBR looks like weak RDMA to me, so it is reasonable question why not > use full RDMA with loopback optimization instead of creating something > unique. True, KDBR exposes RDMA-like API because it's sole user is currently pvrdma device. But, by design it can be expand to support other clients for example virtio device which might have other attributes, can we expect the same from SoftRoCE? > > IMHO, it also makes more sense for something like KDBR to live as a > RDMA transport, not as a unique char device, it is obviously very > RDMA-like. Can you elaborate more on this? What exactly it will solve? How it will be better than kdbr? As we see it - kdbr, when will be expand to support peers on external hosts, will be like a ULP. > > .. and the char dev really can't be used when implementing user space > RDMA, that would just make a big mess.. The position of kdbr is not to be a layer *between* user space and device - it is *the device* from point of view of the process. > > > 4. We don't want all the VM memory to be pinned since it disable > > memory-over-commit > >which in turn will make the pvrdma device useless. > >We weren't sure how nice would play Soft RoCE with memory pinning and we > > wanted > >more control on memory management. It may be a solvable issue, but > > combined > >with the others lead us to our decision to come up with our kernel > > bridge (char > > soft roce certainly can be optimized to remove the page pin and always > run in an ODP-like mode. > > But obviously if you connect pvrdma to real hardware then the page pin > comes back. The fact that page pin is not needed with Soft RoCE device but is needed with real RoCE device is exactly where kdbr can help as it isolates this fact from user space process. > > >device or not, we went for it since it was the easiest to > >implement for a POC) > > I can see why it would be easy to implement, but not sure how this > really improves the kernel.. Sorry, we didn't mean "easy" but "simple", and simplest solutions are always preferred. IMHO, currently there is no good solution to do data copy between two VMs. > > Jason Can you comment on the second point - migration? Please note that we need it to work both with Soft RoCE and with real device. Marcel & Yuval
[Qemu-devel] [PATCH for-2.9] throttle: Remove block from group on hot-unplug
When a block device that is part of a throttle group is hot-unplugged, we forgot to remove it from the throttle group. This leaves stale memory around, and causes an easily reproducible crash: $ ./x86_64-softmmu/qemu-system-x86_64 -nodefaults -nographic -qmp stdio \ -device virtio-scsi-pci,bus=pci.0 -drive \ id=drive_image2,if=none,format=raw,file=file2,bps=512000,iops=100,group=foo \ -device scsi-hd,id=image2,drive=drive_image2 -drive \ id=drive_image3,if=none,format=raw,file=file3,bps=512000,iops=100,group=foo \ -device scsi-hd,id=image3,drive=drive_image3 {'execute':'qmp_capabilities'} {'execute':'device_del','arguments':{'id':'image3'}} {'execute':'system_reset'} Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1428810 Suggested-by: Alberto GarciaSigned-off-by: Eric Blake --- block/block-backend.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/block/block-backend.c b/block/block-backend.c index 0b63773..d27c3a3 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -230,6 +230,9 @@ static void blk_delete(BlockBackend *blk) assert(!blk->refcnt); assert(!blk->name); assert(!blk->dev); +if (blk->public.throttle_state) { +blk_io_limits_disable(blk); +} if (blk->root) { blk_remove_bs(blk); } -- 2.9.3
Re: [Qemu-devel] [PATCH for-2.9] block: Ignore guest dev permissions during incoming migration
On Thu, Apr 06, 2017 at 07:59:53PM +0200, Kevin Wolf wrote: > Usually guest devices don't like other writers to the same image, so > they use blk_set_perm() to prevent this from happening. In the migration > phase before the VM is actually running, though, they don't have a > problem with writes to the image. On the other hand, storage migration > needs to be able to write to the image in this phase, so the restrictive > blk_set_perm() call of qdev devices breaks it. > > This patch flags all BlockBackends with a qdev device as > blk->disable_perm during incoming migration, which means that the > requested permissions are stored in the BlockBackend, but not actually > applied to its root node yet. > > Once migration has finished and the VM should be resumed, the > permissions are applied. If they cannot be applied (e.g. because the NBD > server used for block migration hasn't been shut down), resuming the VM > fails. > > Signed-off-by: Kevin Wolf> Tested-by: Kashyap Chamarthy Thanks. Posted the test evidence on your RFC thread: http://lists.nongnu.org/archive/html/qemu-devel/2017-04/msg01086.html [...] -- /kashyap
Re: [Qemu-devel] help debugging throttle crash
On 04/03/2017 10:24 AM, Alberto Garcia wrote: > On Mon, Apr 03, 2017 at 09:07:02AM -0500, Eric Blake wrote: > >> At this point, it looks like no one is calling >> throttle_group_unregister_blk() as a result of the 'device_del', >> which leaves stale memory around > > I see, I can also reproduce this very easily. > > I wonder if it's not enough to simply disable I/O limits when a > BlockBackend is deleted? Seems to pass my testing. Do you want to submit it as a formal patch, or shall I? > > --- a/block/block-backend.c > +++ b/block/block-backend.c > @@ -230,6 +230,9 @@ static void blk_delete(BlockBackend *blk) > assert(!blk->refcnt); > assert(!blk->name); > assert(!blk->dev); > +if (blk->public.throttle_state) { > +blk_io_limits_disable(blk); > +} > if (blk->root) { > blk_remove_bs(blk); > } > > Berto > -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH] kvm: pass the virtual SEI syndrome to guest OS
On 04/06/17 14:35, gengdongjiu wrote: > Dear, Laszlo >Thanks for your detailed explanation. > > On 2017/3/29 19:58, Laszlo Ersek wrote: >> (This ought to be one of the longest address lists I've ever seen :) >> Thanks for the CC. I'm glad Shannon is already on the CC list. For good >> measure, I'm adding MST and Igor.) >> >> On 03/29/17 12:36, Achin Gupta wrote: >>> Hi gengdongjiu, >>> >>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote: Hi Laszlo/Biesheuvel/Qemu developer, Now I encounter a issue and want to consult with you in ARM64 platform, as described below: when guest OS happen synchronous or asynchronous abort, kvm needs to send the error address to Qemu or UEFI through sigbus to dynamically generate APEI table. from my investigation, there are two ways: (1) Qemu get the error address, and generate the APEI table, then notify UEFI to know this generation, then inject abort error to guest OS, guest OS read the APEI table. (2) Qemu get the error address, and let UEFI to generate the APEI table, then inject abort error to guest OS, guest OS read the APEI table. >>> >>> Just being pedantic! I don't think we are talking about creating the APEI >>> table >>> dynamically here. The issue is: Once KVM has received an error that is >>> destined >>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the >>> error >>> into the guest OS, a CPER (Common Platform Error Record) has to be generated >>> corresponding to the error source (GHES corresponding to memory subsystem, >>> processor etc) to allow the guest OS to do anything meaningful with the >>> error. So who should create the CPER is the question. >>> >>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error >>> arrives >>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is >>> responsible for creating the CPER. ARM is experimenting with using a >>> Standalone >>> MM EDK2 image in the secure world to do the CPER creation. This will avoid >>> adding the same code in ARM TF in EL3 (better for security). The error will >>> then >>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM >>> Trusted >>> Firmware. >>> >>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1 >>> interface (as discussed with Christoffer below). So it should generate the >>> CPER >>> before injecting the error. >>> >>> This is corresponds to (1) above apart from notifying UEFI (I am assuming >>> you >>> mean guest UEFI). At this time, the guest OS already knows where to pick up >>> the >>> CPER from through the HEST. Qemu has to create the CPER and populate its >>> address >>> at the address exported in the HEST. Guest UEFI should not be involved in >>> this >>> flow. Its job was to create the HEST at boot and that has been done by this >>> stage. >>> >>> Qemu folk will be able to add but it looks like support for CPER generation >>> will >>> need to be added to Qemu. We need to resolve this. >>> >>> Do shout if I am missing anything above. >> >> After reading this email, the use case looks *very* similar to what >> we've just done with VMGENID for QEMU 2.9. >> >> We have a facility between QEMU and the guest firmware, called "ACPI >> linker/loader", with which QEMU instructs the firmware to >> >> - allocate and download blobs into guest RAM (AcpiNVS type memory) -- >> ALLOCATE command, >> >> - relocate pointers in those blobs, to fields in other (or the same) >> blobs -- ADD_POINTER command, >> >> - set ACPI table checksums -- ADD_CHECKSUM command, >> >> - and send GPAs of fields within such blobs back to QEMU -- >> WRITE_POINTER command. >> >> This is how I imagine we can map the facility to the current use case >> (note that this is the first time I read about HEST / GHES / CPER): >> >> etc/acpi/tables etc/hardware_errors >> == >> +---+ >> +--+ | address | +-> +--+ >> |HEST + | registers | | | Error Status | >> + ++ | +-+ | | Data Block 1 | >> | | GHES | --> | | address | + | ++ >> | | GHES | --> | | address | --+ | | CPER | >> | | GHES | --> | | address | + | | | CPER | >> | | GHES | --> | | address | -+ | | | | CPER | >> +-++ +-+-+ | | | +-++ >> | | | >> | | +---> +--+ >> | | | Error Status | >> | | | Data Block 2 | >> | | | ++ >>
[Qemu-devel] [PATCH v2] kvmvapic: Enable kvmvapic when necessary
If KVM provides VAPIC, don't set up kvmvapic. Signed-off-by: Anthony Xu--- hw/intc/apic_common.c | 5 - include/sysemu/kvm.h | 1 + kvm-all.c | 5 + 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/hw/intc/apic_common.c b/hw/intc/apic_common.c index c3829e3..bf72107 100644 --- a/hw/intc/apic_common.c +++ b/hw/intc/apic_common.c @@ -317,8 +317,11 @@ static void apic_common_realize(DeviceState *dev, Error **errp) info = APIC_COMMON_GET_CLASS(s); info->realize(dev, errp); -/* Note: We need at least 1M to map the VAPIC option ROM */ +/* Note: We need at least 1M to map the VAPIC option ROM, + if it is KVM, enable kvmvapic only when KVM doesn't have + VAPIC capability*/ if (!vapic && s->vapic_control & VAPIC_ENABLE_MASK && +(!kvm_enabled() || (kvm_enabled() && !kvm_has_vapic())) && !hax_enabled() && ram_size >= 1024 * 1024) { vapic = sysbus_create_simple("kvmvapic", -1, NULL); } diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index 24281fc..43e0e4c 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -215,6 +215,7 @@ extern KVMState *kvm_state; bool kvm_has_free_slot(MachineState *ms); int kvm_has_sync_mmu(void); +int kvm_has_vapic(void); int kvm_has_vcpu_events(void); int kvm_has_robust_singlestep(void); int kvm_has_debugregs(void); diff --git a/kvm-all.c b/kvm-all.c index 90b8573..c331c8c 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -2232,6 +2232,11 @@ int kvm_has_sync_mmu(void) return kvm_check_extension(kvm_state, KVM_CAP_SYNC_MMU); } +int kvm_has_vapic(void) +{ +return !kvm_check_extension(kvm_state, KVM_CAP_VAPIC); +} + int kvm_has_vcpu_events(void) { return kvm_state->vcpu_events; -- 1.8.3.1
Re: [Qemu-devel] [Qemu-block] [RFC PATCH for-2.9] block: Ignore guest dev permissions during incoming migration
On Thu, Apr 06, 2017 at 07:26:15PM +0200, Kashyap Chamarthy wrote: > On Tue, Apr 04, 2017 at 05:35:56PM +0200, Kevin Wolf wrote: > > Usually guest devices don't like other writers to the same image, so > > they use blk_set_perm() to prevent this from happening. In the migration > > phase before the VM is actually running, though, they don't have a > > problem with writes to the image. On the other hand, storage migration > > needs to be able to write to the image in this phase, so the restrictive > > blk_set_perm() call of qdev devices breaks it. > > > > This patch flags all BlockBackends with a qdev device as > > blk->disable_perm during incoming migration, which means that the > > requested permissions are stored in the BlockBackend, but not actually > > applied to its root node yet. > > > > Once migration has finished and the VM should be resumed, the > > permissions are applied. If they cannot be applied (e.g. because the NBD > > server used for block migration hasn't been shut down), resuming the VM > > fails. > > > > Signed-off-by: Kevin Wolf> > --- > > block/block-backend.c | 40 +++- > > include/block/block.h | 2 ++ > > migration/migration.c | 8 > > qmp.c | 6 ++ > > 4 files changed, 55 insertions(+), 1 deletion(-) > > With your fix applied, now I don't see the original error ("error: > internal error: unable to execute QEMU command 'nbd-server-add': > Conflicts with use by drive-virtio-disk0 as 'root', which does not allow > 'write' on #block163"), and I can export a disk via `nbd-server-add` > with 'writeable' flag. > > However, with this fix, running `drive-mirror` seg-faults source QEMU. TL;DR: Kevin / Max pointed out that segmentation fault should be fixed by supplying a 'job-id' parameter to `drive-mirror`. (Longer version, refer below.) So: Tested-by: Kashyap Chamarthy > Reproducer: > > (1) On destination QEMU > > $ /home/stack/build/build-qemu/x86_64-softmmu/qemu-system-x86_64 \ > -display none -nodefconfig -nodefaults -m 512 \ > -blockdev > node-name=bar,driver=qcow2,file.driver=file,file.filename=./dst-disk.qcow2 \ > -serial unix:/tmp/monitor,server,nowait \ > -incoming tcp:localhost: -qmp stdio > {"QMP": {"version": {"qemu": {"micro": 93, "minor": 8, "major": 2}, > "package": " (v2.9.0-rc3-3-g3a8624b)"}, "capabilities": []}} > { "execute": "qmp_capabilities" } > {"return": {}} > { "execute": "nbd-server-start", "arguments": { "addr": { "type": > "inet","data": { "host": "localhost", "port": "" } } } } > {"return": {}} > { "execute": "nbd-server-add", "arguments": { "device": "bar","writable": > true } } > {"return": {}} > /home/stack/src/qemu/nbd/server.c:nbd_receive_request():L706: read failed > > > (2) On source QEMU: > > $ /home/stack/build/build-qemu/x86_64-softmmu/qemu-system-x86_64 \ > -display none -nodefconfig -nodefaults -m 512 \ > -device virtio-scsi-pci,id=scsi -device virtio-serial-pci \ > -blockdev > node-name=foo,driver=qcow2,file.driver=file,file.filename=./cirros-0.3.5.qcow2 > -qmp stdio > {"QMP": {"version": {"qemu": {"micro": 93, "minor": 8, "major": 2}, > "package": " (v2.9.0-rc3-3-g3a8624b)"}, "capabilities": []}} > { "execute": "qmp_capabilities" } > {"return": {}} > { "execute": "drive-mirror", "arguments": { "device": "foo", "target": > "nbd:localhost::exportname=bar", "sync": "full","format": "raw", "mode": > "existing" } } > Segmentation fault (core dumped) Okay, there was a missing piece: I was pointed out on IRC that the `drive-mirror` QMP command was missing the 'job-id' (introduced by commit: 71aa986 - "mirror: Add 'job-id' parameter to 'blockdev-mirror' and 'drive-mirror'") parameter. Kevin tells me that this is mandatory _if_ you're using 'node-name' (which I was, in my '-blockdev' command-line above). And Max points out, the above should be caught by his: https://lists.nongnu.org/archive/html/qemu-block/2017-04/msg00059.html -- [PATCH for-2.9 0/2] block/mirror: Fix use-after-free And sure enough, adding the 'job-id' to `drive-mirror` on source will let `drive-mirror` proceed succesfully: [...] { "execute": "drive-mirror", "arguments": { "device": "foo", "job-id": "job-0", "target": "nbd:localhost::exportname=bar", "sync": "full","format": "raw", "mode": "existing" } } {"return": {}} {"timestamp": {"seconds": 1491498318, "microseconds": 435816}, "event": "BLOCK_JOB_READY", "data": {"device": "job-0", "len": 41126400, "offset": 41126400, "speed": 0, "type": "mirror"}} Issue a `block-job-complete` (or `block-job-cancel) to end the running mirror job: [...] {"execute": "block-job-complete", "arguments": {"device": "job-0"} } {"return": {}} {"timestamp": {"seconds": 1491500183, "microseconds": 768746}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "job-0", "len": 41126400, "offset": 41126400, "speed": 0, "type": "mirror"}} Then,
Re: [Qemu-devel] Fix build break during configuration on musl-libc based Linux systems.
On 6 April 2017 at 19:15, Rainer Müllerwrote: > A bit late to this thread, but the original problem was also reported > for Mac OS X with --enable-curses in MacPorts. The build fails with the > same symptoms as in the original report. > > https://trac.macports.org/ticket/53929 > > As you identified, the problem is that ncurses expects the define > _XOPEN_SOURCE >= 500 to enable the wide-char function declarations. > > The solution to retain access to non-standard API on Mac OS X would be > to also define _DARWIN_C_SOURCE which enables extensions. Thanks for the report. I have a feeling that fixing this bug got lost somewhere -- there was a long thread about it but I don't think an actual patch ever came out of it :-( At this point (very near to release of 2.9) I think I'd rather postpone fixing this until after release. It should be possible to work around it with an --extra-cflags configure setting, and our current behaviour is something we've had for many releases now. thanks -- PMM
Re: [Qemu-devel] [PATCH 1/1] qemu-img: img_create does not support image-opts, fix docs
Am 06.04.2017 um 19:45 hat Jeff Cody geschrieben: > The documentation and help for qemu-img claims that 'qemu-img create' > will take the '--image-opts' argument. This is not true, so this > patch removes those claims. > > Signed-off-by: Jeff CodyThanks, applied to the block branch. Kevin
Re: [Qemu-devel] [PATCH]Enable kvmvapic only when host doesn't support VAPIC capability in KVM mode
> > --- a/kvm-all.c > > +++ b/kvm-all.c > > @@ -2232,6 +2232,10 @@ int kvm_has_sync_mmu(void) > > return kvm_check_extension(kvm_state, KVM_CAP_SYNC_MMU); > > } > > > > +int kvm_has_vapic(void){ > > +return !kvm_check_extension(kvm_state, KVM_CAP_VAPIC); > > +} > > + > > Is that function shouldn't return true if KVM is providing VAPIC > capability? > It should, but it doesn't. see below KVM kernel code segment, it returns false if KVM provides VAPIC. Thanks for your comments, Will resend the patch. Anthony arch/x86/kvm/x86.c case KVM_CAP_VAPIC: r = !kvm_x86_ops->cpu_has_accelerated_tpr(); break;
Re: [Qemu-devel] Fix build break during configuration on musl-libc based Linux systems.
On 2017-02-17 17:57, Peter Maydell wrote: > On 17 February 2017 at 11:20, Paolo Bonziniwrote: >> >> >> On 17/02/2017 11:18, Peter Maydell wrote: >>> Defining _XOPEN_SOURCE is easy enough, and I think we should >>> do it unconditionally. We should check what effect this has >>> on the BSD hosts though I guess. (You could argue that we >>> should be defining _XOPEN_SOURCE anyway for the benefit of >>> the non-glibc BSD/Solaris/etc platforms.) >> >> Sounds good, then I think we should define it to 700 just like glibc does. > > Unfortunately this idea turns out to break OSX compiles, > because on OSX saying _XOPEN_SOURCE=anything disables > all the non-X/Open APIs (which you get by default, and > some of which like mkdtemp we use). A bit late to this thread, but the original problem was also reported for Mac OS X with --enable-curses in MacPorts. The build fails with the same symptoms as in the original report. https://trac.macports.org/ticket/53929 As you identified, the problem is that ncurses expects the define _XOPEN_SOURCE >= 500 to enable the wide-char function declarations. The solution to retain access to non-standard API on Mac OS X would be to also define _DARWIN_C_SOURCE which enables extensions. $ cat foo.c #include int main() { mkdtemp("/tmp/test-XX"); } $ cc -D_XOPEN_SOURCE=500 -c foo.c foo.c:4:5: warning: implicit declaration of function 'mkdtemp' is invalid in C99 [-Wimplicit-function-declaration] mkdtemp("/tmp/test-XX"); ^ 1 warning generated. $ cc -D_XOPEN_SOURCE=500 -D_DARWIN_C_SOURCE -c foo.c $ A quick test on current master with configure patched to define QEMU_CFLAGS="-D_XOPEN_SOURCE=500 -D_DARWIN_C_SOURCE $QEMU_CFLAGS" compiled fine for both a default configure and with --enable-curses. Rainer
Re: [Qemu-devel] [PATCH for-2.9 1/1] qemu-img: img_create does not support image-opts, fix docs
On 04/06/2017 12:45 PM, Jeff Cody wrote: > The documentation and help for qemu-img claims that 'qemu-img create' > will take the '--image-opts' argument. This is not true, so this > patch removes those claims. > > Signed-off-by: Jeff Cody> --- > qemu-img-cmds.hx | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Doc fix, worth having in 2.9 if we still have time. Reviewed-by: Eric Blake -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [Qemu-ppc] [PATCH RFC v1 0/3] Enable MTTCG on PPC64
On Apr 6, 2017, at 1:08 PM, luigi burdo wrote: Hi i can help test it too on my two Be machine. If some one help me to find where is the patch or where i can download the commits Thanks Luigi Here are the patches: 1/3 https://patchwork.ozlabs.org/patch/747691/ 2/3 https://patchwork.ozlabs.org/patch/747692/ 3/3 https://patchwork.ozlabs.org/patch/747688/
[Qemu-devel] [PATCH for-2.9] block: Ignore guest dev permissions during incoming migration
Usually guest devices don't like other writers to the same image, so they use blk_set_perm() to prevent this from happening. In the migration phase before the VM is actually running, though, they don't have a problem with writes to the image. On the other hand, storage migration needs to be able to write to the image in this phase, so the restrictive blk_set_perm() call of qdev devices breaks it. This patch flags all BlockBackends with a qdev device as blk->disable_perm during incoming migration, which means that the requested permissions are stored in the BlockBackend, but not actually applied to its root node yet. Once migration has finished and the VM should be resumed, the permissions are applied. If they cannot be applied (e.g. because the NBD server used for block migration hasn't been shut down), resuming the VM fails. Signed-off-by: Kevin WolfTested-by: Kashyap Chamarthy --- RFC -> v1: - s/blk_next/blk_all_next/g [Max] block/block-backend.c | 40 +++- include/block/block.h | 2 ++ migration/migration.c | 8 qmp.c | 6 ++ 4 files changed, 55 insertions(+), 1 deletion(-) diff --git a/block/block-backend.c b/block/block-backend.c index 0b63773..18ece99 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -61,6 +61,7 @@ struct BlockBackend { uint64_t perm; uint64_t shared_perm; +bool disable_perm; bool allow_write_beyond_eof; @@ -578,7 +579,7 @@ int blk_set_perm(BlockBackend *blk, uint64_t perm, uint64_t shared_perm, { int ret; -if (blk->root) { +if (blk->root && !blk->disable_perm) { ret = bdrv_child_try_set_perm(blk->root, perm, shared_perm, errp); if (ret < 0) { return ret; @@ -597,15 +598,52 @@ void blk_get_perm(BlockBackend *blk, uint64_t *perm, uint64_t *shared_perm) *shared_perm = blk->shared_perm; } +/* + * Notifies the user of all BlockBackends that migration has completed. qdev + * devices can tighten their permissions in response (specifically revoke + * shared write permissions that we needed for storage migration). + * + * If an error is returned, the VM cannot be allowed to be resumed. + */ +void blk_resume_after_migration(Error **errp) +{ +BlockBackend *blk; +Error *local_err = NULL; + +for (blk = blk_all_next(NULL); blk; blk = blk_all_next(blk)) { +if (!blk->disable_perm) { +continue; +} + +blk->disable_perm = false; + +blk_set_perm(blk, blk->perm, blk->shared_perm, _err); +if (local_err) { +error_propagate(errp, local_err); +blk->disable_perm = true; +return; +} +} +} + static int blk_do_attach_dev(BlockBackend *blk, void *dev) { if (blk->dev) { return -EBUSY; } + +/* While migration is still incoming, we don't need to apply the + * permissions of guest device BlockBackends. We might still have a block + * job or NBD server writing to the image for storage migration. */ +if (runstate_check(RUN_STATE_INMIGRATE)) { +blk->disable_perm = true; +} + blk_ref(blk); blk->dev = dev; blk->legacy_dev = false; blk_iostatus_reset(blk); + return 0; } diff --git a/include/block/block.h b/include/block/block.h index 5149260..3e09222 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -366,6 +366,8 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp); void bdrv_invalidate_cache_all(Error **errp); int bdrv_inactivate_all(void); +void blk_resume_after_migration(Error **errp); + /* Ensure contents are flushed to disk. */ int bdrv_flush(BlockDriverState *bs); int coroutine_fn bdrv_co_flush(BlockDriverState *bs); diff --git a/migration/migration.c b/migration/migration.c index 54060f7..ad4036f 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -349,6 +349,14 @@ static void process_incoming_migration_bh(void *opaque) exit(EXIT_FAILURE); } +/* If we get an error here, just don't restart the VM yet. */ +blk_resume_after_migration(_err); +if (local_err) { +error_free(local_err); +local_err = NULL; +autostart = false; +} + /* * This must happen after all error conditions are dealt with and * we're sure the VM is going to be running on this host. diff --git a/qmp.c b/qmp.c index fa82b59..a744e44 100644 --- a/qmp.c +++ b/qmp.c @@ -207,6 +207,12 @@ void qmp_cont(Error **errp) } } +blk_resume_after_migration(_err); +if (local_err) { +error_propagate(errp, local_err); +return; +} + if (runstate_check(RUN_STATE_INMIGRATE)) { autostart = 1; } else { -- 1.8.3.1
Re: [Qemu-devel] is qemu able to run multiple system emulation simultaneously
On Apr 6, 2017, at 12:00 PM, qemu-devel-requ...@nongnu.org wrote: Dear QEMU developers, I need multiple ARM-based hardware emulation runing simultaneously in one PC. I wander if QEMU is capble to run two system emulation at the same time. Or I have to use two QEMU for two system emulation If the former is possible, where can i find some docs about it? I have searched two days, the result is not positive. Please help. Thanks in advance. Best regards, Jiahuan My vote goes to running multiple instances of QEMU at the same time. It is easy to do and works.
Re: [Qemu-devel] [PATCH for-2.9 0/2] block/mirror: Fix use-after-free
Am 03.04.2017 um 19:51 hat Max Reitz geschrieben: > And the exciting 2.9 ride continues! > > When mirroring from a BDS with no parents at all (such as those added > with -blockdev or blockdev-add), we have a use-after-free in mirror's > error path. The first patch of this series fixes that, the other adds a > patch so we don't regress. > > What issue will we find next? Stay tuned! Thanks, applied to the block branch. Kevin
[Qemu-devel] [PATCH 1/1] qemu-img: img_create does not support image-opts, fix docs
The documentation and help for qemu-img claims that 'qemu-img create' will take the '--image-opts' argument. This is not true, so this patch removes those claims. Signed-off-by: Jeff Cody--- qemu-img-cmds.hx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx index 9c9702c..8ac7822 100644 --- a/qemu-img-cmds.hx +++ b/qemu-img-cmds.hx @@ -22,9 +22,9 @@ STEXI ETEXI DEF("create", img_create, -"create [-q] [--object objectdef] [--image-opts] [-f fmt] [-o options] filename [size]") +"create [-q] [--object objectdef] [-f fmt] [-o options] filename [size]") STEXI -@item create [--object @var{objectdef}] [--image-opts] [-q] [-f @var{fmt}] [-o @var{options}] @var{filename} [@var{size}] +@item create [--object @var{objectdef}] [-q] [-f @var{fmt}] [-o @var{options}] @var{filename} [@var{size}] ETEXI DEF("commit", img_commit, -- 2.9.3
Re: [Qemu-devel] EXT :Re: Emulating external registers
On 6 April 2017 at 18:23, Wu, Michael Y [US] (MS)wrote: > I changed my code to the following and added the appropriate read,write > functions and a MemoryRegionOps. > memory_region_init_io(reg_memory, NULL, _ops, reg, > "reg_mem", 0x0040); //set to 64 bytes > memory_region_add_subregion(sysmem, 0xFC00, reg_memory); > > For the read function I just returned a zero. So if I were to read from the > address 0xFC00 it should return a value of 0? The current issue I am > having is that gdb hangs when the pointer is accessed. I am starting to think > my bare-metal program is incorrect. I also added log messages in my read and > write functions. The read function was not accessed. You'll probably find that what has happened is that your program has taken an exception which you don't have an exception handler installed for, and has then jumped off into nowhere or gone into an infinite loop of taking exceptions. (Probably hitting ^c in gdb will break into wherever it is.) It's a good idea in bare metal test code to write at least a minimal set of exception handlers that can print a message when an unexpected exception occurs and stop, so you don't get too confused. You might also want to investigate QEMU's tracing: -d in_asm,exec,cpu,int,nochain -D debug.log will write tracing to the debug.log file (a lot of it, and this set of trace flags slow down execution a lot, but if you're doing very small bare metal flags that should be ok). This can help in figuring out what your test program is doing. (Watch out that the in_asm shows when we first encounter a block of code, but if we execute the same bit of code a second time we'll only print the exec and cpu parts of the logging.) thanks -- PMM
Re: [Qemu-devel] [PATCH for-2.9 2/2] commit: Set commit_top_bs->total_sectors
On 04/06/2017 12:14 PM, Kevin Wolf wrote: > Like in the mirror filter driver, we also need to set the image size for > the commit filter driver. This is less likely to be a problem in > practice than for the mirror because we're not at the active layer here, > but attaching new parents to a node in the middle of the chain is > possible, so the size needs to be correct anyway. > > Signed-off-by: Kevin Wolf> --- > block/commit.c | 1 + > 1 file changed, 1 insertion(+) Reviewed-by: Eric Blake > > diff --git a/block/commit.c b/block/commit.c > index 4c38220..91d2c34 100644 > --- a/block/commit.c > +++ b/block/commit.c > @@ -335,6 +335,7 @@ void commit_start(const char *job_id, BlockDriverState > *bs, > if (commit_top_bs == NULL) { > goto fail; > } > +commit_top_bs->total_sectors = top->total_sectors; [I'd really like to change total_sectors to bytes in 2.10 or later, but that's a much hairier mess to untangle. I've got some patches that I hope to post soon that at least start the process with regards to bdrv_is_allocated...] -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH for-2.9 1/2] commit: Set commit_top_bs->aio_context
On 04/06/2017 12:14 PM, Kevin Wolf wrote: > The filter driver that is inserted by the commit job needs to use the > same AioContext as its parent and child nodes. > > Signed-off-by: Kevin Wolf> --- > block/commit.c | 2 ++ > 1 file changed, 2 insertions(+) > Reviewed-by: Eric Blake -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [Qemu-block] [RFC PATCH for-2.9] block: Ignore guest dev permissions during incoming migration
On Tue, Apr 04, 2017 at 05:35:56PM +0200, Kevin Wolf wrote: > Usually guest devices don't like other writers to the same image, so > they use blk_set_perm() to prevent this from happening. In the migration > phase before the VM is actually running, though, they don't have a > problem with writes to the image. On the other hand, storage migration > needs to be able to write to the image in this phase, so the restrictive > blk_set_perm() call of qdev devices breaks it. > > This patch flags all BlockBackends with a qdev device as > blk->disable_perm during incoming migration, which means that the > requested permissions are stored in the BlockBackend, but not actually > applied to its root node yet. > > Once migration has finished and the VM should be resumed, the > permissions are applied. If they cannot be applied (e.g. because the NBD > server used for block migration hasn't been shut down), resuming the VM > fails. > > Signed-off-by: Kevin Wolf> --- > block/block-backend.c | 40 +++- > include/block/block.h | 2 ++ > migration/migration.c | 8 > qmp.c | 6 ++ > 4 files changed, 55 insertions(+), 1 deletion(-) With your fix applied, now I don't see the original error ("error: internal error: unable to execute QEMU command 'nbd-server-add': Conflicts with use by drive-virtio-disk0 as 'root', which does not allow 'write' on #block163"), and I can export a disk via `nbd-server-add` with 'writeable' flag. However, with this fix, running `drive-mirror` seg-faults source QEMU. Reproducer: (1) On destination QEMU $ /home/stack/build/build-qemu/x86_64-softmmu/qemu-system-x86_64 \ -display none -nodefconfig -nodefaults -m 512 \ -blockdev node-name=bar,driver=qcow2,file.driver=file,file.filename=./dst-disk.qcow2 \ -serial unix:/tmp/monitor,server,nowait \ -incoming tcp:localhost: -qmp stdio {"QMP": {"version": {"qemu": {"micro": 93, "minor": 8, "major": 2}, "package": " (v2.9.0-rc3-3-g3a8624b)"}, "capabilities": []}} { "execute": "qmp_capabilities" } {"return": {}} { "execute": "nbd-server-start", "arguments": { "addr": { "type": "inet","data": { "host": "localhost", "port": "" } } } } {"return": {}} { "execute": "nbd-server-add", "arguments": { "device": "bar","writable": true } } {"return": {}} /home/stack/src/qemu/nbd/server.c:nbd_receive_request():L706: read failed (2) On source QEMU: $ /home/stack/build/build-qemu/x86_64-softmmu/qemu-system-x86_64 \ -display none -nodefconfig -nodefaults -m 512 \ -device virtio-scsi-pci,id=scsi -device virtio-serial-pci \ -blockdev node-name=foo,driver=qcow2,file.driver=file,file.filename=./cirros-0.3.5.qcow2 -qmp stdio {"QMP": {"version": {"qemu": {"micro": 93, "minor": 8, "major": 2}, "package": " (v2.9.0-rc3-3-g3a8624b)"}, "capabilities": []}} { "execute": "qmp_capabilities" } {"return": {}} { "execute": "drive-mirror", "arguments": { "device": "foo", "target": "nbd:localhost::exportname=bar", "sync": "full","format": "raw", "mode": "existing" } } Segmentation fault (core dumped) [...] -- /kashyap
Re: [Qemu-devel] EXT :Re: Emulating external registers
Thank you Peter! I changed my code to the following and added the appropriate read,write functions and a MemoryRegionOps. memory_region_init_io(reg_memory, NULL, _ops, reg, "reg_mem", 0x0040); //set to 64 bytes memory_region_add_subregion(sysmem, 0xFC00, reg_memory); For the read function I just returned a zero. So if I were to read from the address 0xFC00 it should return a value of 0? The current issue I am having is that gdb hangs when the pointer is accessed. I am starting to think my bare-metal program is incorrect. I also added log messages in my read and write functions. The read function was not accessed. //Some code in my bare metal program volatile unsigned int * const test = (unsigned int *) 0xFC00; if (0x0 == *test) //gdb hangs on this line, cannot step further { read_hit(); } The file 'unimp.c' you mentioned looks great! I will be using that as inspiration for creating my own device model. -Original Message- From: Peter Maydell [mailto:peter.mayd...@linaro.org] Sent: Thursday, April 06, 2017 1:33 AM To: Wu, Michael Y [US] (MS) Cc: qemu-devel@nongnu.org Subject: EXT :Re: [Qemu-devel] Emulating external registers On 5 April 2017 at 22:03, Wu, Michael Y [US] (MS)wrote: > My current approach is to create a new MemoryRegion in the init function of > the g3beige source code (mac_oldworld.c). My idea is to set up some mmio to > certain address to represent those registers. For now I only care about > reading and writing, but in the future I hope to add additional logic when > the registers are written. > I added the following lines: > MemoryRegion *reg = g_new(MemoryRegion, 1); > memory_region_init_alias(reg, NULL, "reg_mmio", > get_system_io(), 0, 0x0020); This is defining your region to be an alias into the system IO region, so reading/writing your region will act like reads and writes into those IO ports. This doesn't sound like what you were trying to do. Usually memory regions for registers are defined using memory_region_init_io(), where you pass in a structure that includes pointers to functions which are called for reads and writes. > memory_region_add_subregion(sysmem, 0xfc00, reg); Typically you would create a device model for whatever this device is, and then map it in the board code, not directly create a memory region in the board code. You might find it useful to look at hw/misc/unimp.c, which is a device which simply prints log messages when it is read or written. Or try a simple device that your board already has. thanks -- PMM
Re: [Qemu-devel] [Qemu-block] [RFC PATCH for-2.9] block: Ignore guest dev permissions during incoming migration
On Tue, Apr 04, 2017 at 05:35:56PM +0200, Kevin Wolf wrote: > Usually guest devices don't like other writers to the same image, so > they use blk_set_perm() to prevent this from happening. In the migration > phase before the VM is actually running, though, they don't have a > problem with writes to the image. On the other hand, storage migration > needs to be able to write to the image in this phase, so the restrictive > blk_set_perm() call of qdev devices breaks it. > > This patch flags all BlockBackends with a qdev device as > blk->disable_perm during incoming migration, which means that the > requested permissions are stored in the BlockBackend, but not actually > applied to its root node yet. > > Once migration has finished and the VM should be resumed, the > permissions are applied. If they cannot be applied (e.g. because the NBD > server used for block migration hasn't been shut down), resuming the VM > fails. So I have an environment with a patched QEMU built with your fix to test it with libvirt APIs, however, there's a libvirt bug that I just discovered which fails the NBD-based live storage migration: https://www.redhat.com/archives/libvir-list/2017-April/msg00350.html -- NBD-based storage migration fails with "error: invalid argument: monitor must not be NULL" Meanwhile, I'm about it test with plain QMP. > Signed-off-by: Kevin Wolf> --- > block/block-backend.c | 40 +++- > include/block/block.h | 2 ++ > migration/migration.c | 8 > qmp.c | 6 ++ > 4 files changed, 55 insertions(+), 1 deletion(-) > > diff --git a/block/block-backend.c b/block/block-backend.c > index 0b63773..f817040 100644 > --- a/block/block-backend.c > +++ b/block/block-backend.c > @@ -61,6 +61,7 @@ struct BlockBackend { > > uint64_t perm; > uint64_t shared_perm; > +bool disable_perm; > > bool allow_write_beyond_eof; > > @@ -578,7 +579,7 @@ int blk_set_perm(BlockBackend *blk, uint64_t perm, > uint64_t shared_perm, > { > int ret; > > -if (blk->root) { > +if (blk->root && !blk->disable_perm) { > ret = bdrv_child_try_set_perm(blk->root, perm, shared_perm, errp); > if (ret < 0) { > return ret; > @@ -597,15 +598,52 @@ void blk_get_perm(BlockBackend *blk, uint64_t *perm, > uint64_t *shared_perm) > *shared_perm = blk->shared_perm; > } > > +/* > + * Notifies the user of all BlockBackends that migration has completed. qdev > + * devices can tighten their permissions in response (specifically revoke > + * shared write permissions that we needed for storage migration). > + * > + * If an error is returned, the VM cannot be allowed to be resumed. > + */ > +void blk_resume_after_migration(Error **errp) > +{ > +BlockBackend *blk; > +Error *local_err = NULL; > + > +for (blk = blk_next(NULL); blk; blk = blk_next(blk)) { > +if (!blk->disable_perm) { > +continue; > +} > + > +blk->disable_perm = false; > + > +blk_set_perm(blk, blk->perm, blk->shared_perm, _err); > +if (local_err) { > +error_propagate(errp, local_err); > +blk->disable_perm = true; > +return; > +} > +} > +} > + > static int blk_do_attach_dev(BlockBackend *blk, void *dev) > { > if (blk->dev) { > return -EBUSY; > } > + > +/* While migration is still incoming, we don't need to apply the > + * permissions of guest device BlockBackends. We might still have a block > + * job or NBD server writing to the image for storage migration. */ > +if (runstate_check(RUN_STATE_INMIGRATE)) { > +blk->disable_perm = true; > +} > + > blk_ref(blk); > blk->dev = dev; > blk->legacy_dev = false; > blk_iostatus_reset(blk); > + > return 0; > } > > diff --git a/include/block/block.h b/include/block/block.h > index 5149260..3e09222 100644 > --- a/include/block/block.h > +++ b/include/block/block.h > @@ -366,6 +366,8 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error > **errp); > void bdrv_invalidate_cache_all(Error **errp); > int bdrv_inactivate_all(void); > > +void blk_resume_after_migration(Error **errp); > + > /* Ensure contents are flushed to disk. */ > int bdrv_flush(BlockDriverState *bs); > int coroutine_fn bdrv_co_flush(BlockDriverState *bs); > diff --git a/migration/migration.c b/migration/migration.c > index 54060f7..ad4036f 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -349,6 +349,14 @@ static void process_incoming_migration_bh(void *opaque) > exit(EXIT_FAILURE); > } > > +/* If we get an error here, just don't restart the VM yet. */ > +blk_resume_after_migration(_err); > +if (local_err) { > +error_free(local_err); > +local_err = NULL; > +autostart = false; > +} > + > /* > * This must happen after all error conditions are
[Qemu-devel] [PATCH for-2.9 0/2] commit: Fix commit_top_bs problems
The same problem that Fam's patch "mirror: Fix aio context of mirror_top_bs" fixes for the mirror block job also exists for the commit block job and the 'commit' HMP command. While comparing the respective places in mirror.c and commit.c, I also noticed that commit neglects to set an image size. This series fixes both problems. Kevin Wolf (2): commit: Set commit_top_bs->aio_context commit: Set commit_top_bs->total_sectors block/commit.c | 3 +++ 1 file changed, 3 insertions(+) -- 1.8.3.1
[Qemu-devel] [PATCH for-2.9 1/2] commit: Set commit_top_bs->aio_context
The filter driver that is inserted by the commit job needs to use the same AioContext as its parent and child nodes. Signed-off-by: Kevin Wolf--- block/commit.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/block/commit.c b/block/commit.c index 2832482..4c38220 100644 --- a/block/commit.c +++ b/block/commit.c @@ -335,6 +335,7 @@ void commit_start(const char *job_id, BlockDriverState *bs, if (commit_top_bs == NULL) { goto fail; } +bdrv_set_aio_context(commit_top_bs, bdrv_get_aio_context(top)); bdrv_set_backing_hd(commit_top_bs, top, _err); if (local_err) { @@ -482,6 +483,7 @@ int bdrv_commit(BlockDriverState *bs) error_report_err(local_err); goto ro_cleanup; } +bdrv_set_aio_context(commit_top_bs, bdrv_get_aio_context(backing_file_bs)); bdrv_set_backing_hd(commit_top_bs, backing_file_bs, _abort); bdrv_set_backing_hd(bs, commit_top_bs, _abort); -- 1.8.3.1
[Qemu-devel] [PATCH for-2.9 2/2] commit: Set commit_top_bs->total_sectors
Like in the mirror filter driver, we also need to set the image size for the commit filter driver. This is less likely to be a problem in practice than for the mirror because we're not at the active layer here, but attaching new parents to a node in the middle of the chain is possible, so the size needs to be correct anyway. Signed-off-by: Kevin Wolf--- block/commit.c | 1 + 1 file changed, 1 insertion(+) diff --git a/block/commit.c b/block/commit.c index 4c38220..91d2c34 100644 --- a/block/commit.c +++ b/block/commit.c @@ -335,6 +335,7 @@ void commit_start(const char *job_id, BlockDriverState *bs, if (commit_top_bs == NULL) { goto fail; } +commit_top_bs->total_sectors = top->total_sectors; bdrv_set_aio_context(commit_top_bs, bdrv_get_aio_context(top)); bdrv_set_backing_hd(commit_top_bs, top, _err); -- 1.8.3.1
[Qemu-devel] [Bug 1635339] Re: qxl_pre_save assertion failure on vm "save"
I'm running into this issue as well: Arch Linux Qemu 2.8.0 spice guest tools: 0.132 QXL driver version (as reported by Windows Device Manager): 10.0.0.15000 Everything else works great. It would save me a lot of rebooting if this could get fixed. If there is anything I can do or try, I'll be glad to help. Relevant log of VM boot and crash on selecting suspend action in virt- manager: 2017-04-06 16:59:24.681+: starting up libvirt version: 3.1.0, qemu version: 2.8.0, hostname: arch-vaio.localdomain LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=spice /usr/bin/qemu-system-x86_64 -name guest=Windows,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-9-Windows/master-key.aes -machine pc-i440fx-2.7,accel=kvm,usb=off,vmport=off,dump-guest-core=off -cpu core2duo,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff -m 2048 -realtime mlock=off -smp 2,sockets=1,cores=2,threads=1 -uuid f14684d3-5f81-4743-8512-e516d85ca2c9 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-9-Windows/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/mnt/media/Qemu/windows10.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/usr/share/spice-guest-tools/spice-guest-tools.iso,format=raw,if=none,id=drive-ide0-0-1,readonly=on -device ide-cd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -netdev user,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:44:08:31,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -device usb-tablet,id=input2,bus=usb.0,port=3 -spice port=5900,addr=127.0.0.1,disable-ticketing,image-compression=off,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=1 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on char device redirected to /dev/pts/6 (label charserial0) main_channel_link: add main channel client red_dispatcher_set_cursor_peer: inputs_connect: inputs channel client create main_channel_handle_parsed: agent start ehci warning: guest updated active QH qemu-system-x86_64: /build/qemu/src/qemu-2.8.0/hw/display/qxl.c:2173: qxl_pre_save: Assertion `d->last_release_offset < d->vga.vram_size' failed. 2017-04-06 17:00:32.819+: shutting down, reason=crashed -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1635339 Title: qxl_pre_save assertion failure on vm "save" Status in QEMU: New Bug description: When I try and save my Windows 10 VM, I see an assertion failure, and the machine is shut down. I see the following in the log: main_channel_handle_parsed: agent start qemu-system-x86_64: /build/qemu-Zwynhi/qemu-2.5+dfsg/hw/display/qxl.c:2101: qxl_pre_save: Assertion `d->last_release_offset < d->vga.vram_size' failed. 2016-10-20 11:52:42.713+: shutting down Please let me know what other information would be relevant! To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1635339/+subscriptions
Re: [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
On 04/06/2017 10:01 AM, Alberto Garcia wrote: > Hi all, > > over the past couple of months I discussed with some of you the > possibility to extend the qcow2 format in order to improve its > performance and reduce its memory requirements (particularly with very > large images). > > After some discussion in the mailing list and the #qemu IRC channel I > decided to write a prototype of a new extension for qcow2 so I could > understand better the scope of the changes and have some preliminary > data about its effects. > > This e-mail is the formal presentation of my proposal to extend the > on-disk qcow2 format. As you can see this is still an RFC. Due to the > nature of the changes I would like to get as much feedback as possible > before going forward. The idea in general makes sense; I can even remember chatting with Kevin about similar ideas as far back as 2015, where the biggest drawback is that it is an incompatible image change, and therefore images created with the flag cannot be read by older tools. > === Test results === > > I have a basic working prototype of this. It's still incomplete -and > buggy :)- but it gives an idea of what we can expect from it. In my > implementation each data cluster has 8 subclusters, but that's not set > in stone (see below). > > I made all tests on an SSD drive, writing to an empty qcow2 image with > a fully populated 40GB backing image, performing random writes using > fio with a block size of 4KB. > > I tried with the default and maximum cluster sizes (64KB and 2MB) and > also with some other sizes. I also made sure to try with 32KB clusters > so the subcluster size matches the 4KB block size used for the I/O. > > It's important to point out that once a cluster has been completely > allocated then having subclusters offers no performance benefit. For > this reason the size of the image for these tests (40GB) was chosen to > be large enough to guarantee that there are always new clusters being > allocated. This is therefore a worst-case scenario (or best-case for > this feature, if you want). > > Here are the results (subcluster size in brackets): > > |-++-+---| > | cluster size | subclusters=on | subclusters=off | Max L2 cache size | > |-++-+---| > | 2 MB (256 KB) | 440 IOPS | 100 IOPS | 160 KB (*)| > | 512 KB (64 KB) | 1000 IOPS | 300 IOPS | 640 KB| > | 64 KB (8 KB) | 3000 IOPS | 1000 IOPS | 5 MB| > | 32 KB (4 KB) | 12000 IOPS | 1300 IOPS | 10 MB| > | 4 KB (512 B) | 100 IOPS | 100 IOPS | 80 MB| > |-++-+---| Those are some cool results! > === Changes to the on-disk format === > > The qcow2 on-disk format needs to change so each L2 entry has a bitmap > indicating the allocation status of each subcluster. There are three > possible states (unallocated, allocated, all zeroes), so we need two > bits per subcluster. You also have to add a new incompatible feature bit, so that older tools know they can't read the new image correctly, and therefore don't accidentally corrupt it. > > An L2 entry is 64 bits wide, and this is the current format (for > uncompressed clusters): > > 6356 5548 4740 3932 3124 2316 15 8 7 0 > > **<> <--><--->* > Rsrved host cluster offset of data Reserved > (6 bits)(47 bits) (8 bits) > > bit 63: refcount == 1 (QCOW_OFLAG_COPIED) > bit 62: compressed = 1 (QCOW_OFLAG_COMPRESSED) > bit 0: all zeros(QCOW_OFLAG_ZERO) > > I thought of three alternatives for storing the subcluster bitmaps. I > haven't made my mind completely about which one is the best one, so > I'd like to present all three for discussion. Here they are: > > (1) Storing the bitmap inside the 64-bit entry > > This is a simple alternative and is the one that I chose for my > prototype. There are 14 unused bits plus the "all zeroes" one. If > we steal one from the host offset we have the 16 bits that we need > for the bitmap and we have 46 bits left for the host offset, which > is more than enough. Note that because you are using exactly 8 subclusters, you can require that the minimum cluster size when subclusters are enabled be 4k (since we already have a lower-limit of 512-byte sector operation, and don't want subclusters to be smaller than that); at which case you are guaranteed that the host cluster offset will be 4k aligned. So in reality, once you turn on subclusters, you have: 6356 5548 4740 3932 3124 2316 15 8 7 0
Re: [Qemu-devel] [PATCH 1/5] qdev: qdev_hotplug is really a bool
On 04/06/2017 10:13 AM, Juan Quintela wrote: Signed-off-by: Juan QuintelaReviewed-by: Philippe Mathieu-Daudé --- hw/core/qdev.c | 4 ++-- include/hw/qdev-core.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/hw/core/qdev.c b/hw/core/qdev.c index 1e7fb33..6fa46b5 100644 --- a/hw/core/qdev.c +++ b/hw/core/qdev.c @@ -39,7 +39,7 @@ #include "qapi-event.h" #include "migration/migration.h" -int qdev_hotplug = 0; +bool qdev_hotplug = false; static bool qdev_hot_added = false; static bool qdev_hot_removed = false; @@ -385,7 +385,7 @@ void qdev_machine_creation_done(void) * ok, initial machine setup is done, starting from now we can * only create hotpluggable devices */ -qdev_hotplug = 1; +qdev_hotplug = true; } bool qdev_machine_modified(void) diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h index b44b476..a96a913 100644 --- a/include/hw/qdev-core.h +++ b/include/hw/qdev-core.h @@ -386,7 +386,7 @@ Object *qdev_get_machine(void); /* FIXME: make this a link<> */ void qdev_set_parent_bus(DeviceState *dev, BusState *bus); -extern int qdev_hotplug; +extern bool qdev_hotplug; char *qdev_get_dev_path(DeviceState *dev);
Re: [Qemu-devel] [Qemu-arm] [PATCH] target/arm: Add missing entries to excnames[] for log strings
On 6 April 2017 at 17:34, Philippe Mathieu-Daudéwrote: > Hi Peter, > > excnames[] is only used in arm_log_exception(), can we move it there? Seems like a good idea. I think this is historic, we used to use it in separate functions in different files for the 32-bit and 64-bit exception entry code, but those got unified a little while back. I'll do that as a separate patch, though. thanks -- PMM
Re: [Qemu-devel] [Qemu-arm] [PATCH] stellaris: Don't hw_error() on bad register accesses
On 6 April 2017 at 17:24, Philippe Mathieu-Daudéwrote: > Hi Peter, > > > On 04/06/2017 10:45 AM, Peter Maydell wrote: >> default: >> -hw_error("gptm_write: Bad offset 0x%x\n", (int)offset); >> +qemu_log_mask(LOG_GUEST_ERROR, >> + "GPTM: read at bad offset 0x%x\n", (int)offset); > > > use HWADDR_PRIx to remove this unnecessary casts here in following changes? > > ie: "GPTM: read at bad offset 0x%" HWADDR_PRIx "\n", offset); I don't think either of the two is clearly better for this sort of case where the offset is known to be small, so I opted to leave the code the way it was already written. > either way: > Reviewed-by: Philippe Mathieu-Daudé Thanks. -- PMM
Re: [Qemu-devel] [Qemu-arm] [PATCH] target/arm: Add missing entries to excnames[] for log strings
Hi Peter, excnames[] is only used in arm_log_exception(), can we move it there? anyway: Reviewed-by: Philippe Mathieu-DaudéOn 04/06/2017 10:45 AM, Peter Maydell wrote: Recent changes have added new EXCP_ values to ARM but forgot to update the excnames[] array which is used to provide human-readable strings when printing information about the exception for debug logging. Add the missing entries, and add a comment to the list of #defines to help avoid the mistake being repeated in future. Signed-off-by: Peter Maydell --- target/arm/cpu.h | 1 + target/arm/internals.h | 2 ++ 2 files changed, 3 insertions(+) diff --git a/target/arm/cpu.h b/target/arm/cpu.h index a8aabce..e6f05e2 100644 --- a/target/arm/cpu.h +++ b/target/arm/cpu.h @@ -58,6 +58,7 @@ #define EXCP_SEMIHOST 16 /* semihosting call */ #define EXCP_NOCP 17 /* v7M NOCP UsageFault */ #define EXCP_INVSTATE 18 /* v7M INVSTATE UsageFault */ +/* NB: new EXCP_ defines should be added to the excnames[] array too */ #define ARMV7M_EXCP_RESET 1 #define ARMV7M_EXCP_NMI 2 diff --git a/target/arm/internals.h b/target/arm/internals.h index f742a41..97ca034 100644 --- a/target/arm/internals.h +++ b/target/arm/internals.h @@ -70,6 +70,8 @@ static const char * const excnames[] = { [EXCP_VIRQ] = "Virtual IRQ", [EXCP_VFIQ] = "Virtual FIQ", [EXCP_SEMIHOST] = "Semihosting call", +[EXCP_NOCP] = "v7M NOCP UsageFault", +[EXCP_INVSTATE] = "v7M INVSTATE UsageFault", }; /* Scale factor for generic timers, ie number of ns per tick.
Re: [Qemu-devel] [Qemu-arm] [PATCH] stellaris: Don't hw_error() on bad register accesses
Hi Peter, On 04/06/2017 10:45 AM, Peter Maydell wrote: Current recommended style is to log a guest error on bad register accesses, not kill the whole system with hw_error(). Change the hw_error() calls to log as LOG_GUEST_ERROR or LOG_UNIMP or use g_assert_not_reached() as appropriate. Signed-off-by: Peter Maydell--- hw/arm/stellaris.c | 60 +- 1 file changed, 37 insertions(+), 23 deletions(-) diff --git a/hw/arm/stellaris.c b/hw/arm/stellaris.c index 9edcd49..ea7a809 100644 --- a/hw/arm/stellaris.c +++ b/hw/arm/stellaris.c @@ -108,7 +108,10 @@ static void gptm_reload(gptm_state *s, int n, int reset) } else if (s->mode[n] == 0xa) { /* PWM mode. Not implemented. */ } else { -hw_error("TODO: 16-bit timer mode 0x%x\n", s->mode[n]); +qemu_log_mask(LOG_UNIMP, + "GPTM: 16-bit timer mode unimplemented: 0x%x\n", + s->mode[n]); +return; } s->tick[n] = tick; timer_mod(s->timer[n], tick); @@ -149,7 +152,9 @@ static void gptm_tick(void *opaque) } else if (s->mode[n] == 0xa) { /* PWM mode. Not implemented. */ } else { -hw_error("TODO: 16-bit timer mode 0x%x\n", s->mode[n]); +qemu_log_mask(LOG_UNIMP, + "GPTM: 16-bit timer mode unimplemented: 0x%x\n", + s->mode[n]); } gptm_update_irq(s); } @@ -286,7 +291,8 @@ static void gptm_write(void *opaque, hwaddr offset, s->match_prescale[0] = value; break; default: -hw_error("gptm_write: Bad offset 0x%x\n", (int)offset); +qemu_log_mask(LOG_GUEST_ERROR, + "GPTM: read at bad offset 0x%x\n", (int)offset); use HWADDR_PRIx to remove this unnecessary casts here in following changes? ie: "GPTM: read at bad offset 0x%" HWADDR_PRIx "\n", offset); either way: Reviewed-by: Philippe Mathieu-Daudé } gptm_update_irq(s); } @@ -425,7 +431,10 @@ static int ssys_board_class(const ssys_state *s) } /* for unknown classes, fall through */ default: -hw_error("ssys_board_class: Unknown class 0x%08x\n", did0); +/* This can only happen if the hardwired constant did0 value + * in this board's stellaris_board_info struct is wrong. + */ +g_assert_not_reached(); } } @@ -479,8 +488,7 @@ static uint64_t ssys_read(void *opaque, hwaddr offset, case DID0_CLASS_SANDSTORM: return pllcfg_sandstorm[xtal]; default: -hw_error("ssys_read: Unhandled class for PLLCFG read.\n"); -return 0; +g_assert_not_reached(); } } case 0x070: /* RCC2 */ @@ -512,7 +520,8 @@ static uint64_t ssys_read(void *opaque, hwaddr offset, case 0x1e4: /* USER1 */ return s->user1; default: -hw_error("ssys_read: Bad offset 0x%x\n", (int)offset); +qemu_log_mask(LOG_GUEST_ERROR, + "SSYS: read at bad offset 0x%x\n", (int)offset); return 0; } } @@ -614,7 +623,8 @@ static void ssys_write(void *opaque, hwaddr offset, s->ldoarst = value; break; default: -hw_error("ssys_write: Bad offset 0x%x\n", (int)offset); +qemu_log_mask(LOG_GUEST_ERROR, + "SSYS: write at bad offset 0x%x\n", (int)offset); } ssys_update(s); } @@ -748,7 +758,8 @@ static uint64_t stellaris_i2c_read(void *opaque, hwaddr offset, case 0x20: /* MCR */ return s->mcr; default: -hw_error("strllaris_i2c_read: Bad offset 0x%x\n", (int)offset); +qemu_log_mask(LOG_GUEST_ERROR, + "stellaris_i2c: read at bad offset 0x%x\n", (int)offset); return 0; } } @@ -823,17 +834,18 @@ static void stellaris_i2c_write(void *opaque, hwaddr offset, s->mris &= ~value; break; case 0x20: /* MCR */ -if (value & 1) -hw_error( - "stellaris_i2c_write: Loopback not implemented\n"); -if (value & 0x20) -hw_error( - "stellaris_i2c_write: Slave mode not implemented\n"); +if (value & 1) { +qemu_log_mask(LOG_UNIMP, "stellaris_i2c: Loopback not implemented"); +} +if (value & 0x20) { +qemu_log_mask(LOG_UNIMP, + "stellaris_i2c: Slave mode not implemented"); +} s->mcr = value & 0x31; break; default: -hw_error("stellaris_i2c_write: Bad offset 0x%x\n", - (int)offset); +qemu_log_mask(LOG_GUEST_ERROR, + "stellaris_i2c: write at bad offset 0x%x\n", (int)offset); } stellaris_i2c_update(s); } @@ -1057,8 +1069,8 @@ static uint64_t stellaris_adc_read(void *opaque, hwaddr offset, case 0x30: /* SAC */
Re: [Qemu-devel] [PATCH for-2.9 5/5] coroutine: Explicitly specify AioContext when creating coroutine
Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben: > Coroutine in block layer should always be waken up in bs->aio_context > rather than the "current" context where it is entered. They differ when > the main loop is doing QMP tasks. This whole mechanism is complex stuff that I haven't quite caught up on yet, but this change means that we probably have some code that runs under one AioContext or the other depending on whether a CoMutex had to yield. Waking it up in its original AioContext definitely seemed more straightforward. > Race conditions happen without this patch, because the wrong context is > acquired in co_schedule_bh_cb, while the entered coroutine works on a > different one. If the code in co_schedule_bh_cb() assumes that coroutines can never move between AioContexts, then probably this assumption is what really needs to be fixed. For example, another case where this happens is that block jobs follow their nodes if the AioContext changes and even implement .attached_aio_context callbacks when they need to drag additional nodes into the new context. With your change, the job coroutine would remember the old coroutine and move back to the old context in some cases! I don't want to be the one to debug this kind of problems... If we really went this way, we'd need at least a way to change the AioContext of a coroutine after the fact, and be sure that we call it everywhere where it's needed (it's this last part that I'm highly doubtful about for 2.9). In fact, I think even attempting this is insanity and we need to teach the infrastructure to cope with coroutines that move between AioContexts. If it's really just about acquiring the wrong context, shouldn't then using co->ctx instead of ctx solve the problem? > Make the block layer explicitly specify a desired context for each created > coroutine. For the rest, always use qemu_get_aio_context(). > > Signed-off-by: Fam Zheng> --- a/include/qemu/coroutine.h > +++ b/include/qemu/coroutine.h > @@ -63,7 +63,8 @@ typedef void coroutine_fn CoroutineEntry(void *opaque); > * Use qemu_coroutine_enter() to actually transfer control to the coroutine. > * The opaque argument is passed as the argument to the entry point. > */ > -Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque); > +Coroutine *qemu_coroutine_create(AioContext *ctx, > + CoroutineEntry *entry, void *opaque); The new parameter could use some documentation, it's not even obvious why a coroutine should have an AioContext. > --- a/util/qemu-coroutine.c > +++ b/util/qemu-coroutine.c > @@ -115,7 +118,6 @@ void qemu_coroutine_enter(Coroutine *co) > } > > co->caller = self; > -co->ctx = qemu_get_current_aio_context(); > > /* Store co->ctx before anything that stores co. Matches > * barrier in aio_co_wake and qemu_co_mutex_wake. */ smp_wmb(); The comment suggests that the barrier can go away if you don't set co->ctx any more. Kevin
Re: [Qemu-devel] is qemu able to run multiple system emulation simultaneously
On 6 April 2017 at 16:47, Jiahuan Zhangwrote: > I need multiple ARM-based hardware emulation runing simultaneously in one > PC. > I wander if QEMU is capble to run two system emulation at the same time. > Or I have to use two QEMU for two system emulation You need to run two copies of QEMU for this. It should work fine, though. thanks -- PMM
Re: [Qemu-devel] [PATCH RFC v1 3/3] target/ppc: Generate fence operations
On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote: @@ -3028,6 +3030,7 @@ static void gen_##name(DisasContext *ctx) \ tcg_gen_qemu_ld_tl(gpr, t0, ctx->mem_idx, memop);\ tcg_gen_mov_tl(cpu_reserve, t0); \ tcg_gen_mov_tl(cpu_reserve_val, gpr);\ +tcg_gen_mb(TCG_MO_ALL | TCG_BAR_LDAQ); \ Please put the barrier next to the load. I hope we can merge these soon. @@ -3196,6 +3199,7 @@ static void gen_##name(DisasContext *ctx) \ if (len > 1) { \ gen_check_align(ctx, t0, (len) - 1);\ } \ +tcg_gen_mb(TCG_MO_ALL | TCG_BAR_STRL); \ gen_conditional_store(ctx, t0, rS(ctx->opcode), memop); \ tcg_temp_free(t0); \ } This one is more complicated... The success path includes an atomic_cmpxchg, which itself is a SC barrier. However, the fail path branches across the cmpxchg and so sees no barrier, but one is still required by the architecture. How about a gen_conditional_store that looks like this: { TCGLabel *l1 = gen_new_label(); TCGLabel *l2 = gen_new_label(); TCGv t0; tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1); t0 = tcg_temp_new(); tcg_gen_atomic_cmpxchg_tl(t0, cpu_reserve, cpu_reserve_val, cpu_gpr[reg], ctx->mem_idx, DEF_MEMOP(memop) | MO_ALIGN); tcg_gen_setcond_tl(TCG_COND_EQ, t0, t0, cpu_reserve_val); tcg_gen_shli_tl(t0, t0, CRF_EQ_BIT); tcg_gen_or_tl(t0, t0, cpu_so); tcg_gen_trunc_tl(cpu_crf[0], t0); tcg_temp_free(t0); tcg_gen_br(l2); gen_set_label(l1); /* Address mismatch implies failure. But we still need to provide the memory barrier semantics of the instruction. */ tcg_gen_mb(TCG_MO_ALL | TCG_BAR_STRL); tcg_gen_trunc_tl(cpu_crf[0], cpu_so); gen_set_label(l2); tcg_gen_movi_tl(cpu_reserve, -1); } Note that you don't need to reset cpu_reserve_val at all. I just thought of something we might need to check for this and other ll/sc implemetations -- do we need to check for address misalignment along the address comparison failure path? I suspect that we do. r~
Re: [Qemu-devel] [PATCH 2/2] parallels: relax check for auto switch of prealloc_mode
On Wed, Apr 05, 2017 at 06:12:06PM +0300, Denis V. Lunev wrote: > PRL_PREALLOC_MODE_TRUNCATE can be set only through command line. Remove > the check that bdrv_truncate() is working on, this is almost always the > case, while the check could lead to serious consequences during migration > etc when we should not even try to API which technically could change the > image (asserts are possible). > > Let us keep this simple. > > Signed-off-by: Denis V. Lunev> CC: Stefan Hajnoczi > --- > block/parallels.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) Reviewed-by: Stefan Hajnoczi signature.asc Description: PGP signature
Re: [Qemu-devel] [PATCH for-2.9?] tests/check-qdict: Fix missing brackets
On 04/06/2017 10:49 AM, Eric Blake wrote: > On 04/06/2017 10:41 AM, Dr. David Alan Gilbert (git) wrote: >> From: "Dr. David Alan Gilbert">> >> Gcc 7 (on Fedora 26) spotted odd use of integers instead of a >> boolean; it's got a point. >> >> Signed-off-by: Dr. David Alan Gilbert >> --- >> tests/check-qdict.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/tests/check-qdict.c b/tests/check-qdict.c >> index 81162ee572..6f3fbcf9c1 100644 >> --- a/tests/check-qdict.c >> +++ b/tests/check-qdict.c >> @@ -559,7 +559,7 @@ static void qdict_join_test(void) >> g_assert(qdict_size(dict1) == 2); >> g_assert(qdict_size(dict2) == !overwrite); >> >> -g_assert(qdict_get_int(dict1, "foo") == overwrite ? 84 : 42); >> +g_assert(qdict_get_int(dict1, "foo") == (overwrite ? 84 : 42)); > > How is the test passing pre-patch, and why does it not change the test > post-patch? Does that mean that overwrite is not doing what we expected? Replying to myself: Pre-patch, it was reached twice (once for overwrite=false, once for overwrite=true), as: (42 == false) ? 84 : 42 (84 == true) ? 84 : 42 which simplifies to 42 on both iterations, and g_assert(42) succeeds. In fact, this is a tautology - no matter WHAT value we encounter, the assert succeeds, so we are not actually testing that the value matters. Post-patch, it becomes: 42 == (false ? 84 : 42) 84 == (true ? 84 : 42) which is then asserting that we actually have the value we expect. Reviewed-by: Eric Blake Safe for 2.9, but late enough that it also doesn't matter if it slips until 2.10. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH RFC v1 2/3] cputlb: handle first atomic write to the page
On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote: In case where the conditional write is the first write to the page, TLB_NOTDIRTY will be set and stop_the_world is triggered. Handle this as a special case and set the dirty bit. After that fall through to the actual atomic instruction below. Signed-off-by: Nikunj A Dadhania--- cputlb.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) Reviewed-by: Richard Henderson r~
Re: [Qemu-devel] [PATCH RFC v1 1/3] target/ppc: Emulate LL/SC using cmpxchg helpers
On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote: tcg_gen_trunc_tl_i32(cpu_crf[0], cpu_so); l1 = gen_new_label(); tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1); -tcg_gen_ori_i32(cpu_crf[0], cpu_crf[0], CRF_EQ); -tcg_gen_qemu_st_tl(cpu_gpr[reg], EA, ctx->mem_idx, memop); + +t0 = tcg_temp_new(); +tcg_gen_atomic_cmpxchg_tl(t0, EA, cpu_reserve_val, cpu_gpr[reg], + ctx->mem_idx, DEF_MEMOP(memop)); Actually, I noticed another, existing, problem. This code changes CRF[0] before the user memory write, which might fault. This needs to delay any changes to the architecture visible state until after any exception may be triggered. r~
Re: [Qemu-devel] [PATCH] tests/check-qdict: Fix missing brackets
* Eric Blake (ebl...@redhat.com) wrote: > On 04/06/2017 10:41 AM, Dr. David Alan Gilbert (git) wrote: > > From: "Dr. David Alan Gilbert"> > > > Gcc 7 (on Fedora 26) spotted odd use of integers instead of a > > boolean; it's got a point. > > > > Signed-off-by: Dr. David Alan Gilbert > > --- > > tests/check-qdict.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/tests/check-qdict.c b/tests/check-qdict.c > > index 81162ee572..6f3fbcf9c1 100644 > > --- a/tests/check-qdict.c > > +++ b/tests/check-qdict.c > > @@ -559,7 +559,7 @@ static void qdict_join_test(void) > > g_assert(qdict_size(dict1) == 2); > > g_assert(qdict_size(dict2) == !overwrite); > > > > -g_assert(qdict_get_int(dict1, "foo") == overwrite ? 84 : 42); > > +g_assert(qdict_get_int(dict1, "foo") == (overwrite ? 84 : 42)); > > How is the test passing pre-patch, and why does it not change the test > post-patch? Does that mean that overwrite is not doing what we expected? Pre-patch it passes because either 84 or 42 are valid 'true' values into g_assert. Post patch it ends up as either: qdict_get_int(dict1, "foo") == 42 or qdict_get_int(dict1, "foo") == 84 which is what's intended. Dave > -- > Eric Blake eblake redhat com+1-919-301-3266 > Libvirt virtualization library http://libvirt.org > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-devel] [PATCH RFC v1 1/3] target/ppc: Emulate LL/SC using cmpxchg helpers
On 04/06/2017 03:22 AM, Nikunj A Dadhania wrote: +TCGv_i32 tmp = tcg_temp_local_new_i32(); +TCGv t0; +tcg_gen_movi_i32(tmp, 0); tcg_gen_trunc_tl_i32(cpu_crf[0], cpu_so); l1 = gen_new_label(); tcg_gen_brcond_tl(TCG_COND_NE, EA, cpu_reserve, l1); -tcg_gen_ori_i32(cpu_crf[0], cpu_crf[0], CRF_EQ); -tcg_gen_qemu_st_tl(cpu_gpr[reg], EA, ctx->mem_idx, memop); + +t0 = tcg_temp_new(); +tcg_gen_atomic_cmpxchg_tl(t0, EA, cpu_reserve_val, cpu_gpr[reg], + ctx->mem_idx, DEF_MEMOP(memop)); +tcg_gen_setcond_tl(TCG_COND_EQ, t0, t0, cpu_reserve_val); +tcg_gen_trunc_tl_i32(tmp, t0); + gen_set_label(l1); +tcg_gen_shli_i32(tmp, tmp, CRF_EQ_BIT); +tcg_gen_or_i32(cpu_crf[0], cpu_crf[0], tmp); I encourage you to move these two lines up beside the setcond. That way you don't need to use a local tmp, which implies a spill/restore from the stack. r~
Re: [Qemu-devel] [PATCH] tests/check-qdict: Fix missing brackets
On 04/06/2017 10:41 AM, Dr. David Alan Gilbert (git) wrote: > From: "Dr. David Alan Gilbert"> > Gcc 7 (on Fedora 26) spotted odd use of integers instead of a > boolean; it's got a point. > > Signed-off-by: Dr. David Alan Gilbert > --- > tests/check-qdict.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tests/check-qdict.c b/tests/check-qdict.c > index 81162ee572..6f3fbcf9c1 100644 > --- a/tests/check-qdict.c > +++ b/tests/check-qdict.c > @@ -559,7 +559,7 @@ static void qdict_join_test(void) > g_assert(qdict_size(dict1) == 2); > g_assert(qdict_size(dict2) == !overwrite); > > -g_assert(qdict_get_int(dict1, "foo") == overwrite ? 84 : 42); > +g_assert(qdict_get_int(dict1, "foo") == (overwrite ? 84 : 42)); How is the test passing pre-patch, and why does it not change the test post-patch? Does that mean that overwrite is not doing what we expected? -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
[Qemu-devel] is qemu able to run multiple system emulation simultaneously
Dear QEMU developers, I need multiple ARM-based hardware emulation runing simultaneously in one PC. I wander if QEMU is capble to run two system emulation at the same time. Or I have to use two QEMU for two system emulation If the former is possible, where can i find some docs about it? I have searched two days, the result is not positive. Please help. Thanks in advance. Best regards, Jiahuan
[Qemu-devel] [PATCH] tests/check-qdict: Fix missing brackets
From: "Dr. David Alan Gilbert"Gcc 7 (on Fedora 26) spotted odd use of integers instead of a boolean; it's got a point. Signed-off-by: Dr. David Alan Gilbert --- tests/check-qdict.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/check-qdict.c b/tests/check-qdict.c index 81162ee572..6f3fbcf9c1 100644 --- a/tests/check-qdict.c +++ b/tests/check-qdict.c @@ -559,7 +559,7 @@ static void qdict_join_test(void) g_assert(qdict_size(dict1) == 2); g_assert(qdict_size(dict2) == !overwrite); -g_assert(qdict_get_int(dict1, "foo") == overwrite ? 84 : 42); +g_assert(qdict_get_int(dict1, "foo") == (overwrite ? 84 : 42)); g_assert(qdict_get_int(dict1, "bar") == 23); if (!overwrite) { -- 2.12.2
Re: [Qemu-devel] [PATCH for-2.9 4/5] block: Drain BH in bdrv_drained_begin
Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben: > During block job completion, nothing is preventing > block_job_defer_to_main_loop_bh from being called in a nested > aio_poll(), which is a trouble, such as in this code path: > > qmp_block_commit > commit_active_start > bdrv_reopen > bdrv_reopen_multiple > bdrv_reopen_prepare > bdrv_flush > aio_poll > aio_bh_poll > aio_bh_call > block_job_defer_to_main_loop_bh > stream_complete > bdrv_reopen > > block_job_defer_to_main_loop_bh is the last step of the stream job, > which should have been "paused" by the bdrv_drained_begin/end in > bdrv_reopen_multiple, but it is not done because it's in the form of a > main loop BH. > > Similar to why block jobs should be paused between drained_begin and > drained_end, BHs they schedule must be excluded as well. To achieve > this, this patch forces draining the BH before leaving bdrv_drained_begin(). > > Signed-off-by: Fam ZhengWe used to poll BHs in earlier times. Commit 99723548 ('block: add BDS field to count in-flight requests') changed this, without an explanation in the commit message. Paolo, is this simply a bug in that commit, or do you rely on it somewhere? I remember that you wanted to get rid of some aio_poll() calls a while ago. > block/io.c | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/block/io.c b/block/io.c > index 2709a70..b9cfd18 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -228,7 +228,12 @@ void bdrv_drained_begin(BlockDriverState *bs) > bdrv_parent_drained_begin(bs); > } > > -bdrv_drain_recurse(bs); > +while (true) { > +if (!bdrv_drain_recurse(bs) && > +!aio_poll(bdrv_get_aio_context(bs), false)) { > +break; > +} The indentation is off here. > +} > } The old code had this in what is now the BDRV_POLL_WHILE() call in bdrv_drain_recurse(). I think it makes more sense there, saves you a loop here and fixes bdrv_drain_all_begin() at the same time. Kevin
Re: [Qemu-devel] [PATCH v6] vfio error recovery: kernel support
On Thu, 6 Apr 2017 16:53:44 +0800 Cao jinwrote: > On 04/06/2017 06:36 AM, Michael S. Tsirkin wrote: > > On Wed, Apr 05, 2017 at 04:19:10PM -0600, Alex Williamson wrote: > >> On Thu, 6 Apr 2017 00:50:22 +0300 > >> "Michael S. Tsirkin" wrote: > >> > >>> On Wed, Apr 05, 2017 at 01:38:22PM -0600, Alex Williamson wrote: > The previous intention of trying to handle all sorts of AER faults > clearly had more value, though even there the implementation and > configuration requirements restricted the practicality. For instance > is AER support actually useful to a customer if it requires all ports > of a multifunction device assigned to the VM? This seems more like a > feature targeting whole system partitioning rather than general VM > device assignment use cases. Maybe that's ok, but it should be a clear > design decision. > >>> > >>> Alex, what kind of testing do you expect to be necessary? > >>> Would you say testing on real hardware and making it trigger > >>> AER errors is a requirement? > >> > >> Testing various fatal, non-fatal, and corrected errors with aer-inject, > >> especially in multfunction configurations (where more than one port > >> is actually usable) would certainly be required. If we have cases where > >> the driver for a companion function can escalate a non-fatal error to a > >> bus reset, that should be tested, even if it requires temporary hacks to > >> the host driver for the companion function to trigger that case. AER > >> handling is not something that the typical user is going to experience, > >> so it should to be thoroughly tested to make sure it works when needed > >> or there's little point to doing it at all. Thanks, > >> > >> Alex > > > > Some things can be tested within a VM. What would you > > say would be sufficient on a VM and what has to be > > tested on bare metal? > > > > Does the "bare metal" here mean something like XenServer? No, bare metal means the non-virtualized host OS. I think Michael was trying to facilitate testing by proposing to do it in a VM such that we can create strange and interesting topologies that aren't bound by a system in a remote lab having only one NIC port connected. Thanks, Alex
Re: [Qemu-devel] [PATCH for-2.9 5/5] coroutine: Explicitly specify AioContext when creating coroutine
On 04/06/2017 09:25 AM, Fam Zheng wrote: > Coroutine in block layer should always be waken up in bs->aio_context s/waken up/awakened/ > rather than the "current" context where it is entered. They differ when > the main loop is doing QMP tasks. > > Race conditions happen without this patch, because the wrong context is > acquired in co_schedule_bh_cb, while the entered coroutine works on a > different one. > > Make the block layer explicitly specify a desired context for each created > coroutine. For the rest, always use qemu_get_aio_context(). > > Signed-off-by: Fam Zheng> --- The meat of the change is here (using an order file to present your diff with the interesting changes first can aid review)... > +++ b/util/qemu-coroutine.c > @@ -43,7 +43,8 @@ static void coroutine_pool_cleanup(Notifier *n, void *value) > } > } > > -Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque) > +Coroutine *qemu_coroutine_create(AioContext *ctx, > + CoroutineEntry *entry, void *opaque) > { > Coroutine *co = NULL; > > @@ -78,6 +79,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, > void *opaque) > > co->entry = entry; > co->entry_arg = opaque; > +co->ctx = ctx; > QSIMPLEQ_INIT(>co_queue_wakeup); > return co; > } > @@ -107,6 +109,7 @@ void qemu_coroutine_enter(Coroutine *co) > Coroutine *self = qemu_coroutine_self(); > CoroutineAction ret; > > +assert(co->ctx); > trace_qemu_coroutine_enter(self, co, co->entry_arg); > > if (co->caller) { > @@ -115,7 +118,6 @@ void qemu_coroutine_enter(Coroutine *co) > } > > co->caller = self; > -co->ctx = qemu_get_current_aio_context(); Basically, you close the race by assigning co->ctx sooner (during creation, rather than entering the coroutine), where non-block callers still end up with the same context, and block callers now have a chance to provide their desired context up front. Makes for a big patch due to the fallout, but the result seems sane to me. Reviewed-by: Eric Blake -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH v6] vfio error recovery: kernel support
On Thu, 6 Apr 2017 16:49:35 +0800 Cao jinwrote: > On 04/06/2017 05:56 AM, Michael S. Tsirkin wrote: > > On Wed, Apr 05, 2017 at 04:54:33PM +0800, Cao jin wrote: > >> Apparently, I don't have experience to induce non-fatal error, device > >> error is more of a chance related with the environment(temperature, > >> humidity, etc) as I understand. > > > > I'm not sure how to interpret this statement. I think what Alex is > > saying is simply that patches should include some justification. They > > make changes but what are they improving? > > For example: > > > > I tested device ABC in conditions DEF. Without a patch VM > > stops. With the patches applied VM recovers and proceeds to > > use the device normally. > > > > is one reasonable justification imho. > > > > Got it. But unfortunately, until now, I haven't seen a VM stop caused by > a real device non-fatal error during device assignment(Only saw real > fatal errors after start VM). > On one side, AER error could occur theoretically; on the other side, > seldom people have seen a VM stop caused by AER. Now I am asked that do > I have a real evidence or scenario to prove that this patchset is really > useful? I don't, and we all know it is hard to trigger a real hardware > error, so, seems I am pushed into the corner. I guess these questions > also apply for AER driver's author, if the scenario is easy to > reproduce, there is no need to write aer_inject to fake errors. Yes, non-fatal errors are rare, so rare in fact that I wonder if anything actually implements them. Could we take a survey of Linux drivers with AER support and see which ones actually do anything useful on a non-fatal error? I know fatal errors exist, I've seen them. The situation I want to avoid is adding non-fatal AER support that never gets used and just gets in the way of the next attempt to support fatal error recovery. We're extending the user ABI here, which means that for any feature we add, we need to consider how we'll support it long term and how we'll deprecate it or further extend it when the next level of support comes. I'm reluctant to add code that only partially solves a problem and only addresses the side of the problem that we're not even sure occurs. Thanks, Alex
Re: [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances
On Thu, Apr 06, 2017 at 03:00:28PM +0300, Michael S. Tsirkin wrote: > On Thu, Apr 06, 2017 at 03:08:35PM +0800, Peter Xu wrote: > > This is v8 of vt-d vfio enablement series. > > > > v8 > > - remove patches 1-9 since merged already > > - add David's r-b for all the patches > > - add Aviv's s-o-b in the last patch > > - rename iommu to iommu_dmar [Jason] > > - rename last patch subject to "remote IOTLB" [Jason] > > - pick up jason's two patches to fix vhost breakage > > - let vhost leverage the new IOMMU notifier interface > > Looks good to me except I'm still wondering why there's a single patch > by Jason. Answered in the other thread (the other one has been picked up already). > Marcel - you mentioned that you intend to try to open and > maintain pci-next. Would you like to pick this one up? Otherwise, pls > repost after 2.10. Thanks, -- peterx
Re: [Qemu-devel] [PATCH v8 0/9] VT-d: vfio enablement and misc enhances
On Thu, Apr 06, 2017 at 02:53:46PM +0300, Michael S. Tsirkin wrote: > On Thu, Apr 06, 2017 at 03:08:35PM +0800, Peter Xu wrote: > > This is v8 of vt-d vfio enablement series. > > > > v8 > > - remove patches 1-9 since merged already > > - add David's r-b for all the patches > > - add Aviv's s-o-b in the last patch > > - rename iommu to iommu_dmar [Jason] > > - rename last patch subject to "remote IOTLB" [Jason] > > - pick up jason's two patches to fix vhost breakage > > I only see one (6/9) - is a patch missing or misattributed? Oh sorry I should say "jason's one patch". The other patch has already been merged upstream, which is 375f74f47 ("vhost: generalize iommu memory region"). > > > - let vhost leverage the new IOMMU notifier interface > > Which patch does this? The first one ("memory: add section range info for IOMMU notifier"). Thanks, -- peterx
Re: [Qemu-devel] [PATCH 2/3] move xen-hvm.c to hw/i386/xen/
On Wed, Apr 05, 2017 at 04:21:30PM -0700, Anthony Xu wrote: > move xen-hvm.c to hw/i386/xen/ > > > Signed-off -by: Anthony Xu> > > > --- > Makefile.target| 3 +-- > hw/i386/xen/Makefile.objs | 2 +- > hw/i386/xen/trace-events | 11 +++ > xen-hvm.c => hw/i386/xen/xen-hvm.c | 2 +- > stubs/Makefile.objs| 1 + > xen-hvm-stub.c => stubs/xen-hvm.c | 0 > trace-events | 11 --- > 7 files changed, 15 insertions(+), 15 deletions(-) > rename xen-hvm.c => hw/i386/xen/xen-hvm.c (99%) > rename xen-hvm-stub.c => stubs/xen-hvm.c (100%) Reviewed-By: Sahid Orentino Ferdjaoui > > diff --git a/Makefile.target b/Makefile.target > index 48c027f..d5ff0c7 100644 > --- a/Makefile.target > +++ b/Makefile.target > @@ -150,8 +150,7 @@ obj-y += migration/ram.o migration/savevm.o > LIBS := $(libs_softmmu) $(LIBS) > > # xen support > -obj-$(CONFIG_XEN_I386) += xen-hvm.o xen-mapcache.o > -obj-$(call lnot,$(CONFIG_XEN_I386)) += xen-hvm-stub.o > +obj-$(CONFIG_XEN_I386) += xen-mapcache.o > > # Hardware support > ifeq ($(TARGET_NAME), sparc64) > diff --git a/hw/i386/xen/Makefile.objs b/hw/i386/xen/Makefile.objs > index 801a68d..daf4f53 100644 > --- a/hw/i386/xen/Makefile.objs > +++ b/hw/i386/xen/Makefile.objs > @@ -1 +1 @@ > -obj-y += xen_platform.o xen_apic.o xen_pvdevice.o > +obj-y += xen_platform.o xen_apic.o xen_pvdevice.o xen-hvm.o Refers to the file moved /xen-hvm.c to /hw/i386/xen/xen-hvm.c > diff --git a/hw/i386/xen/trace-events b/hw/i386/xen/trace-events > index 321fe60..f25d622 100644 > --- a/hw/i386/xen/trace-events > +++ b/hw/i386/xen/trace-events > @@ -4,3 +4,14 @@ xen_platform_log(char *s) "xen platform: %s" > # hw/i386/xen/xen_pvdevice.c > xen_pv_mmio_read(uint64_t addr) "WARNING: read from Xen PV Device MMIO space > (address %"PRIx64")" > xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space > (address %"PRIx64")" > + > +# xen-hvm.c > +xen_ram_alloc(unsigned long ram_addr, unsigned long size) "requested: %#lx, > size %#lx" > +xen_client_set_memory(uint64_t start_addr, unsigned long size, bool > log_dirty) "%#"PRIx64" size %#lx, log_dirty %i" > +handle_ioreq(void *req, uint32_t type, uint32_t dir, uint32_t df, uint32_t > data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t size) > "I/O=%p type=%d dir=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d > size=%d" > +handle_ioreq_read(void *req, uint32_t type, uint32_t df, uint32_t > data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t size) > "I/O=%p read type=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d > size=%d" > +handle_ioreq_write(void *req, uint32_t type, uint32_t df, uint32_t > data_is_ptr, uint64_t addr, uint64_t data, uint32_t count, uint32_t size) > "I/O=%p write type=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d > size=%d" > +cpu_ioreq_pio(void *req, uint32_t dir, uint32_t df, uint32_t data_is_ptr, > uint64_t addr, uint64_t data, uint32_t count, uint32_t size) "I/O=%p pio > dir=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d size=%d" > +cpu_ioreq_pio_read_reg(void *req, uint64_t data, uint64_t addr, uint32_t > size) "I/O=%p pio read reg data=%#"PRIx64" port=%#"PRIx64" size=%d" > +cpu_ioreq_pio_write_reg(void *req, uint64_t data, uint64_t addr, uint32_t > size) "I/O=%p pio write reg data=%#"PRIx64" port=%#"PRIx64" size=%d" > +cpu_ioreq_move(void *req, uint32_t dir, uint32_t df, uint32_t data_is_ptr, > uint64_t addr, uint64_t data, uint32_t count, uint32_t size) "I/O=%p copy > dir=%d df=%d ptr=%d port=%#"PRIx64" data=%#"PRIx64" count=%d size=%d" comming from /trace-events > diff --git a/xen-hvm.c b/hw/i386/xen/xen-hvm.c > similarity index 99% > rename from xen-hvm.c > rename to hw/i386/xen/xen-hvm.c > index 5043beb..0892361 100644 > --- a/xen-hvm.c > +++ b/hw/i386/xen/xen-hvm.c > @@ -22,7 +22,7 @@ > #include "qemu/error-report.h" > #include "qemu/range.h" > #include "sysemu/xen-mapcache.h" > -#include "trace-root.h" > +#include "trace.h" > #include "exec/address-spaces.h" > > #include > diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs > index 6c80613..f5b47bf 100644 > --- a/stubs/Makefile.objs > +++ b/stubs/Makefile.objs > @@ -38,3 +38,4 @@ stub-obj-y += target-get-monitor-def.o > stub-obj-y += pc_madt_cpu_entry.o > stub-obj-y += vmgenid.o > stub-obj-y += xen-common.o > +stub-obj-y += xen-hvm.o Refers to the file moved /xen-hvm-stub.c to /hw/i386/stubs/xen-hvm.c > diff --git a/xen-hvm-stub.c b/stubs/xen-hvm.c > similarity index 100% > rename from xen-hvm-stub.c > rename to stubs/xen-hvm.c > diff --git a/trace-events b/trace-events > index b07a09b..4e14487 100644 > --- a/trace-events > +++ b/trace-events > @@ -48,17 +48,6 @@ spice_vmc_register_interface(void *scd) "spice vmc > registered interface %p" > spice_vmc_unregister_interface(void *scd) "spice vmc
Re: [Qemu-devel] [PATCH 1/2] parallels: create parallels_inactivate() callback of block driver state
On 04/06/2017 06:14 PM, Stefan Hajnoczi wrote: > On Wed, Apr 05, 2017 at 06:12:05PM +0300, Denis V. Lunev wrote: >> We should avoid to image file at migration end when BDRV_O_INACTIVE > s/avoid to image/avoid writing to the image/ ? yes >> is set on the block driver state. All modifications should be done >> in advance, i.e. in bdrv_inactivate. >> >> The patch also adds final bdrv_flush() to be sure that changes have >> reached disk finally before other side will access the image. > If migration fails/cancels on the source after .bdrv_inactivate() we > need to return to the "activated" state. .bdrv_invalidate_cache() will > be called so I think it needs to be implemented by parallels to set > s->header->inuse again if BDRV_O_RDWR. > > .bdrv_invalidate_cache() is also called on the destination at live > migration handover. That's okay - it makes sense for the destination to > set the inuse field on success. hmm. sounds like a good catch.. >> -static void parallels_close(BlockDriverState *bs) >> +static int parallels_inactivate(BlockDriverState *bs) >> { >> +int err; >> BDRVParallelsState *s = bs->opaque; >> >> -if (bs->open_flags & BDRV_O_RDWR) { >> -s->header->inuse = 0; >> -parallels_update_header(bs); >> +if (!(bs->open_flags & BDRV_O_RDWR)) { >> +return 0; >> +} >> + >> +s->header->inuse = 0; >> +err = parallels_update_header(bs); >> +if (err < 0) { > Should we set s->header->inuse again in all error paths in case > something calls parallels_hupdate_header() again later? yes. thank you :)
Re: [Qemu-devel] [PATCH 1/3] move xen-common.c to hw/xen/
On 04/06/2017 10:10 AM, Sahid Orentino Ferdjaoui wrote: > On Wed, Apr 05, 2017 at 04:21:29PM -0700, Anthony Xu wrote: >> move xen-common.c to hw/xen/ >> >> Signed-off -by: Anthony Xu> > nit: s/Signed-off -by/Signed-off-by, not sure if it's really important > or not. Some of the automated tooling may choke on the misspelled variant, and S-o-b is legally important, so it is worth fixing. By the way, you can use 'git commit -s' or 'git send-email -s' to add S-o-b in the proper format at a given stage in your workflow, rather than having to type it by hand; it becomes obvious that you didn't use git's automation for adding it when you have a typo or when you have more than one blank line between your S-o-b and the --- separator. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH for-2.9 3/5] block: Quiesce old aio context during bdrv_set_aio_context
Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben: > The fact that the bs->aio_context is changing can confuse the dataplane > iothread, because of the now fine granularity aio context lock. > bdrv_drain should rather be a bdrv_drained_begin/end pair, but since > bs->aio_context is changing, we can just use aio_disable_external and > block_job_pause. > > Reported-by: Ed Swierk> Signed-off-by: Fam Zheng > --- > block.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/block.c b/block.c > index 8893ac1..e70684a 100644 > --- a/block.c > +++ b/block.c > @@ -4395,11 +4395,14 @@ void bdrv_attach_aio_context(BlockDriverState *bs, > > void bdrv_set_aio_context(BlockDriverState *bs, AioContext *new_context) > { > -AioContext *ctx; > +AioContext *ctx = bdrv_get_aio_context(bs); > > +aio_disable_external(ctx); > +if (bs->job) { > +block_job_pause(bs->job); > +} Even more bs->job users... :-( But is this one actually necessary? We drain all pending BHs below, so the job shouldn't have any requests in flight or be able to submit new ones while we switch. > bdrv_drain(bs); /* ensure there are no in-flight requests */ > > -ctx = bdrv_get_aio_context(bs); > while (aio_poll(ctx, false)) { > /* wait for all bottom halves to execute */ > } > @@ -4412,6 +4415,10 @@ void bdrv_set_aio_context(BlockDriverState *bs, > AioContext *new_context) > aio_context_acquire(new_context); > bdrv_attach_aio_context(bs, new_context); > aio_context_release(new_context); > +if (bs->job) { > +block_job_resume(bs->job); > +} > +aio_enable_external(ctx); > } The aio_disabe/enable_external() pair seems to make sense anyway. Kevin
Re: [Qemu-devel] [PATCH 1/2] parallels: create parallels_inactivate() callback of block driver state
On Wed, Apr 05, 2017 at 06:12:05PM +0300, Denis V. Lunev wrote: > We should avoid to image file at migration end when BDRV_O_INACTIVE s/avoid to image/avoid writing to the image/ ? > is set on the block driver state. All modifications should be done > in advance, i.e. in bdrv_inactivate. > > The patch also adds final bdrv_flush() to be sure that changes have > reached disk finally before other side will access the image. If migration fails/cancels on the source after .bdrv_inactivate() we need to return to the "activated" state. .bdrv_invalidate_cache() will be called so I think it needs to be implemented by parallels to set s->header->inuse again if BDRV_O_RDWR. .bdrv_invalidate_cache() is also called on the destination at live migration handover. That's okay - it makes sense for the destination to set the inuse field on success. > -static void parallels_close(BlockDriverState *bs) > +static int parallels_inactivate(BlockDriverState *bs) > { > +int err; > BDRVParallelsState *s = bs->opaque; > > -if (bs->open_flags & BDRV_O_RDWR) { > -s->header->inuse = 0; > -parallels_update_header(bs); > +if (!(bs->open_flags & BDRV_O_RDWR)) { > +return 0; > +} > + > +s->header->inuse = 0; > +err = parallels_update_header(bs); > +if (err < 0) { Should we set s->header->inuse again in all error paths in case something calls parallels_hupdate_header() again later? signature.asc Description: PGP signature
Re: [Qemu-devel] [PATCH for-2.9 4/5] block: Drain BH in bdrv_drained_begin
On 04/06/2017 09:25 AM, Fam Zheng wrote: > During block job completion, nothing is preventing > block_job_defer_to_main_loop_bh from being called in a nested > aio_poll(), which is a trouble, such as in this code path: > > qmp_block_commit > commit_active_start > bdrv_reopen > bdrv_reopen_multiple > bdrv_reopen_prepare > bdrv_flush > aio_poll > aio_bh_poll > aio_bh_call > block_job_defer_to_main_loop_bh > stream_complete > bdrv_reopen > > block_job_defer_to_main_loop_bh is the last step of the stream job, > which should have been "paused" by the bdrv_drained_begin/end in > bdrv_reopen_multiple, but it is not done because it's in the form of a > main loop BH. > > Similar to why block jobs should be paused between drained_begin and > drained_end, BHs they schedule must be excluded as well. To achieve > this, this patch forces draining the BH before leaving bdrv_drained_begin(). > > Signed-off-by: Fam Zheng> --- > block/io.c | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > Nice writeup. Reviewed-by: Eric Blake -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH v8 1/9] memory: add section range info for IOMMU notifier
On Thu, 6 Apr 2017 15:08:36 +0800 Peter Xuwrote: > In this patch, IOMMUNotifier.{start|end} are introduced to store section > information for a specific notifier. When notification occurs, we not > only check the notification type (MAP|UNMAP), but also check whether the > notified iova range overlaps with the range of specific IOMMU notifier, > and skip those notifiers if not in the listened range. > > When removing an region, we need to make sure we removed the correct > VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well. > > This patch is solving the problem that vfio-pci devices receive > duplicated UNMAP notification on x86 platform when vIOMMU is there. The > issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is > splitted by the (0xfee0, 0xfeef) IRQ region. AFAIK > this (splitted IOMMU region) is only happening on x86. > > This patch also helps vhost to leverage the new interface as well, so > that vhost won't get duplicated cache flushes. In that sense, it's an > slight performance improvement. > > Suggested-by: David Gibson > Signed-off-by: Peter Xu > --- > v7->v8: > - let vhost dmar leverage the new interface as well > - add some more comments in commit message, mentioning what issue this > patch has solved > - since touched up, removing Alex's a-b and DavidG's r-b > --- > hw/vfio/common.c | 12 +--- > hw/virtio/vhost.c | 10 -- > include/exec/memory.h | 19 ++- > memory.c | 9 + > 4 files changed, 44 insertions(+), 6 deletions(-) Acked-by: Alex Williamson > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > index f3ba9b9..6b33b9f 100644 > --- a/hw/vfio/common.c > +++ b/hw/vfio/common.c > @@ -478,8 +478,13 @@ static void vfio_listener_region_add(MemoryListener > *listener, > giommu->iommu_offset = section->offset_within_address_space - > section->offset_within_region; > giommu->container = container; > -giommu->n.notify = vfio_iommu_map_notify; > -giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL; > +llend = int128_add(int128_make64(section->offset_within_region), > + section->size); > +llend = int128_sub(llend, int128_one()); > +iommu_notifier_init(>n, vfio_iommu_map_notify, > +IOMMU_NOTIFIER_ALL, > +section->offset_within_region, > +int128_get64(llend)); > QLIST_INSERT_HEAD(>giommu_list, giommu, giommu_next); > > memory_region_register_iommu_notifier(giommu->iommu, >n); > @@ -550,7 +555,8 @@ static void vfio_listener_region_del(MemoryListener > *listener, > VFIOGuestIOMMU *giommu; > > QLIST_FOREACH(giommu, >giommu_list, giommu_next) { > -if (giommu->iommu == section->mr) { > +if (giommu->iommu == section->mr && > +giommu->n.start == section->offset_within_region) { > memory_region_unregister_iommu_notifier(giommu->iommu, > >n); > QLIST_REMOVE(giommu, giommu_next); > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index 613494d..185b95b 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -736,14 +736,20 @@ static void vhost_iommu_region_add(MemoryListener > *listener, > struct vhost_dev *dev = container_of(listener, struct vhost_dev, > iommu_listener); > struct vhost_iommu *iommu; > +Int128 end; > > if (!memory_region_is_iommu(section->mr)) { > return; > } > > iommu = g_malloc0(sizeof(*iommu)); > -iommu->n.notify = vhost_iommu_unmap_notify; > -iommu->n.notifier_flags = IOMMU_NOTIFIER_UNMAP; > +end = int128_add(int128_make64(section->offset_within_region), > + section->size); > +end = int128_sub(end, int128_one()); > +iommu_notifier_init(>n, vhost_iommu_unmap_notify, > +IOMMU_NOTIFIER_UNMAP, > +section->offset_within_region, > +int128_get64(end)); > iommu->mr = section->mr; > iommu->iommu_offset = section->offset_within_address_space - >section->offset_within_region; > diff --git a/include/exec/memory.h b/include/exec/memory.h > index f20b191..0840c89 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -77,13 +77,30 @@ typedef enum { > > #define IOMMU_NOTIFIER_ALL (IOMMU_NOTIFIER_MAP | IOMMU_NOTIFIER_UNMAP) > > +struct IOMMUNotifier; > +typedef void (*IOMMUNotify)(struct IOMMUNotifier *notifier, > +IOMMUTLBEntry *data); > + > struct IOMMUNotifier { > -void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data); >
Re: [Qemu-devel] [PATCH 1/3] move xen-common.c to hw/xen/
On Wed, Apr 05, 2017 at 04:21:29PM -0700, Anthony Xu wrote: > move xen-common.c to hw/xen/ > > Signed-off -by: Anthony Xunit: s/Signed-off -by/Signed-off-by, not sure if it's really important or not. Reviewed-By: Sahid Orentino Ferdjaoui > > --- > Makefile.target | 2 -- > hw/xen/Makefile.objs| 2 +- > xen-common.c => hw/xen/xen-common.c | 0 > stubs/Makefile.objs | 1 + > xen-common-stub.c => stubs/xen-common.c | 0 > 5 files changed, 2 insertions(+), 3 deletions(-) > rename xen-common.c => hw/xen/xen-common.c (100%) > rename xen-common-stub.c => stubs/xen-common.c (100%) > > diff --git a/Makefile.target b/Makefile.target > index 7df2b8c..48c027f 100644 > --- a/Makefile.target > +++ b/Makefile.target > @@ -150,9 +150,7 @@ obj-y += migration/ram.o migration/savevm.o > LIBS := $(libs_softmmu) $(LIBS) > > # xen support > -obj-$(CONFIG_XEN) += xen-common.o > obj-$(CONFIG_XEN_I386) += xen-hvm.o xen-mapcache.o > -obj-$(call lnot,$(CONFIG_XEN)) += xen-common-stub.o > obj-$(call lnot,$(CONFIG_XEN_I386)) += xen-hvm-stub.o > > # Hardware support > diff --git a/hw/xen/Makefile.objs b/hw/xen/Makefile.objs > index 4be3ec9..64a70bc 100644 > --- a/hw/xen/Makefile.objs > +++ b/hw/xen/Makefile.objs > @@ -1,5 +1,5 @@ > # xen backend driver support > -common-obj-$(CONFIG_XEN) += xen_backend.o xen_devconfig.o xen_pvdev.o > +common-obj-$(CONFIG_XEN) += xen_backend.o xen_devconfig.o xen_pvdev.o > xen-common.o Refers to the moved file /xen-common.c to /hw/xen/xen-common.c > obj-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen-host-pci-device.o > obj-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pt.o xen_pt_config_init.o > xen_pt_graphics.o xen_pt_msi.o > diff --git a/xen-common.c b/hw/xen/xen-common.c > similarity index 100% > rename from xen-common.c > rename to hw/xen/xen-common.c > diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs > index 224f04b..6c80613 100644 > --- a/stubs/Makefile.objs > +++ b/stubs/Makefile.objs > @@ -37,3 +37,4 @@ stub-obj-y += target-monitor-defs.o > stub-obj-y += target-get-monitor-def.o > stub-obj-y += pc_madt_cpu_entry.o > stub-obj-y += vmgenid.o > +stub-obj-y += xen-common.o Refers to the moved file /xen-common-stub.c to /hw/xen/stubs/xen-common.c > diff --git a/xen-common-stub.c b/stubs/xen-common.c > similarity index 100% > rename from xen-common-stub.c > rename to stubs/xen-common.c > -- > 1.8.3.1 >
[Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation
Hi all, over the past couple of months I discussed with some of you the possibility to extend the qcow2 format in order to improve its performance and reduce its memory requirements (particularly with very large images). After some discussion in the mailing list and the #qemu IRC channel I decided to write a prototype of a new extension for qcow2 so I could understand better the scope of the changes and have some preliminary data about its effects. This e-mail is the formal presentation of my proposal to extend the on-disk qcow2 format. As you can see this is still an RFC. Due to the nature of the changes I would like to get as much feedback as possible before going forward. === Problem === The original problem that I wanted to address is the memory requirements of qcow2 files if you want to have very large images and still keep good I/O performance. This is a consequence of its very structure, which I'm going to describe now. A qcow2 image is divided into units of constant size called clusters, and among other things it contains metadata that maps guest addresses to host addresses (the so-called L1 and L2 tables). There are two basic problems that result from this: 1) Reading from or writing to a qcow2 image involves reading the corresponding entry on the L2 table that maps the guest address to the host address. This is very slow because it involves two I/O operations: one on the L2 table and the other one on the actual data cluster. 2) A cluster is the smallest unit of allocation. Therefore writing a mere 512 bytes to an empty disk requires allocating a complete cluster and filling it with zeroes (or with data from the backing image if there is one). This wastes more disk space and also has a negative impact on I/O. Problem (1) can be solved by keeping in memory a cache of the L2 tables (QEMU has an "l2_cache_size" parameter for this purpose). The amount of disk space used by L2 tables depends on two factors: the disk size and the cluster size. The cluster size can be configured when the image is created, and it can be any power of two between 512 bytes and 2 MB (it defaults to 64 KB). The maximum amount of space needed for the L2 tables can be calculated with the following formula: max_l2_size = virtual_disk_size * 8 / cluster_size Large images require a large amount of metadata, and therefore a large amount of memory for the L2 cache. With the default cluster size (64KB) that's 128MB of L2 cache for a 1TB qcow2 image. The only way to reduce the size of the L2 tables is therefore increasing the cluster size, but this makes the image less efficient as described earlier in (2). === The proposal === The idea of this proposal is to extend the qcow2 format by allowing subcluster allocation. There would be an optional feature that would need to be enabled when creating the image. The on-disk format would remain essentially the same, except that each data cluster would be internally divided into a number of subclusters of equal size. What this means in practice is that each entry on an L2 table would be accompanied by a bitmap indicating the allocation state of each one of the subclusters for that cluster. There are several alternatives for storing the bitmap, described below. Other than L2 entries, all other data structures would remain unchanged, but for data clusters the smallest unit of allocation would now be the subcluster. Reference counting would still be at the cluster level, because there's no way to reference individual subclusters. Copy-on-write on internal snapshots would need to copy complete clusters, so that scenario would not benefit from this change. I see two main use cases for this feature: a) The qcow2 image is not too large / the L2 cache is not a problem, but you want to increase the allocation performance. In this case you can have something like a 128KB cluster with 4KB subclusters (with 4KB being a common block size in ext4 and other filesystems) b) The qcow2 image is very large and you want to save metadata space in order to have a smaller L2 cache. In this case you can go for the maximum cluster size (2MB) but you want to have smaller subclusters to increase the allocation performance and optimize the disk usage. This was actually my original use case. === Test results === I have a basic working prototype of this. It's still incomplete -and buggy :)- but it gives an idea of what we can expect from it. In my implementation each data cluster has 8 subclusters, but that's not set in stone (see below). I made all tests on an SSD drive, writing to an empty qcow2 image with a fully populated 40GB backing image, performing random writes using fio with a block size of 4KB. I tried with the default and maximum cluster sizes (64KB and 2MB) and also with some other sizes. I also made sure to try with 32KB clusters so the subcluster size matches the 4KB block size used for the I/O. It's important to point out that once
Re: [Qemu-devel] [PATCH for-2.9 3/5] block: Quiesce old aio context during bdrv_set_aio_context
On 04/06/2017 09:25 AM, Fam Zheng wrote: > The fact that the bs->aio_context is changing can confuse the dataplane > iothread, because of the now fine granularity aio context lock. > bdrv_drain should rather be a bdrv_drained_begin/end pair, but since > bs->aio_context is changing, we can just use aio_disable_external and > block_job_pause. > > Reported-by: Ed Swierk> Signed-off-by: Fam Zheng > --- > block.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > Reviewed-by: Eric Blake -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH v5] Allow setting NUMA distance for different NUMA nodes
On Thu, Apr 06, 2017 at 10:18:53AM +0800, He Chen wrote: > This patch is going to add SLIT table support in QEMU, and provides > additional option `dist` for command `-numa` to allow user set vNUMA > distance by QEMU command. > > With this patch, when a user wants to create a guest that contains > several vNUMA nodes and also wants to set distance among those nodes, > the QEMU command would like: > > ``` > -numa node,nodeid=0,cpus=0 \ > -numa node,nodeid=1,cpus=1 \ > -numa node,nodeid=2,cpus=2 \ > -numa node,nodeid=3,cpus=3 \ > -numa dist,src=0,dst=1,val=21 \ > -numa dist,src=0,dst=2,val=31 \ > -numa dist,src=0,dst=3,val=41 \ > -numa dist,src=1,dst=2,val=21 \ > -numa dist,src=1,dst=3,val=31 \ > -numa dist,src=2,dst=3,val=21 \ > ``` > > Signed-off-by: He Chen> --- > hw/acpi/aml-build.c | 25 + > hw/i386/acpi-build.c| 2 + > include/hw/acpi/aml-build.h | 1 + > include/sysemu/numa.h | 1 + > include/sysemu/sysemu.h | 4 ++ > numa.c | 121 > > qapi-schema.json| 30 ++- > qemu-options.hx | 17 ++- > 8 files changed, 198 insertions(+), 3 deletions(-) > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c > index c6f2032..2c6ab07 100644 > --- a/hw/acpi/aml-build.c > +++ b/hw/acpi/aml-build.c > @@ -24,6 +24,7 @@ > #include "hw/acpi/aml-build.h" > #include "qemu/bswap.h" > #include "qemu/bitops.h" > +#include "sysemu/numa.h" > > static GArray *build_alloc_array(void) > { > @@ -1609,3 +1610,27 @@ void build_srat_memory(AcpiSratMemoryAffinity > *numamem, uint64_t base, > numamem->base_addr = cpu_to_le64(base); > numamem->range_length = cpu_to_le64(len); > } > + > +/* > + * ACPI spec 5.2.17 System Locality Distance Information Table > + * (Revision 2.0 or later) > + */ > +void build_slit(GArray *table_data, BIOSLinker *linker) > +{ > +int slit_start, i, j; > +slit_start = table_data->len; > + > +acpi_data_push(table_data, sizeof(AcpiTableHeader)); > + > +build_append_int_noprefix(table_data, nb_numa_nodes, 8); > +for (i = 0; i < nb_numa_nodes; i++) { > +for (j = 0; j < nb_numa_nodes; j++) { > +build_append_int_noprefix(table_data, numa_info[i].distance[j], > 1); > +} > +} > + > +build_header(linker, table_data, > + (void *)(table_data->data + slit_start), > + "SLIT", > + table_data->len - slit_start, 1, NULL, NULL); > +} > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c > index 2073108..12730ea 100644 > --- a/hw/i386/acpi-build.c > +++ b/hw/i386/acpi-build.c > @@ -2678,6 +2678,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState > *machine) > if (pcms->numa_nodes) { > acpi_add_table(table_offsets, tables_blob); > build_srat(tables_blob, tables->linker, machine); > +acpi_add_table(table_offsets, tables_blob); > +build_slit(tables_blob, tables->linker); We could make the generation of the SLIT dependent on have_numa_distance. > } > if (acpi_get_mcfg()) { > acpi_add_table(table_offsets, tables_blob); > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h > index 00c21f1..329a0d0 100644 > --- a/include/hw/acpi/aml-build.h > +++ b/include/hw/acpi/aml-build.h > @@ -389,4 +389,5 @@ GCC_FMT_ATTR(2, 3); > void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base, > uint64_t len, int node, MemoryAffinityFlags flags); > > +void build_slit(GArray *table_data, BIOSLinker *linker); > #endif > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h > index 8f09dcf..2f7a941 100644 > --- a/include/sysemu/numa.h > +++ b/include/sysemu/numa.h > @@ -21,6 +21,7 @@ typedef struct node_info { > struct HostMemoryBackend *node_memdev; > bool present; > QLIST_HEAD(, numa_addr_range) addr; /* List to store address ranges */ > +uint8_t distance[MAX_NODES]; > } NodeInfo; > > extern NodeInfo numa_info[MAX_NODES]; > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h > index 576c7ce..6999545 100644 > --- a/include/sysemu/sysemu.h > +++ b/include/sysemu/sysemu.h > @@ -169,6 +169,10 @@ extern int mem_prealloc; > > #define MAX_NODES 128 > #define NUMA_NODE_UNASSIGNED MAX_NODES > +#define NUMA_DISTANCE_MIN 10 > +#define NUMA_DISTANCE_DEFAULT 20 > +#define NUMA_DISTANCE_MAX 254 > +#define NUMA_DISTANCE_UNREACHABLE 255 > > #define MAX_OPTION_ROMS 16 > typedef struct QEMUOptionRom { > diff --git a/numa.c b/numa.c > index 6fc2393..838e45a 100644 > --- a/numa.c > +++ b/numa.c > @@ -52,6 +52,7 @@ static int max_numa_nodeid; /* Highest specified NUMA node > ID, plus one. > */ > int nb_numa_nodes; > NodeInfo numa_info[MAX_NODES]; > +static bool have_numa_distance; > > void numa_set_mem_node_id(ram_addr_t addr, uint64_t size,
Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
On 31/03/2017 4:41 PM, Haozhong Zhang wrote: This patch series constructs the flush hint address structures for nvdimm devices in QEMU. It's of course not for 2.9. I send it out early in order to get comments on one point I'm uncertain (see the detailed explanation below). Thanks for any comments in advance! Background --- Flush hint address structure is a substructure of NFIT and specifies one or more addresses, namely Flush Hint Addresses. Software can write to any one of these flush hint addresses to cause any preceding writes to the NVDIMM region to be flushed out of the intervening platform buffers to the targeted NVDIMM. More details can be found in ACPI Spec 6.1, Section 5.2.25.8 "Flush Hint Address Structure". Why is it RFC? --- RFC is added because I'm not sure whether the way in this patch series that allocates the guest flush hint addresses is right. QEMU needs to trap guest accesses (at least for writes) to the flush hint addresses in order to perform the necessary flush on the host back store. Therefore, QEMU needs to create IO memory regions that cover those flush hint addresses. In order to create those IO memory regions, QEMU needs to know the flush hint addresses or their offsets to other known memory regions in advance. So far looks good. Flush hint addresses are in the guest address space. Looking at how the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural to take the same way for flush hint addresses, i.e. let the guest firmware allocate from free addresses and patch them in the flush hint address structure. (*Please correct me If my following understand is wrong*) However, the current allocation and pointer patching are transparent to QEMU, so QEMU will be unaware of the flush hint addresses, and consequently have no way to create corresponding IO memory regions in order to trap guest accesses. Er, it is awkward and flush-hint-table is static which may not be easily patched. Alternatively, this patch series moves the allocation of flush hint addresses to QEMU: 1. (Patch 1) We reserve an address range after the end address of each nvdimm device. Its size is specified by the user via a new pc-dimm option 'reserved-size'. We should make it only work for nvdimm? For the following example, -object memory-backend-file,id=mem0,size=4G,... -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,... -device pc-dimm,id=dimm1,... if dimm0 is allocated to address N ~ N+4G, the address of dimm1 will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for dimm0. 2. (Patch 4) When NVDIMM ACPI code builds the flush hint address structure for each nvdimm device, it will allocate them from the above reserved area, e.g. the flush hint addresses of above dimm0 are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in this way, so QEMU can easily create IO memory regions for them. If the reserved area is not present or too small, QEMU will report errors. We should make 'reserved-size' always be page-aligned and should be transparent to the user, i.e, automatically reserve 4k if 'flush-hint' is specified?
Re: [Qemu-devel] [RFC PATCH 2/4] nvdimm: add functions to initialize and perform flush on back store
On 03/31/2017 04:41 PM, Haozhong Zhang wrote: fsync() is used to persist modifications to the back store. If the host NVDIMM is used as the back store, fsync() on Linux will trigger the write to the host flush hint address. Signed-off-by: Haozhong Zhang--- hw/mem/nvdimm.c | 22 ++ include/hw/mem/nvdimm.h | 13 + 2 files changed, 35 insertions(+) diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c index db896b0..484ab8b 100644 --- a/hw/mem/nvdimm.c +++ b/hw/mem/nvdimm.c @@ -78,6 +78,26 @@ static MemoryRegion *nvdimm_get_memory_region(PCDIMMDevice *dimm) return >nvdimm_mr; } +static void nvdimm_flush_init(NVDIMMDevice *nvdimm, MemoryRegion *hostmem_mr) +{ +if (nvdimm->flush_hint_enabled) { +nvdimm->backend_fd = memory_region_get_fd(hostmem_mr); Hmm, IIRC host-mem-file does not initalize backend_fd at all.
Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
On 04/06/2017 05:43 PM, Stefan Hajnoczi wrote: On Fri, Mar 31, 2017 at 04:41:43PM +0800, Haozhong Zhang wrote: This patch series constructs the flush hint address structures for nvdimm devices in QEMU. It's of course not for 2.9. I send it out early in order to get comments on one point I'm uncertain (see the detailed explanation below). Thanks for any comments in advance! Background --- Extra background: Flush Hint Addresses are necessary because: 1. Some hardware configurations may require them. In other words, a cache flush instruction is not enough to persist data. 2. The host file system may need fsync(2) calls (e.g. to persist metadata changes). Without Flush Hint Addresses only some NVDIMM configurations actually guarantee data persistence. Flush hint address structure is a substructure of NFIT and specifies one or more addresses, namely Flush Hint Addresses. Software can write to any one of these flush hint addresses to cause any preceding writes to the NVDIMM region to be flushed out of the intervening platform buffers to the targeted NVDIMM. More details can be found in ACPI Spec 6.1, Section 5.2.25.8 "Flush Hint Address Structure". Do you have performance data? I'm concerned that Flush Hint Address hardware interface is not virtualization-friendly. In Linux drivers/nvdimm/region_devs.c:nvdimm_flush() does: wmb(); for (i = 0; i < nd_region->ndr_mappings; i++) if (ndrd_get_flush_wpq(ndrd, i, 0)) writeq(1, ndrd_get_flush_wpq(ndrd, i, idx)); wmb(); That looks pretty lightweight - it's an MMIO write between write barriers. This patch implements the MMIO write like this: void nvdimm_flush(NVDIMMDevice *nvdimm) { if (nvdimm->backend_fd != -1) { /* * If the backend store is a physical NVDIMM device, fsync() * will trigger the flush via the flush hint on the host device. */ fsync(nvdimm->backend_fd); } } The MMIO store instruction turned into a synchronous fsync(2) system call plus vmexit/vmenter and QEMU userspace context switch: 1. The vcpu blocks during the fsync(2) system call. The MMIO write instruction has an unexpected and huge latency. 2. The vcpu thread holds the QEMU global mutex so all other threads (including the monitor) are blocked during fsync(2). Other vcpu threads may block if they vmexit. It is hard to implement this efficiently in QEMU. This is why I said the hardware interface is not virtualization-friendly. It's cheap on real hardware but expensive under virtualization. We should think about the optimal way of implementing Flush Hint Addresses in QEMU. But if there is no reasonable way to implement them then I think it's better *not* to implement them, just like the Block Window feature which is also not virtualization-friendly. Users who want a block device can use virtio-blk. I don't think NVDIMM Block Window can achieve better performance than virtio-blk under virtualization (although I'm happy to be proven wrong). Some ideas for a faster implementation: 1. Use memory_region_clear_global_locking() to avoid taking the QEMU global mutex. Little synchronization is necessary as long as the NVDIMM device isn't hot unplugged (not yet supported anyway). 2. Can the host kernel provide a way to mmap Address Flush Hints from the physical NVDIMM in cases where the configuration does not require host kernel interception? That way QEMU can map the physical NVDIMM's Address Flush Hints directly into the guest. The hypervisor is bypassed and performance would be good. I'm not sure there is anything we can do to make the case where the host kernel wants an fsync(2) fast :(. Good point. We can assume flush-CPU-cache-to-make-persistence is always available on Intel's hardware so that flush-hint-table is not needed if the vNVDIMM is based on a real Intel's NVDIMM device. If the vNVDIMM device is based on the regular file, i think fsync is the bottleneck rather than this mmio-virtualization. :(
Re: [Qemu-devel] [PATCH] Put all trace.o into libqemuutil.a
On Tue, Apr 04, 2017 at 09:39:39PM +, Xu, Anthony wrote: Thanks, applied to my tracing-next tree: https://github.com/stefanha/qemu/commits/tracing-next I fixed up the patch when applying it. Comments below: > Put all trace.o into libqemuutil.a Please use the "trace: " prefix for tracing patches. All components of QEMU have commonly used prefixes that make it easy to identify which area a patch affects. People reading the mailing list or grepping through git logs rely on this. > > Currently all trace.o are linked into qemu-system, qemu-img, > qemu-nbd, qemu-io etc., even the corresponding components > are not included. > Put all trace.o into libqemuutil.a that the linker would only pull in .o > files containing symbols that are actually referenced by the > program. > > > Signed-off -by: Anthony XuPlease use git-format-patch(1) to send correctly formatted patches in the future. git-am(1) was unable to apply your email and I had to do it manually. Stefan signature.asc Description: PGP signature
Re: [Qemu-devel] [RFC PATCH 1/4] pc-dimm: add 'reserved-size' to reserve address range after the ending address
On 03/31/2017 04:41 PM, Haozhong Zhang wrote: If option 'reserved-size=RSVD' is present, QEMU will reserve an address range of size 'RSVD' after the ending address of pc-dimm device. For the following example, -object memory-backend-file,id=mem0,size=4G,... -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,... -device pc-dimm,id=dimm1,... if dimm0 is allocated to address N ~ N+4G, the address range of dimm1 will start from N+4G+4K or higher. Its current usage is to reserve spaces for flush hint addresses of nvdimm devices. Signed-off-by: Haozhong Zhang--- hw/mem/pc-dimm.c | 48 +--- include/hw/mem/pc-dimm.h | 2 ++ 2 files changed, 47 insertions(+), 3 deletions(-) diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c index 9e8dab0..13dcd71 100644 --- a/hw/mem/pc-dimm.c +++ b/hw/mem/pc-dimm.c @@ -28,6 +28,7 @@ #include "sysemu/kvm.h" #include "trace.h" #include "hw/virtio/vhost.h" +#include "exec/address-spaces.h" typedef struct pc_dimms_capacity { uint64_t size; @@ -44,7 +45,12 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms, MemoryRegion *vmstate_mr = ddc->get_vmstate_memory_region(dimm); Error *local_err = NULL; uint64_t existing_dimms_capacity = 0; -uint64_t addr; +uint64_t addr, size = memory_region_size(mr); + +size += object_property_get_int(OBJECT(dimm), PC_DIMM_RSVD_PROP, _err); +if (local_err) { +goto out; +} addr = object_property_get_int(OBJECT(dimm), PC_DIMM_ADDR_PROP, _err); if (local_err) { @@ -54,7 +60,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms, addr = pc_dimm_get_free_addr(hpms->base, memory_region_size(>mr), !addr ? NULL : , align, - memory_region_size(mr), _err); + size, _err); if (local_err) { goto out; } @@ -64,7 +70,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms, goto out; } -if (existing_dimms_capacity + memory_region_size(mr) > +if (existing_dimms_capacity + size > machine->maxram_size - machine->ram_size) { error_setg(_err, "not enough space, currently 0x%" PRIx64 " in use of total hot pluggable 0x" RAM_ADDR_FMT, @@ -315,6 +321,9 @@ uint64_t pc_dimm_get_free_addr(uint64_t address_space_start, PCDIMMDevice *dimm = item->data; uint64_t dimm_size = object_property_get_int(OBJECT(dimm), PC_DIMM_SIZE_PROP, + errp) + + object_property_get_int(OBJECT(dimm), + PC_DIMM_RSVD_PROP, errp); if (errp && *errp) { goto out; @@ -382,6 +391,37 @@ static void pc_dimm_check_memdev_is_busy(Object *obj, const char *name, error_propagate(errp, local_err); } Should count the reserved size in pc_existing_dimms_capacity_internal() too.
Re: [Qemu-devel] [RFC PATCH 0/4] nvdimm: enable flush hint address structure
On 04/06/2017 05:58 PM, Haozhong Zhang wrote: On 04/06/17 17:39 +0800, Xiao Guangrong wrote: On 31/03/2017 4:41 PM, Haozhong Zhang wrote: This patch series constructs the flush hint address structures for nvdimm devices in QEMU. It's of course not for 2.9. I send it out early in order to get comments on one point I'm uncertain (see the detailed explanation below). Thanks for any comments in advance! Background --- Flush hint address structure is a substructure of NFIT and specifies one or more addresses, namely Flush Hint Addresses. Software can write to any one of these flush hint addresses to cause any preceding writes to the NVDIMM region to be flushed out of the intervening platform buffers to the targeted NVDIMM. More details can be found in ACPI Spec 6.1, Section 5.2.25.8 "Flush Hint Address Structure". Why is it RFC? --- RFC is added because I'm not sure whether the way in this patch series that allocates the guest flush hint addresses is right. QEMU needs to trap guest accesses (at least for writes) to the flush hint addresses in order to perform the necessary flush on the host back store. Therefore, QEMU needs to create IO memory regions that cover those flush hint addresses. In order to create those IO memory regions, QEMU needs to know the flush hint addresses or their offsets to other known memory regions in advance. So far looks good. Flush hint addresses are in the guest address space. Looking at how the current NVDIMM ACPI in QEMU allocates the DSM buffer, it's natural to take the same way for flush hint addresses, i.e. let the guest firmware allocate from free addresses and patch them in the flush hint address structure. (*Please correct me If my following understand is wrong*) However, the current allocation and pointer patching are transparent to QEMU, so QEMU will be unaware of the flush hint addresses, and consequently have no way to create corresponding IO memory regions in order to trap guest accesses. Er, it is awkward and flush-hint-table is static which may not be easily patched. Alternatively, this patch series moves the allocation of flush hint addresses to QEMU: 1. (Patch 1) We reserve an address range after the end address of each nvdimm device. Its size is specified by the user via a new pc-dimm option 'reserved-size'. We should make it only work for nvdimm? Yes, we can check whether the machine option 'nvdimm' is present when plugging the nvdimm. For the following example, -object memory-backend-file,id=mem0,size=4G,... -device nvdimm,id=dimm0,memdev=mem0,reserved-size=4K,... -device pc-dimm,id=dimm1,... if dimm0 is allocated to address N ~ N+4G, the address of dimm1 will start from N+4G+4K or higher. N+4G ~ N+4G+4K is reserved for dimm0. 2. (Patch 4) When NVDIMM ACPI code builds the flush hint address structure for each nvdimm device, it will allocate them from the above reserved area, e.g. the flush hint addresses of above dimm0 are allocated in N+4G ~ N+4G+4K. The addresses are known to QEMU in this way, so QEMU can easily create IO memory regions for them. If the reserved area is not present or too small, QEMU will report errors. We should make 'reserved-size' always be page-aligned and should be transparent to the user, i.e, automatically reserve 4k if 'flush-hint' is specified? 4K alignment is already enforced by current memory plug code. About the automatic reservation, is a non-zero default value acceptable by qemu design/convention in general? Needn't make it as a user-visible parameter, just a field contained in dimm-dev struct or nvdimm-dev struct indicates the reserved size is okay.
Re: [Qemu-devel] [PATCH for-2.9 2/5] mirror: Fix aio context of mirror_top_bs
On 04/06/2017 09:25 AM, Fam Zheng wrote: > It should be moved to the same context as source, before inserting to the > graph. > > Signed-off-by: Fam Zheng> --- > block/mirror.c | 1 + > 1 file changed, 1 insertion(+) Reviewed-by: Eric Blake > > diff --git a/block/mirror.c b/block/mirror.c > index 9e2fecc..e904fef 100644 > --- a/block/mirror.c > +++ b/block/mirror.c > @@ -1148,6 +1148,7 @@ static void mirror_start_job(const char *job_id, > BlockDriverState *bs, > return; > } > mirror_top_bs->total_sectors = bs->total_sectors; > +bdrv_set_aio_context(mirror_top_bs, bdrv_get_aio_context(bs)); > > /* bdrv_append takes ownership of the mirror_top_bs reference, need to > keep > * it alive until block_job_create() even if bs has no parent. */ > -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH for-2.9 2/5] mirror: Fix aio context of mirror_top_bs
Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben: > It should be moved to the same context as source, before inserting to the > graph. > > Signed-off-by: Fam ZhengReviewed-by: Kevin Wolf
Re: [Qemu-devel] [PATCH for-2.9 1/5] block: Fix unpaired aio_disable_external in external snapshot
Am 06.04.2017 um 16:25 hat Fam Zheng geschrieben: > bdrv_replace_child_noperm tries to hand over the quiesce_counter state > from old bs to the new one, but if they are not on the same aio context > this causes unbalance. > > Fix this by forcing callers to move the aio context first, and assert > it. > > Reported-by: Ed Swierk> Signed-off-by: Fam Zheng > --- > block.c| 3 +++ > blockdev.c | 4 ++-- > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/block.c b/block.c > index 927ba89..8893ac1 100644 > --- a/block.c > +++ b/block.c > @@ -1752,6 +1752,9 @@ static void bdrv_replace_child_noperm(BdrvChild *child, > { > BlockDriverState *old_bs = child->bs; > > +if (old_bs && new_bs) { > +assert(bdrv_get_aio_context(old_bs) == bdrv_get_aio_context(new_bs)); > +} > if (old_bs) { > if (old_bs->quiesce_counter && child->role->drained_end) { > child->role->drained_end(child); Can we move this to a separate patch and also add this hunk for asserting the same thing for newly attached child nodes: @@ -1852,6 +1855,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs, bdrv_get_cumulative_perm(parent_bs, , _perm); assert(parent_bs->drv); +assert(bdrv_get_aio_context(parent_bs) == bdrv_get_aio_context(child_bs)); parent_bs->drv->bdrv_child_perm(parent_bs, NULL, child_role, perm, shared_perm, , _perm); Kevin
Re: [Qemu-devel] [PATCH for-2.9 1/5] block: Fix unpaired aio_disable_external in external snapshot
On 04/06/2017 09:25 AM, Fam Zheng wrote: > bdrv_replace_child_noperm tries to hand over the quiesce_counter state > from old bs to the new one, but if they are not on the same aio context > this causes unbalance. > > Fix this by forcing callers to move the aio context first, and assert > it. > > Reported-by: Ed Swierk> Signed-off-by: Fam Zheng > --- > block.c| 3 +++ > blockdev.c | 4 ++-- > 2 files changed, 5 insertions(+), 2 deletions(-) > Reviewed-by: Eric Blake -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature