Re: Better alternative to strncpy in QEMU.
On 12/04/21 06:51, Thomas Huth wrote: I think this is pretty much the same as g_strlcpy() from the glib: https://developer.gnome.org/glib/2.66/glib-String-Utility-Functions.html#g-strlcpy So I guess Paolo had something different in mind when adding this task? Yes, I did. strncpy is used legitimately when placing data in a fixed-size buffer that is written to a socket, to a file or to guest memory. The problem with using g_strlcpy in those cases is that it does not write past the first '\0' character, and therefore it can leak host data. What I had in mind was basically strncpy plus an assertion that the last copied byte will be set to 0. It can be written in many ways, for example strncpy followed by assert(dest[destlen - 1] == '\0'), or like assert(strlen(src) < destlen) followed by strncpy, or of course you could write a for loop by hand. Once you do that, you can split uses of strncpy in two: those where the reader expects the last byte to be zero, and those where the reader does not. (I don't expect many cases of the first type, because the reader always has to think of how to handle a malicious data stream that does not have a zero termination). As long as you avoid the accidentally quadratic behavior that Peter pointed out, any way is fine since performance does not matter on these paths. Making the code nice and readable is more important. Paolo
Re: [PATCH v2 0/2] docs/devel/qgraph: add troubleshooting information
On 12/04/21 16:34, Stefan Hajnoczi wrote: v2: * Fix "will unavailable" typo [Thomas] I recently needed to troubleshoot a case where qos-test terminated immediately with no output. In other words, qos-test decided that no tests are runnable. After lots of head scratching and some help from Emanuele it turned out that the machine types weren't being detected as expected. These patches add documentation about how to troubleshoot similar cases in the future. Stefan Hajnoczi (2): libqos/qgraph: fix "UNAVAILBLE" typo docs/devel/qgraph: add troubleshooting information docs/devel/qgraph.rst | 58 + tests/qtest/libqos/qgraph.c | 2 +- 2 files changed, 59 insertions(+), 1 deletion(-) Acked-by: Paolo Bonzini Thanks, this is helpful. Paolo
Re: [PATCH] docs: Add a QEMU Code of Conduct and Conflict Resolution Policy document
On 07/04/21 17:42, Kevin Wolf wrote: +* Publishing other's private information, such as physical or electronic +addresses, without explicit permission Yes, it's pretty clear that I'm not publishing new information about people when I'm keeping them in Cc: when replying to a thread, or even when they posted in another thread on the list recently. It becomes much less clear for adding people who aren't usually part of the QEMU community. If you took the email from, say, the Libvirt or kernel mailing lists, that would not be considered private. If somebody has two email addresses and you deliberately Cc him on an address that he's only using for communications within his family, that would be a problem. I agree that this doxing is unlikely to happen since our communication revolves on email and we generally don't accept pseudonymous contributions. But even there, we have had historically a couple exceptions to the no-pseudonyms rule. Paolo
Re: [PATCH 1/4] target/ppc: Code motion required to build disabling tcg
On Mon, Apr 12, 2021 at 12:05:31PM +, Bruno Piazera Larsen wrote: > > A general advice for this whole series is: make sure you add in some > > words explaining why you decided to make a particular change. It will be > > much easier to review if we know what were the logical steps leading to > > the change. > > Fair point, I should've thought about that. > > > > This commit does the necessary code motion from translate_init.c.inc > > > > For instance, I don't immediately see why these changes are necessary. I > > see that translate_init.c.inc already has some `#ifdef CONFIG_TCG`, so > > why do we need to move a bunch of code into cpu.c instead of just adding > > more code under ifdef CONFIG_TCG? (I'm not saying it's wrong, just trying to > > understand the reasoning). > > There are 3 main reasons for this decision. The first is kind of silly, but > when I read translate.c my mind jumped to translating machine code to TCG, > and the amount of TCGv variables at the start reinforced this notion. > The second was that both s390x and i386 removed it (translate.c) from > compilation, so I had no good reason to doubt it. > The last (and arguably most important) is that translate.c is many thousands > of lines long (translate_init.c.inc alone was almost 11k). The whole point of > disabling TCG is to speed up compilation and reduce the final file size, so I > think it makes sense to remove that big file. > And the final nail in the coffin is that at no point did it cross my mind to > keep the init part of translation, but remove the rest > > Also, I'm not a fan of big ifdefs, because it's kinda hard to follow them > when viewing code in vim. Adding that to already having a cpu.c file, where > it makes sense (to me, at least) to add all CPU related functions, just > sounded like a good idea. I think those are all sound reasons; I think not compiling translate.c for !tcg builds is the right choice. We might, however, need to "rescue" certain functions from there by moving them to another file so they can be used by KVM code as well. > > Is translate_init.c.inc intended to be TCG only? But then I see you > > moved TCG-only functions out of it (ppc_fixup_cpu) and left not TCG-only > > functions (gen_spr_generic). > > This is me misjudging what is TCG and what is not, mostly. I think that > actually moving everything to cpu.c and adding ifdefs, or creating a > cpu_tcg.c.inc or similar, would be the most sensible code motion, but every > function I removed from translate gave me 3 new errors, at some point I felt > like I should leave something behind otherwise we're compiling everything > from TCG again, just in different files, so I tried to guess what was TCG and > what was not (also, I really wanted the RFC out by the weekend, I _may_ have > rushed a few choices). I'm actually not sure if we'll want translate_init.c for !tcg builds. It's *primarily* for TCG, but we still need at least some of the cpu state structure for KVM, and some of that is initialized in translate_init. I think it will probably make more sense to leave it in for a first cut. Later refinement might end up removing it. The whole #include translate_init.c.inc thing might make for some awkward fiddling in this, of course. > > > This moves all functions that start with gdb_* into target/ppc/gdbstub.c > > > and creates a new function that calls those and is called by > > > ppc_cpu_realize > > > > This looks like it makes sense regardless of disable-tcg, could we have > > it in a standalone patch? > > Sure, I'll try and get it ready ASAP, and make sure I didn't miss one > function before sending. Just a noob question... do I edit the patch manually > to have it only contain the gdb code motion, or is there a way to move back > to before I actually made the commit without needing to re-do the changes? > > Thomas, I'm adding you to this discussion since it sort of answers your > message on the other one, about the functions used even in a KVM-only build. > > > IIRC you don't only have to exclude translate.c from the build, you also > > have to separate translate_init.c.inc from it, i.e. turn > > translate_init.c.inc into a proper .c file and get rid of the #include > > "translate_init.c.inc" statement in translate.c, since many functions in the > > translate_init.c.inc file are still needed for the KVM-only builds, too. So > > maybe that's a good place to start as a first mini series. > > Now that I know that translate doesn't mean "translating to TCG", this idea > makes more sense. My question is, is it a better option than the code motion > I'm suggesting? From my quick check on the bits that I haven't touched it > might, but at this point it's clear I'm very lost with what makes sense > hahaha. > > > Bruno Piazera Larsen > > Instituto de Pesquisas > ELDORADO
Re: [PATCH 2/4] target/ppc: added solutions for building with disable-tcg
On Mon, Apr 12, 2021 at 08:40:47AM -0700, Richard Henderson wrote: > On 4/11/21 10:08 PM, David Gibson wrote: > > Not directly related to what you're trying to accomplish here, but the > > whole vscr_sat thing looks really weird. I have no idea why we're > > splitting out the storage of VSCR[SAT] for the TCG case at all. If > > you wanted to clean that up as a preliminary, I'd be grateful. > > That's about efficiently implementing the vector saturation instructions in > TCG. See GEN_VXFORM_SAT in translate/vmx-impl.c.inc. > > The SAT bit is represented as a vector that e.g. compares the result of > addition with the result of saturating addition. We don't need to resolve > that to a single bit until the VSCR register is read. Aha, thanks for the input. Long term, that might benefit from KVM specific code paths that don't bother with this then. However, for the first cut, it's simpler to just keep the current representation, even though it's pretty odd for KVM. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH 4/5] blkdebug: do not suspend in the middle of QLIST_FOREACH_SAFE
On 08/04/21 17:59, Emanuele Giuseppe Esposito wrote: Perhaps insert here: That would be unsafe in case a rule other than the current one is removed while the coroutine has yielded. Keep FOREACH_SAFE because suspend_request deletes the current rule. After this patch, *all* matching rules are deleted before suspending the coroutine, rather than just one. This doesn't affect the existing testcases. Use actions_count to see how many yeld to issue. Co-developed-by: Paolo Bonzini Signed-off-by: Emanuele Giuseppe Esposito --- block/blkdebug.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/block/blkdebug.c b/block/blkdebug.c index 388b5ed615..dffd869b32 100644 --- a/block/blkdebug.c +++ b/block/blkdebug.c @@ -789,7 +789,6 @@ static void suspend_request(BlockDriverState *bs, BlkdebugRule *rule) if (!qtest_enabled()) { printf("blkdebug: Suspended request '%s'\n", r->tag); } -qemu_coroutine_yield(); } static void process_rule(BlockDriverState *bs, struct BlkdebugRule *rule, @@ -834,6 +833,12 @@ static void blkdebug_debug_event(BlockDriverState *bs, BlkdebugEvent event) QLIST_FOREACH_SAFE(rule, &s->rules[event], next, next) { process_rule(bs, rule, actions_count); } + +while (actions_count[ACTION_SUSPEND] > 0) { +qemu_coroutine_yield(); +actions_count[ACTION_SUSPEND]--; +} + s->state = s->new_state; }
Re: [PATCH 0/5] blkdebug: fix racing condition when iterating on
On 08/04/21 17:59, Emanuele Giuseppe Esposito wrote: When qemu_coroutine_enter is executed in a loop (even QEMU_FOREACH_SAFE), the new routine can modify the list, for example removing an element, causing problem when control is given back to the caller that continues iterating on the same list. Patch 1 solves the issue in blkdebug_debug_resume by restarting the list walk after every coroutine_enter if list has to be fully iterated. Patches 2,3,4 aim to fix blkdebug_debug_event by gathering all actions that the rules make in a counter and invoking the respective coroutine_yeld only after processing all requests. Patch 5 adds a lock to protect rules and suspended_reqs. Patch 5 is somewhat independent of the others; right now everything works because it's protected by the AioContext lock. On the other hand the scenarios in patches 1-4 are bugs even without patch 5. They become more obvious if you see an explicit unlock/lock pair within QTAILQ_FOREACH_SAFE, but they can happen already with just a qemu_coroutine_yield or qemu_coroutine_enter within the iteration. Paolo
Re: [PATCH RFC RESEND v2 3/6] hw/pci: Add pci_root_bus_max_bus
Hi Xingang, On 3/25/21 8:22 AM, Wang Xingang wrote: > From: Xingang Wang > > This helps to find max bus number of a root bus. s/max bus number of a root bus/highest bus number of a bridge hierarchy? > > Signed-off-by: Xingang Wang > Signed-off-by: Jiahui Cen > --- > hw/pci/pci.c | 34 ++ > include/hw/pci/pci.h | 1 + > 2 files changed, 35 insertions(+) > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c > index e17aa9075f..c7957cbf7c 100644 > --- a/hw/pci/pci.c > +++ b/hw/pci/pci.c > @@ -538,6 +538,40 @@ int pci_bus_num(PCIBus *s) > return PCI_BUS_GET_CLASS(s)->bus_num(s); > } > > +int pci_root_bus_max_bus(PCIBus *bus) > +{ > +PCIHostState *host; > +PCIDevice *dev; > +int max_bus = 0; > +int type, devfn; > +uint8_t subordinate; > + > +if (!pci_bus_is_root(bus)) { > +return 0; > +} > + > +host = PCI_HOST_BRIDGE(BUS(bus)->parent); > +max_bus = pci_bus_num(host->bus); > + > +for (devfn = 0; devfn < ARRAY_SIZE(host->bus->devices); devfn++) { > +dev = host->bus->devices[devfn]; > + > +if (!dev) { > +continue; > +} > + > +type = dev->config[PCI_HEADER_TYPE] & > ~PCI_HEADER_TYPE_MULTI_FUNCTION; Seems there is PCI_DEVICE_GET_CLASS(dev)->is_bridge (see pci_root_bus_in_range). Can't that be used instead? > +if (type == PCI_HEADER_TYPE_BRIDGE) { > +subordinate = dev->config[PCI_SUBORDINATE_BUS]; > +if (subordinate > max_bus) { > +max_bus = subordinate; what about the secondary bus number, it is always less than the others? Thanks Eric > +} > +} > +} > + > +return max_bus; > +} > + > int pci_bus_numa_node(PCIBus *bus) > { > return PCI_BUS_GET_CLASS(bus)->numa_node(bus); > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h > index 718b5a454a..e0c69534f4 100644 > --- a/include/hw/pci/pci.h > +++ b/include/hw/pci/pci.h > @@ -450,6 +450,7 @@ static inline PCIBus *pci_get_bus(const PCIDevice *dev) > return PCI_BUS(qdev_get_parent_bus(DEVICE(dev))); > } > int pci_bus_num(PCIBus *s); > +int pci_root_bus_max_bus(PCIBus *bus); > static inline int pci_dev_bus_num(const PCIDevice *dev) > { > return pci_bus_num(pci_get_bus(dev)); >
[RFC PATCH v2 3/6] hw/arm/virt-acpi-build: Distinguish possible and present cpus
From: Ying Fang When building ACPI tables regarding CPUs we should always build them for the number of possible CPUs, not the number of present CPUs. We then ensure only the present CPUs are enabled in MADT. Furthermore, it is also needed if we are going to support CPU hotplug in the future. This patch is a rework based on Andrew Jones's contribution at https://lists.gnu.org/archive/html/qemu-arm/2018-07/msg00076.html Signed-off-by: Ying Fang Signed-off-by: Yanan Wang --- hw/arm/virt-acpi-build.c | 14 ++ hw/arm/virt.c| 3 +++ 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index f5a2b2d4cb..2ad5dad1bf 100644 --- a/hw/arm/virt-acpi-build.c +++ b/hw/arm/virt-acpi-build.c @@ -61,13 +61,16 @@ static void acpi_dsdt_add_cpus(Aml *scope, VirtMachineState *vms) { -MachineState *ms = MACHINE(vms); +CPUArchIdList *possible_cpus = MACHINE(vms)->possible_cpus; uint16_t i; -for (i = 0; i < ms->smp.cpus; i++) { +for (i = 0; i < possible_cpus->len; i++) { Aml *dev = aml_device("C%.03X", i); aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0007"))); aml_append(dev, aml_name_decl("_UID", aml_int(i))); +if (possible_cpus->cpus[i].cpu == NULL) { +aml_append(dev, aml_name_decl("_STA", aml_int(0))); +} aml_append(scope, dev); } } @@ -479,6 +482,7 @@ build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) const int *irqmap = vms->irqmap; AcpiMadtGenericDistributor *gicd; AcpiMadtGenericMsiFrame *gic_msi; +CPUArchIdList *possible_cpus = MACHINE(vms)->possible_cpus; int i; acpi_data_push(table_data, sizeof(AcpiMultipleApicTable)); @@ -489,7 +493,7 @@ build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) gicd->base_address = cpu_to_le64(memmap[VIRT_GIC_DIST].base); gicd->version = vms->gic_version; -for (i = 0; i < MACHINE(vms)->smp.cpus; i++) { +for (i = 0; i < possible_cpus->len; i++) { AcpiMadtGenericCpuInterface *gicc = acpi_data_push(table_data, sizeof(*gicc)); ARMCPU *armcpu = ARM_CPU(qemu_get_cpu(i)); @@ -504,7 +508,9 @@ build_madt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) gicc->cpu_interface_number = cpu_to_le32(i); gicc->arm_mpidr = cpu_to_le64(armcpu->mp_affinity); gicc->uid = cpu_to_le32(i); -gicc->flags = cpu_to_le32(ACPI_MADT_GICC_ENABLED); +if (possible_cpus->cpus[i].cpu != NULL) { +gicc->flags = cpu_to_le32(ACPI_MADT_GICC_ENABLED); +} if (arm_feature(&armcpu->env, ARM_FEATURE_PMU)) { gicc->performance_interrupt = cpu_to_le32(PPI(VIRTUAL_PMU_IRQ)); diff --git a/hw/arm/virt.c b/hw/arm/virt.c index f4ae60ded9..3e5d9b6f26 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -2063,6 +2063,9 @@ static void machvirt_init(MachineState *machine) } qdev_realize(DEVICE(cpuobj), NULL, &error_fatal); + +/* Initialize cpu member here since cpu hotplug is not supported yet */ +machine->possible_cpus->cpus[n].cpu = cpuobj; object_unref(cpuobj); } fdt_add_timer_nodes(vms); -- 2.19.1
[RFC PATCH v2 0/6] hw/arm/virt: Introduce cpu topology support
Hi, This series is a new version of [0] recently posted by Ying Fang to introduce cpu topology support for ARM platform. I have taken over his work about this now, thanks for his contribution. Description: An accurate cpu topology may help improve the cpu scheduler's decision making when dealing with multi-core system. So cpu topology description is helpful to provide guest with the right view. Dario Faggioli's talk in [1] also shows the virtual topology could have impact on scheduling performace. Thus this patch series introduces cpu topology support for ARM platform. This series originally comes from Andrew Jones's patches [2], but with some re-arrangement. Thanks for Andrew's contribution. In this series, both fdt and ACPI PPTT table are introduced to present cpu topology to the guest. And a new function virt_smp_parse() not like the default smp_parse() is introduced, which prefers cores over sockets. [0] https://patchwork.kernel.org/project/qemu-devel/cover/20210225085627.2263-1-fangyi...@huawei.com/ [1] https://kvmforum2020.sched.com/event/eE1y/virtual-topology-for-virtual-machines-friend-or-foe-dario-faggioli-suse [2] https://github.com/rhdrjones/qemu/commit/ecfc1565f22187d2c715a99bbcd35cf3a7e428fa Test results: After applying this patch series, launch a guest with virt-6.0 and cpu topology configured with: -smp 96,sockets=2,clusters=6,cores=4,threads=2, VM's cpu topology description shows as below. Architecture:aarch64 Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 NUMA node(s):1 Vendor ID: 0x48 Model: 0 Stepping:0x1 BogoMIPS:200.00 NUMA node0 CPU(s): 0-95 --- Changelogs: v1->v2: - Address Andrew Jones's comments - Address Michael S. Tsirkin's comments - Pick up one more patch(patch#6) of Andrew Jones - Rebased on v6.0.0-rc2 release --- Andrew Jones (3): device_tree: Add qemu_fdt_add_path hw/arm/virt: DT: Add cpu-map hw/arm/virt: Replace smp_parse with one that prefers cores Yanan Wang (2): hw/acpi/aml-build: Add processor hierarchy node structure hw/arm/virt-acpi-build: Add PPTT table Ying Fang (1): hw/arm/virt-acpi-build: Distinguish possible and present cpus hw/acpi/aml-build.c | 27 hw/arm/virt-acpi-build.c | 77 -- hw/arm/virt.c| 120 ++- include/hw/acpi/aml-build.h | 4 ++ include/hw/arm/virt.h| 1 + include/sysemu/device_tree.h | 1 + softmmu/device_tree.c| 45 - 7 files changed, 268 insertions(+), 7 deletions(-) -- 2.19.1
[RFC PATCH v2 2/6] hw/arm/virt: DT: Add cpu-map
From: Andrew Jones Support device tree CPU topology descriptions. Signed-off-by: Andrew Jones Signed-off-by: Yanan Wang --- hw/arm/virt.c | 41 - include/hw/arm/virt.h | 1 + 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 9f01d9041b..f4ae60ded9 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -352,10 +352,11 @@ static void fdt_add_cpu_nodes(const VirtMachineState *vms) int cpu; int addr_cells = 1; const MachineState *ms = MACHINE(vms); +const VirtMachineClass *vmc = VIRT_MACHINE_GET_CLASS(vms); int smp_cpus = ms->smp.cpus; /* - * From Documentation/devicetree/bindings/arm/cpus.txt + * See Linux Documentation/devicetree/bindings/arm/cpus.yaml * On ARM v8 64-bit systems value should be set to 2, * that corresponds to the MPIDR_EL1 register size. * If MPIDR_EL1[63:32] value is equal to 0 on all CPUs @@ -408,8 +409,45 @@ static void fdt_add_cpu_nodes(const VirtMachineState *vms) ms->possible_cpus->cpus[cs->cpu_index].props.node_id); } +if (ms->smp.cpus > 1 && !vmc->no_cpu_topology) { +qemu_fdt_setprop_cell(ms->fdt, nodename, "phandle", + qemu_fdt_alloc_phandle(ms->fdt)); +} + g_free(nodename); } + +if (ms->smp.cpus > 1 && !vmc->no_cpu_topology) { +/* + * See Linux Documentation/devicetree/bindings/cpu/cpu-topology.txt + * In a SMP system, the hierarchy of CPUs is defined through four + * entities that are used to describe the layout of physical CPUs + * in the system: socket/cluster/core/thread. + */ +qemu_fdt_add_subnode(ms->fdt, "/cpus/cpu-map"); + +for (cpu = ms->smp.cpus - 1; cpu >= 0; cpu--) { +char *cpu_path = g_strdup_printf("/cpus/cpu@%d", cpu); +char *map_path; + +if (ms->smp.threads > 1) { +map_path = g_strdup_printf( +"/cpus/cpu-map/%s%d/%s%d/%s%d", +"socket", cpu / (ms->smp.cores * ms->smp.threads), +"core", (cpu / ms->smp.threads) % ms->smp.cores, +"thread", cpu % ms->smp.threads); +} else { +map_path = g_strdup_printf( +"/cpus/cpu-map/%s%d/%s%d", +"socket", cpu / ms->smp.cores, +"core", cpu % ms->smp.cores); +} +qemu_fdt_add_path(ms->fdt, map_path); +qemu_fdt_setprop_phandle(ms->fdt, map_path, "cpu", cpu_path); +g_free(map_path); +g_free(cpu_path); +} +} } static void fdt_add_its_gic_node(VirtMachineState *vms) @@ -2769,6 +2807,7 @@ static void virt_machine_5_2_options(MachineClass *mc) virt_machine_6_0_options(mc); compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len); vmc->no_secure_gpio = true; +vmc->no_cpu_topology = true; } DEFINE_VIRT_MACHINE(5, 2) diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h index 921416f918..4a4b98e4a7 100644 --- a/include/hw/arm/virt.h +++ b/include/hw/arm/virt.h @@ -129,6 +129,7 @@ struct VirtMachineClass { bool no_kvm_steal_time; bool acpi_expose_flash; bool no_secure_gpio; +bool no_cpu_topology; }; struct VirtMachineState { -- 2.19.1
[RFC PATCH v2 4/6] hw/acpi/aml-build: Add processor hierarchy node structure
Add a generic API to build Processor Hierarchy Node Structure(Type 0), which is strictly consistent with descriptions in ACPI 6.3: 5.2.29.1. This function will be used to build ACPI PPTT table for cpu topology. Signed-off-by: Ying Fang Signed-off-by: Henglong Fan Signed-off-by: Yanan Wang --- hw/acpi/aml-build.c | 27 +++ include/hw/acpi/aml-build.h | 4 2 files changed, 31 insertions(+) diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c index d33ce8954a..75e01aea17 100644 --- a/hw/acpi/aml-build.c +++ b/hw/acpi/aml-build.c @@ -1916,6 +1916,33 @@ void build_slit(GArray *table_data, BIOSLinker *linker, MachineState *ms, table_data->len - slit_start, 1, oem_id, oem_table_id); } +/* + * ACPI 6.3: 5.2.29.1 Processor Hierarchy Node Structure (Type 0) + */ +void build_processor_hierarchy_node(GArray *tbl, uint32_t flags, +uint32_t parent, uint32_t id, +uint32_t *priv_rsrc, uint32_t priv_num) +{ +int i; + +build_append_byte(tbl, 0); /* Type 0 - processor */ +build_append_byte(tbl, 20 + priv_num * 4); /* Length */ +build_append_int_noprefix(tbl, 0, 2); /* Reserved */ +build_append_int_noprefix(tbl, flags, 4); /* Flags */ +build_append_int_noprefix(tbl, parent, 4); /* Parent */ +build_append_int_noprefix(tbl, id, 4); /* ACPI processor ID */ + +/* Number of private resources */ +build_append_int_noprefix(tbl, priv_num, 4); + +/* Private resources[N] */ +if (priv_num > 0 && priv_rsrc != NULL) { +for (i = 0; i < priv_num; i++) { +build_append_int_noprefix(tbl, priv_rsrc[i], 4); +} +} +} + /* build rev1/rev3/rev5.1 FADT */ void build_fadt(GArray *tbl, BIOSLinker *linker, const AcpiFadtData *f, const char *oem_id, const char *oem_table_id) diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h index 471266d739..ea74b8f6ed 100644 --- a/include/hw/acpi/aml-build.h +++ b/include/hw/acpi/aml-build.h @@ -462,6 +462,10 @@ void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base, void build_slit(GArray *table_data, BIOSLinker *linker, MachineState *ms, const char *oem_id, const char *oem_table_id); +void build_processor_hierarchy_node(GArray *tbl, uint32_t flags, +uint32_t parent, uint32_t id, +uint32_t *priv_rsrc, uint32_t priv_num); + void build_fadt(GArray *tbl, BIOSLinker *linker, const AcpiFadtData *f, const char *oem_id, const char *oem_table_id); -- 2.19.1
[RFC PATCH v2 1/6] device_tree: Add qemu_fdt_add_path
From: Andrew Jones qemu_fdt_add_path() works like qemu_fdt_add_subnode(), except it also adds any missing subnodes in the path. We also tweak an error message of qemu_fdt_add_subnode(). We'll make use of this new function in a coming patch. Signed-off-by: Andrew Jones Signed-off-by: Yanan Wang --- include/sysemu/device_tree.h | 1 + softmmu/device_tree.c| 45 ++-- 2 files changed, 44 insertions(+), 2 deletions(-) diff --git a/include/sysemu/device_tree.h b/include/sysemu/device_tree.h index 8a2fe55622..ef060a9759 100644 --- a/include/sysemu/device_tree.h +++ b/include/sysemu/device_tree.h @@ -121,6 +121,7 @@ uint32_t qemu_fdt_get_phandle(void *fdt, const char *path); uint32_t qemu_fdt_alloc_phandle(void *fdt); int qemu_fdt_nop_node(void *fdt, const char *node_path); int qemu_fdt_add_subnode(void *fdt, const char *name); +int qemu_fdt_add_path(void *fdt, const char *path); #define qemu_fdt_setprop_cells(fdt, node_path, property, ...) \ do { \ diff --git a/softmmu/device_tree.c b/softmmu/device_tree.c index 2691c58cf6..8592c7aa1b 100644 --- a/softmmu/device_tree.c +++ b/softmmu/device_tree.c @@ -541,8 +541,8 @@ int qemu_fdt_add_subnode(void *fdt, const char *name) retval = fdt_add_subnode(fdt, parent, basename); if (retval < 0) { -error_report("FDT: Failed to create subnode %s: %s", name, - fdt_strerror(retval)); +error_report("%s: Failed to create subnode %s: %s", + __func__, name, fdt_strerror(retval)); exit(1); } @@ -550,6 +550,47 @@ int qemu_fdt_add_subnode(void *fdt, const char *name) return retval; } +/* + * Like qemu_fdt_add_subnode(), but will add all missing + * subnodes in the path. + */ +int qemu_fdt_add_path(void *fdt, const char *path) +{ +char *dupname, *basename, *p; +int parent, retval = -1; + +if (path[0] != '/') { +return retval; +} + +parent = fdt_path_offset(fdt, "/"); +p = dupname = g_strdup(path); + +while (p) { +*p = '/'; +basename = p + 1; +p = strchr(p + 1, '/'); +if (p) { +*p = '\0'; +} +retval = fdt_path_offset(fdt, dupname); +if (retval < 0 && retval != -FDT_ERR_NOTFOUND) { +error_report("%s: Invalid path %s: %s", + __func__, path, fdt_strerror(retval)); +exit(1); +} else if (retval == -FDT_ERR_NOTFOUND) { +retval = fdt_add_subnode(fdt, parent, basename); +if (retval < 0) { +break; +} +} +parent = retval; +} + +g_free(dupname); +return retval; +} + void qemu_fdt_dumpdtb(void *fdt, int size) { const char *dumpdtb = current_machine->dumpdtb; -- 2.19.1
[RFC PATCH v2 6/6] hw/arm/virt: Replace smp_parse with one that prefers cores
From: Andrew Jones The virt machine type has never used the CPU topology parameters, other than number of online CPUs and max CPUs. When choosing how to allocate those CPUs the default has been to assume cores. In preparation for using the other CPU topology parameters let's use an smp_parse that prefers cores over sockets. We can also enforce the topology matches max_cpus check because we have no legacy to preserve. Signed-off-by: Andrew Jones Signed-off-by: Yanan Wang --- hw/arm/virt.c | 76 +++ 1 file changed, 76 insertions(+) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 3e5d9b6f26..57ef961cb5 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -79,6 +79,8 @@ #include "hw/virtio/virtio-iommu.h" #include "hw/char/pl011.h" #include "qemu/guest-random.h" +#include "qapi/qmp/qerror.h" +#include "sysemu/replay.h" #define DEFINE_VIRT_MACHINE_LATEST(major, minor, latest) \ static void virt_##major##_##minor##_class_init(ObjectClass *oc, \ @@ -2625,6 +2627,79 @@ static int virt_kvm_type(MachineState *ms, const char *type_str) return fixed_ipa ? 0 : requested_pa_size; } +/* + * Unlike smp_parse() in hw/core/machine.c, we prefer cores over sockets, + * e.g. '-smp 8' creates 1 socket with 8 cores. Whereas '-smp 8' with + * hw/core/machine.c's smp_parse() creates 8 sockets, each with 1 core. + * Additionally, we can enforce the topology matches max_cpus check, + * because we have no legacy to preserve. + */ +static void virt_smp_parse(MachineState *ms, QemuOpts *opts) +{ +if (opts) { +unsigned cpus = qemu_opt_get_number(opts, "cpus", 0); +unsigned sockets = qemu_opt_get_number(opts, "sockets", 0); +unsigned cores = qemu_opt_get_number(opts, "cores", 0); +unsigned threads = qemu_opt_get_number(opts, "threads", 0); + +/* + * Compute missing values; prefer cores over sockets and + * sockets over threads. + */ +if (cpus == 0 || cores == 0) { +sockets = sockets > 0 ? sockets : 1; +threads = threads > 0 ? threads : 1; +if (cpus == 0) { +cores = cores > 0 ? cores : 1; +cpus = cores * threads * sockets; +} else { +ms->smp.max_cpus = qemu_opt_get_number(opts, "maxcpus", cpus); +cores = ms->smp.max_cpus / (sockets * threads); +} +} else if (sockets == 0) { +threads = threads > 0 ? threads : 1; +sockets = cpus / (cores * threads); +sockets = sockets > 0 ? sockets : 1; +} else if (threads == 0) { +threads = cpus / (cores * sockets); +threads = threads > 0 ? threads : 1; +} else if (sockets * cores * threads < cpus) { +error_report("cpu topology: " + "sockets (%u) * cores (%u) * threads (%u) < " + "smp_cpus (%u)", + sockets, cores, threads, cpus); +exit(1); +} + +ms->smp.max_cpus = qemu_opt_get_number(opts, "maxcpus", cpus); + +if (ms->smp.max_cpus < cpus) { +error_report("maxcpus must be equal to or greater than smp"); +exit(1); +} + +if (sockets * cores * threads != ms->smp.max_cpus) { +error_report("cpu topology: " + "sockets (%u) * cores (%u) * threads (%u)" + "!= maxcpus (%u)", + sockets, cores, threads, + ms->smp.max_cpus); +exit(1); +} + +ms->smp.cpus = cpus; +ms->smp.cores = cores; +ms->smp.threads = threads; +ms->smp.sockets = sockets; +} + +if (ms->smp.cpus > 1) { +Error *blocker = NULL; +error_setg(&blocker, QERR_REPLAY_NOT_SUPPORTED, "smp"); +replay_add_blocker(blocker); +} +} + static void virt_machine_class_init(ObjectClass *oc, void *data) { MachineClass *mc = MACHINE_CLASS(oc); @@ -2650,6 +2725,7 @@ static void virt_machine_class_init(ObjectClass *oc, void *data) mc->cpu_index_to_instance_props = virt_cpu_index_to_props; mc->default_cpu_type = ARM_CPU_TYPE_NAME("cortex-a15"); mc->get_default_cpu_node_id = virt_get_default_cpu_node_id; +mc->smp_parse = virt_smp_parse; mc->kvm_type = virt_kvm_type; assert(!mc->get_hotplug_handler); mc->get_hotplug_handler = virt_machine_get_hotplug_handler; -- 2.19.1
[RFC PATCH v2 5/6] hw/arm/virt-acpi-build: Add PPTT table
Add the Processor Properties Topology Table (PPTT) to present CPU topology information to ACPI guests. Note, while a DT boot Linux guest with a non-flat CPU topology will see socket and core IDs being sequential integers starting from zero, e.g. with -smp 4,sockets=2,cores=2,threads=1 a DT boot produces cpu: 0 package_id: 0 core_id: 0 cpu: 1 package_id: 0 core_id: 1 cpu: 2 package_id: 1 core_id: 0 cpu: 3 package_id: 1 core_id: 1 an ACPI boot produces cpu: 0 package_id: 36 core_id: 0 cpu: 1 package_id: 36 core_id: 1 cpu: 2 package_id: 96 core_id: 2 cpu: 3 package_id: 96 core_id: 3 This is due to several reasons: 1) DT cpu nodes do not have an equivalent field to what the PPTT ACPI Processor ID must be, i.e. something equal to the MADT CPU UID or equal to the UID of an ACPI processor container. In both ACPI cases those are platform dependant IDs assigned by the vendor. 2) While QEMU is the vendor for a guest, if the topology specifies SMT (> 1 thread), then, with ACPI, it is impossible to assign a core-id the same value as a package-id, thus it is not possible to have package-id=0 and core-id=0. This is because package and core containers must be in the same ACPI namespace and therefore must have unique UIDs. 3) ACPI processor containers are not required for PPTT tables to be used and, due to the limitations of which IDs are selected described above in (2), they are not helpful for QEMU, so we don't build them with this patch. In the absence of them, Linux assigns its own unique IDs. The maintainers have chosen not to use counters from zero, but rather ACPI table offsets, which explains why the numbers are so much larger than with DT. 4) When there is no SMT (threads=1) the core IDs for ACPI boot guests match the logical CPU IDs, because these IDs must be equal to the MADT CPU UID (as no processor containers are present), and QEMU uses the logical CPU ID for these MADT IDs. Tested-by: Jiajie Li Signed-off-by: Andrew Jones Signed-off-by: Ying Fang Signed-off-by: Yanan Wang --- hw/arm/virt-acpi-build.c | 63 1 file changed, 63 insertions(+) diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index 2ad5dad1bf..03fd812d5a 100644 --- a/hw/arm/virt-acpi-build.c +++ b/hw/arm/virt-acpi-build.c @@ -436,6 +436,64 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) vms->oem_table_id); } +/* PPTT */ +static void +build_pptt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) +{ +int pptt_start = table_data->len; +int uid = 0, cpus = 0, socket = 0; +MachineState *ms = MACHINE(vms); +unsigned int smp_cores = ms->smp.cores; +unsigned int smp_threads = ms->smp.threads; + +acpi_data_push(table_data, sizeof(AcpiTableHeader)); + +for (socket = 0; cpus < ms->possible_cpus->len; socket++) { +uint32_t socket_offset = table_data->len - pptt_start; +int core; + +build_processor_hierarchy_node( +table_data, 1, /* Physical package */ +0, socket, /* No parent */ +NULL, 0); /* No private resources */ + +for (core = 0; core < smp_cores; core++) { +uint32_t core_offset = table_data->len - pptt_start; +int thread; + +if (smp_threads <= 1) { +build_processor_hierarchy_node( +table_data, +(1 << 1) | /* ACPI Processor ID valid */ +(1 << 3), /* ACPI 6.3 - Node is a Leaf */ +socket_offset, uid++, /* Parent is a Socket */ +NULL, 0); /* No private resources */ +} else { +build_processor_hierarchy_node( +table_data, 0, +socket_offset, core, /* Parent is a Socket */ +NULL, 0); /* No private resources */ + +for (thread = 0; thread < smp_threads; thread++) { +build_processor_hierarchy_node( +table_data, +(1 << 1) | /* ACPI Processor ID valid */ +(1 << 2) | /* ACPI 6.3 - Processor is a Thread */ +(1 << 3), /* ACPI 6.3 - Node is a Leaf */ +core_offset, uid++, /* Parent is a Core */ +NULL, 0); /* No private resources */ +} +} +} +cpus += smp_cores * smp_threads; +} + +build_header(linker, table_data, + (void *)(table_data->data + pptt_start), "PPTT", + table_data->len - pptt_start, 2, + vms->oem_id, vms->oem_table_id); +} + /* GTDT */ static void build_gtdt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) @@ -707,6 +765,11 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
[PATCH-for-6.1] exec: Remove accel/tcg/ from include paths
When TCG is enabled, the accel/tcg/ include path is added to the project global include search list. This accel/tcg/ directory contains a header named "internal.h" which, while intented to be internal to accel/tcg/, is accessible by all files compiled when TCG is enabled. This might lead to problem with other directories using the same "internal.h" header name: $ git ls-files | fgrep /internal.h accel/tcg/internal.h include/hw/ide/internal.h target/hexagon/internal.h target/mips/internal.h target/ppc/internal.h target/s390x/internal.h As we don't need to expose accel/tcg/ internals to the rest of the code base, simplify by removing it from the include search list, and include the accel/tcg/ public headers relative to the project root search path (which is already in the generic include search path). Signed-off-by: Philippe Mathieu-Daudé --- Arguably public accel/tcg/ headers should be exposed under include/. --- meson.build | 1 - include/exec/helper-gen.h | 4 ++-- include/exec/helper-proto.h | 4 ++-- include/exec/helper-tcg.h | 4 ++-- 4 files changed, 6 insertions(+), 7 deletions(-) diff --git a/meson.build b/meson.build index c6f4b0cf5e8..d8bb1ec5aa9 100644 --- a/meson.build +++ b/meson.build @@ -258,7 +258,6 @@ tcg_arch = 'riscv' endif add_project_arguments('-iquote', meson.current_source_dir() / 'tcg' / tcg_arch, -'-iquote', meson.current_source_dir() / 'accel/tcg', language: ['c', 'cpp', 'objc']) accelerators += 'CONFIG_TCG' diff --git a/include/exec/helper-gen.h b/include/exec/helper-gen.h index 29c02f85dcc..1c2e7a8ed39 100644 --- a/include/exec/helper-gen.h +++ b/include/exec/helper-gen.h @@ -81,8 +81,8 @@ static inline void glue(gen_helper_, name)(dh_retvar_decl(ret) \ #include "helper.h" #include "trace/generated-helpers.h" #include "trace/generated-helpers-wrappers.h" -#include "tcg-runtime.h" -#include "plugin-helpers.h" +#include "accel/tcg/tcg-runtime.h" +#include "accel/tcg/plugin-helpers.h" #undef DEF_HELPER_FLAGS_0 #undef DEF_HELPER_FLAGS_1 diff --git a/include/exec/helper-proto.h b/include/exec/helper-proto.h index 659f9298e8f..ba100793a7d 100644 --- a/include/exec/helper-proto.h +++ b/include/exec/helper-proto.h @@ -39,8 +39,8 @@ dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), dh_ctype(t3), \ #include "helper.h" #include "trace/generated-helpers.h" -#include "tcg-runtime.h" -#include "plugin-helpers.h" +#include "accel/tcg/tcg-runtime.h" +#include "accel/tcg/plugin-helpers.h" #undef IN_HELPER_PROTO diff --git a/include/exec/helper-tcg.h b/include/exec/helper-tcg.h index 27870509a20..68885146355 100644 --- a/include/exec/helper-tcg.h +++ b/include/exec/helper-tcg.h @@ -60,8 +60,8 @@ #include "helper.h" #include "trace/generated-helpers.h" -#include "tcg-runtime.h" -#include "plugin-helpers.h" +#include "accel/tcg/tcg-runtime.h" +#include "accel/tcg/plugin-helpers.h" #undef str #undef DEF_HELPER_FLAGS_0 -- 2.26.3
Re: [PATCH v2 2/3] qom: move user_creatable_add_opts logic to vl.c and QAPIfy it
On 13.04.21 06:41, Markus Armbruster wrote: David Hildenbrand writes: On 12.03.21 18:35, Paolo Bonzini wrote: Emulators are currently using OptsVisitor (via user_creatable_add_opts) to parse the -object command line option. This has one extra feature, compared to keyval, which is automatic conversion of integers to lists as well as support for lists as repeated options: -object memory-backend-ram,id=pc.ram,size=1048576000,host-nodes=0,policy=bind So we cannot replace OptsVisitor with keyval right now. Still, this patch moves the user_creatable_add_opts logic to vl.c since it is not needed anywhere else, and makes it go through user_creatable_add_qapi. In order to minimize code changes, the predicate still takes a string. This can be changed later to use the ObjectType QAPI enum directly. Rebasing my "noreserve"[1] series on this, I get weird errors from QEMU when specifying the new "reserve=off" option for a memory-backend-ram: "Invalid parameter 'reserve'" And it looks like this is the case for any new properties. Poking around, I fail to find what's causing this -- or how to unlock new properties. What is the magic toggle to make it work? Thanks! [1] https://lkml.kernel.org/r/20210319101230.21531-1-da...@redhat.com Wild guess: you didn't add your new properties in the QAPI schema. For a not-so-wild-guess, send us a git-fetch argument for your rebased series. Oh, there is qapi/qom.json -- maybe that does the trick. (I have mixed feelings about having to specify the same thing twice at different locations) I'll have a look if that makes it fly. -- Thanks, David / dhildenb
Re: [PATCH-for-6.1] exec: Remove accel/tcg/ from include paths
Reviewed-by: Claudio Fontana Ciao, Claudio On 4/13/21 10:10 AM, Philippe Mathieu-Daudé wrote: > When TCG is enabled, the accel/tcg/ include path is added to the > project global include search list. This accel/tcg/ directory > contains a header named "internal.h" which, while intented to > be internal to accel/tcg/, is accessible by all files compiled > when TCG is enabled. This might lead to problem with other > directories using the same "internal.h" header name: > > $ git ls-files | fgrep /internal.h > accel/tcg/internal.h > include/hw/ide/internal.h > target/hexagon/internal.h > target/mips/internal.h > target/ppc/internal.h > target/s390x/internal.h > > As we don't need to expose accel/tcg/ internals to the rest of > the code base, simplify by removing it from the include search > list, and include the accel/tcg/ public headers relative to the > project root search path (which is already in the generic include > search path). > > Signed-off-by: Philippe Mathieu-Daudé > --- > Arguably public accel/tcg/ headers should be exposed under include/. > --- > meson.build | 1 - > include/exec/helper-gen.h | 4 ++-- > include/exec/helper-proto.h | 4 ++-- > include/exec/helper-tcg.h | 4 ++-- > 4 files changed, 6 insertions(+), 7 deletions(-) > > diff --git a/meson.build b/meson.build > index c6f4b0cf5e8..d8bb1ec5aa9 100644 > --- a/meson.build > +++ b/meson.build > @@ -258,7 +258,6 @@ > tcg_arch = 'riscv' >endif >add_project_arguments('-iquote', meson.current_source_dir() / 'tcg' / > tcg_arch, > -'-iquote', meson.current_source_dir() / 'accel/tcg', > language: ['c', 'cpp', 'objc']) > >accelerators += 'CONFIG_TCG' > diff --git a/include/exec/helper-gen.h b/include/exec/helper-gen.h > index 29c02f85dcc..1c2e7a8ed39 100644 > --- a/include/exec/helper-gen.h > +++ b/include/exec/helper-gen.h > @@ -81,8 +81,8 @@ static inline void glue(gen_helper_, > name)(dh_retvar_decl(ret) \ > #include "helper.h" > #include "trace/generated-helpers.h" > #include "trace/generated-helpers-wrappers.h" > -#include "tcg-runtime.h" > -#include "plugin-helpers.h" > +#include "accel/tcg/tcg-runtime.h" > +#include "accel/tcg/plugin-helpers.h" > > #undef DEF_HELPER_FLAGS_0 > #undef DEF_HELPER_FLAGS_1 > diff --git a/include/exec/helper-proto.h b/include/exec/helper-proto.h > index 659f9298e8f..ba100793a7d 100644 > --- a/include/exec/helper-proto.h > +++ b/include/exec/helper-proto.h > @@ -39,8 +39,8 @@ dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), > dh_ctype(t3), \ > > #include "helper.h" > #include "trace/generated-helpers.h" > -#include "tcg-runtime.h" > -#include "plugin-helpers.h" > +#include "accel/tcg/tcg-runtime.h" > +#include "accel/tcg/plugin-helpers.h" > > #undef IN_HELPER_PROTO > > diff --git a/include/exec/helper-tcg.h b/include/exec/helper-tcg.h > index 27870509a20..68885146355 100644 > --- a/include/exec/helper-tcg.h > +++ b/include/exec/helper-tcg.h > @@ -60,8 +60,8 @@ > > #include "helper.h" > #include "trace/generated-helpers.h" > -#include "tcg-runtime.h" > -#include "plugin-helpers.h" > +#include "accel/tcg/tcg-runtime.h" > +#include "accel/tcg/plugin-helpers.h" > > #undef str > #undef DEF_HELPER_FLAGS_0 >
[RFC PATCH v2 1/4] vl.c: Add -smp, clusters=* command line support for ARM cpu
A cluster means a group of cores that share some resources (e.g. cache) among them under the LLC. For example, ARM64 server chip Kunpeng 920 has 6 or 8 clusters in each NUMA, and each cluster has 4 cores. All clusters share L3 cache data while cores within each cluster share the L2 cache. The cache affinity of cluster has been proved to improve the Linux kernel scheduling performance and a patchset has been posted, in which a general sched_domain for clusters was added and a cluster level was added in the arch-neutral cpu topology struct like below. struct cpu_topology { int thread_id; int core_id; int cluster_id; int package_id; int llc_id; cpumask_t thread_sibling; cpumask_t core_sibling; cpumask_t cluster_sibling; cpumask_t llc_sibling; } Also the Kernel Doc: Documentation/devicetree/bindings/cpu/cpu-topology.txt defines a four-level CPU topology hierarchy like socket/cluster/core/thread. According to the context, a socket node's child nodes must be one or more cluster nodes and a cluster node's child nodes must be one or more cluster nodes/one or more core nodes. So let's add the -smp, clusters=* command line support for ARM cpu, so that future guest os could make use of cluster cpu topology for better scheduling performance. For ARM machines, a four-level cpu hierarchy can be defined and it will be sockets/clusters/cores/threads. Because we only support clusters for ARM cpu currently, a new member "unsigned smp_clusters" is added to the VirtMachineState structure. Signed-off-by: Yanan Wang --- include/hw/arm/virt.h | 1 + qemu-options.hx | 26 +++--- softmmu/vl.c | 3 +++ 3 files changed, 19 insertions(+), 11 deletions(-) diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h index 4a4b98e4a7..5d5d156924 100644 --- a/include/hw/arm/virt.h +++ b/include/hw/arm/virt.h @@ -155,6 +155,7 @@ struct VirtMachineState { char *pciehb_nodename; const int *irqmap; int fdt_size; +unsigned smp_clusters; uint32_t clock_phandle; uint32_t gic_phandle; uint32_t msi_phandle; diff --git a/qemu-options.hx b/qemu-options.hx index fd21002bd6..65343ea23c 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -184,25 +184,29 @@ SRST ERST DEF("smp", HAS_ARG, QEMU_OPTION_smp, -"-smp [cpus=]n[,maxcpus=cpus][,cores=cores][,threads=threads][,dies=dies][,sockets=sockets]\n" +"-smp [cpus=]n[,maxcpus=cpus][,clusters=clusters][,cores=cores][,threads=threads][,dies=dies][,sockets=sockets]\n" "set the number of CPUs to 'n' [default=1]\n" "maxcpus= maximum number of total cpus, including\n" "offline CPUs for hotplug, etc\n" -"cores= number of CPU cores on one socket (for PC, it's on one die)\n" +"cores= number of CPU cores on one socket\n" +"(it's on one die for PC, and on one cluster for ARM)\n" "threads= number of threads on one CPU core\n" +"clusters= number of CPU clusters on one socket (for ARM only)\n" "dies= number of CPU dies on one socket (for PC only)\n" "sockets= number of discrete sockets in the system\n", QEMU_ARCH_ALL) SRST -``-smp [cpus=]n[,cores=cores][,threads=threads][,dies=dies][,sockets=sockets][,maxcpus=maxcpus]`` -Simulate an SMP system with n CPUs. On the PC target, up to 255 CPUs -are supported. On Sparc32 target, Linux limits the number of usable -CPUs to 4. For the PC target, the number of cores per die, the -number of threads per cores, the number of dies per packages and the -total number of sockets can be specified. Missing values will be -computed. If any on the three values is given, the total number of -CPUs n can be omitted. maxcpus specifies the maximum number of -hotpluggable CPUs. +``-smp [cpus=]n[,maxcpus=cpus][,clusters=clusters][,cores=cores][,threads=threads][,dies=dies][,sockets=sockets]`` +Simulate an SMP system with n CPUs. On the PC target, up to 255 +CPUs are supported. On the Sparc32 target, Linux limits the number +of usable CPUs to 4. For the PC target, the number of threads per +core, the number of cores per die, the number of dies per package +and the total number of sockets can be specified. For the ARM target, +the number of threads per core, the number of cores per cluster, the +number of clusters per socket and the total number of sockets can be +specified. And missing values will be computed. If any of the five +values is given, the total number of CPUs n can be omitted. Maxcpus +specifies the maximum number of hotpluggable CPUs. ERST DEF("numa", HAS_ARG, QEMU_OPTION_numa, diff --git a/softmmu/vl.c b/softmmu/vl.c index aadb526138..46f5b6b575 100644 --- a/softmmu/vl.c +++ b/softmmu/vl.c @@ -720,6 +720,9 @@ static QemuOptsList qemu_smp_opts = { }, {
[RFC PATCH v2 0/4] hw/arm/virt: Introduce cluster cpu topology support
Hi, This series is a new version of [0] posted to introduce the cluster cpu topology support for ARM platform, besides now existing sockets, cores, and threads. And the code has been rewriten based on patch series [1]. [0] https://patchwork.kernel.org/project/qemu-devel/cover/20210331095343.12172-1-wangyana...@huawei.com/ [1] https://patchwork.kernel.org/project/qemu-devel/cover/20210413080745.33004-1-wangyana...@huawei.com/ Changelogs: v1->v2: - Only focus on cluster support for ARM platform - Rebase the code on patch series [1] Description: A cluster means a group of cores that share some resources (e.g. cache) among them under the LLC. For example, ARM64 server chip Kunpeng 920 has 6 or 8 clusters in each NUMA, and each cluster has 4 cores. All clusters share L3 cache data while cores within each cluster share the L2 cache. The cache affinity of cluster has been proved to improve the Linux kernel scheduling performance and a patchset [2] has already been posted, where a general sched_domain for clusters was added and a cluster level was added in the arch-neutral cpu topology struct like below. struct cpu_topology { int thread_id; int core_id; int cluster_id; int package_id; int llc_id; cpumask_t thread_sibling; cpumask_t core_sibling; cpumask_t cluster_sibling; cpumask_t llc_sibling; }; Also Kernel Doc [3]: Documentation/devicetree/bindings/cpu/cpu-topology.txt defines a four-level CPU topology hierarchy like socket/cluster/core/thread. According to the context, a socket node's child nodes must be one or more cluster nodes and a cluster node's child nodes must be one or more cluster nodes/one or more core nodes. So let's add the -smp, clusters=* command line support for ARM cpu, so that future guest os could make use of cluster cpu topology for better scheduling performance. For ARM machines, a four-level cpu hierarchy can be defined and it will be sockets/clusters/cores/threads. [2] https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210319041618.14316-1-song.bao@hisilicon.com/ [3] https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/cpu/cpu-topology.txt Test results: After applying this patch series, launch a guest with virt-6.0 and cpu topology configured with: -smp 96,sockets=2,clusters=6,cores=4,threads=2, VM's cpu topology description shows as below. lscpu: Architecture:aarch64 Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 NUMA node(s):1 Vendor ID: 0x48 Model: 0 Stepping:0x1 BogoMIPS:200.00 NUMA node0 CPU(s): 0-95 Topology information of clusters can also be got: cat /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list: 0-7 cat /sys/devices/system/cpu/cpu0/topology/cluster_id: 56 cat /sys/devices/system/cpu/cpu8/topology/cluster_cpus_list: 8-15 cat /sys/devices/system/cpu/cpu8/topology/cluster_id: 316 ... cat /sys/devices/system/cpu/cpu95/topology/cluster_cpus_list: 88-95 cat /sys/devices/system/cpu/cpu95/topology/cluster_id: 2936 Yanan Wang (4): vl.c: Add -smp, clusters=* command line support for ARM cpu hw/arm/virt: Parse -smp cluster parameter in virt_smp_parse hw/arm/virt-acpi-build: Add cluster level for PPTT table hw/arm/virt: Add cluster level for device tree hw/arm/virt-acpi-build.c | 55 hw/arm/virt.c| 44 +++- include/hw/arm/virt.h| 1 + qemu-options.hx | 26 +++ softmmu/vl.c | 3 +++ 5 files changed, 78 insertions(+), 51 deletions(-) -- 2.19.1
[RFC PATCH v2 3/4] hw/arm/virt-acpi-build: Add cluster level for PPTT table
Add a Processor Hierarchy Node of cluster level between core level and package level for ARM PPTT table. Signed-off-by: Yanan Wang --- hw/arm/virt-acpi-build.c | 55 1 file changed, 33 insertions(+), 22 deletions(-) diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index 03fd812d5a..2b745711d1 100644 --- a/hw/arm/virt-acpi-build.c +++ b/hw/arm/virt-acpi-build.c @@ -443,6 +443,7 @@ build_pptt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) int pptt_start = table_data->len; int uid = 0, cpus = 0, socket = 0; MachineState *ms = MACHINE(vms); +unsigned int smp_clusters = vms->smp_clusters; unsigned int smp_cores = ms->smp.cores; unsigned int smp_threads = ms->smp.threads; @@ -450,42 +451,52 @@ build_pptt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) for (socket = 0; cpus < ms->possible_cpus->len; socket++) { uint32_t socket_offset = table_data->len - pptt_start; -int core; +int cluster; build_processor_hierarchy_node( table_data, 1, /* Physical package */ 0, socket, /* No parent */ NULL, 0); /* No private resources */ -for (core = 0; core < smp_cores; core++) { -uint32_t core_offset = table_data->len - pptt_start; -int thread; - -if (smp_threads <= 1) { -build_processor_hierarchy_node( -table_data, -(1 << 1) | /* ACPI Processor ID valid */ -(1 << 3), /* ACPI 6.3 - Node is a Leaf */ -socket_offset, uid++, /* Parent is a Socket */ -NULL, 0); /* No private resources */ -} else { -build_processor_hierarchy_node( -table_data, 0, -socket_offset, core, /* Parent is a Socket */ -NULL, 0); /* No private resources */ - -for (thread = 0; thread < smp_threads; thread++) { +for (cluster = 0; cluster < smp_clusters; cluster++) { +uint32_t cluster_offset = table_data->len - pptt_start; +int core; + +build_processor_hierarchy_node( +table_data, 0, +socket_offset, cluster, /* Parent is a Socket */ +NULL, 0); /* No private resources */ + +for (core = 0; core < smp_cores; core++) { +uint32_t core_offset = table_data->len - pptt_start; +int thread; + +if (smp_threads <= 1) { build_processor_hierarchy_node( table_data, (1 << 1) | /* ACPI Processor ID valid */ -(1 << 2) | /* ACPI 6.3 - Processor is a Thread */ (1 << 3), /* ACPI 6.3 - Node is a Leaf */ -core_offset, uid++, /* Parent is a Core */ +cluster_offset, uid++, /* Parent is a Cluster */ NULL, 0); /* No private resources */ +} else { +build_processor_hierarchy_node( +table_data, 0, +cluster_offset, core, /* Parent is a Cluster */ +NULL, 0); /* No private resources */ + +for (thread = 0; thread < smp_threads; thread++) { +build_processor_hierarchy_node( +table_data, +(1 << 1) | /* ACPI Processor ID valid */ +(1 << 2) | /* ACPI 6.3 - Processor is a Thread */ +(1 << 3), /* ACPI 6.3 - Node is a Leaf */ +core_offset, uid++, /* Parent is a Core */ +NULL, 0); /* No private resources */ +} } } } -cpus += smp_cores * smp_threads; +cpus += smp_clusters * smp_cores * smp_threads; } build_header(linker, table_data, -- 2.19.1
[RFC PATCH v2 4/4] hw/arm/virt: Add cluster level for device tree
Add a cluster level between core level and package level for ARM device tree. Signed-off-by: Yanan Wang --- hw/arm/virt.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 51797628db..4468a4734b 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -434,14 +434,18 @@ static void fdt_add_cpu_nodes(const VirtMachineState *vms) if (ms->smp.threads > 1) { map_path = g_strdup_printf( -"/cpus/cpu-map/%s%d/%s%d/%s%d", -"socket", cpu / (ms->smp.cores * ms->smp.threads), +"/cpus/cpu-map/%s%d/%s%d/%s%d/%s%d", +"socket", cpu / (vms->smp_clusters * ms->smp.cores * +ms->smp.threads), +"cluster", (cpu / (ms->smp.cores * ms->smp.threads)) % +vms->smp_clusters, "core", (cpu / ms->smp.threads) % ms->smp.cores, "thread", cpu % ms->smp.threads); } else { map_path = g_strdup_printf( -"/cpus/cpu-map/%s%d/%s%d", -"socket", cpu / ms->smp.cores, +"/cpus/cpu-map/%s%d/%s%d/%s%d", +"socket", cpu / (vms->smp_clusters * ms->smp.cores), +"cluster", (cpu / ms->smp.cores) % vms->smp_clusters, "core", cpu % ms->smp.cores); } qemu_fdt_add_path(ms->fdt, map_path); -- 2.19.1
Re: [PATCH v2 2/3] qom: move user_creatable_add_opts logic to vl.c and QAPIfy it
On 13.04.21 10:13, David Hildenbrand wrote: On 13.04.21 06:41, Markus Armbruster wrote: David Hildenbrand writes: On 12.03.21 18:35, Paolo Bonzini wrote: Emulators are currently using OptsVisitor (via user_creatable_add_opts) to parse the -object command line option. This has one extra feature, compared to keyval, which is automatic conversion of integers to lists as well as support for lists as repeated options: -object memory-backend-ram,id=pc.ram,size=1048576000,host-nodes=0,policy=bind So we cannot replace OptsVisitor with keyval right now. Still, this patch moves the user_creatable_add_opts logic to vl.c since it is not needed anywhere else, and makes it go through user_creatable_add_qapi. In order to minimize code changes, the predicate still takes a string. This can be changed later to use the ObjectType QAPI enum directly. Rebasing my "noreserve"[1] series on this, I get weird errors from QEMU when specifying the new "reserve=off" option for a memory-backend-ram: "Invalid parameter 'reserve'" And it looks like this is the case for any new properties. Poking around, I fail to find what's causing this -- or how to unlock new properties. What is the magic toggle to make it work? Thanks! [1] https://lkml.kernel.org/r/20210319101230.21531-1-da...@redhat.com Wild guess: you didn't add your new properties in the QAPI schema. For a not-so-wild-guess, send us a git-fetch argument for your rebased series. Oh, there is qapi/qom.json -- maybe that does the trick. (I have mixed feelings about having to specify the same thing twice at different locations) I'll have a look if that makes it fly. Yes, works just fine -- thanks! -- Thanks, David / dhildenb
Re: [RFC v12 27/65] target/arm: split a15 cpu model and 32bit class functions to cpu32.c
On 4/8/21 12:23 PM, Claudio Fontana wrote: > On 3/28/21 6:18 PM, Richard Henderson wrote: >> On 3/26/21 1:36 PM, Claudio Fontana wrote: >>> provide helper functions there to initialize 32bit models, >>> and export the a15 cpu model. >>> >>> We still need to keep around a15 until we sort out the board configurations. >>> >>> cpu.c will continue to contain the common parts between >>> 32 and 64. >>> >>> Note that we need to build cpu32 also for TARGET_AARCH64, >>> because qemu-system-aarch64 is supposed to be able to run >>> non-aarch64 cpus too. >>> >>> Signed-off-by: Claudio Fontana >> >> Dump state has nothing to do with a15 or the 32-bit models. > > The relationship I see here is that 32-bit builds, > > only really need to use aarch32 version of the dump state, and they do not > need to see the aarch64 version. > >> Moving that should be a separate change. >> >> The gdbstub changes are also a separate change, afaics. > > To hopefully clarify things a bit more here, the current hierarchy seems not right for ARM CPUs also in this respect. TYPE_ARM_CPU is the ancestor of all ARM CPUs, ok. TYPE_AARCH64_CPU is child of TYPE_ARM_CPU, ok. There is however no TYPE_AARCH32_CPU, or TYPE_ARM32_CPU. So what ends up being necessary here (both in the mainline code and in this patch, which just makes it more explicit), is to make the ancestor TYPE_ARM_CPU effectively a 32bit CPU class, with TYPE_AARCH64_CPU overriding class methods in order to morph it into a 64bit CPU class. What this patch does is to make it explicit, by creating a cpu32.c module that contains this non-explicit 32bit ARM CPU class methods, and the registration function to register 32bit ARM cpus. Thanks, Claudio > But the main concern is to split more, I understand: will do. > >> >> I don't understand the a15 comment above. Is it really relevant to moving >> the >> 32-bit cpu models? > > > The point there was that we keep around a15 for KVM for the moment, instead > of relegating it to the set of "tcg-only" models, > so that virt board and qtest continue to function also in the KVM-only build. > > Mainly for this code here a question from my side: is the current code > actually already "wrong"? > > I mean, we unconditionally set the aarch64-capable cpu classes to all use > aarch64_gdb_arch_name and gdbstub64, > but what about an aarch64-capable cpu running in 32bit mode? > > Why don't we have, like tentatively done here for arm_cpu_dump_state, > > an arm_gdb_arch_name and an arm_cpu_gdb_read_register that check is_a64() and > call aaarch32_cpu_gdb_read_register or aarch64_cpu_gdb_read_register > accordingly? > > >> >> >> r~ >> >
[RFC PATCH v2 2/4] hw/arm/virt: Parse -smp cluster parameter in virt_smp_parse
There is a separate function virt_smp_parse() in hw/virt/arm.c used to parse cpu topology for the ARM machines. So add parsing of -smp cluster parameter in it, then total number of logical cpus will be calculated like: max_cpus = sockets * clusters * cores * threads. In virt_smp_parse(), the computing logic of missing values prefers cores over sockets over threads. And for compatibility, the value of clusters will be set as default 1 if not explicitly specified. Signed-off-by: Yanan Wang --- hw/arm/virt.c | 32 ++-- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 57ef961cb5..51797628db 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -2639,35 +2639,38 @@ static void virt_smp_parse(MachineState *ms, QemuOpts *opts) if (opts) { unsigned cpus = qemu_opt_get_number(opts, "cpus", 0); unsigned sockets = qemu_opt_get_number(opts, "sockets", 0); +unsigned clusters = qemu_opt_get_number(opts, "clusters", 1); unsigned cores = qemu_opt_get_number(opts, "cores", 0); unsigned threads = qemu_opt_get_number(opts, "threads", 0); +VirtMachineState *vms = VIRT_MACHINE(ms); /* - * Compute missing values; prefer cores over sockets and - * sockets over threads. + * Compute missing values; prefer cores over sockets and sockets + * over threads. For compatibility, value of clusters will have + * been set as default 1 if not explicitly specified. */ if (cpus == 0 || cores == 0) { sockets = sockets > 0 ? sockets : 1; threads = threads > 0 ? threads : 1; if (cpus == 0) { cores = cores > 0 ? cores : 1; -cpus = cores * threads * sockets; +cpus = sockets * clusters * cores * threads; } else { ms->smp.max_cpus = qemu_opt_get_number(opts, "maxcpus", cpus); -cores = ms->smp.max_cpus / (sockets * threads); +cores = ms->smp.max_cpus / (sockets * clusters * threads); } } else if (sockets == 0) { threads = threads > 0 ? threads : 1; -sockets = cpus / (cores * threads); +sockets = cpus / (clusters * cores * threads); sockets = sockets > 0 ? sockets : 1; } else if (threads == 0) { -threads = cpus / (cores * sockets); +threads = cpus / (sockets * clusters * cores); threads = threads > 0 ? threads : 1; -} else if (sockets * cores * threads < cpus) { +} else if (sockets * clusters * cores * threads < cpus) { error_report("cpu topology: " - "sockets (%u) * cores (%u) * threads (%u) < " - "smp_cpus (%u)", - sockets, cores, threads, cpus); + "sockets (%u) * clusters (%u) * cores (%u) * " + "threads (%u) < smp_cpus (%u)", + sockets, clusters, cores, threads, cpus); exit(1); } @@ -2678,11 +2681,11 @@ static void virt_smp_parse(MachineState *ms, QemuOpts *opts) exit(1); } -if (sockets * cores * threads != ms->smp.max_cpus) { +if (sockets * clusters * cores * threads != ms->smp.max_cpus) { error_report("cpu topology: " - "sockets (%u) * cores (%u) * threads (%u)" - "!= maxcpus (%u)", - sockets, cores, threads, + "sockets (%u) * clusters (%u) * cores (%u) * " + "threads (%u) != maxcpus (%u)", + sockets, clusters, cores, threads, ms->smp.max_cpus); exit(1); } @@ -2691,6 +2694,7 @@ static void virt_smp_parse(MachineState *ms, QemuOpts *opts) ms->smp.cores = cores; ms->smp.threads = threads; ms->smp.sockets = sockets; +vms->smp_clusters = clusters; } if (ms->smp.cpus > 1) { -- 2.19.1
Re: trace_FOO_tcg bit-rotted?
On Mon, Apr 12, 2021 at 08:06:57PM +0100, Alex Bennée wrote: > > Stefan Hajnoczi writes: > > > On Fri, Apr 09, 2021 at 05:29:08PM +0100, Alex Bennée wrote: > >> > >> Laurent Vivier writes: > >> > >> > Le 06/04/2021 à 18:00, Alex Bennée a écrit : > >> >> Hi, > >> >> > >> >> It's been awhile since I last played with this but I think we are > >> >> suffering from not having some test cases for tracing code > >> >> generation/execution in the tree. I tried adding a simple trace point to > >> >> see if I could track ERET calls: > >> >> > >> >> --8<---cut here---start->8--- > >> >> diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c > >> >> index 0b42e53500..0d643f78fe 100644 > >> >> --- a/target/arm/translate-a64.c > >> >> +++ b/target/arm/translate-a64.c > >> >> @@ -36,6 +36,7 @@ > >> >> #include "exec/log.h" > >> >> > >> >> #include "trace-tcg.h" > >> >> +#include "trace.h" > >> >> #include "translate-a64.h" > >> >> #include "qemu/atomic128.h" > >> >> > >> >> @@ -2302,6 +2303,9 @@ static void disas_uncond_b_reg(DisasContext *s, > >> >> uint32_t insn) > >> >> default: > >> >> goto do_unallocated; > >> >> } > >> >> + > >> >> +trace_eret_tcg(s->current_el, dst); > >> >> + > >> >> if (tb_cflags(s->base.tb) & CF_USE_ICOUNT) { > >> >> gen_io_start(); > >> >> } > >> >> diff --git a/target/arm/trace-events b/target/arm/trace-events > >> >> index 41c63d7570..2d4fca16a1 100644 > >> >> --- a/target/arm/trace-events > >> >> +++ b/target/arm/trace-events > >> >> @@ -1,5 +1,10 @@ > >> >> # See docs/devel/tracing.txt for syntax documentation. > >> >> > >> >> +# translate-a64.c > >> >> +# Mode: softmmu > >> >> +# Targets: TCG(aarch64-softmmu) > >> >> +tcg eret(int current_el, TCGv target_el) "trans_eret: from EL%d", > >> >> "exec_eret: EL%d to EL%"PRId64 > >> > > >> > If I read correctly, the name should be eret_tcg() > >> > And I'm not sure TCGv will be accepted as a parameter type, use > >> > uint64_t instead (and %PRIu64) > >> > >> This was my confusion. I thought the trace-events file was prefixed with > >> tcg like guest_mem_before: > >> > >> vcpu tcg guest_mem_before(TCGv vaddr, uint16_t info) "info=%d", > >> "vaddr=0x%016"PRIx64" info=%d" > >> > >> and that signalled the tools to generate _trans, _exec and _tcg hooks in > >> the generated files. The trace code (see other patch) also has logic to > >> translate natural TCG types into the natives types as well signalling > >> which values are only visible for the _exec portion. > >> > >> Maybe I'm over thinking this. Perhaps all the TCG tracing use cases are > >> just as easily supported with TCG plugins now and we should deprecate > >> this unused bit of complexity. I certainly understand the plugin > >> interactions better ;-) > > > > Lluís: are you happy to deprecate tcg trace events in favor of TCG > > plugins? > > > > My question is whether TCG plugins are really equivalent here. Will TCG > > plugin users have to write their own log file output code to extract > > this information from the QEMU process (i.e. reinventing tracing)? > > Yes - although there is no reason we couldn't expose the trace output as > a helper. Currently there is: > > /** >* qemu_plugin_outs() - output string via QEMU's logging system >* @string: a string >*/ > void qemu_plugin_outs(const char *string); > > which allows the user to echo to the log in conjunction with -d plugin > on the command line. Plugins are of course free to do there own thing. > > > Is > > the performance at least as good as tracing? > > Well like all things that depends ;-) > > Generally on the sort of events you tend to trace (like the example > memory access) you usually either want to aggregate or filter your > results. With trace output their is no way to do this and you end up > post processing potentially very large files (smaller if you use the > non-default binary format). I don't know if a similar thing is possible > with uprobes and ebpf - I've only ever used the simple logging output in > anger. The example plugins generally do things like count total > accesses: > > https://gitlab.com/qemu-project/qemu/-/blob/master/tests/plugin/mem.c > > (pretty fast in comparison to writing out reams of data) > > or aggregate the results: > > > https://gitlab.com/qemu-project/qemu/-/blob/master/contrib/plugins/hotpages.c > > (probably slower while running QEMU, but faster overall because no post > processing of log files required.) > > Of course plugins are a non-default build option because although light > it does have a small performance impact on code generation even when no > instrumentation is occurring. As a result you can't use it without > building a version first. > > If we had test code in the build that used the TCG tracing abilities I > wouldn't worry because at least we are defending the feature and we > wouldn't run into probl
[Bug 1923583] [NEW] colo: pvm flush failed after svm killed
Public bug reported: Hi, Primary vm flush failed after killing svm, which leads primary vm guest filesystem unavailable. qemu versoin: 5.2.0 host/guest os: CentOS Linux release 7.6.1810 (Core) Reproduce steps: 1. create colo vm following https://github.com/qemu/qemu/blob/master/docs/COLO-FT.txt 2. kill secondary vm (don't remove nbd child from quorum on primary vm)and wait for a minute. the interval depends on guest os. result: primary vm file system shutdown because of flush cache error. After serveral tests, I found that qemu-5.0.0 worked well, and it's the commit https://git.qemu.org/?p=qemu.git;a=commit;h=883833e29cb800b4d92b5d4736252f4004885191(block: Flush all children in generic code) leads this change, and both virtio- blk and ide turned out to be bad. I think it's nbd(replication) flush failed leads bdrv_co_flush(quorum_bs) failed, here is the call stack. #0 bdrv_co_flush (bs=0x56242b3cc0b0=nbd_bs) at ../block/io.c:2856 #1 0x562428b0f399 in bdrv_co_flush (bs=0x56242b3c7e00=replication_bs) at ../block/io.c:2920 #2 0x562428b0f399 in bdrv_co_flush (bs=0x56242a4ad800=quorum_bs) at ../block/io.c:2920 #3 0x562428b70d56 in blk_do_flush (blk=0x56242a4ad4a0) at ../block/block-backend.c:1672 #4 0x562428b70d87 in blk_aio_flush_entry (opaque=0x7fd0980073f0) at ../block/block-backend.c:1680 #5 0x562428c5f9a7 in coroutine_trampoline (i0=-1409269904, i1=32721) at ../util/coroutine-ucontext.c:173 While i am not sure whether i use colo inproperly? Can we assume that nbd child of quorum immediately removed right after svm crashed? Or it's really a bug? Does the following patch fix? Help is needed! Thanks a lot! diff --git a/block/quorum.c b/block/quorum.c index cfc1436..f2c0805 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -1279,7 +1279,7 @@ static BlockDriver bdrv_quorum = { .bdrv_dirname = quorum_dirname, .bdrv_co_block_status = quorum_co_block_status, -.bdrv_co_flush_to_disk = quorum_co_flush, +.bdrv_co_flush = quorum_co_flush, .bdrv_getlength = quorum_getlength, ** Affects: qemu Importance: Undecided Status: New ** Patch added: "primary guest kernel message" https://bugs.launchpad.net/bugs/1923583/+attachment/5487235/+files/primary_guest_dmesg.log -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1923583 Title: colo: pvm flush failed after svm killed Status in QEMU: New Bug description: Hi, Primary vm flush failed after killing svm, which leads primary vm guest filesystem unavailable. qemu versoin: 5.2.0 host/guest os: CentOS Linux release 7.6.1810 (Core) Reproduce steps: 1. create colo vm following https://github.com/qemu/qemu/blob/master/docs/COLO-FT.txt 2. kill secondary vm (don't remove nbd child from quorum on primary vm)and wait for a minute. the interval depends on guest os. result: primary vm file system shutdown because of flush cache error. After serveral tests, I found that qemu-5.0.0 worked well, and it's the commit https://git.qemu.org/?p=qemu.git;a=commit;h=883833e29cb800b4d92b5d4736252f4004885191(block: Flush all children in generic code) leads this change, and both virtio-blk and ide turned out to be bad. I think it's nbd(replication) flush failed leads bdrv_co_flush(quorum_bs) failed, here is the call stack. #0 bdrv_co_flush (bs=0x56242b3cc0b0=nbd_bs) at ../block/io.c:2856 #1 0x562428b0f399 in bdrv_co_flush (bs=0x56242b3c7e00=replication_bs) at ../block/io.c:2920 #2 0x562428b0f399 in bdrv_co_flush (bs=0x56242a4ad800=quorum_bs) at ../block/io.c:2920 #3 0x562428b70d56 in blk_do_flush (blk=0x56242a4ad4a0) at ../block/block-backend.c:1672 #4 0x562428b70d87 in blk_aio_flush_entry (opaque=0x7fd0980073f0) at ../block/block-backend.c:1680 #5 0x562428c5f9a7 in coroutine_trampoline (i0=-1409269904, i1=32721) at ../util/coroutine-ucontext.c:173 While i am not sure whether i use colo inproperly? Can we assume that nbd child of quorum immediately removed right after svm crashed? Or it's really a bug? Does the following patch fix? Help is needed! Thanks a lot! diff --git a/block/quorum.c b/block/quorum.c index cfc1436..f2c0805 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -1279,7 +1279,7 @@ static BlockDriver bdrv_quorum = { .bdrv_dirname = quorum_dirname, .bdrv_co_block_status = quorum_co_block_status, -.bdrv_co_flush_to_disk = quorum_co_flush, +.bdrv_co_flush = quorum_co_flush, .bdrv_getlength = quorum_getlength, To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1923583/+subscriptions
Re: [Bug 1923583] [NEW] colo: pvm flush failed after svm killed
Patchew URL: https://patchew.org/QEMU/161830261172.29345.7866671962411605196.malone...@wampee.canonical.com/ Hi, This series seems to have some coding style problems. See output below for more information: Type: series Message-id: 161830261172.29345.7866671962411605196.malone...@wampee.canonical.com Subject: [Bug 1923583] [NEW] colo: pvm flush failed after svm killed === TEST SCRIPT BEGIN === #!/bin/bash git rev-parse base > /dev/null || exit 0 git config --local diff.renamelimit 0 git config --local diff.renames True git config --local diff.algorithm histogram ./scripts/checkpatch.pl --mailback base.. === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 From https://github.com/patchew-project/qemu * [new tag] patchew/161830261172.29345.7866671962411605196.malone...@wampee.canonical.com -> patchew/161830261172.29345.7866671962411605196.malone...@wampee.canonical.com - [tag update] patchew/20210413081008.3409459-1-f4...@amsat.org -> patchew/20210413081008.3409459-1-f4...@amsat.org Switched to a new branch 'test' f43885d colo: pvm flush failed after svm killed === OUTPUT BEGIN === ERROR: Missing Signed-off-by: line(s) total: 1 errors, 0 warnings, 8 lines checked Commit f43885d3a7e9 (colo: pvm flush failed after svm killed) has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. === OUTPUT END === Test command exited with code: 1 The full log is available at http://patchew.org/logs/161830261172.29345.7866671962411605196.malone...@wampee.canonical.com/testing.checkpatch/?type=message. --- Email generated automatically by Patchew [https://patchew.org/]. Please send your feedback to patchew-de...@redhat.com
Re: [PATCH] vhost-user-fs: fix features handling
On Thu, Apr 08, 2021 at 10:55:34PM +0300, Anton Kuchin wrote: > Make virtio-fs take into account server capabilities. > > Just returning requested features assumes they all of then are implemented > by server and results in setting unsupported configuration if some of them > are absent. > > Signed-off-by: Anton Kuchin > --- > hw/virtio/vhost-user-fs.c | 17 + > 1 file changed, 13 insertions(+), 4 deletions(-) > > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c > index ac4fc34b36..6cf983ba0e 100644 > --- a/hw/virtio/vhost-user-fs.c > +++ b/hw/virtio/vhost-user-fs.c > @@ -24,6 +24,14 @@ > #include "monitor/monitor.h" > #include "sysemu/sysemu.h" > > +static const int user_feature_bits[] = { > +VIRTIO_F_VERSION_1, > +VIRTIO_RING_F_INDIRECT_DESC, > +VIRTIO_RING_F_EVENT_IDX, > +VIRTIO_F_NOTIFY_ON_EMPTY, > +VHOST_INVALID_FEATURE_BIT > +}; Please add: VIRTIO_F_RING_PACKED VIRTIO_F_IOMMU_PLATFORM QEMU's virtiofsd does not enable either of these for now, but it's worth allowing the vhost-user device backend to participate in negotiation so that this can change in the future (or alternative virtiofsd implementations can support these features). signature.asc Description: PGP signature
Re: [PATCH 2/2] Support monitor chardev hotswap with QMP
On Tue, Apr 13, 2021 at 8:41 AM Markus Armbruster wrote: > > Li Zhang writes: > > > From: Li Zhang > > > > For some scenarios, it needs to hot-add a monitor device. > > But QEMU doesn't support hotplug yet. It also works by adding > > a monitor with null backend by default and then change its > > backend to socket by QMP command "chardev-change". > > > > So this patch is to support monitor chardev hotswap with QMP. > > > > Signed-off-by: Li Zhang > > I think what what you're trying to say is that chardev-change does not > work when the character device changes is used by a QMP monitor. > Correct? > I mean that when the character device is a monitor device, it doesn't work with a QMP monitor. For example, I add 2 QMP monitors and change one of the monitor's backends from socket to a null device. It doesn't work because it needs the monitor device to support chardev-change. > If yes, how exactly does it misbehave? This command chardev-change needs specific device's change callback function. > > Does it work with an HMP monitor? No, it doesn't work. > > > --- > > monitor/monitor-internal.h | 3 +++ > > monitor/monitor.c | 2 +- > > monitor/qmp.c | 42 +++--- > > 3 files changed, 43 insertions(+), 4 deletions(-) > > > > diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h > > index 40903d6386..2df6dd21de 100644 > > --- a/monitor/monitor-internal.h > > +++ b/monitor/monitor-internal.h > > @@ -186,4 +186,7 @@ int hmp_compare_cmd(const char *name, const char *list); > > void qmp_query_qmp_schema(QDict *qdict, QObject **ret_data, > > Error **errp); > > > > +gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond, > > + void *opaque); > > + > > #endif > > diff --git a/monitor/monitor.c b/monitor/monitor.c > > index e94f532cf5..2d255bab18 100644 > > --- a/monitor/monitor.c > > +++ b/monitor/monitor.c > > @@ -157,7 +157,7 @@ static inline bool monitor_is_hmp_non_interactive(const > > Monitor *mon) > > > > static void monitor_flush_locked(Monitor *mon); > > > > -static gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond, > > +gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond, > >void *opaque) > > { > > Monitor *mon = opaque; > > diff --git a/monitor/qmp.c b/monitor/qmp.c > > index 2326bd7f9b..55cfb230d9 100644 > > --- a/monitor/qmp.c > > +++ b/monitor/qmp.c > > @@ -44,6 +44,7 @@ struct QMPRequest { > > Error *err; > > }; > > typedef struct QMPRequest QMPRequest; > > +static void monitor_qmp_set_handlers_bh(void *opaque); > > > > QmpCommandList qmp_commands, qmp_cap_negotiation_commands; > > > > @@ -480,7 +481,35 @@ void monitor_data_destroy_qmp(MonitorQMP *mon) > > g_queue_free(mon->qmp_requests); > > } > > > > -static void monitor_qmp_setup_handlers_bh(void *opaque) > > +static int monitor_qmp_change(void *opaque) > > +{ > > +MonitorQMP *mon = opaque; > > + > > +mon->common.use_io_thread = > > +qemu_chr_has_feature(mon->common.chr.chr, > > QEMU_CHAR_FEATURE_GCONTEXT); > > + > > +if (mon->common.use_io_thread) { > > +aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread), > > +monitor_qmp_set_handlers_bh, mon); > > +} else { > > +qemu_chr_fe_set_handlers(&mon->common.chr, monitor_can_read, > > + monitor_qmp_read, monitor_qmp_event, > > + monitor_qmp_change, &mon->common, NULL, > > true); > > +} > > + > > +if (mon->common.out_watch) { > > +g_source_remove(mon->common.out_watch); > > All other updates of @out_watch are under @mon_lock. Why not this one? Sorry, I missed it. I will correct it. > > I have no idea whether g_source_remove() is the right function to call. > Its documentation says "You must use g_source_destroy() for sources > added to a non-default main context." The qemu_chr_fe_set_handlers() > contract is of no help. > > Documentation of g_source_destroy() confuses some more: "This does not > unref the GSource: if you still hold a reference, use g_source_unref() > to drop it. > > Marc-André, can you help? > > > +qemu_mutex_lock(&mon->common.mon_lock); > > +mon->common.out_watch = > > +qemu_chr_fe_add_watch(&mon->common.chr, G_IO_OUT | G_IO_HUP, > > + monitor_unblocked, &mon->common); > > Bad indentation. Better: > > mon->common.out_watch = > qemu_chr_fe_add_watch(&mon->common.chr, G_IO_OUT | G_IO_HUP, >monitor_unblocked, &mon->common); > > or > > mon->common.out_watch = qemu_chr_fe_add_watch(&mon->common.chr, > G_IO_OUT | G_IO_HUP, > monitor_unblocked, >
Re: [Virtio-fs] [PATCH] vhost-user-fs: fix features handling
On Mon, Apr 12, 2021 at 02:43:16PM -0400, Vivek Goyal wrote: > On Sun, Apr 11, 2021 at 09:21:54AM +0300, Anton Kuchin wrote: > > > > On 09/04/2021 18:56, Vivek Goyal wrote: > > > On Thu, Apr 08, 2021 at 10:55:34PM +0300, Anton Kuchin wrote: > > > > Make virtio-fs take into account server capabilities. > > > > > > > > Just returning requested features assumes they all of then are > > > > implemented > > > > by server and results in setting unsupported configuration if some of > > > > them > > > > are absent. > > > > > > > > Signed-off-by: Anton Kuchin > > > [CC stefan and qemu-devel.] > > > > > > Can you give more details of what problem exactly you are facing. Or > > > this fix is about avoiding a future problem where device can refuse > > > to support a feature qemu is requesting for. > > > > This fixes existing problem that qemu ignores features (un)supported by > > backend and this works fine only if backend features match features of qemu. > > Otherwise qemu incorrectly assumes that backend suports all of them and > > calls vhost_set_features() not taking into account result of previous > > vhost_get_features() call. This breaks protocol and can crash server or > > cause incorrect behavior. > > > > This is why I hope it to be accepted in time for 6.0 release. > > > > > IIUC, this patch is preparing a list of features vhost-user-fs device > > > can support. Then it calls vhost_get_features() which makes sure that > > > all these features are support by real vhost-user device (hdev->features). > > > If not, then corresponding feature is reset and remaining features > > > are returned to caller. > > When this callback is executed in virtio_bus_device_plugged() list of > > features that vhost-backend supports has been already obtained earlier by > > vhost_user_get_features() in vuf_device_realize() and stored in > > hdev->features. > > > vuf_get_features() should return bitmask of features that > > are common for vhost backend (hdev->features) and frontend > > (vdev->host_features). > > But that's not what exactly this patch seems to be doing. > IIUC, It only resets some of the features from list passed from > the caller. So whatever has been defined in user_feature_bits[], > and if these features are not supported by vhost-user backend, then > that feature will be reset before returning to caller. > > So the question is what are those features which should be in > user_feature_bits[]? For example, by default libvhost-user > also supports. > > /* vhost-user feature bits */ > 1ULL << VHOST_F_LOG_ALL | > 1ULL << VHOST_USER_F_PROTOCOL_FEATURES; > > Should that be in user_feature_bits[] too. So that if a customer > vhost-user-fs backend does not support VHOST_F_LOG_ALL or > VHOST_USER_F_PROTOCOL_FEATURES, it is reset. > > IIUC, your current patch is not going to reset these features if > caller passed you those in vuf_get_features(,requested_features). > > So to me this becomes more of a question that what are those common > features which both the ends of vhost-user device should support for > it to work and should be checked in vuf_get_features(). VHOST_F_LOG_ALL and VHOST_USER_F_PROTOCOL_FEATURES are controlled by hw/virtio/vhost.c and hw/virtio/vhost-user.c. These feature bits are part of the vhost-user protocol and are not involved in guest-visible VIRTIO feature negotiation. It's confusing because these bits use the same namespace as VIRTIO features but it is correct to omit it from user_feature_bits[]. Stefan signature.asc Description: PGP signature
Re: [PATCH 2/2] Support monitor chardev hotswap with QMP
On Tue, Apr 13, 2021 at 08:40:59AM +0200, Markus Armbruster wrote: > Li Zhang writes: > > > From: Li Zhang > > > > For some scenarios, it needs to hot-add a monitor device. > > But QEMU doesn't support hotplug yet. It also works by adding > > a monitor with null backend by default and then change its > > backend to socket by QMP command "chardev-change". If you need ability to hot-add monitor instances, why not just implement that feature directly, instead of pre-creating monitors with null backends and then later changing the backend ? > > > > So this patch is to support monitor chardev hotswap with QMP. > > > > Signed-off-by: Li Zhang > > I think what what you're trying to say is that chardev-change does not > work when the character device changes is used by a QMP monitor. > Correct? > > If yes, how exactly does it misbehave? > > Does it work with an HMP monitor? > > > --- > > monitor/monitor-internal.h | 3 +++ > > monitor/monitor.c | 2 +- > > monitor/qmp.c | 42 +++--- > > 3 files changed, 43 insertions(+), 4 deletions(-) > > > > diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h > > index 40903d6386..2df6dd21de 100644 > > --- a/monitor/monitor-internal.h > > +++ b/monitor/monitor-internal.h > > @@ -186,4 +186,7 @@ int hmp_compare_cmd(const char *name, const char *list); > > void qmp_query_qmp_schema(QDict *qdict, QObject **ret_data, > > Error **errp); > > > > +gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond, > > + void *opaque); > > + > > #endif > > diff --git a/monitor/monitor.c b/monitor/monitor.c > > index e94f532cf5..2d255bab18 100644 > > --- a/monitor/monitor.c > > +++ b/monitor/monitor.c > > @@ -157,7 +157,7 @@ static inline bool monitor_is_hmp_non_interactive(const > > Monitor *mon) > > > > static void monitor_flush_locked(Monitor *mon); > > > > -static gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond, > > +gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond, > >void *opaque) > > { > > Monitor *mon = opaque; > > diff --git a/monitor/qmp.c b/monitor/qmp.c > > index 2326bd7f9b..55cfb230d9 100644 > > --- a/monitor/qmp.c > > +++ b/monitor/qmp.c > > @@ -44,6 +44,7 @@ struct QMPRequest { > > Error *err; > > }; > > typedef struct QMPRequest QMPRequest; > > +static void monitor_qmp_set_handlers_bh(void *opaque); > > > > QmpCommandList qmp_commands, qmp_cap_negotiation_commands; > > > > @@ -480,7 +481,35 @@ void monitor_data_destroy_qmp(MonitorQMP *mon) > > g_queue_free(mon->qmp_requests); > > } > > > > -static void monitor_qmp_setup_handlers_bh(void *opaque) > > +static int monitor_qmp_change(void *opaque) > > +{ > > +MonitorQMP *mon = opaque; > > + > > +mon->common.use_io_thread = > > +qemu_chr_has_feature(mon->common.chr.chr, > > QEMU_CHAR_FEATURE_GCONTEXT); > > + > > +if (mon->common.use_io_thread) { > > +aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread), > > +monitor_qmp_set_handlers_bh, mon); > > +} else { > > +qemu_chr_fe_set_handlers(&mon->common.chr, monitor_can_read, > > + monitor_qmp_read, monitor_qmp_event, > > + monitor_qmp_change, &mon->common, NULL, > > true); > > +} > > + > > +if (mon->common.out_watch) { > > +g_source_remove(mon->common.out_watch); > > All other updates of @out_watch are under @mon_lock. Why not this one? > > I have no idea whether g_source_remove() is the right function to call. > Its documentation says "You must use g_source_destroy() for sources > added to a non-default main context." The qemu_chr_fe_set_handlers() > contract is of no help. > > Documentation of g_source_destroy() confuses some more: "This does not > unref the GSource: if you still hold a reference, use g_source_unref() > to drop it. > > Marc-André, can you help? > > > +qemu_mutex_lock(&mon->common.mon_lock); > > +mon->common.out_watch = > > +qemu_chr_fe_add_watch(&mon->common.chr, G_IO_OUT | G_IO_HUP, > > + monitor_unblocked, &mon->common); > > Bad indentation. Better: > > mon->common.out_watch = > qemu_chr_fe_add_watch(&mon->common.chr, G_IO_OUT | G_IO_HUP, >monitor_unblocked, &mon->common); > > or > > mon->common.out_watch = qemu_chr_fe_add_watch(&mon->common.chr, > G_IO_OUT | G_IO_HUP, > monitor_unblocked, > &mon->common); > > or > > mon->common.out_watch = qemu_chr_fe_add_watch(&mon->common.chr, > G_IO_OUT | G_IO_HUP, >
Re: [Bug 1923497] [NEW] bios_linker_loader_add_checksum: Assertion `start_offset < file->blob->len' failed
On Mon, 12 Apr 2021 20:29:04 - Ed Davison <1923...@bugs.launchpad.net> wrote: > Public bug reported: > > Trying boot/start a Windows 10 VM. Worked until recently when this > error started showing up. > > I have the following installed on Fedora 33: > qemu-kvm-5.1.0-9.fc33.x86_64 Could you add used QEMU command line in your case? > > This is the error: > > Error starting domain: internal error: process exited while connecting > to monitor: qemu-system-x86_64: /builddir/build/BUILD/qemu-5.1.0/hw/acpi > /bios-linker-loader.c:239: bios_linker_loader_add_checksum: Assertion > `start_offset < file->blob->len' failed. > > Traceback (most recent call last): > File "/usr/share/virt-manager/virtManager/asyncjob.py", line 65, in > cb_wrapper > callback(asyncjob, *args, **kwargs) > File "/usr/share/virt-manager/virtManager/asyncjob.py", line 101, in tmpcb > callback(*args, **kwargs) > File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line > 57, in newfn > ret = fn(self, *args, **kwargs) > File "/usr/share/virt-manager/virtManager/object/domain.py", line 1329, in > startup > self._backend.create() > File "/usr/lib64/python3.9/site-packages/libvirt.py", line 1234, in create > if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self) > libvirt.libvirtError: internal error: process exited while connecting to > monitor: qemu-system-x86_64: > /builddir/build/BUILD/qemu-5.1.0/hw/acpi/bios-linker-loader.c:239: > bios_linker_loader_add_checksum: Assertion `start_offset < file->blob->len' > failed. > > I see this were referenced in a patch from some time ago and supposedly > fixed. Here is the patch info I was able to find: > > http://next.patchew.org/QEMU/1515677902-23436-1-git-send-email- > peter.mayd...@linaro.org/1515677902-23436-10-git-send-email- > peter.mayd...@linaro.org/ > > ** Affects: qemu > Importance: Undecided > Status: New >
Re: [PATCH RFC RESEND v2 4/6] hw/arm/virt-acpi-build: Add explicit idmap info in IORT table
Hi Xingang, On 3/25/21 8:22 AM, Wang Xingang wrote: > From: Xingang Wang > > The idmap of smmuv3 and root complex covers the whole RID space for now, > this patch add explicit idmap info according to root bus number range. > This add smmuv3 idmap for certain bus which has enabled the iommu property. > > Signed-off-by: Xingang Wang > Signed-off-by: Jiahui Cen > --- > hw/arm/virt-acpi-build.c | 103 ++- > 1 file changed, 81 insertions(+), 22 deletions(-) > > diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c > index f5a2b2d4cb..5491036c86 100644 > --- a/hw/arm/virt-acpi-build.c > +++ b/hw/arm/virt-acpi-build.c > @@ -44,6 +44,7 @@ > #include "hw/acpi/tpm.h" > #include "hw/pci/pcie_host.h" > #include "hw/pci/pci.h" > +#include "hw/pci/pci_bus.h" > #include "hw/pci-host/gpex.h" > #include "hw/arm/virt.h" > #include "hw/mem/nvdimm.h" > @@ -237,6 +238,41 @@ static void acpi_dsdt_add_tpm(Aml *scope, > VirtMachineState *vms) > aml_append(scope, dev); > } > > +typedef > +struct AcpiIortMapping { > +AcpiIortIdMapping idmap; > +bool iommu; > +} AcpiIortMapping; > + > +/* For all PCI host bridges, walk and insert DMAR scope */ this comment should rather be in the caller also DMAR is not the ARM vocable. I would add the comment for this function: build the ID mapping for aa given PCI host bridge > +static int > +iort_host_bridges(Object *obj, void *opaque) > +{ > +GArray *map_blob = opaque; > +AcpiIortMapping map; > +AcpiIortIdMapping *idmap = &map.idmap; > +int bus_num, max_bus; > + > +if (object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) { > +PCIBus *bus = PCI_HOST_BRIDGE(obj)->bus; > + > +if (bus) { > +bus_num = pci_bus_num(bus); > +max_bus = pci_root_bus_max_bus(bus); > + > +idmap->input_base = cpu_to_le32(bus_num << 8); > +idmap->id_count = cpu_to_le32((max_bus - bus_num + 1) << 8); > +idmap->output_base = cpu_to_le32(bus_num << 8); > +idmap->flags = cpu_to_le32(0); > + > +map.iommu = pci_root_bus_has_iommu(bus); if iommu is not set, we don't need to populate the idmap and we may even directly continue, ie. not add the element the map_bap_blob, no? > +g_array_append_val(map_blob, map); > +} > +} > + > +return 0; > +} > + > static void > build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) > { > @@ -247,6 +283,21 @@ build_iort(GArray *table_data, BIOSLinker *linker, > VirtMachineState *vms) > AcpiIortSmmu3 *smmu; > size_t node_size, iort_node_offset, iort_length, smmu_offset = 0; > AcpiIortRC *rc; > +int smmu_mapping_count; > +GArray *map_blob = g_array_new(false, true, sizeof(AcpiIortMapping)); > +AcpiIortMapping *map; > + > +/* pci_for_each_bus(vms->bus, insert_map, map_blob); */ comment to be removed > +object_child_foreach_recursive(object_get_root(), > + iort_host_bridges, map_blob); > + > +smmu_mapping_count = 0; > +for (int i = 0; i < map_blob->len; i++) { > +map = &g_array_index(map_blob, AcpiIortMapping, i); > +if (map->iommu) { > +smmu_mapping_count++; > +} > +} > > iort = acpi_data_push(table_data, sizeof(*iort)); > > @@ -280,13 +331,13 @@ build_iort(GArray *table_data, BIOSLinker *linker, > VirtMachineState *vms) > > /* SMMUv3 node */ > smmu_offset = iort_node_offset + node_size; > -node_size = sizeof(*smmu) + sizeof(*idmap); > +node_size = sizeof(*smmu) + sizeof(*idmap) * smmu_mapping_count; > iort_length += node_size; > smmu = acpi_data_push(table_data, node_size); > > smmu->type = ACPI_IORT_NODE_SMMU_V3; > smmu->length = cpu_to_le16(node_size); > -smmu->mapping_count = cpu_to_le32(1); > +smmu->mapping_count = cpu_to_le32(smmu_mapping_count); > smmu->mapping_offset = cpu_to_le32(sizeof(*smmu)); > smmu->base_address = cpu_to_le64(vms->memmap[VIRT_SMMU].base); > smmu->flags = cpu_to_le32(ACPI_IORT_SMMU_V3_COHACC_OVERRIDE); > @@ -295,23 +346,28 @@ build_iort(GArray *table_data, BIOSLinker *linker, > VirtMachineState *vms) > smmu->gerr_gsiv = cpu_to_le32(irq + 2); > smmu->sync_gsiv = cpu_to_le32(irq + 3); > > -/* Identity RID mapping covering the whole input RID range */ > -idmap = &smmu->id_mapping_array[0]; > -idmap->input_base = 0; > -idmap->id_count = cpu_to_le32(0x); > -idmap->output_base = 0; > -/* output IORT node is the ITS group node (the first node) */ > -idmap->output_reference = cpu_to_le32(iort_node_offset); > +for (int i = 0, j = 0; i < map_blob->len; i++) { > +map = &g_array_index(map_blob, AcpiIortMapping, i); > + > +if (!map->iommu) { > +continue; > +} > + > +
[PATCH v5 00/14] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property
Based-on: 20210406080126.24010-1-da...@redhat.com Some cleanups previously sent in other context (resizeable allocations), followed by RAM_NORESERVE, implementing it under Linux using MAP_NORESERVE, and letting users configure it for memory backens using the "reserve" property (default: true). MAP_NORESERVE under Linux has in the context of QEMU an effect on 1) Private/shared anonymous memory -> memory-backend-ram,id=mem0,size=10G 2) Private fd-based mappings -> memory-backend-file,id=mem0,size=10G,mem-path=/dev/shm/0 -> memory-backend-memfd,id=mem0,size=10G 3) Private/shared hugetlb mappings -> memory-backend-memfd,id=mem0,size=10G,hugetlb=on,hugetlbsize=2M With MAP_NORESERVE/"reserve=off", we won't be reserving swap space (1/2) or huge pages (3) for the whole memory region. The target use case is virtio-mem, which dynamically exposes memory inside a large, sparse memory area to the VM. MAP_NORESERVE tells the OS "this mapping might be very sparse". This essentially allows avoiding having to set "/proc/sys/vm/overcommit_memory == 1") when using virtio-mem and also supporting hugetlbfs in the future. v4 -> v5: - Sent out shared anonymous RAM fixes separately - Rebased - "hostmem: Wire up RAM_NORESERVE via "reserve" property" -- Adjusted/simplified description of new "reserve" property -- Properly add it to qapi/qom.json - "qmp: Clarify memory backend properties returned via query-memdev" -- Added - "qmp: Include "share" property of memory backends" -- Added - "hmp: Print "share" property of memory backends with "info memdev"" - Added - "qmp: Include "reserve" property of memory backends" -- Adjust description of new "reserve" property v3 -> v4: - Minor comment/description updates - "softmmu/physmem: Fix ram_block_discard_range() to handle shared ..." -- Extended description - "util/mmap-alloc: Pass flags instead of separate bools to ..." -- Move flags to include/qemu/osdep.h and rename to "QEMU_MAP_*" - "memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()" -- Adjust to new flags. Handle errors in mmap_activate() for now. - "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux" -- Restrict support to Linux only for now - "qmp: Include "reserve" property of memory backends" -- Added - "hmp: Print "reserve" property of memory backends with ..." -- Added v2 -> v3: - Renamed "softmmu/physmem: Drop "shared" parameter from ram_block_add()" to "softmmu/physmem: Mark shared anonymous memory RAM_SHARED" and adjusted the description - Added "softmmu/physmem: Fix ram_block_discard_range() to handle shared anonymous memory" - Added "softmmu/physmem: Fix qemu_ram_remap() to handle shared anonymous memory" - Added "util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()" - "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE" -- Further tweak code comments -- Handle shared anonymous memory v1 -> v2: - Rebased to upstream and phs_mem_alloc simplifications -- Upsteam added the "map_offset" parameter to many RAM allocation interfaces. - "softmmu/physmem: Drop "shared" parameter from ram_block_add()" -- Use local variable "shared" - "memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()" -- Simplify due to phs_mem_alloc changes - "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE" -- Add a whole bunch of comments. -- Exclude shared anonymous memory that QEMU doesn't use -- Special-case readonly mappings Cc: Peter Xu Cc: "Michael S. Tsirkin" Cc: Eduardo Habkost Cc: "Dr. David Alan Gilbert" Cc: Richard Henderson Cc: Paolo Bonzini Cc: Igor Mammedov Cc: "Philippe Mathieu-Daudé" Cc: Stefan Hajnoczi Cc: Murilo Opsfelder Araujo Cc: Greg Kurz Cc: Liam Merwick Cc: Marcel Apfelbaum David Hildenbrand (14): util/mmap-alloc: Factor out calculation of the pagesize for the guard page util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve() util/mmap-alloc: Factor out activating of memory to mmap_activate() softmmu/memory: Pass ram_flags to qemu_ram_alloc_from_fd() softmmu/memory: Pass ram_flags to memory_region_init_ram_shared_nomigrate() util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap() memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap() util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux hostmem: Wire up RAM_NORESERVE via "reserve" property qmp: Clarify memory backend properties returned via query-memdev qmp: Include "share" property of memory backends hmp: Print "share" property of memory backends with "info memdev" qmp: Include "reserve" property of memory backends hmp: Print "reserve" property of memory backends with "info memdev" backends/hostmem-file.c | 11 +- backends/hostmem-memfd.c | 8 +- backends/hostmem-ram.c| 7 +- backends/hostmem.c| 32 +++ hw/core/machine-hmp-cmds.c| 4 + hw/core/
[PATCH v5 01/14] util/mmap-alloc: Factor out calculation of the pagesize for the guard page
Let's factor out calculating the size of the guard page and rename the variable to make it clearer that this pagesize only applies to the guard page. Reviewed-by: Peter Xu Acked-by: Murilo Opsfelder Araujo Cc: Igor Kotrasinski Signed-off-by: David Hildenbrand --- util/mmap-alloc.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c index e6fa8b598b..24854064b4 100644 --- a/util/mmap-alloc.c +++ b/util/mmap-alloc.c @@ -82,6 +82,16 @@ size_t qemu_mempath_getpagesize(const char *mem_path) return qemu_real_host_page_size; } +static inline size_t mmap_guard_pagesize(int fd) +{ +#if defined(__powerpc64__) && defined(__linux__) +/* Mappings in the same segment must share the same page size */ +return qemu_fd_getpagesize(fd); +#else +return qemu_real_host_page_size; +#endif +} + void *qemu_ram_mmap(int fd, size_t size, size_t align, @@ -90,12 +100,12 @@ void *qemu_ram_mmap(int fd, bool is_pmem, off_t map_offset) { +const size_t guard_pagesize = mmap_guard_pagesize(fd); int prot; int flags; int map_sync_flags = 0; int guardfd; size_t offset; -size_t pagesize; size_t total; void *guardptr; void *ptr; @@ -116,8 +126,7 @@ void *qemu_ram_mmap(int fd, * anonymous memory is OK. */ flags = MAP_PRIVATE; -pagesize = qemu_fd_getpagesize(fd); -if (fd == -1 || pagesize == qemu_real_host_page_size) { +if (fd == -1 || guard_pagesize == qemu_real_host_page_size) { guardfd = -1; flags |= MAP_ANONYMOUS; } else { @@ -126,7 +135,6 @@ void *qemu_ram_mmap(int fd, } #else guardfd = -1; -pagesize = qemu_real_host_page_size; flags = MAP_PRIVATE | MAP_ANONYMOUS; #endif @@ -138,7 +146,7 @@ void *qemu_ram_mmap(int fd, assert(is_power_of_2(align)); /* Always align to host page size */ -assert(align >= pagesize); +assert(align >= guard_pagesize); flags = MAP_FIXED; flags |= fd == -1 ? MAP_ANONYMOUS : 0; @@ -193,8 +201,8 @@ void *qemu_ram_mmap(int fd, * a guard page guarding against potential buffer overflows. */ total -= offset; -if (total > size + pagesize) { -munmap(ptr + size + pagesize, total - size - pagesize); +if (total > size + guard_pagesize) { +munmap(ptr + size + guard_pagesize, total - size - guard_pagesize); } return ptr; @@ -202,15 +210,8 @@ void *qemu_ram_mmap(int fd, void qemu_ram_munmap(int fd, void *ptr, size_t size) { -size_t pagesize; - if (ptr) { /* Unmap both the RAM block and the guard page */ -#if defined(__powerpc64__) && defined(__linux__) -pagesize = qemu_fd_getpagesize(fd); -#else -pagesize = qemu_real_host_page_size; -#endif -munmap(ptr, size + pagesize); +munmap(ptr, size + mmap_guard_pagesize(fd)); } } -- 2.30.2
[PATCH v5 02/14] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve()
We want to reserve a memory region without actually populating memory. Let's factor that out. Reviewed-by: Igor Kotrasinski Acked-by: Murilo Opsfelder Araujo Reviewed-by: Richard Henderson Reviewed-by: Peter Xu Signed-off-by: David Hildenbrand --- util/mmap-alloc.c | 58 +++ 1 file changed, 33 insertions(+), 25 deletions(-) diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c index 24854064b4..223d66219c 100644 --- a/util/mmap-alloc.c +++ b/util/mmap-alloc.c @@ -82,6 +82,38 @@ size_t qemu_mempath_getpagesize(const char *mem_path) return qemu_real_host_page_size; } +/* + * Reserve a new memory region of the requested size to be used for mapping + * from the given fd (if any). + */ +static void *mmap_reserve(size_t size, int fd) +{ +int flags = MAP_PRIVATE; + +#if defined(__powerpc64__) && defined(__linux__) +/* + * On ppc64 mappings in the same segment (aka slice) must share the same + * page size. Since we will be re-allocating part of this segment + * from the supplied fd, we should make sure to use the same page size, to + * this end we mmap the supplied fd. In this case, set MAP_NORESERVE to + * avoid allocating backing store memory. + * We do this unless we are using the system page size, in which case + * anonymous memory is OK. + */ +if (fd == -1 || qemu_fd_getpagesize(fd) == qemu_real_host_page_size) { +fd = -1; +flags |= MAP_ANONYMOUS; +} else { +flags |= MAP_NORESERVE; +} +#else +fd = -1; +flags |= MAP_ANONYMOUS; +#endif + +return mmap(0, size, PROT_NONE, flags, fd, 0); +} + static inline size_t mmap_guard_pagesize(int fd) { #if defined(__powerpc64__) && defined(__linux__) @@ -104,7 +136,6 @@ void *qemu_ram_mmap(int fd, int prot; int flags; int map_sync_flags = 0; -int guardfd; size_t offset; size_t total; void *guardptr; @@ -116,30 +147,7 @@ void *qemu_ram_mmap(int fd, */ total = size + align; -#if defined(__powerpc64__) && defined(__linux__) -/* On ppc64 mappings in the same segment (aka slice) must share the same - * page size. Since we will be re-allocating part of this segment - * from the supplied fd, we should make sure to use the same page size, to - * this end we mmap the supplied fd. In this case, set MAP_NORESERVE to - * avoid allocating backing store memory. - * We do this unless we are using the system page size, in which case - * anonymous memory is OK. - */ -flags = MAP_PRIVATE; -if (fd == -1 || guard_pagesize == qemu_real_host_page_size) { -guardfd = -1; -flags |= MAP_ANONYMOUS; -} else { -guardfd = fd; -flags |= MAP_NORESERVE; -} -#else -guardfd = -1; -flags = MAP_PRIVATE | MAP_ANONYMOUS; -#endif - -guardptr = mmap(0, total, PROT_NONE, flags, guardfd, 0); - +guardptr = mmap_reserve(total, fd); if (guardptr == MAP_FAILED) { return MAP_FAILED; } -- 2.30.2
[PATCH v5 06/14] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()
Let's pass flags instead of bools to prepare for passing other flags and update the documentation of qemu_ram_mmap(). Introduce new QEMU_MAP_ flags that abstract the mmap() PROT_ and MAP_ flag handling and simplify it. We expose only flags that are currently supported by qemu_ram_mmap(). Maybe, we'll see qemu_mmap() in the future as well that can implement these flags. Note: We don't use MAP_ flags as some flags (e.g., MAP_SYNC) are only defined for some systems and we want to always be able to identify these flags reliably inside qemu_ram_mmap() -- for example, to properly warn when some future flags are not available or effective on a system. Also, this way we can simplify PROT_ handling as well. Reviewed-by: Peter Xu Signed-off-by: David Hildenbrand --- include/qemu/mmap-alloc.h | 16 +--- include/qemu/osdep.h | 18 ++ softmmu/physmem.c | 8 +--- util/mmap-alloc.c | 15 --- util/oslib-posix.c| 3 ++- 5 files changed, 42 insertions(+), 18 deletions(-) diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h index 456ff87df1..90d0eee705 100644 --- a/include/qemu/mmap-alloc.h +++ b/include/qemu/mmap-alloc.h @@ -7,18 +7,22 @@ size_t qemu_fd_getpagesize(int fd); size_t qemu_mempath_getpagesize(const char *mem_path); /** - * qemu_ram_mmap: mmap the specified file or device. + * qemu_ram_mmap: mmap anonymous memory, the specified file or device. + * + * mmap() abstraction to map guest RAM, simplifying flag handling, taking + * care of alignment requirements and installing guard pages. * * Parameters: * @fd: the file or the device to mmap * @size: the number of bytes to be mmaped * @align: if not zero, specify the alignment of the starting mapping address; * otherwise, the alignment in use will be determined by QEMU. - * @readonly: true for a read-only mapping, false for read/write. - * @shared: map has RAM_SHARED flag. - * @is_pmem: map has RAM_PMEM flag. + * @qemu_map_flags: QEMU_MAP_* flags * @map_offset: map starts at offset of map_offset from the start of fd * + * Internally, MAP_PRIVATE, MAP_ANONYMOUS and MAP_SHARED_VALIDATE are set + * implicitly based on other parameters. + * * Return: * On success, return a pointer to the mapped area. * On failure, return MAP_FAILED. @@ -26,9 +30,7 @@ size_t qemu_mempath_getpagesize(const char *mem_path); void *qemu_ram_mmap(int fd, size_t size, size_t align, -bool readonly, -bool shared, -bool is_pmem, +uint32_t qemu_map_flags, off_t map_offset); void qemu_ram_munmap(int fd, void *ptr, size_t size); diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h index ba15be9c56..5ba89f5460 100644 --- a/include/qemu/osdep.h +++ b/include/qemu/osdep.h @@ -347,6 +347,24 @@ void *qemu_anon_ram_alloc(size_t size, uint64_t *align, bool shared); void qemu_vfree(void *ptr); void qemu_anon_ram_free(void *ptr, size_t size); +/* + * Abstraction of PROT_ and MAP_ flags as passed to mmap(), for example, + * consumed by qemu_ram_mmap(). + */ + +/* Map PROT_READ instead of PROT_READ | PROT_WRITE. */ +#define QEMU_MAP_READONLY (1 << 0) + +/* Use MAP_SHARED instead of MAP_PRIVATE. */ +#define QEMU_MAP_SHARED (1 << 1) + +/* + * Use MAP_SYNC | MAP_SHARED_VALIDATE if supported. Ignored without + * QEMU_MAP_SHARED. If mapping fails, warn and fallback to !QEMU_MAP_SYNC. + */ +#define QEMU_MAP_SYNC (1 << 2) + + #define QEMU_MADV_INVALID -1 #if defined(CONFIG_MADVISE) diff --git a/softmmu/physmem.c b/softmmu/physmem.c index fdcd38ba61..d1a1a6b72e 100644 --- a/softmmu/physmem.c +++ b/softmmu/physmem.c @@ -1533,6 +1533,7 @@ static void *file_ram_alloc(RAMBlock *block, off_t offset, Error **errp) { +uint32_t qemu_map_flags; void *area; block->page_size = qemu_fd_getpagesize(fd); @@ -1580,9 +1581,10 @@ static void *file_ram_alloc(RAMBlock *block, perror("ftruncate"); } -area = qemu_ram_mmap(fd, memory, block->mr->align, readonly, - block->flags & RAM_SHARED, block->flags & RAM_PMEM, - offset); +qemu_map_flags = readonly ? QEMU_MAP_READONLY : 0; +qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0; +qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0; +area = qemu_ram_mmap(fd, memory, block->mr->align, qemu_map_flags, offset); if (area == MAP_FAILED) { error_setg_errno(errp, errno, "unable to map backing store for guest RAM"); diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c index 0e2bd7bc0e..1ddc0e2a1e 100644 --- a/util/mmap-alloc.c +++ b/util/mmap-alloc.c @@ -118,9 +118,12 @@ static void *mmap_reserve(size_t size, int fd) * Activate memory in a reserved region from the given
[PATCH v5 03/14] util/mmap-alloc: Factor out activating of memory to mmap_activate()
We want to activate memory within a reserved memory region, to make it accessible. Let's factor that out. Reviewed-by: Richard Henderson Acked-by: Murilo Opsfelder Araujo Reviewed-by: Peter Xu Signed-off-by: David Hildenbrand --- util/mmap-alloc.c | 94 +-- 1 file changed, 50 insertions(+), 44 deletions(-) diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c index 223d66219c..0e2bd7bc0e 100644 --- a/util/mmap-alloc.c +++ b/util/mmap-alloc.c @@ -114,6 +114,52 @@ static void *mmap_reserve(size_t size, int fd) return mmap(0, size, PROT_NONE, flags, fd, 0); } +/* + * Activate memory in a reserved region from the given fd (if any), to make + * it accessible. + */ +static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly, + bool shared, bool is_pmem, off_t map_offset) +{ +const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE); +int map_sync_flags = 0; +int flags = MAP_FIXED; +void *activated_ptr; + +flags |= fd == -1 ? MAP_ANONYMOUS : 0; +flags |= shared ? MAP_SHARED : MAP_PRIVATE; +if (shared && is_pmem) { +map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE; +} + +activated_ptr = mmap(ptr, size, prot, flags | map_sync_flags, fd, + map_offset); +if (activated_ptr == MAP_FAILED && map_sync_flags) { +if (errno == ENOTSUP) { +char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd); +char *file_name = g_malloc0(PATH_MAX); +int len = readlink(proc_link, file_name, PATH_MAX - 1); + +if (len < 0) { +len = 0; +} +file_name[len] = '\0'; +fprintf(stderr, "Warning: requesting persistence across crashes " +"for backend file %s failed. Proceeding without " +"persistence, data might become corrupted in case of host " +"crash.\n", file_name); +g_free(proc_link); +g_free(file_name); +} +/* + * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we will try + * again without these flags to handle backwards compatibility. + */ +activated_ptr = mmap(ptr, size, prot, flags, fd, map_offset); +} +return activated_ptr; +} + static inline size_t mmap_guard_pagesize(int fd) { #if defined(__powerpc64__) && defined(__linux__) @@ -133,13 +179,8 @@ void *qemu_ram_mmap(int fd, off_t map_offset) { const size_t guard_pagesize = mmap_guard_pagesize(fd); -int prot; -int flags; -int map_sync_flags = 0; -size_t offset; -size_t total; -void *guardptr; -void *ptr; +size_t offset, total; +void *ptr, *guardptr; /* * Note: this always allocates at least one extra page of virtual address @@ -156,45 +197,10 @@ void *qemu_ram_mmap(int fd, /* Always align to host page size */ assert(align >= guard_pagesize); -flags = MAP_FIXED; -flags |= fd == -1 ? MAP_ANONYMOUS : 0; -flags |= shared ? MAP_SHARED : MAP_PRIVATE; -if (shared && is_pmem) { -map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE; -} - offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr; -prot = PROT_READ | (readonly ? 0 : PROT_WRITE); - -ptr = mmap(guardptr + offset, size, prot, - flags | map_sync_flags, fd, map_offset); - -if (ptr == MAP_FAILED && map_sync_flags) { -if (errno == ENOTSUP) { -char *proc_link, *file_name; -int len; -proc_link = g_strdup_printf("/proc/self/fd/%d", fd); -file_name = g_malloc0(PATH_MAX); -len = readlink(proc_link, file_name, PATH_MAX - 1); -if (len < 0) { -len = 0; -} -file_name[len] = '\0'; -fprintf(stderr, "Warning: requesting persistence across crashes " -"for backend file %s failed. Proceeding without " -"persistence, data might become corrupted in case of host " -"crash.\n", file_name); -g_free(proc_link); -g_free(file_name); -} -/* - * if map failed with MAP_SHARED_VALIDATE | MAP_SYNC, - * we will remove these flags to handle compatibility. - */ -ptr = mmap(guardptr + offset, size, prot, flags, fd, map_offset); -} - +ptr = mmap_activate(guardptr + offset, size, fd, readonly, shared, is_pmem, +map_offset); if (ptr == MAP_FAILED) { munmap(guardptr, total); return MAP_FAILED; -- 2.30.2
[PATCH v5 05/14] softmmu/memory: Pass ram_flags to memory_region_init_ram_shared_nomigrate()
Let's forward ram_flags instead, renaming memory_region_init_ram_shared_nomigrate() into memory_region_init_ram_flags_nomigrate(). Forward flags to qemu_ram_alloc() and qemu_ram_alloc_internal(). Reviewed-by: Peter Xu Signed-off-by: David Hildenbrand --- backends/hostmem-ram.c| 6 +++-- hw/m68k/next-cube.c | 4 ++-- include/exec/memory.h | 24 +-- include/exec/ram_addr.h | 2 +- .../memory-region-housekeeping.cocci | 8 +++ softmmu/memory.c | 20 softmmu/physmem.c | 24 --- 7 files changed, 43 insertions(+), 45 deletions(-) diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c index 5cc53e76c9..741e701062 100644 --- a/backends/hostmem-ram.c +++ b/backends/hostmem-ram.c @@ -19,6 +19,7 @@ static void ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) { +uint32_t ram_flags; char *name; if (!backend->size) { @@ -27,8 +28,9 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) } name = host_memory_backend_get_name(backend); -memory_region_init_ram_shared_nomigrate(&backend->mr, OBJECT(backend), name, - backend->size, backend->share, errp); +ram_flags = backend->share ? RAM_SHARED : 0; +memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name, + backend->size, ram_flags, errp); g_free(name); } diff --git a/hw/m68k/next-cube.c b/hw/m68k/next-cube.c index 92b45d760f..59ccae0d5e 100644 --- a/hw/m68k/next-cube.c +++ b/hw/m68k/next-cube.c @@ -986,8 +986,8 @@ static void next_cube_init(MachineState *machine) sysbus_mmio_map(SYS_BUS_DEVICE(pcdev), 1, 0x0210); /* BMAP memory */ -memory_region_init_ram_shared_nomigrate(bmapm1, NULL, "next.bmapmem", 64, -true, &error_fatal); +memory_region_init_ram_flags_nomigrate(bmapm1, NULL, "next.bmapmem", 64, + RAM_SHARED, &error_fatal); memory_region_add_subregion(sysmem, 0x020c, bmapm1); /* The Rev_2.5_v66.bin firmware accesses it at 0x820c0020, too */ memory_region_init_alias(bmapm2, NULL, "next.bmapmem2", bmapm1, 0x0, 64); diff --git a/include/exec/memory.h b/include/exec/memory.h index 8ad280e532..10179c6695 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -928,27 +928,27 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr, Error **errp); /** - * memory_region_init_ram_shared_nomigrate: Initialize RAM memory region. - * Accesses into the region will - * modify memory directly. + * memory_region_init_ram_flags_nomigrate: Initialize RAM memory region. + * Accesses into the region will + * modify memory directly. * * @mr: the #MemoryRegion to be initialized. * @owner: the object that tracks the region's reference count * @name: Region name, becomes part of RAMBlock name used in migration stream *must be unique within any device * @size: size of the region. - * @share: allow remapping RAM to different addresses + * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED. * @errp: pointer to Error*, to store an error if it happens. * - * Note that this function is similar to memory_region_init_ram_nomigrate. - * The only difference is part of the RAM region can be remapped. + * Note that this function does not do anything to cause the data in the + * RAM memory region to be migrated; that is the responsibility of the caller. */ -void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr, - Object *owner, - const char *name, - uint64_t size, - bool share, - Error **errp); +void memory_region_init_ram_flags_nomigrate(MemoryRegion *mr, +Object *owner, +const char *name, +uint64_t size, +uint32_t ram_flags, +Error **errp); /** * memory_region_init_resizeable_ram: Initialize memory region with resizeable diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h index a7e3378340..6d4513f8e2 100644 --- a/include/exec/ram_addr.h +++ b/include/exec/ram_addr.h @@ -122,7 +122,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr, RAMBlock *qemu_ram_allo
[PATCH v5 04/14] softmmu/memory: Pass ram_flags to qemu_ram_alloc_from_fd()
Let's pass in ram flags just like we do with qemu_ram_alloc_from_file(), to clean up and prepare for more flags. Simplify the documentation of passed ram flags: Looking at our documentation of RAM_SHARED and RAM_PMEM is sufficient, no need to be repetitive. Reviewed-by: Peter Xu Signed-off-by: David Hildenbrand --- backends/hostmem-memfd.c | 7 --- hw/misc/ivshmem.c| 5 ++--- include/exec/memory.h| 9 +++-- include/exec/ram_addr.h | 6 +- softmmu/memory.c | 7 +++ 5 files changed, 13 insertions(+), 21 deletions(-) diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c index 69b0ae30bb..93b5d1a4cf 100644 --- a/backends/hostmem-memfd.c +++ b/backends/hostmem-memfd.c @@ -36,6 +36,7 @@ static void memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) { HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend); +uint32_t ram_flags; char *name; int fd; @@ -53,9 +54,9 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) } name = host_memory_backend_get_name(backend); -memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), - name, backend->size, - backend->share, fd, 0, errp); +ram_flags = backend->share ? RAM_SHARED : 0; +memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name, + backend->size, ram_flags, fd, 0, errp); g_free(name); } diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c index a1fa4878be..1ba4a98377 100644 --- a/hw/misc/ivshmem.c +++ b/hw/misc/ivshmem.c @@ -493,9 +493,8 @@ static void process_msg_shmem(IVShmemState *s, int fd, Error **errp) size = buf.st_size; /* mmap the region and map into the BAR2 */ -memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s), - "ivshmem.bar2", size, true, fd, 0, - &local_err); +memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s), "ivshmem.bar2", + size, RAM_SHARED, fd, 0, &local_err); if (local_err) { error_propagate(errp, local_err); return; diff --git a/include/exec/memory.h b/include/exec/memory.h index 5728a681b2..8ad280e532 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -991,10 +991,7 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr, * @size: size of the region. * @align: alignment of the region base address; if 0, the default alignment * (getpagesize()) will be used. - * @ram_flags: Memory region features: - * - RAM_SHARED: memory must be mmaped with the MAP_SHARED flag - * - RAM_PMEM: the memory is persistent memory - * Other bits are ignored now. + * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM. * @path: the path in which to allocate the RAM. * @readonly: true to open @path for reading, false for read/write. * @errp: pointer to Error*, to store an error if it happens. @@ -1020,7 +1017,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr, * @owner: the object that tracks the region's reference count * @name: the name of the region. * @size: size of the region. - * @share: %true if memory must be mmaped with the MAP_SHARED flag + * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM. * @fd: the fd to mmap. * @offset: offset within the file referenced by fd * @errp: pointer to Error*, to store an error if it happens. @@ -1032,7 +1029,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr, Object *owner, const char *name, uint64_t size, -bool share, +uint32_t ram_flags, int fd, ram_addr_t offset, Error **errp); diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h index 3cb9791df3..a7e3378340 100644 --- a/include/exec/ram_addr.h +++ b/include/exec/ram_addr.h @@ -104,11 +104,7 @@ long qemu_maxrampagesize(void); * Parameters: * @size: the size in bytes of the ram block * @mr: the memory region where the ram block is - * @ram_flags: specify the properties of the ram block, which can be one - * or bit-or of following values - * - RAM_SHARED: mmap the backing file or device with MAP_SHARED - * - RAM_PMEM: the backend @mem_path or @fd is persistent memory - * Other bits are ignored. + * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM. * @mem_path or @fd: specify the backing file or device * @readonly: true to open @path for reading, false for read/write. * @errp: pointer to Error*, to store an error if it happens diff --git a/softmmu/
[PATCH v5 09/14] hostmem: Wire up RAM_NORESERVE via "reserve" property
Let's provide a way to control the use of RAM_NORESERVE via memory backends using the "reserve" property which defaults to true (old behavior). Only Linux currently supports clearing the flag (and support is checked at runtime, depending on the setting of "/proc/sys/vm/overcommit_memory"). Windows and other POSIX systems will bail out with "reserve=false". The target use case is virtio-mem, which dynamically exposes memory inside a large, sparse memory area to the VM. This essentially allows avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using virtio-mem and also supporting hugetlbfs in the future. Reviewed-by: Peter Xu Reviewed-by: Eduardo Habkost Cc: Markus Armbruster Cc: Eric Blake Cc: Igor Mammedov Signed-off-by: David Hildenbrand --- backends/hostmem-file.c | 11 ++- backends/hostmem-memfd.c | 1 + backends/hostmem-ram.c | 1 + backends/hostmem.c | 32 include/sysemu/hostmem.h | 2 +- qapi/qom.json| 4 6 files changed, 45 insertions(+), 6 deletions(-) diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c index b683da9daf..9d550e53d4 100644 --- a/backends/hostmem-file.c +++ b/backends/hostmem-file.c @@ -40,6 +40,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) object_get_typename(OBJECT(backend))); #else HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend); +uint32_t ram_flags; gchar *name; if (!backend->size) { @@ -52,11 +53,11 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) } name = host_memory_backend_get_name(backend); -memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), - name, - backend->size, fb->align, - (backend->share ? RAM_SHARED : 0) | - (fb->is_pmem ? RAM_PMEM : 0), +ram_flags = backend->share ? RAM_SHARED : 0; +ram_flags |= backend->reserve ? 0 : RAM_NORESERVE; +ram_flags |= fb->is_pmem ? RAM_PMEM : 0; +memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name, + backend->size, fb->align, ram_flags, fb->mem_path, fb->readonly, errp); g_free(name); #endif diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c index 93b5d1a4cf..f3436b623d 100644 --- a/backends/hostmem-memfd.c +++ b/backends/hostmem-memfd.c @@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) name = host_memory_backend_get_name(backend); ram_flags = backend->share ? RAM_SHARED : 0; +ram_flags |= backend->reserve ? 0 : RAM_NORESERVE; memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name, backend->size, ram_flags, fd, 0, errp); g_free(name); diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c index 741e701062..b8e55cdbd0 100644 --- a/backends/hostmem-ram.c +++ b/backends/hostmem-ram.c @@ -29,6 +29,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp) name = host_memory_backend_get_name(backend); ram_flags = backend->share ? RAM_SHARED : 0; +ram_flags |= backend->reserve ? 0 : RAM_NORESERVE; memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name, backend->size, ram_flags, errp); g_free(name); diff --git a/backends/hostmem.c b/backends/hostmem.c index c6c1ff5b99..58fdc1b658 100644 --- a/backends/hostmem.c +++ b/backends/hostmem.c @@ -217,6 +217,11 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value, Error *local_err = NULL; HostMemoryBackend *backend = MEMORY_BACKEND(obj); +if (!backend->reserve && value) { +error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible"); +return; +} + if (!host_memory_backend_mr_inited(backend)) { backend->prealloc = value; return; @@ -268,6 +273,7 @@ static void host_memory_backend_init(Object *obj) /* TODO: convert access to globals to compat properties */ backend->merge = machine_mem_merge(machine); backend->dump = machine_dump_guest_core(machine); +backend->reserve = true; backend->prealloc_threads = 1; } @@ -426,6 +432,28 @@ static void host_memory_backend_set_share(Object *o, bool value, Error **errp) backend->share = value; } +static bool host_memory_backend_get_reserve(Object *o, Error **errp) +{ +HostMemoryBackend *backend = MEMORY_BACKEND(o); + +return backend->reserve; +} + +static void host_memory_backend_set_reserve(Object *o, bool value, Error **errp) +{ +HostMemoryBackend *backend = MEMORY_BACKEND(o); + +if (host_memory_backend_mr_inited(backend)) { +error_setg(errp, "cannot change property value"); +return; +} +if (backen
[PATCH v5 13/14] qmp: Include "reserve" property of memory backends
Let's include the new property. Cc: Eric Blake Cc: Markus Armbruster Cc: Igor Mammedov Signed-off-by: David Hildenbrand --- hw/core/machine-qmp-cmds.c | 1 + qapi/machine.json | 4 2 files changed, 5 insertions(+) diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c index d41db5b93b..2d135ecdd0 100644 --- a/hw/core/machine-qmp-cmds.c +++ b/hw/core/machine-qmp-cmds.c @@ -175,6 +175,7 @@ static int query_memdev(Object *obj, void *opaque) m->dump = object_property_get_bool(obj, "dump", &error_abort); m->prealloc = object_property_get_bool(obj, "prealloc", &error_abort); m->share = object_property_get_bool(obj, "share", &error_abort); +m->reserve = object_property_get_bool(obj, "reserve", &error_abort); m->policy = object_property_get_enum(obj, "policy", "HostMemPolicy", &error_abort); host_nodes = object_property_get_qobject(obj, diff --git a/qapi/machine.json b/qapi/machine.json index 32650bfe9e..5932139d20 100644 --- a/qapi/machine.json +++ b/qapi/machine.json @@ -798,6 +798,9 @@ # # @share: whether memory is private to QEMU or shared (since 6.1) # +# @reserve: whether swap space (or huge pages) was reserved if applicable +# (since 6.1) +# # @host-nodes: host nodes for its memory policy # # @policy: memory policy of memory backend @@ -812,6 +815,7 @@ 'dump': 'bool', 'prealloc': 'bool', 'share': 'bool', +'reserve':'bool', 'host-nodes': ['uint16'], 'policy': 'HostMemPolicy' }} -- 2.30.2
[PATCH v5 07/14] memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
Let's introduce RAM_NORESERVE, allowing mmap'ing with MAP_NORESERVE. The new flag has the following semantics: " RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or huge pages if applicable) is skipped: will bail out if not supported. When not set, the OS will do the reservation, if supported for the memory type. " Allow passing it into: - memory_region_init_ram_nomigrate() - memory_region_init_resizeable_ram() - memory_region_init_ram_from_file() ... and teach qemu_ram_mmap() and qemu_anon_ram_alloc() about the flag. Bail out if the flag is not supported, which is the case right now for both, POSIX and win32. We will add Linux support next and allow specifying RAM_NORESERVE via memory backends. The target use case is virtio-mem, which dynamically exposes memory inside a large, sparse memory area to the VM. Reviewed-by: Peter Xu Signed-off-by: David Hildenbrand --- include/exec/cpu-common.h | 1 + include/exec/memory.h | 15 --- include/exec/ram_addr.h | 3 ++- include/qemu/osdep.h | 9 - migration/ram.c | 3 +-- softmmu/physmem.c | 15 +++ util/mmap-alloc.c | 7 +++ util/oslib-posix.c| 6 -- util/oslib-win32.c| 13 - 9 files changed, 58 insertions(+), 14 deletions(-) diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h index 5a0a2d93e0..38a47ad4ac 100644 --- a/include/exec/cpu-common.h +++ b/include/exec/cpu-common.h @@ -58,6 +58,7 @@ void *qemu_ram_get_host_addr(RAMBlock *rb); ram_addr_t qemu_ram_get_offset(RAMBlock *rb); ram_addr_t qemu_ram_get_used_length(RAMBlock *rb); bool qemu_ram_is_shared(RAMBlock *rb); +bool qemu_ram_is_noreserve(RAMBlock *rb); bool qemu_ram_is_uf_zeroable(RAMBlock *rb); void qemu_ram_set_uf_zeroable(RAMBlock *rb); bool qemu_ram_is_migratable(RAMBlock *rb); diff --git a/include/exec/memory.h b/include/exec/memory.h index 10179c6695..8d77819bcd 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -155,6 +155,13 @@ typedef struct IOMMUTLBEvent { */ #define RAM_UF_WRITEPROTECT (1 << 6) +/* + * RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or huge + * pages if applicable) is skipped: will bail out if not supported. When not + * set, the OS will do the reservation, if supported for the memory type. + */ +#define RAM_NORESERVE (1 << 7) + static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn, IOMMUNotifierFlag flags, hwaddr start, hwaddr end, @@ -937,7 +944,7 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr, * @name: Region name, becomes part of RAMBlock name used in migration stream *must be unique within any device * @size: size of the region. - * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED. + * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_NORESERVE. * @errp: pointer to Error*, to store an error if it happens. * * Note that this function does not do anything to cause the data in the @@ -991,7 +998,8 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr, * @size: size of the region. * @align: alignment of the region base address; if 0, the default alignment * (getpagesize()) will be used. - * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM. + * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM, + * RAM_NORESERVE, * @path: the path in which to allocate the RAM. * @readonly: true to open @path for reading, false for read/write. * @errp: pointer to Error*, to store an error if it happens. @@ -1017,7 +1025,8 @@ void memory_region_init_ram_from_file(MemoryRegion *mr, * @owner: the object that tracks the region's reference count * @name: the name of the region. * @size: size of the region. - * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM. + * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM, + * RAM_NORESERVE. * @fd: the fd to mmap. * @offset: offset within the file referenced by fd * @errp: pointer to Error*, to store an error if it happens. diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h index 6d4513f8e2..551876bed0 100644 --- a/include/exec/ram_addr.h +++ b/include/exec/ram_addr.h @@ -104,7 +104,8 @@ long qemu_maxrampagesize(void); * Parameters: * @size: the size in bytes of the ram block * @mr: the memory region where the ram block is - * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM. + * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM, + * RAM_NORESERVE. * @mem_path or @fd: specify the backing file or device * @readonly: true to open @path for reading, false for read/write. * @errp: pointer to Error*, to store an error if it happens diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h index 5ba89f5460..877d9a5eff 100644 --
[PATCH v5 08/14] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no effect on most shared mappings - except for hugetlbfs and anonymous memory. Linux man page: "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available. See also the discussion of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before 2.6, this flag had effect only for private writable mappings." Note that the "guarantee" part is wrong with memory overcommit in Linux. Also, in Linux hugetlbfs is treated differently - we configure reservation of huge pages from the pool, not reservation of swap space (huge pages cannot be swapped). The rough behavior is [1]: a) !Hugetlbfs: 1) Without MAP_NORESERVE *or* with memory overcommit under Linux disabled ("/proc/sys/vm/overcommit_memory == 2"), the following accounting/reservation happens: For a file backed map SHARED or READ-only - 0 cost (the file is the map not swap) PRIVATE WRITABLE - size of mapping per instance For an anonymous or /dev/zero map SHARED - size of mapping PRIVATE READ-only - 0 cost (but of little use) PRIVATE WRITABLE - size of mapping per instance 2) With MAP_NORESERVE, no accounting/reservation happens. b) Hugetlbfs: 1) Without MAP_NORESERVE, huge pages are reserved. 2) With MAP_NORESERVE, no huge pages are reserved. Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able to configure it for !hugetlbfs globally; this toggle now allows configuring it more fine-grained, not for the whole system. The target use case is virtio-mem, which dynamically exposes memory inside a large, sparse memory area to the VM. [1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting Reviewed-by: Peter Xu Signed-off-by: David Hildenbrand --- include/qemu/osdep.h | 3 ++ softmmu/physmem.c| 1 + util/mmap-alloc.c| 69 ++-- 3 files changed, 71 insertions(+), 2 deletions(-) diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h index 877d9a5eff..d5c1860508 100644 --- a/include/qemu/osdep.h +++ b/include/qemu/osdep.h @@ -176,6 +176,9 @@ extern int daemon(int, int); #ifndef MAP_FIXED_NOREPLACE #define MAP_FIXED_NOREPLACE 0 #endif +#ifndef MAP_NORESERVE +#define MAP_NORESERVE 0 +#endif #ifndef ENOMEDIUM #define ENOMEDIUM ENODEV #endif diff --git a/softmmu/physmem.c b/softmmu/physmem.c index 608ba75d18..0d4376b04e 100644 --- a/softmmu/physmem.c +++ b/softmmu/physmem.c @@ -2229,6 +2229,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) flags = MAP_FIXED; flags |= block->flags & RAM_SHARED ? MAP_SHARED : MAP_PRIVATE; +flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0; if (block->fd >= 0) { area = mmap(vaddr, length, PROT_READ | PROT_WRITE, flags, block->fd, offset); diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c index d0cf4aaee5..838e286ce5 100644 --- a/util/mmap-alloc.c +++ b/util/mmap-alloc.c @@ -20,6 +20,7 @@ #include "qemu/osdep.h" #include "qemu/mmap-alloc.h" #include "qemu/host-utils.h" +#include "qemu/cutils.h" #include "qemu/error-report.h" #define HUGETLBFS_MAGIC 0x958458f6 @@ -83,6 +84,70 @@ size_t qemu_mempath_getpagesize(const char *mem_path) return qemu_real_host_page_size; } +#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory" +static bool map_noreserve_effective(int fd, uint32_t qemu_map_flags) +{ +#if defined(__linux__) +const bool readonly = qemu_map_flags & QEMU_MAP_READONLY; +const bool shared = qemu_map_flags & QEMU_MAP_SHARED; +gchar *content = NULL; +const char *endptr; +unsigned int tmp; + +/* + * hugeltb accounting is different than ordinary swap reservation: + * a) Hugetlb pages from the pool are reserved for both private and + *shared mappings. For shared mappings, all mappers have to specify + *MAP_NORESERVE. + * b) MAP_NORESERVE is not affected by /proc/sys/vm/overcommit_memory. + */ +if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) { +return true; +} + +/* + * Accountable mappings in the kernel that can be affected by MAP_NORESEVE + * are private writable mappings (see mm/mmap.c:accountable_mapping() in + * Linux). For all shared or readonly mappings, MAP_NORESERVE is always + * implicitly active -- no reservation; this includes shmem. The only + * exception is shared anonymous memory, it is accounted like private + * anonymous memory. + */ +if (readonly || (shared && fd >= 0)) { +return true; +} + +/* + * MAP_NORESERVE is globally ignored for applicable !huge
[PATCH v5 14/14] hmp: Print "reserve" property of memory backends with "info memdev"
Let's print the new property. Reviewed-by: Dr. David Alan Gilbert Cc: Markus Armbruster Cc: Eric Blake Cc: Igor Mammedov Signed-off-by: David Hildenbrand --- hw/core/machine-hmp-cmds.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c index 004a92b3d6..9bedc77bb4 100644 --- a/hw/core/machine-hmp-cmds.c +++ b/hw/core/machine-hmp-cmds.c @@ -112,6 +112,8 @@ void hmp_info_memdev(Monitor *mon, const QDict *qdict) m->value->prealloc ? "true" : "false"); monitor_printf(mon, " share: %s\n", m->value->share ? "true" : "false"); +monitor_printf(mon, " reserve: %s\n", + m->value->reserve ? "true" : "false"); monitor_printf(mon, " policy: %s\n", HostMemPolicy_str(m->value->policy)); visit_complete(v, &str); -- 2.30.2
[PATCH v5 12/14] hmp: Print "share" property of memory backends with "info memdev"
Let's print the property. Cc: Markus Armbruster Cc: Eric Blake Cc: Igor Mammedov Signed-off-by: David Hildenbrand --- hw/core/machine-hmp-cmds.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c index 58248cffa3..004a92b3d6 100644 --- a/hw/core/machine-hmp-cmds.c +++ b/hw/core/machine-hmp-cmds.c @@ -110,6 +110,8 @@ void hmp_info_memdev(Monitor *mon, const QDict *qdict) m->value->dump ? "true" : "false"); monitor_printf(mon, " prealloc: %s\n", m->value->prealloc ? "true" : "false"); +monitor_printf(mon, " share: %s\n", + m->value->share ? "true" : "false"); monitor_printf(mon, " policy: %s\n", HostMemPolicy_str(m->value->policy)); visit_complete(v, &str); -- 2.30.2
[PATCH v5 10/14] qmp: Clarify memory backend properties returned via query-memdev
We return information on the currently configured memory backends and don't configure them, so decribe what the currently set properties express. Suggested-by: Markus Armbruster Cc: Eric Blake Cc: Markus Armbruster Cc: Igor Mammedov Signed-off-by: David Hildenbrand --- qapi/machine.json | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/qapi/machine.json b/qapi/machine.json index 6e90d463fc..758b901185 100644 --- a/qapi/machine.json +++ b/qapi/machine.json @@ -790,11 +790,11 @@ # # @size: memory backend size # -# @merge: enables or disables memory merge support +# @merge: whether memory merge support is enabled # -# @dump: includes memory backend's memory in a core dump or not +# @dump: whether memory backend's memory is included in a core dump # -# @prealloc: enables or disables memory preallocation +# @prealloc: whether memory was preallocated # # @host-nodes: host nodes for its memory policy # -- 2.30.2
Re: [PATCH 0/1] sphinx: qapidoc: Wrap "If" section body in a paragraph node
On Wed, 7 Apr 2021 at 10:41, Markus Armbruster wrote: > > Peter, do you intend to merge this yourself? I have nothing else queued > right now. If you want me to do a pull request for this patch, let me > know. I missed this email earlier. I had assumed somebody else would pick it up, but as it happens I have a couple of last minute patches I want to send out anyway, so I'll pick this up now. thanks -- PMM
[PATCH v5 11/14] qmp: Include "share" property of memory backends
Let's include the property, which can be helpful when debugging, for example, to spot misuse of MAP_PRIVATE which can result in some ugly corner cases (e.g., double-memory consumption on shmem). Use the same description we also use for describing the property. Cc: Eric Blake Cc: Markus Armbruster Cc: Igor Mammedov Signed-off-by: David Hildenbrand --- hw/core/machine-qmp-cmds.c | 1 + qapi/machine.json | 3 +++ 2 files changed, 4 insertions(+) diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c index 68a942595a..d41db5b93b 100644 --- a/hw/core/machine-qmp-cmds.c +++ b/hw/core/machine-qmp-cmds.c @@ -174,6 +174,7 @@ static int query_memdev(Object *obj, void *opaque) m->merge = object_property_get_bool(obj, "merge", &error_abort); m->dump = object_property_get_bool(obj, "dump", &error_abort); m->prealloc = object_property_get_bool(obj, "prealloc", &error_abort); +m->share = object_property_get_bool(obj, "share", &error_abort); m->policy = object_property_get_enum(obj, "policy", "HostMemPolicy", &error_abort); host_nodes = object_property_get_qobject(obj, diff --git a/qapi/machine.json b/qapi/machine.json index 758b901185..32650bfe9e 100644 --- a/qapi/machine.json +++ b/qapi/machine.json @@ -796,6 +796,8 @@ # # @prealloc: whether memory was preallocated # +# @share: whether memory is private to QEMU or shared (since 6.1) +# # @host-nodes: host nodes for its memory policy # # @policy: memory policy of memory backend @@ -809,6 +811,7 @@ 'merge': 'bool', 'dump': 'bool', 'prealloc': 'bool', +'share': 'bool', 'host-nodes': ['uint16'], 'policy': 'HostMemPolicy' }} -- 2.30.2
Re: trace_FOO_tcg bit-rotted?
Stefan Hajnoczi writes: > On Mon, Apr 12, 2021 at 08:06:57PM +0100, Alex Bennée wrote: >> >> Stefan Hajnoczi writes: >> >> > On Fri, Apr 09, 2021 at 05:29:08PM +0100, Alex Bennée wrote: >> >> >> >> Laurent Vivier writes: >> >> >> >> > Le 06/04/2021 à 18:00, Alex Bennée a écrit : >> >> >> Hi, >> >> >> >> >> >> It's been awhile since I last played with this but I think we are >> >> >> suffering from not having some test cases for tracing code >> >> >> generation/execution in the tree. I tried adding a simple trace point >> >> >> to >> >> >> see if I could track ERET calls: >> > >> > Lluís: are you happy to deprecate tcg trace events in favor of TCG >> > plugins? > > That said, I haven't used the TCG trace event functionality and maybe > I'm missing something obvious that Lluís will point out :). I've updated the Cc to what I think is Lluís current email as I was getting bounces from the old academic address. -- Alex Bennée
Re: [PATCH 2/2] Support monitor chardev hotswap with QMP
On Tue, Apr 13, 2021 at 10:58 AM Daniel P. Berrangé wrote: > > On Tue, Apr 13, 2021 at 08:40:59AM +0200, Markus Armbruster wrote: > > Li Zhang writes: > > > > > From: Li Zhang > > > > > > For some scenarios, it needs to hot-add a monitor device. > > > But QEMU doesn't support hotplug yet. It also works by adding > > > a monitor with null backend by default and then change its > > > backend to socket by QMP command "chardev-change". > > If you need ability to hot-add monitor instances, why not just > implement that feature directly, instead of pre-creating monitors > with null backends and then later changing the backend ? > Hi Daniel, I would like to use this solution first because we'd like to use it as soon as possible. It's the best solution to implement the hot-add feature and it may need more effort to do it. I will consider it. > > > > > > So this patch is to support monitor chardev hotswap with QMP. > > > > > > Signed-off-by: Li Zhang > > > > I think what what you're trying to say is that chardev-change does not > > work when the character device changes is used by a QMP monitor. > > Correct? > > > > If yes, how exactly does it misbehave? > > > > Does it work with an HMP monitor? > > > > > --- > > > monitor/monitor-internal.h | 3 +++ > > > monitor/monitor.c | 2 +- > > > monitor/qmp.c | 42 +++--- > > > 3 files changed, 43 insertions(+), 4 deletions(-) > > > > > > diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h > > > index 40903d6386..2df6dd21de 100644 > > > --- a/monitor/monitor-internal.h > > > +++ b/monitor/monitor-internal.h > > > @@ -186,4 +186,7 @@ int hmp_compare_cmd(const char *name, const char > > > *list); > > > void qmp_query_qmp_schema(QDict *qdict, QObject **ret_data, > > > Error **errp); > > > > > > +gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond, > > > + void *opaque); > > > + > > > #endif > > > diff --git a/monitor/monitor.c b/monitor/monitor.c > > > index e94f532cf5..2d255bab18 100644 > > > --- a/monitor/monitor.c > > > +++ b/monitor/monitor.c > > > @@ -157,7 +157,7 @@ static inline bool > > > monitor_is_hmp_non_interactive(const Monitor *mon) > > > > > > static void monitor_flush_locked(Monitor *mon); > > > > > > -static gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond, > > > +gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond, > > >void *opaque) > > > { > > > Monitor *mon = opaque; > > > diff --git a/monitor/qmp.c b/monitor/qmp.c > > > index 2326bd7f9b..55cfb230d9 100644 > > > --- a/monitor/qmp.c > > > +++ b/monitor/qmp.c > > > @@ -44,6 +44,7 @@ struct QMPRequest { > > > Error *err; > > > }; > > > typedef struct QMPRequest QMPRequest; > > > +static void monitor_qmp_set_handlers_bh(void *opaque); > > > > > > QmpCommandList qmp_commands, qmp_cap_negotiation_commands; > > > > > > @@ -480,7 +481,35 @@ void monitor_data_destroy_qmp(MonitorQMP *mon) > > > g_queue_free(mon->qmp_requests); > > > } > > > > > > -static void monitor_qmp_setup_handlers_bh(void *opaque) > > > +static int monitor_qmp_change(void *opaque) > > > +{ > > > +MonitorQMP *mon = opaque; > > > + > > > +mon->common.use_io_thread = > > > +qemu_chr_has_feature(mon->common.chr.chr, > > > QEMU_CHAR_FEATURE_GCONTEXT); > > > + > > > +if (mon->common.use_io_thread) { > > > +aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread), > > > +monitor_qmp_set_handlers_bh, mon); > > > +} else { > > > +qemu_chr_fe_set_handlers(&mon->common.chr, monitor_can_read, > > > + monitor_qmp_read, monitor_qmp_event, > > > + monitor_qmp_change, &mon->common, NULL, > > > true); > > > +} > > > + > > > +if (mon->common.out_watch) { > > > +g_source_remove(mon->common.out_watch); > > > > All other updates of @out_watch are under @mon_lock. Why not this one? > > > > I have no idea whether g_source_remove() is the right function to call. > > Its documentation says "You must use g_source_destroy() for sources > > added to a non-default main context." The qemu_chr_fe_set_handlers() > > contract is of no help. > > > > Documentation of g_source_destroy() confuses some more: "This does not > > unref the GSource: if you still hold a reference, use g_source_unref() > > to drop it. > > > > Marc-André, can you help? > > > > > +qemu_mutex_lock(&mon->common.mon_lock); > > > +mon->common.out_watch = > > > +qemu_chr_fe_add_watch(&mon->common.chr, G_IO_OUT | G_IO_HUP, > > > + monitor_unblocked, &mon->common); > > > > Bad indentation. Better: > > > > mon->common.out_watch = > > qemu_chr_fe_add_watch(&mon->common.chr, G_IO_OUT | G_IO_HUP, > >monitor_unb
Re: [PATCH v2 2/3] qom: move user_creatable_add_opts logic to vl.c and QAPIfy it
Am 13.04.2021 um 10:13 hat David Hildenbrand geschrieben: > On 13.04.21 06:41, Markus Armbruster wrote: > > David Hildenbrand writes: > > > > > On 12.03.21 18:35, Paolo Bonzini wrote: > > > > Emulators are currently using OptsVisitor (via user_creatable_add_opts) > > > > to parse the -object command line option. This has one extra feature, > > > > compared to keyval, which is automatic conversion of integers to lists > > > > as well as support for lists as repeated options: > > > > -object > > > > memory-backend-ram,id=pc.ram,size=1048576000,host-nodes=0,policy=bind > > > > So we cannot replace OptsVisitor with keyval right now. Still, this > > > > patch moves the user_creatable_add_opts logic to vl.c since it is > > > > not needed anywhere else, and makes it go through > > > > user_creatable_add_qapi. > > > > In order to minimize code changes, the predicate still takes a > > > > string. > > > > This can be changed later to use the ObjectType QAPI enum directly. > > > > > > > > > > Rebasing my "noreserve"[1] series on this, I get weird errors from > > > QEMU when specifying the new "reserve=off" option for a > > > memory-backend-ram: > > > > > > "Invalid parameter 'reserve'" > > > > > > And it looks like this is the case for any new properties. Poking > > > around, I fail to find what's causing this -- or how to unlock new > > > properties. What is the magic toggle to make it work? > > > > > > Thanks! > > > > > > [1] https://lkml.kernel.org/r/20210319101230.21531-1-da...@redhat.com > > > > Wild guess: you didn't add your new properties in the QAPI schema. > > > > For a not-so-wild-guess, send us a git-fetch argument for your rebased > > series. > > > > Oh, there is qapi/qom.json -- maybe that does the trick. > > (I have mixed feelings about having to specify the same thing twice at > different locations) The idea is that we'll eventually generate some of the QOM boilerplate from the QAPI schema and remove the duplication again. But yes, for the time being, you need to touch both places. Kevin
Re: [PATCH v2 2/3] qom: move user_creatable_add_opts logic to vl.c and QAPIfy it
David Hildenbrand writes: > On 13.04.21 06:41, Markus Armbruster wrote: >> David Hildenbrand writes: >> >>> On 12.03.21 18:35, Paolo Bonzini wrote: Emulators are currently using OptsVisitor (via user_creatable_add_opts) to parse the -object command line option. This has one extra feature, compared to keyval, which is automatic conversion of integers to lists as well as support for lists as repeated options: -object memory-backend-ram,id=pc.ram,size=1048576000,host-nodes=0,policy=bind So we cannot replace OptsVisitor with keyval right now. Still, this patch moves the user_creatable_add_opts logic to vl.c since it is not needed anywhere else, and makes it go through user_creatable_add_qapi. In order to minimize code changes, the predicate still takes a string. This can be changed later to use the ObjectType QAPI enum directly. >>> >>> Rebasing my "noreserve"[1] series on this, I get weird errors from >>> QEMU when specifying the new "reserve=off" option for a >>> memory-backend-ram: >>> >>> "Invalid parameter 'reserve'" >>> >>> And it looks like this is the case for any new properties. Poking >>> around, I fail to find what's causing this -- or how to unlock new >>> properties. What is the magic toggle to make it work? >>> >>> Thanks! >>> >>> [1] https://lkml.kernel.org/r/20210319101230.21531-1-da...@redhat.com >> Wild guess: you didn't add your new properties in the QAPI schema. >> For a not-so-wild-guess, send us a git-fetch argument for your >> rebased >> series. >> > > Oh, there is qapi/qom.json -- maybe that does the trick. > > (I have mixed feelings about having to specify the same thing twice at > different locations) With reason. We've talked about generating QOM boilerplate from the QAPI schema, but haven't progressed to actual patches. > I'll have a look if that makes it fly.
[PATCH RESEND v7 00/13] virtio-mem: vfio support
After silence for more than 1.5 months and the feeling like pinging into a black hole, I rebased and retested the patches. I hope we can get them into 6.1 early -- or at least get some more feedback on the patches. @Paolo: Michael and Alex already acked relevant parts -- A virtio-mem device manages a memory region in guest physical address space, represented as a single (currently large) memory region in QEMU, mapped into system memory address space. Before the guest is allowed to use memory blocks, it must coordinate with the hypervisor (plug blocks). After a reboot, all memory is usually unplugged - when the guest comes up, it detects the virtio-mem device and selects memory blocks to plug (based on resize requests from the hypervisor). Memory hot(un)plug consists of (un)plugging memory blocks via a virtio-mem device (triggered by the guest). When unplugging blocks, we discard the memory - similar to memory balloon inflation. In contrast to memory ballooning, we always know which memory blocks a guest may actually use - especially during a reboot, after a crash, or after kexec (and during hibernation as well). Guests agreed to not access unplugged memory again, especially not via DMA. The issue with vfio is, that it cannot deal with random discards - for this reason, virtio-mem and vfio can currently only run mutually exclusive. Especially, vfio would currently map the whole memory region (with possible only little/no plugged blocks), resulting in all pages getting pinned and therefore resulting in a higher memory consumption than expected (turning virtio-mem basically useless in these environments). To make vfio work nicely with virtio-mem, we have to map only the plugged blocks, and map/unmap properly when plugging/unplugging blocks (including discarding of RAM when unplugging). We achieve that by using a new notifier mechanism that communicates changes. It's important to map memory in the granularity in which we could see unmaps again (-> virtio-mem block size) - so when e.g., plugging consecutive 100 MB with a block size of 2 MB, we need 50 mappings. When unmapping, we can use a single vfio_unmap call for the applicable range. We expect that the block size of virtio-mem devices will be fairly large in the future (to not run out of mappings and to improve hot(un)plug performance), configured by the user, when used with vfio (e.g., 128MB, 1G, ...), but it will depend on the setup. More info regarding virtio-mem can be found at: https://virtio-mem.gitlab.io/ v7 is located at: g...@github.com:davidhildenbrand/qemu.git virtio-mem-vfio-v7 v6 -> v7: - s/RamDiscardMgr/RamDiscardManager/ - "memory: Introduce RamDiscardManager for RAM memory regions" -- Make RamDiscardManager/RamDiscardListener eat MemoryRegionSections -- Replace notify_discard_all callback by double_discard_supported -- Reshuffle the individual hunks in memory.h -- Provide function wrappers for RamDiscardManager calls - "memory: Helpers to copy/free a MemoryRegionSection" -- Added - "virtio-mem: Implement RamDiscardManager interface" -- Work on MemoryRegionSections instead of ranges -- Minor optimizations - "vfio: Support for RamDiscardManager in the !vIOMMU case" -- Simplify based on new interfaces / MemoryRegionSections -- Minor cleanups and optimizations -- Add a comment regarding dirty bitmap sync. -- Don't store "offset_within_region" in VFIORamDiscardListener - "vfio: Support for RamDiscardManager in the vIOMMU case" -- Adjust to new interface - "softmmu/physmem: Don't use atomic operations in ..." -- Rename variables - "softmmu/physmem: Extend ram_block_discard_(require|disable) ..." -- Rename variables - Rebased and retested v5 -> v6: - "memory: Introduce RamDiscardMgr for RAM memory regions" -- Fix variable names in one prototype. - "virtio-mem: Don't report errors when ram_block_discard_range() fails" -- Added - "virtio-mem: Implement RamDiscardMgr interface" -- Don't report an error if discarding fails - Rebased and retested v4 -> v5: - "vfio: Support for RamDiscardMgr in the !vIOMMU case" -- Added more assertions for granularity vs. iommu supported pagesize - "vfio: Sanity check maximum number of DMA mappings with RamDiscardMgr" -- Fix accounting of mappings - "vfio: Disable only uncoordinated discards for VFIO_TYPE1 iommus" -- Fence off SPAPR and add some comments regarding future support. -- Tweak patch description - Rebase and retest v3 -> v4: - "vfio: Query and store the maximum number of DMA mappings -- Limit the patch to querying and storing only -- Renamed to "vfio: Query and store the maximum number of possible DMA mappings" - "vfio: Support for RamDiscardMgr in the !vIOMMU case" -- Remove sanity checks / warning the user - "vfio: Sanity check maximum number of DMA mappings with RamDiscardMgr" -- Perform sanity checks by looking at the number of memslots and all registered RamDiscardMgr sections - Rebase and retest - Reshuffled the patches slightly v2 -> v3: - Rebased + retested - Fixed some typ
[PATCH RESEND v7 02/13] memory: Helpers to copy/free a MemoryRegionSection
In case one wants to create a permanent copy of a MemoryRegionSections, one needs access to flatview_ref()/flatview_unref(). Instead of exposing these, let's just add helpers to copy/free a MemoryRegionSection and properly adjust references. Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- include/exec/memory.h | 20 softmmu/memory.c | 27 +++ 2 files changed, 47 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 38a3b41ac1..e806d0140e 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1047,6 +1047,26 @@ static inline bool MemoryRegionSection_eq(MemoryRegionSection *a, a->nonvolatile == b->nonvolatile; } +/** + * memory_region_section_new_copy: Copy a memory region section + * + * Allocate memory for a new copy, copy the memory region section, and + * properly take a reference on all relevant members. + * + * @s: the #MemoryRegionSection to copy + */ +MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s); + +/** + * memory_region_section_new_copy: Free a copied memory region section + * + * Free a copy of a memory section created via memory_region_section_new_copy(). + * properly dropping references on all relevant members. + * + * @s: the #MemoryRegionSection to copy + */ +void memory_region_section_free_copy(MemoryRegionSection *s); + /** * memory_region_init: Initialize a memory region * diff --git a/softmmu/memory.c b/softmmu/memory.c index 26ea87d77a..776c7cac38 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -2694,6 +2694,33 @@ MemoryRegionSection memory_region_find(MemoryRegion *mr, return ret; } +MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s) +{ +MemoryRegionSection *tmp = g_new(MemoryRegionSection, 1); + +*tmp = *s; +if (tmp->mr) { +memory_region_ref(tmp->mr); +} +if (tmp->fv) { +bool ret = flatview_ref(tmp->fv); + +g_assert(ret); +} +return tmp; +} + +void memory_region_section_free_copy(MemoryRegionSection *s) +{ +if (s->fv) { +flatview_unref(s->fv); +} +if (s->mr) { +memory_region_unref(s->mr); +} +g_free(s); +} + bool memory_region_present(MemoryRegion *container, hwaddr addr) { MemoryRegion *mr; -- 2.30.2
[PATCH RESEND v7 04/13] virtio-mem: Don't report errors when ram_block_discard_range() fails
Any errors are unexpected and ram_block_discard_range() already properly prints errors. Let's stop manually reporting errors. Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- hw/virtio/virtio-mem.c | 20 1 file changed, 4 insertions(+), 16 deletions(-) diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index 471e464171..bbe42ad83b 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -246,17 +246,14 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa, uint64_t size, bool plug) { const uint64_t offset = start_gpa - vmem->addr; -int ret; +RAMBlock *rb = vmem->memdev->mr.ram_block; if (virtio_mem_is_busy()) { return -EBUSY; } if (!plug) { -ret = ram_block_discard_range(vmem->memdev->mr.ram_block, offset, size); -if (ret) { -error_report("Unexpected error discarding RAM: %s", - strerror(-ret)); +if (ram_block_discard_range(rb, offset, size)) { return -EBUSY; } } @@ -345,15 +342,12 @@ static void virtio_mem_resize_usable_region(VirtIOMEM *vmem, static int virtio_mem_unplug_all(VirtIOMEM *vmem) { RAMBlock *rb = vmem->memdev->mr.ram_block; -int ret; if (virtio_mem_is_busy()) { return -EBUSY; } -ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb)); -if (ret) { -error_report("Unexpected error discarding RAM: %s", strerror(-ret)); +if (ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb))) { return -EBUSY; } bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size); @@ -625,14 +619,8 @@ static int virtio_mem_discard_range_cb(const VirtIOMEM *vmem, void *arg, uint64_t offset, uint64_t size) { RAMBlock *rb = vmem->memdev->mr.ram_block; -int ret; -ret = ram_block_discard_range(rb, offset, size); -if (ret) { -error_report("Unexpected error discarding RAM: %s", strerror(-ret)); -return -EINVAL; -} -return 0; +return ram_block_discard_range(rb, offset, size) ? -EINVAL : 0; } static int virtio_mem_restore_unplugged(VirtIOMEM *vmem) -- 2.30.2
[PATCH RESEND v7 06/13] vfio: Support for RamDiscardManager in the !vIOMMU case
Implement support for RamDiscardManager, to prepare for virtio-mem support. Instead of mapping the whole memory section, we only map "populated" parts and update the mapping when notified about discarding/population of memory via the RamDiscardListener. Similarly, when syncing the dirty bitmaps, sync only the actually mapped (populated) parts by replaying via the notifier. Using virtio-mem with vfio is still blocked via ram_block_discard_disable()/ram_block_discard_require() after this patch. Reviewed-by: Alex Williamson Acked-by: Alex Williamson Acked-by: Michael S. Tsirkin Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- hw/vfio/common.c | 164 ++ include/hw/vfio/vfio-common.h | 11 +++ 2 files changed, 175 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index ae5654fcdb..5af7755227 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -649,6 +649,110 @@ out: rcu_read_unlock(); } +static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl, +MemoryRegionSection *section) +{ +VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, +listener); +const hwaddr size = int128_get64(section->size); +const hwaddr iova = section->offset_within_address_space; +int ret; + +/* Unmap with a single call. */ +ret = vfio_dma_unmap(vrdl->container, iova, size , NULL); +if (ret) { +error_report("%s: vfio_dma_unmap() failed: %s", __func__, + strerror(-ret)); +} +} + +static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl, +MemoryRegionSection *section) +{ +VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener, +listener); +const hwaddr end = section->offset_within_region + + int128_get64(section->size); +hwaddr start, next, iova; +void *vaddr; +int ret; + +/* + * Map in (aligned within memory region) minimum granularity, so we can + * unmap in minimum granularity later. + */ +for (start = section->offset_within_region; start < end; start = next) { +next = ROUND_UP(start + 1, vrdl->granularity); +next = MIN(next, end); + +iova = start - section->offset_within_region + + section->offset_within_address_space; +vaddr = memory_region_get_ram_ptr(section->mr) + start; + +ret = vfio_dma_map(vrdl->container, iova, next - start, + vaddr, section->readonly); +if (ret) { +/* Rollback */ +vfio_ram_discard_notify_discard(rdl, section); +return ret; +} +} +return 0; +} + +static void vfio_register_ram_discard_listener(VFIOContainer *container, + MemoryRegionSection *section) +{ +RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); +VFIORamDiscardListener *vrdl; + +/* Ignore some corner cases not relevant in practice. */ +g_assert(QEMU_IS_ALIGNED(section->offset_within_region, TARGET_PAGE_SIZE)); +g_assert(QEMU_IS_ALIGNED(section->offset_within_address_space, + TARGET_PAGE_SIZE)); +g_assert(QEMU_IS_ALIGNED(int128_get64(section->size), TARGET_PAGE_SIZE)); + +vrdl = g_new0(VFIORamDiscardListener, 1); +vrdl->container = container; +vrdl->mr = section->mr; +vrdl->offset_within_address_space = section->offset_within_address_space; +vrdl->size = int128_get64(section->size); +vrdl->granularity = ram_discard_manager_get_min_granularity(rdm, +section->mr); + +g_assert(vrdl->granularity && is_power_of_2(vrdl->granularity)); +g_assert(vrdl->granularity >= 1 << ctz64(container->pgsizes)); + +ram_discard_listener_init(&vrdl->listener, + vfio_ram_discard_notify_populate, + vfio_ram_discard_notify_discard, true); +ram_discard_manager_register_listener(rdm, &vrdl->listener, section); +QLIST_INSERT_HEAD(&container->vrdl_list, vrdl, next); +} + +static void vfio_unregister_ram_discard_listener(VFIOContainer *container, + MemoryRegionSection *section) +{ +RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr); +VFIORamDiscardListener *vrdl = NULL; + +QLIST_FOREACH(vrdl, &container->vrdl_list, next) { +if (vrdl->mr == section->mr && +vrdl->offset_within_address_space == +section->offset_within_addr
[PATCH RESEND v7 03/13] virtio-mem: Factor out traversing unplugged ranges
Let's factor out the core logic, no need to replicate. Reviewed-by: Pankaj Gupta Acked-by: Michael S. Tsirkin Reviewed-by: Michael S. Tsirkin Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- hw/virtio/virtio-mem.c | 86 -- 1 file changed, 49 insertions(+), 37 deletions(-) diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index 655824ff81..471e464171 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -145,6 +145,33 @@ static bool virtio_mem_is_busy(void) return migration_in_incoming_postcopy() || !migration_is_idle(); } +typedef int (*virtio_mem_range_cb)(const VirtIOMEM *vmem, void *arg, + uint64_t offset, uint64_t size); + +static int virtio_mem_for_each_unplugged_range(const VirtIOMEM *vmem, void *arg, + virtio_mem_range_cb cb) +{ +unsigned long first_zero_bit, last_zero_bit; +uint64_t offset, size; +int ret = 0; + +first_zero_bit = find_first_zero_bit(vmem->bitmap, vmem->bitmap_size); +while (first_zero_bit < vmem->bitmap_size) { +offset = first_zero_bit * vmem->block_size; +last_zero_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size, + first_zero_bit + 1) - 1; +size = (last_zero_bit - first_zero_bit + 1) * vmem->block_size; + +ret = cb(vmem, arg, offset, size); +if (ret) { +break; +} +first_zero_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, +last_zero_bit + 2); +} +return ret; +} + static bool virtio_mem_test_bitmap(VirtIOMEM *vmem, uint64_t start_gpa, uint64_t size, bool plugged) { @@ -594,33 +621,27 @@ static void virtio_mem_device_unrealize(DeviceState *dev) ram_block_discard_require(false); } -static int virtio_mem_restore_unplugged(VirtIOMEM *vmem) +static int virtio_mem_discard_range_cb(const VirtIOMEM *vmem, void *arg, + uint64_t offset, uint64_t size) { RAMBlock *rb = vmem->memdev->mr.ram_block; -unsigned long first_zero_bit, last_zero_bit; -uint64_t offset, length; int ret; -/* Find consecutive unplugged blocks and discard the consecutive range. */ -first_zero_bit = find_first_zero_bit(vmem->bitmap, vmem->bitmap_size); -while (first_zero_bit < vmem->bitmap_size) { -offset = first_zero_bit * vmem->block_size; -last_zero_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size, - first_zero_bit + 1) - 1; -length = (last_zero_bit - first_zero_bit + 1) * vmem->block_size; - -ret = ram_block_discard_range(rb, offset, length); -if (ret) { -error_report("Unexpected error discarding RAM: %s", - strerror(-ret)); -return -EINVAL; -} -first_zero_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, -last_zero_bit + 2); +ret = ram_block_discard_range(rb, offset, size); +if (ret) { +error_report("Unexpected error discarding RAM: %s", strerror(-ret)); +return -EINVAL; } return 0; } +static int virtio_mem_restore_unplugged(VirtIOMEM *vmem) +{ +/* Make sure all memory is really discarded after migration. */ +return virtio_mem_for_each_unplugged_range(vmem, NULL, + virtio_mem_discard_range_cb); +} + static int virtio_mem_post_load(void *opaque, int version_id) { if (migration_in_incoming_postcopy()) { @@ -872,28 +893,19 @@ static void virtio_mem_set_block_size(Object *obj, Visitor *v, const char *name, vmem->block_size = value; } -static void virtio_mem_precopy_exclude_unplugged(VirtIOMEM *vmem) +static int virtio_mem_precopy_exclude_range_cb(const VirtIOMEM *vmem, void *arg, + uint64_t offset, uint64_t size) { void * const host = qemu_ram_get_host_addr(vmem->memdev->mr.ram_block); -unsigned long first_zero_bit, last_zero_bit; -uint64_t offset, length; -/* - * Find consecutive unplugged blocks and exclude them from migration. - * - * Note: Blocks cannot get (un)plugged during precopy, no locking needed. - */ -first_zero_bit = find_first_zero_bit(vmem->bitmap, vmem->bitmap_size); -while (first_zero_bit < vmem->bitmap_size) { -offset = first_zero_bit * vmem->block_size; -last_zero_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size, - first_zero_bit + 1) - 1; -length = (last_zero_bit - first_zero_bit + 1) * vmem->block_
[PATCH RESEND v7 01/13] memory: Introduce RamDiscardManager for RAM memory regions
We have some special RAM memory regions (managed by virtio-mem), whereby the guest agreed to only use selected memory ranges. "unused" parts are discarded so they won't consume memory - to logically unplug these memory ranges. Before the VM is allowed to use such logically unplugged memory again, coordination with the hypervisor is required. This results in "sparse" mmaps/RAMBlocks/memory regions, whereby only coordinated parts are valid to be used/accessed by the VM. In most cases, we don't care about that - e.g., in KVM, we simply have a single KVM memory slot. However, in case of vfio, registering the whole region with the kernel results in all pages getting pinned, and therefore an unexpected high memory consumption - discarding of RAM in that context is broken. Let's introduce a way to coordinate discarding/populating memory within a RAM memory region with such special consumers of RAM memory regions: they can register as listeners and get updates on memory getting discarded and populated. Using this machinery, vfio will be able to map only the currently populated parts, resulting in discarded parts not getting pinned and not consuming memory. A RamDiscardManager has to be set for a memory region before it is getting mapped, and cannot change while the memory region is mapped. Note: At some point, we might want to let RAMBlock users (esp. vfio used for nvme://) consume this interface as well. We'll need RAMBlock notifier calls when a RAMBlock is getting mapped/unmapped (via the corresponding memory region), so we can properly register a listener there as well. Reviewed-by: Pankaj Gupta Acked-by: Michael S. Tsirkin Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- include/exec/memory.h | 286 ++ softmmu/memory.c | 71 +++ 2 files changed, 335 insertions(+), 22 deletions(-) diff --git a/include/exec/memory.h b/include/exec/memory.h index 5728a681b2..38a3b41ac1 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -42,6 +42,12 @@ typedef struct IOMMUMemoryRegionClass IOMMUMemoryRegionClass; DECLARE_OBJ_CHECKERS(IOMMUMemoryRegion, IOMMUMemoryRegionClass, IOMMU_MEMORY_REGION, TYPE_IOMMU_MEMORY_REGION) +#define TYPE_RAM_DISCARD_MANAGER "qemu:ram-discard-manager" +typedef struct RamDiscardManagerClass RamDiscardManagerClass; +typedef struct RamDiscardManager RamDiscardManager; +DECLARE_OBJ_CHECKERS(RamDiscardManager, RamDiscardManagerClass, + RAM_DISCARD_MANAGER, TYPE_RAM_DISCARD_MANAGER); + #ifdef CONFIG_FUZZ void fuzz_dma_read_cb(size_t addr, size_t len, @@ -65,6 +71,28 @@ struct ReservedRegion { unsigned type; }; +/** + * struct MemoryRegionSection: describes a fragment of a #MemoryRegion + * + * @mr: the region, or %NULL if empty + * @fv: the flat view of the address space the region is mapped in + * @offset_within_region: the beginning of the section, relative to @mr's start + * @size: the size of the section; will not exceed @mr's boundaries + * @offset_within_address_space: the address of the first byte of the section + * relative to the region's address space + * @readonly: writes to this section are ignored + * @nonvolatile: this section is non-volatile + */ +struct MemoryRegionSection { +Int128 size; +MemoryRegion *mr; +FlatView *fv; +hwaddr offset_within_region; +hwaddr offset_within_address_space; +bool readonly; +bool nonvolatile; +}; + typedef struct IOMMUTLBEntry IOMMUTLBEntry; /* See address_space_translate: bit 0 is read, bit 1 is write. */ @@ -441,6 +469,206 @@ struct IOMMUMemoryRegionClass { Error **errp); }; +typedef struct RamDiscardListener RamDiscardListener; +typedef int (*NotifyRamPopulate)(RamDiscardListener *rdl, + MemoryRegionSection *section); +typedef void (*NotifyRamDiscard)(RamDiscardListener *rdl, + MemoryRegionSection *section); + +struct RamDiscardListener { +/* + * @notify_populate: + * + * Notification that previously discarded memory is about to get populated. + * Listeners are able to object. If any listener objects, already + * successfully notified listeners are notified about a discard again. + * + * @rdl: the #RamDiscardListener getting notified + * @section: the #MemoryRegionSection to get populated. The section + * is aligned within the memory region to the minimum granularity + * unless it would exceed the registered section. + * + * Returns 0 on success. If the notification is rejected by the listener, + * an error is returned. + */ +NotifyRamPopulate notify_populate; + +/* + * @notify_
[PATCH RESEND v7 10/13] softmmu/physmem: Don't use atomic operations in ram_block_discard_(disable|require)
We have users in migration context that don't hold the BQL (when finishing migration). To prepare for further changes, use a dedicated mutex instead of atomic operations. Keep using qatomic_read ("READ_ONCE") for the functions that only extract the current state (e.g., used by virtio-balloon), locking isn't necessary. While at it, split up the counter into two variables to make it easier to understand. Suggested-by: Peter Xu Reviewed-by: Peter Xu Reviewed-by: Pankaj Gupta Acked-by: Michael S. Tsirkin Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- softmmu/physmem.c | 70 ++- 1 file changed, 39 insertions(+), 31 deletions(-) diff --git a/softmmu/physmem.c b/softmmu/physmem.c index 85034d9c11..aaa2b2eb92 100644 --- a/softmmu/physmem.c +++ b/softmmu/physmem.c @@ -3647,56 +3647,64 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root) } } -/* - * If positive, discarding RAM is disabled. If negative, discarding RAM is - * required to work and cannot be disabled. - */ -static int ram_block_discard_disabled; +static unsigned int ram_block_discard_required_cnt; +static unsigned int ram_block_discard_disabled_cnt; +static QemuMutex ram_block_discard_disable_mutex; + +static void ram_block_discard_disable_mutex_lock(void) +{ +static gsize initialized; + +if (g_once_init_enter(&initialized)) { +qemu_mutex_init(&ram_block_discard_disable_mutex); +g_once_init_leave(&initialized, 1); +} +qemu_mutex_lock(&ram_block_discard_disable_mutex); +} + +static void ram_block_discard_disable_mutex_unlock(void) +{ +qemu_mutex_unlock(&ram_block_discard_disable_mutex); +} int ram_block_discard_disable(bool state) { -int old; +int ret = 0; +ram_block_discard_disable_mutex_lock(); if (!state) { -qatomic_dec(&ram_block_discard_disabled); -return 0; +ram_block_discard_disabled_cnt--; +} else if (!ram_block_discard_required_cnt) { +ram_block_discard_disabled_cnt++; +} else { +ret = -EBUSY; } - -do { -old = qatomic_read(&ram_block_discard_disabled); -if (old < 0) { -return -EBUSY; -} -} while (qatomic_cmpxchg(&ram_block_discard_disabled, - old, old + 1) != old); -return 0; +ram_block_discard_disable_mutex_unlock(); +return ret; } int ram_block_discard_require(bool state) { -int old; +int ret = 0; +ram_block_discard_disable_mutex_lock(); if (!state) { -qatomic_inc(&ram_block_discard_disabled); -return 0; +ram_block_discard_required_cnt--; +} else if (!ram_block_discard_disabled_cnt) { +ram_block_discard_required_cnt++; +} else { +ret = -EBUSY; } - -do { -old = qatomic_read(&ram_block_discard_disabled); -if (old > 0) { -return -EBUSY; -} -} while (qatomic_cmpxchg(&ram_block_discard_disabled, - old, old - 1) != old); -return 0; +ram_block_discard_disable_mutex_unlock(); +return ret; } bool ram_block_discard_is_disabled(void) { -return qatomic_read(&ram_block_discard_disabled) > 0; +return qatomic_read(&ram_block_discard_disabled_cnt); } bool ram_block_discard_is_required(void) { -return qatomic_read(&ram_block_discard_disabled) < 0; +return qatomic_read(&ram_block_discard_required_cnt); } -- 2.30.2
[PATCH RESEND v7 08/13] vfio: Sanity check maximum number of DMA mappings with RamDiscardManager
Although RamDiscardManager can handle running into the maximum number of DMA mappings by propagating errors when creating a DMA mapping, we want to sanity check and warn the user early that there is a theoretical setup issue and that virtio-mem might not be able to provide as much memory towards a VM as desired. As suggested by Alex, let's use the number of KVM memory slots to guess how many other mappings we might see over time. Acked-by: Alex Williamson Reviewed-by: Alex Williamson Acked-by: Michael S. Tsirkin Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- hw/vfio/common.c | 43 +++ 1 file changed, 43 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 79628d60ae..f8a2fe8441 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -728,6 +728,49 @@ static void vfio_register_ram_discard_listener(VFIOContainer *container, vfio_ram_discard_notify_discard, true); ram_discard_manager_register_listener(rdm, &vrdl->listener, section); QLIST_INSERT_HEAD(&container->vrdl_list, vrdl, next); + +/* + * Sanity-check if we have a theoretically problematic setup where we could + * exceed the maximum number of possible DMA mappings over time. We assume + * that each mapped section in the same address space as a RamDiscardManager + * section consumes exactly one DMA mapping, with the exception of + * RamDiscardManager sections; i.e., we don't expect to have gIOMMU sections + * in the same address space as RamDiscardManager sections. + * + * We assume that each section in the address space consumes one memslot. + * We take the number of KVM memory slots as a best guess for the maximum + * number of sections in the address space we could have over time, + * also consuming DMA mappings. + */ +if (container->dma_max_mappings) { +unsigned int vrdl_count = 0, vrdl_mappings = 0, max_memslots = 512; + +#ifdef CONFIG_KVM +if (kvm_enabled()) { +max_memslots = kvm_get_max_memslots(); +} +#endif + +QLIST_FOREACH(vrdl, &container->vrdl_list, next) { +hwaddr start, end; + +start = QEMU_ALIGN_DOWN(vrdl->offset_within_address_space, +vrdl->granularity); +end = ROUND_UP(vrdl->offset_within_address_space + vrdl->size, + vrdl->granularity); +vrdl_mappings += (end - start) / vrdl->granularity; +vrdl_count++; +} + +if (vrdl_mappings + max_memslots - vrdl_count > +container->dma_max_mappings) { +warn_report("%s: possibly running out of DMA mappings. E.g., try" +" increasing the 'block-size' of virtio-mem devies." +" Maximum possible DMA mappings: %d, Maximum possible" +" memslots: %d", __func__, container->dma_max_mappings, +max_memslots); +} +} } static void vfio_unregister_ram_discard_listener(VFIOContainer *container, -- 2.30.2
[PATCH RESEND v7 07/13] vfio: Query and store the maximum number of possible DMA mappings
Let's query the maximum number of possible DMA mappings by querying the available mappings when creating the container (before any mappings are created). We'll use this informaton soon to perform some sanity checks and warn the user. Reviewed-by: Alex Williamson Acked-by: Alex Williamson Acked-by: Michael S. Tsirkin Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- hw/vfio/common.c | 4 include/hw/vfio/vfio-common.h | 1 + 2 files changed, 5 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 5af7755227..79628d60ae 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1931,6 +1931,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, container->fd = fd; container->error = NULL; container->dirty_pages_supported = false; +container->dma_max_mappings = 0; QLIST_INIT(&container->giommu_list); QLIST_INIT(&container->hostwin_list); QLIST_INIT(&container->vrdl_list); @@ -1962,7 +1963,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes); container->pgsizes = info->iova_pgsizes; +/* The default in the kernel ("dma_entry_limit") is 65535. */ +container->dma_max_mappings = 65535; if (!ret) { +vfio_get_info_dma_avail(info, &container->dma_max_mappings); vfio_get_iommu_info_migration(container, info); } g_free(info); diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 681432213d..8af11b0a76 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -88,6 +88,7 @@ typedef struct VFIOContainer { uint64_t dirty_pgsizes; uint64_t max_dirty_bitmap_size; unsigned long pgsizes; +unsigned int dma_max_mappings; QLIST_HEAD(, VFIOGuestIOMMU) giommu_list; QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list; QLIST_HEAD(, VFIOGroup) group_list; -- 2.30.2
[PATCH RESEND v7 13/13] vfio: Disable only uncoordinated discards for VFIO_TYPE1 iommus
We support coordinated discarding of RAM using the RamDiscardManager for the VFIO_TYPE1 iommus. Let's unlock support for coordinated discards, keeping uncoordinated discards (e.g., via virtio-balloon) disabled if possible. This unlocks virtio-mem + vfio on x86-64. Note that vfio used via "nvme://" by the block layer has to be implemented/unlocked separately. For now, virtio-mem only supports x86-64; we don't restrict RamDiscardManager to x86-64, though: arm64 and s390x are supposed to work as well, and we'll test once unlocking virtio-mem support. The spapr IOMMUs will need special care, to be tackled later, e.g.., once supporting virtio-mem. Note: The block size of a virtio-mem device has to be set to sane sizes, depending on the maximum hotplug size - to not run out of vfio mappings. The default virtio-mem block size is usually in the range of a couple of MBs. The maximum number of mapping is 64k, shared with other users. Assume you want to hotplug 256GB using virtio-mem - the block size would have to be set to at least 8 MiB (resulting in 32768 separate mappings). Acked-by: Alex Williamson Reviewed-by: Alex Williamson Acked-by: Michael S. Tsirkin Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- hw/vfio/common.c | 65 +++- 1 file changed, 53 insertions(+), 12 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 8a9bbf2791..3f0d111360 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -135,6 +135,29 @@ static const char *index_to_str(VFIODevice *vbasedev, int index) } } +static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state) +{ +switch (container->iommu_type) { +case VFIO_TYPE1v2_IOMMU: +case VFIO_TYPE1_IOMMU: +/* + * We support coordinated discarding of RAM via the RamDiscardManager. + */ +return ram_block_uncoordinated_discard_disable(state); +default: +/* + * VFIO_SPAPR_TCE_IOMMU most probably works just fine with + * RamDiscardManager, however, it is completely untested. + * + * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does + * completely the opposite of managing mapping/pinning dynamically as + * required by RamDiscardManager. We would have to special-case sections + * with a RamDiscardManager. + */ +return ram_block_discard_disable(state); +} +} + int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex, int action, int fd, Error **errp) { @@ -1977,15 +2000,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, * new memory, it will not yet set ram_block_discard_set_required() and * therefore, neither stops us here or deals with the sudden memory * consumption of inflated memory. + * + * We do support discarding of memory coordinated via the RamDiscardManager + * with some IOMMU types. vfio_ram_block_discard_disable() handles the + * details once we know which type of IOMMU we are using. */ -ret = ram_block_discard_disable(true); -if (ret) { -error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken"); -return ret; -} QLIST_FOREACH(container, &space->containers, next) { if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) { +ret = vfio_ram_block_discard_disable(container, true); +if (ret) { +error_setg_errno(errp, -ret, + "Cannot set discarding of RAM broken"); +if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, + &container->fd)) { +error_report("vfio: error disconnecting group %d from" + " container", group->groupid); +} +return ret; +} group->container = container; QLIST_INSERT_HEAD(&container->group_list, group, container_next); vfio_kvm_device_add_group(group); @@ -2023,6 +2056,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, goto free_container_exit; } +ret = vfio_ram_block_discard_disable(container, true); +if (ret) { +error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken"); +goto free_container_exit; +} + switch (container->iommu_type) { case VFIO_TYPE1v2_IOMMU: case VFIO_TYPE1_IOMMU: @@ -2070,7 +2109,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, if (ret) { error_setg_errno(errp, errno, "failed to enable container"); ret = -errno; -goto free_conta
[PATCH RESEND v7 05/13] virtio-mem: Implement RamDiscardManager interface
Let's properly notify when (un)plugging blocks, after discarding memory and before allowing the guest to consume memory. Handle errors from notifiers gracefully (e.g., no remaining VFIO mappings) when plugging, rolling back the change and telling the guest that the VM is busy. One special case to take care of is replaying all notifications after restoring the vmstate. The device starts out with all memory discarded, so after loading the vmstate, we have to notify about all plugged blocks. Acked-by: Michael S. Tsirkin Reviewed-by: Michael S. Tsirkin Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- hw/virtio/virtio-mem.c | 288 - include/hw/virtio/virtio-mem.h | 3 + 2 files changed, 288 insertions(+), 3 deletions(-) diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index bbe42ad83b..e209b56057 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -172,7 +172,146 @@ static int virtio_mem_for_each_unplugged_range(const VirtIOMEM *vmem, void *arg, return ret; } -static bool virtio_mem_test_bitmap(VirtIOMEM *vmem, uint64_t start_gpa, +/* + * Adjust the memory section to cover the intersection with the given range. + * + * Returns false if the intersection is empty, otherwise returns true. + */ +static bool virito_mem_intersect_memory_section(MemoryRegionSection *s, +uint64_t offset, uint64_t size) +{ +uint64_t start = MAX(s->offset_within_region, offset); +uint64_t end = MIN(s->offset_within_region + int128_get64(s->size), + offset + size); + +if (end <= start) { +return false; +} + +s->offset_within_address_space += start - s->offset_within_region; +s->offset_within_region = start; +s->size = int128_make64(end - start); +return true; +} + +typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void *arg); + +static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem, + MemoryRegionSection *s, + void *arg, + virtio_mem_section_cb cb) +{ +unsigned long first_bit, last_bit; +uint64_t offset, size; +int ret = 0; + +first_bit = s->offset_within_region / vmem->bitmap_size; +first_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size, first_bit); +while (first_bit < vmem->bitmap_size) { +MemoryRegionSection tmp = *s; + +offset = first_bit * vmem->block_size; +last_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, + first_bit + 1) - 1; +size = (last_bit - first_bit + 1) * vmem->block_size; + +if (!virito_mem_intersect_memory_section(&tmp, offset, size)) { +break; +} +ret = cb(&tmp, arg); +if (ret) { +break; +} +first_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size, + last_bit + 2); +} +return ret; +} + +static int virtio_mem_notify_populate_cb(MemoryRegionSection *s, void *arg) +{ +RamDiscardListener *rdl = arg; + +return rdl->notify_populate(rdl, s); +} + +static int virtio_mem_notify_discard_cb(MemoryRegionSection *s, void *arg) +{ +RamDiscardListener *rdl = arg; + +rdl->notify_discard(rdl, s); +return 0; +} + +static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset, + uint64_t size) +{ +RamDiscardListener *rdl; + +QLIST_FOREACH(rdl, &vmem->rdl_list, next) { +MemoryRegionSection tmp = *rdl->section; + +if (!virito_mem_intersect_memory_section(&tmp, offset, size)) { +continue; +} +rdl->notify_discard(rdl, &tmp); +} +} + +static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset, + uint64_t size) +{ +RamDiscardListener *rdl, *rdl2; +int ret = 0; + +QLIST_FOREACH(rdl, &vmem->rdl_list, next) { +MemoryRegionSection tmp = *rdl->section; + +if (!virito_mem_intersect_memory_section(&tmp, offset, size)) { +continue; +} +ret = rdl->notify_populate(rdl, &tmp); +if (ret) { +break; +} +} + +if (ret) { +/* Notify all already-notified listeners. */ +QLIST_FOREACH(rdl2, &vmem->rdl_list, next) { +MemoryRegionSection tmp = *rdl->section; + +if (rdl2 == rdl) { +break; +} +if (!virito_mem_intersect_memory_section(&tmp, offset, size)) { +continue; +} +rdl2->notify_discard(rdl2, &tmp); +} +} +return re
[PATCH RESEND v7 11/13] softmmu/physmem: Extend ram_block_discard_(require|disable) by two discard types
We want to separate the two cases whereby we discard ram - uncoordinated: e.g., virito-balloon - coordinated: e.g., virtio-mem coordinated via the RamDiscardManager Reviewed-by: Pankaj Gupta Acked-by: Michael S. Tsirkin Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- include/exec/memory.h | 18 +-- softmmu/physmem.c | 54 ++- 2 files changed, 64 insertions(+), 8 deletions(-) diff --git a/include/exec/memory.h b/include/exec/memory.h index e806d0140e..bba7b6643e 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -2873,6 +2873,12 @@ static inline MemOp devend_memop(enum device_endian end) */ int ram_block_discard_disable(bool state); +/* + * See ram_block_discard_disable(): only disable uncoordinated discards, + * keeping coordinated discards (via the RamDiscardManager) enabled. + */ +int ram_block_uncoordinated_discard_disable(bool state); + /* * Inhibit technologies that disable discarding of pages in RAM blocks. * @@ -2882,12 +2888,20 @@ int ram_block_discard_disable(bool state); int ram_block_discard_require(bool state); /* - * Test if discarding of memory in ram blocks is disabled. + * See ram_block_discard_require(): only inhibit technologies that disable + * uncoordinated discarding of pages in RAM blocks, allowing co-existance with + * technologies that only inhibit uncoordinated discards (via the + * RamDiscardManager). + */ +int ram_block_coordinated_discard_require(bool state); + +/* + * Test if any discarding of memory in ram blocks is disabled. */ bool ram_block_discard_is_disabled(void); /* - * Test if discarding of memory in ram blocks is required to work reliably. + * Test if any discarding of memory in ram blocks is required to work reliably. */ bool ram_block_discard_is_required(void); diff --git a/softmmu/physmem.c b/softmmu/physmem.c index aaa2b2eb92..ead7b5c429 100644 --- a/softmmu/physmem.c +++ b/softmmu/physmem.c @@ -3647,8 +3647,14 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, MemoryRegion *root) } } +/* Require any discards to work. */ static unsigned int ram_block_discard_required_cnt; +/* Require only coordinated discards to work. */ +static unsigned int ram_block_coordinated_discard_required_cnt; +/* Disable any discards. */ static unsigned int ram_block_discard_disabled_cnt; +/* Disable only uncoordinated discards. */ +static unsigned int ram_block_uncoordinated_discard_disabled_cnt; static QemuMutex ram_block_discard_disable_mutex; static void ram_block_discard_disable_mutex_lock(void) @@ -3674,10 +3680,27 @@ int ram_block_discard_disable(bool state) ram_block_discard_disable_mutex_lock(); if (!state) { ram_block_discard_disabled_cnt--; -} else if (!ram_block_discard_required_cnt) { -ram_block_discard_disabled_cnt++; +} else if (ram_block_discard_required_cnt || + ram_block_coordinated_discard_required_cnt) { +ret = -EBUSY; } else { +ram_block_discard_disabled_cnt++; +} +ram_block_discard_disable_mutex_unlock(); +return ret; +} + +int ram_block_uncoordinated_discard_disable(bool state) +{ +int ret = 0; + +ram_block_discard_disable_mutex_lock(); +if (!state) { +ram_block_uncoordinated_discard_disabled_cnt--; +} else if (ram_block_discard_required_cnt) { ret = -EBUSY; +} else { +ram_block_uncoordinated_discard_disabled_cnt++; } ram_block_discard_disable_mutex_unlock(); return ret; @@ -3690,10 +3713,27 @@ int ram_block_discard_require(bool state) ram_block_discard_disable_mutex_lock(); if (!state) { ram_block_discard_required_cnt--; -} else if (!ram_block_discard_disabled_cnt) { -ram_block_discard_required_cnt++; +} else if (ram_block_discard_disabled_cnt || + ram_block_uncoordinated_discard_disabled_cnt) { +ret = -EBUSY; } else { +ram_block_discard_required_cnt++; +} +ram_block_discard_disable_mutex_unlock(); +return ret; +} + +int ram_block_coordinated_discard_require(bool state) +{ +int ret = 0; + +ram_block_discard_disable_mutex_lock(); +if (!state) { +ram_block_coordinated_discard_required_cnt--; +} else if (ram_block_discard_disabled_cnt) { ret = -EBUSY; +} else { +ram_block_coordinated_discard_required_cnt++; } ram_block_discard_disable_mutex_unlock(); return ret; @@ -3701,10 +3741,12 @@ int ram_block_discard_require(bool state) bool ram_block_discard_is_disabled(void) { -return qatomic_read(&ram_block_discard_disabled_cnt); +return qatomic_read(&ram_block_discard_disabled_cnt) || + qatomic_read(&ram_block_uncoordinated_discard_disabled_cnt); } bool ra
[PATCH RESEND v7 09/13] vfio: Support for RamDiscardManager in the vIOMMU case
vIOMMU support works already with RamDiscardManager as long as guests only map populated memory. Both, populated and discarded memory is mapped into &address_space_memory, where vfio_get_xlat_addr() will find that memory, to create the vfio mapping. Sane guests will never map discarded memory (e.g., unplugged memory blocks in virtio-mem) into an IOMMU - or keep it mapped into an IOMMU while memory is getting discarded. However, there are two cases where a malicious guests could trigger pinning of more memory than intended. One case is easy to handle: the guest trying to map discarded memory into an IOMMU. The other case is harder to handle: the guest keeping memory mapped in the IOMMU while it is getting discarded. We would have to walk over all mappings when discarding memory and identify if any mapping would be a violation. Let's keep it simple for now and print a warning, indicating that setting RLIMIT_MEMLOCK can mitigate such attacks. We have to take care of incoming migration: at the point the IOMMUs get restored and start creating mappings in vfio, RamDiscardManager implementations might not be back up and running yet: let's add runstate priorities to enforce the order when restoring. Acked-by: Alex Williamson Reviewed-by: Alex Williamson Acked-by: Michael S. Tsirkin Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- hw/vfio/common.c| 39 + hw/virtio/virtio-mem.c | 1 + include/migration/vmstate.h | 1 + 3 files changed, 41 insertions(+) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index f8a2fe8441..8a9bbf2791 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -36,6 +36,7 @@ #include "qemu/range.h" #include "sysemu/kvm.h" #include "sysemu/reset.h" +#include "sysemu/runstate.h" #include "trace.h" #include "qapi/error.h" #include "migration/migration.h" @@ -569,6 +570,44 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, error_report("iommu map to non memory area %"HWADDR_PRIx"", xlat); return false; +} else if (memory_region_has_ram_discard_manager(mr)) { +RamDiscardManager *rdm = memory_region_get_ram_discard_manager(mr); +MemoryRegionSection tmp = { +.mr = mr, +.offset_within_region = xlat, +.size = int128_make64(len), +}; + +/* + * Malicious VMs can map memory into the IOMMU, which is expected + * to remain discarded. vfio will pin all pages, populating memory. + * Disallow that. vmstate priorities make sure any RamDiscardManager + * were already restored before IOMMUs are restored. + */ +if (!ram_discard_manager_is_populated(rdm, &tmp)) { +error_report("iommu map to discarded memory (e.g., unplugged via" + " virtio-mem): %"HWADDR_PRIx"", + iotlb->translated_addr); +return false; +} + +/* + * Malicious VMs might trigger discarding of IOMMU-mapped memory. The + * pages will remain pinned inside vfio until unmapped, resulting in a + * higher memory consumption than expected. If memory would get + * populated again later, there would be an inconsistency between pages + * pinned by vfio and pages seen by QEMU. This is the case until + * unmapped from the IOMMU (e.g., during device reset). + * + * With malicious guests, we really only care about pinning more memory + * than expected. RLIMIT_MEMLOCK set for the user/process can never be + * exceeded and can be used to mitigate this problem. + */ +warn_report_once("Using vfio with vIOMMUs and coordinated discarding of" + " RAM (e.g., virtio-mem) works, however, malicious" + " guests can trigger pinning of more memory than" + " intended via an IOMMU. It's possible to mitigate " + " by setting/adjusting RLIMIT_MEMLOCK."); } /* diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index e209b56057..cbd07fc9f1 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -886,6 +886,7 @@ static const VMStateDescription vmstate_virtio_mem_device = { .name = "virtio-mem-device", .minimum_version_id = 1, .version_id = 1, +.priority = MIG_PRI_VIRTIO_MEM, .post_load = virtio_mem_post_load, .fields = (VMStateField[]) { VMSTATE_WITH_TMP(VirtIOMEM, VirtIOMEMMigSanityChecks, diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h index 075ee80096..3bf58ff043 100644 --- a/include/migration/vmstate.h +++ b/include/migration/vmstate.h @@ -153,6 +153,7 @@ typedef enum
[PATCH RESEND v7 12/13] virtio-mem: Require only coordinated discards
We implement the RamDiscardManager interface and only require coordinated discarding of RAM to work. Reviewed-by: Dr. David Alan Gilbert Reviewed-by: Pankaj Gupta Acked-by: Michael S. Tsirkin Reviewed-by: Michael S. Tsirkin Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Alex Williamson Cc: Dr. David Alan Gilbert Cc: Igor Mammedov Cc: Pankaj Gupta Cc: Peter Xu Cc: Auger Eric Cc: Wei Yang Cc: teawater Cc: Marek Kedzierski Signed-off-by: David Hildenbrand --- hw/virtio/virtio-mem.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index cbd07fc9f1..9e79b5b50c 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -719,7 +719,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp) return; } -if (ram_block_discard_require(true)) { +if (ram_block_coordinated_discard_require(true)) { error_setg(errp, "Discarding RAM is disabled"); return; } @@ -727,7 +727,7 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp) ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb)); if (ret) { error_setg_errno(errp, -ret, "Unexpected error discarding RAM"); -ram_block_discard_require(false); +ram_block_coordinated_discard_require(false); return; } @@ -771,7 +771,7 @@ static void virtio_mem_device_unrealize(DeviceState *dev) virtio_del_queue(vdev, 0); virtio_cleanup(vdev); g_free(vmem->bitmap); -ram_block_discard_require(false); +ram_block_coordinated_discard_require(false); } static int virtio_mem_discard_range_cb(const VirtIOMEM *vmem, void *arg, -- 2.30.2
[PULL 0/3] MIPS patches for 2021-04-13
The following changes since commit c1e90def01bdb8fcbdbebd9d1eaa8e4827ece620: Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20210412' into staging (2021-04-12 12:12:09 +0100) are available in the Git repository at: https://github.com/philmd/qemu.git tags/mips-20210413 for you to fetch changes up to f4349ba966abfe39f5d98694abd7c7551d5c8c02: target/mips: Fix TCG temporary leak in gen_cache_operation() (2021-04-13 12:07:00 +0200) MIPS patches queue - Fix invalid Kconfig dependency - Fix missing migrated value - Fix TCG temporary leak Philippe Mathieu-Daudé (3): hw/isa/Kconfig: Add missing dependency VIA VT82C686 -> APM hw/isa/piix4: Migrate Reset Control Register target/mips: Fix TCG temporary leak in gen_cache_operation() hw/isa/piix4.c | 15 ++- target/mips/translate.c | 2 ++ hw/isa/Kconfig | 1 + 3 files changed, 17 insertions(+), 1 deletion(-) -- 2.26.3
[PULL 1/3] hw/isa/Kconfig: Add missing dependency VIA VT82C686 -> APM
TYPE_VIA_PM calls apm_init() in via_pm_realize(), so requires APM to be selected. Reported-by: BALATON Zoltan Fixes: dd0ff8191ab ("isa: express SuperIO dependencies with Kconfig") Signed-off-by: Philippe Mathieu-Daudé Message-Id: <20210302080531.913802-1-f4...@amsat.org> --- hw/isa/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/isa/Kconfig b/hw/isa/Kconfig index 2691eae2f0c..55e0003ce40 100644 --- a/hw/isa/Kconfig +++ b/hw/isa/Kconfig @@ -48,6 +48,7 @@ config VT82C686 select SERIAL_ISA select FDC select USB_UHCI +select APM config SMC37C669 bool -- 2.26.3
[PULL 2/3] hw/isa/piix4: Migrate Reset Control Register
When adding the Reset register in commit 5790b757cfb we forgot to migrate it. While it is possible a VM using the PIIX4 is migrated just after requesting a system shutdown, it is very unlikely. However when restoring a migrated VM, we might have the RCR bit #4 set on the stack and when the VM resume it directly shutdowns. Add a post_load() migration handler and set the default RCR value to 0 for earlier versions, assuming the VM was not going to shutdown before migration. Fixes: 5790b757cfb ("piix4: Add the Reset Control Register") Signed-off-by: Philippe Mathieu-Daudé Reviewed-by: Dr. David Alan Gilbert Message-Id: <20210324200334.729899-1-f4...@amsat.org> --- hw/isa/piix4.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/hw/isa/piix4.c b/hw/isa/piix4.c index a50d97834c7..b3b6a4378a3 100644 --- a/hw/isa/piix4.c +++ b/hw/isa/piix4.c @@ -93,12 +93,25 @@ static void piix4_isa_reset(DeviceState *dev) pci_conf[0xae] = 0x00; } +static int piix4_ide_post_load(void *opaque, int version_id) +{ +PIIX4State *s = opaque; + +if (version_id == 2) { +s->rcr = 0; +} + +return 0; +} + static const VMStateDescription vmstate_piix4 = { .name = "PIIX4", -.version_id = 2, +.version_id = 3, .minimum_version_id = 2, +.post_load = piix4_ide_post_load, .fields = (VMStateField[]) { VMSTATE_PCI_DEVICE(dev, PIIX4State), +VMSTATE_UINT8_V(rcr, PIIX4State, 3), VMSTATE_END_OF_LIST() } }; -- 2.26.3
[PULL 3/3] target/mips: Fix TCG temporary leak in gen_cache_operation()
Fix a TCG temporary leak when translating CACHE opcode. Fixes: 0d74a222c27 ("make ITC Configuration Tags accessible to the CPU") Signed-off-by: Philippe Mathieu-Daudé Reviewed-by: Richard Henderson Message-Id: <20210406202857.1440744-1-f4...@amsat.org> --- target/mips/translate.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/target/mips/translate.c b/target/mips/translate.c index c518bf3963b..71fa5ec1973 100644 --- a/target/mips/translate.c +++ b/target/mips/translate.c @@ -12804,6 +12804,8 @@ static void gen_cache_operation(DisasContext *ctx, uint32_t op, int base, TCGv t1 = tcg_temp_new(); gen_base_offset_addr(ctx, t1, base, offset); gen_helper_cache(cpu_env, t1, t0); +tcg_temp_free(t1); +tcg_temp_free_i32(t0); } #if defined(TARGET_MIPS64) -- 2.26.3
Re: [PULL 0/1] NBD fix for 6.0-rc3
On Tue, 13 Apr 2021 at 07:47, Vladimir Sementsov-Ogievskiy wrote: > > 12.04.2021 18:48, Peter Maydell wrote: > > On Mon, 12 Apr 2021 at 13:19, Vladimir Sementsov-Ogievskiy > > wrote: > >> > >> The following changes since commit > >> 555249a59e9cdd6b58da103aba5cf3a2d45c899f: > >> > >>Merge remote-tracking branch > >> 'remotes/ehabkost-gl/tags/x86-next-pull-request' into staging (2021-04-10 > >> 16:58:56 +0100) > >> > >> are available in the Git repository at: > >> > >>https://src.openvz.org/scm/~vsementsov/qemu.git tags/pull-nbd-2021-04-12 > >> > >> for you to fetch changes up to d3c278b689845558cd9811940436b28ae6afc5d7: > >> > >>block/nbd: fix possible use after free of s->connect_thread (2021-04-12 > >> 11:56:03 +0300) > >> > >> > >> One fix of possible use-after-free in NBD block-driver for 6.0-rc3 > >> > >> > >> > >> Note: the tag is signed by a new key, as I've lost the old one. So, > >> now there are two keys with my name on http://keys.gnupg.net, the elder > >> is lost. I feel stupid about that :(. Anyway, both keys are not signed by > >> anybody except for myself. And this is my first pull-request to Qemu, > >> so, I think some kind of TOFU still applies. > > > > I'd really rather not deal with trying to add a new source of pull > > requests the day before rc3, please. Eric, could you do a pull > > or something? > Hmm. Ok, that's not a degradation of 6.0 and there is no existing bug > somewhere, so we can just not care for 6.0. I think fixing a use-after-free is sensible for rc3; it's only having it come to me via a new unknown submitter of pull requests that I'm not happy about. thanks -- PMM
Re: [PATCH] docs: Add a QEMU Code of Conduct and Conflict Resolution Policy document
Hi Paolo, On 13.04.21 09:42, Paolo Bonzini wrote: > On 07/04/21 17:42, Kevin Wolf wrote: >>> +* Publishing other's private information, such as physical or >>> electronic >>> +addresses, without explicit permission >> >> Yes, it's pretty clear that I'm not publishing new information about >> people when I'm keeping them in Cc: when replying to a thread, or even >> when they posted in another thread on the list recently. It becomes much >> less clear for adding people who aren't usually part of the QEMU >> community. > > If you took the email from, say, the Libvirt or kernel mailing lists, > that would not be considered private. If somebody has two email > addresses and you deliberately Cc him on an address that he's only using > for communications within his family, that would be a problem. I have to admit I had originally stumbled over this bullet point myself, reading it as private=personal. So maybe it might help avoid ambiguities for non-native readers to formulate it as "non-public" instead? Like, if someone posts to a public mailing list with their private as opposed to business address in the footer. Then I would consider it public. I did intentionally use a private email for topics such as PReP. Or consider the case you get a bug report not copied to the public mailing lists from someone you don't know. Then I would still expect to be allowed to attribute a commit via Reported-by/CC to that person, as it seems in his/her interest to get the bug fixed and be notified, unless explicitly requested otherwise. Mistakes can always happen, but I feel it needs to be the responsibility of the sender, not of the receiver, to ensure that only data is shared that the project members may use for valid development purposes. Not sure how to extend that bullet point to make its purpose/scope clearer. Cheers, Andreas -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Felix Imendörffer HRB 36809 (AG Nürnberg)
Re: [PATCH] docs: Add a QEMU Code of Conduct and Conflict Resolution Policy document
On Tue, 13 Apr 2021 at 11:23, Andreas Färber wrote: > Or consider the case you get a bug report not copied to the public > mailing lists from someone you don't know. Then I would still expect to > be allowed to attribute a commit via Reported-by/CC to that person, as > it seems in his/her interest to get the bug fixed and be notified, > unless explicitly requested otherwise. FWIW, in this kind of situation, I generally try to explicitly ask the submitter if they're OK with my adding a reported-by tag, just as a matter of politeness. Not everybody is OK with having their email address publicly recorded on mailing list archives and in git history forever. thanks -- PMM
Re: Commit "x86/kvm: Move context tracking where it belongs" broke guest time accounting
On Wed, 7 Apr 2021 at 18:55, Michael Tokarev wrote: > > Hi! > > It looks like this commit: > > commit 87fa7f3e98a1310ef1ac1900e7ee7f9610a038bc > Author: Thomas Gleixner > Date: Wed Jul 8 21:51:54 2020 +0200 > > x86/kvm: Move context tracking where it belongs > > Context tracking for KVM happens way too early in the vcpu_run() > code. Anything after guest_enter_irqoff() and before guest_exit_irqoff() > cannot use RCU and should also be not instrumented. > > The current way of doing this covers way too much code. Move it closer to > the actual vmenter/exit code. > > broke kvm guest cpu time accounting - after this commit, when running > qemu-system-x86_64 -enable-kvm, the guest time (in /proc/stat and > elsewhere) is always 0. > > I dunno why it happened, but it happened, and all kernels after 5.9 > are affected by this. > > This commit is found in a (painful) git bisect between kernel 5.8 and 5.10. Hi Michael, Please have a try. https://lore.kernel.org/kvm/1618298169-3831-1-git-send-email-wanpen...@tencent.com/ Wanpeng
Re: [PATCH] cutils: fix memory leak in get_relocated_path()
Is this fix aiming at 6.0 release? On 4/12/21 7:02 PM, Stefano Garzarella wrote: > get_relocated_path() allocates a GString object and returns the > character data (C string) to the caller without freeing the memory > allocated for that object as reported by valgrind: > > 24 bytes in 1 blocks are definitely lost in loss record 2,805 of 6,532 > at 0x4839809: malloc (vg_replace_malloc.c:307) > by 0x55AABB8: g_malloc (in /usr/lib64/libglib-2.0.so.0.6600.8) > by 0x55C2481: g_slice_alloc (in /usr/lib64/libglib-2.0.so.0.6600.8) > by 0x55C4827: g_string_sized_new (in /usr/lib64/libglib-2.0.so.0.6600.8) > by 0x55C4CEA: g_string_new (in /usr/lib64/libglib-2.0.so.0.6600.8) > by 0x906314: get_relocated_path (cutils.c:1036) > by 0x6E1F77: qemu_read_default_config_file (vl.c:2122) > by 0x6E1F77: qemu_init (vl.c:2687) > by 0x3E3AF8: main (main.c:49) > > Let's use g_string_free(gstring, false) to free only the GString object > and transfer the ownership of the character data to the caller. > > Fixes: f4f5ed2cbd ("cutils: introduce get_relocated_path") > Signed-off-by: Stefano Garzarella > --- > util/cutils.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/util/cutils.c b/util/cutils.c > index ee908486da..c9b91e7535 100644 > --- a/util/cutils.c > +++ b/util/cutils.c > @@ -1055,5 +1055,5 @@ char *get_relocated_path(const char *dir) > assert(G_IS_DIR_SEPARATOR(dir[-1])); > g_string_append(result, dir - 1); > } > -return result->str; > +return g_string_free(result, false); > } >
[PATCH for-6.0] x86: acpi: use offset instead of pointer when using build_header()
Do the same as in commit (4d027afeb3a97 Virt: ACPI: fix qemu assert due to re-assigned table data address) for remaining tables that happen to use saved at the beginning pointer to build header to avoid assert when table_data is relocated due to implicit re-size. Reported-in: https://bugs.launchpad.net/bugs/1923497 Signed-off-by: Igor Mammedov --- PS: I have build_header() refactoring patch that requires offset instead of pointer, to make it harder to misuse but it's a bit intrusive for last minute fixes. So here goes simplified variant, and I'll post refactoring patch for 6.1. later. --- hw/acpi/aml-build.c | 15 +-- hw/i386/acpi-build.c | 8 ++-- 2 files changed, 15 insertions(+), 8 deletions(-) diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c index d33ce8954a..f0035d2b4a 100644 --- a/hw/acpi/aml-build.c +++ b/hw/acpi/aml-build.c @@ -1830,6 +1830,7 @@ build_rsdt(GArray *table_data, BIOSLinker *linker, GArray *table_offsets, int i; unsigned rsdt_entries_offset; AcpiRsdtDescriptorRev1 *rsdt; +int rsdt_start = table_data->len; const unsigned table_data_len = (sizeof(uint32_t) * table_offsets->len); const unsigned rsdt_entry_size = sizeof(rsdt->table_offset_entry[0]); const size_t rsdt_len = sizeof(*rsdt) + table_data_len; @@ -1846,7 +1847,8 @@ build_rsdt(GArray *table_data, BIOSLinker *linker, GArray *table_offsets, ACPI_BUILD_TABLE_FILE, ref_tbl_offset); } build_header(linker, table_data, - (void *)rsdt, "RSDT", rsdt_len, 1, oem_id, oem_table_id); + (void *)(table_data->data + rsdt_start), + "RSDT", rsdt_len, 1, oem_id, oem_table_id); } /* Build xsdt table */ @@ -1857,6 +1859,7 @@ build_xsdt(GArray *table_data, BIOSLinker *linker, GArray *table_offsets, int i; unsigned xsdt_entries_offset; AcpiXsdtDescriptorRev2 *xsdt; +int xsdt_start = table_data->len; const unsigned table_data_len = (sizeof(uint64_t) * table_offsets->len); const unsigned xsdt_entry_size = sizeof(xsdt->table_offset_entry[0]); const size_t xsdt_len = sizeof(*xsdt) + table_data_len; @@ -1873,7 +1876,8 @@ build_xsdt(GArray *table_data, BIOSLinker *linker, GArray *table_offsets, ACPI_BUILD_TABLE_FILE, ref_tbl_offset); } build_header(linker, table_data, - (void *)xsdt, "XSDT", xsdt_len, 1, oem_id, oem_table_id); + (void *)(table_data->data + xsdt_start), + "XSDT", xsdt_len, 1, oem_id, oem_table_id); } void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base, @@ -2053,10 +2057,9 @@ void build_tpm2(GArray *table_data, BIOSLinker *linker, GArray *tcpalog, uint64_t control_area_start_address; TPMIf *tpmif = tpm_find(); uint32_t start_method; -void *tpm2_ptr; tpm2_start = table_data->len; -tpm2_ptr = acpi_data_push(table_data, sizeof(AcpiTableHeader)); +acpi_data_push(table_data, sizeof(AcpiTableHeader)); /* Platform Class */ build_append_int_noprefix(table_data, TPM2_ACPI_CLASS_CLIENT, 2); @@ -2095,8 +2098,8 @@ void build_tpm2(GArray *table_data, BIOSLinker *linker, GArray *tcpalog, log_addr_offset, 8, ACPI_BUILD_TPMLOG_FILE, 0); build_header(linker, table_data, - tpm2_ptr, "TPM2", table_data->len - tpm2_start, 4, oem_id, - oem_table_id); + (void *)(table_data->data + tpm2_start), + "TPM2", table_data->len - tpm2_start, 4, oem_id, oem_table_id); } Aml *build_crs(PCIHostState *host, CrsRangeSet *range_set, uint32_t io_offset, diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c index de98750aef..daaf8f473e 100644 --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -1816,6 +1816,7 @@ build_hpet(GArray *table_data, BIOSLinker *linker, const char *oem_id, const char *oem_table_id) { Acpi20Hpet *hpet; +int hpet_start = table_data->len; hpet = acpi_data_push(table_data, sizeof(*hpet)); /* Note timer_block_id value must be kept in sync with value advertised by @@ -1824,13 +1825,15 @@ build_hpet(GArray *table_data, BIOSLinker *linker, const char *oem_id, hpet->timer_block_id = cpu_to_le32(0x8086a201); hpet->addr.address = cpu_to_le64(HPET_BASE); build_header(linker, table_data, - (void *)hpet, "HPET", sizeof(*hpet), 1, oem_id, oem_table_id); + (void *)(table_data->data + hpet_start), + "HPET", sizeof(*hpet), 1, oem_id, oem_table_id); } static void build_tpm_tcpa(GArray *table_data, BIOSLinker *linker, GArray *tcpalog, const char *oem_id, const char *oem_table_id) { +int tcpa_start = table_data->len; Acpi20Tcpa *tcpa = acpi_data_push(table_data, sizeof *tcpa); unsigned log_addr_size = sizeof(tcpa->log_area_start_address); unsigned log_addr_offset = @@ -184
Re: [PATCH RFC RESEND v2 3/6] hw/pci: Add pci_root_bus_max_bus
Hi Eric, On 2021/4/13 16:05, Auger Eric wrote: Hi Xingang, On 3/25/21 8:22 AM, Wang Xingang wrote: From: Xingang Wang This helps to find max bus number of a root bus. s/max bus number of a root bus/highest bus number of a bridge hierarchy? Thanks, I will change the description. Signed-off-by: Xingang Wang Signed-off-by: Jiahui Cen --- hw/pci/pci.c | 34 ++ include/hw/pci/pci.h | 1 + 2 files changed, 35 insertions(+) diff --git a/hw/pci/pci.c b/hw/pci/pci.c index e17aa9075f..c7957cbf7c 100644 --- a/hw/pci/pci.c +++ b/hw/pci/pci.c @@ -538,6 +538,40 @@ int pci_bus_num(PCIBus *s) return PCI_BUS_GET_CLASS(s)->bus_num(s); } +int pci_root_bus_max_bus(PCIBus *bus) +{ +PCIHostState *host; +PCIDevice *dev; +int max_bus = 0; +int type, devfn; +uint8_t subordinate; + +if (!pci_bus_is_root(bus)) { +return 0; +} + +host = PCI_HOST_BRIDGE(BUS(bus)->parent); +max_bus = pci_bus_num(host->bus); + +for (devfn = 0; devfn < ARRAY_SIZE(host->bus->devices); devfn++) { +dev = host->bus->devices[devfn]; + +if (!dev) { +continue; +} + +type = dev->config[PCI_HEADER_TYPE] & ~PCI_HEADER_TYPE_MULTI_FUNCTION; Seems there is PCI_DEVICE_GET_CLASS(dev)->is_bridge (see pci_root_bus_in_range). Can't that be used instead? Thanks, I will simplify this. +if (type == PCI_HEADER_TYPE_BRIDGE) { +subordinate = dev->config[PCI_SUBORDINATE_BUS]; +if (subordinate > max_bus) { +max_bus = subordinate; what about the secondary bus number, it is always less than the others? Thanks, the secondary bus number should be taken into account. Maybe a pci_root_bus_range to get [min_bus, max_bus] would be better. Thanks Eric +} +} +} + +return max_bus; +} + int pci_bus_numa_node(PCIBus *bus) { return PCI_BUS_GET_CLASS(bus)->numa_node(bus); diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h index 718b5a454a..e0c69534f4 100644 --- a/include/hw/pci/pci.h +++ b/include/hw/pci/pci.h @@ -450,6 +450,7 @@ static inline PCIBus *pci_get_bus(const PCIDevice *dev) return PCI_BUS(qdev_get_parent_bus(DEVICE(dev))); } int pci_bus_num(PCIBus *s); +int pci_root_bus_max_bus(PCIBus *bus); static inline int pci_dev_bus_num(const PCIDevice *dev) { return pci_bus_num(pci_get_bus(dev)); . Xingang .
Re: [PATCH] vhost-user-fs: fix features handling
* Stefan Hajnoczi (stefa...@redhat.com) wrote: > On Thu, Apr 08, 2021 at 10:55:34PM +0300, Anton Kuchin wrote: > > Make virtio-fs take into account server capabilities. > > > > Just returning requested features assumes they all of then are implemented > > by server and results in setting unsupported configuration if some of them > > are absent. > > > > Signed-off-by: Anton Kuchin > > --- > > hw/virtio/vhost-user-fs.c | 17 + > > 1 file changed, 13 insertions(+), 4 deletions(-) > > > > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c > > index ac4fc34b36..6cf983ba0e 100644 > > --- a/hw/virtio/vhost-user-fs.c > > +++ b/hw/virtio/vhost-user-fs.c > > @@ -24,6 +24,14 @@ > > #include "monitor/monitor.h" > > #include "sysemu/sysemu.h" > > > > +static const int user_feature_bits[] = { > > +VIRTIO_F_VERSION_1, > > +VIRTIO_RING_F_INDIRECT_DESC, > > +VIRTIO_RING_F_EVENT_IDX, > > +VIRTIO_F_NOTIFY_ON_EMPTY, > > +VHOST_INVALID_FEATURE_BIT > > +}; > > Please add: > > VIRTIO_F_RING_PACKED > VIRTIO_F_IOMMU_PLATFORM > > QEMU's virtiofsd does not enable either of these for now, but it's worth > allowing the vhost-user device backend to participate in negotiation so > that this can change in the future (or alternative virtiofsd > implementations can support these features). OK, so: Reviewed-by: Dr. David Alan Gilbert and queued, I'll add those extra 2 lines. We seem pretty inconsistent about all the different vhost-user devices. Dave -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
[PATCH 0/2] osdep: allow including qemu/osdep.h outside extern "C"
qemu/osdep.h is quite special in that, despite being part of QEMU sources, it is included by C++ source files as well. disas/nanomips.cpp is doing so within an 'extern "C"' block, which breaks with latest glib due to the inclusion of templates in glib.h. These patches implement Daniel Berrangé's idea of pushing the 'extern "C"' block within glib.h and including system headers (including glib.h, and in fact QEMU's own glib-compat.h too) *outside* the block. (CI has not finished running yet, but it seems encouraging). Paolo Paolo Bonzini (2): osdep: include glib-compat.h before other QEMU headers osdep: protect qemu/osdep.h with extern "C" disas/nanomips.cpp | 2 +- include/qemu/compiler.h | 6 ++ include/qemu/osdep.h| 13 +++-- 3 files changed, 18 insertions(+), 3 deletions(-) -- 2.30.1
[PATCH 2/2] osdep: protect qemu/osdep.h with extern "C"
System headers may include templates if compiled with a C++ compiler, which cause the compiler to complain if qemu/osdep.h is included within a C++ source file's 'extern "C"' block. Add an 'extern "C"' block directly to qemu/osdep.h, so that system headers can be kept out of it. There is a stray declaration early in qemu/osdep.h, which needs to be special cased. Add a definition in qemu/compiler.h to make it look nice. config-host.h, CONFIG_TARGET, exec/poison.h and qemu/compiler.h are included outside the 'extern "C"' block; that is not an issue because they consist entirely of preprocessor directives. Signed-off-by: Paolo Bonzini --- disas/nanomips.cpp | 2 +- include/qemu/compiler.h | 6 ++ include/qemu/osdep.h| 10 +- 3 files changed, 16 insertions(+), 2 deletions(-) diff --git a/disas/nanomips.cpp b/disas/nanomips.cpp index 2b09655271..8ddef897f0 100644 --- a/disas/nanomips.cpp +++ b/disas/nanomips.cpp @@ -27,8 +27,8 @@ * Reference Manual", Revision 01.01, April 27, 2018 */ -extern "C" { #include "qemu/osdep.h" +extern "C" { #include "disas/dis-asm.h" } diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h index cf28bb2bcd..091c45248b 100644 --- a/include/qemu/compiler.h +++ b/include/qemu/compiler.h @@ -11,6 +11,12 @@ #define QEMU_STATIC_ANALYSIS 1 #endif +#ifdef __cplusplus +#define QEMU_EXTERN_C extern "C" +#else +#define QEMU_EXTERN_C extern +#endif + #define QEMU_NORETURN __attribute__ ((__noreturn__)) #define QEMU_WARN_UNUSED_RESULT __attribute__((warn_unused_result)) diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h index b67b0a1e8c..3f8785a471 100644 --- a/include/qemu/osdep.h +++ b/include/qemu/osdep.h @@ -57,7 +57,7 @@ #define daemon qemu_fake_daemon_function #include #undef daemon -extern int daemon(int, int); +QEMU_EXTERN_C int daemon(int, int); #endif #ifdef _WIN32 @@ -113,6 +113,10 @@ extern int daemon(int, int); #include "glib-compat.h" +#ifdef __cplusplus +extern "C" { +#endif + #ifdef _WIN32 #include "sysemu/os-win32.h" #endif @@ -723,4 +727,8 @@ static inline int platform_does_not_support_system(const char *command) } #endif /* !HAVE_SYSTEM_FUNCTION */ +#ifdef __cplusplus +} +#endif + #endif -- 2.30.1
[PATCH 1/2] osdep: include glib-compat.h before other QEMU headers
glib-compat.h is sort of like a system header, and it needs to include system headers (glib.h) that may dislike being included under 'extern "C"'. Move it right after all system headers and before all other QEMU headers. Signed-off-by: Paolo Bonzini --- include/qemu/osdep.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h index ba15be9c56..b67b0a1e8c 100644 --- a/include/qemu/osdep.h +++ b/include/qemu/osdep.h @@ -111,6 +111,8 @@ extern int daemon(int, int); #define WEXITSTATUS(x) (x) #endif +#include "glib-compat.h" + #ifdef _WIN32 #include "sysemu/os-win32.h" #endif @@ -123,7 +125,6 @@ extern int daemon(int, int); #include #endif -#include "glib-compat.h" #include "qemu/typedefs.h" /* -- 2.30.1
Re: [PATCH] docs: Add a QEMU Code of Conduct and Conflict Resolution Policy document
Peter Maydell writes: > On Tue, 13 Apr 2021 at 11:23, Andreas Färber wrote: >> Or consider the case you get a bug report not copied to the public >> mailing lists from someone you don't know. Then I would still expect to >> be allowed to attribute a commit via Reported-by/CC to that person, as >> it seems in his/her interest to get the bug fixed and be notified, >> unless explicitly requested otherwise. > > FWIW, in this kind of situation, I generally try to explicitly > ask the submitter if they're OK with my adding a reported-by > tag, just as a matter of politeness. Not everybody is OK with > having their email address publicly recorded on mailing list > archives and in git history forever. That's what I'd do, too. Still, neglecting to ask for permission to publicly credit a bug report is not anywhere near doxing. If the public credit turns out to be unwanted, a sincere apology is obviously called for. People may exist who need to be slapped over the head with a code of conduct to figure that out. I hope we'll never need to do that. Anyway. What I see at work here is one of the unintended consequences of formal codes of conduct: they read like law, so people read them lawyerly. Our CoC attempts to avoid this by explicitly stating its *purpose*: "a guide to make it easier to be excellent to each other." This applies to the QEMU leadership committee in spades. Treating negligent publication of some technical e-mail's sender address as malicious doxing wouldn't be excellent to anyone, it would be the leadership committee shooting themselves into the foot with a machine gun". Let's not worry about that, okay?
Re: [PATCH] cutils: fix memory leak in get_relocated_path()
On Tue, Apr 13, 2021 at 12:59:36PM +0200, Philippe Mathieu-Daudé wrote: Is this fix aiming at 6.0 release? The leak is minimal, but the fix is very simple. So, I think it can go if someone has a pull request to send with other patches, but I'm not sure with which tree. Thanks, Stefano On 4/12/21 7:02 PM, Stefano Garzarella wrote: get_relocated_path() allocates a GString object and returns the character data (C string) to the caller without freeing the memory allocated for that object as reported by valgrind: 24 bytes in 1 blocks are definitely lost in loss record 2,805 of 6,532 at 0x4839809: malloc (vg_replace_malloc.c:307) by 0x55AABB8: g_malloc (in /usr/lib64/libglib-2.0.so.0.6600.8) by 0x55C2481: g_slice_alloc (in /usr/lib64/libglib-2.0.so.0.6600.8) by 0x55C4827: g_string_sized_new (in /usr/lib64/libglib-2.0.so.0.6600.8) by 0x55C4CEA: g_string_new (in /usr/lib64/libglib-2.0.so.0.6600.8) by 0x906314: get_relocated_path (cutils.c:1036) by 0x6E1F77: qemu_read_default_config_file (vl.c:2122) by 0x6E1F77: qemu_init (vl.c:2687) by 0x3E3AF8: main (main.c:49) Let's use g_string_free(gstring, false) to free only the GString object and transfer the ownership of the character data to the caller. Fixes: f4f5ed2cbd ("cutils: introduce get_relocated_path") Signed-off-by: Stefano Garzarella --- util/cutils.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/util/cutils.c b/util/cutils.c index ee908486da..c9b91e7535 100644 --- a/util/cutils.c +++ b/util/cutils.c @@ -1055,5 +1055,5 @@ char *get_relocated_path(const char *dir) assert(G_IS_DIR_SEPARATOR(dir[-1])); g_string_append(result, dir - 1); } -return result->str; +return g_string_free(result, false); }
Re: [PATCH 0/2] osdep: allow including qemu/osdep.h outside extern "C"
Patchew URL: https://patchew.org/QEMU/20210413113741.214867-1-pbonz...@redhat.com/ Hi, This series seems to have some coding style problems. See output below for more information: Type: series Message-id: 20210413113741.214867-1-pbonz...@redhat.com Subject: [PATCH 0/2] osdep: allow including qemu/osdep.h outside extern "C" === TEST SCRIPT BEGIN === #!/bin/bash git rev-parse base > /dev/null || exit 0 git config --local diff.renamelimit 0 git config --local diff.renames True git config --local diff.algorithm histogram ./scripts/checkpatch.pl --mailback base.. === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 From https://github.com/patchew-project/qemu - [tag update] patchew/20210408195534.647895-1-antonkuc...@yandex-team.ru -> patchew/20210408195534.647895-1-antonkuc...@yandex-team.ru * [new tag] patchew/20210413113741.214867-1-pbonz...@redhat.com -> patchew/20210413113741.214867-1-pbonz...@redhat.com Switched to a new branch 'test' 992fa52 osdep: protect qemu/osdep.h with extern "C" bc2310f osdep: include glib-compat.h before other QEMU headers === OUTPUT BEGIN === 1/2 Checking commit bc2310f731a5 (osdep: include glib-compat.h before other QEMU headers) 2/2 Checking commit 992fa52da413 (osdep: protect qemu/osdep.h with extern "C") WARNING: architecture specific defines should be avoided #51: FILE: include/qemu/compiler.h:14: +#ifdef __cplusplus ERROR: storage class should be at the beginning of the declaration #52: FILE: include/qemu/compiler.h:15: +#define QEMU_EXTERN_C extern "C" ERROR: storage class should be at the beginning of the declaration #54: FILE: include/qemu/compiler.h:17: +#define QEMU_EXTERN_C extern WARNING: architecture specific defines should be avoided #77: FILE: include/qemu/osdep.h:116: +#ifdef __cplusplus WARNING: architecture specific defines should be avoided #88: FILE: include/qemu/osdep.h:730: +#ifdef __cplusplus total: 2 errors, 3 warnings, 47 lines checked Patch 2/2 has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. === OUTPUT END === Test command exited with code: 1 The full log is available at http://patchew.org/logs/20210413113741.214867-1-pbonz...@redhat.com/testing.checkpatch/?type=message. --- Email generated automatically by Patchew [https://patchew.org/]. Please send your feedback to patchew-de...@redhat.com
Re: [PATCH for-6.0] block/nbd: fix possible use after free of s->connect_thread
On 06.04.21 17:51, Vladimir Sementsov-Ogievskiy wrote: If on nbd_close() we detach the thread (in nbd_co_establish_connection_cancel() thr->state becomes CONNECT_THREAD_RUNNING_DETACHED), after that point we should not use s->connect_thread (which is set to NULL), as running thread may free it at any time. Still nbd_co_establish_connection() does exactly this: it saves s->connect_thread to local variable (just for better code style) and use it even after yield point, when thread may be already detached. Fix that. Also check thr to be non-NULL on nbd_co_establish_connection() start for safety. After this patch "case CONNECT_THREAD_RUNNING_DETACHED" becomes impossible in the second switch in nbd_co_establish_connection(). Still, don't add extra abort() just before the release. If it somehow possible to reach this "case:" it won't hurt. Anyway, good refactoring of all this reconnect mess will come soon. Signed-off-by: Vladimir Sementsov-Ogievskiy --- Hi all! I faced a crash, just running 277 iotest in a loop. I can't reproduce it on master, it reproduces only on my branch with nbd reconnect refactorings. Still, it seems very possible that it may crash under some conditions. So I propose this patch for 6.0. It's written so that it's obvious that it will not hurt: pre-patch, on first hunk we'll just crash if thr is NULL, on second hunk it's safe to return -1, and using thr when s->connect_thread is already zeroed is obviously wrong. block/nbd.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/block/nbd.c b/block/nbd.c index c26dc5a54f..1d4668d42d 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -443,6 +443,11 @@ nbd_co_establish_connection(BlockDriverState *bs, Error **errp) BDRVNBDState *s = bs->opaque; NBDConnectThread *thr = s->connect_thread; +if (!thr) { +/* detached */ +return -1; +} + qemu_mutex_lock(&thr->mutex); switch (thr->state) { First, it is a bit strange not to set *errp in these cases. I don’t think it’d make a difference anyway, but now that I look into it... The error would be propagated to s->connect_err, but that isn’t used anywhere, so... What’s the point? Anyway. What I really wanted to ask is: nbd_co_establish_connection() may create a thread, which accesses the thread. Is that a problem? Like, can this happen: - nbd_co_establish_connection(): thr->state := THREAD_RUNNING - nbd_co_establish_connection(): thread is created - nbd_co_establish_connection(): thr->mutex unlocked - connect_thread_func(): thr->mutex locked - connect_thread_func(): thr->state := something else - connect_thread_func(): thr->mutex unlocked - nbd_co_establish_connection(): yields - (As I understood it, this yield leads to nbd_co_establish_connection_cancel() continuing) - nbd_co_EC_cancel(): thr->mutex locked - nbd_co_EC_cancel(): do_free true - nbd_co_EC_cancel(): nbd_free_connect_thread() - connect_thread_func(): nbd_free_connect_thread() ? Max
Re: [PATCH 2/2] osdep: protect qemu/osdep.h with extern "C"
On Tue, 13 Apr 2021 at 12:37, Paolo Bonzini wrote: > > System headers may include templates if compiled with a C++ compiler, > which cause the compiler to complain if qemu/osdep.h is included > within a C++ source file's 'extern "C"' block. Add > an 'extern "C"' block directly to qemu/osdep.h, so that > system headers can be kept out of it. > > There is a stray declaration early in qemu/osdep.h, which needs > to be special cased. Add a definition in qemu/compiler.h to > make it look nice. > > config-host.h, CONFIG_TARGET, exec/poison.h and qemu/compiler.h > are included outside the 'extern "C"' block; that is not > an issue because they consist entirely of preprocessor directives. > > Signed-off-by: Paolo Bonzini > --- > disas/nanomips.cpp | 2 +- > include/qemu/compiler.h | 6 ++ > include/qemu/osdep.h| 10 +- > 3 files changed, 16 insertions(+), 2 deletions(-) > > diff --git a/disas/nanomips.cpp b/disas/nanomips.cpp > index 2b09655271..8ddef897f0 100644 > --- a/disas/nanomips.cpp > +++ b/disas/nanomips.cpp > @@ -27,8 +27,8 @@ > * Reference Manual", Revision 01.01, April 27, 2018 > */ > > -extern "C" { > #include "qemu/osdep.h" > +extern "C" { > #include "disas/dis-asm.h" > } > > diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h > index cf28bb2bcd..091c45248b 100644 > --- a/include/qemu/compiler.h > +++ b/include/qemu/compiler.h > @@ -11,6 +11,12 @@ > #define QEMU_STATIC_ANALYSIS 1 > #endif > > +#ifdef __cplusplus > +#define QEMU_EXTERN_C extern "C" > +#else > +#define QEMU_EXTERN_C extern > +#endif > + > #define QEMU_NORETURN __attribute__ ((__noreturn__)) > > #define QEMU_WARN_UNUSED_RESULT __attribute__((warn_unused_result)) > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h > index b67b0a1e8c..3f8785a471 100644 > --- a/include/qemu/osdep.h > +++ b/include/qemu/osdep.h > @@ -57,7 +57,7 @@ > #define daemon qemu_fake_daemon_function > #include > #undef daemon > -extern int daemon(int, int); > +QEMU_EXTERN_C int daemon(int, int); > #endif > > #ifdef _WIN32 > @@ -113,6 +113,10 @@ extern int daemon(int, int); > > #include "glib-compat.h" > > +#ifdef __cplusplus > +extern "C" { > +#endif > + > #ifdef _WIN32 > #include "sysemu/os-win32.h" > #endif There are some system header includes in osdep.h below this point (sys/shm.h and sys/uio.h) -- don't they need to be moved up to go with the other system includes first ? thanks -- PMM
Re: [PATCH RFC RESEND v2 4/6] hw/arm/virt-acpi-build: Add explicit idmap info in IORT table
Hi Eric, On 2021/4/13 17:03, Auger Eric wrote: Hi Xingang, On 3/25/21 8:22 AM, Wang Xingang wrote: From: Xingang Wang The idmap of smmuv3 and root complex covers the whole RID space for now, this patch add explicit idmap info according to root bus number range. This add smmuv3 idmap for certain bus which has enabled the iommu property. Signed-off-by: Xingang Wang Signed-off-by: Jiahui Cen --- hw/arm/virt-acpi-build.c | 103 ++- 1 file changed, 81 insertions(+), 22 deletions(-) diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index f5a2b2d4cb..5491036c86 100644 --- a/hw/arm/virt-acpi-build.c +++ b/hw/arm/virt-acpi-build.c @@ -44,6 +44,7 @@ #include "hw/acpi/tpm.h" #include "hw/pci/pcie_host.h" #include "hw/pci/pci.h" +#include "hw/pci/pci_bus.h" #include "hw/pci-host/gpex.h" #include "hw/arm/virt.h" #include "hw/mem/nvdimm.h" @@ -237,6 +238,41 @@ static void acpi_dsdt_add_tpm(Aml *scope, VirtMachineState *vms) aml_append(scope, dev); } +typedef +struct AcpiIortMapping { +AcpiIortIdMapping idmap; +bool iommu; +} AcpiIortMapping; + +/* For all PCI host bridges, walk and insert DMAR scope */ this comment should rather be in the caller also DMAR is not the ARM vocable. I would add the comment for this function: build the ID mapping for aa given PCI host bridge Thanks, I will fix the comment. +static int +iort_host_bridges(Object *obj, void *opaque) +{ +GArray *map_blob = opaque; +AcpiIortMapping map; +AcpiIortIdMapping *idmap = &map.idmap; +int bus_num, max_bus; + +if (object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) { +PCIBus *bus = PCI_HOST_BRIDGE(obj)->bus; + +if (bus) { +bus_num = pci_bus_num(bus); +max_bus = pci_root_bus_max_bus(bus); + +idmap->input_base = cpu_to_le32(bus_num << 8); +idmap->id_count = cpu_to_le32((max_bus - bus_num + 1) << 8); +idmap->output_base = cpu_to_le32(bus_num << 8); +idmap->flags = cpu_to_le32(0); + +map.iommu = pci_root_bus_has_iommu(bus); if iommu is not set, we don't need to populate the idmap and we may even directly continue, ie. not add the element the map_bap_blob, no? Thanks, if we leave the whole range when iommu is not set, this indeed should be dropped in map_blob. +g_array_append_val(map_blob, map); +} +} + +return 0; +} + static void build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) { @@ -247,6 +283,21 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) AcpiIortSmmu3 *smmu; size_t node_size, iort_node_offset, iort_length, smmu_offset = 0; AcpiIortRC *rc; +int smmu_mapping_count; +GArray *map_blob = g_array_new(false, true, sizeof(AcpiIortMapping)); +AcpiIortMapping *map; + +/* pci_for_each_bus(vms->bus, insert_map, map_blob); */ comment to be removed Done. +object_child_foreach_recursive(object_get_root(), + iort_host_bridges, map_blob); + +smmu_mapping_count = 0; +for (int i = 0; i < map_blob->len; i++) { +map = &g_array_index(map_blob, AcpiIortMapping, i); +if (map->iommu) { +smmu_mapping_count++; +} +} iort = acpi_data_push(table_data, sizeof(*iort)); @@ -280,13 +331,13 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) /* SMMUv3 node */ smmu_offset = iort_node_offset + node_size; -node_size = sizeof(*smmu) + sizeof(*idmap); +node_size = sizeof(*smmu) + sizeof(*idmap) * smmu_mapping_count; iort_length += node_size; smmu = acpi_data_push(table_data, node_size); smmu->type = ACPI_IORT_NODE_SMMU_V3; smmu->length = cpu_to_le16(node_size); -smmu->mapping_count = cpu_to_le32(1); +smmu->mapping_count = cpu_to_le32(smmu_mapping_count); smmu->mapping_offset = cpu_to_le32(sizeof(*smmu)); smmu->base_address = cpu_to_le64(vms->memmap[VIRT_SMMU].base); smmu->flags = cpu_to_le32(ACPI_IORT_SMMU_V3_COHACC_OVERRIDE); @@ -295,23 +346,28 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) smmu->gerr_gsiv = cpu_to_le32(irq + 2); smmu->sync_gsiv = cpu_to_le32(irq + 3); -/* Identity RID mapping covering the whole input RID range */ -idmap = &smmu->id_mapping_array[0]; -idmap->input_base = 0; -idmap->id_count = cpu_to_le32(0x); -idmap->output_base = 0; -/* output IORT node is the ITS group node (the first node) */ -idmap->output_reference = cpu_to_le32(iort_node_offset); +for (int i = 0, j = 0; i < map_blob->len; i++) { +map = &g_array_index(map_blob, AcpiIortMapping, i); + +if (!map->iommu) { +continue; +} + +idmap =
Re: testing/next - hexagon toolchain update
Brian Cain writes: > Alex, > > You are the one maintaining the testing/next tree at > https://gitlab.com/stsquad/qemu correct? The current patch series for > hexagon under review requires toolchain updates. These changes to > llvm/clang landed in the last week or two. > > Can you apply this patch? I've applied to my tree. I didn't get it in this cycle but I'll post a new testing/next for review in the next week or so ready for master to re-open. > > > > From 68547357c895934796e9b4687338bb9e39ac86c5 Mon Sep 17 00:00:00 2001 > From: Brian Cain mailto:bc...@quicinc.com > Date: Thu, 1 Apr 2021 10:32:24 -0500 > Subject: [PATCH] Update llvm-project commit > > clang was updated with new inline asm registers for hexagon, this is > necessary for QEMU test cases currently under review. > > Signed-off-by: Brian Cain mailto:bc...@quicinc.com > --- > tests/docker/dockerfiles/debian-hexagon-cross.docker | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tests/docker/dockerfiles/debian-hexagon-cross.docker > b/tests/docker/dockerfiles/debian-hexagon-cross.docker > index b6fb651..1d19e8f 100644 > --- a/tests/docker/dockerfiles/debian-hexagon-cross.docker > +++ b/tests/docker/dockerfiles/debian-hexagon-cross.docker > @@ -24,7 +24,7 @@ RUN apt update && \ > ENV TOOLCHAIN_INSTALL /usr/local > ENV ROOTFS /usr/local > > -ENV LLVM_URL > https://github.com/llvm/llvm-project/archive/3d8149c2a1228609fd7d7c91a04681304a2f0ca9.tar.gz > +ENV LLVM_URL > https://github.com/llvm/llvm-project/archive/bfcd21876adc3498065e4da92799f613e730d475.tar.gz > ENV MUSL_URL > https://github.com/quic/musl/archive/aff74b395fbf59cd7e93b3691905aa1af6c0778c.tar.gz > ENV LINUX_URL > https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.6.18.tar.xz > > [2. 0001-Update-llvm-project-commit.patch --- text/x-diff; > 0001-Update-llvm-project-commit.patch]... -- Alex Bennée
Re: [RFC v12 00/65] arm cleanup experiment for kvm-only build
On 3/28/21 9:27 PM, Richard Henderson wrote: > On 3/26/21 1:35 PM, Claudio Fontana wrote: >> Here a new version of the series that enables kvm-only builds. >> >> The goal here is to enable the KVM-only build, but there is >> some additional cleanup too. >> >> In this iteration I mostly fixed existing issues, and added an attempt >> to put more content in cpu-sve. More splitting still required >> for both cpu-sve and cpu-auth. >> >> Comments welcome, thanks, > > My number 1 takeaway is that I don't think we should bother trying so hard to > compile out aarch64-specific code. The number of ifdefs seems to be quite > high > in the end. > Understand - it turns out to be useful in practice though to go through some effort to improve the distinction between aarch64-specific code , aarch32-specific code, and code that is common to both, even though we might later decide to just have basically TARGET_AARCH64 be always true (ie removing it) in a future series. > The cleanup that I think would be better would be to *remove* the 32-bit-only > qemu-system-arm and build it all into a single binary. That would reduce the > number of build combinations and could in turn simplify or speed up CI. Still, I think it is interesting for the reader of the code, and in trying to debug issues, to be mindful about which code is 32bit-only, which code is 64-bit only, which is common. Not that this series fixes everything in this regard. > > We probably cannot remove qemu-arm for 32-bit linux-user binaries. But my > guess is the amount of aarch64-specific code that would leak in is pretty > minimal. The bulk of the difference is in the set of tcg helpers, which are > segregated already. > > The tcg/kvm split is much more interesting, and that part of the patch set > looks pretty good. > > > r~ > Thanks, I'll send a new version momentarily, we can continue from there. Ciao, Claudio
[PULL 1/3] hw/arm/mps2-tz: Fix MPC setting for AN524 SRAM block
The AN524 has three MPCs: one for the BRAM, one for the QSPI flash, and one for the DDR. We incorrectly set the .mpc field in the RAMInfo struct for the SRAM block to 1, giving it the same MPC we are using for the QSPI. The effect of this was that the QSPI didn't get mapped into the system address space at all, via an MPC or otherwise, and guest programs which tried to read from the QSPI would get a bus error. Correct the SRAM RAMInfo to indicate that it does not have an associated MPC. Fixes: 25ff112a8cc ("hw/arm/mps2-tz: Add new mps3-an524 board") Signed-off-by: Peter Maydell Reviewed-by: Richard Henderson Reviewed-by: Philippe Mathieu-Daudé Message-id: 20210409150527.15053-2-peter.mayd...@linaro.org --- hw/arm/mps2-tz.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/arm/mps2-tz.c b/hw/arm/mps2-tz.c index 3fbe3d29f95..5ebd671bf83 100644 --- a/hw/arm/mps2-tz.c +++ b/hw/arm/mps2-tz.c @@ -238,7 +238,7 @@ static const RAMInfo an524_raminfo[] = { { .name = "sram", .base = 0x2000, .size = 32 * 4 * KiB, -.mpc = 1, +.mpc = -1, .mrindex = 1, }, { /* We don't model QSPI flash yet; for now expose it as simple ROM */ -- 2.20.1
Re: [PULL 00/13] qemu-sparc queue 20210412
On Mon, 12 Apr 2021 at 23:20, Mark Cave-Ayland wrote: > > The following changes since commit c1e90def01bdb8fcbdbebd9d1eaa8e4827ece620: > > Merge remote-tracking branch > 'remotes/pmaydell/tags/pull-target-arm-20210412' into staging (2021-04-12 > 12:12:09 +0100) > > are available in the Git repository at: > > git://github.com/mcayland/qemu.git tags/qemu-sparc-20210412 > > for you to fetch changes up to ce94fa7aa646a18e9b9105a32eea2152b202b431: > > tests/qtest: add tests for am53c974 device (2021-04-12 22:37:11 +0100) > > > qemu-sparc queue > Applied, thanks. Please update the changelog at https://wiki.qemu.org/ChangeLog/6.0 for any user-visible changes. -- PMM
[PULL 2/3] hw/arm/mps2-tz: Assert if more than one RAM is attached to an MPC
Each board in mps2-tz.c specifies a RAMInfo[] array providing information about each RAM in the board. The .mpc field of the RAMInfo struct specifies which MPC, if any, the RAM is attached to. We already assert if the array doesn't have any entry for an MPC, but we don't diagnose the error of using the same MPC number twice (which is quite easy to do by accident if copy-and-pasting structure entries). Enhance find_raminfo_for_mpc() so that it detects multiple entries for the MPC as well as missing entries. Signed-off-by: Peter Maydell Reviewed-by: Richard Henderson Reviewed-by: Philippe Mathieu-Daudé Message-id: 20210409150527.15053-3-peter.mayd...@linaro.org --- hw/arm/mps2-tz.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/hw/arm/mps2-tz.c b/hw/arm/mps2-tz.c index 5ebd671bf83..25016e464d9 100644 --- a/hw/arm/mps2-tz.c +++ b/hw/arm/mps2-tz.c @@ -306,14 +306,18 @@ static const RAMInfo *find_raminfo_for_mpc(MPS2TZMachineState *mms, int mpc) { MPS2TZMachineClass *mmc = MPS2TZ_MACHINE_GET_CLASS(mms); const RAMInfo *p; +const RAMInfo *found = NULL; for (p = mmc->raminfo; p->name; p++) { if (p->mpc == mpc && !(p->flags & IS_ALIAS)) { -return p; +/* There should only be one entry in the array for this MPC */ +g_assert(!found); +found = p; } } /* if raminfo array doesn't have an entry for each MPC this is a bug */ -g_assert_not_reached(); +assert(found); +return found; } static MemoryRegion *mr_for_raminfo(MPS2TZMachineState *mms, -- 2.20.1
[PULL 3/3] sphinx: qapidoc: Wrap "If" section body in a paragraph node
From: John Snow These sections need to be wrapped in a block-level element, such as Paragraph in order for them to be rendered into Texinfo correctly. Before (e.g.): If defined(CONFIG_REPLICATION) became: .SS If \fBdefined(CONFIG_REPLICATION)\fP.SS \fBBlockdevOptionsReplication\fP (Object) ... After: If defined(CONFIG_REPLICATION) becomes: .SS If .sp \fBdefined(CONFIG_REPLICATION)\fP .SS \fBBlockdevOptionsReplication\fP (Object) ... Reported-by: Markus Armbruster Tested-by: Markus Armbruster Signed-off-by: John Snow Reviewed-by: Peter Maydell Message-id: 20210406141909.1992225-2-js...@redhat.com Signed-off-by: Peter Maydell --- docs/sphinx/qapidoc.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/sphinx/qapidoc.py b/docs/sphinx/qapidoc.py index b7b86b5dffb..b7a2d39c105 100644 --- a/docs/sphinx/qapidoc.py +++ b/docs/sphinx/qapidoc.py @@ -278,7 +278,9 @@ def _nodes_for_if_section(self, ifcond): nodelist = [] if ifcond: snode = self._make_section('If') -snode += self._nodes_for_ifcond(ifcond, with_if=False) +snode += nodes.paragraph( +'', '', *self._nodes_for_ifcond(ifcond, with_if=False) +) nodelist.append(snode) return nodelist -- 2.20.1
[PULL 0/3] target-arm queue
A few last patches to go in for rc3... The following changes since commit c1e90def01bdb8fcbdbebd9d1eaa8e4827ece620: Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20210412' into staging (2021-04-12 12:12:09 +0100) are available in the Git repository at: https://git.linaro.org/people/pmaydell/qemu-arm.git tags/pull-target-arm-20210413 for you to fetch changes up to 2d18b4ca023ca1a3aee18064251d6e6e1084f3eb: sphinx: qapidoc: Wrap "If" section body in a paragraph node (2021-04-13 10:14:58 +0100) target-arm queue: * Fix MPC setting for AN524 SRAM block * sphinx: qapidoc: Wrap "If" section body in a paragraph node John Snow (1): sphinx: qapidoc: Wrap "If" section body in a paragraph node Peter Maydell (2): hw/arm/mps2-tz: Fix MPC setting for AN524 SRAM block hw/arm/mps2-tz: Assert if more than one RAM is attached to an MPC docs/sphinx/qapidoc.py | 4 +++- hw/arm/mps2-tz.c | 10 +++--- 2 files changed, 10 insertions(+), 4 deletions(-)
Re: [RFC v9 15/29] vfio: Set up nested stage mappings
Hi Eric, On 2021/4/11 20:08, Eric Auger wrote: In nested mode, legacy vfio_iommu_map_notify cannot be used as there is no "caching" mode and we do not trap on map. On Intel, vfio_iommu_map_notify was used to DMA map the RAM through the host single stage. With nested mode, we need to setup the stage 2 and the stage 1 separately. This patch introduces a prereg_listener to setup the stage 2 mapping. The stage 1 mapping, owned by the guest, is passed to the host when the guest invalidates the stage 1 configuration, through a dedicated PCIPASIDOps callback. Guest IOTLB invalidations are cascaded downto the host through another IOMMU MR UNMAP notifier. Signed-off-by: Eric Auger --- v7 -> v8: - properly handle new IOMMUTLBEntry fields and especially propagate DOMAIN and PASID based invalidations v6 -> v7: - remove PASID based invalidation v5 -> v6: - add error_report_err() - remove the abort in case of nested stage case v4 -> v5: - use VFIO_IOMMU_SET_PASID_TABLE - use PCIPASIDOps for config notification v3 -> v4: - use iommu_inv_pasid_info for ASID invalidation v2 -> v3: - use VFIO_IOMMU_ATTACH_PASID_TABLE - new user API - handle leaf v1 -> v2: - adapt to uapi changes - pass the asid - pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier --- hw/vfio/common.c | 139 +-- hw/vfio/pci.c| 21 +++ hw/vfio/trace-events | 2 + 3 files changed, 157 insertions(+), 5 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 0cd7ef2139..e369d451e7 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr, return true; } +/* Propagate a guest IOTLB invalidation to the host (nested mode) */ +static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ +VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); +struct vfio_iommu_type1_cache_invalidate ustruct = {}; +VFIOContainer *container = giommu->container; +int ret; + +assert(iotlb->perm == IOMMU_NONE); + +ustruct.argsz = sizeof(ustruct); +ustruct.flags = 0; +ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info); +ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1; +ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB; + +switch (iotlb->granularity) { +case IOMMU_INV_GRAN_DOMAIN: +ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN; +break; +case IOMMU_INV_GRAN_PASID: +{ +struct iommu_inv_pasid_info *pasid_info; +int archid = -1; + +pasid_info = &ustruct.info.granu.pasid_info; +ustruct.info.granularity = IOMMU_INV_GRANU_PASID; +if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { +pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; +archid = iotlb->arch_id; +} +pasid_info->archid = archid; +trace_vfio_iommu_asid_inv_iotlb(archid); +break; +} +case IOMMU_INV_GRAN_ADDR: +{ +hwaddr start = iotlb->iova + giommu->iommu_offset; +struct iommu_inv_addr_info *addr_info; +size_t size = iotlb->addr_mask + 1; +int archid = -1; + +addr_info = &ustruct.info.granu.addr_info; +ustruct.info.granularity = IOMMU_INV_GRANU_ADDR; +if (iotlb->leaf) { +addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF; +} +if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) { +addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID; +archid = iotlb->arch_id; +} +addr_info->archid = archid; +addr_info->addr = start; +addr_info->granule_size = size; +addr_info->nb_granules = 1; +trace_vfio_iommu_addr_inv_iotlb(archid, start, size, +1, iotlb->leaf); +break; +} Should we pass a size to host kernel here, even if vSMMU doesn't support RIL or guest kernel doesn't use RIL? It will cause TLBI issue in this scenario: Guest kernel issues a TLBI cmd without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed the iova and size (4K) to host kernel. Finally, host kernel issues a TLBI cmd with "range" (4K) which can not invalidate the TLB entry of 2M huge page. (pSMMU supports RIL) Thanks, Kunkun Jiang +} + +ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, &ustruct); +if (ret) { +error_report("%p: failed to invalidate CACHE (%d)", container, ret); +} +} + static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) { VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); @@ -776,6 +843,35 @@ static void vfio_dma_unmap_ram_section(VFIOContainer *container, } } +static void vfio_prereg_listener_region_add(MemoryListener *listener, +MemoryRegionSection *section) +{ +VFIOContainer *container = +container_
Re: [PATCH for-6.0] x86: acpi: use offset instead of pointer when using build_header()
On Tue, Apr 13, 2021 at 07:14:00AM -0400, Igor Mammedov wrote: > Do the same as in commit > (4d027afeb3a97 Virt: ACPI: fix qemu assert due to re-assigned table data > address) > for remaining tables that happen to use saved at > the beginning pointer to build header to avoid assert > when table_data is relocated due to implicit re-size. > > Reported-in: https://bugs.launchpad.net/bugs/1923497 Doesn't this fix the bug? If so - Isn't this Fixes: ? > Signed-off-by: Igor Mammedov > --- > PS: > I have build_header() refactoring patch that requires offset > instead of pointer, to make it harder to misuse but it's > a bit intrusive for last minute fixes. So here goes simplified > variant, and I'll post refactoring patch for 6.1. later. > --- > hw/acpi/aml-build.c | 15 +-- > hw/i386/acpi-build.c | 8 ++-- > 2 files changed, 15 insertions(+), 8 deletions(-) > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c > index d33ce8954a..f0035d2b4a 100644 > --- a/hw/acpi/aml-build.c > +++ b/hw/acpi/aml-build.c > @@ -1830,6 +1830,7 @@ build_rsdt(GArray *table_data, BIOSLinker *linker, > GArray *table_offsets, > int i; > unsigned rsdt_entries_offset; > AcpiRsdtDescriptorRev1 *rsdt; > +int rsdt_start = table_data->len; > const unsigned table_data_len = (sizeof(uint32_t) * table_offsets->len); > const unsigned rsdt_entry_size = sizeof(rsdt->table_offset_entry[0]); > const size_t rsdt_len = sizeof(*rsdt) + table_data_len; > @@ -1846,7 +1847,8 @@ build_rsdt(GArray *table_data, BIOSLinker *linker, > GArray *table_offsets, > ACPI_BUILD_TABLE_FILE, ref_tbl_offset); > } > build_header(linker, table_data, > - (void *)rsdt, "RSDT", rsdt_len, 1, oem_id, oem_table_id); > + (void *)(table_data->data + rsdt_start), > + "RSDT", rsdt_len, 1, oem_id, oem_table_id); > } > > /* Build xsdt table */ > @@ -1857,6 +1859,7 @@ build_xsdt(GArray *table_data, BIOSLinker *linker, > GArray *table_offsets, > int i; > unsigned xsdt_entries_offset; > AcpiXsdtDescriptorRev2 *xsdt; > +int xsdt_start = table_data->len; > const unsigned table_data_len = (sizeof(uint64_t) * table_offsets->len); > const unsigned xsdt_entry_size = sizeof(xsdt->table_offset_entry[0]); > const size_t xsdt_len = sizeof(*xsdt) + table_data_len; > @@ -1873,7 +1876,8 @@ build_xsdt(GArray *table_data, BIOSLinker *linker, > GArray *table_offsets, > ACPI_BUILD_TABLE_FILE, ref_tbl_offset); > } > build_header(linker, table_data, > - (void *)xsdt, "XSDT", xsdt_len, 1, oem_id, oem_table_id); > + (void *)(table_data->data + xsdt_start), > + "XSDT", xsdt_len, 1, oem_id, oem_table_id); > } > > void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base, > @@ -2053,10 +2057,9 @@ void build_tpm2(GArray *table_data, BIOSLinker > *linker, GArray *tcpalog, > uint64_t control_area_start_address; > TPMIf *tpmif = tpm_find(); > uint32_t start_method; > -void *tpm2_ptr; > > tpm2_start = table_data->len; > -tpm2_ptr = acpi_data_push(table_data, sizeof(AcpiTableHeader)); > +acpi_data_push(table_data, sizeof(AcpiTableHeader)); > > /* Platform Class */ > build_append_int_noprefix(table_data, TPM2_ACPI_CLASS_CLIENT, 2); > @@ -2095,8 +2098,8 @@ void build_tpm2(GArray *table_data, BIOSLinker *linker, > GArray *tcpalog, > log_addr_offset, 8, > ACPI_BUILD_TPMLOG_FILE, 0); > build_header(linker, table_data, > - tpm2_ptr, "TPM2", table_data->len - tpm2_start, 4, oem_id, > - oem_table_id); > + (void *)(table_data->data + tpm2_start), > + "TPM2", table_data->len - tpm2_start, 4, oem_id, > oem_table_id); > } > > Aml *build_crs(PCIHostState *host, CrsRangeSet *range_set, uint32_t > io_offset, > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c > index de98750aef..daaf8f473e 100644 > --- a/hw/i386/acpi-build.c > +++ b/hw/i386/acpi-build.c > @@ -1816,6 +1816,7 @@ build_hpet(GArray *table_data, BIOSLinker *linker, > const char *oem_id, > const char *oem_table_id) > { > Acpi20Hpet *hpet; > +int hpet_start = table_data->len; > > hpet = acpi_data_push(table_data, sizeof(*hpet)); > /* Note timer_block_id value must be kept in sync with value advertised > by > @@ -1824,13 +1825,15 @@ build_hpet(GArray *table_data, BIOSLinker *linker, > const char *oem_id, > hpet->timer_block_id = cpu_to_le32(0x8086a201); > hpet->addr.address = cpu_to_le64(HPET_BASE); > build_header(linker, table_data, > - (void *)hpet, "HPET", sizeof(*hpet), 1, oem_id, > oem_table_id); > + (void *)(table_data->data + hpet_start), > + "HPET", sizeof(*hpet), 1, oem_id, oem_table_id); > } > > static
Re: [RFC v12 60/65] target/arm: cpu-pauth: new module for ARMv8.3 Pointer Authentication
On 3/28/21 9:05 PM, Richard Henderson wrote: > On 3/26/21 1:36 PM, Claudio Fontana wrote: >> Pointer Authentication is an AARCH64-only ARMv8.3 optional >> extension, whose cpu properties can be separated out in its own module. >> >> Signed-off-by: Claudio Fontana >> --- >> target/arm/cpu.h | 3 -- >> target/arm/tcg/cpu-pauth.h | 34 >> target/arm/cpu.c | 4 +-- >> target/arm/cpu64.c | 35 ++-- >> target/arm/tcg/cpu-pauth.c | 66 ++ >> target/arm/tcg/meson.build | 1 + >> 6 files changed, 105 insertions(+), 38 deletions(-) >> create mode 100644 target/arm/tcg/cpu-pauth.h >> create mode 100644 target/arm/tcg/cpu-pauth.c > > No move + rename at once. > > Also, you've started using tcg_sve_* and I think that might as well apply to > these, in that second step. > > r~ > The idea for tcg_sve_* was in contrast to cpu_sve_*, which contains the common parts of cpu_sve. So the idea for SVE is: cpu-sve.c : CPU SVE module, contains the common functions. tcg/tcg-sve.c : TCG-specific parts of cpu-sve kvm/kvm-sve.c : KVM-specific parts of cpu-sve Now for PAuth we only have a TCG implementation, so that is the reason that this patch uses the more general name. If still tcg/tcg-pauth.c seems better we can go for it, but it just seems not very logical without a corresponding general cpu-pauth.c Ciao, Claudio
Re: [PATCH for-6.0] block/nbd: fix possible use after free of s->connect_thread
13.04.2021 14:53, Max Reitz wrote: On 06.04.21 17:51, Vladimir Sementsov-Ogievskiy wrote: If on nbd_close() we detach the thread (in nbd_co_establish_connection_cancel() thr->state becomes CONNECT_THREAD_RUNNING_DETACHED), after that point we should not use s->connect_thread (which is set to NULL), as running thread may free it at any time. Still nbd_co_establish_connection() does exactly this: it saves s->connect_thread to local variable (just for better code style) and use it even after yield point, when thread may be already detached. Fix that. Also check thr to be non-NULL on nbd_co_establish_connection() start for safety. After this patch "case CONNECT_THREAD_RUNNING_DETACHED" becomes impossible in the second switch in nbd_co_establish_connection(). Still, don't add extra abort() just before the release. If it somehow possible to reach this "case:" it won't hurt. Anyway, good refactoring of all this reconnect mess will come soon. Signed-off-by: Vladimir Sementsov-Ogievskiy --- Hi all! I faced a crash, just running 277 iotest in a loop. I can't reproduce it on master, it reproduces only on my branch with nbd reconnect refactorings. Still, it seems very possible that it may crash under some conditions. So I propose this patch for 6.0. It's written so that it's obvious that it will not hurt: pre-patch, on first hunk we'll just crash if thr is NULL, on second hunk it's safe to return -1, and using thr when s->connect_thread is already zeroed is obviously wrong. block/nbd.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/block/nbd.c b/block/nbd.c index c26dc5a54f..1d4668d42d 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -443,6 +443,11 @@ nbd_co_establish_connection(BlockDriverState *bs, Error **errp) BDRVNBDState *s = bs->opaque; NBDConnectThread *thr = s->connect_thread; + if (!thr) { + /* detached */ + return -1; + } + qemu_mutex_lock(&thr->mutex); switch (thr->state) { First, it is a bit strange not to set *errp in these cases. Oops, right! ashamed) I don’t think it’d make a difference anyway, but now that I look into it... The error would be propagated to s->connect_err, but that isn’t used anywhere, so... What’s the point? Yes it's unused.. That's to be improved later. Anyway. What I really wanted to ask is: nbd_co_establish_connection() may create a thread, which accesses the thread. Is that a problem? Like, can this happen: - nbd_co_establish_connection(): thr->state := THREAD_RUNNING - nbd_co_establish_connection(): thread is created - nbd_co_establish_connection(): thr->mutex unlocked - connect_thread_func(): thr->mutex locked - connect_thread_func(): thr->state := something else - connect_thread_func(): thr->mutex unlocked - nbd_co_establish_connection(): yields - (As I understood it, this yield leads to nbd_co_establish_connection_cancel() continuing) - nbd_co_EC_cancel(): thr->mutex locked - nbd_co_EC_cancel(): do_free true - nbd_co_EC_cancel(): nbd_free_connect_thread() - connect_thread_func(): nbd_free_connect_thread() I think not. Here we are safe: connect_thread_func will only do free if thread in CONNECT_THREAD_RUNNING_DETACHED state. Thread becomes CONNECT_THREAD_RUNNING_DETACHED only on one code path in nbd_co_establish_connection_cancel(), and on that path do_free is false. OK, what you say is possible if we call nbd_co_establish_connection_cancel() twice with detach=true. But that should not be possible if it is called only from .bdrv_close (indirectly) of nbd driver. The problem is that nbd_co_EC_cancel() may free the thread state, and then nbd_co_establish_connection() reuses saved thr local varible after yield. Still (as I noted before), I've never reproduced it on master, only on my branch with some modifications. Still it seems possible anyway. -- Best regards, Vladimir