date:20210413

Re: Better alternative to strncpy in QEMU.

2021-04-13 Thread Paolo Bonzini


On 12/04/21 06:51, Thomas Huth wrote:




I think this is pretty much the same as g_strlcpy() from the glib:

https://developer.gnome.org/glib/2.66/glib-String-Utility-Functions.html#g-strlcpy 


So I guess Paolo had something different in mind when adding this task?


Yes, I did.  strncpy is used legitimately when placing data in a 
fixed-size buffer that is written to a socket, to a file or to guest 
memory.  The problem with using g_strlcpy in those cases is that it does 
not write past the first '\0' character, and therefore it can leak host 
data.


What I had in mind was basically strncpy plus an assertion that the last 
copied byte will be set to 0.  It can be written in many ways, for 
example strncpy followed by assert(dest[destlen - 1] == '\0'), or like 
assert(strlen(src) < destlen) followed by strncpy, or of course you 
could write a for loop by hand.


Once you do that, you can split uses of strncpy in two: those where the 
reader expects the last byte to be zero, and those where the reader does 
not.  (I don't expect many cases of the first type, because the reader 
always has to think of how to handle a malicious data stream that does 
not have a zero termination).


As long as you avoid the accidentally quadratic behavior that Peter 
pointed out, any way is fine since performance does not matter on these 
paths.  Making the code nice and readable is more important.


Paolo

Re: [PATCH v2 0/2] docs/devel/qgraph: add troubleshooting information

2021-04-13 Thread Paolo Bonzini


On 12/04/21 16:34, Stefan Hajnoczi wrote:

v2:
  * Fix "will unavailable" typo [Thomas]

I recently needed to troubleshoot a case where qos-test terminated immediately
with no output. In other words, qos-test decided that no tests are runnable.

After lots of head scratching and some help from Emanuele it turned out that
the machine types weren't being detected as expected.

These patches add documentation about how to troubleshoot similar cases in the
future.

Stefan Hajnoczi (2):
   libqos/qgraph: fix "UNAVAILBLE" typo
   docs/devel/qgraph: add troubleshooting information

  docs/devel/qgraph.rst   | 58 +
  tests/qtest/libqos/qgraph.c |  2 +-
  2 files changed, 59 insertions(+), 1 deletion(-)



Acked-by: Paolo Bonzini 

Thanks, this is helpful.

Paolo

Re: [PATCH] docs: Add a QEMU Code of Conduct and Conflict Resolution Policy document

2021-04-13 Thread Paolo Bonzini


On 07/04/21 17:42, Kevin Wolf wrote:

+* Publishing other's private information, such as physical or electronic
+addresses, without explicit permission


Yes, it's pretty clear that I'm not publishing new information about
people when I'm keeping them in Cc: when replying to a thread, or even
when they posted in another thread on the list recently. It becomes much
less clear for adding people who aren't usually part of the QEMU
community.


If you took the email from, say, the Libvirt or kernel mailing lists, 
that would not be considered private.  If somebody has two email 
addresses and you deliberately Cc him on an address that he's only using 
for communications within his family, that would be a problem.


I agree that this doxing is unlikely to happen since our communication 
revolves on email and we generally don't accept pseudonymous 
contributions.  But even there, we have had historically a couple 
exceptions to the no-pseudonyms rule.


Paolo

Re: [PATCH 1/4] target/ppc: Code motion required to build disabling tcg

2021-04-13 Thread David Gibson

On Mon, Apr 12, 2021 at 12:05:31PM +, Bruno Piazera Larsen wrote:
> > A general advice for this whole series is: make sure you add in some
> > words explaining why you decided to make a particular change. It will be
> > much easier to review if we know what were the logical steps leading to
> > the change.
> 
> Fair point, I should've thought about that.
> 
> > > This commit does the necessary code motion from translate_init.c.inc
> >
> > For instance, I don't immediately see why these changes are necessary. I
> > see that translate_init.c.inc already has some `#ifdef CONFIG_TCG`, so
> > why do we need to move a bunch of code into cpu.c instead of just adding
> > more code under ifdef CONFIG_TCG? (I'm not saying it's wrong, just trying to
> > understand the reasoning).
> 
> There are 3 main reasons for this decision. The first is kind of silly, but 
> when I read translate.c my mind jumped to translating machine code to TCG, 
> and the amount of TCGv variables at the start reinforced this notion.
> The second was that both s390x and i386 removed it (translate.c) from 
> compilation, so I had no good reason to doubt it.
> The last (and arguably most important) is that translate.c is many thousands 
> of lines long (translate_init.c.inc alone was almost 11k). The whole point of 
> disabling TCG is to speed up compilation and reduce the final file size, so I 
> think it makes sense to remove that big file.
> And the final nail in the coffin is that at no point did it cross my mind to 
> keep the init part of translation, but remove the rest
> 
> Also, I'm not a fan of big ifdefs, because it's kinda hard to follow them 
> when viewing code in vim. Adding that to already having a cpu.c file, where 
> it makes sense (to me, at least) to add all CPU related functions, just 
> sounded like a good idea.

I think those are all sound reasons; I think not compiling translate.c
for !tcg builds is the right choice.  We might, however, need to
"rescue" certain functions from there by moving them to another file
so they can be used by KVM code as well.

> > Is translate_init.c.inc intended to be TCG only? But then I see you
> > moved TCG-only functions out of it (ppc_fixup_cpu) and left not TCG-only
> > functions (gen_spr_generic).
> 
> This is me misjudging what is TCG and what is not, mostly. I think that 
> actually moving everything to cpu.c and adding ifdefs, or creating a 
> cpu_tcg.c.inc or similar, would be the most sensible code motion, but every 
> function I removed from translate gave me 3 new errors, at some point I felt 
> like I should leave something behind otherwise we're compiling everything 
> from TCG again, just in different files, so I tried to guess what was TCG and 
> what was not (also, I really wanted the RFC out by the weekend, I _may_ have 
> rushed a few choices).

I'm actually not sure if we'll want translate_init.c for !tcg builds.
It's *primarily* for TCG, but we still need at least some of the cpu
state structure for KVM, and some of that is initialized in
translate_init.

I think it will probably make more sense to leave it in for a first
cut.  Later refinement might end up removing it.

The whole #include translate_init.c.inc thing might make for some
awkward fiddling in this, of course.

> > > This moves all functions that start with gdb_* into target/ppc/gdbstub.c
> > > and creates a new function that calls those and is called by 
> > > ppc_cpu_realize
> >
> > This looks like it makes sense regardless of disable-tcg, could we have
> > it in a standalone patch?
> 
> Sure, I'll try and get it ready ASAP, and make sure I didn't miss one 
> function before sending. Just a noob question... do I edit the patch manually 
> to have it only contain the gdb code motion, or is there a way to move back 
> to before I actually made the commit without needing to re-do the changes?
> 
> Thomas, I'm adding you to this discussion since it sort of answers your 
> message on the other one, about the functions used even in a KVM-only build.
> 
> > IIRC you don't only have to exclude translate.c from the build, you also
> > have to separate translate_init.c.inc from it, i.e. turn
> > translate_init.c.inc into a proper .c file and get rid of the #include
> > "translate_init.c.inc" statement in translate.c, since many functions in the
> > translate_init.c.inc file are still needed for the KVM-only builds, too. So
> > maybe that's a good place to start as a first mini series.
> 
> Now that I know that translate doesn't mean "translating to TCG", this idea 
> makes more sense. My question is, is it a better option than the code motion 
> I'm suggesting? From my quick check on the bits that I haven't touched it 
> might, but at this point it's clear I'm very lost with what makes sense 
> hahaha.
> 
> 
> Bruno Piazera Larsen
> 
> Instituto de Pesquisas 
> ELDORADO

Re: [PATCH 2/4] target/ppc: added solutions for building with disable-tcg

2021-04-13 Thread David Gibson

On Mon, Apr 12, 2021 at 08:40:47AM -0700, Richard Henderson wrote:
> On 4/11/21 10:08 PM, David Gibson wrote:
> > Not directly related to what you're trying to accomplish here, but the
> > whole vscr_sat thing looks really weird.  I have no idea why we're
> > splitting out the storage of VSCR[SAT] for the TCG case at all.  If
> > you wanted to clean that up as a preliminary, I'd be grateful.
> 
> That's about efficiently implementing the vector saturation instructions in
> TCG.  See GEN_VXFORM_SAT in translate/vmx-impl.c.inc.
> 
> The SAT bit is represented as a vector that e.g. compares the result of
> addition with the result of saturating addition.  We don't need to resolve
> that to a single bit until the VSCR register is read.

Aha, thanks for the input.

Long term, that might benefit from KVM specific code paths that don't
bother with this then.  However, for the first cut, it's simpler to
just keep the current representation, even though it's pretty odd for
KVM.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [PATCH 4/5] blkdebug: do not suspend in the middle of QLIST_FOREACH_SAFE

2021-04-13 Thread Paolo Bonzini


On 08/04/21 17:59, Emanuele Giuseppe Esposito wrote:

Perhaps insert here:

That would be unsafe in case a rule other than the current one is 
removed while the coroutine has yielded.  Keep FOREACH_SAFE because 
suspend_request deletes the current rule.


After this patch, *all* matching rules are deleted before suspending the 
coroutine, rather than just one.  This doesn't affect the existing 
testcases.



Use actions_count to see how many yeld to issue.

Co-developed-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
---
  block/blkdebug.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 388b5ed615..dffd869b32 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -789,7 +789,6 @@ static void suspend_request(BlockDriverState *bs, 
BlkdebugRule *rule)
  if (!qtest_enabled()) {
  printf("blkdebug: Suspended request '%s'\n", r->tag);
  }
-qemu_coroutine_yield();
  }
  
  static void process_rule(BlockDriverState *bs, struct BlkdebugRule *rule,

@@ -834,6 +833,12 @@ static void blkdebug_debug_event(BlockDriverState *bs, 
BlkdebugEvent event)
  QLIST_FOREACH_SAFE(rule, &s->rules[event], next, next) {
  process_rule(bs, rule, actions_count);
  }
+
+while (actions_count[ACTION_SUSPEND] > 0) {
+qemu_coroutine_yield();
+actions_count[ACTION_SUSPEND]--;
+}
+
  s->state = s->new_state;
  }

Re: [PATCH 0/5] blkdebug: fix racing condition when iterating on

2021-04-13 Thread Paolo Bonzini


On 08/04/21 17:59, Emanuele Giuseppe Esposito wrote:

When qemu_coroutine_enter is executed in a loop
(even QEMU_FOREACH_SAFE), the new routine can modify the list,
for example removing an element, causing problem when control
is given back to the caller that continues iterating on the same list.

Patch 1 solves the issue in blkdebug_debug_resume by restarting
the list walk after every coroutine_enter if list has to be fully iterated.
Patches 2,3,4 aim to fix blkdebug_debug_event by gathering
all actions that the rules make in a counter and invoking
the respective coroutine_yeld only after processing all requests.
Patch 5 adds a lock to protect rules and suspended_reqs.


Patch 5 is somewhat independent of the others; right now everything 
works because it's protected by the AioContext lock.


On the other hand the scenarios in patches 1-4 are bugs even without 
patch 5.  They become more obvious if you see an explicit unlock/lock 
pair within QTAILQ_FOREACH_SAFE, but they can happen already with just a 
qemu_coroutine_yield or qemu_coroutine_enter within the iteration.


Paolo

Re: [PATCH RFC RESEND v2 3/6] hw/pci: Add pci_root_bus_max_bus

2021-04-13 Thread Auger Eric

Hi Xingang,

On 3/25/21 8:22 AM, Wang Xingang wrote:
> From: Xingang Wang 
> 
> This helps to find max bus number of a root bus.
s/max bus number of a root bus/highest bus number of a bridge hierarchy?
> 
> Signed-off-by: Xingang Wang 
> Signed-off-by: Jiahui Cen 
> ---
>  hw/pci/pci.c | 34 ++
>  include/hw/pci/pci.h |  1 +
>  2 files changed, 35 insertions(+)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index e17aa9075f..c7957cbf7c 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -538,6 +538,40 @@ int pci_bus_num(PCIBus *s)
>  return PCI_BUS_GET_CLASS(s)->bus_num(s);
>  }
>  
> +int pci_root_bus_max_bus(PCIBus *bus)
> +{
> +PCIHostState *host;
> +PCIDevice *dev;
> +int max_bus = 0;
> +int type, devfn;
> +uint8_t subordinate;
> +
> +if (!pci_bus_is_root(bus)) {
> +return 0;
> +}
> +
> +host = PCI_HOST_BRIDGE(BUS(bus)->parent);
> +max_bus = pci_bus_num(host->bus);
> +
> +for (devfn = 0; devfn < ARRAY_SIZE(host->bus->devices); devfn++) {
> +dev = host->bus->devices[devfn];
> +
> +if (!dev) {
> +continue;
> +}
> +
> +type = dev->config[PCI_HEADER_TYPE] & 
> ~PCI_HEADER_TYPE_MULTI_FUNCTION;
Seems there is PCI_DEVICE_GET_CLASS(dev)->is_bridge (see
pci_root_bus_in_range). Can't that be used instead?
> +if (type == PCI_HEADER_TYPE_BRIDGE) {
> +subordinate = dev->config[PCI_SUBORDINATE_BUS];
> +if (subordinate > max_bus) {
> +max_bus = subordinate;
what about the secondary bus number, it is always less than the others?

Thanks

Eric
> +}
> +}
> +}
> +
> +return max_bus;
> +}
> +
>  int pci_bus_numa_node(PCIBus *bus)
>  {
>  return PCI_BUS_GET_CLASS(bus)->numa_node(bus);
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 718b5a454a..e0c69534f4 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -450,6 +450,7 @@ static inline PCIBus *pci_get_bus(const PCIDevice *dev)
>  return PCI_BUS(qdev_get_parent_bus(DEVICE(dev)));
>  }
>  int pci_bus_num(PCIBus *s);
> +int pci_root_bus_max_bus(PCIBus *bus);
>  static inline int pci_dev_bus_num(const PCIDevice *dev)
>  {
>  return pci_bus_num(pci_get_bus(dev));
>

[RFC PATCH v2 3/6] hw/arm/virt-acpi-build: Distinguish possible and present cpus

2021-04-13 Thread Yanan Wang

From: Ying Fang 

When building ACPI tables regarding CPUs we should always build
them for the number of possible CPUs, not the number of present
CPUs. We then ensure only the present CPUs are enabled in MADT.
Furthermore, it is also needed if we are going to support CPU
hotplug in the future.

This patch is a rework based on Andrew Jones's contribution at
https://lists.gnu.org/archive/html/qemu-arm/2018-07/msg00076.html

Signed-off-by: Ying Fang 
Signed-off-by: Yanan Wang 
---
 hw/arm/virt-acpi-build.c | 14 ++
 hw/arm/virt.c|  3 +++
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index f5a2b2d4cb..2ad5dad1bf 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -61,13 +61,16 @@
 
 static void acpi_dsdt_add_cpus(Aml *scope, VirtMachineState *vms)
 {
-MachineState *ms = MACHINE(vms);
+CPUArchIdList *possible_cpus = MACHINE(vms)->possible_cpus;
 uint16_t i;
 
-for (i = 0; i < ms->smp.cpus; i++) {
+for (i = 0; i < possible_cpus->len; i++) {
 Aml *dev = aml_device("C%.03X", i);
 aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0007")));
 aml_append(dev, aml_name_decl("_UID", aml_int(i)));
+if (possible_cpus->cpus[i].cpu == NULL) {
+aml_append(dev, aml_name_decl("_STA", aml_int(0)));
+}
 aml_append(scope, dev);
 }
 }
@@ -479,6 +482,7 @@ build_madt(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 const int *irqmap = vms->irqmap;
 AcpiMadtGenericDistributor *gicd;
 AcpiMadtGenericMsiFrame *gic_msi;
+CPUArchIdList *possible_cpus = MACHINE(vms)->possible_cpus;
 int i;
 
 acpi_data_push(table_data, sizeof(AcpiMultipleApicTable));
@@ -489,7 +493,7 @@ build_madt(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 gicd->base_address = cpu_to_le64(memmap[VIRT_GIC_DIST].base);
 gicd->version = vms->gic_version;
 
-for (i = 0; i < MACHINE(vms)->smp.cpus; i++) {
+for (i = 0; i < possible_cpus->len; i++) {
 AcpiMadtGenericCpuInterface *gicc = acpi_data_push(table_data,
sizeof(*gicc));
 ARMCPU *armcpu = ARM_CPU(qemu_get_cpu(i));
@@ -504,7 +508,9 @@ build_madt(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 gicc->cpu_interface_number = cpu_to_le32(i);
 gicc->arm_mpidr = cpu_to_le64(armcpu->mp_affinity);
 gicc->uid = cpu_to_le32(i);
-gicc->flags = cpu_to_le32(ACPI_MADT_GICC_ENABLED);
+if (possible_cpus->cpus[i].cpu != NULL) {
+gicc->flags = cpu_to_le32(ACPI_MADT_GICC_ENABLED);
+}
 
 if (arm_feature(&armcpu->env, ARM_FEATURE_PMU)) {
 gicc->performance_interrupt = cpu_to_le32(PPI(VIRTUAL_PMU_IRQ));
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index f4ae60ded9..3e5d9b6f26 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2063,6 +2063,9 @@ static void machvirt_init(MachineState *machine)
 }
 
 qdev_realize(DEVICE(cpuobj), NULL, &error_fatal);
+
+/* Initialize cpu member here since cpu hotplug is not supported yet */
+machine->possible_cpus->cpus[n].cpu = cpuobj;
 object_unref(cpuobj);
 }
 fdt_add_timer_nodes(vms);
-- 
2.19.1

[RFC PATCH v2 0/6] hw/arm/virt: Introduce cpu topology support

2021-04-13 Thread Yanan Wang

Hi,

This series is a new version of [0] recently posted by Ying Fang
to introduce cpu topology support for ARM platform. I have taken
over his work about this now, thanks for his contribution.

Description:
An accurate cpu topology may help improve the cpu scheduler's decision
making when dealing with multi-core system. So cpu topology description
is helpful to provide guest with the right view. Dario Faggioli's talk
in [1] also shows the virtual topology could have impact on scheduling
performace. Thus this patch series introduces cpu topology support for
ARM platform.

This series originally comes from Andrew Jones's patches [2], but with
some re-arrangement. Thanks for Andrew's contribution. In this series,
both fdt and ACPI PPTT table are introduced to present cpu topology to
the guest. And a new function virt_smp_parse() not like the default
smp_parse() is introduced, which prefers cores over sockets.

[0] 
https://patchwork.kernel.org/project/qemu-devel/cover/20210225085627.2263-1-fangyi...@huawei.com/
[1] 
https://kvmforum2020.sched.com/event/eE1y/virtual-topology-for-virtual-machines-friend-or-foe-dario-faggioli-suse
[2] 
https://github.com/rhdrjones/qemu/commit/ecfc1565f22187d2c715a99bbcd35cf3a7e428fa

Test results:
After applying this patch series, launch a guest with virt-6.0 and cpu
topology configured with: -smp 96,sockets=2,clusters=6,cores=4,threads=2,
VM's cpu topology description shows as below.

Architecture:aarch64
Byte Order:  Little Endian
CPU(s):  96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):   2
NUMA node(s):1
Vendor ID:   0x48
Model:   0
Stepping:0x1
BogoMIPS:200.00
NUMA node0 CPU(s):   0-95

---

Changelogs:
v1->v2:
- Address Andrew Jones's comments
- Address Michael S. Tsirkin's comments
- Pick up one more patch(patch#6) of Andrew Jones
- Rebased on v6.0.0-rc2 release

---

Andrew Jones (3):
  device_tree: Add qemu_fdt_add_path
  hw/arm/virt: DT: Add cpu-map
  hw/arm/virt: Replace smp_parse with one that prefers cores

Yanan Wang (2):
  hw/acpi/aml-build: Add processor hierarchy node structure
  hw/arm/virt-acpi-build: Add PPTT table

Ying Fang (1):
  hw/arm/virt-acpi-build: Distinguish possible and present cpus

 hw/acpi/aml-build.c  |  27 
 hw/arm/virt-acpi-build.c |  77 --
 hw/arm/virt.c| 120 ++-
 include/hw/acpi/aml-build.h  |   4 ++
 include/hw/arm/virt.h|   1 +
 include/sysemu/device_tree.h |   1 +
 softmmu/device_tree.c|  45 -
 7 files changed, 268 insertions(+), 7 deletions(-)

-- 
2.19.1

[RFC PATCH v2 2/6] hw/arm/virt: DT: Add cpu-map

2021-04-13 Thread Yanan Wang

From: Andrew Jones 

Support device tree CPU topology descriptions.

Signed-off-by: Andrew Jones 
Signed-off-by: Yanan Wang 
---
 hw/arm/virt.c | 41 -
 include/hw/arm/virt.h |  1 +
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 9f01d9041b..f4ae60ded9 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -352,10 +352,11 @@ static void fdt_add_cpu_nodes(const VirtMachineState *vms)
 int cpu;
 int addr_cells = 1;
 const MachineState *ms = MACHINE(vms);
+const VirtMachineClass *vmc = VIRT_MACHINE_GET_CLASS(vms);
 int smp_cpus = ms->smp.cpus;
 
 /*
- * From Documentation/devicetree/bindings/arm/cpus.txt
+ *  See Linux Documentation/devicetree/bindings/arm/cpus.yaml
  *  On ARM v8 64-bit systems value should be set to 2,
  *  that corresponds to the MPIDR_EL1 register size.
  *  If MPIDR_EL1[63:32] value is equal to 0 on all CPUs
@@ -408,8 +409,45 @@ static void fdt_add_cpu_nodes(const VirtMachineState *vms)
 ms->possible_cpus->cpus[cs->cpu_index].props.node_id);
 }
 
+if (ms->smp.cpus > 1 && !vmc->no_cpu_topology) {
+qemu_fdt_setprop_cell(ms->fdt, nodename, "phandle",
+  qemu_fdt_alloc_phandle(ms->fdt));
+}
+
 g_free(nodename);
 }
+
+if (ms->smp.cpus > 1 && !vmc->no_cpu_topology) {
+/*
+ * See Linux Documentation/devicetree/bindings/cpu/cpu-topology.txt
+ * In a SMP system, the hierarchy of CPUs is defined through four
+ * entities that are used to describe the layout of physical CPUs
+ * in the system: socket/cluster/core/thread.
+ */
+qemu_fdt_add_subnode(ms->fdt, "/cpus/cpu-map");
+
+for (cpu = ms->smp.cpus - 1; cpu >= 0; cpu--) {
+char *cpu_path = g_strdup_printf("/cpus/cpu@%d", cpu);
+char *map_path;
+
+if (ms->smp.threads > 1) {
+map_path = g_strdup_printf(
+"/cpus/cpu-map/%s%d/%s%d/%s%d",
+"socket", cpu / (ms->smp.cores * ms->smp.threads),
+"core", (cpu / ms->smp.threads) % ms->smp.cores,
+"thread", cpu % ms->smp.threads);
+} else {
+map_path = g_strdup_printf(
+"/cpus/cpu-map/%s%d/%s%d",
+"socket", cpu / ms->smp.cores,
+"core", cpu % ms->smp.cores);
+}
+qemu_fdt_add_path(ms->fdt, map_path);
+qemu_fdt_setprop_phandle(ms->fdt, map_path, "cpu", cpu_path);
+g_free(map_path);
+g_free(cpu_path);
+}
+}
 }
 
 static void fdt_add_its_gic_node(VirtMachineState *vms)
@@ -2769,6 +2807,7 @@ static void virt_machine_5_2_options(MachineClass *mc)
 virt_machine_6_0_options(mc);
 compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len);
 vmc->no_secure_gpio = true;
+vmc->no_cpu_topology = true;
 }
 DEFINE_VIRT_MACHINE(5, 2)
 
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 921416f918..4a4b98e4a7 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -129,6 +129,7 @@ struct VirtMachineClass {
 bool no_kvm_steal_time;
 bool acpi_expose_flash;
 bool no_secure_gpio;
+bool no_cpu_topology;
 };
 
 struct VirtMachineState {
-- 
2.19.1

[RFC PATCH v2 4/6] hw/acpi/aml-build: Add processor hierarchy node structure

2021-04-13 Thread Yanan Wang

Add a generic API to build Processor Hierarchy Node Structure(Type 0),
which is strictly consistent with descriptions in ACPI 6.3: 5.2.29.1.

This function will be used to build ACPI PPTT table for cpu topology.

Signed-off-by: Ying Fang 
Signed-off-by: Henglong Fan 
Signed-off-by: Yanan Wang 
---
 hw/acpi/aml-build.c | 27 +++
 include/hw/acpi/aml-build.h |  4 
 2 files changed, 31 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index d33ce8954a..75e01aea17 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1916,6 +1916,33 @@ void build_slit(GArray *table_data, BIOSLinker *linker, 
MachineState *ms,
  table_data->len - slit_start, 1, oem_id, oem_table_id);
 }
 
+/*
+ * ACPI 6.3: 5.2.29.1 Processor Hierarchy Node Structure (Type 0)
+ */
+void build_processor_hierarchy_node(GArray *tbl, uint32_t flags,
+uint32_t parent, uint32_t id,
+uint32_t *priv_rsrc, uint32_t priv_num)
+{
+int i;
+
+build_append_byte(tbl, 0); /* Type 0 - processor */
+build_append_byte(tbl, 20 + priv_num * 4); /* Length */
+build_append_int_noprefix(tbl, 0, 2);  /* Reserved */
+build_append_int_noprefix(tbl, flags, 4);  /* Flags */
+build_append_int_noprefix(tbl, parent, 4); /* Parent */
+build_append_int_noprefix(tbl, id, 4); /* ACPI processor ID */
+
+/* Number of private resources */
+build_append_int_noprefix(tbl, priv_num, 4);
+
+/* Private resources[N] */
+if (priv_num > 0 && priv_rsrc != NULL) {
+for (i = 0; i < priv_num; i++) {
+build_append_int_noprefix(tbl, priv_rsrc[i], 4);
+}
+}
+}
+
 /* build rev1/rev3/rev5.1 FADT */
 void build_fadt(GArray *tbl, BIOSLinker *linker, const AcpiFadtData *f,
 const char *oem_id, const char *oem_table_id)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 471266d739..ea74b8f6ed 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -462,6 +462,10 @@ void build_srat_memory(AcpiSratMemoryAffinity *numamem, 
uint64_t base,
 void build_slit(GArray *table_data, BIOSLinker *linker, MachineState *ms,
 const char *oem_id, const char *oem_table_id);
 
+void build_processor_hierarchy_node(GArray *tbl, uint32_t flags,
+uint32_t parent, uint32_t id,
+uint32_t *priv_rsrc, uint32_t priv_num);
+
 void build_fadt(GArray *tbl, BIOSLinker *linker, const AcpiFadtData *f,
 const char *oem_id, const char *oem_table_id);
 
-- 
2.19.1

[RFC PATCH v2 1/6] device_tree: Add qemu_fdt_add_path

2021-04-13 Thread Yanan Wang

From: Andrew Jones 

qemu_fdt_add_path() works like qemu_fdt_add_subnode(), except
it also adds any missing subnodes in the path. We also tweak
an error message of qemu_fdt_add_subnode().

We'll make use of this new function in a coming patch.

Signed-off-by: Andrew Jones 
Signed-off-by: Yanan Wang 
---
 include/sysemu/device_tree.h |  1 +
 softmmu/device_tree.c| 45 ++--
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/include/sysemu/device_tree.h b/include/sysemu/device_tree.h
index 8a2fe55622..ef060a9759 100644
--- a/include/sysemu/device_tree.h
+++ b/include/sysemu/device_tree.h
@@ -121,6 +121,7 @@ uint32_t qemu_fdt_get_phandle(void *fdt, const char *path);
 uint32_t qemu_fdt_alloc_phandle(void *fdt);
 int qemu_fdt_nop_node(void *fdt, const char *node_path);
 int qemu_fdt_add_subnode(void *fdt, const char *name);
+int qemu_fdt_add_path(void *fdt, const char *path);
 
 #define qemu_fdt_setprop_cells(fdt, node_path, property, ...) \
 do {  \
diff --git a/softmmu/device_tree.c b/softmmu/device_tree.c
index 2691c58cf6..8592c7aa1b 100644
--- a/softmmu/device_tree.c
+++ b/softmmu/device_tree.c
@@ -541,8 +541,8 @@ int qemu_fdt_add_subnode(void *fdt, const char *name)
 
 retval = fdt_add_subnode(fdt, parent, basename);
 if (retval < 0) {
-error_report("FDT: Failed to create subnode %s: %s", name,
- fdt_strerror(retval));
+error_report("%s: Failed to create subnode %s: %s",
+ __func__, name, fdt_strerror(retval));
 exit(1);
 }
 
@@ -550,6 +550,47 @@ int qemu_fdt_add_subnode(void *fdt, const char *name)
 return retval;
 }
 
+/*
+ * Like qemu_fdt_add_subnode(), but will add all missing
+ * subnodes in the path.
+ */
+int qemu_fdt_add_path(void *fdt, const char *path)
+{
+char *dupname, *basename, *p;
+int parent, retval = -1;
+
+if (path[0] != '/') {
+return retval;
+}
+
+parent = fdt_path_offset(fdt, "/");
+p = dupname = g_strdup(path);
+
+while (p) {
+*p = '/';
+basename = p + 1;
+p = strchr(p + 1, '/');
+if (p) {
+*p = '\0';
+}
+retval = fdt_path_offset(fdt, dupname);
+if (retval < 0 && retval != -FDT_ERR_NOTFOUND) {
+error_report("%s: Invalid path %s: %s",
+ __func__, path, fdt_strerror(retval));
+exit(1);
+} else if (retval == -FDT_ERR_NOTFOUND) {
+retval = fdt_add_subnode(fdt, parent, basename);
+if (retval < 0) {
+break;
+}
+}
+parent = retval;
+}
+
+g_free(dupname);
+return retval;
+}
+
 void qemu_fdt_dumpdtb(void *fdt, int size)
 {
 const char *dumpdtb = current_machine->dumpdtb;
-- 
2.19.1

[RFC PATCH v2 6/6] hw/arm/virt: Replace smp_parse with one that prefers cores

2021-04-13 Thread Yanan Wang

From: Andrew Jones 

The virt machine type has never used the CPU topology parameters, other
than number of online CPUs and max CPUs. When choosing how to allocate
those CPUs the default has been to assume cores. In preparation for
using the other CPU topology parameters let's use an smp_parse that
prefers cores over sockets. We can also enforce the topology matches
max_cpus check because we have no legacy to preserve.

Signed-off-by: Andrew Jones 
Signed-off-by: Yanan Wang 
---
 hw/arm/virt.c | 76 +++
 1 file changed, 76 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 3e5d9b6f26..57ef961cb5 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -79,6 +79,8 @@
 #include "hw/virtio/virtio-iommu.h"
 #include "hw/char/pl011.h"
 #include "qemu/guest-random.h"
+#include "qapi/qmp/qerror.h"
+#include "sysemu/replay.h"
 
 #define DEFINE_VIRT_MACHINE_LATEST(major, minor, latest) \
 static void virt_##major##_##minor##_class_init(ObjectClass *oc, \
@@ -2625,6 +2627,79 @@ static int virt_kvm_type(MachineState *ms, const char 
*type_str)
 return fixed_ipa ? 0 : requested_pa_size;
 }
 
+/*
+ * Unlike smp_parse() in hw/core/machine.c, we prefer cores over sockets,
+ * e.g. '-smp 8' creates 1 socket with 8 cores.  Whereas '-smp 8' with
+ * hw/core/machine.c's smp_parse() creates 8 sockets, each with 1 core.
+ * Additionally, we can enforce the topology matches max_cpus check,
+ * because we have no legacy to preserve.
+ */
+static void virt_smp_parse(MachineState *ms, QemuOpts *opts)
+{
+if (opts) {
+unsigned cpus = qemu_opt_get_number(opts, "cpus", 0);
+unsigned sockets = qemu_opt_get_number(opts, "sockets", 0);
+unsigned cores = qemu_opt_get_number(opts, "cores", 0);
+unsigned threads = qemu_opt_get_number(opts, "threads", 0);
+
+/*
+ * Compute missing values; prefer cores over sockets and
+ * sockets over threads.
+ */
+if (cpus == 0 || cores == 0) {
+sockets = sockets > 0 ? sockets : 1;
+threads = threads > 0 ? threads : 1;
+if (cpus == 0) {
+cores = cores > 0 ? cores : 1;
+cpus = cores * threads * sockets;
+} else {
+ms->smp.max_cpus = qemu_opt_get_number(opts, "maxcpus", cpus);
+cores = ms->smp.max_cpus / (sockets * threads);
+}
+} else if (sockets == 0) {
+threads = threads > 0 ? threads : 1;
+sockets = cpus / (cores * threads);
+sockets = sockets > 0 ? sockets : 1;
+} else if (threads == 0) {
+threads = cpus / (cores * sockets);
+threads = threads > 0 ? threads : 1;
+} else if (sockets * cores * threads < cpus) {
+error_report("cpu topology: "
+ "sockets (%u) * cores (%u) * threads (%u) < "
+ "smp_cpus (%u)",
+ sockets, cores, threads, cpus);
+exit(1);
+}
+
+ms->smp.max_cpus = qemu_opt_get_number(opts, "maxcpus", cpus);
+
+if (ms->smp.max_cpus < cpus) {
+error_report("maxcpus must be equal to or greater than smp");
+exit(1);
+}
+
+if (sockets * cores * threads != ms->smp.max_cpus) {
+error_report("cpu topology: "
+ "sockets (%u) * cores (%u) * threads (%u)"
+ "!= maxcpus (%u)",
+ sockets, cores, threads,
+ ms->smp.max_cpus);
+exit(1);
+}
+
+ms->smp.cpus = cpus;
+ms->smp.cores = cores;
+ms->smp.threads = threads;
+ms->smp.sockets = sockets;
+}
+
+if (ms->smp.cpus > 1) {
+Error *blocker = NULL;
+error_setg(&blocker, QERR_REPLAY_NOT_SUPPORTED, "smp");
+replay_add_blocker(blocker);
+}
+}
+
 static void virt_machine_class_init(ObjectClass *oc, void *data)
 {
 MachineClass *mc = MACHINE_CLASS(oc);
@@ -2650,6 +2725,7 @@ static void virt_machine_class_init(ObjectClass *oc, void 
*data)
 mc->cpu_index_to_instance_props = virt_cpu_index_to_props;
 mc->default_cpu_type = ARM_CPU_TYPE_NAME("cortex-a15");
 mc->get_default_cpu_node_id = virt_get_default_cpu_node_id;
+mc->smp_parse = virt_smp_parse;
 mc->kvm_type = virt_kvm_type;
 assert(!mc->get_hotplug_handler);
 mc->get_hotplug_handler = virt_machine_get_hotplug_handler;
-- 
2.19.1

[RFC PATCH v2 5/6] hw/arm/virt-acpi-build: Add PPTT table

2021-04-13 Thread Yanan Wang

Add the Processor Properties Topology Table (PPTT) to present
CPU topology information to ACPI guests. Note, while a DT boot
Linux guest with a non-flat CPU topology will see socket and
core IDs being sequential integers starting from zero, e.g.
with -smp 4,sockets=2,cores=2,threads=1

a DT boot produces

 cpu:  0 package_id:  0 core_id:  0
 cpu:  1 package_id:  0 core_id:  1
 cpu:  2 package_id:  1 core_id:  0
 cpu:  3 package_id:  1 core_id:  1

an ACPI boot produces

 cpu:  0 package_id: 36 core_id:  0
 cpu:  1 package_id: 36 core_id:  1
 cpu:  2 package_id: 96 core_id:  2
 cpu:  3 package_id: 96 core_id:  3

This is due to several reasons:

 1) DT cpu nodes do not have an equivalent field to what the PPTT
ACPI Processor ID must be, i.e. something equal to the MADT CPU
UID or equal to the UID of an ACPI processor container. In both
ACPI cases those are platform dependant IDs assigned by the
vendor.

 2) While QEMU is the vendor for a guest, if the topology specifies
SMT (> 1 thread), then, with ACPI, it is impossible to assign a
core-id the same value as a package-id, thus it is not possible
to have package-id=0 and core-id=0. This is because package and
core containers must be in the same ACPI namespace and therefore
must have unique UIDs.

 3) ACPI processor containers are not required for PPTT tables to
be used and, due to the limitations of which IDs are selected
described above in (2), they are not helpful for QEMU, so we
don't build them with this patch. In the absence of them, Linux
assigns its own unique IDs. The maintainers have chosen not to use
counters from zero, but rather ACPI table offsets, which explains
why the numbers are so much larger than with DT.

 4) When there is no SMT (threads=1) the core IDs for ACPI boot guests
match the logical CPU IDs, because these IDs must be equal to the
MADT CPU UID (as no processor containers are present), and QEMU
uses the logical CPU ID for these MADT IDs.

Tested-by: Jiajie Li 
Signed-off-by: Andrew Jones 
Signed-off-by: Ying Fang 
Signed-off-by: Yanan Wang 
---
 hw/arm/virt-acpi-build.c | 63 
 1 file changed, 63 insertions(+)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 2ad5dad1bf..03fd812d5a 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -436,6 +436,64 @@ build_srat(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
  vms->oem_table_id);
 }
 
+/* PPTT */
+static void
+build_pptt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
+{
+int pptt_start = table_data->len;
+int uid = 0, cpus = 0, socket = 0;
+MachineState *ms = MACHINE(vms);
+unsigned int smp_cores = ms->smp.cores;
+unsigned int smp_threads = ms->smp.threads;
+
+acpi_data_push(table_data, sizeof(AcpiTableHeader));
+
+for (socket = 0; cpus < ms->possible_cpus->len; socket++) {
+uint32_t socket_offset = table_data->len - pptt_start;
+int core;
+
+build_processor_hierarchy_node(
+table_data, 1, /* Physical package */
+0, socket, /* No parent */
+NULL, 0);  /* No private resources */
+
+for (core = 0; core < smp_cores; core++) {
+uint32_t core_offset = table_data->len - pptt_start;
+int thread;
+
+if (smp_threads <= 1) {
+build_processor_hierarchy_node(
+table_data,
+(1 << 1) | /* ACPI Processor ID valid */
+(1 << 3),  /* ACPI 6.3 - Node is a Leaf */
+socket_offset, uid++, /* Parent is a Socket */
+NULL, 0);  /* No private resources */
+} else {
+build_processor_hierarchy_node(
+table_data, 0,
+socket_offset, core, /* Parent is a Socket */
+NULL, 0); /* No private resources */
+
+for (thread = 0; thread < smp_threads; thread++) {
+build_processor_hierarchy_node(
+table_data,
+(1 << 1) | /* ACPI Processor ID valid */
+(1 << 2) | /* ACPI 6.3 - Processor is a Thread */
+(1 << 3),  /* ACPI 6.3 - Node is a Leaf */
+core_offset, uid++, /* Parent is a Core */
+NULL, 0);  /* No private resources */
+}
+}
+}
+cpus += smp_cores * smp_threads;
+}
+
+build_header(linker, table_data,
+ (void *)(table_data->data + pptt_start), "PPTT",
+ table_data->len - pptt_start, 2,
+ vms->oem_id, vms->oem_table_id);
+}
+
 /* GTDT */
 static void
 build_gtdt(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
@@ -707,6 +765,11 @@ void virt_acpi_build(VirtMachineState *vms, 
AcpiBuildTables *tables)

[PATCH-for-6.1] exec: Remove accel/tcg/ from include paths

2021-04-13 Thread Philippe Mathieu-Daudé

When TCG is enabled, the accel/tcg/ include path is added to the
project global include search list. This accel/tcg/ directory
contains a header named "internal.h" which, while intented to
be internal to accel/tcg/, is accessible by all files compiled
when TCG is enabled. This might lead to problem with other
directories using the same "internal.h" header name:

  $ git ls-files | fgrep /internal.h
  accel/tcg/internal.h
  include/hw/ide/internal.h
  target/hexagon/internal.h
  target/mips/internal.h
  target/ppc/internal.h
  target/s390x/internal.h

As we don't need to expose accel/tcg/ internals to the rest of
the code base, simplify by removing it from the include search
list, and include the accel/tcg/ public headers relative to the
project root search path (which is already in the generic include
search path).

Signed-off-by: Philippe Mathieu-Daudé 
---
Arguably public accel/tcg/ headers should be exposed under include/.
---
 meson.build | 1 -
 include/exec/helper-gen.h   | 4 ++--
 include/exec/helper-proto.h | 4 ++--
 include/exec/helper-tcg.h   | 4 ++--
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/meson.build b/meson.build
index c6f4b0cf5e8..d8bb1ec5aa9 100644
--- a/meson.build
+++ b/meson.build
@@ -258,7 +258,6 @@
 tcg_arch = 'riscv'
   endif
   add_project_arguments('-iquote', meson.current_source_dir() / 'tcg' / 
tcg_arch,
-'-iquote', meson.current_source_dir() / 'accel/tcg',
 language: ['c', 'cpp', 'objc'])
 
   accelerators += 'CONFIG_TCG'
diff --git a/include/exec/helper-gen.h b/include/exec/helper-gen.h
index 29c02f85dcc..1c2e7a8ed39 100644
--- a/include/exec/helper-gen.h
+++ b/include/exec/helper-gen.h
@@ -81,8 +81,8 @@ static inline void glue(gen_helper_, 
name)(dh_retvar_decl(ret)  \
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "trace/generated-helpers-wrappers.h"
-#include "tcg-runtime.h"
-#include "plugin-helpers.h"
+#include "accel/tcg/tcg-runtime.h"
+#include "accel/tcg/plugin-helpers.h"
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/include/exec/helper-proto.h b/include/exec/helper-proto.h
index 659f9298e8f..ba100793a7d 100644
--- a/include/exec/helper-proto.h
+++ b/include/exec/helper-proto.h
@@ -39,8 +39,8 @@ dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), 
dh_ctype(t3), \
 
 #include "helper.h"
 #include "trace/generated-helpers.h"
-#include "tcg-runtime.h"
-#include "plugin-helpers.h"
+#include "accel/tcg/tcg-runtime.h"
+#include "accel/tcg/plugin-helpers.h"
 
 #undef IN_HELPER_PROTO
 
diff --git a/include/exec/helper-tcg.h b/include/exec/helper-tcg.h
index 27870509a20..68885146355 100644
--- a/include/exec/helper-tcg.h
+++ b/include/exec/helper-tcg.h
@@ -60,8 +60,8 @@
 
 #include "helper.h"
 #include "trace/generated-helpers.h"
-#include "tcg-runtime.h"
-#include "plugin-helpers.h"
+#include "accel/tcg/tcg-runtime.h"
+#include "accel/tcg/plugin-helpers.h"
 
 #undef str
 #undef DEF_HELPER_FLAGS_0
-- 
2.26.3

Re: [PATCH v2 2/3] qom: move user_creatable_add_opts logic to vl.c and QAPIfy it

2021-04-13 Thread David Hildenbrand


On 13.04.21 06:41, Markus Armbruster wrote:

David Hildenbrand  writes:


On 12.03.21 18:35, Paolo Bonzini wrote:

Emulators are currently using OptsVisitor (via user_creatable_add_opts)
to parse the -object command line option.  This has one extra feature,
compared to keyval, which is automatic conversion of integers to lists
as well as support for lists as repeated options:
-object
memory-backend-ram,id=pc.ram,size=1048576000,host-nodes=0,policy=bind
So we cannot replace OptsVisitor with keyval right now.  Still, this
patch moves the user_creatable_add_opts logic to vl.c since it is
not needed anywhere else, and makes it go through user_creatable_add_qapi.
In order to minimize code changes, the predicate still takes a
string.
This can be changed later to use the ObjectType QAPI enum directly.



Rebasing my "noreserve"[1] series on this, I get weird errors from
QEMU when specifying the new "reserve=off" option for a
memory-backend-ram:

"Invalid parameter 'reserve'"

And it looks like this is the case for any new properties. Poking
around, I fail to find what's causing this -- or how to unlock new
properties. What is the magic toggle to make it work?

Thanks!

[1] https://lkml.kernel.org/r/20210319101230.21531-1-da...@redhat.com


Wild guess: you didn't add your new properties in the QAPI schema.

For a not-so-wild-guess, send us a git-fetch argument for your rebased
series.



Oh, there is qapi/qom.json -- maybe that does the trick.

(I have mixed feelings about having to specify the same thing twice at 
different locations)


I'll have a look if that makes it fly.

--
Thanks,

David / dhildenb

Re: [PATCH-for-6.1] exec: Remove accel/tcg/ from include paths

2021-04-13 Thread Claudio Fontana

Reviewed-by: Claudio Fontana 

Ciao,

Claudio

On 4/13/21 10:10 AM, Philippe Mathieu-Daudé wrote:
> When TCG is enabled, the accel/tcg/ include path is added to the
> project global include search list. This accel/tcg/ directory
> contains a header named "internal.h" which, while intented to
> be internal to accel/tcg/, is accessible by all files compiled
> when TCG is enabled. This might lead to problem with other
> directories using the same "internal.h" header name:
> 
>   $ git ls-files | fgrep /internal.h
>   accel/tcg/internal.h
>   include/hw/ide/internal.h
>   target/hexagon/internal.h
>   target/mips/internal.h
>   target/ppc/internal.h
>   target/s390x/internal.h
> 
> As we don't need to expose accel/tcg/ internals to the rest of
> the code base, simplify by removing it from the include search
> list, and include the accel/tcg/ public headers relative to the
> project root search path (which is already in the generic include
> search path).
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> Arguably public accel/tcg/ headers should be exposed under include/.
> ---
>  meson.build | 1 -
>  include/exec/helper-gen.h   | 4 ++--
>  include/exec/helper-proto.h | 4 ++--
>  include/exec/helper-tcg.h   | 4 ++--
>  4 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/meson.build b/meson.build
> index c6f4b0cf5e8..d8bb1ec5aa9 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -258,7 +258,6 @@
>  tcg_arch = 'riscv'
>endif
>add_project_arguments('-iquote', meson.current_source_dir() / 'tcg' / 
> tcg_arch,
> -'-iquote', meson.current_source_dir() / 'accel/tcg',
>  language: ['c', 'cpp', 'objc'])
>  
>accelerators += 'CONFIG_TCG'
> diff --git a/include/exec/helper-gen.h b/include/exec/helper-gen.h
> index 29c02f85dcc..1c2e7a8ed39 100644
> --- a/include/exec/helper-gen.h
> +++ b/include/exec/helper-gen.h
> @@ -81,8 +81,8 @@ static inline void glue(gen_helper_, 
> name)(dh_retvar_decl(ret)  \
>  #include "helper.h"
>  #include "trace/generated-helpers.h"
>  #include "trace/generated-helpers-wrappers.h"
> -#include "tcg-runtime.h"
> -#include "plugin-helpers.h"
> +#include "accel/tcg/tcg-runtime.h"
> +#include "accel/tcg/plugin-helpers.h"
>  
>  #undef DEF_HELPER_FLAGS_0
>  #undef DEF_HELPER_FLAGS_1
> diff --git a/include/exec/helper-proto.h b/include/exec/helper-proto.h
> index 659f9298e8f..ba100793a7d 100644
> --- a/include/exec/helper-proto.h
> +++ b/include/exec/helper-proto.h
> @@ -39,8 +39,8 @@ dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), 
> dh_ctype(t3), \
>  
>  #include "helper.h"
>  #include "trace/generated-helpers.h"
> -#include "tcg-runtime.h"
> -#include "plugin-helpers.h"
> +#include "accel/tcg/tcg-runtime.h"
> +#include "accel/tcg/plugin-helpers.h"
>  
>  #undef IN_HELPER_PROTO
>  
> diff --git a/include/exec/helper-tcg.h b/include/exec/helper-tcg.h
> index 27870509a20..68885146355 100644
> --- a/include/exec/helper-tcg.h
> +++ b/include/exec/helper-tcg.h
> @@ -60,8 +60,8 @@
>  
>  #include "helper.h"
>  #include "trace/generated-helpers.h"
> -#include "tcg-runtime.h"
> -#include "plugin-helpers.h"
> +#include "accel/tcg/tcg-runtime.h"
> +#include "accel/tcg/plugin-helpers.h"
>  
>  #undef str
>  #undef DEF_HELPER_FLAGS_0
>

[RFC PATCH v2 1/4] vl.c: Add -smp, clusters=* command line support for ARM cpu

2021-04-13 Thread Yanan Wang

A cluster means a group of cores that share some resources (e.g. cache)
among them under the LLC. For example, ARM64 server chip Kunpeng 920 has
6 or 8 clusters in each NUMA, and each cluster has 4 cores. All clusters
share L3 cache data while cores within each cluster share the L2 cache.

The cache affinity of cluster has been proved to improve the Linux kernel
scheduling performance and a patchset has been posted, in which a general
sched_domain for clusters was added and a cluster level was added in the
arch-neutral cpu topology struct like below.
struct cpu_topology {
int thread_id;
int core_id;
int cluster_id;
int package_id;
int llc_id;
cpumask_t thread_sibling;
cpumask_t core_sibling;
cpumask_t cluster_sibling;
cpumask_t llc_sibling;
}

Also the Kernel Doc: Documentation/devicetree/bindings/cpu/cpu-topology.txt
defines a four-level CPU topology hierarchy like socket/cluster/core/thread.
According to the context, a socket node's child nodes must be one or more
cluster nodes and a cluster node's child nodes must be one or more cluster
nodes/one or more core nodes.

So let's add the -smp, clusters=* command line support for ARM cpu, so that
future guest os could make use of cluster cpu topology for better scheduling
performance. For ARM machines, a four-level cpu hierarchy can be defined and
it will be sockets/clusters/cores/threads. Because we only support clusters
for ARM cpu currently, a new member "unsigned smp_clusters" is added to the
VirtMachineState structure.

Signed-off-by: Yanan Wang 
---
 include/hw/arm/virt.h |  1 +
 qemu-options.hx   | 26 +++---
 softmmu/vl.c  |  3 +++
 3 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 4a4b98e4a7..5d5d156924 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -155,6 +155,7 @@ struct VirtMachineState {
 char *pciehb_nodename;
 const int *irqmap;
 int fdt_size;
+unsigned smp_clusters;
 uint32_t clock_phandle;
 uint32_t gic_phandle;
 uint32_t msi_phandle;
diff --git a/qemu-options.hx b/qemu-options.hx
index fd21002bd6..65343ea23c 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -184,25 +184,29 @@ SRST
 ERST
 
 DEF("smp", HAS_ARG, QEMU_OPTION_smp,
-"-smp 
[cpus=]n[,maxcpus=cpus][,cores=cores][,threads=threads][,dies=dies][,sockets=sockets]\n"
+"-smp 
[cpus=]n[,maxcpus=cpus][,clusters=clusters][,cores=cores][,threads=threads][,dies=dies][,sockets=sockets]\n"
 "set the number of CPUs to 'n' [default=1]\n"
 "maxcpus= maximum number of total cpus, including\n"
 "offline CPUs for hotplug, etc\n"
-"cores= number of CPU cores on one socket (for PC, it's on 
one die)\n"
+"cores= number of CPU cores on one socket\n"
+"(it's on one die for PC, and on one cluster for ARM)\n"
 "threads= number of threads on one CPU core\n"
+"clusters= number of CPU clusters on one socket (for ARM 
only)\n"
 "dies= number of CPU dies on one socket (for PC only)\n"
 "sockets= number of discrete sockets in the system\n",
 QEMU_ARCH_ALL)
 SRST
-``-smp 
[cpus=]n[,cores=cores][,threads=threads][,dies=dies][,sockets=sockets][,maxcpus=maxcpus]``
-Simulate an SMP system with n CPUs. On the PC target, up to 255 CPUs
-are supported. On Sparc32 target, Linux limits the number of usable
-CPUs to 4. For the PC target, the number of cores per die, the
-number of threads per cores, the number of dies per packages and the
-total number of sockets can be specified. Missing values will be
-computed. If any on the three values is given, the total number of
-CPUs n can be omitted. maxcpus specifies the maximum number of
-hotpluggable CPUs.
+``-smp 
[cpus=]n[,maxcpus=cpus][,clusters=clusters][,cores=cores][,threads=threads][,dies=dies][,sockets=sockets]``
+Simulate an SMP system with n CPUs. On the PC target, up to 255
+CPUs are supported. On the Sparc32 target, Linux limits the number
+of usable CPUs to 4. For the PC target, the number of threads per
+core, the number of cores per die, the number of dies per package
+and the total number of sockets can be specified. For the ARM target,
+the number of threads per core, the number of cores per cluster, the
+number of clusters per socket and the total number of sockets can be
+specified. And missing values will be computed. If any of the five
+values is given, the total number of CPUs n can be omitted. Maxcpus
+specifies the maximum number of hotpluggable CPUs.
 ERST
 
 DEF("numa", HAS_ARG, QEMU_OPTION_numa,
diff --git a/softmmu/vl.c b/softmmu/vl.c
index aadb526138..46f5b6b575 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -720,6 +720,9 @@ static QemuOptsList qemu_smp_opts = {
 }, {

[RFC PATCH v2 0/4] hw/arm/virt: Introduce cluster cpu topology support

2021-04-13 Thread Yanan Wang

Hi,

This series is a new version of [0] posted to introduce the cluster cpu
topology support for ARM platform, besides now existing sockets, cores,
and threads. And the code has been rewriten based on patch series [1].
[0] 
https://patchwork.kernel.org/project/qemu-devel/cover/20210331095343.12172-1-wangyana...@huawei.com/
[1] 
https://patchwork.kernel.org/project/qemu-devel/cover/20210413080745.33004-1-wangyana...@huawei.com/

Changelogs:
v1->v2:
- Only focus on cluster support for ARM platform
- Rebase the code on patch series [1]

Description:
A cluster means a group of cores that share some resources (e.g. cache)
among them under the LLC. For example, ARM64 server chip Kunpeng 920 has
6 or 8 clusters in each NUMA, and each cluster has 4 cores. All clusters
share L3 cache data while cores within each cluster share the L2 cache.

The cache affinity of cluster has been proved to improve the Linux kernel
scheduling performance and a patchset [2] has already been posted, where
a general sched_domain for clusters was added and a cluster level was
added in the arch-neutral cpu topology struct like below.
struct cpu_topology {
int thread_id;
int core_id;
int cluster_id;
int package_id;
int llc_id;
cpumask_t thread_sibling;
cpumask_t core_sibling;
cpumask_t cluster_sibling;
cpumask_t llc_sibling;
};

Also Kernel Doc [3]: Documentation/devicetree/bindings/cpu/cpu-topology.txt
defines a four-level CPU topology hierarchy like socket/cluster/core/thread.
According to the context, a socket node's child nodes must be one or more
cluster nodes and a cluster node's child nodes must be one or more cluster
nodes/one or more core nodes.

So let's add the -smp, clusters=* command line support for ARM cpu, so that
future guest os could make use of cluster cpu topology for better scheduling
performance. For ARM machines, a four-level cpu hierarchy can be defined and
it will be sockets/clusters/cores/threads.
[2] 
https://patchwork.kernel.org/project/linux-arm-kernel/cover/20210319041618.14316-1-song.bao@hisilicon.com/
[3] 
https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/cpu/cpu-topology.txt

Test results:
After applying this patch series, launch a guest with virt-6.0 and cpu
topology configured with: -smp
96,sockets=2,clusters=6,cores=4,threads=2,
VM's cpu topology description shows as below.
lscpu:
Architecture:aarch64
Byte Order:  Little Endian
CPU(s):  96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):   2
NUMA node(s):1
Vendor ID:   0x48
Model:   0
Stepping:0x1
BogoMIPS:200.00
NUMA node0 CPU(s):   0-95

Topology information of clusters can also be got:
cat /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list: 0-7
cat /sys/devices/system/cpu/cpu0/topology/cluster_id: 56

cat /sys/devices/system/cpu/cpu8/topology/cluster_cpus_list: 8-15
cat /sys/devices/system/cpu/cpu8/topology/cluster_id: 316
...
cat /sys/devices/system/cpu/cpu95/topology/cluster_cpus_list: 88-95
cat /sys/devices/system/cpu/cpu95/topology/cluster_id: 2936

Yanan Wang (4):
  vl.c: Add -smp, clusters=* command line support for ARM cpu
  hw/arm/virt: Parse -smp cluster parameter in virt_smp_parse
  hw/arm/virt-acpi-build: Add cluster level for PPTT table
  hw/arm/virt: Add cluster level for device tree

 hw/arm/virt-acpi-build.c | 55 
 hw/arm/virt.c| 44 +++-
 include/hw/arm/virt.h|  1 +
 qemu-options.hx  | 26 +++
 softmmu/vl.c |  3 +++
 5 files changed, 78 insertions(+), 51 deletions(-)

-- 
2.19.1

[RFC PATCH v2 3/4] hw/arm/virt-acpi-build: Add cluster level for PPTT table

2021-04-13 Thread Yanan Wang

Add a Processor Hierarchy Node of cluster level between core level
and package level for ARM PPTT table.

Signed-off-by: Yanan Wang 
---
 hw/arm/virt-acpi-build.c | 55 
 1 file changed, 33 insertions(+), 22 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 03fd812d5a..2b745711d1 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -443,6 +443,7 @@ build_pptt(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 int pptt_start = table_data->len;
 int uid = 0, cpus = 0, socket = 0;
 MachineState *ms = MACHINE(vms);
+unsigned int smp_clusters = vms->smp_clusters;
 unsigned int smp_cores = ms->smp.cores;
 unsigned int smp_threads = ms->smp.threads;
 
@@ -450,42 +451,52 @@ build_pptt(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
 
 for (socket = 0; cpus < ms->possible_cpus->len; socket++) {
 uint32_t socket_offset = table_data->len - pptt_start;
-int core;
+int cluster;
 
 build_processor_hierarchy_node(
 table_data, 1, /* Physical package */
 0, socket, /* No parent */
 NULL, 0);  /* No private resources */
 
-for (core = 0; core < smp_cores; core++) {
-uint32_t core_offset = table_data->len - pptt_start;
-int thread;
-
-if (smp_threads <= 1) {
-build_processor_hierarchy_node(
-table_data,
-(1 << 1) | /* ACPI Processor ID valid */
-(1 << 3),  /* ACPI 6.3 - Node is a Leaf */
-socket_offset, uid++, /* Parent is a Socket */
-NULL, 0);  /* No private resources */
-} else {
-build_processor_hierarchy_node(
-table_data, 0,
-socket_offset, core, /* Parent is a Socket */
-NULL, 0); /* No private resources */
-
-for (thread = 0; thread < smp_threads; thread++) {
+for (cluster = 0; cluster < smp_clusters; cluster++) {
+uint32_t cluster_offset = table_data->len - pptt_start;
+int core;
+
+build_processor_hierarchy_node(
+table_data, 0,
+socket_offset, cluster, /* Parent is a Socket */
+NULL, 0); /* No private resources */
+
+for (core = 0; core < smp_cores; core++) {
+uint32_t core_offset = table_data->len - pptt_start;
+int thread;
+
+if (smp_threads <= 1) {
 build_processor_hierarchy_node(
 table_data,
 (1 << 1) | /* ACPI Processor ID valid */
-(1 << 2) | /* ACPI 6.3 - Processor is a Thread */
 (1 << 3),  /* ACPI 6.3 - Node is a Leaf */
-core_offset, uid++, /* Parent is a Core */
+cluster_offset, uid++, /* Parent is a Cluster */
 NULL, 0);  /* No private resources */
+} else {
+build_processor_hierarchy_node(
+table_data, 0,
+cluster_offset, core, /* Parent is a Cluster */
+NULL, 0); /* No private resources */
+
+for (thread = 0; thread < smp_threads; thread++) {
+build_processor_hierarchy_node(
+table_data,
+(1 << 1) | /* ACPI Processor ID valid */
+(1 << 2) | /* ACPI 6.3 - Processor is a Thread */
+(1 << 3),  /* ACPI 6.3 - Node is a Leaf */
+core_offset, uid++, /* Parent is a Core */
+NULL, 0);  /* No private resources */
+}
 }
 }
 }
-cpus += smp_cores * smp_threads;
+cpus += smp_clusters * smp_cores * smp_threads;
 }
 
 build_header(linker, table_data,
-- 
2.19.1

[RFC PATCH v2 4/4] hw/arm/virt: Add cluster level for device tree

2021-04-13 Thread Yanan Wang

Add a cluster level between core level and package level for
ARM device tree.

Signed-off-by: Yanan Wang 
---
 hw/arm/virt.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 51797628db..4468a4734b 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -434,14 +434,18 @@ static void fdt_add_cpu_nodes(const VirtMachineState *vms)
 
 if (ms->smp.threads > 1) {
 map_path = g_strdup_printf(
-"/cpus/cpu-map/%s%d/%s%d/%s%d",
-"socket", cpu / (ms->smp.cores * ms->smp.threads),
+"/cpus/cpu-map/%s%d/%s%d/%s%d/%s%d",
+"socket", cpu / (vms->smp_clusters * ms->smp.cores *
+ms->smp.threads),
+"cluster", (cpu / (ms->smp.cores * ms->smp.threads)) %
+vms->smp_clusters,
 "core", (cpu / ms->smp.threads) % ms->smp.cores,
 "thread", cpu % ms->smp.threads);
 } else {
 map_path = g_strdup_printf(
-"/cpus/cpu-map/%s%d/%s%d",
-"socket", cpu / ms->smp.cores,
+"/cpus/cpu-map/%s%d/%s%d/%s%d",
+"socket", cpu / (vms->smp_clusters * ms->smp.cores),
+"cluster", (cpu / ms->smp.cores) % vms->smp_clusters,
 "core", cpu % ms->smp.cores);
 }
 qemu_fdt_add_path(ms->fdt, map_path);
-- 
2.19.1

Re: [PATCH v2 2/3] qom: move user_creatable_add_opts logic to vl.c and QAPIfy it

2021-04-13 Thread David Hildenbrand


On 13.04.21 10:13, David Hildenbrand wrote:

On 13.04.21 06:41, Markus Armbruster wrote:

David Hildenbrand  writes:


On 12.03.21 18:35, Paolo Bonzini wrote:

Emulators are currently using OptsVisitor (via user_creatable_add_opts)
to parse the -object command line option.  This has one extra feature,
compared to keyval, which is automatic conversion of integers to lists
as well as support for lists as repeated options:
 -object
memory-backend-ram,id=pc.ram,size=1048576000,host-nodes=0,policy=bind
So we cannot replace OptsVisitor with keyval right now.  Still, this
patch moves the user_creatable_add_opts logic to vl.c since it is
not needed anywhere else, and makes it go through user_creatable_add_qapi.
In order to minimize code changes, the predicate still takes a
string.
This can be changed later to use the ObjectType QAPI enum directly.



Rebasing my "noreserve"[1] series on this, I get weird errors from
QEMU when specifying the new "reserve=off" option for a
memory-backend-ram:

"Invalid parameter 'reserve'"

And it looks like this is the case for any new properties. Poking
around, I fail to find what's causing this -- or how to unlock new
properties. What is the magic toggle to make it work?

Thanks!

[1] https://lkml.kernel.org/r/20210319101230.21531-1-da...@redhat.com


Wild guess: you didn't add your new properties in the QAPI schema.

For a not-so-wild-guess, send us a git-fetch argument for your rebased
series.



Oh, there is qapi/qom.json -- maybe that does the trick.

(I have mixed feelings about having to specify the same thing twice at
different locations)

I'll have a look if that makes it fly.


Yes, works just fine -- thanks!


--
Thanks,

David / dhildenb

Re: [RFC v12 27/65] target/arm: split a15 cpu model and 32bit class functions to cpu32.c

2021-04-13 Thread Claudio Fontana

On 4/8/21 12:23 PM, Claudio Fontana wrote:
> On 3/28/21 6:18 PM, Richard Henderson wrote:
>> On 3/26/21 1:36 PM, Claudio Fontana wrote:
>>> provide helper functions there to initialize 32bit models,
>>> and export the a15 cpu model.
>>>
>>> We still need to keep around a15 until we sort out the board configurations.
>>>
>>> cpu.c will continue to contain the common parts between
>>> 32 and 64.
>>>
>>> Note that we need to build cpu32 also for TARGET_AARCH64,
>>> because qemu-system-aarch64 is supposed to be able to run
>>> non-aarch64 cpus too.
>>>
>>> Signed-off-by: Claudio Fontana 
>>
>> Dump state has nothing to do with a15 or the 32-bit models.
> 
> The relationship I see here is that 32-bit builds,
> 
> only really need to use aarch32 version of the dump state, and they do not 
> need to see the aarch64 version.
> 
>> Moving that should be a separate change.
>>
>> The gdbstub changes are also a separate change, afaics.
> 
> 


To hopefully clarify things a bit more here,

the current hierarchy seems not right for ARM CPUs also in this respect.

TYPE_ARM_CPU is the ancestor of all ARM CPUs, ok.
TYPE_AARCH64_CPU is child of TYPE_ARM_CPU, ok.

There is however no TYPE_AARCH32_CPU, or TYPE_ARM32_CPU.

So what ends up being necessary here (both in the mainline code and in this 
patch, which just makes it more explicit),
is to make the ancestor TYPE_ARM_CPU effectively a 32bit CPU class, with 
TYPE_AARCH64_CPU overriding class methods in order to morph it into a 64bit CPU 
class.

What this patch does is to make it explicit, by creating a cpu32.c module that 
contains this non-explicit 32bit ARM CPU class methods, and the registration 
function to register 32bit ARM cpus.

Thanks,

Claudio

> But the main concern is to split more, I understand: will do.
> 
>>
>> I don't understand the a15 comment above.  Is it really relevant to moving 
>> the 
>> 32-bit cpu models?
> 
> 
> The point there was that we keep around a15 for KVM for the moment, instead 
> of relegating it to the set of "tcg-only" models,
> so that virt board and qtest continue to function also in the KVM-only build.
> 
> Mainly for this code here a question from my side: is the current code 
> actually already "wrong"?
> 
> I mean, we unconditionally set the aarch64-capable cpu classes to all use 
> aarch64_gdb_arch_name and gdbstub64,
> but what about an aarch64-capable cpu running in 32bit mode?
> 
> Why don't we have, like tentatively done here for arm_cpu_dump_state,
> 
> an arm_gdb_arch_name and an arm_cpu_gdb_read_register that check is_a64() and 
> call aaarch32_cpu_gdb_read_register or aarch64_cpu_gdb_read_register 
> accordingly?
> 
> 
>>
>>
>> r~
>>
>

[RFC PATCH v2 2/4] hw/arm/virt: Parse -smp cluster parameter in virt_smp_parse

2021-04-13 Thread Yanan Wang

There is a separate function virt_smp_parse() in hw/virt/arm.c used
to parse cpu topology for the ARM machines. So add parsing of -smp
cluster parameter in it, then total number of logical cpus will be
calculated like: max_cpus = sockets * clusters * cores * threads.

In virt_smp_parse(), the computing logic of missing values prefers
cores over sockets over threads. And for compatibility, the value
of clusters will be set as default 1 if not explicitly specified.

Signed-off-by: Yanan Wang 
---
 hw/arm/virt.c | 32 ++--
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 57ef961cb5..51797628db 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2639,35 +2639,38 @@ static void virt_smp_parse(MachineState *ms, QemuOpts 
*opts)
 if (opts) {
 unsigned cpus = qemu_opt_get_number(opts, "cpus", 0);
 unsigned sockets = qemu_opt_get_number(opts, "sockets", 0);
+unsigned clusters = qemu_opt_get_number(opts, "clusters", 1);
 unsigned cores = qemu_opt_get_number(opts, "cores", 0);
 unsigned threads = qemu_opt_get_number(opts, "threads", 0);
+VirtMachineState *vms = VIRT_MACHINE(ms);
 
 /*
- * Compute missing values; prefer cores over sockets and
- * sockets over threads.
+ * Compute missing values; prefer cores over sockets and sockets
+ * over threads. For compatibility, value of clusters will have
+ * been set as default 1 if not explicitly specified.
  */
 if (cpus == 0 || cores == 0) {
 sockets = sockets > 0 ? sockets : 1;
 threads = threads > 0 ? threads : 1;
 if (cpus == 0) {
 cores = cores > 0 ? cores : 1;
-cpus = cores * threads * sockets;
+cpus = sockets * clusters * cores * threads;
 } else {
 ms->smp.max_cpus = qemu_opt_get_number(opts, "maxcpus", cpus);
-cores = ms->smp.max_cpus / (sockets * threads);
+cores = ms->smp.max_cpus / (sockets * clusters * threads);
 }
 } else if (sockets == 0) {
 threads = threads > 0 ? threads : 1;
-sockets = cpus / (cores * threads);
+sockets = cpus / (clusters * cores * threads);
 sockets = sockets > 0 ? sockets : 1;
 } else if (threads == 0) {
-threads = cpus / (cores * sockets);
+threads = cpus / (sockets * clusters * cores);
 threads = threads > 0 ? threads : 1;
-} else if (sockets * cores * threads < cpus) {
+} else if (sockets * clusters * cores * threads < cpus) {
 error_report("cpu topology: "
- "sockets (%u) * cores (%u) * threads (%u) < "
- "smp_cpus (%u)",
- sockets, cores, threads, cpus);
+ "sockets (%u) * clusters (%u) * cores (%u) * "
+ "threads (%u) < smp_cpus (%u)",
+ sockets, clusters, cores, threads, cpus);
 exit(1);
 }
 
@@ -2678,11 +2681,11 @@ static void virt_smp_parse(MachineState *ms, QemuOpts 
*opts)
 exit(1);
 }
 
-if (sockets * cores * threads != ms->smp.max_cpus) {
+if (sockets * clusters * cores * threads != ms->smp.max_cpus) {
 error_report("cpu topology: "
- "sockets (%u) * cores (%u) * threads (%u)"
- "!= maxcpus (%u)",
- sockets, cores, threads,
+ "sockets (%u) * clusters (%u) * cores (%u) * "
+ "threads (%u) != maxcpus (%u)",
+ sockets, clusters, cores, threads,
  ms->smp.max_cpus);
 exit(1);
 }
@@ -2691,6 +2694,7 @@ static void virt_smp_parse(MachineState *ms, QemuOpts 
*opts)
 ms->smp.cores = cores;
 ms->smp.threads = threads;
 ms->smp.sockets = sockets;
+vms->smp_clusters = clusters;
 }
 
 if (ms->smp.cpus > 1) {
-- 
2.19.1

Re: trace_FOO_tcg bit-rotted?

2021-04-13 Thread Stefan Hajnoczi

On Mon, Apr 12, 2021 at 08:06:57PM +0100, Alex Bennée wrote:
> 
> Stefan Hajnoczi  writes:
> 
> > On Fri, Apr 09, 2021 at 05:29:08PM +0100, Alex Bennée wrote:
> >> 
> >> Laurent Vivier  writes:
> >> 
> >> > Le 06/04/2021 à 18:00, Alex Bennée a écrit :
> >> >> Hi,
> >> >> 
> >> >> It's been awhile since I last played with this but I think we are
> >> >> suffering from not having some test cases for tracing code
> >> >> generation/execution in the tree. I tried adding a simple trace point to
> >> >> see if I could track ERET calls:
> >> >> 
> >> >> --8<---cut here---start->8---
> >> >> diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
> >> >> index 0b42e53500..0d643f78fe 100644
> >> >> --- a/target/arm/translate-a64.c
> >> >> +++ b/target/arm/translate-a64.c
> >> >> @@ -36,6 +36,7 @@
> >> >>  #include "exec/log.h"
> >> >>  
> >> >>  #include "trace-tcg.h"
> >> >> +#include "trace.h"
> >> >>  #include "translate-a64.h"
> >> >>  #include "qemu/atomic128.h"
> >> >>  
> >> >> @@ -2302,6 +2303,9 @@ static void disas_uncond_b_reg(DisasContext *s, 
> >> >> uint32_t insn)
> >> >>  default:
> >> >>  goto do_unallocated;
> >> >>  }
> >> >> +
> >> >> +trace_eret_tcg(s->current_el, dst);
> >> >> +
> >> >>  if (tb_cflags(s->base.tb) & CF_USE_ICOUNT) {
> >> >>  gen_io_start();
> >> >>  }
> >> >> diff --git a/target/arm/trace-events b/target/arm/trace-events
> >> >> index 41c63d7570..2d4fca16a1 100644
> >> >> --- a/target/arm/trace-events
> >> >> +++ b/target/arm/trace-events
> >> >> @@ -1,5 +1,10 @@
> >> >>  # See docs/devel/tracing.txt for syntax documentation.
> >> >>  
> >> >> +# translate-a64.c
> >> >> +# Mode: softmmu
> >> >> +# Targets: TCG(aarch64-softmmu)
> >> >> +tcg eret(int current_el, TCGv target_el) "trans_eret: from EL%d", 
> >> >> "exec_eret: EL%d to EL%"PRId64
> >> >
> >> > If I read correctly, the name should be eret_tcg()
> >> > And I'm not sure TCGv will be accepted as a parameter type, use
> >> > uint64_t instead (and %PRIu64)
> >> 
> >> This was my confusion. I thought the trace-events file was prefixed with
> >> tcg like guest_mem_before:
> >> 
> >>   vcpu tcg guest_mem_before(TCGv vaddr, uint16_t info) "info=%d", 
> >> "vaddr=0x%016"PRIx64" info=%d"
> >> 
> >> and that signalled the tools to generate _trans, _exec and _tcg hooks in
> >> the generated files. The trace code (see other patch) also has logic to
> >> translate natural TCG types into the natives types as well signalling
> >> which values are only visible for the _exec portion.
> >> 
> >> Maybe I'm over thinking this. Perhaps all the TCG tracing use cases are
> >> just as easily supported with TCG plugins now and we should deprecate
> >> this unused bit of complexity. I certainly understand the plugin
> >> interactions better ;-)
> >
> > Lluís: are you happy to deprecate tcg trace events in favor of TCG
> > plugins?
> >
> > My question is whether TCG plugins are really equivalent here. Will TCG
> > plugin users have to write their own log file output code to extract
> > this information from the QEMU process (i.e. reinventing tracing)?
> 
> Yes - although there is no reason we couldn't expose the trace output as
> a helper. Currently there is:
> 
>   /**
>* qemu_plugin_outs() - output string via QEMU's logging system
>* @string: a string
>*/
>   void qemu_plugin_outs(const char *string);
> 
> which allows the user to echo to the log in conjunction with -d plugin
> on the command line. Plugins are of course free to do there own thing.
> 
> > Is
> > the performance at least as good as tracing?
> 
> Well like all things that depends ;-)
> 
> Generally on the sort of events you tend to trace (like the example
> memory access) you usually either want to aggregate or filter your
> results. With trace output their is no way to do this and you end up
> post processing potentially very large files (smaller if you use the
> non-default binary format). I don't know if a similar thing is possible
> with uprobes and ebpf - I've only ever used the simple logging output in
> anger. The example plugins generally do things like count total
> accesses:
> 
>   https://gitlab.com/qemu-project/qemu/-/blob/master/tests/plugin/mem.c
> 
> (pretty fast in comparison to writing out reams of data)
> 
> or aggregate the results:
> 
>   
> https://gitlab.com/qemu-project/qemu/-/blob/master/contrib/plugins/hotpages.c
> 
> (probably slower while running QEMU, but faster overall because no post
> processing of log files required.)
> 
> Of course plugins are a non-default build option because although light
> it does have a small performance impact on code generation even when no
> instrumentation is occurring. As a result you can't use it without
> building a version first.
> 
> If we had test code in the build that used the TCG tracing abilities I
> wouldn't worry because at least we are defending the feature and we
> wouldn't run into probl

[Bug 1923583] [NEW] colo: pvm flush failed after svm killed

2021-04-13 Thread meeho yuen

Public bug reported:

Hi,
   Primary vm flush failed after killing svm, which leads primary vm guest 
filesystem unavailable.

qemu versoin: 5.2.0
host/guest os: CentOS Linux release 7.6.1810 (Core)

Reproduce steps:
1. create colo vm following 
https://github.com/qemu/qemu/blob/master/docs/COLO-FT.txt
2. kill secondary vm (don't remove nbd child from quorum on primary vm)and wait 
for a minute. the interval depends on guest os.
result: primary vm file system shutdown because of flush cache error.

After serveral tests, I found that qemu-5.0.0 worked well, and it's the
commit
https://git.qemu.org/?p=qemu.git;a=commit;h=883833e29cb800b4d92b5d4736252f4004885191(block:
Flush all children in generic code) leads this change, and both virtio-
blk and ide turned out to be bad.

I think it's nbd(replication) flush failed leads bdrv_co_flush(quorum_bs) 
failed, here is the call stack.
#0  bdrv_co_flush (bs=0x56242b3cc0b0=nbd_bs) at ../block/io.c:2856
#1  0x562428b0f399 in bdrv_co_flush (bs=0x56242b3c7e00=replication_bs) at 
../block/io.c:2920
#2  0x562428b0f399 in bdrv_co_flush (bs=0x56242a4ad800=quorum_bs) at 
../block/io.c:2920
#3  0x562428b70d56 in blk_do_flush (blk=0x56242a4ad4a0) at 
../block/block-backend.c:1672
#4  0x562428b70d87 in blk_aio_flush_entry (opaque=0x7fd0980073f0) at 
../block/block-backend.c:1680
#5  0x562428c5f9a7 in coroutine_trampoline (i0=-1409269904, i1=32721) at 
../util/coroutine-ucontext.c:173

While i am not sure whether i use colo inproperly? Can we assume that
nbd child of quorum immediately removed right after svm crashed? Or it's
really a bug? Does the following patch fix? Help is needed! Thanks a
lot！

diff --git a/block/quorum.c b/block/quorum.c
index cfc1436..f2c0805 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1279,7 +1279,7 @@ static BlockDriver bdrv_quorum = {
 .bdrv_dirname   = quorum_dirname,
 .bdrv_co_block_status   = quorum_co_block_status,
 
-.bdrv_co_flush_to_disk  = quorum_co_flush,
+.bdrv_co_flush  = quorum_co_flush,
 
 .bdrv_getlength = quorum_getlength,

** Affects: qemu
 Importance: Undecided
 Status: New

** Patch added: "primary guest kernel message"
   
https://bugs.launchpad.net/bugs/1923583/+attachment/5487235/+files/primary_guest_dmesg.log

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1923583

Title:
  colo: pvm flush failed after svm killed

Status in QEMU:
  New

Bug description:
  Hi,
 Primary vm flush failed after killing svm, which leads primary vm guest 
filesystem unavailable.

  qemu versoin: 5.2.0
  host/guest os: CentOS Linux release 7.6.1810 (Core)

  Reproduce steps:
  1. create colo vm following 
https://github.com/qemu/qemu/blob/master/docs/COLO-FT.txt
  2. kill secondary vm (don't remove nbd child from quorum on primary vm)and 
wait for a minute. the interval depends on guest os.
  result: primary vm file system shutdown because of flush cache error.

  After serveral tests, I found that qemu-5.0.0 worked well, and it's
  the commit
  
https://git.qemu.org/?p=qemu.git;a=commit;h=883833e29cb800b4d92b5d4736252f4004885191(block:
  Flush all children in generic code) leads this change, and both
  virtio-blk and ide turned out to be bad.

  I think it's nbd(replication) flush failed leads bdrv_co_flush(quorum_bs) 
failed, here is the call stack.
  #0  bdrv_co_flush (bs=0x56242b3cc0b0=nbd_bs) at ../block/io.c:2856
  #1  0x562428b0f399 in bdrv_co_flush (bs=0x56242b3c7e00=replication_bs) at 
../block/io.c:2920
  #2  0x562428b0f399 in bdrv_co_flush (bs=0x56242a4ad800=quorum_bs) at 
../block/io.c:2920
  #3  0x562428b70d56 in blk_do_flush (blk=0x56242a4ad4a0) at 
../block/block-backend.c:1672
  #4  0x562428b70d87 in blk_aio_flush_entry (opaque=0x7fd0980073f0) at 
../block/block-backend.c:1680
  #5  0x562428c5f9a7 in coroutine_trampoline (i0=-1409269904, i1=32721) at 
../util/coroutine-ucontext.c:173

  While i am not sure whether i use colo inproperly? Can we assume that
  nbd child of quorum immediately removed right after svm crashed? Or
  it's really a bug? Does the following patch fix? Help is needed!
  Thanks a lot！

  diff --git a/block/quorum.c b/block/quorum.c
  index cfc1436..f2c0805 100644
  --- a/block/quorum.c
  +++ b/block/quorum.c
  @@ -1279,7 +1279,7 @@ static BlockDriver bdrv_quorum = {
   .bdrv_dirname   = quorum_dirname,
   .bdrv_co_block_status   = quorum_co_block_status,
   
  -.bdrv_co_flush_to_disk  = quorum_co_flush,
  +.bdrv_co_flush  = quorum_co_flush,
   
   .bdrv_getlength = quorum_getlength,

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1923583/+subscriptions

Re: [Bug 1923583] [NEW] colo: pvm flush failed after svm killed

2021-04-13 Thread no-reply

Patchew URL: 
https://patchew.org/QEMU/161830261172.29345.7866671962411605196.malone...@wampee.canonical.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 
161830261172.29345.7866671962411605196.malone...@wampee.canonical.com
Subject: [Bug 1923583] [NEW] colo: pvm flush failed after svm killed

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag] 
patchew/161830261172.29345.7866671962411605196.malone...@wampee.canonical.com 
-> patchew/161830261172.29345.7866671962411605196.malone...@wampee.canonical.com
 - [tag update]  patchew/20210413081008.3409459-1-f4...@amsat.org -> 
patchew/20210413081008.3409459-1-f4...@amsat.org
Switched to a new branch 'test'
f43885d colo: pvm flush failed after svm killed

=== OUTPUT BEGIN ===
ERROR: Missing Signed-off-by: line(s)

total: 1 errors, 0 warnings, 8 lines checked

Commit f43885d3a7e9 (colo: pvm flush failed after svm killed) has style 
problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/161830261172.29345.7866671962411605196.malone...@wampee.canonical.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH] vhost-user-fs: fix features handling

2021-04-13 Thread Stefan Hajnoczi

On Thu, Apr 08, 2021 at 10:55:34PM +0300, Anton Kuchin wrote:
> Make virtio-fs take into account server capabilities.
> 
> Just returning requested features assumes they all of then are implemented
> by server and results in setting unsupported configuration if some of them
> are absent.
> 
> Signed-off-by: Anton Kuchin 
> ---
>  hw/virtio/vhost-user-fs.c | 17 +
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index ac4fc34b36..6cf983ba0e 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -24,6 +24,14 @@
>  #include "monitor/monitor.h"
>  #include "sysemu/sysemu.h"
>  
> +static const int user_feature_bits[] = {
> +VIRTIO_F_VERSION_1,
> +VIRTIO_RING_F_INDIRECT_DESC,
> +VIRTIO_RING_F_EVENT_IDX,
> +VIRTIO_F_NOTIFY_ON_EMPTY,
> +VHOST_INVALID_FEATURE_BIT
> +};

Please add:

VIRTIO_F_RING_PACKED
VIRTIO_F_IOMMU_PLATFORM

QEMU's virtiofsd does not enable either of these for now, but it's worth
allowing the vhost-user device backend to participate in negotiation so
that this can change in the future (or alternative virtiofsd
implementations can support these features).


signature.asc
Description: PGP signature

Re: [PATCH 2/2] Support monitor chardev hotswap with QMP

2021-04-13 Thread Li Zhang

On Tue, Apr 13, 2021 at 8:41 AM Markus Armbruster  wrote:
>
> Li Zhang  writes:
>
> > From: Li Zhang 
> >
> > For some scenarios, it needs to hot-add a monitor device.
> > But QEMU doesn't support hotplug yet. It also works by adding
> > a monitor with null backend by default and then change its
> > backend to socket by QMP command "chardev-change".
> >
> > So this patch is to support monitor chardev hotswap with QMP.
> >
> > Signed-off-by: Li Zhang 
>
> I think what what you're trying to say is that chardev-change does not
> work when the character device changes is used by a QMP monitor.
> Correct?
>
I mean that when the character device is a monitor device, it doesn't
work with a QMP monitor.
For example, I add 2 QMP monitors and change one of the monitor's
backends from socket to a null device.
It doesn't work because it needs the monitor device to support chardev-change.

> If yes, how exactly does it misbehave?
This command chardev-change needs specific device's change callback function.

>
> Does it work with an HMP monitor?
No, it doesn't work.

>
> > ---
> >  monitor/monitor-internal.h |  3 +++
> >  monitor/monitor.c  |  2 +-
> >  monitor/qmp.c  | 42 +++---
> >  3 files changed, 43 insertions(+), 4 deletions(-)
> >
> > diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h
> > index 40903d6386..2df6dd21de 100644
> > --- a/monitor/monitor-internal.h
> > +++ b/monitor/monitor-internal.h
> > @@ -186,4 +186,7 @@ int hmp_compare_cmd(const char *name, const char *list);
> >  void qmp_query_qmp_schema(QDict *qdict, QObject **ret_data,
> >   Error **errp);
> >
> > +gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond,
> > +   void *opaque);
> > +
> >  #endif
> > diff --git a/monitor/monitor.c b/monitor/monitor.c
> > index e94f532cf5..2d255bab18 100644
> > --- a/monitor/monitor.c
> > +++ b/monitor/monitor.c
> > @@ -157,7 +157,7 @@ static inline bool monitor_is_hmp_non_interactive(const 
> > Monitor *mon)
> >
> >  static void monitor_flush_locked(Monitor *mon);
> >
> > -static gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond,
> > +gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond,
> >void *opaque)
> >  {
> >  Monitor *mon = opaque;
> > diff --git a/monitor/qmp.c b/monitor/qmp.c
> > index 2326bd7f9b..55cfb230d9 100644
> > --- a/monitor/qmp.c
> > +++ b/monitor/qmp.c
> > @@ -44,6 +44,7 @@ struct QMPRequest {
> >  Error *err;
> >  };
> >  typedef struct QMPRequest QMPRequest;
> > +static void monitor_qmp_set_handlers_bh(void *opaque);
> >
> >  QmpCommandList qmp_commands, qmp_cap_negotiation_commands;
> >
> > @@ -480,7 +481,35 @@ void monitor_data_destroy_qmp(MonitorQMP *mon)
> >  g_queue_free(mon->qmp_requests);
> >  }
> >
> > -static void monitor_qmp_setup_handlers_bh(void *opaque)
> > +static int monitor_qmp_change(void *opaque)
> > +{
> > +MonitorQMP *mon = opaque;
> > +
> > +mon->common.use_io_thread =
> > +qemu_chr_has_feature(mon->common.chr.chr, 
> > QEMU_CHAR_FEATURE_GCONTEXT);
> > +
> > +if (mon->common.use_io_thread) {
> > +aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread),
> > +monitor_qmp_set_handlers_bh, mon);
> > +} else {
> > +qemu_chr_fe_set_handlers(&mon->common.chr, monitor_can_read,
> > + monitor_qmp_read, monitor_qmp_event,
> > + monitor_qmp_change, &mon->common, NULL, 
> > true);
> > +}
> > +
> > +if (mon->common.out_watch) {
> > +g_source_remove(mon->common.out_watch);
>
> All other updates of @out_watch are under @mon_lock.  Why not this one?

Sorry, I missed it. I will correct it.

>
> I have no idea whether g_source_remove() is the right function to call.
> Its documentation says "You must use g_source_destroy() for sources
> added to a non-default main context."  The qemu_chr_fe_set_handlers()
> contract is of no help.
>
> Documentation of g_source_destroy() confuses some more: "This does not
> unref the GSource: if you still hold a reference, use g_source_unref()
> to drop it.
>
> Marc-André, can you help?
>
> > +qemu_mutex_lock(&mon->common.mon_lock);
> > +mon->common.out_watch =
> > +qemu_chr_fe_add_watch(&mon->common.chr, G_IO_OUT | G_IO_HUP,
> > +   monitor_unblocked, &mon->common);
>
> Bad indentation.  Better:
>
> mon->common.out_watch =
> qemu_chr_fe_add_watch(&mon->common.chr, G_IO_OUT | G_IO_HUP,
>monitor_unblocked, &mon->common);
>
> or
>
> mon->common.out_watch = qemu_chr_fe_add_watch(&mon->common.chr,
>   G_IO_OUT | G_IO_HUP,
>   monitor_unblocked,
>

Re: [Virtio-fs] [PATCH] vhost-user-fs: fix features handling

2021-04-13 Thread Stefan Hajnoczi

On Mon, Apr 12, 2021 at 02:43:16PM -0400, Vivek Goyal wrote:
> On Sun, Apr 11, 2021 at 09:21:54AM +0300, Anton Kuchin wrote:
> > 
> > On 09/04/2021 18:56, Vivek Goyal wrote:
> > > On Thu, Apr 08, 2021 at 10:55:34PM +0300, Anton Kuchin wrote:
> > > > Make virtio-fs take into account server capabilities.
> > > > 
> > > > Just returning requested features assumes they all of then are 
> > > > implemented
> > > > by server and results in setting unsupported configuration if some of 
> > > > them
> > > > are absent.
> > > > 
> > > > Signed-off-by: Anton Kuchin
> > > [CC stefan and qemu-devel.]
> > > 
> > > Can you give more details of what problem exactly you are facing. Or
> > > this fix is about avoiding a future problem where device can refuse
> > > to support a feature qemu is requesting for.
> > 
> > This fixes existing problem that qemu ignores features (un)supported by
> > backend and this works fine only if backend features match features of qemu.
> > Otherwise qemu incorrectly assumes that backend suports all of them and
> > calls vhost_set_features() not taking into account result of previous
> > vhost_get_features() call. This breaks protocol and can crash server or
> > cause incorrect behavior.
> > 
> > This is why I hope it to be accepted in time for 6.0 release.
> > 
> > > IIUC, this patch is preparing a list of features vhost-user-fs device
> > > can support. Then it calls vhost_get_features() which makes sure that
> > > all these features are support by real vhost-user device (hdev->features).
> > > If not, then corresponding feature is reset and remaining features
> > > are returned to caller.
> > When this callback is executed in virtio_bus_device_plugged() list of
> > features that vhost-backend supports has been already obtained earlier by
> > vhost_user_get_features() in vuf_device_realize() and stored in
> > hdev->features.
> 
> > vuf_get_features() should return bitmask of features that
> > are common for vhost backend (hdev->features) and frontend
> > (vdev->host_features).
> 
> But that's not what exactly this patch seems to be doing.
> IIUC, It only resets some of the features from list passed from
> the caller. So whatever has been defined in user_feature_bits[],
> and if these features are not supported by vhost-user backend, then
> that feature will be reset before returning to caller.
> 
> So the question is what are those features which should be in
> user_feature_bits[]? For example, by default libvhost-user
> also supports.
> 
> /* vhost-user feature bits */
> 1ULL << VHOST_F_LOG_ALL |
> 1ULL << VHOST_USER_F_PROTOCOL_FEATURES;
> 
> Should that be in user_feature_bits[] too. So that if a customer
> vhost-user-fs backend does not support VHOST_F_LOG_ALL or
> VHOST_USER_F_PROTOCOL_FEATURES, it is reset.
> 
> IIUC, your current patch is not going to reset these features if
> caller passed you those in vuf_get_features(,requested_features).
> 
> So to me this becomes more of a question that what are those common
> features which both the ends of vhost-user device should support for
> it to work and should be checked in vuf_get_features(). 

VHOST_F_LOG_ALL and VHOST_USER_F_PROTOCOL_FEATURES are controlled by
hw/virtio/vhost.c and hw/virtio/vhost-user.c. These feature bits are
part of the vhost-user protocol and are not involved in guest-visible
VIRTIO feature negotiation. It's confusing because these bits use the
same namespace as VIRTIO features but it is correct to omit it from
user_feature_bits[].

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 2/2] Support monitor chardev hotswap with QMP

2021-04-13 Thread Daniel P . Berrangé

On Tue, Apr 13, 2021 at 08:40:59AM +0200, Markus Armbruster wrote:
> Li Zhang  writes:
> 
> > From: Li Zhang 
> >
> > For some scenarios, it needs to hot-add a monitor device.
> > But QEMU doesn't support hotplug yet. It also works by adding
> > a monitor with null backend by default and then change its
> > backend to socket by QMP command "chardev-change".

If you need ability to hot-add monitor instances, why not just
implement that feature directly, instead of pre-creating monitors
with null backends and then later changing the backend ?

> >
> > So this patch is to support monitor chardev hotswap with QMP.
> >
> > Signed-off-by: Li Zhang 
> 
> I think what what you're trying to say is that chardev-change does not
> work when the character device changes is used by a QMP monitor.
> Correct?
> 
> If yes, how exactly does it misbehave?
> 
> Does it work with an HMP monitor?
> 
> > ---
> >  monitor/monitor-internal.h |  3 +++
> >  monitor/monitor.c  |  2 +-
> >  monitor/qmp.c  | 42 +++---
> >  3 files changed, 43 insertions(+), 4 deletions(-)
> >
> > diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h
> > index 40903d6386..2df6dd21de 100644
> > --- a/monitor/monitor-internal.h
> > +++ b/monitor/monitor-internal.h
> > @@ -186,4 +186,7 @@ int hmp_compare_cmd(const char *name, const char *list);
> >  void qmp_query_qmp_schema(QDict *qdict, QObject **ret_data,
> >   Error **errp);
> >  
> > +gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond,
> > +   void *opaque);
> > +
> >  #endif
> > diff --git a/monitor/monitor.c b/monitor/monitor.c
> > index e94f532cf5..2d255bab18 100644
> > --- a/monitor/monitor.c
> > +++ b/monitor/monitor.c
> > @@ -157,7 +157,7 @@ static inline bool monitor_is_hmp_non_interactive(const 
> > Monitor *mon)
> >  
> >  static void monitor_flush_locked(Monitor *mon);
> >  
> > -static gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond,
> > +gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond,
> >void *opaque)
> >  {
> >  Monitor *mon = opaque;
> > diff --git a/monitor/qmp.c b/monitor/qmp.c
> > index 2326bd7f9b..55cfb230d9 100644
> > --- a/monitor/qmp.c
> > +++ b/monitor/qmp.c
> > @@ -44,6 +44,7 @@ struct QMPRequest {
> >  Error *err;
> >  };
> >  typedef struct QMPRequest QMPRequest;
> > +static void monitor_qmp_set_handlers_bh(void *opaque);
> >  
> >  QmpCommandList qmp_commands, qmp_cap_negotiation_commands;
> >  
> > @@ -480,7 +481,35 @@ void monitor_data_destroy_qmp(MonitorQMP *mon)
> >  g_queue_free(mon->qmp_requests);
> >  }
> >  
> > -static void monitor_qmp_setup_handlers_bh(void *opaque)
> > +static int monitor_qmp_change(void *opaque)
> > +{
> > +MonitorQMP *mon = opaque;
> > +
> > +mon->common.use_io_thread =
> > +qemu_chr_has_feature(mon->common.chr.chr, 
> > QEMU_CHAR_FEATURE_GCONTEXT);
> > +
> > +if (mon->common.use_io_thread) {
> > +aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread),
> > +monitor_qmp_set_handlers_bh, mon);
> > +} else {
> > +qemu_chr_fe_set_handlers(&mon->common.chr, monitor_can_read,
> > + monitor_qmp_read, monitor_qmp_event,
> > + monitor_qmp_change, &mon->common, NULL, 
> > true);
> > +}
> > +
> > +if (mon->common.out_watch) {
> > +g_source_remove(mon->common.out_watch);
> 
> All other updates of @out_watch are under @mon_lock.  Why not this one?
> 
> I have no idea whether g_source_remove() is the right function to call.
> Its documentation says "You must use g_source_destroy() for sources
> added to a non-default main context."  The qemu_chr_fe_set_handlers()
> contract is of no help.
> 
> Documentation of g_source_destroy() confuses some more: "This does not
> unref the GSource: if you still hold a reference, use g_source_unref()
> to drop it.
> 
> Marc-André, can you help?
> 
> > +qemu_mutex_lock(&mon->common.mon_lock);
> > +mon->common.out_watch =
> > +qemu_chr_fe_add_watch(&mon->common.chr, G_IO_OUT | G_IO_HUP,
> > +   monitor_unblocked, &mon->common);
> 
> Bad indentation.  Better:
> 
> mon->common.out_watch =
> qemu_chr_fe_add_watch(&mon->common.chr, G_IO_OUT | G_IO_HUP,
>monitor_unblocked, &mon->common);
> 
> or
> 
> mon->common.out_watch = qemu_chr_fe_add_watch(&mon->common.chr,
>   G_IO_OUT | G_IO_HUP,
>   monitor_unblocked,
>   &mon->common);
> 
> or
> 
> mon->common.out_watch = qemu_chr_fe_add_watch(&mon->common.chr,
> G_IO_OUT | G_IO_HUP,
>

Re: [Bug 1923497] [NEW] bios_linker_loader_add_checksum: Assertion `start_offset < file->blob->len' failed

2021-04-13 Thread Igor Mammedov

On Mon, 12 Apr 2021 20:29:04 -
Ed Davison <1923...@bugs.launchpad.net> wrote:

> Public bug reported:
> 
> Trying boot/start a Windows 10 VM.  Worked until recently when this
> error started showing up.
> 
> I have the following installed on Fedora 33:
> qemu-kvm-5.1.0-9.fc33.x86_64

Could you add used QEMU command line in your case?

> 
> This is the error:
> 
> Error starting domain: internal error: process exited while connecting
> to monitor: qemu-system-x86_64: /builddir/build/BUILD/qemu-5.1.0/hw/acpi
> /bios-linker-loader.c:239: bios_linker_loader_add_checksum: Assertion
> `start_offset < file->blob->len' failed.
> 
> Traceback (most recent call last):
>   File "/usr/share/virt-manager/virtManager/asyncjob.py", line 65, in 
> cb_wrapper
> callback(asyncjob, *args, **kwargs)
>   File "/usr/share/virt-manager/virtManager/asyncjob.py", line 101, in tmpcb
> callback(*args, **kwargs)
>   File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 
> 57, in newfn
> ret = fn(self, *args, **kwargs)
>   File "/usr/share/virt-manager/virtManager/object/domain.py", line 1329, in 
> startup
> self._backend.create()
>   File "/usr/lib64/python3.9/site-packages/libvirt.py", line 1234, in create
> if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
> libvirt.libvirtError: internal error: process exited while connecting to 
> monitor: qemu-system-x86_64: 
> /builddir/build/BUILD/qemu-5.1.0/hw/acpi/bios-linker-loader.c:239: 
> bios_linker_loader_add_checksum: Assertion `start_offset < file->blob->len' 
> failed.
> 
> I see this were referenced in a patch from some time ago and supposedly
> fixed.  Here is the patch info I was able to find:
> 
> http://next.patchew.org/QEMU/1515677902-23436-1-git-send-email-
> peter.mayd...@linaro.org/1515677902-23436-10-git-send-email-
> peter.mayd...@linaro.org/
> 
> ** Affects: qemu
>  Importance: Undecided
>  Status: New
>

Re: [PATCH RFC RESEND v2 4/6] hw/arm/virt-acpi-build: Add explicit idmap info in IORT table

2021-04-13 Thread Auger Eric

Hi Xingang,

On 3/25/21 8:22 AM, Wang Xingang wrote:
> From: Xingang Wang 
> 
> The idmap of smmuv3 and root complex covers the whole RID space for now,
> this patch add explicit idmap info according to root bus number range.
> This add smmuv3 idmap for certain bus which has enabled the iommu property.
> 
> Signed-off-by: Xingang Wang 
> Signed-off-by: Jiahui Cen 
> ---
>  hw/arm/virt-acpi-build.c | 103 ++-
>  1 file changed, 81 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index f5a2b2d4cb..5491036c86 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -44,6 +44,7 @@
>  #include "hw/acpi/tpm.h"
>  #include "hw/pci/pcie_host.h"
>  #include "hw/pci/pci.h"
> +#include "hw/pci/pci_bus.h"
>  #include "hw/pci-host/gpex.h"
>  #include "hw/arm/virt.h"
>  #include "hw/mem/nvdimm.h"
> @@ -237,6 +238,41 @@ static void acpi_dsdt_add_tpm(Aml *scope, 
> VirtMachineState *vms)
>  aml_append(scope, dev);
>  }
>  
> +typedef
> +struct AcpiIortMapping {
> +AcpiIortIdMapping idmap;
> +bool iommu;
> +} AcpiIortMapping;
> +
> +/* For all PCI host bridges, walk and insert DMAR scope */
this comment should rather be in the caller
also DMAR is not the ARM vocable.

I would add the comment for this function:
build the ID mapping for aa given PCI host bridge
> +static int
> +iort_host_bridges(Object *obj, void *opaque)
> +{
> +GArray *map_blob = opaque;
> +AcpiIortMapping map;
> +AcpiIortIdMapping *idmap = &map.idmap;
> +int bus_num, max_bus;
> +
> +if (object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) {
> +PCIBus *bus = PCI_HOST_BRIDGE(obj)->bus;
> +
> +if (bus) {
> +bus_num = pci_bus_num(bus);
> +max_bus = pci_root_bus_max_bus(bus);
> +
> +idmap->input_base = cpu_to_le32(bus_num << 8);
> +idmap->id_count = cpu_to_le32((max_bus - bus_num + 1) << 8);
> +idmap->output_base = cpu_to_le32(bus_num << 8);
> +idmap->flags = cpu_to_le32(0);
> +
> +map.iommu = pci_root_bus_has_iommu(bus);
if iommu is not set, we don't need to populate the idmap and we may even
directly continue, ie. not add the element the map_bap_blob, no?
> +g_array_append_val(map_blob, map);
> +}
> +}
> +
> +return 0;
> +}
> +
>  static void
>  build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>  {
> @@ -247,6 +283,21 @@ build_iort(GArray *table_data, BIOSLinker *linker, 
> VirtMachineState *vms)
>  AcpiIortSmmu3 *smmu;
>  size_t node_size, iort_node_offset, iort_length, smmu_offset = 0;
>  AcpiIortRC *rc;
> +int smmu_mapping_count;
> +GArray *map_blob = g_array_new(false, true, sizeof(AcpiIortMapping));
> +AcpiIortMapping *map;
> +
> +/* pci_for_each_bus(vms->bus, insert_map, map_blob); */
comment to be removed
> +object_child_foreach_recursive(object_get_root(),
> +   iort_host_bridges, map_blob);
> +
> +smmu_mapping_count = 0;
> +for (int i = 0; i < map_blob->len; i++) {
> +map = &g_array_index(map_blob, AcpiIortMapping, i);
> +if (map->iommu) {
> +smmu_mapping_count++;
> +}
> +}
>  
>  iort = acpi_data_push(table_data, sizeof(*iort));
>  
> @@ -280,13 +331,13 @@ build_iort(GArray *table_data, BIOSLinker *linker, 
> VirtMachineState *vms)
>  
>  /* SMMUv3 node */
>  smmu_offset = iort_node_offset + node_size;
> -node_size = sizeof(*smmu) + sizeof(*idmap);
> +node_size = sizeof(*smmu) + sizeof(*idmap) * smmu_mapping_count;
>  iort_length += node_size;
>  smmu = acpi_data_push(table_data, node_size);
>  
>  smmu->type = ACPI_IORT_NODE_SMMU_V3;
>  smmu->length = cpu_to_le16(node_size);
> -smmu->mapping_count = cpu_to_le32(1);
> +smmu->mapping_count = cpu_to_le32(smmu_mapping_count);
>  smmu->mapping_offset = cpu_to_le32(sizeof(*smmu));
>  smmu->base_address = cpu_to_le64(vms->memmap[VIRT_SMMU].base);
>  smmu->flags = cpu_to_le32(ACPI_IORT_SMMU_V3_COHACC_OVERRIDE);
> @@ -295,23 +346,28 @@ build_iort(GArray *table_data, BIOSLinker *linker, 
> VirtMachineState *vms)
>  smmu->gerr_gsiv = cpu_to_le32(irq + 2);
>  smmu->sync_gsiv = cpu_to_le32(irq + 3);
>  
> -/* Identity RID mapping covering the whole input RID range */
> -idmap = &smmu->id_mapping_array[0];
> -idmap->input_base = 0;
> -idmap->id_count = cpu_to_le32(0x);
> -idmap->output_base = 0;
> -/* output IORT node is the ITS group node (the first node) */
> -idmap->output_reference = cpu_to_le32(iort_node_offset);
> +for (int i = 0, j = 0; i < map_blob->len; i++) {
> +map = &g_array_index(map_blob, AcpiIortMapping, i);
> +
> +if (!map->iommu) {
> +continue;
> +}
> +
> +

[PATCH v5 00/14] RAM_NORESERVE, MAP_NORESERVE and hostmem "reserve" property

2021-04-13 Thread David Hildenbrand

Based-on: 20210406080126.24010-1-da...@redhat.com

Some cleanups previously sent in other context (resizeable allocations),
followed by RAM_NORESERVE, implementing it under Linux using MAP_NORESERVE,
and letting users configure it for memory backens using the "reserve"
property (default: true).

MAP_NORESERVE under Linux has in the context of QEMU an effect on
1) Private/shared anonymous memory
-> memory-backend-ram,id=mem0,size=10G
2) Private fd-based mappings
-> memory-backend-file,id=mem0,size=10G,mem-path=/dev/shm/0
-> memory-backend-memfd,id=mem0,size=10G
3) Private/shared hugetlb mappings
-> memory-backend-memfd,id=mem0,size=10G,hugetlb=on,hugetlbsize=2M

With MAP_NORESERVE/"reserve=off", we won't be reserving swap space (1/2) or
huge pages (3) for the whole memory region.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM. MAP_NORESERVE tells the OS
"this mapping might be very sparse". This essentially allows
avoiding having to set "/proc/sys/vm/overcommit_memory == 1") when using
virtio-mem and also supporting hugetlbfs in the future.

v4 -> v5:
- Sent out shared anonymous RAM fixes separately
- Rebased
- "hostmem: Wire up RAM_NORESERVE via "reserve" property"
-- Adjusted/simplified description of new "reserve" property
-- Properly add it to qapi/qom.json
- "qmp: Clarify memory backend properties returned via query-memdev"
-- Added
- "qmp: Include "share" property of memory backends"
-- Added
- "hmp: Print "share" property of memory backends with "info memdev""
- Added
- "qmp: Include "reserve" property of memory backends"
-- Adjust description of new "reserve" property

v3 -> v4:
- Minor comment/description updates
- "softmmu/physmem: Fix ram_block_discard_range() to handle shared ..."
-- Extended description
- "util/mmap-alloc: Pass flags instead of separate bools to ..."
-- Move flags to include/qemu/osdep.h and rename to "QEMU_MAP_*"
- "memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()"
-- Adjust to new flags. Handle errors in mmap_activate() for now.
- "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux"
-- Restrict support to Linux only for now
- "qmp: Include "reserve" property of memory backends"
-- Added
- "hmp: Print "reserve" property of memory backends with ..."
-- Added

v2 -> v3:
- Renamed "softmmu/physmem: Drop "shared" parameter from ram_block_add()"
  to "softmmu/physmem: Mark shared anonymous memory RAM_SHARED" and
  adjusted the description
- Added "softmmu/physmem: Fix ram_block_discard_range() to handle shared
  anonymous memory"
- Added "softmmu/physmem: Fix qemu_ram_remap() to handle shared anonymous
  memory"
- Added "util/mmap-alloc: Pass flags instead of separate bools to
  qemu_ram_mmap()"
- "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE"
-- Further tweak code comments
-- Handle shared anonymous memory

v1 -> v2:
- Rebased to upstream and phs_mem_alloc simplifications
-- Upsteam added the "map_offset" parameter to many RAM allocation
   interfaces.
- "softmmu/physmem: Drop "shared" parameter from ram_block_add()"
-- Use local variable "shared"
- "memory: introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()"
-- Simplify due to phs_mem_alloc changes
- "util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE"
-- Add a whole bunch of comments.
-- Exclude shared anonymous memory that QEMU doesn't use
-- Special-case readonly mappings

Cc: Peter Xu 
Cc: "Michael S. Tsirkin" 
Cc: Eduardo Habkost 
Cc: "Dr. David Alan Gilbert" 
Cc: Richard Henderson 
Cc: Paolo Bonzini 
Cc: Igor Mammedov 
Cc: "Philippe Mathieu-Daudé" 
Cc: Stefan Hajnoczi 
Cc: Murilo Opsfelder Araujo 
Cc: Greg Kurz 
Cc: Liam Merwick 
Cc: Marcel Apfelbaum 

David Hildenbrand (14):
  util/mmap-alloc: Factor out calculation of the pagesize for the guard
page
  util/mmap-alloc: Factor out reserving of a memory region to
mmap_reserve()
  util/mmap-alloc: Factor out activating of memory to mmap_activate()
  softmmu/memory: Pass ram_flags to qemu_ram_alloc_from_fd()
  softmmu/memory: Pass ram_flags to
memory_region_init_ram_shared_nomigrate()
  util/mmap-alloc: Pass flags instead of separate bools to
qemu_ram_mmap()
  memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()
  util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux
  hostmem: Wire up RAM_NORESERVE via "reserve" property
  qmp: Clarify memory backend properties returned via query-memdev
  qmp: Include "share" property of memory backends
  hmp: Print "share" property of memory backends with "info memdev"
  qmp: Include "reserve" property of memory backends
  hmp: Print "reserve" property of memory backends with "info memdev"

 backends/hostmem-file.c   |  11 +-
 backends/hostmem-memfd.c  |   8 +-
 backends/hostmem-ram.c|   7 +-
 backends/hostmem.c|  32 +++
 hw/core/machine-hmp-cmds.c|   4 +
 hw/core/

[PATCH v5 01/14] util/mmap-alloc: Factor out calculation of the pagesize for the guard page

2021-04-13 Thread David Hildenbrand

Let's factor out calculating the size of the guard page and rename the
variable to make it clearer that this pagesize only applies to the
guard page.

Reviewed-by: Peter Xu 
Acked-by: Murilo Opsfelder Araujo 
Cc: Igor Kotrasinski 
Signed-off-by: David Hildenbrand 
---
 util/mmap-alloc.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index e6fa8b598b..24854064b4 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -82,6 +82,16 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
 return qemu_real_host_page_size;
 }
 
+static inline size_t mmap_guard_pagesize(int fd)
+{
+#if defined(__powerpc64__) && defined(__linux__)
+/* Mappings in the same segment must share the same page size */
+return qemu_fd_getpagesize(fd);
+#else
+return qemu_real_host_page_size;
+#endif
+}
+
 void *qemu_ram_mmap(int fd,
 size_t size,
 size_t align,
@@ -90,12 +100,12 @@ void *qemu_ram_mmap(int fd,
 bool is_pmem,
 off_t map_offset)
 {
+const size_t guard_pagesize = mmap_guard_pagesize(fd);
 int prot;
 int flags;
 int map_sync_flags = 0;
 int guardfd;
 size_t offset;
-size_t pagesize;
 size_t total;
 void *guardptr;
 void *ptr;
@@ -116,8 +126,7 @@ void *qemu_ram_mmap(int fd,
  * anonymous memory is OK.
  */
 flags = MAP_PRIVATE;
-pagesize = qemu_fd_getpagesize(fd);
-if (fd == -1 || pagesize == qemu_real_host_page_size) {
+if (fd == -1 || guard_pagesize == qemu_real_host_page_size) {
 guardfd = -1;
 flags |= MAP_ANONYMOUS;
 } else {
@@ -126,7 +135,6 @@ void *qemu_ram_mmap(int fd,
 }
 #else
 guardfd = -1;
-pagesize = qemu_real_host_page_size;
 flags = MAP_PRIVATE | MAP_ANONYMOUS;
 #endif
 
@@ -138,7 +146,7 @@ void *qemu_ram_mmap(int fd,
 
 assert(is_power_of_2(align));
 /* Always align to host page size */
-assert(align >= pagesize);
+assert(align >= guard_pagesize);
 
 flags = MAP_FIXED;
 flags |= fd == -1 ? MAP_ANONYMOUS : 0;
@@ -193,8 +201,8 @@ void *qemu_ram_mmap(int fd,
  * a guard page guarding against potential buffer overflows.
  */
 total -= offset;
-if (total > size + pagesize) {
-munmap(ptr + size + pagesize, total - size - pagesize);
+if (total > size + guard_pagesize) {
+munmap(ptr + size + guard_pagesize, total - size - guard_pagesize);
 }
 
 return ptr;
@@ -202,15 +210,8 @@ void *qemu_ram_mmap(int fd,
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size)
 {
-size_t pagesize;
-
 if (ptr) {
 /* Unmap both the RAM block and the guard page */
-#if defined(__powerpc64__) && defined(__linux__)
-pagesize = qemu_fd_getpagesize(fd);
-#else
-pagesize = qemu_real_host_page_size;
-#endif
-munmap(ptr, size + pagesize);
+munmap(ptr, size + mmap_guard_pagesize(fd));
 }
 }
-- 
2.30.2

[PATCH v5 02/14] util/mmap-alloc: Factor out reserving of a memory region to mmap_reserve()

2021-04-13 Thread David Hildenbrand

We want to reserve a memory region without actually populating memory.
Let's factor that out.

Reviewed-by: Igor Kotrasinski 
Acked-by: Murilo Opsfelder Araujo 
Reviewed-by: Richard Henderson 
Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 util/mmap-alloc.c | 58 +++
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 24854064b4..223d66219c 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -82,6 +82,38 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
 return qemu_real_host_page_size;
 }
 
+/*
+ * Reserve a new memory region of the requested size to be used for mapping
+ * from the given fd (if any).
+ */
+static void *mmap_reserve(size_t size, int fd)
+{
+int flags = MAP_PRIVATE;
+
+#if defined(__powerpc64__) && defined(__linux__)
+/*
+ * On ppc64 mappings in the same segment (aka slice) must share the same
+ * page size. Since we will be re-allocating part of this segment
+ * from the supplied fd, we should make sure to use the same page size, to
+ * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
+ * avoid allocating backing store memory.
+ * We do this unless we are using the system page size, in which case
+ * anonymous memory is OK.
+ */
+if (fd == -1 || qemu_fd_getpagesize(fd) == qemu_real_host_page_size) {
+fd = -1;
+flags |= MAP_ANONYMOUS;
+} else {
+flags |= MAP_NORESERVE;
+}
+#else
+fd = -1;
+flags |= MAP_ANONYMOUS;
+#endif
+
+return mmap(0, size, PROT_NONE, flags, fd, 0);
+}
+
 static inline size_t mmap_guard_pagesize(int fd)
 {
 #if defined(__powerpc64__) && defined(__linux__)
@@ -104,7 +136,6 @@ void *qemu_ram_mmap(int fd,
 int prot;
 int flags;
 int map_sync_flags = 0;
-int guardfd;
 size_t offset;
 size_t total;
 void *guardptr;
@@ -116,30 +147,7 @@ void *qemu_ram_mmap(int fd,
  */
 total = size + align;
 
-#if defined(__powerpc64__) && defined(__linux__)
-/* On ppc64 mappings in the same segment (aka slice) must share the same
- * page size. Since we will be re-allocating part of this segment
- * from the supplied fd, we should make sure to use the same page size, to
- * this end we mmap the supplied fd.  In this case, set MAP_NORESERVE to
- * avoid allocating backing store memory.
- * We do this unless we are using the system page size, in which case
- * anonymous memory is OK.
- */
-flags = MAP_PRIVATE;
-if (fd == -1 || guard_pagesize == qemu_real_host_page_size) {
-guardfd = -1;
-flags |= MAP_ANONYMOUS;
-} else {
-guardfd = fd;
-flags |= MAP_NORESERVE;
-}
-#else
-guardfd = -1;
-flags = MAP_PRIVATE | MAP_ANONYMOUS;
-#endif
-
-guardptr = mmap(0, total, PROT_NONE, flags, guardfd, 0);
-
+guardptr = mmap_reserve(total, fd);
 if (guardptr == MAP_FAILED) {
 return MAP_FAILED;
 }
-- 
2.30.2

[PATCH v5 06/14] util/mmap-alloc: Pass flags instead of separate bools to qemu_ram_mmap()

2021-04-13 Thread David Hildenbrand

Let's pass flags instead of bools to prepare for passing other flags and
update the documentation of qemu_ram_mmap(). Introduce new QEMU_MAP_
flags that abstract the mmap() PROT_ and MAP_ flag handling and simplify
it.

We expose only flags that are currently supported by qemu_ram_mmap().
Maybe, we'll see qemu_mmap() in the future as well that can implement these
flags.

Note: We don't use MAP_ flags as some flags (e.g., MAP_SYNC) are only
defined for some systems and we want to always be able to identify
these flags reliably inside qemu_ram_mmap() -- for example, to properly
warn when some future flags are not available or effective on a system.
Also, this way we can simplify PROT_ handling as well.

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 include/qemu/mmap-alloc.h | 16 +---
 include/qemu/osdep.h  | 18 ++
 softmmu/physmem.c |  8 +---
 util/mmap-alloc.c | 15 ---
 util/oslib-posix.c|  3 ++-
 5 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 456ff87df1..90d0eee705 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -7,18 +7,22 @@ size_t qemu_fd_getpagesize(int fd);
 size_t qemu_mempath_getpagesize(const char *mem_path);
 
 /**
- * qemu_ram_mmap: mmap the specified file or device.
+ * qemu_ram_mmap: mmap anonymous memory, the specified file or device.
+ *
+ * mmap() abstraction to map guest RAM, simplifying flag handling, taking
+ * care of alignment requirements and installing guard pages.
  *
  * Parameters:
  *  @fd: the file or the device to mmap
  *  @size: the number of bytes to be mmaped
  *  @align: if not zero, specify the alignment of the starting mapping address;
  *  otherwise, the alignment in use will be determined by QEMU.
- *  @readonly: true for a read-only mapping, false for read/write.
- *  @shared: map has RAM_SHARED flag.
- *  @is_pmem: map has RAM_PMEM flag.
+ *  @qemu_map_flags: QEMU_MAP_* flags
  *  @map_offset: map starts at offset of map_offset from the start of fd
  *
+ * Internally, MAP_PRIVATE, MAP_ANONYMOUS and MAP_SHARED_VALIDATE are set
+ * implicitly based on other parameters.
+ *
  * Return:
  *  On success, return a pointer to the mapped area.
  *  On failure, return MAP_FAILED.
@@ -26,9 +30,7 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
 void *qemu_ram_mmap(int fd,
 size_t size,
 size_t align,
-bool readonly,
-bool shared,
-bool is_pmem,
+uint32_t qemu_map_flags,
 off_t map_offset);
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size);
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index ba15be9c56..5ba89f5460 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -347,6 +347,24 @@ void *qemu_anon_ram_alloc(size_t size, uint64_t *align, 
bool shared);
 void qemu_vfree(void *ptr);
 void qemu_anon_ram_free(void *ptr, size_t size);
 
+/*
+ * Abstraction of PROT_ and MAP_ flags as passed to mmap(), for example,
+ * consumed by qemu_ram_mmap().
+ */
+
+/* Map PROT_READ instead of PROT_READ | PROT_WRITE. */
+#define QEMU_MAP_READONLY   (1 << 0)
+
+/* Use MAP_SHARED instead of MAP_PRIVATE. */
+#define QEMU_MAP_SHARED (1 << 1)
+
+/*
+ * Use MAP_SYNC | MAP_SHARED_VALIDATE if supported. Ignored without
+ * QEMU_MAP_SHARED. If mapping fails, warn and fallback to !QEMU_MAP_SYNC.
+ */
+#define QEMU_MAP_SYNC   (1 << 2)
+
+
 #define QEMU_MADV_INVALID -1
 
 #if defined(CONFIG_MADVISE)
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index fdcd38ba61..d1a1a6b72e 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1533,6 +1533,7 @@ static void *file_ram_alloc(RAMBlock *block,
 off_t offset,
 Error **errp)
 {
+uint32_t qemu_map_flags;
 void *area;
 
 block->page_size = qemu_fd_getpagesize(fd);
@@ -1580,9 +1581,10 @@ static void *file_ram_alloc(RAMBlock *block,
 perror("ftruncate");
 }
 
-area = qemu_ram_mmap(fd, memory, block->mr->align, readonly,
- block->flags & RAM_SHARED, block->flags & RAM_PMEM,
- offset);
+qemu_map_flags = readonly ? QEMU_MAP_READONLY : 0;
+qemu_map_flags |= (block->flags & RAM_SHARED) ? QEMU_MAP_SHARED : 0;
+qemu_map_flags |= (block->flags & RAM_PMEM) ? QEMU_MAP_SYNC : 0;
+area = qemu_ram_mmap(fd, memory, block->mr->align, qemu_map_flags, offset);
 if (area == MAP_FAILED) {
 error_setg_errno(errp, errno,
  "unable to map backing store for guest RAM");
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 0e2bd7bc0e..1ddc0e2a1e 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -118,9 +118,12 @@ static void *mmap_reserve(size_t size, int fd)
  * Activate memory in a reserved region from the given

[PATCH v5 03/14] util/mmap-alloc: Factor out activating of memory to mmap_activate()

2021-04-13 Thread David Hildenbrand

We want to activate memory within a reserved memory region, to make it
accessible. Let's factor that out.

Reviewed-by: Richard Henderson 
Acked-by: Murilo Opsfelder Araujo 
Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 util/mmap-alloc.c | 94 +--
 1 file changed, 50 insertions(+), 44 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 223d66219c..0e2bd7bc0e 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -114,6 +114,52 @@ static void *mmap_reserve(size_t size, int fd)
 return mmap(0, size, PROT_NONE, flags, fd, 0);
 }
 
+/*
+ * Activate memory in a reserved region from the given fd (if any), to make
+ * it accessible.
+ */
+static void *mmap_activate(void *ptr, size_t size, int fd, bool readonly,
+   bool shared, bool is_pmem, off_t map_offset)
+{
+const int prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
+int map_sync_flags = 0;
+int flags = MAP_FIXED;
+void *activated_ptr;
+
+flags |= fd == -1 ? MAP_ANONYMOUS : 0;
+flags |= shared ? MAP_SHARED : MAP_PRIVATE;
+if (shared && is_pmem) {
+map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
+}
+
+activated_ptr = mmap(ptr, size, prot, flags | map_sync_flags, fd,
+ map_offset);
+if (activated_ptr == MAP_FAILED && map_sync_flags) {
+if (errno == ENOTSUP) {
+char *proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
+char *file_name = g_malloc0(PATH_MAX);
+int len = readlink(proc_link, file_name, PATH_MAX - 1);
+
+if (len < 0) {
+len = 0;
+}
+file_name[len] = '\0';
+fprintf(stderr, "Warning: requesting persistence across crashes "
+"for backend file %s failed. Proceeding without "
+"persistence, data might become corrupted in case of host "
+"crash.\n", file_name);
+g_free(proc_link);
+g_free(file_name);
+}
+/*
+ * If mmap failed with MAP_SHARED_VALIDATE | MAP_SYNC, we will try
+ * again without these flags to handle backwards compatibility.
+ */
+activated_ptr = mmap(ptr, size, prot, flags, fd, map_offset);
+}
+return activated_ptr;
+}
+
 static inline size_t mmap_guard_pagesize(int fd)
 {
 #if defined(__powerpc64__) && defined(__linux__)
@@ -133,13 +179,8 @@ void *qemu_ram_mmap(int fd,
 off_t map_offset)
 {
 const size_t guard_pagesize = mmap_guard_pagesize(fd);
-int prot;
-int flags;
-int map_sync_flags = 0;
-size_t offset;
-size_t total;
-void *guardptr;
-void *ptr;
+size_t offset, total;
+void *ptr, *guardptr;
 
 /*
  * Note: this always allocates at least one extra page of virtual address
@@ -156,45 +197,10 @@ void *qemu_ram_mmap(int fd,
 /* Always align to host page size */
 assert(align >= guard_pagesize);
 
-flags = MAP_FIXED;
-flags |= fd == -1 ? MAP_ANONYMOUS : 0;
-flags |= shared ? MAP_SHARED : MAP_PRIVATE;
-if (shared && is_pmem) {
-map_sync_flags = MAP_SYNC | MAP_SHARED_VALIDATE;
-}
-
 offset = QEMU_ALIGN_UP((uintptr_t)guardptr, align) - (uintptr_t)guardptr;
 
-prot = PROT_READ | (readonly ? 0 : PROT_WRITE);
-
-ptr = mmap(guardptr + offset, size, prot,
-   flags | map_sync_flags, fd, map_offset);
-
-if (ptr == MAP_FAILED && map_sync_flags) {
-if (errno == ENOTSUP) {
-char *proc_link, *file_name;
-int len;
-proc_link = g_strdup_printf("/proc/self/fd/%d", fd);
-file_name = g_malloc0(PATH_MAX);
-len = readlink(proc_link, file_name, PATH_MAX - 1);
-if (len < 0) {
-len = 0;
-}
-file_name[len] = '\0';
-fprintf(stderr, "Warning: requesting persistence across crashes "
-"for backend file %s failed. Proceeding without "
-"persistence, data might become corrupted in case of host "
-"crash.\n", file_name);
-g_free(proc_link);
-g_free(file_name);
-}
-/*
- * if map failed with MAP_SHARED_VALIDATE | MAP_SYNC,
- * we will remove these flags to handle compatibility.
- */
-ptr = mmap(guardptr + offset, size, prot, flags, fd, map_offset);
-}
-
+ptr = mmap_activate(guardptr + offset, size, fd, readonly, shared, is_pmem,
+map_offset);
 if (ptr == MAP_FAILED) {
 munmap(guardptr, total);
 return MAP_FAILED;
-- 
2.30.2

[PATCH v5 05/14] softmmu/memory: Pass ram_flags to memory_region_init_ram_shared_nomigrate()

2021-04-13 Thread David Hildenbrand

Let's forward ram_flags instead, renaming
memory_region_init_ram_shared_nomigrate() into
memory_region_init_ram_flags_nomigrate(). Forward flags to
qemu_ram_alloc() and qemu_ram_alloc_internal().

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 backends/hostmem-ram.c|  6 +++--
 hw/m68k/next-cube.c   |  4 ++--
 include/exec/memory.h | 24 +--
 include/exec/ram_addr.h   |  2 +-
 .../memory-region-housekeeping.cocci  |  8 +++
 softmmu/memory.c  | 20 
 softmmu/physmem.c | 24 ---
 7 files changed, 43 insertions(+), 45 deletions(-)

diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index 5cc53e76c9..741e701062 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -19,6 +19,7 @@
 static void
 ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
+uint32_t ram_flags;
 char *name;
 
 if (!backend->size) {
@@ -27,8 +28,9 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 }
 
 name = host_memory_backend_get_name(backend);
-memory_region_init_ram_shared_nomigrate(&backend->mr, OBJECT(backend), 
name,
-   backend->size, backend->share, errp);
+ram_flags = backend->share ? RAM_SHARED : 0;
+memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
+   backend->size, ram_flags, errp);
 g_free(name);
 }
 
diff --git a/hw/m68k/next-cube.c b/hw/m68k/next-cube.c
index 92b45d760f..59ccae0d5e 100644
--- a/hw/m68k/next-cube.c
+++ b/hw/m68k/next-cube.c
@@ -986,8 +986,8 @@ static void next_cube_init(MachineState *machine)
 sysbus_mmio_map(SYS_BUS_DEVICE(pcdev), 1, 0x0210);
 
 /* BMAP memory */
-memory_region_init_ram_shared_nomigrate(bmapm1, NULL, "next.bmapmem", 64,
-true, &error_fatal);
+memory_region_init_ram_flags_nomigrate(bmapm1, NULL, "next.bmapmem", 64,
+   RAM_SHARED, &error_fatal);
 memory_region_add_subregion(sysmem, 0x020c, bmapm1);
 /* The Rev_2.5_v66.bin firmware accesses it at 0x820c0020, too */
 memory_region_init_alias(bmapm2, NULL, "next.bmapmem2", bmapm1, 0x0, 64);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 8ad280e532..10179c6695 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -928,27 +928,27 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
   Error **errp);
 
 /**
- * memory_region_init_ram_shared_nomigrate:  Initialize RAM memory region.
- *   Accesses into the region will
- *   modify memory directly.
+ * memory_region_init_ram_flags_nomigrate:  Initialize RAM memory region.
+ *  Accesses into the region will
+ *  modify memory directly.
  *
  * @mr: the #MemoryRegion to be initialized.
  * @owner: the object that tracks the region's reference count
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *must be unique within any device
  * @size: size of the region.
- * @share: allow remapping RAM to different addresses
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED.
  * @errp: pointer to Error*, to store an error if it happens.
  *
- * Note that this function is similar to memory_region_init_ram_nomigrate.
- * The only difference is part of the RAM region can be remapped.
+ * Note that this function does not do anything to cause the data in the
+ * RAM memory region to be migrated; that is the responsibility of the caller.
  */
-void memory_region_init_ram_shared_nomigrate(MemoryRegion *mr,
- Object *owner,
- const char *name,
- uint64_t size,
- bool share,
- Error **errp);
+void memory_region_init_ram_flags_nomigrate(MemoryRegion *mr,
+Object *owner,
+const char *name,
+uint64_t size,
+uint32_t ram_flags,
+Error **errp);
 
 /**
  * memory_region_init_resizeable_ram:  Initialize memory region with resizeable
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index a7e3378340..6d4513f8e2 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -122,7 +122,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, 
MemoryRegion *mr,
 
 RAMBlock *qemu_ram_allo

[PATCH v5 04/14] softmmu/memory: Pass ram_flags to qemu_ram_alloc_from_fd()

2021-04-13 Thread David Hildenbrand

Let's pass in ram flags just like we do with qemu_ram_alloc_from_file(),
to clean up and prepare for more flags.

Simplify the documentation of passed ram flags: Looking at our
documentation of RAM_SHARED and RAM_PMEM is sufficient, no need to be
repetitive.

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 backends/hostmem-memfd.c | 7 ---
 hw/misc/ivshmem.c| 5 ++---
 include/exec/memory.h| 9 +++--
 include/exec/ram_addr.h  | 6 +-
 softmmu/memory.c | 7 +++
 5 files changed, 13 insertions(+), 21 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 69b0ae30bb..93b5d1a4cf 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -36,6 +36,7 @@ static void
 memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
 HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend);
+uint32_t ram_flags;
 char *name;
 int fd;
 
@@ -53,9 +54,9 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 }
 
 name = host_memory_backend_get_name(backend);
-memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
-   name, backend->size,
-   backend->share, fd, 0, errp);
+ram_flags = backend->share ? RAM_SHARED : 0;
+memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
+   backend->size, ram_flags, fd, 0, errp);
 g_free(name);
 }
 
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index a1fa4878be..1ba4a98377 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -493,9 +493,8 @@ static void process_msg_shmem(IVShmemState *s, int fd, 
Error **errp)
 size = buf.st_size;
 
 /* mmap the region and map into the BAR2 */
-memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s),
-   "ivshmem.bar2", size, true, fd, 0,
-   &local_err);
+memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s), "ivshmem.bar2",
+   size, RAM_SHARED, fd, 0, &local_err);
 if (local_err) {
 error_propagate(errp, local_err);
 return;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 5728a681b2..8ad280e532 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -991,10 +991,7 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  * (getpagesize()) will be used.
- * @ram_flags: Memory region features:
- * - RAM_SHARED: memory must be mmaped with the MAP_SHARED flag
- * - RAM_PMEM: the memory is persistent memory
- * Other bits are ignored now.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  * @path: the path in which to allocate the RAM.
  * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -1020,7 +1017,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @owner: the object that tracks the region's reference count
  * @name: the name of the region.
  * @size: size of the region.
- * @share: %true if memory must be mmaped with the MAP_SHARED flag
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  * @fd: the fd to mmap.
  * @offset: offset within the file referenced by fd
  * @errp: pointer to Error*, to store an error if it happens.
@@ -1032,7 +1029,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
 Object *owner,
 const char *name,
 uint64_t size,
-bool share,
+uint32_t ram_flags,
 int fd,
 ram_addr_t offset,
 Error **errp);
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 3cb9791df3..a7e3378340 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -104,11 +104,7 @@ long qemu_maxrampagesize(void);
  * Parameters:
  *  @size: the size in bytes of the ram block
  *  @mr: the memory region where the ram block is
- *  @ram_flags: specify the properties of the ram block, which can be one
- *  or bit-or of following values
- *  - RAM_SHARED: mmap the backing file or device with MAP_SHARED
- *  - RAM_PMEM: the backend @mem_path or @fd is persistent memory
- *  Other bits are ignored.
+ *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
  *  @mem_path or @fd: specify the backing file or device
  *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/softmmu/

[PATCH v5 09/14] hostmem: Wire up RAM_NORESERVE via "reserve" property

2021-04-13 Thread David Hildenbrand

Let's provide a way to control the use of RAM_NORESERVE via memory
backends using the "reserve" property which defaults to true (old
behavior).

Only Linux currently supports clearing the flag (and support is checked at
runtime, depending on the setting of "/proc/sys/vm/overcommit_memory").
Windows and other POSIX systems will bail out with "reserve=false".

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM. This essentially allows
avoiding to set "/proc/sys/vm/overcommit_memory == 0") when using
virtio-mem and also supporting hugetlbfs in the future.

Reviewed-by: Peter Xu 
Reviewed-by: Eduardo Habkost 
Cc: Markus Armbruster 
Cc: Eric Blake 
Cc: Igor Mammedov 
Signed-off-by: David Hildenbrand 
---
 backends/hostmem-file.c  | 11 ++-
 backends/hostmem-memfd.c |  1 +
 backends/hostmem-ram.c   |  1 +
 backends/hostmem.c   | 32 
 include/sysemu/hostmem.h |  2 +-
 qapi/qom.json|  4 
 6 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index b683da9daf..9d550e53d4 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -40,6 +40,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
object_get_typename(OBJECT(backend)));
 #else
 HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
+uint32_t ram_flags;
 gchar *name;
 
 if (!backend->size) {
@@ -52,11 +53,11 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 }
 
 name = host_memory_backend_get_name(backend);
-memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
- name,
- backend->size, fb->align,
- (backend->share ? RAM_SHARED : 0) |
- (fb->is_pmem ? RAM_PMEM : 0),
+ram_flags = backend->share ? RAM_SHARED : 0;
+ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
+ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
+memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
+ backend->size, fb->align, ram_flags,
  fb->mem_path, fb->readonly, errp);
 g_free(name);
 #endif
diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 93b5d1a4cf..f3436b623d 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 
 name = host_memory_backend_get_name(backend);
 ram_flags = backend->share ? RAM_SHARED : 0;
+ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
 memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
backend->size, ram_flags, fd, 0, errp);
 g_free(name);
diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index 741e701062..b8e55cdbd0 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -29,6 +29,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 
 name = host_memory_backend_get_name(backend);
 ram_flags = backend->share ? RAM_SHARED : 0;
+ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
 memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
backend->size, ram_flags, errp);
 g_free(name);
diff --git a/backends/hostmem.c b/backends/hostmem.c
index c6c1ff5b99..58fdc1b658 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -217,6 +217,11 @@ static void host_memory_backend_set_prealloc(Object *obj, 
bool value,
 Error *local_err = NULL;
 HostMemoryBackend *backend = MEMORY_BACKEND(obj);
 
+if (!backend->reserve && value) {
+error_setg(errp, "'prealloc=on' and 'reserve=off' are incompatible");
+return;
+}
+
 if (!host_memory_backend_mr_inited(backend)) {
 backend->prealloc = value;
 return;
@@ -268,6 +273,7 @@ static void host_memory_backend_init(Object *obj)
 /* TODO: convert access to globals to compat properties */
 backend->merge = machine_mem_merge(machine);
 backend->dump = machine_dump_guest_core(machine);
+backend->reserve = true;
 backend->prealloc_threads = 1;
 }
 
@@ -426,6 +432,28 @@ static void host_memory_backend_set_share(Object *o, bool 
value, Error **errp)
 backend->share = value;
 }
 
+static bool host_memory_backend_get_reserve(Object *o, Error **errp)
+{
+HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+return backend->reserve;
+}
+
+static void host_memory_backend_set_reserve(Object *o, bool value, Error 
**errp)
+{
+HostMemoryBackend *backend = MEMORY_BACKEND(o);
+
+if (host_memory_backend_mr_inited(backend)) {
+error_setg(errp, "cannot change property value");
+return;
+}
+if (backen

[PATCH v5 13/14] qmp: Include "reserve" property of memory backends

2021-04-13 Thread David Hildenbrand

Let's include the new property.

Cc: Eric Blake 
Cc: Markus Armbruster 
Cc: Igor Mammedov 
Signed-off-by: David Hildenbrand 
---
 hw/core/machine-qmp-cmds.c | 1 +
 qapi/machine.json  | 4 
 2 files changed, 5 insertions(+)

diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c
index d41db5b93b..2d135ecdd0 100644
--- a/hw/core/machine-qmp-cmds.c
+++ b/hw/core/machine-qmp-cmds.c
@@ -175,6 +175,7 @@ static int query_memdev(Object *obj, void *opaque)
 m->dump = object_property_get_bool(obj, "dump", &error_abort);
 m->prealloc = object_property_get_bool(obj, "prealloc", &error_abort);
 m->share = object_property_get_bool(obj, "share", &error_abort);
+m->reserve = object_property_get_bool(obj, "reserve", &error_abort);
 m->policy = object_property_get_enum(obj, "policy", "HostMemPolicy",
  &error_abort);
 host_nodes = object_property_get_qobject(obj,
diff --git a/qapi/machine.json b/qapi/machine.json
index 32650bfe9e..5932139d20 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -798,6 +798,9 @@
 #
 # @share: whether memory is private to QEMU or shared (since 6.1)
 #
+# @reserve: whether swap space (or huge pages) was reserved if applicable
+#   (since 6.1)
+#
 # @host-nodes: host nodes for its memory policy
 #
 # @policy: memory policy of memory backend
@@ -812,6 +815,7 @@
 'dump':   'bool',
 'prealloc':   'bool',
 'share':  'bool',
+'reserve':'bool',
 'host-nodes': ['uint16'],
 'policy': 'HostMemPolicy' }}
 
-- 
2.30.2

[PATCH v5 07/14] memory: Introduce RAM_NORESERVE and wire it up in qemu_ram_mmap()

2021-04-13 Thread David Hildenbrand

Let's introduce RAM_NORESERVE, allowing mmap'ing with MAP_NORESERVE. The
new flag has the following semantics:

"
RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or huge
pages if applicable) is skipped: will bail out if not supported. When not
set, the OS will do the reservation, if supported for the memory type.
"

Allow passing it into:
- memory_region_init_ram_nomigrate()
- memory_region_init_resizeable_ram()
- memory_region_init_ram_from_file()

... and teach qemu_ram_mmap() and qemu_anon_ram_alloc() about the flag.
Bail out if the flag is not supported, which is the case right now for
both, POSIX and win32. We will add Linux support next and allow specifying
RAM_NORESERVE via memory backends.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 include/exec/cpu-common.h |  1 +
 include/exec/memory.h | 15 ---
 include/exec/ram_addr.h   |  3 ++-
 include/qemu/osdep.h  |  9 -
 migration/ram.c   |  3 +--
 softmmu/physmem.c | 15 +++
 util/mmap-alloc.c |  7 +++
 util/oslib-posix.c|  6 --
 util/oslib-win32.c| 13 -
 9 files changed, 58 insertions(+), 14 deletions(-)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 5a0a2d93e0..38a47ad4ac 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -58,6 +58,7 @@ void *qemu_ram_get_host_addr(RAMBlock *rb);
 ram_addr_t qemu_ram_get_offset(RAMBlock *rb);
 ram_addr_t qemu_ram_get_used_length(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
+bool qemu_ram_is_noreserve(RAMBlock *rb);
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
 void qemu_ram_set_uf_zeroable(RAMBlock *rb);
 bool qemu_ram_is_migratable(RAMBlock *rb);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 10179c6695..8d77819bcd 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -155,6 +155,13 @@ typedef struct IOMMUTLBEvent {
  */
 #define RAM_UF_WRITEPROTECT (1 << 6)
 
+/*
+ * RAM is mmap-ed with MAP_NORESERVE. When set, reserving swap space (or huge
+ * pages if applicable) is skipped: will bail out if not supported. When not
+ * set, the OS will do the reservation, if supported for the memory type.
+ */
+#define RAM_NORESERVE (1 << 7)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
IOMMUNotifierFlag flags,
hwaddr start, hwaddr end,
@@ -937,7 +944,7 @@ void memory_region_init_ram_nomigrate(MemoryRegion *mr,
  * @name: Region name, becomes part of RAMBlock name used in migration stream
  *must be unique within any device
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_NORESERVE.
  * @errp: pointer to Error*, to store an error if it happens.
  *
  * Note that this function does not do anything to cause the data in the
@@ -991,7 +998,8 @@ void memory_region_init_resizeable_ram(MemoryRegion *mr,
  * @size: size of the region.
  * @align: alignment of the region base address; if 0, the default alignment
  * (getpagesize()) will be used.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ * RAM_NORESERVE,
  * @path: the path in which to allocate the RAM.
  * @readonly: true to open @path for reading, false for read/write.
  * @errp: pointer to Error*, to store an error if it happens.
@@ -1017,7 +1025,8 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @owner: the object that tracks the region's reference count
  * @name: the name of the region.
  * @size: size of the region.
- * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ * @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ * RAM_NORESERVE.
  * @fd: the fd to mmap.
  * @offset: offset within the file referenced by fd
  * @errp: pointer to Error*, to store an error if it happens.
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 6d4513f8e2..551876bed0 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -104,7 +104,8 @@ long qemu_maxrampagesize(void);
  * Parameters:
  *  @size: the size in bytes of the ram block
  *  @mr: the memory region where the ram block is
- *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM.
+ *  @ram_flags: RamBlock flags. Supported flags: RAM_SHARED, RAM_PMEM,
+ *  RAM_NORESERVE.
  *  @mem_path or @fd: specify the backing file or device
  *  @readonly: true to open @path for reading, false for read/write.
  *  @errp: pointer to Error*, to store an error if it happens
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 5ba89f5460..877d9a5eff 100644
--

[PATCH v5 08/14] util/mmap-alloc: Support RAM_NORESERVE via MAP_NORESERVE under Linux

2021-04-13 Thread David Hildenbrand

Let's support RAM_NORESERVE via MAP_NORESERVE on Linux. The flag has no
effect on most shared mappings - except for hugetlbfs and anonymous memory.

Linux man page:
  "MAP_NORESERVE: Do not reserve swap space for this mapping. When swap
  space is reserved, one has the guarantee that it is possible to modify
  the mapping. When swap space is not reserved one might get SIGSEGV
  upon a write if no physical memory is available. See also the discussion
  of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before
  2.6, this flag had effect only for private writable mappings."

Note that the "guarantee" part is wrong with memory overcommit in Linux.

Also, in Linux hugetlbfs is treated differently - we configure reservation
of huge pages from the pool, not reservation of swap space (huge pages
cannot be swapped).

The rough behavior is [1]:
a) !Hugetlbfs:

  1) Without MAP_NORESERVE *or* with memory overcommit under Linux
 disabled ("/proc/sys/vm/overcommit_memory == 2"), the following
 accounting/reservation happens:
  For a file backed map
   SHARED or READ-only - 0 cost (the file is the map not swap)
   PRIVATE WRITABLE - size of mapping per instance

  For an anonymous or /dev/zero map
   SHARED   - size of mapping
   PRIVATE READ-only - 0 cost (but of little use)
   PRIVATE WRITABLE - size of mapping per instance

  2) With MAP_NORESERVE, no accounting/reservation happens.

b) Hugetlbfs:

  1) Without MAP_NORESERVE, huge pages are reserved.

  2) With MAP_NORESERVE, no huge pages are reserved.

Note: With "/proc/sys/vm/overcommit_memory == 0", we were already able
to configure it for !hugetlbfs globally; this toggle now allows
configuring it more fine-grained, not for the whole system.

The target use case is virtio-mem, which dynamically exposes memory
inside a large, sparse memory area to the VM.

[1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 include/qemu/osdep.h |  3 ++
 softmmu/physmem.c|  1 +
 util/mmap-alloc.c| 69 ++--
 3 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 877d9a5eff..d5c1860508 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -176,6 +176,9 @@ extern int daemon(int, int);
 #ifndef MAP_FIXED_NOREPLACE
 #define MAP_FIXED_NOREPLACE 0
 #endif
+#ifndef MAP_NORESERVE
+#define MAP_NORESERVE 0
+#endif
 #ifndef ENOMEDIUM
 #define ENOMEDIUM ENODEV
 #endif
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 608ba75d18..0d4376b04e 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -2229,6 +2229,7 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
 flags = MAP_FIXED;
 flags |= block->flags & RAM_SHARED ?
  MAP_SHARED : MAP_PRIVATE;
+flags |= block->flags & RAM_NORESERVE ? MAP_NORESERVE : 0;
 if (block->fd >= 0) {
 area = mmap(vaddr, length, PROT_READ | PROT_WRITE,
 flags, block->fd, offset);
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index d0cf4aaee5..838e286ce5 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/host-utils.h"
+#include "qemu/cutils.h"
 #include "qemu/error-report.h"
 
 #define HUGETLBFS_MAGIC   0x958458f6
@@ -83,6 +84,70 @@ size_t qemu_mempath_getpagesize(const char *mem_path)
 return qemu_real_host_page_size;
 }
 
+#define OVERCOMMIT_MEMORY_PATH "/proc/sys/vm/overcommit_memory"
+static bool map_noreserve_effective(int fd, uint32_t qemu_map_flags)
+{
+#if defined(__linux__)
+const bool readonly = qemu_map_flags & QEMU_MAP_READONLY;
+const bool shared = qemu_map_flags & QEMU_MAP_SHARED;
+gchar *content = NULL;
+const char *endptr;
+unsigned int tmp;
+
+/*
+ * hugeltb accounting is different than ordinary swap reservation:
+ * a) Hugetlb pages from the pool are reserved for both private and
+ *shared mappings. For shared mappings, all mappers have to specify
+ *MAP_NORESERVE.
+ * b) MAP_NORESERVE is not affected by /proc/sys/vm/overcommit_memory.
+ */
+if (qemu_fd_getpagesize(fd) != qemu_real_host_page_size) {
+return true;
+}
+
+/*
+ * Accountable mappings in the kernel that can be affected by MAP_NORESEVE
+ * are private writable mappings (see mm/mmap.c:accountable_mapping() in
+ * Linux). For all shared or readonly mappings, MAP_NORESERVE is always
+ * implicitly active -- no reservation; this includes shmem. The only
+ * exception is shared anonymous memory, it is accounted like private
+ * anonymous memory.
+ */
+if (readonly || (shared && fd >= 0)) {
+return true;
+}
+
+/*
+ * MAP_NORESERVE is globally ignored for applicable !huge

[PATCH v5 14/14] hmp: Print "reserve" property of memory backends with "info memdev"

2021-04-13 Thread David Hildenbrand

Let's print the new property.

Reviewed-by: Dr. David Alan Gilbert 
Cc: Markus Armbruster 
Cc: Eric Blake 
Cc: Igor Mammedov 
Signed-off-by: David Hildenbrand 
---
 hw/core/machine-hmp-cmds.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 004a92b3d6..9bedc77bb4 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -112,6 +112,8 @@ void hmp_info_memdev(Monitor *mon, const QDict *qdict)
m->value->prealloc ? "true" : "false");
 monitor_printf(mon, "  share: %s\n",
m->value->share ? "true" : "false");
+monitor_printf(mon, "  reserve: %s\n",
+   m->value->reserve ? "true" : "false");
 monitor_printf(mon, "  policy: %s\n",
HostMemPolicy_str(m->value->policy));
 visit_complete(v, &str);
-- 
2.30.2

[PATCH v5 12/14] hmp: Print "share" property of memory backends with "info memdev"

2021-04-13 Thread David Hildenbrand

Let's print the property.

Cc: Markus Armbruster 
Cc: Eric Blake 
Cc: Igor Mammedov 
Signed-off-by: David Hildenbrand 
---
 hw/core/machine-hmp-cmds.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 58248cffa3..004a92b3d6 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -110,6 +110,8 @@ void hmp_info_memdev(Monitor *mon, const QDict *qdict)
m->value->dump ? "true" : "false");
 monitor_printf(mon, "  prealloc: %s\n",
m->value->prealloc ? "true" : "false");
+monitor_printf(mon, "  share: %s\n",
+   m->value->share ? "true" : "false");
 monitor_printf(mon, "  policy: %s\n",
HostMemPolicy_str(m->value->policy));
 visit_complete(v, &str);
-- 
2.30.2

[PATCH v5 10/14] qmp: Clarify memory backend properties returned via query-memdev

2021-04-13 Thread David Hildenbrand

We return information on the currently configured memory backends and
don't configure them, so decribe what the currently set properties
express.

Suggested-by: Markus Armbruster 
Cc: Eric Blake 
Cc: Markus Armbruster 
Cc: Igor Mammedov 
Signed-off-by: David Hildenbrand 
---
 qapi/machine.json | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/qapi/machine.json b/qapi/machine.json
index 6e90d463fc..758b901185 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -790,11 +790,11 @@
 #
 # @size: memory backend size
 #
-# @merge: enables or disables memory merge support
+# @merge: whether memory merge support is enabled
 #
-# @dump: includes memory backend's memory in a core dump or not
+# @dump: whether memory backend's memory is included in a core dump
 #
-# @prealloc: enables or disables memory preallocation
+# @prealloc: whether memory was preallocated
 #
 # @host-nodes: host nodes for its memory policy
 #
-- 
2.30.2

Re: [PATCH 0/1] sphinx: qapidoc: Wrap "If" section body in a paragraph node

2021-04-13 Thread Peter Maydell

On Wed, 7 Apr 2021 at 10:41, Markus Armbruster  wrote:
>
> Peter, do you intend to merge this yourself?  I have nothing else queued
> right now.  If you want me to do a pull request for this patch, let me
> know.

I missed this email earlier. I had assumed somebody else would pick it up,
but as it happens I have a couple of last minute patches I want to send
out anyway, so I'll pick this up now.

thanks
-- PMM

[PATCH v5 11/14] qmp: Include "share" property of memory backends

2021-04-13 Thread David Hildenbrand

Let's include the property, which can be helpful when debugging,
for example, to spot misuse of MAP_PRIVATE which can result in some ugly
corner cases (e.g., double-memory consumption on shmem).

Use the same description we also use for describing the property.

Cc: Eric Blake 
Cc: Markus Armbruster 
Cc: Igor Mammedov 
Signed-off-by: David Hildenbrand 
---
 hw/core/machine-qmp-cmds.c | 1 +
 qapi/machine.json  | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/hw/core/machine-qmp-cmds.c b/hw/core/machine-qmp-cmds.c
index 68a942595a..d41db5b93b 100644
--- a/hw/core/machine-qmp-cmds.c
+++ b/hw/core/machine-qmp-cmds.c
@@ -174,6 +174,7 @@ static int query_memdev(Object *obj, void *opaque)
 m->merge = object_property_get_bool(obj, "merge", &error_abort);
 m->dump = object_property_get_bool(obj, "dump", &error_abort);
 m->prealloc = object_property_get_bool(obj, "prealloc", &error_abort);
+m->share = object_property_get_bool(obj, "share", &error_abort);
 m->policy = object_property_get_enum(obj, "policy", "HostMemPolicy",
  &error_abort);
 host_nodes = object_property_get_qobject(obj,
diff --git a/qapi/machine.json b/qapi/machine.json
index 758b901185..32650bfe9e 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -796,6 +796,8 @@
 #
 # @prealloc: whether memory was preallocated
 #
+# @share: whether memory is private to QEMU or shared (since 6.1)
+#
 # @host-nodes: host nodes for its memory policy
 #
 # @policy: memory policy of memory backend
@@ -809,6 +811,7 @@
 'merge':  'bool',
 'dump':   'bool',
 'prealloc':   'bool',
+'share':  'bool',
 'host-nodes': ['uint16'],
 'policy': 'HostMemPolicy' }}
 
-- 
2.30.2

Re: trace_FOO_tcg bit-rotted?

2021-04-13 Thread Alex Bennée



Stefan Hajnoczi  writes:

> On Mon, Apr 12, 2021 at 08:06:57PM +0100, Alex Bennée wrote:
>> 
>> Stefan Hajnoczi  writes:
>> 
>> > On Fri, Apr 09, 2021 at 05:29:08PM +0100, Alex Bennée wrote:
>> >> 
>> >> Laurent Vivier  writes:
>> >> 
>> >> > Le 06/04/2021 à 18:00, Alex Bennée a écrit :
>> >> >> Hi,
>> >> >> 
>> >> >> It's been awhile since I last played with this but I think we are
>> >> >> suffering from not having some test cases for tracing code
>> >> >> generation/execution in the tree. I tried adding a simple trace point 
>> >> >> to
>> >> >> see if I could track ERET calls:

>> >
>> > Lluís: are you happy to deprecate tcg trace events in favor of TCG
>> > plugins?

>
> That said, I haven't used the TCG trace event functionality and maybe
> I'm missing something obvious that Lluís will point out :).

I've updated the Cc to what I think is Lluís current email as I was
getting bounces from the old academic address.

-- 
Alex Bennée

Re: [PATCH 2/2] Support monitor chardev hotswap with QMP

2021-04-13 Thread Li Zhang

On Tue, Apr 13, 2021 at 10:58 AM Daniel P. Berrangé  wrote:
>
> On Tue, Apr 13, 2021 at 08:40:59AM +0200, Markus Armbruster wrote:
> > Li Zhang  writes:
> >
> > > From: Li Zhang 
> > >
> > > For some scenarios, it needs to hot-add a monitor device.
> > > But QEMU doesn't support hotplug yet. It also works by adding
> > > a monitor with null backend by default and then change its
> > > backend to socket by QMP command "chardev-change".
>
> If you need ability to hot-add monitor instances, why not just
> implement that feature directly, instead of pre-creating monitors
> with null backends and then later changing the backend ?
>

Hi Daniel,

I would like to use this solution first because we'd like to use it as
soon as possible.
It's the best solution to implement the hot-add feature and it may
need more effort to do it.
I will consider it.

> > >
> > > So this patch is to support monitor chardev hotswap with QMP.
> > >
> > > Signed-off-by: Li Zhang 
> >
> > I think what what you're trying to say is that chardev-change does not
> > work when the character device changes is used by a QMP monitor.
> > Correct?
> >
> > If yes, how exactly does it misbehave?
> >
> > Does it work with an HMP monitor?
> >
> > > ---
> > >  monitor/monitor-internal.h |  3 +++
> > >  monitor/monitor.c  |  2 +-
> > >  monitor/qmp.c  | 42 +++---
> > >  3 files changed, 43 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h
> > > index 40903d6386..2df6dd21de 100644
> > > --- a/monitor/monitor-internal.h
> > > +++ b/monitor/monitor-internal.h
> > > @@ -186,4 +186,7 @@ int hmp_compare_cmd(const char *name, const char 
> > > *list);
> > >  void qmp_query_qmp_schema(QDict *qdict, QObject **ret_data,
> > >   Error **errp);
> > >
> > > +gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond,
> > > +   void *opaque);
> > > +
> > >  #endif
> > > diff --git a/monitor/monitor.c b/monitor/monitor.c
> > > index e94f532cf5..2d255bab18 100644
> > > --- a/monitor/monitor.c
> > > +++ b/monitor/monitor.c
> > > @@ -157,7 +157,7 @@ static inline bool 
> > > monitor_is_hmp_non_interactive(const Monitor *mon)
> > >
> > >  static void monitor_flush_locked(Monitor *mon);
> > >
> > > -static gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond,
> > > +gboolean monitor_unblocked(GIOChannel *chan, GIOCondition cond,
> > >void *opaque)
> > >  {
> > >  Monitor *mon = opaque;
> > > diff --git a/monitor/qmp.c b/monitor/qmp.c
> > > index 2326bd7f9b..55cfb230d9 100644
> > > --- a/monitor/qmp.c
> > > +++ b/monitor/qmp.c
> > > @@ -44,6 +44,7 @@ struct QMPRequest {
> > >  Error *err;
> > >  };
> > >  typedef struct QMPRequest QMPRequest;
> > > +static void monitor_qmp_set_handlers_bh(void *opaque);
> > >
> > >  QmpCommandList qmp_commands, qmp_cap_negotiation_commands;
> > >
> > > @@ -480,7 +481,35 @@ void monitor_data_destroy_qmp(MonitorQMP *mon)
> > >  g_queue_free(mon->qmp_requests);
> > >  }
> > >
> > > -static void monitor_qmp_setup_handlers_bh(void *opaque)
> > > +static int monitor_qmp_change(void *opaque)
> > > +{
> > > +MonitorQMP *mon = opaque;
> > > +
> > > +mon->common.use_io_thread =
> > > +qemu_chr_has_feature(mon->common.chr.chr, 
> > > QEMU_CHAR_FEATURE_GCONTEXT);
> > > +
> > > +if (mon->common.use_io_thread) {
> > > +aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread),
> > > +monitor_qmp_set_handlers_bh, mon);
> > > +} else {
> > > +qemu_chr_fe_set_handlers(&mon->common.chr, monitor_can_read,
> > > + monitor_qmp_read, monitor_qmp_event,
> > > + monitor_qmp_change, &mon->common, NULL, 
> > > true);
> > > +}
> > > +
> > > +if (mon->common.out_watch) {
> > > +g_source_remove(mon->common.out_watch);
> >
> > All other updates of @out_watch are under @mon_lock.  Why not this one?
> >
> > I have no idea whether g_source_remove() is the right function to call.
> > Its documentation says "You must use g_source_destroy() for sources
> > added to a non-default main context."  The qemu_chr_fe_set_handlers()
> > contract is of no help.
> >
> > Documentation of g_source_destroy() confuses some more: "This does not
> > unref the GSource: if you still hold a reference, use g_source_unref()
> > to drop it.
> >
> > Marc-André, can you help?
> >
> > > +qemu_mutex_lock(&mon->common.mon_lock);
> > > +mon->common.out_watch =
> > > +qemu_chr_fe_add_watch(&mon->common.chr, G_IO_OUT | G_IO_HUP,
> > > +   monitor_unblocked, &mon->common);
> >
> > Bad indentation.  Better:
> >
> > mon->common.out_watch =
> > qemu_chr_fe_add_watch(&mon->common.chr, G_IO_OUT | G_IO_HUP,
> >monitor_unb

Re: [PATCH v2 2/3] qom: move user_creatable_add_opts logic to vl.c and QAPIfy it

2021-04-13 Thread Kevin Wolf

Am 13.04.2021 um 10:13 hat David Hildenbrand geschrieben:
> On 13.04.21 06:41, Markus Armbruster wrote:
> > David Hildenbrand  writes:
> > 
> > > On 12.03.21 18:35, Paolo Bonzini wrote:
> > > > Emulators are currently using OptsVisitor (via user_creatable_add_opts)
> > > > to parse the -object command line option.  This has one extra feature,
> > > > compared to keyval, which is automatic conversion of integers to lists
> > > > as well as support for lists as repeated options:
> > > > -object
> > > > memory-backend-ram,id=pc.ram,size=1048576000,host-nodes=0,policy=bind
> > > > So we cannot replace OptsVisitor with keyval right now.  Still, this
> > > > patch moves the user_creatable_add_opts logic to vl.c since it is
> > > > not needed anywhere else, and makes it go through 
> > > > user_creatable_add_qapi.
> > > > In order to minimize code changes, the predicate still takes a
> > > > string.
> > > > This can be changed later to use the ObjectType QAPI enum directly.
> > > > 
> > > 
> > > Rebasing my "noreserve"[1] series on this, I get weird errors from
> > > QEMU when specifying the new "reserve=off" option for a
> > > memory-backend-ram:
> > > 
> > > "Invalid parameter 'reserve'"
> > > 
> > > And it looks like this is the case for any new properties. Poking
> > > around, I fail to find what's causing this -- or how to unlock new
> > > properties. What is the magic toggle to make it work?
> > > 
> > > Thanks!
> > > 
> > > [1] https://lkml.kernel.org/r/20210319101230.21531-1-da...@redhat.com
> > 
> > Wild guess: you didn't add your new properties in the QAPI schema.
> > 
> > For a not-so-wild-guess, send us a git-fetch argument for your rebased
> > series.
> > 
> 
> Oh, there is qapi/qom.json -- maybe that does the trick.
> 
> (I have mixed feelings about having to specify the same thing twice at
> different locations)

The idea is that we'll eventually generate some of the QOM boilerplate
from the QAPI schema and remove the duplication again.

But yes, for the time being, you need to touch both places.

Kevin

Re: [PATCH v2 2/3] qom: move user_creatable_add_opts logic to vl.c and QAPIfy it

2021-04-13 Thread Markus Armbruster

David Hildenbrand  writes:

> On 13.04.21 06:41, Markus Armbruster wrote:
>> David Hildenbrand  writes:
>> 
>>> On 12.03.21 18:35, Paolo Bonzini wrote:
 Emulators are currently using OptsVisitor (via user_creatable_add_opts)
 to parse the -object command line option.  This has one extra feature,
 compared to keyval, which is automatic conversion of integers to lists
 as well as support for lists as repeated options:
 -object
 memory-backend-ram,id=pc.ram,size=1048576000,host-nodes=0,policy=bind
 So we cannot replace OptsVisitor with keyval right now.  Still, this
 patch moves the user_creatable_add_opts logic to vl.c since it is
 not needed anywhere else, and makes it go through user_creatable_add_qapi.
 In order to minimize code changes, the predicate still takes a
 string.
 This can be changed later to use the ObjectType QAPI enum directly.

>>>
>>> Rebasing my "noreserve"[1] series on this, I get weird errors from
>>> QEMU when specifying the new "reserve=off" option for a
>>> memory-backend-ram:
>>>
>>> "Invalid parameter 'reserve'"
>>>
>>> And it looks like this is the case for any new properties. Poking
>>> around, I fail to find what's causing this -- or how to unlock new
>>> properties. What is the magic toggle to make it work?
>>>
>>> Thanks!
>>>
>>> [1] https://lkml.kernel.org/r/20210319101230.21531-1-da...@redhat.com
>> Wild guess: you didn't add your new properties in the QAPI schema.
>> For a not-so-wild-guess, send us a git-fetch argument for your
>> rebased
>> series.
>> 
>
> Oh, there is qapi/qom.json -- maybe that does the trick.
>
> (I have mixed feelings about having to specify the same thing twice at
> different locations)

With reason.

We've talked about generating QOM boilerplate from the QAPI schema, but
haven't progressed to actual patches.

> I'll have a look if that makes it fly.

[PATCH RESEND v7 00/13] virtio-mem: vfio support

2021-04-13 Thread David Hildenbrand

After silence for more than 1.5 months and the feeling like pinging into a
black hole, I rebased and retested the patches.

I hope we can get them into 6.1 early -- or at least get some more feedback
on the patches.

@Paolo: Michael and Alex already acked relevant parts

--

A virtio-mem device manages a memory region in guest physical address
space, represented as a single (currently large) memory region in QEMU,
mapped into system memory address space. Before the guest is allowed to use
memory blocks, it must coordinate with the hypervisor (plug blocks). After
a reboot, all memory is usually unplugged - when the guest comes up, it
detects the virtio-mem device and selects memory blocks to plug (based on
resize requests from the hypervisor).

Memory hot(un)plug consists of (un)plugging memory blocks via a virtio-mem
device (triggered by the guest). When unplugging blocks, we discard the
memory - similar to memory balloon inflation. In contrast to memory
ballooning, we always know which memory blocks a guest may actually use -
especially during a reboot, after a crash, or after kexec (and during
hibernation as well). Guests agreed to not access unplugged memory again,
especially not via DMA.

The issue with vfio is, that it cannot deal with random discards - for this
reason, virtio-mem and vfio can currently only run mutually exclusive.
Especially, vfio would currently map the whole memory region (with possible
only little/no plugged blocks), resulting in all pages getting pinned and
therefore resulting in a higher memory consumption than expected (turning
virtio-mem basically useless in these environments).

To make vfio work nicely with virtio-mem, we have to map only the plugged
blocks, and map/unmap properly when plugging/unplugging blocks (including
discarding of RAM when unplugging). We achieve that by using a new notifier
mechanism that communicates changes.

It's important to map memory in the granularity in which we could see
unmaps again (-> virtio-mem block size) - so when e.g., plugging
consecutive 100 MB with a block size of 2 MB, we need 50 mappings. When
unmapping, we can use a single vfio_unmap call for the applicable range.
We expect that the block size of virtio-mem devices will be fairly large
in the future (to not run out of mappings and to improve hot(un)plug
performance), configured by the user, when used with vfio (e.g., 128MB,
1G, ...), but it will depend on the setup.

More info regarding virtio-mem can be found at:
https://virtio-mem.gitlab.io/

v7 is located at:
  g...@github.com:davidhildenbrand/qemu.git virtio-mem-vfio-v7

v6 -> v7:
- s/RamDiscardMgr/RamDiscardManager/
- "memory: Introduce RamDiscardManager for RAM memory regions"
-- Make RamDiscardManager/RamDiscardListener eat MemoryRegionSections
-- Replace notify_discard_all callback by double_discard_supported
-- Reshuffle the individual hunks in memory.h
-- Provide function wrappers for RamDiscardManager calls
- "memory: Helpers to copy/free a MemoryRegionSection"
-- Added
- "virtio-mem: Implement RamDiscardManager interface"
-- Work on MemoryRegionSections instead of ranges
-- Minor optimizations
- "vfio: Support for RamDiscardManager in the !vIOMMU case"
-- Simplify based on new interfaces /  MemoryRegionSections
-- Minor cleanups and optimizations
-- Add a comment regarding dirty bitmap sync.
-- Don't store "offset_within_region" in VFIORamDiscardListener
- "vfio: Support for RamDiscardManager in the vIOMMU case"
-- Adjust to new interface
- "softmmu/physmem: Don't use atomic operations in ..."
-- Rename variables
- "softmmu/physmem: Extend ram_block_discard_(require|disable) ..."
-- Rename variables
- Rebased and retested

v5 -> v6:
- "memory: Introduce RamDiscardMgr for RAM memory regions"
-- Fix variable names in one prototype.
- "virtio-mem: Don't report errors when ram_block_discard_range() fails"
-- Added
- "virtio-mem: Implement RamDiscardMgr interface"
-- Don't report an error if discarding fails
- Rebased and retested

v4 -> v5:
- "vfio: Support for RamDiscardMgr in the !vIOMMU case"
-- Added more assertions for granularity vs. iommu supported pagesize
- "vfio: Sanity check maximum number of DMA mappings with RamDiscardMgr"
-- Fix accounting of mappings
- "vfio: Disable only uncoordinated discards for VFIO_TYPE1 iommus"
-- Fence off SPAPR and add some comments regarding future support.
-- Tweak patch description
- Rebase and retest

v3 -> v4:
- "vfio: Query and store the maximum number of DMA mappings
-- Limit the patch to querying and storing only
-- Renamed to "vfio: Query and store the maximum number of possible DMA
   mappings"
- "vfio: Support for RamDiscardMgr in the !vIOMMU case"
-- Remove sanity checks / warning the user
- "vfio: Sanity check maximum number of DMA mappings with RamDiscardMgr"
-- Perform sanity checks by looking at the number of memslots and all
   registered RamDiscardMgr sections
- Rebase and retest
- Reshuffled the patches slightly

v2 -> v3:
- Rebased + retested
- Fixed some typ

[PATCH RESEND v7 02/13] memory: Helpers to copy/free a MemoryRegionSection

2021-04-13 Thread David Hildenbrand

In case one wants to create a permanent copy of a MemoryRegionSections,
one needs access to flatview_ref()/flatview_unref(). Instead of exposing
these, let's just add helpers to copy/free a MemoryRegionSection and
properly adjust references.

Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 include/exec/memory.h | 20 
 softmmu/memory.c  | 27 +++
 2 files changed, 47 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 38a3b41ac1..e806d0140e 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1047,6 +1047,26 @@ static inline bool 
MemoryRegionSection_eq(MemoryRegionSection *a,
a->nonvolatile == b->nonvolatile;
 }
 
+/**
+ * memory_region_section_new_copy: Copy a memory region section
+ *
+ * Allocate memory for a new copy, copy the memory region section, and
+ * properly take a reference on all relevant members.
+ *
+ * @s: the #MemoryRegionSection to copy
+ */
+MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s);
+
+/**
+ * memory_region_section_new_copy: Free a copied memory region section
+ *
+ * Free a copy of a memory section created via 
memory_region_section_new_copy().
+ * properly dropping references on all relevant members.
+ *
+ * @s: the #MemoryRegionSection to copy
+ */
+void memory_region_section_free_copy(MemoryRegionSection *s);
+
 /**
  * memory_region_init: Initialize a memory region
  *
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 26ea87d77a..776c7cac38 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2694,6 +2694,33 @@ MemoryRegionSection memory_region_find(MemoryRegion *mr,
 return ret;
 }
 
+MemoryRegionSection *memory_region_section_new_copy(MemoryRegionSection *s)
+{
+MemoryRegionSection *tmp = g_new(MemoryRegionSection, 1);
+
+*tmp = *s;
+if (tmp->mr) {
+memory_region_ref(tmp->mr);
+}
+if (tmp->fv) {
+bool ret  = flatview_ref(tmp->fv);
+
+g_assert(ret);
+}
+return tmp;
+}
+
+void memory_region_section_free_copy(MemoryRegionSection *s)
+{
+if (s->fv) {
+flatview_unref(s->fv);
+}
+if (s->mr) {
+memory_region_unref(s->mr);
+}
+g_free(s);
+}
+
 bool memory_region_present(MemoryRegion *container, hwaddr addr)
 {
 MemoryRegion *mr;
-- 
2.30.2

[PATCH RESEND v7 04/13] virtio-mem: Don't report errors when ram_block_discard_range() fails

2021-04-13 Thread David Hildenbrand

Any errors are unexpected and ram_block_discard_range() already properly
prints errors. Let's stop manually reporting errors.

Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 hw/virtio/virtio-mem.c | 20 
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 471e464171..bbe42ad83b 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -246,17 +246,14 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, 
uint64_t start_gpa,
   uint64_t size, bool plug)
 {
 const uint64_t offset = start_gpa - vmem->addr;
-int ret;
+RAMBlock *rb = vmem->memdev->mr.ram_block;
 
 if (virtio_mem_is_busy()) {
 return -EBUSY;
 }
 
 if (!plug) {
-ret = ram_block_discard_range(vmem->memdev->mr.ram_block, offset, 
size);
-if (ret) {
-error_report("Unexpected error discarding RAM: %s",
- strerror(-ret));
+if (ram_block_discard_range(rb, offset, size)) {
 return -EBUSY;
 }
 }
@@ -345,15 +342,12 @@ static void virtio_mem_resize_usable_region(VirtIOMEM 
*vmem,
 static int virtio_mem_unplug_all(VirtIOMEM *vmem)
 {
 RAMBlock *rb = vmem->memdev->mr.ram_block;
-int ret;
 
 if (virtio_mem_is_busy()) {
 return -EBUSY;
 }
 
-ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
-if (ret) {
-error_report("Unexpected error discarding RAM: %s", strerror(-ret));
+if (ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb))) {
 return -EBUSY;
 }
 bitmap_clear(vmem->bitmap, 0, vmem->bitmap_size);
@@ -625,14 +619,8 @@ static int virtio_mem_discard_range_cb(const VirtIOMEM 
*vmem, void *arg,
uint64_t offset, uint64_t size)
 {
 RAMBlock *rb = vmem->memdev->mr.ram_block;
-int ret;
 
-ret = ram_block_discard_range(rb, offset, size);
-if (ret) {
-error_report("Unexpected error discarding RAM: %s", strerror(-ret));
-return -EINVAL;
-}
-return 0;
+return ram_block_discard_range(rb, offset, size) ? -EINVAL : 0;
 }
 
 static int virtio_mem_restore_unplugged(VirtIOMEM *vmem)
-- 
2.30.2

[PATCH RESEND v7 06/13] vfio: Support for RamDiscardManager in the !vIOMMU case

2021-04-13 Thread David Hildenbrand

Implement support for RamDiscardManager, to prepare for virtio-mem
support. Instead of mapping the whole memory section, we only map
"populated" parts and update the mapping when notified about
discarding/population of memory via the RamDiscardListener. Similarly, when
syncing the dirty bitmaps, sync only the actually mapped (populated) parts
by replaying via the notifier.

Using virtio-mem with vfio is still blocked via
ram_block_discard_disable()/ram_block_discard_require() after this patch.

Reviewed-by: Alex Williamson 
Acked-by: Alex Williamson 
Acked-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 hw/vfio/common.c  | 164 ++
 include/hw/vfio/vfio-common.h |  11 +++
 2 files changed, 175 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ae5654fcdb..5af7755227 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -649,6 +649,110 @@ out:
 rcu_read_unlock();
 }
 
+static void vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
+MemoryRegionSection *section)
+{
+VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
+listener);
+const hwaddr size = int128_get64(section->size);
+const hwaddr iova = section->offset_within_address_space;
+int ret;
+
+/* Unmap with a single call. */
+ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
+if (ret) {
+error_report("%s: vfio_dma_unmap() failed: %s", __func__,
+ strerror(-ret));
+}
+}
+
+static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
+MemoryRegionSection *section)
+{
+VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
+listener);
+const hwaddr end = section->offset_within_region +
+   int128_get64(section->size);
+hwaddr start, next, iova;
+void *vaddr;
+int ret;
+
+/*
+ * Map in (aligned within memory region) minimum granularity, so we can
+ * unmap in minimum granularity later.
+ */
+for (start = section->offset_within_region; start < end; start = next) {
+next = ROUND_UP(start + 1, vrdl->granularity);
+next = MIN(next, end);
+
+iova = start - section->offset_within_region +
+   section->offset_within_address_space;
+vaddr = memory_region_get_ram_ptr(section->mr) + start;
+
+ret = vfio_dma_map(vrdl->container, iova, next - start,
+   vaddr, section->readonly);
+if (ret) {
+/* Rollback */
+vfio_ram_discard_notify_discard(rdl, section);
+return ret;
+}
+}
+return 0;
+}
+
+static void vfio_register_ram_discard_listener(VFIOContainer *container,
+   MemoryRegionSection *section)
+{
+RamDiscardManager *rdm = 
memory_region_get_ram_discard_manager(section->mr);
+VFIORamDiscardListener *vrdl;
+
+/* Ignore some corner cases not relevant in practice. */
+g_assert(QEMU_IS_ALIGNED(section->offset_within_region, TARGET_PAGE_SIZE));
+g_assert(QEMU_IS_ALIGNED(section->offset_within_address_space,
+ TARGET_PAGE_SIZE));
+g_assert(QEMU_IS_ALIGNED(int128_get64(section->size), TARGET_PAGE_SIZE));
+
+vrdl = g_new0(VFIORamDiscardListener, 1);
+vrdl->container = container;
+vrdl->mr = section->mr;
+vrdl->offset_within_address_space = section->offset_within_address_space;
+vrdl->size = int128_get64(section->size);
+vrdl->granularity = ram_discard_manager_get_min_granularity(rdm,
+section->mr);
+
+g_assert(vrdl->granularity && is_power_of_2(vrdl->granularity));
+g_assert(vrdl->granularity >= 1 << ctz64(container->pgsizes));
+
+ram_discard_listener_init(&vrdl->listener,
+  vfio_ram_discard_notify_populate,
+  vfio_ram_discard_notify_discard, true);
+ram_discard_manager_register_listener(rdm, &vrdl->listener, section);
+QLIST_INSERT_HEAD(&container->vrdl_list, vrdl, next);
+}
+
+static void vfio_unregister_ram_discard_listener(VFIOContainer *container,
+ MemoryRegionSection *section)
+{
+RamDiscardManager *rdm = 
memory_region_get_ram_discard_manager(section->mr);
+VFIORamDiscardListener *vrdl = NULL;
+
+QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
+if (vrdl->mr == section->mr &&
+vrdl->offset_within_address_space ==
+section->offset_within_addr

[PATCH RESEND v7 03/13] virtio-mem: Factor out traversing unplugged ranges

2021-04-13 Thread David Hildenbrand

Let's factor out the core logic, no need to replicate.

Reviewed-by: Pankaj Gupta 
Acked-by: Michael S. Tsirkin 
Reviewed-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 hw/virtio/virtio-mem.c | 86 --
 1 file changed, 49 insertions(+), 37 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 655824ff81..471e464171 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -145,6 +145,33 @@ static bool virtio_mem_is_busy(void)
 return migration_in_incoming_postcopy() || !migration_is_idle();
 }
 
+typedef int (*virtio_mem_range_cb)(const VirtIOMEM *vmem, void *arg,
+   uint64_t offset, uint64_t size);
+
+static int virtio_mem_for_each_unplugged_range(const VirtIOMEM *vmem, void 
*arg,
+   virtio_mem_range_cb cb)
+{
+unsigned long first_zero_bit, last_zero_bit;
+uint64_t offset, size;
+int ret = 0;
+
+first_zero_bit = find_first_zero_bit(vmem->bitmap, vmem->bitmap_size);
+while (first_zero_bit < vmem->bitmap_size) {
+offset = first_zero_bit * vmem->block_size;
+last_zero_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size,
+  first_zero_bit + 1) - 1;
+size = (last_zero_bit - first_zero_bit + 1) * vmem->block_size;
+
+ret = cb(vmem, arg, offset, size);
+if (ret) {
+break;
+}
+first_zero_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size,
+last_zero_bit + 2);
+}
+return ret;
+}
+
 static bool virtio_mem_test_bitmap(VirtIOMEM *vmem, uint64_t start_gpa,
uint64_t size, bool plugged)
 {
@@ -594,33 +621,27 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
 ram_block_discard_require(false);
 }
 
-static int virtio_mem_restore_unplugged(VirtIOMEM *vmem)
+static int virtio_mem_discard_range_cb(const VirtIOMEM *vmem, void *arg,
+   uint64_t offset, uint64_t size)
 {
 RAMBlock *rb = vmem->memdev->mr.ram_block;
-unsigned long first_zero_bit, last_zero_bit;
-uint64_t offset, length;
 int ret;
 
-/* Find consecutive unplugged blocks and discard the consecutive range. */
-first_zero_bit = find_first_zero_bit(vmem->bitmap, vmem->bitmap_size);
-while (first_zero_bit < vmem->bitmap_size) {
-offset = first_zero_bit * vmem->block_size;
-last_zero_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size,
-  first_zero_bit + 1) - 1;
-length = (last_zero_bit - first_zero_bit + 1) * vmem->block_size;
-
-ret = ram_block_discard_range(rb, offset, length);
-if (ret) {
-error_report("Unexpected error discarding RAM: %s",
- strerror(-ret));
-return -EINVAL;
-}
-first_zero_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size,
-last_zero_bit + 2);
+ret = ram_block_discard_range(rb, offset, size);
+if (ret) {
+error_report("Unexpected error discarding RAM: %s", strerror(-ret));
+return -EINVAL;
 }
 return 0;
 }
 
+static int virtio_mem_restore_unplugged(VirtIOMEM *vmem)
+{
+/* Make sure all memory is really discarded after migration. */
+return virtio_mem_for_each_unplugged_range(vmem, NULL,
+   virtio_mem_discard_range_cb);
+}
+
 static int virtio_mem_post_load(void *opaque, int version_id)
 {
 if (migration_in_incoming_postcopy()) {
@@ -872,28 +893,19 @@ static void virtio_mem_set_block_size(Object *obj, 
Visitor *v, const char *name,
 vmem->block_size = value;
 }
 
-static void virtio_mem_precopy_exclude_unplugged(VirtIOMEM *vmem)
+static int virtio_mem_precopy_exclude_range_cb(const VirtIOMEM *vmem, void 
*arg,
+   uint64_t offset, uint64_t size)
 {
 void * const host = qemu_ram_get_host_addr(vmem->memdev->mr.ram_block);
-unsigned long first_zero_bit, last_zero_bit;
-uint64_t offset, length;
 
-/*
- * Find consecutive unplugged blocks and exclude them from migration.
- *
- * Note: Blocks cannot get (un)plugged during precopy, no locking needed.
- */
-first_zero_bit = find_first_zero_bit(vmem->bitmap, vmem->bitmap_size);
-while (first_zero_bit < vmem->bitmap_size) {
-offset = first_zero_bit * vmem->block_size;
-last_zero_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size,
-  first_zero_bit + 1) - 1;
-length = (last_zero_bit - first_zero_bit + 1) * vmem->block_

[PATCH RESEND v7 01/13] memory: Introduce RamDiscardManager for RAM memory regions

2021-04-13 Thread David Hildenbrand

We have some special RAM memory regions (managed by virtio-mem), whereby
the guest agreed to only use selected memory ranges. "unused" parts are
discarded so they won't consume memory - to logically unplug these memory
ranges. Before the VM is allowed to use such logically unplugged memory
again, coordination with the hypervisor is required.

This results in "sparse" mmaps/RAMBlocks/memory regions, whereby only
coordinated parts are valid to be used/accessed by the VM.

In most cases, we don't care about that - e.g., in KVM, we simply have a
single KVM memory slot. However, in case of vfio, registering the
whole region with the kernel results in all pages getting pinned, and
therefore an unexpected high memory consumption - discarding of RAM in
that context is broken.

Let's introduce a way to coordinate discarding/populating memory within a
RAM memory region with such special consumers of RAM memory regions: they
can register as listeners and get updates on memory getting discarded and
populated. Using this machinery, vfio will be able to map only the
currently populated parts, resulting in discarded parts not getting pinned
and not consuming memory.

A RamDiscardManager has to be set for a memory region before it is getting
mapped, and cannot change while the memory region is mapped.

Note: At some point, we might want to let RAMBlock users (esp. vfio used
for nvme://) consume this interface as well. We'll need RAMBlock notifier
calls when a RAMBlock is getting mapped/unmapped (via the corresponding
memory region), so we can properly register a listener there as well.

Reviewed-by: Pankaj Gupta 
Acked-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 include/exec/memory.h | 286 ++
 softmmu/memory.c  |  71 +++
 2 files changed, 335 insertions(+), 22 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 5728a681b2..38a3b41ac1 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -42,6 +42,12 @@ typedef struct IOMMUMemoryRegionClass IOMMUMemoryRegionClass;
 DECLARE_OBJ_CHECKERS(IOMMUMemoryRegion, IOMMUMemoryRegionClass,
  IOMMU_MEMORY_REGION, TYPE_IOMMU_MEMORY_REGION)
 
+#define TYPE_RAM_DISCARD_MANAGER "qemu:ram-discard-manager"
+typedef struct RamDiscardManagerClass RamDiscardManagerClass;
+typedef struct RamDiscardManager RamDiscardManager;
+DECLARE_OBJ_CHECKERS(RamDiscardManager, RamDiscardManagerClass,
+ RAM_DISCARD_MANAGER, TYPE_RAM_DISCARD_MANAGER);
+
 #ifdef CONFIG_FUZZ
 void fuzz_dma_read_cb(size_t addr,
   size_t len,
@@ -65,6 +71,28 @@ struct ReservedRegion {
 unsigned type;
 };
 
+/**
+ * struct MemoryRegionSection: describes a fragment of a #MemoryRegion
+ *
+ * @mr: the region, or %NULL if empty
+ * @fv: the flat view of the address space the region is mapped in
+ * @offset_within_region: the beginning of the section, relative to @mr's start
+ * @size: the size of the section; will not exceed @mr's boundaries
+ * @offset_within_address_space: the address of the first byte of the section
+ * relative to the region's address space
+ * @readonly: writes to this section are ignored
+ * @nonvolatile: this section is non-volatile
+ */
+struct MemoryRegionSection {
+Int128 size;
+MemoryRegion *mr;
+FlatView *fv;
+hwaddr offset_within_region;
+hwaddr offset_within_address_space;
+bool readonly;
+bool nonvolatile;
+};
+
 typedef struct IOMMUTLBEntry IOMMUTLBEntry;
 
 /* See address_space_translate: bit 0 is read, bit 1 is write.  */
@@ -441,6 +469,206 @@ struct IOMMUMemoryRegionClass {
  Error **errp);
 };
 
+typedef struct RamDiscardListener RamDiscardListener;
+typedef int (*NotifyRamPopulate)(RamDiscardListener *rdl,
+ MemoryRegionSection *section);
+typedef void (*NotifyRamDiscard)(RamDiscardListener *rdl,
+ MemoryRegionSection *section);
+
+struct RamDiscardListener {
+/*
+ * @notify_populate:
+ *
+ * Notification that previously discarded memory is about to get populated.
+ * Listeners are able to object. If any listener objects, already
+ * successfully notified listeners are notified about a discard again.
+ *
+ * @rdl: the #RamDiscardListener getting notified
+ * @section: the #MemoryRegionSection to get populated. The section
+ *   is aligned within the memory region to the minimum granularity
+ *   unless it would exceed the registered section.
+ *
+ * Returns 0 on success. If the notification is rejected by the listener,
+ * an error is returned.
+ */
+NotifyRamPopulate notify_populate;
+
+/*
+ * @notify_

[PATCH RESEND v7 10/13] softmmu/physmem: Don't use atomic operations in ram_block_discard_(disable|require)

2021-04-13 Thread David Hildenbrand

We have users in migration context that don't hold the BQL (when
finishing migration). To prepare for further changes, use a dedicated mutex
instead of atomic operations. Keep using qatomic_read ("READ_ONCE") for the
functions that only extract the current state (e.g., used by
virtio-balloon), locking isn't necessary.

While at it, split up the counter into two variables to make it easier
to understand.

Suggested-by: Peter Xu 
Reviewed-by: Peter Xu 
Reviewed-by: Pankaj Gupta 
Acked-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 softmmu/physmem.c | 70 ++-
 1 file changed, 39 insertions(+), 31 deletions(-)

diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 85034d9c11..aaa2b2eb92 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -3647,56 +3647,64 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, 
MemoryRegion *root)
 }
 }
 
-/*
- * If positive, discarding RAM is disabled. If negative, discarding RAM is
- * required to work and cannot be disabled.
- */
-static int ram_block_discard_disabled;
+static unsigned int ram_block_discard_required_cnt;
+static unsigned int ram_block_discard_disabled_cnt;
+static QemuMutex ram_block_discard_disable_mutex;
+
+static void ram_block_discard_disable_mutex_lock(void)
+{
+static gsize initialized;
+
+if (g_once_init_enter(&initialized)) {
+qemu_mutex_init(&ram_block_discard_disable_mutex);
+g_once_init_leave(&initialized, 1);
+}
+qemu_mutex_lock(&ram_block_discard_disable_mutex);
+}
+
+static void ram_block_discard_disable_mutex_unlock(void)
+{
+qemu_mutex_unlock(&ram_block_discard_disable_mutex);
+}
 
 int ram_block_discard_disable(bool state)
 {
-int old;
+int ret = 0;
 
+ram_block_discard_disable_mutex_lock();
 if (!state) {
-qatomic_dec(&ram_block_discard_disabled);
-return 0;
+ram_block_discard_disabled_cnt--;
+} else if (!ram_block_discard_required_cnt) {
+ram_block_discard_disabled_cnt++;
+} else {
+ret = -EBUSY;
 }
-
-do {
-old = qatomic_read(&ram_block_discard_disabled);
-if (old < 0) {
-return -EBUSY;
-}
-} while (qatomic_cmpxchg(&ram_block_discard_disabled,
- old, old + 1) != old);
-return 0;
+ram_block_discard_disable_mutex_unlock();
+return ret;
 }
 
 int ram_block_discard_require(bool state)
 {
-int old;
+int ret = 0;
 
+ram_block_discard_disable_mutex_lock();
 if (!state) {
-qatomic_inc(&ram_block_discard_disabled);
-return 0;
+ram_block_discard_required_cnt--;
+} else if (!ram_block_discard_disabled_cnt) {
+ram_block_discard_required_cnt++;
+} else {
+ret = -EBUSY;
 }
-
-do {
-old = qatomic_read(&ram_block_discard_disabled);
-if (old > 0) {
-return -EBUSY;
-}
-} while (qatomic_cmpxchg(&ram_block_discard_disabled,
- old, old - 1) != old);
-return 0;
+ram_block_discard_disable_mutex_unlock();
+return ret;
 }
 
 bool ram_block_discard_is_disabled(void)
 {
-return qatomic_read(&ram_block_discard_disabled) > 0;
+return qatomic_read(&ram_block_discard_disabled_cnt);
 }
 
 bool ram_block_discard_is_required(void)
 {
-return qatomic_read(&ram_block_discard_disabled) < 0;
+return qatomic_read(&ram_block_discard_required_cnt);
 }
-- 
2.30.2

[PATCH RESEND v7 08/13] vfio: Sanity check maximum number of DMA mappings with RamDiscardManager

2021-04-13 Thread David Hildenbrand

Although RamDiscardManager can handle running into the maximum number of
DMA mappings by propagating errors when creating a DMA mapping, we want
to sanity check and warn the user early that there is a theoretical setup
issue and that virtio-mem might not be able to provide as much memory
towards a VM as desired.

As suggested by Alex, let's use the number of KVM memory slots to guess
how many other mappings we might see over time.

Acked-by: Alex Williamson 
Reviewed-by: Alex Williamson 
Acked-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 hw/vfio/common.c | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 79628d60ae..f8a2fe8441 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -728,6 +728,49 @@ static void 
vfio_register_ram_discard_listener(VFIOContainer *container,
   vfio_ram_discard_notify_discard, true);
 ram_discard_manager_register_listener(rdm, &vrdl->listener, section);
 QLIST_INSERT_HEAD(&container->vrdl_list, vrdl, next);
+
+/*
+ * Sanity-check if we have a theoretically problematic setup where we could
+ * exceed the maximum number of possible DMA mappings over time. We assume
+ * that each mapped section in the same address space as a 
RamDiscardManager
+ * section consumes exactly one DMA mapping, with the exception of
+ * RamDiscardManager sections; i.e., we don't expect to have gIOMMU 
sections
+ * in the same address space as RamDiscardManager sections.
+ *
+ * We assume that each section in the address space consumes one memslot.
+ * We take the number of KVM memory slots as a best guess for the maximum
+ * number of sections in the address space we could have over time,
+ * also consuming DMA mappings.
+ */
+if (container->dma_max_mappings) {
+unsigned int vrdl_count = 0, vrdl_mappings = 0, max_memslots = 512;
+
+#ifdef CONFIG_KVM
+if (kvm_enabled()) {
+max_memslots = kvm_get_max_memslots();
+}
+#endif
+
+QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
+hwaddr start, end;
+
+start = QEMU_ALIGN_DOWN(vrdl->offset_within_address_space,
+vrdl->granularity);
+end = ROUND_UP(vrdl->offset_within_address_space + vrdl->size,
+   vrdl->granularity);
+vrdl_mappings += (end - start) / vrdl->granularity;
+vrdl_count++;
+}
+
+if (vrdl_mappings + max_memslots - vrdl_count >
+container->dma_max_mappings) {
+warn_report("%s: possibly running out of DMA mappings. E.g., try"
+" increasing the 'block-size' of virtio-mem devies."
+" Maximum possible DMA mappings: %d, Maximum possible"
+" memslots: %d", __func__, container->dma_max_mappings,
+max_memslots);
+}
+}
 }
 
 static void vfio_unregister_ram_discard_listener(VFIOContainer *container,
-- 
2.30.2

[PATCH RESEND v7 07/13] vfio: Query and store the maximum number of possible DMA mappings

2021-04-13 Thread David Hildenbrand

Let's query the maximum number of possible DMA mappings by querying the
available mappings when creating the container (before any mappings are
created). We'll use this informaton soon to perform some sanity checks
and warn the user.

Reviewed-by: Alex Williamson 
Acked-by: Alex Williamson 
Acked-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 hw/vfio/common.c  | 4 
 include/hw/vfio/vfio-common.h | 1 +
 2 files changed, 5 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5af7755227..79628d60ae 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1931,6 +1931,7 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
 container->fd = fd;
 container->error = NULL;
 container->dirty_pages_supported = false;
+container->dma_max_mappings = 0;
 QLIST_INIT(&container->giommu_list);
 QLIST_INIT(&container->hostwin_list);
 QLIST_INIT(&container->vrdl_list);
@@ -1962,7 +1963,10 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
 vfio_host_win_add(container, 0, (hwaddr)-1, info->iova_pgsizes);
 container->pgsizes = info->iova_pgsizes;
 
+/* The default in the kernel ("dma_entry_limit") is 65535. */
+container->dma_max_mappings = 65535;
 if (!ret) {
+vfio_get_info_dma_avail(info, &container->dma_max_mappings);
 vfio_get_iommu_info_migration(container, info);
 }
 g_free(info);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 681432213d..8af11b0a76 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -88,6 +88,7 @@ typedef struct VFIOContainer {
 uint64_t dirty_pgsizes;
 uint64_t max_dirty_bitmap_size;
 unsigned long pgsizes;
+unsigned int dma_max_mappings;
 QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
 QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
 QLIST_HEAD(, VFIOGroup) group_list;
-- 
2.30.2

[PATCH RESEND v7 13/13] vfio: Disable only uncoordinated discards for VFIO_TYPE1 iommus

2021-04-13 Thread David Hildenbrand

We support coordinated discarding of RAM using the RamDiscardManager for
the VFIO_TYPE1 iommus. Let's unlock support for coordinated discards,
keeping uncoordinated discards (e.g., via virtio-balloon) disabled if
possible.

This unlocks virtio-mem + vfio on x86-64. Note that vfio used via "nvme://"
by the block layer has to be implemented/unlocked separately. For now,
virtio-mem only supports x86-64; we don't restrict RamDiscardManager to
x86-64, though: arm64 and s390x are supposed to work as well, and we'll
test once unlocking virtio-mem support. The spapr IOMMUs will need special
care, to be tackled later, e.g.., once supporting virtio-mem.

Note: The block size of a virtio-mem device has to be set to sane sizes,
depending on the maximum hotplug size - to not run out of vfio mappings.
The default virtio-mem block size is usually in the range of a couple of
MBs. The maximum number of mapping is 64k, shared with other users.
Assume you want to hotplug 256GB using virtio-mem - the block size would
have to be set to at least 8 MiB (resulting in 32768 separate mappings).

Acked-by: Alex Williamson 
Reviewed-by: Alex Williamson 
Acked-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 hw/vfio/common.c | 65 +++-
 1 file changed, 53 insertions(+), 12 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 8a9bbf2791..3f0d111360 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -135,6 +135,29 @@ static const char *index_to_str(VFIODevice *vbasedev, int 
index)
 }
 }
 
+static int vfio_ram_block_discard_disable(VFIOContainer *container, bool state)
+{
+switch (container->iommu_type) {
+case VFIO_TYPE1v2_IOMMU:
+case VFIO_TYPE1_IOMMU:
+/*
+ * We support coordinated discarding of RAM via the RamDiscardManager.
+ */
+return ram_block_uncoordinated_discard_disable(state);
+default:
+/*
+ * VFIO_SPAPR_TCE_IOMMU most probably works just fine with
+ * RamDiscardManager, however, it is completely untested.
+ *
+ * VFIO_SPAPR_TCE_v2_IOMMU with "DMA memory preregistering" does
+ * completely the opposite of managing mapping/pinning dynamically as
+ * required by RamDiscardManager. We would have to special-case 
sections
+ * with a RamDiscardManager.
+ */
+return ram_block_discard_disable(state);
+}
+}
+
 int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
int action, int fd, Error **errp)
 {
@@ -1977,15 +2000,25 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
  * new memory, it will not yet set ram_block_discard_set_required() and
  * therefore, neither stops us here or deals with the sudden memory
  * consumption of inflated memory.
+ *
+ * We do support discarding of memory coordinated via the RamDiscardManager
+ * with some IOMMU types. vfio_ram_block_discard_disable() handles the
+ * details once we know which type of IOMMU we are using.
  */
-ret = ram_block_discard_disable(true);
-if (ret) {
-error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
-return ret;
-}
 
 QLIST_FOREACH(container, &space->containers, next) {
 if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+ret = vfio_ram_block_discard_disable(container, true);
+if (ret) {
+error_setg_errno(errp, -ret,
+ "Cannot set discarding of RAM broken");
+if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
+  &container->fd)) {
+error_report("vfio: error disconnecting group %d from"
+ " container", group->groupid);
+}
+return ret;
+}
 group->container = container;
 QLIST_INSERT_HEAD(&container->group_list, group, container_next);
 vfio_kvm_device_add_group(group);
@@ -2023,6 +2056,12 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
 goto free_container_exit;
 }
 
+ret = vfio_ram_block_discard_disable(container, true);
+if (ret) {
+error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
+goto free_container_exit;
+}
+
 switch (container->iommu_type) {
 case VFIO_TYPE1v2_IOMMU:
 case VFIO_TYPE1_IOMMU:
@@ -2070,7 +2109,7 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
 if (ret) {
 error_setg_errno(errp, errno, "failed to enable container");
 ret = -errno;
-goto free_conta

[PATCH RESEND v7 05/13] virtio-mem: Implement RamDiscardManager interface

2021-04-13 Thread David Hildenbrand

Let's properly notify when (un)plugging blocks, after discarding memory
and before allowing the guest to consume memory. Handle errors from
notifiers gracefully (e.g., no remaining VFIO mappings) when plugging,
rolling back the change and telling the guest that the VM is busy.

One special case to take care of is replaying all notifications after
restoring the vmstate. The device starts out with all memory discarded,
so after loading the vmstate, we have to notify about all plugged
blocks.

Acked-by: Michael S. Tsirkin 
Reviewed-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 hw/virtio/virtio-mem.c | 288 -
 include/hw/virtio/virtio-mem.h |   3 +
 2 files changed, 288 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index bbe42ad83b..e209b56057 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -172,7 +172,146 @@ static int virtio_mem_for_each_unplugged_range(const 
VirtIOMEM *vmem, void *arg,
 return ret;
 }
 
-static bool virtio_mem_test_bitmap(VirtIOMEM *vmem, uint64_t start_gpa,
+/*
+ * Adjust the memory section to cover the intersection with the given range.
+ *
+ * Returns false if the intersection is empty, otherwise returns true.
+ */
+static bool virito_mem_intersect_memory_section(MemoryRegionSection *s,
+uint64_t offset, uint64_t size)
+{
+uint64_t start = MAX(s->offset_within_region, offset);
+uint64_t end = MIN(s->offset_within_region + int128_get64(s->size),
+   offset + size);
+
+if (end <= start) {
+return false;
+}
+
+s->offset_within_address_space += start - s->offset_within_region;
+s->offset_within_region = start;
+s->size = int128_make64(end - start);
+return true;
+}
+
+typedef int (*virtio_mem_section_cb)(MemoryRegionSection *s, void *arg);
+
+static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
+   MemoryRegionSection *s,
+   void *arg,
+   virtio_mem_section_cb cb)
+{
+unsigned long first_bit, last_bit;
+uint64_t offset, size;
+int ret = 0;
+
+first_bit = s->offset_within_region / vmem->bitmap_size;
+first_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size, first_bit);
+while (first_bit < vmem->bitmap_size) {
+MemoryRegionSection tmp = *s;
+
+offset = first_bit * vmem->block_size;
+last_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size,
+  first_bit + 1) - 1;
+size = (last_bit - first_bit + 1) * vmem->block_size;
+
+if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+break;
+}
+ret = cb(&tmp, arg);
+if (ret) {
+break;
+}
+first_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size,
+  last_bit + 2);
+}
+return ret;
+}
+
+static int virtio_mem_notify_populate_cb(MemoryRegionSection *s, void *arg)
+{
+RamDiscardListener *rdl = arg;
+
+return rdl->notify_populate(rdl, s);
+}
+
+static int virtio_mem_notify_discard_cb(MemoryRegionSection *s, void *arg)
+{
+RamDiscardListener *rdl = arg;
+
+rdl->notify_discard(rdl, s);
+return 0;
+}
+
+static void virtio_mem_notify_unplug(VirtIOMEM *vmem, uint64_t offset,
+ uint64_t size)
+{
+RamDiscardListener *rdl;
+
+QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
+MemoryRegionSection tmp = *rdl->section;
+
+if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+continue;
+}
+rdl->notify_discard(rdl, &tmp);
+}
+}
+
+static int virtio_mem_notify_plug(VirtIOMEM *vmem, uint64_t offset,
+  uint64_t size)
+{
+RamDiscardListener *rdl, *rdl2;
+int ret = 0;
+
+QLIST_FOREACH(rdl, &vmem->rdl_list, next) {
+MemoryRegionSection tmp = *rdl->section;
+
+if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+continue;
+}
+ret = rdl->notify_populate(rdl, &tmp);
+if (ret) {
+break;
+}
+}
+
+if (ret) {
+/* Notify all already-notified listeners. */
+QLIST_FOREACH(rdl2, &vmem->rdl_list, next) {
+MemoryRegionSection tmp = *rdl->section;
+
+if (rdl2 == rdl) {
+break;
+}
+if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+continue;
+}
+rdl2->notify_discard(rdl2, &tmp);
+}
+}
+return re

[PATCH RESEND v7 11/13] softmmu/physmem: Extend ram_block_discard_(require|disable) by two discard types

2021-04-13 Thread David Hildenbrand

We want to separate the two cases whereby we discard ram
- uncoordinated: e.g., virito-balloon
- coordinated: e.g., virtio-mem coordinated via the RamDiscardManager

Reviewed-by: Pankaj Gupta 
Acked-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 include/exec/memory.h | 18 +--
 softmmu/physmem.c | 54 ++-
 2 files changed, 64 insertions(+), 8 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index e806d0140e..bba7b6643e 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2873,6 +2873,12 @@ static inline MemOp devend_memop(enum device_endian end)
  */
 int ram_block_discard_disable(bool state);
 
+/*
+ * See ram_block_discard_disable(): only disable uncoordinated discards,
+ * keeping coordinated discards (via the RamDiscardManager) enabled.
+ */
+int ram_block_uncoordinated_discard_disable(bool state);
+
 /*
  * Inhibit technologies that disable discarding of pages in RAM blocks.
  *
@@ -2882,12 +2888,20 @@ int ram_block_discard_disable(bool state);
 int ram_block_discard_require(bool state);
 
 /*
- * Test if discarding of memory in ram blocks is disabled.
+ * See ram_block_discard_require(): only inhibit technologies that disable
+ * uncoordinated discarding of pages in RAM blocks, allowing co-existance with
+ * technologies that only inhibit uncoordinated discards (via the
+ * RamDiscardManager).
+ */
+int ram_block_coordinated_discard_require(bool state);
+
+/*
+ * Test if any discarding of memory in ram blocks is disabled.
  */
 bool ram_block_discard_is_disabled(void);
 
 /*
- * Test if discarding of memory in ram blocks is required to work reliably.
+ * Test if any discarding of memory in ram blocks is required to work reliably.
  */
 bool ram_block_discard_is_required(void);
 
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index aaa2b2eb92..ead7b5c429 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -3647,8 +3647,14 @@ void mtree_print_dispatch(AddressSpaceDispatch *d, 
MemoryRegion *root)
 }
 }
 
+/* Require any discards to work. */
 static unsigned int ram_block_discard_required_cnt;
+/* Require only coordinated discards to work. */
+static unsigned int ram_block_coordinated_discard_required_cnt;
+/* Disable any discards. */
 static unsigned int ram_block_discard_disabled_cnt;
+/* Disable only uncoordinated discards. */
+static unsigned int ram_block_uncoordinated_discard_disabled_cnt;
 static QemuMutex ram_block_discard_disable_mutex;
 
 static void ram_block_discard_disable_mutex_lock(void)
@@ -3674,10 +3680,27 @@ int ram_block_discard_disable(bool state)
 ram_block_discard_disable_mutex_lock();
 if (!state) {
 ram_block_discard_disabled_cnt--;
-} else if (!ram_block_discard_required_cnt) {
-ram_block_discard_disabled_cnt++;
+} else if (ram_block_discard_required_cnt ||
+   ram_block_coordinated_discard_required_cnt) {
+ret = -EBUSY;
 } else {
+ram_block_discard_disabled_cnt++;
+}
+ram_block_discard_disable_mutex_unlock();
+return ret;
+}
+
+int ram_block_uncoordinated_discard_disable(bool state)
+{
+int ret = 0;
+
+ram_block_discard_disable_mutex_lock();
+if (!state) {
+ram_block_uncoordinated_discard_disabled_cnt--;
+} else if (ram_block_discard_required_cnt) {
 ret = -EBUSY;
+} else {
+ram_block_uncoordinated_discard_disabled_cnt++;
 }
 ram_block_discard_disable_mutex_unlock();
 return ret;
@@ -3690,10 +3713,27 @@ int ram_block_discard_require(bool state)
 ram_block_discard_disable_mutex_lock();
 if (!state) {
 ram_block_discard_required_cnt--;
-} else if (!ram_block_discard_disabled_cnt) {
-ram_block_discard_required_cnt++;
+} else if (ram_block_discard_disabled_cnt ||
+   ram_block_uncoordinated_discard_disabled_cnt) {
+ret = -EBUSY;
 } else {
+ram_block_discard_required_cnt++;
+}
+ram_block_discard_disable_mutex_unlock();
+return ret;
+}
+
+int ram_block_coordinated_discard_require(bool state)
+{
+int ret = 0;
+
+ram_block_discard_disable_mutex_lock();
+if (!state) {
+ram_block_coordinated_discard_required_cnt--;
+} else if (ram_block_discard_disabled_cnt) {
 ret = -EBUSY;
+} else {
+ram_block_coordinated_discard_required_cnt++;
 }
 ram_block_discard_disable_mutex_unlock();
 return ret;
@@ -3701,10 +3741,12 @@ int ram_block_discard_require(bool state)
 
 bool ram_block_discard_is_disabled(void)
 {
-return qatomic_read(&ram_block_discard_disabled_cnt);
+return qatomic_read(&ram_block_discard_disabled_cnt) ||
+   qatomic_read(&ram_block_uncoordinated_discard_disabled_cnt);
 }
 
 bool ra

[PATCH RESEND v7 09/13] vfio: Support for RamDiscardManager in the vIOMMU case

2021-04-13 Thread David Hildenbrand

vIOMMU support works already with RamDiscardManager as long as guests only
map populated memory. Both, populated and discarded memory is mapped
into &address_space_memory, where vfio_get_xlat_addr() will find that
memory, to create the vfio mapping.

Sane guests will never map discarded memory (e.g., unplugged memory
blocks in virtio-mem) into an IOMMU - or keep it mapped into an IOMMU while
memory is getting discarded. However, there are two cases where a malicious
guests could trigger pinning of more memory than intended.

One case is easy to handle: the guest trying to map discarded memory
into an IOMMU.

The other case is harder to handle: the guest keeping memory mapped in
the IOMMU while it is getting discarded. We would have to walk over all
mappings when discarding memory and identify if any mapping would be a
violation. Let's keep it simple for now and print a warning, indicating
that setting RLIMIT_MEMLOCK can mitigate such attacks.

We have to take care of incoming migration: at the point the
IOMMUs get restored and start creating mappings in vfio, RamDiscardManager
implementations might not be back up and running yet: let's add runstate
priorities to enforce the order when restoring.

Acked-by: Alex Williamson 
Reviewed-by: Alex Williamson 
Acked-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 hw/vfio/common.c| 39 +
 hw/virtio/virtio-mem.c  |  1 +
 include/migration/vmstate.h |  1 +
 3 files changed, 41 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f8a2fe8441..8a9bbf2791 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -36,6 +36,7 @@
 #include "qemu/range.h"
 #include "sysemu/kvm.h"
 #include "sysemu/reset.h"
+#include "sysemu/runstate.h"
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
@@ -569,6 +570,44 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void 
**vaddr,
 error_report("iommu map to non memory area %"HWADDR_PRIx"",
  xlat);
 return false;
+} else if (memory_region_has_ram_discard_manager(mr)) {
+RamDiscardManager *rdm = memory_region_get_ram_discard_manager(mr);
+MemoryRegionSection tmp = {
+.mr = mr,
+.offset_within_region = xlat,
+.size = int128_make64(len),
+};
+
+/*
+ * Malicious VMs can map memory into the IOMMU, which is expected
+ * to remain discarded. vfio will pin all pages, populating memory.
+ * Disallow that. vmstate priorities make sure any RamDiscardManager
+ * were already restored before IOMMUs are restored.
+ */
+if (!ram_discard_manager_is_populated(rdm, &tmp)) {
+error_report("iommu map to discarded memory (e.g., unplugged via"
+ " virtio-mem): %"HWADDR_PRIx"",
+ iotlb->translated_addr);
+return false;
+}
+
+/*
+ * Malicious VMs might trigger discarding of IOMMU-mapped memory. The
+ * pages will remain pinned inside vfio until unmapped, resulting in a
+ * higher memory consumption than expected. If memory would get
+ * populated again later, there would be an inconsistency between pages
+ * pinned by vfio and pages seen by QEMU. This is the case until
+ * unmapped from the IOMMU (e.g., during device reset).
+ *
+ * With malicious guests, we really only care about pinning more memory
+ * than expected. RLIMIT_MEMLOCK set for the user/process can never be
+ * exceeded and can be used to mitigate this problem.
+ */
+warn_report_once("Using vfio with vIOMMUs and coordinated discarding 
of"
+ " RAM (e.g., virtio-mem) works, however, malicious"
+ " guests can trigger pinning of more memory than"
+ " intended via an IOMMU. It's possible to mitigate "
+ " by setting/adjusting RLIMIT_MEMLOCK.");
 }
 
 /*
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index e209b56057..cbd07fc9f1 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -886,6 +886,7 @@ static const VMStateDescription vmstate_virtio_mem_device = 
{
 .name = "virtio-mem-device",
 .minimum_version_id = 1,
 .version_id = 1,
+.priority = MIG_PRI_VIRTIO_MEM,
 .post_load = virtio_mem_post_load,
 .fields = (VMStateField[]) {
 VMSTATE_WITH_TMP(VirtIOMEM, VirtIOMEMMigSanityChecks,
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 075ee80096..3bf58ff043 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -153,6 +153,7 @@ typedef enum

[PATCH RESEND v7 12/13] virtio-mem: Require only coordinated discards

2021-04-13 Thread David Hildenbrand

We implement the RamDiscardManager interface and only require coordinated
discarding of RAM to work.

Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Pankaj Gupta 
Acked-by: Michael S. Tsirkin 
Reviewed-by: Michael S. Tsirkin 
Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Dr. David Alan Gilbert 
Cc: Igor Mammedov 
Cc: Pankaj Gupta 
Cc: Peter Xu 
Cc: Auger Eric 
Cc: Wei Yang 
Cc: teawater 
Cc: Marek Kedzierski 
Signed-off-by: David Hildenbrand 
---
 hw/virtio/virtio-mem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index cbd07fc9f1..9e79b5b50c 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -719,7 +719,7 @@ static void virtio_mem_device_realize(DeviceState *dev, 
Error **errp)
 return;
 }
 
-if (ram_block_discard_require(true)) {
+if (ram_block_coordinated_discard_require(true)) {
 error_setg(errp, "Discarding RAM is disabled");
 return;
 }
@@ -727,7 +727,7 @@ static void virtio_mem_device_realize(DeviceState *dev, 
Error **errp)
 ret = ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb));
 if (ret) {
 error_setg_errno(errp, -ret, "Unexpected error discarding RAM");
-ram_block_discard_require(false);
+ram_block_coordinated_discard_require(false);
 return;
 }
 
@@ -771,7 +771,7 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
 virtio_del_queue(vdev, 0);
 virtio_cleanup(vdev);
 g_free(vmem->bitmap);
-ram_block_discard_require(false);
+ram_block_coordinated_discard_require(false);
 }
 
 static int virtio_mem_discard_range_cb(const VirtIOMEM *vmem, void *arg,
-- 
2.30.2

[PULL 0/3] MIPS patches for 2021-04-13

2021-04-13 Thread Philippe Mathieu-Daudé

The following changes since commit c1e90def01bdb8fcbdbebd9d1eaa8e4827ece620:

  Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20210412' 
into staging (2021-04-12 12:12:09 +0100)

are available in the Git repository at:

  https://github.com/philmd/qemu.git tags/mips-20210413

for you to fetch changes up to f4349ba966abfe39f5d98694abd7c7551d5c8c02:

  target/mips: Fix TCG temporary leak in gen_cache_operation() (2021-04-13 
12:07:00 +0200)


MIPS patches queue

- Fix invalid Kconfig dependency
- Fix missing migrated value
- Fix TCG temporary leak


Philippe Mathieu-Daudé (3):
  hw/isa/Kconfig: Add missing dependency VIA VT82C686 -> APM
  hw/isa/piix4: Migrate Reset Control Register
  target/mips: Fix TCG temporary leak in gen_cache_operation()

 hw/isa/piix4.c  | 15 ++-
 target/mips/translate.c |  2 ++
 hw/isa/Kconfig  |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)

-- 
2.26.3

[PULL 1/3] hw/isa/Kconfig: Add missing dependency VIA VT82C686 -> APM

2021-04-13 Thread Philippe Mathieu-Daudé

TYPE_VIA_PM calls apm_init() in via_pm_realize(), so
requires APM to be selected.

Reported-by: BALATON Zoltan 
Fixes: dd0ff8191ab ("isa: express SuperIO dependencies with Kconfig")
Signed-off-by: Philippe Mathieu-Daudé 
Message-Id: <20210302080531.913802-1-f4...@amsat.org>
---
 hw/isa/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/isa/Kconfig b/hw/isa/Kconfig
index 2691eae2f0c..55e0003ce40 100644
--- a/hw/isa/Kconfig
+++ b/hw/isa/Kconfig
@@ -48,6 +48,7 @@ config VT82C686
 select SERIAL_ISA
 select FDC
 select USB_UHCI
+select APM
 
 config SMC37C669
 bool
-- 
2.26.3

[PULL 2/3] hw/isa/piix4: Migrate Reset Control Register

2021-04-13 Thread Philippe Mathieu-Daudé

When adding the Reset register in commit 5790b757cfb we
forgot to migrate it.

While it is possible a VM using the PIIX4 is migrated just
after requesting a system shutdown, it is very unlikely.
However when restoring a migrated VM, we might have the
RCR bit #4 set on the stack and when the VM resume it
directly shutdowns.

Add a post_load() migration handler and set the default
RCR value to 0 for earlier versions, assuming the VM was
not going to shutdown before migration.

Fixes: 5790b757cfb ("piix4: Add the Reset Control Register")
Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Dr. David Alan Gilbert 
Message-Id: <20210324200334.729899-1-f4...@amsat.org>
---
 hw/isa/piix4.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/hw/isa/piix4.c b/hw/isa/piix4.c
index a50d97834c7..b3b6a4378a3 100644
--- a/hw/isa/piix4.c
+++ b/hw/isa/piix4.c
@@ -93,12 +93,25 @@ static void piix4_isa_reset(DeviceState *dev)
 pci_conf[0xae] = 0x00;
 }
 
+static int piix4_ide_post_load(void *opaque, int version_id)
+{
+PIIX4State *s = opaque;
+
+if (version_id == 2) {
+s->rcr = 0;
+}
+
+return 0;
+}
+
 static const VMStateDescription vmstate_piix4 = {
 .name = "PIIX4",
-.version_id = 2,
+.version_id = 3,
 .minimum_version_id = 2,
+.post_load = piix4_ide_post_load,
 .fields = (VMStateField[]) {
 VMSTATE_PCI_DEVICE(dev, PIIX4State),
+VMSTATE_UINT8_V(rcr, PIIX4State, 3),
 VMSTATE_END_OF_LIST()
 }
 };
-- 
2.26.3

[PULL 3/3] target/mips: Fix TCG temporary leak in gen_cache_operation()

2021-04-13 Thread Philippe Mathieu-Daudé

Fix a TCG temporary leak when translating CACHE opcode.

Fixes: 0d74a222c27 ("make ITC Configuration Tags accessible to the CPU")
Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Richard Henderson 
Message-Id: <20210406202857.1440744-1-f4...@amsat.org>
---
 target/mips/translate.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/target/mips/translate.c b/target/mips/translate.c
index c518bf3963b..71fa5ec1973 100644
--- a/target/mips/translate.c
+++ b/target/mips/translate.c
@@ -12804,6 +12804,8 @@ static void gen_cache_operation(DisasContext *ctx, 
uint32_t op, int base,
 TCGv t1 = tcg_temp_new();
 gen_base_offset_addr(ctx, t1, base, offset);
 gen_helper_cache(cpu_env, t1, t0);
+tcg_temp_free(t1);
+tcg_temp_free_i32(t0);
 }
 
 #if defined(TARGET_MIPS64)
-- 
2.26.3

Re: [PULL 0/1] NBD fix for 6.0-rc3

2021-04-13 Thread Peter Maydell

On Tue, 13 Apr 2021 at 07:47, Vladimir Sementsov-Ogievskiy
 wrote:
>
> 12.04.2021 18:48, Peter Maydell wrote:
> > On Mon, 12 Apr 2021 at 13:19, Vladimir Sementsov-Ogievskiy
> >  wrote:
> >>
> >> The following changes since commit 
> >> 555249a59e9cdd6b58da103aba5cf3a2d45c899f:
> >>
> >>Merge remote-tracking branch 
> >> 'remotes/ehabkost-gl/tags/x86-next-pull-request' into staging (2021-04-10 
> >> 16:58:56 +0100)
> >>
> >> are available in the Git repository at:
> >>
> >>https://src.openvz.org/scm/~vsementsov/qemu.git tags/pull-nbd-2021-04-12
> >>
> >> for you to fetch changes up to d3c278b689845558cd9811940436b28ae6afc5d7:
> >>
> >>block/nbd: fix possible use after free of s->connect_thread (2021-04-12 
> >> 11:56:03 +0300)
> >>
> >> 
> >> One fix of possible use-after-free in NBD block-driver for 6.0-rc3
> >>
> >> 
> >>
> >> Note: the tag is signed by a new key, as I've lost the old one. So,
> >> now there are two keys with my name on http://keys.gnupg.net, the elder
> >> is lost. I feel stupid about that :(. Anyway, both keys are not signed by
> >> anybody except for myself. And this is my first pull-request to Qemu,
> >> so, I think some kind of TOFU still applies.
> >
> > I'd really rather not deal with trying to add a new source of pull
> > requests the day before rc3, please. Eric, could you do a pull
> > or something?

> Hmm. Ok, that's not a degradation of 6.0 and there is no existing bug
> somewhere, so we can just not care for 6.0.

I think fixing a use-after-free is sensible for rc3; it's only having
it come to me via a new unknown submitter of pull requests that I'm
not happy about.

thanks
-- PMM

Re: [PATCH] docs: Add a QEMU Code of Conduct and Conflict Resolution Policy document

2021-04-13 Thread Andreas Färber

Hi Paolo,

On 13.04.21 09:42, Paolo Bonzini wrote:
> On 07/04/21 17:42, Kevin Wolf wrote:
>>> +* Publishing other's private information, such as physical or
>>> electronic
>>> +addresses, without explicit permission
>>
>> Yes, it's pretty clear that I'm not publishing new information about
>> people when I'm keeping them in Cc: when replying to a thread, or even
>> when they posted in another thread on the list recently. It becomes much
>> less clear for adding people who aren't usually part of the QEMU
>> community.
> 
> If you took the email from, say, the Libvirt or kernel mailing lists,
> that would not be considered private.  If somebody has two email
> addresses and you deliberately Cc him on an address that he's only using
> for communications within his family, that would be a problem.

I have to admit I had originally stumbled over this bullet point myself,
reading it as private=personal. So maybe it might help avoid ambiguities
for non-native readers to formulate it as "non-public" instead?

Like, if someone posts to a public mailing list with their private as
opposed to business address in the footer. Then I would consider it
public. I did intentionally use a private email for topics such as PReP.

Or consider the case you get a bug report not copied to the public
mailing lists from someone you don't know. Then I would still expect to
be allowed to attribute a commit via Reported-by/CC to that person, as
it seems in his/her interest to get the bug fixed and be notified,
unless explicitly requested otherwise.

Mistakes can always happen, but I feel it needs to be the responsibility
of the sender, not of the receiver, to ensure that only data is shared
that the project members may use for valid development purposes.

Not sure how to extend that bullet point to make its purpose/scope clearer.

Cheers,
Andreas

-- 
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer
HRB 36809 (AG Nürnberg)

Re: [PATCH] docs: Add a QEMU Code of Conduct and Conflict Resolution Policy document

2021-04-13 Thread Peter Maydell

On Tue, 13 Apr 2021 at 11:23, Andreas Färber  wrote:
> Or consider the case you get a bug report not copied to the public
> mailing lists from someone you don't know. Then I would still expect to
> be allowed to attribute a commit via Reported-by/CC to that person, as
> it seems in his/her interest to get the bug fixed and be notified,
> unless explicitly requested otherwise.

FWIW, in this kind of situation, I generally try to explicitly
ask the submitter if they're OK with my adding a reported-by
tag, just as a matter of politeness. Not everybody is OK with
having their email address publicly recorded on mailing list
archives and in git history forever.

thanks
-- PMM

Re: Commit "x86/kvm: Move context tracking where it belongs" broke guest time accounting

2021-04-13 Thread Wanpeng Li

On Wed, 7 Apr 2021 at 18:55, Michael Tokarev  wrote:
>
> Hi!
>
> It looks like this commit:
>
> commit 87fa7f3e98a1310ef1ac1900e7ee7f9610a038bc
> Author: Thomas Gleixner 
> Date:   Wed Jul 8 21:51:54 2020 +0200
>
>  x86/kvm: Move context tracking where it belongs
>
>  Context tracking for KVM happens way too early in the vcpu_run()
>  code. Anything after guest_enter_irqoff() and before guest_exit_irqoff()
>  cannot use RCU and should also be not instrumented.
>
>  The current way of doing this covers way too much code. Move it closer to
>  the actual vmenter/exit code.
>
> broke kvm guest cpu time accounting - after this commit, when running
> qemu-system-x86_64 -enable-kvm, the guest time (in /proc/stat and
> elsewhere) is always 0.
>
> I dunno why it happened, but it happened, and all kernels after 5.9
> are affected by this.
>
> This commit is found in a (painful) git bisect between kernel 5.8 and 5.10.

Hi Michael,

Please have a try.
https://lore.kernel.org/kvm/1618298169-3831-1-git-send-email-wanpen...@tencent.com/

Wanpeng

Re: [PATCH] cutils: fix memory leak in get_relocated_path()

2021-04-13 Thread Philippe Mathieu-Daudé

Is this fix aiming at 6.0 release?

On 4/12/21 7:02 PM, Stefano Garzarella wrote:
> get_relocated_path() allocates a GString object and returns the
> character data (C string) to the caller without freeing the memory
> allocated for that object as reported by valgrind:
> 
>   24 bytes in 1 blocks are definitely lost in loss record 2,805 of 6,532
>  at 0x4839809: malloc (vg_replace_malloc.c:307)
>  by 0x55AABB8: g_malloc (in /usr/lib64/libglib-2.0.so.0.6600.8)
>  by 0x55C2481: g_slice_alloc (in /usr/lib64/libglib-2.0.so.0.6600.8)
>  by 0x55C4827: g_string_sized_new (in /usr/lib64/libglib-2.0.so.0.6600.8)
>  by 0x55C4CEA: g_string_new (in /usr/lib64/libglib-2.0.so.0.6600.8)
>  by 0x906314: get_relocated_path (cutils.c:1036)
>  by 0x6E1F77: qemu_read_default_config_file (vl.c:2122)
>  by 0x6E1F77: qemu_init (vl.c:2687)
>  by 0x3E3AF8: main (main.c:49)
> 
> Let's use g_string_free(gstring, false) to free only the GString object
> and transfer the ownership of the character data to the caller.
> 
> Fixes: f4f5ed2cbd ("cutils: introduce get_relocated_path")
> Signed-off-by: Stefano Garzarella 
> ---
>  util/cutils.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/util/cutils.c b/util/cutils.c
> index ee908486da..c9b91e7535 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -1055,5 +1055,5 @@ char *get_relocated_path(const char *dir)
>  assert(G_IS_DIR_SEPARATOR(dir[-1]));
>  g_string_append(result, dir - 1);
>  }
> -return result->str;
> +return g_string_free(result, false);
>  }
>

[PATCH for-6.0] x86: acpi: use offset instead of pointer when using build_header()

2021-04-13 Thread Igor Mammedov

Do the same as in commit
 (4d027afeb3a97 Virt: ACPI: fix qemu assert due to re-assigned table data 
address)
for remaining tables that happen to use saved at
the beginning pointer to build header to avoid assert
when table_data is relocated due to implicit re-size.

Reported-in: https://bugs.launchpad.net/bugs/1923497
Signed-off-by: Igor Mammedov 
---
PS:
 I have build_header() refactoring patch that requires offset
 instead of pointer, to make it harder to misuse but it's
 a bit intrusive for last minute fixes. So here goes simplified
 variant, and I'll post refactoring patch for 6.1. later.
---
 hw/acpi/aml-build.c  | 15 +--
 hw/i386/acpi-build.c |  8 ++--
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index d33ce8954a..f0035d2b4a 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1830,6 +1830,7 @@ build_rsdt(GArray *table_data, BIOSLinker *linker, GArray 
*table_offsets,
 int i;
 unsigned rsdt_entries_offset;
 AcpiRsdtDescriptorRev1 *rsdt;
+int rsdt_start = table_data->len;
 const unsigned table_data_len = (sizeof(uint32_t) * table_offsets->len);
 const unsigned rsdt_entry_size = sizeof(rsdt->table_offset_entry[0]);
 const size_t rsdt_len = sizeof(*rsdt) + table_data_len;
@@ -1846,7 +1847,8 @@ build_rsdt(GArray *table_data, BIOSLinker *linker, GArray 
*table_offsets,
 ACPI_BUILD_TABLE_FILE, ref_tbl_offset);
 }
 build_header(linker, table_data,
- (void *)rsdt, "RSDT", rsdt_len, 1, oem_id, oem_table_id);
+ (void *)(table_data->data + rsdt_start),
+ "RSDT", rsdt_len, 1, oem_id, oem_table_id);
 }
 
 /* Build xsdt table */
@@ -1857,6 +1859,7 @@ build_xsdt(GArray *table_data, BIOSLinker *linker, GArray 
*table_offsets,
 int i;
 unsigned xsdt_entries_offset;
 AcpiXsdtDescriptorRev2 *xsdt;
+int xsdt_start = table_data->len;
 const unsigned table_data_len = (sizeof(uint64_t) * table_offsets->len);
 const unsigned xsdt_entry_size = sizeof(xsdt->table_offset_entry[0]);
 const size_t xsdt_len = sizeof(*xsdt) + table_data_len;
@@ -1873,7 +1876,8 @@ build_xsdt(GArray *table_data, BIOSLinker *linker, GArray 
*table_offsets,
 ACPI_BUILD_TABLE_FILE, ref_tbl_offset);
 }
 build_header(linker, table_data,
- (void *)xsdt, "XSDT", xsdt_len, 1, oem_id, oem_table_id);
+ (void *)(table_data->data + xsdt_start),
+ "XSDT", xsdt_len, 1, oem_id, oem_table_id);
 }
 
 void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
@@ -2053,10 +2057,9 @@ void build_tpm2(GArray *table_data, BIOSLinker *linker, 
GArray *tcpalog,
 uint64_t control_area_start_address;
 TPMIf *tpmif = tpm_find();
 uint32_t start_method;
-void *tpm2_ptr;
 
 tpm2_start = table_data->len;
-tpm2_ptr = acpi_data_push(table_data, sizeof(AcpiTableHeader));
+acpi_data_push(table_data, sizeof(AcpiTableHeader));
 
 /* Platform Class */
 build_append_int_noprefix(table_data, TPM2_ACPI_CLASS_CLIENT, 2);
@@ -2095,8 +2098,8 @@ void build_tpm2(GArray *table_data, BIOSLinker *linker, 
GArray *tcpalog,
log_addr_offset, 8,
ACPI_BUILD_TPMLOG_FILE, 0);
 build_header(linker, table_data,
- tpm2_ptr, "TPM2", table_data->len - tpm2_start, 4, oem_id,
- oem_table_id);
+ (void *)(table_data->data + tpm2_start),
+ "TPM2", table_data->len - tpm2_start, 4, oem_id, 
oem_table_id);
 }
 
 Aml *build_crs(PCIHostState *host, CrsRangeSet *range_set, uint32_t io_offset,
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index de98750aef..daaf8f473e 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -1816,6 +1816,7 @@ build_hpet(GArray *table_data, BIOSLinker *linker, const 
char *oem_id,
const char *oem_table_id)
 {
 Acpi20Hpet *hpet;
+int hpet_start = table_data->len;
 
 hpet = acpi_data_push(table_data, sizeof(*hpet));
 /* Note timer_block_id value must be kept in sync with value advertised by
@@ -1824,13 +1825,15 @@ build_hpet(GArray *table_data, BIOSLinker *linker, 
const char *oem_id,
 hpet->timer_block_id = cpu_to_le32(0x8086a201);
 hpet->addr.address = cpu_to_le64(HPET_BASE);
 build_header(linker, table_data,
- (void *)hpet, "HPET", sizeof(*hpet), 1, oem_id, oem_table_id);
+ (void *)(table_data->data + hpet_start),
+ "HPET", sizeof(*hpet), 1, oem_id, oem_table_id);
 }
 
 static void
 build_tpm_tcpa(GArray *table_data, BIOSLinker *linker, GArray *tcpalog,
const char *oem_id, const char *oem_table_id)
 {
+int tcpa_start = table_data->len;
 Acpi20Tcpa *tcpa = acpi_data_push(table_data, sizeof *tcpa);
 unsigned log_addr_size = sizeof(tcpa->log_area_start_address);
 unsigned log_addr_offset =
@@ -184

Re: [PATCH RFC RESEND v2 3/6] hw/pci: Add pci_root_bus_max_bus

2021-04-13 Thread Xingang Wang


Hi Eric,

On 2021/4/13 16:05, Auger Eric wrote:

Hi Xingang,

On 3/25/21 8:22 AM, Wang Xingang wrote:

From: Xingang Wang 

This helps to find max bus number of a root bus.

s/max bus number of a root bus/highest bus number of a bridge hierarchy?


Thanks, I will change the description.



Signed-off-by: Xingang Wang 
Signed-off-by: Jiahui Cen 
---
  hw/pci/pci.c | 34 ++
  include/hw/pci/pci.h |  1 +
  2 files changed, 35 insertions(+)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e17aa9075f..c7957cbf7c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -538,6 +538,40 @@ int pci_bus_num(PCIBus *s)
  return PCI_BUS_GET_CLASS(s)->bus_num(s);
  }
  
+int pci_root_bus_max_bus(PCIBus *bus)

+{
+PCIHostState *host;
+PCIDevice *dev;
+int max_bus = 0;
+int type, devfn;
+uint8_t subordinate;
+
+if (!pci_bus_is_root(bus)) {
+return 0;
+}
+
+host = PCI_HOST_BRIDGE(BUS(bus)->parent);
+max_bus = pci_bus_num(host->bus);
+
+for (devfn = 0; devfn < ARRAY_SIZE(host->bus->devices); devfn++) {
+dev = host->bus->devices[devfn];
+
+if (!dev) {
+continue;
+}
+
+type = dev->config[PCI_HEADER_TYPE] & ~PCI_HEADER_TYPE_MULTI_FUNCTION;

Seems there is PCI_DEVICE_GET_CLASS(dev)->is_bridge (see
pci_root_bus_in_range). Can't that be used instead?


Thanks, I will simplify this.


+if (type == PCI_HEADER_TYPE_BRIDGE) {
+subordinate = dev->config[PCI_SUBORDINATE_BUS];
+if (subordinate > max_bus) {
+max_bus = subordinate;

what about the secondary bus number, it is always less than the others?



Thanks, the secondary bus number should be taken into account. Maybe a 
pci_root_bus_range to get [min_bus, max_bus] would be better.



Thanks

Eric

+}
+}
+}
+
+return max_bus;
+}
+
  int pci_bus_numa_node(PCIBus *bus)
  {
  return PCI_BUS_GET_CLASS(bus)->numa_node(bus);
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 718b5a454a..e0c69534f4 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -450,6 +450,7 @@ static inline PCIBus *pci_get_bus(const PCIDevice *dev)
  return PCI_BUS(qdev_get_parent_bus(DEVICE(dev)));
  }
  int pci_bus_num(PCIBus *s);
+int pci_root_bus_max_bus(PCIBus *bus);
  static inline int pci_dev_bus_num(const PCIDevice *dev)
  {
  return pci_bus_num(pci_get_bus(dev));



.



Xingang

.

Re: [PATCH] vhost-user-fs: fix features handling

2021-04-13 Thread Dr. David Alan Gilbert

* Stefan Hajnoczi (stefa...@redhat.com) wrote:
> On Thu, Apr 08, 2021 at 10:55:34PM +0300, Anton Kuchin wrote:
> > Make virtio-fs take into account server capabilities.
> > 
> > Just returning requested features assumes they all of then are implemented
> > by server and results in setting unsupported configuration if some of them
> > are absent.
> > 
> > Signed-off-by: Anton Kuchin 
> > ---
> >  hw/virtio/vhost-user-fs.c | 17 +
> >  1 file changed, 13 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> > index ac4fc34b36..6cf983ba0e 100644
> > --- a/hw/virtio/vhost-user-fs.c
> > +++ b/hw/virtio/vhost-user-fs.c
> > @@ -24,6 +24,14 @@
> >  #include "monitor/monitor.h"
> >  #include "sysemu/sysemu.h"
> >  
> > +static const int user_feature_bits[] = {
> > +VIRTIO_F_VERSION_1,
> > +VIRTIO_RING_F_INDIRECT_DESC,
> > +VIRTIO_RING_F_EVENT_IDX,
> > +VIRTIO_F_NOTIFY_ON_EMPTY,
> > +VHOST_INVALID_FEATURE_BIT
> > +};
> 
> Please add:
> 
> VIRTIO_F_RING_PACKED
> VIRTIO_F_IOMMU_PLATFORM
> 
> QEMU's virtiofsd does not enable either of these for now, but it's worth
> allowing the vhost-user device backend to participate in negotiation so
> that this can change in the future (or alternative virtiofsd
> implementations can support these features).

OK, so:


Reviewed-by: Dr. David Alan Gilbert 


and queued, I'll add those extra 2 lines.  We seem pretty inconsistent
about all the different vhost-user devices.

Dave

-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

[PATCH 0/2] osdep: allow including qemu/osdep.h outside extern "C"

2021-04-13 Thread Paolo Bonzini

qemu/osdep.h is quite special in that, despite being part of QEMU sources,
it is included by C++ source files as well.

disas/nanomips.cpp is doing so within an 'extern "C"' block, which breaks
with latest glib due to the inclusion of templates in glib.h.

These patches implement Daniel Berrangé's idea of pushing the 'extern "C"'
block within glib.h and including system headers (including glib.h,
and in fact QEMU's own glib-compat.h too) *outside* the block.

(CI has not finished running yet, but it seems encouraging).

Paolo

Paolo Bonzini (2):
  osdep: include glib-compat.h before other QEMU headers
  osdep: protect qemu/osdep.h with extern "C"

 disas/nanomips.cpp  |  2 +-
 include/qemu/compiler.h |  6 ++
 include/qemu/osdep.h| 13 +++--
 3 files changed, 18 insertions(+), 3 deletions(-)

-- 
2.30.1

[PATCH 2/2] osdep: protect qemu/osdep.h with extern "C"

2021-04-13 Thread Paolo Bonzini

System headers may include templates if compiled with a C++ compiler,
which cause the compiler to complain if qemu/osdep.h is included
within a C++ source file's 'extern "C"' block.  Add
an 'extern "C"' block directly to qemu/osdep.h, so that
system headers can be kept out of it.

There is a stray declaration early in qemu/osdep.h, which needs
to be special cased.  Add a definition in qemu/compiler.h to
make it look nice.

config-host.h, CONFIG_TARGET, exec/poison.h and qemu/compiler.h
are included outside the 'extern "C"' block; that is not
an issue because they consist entirely of preprocessor directives.

Signed-off-by: Paolo Bonzini 
---
 disas/nanomips.cpp  |  2 +-
 include/qemu/compiler.h |  6 ++
 include/qemu/osdep.h| 10 +-
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/disas/nanomips.cpp b/disas/nanomips.cpp
index 2b09655271..8ddef897f0 100644
--- a/disas/nanomips.cpp
+++ b/disas/nanomips.cpp
@@ -27,8 +27,8 @@
  *  Reference Manual", Revision 01.01, April 27, 2018
  */
 
-extern "C" {
 #include "qemu/osdep.h"
+extern "C" {
 #include "disas/dis-asm.h"
 }
 
diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
index cf28bb2bcd..091c45248b 100644
--- a/include/qemu/compiler.h
+++ b/include/qemu/compiler.h
@@ -11,6 +11,12 @@
 #define QEMU_STATIC_ANALYSIS 1
 #endif
 
+#ifdef __cplusplus
+#define QEMU_EXTERN_C extern "C"
+#else
+#define QEMU_EXTERN_C extern
+#endif
+
 #define QEMU_NORETURN __attribute__ ((__noreturn__))
 
 #define QEMU_WARN_UNUSED_RESULT __attribute__((warn_unused_result))
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index b67b0a1e8c..3f8785a471 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -57,7 +57,7 @@
 #define daemon qemu_fake_daemon_function
 #include 
 #undef daemon
-extern int daemon(int, int);
+QEMU_EXTERN_C int daemon(int, int);
 #endif
 
 #ifdef _WIN32
@@ -113,6 +113,10 @@ extern int daemon(int, int);
 
 #include "glib-compat.h"
 
+#ifdef __cplusplus
+extern "C" {
+#endif
+
 #ifdef _WIN32
 #include "sysemu/os-win32.h"
 #endif
@@ -723,4 +727,8 @@ static inline int platform_does_not_support_system(const 
char *command)
 }
 #endif /* !HAVE_SYSTEM_FUNCTION */
 
+#ifdef __cplusplus
+}
+#endif
+
 #endif
-- 
2.30.1

[PATCH 1/2] osdep: include glib-compat.h before other QEMU headers

2021-04-13 Thread Paolo Bonzini

glib-compat.h is sort of like a system header, and it needs to include
system headers (glib.h) that may dislike being included under
'extern "C"'.  Move it right after all system headers and before
all other QEMU headers.

Signed-off-by: Paolo Bonzini 
---
 include/qemu/osdep.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index ba15be9c56..b67b0a1e8c 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -111,6 +111,8 @@ extern int daemon(int, int);
 #define WEXITSTATUS(x) (x)
 #endif
 
+#include "glib-compat.h"
+
 #ifdef _WIN32
 #include "sysemu/os-win32.h"
 #endif
@@ -123,7 +125,6 @@ extern int daemon(int, int);
 #include 
 #endif
 
-#include "glib-compat.h"
 #include "qemu/typedefs.h"
 
 /*
-- 
2.30.1

Re: [PATCH] docs: Add a QEMU Code of Conduct and Conflict Resolution Policy document

2021-04-13 Thread Markus Armbruster

Peter Maydell  writes:

> On Tue, 13 Apr 2021 at 11:23, Andreas Färber  wrote:
>> Or consider the case you get a bug report not copied to the public
>> mailing lists from someone you don't know. Then I would still expect to
>> be allowed to attribute a commit via Reported-by/CC to that person, as
>> it seems in his/her interest to get the bug fixed and be notified,
>> unless explicitly requested otherwise.
>
> FWIW, in this kind of situation, I generally try to explicitly
> ask the submitter if they're OK with my adding a reported-by
> tag, just as a matter of politeness. Not everybody is OK with
> having their email address publicly recorded on mailing list
> archives and in git history forever.

That's what I'd do, too.

Still, neglecting to ask for permission to publicly credit a bug report
is not anywhere near doxing.  If the public credit turns out to be
unwanted, a sincere apology is obviously called for.  People may exist
who need to be slapped over the head with a code of conduct to figure
that out.  I hope we'll never need to do that.

Anyway.  What I see at work here is one of the unintended consequences
of formal codes of conduct: they read like law, so people read them
lawyerly.  Our CoC attempts to avoid this by explicitly stating its
*purpose*: "a guide to make it easier to be excellent to each other."

This applies to the QEMU leadership committee in spades.  Treating
negligent publication of some technical e-mail's sender address as
malicious doxing wouldn't be excellent to anyone, it would be the
leadership committee shooting themselves into the foot with a machine
gun".

Let's not worry about that, okay?

Re: [PATCH] cutils: fix memory leak in get_relocated_path()

2021-04-13 Thread Stefano Garzarella


On Tue, Apr 13, 2021 at 12:59:36PM +0200, Philippe Mathieu-Daudé wrote:

Is this fix aiming at 6.0 release?


The leak is minimal, but the fix is very simple.
So, I think it can go if someone has a pull request to send with other 
patches, but I'm not sure with which tree.


Thanks,
Stefano



On 4/12/21 7:02 PM, Stefano Garzarella wrote:

get_relocated_path() allocates a GString object and returns the
character data (C string) to the caller without freeing the memory
allocated for that object as reported by valgrind:

  24 bytes in 1 blocks are definitely lost in loss record 2,805 of 6,532
 at 0x4839809: malloc (vg_replace_malloc.c:307)
 by 0x55AABB8: g_malloc (in /usr/lib64/libglib-2.0.so.0.6600.8)
 by 0x55C2481: g_slice_alloc (in /usr/lib64/libglib-2.0.so.0.6600.8)
 by 0x55C4827: g_string_sized_new (in /usr/lib64/libglib-2.0.so.0.6600.8)
 by 0x55C4CEA: g_string_new (in /usr/lib64/libglib-2.0.so.0.6600.8)
 by 0x906314: get_relocated_path (cutils.c:1036)
 by 0x6E1F77: qemu_read_default_config_file (vl.c:2122)
 by 0x6E1F77: qemu_init (vl.c:2687)
 by 0x3E3AF8: main (main.c:49)

Let's use g_string_free(gstring, false) to free only the GString object
and transfer the ownership of the character data to the caller.

Fixes: f4f5ed2cbd ("cutils: introduce get_relocated_path")
Signed-off-by: Stefano Garzarella 
---
 util/cutils.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/cutils.c b/util/cutils.c
index ee908486da..c9b91e7535 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -1055,5 +1055,5 @@ char *get_relocated_path(const char *dir)
 assert(G_IS_DIR_SEPARATOR(dir[-1]));
 g_string_append(result, dir - 1);
 }
-return result->str;
+return g_string_free(result, false);
 }

Re: [PATCH 0/2] osdep: allow including qemu/osdep.h outside extern "C"

2021-04-13 Thread no-reply

Patchew URL: 
https://patchew.org/QEMU/20210413113741.214867-1-pbonz...@redhat.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20210413113741.214867-1-pbonz...@redhat.com
Subject: [PATCH 0/2] osdep: allow including qemu/osdep.h outside extern "C"

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 - [tag update]  patchew/20210408195534.647895-1-antonkuc...@yandex-team.ru 
-> patchew/20210408195534.647895-1-antonkuc...@yandex-team.ru
 * [new tag] patchew/20210413113741.214867-1-pbonz...@redhat.com -> 
patchew/20210413113741.214867-1-pbonz...@redhat.com
Switched to a new branch 'test'
992fa52 osdep: protect qemu/osdep.h with extern "C"
bc2310f osdep: include glib-compat.h before other QEMU headers

=== OUTPUT BEGIN ===
1/2 Checking commit bc2310f731a5 (osdep: include glib-compat.h before other 
QEMU headers)
2/2 Checking commit 992fa52da413 (osdep: protect qemu/osdep.h with extern "C")
WARNING: architecture specific defines should be avoided
#51: FILE: include/qemu/compiler.h:14:
+#ifdef __cplusplus

ERROR: storage class should be at the beginning of the declaration
#52: FILE: include/qemu/compiler.h:15:
+#define QEMU_EXTERN_C extern "C"

ERROR: storage class should be at the beginning of the declaration
#54: FILE: include/qemu/compiler.h:17:
+#define QEMU_EXTERN_C extern

WARNING: architecture specific defines should be avoided
#77: FILE: include/qemu/osdep.h:116:
+#ifdef __cplusplus

WARNING: architecture specific defines should be avoided
#88: FILE: include/qemu/osdep.h:730:
+#ifdef __cplusplus

total: 2 errors, 3 warnings, 47 lines checked

Patch 2/2 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20210413113741.214867-1-pbonz...@redhat.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

Re: [PATCH for-6.0] block/nbd: fix possible use after free of s->connect_thread

2021-04-13 Thread Max Reitz


On 06.04.21 17:51, Vladimir Sementsov-Ogievskiy wrote:

If on nbd_close() we detach the thread (in
nbd_co_establish_connection_cancel() thr->state becomes
CONNECT_THREAD_RUNNING_DETACHED), after that point we should not use
s->connect_thread (which is set to NULL), as running thread may free it
at any time.

Still nbd_co_establish_connection() does exactly this: it saves
s->connect_thread to local variable (just for better code style) and
use it even after yield point, when thread may be already detached.

Fix that. Also check thr to be non-NULL on
nbd_co_establish_connection() start for safety.

After this patch "case CONNECT_THREAD_RUNNING_DETACHED" becomes
impossible in the second switch in nbd_co_establish_connection().
Still, don't add extra abort() just before the release. If it somehow
possible to reach this "case:" it won't hurt. Anyway, good refactoring
of all this reconnect mess will come soon.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

Hi all! I faced a crash, just running 277 iotest in a loop. I can't
reproduce it on master, it reproduces only on my branch with nbd
reconnect refactorings.

Still, it seems very possible that it may crash under some conditions.
So I propose this patch for 6.0. It's written so that it's obvious that
it will not hurt:

  pre-patch, on first hunk we'll just crash if thr is NULL,
  on second hunk it's safe to return -1, and using thr when
  s->connect_thread is already zeroed is obviously wrong.

  block/nbd.c | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index c26dc5a54f..1d4668d42d 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -443,6 +443,11 @@ nbd_co_establish_connection(BlockDriverState *bs, Error 
**errp)
  BDRVNBDState *s = bs->opaque;
  NBDConnectThread *thr = s->connect_thread;
  
+if (!thr) {

+/* detached */
+return -1;
+}
+
  qemu_mutex_lock(&thr->mutex);
  
  switch (thr->state) {


First, it is a bit strange not to set *errp in these cases.  I don’t 
think it’d make a difference anyway, but now that I look into it...  The 
error would be propagated to s->connect_err, but that isn’t used 
anywhere, so...  What’s the point?


Anyway.  What I really wanted to ask is: nbd_co_establish_connection() 
may create a thread, which accesses the thread.  Is that a problem? 
Like, can this happen:


- nbd_co_establish_connection(): thr->state := THREAD_RUNNING
- nbd_co_establish_connection(): thread is created
- nbd_co_establish_connection(): thr->mutex unlocked
- connect_thread_func(): thr->mutex locked
- connect_thread_func(): thr->state := something else
- connect_thread_func(): thr->mutex unlocked
- nbd_co_establish_connection(): yields
- (As I understood it, this yield leads to
   nbd_co_establish_connection_cancel() continuing)
- nbd_co_EC_cancel(): thr->mutex locked
- nbd_co_EC_cancel(): do_free true
- nbd_co_EC_cancel(): nbd_free_connect_thread()
- connect_thread_func(): nbd_free_connect_thread()

?

Max

Re: [PATCH 2/2] osdep: protect qemu/osdep.h with extern "C"

2021-04-13 Thread Peter Maydell

On Tue, 13 Apr 2021 at 12:37, Paolo Bonzini  wrote:
>
> System headers may include templates if compiled with a C++ compiler,
> which cause the compiler to complain if qemu/osdep.h is included
> within a C++ source file's 'extern "C"' block.  Add
> an 'extern "C"' block directly to qemu/osdep.h, so that
> system headers can be kept out of it.
>
> There is a stray declaration early in qemu/osdep.h, which needs
> to be special cased.  Add a definition in qemu/compiler.h to
> make it look nice.
>
> config-host.h, CONFIG_TARGET, exec/poison.h and qemu/compiler.h
> are included outside the 'extern "C"' block; that is not
> an issue because they consist entirely of preprocessor directives.
>
> Signed-off-by: Paolo Bonzini 
> ---
>  disas/nanomips.cpp  |  2 +-
>  include/qemu/compiler.h |  6 ++
>  include/qemu/osdep.h| 10 +-
>  3 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/disas/nanomips.cpp b/disas/nanomips.cpp
> index 2b09655271..8ddef897f0 100644
> --- a/disas/nanomips.cpp
> +++ b/disas/nanomips.cpp
> @@ -27,8 +27,8 @@
>   *  Reference Manual", Revision 01.01, April 27, 2018
>   */
>
> -extern "C" {
>  #include "qemu/osdep.h"
> +extern "C" {
>  #include "disas/dis-asm.h"
>  }
>
> diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
> index cf28bb2bcd..091c45248b 100644
> --- a/include/qemu/compiler.h
> +++ b/include/qemu/compiler.h
> @@ -11,6 +11,12 @@
>  #define QEMU_STATIC_ANALYSIS 1
>  #endif
>
> +#ifdef __cplusplus
> +#define QEMU_EXTERN_C extern "C"
> +#else
> +#define QEMU_EXTERN_C extern
> +#endif
> +
>  #define QEMU_NORETURN __attribute__ ((__noreturn__))
>
>  #define QEMU_WARN_UNUSED_RESULT __attribute__((warn_unused_result))
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index b67b0a1e8c..3f8785a471 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -57,7 +57,7 @@
>  #define daemon qemu_fake_daemon_function
>  #include 
>  #undef daemon
> -extern int daemon(int, int);
> +QEMU_EXTERN_C int daemon(int, int);
>  #endif
>
>  #ifdef _WIN32
> @@ -113,6 +113,10 @@ extern int daemon(int, int);
>
>  #include "glib-compat.h"
>
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
>  #ifdef _WIN32
>  #include "sysemu/os-win32.h"
>  #endif

There are some system header includes in osdep.h below this point
(sys/shm.h and sys/uio.h) -- don't they need to be moved up
to go with the other system includes first ?

thanks
-- PMM

Re: [PATCH RFC RESEND v2 4/6] hw/arm/virt-acpi-build: Add explicit idmap info in IORT table

2021-04-13 Thread Xingang Wang


Hi Eric,

On 2021/4/13 17:03, Auger Eric wrote:

Hi Xingang,

On 3/25/21 8:22 AM, Wang Xingang wrote:

From: Xingang Wang 

The idmap of smmuv3 and root complex covers the whole RID space for now,
this patch add explicit idmap info according to root bus number range.
This add smmuv3 idmap for certain bus which has enabled the iommu property.

Signed-off-by: Xingang Wang 
Signed-off-by: Jiahui Cen 
---
  hw/arm/virt-acpi-build.c | 103 ++-
  1 file changed, 81 insertions(+), 22 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index f5a2b2d4cb..5491036c86 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -44,6 +44,7 @@
  #include "hw/acpi/tpm.h"
  #include "hw/pci/pcie_host.h"
  #include "hw/pci/pci.h"
+#include "hw/pci/pci_bus.h"
  #include "hw/pci-host/gpex.h"
  #include "hw/arm/virt.h"
  #include "hw/mem/nvdimm.h"
@@ -237,6 +238,41 @@ static void acpi_dsdt_add_tpm(Aml *scope, VirtMachineState 
*vms)
  aml_append(scope, dev);
  }
  
+typedef

+struct AcpiIortMapping {
+AcpiIortIdMapping idmap;
+bool iommu;
+} AcpiIortMapping;
+
+/* For all PCI host bridges, walk and insert DMAR scope */

this comment should rather be in the caller
also DMAR is not the ARM vocable.

I would add the comment for this function:
build the ID mapping for aa given PCI host bridge


Thanks, I will fix the comment.


+static int
+iort_host_bridges(Object *obj, void *opaque)
+{
+GArray *map_blob = opaque;
+AcpiIortMapping map;
+AcpiIortIdMapping *idmap = &map.idmap;
+int bus_num, max_bus;
+
+if (object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) {
+PCIBus *bus = PCI_HOST_BRIDGE(obj)->bus;
+
+if (bus) {
+bus_num = pci_bus_num(bus);
+max_bus = pci_root_bus_max_bus(bus);
+
+idmap->input_base = cpu_to_le32(bus_num << 8);
+idmap->id_count = cpu_to_le32((max_bus - bus_num + 1) << 8);
+idmap->output_base = cpu_to_le32(bus_num << 8);
+idmap->flags = cpu_to_le32(0);
+
+map.iommu = pci_root_bus_has_iommu(bus);

if iommu is not set, we don't need to populate the idmap and we may even
directly continue, ie. not add the element the map_bap_blob, no?
Thanks, if we leave the whole range when iommu is not set, this indeed 
should be dropped in map_blob.

+g_array_append_val(map_blob, map);
+}
+}
+
+return 0;
+}
+
  static void
  build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
  {
@@ -247,6 +283,21 @@ build_iort(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
  AcpiIortSmmu3 *smmu;
  size_t node_size, iort_node_offset, iort_length, smmu_offset = 0;
  AcpiIortRC *rc;
+int smmu_mapping_count;
+GArray *map_blob = g_array_new(false, true, sizeof(AcpiIortMapping));
+AcpiIortMapping *map;
+
+/* pci_for_each_bus(vms->bus, insert_map, map_blob); */

comment to be removed

Done.

+object_child_foreach_recursive(object_get_root(),
+   iort_host_bridges, map_blob);
+
+smmu_mapping_count = 0;
+for (int i = 0; i < map_blob->len; i++) {
+map = &g_array_index(map_blob, AcpiIortMapping, i);
+if (map->iommu) {
+smmu_mapping_count++;
+}
+}
  
  iort = acpi_data_push(table_data, sizeof(*iort));
  
@@ -280,13 +331,13 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
  
  /* SMMUv3 node */

  smmu_offset = iort_node_offset + node_size;
-node_size = sizeof(*smmu) + sizeof(*idmap);
+node_size = sizeof(*smmu) + sizeof(*idmap) * smmu_mapping_count;
  iort_length += node_size;
  smmu = acpi_data_push(table_data, node_size);
  
  smmu->type = ACPI_IORT_NODE_SMMU_V3;

  smmu->length = cpu_to_le16(node_size);
-smmu->mapping_count = cpu_to_le32(1);
+smmu->mapping_count = cpu_to_le32(smmu_mapping_count);
  smmu->mapping_offset = cpu_to_le32(sizeof(*smmu));
  smmu->base_address = cpu_to_le64(vms->memmap[VIRT_SMMU].base);
  smmu->flags = cpu_to_le32(ACPI_IORT_SMMU_V3_COHACC_OVERRIDE);
@@ -295,23 +346,28 @@ build_iort(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
  smmu->gerr_gsiv = cpu_to_le32(irq + 2);
  smmu->sync_gsiv = cpu_to_le32(irq + 3);
  
-/* Identity RID mapping covering the whole input RID range */

-idmap = &smmu->id_mapping_array[0];
-idmap->input_base = 0;
-idmap->id_count = cpu_to_le32(0x);
-idmap->output_base = 0;
-/* output IORT node is the ITS group node (the first node) */
-idmap->output_reference = cpu_to_le32(iort_node_offset);
+for (int i = 0, j = 0; i < map_blob->len; i++) {
+map = &g_array_index(map_blob, AcpiIortMapping, i);
+
+if (!map->iommu) {
+continue;
+}
+
+idmap =

Re: testing/next - hexagon toolchain update

2021-04-13 Thread Alex Bennée



Brian Cain  writes:

> Alex,
>
> You are the one maintaining the testing/next tree at
> https://gitlab.com/stsquad/qemu correct? The current patch series for
> hexagon under review requires toolchain updates. These changes to
> llvm/clang landed in the last week or two.
>
> Can you apply this patch?

I've applied to my tree. I didn't get it in this cycle but I'll post a
new testing/next for review in the next week or so ready for master to
re-open.

>
> 
>
> From 68547357c895934796e9b4687338bb9e39ac86c5 Mon Sep 17 00:00:00 2001
> From: Brian Cain mailto:bc...@quicinc.com
> Date: Thu, 1 Apr 2021 10:32:24 -0500
> Subject: [PATCH] Update llvm-project commit
>
> clang was updated with new inline asm registers for hexagon, this is
> necessary for QEMU test cases currently under review.
>
> Signed-off-by: Brian Cain mailto:bc...@quicinc.com
> ---
>  tests/docker/dockerfiles/debian-hexagon-cross.docker | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tests/docker/dockerfiles/debian-hexagon-cross.docker 
> b/tests/docker/dockerfiles/debian-hexagon-cross.docker
> index b6fb651..1d19e8f 100644
> --- a/tests/docker/dockerfiles/debian-hexagon-cross.docker
> +++ b/tests/docker/dockerfiles/debian-hexagon-cross.docker
> @@ -24,7 +24,7 @@ RUN apt update && \
>  ENV TOOLCHAIN_INSTALL /usr/local
>  ENV ROOTFS /usr/local
>
> -ENV LLVM_URL 
> https://github.com/llvm/llvm-project/archive/3d8149c2a1228609fd7d7c91a04681304a2f0ca9.tar.gz
> +ENV LLVM_URL 
> https://github.com/llvm/llvm-project/archive/bfcd21876adc3498065e4da92799f613e730d475.tar.gz
>  ENV MUSL_URL 
> https://github.com/quic/musl/archive/aff74b395fbf59cd7e93b3691905aa1af6c0778c.tar.gz
>  ENV LINUX_URL 
> https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.6.18.tar.xz
>
> [2. 0001-Update-llvm-project-commit.patch --- text/x-diff; 
> 0001-Update-llvm-project-commit.patch]...


-- 
Alex Bennée

Re: [RFC v12 00/65] arm cleanup experiment for kvm-only build

2021-04-13 Thread Claudio Fontana

On 3/28/21 9:27 PM, Richard Henderson wrote:
> On 3/26/21 1:35 PM, Claudio Fontana wrote:
>> Here a new version of the series that enables kvm-only builds.
>>
>> The goal here is to enable the KVM-only build, but there is
>> some additional cleanup too.
>>
>> In this iteration I mostly fixed existing issues, and added an attempt
>> to put more content in cpu-sve. More splitting still required
>> for both cpu-sve and cpu-auth.
>>
>> Comments welcome, thanks,
> 
> My number 1 takeaway is that I don't think we should bother trying so hard to 
> compile out aarch64-specific code.  The number of ifdefs seems to be quite 
> high 
> in the end.
> 

Understand - it turns out to be useful in practice though to go through some 
effort
to improve the distinction between aarch64-specific code , aarch32-specific 
code,
and code that is common to both, even though we might later decide to just have 
basically TARGET_AARCH64 be always true (ie removing it)
in a future series.

> The cleanup that I think would be better would be to *remove* the 32-bit-only 
> qemu-system-arm and build it all into a single binary.  That would reduce the 
> number of build combinations and could in turn simplify or speed up CI.

Still, I think it is interesting for the reader of the code, and in trying to 
debug issues,
to be mindful about which code is 32bit-only, which code is 64-bit only, which 
is common.

Not that this series fixes everything in this regard.

> 
> We probably cannot remove qemu-arm for 32-bit linux-user binaries.  But my 
> guess is the amount of aarch64-specific code that would leak in is pretty 
> minimal.  The bulk of the difference is in the set of tcg helpers, which are 
> segregated already.
> 
> The tcg/kvm split is much more interesting, and that part of the patch set 
> looks pretty good.
> 
> 
> r~
> 

Thanks, I'll send a new version momentarily, we can continue from there.

Ciao,

Claudio

[PULL 1/3] hw/arm/mps2-tz: Fix MPC setting for AN524 SRAM block

2021-04-13 Thread Peter Maydell

The AN524 has three MPCs: one for the BRAM, one for the QSPI flash,
and one for the DDR.  We incorrectly set the .mpc field in the
RAMInfo struct for the SRAM block to 1, giving it the same MPC we are
using for the QSPI.  The effect of this was that the QSPI didn't get
mapped into the system address space at all, via an MPC or otherwise,
and guest programs which tried to read from the QSPI would get a bus
error.  Correct the SRAM RAMInfo to indicate that it does not have an
associated MPC.

Fixes: 25ff112a8cc ("hw/arm/mps2-tz: Add new mps3-an524 board")
Signed-off-by: Peter Maydell 
Reviewed-by: Richard Henderson 
Reviewed-by: Philippe Mathieu-Daudé 
Message-id: 20210409150527.15053-2-peter.mayd...@linaro.org
---
 hw/arm/mps2-tz.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/arm/mps2-tz.c b/hw/arm/mps2-tz.c
index 3fbe3d29f95..5ebd671bf83 100644
--- a/hw/arm/mps2-tz.c
+++ b/hw/arm/mps2-tz.c
@@ -238,7 +238,7 @@ static const RAMInfo an524_raminfo[] = { {
 .name = "sram",
 .base = 0x2000,
 .size = 32 * 4 * KiB,
-.mpc = 1,
+.mpc = -1,
 .mrindex = 1,
 }, {
 /* We don't model QSPI flash yet; for now expose it as simple ROM */
-- 
2.20.1

Re: [PULL 00/13] qemu-sparc queue 20210412

2021-04-13 Thread Peter Maydell

On Mon, 12 Apr 2021 at 23:20, Mark Cave-Ayland
 wrote:
>
> The following changes since commit c1e90def01bdb8fcbdbebd9d1eaa8e4827ece620:
>
>   Merge remote-tracking branch 
> 'remotes/pmaydell/tags/pull-target-arm-20210412' into staging (2021-04-12 
> 12:12:09 +0100)
>
> are available in the Git repository at:
>
>   git://github.com/mcayland/qemu.git tags/qemu-sparc-20210412
>
> for you to fetch changes up to ce94fa7aa646a18e9b9105a32eea2152b202b431:
>
>   tests/qtest: add tests for am53c974 device (2021-04-12 22:37:11 +0100)
>
> 
> qemu-sparc queue
>

Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/6.0
for any user-visible changes.

-- PMM

[PULL 2/3] hw/arm/mps2-tz: Assert if more than one RAM is attached to an MPC

2021-04-13 Thread Peter Maydell

Each board in mps2-tz.c specifies a RAMInfo[] array providing
information about each RAM in the board.  The .mpc field of the
RAMInfo struct specifies which MPC, if any, the RAM is attached to.
We already assert if the array doesn't have any entry for an MPC, but
we don't diagnose the error of using the same MPC number twice (which
is quite easy to do by accident if copy-and-pasting structure
entries).

Enhance find_raminfo_for_mpc() so that it detects multiple entries
for the MPC as well as missing entries.

Signed-off-by: Peter Maydell 
Reviewed-by: Richard Henderson 
Reviewed-by: Philippe Mathieu-Daudé 
Message-id: 20210409150527.15053-3-peter.mayd...@linaro.org
---
 hw/arm/mps2-tz.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/hw/arm/mps2-tz.c b/hw/arm/mps2-tz.c
index 5ebd671bf83..25016e464d9 100644
--- a/hw/arm/mps2-tz.c
+++ b/hw/arm/mps2-tz.c
@@ -306,14 +306,18 @@ static const RAMInfo 
*find_raminfo_for_mpc(MPS2TZMachineState *mms, int mpc)
 {
 MPS2TZMachineClass *mmc = MPS2TZ_MACHINE_GET_CLASS(mms);
 const RAMInfo *p;
+const RAMInfo *found = NULL;
 
 for (p = mmc->raminfo; p->name; p++) {
 if (p->mpc == mpc && !(p->flags & IS_ALIAS)) {
-return p;
+/* There should only be one entry in the array for this MPC */
+g_assert(!found);
+found = p;
 }
 }
 /* if raminfo array doesn't have an entry for each MPC this is a bug */
-g_assert_not_reached();
+assert(found);
+return found;
 }
 
 static MemoryRegion *mr_for_raminfo(MPS2TZMachineState *mms,
-- 
2.20.1

[PULL 3/3] sphinx: qapidoc: Wrap "If" section body in a paragraph node

2021-04-13 Thread Peter Maydell

From: John Snow 

These sections need to be wrapped in a block-level element, such as
Paragraph in order for them to be rendered into Texinfo correctly.

Before (e.g.):


  If
  defined(CONFIG_REPLICATION)


became:

  .SS If
  \fBdefined(CONFIG_REPLICATION)\fP.SS \fBBlockdevOptionsReplication\fP (Object)
  ...

After:


  If
  
defined(CONFIG_REPLICATION)
  


becomes:

  .SS If
  .sp
  \fBdefined(CONFIG_REPLICATION)\fP
  .SS \fBBlockdevOptionsReplication\fP (Object)
  ...

Reported-by: Markus Armbruster 
Tested-by: Markus Armbruster 
Signed-off-by: John Snow 
Reviewed-by: Peter Maydell 
Message-id: 20210406141909.1992225-2-js...@redhat.com
Signed-off-by: Peter Maydell 
---
 docs/sphinx/qapidoc.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/sphinx/qapidoc.py b/docs/sphinx/qapidoc.py
index b7b86b5dffb..b7a2d39c105 100644
--- a/docs/sphinx/qapidoc.py
+++ b/docs/sphinx/qapidoc.py
@@ -278,7 +278,9 @@ def _nodes_for_if_section(self, ifcond):
 nodelist = []
 if ifcond:
 snode = self._make_section('If')
-snode += self._nodes_for_ifcond(ifcond, with_if=False)
+snode += nodes.paragraph(
+'', '', *self._nodes_for_ifcond(ifcond, with_if=False)
+)
 nodelist.append(snode)
 return nodelist
 
-- 
2.20.1

[PULL 0/3] target-arm queue

2021-04-13 Thread Peter Maydell

A few last patches to go in for rc3...

The following changes since commit c1e90def01bdb8fcbdbebd9d1eaa8e4827ece620:

  Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20210412' 
into staging (2021-04-12 12:12:09 +0100)

are available in the Git repository at:

  https://git.linaro.org/people/pmaydell/qemu-arm.git 
tags/pull-target-arm-20210413

for you to fetch changes up to 2d18b4ca023ca1a3aee18064251d6e6e1084f3eb:

  sphinx: qapidoc: Wrap "If" section body in a paragraph node (2021-04-13 
10:14:58 +0100)


target-arm queue:
 * Fix MPC setting for AN524 SRAM block
 * sphinx: qapidoc: Wrap "If" section body in a paragraph node


John Snow (1):
  sphinx: qapidoc: Wrap "If" section body in a paragraph node

Peter Maydell (2):
  hw/arm/mps2-tz: Fix MPC setting for AN524 SRAM block
  hw/arm/mps2-tz: Assert if more than one RAM is attached to an MPC

 docs/sphinx/qapidoc.py |  4 +++-
 hw/arm/mps2-tz.c   | 10 +++---
 2 files changed, 10 insertions(+), 4 deletions(-)

Re: [RFC v9 15/29] vfio: Set up nested stage mappings

2021-04-13 Thread Kunkun Jiang


Hi Eric,

On 2021/4/11 20:08, Eric Auger wrote:

In nested mode, legacy vfio_iommu_map_notify cannot be used as
there is no "caching" mode and we do not trap on map.

On Intel, vfio_iommu_map_notify was used to DMA map the RAM
through the host single stage.

With nested mode, we need to setup the stage 2 and the stage 1
separately. This patch introduces a prereg_listener to setup
the stage 2 mapping.

The stage 1 mapping, owned by the guest, is passed to the host
when the guest invalidates the stage 1 configuration, through
a dedicated PCIPASIDOps callback. Guest IOTLB invalidations
are cascaded downto the host through another IOMMU MR UNMAP
notifier.

Signed-off-by: Eric Auger 

---

v7 -> v8:
- properly handle new IOMMUTLBEntry fields and especially
   propagate DOMAIN and PASID based invalidations

v6 -> v7:
- remove PASID based invalidation

v5 -> v6:
- add error_report_err()
- remove the abort in case of nested stage case

v4 -> v5:
- use VFIO_IOMMU_SET_PASID_TABLE
- use PCIPASIDOps for config notification

v3 -> v4:
- use iommu_inv_pasid_info for ASID invalidation

v2 -> v3:
- use VFIO_IOMMU_ATTACH_PASID_TABLE
- new user API
- handle leaf

v1 -> v2:
- adapt to uapi changes
- pass the asid
- pass IOMMU_NOTIFIER_S1_CFG when initializing the config notifier
---
  hw/vfio/common.c | 139 +--
  hw/vfio/pci.c|  21 +++
  hw/vfio/trace-events |   2 +
  3 files changed, 157 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0cd7ef2139..e369d451e7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -595,6 +595,73 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void 
**vaddr,
  return true;
  }
  
+/* Propagate a guest IOTLB invalidation to the host (nested mode) */

+static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+{
+VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+struct vfio_iommu_type1_cache_invalidate ustruct = {};
+VFIOContainer *container = giommu->container;
+int ret;
+
+assert(iotlb->perm == IOMMU_NONE);
+
+ustruct.argsz = sizeof(ustruct);
+ustruct.flags = 0;
+ustruct.info.argsz = sizeof(struct iommu_cache_invalidate_info);
+ustruct.info.version = IOMMU_CACHE_INVALIDATE_INFO_VERSION_1;
+ustruct.info.cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+
+switch (iotlb->granularity) {
+case IOMMU_INV_GRAN_DOMAIN:
+ustruct.info.granularity = IOMMU_INV_GRANU_DOMAIN;
+break;
+case IOMMU_INV_GRAN_PASID:
+{
+struct iommu_inv_pasid_info *pasid_info;
+int archid = -1;
+
+pasid_info = &ustruct.info.granu.pasid_info;
+ustruct.info.granularity = IOMMU_INV_GRANU_PASID;
+if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+pasid_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+archid = iotlb->arch_id;
+}
+pasid_info->archid = archid;
+trace_vfio_iommu_asid_inv_iotlb(archid);
+break;
+}
+case IOMMU_INV_GRAN_ADDR:
+{
+hwaddr start = iotlb->iova + giommu->iommu_offset;
+struct iommu_inv_addr_info *addr_info;
+size_t size = iotlb->addr_mask + 1;
+int archid = -1;
+
+addr_info = &ustruct.info.granu.addr_info;
+ustruct.info.granularity = IOMMU_INV_GRANU_ADDR;
+if (iotlb->leaf) {
+addr_info->flags |= IOMMU_INV_ADDR_FLAGS_LEAF;
+}
+if (iotlb->flags & IOMMU_INV_FLAGS_ARCHID) {
+addr_info->flags |= IOMMU_INV_ADDR_FLAGS_ARCHID;
+archid = iotlb->arch_id;
+}
+addr_info->archid = archid;
+addr_info->addr = start;
+addr_info->granule_size = size;
+addr_info->nb_granules = 1;
+trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
+1, iotlb->leaf);
+break;
+}

Should we pass a size to  host kernel here, even if vSMMU doesn't support
RIL or guest kernel doesn't use RIL?

It will cause TLBI issue in  this scenario: Guest kernel issues a TLBI cmd
without "range" (tg = 0) to invalidate a 2M huge page. Then qemu passed
the iova and size (4K) to host kernel. Finally, host kernel issues a 
TLBI cmd

with "range" (4K) which can not invalidate the TLB entry of 2M huge page.
(pSMMU supports RIL)

Thanks,
Kunkun Jiang

+}
+
+ret = ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, &ustruct);
+if (ret) {
+error_report("%p: failed to invalidate CACHE (%d)", container, ret);
+}
+}
+
  static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
  {
  VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
@@ -776,6 +843,35 @@ static void vfio_dma_unmap_ram_section(VFIOContainer 
*container,
  }
  }
  
+static void vfio_prereg_listener_region_add(MemoryListener *listener,

+MemoryRegionSection *section)
+{
+VFIOContainer *container =
+container_

Re: [PATCH for-6.0] x86: acpi: use offset instead of pointer when using build_header()

2021-04-13 Thread Michael S. Tsirkin

On Tue, Apr 13, 2021 at 07:14:00AM -0400, Igor Mammedov wrote:
> Do the same as in commit
>  (4d027afeb3a97 Virt: ACPI: fix qemu assert due to re-assigned table data 
> address)
> for remaining tables that happen to use saved at
> the beginning pointer to build header to avoid assert
> when table_data is relocated due to implicit re-size.
> 
> Reported-in: https://bugs.launchpad.net/bugs/1923497

Doesn't this fix the bug? If so -
Isn't this Fixes: ?

> Signed-off-by: Igor Mammedov 
> ---
> PS:
>  I have build_header() refactoring patch that requires offset
>  instead of pointer, to make it harder to misuse but it's
>  a bit intrusive for last minute fixes. So here goes simplified
>  variant, and I'll post refactoring patch for 6.1. later.
> ---
>  hw/acpi/aml-build.c  | 15 +--
>  hw/i386/acpi-build.c |  8 ++--
>  2 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index d33ce8954a..f0035d2b4a 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1830,6 +1830,7 @@ build_rsdt(GArray *table_data, BIOSLinker *linker, 
> GArray *table_offsets,
>  int i;
>  unsigned rsdt_entries_offset;
>  AcpiRsdtDescriptorRev1 *rsdt;
> +int rsdt_start = table_data->len;
>  const unsigned table_data_len = (sizeof(uint32_t) * table_offsets->len);
>  const unsigned rsdt_entry_size = sizeof(rsdt->table_offset_entry[0]);
>  const size_t rsdt_len = sizeof(*rsdt) + table_data_len;
> @@ -1846,7 +1847,8 @@ build_rsdt(GArray *table_data, BIOSLinker *linker, 
> GArray *table_offsets,
>  ACPI_BUILD_TABLE_FILE, ref_tbl_offset);
>  }
>  build_header(linker, table_data,
> - (void *)rsdt, "RSDT", rsdt_len, 1, oem_id, oem_table_id);
> + (void *)(table_data->data + rsdt_start),
> + "RSDT", rsdt_len, 1, oem_id, oem_table_id);
>  }
>  
>  /* Build xsdt table */
> @@ -1857,6 +1859,7 @@ build_xsdt(GArray *table_data, BIOSLinker *linker, 
> GArray *table_offsets,
>  int i;
>  unsigned xsdt_entries_offset;
>  AcpiXsdtDescriptorRev2 *xsdt;
> +int xsdt_start = table_data->len;
>  const unsigned table_data_len = (sizeof(uint64_t) * table_offsets->len);
>  const unsigned xsdt_entry_size = sizeof(xsdt->table_offset_entry[0]);
>  const size_t xsdt_len = sizeof(*xsdt) + table_data_len;
> @@ -1873,7 +1876,8 @@ build_xsdt(GArray *table_data, BIOSLinker *linker, 
> GArray *table_offsets,
>  ACPI_BUILD_TABLE_FILE, ref_tbl_offset);
>  }
>  build_header(linker, table_data,
> - (void *)xsdt, "XSDT", xsdt_len, 1, oem_id, oem_table_id);
> + (void *)(table_data->data + xsdt_start),
> + "XSDT", xsdt_len, 1, oem_id, oem_table_id);
>  }
>  
>  void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
> @@ -2053,10 +2057,9 @@ void build_tpm2(GArray *table_data, BIOSLinker 
> *linker, GArray *tcpalog,
>  uint64_t control_area_start_address;
>  TPMIf *tpmif = tpm_find();
>  uint32_t start_method;
> -void *tpm2_ptr;
>  
>  tpm2_start = table_data->len;
> -tpm2_ptr = acpi_data_push(table_data, sizeof(AcpiTableHeader));
> +acpi_data_push(table_data, sizeof(AcpiTableHeader));
>  
>  /* Platform Class */
>  build_append_int_noprefix(table_data, TPM2_ACPI_CLASS_CLIENT, 2);
> @@ -2095,8 +2098,8 @@ void build_tpm2(GArray *table_data, BIOSLinker *linker, 
> GArray *tcpalog,
> log_addr_offset, 8,
> ACPI_BUILD_TPMLOG_FILE, 0);
>  build_header(linker, table_data,
> - tpm2_ptr, "TPM2", table_data->len - tpm2_start, 4, oem_id,
> - oem_table_id);
> + (void *)(table_data->data + tpm2_start),
> + "TPM2", table_data->len - tpm2_start, 4, oem_id, 
> oem_table_id);
>  }
>  
>  Aml *build_crs(PCIHostState *host, CrsRangeSet *range_set, uint32_t 
> io_offset,
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index de98750aef..daaf8f473e 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -1816,6 +1816,7 @@ build_hpet(GArray *table_data, BIOSLinker *linker, 
> const char *oem_id,
> const char *oem_table_id)
>  {
>  Acpi20Hpet *hpet;
> +int hpet_start = table_data->len;
>  
>  hpet = acpi_data_push(table_data, sizeof(*hpet));
>  /* Note timer_block_id value must be kept in sync with value advertised 
> by
> @@ -1824,13 +1825,15 @@ build_hpet(GArray *table_data, BIOSLinker *linker, 
> const char *oem_id,
>  hpet->timer_block_id = cpu_to_le32(0x8086a201);
>  hpet->addr.address = cpu_to_le64(HPET_BASE);
>  build_header(linker, table_data,
> - (void *)hpet, "HPET", sizeof(*hpet), 1, oem_id, 
> oem_table_id);
> + (void *)(table_data->data + hpet_start),
> + "HPET", sizeof(*hpet), 1, oem_id, oem_table_id);
>  }
>  
>  static

Re: [RFC v12 60/65] target/arm: cpu-pauth: new module for ARMv8.3 Pointer Authentication

2021-04-13 Thread Claudio Fontana

On 3/28/21 9:05 PM, Richard Henderson wrote:
> On 3/26/21 1:36 PM, Claudio Fontana wrote:
>> Pointer Authentication is an AARCH64-only ARMv8.3 optional
>> extension, whose cpu properties can be separated out in its own module.
>>
>> Signed-off-by: Claudio Fontana
>> ---
>>   target/arm/cpu.h   |  3 --
>>   target/arm/tcg/cpu-pauth.h | 34 
>>   target/arm/cpu.c   |  4 +--
>>   target/arm/cpu64.c | 35 ++--
>>   target/arm/tcg/cpu-pauth.c | 66 ++
>>   target/arm/tcg/meson.build |  1 +
>>   6 files changed, 105 insertions(+), 38 deletions(-)
>>   create mode 100644 target/arm/tcg/cpu-pauth.h
>>   create mode 100644 target/arm/tcg/cpu-pauth.c
> 
> No move + rename at once.
> 
> Also, you've started using tcg_sve_* and I think that might as well apply to 
> these, in that second step.
> 
> r~
> 

The idea for tcg_sve_* was in contrast to cpu_sve_*, which contains the common 
parts of cpu_sve.

So the idea for SVE is:

cpu-sve.c : CPU SVE module, contains the common functions.
tcg/tcg-sve.c : TCG-specific parts of cpu-sve
kvm/kvm-sve.c : KVM-specific parts of cpu-sve

Now for PAuth we only have a TCG implementation,
so that is the reason that this patch uses the more general name.

If still tcg/tcg-pauth.c seems better we can go for it, but it just seems not 
very logical without a corresponding general cpu-pauth.c

Ciao,

Claudio

Re: [PATCH for-6.0] block/nbd: fix possible use after free of s->connect_thread

2021-04-13 Thread Vladimir Sementsov-Ogievskiy


13.04.2021 14:53, Max Reitz wrote:

On 06.04.21 17:51, Vladimir Sementsov-Ogievskiy wrote:

If on nbd_close() we detach the thread (in
nbd_co_establish_connection_cancel() thr->state becomes
CONNECT_THREAD_RUNNING_DETACHED), after that point we should not use
s->connect_thread (which is set to NULL), as running thread may free it
at any time.

Still nbd_co_establish_connection() does exactly this: it saves
s->connect_thread to local variable (just for better code style) and
use it even after yield point, when thread may be already detached.

Fix that. Also check thr to be non-NULL on
nbd_co_establish_connection() start for safety.

After this patch "case CONNECT_THREAD_RUNNING_DETACHED" becomes
impossible in the second switch in nbd_co_establish_connection().
Still, don't add extra abort() just before the release. If it somehow
possible to reach this "case:" it won't hurt. Anyway, good refactoring
of all this reconnect mess will come soon.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

Hi all! I faced a crash, just running 277 iotest in a loop. I can't
reproduce it on master, it reproduces only on my branch with nbd
reconnect refactorings.

Still, it seems very possible that it may crash under some conditions.
So I propose this patch for 6.0. It's written so that it's obvious that
it will not hurt:

  pre-patch, on first hunk we'll just crash if thr is NULL,
  on second hunk it's safe to return -1, and using thr when
  s->connect_thread is already zeroed is obviously wrong.

  block/nbd.c | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index c26dc5a54f..1d4668d42d 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -443,6 +443,11 @@ nbd_co_establish_connection(BlockDriverState *bs, Error 
**errp)
  BDRVNBDState *s = bs->opaque;
  NBDConnectThread *thr = s->connect_thread;
+    if (!thr) {
+    /* detached */
+    return -1;
+    }
+
  qemu_mutex_lock(&thr->mutex);
  switch (thr->state) {


First, it is a bit strange not to set *errp in these cases. 


Oops, right! ashamed)


I don’t think it’d make a difference anyway, but now that I look into it...  The 
error would be propagated to s->connect_err, but that isn’t used anywhere, 
so...  What’s the point?


Yes it's unused.. That's to be improved later.



Anyway.  What I really wanted to ask is: nbd_co_establish_connection() may 
create a thread, which accesses the thread.  Is that a problem? Like, can this 
happen:

- nbd_co_establish_connection(): thr->state := THREAD_RUNNING
- nbd_co_establish_connection(): thread is created
- nbd_co_establish_connection(): thr->mutex unlocked
- connect_thread_func(): thr->mutex locked
- connect_thread_func(): thr->state := something else
- connect_thread_func(): thr->mutex unlocked
- nbd_co_establish_connection(): yields
- (As I understood it, this yield leads to
   nbd_co_establish_connection_cancel() continuing)
- nbd_co_EC_cancel(): thr->mutex locked
- nbd_co_EC_cancel(): do_free true
- nbd_co_EC_cancel(): nbd_free_connect_thread()
- connect_thread_func(): nbd_free_connect_thread()



I think not. Here we are safe: connect_thread_func will only do free if thread 
in CONNECT_THREAD_RUNNING_DETACHED state. Thread becomes 
CONNECT_THREAD_RUNNING_DETACHED only on one code path in 
nbd_co_establish_connection_cancel(), and on that path do_free is false. OK, 
what you say is possible if we call nbd_co_establish_connection_cancel() twice 
with detach=true. But that should not be possible if it is called only from 
.bdrv_close (indirectly) of nbd driver.

The problem is that nbd_co_EC_cancel() may free the thread state, and then 
nbd_co_establish_connection() reuses saved thr local varible after yield. Still 
(as I noted before), I've never reproduced it on master, only on my branch with 
some modifications. Still it seems possible anyway.


--
Best regards,
Vladimir

1 2 3 >

1 - 100 of 255 matches

Mail list logo