Re: [PATCH 08/10] spapr: introduce SpaprMachineClass::numa_assoc_domains
On 8/20/20 1:26 AM, David Gibson wrote: On Fri, Aug 14, 2020 at 05:54:22PM -0300, Daniel Henrique Barboza wrote: We can't use the input from machine->numa_state->nodes directly in the pSeries machine because PAPR does not work with raw distance values, like ACPI SLIT does. We need to determine common associativity domains, based on similar performance/distance of the resources, and set these domains in the associativy array that goes to the FDT of each resource. To ease the translation between regular ACPI NUMA distance info to our PAPR dialect, let's create a matrix called numa_assoc_domains in the SpaprMachineClass. This matrix will be initiated during machine init, where we will read NUMA information from user input, apply a heuristic to determine the associativity domains for each node, then populate numa_assoc_domains accordingly. The changes are mostly centered in the spapr_set_associativity() helper that will use the values of numa_assoc_domains instead of using 0x0, with spapr_dt_dynamic_reconfiguration_memory() and h_home_node_associativity() being the exceptions. To keep the changes under control, we'll plug in the matrix usage in the existing code first. The actual heuristic to determine the associativity domains for each NUMA node will come in a follow-up patch. Note that the matrix is initiated with zeros, meaning that there is no guest changes implemented in this patch. We'll keep these changes from legacy NUMA guests by not initiating the matrix in these cases. Signed-off-by: Daniel Henrique Barboza IIUC, what this is basically doing is that instead of doing the translation from qemu's internal NUMA representation to PAPRs at the point(s) we emit the PAPR information, we're moving it to persistent data we calculate for each (qemu) numa node then just copy out when we emit the information? Yes. The original intention was to not go straight from QEMU numa_state to associativity arrays, given that for each numa_state entry we need to (1) round the value up to what the kernel understands (10,20,40,80,160) and (2) determine the associativity domains for each position of the associativity array. I could made basically the same thing on top of the existing numa_state info, but I wasn't sure if overwriting it was a good idea. That could be a reasonable idea, and indeed the rest of the series might be easier to understand if this change were earlier in the series. In particular it means we might be able localise all the hacks for calculating the right vectors depending on machine version/options into one place. This didn't cross my mind when I first wrote it but it is a good idea to put all the NUMA related code into the same function. I considered creating a spapr_numa.c to avoid putting more stuff into spapr.c but, for the amount of code being added, I didn't think it was justified. A couple of nits, though: --- hw/ppc/spapr.c| 46 +++ hw/ppc/spapr_hcall.c | 13 -- hw/ppc/spapr_nvdimm.c | 13 +- hw/ppc/spapr_pci.c| 3 ++- include/hw/ppc/spapr.h| 7 +- include/hw/ppc/spapr_nvdimm.h | 5 ++-- 6 files changed, 59 insertions(+), 28 deletions(-) diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index b80a6f6936..4f50ab21ee 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -201,8 +201,13 @@ static int spapr_fixup_cpu_smt_dt(void *fdt, int offset, PowerPCCPU *cpu, return ret; } -void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index) +void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index, + MachineState *machine) { +SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine); +uint8_t assoc_domain1 = smc->numa_assoc_domains[node_id][0]; +uint8_t assoc_domain2 = smc->numa_assoc_domains[node_id][1]; +uint8_t assoc_domain3 = smc->numa_assoc_domains[node_id][2]; uint8_t assoc_size = 0x4; if (cpu_index >= 0) { @@ -211,17 +216,18 @@ void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index) } assoc[0] = cpu_to_be32(assoc_size); -assoc[1] = cpu_to_be32(0x0); -assoc[2] = cpu_to_be32(0x0); -assoc[3] = cpu_to_be32(0x0); +assoc[1] = cpu_to_be32(assoc_domain1); +assoc[2] = cpu_to_be32(assoc_domain2); +assoc[3] = cpu_to_be32(assoc_domain3); assoc[4] = cpu_to_be32(node_id); So spapr_set_associativity() is already a slightly dangerous function, because the required buffer space for 'assoc' varies in a non-obvious way depending on if cpu_index is >= 0. I didn't comment on that when it was introduced, because it's not really any worse than what it replaced. But with this change, I think we can do better. I'd suggest storing the full PAPR associativity vector for each qemu numa node verbatim, so it can just be copied straight into the device tree without interpretation. Then the helper can actually do the
Re: [PATCH 08/10] spapr: introduce SpaprMachineClass::numa_assoc_domains
On Fri, Aug 14, 2020 at 05:54:22PM -0300, Daniel Henrique Barboza wrote: > We can't use the input from machine->numa_state->nodes directly > in the pSeries machine because PAPR does not work with raw distance > values, like ACPI SLIT does. We need to determine common > associativity domains, based on similar performance/distance of the > resources, and set these domains in the associativy array that goes > to the FDT of each resource. > > To ease the translation between regular ACPI NUMA distance info > to our PAPR dialect, let's create a matrix called numa_assoc_domains > in the SpaprMachineClass. This matrix will be initiated during > machine init, where we will read NUMA information from user input, > apply a heuristic to determine the associativity domains for each node, > then populate numa_assoc_domains accordingly. > > The changes are mostly centered in the spapr_set_associativity() > helper that will use the values of numa_assoc_domains instead of > using 0x0, with spapr_dt_dynamic_reconfiguration_memory() and > h_home_node_associativity() being the exceptions. > > To keep the changes under control, we'll plug in the matrix usage > in the existing code first. The actual heuristic to determine > the associativity domains for each NUMA node will come in a follow-up > patch. > > Note that the matrix is initiated with zeros, meaning that there is > no guest changes implemented in this patch. We'll keep these > changes from legacy NUMA guests by not initiating the matrix > in these cases. > > Signed-off-by: Daniel Henrique Barboza IIUC, what this is basically doing is that instead of doing the translation from qemu's internal NUMA representation to PAPRs at the point(s) we emit the PAPR information, we're moving it to persistent data we calculate for each (qemu) numa node then just copy out when we emit the information? That could be a reasonable idea, and indeed the rest of the series might be easier to understand if this change were earlier in the series. In particular it means we might be able localise all the hacks for calculating the right vectors depending on machine version/options into one place. A couple of nits, though: > --- > hw/ppc/spapr.c| 46 +++ > hw/ppc/spapr_hcall.c | 13 -- > hw/ppc/spapr_nvdimm.c | 13 +- > hw/ppc/spapr_pci.c| 3 ++- > include/hw/ppc/spapr.h| 7 +- > include/hw/ppc/spapr_nvdimm.h | 5 ++-- > 6 files changed, 59 insertions(+), 28 deletions(-) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index b80a6f6936..4f50ab21ee 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -201,8 +201,13 @@ static int spapr_fixup_cpu_smt_dt(void *fdt, int offset, > PowerPCCPU *cpu, > return ret; > } > > -void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index) > +void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index, > + MachineState *machine) > { > +SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine); > +uint8_t assoc_domain1 = smc->numa_assoc_domains[node_id][0]; > +uint8_t assoc_domain2 = smc->numa_assoc_domains[node_id][1]; > +uint8_t assoc_domain3 = smc->numa_assoc_domains[node_id][2]; > uint8_t assoc_size = 0x4; > > if (cpu_index >= 0) { > @@ -211,17 +216,18 @@ void spapr_set_associativity(uint32_t *assoc, int > node_id, int cpu_index) > } > > assoc[0] = cpu_to_be32(assoc_size); > -assoc[1] = cpu_to_be32(0x0); > -assoc[2] = cpu_to_be32(0x0); > -assoc[3] = cpu_to_be32(0x0); > +assoc[1] = cpu_to_be32(assoc_domain1); > +assoc[2] = cpu_to_be32(assoc_domain2); > +assoc[3] = cpu_to_be32(assoc_domain3); > assoc[4] = cpu_to_be32(node_id); So spapr_set_associativity() is already a slightly dangerous function, because the required buffer space for 'assoc' varies in a non-obvious way depending on if cpu_index is >= 0. I didn't comment on that when it was introduced, because it's not really any worse than what it replaced. But with this change, I think we can do better. I'd suggest storing the full PAPR associativity vector for each qemu numa node verbatim, so it can just be copied straight into the device tree without interpretation. Then the helper can actually do the property set, and we don't need magically sized locals any more. Obviously there will need to be some more handling for the extra layer we add on cpu assoc vectors. We could store a vector for each vcpu as well, or just have a hack to adjust these (fdt_appendprop() might be useful). > } > > -static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu) > +static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu, > + MachineState *machine) > { > int index = spapr_get_vcpu_id(cpu); > uint32_t associativity[6]; > -spapr_set_associativity(associativity, cpu->node_id,
[PATCH 08/10] spapr: introduce SpaprMachineClass::numa_assoc_domains
We can't use the input from machine->numa_state->nodes directly in the pSeries machine because PAPR does not work with raw distance values, like ACPI SLIT does. We need to determine common associativity domains, based on similar performance/distance of the resources, and set these domains in the associativy array that goes to the FDT of each resource. To ease the translation between regular ACPI NUMA distance info to our PAPR dialect, let's create a matrix called numa_assoc_domains in the SpaprMachineClass. This matrix will be initiated during machine init, where we will read NUMA information from user input, apply a heuristic to determine the associativity domains for each node, then populate numa_assoc_domains accordingly. The changes are mostly centered in the spapr_set_associativity() helper that will use the values of numa_assoc_domains instead of using 0x0, with spapr_dt_dynamic_reconfiguration_memory() and h_home_node_associativity() being the exceptions. To keep the changes under control, we'll plug in the matrix usage in the existing code first. The actual heuristic to determine the associativity domains for each NUMA node will come in a follow-up patch. Note that the matrix is initiated with zeros, meaning that there is no guest changes implemented in this patch. We'll keep these changes from legacy NUMA guests by not initiating the matrix in these cases. Signed-off-by: Daniel Henrique Barboza --- hw/ppc/spapr.c| 46 +++ hw/ppc/spapr_hcall.c | 13 -- hw/ppc/spapr_nvdimm.c | 13 +- hw/ppc/spapr_pci.c| 3 ++- include/hw/ppc/spapr.h| 7 +- include/hw/ppc/spapr_nvdimm.h | 5 ++-- 6 files changed, 59 insertions(+), 28 deletions(-) diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index b80a6f6936..4f50ab21ee 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -201,8 +201,13 @@ static int spapr_fixup_cpu_smt_dt(void *fdt, int offset, PowerPCCPU *cpu, return ret; } -void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index) +void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index, + MachineState *machine) { +SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine); +uint8_t assoc_domain1 = smc->numa_assoc_domains[node_id][0]; +uint8_t assoc_domain2 = smc->numa_assoc_domains[node_id][1]; +uint8_t assoc_domain3 = smc->numa_assoc_domains[node_id][2]; uint8_t assoc_size = 0x4; if (cpu_index >= 0) { @@ -211,17 +216,18 @@ void spapr_set_associativity(uint32_t *assoc, int node_id, int cpu_index) } assoc[0] = cpu_to_be32(assoc_size); -assoc[1] = cpu_to_be32(0x0); -assoc[2] = cpu_to_be32(0x0); -assoc[3] = cpu_to_be32(0x0); +assoc[1] = cpu_to_be32(assoc_domain1); +assoc[2] = cpu_to_be32(assoc_domain2); +assoc[3] = cpu_to_be32(assoc_domain3); assoc[4] = cpu_to_be32(node_id); } -static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu) +static int spapr_fixup_cpu_numa_dt(void *fdt, int offset, PowerPCCPU *cpu, + MachineState *machine) { int index = spapr_get_vcpu_id(cpu); uint32_t associativity[6]; -spapr_set_associativity(associativity, cpu->node_id, index); +spapr_set_associativity(associativity, cpu->node_id, index, machine); /* Advertise NUMA via ibm,associativity */ return fdt_setprop(fdt, offset, "ibm,associativity", associativity, @@ -335,14 +341,14 @@ static void add_str(GString *s, const gchar *s1) } static int spapr_dt_memory_node(void *fdt, int nodeid, hwaddr start, -hwaddr size) +hwaddr size, MachineState *machine) { uint32_t associativity[5]; char mem_name[32]; uint64_t mem_reg_property[2]; int off; -spapr_set_associativity(associativity, nodeid, -1); +spapr_set_associativity(associativity, nodeid, -1, machine); mem_reg_property[0] = cpu_to_be64(start); mem_reg_property[1] = cpu_to_be64(size); @@ -574,6 +580,7 @@ static int spapr_dt_dynamic_reconfiguration_memory(SpaprMachineState *spapr, void *fdt) { MachineState *machine = MACHINE(spapr); +SpaprMachineClass *smc = SPAPR_MACHINE_GET_CLASS(machine); int nb_numa_nodes = machine->numa_state->num_nodes; int ret, i, offset; uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE; @@ -628,12 +635,17 @@ static int spapr_dt_dynamic_reconfiguration_memory(SpaprMachineState *spapr, int_buf[1] = cpu_to_be32(4); /* Number of entries per associativity list */ cur_index += 2; for (i = 0; i < nr_nodes; i++) { +uint8_t assoc_domain1 = smc->numa_assoc_domains[i][0]; +uint8_t assoc_domain2 = smc->numa_assoc_domains[i][1]; +uint8_t assoc_domain3 = smc->numa_assoc_domains[i][2]; + uint32_t associativity[] = { -