Re: [PATCH 03/10] spapr: robustify NVLink2 NUMA node logic

2020-08-26 Thread Daniel Henrique Barboza




On 8/19/20 11:14 PM, David Gibson wrote:

On Fri, Aug 14, 2020 at 05:54:17PM -0300, Daniel Henrique Barboza wrote:

NVLink2 GPUs are allocated in their own NUMA node, at maximum
distance from every other resource in the board. The existing
logic makes some assumptions that don't scale well:

- only NVLink2 GPUs will ever require such mechanism, meaning
that the GPU logic is tightly coupled with the NUMA setup of
the machine, via how ibm,max-associativity-domains is set.

- the code is relying on the lack of support for sparse NUMA
nodes in QEMU. Eventually this support can be implemented, and
then the assumption that spapr->gpu_numa_id represents the total
of NUMA nodes plus all generated NUMA ids for the GPUs, which
relies on all QEMU NUMA nodes not being sparsed, has a good
potential for disaster.

This patch aims to fix both assumptions by creating a generic
mechanism to get an available NUMA node, regardless of the
NUMA setup being sparse or not. The idea is to rename the existing
spapr->gpu_numa_id to spapr->current_numa_id and add a new
spapr->extra_numa_nodes attribute. They are used in a new function
called spapr_pci_get_available_numa_id(), that takes into account
that the NUMA conf can be sparsed or not, to retrieve an available
NUMA id for the caller. Each consecutive call of
spapr_pci_get_available_numa_id() will generate a new ID, up
to the limit of numa_state->num_nodes + spapr->extra_numa_nodes
exceeding MAX_NODES. This is a generic code being used only by
NVLink2 ATM, being available to be used in the future by any
other device.

With this new function in place, we can decouple
ibm,max-associativity-domains logic from NVLink2 logic by
using the new spapr->extra_numa_nodes to define the maxdomains
of the forth NUMA level. Instead of defining it as gpu_numa_id,
use num_nodes + extra_numa_nodes. This also makes it resilient
to any future change in the support of sparse NUMA nodes.

Despite all the code juggling, no functional change was made
because sparse NUMA nodes isn't a thing and we do not support
distinct NUMA distances via user input. Next patches will
change that.

Signed-off-by: Daniel Henrique Barboza 
---
  hw/ppc/spapr.c  | 15 ++-
  hw/ppc/spapr_pci.c  | 33 +
  hw/ppc/spapr_pci_nvlink2.c  | 10 ++
  include/hw/pci-host/spapr.h |  2 ++
  include/hw/ppc/spapr.h  |  4 +++-
  5 files changed, 54 insertions(+), 10 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 3b16edaf4c..22e78cfc84 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -910,13 +910,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void 
*fdt)
  cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE & 0x),
  cpu_to_be32(ms->smp.max_cpus / ms->smp.threads),
  };
-uint32_t maxdomain = cpu_to_be32(spapr->gpu_numa_id > 1 ? 1 : 0);
+uint32_t maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
  uint32_t maxdomains[] = {
  cpu_to_be32(4),
  maxdomain,
  maxdomain,
  maxdomain,
-cpu_to_be32(spapr->gpu_numa_id),
+cpu_to_be32(ms->numa_state->num_nodes + spapr->extra_numa_nodes),
  };
  
  _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));

@@ -2824,13 +2824,18 @@ static void spapr_machine_init(MachineState *machine)
  /*
   * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
   * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
- * called from vPHB reset handler so we initialize the counter here.
+ * called from vPHB reset handler. We have code to generate an extra numa
+ * id to place the GPU via 'extra_numa_nodes' and 'current_numa_node', 
which
+ * are initialized here.
+ *
   * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
   * must be equally distant from any other node.
- * The final value of spapr->gpu_numa_id is going to be written to
+ *
+ * The extra NUMA node ids generated for GPU usage will be written to
   * max-associativity-domains in spapr_build_fdt().
   */
-spapr->gpu_numa_id = MAX(1, machine->numa_state->num_nodes);
+spapr->current_numa_id = 0;
+spapr->extra_numa_nodes = 0;
  
  if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&

  ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 0a418f1e67..09ac58fd7f 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -2492,3 +2492,36 @@ void spapr_pci_switch_vga(bool big_endian)
 _endian);
  }
  }
+
+unsigned spapr_pci_get_available_numa_id(Error **errp)
+{
+MachineState *machine = MACHINE(qdev_get_machine());
+SpaprMachineState *spapr = SPAPR_MACHINE(machine);
+NodeInfo *numa_info = machine->numa_state->nodes;
+unsigned i, start;
+
+if (machine->numa_state->num_nodes + spapr->extra_numa_nodes >= 

Re: [PATCH 03/10] spapr: robustify NVLink2 NUMA node logic

2020-08-19 Thread David Gibson
On Fri, Aug 14, 2020 at 05:54:17PM -0300, Daniel Henrique Barboza wrote:
> NVLink2 GPUs are allocated in their own NUMA node, at maximum
> distance from every other resource in the board. The existing
> logic makes some assumptions that don't scale well:
> 
> - only NVLink2 GPUs will ever require such mechanism, meaning
> that the GPU logic is tightly coupled with the NUMA setup of
> the machine, via how ibm,max-associativity-domains is set.
> 
> - the code is relying on the lack of support for sparse NUMA
> nodes in QEMU. Eventually this support can be implemented, and
> then the assumption that spapr->gpu_numa_id represents the total
> of NUMA nodes plus all generated NUMA ids for the GPUs, which
> relies on all QEMU NUMA nodes not being sparsed, has a good
> potential for disaster.
> 
> This patch aims to fix both assumptions by creating a generic
> mechanism to get an available NUMA node, regardless of the
> NUMA setup being sparse or not. The idea is to rename the existing
> spapr->gpu_numa_id to spapr->current_numa_id and add a new
> spapr->extra_numa_nodes attribute. They are used in a new function
> called spapr_pci_get_available_numa_id(), that takes into account
> that the NUMA conf can be sparsed or not, to retrieve an available
> NUMA id for the caller. Each consecutive call of
> spapr_pci_get_available_numa_id() will generate a new ID, up
> to the limit of numa_state->num_nodes + spapr->extra_numa_nodes
> exceeding MAX_NODES. This is a generic code being used only by
> NVLink2 ATM, being available to be used in the future by any
> other device.
> 
> With this new function in place, we can decouple
> ibm,max-associativity-domains logic from NVLink2 logic by
> using the new spapr->extra_numa_nodes to define the maxdomains
> of the forth NUMA level. Instead of defining it as gpu_numa_id,
> use num_nodes + extra_numa_nodes. This also makes it resilient
> to any future change in the support of sparse NUMA nodes.
> 
> Despite all the code juggling, no functional change was made
> because sparse NUMA nodes isn't a thing and we do not support
> distinct NUMA distances via user input. Next patches will
> change that.
> 
> Signed-off-by: Daniel Henrique Barboza 
> ---
>  hw/ppc/spapr.c  | 15 ++-
>  hw/ppc/spapr_pci.c  | 33 +
>  hw/ppc/spapr_pci_nvlink2.c  | 10 ++
>  include/hw/pci-host/spapr.h |  2 ++
>  include/hw/ppc/spapr.h  |  4 +++-
>  5 files changed, 54 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 3b16edaf4c..22e78cfc84 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -910,13 +910,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, 
> void *fdt)
>  cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE & 0x),
>  cpu_to_be32(ms->smp.max_cpus / ms->smp.threads),
>  };
> -uint32_t maxdomain = cpu_to_be32(spapr->gpu_numa_id > 1 ? 1 : 0);
> +uint32_t maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
>  uint32_t maxdomains[] = {
>  cpu_to_be32(4),
>  maxdomain,
>  maxdomain,
>  maxdomain,
> -cpu_to_be32(spapr->gpu_numa_id),
> +cpu_to_be32(ms->numa_state->num_nodes + spapr->extra_numa_nodes),
>  };
>  
>  _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
> @@ -2824,13 +2824,18 @@ static void spapr_machine_init(MachineState *machine)
>  /*
>   * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
>   * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
> - * called from vPHB reset handler so we initialize the counter here.
> + * called from vPHB reset handler. We have code to generate an extra numa
> + * id to place the GPU via 'extra_numa_nodes' and 'current_numa_node', 
> which
> + * are initialized here.
> + *
>   * If no NUMA is configured from the QEMU side, we start from 1 as GPU 
> RAM
>   * must be equally distant from any other node.
> - * The final value of spapr->gpu_numa_id is going to be written to
> + *
> + * The extra NUMA node ids generated for GPU usage will be written to
>   * max-associativity-domains in spapr_build_fdt().
>   */
> -spapr->gpu_numa_id = MAX(1, machine->numa_state->num_nodes);
> +spapr->current_numa_id = 0;
> +spapr->extra_numa_nodes = 0;
>  
>  if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
>  ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 0a418f1e67..09ac58fd7f 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -2492,3 +2492,36 @@ void spapr_pci_switch_vga(bool big_endian)
> _endian);
>  }
>  }
> +
> +unsigned spapr_pci_get_available_numa_id(Error **errp)
> +{
> +MachineState *machine = MACHINE(qdev_get_machine());
> +SpaprMachineState *spapr = SPAPR_MACHINE(machine);
> +NodeInfo 

[PATCH 03/10] spapr: robustify NVLink2 NUMA node logic

2020-08-15 Thread Daniel Henrique Barboza
NVLink2 GPUs are allocated in their own NUMA node, at maximum
distance from every other resource in the board. The existing
logic makes some assumptions that don't scale well:

- only NVLink2 GPUs will ever require such mechanism, meaning
that the GPU logic is tightly coupled with the NUMA setup of
the machine, via how ibm,max-associativity-domains is set.

- the code is relying on the lack of support for sparse NUMA
nodes in QEMU. Eventually this support can be implemented, and
then the assumption that spapr->gpu_numa_id represents the total
of NUMA nodes plus all generated NUMA ids for the GPUs, which
relies on all QEMU NUMA nodes not being sparsed, has a good
potential for disaster.

This patch aims to fix both assumptions by creating a generic
mechanism to get an available NUMA node, regardless of the
NUMA setup being sparse or not. The idea is to rename the existing
spapr->gpu_numa_id to spapr->current_numa_id and add a new
spapr->extra_numa_nodes attribute. They are used in a new function
called spapr_pci_get_available_numa_id(), that takes into account
that the NUMA conf can be sparsed or not, to retrieve an available
NUMA id for the caller. Each consecutive call of
spapr_pci_get_available_numa_id() will generate a new ID, up
to the limit of numa_state->num_nodes + spapr->extra_numa_nodes
exceeding MAX_NODES. This is a generic code being used only by
NVLink2 ATM, being available to be used in the future by any
other device.

With this new function in place, we can decouple
ibm,max-associativity-domains logic from NVLink2 logic by
using the new spapr->extra_numa_nodes to define the maxdomains
of the forth NUMA level. Instead of defining it as gpu_numa_id,
use num_nodes + extra_numa_nodes. This also makes it resilient
to any future change in the support of sparse NUMA nodes.

Despite all the code juggling, no functional change was made
because sparse NUMA nodes isn't a thing and we do not support
distinct NUMA distances via user input. Next patches will
change that.

Signed-off-by: Daniel Henrique Barboza 
---
 hw/ppc/spapr.c  | 15 ++-
 hw/ppc/spapr_pci.c  | 33 +
 hw/ppc/spapr_pci_nvlink2.c  | 10 ++
 include/hw/pci-host/spapr.h |  2 ++
 include/hw/ppc/spapr.h  |  4 +++-
 5 files changed, 54 insertions(+), 10 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 3b16edaf4c..22e78cfc84 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -910,13 +910,13 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void 
*fdt)
 cpu_to_be32(SPAPR_MEMORY_BLOCK_SIZE & 0x),
 cpu_to_be32(ms->smp.max_cpus / ms->smp.threads),
 };
-uint32_t maxdomain = cpu_to_be32(spapr->gpu_numa_id > 1 ? 1 : 0);
+uint32_t maxdomain = cpu_to_be32(spapr->extra_numa_nodes > 1 ? 1 : 0);
 uint32_t maxdomains[] = {
 cpu_to_be32(4),
 maxdomain,
 maxdomain,
 maxdomain,
-cpu_to_be32(spapr->gpu_numa_id),
+cpu_to_be32(ms->numa_state->num_nodes + spapr->extra_numa_nodes),
 };
 
 _FDT(rtas = fdt_add_subnode(fdt, 0, "rtas"));
@@ -2824,13 +2824,18 @@ static void spapr_machine_init(MachineState *machine)
 /*
  * NVLink2-connected GPU RAM needs to be placed on a separate NUMA node.
  * We assign a new numa ID per GPU in spapr_pci_collect_nvgpu() which is
- * called from vPHB reset handler so we initialize the counter here.
+ * called from vPHB reset handler. We have code to generate an extra numa
+ * id to place the GPU via 'extra_numa_nodes' and 'current_numa_node', 
which
+ * are initialized here.
+ *
  * If no NUMA is configured from the QEMU side, we start from 1 as GPU RAM
  * must be equally distant from any other node.
- * The final value of spapr->gpu_numa_id is going to be written to
+ *
+ * The extra NUMA node ids generated for GPU usage will be written to
  * max-associativity-domains in spapr_build_fdt().
  */
-spapr->gpu_numa_id = MAX(1, machine->numa_state->num_nodes);
+spapr->current_numa_id = 0;
+spapr->extra_numa_nodes = 0;
 
 if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
 ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 0a418f1e67..09ac58fd7f 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -2492,3 +2492,36 @@ void spapr_pci_switch_vga(bool big_endian)
_endian);
 }
 }
+
+unsigned spapr_pci_get_available_numa_id(Error **errp)
+{
+MachineState *machine = MACHINE(qdev_get_machine());
+SpaprMachineState *spapr = SPAPR_MACHINE(machine);
+NodeInfo *numa_info = machine->numa_state->nodes;
+unsigned i, start;
+
+if (machine->numa_state->num_nodes + spapr->extra_numa_nodes >= MAX_NODES) 
{
+error_setg(errp,
+   "Unable to get an extra NUMA node beyond MAX_NODES = %d",
+   MAX_NODES);
+