Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <[email protected]>
> Sent: 2021年9月9日 6:32
> To: Wei Chen <[email protected]>
> Cc: Julien Grall <[email protected]>; Stefano Stabellini
> <[email protected]>; [email protected]; Bertrand Marquis
> <[email protected]>; Jan Beulich <[email protected]>
> Subject: RE: [XEN RFC PATCH 23/40] xen/arm: introduce a helper to parse
> device tree memory node
> 
> On Wed, 8 Sep 2021, Wei Chen wrote:
> > > > >>> @@ -55,6 +57,79 @@ static int __init
> > > > >> dtb_numa_processor_affinity_init(nodeid_t node)
> > > > >>>       return 0;
> > > > >>>   }
> > > > >>>
> > > > >>> +/* Callback for parsing of the memory regions affinity */
> > > > >>> +static int __init dtb_numa_memory_affinity_init(nodeid_t node,
> > > > >>> +                                paddr_t start, paddr_t size)
> > > > >>> +{
> > > > >>> +    struct node *nd;
> > > > >>> +    paddr_t end;
> > > > >>> +    int i;
> > > > >>> +
> > > > >>> +    if ( srat_disabled() )
> > > > >>> +        return -EINVAL;
> > > > >>> +
> > > > >>> +    end = start + size;
> > > > >>> +    if ( num_node_memblks >= NR_NODE_MEMBLKS )
> > > > >>> +    {
> > > > >>> +        dprintk(XENLOG_WARNING,
> > > > >>> +                "Too many numa entry, try bigger
> NR_NODE_MEMBLKS
> > > \n");
> > > > >>> +        bad_srat();
> > > > >>> +        return -EINVAL;
> > > > >>> +    }
> > > > >>> +
> > > > >>> +    /* It is fine to add this area to the nodes data it will be
> > > used
> > > > >> later */
> > > > >>> +    i = conflicting_memblks(start, end);
> > > > >>> +    /* No conflicting memory block, we can save it for later
> usage
> > > */;
> > > > >>> +    if ( i < 0 )
> > > > >>> +        goto save_memblk;
> > > > >>> +
> > > > >>> +    if ( memblk_nodeid[i] == node ) {
> > > > >>> +        /*
> > > > >>> +         * Overlaps with other memblk in the same node, warning
> > > here.
> > > > >>> +         * This memblk will be merged with conflicted memblk
> later.
> > > > >>> +         */
> > > > >>> +        printk(XENLOG_WARNING
> > > > >>> +               "DT: NUMA NODE %u (%"PRIx64
> > > > >>> +               "-%"PRIx64") overlaps with itself (%"PRIx64"-
> > > > >> %"PRIx64")\n",
> > > > >>> +               node, start, end,
> > > > >>> +               node_memblk_range[i].start,
> > > node_memblk_range[i].end);
> > > > >>> +    } else {
> > > > >>> +        /*
> > > > >>> +         * Conflict with memblk in other node, this is an error.
> > > > >>> +         * The NUMA information is invalid, NUMA will be turn
> off.
> > > > >>> +         */
> > > > >>> +        printk(XENLOG_ERR
> > > > >>> +               "DT: NUMA NODE %u (%"PRIx64"-%"
> > > > >>> +               PRIx64") overlaps with NODE %u (%"PRIx64"-
> > > > %"PRIx64")\n",
> > > > >>> +               node, start, end, memblk_nodeid[i],
> > > > >>> +               node_memblk_range[i].start,
> > > node_memblk_range[i].end);
> > > > >>> +        bad_srat();
> > > > >>> +        return -EINVAL;
> > > > >>> +    }
> > > > >>> +
> > > > >>> +save_memblk:
> > > > >>> +    nd = &nodes[node];
> > > > >>> +    if ( !node_test_and_set(node, memory_nodes_parsed) ) {
> > > > >>> +        nd->start = start;
> > > > >>> +        nd->end = end;
> > > > >>> +    } else {
> > > > >>> +        if ( start < nd->start )
> > > > >>> +            nd->start = start;
> > > > >>> +        if ( nd->end < end )
> > > > >>> +            nd->end = end;
> > > > >>> +    }
> > > > >>> +
> > > > >>> +    printk(XENLOG_INFO "DT: NUMA node %u %"PRIx64"-%"PRIx64"\n",
> > > > >>> +           node, start, end);
> > > > >>> +
> > > > >>> +    node_memblk_range[num_node_memblks].start = start;
> > > > >>> +    node_memblk_range[num_node_memblks].end = end;
> > > > >>> +    memblk_nodeid[num_node_memblks] = node;
> > > > >>> +    num_node_memblks++;
> > > > >>
> > > > >>
> > > > >> Is it possible to have non-contigous ranges of memory for a
> single
> > > NUMA
> > > > >> node?
> > > > >>
> > > > >> Looking at the DT bindings and Linux implementation, it seems
> > > possible.
> > > > >> Here, it seems that node_memblk_range/memblk_nodeid could handle
> it,
> > > > >> but nodes couldn't.
> > > > >
> > > > > Yes, you're right. I copied this code for x86 ACPI NUMA. Does ACPI
> > > allow
> > > > > non-contiguous ranges of memory for a single NUMA node too?
> > > >
> > > > I couldn't find any restriction for ACPI. Although, I only briefly
> > > > looked at the spec.
> > > >
> > > > > If yes, I think
> > > > > this will affect x86 ACPI NUMA too. In next version, we plan to
> merge
> > > > > dtb_numa_memory_affinity_init and acpi_numa_memory_affinity_init
> into
> > > a
> > > > > neutral function. So we can fix them at the same time.
> > > > >
> > > > > If not, maybe we have to keep the diversity for dtb and ACPI here.
> > > >
> > > > I am not entirely sure what you mean. Are you saying if ACPI doesn't
> > > > allow non-contiguous ranges of memory, then we should keep the
> > > > implementation separated?
> > > >
> > > > If so, then I disagree with that. It is fine to have code that
> supports
> > > > more than what a firmware table supports. The main benefit is less
> code
> > > > and therefore less long term maintenance (with the current solution
> we
> > > > would need to check both the ACPI and DT implementation if there is
> a
> > > > bug in one).
> > > >
> > >
> > > Yes, I agree.
> > >
> >
> > I am looking for some methods to address this comment. Current "nodes"
> > has not considered the situation of memory addresses of different NUMA
> > nodes can be interleaved.
> >
> > This code exists in x86 NUMA implementation. I think it may be based on
> > one early version of Linux x86 NUMA implementation. In recent Linux
> > code, both ACPI/numa/srat.c[1] and x86 NUMA code[2] are not using
> > "nodes" to record NUMA memory address boundary. They don't depend
> > on "nodes" to do sanity check.
> >
> > To fix it, we'd better to upgrade the x86 NUMA driver. It will make
> > a great affect for Xen-x86. And I think it might out of this series
> > scope. Can we create another thread to discuss about it?
> >
> > Or could you give me suggestions that we can use some simple ways
> > to fix it?
> 
> It looks like that we would have to replace all the node->start /
> node->end checks with node_memblk_range checks. There are a few of them
> in valid_numa_range, conflicting_memblks, cutoff_node,
> nodes_cover_memory. It wouldn't be trivial.
> 
> Although I do think that non-contiguous memory for NUMA nodes is
> important to support, the patch series is already 40 patches. I don't
> think it is a good idea to add other significant changes to it. I
> wouldn't upgrade the x86 NUMA driver now. If we can't find a better way,
> we can proceed as you are doing in this version, with the known gap that
> we can't deal with non-contigious memory for NUMA nodes, and fix it with
> a follow-up series later. In that case we would want to have an explicit
> check for non-contiguous memory for NUMA nodes in
> dtb_numa_memory_affinity_init and error out if found.
> 

Yes, I think this may be a more appropriate method at present.
I would add some code to do explicit check and give warning/error.
 
> 
> > Also, on Linux, NUMA implementations for x86 are different from Arm64
> > and RISC-V implementations.[3]
> >
> > [1]
> https://github.com/torvalds/linux/blob/master/drivers/acpi/numa/srat.c
> > [2] https://github.com/torvalds/linux/blob/master/arch/x86/mm/numa.c
> > [3]
> https://github.com/torvalds/linux/blob/master/drivers/base/arch_numa.c
> 
> 
> In general, I like the idea of sharing code as much as possible between
> architectures (x86, ARM, etc.) and between DT/ACPI because it makes the
> code maintainance easier and one might even gain certain features for
> free.
> 
> However, the excercise of sharing code shouldn't take significant
> additional efforts. In fact, it should decrease the overall effort:
> instead of writing new code one just take existing code and move it to
> common.
> 
> In this instance, I think it would be good to be able to share the NUMA
> initialization code between x86/ARM and ACPI/DT if it doesn't cause
> extra efforts.
> 
> Here the extra effort that my previous comment might cause doesn't
> derive from x86/ARM or DT/ACPI code sharing. It derives from the fact
> that our existing code doesn't deal with non-contigous memory for NUMA
> nodes unfortunately. That is something we need to find a way to cope
> with anyway.

Yes. I posted above links didn't mean to create diversity for Arm/x86.
Though, I'm not sure exactly why Linux does this. But I think for Xen,
we still can try to share code for Arm/x86 and DT/ACPI.

Cheers,
Wei Chen

Reply via email to