subject:"\[PATCH v6\] numa\: make node_to_cpumask_map\(\) NUMA_NO

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-29 Thread Yunsheng Lin

On 2019/10/29 16:53, Michal Hocko wrote:
> On Mon 28-10-19 17:20:33, Yunsheng Lin wrote:
>> On 2019/10/12 15:40, Greg KH wrote:
>>> On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
 add pci and acpi maintainer
 cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org

 On 2019/10/11 19:15, Peter Zijlstra wrote:
> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
>> But I failed to see why the above is related to making 
>> node_to_cpumask_map()
>> NUMA_NO_NODE aware?
>
> Your initial bug is for hns3, which is a PCI device, which really _MUST_
> have a node assigned.
>
> It not having one, is a straight up bug. We must not silently accept
> NO_NODE there, ever.
>

 I suppose you mean reporting a lack of affinity when the node of a pcie
 device is not set by "not silently accept NO_NODE".
>>>
>>> If the firmware of a pci device does not provide the node information,
>>> then yes, warn about that.
>>>
 As Greg has asked about in [1]:
 what is a user to do when the user sees the kernel reporting that?

 We may tell user to contact their vendor for info or updates about
 that when they do not know about their system well enough, but their
 vendor may get away with this by quoting ACPI spec as the spec
 considering this optional. Should the user believe this is indeed a
 fw bug or a misreport from the kernel?
>>>
>>> Say it is a firmware bug, if it is a firmware bug, that's simple.
>>>
 If this kind of reporting is common pratice and will not cause any
 misunderstanding, then maybe we can report that.
>>>
>>> Yes, please do so, that's the only way those boxes are ever going to get
>>> fixed.  And go add the test to the "firmware testing" tool that is based
>>> on Linux that Intel has somewhere, to give vendors a chance to fix this
>>> before they ship hardware.
>>>
>>> This shouldn't be a big deal, we warn of other hardware bugs all the
>>> time.
>>
>> Hi, all.
>>
>> The warning for the above case has been added in [1].
>>
>> So maybe it makes sense to make node_to_cpumask_map() NUMA_NO_NODE aware
>> now?
>>
>> If Yes, this patch still can be applied to the latest linus' tree cleanly,
>> Do I need to resend it?
>>
> 
> By this patch you mean 
> http://lkml.kernel.org/r/1568724534-146242-1-git-send-email-linyunsh...@huawei.com
> right?

Yes.

> 
> I would just resend it unless there is still a clear disagreement over
> it.

Ok, thanks.

Will resend it to see if there is still a disagreement over it.

> 
>> [1] 
>> https://lore.kernel.org/linux-pci/1571467543-26125-1-git-send-email-linyunsh...@huawei.com/
>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-29 Thread Michal Hocko

On Mon 28-10-19 17:20:33, Yunsheng Lin wrote:
> On 2019/10/12 15:40, Greg KH wrote:
> > On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> >> add pci and acpi maintainer
> >> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
> >>
> >> On 2019/10/11 19:15, Peter Zijlstra wrote:
> >>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
>  But I failed to see why the above is related to making 
>  node_to_cpumask_map()
>  NUMA_NO_NODE aware?
> >>>
> >>> Your initial bug is for hns3, which is a PCI device, which really _MUST_
> >>> have a node assigned.
> >>>
> >>> It not having one, is a straight up bug. We must not silently accept
> >>> NO_NODE there, ever.
> >>>
> >>
> >> I suppose you mean reporting a lack of affinity when the node of a pcie
> >> device is not set by "not silently accept NO_NODE".
> > 
> > If the firmware of a pci device does not provide the node information,
> > then yes, warn about that.
> > 
> >> As Greg has asked about in [1]:
> >> what is a user to do when the user sees the kernel reporting that?
> >>
> >> We may tell user to contact their vendor for info or updates about
> >> that when they do not know about their system well enough, but their
> >> vendor may get away with this by quoting ACPI spec as the spec
> >> considering this optional. Should the user believe this is indeed a
> >> fw bug or a misreport from the kernel?
> > 
> > Say it is a firmware bug, if it is a firmware bug, that's simple.
> > 
> >> If this kind of reporting is common pratice and will not cause any
> >> misunderstanding, then maybe we can report that.
> > 
> > Yes, please do so, that's the only way those boxes are ever going to get
> > fixed.  And go add the test to the "firmware testing" tool that is based
> > on Linux that Intel has somewhere, to give vendors a chance to fix this
> > before they ship hardware.
> > 
> > This shouldn't be a big deal, we warn of other hardware bugs all the
> > time.
> 
> Hi, all.
> 
> The warning for the above case has been added in [1].
> 
> So maybe it makes sense to make node_to_cpumask_map() NUMA_NO_NODE aware
> now?
> 
> If Yes, this patch still can be applied to the latest linus' tree cleanly,
> Do I need to resend it?
> 

By this patch you mean 
http://lkml.kernel.org/r/1568724534-146242-1-git-send-email-linyunsh...@huawei.com
right?

I would just resend it unless there is still a clear disagreement over
it.

> [1] 
> https://lore.kernel.org/linux-pci/1571467543-26125-1-git-send-email-linyunsh...@huawei.com/

-- 
Michal Hocko
SUSE Labs

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-28 Thread Yunsheng Lin

On 2019/10/12 15:40, Greg KH wrote:
> On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
>> add pci and acpi maintainer
>> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
>>
>> On 2019/10/11 19:15, Peter Zijlstra wrote:
>>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
 But I failed to see why the above is related to making 
 node_to_cpumask_map()
 NUMA_NO_NODE aware?
>>>
>>> Your initial bug is for hns3, which is a PCI device, which really _MUST_
>>> have a node assigned.
>>>
>>> It not having one, is a straight up bug. We must not silently accept
>>> NO_NODE there, ever.
>>>
>>
>> I suppose you mean reporting a lack of affinity when the node of a pcie
>> device is not set by "not silently accept NO_NODE".
> 
> If the firmware of a pci device does not provide the node information,
> then yes, warn about that.
> 
>> As Greg has asked about in [1]:
>> what is a user to do when the user sees the kernel reporting that?
>>
>> We may tell user to contact their vendor for info or updates about
>> that when they do not know about their system well enough, but their
>> vendor may get away with this by quoting ACPI spec as the spec
>> considering this optional. Should the user believe this is indeed a
>> fw bug or a misreport from the kernel?
> 
> Say it is a firmware bug, if it is a firmware bug, that's simple.
> 
>> If this kind of reporting is common pratice and will not cause any
>> misunderstanding, then maybe we can report that.
> 
> Yes, please do so, that's the only way those boxes are ever going to get
> fixed.  And go add the test to the "firmware testing" tool that is based
> on Linux that Intel has somewhere, to give vendors a chance to fix this
> before they ship hardware.
> 
> This shouldn't be a big deal, we warn of other hardware bugs all the
> time.

Hi, all.

The warning for the above case has been added in [1].

So maybe it makes sense to make node_to_cpumask_map() NUMA_NO_NODE aware
now?

If Yes, this patch still can be applied to the latest linus' tree cleanly,
Do I need to resend it?


[1] 
https://lore.kernel.org/linux-pci/1571467543-26125-1-git-send-email-linyunsh...@huawei.com/
> 
> thanks,
> 
> greg k-h
> 
> .
>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-16 Thread Yunsheng Lin

On 2019/10/16 0:58, Greg KH wrote:
> On Tue, Oct 15, 2019 at 06:40:29PM +0800, Yunsheng Lin wrote:
>> On 2019/10/14 17:25, Greg KH wrote:
>>> On Mon, Oct 14, 2019 at 04:00:46PM +0800, Yunsheng Lin wrote:
 On 2019/10/12 18:47, Greg KH wrote:
> On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
>> On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
>>> On 2019/10/12 15:40, Greg KH wrote:
 On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> add pci and acpi maintainer
> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
>
> On 2019/10/11 19:15, Peter Zijlstra wrote:
>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
>>> But I failed to see why the above is related to making 
>>> node_to_cpumask_map()
>>> NUMA_NO_NODE aware?
>>
>> Your initial bug is for hns3, which is a PCI device, which really 
>> _MUST_
>> have a node assigned.
>>
>> It not having one, is a straight up bug. We must not silently accept
>> NO_NODE there, ever.
>>
>
> I suppose you mean reporting a lack of affinity when the node of a 
> pcie
> device is not set by "not silently accept NO_NODE".

 If the firmware of a pci device does not provide the node information,
 then yes, warn about that.

> As Greg has asked about in [1]:
> what is a user to do when the user sees the kernel reporting that?
>
> We may tell user to contact their vendor for info or updates about
> that when they do not know about their system well enough, but their
> vendor may get away with this by quoting ACPI spec as the spec
> considering this optional. Should the user believe this is indeed a
> fw bug or a misreport from the kernel?

 Say it is a firmware bug, if it is a firmware bug, that's simple.

> If this kind of reporting is common pratice and will not cause any
> misunderstanding, then maybe we can report that.

 Yes, please do so, that's the only way those boxes are ever going to 
 get
 fixed.  And go add the test to the "firmware testing" tool that is 
 based
 on Linux that Intel has somewhere, to give vendors a chance to fix this
 before they ship hardware.

 This shouldn't be a big deal, we warn of other hardware bugs all the
 time.
>>>
>>> Ok, thanks for clarifying.
>>>
>>> Will send a patch to catch the case when a pcie device without numa node
>>> being set and warn about it.
>>>
>>> Maybe use dev->bus to verify if it is a pci device?
>>
>> No, do that in the pci bus core code itself, when creating the devices
>> as that is when you know, or do not know, the numa node, right?
>>
>> This can't be in the driver core only, as each bus type will have a
>> different way of determining what the node the device is on.  For some
>> reason, I thought the PCI core code already does this, right?
>
> Yes, pci_irq_get_node(), which NO ONE CALLS!  I should go delete that
> thing...
>
> Anyway, it looks like the pci core code does call set_dev_node() based
> on the PCI bridge, so if that is set up properly, all should be fine.
>
> If not, well, you have buggy firmware and you need to warn about that at
> the time you are creating the bridge.  Look at the call to
> pcibus_to_node() in pci_register_host_bridge().

 Thanks for pointing out the specific function.
 Maybe we do not need to warn about the case when the device has a parent,
 because we must have warned about the parent if the device has a parent
 and the parent also has a node of NO_NODE, so do not need to warn the child
 device anymore? like blew:

 @@ -932,6 +932,10 @@ static int pci_register_host_bridge(struct 
 pci_host_bridge *bridge)
 list_add_tail(>node, _root_buses);
 up_write(_bus_sem);

 +   if (nr_node_ids > 1 && !parent &&
>>>
>>> Why do you need to check this?  If you have a parent, it's your node
>>> should be set, if not, that's an error, right?
>>
>> If the device has parent and the parent device also has a node of
>> NUMA_NO_NODE, then maybe we have warned about the parent device, so
>> we do not have to warn about the child device?
> 
> But it's a PCI bridge, if it is not set properly, that needs to be fixed
> otherwise the PCI devices attached to it have no hope of working
> properly.

You may be right, thanks.

If it's a root PCI bridge and it does have a parent device, but
the parent device is not a pcie device and it's node is NUMA_NO_NODE,
then we will miss warning about this case.

> 
>> In pci_register_host_bridge():
>>
>>  if (!parent)
>>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-15 Thread Greg KH

On Tue, Oct 15, 2019 at 06:40:29PM +0800, Yunsheng Lin wrote:
> On 2019/10/14 17:25, Greg KH wrote:
> > On Mon, Oct 14, 2019 at 04:00:46PM +0800, Yunsheng Lin wrote:
> >> On 2019/10/12 18:47, Greg KH wrote:
> >>> On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
>  On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
> > On 2019/10/12 15:40, Greg KH wrote:
> >> On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> >>> add pci and acpi maintainer
> >>> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
> >>>
> >>> On 2019/10/11 19:15, Peter Zijlstra wrote:
>  On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> > But I failed to see why the above is related to making 
> > node_to_cpumask_map()
> > NUMA_NO_NODE aware?
> 
>  Your initial bug is for hns3, which is a PCI device, which really 
>  _MUST_
>  have a node assigned.
> 
>  It not having one, is a straight up bug. We must not silently accept
>  NO_NODE there, ever.
> 
> >>>
> >>> I suppose you mean reporting a lack of affinity when the node of a 
> >>> pcie
> >>> device is not set by "not silently accept NO_NODE".
> >>
> >> If the firmware of a pci device does not provide the node information,
> >> then yes, warn about that.
> >>
> >>> As Greg has asked about in [1]:
> >>> what is a user to do when the user sees the kernel reporting that?
> >>>
> >>> We may tell user to contact their vendor for info or updates about
> >>> that when they do not know about their system well enough, but their
> >>> vendor may get away with this by quoting ACPI spec as the spec
> >>> considering this optional. Should the user believe this is indeed a
> >>> fw bug or a misreport from the kernel?
> >>
> >> Say it is a firmware bug, if it is a firmware bug, that's simple.
> >>
> >>> If this kind of reporting is common pratice and will not cause any
> >>> misunderstanding, then maybe we can report that.
> >>
> >> Yes, please do so, that's the only way those boxes are ever going to 
> >> get
> >> fixed.  And go add the test to the "firmware testing" tool that is 
> >> based
> >> on Linux that Intel has somewhere, to give vendors a chance to fix this
> >> before they ship hardware.
> >>
> >> This shouldn't be a big deal, we warn of other hardware bugs all the
> >> time.
> >
> > Ok, thanks for clarifying.
> >
> > Will send a patch to catch the case when a pcie device without numa node
> > being set and warn about it.
> >
> > Maybe use dev->bus to verify if it is a pci device?
> 
>  No, do that in the pci bus core code itself, when creating the devices
>  as that is when you know, or do not know, the numa node, right?
> 
>  This can't be in the driver core only, as each bus type will have a
>  different way of determining what the node the device is on.  For some
>  reason, I thought the PCI core code already does this, right?
> >>>
> >>> Yes, pci_irq_get_node(), which NO ONE CALLS!  I should go delete that
> >>> thing...
> >>>
> >>> Anyway, it looks like the pci core code does call set_dev_node() based
> >>> on the PCI bridge, so if that is set up properly, all should be fine.
> >>>
> >>> If not, well, you have buggy firmware and you need to warn about that at
> >>> the time you are creating the bridge.  Look at the call to
> >>> pcibus_to_node() in pci_register_host_bridge().
> >>
> >> Thanks for pointing out the specific function.
> >> Maybe we do not need to warn about the case when the device has a parent,
> >> because we must have warned about the parent if the device has a parent
> >> and the parent also has a node of NO_NODE, so do not need to warn the child
> >> device anymore? like blew:
> >>
> >> @@ -932,6 +932,10 @@ static int pci_register_host_bridge(struct 
> >> pci_host_bridge *bridge)
> >> list_add_tail(>node, _root_buses);
> >> up_write(_bus_sem);
> >>
> >> +   if (nr_node_ids > 1 && !parent &&
> > 
> > Why do you need to check this?  If you have a parent, it's your node
> > should be set, if not, that's an error, right?
> 
> If the device has parent and the parent device also has a node of
> NUMA_NO_NODE, then maybe we have warned about the parent device, so
> we do not have to warn about the child device?

But it's a PCI bridge, if it is not set properly, that needs to be fixed
otherwise the PCI devices attached to it have no hope of working
properly.

> In pci_register_host_bridge():
> 
>   if (!parent)
>   set_dev_node(bus->bridge, pcibus_to_node(bus));
> 
> The above only set the node of the bridge device to the node of bus if
> the bridge device does not have a parent.

Odd, what happens to devices behind another bridge today?  Are their
nodes set properly today?  Is the node

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-15 Thread Yunsheng Lin

On 2019/10/14 17:25, Greg KH wrote:
> On Mon, Oct 14, 2019 at 04:00:46PM +0800, Yunsheng Lin wrote:
>> On 2019/10/12 18:47, Greg KH wrote:
>>> On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
 On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
> On 2019/10/12 15:40, Greg KH wrote:
>> On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
>>> add pci and acpi maintainer
>>> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
>>>
>>> On 2019/10/11 19:15, Peter Zijlstra wrote:
 On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> But I failed to see why the above is related to making 
> node_to_cpumask_map()
> NUMA_NO_NODE aware?

 Your initial bug is for hns3, which is a PCI device, which really 
 _MUST_
 have a node assigned.

 It not having one, is a straight up bug. We must not silently accept
 NO_NODE there, ever.

>>>
>>> I suppose you mean reporting a lack of affinity when the node of a pcie
>>> device is not set by "not silently accept NO_NODE".
>>
>> If the firmware of a pci device does not provide the node information,
>> then yes, warn about that.
>>
>>> As Greg has asked about in [1]:
>>> what is a user to do when the user sees the kernel reporting that?
>>>
>>> We may tell user to contact their vendor for info or updates about
>>> that when they do not know about their system well enough, but their
>>> vendor may get away with this by quoting ACPI spec as the spec
>>> considering this optional. Should the user believe this is indeed a
>>> fw bug or a misreport from the kernel?
>>
>> Say it is a firmware bug, if it is a firmware bug, that's simple.
>>
>>> If this kind of reporting is common pratice and will not cause any
>>> misunderstanding, then maybe we can report that.
>>
>> Yes, please do so, that's the only way those boxes are ever going to get
>> fixed.  And go add the test to the "firmware testing" tool that is based
>> on Linux that Intel has somewhere, to give vendors a chance to fix this
>> before they ship hardware.
>>
>> This shouldn't be a big deal, we warn of other hardware bugs all the
>> time.
>
> Ok, thanks for clarifying.
>
> Will send a patch to catch the case when a pcie device without numa node
> being set and warn about it.
>
> Maybe use dev->bus to verify if it is a pci device?

 No, do that in the pci bus core code itself, when creating the devices
 as that is when you know, or do not know, the numa node, right?

 This can't be in the driver core only, as each bus type will have a
 different way of determining what the node the device is on.  For some
 reason, I thought the PCI core code already does this, right?
>>>
>>> Yes, pci_irq_get_node(), which NO ONE CALLS!  I should go delete that
>>> thing...
>>>
>>> Anyway, it looks like the pci core code does call set_dev_node() based
>>> on the PCI bridge, so if that is set up properly, all should be fine.
>>>
>>> If not, well, you have buggy firmware and you need to warn about that at
>>> the time you are creating the bridge.  Look at the call to
>>> pcibus_to_node() in pci_register_host_bridge().
>>
>> Thanks for pointing out the specific function.
>> Maybe we do not need to warn about the case when the device has a parent,
>> because we must have warned about the parent if the device has a parent
>> and the parent also has a node of NO_NODE, so do not need to warn the child
>> device anymore? like blew:
>>
>> @@ -932,6 +932,10 @@ static int pci_register_host_bridge(struct 
>> pci_host_bridge *bridge)
>> list_add_tail(>node, _root_buses);
>> up_write(_bus_sem);
>>
>> +   if (nr_node_ids > 1 && !parent &&
> 
> Why do you need to check this?  If you have a parent, it's your node
> should be set, if not, that's an error, right?

If the device has parent and the parent device also has a node of
NUMA_NO_NODE, then maybe we have warned about the parent device, so
we do not have to warn about the child device?

In pci_register_host_bridge():

if (!parent)
set_dev_node(bus->bridge, pcibus_to_node(bus));

The above only set the node of the bridge device to the node of bus if
the bridge device does not have a parent.

bus->dev.parent = bus->bridge;

dev_set_name(>dev, "%04x:%02x", pci_domain_nr(bus), bus->number);
name = dev_name(>dev);

err = device_register(>dev);

The above then set the bus device's parent to bridge device, and then
call device_register(), which will set the bus device's node according to
bridge device' node.

> 
>> +   dev_to_node(bus->bridge) == NUMA_NO_NODE)
>> +   dev_err(bus->bridge, FW_BUG "No node assigned on NUMA 
>> capable HW. Please contact your vendor for updates.\n");
>> +
>>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-14 Thread Greg KH

On Mon, Oct 14, 2019 at 11:49:12AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 14, 2019 at 11:25:09AM +0200, Greg KH wrote:
> > Good luck, I don't really think that most, if any, of this is needed,
> > but hey, it's nice to clean it up where it can be :)
> 
> Some of the virtual devices we have (that use devm) really ought to set
> the node too, like drivers/base/cpu.c and driver/base/node.c and
> arguably the cooling devices too (they create a device per cpu).
> 
> The patch I had here:
> 
>   
> https://lkml.kernel.org/r/20190925214526.ga4...@worktop.programming.kicks-ass.net
> 
> takes the more radical approach of requiring a node, except when
> explicitly marked not (the fake devices that don't use devm for
> example).

I like that patch :)

> But yes, PCI and other physical busses really should be having a node
> set, no excuses.

Agreed, at least just warning on the bus creation will make it a bit
less "noisy", in contrast to your patch.  But the messages in your patch
show people just how broken their bioses really are.  Which is always
fun...

> Anyway, I don't think non-physical devices actually use
> cpumask_of_node() much, a quick grep didn't show any.

That's good.

thanks,

greg k-h

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-14 Thread Peter Zijlstra

On Mon, Oct 14, 2019 at 11:25:09AM +0200, Greg KH wrote:
> Good luck, I don't really think that most, if any, of this is needed,
> but hey, it's nice to clean it up where it can be :)

Some of the virtual devices we have (that use devm) really ought to set
the node too, like drivers/base/cpu.c and driver/base/node.c and
arguably the cooling devices too (they create a device per cpu).

The patch I had here:

https://lkml.kernel.org/r/20190925214526.ga4...@worktop.programming.kicks-ass.net

takes the more radical approach of requiring a node, except when
explicitly marked not (the fake devices that don't use devm for
example).

But yes, PCI and other physical busses really should be having a node
set, no excuses.

Anyway, I don't think non-physical devices actually use
cpumask_of_node() much, a quick grep didn't show any.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-14 Thread Greg KH

On Mon, Oct 14, 2019 at 04:00:46PM +0800, Yunsheng Lin wrote:
> On 2019/10/12 18:47, Greg KH wrote:
> > On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
> >> On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
> >>> On 2019/10/12 15:40, Greg KH wrote:
>  On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> > add pci and acpi maintainer
> > cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
> >
> > On 2019/10/11 19:15, Peter Zijlstra wrote:
> >> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> >>> But I failed to see why the above is related to making 
> >>> node_to_cpumask_map()
> >>> NUMA_NO_NODE aware?
> >>
> >> Your initial bug is for hns3, which is a PCI device, which really 
> >> _MUST_
> >> have a node assigned.
> >>
> >> It not having one, is a straight up bug. We must not silently accept
> >> NO_NODE there, ever.
> >>
> >
> > I suppose you mean reporting a lack of affinity when the node of a pcie
> > device is not set by "not silently accept NO_NODE".
> 
>  If the firmware of a pci device does not provide the node information,
>  then yes, warn about that.
> 
> > As Greg has asked about in [1]:
> > what is a user to do when the user sees the kernel reporting that?
> >
> > We may tell user to contact their vendor for info or updates about
> > that when they do not know about their system well enough, but their
> > vendor may get away with this by quoting ACPI spec as the spec
> > considering this optional. Should the user believe this is indeed a
> > fw bug or a misreport from the kernel?
> 
>  Say it is a firmware bug, if it is a firmware bug, that's simple.
> 
> > If this kind of reporting is common pratice and will not cause any
> > misunderstanding, then maybe we can report that.
> 
>  Yes, please do so, that's the only way those boxes are ever going to get
>  fixed.  And go add the test to the "firmware testing" tool that is based
>  on Linux that Intel has somewhere, to give vendors a chance to fix this
>  before they ship hardware.
> 
>  This shouldn't be a big deal, we warn of other hardware bugs all the
>  time.
> >>>
> >>> Ok, thanks for clarifying.
> >>>
> >>> Will send a patch to catch the case when a pcie device without numa node
> >>> being set and warn about it.
> >>>
> >>> Maybe use dev->bus to verify if it is a pci device?
> >>
> >> No, do that in the pci bus core code itself, when creating the devices
> >> as that is when you know, or do not know, the numa node, right?
> >>
> >> This can't be in the driver core only, as each bus type will have a
> >> different way of determining what the node the device is on.  For some
> >> reason, I thought the PCI core code already does this, right?
> > 
> > Yes, pci_irq_get_node(), which NO ONE CALLS!  I should go delete that
> > thing...
> > 
> > Anyway, it looks like the pci core code does call set_dev_node() based
> > on the PCI bridge, so if that is set up properly, all should be fine.
> > 
> > If not, well, you have buggy firmware and you need to warn about that at
> > the time you are creating the bridge.  Look at the call to
> > pcibus_to_node() in pci_register_host_bridge().
> 
> Thanks for pointing out the specific function.
> Maybe we do not need to warn about the case when the device has a parent,
> because we must have warned about the parent if the device has a parent
> and the parent also has a node of NO_NODE, so do not need to warn the child
> device anymore? like blew:
> 
> @@ -932,6 +932,10 @@ static int pci_register_host_bridge(struct 
> pci_host_bridge *bridge)
> list_add_tail(>node, _root_buses);
> up_write(_bus_sem);
> 
> +   if (nr_node_ids > 1 && !parent &&

Why do you need to check this?  If you have a parent, it's your node
should be set, if not, that's an error, right?

> +   dev_to_node(bus->bridge) == NUMA_NO_NODE)
> +   dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable 
> HW. Please contact your vendor for updates.\n");
> +
> return 0;

Who set that bus->bridge node to NUMA_NO_NODE?
If that is set, the firmware is broken, as you say, but you need to tell
the user what firmware is broken.

Try something like this out and see what happens on your machine that
had things "broken".  What does it say?

> Also, we do not need to warn about that in pci_device_add(), Right?
> Because we must have warned about the pci host bridge of the pci device.

That should be true, yes.

> I may be wrong about above because I am not so familiar with the pci.
> 
> > 
> > And yes, you need to do this all on a per-bus-type basis, as has been
> > pointed out.  It's up to the bus to create the device and set this up
> > properly.
> 
> Thanks.
> Will do that on per-bus-type basis.

Good luck, I don't really think that most, if any, of this is needed,

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-14 Thread Yunsheng Lin

On 2019/10/12 18:47, Greg KH wrote:
> On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
>> On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
>>> On 2019/10/12 15:40, Greg KH wrote:
 On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> add pci and acpi maintainer
> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
>
> On 2019/10/11 19:15, Peter Zijlstra wrote:
>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
>>> But I failed to see why the above is related to making 
>>> node_to_cpumask_map()
>>> NUMA_NO_NODE aware?
>>
>> Your initial bug is for hns3, which is a PCI device, which really _MUST_
>> have a node assigned.
>>
>> It not having one, is a straight up bug. We must not silently accept
>> NO_NODE there, ever.
>>
>
> I suppose you mean reporting a lack of affinity when the node of a pcie
> device is not set by "not silently accept NO_NODE".

 If the firmware of a pci device does not provide the node information,
 then yes, warn about that.

> As Greg has asked about in [1]:
> what is a user to do when the user sees the kernel reporting that?
>
> We may tell user to contact their vendor for info or updates about
> that when they do not know about their system well enough, but their
> vendor may get away with this by quoting ACPI spec as the spec
> considering this optional. Should the user believe this is indeed a
> fw bug or a misreport from the kernel?

 Say it is a firmware bug, if it is a firmware bug, that's simple.

> If this kind of reporting is common pratice and will not cause any
> misunderstanding, then maybe we can report that.

 Yes, please do so, that's the only way those boxes are ever going to get
 fixed.  And go add the test to the "firmware testing" tool that is based
 on Linux that Intel has somewhere, to give vendors a chance to fix this
 before they ship hardware.

 This shouldn't be a big deal, we warn of other hardware bugs all the
 time.
>>>
>>> Ok, thanks for clarifying.
>>>
>>> Will send a patch to catch the case when a pcie device without numa node
>>> being set and warn about it.
>>>
>>> Maybe use dev->bus to verify if it is a pci device?
>>
>> No, do that in the pci bus core code itself, when creating the devices
>> as that is when you know, or do not know, the numa node, right?
>>
>> This can't be in the driver core only, as each bus type will have a
>> different way of determining what the node the device is on.  For some
>> reason, I thought the PCI core code already does this, right?
> 
> Yes, pci_irq_get_node(), which NO ONE CALLS!  I should go delete that
> thing...
> 
> Anyway, it looks like the pci core code does call set_dev_node() based
> on the PCI bridge, so if that is set up properly, all should be fine.
> 
> If not, well, you have buggy firmware and you need to warn about that at
> the time you are creating the bridge.  Look at the call to
> pcibus_to_node() in pci_register_host_bridge().

Thanks for pointing out the specific function.
Maybe we do not need to warn about the case when the device has a parent,
because we must have warned about the parent if the device has a parent
and the parent also has a node of NO_NODE, so do not need to warn the child
device anymore? like blew:

@@ -932,6 +932,10 @@ static int pci_register_host_bridge(struct pci_host_bridge 
*bridge)
list_add_tail(>node, _root_buses);
up_write(_bus_sem);

+   if (nr_node_ids > 1 && !parent &&
+   dev_to_node(bus->bridge) == NUMA_NO_NODE)
+   dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable 
HW. Please contact your vendor for updates.\n");
+
return 0;


Also, we do not need to warn about that in pci_device_add(), Right?
Because we must have warned about the pci host bridge of the pci device.

I may be wrong about above because I am not so familiar with the pci.

> 
> And yes, you need to do this all on a per-bus-type basis, as has been
> pointed out.  It's up to the bus to create the device and set this up
> properly.

Thanks.
Will do that on per-bus-type basis.

> 
> thanks,
> 
> greg k-h
> 
> .
>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-12 Thread Greg KH

On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
> On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
> > On 2019/10/12 15:40, Greg KH wrote:
> > > On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> > >> add pci and acpi maintainer
> > >> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
> > >>
> > >> On 2019/10/11 19:15, Peter Zijlstra wrote:
> > >>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> >  But I failed to see why the above is related to making 
> >  node_to_cpumask_map()
> >  NUMA_NO_NODE aware?
> > >>>
> > >>> Your initial bug is for hns3, which is a PCI device, which really _MUST_
> > >>> have a node assigned.
> > >>>
> > >>> It not having one, is a straight up bug. We must not silently accept
> > >>> NO_NODE there, ever.
> > >>>
> > >>
> > >> I suppose you mean reporting a lack of affinity when the node of a pcie
> > >> device is not set by "not silently accept NO_NODE".
> > > 
> > > If the firmware of a pci device does not provide the node information,
> > > then yes, warn about that.
> > > 
> > >> As Greg has asked about in [1]:
> > >> what is a user to do when the user sees the kernel reporting that?
> > >>
> > >> We may tell user to contact their vendor for info or updates about
> > >> that when they do not know about their system well enough, but their
> > >> vendor may get away with this by quoting ACPI spec as the spec
> > >> considering this optional. Should the user believe this is indeed a
> > >> fw bug or a misreport from the kernel?
> > > 
> > > Say it is a firmware bug, if it is a firmware bug, that's simple.
> > > 
> > >> If this kind of reporting is common pratice and will not cause any
> > >> misunderstanding, then maybe we can report that.
> > > 
> > > Yes, please do so, that's the only way those boxes are ever going to get
> > > fixed.  And go add the test to the "firmware testing" tool that is based
> > > on Linux that Intel has somewhere, to give vendors a chance to fix this
> > > before they ship hardware.
> > > 
> > > This shouldn't be a big deal, we warn of other hardware bugs all the
> > > time.
> > 
> > Ok, thanks for clarifying.
> > 
> > Will send a patch to catch the case when a pcie device without numa node
> > being set and warn about it.
> > 
> > Maybe use dev->bus to verify if it is a pci device?
> 
> No, do that in the pci bus core code itself, when creating the devices
> as that is when you know, or do not know, the numa node, right?
> 
> This can't be in the driver core only, as each bus type will have a
> different way of determining what the node the device is on.  For some
> reason, I thought the PCI core code already does this, right?

Yes, pci_irq_get_node(), which NO ONE CALLS!  I should go delete that
thing...

Anyway, it looks like the pci core code does call set_dev_node() based
on the PCI bridge, so if that is set up properly, all should be fine.

If not, well, you have buggy firmware and you need to warn about that at
the time you are creating the bridge.  Look at the call to
pcibus_to_node() in pci_register_host_bridge().

And yes, you need to do this all on a per-bus-type basis, as has been
pointed out.  It's up to the bus to create the device and set this up
properly.

thanks,

greg k-h

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-12 Thread Greg KH

On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
> On 2019/10/12 15:40, Greg KH wrote:
> > On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> >> add pci and acpi maintainer
> >> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
> >>
> >> On 2019/10/11 19:15, Peter Zijlstra wrote:
> >>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
>  But I failed to see why the above is related to making 
>  node_to_cpumask_map()
>  NUMA_NO_NODE aware?
> >>>
> >>> Your initial bug is for hns3, which is a PCI device, which really _MUST_
> >>> have a node assigned.
> >>>
> >>> It not having one, is a straight up bug. We must not silently accept
> >>> NO_NODE there, ever.
> >>>
> >>
> >> I suppose you mean reporting a lack of affinity when the node of a pcie
> >> device is not set by "not silently accept NO_NODE".
> > 
> > If the firmware of a pci device does not provide the node information,
> > then yes, warn about that.
> > 
> >> As Greg has asked about in [1]:
> >> what is a user to do when the user sees the kernel reporting that?
> >>
> >> We may tell user to contact their vendor for info or updates about
> >> that when they do not know about their system well enough, but their
> >> vendor may get away with this by quoting ACPI spec as the spec
> >> considering this optional. Should the user believe this is indeed a
> >> fw bug or a misreport from the kernel?
> > 
> > Say it is a firmware bug, if it is a firmware bug, that's simple.
> > 
> >> If this kind of reporting is common pratice and will not cause any
> >> misunderstanding, then maybe we can report that.
> > 
> > Yes, please do so, that's the only way those boxes are ever going to get
> > fixed.  And go add the test to the "firmware testing" tool that is based
> > on Linux that Intel has somewhere, to give vendors a chance to fix this
> > before they ship hardware.
> > 
> > This shouldn't be a big deal, we warn of other hardware bugs all the
> > time.
> 
> Ok, thanks for clarifying.
> 
> Will send a patch to catch the case when a pcie device without numa node
> being set and warn about it.
> 
> Maybe use dev->bus to verify if it is a pci device?

No, do that in the pci bus core code itself, when creating the devices
as that is when you know, or do not know, the numa node, right?

This can't be in the driver core only, as each bus type will have a
different way of determining what the node the device is on.  For some
reason, I thought the PCI core code already does this, right?

thanks,

greg k-h

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-12 Thread Yunsheng Lin

On 2019/10/12 15:40, Greg KH wrote:
> On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
>> add pci and acpi maintainer
>> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
>>
>> On 2019/10/11 19:15, Peter Zijlstra wrote:
>>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
 But I failed to see why the above is related to making 
 node_to_cpumask_map()
 NUMA_NO_NODE aware?
>>>
>>> Your initial bug is for hns3, which is a PCI device, which really _MUST_
>>> have a node assigned.
>>>
>>> It not having one, is a straight up bug. We must not silently accept
>>> NO_NODE there, ever.
>>>
>>
>> I suppose you mean reporting a lack of affinity when the node of a pcie
>> device is not set by "not silently accept NO_NODE".
> 
> If the firmware of a pci device does not provide the node information,
> then yes, warn about that.
> 
>> As Greg has asked about in [1]:
>> what is a user to do when the user sees the kernel reporting that?
>>
>> We may tell user to contact their vendor for info or updates about
>> that when they do not know about their system well enough, but their
>> vendor may get away with this by quoting ACPI spec as the spec
>> considering this optional. Should the user believe this is indeed a
>> fw bug or a misreport from the kernel?
> 
> Say it is a firmware bug, if it is a firmware bug, that's simple.
> 
>> If this kind of reporting is common pratice and will not cause any
>> misunderstanding, then maybe we can report that.
> 
> Yes, please do so, that's the only way those boxes are ever going to get
> fixed.  And go add the test to the "firmware testing" tool that is based
> on Linux that Intel has somewhere, to give vendors a chance to fix this
> before they ship hardware.
> 
> This shouldn't be a big deal, we warn of other hardware bugs all the
> time.

Ok, thanks for clarifying.

Will send a patch to catch the case when a pcie device without numa node
being set and warn about it.

Maybe use dev->bus to verify if it is a pci device?

> 
> thanks,
> 
> greg k-h
> 
> .
>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-12 Thread Greg KH

On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> add pci and acpi maintainer
> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
> 
> On 2019/10/11 19:15, Peter Zijlstra wrote:
> > On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> >> But I failed to see why the above is related to making 
> >> node_to_cpumask_map()
> >> NUMA_NO_NODE aware?
> > 
> > Your initial bug is for hns3, which is a PCI device, which really _MUST_
> > have a node assigned.
> > 
> > It not having one, is a straight up bug. We must not silently accept
> > NO_NODE there, ever.
> > 
> 
> I suppose you mean reporting a lack of affinity when the node of a pcie
> device is not set by "not silently accept NO_NODE".

If the firmware of a pci device does not provide the node information,
then yes, warn about that.

> As Greg has asked about in [1]:
> what is a user to do when the user sees the kernel reporting that?
> 
> We may tell user to contact their vendor for info or updates about
> that when they do not know about their system well enough, but their
> vendor may get away with this by quoting ACPI spec as the spec
> considering this optional. Should the user believe this is indeed a
> fw bug or a misreport from the kernel?

Say it is a firmware bug, if it is a firmware bug, that's simple.

> If this kind of reporting is common pratice and will not cause any
> misunderstanding, then maybe we can report that.

Yes, please do so, that's the only way those boxes are ever going to get
fixed.  And go add the test to the "firmware testing" tool that is based
on Linux that Intel has somewhere, to give vendors a chance to fix this
before they ship hardware.

This shouldn't be a big deal, we warn of other hardware bugs all the
time.

thanks,

greg k-h

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-12 Thread Yunsheng Lin

add pci and acpi maintainer
cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org

On 2019/10/11 19:15, Peter Zijlstra wrote:
> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
>> But I failed to see why the above is related to making node_to_cpumask_map()
>> NUMA_NO_NODE aware?
> 
> Your initial bug is for hns3, which is a PCI device, which really _MUST_
> have a node assigned.
> 
> It not having one, is a straight up bug. We must not silently accept
> NO_NODE there, ever.
> 

I suppose you mean reporting a lack of affinity when the node of a pcie
device is not set by "not silently accept NO_NODE".

As Greg has asked about in [1]:
what is a user to do when the user sees the kernel reporting that?

We may tell user to contact their vendor for info or updates about
that when they do not know about their system well enough, but their
vendor may get away with this by quoting ACPI spec as the spec
considering this optional. Should the user believe this is indeed a
fw bug or a misreport from the kernel?

If this kind of reporting is common pratice and will not cause any
misunderstanding, then maybe we can report that.

[1] https://lore.kernel.org/lkml/20190905055727.gb23...@kroah.com/

> .
>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-11 Thread Peter Zijlstra

On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> But I failed to see why the above is related to making node_to_cpumask_map()
> NUMA_NO_NODE aware?

Your initial bug is for hns3, which is a PCI device, which really _MUST_
have a node assigned.

It not having one, is a straight up bug. We must not silently accept
NO_NODE there, ever.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-10 Thread Yunsheng Lin

On 2019/10/10 15:32, Michal Hocko wrote:
> On Thu 10-10-19 14:07:21, Yunsheng Lin wrote:
>> On 2019/10/9 20:25, Robin Murphy wrote:
>>> On 2019-10-08 9:38 am, Yunsheng Lin wrote:
 On 2019/9/25 18:41, Peter Zijlstra wrote:
> On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
>>  From the discussion above, It seems making the node_to_cpumask_map()
>> NUMA_NO_NODE aware is the most feasible way to move forwad.
>
> That's still wrong.

 Hi, Peter

 It seems this has trapped in the dead circle.

  From my understanding, NUMA_NO_NODE which means not node numa preference
 is the state to describe the node of virtual device or the physical device
 that has equal distance to all cpu.

 We can be stricter if the device does have a nearer node, but we can not
 deny that a device does not have a node numa preference or node affinity,
 which also means the control or data buffer can be allocated at the node 
 where
 the process is running.

 As you has proposed, making it -2 and have dev_to_node() warn if the 
 device does
 have a nearer node and not set by the fw is a way to be stricter.

 But I think maybe being stricter is not really relevant to NUMA_NO_NODE, 
 because
 we does need a state to describe the device that have equal distance to 
 all node,
 even if it is not physically scalable.

 Any better suggestion to move this forward?
>>>
>>> FWIW (since this is in my inbox), it sounds like the fundamental issue is 
>>> that NUMA_NO_NODE is conflated for at least two different purposes, so 
>>> trying to sort that out would be a good first step. AFAICS we have genuine 
>>> "don't care" cases like alloc_pages_node(), where if the producer says it 
>>> doesn't matter then the consumer is free to make its own judgement on what 
>>> to do, and fundamentally different "we expect this thing to have an 
>>> affinity but it doesn't, so we can't say what's appropriate" cases which 
>>> could really do with some separate indicator like "NUMA_INVALID_NODE".
>>>
>>> The tricky part is then bestowed on the producers to decide whether they 
>>> can downgrade "invalid" to "don't care". You can technically build 'a 
>>> device' whose internal logic is distributed between nodes and thus appears 
>>> to have equal affinity - interrupt controllers, for example, may have 
>>> per-CPU or per-node interfaces that end up looking like that - so although 
>>> it's unlikely it's not outright nonsensical. Similarly a 'device' that's 
>>> actually emulated behind a firmware call interface may well effectively 
>>> have no real affinity.
>>
>> We may set node of the physical device to NUMA_INVALID_NODE when fw does not
>> provide one.
>>
>> But what do we do about NUMA_INVALID_NODE when alloc_pages_node() is called
>> with nid being NUMA_INVALID_NODE?
> 
> There is nothing sensible the allocator can do. The only point of
> NUMA_INVALID_NODE would be to catch potential misconfiguration and
> report them to users so they can complain to their HW/FS suppliers.
> 
> Pushing it to other susbystem doesn't make much sense IMHO because there
> is nothing really actionable. Refusing an allocation altogether sounds
> like a bad plan to me.
>  
>> If we change the node to default one(like node 0) when node of device is
>> NUMA_INVALID_NODE in device_add(), how do we know the default one(like node 
>> 0)
>> is the right one to choose?
> 
> Exactly. We cannot really assume any node in that situation.
>  
>> >From the privous disccusion, the below seems not get to consensus yet:
>> 1) Do we need a state like NUMA_NO_NODE to describe that the device does not
>>have any numa preference?
> 
> This is a traditional meaning MM subsystem is using.
> 
>> 2) What do we do if the fw does not provide a node for the device? Should
>>we guess and pick one for it and how do we do the guessing? Or leave it
>>as it is and handle it as NUMA_NO_NODE?
> 
> As already pointed several times, picking any node is rather error
> prone. You can never assume topology. We used to assume that there
> always be node 0 but that is not really the case (see 3e8589963773
> ("memcg: make it work on sparse non-0-node systems")). Nodes might also
> come and go so this might just lead to all sorts of subtle problems.
> 
> On the other hand using NUMA_NO_NODE as no preference could only lead to
> slightly sub optimal performance.
> 
> I do agree with Peter that reporting a lack of affinity might be useful
> but we shouldn't really try to clever and make up the affinity nilly
> willy

Ok, thanks for clarify.

So it seems we need to do the below if I understand it correctly:
1. fix up the node id of device that has a clear node affinity without it
   being set as much as possible.
2. catch the case when the device that should have a nearer node but without
   being set and warn about it.

But I failed to see why the above is related to making

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-10 Thread Peter Zijlstra

On Wed, Oct 09, 2019 at 01:25:14PM +0100, Robin Murphy wrote:
> On 2019-10-08 9:38 am, Yunsheng Lin wrote:
> > On 2019/9/25 18:41, Peter Zijlstra wrote:
> > > On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
> > > >  From the discussion above, It seems making the node_to_cpumask_map()
> > > > NUMA_NO_NODE aware is the most feasible way to move forwad.
> > > 
> > > That's still wrong.
> > 
> > Hi, Peter
> > 
> > It seems this has trapped in the dead circle.
> > 
> >  From my understanding, NUMA_NO_NODE which means not node numa preference
> > is the state to describe the node of virtual device or the physical device
> > that has equal distance to all cpu.

So I _really_ don't believe in the equidistant physical device. Physics
just doesn't allow that. Or rather, you can, but then it'll be so slow
it doesn't matter.

The only possible option is equidistant to a _small_ number of nodes,
and if that is a reality, then we should look at that. So far however
it's purely been a hypothetical device.

> > We can be stricter if the device does have a nearer node, but we can not
> > deny that a device does not have a node numa preference or node affinity,
> > which also means the control or data buffer can be allocated at the node 
> > where
> > the process is running.
> > 
> > As you has proposed, making it -2 and have dev_to_node() warn if the device 
> > does
> > have a nearer node and not set by the fw is a way to be stricter.

Because it is 100% guaranteed (we have proof) that BIOS is shit and
doesn't set node affinity for devices that really should have it.

So we're trading a hypothetical shared device vs not reporting actual
BIOS bugs. That's no contest.

Worse, we have virtual devices that have clear node affinity without it
set.

So we're growing shit, allowing bugs, and what do we get in return? Warm
fuzzies is not it.

> > Any better suggestion to move this forward?
> 
> FWIW (since this is in my inbox), it sounds like the fundamental issue is
> that NUMA_NO_NODE is conflated for at least two different purposes, so
> trying to sort that out would be a good first step. AFAICS we have genuine
> "don't care" cases like alloc_pages_node(), where if the producer says it
> doesn't matter then the consumer is free to make its own judgement on what
> to do, and fundamentally different "we expect this thing to have an affinity
> but it doesn't, so we can't say what's appropriate" cases which could really
> do with some separate indicator like "NUMA_INVALID_NODE".

It can possible be a 3 state:

 - UNKNON; overridden by parent/bus/etc..
   ERROR when still UNKNOWN on register.

 - INVALID; ERROR on devm usage.
   for virtual devices / pure sysfs nodes

 - NO_NODE; may only be set on virtual devices (we can check against PCI
   bus etc..) when there really is no better option.

But I only want to see the NO_NODE crap at the end, after all other
possible avenues have been done.

> The tricky part is then bestowed on the producers to decide whether they can
> downgrade "invalid" to "don't care". You can technically build 'a device'
> whose internal logic is distributed between nodes and thus appears to have
> equal affinity - interrupt controllers, for example, may have per-CPU or
> per-node interfaces that end up looking like that - so although it's
> unlikely it's not outright nonsensical.

I'm thinking we should/do create per cpu/node devices for such
distributed stuff. For instance, we create per-cpu clockevent devices
(where appropriate).

> Similarly a 'device' that's actually emulated behind a firmware call
> interface may well effectively have no real affinity.

Emulated devices are typically slow as heck and should be avoided if at
all possible. I don't see NUMA affinity being important for them.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-10 Thread Michal Hocko

On Thu 10-10-19 14:07:21, Yunsheng Lin wrote:
> On 2019/10/9 20:25, Robin Murphy wrote:
> > On 2019-10-08 9:38 am, Yunsheng Lin wrote:
> >> On 2019/9/25 18:41, Peter Zijlstra wrote:
> >>> On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
>   From the discussion above, It seems making the node_to_cpumask_map()
>  NUMA_NO_NODE aware is the most feasible way to move forwad.
> >>>
> >>> That's still wrong.
> >>
> >> Hi, Peter
> >>
> >> It seems this has trapped in the dead circle.
> >>
> >>  From my understanding, NUMA_NO_NODE which means not node numa preference
> >> is the state to describe the node of virtual device or the physical device
> >> that has equal distance to all cpu.
> >>
> >> We can be stricter if the device does have a nearer node, but we can not
> >> deny that a device does not have a node numa preference or node affinity,
> >> which also means the control or data buffer can be allocated at the node 
> >> where
> >> the process is running.
> >>
> >> As you has proposed, making it -2 and have dev_to_node() warn if the 
> >> device does
> >> have a nearer node and not set by the fw is a way to be stricter.
> >>
> >> But I think maybe being stricter is not really relevant to NUMA_NO_NODE, 
> >> because
> >> we does need a state to describe the device that have equal distance to 
> >> all node,
> >> even if it is not physically scalable.
> >>
> >> Any better suggestion to move this forward?
> > 
> > FWIW (since this is in my inbox), it sounds like the fundamental issue is 
> > that NUMA_NO_NODE is conflated for at least two different purposes, so 
> > trying to sort that out would be a good first step. AFAICS we have genuine 
> > "don't care" cases like alloc_pages_node(), where if the producer says it 
> > doesn't matter then the consumer is free to make its own judgement on what 
> > to do, and fundamentally different "we expect this thing to have an 
> > affinity but it doesn't, so we can't say what's appropriate" cases which 
> > could really do with some separate indicator like "NUMA_INVALID_NODE".
> > 
> > The tricky part is then bestowed on the producers to decide whether they 
> > can downgrade "invalid" to "don't care". You can technically build 'a 
> > device' whose internal logic is distributed between nodes and thus appears 
> > to have equal affinity - interrupt controllers, for example, may have 
> > per-CPU or per-node interfaces that end up looking like that - so although 
> > it's unlikely it's not outright nonsensical. Similarly a 'device' that's 
> > actually emulated behind a firmware call interface may well effectively 
> > have no real affinity.
> 
> We may set node of the physical device to NUMA_INVALID_NODE when fw does not
> provide one.
> 
> But what do we do about NUMA_INVALID_NODE when alloc_pages_node() is called
> with nid being NUMA_INVALID_NODE?

There is nothing sensible the allocator can do. The only point of
NUMA_INVALID_NODE would be to catch potential misconfiguration and
report them to users so they can complain to their HW/FS suppliers.

Pushing it to other susbystem doesn't make much sense IMHO because there
is nothing really actionable. Refusing an allocation altogether sounds
like a bad plan to me.
 
> If we change the node to default one(like node 0) when node of device is
> NUMA_INVALID_NODE in device_add(), how do we know the default one(like node 0)
> is the right one to choose?

Exactly. We cannot really assume any node in that situation.
 
> >From the privous disccusion, the below seems not get to consensus yet:
> 1) Do we need a state like NUMA_NO_NODE to describe that the device does not
>have any numa preference?

This is a traditional meaning MM subsystem is using.

> 2) What do we do if the fw does not provide a node for the device? Should
>we guess and pick one for it and how do we do the guessing? Or leave it
>as it is and handle it as NUMA_NO_NODE?

As already pointed several times, picking any node is rather error
prone. You can never assume topology. We used to assume that there
always be node 0 but that is not really the case (see 3e8589963773
("memcg: make it work on sparse non-0-node systems")). Nodes might also
come and go so this might just lead to all sorts of subtle problems.

On the other hand using NUMA_NO_NODE as no preference could only lead to
slightly sub optimal performance.

I do agree with Peter that reporting a lack of affinity might be useful
but we shouldn't really try to clever and make up the affinity nilly
willy.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-10 Thread Yunsheng Lin

On 2019/10/9 20:25, Robin Murphy wrote:
> On 2019-10-08 9:38 am, Yunsheng Lin wrote:
>> On 2019/9/25 18:41, Peter Zijlstra wrote:
>>> On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
  From the discussion above, It seems making the node_to_cpumask_map()
 NUMA_NO_NODE aware is the most feasible way to move forwad.
>>>
>>> That's still wrong.
>>
>> Hi, Peter
>>
>> It seems this has trapped in the dead circle.
>>
>>  From my understanding, NUMA_NO_NODE which means not node numa preference
>> is the state to describe the node of virtual device or the physical device
>> that has equal distance to all cpu.
>>
>> We can be stricter if the device does have a nearer node, but we can not
>> deny that a device does not have a node numa preference or node affinity,
>> which also means the control or data buffer can be allocated at the node 
>> where
>> the process is running.
>>
>> As you has proposed, making it -2 and have dev_to_node() warn if the device 
>> does
>> have a nearer node and not set by the fw is a way to be stricter.
>>
>> But I think maybe being stricter is not really relevant to NUMA_NO_NODE, 
>> because
>> we does need a state to describe the device that have equal distance to all 
>> node,
>> even if it is not physically scalable.
>>
>> Any better suggestion to move this forward?
> 
> FWIW (since this is in my inbox), it sounds like the fundamental issue is 
> that NUMA_NO_NODE is conflated for at least two different purposes, so trying 
> to sort that out would be a good first step. AFAICS we have genuine "don't 
> care" cases like alloc_pages_node(), where if the producer says it doesn't 
> matter then the consumer is free to make its own judgement on what to do, and 
> fundamentally different "we expect this thing to have an affinity but it 
> doesn't, so we can't say what's appropriate" cases which could really do with 
> some separate indicator like "NUMA_INVALID_NODE".
> 
> The tricky part is then bestowed on the producers to decide whether they can 
> downgrade "invalid" to "don't care". You can technically build 'a device' 
> whose internal logic is distributed between nodes and thus appears to have 
> equal affinity - interrupt controllers, for example, may have per-CPU or 
> per-node interfaces that end up looking like that - so although it's unlikely 
> it's not outright nonsensical. Similarly a 'device' that's actually emulated 
> behind a firmware call interface may well effectively have no real affinity.

We may set node of the physical device to NUMA_INVALID_NODE when fw does not
provide one.

But what do we do about NUMA_INVALID_NODE when alloc_pages_node() is called
with nid being NUMA_INVALID_NODE?

If we change the node to default one(like node 0) when node of device is
NUMA_INVALID_NODE in device_add(), how do we know the default one(like node 0)
is the right one to choose?

>From the privous disccusion, the below seems not get to consensus yet:
1) Do we need a state like NUMA_NO_NODE to describe that the device does not
   have any numa preference?

2) What do we do if the fw does not provide a node for the device? Should
   we guess and pick one for it and how do we do the guessing? Or leave it
   as it is and handle it as NUMA_NO_NODE?

The point of adding another state like NUMA_INVALID_NODE seems to catch
the case and give a warning above it when the device does have a nearer
node and the fw does not provide one, and alloc_pages_node() still need to
handle it as NUMA_NO_NODE?

If the above is true, then maybe we can move forward with the above goal.

Thanks very much for the suggestion.

> 



> Robin.
> 
> .
>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-09 Thread Robin Murphy


On 2019-10-08 9:38 am, Yunsheng Lin wrote:

On 2019/9/25 18:41, Peter Zijlstra wrote:

On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:

 From the discussion above, It seems making the node_to_cpumask_map()
NUMA_NO_NODE aware is the most feasible way to move forwad.


That's still wrong.


Hi, Peter

It seems this has trapped in the dead circle.

 From my understanding, NUMA_NO_NODE which means not node numa preference
is the state to describe the node of virtual device or the physical device
that has equal distance to all cpu.

We can be stricter if the device does have a nearer node, but we can not
deny that a device does not have a node numa preference or node affinity,
which also means the control or data buffer can be allocated at the node where
the process is running.

As you has proposed, making it -2 and have dev_to_node() warn if the device does
have a nearer node and not set by the fw is a way to be stricter.

But I think maybe being stricter is not really relevant to NUMA_NO_NODE, because
we does need a state to describe the device that have equal distance to all 
node,
even if it is not physically scalable.

Any better suggestion to move this forward?


FWIW (since this is in my inbox), it sounds like the fundamental issue 
is that NUMA_NO_NODE is conflated for at least two different purposes, 
so trying to sort that out would be a good first step. AFAICS we have 
genuine "don't care" cases like alloc_pages_node(), where if the 
producer says it doesn't matter then the consumer is free to make its 
own judgement on what to do, and fundamentally different "we expect this 
thing to have an affinity but it doesn't, so we can't say what's 
appropriate" cases which could really do with some separate indicator 
like "NUMA_INVALID_NODE".


The tricky part is then bestowed on the producers to decide whether they 
can downgrade "invalid" to "don't care". You can technically build 'a 
device' whose internal logic is distributed between nodes and thus 
appears to have equal affinity - interrupt controllers, for example, may 
have per-CPU or per-node interfaces that end up looking like that - so 
although it's unlikely it's not outright nonsensical. Similarly a 
'device' that's actually emulated behind a firmware call interface may 
well effectively have no real affinity.


Robin.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-08 Thread Yunsheng Lin

On 2019/9/25 18:41, Peter Zijlstra wrote:
> On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
>> From the discussion above, It seems making the node_to_cpumask_map()
>> NUMA_NO_NODE aware is the most feasible way to move forwad.
> 
> That's still wrong.

Hi, Peter

It seems this has trapped in the dead circle.

>From my understanding, NUMA_NO_NODE which means not node numa preference
is the state to describe the node of virtual device or the physical device
that has equal distance to all cpu.

We can be stricter if the device does have a nearer node, but we can not
deny that a device does not have a node numa preference or node affinity,
which also means the control or data buffer can be allocated at the node where
the process is running.

As you has proposed, making it -2 and have dev_to_node() warn if the device does
have a nearer node and not set by the fw is a way to be stricter.

But I think maybe being stricter is not really relevant to NUMA_NO_NODE, because
we does need a state to describe the device that have equal distance to all 
node,
even if it is not physically scalable.

Any better suggestion to move this forward?

> 
> .
>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-26 Thread Peter Zijlstra

On Thu, Sep 26, 2019 at 01:45:53PM +0200, Geert Uytterhoeven wrote:
> Hi Peter,
> 
> On Thu, Sep 26, 2019 at 11:42 AM Peter Zijlstra  wrote:
> > On Wed, Sep 25, 2019 at 03:25:44PM +0200, Michal Hocko wrote:
> > > I am sorry but I still do not understand why you consider this whack a
> > > mole better then simply live with the fact that NUMA_NO_NODE is a
> > > reality and that using the full cpu mask is a reasonable answer to that.
> >
> > Because it doesn't make physical sense. A device _cannot_ be local to
> > all CPUs in a NUMA system.
> 
> While it cannot be local to all CPUs, it can be at a uniform (equal) distance
> to each CPU node, can't it?

Only in some really narrow cases; and I'm not sure those are realistic,
nor if then not providing NUMA info is the best way to describe that.

I suppose it is possible to have a PCI bridge shared between two nodes,
such that the PCI devices have equidistance; esp. if that all lives in a
package. But the moment you scale this out, you either get devices that
are 'local' to a package while having multiple packages, or if you
maintain a single bridge in a big system, things become so slow it all
doesn't matter anyway (try having a equidistant device in a 16 node
system).

I'm saying that assigning a node (one of the shared) is, in the generic
ase of multiple packages, the better solution over assigning all nodes.

The other solution is migrating the device model over to a node mask,
instead of a single node. But like said; I'm not sure anybody actually
build something like this. So I'm not sure it matters.

OTOH allowing to not describe NUMA has led to a whole host of crap,
which if we don't become stricter will only get worse.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-26 Thread Peter Zijlstra

On Thu, Sep 26, 2019 at 11:05:59AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2019 at 11:45:26PM +0200, Peter Zijlstra wrote:
> > [7.149889] [Firmware Bug]: device: 'pci:7f': no node assigned on 
> > NUMA capable HW
> > [7.882888] [Firmware Bug]: device: 'pci:ff': no node assigned on 
> > NUMA capable HW
> 
> Going by the limited number of intel numa boxes I have, it looks like:
> 
>   socket = (~busid) >> (8-n)

Bah, I got my notes mixed up, it should be: busid >> (8-n)

> where 'n' is the number of bits required to encode the largest socket
> id, ie 1 for 2-socket and 2 for 4 socket.
> 
> For 8 socket systems we start using pci domains, and things get more
> 'interesting' :/

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-26 Thread Geert Uytterhoeven

Hi Peter,

On Thu, Sep 26, 2019 at 11:42 AM Peter Zijlstra  wrote:
> On Wed, Sep 25, 2019 at 03:25:44PM +0200, Michal Hocko wrote:
> > I am sorry but I still do not understand why you consider this whack a
> > mole better then simply live with the fact that NUMA_NO_NODE is a
> > reality and that using the full cpu mask is a reasonable answer to that.
>
> Because it doesn't make physical sense. A device _cannot_ be local to
> all CPUs in a NUMA system.

While it cannot be local to all CPUs, it can be at a uniform (equal) distance
to each CPU node, can't it?

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-26 Thread Peter Zijlstra

On Wed, Sep 25, 2019 at 11:45:26PM +0200, Peter Zijlstra wrote:
> [7.149889] [Firmware Bug]: device: 'pci:7f': no node assigned on NUMA 
> capable HW
> [7.882888] [Firmware Bug]: device: 'pci:ff': no node assigned on NUMA 
> capable HW

Going by the limited number of intel numa boxes I have, it looks like:

  socket = (~busid) >> (8-n)

where 'n' is the number of bits required to encode the largest socket
id, ie 1 for 2-socket and 2 for 4 socket.

For 8 socket systems we start using pci domains, and things get more
'interesting' :/

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-25 Thread Peter Zijlstra

On Wed, Sep 25, 2019 at 06:31:54PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2019 at 03:25:44PM +0200, Michal Hocko wrote:
> > I am sorry but I still do not understand why you consider this whack a
> > mole better then simply live with the fact that NUMA_NO_NODE is a
> > reality and that using the full cpu mask is a reasonable answer to that.
> 
> Because it doesn't make physical sense. A device _cannot_ be local to
> all CPUs in a NUMA system.

The below patch still gives a fair amount of noise on my fairly old and
cruft IVB-EP, but it gets rid of most of the simple stuff.

[2.890739] [Firmware Bug]: device: 'platform': no node assigned on NUMA 
capable HW
[2.901855] [Firmware Bug]: device: 'vtcon0': no node assigned on NUMA 
capable HW
[2.911804] [Firmware Bug]: device: 'id': no node assigned on NUMA capable HW
[3.800832] [Firmware Bug]: device: 'fbcon': no node assigned on NUMA 
capable HW
[4.824808] [Firmware Bug]: device: 'LNXSYSTM:00': no node assigned on NUMA 
capable HW
[5.112739] [Firmware Bug]: device: 'pci:00': no node assigned on NUMA 
capable HW
[6.703425] [Firmware Bug]: device: 'pci:80': no node assigned on NUMA 
capable HW
[7.049515] [Firmware Bug]: device: 'ACPI0004:00': no node assigned on NUMA 
capable HW
[7.078823] [Firmware Bug]: device: 'ACPI0004:01': no node assigned on NUMA 
capable HW
[7.149889] [Firmware Bug]: device: 'pci:7f': no node assigned on NUMA 
capable HW
[7.158798] [Firmware Bug]: device: ':7f': no node assigned on NUMA 
capable HW
[7.183796] [Firmware Bug]: device: ':7f:08.0': no node assigned on NUMA 
capable HW
[7.199796] [Firmware Bug]: device: ':7f:09.0': no node assigned on NUMA 
capable HW
[7.215792] [Firmware Bug]: device: ':7f:0a.0': no node assigned on NUMA 
capable HW
[7.231791] [Firmware Bug]: device: ':7f:0a.1': no node assigned on NUMA 
capable HW
[7.247793] [Firmware Bug]: device: ':7f:0a.2': no node assigned on NUMA 
capable HW
[7.262794] [Firmware Bug]: device: ':7f:0a.3': no node assigned on NUMA 
capable HW
[7.278789] [Firmware Bug]: device: ':7f:0b.0': no node assigned on NUMA 
capable HW
[7.294787] [Firmware Bug]: device: ':7f:0b.3': no node assigned on NUMA 
capable HW
[7.310794] [Firmware Bug]: device: ':7f:0c.0': no node assigned on NUMA 
capable HW
[7.325796] [Firmware Bug]: device: ':7f:0c.1': no node assigned on NUMA 
capable HW
[7.341790] [Firmware Bug]: device: ':7f:0c.2': no node assigned on NUMA 
capable HW
[7.357789] [Firmware Bug]: device: ':7f:0c.3': no node assigned on NUMA 
capable HW
[7.373789] [Firmware Bug]: device: ':7f:0c.4': no node assigned on NUMA 
capable HW
[7.388789] [Firmware Bug]: device: ':7f:0d.0': no node assigned on NUMA 
capable HW
[7.404791] [Firmware Bug]: device: ':7f:0d.1': no node assigned on NUMA 
capable HW
[7.420789] [Firmware Bug]: device: ':7f:0d.2': no node assigned on NUMA 
capable HW
[7.436790] [Firmware Bug]: device: ':7f:0d.3': no node assigned on NUMA 
capable HW
[7.451789] [Firmware Bug]: device: ':7f:0d.4': no node assigned on NUMA 
capable HW
[7.467799] [Firmware Bug]: device: ':7f:0e.0': no node assigned on NUMA 
capable HW
[7.483797] [Firmware Bug]: device: ':7f:0e.1': no node assigned on NUMA 
capable HW
[7.499830] [Firmware Bug]: device: ':7f:0f.0': no node assigned on NUMA 
capable HW
[7.515825] [Firmware Bug]: device: ':7f:0f.1': no node assigned on NUMA 
capable HW
[7.530823] [Firmware Bug]: device: ':7f:0f.2': no node assigned on NUMA 
capable HW
[7.546824] [Firmware Bug]: device: ':7f:0f.3': no node assigned on NUMA 
capable HW
[7.562823] [Firmware Bug]: device: ':7f:0f.4': no node assigned on NUMA 
capable HW
[7.578822] [Firmware Bug]: device: ':7f:0f.5': no node assigned on NUMA 
capable HW
[7.594830] [Firmware Bug]: device: ':7f:10.0': no node assigned on NUMA 
capable HW
[7.609834] [Firmware Bug]: device: ':7f:10.1': no node assigned on NUMA 
capable HW
[7.625825] [Firmware Bug]: device: ':7f:10.2': no node assigned on NUMA 
capable HW
[7.641824] [Firmware Bug]: device: ':7f:10.3': no node assigned on NUMA 
capable HW
[7.657825] [Firmware Bug]: device: ':7f:10.4': no node assigned on NUMA 
capable HW
[7.673824] [Firmware Bug]: device: ':7f:10.5': no node assigned on NUMA 
capable HW
[7.689792] [Firmware Bug]: device: ':7f:10.6': no node assigned on NUMA 
capable HW
[7.704825] [Firmware Bug]: device: ':7f:10.7': no node assigned on NUMA 
capable HW
[7.720791] [Firmware Bug]: device: ':7f:13.0': no node assigned on NUMA 
capable HW
[7.736793] [Firmware Bug]: device: ':7f:13.1': no node assigned on NUMA 
capable HW
[7.752791] [Firmware Bug]: device: ':7f:13.4': no node assigned on NUMA 
capable HW
[7.767780] [Firmware Bug]:

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-25 Thread Peter Zijlstra

On Wed, Sep 25, 2019 at 03:25:44PM +0200, Michal Hocko wrote:
> I am sorry but I still do not understand why you consider this whack a
> mole better then simply live with the fact that NUMA_NO_NODE is a
> reality and that using the full cpu mask is a reasonable answer to that.

Because it doesn't make physical sense. A device _cannot_ be local to
all CPUs in a NUMA system.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-25 Thread Michal Hocko

On Wed 25-09-19 12:40:40, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 03:19:39PM +0200, Michal Hocko wrote:
> 
> > > The below would get rid of the PMU and workqueue warnings with no
> > > side-effects (the device isn't used for anything except sysfs).
> > 
> > Hardcoding to 0 is simply wrong, if the node0 is cpuless for example...
> 
> It doesn't matter that 0 is _never_ used. These are fake devices,
> and all we care about is getting rid of that error.

That is a very subtle and hard to review assumption. Even if this holds
now a future change might easily break this AFAIU. It also assumes that
you catch all such special devices.

I am sorry but I still do not understand why you consider this whack a
mole better then simply live with the fact that NUMA_NO_NODE is a
reality and that using the full cpu mask is a reasonable answer to that.
Anyway, I feel we are loop here so I will leave out the final decision
to you.

> If it makes you feel better we can make it -2 and have dev_to_node()
> WARN if it ever sees one.

That would help

-- 
Michal Hocko
SUSE Labs

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-25 Thread Peter Zijlstra

On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
> From the discussion above, It seems making the node_to_cpumask_map()
> NUMA_NO_NODE aware is the most feasible way to move forwad.

That's still wrong.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-25 Thread Peter Zijlstra

On Tue, Sep 24, 2019 at 03:19:39PM +0200, Michal Hocko wrote:

> > The below would get rid of the PMU and workqueue warnings with no
> > side-effects (the device isn't used for anything except sysfs).
> 
> Hardcoding to 0 is simply wrong, if the node0 is cpuless for example...

It doesn't matter that 0 is _never_ used. These are fake devices,
and all we care about is getting rid of that error.

If it makes you feel better we can make it -2 and have dev_to_node()
WARN if it ever sees one.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-25 Thread Yunsheng Lin

On 2019/9/24 21:19, Michal Hocko wrote:
> On Tue 24-09-19 14:59:36, Peter Zijlstra wrote:
>> On Tue, Sep 24, 2019 at 02:43:25PM +0200, Peter Zijlstra wrote:
>>> On Tue, Sep 24, 2019 at 02:25:00PM +0200, Michal Hocko wrote:
 On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:
>>>
> We can push back and say we don't respect the specification because it
> is batshit insane ;-)

 Here is my fingers crossed.

 [...]

> Now granted; there's a number of virtual devices that really don't have
> a node affinity, but then, those are not hurt by forcing them onto a
> random node, they really don't do anything. Like:

 Do you really consider a random node a better fix than simply living
 with a more robust NUMA_NO_NODE which tells the actual state? Page
 allocator would effectivelly use the local node in that case. Any code
 using the cpumask will know that any of the online cpus are usable.
>>>
>>> For the pmu devices? Yes, those 'devices' aren't actually used for
>>> anything other than sysfs entries.
>>>
>>> Nothing else uses the struct device.
>>
>> The below would get rid of the PMU and workqueue warnings with no
>> side-effects (the device isn't used for anything except sysfs).
> 
> Hardcoding to 0 is simply wrong, if the node0 is cpuless for example...

Hi, Peter & Michal

>From the discussion above, It seems making the node_to_cpumask_map()
NUMA_NO_NODE aware is the most feasible way to move forwad.

Any suggestion?

>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Michal Hocko

On Tue 24-09-19 14:59:36, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 02:43:25PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 02:25:00PM +0200, Michal Hocko wrote:
> > > On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:
> > 
> > > > We can push back and say we don't respect the specification because it
> > > > is batshit insane ;-)
> > > 
> > > Here is my fingers crossed.
> > > 
> > > [...]
> > > 
> > > > Now granted; there's a number of virtual devices that really don't have
> > > > a node affinity, but then, those are not hurt by forcing them onto a
> > > > random node, they really don't do anything. Like:
> > > 
> > > Do you really consider a random node a better fix than simply living
> > > with a more robust NUMA_NO_NODE which tells the actual state? Page
> > > allocator would effectivelly use the local node in that case. Any code
> > > using the cpumask will know that any of the online cpus are usable.
> > 
> > For the pmu devices? Yes, those 'devices' aren't actually used for
> > anything other than sysfs entries.
> > 
> > Nothing else uses the struct device.
> 
> The below would get rid of the PMU and workqueue warnings with no
> side-effects (the device isn't used for anything except sysfs).

Hardcoding to 0 is simply wrong, if the node0 is cpuless for example...
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra

On Tue, Sep 24, 2019 at 02:43:25PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 02:25:00PM +0200, Michal Hocko wrote:
> > On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:
> 
> > > We can push back and say we don't respect the specification because it
> > > is batshit insane ;-)
> > 
> > Here is my fingers crossed.
> > 
> > [...]
> > 
> > > Now granted; there's a number of virtual devices that really don't have
> > > a node affinity, but then, those are not hurt by forcing them onto a
> > > random node, they really don't do anything. Like:
> > 
> > Do you really consider a random node a better fix than simply living
> > with a more robust NUMA_NO_NODE which tells the actual state? Page
> > allocator would effectivelly use the local node in that case. Any code
> > using the cpumask will know that any of the online cpus are usable.
> 
> For the pmu devices? Yes, those 'devices' aren't actually used for
> anything other than sysfs entries.
> 
> Nothing else uses the struct device.

The below would get rid of the PMU and workqueue warnings with no
side-effects (the device isn't used for anything except sysfs).

I'm stuck in the device code for BDIs, I can't find a sane place to set
the node before it gets added, due to it using device_create_vargs().

---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4f08b17d6426..2a64dcc3d70f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9965,6 +9965,7 @@ static int pmu_dev_alloc(struct pmu *pmu)
if (!pmu->dev)
goto out;
 
+   set_dev_node(pmu->dev, 0);
pmu->dev->groups = pmu->attr_groups;
device_initialize(pmu->dev);
ret = dev_set_name(pmu->dev, "%s", pmu->name);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bc2e09a8ea61..efafc4590bbe 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5613,6 +5613,7 @@ int workqueue_sysfs_register(struct workqueue_struct *wq)
wq_dev->dev.bus = _subsys;
wq_dev->dev.release = wq_device_release;
dev_set_name(_dev->dev, "%s", wq->name);
+   set_dev_node(wq_dev, 0);
 
/*
 * unbound_attrs are created separately.  Suppress uevent until

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra

On Tue, Sep 24, 2019 at 02:25:00PM +0200, Michal Hocko wrote:
> On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:

> > We can push back and say we don't respect the specification because it
> > is batshit insane ;-)
> 
> Here is my fingers crossed.
> 
> [...]
> 
> > Now granted; there's a number of virtual devices that really don't have
> > a node affinity, but then, those are not hurt by forcing them onto a
> > random node, they really don't do anything. Like:
> 
> Do you really consider a random node a better fix than simply living
> with a more robust NUMA_NO_NODE which tells the actual state? Page
> allocator would effectivelly use the local node in that case. Any code
> using the cpumask will know that any of the online cpus are usable.

For the pmu devices? Yes, those 'devices' aren't actually used for
anything other than sysfs entries.

Nothing else uses the struct device.

> Compare that to a wild guess that might be easily wrong and have subtle
> side effects which are really hard to debug. You will only see a higher
> utilization on a specific node. Good luck with a bug report like that.

We'd have the FW_BUG in the dmesg, which should be a big fat clue.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Michal Hocko

On Tue 24-09-19 14:09:43, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 01:54:01PM +0200, Michal Hocko wrote:
> > On Tue 24-09-19 13:23:49, Peter Zijlstra wrote:
> > > On Tue, Sep 24, 2019 at 12:56:22PM +0200, Michal Hocko wrote:
> > [...]
> > > > To be honest I really fail to see why to object to a simple semantic
> > > > that NUMA_NO_NODE imply all usable cpus. Could you explain that please?
> > > 
> > > Because it feels wrong. The device needs to be _somewhere_. It simply
> > > cannot be node-less.
> > 
> > What if it doesn't have any numa preference for what ever reason? There
> > is no other way to express that than NUMA_NO_NODE.
> 
> Like I said; how does that physically work? The device needs to be
> somewhere. It _must_ have a preference.
> 
> > Anyway, I am not going to argue more about this because it seems more of
> > a discussion about "HW shouldn't be doing that although the specification
> > allows that" which cannot really have any outcome except of "feels
> > correct/wrong".
> 
> We can push back and say we don't respect the specification because it
> is batshit insane ;-)

Here is my fingers crossed.

[...]

> Now granted; there's a number of virtual devices that really don't have
> a node affinity, but then, those are not hurt by forcing them onto a
> random node, they really don't do anything. Like:

Do you really consider a random node a better fix than simply living
with a more robust NUMA_NO_NODE which tells the actual state? Page
allocator would effectivelly use the local node in that case. Any code
using the cpumask will know that any of the online cpus are usable.

Compare that to a wild guess that might be easily wrong and have subtle
side effects which are really hard to debug. You will only see a higher
utilization on a specific node. Good luck with a bug report like that.

Anyway, I really do  not feel strongly about that. If you really consider
it a bad idea then I can live with that. This just felt easier and
reasonably consistent to address. Implementing the guessing and fighting
vendors who really do not feel like providing a real affinity sounds
harder and more error prone.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra

On Tue, Sep 24, 2019 at 01:54:01PM +0200, Michal Hocko wrote:
> On Tue 24-09-19 13:23:49, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 12:56:22PM +0200, Michal Hocko wrote:
> [...]
> > > To be honest I really fail to see why to object to a simple semantic
> > > that NUMA_NO_NODE imply all usable cpus. Could you explain that please?
> > 
> > Because it feels wrong. The device needs to be _somewhere_. It simply
> > cannot be node-less.
> 
> What if it doesn't have any numa preference for what ever reason? There
> is no other way to express that than NUMA_NO_NODE.

Like I said; how does that physically work? The device needs to be
somewhere. It _must_ have a preference.

> Anyway, I am not going to argue more about this because it seems more of
> a discussion about "HW shouldn't be doing that although the specification
> allows that" which cannot really have any outcome except of "feels
> correct/wrong".

We can push back and say we don't respect the specification because it
is batshit insane ;-)

> If you really feel strongly about this then we should think of a proper
> way to prevent this to happen because an out-of-bound access is
> certainly not something we really want, right?

I just genuinely don't understand it. And I refuse to duct tape it.

And as shown in that email here:

  https://lkml.kernel.org/r/5a188e2b-6c07-a9db-fbaa-561e9362d...@huawei.com

there is a ton of broken...

15.061682] node node0: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.
...
15.285602] node node3: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.

15.360241] cpu cpu0: has invalid NUMA node(-1), default node of 0 now selected. 
Readjust it by writing to sysfs numa_node or contact your vendor for updates.
...
24.768305] cpu cpu127: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.

39.623339] clockevents clockevent0: has invalid NUMA node(-1), default node of 
0 now selected. Readjust it by writing to sysfs numa_node or contact your 
vendor for updates.
...
48.769530] clockevents clockevent127: has invalid NUMA node(-1), default node 
of 0 now selected. Readjust it by writing to sysfs numa_node or contact your 
vendor for updates.

That's all broken for no reason.. those things actually _have_ a trivial
node affinity.

By silently accepting we let this stuff fester.

Now granted; there's a number of virtual devices that really don't have
a node affinity, but then, those are not hurt by forcing them onto a
random node, they really don't do anything. Like:

48.913502] event_source armv8_pmuv3_0: has invalid NUMA node(-1), default node 
of 0 now selected. Readjust it by writing to sysfs numa_node or contact your 
vendor for updates.
48.985462] event_source breakpoint: has invalid NUMA node(-1), default node of 
0 now selected. Readjust it by writing to sysfs numa_node or contact your 
vendor for updates.
49.057120] event_source uprobe: has invalid NUMA node(-1), default node of 0 
now selected. Readjust it by writing to sysfs numa_node or contact your vendor 
for updates.
49.128431] event_source kprobe: has invalid NUMA node(-1), default node of 0 
now selected. Readjust it by writing to sysfs numa_node or contact your vendor 
for updates.
49.199742] event_source tracepoint: has invalid NUMA node(-1), default node of 
0 now selected. Readjust it by writing to sysfs numa_node or contact your 
vendor for updates.
49.271399] event_source software: has invalid NUMA node(-1), default node of 0 
now selected. Readjust it by writing to sysfs numa_node or contact your vendor 
for updates.

That's just fake devices to get a sysfs entry.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Yunsheng Lin

On 2019/9/24 19:58, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 07:44:28PM +0800, Yunsheng Lin wrote:
>> From [1], there is a lot of devices with node id of NUMA_NO_NODE with the
>> FW_BUG.
>>
>> [1] 
>> https://lore.kernel.org/lkml/5a188e2b-6c07-a9db-fbaa-561e9362d...@huawei.com/
> 
> So aside of all the software devices which we can (and really should)
> fix; these:
> 
> 26.470076]  pci:00: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 26.815436]  pci:7b: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 27.004447]  pci:7a: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 27.236797]  pci:78: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 27.505833]  pci:7c: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 28.056452]  pci:74: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 28.470018]  pci:80: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 28.726411]  pci:bb: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 28.916656]  pci:ba: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 29.152839]  pci:b8: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 29.425808]  pci:bc: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 29.718593]  pci:b4: has invalid NUMA node(-1), default node of 0 now 
> selected. Readjust it by writing to sysfs numa_node or contact your vendor 
> for updates.
> 
> look like actual problems. How can PCI devices not have a node assigned?

The above PCI devices do not have a node assigned because I downgraded
the bios to a older version that has not implemented the "Proximity Domain"
feature specified by APCI, which is optional feature, so the bios denied
that it is a bug of the bios.

> 
> .
>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra

On Tue, Sep 24, 2019 at 07:44:28PM +0800, Yunsheng Lin wrote:
> From [1], there is a lot of devices with node id of NUMA_NO_NODE with the
> FW_BUG.
> 
> [1] 
> https://lore.kernel.org/lkml/5a188e2b-6c07-a9db-fbaa-561e9362d...@huawei.com/

So aside of all the software devices which we can (and really should)
fix; these:

26.470076]  pci:00: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.
26.815436]  pci:7b: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.
27.004447]  pci:7a: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.
27.236797]  pci:78: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.
27.505833]  pci:7c: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.
28.056452]  pci:74: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.
28.470018]  pci:80: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.
28.726411]  pci:bb: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.
28.916656]  pci:ba: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.
29.152839]  pci:b8: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.
29.425808]  pci:bc: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.
29.718593]  pci:b4: has invalid NUMA node(-1), default node of 0 now 
selected. Readjust it by writing to sysfs numa_node or contact your vendor for 
updates.

look like actual problems. How can PCI devices not have a node assigned?

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Michal Hocko

On Tue 24-09-19 13:23:49, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 12:56:22PM +0200, Michal Hocko wrote:
[...]
> > To be honest I really fail to see why to object to a simple semantic
> > that NUMA_NO_NODE imply all usable cpus. Could you explain that please?
> 
> Because it feels wrong. The device needs to be _somewhere_. It simply
> cannot be node-less.

What if it doesn't have any numa preference for what ever reason? There
is no other way to express that than NUMA_NO_NODE.

Anyway, I am not going to argue more about this because it seems more of
a discussion about "HW shouldn't be doing that although the specification
allows that" which cannot really have any outcome except of "feels
correct/wrong".

If you really feel strongly about this then we should think of a proper
way to prevent this to happen because an out-of-bound access is
certainly not something we really want, right?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Yunsheng Lin

On 2019/9/24 19:28, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 07:07:36PM +0800, Yunsheng Lin wrote:
>> On 2019/9/24 17:25, Peter Zijlstra wrote:
>>> On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
 On 2019/9/24 4:34, Peter Zijlstra wrote:
>>>
> I'm saying the ACPI standard is wrong. Explain to me how it is
> physically possible to have a device without NUMA affinity in a NUMA
> system?
>
>  1) The fundamental interconnect is not uniform.
>  2) The device needs to actually be somewhere.
>

 From what I can see, NUMA_NO_NODE may make sense in the below case:

 1) Theoretically, there would be a device that can access all the memory
 uniformly and can be accessed by all cpus uniformly even in a NUMA system.
 Suppose we have two nodes, and the device just sit in the middle of the
 interconnect between the two nodes.

 Even we define a third node solely for the device, we may need to look at
 the node distance to decide the device can be accessed uniformly.

 Or we can decide that the device can be accessed uniformly by setting
 it's node to NUMA_NO_NODE.
>>>
>>> This is indeed a theoretical case; it doesn't scale. The moment you're
>>> adding multiple sockets or even board interconnects this all goes out
>>> the window.
>>>
>>> And in this case, forcing the device to either node is fine.
>>
>> Not really.
>> For packet sending and receiving, the buffer memory may be allocated
>> dynamically. Node of tx buffer memory is mainly based on the cpu
>> that is sending sending, node of rx buffer memory is mainly based on
>> the cpu the interrupt handler of the device is running on, and the
>> device' interrupt affinity is mainly based on node id of the device.
>>
>> We can bind the processes that are using the device to both nodes
>> in order to utilize memory on both nodes for packet sending.
>>
>> But for packet receiving, the node1 may not be used becuase the node
>> id of device is forced to node 0, which is the default way to bind
>> the interrupt to the cpu of the same node.
>>
>> If node_to_cpumask_map() returns all usable cpus when the device's node
>> id is NUMA_NO_NODE, then interrupt can be binded to the cpus on both nodes.
> 
> s/binded/bound/
> 
> Sure; the data can be allocated wherever, but the control structures are
> not dynamically allocated every time. They are persistent, and they will
> be local to some node.
> 
> Anyway, are you saying this stupid corner case is actually relevant?
> Because how does it scale out? What if you have 8 sockets, with each
> socket having 2 nodes and 1 such magic device. Then returning all CPUs
> is just plain wrong.

Yes, the hardware may not scale out, but what about the virtual device?

> 
 2) For many virtual deivces, such as tun or loopback netdevice, they
 are also accessed uniformly by all cpus.
>>>
>>> Not true; the virtual device will sit in memory local to some node.
>>>
>>> And as with physical devices, you probably want at least one (virtual)
>>> queue per node.
>>
>> There may be similar handling as above for virtual device too.
> 
> And it'd be similarly broken.

>From [1], there is a lot of devices with node id of NUMA_NO_NODE with the
FW_BUG.

[1] 
https://lore.kernel.org/lkml/5a188e2b-6c07-a9db-fbaa-561e9362d...@huawei.com/


> 
> .
>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra

On Tue, Sep 24, 2019 at 07:07:36PM +0800, Yunsheng Lin wrote:
> On 2019/9/24 17:25, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
> >> On 2019/9/24 4:34, Peter Zijlstra wrote:
> > 
> >>> I'm saying the ACPI standard is wrong. Explain to me how it is
> >>> physically possible to have a device without NUMA affinity in a NUMA
> >>> system?
> >>>
> >>>  1) The fundamental interconnect is not uniform.
> >>>  2) The device needs to actually be somewhere.
> >>>
> >>
> >> From what I can see, NUMA_NO_NODE may make sense in the below case:
> >>
> >> 1) Theoretically, there would be a device that can access all the memory
> >> uniformly and can be accessed by all cpus uniformly even in a NUMA system.
> >> Suppose we have two nodes, and the device just sit in the middle of the
> >> interconnect between the two nodes.
> >>
> >> Even we define a third node solely for the device, we may need to look at
> >> the node distance to decide the device can be accessed uniformly.
> >>
> >> Or we can decide that the device can be accessed uniformly by setting
> >> it's node to NUMA_NO_NODE.
> > 
> > This is indeed a theoretical case; it doesn't scale. The moment you're
> > adding multiple sockets or even board interconnects this all goes out
> > the window.
> > 
> > And in this case, forcing the device to either node is fine.
> 
> Not really.
> For packet sending and receiving, the buffer memory may be allocated
> dynamically. Node of tx buffer memory is mainly based on the cpu
> that is sending sending, node of rx buffer memory is mainly based on
> the cpu the interrupt handler of the device is running on, and the
> device' interrupt affinity is mainly based on node id of the device.
> 
> We can bind the processes that are using the device to both nodes
> in order to utilize memory on both nodes for packet sending.
> 
> But for packet receiving, the node1 may not be used becuase the node
> id of device is forced to node 0, which is the default way to bind
> the interrupt to the cpu of the same node.
> 
> If node_to_cpumask_map() returns all usable cpus when the device's node
> id is NUMA_NO_NODE, then interrupt can be binded to the cpus on both nodes.

s/binded/bound/

Sure; the data can be allocated wherever, but the control structures are
not dynamically allocated every time. They are persistent, and they will
be local to some node.

Anyway, are you saying this stupid corner case is actually relevant?
Because how does it scale out? What if you have 8 sockets, with each
socket having 2 nodes and 1 such magic device. Then returning all CPUs
is just plain wrong.

> >> 2) For many virtual deivces, such as tun or loopback netdevice, they
> >> are also accessed uniformly by all cpus.
> > 
> > Not true; the virtual device will sit in memory local to some node.
> > 
> > And as with physical devices, you probably want at least one (virtual)
> > queue per node.
> 
> There may be similar handling as above for virtual device too.

And it'd be similarly broken.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra

On Tue, Sep 24, 2019 at 12:56:22PM +0200, Michal Hocko wrote:
> On Tue 24-09-19 11:17:14, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 09:47:51AM +0200, Michal Hocko wrote:
> > > On Mon 23-09-19 22:34:10, Peter Zijlstra wrote:
> > > > On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > I even the
> > > > > ACPI standard is considering this optional. Yunsheng Lin has referred 
> > > > > to
> > > > > the specific part of the standard in one of the earlier discussions.
> > > > > Trying to guess the node affinity is worse than providing all CPUs 
> > > > > IMHO.
> > > > 
> > > > I'm saying the ACPI standard is wrong.
> > > 
> > > Even if you were right on this the reality is that a HW is likely to
> > > follow that standard and we cannot rule out NUMA_NO_NODE being
> > > specified. As of now we would access beyond the defined array and that
> > > is clearly a bug.
> > 
> > Right, because the device node is wrong, so we fix _that_!
> > 
> > > Let's assume that this is really a bug for a moment. What are you going
> > > to do about that? BUG_ON? I do not really see any solution besides to 
> > > either
> > > provide something sensible or BUG_ON. If you are worried about a
> > > conditional then this should be pretty easy to solve by starting the
> > > array at -1 index and associate it with the online cpu mask.
> > 
> > The same thing I proposed earlier; force the device node to 0 (or any
> > other convenient random valid value) and issue a FW_BUG message to the
> > console.
> 
> Why would you "fix" anything and how do you know that node 0 is the
> right choice? I have seen setups with node 0 without any memory and
> similar unexpected things.

We don't know 0 is right; but we know 'unkown' is wrong, so we FW_BUG
and pick _something_.

> To be honest I really fail to see why to object to a simple semantic
> that NUMA_NO_NODE imply all usable cpus. Could you explain that please?

Because it feels wrong. The device needs to be _somewhere_. It simply
cannot be node-less.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Yunsheng Lin

On 2019/9/24 17:25, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
>> On 2019/9/24 4:34, Peter Zijlstra wrote:
> 
>>> I'm saying the ACPI standard is wrong. Explain to me how it is
>>> physically possible to have a device without NUMA affinity in a NUMA
>>> system?
>>>
>>>  1) The fundamental interconnect is not uniform.
>>>  2) The device needs to actually be somewhere.
>>>
>>
>> From what I can see, NUMA_NO_NODE may make sense in the below case:
>>
>> 1) Theoretically, there would be a device that can access all the memory
>> uniformly and can be accessed by all cpus uniformly even in a NUMA system.
>> Suppose we have two nodes, and the device just sit in the middle of the
>> interconnect between the two nodes.
>>
>> Even we define a third node solely for the device, we may need to look at
>> the node distance to decide the device can be accessed uniformly.
>>
>> Or we can decide that the device can be accessed uniformly by setting
>> it's node to NUMA_NO_NODE.
> 
> This is indeed a theoretical case; it doesn't scale. The moment you're
> adding multiple sockets or even board interconnects this all goes out
> the window.
> 
> And in this case, forcing the device to either node is fine.

Not really.
For packet sending and receiving, the buffer memory may be allocated
dynamically. Node of tx buffer memory is mainly based on the cpu
that is sending sending, node of rx buffer memory is mainly based on
the cpu the interrupt handler of the device is running on, and the
device' interrupt affinity is mainly based on node id of the device.

We can bind the processes that are using the device to both nodes
in order to utilize memory on both nodes for packet sending.

But for packet receiving, the node1 may not be used becuase the node
id of device is forced to node 0, which is the default way to bind
the interrupt to the cpu of the same node.

If node_to_cpumask_map() returns all usable cpus when the device's node
id is NUMA_NO_NODE, then interrupt can be binded to the cpus on both nodes.

> 
>> 2) For many virtual deivces, such as tun or loopback netdevice, they
>> are also accessed uniformly by all cpus.
> 
> Not true; the virtual device will sit in memory local to some node.
> 
> And as with physical devices, you probably want at least one (virtual)
> queue per node.

There may be similar handling as above for virtual device too.

> 
> 
> .
>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Michal Hocko

On Tue 24-09-19 11:17:14, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 09:47:51AM +0200, Michal Hocko wrote:
> > On Mon 23-09-19 22:34:10, Peter Zijlstra wrote:
> > > On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
> > [...]
> > > > I even the
> > > > ACPI standard is considering this optional. Yunsheng Lin has referred to
> > > > the specific part of the standard in one of the earlier discussions.
> > > > Trying to guess the node affinity is worse than providing all CPUs IMHO.
> > > 
> > > I'm saying the ACPI standard is wrong.
> > 
> > Even if you were right on this the reality is that a HW is likely to
> > follow that standard and we cannot rule out NUMA_NO_NODE being
> > specified. As of now we would access beyond the defined array and that
> > is clearly a bug.
> 
> Right, because the device node is wrong, so we fix _that_!
> 
> > Let's assume that this is really a bug for a moment. What are you going
> > to do about that? BUG_ON? I do not really see any solution besides to either
> > provide something sensible or BUG_ON. If you are worried about a
> > conditional then this should be pretty easy to solve by starting the
> > array at -1 index and associate it with the online cpu mask.
> 
> The same thing I proposed earlier; force the device node to 0 (or any
> other convenient random valid value) and issue a FW_BUG message to the
> console.

Why would you "fix" anything and how do you know that node 0 is the
right choice? I have seen setups with node 0 without any memory and
similar unexpected things.

To be honest I really fail to see why to object to a simple semantic
that NUMA_NO_NODE imply all usable cpus. Could you explain that please?

-- 
Michal Hocko
SUSE Labs

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra

On Tue, Sep 24, 2019 at 09:29:50AM +0800, Yunsheng Lin wrote:
> On 2019/9/24 4:34, Peter Zijlstra wrote:

> > I'm saying the ACPI standard is wrong. Explain to me how it is
> > physically possible to have a device without NUMA affinity in a NUMA
> > system?
> > 
> >  1) The fundamental interconnect is not uniform.
> >  2) The device needs to actually be somewhere.
> > 
> 
> From what I can see, NUMA_NO_NODE may make sense in the below case:
> 
> 1) Theoretically, there would be a device that can access all the memory
> uniformly and can be accessed by all cpus uniformly even in a NUMA system.
> Suppose we have two nodes, and the device just sit in the middle of the
> interconnect between the two nodes.
> 
> Even we define a third node solely for the device, we may need to look at
> the node distance to decide the device can be accessed uniformly.
> 
> Or we can decide that the device can be accessed uniformly by setting
> it's node to NUMA_NO_NODE.

This is indeed a theoretical case; it doesn't scale. The moment you're
adding multiple sockets or even board interconnects this all goes out
the window.

And in this case, forcing the device to either node is fine.

> 2) For many virtual deivces, such as tun or loopback netdevice, they
> are also accessed uniformly by all cpus.

Not true; the virtual device will sit in memory local to some node.

And as with physical devices, you probably want at least one (virtual)
queue per node.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Peter Zijlstra

On Tue, Sep 24, 2019 at 09:47:51AM +0200, Michal Hocko wrote:
> On Mon 23-09-19 22:34:10, Peter Zijlstra wrote:
> > On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
> [...]
> > > I even the
> > > ACPI standard is considering this optional. Yunsheng Lin has referred to
> > > the specific part of the standard in one of the earlier discussions.
> > > Trying to guess the node affinity is worse than providing all CPUs IMHO.
> > 
> > I'm saying the ACPI standard is wrong.
> 
> Even if you were right on this the reality is that a HW is likely to
> follow that standard and we cannot rule out NUMA_NO_NODE being
> specified. As of now we would access beyond the defined array and that
> is clearly a bug.

Right, because the device node is wrong, so we fix _that_!

> Let's assume that this is really a bug for a moment. What are you going
> to do about that? BUG_ON? I do not really see any solution besides to either
> provide something sensible or BUG_ON. If you are worried about a
> conditional then this should be pretty easy to solve by starting the
> array at -1 index and associate it with the online cpu mask.

The same thing I proposed earlier; force the device node to 0 (or any
other convenient random valid value) and issue a FW_BUG message to the
console.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-24 Thread Michal Hocko

On Mon 23-09-19 22:34:10, Peter Zijlstra wrote:
> On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
[...]
> > I even the
> > ACPI standard is considering this optional. Yunsheng Lin has referred to
> > the specific part of the standard in one of the earlier discussions.
> > Trying to guess the node affinity is worse than providing all CPUs IMHO.
> 
> I'm saying the ACPI standard is wrong.

Even if you were right on this the reality is that a HW is likely to
follow that standard and we cannot rule out NUMA_NO_NODE being
specified. As of now we would access beyond the defined array and that
is clearly a bug.

Let's assume that this is really a bug for a moment. What are you going
to do about that? BUG_ON? I do not really see any solution besides to either
provide something sensible or BUG_ON. If you are worried about a
conditional then this should be pretty easy to solve by starting the
array at -1 index and associate it with the online cpu mask.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-23 Thread Yunsheng Lin

On 2019/9/24 4:34, Peter Zijlstra wrote:
> On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
>> On Mon 23-09-19 17:48:52, Peter Zijlstra wrote:
> 
>> To the NUMA_NO_NODE itself. Your earlier email noted:
>> : > +
>> : >  if ((unsigned)node >= nr_node_ids) {
>> : >  printk(KERN_WARNING
>> : >  "cpumask_of_node(%d): (unsigned)node >= 
>> nr_node_ids(%u)\n",
>> : 
>> : I still think this makes absolutely no sense what so ever.
>>
>> Did you mean the NUMA_NO_NODE handling or the specific node >= nr_node_ids
>> check?
> 
> The NUMA_NO_NODE thing. It's is physical impossibility. And if the
> device description doesn't give us a node, then the description is
> incomplete and wrong and we should bloody well complain about it.
> 
>> Because as to NUMA_NO_NODE I believe this makes sense because this is
>> the only way that a device is not bound to any numa node.
> 
> Which is a physical impossibility.
> 
>> I even the
>> ACPI standard is considering this optional. Yunsheng Lin has referred to
>> the specific part of the standard in one of the earlier discussions.
>> Trying to guess the node affinity is worse than providing all CPUs IMHO.
> 
> I'm saying the ACPI standard is wrong. Explain to me how it is
> physically possible to have a device without NUMA affinity in a NUMA
> system?
> 
>  1) The fundamental interconnect is not uniform.
>  2) The device needs to actually be somewhere.
> 

>From what I can see, NUMA_NO_NODE may make sense in the below case:

1) Theoretically, there would be a device that can access all the memory
uniformly and can be accessed by all cpus uniformly even in a NUMA system.
Suppose we have two nodes, and the device just sit in the middle of the
interconnect between the two nodes.

Even we define a third node solely for the device, we may need to look at
the node distance to decide the device can be accessed uniformly.

Or we can decide that the device can be accessed uniformly by setting
it's node to NUMA_NO_NODE.


2) For many virtual deivces, such as tun or loopback netdevice, they
are also accessed uniformly by all cpus.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-23 Thread Peter Zijlstra

On Mon, Sep 23, 2019 at 06:52:35PM +0200, Michal Hocko wrote:
> On Mon 23-09-19 17:48:52, Peter Zijlstra wrote:

> To the NUMA_NO_NODE itself. Your earlier email noted:
> : > +
> : >   if ((unsigned)node >= nr_node_ids) {
> : >   printk(KERN_WARNING
> : >   "cpumask_of_node(%d): (unsigned)node >= 
> nr_node_ids(%u)\n",
> : 
> : I still think this makes absolutely no sense what so ever.
> 
> Did you mean the NUMA_NO_NODE handling or the specific node >= nr_node_ids
> check?

The NUMA_NO_NODE thing. It's is physical impossibility. And if the
device description doesn't give us a node, then the description is
incomplete and wrong and we should bloody well complain about it.

> Because as to NUMA_NO_NODE I believe this makes sense because this is
> the only way that a device is not bound to any numa node.

Which is a physical impossibility.

> I even the
> ACPI standard is considering this optional. Yunsheng Lin has referred to
> the specific part of the standard in one of the earlier discussions.
> Trying to guess the node affinity is worse than providing all CPUs IMHO.

I'm saying the ACPI standard is wrong. Explain to me how it is
physically possible to have a device without NUMA affinity in a NUMA
system?

 1) The fundamental interconnect is not uniform.
 2) The device needs to actually be somewhere.

>From these it seems to follow that access to the device is subject to
NUMA.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-23 Thread Michal Hocko

On Mon 23-09-19 17:48:52, Peter Zijlstra wrote:
> On Mon, Sep 23, 2019 at 05:28:56PM +0200, Michal Hocko wrote:
> > On Mon 23-09-19 17:15:19, Peter Zijlstra wrote:
> 
> > > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > > index 4123100e..9859acb 100644
> > > > --- a/arch/x86/mm/numa.c
> > > > +++ b/arch/x86/mm/numa.c
> > > > @@ -861,6 +861,9 @@ void numa_remove_cpu(int cpu)
> > > >   */
> > > >  const struct cpumask *cpumask_of_node(int node)
> > > >  {
> > > > +   if (node == NUMA_NO_NODE)
> > > > +   return cpu_online_mask;
> > > 
> > > This mandates the caller holds cpus_read_lock() or something, I'm pretty
> > > sure that if I put:
> > > 
> > >   lockdep_assert_cpus_held();
> > 
> > Is this documented somewhere?
> 
> No idea... common sense :-)

I thought that and cpuhotplug were forbiden to be used in the same
sentence :p

> > Also how does that differ from a normal
> > case when a proper node is used? The cpumask will always be dynamic in
> > the cpu hotplug presence, right?
> 
> As per normal yes, and I'm fairly sure there's a ton of bugs. Any
> 'online' state is subject to change except when you're holding
> sufficient locks to stop it.
> 
> Disabling preemption also stabilizes it, because cpu unplug relies on
> stop-machine.

OK, I guess it is fair to document that callers should be careful when
using this if they absolutely need any stability. But I strongly suspect
they simply do not care all that much. They mostly do care to have
something that gives them an idea which CPUs are close to the device and
that can tolerate some race.

In other words this is more of an optimization than a correctness issue.
 
> > > here, it comes apart real quick. Without holding the cpu hotplug lock,
> > > the online mask is gibberish.
> > 
> > Can the returned cpu mask go away?
> 
> No, the cpu_online_mask itself has static storage, the contents OTOH can
> change at will. Very little practical difference :-)
 
OK, thanks for the confirmation. I was worried that I've overlooked
something.

To the NUMA_NO_NODE itself. Your earlier email noted:
: > +
: > if ((unsigned)node >= nr_node_ids) {
: > printk(KERN_WARNING
: > "cpumask_of_node(%d): (unsigned)node >= 
nr_node_ids(%u)\n",
: 
: I still think this makes absolutely no sense what so ever.

Did you mean the NUMA_NO_NODE handling or the specific node >= nr_node_ids
check?

Because as to NUMA_NO_NODE I believe this makes sense because this is
the only way that a device is not bound to any numa node. I even the
ACPI standard is considering this optional. Yunsheng Lin has referred to
the specific part of the standard in one of the earlier discussions.
Trying to guess the node affinity is worse than providing all CPUs IMHO.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-23 Thread Peter Zijlstra

On Mon, Sep 23, 2019 at 05:28:56PM +0200, Michal Hocko wrote:
> On Mon 23-09-19 17:15:19, Peter Zijlstra wrote:

> > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > index 4123100e..9859acb 100644
> > > --- a/arch/x86/mm/numa.c
> > > +++ b/arch/x86/mm/numa.c
> > > @@ -861,6 +861,9 @@ void numa_remove_cpu(int cpu)
> > >   */
> > >  const struct cpumask *cpumask_of_node(int node)
> > >  {
> > > + if (node == NUMA_NO_NODE)
> > > + return cpu_online_mask;
> > 
> > This mandates the caller holds cpus_read_lock() or something, I'm pretty
> > sure that if I put:
> > 
> > lockdep_assert_cpus_held();
> 
> Is this documented somewhere?

No idea... common sense :-)

> Also how does that differ from a normal
> case when a proper node is used? The cpumask will always be dynamic in
> the cpu hotplug presence, right?

As per normal yes, and I'm fairly sure there's a ton of bugs. Any
'online' state is subject to change except when you're holding
sufficient locks to stop it.

Disabling preemption also stabilizes it, because cpu unplug relies on
stop-machine.

> > here, it comes apart real quick. Without holding the cpu hotplug lock,
> > the online mask is gibberish.
> 
> Can the returned cpu mask go away?

No, the cpu_online_mask itself has static storage, the contents OTOH can
change at will. Very little practical difference :-)

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-23 Thread Michal Hocko

On Mon 23-09-19 17:15:19, Peter Zijlstra wrote:
> On Tue, Sep 17, 2019 at 08:48:54PM +0800, Yunsheng Lin wrote:
> > When passing the return value of dev_to_node() to cpumask_of_node()
> > without checking if the device's node id is NUMA_NO_NODE, there is
> > global-out-of-bounds detected by KASAN.
> > 
> > From the discussion [1], NUMA_NO_NODE really means no node affinity,
> > which also means all cpus should be usable. So the cpumask_of_node()
> > should always return all cpus online when user passes the node id as
> > NUMA_NO_NODE, just like similar semantic that page allocator handles
> > NUMA_NO_NODE.
> > 
> > But we cannot really copy the page allocator logic. Simply because the
> > page allocator doesn't enforce the near node affinity. It just picks it
> > up as a preferred node but then it is free to fallback to any other numa
> > node. This is not the case here and node_to_cpumask_map will only restrict
> > to the particular node's cpus which would have really non deterministic
> > behavior depending on where the code is executed. So in fact we really
> > want to return cpu_online_mask for NUMA_NO_NODE.
> > 
> > Also there is a debugging version of node_to_cpumask_map() for x86 and
> > arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> > patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> > 
> > [1] https://lore.kernel.org/patchwork/patch/1125789/
> 
> That is bloody unusable, don't do that. Use:
> 
>   https://lkml.kernel.org/r/$MSGID
> 
> if anything. Then I can find it in my local mbox without having to
> resort to touching a mouse and shitty browser software.
> 
> (also patchwork is absolute crap for reading email threads)
> 
> Anyway, I found it -- I think, I refused to click the link. I replied
> there.
> 
> > Signed-off-by: Yunsheng Lin 
> > Suggested-by: Michal Hocko 
> > Acked-by: Michal Hocko 
> 
> 
> 
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 4123100e..9859acb 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -861,6 +861,9 @@ void numa_remove_cpu(int cpu)
> >   */
> >  const struct cpumask *cpumask_of_node(int node)
> >  {
> > +   if (node == NUMA_NO_NODE)
> > +   return cpu_online_mask;
> 
> This mandates the caller holds cpus_read_lock() or something, I'm pretty
> sure that if I put:
> 
>   lockdep_assert_cpus_held();

Is this documented somewhere? Also how does that differ from a normal
case when a proper node is used? The cpumask will always be dynamic in
the cpu hotplug presence, right?

> here, it comes apart real quick. Without holding the cpu hotplug lock,
> the online mask is gibberish.

Can the returned cpu mask go away?
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-23 Thread Peter Zijlstra

On Tue, Sep 17, 2019 at 08:48:54PM +0800, Yunsheng Lin wrote:
> When passing the return value of dev_to_node() to cpumask_of_node()
> without checking if the device's node id is NUMA_NO_NODE, there is
> global-out-of-bounds detected by KASAN.
> 
> From the discussion [1], NUMA_NO_NODE really means no node affinity,
> which also means all cpus should be usable. So the cpumask_of_node()
> should always return all cpus online when user passes the node id as
> NUMA_NO_NODE, just like similar semantic that page allocator handles
> NUMA_NO_NODE.
> 
> But we cannot really copy the page allocator logic. Simply because the
> page allocator doesn't enforce the near node affinity. It just picks it
> up as a preferred node but then it is free to fallback to any other numa
> node. This is not the case here and node_to_cpumask_map will only restrict
> to the particular node's cpus which would have really non deterministic
> behavior depending on where the code is executed. So in fact we really
> want to return cpu_online_mask for NUMA_NO_NODE.
> 
> Also there is a debugging version of node_to_cpumask_map() for x86 and
> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> 
> [1] https://lore.kernel.org/patchwork/patch/1125789/

That is bloody unusable, don't do that. Use:

  https://lkml.kernel.org/r/$MSGID

if anything. Then I can find it in my local mbox without having to
resort to touching a mouse and shitty browser software.

(also patchwork is absolute crap for reading email threads)

Anyway, I found it -- I think, I refused to click the link. I replied
there.

> Signed-off-by: Yunsheng Lin 
> Suggested-by: Michal Hocko 
> Acked-by: Michal Hocko 



> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 4123100e..9859acb 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -861,6 +861,9 @@ void numa_remove_cpu(int cpu)
>   */
>  const struct cpumask *cpumask_of_node(int node)
>  {
> + if (node == NUMA_NO_NODE)
> + return cpu_online_mask;

This mandates the caller holds cpus_read_lock() or something, I'm pretty
sure that if I put:

lockdep_assert_cpus_held();

here, it comes apart real quick. Without holding the cpu hotplug lock,
the online mask is gibberish.

> +
>   if ((unsigned)node >= nr_node_ids) {
>   printk(KERN_WARNING
>   "cpumask_of_node(%d): (unsigned)node >= 
> nr_node_ids(%u)\n",

I still think this makes absolutely no sense what so ever.

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-22 Thread Yunsheng Lin

On 2019/9/22 6:38, Paul Burton wrote:
> Hi Yunsheng,
> 
> On Tue, Sep 17, 2019 at 08:48:54PM +0800, Yunsheng Lin wrote:
>> When passing the return value of dev_to_node() to cpumask_of_node()
>> without checking if the device's node id is NUMA_NO_NODE, there is
>> global-out-of-bounds detected by KASAN.
>>
>> From the discussion [1], NUMA_NO_NODE really means no node affinity,
>> which also means all cpus should be usable. So the cpumask_of_node()
>> should always return all cpus online when user passes the node id as
>> NUMA_NO_NODE, just like similar semantic that page allocator handles
>> NUMA_NO_NODE.
>>
>> But we cannot really copy the page allocator logic. Simply because the
>> page allocator doesn't enforce the near node affinity. It just picks it
>> up as a preferred node but then it is free to fallback to any other numa
>> node. This is not the case here and node_to_cpumask_map will only restrict
>> to the particular node's cpus which would have really non deterministic
>> behavior depending on where the code is executed. So in fact we really
>> want to return cpu_online_mask for NUMA_NO_NODE.
>>
>> Also there is a debugging version of node_to_cpumask_map() for x86 and
>> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
>> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
>>
>> [1] https://lore.kernel.org/patchwork/patch/1125789/
>> Signed-off-by: Yunsheng Lin 
>> Suggested-by: Michal Hocko 
>> Acked-by: Michal Hocko 
> 
> If you end up sending another revision then I think it would be worth
> replacing -1 with NUMA_NO_NODE in
> arch/mips/include/asm/mach-ip27/topology.h for consistency, but in any
> case:

Perhaps it is better to replace -1 with NUMA_NO_NODE along with cpu_all_mask
-> cpu_online_mask change if the cpu_all_mask -> cpu_online_mask change is
reasonable.

Anyway, will do that if there is another version needed.

> 
> Acked-by: Paul Burton  # MIPS bits

Thanks for that.

> 
> Thanks,
> Paul
> 
>> ---
>> V6: Drop the cpu_all_mask -> cpu_online_mask change for it seems a
>> little controversial, may need deeper investigation, and rebased
>> on the latest linux-next.
>> V5: Drop unsigned "fix" change for x86/arm64, and change comment log
>> according to Michal's comment.
>> V4: Have all these changes in a single patch.
>> V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
>> for NUMA_NO_NODE case, and change the commit log to better justify
>> the change.
>> V2: make the node id checking change to other arches too.
>> ---
>>  arch/arm64/include/asm/numa.h| 3 +++
>>  arch/arm64/mm/numa.c | 3 +++
>>  arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
>>  arch/s390/include/asm/topology.h | 3 +++
>>  arch/x86/include/asm/topology.h  | 3 +++
>>  arch/x86/mm/numa.c   | 3 +++
>>  6 files changed, 18 insertions(+), 1 deletion(-)
> 
> .
>

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-21 Thread Paul Burton

Hi Yunsheng,

On Tue, Sep 17, 2019 at 08:48:54PM +0800, Yunsheng Lin wrote:
> When passing the return value of dev_to_node() to cpumask_of_node()
> without checking if the device's node id is NUMA_NO_NODE, there is
> global-out-of-bounds detected by KASAN.
> 
> From the discussion [1], NUMA_NO_NODE really means no node affinity,
> which also means all cpus should be usable. So the cpumask_of_node()
> should always return all cpus online when user passes the node id as
> NUMA_NO_NODE, just like similar semantic that page allocator handles
> NUMA_NO_NODE.
> 
> But we cannot really copy the page allocator logic. Simply because the
> page allocator doesn't enforce the near node affinity. It just picks it
> up as a preferred node but then it is free to fallback to any other numa
> node. This is not the case here and node_to_cpumask_map will only restrict
> to the particular node's cpus which would have really non deterministic
> behavior depending on where the code is executed. So in fact we really
> want to return cpu_online_mask for NUMA_NO_NODE.
> 
> Also there is a debugging version of node_to_cpumask_map() for x86 and
> arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
> patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().
> 
> [1] https://lore.kernel.org/patchwork/patch/1125789/
> Signed-off-by: Yunsheng Lin 
> Suggested-by: Michal Hocko 
> Acked-by: Michal Hocko 

If you end up sending another revision then I think it would be worth
replacing -1 with NUMA_NO_NODE in
arch/mips/include/asm/mach-ip27/topology.h for consistency, but in any
case:

Acked-by: Paul Burton  # MIPS bits

Thanks,
Paul

> ---
> V6: Drop the cpu_all_mask -> cpu_online_mask change for it seems a
> little controversial, may need deeper investigation, and rebased
> on the latest linux-next.
> V5: Drop unsigned "fix" change for x86/arm64, and change comment log
> according to Michal's comment.
> V4: Have all these changes in a single patch.
> V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
> for NUMA_NO_NODE case, and change the commit log to better justify
> the change.
> V2: make the node id checking change to other arches too.
> ---
>  arch/arm64/include/asm/numa.h| 3 +++
>  arch/arm64/mm/numa.c | 3 +++
>  arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
>  arch/s390/include/asm/topology.h | 3 +++
>  arch/x86/include/asm/topology.h  | 3 +++
>  arch/x86/mm/numa.c   | 3 +++
>  6 files changed, 18 insertions(+), 1 deletion(-)

[PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-09-17 Thread Yunsheng Lin

When passing the return value of dev_to_node() to cpumask_of_node()
without checking if the device's node id is NUMA_NO_NODE, there is
global-out-of-bounds detected by KASAN.

>From the discussion [1], NUMA_NO_NODE really means no node affinity,
which also means all cpus should be usable. So the cpumask_of_node()
should always return all cpus online when user passes the node id as
NUMA_NO_NODE, just like similar semantic that page allocator handles
NUMA_NO_NODE.

But we cannot really copy the page allocator logic. Simply because the
page allocator doesn't enforce the near node affinity. It just picks it
up as a preferred node but then it is free to fallback to any other numa
node. This is not the case here and node_to_cpumask_map will only restrict
to the particular node's cpus which would have really non deterministic
behavior depending on where the code is executed. So in fact we really
want to return cpu_online_mask for NUMA_NO_NODE.

Also there is a debugging version of node_to_cpumask_map() for x86 and
arm64, which is only used when CONFIG_DEBUG_PER_CPU_MAPS is defined, this
patch changes it to handle NUMA_NO_NODE as normal node_to_cpumask_map().

[1] https://lore.kernel.org/patchwork/patch/1125789/
Signed-off-by: Yunsheng Lin 
Suggested-by: Michal Hocko 
Acked-by: Michal Hocko 
---
V6: Drop the cpu_all_mask -> cpu_online_mask change for it seems a
little controversial, may need deeper investigation, and rebased
on the latest linux-next.
V5: Drop unsigned "fix" change for x86/arm64, and change comment log
according to Michal's comment.
V4: Have all these changes in a single patch.
V3: Change to only handle NUMA_NO_NODE, and return cpu_online_mask
for NUMA_NO_NODE case, and change the commit log to better justify
the change.
V2: make the node id checking change to other arches too.
---
 arch/arm64/include/asm/numa.h| 3 +++
 arch/arm64/mm/numa.c | 3 +++
 arch/mips/include/asm/mach-loongson64/topology.h | 4 +++-
 arch/s390/include/asm/topology.h | 3 +++
 arch/x86/include/asm/topology.h  | 3 +++
 arch/x86/mm/numa.c   | 3 +++
 6 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
index 626ad01..c8a4b31 100644
--- a/arch/arm64/include/asm/numa.h
+++ b/arch/arm64/include/asm/numa.h
@@ -25,6 +25,9 @@ const struct cpumask *cpumask_of_node(int node);
 /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
return node_to_cpumask_map[node];
 }
 #endif
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 4decf16..5ae7eea 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -46,6 +46,9 @@ EXPORT_SYMBOL(node_to_cpumask_map);
  */
 const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
if (WARN_ON(node >= nr_node_ids))
return cpu_none_mask;
 
diff --git a/arch/mips/include/asm/mach-loongson64/topology.h 
b/arch/mips/include/asm/mach-loongson64/topology.h
index 7ff819a..e78daa6 100644
--- a/arch/mips/include/asm/mach-loongson64/topology.h
+++ b/arch/mips/include/asm/mach-loongson64/topology.h
@@ -5,7 +5,9 @@
 #ifdef CONFIG_NUMA
 
 #define cpu_to_node(cpu)   (cpu_logical_map(cpu) >> 2)
-#define cpumask_of_node(node)  (&__node_data[(node)]->cpumask)
+#define cpumask_of_node(node)  ((node) == NUMA_NO_NODE ?   \
+cpu_online_mask :  \
+&__node_data[(node)]->cpumask)
 
 struct pci_bus;
 extern int pcibus_to_node(struct pci_bus *);
diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index cca406f..1bd2e73 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -78,6 +78,9 @@ static inline int cpu_to_node(int cpu)
 #define cpumask_of_node cpumask_of_node
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
return _to_cpumask_map[node];
 }
 
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 4b14d23..7fa82e1 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -69,6 +69,9 @@ extern const struct cpumask *cpumask_of_node(int node);
 /* Returns a pointer to the cpumask of CPUs on Node 'node'. */
 static inline const struct cpumask *cpumask_of_node(int node)
 {
+   if (node == NUMA_NO_NODE)
+   return cpu_online_mask;
+
return node_to_cpumask_map[node];
 }
 #endif
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4123100e..9859acb 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -861,6 +861,9 @@ void

57 matches

Mail list logo