Re: Purpose of numa_node?
On Wed, Feb 13, 2008 at 10:52 AM, Brice Goglin <[EMAIL PROTECTED]> wrote: > /sys/devices/pci:40/:40:0f.0/numa_node1 > /sys/devices/pci:40/:40:10.0/numa_node1 > /sys/devices/pci:40/:40:11.0/numa_node1 > /sys/devices/pci:40/:40:12.0/numa_node1 > /sys/devices/pci:40/:40:12.0/:51:00.0/numa_node1 > /sys/devices/pci:40/:40:13.0/numa_node1 > > The 5 last lines above would report 0 instead of 1 with an older kernel. > Everything looks correct now (:40 is the second PCIe bus and it is > attached to socket #1). > > Thanks a lot, Yinghai! Are you planning to merge these patches in the > near future? 2.6.26? > ingo put them in x86.git#testing please check http://people.redhat.com/mingo/x86.git/README to get that. YH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
On Wed, Feb 13, 2008 at 10:52 AM, Brice Goglin [EMAIL PROTECTED] wrote: /sys/devices/pci:40/:40:0f.0/numa_node1 /sys/devices/pci:40/:40:10.0/numa_node1 /sys/devices/pci:40/:40:11.0/numa_node1 /sys/devices/pci:40/:40:12.0/numa_node1 /sys/devices/pci:40/:40:12.0/:51:00.0/numa_node1 /sys/devices/pci:40/:40:13.0/numa_node1 The 5 last lines above would report 0 instead of 1 with an older kernel. Everything looks correct now (:40 is the second PCIe bus and it is attached to socket #1). Thanks a lot, Yinghai! Are you planning to merge these patches in the near future? 2.6.26? ingo put them in x86.git#testing please check http://people.redhat.com/mingo/x86.git/README to get that. YH -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
On Feb 13, 2008 10:52 AM, Brice Goglin <[EMAIL PROTECTED]> wrote: > Yinghai Lu wrote: > >>> Have a look at the above link. I don't get -1. I get 0 everywhere, while > >>> I should get 1 for some devices. And if I unplug/replug a device using > >>> fakephp, numa_node becomes correct (1 instead of 0). This just looks > >>> like the code is there but things are initialized in the wrong order. > >>> > >> do you have > >> ... > >> bus 00 -> pxm 0 -> node 0 > >> ... > >> bus 40 -> pxm 1 -> node 1 > >> ... > >> bus 80 -> pxm 1 -> node 1 > >> > >> in your boot msg or dmesg? > >> > >> if not, your dsdt doesn't have _PXM for pci root bus. or you need to > >> ask your HW vendor to add that in their BIOS, or use my patchset. > >> > > > > please try the attached patchset > > > > please get x86.git then use quilt apply the patch > > > > http://people.redhat.com/mingo/x86.git/README > > > > I finally managed to test this and it seems to work. I now get the > following numa_node attributes: > /sys/devices/pci:00/:00:01.0/numa_node0 > /sys/devices/pci:00/:00:07.0/numa_node0 > /sys/devices/pci:00/:00:07.0/:38:0d.0/numa_node0 > /sys/devices/pci:00/:00:08.0/numa_node0 > /sys/devices/pci:00/:00:08.1/numa_node0 > /sys/devices/pci:00/:00:08.2/numa_node0 > /sys/devices/pci:00/:00:09.0/numa_node0 > /sys/devices/pci:00/:00:09.1/numa_node0 > /sys/devices/pci:00/:00:09.2/numa_node0 > /sys/devices/pci:00/:00:0a.0/numa_node0 > /sys/devices/pci:00/:00:0a.0/:22:00.0/numa_node0 > /sys/devices/pci:00/:00:0b.0/numa_node0 > /sys/devices/pci:00/:00:0c.0/numa_node0 > /sys/devices/pci:00/:00:0c.0/:0c:00.0/numa_node0 > /sys/devices/pci:00/:00:0c.0/:0c:00.0/:0d:00.0/numa_node > 0 > /sys/devices/pci:00/:00:0d.0/numa_node0 > /sys/devices/pci:00/:00:0d.0/:01:00.0/numa_node0 > /sys/devices/pci:00/:00:0e.0/numa_node0 > /sys/devices/pci:00/:00:0e.0/:17:00.0/numa_node0 > /sys/devices/pci:00/:00:0e.0/:17:00.0/:18:00.0/numa_node > 0 > /sys/devices/pci:00/:00:18.0/numa_node0 > /sys/devices/pci:00/:00:18.1/numa_node0 > /sys/devices/pci:00/:00:18.2/numa_node0 > /sys/devices/pci:00/:00:18.3/numa_node0 > /sys/devices/pci:00/:00:19.0/numa_node0 > /sys/devices/pci:00/:00:19.1/numa_node0 > /sys/devices/pci:00/:00:19.2/numa_node0 > /sys/devices/pci:00/:00:19.3/numa_node0 > /sys/devices/pci:00/:00:1a.0/numa_node0 > /sys/devices/pci:00/:00:1a.1/numa_node0 > /sys/devices/pci:00/:00:1a.2/numa_node0 > /sys/devices/pci:00/:00:1a.3/numa_node0 > /sys/devices/pci:00/:00:1b.0/numa_node0 > /sys/devices/pci:00/:00:1b.1/numa_node0 > /sys/devices/pci:00/:00:1b.2/numa_node0 > /sys/devices/pci:00/:00:1b.3/numa_node0 > /sys/devices/pci:40/:40:0f.0/numa_node1 > /sys/devices/pci:40/:40:10.0/numa_node1 > /sys/devices/pci:40/:40:11.0/numa_node1 > /sys/devices/pci:40/:40:12.0/numa_node1 > /sys/devices/pci:40/:40:12.0/:51:00.0/numa_node1 > /sys/devices/pci:40/:40:13.0/numa_node1 > > The 5 last lines above would report 0 instead of 1 with an older kernel. > Everything looks correct now (:40 is the second PCIe bus and it is > attached to socket #1). > > Thanks a lot, Yinghai! Are you planning to merge these patches in the > near future? 2.6.26? Andi thought that is too hardware related... they have stayed a while in -mm. these patchset could be only useful when you have several HT chains, and BIOS doesn't have pxm->node in dsdt, or doesn't allocate io resource to some of addon cards. YH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
Yinghai Lu wrote: >>> Have a look at the above link. I don't get -1. I get 0 everywhere, while >>> I should get 1 for some devices. And if I unplug/replug a device using >>> fakephp, numa_node becomes correct (1 instead of 0). This just looks >>> like the code is there but things are initialized in the wrong order. >>> >> do you have >> ... >> bus 00 -> pxm 0 -> node 0 >> ... >> bus 40 -> pxm 1 -> node 1 >> ... >> bus 80 -> pxm 1 -> node 1 >> >> in your boot msg or dmesg? >> >> if not, your dsdt doesn't have _PXM for pci root bus. or you need to >> ask your HW vendor to add that in their BIOS, or use my patchset. >> > > please try the attached patchset > > please get x86.git then use quilt apply the patch > > http://people.redhat.com/mingo/x86.git/README > I finally managed to test this and it seems to work. I now get the following numa_node attributes: /sys/devices/pci:00/:00:01.0/numa_node0 /sys/devices/pci:00/:00:07.0/numa_node0 /sys/devices/pci:00/:00:07.0/:38:0d.0/numa_node0 /sys/devices/pci:00/:00:08.0/numa_node0 /sys/devices/pci:00/:00:08.1/numa_node0 /sys/devices/pci:00/:00:08.2/numa_node0 /sys/devices/pci:00/:00:09.0/numa_node0 /sys/devices/pci:00/:00:09.1/numa_node0 /sys/devices/pci:00/:00:09.2/numa_node0 /sys/devices/pci:00/:00:0a.0/numa_node0 /sys/devices/pci:00/:00:0a.0/:22:00.0/numa_node0 /sys/devices/pci:00/:00:0b.0/numa_node0 /sys/devices/pci:00/:00:0c.0/numa_node0 /sys/devices/pci:00/:00:0c.0/:0c:00.0/numa_node0 /sys/devices/pci:00/:00:0c.0/:0c:00.0/:0d:00.0/numa_node 0 /sys/devices/pci:00/:00:0d.0/numa_node0 /sys/devices/pci:00/:00:0d.0/:01:00.0/numa_node0 /sys/devices/pci:00/:00:0e.0/numa_node0 /sys/devices/pci:00/:00:0e.0/:17:00.0/numa_node0 /sys/devices/pci:00/:00:0e.0/:17:00.0/:18:00.0/numa_node 0 /sys/devices/pci:00/:00:18.0/numa_node0 /sys/devices/pci:00/:00:18.1/numa_node0 /sys/devices/pci:00/:00:18.2/numa_node0 /sys/devices/pci:00/:00:18.3/numa_node0 /sys/devices/pci:00/:00:19.0/numa_node0 /sys/devices/pci:00/:00:19.1/numa_node0 /sys/devices/pci:00/:00:19.2/numa_node0 /sys/devices/pci:00/:00:19.3/numa_node0 /sys/devices/pci:00/:00:1a.0/numa_node0 /sys/devices/pci:00/:00:1a.1/numa_node0 /sys/devices/pci:00/:00:1a.2/numa_node0 /sys/devices/pci:00/:00:1a.3/numa_node0 /sys/devices/pci:00/:00:1b.0/numa_node0 /sys/devices/pci:00/:00:1b.1/numa_node0 /sys/devices/pci:00/:00:1b.2/numa_node0 /sys/devices/pci:00/:00:1b.3/numa_node0 /sys/devices/pci:40/:40:0f.0/numa_node1 /sys/devices/pci:40/:40:10.0/numa_node1 /sys/devices/pci:40/:40:11.0/numa_node1 /sys/devices/pci:40/:40:12.0/numa_node1 /sys/devices/pci:40/:40:12.0/:51:00.0/numa_node1 /sys/devices/pci:40/:40:13.0/numa_node1 The 5 last lines above would report 0 instead of 1 with an older kernel. Everything looks correct now (:40 is the second PCIe bus and it is attached to socket #1). Thanks a lot, Yinghai! Are you planning to merge these patches in the near future? 2.6.26? Brice PS: I saved the corresponding dmesg. If you want to look at it, please let me know. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
Yinghai Lu wrote: Have a look at the above link. I don't get -1. I get 0 everywhere, while I should get 1 for some devices. And if I unplug/replug a device using fakephp, numa_node becomes correct (1 instead of 0). This just looks like the code is there but things are initialized in the wrong order. do you have ... bus 00 - pxm 0 - node 0 ... bus 40 - pxm 1 - node 1 ... bus 80 - pxm 1 - node 1 in your boot msg or dmesg? if not, your dsdt doesn't have _PXM for pci root bus. or you need to ask your HW vendor to add that in their BIOS, or use my patchset. please try the attached patchset please get x86.git then use quilt apply the patch http://people.redhat.com/mingo/x86.git/README I finally managed to test this and it seems to work. I now get the following numa_node attributes: /sys/devices/pci:00/:00:01.0/numa_node0 /sys/devices/pci:00/:00:07.0/numa_node0 /sys/devices/pci:00/:00:07.0/:38:0d.0/numa_node0 /sys/devices/pci:00/:00:08.0/numa_node0 /sys/devices/pci:00/:00:08.1/numa_node0 /sys/devices/pci:00/:00:08.2/numa_node0 /sys/devices/pci:00/:00:09.0/numa_node0 /sys/devices/pci:00/:00:09.1/numa_node0 /sys/devices/pci:00/:00:09.2/numa_node0 /sys/devices/pci:00/:00:0a.0/numa_node0 /sys/devices/pci:00/:00:0a.0/:22:00.0/numa_node0 /sys/devices/pci:00/:00:0b.0/numa_node0 /sys/devices/pci:00/:00:0c.0/numa_node0 /sys/devices/pci:00/:00:0c.0/:0c:00.0/numa_node0 /sys/devices/pci:00/:00:0c.0/:0c:00.0/:0d:00.0/numa_node 0 /sys/devices/pci:00/:00:0d.0/numa_node0 /sys/devices/pci:00/:00:0d.0/:01:00.0/numa_node0 /sys/devices/pci:00/:00:0e.0/numa_node0 /sys/devices/pci:00/:00:0e.0/:17:00.0/numa_node0 /sys/devices/pci:00/:00:0e.0/:17:00.0/:18:00.0/numa_node 0 /sys/devices/pci:00/:00:18.0/numa_node0 /sys/devices/pci:00/:00:18.1/numa_node0 /sys/devices/pci:00/:00:18.2/numa_node0 /sys/devices/pci:00/:00:18.3/numa_node0 /sys/devices/pci:00/:00:19.0/numa_node0 /sys/devices/pci:00/:00:19.1/numa_node0 /sys/devices/pci:00/:00:19.2/numa_node0 /sys/devices/pci:00/:00:19.3/numa_node0 /sys/devices/pci:00/:00:1a.0/numa_node0 /sys/devices/pci:00/:00:1a.1/numa_node0 /sys/devices/pci:00/:00:1a.2/numa_node0 /sys/devices/pci:00/:00:1a.3/numa_node0 /sys/devices/pci:00/:00:1b.0/numa_node0 /sys/devices/pci:00/:00:1b.1/numa_node0 /sys/devices/pci:00/:00:1b.2/numa_node0 /sys/devices/pci:00/:00:1b.3/numa_node0 /sys/devices/pci:40/:40:0f.0/numa_node1 /sys/devices/pci:40/:40:10.0/numa_node1 /sys/devices/pci:40/:40:11.0/numa_node1 /sys/devices/pci:40/:40:12.0/numa_node1 /sys/devices/pci:40/:40:12.0/:51:00.0/numa_node1 /sys/devices/pci:40/:40:13.0/numa_node1 The 5 last lines above would report 0 instead of 1 with an older kernel. Everything looks correct now (:40 is the second PCIe bus and it is attached to socket #1). Thanks a lot, Yinghai! Are you planning to merge these patches in the near future? 2.6.26? Brice PS: I saved the corresponding dmesg. If you want to look at it, please let me know. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
On Jan 31, 2008 1:42 PM, Yinghai Lu <[EMAIL PROTECTED]> wrote: > > On Jan 31, 2008 1:35 PM, Brice Goglin <[EMAIL PROTECTED]> wrote: > > Yinghai Lu wrote: > > > On Jan 31, 2008 5:42 AM, Brice Goglin <[EMAIL PROTECTED]> wrote: > > > > > >> It works fine on regular machines such as dual opterons. However, I > > >> noticed recently that it was wrong on some quad-opteron machines (see > > >> http://marc.info/?l=linux-pci=11907248538=2) because something > > >> is not initialized in the right order. But I haven't tested 2.6.24 on > > >> this hardware yet, and I don't know if things have changed regarding > > >> this. > > >> > > > > > > that will depend if you dsdt have _PXM for your pci root bus. > > > otherwise you will get all -1 > > > > > > > Have a look at the above link. I don't get -1. I get 0 everywhere, while > > I should get 1 for some devices. And if I unplug/replug a device using > > fakephp, numa_node becomes correct (1 instead of 0). This just looks > > like the code is there but things are initialized in the wrong order. > > do you have > ... > bus 00 -> pxm 0 -> node 0 > ... > bus 40 -> pxm 1 -> node 1 > ... > bus 80 -> pxm 1 -> node 1 > > in your boot msg or dmesg? > > if not, your dsdt doesn't have _PXM for pci root bus. or you need to > ask your HW vendor to add that in their BIOS, or use my patchset. please try the attached patchset please get x86.git then use quilt apply the patch http://people.redhat.com/mingo/x86.git/README YH patches_01312008_mm_bus_numa.tar.bz2 Description: BZip2 compressed data
Re: Purpose of numa_node?
On Jan 31, 2008 1:35 PM, Brice Goglin <[EMAIL PROTECTED]> wrote: > Yinghai Lu wrote: > > On Jan 31, 2008 5:42 AM, Brice Goglin <[EMAIL PROTECTED]> wrote: > > > >> It works fine on regular machines such as dual opterons. However, I > >> noticed recently that it was wrong on some quad-opteron machines (see > >> http://marc.info/?l=linux-pci=11907248538=2) because something > >> is not initialized in the right order. But I haven't tested 2.6.24 on > >> this hardware yet, and I don't know if things have changed regarding this. > >> > > > > that will depend if you dsdt have _PXM for your pci root bus. > > otherwise you will get all -1 > > > > Have a look at the above link. I don't get -1. I get 0 everywhere, while > I should get 1 for some devices. And if I unplug/replug a device using > fakephp, numa_node becomes correct (1 instead of 0). This just looks > like the code is there but things are initialized in the wrong order. do you have ... bus 00 -> pxm 0 -> node 0 ... bus 40 -> pxm 1 -> node 1 ... bus 80 -> pxm 1 -> node 1 in your boot msg or dmesg? if not, your dsdt doesn't have _PXM for pci root bus. or you need to ask your HW vendor to add that in their BIOS, or use my patchset. YH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
Yinghai Lu wrote: > On Jan 31, 2008 5:42 AM, Brice Goglin <[EMAIL PROTECTED]> wrote: > >> It works fine on regular machines such as dual opterons. However, I >> noticed recently that it was wrong on some quad-opteron machines (see >> http://marc.info/?l=linux-pci=11907248538=2) because something >> is not initialized in the right order. But I haven't tested 2.6.24 on >> this hardware yet, and I don't know if things have changed regarding this. >> > > that will depend if you dsdt have _PXM for your pci root bus. > otherwise you will get all -1 > Have a look at the above link. I don't get -1. I get 0 everywhere, while I should get 1 for some devices. And if I unplug/replug a device using fakephp, numa_node becomes correct (1 instead of 0). This just looks like the code is there but things are initialized in the wrong order. Brice -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
On Jan 31, 2008 5:42 AM, Brice Goglin <[EMAIL PROTECTED]> wrote: > Paul Mundt wrote: > > On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote: > > > >> While pondering ways to optimize I/O and swapping on large NUMA machines, I > >> noticed that the numa_node field in struct device isn't actually used > >> anywhere. We just have a couple dozen lines of code to conditionally > >> create a sysfs file that will always return -1. Is anyone even working on > >> code to actually use this field? I think it's a good piece of information > >> to keep track of, so I'm not suggesting we remove it, but I want to make > >> sure I'm not stepping on toes or duplicating effort if I try to make it > >> useful. > >> > > It's manipulated with accessors. If you look at the users of > > dev_to_node()/set_dev_node() you can see where it's being used. It's > > primarily used in allocation paths for node locality, and the existing > > set_dev_node() callsites are places where node locality information > > already exists (ie, which node a given controller sits on). You can see > > this in places like PCI (pcibus_to_node()) and USB, with node allocation > > hints used in places like the dmapool and skb alloc paths. > > > > The in-kernel use looks perfectly sane in that regard, though I'm not > > sure what the point of exporting this as a RO attribute to userspace is. > > Presumably someone has a tool somewhere that cares about this. > > > > I added the numa_node sysfs attribute in the beginning to make it easier > to bind processes near some devices. So yes I have some user-space tool > using it. It is much easier to use than the local_cpus field on large > machines, especially when you use the libnuma interface to bind things, > since you don't have to translate numa_node from/to cpumasks. > > It works fine on regular machines such as dual opterons. However, I > noticed recently that it was wrong on some quad-opteron machines (see > http://marc.info/?l=linux-pci=11907248538=2) because something > is not initialized in the right order. But I haven't tested 2.6.24 on > this hardware yet, and I don't know if things have changed regarding this. that will depend if you dsdt have _PXM for your pci root bus. otherwise you will get all -1 I have a patchset locally that it call bus_numa, can get that from pci conf space for AMD64 based machine. so you can use that for AMD64 system without _PXM for pci root bus or even with acpi=off. let me know if you want test it. YH -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
Paul Mundt wrote: On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote: While pondering ways to optimize I/O and swapping on large NUMA machines, I noticed that the numa_node field in struct device isn't actually used anywhere. We just have a couple dozen lines of code to conditionally create a sysfs file that will always return -1. Is anyone even working on code to actually use this field? I think it's a good piece of information to keep track of, so I'm not suggesting we remove it, but I want to make sure I'm not stepping on toes or duplicating effort if I try to make it useful. It's manipulated with accessors. If you look at the users of dev_to_node()/set_dev_node() you can see where it's being used. It's primarily used in allocation paths for node locality, and the existing set_dev_node() callsites are places where node locality information already exists (ie, which node a given controller sits on). You can see this in places like PCI (pcibus_to_node()) and USB, with node allocation hints used in places like the dmapool and skb alloc paths. The in-kernel use looks perfectly sane in that regard, though I'm not sure what the point of exporting this as a RO attribute to userspace is. Presumably someone has a tool somewhere that cares about this. I added the numa_node sysfs attribute in the beginning to make it easier to bind processes near some devices. So yes I have some user-space tool using it. It is much easier to use than the local_cpus field on large machines, especially when you use the libnuma interface to bind things, since you don't have to translate numa_node from/to cpumasks. It works fine on regular machines such as dual opterons. However, I noticed recently that it was wrong on some quad-opteron machines (see http://marc.info/?l=linux-pci=11907248538=2) because something is not initialized in the right order. But I haven't tested 2.6.24 on this hardware yet, and I don't know if things have changed regarding this. Brice -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
Paul Mundt <[EMAIL PROTECTED]> writes: > > The in-kernel use looks perfectly sane in that regard, though I'm not > sure what the point of exporting this as a RO attribute to userspace is. > Presumably someone has a tool somewhere that cares about this. The idea was to allow e.g. NUMA aware irqbalanced that directs the interrupts on the same node as the device is connected to. Don't know if it was ever actually implemented. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote: > While pondering ways to optimize I/O and swapping on large NUMA machines, I > noticed that the numa_node field in struct device isn't actually used > anywhere. We just have a couple dozen lines of code to conditionally > create a sysfs file that will always return -1. Is anyone even working on > code to actually use this field? I think it's a good piece of information > to keep track of, so I'm not suggesting we remove it, but I want to make > sure I'm not stepping on toes or duplicating effort if I try to make it > useful. > It's manipulated with accessors. If you look at the users of dev_to_node()/set_dev_node() you can see where it's being used. It's primarily used in allocation paths for node locality, and the existing set_dev_node() callsites are places where node locality information already exists (ie, which node a given controller sits on). You can see this in places like PCI (pcibus_to_node()) and USB, with node allocation hints used in places like the dmapool and skb alloc paths. The in-kernel use looks perfectly sane in that regard, though I'm not sure what the point of exporting this as a RO attribute to userspace is. Presumably someone has a tool somewhere that cares about this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
Paul Mundt [EMAIL PROTECTED] writes: The in-kernel use looks perfectly sane in that regard, though I'm not sure what the point of exporting this as a RO attribute to userspace is. Presumably someone has a tool somewhere that cares about this. The idea was to allow e.g. NUMA aware irqbalanced that directs the interrupts on the same node as the device is connected to. Don't know if it was ever actually implemented. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
Paul Mundt wrote: On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote: While pondering ways to optimize I/O and swapping on large NUMA machines, I noticed that the numa_node field in struct device isn't actually used anywhere. We just have a couple dozen lines of code to conditionally create a sysfs file that will always return -1. Is anyone even working on code to actually use this field? I think it's a good piece of information to keep track of, so I'm not suggesting we remove it, but I want to make sure I'm not stepping on toes or duplicating effort if I try to make it useful. It's manipulated with accessors. If you look at the users of dev_to_node()/set_dev_node() you can see where it's being used. It's primarily used in allocation paths for node locality, and the existing set_dev_node() callsites are places where node locality information already exists (ie, which node a given controller sits on). You can see this in places like PCI (pcibus_to_node()) and USB, with node allocation hints used in places like the dmapool and skb alloc paths. The in-kernel use looks perfectly sane in that regard, though I'm not sure what the point of exporting this as a RO attribute to userspace is. Presumably someone has a tool somewhere that cares about this. I added the numa_node sysfs attribute in the beginning to make it easier to bind processes near some devices. So yes I have some user-space tool using it. It is much easier to use than the local_cpus field on large machines, especially when you use the libnuma interface to bind things, since you don't have to translate numa_node from/to cpumasks. It works fine on regular machines such as dual opterons. However, I noticed recently that it was wrong on some quad-opteron machines (see http://marc.info/?l=linux-pcim=11907248538w=2) because something is not initialized in the right order. But I haven't tested 2.6.24 on this hardware yet, and I don't know if things have changed regarding this. Brice -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
Yinghai Lu wrote: On Jan 31, 2008 5:42 AM, Brice Goglin [EMAIL PROTECTED] wrote: It works fine on regular machines such as dual opterons. However, I noticed recently that it was wrong on some quad-opteron machines (see http://marc.info/?l=linux-pcim=11907248538w=2) because something is not initialized in the right order. But I haven't tested 2.6.24 on this hardware yet, and I don't know if things have changed regarding this. that will depend if you dsdt have _PXM for your pci root bus. otherwise you will get all -1 Have a look at the above link. I don't get -1. I get 0 everywhere, while I should get 1 for some devices. And if I unplug/replug a device using fakephp, numa_node becomes correct (1 instead of 0). This just looks like the code is there but things are initialized in the wrong order. Brice -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
On Jan 31, 2008 1:35 PM, Brice Goglin [EMAIL PROTECTED] wrote: Yinghai Lu wrote: On Jan 31, 2008 5:42 AM, Brice Goglin [EMAIL PROTECTED] wrote: It works fine on regular machines such as dual opterons. However, I noticed recently that it was wrong on some quad-opteron machines (see http://marc.info/?l=linux-pcim=11907248538w=2) because something is not initialized in the right order. But I haven't tested 2.6.24 on this hardware yet, and I don't know if things have changed regarding this. that will depend if you dsdt have _PXM for your pci root bus. otherwise you will get all -1 Have a look at the above link. I don't get -1. I get 0 everywhere, while I should get 1 for some devices. And if I unplug/replug a device using fakephp, numa_node becomes correct (1 instead of 0). This just looks like the code is there but things are initialized in the wrong order. do you have ... bus 00 - pxm 0 - node 0 ... bus 40 - pxm 1 - node 1 ... bus 80 - pxm 1 - node 1 in your boot msg or dmesg? if not, your dsdt doesn't have _PXM for pci root bus. or you need to ask your HW vendor to add that in their BIOS, or use my patchset. YH -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
On Jan 31, 2008 5:42 AM, Brice Goglin [EMAIL PROTECTED] wrote: Paul Mundt wrote: On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote: While pondering ways to optimize I/O and swapping on large NUMA machines, I noticed that the numa_node field in struct device isn't actually used anywhere. We just have a couple dozen lines of code to conditionally create a sysfs file that will always return -1. Is anyone even working on code to actually use this field? I think it's a good piece of information to keep track of, so I'm not suggesting we remove it, but I want to make sure I'm not stepping on toes or duplicating effort if I try to make it useful. It's manipulated with accessors. If you look at the users of dev_to_node()/set_dev_node() you can see where it's being used. It's primarily used in allocation paths for node locality, and the existing set_dev_node() callsites are places where node locality information already exists (ie, which node a given controller sits on). You can see this in places like PCI (pcibus_to_node()) and USB, with node allocation hints used in places like the dmapool and skb alloc paths. The in-kernel use looks perfectly sane in that regard, though I'm not sure what the point of exporting this as a RO attribute to userspace is. Presumably someone has a tool somewhere that cares about this. I added the numa_node sysfs attribute in the beginning to make it easier to bind processes near some devices. So yes I have some user-space tool using it. It is much easier to use than the local_cpus field on large machines, especially when you use the libnuma interface to bind things, since you don't have to translate numa_node from/to cpumasks. It works fine on regular machines such as dual opterons. However, I noticed recently that it was wrong on some quad-opteron machines (see http://marc.info/?l=linux-pcim=11907248538w=2) because something is not initialized in the right order. But I haven't tested 2.6.24 on this hardware yet, and I don't know if things have changed regarding this. that will depend if you dsdt have _PXM for your pci root bus. otherwise you will get all -1 I have a patchset locally that it call bus_numa, can get that from pci conf space for AMD64 based machine. so you can use that for AMD64 system without _PXM for pci root bus or even with acpi=off. let me know if you want test it. YH -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Purpose of numa_node?
On Jan 31, 2008 1:42 PM, Yinghai Lu [EMAIL PROTECTED] wrote: On Jan 31, 2008 1:35 PM, Brice Goglin [EMAIL PROTECTED] wrote: Yinghai Lu wrote: On Jan 31, 2008 5:42 AM, Brice Goglin [EMAIL PROTECTED] wrote: It works fine on regular machines such as dual opterons. However, I noticed recently that it was wrong on some quad-opteron machines (see http://marc.info/?l=linux-pcim=11907248538w=2) because something is not initialized in the right order. But I haven't tested 2.6.24 on this hardware yet, and I don't know if things have changed regarding this. that will depend if you dsdt have _PXM for your pci root bus. otherwise you will get all -1 Have a look at the above link. I don't get -1. I get 0 everywhere, while I should get 1 for some devices. And if I unplug/replug a device using fakephp, numa_node becomes correct (1 instead of 0). This just looks like the code is there but things are initialized in the wrong order. do you have ... bus 00 - pxm 0 - node 0 ... bus 40 - pxm 1 - node 1 ... bus 80 - pxm 1 - node 1 in your boot msg or dmesg? if not, your dsdt doesn't have _PXM for pci root bus. or you need to ask your HW vendor to add that in their BIOS, or use my patchset. please try the attached patchset please get x86.git then use quilt apply the patch http://people.redhat.com/mingo/x86.git/README YH patches_01312008_mm_bus_numa.tar.bz2 Description: BZip2 compressed data
Purpose of numa_node?
While pondering ways to optimize I/O and swapping on large NUMA machines, I noticed that the numa_node field in struct device isn't actually used anywhere. We just have a couple dozen lines of code to conditionally create a sysfs file that will always return -1. Is anyone even working on code to actually use this field? I think it's a good piece of information to keep track of, so I'm not suggesting we remove it, but I want to make sure I'm not stepping on toes or duplicating effort if I try to make it useful. -- Chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Purpose of numa_node?
While pondering ways to optimize I/O and swapping on large NUMA machines, I noticed that the numa_node field in struct device isn't actually used anywhere. We just have a couple dozen lines of code to conditionally create a sysfs file that will always return -1. Is anyone even working on code to actually use this field? I think it's a good piece of information to keep track of, so I'm not suggesting we remove it, but I want to make sure I'm not stepping on toes or duplicating effort if I try to make it useful. -- Chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/