Re: Purpose of numa_node?

2008-02-20 Thread Yinghai Lu
On Wed, Feb 13, 2008 at 10:52 AM, Brice Goglin <[EMAIL PROTECTED]> wrote:
>  /sys/devices/pci:40/:40:0f.0/numa_node1
>  /sys/devices/pci:40/:40:10.0/numa_node1
>  /sys/devices/pci:40/:40:11.0/numa_node1
>  /sys/devices/pci:40/:40:12.0/numa_node1
>  /sys/devices/pci:40/:40:12.0/:51:00.0/numa_node1
>  /sys/devices/pci:40/:40:13.0/numa_node1
>
>  The 5 last lines above would report 0 instead of 1 with an older kernel.
>  Everything looks correct now (:40 is the second PCIe bus and it is
>  attached to socket #1).
>
>  Thanks a lot, Yinghai! Are you planning to merge these patches in the
>  near future? 2.6.26?
>
ingo put them in x86.git#testing

please check
http://people.redhat.com/mingo/x86.git/README
to get that.

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-02-20 Thread Yinghai Lu
On Wed, Feb 13, 2008 at 10:52 AM, Brice Goglin [EMAIL PROTECTED] wrote:
  /sys/devices/pci:40/:40:0f.0/numa_node1
  /sys/devices/pci:40/:40:10.0/numa_node1
  /sys/devices/pci:40/:40:11.0/numa_node1
  /sys/devices/pci:40/:40:12.0/numa_node1
  /sys/devices/pci:40/:40:12.0/:51:00.0/numa_node1
  /sys/devices/pci:40/:40:13.0/numa_node1

  The 5 last lines above would report 0 instead of 1 with an older kernel.
  Everything looks correct now (:40 is the second PCIe bus and it is
  attached to socket #1).

  Thanks a lot, Yinghai! Are you planning to merge these patches in the
  near future? 2.6.26?

ingo put them in x86.git#testing

please check
http://people.redhat.com/mingo/x86.git/README
to get that.

YH
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-02-13 Thread Yinghai Lu
On Feb 13, 2008 10:52 AM, Brice Goglin <[EMAIL PROTECTED]> wrote:
> Yinghai Lu wrote:
> >>> Have a look at the above link. I don't get -1. I get 0 everywhere, while
> >>> I should get 1 for some devices. And if I unplug/replug a device using
> >>> fakephp, numa_node becomes correct (1 instead of 0). This just looks
> >>> like the code is there but things are initialized in the wrong order.
> >>>
> >> do you have
> >> ...
> >> bus 00 -> pxm 0 -> node 0
> >> ...
> >> bus 40 -> pxm 1 -> node 1
> >> ...
> >> bus 80 -> pxm 1 -> node 1
> >>
> >> in your boot msg or dmesg?
> >>
> >> if not, your dsdt doesn't have _PXM for pci root bus. or you need to
> >> ask your HW vendor to add that in their BIOS, or use my patchset.
> >>
> >
> > please try the attached patchset
> >
> > please get x86.git then use quilt apply the patch
> >
> > http://people.redhat.com/mingo/x86.git/README
> >
>
> I finally managed to test this and it seems to work. I now get the
> following numa_node attributes:
> /sys/devices/pci:00/:00:01.0/numa_node0
> /sys/devices/pci:00/:00:07.0/numa_node0
> /sys/devices/pci:00/:00:07.0/:38:0d.0/numa_node0
> /sys/devices/pci:00/:00:08.0/numa_node0
> /sys/devices/pci:00/:00:08.1/numa_node0
> /sys/devices/pci:00/:00:08.2/numa_node0
> /sys/devices/pci:00/:00:09.0/numa_node0
> /sys/devices/pci:00/:00:09.1/numa_node0
> /sys/devices/pci:00/:00:09.2/numa_node0
> /sys/devices/pci:00/:00:0a.0/numa_node0
> /sys/devices/pci:00/:00:0a.0/:22:00.0/numa_node0
> /sys/devices/pci:00/:00:0b.0/numa_node0
> /sys/devices/pci:00/:00:0c.0/numa_node0
> /sys/devices/pci:00/:00:0c.0/:0c:00.0/numa_node0
> /sys/devices/pci:00/:00:0c.0/:0c:00.0/:0d:00.0/numa_node
> 0
> /sys/devices/pci:00/:00:0d.0/numa_node0
> /sys/devices/pci:00/:00:0d.0/:01:00.0/numa_node0
> /sys/devices/pci:00/:00:0e.0/numa_node0
> /sys/devices/pci:00/:00:0e.0/:17:00.0/numa_node0
> /sys/devices/pci:00/:00:0e.0/:17:00.0/:18:00.0/numa_node
> 0
> /sys/devices/pci:00/:00:18.0/numa_node0
> /sys/devices/pci:00/:00:18.1/numa_node0
> /sys/devices/pci:00/:00:18.2/numa_node0
> /sys/devices/pci:00/:00:18.3/numa_node0
> /sys/devices/pci:00/:00:19.0/numa_node0
> /sys/devices/pci:00/:00:19.1/numa_node0
> /sys/devices/pci:00/:00:19.2/numa_node0
> /sys/devices/pci:00/:00:19.3/numa_node0
> /sys/devices/pci:00/:00:1a.0/numa_node0
> /sys/devices/pci:00/:00:1a.1/numa_node0
> /sys/devices/pci:00/:00:1a.2/numa_node0
> /sys/devices/pci:00/:00:1a.3/numa_node0
> /sys/devices/pci:00/:00:1b.0/numa_node0
> /sys/devices/pci:00/:00:1b.1/numa_node0
> /sys/devices/pci:00/:00:1b.2/numa_node0
> /sys/devices/pci:00/:00:1b.3/numa_node0
> /sys/devices/pci:40/:40:0f.0/numa_node1
> /sys/devices/pci:40/:40:10.0/numa_node1
> /sys/devices/pci:40/:40:11.0/numa_node1
> /sys/devices/pci:40/:40:12.0/numa_node1
> /sys/devices/pci:40/:40:12.0/:51:00.0/numa_node1
> /sys/devices/pci:40/:40:13.0/numa_node1
>
> The 5 last lines above would report 0 instead of 1 with an older kernel.
> Everything looks correct now (:40 is the second PCIe bus and it is
> attached to socket #1).
>
> Thanks a lot, Yinghai! Are you planning to merge these patches in the
> near future? 2.6.26?

Andi thought that is too hardware related...
they have stayed a while in -mm.

these patchset could be only useful
when you have several HT chains, and BIOS doesn't have pxm->node in dsdt,
or doesn't allocate io resource to some of addon cards.

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-02-13 Thread Brice Goglin
Yinghai Lu wrote:
>>> Have a look at the above link. I don't get -1. I get 0 everywhere, while
>>> I should get 1 for some devices. And if I unplug/replug a device using
>>> fakephp, numa_node becomes correct (1 instead of 0). This just looks
>>> like the code is there but things are initialized in the wrong order.
>>>   
>> do you have
>> ...
>> bus 00 -> pxm 0 -> node 0
>> ...
>> bus 40 -> pxm 1 -> node 1
>> ...
>> bus 80 -> pxm 1 -> node 1
>>
>> in your boot msg or dmesg?
>>
>> if not, your dsdt doesn't have _PXM for pci root bus. or you need to
>> ask your HW vendor to add that in their BIOS, or use my patchset.
>> 
>
> please try the attached patchset
>
> please get x86.git then use quilt apply the patch
>
> http://people.redhat.com/mingo/x86.git/README
>   

I finally managed to test this and it seems to work. I now get the
following numa_node attributes:
/sys/devices/pci:00/:00:01.0/numa_node0
/sys/devices/pci:00/:00:07.0/numa_node0
/sys/devices/pci:00/:00:07.0/:38:0d.0/numa_node0
/sys/devices/pci:00/:00:08.0/numa_node0
/sys/devices/pci:00/:00:08.1/numa_node0
/sys/devices/pci:00/:00:08.2/numa_node0
/sys/devices/pci:00/:00:09.0/numa_node0
/sys/devices/pci:00/:00:09.1/numa_node0
/sys/devices/pci:00/:00:09.2/numa_node0
/sys/devices/pci:00/:00:0a.0/numa_node0
/sys/devices/pci:00/:00:0a.0/:22:00.0/numa_node0
/sys/devices/pci:00/:00:0b.0/numa_node0
/sys/devices/pci:00/:00:0c.0/numa_node0
/sys/devices/pci:00/:00:0c.0/:0c:00.0/numa_node0
/sys/devices/pci:00/:00:0c.0/:0c:00.0/:0d:00.0/numa_node   
0
/sys/devices/pci:00/:00:0d.0/numa_node0
/sys/devices/pci:00/:00:0d.0/:01:00.0/numa_node0
/sys/devices/pci:00/:00:0e.0/numa_node0
/sys/devices/pci:00/:00:0e.0/:17:00.0/numa_node0
/sys/devices/pci:00/:00:0e.0/:17:00.0/:18:00.0/numa_node   
0
/sys/devices/pci:00/:00:18.0/numa_node0
/sys/devices/pci:00/:00:18.1/numa_node0
/sys/devices/pci:00/:00:18.2/numa_node0
/sys/devices/pci:00/:00:18.3/numa_node0
/sys/devices/pci:00/:00:19.0/numa_node0
/sys/devices/pci:00/:00:19.1/numa_node0
/sys/devices/pci:00/:00:19.2/numa_node0
/sys/devices/pci:00/:00:19.3/numa_node0
/sys/devices/pci:00/:00:1a.0/numa_node0
/sys/devices/pci:00/:00:1a.1/numa_node0
/sys/devices/pci:00/:00:1a.2/numa_node0
/sys/devices/pci:00/:00:1a.3/numa_node0
/sys/devices/pci:00/:00:1b.0/numa_node0
/sys/devices/pci:00/:00:1b.1/numa_node0
/sys/devices/pci:00/:00:1b.2/numa_node0
/sys/devices/pci:00/:00:1b.3/numa_node0
/sys/devices/pci:40/:40:0f.0/numa_node1
/sys/devices/pci:40/:40:10.0/numa_node1
/sys/devices/pci:40/:40:11.0/numa_node1
/sys/devices/pci:40/:40:12.0/numa_node1
/sys/devices/pci:40/:40:12.0/:51:00.0/numa_node1
/sys/devices/pci:40/:40:13.0/numa_node1

The 5 last lines above would report 0 instead of 1 with an older kernel.
Everything looks correct now (:40 is the second PCIe bus and it is
attached to socket #1).

Thanks a lot, Yinghai! Are you planning to merge these patches in the
near future? 2.6.26?

Brice

PS: I saved the corresponding dmesg. If you want to look at it, please
let me know.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-02-13 Thread Brice Goglin
Yinghai Lu wrote:
 Have a look at the above link. I don't get -1. I get 0 everywhere, while
 I should get 1 for some devices. And if I unplug/replug a device using
 fakephp, numa_node becomes correct (1 instead of 0). This just looks
 like the code is there but things are initialized in the wrong order.
   
 do you have
 ...
 bus 00 - pxm 0 - node 0
 ...
 bus 40 - pxm 1 - node 1
 ...
 bus 80 - pxm 1 - node 1

 in your boot msg or dmesg?

 if not, your dsdt doesn't have _PXM for pci root bus. or you need to
 ask your HW vendor to add that in their BIOS, or use my patchset.
 

 please try the attached patchset

 please get x86.git then use quilt apply the patch

 http://people.redhat.com/mingo/x86.git/README
   

I finally managed to test this and it seems to work. I now get the
following numa_node attributes:
/sys/devices/pci:00/:00:01.0/numa_node0
/sys/devices/pci:00/:00:07.0/numa_node0
/sys/devices/pci:00/:00:07.0/:38:0d.0/numa_node0
/sys/devices/pci:00/:00:08.0/numa_node0
/sys/devices/pci:00/:00:08.1/numa_node0
/sys/devices/pci:00/:00:08.2/numa_node0
/sys/devices/pci:00/:00:09.0/numa_node0
/sys/devices/pci:00/:00:09.1/numa_node0
/sys/devices/pci:00/:00:09.2/numa_node0
/sys/devices/pci:00/:00:0a.0/numa_node0
/sys/devices/pci:00/:00:0a.0/:22:00.0/numa_node0
/sys/devices/pci:00/:00:0b.0/numa_node0
/sys/devices/pci:00/:00:0c.0/numa_node0
/sys/devices/pci:00/:00:0c.0/:0c:00.0/numa_node0
/sys/devices/pci:00/:00:0c.0/:0c:00.0/:0d:00.0/numa_node   
0
/sys/devices/pci:00/:00:0d.0/numa_node0
/sys/devices/pci:00/:00:0d.0/:01:00.0/numa_node0
/sys/devices/pci:00/:00:0e.0/numa_node0
/sys/devices/pci:00/:00:0e.0/:17:00.0/numa_node0
/sys/devices/pci:00/:00:0e.0/:17:00.0/:18:00.0/numa_node   
0
/sys/devices/pci:00/:00:18.0/numa_node0
/sys/devices/pci:00/:00:18.1/numa_node0
/sys/devices/pci:00/:00:18.2/numa_node0
/sys/devices/pci:00/:00:18.3/numa_node0
/sys/devices/pci:00/:00:19.0/numa_node0
/sys/devices/pci:00/:00:19.1/numa_node0
/sys/devices/pci:00/:00:19.2/numa_node0
/sys/devices/pci:00/:00:19.3/numa_node0
/sys/devices/pci:00/:00:1a.0/numa_node0
/sys/devices/pci:00/:00:1a.1/numa_node0
/sys/devices/pci:00/:00:1a.2/numa_node0
/sys/devices/pci:00/:00:1a.3/numa_node0
/sys/devices/pci:00/:00:1b.0/numa_node0
/sys/devices/pci:00/:00:1b.1/numa_node0
/sys/devices/pci:00/:00:1b.2/numa_node0
/sys/devices/pci:00/:00:1b.3/numa_node0
/sys/devices/pci:40/:40:0f.0/numa_node1
/sys/devices/pci:40/:40:10.0/numa_node1
/sys/devices/pci:40/:40:11.0/numa_node1
/sys/devices/pci:40/:40:12.0/numa_node1
/sys/devices/pci:40/:40:12.0/:51:00.0/numa_node1
/sys/devices/pci:40/:40:13.0/numa_node1

The 5 last lines above would report 0 instead of 1 with an older kernel.
Everything looks correct now (:40 is the second PCIe bus and it is
attached to socket #1).

Thanks a lot, Yinghai! Are you planning to merge these patches in the
near future? 2.6.26?

Brice

PS: I saved the corresponding dmesg. If you want to look at it, please
let me know.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-01-31 Thread Yinghai Lu
On Jan 31, 2008 1:42 PM, Yinghai Lu <[EMAIL PROTECTED]> wrote:
>
> On Jan 31, 2008 1:35 PM, Brice Goglin <[EMAIL PROTECTED]> wrote:
> > Yinghai Lu wrote:
> > > On Jan 31, 2008 5:42 AM, Brice Goglin <[EMAIL PROTECTED]> wrote:
> > >
> > >> It works fine on regular machines such as dual opterons. However, I
> > >> noticed recently that it was wrong on some quad-opteron machines (see
> > >> http://marc.info/?l=linux-pci=11907248538=2) because something
> > >> is not initialized in the right order. But I haven't tested 2.6.24 on
> > >> this hardware yet, and I don't know if things have changed regarding 
> > >> this.
> > >>
> > >
> > > that will depend if you dsdt have _PXM for your pci root bus.
> > > otherwise you will get all -1
> > >
> >
> > Have a look at the above link. I don't get -1. I get 0 everywhere, while
> > I should get 1 for some devices. And if I unplug/replug a device using
> > fakephp, numa_node becomes correct (1 instead of 0). This just looks
> > like the code is there but things are initialized in the wrong order.
>
> do you have
> ...
> bus 00 -> pxm 0 -> node 0
> ...
> bus 40 -> pxm 1 -> node 1
> ...
> bus 80 -> pxm 1 -> node 1
>
> in your boot msg or dmesg?
>
> if not, your dsdt doesn't have _PXM for pci root bus. or you need to
> ask your HW vendor to add that in their BIOS, or use my patchset.

please try the attached patchset

please get x86.git then use quilt apply the patch

http://people.redhat.com/mingo/x86.git/README

YH


patches_01312008_mm_bus_numa.tar.bz2
Description: BZip2 compressed data


Re: Purpose of numa_node?

2008-01-31 Thread Yinghai Lu
On Jan 31, 2008 1:35 PM, Brice Goglin <[EMAIL PROTECTED]> wrote:
> Yinghai Lu wrote:
> > On Jan 31, 2008 5:42 AM, Brice Goglin <[EMAIL PROTECTED]> wrote:
> >
> >> It works fine on regular machines such as dual opterons. However, I
> >> noticed recently that it was wrong on some quad-opteron machines (see
> >> http://marc.info/?l=linux-pci=11907248538=2) because something
> >> is not initialized in the right order. But I haven't tested 2.6.24 on
> >> this hardware yet, and I don't know if things have changed regarding this.
> >>
> >
> > that will depend if you dsdt have _PXM for your pci root bus.
> > otherwise you will get all -1
> >
>
> Have a look at the above link. I don't get -1. I get 0 everywhere, while
> I should get 1 for some devices. And if I unplug/replug a device using
> fakephp, numa_node becomes correct (1 instead of 0). This just looks
> like the code is there but things are initialized in the wrong order.

do you have
...
bus 00 -> pxm 0 -> node 0
...
bus 40 -> pxm 1 -> node 1
...
bus 80 -> pxm 1 -> node 1

in your boot msg or dmesg?

if not, your dsdt doesn't have _PXM for pci root bus. or you need to
ask your HW vendor to add that in their BIOS, or use my patchset.

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-01-31 Thread Brice Goglin
Yinghai Lu wrote:
> On Jan 31, 2008 5:42 AM, Brice Goglin <[EMAIL PROTECTED]> wrote:
>   
>> It works fine on regular machines such as dual opterons. However, I
>> noticed recently that it was wrong on some quad-opteron machines (see
>> http://marc.info/?l=linux-pci=11907248538=2) because something
>> is not initialized in the right order. But I haven't tested 2.6.24 on
>> this hardware yet, and I don't know if things have changed regarding this.
>> 
>
> that will depend if you dsdt have _PXM for your pci root bus.
> otherwise you will get all -1
>   

Have a look at the above link. I don't get -1. I get 0 everywhere, while
I should get 1 for some devices. And if I unplug/replug a device using
fakephp, numa_node becomes correct (1 instead of 0). This just looks
like the code is there but things are initialized in the wrong order.

Brice


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-01-31 Thread Yinghai Lu
On Jan 31, 2008 5:42 AM, Brice Goglin <[EMAIL PROTECTED]> wrote:
> Paul Mundt wrote:
> > On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote:
> >
> >> While pondering ways to optimize I/O and swapping on large NUMA machines, I
> >> noticed that the numa_node field in struct device isn't actually used
> >> anywhere. We just have a couple dozen lines of code to conditionally
> >>  create a sysfs file that will always return -1.  Is anyone even working on
> >> code to actually use this field?  I think it's a good piece of information
> >> to keep track of, so I'm not suggesting we remove it, but I want to make
> >> sure I'm not stepping on toes or duplicating effort if I try to make it
> >> useful.
> >>
> > It's manipulated with accessors. If you look at the users of
> > dev_to_node()/set_dev_node() you can see where it's being used. It's
> > primarily used in allocation paths for node locality, and the existing
> > set_dev_node() callsites are places where node locality information
> > already exists (ie, which node a given controller sits on). You can see
> > this in places like PCI (pcibus_to_node()) and USB, with node allocation
> > hints used in places like the dmapool and skb alloc paths.
> >
> > The in-kernel use looks perfectly sane in that regard, though I'm not
> > sure what the point of exporting this as a RO attribute to userspace is.
> > Presumably someone has a tool somewhere that cares about this.
> >
>
> I added the numa_node sysfs attribute in the beginning to make it easier
> to bind processes near some devices. So yes I have some user-space tool
> using it. It is much easier to use than the local_cpus field on large
> machines, especially when you use the libnuma interface to bind things,
> since you don't have to translate numa_node from/to cpumasks.
>
> It works fine on regular machines such as dual opterons. However, I
> noticed recently that it was wrong on some quad-opteron machines (see
> http://marc.info/?l=linux-pci=11907248538=2) because something
> is not initialized in the right order. But I haven't tested 2.6.24 on
> this hardware yet, and I don't know if things have changed regarding this.

that will depend if you dsdt have _PXM for your pci root bus.
otherwise you will get all -1

I have a patchset locally that it call bus_numa, can get that from pci
conf space for AMD64 based machine.
so you can use that for AMD64 system without _PXM for pci root bus or
even with acpi=off.

let me know if you want test it.

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-01-31 Thread Brice Goglin

Paul Mundt wrote:

On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote:
  
While pondering ways to optimize I/O and swapping on large NUMA machines, I 
noticed that the numa_node field in struct device isn't actually used 
anywhere. We just have a couple dozen lines of code to conditionally 
 create a sysfs file that will always return -1.  Is anyone even working on 
code to actually use this field?  I think it's a good piece of information 
to keep track of, so I'm not suggesting we remove it, but I want to make 
sure I'm not stepping on toes or duplicating effort if I try to make it 
useful.


It's manipulated with accessors. If you look at the users of
dev_to_node()/set_dev_node() you can see where it's being used. It's
primarily used in allocation paths for node locality, and the existing
set_dev_node() callsites are places where node locality information
already exists (ie, which node a given controller sits on). You can see
this in places like PCI (pcibus_to_node()) and USB, with node allocation
hints used in places like the dmapool and skb alloc paths.

The in-kernel use looks perfectly sane in that regard, though I'm not
sure what the point of exporting this as a RO attribute to userspace is.
Presumably someone has a tool somewhere that cares about this.
  


I added the numa_node sysfs attribute in the beginning to make it easier 
to bind processes near some devices. So yes I have some user-space tool 
using it. It is much easier to use than the local_cpus field on large 
machines, especially when you use the libnuma interface to bind things, 
since you don't have to translate numa_node from/to cpumasks.


It works fine on regular machines such as dual opterons. However, I 
noticed recently that it was wrong on some quad-opteron machines (see 
http://marc.info/?l=linux-pci=11907248538=2) because something 
is not initialized in the right order. But I haven't tested 2.6.24 on 
this hardware yet, and I don't know if things have changed regarding this.


Brice

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-01-31 Thread Andi Kleen
Paul Mundt <[EMAIL PROTECTED]> writes:
>
> The in-kernel use looks perfectly sane in that regard, though I'm not
> sure what the point of exporting this as a RO attribute to userspace is.
> Presumably someone has a tool somewhere that cares about this.

The idea was to allow e.g. NUMA aware irqbalanced that directs the interrupts
on the same node as the device is connected to. Don't know if it was
ever actually implemented.

-Andi
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-01-31 Thread Paul Mundt
On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote:
> While pondering ways to optimize I/O and swapping on large NUMA machines, I 
> noticed that the numa_node field in struct device isn't actually used 
> anywhere. We just have a couple dozen lines of code to conditionally 
>  create a sysfs file that will always return -1.  Is anyone even working on 
> code to actually use this field?  I think it's a good piece of information 
> to keep track of, so I'm not suggesting we remove it, but I want to make 
> sure I'm not stepping on toes or duplicating effort if I try to make it 
> useful.
> 
It's manipulated with accessors. If you look at the users of
dev_to_node()/set_dev_node() you can see where it's being used. It's
primarily used in allocation paths for node locality, and the existing
set_dev_node() callsites are places where node locality information
already exists (ie, which node a given controller sits on). You can see
this in places like PCI (pcibus_to_node()) and USB, with node allocation
hints used in places like the dmapool and skb alloc paths.

The in-kernel use looks perfectly sane in that regard, though I'm not
sure what the point of exporting this as a RO attribute to userspace is.
Presumably someone has a tool somewhere that cares about this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-01-31 Thread Andi Kleen
Paul Mundt [EMAIL PROTECTED] writes:

 The in-kernel use looks perfectly sane in that regard, though I'm not
 sure what the point of exporting this as a RO attribute to userspace is.
 Presumably someone has a tool somewhere that cares about this.

The idea was to allow e.g. NUMA aware irqbalanced that directs the interrupts
on the same node as the device is connected to. Don't know if it was
ever actually implemented.

-Andi
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-01-31 Thread Brice Goglin

Paul Mundt wrote:

On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote:
  
While pondering ways to optimize I/O and swapping on large NUMA machines, I 
noticed that the numa_node field in struct device isn't actually used 
anywhere. We just have a couple dozen lines of code to conditionally 
 create a sysfs file that will always return -1.  Is anyone even working on 
code to actually use this field?  I think it's a good piece of information 
to keep track of, so I'm not suggesting we remove it, but I want to make 
sure I'm not stepping on toes or duplicating effort if I try to make it 
useful.


It's manipulated with accessors. If you look at the users of
dev_to_node()/set_dev_node() you can see where it's being used. It's
primarily used in allocation paths for node locality, and the existing
set_dev_node() callsites are places where node locality information
already exists (ie, which node a given controller sits on). You can see
this in places like PCI (pcibus_to_node()) and USB, with node allocation
hints used in places like the dmapool and skb alloc paths.

The in-kernel use looks perfectly sane in that regard, though I'm not
sure what the point of exporting this as a RO attribute to userspace is.
Presumably someone has a tool somewhere that cares about this.
  


I added the numa_node sysfs attribute in the beginning to make it easier 
to bind processes near some devices. So yes I have some user-space tool 
using it. It is much easier to use than the local_cpus field on large 
machines, especially when you use the libnuma interface to bind things, 
since you don't have to translate numa_node from/to cpumasks.


It works fine on regular machines such as dual opterons. However, I 
noticed recently that it was wrong on some quad-opteron machines (see 
http://marc.info/?l=linux-pcim=11907248538w=2) because something 
is not initialized in the right order. But I haven't tested 2.6.24 on 
this hardware yet, and I don't know if things have changed regarding this.


Brice

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-01-31 Thread Brice Goglin
Yinghai Lu wrote:
 On Jan 31, 2008 5:42 AM, Brice Goglin [EMAIL PROTECTED] wrote:
   
 It works fine on regular machines such as dual opterons. However, I
 noticed recently that it was wrong on some quad-opteron machines (see
 http://marc.info/?l=linux-pcim=11907248538w=2) because something
 is not initialized in the right order. But I haven't tested 2.6.24 on
 this hardware yet, and I don't know if things have changed regarding this.
 

 that will depend if you dsdt have _PXM for your pci root bus.
 otherwise you will get all -1
   

Have a look at the above link. I don't get -1. I get 0 everywhere, while
I should get 1 for some devices. And if I unplug/replug a device using
fakephp, numa_node becomes correct (1 instead of 0). This just looks
like the code is there but things are initialized in the wrong order.

Brice


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-01-31 Thread Yinghai Lu
On Jan 31, 2008 1:35 PM, Brice Goglin [EMAIL PROTECTED] wrote:
 Yinghai Lu wrote:
  On Jan 31, 2008 5:42 AM, Brice Goglin [EMAIL PROTECTED] wrote:
 
  It works fine on regular machines such as dual opterons. However, I
  noticed recently that it was wrong on some quad-opteron machines (see
  http://marc.info/?l=linux-pcim=11907248538w=2) because something
  is not initialized in the right order. But I haven't tested 2.6.24 on
  this hardware yet, and I don't know if things have changed regarding this.
 
 
  that will depend if you dsdt have _PXM for your pci root bus.
  otherwise you will get all -1
 

 Have a look at the above link. I don't get -1. I get 0 everywhere, while
 I should get 1 for some devices. And if I unplug/replug a device using
 fakephp, numa_node becomes correct (1 instead of 0). This just looks
 like the code is there but things are initialized in the wrong order.

do you have
...
bus 00 - pxm 0 - node 0
...
bus 40 - pxm 1 - node 1
...
bus 80 - pxm 1 - node 1

in your boot msg or dmesg?

if not, your dsdt doesn't have _PXM for pci root bus. or you need to
ask your HW vendor to add that in their BIOS, or use my patchset.

YH
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-01-31 Thread Yinghai Lu
On Jan 31, 2008 5:42 AM, Brice Goglin [EMAIL PROTECTED] wrote:
 Paul Mundt wrote:
  On Wed, Jan 30, 2008 at 07:48:13PM -0500, Chris Snook wrote:
 
  While pondering ways to optimize I/O and swapping on large NUMA machines, I
  noticed that the numa_node field in struct device isn't actually used
  anywhere. We just have a couple dozen lines of code to conditionally
   create a sysfs file that will always return -1.  Is anyone even working on
  code to actually use this field?  I think it's a good piece of information
  to keep track of, so I'm not suggesting we remove it, but I want to make
  sure I'm not stepping on toes or duplicating effort if I try to make it
  useful.
 
  It's manipulated with accessors. If you look at the users of
  dev_to_node()/set_dev_node() you can see where it's being used. It's
  primarily used in allocation paths for node locality, and the existing
  set_dev_node() callsites are places where node locality information
  already exists (ie, which node a given controller sits on). You can see
  this in places like PCI (pcibus_to_node()) and USB, with node allocation
  hints used in places like the dmapool and skb alloc paths.
 
  The in-kernel use looks perfectly sane in that regard, though I'm not
  sure what the point of exporting this as a RO attribute to userspace is.
  Presumably someone has a tool somewhere that cares about this.
 

 I added the numa_node sysfs attribute in the beginning to make it easier
 to bind processes near some devices. So yes I have some user-space tool
 using it. It is much easier to use than the local_cpus field on large
 machines, especially when you use the libnuma interface to bind things,
 since you don't have to translate numa_node from/to cpumasks.

 It works fine on regular machines such as dual opterons. However, I
 noticed recently that it was wrong on some quad-opteron machines (see
 http://marc.info/?l=linux-pcim=11907248538w=2) because something
 is not initialized in the right order. But I haven't tested 2.6.24 on
 this hardware yet, and I don't know if things have changed regarding this.

that will depend if you dsdt have _PXM for your pci root bus.
otherwise you will get all -1

I have a patchset locally that it call bus_numa, can get that from pci
conf space for AMD64 based machine.
so you can use that for AMD64 system without _PXM for pci root bus or
even with acpi=off.

let me know if you want test it.

YH
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Purpose of numa_node?

2008-01-31 Thread Yinghai Lu
On Jan 31, 2008 1:42 PM, Yinghai Lu [EMAIL PROTECTED] wrote:

 On Jan 31, 2008 1:35 PM, Brice Goglin [EMAIL PROTECTED] wrote:
  Yinghai Lu wrote:
   On Jan 31, 2008 5:42 AM, Brice Goglin [EMAIL PROTECTED] wrote:
  
   It works fine on regular machines such as dual opterons. However, I
   noticed recently that it was wrong on some quad-opteron machines (see
   http://marc.info/?l=linux-pcim=11907248538w=2) because something
   is not initialized in the right order. But I haven't tested 2.6.24 on
   this hardware yet, and I don't know if things have changed regarding 
   this.
  
  
   that will depend if you dsdt have _PXM for your pci root bus.
   otherwise you will get all -1
  
 
  Have a look at the above link. I don't get -1. I get 0 everywhere, while
  I should get 1 for some devices. And if I unplug/replug a device using
  fakephp, numa_node becomes correct (1 instead of 0). This just looks
  like the code is there but things are initialized in the wrong order.

 do you have
 ...
 bus 00 - pxm 0 - node 0
 ...
 bus 40 - pxm 1 - node 1
 ...
 bus 80 - pxm 1 - node 1

 in your boot msg or dmesg?

 if not, your dsdt doesn't have _PXM for pci root bus. or you need to
 ask your HW vendor to add that in their BIOS, or use my patchset.

please try the attached patchset

please get x86.git then use quilt apply the patch

http://people.redhat.com/mingo/x86.git/README

YH


patches_01312008_mm_bus_numa.tar.bz2
Description: BZip2 compressed data


Purpose of numa_node?

2008-01-30 Thread Chris Snook
While pondering ways to optimize I/O and swapping on large NUMA machines, I 
noticed that the numa_node field in struct device isn't actually used anywhere. 
 We just have a couple dozen lines of code to conditionally create a sysfs file 
that will always return -1.  Is anyone even working on code to actually use this 
field?  I think it's a good piece of information to keep track of, so I'm not 
suggesting we remove it, but I want to make sure I'm not stepping on toes or 
duplicating effort if I try to make it useful.


-- Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Purpose of numa_node?

2008-01-30 Thread Chris Snook
While pondering ways to optimize I/O and swapping on large NUMA machines, I 
noticed that the numa_node field in struct device isn't actually used anywhere. 
 We just have a couple dozen lines of code to conditionally create a sysfs file 
that will always return -1.  Is anyone even working on code to actually use this 
field?  I think it's a good piece of information to keep track of, so I'm not 
suggesting we remove it, but I want to make sure I'm not stepping on toes or 
duplicating effort if I try to make it useful.


-- Chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/