Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)

2014-04-11 Thread Dave Hansen
On 04/11/2014 03:13 PM, David Rientjes wrote:
> What additional information, in your opinion, can we export to assist 
> userspace in making this determination that $address is on $nid?

In the case of overlapping nodes, the only place we actually have *all*
of the information is in the 'struct page' itself.  Ulrich's original
patch obviously _works_, and especially if it's an interface only for
debugging purposes, it seems silly to spend virtually any time
optimizing it.  Keeping it close to pagemap's implementation lessens the
likelihood that we'll screw things up.

I assume that the original problem was trying to figure out what NUMA
affinity a given range of pages mapped in to a _process_ have, and that
/proc/$pid/numamaps is too coarse.  Is that right, Ulrich?

If you want to go the route of calculating and exporting the physical
ranges that nodes uniquely own, you've *GOT* to handle the overlaps.
Naoya had the right idea.  His idea seemed to get shot down with the
misunderstanding that node pfn ranges never overlap.

The only other question is how many of these kpage* things we're going
to put in here until we've exported the entire contents of 'struct page'
5 times over. :)

We could add some tracepoints to the pagemap to dump lots of information
in to a trace buffer that could be later read back.  If you want
detailed information  (NUMA for instance), you turn the tracepoints and
read pagemap for the range you care about.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)

2014-04-11 Thread David Rientjes
On Fri, 11 Apr 2014, Dave Hansen wrote:

> > So?  Who cares if there are non-addressable holes in part of the span?  
> > Ulrich, correct me if I'm wrong, but it seems you're looking for just a 
> > address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually 
> > expecting that there are no holes in a node for things like acpi or I/O or 
> > reserved memory.
> ...
> > I think trying to represent holes and handling different memory models and 
> > hotplug in special ways is complete overkill.
> 
> This isn't just about memory hotplug or different memory models.  There
> are systems out there today, in production, that have layouts like this:
> 
> |--Node0-|
>  |--Node1-|
> 
> and this:
> 
> |--Node0-|
>  |-Node1-|
> 

What additional information, in your opinion, can we export to assist 
userspace in making this determination that $address is on $nid?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)

2014-04-11 Thread Dave Hansen
On 04/11/2014 04:00 AM, David Rientjes wrote:
> On Thu, 10 Apr 2014, Naoya Horiguchi wrote:
>> > Yes, that's right, but it seems to me that just node_start_pfn and 
>> > node_end_pfn
>> > is not enough because there can be holes (without any page struct backed) 
>> > inside
>> > [node_start_pfn, node_end_pfn), and it's not aware of memory hotplug.
>> > 
> So?  Who cares if there are non-addressable holes in part of the span?  
> Ulrich, correct me if I'm wrong, but it seems you're looking for just a 
> address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually 
> expecting that there are no holes in a node for things like acpi or I/O or 
> reserved memory.
...
> I think trying to represent holes and handling different memory models and 
> hotplug in special ways is complete overkill.

This isn't just about memory hotplug or different memory models.  There
are systems out there today, in production, that have layouts like this:

|--Node0-|
 |--Node1-|

and this:

|--Node0-|
 |-Node1-|

For those systems, this interface has no meaning.  Given a page in the
shared-span areas, this interface provides no way to figure out which
node it is in.

If you want a non-portable hack that just works on one system, I'd
suggest parsing the existing firmware tables.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)

2014-04-11 Thread David Rientjes
On Thu, 10 Apr 2014, Naoya Horiguchi wrote:

> Yes, that's right, but it seems to me that just node_start_pfn and 
> node_end_pfn
> is not enough because there can be holes (without any page struct backed) 
> inside
> [node_start_pfn, node_end_pfn), and it's not aware of memory hotplug.
> 

So?  Who cares if there are non-addressable holes in part of the span?  
Ulrich, correct me if I'm wrong, but it seems you're looking for just a 
address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually 
expecting that there are no holes in a node for things like acpi or I/O or 
reserved memory.

The node spans a contiguous length of memory, there's no consideration for 
addresses that aren't actually backed by physical memory.  We are just 
representing proximity domains that have a base address and length in the 
acpi world.

Memory hotplug is already taken care of because onlining and offlining 
nodes already add these node classes and {start,end}_phys_addr would 
show up automatically.  If you use node_start_pfn(nid) and 
node_end_pfn(nid) as suggested, there's no futher consideration needed for 
hotplug.

I think trying to represent holes and handling different memory models and 
hotplug in special ways is complete overkill.

Ulrich, can I have your ack?
---
 Documentation/ABI/stable/sysfs-devices-node | 12 
 drivers/base/node.c | 18 ++
 2 files changed, 30 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node 
b/Documentation/ABI/stable/sysfs-devices-node
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -63,6 +63,18 @@ Description:
The node's hit/miss statistics, in units of pages.
See Documentation/numastat.txt
 
+What:  /sys/devices/system/node/nodeX/start_phys_addr
+Date:  April 2014
+Contact:   David Rientjes 
+Description:
+   The physical base address of this node.
+
+What:  /sys/devices/system/node/nodeX/end_phys_addr
+Date:  April 2014
+Contact:   David Rientjes 
+Description:
+   The physical base + length address of this node.
+
 What:  /sys/devices/system/node/nodeX/distance
 Date:  October 2002
 Contact:   Linux Memory Management list 
diff --git a/drivers/base/node.c b/drivers/base/node.c
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -170,6 +170,20 @@ static ssize_t node_read_numastat(struct device *dev,
 }
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
+static ssize_t node_read_start_phys_addr(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return sprintf(buf, "0x%lx\n", node_start_pfn(dev->id) << PAGE_SHIFT);
+}
+static DEVICE_ATTR(start_phys_addr, S_IRUGO, node_read_start_phys_addr, NULL);
+
+static ssize_t node_read_end_phys_addr(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return sprintf(buf, "0x%lx\n", node_end_pfn(dev->id) << PAGE_SHIFT);
+}
+static DEVICE_ATTR(end_phys_addr, S_IRUGO, node_read_end_phys_addr, NULL);
+
 static ssize_t node_read_vmstat(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -286,6 +300,8 @@ static int register_node(struct node *node, int num, struct 
node *parent)
device_create_file(>dev, _attr_cpulist);
device_create_file(>dev, _attr_meminfo);
device_create_file(>dev, _attr_numastat);
+   device_create_file(>dev, _attr_start_phys_addr);
+   device_create_file(>dev, _attr_end_phys_addr);
device_create_file(>dev, _attr_distance);
device_create_file(>dev, _attr_vmstat);
 
@@ -311,6 +327,8 @@ void unregister_node(struct node *node)
device_remove_file(>dev, _attr_cpulist);
device_remove_file(>dev, _attr_meminfo);
device_remove_file(>dev, _attr_numastat);
+   device_remove_file(>dev, _attr_start_phys_addr);
+   device_remove_file(>dev, _attr_end_phys_addr);
device_remove_file(>dev, _attr_distance);
device_remove_file(>dev, _attr_vmstat);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)

2014-04-11 Thread David Rientjes
On Thu, 10 Apr 2014, Naoya Horiguchi wrote:

 Yes, that's right, but it seems to me that just node_start_pfn and 
 node_end_pfn
 is not enough because there can be holes (without any page struct backed) 
 inside
 [node_start_pfn, node_end_pfn), and it's not aware of memory hotplug.
 

So?  Who cares if there are non-addressable holes in part of the span?  
Ulrich, correct me if I'm wrong, but it seems you're looking for just a 
address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually 
expecting that there are no holes in a node for things like acpi or I/O or 
reserved memory.

The node spans a contiguous length of memory, there's no consideration for 
addresses that aren't actually backed by physical memory.  We are just 
representing proximity domains that have a base address and length in the 
acpi world.

Memory hotplug is already taken care of because onlining and offlining 
nodes already add these node classes and {start,end}_phys_addr would 
show up automatically.  If you use node_start_pfn(nid) and 
node_end_pfn(nid) as suggested, there's no futher consideration needed for 
hotplug.

I think trying to represent holes and handling different memory models and 
hotplug in special ways is complete overkill.

Ulrich, can I have your ack?
---
 Documentation/ABI/stable/sysfs-devices-node | 12 
 drivers/base/node.c | 18 ++
 2 files changed, 30 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node 
b/Documentation/ABI/stable/sysfs-devices-node
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -63,6 +63,18 @@ Description:
The node's hit/miss statistics, in units of pages.
See Documentation/numastat.txt
 
+What:  /sys/devices/system/node/nodeX/start_phys_addr
+Date:  April 2014
+Contact:   David Rientjes rient...@google.com
+Description:
+   The physical base address of this node.
+
+What:  /sys/devices/system/node/nodeX/end_phys_addr
+Date:  April 2014
+Contact:   David Rientjes rient...@google.com
+Description:
+   The physical base + length address of this node.
+
 What:  /sys/devices/system/node/nodeX/distance
 Date:  October 2002
 Contact:   Linux Memory Management list linux...@kvack.org
diff --git a/drivers/base/node.c b/drivers/base/node.c
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -170,6 +170,20 @@ static ssize_t node_read_numastat(struct device *dev,
 }
 static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
+static ssize_t node_read_start_phys_addr(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return sprintf(buf, 0x%lx\n, node_start_pfn(dev-id)  PAGE_SHIFT);
+}
+static DEVICE_ATTR(start_phys_addr, S_IRUGO, node_read_start_phys_addr, NULL);
+
+static ssize_t node_read_end_phys_addr(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return sprintf(buf, 0x%lx\n, node_end_pfn(dev-id)  PAGE_SHIFT);
+}
+static DEVICE_ATTR(end_phys_addr, S_IRUGO, node_read_end_phys_addr, NULL);
+
 static ssize_t node_read_vmstat(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -286,6 +300,8 @@ static int register_node(struct node *node, int num, struct 
node *parent)
device_create_file(node-dev, dev_attr_cpulist);
device_create_file(node-dev, dev_attr_meminfo);
device_create_file(node-dev, dev_attr_numastat);
+   device_create_file(node-dev, dev_attr_start_phys_addr);
+   device_create_file(node-dev, dev_attr_end_phys_addr);
device_create_file(node-dev, dev_attr_distance);
device_create_file(node-dev, dev_attr_vmstat);
 
@@ -311,6 +327,8 @@ void unregister_node(struct node *node)
device_remove_file(node-dev, dev_attr_cpulist);
device_remove_file(node-dev, dev_attr_meminfo);
device_remove_file(node-dev, dev_attr_numastat);
+   device_remove_file(node-dev, dev_attr_start_phys_addr);
+   device_remove_file(node-dev, dev_attr_end_phys_addr);
device_remove_file(node-dev, dev_attr_distance);
device_remove_file(node-dev, dev_attr_vmstat);
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)

2014-04-11 Thread Dave Hansen
On 04/11/2014 04:00 AM, David Rientjes wrote:
 On Thu, 10 Apr 2014, Naoya Horiguchi wrote:
  Yes, that's right, but it seems to me that just node_start_pfn and 
  node_end_pfn
  is not enough because there can be holes (without any page struct backed) 
  inside
  [node_start_pfn, node_end_pfn), and it's not aware of memory hotplug.
  
 So?  Who cares if there are non-addressable holes in part of the span?  
 Ulrich, correct me if I'm wrong, but it seems you're looking for just a 
 address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually 
 expecting that there are no holes in a node for things like acpi or I/O or 
 reserved memory.
...
 I think trying to represent holes and handling different memory models and 
 hotplug in special ways is complete overkill.

This isn't just about memory hotplug or different memory models.  There
are systems out there today, in production, that have layouts like this:

|--Node0-|
 |--Node1-|

and this:

|--Node0-|
 |-Node1-|

For those systems, this interface has no meaning.  Given a page in the
shared-span areas, this interface provides no way to figure out which
node it is in.

If you want a non-portable hack that just works on one system, I'd
suggest parsing the existing firmware tables.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)

2014-04-11 Thread David Rientjes
On Fri, 11 Apr 2014, Dave Hansen wrote:

  So?  Who cares if there are non-addressable holes in part of the span?  
  Ulrich, correct me if I'm wrong, but it seems you're looking for just a 
  address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually 
  expecting that there are no holes in a node for things like acpi or I/O or 
  reserved memory.
 ...
  I think trying to represent holes and handling different memory models and 
  hotplug in special ways is complete overkill.
 
 This isn't just about memory hotplug or different memory models.  There
 are systems out there today, in production, that have layouts like this:
 
 |--Node0-|
  |--Node1-|
 
 and this:
 
 |--Node0-|
  |-Node1-|
 

What additional information, in your opinion, can we export to assist 
userspace in making this determination that $address is on $nid?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)

2014-04-11 Thread Dave Hansen
On 04/11/2014 03:13 PM, David Rientjes wrote:
 What additional information, in your opinion, can we export to assist 
 userspace in making this determination that $address is on $nid?

In the case of overlapping nodes, the only place we actually have *all*
of the information is in the 'struct page' itself.  Ulrich's original
patch obviously _works_, and especially if it's an interface only for
debugging purposes, it seems silly to spend virtually any time
optimizing it.  Keeping it close to pagemap's implementation lessens the
likelihood that we'll screw things up.

I assume that the original problem was trying to figure out what NUMA
affinity a given range of pages mapped in to a _process_ have, and that
/proc/$pid/numamaps is too coarse.  Is that right, Ulrich?

If you want to go the route of calculating and exporting the physical
ranges that nodes uniquely own, you've *GOT* to handle the overlaps.
Naoya had the right idea.  His idea seemed to get shot down with the
misunderstanding that node pfn ranges never overlap.

The only other question is how many of these kpage* things we're going
to put in here until we've exported the entire contents of 'struct page'
5 times over. :)

We could add some tracepoints to the pagemap to dump lots of information
in to a trace buffer that could be later read back.  If you want
detailed information  (NUMA for instance), you turn the tracepoints and
read pagemap for the range you care about.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NUMA node information for pages

2014-04-10 Thread David Rientjes
On Wed, 9 Apr 2014, Naoya Horiguchi wrote:

> >  [ And that block_size_bytes file is absolutely horrid, why are we
> >exporting all this information in hex and not telling anybody? ]
> 
> Indeed, this kind of implicit hex numbers are commonly used in many place.
> I guess that it's maybe for historical reasons.
> 

I think it was meant to be simple to that you could easily add the length 
to the start, but it should at least prefix this with `0x'.  That code has 
been around for years, though, so we probably can't fix it now.

> > I'd much prefer a single change that works for everybody and userspace can 
> > rely on exporting accurate information as long as sysfs is mounted, and 
> > not even need to rely on getpagesize() to convert from pfn to physical 
> > address: just simple {start,end}_phys_addr files added to 
> > /sys/devices/system/node/nodeN/ for node N.  Online information can 
> > already be parsed for these ranges from /sys/devices/system/node/online.
> 
> OK, so what if some node has multiple address ranges?  I don't think that
> start(end)_phys_addr simply returns minimum (maximum) possible address is 
> optimal,
> because users can't know about void range between valid address ranges
> (non-exist pfn should not belong to any node).
> Are printing multilined (or comma-separated) ranges preferable for example
> like below?
> 
>   $ cat /sys/devices/system/node/nodeN/phys_addr
>   0x0-0x8000
>   0x1-0x18000
> 

What the...?  nodeN should represent the pgdat for that node and a pgdat 
can only have a single range.  I'm suggesting that 
/sys/devices/system/node/nodeN/start_phys_addr returns 
node_start_pfn(N) << PAGE_SHIFT and 
/sys/devices/system/node/nodeN/end_phys_addr returns
node_end_pfn(N) << PAGE_SHIFT and prefix them correctly this time with 
`0x'.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NUMA node information for pages

2014-04-10 Thread David Rientjes
On Wed, 9 Apr 2014, Naoya Horiguchi wrote:

   [ And that block_size_bytes file is absolutely horrid, why are we
 exporting all this information in hex and not telling anybody? ]
 
 Indeed, this kind of implicit hex numbers are commonly used in many place.
 I guess that it's maybe for historical reasons.
 

I think it was meant to be simple to that you could easily add the length 
to the start, but it should at least prefix this with `0x'.  That code has 
been around for years, though, so we probably can't fix it now.

  I'd much prefer a single change that works for everybody and userspace can 
  rely on exporting accurate information as long as sysfs is mounted, and 
  not even need to rely on getpagesize() to convert from pfn to physical 
  address: just simple {start,end}_phys_addr files added to 
  /sys/devices/system/node/nodeN/ for node N.  Online information can 
  already be parsed for these ranges from /sys/devices/system/node/online.
 
 OK, so what if some node has multiple address ranges?  I don't think that
 start(end)_phys_addr simply returns minimum (maximum) possible address is 
 optimal,
 because users can't know about void range between valid address ranges
 (non-exist pfn should not belong to any node).
 Are printing multilined (or comma-separated) ranges preferable for example
 like below?
 
   $ cat /sys/devices/system/node/nodeN/phys_addr
   0x0-0x8000
   0x1-0x18000
 

What the...?  nodeN should represent the pgdat for that node and a pgdat 
can only have a single range.  I'm suggesting that 
/sys/devices/system/node/nodeN/start_phys_addr returns 
node_start_pfn(N)  PAGE_SHIFT and 
/sys/devices/system/node/nodeN/end_phys_addr returns
node_end_pfn(N)  PAGE_SHIFT and prefix them correctly this time with 
`0x'.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NUMA node information for pages

2014-04-09 Thread David Rientjes
On Tue, 8 Apr 2014, Naoya Horiguchi wrote:

> memory hotplug is done in memory block basis, so if we get info from under
> /sys/devices/system/memory/memory it should be memory hotplug-aware
> (/sys/devices/system/memory/memory/state shows online/offline status.)
> 
> And IIUC, "pfn-node_id" mapping might be already available for userspace.
> /sys/devices/system/memory/block_size_bytes exports memory block size,
> so we can simply map pfn (physical address) into memory block ID by
> (physicall address)/(memory block size), then we can find associated node
> from /sys/devices/system/memory/memory
> 
>   $ ls -l /sys/devices/system/memory/memory0
>   ...
>   lrwxrwxrwx 1 root root0 Apr  8 00:15 node0 -> ../../node/node0
> 

That's only possible with sparsemem and if you have memory hotplug 
enabled.  I'm thinking that Ulrich is looking for a solution that won't 
have such a dependency and work for all memory models (including one that 
disables NUMA and simply represents all memory as one big node).

 [ And that block_size_bytes file is absolutely horrid, why are we
   exporting all this information in hex and not telling anybody? ]

I'd much prefer a single change that works for everybody and userspace can 
rely on exporting accurate information as long as sysfs is mounted, and 
not even need to rely on getpagesize() to convert from pfn to physical 
address: just simple {start,end}_phys_addr files added to 
/sys/devices/system/node/nodeN/ for node N.  Online information can 
already be parsed for these ranges from /sys/devices/system/node/online.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NUMA node information for pages

2014-04-09 Thread David Rientjes
On Tue, 8 Apr 2014, Naoya Horiguchi wrote:

 memory hotplug is done in memory block basis, so if we get info from under
 /sys/devices/system/memory/memoryID it should be memory hotplug-aware
 (/sys/devices/system/memory/memoryID/state shows online/offline status.)
 
 And IIUC, pfn-node_id mapping might be already available for userspace.
 /sys/devices/system/memory/block_size_bytes exports memory block size,
 so we can simply map pfn (physical address) into memory block ID by
 (physicall address)/(memory block size), then we can find associated node
 from /sys/devices/system/memory/memoryID
 
   $ ls -l /sys/devices/system/memory/memory0
   ...
   lrwxrwxrwx 1 root root0 Apr  8 00:15 node0 - ../../node/node0
 

That's only possible with sparsemem and if you have memory hotplug 
enabled.  I'm thinking that Ulrich is looking for a solution that won't 
have such a dependency and work for all memory models (including one that 
disables NUMA and simply represents all memory as one big node).

 [ And that block_size_bytes file is absolutely horrid, why are we
   exporting all this information in hex and not telling anybody? ]

I'd much prefer a single change that works for everybody and userspace can 
rely on exporting accurate information as long as sysfs is mounted, and 
not even need to rely on getpagesize() to convert from pfn to physical 
address: just simple {start,end}_phys_addr files added to 
/sys/devices/system/node/nodeN/ for node N.  Online information can 
already be parsed for these ranges from /sys/devices/system/node/online.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NUMA node information for pages

2014-04-07 Thread Ulrich Drepper
On Mon, Mar 31, 2014 at 9:24 PM, Naoya Horiguchi
 wrote:
> The information about "pfn-node" mapping seldom (or never) changes after boot,
> so it seems better to me that adding a new interface somewhere under
> /sys/devices/system/node/nodeN which shows pfn range of a given node.
> If this doesn't work for your usecase, could you explain more about how you
> use this information?

I have no problem with that type of interface.  It'll be more work
figuring out the details since the interface I proposed is trivial and
mimics that of kpageflags etc but that's manageable.

I'll see whether I can figure out the necessary details.  I imagine
that if the PFN are indeed always clustered for each node then, as
David proposes, text output like

  PFNSTART PFNSTOP

in a file below /sys/devices/system/node/nodeN should be sufficient.

How does memory hot plug work in this situation?  If the PFNs are
allocated dense at startup then there might potentially be many ranges
for each node.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NUMA node information for pages

2014-04-07 Thread Ulrich Drepper
On Mon, Mar 31, 2014 at 9:24 PM, Naoya Horiguchi
n-horigu...@ah.jp.nec.com wrote:
 The information about pfn-node mapping seldom (or never) changes after boot,
 so it seems better to me that adding a new interface somewhere under
 /sys/devices/system/node/nodeN which shows pfn range of a given node.
 If this doesn't work for your usecase, could you explain more about how you
 use this information?

I have no problem with that type of interface.  It'll be more work
figuring out the details since the interface I proposed is trivial and
mimics that of kpageflags etc but that's manageable.

I'll see whether I can figure out the necessary details.  I imagine
that if the PFN are indeed always clustered for each node then, as
David proposes, text output like

  PFNSTART PFNSTOP

in a file below /sys/devices/system/node/nodeN should be sufficient.

How does memory hot plug work in this situation?  If the PFNs are
allocated dense at startup then there might potentially be many ranges
for each node.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NUMA node information for pages

2014-03-31 Thread David Rientjes
On Mon, 31 Mar 2014, Naoya Horiguchi wrote:

> > I might be missing something but I couldn't find a way to use the
> > pagemap information to then look up the NUMA node the respective page is
> > located on.  Especially when analyzing anomalities this is really
> > useful.  The /proc/kpageflags and /proc/kpagecount files don't have that
> > information.
> > 
> > If this is correct, could the attached patch be considered?  It's really
> > simple and follows the same line as the kpageflags file.
> 
> The information about "pfn-node" mapping seldom (or never) changes after boot,
> so it seems better to me that adding a new interface somewhere under
> /sys/devices/system/node/nodeN which shows pfn range of a given node.

If that's the direction we're going, I'd much prefer just the physical 
start and end addresses be exported rather than pfn so we don't need to do
getpagesize() in userspace.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


NUMA node information for pages

2014-03-31 Thread Ulrich Drepper
I might be missing something but I couldn't find a way to use the
pagemap information to then look up the NUMA node the respective page is
located on.  Especially when analyzing anomalities this is really
useful.  The /proc/kpageflags and /proc/kpagecount files don't have that
information.

If this is correct, could the attached patch be considered?  It's really
simple and follows the same line as the kpageflags file.


Signed-off-by: Ulrich Drepper 

 Documentation/vm/pagemap.txt |3 ++
 fs/proc/page.c   |   50
 +++
 2 files changed, 53 insertions(+)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 5948e45..413b34c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -34,6 +34,9 @@ There are three components to pagemap:
  * /proc/kpagecount.  This file contains a 64-bit count of the number of
times each page is mapped, indexed by PFN.
 
+ * /proc/kpagenode.  This file contains a 32-bit number of the NUMA node
+   each page is mapped on.
+
  * /proc/kpageflags.  This file contains a 64-bit set of flags for each
page, indexed by PFN.
 
diff --git a/fs/proc/page.c b/fs/proc/page.c
index e647c55..65bea9f 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -15,6 +15,9 @@
 #define KPMSIZE sizeof(u64)
 #define KPMMASK (KPMSIZE - 1)
 
+#define KNIDSIZE sizeof(s32)
+#define KNIDMASK (KNIDSIZE - 1)
+
 /* /proc/kpagecount - an array exposing page counts
  *
  * Each entry is a u64 representing the corresponding
@@ -212,10 +215,57 @@ static const struct file_operations 
proc_kpageflags_operations = {
.read = kpageflags_read,
 };
 
+/* /proc/kpagenode - an array exposing node information for pages
+ *
+ * Each entry is a s32 representing the corresponding
+ * physical page flags.
+ */
+
+static ssize_t kpagenode_read(struct file *file, char __user *buf,
+size_t count, loff_t *ppos)
+{
+   u64 __user *out = (u64 __user *)buf;
+   unsigned long src = *ppos;
+   unsigned long pfn = src / KNIDSIZE;
+   ssize_t ret = 0;
+
+   count = min_t(unsigned long, count, (max_pfn * KNIDSIZE) - src);
+   if (src & KNIDSIZE || count & KNIDMASK)
+   return -EINVAL;
+
+   while (count > 0) {
+   int nid;
+   if (pfn_valid(pfn))
+   nid = pfn_to_nid(pfn);
+   else
+   nid = -1;
+
+   if (put_user(nid, out)) {
+   ret = -EFAULT;
+   break;
+   }
+
+   pfn++;
+   out++;
+   count -= KNIDSIZE;
+   }
+
+   *ppos += (char __user *)out - buf;
+   if (!ret)
+   ret = (char __user *)out - buf;
+   return ret;
+}
+
+static const struct file_operations proc_kpagenode_operations = {
+   .llseek = mem_lseek,
+   .read = kpagenode_read,
+};
+
 static int __init proc_page_init(void)
 {
proc_create("kpagecount", S_IRUSR, NULL, _kpagecount_operations);
proc_create("kpageflags", S_IRUSR, NULL, _kpageflags_operations);
+   proc_create("kpagenode", S_IRUSR, NULL, _kpagenode_operations);
return 0;
 }
 fs_initcall(proc_page_init);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


NUMA node information for pages

2014-03-31 Thread Ulrich Drepper
I might be missing something but I couldn't find a way to use the
pagemap information to then look up the NUMA node the respective page is
located on.  Especially when analyzing anomalities this is really
useful.  The /proc/kpageflags and /proc/kpagecount files don't have that
information.

If this is correct, could the attached patch be considered?  It's really
simple and follows the same line as the kpageflags file.


Signed-off-by: Ulrich Drepper drep...@gmail.com

 Documentation/vm/pagemap.txt |3 ++
 fs/proc/page.c   |   50
 +++
 2 files changed, 53 insertions(+)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 5948e45..413b34c 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -34,6 +34,9 @@ There are three components to pagemap:
  * /proc/kpagecount.  This file contains a 64-bit count of the number of
times each page is mapped, indexed by PFN.
 
+ * /proc/kpagenode.  This file contains a 32-bit number of the NUMA node
+   each page is mapped on.
+
  * /proc/kpageflags.  This file contains a 64-bit set of flags for each
page, indexed by PFN.
 
diff --git a/fs/proc/page.c b/fs/proc/page.c
index e647c55..65bea9f 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -15,6 +15,9 @@
 #define KPMSIZE sizeof(u64)
 #define KPMMASK (KPMSIZE - 1)
 
+#define KNIDSIZE sizeof(s32)
+#define KNIDMASK (KNIDSIZE - 1)
+
 /* /proc/kpagecount - an array exposing page counts
  *
  * Each entry is a u64 representing the corresponding
@@ -212,10 +215,57 @@ static const struct file_operations 
proc_kpageflags_operations = {
.read = kpageflags_read,
 };
 
+/* /proc/kpagenode - an array exposing node information for pages
+ *
+ * Each entry is a s32 representing the corresponding
+ * physical page flags.
+ */
+
+static ssize_t kpagenode_read(struct file *file, char __user *buf,
+size_t count, loff_t *ppos)
+{
+   u64 __user *out = (u64 __user *)buf;
+   unsigned long src = *ppos;
+   unsigned long pfn = src / KNIDSIZE;
+   ssize_t ret = 0;
+
+   count = min_t(unsigned long, count, (max_pfn * KNIDSIZE) - src);
+   if (src  KNIDSIZE || count  KNIDMASK)
+   return -EINVAL;
+
+   while (count  0) {
+   int nid;
+   if (pfn_valid(pfn))
+   nid = pfn_to_nid(pfn);
+   else
+   nid = -1;
+
+   if (put_user(nid, out)) {
+   ret = -EFAULT;
+   break;
+   }
+
+   pfn++;
+   out++;
+   count -= KNIDSIZE;
+   }
+
+   *ppos += (char __user *)out - buf;
+   if (!ret)
+   ret = (char __user *)out - buf;
+   return ret;
+}
+
+static const struct file_operations proc_kpagenode_operations = {
+   .llseek = mem_lseek,
+   .read = kpagenode_read,
+};
+
 static int __init proc_page_init(void)
 {
proc_create(kpagecount, S_IRUSR, NULL, proc_kpagecount_operations);
proc_create(kpageflags, S_IRUSR, NULL, proc_kpageflags_operations);
+   proc_create(kpagenode, S_IRUSR, NULL, proc_kpagenode_operations);
return 0;
 }
 fs_initcall(proc_page_init);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NUMA node information for pages

2014-03-31 Thread David Rientjes
On Mon, 31 Mar 2014, Naoya Horiguchi wrote:

  I might be missing something but I couldn't find a way to use the
  pagemap information to then look up the NUMA node the respective page is
  located on.  Especially when analyzing anomalities this is really
  useful.  The /proc/kpageflags and /proc/kpagecount files don't have that
  information.
  
  If this is correct, could the attached patch be considered?  It's really
  simple and follows the same line as the kpageflags file.
 
 The information about pfn-node mapping seldom (or never) changes after boot,
 so it seems better to me that adding a new interface somewhere under
 /sys/devices/system/node/nodeN which shows pfn range of a given node.

If that's the direction we're going, I'd much prefer just the physical 
start and end addresses be exported rather than pfn so we don't need to do
getpagesize() in userspace.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/