Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)
On 04/11/2014 03:13 PM, David Rientjes wrote: > What additional information, in your opinion, can we export to assist > userspace in making this determination that $address is on $nid? In the case of overlapping nodes, the only place we actually have *all* of the information is in the 'struct page' itself. Ulrich's original patch obviously _works_, and especially if it's an interface only for debugging purposes, it seems silly to spend virtually any time optimizing it. Keeping it close to pagemap's implementation lessens the likelihood that we'll screw things up. I assume that the original problem was trying to figure out what NUMA affinity a given range of pages mapped in to a _process_ have, and that /proc/$pid/numamaps is too coarse. Is that right, Ulrich? If you want to go the route of calculating and exporting the physical ranges that nodes uniquely own, you've *GOT* to handle the overlaps. Naoya had the right idea. His idea seemed to get shot down with the misunderstanding that node pfn ranges never overlap. The only other question is how many of these kpage* things we're going to put in here until we've exported the entire contents of 'struct page' 5 times over. :) We could add some tracepoints to the pagemap to dump lots of information in to a trace buffer that could be later read back. If you want detailed information (NUMA for instance), you turn the tracepoints and read pagemap for the range you care about. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)
On Fri, 11 Apr 2014, Dave Hansen wrote: > > So? Who cares if there are non-addressable holes in part of the span? > > Ulrich, correct me if I'm wrong, but it seems you're looking for just a > > address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually > > expecting that there are no holes in a node for things like acpi or I/O or > > reserved memory. > ... > > I think trying to represent holes and handling different memory models and > > hotplug in special ways is complete overkill. > > This isn't just about memory hotplug or different memory models. There > are systems out there today, in production, that have layouts like this: > > |--Node0-| > |--Node1-| > > and this: > > |--Node0-| > |-Node1-| > What additional information, in your opinion, can we export to assist userspace in making this determination that $address is on $nid? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)
On 04/11/2014 04:00 AM, David Rientjes wrote: > On Thu, 10 Apr 2014, Naoya Horiguchi wrote: >> > Yes, that's right, but it seems to me that just node_start_pfn and >> > node_end_pfn >> > is not enough because there can be holes (without any page struct backed) >> > inside >> > [node_start_pfn, node_end_pfn), and it's not aware of memory hotplug. >> > > So? Who cares if there are non-addressable holes in part of the span? > Ulrich, correct me if I'm wrong, but it seems you're looking for just a > address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually > expecting that there are no holes in a node for things like acpi or I/O or > reserved memory. ... > I think trying to represent holes and handling different memory models and > hotplug in special ways is complete overkill. This isn't just about memory hotplug or different memory models. There are systems out there today, in production, that have layouts like this: |--Node0-| |--Node1-| and this: |--Node0-| |-Node1-| For those systems, this interface has no meaning. Given a page in the shared-span areas, this interface provides no way to figure out which node it is in. If you want a non-portable hack that just works on one system, I'd suggest parsing the existing firmware tables. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)
On Thu, 10 Apr 2014, Naoya Horiguchi wrote: > Yes, that's right, but it seems to me that just node_start_pfn and > node_end_pfn > is not enough because there can be holes (without any page struct backed) > inside > [node_start_pfn, node_end_pfn), and it's not aware of memory hotplug. > So? Who cares if there are non-addressable holes in part of the span? Ulrich, correct me if I'm wrong, but it seems you're looking for just a address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually expecting that there are no holes in a node for things like acpi or I/O or reserved memory. The node spans a contiguous length of memory, there's no consideration for addresses that aren't actually backed by physical memory. We are just representing proximity domains that have a base address and length in the acpi world. Memory hotplug is already taken care of because onlining and offlining nodes already add these node classes and {start,end}_phys_addr would show up automatically. If you use node_start_pfn(nid) and node_end_pfn(nid) as suggested, there's no futher consideration needed for hotplug. I think trying to represent holes and handling different memory models and hotplug in special ways is complete overkill. Ulrich, can I have your ack? --- Documentation/ABI/stable/sysfs-devices-node | 12 drivers/base/node.c | 18 ++ 2 files changed, 30 insertions(+) diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -63,6 +63,18 @@ Description: The node's hit/miss statistics, in units of pages. See Documentation/numastat.txt +What: /sys/devices/system/node/nodeX/start_phys_addr +Date: April 2014 +Contact: David Rientjes +Description: + The physical base address of this node. + +What: /sys/devices/system/node/nodeX/end_phys_addr +Date: April 2014 +Contact: David Rientjes +Description: + The physical base + length address of this node. + What: /sys/devices/system/node/nodeX/distance Date: October 2002 Contact: Linux Memory Management list diff --git a/drivers/base/node.c b/drivers/base/node.c --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -170,6 +170,20 @@ static ssize_t node_read_numastat(struct device *dev, } static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); +static ssize_t node_read_start_phys_addr(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "0x%lx\n", node_start_pfn(dev->id) << PAGE_SHIFT); +} +static DEVICE_ATTR(start_phys_addr, S_IRUGO, node_read_start_phys_addr, NULL); + +static ssize_t node_read_end_phys_addr(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "0x%lx\n", node_end_pfn(dev->id) << PAGE_SHIFT); +} +static DEVICE_ATTR(end_phys_addr, S_IRUGO, node_read_end_phys_addr, NULL); + static ssize_t node_read_vmstat(struct device *dev, struct device_attribute *attr, char *buf) { @@ -286,6 +300,8 @@ static int register_node(struct node *node, int num, struct node *parent) device_create_file(>dev, _attr_cpulist); device_create_file(>dev, _attr_meminfo); device_create_file(>dev, _attr_numastat); + device_create_file(>dev, _attr_start_phys_addr); + device_create_file(>dev, _attr_end_phys_addr); device_create_file(>dev, _attr_distance); device_create_file(>dev, _attr_vmstat); @@ -311,6 +327,8 @@ void unregister_node(struct node *node) device_remove_file(>dev, _attr_cpulist); device_remove_file(>dev, _attr_meminfo); device_remove_file(>dev, _attr_numastat); + device_remove_file(>dev, _attr_start_phys_addr); + device_remove_file(>dev, _attr_end_phys_addr); device_remove_file(>dev, _attr_distance); device_remove_file(>dev, _attr_vmstat); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)
On Thu, 10 Apr 2014, Naoya Horiguchi wrote: Yes, that's right, but it seems to me that just node_start_pfn and node_end_pfn is not enough because there can be holes (without any page struct backed) inside [node_start_pfn, node_end_pfn), and it's not aware of memory hotplug. So? Who cares if there are non-addressable holes in part of the span? Ulrich, correct me if I'm wrong, but it seems you're looking for just a address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually expecting that there are no holes in a node for things like acpi or I/O or reserved memory. The node spans a contiguous length of memory, there's no consideration for addresses that aren't actually backed by physical memory. We are just representing proximity domains that have a base address and length in the acpi world. Memory hotplug is already taken care of because onlining and offlining nodes already add these node classes and {start,end}_phys_addr would show up automatically. If you use node_start_pfn(nid) and node_end_pfn(nid) as suggested, there's no futher consideration needed for hotplug. I think trying to represent holes and handling different memory models and hotplug in special ways is complete overkill. Ulrich, can I have your ack? --- Documentation/ABI/stable/sysfs-devices-node | 12 drivers/base/node.c | 18 ++ 2 files changed, 30 insertions(+) diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node --- a/Documentation/ABI/stable/sysfs-devices-node +++ b/Documentation/ABI/stable/sysfs-devices-node @@ -63,6 +63,18 @@ Description: The node's hit/miss statistics, in units of pages. See Documentation/numastat.txt +What: /sys/devices/system/node/nodeX/start_phys_addr +Date: April 2014 +Contact: David Rientjes rient...@google.com +Description: + The physical base address of this node. + +What: /sys/devices/system/node/nodeX/end_phys_addr +Date: April 2014 +Contact: David Rientjes rient...@google.com +Description: + The physical base + length address of this node. + What: /sys/devices/system/node/nodeX/distance Date: October 2002 Contact: Linux Memory Management list linux...@kvack.org diff --git a/drivers/base/node.c b/drivers/base/node.c --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -170,6 +170,20 @@ static ssize_t node_read_numastat(struct device *dev, } static DEVICE_ATTR(numastat, S_IRUGO, node_read_numastat, NULL); +static ssize_t node_read_start_phys_addr(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, 0x%lx\n, node_start_pfn(dev-id) PAGE_SHIFT); +} +static DEVICE_ATTR(start_phys_addr, S_IRUGO, node_read_start_phys_addr, NULL); + +static ssize_t node_read_end_phys_addr(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, 0x%lx\n, node_end_pfn(dev-id) PAGE_SHIFT); +} +static DEVICE_ATTR(end_phys_addr, S_IRUGO, node_read_end_phys_addr, NULL); + static ssize_t node_read_vmstat(struct device *dev, struct device_attribute *attr, char *buf) { @@ -286,6 +300,8 @@ static int register_node(struct node *node, int num, struct node *parent) device_create_file(node-dev, dev_attr_cpulist); device_create_file(node-dev, dev_attr_meminfo); device_create_file(node-dev, dev_attr_numastat); + device_create_file(node-dev, dev_attr_start_phys_addr); + device_create_file(node-dev, dev_attr_end_phys_addr); device_create_file(node-dev, dev_attr_distance); device_create_file(node-dev, dev_attr_vmstat); @@ -311,6 +327,8 @@ void unregister_node(struct node *node) device_remove_file(node-dev, dev_attr_cpulist); device_remove_file(node-dev, dev_attr_meminfo); device_remove_file(node-dev, dev_attr_numastat); + device_remove_file(node-dev, dev_attr_start_phys_addr); + device_remove_file(node-dev, dev_attr_end_phys_addr); device_remove_file(node-dev, dev_attr_distance); device_remove_file(node-dev, dev_attr_vmstat); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)
On 04/11/2014 04:00 AM, David Rientjes wrote: On Thu, 10 Apr 2014, Naoya Horiguchi wrote: Yes, that's right, but it seems to me that just node_start_pfn and node_end_pfn is not enough because there can be holes (without any page struct backed) inside [node_start_pfn, node_end_pfn), and it's not aware of memory hotplug. So? Who cares if there are non-addressable holes in part of the span? Ulrich, correct me if I'm wrong, but it seems you're looking for just a address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually expecting that there are no holes in a node for things like acpi or I/O or reserved memory. ... I think trying to represent holes and handling different memory models and hotplug in special ways is complete overkill. This isn't just about memory hotplug or different memory models. There are systems out there today, in production, that have layouts like this: |--Node0-| |--Node1-| and this: |--Node0-| |-Node1-| For those systems, this interface has no meaning. Given a page in the shared-span areas, this interface provides no way to figure out which node it is in. If you want a non-portable hack that just works on one system, I'd suggest parsing the existing firmware tables. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)
On Fri, 11 Apr 2014, Dave Hansen wrote: So? Who cares if there are non-addressable holes in part of the span? Ulrich, correct me if I'm wrong, but it seems you're looking for just a address-to-nodeid mapping (or pfn-to-nodeid mapping) and aren't actually expecting that there are no holes in a node for things like acpi or I/O or reserved memory. ... I think trying to represent holes and handling different memory models and hotplug in special ways is complete overkill. This isn't just about memory hotplug or different memory models. There are systems out there today, in production, that have layouts like this: |--Node0-| |--Node1-| and this: |--Node0-| |-Node1-| What additional information, in your opinion, can we export to assist userspace in making this determination that $address is on $nid? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] drivers/base/node.c: export physical address range of given node (Re: NUMA node information for pages)
On 04/11/2014 03:13 PM, David Rientjes wrote: What additional information, in your opinion, can we export to assist userspace in making this determination that $address is on $nid? In the case of overlapping nodes, the only place we actually have *all* of the information is in the 'struct page' itself. Ulrich's original patch obviously _works_, and especially if it's an interface only for debugging purposes, it seems silly to spend virtually any time optimizing it. Keeping it close to pagemap's implementation lessens the likelihood that we'll screw things up. I assume that the original problem was trying to figure out what NUMA affinity a given range of pages mapped in to a _process_ have, and that /proc/$pid/numamaps is too coarse. Is that right, Ulrich? If you want to go the route of calculating and exporting the physical ranges that nodes uniquely own, you've *GOT* to handle the overlaps. Naoya had the right idea. His idea seemed to get shot down with the misunderstanding that node pfn ranges never overlap. The only other question is how many of these kpage* things we're going to put in here until we've exported the entire contents of 'struct page' 5 times over. :) We could add some tracepoints to the pagemap to dump lots of information in to a trace buffer that could be later read back. If you want detailed information (NUMA for instance), you turn the tracepoints and read pagemap for the range you care about. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NUMA node information for pages
On Wed, 9 Apr 2014, Naoya Horiguchi wrote: > > [ And that block_size_bytes file is absolutely horrid, why are we > >exporting all this information in hex and not telling anybody? ] > > Indeed, this kind of implicit hex numbers are commonly used in many place. > I guess that it's maybe for historical reasons. > I think it was meant to be simple to that you could easily add the length to the start, but it should at least prefix this with `0x'. That code has been around for years, though, so we probably can't fix it now. > > I'd much prefer a single change that works for everybody and userspace can > > rely on exporting accurate information as long as sysfs is mounted, and > > not even need to rely on getpagesize() to convert from pfn to physical > > address: just simple {start,end}_phys_addr files added to > > /sys/devices/system/node/nodeN/ for node N. Online information can > > already be parsed for these ranges from /sys/devices/system/node/online. > > OK, so what if some node has multiple address ranges? I don't think that > start(end)_phys_addr simply returns minimum (maximum) possible address is > optimal, > because users can't know about void range between valid address ranges > (non-exist pfn should not belong to any node). > Are printing multilined (or comma-separated) ranges preferable for example > like below? > > $ cat /sys/devices/system/node/nodeN/phys_addr > 0x0-0x8000 > 0x1-0x18000 > What the...? nodeN should represent the pgdat for that node and a pgdat can only have a single range. I'm suggesting that /sys/devices/system/node/nodeN/start_phys_addr returns node_start_pfn(N) << PAGE_SHIFT and /sys/devices/system/node/nodeN/end_phys_addr returns node_end_pfn(N) << PAGE_SHIFT and prefix them correctly this time with `0x'. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NUMA node information for pages
On Wed, 9 Apr 2014, Naoya Horiguchi wrote: [ And that block_size_bytes file is absolutely horrid, why are we exporting all this information in hex and not telling anybody? ] Indeed, this kind of implicit hex numbers are commonly used in many place. I guess that it's maybe for historical reasons. I think it was meant to be simple to that you could easily add the length to the start, but it should at least prefix this with `0x'. That code has been around for years, though, so we probably can't fix it now. I'd much prefer a single change that works for everybody and userspace can rely on exporting accurate information as long as sysfs is mounted, and not even need to rely on getpagesize() to convert from pfn to physical address: just simple {start,end}_phys_addr files added to /sys/devices/system/node/nodeN/ for node N. Online information can already be parsed for these ranges from /sys/devices/system/node/online. OK, so what if some node has multiple address ranges? I don't think that start(end)_phys_addr simply returns minimum (maximum) possible address is optimal, because users can't know about void range between valid address ranges (non-exist pfn should not belong to any node). Are printing multilined (or comma-separated) ranges preferable for example like below? $ cat /sys/devices/system/node/nodeN/phys_addr 0x0-0x8000 0x1-0x18000 What the...? nodeN should represent the pgdat for that node and a pgdat can only have a single range. I'm suggesting that /sys/devices/system/node/nodeN/start_phys_addr returns node_start_pfn(N) PAGE_SHIFT and /sys/devices/system/node/nodeN/end_phys_addr returns node_end_pfn(N) PAGE_SHIFT and prefix them correctly this time with `0x'. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NUMA node information for pages
On Tue, 8 Apr 2014, Naoya Horiguchi wrote: > memory hotplug is done in memory block basis, so if we get info from under > /sys/devices/system/memory/memory it should be memory hotplug-aware > (/sys/devices/system/memory/memory/state shows online/offline status.) > > And IIUC, "pfn-node_id" mapping might be already available for userspace. > /sys/devices/system/memory/block_size_bytes exports memory block size, > so we can simply map pfn (physical address) into memory block ID by > (physicall address)/(memory block size), then we can find associated node > from /sys/devices/system/memory/memory > > $ ls -l /sys/devices/system/memory/memory0 > ... > lrwxrwxrwx 1 root root0 Apr 8 00:15 node0 -> ../../node/node0 > That's only possible with sparsemem and if you have memory hotplug enabled. I'm thinking that Ulrich is looking for a solution that won't have such a dependency and work for all memory models (including one that disables NUMA and simply represents all memory as one big node). [ And that block_size_bytes file is absolutely horrid, why are we exporting all this information in hex and not telling anybody? ] I'd much prefer a single change that works for everybody and userspace can rely on exporting accurate information as long as sysfs is mounted, and not even need to rely on getpagesize() to convert from pfn to physical address: just simple {start,end}_phys_addr files added to /sys/devices/system/node/nodeN/ for node N. Online information can already be parsed for these ranges from /sys/devices/system/node/online. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NUMA node information for pages
On Tue, 8 Apr 2014, Naoya Horiguchi wrote: memory hotplug is done in memory block basis, so if we get info from under /sys/devices/system/memory/memoryID it should be memory hotplug-aware (/sys/devices/system/memory/memoryID/state shows online/offline status.) And IIUC, pfn-node_id mapping might be already available for userspace. /sys/devices/system/memory/block_size_bytes exports memory block size, so we can simply map pfn (physical address) into memory block ID by (physicall address)/(memory block size), then we can find associated node from /sys/devices/system/memory/memoryID $ ls -l /sys/devices/system/memory/memory0 ... lrwxrwxrwx 1 root root0 Apr 8 00:15 node0 - ../../node/node0 That's only possible with sparsemem and if you have memory hotplug enabled. I'm thinking that Ulrich is looking for a solution that won't have such a dependency and work for all memory models (including one that disables NUMA and simply represents all memory as one big node). [ And that block_size_bytes file is absolutely horrid, why are we exporting all this information in hex and not telling anybody? ] I'd much prefer a single change that works for everybody and userspace can rely on exporting accurate information as long as sysfs is mounted, and not even need to rely on getpagesize() to convert from pfn to physical address: just simple {start,end}_phys_addr files added to /sys/devices/system/node/nodeN/ for node N. Online information can already be parsed for these ranges from /sys/devices/system/node/online. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NUMA node information for pages
On Mon, Mar 31, 2014 at 9:24 PM, Naoya Horiguchi wrote: > The information about "pfn-node" mapping seldom (or never) changes after boot, > so it seems better to me that adding a new interface somewhere under > /sys/devices/system/node/nodeN which shows pfn range of a given node. > If this doesn't work for your usecase, could you explain more about how you > use this information? I have no problem with that type of interface. It'll be more work figuring out the details since the interface I proposed is trivial and mimics that of kpageflags etc but that's manageable. I'll see whether I can figure out the necessary details. I imagine that if the PFN are indeed always clustered for each node then, as David proposes, text output like PFNSTART PFNSTOP in a file below /sys/devices/system/node/nodeN should be sufficient. How does memory hot plug work in this situation? If the PFNs are allocated dense at startup then there might potentially be many ranges for each node. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NUMA node information for pages
On Mon, Mar 31, 2014 at 9:24 PM, Naoya Horiguchi n-horigu...@ah.jp.nec.com wrote: The information about pfn-node mapping seldom (or never) changes after boot, so it seems better to me that adding a new interface somewhere under /sys/devices/system/node/nodeN which shows pfn range of a given node. If this doesn't work for your usecase, could you explain more about how you use this information? I have no problem with that type of interface. It'll be more work figuring out the details since the interface I proposed is trivial and mimics that of kpageflags etc but that's manageable. I'll see whether I can figure out the necessary details. I imagine that if the PFN are indeed always clustered for each node then, as David proposes, text output like PFNSTART PFNSTOP in a file below /sys/devices/system/node/nodeN should be sufficient. How does memory hot plug work in this situation? If the PFNs are allocated dense at startup then there might potentially be many ranges for each node. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NUMA node information for pages
On Mon, 31 Mar 2014, Naoya Horiguchi wrote: > > I might be missing something but I couldn't find a way to use the > > pagemap information to then look up the NUMA node the respective page is > > located on. Especially when analyzing anomalities this is really > > useful. The /proc/kpageflags and /proc/kpagecount files don't have that > > information. > > > > If this is correct, could the attached patch be considered? It's really > > simple and follows the same line as the kpageflags file. > > The information about "pfn-node" mapping seldom (or never) changes after boot, > so it seems better to me that adding a new interface somewhere under > /sys/devices/system/node/nodeN which shows pfn range of a given node. If that's the direction we're going, I'd much prefer just the physical start and end addresses be exported rather than pfn so we don't need to do getpagesize() in userspace. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
NUMA node information for pages
I might be missing something but I couldn't find a way to use the pagemap information to then look up the NUMA node the respective page is located on. Especially when analyzing anomalities this is really useful. The /proc/kpageflags and /proc/kpagecount files don't have that information. If this is correct, could the attached patch be considered? It's really simple and follows the same line as the kpageflags file. Signed-off-by: Ulrich Drepper Documentation/vm/pagemap.txt |3 ++ fs/proc/page.c | 50 +++ 2 files changed, 53 insertions(+) diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 5948e45..413b34c 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -34,6 +34,9 @@ There are three components to pagemap: * /proc/kpagecount. This file contains a 64-bit count of the number of times each page is mapped, indexed by PFN. + * /proc/kpagenode. This file contains a 32-bit number of the NUMA node + each page is mapped on. + * /proc/kpageflags. This file contains a 64-bit set of flags for each page, indexed by PFN. diff --git a/fs/proc/page.c b/fs/proc/page.c index e647c55..65bea9f 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -15,6 +15,9 @@ #define KPMSIZE sizeof(u64) #define KPMMASK (KPMSIZE - 1) +#define KNIDSIZE sizeof(s32) +#define KNIDMASK (KNIDSIZE - 1) + /* /proc/kpagecount - an array exposing page counts * * Each entry is a u64 representing the corresponding @@ -212,10 +215,57 @@ static const struct file_operations proc_kpageflags_operations = { .read = kpageflags_read, }; +/* /proc/kpagenode - an array exposing node information for pages + * + * Each entry is a s32 representing the corresponding + * physical page flags. + */ + +static ssize_t kpagenode_read(struct file *file, char __user *buf, +size_t count, loff_t *ppos) +{ + u64 __user *out = (u64 __user *)buf; + unsigned long src = *ppos; + unsigned long pfn = src / KNIDSIZE; + ssize_t ret = 0; + + count = min_t(unsigned long, count, (max_pfn * KNIDSIZE) - src); + if (src & KNIDSIZE || count & KNIDMASK) + return -EINVAL; + + while (count > 0) { + int nid; + if (pfn_valid(pfn)) + nid = pfn_to_nid(pfn); + else + nid = -1; + + if (put_user(nid, out)) { + ret = -EFAULT; + break; + } + + pfn++; + out++; + count -= KNIDSIZE; + } + + *ppos += (char __user *)out - buf; + if (!ret) + ret = (char __user *)out - buf; + return ret; +} + +static const struct file_operations proc_kpagenode_operations = { + .llseek = mem_lseek, + .read = kpagenode_read, +}; + static int __init proc_page_init(void) { proc_create("kpagecount", S_IRUSR, NULL, _kpagecount_operations); proc_create("kpageflags", S_IRUSR, NULL, _kpageflags_operations); + proc_create("kpagenode", S_IRUSR, NULL, _kpagenode_operations); return 0; } fs_initcall(proc_page_init); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
NUMA node information for pages
I might be missing something but I couldn't find a way to use the pagemap information to then look up the NUMA node the respective page is located on. Especially when analyzing anomalities this is really useful. The /proc/kpageflags and /proc/kpagecount files don't have that information. If this is correct, could the attached patch be considered? It's really simple and follows the same line as the kpageflags file. Signed-off-by: Ulrich Drepper drep...@gmail.com Documentation/vm/pagemap.txt |3 ++ fs/proc/page.c | 50 +++ 2 files changed, 53 insertions(+) diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 5948e45..413b34c 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -34,6 +34,9 @@ There are three components to pagemap: * /proc/kpagecount. This file contains a 64-bit count of the number of times each page is mapped, indexed by PFN. + * /proc/kpagenode. This file contains a 32-bit number of the NUMA node + each page is mapped on. + * /proc/kpageflags. This file contains a 64-bit set of flags for each page, indexed by PFN. diff --git a/fs/proc/page.c b/fs/proc/page.c index e647c55..65bea9f 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -15,6 +15,9 @@ #define KPMSIZE sizeof(u64) #define KPMMASK (KPMSIZE - 1) +#define KNIDSIZE sizeof(s32) +#define KNIDMASK (KNIDSIZE - 1) + /* /proc/kpagecount - an array exposing page counts * * Each entry is a u64 representing the corresponding @@ -212,10 +215,57 @@ static const struct file_operations proc_kpageflags_operations = { .read = kpageflags_read, }; +/* /proc/kpagenode - an array exposing node information for pages + * + * Each entry is a s32 representing the corresponding + * physical page flags. + */ + +static ssize_t kpagenode_read(struct file *file, char __user *buf, +size_t count, loff_t *ppos) +{ + u64 __user *out = (u64 __user *)buf; + unsigned long src = *ppos; + unsigned long pfn = src / KNIDSIZE; + ssize_t ret = 0; + + count = min_t(unsigned long, count, (max_pfn * KNIDSIZE) - src); + if (src KNIDSIZE || count KNIDMASK) + return -EINVAL; + + while (count 0) { + int nid; + if (pfn_valid(pfn)) + nid = pfn_to_nid(pfn); + else + nid = -1; + + if (put_user(nid, out)) { + ret = -EFAULT; + break; + } + + pfn++; + out++; + count -= KNIDSIZE; + } + + *ppos += (char __user *)out - buf; + if (!ret) + ret = (char __user *)out - buf; + return ret; +} + +static const struct file_operations proc_kpagenode_operations = { + .llseek = mem_lseek, + .read = kpagenode_read, +}; + static int __init proc_page_init(void) { proc_create(kpagecount, S_IRUSR, NULL, proc_kpagecount_operations); proc_create(kpageflags, S_IRUSR, NULL, proc_kpageflags_operations); + proc_create(kpagenode, S_IRUSR, NULL, proc_kpagenode_operations); return 0; } fs_initcall(proc_page_init); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NUMA node information for pages
On Mon, 31 Mar 2014, Naoya Horiguchi wrote: I might be missing something but I couldn't find a way to use the pagemap information to then look up the NUMA node the respective page is located on. Especially when analyzing anomalities this is really useful. The /proc/kpageflags and /proc/kpagecount files don't have that information. If this is correct, could the attached patch be considered? It's really simple and follows the same line as the kpageflags file. The information about pfn-node mapping seldom (or never) changes after boot, so it seems better to me that adding a new interface somewhere under /sys/devices/system/node/nodeN which shows pfn range of a given node. If that's the direction we're going, I'd much prefer just the physical start and end addresses be exported rather than pfn so we don't need to do getpagesize() in userspace. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/