Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-10-13 Thread Anshuman Khandual
On 09/20/2016 06:24 AM, David Rientjes wrote:
> On Sat, 17 Sep 2016, Anshuman Khandual wrote:
> 
>>> > > I'm questioning if this information can be inferred from information 
>>> > > already in /proc/zoneinfo and sysfs.  We know the no-fallback zonelist 
>>> > > is 
>>> > > going to include the local node, and we know the other zonelists are 
>>> > > either node ordered or zone ordered (or do we need to extend 
>>> > > vm.numa_zonelist_order for default?).  I may have missed what new 
>>> > > knowledge this interface is imparting on us.
>> > 
>> > IIUC /proc/zoneinfo lists down zone internal state and statistics for
>> > all zones on the system at any given point of time. The no-fallback
>> > list contains the zones from the local node and fallback (which gets
>> > used more often than the no-fallback) list contains all zones either
>> > in node-ordered or zone-ordered manner. In most of the platforms the
>> > default being the node order but the sequence of present nodes in
>> > that order is determined by various factors like NUMA distance, load,
>> > presence of CPUs on the node etc. This order of nodes in the fallback
>> > list is the most important information derived out of this interface.
>> > 
> The point is that all of this can be inferred with information already 
> provided, so the additional interface seems unnecessary.  The only 
> extension I think that is needed is to determine if the order is node or 
> zone when vm.numa_zonelist_order == default and we shouldn't parse this 
> from dmesg.

Okay. Seems like the general view is that this interface is not necessary.
Hence wont be posting the debugfs version for now.



Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-10-13 Thread Anshuman Khandual
On 09/20/2016 06:24 AM, David Rientjes wrote:
> On Sat, 17 Sep 2016, Anshuman Khandual wrote:
> 
>>> > > I'm questioning if this information can be inferred from information 
>>> > > already in /proc/zoneinfo and sysfs.  We know the no-fallback zonelist 
>>> > > is 
>>> > > going to include the local node, and we know the other zonelists are 
>>> > > either node ordered or zone ordered (or do we need to extend 
>>> > > vm.numa_zonelist_order for default?).  I may have missed what new 
>>> > > knowledge this interface is imparting on us.
>> > 
>> > IIUC /proc/zoneinfo lists down zone internal state and statistics for
>> > all zones on the system at any given point of time. The no-fallback
>> > list contains the zones from the local node and fallback (which gets
>> > used more often than the no-fallback) list contains all zones either
>> > in node-ordered or zone-ordered manner. In most of the platforms the
>> > default being the node order but the sequence of present nodes in
>> > that order is determined by various factors like NUMA distance, load,
>> > presence of CPUs on the node etc. This order of nodes in the fallback
>> > list is the most important information derived out of this interface.
>> > 
> The point is that all of this can be inferred with information already 
> provided, so the additional interface seems unnecessary.  The only 
> extension I think that is needed is to determine if the order is node or 
> zone when vm.numa_zonelist_order == default and we shouldn't parse this 
> from dmesg.

Okay. Seems like the general view is that this interface is not necessary.
Hence wont be posting the debugfs version for now.



Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-19 Thread David Rientjes
On Sat, 17 Sep 2016, Anshuman Khandual wrote:

> > I'm questioning if this information can be inferred from information 
> > already in /proc/zoneinfo and sysfs.  We know the no-fallback zonelist is 
> > going to include the local node, and we know the other zonelists are 
> > either node ordered or zone ordered (or do we need to extend 
> > vm.numa_zonelist_order for default?).  I may have missed what new 
> > knowledge this interface is imparting on us.
> 
> IIUC /proc/zoneinfo lists down zone internal state and statistics for
> all zones on the system at any given point of time. The no-fallback
> list contains the zones from the local node and fallback (which gets
> used more often than the no-fallback) list contains all zones either
> in node-ordered or zone-ordered manner. In most of the platforms the
> default being the node order but the sequence of present nodes in
> that order is determined by various factors like NUMA distance, load,
> presence of CPUs on the node etc. This order of nodes in the fallback
> list is the most important information derived out of this interface.
> 

The point is that all of this can be inferred with information already 
provided, so the additional interface seems unnecessary.  The only 
extension I think that is needed is to determine if the order is node or 
zone when vm.numa_zonelist_order == default and we shouldn't parse this 
from dmesg.


Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-19 Thread David Rientjes
On Sat, 17 Sep 2016, Anshuman Khandual wrote:

> > I'm questioning if this information can be inferred from information 
> > already in /proc/zoneinfo and sysfs.  We know the no-fallback zonelist is 
> > going to include the local node, and we know the other zonelists are 
> > either node ordered or zone ordered (or do we need to extend 
> > vm.numa_zonelist_order for default?).  I may have missed what new 
> > knowledge this interface is imparting on us.
> 
> IIUC /proc/zoneinfo lists down zone internal state and statistics for
> all zones on the system at any given point of time. The no-fallback
> list contains the zones from the local node and fallback (which gets
> used more often than the no-fallback) list contains all zones either
> in node-ordered or zone-ordered manner. In most of the platforms the
> default being the node order but the sequence of present nodes in
> that order is determined by various factors like NUMA distance, load,
> presence of CPUs on the node etc. This order of nodes in the fallback
> list is the most important information derived out of this interface.
> 

The point is that all of this can be inferred with information already 
provided, so the additional interface seems unnecessary.  The only 
extension I think that is needed is to determine if the order is node or 
zone when vm.numa_zonelist_order == default and we shouldn't parse this 
from dmesg.


Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-16 Thread Anshuman Khandual
On 09/12/2016 11:43 PM, David Rientjes wrote:
> On Mon, 12 Sep 2016, Anshuman Khandual wrote:
> 
> after memory or node hot[un]plug is desirable. This change adds one
> new sysfs interface (/sys/devices/system/memory/system_zone_details)
> which will fetch and dump this information.
>>> Doesn't this violate the "one value per file" sysfs rule?  Does it
>>> belong in debugfs instead?
>>
>> Yeah sure. Will make it a debugfs interface.
>>
> 
> So the intended reader of this file is running as root?

Yeah.

> 
>>> I also really question the need to dump kernel addresses out, filtered 
>>> or not.  What's the point?
>>
>> Hmm, thought it to be an additional information. But yes its additional
>> and can be dropped.
>>
> 
> I'm questioning if this information can be inferred from information 
> already in /proc/zoneinfo and sysfs.  We know the no-fallback zonelist is 
> going to include the local node, and we know the other zonelists are 
> either node ordered or zone ordered (or do we need to extend 
> vm.numa_zonelist_order for default?).  I may have missed what new 
> knowledge this interface is imparting on us.

IIUC /proc/zoneinfo lists down zone internal state and statistics for
all zones on the system at any given point of time. The no-fallback
list contains the zones from the local node and fallback (which gets
used more often than the no-fallback) list contains all zones either
in node-ordered or zone-ordered manner. In most of the platforms the
default being the node order but the sequence of present nodes in
that order is determined by various factors like NUMA distance, load,
presence of CPUs on the node etc. This order of nodes in the fallback
list is the most important information derived out of this interface.



Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-16 Thread Anshuman Khandual
On 09/12/2016 11:43 PM, David Rientjes wrote:
> On Mon, 12 Sep 2016, Anshuman Khandual wrote:
> 
> after memory or node hot[un]plug is desirable. This change adds one
> new sysfs interface (/sys/devices/system/memory/system_zone_details)
> which will fetch and dump this information.
>>> Doesn't this violate the "one value per file" sysfs rule?  Does it
>>> belong in debugfs instead?
>>
>> Yeah sure. Will make it a debugfs interface.
>>
> 
> So the intended reader of this file is running as root?

Yeah.

> 
>>> I also really question the need to dump kernel addresses out, filtered 
>>> or not.  What's the point?
>>
>> Hmm, thought it to be an additional information. But yes its additional
>> and can be dropped.
>>
> 
> I'm questioning if this information can be inferred from information 
> already in /proc/zoneinfo and sysfs.  We know the no-fallback zonelist is 
> going to include the local node, and we know the other zonelists are 
> either node ordered or zone ordered (or do we need to extend 
> vm.numa_zonelist_order for default?).  I may have missed what new 
> knowledge this interface is imparting on us.

IIUC /proc/zoneinfo lists down zone internal state and statistics for
all zones on the system at any given point of time. The no-fallback
list contains the zones from the local node and fallback (which gets
used more often than the no-fallback) list contains all zones either
in node-ordered or zone-ordered manner. In most of the platforms the
default being the node order but the sequence of present nodes in
that order is determined by various factors like NUMA distance, load,
presence of CPUs on the node etc. This order of nodes in the fallback
list is the most important information derived out of this interface.



Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-12 Thread David Rientjes
On Mon, 12 Sep 2016, Anshuman Khandual wrote:

> >> > after memory or node hot[un]plug is desirable. This change adds one
> >> > new sysfs interface (/sys/devices/system/memory/system_zone_details)
> >> > which will fetch and dump this information.
> > Doesn't this violate the "one value per file" sysfs rule?  Does it
> > belong in debugfs instead?
> 
> Yeah sure. Will make it a debugfs interface.
> 

So the intended reader of this file is running as root?

> > I also really question the need to dump kernel addresses out, filtered 
> > or not.  What's the point?
> 
> Hmm, thought it to be an additional information. But yes its additional
> and can be dropped.
> 

I'm questioning if this information can be inferred from information 
already in /proc/zoneinfo and sysfs.  We know the no-fallback zonelist is 
going to include the local node, and we know the other zonelists are 
either node ordered or zone ordered (or do we need to extend 
vm.numa_zonelist_order for default?).  I may have missed what new 
knowledge this interface is imparting on us.


Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-12 Thread David Rientjes
On Mon, 12 Sep 2016, Anshuman Khandual wrote:

> >> > after memory or node hot[un]plug is desirable. This change adds one
> >> > new sysfs interface (/sys/devices/system/memory/system_zone_details)
> >> > which will fetch and dump this information.
> > Doesn't this violate the "one value per file" sysfs rule?  Does it
> > belong in debugfs instead?
> 
> Yeah sure. Will make it a debugfs interface.
> 

So the intended reader of this file is running as root?

> > I also really question the need to dump kernel addresses out, filtered 
> > or not.  What's the point?
> 
> Hmm, thought it to be an additional information. But yes its additional
> and can be dropped.
> 

I'm questioning if this information can be inferred from information 
already in /proc/zoneinfo and sysfs.  We know the no-fallback zonelist is 
going to include the local node, and we know the other zonelists are 
either node ordered or zone ordered (or do we need to extend 
vm.numa_zonelist_order for default?).  I may have missed what new 
knowledge this interface is imparting on us.


Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-11 Thread Anshuman Khandual
On 09/09/2016 01:54 AM, Dave Hansen wrote:
> On 09/07/2016 07:46 PM, Anshuman Khandual wrote:
>> > after memory or node hot[un]plug is desirable. This change adds one
>> > new sysfs interface (/sys/devices/system/memory/system_zone_details)
>> > which will fetch and dump this information.
> Doesn't this violate the "one value per file" sysfs rule?  Does it
> belong in debugfs instead?

Yeah sure. Will make it a debugfs interface.

> 
> I also really question the need to dump kernel addresses out, filtered
> or not.  What's the point?

Hmm, thought it to be an additional information. But yes its additional
and can be dropped.



Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-11 Thread Anshuman Khandual
On 09/09/2016 01:54 AM, Dave Hansen wrote:
> On 09/07/2016 07:46 PM, Anshuman Khandual wrote:
>> > after memory or node hot[un]plug is desirable. This change adds one
>> > new sysfs interface (/sys/devices/system/memory/system_zone_details)
>> > which will fetch and dump this information.
> Doesn't this violate the "one value per file" sysfs rule?  Does it
> belong in debugfs instead?

Yeah sure. Will make it a debugfs interface.

> 
> I also really question the need to dump kernel addresses out, filtered
> or not.  What's the point?

Hmm, thought it to be an additional information. But yes its additional
and can be dropped.



Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-11 Thread Anshuman Khandual
On 09/09/2016 07:06 PM, Michal Hocko wrote:
> On Thu 08-09-16 08:16:58, Anshuman Khandual wrote:
>> > Each individual node in the system has a ZONELIST_FALLBACK zonelist
>> > and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback
>> > order of zones during memory allocations. Sometimes it helps to dump
>> > these zonelists to see the priority order of various zones in them.
>> > 
>> > Particularly platforms which support memory hotplug into previously
>> > non existing zones (at boot), this interface helps in visualizing
>> > which all zonelists of the system at what priority level, the new
>> > hot added memory ends up in. POWER is such a platform where all the
>> > memory detected during boot time remains with ZONE_DMA for good but
>> > then hot plug process can actually get new memory into ZONE_MOVABLE.
>> > So having a way to get the snapshot of the zonelists on the system
>> > after memory or node hot[un]plug is desirable. This change adds one
>> > new sysfs interface (/sys/devices/system/memory/system_zone_details)
>> > which will fetch and dump this information.
> I am still not sure I understand why this is helpful and who is the
> consumer for this interface and how it will benefit from the
> information. Dave (who doesn't seem to be on the CC list re-added) had
> another objection that this breaks one-value-per-file rule for sysfs
> files.

It helps in understanding the relative priority of each memory zone of the
system during various allocation scenarios. Its particularly helpful after
hotplug/unplug of additional memory into previously non existing zone on
a node.

> 
> This all smells like a debugging feature to me and so it should go into
> debugfs.

Sure, will make it a debugfs interface.



Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-11 Thread Anshuman Khandual
On 09/09/2016 07:06 PM, Michal Hocko wrote:
> On Thu 08-09-16 08:16:58, Anshuman Khandual wrote:
>> > Each individual node in the system has a ZONELIST_FALLBACK zonelist
>> > and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback
>> > order of zones during memory allocations. Sometimes it helps to dump
>> > these zonelists to see the priority order of various zones in them.
>> > 
>> > Particularly platforms which support memory hotplug into previously
>> > non existing zones (at boot), this interface helps in visualizing
>> > which all zonelists of the system at what priority level, the new
>> > hot added memory ends up in. POWER is such a platform where all the
>> > memory detected during boot time remains with ZONE_DMA for good but
>> > then hot plug process can actually get new memory into ZONE_MOVABLE.
>> > So having a way to get the snapshot of the zonelists on the system
>> > after memory or node hot[un]plug is desirable. This change adds one
>> > new sysfs interface (/sys/devices/system/memory/system_zone_details)
>> > which will fetch and dump this information.
> I am still not sure I understand why this is helpful and who is the
> consumer for this interface and how it will benefit from the
> information. Dave (who doesn't seem to be on the CC list re-added) had
> another objection that this breaks one-value-per-file rule for sysfs
> files.

It helps in understanding the relative priority of each memory zone of the
system during various allocation scenarios. Its particularly helpful after
hotplug/unplug of additional memory into previously non existing zone on
a node.

> 
> This all smells like a debugging feature to me and so it should go into
> debugfs.

Sure, will make it a debugfs interface.



Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-09 Thread Michal Hocko
On Thu 08-09-16 08:16:58, Anshuman Khandual wrote:
> Each individual node in the system has a ZONELIST_FALLBACK zonelist
> and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback
> order of zones during memory allocations. Sometimes it helps to dump
> these zonelists to see the priority order of various zones in them.
> 
> Particularly platforms which support memory hotplug into previously
> non existing zones (at boot), this interface helps in visualizing
> which all zonelists of the system at what priority level, the new
> hot added memory ends up in. POWER is such a platform where all the
> memory detected during boot time remains with ZONE_DMA for good but
> then hot plug process can actually get new memory into ZONE_MOVABLE.
> So having a way to get the snapshot of the zonelists on the system
> after memory or node hot[un]plug is desirable. This change adds one
> new sysfs interface (/sys/devices/system/memory/system_zone_details)
> which will fetch and dump this information.

I am still not sure I understand why this is helpful and who is the
consumer for this interface and how it will benefit from the
information. Dave (who doesn't seem to be on the CC list re-added) had
another objection that this breaks one-value-per-file rule for sysfs
files.

This all smells like a debugging feature to me and so it should go into
debugfs.

> Example zonelist information from a KVM guest.
> 
> [NODE (0)]
> ZONELIST_FALLBACK
> (0) (node 0) (DMA 0xc0006300)
> (1) (node 1) (DMA 0xc0016300)
> (2) (node 2) (DMA 0xc0026300)
> (3) (node 3) (DMA 0xc003ffdba300)
> ZONELIST_NOFALLBACK
> (0) (node 0) (DMA 0xc0006300)
> [NODE (1)]
> ZONELIST_FALLBACK
> (0) (node 1) (DMA 0xc0016300)
> (1) (node 2) (DMA 0xc0026300)
> (2) (node 3) (DMA 0xc003ffdba300)
> (3) (node 0) (DMA 0xc0006300)
> ZONELIST_NOFALLBACK
> (0) (node 1) (DMA 0xc0016300)
> [NODE (2)]
> ZONELIST_FALLBACK
> (0) (node 2) (DMA 0xc0026300)
> (1) (node 3) (DMA 0xc003ffdba300)
> (2) (node 0) (DMA 0xc0006300)
> (3) (node 1) (DMA 0xc0016300)
> ZONELIST_NOFALLBACK
> (0) (node 2) (DMA 0xc0026300)
> [NODE (3)]
> ZONELIST_FALLBACK
> (0) (node 3) (DMA 0xc003ffdba300)
> (1) (node 0) (DMA 0xc0006300)
> (2) (node 1) (DMA 0xc0016300)
> (3) (node 2) (DMA 0xc0026300)
> ZONELIST_NOFALLBACK
> (0) (node 3) (DMA 0xc003ffdba300)
> 
> Signed-off-by: Anshuman Khandual 
> ---
> Changes in V4:
> - Explicitly included mmzone.h header inside page_alloc.c
> - Changed the kernel address printing from %lx to %pK
> 
> Changes in V3:
> - Moved all these new sysfs code inside CONFIG_NUMA
> 
> Changes in V2:
> - Added more details into the commit message
> - Added sysfs interface file details into the commit message
> - Added ../ABI/testing/sysfs-system-zone-details file
> 
>  .../ABI/testing/sysfs-system-zone-details  |  9 
>  drivers/base/memory.c  | 52 
> ++
>  mm/page_alloc.c|  1 +
>  3 files changed, 62 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-system-zone-details
> 
> diff --git a/Documentation/ABI/testing/sysfs-system-zone-details 
> b/Documentation/ABI/testing/sysfs-system-zone-details
> new file mode 100644
> index 000..9c13b2e
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-system-zone-details
> @@ -0,0 +1,9 @@
> +What:/sys/devices/system/memory/system_zone_details
> +Date:Sep 2016
> +KernelVersion:   4.8
> +Contact: khand...@linux.vnet.ibm.com
> +Description:
> + This read only file dumps the zonelist and it's constituent
> + zones information for both ZONELIST_FALLBACK and ZONELIST_
> + NOFALLBACK zonelists for each online node of the system at
> + any given point of time.
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index dc75de9..c7ab991 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -442,7 +442,56 @@ print_block_size(struct device *dev, struct 
> device_attribute *attr,
>   return sprintf(buf, "%lx\n", get_memory_block_size());
>  }
>  
> +#ifdef CONFIG_NUMA
> +static ssize_t dump_zonelist(char *buf, struct zonelist *zonelist)
> +{
> + unsigned int i;
> + ssize_t count = 0;
> +
> + for (i = 0; zonelist->_zonerefs[i].zone; i++) {
> + count += sprintf(buf + count,
> 

Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-09 Thread Michal Hocko
On Thu 08-09-16 08:16:58, Anshuman Khandual wrote:
> Each individual node in the system has a ZONELIST_FALLBACK zonelist
> and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback
> order of zones during memory allocations. Sometimes it helps to dump
> these zonelists to see the priority order of various zones in them.
> 
> Particularly platforms which support memory hotplug into previously
> non existing zones (at boot), this interface helps in visualizing
> which all zonelists of the system at what priority level, the new
> hot added memory ends up in. POWER is such a platform where all the
> memory detected during boot time remains with ZONE_DMA for good but
> then hot plug process can actually get new memory into ZONE_MOVABLE.
> So having a way to get the snapshot of the zonelists on the system
> after memory or node hot[un]plug is desirable. This change adds one
> new sysfs interface (/sys/devices/system/memory/system_zone_details)
> which will fetch and dump this information.

I am still not sure I understand why this is helpful and who is the
consumer for this interface and how it will benefit from the
information. Dave (who doesn't seem to be on the CC list re-added) had
another objection that this breaks one-value-per-file rule for sysfs
files.

This all smells like a debugging feature to me and so it should go into
debugfs.

> Example zonelist information from a KVM guest.
> 
> [NODE (0)]
> ZONELIST_FALLBACK
> (0) (node 0) (DMA 0xc0006300)
> (1) (node 1) (DMA 0xc0016300)
> (2) (node 2) (DMA 0xc0026300)
> (3) (node 3) (DMA 0xc003ffdba300)
> ZONELIST_NOFALLBACK
> (0) (node 0) (DMA 0xc0006300)
> [NODE (1)]
> ZONELIST_FALLBACK
> (0) (node 1) (DMA 0xc0016300)
> (1) (node 2) (DMA 0xc0026300)
> (2) (node 3) (DMA 0xc003ffdba300)
> (3) (node 0) (DMA 0xc0006300)
> ZONELIST_NOFALLBACK
> (0) (node 1) (DMA 0xc0016300)
> [NODE (2)]
> ZONELIST_FALLBACK
> (0) (node 2) (DMA 0xc0026300)
> (1) (node 3) (DMA 0xc003ffdba300)
> (2) (node 0) (DMA 0xc0006300)
> (3) (node 1) (DMA 0xc0016300)
> ZONELIST_NOFALLBACK
> (0) (node 2) (DMA 0xc0026300)
> [NODE (3)]
> ZONELIST_FALLBACK
> (0) (node 3) (DMA 0xc003ffdba300)
> (1) (node 0) (DMA 0xc0006300)
> (2) (node 1) (DMA 0xc0016300)
> (3) (node 2) (DMA 0xc0026300)
> ZONELIST_NOFALLBACK
> (0) (node 3) (DMA 0xc003ffdba300)
> 
> Signed-off-by: Anshuman Khandual 
> ---
> Changes in V4:
> - Explicitly included mmzone.h header inside page_alloc.c
> - Changed the kernel address printing from %lx to %pK
> 
> Changes in V3:
> - Moved all these new sysfs code inside CONFIG_NUMA
> 
> Changes in V2:
> - Added more details into the commit message
> - Added sysfs interface file details into the commit message
> - Added ../ABI/testing/sysfs-system-zone-details file
> 
>  .../ABI/testing/sysfs-system-zone-details  |  9 
>  drivers/base/memory.c  | 52 
> ++
>  mm/page_alloc.c|  1 +
>  3 files changed, 62 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-system-zone-details
> 
> diff --git a/Documentation/ABI/testing/sysfs-system-zone-details 
> b/Documentation/ABI/testing/sysfs-system-zone-details
> new file mode 100644
> index 000..9c13b2e
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-system-zone-details
> @@ -0,0 +1,9 @@
> +What:/sys/devices/system/memory/system_zone_details
> +Date:Sep 2016
> +KernelVersion:   4.8
> +Contact: khand...@linux.vnet.ibm.com
> +Description:
> + This read only file dumps the zonelist and it's constituent
> + zones information for both ZONELIST_FALLBACK and ZONELIST_
> + NOFALLBACK zonelists for each online node of the system at
> + any given point of time.
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index dc75de9..c7ab991 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -442,7 +442,56 @@ print_block_size(struct device *dev, struct 
> device_attribute *attr,
>   return sprintf(buf, "%lx\n", get_memory_block_size());
>  }
>  
> +#ifdef CONFIG_NUMA
> +static ssize_t dump_zonelist(char *buf, struct zonelist *zonelist)
> +{
> + unsigned int i;
> + ssize_t count = 0;
> +
> + for (i = 0; zonelist->_zonerefs[i].zone; i++) {
> + count += sprintf(buf + count,
> + 

Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-08 Thread Dave Hansen
On 09/07/2016 07:46 PM, Anshuman Khandual wrote:
> after memory or node hot[un]plug is desirable. This change adds one
> new sysfs interface (/sys/devices/system/memory/system_zone_details)
> which will fetch and dump this information.

Doesn't this violate the "one value per file" sysfs rule?  Does it
belong in debugfs instead?

I also really question the need to dump kernel addresses out, filtered
or not.  What's the point?


Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-08 Thread Dave Hansen
On 09/07/2016 07:46 PM, Anshuman Khandual wrote:
> after memory or node hot[un]plug is desirable. This change adds one
> new sysfs interface (/sys/devices/system/memory/system_zone_details)
> which will fetch and dump this information.

Doesn't this violate the "one value per file" sysfs rule?  Does it
belong in debugfs instead?

I also really question the need to dump kernel addresses out, filtered
or not.  What's the point?


Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-08 Thread kbuild test robot
Hi Anshuman,

[auto build test ERROR on driver-core/driver-core-testing]
[also build test ERROR on v4.8-rc5]
[cannot apply to next-20160908]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
convenience) to record what (public, well-known) commit your patch series was 
built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:
https://github.com/0day-ci/linux/commits/Anshuman-Khandual/mm-Add-sysfs-interface-to-dump-each-node-s-zonelist-information/20160908-104922
config: x86_64-lkp (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/base/memory.c: In function 'dump_zonelist':
>> drivers/base/memory.c:455:4: error: 'zone_names' undeclared (first use in 
>> this function)
   zone_names[zonelist->_zonerefs[i].zone_idx],
   ^~
   drivers/base/memory.c:455:4: note: each undeclared identifier is reported 
only once for each function it appears in

vim +/zone_names +455 drivers/base/memory.c

   449  ssize_t count = 0;
   450  
   451  for (i = 0; zonelist->_zonerefs[i].zone; i++) {
   452  count += sprintf(buf + count,
   453  "\t\t(%d) (node %d) (%-7s 0x%pK)\n", i,
   454  
zonelist->_zonerefs[i].zone->zone_pgdat->node_id,
 > 455  zone_names[zonelist->_zonerefs[i].zone_idx],
   456  (void *) zonelist->_zonerefs[i].zone);
   457  }
   458  return count;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-08 Thread kbuild test robot
Hi Anshuman,

[auto build test ERROR on driver-core/driver-core-testing]
[also build test ERROR on v4.8-rc5]
[cannot apply to next-20160908]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]
[Suggest to use git(>=2.9.0) format-patch --base= (or --base=auto for 
convenience) to record what (public, well-known) commit your patch series was 
built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:
https://github.com/0day-ci/linux/commits/Anshuman-Khandual/mm-Add-sysfs-interface-to-dump-each-node-s-zonelist-information/20160908-104922
config: x86_64-lkp (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/base/memory.c: In function 'dump_zonelist':
>> drivers/base/memory.c:455:4: error: 'zone_names' undeclared (first use in 
>> this function)
   zone_names[zonelist->_zonerefs[i].zone_idx],
   ^~
   drivers/base/memory.c:455:4: note: each undeclared identifier is reported 
only once for each function it appears in

vim +/zone_names +455 drivers/base/memory.c

   449  ssize_t count = 0;
   450  
   451  for (i = 0; zonelist->_zonerefs[i].zone; i++) {
   452  count += sprintf(buf + count,
   453  "\t\t(%d) (node %d) (%-7s 0x%pK)\n", i,
   454  
zonelist->_zonerefs[i].zone->zone_pgdat->node_id,
 > 455  zone_names[zonelist->_zonerefs[i].zone_idx],
   456  (void *) zonelist->_zonerefs[i].zone);
   457  }
   458  return count;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


[PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-07 Thread Anshuman Khandual
Each individual node in the system has a ZONELIST_FALLBACK zonelist
and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback
order of zones during memory allocations. Sometimes it helps to dump
these zonelists to see the priority order of various zones in them.

Particularly platforms which support memory hotplug into previously
non existing zones (at boot), this interface helps in visualizing
which all zonelists of the system at what priority level, the new
hot added memory ends up in. POWER is such a platform where all the
memory detected during boot time remains with ZONE_DMA for good but
then hot plug process can actually get new memory into ZONE_MOVABLE.
So having a way to get the snapshot of the zonelists on the system
after memory or node hot[un]plug is desirable. This change adds one
new sysfs interface (/sys/devices/system/memory/system_zone_details)
which will fetch and dump this information.

Example zonelist information from a KVM guest.

[NODE (0)]
ZONELIST_FALLBACK
(0) (node 0) (DMA 0xc0006300)
(1) (node 1) (DMA 0xc0016300)
(2) (node 2) (DMA 0xc0026300)
(3) (node 3) (DMA 0xc003ffdba300)
ZONELIST_NOFALLBACK
(0) (node 0) (DMA 0xc0006300)
[NODE (1)]
ZONELIST_FALLBACK
(0) (node 1) (DMA 0xc0016300)
(1) (node 2) (DMA 0xc0026300)
(2) (node 3) (DMA 0xc003ffdba300)
(3) (node 0) (DMA 0xc0006300)
ZONELIST_NOFALLBACK
(0) (node 1) (DMA 0xc0016300)
[NODE (2)]
ZONELIST_FALLBACK
(0) (node 2) (DMA 0xc0026300)
(1) (node 3) (DMA 0xc003ffdba300)
(2) (node 0) (DMA 0xc0006300)
(3) (node 1) (DMA 0xc0016300)
ZONELIST_NOFALLBACK
(0) (node 2) (DMA 0xc0026300)
[NODE (3)]
ZONELIST_FALLBACK
(0) (node 3) (DMA 0xc003ffdba300)
(1) (node 0) (DMA 0xc0006300)
(2) (node 1) (DMA 0xc0016300)
(3) (node 2) (DMA 0xc0026300)
ZONELIST_NOFALLBACK
(0) (node 3) (DMA 0xc003ffdba300)

Signed-off-by: Anshuman Khandual 
---
Changes in V4:
- Explicitly included mmzone.h header inside page_alloc.c
- Changed the kernel address printing from %lx to %pK

Changes in V3:
- Moved all these new sysfs code inside CONFIG_NUMA

Changes in V2:
- Added more details into the commit message
- Added sysfs interface file details into the commit message
- Added ../ABI/testing/sysfs-system-zone-details file

 .../ABI/testing/sysfs-system-zone-details  |  9 
 drivers/base/memory.c  | 52 ++
 mm/page_alloc.c|  1 +
 3 files changed, 62 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-system-zone-details

diff --git a/Documentation/ABI/testing/sysfs-system-zone-details 
b/Documentation/ABI/testing/sysfs-system-zone-details
new file mode 100644
index 000..9c13b2e
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-system-zone-details
@@ -0,0 +1,9 @@
+What:  /sys/devices/system/memory/system_zone_details
+Date:  Sep 2016
+KernelVersion: 4.8
+Contact:   khand...@linux.vnet.ibm.com
+Description:
+   This read only file dumps the zonelist and it's constituent
+   zones information for both ZONELIST_FALLBACK and ZONELIST_
+   NOFALLBACK zonelists for each online node of the system at
+   any given point of time.
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index dc75de9..c7ab991 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -442,7 +442,56 @@ print_block_size(struct device *dev, struct 
device_attribute *attr,
return sprintf(buf, "%lx\n", get_memory_block_size());
 }
 
+#ifdef CONFIG_NUMA
+static ssize_t dump_zonelist(char *buf, struct zonelist *zonelist)
+{
+   unsigned int i;
+   ssize_t count = 0;
+
+   for (i = 0; zonelist->_zonerefs[i].zone; i++) {
+   count += sprintf(buf + count,
+   "\t\t(%d) (node %d) (%-7s 0x%pK)\n", i,
+   zonelist->_zonerefs[i].zone->zone_pgdat->node_id,
+   zone_names[zonelist->_zonerefs[i].zone_idx],
+   (void *) zonelist->_zonerefs[i].zone);
+   }
+   return count;
+}
+
+static ssize_t dump_zonelists(char *buf)
+{
+   struct zonelist *zonelist;
+   unsigned int node;
+   ssize_t count = 0;
+
+   for_each_online_node(node) {
+   zonelist = &(NODE_DATA(node)->
+   node_zonelists[ZONELIST_FALLBACK]);
+   count += 

[PATCH V4] mm: Add sysfs interface to dump each node's zonelist information

2016-09-07 Thread Anshuman Khandual
Each individual node in the system has a ZONELIST_FALLBACK zonelist
and a ZONELIST_NOFALLBACK zonelist. These zonelists decide fallback
order of zones during memory allocations. Sometimes it helps to dump
these zonelists to see the priority order of various zones in them.

Particularly platforms which support memory hotplug into previously
non existing zones (at boot), this interface helps in visualizing
which all zonelists of the system at what priority level, the new
hot added memory ends up in. POWER is such a platform where all the
memory detected during boot time remains with ZONE_DMA for good but
then hot plug process can actually get new memory into ZONE_MOVABLE.
So having a way to get the snapshot of the zonelists on the system
after memory or node hot[un]plug is desirable. This change adds one
new sysfs interface (/sys/devices/system/memory/system_zone_details)
which will fetch and dump this information.

Example zonelist information from a KVM guest.

[NODE (0)]
ZONELIST_FALLBACK
(0) (node 0) (DMA 0xc0006300)
(1) (node 1) (DMA 0xc0016300)
(2) (node 2) (DMA 0xc0026300)
(3) (node 3) (DMA 0xc003ffdba300)
ZONELIST_NOFALLBACK
(0) (node 0) (DMA 0xc0006300)
[NODE (1)]
ZONELIST_FALLBACK
(0) (node 1) (DMA 0xc0016300)
(1) (node 2) (DMA 0xc0026300)
(2) (node 3) (DMA 0xc003ffdba300)
(3) (node 0) (DMA 0xc0006300)
ZONELIST_NOFALLBACK
(0) (node 1) (DMA 0xc0016300)
[NODE (2)]
ZONELIST_FALLBACK
(0) (node 2) (DMA 0xc0026300)
(1) (node 3) (DMA 0xc003ffdba300)
(2) (node 0) (DMA 0xc0006300)
(3) (node 1) (DMA 0xc0016300)
ZONELIST_NOFALLBACK
(0) (node 2) (DMA 0xc0026300)
[NODE (3)]
ZONELIST_FALLBACK
(0) (node 3) (DMA 0xc003ffdba300)
(1) (node 0) (DMA 0xc0006300)
(2) (node 1) (DMA 0xc0016300)
(3) (node 2) (DMA 0xc0026300)
ZONELIST_NOFALLBACK
(0) (node 3) (DMA 0xc003ffdba300)

Signed-off-by: Anshuman Khandual 
---
Changes in V4:
- Explicitly included mmzone.h header inside page_alloc.c
- Changed the kernel address printing from %lx to %pK

Changes in V3:
- Moved all these new sysfs code inside CONFIG_NUMA

Changes in V2:
- Added more details into the commit message
- Added sysfs interface file details into the commit message
- Added ../ABI/testing/sysfs-system-zone-details file

 .../ABI/testing/sysfs-system-zone-details  |  9 
 drivers/base/memory.c  | 52 ++
 mm/page_alloc.c|  1 +
 3 files changed, 62 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-system-zone-details

diff --git a/Documentation/ABI/testing/sysfs-system-zone-details 
b/Documentation/ABI/testing/sysfs-system-zone-details
new file mode 100644
index 000..9c13b2e
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-system-zone-details
@@ -0,0 +1,9 @@
+What:  /sys/devices/system/memory/system_zone_details
+Date:  Sep 2016
+KernelVersion: 4.8
+Contact:   khand...@linux.vnet.ibm.com
+Description:
+   This read only file dumps the zonelist and it's constituent
+   zones information for both ZONELIST_FALLBACK and ZONELIST_
+   NOFALLBACK zonelists for each online node of the system at
+   any given point of time.
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index dc75de9..c7ab991 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -442,7 +442,56 @@ print_block_size(struct device *dev, struct 
device_attribute *attr,
return sprintf(buf, "%lx\n", get_memory_block_size());
 }
 
+#ifdef CONFIG_NUMA
+static ssize_t dump_zonelist(char *buf, struct zonelist *zonelist)
+{
+   unsigned int i;
+   ssize_t count = 0;
+
+   for (i = 0; zonelist->_zonerefs[i].zone; i++) {
+   count += sprintf(buf + count,
+   "\t\t(%d) (node %d) (%-7s 0x%pK)\n", i,
+   zonelist->_zonerefs[i].zone->zone_pgdat->node_id,
+   zone_names[zonelist->_zonerefs[i].zone_idx],
+   (void *) zonelist->_zonerefs[i].zone);
+   }
+   return count;
+}
+
+static ssize_t dump_zonelists(char *buf)
+{
+   struct zonelist *zonelist;
+   unsigned int node;
+   ssize_t count = 0;
+
+   for_each_online_node(node) {
+   zonelist = &(NODE_DATA(node)->
+   node_zonelists[ZONELIST_FALLBACK]);
+   count += sprintf(buf + count, "[NODE