Re: [PATCH V2 0/3] Define coherent device memory node

2017-02-13 Thread Anshuman Khandual
On 02/13/2017 09:04 PM, Vlastimil Babka wrote:
> On 02/10/2017 11:06 AM, Anshuman Khandual wrote:
>>  This three patches define CDM node with HugeTLB & Buddy allocation
>> isolation. Please refer to the last RFC posting mentioned here for details.
>> The series has been split for easier review process. The next part of the
>> work like VM flags, auto NUMA and KSM interactions with tagged VMAs will
>> follow later.
> 
> Hi,
> 
> I'm not sure if the splitting to smaller series and focusing on partial
> implementations is helpful at this point, until there's some consensus
> about the whole approach from a big picture perspective.

I have been trying for that through RFCs on CDM but there were not
enough needed feedback from the larger MM community. Hence decided
to split up the series and ask for smaller chunks of code to be
reviewed, debated. Thought this will be a better approach. These
three patches are complete in themselves from functionality point of
view. VMA flags, auto NUMA, KSM are additional feature improvement on
this core set of patches.

RFC V2: https://lkml.org/lkml/2017/1/29/198  (zonelist and cpuset)
RFC V1: https://lkml.org/lkml/2016/10/24/19  (zonelist method)
RFC v2: https://lkml.org/lkml/2016/11/22/339 (cpuset method)

> 
> Note that it's also confusing that v1 of this partial patchset mentioned
> some alternative implementations, but only as git branches, and the
> discussion about their differences is linked elsewhere. That further
> makes meaningful review harder IMHO.

I had posted two alternate approaches except the GFP flag buddy method
in my last RFC. There were not much of discussion on them except some
generic top cpuset characteristics. The current posted nodemask based
isolation method is the minimalist, less intrusive and very less amount
of code change without affecting much of common MM code IMHO. But yes,
if required I can go ahead and post all other alternate methods on this
thread if looking into them helps in better comparison and review.

> 
> Going back to the bigger picture, I've read the comments on previous
> postings and I think Jerome makes many good points in this subthread [1]
> against the idea of representing the device memory as generic memory
> nodes and expecting userspace to mbind() to them. So if I make a program
> that uses mbind() to back some mmapped area with memory of "devices like
> accelerators, GPU cards, network cards, FPGA cards, PLD cards etc which
> might contain on board memory", then it will get such memory... and then
> what? How will it benefit from it? I will also need to tell some driver
> to make the device do some operations with this memory, right? And that
> most likely won't be a generic operation. In that case I can also ask
> the driver to give me that memory in the first place, and it can apply
> whatever policies are best for the device in question? And it's also the
> driver that can detect if the device memory is being wasted by a process
> that isn't currently performing the interesting operations, while
> another process that does them had to fallback its allocations to system
> memory and thus runs slower. I expect the NUMA balancing can't catch
> that for device memory (and you also disable it anyway?) So I don't
> really see how a generic solution would work, without having a full
> concrete example, and thus it's really hard to say that this approach is
> the right way to go and should be merged.

Okay, let me attempt to explain this.

* User space using mbind() to get CDM memory is an additional benefit
  we get by making the CDM plug in as a node and be part of the buddy
  allocator. But the over all idea from the user space point of view
  is that the application can allocate any generic buffer and try to
  use the buffer either from the CPU side or from the device without
  knowing about where the buffer is really mapped physically. That
  gives a seamless and transparent view to the user space where CPU
  compute and possible device based compute can work together. This
  is not possible through a driver allocated buffer.

* The placement of the memory on the buffer can happen on system memory
  when the CPU faults while accessing it. But a driver can manage the
  migration between system RAM and CDM memory once the buffer is being
  used from CPU and the device interchangeably. As you have mentioned
  driver will have more information about where which part of the buffer
  should be placed at any point of time and it can make it happen with
  migration. So both allocation and placement are decided by the driver
  during runtime. CDM provides the framework for this can kind device
  assisted compute and driver managed memory placements.

* If any application is not using CDM memory for along time placed on
  its buffer and another application is forced to fallback on system
  RAM when it really wanted is CDM, the driver can detect these kind
  of situations through memory access patters on the device HW and
  

Re: [PATCH V2 0/3] Define coherent device memory node

2017-02-13 Thread Anshuman Khandual
On 02/13/2017 09:04 PM, Vlastimil Babka wrote:
> On 02/10/2017 11:06 AM, Anshuman Khandual wrote:
>>  This three patches define CDM node with HugeTLB & Buddy allocation
>> isolation. Please refer to the last RFC posting mentioned here for details.
>> The series has been split for easier review process. The next part of the
>> work like VM flags, auto NUMA and KSM interactions with tagged VMAs will
>> follow later.
> 
> Hi,
> 
> I'm not sure if the splitting to smaller series and focusing on partial
> implementations is helpful at this point, until there's some consensus
> about the whole approach from a big picture perspective.

I have been trying for that through RFCs on CDM but there were not
enough needed feedback from the larger MM community. Hence decided
to split up the series and ask for smaller chunks of code to be
reviewed, debated. Thought this will be a better approach. These
three patches are complete in themselves from functionality point of
view. VMA flags, auto NUMA, KSM are additional feature improvement on
this core set of patches.

RFC V2: https://lkml.org/lkml/2017/1/29/198  (zonelist and cpuset)
RFC V1: https://lkml.org/lkml/2016/10/24/19  (zonelist method)
RFC v2: https://lkml.org/lkml/2016/11/22/339 (cpuset method)

> 
> Note that it's also confusing that v1 of this partial patchset mentioned
> some alternative implementations, but only as git branches, and the
> discussion about their differences is linked elsewhere. That further
> makes meaningful review harder IMHO.

I had posted two alternate approaches except the GFP flag buddy method
in my last RFC. There were not much of discussion on them except some
generic top cpuset characteristics. The current posted nodemask based
isolation method is the minimalist, less intrusive and very less amount
of code change without affecting much of common MM code IMHO. But yes,
if required I can go ahead and post all other alternate methods on this
thread if looking into them helps in better comparison and review.

> 
> Going back to the bigger picture, I've read the comments on previous
> postings and I think Jerome makes many good points in this subthread [1]
> against the idea of representing the device memory as generic memory
> nodes and expecting userspace to mbind() to them. So if I make a program
> that uses mbind() to back some mmapped area with memory of "devices like
> accelerators, GPU cards, network cards, FPGA cards, PLD cards etc which
> might contain on board memory", then it will get such memory... and then
> what? How will it benefit from it? I will also need to tell some driver
> to make the device do some operations with this memory, right? And that
> most likely won't be a generic operation. In that case I can also ask
> the driver to give me that memory in the first place, and it can apply
> whatever policies are best for the device in question? And it's also the
> driver that can detect if the device memory is being wasted by a process
> that isn't currently performing the interesting operations, while
> another process that does them had to fallback its allocations to system
> memory and thus runs slower. I expect the NUMA balancing can't catch
> that for device memory (and you also disable it anyway?) So I don't
> really see how a generic solution would work, without having a full
> concrete example, and thus it's really hard to say that this approach is
> the right way to go and should be merged.

Okay, let me attempt to explain this.

* User space using mbind() to get CDM memory is an additional benefit
  we get by making the CDM plug in as a node and be part of the buddy
  allocator. But the over all idea from the user space point of view
  is that the application can allocate any generic buffer and try to
  use the buffer either from the CPU side or from the device without
  knowing about where the buffer is really mapped physically. That
  gives a seamless and transparent view to the user space where CPU
  compute and possible device based compute can work together. This
  is not possible through a driver allocated buffer.

* The placement of the memory on the buffer can happen on system memory
  when the CPU faults while accessing it. But a driver can manage the
  migration between system RAM and CDM memory once the buffer is being
  used from CPU and the device interchangeably. As you have mentioned
  driver will have more information about where which part of the buffer
  should be placed at any point of time and it can make it happen with
  migration. So both allocation and placement are decided by the driver
  during runtime. CDM provides the framework for this can kind device
  assisted compute and driver managed memory placements.

* If any application is not using CDM memory for along time placed on
  its buffer and another application is forced to fallback on system
  RAM when it really wanted is CDM, the driver can detect these kind
  of situations through memory access patters on the device HW and
  

Re: [PATCH V2 0/3] Define coherent device memory node

2017-02-13 Thread Vlastimil Babka
On 02/10/2017 11:06 AM, Anshuman Khandual wrote:
>   This three patches define CDM node with HugeTLB & Buddy allocation
> isolation. Please refer to the last RFC posting mentioned here for details.
> The series has been split for easier review process. The next part of the
> work like VM flags, auto NUMA and KSM interactions with tagged VMAs will
> follow later.

Hi,

I'm not sure if the splitting to smaller series and focusing on partial
implementations is helpful at this point, until there's some consensus
about the whole approach from a big picture perspective.

Note that it's also confusing that v1 of this partial patchset mentioned
some alternative implementations, but only as git branches, and the
discussion about their differences is linked elsewhere. That further
makes meaningful review harder IMHO.

Going back to the bigger picture, I've read the comments on previous
postings and I think Jerome makes many good points in this subthread [1]
against the idea of representing the device memory as generic memory
nodes and expecting userspace to mbind() to them. So if I make a program
that uses mbind() to back some mmapped area with memory of "devices like
accelerators, GPU cards, network cards, FPGA cards, PLD cards etc which
might contain on board memory", then it will get such memory... and then
what? How will it benefit from it? I will also need to tell some driver
to make the device do some operations with this memory, right? And that
most likely won't be a generic operation. In that case I can also ask
the driver to give me that memory in the first place, and it can apply
whatever policies are best for the device in question? And it's also the
driver that can detect if the device memory is being wasted by a process
that isn't currently performing the interesting operations, while
another process that does them had to fallback its allocations to system
memory and thus runs slower. I expect the NUMA balancing can't catch
that for device memory (and you also disable it anyway?) So I don't
really see how a generic solution would work, without having a full
concrete example, and thus it's really hard to say that this approach is
the right way to go and should be merged.

The only examples I've noticed that don't require any special operations
to benefit from placement in the "device memory", were fast memories
like MCDRAM, which differentiate by performance of generic CPU
operations, so it's not really a "device memory" by your terminology.
And I would expect policing access to such performance differentiated
memory is already possible with e.g. cpusets?

Thanks,
Vlastimil

[1] https://lkml.kernel.org/r/20161025153256.gb6...@gmail.com

> https://lkml.org/lkml/2017/1/29/198
> 
> Changes in V2:
> 
> * Removed redundant nodemask_has_cdm() check from zonelist iterator
> * Dropped the nodemask_had_cdm() function itself
> * Added node_set/clear_state_cdm() functions and removed bunch of #ifdefs
> * Moved CDM helper functions into nodemask.h from node.h header file
> * Fixed the build failure by additional CONFIG_NEED_MULTIPLE_NODES check
> 
> Previous V1: (https://lkml.org/lkml/2017/2/8/329)
> 
> Anshuman Khandual (3):
>   mm: Define coherent device memory (CDM) node
>   mm: Enable HugeTLB allocation isolation for CDM nodes
>   mm: Enable Buddy allocation isolation for CDM nodes
> 
>  Documentation/ABI/stable/sysfs-devices-node |  7 
>  arch/powerpc/Kconfig|  1 +
>  arch/powerpc/mm/numa.c  |  7 
>  drivers/base/node.c |  6 +++
>  include/linux/nodemask.h| 58 
> -
>  mm/Kconfig  |  4 ++
>  mm/hugetlb.c| 25 -
>  mm/memory_hotplug.c |  3 ++
>  mm/page_alloc.c | 24 +++-
>  9 files changed, 123 insertions(+), 12 deletions(-)
> 



Re: [PATCH V2 0/3] Define coherent device memory node

2017-02-13 Thread Vlastimil Babka
On 02/10/2017 11:06 AM, Anshuman Khandual wrote:
>   This three patches define CDM node with HugeTLB & Buddy allocation
> isolation. Please refer to the last RFC posting mentioned here for details.
> The series has been split for easier review process. The next part of the
> work like VM flags, auto NUMA and KSM interactions with tagged VMAs will
> follow later.

Hi,

I'm not sure if the splitting to smaller series and focusing on partial
implementations is helpful at this point, until there's some consensus
about the whole approach from a big picture perspective.

Note that it's also confusing that v1 of this partial patchset mentioned
some alternative implementations, but only as git branches, and the
discussion about their differences is linked elsewhere. That further
makes meaningful review harder IMHO.

Going back to the bigger picture, I've read the comments on previous
postings and I think Jerome makes many good points in this subthread [1]
against the idea of representing the device memory as generic memory
nodes and expecting userspace to mbind() to them. So if I make a program
that uses mbind() to back some mmapped area with memory of "devices like
accelerators, GPU cards, network cards, FPGA cards, PLD cards etc which
might contain on board memory", then it will get such memory... and then
what? How will it benefit from it? I will also need to tell some driver
to make the device do some operations with this memory, right? And that
most likely won't be a generic operation. In that case I can also ask
the driver to give me that memory in the first place, and it can apply
whatever policies are best for the device in question? And it's also the
driver that can detect if the device memory is being wasted by a process
that isn't currently performing the interesting operations, while
another process that does them had to fallback its allocations to system
memory and thus runs slower. I expect the NUMA balancing can't catch
that for device memory (and you also disable it anyway?) So I don't
really see how a generic solution would work, without having a full
concrete example, and thus it's really hard to say that this approach is
the right way to go and should be merged.

The only examples I've noticed that don't require any special operations
to benefit from placement in the "device memory", were fast memories
like MCDRAM, which differentiate by performance of generic CPU
operations, so it's not really a "device memory" by your terminology.
And I would expect policing access to such performance differentiated
memory is already possible with e.g. cpusets?

Thanks,
Vlastimil

[1] https://lkml.kernel.org/r/20161025153256.gb6...@gmail.com

> https://lkml.org/lkml/2017/1/29/198
> 
> Changes in V2:
> 
> * Removed redundant nodemask_has_cdm() check from zonelist iterator
> * Dropped the nodemask_had_cdm() function itself
> * Added node_set/clear_state_cdm() functions and removed bunch of #ifdefs
> * Moved CDM helper functions into nodemask.h from node.h header file
> * Fixed the build failure by additional CONFIG_NEED_MULTIPLE_NODES check
> 
> Previous V1: (https://lkml.org/lkml/2017/2/8/329)
> 
> Anshuman Khandual (3):
>   mm: Define coherent device memory (CDM) node
>   mm: Enable HugeTLB allocation isolation for CDM nodes
>   mm: Enable Buddy allocation isolation for CDM nodes
> 
>  Documentation/ABI/stable/sysfs-devices-node |  7 
>  arch/powerpc/Kconfig|  1 +
>  arch/powerpc/mm/numa.c  |  7 
>  drivers/base/node.c |  6 +++
>  include/linux/nodemask.h| 58 
> -
>  mm/Kconfig  |  4 ++
>  mm/hugetlb.c| 25 -
>  mm/memory_hotplug.c |  3 ++
>  mm/page_alloc.c | 24 +++-
>  9 files changed, 123 insertions(+), 12 deletions(-)
> 



[PATCH V2 0/3] Define coherent device memory node

2017-02-10 Thread Anshuman Khandual
This three patches define CDM node with HugeTLB & Buddy allocation
isolation. Please refer to the last RFC posting mentioned here for details.
The series has been split for easier review process. The next part of the
work like VM flags, auto NUMA and KSM interactions with tagged VMAs will
follow later.

https://lkml.org/lkml/2017/1/29/198

Changes in V2:

* Removed redundant nodemask_has_cdm() check from zonelist iterator
* Dropped the nodemask_had_cdm() function itself
* Added node_set/clear_state_cdm() functions and removed bunch of #ifdefs
* Moved CDM helper functions into nodemask.h from node.h header file
* Fixed the build failure by additional CONFIG_NEED_MULTIPLE_NODES check

Previous V1: (https://lkml.org/lkml/2017/2/8/329)

Anshuman Khandual (3):
  mm: Define coherent device memory (CDM) node
  mm: Enable HugeTLB allocation isolation for CDM nodes
  mm: Enable Buddy allocation isolation for CDM nodes

 Documentation/ABI/stable/sysfs-devices-node |  7 
 arch/powerpc/Kconfig|  1 +
 arch/powerpc/mm/numa.c  |  7 
 drivers/base/node.c |  6 +++
 include/linux/nodemask.h| 58 -
 mm/Kconfig  |  4 ++
 mm/hugetlb.c| 25 -
 mm/memory_hotplug.c |  3 ++
 mm/page_alloc.c | 24 +++-
 9 files changed, 123 insertions(+), 12 deletions(-)

-- 
2.9.3



[PATCH V2 0/3] Define coherent device memory node

2017-02-10 Thread Anshuman Khandual
This three patches define CDM node with HugeTLB & Buddy allocation
isolation. Please refer to the last RFC posting mentioned here for details.
The series has been split for easier review process. The next part of the
work like VM flags, auto NUMA and KSM interactions with tagged VMAs will
follow later.

https://lkml.org/lkml/2017/1/29/198

Changes in V2:

* Removed redundant nodemask_has_cdm() check from zonelist iterator
* Dropped the nodemask_had_cdm() function itself
* Added node_set/clear_state_cdm() functions and removed bunch of #ifdefs
* Moved CDM helper functions into nodemask.h from node.h header file
* Fixed the build failure by additional CONFIG_NEED_MULTIPLE_NODES check

Previous V1: (https://lkml.org/lkml/2017/2/8/329)

Anshuman Khandual (3):
  mm: Define coherent device memory (CDM) node
  mm: Enable HugeTLB allocation isolation for CDM nodes
  mm: Enable Buddy allocation isolation for CDM nodes

 Documentation/ABI/stable/sysfs-devices-node |  7 
 arch/powerpc/Kconfig|  1 +
 arch/powerpc/mm/numa.c  |  7 
 drivers/base/node.c |  6 +++
 include/linux/nodemask.h| 58 -
 mm/Kconfig  |  4 ++
 mm/hugetlb.c| 25 -
 mm/memory_hotplug.c |  3 ++
 mm/page_alloc.c | 24 +++-
 9 files changed, 123 insertions(+), 12 deletions(-)

-- 
2.9.3