Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-11-16 Thread chetan L
On Fri, Sep 8, 2017 at 1:43 PM, Dan Williams  wrote:
> On Thu, Sep 7, 2017 at 6:59 PM, Bob Liu  wrote:
>> On 2017/9/8 1:27, Jerome Glisse wrote:
> [..]
>>> No this are 2 orthogonal thing, they do not conflict with each others quite
>>> the contrary. HMM (the CDM part is no different) is a set of helpers, see
>>> it as a toolbox, for device driver.
>>>
>>> HMAT is a way for firmware to report memory resources with more informations
>>> that just range of physical address. HMAT is specific to platform that rely
>>> on ACPI. HMAT does not provide any helpers to manage these memory.
>>>
>>> So a device driver can get informations about device memory from HMAT and 
>>> then
>>> use HMM to help in managing and using this memory.
>>>
>>
>> Yes, but as Balbir mentioned requires :
>> 1. Don't online the memory as a NUMA node
>> 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver
>>
>> And I'm not sure whether Intel going to use this HMM-CDM based method for 
>> their "target domain" memory ?
>> Or they prefer to NUMA approach?   Ross? Dan?
>
> The starting / strawman proposal for performance differentiated memory
> ranges is to get platform firmware to mark them reserved by default.
> Then, after we parse the HMAT, make them available via the device-dax
> mechanism so that applications that need 100% guaranteed access to
> these potentially high-value / limited-capacity ranges can be sure to
> get them by default, i.e. before any random kernel objects are placed
> in them. Otherwise, if there are no dedicated users for the memory
> ranges via device-dax, or they don't need the total capacity, we want
> to hotplug that memory into the general purpose memory allocator with
> a numa node number so typical numactl and memory-management flows are
> enabled.
>
> Ideally this would not be specific to HMAT and any agent that knows
> differentiated performance characteristics of a memory range could use
> this scheme.

@Dan/Ross

With this approach, in a SVM environment, if you would want a PRI(page
grant) request to get satisfied from this HMAT-indexed memory node,
then do you think we could make that happen. If yes, is that something
you guys are currently working on.


Chetan


Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-11-16 Thread chetan L
On Fri, Sep 8, 2017 at 1:43 PM, Dan Williams  wrote:
> On Thu, Sep 7, 2017 at 6:59 PM, Bob Liu  wrote:
>> On 2017/9/8 1:27, Jerome Glisse wrote:
> [..]
>>> No this are 2 orthogonal thing, they do not conflict with each others quite
>>> the contrary. HMM (the CDM part is no different) is a set of helpers, see
>>> it as a toolbox, for device driver.
>>>
>>> HMAT is a way for firmware to report memory resources with more informations
>>> that just range of physical address. HMAT is specific to platform that rely
>>> on ACPI. HMAT does not provide any helpers to manage these memory.
>>>
>>> So a device driver can get informations about device memory from HMAT and 
>>> then
>>> use HMM to help in managing and using this memory.
>>>
>>
>> Yes, but as Balbir mentioned requires :
>> 1. Don't online the memory as a NUMA node
>> 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver
>>
>> And I'm not sure whether Intel going to use this HMM-CDM based method for 
>> their "target domain" memory ?
>> Or they prefer to NUMA approach?   Ross? Dan?
>
> The starting / strawman proposal for performance differentiated memory
> ranges is to get platform firmware to mark them reserved by default.
> Then, after we parse the HMAT, make them available via the device-dax
> mechanism so that applications that need 100% guaranteed access to
> these potentially high-value / limited-capacity ranges can be sure to
> get them by default, i.e. before any random kernel objects are placed
> in them. Otherwise, if there are no dedicated users for the memory
> ranges via device-dax, or they don't need the total capacity, we want
> to hotplug that memory into the general purpose memory allocator with
> a numa node number so typical numactl and memory-management flows are
> enabled.
>
> Ideally this would not be specific to HMAT and any agent that knows
> differentiated performance characteristics of a memory range could use
> this scheme.

@Dan/Ross

With this approach, in a SVM environment, if you would want a PRI(page
grant) request to get satisfied from this HMAT-indexed memory node,
then do you think we could make that happen. If yes, is that something
you guys are currently working on.


Chetan


Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-08 Thread Dan Williams
On Thu, Sep 7, 2017 at 6:59 PM, Bob Liu  wrote:
> On 2017/9/8 1:27, Jerome Glisse wrote:
[..]
>> No this are 2 orthogonal thing, they do not conflict with each others quite
>> the contrary. HMM (the CDM part is no different) is a set of helpers, see
>> it as a toolbox, for device driver.
>>
>> HMAT is a way for firmware to report memory resources with more informations
>> that just range of physical address. HMAT is specific to platform that rely
>> on ACPI. HMAT does not provide any helpers to manage these memory.
>>
>> So a device driver can get informations about device memory from HMAT and 
>> then
>> use HMM to help in managing and using this memory.
>>
>
> Yes, but as Balbir mentioned requires :
> 1. Don't online the memory as a NUMA node
> 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver
>
> And I'm not sure whether Intel going to use this HMM-CDM based method for 
> their "target domain" memory ?
> Or they prefer to NUMA approach?   Ross? Dan?

The starting / strawman proposal for performance differentiated memory
ranges is to get platform firmware to mark them reserved by default.
Then, after we parse the HMAT, make them available via the device-dax
mechanism so that applications that need 100% guaranteed access to
these potentially high-value / limited-capacity ranges can be sure to
get them by default, i.e. before any random kernel objects are placed
in them. Otherwise, if there are no dedicated users for the memory
ranges via device-dax, or they don't need the total capacity, we want
to hotplug that memory into the general purpose memory allocator with
a numa node number so typical numactl and memory-management flows are
enabled.

Ideally this would not be specific to HMAT and any agent that knows
differentiated performance characteristics of a memory range could use
this scheme.


Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-08 Thread Dan Williams
On Thu, Sep 7, 2017 at 6:59 PM, Bob Liu  wrote:
> On 2017/9/8 1:27, Jerome Glisse wrote:
[..]
>> No this are 2 orthogonal thing, they do not conflict with each others quite
>> the contrary. HMM (the CDM part is no different) is a set of helpers, see
>> it as a toolbox, for device driver.
>>
>> HMAT is a way for firmware to report memory resources with more informations
>> that just range of physical address. HMAT is specific to platform that rely
>> on ACPI. HMAT does not provide any helpers to manage these memory.
>>
>> So a device driver can get informations about device memory from HMAT and 
>> then
>> use HMM to help in managing and using this memory.
>>
>
> Yes, but as Balbir mentioned requires :
> 1. Don't online the memory as a NUMA node
> 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver
>
> And I'm not sure whether Intel going to use this HMM-CDM based method for 
> their "target domain" memory ?
> Or they prefer to NUMA approach?   Ross? Dan?

The starting / strawman proposal for performance differentiated memory
ranges is to get platform firmware to mark them reserved by default.
Then, after we parse the HMAT, make them available via the device-dax
mechanism so that applications that need 100% guaranteed access to
these potentially high-value / limited-capacity ranges can be sure to
get them by default, i.e. before any random kernel objects are placed
in them. Otherwise, if there are no dedicated users for the memory
ranges via device-dax, or they don't need the total capacity, we want
to hotplug that memory into the general purpose memory allocator with
a numa node number so typical numactl and memory-management flows are
enabled.

Ideally this would not be specific to HMAT and any agent that knows
differentiated performance characteristics of a memory range could use
this scheme.


Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-08 Thread Jerome Glisse
On Fri, Sep 08, 2017 at 01:43:44PM -0600, Ross Zwisler wrote:
> On Tue, Sep 05, 2017 at 03:20:50PM -0400, Jerome Glisse wrote:
> <>
> > Does HMAT support device hotplug ? I am unfamiliar with the whole inner 
> > working
> > of ACPI versus PCIE. Anyway i don't see any issue with device memory also 
> > showing
> > through HMAT but like i said device driver for the device will want to be 
> > in total
> > control of that memory.
> 
> Yep, the HMAT will support device hotplug via the _HMA method (section 6.2.18
> of ACPI 6.2).  This basically supplies an entirely new HMAT that the system
> will use to replace the current one.
> 
> I don't yet have support for _HMA in my enabling, but I do intend to add
> support for it once we settle on a sysfs API for the regular boot-time case.
> 
> > Like i said issue here is that core kernel is unaware of the device 
> > activity ie
> > on what part of memory the device is actively working. So core mm can not 
> > make
> > inform decision on what should be migrated to device memory. Also we do not 
> > want
> > regular memory allocation to end in device memory unless explicitly ask for.
> > Few reasons for that. First this memory might not only be use for compute 
> > task
> > but also for graphic and in that case they are hard constraint on physically
> > contiguous memory allocation that require the GPU to move thing around to 
> > make
> > room for graphic object (can't allow GUP).
> > 
> > Second reasons, the device memory is inherently unreliable. If there is a 
> > bug
> > in the device driver or the user manage to trigger a faulty condition on GPU
> > the device might need a hard reset (ie cut PCIE power to device) which leads
> > to loss of memory content. While GPU are becoming more and more resilient 
> > they
> > are still prone to lockup.
> > 
> > Finaly for GPU there is a common pattern of memory over-commit. You pretend 
> > to
> > each application as if they were the only one and allow each of them to 
> > allocate
> > all of the device memory or more than could with strict sharing. As GPU have
> > long timeslice between switching to different context/application they can
> > easily move out and in large chunk of the process memory at 
> > context/application
> > switching. This is have proven to be a key aspect to allow maximum 
> > performances
> > accross several concurrent application/context.
> > 
> > To implement this easiest solution is for the device to lie about how much 
> > memory
> > it has and use the system memory as an overflow.
> 
> I don't think any of this precludes the HMAT being involved.  This is all very
> similar to what I think we need to do for high bandwidth memory, for example.
> We don't want the OS to use it for anything, and we want all of it to be
> available for applications to allocate and use for their specific workload.
> We don't want to make any assumptions about how it can or should be used.
> 
> The HMAT is just there to give us a few things:
> 
> 1) It provides us with an explicit way of telling the OS not to use the
> memory, in the form of the "Reservation hint" flag in the Memory Subsystem
> Address Range Structure (ACPI 6.2 section 5.2.27.3).  I expect that this will
> be set for persistent memory and HBM, and it sounds like you'd expect it to be
> set for your device memory as well.
> 
> 2) It provides us with a way of telling userspace "hey, I know about some
> memory, and I can tell you its performance characteristics".  All control of
> how this memory is allocated and used is still left to userspace.
> 
> > I am not saying that NUMA is not the way forward, i am saying that as it is 
> > today
> > it is not suited for this. It is lacking metric, it is lacking logic, it is 
> > lacking
> > features. We could add all this but it is a lot of work and i don't feel 
> > that we
> > have enough real world experience to do so now. I would rather have each 
> > devices
> > grow proper infrastructure in their driver through device specific API.
> 
> To be clear, I'm not proposing that we teach the NUMA code how to
> automatically allocate for a given numa node, balance, etc. memory described
> by the HMAT.  All I want is an API that says "here is some memory, I'll tell
> you all I can about it and let you do with it what you will", and perhaps a
> way to manually allocate what you want.
> 
> And yes, this is very hand-wavy at this point. :)  After I get the sysfs
> portion sussed out the next step is to work on enabling something like
> libnuma to allow the memory to be manually allocated.
> 
> I think this works for both my use case and yours, correct?

Depend what you mean. Using NUMA as it is today no. Growing new API on the
side of libnuma maybe. It is hard to say. Right now the GPU do have a very
reach API see OpenCL or CUDA API. Anything less expressive than what they
offer would not work.

Existing libnuma API is illsuited. It is too static. GPU workload are more
dynamic as a result the virtual address space 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-08 Thread Jerome Glisse
On Fri, Sep 08, 2017 at 01:43:44PM -0600, Ross Zwisler wrote:
> On Tue, Sep 05, 2017 at 03:20:50PM -0400, Jerome Glisse wrote:
> <>
> > Does HMAT support device hotplug ? I am unfamiliar with the whole inner 
> > working
> > of ACPI versus PCIE. Anyway i don't see any issue with device memory also 
> > showing
> > through HMAT but like i said device driver for the device will want to be 
> > in total
> > control of that memory.
> 
> Yep, the HMAT will support device hotplug via the _HMA method (section 6.2.18
> of ACPI 6.2).  This basically supplies an entirely new HMAT that the system
> will use to replace the current one.
> 
> I don't yet have support for _HMA in my enabling, but I do intend to add
> support for it once we settle on a sysfs API for the regular boot-time case.
> 
> > Like i said issue here is that core kernel is unaware of the device 
> > activity ie
> > on what part of memory the device is actively working. So core mm can not 
> > make
> > inform decision on what should be migrated to device memory. Also we do not 
> > want
> > regular memory allocation to end in device memory unless explicitly ask for.
> > Few reasons for that. First this memory might not only be use for compute 
> > task
> > but also for graphic and in that case they are hard constraint on physically
> > contiguous memory allocation that require the GPU to move thing around to 
> > make
> > room for graphic object (can't allow GUP).
> > 
> > Second reasons, the device memory is inherently unreliable. If there is a 
> > bug
> > in the device driver or the user manage to trigger a faulty condition on GPU
> > the device might need a hard reset (ie cut PCIE power to device) which leads
> > to loss of memory content. While GPU are becoming more and more resilient 
> > they
> > are still prone to lockup.
> > 
> > Finaly for GPU there is a common pattern of memory over-commit. You pretend 
> > to
> > each application as if they were the only one and allow each of them to 
> > allocate
> > all of the device memory or more than could with strict sharing. As GPU have
> > long timeslice between switching to different context/application they can
> > easily move out and in large chunk of the process memory at 
> > context/application
> > switching. This is have proven to be a key aspect to allow maximum 
> > performances
> > accross several concurrent application/context.
> > 
> > To implement this easiest solution is for the device to lie about how much 
> > memory
> > it has and use the system memory as an overflow.
> 
> I don't think any of this precludes the HMAT being involved.  This is all very
> similar to what I think we need to do for high bandwidth memory, for example.
> We don't want the OS to use it for anything, and we want all of it to be
> available for applications to allocate and use for their specific workload.
> We don't want to make any assumptions about how it can or should be used.
> 
> The HMAT is just there to give us a few things:
> 
> 1) It provides us with an explicit way of telling the OS not to use the
> memory, in the form of the "Reservation hint" flag in the Memory Subsystem
> Address Range Structure (ACPI 6.2 section 5.2.27.3).  I expect that this will
> be set for persistent memory and HBM, and it sounds like you'd expect it to be
> set for your device memory as well.
> 
> 2) It provides us with a way of telling userspace "hey, I know about some
> memory, and I can tell you its performance characteristics".  All control of
> how this memory is allocated and used is still left to userspace.
> 
> > I am not saying that NUMA is not the way forward, i am saying that as it is 
> > today
> > it is not suited for this. It is lacking metric, it is lacking logic, it is 
> > lacking
> > features. We could add all this but it is a lot of work and i don't feel 
> > that we
> > have enough real world experience to do so now. I would rather have each 
> > devices
> > grow proper infrastructure in their driver through device specific API.
> 
> To be clear, I'm not proposing that we teach the NUMA code how to
> automatically allocate for a given numa node, balance, etc. memory described
> by the HMAT.  All I want is an API that says "here is some memory, I'll tell
> you all I can about it and let you do with it what you will", and perhaps a
> way to manually allocate what you want.
> 
> And yes, this is very hand-wavy at this point. :)  After I get the sysfs
> portion sussed out the next step is to work on enabling something like
> libnuma to allow the memory to be manually allocated.
> 
> I think this works for both my use case and yours, correct?

Depend what you mean. Using NUMA as it is today no. Growing new API on the
side of libnuma maybe. It is hard to say. Right now the GPU do have a very
reach API see OpenCL or CUDA API. Anything less expressive than what they
offer would not work.

Existing libnuma API is illsuited. It is too static. GPU workload are more
dynamic as a result the virtual address space 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-08 Thread Ross Zwisler
On Tue, Sep 05, 2017 at 03:20:50PM -0400, Jerome Glisse wrote:
<>
> Does HMAT support device hotplug ? I am unfamiliar with the whole inner 
> working
> of ACPI versus PCIE. Anyway i don't see any issue with device memory also 
> showing
> through HMAT but like i said device driver for the device will want to be in 
> total
> control of that memory.

Yep, the HMAT will support device hotplug via the _HMA method (section 6.2.18
of ACPI 6.2).  This basically supplies an entirely new HMAT that the system
will use to replace the current one.

I don't yet have support for _HMA in my enabling, but I do intend to add
support for it once we settle on a sysfs API for the regular boot-time case.

> Like i said issue here is that core kernel is unaware of the device activity 
> ie
> on what part of memory the device is actively working. So core mm can not make
> inform decision on what should be migrated to device memory. Also we do not 
> want
> regular memory allocation to end in device memory unless explicitly ask for.
> Few reasons for that. First this memory might not only be use for compute task
> but also for graphic and in that case they are hard constraint on physically
> contiguous memory allocation that require the GPU to move thing around to make
> room for graphic object (can't allow GUP).
> 
> Second reasons, the device memory is inherently unreliable. If there is a bug
> in the device driver or the user manage to trigger a faulty condition on GPU
> the device might need a hard reset (ie cut PCIE power to device) which leads
> to loss of memory content. While GPU are becoming more and more resilient they
> are still prone to lockup.
> 
> Finaly for GPU there is a common pattern of memory over-commit. You pretend to
> each application as if they were the only one and allow each of them to 
> allocate
> all of the device memory or more than could with strict sharing. As GPU have
> long timeslice between switching to different context/application they can
> easily move out and in large chunk of the process memory at 
> context/application
> switching. This is have proven to be a key aspect to allow maximum 
> performances
> accross several concurrent application/context.
> 
> To implement this easiest solution is for the device to lie about how much 
> memory
> it has and use the system memory as an overflow.

I don't think any of this precludes the HMAT being involved.  This is all very
similar to what I think we need to do for high bandwidth memory, for example.
We don't want the OS to use it for anything, and we want all of it to be
available for applications to allocate and use for their specific workload.
We don't want to make any assumptions about how it can or should be used.

The HMAT is just there to give us a few things:

1) It provides us with an explicit way of telling the OS not to use the
memory, in the form of the "Reservation hint" flag in the Memory Subsystem
Address Range Structure (ACPI 6.2 section 5.2.27.3).  I expect that this will
be set for persistent memory and HBM, and it sounds like you'd expect it to be
set for your device memory as well.

2) It provides us with a way of telling userspace "hey, I know about some
memory, and I can tell you its performance characteristics".  All control of
how this memory is allocated and used is still left to userspace.

> I am not saying that NUMA is not the way forward, i am saying that as it is 
> today
> it is not suited for this. It is lacking metric, it is lacking logic, it is 
> lacking
> features. We could add all this but it is a lot of work and i don't feel that 
> we
> have enough real world experience to do so now. I would rather have each 
> devices
> grow proper infrastructure in their driver through device specific API.

To be clear, I'm not proposing that we teach the NUMA code how to
automatically allocate for a given numa node, balance, etc. memory described
by the HMAT.  All I want is an API that says "here is some memory, I'll tell
you all I can about it and let you do with it what you will", and perhaps a
way to manually allocate what you want.

And yes, this is very hand-wavy at this point. :)  After I get the sysfs
portion sussed out the next step is to work on enabling something like
libnuma to allow the memory to be manually allocated.

I think this works for both my use case and yours, correct?

> Then identify common pattern and from there try to build a sane API (if any 
> such
> thing exist :)) rather than trying today to build the whole house from the 
> ground
> up with just a foggy idea of how it should looks in the end.

Yea, I do see your point.  My worry is that if I define an API, and you define
an API, we'll end up in two different places with people using our different
APIs, then:

https://xkcd.com/927/

:)

The HMAT enabling I'm trying to do is very passive - it doesn't actively do
*anything* with the memory, it's entire purpose is to give userspace more
information about the memory so userspace can make 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-08 Thread Ross Zwisler
On Tue, Sep 05, 2017 at 03:20:50PM -0400, Jerome Glisse wrote:
<>
> Does HMAT support device hotplug ? I am unfamiliar with the whole inner 
> working
> of ACPI versus PCIE. Anyway i don't see any issue with device memory also 
> showing
> through HMAT but like i said device driver for the device will want to be in 
> total
> control of that memory.

Yep, the HMAT will support device hotplug via the _HMA method (section 6.2.18
of ACPI 6.2).  This basically supplies an entirely new HMAT that the system
will use to replace the current one.

I don't yet have support for _HMA in my enabling, but I do intend to add
support for it once we settle on a sysfs API for the regular boot-time case.

> Like i said issue here is that core kernel is unaware of the device activity 
> ie
> on what part of memory the device is actively working. So core mm can not make
> inform decision on what should be migrated to device memory. Also we do not 
> want
> regular memory allocation to end in device memory unless explicitly ask for.
> Few reasons for that. First this memory might not only be use for compute task
> but also for graphic and in that case they are hard constraint on physically
> contiguous memory allocation that require the GPU to move thing around to make
> room for graphic object (can't allow GUP).
> 
> Second reasons, the device memory is inherently unreliable. If there is a bug
> in the device driver or the user manage to trigger a faulty condition on GPU
> the device might need a hard reset (ie cut PCIE power to device) which leads
> to loss of memory content. While GPU are becoming more and more resilient they
> are still prone to lockup.
> 
> Finaly for GPU there is a common pattern of memory over-commit. You pretend to
> each application as if they were the only one and allow each of them to 
> allocate
> all of the device memory or more than could with strict sharing. As GPU have
> long timeslice between switching to different context/application they can
> easily move out and in large chunk of the process memory at 
> context/application
> switching. This is have proven to be a key aspect to allow maximum 
> performances
> accross several concurrent application/context.
> 
> To implement this easiest solution is for the device to lie about how much 
> memory
> it has and use the system memory as an overflow.

I don't think any of this precludes the HMAT being involved.  This is all very
similar to what I think we need to do for high bandwidth memory, for example.
We don't want the OS to use it for anything, and we want all of it to be
available for applications to allocate and use for their specific workload.
We don't want to make any assumptions about how it can or should be used.

The HMAT is just there to give us a few things:

1) It provides us with an explicit way of telling the OS not to use the
memory, in the form of the "Reservation hint" flag in the Memory Subsystem
Address Range Structure (ACPI 6.2 section 5.2.27.3).  I expect that this will
be set for persistent memory and HBM, and it sounds like you'd expect it to be
set for your device memory as well.

2) It provides us with a way of telling userspace "hey, I know about some
memory, and I can tell you its performance characteristics".  All control of
how this memory is allocated and used is still left to userspace.

> I am not saying that NUMA is not the way forward, i am saying that as it is 
> today
> it is not suited for this. It is lacking metric, it is lacking logic, it is 
> lacking
> features. We could add all this but it is a lot of work and i don't feel that 
> we
> have enough real world experience to do so now. I would rather have each 
> devices
> grow proper infrastructure in their driver through device specific API.

To be clear, I'm not proposing that we teach the NUMA code how to
automatically allocate for a given numa node, balance, etc. memory described
by the HMAT.  All I want is an API that says "here is some memory, I'll tell
you all I can about it and let you do with it what you will", and perhaps a
way to manually allocate what you want.

And yes, this is very hand-wavy at this point. :)  After I get the sysfs
portion sussed out the next step is to work on enabling something like
libnuma to allow the memory to be manually allocated.

I think this works for both my use case and yours, correct?

> Then identify common pattern and from there try to build a sane API (if any 
> such
> thing exist :)) rather than trying today to build the whole house from the 
> ground
> up with just a foggy idea of how it should looks in the end.

Yea, I do see your point.  My worry is that if I define an API, and you define
an API, we'll end up in two different places with people using our different
APIs, then:

https://xkcd.com/927/

:)

The HMAT enabling I'm trying to do is very passive - it doesn't actively do
*anything* with the memory, it's entire purpose is to give userspace more
information about the memory so userspace can make 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-07 Thread Bob Liu
On 2017/9/8 1:27, Jerome Glisse wrote:
>> On 2017/9/6 10:12, Jerome Glisse wrote:
>>> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
 On 2017/9/6 2:54, Ross Zwisler wrote:
> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>> On 2017/9/4 23:51, Jerome Glisse wrote:
 On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> On 2017/8/17 8:05, Jérôme Glisse wrote:
> 
> [...]
> 
>>> For HMM each process give hint (somewhat similar to mbind) for range of
>>> virtual address to the device kernel driver (through some API like OpenCL
>>> or CUDA for GPU for instance). All this being device driver specific ioctl.
>>>
>>> The kernel device driver have an overall view of all the process that use
>>> the device and each of the memory advise they gave. From that informations
>>> the kernel device driver decide what part of each process address space to
>>> migrate to device memory.
>>
>> Oh, I mean CDM-HMM.  I'm fine with HMM.
> 
> They are one and the same really. In both cases HMM is just a set of helpers
> for device driver.
> 
>>> This obviously dynamic and likely to change over the process lifetime.
>>>
>>> My understanding is that HMAT want similar API to allow process to give
>>> direction on
>>> where each range of virtual address should be allocated. It is expected
>>> that most
>>
>> Right, but not clear who should manage the physical memory allocation and
>> setup the pagetable mapping. An new driver or the kernel?
> 
> Physical device memory is manage by the kernel device driver as it is today
> and has it will be tomorrow. HMM does not change that, nor does it requires
> any change to that.
> 

Can someone from Intel give more information about the plan of managing HMAT 
reported memory?

> Migrating process memory to or from device is done by the kernel through
> the regular page migration. HMM provides new helper for device driver to
> initiate such migration. There is no mechanisms like auto numa migration
> for the reasons i explain previously.
> 
> Kernel device driver use all knowledge it has to decide what to migrate to
> device memory. Nothing new here either, it is what happens today for special
> allocated device object and it will just happen all the same for regular
> mmap memory (private anonymous or mmap of a regular file of a filesystem).
> 
> 
> So every low level thing happen in the kernel. Userspace only provides
> directive to the kernel device driver through device specific API. But the
> kernel device driver can ignore or override those directive.
> 
> 
>>> software can easily infer what part of its address will need more
>>> bandwidth, smaller
>>> latency versus what part is sparsely accessed ...
>>>
>>> For HMAT i think first target is HBM and persistent memory and device
>>> memory might
>>> be added latter if that make sense.
>>>
>>
>> Okay, so there are two potential ways for CPU-addressable cache-coherent
>> device memory
>> (or cpu-less numa memory or "target domain" memory in ACPI spec )?
>> 1. CDM-HMM
>> 2. HMAT
> 
> No this are 2 orthogonal thing, they do not conflict with each others quite
> the contrary. HMM (the CDM part is no different) is a set of helpers, see
> it as a toolbox, for device driver.
> 
> HMAT is a way for firmware to report memory resources with more informations
> that just range of physical address. HMAT is specific to platform that rely
> on ACPI. HMAT does not provide any helpers to manage these memory.
> 
> So a device driver can get informations about device memory from HMAT and then
> use HMM to help in managing and using this memory.
> 

Yes, but as Balbir mentioned requires :
1. Don't online the memory as a NUMA node
2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver

And I'm not sure whether Intel going to use this HMM-CDM based method for their 
"target domain" memory ? 
Or they prefer to NUMA approach?   Ross? Dan?

--
Thanks,
Bob Liu




Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-07 Thread Bob Liu
On 2017/9/8 1:27, Jerome Glisse wrote:
>> On 2017/9/6 10:12, Jerome Glisse wrote:
>>> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
 On 2017/9/6 2:54, Ross Zwisler wrote:
> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>> On 2017/9/4 23:51, Jerome Glisse wrote:
 On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> On 2017/8/17 8:05, Jérôme Glisse wrote:
> 
> [...]
> 
>>> For HMM each process give hint (somewhat similar to mbind) for range of
>>> virtual address to the device kernel driver (through some API like OpenCL
>>> or CUDA for GPU for instance). All this being device driver specific ioctl.
>>>
>>> The kernel device driver have an overall view of all the process that use
>>> the device and each of the memory advise they gave. From that informations
>>> the kernel device driver decide what part of each process address space to
>>> migrate to device memory.
>>
>> Oh, I mean CDM-HMM.  I'm fine with HMM.
> 
> They are one and the same really. In both cases HMM is just a set of helpers
> for device driver.
> 
>>> This obviously dynamic and likely to change over the process lifetime.
>>>
>>> My understanding is that HMAT want similar API to allow process to give
>>> direction on
>>> where each range of virtual address should be allocated. It is expected
>>> that most
>>
>> Right, but not clear who should manage the physical memory allocation and
>> setup the pagetable mapping. An new driver or the kernel?
> 
> Physical device memory is manage by the kernel device driver as it is today
> and has it will be tomorrow. HMM does not change that, nor does it requires
> any change to that.
> 

Can someone from Intel give more information about the plan of managing HMAT 
reported memory?

> Migrating process memory to or from device is done by the kernel through
> the regular page migration. HMM provides new helper for device driver to
> initiate such migration. There is no mechanisms like auto numa migration
> for the reasons i explain previously.
> 
> Kernel device driver use all knowledge it has to decide what to migrate to
> device memory. Nothing new here either, it is what happens today for special
> allocated device object and it will just happen all the same for regular
> mmap memory (private anonymous or mmap of a regular file of a filesystem).
> 
> 
> So every low level thing happen in the kernel. Userspace only provides
> directive to the kernel device driver through device specific API. But the
> kernel device driver can ignore or override those directive.
> 
> 
>>> software can easily infer what part of its address will need more
>>> bandwidth, smaller
>>> latency versus what part is sparsely accessed ...
>>>
>>> For HMAT i think first target is HBM and persistent memory and device
>>> memory might
>>> be added latter if that make sense.
>>>
>>
>> Okay, so there are two potential ways for CPU-addressable cache-coherent
>> device memory
>> (or cpu-less numa memory or "target domain" memory in ACPI spec )?
>> 1. CDM-HMM
>> 2. HMAT
> 
> No this are 2 orthogonal thing, they do not conflict with each others quite
> the contrary. HMM (the CDM part is no different) is a set of helpers, see
> it as a toolbox, for device driver.
> 
> HMAT is a way for firmware to report memory resources with more informations
> that just range of physical address. HMAT is specific to platform that rely
> on ACPI. HMAT does not provide any helpers to manage these memory.
> 
> So a device driver can get informations about device memory from HMAT and then
> use HMM to help in managing and using this memory.
> 

Yes, but as Balbir mentioned requires :
1. Don't online the memory as a NUMA node
2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver

And I'm not sure whether Intel going to use this HMM-CDM based method for their 
"target domain" memory ? 
Or they prefer to NUMA approach?   Ross? Dan?

--
Thanks,
Bob Liu




Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-07 Thread Jerome Glisse
> On 2017/9/6 10:12, Jerome Glisse wrote:
> > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
> >> On 2017/9/6 2:54, Ross Zwisler wrote:
> >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>  On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > On 2017/9/4 23:51, Jerome Glisse wrote:
> >> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >>> On 2017/8/17 8:05, Jérôme Glisse wrote:

[...]

> > For HMM each process give hint (somewhat similar to mbind) for range of
> > virtual address to the device kernel driver (through some API like OpenCL
> > or CUDA for GPU for instance). All this being device driver specific ioctl.
> > 
> > The kernel device driver have an overall view of all the process that use
> > the device and each of the memory advise they gave. From that informations
> > the kernel device driver decide what part of each process address space to
> > migrate to device memory.
> 
> Oh, I mean CDM-HMM.  I'm fine with HMM.

They are one and the same really. In both cases HMM is just a set of helpers
for device driver.

> > This obviously dynamic and likely to change over the process lifetime.
> > 
> > My understanding is that HMAT want similar API to allow process to give
> > direction on
> > where each range of virtual address should be allocated. It is expected
> > that most
> 
> Right, but not clear who should manage the physical memory allocation and
> setup the pagetable mapping. An new driver or the kernel?

Physical device memory is manage by the kernel device driver as it is today
and has it will be tomorrow. HMM does not change that, nor does it requires
any change to that.

Migrating process memory to or from device is done by the kernel through
the regular page migration. HMM provides new helper for device driver to
initiate such migration. There is no mechanisms like auto numa migration
for the reasons i explain previously.

Kernel device driver use all knowledge it has to decide what to migrate to
device memory. Nothing new here either, it is what happens today for special
allocated device object and it will just happen all the same for regular
mmap memory (private anonymous or mmap of a regular file of a filesystem).


So every low level thing happen in the kernel. Userspace only provides
directive to the kernel device driver through device specific API. But the
kernel device driver can ignore or override those directive.


> > software can easily infer what part of its address will need more
> > bandwidth, smaller
> > latency versus what part is sparsely accessed ...
> > 
> > For HMAT i think first target is HBM and persistent memory and device
> > memory might
> > be added latter if that make sense.
> > 
> 
> Okay, so there are two potential ways for CPU-addressable cache-coherent
> device memory
> (or cpu-less numa memory or "target domain" memory in ACPI spec )?
> 1. CDM-HMM
> 2. HMAT

No this are 2 orthogonal thing, they do not conflict with each others quite
the contrary. HMM (the CDM part is no different) is a set of helpers, see
it as a toolbox, for device driver.

HMAT is a way for firmware to report memory resources with more informations
that just range of physical address. HMAT is specific to platform that rely
on ACPI. HMAT does not provide any helpers to manage these memory.

So a device driver can get informations about device memory from HMAT and then
use HMM to help in managing and using this memory.

Jérôme


Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-07 Thread Jerome Glisse
> On 2017/9/6 10:12, Jerome Glisse wrote:
> > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
> >> On 2017/9/6 2:54, Ross Zwisler wrote:
> >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>  On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > On 2017/9/4 23:51, Jerome Glisse wrote:
> >> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >>> On 2017/8/17 8:05, Jérôme Glisse wrote:

[...]

> > For HMM each process give hint (somewhat similar to mbind) for range of
> > virtual address to the device kernel driver (through some API like OpenCL
> > or CUDA for GPU for instance). All this being device driver specific ioctl.
> > 
> > The kernel device driver have an overall view of all the process that use
> > the device and each of the memory advise they gave. From that informations
> > the kernel device driver decide what part of each process address space to
> > migrate to device memory.
> 
> Oh, I mean CDM-HMM.  I'm fine with HMM.

They are one and the same really. In both cases HMM is just a set of helpers
for device driver.

> > This obviously dynamic and likely to change over the process lifetime.
> > 
> > My understanding is that HMAT want similar API to allow process to give
> > direction on
> > where each range of virtual address should be allocated. It is expected
> > that most
> 
> Right, but not clear who should manage the physical memory allocation and
> setup the pagetable mapping. An new driver or the kernel?

Physical device memory is manage by the kernel device driver as it is today
and has it will be tomorrow. HMM does not change that, nor does it requires
any change to that.

Migrating process memory to or from device is done by the kernel through
the regular page migration. HMM provides new helper for device driver to
initiate such migration. There is no mechanisms like auto numa migration
for the reasons i explain previously.

Kernel device driver use all knowledge it has to decide what to migrate to
device memory. Nothing new here either, it is what happens today for special
allocated device object and it will just happen all the same for regular
mmap memory (private anonymous or mmap of a regular file of a filesystem).


So every low level thing happen in the kernel. Userspace only provides
directive to the kernel device driver through device specific API. But the
kernel device driver can ignore or override those directive.


> > software can easily infer what part of its address will need more
> > bandwidth, smaller
> > latency versus what part is sparsely accessed ...
> > 
> > For HMAT i think first target is HBM and persistent memory and device
> > memory might
> > be added latter if that make sense.
> > 
> 
> Okay, so there are two potential ways for CPU-addressable cache-coherent
> device memory
> (or cpu-less numa memory or "target domain" memory in ACPI spec )?
> 1. CDM-HMM
> 2. HMAT

No this are 2 orthogonal thing, they do not conflict with each others quite
the contrary. HMM (the CDM part is no different) is a set of helpers, see
it as a toolbox, for device driver.

HMAT is a way for firmware to report memory resources with more informations
that just range of physical address. HMAT is specific to platform that rely
on ACPI. HMAT does not provide any helpers to manage these memory.

So a device driver can get informations about device memory from HMAT and then
use HMM to help in managing and using this memory.

Jérôme


Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-07 Thread Jerome Glisse
> On 2017/9/6 10:12, Jerome Glisse wrote:
> > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
> >> On 2017/9/6 2:54, Ross Zwisler wrote:
> >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>  On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > On 2017/9/4 23:51, Jerome Glisse wrote:
> >> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>  Unlike unaddressable memory, coherent device memory has a real
>  resource associated with it on the system (as CPU can address
>  it). Add a new helper to hotplug such memory within the HMM
>  framework.
> 
> >>>
> >>> Got an new question, coherent device( e.g CCIX) memory are likely
> >>> reported to OS
> >>> through ACPI and recognized as NUMA memory node.
> >>> Then how can their memory be captured and managed by HMM framework?
> >>>
> >>
> >> Only platform that has such memory today is powerpc and it is not
> >> reported
> >> as regular memory by the firmware hence why they need this helper.
> >>
> >> I don't think anyone has defined anything yet for x86 and acpi. As
> >> this is
> >
> > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > Table (HMAT) table defined in ACPI 6.2.
> > The HMAT can cover CPU-addressable memory types(though not non-cache
> > coherent on-device memory).
> >
> > Ross from Intel already done some work on this, see:
> > https://lwn.net/Articles/724562/
> >
> > arm64 supports APCI also, there is likely more this kind of device when
> > CCIX
> > is out (should be very soon if on schedule).
> 
>  HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy"
>  memory ie
>  when you have several kind of memory each with different
>  characteristics:
>    - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>  small (ie few giga bytes)
>    - Persistent memory, slower (both latency and bandwidth) big (tera
>    bytes)
>    - DDR (good old memory) well characteristics are between HBM and
>    persistent
> 
>  So AFAICT this has nothing to do with what HMM is for, ie device memory.
>  Note
>  that device memory can have a hierarchy of memory themself (HBM, GDDR
>  and in
>  maybe even persistent memory).
> 
> >> memory on PCIE like interface then i don't expect it to be reported as
> >> NUMA
> >> memory node but as io range like any regular PCIE resources. Device
> >> driver
> >> through capabilities flags would then figure out if the link between
> >> the
> >> device and CPU is CCIX capable if so it can use this helper to hotplug
> >> it
> >> as device memory.
> >>
> >
> > From my point of view,  Cache coherent device memory will popular soon
> > and
> > reported through ACPI/UEFI. Extending NUMA policy still sounds more
> > reasonable
> > to me.
> 
>  Cache coherent device will be reported through standard mecanisms
>  defined by
>  the bus standard they are using. To my knowledge all the standard are
>  either
>  on top of PCIE or are similar to PCIE.
> 
>  It is true that on many platform PCIE resource is manage/initialize by
>  the
>  bios (UEFI) but it is platform specific. In some case we reprogram what
>  the
>  bios pick.
> 
>  So like i was saying i don't expect the BIOS/UEFI to report device
>  memory as
>  regular memory. It will be reported as a regular PCIE resources and then
>  the
>  device driver will be able to determine through some flags if the link
>  between
>  the CPU(s) and the device is cache coherent or not. At that point the
>  device
>  driver can use register it with HMM helper.
> 
> 
>  The whole NUMA discussion happen several time in the past i suggest
>  looking
>  on mm list archive for them. But it was rule out for several reasons.
>  Top of
>  my head:
>    - people hate CPU less node and device memory is inherently CPU less
> >>>
> >>> With the introduction of the HMAT in ACPI 6.2 one of the things that was
> >>> added
> >>> was the ability to have an ACPI proximity domain that isn't associated
> >>> with a
> >>> CPU.  This can be seen in the changes in the text of the "Proximity
> >>> Domain"
> >>> field in table 5-73 which describes the "Memory Affinity Structure".  One
> >>> of
> >>> the major features of the HMAT was the separation of "Initiator"
> >>> proximity
> >>> domains (CPUs, devices that initiate memory transfers), and "target"
> >>> proximity
> >>> domains (memory regions, be they attached to a CPU or some other device).
> >>>
> >>> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
> >>> already in a place where we have to support CPU-less 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-07 Thread Jerome Glisse
> On 2017/9/6 10:12, Jerome Glisse wrote:
> > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
> >> On 2017/9/6 2:54, Ross Zwisler wrote:
> >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>  On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > On 2017/9/4 23:51, Jerome Glisse wrote:
> >> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>  Unlike unaddressable memory, coherent device memory has a real
>  resource associated with it on the system (as CPU can address
>  it). Add a new helper to hotplug such memory within the HMM
>  framework.
> 
> >>>
> >>> Got an new question, coherent device( e.g CCIX) memory are likely
> >>> reported to OS
> >>> through ACPI and recognized as NUMA memory node.
> >>> Then how can their memory be captured and managed by HMM framework?
> >>>
> >>
> >> Only platform that has such memory today is powerpc and it is not
> >> reported
> >> as regular memory by the firmware hence why they need this helper.
> >>
> >> I don't think anyone has defined anything yet for x86 and acpi. As
> >> this is
> >
> > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > Table (HMAT) table defined in ACPI 6.2.
> > The HMAT can cover CPU-addressable memory types(though not non-cache
> > coherent on-device memory).
> >
> > Ross from Intel already done some work on this, see:
> > https://lwn.net/Articles/724562/
> >
> > arm64 supports APCI also, there is likely more this kind of device when
> > CCIX
> > is out (should be very soon if on schedule).
> 
>  HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy"
>  memory ie
>  when you have several kind of memory each with different
>  characteristics:
>    - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>  small (ie few giga bytes)
>    - Persistent memory, slower (both latency and bandwidth) big (tera
>    bytes)
>    - DDR (good old memory) well characteristics are between HBM and
>    persistent
> 
>  So AFAICT this has nothing to do with what HMM is for, ie device memory.
>  Note
>  that device memory can have a hierarchy of memory themself (HBM, GDDR
>  and in
>  maybe even persistent memory).
> 
> >> memory on PCIE like interface then i don't expect it to be reported as
> >> NUMA
> >> memory node but as io range like any regular PCIE resources. Device
> >> driver
> >> through capabilities flags would then figure out if the link between
> >> the
> >> device and CPU is CCIX capable if so it can use this helper to hotplug
> >> it
> >> as device memory.
> >>
> >
> > From my point of view,  Cache coherent device memory will popular soon
> > and
> > reported through ACPI/UEFI. Extending NUMA policy still sounds more
> > reasonable
> > to me.
> 
>  Cache coherent device will be reported through standard mecanisms
>  defined by
>  the bus standard they are using. To my knowledge all the standard are
>  either
>  on top of PCIE or are similar to PCIE.
> 
>  It is true that on many platform PCIE resource is manage/initialize by
>  the
>  bios (UEFI) but it is platform specific. In some case we reprogram what
>  the
>  bios pick.
> 
>  So like i was saying i don't expect the BIOS/UEFI to report device
>  memory as
>  regular memory. It will be reported as a regular PCIE resources and then
>  the
>  device driver will be able to determine through some flags if the link
>  between
>  the CPU(s) and the device is cache coherent or not. At that point the
>  device
>  driver can use register it with HMM helper.
> 
> 
>  The whole NUMA discussion happen several time in the past i suggest
>  looking
>  on mm list archive for them. But it was rule out for several reasons.
>  Top of
>  my head:
>    - people hate CPU less node and device memory is inherently CPU less
> >>>
> >>> With the introduction of the HMAT in ACPI 6.2 one of the things that was
> >>> added
> >>> was the ability to have an ACPI proximity domain that isn't associated
> >>> with a
> >>> CPU.  This can be seen in the changes in the text of the "Proximity
> >>> Domain"
> >>> field in table 5-73 which describes the "Memory Affinity Structure".  One
> >>> of
> >>> the major features of the HMAT was the separation of "Initiator"
> >>> proximity
> >>> domains (CPUs, devices that initiate memory transfers), and "target"
> >>> proximity
> >>> domains (memory regions, be they attached to a CPU or some other device).
> >>>
> >>> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
> >>> already in a place where we have to support CPU-less 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-06 Thread Bob Liu
On 2017/9/6 10:12, Jerome Glisse wrote:
> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
>> On 2017/9/6 2:54, Ross Zwisler wrote:
>>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
 On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> On 2017/9/4 23:51, Jerome Glisse wrote:
>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
 Unlike unaddressable memory, coherent device memory has a real
 resource associated with it on the system (as CPU can address
 it). Add a new helper to hotplug such memory within the HMM
 framework.

>>>
>>> Got an new question, coherent device( e.g CCIX) memory are likely 
>>> reported to OS 
>>> through ACPI and recognized as NUMA memory node.
>>> Then how can their memory be captured and managed by HMM framework?
>>>
>>
>> Only platform that has such memory today is powerpc and it is not 
>> reported
>> as regular memory by the firmware hence why they need this helper.
>>
>> I don't think anyone has defined anything yet for x86 and acpi. As this 
>> is
>
> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> Table (HMAT) table defined in ACPI 6.2.
> The HMAT can cover CPU-addressable memory types(though not non-cache
> coherent on-device memory).
>
> Ross from Intel already done some work on this, see:
> https://lwn.net/Articles/724562/
>
> arm64 supports APCI also, there is likely more this kind of device when 
> CCIX
> is out (should be very soon if on schedule).

 HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory 
 ie
 when you have several kind of memory each with different characteristics:
   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
 small (ie few giga bytes)
   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
   - DDR (good old memory) well characteristics are between HBM and 
 persistent

 So AFAICT this has nothing to do with what HMM is for, ie device memory. 
 Note
 that device memory can have a hierarchy of memory themself (HBM, GDDR and 
 in
 maybe even persistent memory).

>> memory on PCIE like interface then i don't expect it to be reported as 
>> NUMA
>> memory node but as io range like any regular PCIE resources. Device 
>> driver
>> through capabilities flags would then figure out if the link between the
>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>> as device memory.
>>
>
> From my point of view,  Cache coherent device memory will popular soon and
> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
> reasonable
> to me.

 Cache coherent device will be reported through standard mecanisms defined 
 by
 the bus standard they are using. To my knowledge all the standard are 
 either
 on top of PCIE or are similar to PCIE.

 It is true that on many platform PCIE resource is manage/initialize by the
 bios (UEFI) but it is platform specific. In some case we reprogram what the
 bios pick.

 So like i was saying i don't expect the BIOS/UEFI to report device memory 
 as
 regular memory. It will be reported as a regular PCIE resources and then 
 the
 device driver will be able to determine through some flags if the link 
 between
 the CPU(s) and the device is cache coherent or not. At that point the 
 device
 driver can use register it with HMM helper.


 The whole NUMA discussion happen several time in the past i suggest looking
 on mm list archive for them. But it was rule out for several reasons. Top 
 of
 my head:
   - people hate CPU less node and device memory is inherently CPU less
>>>
>>> With the introduction of the HMAT in ACPI 6.2 one of the things that was 
>>> added
>>> was the ability to have an ACPI proximity domain that isn't associated with 
>>> a
>>> CPU.  This can be seen in the changes in the text of the "Proximity Domain"
>>> field in table 5-73 which describes the "Memory Affinity Structure".  One of
>>> the major features of the HMAT was the separation of "Initiator" proximity
>>> domains (CPUs, devices that initiate memory transfers), and "target" 
>>> proximity
>>> domains (memory regions, be they attached to a CPU or some other device).
>>>
>>> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
>>> already in a place where we have to support CPU-less NUMA nodes.
>>>
   - device driver want total control over memory and thus to be isolated 
 from
 mm mecanism and doing all those special cases was not welcome
>>>
>>> I agree that the kernel doesn't have enough information to be able to
>>> accurately handle 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-06 Thread Bob Liu
On 2017/9/6 10:12, Jerome Glisse wrote:
> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
>> On 2017/9/6 2:54, Ross Zwisler wrote:
>>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
 On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> On 2017/9/4 23:51, Jerome Glisse wrote:
>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>> On 2017/8/17 8:05, Jérôme Glisse wrote:
 Unlike unaddressable memory, coherent device memory has a real
 resource associated with it on the system (as CPU can address
 it). Add a new helper to hotplug such memory within the HMM
 framework.

>>>
>>> Got an new question, coherent device( e.g CCIX) memory are likely 
>>> reported to OS 
>>> through ACPI and recognized as NUMA memory node.
>>> Then how can their memory be captured and managed by HMM framework?
>>>
>>
>> Only platform that has such memory today is powerpc and it is not 
>> reported
>> as regular memory by the firmware hence why they need this helper.
>>
>> I don't think anyone has defined anything yet for x86 and acpi. As this 
>> is
>
> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> Table (HMAT) table defined in ACPI 6.2.
> The HMAT can cover CPU-addressable memory types(though not non-cache
> coherent on-device memory).
>
> Ross from Intel already done some work on this, see:
> https://lwn.net/Articles/724562/
>
> arm64 supports APCI also, there is likely more this kind of device when 
> CCIX
> is out (should be very soon if on schedule).

 HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory 
 ie
 when you have several kind of memory each with different characteristics:
   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
 small (ie few giga bytes)
   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
   - DDR (good old memory) well characteristics are between HBM and 
 persistent

 So AFAICT this has nothing to do with what HMM is for, ie device memory. 
 Note
 that device memory can have a hierarchy of memory themself (HBM, GDDR and 
 in
 maybe even persistent memory).

>> memory on PCIE like interface then i don't expect it to be reported as 
>> NUMA
>> memory node but as io range like any regular PCIE resources. Device 
>> driver
>> through capabilities flags would then figure out if the link between the
>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>> as device memory.
>>
>
> From my point of view,  Cache coherent device memory will popular soon and
> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
> reasonable
> to me.

 Cache coherent device will be reported through standard mecanisms defined 
 by
 the bus standard they are using. To my knowledge all the standard are 
 either
 on top of PCIE or are similar to PCIE.

 It is true that on many platform PCIE resource is manage/initialize by the
 bios (UEFI) but it is platform specific. In some case we reprogram what the
 bios pick.

 So like i was saying i don't expect the BIOS/UEFI to report device memory 
 as
 regular memory. It will be reported as a regular PCIE resources and then 
 the
 device driver will be able to determine through some flags if the link 
 between
 the CPU(s) and the device is cache coherent or not. At that point the 
 device
 driver can use register it with HMM helper.


 The whole NUMA discussion happen several time in the past i suggest looking
 on mm list archive for them. But it was rule out for several reasons. Top 
 of
 my head:
   - people hate CPU less node and device memory is inherently CPU less
>>>
>>> With the introduction of the HMAT in ACPI 6.2 one of the things that was 
>>> added
>>> was the ability to have an ACPI proximity domain that isn't associated with 
>>> a
>>> CPU.  This can be seen in the changes in the text of the "Proximity Domain"
>>> field in table 5-73 which describes the "Memory Affinity Structure".  One of
>>> the major features of the HMAT was the separation of "Initiator" proximity
>>> domains (CPUs, devices that initiate memory transfers), and "target" 
>>> proximity
>>> domains (memory regions, be they attached to a CPU or some other device).
>>>
>>> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
>>> already in a place where we have to support CPU-less NUMA nodes.
>>>
   - device driver want total control over memory and thus to be isolated 
 from
 mm mecanism and doing all those special cases was not welcome
>>>
>>> I agree that the kernel doesn't have enough information to be able to
>>> accurately handle 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Jerome Glisse
On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
> On 2017/9/6 2:54, Ross Zwisler wrote:
> > On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
> >> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> >>> On 2017/9/4 23:51, Jerome Glisse wrote:
>  On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> > On 2017/8/17 8:05, Jérôme Glisse wrote:
> >> Unlike unaddressable memory, coherent device memory has a real
> >> resource associated with it on the system (as CPU can address
> >> it). Add a new helper to hotplug such memory within the HMM
> >> framework.
> >>
> >
> > Got an new question, coherent device( e.g CCIX) memory are likely 
> > reported to OS 
> > through ACPI and recognized as NUMA memory node.
> > Then how can their memory be captured and managed by HMM framework?
> >
> 
>  Only platform that has such memory today is powerpc and it is not 
>  reported
>  as regular memory by the firmware hence why they need this helper.
> 
>  I don't think anyone has defined anything yet for x86 and acpi. As this 
>  is
> >>>
> >>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> >>> Table (HMAT) table defined in ACPI 6.2.
> >>> The HMAT can cover CPU-addressable memory types(though not non-cache
> >>> coherent on-device memory).
> >>>
> >>> Ross from Intel already done some work on this, see:
> >>> https://lwn.net/Articles/724562/
> >>>
> >>> arm64 supports APCI also, there is likely more this kind of device when 
> >>> CCIX
> >>> is out (should be very soon if on schedule).
> >>
> >> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory 
> >> ie
> >> when you have several kind of memory each with different characteristics:
> >>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> >> small (ie few giga bytes)
> >>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
> >>   - DDR (good old memory) well characteristics are between HBM and 
> >> persistent
> >>
> >> So AFAICT this has nothing to do with what HMM is for, ie device memory. 
> >> Note
> >> that device memory can have a hierarchy of memory themself (HBM, GDDR and 
> >> in
> >> maybe even persistent memory).
> >>
>  memory on PCIE like interface then i don't expect it to be reported as 
>  NUMA
>  memory node but as io range like any regular PCIE resources. Device 
>  driver
>  through capabilities flags would then figure out if the link between the
>  device and CPU is CCIX capable if so it can use this helper to hotplug it
>  as device memory.
> 
> >>>
> >>> From my point of view,  Cache coherent device memory will popular soon and
> >>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
> >>> reasonable
> >>> to me.
> >>
> >> Cache coherent device will be reported through standard mecanisms defined 
> >> by
> >> the bus standard they are using. To my knowledge all the standard are 
> >> either
> >> on top of PCIE or are similar to PCIE.
> >>
> >> It is true that on many platform PCIE resource is manage/initialize by the
> >> bios (UEFI) but it is platform specific. In some case we reprogram what the
> >> bios pick.
> >>
> >> So like i was saying i don't expect the BIOS/UEFI to report device memory 
> >> as
> >> regular memory. It will be reported as a regular PCIE resources and then 
> >> the
> >> device driver will be able to determine through some flags if the link 
> >> between
> >> the CPU(s) and the device is cache coherent or not. At that point the 
> >> device
> >> driver can use register it with HMM helper.
> >>
> >>
> >> The whole NUMA discussion happen several time in the past i suggest looking
> >> on mm list archive for them. But it was rule out for several reasons. Top 
> >> of
> >> my head:
> >>   - people hate CPU less node and device memory is inherently CPU less
> > 
> > With the introduction of the HMAT in ACPI 6.2 one of the things that was 
> > added
> > was the ability to have an ACPI proximity domain that isn't associated with 
> > a
> > CPU.  This can be seen in the changes in the text of the "Proximity Domain"
> > field in table 5-73 which describes the "Memory Affinity Structure".  One of
> > the major features of the HMAT was the separation of "Initiator" proximity
> > domains (CPUs, devices that initiate memory transfers), and "target" 
> > proximity
> > domains (memory regions, be they attached to a CPU or some other device).
> > 
> > ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
> > already in a place where we have to support CPU-less NUMA nodes.
> > 
> >>   - device driver want total control over memory and thus to be isolated 
> >> from
> >> mm mecanism and doing all those special cases was not welcome
> > 
> > I agree that the kernel doesn't have enough information to be able to
> > accurately handle all the use cases for the various types 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Jerome Glisse
On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote:
> On 2017/9/6 2:54, Ross Zwisler wrote:
> > On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
> >> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> >>> On 2017/9/4 23:51, Jerome Glisse wrote:
>  On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> > On 2017/8/17 8:05, Jérôme Glisse wrote:
> >> Unlike unaddressable memory, coherent device memory has a real
> >> resource associated with it on the system (as CPU can address
> >> it). Add a new helper to hotplug such memory within the HMM
> >> framework.
> >>
> >
> > Got an new question, coherent device( e.g CCIX) memory are likely 
> > reported to OS 
> > through ACPI and recognized as NUMA memory node.
> > Then how can their memory be captured and managed by HMM framework?
> >
> 
>  Only platform that has such memory today is powerpc and it is not 
>  reported
>  as regular memory by the firmware hence why they need this helper.
> 
>  I don't think anyone has defined anything yet for x86 and acpi. As this 
>  is
> >>>
> >>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> >>> Table (HMAT) table defined in ACPI 6.2.
> >>> The HMAT can cover CPU-addressable memory types(though not non-cache
> >>> coherent on-device memory).
> >>>
> >>> Ross from Intel already done some work on this, see:
> >>> https://lwn.net/Articles/724562/
> >>>
> >>> arm64 supports APCI also, there is likely more this kind of device when 
> >>> CCIX
> >>> is out (should be very soon if on schedule).
> >>
> >> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory 
> >> ie
> >> when you have several kind of memory each with different characteristics:
> >>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> >> small (ie few giga bytes)
> >>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
> >>   - DDR (good old memory) well characteristics are between HBM and 
> >> persistent
> >>
> >> So AFAICT this has nothing to do with what HMM is for, ie device memory. 
> >> Note
> >> that device memory can have a hierarchy of memory themself (HBM, GDDR and 
> >> in
> >> maybe even persistent memory).
> >>
>  memory on PCIE like interface then i don't expect it to be reported as 
>  NUMA
>  memory node but as io range like any regular PCIE resources. Device 
>  driver
>  through capabilities flags would then figure out if the link between the
>  device and CPU is CCIX capable if so it can use this helper to hotplug it
>  as device memory.
> 
> >>>
> >>> From my point of view,  Cache coherent device memory will popular soon and
> >>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
> >>> reasonable
> >>> to me.
> >>
> >> Cache coherent device will be reported through standard mecanisms defined 
> >> by
> >> the bus standard they are using. To my knowledge all the standard are 
> >> either
> >> on top of PCIE or are similar to PCIE.
> >>
> >> It is true that on many platform PCIE resource is manage/initialize by the
> >> bios (UEFI) but it is platform specific. In some case we reprogram what the
> >> bios pick.
> >>
> >> So like i was saying i don't expect the BIOS/UEFI to report device memory 
> >> as
> >> regular memory. It will be reported as a regular PCIE resources and then 
> >> the
> >> device driver will be able to determine through some flags if the link 
> >> between
> >> the CPU(s) and the device is cache coherent or not. At that point the 
> >> device
> >> driver can use register it with HMM helper.
> >>
> >>
> >> The whole NUMA discussion happen several time in the past i suggest looking
> >> on mm list archive for them. But it was rule out for several reasons. Top 
> >> of
> >> my head:
> >>   - people hate CPU less node and device memory is inherently CPU less
> > 
> > With the introduction of the HMAT in ACPI 6.2 one of the things that was 
> > added
> > was the ability to have an ACPI proximity domain that isn't associated with 
> > a
> > CPU.  This can be seen in the changes in the text of the "Proximity Domain"
> > field in table 5-73 which describes the "Memory Affinity Structure".  One of
> > the major features of the HMAT was the separation of "Initiator" proximity
> > domains (CPUs, devices that initiate memory transfers), and "target" 
> > proximity
> > domains (memory regions, be they attached to a CPU or some other device).
> > 
> > ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
> > already in a place where we have to support CPU-less NUMA nodes.
> > 
> >>   - device driver want total control over memory and thus to be isolated 
> >> from
> >> mm mecanism and doing all those special cases was not welcome
> > 
> > I agree that the kernel doesn't have enough information to be able to
> > accurately handle all the use cases for the various types 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Bob Liu
On 2017/9/6 2:54, Ross Zwisler wrote:
> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>> On 2017/9/4 23:51, Jerome Glisse wrote:
 On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> On 2017/8/17 8:05, Jérôme Glisse wrote:
>> Unlike unaddressable memory, coherent device memory has a real
>> resource associated with it on the system (as CPU can address
>> it). Add a new helper to hotplug such memory within the HMM
>> framework.
>>
>
> Got an new question, coherent device( e.g CCIX) memory are likely 
> reported to OS 
> through ACPI and recognized as NUMA memory node.
> Then how can their memory be captured and managed by HMM framework?
>

 Only platform that has such memory today is powerpc and it is not reported
 as regular memory by the firmware hence why they need this helper.

 I don't think anyone has defined anything yet for x86 and acpi. As this is
>>>
>>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>>> Table (HMAT) table defined in ACPI 6.2.
>>> The HMAT can cover CPU-addressable memory types(though not non-cache
>>> coherent on-device memory).
>>>
>>> Ross from Intel already done some work on this, see:
>>> https://lwn.net/Articles/724562/
>>>
>>> arm64 supports APCI also, there is likely more this kind of device when CCIX
>>> is out (should be very soon if on schedule).
>>
>> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
>> when you have several kind of memory each with different characteristics:
>>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>> small (ie few giga bytes)
>>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>>   - DDR (good old memory) well characteristics are between HBM and persistent
>>
>> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
>> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
>> maybe even persistent memory).
>>
 memory on PCIE like interface then i don't expect it to be reported as NUMA
 memory node but as io range like any regular PCIE resources. Device driver
 through capabilities flags would then figure out if the link between the
 device and CPU is CCIX capable if so it can use this helper to hotplug it
 as device memory.

>>>
>>> From my point of view,  Cache coherent device memory will popular soon and
>>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>>> reasonable
>>> to me.
>>
>> Cache coherent device will be reported through standard mecanisms defined by
>> the bus standard they are using. To my knowledge all the standard are either
>> on top of PCIE or are similar to PCIE.
>>
>> It is true that on many platform PCIE resource is manage/initialize by the
>> bios (UEFI) but it is platform specific. In some case we reprogram what the
>> bios pick.
>>
>> So like i was saying i don't expect the BIOS/UEFI to report device memory as
>> regular memory. It will be reported as a regular PCIE resources and then the
>> device driver will be able to determine through some flags if the link 
>> between
>> the CPU(s) and the device is cache coherent or not. At that point the device
>> driver can use register it with HMM helper.
>>
>>
>> The whole NUMA discussion happen several time in the past i suggest looking
>> on mm list archive for them. But it was rule out for several reasons. Top of
>> my head:
>>   - people hate CPU less node and device memory is inherently CPU less
> 
> With the introduction of the HMAT in ACPI 6.2 one of the things that was added
> was the ability to have an ACPI proximity domain that isn't associated with a
> CPU.  This can be seen in the changes in the text of the "Proximity Domain"
> field in table 5-73 which describes the "Memory Affinity Structure".  One of
> the major features of the HMAT was the separation of "Initiator" proximity
> domains (CPUs, devices that initiate memory transfers), and "target" proximity
> domains (memory regions, be they attached to a CPU or some other device).
> 
> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
> already in a place where we have to support CPU-less NUMA nodes.
> 
>>   - device driver want total control over memory and thus to be isolated from
>> mm mecanism and doing all those special cases was not welcome
> 
> I agree that the kernel doesn't have enough information to be able to
> accurately handle all the use cases for the various types of heterogeneous
> memory.   The goal of my HMAT enabling is to allow that memory to be reserved
> from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem
> Address Range Structure, then provide userspace with enough information to be
> able to distinguish between the various types of memory in the system so it
> can allocate & utilize it 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Bob Liu
On 2017/9/6 2:54, Ross Zwisler wrote:
> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
>> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>>> On 2017/9/4 23:51, Jerome Glisse wrote:
 On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> On 2017/8/17 8:05, Jérôme Glisse wrote:
>> Unlike unaddressable memory, coherent device memory has a real
>> resource associated with it on the system (as CPU can address
>> it). Add a new helper to hotplug such memory within the HMM
>> framework.
>>
>
> Got an new question, coherent device( e.g CCIX) memory are likely 
> reported to OS 
> through ACPI and recognized as NUMA memory node.
> Then how can their memory be captured and managed by HMM framework?
>

 Only platform that has such memory today is powerpc and it is not reported
 as regular memory by the firmware hence why they need this helper.

 I don't think anyone has defined anything yet for x86 and acpi. As this is
>>>
>>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>>> Table (HMAT) table defined in ACPI 6.2.
>>> The HMAT can cover CPU-addressable memory types(though not non-cache
>>> coherent on-device memory).
>>>
>>> Ross from Intel already done some work on this, see:
>>> https://lwn.net/Articles/724562/
>>>
>>> arm64 supports APCI also, there is likely more this kind of device when CCIX
>>> is out (should be very soon if on schedule).
>>
>> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
>> when you have several kind of memory each with different characteristics:
>>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>> small (ie few giga bytes)
>>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>>   - DDR (good old memory) well characteristics are between HBM and persistent
>>
>> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
>> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
>> maybe even persistent memory).
>>
 memory on PCIE like interface then i don't expect it to be reported as NUMA
 memory node but as io range like any regular PCIE resources. Device driver
 through capabilities flags would then figure out if the link between the
 device and CPU is CCIX capable if so it can use this helper to hotplug it
 as device memory.

>>>
>>> From my point of view,  Cache coherent device memory will popular soon and
>>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>>> reasonable
>>> to me.
>>
>> Cache coherent device will be reported through standard mecanisms defined by
>> the bus standard they are using. To my knowledge all the standard are either
>> on top of PCIE or are similar to PCIE.
>>
>> It is true that on many platform PCIE resource is manage/initialize by the
>> bios (UEFI) but it is platform specific. In some case we reprogram what the
>> bios pick.
>>
>> So like i was saying i don't expect the BIOS/UEFI to report device memory as
>> regular memory. It will be reported as a regular PCIE resources and then the
>> device driver will be able to determine through some flags if the link 
>> between
>> the CPU(s) and the device is cache coherent or not. At that point the device
>> driver can use register it with HMM helper.
>>
>>
>> The whole NUMA discussion happen several time in the past i suggest looking
>> on mm list archive for them. But it was rule out for several reasons. Top of
>> my head:
>>   - people hate CPU less node and device memory is inherently CPU less
> 
> With the introduction of the HMAT in ACPI 6.2 one of the things that was added
> was the ability to have an ACPI proximity domain that isn't associated with a
> CPU.  This can be seen in the changes in the text of the "Proximity Domain"
> field in table 5-73 which describes the "Memory Affinity Structure".  One of
> the major features of the HMAT was the separation of "Initiator" proximity
> domains (CPUs, devices that initiate memory transfers), and "target" proximity
> domains (memory regions, be they attached to a CPU or some other device).
> 
> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
> already in a place where we have to support CPU-less NUMA nodes.
> 
>>   - device driver want total control over memory and thus to be isolated from
>> mm mecanism and doing all those special cases was not welcome
> 
> I agree that the kernel doesn't have enough information to be able to
> accurately handle all the use cases for the various types of heterogeneous
> memory.   The goal of my HMAT enabling is to allow that memory to be reserved
> from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem
> Address Range Structure, then provide userspace with enough information to be
> able to distinguish between the various types of memory in the system so it
> can allocate & utilize it 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Jerome Glisse
On Tue, Sep 05, 2017 at 01:00:13PM -0600, Ross Zwisler wrote:
> On Tue, Sep 05, 2017 at 09:50:17AM -0400, Jerome Glisse wrote:
> > On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
> > > On 2017/9/5 10:38, Jerome Glisse wrote:
> > > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > > >> On 2017/9/4 23:51, Jerome Glisse wrote:
> > > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> > >  On 2017/8/17 8:05, Jérôme Glisse wrote:
> > > > Unlike unaddressable memory, coherent device memory has a real
> > > > resource associated with it on the system (as CPU can address
> > > > it). Add a new helper to hotplug such memory within the HMM
> > > > framework.
> > > >
> > > 
> > >  Got an new question, coherent device( e.g CCIX) memory are likely 
> > >  reported to OS 
> > >  through ACPI and recognized as NUMA memory node.
> > >  Then how can their memory be captured and managed by HMM framework?
> > > 
> > > >>>
> > > >>> Only platform that has such memory today is powerpc and it is not 
> > > >>> reported
> > > >>> as regular memory by the firmware hence why they need this helper.
> > > >>>
> > > >>> I don't think anyone has defined anything yet for x86 and acpi. As 
> > > >>> this is
> > > >>
> > > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > > >> Table (HMAT) table defined in ACPI 6.2.
> > > >> The HMAT can cover CPU-addressable memory types(though not non-cache
> > > >> coherent on-device memory).
> > > >>
> > > >> Ross from Intel already done some work on this, see:
> > > >> https://lwn.net/Articles/724562/
> > > >>
> > > >> arm64 supports APCI also, there is likely more this kind of device 
> > > >> when CCIX
> > > >> is out (should be very soon if on schedule).
> > > > 
> > > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" 
> > > > memory ie
> > > > when you have several kind of memory each with different 
> > > > characteristics:
> > > >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> > > > small (ie few giga bytes)
> > > >   - Persistent memory, slower (both latency and bandwidth) big (tera 
> > > > bytes)
> > > >   - DDR (good old memory) well characteristics are between HBM and 
> > > > persistent
> > > > 
> > > 
> > > Okay, then how the kernel handle the situation of "kind of memory each 
> > > with different characteristics"?
> > > Does someone have any suggestion?  I thought HMM can do this.
> > > Numa policy/node distance is good but perhaps require a few extending, 
> > > e.g a HBM node can't be
> > > swap, can't accept DDR fallback allocation.
> > 
> > I don't think there is any consensus for this. I put forward the idea that 
> > NUMA
> > needed to be extended as with deep hierarchy it is not only the distance 
> > between
> > two nodes but also others factors like persistency, bandwidth, latency ...
> > 
> > 
> > > > So AFAICT this has nothing to do with what HMM is for, ie device 
> > > > memory. Note
> > > > that device memory can have a hierarchy of memory themself (HBM, GDDR 
> > > > and in
> > > > maybe even persistent memory).
> > > > 
> > > 
> > > This looks like a subset of HMAT when CPU can address device memory 
> > > directly in cache-coherent way.
> > 
> > It is not, it is much more complex than that. Linux kernel has no idea on 
> > what is
> > going on a device and thus do not have any usefull informations to make 
> > proper
> > decission regarding device memory. Here device is real device ie something 
> > with
> > processing capability, not something like HBM or persistent memory even if 
> > the
> > latter is associated with a struct device inside linux kernel.
> > 
> > > 
> > > 
> > > >>> memory on PCIE like interface then i don't expect it to be reported 
> > > >>> as NUMA
> > > >>> memory node but as io range like any regular PCIE resources. Device 
> > > >>> driver
> > > >>> through capabilities flags would then figure out if the link between 
> > > >>> the
> > > >>> device and CPU is CCIX capable if so it can use this helper to 
> > > >>> hotplug it
> > > >>> as device memory.
> > > >>>
> > > >>
> > > >> From my point of view,  Cache coherent device memory will popular soon 
> > > >> and
> > > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
> > > >> reasonable
> > > >> to me.
> > > > 
> > > > Cache coherent device will be reported through standard mecanisms 
> > > > defined by
> > > > the bus standard they are using. To my knowledge all the standard are 
> > > > either
> > > > on top of PCIE or are similar to PCIE.
> > > > 
> > > > It is true that on many platform PCIE resource is manage/initialize by 
> > > > the
> > > > bios (UEFI) but it is platform specific. In some case we reprogram what 
> > > > the
> > > > bios pick.
> > > > 
> > > > So like i was saying i don't expect the BIOS/UEFI to report device 
> > > > memory as
> > > 
> > > But it's happening.
> > > In my understanding, that's 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Jerome Glisse
On Tue, Sep 05, 2017 at 01:00:13PM -0600, Ross Zwisler wrote:
> On Tue, Sep 05, 2017 at 09:50:17AM -0400, Jerome Glisse wrote:
> > On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
> > > On 2017/9/5 10:38, Jerome Glisse wrote:
> > > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > > >> On 2017/9/4 23:51, Jerome Glisse wrote:
> > > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> > >  On 2017/8/17 8:05, Jérôme Glisse wrote:
> > > > Unlike unaddressable memory, coherent device memory has a real
> > > > resource associated with it on the system (as CPU can address
> > > > it). Add a new helper to hotplug such memory within the HMM
> > > > framework.
> > > >
> > > 
> > >  Got an new question, coherent device( e.g CCIX) memory are likely 
> > >  reported to OS 
> > >  through ACPI and recognized as NUMA memory node.
> > >  Then how can their memory be captured and managed by HMM framework?
> > > 
> > > >>>
> > > >>> Only platform that has such memory today is powerpc and it is not 
> > > >>> reported
> > > >>> as regular memory by the firmware hence why they need this helper.
> > > >>>
> > > >>> I don't think anyone has defined anything yet for x86 and acpi. As 
> > > >>> this is
> > > >>
> > > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > > >> Table (HMAT) table defined in ACPI 6.2.
> > > >> The HMAT can cover CPU-addressable memory types(though not non-cache
> > > >> coherent on-device memory).
> > > >>
> > > >> Ross from Intel already done some work on this, see:
> > > >> https://lwn.net/Articles/724562/
> > > >>
> > > >> arm64 supports APCI also, there is likely more this kind of device 
> > > >> when CCIX
> > > >> is out (should be very soon if on schedule).
> > > > 
> > > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" 
> > > > memory ie
> > > > when you have several kind of memory each with different 
> > > > characteristics:
> > > >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> > > > small (ie few giga bytes)
> > > >   - Persistent memory, slower (both latency and bandwidth) big (tera 
> > > > bytes)
> > > >   - DDR (good old memory) well characteristics are between HBM and 
> > > > persistent
> > > > 
> > > 
> > > Okay, then how the kernel handle the situation of "kind of memory each 
> > > with different characteristics"?
> > > Does someone have any suggestion?  I thought HMM can do this.
> > > Numa policy/node distance is good but perhaps require a few extending, 
> > > e.g a HBM node can't be
> > > swap, can't accept DDR fallback allocation.
> > 
> > I don't think there is any consensus for this. I put forward the idea that 
> > NUMA
> > needed to be extended as with deep hierarchy it is not only the distance 
> > between
> > two nodes but also others factors like persistency, bandwidth, latency ...
> > 
> > 
> > > > So AFAICT this has nothing to do with what HMM is for, ie device 
> > > > memory. Note
> > > > that device memory can have a hierarchy of memory themself (HBM, GDDR 
> > > > and in
> > > > maybe even persistent memory).
> > > > 
> > > 
> > > This looks like a subset of HMAT when CPU can address device memory 
> > > directly in cache-coherent way.
> > 
> > It is not, it is much more complex than that. Linux kernel has no idea on 
> > what is
> > going on a device and thus do not have any usefull informations to make 
> > proper
> > decission regarding device memory. Here device is real device ie something 
> > with
> > processing capability, not something like HBM or persistent memory even if 
> > the
> > latter is associated with a struct device inside linux kernel.
> > 
> > > 
> > > 
> > > >>> memory on PCIE like interface then i don't expect it to be reported 
> > > >>> as NUMA
> > > >>> memory node but as io range like any regular PCIE resources. Device 
> > > >>> driver
> > > >>> through capabilities flags would then figure out if the link between 
> > > >>> the
> > > >>> device and CPU is CCIX capable if so it can use this helper to 
> > > >>> hotplug it
> > > >>> as device memory.
> > > >>>
> > > >>
> > > >> From my point of view,  Cache coherent device memory will popular soon 
> > > >> and
> > > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
> > > >> reasonable
> > > >> to me.
> > > > 
> > > > Cache coherent device will be reported through standard mecanisms 
> > > > defined by
> > > > the bus standard they are using. To my knowledge all the standard are 
> > > > either
> > > > on top of PCIE or are similar to PCIE.
> > > > 
> > > > It is true that on many platform PCIE resource is manage/initialize by 
> > > > the
> > > > bios (UEFI) but it is platform specific. In some case we reprogram what 
> > > > the
> > > > bios pick.
> > > > 
> > > > So like i was saying i don't expect the BIOS/UEFI to report device 
> > > > memory as
> > > 
> > > But it's happening.
> > > In my understanding, that's 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Ross Zwisler
On Tue, Sep 05, 2017 at 09:50:17AM -0400, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
> > On 2017/9/5 10:38, Jerome Glisse wrote:
> > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > >> On 2017/9/4 23:51, Jerome Glisse wrote:
> > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >  On 2017/8/17 8:05, Jérôme Glisse wrote:
> > > Unlike unaddressable memory, coherent device memory has a real
> > > resource associated with it on the system (as CPU can address
> > > it). Add a new helper to hotplug such memory within the HMM
> > > framework.
> > >
> > 
> >  Got an new question, coherent device( e.g CCIX) memory are likely 
> >  reported to OS 
> >  through ACPI and recognized as NUMA memory node.
> >  Then how can their memory be captured and managed by HMM framework?
> > 
> > >>>
> > >>> Only platform that has such memory today is powerpc and it is not 
> > >>> reported
> > >>> as regular memory by the firmware hence why they need this helper.
> > >>>
> > >>> I don't think anyone has defined anything yet for x86 and acpi. As this 
> > >>> is
> > >>
> > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > >> Table (HMAT) table defined in ACPI 6.2.
> > >> The HMAT can cover CPU-addressable memory types(though not non-cache
> > >> coherent on-device memory).
> > >>
> > >> Ross from Intel already done some work on this, see:
> > >> https://lwn.net/Articles/724562/
> > >>
> > >> arm64 supports APCI also, there is likely more this kind of device when 
> > >> CCIX
> > >> is out (should be very soon if on schedule).
> > > 
> > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory 
> > > ie
> > > when you have several kind of memory each with different characteristics:
> > >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> > > small (ie few giga bytes)
> > >   - Persistent memory, slower (both latency and bandwidth) big (tera 
> > > bytes)
> > >   - DDR (good old memory) well characteristics are between HBM and 
> > > persistent
> > > 
> > 
> > Okay, then how the kernel handle the situation of "kind of memory each with 
> > different characteristics"?
> > Does someone have any suggestion?  I thought HMM can do this.
> > Numa policy/node distance is good but perhaps require a few extending, e.g 
> > a HBM node can't be
> > swap, can't accept DDR fallback allocation.
> 
> I don't think there is any consensus for this. I put forward the idea that 
> NUMA
> needed to be extended as with deep hierarchy it is not only the distance 
> between
> two nodes but also others factors like persistency, bandwidth, latency ...
> 
> 
> > > So AFAICT this has nothing to do with what HMM is for, ie device memory. 
> > > Note
> > > that device memory can have a hierarchy of memory themself (HBM, GDDR and 
> > > in
> > > maybe even persistent memory).
> > > 
> > 
> > This looks like a subset of HMAT when CPU can address device memory 
> > directly in cache-coherent way.
> 
> It is not, it is much more complex than that. Linux kernel has no idea on 
> what is
> going on a device and thus do not have any usefull informations to make proper
> decission regarding device memory. Here device is real device ie something 
> with
> processing capability, not something like HBM or persistent memory even if the
> latter is associated with a struct device inside linux kernel.
> 
> > 
> > 
> > >>> memory on PCIE like interface then i don't expect it to be reported as 
> > >>> NUMA
> > >>> memory node but as io range like any regular PCIE resources. Device 
> > >>> driver
> > >>> through capabilities flags would then figure out if the link between the
> > >>> device and CPU is CCIX capable if so it can use this helper to hotplug 
> > >>> it
> > >>> as device memory.
> > >>>
> > >>
> > >> From my point of view,  Cache coherent device memory will popular soon 
> > >> and
> > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
> > >> reasonable
> > >> to me.
> > > 
> > > Cache coherent device will be reported through standard mecanisms defined 
> > > by
> > > the bus standard they are using. To my knowledge all the standard are 
> > > either
> > > on top of PCIE or are similar to PCIE.
> > > 
> > > It is true that on many platform PCIE resource is manage/initialize by the
> > > bios (UEFI) but it is platform specific. In some case we reprogram what 
> > > the
> > > bios pick.
> > > 
> > > So like i was saying i don't expect the BIOS/UEFI to report device memory 
> > > as
> > 
> > But it's happening.
> > In my understanding, that's why HMAT was introduced.
> > For reporting device memory as regular memory(with different 
> > characteristics).
> 
> That is not my understanding but only Intel can confirm. HMAT was introduced
> for things like HBM or persistent memory. Which i do not consider as device
> memory. Sure persistent memory is assign a device struct 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Ross Zwisler
On Tue, Sep 05, 2017 at 09:50:17AM -0400, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
> > On 2017/9/5 10:38, Jerome Glisse wrote:
> > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > >> On 2017/9/4 23:51, Jerome Glisse wrote:
> > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >  On 2017/8/17 8:05, Jérôme Glisse wrote:
> > > Unlike unaddressable memory, coherent device memory has a real
> > > resource associated with it on the system (as CPU can address
> > > it). Add a new helper to hotplug such memory within the HMM
> > > framework.
> > >
> > 
> >  Got an new question, coherent device( e.g CCIX) memory are likely 
> >  reported to OS 
> >  through ACPI and recognized as NUMA memory node.
> >  Then how can their memory be captured and managed by HMM framework?
> > 
> > >>>
> > >>> Only platform that has such memory today is powerpc and it is not 
> > >>> reported
> > >>> as regular memory by the firmware hence why they need this helper.
> > >>>
> > >>> I don't think anyone has defined anything yet for x86 and acpi. As this 
> > >>> is
> > >>
> > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > >> Table (HMAT) table defined in ACPI 6.2.
> > >> The HMAT can cover CPU-addressable memory types(though not non-cache
> > >> coherent on-device memory).
> > >>
> > >> Ross from Intel already done some work on this, see:
> > >> https://lwn.net/Articles/724562/
> > >>
> > >> arm64 supports APCI also, there is likely more this kind of device when 
> > >> CCIX
> > >> is out (should be very soon if on schedule).
> > > 
> > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory 
> > > ie
> > > when you have several kind of memory each with different characteristics:
> > >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> > > small (ie few giga bytes)
> > >   - Persistent memory, slower (both latency and bandwidth) big (tera 
> > > bytes)
> > >   - DDR (good old memory) well characteristics are between HBM and 
> > > persistent
> > > 
> > 
> > Okay, then how the kernel handle the situation of "kind of memory each with 
> > different characteristics"?
> > Does someone have any suggestion?  I thought HMM can do this.
> > Numa policy/node distance is good but perhaps require a few extending, e.g 
> > a HBM node can't be
> > swap, can't accept DDR fallback allocation.
> 
> I don't think there is any consensus for this. I put forward the idea that 
> NUMA
> needed to be extended as with deep hierarchy it is not only the distance 
> between
> two nodes but also others factors like persistency, bandwidth, latency ...
> 
> 
> > > So AFAICT this has nothing to do with what HMM is for, ie device memory. 
> > > Note
> > > that device memory can have a hierarchy of memory themself (HBM, GDDR and 
> > > in
> > > maybe even persistent memory).
> > > 
> > 
> > This looks like a subset of HMAT when CPU can address device memory 
> > directly in cache-coherent way.
> 
> It is not, it is much more complex than that. Linux kernel has no idea on 
> what is
> going on a device and thus do not have any usefull informations to make proper
> decission regarding device memory. Here device is real device ie something 
> with
> processing capability, not something like HBM or persistent memory even if the
> latter is associated with a struct device inside linux kernel.
> 
> > 
> > 
> > >>> memory on PCIE like interface then i don't expect it to be reported as 
> > >>> NUMA
> > >>> memory node but as io range like any regular PCIE resources. Device 
> > >>> driver
> > >>> through capabilities flags would then figure out if the link between the
> > >>> device and CPU is CCIX capable if so it can use this helper to hotplug 
> > >>> it
> > >>> as device memory.
> > >>>
> > >>
> > >> From my point of view,  Cache coherent device memory will popular soon 
> > >> and
> > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
> > >> reasonable
> > >> to me.
> > > 
> > > Cache coherent device will be reported through standard mecanisms defined 
> > > by
> > > the bus standard they are using. To my knowledge all the standard are 
> > > either
> > > on top of PCIE or are similar to PCIE.
> > > 
> > > It is true that on many platform PCIE resource is manage/initialize by the
> > > bios (UEFI) but it is platform specific. In some case we reprogram what 
> > > the
> > > bios pick.
> > > 
> > > So like i was saying i don't expect the BIOS/UEFI to report device memory 
> > > as
> > 
> > But it's happening.
> > In my understanding, that's why HMAT was introduced.
> > For reporting device memory as regular memory(with different 
> > characteristics).
> 
> That is not my understanding but only Intel can confirm. HMAT was introduced
> for things like HBM or persistent memory. Which i do not consider as device
> memory. Sure persistent memory is assign a device struct 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Ross Zwisler
On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > On 2017/9/4 23:51, Jerome Glisse wrote:
> > > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> > >> On 2017/8/17 8:05, Jérôme Glisse wrote:
> > >>> Unlike unaddressable memory, coherent device memory has a real
> > >>> resource associated with it on the system (as CPU can address
> > >>> it). Add a new helper to hotplug such memory within the HMM
> > >>> framework.
> > >>>
> > >>
> > >> Got an new question, coherent device( e.g CCIX) memory are likely 
> > >> reported to OS 
> > >> through ACPI and recognized as NUMA memory node.
> > >> Then how can their memory be captured and managed by HMM framework?
> > >>
> > > 
> > > Only platform that has such memory today is powerpc and it is not reported
> > > as regular memory by the firmware hence why they need this helper.
> > > 
> > > I don't think anyone has defined anything yet for x86 and acpi. As this is
> > 
> > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > Table (HMAT) table defined in ACPI 6.2.
> > The HMAT can cover CPU-addressable memory types(though not non-cache
> > coherent on-device memory).
> > 
> > Ross from Intel already done some work on this, see:
> > https://lwn.net/Articles/724562/
> > 
> > arm64 supports APCI also, there is likely more this kind of device when CCIX
> > is out (should be very soon if on schedule).
> 
> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> when you have several kind of memory each with different characteristics:
>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> small (ie few giga bytes)
>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>   - DDR (good old memory) well characteristics are between HBM and persistent
> 
> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> maybe even persistent memory).
> 
> > > memory on PCIE like interface then i don't expect it to be reported as 
> > > NUMA
> > > memory node but as io range like any regular PCIE resources. Device driver
> > > through capabilities flags would then figure out if the link between the
> > > device and CPU is CCIX capable if so it can use this helper to hotplug it
> > > as device memory.
> > > 
> > 
> > From my point of view,  Cache coherent device memory will popular soon and
> > reported through ACPI/UEFI. Extending NUMA policy still sounds more 
> > reasonable
> > to me.
> 
> Cache coherent device will be reported through standard mecanisms defined by
> the bus standard they are using. To my knowledge all the standard are either
> on top of PCIE or are similar to PCIE.
> 
> It is true that on many platform PCIE resource is manage/initialize by the
> bios (UEFI) but it is platform specific. In some case we reprogram what the
> bios pick.
> 
> So like i was saying i don't expect the BIOS/UEFI to report device memory as
> regular memory. It will be reported as a regular PCIE resources and then the
> device driver will be able to determine through some flags if the link between
> the CPU(s) and the device is cache coherent or not. At that point the device
> driver can use register it with HMM helper.
> 
> 
> The whole NUMA discussion happen several time in the past i suggest looking
> on mm list archive for them. But it was rule out for several reasons. Top of
> my head:
>   - people hate CPU less node and device memory is inherently CPU less

With the introduction of the HMAT in ACPI 6.2 one of the things that was added
was the ability to have an ACPI proximity domain that isn't associated with a
CPU.  This can be seen in the changes in the text of the "Proximity Domain"
field in table 5-73 which describes the "Memory Affinity Structure".  One of
the major features of the HMAT was the separation of "Initiator" proximity
domains (CPUs, devices that initiate memory transfers), and "target" proximity
domains (memory regions, be they attached to a CPU or some other device).

ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
already in a place where we have to support CPU-less NUMA nodes.

>   - device driver want total control over memory and thus to be isolated from
> mm mecanism and doing all those special cases was not welcome

I agree that the kernel doesn't have enough information to be able to
accurately handle all the use cases for the various types of heterogeneous
memory.   The goal of my HMAT enabling is to allow that memory to be reserved
from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem
Address Range Structure, then provide userspace with enough information to be
able to distinguish between the various types of memory in the system so it
can allocate & utilize it appropriately.

>   - existing NUMA migration mecanism are ill suited for this 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Ross Zwisler
On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> > On 2017/9/4 23:51, Jerome Glisse wrote:
> > > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> > >> On 2017/8/17 8:05, Jérôme Glisse wrote:
> > >>> Unlike unaddressable memory, coherent device memory has a real
> > >>> resource associated with it on the system (as CPU can address
> > >>> it). Add a new helper to hotplug such memory within the HMM
> > >>> framework.
> > >>>
> > >>
> > >> Got an new question, coherent device( e.g CCIX) memory are likely 
> > >> reported to OS 
> > >> through ACPI and recognized as NUMA memory node.
> > >> Then how can their memory be captured and managed by HMM framework?
> > >>
> > > 
> > > Only platform that has such memory today is powerpc and it is not reported
> > > as regular memory by the firmware hence why they need this helper.
> > > 
> > > I don't think anyone has defined anything yet for x86 and acpi. As this is
> > 
> > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> > Table (HMAT) table defined in ACPI 6.2.
> > The HMAT can cover CPU-addressable memory types(though not non-cache
> > coherent on-device memory).
> > 
> > Ross from Intel already done some work on this, see:
> > https://lwn.net/Articles/724562/
> > 
> > arm64 supports APCI also, there is likely more this kind of device when CCIX
> > is out (should be very soon if on schedule).
> 
> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> when you have several kind of memory each with different characteristics:
>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> small (ie few giga bytes)
>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>   - DDR (good old memory) well characteristics are between HBM and persistent
> 
> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> maybe even persistent memory).
> 
> > > memory on PCIE like interface then i don't expect it to be reported as 
> > > NUMA
> > > memory node but as io range like any regular PCIE resources. Device driver
> > > through capabilities flags would then figure out if the link between the
> > > device and CPU is CCIX capable if so it can use this helper to hotplug it
> > > as device memory.
> > > 
> > 
> > From my point of view,  Cache coherent device memory will popular soon and
> > reported through ACPI/UEFI. Extending NUMA policy still sounds more 
> > reasonable
> > to me.
> 
> Cache coherent device will be reported through standard mecanisms defined by
> the bus standard they are using. To my knowledge all the standard are either
> on top of PCIE or are similar to PCIE.
> 
> It is true that on many platform PCIE resource is manage/initialize by the
> bios (UEFI) but it is platform specific. In some case we reprogram what the
> bios pick.
> 
> So like i was saying i don't expect the BIOS/UEFI to report device memory as
> regular memory. It will be reported as a regular PCIE resources and then the
> device driver will be able to determine through some flags if the link between
> the CPU(s) and the device is cache coherent or not. At that point the device
> driver can use register it with HMM helper.
> 
> 
> The whole NUMA discussion happen several time in the past i suggest looking
> on mm list archive for them. But it was rule out for several reasons. Top of
> my head:
>   - people hate CPU less node and device memory is inherently CPU less

With the introduction of the HMAT in ACPI 6.2 one of the things that was added
was the ability to have an ACPI proximity domain that isn't associated with a
CPU.  This can be seen in the changes in the text of the "Proximity Domain"
field in table 5-73 which describes the "Memory Affinity Structure".  One of
the major features of the HMAT was the separation of "Initiator" proximity
domains (CPUs, devices that initiate memory transfers), and "target" proximity
domains (memory regions, be they attached to a CPU or some other device).

ACPI proximity domains map directly to Linux NUMA nodes, so I think we're
already in a place where we have to support CPU-less NUMA nodes.

>   - device driver want total control over memory and thus to be isolated from
> mm mecanism and doing all those special cases was not welcome

I agree that the kernel doesn't have enough information to be able to
accurately handle all the use cases for the various types of heterogeneous
memory.   The goal of my HMAT enabling is to allow that memory to be reserved
from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem
Address Range Structure, then provide userspace with enough information to be
able to distinguish between the various types of memory in the system so it
can allocate & utilize it appropriately.

>   - existing NUMA migration mecanism are ill suited for this 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Dan Williams
On Tue, Sep 5, 2017 at 6:50 AM, Jerome Glisse  wrote:
> On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
>> On 2017/9/5 10:38, Jerome Glisse wrote:
>> > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>> >> On 2017/9/4 23:51, Jerome Glisse wrote:
>> >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>  On 2017/8/17 8:05, Jérôme Glisse wrote:
>> > Unlike unaddressable memory, coherent device memory has a real
>> > resource associated with it on the system (as CPU can address
>> > it). Add a new helper to hotplug such memory within the HMM
>> > framework.
>> >
>> 
>>  Got an new question, coherent device( e.g CCIX) memory are likely 
>>  reported to OS
>>  through ACPI and recognized as NUMA memory node.
>>  Then how can their memory be captured and managed by HMM framework?
>> 
>> >>>
>> >>> Only platform that has such memory today is powerpc and it is not 
>> >>> reported
>> >>> as regular memory by the firmware hence why they need this helper.
>> >>>
>> >>> I don't think anyone has defined anything yet for x86 and acpi. As this 
>> >>> is
>> >>
>> >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>> >> Table (HMAT) table defined in ACPI 6.2.
>> >> The HMAT can cover CPU-addressable memory types(though not non-cache
>> >> coherent on-device memory).
>> >>
>> >> Ross from Intel already done some work on this, see:
>> >> https://lwn.net/Articles/724562/
>> >>
>> >> arm64 supports APCI also, there is likely more this kind of device when 
>> >> CCIX
>> >> is out (should be very soon if on schedule).
>> >
>> > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory 
>> > ie
>> > when you have several kind of memory each with different characteristics:
>> >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>> > small (ie few giga bytes)
>> >   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>> >   - DDR (good old memory) well characteristics are between HBM and 
>> > persistent
>> >
>>
>> Okay, then how the kernel handle the situation of "kind of memory each with 
>> different characteristics"?
>> Does someone have any suggestion?  I thought HMM can do this.
>> Numa policy/node distance is good but perhaps require a few extending, e.g a 
>> HBM node can't be
>> swap, can't accept DDR fallback allocation.
>
> I don't think there is any consensus for this. I put forward the idea that 
> NUMA
> needed to be extended as with deep hierarchy it is not only the distance 
> between
> two nodes but also others factors like persistency, bandwidth, latency ...
>
>
>> > So AFAICT this has nothing to do with what HMM is for, ie device memory. 
>> > Note
>> > that device memory can have a hierarchy of memory themself (HBM, GDDR and 
>> > in
>> > maybe even persistent memory).
>> >
>>
>> This looks like a subset of HMAT when CPU can address device memory directly 
>> in cache-coherent way.
>
> It is not, it is much more complex than that. Linux kernel has no idea on 
> what is
> going on a device and thus do not have any usefull informations to make proper
> decission regarding device memory. Here device is real device ie something 
> with
> processing capability, not something like HBM or persistent memory even if the
> latter is associated with a struct device inside linux kernel.
>
>>
>>
>> >>> memory on PCIE like interface then i don't expect it to be reported as 
>> >>> NUMA
>> >>> memory node but as io range like any regular PCIE resources. Device 
>> >>> driver
>> >>> through capabilities flags would then figure out if the link between the
>> >>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>> >>> as device memory.
>> >>>
>> >>
>> >> From my point of view,  Cache coherent device memory will popular soon and
>> >> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>> >> reasonable
>> >> to me.
>> >
>> > Cache coherent device will be reported through standard mecanisms defined 
>> > by
>> > the bus standard they are using. To my knowledge all the standard are 
>> > either
>> > on top of PCIE or are similar to PCIE.
>> >
>> > It is true that on many platform PCIE resource is manage/initialize by the
>> > bios (UEFI) but it is platform specific. In some case we reprogram what the
>> > bios pick.
>> >
>> > So like i was saying i don't expect the BIOS/UEFI to report device memory 
>> > as
>>
>> But it's happening.
>> In my understanding, that's why HMAT was introduced.
>> For reporting device memory as regular memory(with different 
>> characteristics).
>
> That is not my understanding but only Intel can confirm. HMAT was introduced
> for things like HBM or persistent memory. Which i do not consider as device
> memory. Sure persistent memory is assign a device struct because it is easier
> for integration with the block system i assume. But it does not make it a
> device in my view. For me a 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Dan Williams
On Tue, Sep 5, 2017 at 6:50 AM, Jerome Glisse  wrote:
> On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
>> On 2017/9/5 10:38, Jerome Glisse wrote:
>> > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>> >> On 2017/9/4 23:51, Jerome Glisse wrote:
>> >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>>  On 2017/8/17 8:05, Jérôme Glisse wrote:
>> > Unlike unaddressable memory, coherent device memory has a real
>> > resource associated with it on the system (as CPU can address
>> > it). Add a new helper to hotplug such memory within the HMM
>> > framework.
>> >
>> 
>>  Got an new question, coherent device( e.g CCIX) memory are likely 
>>  reported to OS
>>  through ACPI and recognized as NUMA memory node.
>>  Then how can their memory be captured and managed by HMM framework?
>> 
>> >>>
>> >>> Only platform that has such memory today is powerpc and it is not 
>> >>> reported
>> >>> as regular memory by the firmware hence why they need this helper.
>> >>>
>> >>> I don't think anyone has defined anything yet for x86 and acpi. As this 
>> >>> is
>> >>
>> >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>> >> Table (HMAT) table defined in ACPI 6.2.
>> >> The HMAT can cover CPU-addressable memory types(though not non-cache
>> >> coherent on-device memory).
>> >>
>> >> Ross from Intel already done some work on this, see:
>> >> https://lwn.net/Articles/724562/
>> >>
>> >> arm64 supports APCI also, there is likely more this kind of device when 
>> >> CCIX
>> >> is out (should be very soon if on schedule).
>> >
>> > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory 
>> > ie
>> > when you have several kind of memory each with different characteristics:
>> >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
>> > small (ie few giga bytes)
>> >   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>> >   - DDR (good old memory) well characteristics are between HBM and 
>> > persistent
>> >
>>
>> Okay, then how the kernel handle the situation of "kind of memory each with 
>> different characteristics"?
>> Does someone have any suggestion?  I thought HMM can do this.
>> Numa policy/node distance is good but perhaps require a few extending, e.g a 
>> HBM node can't be
>> swap, can't accept DDR fallback allocation.
>
> I don't think there is any consensus for this. I put forward the idea that 
> NUMA
> needed to be extended as with deep hierarchy it is not only the distance 
> between
> two nodes but also others factors like persistency, bandwidth, latency ...
>
>
>> > So AFAICT this has nothing to do with what HMM is for, ie device memory. 
>> > Note
>> > that device memory can have a hierarchy of memory themself (HBM, GDDR and 
>> > in
>> > maybe even persistent memory).
>> >
>>
>> This looks like a subset of HMAT when CPU can address device memory directly 
>> in cache-coherent way.
>
> It is not, it is much more complex than that. Linux kernel has no idea on 
> what is
> going on a device and thus do not have any usefull informations to make proper
> decission regarding device memory. Here device is real device ie something 
> with
> processing capability, not something like HBM or persistent memory even if the
> latter is associated with a struct device inside linux kernel.
>
>>
>>
>> >>> memory on PCIE like interface then i don't expect it to be reported as 
>> >>> NUMA
>> >>> memory node but as io range like any regular PCIE resources. Device 
>> >>> driver
>> >>> through capabilities flags would then figure out if the link between the
>> >>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>> >>> as device memory.
>> >>>
>> >>
>> >> From my point of view,  Cache coherent device memory will popular soon and
>> >> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>> >> reasonable
>> >> to me.
>> >
>> > Cache coherent device will be reported through standard mecanisms defined 
>> > by
>> > the bus standard they are using. To my knowledge all the standard are 
>> > either
>> > on top of PCIE or are similar to PCIE.
>> >
>> > It is true that on many platform PCIE resource is manage/initialize by the
>> > bios (UEFI) but it is platform specific. In some case we reprogram what the
>> > bios pick.
>> >
>> > So like i was saying i don't expect the BIOS/UEFI to report device memory 
>> > as
>>
>> But it's happening.
>> In my understanding, that's why HMAT was introduced.
>> For reporting device memory as regular memory(with different 
>> characteristics).
>
> That is not my understanding but only Intel can confirm. HMAT was introduced
> for things like HBM or persistent memory. Which i do not consider as device
> memory. Sure persistent memory is assign a device struct because it is easier
> for integration with the block system i assume. But it does not make it a
> device in my view. For me a device is a piece of 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Jerome Glisse
On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
> On 2017/9/5 10:38, Jerome Glisse wrote:
> > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> >> On 2017/9/4 23:51, Jerome Glisse wrote:
> >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>  On 2017/8/17 8:05, Jérôme Glisse wrote:
> > Unlike unaddressable memory, coherent device memory has a real
> > resource associated with it on the system (as CPU can address
> > it). Add a new helper to hotplug such memory within the HMM
> > framework.
> >
> 
>  Got an new question, coherent device( e.g CCIX) memory are likely 
>  reported to OS 
>  through ACPI and recognized as NUMA memory node.
>  Then how can their memory be captured and managed by HMM framework?
> 
> >>>
> >>> Only platform that has such memory today is powerpc and it is not reported
> >>> as regular memory by the firmware hence why they need this helper.
> >>>
> >>> I don't think anyone has defined anything yet for x86 and acpi. As this is
> >>
> >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> >> Table (HMAT) table defined in ACPI 6.2.
> >> The HMAT can cover CPU-addressable memory types(though not non-cache
> >> coherent on-device memory).
> >>
> >> Ross from Intel already done some work on this, see:
> >> https://lwn.net/Articles/724562/
> >>
> >> arm64 supports APCI also, there is likely more this kind of device when 
> >> CCIX
> >> is out (should be very soon if on schedule).
> > 
> > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> > when you have several kind of memory each with different characteristics:
> >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> > small (ie few giga bytes)
> >   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
> >   - DDR (good old memory) well characteristics are between HBM and 
> > persistent
> > 
> 
> Okay, then how the kernel handle the situation of "kind of memory each with 
> different characteristics"?
> Does someone have any suggestion?  I thought HMM can do this.
> Numa policy/node distance is good but perhaps require a few extending, e.g a 
> HBM node can't be
> swap, can't accept DDR fallback allocation.

I don't think there is any consensus for this. I put forward the idea that NUMA
needed to be extended as with deep hierarchy it is not only the distance between
two nodes but also others factors like persistency, bandwidth, latency ...


> > So AFAICT this has nothing to do with what HMM is for, ie device memory. 
> > Note
> > that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> > maybe even persistent memory).
> > 
> 
> This looks like a subset of HMAT when CPU can address device memory directly 
> in cache-coherent way.

It is not, it is much more complex than that. Linux kernel has no idea on what 
is
going on a device and thus do not have any usefull informations to make proper
decission regarding device memory. Here device is real device ie something with
processing capability, not something like HBM or persistent memory even if the
latter is associated with a struct device inside linux kernel.

> 
> 
> >>> memory on PCIE like interface then i don't expect it to be reported as 
> >>> NUMA
> >>> memory node but as io range like any regular PCIE resources. Device driver
> >>> through capabilities flags would then figure out if the link between the
> >>> device and CPU is CCIX capable if so it can use this helper to hotplug it
> >>> as device memory.
> >>>
> >>
> >> From my point of view,  Cache coherent device memory will popular soon and
> >> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
> >> reasonable
> >> to me.
> > 
> > Cache coherent device will be reported through standard mecanisms defined by
> > the bus standard they are using. To my knowledge all the standard are either
> > on top of PCIE or are similar to PCIE.
> > 
> > It is true that on many platform PCIE resource is manage/initialize by the
> > bios (UEFI) but it is platform specific. In some case we reprogram what the
> > bios pick.
> > 
> > So like i was saying i don't expect the BIOS/UEFI to report device memory as
> 
> But it's happening.
> In my understanding, that's why HMAT was introduced.
> For reporting device memory as regular memory(with different characteristics).

That is not my understanding but only Intel can confirm. HMAT was introduced
for things like HBM or persistent memory. Which i do not consider as device
memory. Sure persistent memory is assign a device struct because it is easier
for integration with the block system i assume. But it does not make it a
device in my view. For me a device is a piece of hardware that has some
processing capabilities (network adapter, sound card, GPU, ...)

But we can argue about semantic and what a device is. For all intent and 
purposes
device in HMM context is some piece of hardware with 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-05 Thread Jerome Glisse
On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote:
> On 2017/9/5 10:38, Jerome Glisse wrote:
> > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> >> On 2017/9/4 23:51, Jerome Glisse wrote:
> >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>  On 2017/8/17 8:05, Jérôme Glisse wrote:
> > Unlike unaddressable memory, coherent device memory has a real
> > resource associated with it on the system (as CPU can address
> > it). Add a new helper to hotplug such memory within the HMM
> > framework.
> >
> 
>  Got an new question, coherent device( e.g CCIX) memory are likely 
>  reported to OS 
>  through ACPI and recognized as NUMA memory node.
>  Then how can their memory be captured and managed by HMM framework?
> 
> >>>
> >>> Only platform that has such memory today is powerpc and it is not reported
> >>> as regular memory by the firmware hence why they need this helper.
> >>>
> >>> I don't think anyone has defined anything yet for x86 and acpi. As this is
> >>
> >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> >> Table (HMAT) table defined in ACPI 6.2.
> >> The HMAT can cover CPU-addressable memory types(though not non-cache
> >> coherent on-device memory).
> >>
> >> Ross from Intel already done some work on this, see:
> >> https://lwn.net/Articles/724562/
> >>
> >> arm64 supports APCI also, there is likely more this kind of device when 
> >> CCIX
> >> is out (should be very soon if on schedule).
> > 
> > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> > when you have several kind of memory each with different characteristics:
> >   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> > small (ie few giga bytes)
> >   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
> >   - DDR (good old memory) well characteristics are between HBM and 
> > persistent
> > 
> 
> Okay, then how the kernel handle the situation of "kind of memory each with 
> different characteristics"?
> Does someone have any suggestion?  I thought HMM can do this.
> Numa policy/node distance is good but perhaps require a few extending, e.g a 
> HBM node can't be
> swap, can't accept DDR fallback allocation.

I don't think there is any consensus for this. I put forward the idea that NUMA
needed to be extended as with deep hierarchy it is not only the distance between
two nodes but also others factors like persistency, bandwidth, latency ...


> > So AFAICT this has nothing to do with what HMM is for, ie device memory. 
> > Note
> > that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> > maybe even persistent memory).
> > 
> 
> This looks like a subset of HMAT when CPU can address device memory directly 
> in cache-coherent way.

It is not, it is much more complex than that. Linux kernel has no idea on what 
is
going on a device and thus do not have any usefull informations to make proper
decission regarding device memory. Here device is real device ie something with
processing capability, not something like HBM or persistent memory even if the
latter is associated with a struct device inside linux kernel.

> 
> 
> >>> memory on PCIE like interface then i don't expect it to be reported as 
> >>> NUMA
> >>> memory node but as io range like any regular PCIE resources. Device driver
> >>> through capabilities flags would then figure out if the link between the
> >>> device and CPU is CCIX capable if so it can use this helper to hotplug it
> >>> as device memory.
> >>>
> >>
> >> From my point of view,  Cache coherent device memory will popular soon and
> >> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
> >> reasonable
> >> to me.
> > 
> > Cache coherent device will be reported through standard mecanisms defined by
> > the bus standard they are using. To my knowledge all the standard are either
> > on top of PCIE or are similar to PCIE.
> > 
> > It is true that on many platform PCIE resource is manage/initialize by the
> > bios (UEFI) but it is platform specific. In some case we reprogram what the
> > bios pick.
> > 
> > So like i was saying i don't expect the BIOS/UEFI to report device memory as
> 
> But it's happening.
> In my understanding, that's why HMAT was introduced.
> For reporting device memory as regular memory(with different characteristics).

That is not my understanding but only Intel can confirm. HMAT was introduced
for things like HBM or persistent memory. Which i do not consider as device
memory. Sure persistent memory is assign a device struct because it is easier
for integration with the block system i assume. But it does not make it a
device in my view. For me a device is a piece of hardware that has some
processing capabilities (network adapter, sound card, GPU, ...)

But we can argue about semantic and what a device is. For all intent and 
purposes
device in HMM context is some piece of hardware with 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Bob Liu
On 2017/9/5 10:38, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
 On 2017/8/17 8:05, Jérôme Glisse wrote:
> Unlike unaddressable memory, coherent device memory has a real
> resource associated with it on the system (as CPU can address
> it). Add a new helper to hotplug such memory within the HMM
> framework.
>

 Got an new question, coherent device( e.g CCIX) memory are likely reported 
 to OS 
 through ACPI and recognized as NUMA memory node.
 Then how can their memory be captured and managed by HMM framework?

>>>
>>> Only platform that has such memory today is powerpc and it is not reported
>>> as regular memory by the firmware hence why they need this helper.
>>>
>>> I don't think anyone has defined anything yet for x86 and acpi. As this is
>>
>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>> Table (HMAT) table defined in ACPI 6.2.
>> The HMAT can cover CPU-addressable memory types(though not non-cache
>> coherent on-device memory).
>>
>> Ross from Intel already done some work on this, see:
>> https://lwn.net/Articles/724562/
>>
>> arm64 supports APCI also, there is likely more this kind of device when CCIX
>> is out (should be very soon if on schedule).
> 
> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> when you have several kind of memory each with different characteristics:
>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> small (ie few giga bytes)
>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>   - DDR (good old memory) well characteristics are between HBM and persistent
> 

Okay, then how the kernel handle the situation of "kind of memory each with 
different characteristics"?
Does someone have any suggestion?  I thought HMM can do this.
Numa policy/node distance is good but perhaps require a few extending, e.g a 
HBM node can't be
swap, can't accept DDR fallback allocation.

> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> maybe even persistent memory).
> 

This looks like a subset of HMAT when CPU can address device memory directly in 
cache-coherent way.


>>> memory on PCIE like interface then i don't expect it to be reported as NUMA
>>> memory node but as io range like any regular PCIE resources. Device driver
>>> through capabilities flags would then figure out if the link between the
>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>>> as device memory.
>>>
>>
>> From my point of view,  Cache coherent device memory will popular soon and
>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>> reasonable
>> to me.
> 
> Cache coherent device will be reported through standard mecanisms defined by
> the bus standard they are using. To my knowledge all the standard are either
> on top of PCIE or are similar to PCIE.
> 
> It is true that on many platform PCIE resource is manage/initialize by the
> bios (UEFI) but it is platform specific. In some case we reprogram what the
> bios pick.
> 
> So like i was saying i don't expect the BIOS/UEFI to report device memory as

But it's happening.
In my understanding, that's why HMAT was introduced.
For reporting device memory as regular memory(with different characteristics).

--
Regards,
Bob Liu

> regular memory. It will be reported as a regular PCIE resources and then the
> device driver will be able to determine through some flags if the link between
> the CPU(s) and the device is cache coherent or not. At that point the device
> driver can use register it with HMM helper.
> 
> 
> The whole NUMA discussion happen several time in the past i suggest looking
> on mm list archive for them. But it was rule out for several reasons. Top of
> my head:
>   - people hate CPU less node and device memory is inherently CPU less
>   - device driver want total control over memory and thus to be isolated from
> mm mecanism and doing all those special cases was not welcome
>   - existing NUMA migration mecanism are ill suited for this memory as
> access by the device to the memory is unknown to core mm and there
> is no easy way to report it or track it (this kind of depends on the
> platform and hardware)
> 
> I am likely missing other big points.
> 
> Cheers,
> Jérôme
> 
> .
> 




Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Bob Liu
On 2017/9/5 10:38, Jerome Glisse wrote:
> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
>> On 2017/9/4 23:51, Jerome Glisse wrote:
>>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
 On 2017/8/17 8:05, Jérôme Glisse wrote:
> Unlike unaddressable memory, coherent device memory has a real
> resource associated with it on the system (as CPU can address
> it). Add a new helper to hotplug such memory within the HMM
> framework.
>

 Got an new question, coherent device( e.g CCIX) memory are likely reported 
 to OS 
 through ACPI and recognized as NUMA memory node.
 Then how can their memory be captured and managed by HMM framework?

>>>
>>> Only platform that has such memory today is powerpc and it is not reported
>>> as regular memory by the firmware hence why they need this helper.
>>>
>>> I don't think anyone has defined anything yet for x86 and acpi. As this is
>>
>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
>> Table (HMAT) table defined in ACPI 6.2.
>> The HMAT can cover CPU-addressable memory types(though not non-cache
>> coherent on-device memory).
>>
>> Ross from Intel already done some work on this, see:
>> https://lwn.net/Articles/724562/
>>
>> arm64 supports APCI also, there is likely more this kind of device when CCIX
>> is out (should be very soon if on schedule).
> 
> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
> when you have several kind of memory each with different characteristics:
>   - HBM very fast (latency) and high bandwidth, non persistent, somewhat
> small (ie few giga bytes)
>   - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
>   - DDR (good old memory) well characteristics are between HBM and persistent
> 

Okay, then how the kernel handle the situation of "kind of memory each with 
different characteristics"?
Does someone have any suggestion?  I thought HMM can do this.
Numa policy/node distance is good but perhaps require a few extending, e.g a 
HBM node can't be
swap, can't accept DDR fallback allocation.

> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
> that device memory can have a hierarchy of memory themself (HBM, GDDR and in
> maybe even persistent memory).
> 

This looks like a subset of HMAT when CPU can address device memory directly in 
cache-coherent way.


>>> memory on PCIE like interface then i don't expect it to be reported as NUMA
>>> memory node but as io range like any regular PCIE resources. Device driver
>>> through capabilities flags would then figure out if the link between the
>>> device and CPU is CCIX capable if so it can use this helper to hotplug it
>>> as device memory.
>>>
>>
>> From my point of view,  Cache coherent device memory will popular soon and
>> reported through ACPI/UEFI. Extending NUMA policy still sounds more 
>> reasonable
>> to me.
> 
> Cache coherent device will be reported through standard mecanisms defined by
> the bus standard they are using. To my knowledge all the standard are either
> on top of PCIE or are similar to PCIE.
> 
> It is true that on many platform PCIE resource is manage/initialize by the
> bios (UEFI) but it is platform specific. In some case we reprogram what the
> bios pick.
> 
> So like i was saying i don't expect the BIOS/UEFI to report device memory as

But it's happening.
In my understanding, that's why HMAT was introduced.
For reporting device memory as regular memory(with different characteristics).

--
Regards,
Bob Liu

> regular memory. It will be reported as a regular PCIE resources and then the
> device driver will be able to determine through some flags if the link between
> the CPU(s) and the device is cache coherent or not. At that point the device
> driver can use register it with HMM helper.
> 
> 
> The whole NUMA discussion happen several time in the past i suggest looking
> on mm list archive for them. But it was rule out for several reasons. Top of
> my head:
>   - people hate CPU less node and device memory is inherently CPU less
>   - device driver want total control over memory and thus to be isolated from
> mm mecanism and doing all those special cases was not welcome
>   - existing NUMA migration mecanism are ill suited for this memory as
> access by the device to the memory is unknown to core mm and there
> is no easy way to report it or track it (this kind of depends on the
> platform and hardware)
> 
> I am likely missing other big points.
> 
> Cheers,
> Jérôme
> 
> .
> 




Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Balbir Singh
On Tue, Sep 5, 2017 at 1:51 AM, Jerome Glisse  wrote:
> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>> > Unlike unaddressable memory, coherent device memory has a real
>> > resource associated with it on the system (as CPU can address
>> > it). Add a new helper to hotplug such memory within the HMM
>> > framework.
>> >
>>
>> Got an new question, coherent device( e.g CCIX) memory are likely reported 
>> to OS
>> through ACPI and recognized as NUMA memory node.
>> Then how can their memory be captured and managed by HMM framework?
>>
>
> Only platform that has such memory today is powerpc and it is not reported
> as regular memory by the firmware hence why they need this helper.
>
> I don't think anyone has defined anything yet for x86 and acpi. As this is
> memory on PCIE like interface then i don't expect it to be reported as NUMA
> memory node but as io range like any regular PCIE resources. Device driver
> through capabilities flags would then figure out if the link between the
> device and CPU is CCIX capable if so it can use this helper to hotplug it
> as device memory.

Yep, the arch needs to do the right thing at hotplug time, which is

1. Don't online the memory as a NUMA node
2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver

Like Jerome said and we tried as well, the NUMA approach needs more
agreement and discussion and probable extensions

Balbir Singh


Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Balbir Singh
On Tue, Sep 5, 2017 at 1:51 AM, Jerome Glisse  wrote:
> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>> > Unlike unaddressable memory, coherent device memory has a real
>> > resource associated with it on the system (as CPU can address
>> > it). Add a new helper to hotplug such memory within the HMM
>> > framework.
>> >
>>
>> Got an new question, coherent device( e.g CCIX) memory are likely reported 
>> to OS
>> through ACPI and recognized as NUMA memory node.
>> Then how can their memory be captured and managed by HMM framework?
>>
>
> Only platform that has such memory today is powerpc and it is not reported
> as regular memory by the firmware hence why they need this helper.
>
> I don't think anyone has defined anything yet for x86 and acpi. As this is
> memory on PCIE like interface then i don't expect it to be reported as NUMA
> memory node but as io range like any regular PCIE resources. Device driver
> through capabilities flags would then figure out if the link between the
> device and CPU is CCIX capable if so it can use this helper to hotplug it
> as device memory.

Yep, the arch needs to do the right thing at hotplug time, which is

1. Don't online the memory as a NUMA node
2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver

Like Jerome said and we tried as well, the NUMA approach needs more
agreement and discussion and probable extensions

Balbir Singh


Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Jerome Glisse
On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> On 2017/9/4 23:51, Jerome Glisse wrote:
> > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >> On 2017/8/17 8:05, Jérôme Glisse wrote:
> >>> Unlike unaddressable memory, coherent device memory has a real
> >>> resource associated with it on the system (as CPU can address
> >>> it). Add a new helper to hotplug such memory within the HMM
> >>> framework.
> >>>
> >>
> >> Got an new question, coherent device( e.g CCIX) memory are likely reported 
> >> to OS 
> >> through ACPI and recognized as NUMA memory node.
> >> Then how can their memory be captured and managed by HMM framework?
> >>
> > 
> > Only platform that has such memory today is powerpc and it is not reported
> > as regular memory by the firmware hence why they need this helper.
> > 
> > I don't think anyone has defined anything yet for x86 and acpi. As this is
> 
> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> Table (HMAT) table defined in ACPI 6.2.
> The HMAT can cover CPU-addressable memory types(though not non-cache
> coherent on-device memory).
> 
> Ross from Intel already done some work on this, see:
> https://lwn.net/Articles/724562/
> 
> arm64 supports APCI also, there is likely more this kind of device when CCIX
> is out (should be very soon if on schedule).

HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
when you have several kind of memory each with different characteristics:
  - HBM very fast (latency) and high bandwidth, non persistent, somewhat
small (ie few giga bytes)
  - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
  - DDR (good old memory) well characteristics are between HBM and persistent

So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
that device memory can have a hierarchy of memory themself (HBM, GDDR and in
maybe even persistent memory).

> > memory on PCIE like interface then i don't expect it to be reported as NUMA
> > memory node but as io range like any regular PCIE resources. Device driver
> > through capabilities flags would then figure out if the link between the
> > device and CPU is CCIX capable if so it can use this helper to hotplug it
> > as device memory.
> > 
> 
> From my point of view,  Cache coherent device memory will popular soon and
> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
> to me.

Cache coherent device will be reported through standard mecanisms defined by
the bus standard they are using. To my knowledge all the standard are either
on top of PCIE or are similar to PCIE.

It is true that on many platform PCIE resource is manage/initialize by the
bios (UEFI) but it is platform specific. In some case we reprogram what the
bios pick.

So like i was saying i don't expect the BIOS/UEFI to report device memory as
regular memory. It will be reported as a regular PCIE resources and then the
device driver will be able to determine through some flags if the link between
the CPU(s) and the device is cache coherent or not. At that point the device
driver can use register it with HMM helper.


The whole NUMA discussion happen several time in the past i suggest looking
on mm list archive for them. But it was rule out for several reasons. Top of
my head:
  - people hate CPU less node and device memory is inherently CPU less
  - device driver want total control over memory and thus to be isolated from
mm mecanism and doing all those special cases was not welcome
  - existing NUMA migration mecanism are ill suited for this memory as
access by the device to the memory is unknown to core mm and there
is no easy way to report it or track it (this kind of depends on the
platform and hardware)

I am likely missing other big points.

Cheers,
Jérôme


Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Jerome Glisse
On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote:
> On 2017/9/4 23:51, Jerome Glisse wrote:
> > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> >> On 2017/8/17 8:05, Jérôme Glisse wrote:
> >>> Unlike unaddressable memory, coherent device memory has a real
> >>> resource associated with it on the system (as CPU can address
> >>> it). Add a new helper to hotplug such memory within the HMM
> >>> framework.
> >>>
> >>
> >> Got an new question, coherent device( e.g CCIX) memory are likely reported 
> >> to OS 
> >> through ACPI and recognized as NUMA memory node.
> >> Then how can their memory be captured and managed by HMM framework?
> >>
> > 
> > Only platform that has such memory today is powerpc and it is not reported
> > as regular memory by the firmware hence why they need this helper.
> > 
> > I don't think anyone has defined anything yet for x86 and acpi. As this is
> 
> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
> Table (HMAT) table defined in ACPI 6.2.
> The HMAT can cover CPU-addressable memory types(though not non-cache
> coherent on-device memory).
> 
> Ross from Intel already done some work on this, see:
> https://lwn.net/Articles/724562/
> 
> arm64 supports APCI also, there is likely more this kind of device when CCIX
> is out (should be very soon if on schedule).

HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie
when you have several kind of memory each with different characteristics:
  - HBM very fast (latency) and high bandwidth, non persistent, somewhat
small (ie few giga bytes)
  - Persistent memory, slower (both latency and bandwidth) big (tera bytes)
  - DDR (good old memory) well characteristics are between HBM and persistent

So AFAICT this has nothing to do with what HMM is for, ie device memory. Note
that device memory can have a hierarchy of memory themself (HBM, GDDR and in
maybe even persistent memory).

> > memory on PCIE like interface then i don't expect it to be reported as NUMA
> > memory node but as io range like any regular PCIE resources. Device driver
> > through capabilities flags would then figure out if the link between the
> > device and CPU is CCIX capable if so it can use this helper to hotplug it
> > as device memory.
> > 
> 
> From my point of view,  Cache coherent device memory will popular soon and
> reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable
> to me.

Cache coherent device will be reported through standard mecanisms defined by
the bus standard they are using. To my knowledge all the standard are either
on top of PCIE or are similar to PCIE.

It is true that on many platform PCIE resource is manage/initialize by the
bios (UEFI) but it is platform specific. In some case we reprogram what the
bios pick.

So like i was saying i don't expect the BIOS/UEFI to report device memory as
regular memory. It will be reported as a regular PCIE resources and then the
device driver will be able to determine through some flags if the link between
the CPU(s) and the device is cache coherent or not. At that point the device
driver can use register it with HMM helper.


The whole NUMA discussion happen several time in the past i suggest looking
on mm list archive for them. But it was rule out for several reasons. Top of
my head:
  - people hate CPU less node and device memory is inherently CPU less
  - device driver want total control over memory and thus to be isolated from
mm mecanism and doing all those special cases was not welcome
  - existing NUMA migration mecanism are ill suited for this memory as
access by the device to the memory is unknown to core mm and there
is no easy way to report it or track it (this kind of depends on the
platform and hardware)

I am likely missing other big points.

Cheers,
Jérôme


Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Bob Liu
On 2017/9/4 23:51, Jerome Glisse wrote:
> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>>> Unlike unaddressable memory, coherent device memory has a real
>>> resource associated with it on the system (as CPU can address
>>> it). Add a new helper to hotplug such memory within the HMM
>>> framework.
>>>
>>
>> Got an new question, coherent device( e.g CCIX) memory are likely reported 
>> to OS 
>> through ACPI and recognized as NUMA memory node.
>> Then how can their memory be captured and managed by HMM framework?
>>
> 
> Only platform that has such memory today is powerpc and it is not reported
> as regular memory by the firmware hence why they need this helper.
> 
> I don't think anyone has defined anything yet for x86 and acpi. As this is

Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
Table (HMAT) table defined in ACPI 6.2.
The HMAT can cover CPU-addressable memory types(though not non-cache coherent 
on-device memory).

Ross from Intel already done some work on this, see:
https://lwn.net/Articles/724562/

arm64 supports APCI also, there is likely more this kind of device when CCIX is 
out
(should be very soon if on schedule).

> memory on PCIE like interface then i don't expect it to be reported as NUMA
> memory node but as io range like any regular PCIE resources. Device driver
> through capabilities flags would then figure out if the link between the
> device and CPU is CCIX capable if so it can use this helper to hotplug it
> as device memory.
> 

>From my point of view,  Cache coherent device memory will popular soon and 
>reported through ACPI/UEFI.
Extending NUMA policy still sounds more reasonable to me.

--
Thanks,
Bob Liu



Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Bob Liu
On 2017/9/4 23:51, Jerome Glisse wrote:
> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
>> On 2017/8/17 8:05, Jérôme Glisse wrote:
>>> Unlike unaddressable memory, coherent device memory has a real
>>> resource associated with it on the system (as CPU can address
>>> it). Add a new helper to hotplug such memory within the HMM
>>> framework.
>>>
>>
>> Got an new question, coherent device( e.g CCIX) memory are likely reported 
>> to OS 
>> through ACPI and recognized as NUMA memory node.
>> Then how can their memory be captured and managed by HMM framework?
>>
> 
> Only platform that has such memory today is powerpc and it is not reported
> as regular memory by the firmware hence why they need this helper.
> 
> I don't think anyone has defined anything yet for x86 and acpi. As this is

Not yet, but now the ACPI spec has Heterogeneous Memory Attribute
Table (HMAT) table defined in ACPI 6.2.
The HMAT can cover CPU-addressable memory types(though not non-cache coherent 
on-device memory).

Ross from Intel already done some work on this, see:
https://lwn.net/Articles/724562/

arm64 supports APCI also, there is likely more this kind of device when CCIX is 
out
(should be very soon if on schedule).

> memory on PCIE like interface then i don't expect it to be reported as NUMA
> memory node but as io range like any regular PCIE resources. Device driver
> through capabilities flags would then figure out if the link between the
> device and CPU is CCIX capable if so it can use this helper to hotplug it
> as device memory.
> 

>From my point of view,  Cache coherent device memory will popular soon and 
>reported through ACPI/UEFI.
Extending NUMA policy still sounds more reasonable to me.

--
Thanks,
Bob Liu



Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Jerome Glisse
On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> On 2017/8/17 8:05, Jérôme Glisse wrote:
> > Unlike unaddressable memory, coherent device memory has a real
> > resource associated with it on the system (as CPU can address
> > it). Add a new helper to hotplug such memory within the HMM
> > framework.
> > 
> 
> Got an new question, coherent device( e.g CCIX) memory are likely reported to 
> OS 
> through ACPI and recognized as NUMA memory node.
> Then how can their memory be captured and managed by HMM framework?
> 

Only platform that has such memory today is powerpc and it is not reported
as regular memory by the firmware hence why they need this helper.

I don't think anyone has defined anything yet for x86 and acpi. As this is
memory on PCIE like interface then i don't expect it to be reported as NUMA
memory node but as io range like any regular PCIE resources. Device driver
through capabilities flags would then figure out if the link between the
device and CPU is CCIX capable if so it can use this helper to hotplug it
as device memory.

Jérôme


Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-04 Thread Jerome Glisse
On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote:
> On 2017/8/17 8:05, Jérôme Glisse wrote:
> > Unlike unaddressable memory, coherent device memory has a real
> > resource associated with it on the system (as CPU can address
> > it). Add a new helper to hotplug such memory within the HMM
> > framework.
> > 
> 
> Got an new question, coherent device( e.g CCIX) memory are likely reported to 
> OS 
> through ACPI and recognized as NUMA memory node.
> Then how can their memory be captured and managed by HMM framework?
> 

Only platform that has such memory today is powerpc and it is not reported
as regular memory by the firmware hence why they need this helper.

I don't think anyone has defined anything yet for x86 and acpi. As this is
memory on PCIE like interface then i don't expect it to be reported as NUMA
memory node but as io range like any regular PCIE resources. Device driver
through capabilities flags would then figure out if the link between the
device and CPU is CCIX capable if so it can use this helper to hotplug it
as device memory.

Jérôme


Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-03 Thread Bob Liu
On 2017/8/17 8:05, Jérôme Glisse wrote:
> Unlike unaddressable memory, coherent device memory has a real
> resource associated with it on the system (as CPU can address
> it). Add a new helper to hotplug such memory within the HMM
> framework.
> 

Got an new question, coherent device( e.g CCIX) memory are likely reported to 
OS 
through ACPI and recognized as NUMA memory node.
Then how can their memory be captured and managed by HMM framework?

--
Regards,
Bob Liu

> Changed since v2:
>   - s/host/public
> Changed since v1:
>   - s/public/host
> 
> Signed-off-by: Jérôme Glisse 
> Reviewed-by: Balbir Singh 
> ---
>  include/linux/hmm.h |  3 ++
>  mm/hmm.c| 88 
> ++---
>  2 files changed, 86 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 79e63178fd87..5866f3194c26 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -443,6 +443,9 @@ struct hmm_devmem {
>  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
> struct device *device,
> unsigned long size);
> +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> +struct device *device,
> +struct resource *res);
>  void hmm_devmem_remove(struct hmm_devmem *devmem);
>  
>  /*
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 1a1e79d390c1..3faa4d40295e 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -854,7 +854,11 @@ static void hmm_devmem_release(struct device *dev, void 
> *data)
>   zone = page_zone(page);
>  
>   mem_hotplug_begin();
> - __remove_pages(zone, start_pfn, npages);
> + if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
> + __remove_pages(zone, start_pfn, npages);
> + else
> + arch_remove_memory(start_pfn << PAGE_SHIFT,
> +npages << PAGE_SHIFT);
>   mem_hotplug_done();
>  
>   hmm_devmem_radix_release(resource);
> @@ -890,7 +894,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> *devmem)
>   if (is_ram == REGION_INTERSECTS)
>   return -ENXIO;
>  
> - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> + if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
> + devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
> + else
> + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +
>   devmem->pagemap.res = devmem->resource;
>   devmem->pagemap.page_fault = hmm_devmem_fault;
>   devmem->pagemap.page_free = hmm_devmem_free;
> @@ -935,9 +943,15 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> *devmem)
>* over the device memory is un-accessible thus we do not want to
>* create a linear mapping for the memory like arch_add_memory()
>* would do.
> +  *
> +  * For device public memory, which is accesible by the CPU, we do
> +  * want the linear mapping and thus use arch_add_memory().
>*/
> - ret = add_pages(nid, align_start >> PAGE_SHIFT,
> - align_size >> PAGE_SHIFT, false);
> + if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
> + ret = arch_add_memory(nid, align_start, align_size, false);
> + else
> + ret = add_pages(nid, align_start >> PAGE_SHIFT,
> + align_size >> PAGE_SHIFT, false);
>   if (ret) {
>   mem_hotplug_done();
>   goto error_add_memory;
> @@ -1084,6 +1098,67 @@ struct hmm_devmem *hmm_devmem_add(const struct 
> hmm_devmem_ops *ops,
>  }
>  EXPORT_SYMBOL(hmm_devmem_add);
>  
> +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> +struct device *device,
> +struct resource *res)
> +{
> + struct hmm_devmem *devmem;
> + int ret;
> +
> + if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
> + return ERR_PTR(-EINVAL);
> +
> + static_branch_enable(_private_key);
> +
> + devmem = devres_alloc_node(_devmem_release, sizeof(*devmem),
> +GFP_KERNEL, dev_to_node(device));
> + if (!devmem)
> + return ERR_PTR(-ENOMEM);
> +
> + init_completion(>completion);
> + devmem->pfn_first = -1UL;
> + devmem->pfn_last = -1UL;
> + devmem->resource = res;
> + devmem->device = device;
> + devmem->ops = ops;
> +
> + ret = percpu_ref_init(>ref, _devmem_ref_release,
> +   0, GFP_KERNEL);
> + if (ret)
> + goto error_percpu_ref;
> +
> + ret = devm_add_action(device, hmm_devmem_ref_exit, >ref);
> + if (ret)
> + goto error_devm_add_action;
> +
> +
> + devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
> + devmem->pfn_last = 

Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-09-03 Thread Bob Liu
On 2017/8/17 8:05, Jérôme Glisse wrote:
> Unlike unaddressable memory, coherent device memory has a real
> resource associated with it on the system (as CPU can address
> it). Add a new helper to hotplug such memory within the HMM
> framework.
> 

Got an new question, coherent device( e.g CCIX) memory are likely reported to 
OS 
through ACPI and recognized as NUMA memory node.
Then how can their memory be captured and managed by HMM framework?

--
Regards,
Bob Liu

> Changed since v2:
>   - s/host/public
> Changed since v1:
>   - s/public/host
> 
> Signed-off-by: Jérôme Glisse 
> Reviewed-by: Balbir Singh 
> ---
>  include/linux/hmm.h |  3 ++
>  mm/hmm.c| 88 
> ++---
>  2 files changed, 86 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 79e63178fd87..5866f3194c26 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -443,6 +443,9 @@ struct hmm_devmem {
>  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
> struct device *device,
> unsigned long size);
> +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> +struct device *device,
> +struct resource *res);
>  void hmm_devmem_remove(struct hmm_devmem *devmem);
>  
>  /*
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 1a1e79d390c1..3faa4d40295e 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -854,7 +854,11 @@ static void hmm_devmem_release(struct device *dev, void 
> *data)
>   zone = page_zone(page);
>  
>   mem_hotplug_begin();
> - __remove_pages(zone, start_pfn, npages);
> + if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
> + __remove_pages(zone, start_pfn, npages);
> + else
> + arch_remove_memory(start_pfn << PAGE_SHIFT,
> +npages << PAGE_SHIFT);
>   mem_hotplug_done();
>  
>   hmm_devmem_radix_release(resource);
> @@ -890,7 +894,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> *devmem)
>   if (is_ram == REGION_INTERSECTS)
>   return -ENXIO;
>  
> - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> + if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
> + devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
> + else
> + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +
>   devmem->pagemap.res = devmem->resource;
>   devmem->pagemap.page_fault = hmm_devmem_fault;
>   devmem->pagemap.page_free = hmm_devmem_free;
> @@ -935,9 +943,15 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
> *devmem)
>* over the device memory is un-accessible thus we do not want to
>* create a linear mapping for the memory like arch_add_memory()
>* would do.
> +  *
> +  * For device public memory, which is accesible by the CPU, we do
> +  * want the linear mapping and thus use arch_add_memory().
>*/
> - ret = add_pages(nid, align_start >> PAGE_SHIFT,
> - align_size >> PAGE_SHIFT, false);
> + if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
> + ret = arch_add_memory(nid, align_start, align_size, false);
> + else
> + ret = add_pages(nid, align_start >> PAGE_SHIFT,
> + align_size >> PAGE_SHIFT, false);
>   if (ret) {
>   mem_hotplug_done();
>   goto error_add_memory;
> @@ -1084,6 +1098,67 @@ struct hmm_devmem *hmm_devmem_add(const struct 
> hmm_devmem_ops *ops,
>  }
>  EXPORT_SYMBOL(hmm_devmem_add);
>  
> +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> +struct device *device,
> +struct resource *res)
> +{
> + struct hmm_devmem *devmem;
> + int ret;
> +
> + if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
> + return ERR_PTR(-EINVAL);
> +
> + static_branch_enable(_private_key);
> +
> + devmem = devres_alloc_node(_devmem_release, sizeof(*devmem),
> +GFP_KERNEL, dev_to_node(device));
> + if (!devmem)
> + return ERR_PTR(-ENOMEM);
> +
> + init_completion(>completion);
> + devmem->pfn_first = -1UL;
> + devmem->pfn_last = -1UL;
> + devmem->resource = res;
> + devmem->device = device;
> + devmem->ops = ops;
> +
> + ret = percpu_ref_init(>ref, _devmem_ref_release,
> +   0, GFP_KERNEL);
> + if (ret)
> + goto error_percpu_ref;
> +
> + ret = devm_add_action(device, hmm_devmem_ref_exit, >ref);
> + if (ret)
> + goto error_devm_add_action;
> +
> +
> + devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
> + devmem->pfn_last = devmem->pfn_first +
> +

[HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-08-16 Thread Jérôme Glisse
Unlike unaddressable memory, coherent device memory has a real
resource associated with it on the system (as CPU can address
it). Add a new helper to hotplug such memory within the HMM
framework.

Changed since v2:
  - s/host/public
Changed since v1:
  - s/public/host

Signed-off-by: Jérôme Glisse 
Reviewed-by: Balbir Singh 
---
 include/linux/hmm.h |  3 ++
 mm/hmm.c| 88 ++---
 2 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 79e63178fd87..5866f3194c26 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -443,6 +443,9 @@ struct hmm_devmem {
 struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
  struct device *device,
  unsigned long size);
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+  struct device *device,
+  struct resource *res);
 void hmm_devmem_remove(struct hmm_devmem *devmem);
 
 /*
diff --git a/mm/hmm.c b/mm/hmm.c
index 1a1e79d390c1..3faa4d40295e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -854,7 +854,11 @@ static void hmm_devmem_release(struct device *dev, void 
*data)
zone = page_zone(page);
 
mem_hotplug_begin();
-   __remove_pages(zone, start_pfn, npages);
+   if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
+   __remove_pages(zone, start_pfn, npages);
+   else
+   arch_remove_memory(start_pfn << PAGE_SHIFT,
+  npages << PAGE_SHIFT);
mem_hotplug_done();
 
hmm_devmem_radix_release(resource);
@@ -890,7 +894,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
*devmem)
if (is_ram == REGION_INTERSECTS)
return -ENXIO;
 
-   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+   if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
+   devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
+   else
+   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+
devmem->pagemap.res = devmem->resource;
devmem->pagemap.page_fault = hmm_devmem_fault;
devmem->pagemap.page_free = hmm_devmem_free;
@@ -935,9 +943,15 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
*devmem)
 * over the device memory is un-accessible thus we do not want to
 * create a linear mapping for the memory like arch_add_memory()
 * would do.
+*
+* For device public memory, which is accesible by the CPU, we do
+* want the linear mapping and thus use arch_add_memory().
 */
-   ret = add_pages(nid, align_start >> PAGE_SHIFT,
-   align_size >> PAGE_SHIFT, false);
+   if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
+   ret = arch_add_memory(nid, align_start, align_size, false);
+   else
+   ret = add_pages(nid, align_start >> PAGE_SHIFT,
+   align_size >> PAGE_SHIFT, false);
if (ret) {
mem_hotplug_done();
goto error_add_memory;
@@ -1084,6 +1098,67 @@ struct hmm_devmem *hmm_devmem_add(const struct 
hmm_devmem_ops *ops,
 }
 EXPORT_SYMBOL(hmm_devmem_add);
 
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+  struct device *device,
+  struct resource *res)
+{
+   struct hmm_devmem *devmem;
+   int ret;
+
+   if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
+   return ERR_PTR(-EINVAL);
+
+   static_branch_enable(_private_key);
+
+   devmem = devres_alloc_node(_devmem_release, sizeof(*devmem),
+  GFP_KERNEL, dev_to_node(device));
+   if (!devmem)
+   return ERR_PTR(-ENOMEM);
+
+   init_completion(>completion);
+   devmem->pfn_first = -1UL;
+   devmem->pfn_last = -1UL;
+   devmem->resource = res;
+   devmem->device = device;
+   devmem->ops = ops;
+
+   ret = percpu_ref_init(>ref, _devmem_ref_release,
+ 0, GFP_KERNEL);
+   if (ret)
+   goto error_percpu_ref;
+
+   ret = devm_add_action(device, hmm_devmem_ref_exit, >ref);
+   if (ret)
+   goto error_devm_add_action;
+
+
+   devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+   devmem->pfn_last = devmem->pfn_first +
+  (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+   ret = hmm_devmem_pages_create(devmem);
+   if (ret)
+   goto error_devm_add_action;
+
+   devres_add(device, devmem);
+
+   ret = devm_add_action(device, hmm_devmem_ref_kill, >ref);
+   if (ret) {
+   hmm_devmem_remove(devmem);
+   

[HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3

2017-08-16 Thread Jérôme Glisse
Unlike unaddressable memory, coherent device memory has a real
resource associated with it on the system (as CPU can address
it). Add a new helper to hotplug such memory within the HMM
framework.

Changed since v2:
  - s/host/public
Changed since v1:
  - s/public/host

Signed-off-by: Jérôme Glisse 
Reviewed-by: Balbir Singh 
---
 include/linux/hmm.h |  3 ++
 mm/hmm.c| 88 ++---
 2 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 79e63178fd87..5866f3194c26 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -443,6 +443,9 @@ struct hmm_devmem {
 struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
  struct device *device,
  unsigned long size);
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+  struct device *device,
+  struct resource *res);
 void hmm_devmem_remove(struct hmm_devmem *devmem);
 
 /*
diff --git a/mm/hmm.c b/mm/hmm.c
index 1a1e79d390c1..3faa4d40295e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -854,7 +854,11 @@ static void hmm_devmem_release(struct device *dev, void 
*data)
zone = page_zone(page);
 
mem_hotplug_begin();
-   __remove_pages(zone, start_pfn, npages);
+   if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
+   __remove_pages(zone, start_pfn, npages);
+   else
+   arch_remove_memory(start_pfn << PAGE_SHIFT,
+  npages << PAGE_SHIFT);
mem_hotplug_done();
 
hmm_devmem_radix_release(resource);
@@ -890,7 +894,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
*devmem)
if (is_ram == REGION_INTERSECTS)
return -ENXIO;
 
-   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+   if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY)
+   devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
+   else
+   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+
devmem->pagemap.res = devmem->resource;
devmem->pagemap.page_fault = hmm_devmem_fault;
devmem->pagemap.page_free = hmm_devmem_free;
@@ -935,9 +943,15 @@ static int hmm_devmem_pages_create(struct hmm_devmem 
*devmem)
 * over the device memory is un-accessible thus we do not want to
 * create a linear mapping for the memory like arch_add_memory()
 * would do.
+*
+* For device public memory, which is accesible by the CPU, we do
+* want the linear mapping and thus use arch_add_memory().
 */
-   ret = add_pages(nid, align_start >> PAGE_SHIFT,
-   align_size >> PAGE_SHIFT, false);
+   if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
+   ret = arch_add_memory(nid, align_start, align_size, false);
+   else
+   ret = add_pages(nid, align_start >> PAGE_SHIFT,
+   align_size >> PAGE_SHIFT, false);
if (ret) {
mem_hotplug_done();
goto error_add_memory;
@@ -1084,6 +1098,67 @@ struct hmm_devmem *hmm_devmem_add(const struct 
hmm_devmem_ops *ops,
 }
 EXPORT_SYMBOL(hmm_devmem_add);
 
+struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
+  struct device *device,
+  struct resource *res)
+{
+   struct hmm_devmem *devmem;
+   int ret;
+
+   if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
+   return ERR_PTR(-EINVAL);
+
+   static_branch_enable(_private_key);
+
+   devmem = devres_alloc_node(_devmem_release, sizeof(*devmem),
+  GFP_KERNEL, dev_to_node(device));
+   if (!devmem)
+   return ERR_PTR(-ENOMEM);
+
+   init_completion(>completion);
+   devmem->pfn_first = -1UL;
+   devmem->pfn_last = -1UL;
+   devmem->resource = res;
+   devmem->device = device;
+   devmem->ops = ops;
+
+   ret = percpu_ref_init(>ref, _devmem_ref_release,
+ 0, GFP_KERNEL);
+   if (ret)
+   goto error_percpu_ref;
+
+   ret = devm_add_action(device, hmm_devmem_ref_exit, >ref);
+   if (ret)
+   goto error_devm_add_action;
+
+
+   devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+   devmem->pfn_last = devmem->pfn_first +
+  (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+   ret = hmm_devmem_pages_create(devmem);
+   if (ret)
+   goto error_devm_add_action;
+
+   devres_add(device, devmem);
+
+   ret = devm_add_action(device, hmm_devmem_ref_kill, >ref);
+   if (ret) {
+   hmm_devmem_remove(devmem);
+   return ERR_PTR(ret);
+   }
+
+   return