Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Fri, Sep 8, 2017 at 1:43 PM, Dan Williamswrote: > On Thu, Sep 7, 2017 at 6:59 PM, Bob Liu wrote: >> On 2017/9/8 1:27, Jerome Glisse wrote: > [..] >>> No this are 2 orthogonal thing, they do not conflict with each others quite >>> the contrary. HMM (the CDM part is no different) is a set of helpers, see >>> it as a toolbox, for device driver. >>> >>> HMAT is a way for firmware to report memory resources with more informations >>> that just range of physical address. HMAT is specific to platform that rely >>> on ACPI. HMAT does not provide any helpers to manage these memory. >>> >>> So a device driver can get informations about device memory from HMAT and >>> then >>> use HMM to help in managing and using this memory. >>> >> >> Yes, but as Balbir mentioned requires : >> 1. Don't online the memory as a NUMA node >> 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver >> >> And I'm not sure whether Intel going to use this HMM-CDM based method for >> their "target domain" memory ? >> Or they prefer to NUMA approach? Ross? Dan? > > The starting / strawman proposal for performance differentiated memory > ranges is to get platform firmware to mark them reserved by default. > Then, after we parse the HMAT, make them available via the device-dax > mechanism so that applications that need 100% guaranteed access to > these potentially high-value / limited-capacity ranges can be sure to > get them by default, i.e. before any random kernel objects are placed > in them. Otherwise, if there are no dedicated users for the memory > ranges via device-dax, or they don't need the total capacity, we want > to hotplug that memory into the general purpose memory allocator with > a numa node number so typical numactl and memory-management flows are > enabled. > > Ideally this would not be specific to HMAT and any agent that knows > differentiated performance characteristics of a memory range could use > this scheme. @Dan/Ross With this approach, in a SVM environment, if you would want a PRI(page grant) request to get satisfied from this HMAT-indexed memory node, then do you think we could make that happen. If yes, is that something you guys are currently working on. Chetan
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Fri, Sep 8, 2017 at 1:43 PM, Dan Williams wrote: > On Thu, Sep 7, 2017 at 6:59 PM, Bob Liu wrote: >> On 2017/9/8 1:27, Jerome Glisse wrote: > [..] >>> No this are 2 orthogonal thing, they do not conflict with each others quite >>> the contrary. HMM (the CDM part is no different) is a set of helpers, see >>> it as a toolbox, for device driver. >>> >>> HMAT is a way for firmware to report memory resources with more informations >>> that just range of physical address. HMAT is specific to platform that rely >>> on ACPI. HMAT does not provide any helpers to manage these memory. >>> >>> So a device driver can get informations about device memory from HMAT and >>> then >>> use HMM to help in managing and using this memory. >>> >> >> Yes, but as Balbir mentioned requires : >> 1. Don't online the memory as a NUMA node >> 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver >> >> And I'm not sure whether Intel going to use this HMM-CDM based method for >> their "target domain" memory ? >> Or they prefer to NUMA approach? Ross? Dan? > > The starting / strawman proposal for performance differentiated memory > ranges is to get platform firmware to mark them reserved by default. > Then, after we parse the HMAT, make them available via the device-dax > mechanism so that applications that need 100% guaranteed access to > these potentially high-value / limited-capacity ranges can be sure to > get them by default, i.e. before any random kernel objects are placed > in them. Otherwise, if there are no dedicated users for the memory > ranges via device-dax, or they don't need the total capacity, we want > to hotplug that memory into the general purpose memory allocator with > a numa node number so typical numactl and memory-management flows are > enabled. > > Ideally this would not be specific to HMAT and any agent that knows > differentiated performance characteristics of a memory range could use > this scheme. @Dan/Ross With this approach, in a SVM environment, if you would want a PRI(page grant) request to get satisfied from this HMAT-indexed memory node, then do you think we could make that happen. If yes, is that something you guys are currently working on. Chetan
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Thu, Sep 7, 2017 at 6:59 PM, Bob Liuwrote: > On 2017/9/8 1:27, Jerome Glisse wrote: [..] >> No this are 2 orthogonal thing, they do not conflict with each others quite >> the contrary. HMM (the CDM part is no different) is a set of helpers, see >> it as a toolbox, for device driver. >> >> HMAT is a way for firmware to report memory resources with more informations >> that just range of physical address. HMAT is specific to platform that rely >> on ACPI. HMAT does not provide any helpers to manage these memory. >> >> So a device driver can get informations about device memory from HMAT and >> then >> use HMM to help in managing and using this memory. >> > > Yes, but as Balbir mentioned requires : > 1. Don't online the memory as a NUMA node > 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver > > And I'm not sure whether Intel going to use this HMM-CDM based method for > their "target domain" memory ? > Or they prefer to NUMA approach? Ross? Dan? The starting / strawman proposal for performance differentiated memory ranges is to get platform firmware to mark them reserved by default. Then, after we parse the HMAT, make them available via the device-dax mechanism so that applications that need 100% guaranteed access to these potentially high-value / limited-capacity ranges can be sure to get them by default, i.e. before any random kernel objects are placed in them. Otherwise, if there are no dedicated users for the memory ranges via device-dax, or they don't need the total capacity, we want to hotplug that memory into the general purpose memory allocator with a numa node number so typical numactl and memory-management flows are enabled. Ideally this would not be specific to HMAT and any agent that knows differentiated performance characteristics of a memory range could use this scheme.
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Thu, Sep 7, 2017 at 6:59 PM, Bob Liu wrote: > On 2017/9/8 1:27, Jerome Glisse wrote: [..] >> No this are 2 orthogonal thing, they do not conflict with each others quite >> the contrary. HMM (the CDM part is no different) is a set of helpers, see >> it as a toolbox, for device driver. >> >> HMAT is a way for firmware to report memory resources with more informations >> that just range of physical address. HMAT is specific to platform that rely >> on ACPI. HMAT does not provide any helpers to manage these memory. >> >> So a device driver can get informations about device memory from HMAT and >> then >> use HMM to help in managing and using this memory. >> > > Yes, but as Balbir mentioned requires : > 1. Don't online the memory as a NUMA node > 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver > > And I'm not sure whether Intel going to use this HMM-CDM based method for > their "target domain" memory ? > Or they prefer to NUMA approach? Ross? Dan? The starting / strawman proposal for performance differentiated memory ranges is to get platform firmware to mark them reserved by default. Then, after we parse the HMAT, make them available via the device-dax mechanism so that applications that need 100% guaranteed access to these potentially high-value / limited-capacity ranges can be sure to get them by default, i.e. before any random kernel objects are placed in them. Otherwise, if there are no dedicated users for the memory ranges via device-dax, or they don't need the total capacity, we want to hotplug that memory into the general purpose memory allocator with a numa node number so typical numactl and memory-management flows are enabled. Ideally this would not be specific to HMAT and any agent that knows differentiated performance characteristics of a memory range could use this scheme.
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Fri, Sep 08, 2017 at 01:43:44PM -0600, Ross Zwisler wrote: > On Tue, Sep 05, 2017 at 03:20:50PM -0400, Jerome Glisse wrote: > <> > > Does HMAT support device hotplug ? I am unfamiliar with the whole inner > > working > > of ACPI versus PCIE. Anyway i don't see any issue with device memory also > > showing > > through HMAT but like i said device driver for the device will want to be > > in total > > control of that memory. > > Yep, the HMAT will support device hotplug via the _HMA method (section 6.2.18 > of ACPI 6.2). This basically supplies an entirely new HMAT that the system > will use to replace the current one. > > I don't yet have support for _HMA in my enabling, but I do intend to add > support for it once we settle on a sysfs API for the regular boot-time case. > > > Like i said issue here is that core kernel is unaware of the device > > activity ie > > on what part of memory the device is actively working. So core mm can not > > make > > inform decision on what should be migrated to device memory. Also we do not > > want > > regular memory allocation to end in device memory unless explicitly ask for. > > Few reasons for that. First this memory might not only be use for compute > > task > > but also for graphic and in that case they are hard constraint on physically > > contiguous memory allocation that require the GPU to move thing around to > > make > > room for graphic object (can't allow GUP). > > > > Second reasons, the device memory is inherently unreliable. If there is a > > bug > > in the device driver or the user manage to trigger a faulty condition on GPU > > the device might need a hard reset (ie cut PCIE power to device) which leads > > to loss of memory content. While GPU are becoming more and more resilient > > they > > are still prone to lockup. > > > > Finaly for GPU there is a common pattern of memory over-commit. You pretend > > to > > each application as if they were the only one and allow each of them to > > allocate > > all of the device memory or more than could with strict sharing. As GPU have > > long timeslice between switching to different context/application they can > > easily move out and in large chunk of the process memory at > > context/application > > switching. This is have proven to be a key aspect to allow maximum > > performances > > accross several concurrent application/context. > > > > To implement this easiest solution is for the device to lie about how much > > memory > > it has and use the system memory as an overflow. > > I don't think any of this precludes the HMAT being involved. This is all very > similar to what I think we need to do for high bandwidth memory, for example. > We don't want the OS to use it for anything, and we want all of it to be > available for applications to allocate and use for their specific workload. > We don't want to make any assumptions about how it can or should be used. > > The HMAT is just there to give us a few things: > > 1) It provides us with an explicit way of telling the OS not to use the > memory, in the form of the "Reservation hint" flag in the Memory Subsystem > Address Range Structure (ACPI 6.2 section 5.2.27.3). I expect that this will > be set for persistent memory and HBM, and it sounds like you'd expect it to be > set for your device memory as well. > > 2) It provides us with a way of telling userspace "hey, I know about some > memory, and I can tell you its performance characteristics". All control of > how this memory is allocated and used is still left to userspace. > > > I am not saying that NUMA is not the way forward, i am saying that as it is > > today > > it is not suited for this. It is lacking metric, it is lacking logic, it is > > lacking > > features. We could add all this but it is a lot of work and i don't feel > > that we > > have enough real world experience to do so now. I would rather have each > > devices > > grow proper infrastructure in their driver through device specific API. > > To be clear, I'm not proposing that we teach the NUMA code how to > automatically allocate for a given numa node, balance, etc. memory described > by the HMAT. All I want is an API that says "here is some memory, I'll tell > you all I can about it and let you do with it what you will", and perhaps a > way to manually allocate what you want. > > And yes, this is very hand-wavy at this point. :) After I get the sysfs > portion sussed out the next step is to work on enabling something like > libnuma to allow the memory to be manually allocated. > > I think this works for both my use case and yours, correct? Depend what you mean. Using NUMA as it is today no. Growing new API on the side of libnuma maybe. It is hard to say. Right now the GPU do have a very reach API see OpenCL or CUDA API. Anything less expressive than what they offer would not work. Existing libnuma API is illsuited. It is too static. GPU workload are more dynamic as a result the virtual address space
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Fri, Sep 08, 2017 at 01:43:44PM -0600, Ross Zwisler wrote: > On Tue, Sep 05, 2017 at 03:20:50PM -0400, Jerome Glisse wrote: > <> > > Does HMAT support device hotplug ? I am unfamiliar with the whole inner > > working > > of ACPI versus PCIE. Anyway i don't see any issue with device memory also > > showing > > through HMAT but like i said device driver for the device will want to be > > in total > > control of that memory. > > Yep, the HMAT will support device hotplug via the _HMA method (section 6.2.18 > of ACPI 6.2). This basically supplies an entirely new HMAT that the system > will use to replace the current one. > > I don't yet have support for _HMA in my enabling, but I do intend to add > support for it once we settle on a sysfs API for the regular boot-time case. > > > Like i said issue here is that core kernel is unaware of the device > > activity ie > > on what part of memory the device is actively working. So core mm can not > > make > > inform decision on what should be migrated to device memory. Also we do not > > want > > regular memory allocation to end in device memory unless explicitly ask for. > > Few reasons for that. First this memory might not only be use for compute > > task > > but also for graphic and in that case they are hard constraint on physically > > contiguous memory allocation that require the GPU to move thing around to > > make > > room for graphic object (can't allow GUP). > > > > Second reasons, the device memory is inherently unreliable. If there is a > > bug > > in the device driver or the user manage to trigger a faulty condition on GPU > > the device might need a hard reset (ie cut PCIE power to device) which leads > > to loss of memory content. While GPU are becoming more and more resilient > > they > > are still prone to lockup. > > > > Finaly for GPU there is a common pattern of memory over-commit. You pretend > > to > > each application as if they were the only one and allow each of them to > > allocate > > all of the device memory or more than could with strict sharing. As GPU have > > long timeslice between switching to different context/application they can > > easily move out and in large chunk of the process memory at > > context/application > > switching. This is have proven to be a key aspect to allow maximum > > performances > > accross several concurrent application/context. > > > > To implement this easiest solution is for the device to lie about how much > > memory > > it has and use the system memory as an overflow. > > I don't think any of this precludes the HMAT being involved. This is all very > similar to what I think we need to do for high bandwidth memory, for example. > We don't want the OS to use it for anything, and we want all of it to be > available for applications to allocate and use for their specific workload. > We don't want to make any assumptions about how it can or should be used. > > The HMAT is just there to give us a few things: > > 1) It provides us with an explicit way of telling the OS not to use the > memory, in the form of the "Reservation hint" flag in the Memory Subsystem > Address Range Structure (ACPI 6.2 section 5.2.27.3). I expect that this will > be set for persistent memory and HBM, and it sounds like you'd expect it to be > set for your device memory as well. > > 2) It provides us with a way of telling userspace "hey, I know about some > memory, and I can tell you its performance characteristics". All control of > how this memory is allocated and used is still left to userspace. > > > I am not saying that NUMA is not the way forward, i am saying that as it is > > today > > it is not suited for this. It is lacking metric, it is lacking logic, it is > > lacking > > features. We could add all this but it is a lot of work and i don't feel > > that we > > have enough real world experience to do so now. I would rather have each > > devices > > grow proper infrastructure in their driver through device specific API. > > To be clear, I'm not proposing that we teach the NUMA code how to > automatically allocate for a given numa node, balance, etc. memory described > by the HMAT. All I want is an API that says "here is some memory, I'll tell > you all I can about it and let you do with it what you will", and perhaps a > way to manually allocate what you want. > > And yes, this is very hand-wavy at this point. :) After I get the sysfs > portion sussed out the next step is to work on enabling something like > libnuma to allow the memory to be manually allocated. > > I think this works for both my use case and yours, correct? Depend what you mean. Using NUMA as it is today no. Growing new API on the side of libnuma maybe. It is hard to say. Right now the GPU do have a very reach API see OpenCL or CUDA API. Anything less expressive than what they offer would not work. Existing libnuma API is illsuited. It is too static. GPU workload are more dynamic as a result the virtual address space
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 05, 2017 at 03:20:50PM -0400, Jerome Glisse wrote: <> > Does HMAT support device hotplug ? I am unfamiliar with the whole inner > working > of ACPI versus PCIE. Anyway i don't see any issue with device memory also > showing > through HMAT but like i said device driver for the device will want to be in > total > control of that memory. Yep, the HMAT will support device hotplug via the _HMA method (section 6.2.18 of ACPI 6.2). This basically supplies an entirely new HMAT that the system will use to replace the current one. I don't yet have support for _HMA in my enabling, but I do intend to add support for it once we settle on a sysfs API for the regular boot-time case. > Like i said issue here is that core kernel is unaware of the device activity > ie > on what part of memory the device is actively working. So core mm can not make > inform decision on what should be migrated to device memory. Also we do not > want > regular memory allocation to end in device memory unless explicitly ask for. > Few reasons for that. First this memory might not only be use for compute task > but also for graphic and in that case they are hard constraint on physically > contiguous memory allocation that require the GPU to move thing around to make > room for graphic object (can't allow GUP). > > Second reasons, the device memory is inherently unreliable. If there is a bug > in the device driver or the user manage to trigger a faulty condition on GPU > the device might need a hard reset (ie cut PCIE power to device) which leads > to loss of memory content. While GPU are becoming more and more resilient they > are still prone to lockup. > > Finaly for GPU there is a common pattern of memory over-commit. You pretend to > each application as if they were the only one and allow each of them to > allocate > all of the device memory or more than could with strict sharing. As GPU have > long timeslice between switching to different context/application they can > easily move out and in large chunk of the process memory at > context/application > switching. This is have proven to be a key aspect to allow maximum > performances > accross several concurrent application/context. > > To implement this easiest solution is for the device to lie about how much > memory > it has and use the system memory as an overflow. I don't think any of this precludes the HMAT being involved. This is all very similar to what I think we need to do for high bandwidth memory, for example. We don't want the OS to use it for anything, and we want all of it to be available for applications to allocate and use for their specific workload. We don't want to make any assumptions about how it can or should be used. The HMAT is just there to give us a few things: 1) It provides us with an explicit way of telling the OS not to use the memory, in the form of the "Reservation hint" flag in the Memory Subsystem Address Range Structure (ACPI 6.2 section 5.2.27.3). I expect that this will be set for persistent memory and HBM, and it sounds like you'd expect it to be set for your device memory as well. 2) It provides us with a way of telling userspace "hey, I know about some memory, and I can tell you its performance characteristics". All control of how this memory is allocated and used is still left to userspace. > I am not saying that NUMA is not the way forward, i am saying that as it is > today > it is not suited for this. It is lacking metric, it is lacking logic, it is > lacking > features. We could add all this but it is a lot of work and i don't feel that > we > have enough real world experience to do so now. I would rather have each > devices > grow proper infrastructure in their driver through device specific API. To be clear, I'm not proposing that we teach the NUMA code how to automatically allocate for a given numa node, balance, etc. memory described by the HMAT. All I want is an API that says "here is some memory, I'll tell you all I can about it and let you do with it what you will", and perhaps a way to manually allocate what you want. And yes, this is very hand-wavy at this point. :) After I get the sysfs portion sussed out the next step is to work on enabling something like libnuma to allow the memory to be manually allocated. I think this works for both my use case and yours, correct? > Then identify common pattern and from there try to build a sane API (if any > such > thing exist :)) rather than trying today to build the whole house from the > ground > up with just a foggy idea of how it should looks in the end. Yea, I do see your point. My worry is that if I define an API, and you define an API, we'll end up in two different places with people using our different APIs, then: https://xkcd.com/927/ :) The HMAT enabling I'm trying to do is very passive - it doesn't actively do *anything* with the memory, it's entire purpose is to give userspace more information about the memory so userspace can make
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 05, 2017 at 03:20:50PM -0400, Jerome Glisse wrote: <> > Does HMAT support device hotplug ? I am unfamiliar with the whole inner > working > of ACPI versus PCIE. Anyway i don't see any issue with device memory also > showing > through HMAT but like i said device driver for the device will want to be in > total > control of that memory. Yep, the HMAT will support device hotplug via the _HMA method (section 6.2.18 of ACPI 6.2). This basically supplies an entirely new HMAT that the system will use to replace the current one. I don't yet have support for _HMA in my enabling, but I do intend to add support for it once we settle on a sysfs API for the regular boot-time case. > Like i said issue here is that core kernel is unaware of the device activity > ie > on what part of memory the device is actively working. So core mm can not make > inform decision on what should be migrated to device memory. Also we do not > want > regular memory allocation to end in device memory unless explicitly ask for. > Few reasons for that. First this memory might not only be use for compute task > but also for graphic and in that case they are hard constraint on physically > contiguous memory allocation that require the GPU to move thing around to make > room for graphic object (can't allow GUP). > > Second reasons, the device memory is inherently unreliable. If there is a bug > in the device driver or the user manage to trigger a faulty condition on GPU > the device might need a hard reset (ie cut PCIE power to device) which leads > to loss of memory content. While GPU are becoming more and more resilient they > are still prone to lockup. > > Finaly for GPU there is a common pattern of memory over-commit. You pretend to > each application as if they were the only one and allow each of them to > allocate > all of the device memory or more than could with strict sharing. As GPU have > long timeslice between switching to different context/application they can > easily move out and in large chunk of the process memory at > context/application > switching. This is have proven to be a key aspect to allow maximum > performances > accross several concurrent application/context. > > To implement this easiest solution is for the device to lie about how much > memory > it has and use the system memory as an overflow. I don't think any of this precludes the HMAT being involved. This is all very similar to what I think we need to do for high bandwidth memory, for example. We don't want the OS to use it for anything, and we want all of it to be available for applications to allocate and use for their specific workload. We don't want to make any assumptions about how it can or should be used. The HMAT is just there to give us a few things: 1) It provides us with an explicit way of telling the OS not to use the memory, in the form of the "Reservation hint" flag in the Memory Subsystem Address Range Structure (ACPI 6.2 section 5.2.27.3). I expect that this will be set for persistent memory and HBM, and it sounds like you'd expect it to be set for your device memory as well. 2) It provides us with a way of telling userspace "hey, I know about some memory, and I can tell you its performance characteristics". All control of how this memory is allocated and used is still left to userspace. > I am not saying that NUMA is not the way forward, i am saying that as it is > today > it is not suited for this. It is lacking metric, it is lacking logic, it is > lacking > features. We could add all this but it is a lot of work and i don't feel that > we > have enough real world experience to do so now. I would rather have each > devices > grow proper infrastructure in their driver through device specific API. To be clear, I'm not proposing that we teach the NUMA code how to automatically allocate for a given numa node, balance, etc. memory described by the HMAT. All I want is an API that says "here is some memory, I'll tell you all I can about it and let you do with it what you will", and perhaps a way to manually allocate what you want. And yes, this is very hand-wavy at this point. :) After I get the sysfs portion sussed out the next step is to work on enabling something like libnuma to allow the memory to be manually allocated. I think this works for both my use case and yours, correct? > Then identify common pattern and from there try to build a sane API (if any > such > thing exist :)) rather than trying today to build the whole house from the > ground > up with just a foggy idea of how it should looks in the end. Yea, I do see your point. My worry is that if I define an API, and you define an API, we'll end up in two different places with people using our different APIs, then: https://xkcd.com/927/ :) The HMAT enabling I'm trying to do is very passive - it doesn't actively do *anything* with the memory, it's entire purpose is to give userspace more information about the memory so userspace can make
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On 2017/9/8 1:27, Jerome Glisse wrote: >> On 2017/9/6 10:12, Jerome Glisse wrote: >>> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote: On 2017/9/6 2:54, Ross Zwisler wrote: > On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: >> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: >>> On 2017/9/4 23:51, Jerome Glisse wrote: On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > On 2017/8/17 8:05, Jérôme Glisse wrote: > > [...] > >>> For HMM each process give hint (somewhat similar to mbind) for range of >>> virtual address to the device kernel driver (through some API like OpenCL >>> or CUDA for GPU for instance). All this being device driver specific ioctl. >>> >>> The kernel device driver have an overall view of all the process that use >>> the device and each of the memory advise they gave. From that informations >>> the kernel device driver decide what part of each process address space to >>> migrate to device memory. >> >> Oh, I mean CDM-HMM. I'm fine with HMM. > > They are one and the same really. In both cases HMM is just a set of helpers > for device driver. > >>> This obviously dynamic and likely to change over the process lifetime. >>> >>> My understanding is that HMAT want similar API to allow process to give >>> direction on >>> where each range of virtual address should be allocated. It is expected >>> that most >> >> Right, but not clear who should manage the physical memory allocation and >> setup the pagetable mapping. An new driver or the kernel? > > Physical device memory is manage by the kernel device driver as it is today > and has it will be tomorrow. HMM does not change that, nor does it requires > any change to that. > Can someone from Intel give more information about the plan of managing HMAT reported memory? > Migrating process memory to or from device is done by the kernel through > the regular page migration. HMM provides new helper for device driver to > initiate such migration. There is no mechanisms like auto numa migration > for the reasons i explain previously. > > Kernel device driver use all knowledge it has to decide what to migrate to > device memory. Nothing new here either, it is what happens today for special > allocated device object and it will just happen all the same for regular > mmap memory (private anonymous or mmap of a regular file of a filesystem). > > > So every low level thing happen in the kernel. Userspace only provides > directive to the kernel device driver through device specific API. But the > kernel device driver can ignore or override those directive. > > >>> software can easily infer what part of its address will need more >>> bandwidth, smaller >>> latency versus what part is sparsely accessed ... >>> >>> For HMAT i think first target is HBM and persistent memory and device >>> memory might >>> be added latter if that make sense. >>> >> >> Okay, so there are two potential ways for CPU-addressable cache-coherent >> device memory >> (or cpu-less numa memory or "target domain" memory in ACPI spec )? >> 1. CDM-HMM >> 2. HMAT > > No this are 2 orthogonal thing, they do not conflict with each others quite > the contrary. HMM (the CDM part is no different) is a set of helpers, see > it as a toolbox, for device driver. > > HMAT is a way for firmware to report memory resources with more informations > that just range of physical address. HMAT is specific to platform that rely > on ACPI. HMAT does not provide any helpers to manage these memory. > > So a device driver can get informations about device memory from HMAT and then > use HMM to help in managing and using this memory. > Yes, but as Balbir mentioned requires : 1. Don't online the memory as a NUMA node 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver And I'm not sure whether Intel going to use this HMM-CDM based method for their "target domain" memory ? Or they prefer to NUMA approach? Ross? Dan? -- Thanks, Bob Liu
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On 2017/9/8 1:27, Jerome Glisse wrote: >> On 2017/9/6 10:12, Jerome Glisse wrote: >>> On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote: On 2017/9/6 2:54, Ross Zwisler wrote: > On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: >> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: >>> On 2017/9/4 23:51, Jerome Glisse wrote: On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > On 2017/8/17 8:05, Jérôme Glisse wrote: > > [...] > >>> For HMM each process give hint (somewhat similar to mbind) for range of >>> virtual address to the device kernel driver (through some API like OpenCL >>> or CUDA for GPU for instance). All this being device driver specific ioctl. >>> >>> The kernel device driver have an overall view of all the process that use >>> the device and each of the memory advise they gave. From that informations >>> the kernel device driver decide what part of each process address space to >>> migrate to device memory. >> >> Oh, I mean CDM-HMM. I'm fine with HMM. > > They are one and the same really. In both cases HMM is just a set of helpers > for device driver. > >>> This obviously dynamic and likely to change over the process lifetime. >>> >>> My understanding is that HMAT want similar API to allow process to give >>> direction on >>> where each range of virtual address should be allocated. It is expected >>> that most >> >> Right, but not clear who should manage the physical memory allocation and >> setup the pagetable mapping. An new driver or the kernel? > > Physical device memory is manage by the kernel device driver as it is today > and has it will be tomorrow. HMM does not change that, nor does it requires > any change to that. > Can someone from Intel give more information about the plan of managing HMAT reported memory? > Migrating process memory to or from device is done by the kernel through > the regular page migration. HMM provides new helper for device driver to > initiate such migration. There is no mechanisms like auto numa migration > for the reasons i explain previously. > > Kernel device driver use all knowledge it has to decide what to migrate to > device memory. Nothing new here either, it is what happens today for special > allocated device object and it will just happen all the same for regular > mmap memory (private anonymous or mmap of a regular file of a filesystem). > > > So every low level thing happen in the kernel. Userspace only provides > directive to the kernel device driver through device specific API. But the > kernel device driver can ignore or override those directive. > > >>> software can easily infer what part of its address will need more >>> bandwidth, smaller >>> latency versus what part is sparsely accessed ... >>> >>> For HMAT i think first target is HBM and persistent memory and device >>> memory might >>> be added latter if that make sense. >>> >> >> Okay, so there are two potential ways for CPU-addressable cache-coherent >> device memory >> (or cpu-less numa memory or "target domain" memory in ACPI spec )? >> 1. CDM-HMM >> 2. HMAT > > No this are 2 orthogonal thing, they do not conflict with each others quite > the contrary. HMM (the CDM part is no different) is a set of helpers, see > it as a toolbox, for device driver. > > HMAT is a way for firmware to report memory resources with more informations > that just range of physical address. HMAT is specific to platform that rely > on ACPI. HMAT does not provide any helpers to manage these memory. > > So a device driver can get informations about device memory from HMAT and then > use HMM to help in managing and using this memory. > Yes, but as Balbir mentioned requires : 1. Don't online the memory as a NUMA node 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver And I'm not sure whether Intel going to use this HMM-CDM based method for their "target domain" memory ? Or they prefer to NUMA approach? Ross? Dan? -- Thanks, Bob Liu
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
> On 2017/9/6 10:12, Jerome Glisse wrote: > > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote: > >> On 2017/9/6 2:54, Ross Zwisler wrote: > >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > > On 2017/9/4 23:51, Jerome Glisse wrote: > >> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > >>> On 2017/8/17 8:05, Jérôme Glisse wrote: [...] > > For HMM each process give hint (somewhat similar to mbind) for range of > > virtual address to the device kernel driver (through some API like OpenCL > > or CUDA for GPU for instance). All this being device driver specific ioctl. > > > > The kernel device driver have an overall view of all the process that use > > the device and each of the memory advise they gave. From that informations > > the kernel device driver decide what part of each process address space to > > migrate to device memory. > > Oh, I mean CDM-HMM. I'm fine with HMM. They are one and the same really. In both cases HMM is just a set of helpers for device driver. > > This obviously dynamic and likely to change over the process lifetime. > > > > My understanding is that HMAT want similar API to allow process to give > > direction on > > where each range of virtual address should be allocated. It is expected > > that most > > Right, but not clear who should manage the physical memory allocation and > setup the pagetable mapping. An new driver or the kernel? Physical device memory is manage by the kernel device driver as it is today and has it will be tomorrow. HMM does not change that, nor does it requires any change to that. Migrating process memory to or from device is done by the kernel through the regular page migration. HMM provides new helper for device driver to initiate such migration. There is no mechanisms like auto numa migration for the reasons i explain previously. Kernel device driver use all knowledge it has to decide what to migrate to device memory. Nothing new here either, it is what happens today for special allocated device object and it will just happen all the same for regular mmap memory (private anonymous or mmap of a regular file of a filesystem). So every low level thing happen in the kernel. Userspace only provides directive to the kernel device driver through device specific API. But the kernel device driver can ignore or override those directive. > > software can easily infer what part of its address will need more > > bandwidth, smaller > > latency versus what part is sparsely accessed ... > > > > For HMAT i think first target is HBM and persistent memory and device > > memory might > > be added latter if that make sense. > > > > Okay, so there are two potential ways for CPU-addressable cache-coherent > device memory > (or cpu-less numa memory or "target domain" memory in ACPI spec )? > 1. CDM-HMM > 2. HMAT No this are 2 orthogonal thing, they do not conflict with each others quite the contrary. HMM (the CDM part is no different) is a set of helpers, see it as a toolbox, for device driver. HMAT is a way for firmware to report memory resources with more informations that just range of physical address. HMAT is specific to platform that rely on ACPI. HMAT does not provide any helpers to manage these memory. So a device driver can get informations about device memory from HMAT and then use HMM to help in managing and using this memory. Jérôme
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
> On 2017/9/6 10:12, Jerome Glisse wrote: > > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote: > >> On 2017/9/6 2:54, Ross Zwisler wrote: > >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > > On 2017/9/4 23:51, Jerome Glisse wrote: > >> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > >>> On 2017/8/17 8:05, Jérôme Glisse wrote: [...] > > For HMM each process give hint (somewhat similar to mbind) for range of > > virtual address to the device kernel driver (through some API like OpenCL > > or CUDA for GPU for instance). All this being device driver specific ioctl. > > > > The kernel device driver have an overall view of all the process that use > > the device and each of the memory advise they gave. From that informations > > the kernel device driver decide what part of each process address space to > > migrate to device memory. > > Oh, I mean CDM-HMM. I'm fine with HMM. They are one and the same really. In both cases HMM is just a set of helpers for device driver. > > This obviously dynamic and likely to change over the process lifetime. > > > > My understanding is that HMAT want similar API to allow process to give > > direction on > > where each range of virtual address should be allocated. It is expected > > that most > > Right, but not clear who should manage the physical memory allocation and > setup the pagetable mapping. An new driver or the kernel? Physical device memory is manage by the kernel device driver as it is today and has it will be tomorrow. HMM does not change that, nor does it requires any change to that. Migrating process memory to or from device is done by the kernel through the regular page migration. HMM provides new helper for device driver to initiate such migration. There is no mechanisms like auto numa migration for the reasons i explain previously. Kernel device driver use all knowledge it has to decide what to migrate to device memory. Nothing new here either, it is what happens today for special allocated device object and it will just happen all the same for regular mmap memory (private anonymous or mmap of a regular file of a filesystem). So every low level thing happen in the kernel. Userspace only provides directive to the kernel device driver through device specific API. But the kernel device driver can ignore or override those directive. > > software can easily infer what part of its address will need more > > bandwidth, smaller > > latency versus what part is sparsely accessed ... > > > > For HMAT i think first target is HBM and persistent memory and device > > memory might > > be added latter if that make sense. > > > > Okay, so there are two potential ways for CPU-addressable cache-coherent > device memory > (or cpu-less numa memory or "target domain" memory in ACPI spec )? > 1. CDM-HMM > 2. HMAT No this are 2 orthogonal thing, they do not conflict with each others quite the contrary. HMM (the CDM part is no different) is a set of helpers, see it as a toolbox, for device driver. HMAT is a way for firmware to report memory resources with more informations that just range of physical address. HMAT is specific to platform that rely on ACPI. HMAT does not provide any helpers to manage these memory. So a device driver can get informations about device memory from HMAT and then use HMM to help in managing and using this memory. Jérôme
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
> On 2017/9/6 10:12, Jerome Glisse wrote: > > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote: > >> On 2017/9/6 2:54, Ross Zwisler wrote: > >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > > On 2017/9/4 23:51, Jerome Glisse wrote: > >> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > >>> On 2017/8/17 8:05, Jérôme Glisse wrote: > Unlike unaddressable memory, coherent device memory has a real > resource associated with it on the system (as CPU can address > it). Add a new helper to hotplug such memory within the HMM > framework. > > >>> > >>> Got an new question, coherent device( e.g CCIX) memory are likely > >>> reported to OS > >>> through ACPI and recognized as NUMA memory node. > >>> Then how can their memory be captured and managed by HMM framework? > >>> > >> > >> Only platform that has such memory today is powerpc and it is not > >> reported > >> as regular memory by the firmware hence why they need this helper. > >> > >> I don't think anyone has defined anything yet for x86 and acpi. As > >> this is > > > > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > > Table (HMAT) table defined in ACPI 6.2. > > The HMAT can cover CPU-addressable memory types(though not non-cache > > coherent on-device memory). > > > > Ross from Intel already done some work on this, see: > > https://lwn.net/Articles/724562/ > > > > arm64 supports APCI also, there is likely more this kind of device when > > CCIX > > is out (should be very soon if on schedule). > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" > memory ie > when you have several kind of memory each with different > characteristics: > - HBM very fast (latency) and high bandwidth, non persistent, somewhat > small (ie few giga bytes) > - Persistent memory, slower (both latency and bandwidth) big (tera > bytes) > - DDR (good old memory) well characteristics are between HBM and > persistent > > So AFAICT this has nothing to do with what HMM is for, ie device memory. > Note > that device memory can have a hierarchy of memory themself (HBM, GDDR > and in > maybe even persistent memory). > > >> memory on PCIE like interface then i don't expect it to be reported as > >> NUMA > >> memory node but as io range like any regular PCIE resources. Device > >> driver > >> through capabilities flags would then figure out if the link between > >> the > >> device and CPU is CCIX capable if so it can use this helper to hotplug > >> it > >> as device memory. > >> > > > > From my point of view, Cache coherent device memory will popular soon > > and > > reported through ACPI/UEFI. Extending NUMA policy still sounds more > > reasonable > > to me. > > Cache coherent device will be reported through standard mecanisms > defined by > the bus standard they are using. To my knowledge all the standard are > either > on top of PCIE or are similar to PCIE. > > It is true that on many platform PCIE resource is manage/initialize by > the > bios (UEFI) but it is platform specific. In some case we reprogram what > the > bios pick. > > So like i was saying i don't expect the BIOS/UEFI to report device > memory as > regular memory. It will be reported as a regular PCIE resources and then > the > device driver will be able to determine through some flags if the link > between > the CPU(s) and the device is cache coherent or not. At that point the > device > driver can use register it with HMM helper. > > > The whole NUMA discussion happen several time in the past i suggest > looking > on mm list archive for them. But it was rule out for several reasons. > Top of > my head: > - people hate CPU less node and device memory is inherently CPU less > >>> > >>> With the introduction of the HMAT in ACPI 6.2 one of the things that was > >>> added > >>> was the ability to have an ACPI proximity domain that isn't associated > >>> with a > >>> CPU. This can be seen in the changes in the text of the "Proximity > >>> Domain" > >>> field in table 5-73 which describes the "Memory Affinity Structure". One > >>> of > >>> the major features of the HMAT was the separation of "Initiator" > >>> proximity > >>> domains (CPUs, devices that initiate memory transfers), and "target" > >>> proximity > >>> domains (memory regions, be they attached to a CPU or some other device). > >>> > >>> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're > >>> already in a place where we have to support CPU-less
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
> On 2017/9/6 10:12, Jerome Glisse wrote: > > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote: > >> On 2017/9/6 2:54, Ross Zwisler wrote: > >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > > On 2017/9/4 23:51, Jerome Glisse wrote: > >> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > >>> On 2017/8/17 8:05, Jérôme Glisse wrote: > Unlike unaddressable memory, coherent device memory has a real > resource associated with it on the system (as CPU can address > it). Add a new helper to hotplug such memory within the HMM > framework. > > >>> > >>> Got an new question, coherent device( e.g CCIX) memory are likely > >>> reported to OS > >>> through ACPI and recognized as NUMA memory node. > >>> Then how can their memory be captured and managed by HMM framework? > >>> > >> > >> Only platform that has such memory today is powerpc and it is not > >> reported > >> as regular memory by the firmware hence why they need this helper. > >> > >> I don't think anyone has defined anything yet for x86 and acpi. As > >> this is > > > > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > > Table (HMAT) table defined in ACPI 6.2. > > The HMAT can cover CPU-addressable memory types(though not non-cache > > coherent on-device memory). > > > > Ross from Intel already done some work on this, see: > > https://lwn.net/Articles/724562/ > > > > arm64 supports APCI also, there is likely more this kind of device when > > CCIX > > is out (should be very soon if on schedule). > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" > memory ie > when you have several kind of memory each with different > characteristics: > - HBM very fast (latency) and high bandwidth, non persistent, somewhat > small (ie few giga bytes) > - Persistent memory, slower (both latency and bandwidth) big (tera > bytes) > - DDR (good old memory) well characteristics are between HBM and > persistent > > So AFAICT this has nothing to do with what HMM is for, ie device memory. > Note > that device memory can have a hierarchy of memory themself (HBM, GDDR > and in > maybe even persistent memory). > > >> memory on PCIE like interface then i don't expect it to be reported as > >> NUMA > >> memory node but as io range like any regular PCIE resources. Device > >> driver > >> through capabilities flags would then figure out if the link between > >> the > >> device and CPU is CCIX capable if so it can use this helper to hotplug > >> it > >> as device memory. > >> > > > > From my point of view, Cache coherent device memory will popular soon > > and > > reported through ACPI/UEFI. Extending NUMA policy still sounds more > > reasonable > > to me. > > Cache coherent device will be reported through standard mecanisms > defined by > the bus standard they are using. To my knowledge all the standard are > either > on top of PCIE or are similar to PCIE. > > It is true that on many platform PCIE resource is manage/initialize by > the > bios (UEFI) but it is platform specific. In some case we reprogram what > the > bios pick. > > So like i was saying i don't expect the BIOS/UEFI to report device > memory as > regular memory. It will be reported as a regular PCIE resources and then > the > device driver will be able to determine through some flags if the link > between > the CPU(s) and the device is cache coherent or not. At that point the > device > driver can use register it with HMM helper. > > > The whole NUMA discussion happen several time in the past i suggest > looking > on mm list archive for them. But it was rule out for several reasons. > Top of > my head: > - people hate CPU less node and device memory is inherently CPU less > >>> > >>> With the introduction of the HMAT in ACPI 6.2 one of the things that was > >>> added > >>> was the ability to have an ACPI proximity domain that isn't associated > >>> with a > >>> CPU. This can be seen in the changes in the text of the "Proximity > >>> Domain" > >>> field in table 5-73 which describes the "Memory Affinity Structure". One > >>> of > >>> the major features of the HMAT was the separation of "Initiator" > >>> proximity > >>> domains (CPUs, devices that initiate memory transfers), and "target" > >>> proximity > >>> domains (memory regions, be they attached to a CPU or some other device). > >>> > >>> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're > >>> already in a place where we have to support CPU-less
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On 2017/9/6 10:12, Jerome Glisse wrote: > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote: >> On 2017/9/6 2:54, Ross Zwisler wrote: >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > On 2017/9/4 23:51, Jerome Glisse wrote: >> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: >>> On 2017/8/17 8:05, Jérôme Glisse wrote: Unlike unaddressable memory, coherent device memory has a real resource associated with it on the system (as CPU can address it). Add a new helper to hotplug such memory within the HMM framework. >>> >>> Got an new question, coherent device( e.g CCIX) memory are likely >>> reported to OS >>> through ACPI and recognized as NUMA memory node. >>> Then how can their memory be captured and managed by HMM framework? >>> >> >> Only platform that has such memory today is powerpc and it is not >> reported >> as regular memory by the firmware hence why they need this helper. >> >> I don't think anyone has defined anything yet for x86 and acpi. As this >> is > > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > Table (HMAT) table defined in ACPI 6.2. > The HMAT can cover CPU-addressable memory types(though not non-cache > coherent on-device memory). > > Ross from Intel already done some work on this, see: > https://lwn.net/Articles/724562/ > > arm64 supports APCI also, there is likely more this kind of device when > CCIX > is out (should be very soon if on schedule). HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie when you have several kind of memory each with different characteristics: - HBM very fast (latency) and high bandwidth, non persistent, somewhat small (ie few giga bytes) - Persistent memory, slower (both latency and bandwidth) big (tera bytes) - DDR (good old memory) well characteristics are between HBM and persistent So AFAICT this has nothing to do with what HMM is for, ie device memory. Note that device memory can have a hierarchy of memory themself (HBM, GDDR and in maybe even persistent memory). >> memory on PCIE like interface then i don't expect it to be reported as >> NUMA >> memory node but as io range like any regular PCIE resources. Device >> driver >> through capabilities flags would then figure out if the link between the >> device and CPU is CCIX capable if so it can use this helper to hotplug it >> as device memory. >> > > From my point of view, Cache coherent device memory will popular soon and > reported through ACPI/UEFI. Extending NUMA policy still sounds more > reasonable > to me. Cache coherent device will be reported through standard mecanisms defined by the bus standard they are using. To my knowledge all the standard are either on top of PCIE or are similar to PCIE. It is true that on many platform PCIE resource is manage/initialize by the bios (UEFI) but it is platform specific. In some case we reprogram what the bios pick. So like i was saying i don't expect the BIOS/UEFI to report device memory as regular memory. It will be reported as a regular PCIE resources and then the device driver will be able to determine through some flags if the link between the CPU(s) and the device is cache coherent or not. At that point the device driver can use register it with HMM helper. The whole NUMA discussion happen several time in the past i suggest looking on mm list archive for them. But it was rule out for several reasons. Top of my head: - people hate CPU less node and device memory is inherently CPU less >>> >>> With the introduction of the HMAT in ACPI 6.2 one of the things that was >>> added >>> was the ability to have an ACPI proximity domain that isn't associated with >>> a >>> CPU. This can be seen in the changes in the text of the "Proximity Domain" >>> field in table 5-73 which describes the "Memory Affinity Structure". One of >>> the major features of the HMAT was the separation of "Initiator" proximity >>> domains (CPUs, devices that initiate memory transfers), and "target" >>> proximity >>> domains (memory regions, be they attached to a CPU or some other device). >>> >>> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're >>> already in a place where we have to support CPU-less NUMA nodes. >>> - device driver want total control over memory and thus to be isolated from mm mecanism and doing all those special cases was not welcome >>> >>> I agree that the kernel doesn't have enough information to be able to >>> accurately handle
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On 2017/9/6 10:12, Jerome Glisse wrote: > On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote: >> On 2017/9/6 2:54, Ross Zwisler wrote: >>> On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > On 2017/9/4 23:51, Jerome Glisse wrote: >> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: >>> On 2017/8/17 8:05, Jérôme Glisse wrote: Unlike unaddressable memory, coherent device memory has a real resource associated with it on the system (as CPU can address it). Add a new helper to hotplug such memory within the HMM framework. >>> >>> Got an new question, coherent device( e.g CCIX) memory are likely >>> reported to OS >>> through ACPI and recognized as NUMA memory node. >>> Then how can their memory be captured and managed by HMM framework? >>> >> >> Only platform that has such memory today is powerpc and it is not >> reported >> as regular memory by the firmware hence why they need this helper. >> >> I don't think anyone has defined anything yet for x86 and acpi. As this >> is > > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > Table (HMAT) table defined in ACPI 6.2. > The HMAT can cover CPU-addressable memory types(though not non-cache > coherent on-device memory). > > Ross from Intel already done some work on this, see: > https://lwn.net/Articles/724562/ > > arm64 supports APCI also, there is likely more this kind of device when > CCIX > is out (should be very soon if on schedule). HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie when you have several kind of memory each with different characteristics: - HBM very fast (latency) and high bandwidth, non persistent, somewhat small (ie few giga bytes) - Persistent memory, slower (both latency and bandwidth) big (tera bytes) - DDR (good old memory) well characteristics are between HBM and persistent So AFAICT this has nothing to do with what HMM is for, ie device memory. Note that device memory can have a hierarchy of memory themself (HBM, GDDR and in maybe even persistent memory). >> memory on PCIE like interface then i don't expect it to be reported as >> NUMA >> memory node but as io range like any regular PCIE resources. Device >> driver >> through capabilities flags would then figure out if the link between the >> device and CPU is CCIX capable if so it can use this helper to hotplug it >> as device memory. >> > > From my point of view, Cache coherent device memory will popular soon and > reported through ACPI/UEFI. Extending NUMA policy still sounds more > reasonable > to me. Cache coherent device will be reported through standard mecanisms defined by the bus standard they are using. To my knowledge all the standard are either on top of PCIE or are similar to PCIE. It is true that on many platform PCIE resource is manage/initialize by the bios (UEFI) but it is platform specific. In some case we reprogram what the bios pick. So like i was saying i don't expect the BIOS/UEFI to report device memory as regular memory. It will be reported as a regular PCIE resources and then the device driver will be able to determine through some flags if the link between the CPU(s) and the device is cache coherent or not. At that point the device driver can use register it with HMM helper. The whole NUMA discussion happen several time in the past i suggest looking on mm list archive for them. But it was rule out for several reasons. Top of my head: - people hate CPU less node and device memory is inherently CPU less >>> >>> With the introduction of the HMAT in ACPI 6.2 one of the things that was >>> added >>> was the ability to have an ACPI proximity domain that isn't associated with >>> a >>> CPU. This can be seen in the changes in the text of the "Proximity Domain" >>> field in table 5-73 which describes the "Memory Affinity Structure". One of >>> the major features of the HMAT was the separation of "Initiator" proximity >>> domains (CPUs, devices that initiate memory transfers), and "target" >>> proximity >>> domains (memory regions, be they attached to a CPU or some other device). >>> >>> ACPI proximity domains map directly to Linux NUMA nodes, so I think we're >>> already in a place where we have to support CPU-less NUMA nodes. >>> - device driver want total control over memory and thus to be isolated from mm mecanism and doing all those special cases was not welcome >>> >>> I agree that the kernel doesn't have enough information to be able to >>> accurately handle
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote: > On 2017/9/6 2:54, Ross Zwisler wrote: > > On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: > >> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > >>> On 2017/9/4 23:51, Jerome Glisse wrote: > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > > On 2017/8/17 8:05, Jérôme Glisse wrote: > >> Unlike unaddressable memory, coherent device memory has a real > >> resource associated with it on the system (as CPU can address > >> it). Add a new helper to hotplug such memory within the HMM > >> framework. > >> > > > > Got an new question, coherent device( e.g CCIX) memory are likely > > reported to OS > > through ACPI and recognized as NUMA memory node. > > Then how can their memory be captured and managed by HMM framework? > > > > Only platform that has such memory today is powerpc and it is not > reported > as regular memory by the firmware hence why they need this helper. > > I don't think anyone has defined anything yet for x86 and acpi. As this > is > >>> > >>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > >>> Table (HMAT) table defined in ACPI 6.2. > >>> The HMAT can cover CPU-addressable memory types(though not non-cache > >>> coherent on-device memory). > >>> > >>> Ross from Intel already done some work on this, see: > >>> https://lwn.net/Articles/724562/ > >>> > >>> arm64 supports APCI also, there is likely more this kind of device when > >>> CCIX > >>> is out (should be very soon if on schedule). > >> > >> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory > >> ie > >> when you have several kind of memory each with different characteristics: > >> - HBM very fast (latency) and high bandwidth, non persistent, somewhat > >> small (ie few giga bytes) > >> - Persistent memory, slower (both latency and bandwidth) big (tera bytes) > >> - DDR (good old memory) well characteristics are between HBM and > >> persistent > >> > >> So AFAICT this has nothing to do with what HMM is for, ie device memory. > >> Note > >> that device memory can have a hierarchy of memory themself (HBM, GDDR and > >> in > >> maybe even persistent memory). > >> > memory on PCIE like interface then i don't expect it to be reported as > NUMA > memory node but as io range like any regular PCIE resources. Device > driver > through capabilities flags would then figure out if the link between the > device and CPU is CCIX capable if so it can use this helper to hotplug it > as device memory. > > >>> > >>> From my point of view, Cache coherent device memory will popular soon and > >>> reported through ACPI/UEFI. Extending NUMA policy still sounds more > >>> reasonable > >>> to me. > >> > >> Cache coherent device will be reported through standard mecanisms defined > >> by > >> the bus standard they are using. To my knowledge all the standard are > >> either > >> on top of PCIE or are similar to PCIE. > >> > >> It is true that on many platform PCIE resource is manage/initialize by the > >> bios (UEFI) but it is platform specific. In some case we reprogram what the > >> bios pick. > >> > >> So like i was saying i don't expect the BIOS/UEFI to report device memory > >> as > >> regular memory. It will be reported as a regular PCIE resources and then > >> the > >> device driver will be able to determine through some flags if the link > >> between > >> the CPU(s) and the device is cache coherent or not. At that point the > >> device > >> driver can use register it with HMM helper. > >> > >> > >> The whole NUMA discussion happen several time in the past i suggest looking > >> on mm list archive for them. But it was rule out for several reasons. Top > >> of > >> my head: > >> - people hate CPU less node and device memory is inherently CPU less > > > > With the introduction of the HMAT in ACPI 6.2 one of the things that was > > added > > was the ability to have an ACPI proximity domain that isn't associated with > > a > > CPU. This can be seen in the changes in the text of the "Proximity Domain" > > field in table 5-73 which describes the "Memory Affinity Structure". One of > > the major features of the HMAT was the separation of "Initiator" proximity > > domains (CPUs, devices that initiate memory transfers), and "target" > > proximity > > domains (memory regions, be they attached to a CPU or some other device). > > > > ACPI proximity domains map directly to Linux NUMA nodes, so I think we're > > already in a place where we have to support CPU-less NUMA nodes. > > > >> - device driver want total control over memory and thus to be isolated > >> from > >> mm mecanism and doing all those special cases was not welcome > > > > I agree that the kernel doesn't have enough information to be able to > > accurately handle all the use cases for the various types
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Wed, Sep 06, 2017 at 09:25:36AM +0800, Bob Liu wrote: > On 2017/9/6 2:54, Ross Zwisler wrote: > > On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: > >> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > >>> On 2017/9/4 23:51, Jerome Glisse wrote: > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > > On 2017/8/17 8:05, Jérôme Glisse wrote: > >> Unlike unaddressable memory, coherent device memory has a real > >> resource associated with it on the system (as CPU can address > >> it). Add a new helper to hotplug such memory within the HMM > >> framework. > >> > > > > Got an new question, coherent device( e.g CCIX) memory are likely > > reported to OS > > through ACPI and recognized as NUMA memory node. > > Then how can their memory be captured and managed by HMM framework? > > > > Only platform that has such memory today is powerpc and it is not > reported > as regular memory by the firmware hence why they need this helper. > > I don't think anyone has defined anything yet for x86 and acpi. As this > is > >>> > >>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > >>> Table (HMAT) table defined in ACPI 6.2. > >>> The HMAT can cover CPU-addressable memory types(though not non-cache > >>> coherent on-device memory). > >>> > >>> Ross from Intel already done some work on this, see: > >>> https://lwn.net/Articles/724562/ > >>> > >>> arm64 supports APCI also, there is likely more this kind of device when > >>> CCIX > >>> is out (should be very soon if on schedule). > >> > >> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory > >> ie > >> when you have several kind of memory each with different characteristics: > >> - HBM very fast (latency) and high bandwidth, non persistent, somewhat > >> small (ie few giga bytes) > >> - Persistent memory, slower (both latency and bandwidth) big (tera bytes) > >> - DDR (good old memory) well characteristics are between HBM and > >> persistent > >> > >> So AFAICT this has nothing to do with what HMM is for, ie device memory. > >> Note > >> that device memory can have a hierarchy of memory themself (HBM, GDDR and > >> in > >> maybe even persistent memory). > >> > memory on PCIE like interface then i don't expect it to be reported as > NUMA > memory node but as io range like any regular PCIE resources. Device > driver > through capabilities flags would then figure out if the link between the > device and CPU is CCIX capable if so it can use this helper to hotplug it > as device memory. > > >>> > >>> From my point of view, Cache coherent device memory will popular soon and > >>> reported through ACPI/UEFI. Extending NUMA policy still sounds more > >>> reasonable > >>> to me. > >> > >> Cache coherent device will be reported through standard mecanisms defined > >> by > >> the bus standard they are using. To my knowledge all the standard are > >> either > >> on top of PCIE or are similar to PCIE. > >> > >> It is true that on many platform PCIE resource is manage/initialize by the > >> bios (UEFI) but it is platform specific. In some case we reprogram what the > >> bios pick. > >> > >> So like i was saying i don't expect the BIOS/UEFI to report device memory > >> as > >> regular memory. It will be reported as a regular PCIE resources and then > >> the > >> device driver will be able to determine through some flags if the link > >> between > >> the CPU(s) and the device is cache coherent or not. At that point the > >> device > >> driver can use register it with HMM helper. > >> > >> > >> The whole NUMA discussion happen several time in the past i suggest looking > >> on mm list archive for them. But it was rule out for several reasons. Top > >> of > >> my head: > >> - people hate CPU less node and device memory is inherently CPU less > > > > With the introduction of the HMAT in ACPI 6.2 one of the things that was > > added > > was the ability to have an ACPI proximity domain that isn't associated with > > a > > CPU. This can be seen in the changes in the text of the "Proximity Domain" > > field in table 5-73 which describes the "Memory Affinity Structure". One of > > the major features of the HMAT was the separation of "Initiator" proximity > > domains (CPUs, devices that initiate memory transfers), and "target" > > proximity > > domains (memory regions, be they attached to a CPU or some other device). > > > > ACPI proximity domains map directly to Linux NUMA nodes, so I think we're > > already in a place where we have to support CPU-less NUMA nodes. > > > >> - device driver want total control over memory and thus to be isolated > >> from > >> mm mecanism and doing all those special cases was not welcome > > > > I agree that the kernel doesn't have enough information to be able to > > accurately handle all the use cases for the various types
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On 2017/9/6 2:54, Ross Zwisler wrote: > On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: >> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: >>> On 2017/9/4 23:51, Jerome Glisse wrote: On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > On 2017/8/17 8:05, Jérôme Glisse wrote: >> Unlike unaddressable memory, coherent device memory has a real >> resource associated with it on the system (as CPU can address >> it). Add a new helper to hotplug such memory within the HMM >> framework. >> > > Got an new question, coherent device( e.g CCIX) memory are likely > reported to OS > through ACPI and recognized as NUMA memory node. > Then how can their memory be captured and managed by HMM framework? > Only platform that has such memory today is powerpc and it is not reported as regular memory by the firmware hence why they need this helper. I don't think anyone has defined anything yet for x86 and acpi. As this is >>> >>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute >>> Table (HMAT) table defined in ACPI 6.2. >>> The HMAT can cover CPU-addressable memory types(though not non-cache >>> coherent on-device memory). >>> >>> Ross from Intel already done some work on this, see: >>> https://lwn.net/Articles/724562/ >>> >>> arm64 supports APCI also, there is likely more this kind of device when CCIX >>> is out (should be very soon if on schedule). >> >> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie >> when you have several kind of memory each with different characteristics: >> - HBM very fast (latency) and high bandwidth, non persistent, somewhat >> small (ie few giga bytes) >> - Persistent memory, slower (both latency and bandwidth) big (tera bytes) >> - DDR (good old memory) well characteristics are between HBM and persistent >> >> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note >> that device memory can have a hierarchy of memory themself (HBM, GDDR and in >> maybe even persistent memory). >> memory on PCIE like interface then i don't expect it to be reported as NUMA memory node but as io range like any regular PCIE resources. Device driver through capabilities flags would then figure out if the link between the device and CPU is CCIX capable if so it can use this helper to hotplug it as device memory. >>> >>> From my point of view, Cache coherent device memory will popular soon and >>> reported through ACPI/UEFI. Extending NUMA policy still sounds more >>> reasonable >>> to me. >> >> Cache coherent device will be reported through standard mecanisms defined by >> the bus standard they are using. To my knowledge all the standard are either >> on top of PCIE or are similar to PCIE. >> >> It is true that on many platform PCIE resource is manage/initialize by the >> bios (UEFI) but it is platform specific. In some case we reprogram what the >> bios pick. >> >> So like i was saying i don't expect the BIOS/UEFI to report device memory as >> regular memory. It will be reported as a regular PCIE resources and then the >> device driver will be able to determine through some flags if the link >> between >> the CPU(s) and the device is cache coherent or not. At that point the device >> driver can use register it with HMM helper. >> >> >> The whole NUMA discussion happen several time in the past i suggest looking >> on mm list archive for them. But it was rule out for several reasons. Top of >> my head: >> - people hate CPU less node and device memory is inherently CPU less > > With the introduction of the HMAT in ACPI 6.2 one of the things that was added > was the ability to have an ACPI proximity domain that isn't associated with a > CPU. This can be seen in the changes in the text of the "Proximity Domain" > field in table 5-73 which describes the "Memory Affinity Structure". One of > the major features of the HMAT was the separation of "Initiator" proximity > domains (CPUs, devices that initiate memory transfers), and "target" proximity > domains (memory regions, be they attached to a CPU or some other device). > > ACPI proximity domains map directly to Linux NUMA nodes, so I think we're > already in a place where we have to support CPU-less NUMA nodes. > >> - device driver want total control over memory and thus to be isolated from >> mm mecanism and doing all those special cases was not welcome > > I agree that the kernel doesn't have enough information to be able to > accurately handle all the use cases for the various types of heterogeneous > memory. The goal of my HMAT enabling is to allow that memory to be reserved > from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem > Address Range Structure, then provide userspace with enough information to be > able to distinguish between the various types of memory in the system so it > can allocate & utilize it
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On 2017/9/6 2:54, Ross Zwisler wrote: > On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: >> On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: >>> On 2017/9/4 23:51, Jerome Glisse wrote: On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > On 2017/8/17 8:05, Jérôme Glisse wrote: >> Unlike unaddressable memory, coherent device memory has a real >> resource associated with it on the system (as CPU can address >> it). Add a new helper to hotplug such memory within the HMM >> framework. >> > > Got an new question, coherent device( e.g CCIX) memory are likely > reported to OS > through ACPI and recognized as NUMA memory node. > Then how can their memory be captured and managed by HMM framework? > Only platform that has such memory today is powerpc and it is not reported as regular memory by the firmware hence why they need this helper. I don't think anyone has defined anything yet for x86 and acpi. As this is >>> >>> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute >>> Table (HMAT) table defined in ACPI 6.2. >>> The HMAT can cover CPU-addressable memory types(though not non-cache >>> coherent on-device memory). >>> >>> Ross from Intel already done some work on this, see: >>> https://lwn.net/Articles/724562/ >>> >>> arm64 supports APCI also, there is likely more this kind of device when CCIX >>> is out (should be very soon if on schedule). >> >> HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie >> when you have several kind of memory each with different characteristics: >> - HBM very fast (latency) and high bandwidth, non persistent, somewhat >> small (ie few giga bytes) >> - Persistent memory, slower (both latency and bandwidth) big (tera bytes) >> - DDR (good old memory) well characteristics are between HBM and persistent >> >> So AFAICT this has nothing to do with what HMM is for, ie device memory. Note >> that device memory can have a hierarchy of memory themself (HBM, GDDR and in >> maybe even persistent memory). >> memory on PCIE like interface then i don't expect it to be reported as NUMA memory node but as io range like any regular PCIE resources. Device driver through capabilities flags would then figure out if the link between the device and CPU is CCIX capable if so it can use this helper to hotplug it as device memory. >>> >>> From my point of view, Cache coherent device memory will popular soon and >>> reported through ACPI/UEFI. Extending NUMA policy still sounds more >>> reasonable >>> to me. >> >> Cache coherent device will be reported through standard mecanisms defined by >> the bus standard they are using. To my knowledge all the standard are either >> on top of PCIE or are similar to PCIE. >> >> It is true that on many platform PCIE resource is manage/initialize by the >> bios (UEFI) but it is platform specific. In some case we reprogram what the >> bios pick. >> >> So like i was saying i don't expect the BIOS/UEFI to report device memory as >> regular memory. It will be reported as a regular PCIE resources and then the >> device driver will be able to determine through some flags if the link >> between >> the CPU(s) and the device is cache coherent or not. At that point the device >> driver can use register it with HMM helper. >> >> >> The whole NUMA discussion happen several time in the past i suggest looking >> on mm list archive for them. But it was rule out for several reasons. Top of >> my head: >> - people hate CPU less node and device memory is inherently CPU less > > With the introduction of the HMAT in ACPI 6.2 one of the things that was added > was the ability to have an ACPI proximity domain that isn't associated with a > CPU. This can be seen in the changes in the text of the "Proximity Domain" > field in table 5-73 which describes the "Memory Affinity Structure". One of > the major features of the HMAT was the separation of "Initiator" proximity > domains (CPUs, devices that initiate memory transfers), and "target" proximity > domains (memory regions, be they attached to a CPU or some other device). > > ACPI proximity domains map directly to Linux NUMA nodes, so I think we're > already in a place where we have to support CPU-less NUMA nodes. > >> - device driver want total control over memory and thus to be isolated from >> mm mecanism and doing all those special cases was not welcome > > I agree that the kernel doesn't have enough information to be able to > accurately handle all the use cases for the various types of heterogeneous > memory. The goal of my HMAT enabling is to allow that memory to be reserved > from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem > Address Range Structure, then provide userspace with enough information to be > able to distinguish between the various types of memory in the system so it > can allocate & utilize it
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 05, 2017 at 01:00:13PM -0600, Ross Zwisler wrote: > On Tue, Sep 05, 2017 at 09:50:17AM -0400, Jerome Glisse wrote: > > On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote: > > > On 2017/9/5 10:38, Jerome Glisse wrote: > > > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > > > >> On 2017/9/4 23:51, Jerome Glisse wrote: > > > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > > > On 2017/8/17 8:05, Jérôme Glisse wrote: > > > > Unlike unaddressable memory, coherent device memory has a real > > > > resource associated with it on the system (as CPU can address > > > > it). Add a new helper to hotplug such memory within the HMM > > > > framework. > > > > > > > > > > Got an new question, coherent device( e.g CCIX) memory are likely > > > reported to OS > > > through ACPI and recognized as NUMA memory node. > > > Then how can their memory be captured and managed by HMM framework? > > > > > > >>> > > > >>> Only platform that has such memory today is powerpc and it is not > > > >>> reported > > > >>> as regular memory by the firmware hence why they need this helper. > > > >>> > > > >>> I don't think anyone has defined anything yet for x86 and acpi. As > > > >>> this is > > > >> > > > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > > > >> Table (HMAT) table defined in ACPI 6.2. > > > >> The HMAT can cover CPU-addressable memory types(though not non-cache > > > >> coherent on-device memory). > > > >> > > > >> Ross from Intel already done some work on this, see: > > > >> https://lwn.net/Articles/724562/ > > > >> > > > >> arm64 supports APCI also, there is likely more this kind of device > > > >> when CCIX > > > >> is out (should be very soon if on schedule). > > > > > > > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" > > > > memory ie > > > > when you have several kind of memory each with different > > > > characteristics: > > > > - HBM very fast (latency) and high bandwidth, non persistent, somewhat > > > > small (ie few giga bytes) > > > > - Persistent memory, slower (both latency and bandwidth) big (tera > > > > bytes) > > > > - DDR (good old memory) well characteristics are between HBM and > > > > persistent > > > > > > > > > > Okay, then how the kernel handle the situation of "kind of memory each > > > with different characteristics"? > > > Does someone have any suggestion? I thought HMM can do this. > > > Numa policy/node distance is good but perhaps require a few extending, > > > e.g a HBM node can't be > > > swap, can't accept DDR fallback allocation. > > > > I don't think there is any consensus for this. I put forward the idea that > > NUMA > > needed to be extended as with deep hierarchy it is not only the distance > > between > > two nodes but also others factors like persistency, bandwidth, latency ... > > > > > > > > So AFAICT this has nothing to do with what HMM is for, ie device > > > > memory. Note > > > > that device memory can have a hierarchy of memory themself (HBM, GDDR > > > > and in > > > > maybe even persistent memory). > > > > > > > > > > This looks like a subset of HMAT when CPU can address device memory > > > directly in cache-coherent way. > > > > It is not, it is much more complex than that. Linux kernel has no idea on > > what is > > going on a device and thus do not have any usefull informations to make > > proper > > decission regarding device memory. Here device is real device ie something > > with > > processing capability, not something like HBM or persistent memory even if > > the > > latter is associated with a struct device inside linux kernel. > > > > > > > > > > > >>> memory on PCIE like interface then i don't expect it to be reported > > > >>> as NUMA > > > >>> memory node but as io range like any regular PCIE resources. Device > > > >>> driver > > > >>> through capabilities flags would then figure out if the link between > > > >>> the > > > >>> device and CPU is CCIX capable if so it can use this helper to > > > >>> hotplug it > > > >>> as device memory. > > > >>> > > > >> > > > >> From my point of view, Cache coherent device memory will popular soon > > > >> and > > > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more > > > >> reasonable > > > >> to me. > > > > > > > > Cache coherent device will be reported through standard mecanisms > > > > defined by > > > > the bus standard they are using. To my knowledge all the standard are > > > > either > > > > on top of PCIE or are similar to PCIE. > > > > > > > > It is true that on many platform PCIE resource is manage/initialize by > > > > the > > > > bios (UEFI) but it is platform specific. In some case we reprogram what > > > > the > > > > bios pick. > > > > > > > > So like i was saying i don't expect the BIOS/UEFI to report device > > > > memory as > > > > > > But it's happening. > > > In my understanding, that's
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 05, 2017 at 01:00:13PM -0600, Ross Zwisler wrote: > On Tue, Sep 05, 2017 at 09:50:17AM -0400, Jerome Glisse wrote: > > On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote: > > > On 2017/9/5 10:38, Jerome Glisse wrote: > > > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > > > >> On 2017/9/4 23:51, Jerome Glisse wrote: > > > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > > > On 2017/8/17 8:05, Jérôme Glisse wrote: > > > > Unlike unaddressable memory, coherent device memory has a real > > > > resource associated with it on the system (as CPU can address > > > > it). Add a new helper to hotplug such memory within the HMM > > > > framework. > > > > > > > > > > Got an new question, coherent device( e.g CCIX) memory are likely > > > reported to OS > > > through ACPI and recognized as NUMA memory node. > > > Then how can their memory be captured and managed by HMM framework? > > > > > > >>> > > > >>> Only platform that has such memory today is powerpc and it is not > > > >>> reported > > > >>> as regular memory by the firmware hence why they need this helper. > > > >>> > > > >>> I don't think anyone has defined anything yet for x86 and acpi. As > > > >>> this is > > > >> > > > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > > > >> Table (HMAT) table defined in ACPI 6.2. > > > >> The HMAT can cover CPU-addressable memory types(though not non-cache > > > >> coherent on-device memory). > > > >> > > > >> Ross from Intel already done some work on this, see: > > > >> https://lwn.net/Articles/724562/ > > > >> > > > >> arm64 supports APCI also, there is likely more this kind of device > > > >> when CCIX > > > >> is out (should be very soon if on schedule). > > > > > > > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" > > > > memory ie > > > > when you have several kind of memory each with different > > > > characteristics: > > > > - HBM very fast (latency) and high bandwidth, non persistent, somewhat > > > > small (ie few giga bytes) > > > > - Persistent memory, slower (both latency and bandwidth) big (tera > > > > bytes) > > > > - DDR (good old memory) well characteristics are between HBM and > > > > persistent > > > > > > > > > > Okay, then how the kernel handle the situation of "kind of memory each > > > with different characteristics"? > > > Does someone have any suggestion? I thought HMM can do this. > > > Numa policy/node distance is good but perhaps require a few extending, > > > e.g a HBM node can't be > > > swap, can't accept DDR fallback allocation. > > > > I don't think there is any consensus for this. I put forward the idea that > > NUMA > > needed to be extended as with deep hierarchy it is not only the distance > > between > > two nodes but also others factors like persistency, bandwidth, latency ... > > > > > > > > So AFAICT this has nothing to do with what HMM is for, ie device > > > > memory. Note > > > > that device memory can have a hierarchy of memory themself (HBM, GDDR > > > > and in > > > > maybe even persistent memory). > > > > > > > > > > This looks like a subset of HMAT when CPU can address device memory > > > directly in cache-coherent way. > > > > It is not, it is much more complex than that. Linux kernel has no idea on > > what is > > going on a device and thus do not have any usefull informations to make > > proper > > decission regarding device memory. Here device is real device ie something > > with > > processing capability, not something like HBM or persistent memory even if > > the > > latter is associated with a struct device inside linux kernel. > > > > > > > > > > > >>> memory on PCIE like interface then i don't expect it to be reported > > > >>> as NUMA > > > >>> memory node but as io range like any regular PCIE resources. Device > > > >>> driver > > > >>> through capabilities flags would then figure out if the link between > > > >>> the > > > >>> device and CPU is CCIX capable if so it can use this helper to > > > >>> hotplug it > > > >>> as device memory. > > > >>> > > > >> > > > >> From my point of view, Cache coherent device memory will popular soon > > > >> and > > > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more > > > >> reasonable > > > >> to me. > > > > > > > > Cache coherent device will be reported through standard mecanisms > > > > defined by > > > > the bus standard they are using. To my knowledge all the standard are > > > > either > > > > on top of PCIE or are similar to PCIE. > > > > > > > > It is true that on many platform PCIE resource is manage/initialize by > > > > the > > > > bios (UEFI) but it is platform specific. In some case we reprogram what > > > > the > > > > bios pick. > > > > > > > > So like i was saying i don't expect the BIOS/UEFI to report device > > > > memory as > > > > > > But it's happening. > > > In my understanding, that's
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 05, 2017 at 09:50:17AM -0400, Jerome Glisse wrote: > On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote: > > On 2017/9/5 10:38, Jerome Glisse wrote: > > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > > >> On 2017/9/4 23:51, Jerome Glisse wrote: > > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > > On 2017/8/17 8:05, Jérôme Glisse wrote: > > > Unlike unaddressable memory, coherent device memory has a real > > > resource associated with it on the system (as CPU can address > > > it). Add a new helper to hotplug such memory within the HMM > > > framework. > > > > > > > Got an new question, coherent device( e.g CCIX) memory are likely > > reported to OS > > through ACPI and recognized as NUMA memory node. > > Then how can their memory be captured and managed by HMM framework? > > > > >>> > > >>> Only platform that has such memory today is powerpc and it is not > > >>> reported > > >>> as regular memory by the firmware hence why they need this helper. > > >>> > > >>> I don't think anyone has defined anything yet for x86 and acpi. As this > > >>> is > > >> > > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > > >> Table (HMAT) table defined in ACPI 6.2. > > >> The HMAT can cover CPU-addressable memory types(though not non-cache > > >> coherent on-device memory). > > >> > > >> Ross from Intel already done some work on this, see: > > >> https://lwn.net/Articles/724562/ > > >> > > >> arm64 supports APCI also, there is likely more this kind of device when > > >> CCIX > > >> is out (should be very soon if on schedule). > > > > > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory > > > ie > > > when you have several kind of memory each with different characteristics: > > > - HBM very fast (latency) and high bandwidth, non persistent, somewhat > > > small (ie few giga bytes) > > > - Persistent memory, slower (both latency and bandwidth) big (tera > > > bytes) > > > - DDR (good old memory) well characteristics are between HBM and > > > persistent > > > > > > > Okay, then how the kernel handle the situation of "kind of memory each with > > different characteristics"? > > Does someone have any suggestion? I thought HMM can do this. > > Numa policy/node distance is good but perhaps require a few extending, e.g > > a HBM node can't be > > swap, can't accept DDR fallback allocation. > > I don't think there is any consensus for this. I put forward the idea that > NUMA > needed to be extended as with deep hierarchy it is not only the distance > between > two nodes but also others factors like persistency, bandwidth, latency ... > > > > > So AFAICT this has nothing to do with what HMM is for, ie device memory. > > > Note > > > that device memory can have a hierarchy of memory themself (HBM, GDDR and > > > in > > > maybe even persistent memory). > > > > > > > This looks like a subset of HMAT when CPU can address device memory > > directly in cache-coherent way. > > It is not, it is much more complex than that. Linux kernel has no idea on > what is > going on a device and thus do not have any usefull informations to make proper > decission regarding device memory. Here device is real device ie something > with > processing capability, not something like HBM or persistent memory even if the > latter is associated with a struct device inside linux kernel. > > > > > > > >>> memory on PCIE like interface then i don't expect it to be reported as > > >>> NUMA > > >>> memory node but as io range like any regular PCIE resources. Device > > >>> driver > > >>> through capabilities flags would then figure out if the link between the > > >>> device and CPU is CCIX capable if so it can use this helper to hotplug > > >>> it > > >>> as device memory. > > >>> > > >> > > >> From my point of view, Cache coherent device memory will popular soon > > >> and > > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more > > >> reasonable > > >> to me. > > > > > > Cache coherent device will be reported through standard mecanisms defined > > > by > > > the bus standard they are using. To my knowledge all the standard are > > > either > > > on top of PCIE or are similar to PCIE. > > > > > > It is true that on many platform PCIE resource is manage/initialize by the > > > bios (UEFI) but it is platform specific. In some case we reprogram what > > > the > > > bios pick. > > > > > > So like i was saying i don't expect the BIOS/UEFI to report device memory > > > as > > > > But it's happening. > > In my understanding, that's why HMAT was introduced. > > For reporting device memory as regular memory(with different > > characteristics). > > That is not my understanding but only Intel can confirm. HMAT was introduced > for things like HBM or persistent memory. Which i do not consider as device > memory. Sure persistent memory is assign a device struct
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 05, 2017 at 09:50:17AM -0400, Jerome Glisse wrote: > On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote: > > On 2017/9/5 10:38, Jerome Glisse wrote: > > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > > >> On 2017/9/4 23:51, Jerome Glisse wrote: > > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > > On 2017/8/17 8:05, Jérôme Glisse wrote: > > > Unlike unaddressable memory, coherent device memory has a real > > > resource associated with it on the system (as CPU can address > > > it). Add a new helper to hotplug such memory within the HMM > > > framework. > > > > > > > Got an new question, coherent device( e.g CCIX) memory are likely > > reported to OS > > through ACPI and recognized as NUMA memory node. > > Then how can their memory be captured and managed by HMM framework? > > > > >>> > > >>> Only platform that has such memory today is powerpc and it is not > > >>> reported > > >>> as regular memory by the firmware hence why they need this helper. > > >>> > > >>> I don't think anyone has defined anything yet for x86 and acpi. As this > > >>> is > > >> > > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > > >> Table (HMAT) table defined in ACPI 6.2. > > >> The HMAT can cover CPU-addressable memory types(though not non-cache > > >> coherent on-device memory). > > >> > > >> Ross from Intel already done some work on this, see: > > >> https://lwn.net/Articles/724562/ > > >> > > >> arm64 supports APCI also, there is likely more this kind of device when > > >> CCIX > > >> is out (should be very soon if on schedule). > > > > > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory > > > ie > > > when you have several kind of memory each with different characteristics: > > > - HBM very fast (latency) and high bandwidth, non persistent, somewhat > > > small (ie few giga bytes) > > > - Persistent memory, slower (both latency and bandwidth) big (tera > > > bytes) > > > - DDR (good old memory) well characteristics are between HBM and > > > persistent > > > > > > > Okay, then how the kernel handle the situation of "kind of memory each with > > different characteristics"? > > Does someone have any suggestion? I thought HMM can do this. > > Numa policy/node distance is good but perhaps require a few extending, e.g > > a HBM node can't be > > swap, can't accept DDR fallback allocation. > > I don't think there is any consensus for this. I put forward the idea that > NUMA > needed to be extended as with deep hierarchy it is not only the distance > between > two nodes but also others factors like persistency, bandwidth, latency ... > > > > > So AFAICT this has nothing to do with what HMM is for, ie device memory. > > > Note > > > that device memory can have a hierarchy of memory themself (HBM, GDDR and > > > in > > > maybe even persistent memory). > > > > > > > This looks like a subset of HMAT when CPU can address device memory > > directly in cache-coherent way. > > It is not, it is much more complex than that. Linux kernel has no idea on > what is > going on a device and thus do not have any usefull informations to make proper > decission regarding device memory. Here device is real device ie something > with > processing capability, not something like HBM or persistent memory even if the > latter is associated with a struct device inside linux kernel. > > > > > > > >>> memory on PCIE like interface then i don't expect it to be reported as > > >>> NUMA > > >>> memory node but as io range like any regular PCIE resources. Device > > >>> driver > > >>> through capabilities flags would then figure out if the link between the > > >>> device and CPU is CCIX capable if so it can use this helper to hotplug > > >>> it > > >>> as device memory. > > >>> > > >> > > >> From my point of view, Cache coherent device memory will popular soon > > >> and > > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more > > >> reasonable > > >> to me. > > > > > > Cache coherent device will be reported through standard mecanisms defined > > > by > > > the bus standard they are using. To my knowledge all the standard are > > > either > > > on top of PCIE or are similar to PCIE. > > > > > > It is true that on many platform PCIE resource is manage/initialize by the > > > bios (UEFI) but it is platform specific. In some case we reprogram what > > > the > > > bios pick. > > > > > > So like i was saying i don't expect the BIOS/UEFI to report device memory > > > as > > > > But it's happening. > > In my understanding, that's why HMAT was introduced. > > For reporting device memory as regular memory(with different > > characteristics). > > That is not my understanding but only Intel can confirm. HMAT was introduced > for things like HBM or persistent memory. Which i do not consider as device > memory. Sure persistent memory is assign a device struct
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > > On 2017/9/4 23:51, Jerome Glisse wrote: > > > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > > >> On 2017/8/17 8:05, Jérôme Glisse wrote: > > >>> Unlike unaddressable memory, coherent device memory has a real > > >>> resource associated with it on the system (as CPU can address > > >>> it). Add a new helper to hotplug such memory within the HMM > > >>> framework. > > >>> > > >> > > >> Got an new question, coherent device( e.g CCIX) memory are likely > > >> reported to OS > > >> through ACPI and recognized as NUMA memory node. > > >> Then how can their memory be captured and managed by HMM framework? > > >> > > > > > > Only platform that has such memory today is powerpc and it is not reported > > > as regular memory by the firmware hence why they need this helper. > > > > > > I don't think anyone has defined anything yet for x86 and acpi. As this is > > > > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > > Table (HMAT) table defined in ACPI 6.2. > > The HMAT can cover CPU-addressable memory types(though not non-cache > > coherent on-device memory). > > > > Ross from Intel already done some work on this, see: > > https://lwn.net/Articles/724562/ > > > > arm64 supports APCI also, there is likely more this kind of device when CCIX > > is out (should be very soon if on schedule). > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie > when you have several kind of memory each with different characteristics: > - HBM very fast (latency) and high bandwidth, non persistent, somewhat > small (ie few giga bytes) > - Persistent memory, slower (both latency and bandwidth) big (tera bytes) > - DDR (good old memory) well characteristics are between HBM and persistent > > So AFAICT this has nothing to do with what HMM is for, ie device memory. Note > that device memory can have a hierarchy of memory themself (HBM, GDDR and in > maybe even persistent memory). > > > > memory on PCIE like interface then i don't expect it to be reported as > > > NUMA > > > memory node but as io range like any regular PCIE resources. Device driver > > > through capabilities flags would then figure out if the link between the > > > device and CPU is CCIX capable if so it can use this helper to hotplug it > > > as device memory. > > > > > > > From my point of view, Cache coherent device memory will popular soon and > > reported through ACPI/UEFI. Extending NUMA policy still sounds more > > reasonable > > to me. > > Cache coherent device will be reported through standard mecanisms defined by > the bus standard they are using. To my knowledge all the standard are either > on top of PCIE or are similar to PCIE. > > It is true that on many platform PCIE resource is manage/initialize by the > bios (UEFI) but it is platform specific. In some case we reprogram what the > bios pick. > > So like i was saying i don't expect the BIOS/UEFI to report device memory as > regular memory. It will be reported as a regular PCIE resources and then the > device driver will be able to determine through some flags if the link between > the CPU(s) and the device is cache coherent or not. At that point the device > driver can use register it with HMM helper. > > > The whole NUMA discussion happen several time in the past i suggest looking > on mm list archive for them. But it was rule out for several reasons. Top of > my head: > - people hate CPU less node and device memory is inherently CPU less With the introduction of the HMAT in ACPI 6.2 one of the things that was added was the ability to have an ACPI proximity domain that isn't associated with a CPU. This can be seen in the changes in the text of the "Proximity Domain" field in table 5-73 which describes the "Memory Affinity Structure". One of the major features of the HMAT was the separation of "Initiator" proximity domains (CPUs, devices that initiate memory transfers), and "target" proximity domains (memory regions, be they attached to a CPU or some other device). ACPI proximity domains map directly to Linux NUMA nodes, so I think we're already in a place where we have to support CPU-less NUMA nodes. > - device driver want total control over memory and thus to be isolated from > mm mecanism and doing all those special cases was not welcome I agree that the kernel doesn't have enough information to be able to accurately handle all the use cases for the various types of heterogeneous memory. The goal of my HMAT enabling is to allow that memory to be reserved from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem Address Range Structure, then provide userspace with enough information to be able to distinguish between the various types of memory in the system so it can allocate & utilize it appropriately. > - existing NUMA migration mecanism are ill suited for this
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Mon, Sep 04, 2017 at 10:38:27PM -0400, Jerome Glisse wrote: > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > > On 2017/9/4 23:51, Jerome Glisse wrote: > > > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > > >> On 2017/8/17 8:05, Jérôme Glisse wrote: > > >>> Unlike unaddressable memory, coherent device memory has a real > > >>> resource associated with it on the system (as CPU can address > > >>> it). Add a new helper to hotplug such memory within the HMM > > >>> framework. > > >>> > > >> > > >> Got an new question, coherent device( e.g CCIX) memory are likely > > >> reported to OS > > >> through ACPI and recognized as NUMA memory node. > > >> Then how can their memory be captured and managed by HMM framework? > > >> > > > > > > Only platform that has such memory today is powerpc and it is not reported > > > as regular memory by the firmware hence why they need this helper. > > > > > > I don't think anyone has defined anything yet for x86 and acpi. As this is > > > > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > > Table (HMAT) table defined in ACPI 6.2. > > The HMAT can cover CPU-addressable memory types(though not non-cache > > coherent on-device memory). > > > > Ross from Intel already done some work on this, see: > > https://lwn.net/Articles/724562/ > > > > arm64 supports APCI also, there is likely more this kind of device when CCIX > > is out (should be very soon if on schedule). > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie > when you have several kind of memory each with different characteristics: > - HBM very fast (latency) and high bandwidth, non persistent, somewhat > small (ie few giga bytes) > - Persistent memory, slower (both latency and bandwidth) big (tera bytes) > - DDR (good old memory) well characteristics are between HBM and persistent > > So AFAICT this has nothing to do with what HMM is for, ie device memory. Note > that device memory can have a hierarchy of memory themself (HBM, GDDR and in > maybe even persistent memory). > > > > memory on PCIE like interface then i don't expect it to be reported as > > > NUMA > > > memory node but as io range like any regular PCIE resources. Device driver > > > through capabilities flags would then figure out if the link between the > > > device and CPU is CCIX capable if so it can use this helper to hotplug it > > > as device memory. > > > > > > > From my point of view, Cache coherent device memory will popular soon and > > reported through ACPI/UEFI. Extending NUMA policy still sounds more > > reasonable > > to me. > > Cache coherent device will be reported through standard mecanisms defined by > the bus standard they are using. To my knowledge all the standard are either > on top of PCIE or are similar to PCIE. > > It is true that on many platform PCIE resource is manage/initialize by the > bios (UEFI) but it is platform specific. In some case we reprogram what the > bios pick. > > So like i was saying i don't expect the BIOS/UEFI to report device memory as > regular memory. It will be reported as a regular PCIE resources and then the > device driver will be able to determine through some flags if the link between > the CPU(s) and the device is cache coherent or not. At that point the device > driver can use register it with HMM helper. > > > The whole NUMA discussion happen several time in the past i suggest looking > on mm list archive for them. But it was rule out for several reasons. Top of > my head: > - people hate CPU less node and device memory is inherently CPU less With the introduction of the HMAT in ACPI 6.2 one of the things that was added was the ability to have an ACPI proximity domain that isn't associated with a CPU. This can be seen in the changes in the text of the "Proximity Domain" field in table 5-73 which describes the "Memory Affinity Structure". One of the major features of the HMAT was the separation of "Initiator" proximity domains (CPUs, devices that initiate memory transfers), and "target" proximity domains (memory regions, be they attached to a CPU or some other device). ACPI proximity domains map directly to Linux NUMA nodes, so I think we're already in a place where we have to support CPU-less NUMA nodes. > - device driver want total control over memory and thus to be isolated from > mm mecanism and doing all those special cases was not welcome I agree that the kernel doesn't have enough information to be able to accurately handle all the use cases for the various types of heterogeneous memory. The goal of my HMAT enabling is to allow that memory to be reserved from kernel use via the "Reservation Hint" in the HMAT's Memory Subsystem Address Range Structure, then provide userspace with enough information to be able to distinguish between the various types of memory in the system so it can allocate & utilize it appropriately. > - existing NUMA migration mecanism are ill suited for this
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 5, 2017 at 6:50 AM, Jerome Glissewrote: > On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote: >> On 2017/9/5 10:38, Jerome Glisse wrote: >> > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: >> >> On 2017/9/4 23:51, Jerome Glisse wrote: >> >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: >> On 2017/8/17 8:05, Jérôme Glisse wrote: >> > Unlike unaddressable memory, coherent device memory has a real >> > resource associated with it on the system (as CPU can address >> > it). Add a new helper to hotplug such memory within the HMM >> > framework. >> > >> >> Got an new question, coherent device( e.g CCIX) memory are likely >> reported to OS >> through ACPI and recognized as NUMA memory node. >> Then how can their memory be captured and managed by HMM framework? >> >> >>> >> >>> Only platform that has such memory today is powerpc and it is not >> >>> reported >> >>> as regular memory by the firmware hence why they need this helper. >> >>> >> >>> I don't think anyone has defined anything yet for x86 and acpi. As this >> >>> is >> >> >> >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute >> >> Table (HMAT) table defined in ACPI 6.2. >> >> The HMAT can cover CPU-addressable memory types(though not non-cache >> >> coherent on-device memory). >> >> >> >> Ross from Intel already done some work on this, see: >> >> https://lwn.net/Articles/724562/ >> >> >> >> arm64 supports APCI also, there is likely more this kind of device when >> >> CCIX >> >> is out (should be very soon if on schedule). >> > >> > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory >> > ie >> > when you have several kind of memory each with different characteristics: >> > - HBM very fast (latency) and high bandwidth, non persistent, somewhat >> > small (ie few giga bytes) >> > - Persistent memory, slower (both latency and bandwidth) big (tera bytes) >> > - DDR (good old memory) well characteristics are between HBM and >> > persistent >> > >> >> Okay, then how the kernel handle the situation of "kind of memory each with >> different characteristics"? >> Does someone have any suggestion? I thought HMM can do this. >> Numa policy/node distance is good but perhaps require a few extending, e.g a >> HBM node can't be >> swap, can't accept DDR fallback allocation. > > I don't think there is any consensus for this. I put forward the idea that > NUMA > needed to be extended as with deep hierarchy it is not only the distance > between > two nodes but also others factors like persistency, bandwidth, latency ... > > >> > So AFAICT this has nothing to do with what HMM is for, ie device memory. >> > Note >> > that device memory can have a hierarchy of memory themself (HBM, GDDR and >> > in >> > maybe even persistent memory). >> > >> >> This looks like a subset of HMAT when CPU can address device memory directly >> in cache-coherent way. > > It is not, it is much more complex than that. Linux kernel has no idea on > what is > going on a device and thus do not have any usefull informations to make proper > decission regarding device memory. Here device is real device ie something > with > processing capability, not something like HBM or persistent memory even if the > latter is associated with a struct device inside linux kernel. > >> >> >> >>> memory on PCIE like interface then i don't expect it to be reported as >> >>> NUMA >> >>> memory node but as io range like any regular PCIE resources. Device >> >>> driver >> >>> through capabilities flags would then figure out if the link between the >> >>> device and CPU is CCIX capable if so it can use this helper to hotplug it >> >>> as device memory. >> >>> >> >> >> >> From my point of view, Cache coherent device memory will popular soon and >> >> reported through ACPI/UEFI. Extending NUMA policy still sounds more >> >> reasonable >> >> to me. >> > >> > Cache coherent device will be reported through standard mecanisms defined >> > by >> > the bus standard they are using. To my knowledge all the standard are >> > either >> > on top of PCIE or are similar to PCIE. >> > >> > It is true that on many platform PCIE resource is manage/initialize by the >> > bios (UEFI) but it is platform specific. In some case we reprogram what the >> > bios pick. >> > >> > So like i was saying i don't expect the BIOS/UEFI to report device memory >> > as >> >> But it's happening. >> In my understanding, that's why HMAT was introduced. >> For reporting device memory as regular memory(with different >> characteristics). > > That is not my understanding but only Intel can confirm. HMAT was introduced > for things like HBM or persistent memory. Which i do not consider as device > memory. Sure persistent memory is assign a device struct because it is easier > for integration with the block system i assume. But it does not make it a > device in my view. For me a
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 5, 2017 at 6:50 AM, Jerome Glisse wrote: > On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote: >> On 2017/9/5 10:38, Jerome Glisse wrote: >> > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: >> >> On 2017/9/4 23:51, Jerome Glisse wrote: >> >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: >> On 2017/8/17 8:05, Jérôme Glisse wrote: >> > Unlike unaddressable memory, coherent device memory has a real >> > resource associated with it on the system (as CPU can address >> > it). Add a new helper to hotplug such memory within the HMM >> > framework. >> > >> >> Got an new question, coherent device( e.g CCIX) memory are likely >> reported to OS >> through ACPI and recognized as NUMA memory node. >> Then how can their memory be captured and managed by HMM framework? >> >> >>> >> >>> Only platform that has such memory today is powerpc and it is not >> >>> reported >> >>> as regular memory by the firmware hence why they need this helper. >> >>> >> >>> I don't think anyone has defined anything yet for x86 and acpi. As this >> >>> is >> >> >> >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute >> >> Table (HMAT) table defined in ACPI 6.2. >> >> The HMAT can cover CPU-addressable memory types(though not non-cache >> >> coherent on-device memory). >> >> >> >> Ross from Intel already done some work on this, see: >> >> https://lwn.net/Articles/724562/ >> >> >> >> arm64 supports APCI also, there is likely more this kind of device when >> >> CCIX >> >> is out (should be very soon if on schedule). >> > >> > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory >> > ie >> > when you have several kind of memory each with different characteristics: >> > - HBM very fast (latency) and high bandwidth, non persistent, somewhat >> > small (ie few giga bytes) >> > - Persistent memory, slower (both latency and bandwidth) big (tera bytes) >> > - DDR (good old memory) well characteristics are between HBM and >> > persistent >> > >> >> Okay, then how the kernel handle the situation of "kind of memory each with >> different characteristics"? >> Does someone have any suggestion? I thought HMM can do this. >> Numa policy/node distance is good but perhaps require a few extending, e.g a >> HBM node can't be >> swap, can't accept DDR fallback allocation. > > I don't think there is any consensus for this. I put forward the idea that > NUMA > needed to be extended as with deep hierarchy it is not only the distance > between > two nodes but also others factors like persistency, bandwidth, latency ... > > >> > So AFAICT this has nothing to do with what HMM is for, ie device memory. >> > Note >> > that device memory can have a hierarchy of memory themself (HBM, GDDR and >> > in >> > maybe even persistent memory). >> > >> >> This looks like a subset of HMAT when CPU can address device memory directly >> in cache-coherent way. > > It is not, it is much more complex than that. Linux kernel has no idea on > what is > going on a device and thus do not have any usefull informations to make proper > decission regarding device memory. Here device is real device ie something > with > processing capability, not something like HBM or persistent memory even if the > latter is associated with a struct device inside linux kernel. > >> >> >> >>> memory on PCIE like interface then i don't expect it to be reported as >> >>> NUMA >> >>> memory node but as io range like any regular PCIE resources. Device >> >>> driver >> >>> through capabilities flags would then figure out if the link between the >> >>> device and CPU is CCIX capable if so it can use this helper to hotplug it >> >>> as device memory. >> >>> >> >> >> >> From my point of view, Cache coherent device memory will popular soon and >> >> reported through ACPI/UEFI. Extending NUMA policy still sounds more >> >> reasonable >> >> to me. >> > >> > Cache coherent device will be reported through standard mecanisms defined >> > by >> > the bus standard they are using. To my knowledge all the standard are >> > either >> > on top of PCIE or are similar to PCIE. >> > >> > It is true that on many platform PCIE resource is manage/initialize by the >> > bios (UEFI) but it is platform specific. In some case we reprogram what the >> > bios pick. >> > >> > So like i was saying i don't expect the BIOS/UEFI to report device memory >> > as >> >> But it's happening. >> In my understanding, that's why HMAT was introduced. >> For reporting device memory as regular memory(with different >> characteristics). > > That is not my understanding but only Intel can confirm. HMAT was introduced > for things like HBM or persistent memory. Which i do not consider as device > memory. Sure persistent memory is assign a device struct because it is easier > for integration with the block system i assume. But it does not make it a > device in my view. For me a device is a piece of
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote: > On 2017/9/5 10:38, Jerome Glisse wrote: > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > >> On 2017/9/4 23:51, Jerome Glisse wrote: > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > On 2017/8/17 8:05, Jérôme Glisse wrote: > > Unlike unaddressable memory, coherent device memory has a real > > resource associated with it on the system (as CPU can address > > it). Add a new helper to hotplug such memory within the HMM > > framework. > > > > Got an new question, coherent device( e.g CCIX) memory are likely > reported to OS > through ACPI and recognized as NUMA memory node. > Then how can their memory be captured and managed by HMM framework? > > >>> > >>> Only platform that has such memory today is powerpc and it is not reported > >>> as regular memory by the firmware hence why they need this helper. > >>> > >>> I don't think anyone has defined anything yet for x86 and acpi. As this is > >> > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > >> Table (HMAT) table defined in ACPI 6.2. > >> The HMAT can cover CPU-addressable memory types(though not non-cache > >> coherent on-device memory). > >> > >> Ross from Intel already done some work on this, see: > >> https://lwn.net/Articles/724562/ > >> > >> arm64 supports APCI also, there is likely more this kind of device when > >> CCIX > >> is out (should be very soon if on schedule). > > > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie > > when you have several kind of memory each with different characteristics: > > - HBM very fast (latency) and high bandwidth, non persistent, somewhat > > small (ie few giga bytes) > > - Persistent memory, slower (both latency and bandwidth) big (tera bytes) > > - DDR (good old memory) well characteristics are between HBM and > > persistent > > > > Okay, then how the kernel handle the situation of "kind of memory each with > different characteristics"? > Does someone have any suggestion? I thought HMM can do this. > Numa policy/node distance is good but perhaps require a few extending, e.g a > HBM node can't be > swap, can't accept DDR fallback allocation. I don't think there is any consensus for this. I put forward the idea that NUMA needed to be extended as with deep hierarchy it is not only the distance between two nodes but also others factors like persistency, bandwidth, latency ... > > So AFAICT this has nothing to do with what HMM is for, ie device memory. > > Note > > that device memory can have a hierarchy of memory themself (HBM, GDDR and in > > maybe even persistent memory). > > > > This looks like a subset of HMAT when CPU can address device memory directly > in cache-coherent way. It is not, it is much more complex than that. Linux kernel has no idea on what is going on a device and thus do not have any usefull informations to make proper decission regarding device memory. Here device is real device ie something with processing capability, not something like HBM or persistent memory even if the latter is associated with a struct device inside linux kernel. > > > >>> memory on PCIE like interface then i don't expect it to be reported as > >>> NUMA > >>> memory node but as io range like any regular PCIE resources. Device driver > >>> through capabilities flags would then figure out if the link between the > >>> device and CPU is CCIX capable if so it can use this helper to hotplug it > >>> as device memory. > >>> > >> > >> From my point of view, Cache coherent device memory will popular soon and > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more > >> reasonable > >> to me. > > > > Cache coherent device will be reported through standard mecanisms defined by > > the bus standard they are using. To my knowledge all the standard are either > > on top of PCIE or are similar to PCIE. > > > > It is true that on many platform PCIE resource is manage/initialize by the > > bios (UEFI) but it is platform specific. In some case we reprogram what the > > bios pick. > > > > So like i was saying i don't expect the BIOS/UEFI to report device memory as > > But it's happening. > In my understanding, that's why HMAT was introduced. > For reporting device memory as regular memory(with different characteristics). That is not my understanding but only Intel can confirm. HMAT was introduced for things like HBM or persistent memory. Which i do not consider as device memory. Sure persistent memory is assign a device struct because it is easier for integration with the block system i assume. But it does not make it a device in my view. For me a device is a piece of hardware that has some processing capabilities (network adapter, sound card, GPU, ...) But we can argue about semantic and what a device is. For all intent and purposes device in HMM context is some piece of hardware with
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 05, 2017 at 11:50:57AM +0800, Bob Liu wrote: > On 2017/9/5 10:38, Jerome Glisse wrote: > > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > >> On 2017/9/4 23:51, Jerome Glisse wrote: > >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > On 2017/8/17 8:05, Jérôme Glisse wrote: > > Unlike unaddressable memory, coherent device memory has a real > > resource associated with it on the system (as CPU can address > > it). Add a new helper to hotplug such memory within the HMM > > framework. > > > > Got an new question, coherent device( e.g CCIX) memory are likely > reported to OS > through ACPI and recognized as NUMA memory node. > Then how can their memory be captured and managed by HMM framework? > > >>> > >>> Only platform that has such memory today is powerpc and it is not reported > >>> as regular memory by the firmware hence why they need this helper. > >>> > >>> I don't think anyone has defined anything yet for x86 and acpi. As this is > >> > >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > >> Table (HMAT) table defined in ACPI 6.2. > >> The HMAT can cover CPU-addressable memory types(though not non-cache > >> coherent on-device memory). > >> > >> Ross from Intel already done some work on this, see: > >> https://lwn.net/Articles/724562/ > >> > >> arm64 supports APCI also, there is likely more this kind of device when > >> CCIX > >> is out (should be very soon if on schedule). > > > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie > > when you have several kind of memory each with different characteristics: > > - HBM very fast (latency) and high bandwidth, non persistent, somewhat > > small (ie few giga bytes) > > - Persistent memory, slower (both latency and bandwidth) big (tera bytes) > > - DDR (good old memory) well characteristics are between HBM and > > persistent > > > > Okay, then how the kernel handle the situation of "kind of memory each with > different characteristics"? > Does someone have any suggestion? I thought HMM can do this. > Numa policy/node distance is good but perhaps require a few extending, e.g a > HBM node can't be > swap, can't accept DDR fallback allocation. I don't think there is any consensus for this. I put forward the idea that NUMA needed to be extended as with deep hierarchy it is not only the distance between two nodes but also others factors like persistency, bandwidth, latency ... > > So AFAICT this has nothing to do with what HMM is for, ie device memory. > > Note > > that device memory can have a hierarchy of memory themself (HBM, GDDR and in > > maybe even persistent memory). > > > > This looks like a subset of HMAT when CPU can address device memory directly > in cache-coherent way. It is not, it is much more complex than that. Linux kernel has no idea on what is going on a device and thus do not have any usefull informations to make proper decission regarding device memory. Here device is real device ie something with processing capability, not something like HBM or persistent memory even if the latter is associated with a struct device inside linux kernel. > > > >>> memory on PCIE like interface then i don't expect it to be reported as > >>> NUMA > >>> memory node but as io range like any regular PCIE resources. Device driver > >>> through capabilities flags would then figure out if the link between the > >>> device and CPU is CCIX capable if so it can use this helper to hotplug it > >>> as device memory. > >>> > >> > >> From my point of view, Cache coherent device memory will popular soon and > >> reported through ACPI/UEFI. Extending NUMA policy still sounds more > >> reasonable > >> to me. > > > > Cache coherent device will be reported through standard mecanisms defined by > > the bus standard they are using. To my knowledge all the standard are either > > on top of PCIE or are similar to PCIE. > > > > It is true that on many platform PCIE resource is manage/initialize by the > > bios (UEFI) but it is platform specific. In some case we reprogram what the > > bios pick. > > > > So like i was saying i don't expect the BIOS/UEFI to report device memory as > > But it's happening. > In my understanding, that's why HMAT was introduced. > For reporting device memory as regular memory(with different characteristics). That is not my understanding but only Intel can confirm. HMAT was introduced for things like HBM or persistent memory. Which i do not consider as device memory. Sure persistent memory is assign a device struct because it is easier for integration with the block system i assume. But it does not make it a device in my view. For me a device is a piece of hardware that has some processing capabilities (network adapter, sound card, GPU, ...) But we can argue about semantic and what a device is. For all intent and purposes device in HMM context is some piece of hardware with
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On 2017/9/5 10:38, Jerome Glisse wrote: > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: >> On 2017/9/4 23:51, Jerome Glisse wrote: >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: On 2017/8/17 8:05, Jérôme Glisse wrote: > Unlike unaddressable memory, coherent device memory has a real > resource associated with it on the system (as CPU can address > it). Add a new helper to hotplug such memory within the HMM > framework. > Got an new question, coherent device( e.g CCIX) memory are likely reported to OS through ACPI and recognized as NUMA memory node. Then how can their memory be captured and managed by HMM framework? >>> >>> Only platform that has such memory today is powerpc and it is not reported >>> as regular memory by the firmware hence why they need this helper. >>> >>> I don't think anyone has defined anything yet for x86 and acpi. As this is >> >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute >> Table (HMAT) table defined in ACPI 6.2. >> The HMAT can cover CPU-addressable memory types(though not non-cache >> coherent on-device memory). >> >> Ross from Intel already done some work on this, see: >> https://lwn.net/Articles/724562/ >> >> arm64 supports APCI also, there is likely more this kind of device when CCIX >> is out (should be very soon if on schedule). > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie > when you have several kind of memory each with different characteristics: > - HBM very fast (latency) and high bandwidth, non persistent, somewhat > small (ie few giga bytes) > - Persistent memory, slower (both latency and bandwidth) big (tera bytes) > - DDR (good old memory) well characteristics are between HBM and persistent > Okay, then how the kernel handle the situation of "kind of memory each with different characteristics"? Does someone have any suggestion? I thought HMM can do this. Numa policy/node distance is good but perhaps require a few extending, e.g a HBM node can't be swap, can't accept DDR fallback allocation. > So AFAICT this has nothing to do with what HMM is for, ie device memory. Note > that device memory can have a hierarchy of memory themself (HBM, GDDR and in > maybe even persistent memory). > This looks like a subset of HMAT when CPU can address device memory directly in cache-coherent way. >>> memory on PCIE like interface then i don't expect it to be reported as NUMA >>> memory node but as io range like any regular PCIE resources. Device driver >>> through capabilities flags would then figure out if the link between the >>> device and CPU is CCIX capable if so it can use this helper to hotplug it >>> as device memory. >>> >> >> From my point of view, Cache coherent device memory will popular soon and >> reported through ACPI/UEFI. Extending NUMA policy still sounds more >> reasonable >> to me. > > Cache coherent device will be reported through standard mecanisms defined by > the bus standard they are using. To my knowledge all the standard are either > on top of PCIE or are similar to PCIE. > > It is true that on many platform PCIE resource is manage/initialize by the > bios (UEFI) but it is platform specific. In some case we reprogram what the > bios pick. > > So like i was saying i don't expect the BIOS/UEFI to report device memory as But it's happening. In my understanding, that's why HMAT was introduced. For reporting device memory as regular memory(with different characteristics). -- Regards, Bob Liu > regular memory. It will be reported as a regular PCIE resources and then the > device driver will be able to determine through some flags if the link between > the CPU(s) and the device is cache coherent or not. At that point the device > driver can use register it with HMM helper. > > > The whole NUMA discussion happen several time in the past i suggest looking > on mm list archive for them. But it was rule out for several reasons. Top of > my head: > - people hate CPU less node and device memory is inherently CPU less > - device driver want total control over memory and thus to be isolated from > mm mecanism and doing all those special cases was not welcome > - existing NUMA migration mecanism are ill suited for this memory as > access by the device to the memory is unknown to core mm and there > is no easy way to report it or track it (this kind of depends on the > platform and hardware) > > I am likely missing other big points. > > Cheers, > Jérôme > > . >
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On 2017/9/5 10:38, Jerome Glisse wrote: > On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: >> On 2017/9/4 23:51, Jerome Glisse wrote: >>> On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: On 2017/8/17 8:05, Jérôme Glisse wrote: > Unlike unaddressable memory, coherent device memory has a real > resource associated with it on the system (as CPU can address > it). Add a new helper to hotplug such memory within the HMM > framework. > Got an new question, coherent device( e.g CCIX) memory are likely reported to OS through ACPI and recognized as NUMA memory node. Then how can their memory be captured and managed by HMM framework? >>> >>> Only platform that has such memory today is powerpc and it is not reported >>> as regular memory by the firmware hence why they need this helper. >>> >>> I don't think anyone has defined anything yet for x86 and acpi. As this is >> >> Not yet, but now the ACPI spec has Heterogeneous Memory Attribute >> Table (HMAT) table defined in ACPI 6.2. >> The HMAT can cover CPU-addressable memory types(though not non-cache >> coherent on-device memory). >> >> Ross from Intel already done some work on this, see: >> https://lwn.net/Articles/724562/ >> >> arm64 supports APCI also, there is likely more this kind of device when CCIX >> is out (should be very soon if on schedule). > > HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie > when you have several kind of memory each with different characteristics: > - HBM very fast (latency) and high bandwidth, non persistent, somewhat > small (ie few giga bytes) > - Persistent memory, slower (both latency and bandwidth) big (tera bytes) > - DDR (good old memory) well characteristics are between HBM and persistent > Okay, then how the kernel handle the situation of "kind of memory each with different characteristics"? Does someone have any suggestion? I thought HMM can do this. Numa policy/node distance is good but perhaps require a few extending, e.g a HBM node can't be swap, can't accept DDR fallback allocation. > So AFAICT this has nothing to do with what HMM is for, ie device memory. Note > that device memory can have a hierarchy of memory themself (HBM, GDDR and in > maybe even persistent memory). > This looks like a subset of HMAT when CPU can address device memory directly in cache-coherent way. >>> memory on PCIE like interface then i don't expect it to be reported as NUMA >>> memory node but as io range like any regular PCIE resources. Device driver >>> through capabilities flags would then figure out if the link between the >>> device and CPU is CCIX capable if so it can use this helper to hotplug it >>> as device memory. >>> >> >> From my point of view, Cache coherent device memory will popular soon and >> reported through ACPI/UEFI. Extending NUMA policy still sounds more >> reasonable >> to me. > > Cache coherent device will be reported through standard mecanisms defined by > the bus standard they are using. To my knowledge all the standard are either > on top of PCIE or are similar to PCIE. > > It is true that on many platform PCIE resource is manage/initialize by the > bios (UEFI) but it is platform specific. In some case we reprogram what the > bios pick. > > So like i was saying i don't expect the BIOS/UEFI to report device memory as But it's happening. In my understanding, that's why HMAT was introduced. For reporting device memory as regular memory(with different characteristics). -- Regards, Bob Liu > regular memory. It will be reported as a regular PCIE resources and then the > device driver will be able to determine through some flags if the link between > the CPU(s) and the device is cache coherent or not. At that point the device > driver can use register it with HMM helper. > > > The whole NUMA discussion happen several time in the past i suggest looking > on mm list archive for them. But it was rule out for several reasons. Top of > my head: > - people hate CPU less node and device memory is inherently CPU less > - device driver want total control over memory and thus to be isolated from > mm mecanism and doing all those special cases was not welcome > - existing NUMA migration mecanism are ill suited for this memory as > access by the device to the memory is unknown to core mm and there > is no easy way to report it or track it (this kind of depends on the > platform and hardware) > > I am likely missing other big points. > > Cheers, > Jérôme > > . >
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 5, 2017 at 1:51 AM, Jerome Glissewrote: > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: >> On 2017/8/17 8:05, Jérôme Glisse wrote: >> > Unlike unaddressable memory, coherent device memory has a real >> > resource associated with it on the system (as CPU can address >> > it). Add a new helper to hotplug such memory within the HMM >> > framework. >> > >> >> Got an new question, coherent device( e.g CCIX) memory are likely reported >> to OS >> through ACPI and recognized as NUMA memory node. >> Then how can their memory be captured and managed by HMM framework? >> > > Only platform that has such memory today is powerpc and it is not reported > as regular memory by the firmware hence why they need this helper. > > I don't think anyone has defined anything yet for x86 and acpi. As this is > memory on PCIE like interface then i don't expect it to be reported as NUMA > memory node but as io range like any regular PCIE resources. Device driver > through capabilities flags would then figure out if the link between the > device and CPU is CCIX capable if so it can use this helper to hotplug it > as device memory. Yep, the arch needs to do the right thing at hotplug time, which is 1. Don't online the memory as a NUMA node 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver Like Jerome said and we tried as well, the NUMA approach needs more agreement and discussion and probable extensions Balbir Singh
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 5, 2017 at 1:51 AM, Jerome Glisse wrote: > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: >> On 2017/8/17 8:05, Jérôme Glisse wrote: >> > Unlike unaddressable memory, coherent device memory has a real >> > resource associated with it on the system (as CPU can address >> > it). Add a new helper to hotplug such memory within the HMM >> > framework. >> > >> >> Got an new question, coherent device( e.g CCIX) memory are likely reported >> to OS >> through ACPI and recognized as NUMA memory node. >> Then how can their memory be captured and managed by HMM framework? >> > > Only platform that has such memory today is powerpc and it is not reported > as regular memory by the firmware hence why they need this helper. > > I don't think anyone has defined anything yet for x86 and acpi. As this is > memory on PCIE like interface then i don't expect it to be reported as NUMA > memory node but as io range like any regular PCIE resources. Device driver > through capabilities flags would then figure out if the link between the > device and CPU is CCIX capable if so it can use this helper to hotplug it > as device memory. Yep, the arch needs to do the right thing at hotplug time, which is 1. Don't online the memory as a NUMA node 2. Use the HMM-CDM API's to map the memory to ZONE DEVICE via the driver Like Jerome said and we tried as well, the NUMA approach needs more agreement and discussion and probable extensions Balbir Singh
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > On 2017/9/4 23:51, Jerome Glisse wrote: > > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > >> On 2017/8/17 8:05, Jérôme Glisse wrote: > >>> Unlike unaddressable memory, coherent device memory has a real > >>> resource associated with it on the system (as CPU can address > >>> it). Add a new helper to hotplug such memory within the HMM > >>> framework. > >>> > >> > >> Got an new question, coherent device( e.g CCIX) memory are likely reported > >> to OS > >> through ACPI and recognized as NUMA memory node. > >> Then how can their memory be captured and managed by HMM framework? > >> > > > > Only platform that has such memory today is powerpc and it is not reported > > as regular memory by the firmware hence why they need this helper. > > > > I don't think anyone has defined anything yet for x86 and acpi. As this is > > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > Table (HMAT) table defined in ACPI 6.2. > The HMAT can cover CPU-addressable memory types(though not non-cache > coherent on-device memory). > > Ross from Intel already done some work on this, see: > https://lwn.net/Articles/724562/ > > arm64 supports APCI also, there is likely more this kind of device when CCIX > is out (should be very soon if on schedule). HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie when you have several kind of memory each with different characteristics: - HBM very fast (latency) and high bandwidth, non persistent, somewhat small (ie few giga bytes) - Persistent memory, slower (both latency and bandwidth) big (tera bytes) - DDR (good old memory) well characteristics are between HBM and persistent So AFAICT this has nothing to do with what HMM is for, ie device memory. Note that device memory can have a hierarchy of memory themself (HBM, GDDR and in maybe even persistent memory). > > memory on PCIE like interface then i don't expect it to be reported as NUMA > > memory node but as io range like any regular PCIE resources. Device driver > > through capabilities flags would then figure out if the link between the > > device and CPU is CCIX capable if so it can use this helper to hotplug it > > as device memory. > > > > From my point of view, Cache coherent device memory will popular soon and > reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable > to me. Cache coherent device will be reported through standard mecanisms defined by the bus standard they are using. To my knowledge all the standard are either on top of PCIE or are similar to PCIE. It is true that on many platform PCIE resource is manage/initialize by the bios (UEFI) but it is platform specific. In some case we reprogram what the bios pick. So like i was saying i don't expect the BIOS/UEFI to report device memory as regular memory. It will be reported as a regular PCIE resources and then the device driver will be able to determine through some flags if the link between the CPU(s) and the device is cache coherent or not. At that point the device driver can use register it with HMM helper. The whole NUMA discussion happen several time in the past i suggest looking on mm list archive for them. But it was rule out for several reasons. Top of my head: - people hate CPU less node and device memory is inherently CPU less - device driver want total control over memory and thus to be isolated from mm mecanism and doing all those special cases was not welcome - existing NUMA migration mecanism are ill suited for this memory as access by the device to the memory is unknown to core mm and there is no easy way to report it or track it (this kind of depends on the platform and hardware) I am likely missing other big points. Cheers, Jérôme
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Tue, Sep 05, 2017 at 09:13:24AM +0800, Bob Liu wrote: > On 2017/9/4 23:51, Jerome Glisse wrote: > > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > >> On 2017/8/17 8:05, Jérôme Glisse wrote: > >>> Unlike unaddressable memory, coherent device memory has a real > >>> resource associated with it on the system (as CPU can address > >>> it). Add a new helper to hotplug such memory within the HMM > >>> framework. > >>> > >> > >> Got an new question, coherent device( e.g CCIX) memory are likely reported > >> to OS > >> through ACPI and recognized as NUMA memory node. > >> Then how can their memory be captured and managed by HMM framework? > >> > > > > Only platform that has such memory today is powerpc and it is not reported > > as regular memory by the firmware hence why they need this helper. > > > > I don't think anyone has defined anything yet for x86 and acpi. As this is > > Not yet, but now the ACPI spec has Heterogeneous Memory Attribute > Table (HMAT) table defined in ACPI 6.2. > The HMAT can cover CPU-addressable memory types(though not non-cache > coherent on-device memory). > > Ross from Intel already done some work on this, see: > https://lwn.net/Articles/724562/ > > arm64 supports APCI also, there is likely more this kind of device when CCIX > is out (should be very soon if on schedule). HMAT is not for the same thing, AFAIK HMAT is for deep "hierarchy" memory ie when you have several kind of memory each with different characteristics: - HBM very fast (latency) and high bandwidth, non persistent, somewhat small (ie few giga bytes) - Persistent memory, slower (both latency and bandwidth) big (tera bytes) - DDR (good old memory) well characteristics are between HBM and persistent So AFAICT this has nothing to do with what HMM is for, ie device memory. Note that device memory can have a hierarchy of memory themself (HBM, GDDR and in maybe even persistent memory). > > memory on PCIE like interface then i don't expect it to be reported as NUMA > > memory node but as io range like any regular PCIE resources. Device driver > > through capabilities flags would then figure out if the link between the > > device and CPU is CCIX capable if so it can use this helper to hotplug it > > as device memory. > > > > From my point of view, Cache coherent device memory will popular soon and > reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable > to me. Cache coherent device will be reported through standard mecanisms defined by the bus standard they are using. To my knowledge all the standard are either on top of PCIE or are similar to PCIE. It is true that on many platform PCIE resource is manage/initialize by the bios (UEFI) but it is platform specific. In some case we reprogram what the bios pick. So like i was saying i don't expect the BIOS/UEFI to report device memory as regular memory. It will be reported as a regular PCIE resources and then the device driver will be able to determine through some flags if the link between the CPU(s) and the device is cache coherent or not. At that point the device driver can use register it with HMM helper. The whole NUMA discussion happen several time in the past i suggest looking on mm list archive for them. But it was rule out for several reasons. Top of my head: - people hate CPU less node and device memory is inherently CPU less - device driver want total control over memory and thus to be isolated from mm mecanism and doing all those special cases was not welcome - existing NUMA migration mecanism are ill suited for this memory as access by the device to the memory is unknown to core mm and there is no easy way to report it or track it (this kind of depends on the platform and hardware) I am likely missing other big points. Cheers, Jérôme
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On 2017/9/4 23:51, Jerome Glisse wrote: > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: >> On 2017/8/17 8:05, Jérôme Glisse wrote: >>> Unlike unaddressable memory, coherent device memory has a real >>> resource associated with it on the system (as CPU can address >>> it). Add a new helper to hotplug such memory within the HMM >>> framework. >>> >> >> Got an new question, coherent device( e.g CCIX) memory are likely reported >> to OS >> through ACPI and recognized as NUMA memory node. >> Then how can their memory be captured and managed by HMM framework? >> > > Only platform that has such memory today is powerpc and it is not reported > as regular memory by the firmware hence why they need this helper. > > I don't think anyone has defined anything yet for x86 and acpi. As this is Not yet, but now the ACPI spec has Heterogeneous Memory Attribute Table (HMAT) table defined in ACPI 6.2. The HMAT can cover CPU-addressable memory types(though not non-cache coherent on-device memory). Ross from Intel already done some work on this, see: https://lwn.net/Articles/724562/ arm64 supports APCI also, there is likely more this kind of device when CCIX is out (should be very soon if on schedule). > memory on PCIE like interface then i don't expect it to be reported as NUMA > memory node but as io range like any regular PCIE resources. Device driver > through capabilities flags would then figure out if the link between the > device and CPU is CCIX capable if so it can use this helper to hotplug it > as device memory. > >From my point of view, Cache coherent device memory will popular soon and >reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable to me. -- Thanks, Bob Liu
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On 2017/9/4 23:51, Jerome Glisse wrote: > On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: >> On 2017/8/17 8:05, Jérôme Glisse wrote: >>> Unlike unaddressable memory, coherent device memory has a real >>> resource associated with it on the system (as CPU can address >>> it). Add a new helper to hotplug such memory within the HMM >>> framework. >>> >> >> Got an new question, coherent device( e.g CCIX) memory are likely reported >> to OS >> through ACPI and recognized as NUMA memory node. >> Then how can their memory be captured and managed by HMM framework? >> > > Only platform that has such memory today is powerpc and it is not reported > as regular memory by the firmware hence why they need this helper. > > I don't think anyone has defined anything yet for x86 and acpi. As this is Not yet, but now the ACPI spec has Heterogeneous Memory Attribute Table (HMAT) table defined in ACPI 6.2. The HMAT can cover CPU-addressable memory types(though not non-cache coherent on-device memory). Ross from Intel already done some work on this, see: https://lwn.net/Articles/724562/ arm64 supports APCI also, there is likely more this kind of device when CCIX is out (should be very soon if on schedule). > memory on PCIE like interface then i don't expect it to be reported as NUMA > memory node but as io range like any regular PCIE resources. Device driver > through capabilities flags would then figure out if the link between the > device and CPU is CCIX capable if so it can use this helper to hotplug it > as device memory. > >From my point of view, Cache coherent device memory will popular soon and >reported through ACPI/UEFI. Extending NUMA policy still sounds more reasonable to me. -- Thanks, Bob Liu
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > On 2017/8/17 8:05, Jérôme Glisse wrote: > > Unlike unaddressable memory, coherent device memory has a real > > resource associated with it on the system (as CPU can address > > it). Add a new helper to hotplug such memory within the HMM > > framework. > > > > Got an new question, coherent device( e.g CCIX) memory are likely reported to > OS > through ACPI and recognized as NUMA memory node. > Then how can their memory be captured and managed by HMM framework? > Only platform that has such memory today is powerpc and it is not reported as regular memory by the firmware hence why they need this helper. I don't think anyone has defined anything yet for x86 and acpi. As this is memory on PCIE like interface then i don't expect it to be reported as NUMA memory node but as io range like any regular PCIE resources. Device driver through capabilities flags would then figure out if the link between the device and CPU is CCIX capable if so it can use this helper to hotplug it as device memory. Jérôme
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On Mon, Sep 04, 2017 at 11:09:14AM +0800, Bob Liu wrote: > On 2017/8/17 8:05, Jérôme Glisse wrote: > > Unlike unaddressable memory, coherent device memory has a real > > resource associated with it on the system (as CPU can address > > it). Add a new helper to hotplug such memory within the HMM > > framework. > > > > Got an new question, coherent device( e.g CCIX) memory are likely reported to > OS > through ACPI and recognized as NUMA memory node. > Then how can their memory be captured and managed by HMM framework? > Only platform that has such memory today is powerpc and it is not reported as regular memory by the firmware hence why they need this helper. I don't think anyone has defined anything yet for x86 and acpi. As this is memory on PCIE like interface then i don't expect it to be reported as NUMA memory node but as io range like any regular PCIE resources. Device driver through capabilities flags would then figure out if the link between the device and CPU is CCIX capable if so it can use this helper to hotplug it as device memory. Jérôme
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On 2017/8/17 8:05, Jérôme Glisse wrote: > Unlike unaddressable memory, coherent device memory has a real > resource associated with it on the system (as CPU can address > it). Add a new helper to hotplug such memory within the HMM > framework. > Got an new question, coherent device( e.g CCIX) memory are likely reported to OS through ACPI and recognized as NUMA memory node. Then how can their memory be captured and managed by HMM framework? -- Regards, Bob Liu > Changed since v2: > - s/host/public > Changed since v1: > - s/public/host > > Signed-off-by: Jérôme Glisse> Reviewed-by: Balbir Singh > --- > include/linux/hmm.h | 3 ++ > mm/hmm.c| 88 > ++--- > 2 files changed, 86 insertions(+), 5 deletions(-) > > diff --git a/include/linux/hmm.h b/include/linux/hmm.h > index 79e63178fd87..5866f3194c26 100644 > --- a/include/linux/hmm.h > +++ b/include/linux/hmm.h > @@ -443,6 +443,9 @@ struct hmm_devmem { > struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, > struct device *device, > unsigned long size); > +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, > +struct device *device, > +struct resource *res); > void hmm_devmem_remove(struct hmm_devmem *devmem); > > /* > diff --git a/mm/hmm.c b/mm/hmm.c > index 1a1e79d390c1..3faa4d40295e 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -854,7 +854,11 @@ static void hmm_devmem_release(struct device *dev, void > *data) > zone = page_zone(page); > > mem_hotplug_begin(); > - __remove_pages(zone, start_pfn, npages); > + if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) > + __remove_pages(zone, start_pfn, npages); > + else > + arch_remove_memory(start_pfn << PAGE_SHIFT, > +npages << PAGE_SHIFT); > mem_hotplug_done(); > > hmm_devmem_radix_release(resource); > @@ -890,7 +894,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem > *devmem) > if (is_ram == REGION_INTERSECTS) > return -ENXIO; > > - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; > + if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY) > + devmem->pagemap.type = MEMORY_DEVICE_PUBLIC; > + else > + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; > + > devmem->pagemap.res = devmem->resource; > devmem->pagemap.page_fault = hmm_devmem_fault; > devmem->pagemap.page_free = hmm_devmem_free; > @@ -935,9 +943,15 @@ static int hmm_devmem_pages_create(struct hmm_devmem > *devmem) >* over the device memory is un-accessible thus we do not want to >* create a linear mapping for the memory like arch_add_memory() >* would do. > + * > + * For device public memory, which is accesible by the CPU, we do > + * want the linear mapping and thus use arch_add_memory(). >*/ > - ret = add_pages(nid, align_start >> PAGE_SHIFT, > - align_size >> PAGE_SHIFT, false); > + if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC) > + ret = arch_add_memory(nid, align_start, align_size, false); > + else > + ret = add_pages(nid, align_start >> PAGE_SHIFT, > + align_size >> PAGE_SHIFT, false); > if (ret) { > mem_hotplug_done(); > goto error_add_memory; > @@ -1084,6 +1098,67 @@ struct hmm_devmem *hmm_devmem_add(const struct > hmm_devmem_ops *ops, > } > EXPORT_SYMBOL(hmm_devmem_add); > > +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, > +struct device *device, > +struct resource *res) > +{ > + struct hmm_devmem *devmem; > + int ret; > + > + if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY) > + return ERR_PTR(-EINVAL); > + > + static_branch_enable(_private_key); > + > + devmem = devres_alloc_node(_devmem_release, sizeof(*devmem), > +GFP_KERNEL, dev_to_node(device)); > + if (!devmem) > + return ERR_PTR(-ENOMEM); > + > + init_completion(>completion); > + devmem->pfn_first = -1UL; > + devmem->pfn_last = -1UL; > + devmem->resource = res; > + devmem->device = device; > + devmem->ops = ops; > + > + ret = percpu_ref_init(>ref, _devmem_ref_release, > + 0, GFP_KERNEL); > + if (ret) > + goto error_percpu_ref; > + > + ret = devm_add_action(device, hmm_devmem_ref_exit, >ref); > + if (ret) > + goto error_devm_add_action; > + > + > + devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT; > + devmem->pfn_last =
Re: [HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
On 2017/8/17 8:05, Jérôme Glisse wrote: > Unlike unaddressable memory, coherent device memory has a real > resource associated with it on the system (as CPU can address > it). Add a new helper to hotplug such memory within the HMM > framework. > Got an new question, coherent device( e.g CCIX) memory are likely reported to OS through ACPI and recognized as NUMA memory node. Then how can their memory be captured and managed by HMM framework? -- Regards, Bob Liu > Changed since v2: > - s/host/public > Changed since v1: > - s/public/host > > Signed-off-by: Jérôme Glisse > Reviewed-by: Balbir Singh > --- > include/linux/hmm.h | 3 ++ > mm/hmm.c| 88 > ++--- > 2 files changed, 86 insertions(+), 5 deletions(-) > > diff --git a/include/linux/hmm.h b/include/linux/hmm.h > index 79e63178fd87..5866f3194c26 100644 > --- a/include/linux/hmm.h > +++ b/include/linux/hmm.h > @@ -443,6 +443,9 @@ struct hmm_devmem { > struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, > struct device *device, > unsigned long size); > +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, > +struct device *device, > +struct resource *res); > void hmm_devmem_remove(struct hmm_devmem *devmem); > > /* > diff --git a/mm/hmm.c b/mm/hmm.c > index 1a1e79d390c1..3faa4d40295e 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -854,7 +854,11 @@ static void hmm_devmem_release(struct device *dev, void > *data) > zone = page_zone(page); > > mem_hotplug_begin(); > - __remove_pages(zone, start_pfn, npages); > + if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) > + __remove_pages(zone, start_pfn, npages); > + else > + arch_remove_memory(start_pfn << PAGE_SHIFT, > +npages << PAGE_SHIFT); > mem_hotplug_done(); > > hmm_devmem_radix_release(resource); > @@ -890,7 +894,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem > *devmem) > if (is_ram == REGION_INTERSECTS) > return -ENXIO; > > - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; > + if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY) > + devmem->pagemap.type = MEMORY_DEVICE_PUBLIC; > + else > + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; > + > devmem->pagemap.res = devmem->resource; > devmem->pagemap.page_fault = hmm_devmem_fault; > devmem->pagemap.page_free = hmm_devmem_free; > @@ -935,9 +943,15 @@ static int hmm_devmem_pages_create(struct hmm_devmem > *devmem) >* over the device memory is un-accessible thus we do not want to >* create a linear mapping for the memory like arch_add_memory() >* would do. > + * > + * For device public memory, which is accesible by the CPU, we do > + * want the linear mapping and thus use arch_add_memory(). >*/ > - ret = add_pages(nid, align_start >> PAGE_SHIFT, > - align_size >> PAGE_SHIFT, false); > + if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC) > + ret = arch_add_memory(nid, align_start, align_size, false); > + else > + ret = add_pages(nid, align_start >> PAGE_SHIFT, > + align_size >> PAGE_SHIFT, false); > if (ret) { > mem_hotplug_done(); > goto error_add_memory; > @@ -1084,6 +1098,67 @@ struct hmm_devmem *hmm_devmem_add(const struct > hmm_devmem_ops *ops, > } > EXPORT_SYMBOL(hmm_devmem_add); > > +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, > +struct device *device, > +struct resource *res) > +{ > + struct hmm_devmem *devmem; > + int ret; > + > + if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY) > + return ERR_PTR(-EINVAL); > + > + static_branch_enable(_private_key); > + > + devmem = devres_alloc_node(_devmem_release, sizeof(*devmem), > +GFP_KERNEL, dev_to_node(device)); > + if (!devmem) > + return ERR_PTR(-ENOMEM); > + > + init_completion(>completion); > + devmem->pfn_first = -1UL; > + devmem->pfn_last = -1UL; > + devmem->resource = res; > + devmem->device = device; > + devmem->ops = ops; > + > + ret = percpu_ref_init(>ref, _devmem_ref_release, > + 0, GFP_KERNEL); > + if (ret) > + goto error_percpu_ref; > + > + ret = devm_add_action(device, hmm_devmem_ref_exit, >ref); > + if (ret) > + goto error_devm_add_action; > + > + > + devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT; > + devmem->pfn_last = devmem->pfn_first + > +
[HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
Unlike unaddressable memory, coherent device memory has a real resource associated with it on the system (as CPU can address it). Add a new helper to hotplug such memory within the HMM framework. Changed since v2: - s/host/public Changed since v1: - s/public/host Signed-off-by: Jérôme GlisseReviewed-by: Balbir Singh --- include/linux/hmm.h | 3 ++ mm/hmm.c| 88 ++--- 2 files changed, 86 insertions(+), 5 deletions(-) diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 79e63178fd87..5866f3194c26 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -443,6 +443,9 @@ struct hmm_devmem { struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, struct device *device, unsigned long size); +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, + struct device *device, + struct resource *res); void hmm_devmem_remove(struct hmm_devmem *devmem); /* diff --git a/mm/hmm.c b/mm/hmm.c index 1a1e79d390c1..3faa4d40295e 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -854,7 +854,11 @@ static void hmm_devmem_release(struct device *dev, void *data) zone = page_zone(page); mem_hotplug_begin(); - __remove_pages(zone, start_pfn, npages); + if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) + __remove_pages(zone, start_pfn, npages); + else + arch_remove_memory(start_pfn << PAGE_SHIFT, + npages << PAGE_SHIFT); mem_hotplug_done(); hmm_devmem_radix_release(resource); @@ -890,7 +894,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem) if (is_ram == REGION_INTERSECTS) return -ENXIO; - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; + if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY) + devmem->pagemap.type = MEMORY_DEVICE_PUBLIC; + else + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; + devmem->pagemap.res = devmem->resource; devmem->pagemap.page_fault = hmm_devmem_fault; devmem->pagemap.page_free = hmm_devmem_free; @@ -935,9 +943,15 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem) * over the device memory is un-accessible thus we do not want to * create a linear mapping for the memory like arch_add_memory() * would do. +* +* For device public memory, which is accesible by the CPU, we do +* want the linear mapping and thus use arch_add_memory(). */ - ret = add_pages(nid, align_start >> PAGE_SHIFT, - align_size >> PAGE_SHIFT, false); + if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC) + ret = arch_add_memory(nid, align_start, align_size, false); + else + ret = add_pages(nid, align_start >> PAGE_SHIFT, + align_size >> PAGE_SHIFT, false); if (ret) { mem_hotplug_done(); goto error_add_memory; @@ -1084,6 +1098,67 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, } EXPORT_SYMBOL(hmm_devmem_add); +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, + struct device *device, + struct resource *res) +{ + struct hmm_devmem *devmem; + int ret; + + if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY) + return ERR_PTR(-EINVAL); + + static_branch_enable(_private_key); + + devmem = devres_alloc_node(_devmem_release, sizeof(*devmem), + GFP_KERNEL, dev_to_node(device)); + if (!devmem) + return ERR_PTR(-ENOMEM); + + init_completion(>completion); + devmem->pfn_first = -1UL; + devmem->pfn_last = -1UL; + devmem->resource = res; + devmem->device = device; + devmem->ops = ops; + + ret = percpu_ref_init(>ref, _devmem_ref_release, + 0, GFP_KERNEL); + if (ret) + goto error_percpu_ref; + + ret = devm_add_action(device, hmm_devmem_ref_exit, >ref); + if (ret) + goto error_devm_add_action; + + + devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT; + devmem->pfn_last = devmem->pfn_first + + (resource_size(devmem->resource) >> PAGE_SHIFT); + + ret = hmm_devmem_pages_create(devmem); + if (ret) + goto error_devm_add_action; + + devres_add(device, devmem); + + ret = devm_add_action(device, hmm_devmem_ref_kill, >ref); + if (ret) { + hmm_devmem_remove(devmem); +
[HMM-v25 19/19] mm/hmm: add new helper to hotplug CDM memory region v3
Unlike unaddressable memory, coherent device memory has a real resource associated with it on the system (as CPU can address it). Add a new helper to hotplug such memory within the HMM framework. Changed since v2: - s/host/public Changed since v1: - s/public/host Signed-off-by: Jérôme Glisse Reviewed-by: Balbir Singh --- include/linux/hmm.h | 3 ++ mm/hmm.c| 88 ++--- 2 files changed, 86 insertions(+), 5 deletions(-) diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 79e63178fd87..5866f3194c26 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -443,6 +443,9 @@ struct hmm_devmem { struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, struct device *device, unsigned long size); +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, + struct device *device, + struct resource *res); void hmm_devmem_remove(struct hmm_devmem *devmem); /* diff --git a/mm/hmm.c b/mm/hmm.c index 1a1e79d390c1..3faa4d40295e 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -854,7 +854,11 @@ static void hmm_devmem_release(struct device *dev, void *data) zone = page_zone(page); mem_hotplug_begin(); - __remove_pages(zone, start_pfn, npages); + if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) + __remove_pages(zone, start_pfn, npages); + else + arch_remove_memory(start_pfn << PAGE_SHIFT, + npages << PAGE_SHIFT); mem_hotplug_done(); hmm_devmem_radix_release(resource); @@ -890,7 +894,11 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem) if (is_ram == REGION_INTERSECTS) return -ENXIO; - devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; + if (devmem->resource->desc == IORES_DESC_DEVICE_PUBLIC_MEMORY) + devmem->pagemap.type = MEMORY_DEVICE_PUBLIC; + else + devmem->pagemap.type = MEMORY_DEVICE_PRIVATE; + devmem->pagemap.res = devmem->resource; devmem->pagemap.page_fault = hmm_devmem_fault; devmem->pagemap.page_free = hmm_devmem_free; @@ -935,9 +943,15 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem) * over the device memory is un-accessible thus we do not want to * create a linear mapping for the memory like arch_add_memory() * would do. +* +* For device public memory, which is accesible by the CPU, we do +* want the linear mapping and thus use arch_add_memory(). */ - ret = add_pages(nid, align_start >> PAGE_SHIFT, - align_size >> PAGE_SHIFT, false); + if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC) + ret = arch_add_memory(nid, align_start, align_size, false); + else + ret = add_pages(nid, align_start >> PAGE_SHIFT, + align_size >> PAGE_SHIFT, false); if (ret) { mem_hotplug_done(); goto error_add_memory; @@ -1084,6 +1098,67 @@ struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops, } EXPORT_SYMBOL(hmm_devmem_add); +struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, + struct device *device, + struct resource *res) +{ + struct hmm_devmem *devmem; + int ret; + + if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY) + return ERR_PTR(-EINVAL); + + static_branch_enable(_private_key); + + devmem = devres_alloc_node(_devmem_release, sizeof(*devmem), + GFP_KERNEL, dev_to_node(device)); + if (!devmem) + return ERR_PTR(-ENOMEM); + + init_completion(>completion); + devmem->pfn_first = -1UL; + devmem->pfn_last = -1UL; + devmem->resource = res; + devmem->device = device; + devmem->ops = ops; + + ret = percpu_ref_init(>ref, _devmem_ref_release, + 0, GFP_KERNEL); + if (ret) + goto error_percpu_ref; + + ret = devm_add_action(device, hmm_devmem_ref_exit, >ref); + if (ret) + goto error_devm_add_action; + + + devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT; + devmem->pfn_last = devmem->pfn_first + + (resource_size(devmem->resource) >> PAGE_SHIFT); + + ret = hmm_devmem_pages_create(devmem); + if (ret) + goto error_devm_add_action; + + devres_add(device, devmem); + + ret = devm_add_action(device, hmm_devmem_ref_kill, >ref); + if (ret) { + hmm_devmem_remove(devmem); + return ERR_PTR(ret); + } + + return