Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Le 30/12/2017 à 07:58, Matthew Wilcox a écrit : > On Wed, Dec 27, 2017 at 10:10:34AM +0100, Brice Goglin wrote: >>> Perhaps we can enlist /proc/iomem or a similar enumeration interface >>> to tell userspace the NUMA node and whether the kernel thinks it has >>> better or worse performance characteristics relative to base >>> system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start >>> publishing absolute numbers in sysfs userspace will default to looking >>> for specific magic numbers in sysfs vs asking the kernel for memory >>> that has performance characteristics relative to base "System RAM". In >>> other words the absolute performance information that the HMAT >>> publishes is useful to the kernel, but it's not clear that userspace >>> needs that vs a relative indicator for making NUMA node preference >>> decisions. >> Some HPC users will benchmark the machine to discovery actual >> performance numbers anyway. >> However, most users won't do this. They will want to know relative >> performance of different nodes. If you normalize HMAT values by dividing >> them with system-RAM values, that's likely OK. If you just say "that >> node is faster than system RAM", it's not precise enough. > So "this memory has 800% bandwidth of normal" and "this memory has 70% > bandwidth of normal"? I guess that would work. Brice
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Le 30/12/2017 à 07:58, Matthew Wilcox a écrit : > On Wed, Dec 27, 2017 at 10:10:34AM +0100, Brice Goglin wrote: >>> Perhaps we can enlist /proc/iomem or a similar enumeration interface >>> to tell userspace the NUMA node and whether the kernel thinks it has >>> better or worse performance characteristics relative to base >>> system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start >>> publishing absolute numbers in sysfs userspace will default to looking >>> for specific magic numbers in sysfs vs asking the kernel for memory >>> that has performance characteristics relative to base "System RAM". In >>> other words the absolute performance information that the HMAT >>> publishes is useful to the kernel, but it's not clear that userspace >>> needs that vs a relative indicator for making NUMA node preference >>> decisions. >> Some HPC users will benchmark the machine to discovery actual >> performance numbers anyway. >> However, most users won't do this. They will want to know relative >> performance of different nodes. If you normalize HMAT values by dividing >> them with system-RAM values, that's likely OK. If you just say "that >> node is faster than system RAM", it's not precise enough. > So "this memory has 800% bandwidth of normal" and "this memory has 70% > bandwidth of normal"? I guess that would work. Brice
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed, Dec 27, 2017 at 10:10:34AM +0100, Brice Goglin wrote: > > Perhaps we can enlist /proc/iomem or a similar enumeration interface > > to tell userspace the NUMA node and whether the kernel thinks it has > > better or worse performance characteristics relative to base > > system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start > > publishing absolute numbers in sysfs userspace will default to looking > > for specific magic numbers in sysfs vs asking the kernel for memory > > that has performance characteristics relative to base "System RAM". In > > other words the absolute performance information that the HMAT > > publishes is useful to the kernel, but it's not clear that userspace > > needs that vs a relative indicator for making NUMA node preference > > decisions. > > Some HPC users will benchmark the machine to discovery actual > performance numbers anyway. > However, most users won't do this. They will want to know relative > performance of different nodes. If you normalize HMAT values by dividing > them with system-RAM values, that's likely OK. If you just say "that > node is faster than system RAM", it's not precise enough. So "this memory has 800% bandwidth of normal" and "this memory has 70% bandwidth of normal"?
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed, Dec 27, 2017 at 10:10:34AM +0100, Brice Goglin wrote: > > Perhaps we can enlist /proc/iomem or a similar enumeration interface > > to tell userspace the NUMA node and whether the kernel thinks it has > > better or worse performance characteristics relative to base > > system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start > > publishing absolute numbers in sysfs userspace will default to looking > > for specific magic numbers in sysfs vs asking the kernel for memory > > that has performance characteristics relative to base "System RAM". In > > other words the absolute performance information that the HMAT > > publishes is useful to the kernel, but it's not clear that userspace > > needs that vs a relative indicator for making NUMA node preference > > decisions. > > Some HPC users will benchmark the machine to discovery actual > performance numbers anyway. > However, most users won't do this. They will want to know relative > performance of different nodes. If you normalize HMAT values by dividing > them with system-RAM values, that's likely OK. If you just say "that > node is faster than system RAM", it's not precise enough. So "this memory has 800% bandwidth of normal" and "this memory has 70% bandwidth of normal"?
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Le 22/12/2017 à 23:53, Dan Williams a écrit : > On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglinwrote: >> Le 20/12/2017 à 23:41, Ross Zwisler a écrit : > [..] >> Hello >> >> I can confirm that HPC runtimes are going to use these patches (at least >> all runtimes that use hwloc for topology discovery, but that's the vast >> majority of HPC anyway). >> >> We really didn't like KNL exposing a hacky SLIT table [1]. We had to >> explicitly detect that specific crazy table to find out which NUMA nodes >> were local to which cores, and to find out which NUMA nodes were >> HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the >> application because the reported latencies didn't match reality. Quite >> annoying. >> >> With Ross' patches, we can easily get what we need: >> * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ >> can only report a single local node per CPU (doesn't work for KNL and >> upcoming architectures with HBM+DDR+...) >> * which NUMA nodes are slow/fast (for both bandwidth and latency) >> And we can still look at SLIT under /sys/devices/system/node if really >> needed. >> >> And of course having this in sysfs is much better than parsing ACPI >> tables that are only accessible to root :) > On this point, it's not clear to me that we should allow these sysfs > entries to be world readable. Given /proc/iomem now hides physical > address information from non-root we at least need to be careful not > to undo that with new sysfs HMAT attributes. Once you need to be root > for this info, is parsing binary HMAT vs sysfs a blocker for the HPC > use case? I don't think it would be a blocker. > Perhaps we can enlist /proc/iomem or a similar enumeration interface > to tell userspace the NUMA node and whether the kernel thinks it has > better or worse performance characteristics relative to base > system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start > publishing absolute numbers in sysfs userspace will default to looking > for specific magic numbers in sysfs vs asking the kernel for memory > that has performance characteristics relative to base "System RAM". In > other words the absolute performance information that the HMAT > publishes is useful to the kernel, but it's not clear that userspace > needs that vs a relative indicator for making NUMA node preference > decisions. Some HPC users will benchmark the machine to discovery actual performance numbers anyway. However, most users won't do this. They will want to know relative performance of different nodes. If you normalize HMAT values by dividing them with system-RAM values, that's likely OK. If you just say "that node is faster than system RAM", it's not precise enough. Brice
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Le 22/12/2017 à 23:53, Dan Williams a écrit : > On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin wrote: >> Le 20/12/2017 à 23:41, Ross Zwisler a écrit : > [..] >> Hello >> >> I can confirm that HPC runtimes are going to use these patches (at least >> all runtimes that use hwloc for topology discovery, but that's the vast >> majority of HPC anyway). >> >> We really didn't like KNL exposing a hacky SLIT table [1]. We had to >> explicitly detect that specific crazy table to find out which NUMA nodes >> were local to which cores, and to find out which NUMA nodes were >> HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the >> application because the reported latencies didn't match reality. Quite >> annoying. >> >> With Ross' patches, we can easily get what we need: >> * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ >> can only report a single local node per CPU (doesn't work for KNL and >> upcoming architectures with HBM+DDR+...) >> * which NUMA nodes are slow/fast (for both bandwidth and latency) >> And we can still look at SLIT under /sys/devices/system/node if really >> needed. >> >> And of course having this in sysfs is much better than parsing ACPI >> tables that are only accessible to root :) > On this point, it's not clear to me that we should allow these sysfs > entries to be world readable. Given /proc/iomem now hides physical > address information from non-root we at least need to be careful not > to undo that with new sysfs HMAT attributes. Once you need to be root > for this info, is parsing binary HMAT vs sysfs a blocker for the HPC > use case? I don't think it would be a blocker. > Perhaps we can enlist /proc/iomem or a similar enumeration interface > to tell userspace the NUMA node and whether the kernel thinks it has > better or worse performance characteristics relative to base > system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start > publishing absolute numbers in sysfs userspace will default to looking > for specific magic numbers in sysfs vs asking the kernel for memory > that has performance characteristics relative to base "System RAM". In > other words the absolute performance information that the HMAT > publishes is useful to the kernel, but it's not clear that userspace > needs that vs a relative indicator for making NUMA node preference > decisions. Some HPC users will benchmark the machine to discovery actual performance numbers anyway. However, most users won't do this. They will want to know relative performance of different nodes. If you normalize HMAT values by dividing them with system-RAM values, that's likely OK. If you just say "that node is faster than system RAM", it's not precise enough. Brice
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 2017/12/23 6:31, Ross Zwisler wrote: > On Fri, Dec 22, 2017 at 08:39:41AM +0530, Anshuman Khandual wrote: >> On 12/14/2017 07:40 AM, Ross Zwisler wrote: > <> >>> We solve this issue by providing userspace with performance information on >>> individual memory ranges. This performance information is exposed via >>> sysfs: >>> >>> # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null >>> mem_tgt2/firmware_id:1 >>> mem_tgt2/is_cached:0 >>> mem_tgt2/local_init/read_bw_MBps:40960 >>> mem_tgt2/local_init/read_lat_nsec:50 >>> mem_tgt2/local_init/write_bw_MBps:40960 >>> mem_tgt2/local_init/write_lat_nsec:50 > <> >> We will enlist properties for all possible "source --> target" on the system? > > Nope, just 'local' initiator/target pairs. I talk about the reasoning for > this in the cover letter for patch 3: > > https://lists.01.org/pipermail/linux-nvdimm/2017-December/013574.html > >> Right now it shows only bandwidth and latency properties, can it accommodate >> other properties as well in future ? > > We also have an 'is_cached' attribute for the memory targets if they are > involved in a caching hierarchy, but right now those are all the things we > expose. We can potentially expose whatever we want that is present in the > HMAT, but those seemed like a good start. > > I noticed that in your presentation you had some other examples of attributes > you cared about: > > * reliability > * power consumption > * density > > The HMAT doesn't provide this sort of information at present, but we > could/would add them to sysfs if the HMAT ever grew support for them. > >>> This allows applications to easily find the memory that they want to use. >>> We expect that the existing NUMA APIs will be enhanced to use this new >>> information so that applications can continue to use them to select their >>> desired memory. >> >> I had presented a proposal for NUMA redesign in the Plumbers Conference this >> year where various memory devices with different kind of memory attributes >> can be represented in the kernel and be used explicitly from the user space. >> Here is the link to the proposal if you feel interested. The proposal is >> very intrusive and also I dont have a RFC for it yet for discussion here. >> >> https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf >> >> Problem is, designing the sysfs interface for memory attribute detection >> from user space without first thinking about redesigning the NUMA for >> heterogeneous memory may not be a good idea. Will look into this further. > > I took another look at your presentation, and overall I think that if/when a > NUMA redesign like this takes place ACPI systems with HMAT tables will be able > to participate. But I think we are probably a ways away from that, and like I I'm afraid not, there are cache-coherent bus like CCIX/OpenCAPI come out soon. No matter to say System-on-Chip already with internal bus linked DDR、HBM、CPU、Accelerator.. > said in my previous mail ACPI systems with memory-only NUMA nodes are going to > exist and need to be supported with the current NUMA scheme. Hence I don't And not only memory-only, but the accelerators can also be a master like CPU. > think that this patch series conflicts with your proposal. Didn't see conflict neither, but perhaps we should think for a longer-term solution and cover more situations/platforms. Anshuman's proposal is really a good start point to us. Cheers, Bob Liu
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 2017/12/23 6:31, Ross Zwisler wrote: > On Fri, Dec 22, 2017 at 08:39:41AM +0530, Anshuman Khandual wrote: >> On 12/14/2017 07:40 AM, Ross Zwisler wrote: > <> >>> We solve this issue by providing userspace with performance information on >>> individual memory ranges. This performance information is exposed via >>> sysfs: >>> >>> # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null >>> mem_tgt2/firmware_id:1 >>> mem_tgt2/is_cached:0 >>> mem_tgt2/local_init/read_bw_MBps:40960 >>> mem_tgt2/local_init/read_lat_nsec:50 >>> mem_tgt2/local_init/write_bw_MBps:40960 >>> mem_tgt2/local_init/write_lat_nsec:50 > <> >> We will enlist properties for all possible "source --> target" on the system? > > Nope, just 'local' initiator/target pairs. I talk about the reasoning for > this in the cover letter for patch 3: > > https://lists.01.org/pipermail/linux-nvdimm/2017-December/013574.html > >> Right now it shows only bandwidth and latency properties, can it accommodate >> other properties as well in future ? > > We also have an 'is_cached' attribute for the memory targets if they are > involved in a caching hierarchy, but right now those are all the things we > expose. We can potentially expose whatever we want that is present in the > HMAT, but those seemed like a good start. > > I noticed that in your presentation you had some other examples of attributes > you cared about: > > * reliability > * power consumption > * density > > The HMAT doesn't provide this sort of information at present, but we > could/would add them to sysfs if the HMAT ever grew support for them. > >>> This allows applications to easily find the memory that they want to use. >>> We expect that the existing NUMA APIs will be enhanced to use this new >>> information so that applications can continue to use them to select their >>> desired memory. >> >> I had presented a proposal for NUMA redesign in the Plumbers Conference this >> year where various memory devices with different kind of memory attributes >> can be represented in the kernel and be used explicitly from the user space. >> Here is the link to the proposal if you feel interested. The proposal is >> very intrusive and also I dont have a RFC for it yet for discussion here. >> >> https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf >> >> Problem is, designing the sysfs interface for memory attribute detection >> from user space without first thinking about redesigning the NUMA for >> heterogeneous memory may not be a good idea. Will look into this further. > > I took another look at your presentation, and overall I think that if/when a > NUMA redesign like this takes place ACPI systems with HMAT tables will be able > to participate. But I think we are probably a ways away from that, and like I I'm afraid not, there are cache-coherent bus like CCIX/OpenCAPI come out soon. No matter to say System-on-Chip already with internal bus linked DDR、HBM、CPU、Accelerator.. > said in my previous mail ACPI systems with memory-only NUMA nodes are going to > exist and need to be supported with the current NUMA scheme. Hence I don't And not only memory-only, but the accelerators can also be a master like CPU. > think that this patch series conflicts with your proposal. Didn't see conflict neither, but perhaps we should think for a longer-term solution and cover more situations/platforms. Anshuman's proposal is really a good start point to us. Cheers, Bob Liu
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/23/2017 03:43 AM, Ross Zwisler wrote: > On Fri, Dec 22, 2017 at 08:39:41AM +0530, Anshuman Khandual wrote: >> On 12/14/2017 07:40 AM, Ross Zwisler wrote: >>> Quick Summary >>> >>> Platforms exist today which have multiple types of memory attached to a >>> single CPU. These disparate memory ranges have some characteristics in >>> common, such as CPU cache coherence, but they can have wide ranges of >>> performance both in terms of latency and bandwidth. >> >> Right. >> >>> >>> For example, consider a system that contains persistent memory, standard >>> DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. >>> There could potentially be an order of magnitude or more difference in >>> performance between the slowest and fastest memory attached to that CPU. >> >> Right. >> >>> >>> With the current Linux code NUMA nodes are CPU-centric, so all the memory >>> attached to a given CPU will be lumped into the same NUMA node. This makes >>> it very difficult for userspace applications to understand the performance >>> of different memory ranges on a given CPU. >> >> Right but that might require fundamental changes to the NUMA representation. >> Plugging those memory as separate NUMA nodes, identify them through sysfs >> and try allocating from it through mbind() seems like a short term solution. >> >> Though if we decide to go in this direction, sysfs interface or something >> similar is required to enumerate memory properties. > > Yep, and this patch series is trying to be the sysfs interface that is > required to the memory properties. :) It's a certainty that we will have > memory-only NUMA nodes, at least on platforms that support ACPI. Supporting > memory-only proximity domains (which Linux turns in to memory-only NUMA nodes) > is explicitly supported with the introduction of the HMAT in ACPI 6.2. Yeah, even on POWER platforms can have memory only NUMA nodes. > > It also turns out that the existing memory management code already deals with > them just fine - you see this with my hmat_examples setup: > > https://github.com/rzwisler/hmat_examples > > Both configurations created by this repo create memory-only NUMA nodes, even > with upstream kernels. My patches don't change that, they just provide a > sysfs representation of the HMAT so users can discover the memory that exists > in the system. Once its a NUMA node everything will work as is from MM interface point of view. But the point is how we export these properties to user space. My only concern is lets not do it in a way which will be locked without first going through NUMA redesign for these new attribute based memory, thats all. > >>> We solve this issue by providing userspace with performance information on >>> individual memory ranges. This performance information is exposed via >>> sysfs: >>> >>> # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null >>> mem_tgt2/firmware_id:1 >>> mem_tgt2/is_cached:0 >>> mem_tgt2/local_init/read_bw_MBps:40960 >>> mem_tgt2/local_init/read_lat_nsec:50 >>> mem_tgt2/local_init/write_bw_MBps:40960 >>> mem_tgt2/local_init/write_lat_nsec:50 >> >> I might have missed discussions from earlier versions, why we have this >> kind of a "source --> target" model ? We will enlist properties for all >> possible "source --> target" on the system ? Right now it shows only >> bandwidth and latency properties, can it accommodate other properties >> as well in future ? > > The initiator/target model is useful in preventing us from needing a > MAX_NUMA_NODES x MAX_NUMA_NODES sized table for each performance attribute. I > talked about it a little more here: That makes it even more complex. Not only we have a memory attribute like bandwidth specific to the range, we are also exporting it's relative values as seen from different CPU nodes. Its again kind of a NUMA distance table being exported in the generic sysfs path like /sys/devices/. The problem is possible future memory attributes like 'reliability', 'density', 'power consumption' might not have a need for a "source --> destination" kind of model as they dont change based on which CPU node is accessing it. > > https://lists.01.org/pipermail/linux-nvdimm/2017-December/013654.html > >>> This allows applications to easily find the memory that they want to use. >>> We expect that the existing NUMA APIs will be enhanced to use this new >>> information so that applications can continue to use them to select their >>> desired memory. >> >> I had presented a proposal for NUMA redesign in the Plumbers Conference this >> year where various memory devices with different kind of memory attributes >> can be represented in the kernel and be used explicitly from the user space. >> Here is the link to the proposal if you feel interested. The proposal is >> very intrusive and also I dont have a RFC for it yet for discussion here. >> >>
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/23/2017 03:43 AM, Ross Zwisler wrote: > On Fri, Dec 22, 2017 at 08:39:41AM +0530, Anshuman Khandual wrote: >> On 12/14/2017 07:40 AM, Ross Zwisler wrote: >>> Quick Summary >>> >>> Platforms exist today which have multiple types of memory attached to a >>> single CPU. These disparate memory ranges have some characteristics in >>> common, such as CPU cache coherence, but they can have wide ranges of >>> performance both in terms of latency and bandwidth. >> >> Right. >> >>> >>> For example, consider a system that contains persistent memory, standard >>> DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. >>> There could potentially be an order of magnitude or more difference in >>> performance between the slowest and fastest memory attached to that CPU. >> >> Right. >> >>> >>> With the current Linux code NUMA nodes are CPU-centric, so all the memory >>> attached to a given CPU will be lumped into the same NUMA node. This makes >>> it very difficult for userspace applications to understand the performance >>> of different memory ranges on a given CPU. >> >> Right but that might require fundamental changes to the NUMA representation. >> Plugging those memory as separate NUMA nodes, identify them through sysfs >> and try allocating from it through mbind() seems like a short term solution. >> >> Though if we decide to go in this direction, sysfs interface or something >> similar is required to enumerate memory properties. > > Yep, and this patch series is trying to be the sysfs interface that is > required to the memory properties. :) It's a certainty that we will have > memory-only NUMA nodes, at least on platforms that support ACPI. Supporting > memory-only proximity domains (which Linux turns in to memory-only NUMA nodes) > is explicitly supported with the introduction of the HMAT in ACPI 6.2. Yeah, even on POWER platforms can have memory only NUMA nodes. > > It also turns out that the existing memory management code already deals with > them just fine - you see this with my hmat_examples setup: > > https://github.com/rzwisler/hmat_examples > > Both configurations created by this repo create memory-only NUMA nodes, even > with upstream kernels. My patches don't change that, they just provide a > sysfs representation of the HMAT so users can discover the memory that exists > in the system. Once its a NUMA node everything will work as is from MM interface point of view. But the point is how we export these properties to user space. My only concern is lets not do it in a way which will be locked without first going through NUMA redesign for these new attribute based memory, thats all. > >>> We solve this issue by providing userspace with performance information on >>> individual memory ranges. This performance information is exposed via >>> sysfs: >>> >>> # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null >>> mem_tgt2/firmware_id:1 >>> mem_tgt2/is_cached:0 >>> mem_tgt2/local_init/read_bw_MBps:40960 >>> mem_tgt2/local_init/read_lat_nsec:50 >>> mem_tgt2/local_init/write_bw_MBps:40960 >>> mem_tgt2/local_init/write_lat_nsec:50 >> >> I might have missed discussions from earlier versions, why we have this >> kind of a "source --> target" model ? We will enlist properties for all >> possible "source --> target" on the system ? Right now it shows only >> bandwidth and latency properties, can it accommodate other properties >> as well in future ? > > The initiator/target model is useful in preventing us from needing a > MAX_NUMA_NODES x MAX_NUMA_NODES sized table for each performance attribute. I > talked about it a little more here: That makes it even more complex. Not only we have a memory attribute like bandwidth specific to the range, we are also exporting it's relative values as seen from different CPU nodes. Its again kind of a NUMA distance table being exported in the generic sysfs path like /sys/devices/. The problem is possible future memory attributes like 'reliability', 'density', 'power consumption' might not have a need for a "source --> destination" kind of model as they dont change based on which CPU node is accessing it. > > https://lists.01.org/pipermail/linux-nvdimm/2017-December/013654.html > >>> This allows applications to easily find the memory that they want to use. >>> We expect that the existing NUMA APIs will be enhanced to use this new >>> information so that applications can continue to use them to select their >>> desired memory. >> >> I had presented a proposal for NUMA redesign in the Plumbers Conference this >> year where various memory devices with different kind of memory attributes >> can be represented in the kernel and be used explicitly from the user space. >> Here is the link to the proposal if you feel interested. The proposal is >> very intrusive and also I dont have a RFC for it yet for discussion here. >> >>
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/22/2017 10:43 PM, Dave Hansen wrote: > On 12/21/2017 07:09 PM, Anshuman Khandual wrote: >> I had presented a proposal for NUMA redesign in the Plumbers Conference this >> year where various memory devices with different kind of memory attributes >> can be represented in the kernel and be used explicitly from the user space. >> Here is the link to the proposal if you feel interested. The proposal is >> very intrusive and also I dont have a RFC for it yet for discussion here. > I think that's the best reason to "re-use NUMA" for this: it's _not_ > intrusive. > > Also, from an x86 perspective, these HMAT systems *will* be out there. > Old versions of Linux *will* see different types of memory as separate > NUMA nodes. So, if we are going to do something different, it's going > to be interesting to un-teach those systems about using the NUMA APIs > for this. That ship has sailed. I understand the need to fetch these details from ACPI/DT for applications to target these distinct memory only NUMA nodes. This can be done by parsing from platform specific values from /proc/acpi/ or /proc/device-tree/ interfaces. This can be a short term solution before NUMA redesign can be figured out. But adding generic devices like "hmat" in the /sys/devices/ path which will be locked for good, seems problematic.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/22/2017 10:43 PM, Dave Hansen wrote: > On 12/21/2017 07:09 PM, Anshuman Khandual wrote: >> I had presented a proposal for NUMA redesign in the Plumbers Conference this >> year where various memory devices with different kind of memory attributes >> can be represented in the kernel and be used explicitly from the user space. >> Here is the link to the proposal if you feel interested. The proposal is >> very intrusive and also I dont have a RFC for it yet for discussion here. > I think that's the best reason to "re-use NUMA" for this: it's _not_ > intrusive. > > Also, from an x86 perspective, these HMAT systems *will* be out there. > Old versions of Linux *will* see different types of memory as separate > NUMA nodes. So, if we are going to do something different, it's going > to be interesting to un-teach those systems about using the NUMA APIs > for this. That ship has sailed. I understand the need to fetch these details from ACPI/DT for applications to target these distinct memory only NUMA nodes. This can be done by parsing from platform specific values from /proc/acpi/ or /proc/device-tree/ interfaces. This can be a short term solution before NUMA redesign can be figured out. But adding generic devices like "hmat" in the /sys/devices/ path which will be locked for good, seems problematic.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Sat, Dec 23, 2017 at 12:57 AM, Dan Williamswrote: > On Fri, Dec 22, 2017 at 3:22 PM, Ross Zwisler > wrote: >> On Fri, Dec 22, 2017 at 02:53:42PM -0800, Dan Williams wrote: >>> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin >>> wrote: >>> > Le 20/12/2017 à 23:41, Ross Zwisler a écrit : >>> [..] >>> > Hello >>> > >>> > I can confirm that HPC runtimes are going to use these patches (at least >>> > all runtimes that use hwloc for topology discovery, but that's the vast >>> > majority of HPC anyway). >>> > >>> > We really didn't like KNL exposing a hacky SLIT table [1]. We had to >>> > explicitly detect that specific crazy table to find out which NUMA nodes >>> > were local to which cores, and to find out which NUMA nodes were >>> > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the >>> > application because the reported latencies didn't match reality. Quite >>> > annoying. >>> > >>> > With Ross' patches, we can easily get what we need: >>> > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ >>> > can only report a single local node per CPU (doesn't work for KNL and >>> > upcoming architectures with HBM+DDR+...) >>> > * which NUMA nodes are slow/fast (for both bandwidth and latency) >>> > And we can still look at SLIT under /sys/devices/system/node if really >>> > needed. >>> > >>> > And of course having this in sysfs is much better than parsing ACPI >>> > tables that are only accessible to root :) >>> >>> On this point, it's not clear to me that we should allow these sysfs >>> entries to be world readable. Given /proc/iomem now hides physical >>> address information from non-root we at least need to be careful not >>> to undo that with new sysfs HMAT attributes. >> >> This enabling does not expose any physical addresses to userspace. It only >> provides performance numbers from the HMAT and associates them with existing >> NUMA nodes. Are you worried that exposing performance numbers to non-root >> users via sysfs poses a security risk? > > It's an information disclosure that's not clear we need to make to > non-root processes. > > I'm more worried about userspace growing dependencies on the absolute > numbers when those numbers can change from platform to platform. > Differentiated memory on one platform may be the common memory pool on > another. > > To me this has parallels with storage device hinting where > specifications like T10 have a complex enumeration of all the > performance hints that can be passed to the device, but the Linux > enabling effort aims for a sanitzed set of relative hints that make > sense. It's more flexible if userspace specifies a relative intent > rather than an absolute performance target. Putting all the HMAT > information into sysfs gives userspace more information than it could > possibly do anything reasonable, at least outside of specialized apps > that are hand tuned for a given hardware platform. That's a valid point IMO. It is sort of tempting to expose everything to user space verbatim, especially early in the enabling process when the kernel has not yet found suitable ways to utilize the given information, but the very act of exposing it may affect what can be done with it in the future. User space interfaces need to stay around and be supported forever, at least potentially, so adding every one of them is a serious commitment. Thanks, Rafael
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Sat, Dec 23, 2017 at 12:57 AM, Dan Williams wrote: > On Fri, Dec 22, 2017 at 3:22 PM, Ross Zwisler > wrote: >> On Fri, Dec 22, 2017 at 02:53:42PM -0800, Dan Williams wrote: >>> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin >>> wrote: >>> > Le 20/12/2017 à 23:41, Ross Zwisler a écrit : >>> [..] >>> > Hello >>> > >>> > I can confirm that HPC runtimes are going to use these patches (at least >>> > all runtimes that use hwloc for topology discovery, but that's the vast >>> > majority of HPC anyway). >>> > >>> > We really didn't like KNL exposing a hacky SLIT table [1]. We had to >>> > explicitly detect that specific crazy table to find out which NUMA nodes >>> > were local to which cores, and to find out which NUMA nodes were >>> > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the >>> > application because the reported latencies didn't match reality. Quite >>> > annoying. >>> > >>> > With Ross' patches, we can easily get what we need: >>> > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ >>> > can only report a single local node per CPU (doesn't work for KNL and >>> > upcoming architectures with HBM+DDR+...) >>> > * which NUMA nodes are slow/fast (for both bandwidth and latency) >>> > And we can still look at SLIT under /sys/devices/system/node if really >>> > needed. >>> > >>> > And of course having this in sysfs is much better than parsing ACPI >>> > tables that are only accessible to root :) >>> >>> On this point, it's not clear to me that we should allow these sysfs >>> entries to be world readable. Given /proc/iomem now hides physical >>> address information from non-root we at least need to be careful not >>> to undo that with new sysfs HMAT attributes. >> >> This enabling does not expose any physical addresses to userspace. It only >> provides performance numbers from the HMAT and associates them with existing >> NUMA nodes. Are you worried that exposing performance numbers to non-root >> users via sysfs poses a security risk? > > It's an information disclosure that's not clear we need to make to > non-root processes. > > I'm more worried about userspace growing dependencies on the absolute > numbers when those numbers can change from platform to platform. > Differentiated memory on one platform may be the common memory pool on > another. > > To me this has parallels with storage device hinting where > specifications like T10 have a complex enumeration of all the > performance hints that can be passed to the device, but the Linux > enabling effort aims for a sanitzed set of relative hints that make > sense. It's more flexible if userspace specifies a relative intent > rather than an absolute performance target. Putting all the HMAT > information into sysfs gives userspace more information than it could > possibly do anything reasonable, at least outside of specialized apps > that are hand tuned for a given hardware platform. That's a valid point IMO. It is sort of tempting to expose everything to user space verbatim, especially early in the enabling process when the kernel has not yet found suitable ways to utilize the given information, but the very act of exposing it may affect what can be done with it in the future. User space interfaces need to stay around and be supported forever, at least potentially, so adding every one of them is a serious commitment. Thanks, Rafael
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Fri, Dec 22, 2017 at 3:22 PM, Ross Zwislerwrote: > On Fri, Dec 22, 2017 at 02:53:42PM -0800, Dan Williams wrote: >> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin >> wrote: >> > Le 20/12/2017 à 23:41, Ross Zwisler a écrit : >> [..] >> > Hello >> > >> > I can confirm that HPC runtimes are going to use these patches (at least >> > all runtimes that use hwloc for topology discovery, but that's the vast >> > majority of HPC anyway). >> > >> > We really didn't like KNL exposing a hacky SLIT table [1]. We had to >> > explicitly detect that specific crazy table to find out which NUMA nodes >> > were local to which cores, and to find out which NUMA nodes were >> > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the >> > application because the reported latencies didn't match reality. Quite >> > annoying. >> > >> > With Ross' patches, we can easily get what we need: >> > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ >> > can only report a single local node per CPU (doesn't work for KNL and >> > upcoming architectures with HBM+DDR+...) >> > * which NUMA nodes are slow/fast (for both bandwidth and latency) >> > And we can still look at SLIT under /sys/devices/system/node if really >> > needed. >> > >> > And of course having this in sysfs is much better than parsing ACPI >> > tables that are only accessible to root :) >> >> On this point, it's not clear to me that we should allow these sysfs >> entries to be world readable. Given /proc/iomem now hides physical >> address information from non-root we at least need to be careful not >> to undo that with new sysfs HMAT attributes. > > This enabling does not expose any physical addresses to userspace. It only > provides performance numbers from the HMAT and associates them with existing > NUMA nodes. Are you worried that exposing performance numbers to non-root > users via sysfs poses a security risk? It's an information disclosure that's not clear we need to make to non-root processes. I'm more worried about userspace growing dependencies on the absolute numbers when those numbers can change from platform to platform. Differentiated memory on one platform may be the common memory pool on another. To me this has parallels with storage device hinting where specifications like T10 have a complex enumeration of all the performance hints that can be passed to the device, but the Linux enabling effort aims for a sanitzed set of relative hints that make sense. It's more flexible if userspace specifies a relative intent rather than an absolute performance target. Putting all the HMAT information into sysfs gives userspace more information than it could possibly do anything reasonable, at least outside of specialized apps that are hand tuned for a given hardware platform.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Fri, Dec 22, 2017 at 3:22 PM, Ross Zwisler wrote: > On Fri, Dec 22, 2017 at 02:53:42PM -0800, Dan Williams wrote: >> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin >> wrote: >> > Le 20/12/2017 à 23:41, Ross Zwisler a écrit : >> [..] >> > Hello >> > >> > I can confirm that HPC runtimes are going to use these patches (at least >> > all runtimes that use hwloc for topology discovery, but that's the vast >> > majority of HPC anyway). >> > >> > We really didn't like KNL exposing a hacky SLIT table [1]. We had to >> > explicitly detect that specific crazy table to find out which NUMA nodes >> > were local to which cores, and to find out which NUMA nodes were >> > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the >> > application because the reported latencies didn't match reality. Quite >> > annoying. >> > >> > With Ross' patches, we can easily get what we need: >> > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ >> > can only report a single local node per CPU (doesn't work for KNL and >> > upcoming architectures with HBM+DDR+...) >> > * which NUMA nodes are slow/fast (for both bandwidth and latency) >> > And we can still look at SLIT under /sys/devices/system/node if really >> > needed. >> > >> > And of course having this in sysfs is much better than parsing ACPI >> > tables that are only accessible to root :) >> >> On this point, it's not clear to me that we should allow these sysfs >> entries to be world readable. Given /proc/iomem now hides physical >> address information from non-root we at least need to be careful not >> to undo that with new sysfs HMAT attributes. > > This enabling does not expose any physical addresses to userspace. It only > provides performance numbers from the HMAT and associates them with existing > NUMA nodes. Are you worried that exposing performance numbers to non-root > users via sysfs poses a security risk? It's an information disclosure that's not clear we need to make to non-root processes. I'm more worried about userspace growing dependencies on the absolute numbers when those numbers can change from platform to platform. Differentiated memory on one platform may be the common memory pool on another. To me this has parallels with storage device hinting where specifications like T10 have a complex enumeration of all the performance hints that can be passed to the device, but the Linux enabling effort aims for a sanitzed set of relative hints that make sense. It's more flexible if userspace specifies a relative intent rather than an absolute performance target. Putting all the HMAT information into sysfs gives userspace more information than it could possibly do anything reasonable, at least outside of specialized apps that are hand tuned for a given hardware platform.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Fri, Dec 22, 2017 at 02:53:42PM -0800, Dan Williams wrote: > On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglinwrote: > > Le 20/12/2017 à 23:41, Ross Zwisler a écrit : > [..] > > Hello > > > > I can confirm that HPC runtimes are going to use these patches (at least > > all runtimes that use hwloc for topology discovery, but that's the vast > > majority of HPC anyway). > > > > We really didn't like KNL exposing a hacky SLIT table [1]. We had to > > explicitly detect that specific crazy table to find out which NUMA nodes > > were local to which cores, and to find out which NUMA nodes were > > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the > > application because the reported latencies didn't match reality. Quite > > annoying. > > > > With Ross' patches, we can easily get what we need: > > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ > > can only report a single local node per CPU (doesn't work for KNL and > > upcoming architectures with HBM+DDR+...) > > * which NUMA nodes are slow/fast (for both bandwidth and latency) > > And we can still look at SLIT under /sys/devices/system/node if really > > needed. > > > > And of course having this in sysfs is much better than parsing ACPI > > tables that are only accessible to root :) > > On this point, it's not clear to me that we should allow these sysfs > entries to be world readable. Given /proc/iomem now hides physical > address information from non-root we at least need to be careful not > to undo that with new sysfs HMAT attributes. This enabling does not expose any physical addresses to userspace. It only provides performance numbers from the HMAT and associates them with existing NUMA nodes. Are you worried that exposing performance numbers to non-root users via sysfs poses a security risk? > Once you need to be root for this info, is parsing binary HMAT vs sysfs a > blocker for the HPC use case? > > Perhaps we can enlist /proc/iomem or a similar enumeration interface > to tell userspace the NUMA node and whether the kernel thinks it has > better or worse performance characteristics relative to base > system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start > publishing absolute numbers in sysfs userspace will default to looking > for specific magic numbers in sysfs vs asking the kernel for memory > that has performance characteristics relative to base "System RAM". In > other words the absolute performance information that the HMAT > publishes is useful to the kernel, but it's not clear that userspace > needs that vs a relative indicator for making NUMA node preference > decisions.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Fri, Dec 22, 2017 at 02:53:42PM -0800, Dan Williams wrote: > On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin wrote: > > Le 20/12/2017 à 23:41, Ross Zwisler a écrit : > [..] > > Hello > > > > I can confirm that HPC runtimes are going to use these patches (at least > > all runtimes that use hwloc for topology discovery, but that's the vast > > majority of HPC anyway). > > > > We really didn't like KNL exposing a hacky SLIT table [1]. We had to > > explicitly detect that specific crazy table to find out which NUMA nodes > > were local to which cores, and to find out which NUMA nodes were > > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the > > application because the reported latencies didn't match reality. Quite > > annoying. > > > > With Ross' patches, we can easily get what we need: > > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ > > can only report a single local node per CPU (doesn't work for KNL and > > upcoming architectures with HBM+DDR+...) > > * which NUMA nodes are slow/fast (for both bandwidth and latency) > > And we can still look at SLIT under /sys/devices/system/node if really > > needed. > > > > And of course having this in sysfs is much better than parsing ACPI > > tables that are only accessible to root :) > > On this point, it's not clear to me that we should allow these sysfs > entries to be world readable. Given /proc/iomem now hides physical > address information from non-root we at least need to be careful not > to undo that with new sysfs HMAT attributes. This enabling does not expose any physical addresses to userspace. It only provides performance numbers from the HMAT and associates them with existing NUMA nodes. Are you worried that exposing performance numbers to non-root users via sysfs poses a security risk? > Once you need to be root for this info, is parsing binary HMAT vs sysfs a > blocker for the HPC use case? > > Perhaps we can enlist /proc/iomem or a similar enumeration interface > to tell userspace the NUMA node and whether the kernel thinks it has > better or worse performance characteristics relative to base > system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start > publishing absolute numbers in sysfs userspace will default to looking > for specific magic numbers in sysfs vs asking the kernel for memory > that has performance characteristics relative to base "System RAM". In > other words the absolute performance information that the HMAT > publishes is useful to the kernel, but it's not clear that userspace > needs that vs a relative indicator for making NUMA node preference > decisions.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglinwrote: > Le 20/12/2017 à 23:41, Ross Zwisler a écrit : [..] > Hello > > I can confirm that HPC runtimes are going to use these patches (at least > all runtimes that use hwloc for topology discovery, but that's the vast > majority of HPC anyway). > > We really didn't like KNL exposing a hacky SLIT table [1]. We had to > explicitly detect that specific crazy table to find out which NUMA nodes > were local to which cores, and to find out which NUMA nodes were > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the > application because the reported latencies didn't match reality. Quite > annoying. > > With Ross' patches, we can easily get what we need: > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ > can only report a single local node per CPU (doesn't work for KNL and > upcoming architectures with HBM+DDR+...) > * which NUMA nodes are slow/fast (for both bandwidth and latency) > And we can still look at SLIT under /sys/devices/system/node if really > needed. > > And of course having this in sysfs is much better than parsing ACPI > tables that are only accessible to root :) On this point, it's not clear to me that we should allow these sysfs entries to be world readable. Given /proc/iomem now hides physical address information from non-root we at least need to be careful not to undo that with new sysfs HMAT attributes. Once you need to be root for this info, is parsing binary HMAT vs sysfs a blocker for the HPC use case? Perhaps we can enlist /proc/iomem or a similar enumeration interface to tell userspace the NUMA node and whether the kernel thinks it has better or worse performance characteristics relative to base system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start publishing absolute numbers in sysfs userspace will default to looking for specific magic numbers in sysfs vs asking the kernel for memory that has performance characteristics relative to base "System RAM". In other words the absolute performance information that the HMAT publishes is useful to the kernel, but it's not clear that userspace needs that vs a relative indicator for making NUMA node preference decisions.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin wrote: > Le 20/12/2017 à 23:41, Ross Zwisler a écrit : [..] > Hello > > I can confirm that HPC runtimes are going to use these patches (at least > all runtimes that use hwloc for topology discovery, but that's the vast > majority of HPC anyway). > > We really didn't like KNL exposing a hacky SLIT table [1]. We had to > explicitly detect that specific crazy table to find out which NUMA nodes > were local to which cores, and to find out which NUMA nodes were > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the > application because the reported latencies didn't match reality. Quite > annoying. > > With Ross' patches, we can easily get what we need: > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ > can only report a single local node per CPU (doesn't work for KNL and > upcoming architectures with HBM+DDR+...) > * which NUMA nodes are slow/fast (for both bandwidth and latency) > And we can still look at SLIT under /sys/devices/system/node if really > needed. > > And of course having this in sysfs is much better than parsing ACPI > tables that are only accessible to root :) On this point, it's not clear to me that we should allow these sysfs entries to be world readable. Given /proc/iomem now hides physical address information from non-root we at least need to be careful not to undo that with new sysfs HMAT attributes. Once you need to be root for this info, is parsing binary HMAT vs sysfs a blocker for the HPC use case? Perhaps we can enlist /proc/iomem or a similar enumeration interface to tell userspace the NUMA node and whether the kernel thinks it has better or worse performance characteristics relative to base system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start publishing absolute numbers in sysfs userspace will default to looking for specific magic numbers in sysfs vs asking the kernel for memory that has performance characteristics relative to base "System RAM". In other words the absolute performance information that the HMAT publishes is useful to the kernel, but it's not clear that userspace needs that vs a relative indicator for making NUMA node preference decisions.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Fri, Dec 22, 2017 at 08:39:41AM +0530, Anshuman Khandual wrote: > On 12/14/2017 07:40 AM, Ross Zwisler wrote: <> > > We solve this issue by providing userspace with performance information on > > individual memory ranges. This performance information is exposed via > > sysfs: > > > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > > mem_tgt2/firmware_id:1 > > mem_tgt2/is_cached:0 > > mem_tgt2/local_init/read_bw_MBps:40960 > > mem_tgt2/local_init/read_lat_nsec:50 > > mem_tgt2/local_init/write_bw_MBps:40960 > > mem_tgt2/local_init/write_lat_nsec:50 <> > We will enlist properties for all possible "source --> target" on the system? Nope, just 'local' initiator/target pairs. I talk about the reasoning for this in the cover letter for patch 3: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013574.html > Right now it shows only bandwidth and latency properties, can it accommodate > other properties as well in future ? We also have an 'is_cached' attribute for the memory targets if they are involved in a caching hierarchy, but right now those are all the things we expose. We can potentially expose whatever we want that is present in the HMAT, but those seemed like a good start. I noticed that in your presentation you had some other examples of attributes you cared about: * reliability * power consumption * density The HMAT doesn't provide this sort of information at present, but we could/would add them to sysfs if the HMAT ever grew support for them. > > This allows applications to easily find the memory that they want to use. > > We expect that the existing NUMA APIs will be enhanced to use this new > > information so that applications can continue to use them to select their > > desired memory. > > I had presented a proposal for NUMA redesign in the Plumbers Conference this > year where various memory devices with different kind of memory attributes > can be represented in the kernel and be used explicitly from the user space. > Here is the link to the proposal if you feel interested. The proposal is > very intrusive and also I dont have a RFC for it yet for discussion here. > > https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf > > Problem is, designing the sysfs interface for memory attribute detection > from user space without first thinking about redesigning the NUMA for > heterogeneous memory may not be a good idea. Will look into this further. I took another look at your presentation, and overall I think that if/when a NUMA redesign like this takes place ACPI systems with HMAT tables will be able to participate. But I think we are probably a ways away from that, and like I said in my previous mail ACPI systems with memory-only NUMA nodes are going to exist and need to be supported with the current NUMA scheme. Hence I don't think that this patch series conflicts with your proposal.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Fri, Dec 22, 2017 at 08:39:41AM +0530, Anshuman Khandual wrote: > On 12/14/2017 07:40 AM, Ross Zwisler wrote: <> > > We solve this issue by providing userspace with performance information on > > individual memory ranges. This performance information is exposed via > > sysfs: > > > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > > mem_tgt2/firmware_id:1 > > mem_tgt2/is_cached:0 > > mem_tgt2/local_init/read_bw_MBps:40960 > > mem_tgt2/local_init/read_lat_nsec:50 > > mem_tgt2/local_init/write_bw_MBps:40960 > > mem_tgt2/local_init/write_lat_nsec:50 <> > We will enlist properties for all possible "source --> target" on the system? Nope, just 'local' initiator/target pairs. I talk about the reasoning for this in the cover letter for patch 3: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013574.html > Right now it shows only bandwidth and latency properties, can it accommodate > other properties as well in future ? We also have an 'is_cached' attribute for the memory targets if they are involved in a caching hierarchy, but right now those are all the things we expose. We can potentially expose whatever we want that is present in the HMAT, but those seemed like a good start. I noticed that in your presentation you had some other examples of attributes you cared about: * reliability * power consumption * density The HMAT doesn't provide this sort of information at present, but we could/would add them to sysfs if the HMAT ever grew support for them. > > This allows applications to easily find the memory that they want to use. > > We expect that the existing NUMA APIs will be enhanced to use this new > > information so that applications can continue to use them to select their > > desired memory. > > I had presented a proposal for NUMA redesign in the Plumbers Conference this > year where various memory devices with different kind of memory attributes > can be represented in the kernel and be used explicitly from the user space. > Here is the link to the proposal if you feel interested. The proposal is > very intrusive and also I dont have a RFC for it yet for discussion here. > > https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf > > Problem is, designing the sysfs interface for memory attribute detection > from user space without first thinking about redesigning the NUMA for > heterogeneous memory may not be a good idea. Will look into this further. I took another look at your presentation, and overall I think that if/when a NUMA redesign like this takes place ACPI systems with HMAT tables will be able to participate. But I think we are probably a ways away from that, and like I said in my previous mail ACPI systems with memory-only NUMA nodes are going to exist and need to be supported with the current NUMA scheme. Hence I don't think that this patch series conflicts with your proposal.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Fri, Dec 22, 2017 at 08:39:41AM +0530, Anshuman Khandual wrote: > On 12/14/2017 07:40 AM, Ross Zwisler wrote: > > Quick Summary > > > > Platforms exist today which have multiple types of memory attached to a > > single CPU. These disparate memory ranges have some characteristics in > > common, such as CPU cache coherence, but they can have wide ranges of > > performance both in terms of latency and bandwidth. > > Right. > > > > > For example, consider a system that contains persistent memory, standard > > DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. > > There could potentially be an order of magnitude or more difference in > > performance between the slowest and fastest memory attached to that CPU. > > Right. > > > > > With the current Linux code NUMA nodes are CPU-centric, so all the memory > > attached to a given CPU will be lumped into the same NUMA node. This makes > > it very difficult for userspace applications to understand the performance > > of different memory ranges on a given CPU. > > Right but that might require fundamental changes to the NUMA representation. > Plugging those memory as separate NUMA nodes, identify them through sysfs > and try allocating from it through mbind() seems like a short term solution. > > Though if we decide to go in this direction, sysfs interface or something > similar is required to enumerate memory properties. Yep, and this patch series is trying to be the sysfs interface that is required to the memory properties. :) It's a certainty that we will have memory-only NUMA nodes, at least on platforms that support ACPI. Supporting memory-only proximity domains (which Linux turns in to memory-only NUMA nodes) is explicitly supported with the introduction of the HMAT in ACPI 6.2. It also turns out that the existing memory management code already deals with them just fine - you see this with my hmat_examples setup: https://github.com/rzwisler/hmat_examples Both configurations created by this repo create memory-only NUMA nodes, even with upstream kernels. My patches don't change that, they just provide a sysfs representation of the HMAT so users can discover the memory that exists in the system. > > We solve this issue by providing userspace with performance information on > > individual memory ranges. This performance information is exposed via > > sysfs: > > > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > > mem_tgt2/firmware_id:1 > > mem_tgt2/is_cached:0 > > mem_tgt2/local_init/read_bw_MBps:40960 > > mem_tgt2/local_init/read_lat_nsec:50 > > mem_tgt2/local_init/write_bw_MBps:40960 > > mem_tgt2/local_init/write_lat_nsec:50 > > I might have missed discussions from earlier versions, why we have this > kind of a "source --> target" model ? We will enlist properties for all > possible "source --> target" on the system ? Right now it shows only > bandwidth and latency properties, can it accommodate other properties > as well in future ? The initiator/target model is useful in preventing us from needing a MAX_NUMA_NODES x MAX_NUMA_NODES sized table for each performance attribute. I talked about it a little more here: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013654.html > > This allows applications to easily find the memory that they want to use. > > We expect that the existing NUMA APIs will be enhanced to use this new > > information so that applications can continue to use them to select their > > desired memory. > > I had presented a proposal for NUMA redesign in the Plumbers Conference this > year where various memory devices with different kind of memory attributes > can be represented in the kernel and be used explicitly from the user space. > Here is the link to the proposal if you feel interested. The proposal is > very intrusive and also I dont have a RFC for it yet for discussion here. > > https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf I'll take a look, but my first reaction is that I agree with Dave that it seems hard to re-teach systems a new NUMA scheme. This patch series doesn't attempt to do that - it is very unintrusive and only informs users about the memory-only NUMA nodes that will already exist in their ACPI-based systems. > Problem is, designing the sysfs interface for memory attribute detection > from user space without first thinking about redesigning the NUMA for > heterogeneous memory may not be a good idea. Will look into this further.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Fri, Dec 22, 2017 at 08:39:41AM +0530, Anshuman Khandual wrote: > On 12/14/2017 07:40 AM, Ross Zwisler wrote: > > Quick Summary > > > > Platforms exist today which have multiple types of memory attached to a > > single CPU. These disparate memory ranges have some characteristics in > > common, such as CPU cache coherence, but they can have wide ranges of > > performance both in terms of latency and bandwidth. > > Right. > > > > > For example, consider a system that contains persistent memory, standard > > DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. > > There could potentially be an order of magnitude or more difference in > > performance between the slowest and fastest memory attached to that CPU. > > Right. > > > > > With the current Linux code NUMA nodes are CPU-centric, so all the memory > > attached to a given CPU will be lumped into the same NUMA node. This makes > > it very difficult for userspace applications to understand the performance > > of different memory ranges on a given CPU. > > Right but that might require fundamental changes to the NUMA representation. > Plugging those memory as separate NUMA nodes, identify them through sysfs > and try allocating from it through mbind() seems like a short term solution. > > Though if we decide to go in this direction, sysfs interface or something > similar is required to enumerate memory properties. Yep, and this patch series is trying to be the sysfs interface that is required to the memory properties. :) It's a certainty that we will have memory-only NUMA nodes, at least on platforms that support ACPI. Supporting memory-only proximity domains (which Linux turns in to memory-only NUMA nodes) is explicitly supported with the introduction of the HMAT in ACPI 6.2. It also turns out that the existing memory management code already deals with them just fine - you see this with my hmat_examples setup: https://github.com/rzwisler/hmat_examples Both configurations created by this repo create memory-only NUMA nodes, even with upstream kernels. My patches don't change that, they just provide a sysfs representation of the HMAT so users can discover the memory that exists in the system. > > We solve this issue by providing userspace with performance information on > > individual memory ranges. This performance information is exposed via > > sysfs: > > > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > > mem_tgt2/firmware_id:1 > > mem_tgt2/is_cached:0 > > mem_tgt2/local_init/read_bw_MBps:40960 > > mem_tgt2/local_init/read_lat_nsec:50 > > mem_tgt2/local_init/write_bw_MBps:40960 > > mem_tgt2/local_init/write_lat_nsec:50 > > I might have missed discussions from earlier versions, why we have this > kind of a "source --> target" model ? We will enlist properties for all > possible "source --> target" on the system ? Right now it shows only > bandwidth and latency properties, can it accommodate other properties > as well in future ? The initiator/target model is useful in preventing us from needing a MAX_NUMA_NODES x MAX_NUMA_NODES sized table for each performance attribute. I talked about it a little more here: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013654.html > > This allows applications to easily find the memory that they want to use. > > We expect that the existing NUMA APIs will be enhanced to use this new > > information so that applications can continue to use them to select their > > desired memory. > > I had presented a proposal for NUMA redesign in the Plumbers Conference this > year where various memory devices with different kind of memory attributes > can be represented in the kernel and be used explicitly from the user space. > Here is the link to the proposal if you feel interested. The proposal is > very intrusive and also I dont have a RFC for it yet for discussion here. > > https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf I'll take a look, but my first reaction is that I agree with Dave that it seems hard to re-teach systems a new NUMA scheme. This patch series doesn't attempt to do that - it is very unintrusive and only informs users about the memory-only NUMA nodes that will already exist in their ACPI-based systems. > Problem is, designing the sysfs interface for memory attribute detection > from user space without first thinking about redesigning the NUMA for > heterogeneous memory may not be a good idea. Will look into this further.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Thu, Dec 21, 2017 at 01:41:15AM +, Elliott, Robert (Persistent Memory) wrote: > > > > -Original Message- > > From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of > > Ross Zwisler > ... > > > > On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote: > ... > > > initiator is a CPU? I'd have expected you to expose a memory controller > > > abstraction rather than re-use storage terminology. > > > > Yea, I agree that at first blush it seems weird. It turns out that > > looking at it in sort of a storage initiator/target way is beneficial, > > though, because it allows us to cut down on the number of data values > > we need to represent. > > > > For example the SLIT, which doesn't differentiate between initiator and > > target proximity domains (and thus nodes) always represents a system > > with N proximity domains using a NxN distance table. This makes sense > > if every node contains both CPUs and memory. > > > > With the introduction of the HMAT, though, we can have memory-only > > initiator nodes and we can explicitly associate them with their local > > CPU. This is necessary so that we can separate memory with different > > performance characteristics (HBM vs normal memory vs persistent memory, > > for example) that are all attached to the same CPU. > > > > So, say we now have a system with 4 CPUs, and each of those CPUs has 3 > > different types of memory attached to it. We now have 16 total proximity > > domains, 4 CPU and 12 memory. > > The CPU cores that make up a node can have performance restrictions of > their own; for example, they might max out at 10 GB/s even though the > memory controller supports 120 GB/s (meaning you need to use 12 cores > on the node to fully exercise memory). It'd be helpful to report this, > so software can decide how many cores to use for bandwidth-intensive work. > > > If we represent this with the SLIT we end up with a 16 X 16 distance table > > (256 entries), most of which don't matter because they are memory-to- > > memory distances which don't make sense. > > > > In the HMAT, though, we separate out the initiators and the targets and > > put them into separate lists. (See 5.2.27.4 System Locality Latency and > > Bandwidth Information Structure in ACPI 6.2 for details.) So, this same > > config in the HMAT only has 4*12=48 performance values of each type, all > > of which convey meaningful information. > > > > The HMAT indeed even uses the storage "initiator" and "target" > > terminology. :) > > Centralized DMA engines (e.g., as used by the "DMA based blk-mq pmem > driver") have performance differences too. A CPU might include > CPU cores that reach 10 GB/s, DMA engines that reach 60 GB/s, and > memory controllers that reach 120 GB/s. I guess these would be > represented as extra initiators on the node? For both of your comments I think all of this comes down to how you want to represent your platform in the HMAT. The sysfs representation just shows you what is in the HMAT. Each initiator node is just a single NUMA node (think of it as a NUMA node which has the characteristic that it can initiate memory requests), so I don't think there is a way to have "extra initiators on the node". I think what you're talking about is separating the DMA engines and CPU cores into separate NUMA nodes, both of which are initiators. I think this is probably fine as it conveys useful info. I don't think the HMAT has a concept of increasing bandwidth for number of CPU cores used - it just has a single bandwidth number (well, one for read and one for write) per initiator/target pair. I don't think we want to add this, either - the HMAT is already very complex.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Thu, Dec 21, 2017 at 01:41:15AM +, Elliott, Robert (Persistent Memory) wrote: > > > > -Original Message- > > From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of > > Ross Zwisler > ... > > > > On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote: > ... > > > initiator is a CPU? I'd have expected you to expose a memory controller > > > abstraction rather than re-use storage terminology. > > > > Yea, I agree that at first blush it seems weird. It turns out that > > looking at it in sort of a storage initiator/target way is beneficial, > > though, because it allows us to cut down on the number of data values > > we need to represent. > > > > For example the SLIT, which doesn't differentiate between initiator and > > target proximity domains (and thus nodes) always represents a system > > with N proximity domains using a NxN distance table. This makes sense > > if every node contains both CPUs and memory. > > > > With the introduction of the HMAT, though, we can have memory-only > > initiator nodes and we can explicitly associate them with their local > > CPU. This is necessary so that we can separate memory with different > > performance characteristics (HBM vs normal memory vs persistent memory, > > for example) that are all attached to the same CPU. > > > > So, say we now have a system with 4 CPUs, and each of those CPUs has 3 > > different types of memory attached to it. We now have 16 total proximity > > domains, 4 CPU and 12 memory. > > The CPU cores that make up a node can have performance restrictions of > their own; for example, they might max out at 10 GB/s even though the > memory controller supports 120 GB/s (meaning you need to use 12 cores > on the node to fully exercise memory). It'd be helpful to report this, > so software can decide how many cores to use for bandwidth-intensive work. > > > If we represent this with the SLIT we end up with a 16 X 16 distance table > > (256 entries), most of which don't matter because they are memory-to- > > memory distances which don't make sense. > > > > In the HMAT, though, we separate out the initiators and the targets and > > put them into separate lists. (See 5.2.27.4 System Locality Latency and > > Bandwidth Information Structure in ACPI 6.2 for details.) So, this same > > config in the HMAT only has 4*12=48 performance values of each type, all > > of which convey meaningful information. > > > > The HMAT indeed even uses the storage "initiator" and "target" > > terminology. :) > > Centralized DMA engines (e.g., as used by the "DMA based blk-mq pmem > driver") have performance differences too. A CPU might include > CPU cores that reach 10 GB/s, DMA engines that reach 60 GB/s, and > memory controllers that reach 120 GB/s. I guess these would be > represented as extra initiators on the node? For both of your comments I think all of this comes down to how you want to represent your platform in the HMAT. The sysfs representation just shows you what is in the HMAT. Each initiator node is just a single NUMA node (think of it as a NUMA node which has the characteristic that it can initiate memory requests), so I don't think there is a way to have "extra initiators on the node". I think what you're talking about is separating the DMA engines and CPU cores into separate NUMA nodes, both of which are initiators. I think this is probably fine as it conveys useful info. I don't think the HMAT has a concept of increasing bandwidth for number of CPU cores used - it just has a single bandwidth number (well, one for read and one for write) per initiator/target pair. I don't think we want to add this, either - the HMAT is already very complex.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/21/2017 07:09 PM, Anshuman Khandual wrote: > I had presented a proposal for NUMA redesign in the Plumbers Conference this > year where various memory devices with different kind of memory attributes > can be represented in the kernel and be used explicitly from the user space. > Here is the link to the proposal if you feel interested. The proposal is > very intrusive and also I dont have a RFC for it yet for discussion here. I think that's the best reason to "re-use NUMA" for this: it's _not_ intrusive. Also, from an x86 perspective, these HMAT systems *will* be out there. Old versions of Linux *will* see different types of memory as separate NUMA nodes. So, if we are going to do something different, it's going to be interesting to un-teach those systems about using the NUMA APIs for this. That ship has sailed.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/21/2017 07:09 PM, Anshuman Khandual wrote: > I had presented a proposal for NUMA redesign in the Plumbers Conference this > year where various memory devices with different kind of memory attributes > can be represented in the kernel and be used explicitly from the user space. > Here is the link to the proposal if you feel interested. The proposal is > very intrusive and also I dont have a RFC for it yet for discussion here. I think that's the best reason to "re-use NUMA" for this: it's _not_ intrusive. Also, from an x86 perspective, these HMAT systems *will* be out there. Old versions of Linux *will* see different types of memory as separate NUMA nodes. So, if we are going to do something different, it's going to be interesting to un-teach those systems about using the NUMA APIs for this. That ship has sailed.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/22/2017 04:01 PM, Kogut, Jaroslaw wrote: >> ... first thinking about redesigning the NUMA for >> heterogeneous memory may not be a good idea. Will look into this further. > I agree with comment that first a direction should be defined how to handle > heterogeneous memory system. > >> https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/ >> Hierarchical_NUMA_Design_Plumbers_2017.pdf > I miss in the presentation a user perspective of the new approach, e.g. > - How does application developer see/understand the heterogeneous memory > system? >From user perspective - Each memory node (with or without CPU) is a NUMA node with attributes - User should detect these NUMA nodes from sysfs (not part of proposal) - User allocates/operates/destroys VMA with new sys calls (_mattr based) > - How does app developer use the heterogeneous memory system? - Through existing and new system calls > - What are modification in API/sys interfaces? - The presentation has possible addition of new system calls with 'u64 _mattr' representation for memory attributes which can be used while requesting different kinds of memory from the kernel > > In other hand, if we assume that separate memory NUMA node has different > memory capabilities/attributes from stand point of particular CPU, it is easy > to explain for user how to describe/handle heterogeneous memory. > > Of course, current numa design is not sufficient in kernel in following areas > today: > - Exposing memory attributes that describe heterogeneous memory system > - Interfaces to use the heterogeneous memory system, e.g. more sophisticated > policies > - Internal mechanism in memory management, e.g. automigration, maybe > something else. Right, we would need - Representation of NUMA with attributes - APIs/syscalls for accessing the intended memory from user space - Memory management policies and algorithms navigating trough all these new attributes in various situations IMHO, we should not consider sysfs interfaces for heterogeneous memory (which will be an ABI going forward and hence cannot be changed easily) before we get the NUMA redesign right.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/22/2017 04:01 PM, Kogut, Jaroslaw wrote: >> ... first thinking about redesigning the NUMA for >> heterogeneous memory may not be a good idea. Will look into this further. > I agree with comment that first a direction should be defined how to handle > heterogeneous memory system. > >> https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/ >> Hierarchical_NUMA_Design_Plumbers_2017.pdf > I miss in the presentation a user perspective of the new approach, e.g. > - How does application developer see/understand the heterogeneous memory > system? >From user perspective - Each memory node (with or without CPU) is a NUMA node with attributes - User should detect these NUMA nodes from sysfs (not part of proposal) - User allocates/operates/destroys VMA with new sys calls (_mattr based) > - How does app developer use the heterogeneous memory system? - Through existing and new system calls > - What are modification in API/sys interfaces? - The presentation has possible addition of new system calls with 'u64 _mattr' representation for memory attributes which can be used while requesting different kinds of memory from the kernel > > In other hand, if we assume that separate memory NUMA node has different > memory capabilities/attributes from stand point of particular CPU, it is easy > to explain for user how to describe/handle heterogeneous memory. > > Of course, current numa design is not sufficient in kernel in following areas > today: > - Exposing memory attributes that describe heterogeneous memory system > - Interfaces to use the heterogeneous memory system, e.g. more sophisticated > policies > - Internal mechanism in memory management, e.g. automigration, maybe > something else. Right, we would need - Representation of NUMA with attributes - APIs/syscalls for accessing the intended memory from user space - Memory management policies and algorithms navigating trough all these new attributes in various situations IMHO, we should not consider sysfs interfaces for heterogeneous memory (which will be an ABI going forward and hence cannot be changed easily) before we get the NUMA redesign right.
RE: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
> ... first thinking about redesigning the NUMA for > heterogeneous memory may not be a good idea. Will look into this further. I agree with comment that first a direction should be defined how to handle heterogeneous memory system. > https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/ > Hierarchical_NUMA_Design_Plumbers_2017.pdf I miss in the presentation a user perspective of the new approach, e.g. - How does application developer see/understand the heterogeneous memory system? - How does app developer use the heterogeneous memory system? - What are modification in API/sys interfaces? In other hand, if we assume that separate memory NUMA node has different memory capabilities/attributes from stand point of particular CPU, it is easy to explain for user how to describe/handle heterogeneous memory. Of course, current numa design is not sufficient in kernel in following areas today: - Exposing memory attributes that describe heterogeneous memory system - Interfaces to use the heterogeneous memory system, e.g. more sophisticated policies - Internal mechanism in memory management, e.g. automigration, maybe something else. > -Original Message- > From: Anshuman Khandual [mailto:khand...@linux.vnet.ibm.com] > Sent: Friday, December 22, 2017 4:10 AM > To: Ross Zwisler <ross.zwis...@linux.intel.com>; linux-kernel@vger.kernel.org > Cc: Anaczkowski, Lukasz <lukasz.anaczkow...@intel.com>; Box, David E > <david.e@intel.com>; Kogut, Jaroslaw <jaroslaw.ko...@intel.com>; Koss, > Marcin <marcin.k...@intel.com>; Koziej, Artur <artur.koz...@intel.com>; > Lahtinen, Joonas <joonas.lahti...@intel.com>; Moore, Robert > <robert.mo...@intel.com>; Nachimuthu, Murugasamy > <murugasamy.nachimu...@intel.com>; Odzioba, Lukasz > <lukasz.odzi...@intel.com>; Wysocki, Rafael J <rafael.j.wyso...@intel.com>; > Rafael J. Wysocki <r...@rjwysocki.net>; Schmauss, Erik > <erik.schma...@intel.com>; Verma, Vishal L <vishal.l.ve...@intel.com>; > Zheng, Lv <lv.zh...@intel.com>; Andrew Morton <akpm@linux- > foundation.org>; Balbir Singh <bsinghar...@gmail.com>; Brice Goglin > <brice.gog...@gmail.com>; Williams, Dan J <dan.j.willi...@intel.com>; > Hansen, Dave <dave.han...@intel.com>; Jerome Glisse <jgli...@redhat.com>; > John Hubbard <jhubb...@nvidia.com>; Len Brown <l...@kernel.org>; Tim > Chen <tim.c.c...@linux.intel.com>; de...@acpica.org; linux- > a...@vger.kernel.org; linux...@kvack.org; linux-nvd...@lists.01.org > Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT > > On 12/14/2017 07:40 AM, Ross Zwisler wrote: > > Quick Summary > > > > Platforms exist today which have multiple types of memory attached to > > a single CPU. These disparate memory ranges have some characteristics > > in common, such as CPU cache coherence, but they can have wide ranges > > of performance both in terms of latency and bandwidth. > > Right. > > > > > For example, consider a system that contains persistent memory, > > standard DDR memory and High Bandwidth Memory (HBM), all attached to > the same CPU. > > There could potentially be an order of magnitude or more difference in > > performance between the slowest and fastest memory attached to that CPU. > > Right. > > > > > With the current Linux code NUMA nodes are CPU-centric, so all the > > memory attached to a given CPU will be lumped into the same NUMA node. > > This makes it very difficult for userspace applications to understand > > the performance of different memory ranges on a given CPU. > > Right but that might require fundamental changes to the NUMA > representation. > Plugging those memory as separate NUMA nodes, identify them through sysfs > and try allocating from it through mbind() seems like a short term solution. > > Though if we decide to go in this direction, sysfs interface or something > similar > is required to enumerate memory properties. > > > > > We solve this issue by providing userspace with performance > > information on individual memory ranges. This performance information > > is exposed via > > sysfs: > > > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > > mem_tgt2/firmware_id:1 > > mem_tgt2/is_cached:0 > > mem_tgt2/local_init/read_bw_MBps:40960 > > mem_tgt2/local_init/read_lat_nsec:50 > > mem_tgt2/local_init/write_bw_MBps:40960 > > mem_tgt2/local_init/write_lat_nsec:50 > > I might have missed discussions from earlier versions, why we have this kind > of > a "source --> target" model ? We will enli
RE: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
> ... first thinking about redesigning the NUMA for > heterogeneous memory may not be a good idea. Will look into this further. I agree with comment that first a direction should be defined how to handle heterogeneous memory system. > https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/ > Hierarchical_NUMA_Design_Plumbers_2017.pdf I miss in the presentation a user perspective of the new approach, e.g. - How does application developer see/understand the heterogeneous memory system? - How does app developer use the heterogeneous memory system? - What are modification in API/sys interfaces? In other hand, if we assume that separate memory NUMA node has different memory capabilities/attributes from stand point of particular CPU, it is easy to explain for user how to describe/handle heterogeneous memory. Of course, current numa design is not sufficient in kernel in following areas today: - Exposing memory attributes that describe heterogeneous memory system - Interfaces to use the heterogeneous memory system, e.g. more sophisticated policies - Internal mechanism in memory management, e.g. automigration, maybe something else. > -Original Message- > From: Anshuman Khandual [mailto:khand...@linux.vnet.ibm.com] > Sent: Friday, December 22, 2017 4:10 AM > To: Ross Zwisler ; linux-kernel@vger.kernel.org > Cc: Anaczkowski, Lukasz ; Box, David E > ; Kogut, Jaroslaw ; Koss, > Marcin ; Koziej, Artur ; > Lahtinen, Joonas ; Moore, Robert > ; Nachimuthu, Murugasamy > ; Odzioba, Lukasz > ; Wysocki, Rafael J ; > Rafael J. Wysocki ; Schmauss, Erik > ; Verma, Vishal L ; > Zheng, Lv ; Andrew Morton foundation.org>; Balbir Singh ; Brice Goglin > ; Williams, Dan J ; > Hansen, Dave ; Jerome Glisse ; > John Hubbard ; Len Brown ; Tim > Chen ; de...@acpica.org; linux- > a...@vger.kernel.org; linux...@kvack.org; linux-nvd...@lists.01.org > Subject: Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT > > On 12/14/2017 07:40 AM, Ross Zwisler wrote: > > Quick Summary > > > > Platforms exist today which have multiple types of memory attached to > > a single CPU. These disparate memory ranges have some characteristics > > in common, such as CPU cache coherence, but they can have wide ranges > > of performance both in terms of latency and bandwidth. > > Right. > > > > > For example, consider a system that contains persistent memory, > > standard DDR memory and High Bandwidth Memory (HBM), all attached to > the same CPU. > > There could potentially be an order of magnitude or more difference in > > performance between the slowest and fastest memory attached to that CPU. > > Right. > > > > > With the current Linux code NUMA nodes are CPU-centric, so all the > > memory attached to a given CPU will be lumped into the same NUMA node. > > This makes it very difficult for userspace applications to understand > > the performance of different memory ranges on a given CPU. > > Right but that might require fundamental changes to the NUMA > representation. > Plugging those memory as separate NUMA nodes, identify them through sysfs > and try allocating from it through mbind() seems like a short term solution. > > Though if we decide to go in this direction, sysfs interface or something > similar > is required to enumerate memory properties. > > > > > We solve this issue by providing userspace with performance > > information on individual memory ranges. This performance information > > is exposed via > > sysfs: > > > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > > mem_tgt2/firmware_id:1 > > mem_tgt2/is_cached:0 > > mem_tgt2/local_init/read_bw_MBps:40960 > > mem_tgt2/local_init/read_lat_nsec:50 > > mem_tgt2/local_init/write_bw_MBps:40960 > > mem_tgt2/local_init/write_lat_nsec:50 > > I might have missed discussions from earlier versions, why we have this kind > of > a "source --> target" model ? We will enlist properties for all possible > "source -- > > target" on the system ? Right now it shows only bandwidth and latency > properties, can it accommodate other properties as well in future ? > > > > > This allows applications to easily find the memory that they want to use. > > We expect that the existing NUMA APIs will be enhanced to use this new > > information so that applications can continue to use them to select > > their desired memory. > > I had presented a proposal for NUMA redesign in the Plumbers Conference this > year where various memory devices with different kind of memory attributes > can be represented in the kernel and be used explicitly
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/14/2017 07:40 AM, Ross Zwisler wrote: > Quick Summary > > Platforms exist today which have multiple types of memory attached to a > single CPU. These disparate memory ranges have some characteristics in > common, such as CPU cache coherence, but they can have wide ranges of > performance both in terms of latency and bandwidth. Right. > > For example, consider a system that contains persistent memory, standard > DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. > There could potentially be an order of magnitude or more difference in > performance between the slowest and fastest memory attached to that CPU. Right. > > With the current Linux code NUMA nodes are CPU-centric, so all the memory > attached to a given CPU will be lumped into the same NUMA node. This makes > it very difficult for userspace applications to understand the performance > of different memory ranges on a given CPU. Right but that might require fundamental changes to the NUMA representation. Plugging those memory as separate NUMA nodes, identify them through sysfs and try allocating from it through mbind() seems like a short term solution. Though if we decide to go in this direction, sysfs interface or something similar is required to enumerate memory properties. > > We solve this issue by providing userspace with performance information on > individual memory ranges. This performance information is exposed via > sysfs: > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > mem_tgt2/firmware_id:1 > mem_tgt2/is_cached:0 > mem_tgt2/local_init/read_bw_MBps:40960 > mem_tgt2/local_init/read_lat_nsec:50 > mem_tgt2/local_init/write_bw_MBps:40960 > mem_tgt2/local_init/write_lat_nsec:50 I might have missed discussions from earlier versions, why we have this kind of a "source --> target" model ? We will enlist properties for all possible "source --> target" on the system ? Right now it shows only bandwidth and latency properties, can it accommodate other properties as well in future ? > > This allows applications to easily find the memory that they want to use. > We expect that the existing NUMA APIs will be enhanced to use this new > information so that applications can continue to use them to select their > desired memory. I had presented a proposal for NUMA redesign in the Plumbers Conference this year where various memory devices with different kind of memory attributes can be represented in the kernel and be used explicitly from the user space. Here is the link to the proposal if you feel interested. The proposal is very intrusive and also I dont have a RFC for it yet for discussion here. https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf Problem is, designing the sysfs interface for memory attribute detection from user space without first thinking about redesigning the NUMA for heterogeneous memory may not be a good idea. Will look into this further.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/14/2017 07:40 AM, Ross Zwisler wrote: > Quick Summary > > Platforms exist today which have multiple types of memory attached to a > single CPU. These disparate memory ranges have some characteristics in > common, such as CPU cache coherence, but they can have wide ranges of > performance both in terms of latency and bandwidth. Right. > > For example, consider a system that contains persistent memory, standard > DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. > There could potentially be an order of magnitude or more difference in > performance between the slowest and fastest memory attached to that CPU. Right. > > With the current Linux code NUMA nodes are CPU-centric, so all the memory > attached to a given CPU will be lumped into the same NUMA node. This makes > it very difficult for userspace applications to understand the performance > of different memory ranges on a given CPU. Right but that might require fundamental changes to the NUMA representation. Plugging those memory as separate NUMA nodes, identify them through sysfs and try allocating from it through mbind() seems like a short term solution. Though if we decide to go in this direction, sysfs interface or something similar is required to enumerate memory properties. > > We solve this issue by providing userspace with performance information on > individual memory ranges. This performance information is exposed via > sysfs: > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > mem_tgt2/firmware_id:1 > mem_tgt2/is_cached:0 > mem_tgt2/local_init/read_bw_MBps:40960 > mem_tgt2/local_init/read_lat_nsec:50 > mem_tgt2/local_init/write_bw_MBps:40960 > mem_tgt2/local_init/write_lat_nsec:50 I might have missed discussions from earlier versions, why we have this kind of a "source --> target" model ? We will enlist properties for all possible "source --> target" on the system ? Right now it shows only bandwidth and latency properties, can it accommodate other properties as well in future ? > > This allows applications to easily find the memory that they want to use. > We expect that the existing NUMA APIs will be enhanced to use this new > information so that applications can continue to use them to select their > desired memory. I had presented a proposal for NUMA redesign in the Plumbers Conference this year where various memory devices with different kind of memory attributes can be represented in the kernel and be used explicitly from the user space. Here is the link to the proposal if you feel interested. The proposal is very intrusive and also I dont have a RFC for it yet for discussion here. https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf Problem is, designing the sysfs interface for memory attribute detection from user space without first thinking about redesigning the NUMA for heterogeneous memory may not be a good idea. Will look into this further.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Le 20/12/2017 à 23:41, Ross Zwisler a écrit : > On Wed, Dec 20, 2017 at 02:29:56PM -0800, Dan Williams wrote: >> On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler >>wrote: >>> On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote: On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote: > On 12/20/2017 10:19 AM, Matthew Wilcox wrote: >> I don't know what the right interface is, but my laptop has a set of >> /sys/devices/system/memory/memoryN/ directories. Perhaps this is the >> right place to expose write_bw (etc). > Those directories are already too redundant and wasteful. I think we'd > really rather not add to them. In addition, it's technically possible > to have a memory section span NUMA nodes and have different performance > properties, which make it impossible to represent there. > > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have > uniform performance properties in the HMAT, and we just so happen to > always create one NUMA node per PXM. So, NUMA nodes really are a good > fit. I think you're missing my larger point which is that I don't think this should be exposed to userspace as an ACPI feature. Because if you do, then it'll also be exposed to userspace as an openfirmware feature. And sooner or later a devicetree feature. And then writing a portable program becomes an exercise in suffering. So, what's the right place in sysfs that isn't tied to ACPI? A new directory or set of directories under /sys/devices/system/memory/ ? >>> Oh, the current location isn't at all tied to acpi except that it happens to >>> be named 'hmat'. When it was all named 'hmem' it was just: >>> >>> /sys/devices/system/hmem >>> >>> Which has no ACPI-isms at all. I'm happy to move it under >>> /sys/devices/system/memory/hmat if that's helpful, but I think we still have >>> the issue that the data represented therein is still pulled right from the >>> HMAT, and I don't know how to abstract it into something more platform >>> agnostic until I know what data is provided by those other platforms. >>> >>> For example, the HMAT provides latency information and bandwidth information >>> for both reads and writes. Will the devicetree/openfirmware/etc version >>> have >>> this same info, or will it be just different enough that it won't translate >>> into whatever I choose to stick in sysfs? >> For the initial implementation do we need to have a representation of >> all the performance data? Given that >> /sys/devices/system/node/nodeX/distance is the only generic >> performance attribute published by the kernel today it is already the >> case that applications that need to target specific memories need to >> go parse information that is not provided by the kernel by default. >> The question is can those specialized applications stay special and go >> parse the platform specific data sources, like raw HMAT, directly, or >> do we expect general purpose applications to make use of this data? I >> think a firmware-id to numa-node translation facility >> (/sys/devices/system/node/nodeX/fwid) is a simple start that we can >> build on with more information as specific use cases arise. > We don't represent all the performance data, we only represent the data for > local initiator/target pairs. I do think that this is useful to have in sysfs > because it provides a way to easily answer the most commonly asked questions > (or at least what I'm guessing will be the most commmonly asked queststions), > i.e. "given a CPU, what are the speeds of the various types of memory attached > to it", and "given a chunk of memory, how fast is it and to which CPU is it > local"? By providing this base level of information I'm hoping to prevent > most applications from having to parse the HMAT directly. > > The question of whether or not to include this local performance information > was one of the main questions of the initial RFC patch series, and I did get > feedback (albiet off-list) that the local performance information was > valuable to at least some users. I did intentionally structure my (now very > short) set so that the performance information was added as a separate patch, > so we can get to the place you're talking about where we only provide firmware > id <=> proximity domain mappings by just leaving off the last patch in the > series. > Hello I can confirm that HPC runtimes are going to use these patches (at least all runtimes that use hwloc for topology discovery, but that's the vast majority of HPC anyway). We really didn't like KNL exposing a hacky SLIT table [1]. We had to explicitly detect that specific crazy table to find out which NUMA nodes were local to which cores, and to find out which NUMA nodes were HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the application because the reported latencies didn't match reality. Quite annoying. With Ross'
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Le 20/12/2017 à 23:41, Ross Zwisler a écrit : > On Wed, Dec 20, 2017 at 02:29:56PM -0800, Dan Williams wrote: >> On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler >> wrote: >>> On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote: On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote: > On 12/20/2017 10:19 AM, Matthew Wilcox wrote: >> I don't know what the right interface is, but my laptop has a set of >> /sys/devices/system/memory/memoryN/ directories. Perhaps this is the >> right place to expose write_bw (etc). > Those directories are already too redundant and wasteful. I think we'd > really rather not add to them. In addition, it's technically possible > to have a memory section span NUMA nodes and have different performance > properties, which make it impossible to represent there. > > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have > uniform performance properties in the HMAT, and we just so happen to > always create one NUMA node per PXM. So, NUMA nodes really are a good > fit. I think you're missing my larger point which is that I don't think this should be exposed to userspace as an ACPI feature. Because if you do, then it'll also be exposed to userspace as an openfirmware feature. And sooner or later a devicetree feature. And then writing a portable program becomes an exercise in suffering. So, what's the right place in sysfs that isn't tied to ACPI? A new directory or set of directories under /sys/devices/system/memory/ ? >>> Oh, the current location isn't at all tied to acpi except that it happens to >>> be named 'hmat'. When it was all named 'hmem' it was just: >>> >>> /sys/devices/system/hmem >>> >>> Which has no ACPI-isms at all. I'm happy to move it under >>> /sys/devices/system/memory/hmat if that's helpful, but I think we still have >>> the issue that the data represented therein is still pulled right from the >>> HMAT, and I don't know how to abstract it into something more platform >>> agnostic until I know what data is provided by those other platforms. >>> >>> For example, the HMAT provides latency information and bandwidth information >>> for both reads and writes. Will the devicetree/openfirmware/etc version >>> have >>> this same info, or will it be just different enough that it won't translate >>> into whatever I choose to stick in sysfs? >> For the initial implementation do we need to have a representation of >> all the performance data? Given that >> /sys/devices/system/node/nodeX/distance is the only generic >> performance attribute published by the kernel today it is already the >> case that applications that need to target specific memories need to >> go parse information that is not provided by the kernel by default. >> The question is can those specialized applications stay special and go >> parse the platform specific data sources, like raw HMAT, directly, or >> do we expect general purpose applications to make use of this data? I >> think a firmware-id to numa-node translation facility >> (/sys/devices/system/node/nodeX/fwid) is a simple start that we can >> build on with more information as specific use cases arise. > We don't represent all the performance data, we only represent the data for > local initiator/target pairs. I do think that this is useful to have in sysfs > because it provides a way to easily answer the most commonly asked questions > (or at least what I'm guessing will be the most commmonly asked queststions), > i.e. "given a CPU, what are the speeds of the various types of memory attached > to it", and "given a chunk of memory, how fast is it and to which CPU is it > local"? By providing this base level of information I'm hoping to prevent > most applications from having to parse the HMAT directly. > > The question of whether or not to include this local performance information > was one of the main questions of the initial RFC patch series, and I did get > feedback (albiet off-list) that the local performance information was > valuable to at least some users. I did intentionally structure my (now very > short) set so that the performance information was added as a separate patch, > so we can get to the place you're talking about where we only provide firmware > id <=> proximity domain mappings by just leaving off the last patch in the > series. > Hello I can confirm that HPC runtimes are going to use these patches (at least all runtimes that use hwloc for topology discovery, but that's the vast majority of HPC anyway). We really didn't like KNL exposing a hacky SLIT table [1]. We had to explicitly detect that specific crazy table to find out which NUMA nodes were local to which cores, and to find out which NUMA nodes were HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the application because the reported latencies didn't match reality. Quite annoying. With Ross' patches, we can easily get
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed 20-12-17 09:41:07, Ross Zwisler wrote: > On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > > On Thu, Dec 14, 2017 at 02:00:32PM +0100, Michal Hocko wrote: > <> > > > What is the testing procedure? How can I setup qemu to simlate such HW? > > > > Well, the QEMU table simulation is gross, so I'd rather not get everyone > > testing with that. Injecting custom HMAT and SRAT tables via > > initrd/initramfs > > is a much better way: > > > > https://www.kernel.org/doc/Documentation/acpi/initrd_table_override.txt > > > > Dan recently posted a patch that lets this happen for the HMAT: > > > > https://lists.01.org/pipermail/linux-nvdimm/2017-December/013545.html > > > > I'm working right now on getting an easier way to generate HMAT tables - > > I'll > > let you know when I have something working. > > I've posted details on how to set up test configurations using injected HMAT > and SRAT tables here: > > https://github.com/rzwisler/hmat_examples > > So far I've got two different sample configs, and we can add more as they are > useful. Having the sample configs in github is also nice because if someone > finds a config that causes a kernel issue it can be reported then added to > this list of example configs for future testing. > > Please let me know if you have trouble getting this working. Thanks a lot Ross, I will try this but things are getting pretty busy here before the holiday so I won't be able to get to it and your other email before new year. Sorry about that. -- Michal Hocko SUSE Labs
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed 20-12-17 09:41:07, Ross Zwisler wrote: > On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > > On Thu, Dec 14, 2017 at 02:00:32PM +0100, Michal Hocko wrote: > <> > > > What is the testing procedure? How can I setup qemu to simlate such HW? > > > > Well, the QEMU table simulation is gross, so I'd rather not get everyone > > testing with that. Injecting custom HMAT and SRAT tables via > > initrd/initramfs > > is a much better way: > > > > https://www.kernel.org/doc/Documentation/acpi/initrd_table_override.txt > > > > Dan recently posted a patch that lets this happen for the HMAT: > > > > https://lists.01.org/pipermail/linux-nvdimm/2017-December/013545.html > > > > I'm working right now on getting an easier way to generate HMAT tables - > > I'll > > let you know when I have something working. > > I've posted details on how to set up test configurations using injected HMAT > and SRAT tables here: > > https://github.com/rzwisler/hmat_examples > > So far I've got two different sample configs, and we can add more as they are > useful. Having the sample configs in github is also nice because if someone > finds a config that causes a kernel issue it can be reported then added to > this list of example configs for future testing. > > Please let me know if you have trouble getting this working. Thanks a lot Ross, I will try this but things are getting pretty busy here before the holiday so I won't be able to get to it and your other email before new year. Sorry about that. -- Michal Hocko SUSE Labs
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Matthew Wilcoxwrites: > On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: >> What I'm hoping to do with this series is to just provide a sysfs >> representation of the HMAT so that applications can know which NUMA nodes to >> select with existing utilities like numactl. This series does not currently >> alter any kernel behavior, it only provides a sysfs interface. >> >> Say for example you had a system with some high bandwidth memory (HBM), and >> you wanted to use it for a specific application. You could use the sysfs >> representation of the HMAT to figure out which memory target held your HBM. >> You could do this by looking at the local bandwidth values for the various >> memory targets, so: >> >> # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps >> /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 >> /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 >> /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 >> /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 >> >> and look for the one that corresponds to your HBM speed. (These numbers are >> made up, but you get the idea.) > > Presumably ACPI-based platforms will not be the only ones who have the > ability to expose different bandwidth memories in the future. I think > we need a platform-agnostic way ... right, PowerPC people? Yes! I don't have any detail at hand but will try and rustle something up. cheers
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Matthew Wilcox writes: > On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: >> What I'm hoping to do with this series is to just provide a sysfs >> representation of the HMAT so that applications can know which NUMA nodes to >> select with existing utilities like numactl. This series does not currently >> alter any kernel behavior, it only provides a sysfs interface. >> >> Say for example you had a system with some high bandwidth memory (HBM), and >> you wanted to use it for a specific application. You could use the sysfs >> representation of the HMAT to figure out which memory target held your HBM. >> You could do this by looking at the local bandwidth values for the various >> memory targets, so: >> >> # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps >> /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 >> /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 >> /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 >> /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 >> >> and look for the one that corresponds to your HBM speed. (These numbers are >> made up, but you get the idea.) > > Presumably ACPI-based platforms will not be the only ones who have the > ability to expose different bandwidth memories in the future. I think > we need a platform-agnostic way ... right, PowerPC people? Yes! I don't have any detail at hand but will try and rustle something up. cheers
RE: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
> -Original Message- > From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of > Ross Zwisler ... > > On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote: ... > > initiator is a CPU? I'd have expected you to expose a memory controller > > abstraction rather than re-use storage terminology. > > Yea, I agree that at first blush it seems weird. It turns out that > looking at it in sort of a storage initiator/target way is beneficial, > though, because it allows us to cut down on the number of data values > we need to represent. > > For example the SLIT, which doesn't differentiate between initiator and > target proximity domains (and thus nodes) always represents a system > with N proximity domains using a NxN distance table. This makes sense > if every node contains both CPUs and memory. > > With the introduction of the HMAT, though, we can have memory-only > initiator nodes and we can explicitly associate them with their local > CPU. This is necessary so that we can separate memory with different > performance characteristics (HBM vs normal memory vs persistent memory, > for example) that are all attached to the same CPU. > > So, say we now have a system with 4 CPUs, and each of those CPUs has 3 > different types of memory attached to it. We now have 16 total proximity > domains, 4 CPU and 12 memory. The CPU cores that make up a node can have performance restrictions of their own; for example, they might max out at 10 GB/s even though the memory controller supports 120 GB/s (meaning you need to use 12 cores on the node to fully exercise memory). It'd be helpful to report this, so software can decide how many cores to use for bandwidth-intensive work. > If we represent this with the SLIT we end up with a 16 X 16 distance table > (256 entries), most of which don't matter because they are memory-to- > memory distances which don't make sense. > > In the HMAT, though, we separate out the initiators and the targets and > put them into separate lists. (See 5.2.27.4 System Locality Latency and > Bandwidth Information Structure in ACPI 6.2 for details.) So, this same > config in the HMAT only has 4*12=48 performance values of each type, all > of which convey meaningful information. > > The HMAT indeed even uses the storage "initiator" and "target" > terminology. :) Centralized DMA engines (e.g., as used by the "DMA based blk-mq pmem driver") have performance differences too. A CPU might include CPU cores that reach 10 GB/s, DMA engines that reach 60 GB/s, and memory controllers that reach 120 GB/s. I guess these would be represented as extra initiators on the node? --- Robert Elliott, HPE Persistent Memory
RE: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
> -Original Message- > From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of > Ross Zwisler ... > > On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote: ... > > initiator is a CPU? I'd have expected you to expose a memory controller > > abstraction rather than re-use storage terminology. > > Yea, I agree that at first blush it seems weird. It turns out that > looking at it in sort of a storage initiator/target way is beneficial, > though, because it allows us to cut down on the number of data values > we need to represent. > > For example the SLIT, which doesn't differentiate between initiator and > target proximity domains (and thus nodes) always represents a system > with N proximity domains using a NxN distance table. This makes sense > if every node contains both CPUs and memory. > > With the introduction of the HMAT, though, we can have memory-only > initiator nodes and we can explicitly associate them with their local > CPU. This is necessary so that we can separate memory with different > performance characteristics (HBM vs normal memory vs persistent memory, > for example) that are all attached to the same CPU. > > So, say we now have a system with 4 CPUs, and each of those CPUs has 3 > different types of memory attached to it. We now have 16 total proximity > domains, 4 CPU and 12 memory. The CPU cores that make up a node can have performance restrictions of their own; for example, they might max out at 10 GB/s even though the memory controller supports 120 GB/s (meaning you need to use 12 cores on the node to fully exercise memory). It'd be helpful to report this, so software can decide how many cores to use for bandwidth-intensive work. > If we represent this with the SLIT we end up with a 16 X 16 distance table > (256 entries), most of which don't matter because they are memory-to- > memory distances which don't make sense. > > In the HMAT, though, we separate out the initiators and the targets and > put them into separate lists. (See 5.2.27.4 System Locality Latency and > Bandwidth Information Structure in ACPI 6.2 for details.) So, this same > config in the HMAT only has 4*12=48 performance values of each type, all > of which convey meaningful information. > > The HMAT indeed even uses the storage "initiator" and "target" > terminology. :) Centralized DMA engines (e.g., as used by the "DMA based blk-mq pmem driver") have performance differences too. A CPU might include CPU cores that reach 10 GB/s, DMA engines that reach 60 GB/s, and memory controllers that reach 120 GB/s. I guess these would be represented as extra initiators on the node? --- Robert Elliott, HPE Persistent Memory
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed, Dec 20, 2017 at 02:29:56PM -0800, Dan Williams wrote: > On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler >wrote: > > On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote: > >> On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote: > >> > On 12/20/2017 10:19 AM, Matthew Wilcox wrote: > >> > > I don't know what the right interface is, but my laptop has a set of > >> > > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > >> > > right place to expose write_bw (etc). > >> > > >> > Those directories are already too redundant and wasteful. I think we'd > >> > really rather not add to them. In addition, it's technically possible > >> > to have a memory section span NUMA nodes and have different performance > >> > properties, which make it impossible to represent there. > >> > > >> > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have > >> > uniform performance properties in the HMAT, and we just so happen to > >> > always create one NUMA node per PXM. So, NUMA nodes really are a good > >> > fit. > >> > >> I think you're missing my larger point which is that I don't think this > >> should be exposed to userspace as an ACPI feature. Because if you do, > >> then it'll also be exposed to userspace as an openfirmware feature. > >> And sooner or later a devicetree feature. And then writing a portable > >> program becomes an exercise in suffering. > >> > >> So, what's the right place in sysfs that isn't tied to ACPI? A new > >> directory or set of directories under /sys/devices/system/memory/ ? > > > > Oh, the current location isn't at all tied to acpi except that it happens to > > be named 'hmat'. When it was all named 'hmem' it was just: > > > > /sys/devices/system/hmem > > > > Which has no ACPI-isms at all. I'm happy to move it under > > /sys/devices/system/memory/hmat if that's helpful, but I think we still have > > the issue that the data represented therein is still pulled right from the > > HMAT, and I don't know how to abstract it into something more platform > > agnostic until I know what data is provided by those other platforms. > > > > For example, the HMAT provides latency information and bandwidth information > > for both reads and writes. Will the devicetree/openfirmware/etc version > > have > > this same info, or will it be just different enough that it won't translate > > into whatever I choose to stick in sysfs? > > For the initial implementation do we need to have a representation of > all the performance data? Given that > /sys/devices/system/node/nodeX/distance is the only generic > performance attribute published by the kernel today it is already the > case that applications that need to target specific memories need to > go parse information that is not provided by the kernel by default. > The question is can those specialized applications stay special and go > parse the platform specific data sources, like raw HMAT, directly, or > do we expect general purpose applications to make use of this data? I > think a firmware-id to numa-node translation facility > (/sys/devices/system/node/nodeX/fwid) is a simple start that we can > build on with more information as specific use cases arise. We don't represent all the performance data, we only represent the data for local initiator/target pairs. I do think that this is useful to have in sysfs because it provides a way to easily answer the most commonly asked questions (or at least what I'm guessing will be the most commmonly asked queststions), i.e. "given a CPU, what are the speeds of the various types of memory attached to it", and "given a chunk of memory, how fast is it and to which CPU is it local"? By providing this base level of information I'm hoping to prevent most applications from having to parse the HMAT directly. The question of whether or not to include this local performance information was one of the main questions of the initial RFC patch series, and I did get feedback (albiet off-list) that the local performance information was valuable to at least some users. I did intentionally structure my (now very short) set so that the performance information was added as a separate patch, so we can get to the place you're talking about where we only provide firmware id <=> proximity domain mappings by just leaving off the last patch in the series. I'm personally still of the opinion though that this last patch does add value.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed, Dec 20, 2017 at 02:29:56PM -0800, Dan Williams wrote: > On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler > wrote: > > On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote: > >> On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote: > >> > On 12/20/2017 10:19 AM, Matthew Wilcox wrote: > >> > > I don't know what the right interface is, but my laptop has a set of > >> > > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > >> > > right place to expose write_bw (etc). > >> > > >> > Those directories are already too redundant and wasteful. I think we'd > >> > really rather not add to them. In addition, it's technically possible > >> > to have a memory section span NUMA nodes and have different performance > >> > properties, which make it impossible to represent there. > >> > > >> > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have > >> > uniform performance properties in the HMAT, and we just so happen to > >> > always create one NUMA node per PXM. So, NUMA nodes really are a good > >> > fit. > >> > >> I think you're missing my larger point which is that I don't think this > >> should be exposed to userspace as an ACPI feature. Because if you do, > >> then it'll also be exposed to userspace as an openfirmware feature. > >> And sooner or later a devicetree feature. And then writing a portable > >> program becomes an exercise in suffering. > >> > >> So, what's the right place in sysfs that isn't tied to ACPI? A new > >> directory or set of directories under /sys/devices/system/memory/ ? > > > > Oh, the current location isn't at all tied to acpi except that it happens to > > be named 'hmat'. When it was all named 'hmem' it was just: > > > > /sys/devices/system/hmem > > > > Which has no ACPI-isms at all. I'm happy to move it under > > /sys/devices/system/memory/hmat if that's helpful, but I think we still have > > the issue that the data represented therein is still pulled right from the > > HMAT, and I don't know how to abstract it into something more platform > > agnostic until I know what data is provided by those other platforms. > > > > For example, the HMAT provides latency information and bandwidth information > > for both reads and writes. Will the devicetree/openfirmware/etc version > > have > > this same info, or will it be just different enough that it won't translate > > into whatever I choose to stick in sysfs? > > For the initial implementation do we need to have a representation of > all the performance data? Given that > /sys/devices/system/node/nodeX/distance is the only generic > performance attribute published by the kernel today it is already the > case that applications that need to target specific memories need to > go parse information that is not provided by the kernel by default. > The question is can those specialized applications stay special and go > parse the platform specific data sources, like raw HMAT, directly, or > do we expect general purpose applications to make use of this data? I > think a firmware-id to numa-node translation facility > (/sys/devices/system/node/nodeX/fwid) is a simple start that we can > build on with more information as specific use cases arise. We don't represent all the performance data, we only represent the data for local initiator/target pairs. I do think that this is useful to have in sysfs because it provides a way to easily answer the most commonly asked questions (or at least what I'm guessing will be the most commmonly asked queststions), i.e. "given a CPU, what are the speeds of the various types of memory attached to it", and "given a chunk of memory, how fast is it and to which CPU is it local"? By providing this base level of information I'm hoping to prevent most applications from having to parse the HMAT directly. The question of whether or not to include this local performance information was one of the main questions of the initial RFC patch series, and I did get feedback (albiet off-list) that the local performance information was valuable to at least some users. I did intentionally structure my (now very short) set so that the performance information was added as a separate patch, so we can get to the place you're talking about where we only provide firmware id <=> proximity domain mappings by just leaving off the last patch in the series. I'm personally still of the opinion though that this last patch does add value.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwislerwrote: > On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote: >> On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote: >> > On 12/20/2017 10:19 AM, Matthew Wilcox wrote: >> > > I don't know what the right interface is, but my laptop has a set of >> > > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the >> > > right place to expose write_bw (etc). >> > >> > Those directories are already too redundant and wasteful. I think we'd >> > really rather not add to them. In addition, it's technically possible >> > to have a memory section span NUMA nodes and have different performance >> > properties, which make it impossible to represent there. >> > >> > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have >> > uniform performance properties in the HMAT, and we just so happen to >> > always create one NUMA node per PXM. So, NUMA nodes really are a good fit. >> >> I think you're missing my larger point which is that I don't think this >> should be exposed to userspace as an ACPI feature. Because if you do, >> then it'll also be exposed to userspace as an openfirmware feature. >> And sooner or later a devicetree feature. And then writing a portable >> program becomes an exercise in suffering. >> >> So, what's the right place in sysfs that isn't tied to ACPI? A new >> directory or set of directories under /sys/devices/system/memory/ ? > > Oh, the current location isn't at all tied to acpi except that it happens to > be named 'hmat'. When it was all named 'hmem' it was just: > > /sys/devices/system/hmem > > Which has no ACPI-isms at all. I'm happy to move it under > /sys/devices/system/memory/hmat if that's helpful, but I think we still have > the issue that the data represented therein is still pulled right from the > HMAT, and I don't know how to abstract it into something more platform > agnostic until I know what data is provided by those other platforms. > > For example, the HMAT provides latency information and bandwidth information > for both reads and writes. Will the devicetree/openfirmware/etc version have > this same info, or will it be just different enough that it won't translate > into whatever I choose to stick in sysfs? For the initial implementation do we need to have a representation of all the performance data? Given that /sys/devices/system/node/nodeX/distance is the only generic performance attribute published by the kernel today it is already the case that applications that need to target specific memories need to go parse information that is not provided by the kernel by default. The question is can those specialized applications stay special and go parse the platform specific data sources, like raw HMAT, directly, or do we expect general purpose applications to make use of this data? I think a firmware-id to numa-node translation facility (/sys/devices/system/node/nodeX/fwid) is a simple start that we can build on with more information as specific use cases arise.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler wrote: > On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote: >> On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote: >> > On 12/20/2017 10:19 AM, Matthew Wilcox wrote: >> > > I don't know what the right interface is, but my laptop has a set of >> > > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the >> > > right place to expose write_bw (etc). >> > >> > Those directories are already too redundant and wasteful. I think we'd >> > really rather not add to them. In addition, it's technically possible >> > to have a memory section span NUMA nodes and have different performance >> > properties, which make it impossible to represent there. >> > >> > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have >> > uniform performance properties in the HMAT, and we just so happen to >> > always create one NUMA node per PXM. So, NUMA nodes really are a good fit. >> >> I think you're missing my larger point which is that I don't think this >> should be exposed to userspace as an ACPI feature. Because if you do, >> then it'll also be exposed to userspace as an openfirmware feature. >> And sooner or later a devicetree feature. And then writing a portable >> program becomes an exercise in suffering. >> >> So, what's the right place in sysfs that isn't tied to ACPI? A new >> directory or set of directories under /sys/devices/system/memory/ ? > > Oh, the current location isn't at all tied to acpi except that it happens to > be named 'hmat'. When it was all named 'hmem' it was just: > > /sys/devices/system/hmem > > Which has no ACPI-isms at all. I'm happy to move it under > /sys/devices/system/memory/hmat if that's helpful, but I think we still have > the issue that the data represented therein is still pulled right from the > HMAT, and I don't know how to abstract it into something more platform > agnostic until I know what data is provided by those other platforms. > > For example, the HMAT provides latency information and bandwidth information > for both reads and writes. Will the devicetree/openfirmware/etc version have > this same info, or will it be just different enough that it won't translate > into whatever I choose to stick in sysfs? For the initial implementation do we need to have a representation of all the performance data? Given that /sys/devices/system/node/nodeX/distance is the only generic performance attribute published by the kernel today it is already the case that applications that need to target specific memories need to go parse information that is not provided by the kernel by default. The question is can those specialized applications stay special and go parse the platform specific data sources, like raw HMAT, directly, or do we expect general purpose applications to make use of this data? I think a firmware-id to numa-node translation facility (/sys/devices/system/node/nodeX/fwid) is a simple start that we can build on with more information as specific use cases arise.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote: > On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote: > > On 12/20/2017 10:19 AM, Matthew Wilcox wrote: > > > I don't know what the right interface is, but my laptop has a set of > > > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > > > right place to expose write_bw (etc). > > > > Those directories are already too redundant and wasteful. I think we'd > > really rather not add to them. In addition, it's technically possible > > to have a memory section span NUMA nodes and have different performance > > properties, which make it impossible to represent there. > > > > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have > > uniform performance properties in the HMAT, and we just so happen to > > always create one NUMA node per PXM. So, NUMA nodes really are a good fit. > > I think you're missing my larger point which is that I don't think this > should be exposed to userspace as an ACPI feature. Because if you do, > then it'll also be exposed to userspace as an openfirmware feature. > And sooner or later a devicetree feature. And then writing a portable > program becomes an exercise in suffering. > > So, what's the right place in sysfs that isn't tied to ACPI? A new > directory or set of directories under /sys/devices/system/memory/ ? Oh, the current location isn't at all tied to acpi except that it happens to be named 'hmat'. When it was all named 'hmem' it was just: /sys/devices/system/hmem Which has no ACPI-isms at all. I'm happy to move it under /sys/devices/system/memory/hmat if that's helpful, but I think we still have the issue that the data represented therein is still pulled right from the HMAT, and I don't know how to abstract it into something more platform agnostic until I know what data is provided by those other platforms. For example, the HMAT provides latency information and bandwidth information for both reads and writes. Will the devicetree/openfirmware/etc version have this same info, or will it be just different enough that it won't translate into whatever I choose to stick in sysfs?
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote: > On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote: > > On 12/20/2017 10:19 AM, Matthew Wilcox wrote: > > > I don't know what the right interface is, but my laptop has a set of > > > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > > > right place to expose write_bw (etc). > > > > Those directories are already too redundant and wasteful. I think we'd > > really rather not add to them. In addition, it's technically possible > > to have a memory section span NUMA nodes and have different performance > > properties, which make it impossible to represent there. > > > > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have > > uniform performance properties in the HMAT, and we just so happen to > > always create one NUMA node per PXM. So, NUMA nodes really are a good fit. > > I think you're missing my larger point which is that I don't think this > should be exposed to userspace as an ACPI feature. Because if you do, > then it'll also be exposed to userspace as an openfirmware feature. > And sooner or later a devicetree feature. And then writing a portable > program becomes an exercise in suffering. > > So, what's the right place in sysfs that isn't tied to ACPI? A new > directory or set of directories under /sys/devices/system/memory/ ? Oh, the current location isn't at all tied to acpi except that it happens to be named 'hmat'. When it was all named 'hmem' it was just: /sys/devices/system/hmem Which has no ACPI-isms at all. I'm happy to move it under /sys/devices/system/memory/hmat if that's helpful, but I think we still have the issue that the data represented therein is still pulled right from the HMAT, and I don't know how to abstract it into something more platform agnostic until I know what data is provided by those other platforms. For example, the HMAT provides latency information and bandwidth information for both reads and writes. Will the devicetree/openfirmware/etc version have this same info, or will it be just different enough that it won't translate into whatever I choose to stick in sysfs?
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote: > On 12/20/2017 10:19 AM, Matthew Wilcox wrote: > > I don't know what the right interface is, but my laptop has a set of > > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > > right place to expose write_bw (etc). > > Those directories are already too redundant and wasteful. I think we'd > really rather not add to them. In addition, it's technically possible > to have a memory section span NUMA nodes and have different performance > properties, which make it impossible to represent there. > > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have > uniform performance properties in the HMAT, and we just so happen to > always create one NUMA node per PXM. So, NUMA nodes really are a good fit. I think you're missing my larger point which is that I don't think this should be exposed to userspace as an ACPI feature. Because if you do, then it'll also be exposed to userspace as an openfirmware feature. And sooner or later a devicetree feature. And then writing a portable program becomes an exercise in suffering. So, what's the right place in sysfs that isn't tied to ACPI? A new directory or set of directories under /sys/devices/system/memory/ ?
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote: > On 12/20/2017 10:19 AM, Matthew Wilcox wrote: > > I don't know what the right interface is, but my laptop has a set of > > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > > right place to expose write_bw (etc). > > Those directories are already too redundant and wasteful. I think we'd > really rather not add to them. In addition, it's technically possible > to have a memory section span NUMA nodes and have different performance > properties, which make it impossible to represent there. > > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have > uniform performance properties in the HMAT, and we just so happen to > always create one NUMA node per PXM. So, NUMA nodes really are a good fit. I think you're missing my larger point which is that I don't think this should be exposed to userspace as an ACPI feature. Because if you do, then it'll also be exposed to userspace as an openfirmware feature. And sooner or later a devicetree feature. And then writing a portable program becomes an exercise in suffering. So, what's the right place in sysfs that isn't tied to ACPI? A new directory or set of directories under /sys/devices/system/memory/ ?
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote: > On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > > What I'm hoping to do with this series is to just provide a sysfs > > representation of the HMAT so that applications can know which NUMA nodes to > > select with existing utilities like numactl. This series does not currently > > alter any kernel behavior, it only provides a sysfs interface. > > > > Say for example you had a system with some high bandwidth memory (HBM), and > > you wanted to use it for a specific application. You could use the sysfs > > representation of the HMAT to figure out which memory target held your HBM. > > You could do this by looking at the local bandwidth values for the various > > memory targets, so: > > > > # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps > > /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 > > /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 > > > > and look for the one that corresponds to your HBM speed. (These numbers are > > made up, but you get the idea.) > > Presumably ACPI-based platforms will not be the only ones who have the > ability to expose different bandwidth memories in the future. I think > we need a platform-agnostic way ... right, PowerPC people? Hey Matthew, Yep, this is where I started as well. My plan with my initial implementation was to try and make the sysfs representation as platform agnostic as possible, and just have the ACPI HMAT as one of the many places to gather the data needed to populate sysfs. However, as I began coding the implementation became very specific to the HMAT, probably because I don't know of way that this type of info is represented on another platform. John Hubbard noticed the same thing and asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to prevent it from being confused with the HMM work: https://lkml.org/lkml/2017/7/7/33 https://lkml.org/lkml/2017/7/7/442 I'm open to making it more platform agnostic if I can get my hands on a parallel effort in another platform and tease out the commonality, but trying to do that without a second example hasn't worked out. If we don't have a good second example right now I think maybe we should put this in and then merge it with the second example when it comes along. > I don't know what the right interface is, but my laptop has a set of > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > right place to expose write_bw (etc). > > > Once you know the NUMA node of your HBM, you can figure out the NUMA node of > > it's local initiator: > > > > # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init* > > /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0 > > > > So, in our made-up example our HBM is located in numa node 2, and the local > > CPU for that HBM is at numa node 0. > > initiator is a CPU? I'd have expected you to expose a memory controller > abstraction rather than re-use storage terminology. Yea, I agree that at first blush it seems weird. It turns out that looking at it in sort of a storage initiator/target way is beneficial, though, because it allows us to cut down on the number of data values we need to represent. For example the SLIT, which doesn't differentiate between initiator and target proximity domains (and thus nodes) always represents a system with N proximity domains using a NxN distance table. This makes sense if every node contains both CPUs and memory. With the introduction of the HMAT, though, we can have memory-only initiator nodes and we can explicitly associate them with their local CPU. This is necessary so that we can separate memory with different performance characteristics (HBM vs normal memory vs persistent memory, for example) that are all attached to the same CPU. So, say we now have a system with 4 CPUs, and each of those CPUs has 3 different types of memory attached to it. We now have 16 total proximity domains, 4 CPU and 12 memory. If we represent this with the SLIT we end up with a 16 X 16 distance table (256 entries), most of which don't matter because they are memory-to-memory distances which don't make sense. In the HMAT, though, we separate out the initiators and the targets and put them into separate lists. (See 5.2.27.4 System Locality Latency and Bandwidth Information Structure in ACPI 6.2 for details.) So, this same config in the HMAT only has 4*12=48 performance values of each type, all of which convey meaningful information. The HMAT indeed even uses the storage "initiator" and "target" terminology. :)
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote: > On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > > What I'm hoping to do with this series is to just provide a sysfs > > representation of the HMAT so that applications can know which NUMA nodes to > > select with existing utilities like numactl. This series does not currently > > alter any kernel behavior, it only provides a sysfs interface. > > > > Say for example you had a system with some high bandwidth memory (HBM), and > > you wanted to use it for a specific application. You could use the sysfs > > representation of the HMAT to figure out which memory target held your HBM. > > You could do this by looking at the local bandwidth values for the various > > memory targets, so: > > > > # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps > > /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 > > /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 > > /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 > > > > and look for the one that corresponds to your HBM speed. (These numbers are > > made up, but you get the idea.) > > Presumably ACPI-based platforms will not be the only ones who have the > ability to expose different bandwidth memories in the future. I think > we need a platform-agnostic way ... right, PowerPC people? Hey Matthew, Yep, this is where I started as well. My plan with my initial implementation was to try and make the sysfs representation as platform agnostic as possible, and just have the ACPI HMAT as one of the many places to gather the data needed to populate sysfs. However, as I began coding the implementation became very specific to the HMAT, probably because I don't know of way that this type of info is represented on another platform. John Hubbard noticed the same thing and asked me to s/HMEM/HMAT/ everywhere and just make it HMAT specific, and to prevent it from being confused with the HMM work: https://lkml.org/lkml/2017/7/7/33 https://lkml.org/lkml/2017/7/7/442 I'm open to making it more platform agnostic if I can get my hands on a parallel effort in another platform and tease out the commonality, but trying to do that without a second example hasn't worked out. If we don't have a good second example right now I think maybe we should put this in and then merge it with the second example when it comes along. > I don't know what the right interface is, but my laptop has a set of > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > right place to expose write_bw (etc). > > > Once you know the NUMA node of your HBM, you can figure out the NUMA node of > > it's local initiator: > > > > # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init* > > /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0 > > > > So, in our made-up example our HBM is located in numa node 2, and the local > > CPU for that HBM is at numa node 0. > > initiator is a CPU? I'd have expected you to expose a memory controller > abstraction rather than re-use storage terminology. Yea, I agree that at first blush it seems weird. It turns out that looking at it in sort of a storage initiator/target way is beneficial, though, because it allows us to cut down on the number of data values we need to represent. For example the SLIT, which doesn't differentiate between initiator and target proximity domains (and thus nodes) always represents a system with N proximity domains using a NxN distance table. This makes sense if every node contains both CPUs and memory. With the introduction of the HMAT, though, we can have memory-only initiator nodes and we can explicitly associate them with their local CPU. This is necessary so that we can separate memory with different performance characteristics (HBM vs normal memory vs persistent memory, for example) that are all attached to the same CPU. So, say we now have a system with 4 CPUs, and each of those CPUs has 3 different types of memory attached to it. We now have 16 total proximity domains, 4 CPU and 12 memory. If we represent this with the SLIT we end up with a 16 X 16 distance table (256 entries), most of which don't matter because they are memory-to-memory distances which don't make sense. In the HMAT, though, we separate out the initiators and the targets and put them into separate lists. (See 5.2.27.4 System Locality Latency and Bandwidth Information Structure in ACPI 6.2 for details.) So, this same config in the HMAT only has 4*12=48 performance values of each type, all of which convey meaningful information. The HMAT indeed even uses the storage "initiator" and "target" terminology. :)
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/20/2017 10:19 AM, Matthew Wilcox wrote: > I don't know what the right interface is, but my laptop has a set of > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > right place to expose write_bw (etc). Those directories are already too redundant and wasteful. I think we'd really rather not add to them. In addition, it's technically possible to have a memory section span NUMA nodes and have different performance properties, which make it impossible to represent there. In any case, ACPI PXM's (Proximity Domains) are guaranteed to have uniform performance properties in the HMAT, and we just so happen to always create one NUMA node per PXM. So, NUMA nodes really are a good fit.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/20/2017 10:19 AM, Matthew Wilcox wrote: > I don't know what the right interface is, but my laptop has a set of > /sys/devices/system/memory/memoryN/ directories. Perhaps this is the > right place to expose write_bw (etc). Those directories are already too redundant and wasteful. I think we'd really rather not add to them. In addition, it's technically possible to have a memory section span NUMA nodes and have different performance properties, which make it impossible to represent there. In any case, ACPI PXM's (Proximity Domains) are guaranteed to have uniform performance properties in the HMAT, and we just so happen to always create one NUMA node per PXM. So, NUMA nodes really are a good fit.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > What I'm hoping to do with this series is to just provide a sysfs > representation of the HMAT so that applications can know which NUMA nodes to > select with existing utilities like numactl. This series does not currently > alter any kernel behavior, it only provides a sysfs interface. > > Say for example you had a system with some high bandwidth memory (HBM), and > you wanted to use it for a specific application. You could use the sysfs > representation of the HMAT to figure out which memory target held your HBM. > You could do this by looking at the local bandwidth values for the various > memory targets, so: > > # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps > /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 > /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 > /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 > /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 > > and look for the one that corresponds to your HBM speed. (These numbers are > made up, but you get the idea.) Presumably ACPI-based platforms will not be the only ones who have the ability to expose different bandwidth memories in the future. I think we need a platform-agnostic way ... right, PowerPC people? I don't know what the right interface is, but my laptop has a set of /sys/devices/system/memory/memoryN/ directories. Perhaps this is the right place to expose write_bw (etc). > Once you know the NUMA node of your HBM, you can figure out the NUMA node of > it's local initiator: > > # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init* > /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0 > > So, in our made-up example our HBM is located in numa node 2, and the local > CPU for that HBM is at numa node 0. initiator is a CPU? I'd have expected you to expose a memory controller abstraction rather than re-use storage terminology.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > What I'm hoping to do with this series is to just provide a sysfs > representation of the HMAT so that applications can know which NUMA nodes to > select with existing utilities like numactl. This series does not currently > alter any kernel behavior, it only provides a sysfs interface. > > Say for example you had a system with some high bandwidth memory (HBM), and > you wanted to use it for a specific application. You could use the sysfs > representation of the HMAT to figure out which memory target held your HBM. > You could do this by looking at the local bandwidth values for the various > memory targets, so: > > # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps > /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 > /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 > /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 > /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 > > and look for the one that corresponds to your HBM speed. (These numbers are > made up, but you get the idea.) Presumably ACPI-based platforms will not be the only ones who have the ability to expose different bandwidth memories in the future. I think we need a platform-agnostic way ... right, PowerPC people? I don't know what the right interface is, but my laptop has a set of /sys/devices/system/memory/memoryN/ directories. Perhaps this is the right place to expose write_bw (etc). > Once you know the NUMA node of your HBM, you can figure out the NUMA node of > it's local initiator: > > # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init* > /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0 > > So, in our made-up example our HBM is located in numa node 2, and the local > CPU for that HBM is at numa node 0. initiator is a CPU? I'd have expected you to expose a memory controller abstraction rather than re-use storage terminology.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > On Thu, Dec 14, 2017 at 02:00:32PM +0100, Michal Hocko wrote: <> > > What is the testing procedure? How can I setup qemu to simlate such HW? > > Well, the QEMU table simulation is gross, so I'd rather not get everyone > testing with that. Injecting custom HMAT and SRAT tables via initrd/initramfs > is a much better way: > > https://www.kernel.org/doc/Documentation/acpi/initrd_table_override.txt > > Dan recently posted a patch that lets this happen for the HMAT: > > https://lists.01.org/pipermail/linux-nvdimm/2017-December/013545.html > > I'm working right now on getting an easier way to generate HMAT tables - I'll > let you know when I have something working. I've posted details on how to set up test configurations using injected HMAT and SRAT tables here: https://github.com/rzwisler/hmat_examples So far I've got two different sample configs, and we can add more as they are useful. Having the sample configs in github is also nice because if someone finds a config that causes a kernel issue it can be reported then added to this list of example configs for future testing. Please let me know if you have trouble getting this working.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote: > On Thu, Dec 14, 2017 at 02:00:32PM +0100, Michal Hocko wrote: <> > > What is the testing procedure? How can I setup qemu to simlate such HW? > > Well, the QEMU table simulation is gross, so I'd rather not get everyone > testing with that. Injecting custom HMAT and SRAT tables via initrd/initramfs > is a much better way: > > https://www.kernel.org/doc/Documentation/acpi/initrd_table_override.txt > > Dan recently posted a patch that lets this happen for the HMAT: > > https://lists.01.org/pipermail/linux-nvdimm/2017-December/013545.html > > I'm working right now on getting an easier way to generate HMAT tables - I'll > let you know when I have something working. I've posted details on how to set up test configurations using injected HMAT and SRAT tables here: https://github.com/rzwisler/hmat_examples So far I've got two different sample configs, and we can add more as they are useful. Having the sample configs in github is also nice because if someone finds a config that causes a kernel issue it can be reported then added to this list of example configs for future testing. Please let me know if you have trouble getting this working.
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Thu, Dec 14, 2017 at 02:00:32PM +0100, Michal Hocko wrote: > [CC linix-api] Oh, thanks. I'll add them to my CC list for sysfs related changes in the future. > On Wed 13-12-17 19:10:16, Ross Zwisler wrote: > > This is the third revision of my patches adding a sysfs representation > > of the ACPI Heterogeneous Memory Attribute Table (HMAT). These patches > > are based on v4.15-rc3 and a working tree can be found here: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmat_v3 > > > > My goal is to get these patches merged for v4.16. > > Has actually reviewed the overal design already for this to be 4.16 > thing? I do not see any acks/reviewed-bys in any of the patches... > > > Changes from previous version (https://lkml.org/lkml/2017/7/6/749): > > ... comments on this last posting are touching the surface rather than > really discuss the overal design. Yep, that's a fair assessment. I would love a more in-depth review of the code so far. :) What I'm hoping to do with this series is to just provide a sysfs representation of the HMAT so that applications can know which NUMA nodes to select with existing utilities like numactl. This series does not currently alter any kernel behavior, it only provides a sysfs interface. Say for example you had a system with some high bandwidth memory (HBM), and you wanted to use it for a specific application. You could use the sysfs representation of the HMAT to figure out which memory target held your HBM. You could do this by looking at the local bandwidth values for the various memory targets, so: # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 and look for the one that corresponds to your HBM speed. (These numbers are made up, but you get the idea.) Alternatively if you knew the physical addresses of your HBM you could look for it by finding the numa node that owns the appropriate memory sections, so: # ls -d /sys/devices/system/hmat/mem_tgt2/node2/memory* /sys/devices/system/hmat/mem_tgt2/node2/memory0 /sys/devices/system/hmat/mem_tgt2/node2/memory1 etc. Once you know the NUMA node of your HBM, you can figure out the NUMA node of it's local initiator: # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init* /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0 So, in our made-up example our HBM is located in numa node 2, and the local CPU for that HBM is at numa node 0. You would then use numactl to bind your app to those numa nodes: numactl --membind=2 --cpunodebind=0 ./my_application Does that make sense? Eventually we can enhance numactl so it can automatically choose memory with higher bandwidth, etc., but I think just this bit of kernel enabling gets us started in the right direction. > > - Changed "HMEM" to "HMAT" and "hmem" to "hmat" throughout to make sure > >that this effort doesn't get confused with Jerome's HMM work and to > >make it clear that this enabling is tightly tied to the ACPI HMAT > >table. (John Hubbard) > > > > - Moved the link in the initiator (i.e. mem_init0/mem_tgt2) from > >pointing to the "mem_tgt2/local_init" attribute group to instead > >point at the mem_tgt2 target itself. (Brice Goglin) > > > > - Simplified the contents of both the initiators and the targets so > >that we just symlink to the NUMA node and don't duplicate > >information. For initiators this means that we no longer enumerate > >CPUs, and for targets this means that we don't provide physical > >address start and length information. All of this is already > >available in the NUMA node directory itself (i.e. > >/sys/devices/system/node/node0), and it already accounts for the fact > >that both multiple CPUs and multiple memory regions can be owned by a > >given NUMA node. Also removed some extra attributes (is_enabled, > >is_isolated) which I don't think are useful at this point in time. > > > > I have tested this against many different configs that I implemented > > using qemu. > > What is the testing procedure? How can I setup qemu to simlate such HW? Well, the QEMU table simulation is gross, so I'd rather not get everyone testing with that. Injecting custom HMAT and SRAT tables via initrd/initramfs is a much better way: https://www.kernel.org/doc/Documentation/acpi/initrd_table_override.txt Dan recently posted a patch that lets this happen for the HMAT: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013545.html I'm working right now on getting an easier way to generate HMAT tables - I'll let you know when I have something working. > [Keeping the rest of the email
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On Thu, Dec 14, 2017 at 02:00:32PM +0100, Michal Hocko wrote: > [CC linix-api] Oh, thanks. I'll add them to my CC list for sysfs related changes in the future. > On Wed 13-12-17 19:10:16, Ross Zwisler wrote: > > This is the third revision of my patches adding a sysfs representation > > of the ACPI Heterogeneous Memory Attribute Table (HMAT). These patches > > are based on v4.15-rc3 and a working tree can be found here: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmat_v3 > > > > My goal is to get these patches merged for v4.16. > > Has actually reviewed the overal design already for this to be 4.16 > thing? I do not see any acks/reviewed-bys in any of the patches... > > > Changes from previous version (https://lkml.org/lkml/2017/7/6/749): > > ... comments on this last posting are touching the surface rather than > really discuss the overal design. Yep, that's a fair assessment. I would love a more in-depth review of the code so far. :) What I'm hoping to do with this series is to just provide a sysfs representation of the HMAT so that applications can know which NUMA nodes to select with existing utilities like numactl. This series does not currently alter any kernel behavior, it only provides a sysfs interface. Say for example you had a system with some high bandwidth memory (HBM), and you wanted to use it for a specific application. You could use the sysfs representation of the HMAT to figure out which memory target held your HBM. You could do this by looking at the local bandwidth values for the various memory targets, so: # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920 /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960 /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960 /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960 and look for the one that corresponds to your HBM speed. (These numbers are made up, but you get the idea.) Alternatively if you knew the physical addresses of your HBM you could look for it by finding the numa node that owns the appropriate memory sections, so: # ls -d /sys/devices/system/hmat/mem_tgt2/node2/memory* /sys/devices/system/hmat/mem_tgt2/node2/memory0 /sys/devices/system/hmat/mem_tgt2/node2/memory1 etc. Once you know the NUMA node of your HBM, you can figure out the NUMA node of it's local initiator: # ls -d /sys/devices/system/hmat/mem_tgt2/local_init/mem_init* /sys/devices/system/hmat/mem_tgt2/local_init/mem_init0 So, in our made-up example our HBM is located in numa node 2, and the local CPU for that HBM is at numa node 0. You would then use numactl to bind your app to those numa nodes: numactl --membind=2 --cpunodebind=0 ./my_application Does that make sense? Eventually we can enhance numactl so it can automatically choose memory with higher bandwidth, etc., but I think just this bit of kernel enabling gets us started in the right direction. > > - Changed "HMEM" to "HMAT" and "hmem" to "hmat" throughout to make sure > >that this effort doesn't get confused with Jerome's HMM work and to > >make it clear that this enabling is tightly tied to the ACPI HMAT > >table. (John Hubbard) > > > > - Moved the link in the initiator (i.e. mem_init0/mem_tgt2) from > >pointing to the "mem_tgt2/local_init" attribute group to instead > >point at the mem_tgt2 target itself. (Brice Goglin) > > > > - Simplified the contents of both the initiators and the targets so > >that we just symlink to the NUMA node and don't duplicate > >information. For initiators this means that we no longer enumerate > >CPUs, and for targets this means that we don't provide physical > >address start and length information. All of this is already > >available in the NUMA node directory itself (i.e. > >/sys/devices/system/node/node0), and it already accounts for the fact > >that both multiple CPUs and multiple memory regions can be owned by a > >given NUMA node. Also removed some extra attributes (is_enabled, > >is_isolated) which I don't think are useful at this point in time. > > > > I have tested this against many different configs that I implemented > > using qemu. > > What is the testing procedure? How can I setup qemu to simlate such HW? Well, the QEMU table simulation is gross, so I'd rather not get everyone testing with that. Injecting custom HMAT and SRAT tables via initrd/initramfs is a much better way: https://www.kernel.org/doc/Documentation/acpi/initrd_table_override.txt Dan recently posted a patch that lets this happen for the HMAT: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013545.html I'm working right now on getting an easier way to generate HMAT tables - I'll let you know when I have something working. > [Keeping the rest of the email
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
[CC linix-api] On Wed 13-12-17 19:10:16, Ross Zwisler wrote: > This is the third revision of my patches adding a sysfs representation > of the ACPI Heterogeneous Memory Attribute Table (HMAT). These patches > are based on v4.15-rc3 and a working tree can be found here: > > https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmat_v3 > > My goal is to get these patches merged for v4.16. Has actually reviewed the overal design already for this to be 4.16 thing? I do not see any acks/reviewed-bys in any of the patches... > Changes from previous version (https://lkml.org/lkml/2017/7/6/749): ... comments on this last posting are touching the surface rather than really discuss the overal design. > - Changed "HMEM" to "HMAT" and "hmem" to "hmat" throughout to make sure >that this effort doesn't get confused with Jerome's HMM work and to >make it clear that this enabling is tightly tied to the ACPI HMAT >table. (John Hubbard) > > - Moved the link in the initiator (i.e. mem_init0/mem_tgt2) from >pointing to the "mem_tgt2/local_init" attribute group to instead >point at the mem_tgt2 target itself. (Brice Goglin) > > - Simplified the contents of both the initiators and the targets so >that we just symlink to the NUMA node and don't duplicate >information. For initiators this means that we no longer enumerate >CPUs, and for targets this means that we don't provide physical >address start and length information. All of this is already >available in the NUMA node directory itself (i.e. >/sys/devices/system/node/node0), and it already accounts for the fact >that both multiple CPUs and multiple memory regions can be owned by a >given NUMA node. Also removed some extra attributes (is_enabled, >is_isolated) which I don't think are useful at this point in time. > > I have tested this against many different configs that I implemented > using qemu. What is the testing procedure? How can I setup qemu to simlate such HW? [Keeping the rest of the email for linux-api reference] > --- > > Quick Summary > > Platforms exist today which have multiple types of memory attached to a > single CPU. These disparate memory ranges have some characteristics in > common, such as CPU cache coherence, but they can have wide ranges of > performance both in terms of latency and bandwidth. > > For example, consider a system that contains persistent memory, standard > DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. > There could potentially be an order of magnitude or more difference in > performance between the slowest and fastest memory attached to that CPU. > > With the current Linux code NUMA nodes are CPU-centric, so all the memory > attached to a given CPU will be lumped into the same NUMA node. This makes > it very difficult for userspace applications to understand the performance > of different memory ranges on a given CPU. > > We solve this issue by providing userspace with performance information on > individual memory ranges. This performance information is exposed via > sysfs: > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > mem_tgt2/firmware_id:1 > mem_tgt2/is_cached:0 > mem_tgt2/local_init/read_bw_MBps:40960 > mem_tgt2/local_init/read_lat_nsec:50 > mem_tgt2/local_init/write_bw_MBps:40960 > mem_tgt2/local_init/write_lat_nsec:50 > > This allows applications to easily find the memory that they want to use. > We expect that the existing NUMA APIs will be enhanced to use this new > information so that applications can continue to use them to select their > desired memory. How? Could you provide some examples? > Lots of Details > > This patch set provides a sysfs representation of parts of the > Heterogeneous Memory Attribute Table (HMAT), newly defined in ACPI 6.2. > One major conceptual change in ACPI 6.2 related to this work is that > proximity domains no longer need to contain a processor. We can now > have memory-only proximity domains, which means that we can now have > memory-only Linux NUMA nodes. > > Here is an example configuration where we have a single processor, one > range of regular memory and one range of HBM: > > +---+ ++ > | Processor | | Memory | > | prox domain 0 +---+ prox domain 1 | > | NUMA node 1 | | NUMA node 2| > +---+---+ ++ > | > +---+--+ > | HBM | > | prox domain 2| > | NUMA node 0 | > +--+ > > This gives us one initiator (the processor) and two targets (the two memory > ranges). Each of these three has its own ACPI proximity domain and > associated Linux NUMA node. Note also that while there is a 1:1 mapping > from each proximity domain to each NUMA node, the numbers don't necessarily > match up. Additionally we can have extra NUMA nodes that don't map back to > ACPI
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
[CC linix-api] On Wed 13-12-17 19:10:16, Ross Zwisler wrote: > This is the third revision of my patches adding a sysfs representation > of the ACPI Heterogeneous Memory Attribute Table (HMAT). These patches > are based on v4.15-rc3 and a working tree can be found here: > > https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmat_v3 > > My goal is to get these patches merged for v4.16. Has actually reviewed the overal design already for this to be 4.16 thing? I do not see any acks/reviewed-bys in any of the patches... > Changes from previous version (https://lkml.org/lkml/2017/7/6/749): ... comments on this last posting are touching the surface rather than really discuss the overal design. > - Changed "HMEM" to "HMAT" and "hmem" to "hmat" throughout to make sure >that this effort doesn't get confused with Jerome's HMM work and to >make it clear that this enabling is tightly tied to the ACPI HMAT >table. (John Hubbard) > > - Moved the link in the initiator (i.e. mem_init0/mem_tgt2) from >pointing to the "mem_tgt2/local_init" attribute group to instead >point at the mem_tgt2 target itself. (Brice Goglin) > > - Simplified the contents of both the initiators and the targets so >that we just symlink to the NUMA node and don't duplicate >information. For initiators this means that we no longer enumerate >CPUs, and for targets this means that we don't provide physical >address start and length information. All of this is already >available in the NUMA node directory itself (i.e. >/sys/devices/system/node/node0), and it already accounts for the fact >that both multiple CPUs and multiple memory regions can be owned by a >given NUMA node. Also removed some extra attributes (is_enabled, >is_isolated) which I don't think are useful at this point in time. > > I have tested this against many different configs that I implemented > using qemu. What is the testing procedure? How can I setup qemu to simlate such HW? [Keeping the rest of the email for linux-api reference] > --- > > Quick Summary > > Platforms exist today which have multiple types of memory attached to a > single CPU. These disparate memory ranges have some characteristics in > common, such as CPU cache coherence, but they can have wide ranges of > performance both in terms of latency and bandwidth. > > For example, consider a system that contains persistent memory, standard > DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. > There could potentially be an order of magnitude or more difference in > performance between the slowest and fastest memory attached to that CPU. > > With the current Linux code NUMA nodes are CPU-centric, so all the memory > attached to a given CPU will be lumped into the same NUMA node. This makes > it very difficult for userspace applications to understand the performance > of different memory ranges on a given CPU. > > We solve this issue by providing userspace with performance information on > individual memory ranges. This performance information is exposed via > sysfs: > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > mem_tgt2/firmware_id:1 > mem_tgt2/is_cached:0 > mem_tgt2/local_init/read_bw_MBps:40960 > mem_tgt2/local_init/read_lat_nsec:50 > mem_tgt2/local_init/write_bw_MBps:40960 > mem_tgt2/local_init/write_lat_nsec:50 > > This allows applications to easily find the memory that they want to use. > We expect that the existing NUMA APIs will be enhanced to use this new > information so that applications can continue to use them to select their > desired memory. How? Could you provide some examples? > Lots of Details > > This patch set provides a sysfs representation of parts of the > Heterogeneous Memory Attribute Table (HMAT), newly defined in ACPI 6.2. > One major conceptual change in ACPI 6.2 related to this work is that > proximity domains no longer need to contain a processor. We can now > have memory-only proximity domains, which means that we can now have > memory-only Linux NUMA nodes. > > Here is an example configuration where we have a single processor, one > range of regular memory and one range of HBM: > > +---+ ++ > | Processor | | Memory | > | prox domain 0 +---+ prox domain 1 | > | NUMA node 1 | | NUMA node 2| > +---+---+ ++ > | > +---+--+ > | HBM | > | prox domain 2| > | NUMA node 0 | > +--+ > > This gives us one initiator (the processor) and two targets (the two memory > ranges). Each of these three has its own ACPI proximity domain and > associated Linux NUMA node. Note also that while there is a 1:1 mapping > from each proximity domain to each NUMA node, the numbers don't necessarily > match up. Additionally we can have extra NUMA nodes that don't map back to > ACPI