Re: [Xen-devel] Draft NVDIMM proposal
On Thu, May 17, 2018 at 7:52 AM, George Dunlapwrote: > On 05/15/2018 07:06 PM, Dan Williams wrote: >> On Tue, May 15, 2018 at 7:19 AM, George Dunlap >> wrote: >>> So, who decides what this SPA range and interleave set is? Can the >>> operating system change these interleave sets and mappings, or change >>> data from PMEM to BLK, and is so, how? >> >> The interleave-set to SPA range association and delineation of >> capacity between PMEM and BLK access modes is current out-of-scope for >> ACPI. The BIOS reports the configuration to the OS via the NFIT, but >> the configuration is currently written by vendor specific tooling. >> Longer term it would be great for this mechanism to become >> standardized and available to the OS, but for now it requires platform >> specific tooling to change the DIMM interleave configuration. > > OK -- I was sort of assuming that different hardware would have > different drivers in Linux that ndctl knew how to drive (just like any > other hardware with vendor-specific interfaces); That way potentially lies madness, at least for me as a Linux sub-system maintainer. There is no value for the kernel to help enable vendors to do the same thing slightly differently ways. libnvdimm + nfit is 100% an open standards driver and the hope is to be able to deprecate non-public vendor-specific support over time, and consolidate work-alike support from vendor specs into ACPI. The public standards that the kernel enables are: http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_7_A%20Sept%206.pdf http://pmem.io/documents/NVDIMM_DSM_Interface-V1.6.pdf https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/ https://msdn.microsoft.com/library/windows/hardware/mt604741 > but it sounds a bit > more like at the moment it's binary blobs either in the BIOS/firmware, > or a vendor-supplied tool. Only for the functionality, like interleave set configuration, that is not defined in those standards. Even then the impact is only userspace tooling, not the kernel. Also, we are seeing that functionality bleed into the standards over time. For example label methods used to only exist in the Intel DSM document, but have now been standardized in ACPI 6.2. Firmware update which was a private interface has now graduated to the public Intel DSM document. Hopefully more and more functionality transitions into an ACPI definition over time. Any common functionality in those Intel, HPE, and MSFT command formats is comprehended / abstracted by the ndctl tool. > >>> And so (here's another guess) -- when you're talking about namespaces >>> and label areas, you're talking about namespaces stored *within a >>> pre-existing SPA range*. You use the same format as described in the >>> UEFI spec, but ignore all the stuff about interleave sets and whatever, >>> and use system physical addresses relative to the SPA range rather than >>> DPAs. >> >> Well, we don't ignore it because we need to validate in the driver >> that the interleave set configuration matches a checksum that we >> generated when the namespace was first instantiated on the interleave >> set. However, you are right, for accesses at run time all we care >> about is the SPA for PMEM accesses. > [snip] >> They can change, but only under the control of the BIOS. All changes >> to the interleave set configuration need a reboot because the memory >> controller needs to be set up differently at system-init time. > [snip] >> No, the checksum I'm referring to is the interleave set cookie (see: >> "SetCookie" in the UEFI 2.7 specification). It validates that the >> interleave set backing the SPA has not changed configuration since the >> last boot. > [snip] >> The NVDIMM just provides storage area for the OS to write opaque data >> that just happens to conform to the UEFI Namespace label format. The >> interleave-set configuration is stored in yet another out-of-band >> location on the DIMM or on some platform-specific storage location and >> is consulted / restored by the BIOS each boot. The NFIT is the output >> from the platform specific physical mappings of the DIMMs, and >> Namespaces are logical volumes built on top of those hard-defined NFIT >> boundaries. > > OK, so what I'm hearing is: > > The label area isn't "within a pre-existing SPA range" as I was guessing > (i.e., similar to a partition table residing within a disk); it is the > per-DIMM label area as described by UEFI spec. > > But, the interleave set data in the label area doesn't *control* the > hardware -- the NVDIMM controller / bios / firmware don't read it or do > anything based on what's in it. Rather, the interleave set data in the > label area is there to *record*, for the operating system's benefit, > what the hardware configuration was when the labels were created, so > that if it changes, the OS knows that the label area is
Re: [Xen-devel] Draft NVDIMM proposal
On 05/15/2018 07:06 PM, Dan Williams wrote: > On Tue, May 15, 2018 at 7:19 AM, George Dunlap> wrote: >> So, who decides what this SPA range and interleave set is? Can the >> operating system change these interleave sets and mappings, or change >> data from PMEM to BLK, and is so, how? > > The interleave-set to SPA range association and delineation of > capacity between PMEM and BLK access modes is current out-of-scope for > ACPI. The BIOS reports the configuration to the OS via the NFIT, but > the configuration is currently written by vendor specific tooling. > Longer term it would be great for this mechanism to become > standardized and available to the OS, but for now it requires platform > specific tooling to change the DIMM interleave configuration. OK -- I was sort of assuming that different hardware would have different drivers in Linux that ndctl knew how to drive (just like any other hardware with vendor-specific interfaces); but it sounds a bit more like at the moment it's binary blobs either in the BIOS/firmware, or a vendor-supplied tool. >> And so (here's another guess) -- when you're talking about namespaces >> and label areas, you're talking about namespaces stored *within a >> pre-existing SPA range*. You use the same format as described in the >> UEFI spec, but ignore all the stuff about interleave sets and whatever, >> and use system physical addresses relative to the SPA range rather than >> DPAs. > > Well, we don't ignore it because we need to validate in the driver > that the interleave set configuration matches a checksum that we > generated when the namespace was first instantiated on the interleave > set. However, you are right, for accesses at run time all we care > about is the SPA for PMEM accesses. [snip] > They can change, but only under the control of the BIOS. All changes > to the interleave set configuration need a reboot because the memory > controller needs to be set up differently at system-init time. [snip] > No, the checksum I'm referring to is the interleave set cookie (see: > "SetCookie" in the UEFI 2.7 specification). It validates that the > interleave set backing the SPA has not changed configuration since the > last boot. [snip] > The NVDIMM just provides storage area for the OS to write opaque data > that just happens to conform to the UEFI Namespace label format. The > interleave-set configuration is stored in yet another out-of-band > location on the DIMM or on some platform-specific storage location and > is consulted / restored by the BIOS each boot. The NFIT is the output > from the platform specific physical mappings of the DIMMs, and > Namespaces are logical volumes built on top of those hard-defined NFIT > boundaries. OK, so what I'm hearing is: The label area isn't "within a pre-existing SPA range" as I was guessing (i.e., similar to a partition table residing within a disk); it is the per-DIMM label area as described by UEFI spec. But, the interleave set data in the label area doesn't *control* the hardware -- the NVDIMM controller / bios / firmware don't read it or do anything based on what's in it. Rather, the interleave set data in the label area is there to *record*, for the operating system's benefit, what the hardware configuration was when the labels were created, so that if it changes, the OS knows that the label area is invalid; it must either refrain from touching the NVRAM (if it wants to preserve the data), or write a new label area. The OS can also use labels to partition a single SPA range into several namespaces. It can't change the interleaving, but it can specify that [0-A) is one namespace, [A-B) is another namespace, and these namespaces will naturally map into the SPA range advertised in the NFIT. And if a controller allows the same memory to be used either as PMEM or PBLK, it can write which *should* be used for which, and then can avoid accessing the same underlying NVRAM in two different ways (which will yield unpredictable results). That makes sense. >> If SPA regions don't change after boot, and if Xen can find its own >> Xen-specific namespace to use for the frame tables by reading the NFIT >> table, then that significantly reduces the amount of interaction it >> needs with Linux. >> >> If SPA regions *can* change after boot, and if Xen must rely on Linux to >> read labels and find out what it can safely use for frame tables, then >> it makes things significantly more involved. Not impossible by any >> means, but a lot more complicated. >> >> Hope all that makes sense -- thanks again for your help. > > I think it does, but it seems namespaces are out of reach for Xen > without some agent / enabling that can execute the necessary AML > methods. Sure, we're pretty much used to that. :-) We'll have Linux read the label area and tell Xen what it needs to know. But: * Xen can know the SPA ranges of all potential NVDIMMs before dom0 starts. So it can tell, for instance, if a page
Re: [Xen-devel] Draft NVDIMM proposal
On 15/05/18 19:06, Dan Williams wrote: > On Tue, May 15, 2018 at 7:19 AM, George Dunlap> wrote: >> On 05/11/2018 05:33 PM, Dan Williams wrote: >> >> This is all pretty foundational. Xen can read static ACPI tables, but >> it can't do AML. So to do a proper design for Xen, we need to know: > Oooh, ok, no AML in Xen... > >> 1. If Xen can find out, without Linux's help, what namespaces exist and >> if there is one it can use for its own purposes > Yeah, no, not without calling AML methods. One particularly thorny issue with Xen's architecture is the ownership of the ACPI OSPM, and the fact that there can only be one in the system. Dom0 has to be the OSPM in practice, as we don't want to port most of the Linux drivers and infrastructure in the hypervisor. If we knew a priori that certain AML methods had no side effects, then we could in principle execute them from the hypervisor, but this is an undecideable problem in general. As a result, everything involving AML requires dom0 to decipher the information and passing it to Xen at boot. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Draft NVDIMM proposal
On Tue, May 15, 2018 at 7:19 AM, George Dunlapwrote: > On 05/11/2018 05:33 PM, Dan Williams wrote: >> [ adding linux-nvdimm ] >> >> Great write up! Some comments below... > > Thanks for the quick response! > > It seems I still have some fundamental misconceptions about what's going > on, so I'd better start with that. :-) > > Here's the part that I'm having a hard time getting. > > If actual data on the NVDIMMs is a noun, and the act of writing is a > verb, then the SPA and interleave sets are adverbs: they define *how* > the write happens. When the processor says, "Write to address X", the > memory controller converts address X into a address> tuple to actually write the data. > > So, who decides what this SPA range and interleave set is? Can the > operating system change these interleave sets and mappings, or change > data from PMEM to BLK, and is so, how? The interleave-set to SPA range association and delineation of capacity between PMEM and BLK access modes is current out-of-scope for ACPI. The BIOS reports the configuration to the OS via the NFIT, but the configuration is currently written by vendor specific tooling. Longer term it would be great for this mechanism to become standardized and available to the OS, but for now it requires platform specific tooling to change the DIMM interleave configuration. > If you read through section 13.19 of the UEFI manual, it seems to imply > that this is determined by the label area -- that each DIMM has a > separate label area describing regions local to that DIMM; and that if > you have 4 DIMMs you'll have 4 label areas, and each label area will > have a label describing the DPA region on that DIMM which corresponds to > the interleave set. And somehow someone sets up the interleave sets and > SPA based on what's written there. > > Which would mean that an operating system could change how the > interleave sets work by rewriting the various labels on the DIMMs; for > instance, changing a single 4-way set spanning the entirety of 4 DIMMs, > to one 4-way set spanning half of 4 DIMMs, and 2 2-way sets spanning > half of 2 DIMMs each. If a DIMM supports both the PMEM and BLK mechanisms for accessing the same DPA, then the label breaks the disambiguation and tells the OS to enforce one access mechanism per DPA at a time. Otherwise the OS has no ability to affect the interleave-set configuration, it's all initialized by platform BIOS/firmware before the OS boots. > > But then you say: > >> Unlike NVMe an NVDIMM itself has no concept of namespaces. Some DIMMs >> provide a "label area" which is an out-of-band non-volatile memory >> area where the OS can store whatever it likes. The UEFI 2.7 >> specification defines a data format for the definition of namespaces >> on top of persistent memory ranges advertised to the OS via the ACPI >> NFIT structure. > > OK, so that sounds like no, that's that what happens. So where do the > SPA range and interleave sets come from? > > Random guess: The BIOS / firmware makes it up. Either it's hard-coded, > or there's some menu in the BIOS you can use to change things around; > but once it hits the operating system, that's it -- the mapping of SPA > range onto interleave sets onto DIMMs is, from the operating system's > point of view, fixed. Correct. > And so (here's another guess) -- when you're talking about namespaces > and label areas, you're talking about namespaces stored *within a > pre-existing SPA range*. You use the same format as described in the > UEFI spec, but ignore all the stuff about interleave sets and whatever, > and use system physical addresses relative to the SPA range rather than > DPAs. Well, we don't ignore it because we need to validate in the driver that the interleave set configuration matches a checksum that we generated when the namespace was first instantiated on the interleave set. However, you are right, for accesses at run time all we care about is the SPA for PMEM accesses. > > Is that right? > > But then there's things like this: > >> There is no obligation for an NVDIMM to provide a label area, and as >> far as I know all NVDIMMs on the market today do not provide a label >> area. > [snip] >> Linux supports "label-less" mode where it exposes >> the raw capacity of a region in 1:1 mapped namespace without a label. >> This is how Linux supports "legacy" NVDIMMs that do not support >> labels. > > So are "all NVDIMMs on the market today" then classed as "legacy" > NVDIMMs because they don't support labels? And if labels are simply the > NVDIMM equivalent of a partition table, then what does it mena to > "support" or "not support" labels? Yes, the term "legacy" has been thrown around for NVDIMMs that do not support labels. The way this support is determined is whether the platform publishes the _LSI, _LSR, and _LSW methods in ACPI (see: 6.5.10 NVDIMM Label Methods in ACPI 6.2a). I.e. each DIMM is represented by an ACPI device object, and we query those
Re: [Xen-devel] Draft NVDIMM proposal
On Tue, May 15, 2018 at 5:26 AM, Jan Beulichwrote: On 15.05.18 at 12:12, wrote: [..] >> That is, each fsdax / devdax namespace has a superblock that, in part, >> defines what parts are used for Linux and what parts are used for data. Or >> to put it a different way: Linux decides which parts of a namespace to use >> for page structures, and writes it down in the metadata starting in the first >> page of the namespace. > > And that metadata layout is agreed upon between all OS vendors? The only agreed upon metadata layouts across all OS vendors are the ones that are specified in UEFI. We typically only need inter-OS and UEFI compatibility for booting and other pre-OS accesses. For Linux "raw" and "sector" mode namespaces defined by namespace labels are inter-OS compatible while "fsdax", "devdax", and so called "label-less" configurations are not. ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Draft NVDIMM proposal
On 05/11/2018 05:33 PM, Dan Williams wrote: > [ adding linux-nvdimm ] > > Great write up! Some comments below... Thanks for the quick response! It seems I still have some fundamental misconceptions about what's going on, so I'd better start with that. :-) Here's the part that I'm having a hard time getting. If actual data on the NVDIMMs is a noun, and the act of writing is a verb, then the SPA and interleave sets are adverbs: they define *how* the write happens. When the processor says, "Write to address X", the memory controller converts address X into a tuple to actually write the data. So, who decides what this SPA range and interleave set is? Can the operating system change these interleave sets and mappings, or change data from PMEM to BLK, and is so, how? If you read through section 13.19 of the UEFI manual, it seems to imply that this is determined by the label area -- that each DIMM has a separate label area describing regions local to that DIMM; and that if you have 4 DIMMs you'll have 4 label areas, and each label area will have a label describing the DPA region on that DIMM which corresponds to the interleave set. And somehow someone sets up the interleave sets and SPA based on what's written there. Which would mean that an operating system could change how the interleave sets work by rewriting the various labels on the DIMMs; for instance, changing a single 4-way set spanning the entirety of 4 DIMMs, to one 4-way set spanning half of 4 DIMMs, and 2 2-way sets spanning half of 2 DIMMs each. But then you say: > Unlike NVMe an NVDIMM itself has no concept of namespaces. Some DIMMs > provide a "label area" which is an out-of-band non-volatile memory > area where the OS can store whatever it likes. The UEFI 2.7 > specification defines a data format for the definition of namespaces > on top of persistent memory ranges advertised to the OS via the ACPI > NFIT structure. OK, so that sounds like no, that's that what happens. So where do the SPA range and interleave sets come from? Random guess: The BIOS / firmware makes it up. Either it's hard-coded, or there's some menu in the BIOS you can use to change things around; but once it hits the operating system, that's it -- the mapping of SPA range onto interleave sets onto DIMMs is, from the operating system's point of view, fixed. And so (here's another guess) -- when you're talking about namespaces and label areas, you're talking about namespaces stored *within a pre-existing SPA range*. You use the same format as described in the UEFI spec, but ignore all the stuff about interleave sets and whatever, and use system physical addresses relative to the SPA range rather than DPAs. Is that right? But then there's things like this: > There is no obligation for an NVDIMM to provide a label area, and as > far as I know all NVDIMMs on the market today do not provide a label > area. [snip] > Linux supports "label-less" mode where it exposes > the raw capacity of a region in 1:1 mapped namespace without a label. > This is how Linux supports "legacy" NVDIMMs that do not support > labels. So are "all NVDIMMs on the market today" then classed as "legacy" NVDIMMs because they don't support labels? And if labels are simply the NVDIMM equivalent of a partition table, then what does it mena to "support" or "not support" labels? And then there's this: > In any > event we do the DIMM to SPA association first before reading labels. > The OS calculates a so called "Interleave Set Cookie" from the NFIT > information to compare against a similar value stored in the labels. > This lets the OS determine that the Interleave Set composition has not > changed from when the labels were initially written. An Interleave Set > Cookie mismatch indicates the labels are stale, corrupted, or that the > physical composition of the Interleave Set has changed. So wait, the SPA and interleave sets can actually change? And the labels which the OS reads actually are per-DIMM, and do control somehow how the DPA ranges of individual DIMMs are mapped into interleave sets and exposed as SPAs? (And perhaps, can be changed by the operating system?) And: > There are checksums in the Namespace definition to account label > validity. Starting with ACPI 6.2 DSMs for labels are deprecated in > favor of the new / named methods for label access _LSI, _LSR, and > _LSW. Does this mean the methods will use checksums to verify writes to the label area, and refuse writes which create invalid labels? If all of the above is true, then in what way can it be said that "NVDIMM has no concept of namespaces", that an OS can "store whatever it likes" in the label area, and that UEFI namespaces are "on top of persistent memory ranges advertised to the OS via the ACPI NFIT structure"? I'm sorry if this is obvious, but I am exactly as confused as I was before I started writing this. :-) This is all pretty foundational. Xen can read static ACPI tables, but it can't do AML. So to do a
Re: [Xen-devel] Draft NVDIMM proposal
> On May 15, 2018, at 1:26 PM, Jan Beulichwrote: > On 15.05.18 at 12:12, wrote: >>> On May 15, 2018, at 11:05 AM, Roger Pau Monne wrote: >>> On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote: [ adding linux-nvdimm ] Great write up! Some comments below... On Wed, May 9, 2018 at 10:35 AM, George Dunlap wrote: >> To use a namespace, an operating system needs at a minimum two pieces >> of information: The UUID and/or Name of the namespace, and the SPA >> range where that namespace is mapped; and ideally also the Type and >> Abstraction Type to know how to interpret the data inside. Not necessarily, no. Linux supports "label-less" mode where it exposes the raw capacity of a region in 1:1 mapped namespace without a label. This is how Linux supports "legacy" NVDIMMs that do not support labels. >>> >>> In that case, how does Linux know which area of the NVDIMM it should >>> use to store the page structures? >> >> The answer to that is right here: >> >> `fsdax` and `devdax` mode are both designed to make it possible for >> user processes to have direct mapping of NVRAM. As such, both are >> only suitable for PMEM namespaces (?). Both also need to have kernel >> page structures allocated for each page of NVRAM; this amounts to 64 >> bytes for every 4k of NVRAM. Memory for these page structures can >> either be allocated out of normal "system" memory, or inside the PMEM >> namespace itself. >> >> In both cases, an "info block", very similar to the BTT info block, is >> written to the beginning of the namespace when created. This info >> block specifies whether the page structures come from system memory or >> from the namespace itself. If from the namespace itself, it contains >> information about what parts of the namespace have been set aside for >> Linux to use for this purpose. >> >> That is, each fsdax / devdax namespace has a superblock that, in part, >> defines what parts are used for Linux and what parts are used for data. Or >> to put it a different way: Linux decides which parts of a namespace to use >> for page structures, and writes it down in the metadata starting in the >> first >> page of the namespace. > > And that metadata layout is agreed upon between all OS vendors? > >> Linux has also defined "Type GUIDs" for these two types of namespace >> to be stored in the namespace label, although these are not yet in the >> ACPI spec. They never will be. One of the motivations for GUIDs is that an OS can define private ones without needing to go back and standardize them. Only GUIDs that are needed to inter-OS / pre-OS compatibility would need to be defined in ACPI, and there is no expectation that other OSes understand Linux's format for reserving page structure space. >>> >>> Maybe it would be helpful to somehow mark those areas as >>> "non-persistent" storage, so that other OSes know they can use this >>> space for temporary data that doesn't need to survive across reboots? >> >> In theory there’s no reason another OS couldn’t learn Linux’s format, >> discover where the blocks were, and use those blocks for its own purposes >> while Linux wasn’t running. > > This looks to imply "no" to my question above, in which case I wonder how > we would use (part of) the space when the "other" owner is e.g. Windows. So in classic DOS partition tables, you have partition types; and various operating systems just sort of “claimed” numbers for themselves (e.g., NTFS, Linux Swap, ). But the DOS partition table number space is actually quite small. So in namespaces, you have a similar concept, except that it’s called a “type GUID”, and it’s massively long — long enough anyone who wants to make a new type can simply generate one randomly and be pretty confident that nobody else is using that one. So if the labels contain a TGUID you understand, you use it, just like you would a partition that you understand. If it contains GUIDs you don’t understand, you’d better leave it alone. -George ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Draft NVDIMM proposal
>>> On 15.05.18 at 12:12,wrote: >> On May 15, 2018, at 11:05 AM, Roger Pau Monne wrote: >> On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote: >>> [ adding linux-nvdimm ] >>> >>> Great write up! Some comments below... >>> >>> On Wed, May 9, 2018 at 10:35 AM, George Dunlap >>> wrote: > To use a namespace, an operating system needs at a minimum two pieces > of information: The UUID and/or Name of the namespace, and the SPA > range where that namespace is mapped; and ideally also the Type and > Abstraction Type to know how to interpret the data inside. >>> >>> Not necessarily, no. Linux supports "label-less" mode where it exposes >>> the raw capacity of a region in 1:1 mapped namespace without a label. >>> This is how Linux supports "legacy" NVDIMMs that do not support >>> labels. >> >> In that case, how does Linux know which area of the NVDIMM it should >> use to store the page structures? > > The answer to that is right here: > > `fsdax` and `devdax` mode are both designed to make it possible for > user processes to have direct mapping of NVRAM. As such, both are > only suitable for PMEM namespaces (?). Both also need to have kernel > page structures allocated for each page of NVRAM; this amounts to 64 > bytes for every 4k of NVRAM. Memory for these page structures can > either be allocated out of normal "system" memory, or inside the PMEM > namespace itself. > > In both cases, an "info block", very similar to the BTT info block, is > written to the beginning of the namespace when created. This info > block specifies whether the page structures come from system memory or > from the namespace itself. If from the namespace itself, it contains > information about what parts of the namespace have been set aside for > Linux to use for this purpose. > > That is, each fsdax / devdax namespace has a superblock that, in part, > defines what parts are used for Linux and what parts are used for data. Or > to put it a different way: Linux decides which parts of a namespace to use > for page structures, and writes it down in the metadata starting in the first > page of the namespace. And that metadata layout is agreed upon between all OS vendors? > Linux has also defined "Type GUIDs" for these two types of namespace > to be stored in the namespace label, although these are not yet in the > ACPI spec. >>> >>> They never will be. One of the motivations for GUIDs is that an OS can >>> define private ones without needing to go back and standardize them. >>> Only GUIDs that are needed to inter-OS / pre-OS compatibility would >>> need to be defined in ACPI, and there is no expectation that other >>> OSes understand Linux's format for reserving page structure space. >> >> Maybe it would be helpful to somehow mark those areas as >> "non-persistent" storage, so that other OSes know they can use this >> space for temporary data that doesn't need to survive across reboots? > > In theory there’s no reason another OS couldn’t learn Linux’s format, > discover where the blocks were, and use those blocks for its own purposes > while Linux wasn’t running. This looks to imply "no" to my question above, in which case I wonder how we would use (part of) the space when the "other" owner is e.g. Windows. Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Draft NVDIMM proposal
> On May 15, 2018, at 11:05 AM, Roger Pau Monnewrote: > > Just some replies/questions to some of the points raised below. > > On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote: >> [ adding linux-nvdimm ] >> >> Great write up! Some comments below... >> >> On Wed, May 9, 2018 at 10:35 AM, George Dunlap >> wrote: To use a namespace, an operating system needs at a minimum two pieces of information: The UUID and/or Name of the namespace, and the SPA range where that namespace is mapped; and ideally also the Type and Abstraction Type to know how to interpret the data inside. >> >> Not necessarily, no. Linux supports "label-less" mode where it exposes >> the raw capacity of a region in 1:1 mapped namespace without a label. >> This is how Linux supports "legacy" NVDIMMs that do not support >> labels. > > In that case, how does Linux know which area of the NVDIMM it should > use to store the page structures? The answer to that is right here: `fsdax` and `devdax` mode are both designed to make it possible for user processes to have direct mapping of NVRAM. As such, both are only suitable for PMEM namespaces (?). Both also need to have kernel page structures allocated for each page of NVRAM; this amounts to 64 bytes for every 4k of NVRAM. Memory for these page structures can either be allocated out of normal "system" memory, or inside the PMEM namespace itself. In both cases, an "info block", very similar to the BTT info block, is written to the beginning of the namespace when created. This info block specifies whether the page structures come from system memory or from the namespace itself. If from the namespace itself, it contains information about what parts of the namespace have been set aside for Linux to use for this purpose. That is, each fsdax / devdax namespace has a superblock that, in part, defines what parts are used for Linux and what parts are used for data. Or to put it a different way: Linux decides which parts of a namespace to use for page structures, and writes it down in the metadata starting in the first page of the namespace. Linux has also defined "Type GUIDs" for these two types of namespace to be stored in the namespace label, although these are not yet in the ACPI spec. >> >> They never will be. One of the motivations for GUIDs is that an OS can >> define private ones without needing to go back and standardize them. >> Only GUIDs that are needed to inter-OS / pre-OS compatibility would >> need to be defined in ACPI, and there is no expectation that other >> OSes understand Linux's format for reserving page structure space. > > Maybe it would be helpful to somehow mark those areas as > "non-persistent" storage, so that other OSes know they can use this > space for temporary data that doesn't need to survive across reboots? In theory there’s no reason another OS couldn’t learn Linux’s format, discover where the blocks were, and use those blocks for its own purposes while Linux wasn’t running. But that won’t help Xen, as we want to use those blocks while Linux *is* running. -George ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Draft NVDIMM proposal
Just some replies/questions to some of the points raised below. On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote: > [ adding linux-nvdimm ] > > Great write up! Some comments below... > > On Wed, May 9, 2018 at 10:35 AM, George Dunlap> wrote: > >> To use a namespace, an operating system needs at a minimum two pieces > >> of information: The UUID and/or Name of the namespace, and the SPA > >> range where that namespace is mapped; and ideally also the Type and > >> Abstraction Type to know how to interpret the data inside. > > Not necessarily, no. Linux supports "label-less" mode where it exposes > the raw capacity of a region in 1:1 mapped namespace without a label. > This is how Linux supports "legacy" NVDIMMs that do not support > labels. In that case, how does Linux know which area of the NVDIMM it should use to store the page structures? > >> `fsdax` and `devdax` mode are both designed to make it possible for > >> user processes to have direct mapping of NVRAM. As such, both are > >> only suitable for PMEM namespaces (?). Both also need to have kernel > >> page structures allocated for each page of NVRAM; this amounts to 64 > >> bytes for every 4k of NVRAM. Memory for these page structures can > >> either be allocated out of normal "system" memory, or inside the PMEM > >> namespace itself. > >> > >> In both cases, an "info block", very similar to the BTT info block, is > >> written to the beginning of the namespace when created. This info > >> block specifies whether the page structures come from system memory or > >> from the namespace itself. If from the namespace itself, it contains > >> information about what parts of the namespace have been set aside for > >> Linux to use for this purpose. > >> > >> Linux has also defined "Type GUIDs" for these two types of namespace > >> to be stored in the namespace label, although these are not yet in the > >> ACPI spec. > > They never will be. One of the motivations for GUIDs is that an OS can > define private ones without needing to go back and standardize them. > Only GUIDs that are needed to inter-OS / pre-OS compatibility would > need to be defined in ACPI, and there is no expectation that other > OSes understand Linux's format for reserving page structure space. Maybe it would be helpful to somehow mark those areas as "non-persistent" storage, so that other OSes know they can use this space for temporary data that doesn't need to survive across reboots? > >> # Proposed design / roadmap > >> > >> Initially, dom0 accesses the NVRAM as normal, using static ACPI tables > >> and the DSM methods; mappings are treated by Xen during this phase as > >> MMIO. > >> > >> Once dom0 is ready to pass parts of a namespace through to a guest, it > >> makes a hypercall to tell Xen about the namespace. It includes any > >> regions of the namespace which Xen may use for 'scratch'; it also > >> includes a flag to indicate whether this 'scratch' space may be used > >> for frame tables from other namespaces. > >> > >> Frame tables are then created for this SPA range. They will be > >> allocated from, in this order: 1) designated 'scratch' range from > >> within this namespace 2) designated 'scratch' range from other > >> namespaces which has been marked as sharable 3) system RAM. > >> > >> Xen will either verify that dom0 has no existing mappings, or promote > >> the mappings to full pages (taking appropriate reference counts for > >> mappings). Dom0 must ensure that this namespace is not unmapped, > >> modified, or relocated until it asks Xen to unmap it. > >> > >> For Xen frame tables, to begin with, set aside a partition inside a > >> namespace to be used by Xen. Pass this in to Xen when activating the > >> namespace; this could be either 2a or 3a from "Page structure > >> allocation". After that, we could decide which of the two more > >> streamlined approaches (2b or 3b) to pursue. > >> > >> At this point, dom0 can pass parts of the mapped namespace into > >> guests. Unfortunately, passing files on a fsdax filesystem is > >> probably not safe; but we can pass in full dev-dax or fsdax > >> partitions. > >> > >> From a guest perspective, I propose we provide static NFIT only, no > >> access to labels to begin with. This can be generated in hvmloader > >> and/or the toolstack acpi code. > > I'm ignorant of Xen internals, but can you not reuse the existing QEMU > emulation for labels and NFIT? We only use QEMU for HVM guests, which would still leave PVH guests without NVDIMM support. Ideally we would like to use the same solution for both HVM and PVH, which means QEMU cannot be part of that solution. Thanks, Roger. ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] Draft NVDIMM proposal
[ adding linux-nvdimm ] Great write up! Some comments below... On Wed, May 9, 2018 at 10:35 AM, George Dunlapwrote: > Dan, > > I understand that you're the NVDIMM maintainer for Linux. I've been > working with your colleagues to try to sort out an architecture to allow > NVRAM to be passed to guests under the Xen hypervisor. > > If you have time, I'd appreciate it if you could skim through at least > the first section of the document below ("NVIDMM Overview"), concerning > NVDIMM devices and Linux, to see if I've made any mistakes. > > If you're up for it, additional early feedback on the proposed Xen > architecture, from a Linux perspective, would be awesome as well. > > Thanks, > -George > > On 05/09/2018 06:29 PM, George Dunlap wrote: >> Below is an initial draft of an NVDIMM proposal. I'll submit a patch to >> include it in the tree at some point, but I thought for initial >> discussion it would be easier if it were copied in-line. >> >> I've done a fair amount of investigation, but it's quite likely I've >> made mistakes. Please send me corrections where necessary. >> >> -George >> >> --- >> % NVDIMMs and Xen >> % George Dunlap >> % Revision 0.1 >> >> # NVDIMM overview >> >> It's very difficult, from the various specs, to actually get a >> complete enough picture if what's going on to make a good design. >> This section is meant as an overview of the current hardware, >> firmware, and Linux interfaces sufficient to inform a discussion of >> the issues in designing a Xen interface for NVDIMMs. >> >> ## DIMMs, Namespaces, and access methods >> >> An NVDIMM is a DIMM (_dual in-line memory module_) -- a physical form >> factor) that contains _non-volatile RAM_ (NVRAM). Individual bytes of >> memory on a DIMM are specified by a _DIMM physical address_ or DPA. >> Each DIMM is attached to an NVDIMM controller. >> >> Memory on the DIMMs is divided up into _namespaces_. The word >> "namespace" is rather misleading though; a namespace in this context >> is not actually a space of names (contrast, for example "C++ >> namespaces"); rather, it's more like a SCSI LUN, or a volume, or a >> partition on a drive: a set of data which is meant to be viewed and >> accessed as a unit. (The name was apparently carried over from NVMe >> devices, which were precursors of the NVDIMM spec.) Unlike NVMe an NVDIMM itself has no concept of namespaces. Some DIMMs provide a "label area" which is an out-of-band non-volatile memory area where the OS can store whatever it likes. The UEFI 2.7 specification defines a data format for the definition of namespaces on top of persistent memory ranges advertised to the OS via the ACPI NFIT structure. There is no obligation for an NVDIMM to provide a label area, and as far as I know all NVDIMMs on the market today do not provide a label area. That said, QEMU has the ability to associate a virtual label area with for its virtual NVDIMM representation. >> The NVDIMM controller allows two ways to access the DIMM. One is >> mapped 1-1 in _system physical address space_ (SPA), much like normal >> RAM. This method of access is called _PMEM_. The other method is >> similar to that of a PCI device: you have a control and status >> register which control an 8k aperture window into the DIMM. This >> method access is called _PBLK_. >> >> In the case of PMEM, as in the case of DRAM, addresses from the SPA >> are interleaved across a set of DIMMs (an _interleave set_) for >> performance reasons. A specific PMEM namespace will be a single >> contiguous DPA range across all DIMMs in its interleave set. For >> example, you might have a namespace for DPAs `0-0x5000` on DIMMs 0 >> and 1; and another namespace for DPAs `0x8000-0xa000` on DIMMs >> 0, 1, 2, and 3. >> >> In the case of PBLK, a namespace always resides on a single DIMM. >> However, that namespace can be made up of multiple discontiguous >> chunks of space on that DIMM. For instance, in our example above, we >> might have a namespace om DIMM 0 consisting of DPAs >> `0x5000-0x6000`, `0x8000-0x9000`, and >> `0xa000-0xf000`. >> >> The interleaving of PMEM has implications for the speed and >> reliability of the namespace: Much like RAID 0, it maximizes speed, >> but it means that if any one DIMM fails, the data from the entire >> namespace is corrupted. PBLK makes it slightly less straightforward >> to access, but it allows OS software to apply RAID-like logic to >> balance redundancy and speed. >> >> Furthermore, PMEM requires one byte of SPA for every byte of NVDIMM; >> for large systems without 5-level paging, this is actually becoming a >> limitation. Using PBLK allows existing 4-level paged systems to >> access an arbitrary amount of NVDIMM. >> >> ## Namespaces, labels, and the label area >> >> A namespace is a mapping from the SPA and MMIO space into the DIMM. >> >> The firmware and/or operating system can talk to the NVDIMM controller >> to set up mappings from
Re: [Xen-devel] Draft NVDIMM proposal
Dan, I understand that you're the NVDIMM maintainer for Linux. I've been working with your colleagues to try to sort out an architecture to allow NVRAM to be passed to guests under the Xen hypervisor. If you have time, I'd appreciate it if you could skim through at least the first section of the document below ("NVIDMM Overview"), concerning NVDIMM devices and Linux, to see if I've made any mistakes. If you're up for it, additional early feedback on the proposed Xen architecture, from a Linux perspective, would be awesome as well. Thanks, -George On 05/09/2018 06:29 PM, George Dunlap wrote: > Below is an initial draft of an NVDIMM proposal. I'll submit a patch to > include it in the tree at some point, but I thought for initial > discussion it would be easier if it were copied in-line. > > I've done a fair amount of investigation, but it's quite likely I've > made mistakes. Please send me corrections where necessary. > > -George > > --- > % NVDIMMs and Xen > % George Dunlap > % Revision 0.1 > > # NVDIMM overview > > It's very difficult, from the various specs, to actually get a > complete enough picture if what's going on to make a good design. > This section is meant as an overview of the current hardware, > firmware, and Linux interfaces sufficient to inform a discussion of > the issues in designing a Xen interface for NVDIMMs. > > ## DIMMs, Namespaces, and access methods > > An NVDIMM is a DIMM (_dual in-line memory module_) -- a physical form > factor) that contains _non-volatile RAM_ (NVRAM). Individual bytes of > memory on a DIMM are specified by a _DIMM physical address_ or DPA. > Each DIMM is attached to an NVDIMM controller. > > Memory on the DIMMs is divided up into _namespaces_. The word > "namespace" is rather misleading though; a namespace in this context > is not actually a space of names (contrast, for example "C++ > namespaces"); rather, it's more like a SCSI LUN, or a volume, or a > partition on a drive: a set of data which is meant to be viewed and > accessed as a unit. (The name was apparently carried over from NVMe > devices, which were precursors of the NVDIMM spec.) > > The NVDIMM controller allows two ways to access the DIMM. One is > mapped 1-1 in _system physical address space_ (SPA), much like normal > RAM. This method of access is called _PMEM_. The other method is > similar to that of a PCI device: you have a control and status > register which control an 8k aperture window into the DIMM. This > method access is called _PBLK_. > > In the case of PMEM, as in the case of DRAM, addresses from the SPA > are interleaved across a set of DIMMs (an _interleave set_) for > performance reasons. A specific PMEM namespace will be a single > contiguous DPA range across all DIMMs in its interleave set. For > example, you might have a namespace for DPAs `0-0x5000` on DIMMs 0 > and 1; and another namespace for DPAs `0x8000-0xa000` on DIMMs > 0, 1, 2, and 3. > > In the case of PBLK, a namespace always resides on a single DIMM. > However, that namespace can be made up of multiple discontiguous > chunks of space on that DIMM. For instance, in our example above, we > might have a namespace om DIMM 0 consisting of DPAs > `0x5000-0x6000`, `0x8000-0x9000`, and > `0xa000-0xf000`. > > The interleaving of PMEM has implications for the speed and > reliability of the namespace: Much like RAID 0, it maximizes speed, > but it means that if any one DIMM fails, the data from the entire > namespace is corrupted. PBLK makes it slightly less straightforward > to access, but it allows OS software to apply RAID-like logic to > balance redundancy and speed. > > Furthermore, PMEM requires one byte of SPA for every byte of NVDIMM; > for large systems without 5-level paging, this is actually becoming a > limitation. Using PBLK allows existing 4-level paged systems to > access an arbitrary amount of NVDIMM. > > ## Namespaces, labels, and the label area > > A namespace is a mapping from the SPA and MMIO space into the DIMM. > > The firmware and/or operating system can talk to the NVDIMM controller > to set up mappings from SPA and MMIO space into the DIMM. Because the > memory and PCI devices are separate, it would be possible for buggy > firmware or NVDIMM controller drivers to misconfigure things such that > the same DPA is exposed in multiple places; if so, the results are > undefined. > > Namespaces are constructed out of "labels". Each DIMM has a Label > Storage Area, which is persistent but logically separate from the > device-addressable areas on the DIMM. A label on a DIMM describes a > single contiguous region of DPA on that DIMM. A PMEM namespace is > made up of one label from each of the DIMMs which make its interleave > set; a PBLK namespace is made up of one label for each chunk of range. > > In our examples above, the first PMEM namespace would be made of two > labels (one on DIMM 0 and one on DIMM 1, each describind DPA >