Re: [Xen-devel] Draft NVDIMM proposal

2018-05-22 Thread Dan Williams
On Thu, May 17, 2018 at 7:52 AM, George Dunlap  wrote:
> On 05/15/2018 07:06 PM, Dan Williams wrote:
>> On Tue, May 15, 2018 at 7:19 AM, George Dunlap  
>> wrote:
>>> So, who decides what this SPA range and interleave set is?  Can the
>>> operating system change these interleave sets and mappings, or change
>>> data from PMEM to BLK, and is so, how?
>>
>> The interleave-set to SPA range association and delineation of
>> capacity between PMEM and BLK access modes is current out-of-scope for
>> ACPI. The BIOS reports the configuration to the OS via the NFIT, but
>> the configuration is currently written by vendor specific tooling.
>> Longer term it would be great for this mechanism to become
>> standardized and available to the OS, but for now it requires platform
>> specific tooling to change the DIMM interleave configuration.
>
> OK -- I was sort of assuming that different hardware would have
> different drivers in Linux that ndctl knew how to drive (just like any
> other hardware with vendor-specific interfaces);

That way potentially lies madness, at least for me as a Linux
sub-system maintainer. There is no value for the kernel to help enable
vendors to do the same thing slightly differently ways. libnvdimm +
nfit is 100% an open standards driver and the hope is to be able to
deprecate non-public vendor-specific support over time, and
consolidate work-alike support from vendor specs into ACPI. The public
standards that the kernel enables are:

http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf
http://www.uefi.org/sites/default/files/resources/UEFI%20Spec%202_7_A%20Sept%206.pdf
http://pmem.io/documents/NVDIMM_DSM_Interface-V1.6.pdf
https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/
https://msdn.microsoft.com/library/windows/hardware/mt604741

> but it sounds a bit
> more like at the moment it's binary blobs either in the BIOS/firmware,
> or a vendor-supplied tool.

Only for the functionality, like interleave set configuration, that is
not defined in those standards. Even then the impact is only userspace
tooling, not the kernel. Also, we are seeing that functionality bleed
into the standards over time. For example label methods used to only
exist in the Intel DSM document, but have now been standardized in
ACPI 6.2. Firmware update which was a private interface has now
graduated to the public Intel DSM document. Hopefully more and more
functionality transitions into an ACPI definition over time. Any
common functionality in those Intel, HPE, and MSFT command formats is
comprehended / abstracted by the ndctl tool.

>
>>> And so (here's another guess) -- when you're talking about namespaces
>>> and label areas, you're talking about namespaces stored *within a
>>> pre-existing SPA range*.  You use the same format as described in the
>>> UEFI spec, but ignore all the stuff about interleave sets and whatever,
>>> and use system physical addresses relative to the SPA range rather than
>>> DPAs.
>>
>> Well, we don't ignore it because we need to validate in the driver
>> that the interleave set configuration matches a checksum that we
>> generated when the namespace was first instantiated on the interleave
>> set. However, you are right, for accesses at run time all we care
>> about is the SPA for PMEM accesses.
> [snip]
>> They can change, but only under the control of the BIOS. All changes
>> to the interleave set configuration need a reboot because the memory
>> controller needs to be set up differently at system-init time.
> [snip]
>> No, the checksum I'm referring to is the interleave set cookie (see:
>> "SetCookie" in the UEFI 2.7 specification). It validates that the
>> interleave set backing the SPA has not changed configuration since the
>> last boot.
> [snip]
>> The NVDIMM just provides storage area for the OS to write opaque data
>> that just happens to conform to the UEFI Namespace label format. The
>> interleave-set configuration is stored in yet another out-of-band
>> location on the DIMM or on some platform-specific storage location and
>> is consulted / restored by the BIOS each boot. The NFIT is the output
>> from the platform specific physical mappings of the DIMMs, and
>> Namespaces are logical volumes built on top of those hard-defined NFIT
>> boundaries.
>
> OK, so what I'm hearing is:
>
> The label area isn't "within a pre-existing SPA range" as I was guessing
> (i.e., similar to a partition table residing within a disk); it is the
> per-DIMM label area as described by UEFI spec.
>
> But, the interleave set data in the label area doesn't *control* the
> hardware -- the NVDIMM controller / bios / firmware don't read it or do
> anything based on what's in it.  Rather, the interleave set data in the
> label area is there to *record*, for the operating system's benefit,
> what the hardware configuration was when the labels were created, so
> that if it changes, the OS knows that the label area is 

Re: [Xen-devel] Draft NVDIMM proposal

2018-05-17 Thread George Dunlap
On 05/15/2018 07:06 PM, Dan Williams wrote:
> On Tue, May 15, 2018 at 7:19 AM, George Dunlap  
> wrote:
>> So, who decides what this SPA range and interleave set is?  Can the
>> operating system change these interleave sets and mappings, or change
>> data from PMEM to BLK, and is so, how?
> 
> The interleave-set to SPA range association and delineation of
> capacity between PMEM and BLK access modes is current out-of-scope for
> ACPI. The BIOS reports the configuration to the OS via the NFIT, but
> the configuration is currently written by vendor specific tooling.
> Longer term it would be great for this mechanism to become
> standardized and available to the OS, but for now it requires platform
> specific tooling to change the DIMM interleave configuration.

OK -- I was sort of assuming that different hardware would have
different drivers in Linux that ndctl knew how to drive (just like any
other hardware with vendor-specific interfaces); but it sounds a bit
more like at the moment it's binary blobs either in the BIOS/firmware,
or a vendor-supplied tool.

>> And so (here's another guess) -- when you're talking about namespaces
>> and label areas, you're talking about namespaces stored *within a
>> pre-existing SPA range*.  You use the same format as described in the
>> UEFI spec, but ignore all the stuff about interleave sets and whatever,
>> and use system physical addresses relative to the SPA range rather than
>> DPAs.
> 
> Well, we don't ignore it because we need to validate in the driver
> that the interleave set configuration matches a checksum that we
> generated when the namespace was first instantiated on the interleave
> set. However, you are right, for accesses at run time all we care
> about is the SPA for PMEM accesses.
[snip]
> They can change, but only under the control of the BIOS. All changes
> to the interleave set configuration need a reboot because the memory
> controller needs to be set up differently at system-init time.
[snip]
> No, the checksum I'm referring to is the interleave set cookie (see:
> "SetCookie" in the UEFI 2.7 specification). It validates that the
> interleave set backing the SPA has not changed configuration since the
> last boot.
[snip]
> The NVDIMM just provides storage area for the OS to write opaque data
> that just happens to conform to the UEFI Namespace label format. The
> interleave-set configuration is stored in yet another out-of-band
> location on the DIMM or on some platform-specific storage location and
> is consulted / restored by the BIOS each boot. The NFIT is the output
> from the platform specific physical mappings of the DIMMs, and
> Namespaces are logical volumes built on top of those hard-defined NFIT
> boundaries.

OK, so what I'm hearing is:

The label area isn't "within a pre-existing SPA range" as I was guessing
(i.e., similar to a partition table residing within a disk); it is the
per-DIMM label area as described by UEFI spec.

But, the interleave set data in the label area doesn't *control* the
hardware -- the NVDIMM controller / bios / firmware don't read it or do
anything based on what's in it.  Rather, the interleave set data in the
label area is there to *record*, for the operating system's benefit,
what the hardware configuration was when the labels were created, so
that if it changes, the OS knows that the label area is invalid; it must
either refrain from touching the NVRAM (if it wants to preserve the
data), or write a new label area.

The OS can also use labels to partition a single SPA range into several
namespaces.  It can't change the interleaving, but it can specify that
[0-A) is one namespace, [A-B) is another namespace,  and these
namespaces will naturally map into the SPA range advertised in the NFIT.

And if a controller allows the same memory to be used either as PMEM or
PBLK, it can write which *should* be used for which, and then can avoid
accessing the same underlying NVRAM in two different ways (which will
yield unpredictable results).

That makes sense.

>> If SPA regions don't change after boot, and if Xen can find its own
>> Xen-specific namespace to use for the frame tables by reading the NFIT
>> table, then that significantly reduces the amount of interaction it
>> needs with Linux.
>>
>> If SPA regions *can* change after boot, and if Xen must rely on Linux to
>> read labels and find out what it can safely use for frame tables, then
>> it makes things significantly more involved.  Not impossible by any
>> means, but a lot more complicated.
>>
>> Hope all that makes sense -- thanks again for your help.
> 
> I think it does, but it seems namespaces are out of reach for Xen
> without some agent / enabling that can execute the necessary AML
> methods.

Sure, we're pretty much used to that. :-)  We'll have Linux read the
label area and tell Xen what it needs to know.  But:

* Xen can know the SPA ranges of all potential NVDIMMs before dom0
starts.  So it can tell, for instance, if a page 

Re: [Xen-devel] Draft NVDIMM proposal

2018-05-15 Thread Andrew Cooper
On 15/05/18 19:06, Dan Williams wrote:
> On Tue, May 15, 2018 at 7:19 AM, George Dunlap  
> wrote:
>> On 05/11/2018 05:33 PM, Dan Williams wrote:
>>
>> This is all pretty foundational.  Xen can read static ACPI tables, but
>> it can't do AML.  So to do a proper design for Xen, we need to know:
> Oooh, ok, no AML in Xen...
>
>> 1. If Xen can find out, without Linux's help, what namespaces exist and
>> if there is one it can use for its own purposes
> Yeah, no, not without calling AML methods.

One particularly thorny issue with Xen's architecture is the ownership
of the ACPI OSPM, and the fact that there can only be one in the
system.  Dom0 has to be the OSPM in practice, as we don't want to port
most of the Linux drivers and infrastructure in the hypervisor.

If we knew a priori that certain AML methods had no side effects, then
we could in principle execute them from the hypervisor, but this is an
undecideable problem in general.  As a result, everything involving AML
requires dom0 to decipher the information and passing it to Xen at boot.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Draft NVDIMM proposal

2018-05-15 Thread Dan Williams
On Tue, May 15, 2018 at 7:19 AM, George Dunlap  wrote:
> On 05/11/2018 05:33 PM, Dan Williams wrote:
>> [ adding linux-nvdimm ]
>>
>> Great write up! Some comments below...
>
> Thanks for the quick response!
>
> It seems I still have some fundamental misconceptions about what's going
> on, so I'd better start with that. :-)
>
> Here's the part that I'm having a hard time getting.
>
> If actual data on the NVDIMMs is a noun, and the act of writing is a
> verb, then the SPA and interleave sets are adverbs: they define *how*
> the write happens.  When the processor says, "Write to address X", the
> memory controller converts address X into a  address> tuple to actually write the data.
>
> So, who decides what this SPA range and interleave set is?  Can the
> operating system change these interleave sets and mappings, or change
> data from PMEM to BLK, and is so, how?

The interleave-set to SPA range association and delineation of
capacity between PMEM and BLK access modes is current out-of-scope for
ACPI. The BIOS reports the configuration to the OS via the NFIT, but
the configuration is currently written by vendor specific tooling.
Longer term it would be great for this mechanism to become
standardized and available to the OS, but for now it requires platform
specific tooling to change the DIMM interleave configuration.

> If you read through section 13.19 of the UEFI manual, it seems to imply
> that this is determined by the label area -- that each DIMM has a
> separate label area describing regions local to that DIMM; and that if
> you have 4 DIMMs you'll have 4 label areas, and each label area will
> have a label describing the DPA region on that DIMM which corresponds to
> the interleave set.  And somehow someone sets up the interleave sets and
> SPA based on what's written there.
>
> Which would mean that an operating system could change how the
> interleave sets work by rewriting the various labels on the DIMMs; for
> instance, changing a single 4-way set spanning the entirety of 4 DIMMs,
> to one 4-way set spanning half of 4 DIMMs, and 2 2-way sets spanning
> half of 2 DIMMs each.

If a DIMM supports both the PMEM and BLK mechanisms for accessing the
same DPA, then the label breaks the disambiguation and tells the OS to
enforce one access mechanism per DPA at a time. Otherwise the OS has
no ability to affect the interleave-set configuration, it's all
initialized by platform BIOS/firmware before the OS boots.

>
> But then you say:
>
>> Unlike NVMe an NVDIMM itself has no concept of namespaces. Some DIMMs
>> provide a "label area" which is an out-of-band non-volatile memory
>> area where the OS can store whatever it likes. The UEFI 2.7
>> specification defines a data format for the definition of namespaces
>> on top of persistent memory ranges advertised to the OS via the ACPI
>> NFIT structure.
>
> OK, so that sounds like no, that's that what happens.  So where do the
> SPA range and interleave sets come from?
>
> Random guess: The BIOS / firmware makes it up.  Either it's hard-coded,
> or there's some menu in the BIOS you can use to change things around;
> but once it hits the operating system, that's it -- the mapping of SPA
> range onto interleave sets onto DIMMs is, from the operating system's
> point of view, fixed.

Correct.

> And so (here's another guess) -- when you're talking about namespaces
> and label areas, you're talking about namespaces stored *within a
> pre-existing SPA range*.  You use the same format as described in the
> UEFI spec, but ignore all the stuff about interleave sets and whatever,
> and use system physical addresses relative to the SPA range rather than
> DPAs.

Well, we don't ignore it because we need to validate in the driver
that the interleave set configuration matches a checksum that we
generated when the namespace was first instantiated on the interleave
set. However, you are right, for accesses at run time all we care
about is the SPA for PMEM accesses.

>
> Is that right?
>
> But then there's things like this:
>
>> There is no obligation for an NVDIMM to provide a label area, and as
>> far as I know all NVDIMMs on the market today do not provide a label
>> area.
> [snip]
>> Linux supports "label-less" mode where it exposes
>> the raw capacity of a region in 1:1 mapped namespace without a label.
>> This is how Linux supports "legacy" NVDIMMs that do not support
>> labels.
>
> So are "all NVDIMMs on the market today" then classed as "legacy"
> NVDIMMs because they don't support labels?  And if labels are simply the
> NVDIMM equivalent of a partition table, then what does it mena to
> "support" or "not support" labels?

Yes, the term "legacy" has been thrown around for NVDIMMs that do not
support labels. The way this support is determined is whether the
platform publishes the _LSI, _LSR, and _LSW methods in ACPI (see:
6.5.10 NVDIMM Label Methods in ACPI 6.2a). I.e. each DIMM is
represented by an ACPI device object, and we query those 

Re: [Xen-devel] Draft NVDIMM proposal

2018-05-15 Thread Dan Williams
On Tue, May 15, 2018 at 5:26 AM, Jan Beulich  wrote:
 On 15.05.18 at 12:12,  wrote:
[..]
>> That is, each fsdax / devdax namespace has a superblock that, in part,
>> defines what parts are used for Linux and what parts are used for data.  Or
>> to put it a different way: Linux decides which parts of a namespace to use
>> for page structures, and writes it down in the metadata starting in the first
>> page of the namespace.
>
> And that metadata layout is agreed upon between all OS vendors?

The only agreed upon metadata layouts across all OS vendors are the
ones that are specified in UEFI. We typically only need inter-OS and
UEFI compatibility for booting and other pre-OS accesses. For Linux
"raw" and "sector" mode namespaces defined by namespace labels are
inter-OS compatible while "fsdax", "devdax", and so called
"label-less" configurations are not.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Draft NVDIMM proposal

2018-05-15 Thread George Dunlap
On 05/11/2018 05:33 PM, Dan Williams wrote:
> [ adding linux-nvdimm ]
> 
> Great write up! Some comments below...

Thanks for the quick response!

It seems I still have some fundamental misconceptions about what's going
on, so I'd better start with that. :-)

Here's the part that I'm having a hard time getting.

If actual data on the NVDIMMs is a noun, and the act of writing is a
verb, then the SPA and interleave sets are adverbs: they define *how*
the write happens.  When the processor says, "Write to address X", the
memory controller converts address X into a  tuple to actually write the data.

So, who decides what this SPA range and interleave set is?  Can the
operating system change these interleave sets and mappings, or change
data from PMEM to BLK, and is so, how?

If you read through section 13.19 of the UEFI manual, it seems to imply
that this is determined by the label area -- that each DIMM has a
separate label area describing regions local to that DIMM; and that if
you have 4 DIMMs you'll have 4 label areas, and each label area will
have a label describing the DPA region on that DIMM which corresponds to
the interleave set.  And somehow someone sets up the interleave sets and
SPA based on what's written there.

Which would mean that an operating system could change how the
interleave sets work by rewriting the various labels on the DIMMs; for
instance, changing a single 4-way set spanning the entirety of 4 DIMMs,
to one 4-way set spanning half of 4 DIMMs, and 2 2-way sets spanning
half of 2 DIMMs each.

But then you say:

> Unlike NVMe an NVDIMM itself has no concept of namespaces. Some DIMMs
> provide a "label area" which is an out-of-band non-volatile memory
> area where the OS can store whatever it likes. The UEFI 2.7
> specification defines a data format for the definition of namespaces
> on top of persistent memory ranges advertised to the OS via the ACPI
> NFIT structure.

OK, so that sounds like no, that's that what happens.  So where do the
SPA range and interleave sets come from?

Random guess: The BIOS / firmware makes it up.  Either it's hard-coded,
or there's some menu in the BIOS you can use to change things around;
but once it hits the operating system, that's it -- the mapping of SPA
range onto interleave sets onto DIMMs is, from the operating system's
point of view, fixed.

And so (here's another guess) -- when you're talking about namespaces
and label areas, you're talking about namespaces stored *within a
pre-existing SPA range*.  You use the same format as described in the
UEFI spec, but ignore all the stuff about interleave sets and whatever,
and use system physical addresses relative to the SPA range rather than
DPAs.

Is that right?

But then there's things like this:

> There is no obligation for an NVDIMM to provide a label area, and as
> far as I know all NVDIMMs on the market today do not provide a label
> area.
[snip]
> Linux supports "label-less" mode where it exposes
> the raw capacity of a region in 1:1 mapped namespace without a label.
> This is how Linux supports "legacy" NVDIMMs that do not support
> labels.

So are "all NVDIMMs on the market today" then classed as "legacy"
NVDIMMs because they don't support labels?  And if labels are simply the
NVDIMM equivalent of a partition table, then what does it mena to
"support" or "not support" labels?

And then there's this:

> In any
> event we do the DIMM to SPA association first before reading labels.
> The OS calculates a so called "Interleave Set Cookie" from the NFIT
> information to compare against a similar value stored in the labels.
> This lets the OS determine that the Interleave Set composition has not
> changed from when the labels were initially written. An Interleave Set
> Cookie mismatch indicates the labels are stale, corrupted, or that the
> physical composition of the Interleave Set has changed.

So wait, the SPA and interleave sets can actually change?  And the
labels which the OS reads actually are per-DIMM, and do control somehow
how the DPA ranges of individual DIMMs are mapped into interleave sets
and exposed as SPAs?  (And perhaps, can be changed by the operating system?)

And:

> There are checksums in the Namespace definition to account label
> validity. Starting with ACPI 6.2 DSMs for labels are deprecated in
> favor of the new / named methods for label access _LSI, _LSR, and
> _LSW.

Does this mean the methods will use checksums to verify writes to the
label area, and refuse writes which create invalid labels?

If all of the above is true, then in what way can it be said that
"NVDIMM has no concept of namespaces", that an OS can "store whatever it
likes" in the label area, and that UEFI namespaces are "on top of
persistent memory ranges advertised to the OS via the ACPI NFIT structure"?

I'm sorry if this is obvious, but I am exactly as confused as I was
before I started writing this. :-)

This is all pretty foundational.  Xen can read static ACPI tables, but
it can't do AML.  So to do a 

Re: [Xen-devel] Draft NVDIMM proposal

2018-05-15 Thread George Dunlap


> On May 15, 2018, at 1:26 PM, Jan Beulich  wrote:
> 
 On 15.05.18 at 12:12,  wrote:
>>> On May 15, 2018, at 11:05 AM, Roger Pau Monne  wrote:
>>> On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote:
 [ adding linux-nvdimm ]
 
 Great write up! Some comments below...
 
 On Wed, May 9, 2018 at 10:35 AM, George Dunlap  
 wrote:
>> To use a namespace, an operating system needs at a minimum two pieces
>> of information: The UUID and/or Name of the namespace, and the SPA
>> range where that namespace is mapped; and ideally also the Type and
>> Abstraction Type to know how to interpret the data inside.
 
 Not necessarily, no. Linux supports "label-less" mode where it exposes
 the raw capacity of a region in 1:1 mapped namespace without a label.
 This is how Linux supports "legacy" NVDIMMs that do not support
 labels.
>>> 
>>> In that case, how does Linux know which area of the NVDIMM it should
>>> use to store the page structures?
>> 
>> The answer to that is right here:
>> 
>> `fsdax` and `devdax` mode are both designed to make it possible for
>> user processes to have direct mapping of NVRAM.  As such, both are
>> only suitable for PMEM namespaces (?).  Both also need to have kernel
>> page structures allocated for each page of NVRAM; this amounts to 64
>> bytes for every 4k of NVRAM.  Memory for these page structures can
>> either be allocated out of normal "system" memory, or inside the PMEM
>> namespace itself.
>> 
>> In both cases, an "info block", very similar to the BTT info block, is
>> written to the beginning of the namespace when created.  This info
>> block specifies whether the page structures come from system memory or
>> from the namespace itself.  If from the namespace itself, it contains
>> information about what parts of the namespace have been set aside for
>> Linux to use for this purpose.
>> 
>> That is, each fsdax / devdax namespace has a superblock that, in part, 
>> defines what parts are used for Linux and what parts are used for data.  Or 
>> to put it a different way: Linux decides which parts of a namespace to use 
>> for page structures, and writes it down in the metadata starting in the 
>> first 
>> page of the namespace.
> 
> And that metadata layout is agreed upon between all OS vendors?
> 
>> Linux has also defined "Type GUIDs" for these two types of namespace
>> to be stored in the namespace label, although these are not yet in the
>> ACPI spec.
 
 They never will be. One of the motivations for GUIDs is that an OS can
 define private ones without needing to go back and standardize them.
 Only GUIDs that are needed to inter-OS / pre-OS compatibility would
 need to be defined in ACPI, and there is no expectation that other
 OSes understand Linux's format for reserving page structure space.
>>> 
>>> Maybe it would be helpful to somehow mark those areas as
>>> "non-persistent" storage, so that other OSes know they can use this
>>> space for temporary data that doesn't need to survive across reboots?
>> 
>> In theory there’s no reason another OS couldn’t learn Linux’s format, 
>> discover where the blocks were, and use those blocks for its own purposes 
>> while Linux wasn’t running.
> 
> This looks to imply "no" to my question above, in which case I wonder how
> we would use (part of) the space when the "other" owner is e.g. Windows.

So in classic DOS partition tables, you have partition types; and various 
operating systems just sort of “claimed” numbers for themselves (e.g., NTFS, 
Linux Swap, ).  

But the DOS partition table number space is actually quite small.  So in 
namespaces, you have a similar concept, except that it’s called a “type GUID”, 
and it’s massively long — long enough anyone who wants to make a new type can 
simply generate one randomly and be pretty confident that nobody else is using 
that one.

So if the labels contain a TGUID you understand, you use it, just like you 
would a partition that you understand.  If it contains GUIDs you don’t 
understand, you’d better leave it alone.

 -George
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Draft NVDIMM proposal

2018-05-15 Thread Jan Beulich
>>> On 15.05.18 at 12:12,  wrote:
>> On May 15, 2018, at 11:05 AM, Roger Pau Monne  wrote:
>> On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote:
>>> [ adding linux-nvdimm ]
>>> 
>>> Great write up! Some comments below...
>>> 
>>> On Wed, May 9, 2018 at 10:35 AM, George Dunlap  
>>> wrote:
> To use a namespace, an operating system needs at a minimum two pieces
> of information: The UUID and/or Name of the namespace, and the SPA
> range where that namespace is mapped; and ideally also the Type and
> Abstraction Type to know how to interpret the data inside.
>>> 
>>> Not necessarily, no. Linux supports "label-less" mode where it exposes
>>> the raw capacity of a region in 1:1 mapped namespace without a label.
>>> This is how Linux supports "legacy" NVDIMMs that do not support
>>> labels.
>> 
>> In that case, how does Linux know which area of the NVDIMM it should
>> use to store the page structures?
> 
> The answer to that is right here:
> 
> `fsdax` and `devdax` mode are both designed to make it possible for
> user processes to have direct mapping of NVRAM.  As such, both are
> only suitable for PMEM namespaces (?).  Both also need to have kernel
> page structures allocated for each page of NVRAM; this amounts to 64
> bytes for every 4k of NVRAM.  Memory for these page structures can
> either be allocated out of normal "system" memory, or inside the PMEM
> namespace itself.
> 
> In both cases, an "info block", very similar to the BTT info block, is
> written to the beginning of the namespace when created.  This info
> block specifies whether the page structures come from system memory or
> from the namespace itself.  If from the namespace itself, it contains
> information about what parts of the namespace have been set aside for
> Linux to use for this purpose.
> 
> That is, each fsdax / devdax namespace has a superblock that, in part, 
> defines what parts are used for Linux and what parts are used for data.  Or 
> to put it a different way: Linux decides which parts of a namespace to use 
> for page structures, and writes it down in the metadata starting in the first 
> page of the namespace.

And that metadata layout is agreed upon between all OS vendors?

> Linux has also defined "Type GUIDs" for these two types of namespace
> to be stored in the namespace label, although these are not yet in the
> ACPI spec.
>>> 
>>> They never will be. One of the motivations for GUIDs is that an OS can
>>> define private ones without needing to go back and standardize them.
>>> Only GUIDs that are needed to inter-OS / pre-OS compatibility would
>>> need to be defined in ACPI, and there is no expectation that other
>>> OSes understand Linux's format for reserving page structure space.
>> 
>> Maybe it would be helpful to somehow mark those areas as
>> "non-persistent" storage, so that other OSes know they can use this
>> space for temporary data that doesn't need to survive across reboots?
> 
> In theory there’s no reason another OS couldn’t learn Linux’s format, 
> discover where the blocks were, and use those blocks for its own purposes 
> while Linux wasn’t running.

This looks to imply "no" to my question above, in which case I wonder how
we would use (part of) the space when the "other" owner is e.g. Windows.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Draft NVDIMM proposal

2018-05-15 Thread George Dunlap


> On May 15, 2018, at 11:05 AM, Roger Pau Monne  wrote:
> 
> Just some replies/questions to some of the points raised below.
> 
> On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote:
>> [ adding linux-nvdimm ]
>> 
>> Great write up! Some comments below...
>> 
>> On Wed, May 9, 2018 at 10:35 AM, George Dunlap  
>> wrote:
 To use a namespace, an operating system needs at a minimum two pieces
 of information: The UUID and/or Name of the namespace, and the SPA
 range where that namespace is mapped; and ideally also the Type and
 Abstraction Type to know how to interpret the data inside.
>> 
>> Not necessarily, no. Linux supports "label-less" mode where it exposes
>> the raw capacity of a region in 1:1 mapped namespace without a label.
>> This is how Linux supports "legacy" NVDIMMs that do not support
>> labels.
> 
> In that case, how does Linux know which area of the NVDIMM it should
> use to store the page structures?

The answer to that is right here:

 `fsdax` and `devdax` mode are both designed to make it possible for
 user processes to have direct mapping of NVRAM.  As such, both are
 only suitable for PMEM namespaces (?).  Both also need to have kernel
 page structures allocated for each page of NVRAM; this amounts to 64
 bytes for every 4k of NVRAM.  Memory for these page structures can
 either be allocated out of normal "system" memory, or inside the PMEM
 namespace itself.
 
 In both cases, an "info block", very similar to the BTT info block, is
 written to the beginning of the namespace when created.  This info
 block specifies whether the page structures come from system memory or
 from the namespace itself.  If from the namespace itself, it contains
 information about what parts of the namespace have been set aside for
 Linux to use for this purpose.

That is, each fsdax / devdax namespace has a superblock that, in part, defines 
what parts are used for Linux and what parts are used for data.  Or to put it a 
different way: Linux decides which parts of a namespace to use for page 
structures, and writes it down in the metadata starting in the first page of 
the namespace.


 
 Linux has also defined "Type GUIDs" for these two types of namespace
 to be stored in the namespace label, although these are not yet in the
 ACPI spec.
>> 
>> They never will be. One of the motivations for GUIDs is that an OS can
>> define private ones without needing to go back and standardize them.
>> Only GUIDs that are needed to inter-OS / pre-OS compatibility would
>> need to be defined in ACPI, and there is no expectation that other
>> OSes understand Linux's format for reserving page structure space.
> 
> Maybe it would be helpful to somehow mark those areas as
> "non-persistent" storage, so that other OSes know they can use this
> space for temporary data that doesn't need to survive across reboots?

In theory there’s no reason another OS couldn’t learn Linux’s format, discover 
where the blocks were, and use those blocks for its own purposes while Linux 
wasn’t running.

But that won’t help Xen, as we want to use those blocks while Linux *is* 
running.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Draft NVDIMM proposal

2018-05-15 Thread Roger Pau Monné
Just some replies/questions to some of the points raised below.

On Fri, May 11, 2018 at 09:33:10AM -0700, Dan Williams wrote:
> [ adding linux-nvdimm ]
> 
> Great write up! Some comments below...
> 
> On Wed, May 9, 2018 at 10:35 AM, George Dunlap  
> wrote:
> >> To use a namespace, an operating system needs at a minimum two pieces
> >> of information: The UUID and/or Name of the namespace, and the SPA
> >> range where that namespace is mapped; and ideally also the Type and
> >> Abstraction Type to know how to interpret the data inside.
> 
> Not necessarily, no. Linux supports "label-less" mode where it exposes
> the raw capacity of a region in 1:1 mapped namespace without a label.
> This is how Linux supports "legacy" NVDIMMs that do not support
> labels.

In that case, how does Linux know which area of the NVDIMM it should
use to store the page structures?

> >> `fsdax` and `devdax` mode are both designed to make it possible for
> >> user processes to have direct mapping of NVRAM.  As such, both are
> >> only suitable for PMEM namespaces (?).  Both also need to have kernel
> >> page structures allocated for each page of NVRAM; this amounts to 64
> >> bytes for every 4k of NVRAM.  Memory for these page structures can
> >> either be allocated out of normal "system" memory, or inside the PMEM
> >> namespace itself.
> >>
> >> In both cases, an "info block", very similar to the BTT info block, is
> >> written to the beginning of the namespace when created.  This info
> >> block specifies whether the page structures come from system memory or
> >> from the namespace itself.  If from the namespace itself, it contains
> >> information about what parts of the namespace have been set aside for
> >> Linux to use for this purpose.
> >>
> >> Linux has also defined "Type GUIDs" for these two types of namespace
> >> to be stored in the namespace label, although these are not yet in the
> >> ACPI spec.
> 
> They never will be. One of the motivations for GUIDs is that an OS can
> define private ones without needing to go back and standardize them.
> Only GUIDs that are needed to inter-OS / pre-OS compatibility would
> need to be defined in ACPI, and there is no expectation that other
> OSes understand Linux's format for reserving page structure space.

Maybe it would be helpful to somehow mark those areas as
"non-persistent" storage, so that other OSes know they can use this
space for temporary data that doesn't need to survive across reboots?

> >> # Proposed design / roadmap
> >>
> >> Initially, dom0 accesses the NVRAM as normal, using static ACPI tables
> >> and the DSM methods; mappings are treated by Xen during this phase as
> >> MMIO.
> >>
> >> Once dom0 is ready to pass parts of a namespace through to a guest, it
> >> makes a hypercall to tell Xen about the namespace.  It includes any
> >> regions of the namespace which Xen may use for 'scratch'; it also
> >> includes a flag to indicate whether this 'scratch' space may be used
> >> for frame tables from other namespaces.
> >>
> >> Frame tables are then created for this SPA range.  They will be
> >> allocated from, in this order: 1) designated 'scratch' range from
> >> within this namespace 2) designated 'scratch' range from other
> >> namespaces which has been marked as sharable 3) system RAM.
> >>
> >> Xen will either verify that dom0 has no existing mappings, or promote
> >> the mappings to full pages (taking appropriate reference counts for
> >> mappings).  Dom0 must ensure that this namespace is not unmapped,
> >> modified, or relocated until it asks Xen to unmap it.
> >>
> >> For Xen frame tables, to begin with, set aside a partition inside a
> >> namespace to be used by Xen.  Pass this in to Xen when activating the
> >> namespace; this could be either 2a or 3a from "Page structure
> >> allocation".  After that, we could decide which of the two more
> >> streamlined approaches (2b or 3b) to pursue.
> >>
> >> At this point, dom0 can pass parts of the mapped namespace into
> >> guests.  Unfortunately, passing files on a fsdax filesystem is
> >> probably not safe; but we can pass in full dev-dax or fsdax
> >> partitions.
> >>
> >> From a guest perspective, I propose we provide static NFIT only, no
> >> access to labels to begin with.  This can be generated in hvmloader
> >> and/or the toolstack acpi code.
> 
> I'm ignorant of Xen internals, but can you not reuse the existing QEMU
> emulation for labels and NFIT?

We only use QEMU for HVM guests, which would still leave PVH guests
without NVDIMM support. Ideally we would like to use the same solution
for both HVM and PVH, which means QEMU cannot be part of that
solution.

Thanks, Roger.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Draft NVDIMM proposal

2018-05-11 Thread Dan Williams
[ adding linux-nvdimm ]

Great write up! Some comments below...

On Wed, May 9, 2018 at 10:35 AM, George Dunlap  wrote:
> Dan,
>
> I understand that you're the NVDIMM maintainer for Linux.  I've been
> working with your colleagues to try to sort out an architecture to allow
> NVRAM to be passed to guests under the Xen hypervisor.
>
> If you have time, I'd appreciate it if you could skim through at least
> the first section of the document below ("NVIDMM Overview"), concerning
> NVDIMM devices and Linux, to see if I've made any mistakes.
>
> If you're up for it, additional early feedback on the proposed Xen
> architecture, from a Linux perspective, would be awesome as well.
>
> Thanks,
>  -George
>
> On 05/09/2018 06:29 PM, George Dunlap wrote:
>> Below is an initial draft of an NVDIMM proposal.  I'll submit a patch to
>> include it in the tree at some point, but I thought for initial
>> discussion it would be easier if it were copied in-line.
>>
>> I've done a fair amount of investigation, but it's quite likely I've
>> made mistakes.  Please send me corrections where necessary.
>>
>> -George
>>
>> ---
>> % NVDIMMs and Xen
>> % George Dunlap
>> % Revision 0.1
>>
>> # NVDIMM overview
>>
>> It's very difficult, from the various specs, to actually get a
>> complete enough picture if what's going on to make a good design.
>> This section is meant as an overview of the current hardware,
>> firmware, and Linux interfaces sufficient to inform a discussion of
>> the issues in designing a Xen interface for NVDIMMs.
>>
>> ## DIMMs, Namespaces, and access methods
>>
>> An NVDIMM is a DIMM (_dual in-line memory module_) -- a physical form
>> factor) that contains _non-volatile RAM_ (NVRAM).  Individual bytes of
>> memory on a DIMM are specified by a _DIMM physical address_ or DPA.
>> Each DIMM is attached to an NVDIMM controller.
>>
>> Memory on the DIMMs is divided up into _namespaces_.  The word
>> "namespace" is rather misleading though; a namespace in this context
>> is not actually a space of names (contrast, for example "C++
>> namespaces"); rather, it's more like a SCSI LUN, or a volume, or a
>> partition on a drive: a set of data which is meant to be viewed and
>> accessed as a unit.  (The name was apparently carried over from NVMe
>> devices, which were precursors of the NVDIMM spec.)

Unlike NVMe an NVDIMM itself has no concept of namespaces. Some DIMMs
provide a "label area" which is an out-of-band non-volatile memory
area where the OS can store whatever it likes. The UEFI 2.7
specification defines a data format for the definition of namespaces
on top of persistent memory ranges advertised to the OS via the ACPI
NFIT structure.

There is no obligation for an NVDIMM to provide a label area, and as
far as I know all NVDIMMs on the market today do not provide a label
area. That said, QEMU has the ability to associate a virtual label
area with for its virtual NVDIMM representation.

>> The NVDIMM controller allows two ways to access the DIMM.  One is
>> mapped 1-1 in _system physical address space_ (SPA), much like normal
>> RAM.  This method of access is called _PMEM_.  The other method is
>> similar to that of a PCI device: you have a control and status
>> register which control an 8k aperture window into the DIMM.  This
>> method access is called _PBLK_.
>>
>> In the case of PMEM, as in the case of DRAM, addresses from the SPA
>> are interleaved across a set of DIMMs (an _interleave set_) for
>> performance reasons.  A specific PMEM namespace will be a single
>> contiguous DPA range across all DIMMs in its interleave set.  For
>> example, you might have a namespace for DPAs `0-0x5000` on DIMMs 0
>> and 1; and another namespace for DPAs `0x8000-0xa000` on DIMMs
>> 0, 1, 2, and 3.
>>
>> In the case of PBLK, a namespace always resides on a single DIMM.
>> However, that namespace can be made up of multiple discontiguous
>> chunks of space on that DIMM.  For instance, in our example above, we
>> might have a namespace om DIMM 0 consisting of DPAs
>> `0x5000-0x6000`, `0x8000-0x9000`, and
>> `0xa000-0xf000`.
>>
>> The interleaving of PMEM has implications for the speed and
>> reliability of the namespace: Much like RAID 0, it maximizes speed,
>> but it means that if any one DIMM fails, the data from the entire
>> namespace is corrupted.  PBLK makes it slightly less straightforward
>> to access, but it allows OS software to apply RAID-like logic to
>> balance redundancy and speed.
>>
>> Furthermore, PMEM requires one byte of SPA for every byte of NVDIMM;
>> for large systems without 5-level paging, this is actually becoming a
>> limitation.  Using PBLK allows existing 4-level paged systems to
>> access an arbitrary amount of NVDIMM.
>>
>> ## Namespaces, labels, and the label area
>>
>> A namespace is a mapping from the SPA and MMIO space into the DIMM.
>>
>> The firmware and/or operating system can talk to the NVDIMM controller
>> to set up mappings from 

Re: [Xen-devel] Draft NVDIMM proposal

2018-05-09 Thread George Dunlap
Dan,

I understand that you're the NVDIMM maintainer for Linux.  I've been
working with your colleagues to try to sort out an architecture to allow
NVRAM to be passed to guests under the Xen hypervisor.

If you have time, I'd appreciate it if you could skim through at least
the first section of the document below ("NVIDMM Overview"), concerning
NVDIMM devices and Linux, to see if I've made any mistakes.

If you're up for it, additional early feedback on the proposed Xen
architecture, from a Linux perspective, would be awesome as well.

Thanks,
 -George

On 05/09/2018 06:29 PM, George Dunlap wrote:
> Below is an initial draft of an NVDIMM proposal.  I'll submit a patch to
> include it in the tree at some point, but I thought for initial
> discussion it would be easier if it were copied in-line.
> 
> I've done a fair amount of investigation, but it's quite likely I've
> made mistakes.  Please send me corrections where necessary.
> 
> -George
> 
> ---
> % NVDIMMs and Xen
> % George Dunlap
> % Revision 0.1
> 
> # NVDIMM overview
> 
> It's very difficult, from the various specs, to actually get a
> complete enough picture if what's going on to make a good design.
> This section is meant as an overview of the current hardware,
> firmware, and Linux interfaces sufficient to inform a discussion of
> the issues in designing a Xen interface for NVDIMMs.
> 
> ## DIMMs, Namespaces, and access methods
> 
> An NVDIMM is a DIMM (_dual in-line memory module_) -- a physical form
> factor) that contains _non-volatile RAM_ (NVRAM).  Individual bytes of
> memory on a DIMM are specified by a _DIMM physical address_ or DPA.
> Each DIMM is attached to an NVDIMM controller.
> 
> Memory on the DIMMs is divided up into _namespaces_.  The word
> "namespace" is rather misleading though; a namespace in this context
> is not actually a space of names (contrast, for example "C++
> namespaces"); rather, it's more like a SCSI LUN, or a volume, or a
> partition on a drive: a set of data which is meant to be viewed and
> accessed as a unit.  (The name was apparently carried over from NVMe
> devices, which were precursors of the NVDIMM spec.)
> 
> The NVDIMM controller allows two ways to access the DIMM.  One is
> mapped 1-1 in _system physical address space_ (SPA), much like normal
> RAM.  This method of access is called _PMEM_.  The other method is
> similar to that of a PCI device: you have a control and status
> register which control an 8k aperture window into the DIMM.  This
> method access is called _PBLK_.
> 
> In the case of PMEM, as in the case of DRAM, addresses from the SPA
> are interleaved across a set of DIMMs (an _interleave set_) for
> performance reasons.  A specific PMEM namespace will be a single
> contiguous DPA range across all DIMMs in its interleave set.  For
> example, you might have a namespace for DPAs `0-0x5000` on DIMMs 0
> and 1; and another namespace for DPAs `0x8000-0xa000` on DIMMs
> 0, 1, 2, and 3.
> 
> In the case of PBLK, a namespace always resides on a single DIMM.
> However, that namespace can be made up of multiple discontiguous
> chunks of space on that DIMM.  For instance, in our example above, we
> might have a namespace om DIMM 0 consisting of DPAs
> `0x5000-0x6000`, `0x8000-0x9000`, and
> `0xa000-0xf000`.
> 
> The interleaving of PMEM has implications for the speed and
> reliability of the namespace: Much like RAID 0, it maximizes speed,
> but it means that if any one DIMM fails, the data from the entire
> namespace is corrupted.  PBLK makes it slightly less straightforward
> to access, but it allows OS software to apply RAID-like logic to
> balance redundancy and speed.
> 
> Furthermore, PMEM requires one byte of SPA for every byte of NVDIMM;
> for large systems without 5-level paging, this is actually becoming a
> limitation.  Using PBLK allows existing 4-level paged systems to
> access an arbitrary amount of NVDIMM.
> 
> ## Namespaces, labels, and the label area
> 
> A namespace is a mapping from the SPA and MMIO space into the DIMM.
> 
> The firmware and/or operating system can talk to the NVDIMM controller
> to set up mappings from SPA and MMIO space into the DIMM.  Because the
> memory and PCI devices are separate, it would be possible for buggy
> firmware or NVDIMM controller drivers to misconfigure things such that
> the same DPA is exposed in multiple places; if so, the results are
> undefined.
> 
> Namespaces are constructed out of "labels".  Each DIMM has a Label
> Storage Area, which is persistent but logically separate from the
> device-addressable areas on the DIMM.  A label on a DIMM describes a
> single contiguous region of DPA on that DIMM.  A PMEM namespace is
> made up of one label from each of the DIMMs which make its interleave
> set; a PBLK namespace is made up of one label for each chunk of range.
> 
> In our examples above, the first PMEM namespace would be made of two
> labels (one on DIMM 0 and one on DIMM 1, each describind DPA
>