Re: [PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Dan Williams
On Fri, Oct 7, 2016 at 2:42 PM, Linda Knippers  wrote:
>
>
> On 10/7/2016 3:52 PM, Dan Williams wrote:
>> On Fri, Oct 7, 2016 at 11:19 AM, Linda Knippers  
>> wrote:
>>> Hi Dan,
>>>
>>> A couple of general questions...
>>>
>>> On 10/7/2016 12:38 PM, Dan Williams wrote:
 With the arrival of the device-dax facility in 4.7 a pmem namespace can
 now be configured into a total of four distinct modes: 'raw', 'sector',
 'memory', and 'dax'. Where raw, sector, and memory are block device
 modes and dax supports the device-dax character device. With that degree
 of freedom in the use cases it is overly restrictive to continue the
 current limit of only one pmem namespace per-region, or "interleave-set"
 in ACPI 6+ terminology.
>>>
>>> If I understand correctly, at least some of the restrictions were
>>> part of the Intel NVDIMM Namespace spec rather than ACPI/NFIT restrictions.
>>> The most recent namespace spec on pmem.io hasn't been updated to remove
>>> those restrictions.  Is there a different public spec?
>>
>> Yes, this is Linux specific and use of this capability needs to be
>> cognizant that it could create a configuration that is not understood
>> by EFI, or other OSes (including older Linux implementations).  I plan
>> to add documentation to ndctl along these lines.  This is similar to
>> the current situation with 'pfn' and 'dax' info blocks that are also
>> Linux specific.  However, I should note that this implementation
>> changes none of the interpretation of the fields nor layout of the
>> existing label specification.  It simply allows two pmem labels that
>> happen to appear in the same region to result in two namespaces rather
>> than 0.
>
> Ok, but the namespace spec says that's not allowed.  It seemed like an odd
> restriction to be in the label spec but it is there.

The restriction greatly simplified the implementation back a couple
years ago when we assumed that partitioning of block devices could
handle any cases of needing distinct operation modes for different
sub-divisions of pmem.  If you look at the original implementation of
the btt, before it went upstream, it was designed as a stacked block
device driver.  In that arrangement you could theoretically have
/dev/pmem0 as the whole disk device and then create a btt
configuration on top of /dev/pmem0p1 but leave /dev/pmem0p2 as a plain
/ raw pmem device.

We killed that design during the review process and moved btt to be an
intrinsic property of the whole device.

Another development since that one-namespace-per-region restriction
was thought to be tenable was that we (the Linux community) decided
not to pursue raw-device-dax support for block devices.  Linux block
devices are tied in with the page cache and filesystems sometimes use
the block-device-inode page cache to submit metadata updates
(particularly ext4).  This collided with the msync/fsync dirty
cacheline tracking implementation.  We started to fix a few of those
collisions, but then decided it would be better to leave block devices
alone and move raw-device-dax support to its own / new device node
type.  That's the genesis of device-dax.

So the rationale of "don't allow sub-division because an
implementation can just use block device partitions for different use
case" no longer holds.

 This series adds support for reading and writing configurations that
 describe multiple pmem allocations within a region.  The new rules for
 allocating / validating the available capacity when blk and pmem regions
 alias are (quoting space_valid()):

BLK-space is valid as long as it does not precede a PMEM
allocation in a given region. PMEM-space must be contiguous
and adjacent to an existing existing allocation (if one
exists).
>>>
>>> Why is this new rule necessary?  Is this a HW-specific rule or something
>>> related to how Linux could possibly support something?  Why do we care
>>> whether blk-space is before or after pmem-space? If it's a HW-specific
>>> rule, then shouldn't the enforcement be in the management tool that
>>> configures the namespaces?
>>
>> It is not HW specific, and it's not new in the sense that we already
>> arrange for pmem to be allocated from low addresses and blk to be
>> allocated from high addresses.
>
> Who's the "we"?

"We" == the current Linux kernel implementation, i.e. we the Linux community.

> Does the location within the region come from the OS
> or from the tool that created the namespace?  (I should probably know
> this but not having labels, I've never looked at this.)

The location is chosen by the kernel.  Userspace only selects the size.

> If we're relaxing some of the rules, it seems like one could have
> pmem, then block, then free space, and later want to use free space
> for another pmem range.  If hardware supported it and the management
> tool created it, would the kernel allow it?

As long as external tooling 

Re: [PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Dan Williams
On Fri, Oct 7, 2016 at 2:42 PM, Linda Knippers  wrote:
>
>
> On 10/7/2016 3:52 PM, Dan Williams wrote:
>> On Fri, Oct 7, 2016 at 11:19 AM, Linda Knippers  
>> wrote:
>>> Hi Dan,
>>>
>>> A couple of general questions...
>>>
>>> On 10/7/2016 12:38 PM, Dan Williams wrote:
 With the arrival of the device-dax facility in 4.7 a pmem namespace can
 now be configured into a total of four distinct modes: 'raw', 'sector',
 'memory', and 'dax'. Where raw, sector, and memory are block device
 modes and dax supports the device-dax character device. With that degree
 of freedom in the use cases it is overly restrictive to continue the
 current limit of only one pmem namespace per-region, or "interleave-set"
 in ACPI 6+ terminology.
>>>
>>> If I understand correctly, at least some of the restrictions were
>>> part of the Intel NVDIMM Namespace spec rather than ACPI/NFIT restrictions.
>>> The most recent namespace spec on pmem.io hasn't been updated to remove
>>> those restrictions.  Is there a different public spec?
>>
>> Yes, this is Linux specific and use of this capability needs to be
>> cognizant that it could create a configuration that is not understood
>> by EFI, or other OSes (including older Linux implementations).  I plan
>> to add documentation to ndctl along these lines.  This is similar to
>> the current situation with 'pfn' and 'dax' info blocks that are also
>> Linux specific.  However, I should note that this implementation
>> changes none of the interpretation of the fields nor layout of the
>> existing label specification.  It simply allows two pmem labels that
>> happen to appear in the same region to result in two namespaces rather
>> than 0.
>
> Ok, but the namespace spec says that's not allowed.  It seemed like an odd
> restriction to be in the label spec but it is there.

The restriction greatly simplified the implementation back a couple
years ago when we assumed that partitioning of block devices could
handle any cases of needing distinct operation modes for different
sub-divisions of pmem.  If you look at the original implementation of
the btt, before it went upstream, it was designed as a stacked block
device driver.  In that arrangement you could theoretically have
/dev/pmem0 as the whole disk device and then create a btt
configuration on top of /dev/pmem0p1 but leave /dev/pmem0p2 as a plain
/ raw pmem device.

We killed that design during the review process and moved btt to be an
intrinsic property of the whole device.

Another development since that one-namespace-per-region restriction
was thought to be tenable was that we (the Linux community) decided
not to pursue raw-device-dax support for block devices.  Linux block
devices are tied in with the page cache and filesystems sometimes use
the block-device-inode page cache to submit metadata updates
(particularly ext4).  This collided with the msync/fsync dirty
cacheline tracking implementation.  We started to fix a few of those
collisions, but then decided it would be better to leave block devices
alone and move raw-device-dax support to its own / new device node
type.  That's the genesis of device-dax.

So the rationale of "don't allow sub-division because an
implementation can just use block device partitions for different use
case" no longer holds.

 This series adds support for reading and writing configurations that
 describe multiple pmem allocations within a region.  The new rules for
 allocating / validating the available capacity when blk and pmem regions
 alias are (quoting space_valid()):

BLK-space is valid as long as it does not precede a PMEM
allocation in a given region. PMEM-space must be contiguous
and adjacent to an existing existing allocation (if one
exists).
>>>
>>> Why is this new rule necessary?  Is this a HW-specific rule or something
>>> related to how Linux could possibly support something?  Why do we care
>>> whether blk-space is before or after pmem-space? If it's a HW-specific
>>> rule, then shouldn't the enforcement be in the management tool that
>>> configures the namespaces?
>>
>> It is not HW specific, and it's not new in the sense that we already
>> arrange for pmem to be allocated from low addresses and blk to be
>> allocated from high addresses.
>
> Who's the "we"?

"We" == the current Linux kernel implementation, i.e. we the Linux community.

> Does the location within the region come from the OS
> or from the tool that created the namespace?  (I should probably know
> this but not having labels, I've never looked at this.)

The location is chosen by the kernel.  Userspace only selects the size.

> If we're relaxing some of the rules, it seems like one could have
> pmem, then block, then free space, and later want to use free space
> for another pmem range.  If hardware supported it and the management
> tool created it, would the kernel allow it?

As long as external tooling lays down those labels in that manner the
kernel 

Re: [PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Linda Knippers


On 10/7/2016 3:52 PM, Dan Williams wrote:
> On Fri, Oct 7, 2016 at 11:19 AM, Linda Knippers  
> wrote:
>> Hi Dan,
>>
>> A couple of general questions...
>>
>> On 10/7/2016 12:38 PM, Dan Williams wrote:
>>> With the arrival of the device-dax facility in 4.7 a pmem namespace can
>>> now be configured into a total of four distinct modes: 'raw', 'sector',
>>> 'memory', and 'dax'. Where raw, sector, and memory are block device
>>> modes and dax supports the device-dax character device. With that degree
>>> of freedom in the use cases it is overly restrictive to continue the
>>> current limit of only one pmem namespace per-region, or "interleave-set"
>>> in ACPI 6+ terminology.
>>
>> If I understand correctly, at least some of the restrictions were
>> part of the Intel NVDIMM Namespace spec rather than ACPI/NFIT restrictions.
>> The most recent namespace spec on pmem.io hasn't been updated to remove
>> those restrictions.  Is there a different public spec?
> 
> Yes, this is Linux specific and use of this capability needs to be
> cognizant that it could create a configuration that is not understood
> by EFI, or other OSes (including older Linux implementations).  I plan
> to add documentation to ndctl along these lines.  This is similar to
> the current situation with 'pfn' and 'dax' info blocks that are also
> Linux specific.  However, I should note that this implementation
> changes none of the interpretation of the fields nor layout of the
> existing label specification.  It simply allows two pmem labels that
> happen to appear in the same region to result in two namespaces rather
> than 0.

Ok, but the namespace spec says that's not allowed.  It seemed like an odd
restriction to be in the label spec but it is there.
> 
>>> This series adds support for reading and writing configurations that
>>> describe multiple pmem allocations within a region.  The new rules for
>>> allocating / validating the available capacity when blk and pmem regions
>>> alias are (quoting space_valid()):
>>>
>>>BLK-space is valid as long as it does not precede a PMEM
>>>allocation in a given region. PMEM-space must be contiguous
>>>and adjacent to an existing existing allocation (if one
>>>exists).
>>
>> Why is this new rule necessary?  Is this a HW-specific rule or something
>> related to how Linux could possibly support something?  Why do we care
>> whether blk-space is before or after pmem-space? If it's a HW-specific
>> rule, then shouldn't the enforcement be in the management tool that
>> configures the namespaces?
> 
> It is not HW specific, and it's not new in the sense that we already
> arrange for pmem to be allocated from low addresses and blk to be
> allocated from high addresses.  

Who's the "we"?  Does the location within the region come from the OS
or from the tool that created the namespace?  (I should probably know
this but not having labels, I've never looked at this.)

If we're relaxing some of the rules, it seems like one could have
pmem, then block, then free space, and later want to use free space
for another pmem range.  If hardware supported it and the management
tool created it, would the kernel allow it?

> If another implementation violated
> this constraint Linux would parse it just fine. The constraint is a
> Linux decision to maximize available pmem capacity when blk and pmem
> alias.  So this is a situation where Linux is liberal in what it will
> accept when reading labels, but conservative on the configurations it
> will create when writing labels.

Is it ndctl that's being conservative?  It seems like the kernel shouldn't care.
> 
>>> Where "adjacent" allocations grow an existing namespace.  Note that
>>> growing a namespace is potentially destructive if free space is consumed
>>> from a location preceding the current allocation.  There is no support
>>> for dis-continuity within a given namespace allocation.
>>
>> Are you talking about DPAs here?
> 
> No, this is referring to system-physical-address partitioning.
> 
>>> Previously, since there was only one namespace per-region, the resulting
>>> pmem device would be named after the region.  Now, subsequent namespaces
>>> after the first are named with the region index and a
>>> "." suffix. For example:
>>>
>>>   /dev/pmem0.1
>>
>> According to the existing namespace spec, you can already have multiple
>> block namespaces on a device. I've not see a system with block namespaces
>> so what do those /dev entries look like?  (The dots are somewhat 
>> unattractive.)
> 
> Block namespaces result in devices with names like "/dev/ndblk0.0"
> where the X.Y numbers are ..  This new
> naming for pmem devices is following that precedent.  The "dot" was
> originally adopted from Linux USB device naming.

Does this mean that if someone updates their kernel then their /dev/pmem0
becomes /dev/pmem0.0?  Or do you only get the dot if there is more
than one namespace per region?

-- ljk


> 


Re: [PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Linda Knippers


On 10/7/2016 3:52 PM, Dan Williams wrote:
> On Fri, Oct 7, 2016 at 11:19 AM, Linda Knippers  
> wrote:
>> Hi Dan,
>>
>> A couple of general questions...
>>
>> On 10/7/2016 12:38 PM, Dan Williams wrote:
>>> With the arrival of the device-dax facility in 4.7 a pmem namespace can
>>> now be configured into a total of four distinct modes: 'raw', 'sector',
>>> 'memory', and 'dax'. Where raw, sector, and memory are block device
>>> modes and dax supports the device-dax character device. With that degree
>>> of freedom in the use cases it is overly restrictive to continue the
>>> current limit of only one pmem namespace per-region, or "interleave-set"
>>> in ACPI 6+ terminology.
>>
>> If I understand correctly, at least some of the restrictions were
>> part of the Intel NVDIMM Namespace spec rather than ACPI/NFIT restrictions.
>> The most recent namespace spec on pmem.io hasn't been updated to remove
>> those restrictions.  Is there a different public spec?
> 
> Yes, this is Linux specific and use of this capability needs to be
> cognizant that it could create a configuration that is not understood
> by EFI, or other OSes (including older Linux implementations).  I plan
> to add documentation to ndctl along these lines.  This is similar to
> the current situation with 'pfn' and 'dax' info blocks that are also
> Linux specific.  However, I should note that this implementation
> changes none of the interpretation of the fields nor layout of the
> existing label specification.  It simply allows two pmem labels that
> happen to appear in the same region to result in two namespaces rather
> than 0.

Ok, but the namespace spec says that's not allowed.  It seemed like an odd
restriction to be in the label spec but it is there.
> 
>>> This series adds support for reading and writing configurations that
>>> describe multiple pmem allocations within a region.  The new rules for
>>> allocating / validating the available capacity when blk and pmem regions
>>> alias are (quoting space_valid()):
>>>
>>>BLK-space is valid as long as it does not precede a PMEM
>>>allocation in a given region. PMEM-space must be contiguous
>>>and adjacent to an existing existing allocation (if one
>>>exists).
>>
>> Why is this new rule necessary?  Is this a HW-specific rule or something
>> related to how Linux could possibly support something?  Why do we care
>> whether blk-space is before or after pmem-space? If it's a HW-specific
>> rule, then shouldn't the enforcement be in the management tool that
>> configures the namespaces?
> 
> It is not HW specific, and it's not new in the sense that we already
> arrange for pmem to be allocated from low addresses and blk to be
> allocated from high addresses.  

Who's the "we"?  Does the location within the region come from the OS
or from the tool that created the namespace?  (I should probably know
this but not having labels, I've never looked at this.)

If we're relaxing some of the rules, it seems like one could have
pmem, then block, then free space, and later want to use free space
for another pmem range.  If hardware supported it and the management
tool created it, would the kernel allow it?

> If another implementation violated
> this constraint Linux would parse it just fine. The constraint is a
> Linux decision to maximize available pmem capacity when blk and pmem
> alias.  So this is a situation where Linux is liberal in what it will
> accept when reading labels, but conservative on the configurations it
> will create when writing labels.

Is it ndctl that's being conservative?  It seems like the kernel shouldn't care.
> 
>>> Where "adjacent" allocations grow an existing namespace.  Note that
>>> growing a namespace is potentially destructive if free space is consumed
>>> from a location preceding the current allocation.  There is no support
>>> for dis-continuity within a given namespace allocation.
>>
>> Are you talking about DPAs here?
> 
> No, this is referring to system-physical-address partitioning.
> 
>>> Previously, since there was only one namespace per-region, the resulting
>>> pmem device would be named after the region.  Now, subsequent namespaces
>>> after the first are named with the region index and a
>>> "." suffix. For example:
>>>
>>>   /dev/pmem0.1
>>
>> According to the existing namespace spec, you can already have multiple
>> block namespaces on a device. I've not see a system with block namespaces
>> so what do those /dev entries look like?  (The dots are somewhat 
>> unattractive.)
> 
> Block namespaces result in devices with names like "/dev/ndblk0.0"
> where the X.Y numbers are ..  This new
> naming for pmem devices is following that precedent.  The "dot" was
> originally adopted from Linux USB device naming.

Does this mean that if someone updates their kernel then their /dev/pmem0
becomes /dev/pmem0.0?  Or do you only get the dot if there is more
than one namespace per region?

-- ljk


> 


Re: [PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Dan Williams
On Fri, Oct 7, 2016 at 11:19 AM, Linda Knippers  wrote:
> Hi Dan,
>
> A couple of general questions...
>
> On 10/7/2016 12:38 PM, Dan Williams wrote:
>> With the arrival of the device-dax facility in 4.7 a pmem namespace can
>> now be configured into a total of four distinct modes: 'raw', 'sector',
>> 'memory', and 'dax'. Where raw, sector, and memory are block device
>> modes and dax supports the device-dax character device. With that degree
>> of freedom in the use cases it is overly restrictive to continue the
>> current limit of only one pmem namespace per-region, or "interleave-set"
>> in ACPI 6+ terminology.
>
> If I understand correctly, at least some of the restrictions were
> part of the Intel NVDIMM Namespace spec rather than ACPI/NFIT restrictions.
> The most recent namespace spec on pmem.io hasn't been updated to remove
> those restrictions.  Is there a different public spec?

Yes, this is Linux specific and use of this capability needs to be
cognizant that it could create a configuration that is not understood
by EFI, or other OSes (including older Linux implementations).  I plan
to add documentation to ndctl along these lines.  This is similar to
the current situation with 'pfn' and 'dax' info blocks that are also
Linux specific.  However, I should note that this implementation
changes none of the interpretation of the fields nor layout of the
existing label specification.  It simply allows two pmem labels that
happen to appear in the same region to result in two namespaces rather
than 0.

>> This series adds support for reading and writing configurations that
>> describe multiple pmem allocations within a region.  The new rules for
>> allocating / validating the available capacity when blk and pmem regions
>> alias are (quoting space_valid()):
>>
>>BLK-space is valid as long as it does not precede a PMEM
>>allocation in a given region. PMEM-space must be contiguous
>>and adjacent to an existing existing allocation (if one
>>exists).
>
> Why is this new rule necessary?  Is this a HW-specific rule or something
> related to how Linux could possibly support something?  Why do we care
> whether blk-space is before or after pmem-space? If it's a HW-specific
> rule, then shouldn't the enforcement be in the management tool that
> configures the namespaces?

It is not HW specific, and it's not new in the sense that we already
arrange for pmem to be allocated from low addresses and blk to be
allocated from high addresses.  If another implementation violated
this constraint Linux would parse it just fine. The constraint is a
Linux decision to maximize available pmem capacity when blk and pmem
alias.  So this is a situation where Linux is liberal in what it will
accept when reading labels, but conservative on the configurations it
will create when writing labels.

>> Where "adjacent" allocations grow an existing namespace.  Note that
>> growing a namespace is potentially destructive if free space is consumed
>> from a location preceding the current allocation.  There is no support
>> for dis-continuity within a given namespace allocation.
>
> Are you talking about DPAs here?

No, this is referring to system-physical-address partitioning.

>> Previously, since there was only one namespace per-region, the resulting
>> pmem device would be named after the region.  Now, subsequent namespaces
>> after the first are named with the region index and a
>> "." suffix. For example:
>>
>>   /dev/pmem0.1
>
> According to the existing namespace spec, you can already have multiple
> block namespaces on a device. I've not see a system with block namespaces
> so what do those /dev entries look like?  (The dots are somewhat 
> unattractive.)

Block namespaces result in devices with names like "/dev/ndblk0.0"
where the X.Y numbers are ..  This new
naming for pmem devices is following that precedent.  The "dot" was
originally adopted from Linux USB device naming.


Re: [PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Dan Williams
On Fri, Oct 7, 2016 at 11:19 AM, Linda Knippers  wrote:
> Hi Dan,
>
> A couple of general questions...
>
> On 10/7/2016 12:38 PM, Dan Williams wrote:
>> With the arrival of the device-dax facility in 4.7 a pmem namespace can
>> now be configured into a total of four distinct modes: 'raw', 'sector',
>> 'memory', and 'dax'. Where raw, sector, and memory are block device
>> modes and dax supports the device-dax character device. With that degree
>> of freedom in the use cases it is overly restrictive to continue the
>> current limit of only one pmem namespace per-region, or "interleave-set"
>> in ACPI 6+ terminology.
>
> If I understand correctly, at least some of the restrictions were
> part of the Intel NVDIMM Namespace spec rather than ACPI/NFIT restrictions.
> The most recent namespace spec on pmem.io hasn't been updated to remove
> those restrictions.  Is there a different public spec?

Yes, this is Linux specific and use of this capability needs to be
cognizant that it could create a configuration that is not understood
by EFI, or other OSes (including older Linux implementations).  I plan
to add documentation to ndctl along these lines.  This is similar to
the current situation with 'pfn' and 'dax' info blocks that are also
Linux specific.  However, I should note that this implementation
changes none of the interpretation of the fields nor layout of the
existing label specification.  It simply allows two pmem labels that
happen to appear in the same region to result in two namespaces rather
than 0.

>> This series adds support for reading and writing configurations that
>> describe multiple pmem allocations within a region.  The new rules for
>> allocating / validating the available capacity when blk and pmem regions
>> alias are (quoting space_valid()):
>>
>>BLK-space is valid as long as it does not precede a PMEM
>>allocation in a given region. PMEM-space must be contiguous
>>and adjacent to an existing existing allocation (if one
>>exists).
>
> Why is this new rule necessary?  Is this a HW-specific rule or something
> related to how Linux could possibly support something?  Why do we care
> whether blk-space is before or after pmem-space? If it's a HW-specific
> rule, then shouldn't the enforcement be in the management tool that
> configures the namespaces?

It is not HW specific, and it's not new in the sense that we already
arrange for pmem to be allocated from low addresses and blk to be
allocated from high addresses.  If another implementation violated
this constraint Linux would parse it just fine. The constraint is a
Linux decision to maximize available pmem capacity when blk and pmem
alias.  So this is a situation where Linux is liberal in what it will
accept when reading labels, but conservative on the configurations it
will create when writing labels.

>> Where "adjacent" allocations grow an existing namespace.  Note that
>> growing a namespace is potentially destructive if free space is consumed
>> from a location preceding the current allocation.  There is no support
>> for dis-continuity within a given namespace allocation.
>
> Are you talking about DPAs here?

No, this is referring to system-physical-address partitioning.

>> Previously, since there was only one namespace per-region, the resulting
>> pmem device would be named after the region.  Now, subsequent namespaces
>> after the first are named with the region index and a
>> "." suffix. For example:
>>
>>   /dev/pmem0.1
>
> According to the existing namespace spec, you can already have multiple
> block namespaces on a device. I've not see a system with block namespaces
> so what do those /dev entries look like?  (The dots are somewhat 
> unattractive.)

Block namespaces result in devices with names like "/dev/ndblk0.0"
where the X.Y numbers are ..  This new
naming for pmem devices is following that precedent.  The "dot" was
originally adopted from Linux USB device naming.


Re: [PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Linda Knippers
Hi Dan,

A couple of general questions...

On 10/7/2016 12:38 PM, Dan Williams wrote:
> With the arrival of the device-dax facility in 4.7 a pmem namespace can
> now be configured into a total of four distinct modes: 'raw', 'sector',
> 'memory', and 'dax'. Where raw, sector, and memory are block device
> modes and dax supports the device-dax character device. With that degree
> of freedom in the use cases it is overly restrictive to continue the
> current limit of only one pmem namespace per-region, or "interleave-set"
> in ACPI 6+ terminology.

If I understand correctly, at least some of the restrictions were
part of the Intel NVDIMM Namespace spec rather than ACPI/NFIT restrictions.
The most recent namespace spec on pmem.io hasn't been updated to remove
those restrictions.  Is there a different public spec?

> This series adds support for reading and writing configurations that
> describe multiple pmem allocations within a region.  The new rules for
> allocating / validating the available capacity when blk and pmem regions
> alias are (quoting space_valid()):
> 
>BLK-space is valid as long as it does not precede a PMEM
>allocation in a given region. PMEM-space must be contiguous
>and adjacent to an existing existing allocation (if one
>exists).

Why is this new rule necessary?  Is this a HW-specific rule or something
related to how Linux could possibly support something?  Why do we care
whether blk-space is before or after pmem-space? If it's a HW-specific
rule, then shouldn't the enforcement be in the management tool that
configures the namespaces?

> Where "adjacent" allocations grow an existing namespace.  Note that
> growing a namespace is potentially destructive if free space is consumed
> from a location preceding the current allocation.  There is no support
> for dis-continuity within a given namespace allocation.

Are you talking about DPAs here?

> Previously, since there was only one namespace per-region, the resulting
> pmem device would be named after the region.  Now, subsequent namespaces
> after the first are named with the region index and a
> "." suffix. For example:
> 
>   /dev/pmem0.1

According to the existing namespace spec, you can already have multiple
block namespaces on a device. I've not see a system with block namespaces
so what do those /dev entries look like?  (The dots are somewhat unattractive.)

-- ljk
> 
> ---
> 
> Dan Williams (14):
>   libnvdimm, region: move region-mapping input-paramters to 
> nd_mapping_desc
>   libnvdimm, label: convert label tracking to a linked list
>   libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() 
> helper
>   libnvdimm, namespace: unify blk and pmem label scanning
>   tools/testing/nvdimm: support for sub-dividing a pmem region
>   libnvdimm, namespace: allow multiple pmem-namespaces per region at scan 
> time
>   libnvdimm, namespace: sort namespaces by dpa at init
>   libnvdimm, region: update nd_region_available_dpa() for multi-pmem 
> support
>   libnvdimm, namespace: expand pmem device naming scheme for multi-pmem
>   libnvdimm, namespace: update label implementation for multi-pmem
>   libnvdimm, namespace: enable allocation of multiple pmem namespaces
>   libnvdimm, namespace: filter out of range labels in scan_labels()
>   libnvdimm, namespace: lift single pmem limit in scan_labels()
>   libnvdimm, namespace: allow creation of multiple pmem-namespaces per 
> region
> 
> 
>  drivers/acpi/nfit/core.c  |   30 +
>  drivers/nvdimm/dimm_devs.c|  192 ++--
>  drivers/nvdimm/label.c|  192 +---
>  drivers/nvdimm/namespace_devs.c   |  786 
> +++--
>  drivers/nvdimm/nd-core.h  |   23 +
>  drivers/nvdimm/nd.h   |   28 +
>  drivers/nvdimm/region_devs.c  |   58 ++
>  include/linux/libnvdimm.h |   25 -
>  include/linux/nd.h|8 
>  tools/testing/nvdimm/test/iomap.c |  134 --
>  tools/testing/nvdimm/test/nfit.c  |   21 -
>  tools/testing/nvdimm/test/nfit_test.h |   12 -
>  12 files changed, 1055 insertions(+), 454 deletions(-)
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 


Re: [PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Linda Knippers
Hi Dan,

A couple of general questions...

On 10/7/2016 12:38 PM, Dan Williams wrote:
> With the arrival of the device-dax facility in 4.7 a pmem namespace can
> now be configured into a total of four distinct modes: 'raw', 'sector',
> 'memory', and 'dax'. Where raw, sector, and memory are block device
> modes and dax supports the device-dax character device. With that degree
> of freedom in the use cases it is overly restrictive to continue the
> current limit of only one pmem namespace per-region, or "interleave-set"
> in ACPI 6+ terminology.

If I understand correctly, at least some of the restrictions were
part of the Intel NVDIMM Namespace spec rather than ACPI/NFIT restrictions.
The most recent namespace spec on pmem.io hasn't been updated to remove
those restrictions.  Is there a different public spec?

> This series adds support for reading and writing configurations that
> describe multiple pmem allocations within a region.  The new rules for
> allocating / validating the available capacity when blk and pmem regions
> alias are (quoting space_valid()):
> 
>BLK-space is valid as long as it does not precede a PMEM
>allocation in a given region. PMEM-space must be contiguous
>and adjacent to an existing existing allocation (if one
>exists).

Why is this new rule necessary?  Is this a HW-specific rule or something
related to how Linux could possibly support something?  Why do we care
whether blk-space is before or after pmem-space? If it's a HW-specific
rule, then shouldn't the enforcement be in the management tool that
configures the namespaces?

> Where "adjacent" allocations grow an existing namespace.  Note that
> growing a namespace is potentially destructive if free space is consumed
> from a location preceding the current allocation.  There is no support
> for dis-continuity within a given namespace allocation.

Are you talking about DPAs here?

> Previously, since there was only one namespace per-region, the resulting
> pmem device would be named after the region.  Now, subsequent namespaces
> after the first are named with the region index and a
> "." suffix. For example:
> 
>   /dev/pmem0.1

According to the existing namespace spec, you can already have multiple
block namespaces on a device. I've not see a system with block namespaces
so what do those /dev entries look like?  (The dots are somewhat unattractive.)

-- ljk
> 
> ---
> 
> Dan Williams (14):
>   libnvdimm, region: move region-mapping input-paramters to 
> nd_mapping_desc
>   libnvdimm, label: convert label tracking to a linked list
>   libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() 
> helper
>   libnvdimm, namespace: unify blk and pmem label scanning
>   tools/testing/nvdimm: support for sub-dividing a pmem region
>   libnvdimm, namespace: allow multiple pmem-namespaces per region at scan 
> time
>   libnvdimm, namespace: sort namespaces by dpa at init
>   libnvdimm, region: update nd_region_available_dpa() for multi-pmem 
> support
>   libnvdimm, namespace: expand pmem device naming scheme for multi-pmem
>   libnvdimm, namespace: update label implementation for multi-pmem
>   libnvdimm, namespace: enable allocation of multiple pmem namespaces
>   libnvdimm, namespace: filter out of range labels in scan_labels()
>   libnvdimm, namespace: lift single pmem limit in scan_labels()
>   libnvdimm, namespace: allow creation of multiple pmem-namespaces per 
> region
> 
> 
>  drivers/acpi/nfit/core.c  |   30 +
>  drivers/nvdimm/dimm_devs.c|  192 ++--
>  drivers/nvdimm/label.c|  192 +---
>  drivers/nvdimm/namespace_devs.c   |  786 
> +++--
>  drivers/nvdimm/nd-core.h  |   23 +
>  drivers/nvdimm/nd.h   |   28 +
>  drivers/nvdimm/region_devs.c  |   58 ++
>  include/linux/libnvdimm.h |   25 -
>  include/linux/nd.h|8 
>  tools/testing/nvdimm/test/iomap.c |  134 --
>  tools/testing/nvdimm/test/nfit.c  |   21 -
>  tools/testing/nvdimm/test/nfit_test.h |   12 -
>  12 files changed, 1055 insertions(+), 454 deletions(-)
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 


[PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Dan Williams
With the arrival of the device-dax facility in 4.7 a pmem namespace can
now be configured into a total of four distinct modes: 'raw', 'sector',
'memory', and 'dax'. Where raw, sector, and memory are block device
modes and dax supports the device-dax character device. With that degree
of freedom in the use cases it is overly restrictive to continue the
current limit of only one pmem namespace per-region, or "interleave-set"
in ACPI 6+ terminology.

This series adds support for reading and writing configurations that
describe multiple pmem allocations within a region.  The new rules for
allocating / validating the available capacity when blk and pmem regions
alias are (quoting space_valid()):

   BLK-space is valid as long as it does not precede a PMEM
   allocation in a given region. PMEM-space must be contiguous
   and adjacent to an existing existing allocation (if one
   exists).

Where "adjacent" allocations grow an existing namespace.  Note that
growing a namespace is potentially destructive if free space is consumed
from a location preceding the current allocation.  There is no support
for dis-continuity within a given namespace allocation.

Previously, since there was only one namespace per-region, the resulting
pmem device would be named after the region.  Now, subsequent namespaces
after the first are named with the region index and a
"." suffix. For example:

/dev/pmem0.1

---

Dan Williams (14):
  libnvdimm, region: move region-mapping input-paramters to nd_mapping_desc
  libnvdimm, label: convert label tracking to a linked list
  libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() 
helper
  libnvdimm, namespace: unify blk and pmem label scanning
  tools/testing/nvdimm: support for sub-dividing a pmem region
  libnvdimm, namespace: allow multiple pmem-namespaces per region at scan 
time
  libnvdimm, namespace: sort namespaces by dpa at init
  libnvdimm, region: update nd_region_available_dpa() for multi-pmem support
  libnvdimm, namespace: expand pmem device naming scheme for multi-pmem
  libnvdimm, namespace: update label implementation for multi-pmem
  libnvdimm, namespace: enable allocation of multiple pmem namespaces
  libnvdimm, namespace: filter out of range labels in scan_labels()
  libnvdimm, namespace: lift single pmem limit in scan_labels()
  libnvdimm, namespace: allow creation of multiple pmem-namespaces per 
region


 drivers/acpi/nfit/core.c  |   30 +
 drivers/nvdimm/dimm_devs.c|  192 ++--
 drivers/nvdimm/label.c|  192 +---
 drivers/nvdimm/namespace_devs.c   |  786 +++--
 drivers/nvdimm/nd-core.h  |   23 +
 drivers/nvdimm/nd.h   |   28 +
 drivers/nvdimm/region_devs.c  |   58 ++
 include/linux/libnvdimm.h |   25 -
 include/linux/nd.h|8 
 tools/testing/nvdimm/test/iomap.c |  134 --
 tools/testing/nvdimm/test/nfit.c  |   21 -
 tools/testing/nvdimm/test/nfit_test.h |   12 -
 12 files changed, 1055 insertions(+), 454 deletions(-)


[PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Dan Williams
With the arrival of the device-dax facility in 4.7 a pmem namespace can
now be configured into a total of four distinct modes: 'raw', 'sector',
'memory', and 'dax'. Where raw, sector, and memory are block device
modes and dax supports the device-dax character device. With that degree
of freedom in the use cases it is overly restrictive to continue the
current limit of only one pmem namespace per-region, or "interleave-set"
in ACPI 6+ terminology.

This series adds support for reading and writing configurations that
describe multiple pmem allocations within a region.  The new rules for
allocating / validating the available capacity when blk and pmem regions
alias are (quoting space_valid()):

   BLK-space is valid as long as it does not precede a PMEM
   allocation in a given region. PMEM-space must be contiguous
   and adjacent to an existing existing allocation (if one
   exists).

Where "adjacent" allocations grow an existing namespace.  Note that
growing a namespace is potentially destructive if free space is consumed
from a location preceding the current allocation.  There is no support
for dis-continuity within a given namespace allocation.

Previously, since there was only one namespace per-region, the resulting
pmem device would be named after the region.  Now, subsequent namespaces
after the first are named with the region index and a
"." suffix. For example:

/dev/pmem0.1

---

Dan Williams (14):
  libnvdimm, region: move region-mapping input-paramters to nd_mapping_desc
  libnvdimm, label: convert label tracking to a linked list
  libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() 
helper
  libnvdimm, namespace: unify blk and pmem label scanning
  tools/testing/nvdimm: support for sub-dividing a pmem region
  libnvdimm, namespace: allow multiple pmem-namespaces per region at scan 
time
  libnvdimm, namespace: sort namespaces by dpa at init
  libnvdimm, region: update nd_region_available_dpa() for multi-pmem support
  libnvdimm, namespace: expand pmem device naming scheme for multi-pmem
  libnvdimm, namespace: update label implementation for multi-pmem
  libnvdimm, namespace: enable allocation of multiple pmem namespaces
  libnvdimm, namespace: filter out of range labels in scan_labels()
  libnvdimm, namespace: lift single pmem limit in scan_labels()
  libnvdimm, namespace: allow creation of multiple pmem-namespaces per 
region


 drivers/acpi/nfit/core.c  |   30 +
 drivers/nvdimm/dimm_devs.c|  192 ++--
 drivers/nvdimm/label.c|  192 +---
 drivers/nvdimm/namespace_devs.c   |  786 +++--
 drivers/nvdimm/nd-core.h  |   23 +
 drivers/nvdimm/nd.h   |   28 +
 drivers/nvdimm/region_devs.c  |   58 ++
 include/linux/libnvdimm.h |   25 -
 include/linux/nd.h|8 
 tools/testing/nvdimm/test/iomap.c |  134 --
 tools/testing/nvdimm/test/nfit.c  |   21 -
 tools/testing/nvdimm/test/nfit_test.h |   12 -
 12 files changed, 1055 insertions(+), 454 deletions(-)