Re: [PATCH v4 3/6] libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru.

2017-07-05 Thread Linda Knippers
On 07/04/2017 04:37 PM, Dan Williams wrote:
> On Tue, Jul 4, 2017 at 1:08 PM, Jerry Hoemann  wrote:
>> On Sat, Jul 01, 2017 at 01:46:03PM -0700, Dan Williams wrote:
>>> On Sat, Jul 1, 2017 at 1:38 PM, Jerry Hoemann  wrote:
 On Sat, Jul 01, 2017 at 01:10:31PM -0700, Dan Williams wrote:
> On Sat, Jul 1, 2017 at 1:08 PM, Dan Williams  
> wrote:
>> On Sat, Jul 1, 2017 at 12:58 PM, Jerry Hoemann  
>> wrote:
>>> On Fri, Jun 30, 2017 at 08:55:22PM -0700, Dan Williams wrote:
>>>
>>> ...
>>>
 On Fri, Jun 30, 2017 at 9:09 AM, Jerry Hoemann  
 wrote:
> +   if (cmd == ND_CMD_CALL)
> +   dsm_mask = nd_desc->bus_dsm_mask;
> desc = nd_cmd_bus_desc(cmd);
> uuid = to_nfit_uuid(NFIT_DEV_BUS);
> handle = adev->handle;
> @@ -1613,6 +1615,7 @@ static void acpi_nfit_init_dsms(struct 
> acpi_nfit_desc *acpi_desc)
> struct nvdimm_bus_descriptor *nd_desc = _desc->nd_desc;
> const u8 *uuid = to_nfit_uuid(NFIT_DEV_BUS);
> struct acpi_device *adev;
> +   unsigned long dsm_mask;
> int i;
>
> nd_desc->cmd_mask = acpi_desc->bus_cmd_force_en;
> @@ -1624,6 +1627,11 @@ static void acpi_nfit_init_dsms(struct 
> acpi_nfit_desc *acpi_desc)
> if (acpi_check_dsm(adev->handle, uuid, 1, 1ULL << i))
> set_bit(i, _desc->cmd_mask);
> set_bit(ND_CMD_CALL, _desc->cmd_mask);
> +
> +   dsm_mask = 0x3bf;

 I went ahead and fixed this up to use dsm_mask defined like this:

 +   dsm_mask =
 +   (1 << ND_CMD_ARS_CAP) |
 +   (1 << ND_CMD_ARS_START) |
 +   (1 << ND_CMD_ARS_STATUS) |
 +   (1 << ND_CMD_CLEAR_ERROR) |
 +   (1 << NFIT_CMD_TRANSLATE_SPA) |
 +   (1 << NFIT_CMD_ARS_INJECT_SET) |
 +   (1 << NFIT_CMD_ARS_INJECT_CLEAR) |
 +   (1 << NFIT_CMD_ARS_INJECT_GET);

 This drops function number 0 which userspace has no need to call.
>>>
>>> Actually I like to call function 0.  Its an excellent test when
>>> modifying the code path as its a no side effects function whose output
>>> is known in advance and instantly recognizable.  I also use it when
>>> testing new firmware.
>>>
>>> What is the downside to allowing it?  What bad things happen?
>>
>> It allows implementations to bypass the standardization process and
>> ship new root DSMs. It's always possible to patch the kernel locally
>> for development, so I see no reason to ship this capability globally.

 I don't understand this comment, but I think your next comment
 essentially says to disregard this comment?
>>>
>>> Yes, sorry.
>>>
> Actually, just the discovery portion does not lead to this leak, but
> it's redundant when we have the 'dsm_mask' sysfs attribute.

 No.  The generation of the mask in sysfs is not done by
 executing the code in acpi_nfit_ctl.  One of the reasons I call
 function 0 to test changes I am making to the ioctl path itself.
 The sysfs has nothing to do with that path and cannot be used
 to serve this purpose.

 And since the content of sysfs has been edited it also can not be
 used as a basic test of firmware.

 What is the downside to allowing the calling of function 0?
>>>
>>> It needlessly expands the kernel ABI. I would suggest, if you want to
>>
>> No.  It is not needless.  It is not an ABI extension.
>> Same goes for the override feature.

I have never understood why allowing function 0 is considered harmful.
It is a standard function defined by ACPI in general and specifically
for NVDIMM Rood Device _DSMs.  It is also defined for each vendor-specific
DSM family.  It is not an ABI extension.  It is a standard.

> If the need is testing then we have a tools/testing/nvdimm for that.
> Of course it's an ABI extension, it allows userspace to discover DSM
> function numbers the kernel didn't know about at compile time.

It also allows user space to determine which DSMs are actually supported
by the platform, which may be a subset of the defined set, in a standard
way.  Exposing information only in /sys just makes it harder for people
writing software (tools, tests, whatever) that need to support more than
just Linux.

>> I hope that ACPI doesn't extend the specification in the future because
>> we'll just have to redo these patches yet again.
> 
> Hopefully this is the last ACPI spec version where we add new DSMs to
> the root device. 

I wouldn't bet on it.

> 

Re: [PATCH v4 3/6] libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru.

2017-07-05 Thread Linda Knippers
On 07/04/2017 04:37 PM, Dan Williams wrote:
> On Tue, Jul 4, 2017 at 1:08 PM, Jerry Hoemann  wrote:
>> On Sat, Jul 01, 2017 at 01:46:03PM -0700, Dan Williams wrote:
>>> On Sat, Jul 1, 2017 at 1:38 PM, Jerry Hoemann  wrote:
 On Sat, Jul 01, 2017 at 01:10:31PM -0700, Dan Williams wrote:
> On Sat, Jul 1, 2017 at 1:08 PM, Dan Williams  
> wrote:
>> On Sat, Jul 1, 2017 at 12:58 PM, Jerry Hoemann  
>> wrote:
>>> On Fri, Jun 30, 2017 at 08:55:22PM -0700, Dan Williams wrote:
>>>
>>> ...
>>>
 On Fri, Jun 30, 2017 at 9:09 AM, Jerry Hoemann  
 wrote:
> +   if (cmd == ND_CMD_CALL)
> +   dsm_mask = nd_desc->bus_dsm_mask;
> desc = nd_cmd_bus_desc(cmd);
> uuid = to_nfit_uuid(NFIT_DEV_BUS);
> handle = adev->handle;
> @@ -1613,6 +1615,7 @@ static void acpi_nfit_init_dsms(struct 
> acpi_nfit_desc *acpi_desc)
> struct nvdimm_bus_descriptor *nd_desc = _desc->nd_desc;
> const u8 *uuid = to_nfit_uuid(NFIT_DEV_BUS);
> struct acpi_device *adev;
> +   unsigned long dsm_mask;
> int i;
>
> nd_desc->cmd_mask = acpi_desc->bus_cmd_force_en;
> @@ -1624,6 +1627,11 @@ static void acpi_nfit_init_dsms(struct 
> acpi_nfit_desc *acpi_desc)
> if (acpi_check_dsm(adev->handle, uuid, 1, 1ULL << i))
> set_bit(i, _desc->cmd_mask);
> set_bit(ND_CMD_CALL, _desc->cmd_mask);
> +
> +   dsm_mask = 0x3bf;

 I went ahead and fixed this up to use dsm_mask defined like this:

 +   dsm_mask =
 +   (1 << ND_CMD_ARS_CAP) |
 +   (1 << ND_CMD_ARS_START) |
 +   (1 << ND_CMD_ARS_STATUS) |
 +   (1 << ND_CMD_CLEAR_ERROR) |
 +   (1 << NFIT_CMD_TRANSLATE_SPA) |
 +   (1 << NFIT_CMD_ARS_INJECT_SET) |
 +   (1 << NFIT_CMD_ARS_INJECT_CLEAR) |
 +   (1 << NFIT_CMD_ARS_INJECT_GET);

 This drops function number 0 which userspace has no need to call.
>>>
>>> Actually I like to call function 0.  Its an excellent test when
>>> modifying the code path as its a no side effects function whose output
>>> is known in advance and instantly recognizable.  I also use it when
>>> testing new firmware.
>>>
>>> What is the downside to allowing it?  What bad things happen?
>>
>> It allows implementations to bypass the standardization process and
>> ship new root DSMs. It's always possible to patch the kernel locally
>> for development, so I see no reason to ship this capability globally.

 I don't understand this comment, but I think your next comment
 essentially says to disregard this comment?
>>>
>>> Yes, sorry.
>>>
> Actually, just the discovery portion does not lead to this leak, but
> it's redundant when we have the 'dsm_mask' sysfs attribute.

 No.  The generation of the mask in sysfs is not done by
 executing the code in acpi_nfit_ctl.  One of the reasons I call
 function 0 to test changes I am making to the ioctl path itself.
 The sysfs has nothing to do with that path and cannot be used
 to serve this purpose.

 And since the content of sysfs has been edited it also can not be
 used as a basic test of firmware.

 What is the downside to allowing the calling of function 0?
>>>
>>> It needlessly expands the kernel ABI. I would suggest, if you want to
>>
>> No.  It is not needless.  It is not an ABI extension.
>> Same goes for the override feature.

I have never understood why allowing function 0 is considered harmful.
It is a standard function defined by ACPI in general and specifically
for NVDIMM Rood Device _DSMs.  It is also defined for each vendor-specific
DSM family.  It is not an ABI extension.  It is a standard.

> If the need is testing then we have a tools/testing/nvdimm for that.
> Of course it's an ABI extension, it allows userspace to discover DSM
> function numbers the kernel didn't know about at compile time.

It also allows user space to determine which DSMs are actually supported
by the platform, which may be a subset of the defined set, in a standard
way.  Exposing information only in /sys just makes it harder for people
writing software (tools, tests, whatever) that need to support more than
just Linux.

>> I hope that ACPI doesn't extend the specification in the future because
>> we'll just have to redo these patches yet again.
> 
> Hopefully this is the last ACPI spec version where we add new DSMs to
> the root device. 

I wouldn't bet on it.

> All future methods should be named methods like what
> the specification started doing for NVIDMM leaf devices with 

Re: [PATCH v4 12/16] libnvdimm, nfit: enable support for volatile ranges

2017-06-29 Thread Linda Knippers
On 06/29/2017 06:58 PM, Dan Williams wrote:
> On Thu, Jun 29, 2017 at 3:49 PM, Linda Knippers <linda.knipp...@hpe.com> 
> wrote:
>>> The parent region of the namespace will have a 'volatile' type:
>>>
>>> # cat /sys/bus/nd/devices/region0/devtype
>>> nd_volatile
>>
>>
>> If all I know is the /dev/pmem device name, how do I find that?
>>
> 
> cat $(readlink -f /sys/block/pmem0/device)/../devtype
> 
> ...this is where 'ndctl list' will get the information.
> 

Thanks.

I think we need a section 4 pmem manpage like exists for
mem, sd, fd, md, etc., where we can put stuff like this, as well
as providing some overview information that will point people to
other resources.  I'll give that some thought unless there is one
already that I'm not finding.

-- ljk




Re: [PATCH v4 12/16] libnvdimm, nfit: enable support for volatile ranges

2017-06-29 Thread Linda Knippers
On 06/29/2017 06:58 PM, Dan Williams wrote:
> On Thu, Jun 29, 2017 at 3:49 PM, Linda Knippers  
> wrote:
>>> The parent region of the namespace will have a 'volatile' type:
>>>
>>> # cat /sys/bus/nd/devices/region0/devtype
>>> nd_volatile
>>
>>
>> If all I know is the /dev/pmem device name, how do I find that?
>>
> 
> cat $(readlink -f /sys/block/pmem0/device)/../devtype
> 
> ...this is where 'ndctl list' will get the information.
> 

Thanks.

I think we need a section 4 pmem manpage like exists for
mem, sd, fd, md, etc., where we can put stuff like this, as well
as providing some overview information that will point people to
other resources.  I'll give that some thought unless there is one
already that I'm not finding.

-- ljk




Re: [PATCH v4 12/16] libnvdimm, nfit: enable support for volatile ranges

2017-06-29 Thread Linda Knippers



On 6/29/2017 6:43 PM, Dan Williams wrote:

On Thu, Jun 29, 2017 at 3:35 PM, Linda Knippers <linda.knipp...@hpe.com> wrote:

On 06/29/2017 06:28 PM, Dan Williams wrote:

On Thu, Jun 29, 2017 at 3:12 PM, Linda Knippers <linda.knipp...@hpe.com> wrote:
[..]

The /dev/pmem
device name just tells you that your block device is hosted by a
driver that knows how to handle persistent memory constraints, but any
other details about the nature of the address range need to come from
other sources of information, and potentially information sources that
the kernel does not know about.



I'm asking about the other source of information in this specific case
where we're exposing pmem devices that will never ever be persistent.
Before we add these devices, I think we should be able to tell the user
how they can know the properties of the underlying device.


The only way I can think to indicate this is with a platform + device
whitelist in a tool like ndctl. Where the tool says "yes, these
xyz-vendor DIMMs on this abc-vendor platform with this 123-version
BIOS" is a known good persistent configuration.


Doesn't the kernel know that something will never ever be persistent
because the NFIT type says NFIT_SPA_VDISK, NFIT_SPA_VCD, or NFIT_SPA_VOLATILE?
That's the case I'm asking about here.   In this patch, you're adding support
for creating /dev/pmem devices for those address ranges.  My question is
how the admin/user knows that those devices will never ever be persistent.


The parent region of the namespace will have a 'volatile' type:

# cat /sys/bus/nd/devices/region0/devtype
nd_volatile


If all I know is the /dev/pmem device name, how do I find that?

-- ljk



I don't think we need ndctl to know which vendors' hardware/firmware
actually works as advertised.


Certification stickers is what I was thinking, but I was missing your
direction question.



Re: [PATCH v4 12/16] libnvdimm, nfit: enable support for volatile ranges

2017-06-29 Thread Linda Knippers



On 6/29/2017 6:43 PM, Dan Williams wrote:

On Thu, Jun 29, 2017 at 3:35 PM, Linda Knippers  wrote:

On 06/29/2017 06:28 PM, Dan Williams wrote:

On Thu, Jun 29, 2017 at 3:12 PM, Linda Knippers  wrote:
[..]

The /dev/pmem
device name just tells you that your block device is hosted by a
driver that knows how to handle persistent memory constraints, but any
other details about the nature of the address range need to come from
other sources of information, and potentially information sources that
the kernel does not know about.



I'm asking about the other source of information in this specific case
where we're exposing pmem devices that will never ever be persistent.
Before we add these devices, I think we should be able to tell the user
how they can know the properties of the underlying device.


The only way I can think to indicate this is with a platform + device
whitelist in a tool like ndctl. Where the tool says "yes, these
xyz-vendor DIMMs on this abc-vendor platform with this 123-version
BIOS" is a known good persistent configuration.


Doesn't the kernel know that something will never ever be persistent
because the NFIT type says NFIT_SPA_VDISK, NFIT_SPA_VCD, or NFIT_SPA_VOLATILE?
That's the case I'm asking about here.   In this patch, you're adding support
for creating /dev/pmem devices for those address ranges.  My question is
how the admin/user knows that those devices will never ever be persistent.


The parent region of the namespace will have a 'volatile' type:

# cat /sys/bus/nd/devices/region0/devtype
nd_volatile


If all I know is the /dev/pmem device name, how do I find that?

-- ljk



I don't think we need ndctl to know which vendors' hardware/firmware
actually works as advertised.


Certification stickers is what I was thinking, but I was missing your
direction question.



Re: [PATCH v4 12/16] libnvdimm, nfit: enable support for volatile ranges

2017-06-29 Thread Linda Knippers
On 06/29/2017 06:28 PM, Dan Williams wrote:
> On Thu, Jun 29, 2017 at 3:12 PM, Linda Knippers <linda.knipp...@hpe.com> 
> wrote:
> [..]
>>> The /dev/pmem
>>> device name just tells you that your block device is hosted by a
>>> driver that knows how to handle persistent memory constraints, but any
>>> other details about the nature of the address range need to come from
>>> other sources of information, and potentially information sources that
>>> the kernel does not know about.
>>
>>
>> I'm asking about the other source of information in this specific case
>> where we're exposing pmem devices that will never ever be persistent.
>> Before we add these devices, I think we should be able to tell the user
>> how they can know the properties of the underlying device.
> 
> The only way I can think to indicate this is with a platform + device
> whitelist in a tool like ndctl. Where the tool says "yes, these
> xyz-vendor DIMMs on this abc-vendor platform with this 123-version
> BIOS" is a known good persistent configuration.

Doesn't the kernel know that something will never ever be persistent
because the NFIT type says NFIT_SPA_VDISK, NFIT_SPA_VCD, or NFIT_SPA_VOLATILE?
That's the case I'm asking about here.   In this patch, you're adding support
for creating /dev/pmem devices for those address ranges.  My question is
how the admin/user knows that those devices will never ever be persistent.

I don't think we need ndctl to know which vendors' hardware/firmware
actually works as advertised.

-- ljk


Re: [PATCH v4 12/16] libnvdimm, nfit: enable support for volatile ranges

2017-06-29 Thread Linda Knippers
On 06/29/2017 06:28 PM, Dan Williams wrote:
> On Thu, Jun 29, 2017 at 3:12 PM, Linda Knippers  
> wrote:
> [..]
>>> The /dev/pmem
>>> device name just tells you that your block device is hosted by a
>>> driver that knows how to handle persistent memory constraints, but any
>>> other details about the nature of the address range need to come from
>>> other sources of information, and potentially information sources that
>>> the kernel does not know about.
>>
>>
>> I'm asking about the other source of information in this specific case
>> where we're exposing pmem devices that will never ever be persistent.
>> Before we add these devices, I think we should be able to tell the user
>> how they can know the properties of the underlying device.
> 
> The only way I can think to indicate this is with a platform + device
> whitelist in a tool like ndctl. Where the tool says "yes, these
> xyz-vendor DIMMs on this abc-vendor platform with this 123-version
> BIOS" is a known good persistent configuration.

Doesn't the kernel know that something will never ever be persistent
because the NFIT type says NFIT_SPA_VDISK, NFIT_SPA_VCD, or NFIT_SPA_VOLATILE?
That's the case I'm asking about here.   In this patch, you're adding support
for creating /dev/pmem devices for those address ranges.  My question is
how the admin/user knows that those devices will never ever be persistent.

I don't think we need ndctl to know which vendors' hardware/firmware
actually works as advertised.

-- ljk


Re: [PATCH v4 12/16] libnvdimm, nfit: enable support for volatile ranges

2017-06-29 Thread Linda Knippers



On 6/29/2017 5:50 PM, Dan Williams wrote:

On Thu, Jun 29, 2017 at 2:16 PM, Linda Knippers <linda.knipp...@hpe.com> wrote:

On 06/29/2017 04:42 PM, Dan Williams wrote:

On Thu, Jun 29, 2017 at 12:20 PM, Linda Knippers <linda.knipp...@hpe.com> wrote:

On 06/29/2017 01:54 PM, Dan Williams wrote:

Allow volatile nfit ranges to participate in all the same infrastructure
provided for persistent memory regions.


This seems to be a bit more than "other rework".


It's part of the rationale for having a "write_cache" control
attribute. There's only so much I can squeeze into the subject line,
but it is mentioned in the cover letter.


A resulting resulting namespace
device will still be called "pmem", but the parent region type will be
"nd_volatile".


What does this look like to a user or admin?  How does someone know that
/dev/pmemX is persistent memory and /dev/pmemY isn't?  Someone shouldn't
have to weed through /sys or ndctl some other interface to figure that out
in the future if they don't have to do that today.  We have different
names for BTT namespaces.  Is there a different name for volatile ranges?


No, the block device name is still /dev/pmem. It's already the case
that you need to check behind just the name of the device to figure
out if something is actually volatile or not (see memmap=ss!nn
configurations),


I don't have any experience with using memmap but if it's primarily used
by developers without NVDIMMs, they'd know it's not persistent.  Or is it
primarily used by administrators using non-NFIT NVDIMMs, in which case it
is persistent?

In any case, how exactly does one determine whether the device is volatile
or not?  I'm dumb so tell me the command line or API.


Especially with memmap= or e820-defined memory it's unknowable from
the kernel. We don't know if the user is using it to cover for a
platform where there is no BIOS support for advertising persistent
memory, or if they have a BIOS that does not produce an NFIT as is the
case here [1], or if it is some developer just testing with no
expectation of persistence.

[1]: https://github.com/pmem/ndctl/issues/21


Ok.  I'm not really concerned about those cases but was asking since
you mentioned memmap as an example.

In any case, how does someone, like a system administrator, confirm that
a /dev/pmem device is a device that claims to be persistent?  Is there
a specific ndctl command line that would make it obvious what the Linux
device is on a device that claims to be persistent?


so I would not be in favor of changing the device
name if we think the memory might not be persistent. Moreover, I think
it was a mistake that we change the device name for btt or not, and
I'm glad Matthew talked me out of making the same mistake with
memory-mode vs raw-mode pmem namespaces. So, the block device name
just reflects the driver of the block device, not the properties of
the device, just like all other block device instances.


I agree that creating a new device name for BTT was perhaps a mistake,
although it would be good to know how to query a device property for
sector atomicity.  The difference between BTT vs. non-BTT seems less
critical to me than knowing in an obvious way whether the device is
actually persistent.


We don't have a good way to answer "actually persistent" in the
general case. I'm thinking of cases where the energy source on the
DIMM has died, or we trigger one of the conditions that leads to the
""unable to guarantee persistence of writes" message.


There are certainly error conditions that can happen and we've talked
about that bit in our health status discussions.  I think the question
of whether the device is healthy enough to be persistent right now
is different from whether the device is never ever going to be persistent.


The /dev/pmem
device name just tells you that your block device is hosted by a
driver that knows how to handle persistent memory constraints, but any
other details about the nature of the address range need to come from
other sources of information, and potentially information sources that
the kernel does not know about.


I'm asking about the other source of information in this specific case
where we're exposing pmem devices that will never ever be persistent.
Before we add these devices, I think we should be able to tell the user
how they can know the properties of the underlying device.

-- ljk


Re: [PATCH v4 12/16] libnvdimm, nfit: enable support for volatile ranges

2017-06-29 Thread Linda Knippers



On 6/29/2017 5:50 PM, Dan Williams wrote:

On Thu, Jun 29, 2017 at 2:16 PM, Linda Knippers  wrote:

On 06/29/2017 04:42 PM, Dan Williams wrote:

On Thu, Jun 29, 2017 at 12:20 PM, Linda Knippers  wrote:

On 06/29/2017 01:54 PM, Dan Williams wrote:

Allow volatile nfit ranges to participate in all the same infrastructure
provided for persistent memory regions.


This seems to be a bit more than "other rework".


It's part of the rationale for having a "write_cache" control
attribute. There's only so much I can squeeze into the subject line,
but it is mentioned in the cover letter.


A resulting resulting namespace
device will still be called "pmem", but the parent region type will be
"nd_volatile".


What does this look like to a user or admin?  How does someone know that
/dev/pmemX is persistent memory and /dev/pmemY isn't?  Someone shouldn't
have to weed through /sys or ndctl some other interface to figure that out
in the future if they don't have to do that today.  We have different
names for BTT namespaces.  Is there a different name for volatile ranges?


No, the block device name is still /dev/pmem. It's already the case
that you need to check behind just the name of the device to figure
out if something is actually volatile or not (see memmap=ss!nn
configurations),


I don't have any experience with using memmap but if it's primarily used
by developers without NVDIMMs, they'd know it's not persistent.  Or is it
primarily used by administrators using non-NFIT NVDIMMs, in which case it
is persistent?

In any case, how exactly does one determine whether the device is volatile
or not?  I'm dumb so tell me the command line or API.


Especially with memmap= or e820-defined memory it's unknowable from
the kernel. We don't know if the user is using it to cover for a
platform where there is no BIOS support for advertising persistent
memory, or if they have a BIOS that does not produce an NFIT as is the
case here [1], or if it is some developer just testing with no
expectation of persistence.

[1]: https://github.com/pmem/ndctl/issues/21


Ok.  I'm not really concerned about those cases but was asking since
you mentioned memmap as an example.

In any case, how does someone, like a system administrator, confirm that
a /dev/pmem device is a device that claims to be persistent?  Is there
a specific ndctl command line that would make it obvious what the Linux
device is on a device that claims to be persistent?


so I would not be in favor of changing the device
name if we think the memory might not be persistent. Moreover, I think
it was a mistake that we change the device name for btt or not, and
I'm glad Matthew talked me out of making the same mistake with
memory-mode vs raw-mode pmem namespaces. So, the block device name
just reflects the driver of the block device, not the properties of
the device, just like all other block device instances.


I agree that creating a new device name for BTT was perhaps a mistake,
although it would be good to know how to query a device property for
sector atomicity.  The difference between BTT vs. non-BTT seems less
critical to me than knowing in an obvious way whether the device is
actually persistent.


We don't have a good way to answer "actually persistent" in the
general case. I'm thinking of cases where the energy source on the
DIMM has died, or we trigger one of the conditions that leads to the
""unable to guarantee persistence of writes" message.


There are certainly error conditions that can happen and we've talked
about that bit in our health status discussions.  I think the question
of whether the device is healthy enough to be persistent right now
is different from whether the device is never ever going to be persistent.


The /dev/pmem
device name just tells you that your block device is hosted by a
driver that knows how to handle persistent memory constraints, but any
other details about the nature of the address range need to come from
other sources of information, and potentially information sources that
the kernel does not know about.


I'm asking about the other source of information in this specific case
where we're exposing pmem devices that will never ever be persistent.
Before we add these devices, I think we should be able to tell the user
how they can know the properties of the underlying device.

-- ljk


Re: [PATCH v4 12/16] libnvdimm, nfit: enable support for volatile ranges

2017-06-29 Thread Linda Knippers
On 06/29/2017 04:42 PM, Dan Williams wrote:
> On Thu, Jun 29, 2017 at 12:20 PM, Linda Knippers <linda.knipp...@hpe.com> 
> wrote:
>> On 06/29/2017 01:54 PM, Dan Williams wrote:
>>> Allow volatile nfit ranges to participate in all the same infrastructure
>>> provided for persistent memory regions.
>>
>> This seems to be a bit more than "other rework".
> 
> It's part of the rationale for having a "write_cache" control
> attribute. There's only so much I can squeeze into the subject line,
> but it is mentioned in the cover letter.
> 
>>> A resulting resulting namespace
>>> device will still be called "pmem", but the parent region type will be
>>> "nd_volatile".
>>
>> What does this look like to a user or admin?  How does someone know that
>> /dev/pmemX is persistent memory and /dev/pmemY isn't?  Someone shouldn't
>> have to weed through /sys or ndctl some other interface to figure that out
>> in the future if they don't have to do that today.  We have different
>> names for BTT namespaces.  Is there a different name for volatile ranges?
> 
> No, the block device name is still /dev/pmem. It's already the case
> that you need to check behind just the name of the device to figure
> out if something is actually volatile or not (see memmap=ss!nn
> configurations),

I don't have any experience with using memmap but if it's primarily used
by developers without NVDIMMs, they'd know it's not persistent.  Or is it
primarily used by administrators using non-NFIT NVDIMMs, in which case it
is persistent?

In any case, how exactly does one determine whether the device is volatile
or not?  I'm dumb so tell me the command line or API.

> so I would not be in favor of changing the device
> name if we think the memory might not be persistent. Moreover, I think
> it was a mistake that we change the device name for btt or not, and
> I'm glad Matthew talked me out of making the same mistake with
> memory-mode vs raw-mode pmem namespaces. So, the block device name
> just reflects the driver of the block device, not the properties of
> the device, just like all other block device instances.

I agree that creating a new device name for BTT was perhaps a mistake,
although it would be good to know how to query a device property for
sector atomicity.  The difference between BTT vs. non-BTT seems less
critical to me than knowing in an obvious way whether the device is
actually persistent.

-- ljk





Re: [PATCH v4 12/16] libnvdimm, nfit: enable support for volatile ranges

2017-06-29 Thread Linda Knippers
On 06/29/2017 04:42 PM, Dan Williams wrote:
> On Thu, Jun 29, 2017 at 12:20 PM, Linda Knippers  
> wrote:
>> On 06/29/2017 01:54 PM, Dan Williams wrote:
>>> Allow volatile nfit ranges to participate in all the same infrastructure
>>> provided for persistent memory regions.
>>
>> This seems to be a bit more than "other rework".
> 
> It's part of the rationale for having a "write_cache" control
> attribute. There's only so much I can squeeze into the subject line,
> but it is mentioned in the cover letter.
> 
>>> A resulting resulting namespace
>>> device will still be called "pmem", but the parent region type will be
>>> "nd_volatile".
>>
>> What does this look like to a user or admin?  How does someone know that
>> /dev/pmemX is persistent memory and /dev/pmemY isn't?  Someone shouldn't
>> have to weed through /sys or ndctl some other interface to figure that out
>> in the future if they don't have to do that today.  We have different
>> names for BTT namespaces.  Is there a different name for volatile ranges?
> 
> No, the block device name is still /dev/pmem. It's already the case
> that you need to check behind just the name of the device to figure
> out if something is actually volatile or not (see memmap=ss!nn
> configurations),

I don't have any experience with using memmap but if it's primarily used
by developers without NVDIMMs, they'd know it's not persistent.  Or is it
primarily used by administrators using non-NFIT NVDIMMs, in which case it
is persistent?

In any case, how exactly does one determine whether the device is volatile
or not?  I'm dumb so tell me the command line or API.

> so I would not be in favor of changing the device
> name if we think the memory might not be persistent. Moreover, I think
> it was a mistake that we change the device name for btt or not, and
> I'm glad Matthew talked me out of making the same mistake with
> memory-mode vs raw-mode pmem namespaces. So, the block device name
> just reflects the driver of the block device, not the properties of
> the device, just like all other block device instances.

I agree that creating a new device name for BTT was perhaps a mistake,
although it would be good to know how to query a device property for
sector atomicity.  The difference between BTT vs. non-BTT seems less
critical to me than knowing in an obvious way whether the device is
actually persistent.

-- ljk





Re: [PATCH v4 12/16] libnvdimm, nfit: enable support for volatile ranges

2017-06-29 Thread Linda Knippers
On 06/29/2017 01:54 PM, Dan Williams wrote:
> Allow volatile nfit ranges to participate in all the same infrastructure
> provided for persistent memory regions. 

This seems to be a bit more than "other rework".

> A resulting resulting namespace
> device will still be called "pmem", but the parent region type will be
> "nd_volatile". 

What does this look like to a user or admin?  How does someone know that
/dev/pmemX is persistent memory and /dev/pmemY isn't?  Someone shouldn't
have to weed through /sys or ndctl some other interface to figure that out
in the future if they don't have to do that today.  We have different
names for BTT namespaces.  Is there a different name for volatile ranges?

-- ljk

> This is in preparation for disabling the dax ->flush()
> operation in the pmem driver when it is hosted on a volatile range.
> 
> Cc: Jan Kara 
> Cc: Jeff Moyer 
> Cc: Christoph Hellwig 
> Cc: Matthew Wilcox 
> Cc: Ross Zwisler 
> Signed-off-by: Dan Williams 
> ---
>  drivers/acpi/nfit/core.c|9 -
>  drivers/nvdimm/bus.c|8 
>  drivers/nvdimm/core.c   |2 +-
>  drivers/nvdimm/dax_devs.c   |2 +-
>  drivers/nvdimm/dimm_devs.c  |2 +-
>  drivers/nvdimm/namespace_devs.c |8 
>  drivers/nvdimm/nd-core.h|9 +
>  drivers/nvdimm/pfn_devs.c   |4 ++--
>  drivers/nvdimm/region_devs.c|   27 ++-
>  9 files changed, 44 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index ac2436538b7e..60d1ca149cc1 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -2227,6 +2227,13 @@ static bool nfit_spa_is_virtual(struct 
> acpi_nfit_system_address *spa)
>   nfit_spa_type(spa) == NFIT_SPA_PCD);
>  }
>  
> +static bool nfit_spa_is_volatile(struct acpi_nfit_system_address *spa)
> +{
> + return (nfit_spa_type(spa) == NFIT_SPA_VDISK ||
> + nfit_spa_type(spa) == NFIT_SPA_VCD   ||
> + nfit_spa_type(spa) == NFIT_SPA_VOLATILE);
> +}
> +
>  static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
>   struct nfit_spa *nfit_spa)
>  {
> @@ -2301,7 +2308,7 @@ static int acpi_nfit_register_region(struct 
> acpi_nfit_desc *acpi_desc,
>   ndr_desc);
>   if (!nfit_spa->nd_region)
>   rc = -ENOMEM;
> - } else if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE) {
> + } else if (nfit_spa_is_volatile(spa)) {
>   nfit_spa->nd_region = nvdimm_volatile_region_create(nvdimm_bus,
>   ndr_desc);
>   if (!nfit_spa->nd_region)
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index e9361bffe5ee..4cfba534814b 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -38,13 +38,13 @@ static int to_nd_device_type(struct device *dev)
>  {
>   if (is_nvdimm(dev))
>   return ND_DEVICE_DIMM;
> - else if (is_nd_pmem(dev))
> + else if (is_memory(dev))
>   return ND_DEVICE_REGION_PMEM;
>   else if (is_nd_blk(dev))
>   return ND_DEVICE_REGION_BLK;
>   else if (is_nd_dax(dev))
>   return ND_DEVICE_DAX_PMEM;
> - else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
> + else if (is_nd_region(dev->parent))
>   return nd_region_to_nstype(to_nd_region(dev->parent));
>  
>   return 0;
> @@ -56,7 +56,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct 
> kobj_uevent_env *env)
>* Ensure that region devices always have their numa node set as
>* early as possible.
>*/
> - if (is_nd_pmem(dev) || is_nd_blk(dev))
> + if (is_nd_region(dev))
>   set_dev_node(dev, to_nd_region(dev)->numa_node);
>   return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT,
>   to_nd_device_type(dev));
> @@ -65,7 +65,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct 
> kobj_uevent_env *env)
>  static struct module *to_bus_provider(struct device *dev)
>  {
>   /* pin bus providers while regions are enabled */
> - if (is_nd_pmem(dev) || is_nd_blk(dev)) {
> + if (is_nd_region(dev)) {
>   struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
>  
>   return nvdimm_bus->nd_desc->module;
> diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
> index 2dee908e4bae..22e3ef463401 100644
> --- a/drivers/nvdimm/core.c
> +++ b/drivers/nvdimm/core.c
> @@ -504,7 +504,7 @@ void nvdimm_badblocks_populate(struct nd_region 
> *nd_region,
>   struct nvdimm_bus *nvdimm_bus;
>   struct list_head *poison_list;
>  
> - if (!is_nd_pmem(_region->dev)) {
> + if (!is_memory(_region->dev)) {
>   dev_WARN_ONCE(_region->dev, 1,
>  

Re: [PATCH v4 12/16] libnvdimm, nfit: enable support for volatile ranges

2017-06-29 Thread Linda Knippers
On 06/29/2017 01:54 PM, Dan Williams wrote:
> Allow volatile nfit ranges to participate in all the same infrastructure
> provided for persistent memory regions. 

This seems to be a bit more than "other rework".

> A resulting resulting namespace
> device will still be called "pmem", but the parent region type will be
> "nd_volatile". 

What does this look like to a user or admin?  How does someone know that
/dev/pmemX is persistent memory and /dev/pmemY isn't?  Someone shouldn't
have to weed through /sys or ndctl some other interface to figure that out
in the future if they don't have to do that today.  We have different
names for BTT namespaces.  Is there a different name for volatile ranges?

-- ljk

> This is in preparation for disabling the dax ->flush()
> operation in the pmem driver when it is hosted on a volatile range.
> 
> Cc: Jan Kara 
> Cc: Jeff Moyer 
> Cc: Christoph Hellwig 
> Cc: Matthew Wilcox 
> Cc: Ross Zwisler 
> Signed-off-by: Dan Williams 
> ---
>  drivers/acpi/nfit/core.c|9 -
>  drivers/nvdimm/bus.c|8 
>  drivers/nvdimm/core.c   |2 +-
>  drivers/nvdimm/dax_devs.c   |2 +-
>  drivers/nvdimm/dimm_devs.c  |2 +-
>  drivers/nvdimm/namespace_devs.c |8 
>  drivers/nvdimm/nd-core.h|9 +
>  drivers/nvdimm/pfn_devs.c   |4 ++--
>  drivers/nvdimm/region_devs.c|   27 ++-
>  9 files changed, 44 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index ac2436538b7e..60d1ca149cc1 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -2227,6 +2227,13 @@ static bool nfit_spa_is_virtual(struct 
> acpi_nfit_system_address *spa)
>   nfit_spa_type(spa) == NFIT_SPA_PCD);
>  }
>  
> +static bool nfit_spa_is_volatile(struct acpi_nfit_system_address *spa)
> +{
> + return (nfit_spa_type(spa) == NFIT_SPA_VDISK ||
> + nfit_spa_type(spa) == NFIT_SPA_VCD   ||
> + nfit_spa_type(spa) == NFIT_SPA_VOLATILE);
> +}
> +
>  static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
>   struct nfit_spa *nfit_spa)
>  {
> @@ -2301,7 +2308,7 @@ static int acpi_nfit_register_region(struct 
> acpi_nfit_desc *acpi_desc,
>   ndr_desc);
>   if (!nfit_spa->nd_region)
>   rc = -ENOMEM;
> - } else if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE) {
> + } else if (nfit_spa_is_volatile(spa)) {
>   nfit_spa->nd_region = nvdimm_volatile_region_create(nvdimm_bus,
>   ndr_desc);
>   if (!nfit_spa->nd_region)
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index e9361bffe5ee..4cfba534814b 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -38,13 +38,13 @@ static int to_nd_device_type(struct device *dev)
>  {
>   if (is_nvdimm(dev))
>   return ND_DEVICE_DIMM;
> - else if (is_nd_pmem(dev))
> + else if (is_memory(dev))
>   return ND_DEVICE_REGION_PMEM;
>   else if (is_nd_blk(dev))
>   return ND_DEVICE_REGION_BLK;
>   else if (is_nd_dax(dev))
>   return ND_DEVICE_DAX_PMEM;
> - else if (is_nd_pmem(dev->parent) || is_nd_blk(dev->parent))
> + else if (is_nd_region(dev->parent))
>   return nd_region_to_nstype(to_nd_region(dev->parent));
>  
>   return 0;
> @@ -56,7 +56,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct 
> kobj_uevent_env *env)
>* Ensure that region devices always have their numa node set as
>* early as possible.
>*/
> - if (is_nd_pmem(dev) || is_nd_blk(dev))
> + if (is_nd_region(dev))
>   set_dev_node(dev, to_nd_region(dev)->numa_node);
>   return add_uevent_var(env, "MODALIAS=" ND_DEVICE_MODALIAS_FMT,
>   to_nd_device_type(dev));
> @@ -65,7 +65,7 @@ static int nvdimm_bus_uevent(struct device *dev, struct 
> kobj_uevent_env *env)
>  static struct module *to_bus_provider(struct device *dev)
>  {
>   /* pin bus providers while regions are enabled */
> - if (is_nd_pmem(dev) || is_nd_blk(dev)) {
> + if (is_nd_region(dev)) {
>   struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
>  
>   return nvdimm_bus->nd_desc->module;
> diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
> index 2dee908e4bae..22e3ef463401 100644
> --- a/drivers/nvdimm/core.c
> +++ b/drivers/nvdimm/core.c
> @@ -504,7 +504,7 @@ void nvdimm_badblocks_populate(struct nd_region 
> *nd_region,
>   struct nvdimm_bus *nvdimm_bus;
>   struct list_head *poison_list;
>  
> - if (!is_nd_pmem(_region->dev)) {
> + if (!is_memory(_region->dev)) {
>   dev_WARN_ONCE(_region->dev, 1,
>   "%s only valid for pmem regions\n", __func__);
>   return;
> diff --git a/drivers/nvdimm/dax_devs.c 

Re: [PATCH] Add support of NVDIMM memory error notification in ACPI 6.2

2017-06-08 Thread Linda Knippers



On 6/7/2017 5:33 PM, Kani, Toshimitsu wrote:

On Wed, 2017-06-07 at 14:06 -0700, Dan Williams wrote:

On Wed, Jun 7, 2017 at 1:57 PM, Kani, Toshimitsu 
wrote:

On Wed, 2017-06-07 at 12:09 -0700, Dan Williams wrote:

On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani 
wrote:


 :

+
+static void acpi_nfit_uc_error_notify(struct device *dev,
acpi_handle handle)
+{
+   struct acpi_nfit_desc *acpi_desc =
dev_get_drvdata(dev);
+
+   acpi_nfit_ars_rescan(acpi_desc);


I wonder if we should gate re-scanning with a similar:

if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON)

...check that we do in the mce notification case? Maybe not since
we
don't get an indication of where the error is without a rescan.


I think this mce case is different since the MCE handler already
knows where the new poison location is and can update badblocks
information for it.  Starting ARS is an optional precaution.


However, at a minimum I think we need support for the new Start
ARS flag ("If set to 1 the firmware shall return data from a
previous scrub, if any, without starting a new scrub") and use
that for this case.


That's an interesting idea.  But I wonder how users know if it is
OK to set this flag as it relies on BIOS implementation that is not
described in ACPI...


Ugh, you're right. We might need a revision-id=2 version of Start ARS
so software knows that the BIOS is aware of the new flag.


My bad.  Looking at ACPI 6.2, it actually defines what you described.
Start ARS now defines bit[1] in Flags which can be set to avoid
scanning for this notification.  I will update the patch to set this
flag when HW_ERROR_SCRUB_ON is not set.


Wasn't Dan concerned about how the OS can know whether the FW supports
that bit in the Start ARS?

The Query ARS Capabilities DSM has a bit that tells the OS whether the
platform supports the notification and the point of the notification was
to tell the OS it could do a Start ARS with bit 1 set.  Of course, if
you get the notification then that means the platform has the capability
to deliver it, but it might not hurt to check the flag from the Query
Capabilities bit.

If the spec isn't clear about the relationship between the bits, we could
clarify that in the spec without bumping the revision number for the call.

We could also check to see what a platform does if it gets a flag it
doesn't know about.  Would that be an Invalid Input Parameter?

-- ljk


Thanks,
-Toshi��칻�&�~�&���+-��ݶ��w��˛���m�b��Zr^n�r���z���h&���G���h�(�階�ݢj"���m�z�ޖ���f���h���~�m�



Re: [PATCH] Add support of NVDIMM memory error notification in ACPI 6.2

2017-06-08 Thread Linda Knippers



On 6/7/2017 5:33 PM, Kani, Toshimitsu wrote:

On Wed, 2017-06-07 at 14:06 -0700, Dan Williams wrote:

On Wed, Jun 7, 2017 at 1:57 PM, Kani, Toshimitsu 
wrote:

On Wed, 2017-06-07 at 12:09 -0700, Dan Williams wrote:

On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani 
wrote:


 :

+
+static void acpi_nfit_uc_error_notify(struct device *dev,
acpi_handle handle)
+{
+   struct acpi_nfit_desc *acpi_desc =
dev_get_drvdata(dev);
+
+   acpi_nfit_ars_rescan(acpi_desc);


I wonder if we should gate re-scanning with a similar:

if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON)

...check that we do in the mce notification case? Maybe not since
we
don't get an indication of where the error is without a rescan.


I think this mce case is different since the MCE handler already
knows where the new poison location is and can update badblocks
information for it.  Starting ARS is an optional precaution.


However, at a minimum I think we need support for the new Start
ARS flag ("If set to 1 the firmware shall return data from a
previous scrub, if any, without starting a new scrub") and use
that for this case.


That's an interesting idea.  But I wonder how users know if it is
OK to set this flag as it relies on BIOS implementation that is not
described in ACPI...


Ugh, you're right. We might need a revision-id=2 version of Start ARS
so software knows that the BIOS is aware of the new flag.


My bad.  Looking at ACPI 6.2, it actually defines what you described.
Start ARS now defines bit[1] in Flags which can be set to avoid
scanning for this notification.  I will update the patch to set this
flag when HW_ERROR_SCRUB_ON is not set.


Wasn't Dan concerned about how the OS can know whether the FW supports
that bit in the Start ARS?

The Query ARS Capabilities DSM has a bit that tells the OS whether the
platform supports the notification and the point of the notification was
to tell the OS it could do a Start ARS with bit 1 set.  Of course, if
you get the notification then that means the platform has the capability
to deliver it, but it might not hurt to check the flag from the Query
Capabilities bit.

If the spec isn't clear about the relationship between the bits, we could
clarify that in the spec without bumping the revision number for the call.

We could also check to see what a platform does if it gets a flag it
doesn't know about.  Would that be an Invalid Input Parameter?

-- ljk


Thanks,
-Toshi��칻�&�~�&���+-��ݶ��w��˛���m�b��Zr^n�r���z���h&���G���h�(�階�ݢj"���m�z�ޖ���f���h���~�m�



Re: [PATCH] libnvdimm, region: sysfs trigger for nvdimm_flush()

2017-04-24 Thread Linda Knippers
On 04/21/2017 07:48 PM, Dan Williams wrote:
> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
> platform WPQ (write-pending-queue) buffers when power is removed. The
> nvdimm_flush() mechanism performs that same function on-demand.
> 
> When a pmem namespace is associated with a block device, an
> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
> request. However, when a namespace is in device-dax mode, or namespaces
> are disabled, userspace needs another path.

Why would a user need to flush a disabled namespace?

> The new 'flush' attribute is visible when it can be determined that the
> interleave-set either does, or does not have DIMMs that expose WPQ-flush
> addresses, "flush-hints" in ACPI NFIT terminology. It returns "1" and
> flushes DIMMs, or returns "0" the flush operation is a platform nop.

It seems a little odd to me that reading a read-only attribute both
tells you that the device has flush hints and also triggers a flush.
This means that anyone at any time can cause a flush.  Do we want that?

-- ljk

> 
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/region_devs.c |   17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 8de5a04644a1..3495b4c23941 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -255,6 +255,19 @@ static ssize_t size_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(size);
>  
> +static ssize_t flush_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct nd_region *nd_region = to_nd_region(dev);
> +
> + if (nvdimm_has_flush(nd_region)) {
> + nvdimm_flush(nd_region);
> + return sprintf(buf, "1\n");
> + }
> + return sprintf(buf, "0\n");
> +}
> +static DEVICE_ATTR_RO(flush);
> +
>  static ssize_t mappings_show(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
> @@ -474,6 +487,7 @@ static DEVICE_ATTR_RO(resource);
>  
>  static struct attribute *nd_region_attributes[] = {
>   _attr_size.attr,
> + _attr_flush.attr,
>   _attr_nstype.attr,
>   _attr_mappings.attr,
>   _attr_btt_seed.attr,
> @@ -508,6 +522,9 @@ static umode_t region_visible(struct kobject *kobj, 
> struct attribute *a, int n)
>   if (!is_nd_pmem(dev) && a == _attr_resource.attr)
>   return 0;
>  
> + if (a == _attr_flush.attr && nvdimm_has_flush(nd_region) < 0)
> + return 0;
> +
>   if (a != _attr_set_cookie.attr
>   && a != _attr_available_size.attr)
>   return a->mode;
> 
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 



Re: [PATCH] libnvdimm, region: sysfs trigger for nvdimm_flush()

2017-04-24 Thread Linda Knippers
On 04/21/2017 07:48 PM, Dan Williams wrote:
> The nvdimm_flush() mechanism helps to reduce the impact of an ADR
> (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
> platform WPQ (write-pending-queue) buffers when power is removed. The
> nvdimm_flush() mechanism performs that same function on-demand.
> 
> When a pmem namespace is associated with a block device, an
> nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
> request. However, when a namespace is in device-dax mode, or namespaces
> are disabled, userspace needs another path.

Why would a user need to flush a disabled namespace?

> The new 'flush' attribute is visible when it can be determined that the
> interleave-set either does, or does not have DIMMs that expose WPQ-flush
> addresses, "flush-hints" in ACPI NFIT terminology. It returns "1" and
> flushes DIMMs, or returns "0" the flush operation is a platform nop.

It seems a little odd to me that reading a read-only attribute both
tells you that the device has flush hints and also triggers a flush.
This means that anyone at any time can cause a flush.  Do we want that?

-- ljk

> 
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/region_devs.c |   17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 8de5a04644a1..3495b4c23941 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -255,6 +255,19 @@ static ssize_t size_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(size);
>  
> +static ssize_t flush_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct nd_region *nd_region = to_nd_region(dev);
> +
> + if (nvdimm_has_flush(nd_region)) {
> + nvdimm_flush(nd_region);
> + return sprintf(buf, "1\n");
> + }
> + return sprintf(buf, "0\n");
> +}
> +static DEVICE_ATTR_RO(flush);
> +
>  static ssize_t mappings_show(struct device *dev,
>   struct device_attribute *attr, char *buf)
>  {
> @@ -474,6 +487,7 @@ static DEVICE_ATTR_RO(resource);
>  
>  static struct attribute *nd_region_attributes[] = {
>   _attr_size.attr,
> + _attr_flush.attr,
>   _attr_nstype.attr,
>   _attr_mappings.attr,
>   _attr_btt_seed.attr,
> @@ -508,6 +522,9 @@ static umode_t region_visible(struct kobject *kobj, 
> struct attribute *a, int n)
>   if (!is_nd_pmem(dev) && a == _attr_resource.attr)
>   return 0;
>  
> + if (a == _attr_flush.attr && nvdimm_has_flush(nd_region) < 0)
> + return 0;
> +
>   if (a != _attr_set_cookie.attr
>   && a != _attr_available_size.attr)
>   return a->mode;
> 
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 



Re: [PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Linda Knippers


On 10/7/2016 3:52 PM, Dan Williams wrote:
> On Fri, Oct 7, 2016 at 11:19 AM, Linda Knippers <linda.knipp...@hpe.com> 
> wrote:
>> Hi Dan,
>>
>> A couple of general questions...
>>
>> On 10/7/2016 12:38 PM, Dan Williams wrote:
>>> With the arrival of the device-dax facility in 4.7 a pmem namespace can
>>> now be configured into a total of four distinct modes: 'raw', 'sector',
>>> 'memory', and 'dax'. Where raw, sector, and memory are block device
>>> modes and dax supports the device-dax character device. With that degree
>>> of freedom in the use cases it is overly restrictive to continue the
>>> current limit of only one pmem namespace per-region, or "interleave-set"
>>> in ACPI 6+ terminology.
>>
>> If I understand correctly, at least some of the restrictions were
>> part of the Intel NVDIMM Namespace spec rather than ACPI/NFIT restrictions.
>> The most recent namespace spec on pmem.io hasn't been updated to remove
>> those restrictions.  Is there a different public spec?
> 
> Yes, this is Linux specific and use of this capability needs to be
> cognizant that it could create a configuration that is not understood
> by EFI, or other OSes (including older Linux implementations).  I plan
> to add documentation to ndctl along these lines.  This is similar to
> the current situation with 'pfn' and 'dax' info blocks that are also
> Linux specific.  However, I should note that this implementation
> changes none of the interpretation of the fields nor layout of the
> existing label specification.  It simply allows two pmem labels that
> happen to appear in the same region to result in two namespaces rather
> than 0.

Ok, but the namespace spec says that's not allowed.  It seemed like an odd
restriction to be in the label spec but it is there.
> 
>>> This series adds support for reading and writing configurations that
>>> describe multiple pmem allocations within a region.  The new rules for
>>> allocating / validating the available capacity when blk and pmem regions
>>> alias are (quoting space_valid()):
>>>
>>>BLK-space is valid as long as it does not precede a PMEM
>>>allocation in a given region. PMEM-space must be contiguous
>>>and adjacent to an existing existing allocation (if one
>>>exists).
>>
>> Why is this new rule necessary?  Is this a HW-specific rule or something
>> related to how Linux could possibly support something?  Why do we care
>> whether blk-space is before or after pmem-space? If it's a HW-specific
>> rule, then shouldn't the enforcement be in the management tool that
>> configures the namespaces?
> 
> It is not HW specific, and it's not new in the sense that we already
> arrange for pmem to be allocated from low addresses and blk to be
> allocated from high addresses.  

Who's the "we"?  Does the location within the region come from the OS
or from the tool that created the namespace?  (I should probably know
this but not having labels, I've never looked at this.)

If we're relaxing some of the rules, it seems like one could have
pmem, then block, then free space, and later want to use free space
for another pmem range.  If hardware supported it and the management
tool created it, would the kernel allow it?

> If another implementation violated
> this constraint Linux would parse it just fine. The constraint is a
> Linux decision to maximize available pmem capacity when blk and pmem
> alias.  So this is a situation where Linux is liberal in what it will
> accept when reading labels, but conservative on the configurations it
> will create when writing labels.

Is it ndctl that's being conservative?  It seems like the kernel shouldn't care.
> 
>>> Where "adjacent" allocations grow an existing namespace.  Note that
>>> growing a namespace is potentially destructive if free space is consumed
>>> from a location preceding the current allocation.  There is no support
>>> for dis-continuity within a given namespace allocation.
>>
>> Are you talking about DPAs here?
> 
> No, this is referring to system-physical-address partitioning.
> 
>>> Previously, since there was only one namespace per-region, the resulting
>>> pmem device would be named after the region.  Now, subsequent namespaces
>>> after the first are named with the region index and a
>>> "." suffix. For example:
>>>
>>>   /dev/pmem0.1
>>
>> According to the existing namespace spec, you can already have multiple
>> block namespaces on a device. I've not see a system with block namespaces
>> so what do those /dev entries look like?  (The dots are somewhat 
>> unattractive.)
> 
> Block namespaces result in devices with names like "/dev/ndblk0.0"
> where the X.Y numbers are ..  This new
> naming for pmem devices is following that precedent.  The "dot" was
> originally adopted from Linux USB device naming.

Does this mean that if someone updates their kernel then their /dev/pmem0
becomes /dev/pmem0.0?  Or do you only get the dot if there is more
than one namespace per region?

-- ljk


> 


Re: [PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Linda Knippers


On 10/7/2016 3:52 PM, Dan Williams wrote:
> On Fri, Oct 7, 2016 at 11:19 AM, Linda Knippers  
> wrote:
>> Hi Dan,
>>
>> A couple of general questions...
>>
>> On 10/7/2016 12:38 PM, Dan Williams wrote:
>>> With the arrival of the device-dax facility in 4.7 a pmem namespace can
>>> now be configured into a total of four distinct modes: 'raw', 'sector',
>>> 'memory', and 'dax'. Where raw, sector, and memory are block device
>>> modes and dax supports the device-dax character device. With that degree
>>> of freedom in the use cases it is overly restrictive to continue the
>>> current limit of only one pmem namespace per-region, or "interleave-set"
>>> in ACPI 6+ terminology.
>>
>> If I understand correctly, at least some of the restrictions were
>> part of the Intel NVDIMM Namespace spec rather than ACPI/NFIT restrictions.
>> The most recent namespace spec on pmem.io hasn't been updated to remove
>> those restrictions.  Is there a different public spec?
> 
> Yes, this is Linux specific and use of this capability needs to be
> cognizant that it could create a configuration that is not understood
> by EFI, or other OSes (including older Linux implementations).  I plan
> to add documentation to ndctl along these lines.  This is similar to
> the current situation with 'pfn' and 'dax' info blocks that are also
> Linux specific.  However, I should note that this implementation
> changes none of the interpretation of the fields nor layout of the
> existing label specification.  It simply allows two pmem labels that
> happen to appear in the same region to result in two namespaces rather
> than 0.

Ok, but the namespace spec says that's not allowed.  It seemed like an odd
restriction to be in the label spec but it is there.
> 
>>> This series adds support for reading and writing configurations that
>>> describe multiple pmem allocations within a region.  The new rules for
>>> allocating / validating the available capacity when blk and pmem regions
>>> alias are (quoting space_valid()):
>>>
>>>BLK-space is valid as long as it does not precede a PMEM
>>>allocation in a given region. PMEM-space must be contiguous
>>>and adjacent to an existing existing allocation (if one
>>>exists).
>>
>> Why is this new rule necessary?  Is this a HW-specific rule or something
>> related to how Linux could possibly support something?  Why do we care
>> whether blk-space is before or after pmem-space? If it's a HW-specific
>> rule, then shouldn't the enforcement be in the management tool that
>> configures the namespaces?
> 
> It is not HW specific, and it's not new in the sense that we already
> arrange for pmem to be allocated from low addresses and blk to be
> allocated from high addresses.  

Who's the "we"?  Does the location within the region come from the OS
or from the tool that created the namespace?  (I should probably know
this but not having labels, I've never looked at this.)

If we're relaxing some of the rules, it seems like one could have
pmem, then block, then free space, and later want to use free space
for another pmem range.  If hardware supported it and the management
tool created it, would the kernel allow it?

> If another implementation violated
> this constraint Linux would parse it just fine. The constraint is a
> Linux decision to maximize available pmem capacity when blk and pmem
> alias.  So this is a situation where Linux is liberal in what it will
> accept when reading labels, but conservative on the configurations it
> will create when writing labels.

Is it ndctl that's being conservative?  It seems like the kernel shouldn't care.
> 
>>> Where "adjacent" allocations grow an existing namespace.  Note that
>>> growing a namespace is potentially destructive if free space is consumed
>>> from a location preceding the current allocation.  There is no support
>>> for dis-continuity within a given namespace allocation.
>>
>> Are you talking about DPAs here?
> 
> No, this is referring to system-physical-address partitioning.
> 
>>> Previously, since there was only one namespace per-region, the resulting
>>> pmem device would be named after the region.  Now, subsequent namespaces
>>> after the first are named with the region index and a
>>> "." suffix. For example:
>>>
>>>   /dev/pmem0.1
>>
>> According to the existing namespace spec, you can already have multiple
>> block namespaces on a device. I've not see a system with block namespaces
>> so what do those /dev entries look like?  (The dots are somewhat 
>> unattractive.)
> 
> Block namespaces result in devices with names like "/dev/ndblk0.0"
> where the X.Y numbers are ..  This new
> naming for pmem devices is following that precedent.  The "dot" was
> originally adopted from Linux USB device naming.

Does this mean that if someone updates their kernel then their /dev/pmem0
becomes /dev/pmem0.0?  Or do you only get the dot if there is more
than one namespace per region?

-- ljk


> 


Re: [PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Linda Knippers
Hi Dan,

A couple of general questions...

On 10/7/2016 12:38 PM, Dan Williams wrote:
> With the arrival of the device-dax facility in 4.7 a pmem namespace can
> now be configured into a total of four distinct modes: 'raw', 'sector',
> 'memory', and 'dax'. Where raw, sector, and memory are block device
> modes and dax supports the device-dax character device. With that degree
> of freedom in the use cases it is overly restrictive to continue the
> current limit of only one pmem namespace per-region, or "interleave-set"
> in ACPI 6+ terminology.

If I understand correctly, at least some of the restrictions were
part of the Intel NVDIMM Namespace spec rather than ACPI/NFIT restrictions.
The most recent namespace spec on pmem.io hasn't been updated to remove
those restrictions.  Is there a different public spec?

> This series adds support for reading and writing configurations that
> describe multiple pmem allocations within a region.  The new rules for
> allocating / validating the available capacity when blk and pmem regions
> alias are (quoting space_valid()):
> 
>BLK-space is valid as long as it does not precede a PMEM
>allocation in a given region. PMEM-space must be contiguous
>and adjacent to an existing existing allocation (if one
>exists).

Why is this new rule necessary?  Is this a HW-specific rule or something
related to how Linux could possibly support something?  Why do we care
whether blk-space is before or after pmem-space? If it's a HW-specific
rule, then shouldn't the enforcement be in the management tool that
configures the namespaces?

> Where "adjacent" allocations grow an existing namespace.  Note that
> growing a namespace is potentially destructive if free space is consumed
> from a location preceding the current allocation.  There is no support
> for dis-continuity within a given namespace allocation.

Are you talking about DPAs here?

> Previously, since there was only one namespace per-region, the resulting
> pmem device would be named after the region.  Now, subsequent namespaces
> after the first are named with the region index and a
> "." suffix. For example:
> 
>   /dev/pmem0.1

According to the existing namespace spec, you can already have multiple
block namespaces on a device. I've not see a system with block namespaces
so what do those /dev entries look like?  (The dots are somewhat unattractive.)

-- ljk
> 
> ---
> 
> Dan Williams (14):
>   libnvdimm, region: move region-mapping input-paramters to 
> nd_mapping_desc
>   libnvdimm, label: convert label tracking to a linked list
>   libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() 
> helper
>   libnvdimm, namespace: unify blk and pmem label scanning
>   tools/testing/nvdimm: support for sub-dividing a pmem region
>   libnvdimm, namespace: allow multiple pmem-namespaces per region at scan 
> time
>   libnvdimm, namespace: sort namespaces by dpa at init
>   libnvdimm, region: update nd_region_available_dpa() for multi-pmem 
> support
>   libnvdimm, namespace: expand pmem device naming scheme for multi-pmem
>   libnvdimm, namespace: update label implementation for multi-pmem
>   libnvdimm, namespace: enable allocation of multiple pmem namespaces
>   libnvdimm, namespace: filter out of range labels in scan_labels()
>   libnvdimm, namespace: lift single pmem limit in scan_labels()
>   libnvdimm, namespace: allow creation of multiple pmem-namespaces per 
> region
> 
> 
>  drivers/acpi/nfit/core.c  |   30 +
>  drivers/nvdimm/dimm_devs.c|  192 ++--
>  drivers/nvdimm/label.c|  192 +---
>  drivers/nvdimm/namespace_devs.c   |  786 
> +++--
>  drivers/nvdimm/nd-core.h  |   23 +
>  drivers/nvdimm/nd.h   |   28 +
>  drivers/nvdimm/region_devs.c  |   58 ++
>  include/linux/libnvdimm.h |   25 -
>  include/linux/nd.h|8 
>  tools/testing/nvdimm/test/iomap.c |  134 --
>  tools/testing/nvdimm/test/nfit.c  |   21 -
>  tools/testing/nvdimm/test/nfit_test.h |   12 -
>  12 files changed, 1055 insertions(+), 454 deletions(-)
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 


Re: [PATCH 00/14] libnvdimm: support sub-divisions of pmem for 4.9

2016-10-07 Thread Linda Knippers
Hi Dan,

A couple of general questions...

On 10/7/2016 12:38 PM, Dan Williams wrote:
> With the arrival of the device-dax facility in 4.7 a pmem namespace can
> now be configured into a total of four distinct modes: 'raw', 'sector',
> 'memory', and 'dax'. Where raw, sector, and memory are block device
> modes and dax supports the device-dax character device. With that degree
> of freedom in the use cases it is overly restrictive to continue the
> current limit of only one pmem namespace per-region, or "interleave-set"
> in ACPI 6+ terminology.

If I understand correctly, at least some of the restrictions were
part of the Intel NVDIMM Namespace spec rather than ACPI/NFIT restrictions.
The most recent namespace spec on pmem.io hasn't been updated to remove
those restrictions.  Is there a different public spec?

> This series adds support for reading and writing configurations that
> describe multiple pmem allocations within a region.  The new rules for
> allocating / validating the available capacity when blk and pmem regions
> alias are (quoting space_valid()):
> 
>BLK-space is valid as long as it does not precede a PMEM
>allocation in a given region. PMEM-space must be contiguous
>and adjacent to an existing existing allocation (if one
>exists).

Why is this new rule necessary?  Is this a HW-specific rule or something
related to how Linux could possibly support something?  Why do we care
whether blk-space is before or after pmem-space? If it's a HW-specific
rule, then shouldn't the enforcement be in the management tool that
configures the namespaces?

> Where "adjacent" allocations grow an existing namespace.  Note that
> growing a namespace is potentially destructive if free space is consumed
> from a location preceding the current allocation.  There is no support
> for dis-continuity within a given namespace allocation.

Are you talking about DPAs here?

> Previously, since there was only one namespace per-region, the resulting
> pmem device would be named after the region.  Now, subsequent namespaces
> after the first are named with the region index and a
> "." suffix. For example:
> 
>   /dev/pmem0.1

According to the existing namespace spec, you can already have multiple
block namespaces on a device. I've not see a system with block namespaces
so what do those /dev entries look like?  (The dots are somewhat unattractive.)

-- ljk
> 
> ---
> 
> Dan Williams (14):
>   libnvdimm, region: move region-mapping input-paramters to 
> nd_mapping_desc
>   libnvdimm, label: convert label tracking to a linked list
>   libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() 
> helper
>   libnvdimm, namespace: unify blk and pmem label scanning
>   tools/testing/nvdimm: support for sub-dividing a pmem region
>   libnvdimm, namespace: allow multiple pmem-namespaces per region at scan 
> time
>   libnvdimm, namespace: sort namespaces by dpa at init
>   libnvdimm, region: update nd_region_available_dpa() for multi-pmem 
> support
>   libnvdimm, namespace: expand pmem device naming scheme for multi-pmem
>   libnvdimm, namespace: update label implementation for multi-pmem
>   libnvdimm, namespace: enable allocation of multiple pmem namespaces
>   libnvdimm, namespace: filter out of range labels in scan_labels()
>   libnvdimm, namespace: lift single pmem limit in scan_labels()
>   libnvdimm, namespace: allow creation of multiple pmem-namespaces per 
> region
> 
> 
>  drivers/acpi/nfit/core.c  |   30 +
>  drivers/nvdimm/dimm_devs.c|  192 ++--
>  drivers/nvdimm/label.c|  192 +---
>  drivers/nvdimm/namespace_devs.c   |  786 
> +++--
>  drivers/nvdimm/nd-core.h  |   23 +
>  drivers/nvdimm/nd.h   |   28 +
>  drivers/nvdimm/region_devs.c  |   58 ++
>  include/linux/libnvdimm.h |   25 -
>  include/linux/nd.h|8 
>  tools/testing/nvdimm/test/iomap.c |  134 --
>  tools/testing/nvdimm/test/nfit.c  |   21 -
>  tools/testing/nvdimm/test/nfit_test.h |   12 -
>  12 files changed, 1055 insertions(+), 454 deletions(-)
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 


Re: [PATCH] acpi, nfit: fix acpi event notifications for nfit

2016-08-18 Thread Linda Knippers


On 8/18/2016 3:48 PM, Dan Williams wrote:
> On Thu, Aug 18, 2016 at 11:48 AM, Vishal Verma  
> wrote:
>> The nfit driver had an acpi event notification handler, but it never
>> would've worked because we weren't setting the
>> ACPI_DRIVER_ALL_NOTIFY_EVENTS flag in acpi_driver.
> 
> Let's update the changelog to be helpful for someone implementing a
> backport or taking this back to a -stable branch.  Something like:
> 
> Subject: acpi, nfit: fix event notifications
> 
> Commit 209851649dc4 "acpi: nfit: Add support for hot-add" added
> support for _FIT notifications, but it neglected to set the
> ACPI_DRIVER_ALL_NOTIFY_EVENTS flag that acpi_bus_notify() uses to gate
> notification delivery.

While we're at it, should we update the notifier function to explicitly check
for event 0x80 before re-evaluating the _FIT?  I'm thinking about some time
in the future when there might be more than one event.

-- ljk
> 
> Fixes: 209851649dc4 ("acpi: nfit: Add support for hot-add")
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 


Re: [PATCH] acpi, nfit: fix acpi event notifications for nfit

2016-08-18 Thread Linda Knippers


On 8/18/2016 3:48 PM, Dan Williams wrote:
> On Thu, Aug 18, 2016 at 11:48 AM, Vishal Verma  
> wrote:
>> The nfit driver had an acpi event notification handler, but it never
>> would've worked because we weren't setting the
>> ACPI_DRIVER_ALL_NOTIFY_EVENTS flag in acpi_driver.
> 
> Let's update the changelog to be helpful for someone implementing a
> backport or taking this back to a -stable branch.  Something like:
> 
> Subject: acpi, nfit: fix event notifications
> 
> Commit 209851649dc4 "acpi: nfit: Add support for hot-add" added
> support for _FIT notifications, but it neglected to set the
> ACPI_DRIVER_ALL_NOTIFY_EVENTS flag that acpi_bus_notify() uses to gate
> notification delivery.

While we're at it, should we update the notifier function to explicitly check
for event 0x80 before re-evaluating the _FIT?  I'm thinking about some time
in the future when there might be more than one event.

-- ljk
> 
> Fixes: 209851649dc4 ("acpi: nfit: Add support for hot-add")
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 


Re: [PATCH] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"

2016-07-22 Thread Linda Knippers
On 07/22/2016 03:25 PM, Linda Knippers wrote:
> 
> 
> On 7/22/2016 11:36 AM, Viresh Kumar wrote:
>> On 22-07-16, 17:14, Andreas Herrmann wrote:
>>> diff --git a/drivers/cpufreq/pcc-cpufreq.c b/drivers/cpufreq/pcc-cpufreq.c
>>> index a7ecb9a..3f0ce2a 100644
>>> --- a/drivers/cpufreq/pcc-cpufreq.c
>>> +++ b/drivers/cpufreq/pcc-cpufreq.c
>>> @@ -555,8 +555,6 @@ static int pcc_cpufreq_cpu_init(struct cpufreq_policy 
>>> *policy)
>>> policy->min = policy->cpuinfo.min_freq =
>>> ioread32(_hdr->minimum_frequency) * 1000;
>>>  
>>> -   policy->cpuinfo.transition_latency = CPUFREQ_ETERNAL;
>>> -
>>> pr_debug("init: policy->max is %d, policy->min is %d\n",
>>> policy->max, policy->min);
>>>  out:
>>
>> Hi Rafael,
>>
>> I am very confused on this, can you help me understand ?
>>
>> - CPUFREQ_ETERNAL = -1
>> - unsigned int transition_latency = CPUFREQ_ETERNAL, will set it to UINT_MAX.
>> - Many drivers do it today
>>
>> cpufreq.c
>>
>>  if (policy->governor->max_transition_latency &&
>>  policy->cpuinfo.transition_latency >
>>  policy->governor->max_transition_latency) {
>>
>> - And this check will always fail, unless max_transition_latency is zero.
>>
>> What am I missing ?
> 
> I don't know what's missing but I can reproduce the problem.

I added a debug message to show the transition latency values.

[   36.113829] cpufreq: ondemand governor failed, too long transition latency 
of HW, fallback to
performance governor
[   36.164688] cpufreq: cpufreq_governor: max_transition_latency 0x1000, 
transition_latency
0x4294967295

max_transition latency for ondemand seems to come from
#define TRANSITION_LATENCY_LIMIT(10 * 1000 * 1000)

How does this work for any driver?

-- ljk
> 
> -- ljk
> 



Re: [PATCH] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"

2016-07-22 Thread Linda Knippers
On 07/22/2016 03:25 PM, Linda Knippers wrote:
> 
> 
> On 7/22/2016 11:36 AM, Viresh Kumar wrote:
>> On 22-07-16, 17:14, Andreas Herrmann wrote:
>>> diff --git a/drivers/cpufreq/pcc-cpufreq.c b/drivers/cpufreq/pcc-cpufreq.c
>>> index a7ecb9a..3f0ce2a 100644
>>> --- a/drivers/cpufreq/pcc-cpufreq.c
>>> +++ b/drivers/cpufreq/pcc-cpufreq.c
>>> @@ -555,8 +555,6 @@ static int pcc_cpufreq_cpu_init(struct cpufreq_policy 
>>> *policy)
>>> policy->min = policy->cpuinfo.min_freq =
>>> ioread32(_hdr->minimum_frequency) * 1000;
>>>  
>>> -   policy->cpuinfo.transition_latency = CPUFREQ_ETERNAL;
>>> -
>>> pr_debug("init: policy->max is %d, policy->min is %d\n",
>>> policy->max, policy->min);
>>>  out:
>>
>> Hi Rafael,
>>
>> I am very confused on this, can you help me understand ?
>>
>> - CPUFREQ_ETERNAL = -1
>> - unsigned int transition_latency = CPUFREQ_ETERNAL, will set it to UINT_MAX.
>> - Many drivers do it today
>>
>> cpufreq.c
>>
>>  if (policy->governor->max_transition_latency &&
>>  policy->cpuinfo.transition_latency >
>>  policy->governor->max_transition_latency) {
>>
>> - And this check will always fail, unless max_transition_latency is zero.
>>
>> What am I missing ?
> 
> I don't know what's missing but I can reproduce the problem.

I added a debug message to show the transition latency values.

[   36.113829] cpufreq: ondemand governor failed, too long transition latency 
of HW, fallback to
performance governor
[   36.164688] cpufreq: cpufreq_governor: max_transition_latency 0x1000, 
transition_latency
0x4294967295

max_transition latency for ondemand seems to come from
#define TRANSITION_LATENCY_LIMIT(10 * 1000 * 1000)

How does this work for any driver?

-- ljk
> 
> -- ljk
> 



Re: [PATCH] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"

2016-07-22 Thread Linda Knippers


On 7/22/2016 11:36 AM, Viresh Kumar wrote:
> On 22-07-16, 17:14, Andreas Herrmann wrote:
>> diff --git a/drivers/cpufreq/pcc-cpufreq.c b/drivers/cpufreq/pcc-cpufreq.c
>> index a7ecb9a..3f0ce2a 100644
>> --- a/drivers/cpufreq/pcc-cpufreq.c
>> +++ b/drivers/cpufreq/pcc-cpufreq.c
>> @@ -555,8 +555,6 @@ static int pcc_cpufreq_cpu_init(struct cpufreq_policy 
>> *policy)
>>  policy->min = policy->cpuinfo.min_freq =
>>  ioread32(_hdr->minimum_frequency) * 1000;
>>  
>> -policy->cpuinfo.transition_latency = CPUFREQ_ETERNAL;
>> -
>>  pr_debug("init: policy->max is %d, policy->min is %d\n",
>>  policy->max, policy->min);
>>  out:
> 
> Hi Rafael,
> 
> I am very confused on this, can you help me understand ?
> 
> - CPUFREQ_ETERNAL = -1
> - unsigned int transition_latency = CPUFREQ_ETERNAL, will set it to UINT_MAX.
> - Many drivers do it today
> 
> cpufreq.c
> 
>   if (policy->governor->max_transition_latency &&
>   policy->cpuinfo.transition_latency >
>   policy->governor->max_transition_latency) {
> 
> - And this check will always fail, unless max_transition_latency is zero.
> 
> What am I missing ?

I don't know what's missing but I can reproduce the problem.

-- ljk


Re: [PATCH] Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"

2016-07-22 Thread Linda Knippers


On 7/22/2016 11:36 AM, Viresh Kumar wrote:
> On 22-07-16, 17:14, Andreas Herrmann wrote:
>> diff --git a/drivers/cpufreq/pcc-cpufreq.c b/drivers/cpufreq/pcc-cpufreq.c
>> index a7ecb9a..3f0ce2a 100644
>> --- a/drivers/cpufreq/pcc-cpufreq.c
>> +++ b/drivers/cpufreq/pcc-cpufreq.c
>> @@ -555,8 +555,6 @@ static int pcc_cpufreq_cpu_init(struct cpufreq_policy 
>> *policy)
>>  policy->min = policy->cpuinfo.min_freq =
>>  ioread32(_hdr->minimum_frequency) * 1000;
>>  
>> -policy->cpuinfo.transition_latency = CPUFREQ_ETERNAL;
>> -
>>  pr_debug("init: policy->max is %d, policy->min is %d\n",
>>  policy->max, policy->min);
>>  out:
> 
> Hi Rafael,
> 
> I am very confused on this, can you help me understand ?
> 
> - CPUFREQ_ETERNAL = -1
> - unsigned int transition_latency = CPUFREQ_ETERNAL, will set it to UINT_MAX.
> - Many drivers do it today
> 
> cpufreq.c
> 
>   if (policy->governor->max_transition_latency &&
>   policy->cpuinfo.transition_latency >
>   policy->governor->max_transition_latency) {
> 
> - And this check will always fail, unless max_transition_latency is zero.
> 
> What am I missing ?

I don't know what's missing but I can reproduce the problem.

-- ljk


Re: [PATCH v2 3/3] nfit: do an ARS scrub on hitting a latent media error

2016-07-21 Thread Linda Knippers


On 7/20/2016 9:50 PM, Vishal Verma wrote:
> When a latent (unknown to 'badblocks') error is encountered, it will
> trigger a machine check exception. On a system with machine check
> recovery, this will only SIGBUS the process(es) which had the bad page
> mapped (as opposed to a kernel panic on platforms without machine
> check recovery features). In the former case, we want to trigger a full
> rescan of that nvdimm bus. This will allow any additional, new errors
> to be captured in the block devices' badblocks lists, and offending
> operations on them can be trapped early, avoiding machine checks.

Do we really need to rescan all SPA ranges?  If the problem is with
an NVDIMM, wouldn't the blast radius be the device or it's interleave set,
which would be part of the same SPA range?

I don't know what the overhead associated with a scan is which is why I'm 
asking.

-- ljk

> 
> This is done by registering a callback function with the
> x86_mce_decoder_chain and calling the new ars_rescan functionality with
> the address in the mce notificatiion.
> 
> Cc: Dan Williams 
> Cc: Rafael J. Wysocki 
> Cc: Tony Luck 
> Cc: 
> Cc: 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/acpi/nfit.c | 89 
> +
>  drivers/acpi/nfit.h |  1 +
>  2 files changed, 90 insertions(+)
> 
> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
> index 4e65255..c9f1ee4 100644
> --- a/drivers/acpi/nfit.c
> +++ b/drivers/acpi/nfit.c
> @@ -12,6 +12,7 @@
>   */
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -24,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "nfit.h"
>  
>  /*
> @@ -51,6 +53,9 @@ module_param(disable_vendor_specific, bool, S_IRUGO);
>  MODULE_PARM_DESC(disable_vendor_specific,
>   "Limit commands to the publicly specified set\n");
>  
> +static LIST_HEAD(acpi_descs);
> +static DEFINE_MUTEX(acpi_desc_lock);
> +
>  static struct workqueue_struct *nfit_wq;
>  
>  struct nfit_table_prev {
> @@ -2416,9 +2421,11 @@ static int acpi_nfit_check_deletions(struct 
> acpi_nfit_desc *acpi_desc,
>  
>  int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
>  {
> + struct acpi_nfit_desc *acpi_desc_entry;
>   struct device *dev = acpi_desc->dev;
>   struct nfit_table_prev prev;
>   const void *end;
> + int found = 0;
>   u8 *data;
>   int rc;
>  
> @@ -2473,6 +2480,19 @@ int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, 
> acpi_size sz)
>  
>   rc = acpi_nfit_register_regions(acpi_desc);
>  
> + /*
> +  * We may get here due to an update of the nfit via _FIT.
> +  * Check if the acpi_desc we're (re)initializing is already
> +  * present in the list, and if so, don't re-add it
> +  */
> + mutex_lock(_desc_lock);
> + list_for_each_entry(acpi_desc_entry, _descs, list)
> + if (acpi_desc_entry == acpi_desc)
> + found = 1;
> + if (found == 0)
> + list_add_tail(_desc->list, _descs);
> + mutex_unlock(_desc_lock);
> +
>   out_unlock:
>   mutex_unlock(_desc->init_mutex);
>   return rc;
> @@ -2555,6 +2575,65 @@ static int acpi_nfit_ars_rescan(struct acpi_nfit_desc 
> *acpi_desc)
>   return 0;
>  }
>  
> +static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
> + void *data)
> +{
> + struct mce *mce = (struct mce *)data;
> + struct acpi_nfit_desc *acpi_desc;
> + struct nfit_spa *nfit_spa;
> +
> + /* We only care about memory errors */
> + if (!(mce->status & MCACOD))
> + return NOTIFY_DONE;
> +
> + /*
> +  * mce->addr contains the physical addr accessed that caused the
> +  * machine check. We need to walk through the list of NFITs, and see
> +  * if any of them matches that address, and only then start a scrub.
> +  */
> + mutex_lock(_desc_lock);
> + list_for_each_entry(acpi_desc, _descs, list) {
> + struct device *dev = acpi_desc->dev;
> + int found_match = 0;
> +
> + list_for_each_entry(nfit_spa, _desc->spas, list) {
> + struct acpi_nfit_system_address *spa = nfit_spa->spa;
> +
> + if (nfit_spa_type(spa) != NFIT_SPA_PM)
> + continue;
> + /* find the spa that covers the mce addr */
> + if (spa->address > mce->addr)
> + continue;
> + if ((spa->address + spa->length - 1) < mce->addr)
> + continue;
> + found_match = 1;
> + dev_dbg(dev, "%s: addr in SPA %d (0x%llx, 0x%llx)\n",
> + __func__, spa->range_index, spa->address,
> + 

Re: [PATCH v2 3/3] nfit: do an ARS scrub on hitting a latent media error

2016-07-21 Thread Linda Knippers


On 7/20/2016 9:50 PM, Vishal Verma wrote:
> When a latent (unknown to 'badblocks') error is encountered, it will
> trigger a machine check exception. On a system with machine check
> recovery, this will only SIGBUS the process(es) which had the bad page
> mapped (as opposed to a kernel panic on platforms without machine
> check recovery features). In the former case, we want to trigger a full
> rescan of that nvdimm bus. This will allow any additional, new errors
> to be captured in the block devices' badblocks lists, and offending
> operations on them can be trapped early, avoiding machine checks.

Do we really need to rescan all SPA ranges?  If the problem is with
an NVDIMM, wouldn't the blast radius be the device or it's interleave set,
which would be part of the same SPA range?

I don't know what the overhead associated with a scan is which is why I'm 
asking.

-- ljk

> 
> This is done by registering a callback function with the
> x86_mce_decoder_chain and calling the new ars_rescan functionality with
> the address in the mce notificatiion.
> 
> Cc: Dan Williams 
> Cc: Rafael J. Wysocki 
> Cc: Tony Luck 
> Cc: 
> Cc: 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/acpi/nfit.c | 89 
> +
>  drivers/acpi/nfit.h |  1 +
>  2 files changed, 90 insertions(+)
> 
> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
> index 4e65255..c9f1ee4 100644
> --- a/drivers/acpi/nfit.c
> +++ b/drivers/acpi/nfit.c
> @@ -12,6 +12,7 @@
>   */
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -24,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "nfit.h"
>  
>  /*
> @@ -51,6 +53,9 @@ module_param(disable_vendor_specific, bool, S_IRUGO);
>  MODULE_PARM_DESC(disable_vendor_specific,
>   "Limit commands to the publicly specified set\n");
>  
> +static LIST_HEAD(acpi_descs);
> +static DEFINE_MUTEX(acpi_desc_lock);
> +
>  static struct workqueue_struct *nfit_wq;
>  
>  struct nfit_table_prev {
> @@ -2416,9 +2421,11 @@ static int acpi_nfit_check_deletions(struct 
> acpi_nfit_desc *acpi_desc,
>  
>  int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
>  {
> + struct acpi_nfit_desc *acpi_desc_entry;
>   struct device *dev = acpi_desc->dev;
>   struct nfit_table_prev prev;
>   const void *end;
> + int found = 0;
>   u8 *data;
>   int rc;
>  
> @@ -2473,6 +2480,19 @@ int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, 
> acpi_size sz)
>  
>   rc = acpi_nfit_register_regions(acpi_desc);
>  
> + /*
> +  * We may get here due to an update of the nfit via _FIT.
> +  * Check if the acpi_desc we're (re)initializing is already
> +  * present in the list, and if so, don't re-add it
> +  */
> + mutex_lock(_desc_lock);
> + list_for_each_entry(acpi_desc_entry, _descs, list)
> + if (acpi_desc_entry == acpi_desc)
> + found = 1;
> + if (found == 0)
> + list_add_tail(_desc->list, _descs);
> + mutex_unlock(_desc_lock);
> +
>   out_unlock:
>   mutex_unlock(_desc->init_mutex);
>   return rc;
> @@ -2555,6 +2575,65 @@ static int acpi_nfit_ars_rescan(struct acpi_nfit_desc 
> *acpi_desc)
>   return 0;
>  }
>  
> +static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
> + void *data)
> +{
> + struct mce *mce = (struct mce *)data;
> + struct acpi_nfit_desc *acpi_desc;
> + struct nfit_spa *nfit_spa;
> +
> + /* We only care about memory errors */
> + if (!(mce->status & MCACOD))
> + return NOTIFY_DONE;
> +
> + /*
> +  * mce->addr contains the physical addr accessed that caused the
> +  * machine check. We need to walk through the list of NFITs, and see
> +  * if any of them matches that address, and only then start a scrub.
> +  */
> + mutex_lock(_desc_lock);
> + list_for_each_entry(acpi_desc, _descs, list) {
> + struct device *dev = acpi_desc->dev;
> + int found_match = 0;
> +
> + list_for_each_entry(nfit_spa, _desc->spas, list) {
> + struct acpi_nfit_system_address *spa = nfit_spa->spa;
> +
> + if (nfit_spa_type(spa) != NFIT_SPA_PM)
> + continue;
> + /* find the spa that covers the mce addr */
> + if (spa->address > mce->addr)
> + continue;
> + if ((spa->address + spa->length - 1) < mce->addr)
> + continue;
> + found_match = 1;
> + dev_dbg(dev, "%s: addr in SPA %d (0x%llx, 0x%llx)\n",
> + __func__, spa->range_index, spa->address,
> + spa->length);
> + /*
> +  * We can break at the first match because we're going
> +  * to rescan 

Re: [PATCH v2 3/3] nfit: do an ARS scrub on hitting a latent media error

2016-07-21 Thread Linda Knippers


On 7/21/2016 5:10 PM, Vishal Verma wrote:
> On 07/21, Linda Knippers wrote:
>>
>>
>> On 7/20/2016 9:50 PM, Vishal Verma wrote:
>>> When a latent (unknown to 'badblocks') error is encountered, it will
>>> trigger a machine check exception. On a system with machine check
>>> recovery, this will only SIGBUS the process(es) which had the bad page
>>> mapped (as opposed to a kernel panic on platforms without machine
>>> check recovery features). In the former case, we want to trigger a full
>>> rescan of that nvdimm bus. This will allow any additional, new errors
>>> to be captured in the block devices' badblocks lists, and offending
>>> operations on them can be trapped early, avoiding machine checks.
>>
>> Do we really need to rescan all SPA ranges?  If the problem is with
>> an NVDIMM, wouldn't the blast radius be the device or it's interleave set,
>> which would be part of the same SPA range?
>>
>> I don't know what the overhead associated with a scan is which is why I'm 
>> asking.
> 
> You're right that we don't _need_ to scan all ranges, and that the scrub
> can be long-running, but we just take this 'event' as an opportunity to
> basically refresh everything. Since it is asynchronous, we're not
> holding anything up.

We're not holding up anything in the kernel but I assume there it's
not a zero-overhead operation.  The memory controller may be doing something
or the platform firmware could be doing something which could introduce
latency spikes.  It's the kind of thing that really annoys some customers
but maybe following an MCE no one cares about that.

-- ljk
> 
>>
>> -- ljk
>>
>>>
>>> This is done by registering a callback function with the
>>> x86_mce_decoder_chain and calling the new ars_rescan functionality with
>>> the address in the mce notificatiion.
>>>
>>> Cc: Dan Williams <dan.j.willi...@intel.com>
>>> Cc: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
>>> Cc: Tony Luck <tony.l...@intel.com>
>>> Cc: <linux-a...@vger.kernel.org>
>>> Cc: <linux-nvd...@lists.01.org>
>>> Signed-off-by: Vishal Verma <vishal.l.ve...@intel.com>
>>> ---
>>>  drivers/acpi/nfit.c | 89 
>>> +
>>>  drivers/acpi/nfit.h |  1 +
>>>  2 files changed, 90 insertions(+)
>>>
>>> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
>>> index 4e65255..c9f1ee4 100644
>>> --- a/drivers/acpi/nfit.c
>>> +++ b/drivers/acpi/nfit.c
>>> @@ -12,6 +12,7 @@
>>>   */
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include 
>>>  #include 
>>>  #include 
>>> @@ -24,6 +25,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include "nfit.h"
>>>  
>>>  /*
>>> @@ -51,6 +53,9 @@ module_param(disable_vendor_specific, bool, S_IRUGO);
>>>  MODULE_PARM_DESC(disable_vendor_specific,
>>> "Limit commands to the publicly specified set\n");
>>>  
>>> +static LIST_HEAD(acpi_descs);
>>> +static DEFINE_MUTEX(acpi_desc_lock);
>>> +
>>>  static struct workqueue_struct *nfit_wq;
>>>  
>>>  struct nfit_table_prev {
>>> @@ -2416,9 +2421,11 @@ static int acpi_nfit_check_deletions(struct 
>>> acpi_nfit_desc *acpi_desc,
>>>  
>>>  int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
>>>  {
>>> +   struct acpi_nfit_desc *acpi_desc_entry;
>>> struct device *dev = acpi_desc->dev;
>>> struct nfit_table_prev prev;
>>> const void *end;
>>> +   int found = 0;
>>> u8 *data;
>>> int rc;
>>>  
>>> @@ -2473,6 +2480,19 @@ int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, 
>>> acpi_size sz)
>>>  
>>> rc = acpi_nfit_register_regions(acpi_desc);
>>>  
>>> +   /*
>>> +* We may get here due to an update of the nfit via _FIT.
>>> +* Check if the acpi_desc we're (re)initializing is already
>>> +* present in the list, and if so, don't re-add it
>>> +*/
>>> +   mutex_lock(_desc_lock);
>>> +   list_for_each_entry(acpi_desc_entry, _descs, list)
>>> +   if (acpi_desc_entry == acpi_desc)
>>> +   found = 1;
>>> +   if (found == 0)
>>> +   list_add_tail(_desc->list, _descs);
>>> +   mutex_unlock(_desc_loc

Re: [PATCH v2 3/3] nfit: do an ARS scrub on hitting a latent media error

2016-07-21 Thread Linda Knippers


On 7/21/2016 5:10 PM, Vishal Verma wrote:
> On 07/21, Linda Knippers wrote:
>>
>>
>> On 7/20/2016 9:50 PM, Vishal Verma wrote:
>>> When a latent (unknown to 'badblocks') error is encountered, it will
>>> trigger a machine check exception. On a system with machine check
>>> recovery, this will only SIGBUS the process(es) which had the bad page
>>> mapped (as opposed to a kernel panic on platforms without machine
>>> check recovery features). In the former case, we want to trigger a full
>>> rescan of that nvdimm bus. This will allow any additional, new errors
>>> to be captured in the block devices' badblocks lists, and offending
>>> operations on them can be trapped early, avoiding machine checks.
>>
>> Do we really need to rescan all SPA ranges?  If the problem is with
>> an NVDIMM, wouldn't the blast radius be the device or it's interleave set,
>> which would be part of the same SPA range?
>>
>> I don't know what the overhead associated with a scan is which is why I'm 
>> asking.
> 
> You're right that we don't _need_ to scan all ranges, and that the scrub
> can be long-running, but we just take this 'event' as an opportunity to
> basically refresh everything. Since it is asynchronous, we're not
> holding anything up.

We're not holding up anything in the kernel but I assume there it's
not a zero-overhead operation.  The memory controller may be doing something
or the platform firmware could be doing something which could introduce
latency spikes.  It's the kind of thing that really annoys some customers
but maybe following an MCE no one cares about that.

-- ljk
> 
>>
>> -- ljk
>>
>>>
>>> This is done by registering a callback function with the
>>> x86_mce_decoder_chain and calling the new ars_rescan functionality with
>>> the address in the mce notificatiion.
>>>
>>> Cc: Dan Williams 
>>> Cc: Rafael J. Wysocki 
>>> Cc: Tony Luck 
>>> Cc: 
>>> Cc: 
>>> Signed-off-by: Vishal Verma 
>>> ---
>>>  drivers/acpi/nfit.c | 89 
>>> +
>>>  drivers/acpi/nfit.h |  1 +
>>>  2 files changed, 90 insertions(+)
>>>
>>> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
>>> index 4e65255..c9f1ee4 100644
>>> --- a/drivers/acpi/nfit.c
>>> +++ b/drivers/acpi/nfit.c
>>> @@ -12,6 +12,7 @@
>>>   */
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include 
>>>  #include 
>>>  #include 
>>> @@ -24,6 +25,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include "nfit.h"
>>>  
>>>  /*
>>> @@ -51,6 +53,9 @@ module_param(disable_vendor_specific, bool, S_IRUGO);
>>>  MODULE_PARM_DESC(disable_vendor_specific,
>>> "Limit commands to the publicly specified set\n");
>>>  
>>> +static LIST_HEAD(acpi_descs);
>>> +static DEFINE_MUTEX(acpi_desc_lock);
>>> +
>>>  static struct workqueue_struct *nfit_wq;
>>>  
>>>  struct nfit_table_prev {
>>> @@ -2416,9 +2421,11 @@ static int acpi_nfit_check_deletions(struct 
>>> acpi_nfit_desc *acpi_desc,
>>>  
>>>  int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, acpi_size sz)
>>>  {
>>> +   struct acpi_nfit_desc *acpi_desc_entry;
>>> struct device *dev = acpi_desc->dev;
>>> struct nfit_table_prev prev;
>>> const void *end;
>>> +   int found = 0;
>>> u8 *data;
>>> int rc;
>>>  
>>> @@ -2473,6 +2480,19 @@ int acpi_nfit_init(struct acpi_nfit_desc *acpi_desc, 
>>> acpi_size sz)
>>>  
>>> rc = acpi_nfit_register_regions(acpi_desc);
>>>  
>>> +   /*
>>> +* We may get here due to an update of the nfit via _FIT.
>>> +* Check if the acpi_desc we're (re)initializing is already
>>> +* present in the list, and if so, don't re-add it
>>> +*/
>>> +   mutex_lock(_desc_lock);
>>> +   list_for_each_entry(acpi_desc_entry, _descs, list)
>>> +   if (acpi_desc_entry == acpi_desc)
>>> +   found = 1;
>>> +   if (found == 0)
>>> +   list_add_tail(_desc->list, _descs);
>>> +   mutex_unlock(_desc_lock);
>>> +
>>>   out_unlock:
>>> mutex_unlock(_desc->init_mutex);
>>> return rc;
>>> @@ -2555,6 +2575,65 @@ static int acpi_nfit_ars_resca

Re: [PATCH v2 2/3] nfit, libnvdimm: allow an ARS scrub to be triggered on demand

2016-07-21 Thread Linda Knippers
On 07/20/2016 09:50 PM, Vishal Verma wrote:
> Normally, an ARS (Address Range Scrub) only happens at
> boot/initialization time. There can however arise situations where a
> bus-wide rescan is needed - notably, in the case of discovering a latent
> media error, we should do a full rescan to figure out what other sectors
> are bad, and thus potentially avoid triggering an mce on them in the
> future. Also provide a sysfs trigger to start a bus-wide scrub.

I don't see anything in here that checks to see if the platform actually
supports ARS before setting all this stuff up.  Setting up an MCE handler
and exposing a sysfs trigger for something that is optional and perhaps
not implemented doesn't seem helpful.  Or is there a check that I missed?

-- ljk

> 
> Cc: Dan Williams 
> Cc: Rafael J. Wysocki 
> Cc: 
> Cc: 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/acpi/nfit.c  | 123 
> +--
>  drivers/acpi/nfit.h  |   4 +-
>  drivers/nvdimm/core.c|   7 +++
>  include/linux/libnvdimm.h|   1 +
>  tools/testing/nvdimm/test/nfit.c |  16 +
>  5 files changed, 131 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
> index ac6ddcc0..4e65255 100644
> --- a/drivers/acpi/nfit.c
> +++ b/drivers/acpi/nfit.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -806,8 +807,41 @@ static ssize_t revision_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(revision);
>  
> +/*
> + * This shows the number of full Address Range Scrubs that have been
> + * completed since driver load time. Userspace can wait on this using
> + * select/poll etc.
> + */
> +static ssize_t scrub_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct nvdimm_bus *nvdimm_bus = to_nvdimm_bus(dev);
> + struct nvdimm_bus_descriptor *nd_desc = to_nd_desc(nvdimm_bus);
> + struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
> +
> + return sprintf(buf, "%d\n", acpi_desc->scrub_count);
> +}
> +
> +static int acpi_nfit_ars_rescan(struct acpi_nfit_desc *acpi_desc);
> +
> +static ssize_t scrub_store(struct device *dev,
> + struct device_attribute *attr, const char *buf, size_t size)
> +{
> + struct nvdimm_bus *nvdimm_bus = to_nvdimm_bus(dev);
> + struct nvdimm_bus_descriptor *nd_desc = to_nd_desc(nvdimm_bus);
> + struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
> + int rc;
> +
> + rc = acpi_nfit_ars_rescan(acpi_desc);
> + if (rc)
> + return rc;
> + return size;
> +}
> +static DEVICE_ATTR_RW(scrub);
> +
>  static struct attribute *acpi_nfit_attributes[] = {
>   _attr_revision.attr,
> + _attr_scrub.attr,
>   NULL,
>  };
>  
> @@ -2138,7 +2172,7 @@ static void acpi_nfit_async_scrub(struct acpi_nfit_desc 
> *acpi_desc,
>   unsigned int tmo = scrub_timeout;
>   int rc;
>  
> - if (nfit_spa->ars_done || !nfit_spa->nd_region)
> + if (!(nfit_spa->ars_required && nfit_spa->nd_region))
>   return;
>  
>   rc = ars_start(acpi_desc, nfit_spa);
> @@ -2227,7 +2261,9 @@ static void acpi_nfit_scrub(struct work_struct *work)
>* firmware initiated scrubs to complete and then we go search for the
>* affected spa regions to mark them scanned.  In the second phase we
>* initiate a directed scrub for every range that was not scrubbed in
> -  * phase 1.
> +  * phase 1. If we're called for a 'rescan', we harmlessly pass through
> +  * the first phase, but really only care about running phase 2, where
> +  * regions can be notified of new poison.
>*/
>  
>   /* process platform firmware initiated scrubs */
> @@ -2330,14 +2366,17 @@ static void acpi_nfit_scrub(struct work_struct *work)
>* Flag all the ranges that still need scrubbing, but
>* register them now to make data available.
>*/
> - if (nfit_spa->nd_region)
> - nfit_spa->ars_done = 1;
> - else
> + if (!nfit_spa->nd_region) {
> + nfit_spa->ars_required = 1;
>   acpi_nfit_register_region(acpi_desc, nfit_spa);
> + }
>   }
>  
>   list_for_each_entry(nfit_spa, _desc->spas, list)
>   acpi_nfit_async_scrub(acpi_desc, nfit_spa);
> + acpi_desc->scrub_count++;
> + if (acpi_desc->scrub_count_state)
> + sysfs_notify_dirent(acpi_desc->scrub_count_state);
>   mutex_unlock(_desc->init_mutex);
>  }
>  
> @@ -2495,6 +2534,27 @@ static int acpi_nfit_clear_to_send(struct 
> nvdimm_bus_descriptor *nd_desc,
>   return 0;
>  }
>  
> +static int acpi_nfit_ars_rescan(struct acpi_nfit_desc *acpi_desc)
> +{
> + 

Re: [PATCH v2 2/3] nfit, libnvdimm: allow an ARS scrub to be triggered on demand

2016-07-21 Thread Linda Knippers
On 07/20/2016 09:50 PM, Vishal Verma wrote:
> Normally, an ARS (Address Range Scrub) only happens at
> boot/initialization time. There can however arise situations where a
> bus-wide rescan is needed - notably, in the case of discovering a latent
> media error, we should do a full rescan to figure out what other sectors
> are bad, and thus potentially avoid triggering an mce on them in the
> future. Also provide a sysfs trigger to start a bus-wide scrub.

I don't see anything in here that checks to see if the platform actually
supports ARS before setting all this stuff up.  Setting up an MCE handler
and exposing a sysfs trigger for something that is optional and perhaps
not implemented doesn't seem helpful.  Or is there a check that I missed?

-- ljk

> 
> Cc: Dan Williams 
> Cc: Rafael J. Wysocki 
> Cc: 
> Cc: 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/acpi/nfit.c  | 123 
> +--
>  drivers/acpi/nfit.h  |   4 +-
>  drivers/nvdimm/core.c|   7 +++
>  include/linux/libnvdimm.h|   1 +
>  tools/testing/nvdimm/test/nfit.c |  16 +
>  5 files changed, 131 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
> index ac6ddcc0..4e65255 100644
> --- a/drivers/acpi/nfit.c
> +++ b/drivers/acpi/nfit.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -806,8 +807,41 @@ static ssize_t revision_show(struct device *dev,
>  }
>  static DEVICE_ATTR_RO(revision);
>  
> +/*
> + * This shows the number of full Address Range Scrubs that have been
> + * completed since driver load time. Userspace can wait on this using
> + * select/poll etc.
> + */
> +static ssize_t scrub_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct nvdimm_bus *nvdimm_bus = to_nvdimm_bus(dev);
> + struct nvdimm_bus_descriptor *nd_desc = to_nd_desc(nvdimm_bus);
> + struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
> +
> + return sprintf(buf, "%d\n", acpi_desc->scrub_count);
> +}
> +
> +static int acpi_nfit_ars_rescan(struct acpi_nfit_desc *acpi_desc);
> +
> +static ssize_t scrub_store(struct device *dev,
> + struct device_attribute *attr, const char *buf, size_t size)
> +{
> + struct nvdimm_bus *nvdimm_bus = to_nvdimm_bus(dev);
> + struct nvdimm_bus_descriptor *nd_desc = to_nd_desc(nvdimm_bus);
> + struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
> + int rc;
> +
> + rc = acpi_nfit_ars_rescan(acpi_desc);
> + if (rc)
> + return rc;
> + return size;
> +}
> +static DEVICE_ATTR_RW(scrub);
> +
>  static struct attribute *acpi_nfit_attributes[] = {
>   _attr_revision.attr,
> + _attr_scrub.attr,
>   NULL,
>  };
>  
> @@ -2138,7 +2172,7 @@ static void acpi_nfit_async_scrub(struct acpi_nfit_desc 
> *acpi_desc,
>   unsigned int tmo = scrub_timeout;
>   int rc;
>  
> - if (nfit_spa->ars_done || !nfit_spa->nd_region)
> + if (!(nfit_spa->ars_required && nfit_spa->nd_region))
>   return;
>  
>   rc = ars_start(acpi_desc, nfit_spa);
> @@ -2227,7 +2261,9 @@ static void acpi_nfit_scrub(struct work_struct *work)
>* firmware initiated scrubs to complete and then we go search for the
>* affected spa regions to mark them scanned.  In the second phase we
>* initiate a directed scrub for every range that was not scrubbed in
> -  * phase 1.
> +  * phase 1. If we're called for a 'rescan', we harmlessly pass through
> +  * the first phase, but really only care about running phase 2, where
> +  * regions can be notified of new poison.
>*/
>  
>   /* process platform firmware initiated scrubs */
> @@ -2330,14 +2366,17 @@ static void acpi_nfit_scrub(struct work_struct *work)
>* Flag all the ranges that still need scrubbing, but
>* register them now to make data available.
>*/
> - if (nfit_spa->nd_region)
> - nfit_spa->ars_done = 1;
> - else
> + if (!nfit_spa->nd_region) {
> + nfit_spa->ars_required = 1;
>   acpi_nfit_register_region(acpi_desc, nfit_spa);
> + }
>   }
>  
>   list_for_each_entry(nfit_spa, _desc->spas, list)
>   acpi_nfit_async_scrub(acpi_desc, nfit_spa);
> + acpi_desc->scrub_count++;
> + if (acpi_desc->scrub_count_state)
> + sysfs_notify_dirent(acpi_desc->scrub_count_state);
>   mutex_unlock(_desc->init_mutex);
>  }
>  
> @@ -2495,6 +2534,27 @@ static int acpi_nfit_clear_to_send(struct 
> nvdimm_bus_descriptor *nd_desc,
>   return 0;
>  }
>  
> +static int acpi_nfit_ars_rescan(struct acpi_nfit_desc *acpi_desc)
> +{
> + struct device *dev = acpi_desc->dev;
> + struct nfit_spa *nfit_spa;
> +
> + if (work_busy(_desc->work))
> + return 

Re: [PATCH v2 2/3] nfit, libnvdimm: allow an ARS scrub to be triggered on demand

2016-07-21 Thread Linda Knippers


On 7/21/2016 3:46 PM, Dan Williams wrote:
> On Thu, Jul 21, 2016 at 12:40 PM, Linda Knippers <linda.knipp...@hpe.com> 
> wrote:
>> On 07/20/2016 09:50 PM, Vishal Verma wrote:
>>> Normally, an ARS (Address Range Scrub) only happens at
>>> boot/initialization time. There can however arise situations where a
>>> bus-wide rescan is needed - notably, in the case of discovering a latent
>>> media error, we should do a full rescan to figure out what other sectors
>>> are bad, and thus potentially avoid triggering an mce on them in the
>>> future. Also provide a sysfs trigger to start a bus-wide scrub.
>>
>> I don't see anything in here that checks to see if the platform actually
>> supports ARS before setting all this stuff up.  Setting up an MCE handler
>> and exposing a sysfs trigger for something that is optional and perhaps
>> not implemented doesn't seem helpful.  Or is there a check that I missed?
> 
> We'll get -ENOTTY to ars_start(), but you're right it's a good idea to
> hide the scrub attribute if a platform does not have ars support.
> 
> Vishal, can you add an is_visible() routine to
> acpi_nfit_attribute_group() to hide 'scrub' on platforms that do not
> implement the ARS commands?

It's also possible that a platform might only support ARS at boot time
so subsequent scrubs would fail or not return any new information.
I don't think there's a way to know that in advice though.

-- ljk


Re: [PATCH v2 2/3] nfit, libnvdimm: allow an ARS scrub to be triggered on demand

2016-07-21 Thread Linda Knippers


On 7/21/2016 3:46 PM, Dan Williams wrote:
> On Thu, Jul 21, 2016 at 12:40 PM, Linda Knippers  
> wrote:
>> On 07/20/2016 09:50 PM, Vishal Verma wrote:
>>> Normally, an ARS (Address Range Scrub) only happens at
>>> boot/initialization time. There can however arise situations where a
>>> bus-wide rescan is needed - notably, in the case of discovering a latent
>>> media error, we should do a full rescan to figure out what other sectors
>>> are bad, and thus potentially avoid triggering an mce on them in the
>>> future. Also provide a sysfs trigger to start a bus-wide scrub.
>>
>> I don't see anything in here that checks to see if the platform actually
>> supports ARS before setting all this stuff up.  Setting up an MCE handler
>> and exposing a sysfs trigger for something that is optional and perhaps
>> not implemented doesn't seem helpful.  Or is there a check that I missed?
> 
> We'll get -ENOTTY to ars_start(), but you're right it's a good idea to
> hide the scrub attribute if a platform does not have ars support.
> 
> Vishal, can you add an is_visible() routine to
> acpi_nfit_attribute_group() to hide 'scrub' on platforms that do not
> implement the ARS commands?

It's also possible that a platform might only support ARS at boot time
so subsequent scrubs would fail or not return any new information.
I don't think there's a way to know that in advice though.

-- ljk


Re: IOMMU+DMAR causing NMIs-s

2016-07-13 Thread Linda Knippers


On 7/13/2016 10:48 AM, Alex Williamson wrote:
> On Wed, 13 Jul 2016 12:18:59 +0200
> Joerg Roedel  wrote:
> 
>> On Wed, Jul 13, 2016 at 12:58:24PM +0300, Meelis Roos wrote:
> Just got http://kodu.ut.ee/~mroos/4.6-dmar-fault2.png when playing with 
> BIOS settings (disabling NUMA). It is the first time I see at least some 
> info in NMI decode.  

 This looks interesting. Can you please post output of 'lspci -vvv' and
 'lspci -t'?  
>>>
>>> Here.  
>>
>> Thanks. So device 00:1e.0 is a PCI-bridge which has some 32-bit
>> PCI-devices behind it. One of these devices tries to read address
>> 0xb000, which is blocked by the IOMMU and causes the fault seen in the
>> screen-shot. The fault also causes a PCI-error which is then reported
>> through the NMI, causing your kernel panic.
>>
>> So the 32bit PCI devices behind the bridge are:
>>
>> 01:03.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] 
>> ES1000 (rev 02) (prog-if 00 [VGA controller])
>> 01:04.0 System peripheral: Compaq Computer Corporation Integrated Lights Out 
>> Controller (rev 03)
>> 01:04.2 System peripheral: Compaq Computer Corporation Integrated Lights Out 
>>  Processor (rev 03)
>> 01:04.4 USB controller: Hewlett-Packard Company Integrated Lights-Out 
>> Standard Virtual USB Controller (prog-if 00 [UHCI])
>> 01:04.6 IPMI SMIC interface: Hewlett-Packard Company Integrated Lights-Out 
>> Standard KCS Interface (prog-if 01)
>>
>> Can you try to disable this 'Lights Out' processor? Maybe it is causing
>> the issues. On the other side, the radeon driver for the ATI card is
>> also know for causing faults from time to time. Can you capture the
>> kernel messages right before a crash too?
> 
> IIRC, blacklisting the hpwdt module can defuse those NMIs and might
> help us see more of the actual DMAR faults.  Blacklist in modprobe.d
> and rebuild initrd. Thanks,
> 
> Alex
> 
> PS - never assume BIOS release notes are actually complete

I agree. I'd do the BIOS update and also make sure the iLO FW is current.

-- ljk
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 


Re: IOMMU+DMAR causing NMIs-s

2016-07-13 Thread Linda Knippers


On 7/13/2016 10:48 AM, Alex Williamson wrote:
> On Wed, 13 Jul 2016 12:18:59 +0200
> Joerg Roedel  wrote:
> 
>> On Wed, Jul 13, 2016 at 12:58:24PM +0300, Meelis Roos wrote:
> Just got http://kodu.ut.ee/~mroos/4.6-dmar-fault2.png when playing with 
> BIOS settings (disabling NUMA). It is the first time I see at least some 
> info in NMI decode.  

 This looks interesting. Can you please post output of 'lspci -vvv' and
 'lspci -t'?  
>>>
>>> Here.  
>>
>> Thanks. So device 00:1e.0 is a PCI-bridge which has some 32-bit
>> PCI-devices behind it. One of these devices tries to read address
>> 0xb000, which is blocked by the IOMMU and causes the fault seen in the
>> screen-shot. The fault also causes a PCI-error which is then reported
>> through the NMI, causing your kernel panic.
>>
>> So the 32bit PCI devices behind the bridge are:
>>
>> 01:03.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] 
>> ES1000 (rev 02) (prog-if 00 [VGA controller])
>> 01:04.0 System peripheral: Compaq Computer Corporation Integrated Lights Out 
>> Controller (rev 03)
>> 01:04.2 System peripheral: Compaq Computer Corporation Integrated Lights Out 
>>  Processor (rev 03)
>> 01:04.4 USB controller: Hewlett-Packard Company Integrated Lights-Out 
>> Standard Virtual USB Controller (prog-if 00 [UHCI])
>> 01:04.6 IPMI SMIC interface: Hewlett-Packard Company Integrated Lights-Out 
>> Standard KCS Interface (prog-if 01)
>>
>> Can you try to disable this 'Lights Out' processor? Maybe it is causing
>> the issues. On the other side, the radeon driver for the ATI card is
>> also know for causing faults from time to time. Can you capture the
>> kernel messages right before a crash too?
> 
> IIRC, blacklisting the hpwdt module can defuse those NMIs and might
> help us see more of the actual DMAR faults.  Blacklist in modprobe.d
> and rebuild initrd. Thanks,
> 
> Alex
> 
> PS - never assume BIOS release notes are actually complete

I agree. I'd do the BIOS update and also make sure the iLO FW is current.

-- ljk
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 


Re: [PATCH v2] libnvdimm, nfit: treat volatile virtual CD region as pmem

2016-06-27 Thread Linda Knippers
tual CD 
>>> and
>>> virtual Disk. Looks the "SPA index" is not really _reserved_.
>>
>> If we had multiple VCD regions userspace would benefit from some way
>> to the distinguish them which is why sysfs exports the spa range
>> index.  If there are no memdev associations the spa range index is
>> likely the only identifying attribute we have left.
>>
> 
> Yes, this is a benefit when using multiple VCD regions. 
> 
> Maybe the SPA index of VCD doesn't need to be zero, and it also doesn't need
> associated by NVDIMM region mapping. But ACPI spec needs to be changed.
> 
> Hi Linda, do you have any opinion for that the VCD SPA index shell be set to 
> zero? 

My opinion is that we should fix the spec.  But in the meantime, I think the
value should be zero.

My  understanding is that the primary use of this feature is to support
http boot with UEFI and in that case, there is probably only one device.
I think having these as "pmem" devices is already confusing so what's a little
more confusion if we end up with more than one of them.

-- ljk

>  
>>>
>>>>>
>>>>> After testing on OVMF, pmem driver can support the region that it doesn't
>>>>> assoicate to any NVDIMM mapping. So, treat VCD as pmem is a idea to get
>>>>> a pmem block device that it containts iso.
>>>>>
>>>>> v2
>>>>> Removed the code for setting VCD to a read-only region.
>>>>>
>>>>> Cc: Gary Lin <g...@suse.com>
>>>>> Cc: Dan Williams <dan.j.willi...@intel.com>
>>>>> Cc: Ross Zwisler <ross.zwis...@linux.intel.com>
>>>>> Cc: "Rafael J. Wysocki" <r...@rjwysocki.net>
>>>>> Cc: Linda Knippers <linda.knipp...@hpe.com>
>>>>> Signed-off-by: Lee, Chun-Yi <j...@suse.com>
>>>>> ---
>>>>>  drivers/acpi/nfit.c  |  8 +++-
>>>>>  drivers/nvdimm/region_devs.c | 26 +-
>>>>>  include/linux/libnvdimm.h|  2 ++
>>>>>  3 files changed, 34 insertions(+), 2 deletions(-)
>>>>>
>>>>
>>>> I don't think we need "nd_vcd_device_type".  I think this patch can
>>>> just be reduced to:
>>>>
>>>> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
>>>> index 32579a7b71d5..d289e772c20a 100644
>>>> --- a/drivers/acpi/nfit.c
>>>> +++ b/drivers/acpi/nfit.c
>>>> @@ -1948,6 +1948,7 @@ static int acpi_nfit_init_mapping(struct
>>>> acpi_nfit_desc *acpi_desc,
>>>> nd_mapping->nvdimm = nvdimm;
>>>> switch (nfit_spa_type(spa)) {
>>>> case NFIT_SPA_PM:
>>>> +   case NFIT_SPA_VCD:
>>>> case NFIT_SPA_VOLATILE:
>>>> nd_mapping->start = memdev->address;
>>>> nd_mapping->size = memdev->region_size;
>>>
>>> It does not work.
>>>
>>> Actually I removed the above code in version two because it useless.
>>>
>>> VCD SPA should not be referenced by any "NVDIMM Region Mapping Structure",
>>> so it does not have any memdev in memdevs list for mapping. That means
>>> acpi_nfit_init_mapping() never be called in acpi_nfit_register_region().
>>
>> Ah, sorry!  I should probably mock this up in nfit_test before making
>> assertions like that.  I still want to find a way to minimize this
>> patch and not introduce a new region type, because it just complicates
>> the implementation for no discernible benefit.
> 
> Thanks for your suggestions. In next version, I will removed the VCD region in
> libnvdimm to simplify my patch. Then all changes will only in acpi/nfit.
> 
> 
> Thanks a lot!
> Joey Lee
> 


Re: [PATCH v2] libnvdimm, nfit: treat volatile virtual CD region as pmem

2016-06-27 Thread Linda Knippers
s the "SPA index" is not really _reserved_.
>>
>> If we had multiple VCD regions userspace would benefit from some way
>> to the distinguish them which is why sysfs exports the spa range
>> index.  If there are no memdev associations the spa range index is
>> likely the only identifying attribute we have left.
>>
> 
> Yes, this is a benefit when using multiple VCD regions. 
> 
> Maybe the SPA index of VCD doesn't need to be zero, and it also doesn't need
> associated by NVDIMM region mapping. But ACPI spec needs to be changed.
> 
> Hi Linda, do you have any opinion for that the VCD SPA index shell be set to 
> zero? 

My opinion is that we should fix the spec.  But in the meantime, I think the
value should be zero.

My  understanding is that the primary use of this feature is to support
http boot with UEFI and in that case, there is probably only one device.
I think having these as "pmem" devices is already confusing so what's a little
more confusion if we end up with more than one of them.

-- ljk

>  
>>>
>>>>>
>>>>> After testing on OVMF, pmem driver can support the region that it doesn't
>>>>> assoicate to any NVDIMM mapping. So, treat VCD as pmem is a idea to get
>>>>> a pmem block device that it containts iso.
>>>>>
>>>>> v2
>>>>> Removed the code for setting VCD to a read-only region.
>>>>>
>>>>> Cc: Gary Lin 
>>>>> Cc: Dan Williams 
>>>>> Cc: Ross Zwisler 
>>>>> Cc: "Rafael J. Wysocki" 
>>>>> Cc: Linda Knippers 
>>>>> Signed-off-by: Lee, Chun-Yi 
>>>>> ---
>>>>>  drivers/acpi/nfit.c  |  8 +++-
>>>>>  drivers/nvdimm/region_devs.c | 26 +-
>>>>>  include/linux/libnvdimm.h|  2 ++
>>>>>  3 files changed, 34 insertions(+), 2 deletions(-)
>>>>>
>>>>
>>>> I don't think we need "nd_vcd_device_type".  I think this patch can
>>>> just be reduced to:
>>>>
>>>> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
>>>> index 32579a7b71d5..d289e772c20a 100644
>>>> --- a/drivers/acpi/nfit.c
>>>> +++ b/drivers/acpi/nfit.c
>>>> @@ -1948,6 +1948,7 @@ static int acpi_nfit_init_mapping(struct
>>>> acpi_nfit_desc *acpi_desc,
>>>> nd_mapping->nvdimm = nvdimm;
>>>> switch (nfit_spa_type(spa)) {
>>>> case NFIT_SPA_PM:
>>>> +   case NFIT_SPA_VCD:
>>>> case NFIT_SPA_VOLATILE:
>>>> nd_mapping->start = memdev->address;
>>>> nd_mapping->size = memdev->region_size;
>>>
>>> It does not work.
>>>
>>> Actually I removed the above code in version two because it useless.
>>>
>>> VCD SPA should not be referenced by any "NVDIMM Region Mapping Structure",
>>> so it does not have any memdev in memdevs list for mapping. That means
>>> acpi_nfit_init_mapping() never be called in acpi_nfit_register_region().
>>
>> Ah, sorry!  I should probably mock this up in nfit_test before making
>> assertions like that.  I still want to find a way to minimize this
>> patch and not introduce a new region type, because it just complicates
>> the implementation for no discernible benefit.
> 
> Thanks for your suggestions. In next version, I will removed the VCD region in
> libnvdimm to simplify my patch. Then all changes will only in acpi/nfit.
> 
> 
> Thanks a lot!
> Joey Lee
> 


Re: [PATCH] libnvdimm, nfit: treat volatile virtual CD region as read-only pmem

2016-06-09 Thread Linda Knippers
On 6/4/2016 7:01 AM, joeyli wrote:
> Hi Dan, 
> 
> Thanks for your review.
> 
> On Fri, Jun 03, 2016 at 12:27:34PM -0700, Dan Williams wrote:
>> On Fri, Jun 3, 2016 at 12:13 AM, Lee, Chun-Yi  
>> wrote:
>>> This patch adds codes to treat a volatile virtual CD region as a
>>> read-only pmem region, then read-only /dev/pmem* device can be mounted
>>> with iso9660.
>>>
>>> It's useful to work with the httpboot in EFI firmware to pull a remote
>>> ISO file to the local memory region for booting and installation.
>>>
>>> Wiki page of UEFI HTTPBoot with OVMF:
>>> https://en.opensuse.org/UEFI_HTTPBoot_with_OVMF
>>>
>>> Signed-off-by: Lee, Chun-Yi 
>>> Cc: Gary Lin 
>>> Cc: Dan Williams 
>>> Cc: Ross Zwisler 
>>> Cc: "Rafael J. Wysocki" 
>>> ---
>>>  drivers/acpi/nfit.c  |  8 +++-
>>>  drivers/nvdimm/region_devs.c | 26 +-
>>>  include/linux/libnvdimm.h|  2 ++
>>>  3 files changed, 34 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
>>> index 2215fc8..b100a17 100644
>>> --- a/drivers/acpi/nfit.c
>>> +++ b/drivers/acpi/nfit.c
>>> @@ -1949,6 +1949,7 @@ static int acpi_nfit_init_mapping(struct 
>>> acpi_nfit_desc *acpi_desc,
>>> switch (nfit_spa_type(spa)) {
>>> case NFIT_SPA_PM:
>>> case NFIT_SPA_VOLATILE:
>>> +   case NFIT_SPA_VCD:
>>> nd_mapping->start = memdev->address;
>>> nd_mapping->size = memdev->region_size;
>>> break;
>>
>> Why do we need to distinguish NFIT_SPA_VOLATILE vs NFIT_SPA_VCD, i.e.
>> what happens if something writes to a VCD device?
> 
> Actually I didn't try to write SPA-VCD device before. Every time I mount it
> that the system responses read-only:
> 
> # mount /dev/pmem0 /mnt/
> mount: /dev/pmem0 is write-protected, mounting read-only
> 
> If it can be written, then I think there have no difference between
> NFIT_SPA_VOLATILE with NFIT_SPA_VCD region.

It's not clear to me what the expectations for this type of device are, or
whether they should be read-only.  The ACPI spec is not helpful here.
The other Disk Region and CD Region types are also unclear.  Anyone
care to define them?

> I implemented this patch to treat VCD region as read-only pmem because the
> pmem region generates /dev/pmem* device that it can be mounted.

I'm a bit worried about this type of device showing up as a "pmem" device.
I realize they're described in the NFIT (not sure why but they are) but do
any of the operations that we support for other pmem devices work on these?
Do root device DSMs make any sense?  Are there other DSMs?  What will happen
if someone uses ndctl to reconfigure the device?

I'm especially concerned on systems that might have one of these devices
and also have NVDIMMs.  Do all the pmem devices get numbered different if
this device comes and goes across reboots?  (I know, we shouldn't rely on
those names but it will still confuse people.)  Can they be some other name
that better represents what they're trying to be?

> Maybe I missed. Does NFIT_SPA_VOLATILE region also generate a device in /dev
> that it can be mounted with filesystem? 

Yes.

-- ljk

> Then I think treat the VCD region as
> a read-only VOLATILE region that's also a solution.
> 
> 
> Thanks a lot!
> Joey Lee
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 


Re: [PATCH] libnvdimm, nfit: treat volatile virtual CD region as read-only pmem

2016-06-09 Thread Linda Knippers
On 6/4/2016 7:01 AM, joeyli wrote:
> Hi Dan, 
> 
> Thanks for your review.
> 
> On Fri, Jun 03, 2016 at 12:27:34PM -0700, Dan Williams wrote:
>> On Fri, Jun 3, 2016 at 12:13 AM, Lee, Chun-Yi  
>> wrote:
>>> This patch adds codes to treat a volatile virtual CD region as a
>>> read-only pmem region, then read-only /dev/pmem* device can be mounted
>>> with iso9660.
>>>
>>> It's useful to work with the httpboot in EFI firmware to pull a remote
>>> ISO file to the local memory region for booting and installation.
>>>
>>> Wiki page of UEFI HTTPBoot with OVMF:
>>> https://en.opensuse.org/UEFI_HTTPBoot_with_OVMF
>>>
>>> Signed-off-by: Lee, Chun-Yi 
>>> Cc: Gary Lin 
>>> Cc: Dan Williams 
>>> Cc: Ross Zwisler 
>>> Cc: "Rafael J. Wysocki" 
>>> ---
>>>  drivers/acpi/nfit.c  |  8 +++-
>>>  drivers/nvdimm/region_devs.c | 26 +-
>>>  include/linux/libnvdimm.h|  2 ++
>>>  3 files changed, 34 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
>>> index 2215fc8..b100a17 100644
>>> --- a/drivers/acpi/nfit.c
>>> +++ b/drivers/acpi/nfit.c
>>> @@ -1949,6 +1949,7 @@ static int acpi_nfit_init_mapping(struct 
>>> acpi_nfit_desc *acpi_desc,
>>> switch (nfit_spa_type(spa)) {
>>> case NFIT_SPA_PM:
>>> case NFIT_SPA_VOLATILE:
>>> +   case NFIT_SPA_VCD:
>>> nd_mapping->start = memdev->address;
>>> nd_mapping->size = memdev->region_size;
>>> break;
>>
>> Why do we need to distinguish NFIT_SPA_VOLATILE vs NFIT_SPA_VCD, i.e.
>> what happens if something writes to a VCD device?
> 
> Actually I didn't try to write SPA-VCD device before. Every time I mount it
> that the system responses read-only:
> 
> # mount /dev/pmem0 /mnt/
> mount: /dev/pmem0 is write-protected, mounting read-only
> 
> If it can be written, then I think there have no difference between
> NFIT_SPA_VOLATILE with NFIT_SPA_VCD region.

It's not clear to me what the expectations for this type of device are, or
whether they should be read-only.  The ACPI spec is not helpful here.
The other Disk Region and CD Region types are also unclear.  Anyone
care to define them?

> I implemented this patch to treat VCD region as read-only pmem because the
> pmem region generates /dev/pmem* device that it can be mounted.

I'm a bit worried about this type of device showing up as a "pmem" device.
I realize they're described in the NFIT (not sure why but they are) but do
any of the operations that we support for other pmem devices work on these?
Do root device DSMs make any sense?  Are there other DSMs?  What will happen
if someone uses ndctl to reconfigure the device?

I'm especially concerned on systems that might have one of these devices
and also have NVDIMMs.  Do all the pmem devices get numbered different if
this device comes and goes across reboots?  (I know, we shouldn't rely on
those names but it will still confuse people.)  Can they be some other name
that better represents what they're trying to be?

> Maybe I missed. Does NFIT_SPA_VOLATILE region also generate a device in /dev
> that it can be mounted with filesystem? 

Yes.

-- ljk

> Then I think treat the VCD region as
> a read-only VOLATILE region that's also a solution.
> 
> 
> Thanks a lot!
> Joey Lee
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 


Re: [PATCH 01/13] driver core, libnvdimm: disable manual unbind of dimms while region active

2016-06-06 Thread Linda Knippers


On 6/6/2016 4:36 PM, Dan Williams wrote:
> On Mon, Jun 6, 2016 at 1:20 PM, Linda Knippers <linda.knipp...@hpe.com> wrote:
> [..]
>>> A solution to the posted-write-queue flushing needs to be available
>>> and a platform can choose to use flush hints or ADR.  If the NFIT
>>> defines an NVDIMM device without hints we assume the platform must
>>> have ADR.  If the platform NFIT neglects to define an NVDIMM to
>>> physical address range mapping, we warn about a potentially broken
>>> BIOS.  Hopefully we can make this clearer in future versions of the
>>> spec.
>>
>> You lost me on those last 2 sentences.  An NVDIMM doesn't have to have
>> an SPA range, but that seems to be unrelated to pcommit or flushes.
> 
> Ok, now you've lost me...
> 
> We need a SPA range to be able to do I/O whether that SPA range is
> direct access to media or a block-window aperture.  If an NFIT
> inculdes a "System Physical Address (SPA) Range Structure", but
> neglects to include a corresponding "NVDIMM Region Mapping Structure"
> then the kernel has no idea what actual dimm device(s) back that
> memory region.  Without a memory device mapping it is undefined
> whether I/O to a SPA range requires flushing or not.

Right, if you have an SPA Range Structure, you need an NVDIMM Region
Mapping Structure referencing it (although now I'm wondering if that's
true if it's one of the other GUID types, like a virtual CD...).  But the
reverse isn't true.  You can have a Region mapping Structure without an
SPA range, and that's what I thought you were flagging as an error in
your previous sentence.

> This patch set silences the warning about "not being able to guarantee
> persistence" when the BIOS provides a "NVDIMM Region Mapping
> Structure".  When that structure is present the kernel uses flush
> hints when provided, but ADR otherwise.  See the implementation of
> nvdimm_flush() in patch 4.

Ok, I see that now.

Patch 0 and the commit messages could have used a bit more info.
The big picture was a bit difficult to see and patch 4 doesn't say
anything about finishing in the TODOs from the previous patch.

Thanks,

-- ljk


Re: [PATCH 01/13] driver core, libnvdimm: disable manual unbind of dimms while region active

2016-06-06 Thread Linda Knippers


On 6/6/2016 4:36 PM, Dan Williams wrote:
> On Mon, Jun 6, 2016 at 1:20 PM, Linda Knippers  wrote:
> [..]
>>> A solution to the posted-write-queue flushing needs to be available
>>> and a platform can choose to use flush hints or ADR.  If the NFIT
>>> defines an NVDIMM device without hints we assume the platform must
>>> have ADR.  If the platform NFIT neglects to define an NVDIMM to
>>> physical address range mapping, we warn about a potentially broken
>>> BIOS.  Hopefully we can make this clearer in future versions of the
>>> spec.
>>
>> You lost me on those last 2 sentences.  An NVDIMM doesn't have to have
>> an SPA range, but that seems to be unrelated to pcommit or flushes.
> 
> Ok, now you've lost me...
> 
> We need a SPA range to be able to do I/O whether that SPA range is
> direct access to media or a block-window aperture.  If an NFIT
> inculdes a "System Physical Address (SPA) Range Structure", but
> neglects to include a corresponding "NVDIMM Region Mapping Structure"
> then the kernel has no idea what actual dimm device(s) back that
> memory region.  Without a memory device mapping it is undefined
> whether I/O to a SPA range requires flushing or not.

Right, if you have an SPA Range Structure, you need an NVDIMM Region
Mapping Structure referencing it (although now I'm wondering if that's
true if it's one of the other GUID types, like a virtual CD...).  But the
reverse isn't true.  You can have a Region mapping Structure without an
SPA range, and that's what I thought you were flagging as an error in
your previous sentence.

> This patch set silences the warning about "not being able to guarantee
> persistence" when the BIOS provides a "NVDIMM Region Mapping
> Structure".  When that structure is present the kernel uses flush
> hints when provided, but ADR otherwise.  See the implementation of
> nvdimm_flush() in patch 4.

Ok, I see that now.

Patch 0 and the commit messages could have used a bit more info.
The big picture was a bit difficult to see and patch 4 doesn't say
anything about finishing in the TODOs from the previous patch.

Thanks,

-- ljk


Re: [PATCH 01/13] driver core, libnvdimm: disable manual unbind of dimms while region active

2016-06-06 Thread Linda Knippers


On 6/6/2016 3:46 PM, Dan Williams wrote:
> On Mon, Jun 6, 2016 at 12:36 PM, Linda Knippers <linda.knipp...@hpe.com> 
> wrote:
>>
>>
>> On 6/6/2016 3:31 PM, Dan Williams wrote:
>>> On Mon, Jun 6, 2016 at 12:25 PM, Linda Knippers <linda.knipp...@hpe.com> 
>>> wrote:
>>>> On 6/4/2016 4:52 PM, Dan Williams wrote:
>>>>> There are scenarios where we need a middle ground between disabling all
>>>>> manual bind/unbind attempts (via driver->suppress_bind_attrs) and
>>>>> allowing unbind at any userspace-determined time.  Pinning modules takes
>>>>> away one vector for unwanted out-of-sequence device_release_driver()
>>>>> invocations, this new mechanism (via device->suppress_unbind_attr) takes
>>>>> away another.
>>>>>
>>>>> The first user of this mechanism is the libnvdimm sub-system where
>>>>> manual dimm disabling should be prevented while the dimm is active in
>>>>> any region.  Note that there is a 1:N dimm-to-region relationship which
>>>>> is why this is implemented as a disable count rather than a flag.  This
>>>>> forces userspace to disable regions before dimms when manually shutting
>>>>> down a bus topology.
>>>>
>>>> How is this related to deprecating pcommit?
>>>
>>> We need guarantees that the flush hint mappings are valid for the
>>> duration of a pmem namespace being enabled.  I am going to move the
>>> mapping of the flush hint region from per-dimm to per-region.  However
>>> since multiple regions may reference the same dimm the mapping needs
>>> to be reference counted and shared across regions.  This will be
>>> similar to the arrangement we have for BLK-regions that share a
>>> control region mapping.
>>
>> Why are things moving around?  Aren't flush hints defined per NFIT device
>> handle, making them an optional per-dimm thing?
>>
>> I don't understand a lot of this patch series and had the same questions
>> as Jeff.  How does deprecating pcommit, because it's not necessary with ADR,
>> change so much?
> 
> This patch set deprecates pcommit, and introduces the usage of flush
> hints into the pmem path.  The introduction patch used the word
> "usually", I should have fleshed that out to say "usually ADR, or
> explicit use of flush hints".
> 
> A solution to the posted-write-queue flushing needs to be available
> and a platform can choose to use flush hints or ADR.  If the NFIT
> defines an NVDIMM device without hints we assume the platform must
> have ADR.  If the platform NFIT neglects to define an NVDIMM to
> physical address range mapping, we warn about a potentially broken
> BIOS.  Hopefully we can make this clearer in future versions of the
> spec.

You lost me on those last 2 sentences.  An NVDIMM doesn't have to have
an SPA range, but that seems to be unrelated to pcommit or flushes.

-- ljk


Re: [PATCH 01/13] driver core, libnvdimm: disable manual unbind of dimms while region active

2016-06-06 Thread Linda Knippers


On 6/6/2016 3:46 PM, Dan Williams wrote:
> On Mon, Jun 6, 2016 at 12:36 PM, Linda Knippers  
> wrote:
>>
>>
>> On 6/6/2016 3:31 PM, Dan Williams wrote:
>>> On Mon, Jun 6, 2016 at 12:25 PM, Linda Knippers  
>>> wrote:
>>>> On 6/4/2016 4:52 PM, Dan Williams wrote:
>>>>> There are scenarios where we need a middle ground between disabling all
>>>>> manual bind/unbind attempts (via driver->suppress_bind_attrs) and
>>>>> allowing unbind at any userspace-determined time.  Pinning modules takes
>>>>> away one vector for unwanted out-of-sequence device_release_driver()
>>>>> invocations, this new mechanism (via device->suppress_unbind_attr) takes
>>>>> away another.
>>>>>
>>>>> The first user of this mechanism is the libnvdimm sub-system where
>>>>> manual dimm disabling should be prevented while the dimm is active in
>>>>> any region.  Note that there is a 1:N dimm-to-region relationship which
>>>>> is why this is implemented as a disable count rather than a flag.  This
>>>>> forces userspace to disable regions before dimms when manually shutting
>>>>> down a bus topology.
>>>>
>>>> How is this related to deprecating pcommit?
>>>
>>> We need guarantees that the flush hint mappings are valid for the
>>> duration of a pmem namespace being enabled.  I am going to move the
>>> mapping of the flush hint region from per-dimm to per-region.  However
>>> since multiple regions may reference the same dimm the mapping needs
>>> to be reference counted and shared across regions.  This will be
>>> similar to the arrangement we have for BLK-regions that share a
>>> control region mapping.
>>
>> Why are things moving around?  Aren't flush hints defined per NFIT device
>> handle, making them an optional per-dimm thing?
>>
>> I don't understand a lot of this patch series and had the same questions
>> as Jeff.  How does deprecating pcommit, because it's not necessary with ADR,
>> change so much?
> 
> This patch set deprecates pcommit, and introduces the usage of flush
> hints into the pmem path.  The introduction patch used the word
> "usually", I should have fleshed that out to say "usually ADR, or
> explicit use of flush hints".
> 
> A solution to the posted-write-queue flushing needs to be available
> and a platform can choose to use flush hints or ADR.  If the NFIT
> defines an NVDIMM device without hints we assume the platform must
> have ADR.  If the platform NFIT neglects to define an NVDIMM to
> physical address range mapping, we warn about a potentially broken
> BIOS.  Hopefully we can make this clearer in future versions of the
> spec.

You lost me on those last 2 sentences.  An NVDIMM doesn't have to have
an SPA range, but that seems to be unrelated to pcommit or flushes.

-- ljk


Re: [PATCH 01/13] driver core, libnvdimm: disable manual unbind of dimms while region active

2016-06-06 Thread Linda Knippers


On 6/6/2016 3:31 PM, Dan Williams wrote:
> On Mon, Jun 6, 2016 at 12:25 PM, Linda Knippers <linda.knipp...@hpe.com> 
> wrote:
>> On 6/4/2016 4:52 PM, Dan Williams wrote:
>>> There are scenarios where we need a middle ground between disabling all
>>> manual bind/unbind attempts (via driver->suppress_bind_attrs) and
>>> allowing unbind at any userspace-determined time.  Pinning modules takes
>>> away one vector for unwanted out-of-sequence device_release_driver()
>>> invocations, this new mechanism (via device->suppress_unbind_attr) takes
>>> away another.
>>>
>>> The first user of this mechanism is the libnvdimm sub-system where
>>> manual dimm disabling should be prevented while the dimm is active in
>>> any region.  Note that there is a 1:N dimm-to-region relationship which
>>> is why this is implemented as a disable count rather than a flag.  This
>>> forces userspace to disable regions before dimms when manually shutting
>>> down a bus topology.
>>
>> How is this related to deprecating pcommit?
> 
> We need guarantees that the flush hint mappings are valid for the
> duration of a pmem namespace being enabled.  I am going to move the
> mapping of the flush hint region from per-dimm to per-region.  However
> since multiple regions may reference the same dimm the mapping needs
> to be reference counted and shared across regions.  This will be
> similar to the arrangement we have for BLK-regions that share a
> control region mapping.

Why are things moving around?  Aren't flush hints defined per NFIT device
handle, making them an optional per-dimm thing?

I don't understand a lot of this patch series and had the same questions
as Jeff.  How does deprecating pcommit, because it's not necessary with ADR,
change so much?

-- ljk
> 


Re: [PATCH 01/13] driver core, libnvdimm: disable manual unbind of dimms while region active

2016-06-06 Thread Linda Knippers


On 6/6/2016 3:31 PM, Dan Williams wrote:
> On Mon, Jun 6, 2016 at 12:25 PM, Linda Knippers  
> wrote:
>> On 6/4/2016 4:52 PM, Dan Williams wrote:
>>> There are scenarios where we need a middle ground between disabling all
>>> manual bind/unbind attempts (via driver->suppress_bind_attrs) and
>>> allowing unbind at any userspace-determined time.  Pinning modules takes
>>> away one vector for unwanted out-of-sequence device_release_driver()
>>> invocations, this new mechanism (via device->suppress_unbind_attr) takes
>>> away another.
>>>
>>> The first user of this mechanism is the libnvdimm sub-system where
>>> manual dimm disabling should be prevented while the dimm is active in
>>> any region.  Note that there is a 1:N dimm-to-region relationship which
>>> is why this is implemented as a disable count rather than a flag.  This
>>> forces userspace to disable regions before dimms when manually shutting
>>> down a bus topology.
>>
>> How is this related to deprecating pcommit?
> 
> We need guarantees that the flush hint mappings are valid for the
> duration of a pmem namespace being enabled.  I am going to move the
> mapping of the flush hint region from per-dimm to per-region.  However
> since multiple regions may reference the same dimm the mapping needs
> to be reference counted and shared across regions.  This will be
> similar to the arrangement we have for BLK-regions that share a
> control region mapping.

Why are things moving around?  Aren't flush hints defined per NFIT device
handle, making them an optional per-dimm thing?

I don't understand a lot of this patch series and had the same questions
as Jeff.  How does deprecating pcommit, because it's not necessary with ADR,
change so much?

-- ljk
> 


Re: [PATCH 01/13] driver core, libnvdimm: disable manual unbind of dimms while region active

2016-06-06 Thread Linda Knippers
On 6/4/2016 4:52 PM, Dan Williams wrote:
> There are scenarios where we need a middle ground between disabling all
> manual bind/unbind attempts (via driver->suppress_bind_attrs) and
> allowing unbind at any userspace-determined time.  Pinning modules takes
> away one vector for unwanted out-of-sequence device_release_driver()
> invocations, this new mechanism (via device->suppress_unbind_attr) takes
> away another.
> 
> The first user of this mechanism is the libnvdimm sub-system where
> manual dimm disabling should be prevented while the dimm is active in
> any region.  Note that there is a 1:N dimm-to-region relationship which
> is why this is implemented as a disable count rather than a flag.  This
> forces userspace to disable regions before dimms when manually shutting
> down a bus topology.

How is this related to deprecating pcommit?

-- ljk
> 
> This does not affect any of the kernel-internal initiated invocations of
> device_release_driver().
> 
> Cc: Greg Kroah-Hartman 
> Signed-off-by: Dan Williams 
> ---
>  drivers/base/base.h |1 +
>  drivers/base/bus.c  |   12 ++--
>  drivers/base/core.c |1 +
>  drivers/base/dd.c   |2 +-
>  drivers/nvdimm/namespace_devs.c |1 +
>  drivers/nvdimm/region_devs.c|4 +++-
>  include/linux/device.h  |   20 
>  7 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/base/base.h b/drivers/base/base.h
> index e05db388bd1c..129814b17ca6 100644
> --- a/drivers/base/base.h
> +++ b/drivers/base/base.h
> @@ -109,6 +109,7 @@ extern int bus_add_driver(struct device_driver *drv);
>  extern void bus_remove_driver(struct device_driver *drv);
>  
>  extern void driver_detach(struct device_driver *drv);
> +extern void __device_release_driver(struct device *dev);
>  extern int driver_probe_device(struct device_driver *drv, struct device 
> *dev);
>  extern void driver_deferred_probe_del(struct device *dev);
>  static inline int driver_match_device(struct device_driver *drv,
> diff --git a/drivers/base/bus.c b/drivers/base/bus.c
> index 6470eb8088f4..b48a903f2d28 100644
> --- a/drivers/base/bus.c
> +++ b/drivers/base/bus.c
> @@ -188,10 +188,18 @@ static ssize_t unbind_store(struct device_driver *drv, 
> const char *buf,
>   if (dev && dev->driver == drv) {
>   if (dev->parent)/* Needed for USB */
>   device_lock(dev->parent);
> - device_release_driver(dev);
> +
> + device_lock(dev);
> + if (atomic_read(>suppress_unbind_attr) > 0)
> + err = -EBUSY;
> + else {
> + __device_release_driver(dev);
> + err = count;
> + }
> + device_unlock(dev);
> +
>   if (dev->parent)
>   device_unlock(dev->parent);
> - err = count;
>   }
>   put_device(dev);
>   bus_put(bus);
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 0a8bdade53f2..0ea0e8560e1d 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -708,6 +708,7 @@ void device_initialize(struct device *dev)
>   INIT_LIST_HEAD(>devres_head);
>   device_pm_init(dev);
>   set_dev_node(dev, -1);
> + atomic_set(>suppress_unbind_attr, 0);
>  #ifdef CONFIG_GENERIC_MSI_IRQ
>   INIT_LIST_HEAD(>msi_list);
>  #endif
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 16688f50729c..9e21817ca2d6 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -756,7 +756,7 @@ EXPORT_SYMBOL_GPL(driver_attach);
>   * __device_release_driver() must be called with @dev lock held.
>   * When called for a USB interface, @dev->parent lock must be held as well.
>   */
> -static void __device_release_driver(struct device *dev)
> +void __device_release_driver(struct device *dev)
>  {
>   struct device_driver *drv;
>  
> diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
> index c5e3196c45b0..e65572b6092c 100644
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -1950,6 +1950,7 @@ static int init_active_labels(struct nd_region 
> *nd_region)
>   }
>   nd_mapping->ndd = ndd;
>   atomic_inc(>busy);
> + device_disable_unbind_attr(>dev);
>   get_ndd(ndd);
>  
>   count = nd_label_active_count(ndd);
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 40fcfea26fbb..320f0f3ea367 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -427,8 +427,10 @@ static void nd_region_notify_driver_action(struct 
> nvdimm_bus *nvdimm_bus,
>   nd_mapping->labels = NULL;
>   put_ndd(ndd);
>   nd_mapping->ndd = NULL;
> - if (ndd)
> + if 

Re: [PATCH 01/13] driver core, libnvdimm: disable manual unbind of dimms while region active

2016-06-06 Thread Linda Knippers
On 6/4/2016 4:52 PM, Dan Williams wrote:
> There are scenarios where we need a middle ground between disabling all
> manual bind/unbind attempts (via driver->suppress_bind_attrs) and
> allowing unbind at any userspace-determined time.  Pinning modules takes
> away one vector for unwanted out-of-sequence device_release_driver()
> invocations, this new mechanism (via device->suppress_unbind_attr) takes
> away another.
> 
> The first user of this mechanism is the libnvdimm sub-system where
> manual dimm disabling should be prevented while the dimm is active in
> any region.  Note that there is a 1:N dimm-to-region relationship which
> is why this is implemented as a disable count rather than a flag.  This
> forces userspace to disable regions before dimms when manually shutting
> down a bus topology.

How is this related to deprecating pcommit?

-- ljk
> 
> This does not affect any of the kernel-internal initiated invocations of
> device_release_driver().
> 
> Cc: Greg Kroah-Hartman 
> Signed-off-by: Dan Williams 
> ---
>  drivers/base/base.h |1 +
>  drivers/base/bus.c  |   12 ++--
>  drivers/base/core.c |1 +
>  drivers/base/dd.c   |2 +-
>  drivers/nvdimm/namespace_devs.c |1 +
>  drivers/nvdimm/region_devs.c|4 +++-
>  include/linux/device.h  |   20 
>  7 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/base/base.h b/drivers/base/base.h
> index e05db388bd1c..129814b17ca6 100644
> --- a/drivers/base/base.h
> +++ b/drivers/base/base.h
> @@ -109,6 +109,7 @@ extern int bus_add_driver(struct device_driver *drv);
>  extern void bus_remove_driver(struct device_driver *drv);
>  
>  extern void driver_detach(struct device_driver *drv);
> +extern void __device_release_driver(struct device *dev);
>  extern int driver_probe_device(struct device_driver *drv, struct device 
> *dev);
>  extern void driver_deferred_probe_del(struct device *dev);
>  static inline int driver_match_device(struct device_driver *drv,
> diff --git a/drivers/base/bus.c b/drivers/base/bus.c
> index 6470eb8088f4..b48a903f2d28 100644
> --- a/drivers/base/bus.c
> +++ b/drivers/base/bus.c
> @@ -188,10 +188,18 @@ static ssize_t unbind_store(struct device_driver *drv, 
> const char *buf,
>   if (dev && dev->driver == drv) {
>   if (dev->parent)/* Needed for USB */
>   device_lock(dev->parent);
> - device_release_driver(dev);
> +
> + device_lock(dev);
> + if (atomic_read(>suppress_unbind_attr) > 0)
> + err = -EBUSY;
> + else {
> + __device_release_driver(dev);
> + err = count;
> + }
> + device_unlock(dev);
> +
>   if (dev->parent)
>   device_unlock(dev->parent);
> - err = count;
>   }
>   put_device(dev);
>   bus_put(bus);
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 0a8bdade53f2..0ea0e8560e1d 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -708,6 +708,7 @@ void device_initialize(struct device *dev)
>   INIT_LIST_HEAD(>devres_head);
>   device_pm_init(dev);
>   set_dev_node(dev, -1);
> + atomic_set(>suppress_unbind_attr, 0);
>  #ifdef CONFIG_GENERIC_MSI_IRQ
>   INIT_LIST_HEAD(>msi_list);
>  #endif
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 16688f50729c..9e21817ca2d6 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -756,7 +756,7 @@ EXPORT_SYMBOL_GPL(driver_attach);
>   * __device_release_driver() must be called with @dev lock held.
>   * When called for a USB interface, @dev->parent lock must be held as well.
>   */
> -static void __device_release_driver(struct device *dev)
> +void __device_release_driver(struct device *dev)
>  {
>   struct device_driver *drv;
>  
> diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
> index c5e3196c45b0..e65572b6092c 100644
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -1950,6 +1950,7 @@ static int init_active_labels(struct nd_region 
> *nd_region)
>   }
>   nd_mapping->ndd = ndd;
>   atomic_inc(>busy);
> + device_disable_unbind_attr(>dev);
>   get_ndd(ndd);
>  
>   count = nd_label_active_count(ndd);
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index 40fcfea26fbb..320f0f3ea367 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -427,8 +427,10 @@ static void nd_region_notify_driver_action(struct 
> nvdimm_bus *nvdimm_bus,
>   nd_mapping->labels = NULL;
>   put_ndd(ndd);
>   nd_mapping->ndd = NULL;
> - if (ndd)
> + if (ndd) {
>   

Re: [PATCH v6] acpi: Issue _OSC call for native thermal interrupt handling

2016-03-23 Thread Linda Knippers
I raised a general concern on a previous patch so I found a 1P server
with Skylake and HWP to try.  This doesn't qualify as a tested-by
since all I did was apply the patch and boot the server but hey, it booted.

I do have a question below...

On 3/23/2016 12:07 AM, Srinivas Pandruvada wrote:
> There are several reports of freeze on enabling HWP (Hardware PStates)
> feature on Skylake based systems by Intel P states driver. The root
> cause is identified as the HWP interrupts causing BIOS code to freeze.
> HWP interrupts uses thermal LVT.
> Linux natively handles thermal interrupts, but in Skylake based systems
> SMM will take control of thermal interrupts. This is a problem for several
> reasons:
> - It is freezing in BIOS when tries to handle thermal interrupt, which
> will require BIOS upgrade
> - With SMM handling thermal we loose all the reporting features of
> Linux arch/x86/kernel/cpu/mcheck/therm_throt driver
> - Some thermal drivers like x86-package-temp driver depends on the thermal
> threshold interrupts
> - The HWP interrupts are useful for debugging and tuning performance
> 
> So we need native handling of thermal interrupts. This requires some
> way to inform SMM that OS can handle thermal interrupts. This can be
> done by using _OSC/_PDC under processor scope very early in ACPI
> initialization flow.
> The bit 12 of _OSC/_PDC in processor scope defines whether OS supports
> handling of native interrupts for Collaborative Processor Performance
> Control (CPPC) notifications. Since HWP is a implementation of CPPC,
> setting this bit is equivalent to inform SMM that OS is capable of
> handling thermal interrupts.
> Refer to this document for details on _OSC/_PDC
> http://www.intel.com/content/www/us/en/standards/processor-vendor-
> specific-acpi-specification.html
> 
> This change introduces a new function acpi_early_processor_set_osc(),
> which walks acpi name space and finds acpi processor object and
> set capability via _OSC method.
> 
> Also this change writes HWP status bits to 0 to clear any HWP status
> bits in intel_thermal_interrupt().
> 
> Signed-off-by: Srinivas Pandruvada 
> ---
> v6:
> Added __init for two functions and moved dummy function to internal.h
> Shortened the name of the function.
> 
> v5:
> No code change. Changed commit message to explain HWP is a implementation
> of CPPC.
> 
> v4:
> Suggestion by Boris for removing use of intermediate variable for
> clearing HWP status and using boot_cpu_has instead of static_cpu_has
> 
> v3:
> - Added CONFIG_X86 around static_cpu_has() to fix compile error on
> ARCH=ia64, reported by kbuild test robot
> - return AE_CTRL_TERMINATE to terminate acpi name walk space, when _OSC
> is set already once.
> v2:
> Unnecessary newline was introduced, removed that in acpi_processor.c
> 
>  arch/x86/kernel/cpu/mcheck/therm_throt.c |  3 +++
>  drivers/acpi/acpi_processor.c| 44 
> 
>  drivers/acpi/bus.c   |  3 +++
>  drivers/acpi/internal.h  |  6 +
>  4 files changed, 56 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c 
> b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> index 2c5aaf8..0553858 100644
> --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> @@ -385,6 +385,9 @@ static void intel_thermal_interrupt(void)
>  {
>   __u64 msr_val;
>  
> + if (static_cpu_has(X86_FEATURE_HWP))
> + wrmsrl_safe(MSR_HWP_STATUS, 0);
> +
>   rdmsrl(MSR_IA32_THERM_STATUS, msr_val);
>  
>   /* Check for violation of core thermal thresholds*/
> diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> index 6979186..ba50f46 100644
> --- a/drivers/acpi/acpi_processor.c
> +++ b/drivers/acpi/acpi_processor.c
> @@ -491,6 +491,50 @@ static void acpi_processor_remove(struct acpi_device 
> *device)
>  }
>  #endif /* CONFIG_ACPI_HOTPLUG_CPU */
>  
> +#ifdef CONFIG_X86
> +static bool acpi_hwp_native_thermal_lvt_set;
> +static acpi_status __init acpi_hwp_native_thermal_lvt_osc(acpi_handle handle,
> +   u32 lvl,
> +   void *context,
> +   void **rv)
> +{
> + u8 sb_uuid_str[] = "4077A616-290C-47BE-9EBD-D87058713953";
> + u32 capbuf[2];
> + struct acpi_osc_context osc_context = {
> + .uuid_str = sb_uuid_str,
> + .rev = 1,
> + .cap.length = 8,
> + .cap.pointer = capbuf,
> + };
> +
> + if (acpi_hwp_native_thermal_lvt_set)
> + return AE_CTRL_TERMINATE;
> +
> + capbuf[0] = 0x;
> + capbuf[1] = 0x1000; /* set bit 12 */
> +
> + if (ACPI_SUCCESS(acpi_run_osc(handle, _context))) {
> + acpi_hwp_native_thermal_lvt_set = true;
> + kfree(osc_context.ret.pointer);

There are other boot messages 

Re: [PATCH v6] acpi: Issue _OSC call for native thermal interrupt handling

2016-03-23 Thread Linda Knippers
I raised a general concern on a previous patch so I found a 1P server
with Skylake and HWP to try.  This doesn't qualify as a tested-by
since all I did was apply the patch and boot the server but hey, it booted.

I do have a question below...

On 3/23/2016 12:07 AM, Srinivas Pandruvada wrote:
> There are several reports of freeze on enabling HWP (Hardware PStates)
> feature on Skylake based systems by Intel P states driver. The root
> cause is identified as the HWP interrupts causing BIOS code to freeze.
> HWP interrupts uses thermal LVT.
> Linux natively handles thermal interrupts, but in Skylake based systems
> SMM will take control of thermal interrupts. This is a problem for several
> reasons:
> - It is freezing in BIOS when tries to handle thermal interrupt, which
> will require BIOS upgrade
> - With SMM handling thermal we loose all the reporting features of
> Linux arch/x86/kernel/cpu/mcheck/therm_throt driver
> - Some thermal drivers like x86-package-temp driver depends on the thermal
> threshold interrupts
> - The HWP interrupts are useful for debugging and tuning performance
> 
> So we need native handling of thermal interrupts. This requires some
> way to inform SMM that OS can handle thermal interrupts. This can be
> done by using _OSC/_PDC under processor scope very early in ACPI
> initialization flow.
> The bit 12 of _OSC/_PDC in processor scope defines whether OS supports
> handling of native interrupts for Collaborative Processor Performance
> Control (CPPC) notifications. Since HWP is a implementation of CPPC,
> setting this bit is equivalent to inform SMM that OS is capable of
> handling thermal interrupts.
> Refer to this document for details on _OSC/_PDC
> http://www.intel.com/content/www/us/en/standards/processor-vendor-
> specific-acpi-specification.html
> 
> This change introduces a new function acpi_early_processor_set_osc(),
> which walks acpi name space and finds acpi processor object and
> set capability via _OSC method.
> 
> Also this change writes HWP status bits to 0 to clear any HWP status
> bits in intel_thermal_interrupt().
> 
> Signed-off-by: Srinivas Pandruvada 
> ---
> v6:
> Added __init for two functions and moved dummy function to internal.h
> Shortened the name of the function.
> 
> v5:
> No code change. Changed commit message to explain HWP is a implementation
> of CPPC.
> 
> v4:
> Suggestion by Boris for removing use of intermediate variable for
> clearing HWP status and using boot_cpu_has instead of static_cpu_has
> 
> v3:
> - Added CONFIG_X86 around static_cpu_has() to fix compile error on
> ARCH=ia64, reported by kbuild test robot
> - return AE_CTRL_TERMINATE to terminate acpi name walk space, when _OSC
> is set already once.
> v2:
> Unnecessary newline was introduced, removed that in acpi_processor.c
> 
>  arch/x86/kernel/cpu/mcheck/therm_throt.c |  3 +++
>  drivers/acpi/acpi_processor.c| 44 
> 
>  drivers/acpi/bus.c   |  3 +++
>  drivers/acpi/internal.h  |  6 +
>  4 files changed, 56 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c 
> b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> index 2c5aaf8..0553858 100644
> --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> @@ -385,6 +385,9 @@ static void intel_thermal_interrupt(void)
>  {
>   __u64 msr_val;
>  
> + if (static_cpu_has(X86_FEATURE_HWP))
> + wrmsrl_safe(MSR_HWP_STATUS, 0);
> +
>   rdmsrl(MSR_IA32_THERM_STATUS, msr_val);
>  
>   /* Check for violation of core thermal thresholds*/
> diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> index 6979186..ba50f46 100644
> --- a/drivers/acpi/acpi_processor.c
> +++ b/drivers/acpi/acpi_processor.c
> @@ -491,6 +491,50 @@ static void acpi_processor_remove(struct acpi_device 
> *device)
>  }
>  #endif /* CONFIG_ACPI_HOTPLUG_CPU */
>  
> +#ifdef CONFIG_X86
> +static bool acpi_hwp_native_thermal_lvt_set;
> +static acpi_status __init acpi_hwp_native_thermal_lvt_osc(acpi_handle handle,
> +   u32 lvl,
> +   void *context,
> +   void **rv)
> +{
> + u8 sb_uuid_str[] = "4077A616-290C-47BE-9EBD-D87058713953";
> + u32 capbuf[2];
> + struct acpi_osc_context osc_context = {
> + .uuid_str = sb_uuid_str,
> + .rev = 1,
> + .cap.length = 8,
> + .cap.pointer = capbuf,
> + };
> +
> + if (acpi_hwp_native_thermal_lvt_set)
> + return AE_CTRL_TERMINATE;
> +
> + capbuf[0] = 0x;
> + capbuf[1] = 0x1000; /* set bit 12 */
> +
> + if (ACPI_SUCCESS(acpi_run_osc(handle, _context))) {
> + acpi_hwp_native_thermal_lvt_set = true;
> + kfree(osc_context.ret.pointer);

There are other boot messages that indicate when something is 

Re: [PATCH v4] acpi: Issue _OSC call for native thermal interrupt handling

2016-03-19 Thread Linda Knippers

On 3/17/2016 2:24 PM, Srinivas Pandruvada wrote:
> There are several reports of freeze on enabling HWP (Hardware PStates)
> feature on Skylake based systems by Intel P states driver. The root
> cause is identified as the HWP interrupts causing BIOS code to freeze.
> HWP interrupts uses thermal LVT.
> Linux natively handles thermal interrupts, but in Skylake based systems
> SMM will take control of thermal interrupts. This is a problem for several
> reasons:
> - It is freezing in BIOS when tries to handle thermal interrupt, which
> will require BIOS upgrade
> - With SMM handling thermal we loose all the reporting features of
> Linux arch/x86/kernel/cpu/mcheck/therm_throt driver
> - Some thermal drivers like x86-package-temp driver depends on the thermal
> threshold interrupts
> - The HWP interrupts are useful for debugging and tuning performance
> 
> So we need native handling of thermal interrupts. 

Is this working around a firmware bug?

> To inform SMM that
> OS will handle thermal interrupts, we need to use _OSC under processor
> scope very early in ACPI initialization flow. 

Does _OSC say that OS "will" handle thermal interrupts or that it "can"
handle thermal interrupts?  Couldn't a platform decide it wants to handle
them, regardless of what the OS may be capable of?

> This needs to be done
> before SMM code path looks for _OSC capabilities. The bit 12 of
> _OSC in processor scope defines whether OS will handle thermal interrupts.
> When bit 12 is set to 1, OS will handle thermal interrupts.
> Refer to this document for details on _OSC
> http://www.intel.com/content/www/us/en/standards/processor-vendor-
> specific-acpi-specification.html

Where is bit 12 documented?

> This change introduces a new function acpi_early_processor_set_osc(),
> which walks acpi name space and finds acpi processor object and
> set capability via _OSC method to take over thermal LVT.

Does this change just affect Skylake platforms or all platforms?

-- ljk
> 
> Also this change writes HWP status bits to 0 to clear any HWP status
> bits in intel_thermal_interrupt().
> 
> Signed-off-by: Srinivas Pandruvada 
> ---
> v4:
> Suggestion by Boris for removing use of intermediate variable for
> clearing HWP status and using boot_cpu_has instead of static_cpu_has
> 
> v3:
> - Added CONFIG_X86 around static_cpu_has() to fix compile error on
> ARCH=ia64, reported by kbuild test robot
> - return AE_CTRL_TERMINATE to terminate acpi name walk space, when _OSC
> is set already once.
> v2:
> Unnecessary newline was introduced, removed that in acpi_processor.c
> 
>  arch/x86/kernel/cpu/mcheck/therm_throt.c |  3 ++
>  drivers/acpi/acpi_processor.c| 47 
> 
>  drivers/acpi/bus.c   |  3 ++
>  drivers/acpi/internal.h  |  2 ++
>  4 files changed, 55 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c 
> b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> index 2c5aaf8..0553858 100644
> --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> @@ -385,6 +385,9 @@ static void intel_thermal_interrupt(void)
>  {
>   __u64 msr_val;
>  
> + if (static_cpu_has(X86_FEATURE_HWP))
> + wrmsrl_safe(MSR_HWP_STATUS, 0);
> +
>   rdmsrl(MSR_IA32_THERM_STATUS, msr_val);
>  
>   /* Check for violation of core thermal thresholds*/
> diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> index 6979186..18da84f 100644
> --- a/drivers/acpi/acpi_processor.c
> +++ b/drivers/acpi/acpi_processor.c
> @@ -491,6 +491,53 @@ static void acpi_processor_remove(struct acpi_device 
> *device)
>  }
>  #endif /* CONFIG_ACPI_HOTPLUG_CPU */
>  
> +#ifdef CONFIG_X86
> +static bool acpi_hwp_native_thermal_lvt_set;
> +static acpi_status acpi_set_hwp_native_thermal_lvt_osc(acpi_handle handle,
> +u32 lvl, void *context,
> +void **rv)
> +{
> + u8 sb_uuid_str[] = "4077A616-290C-47BE-9EBD-D87058713953";
> + u32 capbuf[2];
> + struct acpi_osc_context osc_context = {
> + .uuid_str = sb_uuid_str,
> + .rev = 1,
> + .cap.length = 8,
> + .cap.pointer = capbuf,
> + };
> +
> + if (acpi_hwp_native_thermal_lvt_set)
> + return AE_CTRL_TERMINATE;
> +
> + capbuf[0] = 0x;
> + capbuf[1] = 0x1000; /* set bit 12 */
> +
> + if (ACPI_SUCCESS(acpi_run_osc(handle, _context))) {
> + acpi_hwp_native_thermal_lvt_set = true;
> + kfree(osc_context.ret.pointer);
> + }
> +
> + return AE_OK;
> +}
> +
> +void acpi_early_processor_set_osc(void)
> +{
> + if (boot_cpu_has(X86_FEATURE_HWP)) {
> + acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT,
> + ACPI_UINT32_MAX,
> + 

Re: [PATCH v4] acpi: Issue _OSC call for native thermal interrupt handling

2016-03-19 Thread Linda Knippers

On 3/17/2016 2:24 PM, Srinivas Pandruvada wrote:
> There are several reports of freeze on enabling HWP (Hardware PStates)
> feature on Skylake based systems by Intel P states driver. The root
> cause is identified as the HWP interrupts causing BIOS code to freeze.
> HWP interrupts uses thermal LVT.
> Linux natively handles thermal interrupts, but in Skylake based systems
> SMM will take control of thermal interrupts. This is a problem for several
> reasons:
> - It is freezing in BIOS when tries to handle thermal interrupt, which
> will require BIOS upgrade
> - With SMM handling thermal we loose all the reporting features of
> Linux arch/x86/kernel/cpu/mcheck/therm_throt driver
> - Some thermal drivers like x86-package-temp driver depends on the thermal
> threshold interrupts
> - The HWP interrupts are useful for debugging and tuning performance
> 
> So we need native handling of thermal interrupts. 

Is this working around a firmware bug?

> To inform SMM that
> OS will handle thermal interrupts, we need to use _OSC under processor
> scope very early in ACPI initialization flow. 

Does _OSC say that OS "will" handle thermal interrupts or that it "can"
handle thermal interrupts?  Couldn't a platform decide it wants to handle
them, regardless of what the OS may be capable of?

> This needs to be done
> before SMM code path looks for _OSC capabilities. The bit 12 of
> _OSC in processor scope defines whether OS will handle thermal interrupts.
> When bit 12 is set to 1, OS will handle thermal interrupts.
> Refer to this document for details on _OSC
> http://www.intel.com/content/www/us/en/standards/processor-vendor-
> specific-acpi-specification.html

Where is bit 12 documented?

> This change introduces a new function acpi_early_processor_set_osc(),
> which walks acpi name space and finds acpi processor object and
> set capability via _OSC method to take over thermal LVT.

Does this change just affect Skylake platforms or all platforms?

-- ljk
> 
> Also this change writes HWP status bits to 0 to clear any HWP status
> bits in intel_thermal_interrupt().
> 
> Signed-off-by: Srinivas Pandruvada 
> ---
> v4:
> Suggestion by Boris for removing use of intermediate variable for
> clearing HWP status and using boot_cpu_has instead of static_cpu_has
> 
> v3:
> - Added CONFIG_X86 around static_cpu_has() to fix compile error on
> ARCH=ia64, reported by kbuild test robot
> - return AE_CTRL_TERMINATE to terminate acpi name walk space, when _OSC
> is set already once.
> v2:
> Unnecessary newline was introduced, removed that in acpi_processor.c
> 
>  arch/x86/kernel/cpu/mcheck/therm_throt.c |  3 ++
>  drivers/acpi/acpi_processor.c| 47 
> 
>  drivers/acpi/bus.c   |  3 ++
>  drivers/acpi/internal.h  |  2 ++
>  4 files changed, 55 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c 
> b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> index 2c5aaf8..0553858 100644
> --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> @@ -385,6 +385,9 @@ static void intel_thermal_interrupt(void)
>  {
>   __u64 msr_val;
>  
> + if (static_cpu_has(X86_FEATURE_HWP))
> + wrmsrl_safe(MSR_HWP_STATUS, 0);
> +
>   rdmsrl(MSR_IA32_THERM_STATUS, msr_val);
>  
>   /* Check for violation of core thermal thresholds*/
> diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> index 6979186..18da84f 100644
> --- a/drivers/acpi/acpi_processor.c
> +++ b/drivers/acpi/acpi_processor.c
> @@ -491,6 +491,53 @@ static void acpi_processor_remove(struct acpi_device 
> *device)
>  }
>  #endif /* CONFIG_ACPI_HOTPLUG_CPU */
>  
> +#ifdef CONFIG_X86
> +static bool acpi_hwp_native_thermal_lvt_set;
> +static acpi_status acpi_set_hwp_native_thermal_lvt_osc(acpi_handle handle,
> +u32 lvl, void *context,
> +void **rv)
> +{
> + u8 sb_uuid_str[] = "4077A616-290C-47BE-9EBD-D87058713953";
> + u32 capbuf[2];
> + struct acpi_osc_context osc_context = {
> + .uuid_str = sb_uuid_str,
> + .rev = 1,
> + .cap.length = 8,
> + .cap.pointer = capbuf,
> + };
> +
> + if (acpi_hwp_native_thermal_lvt_set)
> + return AE_CTRL_TERMINATE;
> +
> + capbuf[0] = 0x;
> + capbuf[1] = 0x1000; /* set bit 12 */
> +
> + if (ACPI_SUCCESS(acpi_run_osc(handle, _context))) {
> + acpi_hwp_native_thermal_lvt_set = true;
> + kfree(osc_context.ret.pointer);
> + }
> +
> + return AE_OK;
> +}
> +
> +void acpi_early_processor_set_osc(void)
> +{
> + if (boot_cpu_has(X86_FEATURE_HWP)) {
> + acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT,
> + ACPI_UINT32_MAX,
> + acpi_set_hwp_native_thermal_lvt_osc,
> +   

Re: [PATCH v4] acpi: Issue _OSC call for native thermal interrupt handling

2016-03-19 Thread Linda Knippers


On 3/17/2016 4:36 PM, Srinivas Pandruvada wrote:
> On Thu, 2016-03-17 at 16:03 -0400, Linda Knippers wrote:
>> On 3/17/2016 2:24 PM, Srinivas Pandruvada wrote:
>>>
>>> There are several reports of freeze on enabling HWP (Hardware
>>> PStates)
>>> feature on Skylake based systems by Intel P states driver. The root
>>> cause is identified as the HWP interrupts causing BIOS code to
>>> freeze.
>>> HWP interrupts uses thermal LVT.
>>> Linux natively handles thermal interrupts, but in Skylake based
>>> systems
>>> SMM will take control of thermal interrupts. This is a problem for
>>> several
>>> reasons:
>>> - It is freezing in BIOS when tries to handle thermal interrupt,
>>> which
>>> will require BIOS upgrade
>>> - With SMM handling thermal we loose all the reporting features of
>>> Linux arch/x86/kernel/cpu/mcheck/therm_throt driver
>>> - Some thermal drivers like x86-package-temp driver depends on the
>>> thermal
>>> threshold interrupts
>>> - The HWP interrupts are useful for debugging and tuning
>>> performance
>>>
>>> So we need native handling of thermal interrupts. 
>> Is this working around a firmware bug?
> Yes. But Linux always had capability to handle thermal interrupts
> natively, so we should continue to have this in OS control, when
> possible.
> 
>>
>>>
>>> To inform SMM that
>>> OS will handle thermal interrupts, we need to use _OSC under
>>> processor
>>> scope very early in ACPI initialization flow. 
>> Does _OSC say that OS "will" handle thermal interrupts or that it
>> "can"
>> handle thermal interrupts?
> 
> Can handle.
> 
>>   Couldn't a platform decide it wants to handle
>> them, regardless of what the OS may be capable of?
>>
> Yes. In that case platform can change the delivery mode bits to SMM in
> LVT.

And would still exhibit the BIOS bug?

>>>
>>> This needs to be done
>>> before SMM code path looks for _OSC capabilities. The bit 12 of
>>> _OSC in processor scope defines whether OS will handle thermal
>>> interrupts.
>>> When bit 12 is set to 1, OS will handle thermal interrupts.
>>> Refer to this document for details on _OSC
>>> http://www.intel.com/content/www/us/en/standards/processor-vendor-
>>> specific-acpi-specification.html
>> Where is bit 12 documented?
>>
> In the above document.

When I look at that document, I see bit 12 described as
"If set, OSPM supports native interrupt handling for Collaborative Processor
Performance Control notifications."  Is that the same thing or am
I looking at the wrong table?

-- ljk

> 
>>>
>>> This change introduces a new function
>>> acpi_early_processor_set_osc(),
>>> which walks acpi name space and finds acpi processor object and
>>> set capability via _OSC method to take over thermal LVT.
>> Does this change just affect Skylake platforms or all platforms?
> Any platform which has Intel ® Speed Shift Technology (aka HWP) feature
> present and enabled.
> 
> Thanks,
> Srinivas 
> 
>>
>> -- ljk
>>>
>>>
>>> Also this change writes HWP status bits to 0 to clear any HWP
>>> status
>>> bits in intel_thermal_interrupt().
>>>
>>> Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel
>>> .com>
>>> ---
>>> v4:
>>> Suggestion by Boris for removing use of intermediate variable for
>>> clearing HWP status and using boot_cpu_has instead of
>>> static_cpu_has
>>>
>>> v3:
>>> - Added CONFIG_X86 around static_cpu_has() to fix compile error on
>>> ARCH=ia64, reported by kbuild test robot
>>> - return AE_CTRL_TERMINATE to terminate acpi name walk space, when
>>> _OSC
>>> is set already once.
>>> v2:
>>> Unnecessary newline was introduced, removed that in
>>> acpi_processor.c
>>>
>>>  arch/x86/kernel/cpu/mcheck/therm_throt.c |  3 ++
>>>  drivers/acpi/acpi_processor.c| 47
>>> 
>>>  drivers/acpi/bus.c   |  3 ++
>>>  drivers/acpi/internal.h  |  2 ++
>>>  4 files changed, 55 insertions(+)
>>>
>>> diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c
>>> b/arch/x86/kernel/cpu/mcheck/therm_throt.c
>>> index 2c5aaf8..0553858 100644
>>> --- a/arch/x86/kernel/cpu/mcheck/th

Re: [PATCH v4] acpi: Issue _OSC call for native thermal interrupt handling

2016-03-19 Thread Linda Knippers


On 3/17/2016 4:36 PM, Srinivas Pandruvada wrote:
> On Thu, 2016-03-17 at 16:03 -0400, Linda Knippers wrote:
>> On 3/17/2016 2:24 PM, Srinivas Pandruvada wrote:
>>>
>>> There are several reports of freeze on enabling HWP (Hardware
>>> PStates)
>>> feature on Skylake based systems by Intel P states driver. The root
>>> cause is identified as the HWP interrupts causing BIOS code to
>>> freeze.
>>> HWP interrupts uses thermal LVT.
>>> Linux natively handles thermal interrupts, but in Skylake based
>>> systems
>>> SMM will take control of thermal interrupts. This is a problem for
>>> several
>>> reasons:
>>> - It is freezing in BIOS when tries to handle thermal interrupt,
>>> which
>>> will require BIOS upgrade
>>> - With SMM handling thermal we loose all the reporting features of
>>> Linux arch/x86/kernel/cpu/mcheck/therm_throt driver
>>> - Some thermal drivers like x86-package-temp driver depends on the
>>> thermal
>>> threshold interrupts
>>> - The HWP interrupts are useful for debugging and tuning
>>> performance
>>>
>>> So we need native handling of thermal interrupts. 
>> Is this working around a firmware bug?
> Yes. But Linux always had capability to handle thermal interrupts
> natively, so we should continue to have this in OS control, when
> possible.
> 
>>
>>>
>>> To inform SMM that
>>> OS will handle thermal interrupts, we need to use _OSC under
>>> processor
>>> scope very early in ACPI initialization flow. 
>> Does _OSC say that OS "will" handle thermal interrupts or that it
>> "can"
>> handle thermal interrupts?
> 
> Can handle.
> 
>>   Couldn't a platform decide it wants to handle
>> them, regardless of what the OS may be capable of?
>>
> Yes. In that case platform can change the delivery mode bits to SMM in
> LVT.

And would still exhibit the BIOS bug?

>>>
>>> This needs to be done
>>> before SMM code path looks for _OSC capabilities. The bit 12 of
>>> _OSC in processor scope defines whether OS will handle thermal
>>> interrupts.
>>> When bit 12 is set to 1, OS will handle thermal interrupts.
>>> Refer to this document for details on _OSC
>>> http://www.intel.com/content/www/us/en/standards/processor-vendor-
>>> specific-acpi-specification.html
>> Where is bit 12 documented?
>>
> In the above document.

When I look at that document, I see bit 12 described as
"If set, OSPM supports native interrupt handling for Collaborative Processor
Performance Control notifications."  Is that the same thing or am
I looking at the wrong table?

-- ljk

> 
>>>
>>> This change introduces a new function
>>> acpi_early_processor_set_osc(),
>>> which walks acpi name space and finds acpi processor object and
>>> set capability via _OSC method to take over thermal LVT.
>> Does this change just affect Skylake platforms or all platforms?
> Any platform which has Intel ® Speed Shift Technology (aka HWP) feature
> present and enabled.
> 
> Thanks,
> Srinivas 
> 
>>
>> -- ljk
>>>
>>>
>>> Also this change writes HWP status bits to 0 to clear any HWP
>>> status
>>> bits in intel_thermal_interrupt().
>>>
>>> Signed-off-by: Srinivas Pandruvada >> .com>
>>> ---
>>> v4:
>>> Suggestion by Boris for removing use of intermediate variable for
>>> clearing HWP status and using boot_cpu_has instead of
>>> static_cpu_has
>>>
>>> v3:
>>> - Added CONFIG_X86 around static_cpu_has() to fix compile error on
>>> ARCH=ia64, reported by kbuild test robot
>>> - return AE_CTRL_TERMINATE to terminate acpi name walk space, when
>>> _OSC
>>> is set already once.
>>> v2:
>>> Unnecessary newline was introduced, removed that in
>>> acpi_processor.c
>>>
>>>  arch/x86/kernel/cpu/mcheck/therm_throt.c |  3 ++
>>>  drivers/acpi/acpi_processor.c| 47
>>> 
>>>  drivers/acpi/bus.c   |  3 ++
>>>  drivers/acpi/internal.h  |  2 ++
>>>  4 files changed, 55 insertions(+)
>>>
>>> diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c
>>> b/arch/x86/kernel/cpu/mcheck/therm_throt.c
>>> index 2c5aaf8..0553858 100644
>>> --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
>>> +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
>>

Re: [PATCH v4] acpi: Issue _OSC call for native thermal interrupt handling

2016-03-19 Thread Linda Knippers


On 3/17/2016 5:12 PM, Srinivas Pandruvada wrote:

> This needs to be done
> before SMM code path looks for _OSC capabilities. The bit 12 of
> _OSC in processor scope defines whether OS will handle thermal
> interrupts.
> When bit 12 is set to 1, OS will handle thermal interrupts.
> Refer to this document for details on _OSC
> http://www.intel.com/content/www/us/en/standards/processor-vend
> or-
> specific-acpi-specification.html
 Where is bit 12 documented?

>>> In the above document.
>> When I look at that document, I see bit 12 described as
>> "If set, OSPM supports native interrupt handling for Collaborative
>> Processor
>> Performance Control notifications."  Is that the same thing or am
>> I looking at the wrong table?
> Yes. If you look at section 14.4 in Intel SDM, you will see that 
> "HWP is an implementation of the ACPI-defined Collaborative Processor
> Performance Control (CPPC)". Section 14.4.5 also specifies that HWP
> uses IA32_THERM_STATUS to communicate if there are notifications, which
> is notified via thermal interrupt.

Ok, thanks. That wasn't clear from the commit message.  It
sounded like bit 12 directly indicated that the OS will handle
thermal interrupts but it's a bit more indirect than that.

> You asked above if platform can handle these notification in SMM only.
> If you do then the notification will arrive as ACPI notifications. We
> don't have support for such notifications in Linux yet.

What I meant to ask was if the platform can disregard the _OSC information
and handle thermal events on it's own, without OS involvement.
For example, servers typically don't want to rely on the OS to manage
thermal issues.



> This change introduces a new function
> acpi_early_processor_set_osc(),
> which walks acpi name space and finds acpi processor object and
> set capability via _OSC method to take over thermal LVT.
 Does this change just affect Skylake platforms or all platforms?
>>> Any platform which has Intel ® Speed Shift Technology (aka HWP)
>>> feature present and enabled.

Could this be an unexpected change in behavior for platforms
with HWP that don't have this bug, assuming they would look at
the _OSC CPPP bit?  That's actually my main concern here.

-- ljk

>>>
>>> Thanks,
>>> Srinivas 
>>>


 -- ljk
>
>
>
> Also this change writes HWP status bits to 0 to clear any HWP
> status
> bits in intel_thermal_interrupt().
>
> Signed-off-by: Srinivas Pandruvada  ntel
> .com>
> ---
> v4:
> Suggestion by Boris for removing use of intermediate variable
> for
> clearing HWP status and using boot_cpu_has instead of
> static_cpu_has
>
> v3:
> - Added CONFIG_X86 around static_cpu_has() to fix compile error
> on
> ARCH=ia64, reported by kbuild test robot
> - return AE_CTRL_TERMINATE to terminate acpi name walk space,
> when
> _OSC
> is set already once.
> v2:
> Unnecessary newline was introduced, removed that in
> acpi_processor.c
>
>  arch/x86/kernel/cpu/mcheck/therm_throt.c |  3 ++
>  drivers/acpi/acpi_processor.c| 47
> 
>  drivers/acpi/bus.c   |  3 ++
>  drivers/acpi/internal.h  |  2 ++
>  4 files changed, 55 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> index 2c5aaf8..0553858 100644
> --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> @@ -385,6 +385,9 @@ static void intel_thermal_interrupt(void)
>  {
>   __u64 msr_val;
>  
> + if (static_cpu_has(X86_FEATURE_HWP))
> + wrmsrl_safe(MSR_HWP_STATUS, 0);
> +
>   rdmsrl(MSR_IA32_THERM_STATUS, msr_val);
>  
>   /* Check for violation of core thermal thresholds*/
> diff --git a/drivers/acpi/acpi_processor.c
> b/drivers/acpi/acpi_processor.c
> index 6979186..18da84f 100644
> --- a/drivers/acpi/acpi_processor.c
> +++ b/drivers/acpi/acpi_processor.c
> @@ -491,6 +491,53 @@ static void acpi_processor_remove(struct
> acpi_device *device)
>  }
>  #endif /* CONFIG_ACPI_HOTPLUG_CPU */
>  
> +#ifdef CONFIG_X86
> +static bool acpi_hwp_native_thermal_lvt_set;
> +static acpi_status
> acpi_set_hwp_native_thermal_lvt_osc(acpi_handle
> handle,
> +u32
> lvl,
> void *context,
> +void
> **rv)
> +{
> + u8 sb_uuid_str[] = "4077A616-290C-47BE-9EBD-
> D87058713953";
> + u32 capbuf[2];
> + struct acpi_osc_context osc_context = {
> + .uuid_str = sb_uuid_str,
> + .rev = 1,
> + .cap.length = 8,
> + .cap.pointer = capbuf,

Re: [PATCH v4] acpi: Issue _OSC call for native thermal interrupt handling

2016-03-19 Thread Linda Knippers


On 3/17/2016 5:12 PM, Srinivas Pandruvada wrote:

> This needs to be done
> before SMM code path looks for _OSC capabilities. The bit 12 of
> _OSC in processor scope defines whether OS will handle thermal
> interrupts.
> When bit 12 is set to 1, OS will handle thermal interrupts.
> Refer to this document for details on _OSC
> http://www.intel.com/content/www/us/en/standards/processor-vend
> or-
> specific-acpi-specification.html
 Where is bit 12 documented?

>>> In the above document.
>> When I look at that document, I see bit 12 described as
>> "If set, OSPM supports native interrupt handling for Collaborative
>> Processor
>> Performance Control notifications."  Is that the same thing or am
>> I looking at the wrong table?
> Yes. If you look at section 14.4 in Intel SDM, you will see that 
> "HWP is an implementation of the ACPI-defined Collaborative Processor
> Performance Control (CPPC)". Section 14.4.5 also specifies that HWP
> uses IA32_THERM_STATUS to communicate if there are notifications, which
> is notified via thermal interrupt.

Ok, thanks. That wasn't clear from the commit message.  It
sounded like bit 12 directly indicated that the OS will handle
thermal interrupts but it's a bit more indirect than that.

> You asked above if platform can handle these notification in SMM only.
> If you do then the notification will arrive as ACPI notifications. We
> don't have support for such notifications in Linux yet.

What I meant to ask was if the platform can disregard the _OSC information
and handle thermal events on it's own, without OS involvement.
For example, servers typically don't want to rely on the OS to manage
thermal issues.



> This change introduces a new function
> acpi_early_processor_set_osc(),
> which walks acpi name space and finds acpi processor object and
> set capability via _OSC method to take over thermal LVT.
 Does this change just affect Skylake platforms or all platforms?
>>> Any platform which has Intel ® Speed Shift Technology (aka HWP)
>>> feature present and enabled.

Could this be an unexpected change in behavior for platforms
with HWP that don't have this bug, assuming they would look at
the _OSC CPPP bit?  That's actually my main concern here.

-- ljk

>>>
>>> Thanks,
>>> Srinivas 
>>>


 -- ljk
>
>
>
> Also this change writes HWP status bits to 0 to clear any HWP
> status
> bits in intel_thermal_interrupt().
>
> Signed-off-by: Srinivas Pandruvada  ntel
> .com>
> ---
> v4:
> Suggestion by Boris for removing use of intermediate variable
> for
> clearing HWP status and using boot_cpu_has instead of
> static_cpu_has
>
> v3:
> - Added CONFIG_X86 around static_cpu_has() to fix compile error
> on
> ARCH=ia64, reported by kbuild test robot
> - return AE_CTRL_TERMINATE to terminate acpi name walk space,
> when
> _OSC
> is set already once.
> v2:
> Unnecessary newline was introduced, removed that in
> acpi_processor.c
>
>  arch/x86/kernel/cpu/mcheck/therm_throt.c |  3 ++
>  drivers/acpi/acpi_processor.c| 47
> 
>  drivers/acpi/bus.c   |  3 ++
>  drivers/acpi/internal.h  |  2 ++
>  4 files changed, 55 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> index 2c5aaf8..0553858 100644
> --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> @@ -385,6 +385,9 @@ static void intel_thermal_interrupt(void)
>  {
>   __u64 msr_val;
>  
> + if (static_cpu_has(X86_FEATURE_HWP))
> + wrmsrl_safe(MSR_HWP_STATUS, 0);
> +
>   rdmsrl(MSR_IA32_THERM_STATUS, msr_val);
>  
>   /* Check for violation of core thermal thresholds*/
> diff --git a/drivers/acpi/acpi_processor.c
> b/drivers/acpi/acpi_processor.c
> index 6979186..18da84f 100644
> --- a/drivers/acpi/acpi_processor.c
> +++ b/drivers/acpi/acpi_processor.c
> @@ -491,6 +491,53 @@ static void acpi_processor_remove(struct
> acpi_device *device)
>  }
>  #endif /* CONFIG_ACPI_HOTPLUG_CPU */
>  
> +#ifdef CONFIG_X86
> +static bool acpi_hwp_native_thermal_lvt_set;
> +static acpi_status
> acpi_set_hwp_native_thermal_lvt_osc(acpi_handle
> handle,
> +u32
> lvl,
> void *context,
> +void
> **rv)
> +{
> + u8 sb_uuid_str[] = "4077A616-290C-47BE-9EBD-
> D87058713953";
> + u32 capbuf[2];
> + struct acpi_osc_context osc_context = {
> + .uuid_str = sb_uuid_str,
> + .rev = 1,
> + .cap.length = 8,
> + .cap.pointer = capbuf,
> + };
> +
> + if 

Re: [PATCH v4] acpi: Issue _OSC call for native thermal interrupt handling

2016-03-19 Thread Linda Knippers
On 3/17/2016 8:17 PM, Rafael J. Wysocki wrote:


>>> This change introduces a new function
>>> acpi_early_processor_set_osc(),
>>> which walks acpi name space and finds acpi processor object and
>>> set capability via _OSC method to take over thermal LVT.
>> Does this change just affect Skylake platforms or all platforms?
> Any platform which has Intel ® Speed Shift Technology (aka HWP)
> feature present and enabled.
>>
>> Could this be an unexpected change in behavior for platforms
>> with HWP that don't have this bug, assuming they would look at
>> the _OSC CPPP bit?  That's actually my main concern here.
> 
> Do you have any specific platforms in mind or just in general?

It's just general right now but as more HWP server platforms come out,
it may become specific.

-- ljk
> 
> Thanks,
> Rafael
> 


Re: [PATCH v4] acpi: Issue _OSC call for native thermal interrupt handling

2016-03-19 Thread Linda Knippers
On 3/17/2016 8:17 PM, Rafael J. Wysocki wrote:


>>> This change introduces a new function
>>> acpi_early_processor_set_osc(),
>>> which walks acpi name space and finds acpi processor object and
>>> set capability via _OSC method to take over thermal LVT.
>> Does this change just affect Skylake platforms or all platforms?
> Any platform which has Intel ® Speed Shift Technology (aka HWP)
> feature present and enabled.
>>
>> Could this be an unexpected change in behavior for platforms
>> with HWP that don't have this bug, assuming they would look at
>> the _OSC CPPP bit?  That's actually my main concern here.
> 
> Do you have any specific platforms in mind or just in general?

It's just general right now but as more HWP server platforms come out,
it may become specific.

-- ljk
> 
> Thanks,
> Rafael
> 


Re: [PATCH 1/3] nfit: fix multi-interface dimm handling, acpi6.1 compatibility

2016-02-08 Thread Linda Knippers
On 2/8/2016 2:10 PM, Linda Knippers wrote:
> On 2/8/2016 1:30 PM, Dan Williams wrote:
>> ACPI 6.1 clarified that multi-interface dimms require multiple control
>> region entries (DCRs) per dimm.  Previously we were assuming that a
>> control region is only present when block-data-windows are present.
> 
> We need to give this a quick test with NVDIMM-N because those types of
> NVDIMMs have control regions without block-data-windows.  We've fixed
> bugs related to that assumption a couple of times.

I can't test the conditions for which these changes were made but the
code continues to work on my NVDIMM-N system where we have control
regions w/o block-data windows.

-- ljk

> 
>> This implementation was done with an eye to be compatibility with the
>> looser ACPI 6.0 interpretation of this table.
>>
>> 1/ When coalescing the memory device (MEMDEV) tables for a single dimm,
>> coalesce on device_handle rather than control region index.
>>
>> 2/ Whenever we disocver a control region with non-zero block windows
> discover
> 
>> re-scan for block-data-window (BDW) entries.
>>
>> We may need to revisit this if a DIMM ever implements a format interface
>> outside of blk or pmem, but that is not on the foreseeable horizon.
>>
>> Cc: 
>> Signed-off-by: Dan Williams 
>> ---
>>  drivers/acpi/nfit.c |   71 
>> +--
>>  1 file changed, 35 insertions(+), 36 deletions(-)
>>
>> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
>> index ad6d8c6b777e..424b362e8fdc 100644
>> --- a/drivers/acpi/nfit.c
>> +++ b/drivers/acpi/nfit.c
>> @@ -469,37 +469,16 @@ static void nfit_mem_find_spa_bdw(struct 
>> acpi_nfit_desc *acpi_desc,
>>  nfit_mem->bdw = NULL;
>>  }
>>  
>> -static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
>> +static void nfit_mem_init_bdw(struct acpi_nfit_desc *acpi_desc,
>>  struct nfit_mem *nfit_mem, struct acpi_nfit_system_address *spa)
>>  {
>>  u16 dcr = __to_nfit_memdev(nfit_mem)->region_index;
>>  struct nfit_memdev *nfit_memdev;
>>  struct nfit_flush *nfit_flush;
>> -struct nfit_dcr *nfit_dcr;
>>  struct nfit_bdw *nfit_bdw;
>>  struct nfit_idt *nfit_idt;
>>  u16 idt_idx, range_index;
>>  
>> -list_for_each_entry(nfit_dcr, _desc->dcrs, list) {
>> -if (nfit_dcr->dcr->region_index != dcr)
>> -continue;
>> -nfit_mem->dcr = nfit_dcr->dcr;
>> -break;
>> -}
>> -
>> -if (!nfit_mem->dcr) {
>> -dev_dbg(acpi_desc->dev, "SPA %d missing:%s%s\n",
>> -spa->range_index, __to_nfit_memdev(nfit_mem)
>> -? "" : " MEMDEV", nfit_mem->dcr ? "" : " DCR");
>> -return -ENODEV;
>> -}
>> -
>> -/*
>> - * We've found enough to create an nvdimm, optionally
>> - * find an associated BDW
>> - */
>> -list_add(_mem->list, _desc->dimms);
>> -
>>  list_for_each_entry(nfit_bdw, _desc->bdws, list) {
>>  if (nfit_bdw->bdw->region_index != dcr)
>>  continue;
>> @@ -508,12 +487,12 @@ static int nfit_mem_add(struct acpi_nfit_desc 
>> *acpi_desc,
>>  }
>>  
>>  if (!nfit_mem->bdw)
>> -return 0;
>> +return;
>>  
>>  nfit_mem_find_spa_bdw(acpi_desc, nfit_mem);
>>  
>>  if (!nfit_mem->spa_bdw)
>> -return 0;
>> +return;
>>  
>>  range_index = nfit_mem->spa_bdw->range_index;
>>  list_for_each_entry(nfit_memdev, _desc->memdevs, list) {
>> @@ -538,8 +517,6 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
>>  }
>>  break;
>>  }
>> -
>> -return 0;
>>  }
>>  
>>  static int nfit_mem_dcr_init(struct acpi_nfit_desc *acpi_desc,
>> @@ -548,7 +525,6 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc 
>> *acpi_desc,
>>  struct nfit_mem *nfit_mem, *found;
>>  struct nfit_memdev *nfit_memdev;
>>  int type = nfit_spa_type(spa);
>> -u16 dcr;
>>  
>>  switch (type) {
>>  case NFIT_SPA_DCR:
>> @@ -559,14 +535,18 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc 
>> *acpi_desc,
>>  }
>>  
>>  list_for_each_entry(nfit_memdev, _desc->memd

Re: [PATCH 1/3] nfit: fix multi-interface dimm handling, acpi6.1 compatibility

2016-02-08 Thread Linda Knippers
On 2/8/2016 1:30 PM, Dan Williams wrote:
> ACPI 6.1 clarified that multi-interface dimms require multiple control
> region entries (DCRs) per dimm.  Previously we were assuming that a
> control region is only present when block-data-windows are present.

We need to give this a quick test with NVDIMM-N because those types of
NVDIMMs have control regions without block-data-windows.  We've fixed
bugs related to that assumption a couple of times.

> This implementation was done with an eye to be compatibility with the
> looser ACPI 6.0 interpretation of this table.
> 
> 1/ When coalescing the memory device (MEMDEV) tables for a single dimm,
> coalesce on device_handle rather than control region index.
> 
> 2/ Whenever we disocver a control region with non-zero block windows
discover

> re-scan for block-data-window (BDW) entries.
> 
> We may need to revisit this if a DIMM ever implements a format interface
> outside of blk or pmem, but that is not on the foreseeable horizon.
> 
> Cc: 
> Signed-off-by: Dan Williams 
> ---
>  drivers/acpi/nfit.c |   71 
> +--
>  1 file changed, 35 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
> index ad6d8c6b777e..424b362e8fdc 100644
> --- a/drivers/acpi/nfit.c
> +++ b/drivers/acpi/nfit.c
> @@ -469,37 +469,16 @@ static void nfit_mem_find_spa_bdw(struct acpi_nfit_desc 
> *acpi_desc,
>   nfit_mem->bdw = NULL;
>  }
>  
> -static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
> +static void nfit_mem_init_bdw(struct acpi_nfit_desc *acpi_desc,
>   struct nfit_mem *nfit_mem, struct acpi_nfit_system_address *spa)
>  {
>   u16 dcr = __to_nfit_memdev(nfit_mem)->region_index;
>   struct nfit_memdev *nfit_memdev;
>   struct nfit_flush *nfit_flush;
> - struct nfit_dcr *nfit_dcr;
>   struct nfit_bdw *nfit_bdw;
>   struct nfit_idt *nfit_idt;
>   u16 idt_idx, range_index;
>  
> - list_for_each_entry(nfit_dcr, _desc->dcrs, list) {
> - if (nfit_dcr->dcr->region_index != dcr)
> - continue;
> - nfit_mem->dcr = nfit_dcr->dcr;
> - break;
> - }
> -
> - if (!nfit_mem->dcr) {
> - dev_dbg(acpi_desc->dev, "SPA %d missing:%s%s\n",
> - spa->range_index, __to_nfit_memdev(nfit_mem)
> - ? "" : " MEMDEV", nfit_mem->dcr ? "" : " DCR");
> - return -ENODEV;
> - }
> -
> - /*
> -  * We've found enough to create an nvdimm, optionally
> -  * find an associated BDW
> -  */
> - list_add(_mem->list, _desc->dimms);
> -
>   list_for_each_entry(nfit_bdw, _desc->bdws, list) {
>   if (nfit_bdw->bdw->region_index != dcr)
>   continue;
> @@ -508,12 +487,12 @@ static int nfit_mem_add(struct acpi_nfit_desc 
> *acpi_desc,
>   }
>  
>   if (!nfit_mem->bdw)
> - return 0;
> + return;
>  
>   nfit_mem_find_spa_bdw(acpi_desc, nfit_mem);
>  
>   if (!nfit_mem->spa_bdw)
> - return 0;
> + return;
>  
>   range_index = nfit_mem->spa_bdw->range_index;
>   list_for_each_entry(nfit_memdev, _desc->memdevs, list) {
> @@ -538,8 +517,6 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
>   }
>   break;
>   }
> -
> - return 0;
>  }
>  
>  static int nfit_mem_dcr_init(struct acpi_nfit_desc *acpi_desc,
> @@ -548,7 +525,6 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc 
> *acpi_desc,
>   struct nfit_mem *nfit_mem, *found;
>   struct nfit_memdev *nfit_memdev;
>   int type = nfit_spa_type(spa);
> - u16 dcr;
>  
>   switch (type) {
>   case NFIT_SPA_DCR:
> @@ -559,14 +535,18 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc 
> *acpi_desc,
>   }
>  
>   list_for_each_entry(nfit_memdev, _desc->memdevs, list) {
> - int rc;
> + struct nfit_dcr *nfit_dcr;
> + u32 device_handle;
> + u16 dcr;
>  
>   if (nfit_memdev->memdev->range_index != spa->range_index)
>   continue;
>   found = NULL;
>   dcr = nfit_memdev->memdev->region_index;
> + device_handle = nfit_memdev->memdev->device_handle;
>   list_for_each_entry(nfit_mem, _desc->dimms, list)
> - if (__to_nfit_memdev(nfit_mem)->region_index == dcr) {
> + if (__to_nfit_memdev(nfit_mem)->device_handle
> + == device_handle) {
>   found = nfit_mem;
>   break;
>   }
> @@ -579,6 +559,31 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc 
> *acpi_desc,
>   if (!nfit_mem)
>   return -ENOMEM;
>   INIT_LIST_HEAD(_mem->list);
> + list_add(_mem->list, 

Re: [PATCH 1/3] nfit: fix multi-interface dimm handling, acpi6.1 compatibility

2016-02-08 Thread Linda Knippers
On 2/8/2016 1:30 PM, Dan Williams wrote:
> ACPI 6.1 clarified that multi-interface dimms require multiple control
> region entries (DCRs) per dimm.  Previously we were assuming that a
> control region is only present when block-data-windows are present.

We need to give this a quick test with NVDIMM-N because those types of
NVDIMMs have control regions without block-data-windows.  We've fixed
bugs related to that assumption a couple of times.

> This implementation was done with an eye to be compatibility with the
> looser ACPI 6.0 interpretation of this table.
> 
> 1/ When coalescing the memory device (MEMDEV) tables for a single dimm,
> coalesce on device_handle rather than control region index.
> 
> 2/ Whenever we disocver a control region with non-zero block windows
discover

> re-scan for block-data-window (BDW) entries.
> 
> We may need to revisit this if a DIMM ever implements a format interface
> outside of blk or pmem, but that is not on the foreseeable horizon.
> 
> Cc: 
> Signed-off-by: Dan Williams 
> ---
>  drivers/acpi/nfit.c |   71 
> +--
>  1 file changed, 35 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
> index ad6d8c6b777e..424b362e8fdc 100644
> --- a/drivers/acpi/nfit.c
> +++ b/drivers/acpi/nfit.c
> @@ -469,37 +469,16 @@ static void nfit_mem_find_spa_bdw(struct acpi_nfit_desc 
> *acpi_desc,
>   nfit_mem->bdw = NULL;
>  }
>  
> -static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
> +static void nfit_mem_init_bdw(struct acpi_nfit_desc *acpi_desc,
>   struct nfit_mem *nfit_mem, struct acpi_nfit_system_address *spa)
>  {
>   u16 dcr = __to_nfit_memdev(nfit_mem)->region_index;
>   struct nfit_memdev *nfit_memdev;
>   struct nfit_flush *nfit_flush;
> - struct nfit_dcr *nfit_dcr;
>   struct nfit_bdw *nfit_bdw;
>   struct nfit_idt *nfit_idt;
>   u16 idt_idx, range_index;
>  
> - list_for_each_entry(nfit_dcr, _desc->dcrs, list) {
> - if (nfit_dcr->dcr->region_index != dcr)
> - continue;
> - nfit_mem->dcr = nfit_dcr->dcr;
> - break;
> - }
> -
> - if (!nfit_mem->dcr) {
> - dev_dbg(acpi_desc->dev, "SPA %d missing:%s%s\n",
> - spa->range_index, __to_nfit_memdev(nfit_mem)
> - ? "" : " MEMDEV", nfit_mem->dcr ? "" : " DCR");
> - return -ENODEV;
> - }
> -
> - /*
> -  * We've found enough to create an nvdimm, optionally
> -  * find an associated BDW
> -  */
> - list_add(_mem->list, _desc->dimms);
> -
>   list_for_each_entry(nfit_bdw, _desc->bdws, list) {
>   if (nfit_bdw->bdw->region_index != dcr)
>   continue;
> @@ -508,12 +487,12 @@ static int nfit_mem_add(struct acpi_nfit_desc 
> *acpi_desc,
>   }
>  
>   if (!nfit_mem->bdw)
> - return 0;
> + return;
>  
>   nfit_mem_find_spa_bdw(acpi_desc, nfit_mem);
>  
>   if (!nfit_mem->spa_bdw)
> - return 0;
> + return;
>  
>   range_index = nfit_mem->spa_bdw->range_index;
>   list_for_each_entry(nfit_memdev, _desc->memdevs, list) {
> @@ -538,8 +517,6 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
>   }
>   break;
>   }
> -
> - return 0;
>  }
>  
>  static int nfit_mem_dcr_init(struct acpi_nfit_desc *acpi_desc,
> @@ -548,7 +525,6 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc 
> *acpi_desc,
>   struct nfit_mem *nfit_mem, *found;
>   struct nfit_memdev *nfit_memdev;
>   int type = nfit_spa_type(spa);
> - u16 dcr;
>  
>   switch (type) {
>   case NFIT_SPA_DCR:
> @@ -559,14 +535,18 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc 
> *acpi_desc,
>   }
>  
>   list_for_each_entry(nfit_memdev, _desc->memdevs, list) {
> - int rc;
> + struct nfit_dcr *nfit_dcr;
> + u32 device_handle;
> + u16 dcr;
>  
>   if (nfit_memdev->memdev->range_index != spa->range_index)
>   continue;
>   found = NULL;
>   dcr = nfit_memdev->memdev->region_index;
> + device_handle = nfit_memdev->memdev->device_handle;
>   list_for_each_entry(nfit_mem, _desc->dimms, list)
> - if (__to_nfit_memdev(nfit_mem)->region_index == dcr) {
> + if (__to_nfit_memdev(nfit_mem)->device_handle
> + == device_handle) {
>   found = nfit_mem;
>   break;
>   }
> @@ -579,6 +559,31 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc 
> *acpi_desc,
>   if (!nfit_mem)
>   return -ENOMEM;
>   INIT_LIST_HEAD(_mem->list);
> + 

Re: [PATCH 1/3] nfit: fix multi-interface dimm handling, acpi6.1 compatibility

2016-02-08 Thread Linda Knippers
On 2/8/2016 2:10 PM, Linda Knippers wrote:
> On 2/8/2016 1:30 PM, Dan Williams wrote:
>> ACPI 6.1 clarified that multi-interface dimms require multiple control
>> region entries (DCRs) per dimm.  Previously we were assuming that a
>> control region is only present when block-data-windows are present.
> 
> We need to give this a quick test with NVDIMM-N because those types of
> NVDIMMs have control regions without block-data-windows.  We've fixed
> bugs related to that assumption a couple of times.

I can't test the conditions for which these changes were made but the
code continues to work on my NVDIMM-N system where we have control
regions w/o block-data windows.

-- ljk

> 
>> This implementation was done with an eye to be compatibility with the
>> looser ACPI 6.0 interpretation of this table.
>>
>> 1/ When coalescing the memory device (MEMDEV) tables for a single dimm,
>> coalesce on device_handle rather than control region index.
>>
>> 2/ Whenever we disocver a control region with non-zero block windows
> discover
> 
>> re-scan for block-data-window (BDW) entries.
>>
>> We may need to revisit this if a DIMM ever implements a format interface
>> outside of blk or pmem, but that is not on the foreseeable horizon.
>>
>> Cc: <sta...@vger.kernel.org>
>> Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
>> ---
>>  drivers/acpi/nfit.c |   71 
>> +--
>>  1 file changed, 35 insertions(+), 36 deletions(-)
>>
>> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
>> index ad6d8c6b777e..424b362e8fdc 100644
>> --- a/drivers/acpi/nfit.c
>> +++ b/drivers/acpi/nfit.c
>> @@ -469,37 +469,16 @@ static void nfit_mem_find_spa_bdw(struct 
>> acpi_nfit_desc *acpi_desc,
>>  nfit_mem->bdw = NULL;
>>  }
>>  
>> -static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
>> +static void nfit_mem_init_bdw(struct acpi_nfit_desc *acpi_desc,
>>  struct nfit_mem *nfit_mem, struct acpi_nfit_system_address *spa)
>>  {
>>  u16 dcr = __to_nfit_memdev(nfit_mem)->region_index;
>>  struct nfit_memdev *nfit_memdev;
>>  struct nfit_flush *nfit_flush;
>> -struct nfit_dcr *nfit_dcr;
>>  struct nfit_bdw *nfit_bdw;
>>  struct nfit_idt *nfit_idt;
>>  u16 idt_idx, range_index;
>>  
>> -list_for_each_entry(nfit_dcr, _desc->dcrs, list) {
>> -if (nfit_dcr->dcr->region_index != dcr)
>> -continue;
>> -nfit_mem->dcr = nfit_dcr->dcr;
>> -break;
>> -}
>> -
>> -if (!nfit_mem->dcr) {
>> -dev_dbg(acpi_desc->dev, "SPA %d missing:%s%s\n",
>> -spa->range_index, __to_nfit_memdev(nfit_mem)
>> -? "" : " MEMDEV", nfit_mem->dcr ? "" : " DCR");
>> -return -ENODEV;
>> -}
>> -
>> -/*
>> - * We've found enough to create an nvdimm, optionally
>> - * find an associated BDW
>> - */
>> -list_add(_mem->list, _desc->dimms);
>> -
>>  list_for_each_entry(nfit_bdw, _desc->bdws, list) {
>>  if (nfit_bdw->bdw->region_index != dcr)
>>  continue;
>> @@ -508,12 +487,12 @@ static int nfit_mem_add(struct acpi_nfit_desc 
>> *acpi_desc,
>>  }
>>  
>>  if (!nfit_mem->bdw)
>> -return 0;
>> +return;
>>  
>>  nfit_mem_find_spa_bdw(acpi_desc, nfit_mem);
>>  
>>  if (!nfit_mem->spa_bdw)
>> -return 0;
>> +return;
>>  
>>  range_index = nfit_mem->spa_bdw->range_index;
>>  list_for_each_entry(nfit_memdev, _desc->memdevs, list) {
>> @@ -538,8 +517,6 @@ static int nfit_mem_add(struct acpi_nfit_desc *acpi_desc,
>>  }
>>  break;
>>  }
>> -
>> -return 0;
>>  }
>>  
>>  static int nfit_mem_dcr_init(struct acpi_nfit_desc *acpi_desc,
>> @@ -548,7 +525,6 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc 
>> *acpi_desc,
>>  struct nfit_mem *nfit_mem, *found;
>>  struct nfit_memdev *nfit_memdev;
>>  int type = nfit_spa_type(spa);
>> -u16 dcr;
>>  
>>  switch (type) {
>>  case NFIT_SPA_DCR:
>> @@ -559,14 +535,18 @@ static int nfit_mem_dcr_init(struct acpi_nfit_desc 
>> *acpi_desc,
>>  }
>>  
>>  

Re: [PATCH 1/4] nvdimm: Add wrapper for IOCTL pass thru.

2015-11-10 Thread Linda Knippers

On 11/10/2015 3:27 PM, Dan Williams wrote:

On Tue, Nov 10, 2015 at 11:49 AM, Jerry Hoemann  wrote:

On Tue, Nov 10, 2015 at 12:51:59PM -0500, Jeff Moyer wrote:

Jerry Hoemann  writes:


Add IOCTL type 'P' to denote NVDIMM_TYPE_PASSTHRU.


Can't you just make passthrough a separate command?  If you actually add


There are multiple conflicting NVDIMM _DSM running around, they
are "device specific".  So, we should plan in general and not just
for the example DSM that Intel added support for.  These DSM have
over lapping and incompatible function ids.

The Intel example is an example,  not standard. They are free to
change it at will. So, we can't be certain there won't be a
conflict some time in the future if we try to use their number space.

I'm trying to create a generic pass thru that any vendors can use.  Putting
this in the Intel function number space doesn't make a lot of sense to me.


It isn't the "Intel" function number space.  The fact that they
currently align is just a happy accident.


It's not really a happy accident. Your commit message says it
was derived from the Intel spec 'for convenience', which I think is convenient
for anything that implements that spec.

We've discussed ways of supporting different command sets with you
and determined that this pass-through mechanism was a good approach
because it allows multiple different command sets to be support in
a generic way.  Blending the two flavors (generic pass through and explicit
function definitions) is confusing to me.


The kernel is free to break
the 1:1 ioctl number to DSM function number relationship, and I think
it would make the implementation cleaner in this case.


To me it's less clean and even for your own example spec, less
convenient if Intel ever updates that spec.

-- ljk

___
Linux-nvdimm mailing list
linux-nvd...@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] nvdimm: Add wrapper for IOCTL pass thru.

2015-11-10 Thread Linda Knippers

On 11/10/2015 3:27 PM, Dan Williams wrote:

On Tue, Nov 10, 2015 at 11:49 AM, Jerry Hoemann  wrote:

On Tue, Nov 10, 2015 at 12:51:59PM -0500, Jeff Moyer wrote:

Jerry Hoemann  writes:


Add IOCTL type 'P' to denote NVDIMM_TYPE_PASSTHRU.


Can't you just make passthrough a separate command?  If you actually add


There are multiple conflicting NVDIMM _DSM running around, they
are "device specific".  So, we should plan in general and not just
for the example DSM that Intel added support for.  These DSM have
over lapping and incompatible function ids.

The Intel example is an example,  not standard. They are free to
change it at will. So, we can't be certain there won't be a
conflict some time in the future if we try to use their number space.

I'm trying to create a generic pass thru that any vendors can use.  Putting
this in the Intel function number space doesn't make a lot of sense to me.


It isn't the "Intel" function number space.  The fact that they
currently align is just a happy accident.


It's not really a happy accident. Your commit message says it
was derived from the Intel spec 'for convenience', which I think is convenient
for anything that implements that spec.

We've discussed ways of supporting different command sets with you
and determined that this pass-through mechanism was a good approach
because it allows multiple different command sets to be support in
a generic way.  Blending the two flavors (generic pass through and explicit
function definitions) is confusing to me.


The kernel is free to break
the 1:1 ioctl number to DSM function number relationship, and I think
it would make the implementation cleaner in this case.


To me it's less clean and even for your own example spec, less
convenient if Intel ever updates that spec.

-- ljk

___
Linux-nvdimm mailing list
linux-nvd...@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] dax: fix O_DIRECT I/O to the last block of a blockdev

2015-09-08 Thread Linda Knippers
This patch and the 2/2 patch don't seem to have gone anywhere.
Willy?  or Ross?

-- ljk

On 8/14/2015 4:53 PM, Linda Knippers wrote:
> On 8/14/2015 4:15 PM, Jeff Moyer wrote:
>> commit bbab37ddc20b (block: Add support for DAX reads/writes to
>> block devices) caused a regression in mkfs.xfs.  That utility
>> sets the block size of the device to the logical block size
>> using the BLKBSZSET ioctl, and then issues a single sector read
>> from the last sector of the device.  This results in the dax_io
>> code trying to do a page-sized read from 512 bytes from the end
>> of the device.  The result is -ERANGE being returned to userspace.
>>
>> The fix is to align the block to the page size before calling
>> get_block.
>>
>> Thanks to willy for simplifying my original patch.
>>
>> Signed-off-by: Jeff Moyer 
> 
> Tested-by:  Linda Knippers 
> 
>> ---
>>  fs/dax.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index a7f77e1..ef35a20 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -116,7 +116,8 @@ static ssize_t dax_io(struct inode *inode, struct 
>> iov_iter *iter,
>>  unsigned len;
>>  if (pos == max) {
>>  unsigned blkbits = inode->i_blkbits;
>> -sector_t block = pos >> blkbits;
>> +long page = pos >> PAGE_SHIFT;
>> +sector_t block = page << (PAGE_SHIFT - blkbits);
>>  unsigned first = pos - (block << blkbits);
>>  long size;
>>  
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] dax: fix O_DIRECT I/O to the last block of a blockdev

2015-09-08 Thread Linda Knippers
This patch and the 2/2 patch don't seem to have gone anywhere.
Willy?  or Ross?

-- ljk

On 8/14/2015 4:53 PM, Linda Knippers wrote:
> On 8/14/2015 4:15 PM, Jeff Moyer wrote:
>> commit bbab37ddc20b (block: Add support for DAX reads/writes to
>> block devices) caused a regression in mkfs.xfs.  That utility
>> sets the block size of the device to the logical block size
>> using the BLKBSZSET ioctl, and then issues a single sector read
>> from the last sector of the device.  This results in the dax_io
>> code trying to do a page-sized read from 512 bytes from the end
>> of the device.  The result is -ERANGE being returned to userspace.
>>
>> The fix is to align the block to the page size before calling
>> get_block.
>>
>> Thanks to willy for simplifying my original patch.
>>
>> Signed-off-by: Jeff Moyer <jmo...@redhat.com>
> 
> Tested-by:  Linda Knippers <linda.knipp...@hp.com>
> 
>> ---
>>  fs/dax.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index a7f77e1..ef35a20 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -116,7 +116,8 @@ static ssize_t dax_io(struct inode *inode, struct 
>> iov_iter *iter,
>>  unsigned len;
>>  if (pos == max) {
>>  unsigned blkbits = inode->i_blkbits;
>> -sector_t block = pos >> blkbits;
>> +long page = pos >> PAGE_SHIFT;
>> +sector_t block = page << (PAGE_SHIFT - blkbits);
>>  unsigned first = pos - (block << blkbits);
>>  long size;
>>  
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/urgent] x86/mm/srat: Print non-volatile flag in SRAT

2015-09-02 Thread tip-bot for Linda Knippers
Commit-ID:  31e09b18c863718939e3e9c30eee55f9011d85ee
Gitweb: http://git.kernel.org/tip/31e09b18c863718939e3e9c30eee55f9011d85ee
Author: Linda Knippers 
AuthorDate: Tue, 1 Sep 2015 15:41:55 -0400
Committer:  Ingo Molnar 
CommitDate: Wed, 2 Sep 2015 09:33:25 +0200

x86/mm/srat: Print non-volatile flag in SRAT

With the addition of NVDIMM support, a question came up as to
whether NVDIMM ranges should be in the SRAT with this bit set.
I think the consensus was no because the ranges are in the NFIT
with proximity domain information there.

ACPI is not clear on the meaning of this bit in the SRAT.
If someone is setting it, we might want to ask them what they
expect to happen with it.

Right now this bit is only printed if all the ACPI debug
information is turned on.

Signed-off-by: Linda Knippers 
Acked-by: Thomas Gleixner 
Cc: Peter Zijlstra 
Link: http://lkml.kernel.org/r/20150901194154.GA4939@ljkz400
Signed-off-by: Ingo Molnar 
---
 arch/x86/mm/srat.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 66338a6..c2aea63 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -192,10 +192,11 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
 
node_set(node, numa_nodes_parsed);
 
-   pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s\n",
+   pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n",
node, pxm,
(unsigned long long) start, (unsigned long long) end - 1,
-   hotpluggable ? " hotplug" : "");
+   hotpluggable ? " hotplug" : "",
+   ma->flags & ACPI_SRAT_MEM_NON_VOLATILE ? " non-volatile" : "");
 
/* Mark hotplug range in memblock. */
if (hotpluggable && memblock_mark_hotplug(start, ma->length))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/urgent] x86/mm/srat: Print non-volatile flag in SRAT

2015-09-02 Thread tip-bot for Linda Knippers
Commit-ID:  31e09b18c863718939e3e9c30eee55f9011d85ee
Gitweb: http://git.kernel.org/tip/31e09b18c863718939e3e9c30eee55f9011d85ee
Author: Linda Knippers <linda.knipp...@hp.com>
AuthorDate: Tue, 1 Sep 2015 15:41:55 -0400
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Wed, 2 Sep 2015 09:33:25 +0200

x86/mm/srat: Print non-volatile flag in SRAT

With the addition of NVDIMM support, a question came up as to
whether NVDIMM ranges should be in the SRAT with this bit set.
I think the consensus was no because the ranges are in the NFIT
with proximity domain information there.

ACPI is not clear on the meaning of this bit in the SRAT.
If someone is setting it, we might want to ask them what they
expect to happen with it.

Right now this bit is only printed if all the ACPI debug
information is turned on.

Signed-off-by: Linda Knippers <linda.knipp...@hp.com>
Acked-by: Thomas Gleixner <t...@linutronix.de>
Cc: Peter Zijlstra <pet...@infradead.org>
Link: http://lkml.kernel.org/r/20150901194154.GA4939@ljkz400
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/mm/srat.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 66338a6..c2aea63 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -192,10 +192,11 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
 
node_set(node, numa_nodes_parsed);
 
-   pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s\n",
+   pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n",
node, pxm,
(unsigned long long) start, (unsigned long long) end - 1,
-   hotpluggable ? " hotplug" : "");
+   hotpluggable ? " hotplug" : "",
+   ma->flags & ACPI_SRAT_MEM_NON_VOLATILE ? " non-volatile" : "");
 
/* Mark hotplug range in memblock. */
if (hotpluggable && memblock_mark_hotplug(start, ma->length))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] arch/x86/mm/srat: Print non-volatile flag in SRAT

2015-09-01 Thread Linda Knippers

With the addition of NVDIMM support, a question came up as to whether
NVDIMM ranges should be in the SRAT with this bit set.  I think the
consensus was no because the ranges are in the NFIT with proximity
domain information there.

ACPI is not clear on the meaning of this bit in the SRAT.
If someone is setting it, we might want to ask them what they expect
to happen with it.

Right now this bit is only printed if all the ACPI debug information is
turned on.  

Signed-off-by: Linda Knippers 
---
 arch/x86/mm/srat.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 66338a6..c2aea63 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -192,10 +192,11 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
 
node_set(node, numa_nodes_parsed);
 
-   pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s\n",
+   pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n",
node, pxm,
(unsigned long long) start, (unsigned long long) end - 1,
-   hotpluggable ? " hotplug" : "");
+   hotpluggable ? " hotplug" : "",
+   ma->flags & ACPI_SRAT_MEM_NON_VOLATILE ? " non-volatile" : "");
 
/* Mark hotplug range in memblock. */
if (hotpluggable && memblock_mark_hotplug(start, ma->length))
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arch/x86/mm/srat: Print non-volatile flag in SRAT

2015-09-01 Thread Linda Knippers
On 9/1/2015 3:17 PM, Thomas Gleixner wrote:
> On Tue, 1 Sep 2015, Linda Knippers wrote:
>> Nobody checks this flag but it would be interesting to know if it's being
>> set on any platforms.
> 
> What you're omitting to explain, is WHY it would be interesting.

With the addition of NVDIMM support, a question came up as to whether
NVDIMM ranges should be in the SRAT with this bit set.  I think the
consensus was no because the ranges are in the NFIT with proximity
domain information there.

ACPI is not clear on the meaning of this bit in the SRAT.
If someone is setting it, we might want to ask them what they expect
to happen with it.

Right now this bit is only printed if all the ACPI debug information is
turned on.

Sorry, I should have explained more.

-- ljk

> 
> Thanks,
> 
>   tglx
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] arch/x86/mm/srat: Print non-volatile flag in SRAT

2015-09-01 Thread Linda Knippers

Nobody checks this flag but it would be interesting to know if it's being
set on any platforms.

Signed-off-by: Linda Knippers 
---
 arch/x86/mm/srat.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 66338a6..c2aea63 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -192,10 +192,11 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
 
node_set(node, numa_nodes_parsed);
 
-   pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s\n",
+   pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n",
node, pxm,
(unsigned long long) start, (unsigned long long) end - 1,
-   hotpluggable ? " hotplug" : "");
+   hotpluggable ? " hotplug" : "",
+   ma->flags & ACPI_SRAT_MEM_NON_VOLATILE ? " non-volatile" : "");
 
/* Mark hotplug range in memblock. */
if (hotpluggable && memblock_mark_hotplug(start, ma->length))
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arch/x86/mm/srat: Print non-volatile flag in SRAT

2015-09-01 Thread Linda Knippers
On 9/1/2015 3:17 PM, Thomas Gleixner wrote:
> On Tue, 1 Sep 2015, Linda Knippers wrote:
>> Nobody checks this flag but it would be interesting to know if it's being
>> set on any platforms.
> 
> What you're omitting to explain, is WHY it would be interesting.

With the addition of NVDIMM support, a question came up as to whether
NVDIMM ranges should be in the SRAT with this bit set.  I think the
consensus was no because the ranges are in the NFIT with proximity
domain information there.

ACPI is not clear on the meaning of this bit in the SRAT.
If someone is setting it, we might want to ask them what they expect
to happen with it.

Right now this bit is only printed if all the ACPI debug information is
turned on.

Sorry, I should have explained more.

-- ljk

> 
> Thanks,
> 
>   tglx
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] arch/x86/mm/srat: Print non-volatile flag in SRAT

2015-09-01 Thread Linda Knippers

With the addition of NVDIMM support, a question came up as to whether
NVDIMM ranges should be in the SRAT with this bit set.  I think the
consensus was no because the ranges are in the NFIT with proximity
domain information there.

ACPI is not clear on the meaning of this bit in the SRAT.
If someone is setting it, we might want to ask them what they expect
to happen with it.

Right now this bit is only printed if all the ACPI debug information is
turned on.  

Signed-off-by: Linda Knippers <linda.knipp...@hp.com>
---
 arch/x86/mm/srat.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 66338a6..c2aea63 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -192,10 +192,11 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
 
node_set(node, numa_nodes_parsed);
 
-   pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s\n",
+   pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n",
node, pxm,
(unsigned long long) start, (unsigned long long) end - 1,
-   hotpluggable ? " hotplug" : "");
+   hotpluggable ? " hotplug" : "",
+   ma->flags & ACPI_SRAT_MEM_NON_VOLATILE ? " non-volatile" : "");
 
/* Mark hotplug range in memblock. */
if (hotpluggable && memblock_mark_hotplug(start, ma->length))
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] arch/x86/mm/srat: Print non-volatile flag in SRAT

2015-09-01 Thread Linda Knippers

Nobody checks this flag but it would be interesting to know if it's being
set on any platforms.

Signed-off-by: Linda Knippers <linda.knipp...@hp.com>
---
 arch/x86/mm/srat.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 66338a6..c2aea63 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -192,10 +192,11 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
 
node_set(node, numa_nodes_parsed);
 
-   pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s\n",
+   pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n",
node, pxm,
(unsigned long long) start, (unsigned long long) end - 1,
-   hotpluggable ? " hotplug" : "");
+   hotpluggable ? " hotplug" : "",
+   ma->flags & ACPI_SRAT_MEM_NON_VOLATILE ? " non-volatile" : "");
 
/* Mark hotplug range in memblock. */
if (hotpluggable && memblock_mark_hotplug(start, ma->length))
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2]: acpica/nfit: Rename not-armed bit definition

2015-08-27 Thread Linda Knippers
On 8/27/2015 1:04 PM, Dan Williams wrote:
> On Thu, Aug 27, 2015 at 9:32 AM, Linda Knippers  wrote:
>> I know this seems like a nit but we've going to live with this stuff for
>> a long time.
> 
> Along those lines, Bob just came by my office and gave me a decisive
> argument that I think trumps my ABI concerns, and one I'd be willing
> to argue for 4.2 inclusion.  Some BIOS implementers may do the work to
> get the polarity correct others may just read the flag name and get it
> wrong.  Once that happens the field loses all meaning regardless of
> whether Linux interprets it correctly.

BIOS developers, BIOS testers, and the OS folks working with those teams
will be grateful.

Thanks,

-- ljk



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2]: acpica/nfit: Rename not-armed bit definition

2015-08-27 Thread Linda Knippers
On 8/27/2015 11:54 AM, Dan Williams wrote:
> On Thu, Aug 27, 2015 at 8:35 AM, Linda Knippers  wrote:
>> On 8/27/2015 11:30 AM, Dan Williams wrote:
>>> On Thu, Aug 27, 2015 at 7:43 AM, Linda Knippers  
>>> wrote:
>>>> I don't see why we can't fix the names so they make sense now before there
>>>> is hardware in the market.  People doing testing and debugging look at 
>>>> stuff
>>>> in /sys and they write their own scripts too, not necessarily in python.
>>>
>>> The practical concern at this point is that it is too late in the 4.2
>>> development cycle, in my opinion, to push for a cosmetic change like
>>> this.  We also have versions of libndctl starting to leak into
>>> distributions.  Changing this in 4.3 means breaking the ABI from 4.2
>>> and managing both ways in future versions of the library as well as
>>> getting all distributions to update.  Not insurmountable, but also
>>> something I don't want to take on just for a few characters in a sysfs
>>> file.  I think this is better fixed with documentation which I still
>>> owe to the the Documentation/ABI/ directory.
>>
>> Is there any distribution that is going to enable this with 4.2?  I know
>> we're using it for testing now but there is still quite a bit of work
>> queued up for 4.3 and more still left to be done.
> 
> I'm not happy that this is confusing folks, and it is unfortunate that
> the polarity is reversed.
> 
> However, I'm more concerned about the fact that I'd be knowingly
> breaking ABI from 4.2 to 4.3.  I originally thought "no one" was using
> e820 type-12 for enumerating nvdimm resources, but I got shouted down
> when trying to convert that exclusively to the ACPI 6 definition.  The
> Linus edict of "we don't break released ABI" is ringing in my ears.

Maybe not too late for 4.2 then.  They're just string changes.

> The current name is also consistent with the current / released ACPICA
> definition.

They seem to be open to fixing it.

I know this seems like a nit but we've going to live with this stuff for
a long time.

-- ljk
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2]: acpica/nfit: Rename not-armed bit definition

2015-08-27 Thread Linda Knippers
On 8/27/2015 11:30 AM, Dan Williams wrote:
> On Thu, Aug 27, 2015 at 7:43 AM, Linda Knippers  wrote:
>> I don't see why we can't fix the names so they make sense now before there
>> is hardware in the market.  People doing testing and debugging look at stuff
>> in /sys and they write their own scripts too, not necessarily in python.
> 
> The practical concern at this point is that it is too late in the 4.2
> development cycle, in my opinion, to push for a cosmetic change like
> this.  We also have versions of libndctl starting to leak into
> distributions.  Changing this in 4.3 means breaking the ABI from 4.2
> and managing both ways in future versions of the library as well as
> getting all distributions to update.  Not insurmountable, but also
> something I don't want to take on just for a few characters in a sysfs
> file.  I think this is better fixed with documentation which I still
> owe to the the Documentation/ABI/ directory.

Is there any distribution that is going to enable this with 4.2?  I know
we're using it for testing now but there is still quite a bit of work
queued up for 4.3 and more still left to be done.

-- ljk

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2]: acpica/nfit: Rename not-armed bit definition

2015-08-27 Thread Linda Knippers
On 8/26/2015 6:00 PM, Dan Williams wrote:
> On Wed, Aug 26, 2015 at 2:44 PM, Toshi Kani  wrote:
>> On Wed, 2015-08-26 at 14:30 -0700, Dan Williams wrote:
>>> On Wed, Aug 26, 2015 at 2:12 PM, Toshi Kani  wrote:
 On Wed, 2015-08-26 at 10:16 -0700, Dan Williams wrote:
> On Wed, Aug 26, 2015 at 9:20 AM, Toshi Kani  wrote:
>> ACPI 6.0 NFIT Memory Device State Flags in Table 5-129 defines
>> bit 3 as follows.
>>
>>   Bit [3] set to 1 to indicate that the Memory Device is observed
>>   to be not armed prior to OSPM hand off. A Memory Device is
>>   considered armed if it is able to accept persistent writes.
>>
>> This bit is currently defined as ACPI_NFIT_MEM_ARMED, which can be
>> confusing as if the Memory Device is armed when this bit is set.
>>
>> Change the name to ACPI_NFIT_MEM_NOT_ARMED per the spec.
>>
>> Signed-off-by: Toshi Kani 
>> Cc: Dan Williams 
>> Cc: Bob Moore 
>> Cc: Rafael J. Wysocki 
>> ---
>>  drivers/acpi/nfit.c  |6 +++---
>>  drivers/acpi/nfit.h  |2 +-
>>  include/acpi/actbl1.h|2 +-
>
> This file "include/acpi/actbl1.h" is owned by the ACPICA project so
> any changes need to come through them.  But that said, I'm not sure we
> need friendly names at this level.

 I think the name is misleading, but I agree with the process and this
 patch2
 can be dropped.  It'd be nice if the ACPICA project can pick it up later
 when they have a chance, though.

> What I usually say about sysfs name changes to be more human friendly
> is "sysfs is not a UI", i.e. it's not necessarily meant to be user
> friendly.  As long as the names for the flags are distinct then
> wrapping descriptive / accurate names around them is the role of
> libndctl and userspace management software.
>
> Similar feedback for patch1 in the sense that I don't think we need to
> update the sysfs naming.  For example the API to retrieve the state of
> the "arm" flag in libndctl is ndctl_dimm_failed_arm().

 I agree that we do not want to change sysfs API for friendliness, and I
 understand that libndctl already consumes the strings...  But I think
 they
 can be confusing for the long run, i.e. the flags is likely extended for
 additional info, and more people may be looking at sysfs for the state.
  It'd be a lot harder to change them later.
>>>
>>> The starting premise though is that this will be nicer for scripts
>>> that want to avoid the library.  Properly handling the async device
>>> registration semantics of the libnvdimm-sysfs interface is hard to get
>>> right in a script.  I'm trying my best to discourage raw use of sysfs
>>> for this reason.  Small fixes to the names of flags seems to miss this
>>> wider point.
>>
>> Okay, I guess I will have to jump on the bandwagon and discourage people to
>> look at sysfs... ;-P
>>
> 
> That said, I'm not opposed to looking at something like Python-binding
> for libndctl to make scripting easier.

I don't see why we can't fix the names so they make sense now before there
is hardware in the market.  People doing testing and debugging look at stuff
in /sys and they write their own scripts too, not necessarily in python.

If they only make sense to someone using your library, I think we've missed the
mark.  Toshi is reacting to feedback we're getting from people are starting
to test this stuff.

-- ljk
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2]: acpica/nfit: Rename not-armed bit definition

2015-08-27 Thread Linda Knippers
On 8/26/2015 6:00 PM, Dan Williams wrote:
 On Wed, Aug 26, 2015 at 2:44 PM, Toshi Kani toshi.k...@hp.com wrote:
 On Wed, 2015-08-26 at 14:30 -0700, Dan Williams wrote:
 On Wed, Aug 26, 2015 at 2:12 PM, Toshi Kani toshi.k...@hp.com wrote:
 On Wed, 2015-08-26 at 10:16 -0700, Dan Williams wrote:
 On Wed, Aug 26, 2015 at 9:20 AM, Toshi Kani toshi.k...@hp.com wrote:
 ACPI 6.0 NFIT Memory Device State Flags in Table 5-129 defines
 bit 3 as follows.

   Bit [3] set to 1 to indicate that the Memory Device is observed
   to be not armed prior to OSPM hand off. A Memory Device is
   considered armed if it is able to accept persistent writes.

 This bit is currently defined as ACPI_NFIT_MEM_ARMED, which can be
 confusing as if the Memory Device is armed when this bit is set.

 Change the name to ACPI_NFIT_MEM_NOT_ARMED per the spec.

 Signed-off-by: Toshi Kani toshi.k...@hp.com
 Cc: Dan Williams dan.j.willi...@intel.com
 Cc: Bob Moore robert.mo...@intel.com
 Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com
 ---
  drivers/acpi/nfit.c  |6 +++---
  drivers/acpi/nfit.h  |2 +-
  include/acpi/actbl1.h|2 +-

 This file include/acpi/actbl1.h is owned by the ACPICA project so
 any changes need to come through them.  But that said, I'm not sure we
 need friendly names at this level.

 I think the name is misleading, but I agree with the process and this
 patch2
 can be dropped.  It'd be nice if the ACPICA project can pick it up later
 when they have a chance, though.

 What I usually say about sysfs name changes to be more human friendly
 is sysfs is not a UI, i.e. it's not necessarily meant to be user
 friendly.  As long as the names for the flags are distinct then
 wrapping descriptive / accurate names around them is the role of
 libndctl and userspace management software.

 Similar feedback for patch1 in the sense that I don't think we need to
 update the sysfs naming.  For example the API to retrieve the state of
 the arm flag in libndctl is ndctl_dimm_failed_arm().

 I agree that we do not want to change sysfs API for friendliness, and I
 understand that libndctl already consumes the strings...  But I think
 they
 can be confusing for the long run, i.e. the flags is likely extended for
 additional info, and more people may be looking at sysfs for the state.
  It'd be a lot harder to change them later.

 The starting premise though is that this will be nicer for scripts
 that want to avoid the library.  Properly handling the async device
 registration semantics of the libnvdimm-sysfs interface is hard to get
 right in a script.  I'm trying my best to discourage raw use of sysfs
 for this reason.  Small fixes to the names of flags seems to miss this
 wider point.

 Okay, I guess I will have to jump on the bandwagon and discourage people to
 look at sysfs... ;-P

 
 That said, I'm not opposed to looking at something like Python-binding
 for libndctl to make scripting easier.

I don't see why we can't fix the names so they make sense now before there
is hardware in the market.  People doing testing and debugging look at stuff
in /sys and they write their own scripts too, not necessarily in python.

If they only make sense to someone using your library, I think we've missed the
mark.  Toshi is reacting to feedback we're getting from people are starting
to test this stuff.

-- ljk
 --
 To unsubscribe from this list: send the line unsubscribe linux-acpi in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2]: acpica/nfit: Rename not-armed bit definition

2015-08-27 Thread Linda Knippers
On 8/27/2015 11:30 AM, Dan Williams wrote:
 On Thu, Aug 27, 2015 at 7:43 AM, Linda Knippers linda.knipp...@hp.com wrote:
 I don't see why we can't fix the names so they make sense now before there
 is hardware in the market.  People doing testing and debugging look at stuff
 in /sys and they write their own scripts too, not necessarily in python.
 
 The practical concern at this point is that it is too late in the 4.2
 development cycle, in my opinion, to push for a cosmetic change like
 this.  We also have versions of libndctl starting to leak into
 distributions.  Changing this in 4.3 means breaking the ABI from 4.2
 and managing both ways in future versions of the library as well as
 getting all distributions to update.  Not insurmountable, but also
 something I don't want to take on just for a few characters in a sysfs
 file.  I think this is better fixed with documentation which I still
 owe to the the Documentation/ABI/ directory.

Is there any distribution that is going to enable this with 4.2?  I know
we're using it for testing now but there is still quite a bit of work
queued up for 4.3 and more still left to be done.

-- ljk

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2]: acpica/nfit: Rename not-armed bit definition

2015-08-27 Thread Linda Knippers
On 8/27/2015 1:04 PM, Dan Williams wrote:
 On Thu, Aug 27, 2015 at 9:32 AM, Linda Knippers linda.knipp...@hp.com wrote:
 I know this seems like a nit but we've going to live with this stuff for
 a long time.
 
 Along those lines, Bob just came by my office and gave me a decisive
 argument that I think trumps my ABI concerns, and one I'd be willing
 to argue for 4.2 inclusion.  Some BIOS implementers may do the work to
 get the polarity correct others may just read the flag name and get it
 wrong.  Once that happens the field loses all meaning regardless of
 whether Linux interprets it correctly.

BIOS developers, BIOS testers, and the OS folks working with those teams
will be grateful.

Thanks,

-- ljk



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2]: acpica/nfit: Rename not-armed bit definition

2015-08-27 Thread Linda Knippers
On 8/27/2015 11:54 AM, Dan Williams wrote:
 On Thu, Aug 27, 2015 at 8:35 AM, Linda Knippers linda.knipp...@hp.com wrote:
 On 8/27/2015 11:30 AM, Dan Williams wrote:
 On Thu, Aug 27, 2015 at 7:43 AM, Linda Knippers linda.knipp...@hp.com 
 wrote:
 I don't see why we can't fix the names so they make sense now before there
 is hardware in the market.  People doing testing and debugging look at 
 stuff
 in /sys and they write their own scripts too, not necessarily in python.

 The practical concern at this point is that it is too late in the 4.2
 development cycle, in my opinion, to push for a cosmetic change like
 this.  We also have versions of libndctl starting to leak into
 distributions.  Changing this in 4.3 means breaking the ABI from 4.2
 and managing both ways in future versions of the library as well as
 getting all distributions to update.  Not insurmountable, but also
 something I don't want to take on just for a few characters in a sysfs
 file.  I think this is better fixed with documentation which I still
 owe to the the Documentation/ABI/ directory.

 Is there any distribution that is going to enable this with 4.2?  I know
 we're using it for testing now but there is still quite a bit of work
 queued up for 4.3 and more still left to be done.
 
 I'm not happy that this is confusing folks, and it is unfortunate that
 the polarity is reversed.
 
 However, I'm more concerned about the fact that I'd be knowingly
 breaking ABI from 4.2 to 4.3.  I originally thought no one was using
 e820 type-12 for enumerating nvdimm resources, but I got shouted down
 when trying to convert that exclusively to the ACPI 6 definition.  The
 Linus edict of we don't break released ABI is ringing in my ears.

Maybe not too late for 4.2 then.  They're just string changes.

 The current name is also consistent with the current / released ACPICA
 definition.

They seem to be open to fixing it.

I know this seems like a nit but we've going to live with this stuff for
a long time.

-- ljk
 --
 To unsubscribe from this list: send the line unsubscribe linux-acpi in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2]: acpica/nfit: Rename not-armed bit definition

2015-08-26 Thread Linda Knippers
On 8/26/2015 1:16 PM, Dan Williams wrote:
> On Wed, Aug 26, 2015 at 9:20 AM, Toshi Kani  wrote:
>> ACPI 6.0 NFIT Memory Device State Flags in Table 5-129 defines
>> bit 3 as follows.
>>
>>   Bit [3] set to 1 to indicate that the Memory Device is observed
>>   to be not armed prior to OSPM hand off. A Memory Device is
>>   considered armed if it is able to accept persistent writes.
>>
>> This bit is currently defined as ACPI_NFIT_MEM_ARMED, which can be
>> confusing as if the Memory Device is armed when this bit is set.
>>
>> Change the name to ACPI_NFIT_MEM_NOT_ARMED per the spec.
>>
>> Signed-off-by: Toshi Kani 
>> Cc: Dan Williams 
>> Cc: Bob Moore 
>> Cc: Rafael J. Wysocki 
>> ---
>>  drivers/acpi/nfit.c  |6 +++---
>>  drivers/acpi/nfit.h  |2 +-
>>  include/acpi/actbl1.h|2 +-
> 
> This file "include/acpi/actbl1.h" is owned by the ACPICA project so
> any changes need to come through them.  But that said, I'm not sure we
> need friendly names at this level.
> 
> What I usually say about sysfs name changes to be more human friendly
> is "sysfs is not a UI", i.e. it's not necessarily meant to be user
> friendly.  As long as the names for the flags are distinct then
> wrapping descriptive / accurate names around them is the role of
> libndctl and userspace management software.

I think there's a difference between unfriendly and misleading or confusing.
If names didn't matter at all we could just call them bit0, bit1, bit2,...

> Similar feedback for patch1 in the sense that I don't think we need to
> update the sysfs naming.  For example the API to retrieve the state of
> the "arm" flag in libndctl is ndctl_dimm_failed_arm().

It would be so nice for scripts and humans if the sysfs names made as
much sense.

-- ljk

> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2]: acpica/nfit: Rename not-armed bit definition

2015-08-26 Thread Linda Knippers
On 8/26/2015 1:16 PM, Dan Williams wrote:
 On Wed, Aug 26, 2015 at 9:20 AM, Toshi Kani toshi.k...@hp.com wrote:
 ACPI 6.0 NFIT Memory Device State Flags in Table 5-129 defines
 bit 3 as follows.

   Bit [3] set to 1 to indicate that the Memory Device is observed
   to be not armed prior to OSPM hand off. A Memory Device is
   considered armed if it is able to accept persistent writes.

 This bit is currently defined as ACPI_NFIT_MEM_ARMED, which can be
 confusing as if the Memory Device is armed when this bit is set.

 Change the name to ACPI_NFIT_MEM_NOT_ARMED per the spec.

 Signed-off-by: Toshi Kani toshi.k...@hp.com
 Cc: Dan Williams dan.j.willi...@intel.com
 Cc: Bob Moore robert.mo...@intel.com
 Cc: Rafael J. Wysocki rafael.j.wyso...@intel.com
 ---
  drivers/acpi/nfit.c  |6 +++---
  drivers/acpi/nfit.h  |2 +-
  include/acpi/actbl1.h|2 +-
 
 This file include/acpi/actbl1.h is owned by the ACPICA project so
 any changes need to come through them.  But that said, I'm not sure we
 need friendly names at this level.
 
 What I usually say about sysfs name changes to be more human friendly
 is sysfs is not a UI, i.e. it's not necessarily meant to be user
 friendly.  As long as the names for the flags are distinct then
 wrapping descriptive / accurate names around them is the role of
 libndctl and userspace management software.

I think there's a difference between unfriendly and misleading or confusing.
If names didn't matter at all we could just call them bit0, bit1, bit2,...

 Similar feedback for patch1 in the sense that I don't think we need to
 update the sysfs naming.  For example the API to retrieve the state of
 the arm flag in libndctl is ndctl_dimm_failed_arm().

It would be so nice for scripts and humans if the sysfs names made as
much sense.

-- ljk

 ___
 Linux-nvdimm mailing list
 linux-nvd...@lists.01.org
 https://lists.01.org/mailman/listinfo/linux-nvdimm
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] dax: fix O_DIRECT I/O to the last block of a blockdev

2015-08-14 Thread Linda Knippers
On 8/14/2015 4:15 PM, Jeff Moyer wrote:
> commit bbab37ddc20b (block: Add support for DAX reads/writes to
> block devices) caused a regression in mkfs.xfs.  That utility
> sets the block size of the device to the logical block size
> using the BLKBSZSET ioctl, and then issues a single sector read
> from the last sector of the device.  This results in the dax_io
> code trying to do a page-sized read from 512 bytes from the end
> of the device.  The result is -ERANGE being returned to userspace.
> 
> The fix is to align the block to the page size before calling
> get_block.
> 
> Thanks to willy for simplifying my original patch.
> 
> Signed-off-by: Jeff Moyer 

Tested-by:  Linda Knippers 

> ---
>  fs/dax.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index a7f77e1..ef35a20 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -116,7 +116,8 @@ static ssize_t dax_io(struct inode *inode, struct 
> iov_iter *iter,
>   unsigned len;
>   if (pos == max) {
>   unsigned blkbits = inode->i_blkbits;
> - sector_t block = pos >> blkbits;
> + long page = pos >> PAGE_SHIFT;
> + sector_t block = page << (PAGE_SHIFT - blkbits);
>   unsigned first = pos - (block << blkbits);
>   long size;
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] dax: fix O_DIRECT I/O to the last block of a blockdev

2015-08-14 Thread Linda Knippers
On 8/14/2015 4:15 PM, Jeff Moyer wrote:
 commit bbab37ddc20b (block: Add support for DAX reads/writes to
 block devices) caused a regression in mkfs.xfs.  That utility
 sets the block size of the device to the logical block size
 using the BLKBSZSET ioctl, and then issues a single sector read
 from the last sector of the device.  This results in the dax_io
 code trying to do a page-sized read from 512 bytes from the end
 of the device.  The result is -ERANGE being returned to userspace.
 
 The fix is to align the block to the page size before calling
 get_block.
 
 Thanks to willy for simplifying my original patch.
 
 Signed-off-by: Jeff Moyer jmo...@redhat.com

Tested-by:  Linda Knippers linda.knipp...@hp.com

 ---
  fs/dax.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/fs/dax.c b/fs/dax.c
 index a7f77e1..ef35a20 100644
 --- a/fs/dax.c
 +++ b/fs/dax.c
 @@ -116,7 +116,8 @@ static ssize_t dax_io(struct inode *inode, struct 
 iov_iter *iter,
   unsigned len;
   if (pos == max) {
   unsigned blkbits = inode-i_blkbits;
 - sector_t block = pos  blkbits;
 + long page = pos  PAGE_SHIFT;
 + sector_t block = page  (PAGE_SHIFT - blkbits);
   unsigned first = pos - (block  blkbits);
   long size;
  
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-13 Thread Linda Knippers
On 8/13/2015 1:14 PM, Jeff Moyer wrote:
> Linda Knippers  writes:
> 
>>> I'd be fine with changing the persistent memory block device to only
>>> support 4k logical, 4k physical block size.  That probably makes the
>>> most sense.
>>
>> If that's what we want, the current patch doesn't do that.
>> https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
>>
>> It causes the physical block size to be PAGE_SIZE but the
>> logical block size is still 512.  However, the minimum_io_size
>> is now 4096 (same as physical block size, I assume).  The
>> optimal_io_size is still 0.  What does that mean?
> 
> physical block size - device's internal block size
> logical block size - addressable unit

Right, but it's still reported as 512 and that doesn't work.

> optimal io size - device's preferred unit for streaming

So 0 is ok.

> minimum io size - device’s preferred minimum unit for random I/O
> 
> See Martin Petersen's "Linux & Advanced Storage Interfaces" document for
> more information.
> 
>> Whatever we go with, we should do something because 4.2rc6 is still
>> broken, unable to create a xfs file system on a pmem device, ever
>> since the change to use DAX on block devices with O_DIRECT.
> 
> We can change the block device to export logical/physical block sizes of
> PAGE_SIZE.  However, when persistent memory support comes to platforms
> that support page sizes > 32k, xfs will again run into problems (Dave
> Chinner mentioned that xfs can't deal with logical block sizes >32k.)
> Arguably, you can use pmem and dax on such platforms using RAM today for
> testing.  Do we care about breaking that?

I would think so.  AARCH64 uses 64k pages today.

I think Documentation/filesystems/dax.txt could use a little update
too.  It has a section "Implementation Tips for Block Driver Writers"
that makes it sound easy but now I wonder if it even works with the
example ram drivers.  Should we be able to read any 512 byte
"sector"?

-- ljk
> 
> Cheers,
> Jeff
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-13 Thread Linda Knippers
On 8/13/2015 10:00 AM, Jeff Moyer wrote:
> Boaz Harrosh  writes:
> 
>> On 08/13/2015 12:11 AM, Jeff Moyer wrote:
>>> Boaz Harrosh  writes:
>>>
 On 08/07/2015 11:41 PM, Jeff Moyer wrote:
 <>
>
>> We need to cope with the case where the end of a partition isn't on a
>> page boundary though.
>
> Well, that's usually done by falling back to buffered I/O.  I gave that
> a try and panicked the box.  :)  I'll keep looking into it, but probably
> won't have another patch until next week.
>

 lets slow down for a sec, please.

 We have all established that an unaligned partition start is BAD and not 
 supported?
>>>
>>> No.  Unaligned partitions on RAID arrays or 512e devices are bad because
>>> they result in suboptimal performance.  They are most certainly still
>>> supported, though.
>>>
>>
>> What ?
>>
>> I meant for dax on pmem or brd. I meant that we *do not* support dax access
>> on an unaligned partition start. (None dax is perfectly supported)
> 
> Sorry, I thought your statement was broader than that.
> 
>> We did it this way because of the direct_access API that returns a pfn
>> with is PAGE_SIZE. We could introduce a pfn+offset but we thought it is
>> not worth it, and that dax should be page aligned for code simplicity
> 
> I'd be fine with changing the persistent memory block device to only
> support 4k logical, 4k physical block size.  That probably makes the
> most sense.

If that's what we want, the current patch doesn't do that.
https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html

It causes the physical block size to be PAGE_SIZE but the
logical block size is still 512.  However, the minimum_io_size
is now 4096 (same as physical block size, I assume).  The
optimal_io_size is still 0.  What does that mean?

Whatever we go with, we should do something because 4.2rc6 is still
broken, unable to create a xfs file system on a pmem device, ever
since the change to use DAX on block devices with O_DIRECT.

-- ljk
> 
> Cheers,
> Jeff
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-13 Thread Linda Knippers
On 8/13/2015 10:00 AM, Jeff Moyer wrote:
 Boaz Harrosh b...@plexistor.com writes:
 
 On 08/13/2015 12:11 AM, Jeff Moyer wrote:
 Boaz Harrosh b...@plexistor.com writes:

 On 08/07/2015 11:41 PM, Jeff Moyer wrote:
 

 We need to cope with the case where the end of a partition isn't on a
 page boundary though.

 Well, that's usually done by falling back to buffered I/O.  I gave that
 a try and panicked the box.  :)  I'll keep looking into it, but probably
 won't have another patch until next week.


 lets slow down for a sec, please.

 We have all established that an unaligned partition start is BAD and not 
 supported?

 No.  Unaligned partitions on RAID arrays or 512e devices are bad because
 they result in suboptimal performance.  They are most certainly still
 supported, though.


 What ?

 I meant for dax on pmem or brd. I meant that we *do not* support dax access
 on an unaligned partition start. (None dax is perfectly supported)
 
 Sorry, I thought your statement was broader than that.
 
 We did it this way because of the direct_access API that returns a pfn
 with is PAGE_SIZE. We could introduce a pfn+offset but we thought it is
 not worth it, and that dax should be page aligned for code simplicity
 
 I'd be fine with changing the persistent memory block device to only
 support 4k logical, 4k physical block size.  That probably makes the
 most sense.

If that's what we want, the current patch doesn't do that.
https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html

It causes the physical block size to be PAGE_SIZE but the
logical block size is still 512.  However, the minimum_io_size
is now 4096 (same as physical block size, I assume).  The
optimal_io_size is still 0.  What does that mean?

Whatever we go with, we should do something because 4.2rc6 is still
broken, unable to create a xfs file system on a pmem device, ever
since the change to use DAX on block devices with O_DIRECT.

-- ljk
 
 Cheers,
 Jeff
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-13 Thread Linda Knippers
On 8/13/2015 1:14 PM, Jeff Moyer wrote:
 Linda Knippers linda.knipp...@hp.com writes:
 
 I'd be fine with changing the persistent memory block device to only
 support 4k logical, 4k physical block size.  That probably makes the
 most sense.

 If that's what we want, the current patch doesn't do that.
 https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html

 It causes the physical block size to be PAGE_SIZE but the
 logical block size is still 512.  However, the minimum_io_size
 is now 4096 (same as physical block size, I assume).  The
 optimal_io_size is still 0.  What does that mean?
 
 physical block size - device's internal block size
 logical block size - addressable unit

Right, but it's still reported as 512 and that doesn't work.

 optimal io size - device's preferred unit for streaming

So 0 is ok.

 minimum io size - device’s preferred minimum unit for random I/O
 
 See Martin Petersen's Linux  Advanced Storage Interfaces document for
 more information.
 
 Whatever we go with, we should do something because 4.2rc6 is still
 broken, unable to create a xfs file system on a pmem device, ever
 since the change to use DAX on block devices with O_DIRECT.
 
 We can change the block device to export logical/physical block sizes of
 PAGE_SIZE.  However, when persistent memory support comes to platforms
 that support page sizes  32k, xfs will again run into problems (Dave
 Chinner mentioned that xfs can't deal with logical block sizes 32k.)
 Arguably, you can use pmem and dax on such platforms using RAM today for
 testing.  Do we care about breaking that?

I would think so.  AARCH64 uses 64k pages today.

I think Documentation/filesystems/dax.txt could use a little update
too.  It has a section Implementation Tips for Block Driver Writers
that makes it sound easy but now I wonder if it even works with the
example ram drivers.  Should we be able to read any 512 byte
sector?

-- ljk
 
 Cheers,
 Jeff
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-10 Thread Linda Knippers
On 8/10/2015 5:27 PM, Dave Chinner wrote:
> On Mon, Aug 10, 2015 at 12:32:08PM -0400, Linda Knippers wrote:
>> On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
>>> On 08/06/2015 11:34 PM, Dave Chinner wrote:
>>>> On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
>>>>> On 08/06/2015 06:24 AM, Dave Chinner wrote:
>>>>>> On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
>>>>>>> On 08/05/2015 06:01 PM, Dave Chinner wrote:
>>>>>>>> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
>>>>> <>
>>>>>>>>>
>>>>>>>>> I sat down with Linda to look into it, and the problem is that 
>>>>>>>>> mkfs.xfs
>>>>>>>>> sets the blocksize of the device to 512 (via BLKBSZSET), and then 
>>>>>>>>> reads
>>>>>>>>> from the last sector of the device.  This results in dax_io trying to 
>>>>>>>>> do
>>>>>>>>> a page-sized I/O at 512 bytes from the end of the device.
>>>>>>>>
>>>>>
>>>>> This part I do not understand. how is mkfs.xfs reading the sector?
>>>>> Is it through open(/dev/pmem0,...) ? O_DIRECT?
>>>>
>>>> mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
>>>> told that it is working on an image file does it fall back to
>>>> buffered IO. All of the XFS userspace tools work this way to prevent
>>>> page cache pollution issues with read-once or write-once data during
>>>> operation.
> 
>> That patch does cause 'mkfs -t xfs' to work.
>>
>> Before:
>> $ sudo mkfs -t xfs -f /dev/pmem3
>> meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
>>  =   sectsz=512   attr=2, projid32bit=1
>^^
> 
> 
>> $ sudo mkfs -t xfs -f /dev/pmem3
>> meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
>>  =   sectsz=4096  attr=2, projid32bit=1
>^^^
> 
> So in the after case, mkfs.xfs is behaving differently and not
> exercising the bug. It's seen the:
> 
>> $ cat /sys/block/pmem3/queue/logical_block_size
>> 512
>> $ cat /sys/block/pmem3/queue/physical_block_size
>> 4096
>   
> 
> 4k physical block size, and hence configured the filesystem with a
> 4k sector size so all IO it issues is physicallly aligned. IOWs,
> mkfs.xfs's last sector read is 4k aligned and sized, and therefore
> the test has not confirmed that the patch fixes the 512 byte last
> sector read is fixed at all.

That is true.  All I reported is that mkfs.xfs now works.  I
had some questions about whether that was right.

The underlying problem is still there, as I can demonstrate with
a simple reproducer that just does what mkfs.xfs would have done
before.

> Isn't there a regression test suite that covers basic block device
> functionality that you can use to test these simple corner cases?

If there is, seems like DAX adds a few twists.

-- ljk
> 
> Cheers,
> 
> Dave.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-10 Thread Linda Knippers
On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
> On 08/06/2015 11:34 PM, Dave Chinner wrote:
>> On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
>>> On 08/06/2015 06:24 AM, Dave Chinner wrote:
>>>> On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
>>>>> On 08/05/2015 06:01 PM, Dave Chinner wrote:
>>>>>> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
>>> <>
>>>>>>>
>>>>>>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
>>>>>>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
>>>>>>> from the last sector of the device.  This results in dax_io trying to do
>>>>>>> a page-sized I/O at 512 bytes from the end of the device.
>>>>>>
>>>
>>> This part I do not understand. how is mkfs.xfs reading the sector?
>>> Is it through open(/dev/pmem0,...) ? O_DIRECT?
>>
>> mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
>> told that it is working on an image file does it fall back to
>> buffered IO. All of the XFS userspace tools work this way to prevent
>> page cache pollution issues with read-once or write-once data during
>> operation.
>>
> 
> Thanks, yes makes sense. This is a bug at the DAX implementation of
> bdev. Since as you know with DAX there is no difference between
> O_DIRECT and buffered, we must support any aligned IO. I bet it
> should be something with bdev not giving 4K buffer-heads to dax.c.
> 
> Or ... It might just be the infamous bug where the actual partition
> they used was not 4k aligned on its start sector. So the last sector IO
> after partition translation came out wrong. This bug then should be
> fixed by: https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
> by:Vishal Verma
> 
> Vishal I think we should add CC: sta...@vger.kernel.org to your patch
> because of these fdisk bugs.

That patch does cause 'mkfs -t xfs' to work.

Before:
$ sudo mkfs -t xfs -f /dev/pmem3
meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: read failed: Numerical result out of range

After:

$ sudo mkfs -t xfs -f /dev/pmem3
meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
 =   sectsz=4096  attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

$ cat /sys/block/pmem3/queue/logical_block_size
512
$ cat /sys/block/pmem3/queue/physical_block_size
4096
$ cat /sys/block/pmem3/queue/hw_sector_size
512
$ cat /sys/block/pmem3/queue/minimum_io_size
4096

Previously physical_block_size was 512 and minimum_io_size was 0.
What about logical_block_size and hw_sector_size still being 512?

So do we want to change pmem rather than changing DAX?

-- ljk
> 
>> Cheers,
>> Dave.
> 
> Thanks
> Boaz
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-10 Thread Linda Knippers
On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
 On 08/06/2015 11:34 PM, Dave Chinner wrote:
 On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
 On 08/06/2015 06:24 AM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
 On 08/05/2015 06:01 PM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
 

 I sat down with Linda to look into it, and the problem is that mkfs.xfs
 sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
 from the last sector of the device.  This results in dax_io trying to do
 a page-sized I/O at 512 bytes from the end of the device.


 This part I do not understand. how is mkfs.xfs reading the sector?
 Is it through open(/dev/pmem0,...) ? O_DIRECT?

 mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
 told that it is working on an image file does it fall back to
 buffered IO. All of the XFS userspace tools work this way to prevent
 page cache pollution issues with read-once or write-once data during
 operation.

 
 Thanks, yes makes sense. This is a bug at the DAX implementation of
 bdev. Since as you know with DAX there is no difference between
 O_DIRECT and buffered, we must support any aligned IO. I bet it
 should be something with bdev not giving 4K buffer-heads to dax.c.
 
 Or ... It might just be the infamous bug where the actual partition
 they used was not 4k aligned on its start sector. So the last sector IO
 after partition translation came out wrong. This bug then should be
 fixed by: https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
 by:Vishal Verma
 
 Vishal I think we should add CC: sta...@vger.kernel.org to your patch
 because of these fdisk bugs.

That patch does cause 'mkfs -t xfs' to work.

Before:
$ sudo mkfs -t xfs -f /dev/pmem3
meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: read failed: Numerical result out of range

After:

$ sudo mkfs -t xfs -f /dev/pmem3
meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
 =   sectsz=4096  attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=2097152, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

$ cat /sys/block/pmem3/queue/logical_block_size
512
$ cat /sys/block/pmem3/queue/physical_block_size
4096
$ cat /sys/block/pmem3/queue/hw_sector_size
512
$ cat /sys/block/pmem3/queue/minimum_io_size
4096

Previously physical_block_size was 512 and minimum_io_size was 0.
What about logical_block_size and hw_sector_size still being 512?

So do we want to change pmem rather than changing DAX?

-- ljk
 
 Cheers,
 Dave.
 
 Thanks
 Boaz
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-10 Thread Linda Knippers
On 8/10/2015 5:27 PM, Dave Chinner wrote:
 On Mon, Aug 10, 2015 at 12:32:08PM -0400, Linda Knippers wrote:
 On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
 On 08/06/2015 11:34 PM, Dave Chinner wrote:
 On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
 On 08/06/2015 06:24 AM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
 On 08/05/2015 06:01 PM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
 

 I sat down with Linda to look into it, and the problem is that 
 mkfs.xfs
 sets the blocksize of the device to 512 (via BLKBSZSET), and then 
 reads
 from the last sector of the device.  This results in dax_io trying to 
 do
 a page-sized I/O at 512 bytes from the end of the device.


 This part I do not understand. how is mkfs.xfs reading the sector?
 Is it through open(/dev/pmem0,...) ? O_DIRECT?

 mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
 told that it is working on an image file does it fall back to
 buffered IO. All of the XFS userspace tools work this way to prevent
 page cache pollution issues with read-once or write-once data during
 operation.
 
 That patch does cause 'mkfs -t xfs' to work.

 Before:
 $ sudo mkfs -t xfs -f /dev/pmem3
 meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
  =   sectsz=512   attr=2, projid32bit=1
^^
 
 
 $ sudo mkfs -t xfs -f /dev/pmem3
 meta-data=/dev/pmem3 isize=256agcount=4, agsize=524288 blks
  =   sectsz=4096  attr=2, projid32bit=1
^^^
 
 So in the after case, mkfs.xfs is behaving differently and not
 exercising the bug. It's seen the:
 
 $ cat /sys/block/pmem3/queue/logical_block_size
 512
 $ cat /sys/block/pmem3/queue/physical_block_size
 4096
   
 
 4k physical block size, and hence configured the filesystem with a
 4k sector size so all IO it issues is physicallly aligned. IOWs,
 mkfs.xfs's last sector read is 4k aligned and sized, and therefore
 the test has not confirmed that the patch fixes the 512 byte last
 sector read is fixed at all.

That is true.  All I reported is that mkfs.xfs now works.  I
had some questions about whether that was right.

The underlying problem is still there, as I can demonstrate with
a simple reproducer that just does what mkfs.xfs would have done
before.

 Isn't there a regression test suite that covers basic block device
 functionality that you can use to test these simple corner cases?

If there is, seems like DAX adds a few twists.

-- ljk
 
 Cheers,
 
 Dave.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

2015-08-05 Thread Linda Knippers
On 08/05/2015 06:01 PM, Dave Chinner wrote:
> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
>> Hi, Matthew,
>>
>> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
>>
>> # mkfs -t xfs -f /dev/pmem0
>> meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
>>  =   sectsz=512   attr=2, projid32bit=1
>>  =   crc=0finobt=0
>> data =   bsize=4096   blocks=2097152, imaxpct=25
>>  =   sunit=0  swidth=0 blks
>> naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
>> log  =internal log   bsize=4096   blocks=2560, version=2
>>  =   sectsz=512   sunit=0 blks, lazy-count=1
>> realtime =none   extsz=4096   blocks=0, rtextents=0
>> mkfs.xfs: read failed: Numerical result out of range
>>
>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
>> from the last sector of the device.  This results in dax_io trying to do
>> a page-sized I/O at 512 bytes from the end of the device.
> 
> Right - we have to be able to do IO to that last sector, so this is
> a sanity check to tell if the block dev is large enough. The XFS
> kernel code does the same end-of-device sector read when the
> filesystem is mounted, too.
> 
>> bdev_direct_access, receiving this bogus pos/size combo, returns
>> -ERANGE:
>>
>>  if ((sector + DIV_ROUND_UP(size, 512)) >
>>  part_nr_sects_read(bdev->bd_part))
>>  return -ERANGE;
>>
>> Given that file systems supporting dax refuse to mount with a blocksize
>> != page size, I'm guessing this is sort of expected behavior.  However,
>> we really shouldn't be breaking direct I/O on pmem devices.
> 
> If the device is advertising 512 byte sector size support, then this
> needs to work, especially as DAX is completely transparent on the
> block device. Remember that DAX through a filesystem works on
> filesystem data block size boundaries, so a 512 byte sector/4k block
> size filesystem will be able to use DAX for mmapped files just fine.
> 
>> So, what do you want to do?  We could make the pmem device's logical
>> block size fixed at the sytem page size.  Or, we could modify the dax
>> code to work with blocksize < pagesize.  Or, we could continue using the
>> direct I/O codepath for direct block device access.  What do you think?
> 
> I don't know how the pmem device sets up it's limits. Can you post
> the output of:
> 
>   /sys/block/pmem0/queue/logical_block_size
512

>   /sys/block/pmem0/queue/physical_block_size
512

>   /sys/block/pmem0/queue/hw_sector_size
512

>   /sys/block/pmem0/queue/minimum_io_size
512

>   /sys/block/pmem0/queue/optimal_io_size
0

Let me know if you need anything else.

-- ljk


> As these all affect how mkfs.xfs configures the filesystem being
> made and so influences the size and alignment of the IO is does
> 
> Cheers,
> 
> Dave.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression introduced by block: Add support for DAX reads/writes to block devices

2015-08-05 Thread Linda Knippers
On 08/05/2015 06:01 PM, Dave Chinner wrote:
 On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
 Hi, Matthew,

 Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:

 # mkfs -t xfs -f /dev/pmem0
 meta-data=/dev/pmem0 isize=256agcount=4, agsize=524288 blks
  =   sectsz=512   attr=2, projid32bit=1
  =   crc=0finobt=0
 data =   bsize=4096   blocks=2097152, imaxpct=25
  =   sunit=0  swidth=0 blks
 naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
 log  =internal log   bsize=4096   blocks=2560, version=2
  =   sectsz=512   sunit=0 blks, lazy-count=1
 realtime =none   extsz=4096   blocks=0, rtextents=0
 mkfs.xfs: read failed: Numerical result out of range

 I sat down with Linda to look into it, and the problem is that mkfs.xfs
 sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
 from the last sector of the device.  This results in dax_io trying to do
 a page-sized I/O at 512 bytes from the end of the device.
 
 Right - we have to be able to do IO to that last sector, so this is
 a sanity check to tell if the block dev is large enough. The XFS
 kernel code does the same end-of-device sector read when the
 filesystem is mounted, too.
 
 bdev_direct_access, receiving this bogus pos/size combo, returns
 -ERANGE:

  if ((sector + DIV_ROUND_UP(size, 512)) 
  part_nr_sects_read(bdev-bd_part))
  return -ERANGE;

 Given that file systems supporting dax refuse to mount with a blocksize
 != page size, I'm guessing this is sort of expected behavior.  However,
 we really shouldn't be breaking direct I/O on pmem devices.
 
 If the device is advertising 512 byte sector size support, then this
 needs to work, especially as DAX is completely transparent on the
 block device. Remember that DAX through a filesystem works on
 filesystem data block size boundaries, so a 512 byte sector/4k block
 size filesystem will be able to use DAX for mmapped files just fine.
 
 So, what do you want to do?  We could make the pmem device's logical
 block size fixed at the sytem page size.  Or, we could modify the dax
 code to work with blocksize  pagesize.  Or, we could continue using the
 direct I/O codepath for direct block device access.  What do you think?
 
 I don't know how the pmem device sets up it's limits. Can you post
 the output of:
 
   /sys/block/pmem0/queue/logical_block_size
512

   /sys/block/pmem0/queue/physical_block_size
512

   /sys/block/pmem0/queue/hw_sector_size
512

   /sys/block/pmem0/queue/minimum_io_size
512

   /sys/block/pmem0/queue/optimal_io_size
0

Let me know if you need anything else.

-- ljk


 As these all affect how mkfs.xfs configures the filesystem being
 made and so influences the size and alignment of the IO is does
 
 Cheers,
 
 Dave.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH v2 08/20] libnd, nd_acpi: regions (block-data-window, persistent memory, volatile memory)

2015-05-28 Thread Linda Knippers
On 5/28/2015 3:59 PM, Dan Williams wrote:
> On Thu, May 28, 2015 at 11:36 AM, Toshi Kani  wrote:
>> On Sat, 2015-05-09 at 16:55 -0700, Dan Williams wrote:
>>> On Mon, May 4, 2015 at 1:26 PM, Toshi Kani  wrote:
 On Tue, 2015-04-28 at 14:24 -0400, Dan Williams wrote:
>>  :

 The libnd does not support memdev->flags, which contains "Memory Device
 State Flags" defined in Table 5-129 of ACPI 6.0.  In case of major
 errors, we should only allow a failed NVDIMM be accessed with read-only
 for possible data recovery (or not allow any access when the data is
 completely lost), and should not let users operate normally over the
 corrupted data until the error is dealt properly.
>>>
>>> I agree with setting read-only access when these flags show that the
>>> battery is not ready to persist new writes, but I don't think we
>>> should block access in the case where the restore from flash failed.
>>> If the data is potentially corrupted we should log that fact, but
>>> otherwise enable access.  I.e. potentially corrupt data is better than
>>> unavailable data.  It's up to filesystem or application to maintain
>>> its own checksums to catch data corruption.
>>>
 Can you set memdev->flags to nd_region(_desc) so that the pmem driver
 can check the status in nd_pmem_probe()?  nd_pmem_probe() can then set
 the disk read-only or fail probing, and log errors accordingly.
>>>
>>> Will do.
>>
>> I do not see this change in v4.  Is this part of the pending changes
>> behind this release?
> 
> Yes, I was holding it off until we had an upstream acceptance baseline
> set.  That is on hold pending Christoph's review.  He's looking to
> come back next Wednesday with deeper review comments.  The runway to
> land this in v4.2 is running short...

Hi Dan,

Do you have a short list of pending changes, especially if some weren't
discussed on the list?  That might help reviewers.

I know we're still looking at and trying a number of things, like using
the BTT on today's NVDIMMs and adding another example DSM, so we will
have more feedback and patches and may need to adapt some of the
structure to do that.  This can happen after the initial patches are
pulled in but I just wanted to let you know where we are.  Not sure
about others.

-- ljk


> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-nvdimm] [PATCH v2 08/20] libnd, nd_acpi: regions (block-data-window, persistent memory, volatile memory)

2015-05-28 Thread Linda Knippers
On 5/28/2015 3:59 PM, Dan Williams wrote:
 On Thu, May 28, 2015 at 11:36 AM, Toshi Kani toshi.k...@hp.com wrote:
 On Sat, 2015-05-09 at 16:55 -0700, Dan Williams wrote:
 On Mon, May 4, 2015 at 1:26 PM, Toshi Kani toshi.k...@hp.com wrote:
 On Tue, 2015-04-28 at 14:24 -0400, Dan Williams wrote:
  :

 The libnd does not support memdev-flags, which contains Memory Device
 State Flags defined in Table 5-129 of ACPI 6.0.  In case of major
 errors, we should only allow a failed NVDIMM be accessed with read-only
 for possible data recovery (or not allow any access when the data is
 completely lost), and should not let users operate normally over the
 corrupted data until the error is dealt properly.

 I agree with setting read-only access when these flags show that the
 battery is not ready to persist new writes, but I don't think we
 should block access in the case where the restore from flash failed.
 If the data is potentially corrupted we should log that fact, but
 otherwise enable access.  I.e. potentially corrupt data is better than
 unavailable data.  It's up to filesystem or application to maintain
 its own checksums to catch data corruption.

 Can you set memdev-flags to nd_region(_desc) so that the pmem driver
 can check the status in nd_pmem_probe()?  nd_pmem_probe() can then set
 the disk read-only or fail probing, and log errors accordingly.

 Will do.

 I do not see this change in v4.  Is this part of the pending changes
 behind this release?
 
 Yes, I was holding it off until we had an upstream acceptance baseline
 set.  That is on hold pending Christoph's review.  He's looking to
 come back next Wednesday with deeper review comments.  The runway to
 land this in v4.2 is running short...

Hi Dan,

Do you have a short list of pending changes, especially if some weren't
discussed on the list?  That might help reviewers.

I know we're still looking at and trying a number of things, like using
the BTT on today's NVDIMMs and adding another example DSM, so we will
have more feedback and patches and may need to adapt some of the
structure to do that.  This can happen after the initial patches are
pulled in but I just wanted to let you know where we are.  Not sure
about others.

-- ljk


 --
 To unsubscribe from this list: send the line unsubscribe linux-acpi in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >