Re: [PATCH v2 3/3] nvme: Enable autonomous power state transitions

2016-09-02 Thread J Freyensee
On Fri, 2016-09-02 at 14:43 -0700, Andy Lutomirski wrote:
> On Fri, Sep 2, 2016 at 2:15 PM, J Freyensee
>  wrote:
> > 
> > On Tue, 2016-08-30 at 14:59 -0700, Andy Lutomirski wrote:
> > > 
> > > NVME devices can advertise multiple power states.  These states
> > > can
> > > be either "operational" (the device is fully functional but
> > > possibly
> > > slow) or "non-operational" (the device is asleep until woken up).
> > > Some devices can automatically enter a non-operational state when
> > > idle for a specified amount of time and then automatically wake
> > > back
> > > up when needed.
> > > 
> > > The hardware configuration is a table.  For each state, an entry
> > > in
> > > the table indicates the next deeper non-operational state, if
> > > any,
> > > to autonomously transition to and the idle time required before
> > > transitioning.
> > > 
> > > This patch teaches the driver to program APST so that each
> > > successive non-operational state will be entered after an idle
> > > time
> > > equal to 100% of the total latency (entry plus exit) associated
> > > with
> > > that state.  A sysfs attribute 'apst_max_latency_us' gives the
> > > maximum acceptable latency in ns; non-operational states with
> > > total
> > > latency greater than this value will not be used.  As a special
> > > case, apst_max_latency_us=0 will disable APST entirely.
> > 
> > May I ask a dumb question?
> > 
> > How does this work with multiple NVMe devices plugged into a
> > system?  I
> > would have thought we'd want one apst_max_latency_us entry per NVMe
> > controller for individual control of each device?  I have two
> > Fultondale-class devices plugged into a system I tried these
> > patches on
> > (the 4.8-rc4 kernel) and I'm not sure how the single
> > /sys/module/nvme_core/parameters/apst_max_latency_us would work per
> > my
> > 2 devices (and the value is using the default 25000).
> > 
> 
> Ah, I faked you out :(
> 
> The module parameter (nvme_core/parameters/apst_max_latency_us) just
> sets the default for newly probed devices.  The actual setting is in
> /sys/devices/whatever (symlinked from /sys/block/nvme0n1/devices, for
> example).  Perhaps I should name the former
> default_apst_max_latency_us.

It would certainly be more describable to understand what the entry is
for, but then the name is also getting longer.

Just "default_apst_latency_us"? Or maybe it's probably fine as-is.



Re: [PATCH v2 3/3] nvme: Enable autonomous power state transitions

2016-09-02 Thread J Freyensee
On Fri, 2016-09-02 at 14:43 -0700, Andy Lutomirski wrote:
> On Fri, Sep 2, 2016 at 2:15 PM, J Freyensee
>  wrote:
> > 
> > On Tue, 2016-08-30 at 14:59 -0700, Andy Lutomirski wrote:
> > > 
> > > NVME devices can advertise multiple power states.  These states
> > > can
> > > be either "operational" (the device is fully functional but
> > > possibly
> > > slow) or "non-operational" (the device is asleep until woken up).
> > > Some devices can automatically enter a non-operational state when
> > > idle for a specified amount of time and then automatically wake
> > > back
> > > up when needed.
> > > 
> > > The hardware configuration is a table.  For each state, an entry
> > > in
> > > the table indicates the next deeper non-operational state, if
> > > any,
> > > to autonomously transition to and the idle time required before
> > > transitioning.
> > > 
> > > This patch teaches the driver to program APST so that each
> > > successive non-operational state will be entered after an idle
> > > time
> > > equal to 100% of the total latency (entry plus exit) associated
> > > with
> > > that state.  A sysfs attribute 'apst_max_latency_us' gives the
> > > maximum acceptable latency in ns; non-operational states with
> > > total
> > > latency greater than this value will not be used.  As a special
> > > case, apst_max_latency_us=0 will disable APST entirely.
> > 
> > May I ask a dumb question?
> > 
> > How does this work with multiple NVMe devices plugged into a
> > system?  I
> > would have thought we'd want one apst_max_latency_us entry per NVMe
> > controller for individual control of each device?  I have two
> > Fultondale-class devices plugged into a system I tried these
> > patches on
> > (the 4.8-rc4 kernel) and I'm not sure how the single
> > /sys/module/nvme_core/parameters/apst_max_latency_us would work per
> > my
> > 2 devices (and the value is using the default 25000).
> > 
> 
> Ah, I faked you out :(
> 
> The module parameter (nvme_core/parameters/apst_max_latency_us) just
> sets the default for newly probed devices.  The actual setting is in
> /sys/devices/whatever (symlinked from /sys/block/nvme0n1/devices, for
> example).  Perhaps I should name the former
> default_apst_max_latency_us.

It would certainly be more describable to understand what the entry is
for, but then the name is also getting longer.

Just "default_apst_latency_us"? Or maybe it's probably fine as-is.



Re: [PATCH v2 3/3] nvme: Enable autonomous power state transitions

2016-09-02 Thread Andy Lutomirski
On Fri, Sep 2, 2016 at 2:15 PM, J Freyensee
 wrote:
> On Tue, 2016-08-30 at 14:59 -0700, Andy Lutomirski wrote:
>> NVME devices can advertise multiple power states.  These states can
>> be either "operational" (the device is fully functional but possibly
>> slow) or "non-operational" (the device is asleep until woken up).
>> Some devices can automatically enter a non-operational state when
>> idle for a specified amount of time and then automatically wake back
>> up when needed.
>>
>> The hardware configuration is a table.  For each state, an entry in
>> the table indicates the next deeper non-operational state, if any,
>> to autonomously transition to and the idle time required before
>> transitioning.
>>
>> This patch teaches the driver to program APST so that each
>> successive non-operational state will be entered after an idle time
>> equal to 100% of the total latency (entry plus exit) associated with
>> that state.  A sysfs attribute 'apst_max_latency_us' gives the
>> maximum acceptable latency in ns; non-operational states with total
>> latency greater than this value will not be used.  As a special
>> case, apst_max_latency_us=0 will disable APST entirely.
>
> May I ask a dumb question?
>
> How does this work with multiple NVMe devices plugged into a system?  I
> would have thought we'd want one apst_max_latency_us entry per NVMe
> controller for individual control of each device?  I have two
> Fultondale-class devices plugged into a system I tried these patches on
> (the 4.8-rc4 kernel) and I'm not sure how the single
> /sys/module/nvme_core/parameters/apst_max_latency_us would work per my
> 2 devices (and the value is using the default 25000).
>

Ah, I faked you out :(

The module parameter (nvme_core/parameters/apst_max_latency_us) just
sets the default for newly probed devices.  The actual setting is in
/sys/devices/whatever (symlinked from /sys/block/nvme0n1/devices, for
example).  Perhaps I should name the former
default_apst_max_latency_us.

>
>>
>> On hardware without APST support, apst_max_latency_us will not be
>> exposed in sysfs.
>
> Not sure that is true, as from what I see so far, Fultondales don't
> support apst yet I still see:
>
> [root@nvme-fabric-host01 nvme-cli]# cat
> /sys/module/nvme_core/parameters/apst_max_latency_us
> 25000

That will be there regardless.  It's the value in the sysfs device
directory that won't be there, which is hopefully why you couldn't
find it.

>
>>
>> In theory, the device can expose "default" APST table, but this
>> doesn't seem to function correctly on my device (Samsung 950), nor
>> does it seem particularly useful.  There is also an optional
>> mechanism by which a configuration can be "saved" so it will be
>> automatically loaded on reset.  This can be configured from
>> userspace, but it doesn't seem useful to support in the driver.
>>
>> On my laptop, enabling APST seems to save nearly 1W.
>>
>> The hardware tables can be decoded in userspace with nvme-cli.
>> 'nvme id-ctrl /dev/nvmeN' will show the power state table and
>> 'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
>> configuration.
>
> nvme get-feature -f 0x0c -H /dev/nvme0
>
> isn't working for me, I get a:
>
> [root@nvme-fabric-host01 nvme-cli]# ./nvme get-feature -f 0x0c -H
> /dev/nvme0
> NVMe Status:INVALID_FIELD(2)
>
> I don't have the time right now to investigate further, but I'll assume
> it's because I have Fultondales (though I would have thought this patch
> would have provided enough code for the latest nvme-cli code to do this
> new get-feature as-is).

I'm assuming it doesn't work because your hardware doesn't support
APST.  nvme get-feature works even without my patches, since it mostly
bypasses the driver.

--Andy


Re: [PATCH v2 3/3] nvme: Enable autonomous power state transitions

2016-09-02 Thread Andy Lutomirski
On Fri, Sep 2, 2016 at 2:15 PM, J Freyensee
 wrote:
> On Tue, 2016-08-30 at 14:59 -0700, Andy Lutomirski wrote:
>> NVME devices can advertise multiple power states.  These states can
>> be either "operational" (the device is fully functional but possibly
>> slow) or "non-operational" (the device is asleep until woken up).
>> Some devices can automatically enter a non-operational state when
>> idle for a specified amount of time and then automatically wake back
>> up when needed.
>>
>> The hardware configuration is a table.  For each state, an entry in
>> the table indicates the next deeper non-operational state, if any,
>> to autonomously transition to and the idle time required before
>> transitioning.
>>
>> This patch teaches the driver to program APST so that each
>> successive non-operational state will be entered after an idle time
>> equal to 100% of the total latency (entry plus exit) associated with
>> that state.  A sysfs attribute 'apst_max_latency_us' gives the
>> maximum acceptable latency in ns; non-operational states with total
>> latency greater than this value will not be used.  As a special
>> case, apst_max_latency_us=0 will disable APST entirely.
>
> May I ask a dumb question?
>
> How does this work with multiple NVMe devices plugged into a system?  I
> would have thought we'd want one apst_max_latency_us entry per NVMe
> controller for individual control of each device?  I have two
> Fultondale-class devices plugged into a system I tried these patches on
> (the 4.8-rc4 kernel) and I'm not sure how the single
> /sys/module/nvme_core/parameters/apst_max_latency_us would work per my
> 2 devices (and the value is using the default 25000).
>

Ah, I faked you out :(

The module parameter (nvme_core/parameters/apst_max_latency_us) just
sets the default for newly probed devices.  The actual setting is in
/sys/devices/whatever (symlinked from /sys/block/nvme0n1/devices, for
example).  Perhaps I should name the former
default_apst_max_latency_us.

>
>>
>> On hardware without APST support, apst_max_latency_us will not be
>> exposed in sysfs.
>
> Not sure that is true, as from what I see so far, Fultondales don't
> support apst yet I still see:
>
> [root@nvme-fabric-host01 nvme-cli]# cat
> /sys/module/nvme_core/parameters/apst_max_latency_us
> 25000

That will be there regardless.  It's the value in the sysfs device
directory that won't be there, which is hopefully why you couldn't
find it.

>
>>
>> In theory, the device can expose "default" APST table, but this
>> doesn't seem to function correctly on my device (Samsung 950), nor
>> does it seem particularly useful.  There is also an optional
>> mechanism by which a configuration can be "saved" so it will be
>> automatically loaded on reset.  This can be configured from
>> userspace, but it doesn't seem useful to support in the driver.
>>
>> On my laptop, enabling APST seems to save nearly 1W.
>>
>> The hardware tables can be decoded in userspace with nvme-cli.
>> 'nvme id-ctrl /dev/nvmeN' will show the power state table and
>> 'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
>> configuration.
>
> nvme get-feature -f 0x0c -H /dev/nvme0
>
> isn't working for me, I get a:
>
> [root@nvme-fabric-host01 nvme-cli]# ./nvme get-feature -f 0x0c -H
> /dev/nvme0
> NVMe Status:INVALID_FIELD(2)
>
> I don't have the time right now to investigate further, but I'll assume
> it's because I have Fultondales (though I would have thought this patch
> would have provided enough code for the latest nvme-cli code to do this
> new get-feature as-is).

I'm assuming it doesn't work because your hardware doesn't support
APST.  nvme get-feature works even without my patches, since it mostly
bypasses the driver.

--Andy


Re: [PATCH v2 3/3] nvme: Enable autonomous power state transitions

2016-09-02 Thread J Freyensee
On Tue, 2016-08-30 at 14:59 -0700, Andy Lutomirski wrote:
> NVME devices can advertise multiple power states.  These states can
> be either "operational" (the device is fully functional but possibly
> slow) or "non-operational" (the device is asleep until woken up).
> Some devices can automatically enter a non-operational state when
> idle for a specified amount of time and then automatically wake back
> up when needed.
> 
> The hardware configuration is a table.  For each state, an entry in
> the table indicates the next deeper non-operational state, if any,
> to autonomously transition to and the idle time required before
> transitioning.
> 
> This patch teaches the driver to program APST so that each
> successive non-operational state will be entered after an idle time
> equal to 100% of the total latency (entry plus exit) associated with
> that state.  A sysfs attribute 'apst_max_latency_us' gives the
> maximum acceptable latency in ns; non-operational states with total
> latency greater than this value will not be used.  As a special
> case, apst_max_latency_us=0 will disable APST entirely.

May I ask a dumb question?

How does this work with multiple NVMe devices plugged into a system?  I
would have thought we'd want one apst_max_latency_us entry per NVMe
controller for individual control of each device?  I have two
Fultondale-class devices plugged into a system I tried these patches on
(the 4.8-rc4 kernel) and I'm not sure how the single
/sys/module/nvme_core/parameters/apst_max_latency_us would work per my
2 devices (and the value is using the default 25000).

Now from 
nvme id-ctrl /dev/nvme0 (or nvme1)

NVME Identify Controller:
vid : 0x8086
ssvid   : 0x8086
sn  : CVFT41720018800HGN  
mn  : INTEL SSDPE2MD800G4 
fr  : 8DV10151
rab : 0
ieee: 5cd2e4
cmic: 0
mdts: 5
cntlid  : 0
ver : 0
rtd3r   : 0
rtd3e   : 0
oaes: 0
oacs: 0x6
acl : 3
aerl: 3
frmw: 0x2
lpa : 0x2
elpe: 63
npss: 0
avscc   : 0
apsta   : 0 <-

the Fultondales don't support apst.

But I'd still like to ask the dumb question :-).

> 
> On hardware without APST support, apst_max_latency_us will not be
> exposed in sysfs.

Not sure that is true, as from what I see so far, Fultondales don't
support apst yet I still see:

[root@nvme-fabric-host01 nvme-cli]# cat
/sys/module/nvme_core/parameters/apst_max_latency_us
25000

> 
> In theory, the device can expose "default" APST table, but this
> doesn't seem to function correctly on my device (Samsung 950), nor
> does it seem particularly useful.  There is also an optional
> mechanism by which a configuration can be "saved" so it will be
> automatically loaded on reset.  This can be configured from
> userspace, but it doesn't seem useful to support in the driver.
> 
> On my laptop, enabling APST seems to save nearly 1W.
> 
> The hardware tables can be decoded in userspace with nvme-cli.
> 'nvme id-ctrl /dev/nvmeN' will show the power state table and
> 'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
> configuration.

nvme get-feature -f 0x0c -H /dev/nvme0

isn't working for me, I get a:

[root@nvme-fabric-host01 nvme-cli]# ./nvme get-feature -f 0x0c -H
/dev/nvme0
NVMe Status:INVALID_FIELD(2)

I don't have the time right now to investigate further, but I'll assume
it's because I have Fultondales (though I would have thought this patch
would have provided enough code for the latest nvme-cli code to do this
new get-feature as-is).

Jay



Re: [PATCH v2 3/3] nvme: Enable autonomous power state transitions

2016-09-02 Thread J Freyensee
On Tue, 2016-08-30 at 14:59 -0700, Andy Lutomirski wrote:
> NVME devices can advertise multiple power states.  These states can
> be either "operational" (the device is fully functional but possibly
> slow) or "non-operational" (the device is asleep until woken up).
> Some devices can automatically enter a non-operational state when
> idle for a specified amount of time and then automatically wake back
> up when needed.
> 
> The hardware configuration is a table.  For each state, an entry in
> the table indicates the next deeper non-operational state, if any,
> to autonomously transition to and the idle time required before
> transitioning.
> 
> This patch teaches the driver to program APST so that each
> successive non-operational state will be entered after an idle time
> equal to 100% of the total latency (entry plus exit) associated with
> that state.  A sysfs attribute 'apst_max_latency_us' gives the
> maximum acceptable latency in ns; non-operational states with total
> latency greater than this value will not be used.  As a special
> case, apst_max_latency_us=0 will disable APST entirely.

May I ask a dumb question?

How does this work with multiple NVMe devices plugged into a system?  I
would have thought we'd want one apst_max_latency_us entry per NVMe
controller for individual control of each device?  I have two
Fultondale-class devices plugged into a system I tried these patches on
(the 4.8-rc4 kernel) and I'm not sure how the single
/sys/module/nvme_core/parameters/apst_max_latency_us would work per my
2 devices (and the value is using the default 25000).

Now from 
nvme id-ctrl /dev/nvme0 (or nvme1)

NVME Identify Controller:
vid : 0x8086
ssvid   : 0x8086
sn  : CVFT41720018800HGN  
mn  : INTEL SSDPE2MD800G4 
fr  : 8DV10151
rab : 0
ieee: 5cd2e4
cmic: 0
mdts: 5
cntlid  : 0
ver : 0
rtd3r   : 0
rtd3e   : 0
oaes: 0
oacs: 0x6
acl : 3
aerl: 3
frmw: 0x2
lpa : 0x2
elpe: 63
npss: 0
avscc   : 0
apsta   : 0 <-

the Fultondales don't support apst.

But I'd still like to ask the dumb question :-).

> 
> On hardware without APST support, apst_max_latency_us will not be
> exposed in sysfs.

Not sure that is true, as from what I see so far, Fultondales don't
support apst yet I still see:

[root@nvme-fabric-host01 nvme-cli]# cat
/sys/module/nvme_core/parameters/apst_max_latency_us
25000

> 
> In theory, the device can expose "default" APST table, but this
> doesn't seem to function correctly on my device (Samsung 950), nor
> does it seem particularly useful.  There is also an optional
> mechanism by which a configuration can be "saved" so it will be
> automatically loaded on reset.  This can be configured from
> userspace, but it doesn't seem useful to support in the driver.
> 
> On my laptop, enabling APST seems to save nearly 1W.
> 
> The hardware tables can be decoded in userspace with nvme-cli.
> 'nvme id-ctrl /dev/nvmeN' will show the power state table and
> 'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
> configuration.

nvme get-feature -f 0x0c -H /dev/nvme0

isn't working for me, I get a:

[root@nvme-fabric-host01 nvme-cli]# ./nvme get-feature -f 0x0c -H
/dev/nvme0
NVMe Status:INVALID_FIELD(2)

I don't have the time right now to investigate further, but I'll assume
it's because I have Fultondales (though I would have thought this patch
would have provided enough code for the latest nvme-cli code to do this
new get-feature as-is).

Jay



[PATCH v2 3/3] nvme: Enable autonomous power state transitions

2016-08-30 Thread Andy Lutomirski
NVME devices can advertise multiple power states.  These states can
be either "operational" (the device is fully functional but possibly
slow) or "non-operational" (the device is asleep until woken up).
Some devices can automatically enter a non-operational state when
idle for a specified amount of time and then automatically wake back
up when needed.

The hardware configuration is a table.  For each state, an entry in
the table indicates the next deeper non-operational state, if any,
to autonomously transition to and the idle time required before
transitioning.

This patch teaches the driver to program APST so that each
successive non-operational state will be entered after an idle time
equal to 100% of the total latency (entry plus exit) associated with
that state.  A sysfs attribute 'apst_max_latency_us' gives the
maximum acceptable latency in ns; non-operational states with total
latency greater than this value will not be used.  As a special
case, apst_max_latency_us=0 will disable APST entirely.

On hardware without APST support, apst_max_latency_us will not be
exposed in sysfs.

In theory, the device can expose "default" APST table, but this
doesn't seem to function correctly on my device (Samsung 950), nor
does it seem particularly useful.  There is also an optional
mechanism by which a configuration can be "saved" so it will be
automatically loaded on reset.  This can be configured from
userspace, but it doesn't seem useful to support in the driver.

On my laptop, enabling APST seems to save nearly 1W.

The hardware tables can be decoded in userspace with nvme-cli.
'nvme id-ctrl /dev/nvmeN' will show the power state table and
'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
configuration.

Signed-off-by: Andy Lutomirski 
---
 drivers/nvme/host/core.c | 167 +++
 drivers/nvme/host/nvme.h |   6 ++
 include/linux/nvme.h |   6 ++
 3 files changed, 179 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9260d2971176..8aea8dfacda6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -56,6 +56,12 @@ EXPORT_SYMBOL_GPL(nvme_max_retries);
 static int nvme_char_major;
 module_param(nvme_char_major, int, 0);
 
+static unsigned long default_apst_max_latency_us = 25000;
+module_param_named(apst_max_latency_us, default_apst_max_latency_us,
+  ulong, 0644);
+MODULE_PARM_DESC(apst_max_latency_us,
+"default max APST latency; overridden per device in sysfs");
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -1209,6 +1215,98 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
blk_queue_write_cache(q, vwc, vwc);
 }
 
+static void nvme_configure_apst(struct nvme_ctrl *ctrl)
+{
+   /*
+* APST (Autonomous Power State Transition) lets us program a
+* table of power state transitions that the controller will
+* perform automatically.  We configure it with a simple
+* heuristic: we are willing to spend at most 2% of the time
+* transitioning between power states.  Therefore, when running
+* in any given state, we will enter the next lower-power
+* non-operational state after waiting 100 * (enlat + exlat)
+* microseconds, as long as that state's total latency is under
+* the requested maximum latency.
+*
+* We will not autonomously enter any non-operational state for
+* which the total latency exceeds apst_max_latency_us.  Users
+* can set apst_max_latency_us to zero to turn off APST.
+*/
+
+   unsigned apste;
+   struct nvme_feat_auto_pst *table;
+   int ret;
+
+   if (!ctrl->apsta)
+   return; /* APST isn't supported. */
+
+   if (ctrl->npss > 31) {
+   dev_warn(ctrl->device, "NPSS is invalid; not using APST\n");
+   return;
+   }
+
+   table = kzalloc(sizeof(*table), GFP_KERNEL);
+   if (!table)
+   return;
+
+   if (ctrl->apst_max_latency_us == 0) {
+   /* Turn off APST. */
+   apste = 0;
+   } else {
+   __le64 target = cpu_to_le64(0);
+   int state;
+
+   /*
+* Walk through all states from lowest- to highest-power.
+* According to the spec, lower-numbered states use more
+* power.  NPSS, despite the name, is the index of the
+* lowest-power state, not the number of states.
+*/
+   for (state = (int)ctrl->npss; state >= 0; state--) {
+   u64 total_latency_us, transition_ms;
+
+   if (target)
+   table->entries[state] = target;
+
+   /*
+* Is this state a useful non-operational state for
+* higher-power states to 

[PATCH v2 3/3] nvme: Enable autonomous power state transitions

2016-08-30 Thread Andy Lutomirski
NVME devices can advertise multiple power states.  These states can
be either "operational" (the device is fully functional but possibly
slow) or "non-operational" (the device is asleep until woken up).
Some devices can automatically enter a non-operational state when
idle for a specified amount of time and then automatically wake back
up when needed.

The hardware configuration is a table.  For each state, an entry in
the table indicates the next deeper non-operational state, if any,
to autonomously transition to and the idle time required before
transitioning.

This patch teaches the driver to program APST so that each
successive non-operational state will be entered after an idle time
equal to 100% of the total latency (entry plus exit) associated with
that state.  A sysfs attribute 'apst_max_latency_us' gives the
maximum acceptable latency in ns; non-operational states with total
latency greater than this value will not be used.  As a special
case, apst_max_latency_us=0 will disable APST entirely.

On hardware without APST support, apst_max_latency_us will not be
exposed in sysfs.

In theory, the device can expose "default" APST table, but this
doesn't seem to function correctly on my device (Samsung 950), nor
does it seem particularly useful.  There is also an optional
mechanism by which a configuration can be "saved" so it will be
automatically loaded on reset.  This can be configured from
userspace, but it doesn't seem useful to support in the driver.

On my laptop, enabling APST seems to save nearly 1W.

The hardware tables can be decoded in userspace with nvme-cli.
'nvme id-ctrl /dev/nvmeN' will show the power state table and
'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
configuration.

Signed-off-by: Andy Lutomirski 
---
 drivers/nvme/host/core.c | 167 +++
 drivers/nvme/host/nvme.h |   6 ++
 include/linux/nvme.h |   6 ++
 3 files changed, 179 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9260d2971176..8aea8dfacda6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -56,6 +56,12 @@ EXPORT_SYMBOL_GPL(nvme_max_retries);
 static int nvme_char_major;
 module_param(nvme_char_major, int, 0);
 
+static unsigned long default_apst_max_latency_us = 25000;
+module_param_named(apst_max_latency_us, default_apst_max_latency_us,
+  ulong, 0644);
+MODULE_PARM_DESC(apst_max_latency_us,
+"default max APST latency; overridden per device in sysfs");
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -1209,6 +1215,98 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
blk_queue_write_cache(q, vwc, vwc);
 }
 
+static void nvme_configure_apst(struct nvme_ctrl *ctrl)
+{
+   /*
+* APST (Autonomous Power State Transition) lets us program a
+* table of power state transitions that the controller will
+* perform automatically.  We configure it with a simple
+* heuristic: we are willing to spend at most 2% of the time
+* transitioning between power states.  Therefore, when running
+* in any given state, we will enter the next lower-power
+* non-operational state after waiting 100 * (enlat + exlat)
+* microseconds, as long as that state's total latency is under
+* the requested maximum latency.
+*
+* We will not autonomously enter any non-operational state for
+* which the total latency exceeds apst_max_latency_us.  Users
+* can set apst_max_latency_us to zero to turn off APST.
+*/
+
+   unsigned apste;
+   struct nvme_feat_auto_pst *table;
+   int ret;
+
+   if (!ctrl->apsta)
+   return; /* APST isn't supported. */
+
+   if (ctrl->npss > 31) {
+   dev_warn(ctrl->device, "NPSS is invalid; not using APST\n");
+   return;
+   }
+
+   table = kzalloc(sizeof(*table), GFP_KERNEL);
+   if (!table)
+   return;
+
+   if (ctrl->apst_max_latency_us == 0) {
+   /* Turn off APST. */
+   apste = 0;
+   } else {
+   __le64 target = cpu_to_le64(0);
+   int state;
+
+   /*
+* Walk through all states from lowest- to highest-power.
+* According to the spec, lower-numbered states use more
+* power.  NPSS, despite the name, is the index of the
+* lowest-power state, not the number of states.
+*/
+   for (state = (int)ctrl->npss; state >= 0; state--) {
+   u64 total_latency_us, transition_ms;
+
+   if (target)
+   table->entries[state] = target;
+
+   /*
+* Is this state a useful non-operational state for
+* higher-power states to autonomously transition to?