Re: [PATCH v3] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-05-08 Thread Ira Weiny
On Sat, May 08, 2021 at 10:06:42AM +0530, Vaibhav Jain wrote:
> Currently drc_pmem_qeury_stats() generates a dev_err in case
> "Enable Performance Information Collection" feature is disabled from
> HMC or performance stats are not available for an nvdimm. The error is
> of the form below:
> 
> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
>performance stats, Err:-10
> 
> This error message confuses users as it implies a possible problem
> with the nvdimm even though its due to a disabled/unavailable
> feature. We fix this by explicitly handling the H_AUTHORITY and
> H_UNSUPPORTED errors from the H_SCM_PERFORMANCE_STATS hcall.
> 
> In case of H_AUTHORITY error an info message is logged instead of an
> error, saying that "Permission denied while accessing performance
> stats" and an EPERM error is returned back.
> 
> In case of H_UNSUPPORTED error we return a EOPNOTSUPP error back from
> drc_pmem_query_stats() indicating that performance stats-query
> operation is not supported on this nvdimm.
> 
> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
> PHYP')
> Signed-off-by: Vaibhav Jain 

Reviewed-by: Ira Weiny 

> ---
> Changelog
> 
> v3:
> * Return EOPNOTSUPP error in case of H_UNSUPPORTED [ Ira ]
> * Return EPERM in case of H_AUTHORITY [ Ira ]
> * Updated patch description
> 
> v2:
> * Updated the message logged in case of H_AUTHORITY error [ Ira ]
> * Switched from dev_warn to dev_info in case of H_AUTHORITY error.
> * Instead of -EPERM return -EACCESS for H_AUTHORITY error.
> * Added explicit handling of H_UNSUPPORTED error.
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index ef26fe40efb0..e2b69cc3beaf 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -310,6 +310,13 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
> *p,
>   dev_err(>pdev->dev,
>   "Unknown performance stats, Err:0x%016lX\n", ret[0]);
>   return -ENOENT;
> + } else if (rc == H_AUTHORITY) {
> + dev_info(>pdev->dev,
> +  "Permission denied while accessing performance stats");
> + return -EPERM;
> + } else if (rc == H_UNSUPPORTED) {
> + dev_dbg(>pdev->dev, "Performance stats unsupported\n");
> + return -EOPNOTSUPP;
>   } else if (rc != H_SUCCESS) {
>   dev_err(>pdev->dev,
>   "Failed to query performance stats, Err:%lld\n", rc);
> -- 
> 2.31.1
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-05-05 Thread Ira Weiny
On Thu, May 06, 2021 at 12:46:06AM +0530, Vaibhav Jain wrote:
> Currently drc_pmem_qeury_stats() generates a dev_err in case
> "Enable Performance Information Collection" feature is disabled from
> HMC or performance stats are not available for an nvdimm. The error is
> of the form below:
> 
> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
>performance stats, Err:-10
> 
> This error message confuses users as it implies a possible problem
> with the nvdimm even though its due to a disabled/unavailable
> feature. We fix this by explicitly handling the H_AUTHORITY and
> H_UNSUPPORTED errors from the H_SCM_PERFORMANCE_STATS hcall.
> 
> In case of H_AUTHORITY error an info message is logged instead of an
> error, saying that "Permission denied while accessing performance
> stats". Also '-EACCES' error is return instead of -EPERM.

I thought you clarified before that this was a permission issue.  So why change
the error to EACCES?

> 
> In case of H_UNSUPPORTED error we return a -EPERM error back from
> drc_pmem_query_stats() indicating that performance stats-query
> operation is not supported on this nvdimm.

EPERM seems wrong here too...  ENOTSUP?

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 1/3] fsdax: Factor helpers to simplify dax fault code

2021-04-27 Thread Ira Weiny
On Tue, Apr 27, 2021 at 02:44:33AM +, ruansy.f...@fujitsu.com wrote:
> > -Original Message-
> > From: Ira Weiny 
> > Sent: Tuesday, April 27, 2021 7:38 AM
> > Subject: Re: [PATCH v3 1/3] fsdax: Factor helpers to simplify dax fault code
> > 
> > On Thu, Apr 22, 2021 at 09:44:59PM +0800, Shiyang Ruan wrote:
> > > The dax page fault code is too long and a bit difficult to read. And
> > > it is hard to understand when we trying to add new features. Some of
> > > the PTE/PMD codes have similar logic. So, factor them as helper
> > > functions to simplify the code.
> > >
> > > Signed-off-by: Shiyang Ruan 
> > > Reviewed-by: Christoph Hellwig 
> > > Reviewed-by: Ritesh Harjani 
> > > ---
> > >  fs/dax.c | 153
> > > ++-
> > >  1 file changed, 84 insertions(+), 69 deletions(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index b3d27fdc6775..f843fb8fbbf1 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > 
> > [snip]
> > 
> > > @@ -1355,19 +1379,8 @@ static vm_fault_t dax_iomap_pte_fault(struct
> > vm_fault *vmf, pfn_t *pfnp,
> > >   entry = dax_insert_entry(, mapping, vmf, entry, pfn,
> > >0, write && !sync);
> > >
> > > - /*
> > > -  * If we are doing synchronous page fault and inode needs fsync,
> > > -  * we can insert PTE into page tables only after that happens.
> > > -  * Skip insertion for now and return the pfn so that caller can
> > > -  * insert it after fsync is done.
> > > -  */
> > >   if (sync) {
> > > - if (WARN_ON_ONCE(!pfnp)) {
> > > - error = -EIO;
> > > - goto error_finish_iomap;
> > > - }
> > > - *pfnp = pfn;
> > > - ret = VM_FAULT_NEEDDSYNC | major;
> > > + ret = dax_fault_synchronous_pfnp(pfnp, pfn);
> > 
> > I commented on the previous version...  So I'll ask here too.
> > 
> > Why is it ok to drop 'major' here?
> 
> This dax_iomap_pte_fault () finally returns 'ret | major', so I think the 
> major here is not dropped.  The origin code seems OR the return value and 
> major twice.
> 

Thanks I missed that!
Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 1/3] fsdax: Factor helpers to simplify dax fault code

2021-04-26 Thread Ira Weiny
On Thu, Apr 22, 2021 at 09:44:59PM +0800, Shiyang Ruan wrote:
> The dax page fault code is too long and a bit difficult to read. And it
> is hard to understand when we trying to add new features. Some of the
> PTE/PMD codes have similar logic. So, factor them as helper functions to
> simplify the code.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Ritesh Harjani 
> ---
>  fs/dax.c | 153 ++-
>  1 file changed, 84 insertions(+), 69 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index b3d27fdc6775..f843fb8fbbf1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c

[snip]

> @@ -1355,19 +1379,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   entry = dax_insert_entry(, mapping, vmf, entry, pfn,
>0, write && !sync);
>  
> - /*
> -  * If we are doing synchronous page fault and inode needs fsync,
> -  * we can insert PTE into page tables only after that happens.
> -  * Skip insertion for now and return the pfn so that caller can
> -  * insert it after fsync is done.
> -  */
>   if (sync) {
> - if (WARN_ON_ONCE(!pfnp)) {
> - error = -EIO;
> - goto error_finish_iomap;
> - }
> - *pfnp = pfn;
> - ret = VM_FAULT_NEEDDSYNC | major;
> + ret = dax_fault_synchronous_pfnp(pfnp, pfn);

I commented on the previous version...  So I'll ask here too.

Why is it ok to drop 'major' here?

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 1/3] fsdax: Factor helpers to simplify dax fault code

2021-04-26 Thread Ira Weiny
On Wed, Apr 07, 2021 at 09:38:21PM +0800, Shiyang Ruan wrote:
> The dax page fault code is too long and a bit difficult to read. And it
> is hard to understand when we trying to add new features. Some of the
> PTE/PMD codes have similar logic. So, factor them as helper functions to
> simplify the code.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Ritesh Harjani 
> ---
>  fs/dax.c | 153 ++-
>  1 file changed, 84 insertions(+), 69 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index b3d27fdc6775..f843fb8fbbf1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c

[snip]

> @@ -1355,19 +1379,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   entry = dax_insert_entry(, mapping, vmf, entry, pfn,
>0, write && !sync);
>  
> - /*
> -  * If we are doing synchronous page fault and inode needs fsync,
> -  * we can insert PTE into page tables only after that happens.
> -  * Skip insertion for now and return the pfn so that caller can
> -  * insert it after fsync is done.
> -  */
>   if (sync) {
> - if (WARN_ON_ONCE(!pfnp)) {
> - error = -EIO;
> - goto error_finish_iomap;
> - }
> - *pfnp = pfn;
> - ret = VM_FAULT_NEEDDSYNC | major;
> + ret = dax_fault_synchronous_pfnp(pfnp, pfn);

Is there a reason VM_FAULT_MAJOR should be left out here?

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] libnvdimm.h: Remove duplicate struct declaration

2021-04-19 Thread Ira Weiny
On Mon, Apr 19, 2021 at 07:27:25PM +0800, Wan Jiabing wrote:
> struct device is declared at 133rd line.
> The declaration here is unnecessary. Remove it.
> 
> Signed-off-by: Wan Jiabing 
> ---
>  include/linux/libnvdimm.h | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> index 01f251b6e36c..89b69e645ac7 100644
> --- a/include/linux/libnvdimm.h
> +++ b/include/linux/libnvdimm.h
> @@ -141,7 +141,6 @@ static inline void __iomem *devm_nvdimm_ioremap(struct 
> device *dev,
>  
>  struct nvdimm_bus;
>  struct module;
> -struct device;
>  struct nd_blk_region;

What is the coding style preference for pre-declarations like this?  Should
they be placed at the top of the file?

The patch is reasonable but if the intent is to declare right before use for
clarity, both devm_nvdimm_memremap() and nd_blk_region_desc() use struct
device.  So perhaps this duplicate is on purpose?

Ira

>  struct nd_blk_region_desc {
>   int (*enable)(struct nvdimm_bus *nvdimm_bus, struct device *dev);
> -- 
> 2.25.1
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-14 Thread Ira Weiny
On Wed, Apr 14, 2021 at 09:51:40PM +0530, Vaibhav Jain wrote:
> Thanks for looking into this patch Ira,
> 
> Ira Weiny  writes:
> 
> > On Wed, Apr 14, 2021 at 06:10:26PM +0530, Vaibhav Jain wrote:
> >> Currently drc_pmem_qeury_stats() generates a dev_err in case
> >> "Enable Performance Information Collection" feature is disabled from
> >> HMC. The error is of the form below:
> >> 
> >> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
> >> performance stats, Err:-10
> >> 
> >> This error message confuses users as it implies a possible problem
> >> with the nvdimm even though its due to a disabled feature.
> >> 
> >> So we fix this by explicitly handling the H_AUTHORITY error from the
> >> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
> >> error, saying that "Performance stats in-accessible".
> >> 
> >> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
> >> PHYP')
> >> Signed-off-by: Vaibhav Jain 
> >> ---
> >>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >> 
> >> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> >> b/arch/powerpc/platforms/pseries/papr_scm.c
> >> index 835163f54244..9216424f8be3 100644
> >> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> >> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> >> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct 
> >> papr_scm_priv *p,
> >>dev_err(>pdev->dev,
> >>"Unknown performance stats, Err:0x%016lX\n", ret[0]);
> >>return -ENOENT;
> >> +  } else if (rc == H_AUTHORITY) {
> >> +  dev_warn(>pdev->dev, "Performance stats in-accessible");
> >> +  return -EPERM;
> >
> > Is this because of a disabled feature or because of permissions?
> 
> Its because of a disabled feature that revokes permission for a guest to
> retrieve performance statistics.
> 
> The feature is called "Enable Performance Information Collection" and
> once disabled the hcall H_SCM_PERFORMANCE_STATS returns an error
> H_AUTHORITY indicating that the guest doesn't have permission to retrieve
> performance statistics.

In that case would it be appropriate to have the error message indicate a
permission issue?

Something like 'permission denied'?

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] powerpc/papr_scm: Reduce error severity if nvdimm stats inaccessible

2021-04-14 Thread Ira Weiny
On Wed, Apr 14, 2021 at 06:10:26PM +0530, Vaibhav Jain wrote:
> Currently drc_pmem_qeury_stats() generates a dev_err in case
> "Enable Performance Information Collection" feature is disabled from
> HMC. The error is of the form below:
> 
> papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Failed to query
>performance stats, Err:-10
> 
> This error message confuses users as it implies a possible problem
> with the nvdimm even though its due to a disabled feature.
> 
> So we fix this by explicitly handling the H_AUTHORITY error from the
> H_SCM_PERFORMANCE_STATS hcall and generating a warning instead of an
> error, saying that "Performance stats in-accessible".
> 
> Fixes: 2d02bf835e57('powerpc/papr_scm: Fetch nvdimm performance stats from 
> PHYP')
> Signed-off-by: Vaibhav Jain 
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 835163f54244..9216424f8be3 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -277,6 +277,9 @@ static ssize_t drc_pmem_query_stats(struct papr_scm_priv 
> *p,
>   dev_err(>pdev->dev,
>   "Unknown performance stats, Err:0x%016lX\n", ret[0]);
>   return -ENOENT;
> + } else if (rc == H_AUTHORITY) {
> + dev_warn(>pdev->dev, "Performance stats in-accessible");
> + return -EPERM;

Is this because of a disabled feature or because of permissions?

Ira

>   } else if (rc != H_SUCCESS) {
>   dev_err(>pdev->dev,
>   "Failed to query performance stats, Err:%lld\n", rc);
> -- 
> 2.30.2
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v1] libnvdimm, dax: Fix a missing check in nd_dax_probe()

2021-04-09 Thread Ira Weiny
On Thu, Apr 08, 2021 at 06:58:26PM -0700, wangyingji...@126.com wrote:
> From: Yingjie Wang 
> 
> In nd_dax_probe(), 'nd_dax' is allocated by nd_dax_alloc().
> nd_dax_alloc() may fail and return NULL, so we should better check

Avoid the use of 'we'.

> it's return value to avoid a NULL pointer dereference
> a bit later in the code.

How about:

"nd_dax_alloc() may fail and return NULL.  Check for NULL before attempting to
use nd_dax to avoid a NULL pointer dereference."

> 
> Fixes: c5ed9268643c ("libnvdimm, dax: autodetect support")
> Signed-off-by: Yingjie Wang 

The code looks good though.

Ira

> ---
>  drivers/nvdimm/dax_devs.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c
> index 99965077bac4..b1426ac03f01 100644
> --- a/drivers/nvdimm/dax_devs.c
> +++ b/drivers/nvdimm/dax_devs.c
> @@ -106,6 +106,8 @@ int nd_dax_probe(struct device *dev, struct 
> nd_namespace_common *ndns)
>  
>   nvdimm_bus_lock(>dev);
>   nd_dax = nd_dax_alloc(nd_region);
> + if (!nd_dax)
> + return -ENOMEM;
>   nd_pfn = _dax->nd_pfn;
>   dax_dev = nd_pfn_devinit(nd_pfn, ndns);
>   nvdimm_bus_unlock(>dev);
> -- 
> 2.7.4
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] ndtest: Remove redundant NULL check

2021-03-22 Thread Ira Weiny
On Mon, Mar 22, 2021 at 06:00:40PM +0800, Jiapeng Chong wrote:
> Fix the following coccicheck warnings:
> 
> ./tools/testing/nvdimm/test/ndtest.c:491:2-7: WARNING: NULL check before
> some freeing functions is not needed.

I don't think there is anything wrong with this patch specifically but why is
buf not checked for null after the vmalloc?

It seems to me that if size >= DIMM_SIZE and the vmalloc fails the gen pool
allocation is going to be leaked.

Ira

> 
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 
> ---
>  tools/testing/nvdimm/test/ndtest.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/tools/testing/nvdimm/test/ndtest.c 
> b/tools/testing/nvdimm/test/ndtest.c
> index 6862915..98b4a43 100644
> --- a/tools/testing/nvdimm/test/ndtest.c
> +++ b/tools/testing/nvdimm/test/ndtest.c
> @@ -487,8 +487,7 @@ static void *ndtest_alloc_resource(struct ndtest_priv *p, 
> size_t size,
>  buf_err:
>   if (__dma && size >= DIMM_SIZE)
>   gen_pool_free(ndtest_pool, __dma, size);
> - if (buf)
> - vfree(buf);
> + vfree(buf);
>   kfree(res);
>  
>   return NULL;
> -- 
> 1.8.3.1
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] libnvdimm/pmem: remove unused header.

2021-01-04 Thread Ira Weiny
On Mon, Jan 04, 2021 at 02:32:06PM -0800, Dan Williams wrote:
> On Mon, Jan 4, 2021 at 2:19 PM Ira Weiny  wrote:
> >
> > On Mon, Jan 04, 2021 at 01:16:32PM -0800, Dan Williams wrote:
> > > On Mon, Dec 28, 2020 at 9:18 AM Ira Weiny  wrote:
> > > >
> > > > On Fri, Dec 25, 2020 at 09:35:46AM +0800, Jianpeng Ma wrote:
> > > > > 'commit a8b456d01cd6 ("bdi: remove BDI_CAP_SYNCHRONOUS_IO")' forgot
> > > >
> > > > This information should be part of a fixes tag.
> > >
> > > Oh, I was just about to comment "don't provide a Fixes tag for pure
> > > cleanups". Fixes is for functional issues that a backporter should
> > > consider.
> >
> > I thought this was discussed recently and it was concluded that 'fixes' does
> > not indicate something should be backported?
> >
> > ...
> >
> > https://lore.kernel.org/lkml/x8flmvawl0158...@kroah.com/
> >
> > At least that is what Greg KH said.  But Dave C. was not happy with this...
> 
> I was thinking of it more from the distro automation tooling that I
> know fires up it sees a "Fixes" tag. At a minimum this would make
> those systems / humans fire up just to notice, "oh, just a plain
> cleanup with no logical side-effects".

I'm not really following but I don't think it is a big deal either way.  It
just seemed like the text used looked like a 'fixes' line and seemed ok to put
in there.

In this case it is probably fine.

Sorry for the noise,
Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] libnvdimm/pmem: remove unused header.

2021-01-04 Thread Ira Weiny
On Mon, Jan 04, 2021 at 01:16:32PM -0800, Dan Williams wrote:
> On Mon, Dec 28, 2020 at 9:18 AM Ira Weiny  wrote:
> >
> > On Fri, Dec 25, 2020 at 09:35:46AM +0800, Jianpeng Ma wrote:
> > > 'commit a8b456d01cd6 ("bdi: remove BDI_CAP_SYNCHRONOUS_IO")' forgot
> >
> > This information should be part of a fixes tag.
> 
> Oh, I was just about to comment "don't provide a Fixes tag for pure
> cleanups". Fixes is for functional issues that a backporter should
> consider.

I thought this was discussed recently and it was concluded that 'fixes' does
not indicate something should be backported?

...

https://lore.kernel.org/lkml/x8flmvawl0158...@kroah.com/

At least that is what Greg KH said.  But Dave C. was not happy with this...

:-/

Sorry...

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2] fs/dax: include to fix build error on ARC

2021-01-04 Thread Ira Weiny
On Thu, Dec 31, 2020 at 08:29:14PM -0800, Randy Dunlap wrote:
> fs/dax.c uses copy_user_page() but ARC does not provide that interface,
> resulting in a build error.
> 
> Provide copy_user_page() in  (beside copy_page()) and
> add  to fs/dax.c to fix the build error.
> 
> ../fs/dax.c: In function 'copy_cow_page_dax':
> ../fs/dax.c:702:2: error: implicit declaration of function 'copy_user_page'; 
> did you mean 'copy_to_user_page'? [-Werror=implicit-function-declaration]
> 
> Fixes: cccbce671582 ("filesystem-dax: convert to dax_direct_access()")
> Reported-by: kernel test robot 
> Signed-off-by: Randy Dunlap 

Looks reasonable
Reviewed-by: Ira Weiny 

> Cc: Vineet Gupta 
> Cc: linux-snps-...@lists.infradead.org
> Cc: Dan Williams 
> Acked-by: Vineet Gupta 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Jan Kara 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-nvdimm@lists.01.org
> ---
> v2: rebase, add more Cc:
> 
>  arch/arc/include/asm/page.h |1 +
>  fs/dax.c|1 +
>  2 files changed, 2 insertions(+)
> 
> --- lnx-511-rc1.orig/fs/dax.c
> +++ lnx-511-rc1/fs/dax.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #define CREATE_TRACE_POINTS
> --- lnx-511-rc1.orig/arch/arc/include/asm/page.h
> +++ lnx-511-rc1/arch/arc/include/asm/page.h
> @@ -10,6 +10,7 @@
>  #ifndef __ASSEMBLY__
>  
>  #define clear_page(paddr)memset((paddr), 0, PAGE_SIZE)
> +#define copy_user_page(to, from, vaddr, pg)  copy_page(to, from)
>  #define copy_page(to, from)  memcpy((to), (from), PAGE_SIZE)
>  
>  struct vm_area_struct;
> ___
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] nvdimm: Switch to using the new API kobj_to_dev()

2020-12-28 Thread Ira Weiny
On Fri, Dec 25, 2020 at 11:31:05AM +0800, Tian Tao wrote:
> fixed the following coccicheck:
> drivers/nvdimm//namespace_devs.c:1626:60-61: WARNING opportunity for
> kobj_to_dev().
> drivers/nvdimm//region_devs.c:762:60-61: WARNING opportunity for
> kobj_to_dev().
> 
> since container_of(kobj, typeof(*dev), kobj) and
> container_of(kobj, struct device, kobj) are the same, so also
> replace container_of(kobj, typeof(*dev), kobj) with the new kobj_to_dev.
> 
> Signed-off-by: Tian Tao 

Reviewed-by: Ira Weiny 

> ---
>  drivers/nvdimm/bus.c| 2 +-
>  drivers/nvdimm/core.c   | 2 +-
>  drivers/nvdimm/dimm_devs.c  | 4 ++--
>  drivers/nvdimm/namespace_devs.c | 2 +-
>  drivers/nvdimm/region_devs.c| 4 ++--
>  5 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index 2304c61..cf2b70b 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -713,7 +713,7 @@ static struct attribute *nd_numa_attributes[] = {
>  static umode_t nd_numa_attr_visible(struct kobject *kobj, struct attribute 
> *a,
>   int n)
>  {
> - struct device *dev = container_of(kobj, typeof(*dev), kobj);
> + struct device *dev = kobj_to_dev(kobj);
>  
>   if (!IS_ENABLED(CONFIG_NUMA))
>   return 0;
> diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
> index 7de592d..431e47a 100644
> --- a/drivers/nvdimm/core.c
> +++ b/drivers/nvdimm/core.c
> @@ -503,7 +503,7 @@ static DEVICE_ATTR_ADMIN_RW(activate);
>  
>  static umode_t nvdimm_bus_firmware_visible(struct kobject *kobj, struct 
> attribute *a, int n)
>  {
> - struct device *dev = container_of(kobj, typeof(*dev), kobj);
> + struct device *dev = kobj_to_dev(kobj);
>   struct nvdimm_bus *nvdimm_bus = to_nvdimm_bus(dev);
>   struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc;
>   enum nvdimm_fwa_capability cap;
> diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> index b59032e..76ee2c3 100644
> --- a/drivers/nvdimm/dimm_devs.c
> +++ b/drivers/nvdimm/dimm_devs.c
> @@ -418,7 +418,7 @@ static struct attribute *nvdimm_attributes[] = {
>  
>  static umode_t nvdimm_visible(struct kobject *kobj, struct attribute *a, int 
> n)
>  {
> - struct device *dev = container_of(kobj, typeof(*dev), kobj);
> + struct device *dev = kobj_to_dev(kobj);
>   struct nvdimm *nvdimm = to_nvdimm(dev);
>  
>   if (a != _attr_security.attr && a != _attr_frozen.attr)
> @@ -534,7 +534,7 @@ static struct attribute *nvdimm_firmware_attributes[] = {
>  
>  static umode_t nvdimm_firmware_visible(struct kobject *kobj, struct 
> attribute *a, int n)
>  {
> - struct device *dev = container_of(kobj, typeof(*dev), kobj);
> + struct device *dev = kobj_to_dev(kobj);
>   struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
>   struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc;
>   struct nvdimm *nvdimm = to_nvdimm(dev);
> diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
> index 6da67f4..1d11ca7 100644
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -1623,7 +1623,7 @@ static struct attribute *nd_namespace_attributes[] = {
>  static umode_t namespace_visible(struct kobject *kobj,
>   struct attribute *a, int n)
>  {
> - struct device *dev = container_of(kobj, struct device, kobj);
> + struct device *dev = kobj_to_dev(kobj);
>  
>   if (a == _attr_resource.attr && is_namespace_blk(dev))
>   return 0;
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index ef23119..92adfaf 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -644,7 +644,7 @@ static struct attribute *nd_region_attributes[] = {
>  
>  static umode_t region_visible(struct kobject *kobj, struct attribute *a, int 
> n)
>  {
> - struct device *dev = container_of(kobj, typeof(*dev), kobj);
> + struct device *dev = kobj_to_dev(kobj);
>   struct nd_region *nd_region = to_nd_region(dev);
>   struct nd_interleave_set *nd_set = nd_region->nd_set;
>   int type = nd_region_to_nstype(nd_region);
> @@ -759,7 +759,7 @@ REGION_MAPPING(31);
>  
>  static umode_t mapping_visible(struct kobject *kobj, struct attribute *a, 
> int n)
>  {
> - struct device *dev = container_of(kobj, struct device, kobj);
> + struct device *dev = kobj_to_dev(kobj);
>   struct nd_region *nd_region = to_nd_region(dev);
>  
>   if (n < nd_region->ndr_mappings)
> -- 
> 2.7.4
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] libnvdimm/pmem: remove unused header.

2020-12-28 Thread Ira Weiny
On Fri, Dec 25, 2020 at 09:35:46AM +0800, Jianpeng Ma wrote:
> 'commit a8b456d01cd6 ("bdi: remove BDI_CAP_SYNCHRONOUS_IO")' forgot

This information should be part of a fixes tag.

Other than that looks ok.

Reviewed-by: Ira Weiny 

> remove the related header file.
> 
> Signed-off-by: Jianpeng Ma 
> ---
>  drivers/nvdimm/pmem.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 875076b0ea6c..f33bdae626ba 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -23,7 +23,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include "pmem.h"
> -- 
> 2.29.2
> ___
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-18 Thread Ira Weiny
On Fri, Dec 18, 2020 at 02:57:51PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 17 2020 at 23:43, Thomas Gleixner wrote:
> > The only use case for this in your tree is: kmap() and the possible
> > usage of that mapping outside of the thread context which sets it up.
> >
> > The only hint for doing this at all is:
> >
> > Some users, such as kmap(), sometimes requires PKS to be global.
> >
> > 'sometime requires' is really _not_ a technical explanation.
> >
> > Where is the explanation why kmap() usage 'sometimes' requires this
> > global trainwreck in the first place and where is the analysis why this
> > can't be solved differently?
> >
> > Detailed use case analysis please.
> 
> A lengthy conversation with Dan and Dave over IRC confirmed what I was
> suspecting.
> 
> The approach of this whole PKS thing is to make _all_ existing code
> magically "work". That means aside of the obvious thread local mappings,
> the kmap() part is needed to solve the problem of async handling where
> the mapping is handed to some other context which then uses it and
> notifies the context which created the mapping when done. That's the
> principle which was used to make highmem work long time ago.
> 
> IMO that was a mistake back then. The right thing would have been to
> change the code so that it does not rely on a temporary mapping created
> by the initiator. Instead let the initiator hand the page over to the
> other context which then creates a temporary mapping for fiddling with
> it. Water under the bridge...

But maybe not.  We are getting rid of a lot of the kmaps and once the bulk are
gone perhaps we can change this and remove kmap completely?

> 
> Glueing PKS on to that kmap() thing is horrible and global PKS is pretty
> much the opposite of what PKS wants to achieve. It's disabling
> protection systemwide for an unspecified amount of time and for all
> contexts.

I agree.  This is why I have been working on converting kmap() call sites to
kmap_local_page().[1]

> 
> So instead of trying to make global PKS "work" we really should go and
> take a smarter approach.
> 
>   1) Many kmap() use cases are strictly thread local and the mapped
>  address is never handed to some other context, which means this can
>  be replaced with kmap_local() now, which preserves the mapping
>  accross preemption. PKS just works nicely on top of that.

Yes hence the massive kmap->kmap_thread patch set which is now becoming
kmap_local_page().[2]

> 
>   2) Modify kmap() so that it marks the to be mapped page as 'globaly
>  unprotected' instead of doing this global unprotect PKS dance.
>  kunmap() undoes that. That obviously needs some thought
>  vs. refcounting if there are concurrent users, but that's a
>  solvable problem either as part of struct page itself or
>  stored in some global hash.

How would this globally unprotected flag work?  I suppose if kmap created a new
PTE we could make that PTE non-PKS protected then we don't have to fiddle with
the register...  I think I like that idea.

> 
>   3) Have PKS modes:
> 
>  - STRICT:   No pardon
>  
>  - RELAXED:  Warn and unprotect temporary for the current context
> 
>  - SILENT: Like RELAXED, but w/o warning to make sysadmins happy.
>  Default should be RELAXED.
> 
>  - OFF:  Disable the whole PKS thing

I'm not really sure how this solves the global problem but it is probably worth
having in general.

> 
> 
>   4) Have a smart #PF mechanism which does:
> 
>  if (error_code & X86_PF_PK) {
>  page = virt_to_page(address);
> 
>  if (!page || !page_is_globaly_unprotected(page))
>  goto die;
> 
>  if (pks_mode == PKS_MODE_STRICT)
>goto die;
> 
>  WARN_ONCE(pks_mode == PKS_MODE_RELAXED, "Useful info ...");
> 
>  temporary_unprotect(page, regs);
>  return;
>  }

I feel like this is very similar to what I had in the global patch you found in
my git tree with the exception of the RELAXED mode.  I simply had globally
unprotected or die.

global_pkey_is_enabled() handles the page_is_globaly_unprotected() and
temporary_unprotect().[3]

Anyway, I'm sorry (but not sorry) that you found it.  I've been trying to get
0-day and other testing on it and my public tree was the easiest way to do
that.  Anyway...

The patch as a whole needs work.  You are 100% correct that if a mapping is
handed to another context it is going to suck performance wise.  It has had
some internal review but not much.

Regardless I think unprotecting a global context is the easy part.  The code
you had a problem with (and I see is fully broken) was the restriction of
access.  A failure to update in that direction would only result in a wider
window of access.  I contemplated not doing a global update at all and just
leave the access open until the next context switch.  But the code as it stands
tries to force an update for a couple of reasons:

1) 

Re: [NEEDS-REVIEW] [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-17 Thread Ira Weiny
On Thu, Dec 17, 2020 at 12:41:50PM -0800, Dave Hansen wrote:
> On 11/6/20 3:29 PM, ira.we...@intel.com wrote:
> >  void disable_TSC(void)
> > @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, 
> > struct task_struct *next_p)
> >  
> > if ((tifp ^ tifn) & _TIF_SLD)
> > switch_to_sld(tifn);
> > +
> > +   pks_sched_in();
> >  }
> 
> Does the selftest for this ever actually schedule()?

At this point I'm not sure.  This code has been in since the beginning.  So its
seen a lot of soak time.

> 
> I see it talking about context switching, but I don't immediately see
> how it would.

We were trying to force parent and child to run on the same CPU.  I suspect
something is wrong in the timing of that test.

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-12-17 Thread Ira Weiny
On Thu, Dec 17, 2020 at 03:50:55PM +0100, Thomas Gleixner wrote:
> On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
> > --- a/arch/x86/kernel/process.c
> > +++ b/arch/x86/kernel/process.c
> > @@ -43,6 +43,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include "process.h"
> >  
> > @@ -187,6 +188,27 @@ int copy_thread(unsigned long clone_flags, unsigned 
> > long sp, unsigned long arg,
> > return ret;
> >  }
> >  
> > +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > +DECLARE_PER_CPU(u32, pkrs_cache);
> > +static inline void pks_init_task(struct task_struct *tsk)
> 
> First of all. I asked several times now not to glue stuff onto a
> function without a newline inbetween. It's unreadable.

Fixed.

> 
> But what's worse is that the declaration of pkrs_cache which is global
> is in a C file and not in a header. And pkrs_cache is not even used in
> this file. So what?

OK, this was just a complete rebase/refactor mess up on my part.  The
global'ness is not required until we need a global update of the pkrs which was
not part of this series.

I've removed it from this patch.  And cleaned it up in patch 6/10 as well.  And
cleaned it up in the global pkrs patch which you found in my git tree.

> 
> > +{
> > +   /* New tasks get the most restrictive PKRS value */
> > +   tsk->thread.saved_pkrs = INIT_PKRS_VALUE;
> > +}
> > +static inline void pks_sched_in(void)
> 
> Newline between functions. It's fine for stubs, but not for a real 
> implementation.

Again my apologies.

Fixed.

> 
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index d1dfe743e79f..76a62419c446 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -231,3 +231,34 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int 
> > flags)
> >  
> > return pk_reg;
> >  }
> > +
> > +DEFINE_PER_CPU(u32, pkrs_cache);
> 
> Again, why is this global?

In this patch it does not need to be.  I've changed it to static.

> 
> > +void write_pkrs(u32 new_pkrs)
> > +{
> > +   u32 *pkrs;
> > +
> > +   if (!static_cpu_has(X86_FEATURE_PKS))
> > +   return;
> > +
> > +   pkrs = get_cpu_ptr(_cache);
> 
> So this is called from various places including schedule and also from
> the low level entry/exit code. Why do we need to have an extra
> preempt_disable/enable() there via get/put_cpu_ptr()?
> 
> Just because performance in those code paths does not matter?

Honestly I don't recall the full history at this point.  The
preempt_disable/enable() is required when this is called from
pks_update_protection()  AKA when a user is trying to update the protections of
their key.  What I do remember is that this was originally not preempt safe and 
we
had a comment to that effect in the early patches.[1]

Somewhere along the line the preempt discussion lead us to make write_pkrs()
'self contained' with the preemption protection here.  I just did not think
about any performance issues.  It is safe to call preempt_disable() from a
preempt disabled region, correct?  I seem to recall asking that and the answer
was 'yes'.

I will audit the calls again and adjust the preemption disable as needed.

[1] https://lore.kernel.org/lkml/20200717072056.73134-5-ira.we...@intel.com/#t

> 
> > +   if (*pkrs != new_pkrs) {
> > +   *pkrs = new_pkrs;
> > +   wrmsrl(MSR_IA32_PKRS, new_pkrs);
> > +   }
> > +   put_cpu_ptr(pkrs);
> 
> Now back to the context switch:
> 
> > @@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, 
> > struct task_struct *next_p)
> >
> >  if ((tifp ^ tifn) & _TIF_SLD)
> >  switch_to_sld(tifn);
> > +
> > +   pks_sched_in();
> >  }
> 
> How is this supposed to work? 
> 
> switch_to() {
>
>switch_to_extra() {
>   
>   if (unlikely(next_tif & _TIF_WORK_CTXSW_NEXT ||
>  prev_tif & _TIF_WORK_CTXSW_PREV))
>  __switch_to_xtra(prev, next);
> 
> I.e. __switch_to_xtra() is only invoked when the above condition is
> true, which is not guaranteed at all.

I did not know that.  I completely missunderstood what __switch_to_xtra()
meant.  I thought it was arch specific 'extra' stuff so it seemed reasonable to
me.

Also, our test seemed to work.  I'm still investigating what may be wrong.

> 
> While I have to admit that I dropped the ball on the update for the
> entry patch, I'm not too sorry about it anymore when looking at this.
> 
> Are you still sure that this is ready for merging?

Nope...

Thanks for the review,
Ira

> 
> Thanks,
> 
> tglx
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3 10/10] x86/pks: Add PKS test code

2020-12-17 Thread Ira Weiny
On Thu, Dec 17, 2020 at 12:55:39PM -0800, Dave Hansen wrote:
> On 11/6/20 3:29 PM, ira.we...@intel.com wrote:
> > +   /* Arm for context switch test */
> > +   write(fd, "1", 1);
> > +
> > +   /* Context switch out... */
> > +   sleep(4);
> > +
> > +   /* Check msr restored */
> > +   write(fd, "2", 1);
> 
> These are always tricky.  What you ideally want here is:
> 
> 1. Switch away from this task to a non-PKS task, or
> 2. Switch from this task to a PKS-using task, but one which has a
>different PKS value

Or both...

> 
> then, switch back to this task and make sure PKS maintained its value.
> 
> *But*, there's no absolute guarantee that another task will run.  It
> would not be totally unreasonable to have the kernel just sit in a loop
> without context switching here if no other tasks can run.
> 
> The only way you *know* there is a context switch is by having two tasks
> bound to the same logical CPU and make sure they run one after another.

Ah...  We do that.

...
+   CPU_ZERO();
+   CPU_SET(0, );
+   /* Two processes run on CPU 0 so that they go through context switch.  
*/
+   sched_setaffinity(getpid(), sizeof(cpu_set_t), );
...

I think this should be ensuring that both the parent and the child are
running on CPU 0.  At least according to the man page they should be.


A child created via fork(2) inherits its parent's CPU affinity mask.


Perhaps a better method would be to synchronize the 2 threads more to ensure
that we are really running at the 'same time' and forcing the context switch.

>  This just gets itself into a state where it *CAN* context switch and
> prays that one will happen.

Not sure what you mean by 'This'?  Do you mean that running on the same CPU
will sometimes not force a context switch?  Or do you mean that the sleeps
could be badly timed and the 2 threads could run 1 after the other on the same
CPU?  The latter is AFAICT the most likely case.

> 
> You can also run a bunch of these in parallel bound to a single CPU.
> That would also give you higher levels of assurance that *some* context
> switch happens at sleep().

I think more cycles is a good idea for sure.  But I'm more comfortable with
forcing the test to be more synchronized so that it is actually running in the
order we think/want it to be.

> 
> One critical thing with these tests is to sabotage the kernel and then
> run them and make *sure* they fail.  Basically, if you screw up, do they
> actually work to catch it?

I'll try and come up with a more stressful test.

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3.1] entry: Pass irqentry_state_t by reference

2020-12-16 Thread Ira Weiny
On Tue, Dec 15, 2020 at 06:09:02PM -0800, Andy Lutomirski wrote:
> On Tue, Dec 15, 2020 at 5:32 PM Ira Weiny  wrote:
> >
> > On Fri, Dec 11, 2020 at 02:14:28PM -0800, Andy Lutomirski wrote:
> > > On Mon, Nov 23, 2020 at 10:10 PM  wrote:
> 
> > > IOW we have:
> > >
> > > struct extended_pt_regs {
> > >   bool rcu_whatever;
> > >   other generic fields here;
> > >   struct arch_extended_pt_regs arch_regs;
> > >   struct pt_regs regs;
> > > };
> > >
> > > and arch_extended_pt_regs has unsigned long pks;
> > >
> > > and instead of passing a pointer to irqentry_state_t to the generic
> > > entry/exit code, we just pass a pt_regs pointer.  And we have a little
> > > accessor like:
> > >
> > > struct extended_pt_regs *extended_regs(struct pt_regs *) { return
> > > container_of(...); }
> > >
> > > And we tell eBPF that extended_pt_regs is NOT ABI, and we will change
> > > it whenever we feel like just to keep you on your toes, thank you very
> > > much.
> > >
> > > Does this seem reasonable?
> >
> > Conceptually yes.  But I'm failing to see how this implementation can be 
> > made
> > generic for the generic fields.  The pks fields, assuming they stay x86
> > specific, would be reasonable to add in PUSH_AND_CLEAR_REGS.  But the
> > rcu/lockdep field is generic.  Wouldn't we have to modify every 
> > architecture to
> > add space for the rcu/lockdep bool?
> >
> > If not, where is a generic place that could be done?  Basically I'm missing 
> > how
> > the effective stack structure can look like this:
> >
> > > struct extended_pt_regs {
> > >   bool rcu_whatever;
> > >   other generic fields here;
> > >   struct arch_extended_pt_regs arch_regs;
> > >   struct pt_regs regs;
> > > };
> >
> > It seems more reasonable to make it look like:
> >
> > #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > struct extended_pt_regs {
> > unsigned long pkrs;
> > struct pt_regs regs;
> > };
> > #endif
> >
> > And leave the rcu/lockdep bool passed by value as before (still in C).
> 
> We could certainly do this,

I'm going to start with this basic support.  Because I have 0 experience in
most of these architectures.

> but we could also allocate some generic
> space.  PUSH_AND_CLEAR_REGS would get an extra instruction like:
> 
> subq %rsp, $GENERIC_PTREGS_SIZE
> 
> or however this should be written.  That field would be defined in
> asm-offsets.c.  And yes, all the generic-entry architectures would
> need to get onboard.

What do you mean by 'generic-entry' architectures?  I thought they all used the
generic entry code?

Regardless I would need to start another thread on this topic with any of those
architecture maintainers to see what the work load would be for this.  I don't
think I can do it on my own.

FWIW I think it is a bit unfair to hold up the PKS support in x86 for making
these generic fields part of the stack frame.  So perhaps that could be made a
follow on to the PKS series?

> 
> If we wanted to be fancy, we could split the generic area into
> initialize-to-zero and uninitialized for debugging purposes, but that
> might be more complication than is worthwhile.

Ok, agreed, but this is step 3 or 4 at the earliest.

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3.1] entry: Pass irqentry_state_t by reference

2020-12-15 Thread Ira Weiny
On Fri, Dec 11, 2020 at 02:14:28PM -0800, Andy Lutomirski wrote:
> On Mon, Nov 23, 2020 at 10:10 PM  wrote:
> >
> > From: Ira Weiny 
> >
> > Currently struct irqentry_state_t only contains a single bool value
> > which makes passing it by value is reasonable.  However, future patches
> > add information to this struct.  This includes the PKRS thread state,
> > included in this series, as well as information to store kmap reference
> > tracking and PKS global state outside this series.  In total, we
> > anticipate 2 new 32 bit fields and an integer field to be added to the
> > struct beyond the existing bool value.
> >
> > Adding information to irqentry_state_t makes passing by value less
> > efficient.  Therefore, change the entry/exit calls to pass irq_state by
> > reference in preparation for the changes which follow.
> >
> > While at it, make the code easier to follow by changing all the usage
> > sites to consistently use the variable name 'irq_state'.
> 
> After contemplating this for a bit, I think this isn't really the
> right approach.  It *works*, but we've mostly just created a bit of an
> unfortunate situation.

First off please forgive my ignorance on how this code works.

> Our stack, on a (possibly nested) entry looks
> like:
> 
> previous frame (or empty if we came from usermode)
> ---
> SS
> RSP
> FLAGS
> CS
> RIP
> rest of pt_regs
> 
> C frame
> 
> irqentry_state_t (maybe -- the compiler is within its rights to play
> almost arbitrary games here)
> 
> more C stuff
> 
> 
> So what we've accomplished is having two distinct arch register
> regions, one called pt_regs and the other stuck in irqentry_state_t.
> This is annoying because it means that, if we want to access this
> thing without passing a pointer around or access it at all from outer
> frames, we need to do something terrible with the unwinder, and we
> don't want to go there.
> 
> So I propose a somewhat different solution: lay out the stack like this.
> 
> SS
> RSP
> FLAGS
> CS
> RIP
> rest of pt_regs
> PKS
>  extended_pt_regs points here
> 
> C frame
> more C stuff
> ...
> 
> IOW we have:
> 
> struct extended_pt_regs {
>   bool rcu_whatever;
>   other generic fields here;
>   struct arch_extended_pt_regs arch_regs;
>   struct pt_regs regs;
> };
> 
> and arch_extended_pt_regs has unsigned long pks;
> 
> and instead of passing a pointer to irqentry_state_t to the generic
> entry/exit code, we just pass a pt_regs pointer.  And we have a little
> accessor like:
> 
> struct extended_pt_regs *extended_regs(struct pt_regs *) { return
> container_of(...); }
> 
> And we tell eBPF that extended_pt_regs is NOT ABI, and we will change
> it whenever we feel like just to keep you on your toes, thank you very
> much.
> 
> Does this seem reasonable?

Conceptually yes.  But I'm failing to see how this implementation can be made
generic for the generic fields.  The pks fields, assuming they stay x86
specific, would be reasonable to add in PUSH_AND_CLEAR_REGS.  But the
rcu/lockdep field is generic.  Wouldn't we have to modify every architecture to
add space for the rcu/lockdep bool?

If not, where is a generic place that could be done?  Basically I'm missing how
the effective stack structure can look like this:

> struct extended_pt_regs {
>   bool rcu_whatever;
>   other generic fields here;
>   struct arch_extended_pt_regs arch_regs;
>   struct pt_regs regs;
> };

It seems more reasonable to make it look like:

#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
struct extended_pt_regs {
unsigned long pkrs;
struct pt_regs regs;
};
#endif

And leave the rcu/lockdep bool passed by value as before (still in C).

Is that what you mean?  Or am I missing something with the way pt_regs is set
up?  Which is entirely possible because I'm pretty ignorant about how this code
works...  :-/

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] mm/up: combine put_compound_head() and unpin_user_page()

2020-12-09 Thread Ira Weiny
On Wed, Dec 09, 2020 at 03:13:57PM -0400, Jason Gunthorpe wrote:
> These functions accomplish the same thing but have different
> implementations.
> 
> unpin_user_page() has a bug where it calls mod_node_page_state() after
> calling put_page() which creates a risk that the page could have been
> hot-uplugged from the system.
> 
> Fix this by using put_compound_head() as the only implementation.
> 
> __unpin_devmap_managed_user_page() and related can be deleted as well in
> favour of the simpler, but slower, version in put_compound_head() that has
> an extra atomic page_ref_sub, but always calls put_page() which internally
> contains the special devmap code.
> 
> Move put_compound_head() to be directly after try_grab_compound_head() so
> people can find it in future.
> 
> Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) 
> reporting")
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Ira Weiny 

> ---
>  mm/gup.c | 103 +--
>  1 file changed, 23 insertions(+), 80 deletions(-)
> 
> With Matt's folio idea I'd next to go to make a
>   put_folio(folio, refs)
> 
> Which would cleanly eliminate that extra atomic here without duplicating the
> devmap special case.
> 
> This should also be called 'ungrab_compound_head' as we seem to be using the
> word 'grab' to mean 'pin or get' depending on GUP flags.
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 98eb8e6d2609c3..7b33b7d4b324d7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -123,6 +123,28 @@ static __maybe_unused struct page 
> *try_grab_compound_head(struct page *page,
>   return NULL;
>  }
>  
> +static void put_compound_head(struct page *page, int refs, unsigned int 
> flags)
> +{
> + if (flags & FOLL_PIN) {
> + mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
> + refs);
> +
> + if (hpage_pincount_available(page))
> + hpage_pincount_sub(page, refs);
> + else
> + refs *= GUP_PIN_COUNTING_BIAS;
> + }
> +
> + VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
> + /*
> +  * Calling put_page() for each ref is unnecessarily slow. Only the last
> +  * ref needs a put_page().
> +  */
> + if (refs > 1)
> + page_ref_sub(page, refs - 1);
> + put_page(page);
> +}
> +
>  /**
>   * try_grab_page() - elevate a page's refcount by a flag-dependent amount
>   *
> @@ -177,41 +199,6 @@ bool __must_check try_grab_page(struct page *page, 
> unsigned int flags)
>   return true;
>  }
>  
> -#ifdef CONFIG_DEV_PAGEMAP_OPS
> -static bool __unpin_devmap_managed_user_page(struct page *page)
> -{
> - int count, refs = 1;
> -
> - if (!page_is_devmap_managed(page))
> - return false;
> -
> - if (hpage_pincount_available(page))
> - hpage_pincount_sub(page, 1);
> - else
> - refs = GUP_PIN_COUNTING_BIAS;
> -
> - count = page_ref_sub_return(page, refs);
> -
> - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
> - /*
> -  * devmap page refcounts are 1-based, rather than 0-based: if
> -  * refcount is 1, then the page is free and the refcount is
> -  * stable because nobody holds a reference on the page.
> -  */
> - if (count == 1)
> - free_devmap_managed_page(page);
> - else if (!count)
> - __put_page(page);
> -
> - return true;
> -}
> -#else
> -static bool __unpin_devmap_managed_user_page(struct page *page)
> -{
> - return false;
> -}
> -#endif /* CONFIG_DEV_PAGEMAP_OPS */
> -
>  /**
>   * unpin_user_page() - release a dma-pinned page
>   * @page:pointer to page to be released
> @@ -223,28 +210,7 @@ static bool __unpin_devmap_managed_user_page(struct page 
> *page)
>   */
>  void unpin_user_page(struct page *page)
>  {
> - int refs = 1;
> -
> - page = compound_head(page);
> -
> - /*
> -  * For devmap managed pages we need to catch refcount transition from
> -  * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
> -  * page is free and we need to inform the device driver through
> -  * callback. See include/linux/memremap.h and HMM for details.
> -  */
> - if (__unpin_devmap_managed_user_page(page))
> - return;
> -
> - if (hpage_pincount_available(page))
> - hpage_pincount_sub(page, 1);
> - else
> - refs = GUP_PIN_COUNTING_BIAS;
> -
> - if (page_ref_sub_and_test(page, refs))
> - _

Re: [PATCH V3 00/10] PKS: Add Protection Keys Supervisor (PKS) support V3

2020-12-08 Thread Ira Weiny
On Tue, Dec 08, 2020 at 04:55:54PM +0100, Thomas Gleixner wrote:
> Ira,
> 
> On Mon, Dec 07 2020 at 14:14, Ira Weiny wrote:
> > Is there any chance of this landing before the kmap stuff gets sorted out?
> 
> I have marked this as needs an update because the change log of 5/10
> sucks. https://lore.kernel.org/r/87lff1xcmv@nanos.tec.linutronix.de
> 
> > It would be nice to have this in 5.11 to build off of.
> 
> It would be nice if people follow up on review request :)

I did, but just as an update to that patch.[1]  Sorry if this caused you to
miss the response.  It would have been better for me to ping you on that patch.
:-/

I was trying to avoid a whole new series just for that single commit message.
Is that generally ok?

Is that commit message still lacking?

Ira

[1] 
https://lore.kernel.org/linux-doc/20201124060956.1405768-1-ira.we...@intel.com/
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V3 00/10] PKS: Add Protection Keys Supervisor (PKS) support V3

2020-12-07 Thread Ira Weiny
Thomas,

Is there any chance of this landing before the kmap stuff gets sorted out?

It would be nice to have this in 5.11 to build off of.

Ira

On Fri, Nov 06, 2020 at 03:28:58PM -0800, 'Ira Weiny' wrote:
> From: Ira Weiny 
> 
> Changes from V2 [4]
>   Rebased on tip-tree/core/entry
>   From Thomas Gleixner
>   Address bisectability
>   Drop Patch:
>   x86/entry: Move nmi entry/exit into common code
>   From Greg KH
>   Remove WARN_ON's
>   From Dan Williams
>   Add __must_check to pks_key_alloc()
>   New patch: x86/pks: Add PKS defines and config options
>   Split from Enable patch to build on through the series
>   Fix compile errors
> 
> Changes from V1
>   Rebase to TIP master; resolve conflicts and test
>   Clean up some kernel docs updates missed in V1
>   Add irqentry_state_t kernel doc for PKRS field
>   Removed redundant irq_state->pkrs
>   This is only needed when we add the global state and somehow
>   ended up in this patch series.  That will come back when we add
>   the global functionality in.
>   From Thomas Gleixner
>   Update commit messages
>   Add kernel doc for struct irqentry_state_t
>   From Dave Hansen add flags to pks_key_alloc()
> 
> Changes from RFC V3[3]
>   Rebase to TIP master
>   Update test error output
>   Standardize on 'irq_state' for state variables
>   From Dave Hansen
>   Update commit messages
>   Add/clean up comments
>   Add X86_FEATURE_PKS to disabled-features.h and remove some
>   explicit CONFIG checks
>   Move saved_pkrs member of thread_struct
>   Remove superfluous preempt_disable()
>   s/irq_save_pks/irq_save_set_pks/
>   Ensure PKRS is not seen in faults if not configured or not
>   supported
>   s/pks_mknoaccess/pks_mk_noaccess/
>   s/pks_mkread/pks_mk_readonly/
>   s/pks_mkrdwr/pks_mk_readwrite/
>   Change pks_key_alloc return to -EOPNOTSUPP when not supported
>   From Peter Zijlstra
>   Clean up Attribution
>   Remove superfluous preempt_disable()
>   Add union to differentiate exit_rcu/lockdep use in
>   irqentry_state_t
>   From Thomas Gleixner
>   Add preliminary clean up patch and adjust series as needed
> 
> 
> Introduce a new page protection mechanism for supervisor pages, Protection Key
> Supervisor (PKS).
> 
> 2 use cases for PKS are being developed, trusted keys and PMEM.  Trusted keys
> is a newer use case which is still being explored.  PMEM was submitted as part
> of the RFC (v2) series[1].  However, since then it was found that some callers
> of kmap() require a global implementation of PKS.  Specifically some users of
> kmap() expect mappings to be available to all kernel threads.  While global 
> use
> of PKS is rare it needs to be included for correctness.  Unfortunately the
> kmap() updates required a large patch series to make the needed changes at the
> various kmap() call sites so that patch set has been split out.  Because the
> global PKS feature is only required for that use case it will be deferred to
> that set as well.[2]  This patch set is being submitted as a precursor to both
> of the use cases.
> 
> For an overview of the entire PKS ecosystem, a git tree including this series
> and 2 proposed use cases can be found here:
> 
>   
> https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.we...@intel.com/
>   
> https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.we...@intel.com/
> 
> 
> PKS enables protections on 'domains' of supervisor pages to limit supervisor
> mode access to those pages beyond the normal paging protections.  PKS works in
> a similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
> checked in addition to normal paging protections and Access or Writes can be
> disabled via a MSR update without TLB flushes when permissions change.  Also
> like PKU, a page mapping is assigned to a domain by setting pkey bits in the
> page table entry for that mapping.
> 
> Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.
> 
> XSAVE is not supported for the PKRS MSR.  Therefore the implementation
> saves/restores the MSR across context switches and during exceptions.  Nested
> exceptions are supported by each exception getting a new PKS state.
> 
> For consistent behavior with current paging protections, pkey 0 is reser

[PATCH V3.1] entry: Pass irqentry_state_t by reference

2020-11-23 Thread ira . weiny
From: Ira Weiny 

Currently struct irqentry_state_t only contains a single bool value
which makes passing it by value is reasonable.  However, future patches
add information to this struct.  This includes the PKRS thread state,
included in this series, as well as information to store kmap reference
tracking and PKS global state outside this series.  In total, we
anticipate 2 new 32 bit fields and an integer field to be added to the
struct beyond the existing bool value.

Adding information to irqentry_state_t makes passing by value less
efficient.  Therefore, change the entry/exit calls to pass irq_state by
reference in preparation for the changes which follow.

While at it, make the code easier to follow by changing all the usage
sites to consistently use the variable name 'irq_state'.

Signed-off-by: Ira Weiny 

---
Changes from V3
Clarify commit message regarding the need for this patch

Changes from V1
From Thomas: Update commit message
Further clean up Kernel doc and comments
Missed some 'return' comments which are no longer valid

Changes from RFC V3
Clean up @irq_state comments
Standardize on 'irq_state' for the state variable name
Refactor based on new patch from Thomas Gleixner
Also addresses Peter Zijlstra's comment
---
 arch/x86/entry/common.c |  8 
 arch/x86/include/asm/idtentry.h | 25 ++--
 arch/x86/kernel/cpu/mce/core.c  |  4 ++--
 arch/x86/kernel/kvm.c   |  6 +++---
 arch/x86/kernel/nmi.c   |  4 ++--
 arch/x86/kernel/traps.c | 21 
 arch/x86/mm/fault.c |  6 +++---
 include/linux/entry-common.h| 18 +
 kernel/entry/common.c   | 34 +
 9 files changed, 65 insertions(+), 61 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 18d8f17f755c..87dea56a15d2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -259,9 +259,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
 {
struct pt_regs *old_regs;
bool inhcall;
-   irqentry_state_t state;
+   irqentry_state_t irq_state;
 
-   state = irqentry_enter(regs);
+   irqentry_enter(regs, _state);
old_regs = set_irq_regs(regs);
 
instrumentation_begin();
@@ -271,13 +271,13 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
set_irq_regs(old_regs);
 
inhcall = get_and_clear_inhcall();
-   if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+   if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
restore_inhcall(inhcall);
} else {
-   irqentry_exit(regs, state);
+   irqentry_exit(regs, _state);
}
 }
 #endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 247a60a47331..282d2413b6a1 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -49,12 +49,13 @@ static __always_inline void __##func(struct pt_regs *regs); 
\
\
 __visible noinstr void func(struct pt_regs *regs)  \
 {  \
-   irqentry_state_t state = irqentry_enter(regs);  \
+   irqentry_state_t irq_state; 
\
\
+   irqentry_enter(regs, _state);   
\
instrumentation_begin();\
__##func (regs);\
instrumentation_end();  \
-   irqentry_exit(regs, state); \
+   irqentry_exit(regs, _state);
\
 }  \
\
 static __always_inline void __##func(struct pt_regs *regs)
@@ -96,12 +97,13 @@ static __always_inline void __##func(struct pt_regs *regs,  
\
 __visible noinstr void func(struct pt_regs *regs,  \
unsigned long error_code)   \
 {  \
-   irqentry_state_t state = irqentry_enter(regs);  \
+   irqentry_state_t irq_state; 
\
\
+   irqentry_en

Re: [ndctl PATCH v2] Rework license identification

2020-11-20 Thread Ira Weiny
On Thu, Nov 19, 2020 at 06:34:01PM -0800, Dan Williams wrote:
> Convert to the LICENSES/ directory format for COPYING from the Linux
> kernel, and switch all remaining files over to SPDX annotations.
> 
> Cc: Ira Weiny 

LGTM

Reviewed-by: Ira Weiny 

> Reported-by: Christoph Hellwig 
> Signed-off-by: Dan Williams 
> ---
> Changes since v1 [1]:
> - (Ira) some date updates to 2020 were missed
> - (Ira) fix a dropped copyright
> - Switch SPDX comment style to // in C files and /* */ in headers like
>   the kernel.
> - Make sure SPDX identifier is the first line in the file, or second for
>   scripts
> 
> [1]: 
> http://lore.kernel.org/r/160321640031.3386448.4879860972349220888.st...@dwillia2-desk3.amr.corp.intel.com
> 
>  COPYING  |   31 +++
>  Documentation/copyright.txt  |2 
>  Documentation/daxctl/Makefile.am |   12 -
>  Documentation/ndctl/Makefile.am  |   12 -
>  LICENSES/other/CC0-1.0   |0 
>  LICENSES/other/MIT   |0 
>  LICENSES/preferred/GPL-2.0   |0 
>  LICENSES/preferred/LGPL-2.1  |0 
>  acpi.h   |   16 --
>  ccan/array_size/LICENSE  |1 
>  ccan/array_size/array_size.h |2 
>  ccan/build_assert/LICENSE|1 
>  ccan/build_assert/build_assert.h |2 
>  ccan/check_type/LICENSE  |1 
>  ccan/check_type/check_type.h |2 
>  ccan/container_of/LICENSE|1 
>  ccan/container_of/container_of.h |2 
>  ccan/endian/LICENSE  |1 
>  ccan/endian/endian.h |2 
>  ccan/list/LICENSE|1 
>  ccan/list/list.c |2 
>  ccan/list/list.h |2 
>  ccan/minmax/LICENSE  |1 
>  ccan/minmax/minmax.h |2 
>  ccan/short_types/LICENSE |1 
>  ccan/short_types/short_types.h   |2 
>  ccan/str/LICENSE |1 
>  ccan/str/debug.c |2 
>  ccan/str/str.c   |2 
>  ccan/str/str.h   |2 
>  ccan/str/str_debug.h |2 
>  contrib/daxctl-qemu-hmat-setup   |3 
>  daxctl/acpi.c|2 
>  daxctl/builtin.h |2 
>  daxctl/daxctl.c  |   16 --
>  daxctl/device.c  |2 
>  daxctl/lib/libdaxctl-private.h   |   14 --
>  daxctl/lib/libdaxctl.c   |   14 --
>  daxctl/libdaxctl.h   |   14 --
>  daxctl/list.c|   14 --
>  daxctl/migrate.c |4 
>  ndctl/action.h   |7 -
>  ndctl/bat.c  |   14 --
>  ndctl/builtin.h  |2 
>  ndctl/bus.c  |4 
>  ndctl/check.c|   14 --
>  ndctl/create-nfit.c  |   14 --
>  ndctl/dimm.c |   14 --
>  ndctl/firmware-update.h  |2 
>  ndctl/inject-error.c |   14 --
>  ndctl/inject-smart.c |4 
>  ndctl/lib/ars.c  |   14 --
>  ndctl/lib/dimm.c |   14 --
>  ndctl/lib/firmware.c |4 
>  ndctl/lib/hpe1.c |   16 --
>  ndctl/lib/hpe1.h |   16 --
>  ndctl/lib/hyperv.c   |2 
>  ndctl/lib/hyperv.h   |4 
>  ndctl/lib/inject.c   |   14 --
>  ndctl/lib/intel.c|   14 --
>  ndctl/lib/intel.h|2 
>  ndctl/lib/libndctl.c |   14 --
>  ndctl/lib/msft.c |   18 --
>  ndctl/lib/msft.h |   18 --
>  ndctl/lib/nfit.c |   14 --
>  ndctl/lib/private.h  |   14 --
>  ndctl/lib/smart.c|   14 --
>  ndctl/libndctl-nfit.h|   18 --
>  ndctl/libndctl.h |   14 --
>  ndctl/list.c |   14 --
>  ndctl/load-keys.c|2 
>  ndctl/monitor.c  |4 
>  ndctl/namespace.c|   14 --
>  ndctl/namespace.h|   14 --
>  ndctl/ndctl.c|   15 --
>  ndctl/ndctl.h|   14 --
>  ndctl/region.c   |   14 --
>  ndctl/test.c |   14 --
>  ndctl/util/json-smart.c  |   14 --
>  ndctl/util/keys.c|2 
>  ndctl/util/keys.h|4 
>  sles/header  |4 
>  test.h   |   14 --
>  test/ack-shutdown-count-set.c|4 
>  test/align.sh|2 
>  test/blk-exhaust.sh  |   13 -
>  test/blk_namespaces.c|   16 --
>  test/btt-check.sh|  

Re: [PATCH V3 05/10] x86/entry: Pass irqentry_state_t by reference

2020-11-16 Thread Ira Weiny
On Sun, Nov 15, 2020 at 07:58:52PM +0100, Thomas Gleixner wrote:
> Ira,
> 
> On Fri, Nov 06 2020 at 15:29, ira weiny wrote:
> 
> Subject prefix wants to 'entry:'. This changes generic code and the x86
> part is just required to fix the generic code change.

Sorry, yes that was carried incorrectly from earlier versions.

> 
> > Currently struct irqentry_state_t only contains a single bool value
> > which makes passing it by value is reasonable.  However, future patches
> > propose to add information to this struct, for example the PKRS
> > register/thread state.
> >
> > Adding information to irqentry_state_t makes passing by value less
> > efficient.  Therefore, change the entry/exit calls to pass irq_state by
> > reference.
> 
> The PKRS muck needs to add an u32 to that struct. So how is that a
> problem?

There are more fields to be added for the kmap/pmem support.  So this will be
needed eventually.  Even though it is not strictly necessary in the next patch.

> 
> The resulting struct still fits into 64bit which is by far more
> efficiently passed by value than by reference. So which problem are you
> solving here?

I'm getting ahead of myself a bit.  I will be adding more fields for the
kmap/pmem tracking.

Would you accept just a clean up for the variable names in this patch?  I could
then add the pass by reference when I add the new fields later.  Or would an
update to the commit message be ok to land this now?

Ira

> 
> Thanks
> 
> tglx
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 05/58] kmap: Introduce k[un]map_thread

2020-11-09 Thread Ira Weiny
On Tue, Nov 10, 2020 at 02:13:56AM +0100, Thomas Gleixner wrote:
> Ira,
> 
> On Fri, Oct 09 2020 at 12:49, ira weiny wrote:
> > From: Ira Weiny 
> >
> > To correctly support the semantics of kmap() with Kernel protection keys
> > (PKS), kmap() may be required to set the protections on multiple
> > processors (globally).  Enabling PKS globally can be very expensive
> > depending on the requested operation.  Furthermore, enabling a domain
> > globally reduces the protection afforded by PKS.
> >
> > Most kmap() (Aprox 209 of 229) callers use the map within a single thread 
> > and
> > have no need for the protection domain to be enabled globally.  However, the
> > remaining callers do not follow this pattern and, as best I can tell, expect
> > the mapping to be 'global' and available to any thread who may access the
> > mapping.[1]
> >
> > We don't anticipate global mappings to pmem, however in general there is a
> > danger in changing the semantics of kmap().  Effectively, this would cause 
> > an
> > unresolved page fault with little to no information about why the failure
> > occurred.
> >
> > To resolve this a number of options were considered.
> >
> > 1) Attempt to change all the thread local kmap() calls to kmap_atomic()[2]
> > 2) Introduce a flags parameter to kmap() to indicate if the mapping should 
> > be
> >global or not
> > 3) Change ~20 call sites to 'kmap_global()' to indicate that they require a
> >global enablement of the pages.
> > 4) Change ~209 call sites to 'kmap_thread()' to indicate that the mapping 
> > is to
> >be used within that thread of execution only
> >
> > Option 1 is simply not feasible.  Option 2 would require all of the call 
> > sites
> > of kmap() to change.  Option 3 seems like a good minimal change but there 
> > is a
> > danger that new code may miss the semantic change of kmap() and not get the
> > behavior the developer intended.  Therefore, #4 was chosen.
> 
> There is Option #5:

There is now yes.  :-D

> 
> Convert the thread local kmap() invocations to the proposed kmap_local()
> interface which is coming along [1].

I've been trying to follow that thread.

> 
> That solves a couple of issues:
> 
>  1) It relieves the current kmap_atomic() usage sites from the implict
> pagefault/preempt disable semantics which apply even when
> CONFIG_HIGHMEM is disabled. kmap_local() still can be invoked from
> atomic context.
> 
>  2) Due to #1 it allows to replace the conditional usage of kmap() and
> kmap_atomic() for purely thread local mappings.
> 
>  3) It puts the burden on the HIGHMEM inflicted systems
> 
>  4) It is actually more efficient for most of the pure thread local use
> cases on HIGHMEM inflicted systems because it avoids the overhead of
> the global lock and the potential kmap slot exhaustion. A potential
> preemption will be more expensive, but that's not really the case we
> want to optimize for.
> 
>  5) It solves the RT issue vs. kmap_atomic()
> 
> So instead of creating yet another variety of kmap() which is just
> scratching the particular PKRS itch, can we please consolidate all of
> that on the wider reaching kmap_local() approach?

Yes I agree.  We absolutely don't want more kmap*() calls and I was hoping to
dovetail into your kmap_local() work.[2]

I've pivoted away from this work a bit to clean up all the
kmap()/memcpy*()/kunmaps() as discussed elsewhere in the thread first.[3]  I
was hoping your work would land and then I could s/kmap_thread()/kmap_local()/
on all of these patches.

Also, we can convert the new memcpy_*_page() calls to kmap_local() as well.
[For now my patch just uses kmap_atomic().]

I've not looked at all of the patches in your latest version.  Have you
included converting any of the kmap() call sites?  I thought you were more
focused on converting the kmap_atomic() to kmap_local()?

Ira

> 
> Thanks,
> 
> tglx
>  
> [1] https://lore.kernel.org/lkml/20201103092712.714480...@linutronix.de/

[2] 
https://lore.kernel.org/lkml/20201012195354.gc2046...@iweiny-desk2.sc.intel.com/
[3] https://lore.kernel.org/lkml/20201009213434.GA839@sol.localdomain/
https://lore.kernel.org/lkml/20201013200149.gi3576...@zeniv.linux.org.uk/
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH V3 09/10] x86/pks: Enable Protection Keys Supervisor (PKS)

2020-11-06 Thread ira . weiny
From: Fenghua Yu 

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings.  It uses the same mechanism of Protection Keys as those on
User mappings but applies that mechanism to supervisor mappings using a
supervisor specific MSR.

Kernel users can thus defines 'domains' of page mappings which have an
extra level of protection beyond those specified in the supervisor page
table entries.

Enable PKS on supported CPUS.

Co-developed-by: Ira Weiny 
Signed-off-by: Ira Weiny 
Signed-off-by: Fenghua Yu 

---
Changes from V2
From Thomas: Make this patch last so PKS is not enabled until
all the PKS mechanisms are in place.  Specifically:
1) Modify setup_pks() to call write_pkrs() to properly
   set up the initial value when enabled.

2) Split this patch into two. 1) a precursor patch with
   the required defines/config options and 2) this patch
   which actually enables feature on CPUs which support
   it.

Changes since RFC V3
Per Dave Hansen
Update comment
Add X86_FEATURE_PKS to disabled-features.h
Rebase based on latest TIP tree
---
 arch/x86/include/asm/disabled-features.h |  6 +-
 arch/x86/kernel/cpu/common.c | 15 +++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h 
b/arch/x86/include/asm/disabled-features.h
index 164587177152..82540f0c5b6c 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -44,7 +44,11 @@
 # define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31))
 #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 
-#define DISABLE_PKS   (1<<(X86_FEATURE_PKS & 31))
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+# define DISABLE_PKS   0
+#else
+# define DISABLE_PKS   (1<<(X86_FEATURE_PKS & 31))
+#endif
 
 #ifdef CONFIG_X86_5LEVEL
 # define DISABLE_LA57  0
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 35ad8480c464..f8929a557d72 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -58,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cpu.h"
 
@@ -1494,6 +1495,19 @@ static void validate_apic_and_package_id(struct 
cpuinfo_x86 *c)
 #endif
 }
 
+/*
+ * PKS is independent of PKU and either or both may be supported on a CPU.
+ * Configure PKS if the CPU supports the feature.
+ */
+static void setup_pks(void)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   write_pkrs(INIT_PKRS_VALUE);
+   cr4_set_bits(X86_CR4_PKS);
+}
+
 /*
  * This does the hard work of actually picking apart the CPU stuff...
  */
@@ -1591,6 +1605,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 
x86_init_rdrand(c);
setup_pku(c);
+   setup_pks();
 
/*
 * Clear/Set all flags overridden by options, need do it
-- 
2.28.0.rc0.12.gb6a658bd00c9
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH V3 07/10] x86/fault: Report the PKRS state on fault

2020-11-06 Thread ira . weiny
From: Ira Weiny 

When only user space pkeys are enabled faulting within the kernel was an
unexpected condition which should never happen.  Therefore a WARN_ON in
the kernel fault handler would detect if it ever did.  Now this is no
longer the case if PKS is enabled and supported.

Report a Pkey fault with a normal splat and add the PKRS state to the
fault splat text.  Note the PKS register is reset during an exception
therefore the saved PKRS value from before the beginning of the
exception is passed down.

If PKS is not enabled, or not active, maintain the WARN_ON_ONCE() from
before.

Because each fault has its own state the pkrs information will be
correctly reported even if a fault 'faults'.

Suggested-by: Andy Lutomirski 
Signed-off-by: Ira Weiny 

---
Changes from V2
Fix compilation error

Changes from RFC V3
Update commit message
Per Dave Hansen
Don't print PKRS if !cpu_feature_enabled(X86_FEATURE_PKS)
Fix comment
Remove check on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS in favor of
disabled-features.h
---
 arch/x86/mm/fault.c | 58 ++---
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8d20c4c13abf..90029ce9b0da 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -504,7 +504,8 @@ static void show_ldttss(const struct desc_ptr *gdt, const 
char *name, u16 index)
 }
 
 static void
-show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long 
address)
+show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long 
address,
+   irqentry_state_t *irq_state)
 {
if (!oops_may_print())
return;
@@ -548,6 +549,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long 
error_code, unsigned long ad
 (error_code & X86_PF_PK)? "protection keys violation" :
   "permissions violation");
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+   if (cpu_feature_enabled(X86_FEATURE_PKS) && irq_state && (error_code & 
X86_PF_PK))
+   pr_alert("PKRS: 0x%x\n", irq_state->thread_pkrs);
+#endif
+
if (!(error_code & X86_PF_USER) && user_mode(regs)) {
struct desc_ptr idt, gdt;
u16 ldtr, tr;
@@ -626,7 +632,8 @@ static void set_signal_archinfo(unsigned long address,
 
 static noinline void
 no_context(struct pt_regs *regs, unsigned long error_code,
-  unsigned long address, int signal, int si_code)
+  unsigned long address, int signal, int si_code,
+  irqentry_state_t *irq_state)
 {
struct task_struct *tsk = current;
unsigned long flags;
@@ -732,7 +739,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 */
flags = oops_begin();
 
-   show_fault_oops(regs, error_code, address);
+   show_fault_oops(regs, error_code, address, irq_state);
 
if (task_stack_end_corrupted(tsk))
printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
@@ -785,7 +792,8 @@ static bool is_vsyscall_vaddr(unsigned long vaddr)
 
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-  unsigned long address, u32 pkey, int si_code)
+  unsigned long address, u32 pkey, int si_code,
+  irqentry_state_t *irq_state)
 {
struct task_struct *tsk = current;
 
@@ -832,14 +840,14 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned 
long error_code,
if (is_f00f_bug(regs, address))
return;
 
-   no_context(regs, error_code, address, SIGSEGV, si_code);
+   no_context(regs, error_code, address, SIGSEGV, si_code, irq_state);
 }
 
 static noinline void
 bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-unsigned long address)
+unsigned long address, irqentry_state_t *irq_state)
 {
-   __bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR);
+   __bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR, 
irq_state);
 }
 
 static void
@@ -853,7 +861,7 @@ __bad_area(struct pt_regs *regs, unsigned long error_code,
 */
mmap_read_unlock(mm);
 
-   __bad_area_nosemaphore(regs, error_code, address, pkey, si_code);
+   __bad_area_nosemaphore(regs, error_code, address, pkey, si_code, NULL);
 }
 
 static noinline void
@@ -923,7 +931,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
 {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & X86_PF_USER)) {
-   no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+   no_context(regs, error_code, address, SIGBUS, BUS_ADRERR, NULL);
re

[PATCH V3 05/10] x86/entry: Pass irqentry_state_t by reference

2020-11-06 Thread ira . weiny
From: Ira Weiny 

Currently struct irqentry_state_t only contains a single bool value
which makes passing it by value is reasonable.  However, future patches
propose to add information to this struct, for example the PKRS
register/thread state.

Adding information to irqentry_state_t makes passing by value less
efficient.  Therefore, change the entry/exit calls to pass irq_state by
reference.

While at it, make the code easier to follow by changing all the usage
sites to consistently use the variable name 'irq_state'.

Signed-off-by: Ira Weiny 

---
Changes from V1
From Thomas: Update commit message
Further clean up Kernel doc and comments
Missed some 'return' comments which are no longer valid

Changes from RFC V3
Clean up @irq_state comments
Standardize on 'irq_state' for the state variable name
Refactor based on new patch from Thomas Gleixner
Also addresses Peter Zijlstra's comment
---
 arch/x86/entry/common.c |  8 
 arch/x86/include/asm/idtentry.h | 25 ++--
 arch/x86/kernel/cpu/mce/core.c  |  4 ++--
 arch/x86/kernel/kvm.c   |  6 +++---
 arch/x86/kernel/nmi.c   |  4 ++--
 arch/x86/kernel/traps.c | 21 
 arch/x86/mm/fault.c |  6 +++---
 include/linux/entry-common.h| 18 +
 kernel/entry/common.c   | 34 +
 9 files changed, 65 insertions(+), 61 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 18d8f17f755c..87dea56a15d2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -259,9 +259,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
 {
struct pt_regs *old_regs;
bool inhcall;
-   irqentry_state_t state;
+   irqentry_state_t irq_state;
 
-   state = irqentry_enter(regs);
+   irqentry_enter(regs, _state);
old_regs = set_irq_regs(regs);
 
instrumentation_begin();
@@ -271,13 +271,13 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
set_irq_regs(old_regs);
 
inhcall = get_and_clear_inhcall();
-   if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+   if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
restore_inhcall(inhcall);
} else {
-   irqentry_exit(regs, state);
+   irqentry_exit(regs, _state);
}
 }
 #endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 247a60a47331..282d2413b6a1 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -49,12 +49,13 @@ static __always_inline void __##func(struct pt_regs *regs); 
\
\
 __visible noinstr void func(struct pt_regs *regs)  \
 {  \
-   irqentry_state_t state = irqentry_enter(regs);  \
+   irqentry_state_t irq_state; 
\
\
+   irqentry_enter(regs, _state);   
\
instrumentation_begin();\
__##func (regs);\
instrumentation_end();  \
-   irqentry_exit(regs, state); \
+   irqentry_exit(regs, _state);
\
 }  \
\
 static __always_inline void __##func(struct pt_regs *regs)
@@ -96,12 +97,13 @@ static __always_inline void __##func(struct pt_regs *regs,  
\
 __visible noinstr void func(struct pt_regs *regs,  \
unsigned long error_code)   \
 {  \
-   irqentry_state_t state = irqentry_enter(regs);  \
+   irqentry_state_t irq_state; 
\
\
+   irqentry_enter(regs, _state);   
\
instrumentation_begin();\
__##func (regs, error_code);\
instrumentation_end();  \
-   irqentry_exit(regs, state); \
+   irqentry_e

[PATCH V3 02/10] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support

2020-11-06 Thread ira . weiny
From: Ira Weiny 

Define a helper, update_pkey_val(), which will be used to support both
Protection Key User (PKU) and the new Protection Key for Supervisor
(PKS) in subsequent patches.

Co-developed-by: Peter Zijlstra 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Ira Weiny 

---
Changes from RFC V3:
Per Dave Hansen
Update and add comments per Dave's review
Per Peter
Correct attribution
---
 arch/x86/include/asm/pkeys.h |  2 ++
 arch/x86/kernel/fpu/xstate.c | 22 --
 arch/x86/mm/pkeys.c  | 23 +++
 3 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index f9feba80894b..4526245b03e5 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -136,4 +136,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
 }
 
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags);
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index a99afc70cc0a..a3bca3211eba 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -994,9 +994,7 @@ const void *get_xsave_field_ptr(int xfeature_nr)
 int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val)
 {
-   u32 old_pkru;
-   int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
-   u32 new_pkru_bits = 0;
+   u32 pkru;
 
/*
 * This check implies XSAVE support.  OSPKE only gets
@@ -1012,21 +1010,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, 
int pkey,
 */
WARN_ON_ONCE(pkey >= arch_max_pkey());
 
-   /* Set the bits we need in PKRU:  */
-   if (init_val & PKEY_DISABLE_ACCESS)
-   new_pkru_bits |= PKR_AD_BIT;
-   if (init_val & PKEY_DISABLE_WRITE)
-   new_pkru_bits |= PKR_WD_BIT;
-
-   /* Shift the bits in to the correct place in PKRU for pkey: */
-   new_pkru_bits <<= pkey_shift;
-
-   /* Get old PKRU and mask off any old bits in place: */
-   old_pkru = read_pkru();
-   old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
-
-   /* Write old part along with new part: */
-   write_pkru(old_pkru | new_pkru_bits);
+   pkru = read_pkru();
+   pkru = update_pkey_val(pkru, pkey, init_val);
+   write_pkru(pkru);
 
return 0;
 }
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index f5efb4007e74..d1dfe743e79f 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -208,3 +208,26 @@ static __init int setup_init_pkru(char *opt)
return 1;
 }
 __setup("init_pkru=", setup_init_pkru);
+
+/*
+ * Replace disable bits for @pkey with values from @flags
+ *
+ * Kernel users use the same flags as user space:
+ * PKEY_DISABLE_ACCESS
+ * PKEY_DISABLE_WRITE
+ */
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
+{
+   int pkey_shift = pkey * PKR_BITS_PER_PKEY;
+
+   /*  Mask out old bit values */
+   pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
+
+   /*  Or in new values */
+   if (flags & PKEY_DISABLE_ACCESS)
+   pk_reg |= PKR_AD_BIT << pkey_shift;
+   if (flags & PKEY_DISABLE_WRITE)
+   pk_reg |= PKR_WD_BIT << pkey_shift;
+
+   return pk_reg;
+}
-- 
2.28.0.rc0.12.gb6a658bd00c9
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH V3 03/10] x86/pks: Add PKS defines and Kconfig options

2020-11-06 Thread ira . weiny
From: Ira Weiny 

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings.  It uses the same mechanism of Protection Keys as those on
User mappings but applies that mechanism to supervisor mappings using a
supervisor specific MSR.

Kernel users can thus defines 'domains' of page mappings which have an
extra level of protection beyond those specified in the supervisor page
table entries.

Add the Kconfig ARCH_HAS_SUPERVISOR_PKEYS to indicate to core code that
an architecture support pkeys.  Select it for x86.

Define the CPU features bit needed but leave DISABLE_PKS set to disable
the feature until the implementation can be completed and enabled in a
final patch.

Co-developed-by: Fenghua Yu 
Signed-off-by: Fenghua Yu 
Signed-off-by: Ira Weiny 

---
Changes from V2
New patch for V3:  Split this off from the enable patch to be
able to create cleaner bisectability
---
 arch/x86/Kconfig| 1 +
 arch/x86/include/asm/cpufeatures.h  | 1 +
 arch/x86/include/asm/disabled-features.h| 4 +++-
 arch/x86/include/uapi/asm/processor-flags.h | 2 ++
 mm/Kconfig  | 2 ++
 5 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b81f74a..78c4c749c6a9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1876,6 +1876,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_PKEYS
+   select ARCH_HAS_SUPERVISOR_PKEYS
help
  Memory Protection Keys provides a mechanism for enforcing
  page-based protections, but without requiring modification of the
diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index dad350d42ecf..4deb580324e8 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -356,6 +356,7 @@
 #define X86_FEATURE_MOVDIRI(16*32+27) /* MOVDIRI instruction */
 #define X86_FEATURE_MOVDIR64B  (16*32+28) /* MOVDIR64B instruction */
 #define X86_FEATURE_ENQCMD (16*32+29) /* ENQCMD and ENQCMDS 
instructions */
+#define X86_FEATURE_PKS(16*32+31) /* Protection Keys 
for Supervisor pages */
 
 /* AMD-defined CPU features, CPUID level 0x8007 (EBX), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV (17*32+ 0) /* MCA overflow recovery 
support */
diff --git a/arch/x86/include/asm/disabled-features.h 
b/arch/x86/include/asm/disabled-features.h
index 5861d34f9771..164587177152 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -44,6 +44,8 @@
 # define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31))
 #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 
+#define DISABLE_PKS   (1<<(X86_FEATURE_PKS & 31))
+
 #ifdef CONFIG_X86_5LEVEL
 # define DISABLE_LA57  0
 #else
@@ -82,7 +84,7 @@
 #define DISABLED_MASK140
 #define DISABLED_MASK150
 #define DISABLED_MASK16
(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-DISABLE_ENQCMD)
+DISABLE_ENQCMD|DISABLE_PKS)
 #define DISABLED_MASK170
 #define DISABLED_MASK180
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
diff --git a/arch/x86/include/uapi/asm/processor-flags.h 
b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..191c574b2390 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP   _BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT22 /* enable Protection Keys support */
 #define X86_CR4_PKE_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_PKS_BIT24 /* enable Protection Keys for 
Supervisor */
+#define X86_CR4_PKS_BITUL(X86_CR4_PKS_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/mm/Kconfig b/mm/Kconfig
index d42423f884a7..fc9ce7f65683 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -826,6 +826,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
bool
 config ARCH_HAS_PKEYS
bool
+config ARCH_HAS_SUPERVISOR_PKEYS
+   bool
 
 config PERCPU_STATS
bool "Collect percpu memory statistics"
-- 
2.28.0.rc0.12.gb6a658bd00c9
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH V3 06/10] x86/entry: Preserve PKRS MSR across exceptions

2020-11-06 Thread ira . weiny
From: Ira Weiny 

The PKRS MSR is not managed by XSAVE.  It is preserved through a context
switch but this support leaves exception handling code open to memory
accesses during exceptions.

2 possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2]
irqentry_state_t was already an object being used in the exception
handling and is straightforward.  It is also easy for any number of
nested states to be tracked and eventually can be enhanced to store the
reference counting required to support PKS through kmap reentry

Preserve the current task's PKRS values in irqentry_state_t on exception
entry and restoring them on exception exit.

Each nested exception is further saved allowing for any number of levels
of exception handling.

Peter and Thomas both suggested parts of the patch, IDT and NMI respectively.

[1] 
https://lore.kernel.org/lkml/calcetrve1i5jdyzd_bcctxqjn+ze3t38efpgjxn1f577m36...@mail.gmail.com/
[2] https://lore.kernel.org/lkml/874kpxx4jf@nanos.tec.linutronix.de/#t

Cc: Dave Hansen 
Cc: Andy Lutomirski 
Suggested-by: Peter Zijlstra 
Suggested-by: Thomas Gleixner 
Signed-off-by: Ira Weiny 

---
Changes from V1
remove redundant irq_state->pkrs
This value is only needed for the global tracking.  So
it should be included in that patch and not in this one.

Changes from RFC V3
Standardize on 'irq_state' variable name
Per Dave Hansen
irq_save_pkrs() -> irq_save_set_pkrs()
Rebased based on clean up patch by Thomas Gleixner
This includes moving irq_[save_set|restore]_pkrs() to
the core as well.
---
 arch/x86/entry/common.c | 38 +
 arch/x86/include/asm/pkeys_common.h |  5 ++--
 arch/x86/mm/pkeys.c |  2 +-
 include/linux/entry-common.h| 13 ++
 kernel/entry/common.c   | 14 +--
 5 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 87dea56a15d2..1b6a419a6fac 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_XEN_PV
 #include 
@@ -209,6 +210,41 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
 }
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * The register is not maintained with XSAVE so we have to maintain the MSR
+ * value in software during context switch and exception handling.
+ *
+ * Context switches save the MSR in the task struct thus taking that value to
+ * other processors if necessary.
+ *
+ * To protect against exceptions having access to this memory we save the
+ * current running value and set the PKRS value for the duration of the
+ * exception.  Thus preventing exception handlers from having the elevated
+ * access of the interrupted task.
+ */
+noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   irq_state->thread_pkrs = current->thread.saved_pkrs;
+   write_pkrs(INIT_PKRS_VALUE);
+}
+
+noinstr void irq_restore_pkrs(irqentry_state_t *irq_state)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   write_pkrs(irq_state->thread_pkrs);
+   current->thread.saved_pkrs = irq_state->thread_pkrs;
+}
+#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
@@ -272,6 +308,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
 
inhcall = get_and_clear_inhcall();
if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
+   /* Normally called by irqentry_exit, we must restore pkrs here 
*/
+   irq_restore_pkrs(_state);
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
index 801a75615209..11a95e6efd2d 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -27,9 +27,10 @@
 PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
 
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
-void write_pkrs(u32 new_pkrs);
+DECLARE_PER_CPU(u32, pkrs_cache);
+noinstr void write_pkrs(u32 new_pkrs);
 #else
-static inline void write_pkrs(u32 new_pkrs) { }
+static __always_inline void write_pkrs(u32 new_pkrs) { }
 #endif
 
 #endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 76a62419c446..6892d4524868 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -2

[PATCH V3 10/10] x86/pks: Add PKS test code

2020-11-06 Thread ira . weiny
From: Ira Weiny 

The core PKS functionality provides an interface for kernel users to
reserve keys to their domains set up the page tables with those keys and
control access to those domains when needed.

Define test code which exercises the core functionality of PKS via a
debugfs entry.  Basic checks can be triggered on boot with a kernel
command line option while both basic and preemption checks can be
triggered with separate debugfs values.

debugfs controls are:

'0' -- Run access tests with a single pkey
'1' -- Set up the pkey register with no access for the pkey allocated to
   this fd
'2' -- Check that the pkey register updated in '1' is still the same.
   (To be used after a forced context switch.)
'3' -- Allocate all pkeys possible and run tests on each pkey allocated.
   DEFAULT when run at boot.

Closing the fd will cleanup and release the pkey, therefore to exercise
context switch testing a user space program is provided in:

.../tools/testing/selftests/x86/test_pks.c

Reviewed-by: Dave Hansen 
Co-developed-by: Fenghua Yu 
Signed-off-by: Fenghua Yu 
Signed-off-by: Ira Weiny 

---
Changes for V2
Fix compilation errors

Changes for V1
Update for new pks_key_alloc()

Changes from RFC V3
Comments from Dave Hansen
clean up whitespace dmanage
Clean up Kconfig help
Clean up user test error output
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Comments from Jing Han
Remove duplicate stdio.h
---
 Documentation/core-api/protection-keys.rst |   1 +
 arch/x86/mm/fault.c|  23 +
 lib/Kconfig.debug  |  12 +
 lib/Makefile   |   3 +
 lib/pks/Makefile   |   3 +
 lib/pks/pks_test.c | 692 +
 tools/testing/selftests/x86/Makefile   |   3 +-
 tools/testing/selftests/x86/test_pks.c |  66 ++
 8 files changed, 802 insertions(+), 1 deletion(-)
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c
 create mode 100644 tools/testing/selftests/x86/test_pks.c

diff --git a/Documentation/core-api/protection-keys.rst 
b/Documentation/core-api/protection-keys.rst
index c4e6c480562f..8ffdfbff013c 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -164,3 +164,4 @@ of WRPKRU.  So to quote from the WRPKRU text:
until all prior executions of WRPKRU have completed execution
and updated the PKRU register.
 
+Example code can be found in lib/pks/pks_test.c
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 90029ce9b0da..916b2d18ed57 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include  /* faulthandler_disabled()  */
 #include  /* efi_recover_from_page_fault()*/
 #include 
+#include 
 
 #include /* boot_cpu_has, ...*/
 #include  /* dotraplinkage, ...   */
@@ -1149,6 +1150,25 @@ bool fault_in_kernel_space(unsigned long address)
return address >= TASK_SIZE_MAX;
 }
 
+#ifdef CONFIG_PKS_TESTING
+bool pks_test_callback(irqentry_state_t *irq_state);
+static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t 
*irq_state)
+{
+   /*
+* If we get a protection key exception it could be because we
+* are running the PKS test.  If so, pks_test_callback() will
+* clear the protection mechanism and return true to indicate
+* the fault was handled.
+*/
+   return (hw_error_code & X86_PF_PK) && pks_test_callback(irq_state);
+}
+#else
+static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t 
*irq_state)
+{
+   return false;
+}
+#endif
+
 /*
  * Called for all faults where 'address' is part of the kernel address
  * space.  Might get called for faults that originate from *code* that
@@ -1165,6 +1185,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long 
hw_error_code,
if (!cpu_feature_enabled(X86_FEATURE_PKS))
WARN_ON_ONCE(hw_error_code & X86_PF_PK);
 
+   if (handle_pks_testing(hw_error_code, irq_state))
+   return;
+
 #ifdef CONFIG_X86_32
/*
 * We can fault-in kernel-space virtual memory on-demand. The
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index c789b39ed527..e90e06f5a3b9 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2444,6 +2444,18 @@ config HYPERV_TESTING
help
  Select this option to enable Hyper-V vmbus testing.
 
+config PKS_TESTING
+   bool "PKey (S)upervisor testing"
+   default n
+   depends on ARCH_HAS_SUPERVISOR_PKEYS
+   help
+ Select this option to enable testing of PKS core software and
+ hardware.  The 

[PATCH V3 08/10] x86/pks: Add PKS kernel API

2020-11-06 Thread ira . weiny
From: Fenghua Yu 

PKS allows kernel users to define domains of page mappings which have
additional protections beyond the paging protections.

Add an API to allocate, use, and free a protection key which identifies
such a domain.  Export 5 new symbols pks_key_alloc(), pks_mknoaccess(),
pks_mkread(), pks_mkrdwr(), and pks_key_free().  Add 2 new macros;
PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey).

Update the protection key documentation to cover pkeys on supervisor
pages.

Co-developed-by: Ira Weiny 
Signed-off-by: Ira Weiny 
Signed-off-by: Fenghua Yu 

---
Changes from V2
From Greg KH
Replace all WARN_ON_ONCE() uses with pr_err()
From Dan Williams
Add __must_check to pks_key_alloc() to help ensure users
are using the API correctly

Changes from V1
Per Dave Hansen
Add flags to pks_key_alloc() to help future proof the
interface if/when the key space is exhausted.

Changes from RFC V3
Per Dave Hansen
Put WARN_ON_ONCE in pks_key_free()
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change return pks_key_alloc() to EOPNOTSUPP when not
supported or configured
Per Peter Zijlstra
Remove unneeded preempt disable/enable
---
 Documentation/core-api/protection-keys.rst | 102 +---
 arch/x86/include/asm/pgtable_types.h   |  12 ++
 arch/x86/include/asm/pkeys.h   |  11 ++
 arch/x86/include/asm/pkeys_common.h|   4 +
 arch/x86/mm/pkeys.c| 128 +
 include/linux/pgtable.h|   4 +
 include/linux/pkeys.h  |  24 
 7 files changed, 267 insertions(+), 18 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst 
b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..c4e6c480562f 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,25 +4,33 @@
 Memory Protection Keys
 ==
 
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
-
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
-
 Memory Protection Keys provides a mechanism for enforcing page-based
 protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
+when an application changes protection domains.
+
+PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
+Processor" Server CPUs and later.  And It will be available in future
+non-server Intel parts and future AMD processors.
+
+Future Intel processors will support Protection Keys for Supervisor pages
+(PKS).
+
+For anyone wishing to test or use user space pkeys, it is available in Amazon's
+EC2 C5 instances and is known to work there using an Ubuntu 17.04 image.
+
+pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.  User and Supervisor pages are
+treated separately.
+
+Protections for each page are controlled with per CPU registers for each type
+of page User and Supervisor.  Each of these 32 bit register stores two separate
+bits (Access Disable and Write Disable) for each key.
 
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
+For Userspace the register is user-accessible (rdpkru/wrpkru).  For
+Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel.
+
+Being a CPU register, pkeys are inherently thread-local, potentially giving
+each thread an independent set of protections from every other thread.
 
 There are two new instructions (RDPKRU/WRPKRU) for reading and writing
 to the new register.  The feature is only available in 64-bit mode,
@@ -30,8 +38,11 @@ even though there is theoretically space in the PAE PTEs.  
These
 permissions are enforced on data access only and have no effect on
 instruction fetches.
 
-Syscalls
-
+For kernel space rdmsr/wrmsr are used to access the kernel MSRs.
+
+
+Syscalls for user space keys
+
 
 There are 3 system calls which directly interact with pkeys::
 
@@ -98,3 +109,58 @@ with a read()::
 The kernel will send a SIGSEG

[PATCH V3 00/10] PKS: Add Protection Keys Supervisor (PKS) support V3

2020-11-06 Thread ira . weiny
From: Ira Weiny 

Changes from V2 [4]
Rebased on tip-tree/core/entry
From Thomas Gleixner
Address bisectability
Drop Patch:
x86/entry: Move nmi entry/exit into common code
From Greg KH
Remove WARN_ON's
From Dan Williams
Add __must_check to pks_key_alloc()
New patch: x86/pks: Add PKS defines and config options
Split from Enable patch to build on through the series
Fix compile errors

Changes from V1
Rebase to TIP master; resolve conflicts and test
Clean up some kernel docs updates missed in V1
Add irqentry_state_t kernel doc for PKRS field
Removed redundant irq_state->pkrs
This is only needed when we add the global state and somehow
ended up in this patch series.  That will come back when we add
the global functionality in.
From Thomas Gleixner
Update commit messages
Add kernel doc for struct irqentry_state_t
From Dave Hansen add flags to pks_key_alloc()

Changes from RFC V3[3]
Rebase to TIP master
Update test error output
Standardize on 'irq_state' for state variables
From Dave Hansen
Update commit messages
Add/clean up comments
Add X86_FEATURE_PKS to disabled-features.h and remove some
explicit CONFIG checks
Move saved_pkrs member of thread_struct
Remove superfluous preempt_disable()
s/irq_save_pks/irq_save_set_pks/
Ensure PKRS is not seen in faults if not configured or not
supported
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change pks_key_alloc return to -EOPNOTSUPP when not supported
From Peter Zijlstra
Clean up Attribution
Remove superfluous preempt_disable()
Add union to differentiate exit_rcu/lockdep use in
irqentry_state_t
From Thomas Gleixner
Add preliminary clean up patch and adjust series as needed


Introduce a new page protection mechanism for supervisor pages, Protection Key
Supervisor (PKS).

2 use cases for PKS are being developed, trusted keys and PMEM.  Trusted keys
is a newer use case which is still being explored.  PMEM was submitted as part
of the RFC (v2) series[1].  However, since then it was found that some callers
of kmap() require a global implementation of PKS.  Specifically some users of
kmap() expect mappings to be available to all kernel threads.  While global use
of PKS is rare it needs to be included for correctness.  Unfortunately the
kmap() updates required a large patch series to make the needed changes at the
various kmap() call sites so that patch set has been split out.  Because the
global PKS feature is only required for that use case it will be deferred to
that set as well.[2]  This patch set is being submitted as a precursor to both
of the use cases.

For an overview of the entire PKS ecosystem, a git tree including this series
and 2 proposed use cases can be found here:


https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.we...@intel.com/

https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.we...@intel.com/


PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to those pages beyond the normal paging protections.  PKS works in
a similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
checked in addition to normal paging protections and Access or Writes can be
disabled via a MSR update without TLB flushes when permissions change.  Also
like PKU, a page mapping is assigned to a domain by setting pkey bits in the
page table entry for that mapping.

Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.

XSAVE is not supported for the PKRS MSR.  Therefore the implementation
saves/restores the MSR across context switches and during exceptions.  Nested
exceptions are supported by each exception getting a new PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.

Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one.  Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
arch and/or CPU instance.

The following are key attributes of PKS.

   1) Fast switching of permissions
1a) Prevents access without page table manipulations
1b) No TLB flushes required
   2) Wo

[PATCH V3 01/10] x86/pkeys: Create pkeys_common.h

2020-11-06 Thread ira . weiny
From: Ira Weiny 

Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in
similar fashions and can share common defines.  Specifically PKS and PKU
each have:

1. A single control register
2. The same number of keys
3. The same number of bits in the register per key
4. Access and Write disable in the same bit locations

That means that we can share all the macros that synthesize and
manipulate register values between the two features.  Normally, these
macros would be put in asm/pkeys.h to be used internally and externally
to the arch code.  However, the defines are required in pgtable.h and
inclusion of pkeys.h in that header creates complex dependencies which
are best resolved in a separate header.

Share these defines by moving them into a new header, change their names
to reflect the common use, and include the header where needed.

Reviewed-by: Dave Hansen 
Signed-off-by: Ira Weiny 

---
NOTE: The initialization of init_pkru_value cause checkpatch errors
because of the space after the '(' in the macros.  We leave this as is
because it is more readable in this format.  And it was existing code.

---
Changes from RFC V3
Per Dave Hansen
Update commit message
Add comment to PKR_AD_KEY macro
---
 arch/x86/include/asm/pgtable.h  | 13 ++---
 arch/x86/include/asm/pkeys.h|  2 ++
 arch/x86/include/asm/pkeys_common.h | 15 +++
 arch/x86/kernel/fpu/xstate.c|  8 
 arch/x86/mm/pkeys.c | 14 ++
 5 files changed, 33 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_common.h

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..bfbfb951fe65 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1360,9 +1360,7 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
 }
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
-#define PKRU_AD_BIT 0x1
-#define PKRU_WD_BIT 0x2
-#define PKRU_BITS_PER_PKEY 2
+#include 
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 extern u32 init_pkru_value;
@@ -1372,18 +1370,19 @@ extern u32 init_pkru_value;
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-   int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
-   return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+   int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+
+   return !(pkru & (PKR_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-   int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
+   int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
/*
 * Access-disable disables writes too so we need to check
 * both bits here.
 */
-   return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+   return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits));
 }
 
 static inline u16 pte_flags_pkey(unsigned long pte_flags)
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 2ff9b98812b7..f9feba80894b 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_PKEYS_H
 #define _ASM_X86_PKEYS_H
 
+#include 
+
 #define ARCH_DEFAULT_PKEY  0
 
 /*
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
new file mode 100644
index ..737d916f476c
--- /dev/null
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKEYS_INTERNAL_H
+#define _ASM_X86_PKEYS_INTERNAL_H
+
+#define PKR_AD_BIT 0x1
+#define PKR_WD_BIT 0x2
+#define PKR_BITS_PER_PKEY 2
+
+/*
+ * Generate an Access-Disable mask for the given pkey.  Several of these can be
+ * OR'd together to generate pkey register values.
+ */
+#define PKR_AD_KEY(pkey)   (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+
+#endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 5d8047441a0a..a99afc70cc0a 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -995,7 +995,7 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int 
pkey,
unsigned long init_val)
 {
u32 old_pkru;
-   int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
+   int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
u32 new_pkru_bits = 0;
 
/*
@@ -1014,16 +1014,16 @@ int arch_set_user_pkey_access(struct task_struct *tsk, 
int pkey,
 
/* Set the bits we need in PKRU:  */
if (init_val & PKEY_DISABLE_ACCESS)
-   new_pkru_bits |= PKRU_AD_BIT;
+   new_pkru_bits |= PKR_AD_BIT;
if (init_val & PKEY_DISABLE_WRITE)
-   new_pkru_bits |= PKRU_WD_BIT;
+   new_pkru_bits |= PKR_WD_BIT;
 
/* Shift the bits in to the correct place in 

[PATCH V3 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-11-06 Thread ira . weiny
From: Ira Weiny 

The PKRS MSR is defined as a per-logical-processor register.  This
isolates memory access by logical CPU.  Unfortunately, the MSR is not
managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
context switch.

Define a saved PKRS value in the task struct, as well as a cached
per-logical-processor MSR value which mirrors the MSR value of the
current CPU.  Initialize all tasks with the default MSR value.  Then, on
schedule in, check the saved task MSR vs the per-cpu value.  If
different proceed to write the MSR.  If not avoid the overhead of the
MSR write and continue.

Follow on patches will update the saved PKRS as well as the MSR if
needed.

Finally it should be noted that the underlying WRMSR(MSR_IA32_PKRS) is
not serializing but still maintains ordering properties similar to
WRPKRU.  The current SDM section on PKRS needs updating but should be
the same as that of WRPKRU.  So to quote from the WRPKRU text:

WRPKRU will never execute transiently. Memory accesses affected
by PKRU register will not execute (even transiently) until all
prior executions of WRPKRU have completed execution and updated
the PKRU register.

Co-developed-by: Fenghua Yu 
Signed-off-by: Fenghua Yu 
Co-developed-by: Peter Zijlstra 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Ira Weiny 

---
Changes from V2
Adjust for PKS enable being final patch.

Changes from V1
Rebase to latest tip/master
Resolve conflicts with INIT_THREAD changes

Changes since RFC V3
Per Dave Hansen
Update commit message
move saved_pkrs to be in a nicer place
Per Peter Zijlstra
Add Comment from Peter
Clean up white space
Update authorship
---
 arch/x86/include/asm/msr-index.h|  1 +
 arch/x86/include/asm/pkeys_common.h | 20 +++
 arch/x86/include/asm/processor.h| 18 -
 arch/x86/kernel/process.c   | 26 
 arch/x86/mm/pkeys.c | 31 +
 5 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 972a34d93505..ddb125e44408 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -754,6 +754,7 @@
 
 #define MSR_IA32_TSC_DEADLINE  0x06E0
 
+#define MSR_IA32_PKRS  0x06E1
 
 #define MSR_TSX_FORCE_ABORT0x010F
 
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
index 737d916f476c..801a75615209 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -12,4 +12,24 @@
  */
 #define PKR_AD_KEY(pkey)   (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
 
+/*
+ * Define a default PKRS value for each task.
+ *
+ * Key 0 has no restriction.  All other keys are set to the most restrictive
+ * value which is access disabled (AD=1).
+ *
+ * NOTE: This needs to be a macro to be used as part of the INIT_THREAD macro.
+ */
+#define INIT_PKRS_VALUE (PKR_AD_KEY(1) | PKR_AD_KEY(2) | PKR_AD_KEY(3) | \
+PKR_AD_KEY(4) | PKR_AD_KEY(5) | PKR_AD_KEY(6) | \
+PKR_AD_KEY(7) | PKR_AD_KEY(8) | PKR_AD_KEY(9) | \
+PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \
+PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+void write_pkrs(u32 new_pkrs);
+#else
+static inline void write_pkrs(u32 new_pkrs) { }
+#endif
+
 #endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 82a08b585818..e9c65368b0b2 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -18,6 +18,7 @@ struct vm86;
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -520,6 +521,12 @@ struct thread_struct {
unsigned long   cr2;
unsigned long   trap_nr;
unsigned long   error_code;
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+   /* Saved Protection key register for supervisor mappings */
+   u32 saved_pkrs;
+#endif
+
 #ifdef CONFIG_VM86
/* Virtual 86 mode info */
struct vm86 *vm86;
@@ -785,7 +792,16 @@ static inline void spin_lock_prefetch(const void *x)
 #define KSTK_ESP(task) (task_pt_regs(task)->sp)
 
 #else
-#define INIT_THREAD { }
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+#define INIT_THREAD_PKRS   .saved_pkrs = INIT_PKRS_VALUE
+#else
+#define INIT_THREAD_PKRS   0
+#endif
+
+#define INIT_THREAD  { \
+   INIT_THREAD_PKRS,   \
+}
 
 extern unsigned long KSTK_ESP(struct task_struct *task);
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/

Re: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

2020-11-04 Thread Ira Weiny
On Wed, Nov 04, 2020 at 02:45:54PM -0800, 'Ira Weiny' wrote:
> On Wed, Nov 04, 2020 at 11:00:04PM +0100, Thomas Gleixner wrote:
> > On Wed, Nov 04 2020 at 09:46, Ira Weiny wrote:
> > > On Tue, Nov 03, 2020 at 12:36:16AM +0100, Thomas Gleixner wrote:
> > >> This is the wrong ordering, really.
> > >> 
> > >>  x86/entry: Move nmi entry/exit into common code
> > >> 
> > >> is a general cleanup and has absolutely nothing to do with PKRS.So this
> > >> wants to go first.
> > >
> > > Sorry, yes this should be a pre-patch.
> > 
> > I picked it out of the series and applied it to tip core/entry as I have
> > other stuff coming up in that area. 
> 
> Thanks!  I'll rebase to that tree.
> 
> I assume you fixed the spelling error?  Sorry about that.

I'll fix it and send with the other spelling errors I found.

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

2020-11-04 Thread Ira Weiny
On Wed, Nov 04, 2020 at 11:00:04PM +0100, Thomas Gleixner wrote:
> On Wed, Nov 04 2020 at 09:46, Ira Weiny wrote:
> > On Tue, Nov 03, 2020 at 12:36:16AM +0100, Thomas Gleixner wrote:
> >> This is the wrong ordering, really.
> >> 
> >>  x86/entry: Move nmi entry/exit into common code
> >> 
> >> is a general cleanup and has absolutely nothing to do with PKRS.So this
> >> wants to go first.
> >
> > Sorry, yes this should be a pre-patch.
> 
> I picked it out of the series and applied it to tip core/entry as I have
> other stuff coming up in that area. 

Thanks!  I'll rebase to that tree.

I assume you fixed the spelling error?  Sorry about that.

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

2020-11-04 Thread Ira Weiny
On Tue, Nov 03, 2020 at 12:36:16AM +0100, Thomas Gleixner wrote:
> On Mon, Nov 02 2020 at 12:53, ira weiny wrote:
> > Fenghua Yu (2):
> >   x86/pks: Enable Protection Keys Supervisor (PKS)
> >   x86/pks: Add PKS kernel API
> >
> > Ira Weiny (7):
> >   x86/pkeys: Create pkeys_common.h
> >   x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
> >   x86/pks: Preserve the PKRS MSR on context switch
> >   x86/entry: Pass irqentry_state_t by reference
> >   x86/entry: Preserve PKRS MSR across exceptions
> >   x86/fault: Report the PKRS state on fault
> >   x86/pks: Add PKS test code
> >
> > Thomas Gleixner (1):
> >   x86/entry: Move nmi entry/exit into common code
> 
> So the actual patch ordering is:
> 
>x86/pkeys: Create pkeys_common.h
>x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
>x86/pks: Enable Protection Keys Supervisor (PKS)
>x86/pks: Preserve the PKRS MSR on context switch
>x86/pks: Add PKS kernel API
> 
>x86/entry: Move nmi entry/exit into common code
>x86/entry: Pass irqentry_state_t by reference
> 
>x86/entry: Preserve PKRS MSR across exceptions
>x86/fault: Report the PKRS state on fault
>x86/pks: Add PKS test code
> 
> This is the wrong ordering, really.
> 
>  x86/entry: Move nmi entry/exit into common code
> 
> is a general cleanup and has absolutely nothing to do with PKRS.So this
> wants to go first.
> 

Sorry, yes this should be a pre-patch.

> Also:
> 
> x86/entry: Move nmi entry/exit into common code
> [from other email]
>>  x86/entry: Pass irqentry_state_t by reference
>>  > 
>>  >
> 
> is a prerequisite for the rest. So why is it in the middle of the
> series?

It is in the middle because passing by reference is not needed until additional
information is added to irqentry_state_t which is done immediately after this
patch by:

x86/entry: Preserve PKRS MSR across exceptions

I debated squashing the 2 but it made review harder IMO.  But I thought keeping
them in order together made a lot of sense.

> 
> And then you enable all that muck _before_ it is usable:
> 

Strictly speaking you are correct, sorry.  I will reorder the series.

> 
> Bisectability is overrrated, right?

Agreed, bisectability is important.  I thought I had it covered but I was
wrong.

> 
> Once again: Read an understand Documentation/process/*
> 
> Aside of that using a spell checker is not optional.

Agreed.

In looking closer at the entry code I've found a couple of other instances I'll
add another precursor patch.

I've also found other errors with the series which I should have caught.  My
apologies I made some last minute changes which I should have checked more
thoroughly.

Thanks,
Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V2 05/10] x86/pks: Add PKS kernel API

2020-11-03 Thread Ira Weiny
On Tue, Nov 03, 2020 at 07:14:07PM +0100, Greg KH wrote:
> On Tue, Nov 03, 2020 at 09:53:36AM -0800, Ira Weiny wrote:
> > On Tue, Nov 03, 2020 at 07:50:24AM +0100, Greg KH wrote:
> > > On Mon, Nov 02, 2020 at 12:53:15PM -0800, ira.we...@intel.com wrote:
> > > > From: Fenghua Yu 
> > > > 
> > 
> > [snip]
> > 
> > > > diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
> > > > index 2955ba976048..0959a4c0ca64 100644
> > > > --- a/include/linux/pkeys.h
> > > > +++ b/include/linux/pkeys.h
> > > > @@ -50,4 +50,28 @@ static inline void copy_init_pkru_to_fpregs(void)
> > > >  
> > > >  #endif /* ! CONFIG_ARCH_HAS_PKEYS */
> > > >  
> > > > +#define PKS_FLAG_EXCLUSIVE 0x00
> > > > +
> > > > +#ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > > > +static inline int pks_key_alloc(const char * const pkey_user, int 
> > > > flags)
> > > > +{
> > > > +   return -EOPNOTSUPP;
> > > > +}
> > > > +static inline void pks_key_free(int pkey)
> > > > +{
> > > > +}
> > > > +static inline void pks_mk_noaccess(int pkey)
> > > > +{
> > > > +   WARN_ON_ONCE(1);
> > > 
> > > So for panic-on-warn systems, this is ok to reboot the box?
> > 
> > I would not expect this to reboot the box no.  But it is a violation of the 
> > API
> > contract.  If pky_key_alloc() returns an error calling any of the other
> > functions is an error.
> > 
> > > 
> > > Are you sure, that feels odd...
> > 
> > It does feel odd and downright wrong...  But there are a lot of 
> > WARN_ON_ONCE's
> > out there to catch this type of internal programming error.  Is 
> > panic-on-warn
> > commonly used?
> 
> Yes it is, and we are trying to recover from that as it is something
> that you should recover from.  Properly handle the error and move on.

Sorry, I did not know that...  Ok I'll look at the series because I probably
have others I need to change.

Thanks,
Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH V2 05/10] x86/pks: Add PKS kernel API

2020-11-03 Thread Ira Weiny
On Tue, Nov 03, 2020 at 07:50:24AM +0100, Greg KH wrote:
> On Mon, Nov 02, 2020 at 12:53:15PM -0800, ira.we...@intel.com wrote:
> > From: Fenghua Yu 
> > 

[snip]

> > diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
> > index 2955ba976048..0959a4c0ca64 100644
> > --- a/include/linux/pkeys.h
> > +++ b/include/linux/pkeys.h
> > @@ -50,4 +50,28 @@ static inline void copy_init_pkru_to_fpregs(void)
> >  
> >  #endif /* ! CONFIG_ARCH_HAS_PKEYS */
> >  
> > +#define PKS_FLAG_EXCLUSIVE 0x00
> > +
> > +#ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > +static inline int pks_key_alloc(const char * const pkey_user, int flags)
> > +{
> > +   return -EOPNOTSUPP;
> > +}
> > +static inline void pks_key_free(int pkey)
> > +{
> > +}
> > +static inline void pks_mk_noaccess(int pkey)
> > +{
> > +   WARN_ON_ONCE(1);
> 
> So for panic-on-warn systems, this is ok to reboot the box?

I would not expect this to reboot the box no.  But it is a violation of the API
contract.  If pky_key_alloc() returns an error calling any of the other
functions is an error.

> 
> Are you sure, that feels odd...

It does feel odd and downright wrong...  But there are a lot of WARN_ON_ONCE's
out there to catch this type of internal programming error.  Is panic-on-warn
commonly used?

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH V2 10/10] x86/pks: Add PKS test code

2020-11-02 Thread ira . weiny
From: Ira Weiny 

The core PKS functionality provides an interface for kernel users to
reserve keys to their domains set up the page tables with those keys and
control access to those domains when needed.

Define test code which exercises the core functionality of PKS via a
debugfs entry.  Basic checks can be triggered on boot with a kernel
command line option while both basic and preemption checks can be
triggered with separate debugfs values.

debugfs controls are:

'0' -- Run access tests with a single pkey
'1' -- Set up the pkey register with no access for the pkey allocated to
   this fd
'2' -- Check that the pkey register updated in '1' is still the same.
   (To be used after a forced context switch.)
'3' -- Allocate all pkeys possible and run tests on each pkey allocated.
   DEFAULT when run at boot.

Closing the fd will cleanup and release the pkey, therefore to exercise
context switch testing a user space program is provided in:

.../tools/testing/selftests/x86/test_pks.c

Reviewed-by: Dave Hansen 
Co-developed-by: Fenghua Yu 
Signed-off-by: Fenghua Yu 
Signed-off-by: Ira Weiny 

---
Changes for V1
Update for new pks_key_alloc()

Changes from RFC V3
Comments from Dave Hansen
clean up whitespace dmanage
Clean up Kconfig help
Clean up user test error output
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Comments from Jing Han
Remove duplicate stdio.h
---
 Documentation/core-api/protection-keys.rst |   1 +
 arch/x86/mm/fault.c|  23 +
 lib/Kconfig.debug  |  12 +
 lib/Makefile   |   3 +
 lib/pks/Makefile   |   3 +
 lib/pks/pks_test.c | 691 +
 tools/testing/selftests/x86/Makefile   |   3 +-
 tools/testing/selftests/x86/test_pks.c |  66 ++
 8 files changed, 801 insertions(+), 1 deletion(-)
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c
 create mode 100644 tools/testing/selftests/x86/test_pks.c

diff --git a/Documentation/core-api/protection-keys.rst 
b/Documentation/core-api/protection-keys.rst
index c4e6c480562f..8ffdfbff013c 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -164,3 +164,4 @@ of WRPKRU.  So to quote from the WRPKRU text:
until all prior executions of WRPKRU have completed execution
and updated the PKRU register.
 
+Example code can be found in lib/pks/pks_test.c
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 931603102010..0a51b168e8ee 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include  /* faulthandler_disabled()  */
 #include  /* efi_recover_from_page_fault()*/
 #include 
+#include 
 
 #include /* boot_cpu_has, ...*/
 #include  /* dotraplinkage, ...   */
@@ -1149,6 +1150,25 @@ bool fault_in_kernel_space(unsigned long address)
return address >= TASK_SIZE_MAX;
 }
 
+#ifdef CONFIG_PKS_TESTING
+bool pks_test_callback(irqentry_state_t *irq_state);
+static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t 
*irq_state)
+{
+   /*
+* If we get a protection key exception it could be because we
+* are running the PKS test.  If so, pks_test_callback() will
+* clear the protection mechanism and return true to indicate
+* the fault was handled.
+*/
+   return (hw_error_code & X86_PF_PK) && pks_test_callback(irq_state);
+}
+#else
+static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t 
*irq_state)
+{
+   return false;
+}
+#endif
+
 /*
  * Called for all faults where 'address' is part of the kernel address
  * space.  Might get called for faults that originate from *code* that
@@ -1165,6 +1185,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long 
hw_error_code,
if (!cpu_feature_enabled(X86_FEATURE_PKS))
WARN_ON_ONCE(hw_error_code & X86_PF_PK);
 
+   if (handle_pks_testing(hw_error_code, irq_state))
+   return;
+
 #ifdef CONFIG_X86_32
/*
 * We can fault-in kernel-space virtual memory on-demand. The
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d7a7bc3b6098..028beedd86f5 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2444,6 +2444,18 @@ config HYPERV_TESTING
help
  Select this option to enable Hyper-V vmbus testing.
 
+config PKS_TESTING
+   bool "PKey (S)upervisor testing"
+   default n
+   depends on ARCH_HAS_SUPERVISOR_PKEYS
+   help
+ Select this option to enable testing of PKS core software and
+ hardware.  The PKS core provides a mechanism to allocate keys 

[PATCH V2 09/10] x86/fault: Report the PKRS state on fault

2020-11-02 Thread ira . weiny
From: Ira Weiny 

When only user space pkeys are enabled faulting within the kernel was an
unexpected condition which should never happen.  Therefore a WARN_ON in
the kernel fault handler would detect if it ever did.  Now this is no
longer the case if PKS is enabled and supported.

Report a Pkey fault with a normal splat and add the PKRS state to the
fault splat text.  Note the PKS register is reset during an exception
therefore the saved PKRS value from before the beginning of the
exception is passed down.

If PKS is not enabled, or not active, maintain the WARN_ON_ONCE() from
before.

Because each fault has its own state the pkrs information will be
correctly reported even if a fault 'faults'.

Suggested-by: Andy Lutomirski 
Signed-off-by: Ira Weiny 

---
Changes from RFC V3
Update commit message
Per Dave Hansen
Don't print PKRS if !cpu_feature_enabled(X86_FEATURE_PKS)
Fix comment
Remove check on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS in favor of
disabled-features.h
---
 arch/x86/mm/fault.c | 58 ++---
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8d20c4c13abf..931603102010 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -504,7 +504,8 @@ static void show_ldttss(const struct desc_ptr *gdt, const 
char *name, u16 index)
 }
 
 static void
-show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long 
address)
+show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long 
address,
+   irqentry_state_t *irq_state)
 {
if (!oops_may_print())
return;
@@ -548,6 +549,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long 
error_code, unsigned long ad
 (error_code & X86_PF_PK)? "protection keys violation" :
   "permissions violation");
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+   if (cpu_feature_enabled(X86_FEATURE_PKS) && irq_state && (error_code & 
X86_PF_PK))
+   pr_alert("PKRS: 0x%x\n", irq_state->pkrs);
+#endif
+
if (!(error_code & X86_PF_USER) && user_mode(regs)) {
struct desc_ptr idt, gdt;
u16 ldtr, tr;
@@ -626,7 +632,8 @@ static void set_signal_archinfo(unsigned long address,
 
 static noinline void
 no_context(struct pt_regs *regs, unsigned long error_code,
-  unsigned long address, int signal, int si_code)
+  unsigned long address, int signal, int si_code,
+  irqentry_state_t *irq_state)
 {
struct task_struct *tsk = current;
unsigned long flags;
@@ -732,7 +739,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 */
flags = oops_begin();
 
-   show_fault_oops(regs, error_code, address);
+   show_fault_oops(regs, error_code, address, irq_state);
 
if (task_stack_end_corrupted(tsk))
printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
@@ -785,7 +792,8 @@ static bool is_vsyscall_vaddr(unsigned long vaddr)
 
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-  unsigned long address, u32 pkey, int si_code)
+  unsigned long address, u32 pkey, int si_code,
+  irqentry_state_t *irq_state)
 {
struct task_struct *tsk = current;
 
@@ -832,14 +840,14 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned 
long error_code,
if (is_f00f_bug(regs, address))
return;
 
-   no_context(regs, error_code, address, SIGSEGV, si_code);
+   no_context(regs, error_code, address, SIGSEGV, si_code, irq_state);
 }
 
 static noinline void
 bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-unsigned long address)
+unsigned long address, irqentry_state_t *irq_state)
 {
-   __bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR);
+   __bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR, 
irq_state);
 }
 
 static void
@@ -853,7 +861,7 @@ __bad_area(struct pt_regs *regs, unsigned long error_code,
 */
mmap_read_unlock(mm);
 
-   __bad_area_nosemaphore(regs, error_code, address, pkey, si_code);
+   __bad_area_nosemaphore(regs, error_code, address, pkey, si_code, NULL);
 }
 
 static noinline void
@@ -923,7 +931,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
 {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & X86_PF_USER)) {
-   no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+   no_context(regs, error_code, address, SIGBUS, BUS_ADRERR, NULL);
return;
}
 
@@ -957,7 +965,7 @@ mm_fault_

[PATCH V2 07/10] x86/entry: Pass irqentry_state_t by reference

2020-11-02 Thread ira . weiny
From: Ira Weiny 

Currently struct irqentry_state_t only contains a single bool value
which makes passing it by value is reasonable.  However, future patches
propose to add information to this struct, for example the PKRS
register/thread state.

Adding information to irqentry_state_t makes passing by value less
efficient.  Therefore, change the entry/exit calls to pass irq_state by
reference.

While at it, make the code easier to follow by changing all the usage
sites to consistently use the variable name 'irq_state'.

Signed-off-by: Ira Weiny 

---
Changes from V1
From Thomas: Update commit message
Further clean up Kernel doc and comments
Missed some 'return' comments which are no longer valid

Changes from RFC V3
Clean up @irq_state comments
Standardize on 'irq_state' for the state variable name
Refactor based on new patch from Thomas Gleixner
Also addresses Peter Zijlstra's comment
---
 arch/x86/entry/common.c |  8 
 arch/x86/include/asm/idtentry.h | 25 ++--
 arch/x86/kernel/cpu/mce/core.c  |  4 ++--
 arch/x86/kernel/kvm.c   |  6 +++---
 arch/x86/kernel/nmi.c   |  4 ++--
 arch/x86/kernel/traps.c | 21 
 arch/x86/mm/fault.c |  6 +++---
 include/linux/entry-common.h| 18 +
 kernel/entry/common.c   | 34 +
 9 files changed, 65 insertions(+), 61 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 18d8f17f755c..87dea56a15d2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -259,9 +259,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
 {
struct pt_regs *old_regs;
bool inhcall;
-   irqentry_state_t state;
+   irqentry_state_t irq_state;
 
-   state = irqentry_enter(regs);
+   irqentry_enter(regs, _state);
old_regs = set_irq_regs(regs);
 
instrumentation_begin();
@@ -271,13 +271,13 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
set_irq_regs(old_regs);
 
inhcall = get_and_clear_inhcall();
-   if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+   if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
restore_inhcall(inhcall);
} else {
-   irqentry_exit(regs, state);
+   irqentry_exit(regs, _state);
}
 }
 #endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 247a60a47331..282d2413b6a1 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -49,12 +49,13 @@ static __always_inline void __##func(struct pt_regs *regs); 
\
\
 __visible noinstr void func(struct pt_regs *regs)  \
 {  \
-   irqentry_state_t state = irqentry_enter(regs);  \
+   irqentry_state_t irq_state; 
\
\
+   irqentry_enter(regs, _state);   
\
instrumentation_begin();\
__##func (regs);\
instrumentation_end();  \
-   irqentry_exit(regs, state); \
+   irqentry_exit(regs, _state);
\
 }  \
\
 static __always_inline void __##func(struct pt_regs *regs)
@@ -96,12 +97,13 @@ static __always_inline void __##func(struct pt_regs *regs,  
\
 __visible noinstr void func(struct pt_regs *regs,  \
unsigned long error_code)   \
 {  \
-   irqentry_state_t state = irqentry_enter(regs);  \
+   irqentry_state_t irq_state; 
\
\
+   irqentry_enter(regs, _state);   
\
instrumentation_begin();\
__##func (regs, error_code);\
instrumentation_end();  \
-   irqentry_exit(regs, state); \
+   irqentry_e

[PATCH V2 08/10] x86/entry: Preserve PKRS MSR across exceptions

2020-11-02 Thread ira . weiny
From: Ira Weiny 

The PKRS MSR is not managed by XSAVE.  It is preserved through a context
switch but this support leaves exception handling code open to memory
accesses during exceptions.

2 possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2]
irqentry_state_t was already an object being used in the exception
handling and is straightforward.  It is also easy for any number of
nested states to be tracked and eventually can be enhanced to store the
reference counting required to support PKS through kmap reentry

Preserve the current task's PKRS values in irqentry_state_t on exception
entry and restoring them on exception exit.

Each nested exception is further saved allowing for any number of levels
of exception handling.

Peter and Thomas both suggested parts of the patch, IDT and NMI respectively.

[1] 
https://lore.kernel.org/lkml/calcetrve1i5jdyzd_bcctxqjn+ze3t38efpgjxn1f577m36...@mail.gmail.com/
[2] https://lore.kernel.org/lkml/874kpxx4jf@nanos.tec.linutronix.de/#t

Cc: Dave Hansen 
Cc: Andy Lutomirski 
Suggested-by: Peter Zijlstra 
Suggested-by: Thomas Gleixner 
Signed-off-by: Ira Weiny 

---
Changes from V1
remove redundant irq_state->pkrs
This value is only needed for the global tracking.  So
it should be included in that patch and not in this one.

Changes from RFC V3
Standardize on 'irq_state' variable name
Per Dave Hansen
irq_save_pkrs() -> irq_save_set_pkrs()
Rebased based on clean up patch by Thomas Gleixner
This includes moving irq_[save_set|restore]_pkrs() to
the core as well.
---
 arch/x86/entry/common.c | 38 +
 arch/x86/include/asm/pkeys_common.h |  5 ++--
 arch/x86/mm/pkeys.c |  2 +-
 include/linux/entry-common.h| 13 ++
 kernel/entry/common.c   | 14 +--
 5 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 87dea56a15d2..1b6a419a6fac 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_XEN_PV
 #include 
@@ -209,6 +210,41 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
 }
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * The register is not maintained with XSAVE so we have to maintain the MSR
+ * value in software during context switch and exception handling.
+ *
+ * Context switches save the MSR in the task struct thus taking that value to
+ * other processors if necessary.
+ *
+ * To protect against exceptions having access to this memory we save the
+ * current running value and set the PKRS value for the duration of the
+ * exception.  Thus preventing exception handlers from having the elevated
+ * access of the interrupted task.
+ */
+noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   irq_state->thread_pkrs = current->thread.saved_pkrs;
+   write_pkrs(INIT_PKRS_VALUE);
+}
+
+noinstr void irq_restore_pkrs(irqentry_state_t *irq_state)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   write_pkrs(irq_state->thread_pkrs);
+   current->thread.saved_pkrs = irq_state->thread_pkrs;
+}
+#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
@@ -272,6 +308,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
 
inhcall = get_and_clear_inhcall();
if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
+   /* Normally called by irqentry_exit, we must restore pkrs here 
*/
+   irq_restore_pkrs(_state);
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
index cd492c23b28c..f921c58793f9 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -31,9 +31,10 @@
 #definePKS_NUM_KEYS16
 
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
-void write_pkrs(u32 new_pkrs);
+DECLARE_PER_CPU(u32, pkrs_cache);
+noinstr void write_pkrs(u32 new_pkrs);
 #else
-static inline void write_pkrs(u32 new_pkrs) { }
+static __always_inline void write_pkrs(u32 new_pkrs) { }
 #endif
 
 #endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 0dc77409957a..39ec034cf379 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -252,7 +252,7 @@ DEFINE_PER_CPU(u3

[PATCH V2 05/10] x86/pks: Add PKS kernel API

2020-11-02 Thread ira . weiny
From: Fenghua Yu 

PKS allows kernel users to define domains of page mappings which have
additional protections beyond the paging protections.

Add an API to allocate, use, and free a protection key which identifies
such a domain.  Export 5 new symbols pks_key_alloc(), pks_mknoaccess(),
pks_mkread(), pks_mkrdwr(), and pks_key_free().  Add 2 new macros;
PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey).

Update the protection key documentation to cover pkeys on supervisor
pages.

Co-developed-by: Ira Weiny 
Signed-off-by: Ira Weiny 
Signed-off-by: Fenghua Yu 

---
Changes from V1
Per Dave Hansen
Add flags to pks_key_alloc() to help future proof the
interface if/when the key space is exhausted.

Changes from RFC V3
Per Dave Hansen
Put WARN_ON_ONCE in pks_key_free()
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change return pks_key_alloc() to EOPNOTSUPP when not
supported or configured
Per Peter Zijlstra
Remove unneeded preempt disable/enable
---
 Documentation/core-api/protection-keys.rst | 102 ++---
 arch/x86/include/asm/pgtable_types.h   |  12 ++
 arch/x86/include/asm/pkeys.h   |  11 ++
 arch/x86/include/asm/pkeys_common.h|   4 +
 arch/x86/mm/pkeys.c| 126 +
 include/linux/pgtable.h|   4 +
 include/linux/pkeys.h  |  24 
 7 files changed, 265 insertions(+), 18 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst 
b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..c4e6c480562f 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,25 +4,33 @@
 Memory Protection Keys
 ==
 
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
-
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
-
 Memory Protection Keys provides a mechanism for enforcing page-based
 protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
+when an application changes protection domains.
+
+PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
+Processor" Server CPUs and later.  And It will be available in future
+non-server Intel parts and future AMD processors.
+
+Future Intel processors will support Protection Keys for Supervisor pages
+(PKS).
+
+For anyone wishing to test or use user space pkeys, it is available in Amazon's
+EC2 C5 instances and is known to work there using an Ubuntu 17.04 image.
+
+pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.  User and Supervisor pages are
+treated separately.
+
+Protections for each page are controlled with per CPU registers for each type
+of page User and Supervisor.  Each of these 32 bit register stores two separate
+bits (Access Disable and Write Disable) for each key.
 
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
+For Userspace the register is user-accessible (rdpkru/wrpkru).  For
+Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel.
+
+Being a CPU register, pkeys are inherently thread-local, potentially giving
+each thread an independent set of protections from every other thread.
 
 There are two new instructions (RDPKRU/WRPKRU) for reading and writing
 to the new register.  The feature is only available in 64-bit mode,
@@ -30,8 +38,11 @@ even though there is theoretically space in the PAE PTEs.  
These
 permissions are enforced on data access only and have no effect on
 instruction fetches.
 
-Syscalls
-
+For kernel space rdmsr/wrmsr are used to access the kernel MSRs.
+
+
+Syscalls for user space keys
+
 
 There are 3 system calls which directly interact with pkeys::
 
@@ -98,3 +109,58 @@ with a read()::
 The kernel will send a SIGSEGV in both cases, but si_code will be set
 to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
 the plain mprotect() permissions are violated.
+
+
+Kernel API for PKS support
+==
+
+The following interface i

[PATCH V2 06/10] x86/entry: Move nmi entry/exit into common code

2020-11-02 Thread ira . weiny
From: Thomas Gleixner 

Lockdep state handling on NMI enter and exit is nothing specific to X86. It's
not any different on other architectures. Also the extra state type is not
necessary, irqentry_state_t can carry the necessary information as well.

Move it to common code and extend irqentry_state_t to carry lockdep state.

Ira: Make exit_rcu and lockdep a union as they are mutually exclusive
between the IRQ and NMI exceptions, and add kernel documentation for
struct irqentry_state_t

Signed-off-by: Thomas Gleixner 
Signed-off-by: Ira Weiny 

---
Changes from V1
Update commit message
Add kernel doc for struct irqentry_state_t
With Thomas' help
---
 arch/x86/entry/common.c | 34 
 arch/x86/include/asm/idtentry.h |  3 ---
 arch/x86/kernel/cpu/mce/core.c  |  6 ++---
 arch/x86/kernel/nmi.c   |  6 ++---
 arch/x86/kernel/traps.c | 13 ++-
 include/linux/entry-common.h| 39 -
 kernel/entry/common.c   | 36 ++
 7 files changed, 87 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efeec8bda..18d8f17f755c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -209,40 +209,6 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
 }
 
-noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
-{
-   bool irq_state = lockdep_hardirqs_enabled();
-
-   __nmi_enter();
-   lockdep_hardirqs_off(CALLER_ADDR0);
-   lockdep_hardirq_enter();
-   rcu_nmi_enter();
-
-   instrumentation_begin();
-   trace_hardirqs_off_finish();
-   ftrace_nmi_enter();
-   instrumentation_end();
-
-   return irq_state;
-}
-
-noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
-{
-   instrumentation_begin();
-   ftrace_nmi_exit();
-   if (restore) {
-   trace_hardirqs_on_prepare();
-   lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-   }
-   instrumentation_end();
-
-   rcu_nmi_exit();
-   lockdep_hardirq_exit();
-   if (restore)
-   lockdep_hardirqs_on(CALLER_ADDR0);
-   __nmi_exit();
-}
-
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b2442eb0ac2f..247a60a47331 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,9 +11,6 @@
 
 #include 
 
-bool idtentry_enter_nmi(struct pt_regs *regs);
-void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
-
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 51bf910b1e9d..403561a89c28 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1981,7 +1981,7 @@ void (*machine_check_vector)(struct pt_regs *) = 
unexpected_machine_check;
 
 static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 {
-   bool irq_state;
+   irqentry_state_t irq_state;
 
WARN_ON_ONCE(user_mode(regs));
 
@@ -1993,7 +1993,7 @@ static __always_inline void 
exc_machine_check_kernel(struct pt_regs *regs)
mce_check_crashing_cpu())
return;
 
-   irq_state = idtentry_enter_nmi(regs);
+   irq_state = irqentry_nmi_enter(regs);
/*
 * The call targets are marked noinstr, but objtool can't figure
 * that out because it's an indirect call. Annotate it.
@@ -2004,7 +2004,7 @@ static __always_inline void 
exc_machine_check_kernel(struct pt_regs *regs)
if (regs->flags & X86_EFLAGS_IF)
trace_hardirqs_on_prepare();
instrumentation_end();
-   idtentry_exit_nmi(regs, irq_state);
+   irqentry_nmi_exit(regs, irq_state);
 }
 
 static __always_inline void exc_machine_check_user(struct pt_regs *regs)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bc77aaf1303..bf250a339655 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -475,7 +475,7 @@ static DEFINE_PER_CPU(unsigned long, nmi_dr7);
 
 DEFINE_IDTENTRY_RAW(exc_nmi)
 {
-   bool irq_state;
+   irqentry_state_t irq_state;
 
/*
 * Re-enable NMIs right here when running as an SEV-ES guest. This might
@@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
this_cpu_write(nmi_dr7, local_db_save());
 
-   irq_state = idtentry_enter_nmi(regs);
+   irq_state = irqentry_nmi_enter(regs);
 
inc_irq_stat(__nmi_count);
 
if (!ignore_nmis)
default_do_nmi(regs);
 
-   idtentry_exit_nmi(regs, irq_state);
+   irqentry_nmi_exit(regs, irq_state);
 
local_db_restore(this_cpu_read(nmi_dr7));
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index e19df6cde35d..e1b78829d909 100644
--- a/arch/x86/kernel/

[PATCH V2 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-11-02 Thread ira . weiny
From: Ira Weiny 

The PKRS MSR is defined as a per-logical-processor register.  This
isolates memory access by logical CPU.  Unfortunately, the MSR is not
managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
context switch.

Define a saved PKRS value in the task struct, as well as a cached
per-logical-processor MSR value which mirrors the MSR value of the
current CPU.  Initialize all tasks with the default MSR value.  Then, on
schedule in, check the saved task MSR vs the per-cpu value.  If
different proceed to write the MSR.  If not avoid the overhead of the
MSR write and continue.

Follow on patches will update the saved PKRS as well as the MSR if
needed.

Finally it should be noted that the underlying WRMSR(MSR_IA32_PKRS) is
not serializing but still maintains ordering properties similar to
WRPKRU.  The current SDM section on PKRS needs updating but should be
the same as that of WRPKRU.  So to quote from the WRPKRU text:

WRPKRU will never execute transiently. Memory accesses affected
by PKRU register will not execute (even transiently) until all
prior executions of WRPKRU have completed execution and updated
the PKRU register.

Co-developed-by: Fenghua Yu 
Signed-off-by: Fenghua Yu 
Co-developed-by: Peter Zijlstra 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Ira Weiny 

---
Changes from V1
Rebase to latest tip/master
Resolve conflicts with INIT_THREAD changes

Changes since RFC V3
Per Dave Hansen
Update commit message
move saved_pkrs to be in a nicer place
Per Peter Zijlstra
Add Comment from Peter
Clean up white space
Update authorship
---
 arch/x86/include/asm/msr-index.h|  1 +
 arch/x86/include/asm/pkeys_common.h | 20 +++
 arch/x86/include/asm/processor.h| 18 -
 arch/x86/kernel/cpu/common.c|  2 ++
 arch/x86/kernel/process.c   | 26 
 arch/x86/mm/pkeys.c | 31 +
 6 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 972a34d93505..ddb125e44408 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -754,6 +754,7 @@
 
 #define MSR_IA32_TSC_DEADLINE  0x06E0
 
+#define MSR_IA32_PKRS  0x06E1
 
 #define MSR_TSX_FORCE_ABORT0x010F
 
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
index 737d916f476c..801a75615209 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -12,4 +12,24 @@
  */
 #define PKR_AD_KEY(pkey)   (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
 
+/*
+ * Define a default PKRS value for each task.
+ *
+ * Key 0 has no restriction.  All other keys are set to the most restrictive
+ * value which is access disabled (AD=1).
+ *
+ * NOTE: This needs to be a macro to be used as part of the INIT_THREAD macro.
+ */
+#define INIT_PKRS_VALUE (PKR_AD_KEY(1) | PKR_AD_KEY(2) | PKR_AD_KEY(3) | \
+PKR_AD_KEY(4) | PKR_AD_KEY(5) | PKR_AD_KEY(6) | \
+PKR_AD_KEY(7) | PKR_AD_KEY(8) | PKR_AD_KEY(9) | \
+PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \
+PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+void write_pkrs(u32 new_pkrs);
+#else
+static inline void write_pkrs(u32 new_pkrs) { }
+#endif
+
 #endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 60dbcdcb833f..78eb5f483410 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -18,6 +18,7 @@ struct vm86;
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -522,6 +523,12 @@ struct thread_struct {
unsigned long   cr2;
unsigned long   trap_nr;
unsigned long   error_code;
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+   /* Saved Protection key register for supervisor mappings */
+   u32 saved_pkrs;
+#endif
+
 #ifdef CONFIG_VM86
/* Virtual 86 mode info */
struct vm86 *vm86;
@@ -787,7 +794,16 @@ static inline void spin_lock_prefetch(const void *x)
 #define KSTK_ESP(task) (task_pt_regs(task)->sp)
 
 #else
-#define INIT_THREAD { }
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+#define INIT_THREAD_PKRS   .saved_pkrs = INIT_PKRS_VALUE
+#else
+#define INIT_THREAD_PKRS   0
+#endif
+
+#define INIT_THREAD  { \
+   INIT_THREAD_PKRS,   \
+}
 
 extern unsigned long KSTK_ESP(struct task_struct *task);
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/commo

[PATCH V2 01/10] x86/pkeys: Create pkeys_common.h

2020-11-02 Thread ira . weiny
From: Ira Weiny 

Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in
similar fashions and can share common defines.  Specifically PKS and PKU
each have:

1. A single control register
2. The same number of keys
3. The same number of bits in the register per key
4. Access and Write disable in the same bit locations

That means that we can share all the macros that synthesize and
manipulate register values between the two features.  Normally, these
macros would be put in asm/pkeys.h to be used internally and externally
to the arch code.  However, the defines are required in pgtable.h and
inclusion of pkeys.h in that header creates complex dependencies which
are best resolved in a separate header.

Share these defines by moving them into a new header, change their names
to reflect the common use, and include the header where needed.

Reviewed-by: Dave Hansen 
Signed-off-by: Ira Weiny 

---
NOTE: The initialization of init_pkru_value cause checkpatch errors
because of the space after the '(' in the macros.  We leave this as is
because it is more readable in this format.  And it was existing code.

---
Changes from RFC V3
Per Dave Hansen
Update commit message
Add comment to PKR_AD_KEY macro
---
 arch/x86/include/asm/pgtable.h  | 13 ++---
 arch/x86/include/asm/pkeys.h|  2 ++
 arch/x86/include/asm/pkeys_common.h | 15 +++
 arch/x86/kernel/fpu/xstate.c|  8 
 arch/x86/mm/pkeys.c | 14 ++
 5 files changed, 33 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_common.h

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..bfbfb951fe65 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1360,9 +1360,7 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
 }
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
-#define PKRU_AD_BIT 0x1
-#define PKRU_WD_BIT 0x2
-#define PKRU_BITS_PER_PKEY 2
+#include 
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 extern u32 init_pkru_value;
@@ -1372,18 +1370,19 @@ extern u32 init_pkru_value;
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-   int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
-   return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+   int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+
+   return !(pkru & (PKR_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-   int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
+   int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
/*
 * Access-disable disables writes too so we need to check
 * both bits here.
 */
-   return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+   return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits));
 }
 
 static inline u16 pte_flags_pkey(unsigned long pte_flags)
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 2ff9b98812b7..f9feba80894b 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_PKEYS_H
 #define _ASM_X86_PKEYS_H
 
+#include 
+
 #define ARCH_DEFAULT_PKEY  0
 
 /*
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
new file mode 100644
index ..737d916f476c
--- /dev/null
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKEYS_INTERNAL_H
+#define _ASM_X86_PKEYS_INTERNAL_H
+
+#define PKR_AD_BIT 0x1
+#define PKR_WD_BIT 0x2
+#define PKR_BITS_PER_PKEY 2
+
+/*
+ * Generate an Access-Disable mask for the given pkey.  Several of these can be
+ * OR'd together to generate pkey register values.
+ */
+#define PKR_AD_KEY(pkey)   (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+
+#endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 5d8047441a0a..a99afc70cc0a 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -995,7 +995,7 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int 
pkey,
unsigned long init_val)
 {
u32 old_pkru;
-   int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
+   int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
u32 new_pkru_bits = 0;
 
/*
@@ -1014,16 +1014,16 @@ int arch_set_user_pkey_access(struct task_struct *tsk, 
int pkey,
 
/* Set the bits we need in PKRU:  */
if (init_val & PKEY_DISABLE_ACCESS)
-   new_pkru_bits |= PKRU_AD_BIT;
+   new_pkru_bits |= PKR_AD_BIT;
if (init_val & PKEY_DISABLE_WRITE)
-   new_pkru_bits |= PKRU_WD_BIT;
+   new_pkru_bits |= PKR_WD_BIT;
 
/* Shift the bits in to the correct place in 

[PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

2020-11-02 Thread ira . weiny
From: Ira Weiny 

Changes from V1
Rebase to TIP master; resolve conflicts and test
Clean up some kernel docs updates missed in V1
Add irqentry_state_t kernel doc for PKRS field
Removed redundant irq_state->pkrs
This is only needed when we add the global state and somehow
ended up in this patch series.  That will come back when we add
the global functionality in.
From Thomas Gleixner
Update commit messages
Add kernel doc for struct irqentry_state_t
From Dave Hansen add flags to pks_key_alloc()

Changes from RFC V3[3]
Rebase to TIP master
Update test error output
Standardize on 'irq_state' for state variables
From Dave Hansen
Update commit messages
Add/clean up comments
Add X86_FEATURE_PKS to disabled-features.h and remove some
explicit CONFIG checks
Move saved_pkrs member of thread_struct
Remove superfluous preempt_disable()
s/irq_save_pks/irq_save_set_pks/
Ensure PKRS is not seen in faults if not configured or not
supported
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change pks_key_alloc return to -EOPNOTSUPP when not supported
From Peter Zijlstra
Clean up Attribution
Remove superfluous preempt_disable()
Add union to differentiate exit_rcu/lockdep use in
irqentry_state_t
From Thomas Gleixner
Add preliminary clean up patch and adjust series as needed


Introduce a new page protection mechanism for supervisor pages, Protection Key
Supervisor (PKS).

2 use cases for PKS are being developed, trusted keys and PMEM.  Trusted keys
is a newer use case which is still being explored.  PMEM was submitted as part
of the RFC (v2) series[1].  However, since then it was found that some callers
of kmap() require a global implementation of PKS.  Specifically some users of
kmap() expect mappings to be available to all kernel threads.  While global use
of PKS is rare it needs to be included for correctness.  Unfortunately the
kmap() updates required a large patch series to make the needed changes at the
various kmap() call sites so that patch set has been split out.  Because the
global PKS feature is only required for that use case it will be deferred to
that set as well.[2]  This patch set is being submitted as a precursor to both
of the use cases.

For an overview of the entire PKS ecosystem, a git tree including this series
and 2 proposed use cases can be found here:


https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.we...@intel.com/

https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.we...@intel.com/


PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to those pages beyond the normal paging protections.  PKS works in
a similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
checked in addition to normal paging protections and Access or Writes can be
disabled via a MSR update without TLB flushes when permissions change.  Also
like PKU, a page mapping is assigned to a domain by setting pkey bits in the
page table entry for that mapping.

Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.

XSAVE is not supported for the PKRS MSR.  Therefore the implementation
saves/restores the MSR across context switches and during exceptions.  Nested
exceptions are supported by each exception getting a new PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.

Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one.  Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
arch and/or CPU instance.

The following are key attributes of PKS.

   1) Fast switching of permissions
1a) Prevents access without page table manipulations
1b) No TLB flushes required
   2) Works on a per thread basis

PKS is available with 4 and 5 level paging.  Like PKRU it consumes 4 bits from
the PTE to store the pkey within the entry.


[1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.we...@intel.com/
[2] https://lore.kernel.org/lkml/20201009195033.3208459-2-ira.we...@intel.com/
[3] https://lore.kernel.org/lkml/20201009194258.3207172-1-ira.we...@intel.com/


Fenghua Yu (2):
  x86/pks: Enable Protection Keys Supervisor (PKS)
  x86/pks: Add PKS kernel API

Ira Weiny (7):
  x86/pkeys: Cre

[PATCH V2 03/10] x86/pks: Enable Protection Keys Supervisor (PKS)

2020-11-02 Thread ira . weiny
From: Fenghua Yu 

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings.  It uses the same mechanism of Protection Keys as those on
User mappings but applies that mechanism to supervisor mappings using a
supervisor specific MSR.

Kernel users can thus defines 'domains' of page mappings which have an
extra level of protection beyond those specified in the supervisor page
table entries.

Define ARCH_HAS_SUPERVISOR_PKEYS to distinguish this functionality from
the existing ARCH_HAS_PKEYS and then enable PKS when configured and
indicated by the CPU instance.  While not strictly necessary in this
patch, ARCH_HAS_SUPERVISOR_PKEYS separates this functionality through
the patch series so it is introduced here.

Co-developed-by: Ira Weiny 
Signed-off-by: Ira Weiny 
Signed-off-by: Fenghua Yu 

---
Changes since RFC V3
Per Dave Hansen
Update comment
Add X86_FEATURE_PKS to disabled-features.h
Rebase based on latest TIP tree
---
 arch/x86/Kconfig|  1 +
 arch/x86/include/asm/cpufeatures.h  |  1 +
 arch/x86/include/asm/disabled-features.h|  8 +++-
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/cpu/common.c| 13 +
 mm/Kconfig  |  2 ++
 6 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b81f74a..78c4c749c6a9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1876,6 +1876,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_PKEYS
+   select ARCH_HAS_SUPERVISOR_PKEYS
help
  Memory Protection Keys provides a mechanism for enforcing
  page-based protections, but without requiring modification of the
diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index dad350d42ecf..4deb580324e8 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -356,6 +356,7 @@
 #define X86_FEATURE_MOVDIRI(16*32+27) /* MOVDIRI instruction */
 #define X86_FEATURE_MOVDIR64B  (16*32+28) /* MOVDIR64B instruction */
 #define X86_FEATURE_ENQCMD (16*32+29) /* ENQCMD and ENQCMDS 
instructions */
+#define X86_FEATURE_PKS(16*32+31) /* Protection Keys 
for Supervisor pages */
 
 /* AMD-defined CPU features, CPUID level 0x8007 (EBX), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV (17*32+ 0) /* MCA overflow recovery 
support */
diff --git a/arch/x86/include/asm/disabled-features.h 
b/arch/x86/include/asm/disabled-features.h
index 5861d34f9771..82540f0c5b6c 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -44,6 +44,12 @@
 # define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31))
 #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+# define DISABLE_PKS   0
+#else
+# define DISABLE_PKS   (1<<(X86_FEATURE_PKS & 31))
+#endif
+
 #ifdef CONFIG_X86_5LEVEL
 # define DISABLE_LA57  0
 #else
@@ -82,7 +88,7 @@
 #define DISABLED_MASK140
 #define DISABLED_MASK150
 #define DISABLED_MASK16
(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-DISABLE_ENQCMD)
+DISABLE_ENQCMD|DISABLE_PKS)
 #define DISABLED_MASK170
 #define DISABLED_MASK180
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
diff --git a/arch/x86/include/uapi/asm/processor-flags.h 
b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..191c574b2390 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP   _BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT22 /* enable Protection Keys support */
 #define X86_CR4_PKE_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_PKS_BIT24 /* enable Protection Keys for 
Supervisor */
+#define X86_CR4_PKS_BITUL(X86_CR4_PKS_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 35ad8480c464..6a9ca938d9a9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1494,6 +1494,18 @@ static void validate_apic_and_package_id(struct 
cpuinfo_x86 *c)
 #endif
 }
 
+/*
+ * PKS is independent of PKU and either or both may be supported on a CPU.
+ * Configure PKS if the CPU supports the feature.
+ */
+static void setup_pks(void)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   cr4_set_bits(X86_CR4_PKS);
+}
+
 /*
  * This does the hard work of actually

[PATCH V2 02/10] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support

2020-11-02 Thread ira . weiny
From: Ira Weiny 

Define a helper, update_pkey_val(), which will be used to support both
Protection Key User (PKU) and the new Protection Key for Supervisor
(PKS) in subsequent patches.

Co-developed-by: Peter Zijlstra 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Ira Weiny 

---
Changes from RFC V3:
Per Dave Hansen
Update and add comments per Dave's review
Per Peter
Correct attribution
---
 arch/x86/include/asm/pkeys.h |  2 ++
 arch/x86/kernel/fpu/xstate.c | 22 --
 arch/x86/mm/pkeys.c  | 23 +++
 3 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index f9feba80894b..4526245b03e5 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -136,4 +136,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
 }
 
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags);
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index a99afc70cc0a..a3bca3211eba 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -994,9 +994,7 @@ const void *get_xsave_field_ptr(int xfeature_nr)
 int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val)
 {
-   u32 old_pkru;
-   int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
-   u32 new_pkru_bits = 0;
+   u32 pkru;
 
/*
 * This check implies XSAVE support.  OSPKE only gets
@@ -1012,21 +1010,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, 
int pkey,
 */
WARN_ON_ONCE(pkey >= arch_max_pkey());
 
-   /* Set the bits we need in PKRU:  */
-   if (init_val & PKEY_DISABLE_ACCESS)
-   new_pkru_bits |= PKR_AD_BIT;
-   if (init_val & PKEY_DISABLE_WRITE)
-   new_pkru_bits |= PKR_WD_BIT;
-
-   /* Shift the bits in to the correct place in PKRU for pkey: */
-   new_pkru_bits <<= pkey_shift;
-
-   /* Get old PKRU and mask off any old bits in place: */
-   old_pkru = read_pkru();
-   old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
-
-   /* Write old part along with new part: */
-   write_pkru(old_pkru | new_pkru_bits);
+   pkru = read_pkru();
+   pkru = update_pkey_val(pkru, pkey, init_val);
+   write_pkru(pkru);
 
return 0;
 }
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index f5efb4007e74..d1dfe743e79f 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -208,3 +208,26 @@ static __init int setup_init_pkru(char *opt)
return 1;
 }
 __setup("init_pkru=", setup_init_pkru);
+
+/*
+ * Replace disable bits for @pkey with values from @flags
+ *
+ * Kernel users use the same flags as user space:
+ * PKEY_DISABLE_ACCESS
+ * PKEY_DISABLE_WRITE
+ */
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
+{
+   int pkey_shift = pkey * PKR_BITS_PER_PKEY;
+
+   /*  Mask out old bit values */
+   pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
+
+   /*  Or in new values */
+   if (flags & PKEY_DISABLE_ACCESS)
+   pk_reg |= PKR_AD_BIT << pkey_shift;
+   if (flags & PKEY_DISABLE_WRITE)
+   pk_reg |= PKR_WD_BIT << pkey_shift;
+
+   return pk_reg;
+}
-- 
2.28.0.rc0.12.gb6a658bd00c9
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 07/10] x86/entry: Pass irqentry_state_t by reference

2020-10-27 Thread Ira Weiny
On Fri, Oct 23, 2020 at 11:56:33PM +0200, Thomas Gleixner wrote:
> On Thu, Oct 22 2020 at 15:26, ira weiny wrote:
> > From: Ira Weiny 
> >
> > In preparation for adding PKS information to struct irqentry_state_t
> > change all call sites and usages to pass the struct by reference
> > instead of by value.
> 
> This still does not explain WHY you need to do that. 'Preparation' is a
> pretty useless information.
> 
> What is the actual reason for this? Just because PKS information feels
> better that way?
> 
> Also what is PKS information? Changelogs have to make sense on their own
> and not only in the context of a larger series of changes.

I've reworded this to explain the addition of new members which would make
passing by value less efficient with additional structure changes being added
later in the series.

> 
> > While we are editing the call sites it is a good time to standardize on
> > the name 'irq_state'.
> 
>   While at it change all usage sites to consistently use the variable
>   name 'irq_state'.
> 
> Or something like that. See Documentation/process/...

Sorry, bad habit.

Fixed.

Thanks,
Ira

> 
> Thanks,
> 
> tglx
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 06/10] x86/entry: Move nmi entry/exit into common code

2020-10-27 Thread Ira Weiny
On Fri, Oct 23, 2020 at 11:50:11PM +0200, Thomas Gleixner wrote:
> On Thu, Oct 22 2020 at 15:26, ira weiny wrote:
> 
> > From: Thomas Gleixner 
> >
> > Lockdep state handling on NMI enter and exit is nothing specific to X86. 
> > It's
> > not any different on other architectures. Also the extra state type is not
> > necessary, irqentry_state_t can carry the necessary information as well.
> >
> > Move it to common code and extend irqentry_state_t to carry lockdep
> > state.
> 
> This lacks something like:
> 
>  [ Ira: Made the states a union as they are mutually exclusive and added
> the missing kernel doc ]

Fair enough.  done.

> 
> Hrm.
>  
> >  #ifndef irqentry_state
> >  typedef struct irqentry_state {
> > -   boolexit_rcu;
> > +   union {
> > +   boolexit_rcu;
> > +   boollockdep;
> > +   };
> >  } irqentry_state_t;
> >  #endif
> 
>   -E_NO_KERNELDOC

Adding: Paul McKenney

I'm happy to write something but I'm very unfamiliar with this code.  So I'm
getting confused what exactly exit_rcu is flagging.

I can see that exit_rcu is a bad name for the state used in
irqentry_nmi_[enter|exit]().  Furthermore, I see why 'lockdep' is a better
name.  But similar lockdep handling is used in irqentry_exit() if exit_rcu is
true...


Given my limited knowledge; here is my proposed text:

/**
 * struct irqentry_state - Opaque object for exception state storage
 * @exit_rcu: Used exclusively in the irqentry_*() calls; tracks if the
 *exception hit the idle task which requires special handling,
 *including calling rcu_irq_exit(), when the exception exits.
 * @lockdep: Used exclusively in the irqentry_nmi_*() calls; ensures lockdep
 *   tracking is maintained if hardirqs were already enabled
 *
 * This opaque object is filled in by the irqentry_*_enter() functions and
 * should be passed back into the corresponding irqentry_*_exit() functions
 * when the exception is complete.
 *
 * Callers of irqentry_*_[enter|exit]() should consider this structure opaque
 * and all members private.  Descriptions of the members are provided to aid in
 * the maintenance of the irqentry_*() functions.
 */


Perhaps Paul can enlighten me on how exit_rcu is used beyond just flagging a
call to rcu_irq_exit()?

Why do we call lockdep_hardirqs_off() only when in the idle task?  That implies
that regs_irqs_disabled() can only be false if we were in the idle task to
match up the lockdep on/off calls.  This does not make sense to me because why
do we need the extra check for exit_rcu?  I'm still trying to understand when
regs_irqs_disabled() is false.


} else if (!regs_irqs_disabled(regs)) {
...
} else {
/*
 * IRQ flags state is correct already. Just tell RCU if it
 * was not watching on entry.
 */
if (state.exit_rcu)
rcu_irq_exit();
}

Also, the comment in irqentry_enter() refers to irq_enter_from_user_mode() which
does not seem to exist anymore.  So I'm not sure what careful sequence it is
referring to.

/*
 * If RCU is not watching then the same careful
 * sequence vs. lockdep and tracing is required
 * as in irq_enter_from_user_mode().
 */

?

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 09/10] x86/fault: Report the PKRS state on fault

2020-10-22 Thread ira . weiny
From: Ira Weiny 

When only user space pkeys are enabled faulting within the kernel was an
unexpected condition which should never happen.  Therefore a WARN_ON in
the kernel fault handler would detect if it ever did.  Now this is no
longer the case if PKS is enabled and supported.

Report a Pkey fault with a normal splat and add the PKRS state to the
fault splat text.  Note the PKS register is reset during an exception
therefore the saved PKRS value from before the beginning of the
exception is passed down.

If PKS is not enabled, or not active, maintain the WARN_ON_ONCE() from
before.

Because each fault has its own state the pkrs information will be
correctly reported even if a fault 'faults'.

Suggested-by: Andy Lutomirski 
Signed-off-by: Ira Weiny 

---
Changes from RFC V3
Update commit message
Per Dave Hansen
Don't print PKRS if !cpu_feature_enabled(X86_FEATURE_PKS)
Fix comment
Remove check on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS in favor of
disabled-features.h
---
 arch/x86/mm/fault.c | 58 ++---
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5e3fd7763315..e2f83a596b50 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -504,7 +504,8 @@ static void show_ldttss(const struct desc_ptr *gdt, const 
char *name, u16 index)
 }
 
 static void
-show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long 
address)
+show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long 
address,
+   irqentry_state_t *irq_state)
 {
if (!oops_may_print())
return;
@@ -548,6 +549,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long 
error_code, unsigned long ad
 (error_code & X86_PF_PK)? "protection keys violation" :
   "permissions violation");
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+   if (cpu_feature_enabled(X86_FEATURE_PKS) && irq_state && (error_code & 
X86_PF_PK))
+   pr_alert("PKRS: 0x%x\n", irq_state->pkrs);
+#endif
+
if (!(error_code & X86_PF_USER) && user_mode(regs)) {
struct desc_ptr idt, gdt;
u16 ldtr, tr;
@@ -626,7 +632,8 @@ static void set_signal_archinfo(unsigned long address,
 
 static noinline void
 no_context(struct pt_regs *regs, unsigned long error_code,
-  unsigned long address, int signal, int si_code)
+  unsigned long address, int signal, int si_code,
+  irqentry_state_t *irq_state)
 {
struct task_struct *tsk = current;
unsigned long flags;
@@ -732,7 +739,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
 */
flags = oops_begin();
 
-   show_fault_oops(regs, error_code, address);
+   show_fault_oops(regs, error_code, address, irq_state);
 
if (task_stack_end_corrupted(tsk))
printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
@@ -785,7 +792,8 @@ static bool is_vsyscall_vaddr(unsigned long vaddr)
 
 static void
 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-  unsigned long address, u32 pkey, int si_code)
+  unsigned long address, u32 pkey, int si_code,
+  irqentry_state_t *irq_state)
 {
struct task_struct *tsk = current;
 
@@ -832,14 +840,14 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned 
long error_code,
if (is_f00f_bug(regs, address))
return;
 
-   no_context(regs, error_code, address, SIGSEGV, si_code);
+   no_context(regs, error_code, address, SIGSEGV, si_code, irq_state);
 }
 
 static noinline void
 bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
-unsigned long address)
+unsigned long address, irqentry_state_t *irq_state)
 {
-   __bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR);
+   __bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR, 
irq_state);
 }
 
 static void
@@ -853,7 +861,7 @@ __bad_area(struct pt_regs *regs, unsigned long error_code,
 */
mmap_read_unlock(mm);
 
-   __bad_area_nosemaphore(regs, error_code, address, pkey, si_code);
+   __bad_area_nosemaphore(regs, error_code, address, pkey, si_code, NULL);
 }
 
 static noinline void
@@ -923,7 +931,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, 
unsigned long address,
 {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & X86_PF_USER)) {
-   no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+   no_context(regs, error_code, address, SIGBUS, BUS_ADRERR, NULL);
return;
}
 
@@ -957,7 +965,7 @@ mm_fault_

[PATCH 04/10] x86/pks: Preserve the PKRS MSR on context switch

2020-10-22 Thread ira . weiny
From: Ira Weiny 

The PKRS MSR is defined as a per-logical-processor register.  This
isolates memory access by logical CPU.  Unfortunately, the MSR is not
managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
context switch.

Define a saved PKRS value in the task struct, as well as a cached
per-logical-processor MSR value which mirrors the MSR value of the
current CPU.  Initialize all tasks with the default MSR value.  Then, on
schedule in, check the saved task MSR vs the per-cpu value.  If
different proceed to write the MSR.  If not avoid the overhead of the
MSR write and continue.

Follow on patches will update the saved PKRS as well as the MSR if
needed.

Finally it should be noted that the underlying WRMSR(MSR_IA32_PKRS) is
not serializing but still maintains ordering properties similar to
WRPKRU.  The current SDM section on PKRS needs updating but should be
the same as that of WRPKRU.  So to quote from the WRPKRU text:

WRPKRU will never execute transiently. Memory accesses affected
by PKRU register will not execute (even transiently) until all
prior executions of WRPKRU have completed execution and updated
the PKRU register.

Co-developed-by: Fenghua Yu 
Signed-off-by: Fenghua Yu 
Co-developed-by: Peter Zijlstra 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Ira Weiny 

---
Changes since RFC V3
Per Dave Hansen
Update commit message
move saved_pkrs to be in a nicer place
Per Peter Zijlstra
Add Comment from Peter
Clean up white space
Update authorship
---
 arch/x86/include/asm/msr-index.h|  1 +
 arch/x86/include/asm/pkeys_common.h | 20 +++
 arch/x86/include/asm/processor.h| 14 +
 arch/x86/kernel/cpu/common.c|  2 ++
 arch/x86/kernel/process.c   | 26 
 arch/x86/mm/pkeys.c | 31 +
 6 files changed, 94 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 972a34d93505..ddb125e44408 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -754,6 +754,7 @@
 
 #define MSR_IA32_TSC_DEADLINE  0x06E0
 
+#define MSR_IA32_PKRS  0x06E1
 
 #define MSR_TSX_FORCE_ABORT0x010F
 
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
index 737d916f476c..801a75615209 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -12,4 +12,24 @@
  */
 #define PKR_AD_KEY(pkey)   (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
 
+/*
+ * Define a default PKRS value for each task.
+ *
+ * Key 0 has no restriction.  All other keys are set to the most restrictive
+ * value which is access disabled (AD=1).
+ *
+ * NOTE: This needs to be a macro to be used as part of the INIT_THREAD macro.
+ */
+#define INIT_PKRS_VALUE (PKR_AD_KEY(1) | PKR_AD_KEY(2) | PKR_AD_KEY(3) | \
+PKR_AD_KEY(4) | PKR_AD_KEY(5) | PKR_AD_KEY(6) | \
+PKR_AD_KEY(7) | PKR_AD_KEY(8) | PKR_AD_KEY(9) | \
+PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \
+PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+void write_pkrs(u32 new_pkrs);
+#else
+static inline void write_pkrs(u32 new_pkrs) { }
+#endif
+
 #endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f88c74d7dbd4..49975e44e3dd 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -18,6 +18,7 @@ struct vm86;
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -526,6 +527,12 @@ struct thread_struct {
unsigned long   cr2;
unsigned long   trap_nr;
unsigned long   error_code;
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+   /* Saved Protection key register for supervisor mappings */
+   u32 saved_pkrs;
+#endif
+
 #ifdef CONFIG_VM86
/* Virtual 86 mode info */
struct vm86 *vm86;
@@ -843,8 +850,15 @@ static inline void spin_lock_prefetch(const void *x)
 #define STACK_TOP  TASK_SIZE_LOW
 #define STACK_TOP_MAX  TASK_SIZE_MAX
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+#define INIT_THREAD_PKRS   .saved_pkrs = INIT_PKRS_VALUE
+#else
+#define INIT_THREAD_PKRS   0
+#endif
+
 #define INIT_THREAD  { \
.addr_limit = KERNEL_DS,\
+   INIT_THREAD_PKRS,   \
 }
 
 extern unsigned long KSTK_ESP(struct task_struct *task);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6a9ca938d9a9..f8929a557d72 100644
--- a/arch/x86/kernel/cpu/common.c

[PATCH 05/10] x86/pks: Add PKS kernel API

2020-10-22 Thread ira . weiny
From: Fenghua Yu 

PKS allows kernel users to define domains of page mappings which have
additional protections beyond the paging protections.

Add an API to allocate, use, and free a protection key which identifies
such a domain.  Export 5 new symbols pks_key_alloc(), pks_mknoaccess(),
pks_mkread(), pks_mkrdwr(), and pks_key_free().  Add 2 new macros;
PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey).

Update the protection key documentation to cover pkeys on supervisor
pages.

Co-developed-by: Ira Weiny 
Signed-off-by: Ira Weiny 
Signed-off-by: Fenghua Yu 

---
Changes from RFC V3
Per Dave Hansen
Put WARN_ON_ONCE in pks_key_free()
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change return pks_key_alloc() to EOPNOTSUPP when not
supported or configured
Per Peter Zijlstra
Remove unneeded preempt disable/enable
---
 Documentation/core-api/protection-keys.rst | 101 ++---
 arch/x86/include/asm/pgtable_types.h   |  12 ++
 arch/x86/include/asm/pkeys.h   |  11 ++
 arch/x86/include/asm/pkeys_common.h|   4 +
 arch/x86/mm/pkeys.c| 123 +
 include/linux/pgtable.h|   4 +
 include/linux/pkeys.h  |  22 
 7 files changed, 259 insertions(+), 18 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst 
b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..e6cb29dda5b8 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,25 +4,33 @@
 Memory Protection Keys
 ==
 
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
-
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
-
 Memory Protection Keys provides a mechanism for enforcing page-based
 protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
+when an application changes protection domains.
+
+PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
+Processor" Server CPUs and later.  And It will be available in future
+non-server Intel parts and future AMD processors.
+
+Future Intel processors will support Protection Keys for Supervisor pages
+(PKS).
+
+For anyone wishing to test or use user space pkeys, it is available in Amazon's
+EC2 C5 instances and is known to work there using an Ubuntu 17.04 image.
+
+pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.  User and Supervisor pages are
+treated separately.
+
+Protections for each page are controlled with per CPU registers for each type
+of page User and Supervisor.  Each of these 32 bit register stores two separate
+bits (Access Disable and Write Disable) for each key.
 
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
-thread a different set of protections from every other thread.
+For Userspace the register is user-accessible (rdpkru/wrpkru).  For
+Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel.
+
+Being a CPU register, pkeys are inherently thread-local, potentially giving
+each thread an independent set of protections from every other thread.
 
 There are two new instructions (RDPKRU/WRPKRU) for reading and writing
 to the new register.  The feature is only available in 64-bit mode,
@@ -30,8 +38,11 @@ even though there is theoretically space in the PAE PTEs.  
These
 permissions are enforced on data access only and have no effect on
 instruction fetches.
 
-Syscalls
-
+For kernel space rdmsr/wrmsr are used to access the kernel MSRs.
+
+
+Syscalls for user space keys
+
 
 There are 3 system calls which directly interact with pkeys::
 
@@ -98,3 +109,57 @@ with a read()::
 The kernel will send a SIGSEGV in both cases, but si_code will be set
 to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
 the plain mprotect() permissions are violated.
+
+
+Kernel API for PKS support
+==
+
+The following interface is used to allocate, use, and free a pkey which defines
+a 'protection domain' within the kernel.  Setting a pkey value in a supervisor
+mapping adds that mapping to the protect

[PATCH 02/10] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support

2020-10-22 Thread ira . weiny
From: Ira Weiny 

Define a helper, update_pkey_val(), which will be used to support both
Protection Key User (PKU) and the new Protection Key for Supervisor
(PKS) in subsequent patches.

Co-developed-by: Peter Zijlstra 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Ira Weiny 

---
Changes from RFC V3:
Per Dave Hansen
Update and add comments per Dave's review
Per Peter
Correct attribution
---
 arch/x86/include/asm/pkeys.h |  2 ++
 arch/x86/kernel/fpu/xstate.c | 22 --
 arch/x86/mm/pkeys.c  | 23 +++
 3 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index f9feba80894b..4526245b03e5 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -136,4 +136,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
 }
 
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags);
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index a99afc70cc0a..a3bca3211eba 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -994,9 +994,7 @@ const void *get_xsave_field_ptr(int xfeature_nr)
 int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val)
 {
-   u32 old_pkru;
-   int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
-   u32 new_pkru_bits = 0;
+   u32 pkru;
 
/*
 * This check implies XSAVE support.  OSPKE only gets
@@ -1012,21 +1010,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, 
int pkey,
 */
WARN_ON_ONCE(pkey >= arch_max_pkey());
 
-   /* Set the bits we need in PKRU:  */
-   if (init_val & PKEY_DISABLE_ACCESS)
-   new_pkru_bits |= PKR_AD_BIT;
-   if (init_val & PKEY_DISABLE_WRITE)
-   new_pkru_bits |= PKR_WD_BIT;
-
-   /* Shift the bits in to the correct place in PKRU for pkey: */
-   new_pkru_bits <<= pkey_shift;
-
-   /* Get old PKRU and mask off any old bits in place: */
-   old_pkru = read_pkru();
-   old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
-
-   /* Write old part along with new part: */
-   write_pkru(old_pkru | new_pkru_bits);
+   pkru = read_pkru();
+   pkru = update_pkey_val(pkru, pkey, init_val);
+   write_pkru(pkru);
 
return 0;
 }
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index f5efb4007e74..d1dfe743e79f 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -208,3 +208,26 @@ static __init int setup_init_pkru(char *opt)
return 1;
 }
 __setup("init_pkru=", setup_init_pkru);
+
+/*
+ * Replace disable bits for @pkey with values from @flags
+ *
+ * Kernel users use the same flags as user space:
+ * PKEY_DISABLE_ACCESS
+ * PKEY_DISABLE_WRITE
+ */
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
+{
+   int pkey_shift = pkey * PKR_BITS_PER_PKEY;
+
+   /*  Mask out old bit values */
+   pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
+
+   /*  Or in new values */
+   if (flags & PKEY_DISABLE_ACCESS)
+   pk_reg |= PKR_AD_BIT << pkey_shift;
+   if (flags & PKEY_DISABLE_WRITE)
+   pk_reg |= PKR_WD_BIT << pkey_shift;
+
+   return pk_reg;
+}
-- 
2.28.0.rc0.12.gb6a658bd00c9
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 06/10] x86/entry: Move nmi entry/exit into common code

2020-10-22 Thread ira . weiny
From: Thomas Gleixner 

Lockdep state handling on NMI enter and exit is nothing specific to X86. It's
not any different on other architectures. Also the extra state type is not
necessary, irqentry_state_t can carry the necessary information as well.

Move it to common code and extend irqentry_state_t to carry lockdep state.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Ira Weiny 
---
 arch/x86/entry/common.c | 34 ---
 arch/x86/include/asm/idtentry.h |  3 ---
 arch/x86/kernel/cpu/mce/core.c  |  6 +++---
 arch/x86/kernel/nmi.c   |  6 +++---
 arch/x86/kernel/traps.c | 13 ++--
 include/linux/entry-common.h| 24 +-
 kernel/entry/common.c   | 36 +
 7 files changed, 72 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efeec8bda..18d8f17f755c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -209,40 +209,6 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
 }
 
-noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
-{
-   bool irq_state = lockdep_hardirqs_enabled();
-
-   __nmi_enter();
-   lockdep_hardirqs_off(CALLER_ADDR0);
-   lockdep_hardirq_enter();
-   rcu_nmi_enter();
-
-   instrumentation_begin();
-   trace_hardirqs_off_finish();
-   ftrace_nmi_enter();
-   instrumentation_end();
-
-   return irq_state;
-}
-
-noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
-{
-   instrumentation_begin();
-   ftrace_nmi_exit();
-   if (restore) {
-   trace_hardirqs_on_prepare();
-   lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-   }
-   instrumentation_end();
-
-   rcu_nmi_exit();
-   lockdep_hardirq_exit();
-   if (restore)
-   lockdep_hardirqs_on(CALLER_ADDR0);
-   __nmi_exit();
-}
-
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b2442eb0ac2f..247a60a47331 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,9 +11,6 @@
 
 #include 
 
-bool idtentry_enter_nmi(struct pt_regs *regs);
-void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
-
 /**
  * DECLARE_IDTENTRY - Declare functions for simple IDT entry points
  *   No error code pushed by hardware
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 1c08cb9eb9f6..eb3338c0bbc1 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1983,7 +1983,7 @@ void (*machine_check_vector)(struct pt_regs *) = 
unexpected_machine_check;
 
 static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
 {
-   bool irq_state;
+   irqentry_state_t irq_state;
 
WARN_ON_ONCE(user_mode(regs));
 
@@ -1995,7 +1995,7 @@ static __always_inline void 
exc_machine_check_kernel(struct pt_regs *regs)
mce_check_crashing_cpu())
return;
 
-   irq_state = idtentry_enter_nmi(regs);
+   irq_state = irqentry_nmi_enter(regs);
/*
 * The call targets are marked noinstr, but objtool can't figure
 * that out because it's an indirect call. Annotate it.
@@ -2006,7 +2006,7 @@ static __always_inline void 
exc_machine_check_kernel(struct pt_regs *regs)
if (regs->flags & X86_EFLAGS_IF)
trace_hardirqs_on_prepare();
instrumentation_end();
-   idtentry_exit_nmi(regs, irq_state);
+   irqentry_nmi_exit(regs, irq_state);
 }
 
 static __always_inline void exc_machine_check_user(struct pt_regs *regs)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bc77aaf1303..bf250a339655 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -475,7 +475,7 @@ static DEFINE_PER_CPU(unsigned long, nmi_dr7);
 
 DEFINE_IDTENTRY_RAW(exc_nmi)
 {
-   bool irq_state;
+   irqentry_state_t irq_state;
 
/*
 * Re-enable NMIs right here when running as an SEV-ES guest. This might
@@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)
 
this_cpu_write(nmi_dr7, local_db_save());
 
-   irq_state = idtentry_enter_nmi(regs);
+   irq_state = irqentry_nmi_enter(regs);
 
inc_irq_stat(__nmi_count);
 
if (!ignore_nmis)
default_do_nmi(regs);
 
-   idtentry_exit_nmi(regs, irq_state);
+   irqentry_nmi_exit(regs, irq_state);
 
local_db_restore(this_cpu_read(nmi_dr7));
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 3c70fb34028b..bffbbe29fc8c 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -405,7 +405,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
}
 #endif
 
-   idtentry_enter_nmi(regs);
+   irqentry_nmi_enter(regs);
instrumentation_begin();
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);
 
@@ -651,12 

[PATCH 10/10] x86/pks: Add PKS test code

2020-10-22 Thread ira . weiny
From: Ira Weiny 

The core PKS functionality provides an interface for kernel users to
reserve keys to their domains set up the page tables with those keys and
control access to those domains when needed.

Define test code which exercises the core functionality of PKS via a
debugfs entry.  Basic checks can be triggered on boot with a kernel
command line option while both basic and preemption checks can be
triggered with separate debugfs values.

debugfs controls are:

'0' -- Run access tests with a single pkey
'1' -- Set up the pkey register with no access for the pkey allocated to
   this fd
'2' -- Check that the pkey register updated in '1' is still the same.
   (To be used after a forced context switch.)
'3' -- Allocate all pkeys possible and run tests on each pkey allocated.
   DEFAULT when run at boot.

Closing the fd will cleanup and release the pkey, therefore to exercise
context switch testing a user space program is provided in:

.../tools/testing/selftests/x86/test_pks.c

Reviewed-by: Dave Hansen 
Co-developed-by: Fenghua Yu 
Signed-off-by: Fenghua Yu 
Signed-off-by: Ira Weiny 

---
Changes from RFC V3
Comments from Dave Hansen
clean up whitespace dmanage
Clean up Kconfig help
Clean up user test error output
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Comments from Jing Han
Remove duplicate stdio.h
---
 Documentation/core-api/protection-keys.rst |   1 +
 arch/x86/mm/fault.c|  23 +
 lib/Kconfig.debug  |  12 +
 lib/Makefile   |   3 +
 lib/pks/Makefile   |   3 +
 lib/pks/pks_test.c | 691 +
 tools/testing/selftests/x86/Makefile   |   3 +-
 tools/testing/selftests/x86/test_pks.c |  66 ++
 8 files changed, 801 insertions(+), 1 deletion(-)
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c
 create mode 100644 tools/testing/selftests/x86/test_pks.c

diff --git a/Documentation/core-api/protection-keys.rst 
b/Documentation/core-api/protection-keys.rst
index e6cb29dda5b8..b0196c6cf29a 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -163,3 +163,4 @@ of WRPKRU.  So to quote from the WRPKRU text:
until all prior executions of WRPKRU have completed execution
and updated the PKRU register.
 
+Example code can be found in lib/pks/pks_test.c
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e2f83a596b50..4b23a13a7802 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include  /* faulthandler_disabled()  */
 #include  /* efi_recover_from_page_fault()*/
 #include 
+#include 
 
 #include /* boot_cpu_has, ...*/
 #include  /* dotraplinkage, ...   */
@@ -1149,6 +1150,25 @@ bool fault_in_kernel_space(unsigned long address)
return address >= TASK_SIZE_MAX;
 }
 
+#ifdef CONFIG_PKS_TESTING
+bool pks_test_callback(irqentry_state_t *irq_state);
+static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t 
*irq_state)
+{
+   /*
+* If we get a protection key exception it could be because we
+* are running the PKS test.  If so, pks_test_callback() will
+* clear the protection mechanism and return true to indicate
+* the fault was handled.
+*/
+   return (hw_error_code & X86_PF_PK) && pks_test_callback(irq_state);
+}
+#else
+static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t 
*irq_state)
+{
+   return false;
+}
+#endif
+
 /*
  * Called for all faults where 'address' is part of the kernel address
  * space.  Might get called for faults that originate from *code* that
@@ -1165,6 +1185,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long 
hw_error_code,
if (!cpu_feature_enabled(X86_FEATURE_PKS))
WARN_ON_ONCE(hw_error_code & X86_PF_PK);
 
+   if (handle_pks_testing(hw_error_code, irq_state))
+   return;
+
 #ifdef CONFIG_X86_32
/*
 * We can fault-in kernel-space virtual memory on-demand. The
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 66d44d35cc97..8bc32ad9ae94 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2446,6 +2446,18 @@ config HYPERV_TESTING
help
  Select this option to enable Hyper-V vmbus testing.
 
+config PKS_TESTING
+   bool "PKey (S)upervisor testing"
+   default n
+   depends on ARCH_HAS_SUPERVISOR_PKEYS
+   help
+ Select this option to enable testing of PKS core software and
+ hardware.  The PKS core provides a mechanism to allocate keys as well
+ as maintain the protection set

[PATCH 08/10] x86/entry: Preserve PKRS MSR across exceptions

2020-10-22 Thread ira . weiny
From: Ira Weiny 

The PKRS MSR is not managed by XSAVE.  It is preserved through a context
switch but this support leaves exception handling code open to memory
accesses during exceptions.

2 possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2]
irqentry_state_t was already an object being used in the exception
handling and is straightforward.  It is also easy for any number of
nested states to be tracked and eventually can be enhanced to store the
reference counting required to support PKS through kmap reentry

Preserve the current task's PKRS values in irqentry_state_t on exception
entry and restoring them on exception exit.

Each nested exception is further saved allowing for any number of levels
of exception handling.

Peter and Thomas both suggested parts of the patch, IDT and NMI respectively.

[1] 
https://lore.kernel.org/lkml/calcetrve1i5jdyzd_bcctxqjn+ze3t38efpgjxn1f577m36...@mail.gmail.com/
[2] https://lore.kernel.org/lkml/874kpxx4jf@nanos.tec.linutronix.de/#t

Cc: Dave Hansen 
Cc: Andy Lutomirski 
Suggested-by: Peter Zijlstra 
Suggested-by: Thomas Gleixner 
Signed-off-by: Ira Weiny 

---
Changes from RFC V3
Standardize on 'irq_state' variable name
Per Dave Hansen
irq_save_pkrs() -> irq_save_set_pkrs()
Rebased based on clean up patch by Thomas Gleixner
This includes moving irq_[save_set|restore]_pkrs() to
the core as well.
---
 arch/x86/entry/common.c | 39 +
 arch/x86/include/asm/pkeys_common.h |  5 ++--
 arch/x86/mm/pkeys.c |  2 +-
 include/linux/entry-common.h| 12 +
 kernel/entry/common.c   | 14 +--
 5 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 87dea56a15d2..c047ea57bb8c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_XEN_PV
 #include 
@@ -209,6 +210,42 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
 }
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * The register is not maintained with XSAVE so we have to maintain the MSR
+ * value in software during context switch and exception handling.
+ *
+ * Context switches save the MSR in the task struct thus taking that value to
+ * other processors if necessary.
+ *
+ * To protect against exceptions having access to this memory we save the
+ * current running value and set the PKRS value for the duration of the
+ * exception.  Thus preventing exception handlers from having the elevated
+ * access of the interrupted task.
+ */
+noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   irq_state->thread_pkrs = current->thread.saved_pkrs;
+   irq_state->pkrs = this_cpu_read(pkrs_cache);
+   write_pkrs(INIT_PKRS_VALUE);
+}
+
+noinstr void irq_restore_pkrs(irqentry_state_t *irq_state)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   write_pkrs(irq_state->pkrs);
+   current->thread.saved_pkrs = irq_state->thread_pkrs;
+}
+#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
@@ -272,6 +309,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
 
inhcall = get_and_clear_inhcall();
if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
+   /* Normally called by irqentry_exit, we must restore pkrs here 
*/
+   irq_restore_pkrs(_state);
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
index cd492c23b28c..f921c58793f9 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -31,9 +31,10 @@
 #definePKS_NUM_KEYS16
 
 #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
-void write_pkrs(u32 new_pkrs);
+DECLARE_PER_CPU(u32, pkrs_cache);
+noinstr void write_pkrs(u32 new_pkrs);
 #else
-static inline void write_pkrs(u32 new_pkrs) { }
+static __always_inline void write_pkrs(u32 new_pkrs) { }
 #endif
 
 #endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index fd5c4d34c3a5..a05f4edf2f80 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -252,7 +252,7 @@ DEFINE_PER_CPU(u32, pkrs_cache);
  * until all prior executions of WRPKRU have completed execution
  * and updated the PKRU register.
  */
-void write_pkrs(u32 new_p

[PATCH 03/10] x86/pks: Enable Protection Keys Supervisor (PKS)

2020-10-22 Thread ira . weiny
From: Fenghua Yu 

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings.  It uses the same mechanism of Protection Keys as those on
User mappings but applies that mechanism to supervisor mappings using a
supervisor specific MSR.

Kernel users can thus defines 'domains' of page mappings which have an
extra level of protection beyond those specified in the supervisor page
table entries.

Define ARCH_HAS_SUPERVISOR_PKEYS to distinguish this functionality from
the existing ARCH_HAS_PKEYS and then enable PKS when configured and
indicated by the CPU instance.  While not strictly necessary in this
patch, ARCH_HAS_SUPERVISOR_PKEYS separates this functionality through
the patch series so it is introduced here.

Co-developed-by: Ira Weiny 
Signed-off-by: Ira Weiny 
Signed-off-by: Fenghua Yu 

---
Changes since RFC V3
Per Dave Hansen
Update comment
Add X86_FEATURE_PKS to disabled-features.h
Rebase based on latest TIP tree
---
 arch/x86/Kconfig|  1 +
 arch/x86/include/asm/cpufeatures.h  |  1 +
 arch/x86/include/asm/disabled-features.h|  8 +++-
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/cpu/common.c| 13 +
 mm/Kconfig  |  2 ++
 6 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b81f74a..78c4c749c6a9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1876,6 +1876,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_PKEYS
+   select ARCH_HAS_SUPERVISOR_PKEYS
help
  Memory Protection Keys provides a mechanism for enforcing
  page-based protections, but without requiring modification of the
diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index dad350d42ecf..4deb580324e8 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -356,6 +356,7 @@
 #define X86_FEATURE_MOVDIRI(16*32+27) /* MOVDIRI instruction */
 #define X86_FEATURE_MOVDIR64B  (16*32+28) /* MOVDIR64B instruction */
 #define X86_FEATURE_ENQCMD (16*32+29) /* ENQCMD and ENQCMDS 
instructions */
+#define X86_FEATURE_PKS(16*32+31) /* Protection Keys 
for Supervisor pages */
 
 /* AMD-defined CPU features, CPUID level 0x8007 (EBX), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV (17*32+ 0) /* MCA overflow recovery 
support */
diff --git a/arch/x86/include/asm/disabled-features.h 
b/arch/x86/include/asm/disabled-features.h
index 5861d34f9771..82540f0c5b6c 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -44,6 +44,12 @@
 # define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31))
 #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+# define DISABLE_PKS   0
+#else
+# define DISABLE_PKS   (1<<(X86_FEATURE_PKS & 31))
+#endif
+
 #ifdef CONFIG_X86_5LEVEL
 # define DISABLE_LA57  0
 #else
@@ -82,7 +88,7 @@
 #define DISABLED_MASK140
 #define DISABLED_MASK150
 #define DISABLED_MASK16
(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-DISABLE_ENQCMD)
+DISABLE_ENQCMD|DISABLE_PKS)
 #define DISABLED_MASK170
 #define DISABLED_MASK180
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
diff --git a/arch/x86/include/uapi/asm/processor-flags.h 
b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..191c574b2390 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP   _BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT22 /* enable Protection Keys support */
 #define X86_CR4_PKE_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_PKS_BIT24 /* enable Protection Keys for 
Supervisor */
+#define X86_CR4_PKS_BITUL(X86_CR4_PKS_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 35ad8480c464..6a9ca938d9a9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1494,6 +1494,18 @@ static void validate_apic_and_package_id(struct 
cpuinfo_x86 *c)
 #endif
 }
 
+/*
+ * PKS is independent of PKU and either or both may be supported on a CPU.
+ * Configure PKS if the CPU supports the feature.
+ */
+static void setup_pks(void)
+{
+   if (!cpu_feature_enabled(X86_FEATURE_PKS))
+   return;
+
+   cr4_set_bits(X86_CR4_PKS);
+}
+
 /*
  * This does the hard work of actually

[PATCH 01/10] x86/pkeys: Create pkeys_common.h

2020-10-22 Thread ira . weiny
From: Ira Weiny 

Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in
similar fashions and can share common defines.  Specifically PKS and PKU
each have:

1. A single control register
2. The same number of keys
3. The same number of bits in the register per key
4. Access and Write disable in the same bit locations

That means that we can share all the macros that synthesize and
manipulate register values between the two features.  Normally, these
macros would be put in asm/pkeys.h to be used internally and externally
to the arch code.  However, the defines are required in pgtable.h and
inclusion of pkeys.h in that header creates complex dependencies which
are best resolved in a separate header.

Share these defines by moving them into a new header, change their names
to reflect the common use, and include the header where needed.

Reviewed-by: Dave Hansen 
Signed-off-by: Ira Weiny 

---
NOTE: The initialization of init_pkru_value cause checkpatch errors
because of the space after the '(' in the macros.  We leave this as is
because it is more readable in this format.  And it was existing code.

---
Changes from RFC V3
Per Dave Hansen
Update commit message
Add comment to PKR_AD_KEY macro
---
 arch/x86/include/asm/pgtable.h  | 13 ++---
 arch/x86/include/asm/pkeys.h|  2 ++
 arch/x86/include/asm/pkeys_common.h | 15 +++
 arch/x86/kernel/fpu/xstate.c|  8 
 arch/x86/mm/pkeys.c | 14 ++
 5 files changed, 33 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_common.h

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..bfbfb951fe65 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1360,9 +1360,7 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
 }
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */
 
-#define PKRU_AD_BIT 0x1
-#define PKRU_WD_BIT 0x2
-#define PKRU_BITS_PER_PKEY 2
+#include 
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 extern u32 init_pkru_value;
@@ -1372,18 +1370,19 @@ extern u32 init_pkru_value;
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-   int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
-   return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+   int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+
+   return !(pkru & (PKR_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-   int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
+   int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
/*
 * Access-disable disables writes too so we need to check
 * both bits here.
 */
-   return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+   return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits));
 }
 
 static inline u16 pte_flags_pkey(unsigned long pte_flags)
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 2ff9b98812b7..f9feba80894b 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_PKEYS_H
 #define _ASM_X86_PKEYS_H
 
+#include 
+
 #define ARCH_DEFAULT_PKEY  0
 
 /*
diff --git a/arch/x86/include/asm/pkeys_common.h 
b/arch/x86/include/asm/pkeys_common.h
new file mode 100644
index ..737d916f476c
--- /dev/null
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKEYS_INTERNAL_H
+#define _ASM_X86_PKEYS_INTERNAL_H
+
+#define PKR_AD_BIT 0x1
+#define PKR_WD_BIT 0x2
+#define PKR_BITS_PER_PKEY 2
+
+/*
+ * Generate an Access-Disable mask for the given pkey.  Several of these can be
+ * OR'd together to generate pkey register values.
+ */
+#define PKR_AD_KEY(pkey)   (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+
+#endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 5d8047441a0a..a99afc70cc0a 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -995,7 +995,7 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int 
pkey,
unsigned long init_val)
 {
u32 old_pkru;
-   int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
+   int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
u32 new_pkru_bits = 0;
 
/*
@@ -1014,16 +1014,16 @@ int arch_set_user_pkey_access(struct task_struct *tsk, 
int pkey,
 
/* Set the bits we need in PKRU:  */
if (init_val & PKEY_DISABLE_ACCESS)
-   new_pkru_bits |= PKRU_AD_BIT;
+   new_pkru_bits |= PKR_AD_BIT;
if (init_val & PKEY_DISABLE_WRITE)
-   new_pkru_bits |= PKRU_WD_BIT;
+   new_pkru_bits |= PKR_WD_BIT;
 
/* Shift the bits in to the correct place in 

[PATCH 07/10] x86/entry: Pass irqentry_state_t by reference

2020-10-22 Thread ira . weiny
From: Ira Weiny 

In preparation for adding PKS information to struct irqentry_state_t
change all call sites and usages to pass the struct by reference
instead of by value.

While we are editing the call sites it is a good time to standardize on
the name 'irq_state'.

Signed-off-by: Ira Weiny 

---
Changes from RFC V3
Clean up @irq_state comments
Standardize on 'irq_state' for the state variable name
Refactor based on new patch from Thomas Gleixner
Also addresses Peter Zijlstra's comment
---
 arch/x86/entry/common.c |  8 
 arch/x86/include/asm/idtentry.h | 25 ++--
 arch/x86/kernel/cpu/mce/core.c  |  4 ++--
 arch/x86/kernel/kvm.c   |  6 +++---
 arch/x86/kernel/nmi.c   |  4 ++--
 arch/x86/kernel/traps.c | 21 
 arch/x86/mm/fault.c |  6 +++---
 include/linux/entry-common.h| 16 ++--
 kernel/entry/common.c   | 34 +
 9 files changed, 65 insertions(+), 59 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 18d8f17f755c..87dea56a15d2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -259,9 +259,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
 {
struct pt_regs *old_regs;
bool inhcall;
-   irqentry_state_t state;
+   irqentry_state_t irq_state;
 
-   state = irqentry_enter(regs);
+   irqentry_enter(regs, _state);
old_regs = set_irq_regs(regs);
 
instrumentation_begin();
@@ -271,13 +271,13 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct 
pt_regs *regs)
set_irq_regs(old_regs);
 
inhcall = get_and_clear_inhcall();
-   if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+   if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
restore_inhcall(inhcall);
} else {
-   irqentry_exit(regs, state);
+   irqentry_exit(regs, _state);
}
 }
 #endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 247a60a47331..282d2413b6a1 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -49,12 +49,13 @@ static __always_inline void __##func(struct pt_regs *regs); 
\
\
 __visible noinstr void func(struct pt_regs *regs)  \
 {  \
-   irqentry_state_t state = irqentry_enter(regs);  \
+   irqentry_state_t irq_state; 
\
\
+   irqentry_enter(regs, _state);   
\
instrumentation_begin();\
__##func (regs);\
instrumentation_end();  \
-   irqentry_exit(regs, state); \
+   irqentry_exit(regs, _state);
\
 }  \
\
 static __always_inline void __##func(struct pt_regs *regs)
@@ -96,12 +97,13 @@ static __always_inline void __##func(struct pt_regs *regs,  
\
 __visible noinstr void func(struct pt_regs *regs,  \
unsigned long error_code)   \
 {  \
-   irqentry_state_t state = irqentry_enter(regs);  \
+   irqentry_state_t irq_state; 
\
\
+   irqentry_enter(regs, _state);   
\
instrumentation_begin();\
__##func (regs, error_code);\
instrumentation_end();  \
-   irqentry_exit(regs, state); \
+   irqentry_exit(regs, _state);
\
 }  \
\
 static __always_inline void __##func(struct pt_regs *regs, \
@@ -192,15 +194,16 @@ static __always_inline void __##func(struct pt_regs 
*regs, u8 vector);\
 __visible noinstr void func(stru

[PATCH 00/10] PKS: Add Protection Keys Supervisor (PKS) support

2020-10-22 Thread ira . weiny
From: Ira Weiny 

Changes from RFC V3[3]
Rebase to TIP master
Update test error output
Standardize on 'irq_state' for state variables
From Dave Hansen
Update commit messages
Add/clean up comments
Add X86_FEATURE_PKS to disabled-features.h and remove some
explicit CONFIG checks
Move saved_pkrs member of thread_struct
Remove superfluous preempt_disable()
s/irq_save_pks/irq_save_set_pks/
Ensure PKRS is not seen in faults if not configured or not
supported
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change pks_key_alloc return to -EOPNOTSUPP when not supported
From Peter Zijlstra
Clean up Attribution
Remove superfluous preempt_disable()
Add union to differentiate exit_rcu/lockdep use in
irqentry_state_t
From Thomas Gleixner
Add preliminary clean up patch and adjust series as needed


Introduce a new page protection mechanism for supervisor pages, Protection Key
Supervisor (PKS).

2 use cases for PKS are being developed, trusted keys and PMEM.  Trusted keys
is a newer use case which is still being explored.  PMEM was submitted as part
of the RFC (v2) series[1].  However, since then it was found that some callers
of kmap() require a global implementation of PKS.  Specifically some users of
kmap() expect mappings to be available to all kernel threads.  While global use
of PKS is rare it needs to be included for correctness.  Unfortunately the
kmap() updates required a large patch series to make the needed changes at the
various kmap() call sites so that patch set has been split out.  Because the
global PKS feature is only required for that use case it will be deferred to
that set as well.[2]  This patch set is being submitted as a precursor to both
of the use cases.

For an overview of the entire PKS ecosystem, a git tree including this series
and 2 proposed use cases can be found here:


https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.we...@intel.com/

https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.we...@intel.com/


PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to those pages beyond the normal paging protections.  PKS works in
a similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
checked in addition to normal paging protections and Access or Writes can be
disabled via a MSR update without TLB flushes when permissions change.  Also
like PKU, a page mapping is assigned to a domain by setting pkey bits in the
page table entry for that mapping.

Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.

XSAVE is not supported for the PKRS MSR.  Therefore the implementation
saves/restores the MSR across context switches and during exceptions.  Nested
exceptions are supported by each exception getting a new PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.

Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one.  Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
arch and/or CPU instance.

The following are key attributes of PKS.

   1) Fast switching of permissions
1a) Prevents access without page table manipulations
1b) No TLB flushes required
   2) Works on a per thread basis

PKS is available with 4 and 5 level paging.  Like PKRU it consumes 4 bits from
the PTE to store the pkey within the entry.


[1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.we...@intel.com/
[2] https://lore.kernel.org/lkml/20201009195033.3208459-2-ira.we...@intel.com/
[3] https://lore.kernel.org/lkml/20201009194258.3207172-1-ira.we...@intel.com/


Fenghua Yu (2):
  x86/pks: Enable Protection Keys Supervisor (PKS)
  x86/pks: Add PKS kernel API

Ira Weiny (7):
  x86/pkeys: Create pkeys_common.h
  x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
  x86/pks: Preserve the PKRS MSR on context switch
  x86/entry: Pass irqentry_state_t by reference
  x86/entry: Preserve PKRS MSR across exceptions
  x86/fault: Report the PKRS state on fault
  x86/pks: Add PKS test code

Thomas Gleixner (1):
  x86/entry: Move nmi entry/exit into common code

 Documentation/core-api/protection-keys.rst  | 102 ++-
 arch/x86/Kconfig|   1 +
 arch/x86/entry/common.c |  65 +-
 arch/x86/include/asm/cpufeatures.h  |   1 +
 arch/x86/include/asm

Re: [PATCH] mm/mremap_pages: Fix static key devmap_managed_key updates

2020-10-22 Thread Ira Weiny
On Thu, Oct 22, 2020 at 11:19:43AM -0700, Ralph Campbell wrote:
> 
> On 10/22/20 8:41 AM, Ira Weiny wrote:
> > On Thu, Oct 22, 2020 at 11:37:53AM +0530, Aneesh Kumar K.V wrote:
> > > commit 6f42193fd86e ("memremap: don't use a separate devm action for
> > > devmap_managed_enable_get") changed the static key updates such that we
> > > now call devmap_managed_enable_put() without doing the equivalent
> > > devmap_managed_enable_get().
> > > 
> > > devmap_managed_enable_get() is only called for MEMORY_DEVICE_PRIVATE and
> > > MEMORY_DEVICE_FS_DAX, But memunmap_pages() get called for other pgmap
> > > types too. This results in the below warning when switching between
> > > system-ram and devdax mode for devdax namespace.
> > > 
> > >   jump label: negative count!
> > >   WARNING: CPU: 52 PID: 1335 at kernel/jump_label.c:235 
> > > static_key_slow_try_dec+0x88/0xa0
> > >   Modules linked in:
> > >   
> > > 
> > >   NIP [c0433318] static_key_slow_try_dec+0x88/0xa0
> > >   LR [c0433314] static_key_slow_try_dec+0x84/0xa0
> > >   Call Trace:
> > >   [c00025c1f660] [c0433314] static_key_slow_try_dec+0x84/0xa0 
> > > (unreliable)
> > >   [c00025c1f6d0] [c0433664] 
> > > __static_key_slow_dec_cpuslocked+0x34/0xd0
> > >   [c00025c1f700] [c04337a4] static_key_slow_dec+0x54/0xf0
> > >   [c00025c1f770] [c059c49c] memunmap_pages+0x36c/0x500
> > >   [c00025c1f820] [c0d91d10] devm_action_release+0x30/0x50
> > >   [c00025c1f840] [c0d92e34] release_nodes+0x2f4/0x3e0
> > >   [c00025c1f8f0] [c0d8b15c] 
> > > device_release_driver_internal+0x17c/0x280
> > >   [c00025c1f930] [c0d883a4] bus_remove_device+0x124/0x210
> > >   [c00025c1f9b0] [c0d80ef4] device_del+0x1d4/0x530
> > >   [c00025c1fa70] [c0e341e8] unregister_dev_dax+0x48/0xe0
> > >   [c00025c1fae0] [c0d91d10] devm_action_release+0x30/0x50
> > >   [c00025c1fb00] [c0d92e34] release_nodes+0x2f4/0x3e0
> > >   [c00025c1fbb0] [c0d8b15c] 
> > > device_release_driver_internal+0x17c/0x280
> > >   [c00025c1fbf0] [c0d87000] unbind_store+0x130/0x170
> > >   [c00025c1fc30] [c0d862a0] drv_attr_store+0x40/0x60
> > >   [c00025c1fc50] [c06d316c] sysfs_kf_write+0x6c/0xb0
> > >   [c00025c1fc90] [c06d2328] kernfs_fop_write+0x118/0x280
> > >   [c00025c1fce0] [c000005a79f8] vfs_write+0xe8/0x2a0
> > >   [c00025c1fd30] [c05a7d94] ksys_write+0x84/0x140
> > >   [c00025c1fd80] [c003a430] system_call_exception+0x120/0x270
> > >   [c00025c1fe20] [c000c540] system_call_common+0xf0/0x27c
> > > 
> > > Cc: Christoph Hellwig 
> > > Cc: Dan Williams 
> > > Cc: Sachin Sant 
> > > Cc: linux-nvdimm@lists.01.org
> > > Cc: Ira Weiny 
> > > Cc: Jason Gunthorpe 
> > > Signed-off-by: Aneesh Kumar K.V 
> > > ---
> > >   mm/memremap.c | 19 +++
> > >   1 file changed, 15 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > index 73a206d0f645..d4402ff3e467 100644
> > > --- a/mm/memremap.c
> > > +++ b/mm/memremap.c
> > > @@ -158,6 +158,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
> > >   {
> > >   unsigned long pfn;
> > >   int i;
> > > + bool need_devmap_managed = false;
> > > +
> > > + switch (pgmap->type) {
> > > + case MEMORY_DEVICE_PRIVATE:
> > > + case MEMORY_DEVICE_FS_DAX:
> > > + need_devmap_managed = true;
> > > + break;
> > > + default:
> > > + break;
> > > + }
> > 
> > Is it overkill to avoid duplicating this switch logic in
> > page_is_devmap_managed() by creating another call which can be used here?
> 
> Perhaps. I can imagine a helper defined in include/linux/mm.h which
> page_is_devmap_managed() could also call but that would impact a lot of
> places that include mm.h. Since memremap.c already has to have intimate
> knowledge of the pgmap->type, I think limiting the change to just what
> is needed is better for now. So the patch looks OK to me.
> 
> Looking at this some more, I would suggest changing 
> devmap_managed_enable_get()
> and devmap_managed_enable_put() to do the special case checking inst

Re: [PATCH] mm/mremap_pages: Fix static key devmap_managed_key updates

2020-10-22 Thread Ira Weiny
On Thu, Oct 22, 2020 at 11:37:53AM +0530, Aneesh Kumar K.V wrote:
> commit 6f42193fd86e ("memremap: don't use a separate devm action for
> devmap_managed_enable_get") changed the static key updates such that we
> now call devmap_managed_enable_put() without doing the equivalent
> devmap_managed_enable_get().
> 
> devmap_managed_enable_get() is only called for MEMORY_DEVICE_PRIVATE and
> MEMORY_DEVICE_FS_DAX, But memunmap_pages() get called for other pgmap
> types too. This results in the below warning when switching between
> system-ram and devdax mode for devdax namespace.
> 
>  jump label: negative count!
>  WARNING: CPU: 52 PID: 1335 at kernel/jump_label.c:235 
> static_key_slow_try_dec+0x88/0xa0
>  Modules linked in:
>  
> 
>  NIP [c0433318] static_key_slow_try_dec+0x88/0xa0
>  LR [c0433314] static_key_slow_try_dec+0x84/0xa0
>  Call Trace:
>  [c00025c1f660] [c0433314] static_key_slow_try_dec+0x84/0xa0 
> (unreliable)
>  [c00025c1f6d0] [c0433664] 
> __static_key_slow_dec_cpuslocked+0x34/0xd0
>  [c00025c1f700] [c04337a4] static_key_slow_dec+0x54/0xf0
>  [c00025c1f770] [c059c49c] memunmap_pages+0x36c/0x500
>  [c00025c1f820] [c0d91d10] devm_action_release+0x30/0x50
>  [c00025c1f840] [c0d92e34] release_nodes+0x2f4/0x3e0
>  [c00025c1f8f0] [c0d8b15c] 
> device_release_driver_internal+0x17c/0x280
>  [c00025c1f930] [c0d883a4] bus_remove_device+0x124/0x210
>  [c00025c1f9b0] [c0d80ef4] device_del+0x1d4/0x530
>  [c00025c1fa70] [c0e341e8] unregister_dev_dax+0x48/0xe0
>  [c00025c1fae0] [c0d91d10] devm_action_release+0x30/0x50
>  [c00025c1fb00] [c0d92e34] release_nodes+0x2f4/0x3e0
>  [c00025c1fbb0] [c0d8b15c] 
> device_release_driver_internal+0x17c/0x280
>  [c00025c1fbf0] [c0d87000] unbind_store+0x130/0x170
>  [c00025c1fc30] [c0d862a0] drv_attr_store+0x40/0x60
>  [c00025c1fc50] [c06d316c] sysfs_kf_write+0x6c/0xb0
>  [c00025c1fc90] [c06d2328] kernfs_fop_write+0x118/0x280
>  [c00025c1fce0] [c05a79f8] vfs_write+0xe8/0x2a0
>  [c00025c1fd30] [c05a7d94] ksys_write+0x84/0x140
>  [c00025c1fd80] [c003a430] system_call_exception+0x120/0x270
>  [c00025c1fe20] [c000c540] system_call_common+0xf0/0x27c
> 
> Cc: Christoph Hellwig 
> Cc: Dan Williams 
> Cc: Sachin Sant 
> Cc: linux-nvdimm@lists.01.org
> Cc: Ira Weiny 
> Cc: Jason Gunthorpe 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/memremap.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 73a206d0f645..d4402ff3e467 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -158,6 +158,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>  {
>   unsigned long pfn;
>   int i;
> + bool need_devmap_managed = false;
> +
> + switch (pgmap->type) {
> + case MEMORY_DEVICE_PRIVATE:
> + case MEMORY_DEVICE_FS_DAX:
> + need_devmap_managed = true;
> + break;
> + default:
> + break;
> + }

Is it overkill to avoid duplicating this switch logic in
page_is_devmap_managed() by creating another call which can be used here?

>  
>   dev_pagemap_kill(pgmap);
>   for (i = 0; i < pgmap->nr_range; i++)
> @@ -169,7 +179,8 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>   pageunmap_range(pgmap, i);
>  
>   WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
> - devmap_managed_enable_put();
> + if (need_devmap_managed)
> + devmap_managed_enable_put();
>  }
>  EXPORT_SYMBOL_GPL(memunmap_pages);
>  
> @@ -307,7 +318,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>   .pgprot = PAGE_KERNEL,
>   };
>   const int nr_range = pgmap->nr_range;
> - bool need_devmap_managed = true;
> + bool need_devmap_managed = false;

I'm CC'ing Ralph Campbell because I think some of his work has proposed this
same change.

Ira

>   int error, i;
>  
>   if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))
> @@ -327,6 +338,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>   WARN(1, "Missing owner\n");
>   return ERR_PTR(-EINVAL);
>   }
> + need_devmap_managed = true;
>   break;
>   case MEMORY_DEVICE_FS_DAX:
>   if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
> @@ -334,13 +346,12 @@ void *memremap_pages(struct de

Re: [ndctl PATCH] Rework license identification

2020-10-21 Thread Ira Weiny
On Tue, Oct 20, 2020 at 10:53:20AM -0700, Dan Williams wrote:
> Convert to the LICENSES/ directory format for COPYING from the Linux
> kernel, and switch all remaining files over to SPDX annotations.
> 
> Reported-by: Christoph Hellwig 
> Signed-off-by: Dan Williams 
> ---

[snip]

> diff --git a/ndctl/lib/hpe1.h b/ndctl/lib/hpe1.h
> index 1afa54f127a6..acf82af7bb87 100644
> --- a/ndctl/lib/hpe1.h
> +++ b/ndctl/lib/hpe1.h
> @@ -1,16 +1,5 @@
> -/*
> - * Copyright (C) 2016 Hewlett Packard Enterprise Development LP
> - * Copyright (c) 2014-2015, Intel Corporation.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms and conditions of the GNU Lesser General Public License,
> - * version 2.1, as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope it will be useful, but WITHOUT ANY
> - * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
> - * FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License for
> - * more details.
> - */
> +// Copyright (C) 2016 Hewlett Packard Enterprise Development LP
> +// SPDX-License-Identifier: LGPL-2.1

Why drop the Intel copyright here but not elsewhere?

>  #ifndef __NDCTL_HPE1_H__
>  #define __NDCTL_HPE1_H__
>  
> diff --git a/ndctl/lib/inject.c b/ndctl/lib/inject.c
> index 815f254308c6..00ef0a4d1d28 100644
> --- a/ndctl/lib/inject.c
> +++ b/ndctl/lib/inject.c
> @@ -1,15 +1,5 @@
> -/*
> - * Copyright (c) 2014-2017, Intel Corporation.
> - *
> - * This program is free software; you can redistribute it and/or modify it
> - * under the terms and conditions of the GNU Lesser General Public License,
> - * version 2.1, as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope it will be useful, but WITHOUT ANY
> - * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
> - * FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License for
> - * more details.
> - */
> +// Copyright (c) 2014-2017, Intel Corporation. All rights reserved.
> +// SPDX-License-Identifier: LGPL-2.1

And I'm not sure why some of the copyrights are extended to 2020 while others
are not.  I would think they would all be?  But this is more of a curious
questions.

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference

2020-10-20 Thread Ira Weiny
On Mon, Oct 19, 2020 at 11:12:44PM +0200, Thomas Gleixner wrote:
> On Mon, Oct 19 2020 at 13:26, Ira Weiny wrote:
> > On Mon, Oct 19, 2020 at 11:32:50AM +0200, Thomas Gleixner wrote:
> > Sorry, let me clarify.  After this patch we have.
> >
> > typedef union irqentry_state {
> > boolexit_rcu;
> > boollockdep;
> > } irqentry_state_t;
> >
> > Which reflects the mutual exclusion of the 2 variables.
> 
> Huch? From the patch I gave you:
> 
>  #ifndef irqentry_state
>  typedef struct irqentry_state {
>   boolexit_rcu;
> +   boollockdep;
>  } irqentry_state_t;
>  #endif
> 
> How is that a union?

I was proposing to make it a union.

> 
> > But then when the pkrs stuff is added the union changes back to a structure 
> > and
> > looks like this.
> 
> So you want:
> 
>   1) Move stuff to struct irqentry_state (my patch)
> 
>   2) Change it to a union and pass it as pointer at the same time

No, I would have made it a union in your patch.

Pass by reference would remain largely the same.

> 
>   3) Change it back to struct to add PKRS

Yes.  :-/

> 
> > Is that clear?
> 
> What's clear is that the above is nonsense. We can just do
> 
>  #ifndef irqentry_state
>  typedef struct irqentry_state {
>   union {
>   boolexit_rcu;
> boollockdep;
> };
>  } irqentry_state_t;
>  #endif
> 
> right in the patch which I gave you. Because that actually makes sense.

Ok I'm very sorry.  I was thinking that having a struct containing nothing but
an anonymous union would be unacceptable as a stand alone item in your patch.
In my experience other maintainers would have rejected such a change and
would have asked; 'why not just make it a union'?

I'm very happy skipping the gymnastics on individual patches in favor of making
the whole series work out in the end.

Thank you for your help again.  :-)

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference

2020-10-19 Thread Ira Weiny
On Mon, Oct 19, 2020 at 11:32:50AM +0200, Thomas Gleixner wrote:
> On Sun, Oct 18 2020 at 22:37, Ira Weiny wrote:
> > On Fri, Oct 16, 2020 at 02:55:21PM +0200, Thomas Gleixner wrote:
> >> Subject: x86/entry: Move nmi entry/exit into common code
> >> From: Thomas Gleixner 
> >> Date: Fri, 11 Sep 2020 10:09:56 +0200
> >> 
> >> Add blurb here.
> >
> > How about:
> >
> > To prepare for saving PKRS values across NMI's we lift the
> > idtentry_[enter|exit]_nmi() to the common code.  Rename them to
> > irqentry_nmi_[enter|exit]() to reflect the new generic nature and store the
> > state in the same irqentry_state_t structure as the other irqentry_*()
> > functions.  Finally, differentiate the state being stored between the NMI 
> > and
> > IRQ path by adding 'lockdep' to irqentry_state_t.
> 
> No. This has absolutely nothing to do with PKRS. It's a cleanup valuable
> by itself and that's how it should have been done right away.
> 
> So the proper changelog is:
> 
>   Lockdep state handling on NMI enter and exit is nothing specific to
>   X86. It's not any different on other architectures. Also the extra
>   state type is not necessary, irqentry_state_t can carry the necessary
>   information as well.
> 
>   Move it to common code and extend irqentry_state_t to carry lockdep
>   state.

Ok sounds good, thanks.

> 
> >> --- a/include/linux/entry-common.h
> >> +++ b/include/linux/entry-common.h
> >> @@ -343,6 +343,7 @@ void irqentry_exit_to_user_mode(struct p
> >>  #ifndef irqentry_state
> >>  typedef struct irqentry_state {
> >>boolexit_rcu;
> >> +  boollockdep;
> >>  } irqentry_state_t;
> >
> > Building on what Peter said do you agree this should be made into a union?
> >
> > It may not be strictly necessary in this patch but I think it would reflect 
> > the
> > mutual exclusivity better and could be changed easy enough in the follow on
> > patch which adds the pkrs state.
> 
> Why the heck should it be changed in a patch which adds something
> completely different?

Because the PKRS stuff is used in both NMI and IRQ state.

> 
> Either it's mutually exclusive or not and if so it want's to be done in
> this patch and not in a change which extends the struct for other
> reasons.

Sorry, let me clarify.  After this patch we have.

typedef union irqentry_state {
boolexit_rcu;
boollockdep;
} irqentry_state_t;

Which reflects the mutual exclusion of the 2 variables.

But then when the pkrs stuff is added the union changes back to a structure and
looks like this.

typedef struct irqentry_state {
#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
u32 pkrs;
u32 thread_pkrs;
#endif  
union {
boolexit_rcu;
boollockdep;
};
} irqentry_state_t;

Because the pkrs information is in addition to exit_rcu OR lockdep.

So this is what I meant by 'could be changed easy enough in the follow on
patch'.

Is that clear?

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch

2020-10-19 Thread Ira Weiny
On Mon, Oct 19, 2020 at 11:37:14AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 16, 2020 at 10:14:10PM -0700, Ira Weiny wrote:
> > > so it either needs to
> > > explicitly do so, or have an assertion that preemption is indeed
> > > disabled.
> > 
> > However, I don't think I understand clearly.  Doesn't [get|put]_cpu_ptr()
> > handle the preempt_disable() for us? 
> 
> It does.
> 
> > Is it not sufficient to rely on that?
> 
> It is.
> 
> > Dave's comment seems to be the opposite where we need to eliminate preempt
> > disable before calling write_pkrs().
> > 
> > FWIW I think I'm mistaken in my response to Dave regarding the
> > preempt_disable() in pks_update_protection().
> 
> Dave's concern is that we're calling with with preemption already
> disabled so disabling it again is superfluous.

Ok, thanks, and after getting my head straight I think I agree with him, and
you.

Thanks I've reworked the code to removed the superfluous calls.  Sorry about
being so dense...  :-D

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 6/9] x86/entry: Pass irqentry_state_t by reference

2020-10-18 Thread Ira Weiny
On Fri, Oct 16, 2020 at 02:55:21PM +0200, Thomas Gleixner wrote:
> On Fri, Oct 16 2020 at 13:45, Peter Zijlstra wrote:
> > On Fri, Oct 09, 2020 at 12:42:55PM -0700, ira.we...@intel.com wrote:
> >> @@ -238,7 +236,7 @@ noinstr void idtentry_exit_nmi(struct pt_regs *regs, 
> >> bool restore)
> >>  
> >>rcu_nmi_exit();
> >>lockdep_hardirq_exit();
> >> -  if (restore)
> >> +  if (irq_state->exit_rcu)
> >>lockdep_hardirqs_on(CALLER_ADDR0);
> >>__nmi_exit();
> >>  }
> >
> > That's not nice.. The NMI path is different from the IRQ path and has a
> > different variable. Yes, this works, but *groan*.
> >
> > Maybe union them if you want to avoid bloating the structure, but the
> > above makes it really hard to read.
> 
> Right, and also that nmi entry thing should not be in x86. Something
> like the untested below as first cleanup.

Ok, I see what Peter was talking about.  I've added this patch to the series.

> 
> Thanks,
> 
> tglx
> 
> Subject: x86/entry: Move nmi entry/exit into common code
> From: Thomas Gleixner 
> Date: Fri, 11 Sep 2020 10:09:56 +0200
> 
> Add blurb here.

How about:

To prepare for saving PKRS values across NMI's we lift the
idtentry_[enter|exit]_nmi() to the common code.  Rename them to
irqentry_nmi_[enter|exit]() to reflect the new generic nature and store the
state in the same irqentry_state_t structure as the other irqentry_*()
functions.  Finally, differentiate the state being stored between the NMI and
IRQ path by adding 'lockdep' to irqentry_state_t.

?

> 
> Signed-off-by: Thomas Gleixner 
> ---
>  arch/x86/entry/common.c |   34 --
>  arch/x86/include/asm/idtentry.h |3 ---
>  arch/x86/kernel/cpu/mce/core.c  |6 +++---
>  arch/x86/kernel/nmi.c   |6 +++---
>  arch/x86/kernel/traps.c |   13 +++--
>  include/linux/entry-common.h|   20 
>  kernel/entry/common.c   |   36 
>  7 files changed, 69 insertions(+), 49 deletions(-)
> 

[snip]

> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -343,6 +343,7 @@ void irqentry_exit_to_user_mode(struct p
>  #ifndef irqentry_state
>  typedef struct irqentry_state {
>   boolexit_rcu;
> + boollockdep;
>  } irqentry_state_t;

Building on what Peter said do you agree this should be made into a union?

It may not be strictly necessary in this patch but I think it would reflect the
mutual exclusivity better and could be changed easy enough in the follow on
patch which adds the pkrs state.

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API

2020-10-16 Thread Ira Weiny
On Fri, Oct 16, 2020 at 01:07:47PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 09, 2020 at 12:42:54PM -0700, ira.we...@intel.com wrote:
> > +static inline void pks_update_protection(int pkey, unsigned long 
> > protection)
> > +{
> > +   current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs,
> > +pkey, protection);
> > +   preempt_disable();
> > +   write_pkrs(current->thread.saved_pkrs);
> > +   preempt_enable();
> > +}
> 
> write_pkrs() already disables preemption itself. Wrapping it in yet
> another layer is useless.

I was thinking the update to saved_pkrs needed this protection as well and that
was to be included in the preemption disable.  But that too is incorrect.

I've removed this preemption disable.

Thanks,
Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch

2020-10-16 Thread Ira Weiny
On Fri, Oct 16, 2020 at 01:06:36PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 09, 2020 at 12:42:53PM -0700, ira.we...@intel.com wrote:
> 
> > @@ -644,6 +663,8 @@ void __switch_to_xtra(struct task_struct *prev_p, 
> > struct task_struct *next_p)
> >  
> > if ((tifp ^ tifn) & _TIF_SLD)
> > switch_to_sld(tifn);
> > +
> > +   pks_sched_in();
> >  }
> >  
> 
> You seem to have lost the comment proposed here:
> 
>   
> https://lkml.kernel.org/r/20200717083140.gw10...@hirez.programming.kicks-ass.net
> 
> It is useful and important information that the wrmsr normally doesn't
> happen.

Added back in here.

> 
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index 3cf8f775f36d..30f65dd3d0c5 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -229,3 +229,31 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int 
> > flags)
> >  
> > return pk_reg;
> >  }
> > +
> > +DEFINE_PER_CPU(u32, pkrs_cache);
> > +
> > +/**
> > + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> > + * serializing but still maintains ordering properties similar to WRPKRU.
> > + * The current SDM section on PKRS needs updating but should be the same as
> > + * that of WRPKRU.  So to quote from the WRPKRU text:
> > + *
> > + * WRPKRU will never execute transiently. Memory accesses
> > + * affected by PKRU register will not execute (even transiently)
> > + * until all prior executions of WRPKRU have completed execution
> > + * and updated the PKRU register.
> 
> (whitespace damage; space followed by tabstop)

Fixed thanks.

> 
> > + */
> > +void write_pkrs(u32 new_pkrs)
> > +{
> > +   u32 *pkrs;
> > +
> > +   if (!static_cpu_has(X86_FEATURE_PKS))
> > +   return;
> > +
> > +   pkrs = get_cpu_ptr(_cache);
> > +   if (*pkrs != new_pkrs) {
> > +   *pkrs = new_pkrs;
> > +   wrmsrl(MSR_IA32_PKRS, new_pkrs);
> > +   }
> > +   put_cpu_ptr(pkrs);
> > +}
> 
> looks familiar that... :-)

Added you as a co-developer if that is ok?

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch

2020-10-16 Thread Ira Weiny
On Fri, Oct 16, 2020 at 01:12:26PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 13, 2020 at 11:31:45AM -0700, Dave Hansen wrote:
> > > +/**
> > > + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is 
> > > not
> > > + * serializing but still maintains ordering properties similar to WRPKRU.
> > > + * The current SDM section on PKRS needs updating but should be the same 
> > > as
> > > + * that of WRPKRU.  So to quote from the WRPKRU text:
> > > + *
> > > + *   WRPKRU will never execute transiently. Memory accesses
> > > + *   affected by PKRU register will not execute (even transiently)
> > > + *   until all prior executions of WRPKRU have completed execution
> > > + *   and updated the PKRU register.
> > > + */
> > > +void write_pkrs(u32 new_pkrs)
> > > +{
> > > + u32 *pkrs;
> > > +
> > > + if (!static_cpu_has(X86_FEATURE_PKS))
> > > + return;
> > > +
> > > + pkrs = get_cpu_ptr(_cache);
> > > + if (*pkrs != new_pkrs) {
> > > + *pkrs = new_pkrs;
> > > + wrmsrl(MSR_IA32_PKRS, new_pkrs);
> > > + }
> > > + put_cpu_ptr(pkrs);
> > > +}
> > > 
> > 
> > It bugs me a *bit* that this is being called in a preempt-disabled
> > region, but we still bother with the get/put_cpu jazz.  Are there other
> > future call-sites for this that aren't in preempt-disabled regions?
> 
> So the previous version had a useful comment that got lost.

Ok Looking back I see what happened...  This comment...

 /*
  * PKRS is only temporarily changed during specific code paths.
  * Only a preemption during these windows away from the default
  * value would require updating the MSR.
  */

... was added to pks_sched_in() but that got simplified down because cleaning
up write_pkrs() made the code there obsolete.

> This stuff
> needs to fundamentally be preempt disabled,

I agree, the update to the percpu cache value and MSR can not be torn.

> so it either needs to
> explicitly do so, or have an assertion that preemption is indeed
> disabled.

However, I don't think I understand clearly.  Doesn't [get|put]_cpu_ptr()
handle the preempt_disable() for us?  Is it not sufficient to rely on that?

Dave's comment seems to be the opposite where we need to eliminate preempt
disable before calling write_pkrs().

FWIW I think I'm mistaken in my response to Dave regarding the
preempt_disable() in pks_update_protection().

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support

2020-10-16 Thread Ira Weiny
On Fri, Oct 16, 2020 at 12:57:43PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 09, 2020 at 12:42:51PM -0700, ira.we...@intel.com wrote:
> > From: Fenghua Yu 
> > 
> > Define a helper, update_pkey_val(), which will be used to support both
> > Protection Key User (PKU) and the new Protection Key for Supervisor
> > (PKS) in subsequent patches.
> > 
> > Co-developed-by: Ira Weiny 
> > Signed-off-by: Ira Weiny 
> > Signed-off-by: Fenghua Yu 
> > ---
> >  arch/x86/include/asm/pkeys.h |  2 ++
> >  arch/x86/kernel/fpu/xstate.c | 22 --
> >  arch/x86/mm/pkeys.c  | 21 +
> >  3 files changed, 27 insertions(+), 18 deletions(-)
> 
> This is not from Fenghua.
> 
>   
> https://lkml.kernel.org/r/20200717085442.gx10...@hirez.programming.kicks-ass.net
> 
> This is your patch based on the code I wrote.

Ok, I apologize.  Yes the code below was all yours.

Is it ok to add?

Co-developed-by: Peter Zijlstra 
Signed-off-by: Peter Zijlstra 

?

Thanks,
Ira

> 
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index f5efb4007e74..3cf8f775f36d 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -208,3 +208,24 @@ static __init int setup_init_pkru(char *opt)
> > return 1;
> >  }
> >  __setup("init_pkru=", setup_init_pkru);
> > +
> > +/*
> > + * Update the pk_reg value and return it.
> > + *
> > + * Kernel users use the same flags as user space:
> > + * PKEY_DISABLE_ACCESS
> > + * PKEY_DISABLE_WRITE
> > + */
> > +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> > +{
> > +   int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> > +
> > +   pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> > +
> > +   if (flags & PKEY_DISABLE_ACCESS)
> > +   pk_reg |= PKR_AD_BIT << pkey_shift;
> > +   if (flags & PKEY_DISABLE_WRITE)
> > +   pk_reg |= PKR_WD_BIT << pkey_shift;
> > +
> > +   return pk_reg;
> > +}
> > -- 
> > 2.28.0.rc0.12.gb6a658bd00c9
> > 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 9/9] x86/pks: Add PKS test code

2020-10-14 Thread Ira Weiny
On Tue, Oct 13, 2020 at 12:02:07PM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> >  #ifdef CONFIG_X86_32
> > /*
> >  * We can fault-in kernel-space virtual memory on-demand. The
> > diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
> > index cc3510cde64e..f9552bd9341f 100644
> > --- a/include/linux/pkeys.h
> > +++ b/include/linux/pkeys.h
> > @@ -47,7 +47,6 @@ static inline bool arch_pkeys_enabled(void)
> >  static inline void copy_init_pkru_to_fpregs(void)
> >  {
> >  }
> > -
> >  #endif /* ! CONFIG_ARCH_HAS_PKEYS */
> 
> ^ Whitespace damage

Done.

> 
> >  #ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > index 0c781f912f9f..f015c09ba5a1 100644
> > --- a/lib/Kconfig.debug
> > +++ b/lib/Kconfig.debug
> > @@ -2400,6 +2400,18 @@ config HYPERV_TESTING
> > help
> >   Select this option to enable Hyper-V vmbus testing.
> >  
> > +config PKS_TESTING
> > +   bool "PKey(S)upervisor testing"
> 
> Seems like we need a space in there somewhere.

heheh...  yea...

> 
> > +   pid = fork();
> > +   if (pid == 0) {
> > +   fd = open("/sys/kernel/debug/x86/run_pks", O_RDWR);
> > +   if (fd < 0) {
> > +   printf("cannot open file\n");
> > +   return -1;
> > +   }
> > +
> 
> Will this return code make anybody mad?  Should we have a nicer return
> code for when this is running on non-PKS hardware?

I'm not sure it will matter much but I think it is better to report the missing
file.[1]

> 
> I'm not going to be too picky about this.  I'll just ask one question:
> Has this found real bugs for you?

Many, especially regressions as things have changed.

> 
> Reviewed-by: Dave Hansen 
> 

Thanks,
Ira

[1]

diff --git a/tools/testing/selftests/x86/test_pks.c 
b/tools/testing/selftests/x86/test_pks.c
index 8037a2a9ff5f..11be4e212d54 100644
--- a/tools/testing/selftests/x86/test_pks.c
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -11,6 +11,8 @@
 #include 
 #include 
 
+#define PKS_TEST_FILE "/sys/kernel/debug/x86/run_pks"
+
 int main(void)
 {
cpu_set_t cpuset;
@@ -25,9 +27,9 @@ int main(void)
 
pid = fork();
if (pid == 0) {
-   fd = open("/sys/kernel/debug/x86/run_pks", O_RDWR);
+   fd = open(PKS_TEST_FILE, O_RDWR);
if (fd < 0) {
-   printf("cannot open file\n");
+   printf("cannot open %s\n", PKS_TEST_FILE);
return -1;
}
 
@@ -45,9 +47,9 @@ int main(void)
} else {
sleep(2);
 
-   fd = open("/sys/kernel/debug/x86/run_pks", O_RDWR);
+   fd = open(PKS_TEST_FILE, O_RDWR);
if (fd < 0) {
-   printf("cannot open file\n");
+   printf("cannot open %s\n", PKS_TEST_FILE);
return -1;
}
 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions

2020-10-14 Thread Ira Weiny
On Wed, Oct 14, 2020 at 09:06:44PM -0700, Dave Hansen wrote:
> On 10/14/20 8:46 PM, Ira Weiny wrote:
> > On Tue, Oct 13, 2020 at 11:52:32AM -0700, Dave Hansen wrote:
> >> On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> >>> @@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, 
> >>> irqentry_state_t *state)
> >>>   /* Use the combo lockdep/tracing function */
> >>>   trace_hardirqs_off();
> >>>   instrumentation_end();
> >>> +
> >>> +done:
> >>> + irq_save_pkrs(state);
> >>>  }
> >> One nit: This saves *and* sets PKRS.  It's not obvious from the call
> >> here that PKRS is altered at this site.  Seems like there could be a
> >> better name.
> >>
> >> Even if we did:
> >>
> >>irq_save_set_pkrs(state, INIT_VAL);
> >>
> >> It would probably compile down to the same thing, but be *really*
> >> obvious what's going on.
> > I suppose that is true.  But I think it is odd having a parameter which is 
> > the
> > same for every call site.
> 
> Well, it depends on what you optimize for.  I'm trying to optimize for
> the code being understood quickly the first time someone reads it.  To
> me, that's more important than minimizing the number of function
> parameters (which are essentially free).
>

Agreed.  Sorry I was not trying to be confrontational.  There is just enough
other things which are going to take me time to get right I need to focus on
them...  :-D

Sorry,
Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 8/9] x86/fault: Report the PKRS state on fault

2020-10-14 Thread Ira Weiny
On Tue, Oct 13, 2020 at 11:56:53AM -0700, Dave Hansen wrote:
> > @@ -548,6 +549,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long 
> > error_code, unsigned long ad
> >  (error_code & X86_PF_PK)? "protection keys violation" :
> >"permissions violation");
> >  
> > +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > +   if (irq_state && (error_code & X86_PF_PK))
> > +   pr_alert("PKRS: 0x%x\n", irq_state->pkrs);
> > +#endif
> 
> This means everyone will see 'PKRS: 0x0', even if they're on non-PKS
> hardware.  I think I'd rather have this only show PKRS when we're on
> cpu_feature_enabled(PKS) hardware.

Good catch, thanks.

> 
> ...
> > @@ -1148,14 +1156,15 @@ static int fault_in_kernel_space(unsigned long 
> > address)
> >   */
> >  static void
> >  do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> > -  unsigned long address)
> > +  unsigned long address, irqentry_state_t *irq_state)
> >  {
> > /*
> > -* Protection keys exceptions only happen on user pages.  We
> > -* have no user pages in the kernel portion of the address
> > -* space, so do not expect them here.
> > +* If protection keys are not enabled for kernel space
> > +* do not expect Pkey errors here.
> >  */
> 
> Let's fix the double-negative:
> 
>   /*
>* PF_PK is only expected on kernel addresses whenn
>* supervisor pkeys are enabled:
>*/

done. thanks.

> 
> > -   WARN_ON_ONCE(hw_error_code & X86_PF_PK);
> > +   if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS) ||
> > +   !cpu_feature_enabled(X86_FEATURE_PKS))
> > +   WARN_ON_ONCE(hw_error_code & X86_PF_PK);
> 
> Yeah, please stick X86_FEATURE_PKS in disabled-features so you can use
> cpu_feature_enabled(X86_FEATURE_PKS) by itself here..

done.

thanks,
Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 7/9] x86/entry: Preserve PKRS MSR across exceptions

2020-10-14 Thread Ira Weiny
On Tue, Oct 13, 2020 at 11:52:32AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> > @@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, 
> > irqentry_state_t *state)
> > /* Use the combo lockdep/tracing function */
> > trace_hardirqs_off();
> > instrumentation_end();
> > +
> > +done:
> > +   irq_save_pkrs(state);
> >  }
> 
> One nit: This saves *and* sets PKRS.  It's not obvious from the call
> here that PKRS is altered at this site.  Seems like there could be a
> better name.
> 
> Even if we did:
> 
>   irq_save_set_pkrs(state, INIT_VAL);
> 
> It would probably compile down to the same thing, but be *really*
> obvious what's going on.

I suppose that is true.  But I think it is odd having a parameter which is the
same for every call site.

But I'm not going to quibble over something like this.

Changed,
Ira

> 
> >  void irqentry_exit_cond_resched(void)
> > @@ -362,7 +365,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, 
> > irqentry_state_t *state)
> > /* Check whether this returns to user mode */
> > if (user_mode(regs)) {
> > irqentry_exit_to_user_mode(regs);
> > -   } else if (!regs_irqs_disabled(regs)) {
> > +   return;
> > +   }
> > +
> > +   irq_restore_pkrs(state);
> > +
> > +   if (!regs_irqs_disabled(regs)) {
> > /*
> >  * If RCU was not watching on entry this needs to be done
> >  * carefully and needs the same ordering of lockdep/tracing
> > 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 5/9] x86/pks: Add PKS kernel API

2020-10-14 Thread Ira Weiny
On Tue, Oct 13, 2020 at 11:43:57AM -0700, Dave Hansen wrote:
> > +static inline void pks_update_protection(int pkey, unsigned long 
> > protection)
> > +{
> > +   current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs,
> > +pkey, protection);
> > +   preempt_disable();
> > +   write_pkrs(current->thread.saved_pkrs);
> > +   preempt_enable();
> > +}
> 
> Why does this need preempt count manipulation in addition to the
> get/put_cpu_var() inside of write_pkrs()?

This is a bug.  The disable should be around the update_pkey_val().

> 
> > +/**
> > + * PKS access control functions
> > + *
> > + * Change the access of the domain specified by the pkey.  These are global
> > + * updates.  They only affects the current running thread.  It is 
> > undefined and
> > + * a bug for users to call this without having allocated a pkey and using 
> > it as
> > + * pkey here.
> > + *
> > + * pks_mknoaccess()
> > + * Disable all access to the domain
> > + * pks_mkread()
> > + * Make the domain Read only
> > + * pks_mkrdwr()
> > + * Make the domain Read/Write
> > + *
> > + * @pkey the pkey for which the access should change.
> > + *
> > + */
> > +void pks_mknoaccess(int pkey)
> > +{
> > +   pks_update_protection(pkey, PKEY_DISABLE_ACCESS);
> > +}
> > +EXPORT_SYMBOL_GPL(pks_mknoaccess);
> 
> These are named like PTE manipulation functions, which is kinda weird.
> 
> What's wrong with: pks_disable_access(pkey) ?

Internal review suggested these names.  I'm not dead set on them.

FWIW I would rather they not get to wordy.

I was trying to get some consistency with pks_mk*() as meaning PKS 'make' X.

Do me 'disable' implies a state transition where 'make' implies we are
'setting' an absolute value.  I think the later is a better name.  And 'make'
made more sense because 'set' is so overloaded IHO.

> 
> > +void pks_mkread(int pkey)
> > +{
> > +   pks_update_protection(pkey, PKEY_DISABLE_WRITE);
> > +}
> > +EXPORT_SYMBOL_GPL(pks_mkread);
> 
> I really don't like this name.  It doesn't make readable, or even
> read-only, *especially* if it was already access-disabled.

Ok.

But it does sense if going from access-disable to read, correct?.  I could see
this being better named pks_mkreadonly() so that going from RW to this would
make more sense.  Especially after thinking about it above 'read only' needs to
be in the name.

Before I change anything I'd like to get consensus on naming.

How about the following?

pks_mk_noaccess()
pks_mk_readonly()
pks_mk_readwrite()

?

> 
> > +static const char pks_key_user0[] = "kernel";
> > +
> > +/* Store names of allocated keys for debug.  Key 0 is reserved for the 
> > kernel.  */
> > +static const char *pks_key_users[PKS_NUM_KEYS] = {
> > +   pks_key_user0
> > +};
> > +
> > +/*
> > + * Each key is represented by a bit.  Bit 0 is set for key 0 and reserved 
> > for
> > + * its use.  We use ulong for the bit operations but only 16 bits are used.
> > + */
> > +static unsigned long pks_key_allocation_map = 1 << PKS_KERN_DEFAULT_KEY;
> > +
> > +/*
> > + * pks_key_alloc - Allocate a PKS key
> > + *
> > + * @pkey_user: String stored for debugging of key exhaustion.  The caller 
> > is
> > + * responsible to maintain this memory until pks_key_free().
> > + */
> > +int pks_key_alloc(const char * const pkey_user)
> > +{
> > +   int nr;
> > +
> > +   if (!cpu_feature_enabled(X86_FEATURE_PKS))
> > +   return -EINVAL;
> 
> I'm not sure I like -EINVAL for this.  I thought we returned -ENOSPC for
> this case for user pkeys.

-ENOTSUP?

I'm not really sure anyone will need to know the difference between the
platform not supporting the key vs running out of them.  But they are 2
different error conditions.

> 
> > +   while (1) {
> > +   nr = find_first_zero_bit(_key_allocation_map, PKS_NUM_KEYS);
> > +   if (nr >= PKS_NUM_KEYS) {
> > +   pr_info("Cannot allocate supervisor key for %s.\n",
> > +   pkey_user);
> > +   return -ENOSPC;

We return -ENOSPC here when running out of keys.

> > +   }
> > +   if (!test_and_set_bit_lock(nr, _key_allocation_map))
> > +   break;
> > +   }
> > +
> > +   /* for debugging key exhaustion */
> > +   pks_key_users[nr] = pkey_user;
> > +
> > +   return nr;
> > +}
> > +EXPORT_SYMBOL_GPL(pks_key_alloc);
> > +
> > +/*
> > + * pks_key_free - Free a previously allocate PKS key
> > + *
> > + * @pkey: Key to be free'ed
> > + */
> > +void pks_key_free(int pkey)
> > +{
> > +   if (!cpu_feature_enabled(X86_FEATURE_PKS))
> > +   return;
> > +
> > +   if (pkey >= PKS_NUM_KEYS || pkey <= PKS_KERN_DEFAULT_KEY)
> > +   return;
> 
> This seems worthy of a WARN_ON_ONCE() at least.  It's essentially
> corrupt data coming into a kernel API.

Ok, Done,
Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to 

Re: [PATCH RFC V3 4/9] x86/pks: Preserve the PKRS MSR on context switch

2020-10-14 Thread Ira Weiny
On Tue, Oct 13, 2020 at 11:31:45AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > The PKRS MSR is defined as a per-logical-processor register.  This
> > isolates memory access by logical CPU.  Unfortunately, the MSR is not
> > managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
> > context switch.
> > 
> > Define a saved PKRS value in the task struct, as well as a cached
> > per-logical-processor MSR value which mirrors the MSR value of the
> > current CPU.  Initialize all tasks with the default MSR value.  Then, on
> > schedule in, check the saved task MSR vs the per-cpu value.  If
> > different proceed to write the MSR.  If not avoid the overhead of the
> > MSR write and continue.
> 
> It's probably nice to note how the WRMSR is special here, in addition to
> the comments below.

Sure,

> 
> >  #endif /*_ASM_X86_PKEYS_INTERNAL_H */
> > diff --git a/arch/x86/include/asm/processor.h 
> > b/arch/x86/include/asm/processor.h
> > index 97143d87994c..da2381136b2d 100644
> > --- a/arch/x86/include/asm/processor.h
> > +++ b/arch/x86/include/asm/processor.h
> > @@ -18,6 +18,7 @@ struct vm86;
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -542,6 +543,11 @@ struct thread_struct {
> >  
> > unsigned intsig_on_uaccess_err:1;
> >  
> > +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
> > +   /* Saved Protection key register for supervisor mappings */
> > +   u32 saved_pkrs;
> > +#endif
> 
> Could you take a look around thread_struct and see if there are some
> other MSRs near which you can stash this?  This seems like a bit of a
> lonely place.

Are you more concerned with aesthetics or the in memory struct layout?

How about I put it after error_code?

unsigned long   error_code; 
+   
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+   /* Saved Protection key register for supervisor mappings */ 
+   u32 saved_pkrs; 
+#endif 
+   

?

> 
> ...
> >  void flush_thread(void)
> >  {
> > struct task_struct *tsk = current;
> > @@ -195,6 +212,8 @@ void flush_thread(void)
> > memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
> >  
> > fpu__clear_all(>thread.fpu);
> > +
> > +   pks_init_task(tsk);
> >  }
> >  
> >  void disable_TSC(void)
> > @@ -644,6 +663,8 @@ void __switch_to_xtra(struct task_struct *prev_p, 
> > struct task_struct *next_p)
> >  
> > if ((tifp ^ tifn) & _TIF_SLD)
> > switch_to_sld(tifn);
> > +
> > +   pks_sched_in();
> >  }
> >  
> >  /*
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index 3cf8f775f36d..30f65dd3d0c5 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -229,3 +229,31 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int 
> > flags)
> >  
> > return pk_reg;
> >  }
> > +
> > +DEFINE_PER_CPU(u32, pkrs_cache);
> > +
> > +/**
> > + * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
> > + * serializing but still maintains ordering properties similar to WRPKRU.
> > + * The current SDM section on PKRS needs updating but should be the same as
> > + * that of WRPKRU.  So to quote from the WRPKRU text:
> > + *
> > + * WRPKRU will never execute transiently. Memory accesses
> > + * affected by PKRU register will not execute (even transiently)
> > + * until all prior executions of WRPKRU have completed execution
> > + * and updated the PKRU register.
> > + */
> > +void write_pkrs(u32 new_pkrs)
> > +{
> > +   u32 *pkrs;
> > +
> > +   if (!static_cpu_has(X86_FEATURE_PKS))
> > +   return;
> > +
> > +   pkrs = get_cpu_ptr(_cache);
> > +   if (*pkrs != new_pkrs) {
> > +   *pkrs = new_pkrs;
> > +   wrmsrl(MSR_IA32_PKRS, new_pkrs);
> > +   }
> > +   put_cpu_ptr(pkrs);
> > +}
> > 
> 
> It bugs me a *bit* that this is being called in a preempt-disabled
> region, but we still bother with 

Re: [PATCH RFC V3 3/9] x86/pks: Enable Protection Keys Supervisor (PKS)

2020-10-13 Thread Ira Weiny
On Tue, Oct 13, 2020 at 11:23:08AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> > +/*
> > + * PKS is independent of PKU and either or both may be supported on a CPU.
> > + * Configure PKS if the cpu supports the feature.
> > + */
> 
> Let's at least be consistent about CPU vs. cpu in a single comment. :)

Sorry, done.

> 
> > +static void setup_pks(void)
> > +{
> > +   if (!IS_ENABLED(CONFIG_ARCH_HAS_SUPERVISOR_PKEYS))
> > +   return;
> > +   if (!cpu_feature_enabled(X86_FEATURE_PKS))
> > +   return;
> 
> If you put X86_FEATURE_PKS in disabled-features.h, you can get rid of
> the explicit CONFIG_ check.

Done.

> 
> > +   cr4_set_bits(X86_CR4_PKS);
> > +}
> > +
> >  /*
> >   * This does the hard work of actually picking apart the CPU stuff...
> >   */
> > @@ -1544,6 +1558,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
> >  
> > x86_init_rdrand(c);
> > setup_pku(c);
> > +   setup_pks();
> >  
> > /*
> >  * Clear/Set all flags overridden by options, need do it
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 6c974888f86f..1b9bc004d9bc 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -822,6 +822,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
> > bool
> >  config ARCH_HAS_PKEYS
> > bool
> > +config ARCH_HAS_SUPERVISOR_PKEYS
> > +   bool
> >  
> >  config PERCPU_STATS
> > bool "Collect percpu memory statistics"
> > 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 2/9] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support

2020-10-13 Thread Ira Weiny
On Tue, Oct 13, 2020 at 10:50:05AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> > +/*
> > + * Update the pk_reg value and return it.
> 
> How about:
> 
>   Replace disable bits for @pkey with values from @flags.

Done.

> 
> > + * Kernel users use the same flags as user space:
> > + * PKEY_DISABLE_ACCESS
> > + * PKEY_DISABLE_WRITE
> > + */
> > +u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
> > +{
> > +   int pkey_shift = pkey * PKR_BITS_PER_PKEY;
> > +
> > +   pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
> > +
> > +   if (flags & PKEY_DISABLE_ACCESS)
> > +   pk_reg |= PKR_AD_BIT << pkey_shift;
> > +   if (flags & PKEY_DISABLE_WRITE)
> > +   pk_reg |= PKR_WD_BIT << pkey_shift;
> 
> I still think this deserves two lines of comments:
> 
>   /* Mask out old bit values */
> 
>   /* Or in new values */

Sure, done.
Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 24/58] fs/freevxfs: Utilize new kmap_thread()

2020-10-13 Thread Ira Weiny
On Tue, Oct 13, 2020 at 12:25:44PM +0100, Christoph Hellwig wrote:
> > -   kaddr = kmap(pp);
> > +   kaddr = kmap_thread(pp);
> > memcpy(kaddr, vip->vii_immed.vi_immed + offset, PAGE_SIZE);
> > -   kunmap(pp);
> > +   kunmap_thread(pp);
> 
> You only Cced me on this particular patch, which means I have absolutely
> no idea what kmap_thread and kunmap_thread actually do, and thus can't
> provide an informed review.

Sorry the list was so big I struggled with who to CC and on which patches.

> 
> That being said I think your life would be a lot easier if you add
> helpers for the above code sequence and its counterpart that copies
> to a potential hughmem page first, as that hides the implementation
> details from most users.

Matthew Wilcox and Al Viro have suggested similar ideas.

https://lore.kernel.org/lkml/20201013205012.gi2046...@iweiny-desk2.sc.intel.com/

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-13 Thread Ira Weiny
On Tue, Oct 13, 2020 at 09:01:49PM +0100, Al Viro wrote:
> On Tue, Oct 13, 2020 at 08:36:43PM +0100, Matthew Wilcox wrote:
> 
> > static inline void copy_to_highpage(struct page *to, void *vfrom, unsigned 
> > int size)
> > {
> > char *vto = kmap_atomic(to);
> > 
> > memcpy(vto, vfrom, size);
> > kunmap_atomic(vto);
> > }
> > 
> > in linux/highmem.h ?
> 
> You mean, like
> static void memcpy_from_page(char *to, struct page *page, size_t offset, 
> size_t len)
> {
> char *from = kmap_atomic(page);
> memcpy(to, from + offset, len);
> kunmap_atomic(from);
> }
> 
> static void memcpy_to_page(struct page *page, size_t offset, const char 
> *from, size_t len)
> {
> char *to = kmap_atomic(page);
> memcpy(to + offset, from, len);
> kunmap_atomic(to);
> }
> 
> static void memzero_page(struct page *page, size_t offset, size_t len)
> {
> char *addr = kmap_atomic(page);
> memset(addr + offset, 0, len);
> kunmap_atomic(addr);
> }
> 
> in lib/iov_iter.c?  FWIW, I don't like that "highpage" in the name and
> highmem.h as location - these make perfect sense regardless of highmem;
> they are normal memory operations with page + offset used instead of
> a pointer...

I was thinking along those lines as well especially because of the direction
this patch set takes kmap().

Thanks for pointing these out to me.  How about I lift them to a common header?
But if not highmem.h where?

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-13 Thread Ira Weiny
On Tue, Oct 13, 2020 at 08:36:43PM +0100, Matthew Wilcox wrote:
> On Tue, Oct 13, 2020 at 11:44:29AM -0700, Dan Williams wrote:
> > On Fri, Oct 9, 2020 at 12:52 PM  wrote:
> > >
> > > From: Ira Weiny 
> > >
> > > The kmap() calls in this FS are localized to a single thread.  To avoid
> > > the over head of global PKRS updates use the new kmap_thread() call.
> > >
> > > Cc: Nicolas Pitre 
> > > Signed-off-by: Ira Weiny 
> > > ---
> > >  fs/cramfs/inode.c | 10 +-
> > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> > > index 912308600d39..003c014a42ed 100644
> > > --- a/fs/cramfs/inode.c
> > > +++ b/fs/cramfs/inode.c
> > > @@ -247,8 +247,8 @@ static void *cramfs_blkdev_read(struct super_block 
> > > *sb, unsigned int offset,
> > > struct page *page = pages[i];
> > >
> > > if (page) {
> > > -   memcpy(data, kmap(page), PAGE_SIZE);
> > > -   kunmap(page);
> > > +   memcpy(data, kmap_thread(page), PAGE_SIZE);
> > > +   kunmap_thread(page);
> > 
> > Why does this need a sleepable kmap? This looks like a textbook
> > kmap_atomic() use case.
> 
> There's a lot of code of this form.  Could we perhaps have:
> 
> static inline void copy_to_highpage(struct page *to, void *vfrom, unsigned 
> int size)
> {
>   char *vto = kmap_atomic(to);
> 
>   memcpy(vto, vfrom, size);
>   kunmap_atomic(vto);
> }
> 
> in linux/highmem.h ?

Christoph had the same idea.  I'll work on it.

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC V3 1/9] x86/pkeys: Create pkeys_common.h

2020-10-13 Thread Ira Weiny
On Tue, Oct 13, 2020 at 10:46:16AM -0700, Dave Hansen wrote:
> On 10/9/20 12:42 PM, ira.we...@intel.com wrote:
> > Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work
> > in similar fashions and can share common defines.
> 
> Could we be a bit less abstract?  PKS and PKU each have:
> 1. A single control register
> 2. The same number of keys
> 3. The same number of bits in the register per key
> 4. Access and Write disable in the same bit locations
> 
> That means that we can share all the macros that synthesize and
> manipulate register values between the two features.

Sure.  Done.

> 
> > +++ b/arch/x86/include/asm/pkeys_common.h
> > @@ -0,0 +1,11 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_X86_PKEYS_INTERNAL_H
> > +#define _ASM_X86_PKEYS_INTERNAL_H
> > +
> > +#define PKR_AD_BIT 0x1
> > +#define PKR_WD_BIT 0x2
> > +#define PKR_BITS_PER_PKEY 2
> > +
> > +#define PKR_AD_KEY(pkey)   (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
> 
> Now that this has moved away from its use-site, it's a bit less
> self-documenting.  Let's add a comment:
> 
> /*
>  * Generate an Access-Disable mask for the given pkey.  Several of these
>  * can be OR'd together to generate pkey register values.
>  */

Fair enough. done.

> 
> Once that's in place, along with the updated changelog:
> 
> Reviewed-by: Dave Hansen 

Thanks,
Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-12 Thread Ira Weiny
On Mon, Oct 12, 2020 at 09:02:54PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 12, 2020 at 12:53:54PM -0700, Ira Weiny wrote:
> > On Mon, Oct 12, 2020 at 05:44:38PM +0100, Matthew Wilcox wrote:
> > > On Mon, Oct 12, 2020 at 09:28:29AM -0700, Dave Hansen wrote:
> > > > kmap_atomic() is always preferred over kmap()/kmap_thread().
> > > > kmap_atomic() is _much_ more lightweight since its TLB invalidation is
> > > > always CPU-local and never broadcast.
> > > > 
> > > > So, basically, unless you *must* sleep while the mapping is in place,
> > > > kmap_atomic() is preferred.
> > > 
> > > But kmap_atomic() disables preemption, so the _ideal_ interface would map
> > > it only locally, then on preemption make it global.  I don't even know
> > > if that _can_ be done.  But this email makes it seem like kmap_atomic()
> > > has no downsides.
> > 
> > And that is IIUC what Thomas was trying to solve.
> > 
> > Also, Linus brought up that kmap_atomic() has quirks in nesting.[1]
> > 
> > >From what I can see all of these discussions support the need to have 
> > >something
> > between kmap() and kmap_atomic().
> > 
> > However, the reason behind converting call sites to kmap_thread() are 
> > different
> > between Thomas' patch set and mine.  Both require more kmap granularity.
> > However, they do so with different reasons and underlying implementations 
> > but
> > with the _same_ resulting semantics; a thread local mapping which is
> > preemptable.[2]  Therefore they each focus on changing different call sites.
> > 
> > While this patch set is huge I think it serves a valuable purpose to 
> > identify a
> > large number of call sites which are candidates for this new semantic.
> 
> Yes, I agree.  My problem with this patch-set is that it ties it to
> some Intel feature that almost nobody cares about.

I humbly disagree.  At this level the only thing this is tied to is the idea
that there are additional memory protections available which can be enabled
quickly on a per-thread basis.  PKS on Intel is but 1 implementation of that.

Even the kmap code only has knowledge that there is something which needs to be
done special on a devm page.

>
> Maybe we should
> care about it, but you didn't try very hard to make anyone care about
> it in the cover letter.

Ok my bad.  We have customers who care very much about restricting access to
the PMEM pages to prevent bugs in the kernel from causing permanent damage to
their data/file systems.  I'll reword the cover letter better.

> 
> For a future patch-set, I'd like to see you just introduce the new
> API.  Then you can optimise the Intel implementation of it afterwards.
> Those patch-sets have entirely different reviewers.

I considered doing this.  But this seemed more logical because the feature is
being driven by PMEM which is behind the kmap interface not by the users of the
API.

I can introduce a patch set with a kmap_thread() call which does nothing if
that is more palatable but it seems wrong to me to do so.

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-12 Thread Ira Weiny
On Mon, Oct 12, 2020 at 05:44:38PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 12, 2020 at 09:28:29AM -0700, Dave Hansen wrote:
> > kmap_atomic() is always preferred over kmap()/kmap_thread().
> > kmap_atomic() is _much_ more lightweight since its TLB invalidation is
> > always CPU-local and never broadcast.
> > 
> > So, basically, unless you *must* sleep while the mapping is in place,
> > kmap_atomic() is preferred.
> 
> But kmap_atomic() disables preemption, so the _ideal_ interface would map
> it only locally, then on preemption make it global.  I don't even know
> if that _can_ be done.  But this email makes it seem like kmap_atomic()
> has no downsides.

And that is IIUC what Thomas was trying to solve.

Also, Linus brought up that kmap_atomic() has quirks in nesting.[1]

>From what I can see all of these discussions support the need to have something
between kmap() and kmap_atomic().

However, the reason behind converting call sites to kmap_thread() are different
between Thomas' patch set and mine.  Both require more kmap granularity.
However, they do so with different reasons and underlying implementations but
with the _same_ resulting semantics; a thread local mapping which is
preemptable.[2]  Therefore they each focus on changing different call sites.

While this patch set is huge I think it serves a valuable purpose to identify a
large number of call sites which are candidates for this new semantic.

Ira

[1] 
https://lore.kernel.org/lkml/CAHk-=wgbmwsTOKs23Z=71ebtruloeah2u3tnqt2athewvkb...@mail.gmail.com/
[2] It is important to note these implementations are not incompatible with
each other.  So I don't see yet another 'kmap_something()' being required.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 22/58] fs/f2fs: Utilize new kmap_thread()

2020-10-12 Thread Ira Weiny
On Fri, Oct 09, 2020 at 06:30:36PM -0700, Eric Biggers wrote:
> On Sat, Oct 10, 2020 at 01:39:54AM +0100, Matthew Wilcox wrote:
> > On Fri, Oct 09, 2020 at 02:34:34PM -0700, Eric Biggers wrote:
> > > On Fri, Oct 09, 2020 at 12:49:57PM -0700, ira.we...@intel.com wrote:
> > > > The kmap() calls in this FS are localized to a single thread.  To avoid
> > > > the over head of global PKRS updates use the new kmap_thread() call.
> > > >
> > > > @@ -2410,12 +2410,12 @@ static inline struct page 
> > > > *f2fs_pagecache_get_page(
> > > >  
> > > >  static inline void f2fs_copy_page(struct page *src, struct page *dst)
> > > >  {
> > > > -   char *src_kaddr = kmap(src);
> > > > -   char *dst_kaddr = kmap(dst);
> > > > +   char *src_kaddr = kmap_thread(src);
> > > > +   char *dst_kaddr = kmap_thread(dst);
> > > >  
> > > > memcpy(dst_kaddr, src_kaddr, PAGE_SIZE);
> > > > -   kunmap(dst);
> > > > -   kunmap(src);
> > > > +   kunmap_thread(dst);
> > > > +   kunmap_thread(src);
> > > >  }
> > > 
> > > Wouldn't it make more sense to switch cases like this to kmap_atomic()?
> > > The pages are only mapped to do a memcpy(), then they're immediately 
> > > unmapped.
> > 
> > Maybe you missed the earlier thread from Thomas trying to do something
> > similar for rather different reasons ...
> > 
> > https://lore.kernel.org/lkml/20200919091751.06...@linutronix.de/
> 
> I did miss it.  I'm not subscribed to any of the mailing lists it was sent to.
> 
> Anyway, it shouldn't matter.  Patchsets should be standalone, and not require
> reading random prior threads on linux-kernel to understand.

Sorry, but I did not think that the discussion above was directly related.  If
I'm not mistaken, Thomas' work was directed at relaxing kmap_atomic() into
kmap_thread() calls.  While interesting, it is not the point of this series.  I
want to restrict kmap() callers into kmap_thread().

For this series it was considered to change the kmap_thread() call sites to
kmap_atomic().  But like I said in the cover letter kmap_atomic() is not the
same semantic.  It is too strict.  Perhaps I should have expanded that
explanation.

> 
> And I still don't really understand.  After this patchset, there is still code
> nearly identical to the above (doing a temporary mapping just for a memcpy) 
> that
> would still be using kmap_atomic().

I don't understand.  You mean there would be other call sites calling:

kmap_atomic()
memcpy()
kunmap_atomic()

?

> Is the idea that later, such code will be
> converted to use kmap_thread() instead?  If not, why use one over the other?
 

The reason for the new call is that with PKS added behind kmap we have 3 levels
of mapping we want.

global kmap (can span threads and sleep)
'thread' kmap (can sleep but not span threads)
'atomic' kmap (can't sleep nor span threads [by definition])

As Matthew said perhaps 'global kmaps' may be best changed to vmaps?  I just
don't know the details of every call site.

And since I don't know the call site details if there are kmap_thread() calls
which are better off as kmap_atomic() calls I think it is worth converting
them.  But I made the assumption that kmap users would already be calling
kmap_atomic() if they could (because it is more efficient).

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 57/58] nvdimm/pmem: Stray access protection for pmem->virt_addr

2020-10-11 Thread Ira Weiny
On Fri, Oct 09, 2020 at 07:53:07PM -0700, John Hubbard wrote:
> On 10/9/20 12:50 PM, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > The pmem driver uses a cached virtual address to access its memory
> > directly.  Because the nvdimm driver is well aware of the special
> > protections it has mapped memory with, we call dev_access_[en|dis]able()
> > around the direct pmem->virt_addr (pmem_addr) usage instead of the
> > unnecessary overhead of trying to get a page to kmap.
> > 
> > Signed-off-by: Ira Weiny 
> > ---
> >   drivers/nvdimm/pmem.c | 4 
> >   1 file changed, 4 insertions(+)
> > 
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index fab29b514372..e4dc1ae990fc 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -148,7 +148,9 @@ static blk_status_t pmem_do_read(struct pmem_device 
> > *pmem,
> > if (unlikely(is_bad_pmem(>bb, sector, len)))
> > return BLK_STS_IOERR;
> > +   dev_access_enable(false);
> > rc = read_pmem(page, page_off, pmem_addr, len);
> > +   dev_access_disable(false);
> 
> Hi Ira!
> 
> The APIs should be tweaked to use a symbol (GLOBAL, PER_THREAD), instead of
> true/false. Try reading the above and you'll see that it sounds like it's
> doing the opposite of what it is ("enable_this(false)" sounds like a clumsy
> API design to *disable*, right?). And there is no hint about the scope.

Sounds reasonable.

> 
> And it *could* be so much more readable like this:
> 
> dev_access_enable(DEV_ACCESS_THIS_THREAD);

I'll think about the flag name.  I'm not liking 'this thread'.

Maybe DEV_ACCESS_[GLOBAL|THREAD]

Ira
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 48/58] drivers/md: Utilize new kmap_thread()

2020-10-11 Thread Ira Weiny
On Sat, Oct 10, 2020 at 10:20:34AM +0800, Coly Li wrote:
> On 2020/10/10 03:50, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > These kmap() calls are localized to a single thread.  To avoid the over
> > head of global PKRS updates use the new kmap_thread() call.
> > 
> 
> Hi Ira,
> 
> There were a number of options considered.
> 
> 1) Attempt to change all the thread local kmap() calls to kmap_atomic()
> 2) Introduce a flags parameter to kmap() to indicate if the mapping
> should be global or not
> 3) Change ~20-30 call sites to 'kmap_global()' to indicate that they
> require a global mapping of the pages
> 4) Change ~209 call sites to 'kmap_thread()' to indicate that the
> mapping is to be used within that thread of execution only
> 
> 
> I copied the above information from patch 00/58 to this message. The
> idea behind kmap_thread() is fine to me, but as you said the new api is
> very easy to be missed in new code (even for me). I would like to be
> supportive to option 2) introduce a flag to kmap(), then we won't forget
> the new thread-localized kmap method, and people won't ask why a
> _thread() function is called but no kthread created.

Thanks for the feedback.

I'm going to hold off making any changes until others weigh in.  FWIW, I kind
of like option 2 as well.  But there is already kmap_atomic() so it seemed like
kmap_() was more in line with the current API.

Thanks,
Ira

> 
> Thanks.
> 
> 
> Coly Li
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 10/58] drivers/rdma: Utilize new kmap_thread()

2020-10-11 Thread Ira Weiny
On Sat, Oct 10, 2020 at 11:36:49AM +, Bernard Metzler wrote:
> -ira.we...@intel.com wrote: -
> 

[snip]

> >@@ -505,7 +505,7 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx,
> >struct socket *s)
> > page_array[seg] = p;
> > 
> > if (!c_tx->use_sendpage) {
> >-iov[seg].iov_base = kmap(p) + fp_off;
> >+iov[seg].iov_base = kmap_thread(p) + 
> >fp_off;
> 
> This misses a corresponding kunmap_thread() in siw_unmap_pages()
> (pls change line 403 in siw_qp_tx.c as well)

Thanks I missed that.

Done.

Ira

> 
> Thanks,
> Bernard.
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 09/58] drivers/gpu: Utilize new kmap_thread()

2020-10-10 Thread Ira Weiny
On Sat, Oct 10, 2020 at 12:03:49AM +0200, Daniel Vetter wrote:
> On Fri, Oct 09, 2020 at 12:49:44PM -0700, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > These kmap() calls in the gpu stack are localized to a single thread.
> > To avoid the over head of global PKRS updates use the new kmap_thread()
> > call.
> > 
> > Cc: David Airlie 
> > Cc: Daniel Vetter 
> > Cc: Patrik Jakobsson 
> > Signed-off-by: Ira Weiny 
> 
> I'm guessing the entire pile goes in through some other tree.
>

Apologies for not realizing there were multiple maintainers here.

But, I was thinking it would land together through the mm tree once the core
support lands.  I've tried to split these out in a way they can be easily
reviewed/acked by the correct developers.

> If so:
> 
> Acked-by: Daniel Vetter 
> 
> If you want this to land through maintainer trees, then we need a
> per-driver split (since aside from amdgpu and radeon they're all different
> subtrees).

It is just RFC for the moment.  I need to get the core support accepted first
then this can land.

> 
> btw the two kmap calls in drm you highlight in the cover letter should
> also be convertible to kmap_thread. We only hold vmalloc mappings for a
> longer time (or it'd be quite a driver bug). So if you want maybe throw
> those two as two additional patches on top, and we can do some careful
> review & testing for them.

Cool.  I'll add them in.

Ira

> -Daniel
> 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c  | 12 ++--
> >  drivers/gpu/drm/gma500/gma_display.c |  4 ++--
> >  drivers/gpu/drm/gma500/mmu.c | 10 +-
> >  drivers/gpu/drm/i915/gem/i915_gem_shmem.c|  4 ++--
> >  .../gpu/drm/i915/gem/selftests/i915_gem_context.c|  4 ++--
> >  drivers/gpu/drm/i915/gem/selftests/i915_gem_mman.c   |  8 
> >  drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c |  4 ++--
> >  drivers/gpu/drm/i915/gt/intel_gtt.c  |  4 ++--
> >  drivers/gpu/drm/i915/gt/shmem_utils.c|  4 ++--
> >  drivers/gpu/drm/i915/i915_gem.c  |  8 
> >  drivers/gpu/drm/i915/i915_gpu_error.c|  4 ++--
> >  drivers/gpu/drm/i915/selftests/i915_perf.c   |  4 ++--
> >  drivers/gpu/drm/radeon/radeon_ttm.c  |  4 ++--
> >  13 files changed, 37 insertions(+), 37 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > index 978bae731398..bd564bccb7a3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> > @@ -2437,11 +2437,11 @@ static ssize_t amdgpu_ttm_gtt_read(struct file *f, 
> > char __user *buf,
> >  
> > page = adev->gart.pages[p];
> > if (page) {
> > -   ptr = kmap(page);
> > +   ptr = kmap_thread(page);
> > ptr += off;
> >  
> > r = copy_to_user(buf, ptr, cur_size);
> > -   kunmap(adev->gart.pages[p]);
> > +   kunmap_thread(adev->gart.pages[p]);
> > } else
> > r = clear_user(buf, cur_size);
> >  
> > @@ -2507,9 +2507,9 @@ static ssize_t amdgpu_iomem_read(struct file *f, char 
> > __user *buf,
> > if (p->mapping != adev->mman.bdev.dev_mapping)
> > return -EPERM;
> >  
> > -   ptr = kmap(p);
> > +   ptr = kmap_thread(p);
> > r = copy_to_user(buf, ptr + off, bytes);
> > -   kunmap(p);
> > +   kunmap_thread(p);
> > if (r)
> > return -EFAULT;
> >  
> > @@ -2558,9 +2558,9 @@ static ssize_t amdgpu_iomem_write(struct file *f, 
> > const char __user *buf,
> > if (p->mapping != adev->mman.bdev.dev_mapping)
> > return -EPERM;
> >  
> > -   ptr = kmap(p);
> > +   ptr = kmap_thread(p);
> > r = copy_from_user(ptr + off, buf, bytes);
> > -   kunmap(p);
> > +   kunmap_thread(p);
> > if (r)
> > return -EFAULT;
> >  
> > diff --git a/drivers/gpu/drm/gma500/gma_display.c 
> > b/drivers/gpu/drm/gma500/gma_display.c
> > index 3df6d6e850f5..35f4e55c941f 100644
> > --- a/drivers/gpu/drm/gma500/gma_display.c
> > +++ b/drivers/gpu/drm/gma500/gma_display.c
> > @@ -400,9 +400,9 @@ int gma_crtc_cursor_set(stru

Re: [PATCH RFC V3 0/9] PKS: Add Protection Keys Supervisor (PKS) support RFC v3

2020-10-09 Thread Ira Weiny
On Fri, Oct 09, 2020 at 12:42:49PM -0700, 'Ira Weiny' wrote:
> From: Ira Weiny 
> 
> This RFC series has been reviewed by Dave Hansen.
> 
> Introduce a new page protection mechanism for supervisor pages, Protection Key
> Supervisor (PKS).
> 
> 2 use cases for PKS are being developed, trusted keys and PMEM.

RFC patch sets for these use cases have also been posted:

PMEM:
https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.we...@intel.com/

Trusted Keys:
https://lore.kernel.org/lkml/20201009201410.3209180-1-ira.we...@intel.com/

Ira

> Trusted keys
> is a newer use case which is still being explored.  PMEM was submitted as part
> of the RFC (v2) series[1].  However, since then it was found that some callers
> of kmap() require a global implementation of PKS.  Specifically some users of
> kmap() expect mappings to be available to all kernel threads.  While global 
> use
> of PKS is rare it needs to be included for correctness.  Unfortunately the
> kmap() updates required a large patch series to make the needed changes at the
> various kmap() call sites so that patch set has been split out.  Because the
> global PKS feature is only required for that use case it will be deferred to
> that set as well.[2]  This patch set is being submitted as a precursor to both
> of the use cases.
> 
> For an overview of the entire PKS ecosystem, a git tree including this series
> and the 2 use cases can be found here:
> 
>   https://github.com/weiny2/linux-kernel/tree/pks-rfc-v3
> 
> 
> PKS enables protections on 'domains' of supervisor pages to limit supervisor
> mode access to those pages beyond the normal paging protections.  PKS works in
> a similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
> checked in addition to normal paging protections and Access or Writes can be
> disabled via a MSR update without TLB flushes when permissions change.  Also
> like PKU, a page mapping is assigned to a domain by setting pkey bits in the
> page table entry for that mapping.
> 
> Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.
> 
> XSAVE is not supported for the PKRS MSR.  Therefore the implementation
> saves/restores the MSR across context switches and during exceptions.  Nested
> exceptions are supported by each exception getting a new PKS state.
> 
> For consistent behavior with current paging protections, pkey 0 is reserved 
> and
> configured to allow full access via the pkey mechanism, thus preserving the
> default paging protections on mappings with the default pkey value of 0.
> 
> Other keys, (1-15) are allocated by an allocator which prepares us for key
> contention from day one.  Kernel users should be prepared for the allocator to
> fail either because of key exhaustion or due to PKS not being supported on the
> arch and/or CPU instance.
> 
> The following are key attributes of PKS.
> 
>1) Fast switching of permissions
>   1a) Prevents access without page table manipulations
>   1b) No TLB flushes required
>2) Works on a per thread basis
> 
> PKS is available with 4 and 5 level paging.  Like PKRU it consumes 4 bits from
> the PTE to store the pkey within the entry.
> 
> 
> [1] https://lore.kernel.org/lkml/20200717072056.73134-1-ira.we...@intel.com/
> [2] 
> https://github.com/weiny2/linux-kernel/commit/f10abb0f0d7b4e14f03fc8890313a5830cde1e49
>   and a testing patch
> 
> https://github.com/weiny2/linux-kernel/commit/2a8e0fc7654a7c69b243d628f63b01ff26a5a797
> 
> 
> Fenghua Yu (3):
>   x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
>   x86/pks: Enable Protection Keys Supervisor (PKS)
>   x86/pks: Add PKS kernel API
> 
> Ira Weiny (6):
>   x86/pkeys: Create pkeys_common.h
>   x86/pks: Preserve the PKRS MSR on context switch
>   x86/entry: Pass irqentry_state_t by reference
>   x86/entry: Preserve PKRS MSR across exceptions
>   x86/fault: Report the PKRS state on fault
>   x86/pks: Add PKS test code
> 
>  Documentation/core-api/protection-keys.rst  | 102 ++-
>  arch/x86/Kconfig|   1 +
>  arch/x86/entry/common.c |  57 +-
>  arch/x86/include/asm/cpufeatures.h  |   1 +
>  arch/x86/include/asm/idtentry.h |  29 +-
>  arch/x86/include/asm/msr-index.h|   1 +
>  arch/x86/include/asm/pgtable.h  |  13 +-
>  arch/x86/include/asm/pgtable_types.h|  12 +
>  arch/x86/include/asm/pkeys.h|  15 +
>  arch/x86/include/asm/pkeys_common.h |  36 +
>  arch/x86/include/asm/processor.h|  13 +
>  arch/x86/include/uapi/asm/processor-flags.h |   2 +
>  arch/x86/kernel/cpu/common.c|  17 +
>  arch/x86/k

[PATCH RFC PKS/PMEM 55/58] samples: Utilize new kmap_thread()

2020-10-09 Thread ira . weiny
From: Ira Weiny 

These kmap() calls are localized to a single thread.  To avoid the over
head of global PKRS updates use the new kmap_thread() call.

Cc: Kirti Wankhede 
Signed-off-by: Ira Weiny 
---
 samples/vfio-mdev/mbochs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/samples/vfio-mdev/mbochs.c b/samples/vfio-mdev/mbochs.c
index 3cc5e5921682..6d95422c0b46 100644
--- a/samples/vfio-mdev/mbochs.c
+++ b/samples/vfio-mdev/mbochs.c
@@ -479,12 +479,12 @@ static ssize_t mdev_access(struct mdev_device *mdev, char 
*buf, size_t count,
pos -= MBOCHS_MMIO_BAR_OFFSET;
poff = pos & ~PAGE_MASK;
pg = __mbochs_get_page(mdev_state, pos >> PAGE_SHIFT);
-   map = kmap(pg);
+   map = kmap_thread(pg);
if (is_write)
memcpy(map + poff, buf, count);
else
memcpy(buf, map + poff, count);
-   kunmap(pg);
+   kunmap_thread(pg);
put_page(pg);
 
} else {
-- 
2.28.0.rc0.12.gb6a658bd00c9
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


  1   2   3   4   5   >