Re: [PATCH] ocxl: Update for AFU descriptor template version 1.1

2019-06-24 Thread Andrew Donnellan

On 5/6/19 9:15 pm, Frederic Barrat wrote:

From: Alastair D'Silva 

The OpenCAPI discovery and configuration specification has been
updated and introduces version 1.1 of the AFU descriptor template,
with new fields to better define the memory layout of an OpenCAPI
adapter.

The ocxl driver doesn't do much yet to support LPC memory but as we
start seeing (non-LPC) AFU images using the new template, this patch
updates the config space parsing code to avoid spitting a warning.

Signed-off-by: Alastair D'Silva 
Signed-off-by: Frederic Barrat 


Acked-by: Andrew Donnellan 

Alastair: you should fix your editor config :) 
https://openpower.xyz/job/snowpatch/job/snowpatch-linux-checkpatch/7564//artifact/linux/checkpatch.log



--
Andrew Donnellan  OzLabs, ADL Canberra
a...@linux.ibm.com IBM Australia Limited



Re: [PATCH] KVM: PPC: Book3S HV: Fix CR0 setting in TM emulation

2019-06-24 Thread Michael Neuling
On Mon, 2019-06-24 at 21:48 +1000, Michael Ellerman wrote:
> Michael Neuling  writes:
> > When emulating tsr, treclaim and trechkpt, we incorrectly set CR0. The
> > code currently sets:
> > CR0 <- 00 || MSR[TS]
> > but according to the ISA it should be:
> > CR0 <-  0 || MSR[TS] || 0
> 
> Seems bad, what's the worst case impact?

It's a data integrity issue as CR0 is corrupted.

> Do we have a test case for this?

Suraj has a KVM unit test for it.

> > This fixes the bit shift to put the bits in the correct location.
> 
> Fixes: ?

It's been around since we first wrote the code so:

Fixes: 4bb3c7a0208fc13c ("KVM: PPC: Book3S HV: Work around transactional memory 
bugs in POWER9")

Mikey


Re: [PATCH] powerpc: Document xive=off option

2019-06-24 Thread Stewart Smith
Michael Neuling  writes:
> commit 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE
> interrupt controller") added an option to turn off Linux native XIVE
> usage via the xive=off kernel command line option.
>
> This documents this option.
>
> Signed-off-by: Michael Neuling 

Acked-by: Stewart Smith 

-- 
Stewart Smith
OPAL Architect, IBM.



Re: [PATCH 2/2] powerpc/papr_scm: Force a scm-unbind if initial scm-bind fails

2019-06-24 Thread Oliver O'Halloran
On Tue, Jun 25, 2019 at 12:59 AM Vaibhav Jain  wrote:
>
> In some cases initial bind of scm memory for an lpar can fail if
> previously it wasn't released using a scm-unbind hcall. This situation
> can arise due to panic of the previous kernel or forced lpar reset. In
> such cases the H_SCM_BIND_MEM return a H_OVERLAP error.

What is a forced lpar reset? fadump?

> To mitigate such cases the patch updates drc_pmem_bind() to force a
> call to drc_pmem_unbind() in case the initial bind of scm memory fails
> with H_OVERLAP error. In case scm-bind operation again fails after the
> forced scm-unbind then we follow the existing error path.
>
> Signed-off-by: Vaibhav Jain 
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c | 23 ---
>  1 file changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index d790e4e4ffb3..049d7927c0a4 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -44,19 +44,26 @@ struct papr_scm_priv {
> struct nd_interleave_set nd_set;
>  };

> +/* Forward declaration */
pointless comment.

> +static int drc_pmem_unbind(struct papr_scm_priv *);
>
>  static int drc_pmem_bind(struct papr_scm_priv *p)
>  {
> unsigned long ret[PLPAR_HCALL_BUFSIZE];
> uint64_t rc, token;
> -   uint64_t saved = 0;
> +   uint64_t saved;
> +   bool tried_unbind = false;

nit: kernel style uses reverse christmas tree declarations, so this should be:

unsigned long ret[PLPAR_HCALL_BUFSIZE];
bool tried_unbind = false;
uint64_t rc, token;
uint64_t saved;

Come to think of it rc should probably be signed since the hcall
return codes are negative. I'm surprised that's not causing a warning
since we use %lld to print rc rather than %llu.

> +   dev_dbg(>pdev->dev, "bind drc %x\n", p->drc_index);
> /*
>  * When the hypervisor cannot map all the requested memory in a single
>  * hcall it returns H_BUSY and we call again with the token until
>  * we get H_SUCCESS. Aborting the retry loop before getting H_SUCCESS
>  * leave the system in an undefined state, so we wait.
>  */
> +retry:
> token = 0;
> +   saved = 0;
>
> do {
> rc = plpar_hcall(H_SCM_BIND_MEM, ret, p->drc_index, 0,
> @@ -68,8 +75,18 @@ static int drc_pmem_bind(struct papr_scm_priv *p)
> } while (rc == H_BUSY);
>
> if (rc) {
> -   dev_err(>pdev->dev, "bind err: %lld\n", rc);
> -   return -ENXIO;

> +   /* retry after unbinding */
> +   if (rc == H_OVERLAP &&  !tried_unbind) {
> +   dev_warn(>pdev->dev, "Un-binding and retrying\n");
> +   /* Try unbind and ignore any errors */
> +   tried_unbind = true;
> +   drc_pmem_unbind(p);
> +   goto retry;

I think It'd be cleaner if we put the unbind-and-retry logic into the
probe function, e.g:

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c
b/arch/powerpc/platforms/pseries/papr_scm.c
index 96c53b23e58f..d113779fc27c 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -316,6 +316,14 @@ static int papr_scm_probe(struct platform_device *pdev)

/* request the hypervisor to bind this region to somewhere in memory */
rc = drc_pmem_bind(p);
+   if (rc == -EBUSY) {
+   /*
+* If we kexec()ed the previous kernel might have left the DRC
+* bound in memory. Unbind it and try again.
+*/
+   drc_pmem_unbind(p);
+   rc = drc_pmem_bind(p);
+   }
if (rc)
goto err;

Up to you though.


Re: [PATCH v4 1/4] lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE

2019-06-24 Thread Herbert Xu
On Mon, Jun 24, 2019 at 08:35:33PM +0300, Imre Deak wrote:
> Hi,
> 
> On Thu, Jun 20, 2019 at 02:02:21PM +0800, Herbert Xu wrote:
> > On Mon, Jun 17, 2019 at 09:15:02PM +, Christophe Leroy wrote:
> > > All mapping iterator logic is based on the assumption that sg->offset
> > > is always lower than PAGE_SIZE.
> > > 
> > > But there are situations where sg->offset is such that the SG item
> > > is on the second page.
> 
> could you explain how sg->offset becomes >= PAGE_SIZE?

The network stack can produce SG list elements that are longer
than PAGE_SIZE.  If you the iterate over it one page at a time
then the offset can exceed PAGE_SIZE.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH] powerpc/rtas: retry when cpu offline races with suspend/migration

2019-06-24 Thread Juliet Kim
There's some concern this could retry forever, resulting in live lock.  
Another approach would be to abandon the attempt and fail the LPM 
request. 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-June/192531.html


On 6/24/19 12:23 PM, Nathan Lynch wrote:

Hi Mingming,

mmc  writes:

On 2019-06-21 00:05, Nathan Lynch wrote:

So return -EAGAIN instead of -EBUSY when this race is
encountered. Additionally: logging this event is still appropriate but
use pr_info instead of pr_err; and remove use of unlikely() while here
as this is not a hot path at all.

Looks good, since it's not a hot path anyway, so unlikely() should
benefit from optimize compiler path, and should stay. No?

The latency of this path in rtas_ibm_suspend_me() in the best case is
around 2-3 seconds.

So I think not -- this is such a heavyweight and relatively
seldom-executed path that the unlikely() cannot yield any discernible
performance benefit, and its presence imposes a readability cost.


Re: [PATCH 1/2] powerpc/papr_scm: Update drc_pmem_unbind() to use H_SCM_UNBIND_ALL

2019-06-24 Thread Oliver O'Halloran
On Tue, Jun 25, 2019 at 1:03 AM Vaibhav Jain  wrote:
>
> The new hcall named H_SCM_UNBIND_ALL has been introduce that can
> unbind all the memory drc memory-blocks assigned to an lpar. This is
> more efficient than using H_SCM_UNBIND_MEM as currently we don't
> support partial unbind of drc memory-blocks.
>
> Hence this patch proposes following changes to drc_pmem_unbind():
>
> * Update drc_pmem_unbind() to replace hcall H_SCM_UNBIND_MEM to
>   H_SCM_UNBIND_ALL.
>
> * Update drc_pmem_unbind() to handles cases when PHYP asks the guest
>   kernel to wait for specific amount of time before retrying the
>   hcall via the 'LONG_BUSY' return value. In case it takes more
>   than 1 second to unbind the memory log a warning.

Have you benchmarked how long a typical bind operation takes for very
large (1+TB) volumes? I remember it taking a while for the driver to
finish probing and I'd rather we didn't add spurious warnings. That
might have been due to the time taken to online memory rather than the
bind though.

> * Ensure appropriate error code is returned back from the function
>   in case of an error.
>
> Signed-off-by: Vaibhav Jain 
> ---
>  arch/powerpc/include/asm/hvcall.h |  2 +-
>  arch/powerpc/platforms/pseries/papr_scm.c | 37 ---
>  2 files changed, 33 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index 463c63a9fcf1..bb56fa0f976c 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -302,7 +302,7 @@
>  #define H_SCM_UNBIND_MEM0x3F0
>  #define H_SCM_QUERY_BLOCK_MEM_BINDING 0x3F4
>  #define H_SCM_QUERY_LOGICAL_MEM_BINDING 0x3F8
> -#define H_SCM_MEM_QUERY0x3FC
> +#define H_SCM_UNBIND_ALL0x3FC
>  #define H_SCM_BLOCK_CLEAR   0x400
>  #define MAX_HCALL_OPCODE   H_SCM_BLOCK_CLEAR

We should probably update all these to match the latest spec. Can you
do that in a separate patch?

> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 96c53b23e58f..d790e4e4ffb3 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -11,11 +11,16 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>
>  #define BIND_ANY_ADDR (~0ul)
>
> +/* Scope args for H_SCM_UNBIND_ALL */
> +#define H_UNBIND_SCOPE_ALL (0x1)
> +#define H_UNBIND_SCOPE_DRC (0x2)
> +
>  #define PAPR_SCM_DIMM_CMD_MASK \
> ((1ul << ND_CMD_GET_CONFIG_SIZE) | \
>  (1ul << ND_CMD_GET_CONFIG_DATA) | \
> @@ -78,21 +83,43 @@ static int drc_pmem_unbind(struct papr_scm_priv *p)
>  {
> unsigned long ret[PLPAR_HCALL_BUFSIZE];
> uint64_t rc, token;
> +   unsigned long starttime;
>
> token = 0;
>
> -   /* NB: unbind has the same retry requirements mentioned above */
> +   dev_dbg(>pdev->dev, "unbind drc %x\n", p->drc_index);
> +
> +   /* NB: unbind_all has the same retry requirements as drc_pmem_bind() 
> */
> +   starttime = HZ;
> do {
> -   rc = plpar_hcall(H_SCM_UNBIND_MEM, ret, p->drc_index,
> -   p->bound_addr, p->blocks, token);
> +   /* If this is taking too much time then log warning */
> +   if (jiffies_to_msecs(HZ - starttime) > 1000) {
> +   dev_warn(>pdev->dev,
> +"unbind taking > 1s to complete\n");
> +   starttime = HZ;
> +   }
> +
> +   /* Unbind of all SCM resources associated with drcIndex */
> +   rc = plpar_hcall(H_SCM_UNBIND_ALL, ret, H_UNBIND_SCOPE_DRC,
> +p->drc_index, token);
> token = ret[0];
> -   cond_resched();
> +
> +   /* Check if we are stalled for some time */
> +   if (H_IS_LONG_BUSY(rc)) {
> +   msleep(get_longbusy_msecs(rc));
> +   rc = H_BUSY;

> +   token = 0;

The hypervisor should be returning a continue token if the return code
is H_BUSY or any of the H_LONG_BUSY codes. The spec says that if we
get a busy return value then we must use the continue token for the
next unbind, so why are you zeroing it here? If this is required to
make it work then it's a bug in the hypervisor.

> +   } else if (rc == H_BUSY) {
> +   cond_resched();
> +   }
> +
> } while (rc == H_BUSY);
>
> if (rc)
> dev_err(>pdev->dev, "unbind error: %lld\n", rc);

A dev_dbg() for the unbind time might be nice.

>
> -   return !!rc;
> +   return rc == H_SUCCESS ? 0 : -ENXIO;
>  }
>
>  static int papr_scm_meta_get(struct papr_scm_priv *p,
> --
> 2.21.0
>


Re: [PATCH 3/3] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings

2019-06-24 Thread Nicholas Piggin
Anshuman Khandual's on June 24, 2019 4:52 pm:
> 
> 
> On 06/23/2019 03:14 PM, Nicholas Piggin wrote:
>> vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
>> Whether or not a vmap is huge depends on the architecture details,
>> alignments, boot options, etc., which the caller can not be expected
>> to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.
>> 
>> This change teaches vmalloc_to_page about larger pages, and returns
>> the struct page that corresponds to the offset within the large page.
>> This makes the API agnostic to mapping implementation details.
>> 
>> [*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
>> fail gracefully on unexpected huge vmap mappings")
>> 
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  include/asm-generic/4level-fixup.h |  1 +
>>  include/asm-generic/5level-fixup.h |  1 +
>>  mm/vmalloc.c   | 37 +++---
>>  3 files changed, 26 insertions(+), 13 deletions(-)
>> 
>> diff --git a/include/asm-generic/4level-fixup.h 
>> b/include/asm-generic/4level-fixup.h
>> index e3667c9a33a5..3cc65a4dd093 100644
>> --- a/include/asm-generic/4level-fixup.h
>> +++ b/include/asm-generic/4level-fixup.h
>> @@ -20,6 +20,7 @@
>>  #define pud_none(pud)   0
>>  #define pud_bad(pud)0
>>  #define pud_present(pud)1
>> +#define pud_large(pud)  0
>>  #define pud_ERROR(pud)  do { } while (0)
>>  #define pud_clear(pud)  pgd_clear(pud)
>>  #define pud_val(pud)pgd_val(pud)
>> diff --git a/include/asm-generic/5level-fixup.h 
>> b/include/asm-generic/5level-fixup.h
>> index bb6cb347018c..c4377db09a4f 100644
>> --- a/include/asm-generic/5level-fixup.h
>> +++ b/include/asm-generic/5level-fixup.h
>> @@ -22,6 +22,7 @@
>>  #define p4d_none(p4d)   0
>>  #define p4d_bad(p4d)0
>>  #define p4d_present(p4d)1
>> +#define p4d_large(p4d)  0
>>  #define p4d_ERROR(p4d)  do { } while (0)
>>  #define p4d_clear(p4d)  pgd_clear(p4d)
>>  #define p4d_val(p4d)pgd_val(p4d)
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 4c9e150e5ad3..4be98f700862 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -36,6 +36,7 @@
>>  #include 
>>  
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  
>> @@ -284,26 +285,36 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>  
>>  if (pgd_none(*pgd))
>>  return NULL;
>> +
>>  p4d = p4d_offset(pgd, addr);
>>  if (p4d_none(*p4d))
>>  return NULL;
>> -pud = pud_offset(p4d, addr);
>> +if (WARN_ON_ONCE(p4d_bad(*p4d)))
>> +return NULL;
> 
> The warning here is a required addition but it needs to be moved after 
> p4d_large()
> check. Please see the next comment below.
> 
>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>> +if (p4d_large(*p4d))
>> +return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
>> +#endif
>>  
>> -/*
>> - * Don't dereference bad PUD or PMD (below) entries. This will also
>> - * identify huge mappings, which we may encounter on architectures
>> - * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
>> - * identified as vmalloc addresses by is_vmalloc_addr(), but are
>> - * not [unambiguously] associated with a struct page, so there is
>> - * no correct value to return for them.
>> - */
>> -WARN_ON_ONCE(pud_bad(*pud));
>> -if (pud_none(*pud) || pud_bad(*pud))
>> +pud = pud_offset(p4d, addr);
>> +if (pud_none(*pud))
>> +return NULL;
>> +if (WARN_ON_ONCE(pud_bad(*pud)))
>>  return NULL;
>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>> +if (pud_large(*pud))
>> +return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>> +#endif
>> +
> 
> pud_bad() on arm64 returns true when the PUD does not point to a next page
> table page implying the fact that it might be a large/huge entry. I am not
> sure if the semantics holds good for other architectures too. But on arm64
> if pud_large() is true, then pud_bad() will be true as well. So pud_bad()
> check must happen after pud_large() check. So the sequence here should be
> 
> 1. pud_none() --> Nothing is in here, return NULL
> 2. pud_large()--> Return offset page address from the huge page 
> mapping
> 3. pud_bad()  --> Return NULL as there is no more page table level left
> 
> Checking pud_bad() first can return NULL for a valid huge mapping.
> 
>>  pmd = pmd_offset(pud, addr);
>> -WARN_ON_ONCE(pmd_bad(*pmd));
>> -if (pmd_none(*pmd) || pmd_bad(*pmd))
>> +if (pmd_none(*pmd))
>> +return NULL;
>> +if (WARN_ON_ONCE(pmd_bad(*pmd)))
>>  return NULL;
>> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>> +if (pmd_large(*pmd))
>> +return 

[PATCH] powerpc/rtas: Fix hang in race against concurrent cpu offline

2019-06-24 Thread Juliet Kim
>From 1bd2bf7146876e099eafb292668f9a12dc22f526 Mon Sep 17 00:00:00 2001
From: Juliet Kim 
Date: Mon, 24 Jun 2019 17:17:46 -0400
Subject: [PATCH 1/1] powerpc/rtas: Fix hang in race against concurrent cpu
 offline
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The commit
(“powerpc/rtas: Fix a potential race between CPU-Offline & Migration)
attempted to fix a hang in Live Partition Mobility(LPM) by abandoning
the LPM attempt if a race between LPM and concurrent CPU offline was
detected.

However, that fix failed to notify Hypervisor that the LPM attempted
had been abandoned which results in a system hang.

Fix this by sending a signal PHYP to cancel the migration, so that PHYP
can stop waiting, and clean up the migration.

Fixes: dfd718a2ed1f ("powerpc/rtas: Fix a potential race between CPU-Offline & 
Migration")
Signed-off-by: Juliet Kim 
---
This is an alternate solution to the one Nathan proposed in:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-June/192414.html
---
 arch/powerpc/include/asm/hvcall.h | 7 +++
 arch/powerpc/kernel/rtas.c| 8 
 2 files changed, 15 insertions(+)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 463c63a..29ca285 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -261,6 +261,7 @@
 #define H_ADD_CONN 0x284
 #define H_DEL_CONN 0x288
 #define H_JOIN 0x298
+#define H_VASI_SIGNAL   0x2A0
 #define H_VASI_STATE0x2A4
 #define H_VIOCTL   0x2A8
 #define H_ENABLE_CRQ   0x2B0
@@ -348,6 +349,12 @@
 #define H_SIGNAL_SYS_RESET_ALL_OTHERS  -2
 /* >= 0 values are CPU number */
 
+/* Values for argument to H_VASI_SIGNAL */
+#define H_SIGNAL_CANCEL_MIG 0x01
+
+/* Values for 2nd argument to H_VASI_SIGNAL */
+#define H_CPU_OFFLINE_DETECTED  0x0604
+
 /* H_GET_CPU_CHARACTERISTICS return values */
 #define H_CPU_CHAR_SPEC_BAR_ORI31  (1ull << 63) // IBM bit 0
 #define H_CPU_CHAR_BCCTRL_SERIALISED   (1ull << 62) // IBM bit 1
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index b824f4c..f9002b7 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -981,6 +981,14 @@ int rtas_ibm_suspend_me(u64 handle)
 
/* Check if we raced with a CPU-Offline Operation */
if (unlikely(!cpumask_equal(cpu_present_mask, cpu_online_mask))) {
+
+   /* We uses CANCEL, not ABORT to gracefully cancel migration */
+   rc = plpar_hcall_norets(H_VASI_SIGNAL, handle,
+   H_SIGNAL_CANCEL_MIG, H_CPU_OFFLINE_DETECTED);
+
+   if (rc != H_SUCCESS)
+   pr_err("%s: vasi_signal failed %ld\n", __func__, rc);
+
pr_err("%s: Raced against a concurrent CPU-Offline\n",
   __func__);
atomic_set(, -EBUSY);
-- 
1.8.3.1



Re: [PATCH v4 1/4] lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE

2019-06-24 Thread Imre Deak
Hi,

On Thu, Jun 20, 2019 at 02:02:21PM +0800, Herbert Xu wrote:
> On Mon, Jun 17, 2019 at 09:15:02PM +, Christophe Leroy wrote:
> > All mapping iterator logic is based on the assumption that sg->offset
> > is always lower than PAGE_SIZE.
> > 
> > But there are situations where sg->offset is such that the SG item
> > is on the second page.

could you explain how sg->offset becomes >= PAGE_SIZE?

--Imre


> > In that case sg_copy_to_buffer() fails
> > properly copying the data into the buffer. One of the reason is
> > that the data will be outside the kmapped area used to access that
> > data.
> > 
> > This patch fixes the issue by adjusting the mapping iterator
> > offset and pgoffset fields such that offset is always lower than
> > PAGE_SIZE.
> > 
> > Signed-off-by: Christophe Leroy 
> > Fixes: 4225fc8555a9 ("lib/scatterlist: use page iterator in the mapping 
> > iterator")
> > Cc: sta...@vger.kernel.org
> > ---
> >  lib/scatterlist.c | 9 +++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> Good catch.
> 
> > @@ -686,7 +686,12 @@ static bool sg_miter_get_next_page(struct 
> > sg_mapping_iter *miter)
> > sg = miter->piter.sg;
> > pgoffset = miter->piter.sg_pgoffset;
> >  
> > -   miter->__offset = pgoffset ? 0 : sg->offset;
> > +   offset = pgoffset ? 0 : sg->offset;
> > +   while (offset >= PAGE_SIZE) {
> > +   miter->piter.sg_pgoffset = ++pgoffset;
> > +   offset -= PAGE_SIZE;
> > +   }
> 
> How about
> 
>   miter->piter.sg_pgoffset += offset >> PAGE_SHIFT;
>   offset &= PAGE_SIZE - 1;
> 
> Thanks,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 5/5] asm-generic: remove ptrace.h

2019-06-24 Thread Paul Burton
On Mon, Jun 24, 2019 at 07:47:28AM +0200, Christoph Hellwig wrote:
> No one is using this header anymore.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Arnd Bergmann 
> Acked-by: Oleg Nesterov 
> ---
>  MAINTAINERS|  1 -
>  arch/mips/include/asm/ptrace.h |  5 ---
>  include/asm-generic/ptrace.h   | 73 --
>  3 files changed, 79 deletions(-)
>  delete mode 100644 include/asm-generic/ptrace.h

FWIW:

Acked-by: Paul Burton 

Thanks,
Paul

> diff --git a/arch/mips/include/asm/ptrace.h b/arch/mips/include/asm/ptrace.h
> index b6578611dddb..1e76774b36dd 100644
> --- a/arch/mips/include/asm/ptrace.h
> +++ b/arch/mips/include/asm/ptrace.h
> @@ -56,11 +56,6 @@ static inline unsigned long kernel_stack_pointer(struct 
> pt_regs *regs)
>   return regs->regs[31];
>  }
>  
> -/*
> - * Don't use asm-generic/ptrace.h it defines FP accessors that don't make
> - * sense on MIPS.  We rather want an error if they get invoked.
> - */
> -
>  static inline void instruction_pointer_set(struct pt_regs *regs,
> unsigned long val)
>  {


Re: [PATCH v11 02/13] PKCS#7: Refactor verify_pkcs7_signature()

2019-06-24 Thread Thiago Jung Bauermann


Hello David,

AFAIK Mimi is happy with this patch set, but I still need acks from
maintainers of other subsystems that my changes touch before she can
accept it.

Are this patch and the next one ("PKCS#7: Introduce pkcs7_get_digest()")
OK from your PoV?

--
Thiago Jung Bauermann
IBM Linux Technology Center


Thiago Jung Bauermann  writes:

> IMA will need to verify a PKCS#7 signature which has already been parsed.
> For this reason, factor out the code which does that from
> verify_pkcs7_signature() into a new function which takes a struct
> pkcs7_message instead of a data buffer.
>
> Signed-off-by: Thiago Jung Bauermann 
> Reviewed-by: Mimi Zohar 
> Cc: David Howells 
> Cc: David Woodhouse 
> Cc: Herbert Xu 
> Cc: "David S. Miller" 
> ---
>  certs/system_keyring.c   | 61 ++--
>  include/linux/verification.h | 10 ++
>  2 files changed, 55 insertions(+), 16 deletions(-)
>
> diff --git a/certs/system_keyring.c b/certs/system_keyring.c
> index c05c29ae4d5d..4ba82e52e4b4 100644
> --- a/certs/system_keyring.c
> +++ b/certs/system_keyring.c
> @@ -194,33 +194,27 @@ late_initcall(load_system_certificate_list);
>  #ifdef CONFIG_SYSTEM_DATA_VERIFICATION
>
>  /**
> - * verify_pkcs7_signature - Verify a PKCS#7-based signature on system data.
> + * verify_pkcs7_message_sig - Verify a PKCS#7-based signature on system data.
>   * @data: The data to be verified (NULL if expecting internal data).
>   * @len: Size of @data.
> - * @raw_pkcs7: The PKCS#7 message that is the signature.
> - * @pkcs7_len: The size of @raw_pkcs7.
> + * @pkcs7: The PKCS#7 message that is the signature.
>   * @trusted_keys: Trusted keys to use (NULL for builtin trusted keys only,
>   *   (void *)1UL for all trusted keys).
>   * @usage: The use to which the key is being put.
>   * @view_content: Callback to gain access to content.
>   * @ctx: Context for callback.
>   */
> -int verify_pkcs7_signature(const void *data, size_t len,
> -const void *raw_pkcs7, size_t pkcs7_len,
> -struct key *trusted_keys,
> -enum key_being_used_for usage,
> -int (*view_content)(void *ctx,
> -const void *data, size_t len,
> -size_t asn1hdrlen),
> -void *ctx)
> +int verify_pkcs7_message_sig(const void *data, size_t len,
> +  struct pkcs7_message *pkcs7,
> +  struct key *trusted_keys,
> +  enum key_being_used_for usage,
> +  int (*view_content)(void *ctx,
> +  const void *data, size_t len,
> +  size_t asn1hdrlen),
> +  void *ctx)
>  {
> - struct pkcs7_message *pkcs7;
>   int ret;
>
> - pkcs7 = pkcs7_parse_message(raw_pkcs7, pkcs7_len);
> - if (IS_ERR(pkcs7))
> - return PTR_ERR(pkcs7);
> -
>   /* The data should be detached - so we need to supply it. */
>   if (data && pkcs7_supply_detached_data(pkcs7, data, len) < 0) {
>   pr_err("PKCS#7 signature with non-detached data\n");
> @@ -273,6 +267,41 @@ int verify_pkcs7_signature(const void *data, size_t len,
>   }
>
>  error:
> + pr_devel("<==%s() = %d\n", __func__, ret);
> + return ret;
> +}
> +
> +/**
> + * verify_pkcs7_signature - Verify a PKCS#7-based signature on system data.
> + * @data: The data to be verified (NULL if expecting internal data).
> + * @len: Size of @data.
> + * @raw_pkcs7: The PKCS#7 message that is the signature.
> + * @pkcs7_len: The size of @raw_pkcs7.
> + * @trusted_keys: Trusted keys to use (NULL for builtin trusted keys only,
> + *   (void *)1UL for all trusted keys).
> + * @usage: The use to which the key is being put.
> + * @view_content: Callback to gain access to content.
> + * @ctx: Context for callback.
> + */
> +int verify_pkcs7_signature(const void *data, size_t len,
> +const void *raw_pkcs7, size_t pkcs7_len,
> +struct key *trusted_keys,
> +enum key_being_used_for usage,
> +int (*view_content)(void *ctx,
> +const void *data, size_t len,
> +size_t asn1hdrlen),
> +void *ctx)
> +{
> + struct pkcs7_message *pkcs7;
> + int ret;
> +
> + pkcs7 = pkcs7_parse_message(raw_pkcs7, pkcs7_len);
> + if (IS_ERR(pkcs7))
> + return PTR_ERR(pkcs7);
> +
> + ret = verify_pkcs7_message_sig(data, len, pkcs7, trusted_keys, usage,
> +view_content, ctx);
> +
>   pkcs7_free_message(pkcs7);
>   pr_devel("<==%s() = %d\n", __func__, ret);
>   return ret;
> diff 

Re: [PATCH v11 01/13] MODSIGN: Export module signature definitions

2019-06-24 Thread Thiago Jung Bauermann


Hello Jessica,

AFAIK Mimi is happy with this patch set, but I still need acks from
maintainers of other subsystems that my changes touch before she can
accept it.

Is this patch OK from your PoV?

--
Thiago Jung Bauermann
IBM Linux Technology Center


Thiago Jung Bauermann  writes:

> IMA will use the module_signature format for append signatures, so export
> the relevant definitions and factor out the code which verifies that the
> appended signature trailer is valid.
>
> Also, create a CONFIG_MODULE_SIG_FORMAT option so that IMA can select it
> and be able to use mod_check_sig() without having to depend on either
> CONFIG_MODULE_SIG or CONFIG_MODULES.
>
> Signed-off-by: Thiago Jung Bauermann 
> Reviewed-by: Mimi Zohar 
> Cc: Jessica Yu 
> ---
>  include/linux/module.h   |  3 --
>  include/linux/module_signature.h | 44 +
>  init/Kconfig |  6 +++-
>  kernel/Makefile  |  1 +
>  kernel/module.c  |  1 +
>  kernel/module_signature.c| 46 ++
>  kernel/module_signing.c  | 56 +---
>  scripts/Makefile |  2 +-
>  8 files changed, 106 insertions(+), 53 deletions(-)
>
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 188998d3dca9..aa56f531cf1e 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -25,9 +25,6 @@
>  #include 
>  #include 
>
> -/* In stripped ARM and x86-64 modules, ~ is surprisingly rare. */
> -#define MODULE_SIG_STRING "~Module signature appended~\n"
> -
>  /* Not Yet Implemented */
>  #define MODULE_SUPPORTED_DEVICE(name)
>
> diff --git a/include/linux/module_signature.h 
> b/include/linux/module_signature.h
> new file mode 100644
> index ..523617fc5b6a
> --- /dev/null
> +++ b/include/linux/module_signature.h
> @@ -0,0 +1,44 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +/*
> + * Module signature handling.
> + *
> + * Copyright (C) 2012 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowe...@redhat.com)
> + */
> +
> +#ifndef _LINUX_MODULE_SIGNATURE_H
> +#define _LINUX_MODULE_SIGNATURE_H
> +
> +/* In stripped ARM and x86-64 modules, ~ is surprisingly rare. */
> +#define MODULE_SIG_STRING "~Module signature appended~\n"
> +
> +enum pkey_id_type {
> + PKEY_ID_PGP,/* OpenPGP generated key ID */
> + PKEY_ID_X509,   /* X.509 arbitrary subjectKeyIdentifier */
> + PKEY_ID_PKCS7,  /* Signature in PKCS#7 message */
> +};
> +
> +/*
> + * Module signature information block.
> + *
> + * The constituents of the signature section are, in order:
> + *
> + *   - Signer's name
> + *   - Key identifier
> + *   - Signature data
> + *   - Information block
> + */
> +struct module_signature {
> + u8  algo;   /* Public-key crypto algorithm [0] */
> + u8  hash;   /* Digest algorithm [0] */
> + u8  id_type;/* Key identifier type [PKEY_ID_PKCS7] */
> + u8  signer_len; /* Length of signer's name [0] */
> + u8  key_id_len; /* Length of key identifier [0] */
> + u8  __pad[3];
> + __be32  sig_len;/* Length of signature data */
> +};
> +
> +int mod_check_sig(const struct module_signature *ms, size_t file_len,
> +   const char *name);
> +
> +#endif /* _LINUX_MODULE_SIGNATURE_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 8b9ffe236e4f..c2286a3c74c5 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1852,6 +1852,10 @@ config BASE_SMALL
>   default 0 if BASE_FULL
>   default 1 if !BASE_FULL
>
> +config MODULE_SIG_FORMAT
> + def_bool n
> + select SYSTEM_DATA_VERIFICATION
> +
>  menuconfig MODULES
>   bool "Enable loadable module support"
>   option modules
> @@ -1929,7 +1933,7 @@ config MODULE_SRCVERSION_ALL
>  config MODULE_SIG
>   bool "Module signature verification"
>   depends on MODULES
> - select SYSTEM_DATA_VERIFICATION
> + select MODULE_SIG_FORMAT
>   help
> Check modules for valid signatures upon load: the signature
> is simply appended to the module. For more information see
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 33824f0385b3..f29ae2997a43 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -58,6 +58,7 @@ endif
>  obj-$(CONFIG_UID16) += uid16.o
>  obj-$(CONFIG_MODULES) += module.o
>  obj-$(CONFIG_MODULE_SIG) += module_signing.o
> +obj-$(CONFIG_MODULE_SIG_FORMAT) += module_signature.o
>  obj-$(CONFIG_KALLSYMS) += kallsyms.o
>  obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
>  obj-$(CONFIG_CRASH_CORE) += crash_core.o
> diff --git a/kernel/module.c b/kernel/module.c
> index 6e6712b3aaf5..2712f4d217f5 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/kernel/module_signature.c b/kernel/module_signature.c
> new file mode 100644
> 

Re: [PATCH] powerpc/rtas: retry when cpu offline races with suspend/migration

2019-06-24 Thread Nathan Lynch
Hi Mingming,

mmc  writes:
> On 2019-06-21 00:05, Nathan Lynch wrote:
>> So return -EAGAIN instead of -EBUSY when this race is
>> encountered. Additionally: logging this event is still appropriate but
>> use pr_info instead of pr_err; and remove use of unlikely() while here
>> as this is not a hot path at all.
>
> Looks good, since it's not a hot path anyway, so unlikely() should 
> benefit from optimize compiler path, and should stay. No?

The latency of this path in rtas_ibm_suspend_me() in the best case is
around 2-3 seconds.

So I think not -- this is such a heavyweight and relatively
seldom-executed path that the unlikely() cannot yield any discernible
performance benefit, and its presence imposes a readability cost.


Re: [next][PowerPC] RCU stalls while booting linux-next on PowerVM LPAR

2019-06-24 Thread Sachin Sant



> On 24-Jun-2019, at 8:12 PM, David Hildenbrand  wrote:
> 
> On 24.06.19 16:09, Sachin Sant wrote:
>> Latest -next fails to boot on POWER9 PowerVM LPAR due to RCU stalls.
>> 
>> This problem was introduced with next-20190620 (dc636f5d78).
>> next-20190619 was last good kernel.
>> 
>> Reverting following commit allows the kernel to boot.
>> 2fd4aeea6b603 : mm/memory_hotplug: move and simplify walk_memory_blocks()
>> 
>> 
>> [0.014409] Using shared cache scheduler topology
>> [0.016302] devtmpfs: initialized
>> [0.031022] clocksource: jiffies: mask: 0x max_cycles: 
>> 0x, max_idle_ns: 1911260446275 ns
>> [0.031034] futex hash table entries: 16384 (order: 5, 2097152 bytes, 
>> linear)
>> [0.031575] NET: Registered protocol family 16
>> [0.031724] audit: initializing netlink subsys (disabled)
>> [0.031796] audit: type=2000 audit(1561344029.030:1): state=initialized 
>> audit_enabled=0 res=1
>> [0.032249] cpuidle: using governor menu
>> [0.032403] pstore: Registered nvram as persistent store backend
>> [   60.061246] rcu: INFO: rcu_sched self-detected stall on CPU
>> [   60.061254] rcu:  0-: (5999 ticks this GP) 
>> idle=1ea/1/0x4002 softirq=5/5 fqs=2999 
>> [   60.061261]   (t=6000 jiffies g=-1187 q=0)
>> [   60.061265] NMI backtrace for cpu 0
>> [   60.061269] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
>> 5.2.0-rc5-next-20190621-autotest-autotest #1
>> [   60.061275] Call Trace:
>> [   60.061280] [c018ee85f380] [c0b624ec] dump_stack+0xb0/0xf4 
>> (unreliable)
>> [   60.061287] [c018ee85f3c0] [c0b6d464] 
>> nmi_cpu_backtrace+0x144/0x150
>> [   60.061293] [c018ee85f450] [c0b6d61c] 
>> nmi_trigger_cpumask_backtrace+0x1ac/0x1f0
>> [   60.061300] [c018ee85f4f0] [c00692c8] 
>> arch_trigger_cpumask_backtrace+0x28/0x40
>> [   60.061306] [c018ee85f510] [c01c5f90] 
>> rcu_dump_cpu_stacks+0x10c/0x16c
>> [   60.061313] [c018ee85f560] [c01c4fe4] 
>> rcu_sched_clock_irq+0x744/0x990
>> [   60.061318] [c018ee85f630] [c01d5b58] 
>> update_process_times+0x48/0x90
>> [   60.061325] [c018ee85f660] [c01ea03c] tick_periodic+0x4c/0x120
>> [   60.061330] [c018ee85f690] [c01ea150] 
>> tick_handle_periodic+0x40/0xe0
>> [   60.061336] [c018ee85f6d0] [c002b5cc] 
>> timer_interrupt+0x10c/0x2e0
>> [   60.061342] [c018ee85f730] [c0009204] 
>> decrementer_common+0x134/0x140
>> [   60.061350] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
>> [   60.061350] LR = arch_local_irq_restore+0x84/0x90
>> [   60.061357] [c018ee85fa30] [c018ee85fbac] 0xc018ee85fbac 
>> (unreliable)
>> [   60.061364] [c018ee85fa50] [c0b88300] 
>> _raw_spin_unlock_irqrestore+0x50/0x80
>> [   60.061369] [c018ee85fa70] [c0b69da4] klist_next+0xb4/0x150
>> [   60.061376] [c018ee85fac0] [c0766ea0] 
>> subsys_find_device_by_id+0xf0/0x1a0
>> [   60.061382] [c018ee85fb20] [c0797a94] 
>> walk_memory_blocks+0x84/0x100
>> [   60.061388] [c018ee85fb80] [c0795ea0] 
>> link_mem_sections+0x40/0x60
>> [   60.061395] [c018ee85fbb0] [c0f28c28] topology_init+0xa0/0x268
>> [   60.061400] [c018ee85fc10] [c0010448] 
>> do_one_initcall+0x68/0x2c0
>> [   60.061406] [c018ee85fce0] [c0f247dc] 
>> kernel_init_freeable+0x318/0x47c
>> [   60.061411] [c018ee85fdb0] [c00107c4] kernel_init+0x24/0x150
>> [   60.061417] [c018ee85fe20] [c000ba54] 
>> ret_from_kernel_thread+0x5c/0x68
>> [   88.016563] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
>> [swapper/0:1]
>> [   88.016569] Modules linked in:
>> 
> 
> Hi, thanks! Please see
> 
> https://lkml.org/lkml/2019/6/21/600
> 
> and especially
> 
> https://lkml.org/lkml/2019/6/21/908
> 
> Does this fix your problem? The fix is on its way to next.

Yes, this patch fixes the problem for me.

Thanks
-Sachin



Re: [PATCH] powerpc/rtas: retry when cpu offline races with suspend/migration

2019-06-24 Thread mmc

On 2019-06-21 00:05, Nathan Lynch wrote:

The protocol for suspending or migrating an LPAR requires all present
processor threads to enter H_JOIN. So if we have threads offline, we
have to temporarily bring them up. This can race with administrator
actions such as SMT state changes. As of dfd718a2ed1f ("powerpc/rtas:
Fix a potential race between CPU-Offline & Migration"),
rtas_ibm_suspend_me() accounts for this, but errors out with -EBUSY
for what almost certainly is a transient condition in any reasonable
scenario.

Callers of rtas_ibm_suspend_me() already retry when -EAGAIN is
returned, and it is typical during a migration for that to happen
repeatedly for several minutes polling the H_VASI_STATE hcall result
before proceeding to the next stage.

So return -EAGAIN instead of -EBUSY when this race is
encountered. Additionally: logging this event is still appropriate but
use pr_info instead of pr_err; and remove use of unlikely() while here
as this is not a hot path at all.



Looks good, since it's not a hot path anyway, so unlikely() should 
benefit from optimize compiler path, and should stay. No?


Mingming

Fixes: dfd718a2ed1f ("powerpc/rtas: Fix a potential race between
CPU-Offline & Migration")
Signed-off-by: Nathan Lynch 
---
 arch/powerpc/kernel/rtas.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index fbc676160adf..9b4d2a2ffb4f 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -984,10 +984,9 @@ int rtas_ibm_suspend_me(u64 handle)
cpu_hotplug_disable();

/* Check if we raced with a CPU-Offline Operation */
-   if (unlikely(!cpumask_equal(cpu_present_mask, cpu_online_mask))) {
-   pr_err("%s: Raced against a concurrent CPU-Offline\n",
-  __func__);
-   atomic_set(, -EBUSY);
+   if (!cpumask_equal(cpu_present_mask, cpu_online_mask)) {
+   pr_info("%s: Raced against a concurrent CPU-Offline\n", 
__func__);
+   atomic_set(, -EAGAIN);
goto out_hotplug_enable;
}




Re: [PATCH] ocxl: Fix concurrent AFU open and device removal

2019-06-24 Thread Greg Kurz
On Mon, 24 Jun 2019 16:41:48 +0200
Frederic Barrat  wrote:

> If an ocxl device is unbound through sysfs at the same time its AFU is
> being opened by a user process, the open code may dereference freed
> stuctures, which can lead to kernel oops messages. You'd have to hit a
> tiny time window, but it's possible. It's fairly easy to test by
> making the time window bigger artificially.
> 
> Fix it with a combination of 2 changes:
> - when an AFU device is found in the IDR by looking for the device
> minor number, we should hold a reference on the device until after the
> context is allocated. A reference on the AFU structure is kept when
> the context is allocated, so we can release the reference on the
> device after the context allocation.
> - with the fix above, there's still another even tinier window,
> between the time the AFU device is found in the IDR and the reference
> on the device is taken. We can fix this one by removing the IDR entry
> earlier, when the device setup is removed, instead of waiting for the
> 'release' device callback. With proper locking around the IDR.
> 
> Fixes: 75ca758adbaf ("ocxl: Create a clear delineation between ocxl backend & 
> frontend")
> Signed-off-by: Frederic Barrat 
> ---
> mpe: this fixes a commit merged in v5.2-rc1. It's late, and I don't think 
> it's that important. If it's for the next merge window, I would add:
> Cc: sta...@vger.kernel.org  # v5.2
> 
> 
> drivers/misc/ocxl/file.c | 23 +++
>  1 file changed, 11 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c
> index 2870c25da166..4d1b44de1492 100644
> --- a/drivers/misc/ocxl/file.c
> +++ b/drivers/misc/ocxl/file.c
> @@ -18,18 +18,15 @@ static struct class *ocxl_class;
>  static struct mutex minors_idr_lock;
>  static struct idr minors_idr;
>  
> -static struct ocxl_file_info *find_file_info(dev_t devno)
> +static struct ocxl_file_info *find_and_get_file_info(dev_t devno)
>  {
>   struct ocxl_file_info *info;
>  
> - /*
> -  * We don't declare an RCU critical section here, as our AFU
> -  * is protected by a reference counter on the device. By the time the
> -  * info reference is removed from the idr, the ref count of
> -  * the device is already at 0, so no user API will access that AFU and
> -  * this function can't return it.
> -  */
> + mutex_lock(_idr_lock);
>   info = idr_find(_idr, MINOR(devno));
> + if (info)
> + get_device(>dev);
> + mutex_unlock(_idr_lock);
>   return info;
>  }
>  
> @@ -58,14 +55,16 @@ static int afu_open(struct inode *inode, struct file 
> *file)
>  
>   pr_debug("%s for device %x\n", __func__, inode->i_rdev);
>  
> - info = find_file_info(inode->i_rdev);
> + info = find_and_get_file_info(inode->i_rdev);
>   if (!info)
>   return -ENODEV;
>  
>   rc = ocxl_context_alloc(, info->afu, inode->i_mapping);
> - if (rc)
> + if (rc) {
> + put_device(>dev);

You could have a single call site for put_device() since it's
needed for both branches. No big deal.

>   return rc;
> -
> + }
> + put_device(>dev);
>   file->private_data = ctx;
>   return 0;
>  }
> @@ -487,7 +486,6 @@ static void info_release(struct device *dev)
>  {
>   struct ocxl_file_info *info = container_of(dev, struct ocxl_file_info, 
> dev);
>  
> - free_minor(info);
>   ocxl_afu_put(info->afu);
>   kfree(info);
>  }
> @@ -577,6 +575,7 @@ void ocxl_file_unregister_afu(struct ocxl_afu *afu)
>  
>   ocxl_file_make_invisible(info);
>   ocxl_sysfs_unregister_afu(info);
> + free_minor(info);

Since the IDR entry is added by ocxl_file_register_afu(), it seems to make
sense to undo that in ocxl_file_unregister_afu(). Out of curiosity, was there
any historical reason to do this in info_release() in the first place ?

Reviewed-by: Greg Kurz 

>   device_unregister(>dev);
>  }
>  



Re: [PATCH] ocxl: Fix concurrent AFU open and device removal

2019-06-24 Thread Greg Kurz
On Mon, 24 Jun 2019 17:39:26 +0200
Frederic Barrat  wrote:

> Le 24/06/2019 à 17:24, Greg Kurz a écrit :
> > On Mon, 24 Jun 2019 16:41:48 +0200
> > Frederic Barrat  wrote:
> >   
> >> If an ocxl device is unbound through sysfs at the same time its AFU is
> >> being opened by a user process, the open code may dereference freed
> >> stuctures, which can lead to kernel oops messages. You'd have to hit a
> >> tiny time window, but it's possible. It's fairly easy to test by
> >> making the time window bigger artificially.
> >>
> >> Fix it with a combination of 2 changes:
> >> - when an AFU device is found in the IDR by looking for the device
> >> minor number, we should hold a reference on the device until after the
> >> context is allocated. A reference on the AFU structure is kept when
> >> the context is allocated, so we can release the reference on the
> >> device after the context allocation.
> >> - with the fix above, there's still another even tinier window,
> >> between the time the AFU device is found in the IDR and the reference
> >> on the device is taken. We can fix this one by removing the IDR entry
> >> earlier, when the device setup is removed, instead of waiting for the
> >> 'release' device callback. With proper locking around the IDR.
> >>
> >> Fixes: 75ca758adbaf ("ocxl: Create a clear delineation between ocxl 
> >> backend & frontend")
> >> Signed-off-by: Frederic Barrat 
> >> ---
> >> mpe: this fixes a commit merged in v5.2-rc1. It's late, and I don't think 
> >> it's that important. If it's for the next merge window, I would add:
> >> Cc: sta...@vger.kernel.org  # v5.2
> >>
> >>
> >> drivers/misc/ocxl/file.c | 23 +++
> >>   1 file changed, 11 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c
> >> index 2870c25da166..4d1b44de1492 100644
> >> --- a/drivers/misc/ocxl/file.c
> >> +++ b/drivers/misc/ocxl/file.c
> >> @@ -18,18 +18,15 @@ static struct class *ocxl_class;
> >>   static struct mutex minors_idr_lock;
> >>   static struct idr minors_idr;
> >>   
> >> -static struct ocxl_file_info *find_file_info(dev_t devno)
> >> +static struct ocxl_file_info *find_and_get_file_info(dev_t devno)
> >>   {
> >>struct ocxl_file_info *info;
> >>   
> >> -  /*
> >> -   * We don't declare an RCU critical section here, as our AFU
> >> -   * is protected by a reference counter on the device. By the time the
> >> -   * info reference is removed from the idr, the ref count of
> >> -   * the device is already at 0, so no user API will access that AFU and
> >> -   * this function can't return it.
> >> -   */
> >> +  mutex_lock(_idr_lock);
> >>info = idr_find(_idr, MINOR(devno));
> >> +  if (info)
> >> +  get_device(>dev);
> >> +  mutex_unlock(_idr_lock);
> >>return info;
> >>   }
> >>   
> >> @@ -58,14 +55,16 @@ static int afu_open(struct inode *inode, struct file 
> >> *file)
> >>   
> >>pr_debug("%s for device %x\n", __func__, inode->i_rdev);
> >>   
> >> -  info = find_file_info(inode->i_rdev);
> >> +  info = find_and_get_file_info(inode->i_rdev);
> >>if (!info)
> >>return -ENODEV;
> >>   
> >>rc = ocxl_context_alloc(, info->afu, inode->i_mapping);
> >> -  if (rc)
> >> +  if (rc) {
> >> +  put_device(>dev);  
> > 
> > You could have a single call site for put_device() since it's
> > needed for both branches. No big deal.  
> 
> 
> Agreed. Will fix if I end up respinning, but won't if it's the only 
> complaint :-)
> 
> 
> 
> >>return rc;
> >> -
> >> +  }
> >> +  put_device(>dev);
> >>file->private_data = ctx;
> >>return 0;
> >>   }
> >> @@ -487,7 +486,6 @@ static void info_release(struct device *dev)
> >>   {
> >>struct ocxl_file_info *info = container_of(dev, struct ocxl_file_info, 
> >> dev);
> >>   
> >> -  free_minor(info);
> >>ocxl_afu_put(info->afu);
> >>kfree(info);
> >>   }
> >> @@ -577,6 +575,7 @@ void ocxl_file_unregister_afu(struct ocxl_afu *afu)
> >>   
> >>ocxl_file_make_invisible(info);
> >>ocxl_sysfs_unregister_afu(info);
> >> +  free_minor(info);  
> > 
> > Since the IDR entry is added by ocxl_file_register_afu(), it seems to make
> > sense to undo that in ocxl_file_unregister_afu(). Out of curiosity, was 
> > there
> > any historical reason to do this in info_release() in the first place ?  
> 
> 
> Yeah, it makes a lot of sense to remove the IDR entry in 
> ocxl_file_unregister_afu(), that's where we undo the device. I wish I 
> had noticed during the code reviews.
> I don't think there was any good reason to have it in info_release() in 
> the first place. I remember the code went through many iterations to get 
> the reference counting on the AFU structure and device done correctly, 
> but we let that one slip.
> 
> I now think the pre-5.2 ocxl code was also exposed to the 2nd window 
> mentioned in the commit log (but the first window is new with the 
> refactoring introduced in 5.2-rc1).
> 

This calls for two separate 

Re: [PATCH] ocxl: Fix concurrent AFU open and device removal

2019-06-24 Thread Frederic Barrat




Le 24/06/2019 à 17:24, Greg Kurz a écrit :

On Mon, 24 Jun 2019 16:41:48 +0200
Frederic Barrat  wrote:


If an ocxl device is unbound through sysfs at the same time its AFU is
being opened by a user process, the open code may dereference freed
stuctures, which can lead to kernel oops messages. You'd have to hit a
tiny time window, but it's possible. It's fairly easy to test by
making the time window bigger artificially.

Fix it with a combination of 2 changes:
- when an AFU device is found in the IDR by looking for the device
minor number, we should hold a reference on the device until after the
context is allocated. A reference on the AFU structure is kept when
the context is allocated, so we can release the reference on the
device after the context allocation.
- with the fix above, there's still another even tinier window,
between the time the AFU device is found in the IDR and the reference
on the device is taken. We can fix this one by removing the IDR entry
earlier, when the device setup is removed, instead of waiting for the
'release' device callback. With proper locking around the IDR.

Fixes: 75ca758adbaf ("ocxl: Create a clear delineation between ocxl backend & 
frontend")
Signed-off-by: Frederic Barrat 
---
mpe: this fixes a commit merged in v5.2-rc1. It's late, and I don't think it's 
that important. If it's for the next merge window, I would add:
Cc: sta...@vger.kernel.org  # v5.2


drivers/misc/ocxl/file.c | 23 +++
  1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c
index 2870c25da166..4d1b44de1492 100644
--- a/drivers/misc/ocxl/file.c
+++ b/drivers/misc/ocxl/file.c
@@ -18,18 +18,15 @@ static struct class *ocxl_class;
  static struct mutex minors_idr_lock;
  static struct idr minors_idr;
  
-static struct ocxl_file_info *find_file_info(dev_t devno)

+static struct ocxl_file_info *find_and_get_file_info(dev_t devno)
  {
struct ocxl_file_info *info;
  
-	/*

-* We don't declare an RCU critical section here, as our AFU
-* is protected by a reference counter on the device. By the time the
-* info reference is removed from the idr, the ref count of
-* the device is already at 0, so no user API will access that AFU and
-* this function can't return it.
-*/
+   mutex_lock(_idr_lock);
info = idr_find(_idr, MINOR(devno));
+   if (info)
+   get_device(>dev);
+   mutex_unlock(_idr_lock);
return info;
  }
  
@@ -58,14 +55,16 @@ static int afu_open(struct inode *inode, struct file *file)
  
  	pr_debug("%s for device %x\n", __func__, inode->i_rdev);
  
-	info = find_file_info(inode->i_rdev);

+   info = find_and_get_file_info(inode->i_rdev);
if (!info)
return -ENODEV;
  
  	rc = ocxl_context_alloc(, info->afu, inode->i_mapping);

-   if (rc)
+   if (rc) {
+   put_device(>dev);


You could have a single call site for put_device() since it's
needed for both branches. No big deal.



Agreed. Will fix if I end up respinning, but won't if it's the only 
complaint :-)





return rc;
-
+   }
+   put_device(>dev);
file->private_data = ctx;
return 0;
  }
@@ -487,7 +486,6 @@ static void info_release(struct device *dev)
  {
struct ocxl_file_info *info = container_of(dev, struct ocxl_file_info, 
dev);
  
-	free_minor(info);

ocxl_afu_put(info->afu);
kfree(info);
  }
@@ -577,6 +575,7 @@ void ocxl_file_unregister_afu(struct ocxl_afu *afu)
  
  	ocxl_file_make_invisible(info);

ocxl_sysfs_unregister_afu(info);
+   free_minor(info);


Since the IDR entry is added by ocxl_file_register_afu(), it seems to make
sense to undo that in ocxl_file_unregister_afu(). Out of curiosity, was there
any historical reason to do this in info_release() in the first place ?



Yeah, it makes a lot of sense to remove the IDR entry in 
ocxl_file_unregister_afu(), that's where we undo the device. I wish I 
had noticed during the code reviews.
I don't think there was any good reason to have it in info_release() in 
the first place. I remember the code went through many iterations to get 
the reference counting on the AFU structure and device done correctly, 
but we let that one slip.


I now think the pre-5.2 ocxl code was also exposed to the 2nd window 
mentioned in the commit log (but the first window is new with the 
refactoring introduced in 5.2-rc1).


  Fred





Reviewed-by: Greg Kurz 


device_unregister(>dev);
  }
  






[PATCH 2/2] powerpc/papr_scm: Force a scm-unbind if initial scm-bind fails

2019-06-24 Thread Vaibhav Jain
In some cases initial bind of scm memory for an lpar can fail if
previously it wasn't released using a scm-unbind hcall. This situation
can arise due to panic of the previous kernel or forced lpar reset. In
such cases the H_SCM_BIND_MEM return a H_OVERLAP error.

To mitigate such cases the patch updates drc_pmem_bind() to force a
call to drc_pmem_unbind() in case the initial bind of scm memory fails
with H_OVERLAP error. In case scm-bind operation again fails after the
forced scm-unbind then we follow the existing error path.

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index d790e4e4ffb3..049d7927c0a4 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -44,19 +44,26 @@ struct papr_scm_priv {
struct nd_interleave_set nd_set;
 };
 
+/* Forward declaration */
+static int drc_pmem_unbind(struct papr_scm_priv *);
+
 static int drc_pmem_bind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
uint64_t rc, token;
-   uint64_t saved = 0;
+   uint64_t saved;
+   bool tried_unbind = false;
 
+   dev_dbg(>pdev->dev, "bind drc %x\n", p->drc_index);
/*
 * When the hypervisor cannot map all the requested memory in a single
 * hcall it returns H_BUSY and we call again with the token until
 * we get H_SUCCESS. Aborting the retry loop before getting H_SUCCESS
 * leave the system in an undefined state, so we wait.
 */
+retry:
token = 0;
+   saved = 0;
 
do {
rc = plpar_hcall(H_SCM_BIND_MEM, ret, p->drc_index, 0,
@@ -68,8 +75,18 @@ static int drc_pmem_bind(struct papr_scm_priv *p)
} while (rc == H_BUSY);
 
if (rc) {
-   dev_err(>pdev->dev, "bind err: %lld\n", rc);
-   return -ENXIO;
+   /* retry after unbinding */
+   if (rc == H_OVERLAP &&  !tried_unbind) {
+   dev_warn(>pdev->dev, "Un-binding and retrying\n");
+   /* Try unbind and ignore any errors */
+   tried_unbind = true;
+   drc_pmem_unbind(p);
+   goto retry;
+
+   } else {
+   dev_err(>pdev->dev, "bind err: %lld\n", rc);
+   return -ENXIO;
+   }
}
 
p->bound_addr = saved;
-- 
2.21.0



[PATCH 0/2] powerpc/papr_scm: Workaround for failure of drc bind after kexec

2019-06-24 Thread Vaibhav Jain
Presently an error is returned in response to hcall H_SCM_BIND_MEM when a
new kernel boots on lpar via kexec. This prevents papr_scm from registering
drc memory regions with nvdimm. The error reported is of the form below:

"papr_scm ibm,persistent-memory:ibm,pmemory@4412: bind err: -68"

On investigation it was revealed that phyp returns this error as previous
kernel did not completely release bindings for drc scm-memory blocks and
hence phyp rejected request for re-binding these block to lpar with error
H_OVERLAP. Also support for a new H_SCM_UNBIND_ALL is recently added which
is better suited for releasing all the bound scm-memory block from an lpar.

So leveraging new hcall H_SCM_UNBIND_ALL, we can workaround H_OVERLAP issue
during kexec by forcing an unbind of all drm scm-memory blocks and issuing
H_SCM_BIND_MEM to re-bind the drc scm-memory blocks to lpar. This sequence
will also be needed when a new kernel boot on lpar after previous kernel
panicked and it never got an opportunity to call H_SCM_UNBIND_MEM/ALL.

Hence this patch-set implements following changes to papr_scm module:

* Update it to use H_SCM_UNBIND_ALL instead of H_SCM_UNBIND_MEM

* In case hcall H_SCM_BIND_MEM fails with error H_OVERLAP, force
  H_SCM_UNBIND_ALL and retry the bind operation again.

With the patch-set applied re-bind of drc scm-memory to lpar succeeds after
a kexec to new kernel as illustrated below:

# Old kernel
$ sudo ndctl list -R
[
  {
"dev":"region0",


  }
]
# kexec to new kernel
$ sudo kexec --initrd=... vmlinux
...
...
I'm in purgatory
...
papr_scm ibm,persistent-memory:ibm,pmemory@4412: Un-binding and retrying
...
# New kernel
$ sudo ndctl list -R
$ sudo ndctl list -R
[
  {
"dev":"region0",


  }
]


Vaibhav Jain (2):
  powerpc/papr_scm: Update drc_pmem_unbind() to use H_SCM_UNBIND_ALL
  powerpc/papr_scm: Force a scm-unbind if initial scm-bind fails

 arch/powerpc/include/asm/hvcall.h |  2 +-
 arch/powerpc/platforms/pseries/papr_scm.c | 60 ---
 2 files changed, 53 insertions(+), 9 deletions(-)

-- 
2.21.0



[PATCH 1/2] powerpc/papr_scm: Update drc_pmem_unbind() to use H_SCM_UNBIND_ALL

2019-06-24 Thread Vaibhav Jain
The new hcall named H_SCM_UNBIND_ALL has been introduce that can
unbind all the memory drc memory-blocks assigned to an lpar. This is
more efficient than using H_SCM_UNBIND_MEM as currently we don't
support partial unbind of drc memory-blocks.

Hence this patch proposes following changes to drc_pmem_unbind():

* Update drc_pmem_unbind() to replace hcall H_SCM_UNBIND_MEM to
  H_SCM_UNBIND_ALL.

* Update drc_pmem_unbind() to handles cases when PHYP asks the guest
  kernel to wait for specific amount of time before retrying the
  hcall via the 'LONG_BUSY' return value. In case it takes more
  than 1 second to unbind the memory log a warning.

* Ensure appropriate error code is returned back from the function
  in case of an error.

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/include/asm/hvcall.h |  2 +-
 arch/powerpc/platforms/pseries/papr_scm.c | 37 ---
 2 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 463c63a9fcf1..bb56fa0f976c 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -302,7 +302,7 @@
 #define H_SCM_UNBIND_MEM0x3F0
 #define H_SCM_QUERY_BLOCK_MEM_BINDING 0x3F4
 #define H_SCM_QUERY_LOGICAL_MEM_BINDING 0x3F8
-#define H_SCM_MEM_QUERY0x3FC
+#define H_SCM_UNBIND_ALL0x3FC
 #define H_SCM_BLOCK_CLEAR   0x400
 #define MAX_HCALL_OPCODE   H_SCM_BLOCK_CLEAR
 
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 96c53b23e58f..d790e4e4ffb3 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -11,11 +11,16 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
 #define BIND_ANY_ADDR (~0ul)
 
+/* Scope args for H_SCM_UNBIND_ALL */
+#define H_UNBIND_SCOPE_ALL (0x1)
+#define H_UNBIND_SCOPE_DRC (0x2)
+
 #define PAPR_SCM_DIMM_CMD_MASK \
((1ul << ND_CMD_GET_CONFIG_SIZE) | \
 (1ul << ND_CMD_GET_CONFIG_DATA) | \
@@ -78,21 +83,43 @@ static int drc_pmem_unbind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
uint64_t rc, token;
+   unsigned long starttime;
 
token = 0;
 
-   /* NB: unbind has the same retry requirements mentioned above */
+   dev_dbg(>pdev->dev, "unbind drc %x\n", p->drc_index);
+
+   /* NB: unbind_all has the same retry requirements as drc_pmem_bind() */
+   starttime = HZ;
do {
-   rc = plpar_hcall(H_SCM_UNBIND_MEM, ret, p->drc_index,
-   p->bound_addr, p->blocks, token);
+   /* If this is taking too much time then log warning */
+   if (jiffies_to_msecs(HZ - starttime) > 1000) {
+   dev_warn(>pdev->dev,
+"unbind taking > 1s to complete\n");
+   starttime = HZ;
+   }
+
+   /* Unbind of all SCM resources associated with drcIndex */
+   rc = plpar_hcall(H_SCM_UNBIND_ALL, ret, H_UNBIND_SCOPE_DRC,
+p->drc_index, token);
token = ret[0];
-   cond_resched();
+
+   /* Check if we are stalled for some time */
+   if (H_IS_LONG_BUSY(rc)) {
+   msleep(get_longbusy_msecs(rc));
+   rc = H_BUSY;
+   token = 0;
+
+   } else if (rc == H_BUSY) {
+   cond_resched();
+   }
+
} while (rc == H_BUSY);
 
if (rc)
dev_err(>pdev->dev, "unbind error: %lld\n", rc);
 
-   return !!rc;
+   return rc == H_SUCCESS ? 0 : -ENXIO;
 }
 
 static int papr_scm_meta_get(struct papr_scm_priv *p,
-- 
2.21.0



CVE-2019-12817: Linux kernel: powerpc: Unrelated processes may be able to read/write to each other's virtual memory

2019-06-24 Thread Michael Ellerman
The Linux kernel for powerpc since 4.17 has a bug where unrelated processes may
be able to read/write to each other's virtual memory under certain conditions.

This bug only affects machines using 64-bit CPUs with the hash page table MMU,
see below for more detail on affected CPUs.

To trigger the bug a process must allocate memory above 512TB. That only happens
if userspace explicitly requests it with mmap(). That process must then fork(),
at this point the child incorrectly inherits the "context id" of the parent
associated with the mapping above 512TB. It may then be possible for the
parent/child to write to each other's mappings above 512TB, which should not be
possible, and constitutes memory corruption.

If instead the child process exits, all its context ids are freed, including the
context id that is still in use by the parent for the mapping above 512TB. That
id can then be reallocated to a third process, that process can then read/write
to the parent's mapping above 512TB. Additionally if the freed id is used for
the third process's primary context id, then the parent is able to read/write to
the third process's mappings *below* 512TB.

If the parent and child both exit before another process is allocated the freed
context id, the kernel will notice the double free of the id and print a warning
such as:

  ida_free called for id=103 which is not allocated.
  WARNING: CPU: 8 PID: 7293 at lib/idr.c:520 ida_free_rc+0x1b4/0x1d0

The bug was introduced in commit:
  f384796c40dc ("powerpc/mm: Add support for handling > 512TB address in SLB 
miss")

Which was originally merged in v4.17.

Only machines using the hash page table (HPT) MMU are affected, eg. PowerPC 970
(G5), PA6T, Power5/6/7/8/9. By default Power9 bare metal machines (powernv) use
the Radix MMU and are not affected, unless the machine has been explicitly
booted in HPT mode (using disable_radix on the kernel command line). KVM guests
on Power9 may be affected if the host or guest is configured to use the HPT MMU.
LPARs under PowerVM on Power9 are affected as they always use the HPT MMU.
Kernels built with PAGE_SIZE=4K are not affected.

The upstream fix is here:

  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca72d88378b2f2444d3ec145dd442d449d3fefbc

There's also a kernel selftest to verify the fix:

  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16391bfc862342f285195013b73c1394fab28b97

Or a similar standalone version is included below.

cheers


cat > test.c <
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef MAP_FIXED_NOREPLACE
#define MAP_FIXED_NOREPLACE MAP_FIXED   // "Should be safe" above 512TB
#endif

int main(void)
{
int p2c[2], c2p[2], rc, status, c, *p;
unsigned long page_size;
pid_t pid;

page_size = sysconf(_SC_PAGESIZE);
if (page_size != 65536) {
printf("Unsupported page size - not affected\n");
return 1;
}

// Create a mapping at 512TB to allocate an extended_id
p = mmap((void *)(512ul << 40), page_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED_NOREPLACE, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
printf("Error: couldn't mmap(), confirm kernel has 4TB 
support\n");
return 1;
}

printf("parent writing %p = 1\n", p);
*p = 1;

assert(pipe(p2c) != -1 && pipe(c2p) != -1);

pid = fork();
if (pid == 0) {
close(p2c[1]);
close(c2p[0]);
assert(read(p2c[0], , 1) == 1);

pid = getpid();
printf("child writing  %p = %d\n", p, pid);
*p = pid;

assert(write(c2p[1], , 1) == 1);
assert(read(p2c[0], , 1) == 1);
exit(0);
}
close(p2c[0]);
close(c2p[1]);

c = 0;
assert(write(p2c[1], , 1) == 1);
assert(read(c2p[0], , 1) == 1);

// Prevent compiler optimisation
asm volatile("" : : : "memory");

rc = 0;
printf("parent reading %p = %d\n", p, *p);
if (*p != 1) {
printf("Error: BUG! parent saw child's write! *p = %d\n", *p);
rc = 1;
}

assert(write(p2c[1], , 1) == 1);
assert(waitpid(pid, , 0) != -1);
assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);

if (rc == 0)
printf("success: test completed OK\n");

return rc;
}
EOF


signature.asc
Description: PGP signature


Re: [next][PowerPC] RCU stalls while booting linux-next on PowerVM LPAR

2019-06-24 Thread David Hildenbrand
On 24.06.19 16:09, Sachin Sant wrote:
> Latest -next fails to boot on POWER9 PowerVM LPAR due to RCU stalls.
> 
> This problem was introduced with next-20190620 (dc636f5d78).
> next-20190619 was last good kernel.
> 
> Reverting following commit allows the kernel to boot.
> 2fd4aeea6b603 : mm/memory_hotplug: move and simplify walk_memory_blocks()
> 
> 
> [0.014409] Using shared cache scheduler topology
> [0.016302] devtmpfs: initialized
> [0.031022] clocksource: jiffies: mask: 0x max_cycles: 0x, 
> max_idle_ns: 1911260446275 ns
> [0.031034] futex hash table entries: 16384 (order: 5, 2097152 bytes, 
> linear)
> [0.031575] NET: Registered protocol family 16
> [0.031724] audit: initializing netlink subsys (disabled)
> [0.031796] audit: type=2000 audit(1561344029.030:1): state=initialized 
> audit_enabled=0 res=1
> [0.032249] cpuidle: using governor menu
> [0.032403] pstore: Registered nvram as persistent store backend
> [   60.061246] rcu: INFO: rcu_sched self-detected stall on CPU
> [   60.061254] rcu:   0-: (5999 ticks this GP) 
> idle=1ea/1/0x4002 softirq=5/5 fqs=2999 
> [   60.061261](t=6000 jiffies g=-1187 q=0)
> [   60.061265] NMI backtrace for cpu 0
> [   60.061269] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 5.2.0-rc5-next-20190621-autotest-autotest #1
> [   60.061275] Call Trace:
> [   60.061280] [c018ee85f380] [c0b624ec] dump_stack+0xb0/0xf4 
> (unreliable)
> [   60.061287] [c018ee85f3c0] [c0b6d464] 
> nmi_cpu_backtrace+0x144/0x150
> [   60.061293] [c018ee85f450] [c0b6d61c] 
> nmi_trigger_cpumask_backtrace+0x1ac/0x1f0
> [   60.061300] [c018ee85f4f0] [c00692c8] 
> arch_trigger_cpumask_backtrace+0x28/0x40
> [   60.061306] [c018ee85f510] [c01c5f90] 
> rcu_dump_cpu_stacks+0x10c/0x16c
> [   60.061313] [c018ee85f560] [c01c4fe4] 
> rcu_sched_clock_irq+0x744/0x990
> [   60.061318] [c018ee85f630] [c01d5b58] 
> update_process_times+0x48/0x90
> [   60.061325] [c018ee85f660] [c01ea03c] tick_periodic+0x4c/0x120
> [   60.061330] [c018ee85f690] [c01ea150] 
> tick_handle_periodic+0x40/0xe0
> [   60.061336] [c018ee85f6d0] [c002b5cc] 
> timer_interrupt+0x10c/0x2e0
> [   60.061342] [c018ee85f730] [c0009204] 
> decrementer_common+0x134/0x140
> [   60.061350] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
> [   60.061350] LR = arch_local_irq_restore+0x84/0x90
> [   60.061357] [c018ee85fa30] [c018ee85fbac] 0xc018ee85fbac 
> (unreliable)
> [   60.061364] [c018ee85fa50] [c0b88300] 
> _raw_spin_unlock_irqrestore+0x50/0x80
> [   60.061369] [c018ee85fa70] [c0b69da4] klist_next+0xb4/0x150
> [   60.061376] [c018ee85fac0] [c0766ea0] 
> subsys_find_device_by_id+0xf0/0x1a0
> [   60.061382] [c018ee85fb20] [c0797a94] 
> walk_memory_blocks+0x84/0x100
> [   60.061388] [c018ee85fb80] [c0795ea0] 
> link_mem_sections+0x40/0x60
> [   60.061395] [c018ee85fbb0] [c0f28c28] topology_init+0xa0/0x268
> [   60.061400] [c018ee85fc10] [c0010448] 
> do_one_initcall+0x68/0x2c0
> [   60.061406] [c018ee85fce0] [c0f247dc] 
> kernel_init_freeable+0x318/0x47c
> [   60.061411] [c018ee85fdb0] [c00107c4] kernel_init+0x24/0x150
> [   60.061417] [c018ee85fe20] [c000ba54] 
> ret_from_kernel_thread+0x5c/0x68
> [   88.016563] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
> [   88.016569] Modules linked in:
> 

Hi, thanks! Please see

https://lkml.org/lkml/2019/6/21/600

and especially

https://lkml.org/lkml/2019/6/21/908

Does this fix your problem? The fix is on its way to next.

Cheers!

-- 

Thanks,

David / dhildenb


[PATCH] ocxl: Fix concurrent AFU open and device removal

2019-06-24 Thread Frederic Barrat
If an ocxl device is unbound through sysfs at the same time its AFU is
being opened by a user process, the open code may dereference freed
stuctures, which can lead to kernel oops messages. You'd have to hit a
tiny time window, but it's possible. It's fairly easy to test by
making the time window bigger artificially.

Fix it with a combination of 2 changes:
- when an AFU device is found in the IDR by looking for the device
minor number, we should hold a reference on the device until after the
context is allocated. A reference on the AFU structure is kept when
the context is allocated, so we can release the reference on the
device after the context allocation.
- with the fix above, there's still another even tinier window,
between the time the AFU device is found in the IDR and the reference
on the device is taken. We can fix this one by removing the IDR entry
earlier, when the device setup is removed, instead of waiting for the
'release' device callback. With proper locking around the IDR.

Fixes: 75ca758adbaf ("ocxl: Create a clear delineation between ocxl backend & 
frontend")
Signed-off-by: Frederic Barrat 
---
mpe: this fixes a commit merged in v5.2-rc1. It's late, and I don't think it's 
that important. If it's for the next merge window, I would add:
Cc: sta...@vger.kernel.org  # v5.2


drivers/misc/ocxl/file.c | 23 +++
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c
index 2870c25da166..4d1b44de1492 100644
--- a/drivers/misc/ocxl/file.c
+++ b/drivers/misc/ocxl/file.c
@@ -18,18 +18,15 @@ static struct class *ocxl_class;
 static struct mutex minors_idr_lock;
 static struct idr minors_idr;
 
-static struct ocxl_file_info *find_file_info(dev_t devno)
+static struct ocxl_file_info *find_and_get_file_info(dev_t devno)
 {
struct ocxl_file_info *info;
 
-   /*
-* We don't declare an RCU critical section here, as our AFU
-* is protected by a reference counter on the device. By the time the
-* info reference is removed from the idr, the ref count of
-* the device is already at 0, so no user API will access that AFU and
-* this function can't return it.
-*/
+   mutex_lock(_idr_lock);
info = idr_find(_idr, MINOR(devno));
+   if (info)
+   get_device(>dev);
+   mutex_unlock(_idr_lock);
return info;
 }
 
@@ -58,14 +55,16 @@ static int afu_open(struct inode *inode, struct file *file)
 
pr_debug("%s for device %x\n", __func__, inode->i_rdev);
 
-   info = find_file_info(inode->i_rdev);
+   info = find_and_get_file_info(inode->i_rdev);
if (!info)
return -ENODEV;
 
rc = ocxl_context_alloc(, info->afu, inode->i_mapping);
-   if (rc)
+   if (rc) {
+   put_device(>dev);
return rc;
-
+   }
+   put_device(>dev);
file->private_data = ctx;
return 0;
 }
@@ -487,7 +486,6 @@ static void info_release(struct device *dev)
 {
struct ocxl_file_info *info = container_of(dev, struct ocxl_file_info, 
dev);
 
-   free_minor(info);
ocxl_afu_put(info->afu);
kfree(info);
 }
@@ -577,6 +575,7 @@ void ocxl_file_unregister_afu(struct ocxl_afu *afu)
 
ocxl_file_make_invisible(info);
ocxl_sysfs_unregister_afu(info);
+   free_minor(info);
device_unregister(>dev);
 }
 
-- 
2.21.0



[next][PowerPC] RCU stalls while booting linux-next on PowerVM LPAR

2019-06-24 Thread Sachin Sant
Latest -next fails to boot on POWER9 PowerVM LPAR due to RCU stalls.

This problem was introduced with next-20190620 (dc636f5d78).
next-20190619 was last good kernel.

Reverting following commit allows the kernel to boot.
2fd4aeea6b603 : mm/memory_hotplug: move and simplify walk_memory_blocks()


[0.014409] Using shared cache scheduler topology
[0.016302] devtmpfs: initialized
[0.031022] clocksource: jiffies: mask: 0x max_cycles: 0x, 
max_idle_ns: 1911260446275 ns
[0.031034] futex hash table entries: 16384 (order: 5, 2097152 bytes, linear)
[0.031575] NET: Registered protocol family 16
[0.031724] audit: initializing netlink subsys (disabled)
[0.031796] audit: type=2000 audit(1561344029.030:1): state=initialized 
audit_enabled=0 res=1
[0.032249] cpuidle: using governor menu
[0.032403] pstore: Registered nvram as persistent store backend
[   60.061246] rcu: INFO: rcu_sched self-detected stall on CPU
[   60.061254] rcu: 0-: (5999 ticks this GP) 
idle=1ea/1/0x4002 softirq=5/5 fqs=2999 
[   60.061261]  (t=6000 jiffies g=-1187 q=0)
[   60.061265] NMI backtrace for cpu 0
[   60.061269] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.2.0-rc5-next-20190621-autotest-autotest #1
[   60.061275] Call Trace:
[   60.061280] [c018ee85f380] [c0b624ec] dump_stack+0xb0/0xf4 
(unreliable)
[   60.061287] [c018ee85f3c0] [c0b6d464] 
nmi_cpu_backtrace+0x144/0x150
[   60.061293] [c018ee85f450] [c0b6d61c] 
nmi_trigger_cpumask_backtrace+0x1ac/0x1f0
[   60.061300] [c018ee85f4f0] [c00692c8] 
arch_trigger_cpumask_backtrace+0x28/0x40
[   60.061306] [c018ee85f510] [c01c5f90] 
rcu_dump_cpu_stacks+0x10c/0x16c
[   60.061313] [c018ee85f560] [c01c4fe4] 
rcu_sched_clock_irq+0x744/0x990
[   60.061318] [c018ee85f630] [c01d5b58] 
update_process_times+0x48/0x90
[   60.061325] [c018ee85f660] [c01ea03c] tick_periodic+0x4c/0x120
[   60.061330] [c018ee85f690] [c01ea150] 
tick_handle_periodic+0x40/0xe0
[   60.061336] [c018ee85f6d0] [c002b5cc] timer_interrupt+0x10c/0x2e0
[   60.061342] [c018ee85f730] [c0009204] 
decrementer_common+0x134/0x140
[   60.061350] --- interrupt: 901 at replay_interrupt_return+0x0/0x4
[   60.061350] LR = arch_local_irq_restore+0x84/0x90
[   60.061357] [c018ee85fa30] [c018ee85fbac] 0xc018ee85fbac 
(unreliable)
[   60.061364] [c018ee85fa50] [c0b88300] 
_raw_spin_unlock_irqrestore+0x50/0x80
[   60.061369] [c018ee85fa70] [c0b69da4] klist_next+0xb4/0x150
[   60.061376] [c018ee85fac0] [c0766ea0] 
subsys_find_device_by_id+0xf0/0x1a0
[   60.061382] [c018ee85fb20] [c0797a94] 
walk_memory_blocks+0x84/0x100
[   60.061388] [c018ee85fb80] [c0795ea0] link_mem_sections+0x40/0x60
[   60.061395] [c018ee85fbb0] [c0f28c28] topology_init+0xa0/0x268
[   60.061400] [c018ee85fc10] [c0010448] do_one_initcall+0x68/0x2c0
[   60.061406] [c018ee85fce0] [c0f247dc] 
kernel_init_freeable+0x318/0x47c
[   60.061411] [c018ee85fdb0] [c00107c4] kernel_init+0x24/0x150
[   60.061417] [c018ee85fe20] [c000ba54] 
ret_from_kernel_thread+0x5c/0x68
[   88.016563] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
[   88.016569] Modules linked in:


Thanks
-Sachin




boot.log
Description: Binary data


Re: [PATCH] KVM: PPC: Book3S HV: Fix CR0 setting in TM emulation

2019-06-24 Thread Michael Ellerman
Michael Neuling  writes:
> When emulating tsr, treclaim and trechkpt, we incorrectly set CR0. The
> code currently sets:
> CR0 <- 00 || MSR[TS]
> but according to the ISA it should be:
> CR0 <-  0 || MSR[TS] || 0

Seems bad, what's the worst case impact?

Do we have a test case for this?

> This fixes the bit shift to put the bits in the correct location.

Fixes: ?

cheers

> diff --git a/arch/powerpc/kvm/book3s_hv_tm.c b/arch/powerpc/kvm/book3s_hv_tm.c
> index 888e2609e3..31cd0f327c 100644
> --- a/arch/powerpc/kvm/book3s_hv_tm.c
> +++ b/arch/powerpc/kvm/book3s_hv_tm.c
> @@ -131,7 +131,7 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu)
>   }
>   /* Set CR0 to indicate previous transactional state */
>   vcpu->arch.regs.ccr = (vcpu->arch.regs.ccr & 0x0fff) |
> - (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 28);
> + (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 29);
>   /* L=1 => tresume, L=0 => tsuspend */
>   if (instr & (1 << 21)) {
>   if (MSR_TM_SUSPENDED(msr))
> @@ -175,7 +175,7 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu)
>  
>   /* Set CR0 to indicate previous transactional state */
>   vcpu->arch.regs.ccr = (vcpu->arch.regs.ccr & 0x0fff) |
> - (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 28);
> + (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 29);
>   vcpu->arch.shregs.msr &= ~MSR_TS_MASK;
>   return RESUME_GUEST;
>  
> @@ -205,7 +205,7 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu)
>  
>   /* Set CR0 to indicate previous transactional state */
>   vcpu->arch.regs.ccr = (vcpu->arch.regs.ccr & 0x0fff) |
> - (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 28);
> + (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 29);
>   vcpu->arch.shregs.msr = msr | MSR_TS_S;
>   return RESUME_GUEST;
>   }
> -- 
> 2.21.0


Re: [PATCH 1/2] powerpc/64s: Rename PPC_INVALIDATE_ERAT to PPC_ARCH_300_INVALIDATE_ERAT

2019-06-24 Thread Michael Ellerman
Nicholas Piggin  writes:
> Segher Boessenkool's on June 23, 2019 10:03 pm:
>> On Sun, Jun 23, 2019 at 08:41:51PM +1000, Nicholas Piggin wrote:
>>> This makes it clear to the caller that it can only be used on POWER9
>>> and later CPUs.
>> 
>>> -#define PPC_INVALIDATE_ERATPPC_SLBIA(7)
>>> +#define PPC_ARCH_300_INVALIDATE_ERAT   PPC_SLBIA(7)
>> 
>> The architecture version is 3.0 (or 3.0B), not "300".  This will work on
>> implementations of later architecture versions as well, so maybe the name
>> isn't so great anyway?
>
> Yeah... this is kernel convention for better or worse. ISA v3.0B
> feature support is called CPU_FTR_ARCH_300, and later architectures
> will advertise that support. For the most part we can use architected
> features (incompatible changes would require additional code).

I'd rather we used 3_0B or something inside the kernel, but I'm not sure
it's worth the churn to rename the existing feature everywhere. And we
can't rename the user visible feature.

But if you're adding a new usage then I'd prefer: PPC_ISA_3_0B_INVALIDATE_ERAT

I dislike "300" because it implies we support ISA v3.0 but we actually
don't, we only support v3.0B.

cheers


Re: [PATCH 4/5] x86: don't use asm-generic/ptrace.h

2019-06-24 Thread Thomas Gleixner
On Mon, 24 Jun 2019, Christoph Hellwig wrote:

> Doing the indirection through macros for the regs accessors just
> makes them harder to read, so implement the helpers directly.
> 
> Note that only the helpers actually used are implemented now.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Ingo Molnar 
> Acked-by: Oleg Nesterov 

Reviewed-by: Thomas Gleixner 


Re: remove asm-generic/ptrace.h v3

2019-06-24 Thread Thomas Gleixner
On Mon, 24 Jun 2019, Christoph Hellwig wrote:
> 
> asm-generic/ptrace.h is a little weird in that it doesn't actually
> implement any functionality, but it provided multiple layers of macros
> that just implement trivial inline functions.  We implement those
> directly in the few architectures and be off with a much simpler
> design.
> 
> I'm not sure which tree is the right place, but may this can go through
> the asm-generic tree since it removes an asm-generic header?

Makes sense.

Thanks,

tglx


[PATCH v5 3/4] crypto: talitos - fix hash on SEC1.

2019-06-24 Thread Christophe Leroy
On SEC1, hash provides wrong result when performing hashing in several
steps with input data SG list has more than one element. This was
detected with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS:

[   44.185947] alg: hash: md5-talitos test failed (wrong result) on test vector 
6, cfg="random: may_sleep use_finup src_divs=[25.88%@+8063, 
24.19%@+9588, 28.63%@+16333, 4.60%@+6756, 16.70%@+16281] 
dst_divs=[71.61%@alignmask+16361, 14.36%@+7756, 14.3%@+"
[   44.325122] alg: hash: sha1-talitos test failed (wrong result) on test 
vector 3, cfg="random: inplace use_final src_divs=[16.56%@+16378, 
52.0%@+16329, 21.42%@alignmask+16380, 10.2%@alignmask+16380] 
iv_offset=39"
[   44.493500] alg: hash: sha224-talitos test failed (wrong result) on test 
vector 4, cfg="random: use_final nosimd src_divs=[52.27%@+7401, 
17.34%@+16285, 17.71%@+26, 12.68%@+10644] iv_offset=43"
[   44.673262] alg: hash: sha256-talitos test failed (wrong result) on test 
vector 4, cfg="random: may_sleep use_finup src_divs=[60.6%@+12790, 
17.86%@+1329, 12.64%@alignmask+16300, 8.29%@+15, 0.40%@+13506, 
0.51%@+16322, 0.24%@+16339] dst_divs"

This is due to two issues:
- We have an overlap between the buffer used for copying the input
data (SEC1 doesn't do scatter/gather) and the chained descriptor.
- Data copy is wrong when the previous hash left less than one
blocksize of data to hash, implying a complement of the previous
block with a few bytes from the new request.

Fix it by:
- Moving the second descriptor after the buffer, as moving the buffer
after the descriptor would make it more complex for other cipher
operations (AEAD, ABLKCIPHER)
- Skip the bytes taken from the new request to complete the previous
one by moving the SG list forward.

Fixes: 37b5e8897eb5 ("crypto: talitos - chain in buffered data for ahash on 
SEC1")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 drivers/crypto/talitos.c | 69 
 1 file changed, 41 insertions(+), 28 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 95dc3957f358..ab6bd45addf7 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -320,6 +320,21 @@ static int talitos_submit(struct device *dev, int ch, 
struct talitos_desc *desc,
return -EINPROGRESS;
 }
 
+static __be32 get_request_hdr(struct talitos_request *request, bool is_sec1)
+{
+   struct talitos_edesc *edesc;
+
+   if (!is_sec1)
+   return request->desc->hdr;
+
+   if (!request->desc->next_desc)
+   return request->desc->hdr1;
+
+   edesc = container_of(request->desc, struct talitos_edesc, desc);
+
+   return ((struct talitos_desc *)(edesc->buf + edesc->dma_len))->hdr1;
+}
+
 /*
  * process what was done, notify callback of error if not
  */
@@ -341,12 +356,7 @@ static void flush_channel(struct device *dev, int ch, int 
error, int reset_ch)
 
/* descriptors with their done bits set don't get the error */
rmb();
-   if (!is_sec1)
-   hdr = request->desc->hdr;
-   else if (request->desc->next_desc)
-   hdr = (request->desc + 1)->hdr1;
-   else
-   hdr = request->desc->hdr1;
+   hdr = get_request_hdr(request, is_sec1);
 
if ((hdr & DESC_HDR_DONE) == DESC_HDR_DONE)
status = 0;
@@ -476,8 +486,14 @@ static u32 current_desc_hdr(struct device *dev, int ch)
}
}
 
-   if (priv->chan[ch].fifo[iter].desc->next_desc == cur_desc)
-   return (priv->chan[ch].fifo[iter].desc + 1)->hdr;
+   if (priv->chan[ch].fifo[iter].desc->next_desc == cur_desc) {
+   struct talitos_edesc *edesc;
+
+   edesc = container_of(priv->chan[ch].fifo[iter].desc,
+struct talitos_edesc, desc);
+   return ((struct talitos_desc *)
+   (edesc->buf + edesc->dma_len))->hdr;
+   }
 
return priv->chan[ch].fifo[iter].desc->hdr;
 }
@@ -1402,15 +1418,11 @@ static struct talitos_edesc *talitos_edesc_alloc(struct 
device *dev,
edesc->dst_nents = dst_nents;
edesc->iv_dma = iv_dma;
edesc->dma_len = dma_len;
-   if (dma_len) {
-   void *addr = >link_tbl[0];
-
-   if (is_sec1 && !dst)
-   addr += sizeof(struct talitos_desc);
-   edesc->dma_link_tbl = dma_map_single(dev, addr,
+   if (dma_len)
+   edesc->dma_link_tbl = dma_map_single(dev, >link_tbl[0],
 edesc->dma_len,
 DMA_BIDIRECTIONAL);
-   }
+
return edesc;
 }
 
@@ -1722,14 +1734,16 @@ static void common_nonsnoop_hash_unmap(struct device 
*dev,
struct talitos_private *priv = dev_get_drvdata(dev);
bool is_sec1 = has_ftr_sec1(priv);
struct 

[PATCH v5 0/4] *** SUBJECT HERE ***

2019-06-24 Thread Christophe Leroy
This series is the last set of fixes for the Talitos driver.

We now get a fully clean boot on both SEC1 (SEC1.2 on mpc885) and
SEC2 (SEC2.2 on mpc8321E) with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS:

[3.385197] bus: 'platform': really_probe: probing driver talitos with 
device ff02.crypto
[3.450982] random: fast init done
[   12.252548] alg: No test for authenc(hmac(md5),cbc(aes)) 
(authenc-hmac-md5-cbc-aes-talitos-hsna)
[   12.262226] alg: No test for authenc(hmac(md5),cbc(des3_ede)) 
(authenc-hmac-md5-cbc-3des-talitos-hsna)
[   43.310737] Bug in SEC1, padding ourself
[   45.603318] random: crng init done
[   54.612333] talitos ff02.crypto: fsl,sec1.2 algorithms registered in 
/proc/crypto
[   54.620232] driver: 'talitos': driver_bound: bound to device 
'ff02.crypto'

[1.193721] bus: 'platform': really_probe: probing driver talitos with 
device b003.crypto
[1.229197] random: fast init done
[2.714920] alg: No test for authenc(hmac(sha224),cbc(aes)) 
(authenc-hmac-sha224-cbc-aes-talitos)
[2.724312] alg: No test for authenc(hmac(sha224),cbc(aes)) 
(authenc-hmac-sha224-cbc-aes-talitos-hsna)
[4.482045] alg: No test for authenc(hmac(md5),cbc(aes)) 
(authenc-hmac-md5-cbc-aes-talitos)
[4.490940] alg: No test for authenc(hmac(md5),cbc(aes)) 
(authenc-hmac-md5-cbc-aes-talitos-hsna)
[4.500280] alg: No test for authenc(hmac(md5),cbc(des3_ede)) 
(authenc-hmac-md5-cbc-3des-talitos)
[4.509727] alg: No test for authenc(hmac(md5),cbc(des3_ede)) 
(authenc-hmac-md5-cbc-3des-talitos-hsna)
[6.631781] random: crng init done
[   11.521795] talitos b003.crypto: fsl,sec2.2 algorithms registered in 
/proc/crypto
[   11.529803] driver: 'talitos': driver_bound: bound to device 
'b003.crypto'

v2: dropped patch 1 which was irrelevant due to a rebase weirdness. Added Cc to 
stable on the 2 first patches.

v3:
 - removed stable reference in patch 1
 - reworded patch 1 to include name of patch 2 for the dependency.
 - mentionned this dependency in patch 2 as well.
 - corrected the Fixes: sha1 in patch 4
 
v4:
 - using scatterwalk_ffwd() instead of opencodying SG list forwarding.
 - Added a patch to fix sg_copy_to_buffer() when sg->offset() is greater than 
PAGE_SIZE,
 otherwise sg_copy_to_buffer() fails when the list has been forwarded with 
scatterwalk_ffwd().
 - taken the patch "crypto: talitos - eliminate unneeded 'done' functions at 
build time"
 out of the series because it is independent.
 - added a helper to find the header field associated to a request in 
flush_channe()
 
v5:
 - Replacing the while loop by a direct shift/mask operation, as suggested by 
Herbert in patch 1.

Christophe Leroy (4):
  lib/scatterlist: Fix mapping iterator when sg->offset is greater than
PAGE_SIZE
  crypto: talitos - move struct talitos_edesc into talitos.h
  crypto: talitos - fix hash on SEC1.
  crypto: talitos - drop icv_ool

 drivers/crypto/talitos.c | 102 +++
 drivers/crypto/talitos.h |  28 +
 lib/scatterlist.c|   9 +++--
 3 files changed, 74 insertions(+), 65 deletions(-)

-- 
2.13.3



[PATCH v5 1/4] lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE

2019-06-24 Thread Christophe Leroy
All mapping iterator logic is based on the assumption that sg->offset
is always lower than PAGE_SIZE.

But there are situations where sg->offset is such that the SG item
is on the second page. In that case sg_copy_to_buffer() fails
properly copying the data into the buffer. One of the reason is
that the data will be outside the kmapped area used to access that
data.

This patch fixes the issue by adjusting the mapping iterator
offset and pgoffset fields such that offset is always lower than
PAGE_SIZE.

Signed-off-by: Christophe Leroy 
Fixes: 4225fc8555a9 ("lib/scatterlist: use page iterator in the mapping 
iterator")
Cc: sta...@vger.kernel.org
---
 lib/scatterlist.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 739dc9fe2c55..f0757a67affe 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -678,17 +678,18 @@ static bool sg_miter_get_next_page(struct sg_mapping_iter 
*miter)
 {
if (!miter->__remaining) {
struct scatterlist *sg;
-   unsigned long pgoffset;
 
if (!__sg_page_iter_next(>piter))
return false;
 
sg = miter->piter.sg;
-   pgoffset = miter->piter.sg_pgoffset;
 
-   miter->__offset = pgoffset ? 0 : sg->offset;
+   miter->__offset = miter->piter.sg_pgoffset ? 0 : sg->offset;
+   miter->piter.sg_pgoffset += miter->__offset >> PAGE_SHIFT;
+   miter->__offset &= PAGE_SIZE - 1;
miter->__remaining = sg->offset + sg->length -
-   (pgoffset << PAGE_SHIFT) - miter->__offset;
+(miter->piter.sg_pgoffset << PAGE_SHIFT) -
+miter->__offset;
miter->__remaining = min_t(unsigned long, miter->__remaining,
   PAGE_SIZE - miter->__offset);
}
-- 
2.13.3



[PATCH v5 4/4] crypto: talitos - drop icv_ool

2019-06-24 Thread Christophe Leroy
icv_ool is not used anymore, drop it.

Fixes: e345177ded17 ("crypto: talitos - fix AEAD processing.")
Signed-off-by: Christophe Leroy 
---
 drivers/crypto/talitos.c | 3 ---
 drivers/crypto/talitos.h | 2 --
 2 files changed, 5 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index ab6bd45addf7..c9d686a0e805 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -1285,9 +1285,6 @@ static int ipsec_esp(struct talitos_edesc *edesc, struct 
aead_request *areq,
 is_ipsec_esp && !encrypt);
tbl_off += ret;
 
-   /* ICV data */
-   edesc->icv_ool = !encrypt;
-
if (!encrypt && is_ipsec_esp) {
struct talitos_ptr *tbl_ptr = >link_tbl[tbl_off];
 
diff --git a/drivers/crypto/talitos.h b/drivers/crypto/talitos.h
index 95f78c6d9206..1469b956948a 100644
--- a/drivers/crypto/talitos.h
+++ b/drivers/crypto/talitos.h
@@ -46,7 +46,6 @@ struct talitos_desc {
  * talitos_edesc - s/w-extended descriptor
  * @src_nents: number of segments in input scatterlist
  * @dst_nents: number of segments in output scatterlist
- * @icv_ool: whether ICV is out-of-line
  * @iv_dma: dma address of iv for checking continuity and link table
  * @dma_len: length of dma mapped link_tbl space
  * @dma_link_tbl: bus physical address of link_tbl/buf
@@ -61,7 +60,6 @@ struct talitos_desc {
 struct talitos_edesc {
int src_nents;
int dst_nents;
-   bool icv_ool;
dma_addr_t iv_dma;
int dma_len;
dma_addr_t dma_link_tbl;
-- 
2.13.3



[PATCH v5 2/4] crypto: talitos - move struct talitos_edesc into talitos.h

2019-06-24 Thread Christophe Leroy
Moves struct talitos_edesc into talitos.h so that it can be used
from any place in talitos.c

It will be required for next patch ("crypto: talitos - fix hash
on SEC1")

Signed-off-by: Christophe Leroy 
Cc: sta...@vger.kernel.org
---
 drivers/crypto/talitos.c | 30 --
 drivers/crypto/talitos.h | 30 ++
 2 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index a7704b53529d..95dc3957f358 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -951,36 +951,6 @@ static int aead_des3_setkey(struct crypto_aead *authenc,
goto out;
 }
 
-/*
- * talitos_edesc - s/w-extended descriptor
- * @src_nents: number of segments in input scatterlist
- * @dst_nents: number of segments in output scatterlist
- * @icv_ool: whether ICV is out-of-line
- * @iv_dma: dma address of iv for checking continuity and link table
- * @dma_len: length of dma mapped link_tbl space
- * @dma_link_tbl: bus physical address of link_tbl/buf
- * @desc: h/w descriptor
- * @link_tbl: input and output h/w link tables (if {src,dst}_nents > 1) (SEC2)
- * @buf: input and output buffeur (if {src,dst}_nents > 1) (SEC1)
- *
- * if decrypting (with authcheck), or either one of src_nents or dst_nents
- * is greater than 1, an integrity check value is concatenated to the end
- * of link_tbl data
- */
-struct talitos_edesc {
-   int src_nents;
-   int dst_nents;
-   bool icv_ool;
-   dma_addr_t iv_dma;
-   int dma_len;
-   dma_addr_t dma_link_tbl;
-   struct talitos_desc desc;
-   union {
-   struct talitos_ptr link_tbl[0];
-   u8 buf[0];
-   };
-};
-
 static void talitos_sg_unmap(struct device *dev,
 struct talitos_edesc *edesc,
 struct scatterlist *src,
diff --git a/drivers/crypto/talitos.h b/drivers/crypto/talitos.h
index 32ad4fc679ed..95f78c6d9206 100644
--- a/drivers/crypto/talitos.h
+++ b/drivers/crypto/talitos.h
@@ -42,6 +42,36 @@ struct talitos_desc {
 
 #define TALITOS_DESC_SIZE  (sizeof(struct talitos_desc) - sizeof(__be32))
 
+/*
+ * talitos_edesc - s/w-extended descriptor
+ * @src_nents: number of segments in input scatterlist
+ * @dst_nents: number of segments in output scatterlist
+ * @icv_ool: whether ICV is out-of-line
+ * @iv_dma: dma address of iv for checking continuity and link table
+ * @dma_len: length of dma mapped link_tbl space
+ * @dma_link_tbl: bus physical address of link_tbl/buf
+ * @desc: h/w descriptor
+ * @link_tbl: input and output h/w link tables (if {src,dst}_nents > 1) (SEC2)
+ * @buf: input and output buffeur (if {src,dst}_nents > 1) (SEC1)
+ *
+ * if decrypting (with authcheck), or either one of src_nents or dst_nents
+ * is greater than 1, an integrity check value is concatenated to the end
+ * of link_tbl data
+ */
+struct talitos_edesc {
+   int src_nents;
+   int dst_nents;
+   bool icv_ool;
+   dma_addr_t iv_dma;
+   int dma_len;
+   dma_addr_t dma_link_tbl;
+   struct talitos_desc desc;
+   union {
+   struct talitos_ptr link_tbl[0];
+   u8 buf[0];
+   };
+};
+
 /**
  * talitos_request - descriptor submission request
  * @desc: descriptor pointer (kernel virtual)
-- 
2.13.3



Re: [PATCH 3/3] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings

2019-06-24 Thread Anshuman Khandual



On 06/23/2019 03:14 PM, Nicholas Piggin wrote:
> vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
> Whether or not a vmap is huge depends on the architecture details,
> alignments, boot options, etc., which the caller can not be expected
> to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.
> 
> This change teaches vmalloc_to_page about larger pages, and returns
> the struct page that corresponds to the offset within the large page.
> This makes the API agnostic to mapping implementation details.
> 
> [*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
> fail gracefully on unexpected huge vmap mappings")
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  include/asm-generic/4level-fixup.h |  1 +
>  include/asm-generic/5level-fixup.h |  1 +
>  mm/vmalloc.c   | 37 +++---
>  3 files changed, 26 insertions(+), 13 deletions(-)
> 
> diff --git a/include/asm-generic/4level-fixup.h 
> b/include/asm-generic/4level-fixup.h
> index e3667c9a33a5..3cc65a4dd093 100644
> --- a/include/asm-generic/4level-fixup.h
> +++ b/include/asm-generic/4level-fixup.h
> @@ -20,6 +20,7 @@
>  #define pud_none(pud)0
>  #define pud_bad(pud) 0
>  #define pud_present(pud) 1
> +#define pud_large(pud)   0
>  #define pud_ERROR(pud)   do { } while (0)
>  #define pud_clear(pud)   pgd_clear(pud)
>  #define pud_val(pud) pgd_val(pud)
> diff --git a/include/asm-generic/5level-fixup.h 
> b/include/asm-generic/5level-fixup.h
> index bb6cb347018c..c4377db09a4f 100644
> --- a/include/asm-generic/5level-fixup.h
> +++ b/include/asm-generic/5level-fixup.h
> @@ -22,6 +22,7 @@
>  #define p4d_none(p4d)0
>  #define p4d_bad(p4d) 0
>  #define p4d_present(p4d) 1
> +#define p4d_large(p4d)   0
>  #define p4d_ERROR(p4d)   do { } while (0)
>  #define p4d_clear(p4d)   pgd_clear(p4d)
>  #define p4d_val(p4d) pgd_val(p4d)
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 4c9e150e5ad3..4be98f700862 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -36,6 +36,7 @@
>  #include 
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -284,26 +285,36 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>  
>   if (pgd_none(*pgd))
>   return NULL;
> +
>   p4d = p4d_offset(pgd, addr);
>   if (p4d_none(*p4d))
>   return NULL;
> - pud = pud_offset(p4d, addr);
> + if (WARN_ON_ONCE(p4d_bad(*p4d)))
> + return NULL;

The warning here is a required addition but it needs to be moved after 
p4d_large()
check. Please see the next comment below.

> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> + if (p4d_large(*p4d))
> + return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
> +#endif
>  
> - /*
> -  * Don't dereference bad PUD or PMD (below) entries. This will also
> -  * identify huge mappings, which we may encounter on architectures
> -  * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
> -  * identified as vmalloc addresses by is_vmalloc_addr(), but are
> -  * not [unambiguously] associated with a struct page, so there is
> -  * no correct value to return for them.
> -  */
> - WARN_ON_ONCE(pud_bad(*pud));
> - if (pud_none(*pud) || pud_bad(*pud))
> + pud = pud_offset(p4d, addr);
> + if (pud_none(*pud))
> + return NULL;
> + if (WARN_ON_ONCE(pud_bad(*pud)))
>   return NULL;
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> + if (pud_large(*pud))
> + return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> +#endif
> +

pud_bad() on arm64 returns true when the PUD does not point to a next page
table page implying the fact that it might be a large/huge entry. I am not
sure if the semantics holds good for other architectures too. But on arm64
if pud_large() is true, then pud_bad() will be true as well. So pud_bad()
check must happen after pud_large() check. So the sequence here should be

1. pud_none()   --> Nothing is in here, return NULL
2. pud_large()  --> Return offset page address from the huge page mapping
3. pud_bad()--> Return NULL as there is no more page table level left

Checking pud_bad() first can return NULL for a valid huge mapping.

>   pmd = pmd_offset(pud, addr);
> - WARN_ON_ONCE(pmd_bad(*pmd));
> - if (pmd_none(*pmd) || pmd_bad(*pmd))
> + if (pmd_none(*pmd))
> + return NULL;
> + if (WARN_ON_ONCE(pmd_bad(*pmd)))
>   return NULL;
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> + if (pmd_large(*pmd))
> + return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +#endif

Ditto.

I see that your previous proposal had this right which got changed in this
manner after my comments. Sorry about 

[PATCH v1] powerpc: Fix BUG_ON during memory unplug on radix

2019-06-24 Thread Bharata B Rao
We hit the following BUG_ON when memory hotplugged before reboot
is unplugged after reboot:

kernel BUG at arch/powerpc/mm/pgtable-frag.c:113!

 remove_pagetable+0x594/0x6a0
 (unreliable)
 remove_pagetable+0x94/0x6a0
 vmemmap_free+0x394/0x410
 sparse_remove_one_section+0x26c/0x2e8
 __remove_pages+0x428/0x540
 arch_remove_memory+0xd0/0x170
 __remove_memory+0xd4/0x1a0
 dlpar_remove_lmb+0xbc/0x110
 dlpar_memory+0xa80/0xd20
 handle_dlpar_errorlog+0xa8/0x160
 pseries_hp_work_fn+0x2c/0x60
 process_one_work+0x46c/0x860
 worker_thread+0x364/0x5e0
 kthread+0x1b0/0x1c0
 ret_from_kernel_thread+0x5c/0x68

This occurs because, during reboot-after-hotplug, the hotplugged
memory range gets initialized as regular memory and page
tables are setup using memblock allocator. This means that we
wouldn't have initialized the PMD or PTE fragment count for
those PMD or PTE pages.

Fixing this includes 3 parts:

- Re-walk the init_mm page tables from mem_init() and initialize
  the PMD and PTE fragment counts appropriately. So PMD and PTE
  table pages allocated during early allocation will have a
  fragment count of 1.
- Convert the pages from memblock pages to regular pages so that
  they can be freed back to buddy allocator seamlessly. However
  we do this for only PMD and PTE pages and not for PUD pages.
  PUD pages are freed using kmem_cache_free() and we need to
  identify memblock pages and free them differently.
- When we do early memblock based allocation of PMD and PUD pages,
  allocate in PAGE_SIZE granularity so that we are sure the
  complete page is used as pagetable page. PAGE_SIZE allocations will
  have an implication on the amount of memory used for page tables,
  an example of which is shown below:

Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is an example of how much more we end up allocating for
page tables in case of 64T system RAM:

1. Mapping system RAM

With each PGD entry spanning 512G, 64TB RAM would need 128 entries
and hence 128 PUD tables. We use 1G mapping at PUD level (level 2)

With default PUD_TABLE_SIZE(4K), 128*4K=512K (8 64K pages)
With PAGE_SIZE(64K) allocations, 128*64K=8192K (128 64K pages)

2. Mapping struct pages (memmap)

64T RAM would need 64G for memmap with struct page size being 64B.
Since memmap array is mapped using 64K mappings, we would need
64 PUD entries or 64 PMD tables (level 3) in total.

With default PMD_TABLE_SIZE(4K), 64*4K=256K (4 64K pages)
With PAGE_SIZE(64K) allocations, 64*64K=4096K (64 64K pages)

There is no change in PTE table (level 4) allocation requirement as
early page table allocation is already using PAGE_SIZE PTE tables.

So essentially with this change we would use 180 64K pages
more for 64T system.

Reported-by: Srikanth Aithal 
Signed-off-by: Bharata B Rao 
---
v0 - https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-June/192242.html
Changes in v1:
- Handling PUD table freeing too.
- Added details about how much extra memory we use up with
  this approach into the commit message
- A few cleanups and renames

 arch/powerpc/include/asm/book3s/64/pgalloc.h |  7 +-
 arch/powerpc/include/asm/book3s/64/radix.h   |  1 +
 arch/powerpc/include/asm/sparsemem.h |  1 +
 arch/powerpc/mm/book3s64/pgtable.c   | 15 +++-
 arch/powerpc/mm/book3s64/radix_pgtable.c | 79 +++-
 arch/powerpc/mm/mem.c|  5 ++
 6 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index d5a44912902f..9ae134f260be 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -111,7 +111,12 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, 
unsigned long addr)
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-   kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
+   struct page *page = virt_to_page(pud);
+
+   if (PageReserved(page))
+   free_reserved_page(page);
+   else
+   kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 574eca33f893..4320f2790e8d 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -285,6 +285,7 @@ static inline unsigned long radix__get_tree_size(void)
 #ifdef CONFIG_MEMORY_HOTPLUG
 int radix__create_section_mapping(unsigned long start, unsigned long end, int 
nid);
 int radix__remove_section_mapping(unsigned long start, unsigned long end);
+void radix__fixup_pgtable_fragments(void);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 #endif /* __ASSEMBLY__ */
 #endif
diff --git a/arch/powerpc/include/asm/sparsemem.h 

Re: [PATCH 0/3] fix vmalloc_to_page for huge vmap mappings

2019-06-24 Thread Anshuman Khandual
Hello Nicholas,

On 06/23/2019 03:14 PM, Nicholas Piggin wrote:
> This is a change broken out from the huge vmap vmalloc series as
> requested. There is a little bit of dependency juggling across

Thanks for splitting up the previous series and sending this one out.


> trees, but patches are pretty trivial. Ideally if Andrew accepts
> this patch and queues it up for next, then the arch patches would
> be merged through those trees then patch 3 gets sent by Andrew.

Fair enough.

> 
> I've tested this with other powerpc and vmalloc patches, with code
> that explicitly tests vmalloc_to_page on vmalloced memory and
> results look fine.

- Anshuman