Re: [PATCH] ocxl: Update for AFU descriptor template version 1.1
On 5/6/19 9:15 pm, Frederic Barrat wrote: From: Alastair D'Silva The OpenCAPI discovery and configuration specification has been updated and introduces version 1.1 of the AFU descriptor template, with new fields to better define the memory layout of an OpenCAPI adapter. The ocxl driver doesn't do much yet to support LPC memory but as we start seeing (non-LPC) AFU images using the new template, this patch updates the config space parsing code to avoid spitting a warning. Signed-off-by: Alastair D'Silva Signed-off-by: Frederic Barrat Acked-by: Andrew Donnellan Alastair: you should fix your editor config :) https://openpower.xyz/job/snowpatch/job/snowpatch-linux-checkpatch/7564//artifact/linux/checkpatch.log -- Andrew Donnellan OzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited
Re: [PATCH] KVM: PPC: Book3S HV: Fix CR0 setting in TM emulation
On Mon, 2019-06-24 at 21:48 +1000, Michael Ellerman wrote: > Michael Neuling writes: > > When emulating tsr, treclaim and trechkpt, we incorrectly set CR0. The > > code currently sets: > > CR0 <- 00 || MSR[TS] > > but according to the ISA it should be: > > CR0 <- 0 || MSR[TS] || 0 > > Seems bad, what's the worst case impact? It's a data integrity issue as CR0 is corrupted. > Do we have a test case for this? Suraj has a KVM unit test for it. > > This fixes the bit shift to put the bits in the correct location. > > Fixes: ? It's been around since we first wrote the code so: Fixes: 4bb3c7a0208fc13c ("KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9") Mikey
Re: [PATCH] powerpc: Document xive=off option
Michael Neuling writes: > commit 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE > interrupt controller") added an option to turn off Linux native XIVE > usage via the xive=off kernel command line option. > > This documents this option. > > Signed-off-by: Michael Neuling Acked-by: Stewart Smith -- Stewart Smith OPAL Architect, IBM.
Re: [PATCH 2/2] powerpc/papr_scm: Force a scm-unbind if initial scm-bind fails
On Tue, Jun 25, 2019 at 12:59 AM Vaibhav Jain wrote: > > In some cases initial bind of scm memory for an lpar can fail if > previously it wasn't released using a scm-unbind hcall. This situation > can arise due to panic of the previous kernel or forced lpar reset. In > such cases the H_SCM_BIND_MEM return a H_OVERLAP error. What is a forced lpar reset? fadump? > To mitigate such cases the patch updates drc_pmem_bind() to force a > call to drc_pmem_unbind() in case the initial bind of scm memory fails > with H_OVERLAP error. In case scm-bind operation again fails after the > forced scm-unbind then we follow the existing error path. > > Signed-off-by: Vaibhav Jain > --- > arch/powerpc/platforms/pseries/papr_scm.c | 23 --- > 1 file changed, 20 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > b/arch/powerpc/platforms/pseries/papr_scm.c > index d790e4e4ffb3..049d7927c0a4 100644 > --- a/arch/powerpc/platforms/pseries/papr_scm.c > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > @@ -44,19 +44,26 @@ struct papr_scm_priv { > struct nd_interleave_set nd_set; > }; > +/* Forward declaration */ pointless comment. > +static int drc_pmem_unbind(struct papr_scm_priv *); > > static int drc_pmem_bind(struct papr_scm_priv *p) > { > unsigned long ret[PLPAR_HCALL_BUFSIZE]; > uint64_t rc, token; > - uint64_t saved = 0; > + uint64_t saved; > + bool tried_unbind = false; nit: kernel style uses reverse christmas tree declarations, so this should be: unsigned long ret[PLPAR_HCALL_BUFSIZE]; bool tried_unbind = false; uint64_t rc, token; uint64_t saved; Come to think of it rc should probably be signed since the hcall return codes are negative. I'm surprised that's not causing a warning since we use %lld to print rc rather than %llu. > + dev_dbg(>pdev->dev, "bind drc %x\n", p->drc_index); > /* > * When the hypervisor cannot map all the requested memory in a single > * hcall it returns H_BUSY and we call again with the token until > * we get H_SUCCESS. Aborting the retry loop before getting H_SUCCESS > * leave the system in an undefined state, so we wait. > */ > +retry: > token = 0; > + saved = 0; > > do { > rc = plpar_hcall(H_SCM_BIND_MEM, ret, p->drc_index, 0, > @@ -68,8 +75,18 @@ static int drc_pmem_bind(struct papr_scm_priv *p) > } while (rc == H_BUSY); > > if (rc) { > - dev_err(>pdev->dev, "bind err: %lld\n", rc); > - return -ENXIO; > + /* retry after unbinding */ > + if (rc == H_OVERLAP && !tried_unbind) { > + dev_warn(>pdev->dev, "Un-binding and retrying\n"); > + /* Try unbind and ignore any errors */ > + tried_unbind = true; > + drc_pmem_unbind(p); > + goto retry; I think It'd be cleaner if we put the unbind-and-retry logic into the probe function, e.g: diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 96c53b23e58f..d113779fc27c 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -316,6 +316,14 @@ static int papr_scm_probe(struct platform_device *pdev) /* request the hypervisor to bind this region to somewhere in memory */ rc = drc_pmem_bind(p); + if (rc == -EBUSY) { + /* +* If we kexec()ed the previous kernel might have left the DRC +* bound in memory. Unbind it and try again. +*/ + drc_pmem_unbind(p); + rc = drc_pmem_bind(p); + } if (rc) goto err; Up to you though.
Re: [PATCH v4 1/4] lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE
On Mon, Jun 24, 2019 at 08:35:33PM +0300, Imre Deak wrote: > Hi, > > On Thu, Jun 20, 2019 at 02:02:21PM +0800, Herbert Xu wrote: > > On Mon, Jun 17, 2019 at 09:15:02PM +, Christophe Leroy wrote: > > > All mapping iterator logic is based on the assumption that sg->offset > > > is always lower than PAGE_SIZE. > > > > > > But there are situations where sg->offset is such that the SG item > > > is on the second page. > > could you explain how sg->offset becomes >= PAGE_SIZE? The network stack can produce SG list elements that are longer than PAGE_SIZE. If you the iterate over it one page at a time then the offset can exceed PAGE_SIZE. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] powerpc/rtas: retry when cpu offline races with suspend/migration
There's some concern this could retry forever, resulting in live lock. Another approach would be to abandon the attempt and fail the LPM request. https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-June/192531.html On 6/24/19 12:23 PM, Nathan Lynch wrote: Hi Mingming, mmc writes: On 2019-06-21 00:05, Nathan Lynch wrote: So return -EAGAIN instead of -EBUSY when this race is encountered. Additionally: logging this event is still appropriate but use pr_info instead of pr_err; and remove use of unlikely() while here as this is not a hot path at all. Looks good, since it's not a hot path anyway, so unlikely() should benefit from optimize compiler path, and should stay. No? The latency of this path in rtas_ibm_suspend_me() in the best case is around 2-3 seconds. So I think not -- this is such a heavyweight and relatively seldom-executed path that the unlikely() cannot yield any discernible performance benefit, and its presence imposes a readability cost.
Re: [PATCH 1/2] powerpc/papr_scm: Update drc_pmem_unbind() to use H_SCM_UNBIND_ALL
On Tue, Jun 25, 2019 at 1:03 AM Vaibhav Jain wrote: > > The new hcall named H_SCM_UNBIND_ALL has been introduce that can > unbind all the memory drc memory-blocks assigned to an lpar. This is > more efficient than using H_SCM_UNBIND_MEM as currently we don't > support partial unbind of drc memory-blocks. > > Hence this patch proposes following changes to drc_pmem_unbind(): > > * Update drc_pmem_unbind() to replace hcall H_SCM_UNBIND_MEM to > H_SCM_UNBIND_ALL. > > * Update drc_pmem_unbind() to handles cases when PHYP asks the guest > kernel to wait for specific amount of time before retrying the > hcall via the 'LONG_BUSY' return value. In case it takes more > than 1 second to unbind the memory log a warning. Have you benchmarked how long a typical bind operation takes for very large (1+TB) volumes? I remember it taking a while for the driver to finish probing and I'd rather we didn't add spurious warnings. That might have been due to the time taken to online memory rather than the bind though. > * Ensure appropriate error code is returned back from the function > in case of an error. > > Signed-off-by: Vaibhav Jain > --- > arch/powerpc/include/asm/hvcall.h | 2 +- > arch/powerpc/platforms/pseries/papr_scm.c | 37 --- > 2 files changed, 33 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/include/asm/hvcall.h > b/arch/powerpc/include/asm/hvcall.h > index 463c63a9fcf1..bb56fa0f976c 100644 > --- a/arch/powerpc/include/asm/hvcall.h > +++ b/arch/powerpc/include/asm/hvcall.h > @@ -302,7 +302,7 @@ > #define H_SCM_UNBIND_MEM0x3F0 > #define H_SCM_QUERY_BLOCK_MEM_BINDING 0x3F4 > #define H_SCM_QUERY_LOGICAL_MEM_BINDING 0x3F8 > -#define H_SCM_MEM_QUERY0x3FC > +#define H_SCM_UNBIND_ALL0x3FC > #define H_SCM_BLOCK_CLEAR 0x400 > #define MAX_HCALL_OPCODE H_SCM_BLOCK_CLEAR We should probably update all these to match the latest spec. Can you do that in a separate patch? > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > b/arch/powerpc/platforms/pseries/papr_scm.c > index 96c53b23e58f..d790e4e4ffb3 100644 > --- a/arch/powerpc/platforms/pseries/papr_scm.c > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > @@ -11,11 +11,16 @@ > #include > #include > #include > +#include > > #include > > #define BIND_ANY_ADDR (~0ul) > > +/* Scope args for H_SCM_UNBIND_ALL */ > +#define H_UNBIND_SCOPE_ALL (0x1) > +#define H_UNBIND_SCOPE_DRC (0x2) > + > #define PAPR_SCM_DIMM_CMD_MASK \ > ((1ul << ND_CMD_GET_CONFIG_SIZE) | \ > (1ul << ND_CMD_GET_CONFIG_DATA) | \ > @@ -78,21 +83,43 @@ static int drc_pmem_unbind(struct papr_scm_priv *p) > { > unsigned long ret[PLPAR_HCALL_BUFSIZE]; > uint64_t rc, token; > + unsigned long starttime; > > token = 0; > > - /* NB: unbind has the same retry requirements mentioned above */ > + dev_dbg(>pdev->dev, "unbind drc %x\n", p->drc_index); > + > + /* NB: unbind_all has the same retry requirements as drc_pmem_bind() > */ > + starttime = HZ; > do { > - rc = plpar_hcall(H_SCM_UNBIND_MEM, ret, p->drc_index, > - p->bound_addr, p->blocks, token); > + /* If this is taking too much time then log warning */ > + if (jiffies_to_msecs(HZ - starttime) > 1000) { > + dev_warn(>pdev->dev, > +"unbind taking > 1s to complete\n"); > + starttime = HZ; > + } > + > + /* Unbind of all SCM resources associated with drcIndex */ > + rc = plpar_hcall(H_SCM_UNBIND_ALL, ret, H_UNBIND_SCOPE_DRC, > +p->drc_index, token); > token = ret[0]; > - cond_resched(); > + > + /* Check if we are stalled for some time */ > + if (H_IS_LONG_BUSY(rc)) { > + msleep(get_longbusy_msecs(rc)); > + rc = H_BUSY; > + token = 0; The hypervisor should be returning a continue token if the return code is H_BUSY or any of the H_LONG_BUSY codes. The spec says that if we get a busy return value then we must use the continue token for the next unbind, so why are you zeroing it here? If this is required to make it work then it's a bug in the hypervisor. > + } else if (rc == H_BUSY) { > + cond_resched(); > + } > + > } while (rc == H_BUSY); > > if (rc) > dev_err(>pdev->dev, "unbind error: %lld\n", rc); A dev_dbg() for the unbind time might be nice. > > - return !!rc; > + return rc == H_SUCCESS ? 0 : -ENXIO; > } > > static int papr_scm_meta_get(struct papr_scm_priv *p, > -- > 2.21.0 >
Re: [PATCH 3/3] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
Anshuman Khandual's on June 24, 2019 4:52 pm: > > > On 06/23/2019 03:14 PM, Nicholas Piggin wrote: >> vmalloc_to_page returns NULL for addresses mapped by larger pages[*]. >> Whether or not a vmap is huge depends on the architecture details, >> alignments, boot options, etc., which the caller can not be expected >> to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page. >> >> This change teaches vmalloc_to_page about larger pages, and returns >> the struct page that corresponds to the offset within the large page. >> This makes the API agnostic to mapping implementation details. >> >> [*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap: >> fail gracefully on unexpected huge vmap mappings") >> >> Signed-off-by: Nicholas Piggin >> --- >> include/asm-generic/4level-fixup.h | 1 + >> include/asm-generic/5level-fixup.h | 1 + >> mm/vmalloc.c | 37 +++--- >> 3 files changed, 26 insertions(+), 13 deletions(-) >> >> diff --git a/include/asm-generic/4level-fixup.h >> b/include/asm-generic/4level-fixup.h >> index e3667c9a33a5..3cc65a4dd093 100644 >> --- a/include/asm-generic/4level-fixup.h >> +++ b/include/asm-generic/4level-fixup.h >> @@ -20,6 +20,7 @@ >> #define pud_none(pud) 0 >> #define pud_bad(pud)0 >> #define pud_present(pud)1 >> +#define pud_large(pud) 0 >> #define pud_ERROR(pud) do { } while (0) >> #define pud_clear(pud) pgd_clear(pud) >> #define pud_val(pud)pgd_val(pud) >> diff --git a/include/asm-generic/5level-fixup.h >> b/include/asm-generic/5level-fixup.h >> index bb6cb347018c..c4377db09a4f 100644 >> --- a/include/asm-generic/5level-fixup.h >> +++ b/include/asm-generic/5level-fixup.h >> @@ -22,6 +22,7 @@ >> #define p4d_none(p4d) 0 >> #define p4d_bad(p4d)0 >> #define p4d_present(p4d)1 >> +#define p4d_large(p4d) 0 >> #define p4d_ERROR(p4d) do { } while (0) >> #define p4d_clear(p4d) pgd_clear(p4d) >> #define p4d_val(p4d)pgd_val(p4d) >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >> index 4c9e150e5ad3..4be98f700862 100644 >> --- a/mm/vmalloc.c >> +++ b/mm/vmalloc.c >> @@ -36,6 +36,7 @@ >> #include >> >> #include >> +#include >> #include >> #include >> >> @@ -284,26 +285,36 @@ struct page *vmalloc_to_page(const void *vmalloc_addr) >> >> if (pgd_none(*pgd)) >> return NULL; >> + >> p4d = p4d_offset(pgd, addr); >> if (p4d_none(*p4d)) >> return NULL; >> -pud = pud_offset(p4d, addr); >> +if (WARN_ON_ONCE(p4d_bad(*p4d))) >> +return NULL; > > The warning here is a required addition but it needs to be moved after > p4d_large() > check. Please see the next comment below. > >> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP >> +if (p4d_large(*p4d)) >> +return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT); >> +#endif >> >> -/* >> - * Don't dereference bad PUD or PMD (below) entries. This will also >> - * identify huge mappings, which we may encounter on architectures >> - * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be >> - * identified as vmalloc addresses by is_vmalloc_addr(), but are >> - * not [unambiguously] associated with a struct page, so there is >> - * no correct value to return for them. >> - */ >> -WARN_ON_ONCE(pud_bad(*pud)); >> -if (pud_none(*pud) || pud_bad(*pud)) >> +pud = pud_offset(p4d, addr); >> +if (pud_none(*pud)) >> +return NULL; >> +if (WARN_ON_ONCE(pud_bad(*pud))) >> return NULL; >> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP >> +if (pud_large(*pud)) >> +return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); >> +#endif >> + > > pud_bad() on arm64 returns true when the PUD does not point to a next page > table page implying the fact that it might be a large/huge entry. I am not > sure if the semantics holds good for other architectures too. But on arm64 > if pud_large() is true, then pud_bad() will be true as well. So pud_bad() > check must happen after pud_large() check. So the sequence here should be > > 1. pud_none() --> Nothing is in here, return NULL > 2. pud_large()--> Return offset page address from the huge page > mapping > 3. pud_bad() --> Return NULL as there is no more page table level left > > Checking pud_bad() first can return NULL for a valid huge mapping. > >> pmd = pmd_offset(pud, addr); >> -WARN_ON_ONCE(pmd_bad(*pmd)); >> -if (pmd_none(*pmd) || pmd_bad(*pmd)) >> +if (pmd_none(*pmd)) >> +return NULL; >> +if (WARN_ON_ONCE(pmd_bad(*pmd))) >> return NULL; >> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP >> +if (pmd_large(*pmd)) >> +return
[PATCH] powerpc/rtas: Fix hang in race against concurrent cpu offline
>From 1bd2bf7146876e099eafb292668f9a12dc22f526 Mon Sep 17 00:00:00 2001 From: Juliet Kim Date: Mon, 24 Jun 2019 17:17:46 -0400 Subject: [PATCH 1/1] powerpc/rtas: Fix hang in race against concurrent cpu offline MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The commit (“powerpc/rtas: Fix a potential race between CPU-Offline & Migration) attempted to fix a hang in Live Partition Mobility(LPM) by abandoning the LPM attempt if a race between LPM and concurrent CPU offline was detected. However, that fix failed to notify Hypervisor that the LPM attempted had been abandoned which results in a system hang. Fix this by sending a signal PHYP to cancel the migration, so that PHYP can stop waiting, and clean up the migration. Fixes: dfd718a2ed1f ("powerpc/rtas: Fix a potential race between CPU-Offline & Migration") Signed-off-by: Juliet Kim --- This is an alternate solution to the one Nathan proposed in: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-June/192414.html --- arch/powerpc/include/asm/hvcall.h | 7 +++ arch/powerpc/kernel/rtas.c| 8 2 files changed, 15 insertions(+) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index 463c63a..29ca285 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -261,6 +261,7 @@ #define H_ADD_CONN 0x284 #define H_DEL_CONN 0x288 #define H_JOIN 0x298 +#define H_VASI_SIGNAL 0x2A0 #define H_VASI_STATE0x2A4 #define H_VIOCTL 0x2A8 #define H_ENABLE_CRQ 0x2B0 @@ -348,6 +349,12 @@ #define H_SIGNAL_SYS_RESET_ALL_OTHERS -2 /* >= 0 values are CPU number */ +/* Values for argument to H_VASI_SIGNAL */ +#define H_SIGNAL_CANCEL_MIG 0x01 + +/* Values for 2nd argument to H_VASI_SIGNAL */ +#define H_CPU_OFFLINE_DETECTED 0x0604 + /* H_GET_CPU_CHARACTERISTICS return values */ #define H_CPU_CHAR_SPEC_BAR_ORI31 (1ull << 63) // IBM bit 0 #define H_CPU_CHAR_BCCTRL_SERIALISED (1ull << 62) // IBM bit 1 diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index b824f4c..f9002b7 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -981,6 +981,14 @@ int rtas_ibm_suspend_me(u64 handle) /* Check if we raced with a CPU-Offline Operation */ if (unlikely(!cpumask_equal(cpu_present_mask, cpu_online_mask))) { + + /* We uses CANCEL, not ABORT to gracefully cancel migration */ + rc = plpar_hcall_norets(H_VASI_SIGNAL, handle, + H_SIGNAL_CANCEL_MIG, H_CPU_OFFLINE_DETECTED); + + if (rc != H_SUCCESS) + pr_err("%s: vasi_signal failed %ld\n", __func__, rc); + pr_err("%s: Raced against a concurrent CPU-Offline\n", __func__); atomic_set(, -EBUSY); -- 1.8.3.1
Re: [PATCH v4 1/4] lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE
Hi, On Thu, Jun 20, 2019 at 02:02:21PM +0800, Herbert Xu wrote: > On Mon, Jun 17, 2019 at 09:15:02PM +, Christophe Leroy wrote: > > All mapping iterator logic is based on the assumption that sg->offset > > is always lower than PAGE_SIZE. > > > > But there are situations where sg->offset is such that the SG item > > is on the second page. could you explain how sg->offset becomes >= PAGE_SIZE? --Imre > > In that case sg_copy_to_buffer() fails > > properly copying the data into the buffer. One of the reason is > > that the data will be outside the kmapped area used to access that > > data. > > > > This patch fixes the issue by adjusting the mapping iterator > > offset and pgoffset fields such that offset is always lower than > > PAGE_SIZE. > > > > Signed-off-by: Christophe Leroy > > Fixes: 4225fc8555a9 ("lib/scatterlist: use page iterator in the mapping > > iterator") > > Cc: sta...@vger.kernel.org > > --- > > lib/scatterlist.c | 9 +++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > Good catch. > > > @@ -686,7 +686,12 @@ static bool sg_miter_get_next_page(struct > > sg_mapping_iter *miter) > > sg = miter->piter.sg; > > pgoffset = miter->piter.sg_pgoffset; > > > > - miter->__offset = pgoffset ? 0 : sg->offset; > > + offset = pgoffset ? 0 : sg->offset; > > + while (offset >= PAGE_SIZE) { > > + miter->piter.sg_pgoffset = ++pgoffset; > > + offset -= PAGE_SIZE; > > + } > > How about > > miter->piter.sg_pgoffset += offset >> PAGE_SHIFT; > offset &= PAGE_SIZE - 1; > > Thanks, > -- > Email: Herbert Xu > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 5/5] asm-generic: remove ptrace.h
On Mon, Jun 24, 2019 at 07:47:28AM +0200, Christoph Hellwig wrote: > No one is using this header anymore. > > Signed-off-by: Christoph Hellwig > Acked-by: Arnd Bergmann > Acked-by: Oleg Nesterov > --- > MAINTAINERS| 1 - > arch/mips/include/asm/ptrace.h | 5 --- > include/asm-generic/ptrace.h | 73 -- > 3 files changed, 79 deletions(-) > delete mode 100644 include/asm-generic/ptrace.h FWIW: Acked-by: Paul Burton Thanks, Paul > diff --git a/arch/mips/include/asm/ptrace.h b/arch/mips/include/asm/ptrace.h > index b6578611dddb..1e76774b36dd 100644 > --- a/arch/mips/include/asm/ptrace.h > +++ b/arch/mips/include/asm/ptrace.h > @@ -56,11 +56,6 @@ static inline unsigned long kernel_stack_pointer(struct > pt_regs *regs) > return regs->regs[31]; > } > > -/* > - * Don't use asm-generic/ptrace.h it defines FP accessors that don't make > - * sense on MIPS. We rather want an error if they get invoked. > - */ > - > static inline void instruction_pointer_set(struct pt_regs *regs, > unsigned long val) > {
Re: [PATCH v11 02/13] PKCS#7: Refactor verify_pkcs7_signature()
Hello David, AFAIK Mimi is happy with this patch set, but I still need acks from maintainers of other subsystems that my changes touch before she can accept it. Are this patch and the next one ("PKCS#7: Introduce pkcs7_get_digest()") OK from your PoV? -- Thiago Jung Bauermann IBM Linux Technology Center Thiago Jung Bauermann writes: > IMA will need to verify a PKCS#7 signature which has already been parsed. > For this reason, factor out the code which does that from > verify_pkcs7_signature() into a new function which takes a struct > pkcs7_message instead of a data buffer. > > Signed-off-by: Thiago Jung Bauermann > Reviewed-by: Mimi Zohar > Cc: David Howells > Cc: David Woodhouse > Cc: Herbert Xu > Cc: "David S. Miller" > --- > certs/system_keyring.c | 61 ++-- > include/linux/verification.h | 10 ++ > 2 files changed, 55 insertions(+), 16 deletions(-) > > diff --git a/certs/system_keyring.c b/certs/system_keyring.c > index c05c29ae4d5d..4ba82e52e4b4 100644 > --- a/certs/system_keyring.c > +++ b/certs/system_keyring.c > @@ -194,33 +194,27 @@ late_initcall(load_system_certificate_list); > #ifdef CONFIG_SYSTEM_DATA_VERIFICATION > > /** > - * verify_pkcs7_signature - Verify a PKCS#7-based signature on system data. > + * verify_pkcs7_message_sig - Verify a PKCS#7-based signature on system data. > * @data: The data to be verified (NULL if expecting internal data). > * @len: Size of @data. > - * @raw_pkcs7: The PKCS#7 message that is the signature. > - * @pkcs7_len: The size of @raw_pkcs7. > + * @pkcs7: The PKCS#7 message that is the signature. > * @trusted_keys: Trusted keys to use (NULL for builtin trusted keys only, > * (void *)1UL for all trusted keys). > * @usage: The use to which the key is being put. > * @view_content: Callback to gain access to content. > * @ctx: Context for callback. > */ > -int verify_pkcs7_signature(const void *data, size_t len, > -const void *raw_pkcs7, size_t pkcs7_len, > -struct key *trusted_keys, > -enum key_being_used_for usage, > -int (*view_content)(void *ctx, > -const void *data, size_t len, > -size_t asn1hdrlen), > -void *ctx) > +int verify_pkcs7_message_sig(const void *data, size_t len, > + struct pkcs7_message *pkcs7, > + struct key *trusted_keys, > + enum key_being_used_for usage, > + int (*view_content)(void *ctx, > + const void *data, size_t len, > + size_t asn1hdrlen), > + void *ctx) > { > - struct pkcs7_message *pkcs7; > int ret; > > - pkcs7 = pkcs7_parse_message(raw_pkcs7, pkcs7_len); > - if (IS_ERR(pkcs7)) > - return PTR_ERR(pkcs7); > - > /* The data should be detached - so we need to supply it. */ > if (data && pkcs7_supply_detached_data(pkcs7, data, len) < 0) { > pr_err("PKCS#7 signature with non-detached data\n"); > @@ -273,6 +267,41 @@ int verify_pkcs7_signature(const void *data, size_t len, > } > > error: > + pr_devel("<==%s() = %d\n", __func__, ret); > + return ret; > +} > + > +/** > + * verify_pkcs7_signature - Verify a PKCS#7-based signature on system data. > + * @data: The data to be verified (NULL if expecting internal data). > + * @len: Size of @data. > + * @raw_pkcs7: The PKCS#7 message that is the signature. > + * @pkcs7_len: The size of @raw_pkcs7. > + * @trusted_keys: Trusted keys to use (NULL for builtin trusted keys only, > + * (void *)1UL for all trusted keys). > + * @usage: The use to which the key is being put. > + * @view_content: Callback to gain access to content. > + * @ctx: Context for callback. > + */ > +int verify_pkcs7_signature(const void *data, size_t len, > +const void *raw_pkcs7, size_t pkcs7_len, > +struct key *trusted_keys, > +enum key_being_used_for usage, > +int (*view_content)(void *ctx, > +const void *data, size_t len, > +size_t asn1hdrlen), > +void *ctx) > +{ > + struct pkcs7_message *pkcs7; > + int ret; > + > + pkcs7 = pkcs7_parse_message(raw_pkcs7, pkcs7_len); > + if (IS_ERR(pkcs7)) > + return PTR_ERR(pkcs7); > + > + ret = verify_pkcs7_message_sig(data, len, pkcs7, trusted_keys, usage, > +view_content, ctx); > + > pkcs7_free_message(pkcs7); > pr_devel("<==%s() = %d\n", __func__, ret); > return ret; > diff
Re: [PATCH v11 01/13] MODSIGN: Export module signature definitions
Hello Jessica, AFAIK Mimi is happy with this patch set, but I still need acks from maintainers of other subsystems that my changes touch before she can accept it. Is this patch OK from your PoV? -- Thiago Jung Bauermann IBM Linux Technology Center Thiago Jung Bauermann writes: > IMA will use the module_signature format for append signatures, so export > the relevant definitions and factor out the code which verifies that the > appended signature trailer is valid. > > Also, create a CONFIG_MODULE_SIG_FORMAT option so that IMA can select it > and be able to use mod_check_sig() without having to depend on either > CONFIG_MODULE_SIG or CONFIG_MODULES. > > Signed-off-by: Thiago Jung Bauermann > Reviewed-by: Mimi Zohar > Cc: Jessica Yu > --- > include/linux/module.h | 3 -- > include/linux/module_signature.h | 44 + > init/Kconfig | 6 +++- > kernel/Makefile | 1 + > kernel/module.c | 1 + > kernel/module_signature.c| 46 ++ > kernel/module_signing.c | 56 +--- > scripts/Makefile | 2 +- > 8 files changed, 106 insertions(+), 53 deletions(-) > > diff --git a/include/linux/module.h b/include/linux/module.h > index 188998d3dca9..aa56f531cf1e 100644 > --- a/include/linux/module.h > +++ b/include/linux/module.h > @@ -25,9 +25,6 @@ > #include > #include > > -/* In stripped ARM and x86-64 modules, ~ is surprisingly rare. */ > -#define MODULE_SIG_STRING "~Module signature appended~\n" > - > /* Not Yet Implemented */ > #define MODULE_SUPPORTED_DEVICE(name) > > diff --git a/include/linux/module_signature.h > b/include/linux/module_signature.h > new file mode 100644 > index ..523617fc5b6a > --- /dev/null > +++ b/include/linux/module_signature.h > @@ -0,0 +1,44 @@ > +/* SPDX-License-Identifier: GPL-2.0+ */ > +/* > + * Module signature handling. > + * > + * Copyright (C) 2012 Red Hat, Inc. All Rights Reserved. > + * Written by David Howells (dhowe...@redhat.com) > + */ > + > +#ifndef _LINUX_MODULE_SIGNATURE_H > +#define _LINUX_MODULE_SIGNATURE_H > + > +/* In stripped ARM and x86-64 modules, ~ is surprisingly rare. */ > +#define MODULE_SIG_STRING "~Module signature appended~\n" > + > +enum pkey_id_type { > + PKEY_ID_PGP,/* OpenPGP generated key ID */ > + PKEY_ID_X509, /* X.509 arbitrary subjectKeyIdentifier */ > + PKEY_ID_PKCS7, /* Signature in PKCS#7 message */ > +}; > + > +/* > + * Module signature information block. > + * > + * The constituents of the signature section are, in order: > + * > + * - Signer's name > + * - Key identifier > + * - Signature data > + * - Information block > + */ > +struct module_signature { > + u8 algo; /* Public-key crypto algorithm [0] */ > + u8 hash; /* Digest algorithm [0] */ > + u8 id_type;/* Key identifier type [PKEY_ID_PKCS7] */ > + u8 signer_len; /* Length of signer's name [0] */ > + u8 key_id_len; /* Length of key identifier [0] */ > + u8 __pad[3]; > + __be32 sig_len;/* Length of signature data */ > +}; > + > +int mod_check_sig(const struct module_signature *ms, size_t file_len, > + const char *name); > + > +#endif /* _LINUX_MODULE_SIGNATURE_H */ > diff --git a/init/Kconfig b/init/Kconfig > index 8b9ffe236e4f..c2286a3c74c5 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1852,6 +1852,10 @@ config BASE_SMALL > default 0 if BASE_FULL > default 1 if !BASE_FULL > > +config MODULE_SIG_FORMAT > + def_bool n > + select SYSTEM_DATA_VERIFICATION > + > menuconfig MODULES > bool "Enable loadable module support" > option modules > @@ -1929,7 +1933,7 @@ config MODULE_SRCVERSION_ALL > config MODULE_SIG > bool "Module signature verification" > depends on MODULES > - select SYSTEM_DATA_VERIFICATION > + select MODULE_SIG_FORMAT > help > Check modules for valid signatures upon load: the signature > is simply appended to the module. For more information see > diff --git a/kernel/Makefile b/kernel/Makefile > index 33824f0385b3..f29ae2997a43 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -58,6 +58,7 @@ endif > obj-$(CONFIG_UID16) += uid16.o > obj-$(CONFIG_MODULES) += module.o > obj-$(CONFIG_MODULE_SIG) += module_signing.o > +obj-$(CONFIG_MODULE_SIG_FORMAT) += module_signature.o > obj-$(CONFIG_KALLSYMS) += kallsyms.o > obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o > obj-$(CONFIG_CRASH_CORE) += crash_core.o > diff --git a/kernel/module.c b/kernel/module.c > index 6e6712b3aaf5..2712f4d217f5 100644 > --- a/kernel/module.c > +++ b/kernel/module.c > @@ -19,6 +19,7 @@ > #include > #include > #include > +#include > #include > #include > #include > diff --git a/kernel/module_signature.c b/kernel/module_signature.c > new file mode 100644 >
Re: [PATCH] powerpc/rtas: retry when cpu offline races with suspend/migration
Hi Mingming, mmc writes: > On 2019-06-21 00:05, Nathan Lynch wrote: >> So return -EAGAIN instead of -EBUSY when this race is >> encountered. Additionally: logging this event is still appropriate but >> use pr_info instead of pr_err; and remove use of unlikely() while here >> as this is not a hot path at all. > > Looks good, since it's not a hot path anyway, so unlikely() should > benefit from optimize compiler path, and should stay. No? The latency of this path in rtas_ibm_suspend_me() in the best case is around 2-3 seconds. So I think not -- this is such a heavyweight and relatively seldom-executed path that the unlikely() cannot yield any discernible performance benefit, and its presence imposes a readability cost.
Re: [next][PowerPC] RCU stalls while booting linux-next on PowerVM LPAR
> On 24-Jun-2019, at 8:12 PM, David Hildenbrand wrote: > > On 24.06.19 16:09, Sachin Sant wrote: >> Latest -next fails to boot on POWER9 PowerVM LPAR due to RCU stalls. >> >> This problem was introduced with next-20190620 (dc636f5d78). >> next-20190619 was last good kernel. >> >> Reverting following commit allows the kernel to boot. >> 2fd4aeea6b603 : mm/memory_hotplug: move and simplify walk_memory_blocks() >> >> >> [0.014409] Using shared cache scheduler topology >> [0.016302] devtmpfs: initialized >> [0.031022] clocksource: jiffies: mask: 0x max_cycles: >> 0x, max_idle_ns: 1911260446275 ns >> [0.031034] futex hash table entries: 16384 (order: 5, 2097152 bytes, >> linear) >> [0.031575] NET: Registered protocol family 16 >> [0.031724] audit: initializing netlink subsys (disabled) >> [0.031796] audit: type=2000 audit(1561344029.030:1): state=initialized >> audit_enabled=0 res=1 >> [0.032249] cpuidle: using governor menu >> [0.032403] pstore: Registered nvram as persistent store backend >> [ 60.061246] rcu: INFO: rcu_sched self-detected stall on CPU >> [ 60.061254] rcu: 0-: (5999 ticks this GP) >> idle=1ea/1/0x4002 softirq=5/5 fqs=2999 >> [ 60.061261] (t=6000 jiffies g=-1187 q=0) >> [ 60.061265] NMI backtrace for cpu 0 >> [ 60.061269] CPU: 0 PID: 1 Comm: swapper/0 Not tainted >> 5.2.0-rc5-next-20190621-autotest-autotest #1 >> [ 60.061275] Call Trace: >> [ 60.061280] [c018ee85f380] [c0b624ec] dump_stack+0xb0/0xf4 >> (unreliable) >> [ 60.061287] [c018ee85f3c0] [c0b6d464] >> nmi_cpu_backtrace+0x144/0x150 >> [ 60.061293] [c018ee85f450] [c0b6d61c] >> nmi_trigger_cpumask_backtrace+0x1ac/0x1f0 >> [ 60.061300] [c018ee85f4f0] [c00692c8] >> arch_trigger_cpumask_backtrace+0x28/0x40 >> [ 60.061306] [c018ee85f510] [c01c5f90] >> rcu_dump_cpu_stacks+0x10c/0x16c >> [ 60.061313] [c018ee85f560] [c01c4fe4] >> rcu_sched_clock_irq+0x744/0x990 >> [ 60.061318] [c018ee85f630] [c01d5b58] >> update_process_times+0x48/0x90 >> [ 60.061325] [c018ee85f660] [c01ea03c] tick_periodic+0x4c/0x120 >> [ 60.061330] [c018ee85f690] [c01ea150] >> tick_handle_periodic+0x40/0xe0 >> [ 60.061336] [c018ee85f6d0] [c002b5cc] >> timer_interrupt+0x10c/0x2e0 >> [ 60.061342] [c018ee85f730] [c0009204] >> decrementer_common+0x134/0x140 >> [ 60.061350] --- interrupt: 901 at replay_interrupt_return+0x0/0x4 >> [ 60.061350] LR = arch_local_irq_restore+0x84/0x90 >> [ 60.061357] [c018ee85fa30] [c018ee85fbac] 0xc018ee85fbac >> (unreliable) >> [ 60.061364] [c018ee85fa50] [c0b88300] >> _raw_spin_unlock_irqrestore+0x50/0x80 >> [ 60.061369] [c018ee85fa70] [c0b69da4] klist_next+0xb4/0x150 >> [ 60.061376] [c018ee85fac0] [c0766ea0] >> subsys_find_device_by_id+0xf0/0x1a0 >> [ 60.061382] [c018ee85fb20] [c0797a94] >> walk_memory_blocks+0x84/0x100 >> [ 60.061388] [c018ee85fb80] [c0795ea0] >> link_mem_sections+0x40/0x60 >> [ 60.061395] [c018ee85fbb0] [c0f28c28] topology_init+0xa0/0x268 >> [ 60.061400] [c018ee85fc10] [c0010448] >> do_one_initcall+0x68/0x2c0 >> [ 60.061406] [c018ee85fce0] [c0f247dc] >> kernel_init_freeable+0x318/0x47c >> [ 60.061411] [c018ee85fdb0] [c00107c4] kernel_init+0x24/0x150 >> [ 60.061417] [c018ee85fe20] [c000ba54] >> ret_from_kernel_thread+0x5c/0x68 >> [ 88.016563] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! >> [swapper/0:1] >> [ 88.016569] Modules linked in: >> > > Hi, thanks! Please see > > https://lkml.org/lkml/2019/6/21/600 > > and especially > > https://lkml.org/lkml/2019/6/21/908 > > Does this fix your problem? The fix is on its way to next. Yes, this patch fixes the problem for me. Thanks -Sachin
Re: [PATCH] powerpc/rtas: retry when cpu offline races with suspend/migration
On 2019-06-21 00:05, Nathan Lynch wrote: The protocol for suspending or migrating an LPAR requires all present processor threads to enter H_JOIN. So if we have threads offline, we have to temporarily bring them up. This can race with administrator actions such as SMT state changes. As of dfd718a2ed1f ("powerpc/rtas: Fix a potential race between CPU-Offline & Migration"), rtas_ibm_suspend_me() accounts for this, but errors out with -EBUSY for what almost certainly is a transient condition in any reasonable scenario. Callers of rtas_ibm_suspend_me() already retry when -EAGAIN is returned, and it is typical during a migration for that to happen repeatedly for several minutes polling the H_VASI_STATE hcall result before proceeding to the next stage. So return -EAGAIN instead of -EBUSY when this race is encountered. Additionally: logging this event is still appropriate but use pr_info instead of pr_err; and remove use of unlikely() while here as this is not a hot path at all. Looks good, since it's not a hot path anyway, so unlikely() should benefit from optimize compiler path, and should stay. No? Mingming Fixes: dfd718a2ed1f ("powerpc/rtas: Fix a potential race between CPU-Offline & Migration") Signed-off-by: Nathan Lynch --- arch/powerpc/kernel/rtas.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c index fbc676160adf..9b4d2a2ffb4f 100644 --- a/arch/powerpc/kernel/rtas.c +++ b/arch/powerpc/kernel/rtas.c @@ -984,10 +984,9 @@ int rtas_ibm_suspend_me(u64 handle) cpu_hotplug_disable(); /* Check if we raced with a CPU-Offline Operation */ - if (unlikely(!cpumask_equal(cpu_present_mask, cpu_online_mask))) { - pr_err("%s: Raced against a concurrent CPU-Offline\n", - __func__); - atomic_set(, -EBUSY); + if (!cpumask_equal(cpu_present_mask, cpu_online_mask)) { + pr_info("%s: Raced against a concurrent CPU-Offline\n", __func__); + atomic_set(, -EAGAIN); goto out_hotplug_enable; }
Re: [PATCH] ocxl: Fix concurrent AFU open and device removal
On Mon, 24 Jun 2019 16:41:48 +0200 Frederic Barrat wrote: > If an ocxl device is unbound through sysfs at the same time its AFU is > being opened by a user process, the open code may dereference freed > stuctures, which can lead to kernel oops messages. You'd have to hit a > tiny time window, but it's possible. It's fairly easy to test by > making the time window bigger artificially. > > Fix it with a combination of 2 changes: > - when an AFU device is found in the IDR by looking for the device > minor number, we should hold a reference on the device until after the > context is allocated. A reference on the AFU structure is kept when > the context is allocated, so we can release the reference on the > device after the context allocation. > - with the fix above, there's still another even tinier window, > between the time the AFU device is found in the IDR and the reference > on the device is taken. We can fix this one by removing the IDR entry > earlier, when the device setup is removed, instead of waiting for the > 'release' device callback. With proper locking around the IDR. > > Fixes: 75ca758adbaf ("ocxl: Create a clear delineation between ocxl backend & > frontend") > Signed-off-by: Frederic Barrat > --- > mpe: this fixes a commit merged in v5.2-rc1. It's late, and I don't think > it's that important. If it's for the next merge window, I would add: > Cc: sta...@vger.kernel.org # v5.2 > > > drivers/misc/ocxl/file.c | 23 +++ > 1 file changed, 11 insertions(+), 12 deletions(-) > > diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c > index 2870c25da166..4d1b44de1492 100644 > --- a/drivers/misc/ocxl/file.c > +++ b/drivers/misc/ocxl/file.c > @@ -18,18 +18,15 @@ static struct class *ocxl_class; > static struct mutex minors_idr_lock; > static struct idr minors_idr; > > -static struct ocxl_file_info *find_file_info(dev_t devno) > +static struct ocxl_file_info *find_and_get_file_info(dev_t devno) > { > struct ocxl_file_info *info; > > - /* > - * We don't declare an RCU critical section here, as our AFU > - * is protected by a reference counter on the device. By the time the > - * info reference is removed from the idr, the ref count of > - * the device is already at 0, so no user API will access that AFU and > - * this function can't return it. > - */ > + mutex_lock(_idr_lock); > info = idr_find(_idr, MINOR(devno)); > + if (info) > + get_device(>dev); > + mutex_unlock(_idr_lock); > return info; > } > > @@ -58,14 +55,16 @@ static int afu_open(struct inode *inode, struct file > *file) > > pr_debug("%s for device %x\n", __func__, inode->i_rdev); > > - info = find_file_info(inode->i_rdev); > + info = find_and_get_file_info(inode->i_rdev); > if (!info) > return -ENODEV; > > rc = ocxl_context_alloc(, info->afu, inode->i_mapping); > - if (rc) > + if (rc) { > + put_device(>dev); You could have a single call site for put_device() since it's needed for both branches. No big deal. > return rc; > - > + } > + put_device(>dev); > file->private_data = ctx; > return 0; > } > @@ -487,7 +486,6 @@ static void info_release(struct device *dev) > { > struct ocxl_file_info *info = container_of(dev, struct ocxl_file_info, > dev); > > - free_minor(info); > ocxl_afu_put(info->afu); > kfree(info); > } > @@ -577,6 +575,7 @@ void ocxl_file_unregister_afu(struct ocxl_afu *afu) > > ocxl_file_make_invisible(info); > ocxl_sysfs_unregister_afu(info); > + free_minor(info); Since the IDR entry is added by ocxl_file_register_afu(), it seems to make sense to undo that in ocxl_file_unregister_afu(). Out of curiosity, was there any historical reason to do this in info_release() in the first place ? Reviewed-by: Greg Kurz > device_unregister(>dev); > } >
Re: [PATCH] ocxl: Fix concurrent AFU open and device removal
On Mon, 24 Jun 2019 17:39:26 +0200 Frederic Barrat wrote: > Le 24/06/2019 à 17:24, Greg Kurz a écrit : > > On Mon, 24 Jun 2019 16:41:48 +0200 > > Frederic Barrat wrote: > > > >> If an ocxl device is unbound through sysfs at the same time its AFU is > >> being opened by a user process, the open code may dereference freed > >> stuctures, which can lead to kernel oops messages. You'd have to hit a > >> tiny time window, but it's possible. It's fairly easy to test by > >> making the time window bigger artificially. > >> > >> Fix it with a combination of 2 changes: > >> - when an AFU device is found in the IDR by looking for the device > >> minor number, we should hold a reference on the device until after the > >> context is allocated. A reference on the AFU structure is kept when > >> the context is allocated, so we can release the reference on the > >> device after the context allocation. > >> - with the fix above, there's still another even tinier window, > >> between the time the AFU device is found in the IDR and the reference > >> on the device is taken. We can fix this one by removing the IDR entry > >> earlier, when the device setup is removed, instead of waiting for the > >> 'release' device callback. With proper locking around the IDR. > >> > >> Fixes: 75ca758adbaf ("ocxl: Create a clear delineation between ocxl > >> backend & frontend") > >> Signed-off-by: Frederic Barrat > >> --- > >> mpe: this fixes a commit merged in v5.2-rc1. It's late, and I don't think > >> it's that important. If it's for the next merge window, I would add: > >> Cc: sta...@vger.kernel.org # v5.2 > >> > >> > >> drivers/misc/ocxl/file.c | 23 +++ > >> 1 file changed, 11 insertions(+), 12 deletions(-) > >> > >> diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c > >> index 2870c25da166..4d1b44de1492 100644 > >> --- a/drivers/misc/ocxl/file.c > >> +++ b/drivers/misc/ocxl/file.c > >> @@ -18,18 +18,15 @@ static struct class *ocxl_class; > >> static struct mutex minors_idr_lock; > >> static struct idr minors_idr; > >> > >> -static struct ocxl_file_info *find_file_info(dev_t devno) > >> +static struct ocxl_file_info *find_and_get_file_info(dev_t devno) > >> { > >>struct ocxl_file_info *info; > >> > >> - /* > >> - * We don't declare an RCU critical section here, as our AFU > >> - * is protected by a reference counter on the device. By the time the > >> - * info reference is removed from the idr, the ref count of > >> - * the device is already at 0, so no user API will access that AFU and > >> - * this function can't return it. > >> - */ > >> + mutex_lock(_idr_lock); > >>info = idr_find(_idr, MINOR(devno)); > >> + if (info) > >> + get_device(>dev); > >> + mutex_unlock(_idr_lock); > >>return info; > >> } > >> > >> @@ -58,14 +55,16 @@ static int afu_open(struct inode *inode, struct file > >> *file) > >> > >>pr_debug("%s for device %x\n", __func__, inode->i_rdev); > >> > >> - info = find_file_info(inode->i_rdev); > >> + info = find_and_get_file_info(inode->i_rdev); > >>if (!info) > >>return -ENODEV; > >> > >>rc = ocxl_context_alloc(, info->afu, inode->i_mapping); > >> - if (rc) > >> + if (rc) { > >> + put_device(>dev); > > > > You could have a single call site for put_device() since it's > > needed for both branches. No big deal. > > > Agreed. Will fix if I end up respinning, but won't if it's the only > complaint :-) > > > > >>return rc; > >> - > >> + } > >> + put_device(>dev); > >>file->private_data = ctx; > >>return 0; > >> } > >> @@ -487,7 +486,6 @@ static void info_release(struct device *dev) > >> { > >>struct ocxl_file_info *info = container_of(dev, struct ocxl_file_info, > >> dev); > >> > >> - free_minor(info); > >>ocxl_afu_put(info->afu); > >>kfree(info); > >> } > >> @@ -577,6 +575,7 @@ void ocxl_file_unregister_afu(struct ocxl_afu *afu) > >> > >>ocxl_file_make_invisible(info); > >>ocxl_sysfs_unregister_afu(info); > >> + free_minor(info); > > > > Since the IDR entry is added by ocxl_file_register_afu(), it seems to make > > sense to undo that in ocxl_file_unregister_afu(). Out of curiosity, was > > there > > any historical reason to do this in info_release() in the first place ? > > > Yeah, it makes a lot of sense to remove the IDR entry in > ocxl_file_unregister_afu(), that's where we undo the device. I wish I > had noticed during the code reviews. > I don't think there was any good reason to have it in info_release() in > the first place. I remember the code went through many iterations to get > the reference counting on the AFU structure and device done correctly, > but we let that one slip. > > I now think the pre-5.2 ocxl code was also exposed to the 2nd window > mentioned in the commit log (but the first window is new with the > refactoring introduced in 5.2-rc1). > This calls for two separate
Re: [PATCH] ocxl: Fix concurrent AFU open and device removal
Le 24/06/2019 à 17:24, Greg Kurz a écrit : On Mon, 24 Jun 2019 16:41:48 +0200 Frederic Barrat wrote: If an ocxl device is unbound through sysfs at the same time its AFU is being opened by a user process, the open code may dereference freed stuctures, which can lead to kernel oops messages. You'd have to hit a tiny time window, but it's possible. It's fairly easy to test by making the time window bigger artificially. Fix it with a combination of 2 changes: - when an AFU device is found in the IDR by looking for the device minor number, we should hold a reference on the device until after the context is allocated. A reference on the AFU structure is kept when the context is allocated, so we can release the reference on the device after the context allocation. - with the fix above, there's still another even tinier window, between the time the AFU device is found in the IDR and the reference on the device is taken. We can fix this one by removing the IDR entry earlier, when the device setup is removed, instead of waiting for the 'release' device callback. With proper locking around the IDR. Fixes: 75ca758adbaf ("ocxl: Create a clear delineation between ocxl backend & frontend") Signed-off-by: Frederic Barrat --- mpe: this fixes a commit merged in v5.2-rc1. It's late, and I don't think it's that important. If it's for the next merge window, I would add: Cc: sta...@vger.kernel.org # v5.2 drivers/misc/ocxl/file.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c index 2870c25da166..4d1b44de1492 100644 --- a/drivers/misc/ocxl/file.c +++ b/drivers/misc/ocxl/file.c @@ -18,18 +18,15 @@ static struct class *ocxl_class; static struct mutex minors_idr_lock; static struct idr minors_idr; -static struct ocxl_file_info *find_file_info(dev_t devno) +static struct ocxl_file_info *find_and_get_file_info(dev_t devno) { struct ocxl_file_info *info; - /* -* We don't declare an RCU critical section here, as our AFU -* is protected by a reference counter on the device. By the time the -* info reference is removed from the idr, the ref count of -* the device is already at 0, so no user API will access that AFU and -* this function can't return it. -*/ + mutex_lock(_idr_lock); info = idr_find(_idr, MINOR(devno)); + if (info) + get_device(>dev); + mutex_unlock(_idr_lock); return info; } @@ -58,14 +55,16 @@ static int afu_open(struct inode *inode, struct file *file) pr_debug("%s for device %x\n", __func__, inode->i_rdev); - info = find_file_info(inode->i_rdev); + info = find_and_get_file_info(inode->i_rdev); if (!info) return -ENODEV; rc = ocxl_context_alloc(, info->afu, inode->i_mapping); - if (rc) + if (rc) { + put_device(>dev); You could have a single call site for put_device() since it's needed for both branches. No big deal. Agreed. Will fix if I end up respinning, but won't if it's the only complaint :-) return rc; - + } + put_device(>dev); file->private_data = ctx; return 0; } @@ -487,7 +486,6 @@ static void info_release(struct device *dev) { struct ocxl_file_info *info = container_of(dev, struct ocxl_file_info, dev); - free_minor(info); ocxl_afu_put(info->afu); kfree(info); } @@ -577,6 +575,7 @@ void ocxl_file_unregister_afu(struct ocxl_afu *afu) ocxl_file_make_invisible(info); ocxl_sysfs_unregister_afu(info); + free_minor(info); Since the IDR entry is added by ocxl_file_register_afu(), it seems to make sense to undo that in ocxl_file_unregister_afu(). Out of curiosity, was there any historical reason to do this in info_release() in the first place ? Yeah, it makes a lot of sense to remove the IDR entry in ocxl_file_unregister_afu(), that's where we undo the device. I wish I had noticed during the code reviews. I don't think there was any good reason to have it in info_release() in the first place. I remember the code went through many iterations to get the reference counting on the AFU structure and device done correctly, but we let that one slip. I now think the pre-5.2 ocxl code was also exposed to the 2nd window mentioned in the commit log (but the first window is new with the refactoring introduced in 5.2-rc1). Fred Reviewed-by: Greg Kurz device_unregister(>dev); }
[PATCH 2/2] powerpc/papr_scm: Force a scm-unbind if initial scm-bind fails
In some cases initial bind of scm memory for an lpar can fail if previously it wasn't released using a scm-unbind hcall. This situation can arise due to panic of the previous kernel or forced lpar reset. In such cases the H_SCM_BIND_MEM return a H_OVERLAP error. To mitigate such cases the patch updates drc_pmem_bind() to force a call to drc_pmem_unbind() in case the initial bind of scm memory fails with H_OVERLAP error. In case scm-bind operation again fails after the forced scm-unbind then we follow the existing error path. Signed-off-by: Vaibhav Jain --- arch/powerpc/platforms/pseries/papr_scm.c | 23 --- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index d790e4e4ffb3..049d7927c0a4 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -44,19 +44,26 @@ struct papr_scm_priv { struct nd_interleave_set nd_set; }; +/* Forward declaration */ +static int drc_pmem_unbind(struct papr_scm_priv *); + static int drc_pmem_bind(struct papr_scm_priv *p) { unsigned long ret[PLPAR_HCALL_BUFSIZE]; uint64_t rc, token; - uint64_t saved = 0; + uint64_t saved; + bool tried_unbind = false; + dev_dbg(>pdev->dev, "bind drc %x\n", p->drc_index); /* * When the hypervisor cannot map all the requested memory in a single * hcall it returns H_BUSY and we call again with the token until * we get H_SUCCESS. Aborting the retry loop before getting H_SUCCESS * leave the system in an undefined state, so we wait. */ +retry: token = 0; + saved = 0; do { rc = plpar_hcall(H_SCM_BIND_MEM, ret, p->drc_index, 0, @@ -68,8 +75,18 @@ static int drc_pmem_bind(struct papr_scm_priv *p) } while (rc == H_BUSY); if (rc) { - dev_err(>pdev->dev, "bind err: %lld\n", rc); - return -ENXIO; + /* retry after unbinding */ + if (rc == H_OVERLAP && !tried_unbind) { + dev_warn(>pdev->dev, "Un-binding and retrying\n"); + /* Try unbind and ignore any errors */ + tried_unbind = true; + drc_pmem_unbind(p); + goto retry; + + } else { + dev_err(>pdev->dev, "bind err: %lld\n", rc); + return -ENXIO; + } } p->bound_addr = saved; -- 2.21.0
[PATCH 0/2] powerpc/papr_scm: Workaround for failure of drc bind after kexec
Presently an error is returned in response to hcall H_SCM_BIND_MEM when a new kernel boots on lpar via kexec. This prevents papr_scm from registering drc memory regions with nvdimm. The error reported is of the form below: "papr_scm ibm,persistent-memory:ibm,pmemory@4412: bind err: -68" On investigation it was revealed that phyp returns this error as previous kernel did not completely release bindings for drc scm-memory blocks and hence phyp rejected request for re-binding these block to lpar with error H_OVERLAP. Also support for a new H_SCM_UNBIND_ALL is recently added which is better suited for releasing all the bound scm-memory block from an lpar. So leveraging new hcall H_SCM_UNBIND_ALL, we can workaround H_OVERLAP issue during kexec by forcing an unbind of all drm scm-memory blocks and issuing H_SCM_BIND_MEM to re-bind the drc scm-memory blocks to lpar. This sequence will also be needed when a new kernel boot on lpar after previous kernel panicked and it never got an opportunity to call H_SCM_UNBIND_MEM/ALL. Hence this patch-set implements following changes to papr_scm module: * Update it to use H_SCM_UNBIND_ALL instead of H_SCM_UNBIND_MEM * In case hcall H_SCM_BIND_MEM fails with error H_OVERLAP, force H_SCM_UNBIND_ALL and retry the bind operation again. With the patch-set applied re-bind of drc scm-memory to lpar succeeds after a kexec to new kernel as illustrated below: # Old kernel $ sudo ndctl list -R [ { "dev":"region0", } ] # kexec to new kernel $ sudo kexec --initrd=... vmlinux ... ... I'm in purgatory ... papr_scm ibm,persistent-memory:ibm,pmemory@4412: Un-binding and retrying ... # New kernel $ sudo ndctl list -R $ sudo ndctl list -R [ { "dev":"region0", } ] Vaibhav Jain (2): powerpc/papr_scm: Update drc_pmem_unbind() to use H_SCM_UNBIND_ALL powerpc/papr_scm: Force a scm-unbind if initial scm-bind fails arch/powerpc/include/asm/hvcall.h | 2 +- arch/powerpc/platforms/pseries/papr_scm.c | 60 --- 2 files changed, 53 insertions(+), 9 deletions(-) -- 2.21.0
[PATCH 1/2] powerpc/papr_scm: Update drc_pmem_unbind() to use H_SCM_UNBIND_ALL
The new hcall named H_SCM_UNBIND_ALL has been introduce that can unbind all the memory drc memory-blocks assigned to an lpar. This is more efficient than using H_SCM_UNBIND_MEM as currently we don't support partial unbind of drc memory-blocks. Hence this patch proposes following changes to drc_pmem_unbind(): * Update drc_pmem_unbind() to replace hcall H_SCM_UNBIND_MEM to H_SCM_UNBIND_ALL. * Update drc_pmem_unbind() to handles cases when PHYP asks the guest kernel to wait for specific amount of time before retrying the hcall via the 'LONG_BUSY' return value. In case it takes more than 1 second to unbind the memory log a warning. * Ensure appropriate error code is returned back from the function in case of an error. Signed-off-by: Vaibhav Jain --- arch/powerpc/include/asm/hvcall.h | 2 +- arch/powerpc/platforms/pseries/papr_scm.c | 37 --- 2 files changed, 33 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index 463c63a9fcf1..bb56fa0f976c 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -302,7 +302,7 @@ #define H_SCM_UNBIND_MEM0x3F0 #define H_SCM_QUERY_BLOCK_MEM_BINDING 0x3F4 #define H_SCM_QUERY_LOGICAL_MEM_BINDING 0x3F8 -#define H_SCM_MEM_QUERY0x3FC +#define H_SCM_UNBIND_ALL0x3FC #define H_SCM_BLOCK_CLEAR 0x400 #define MAX_HCALL_OPCODE H_SCM_BLOCK_CLEAR diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index 96c53b23e58f..d790e4e4ffb3 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -11,11 +11,16 @@ #include #include #include +#include #include #define BIND_ANY_ADDR (~0ul) +/* Scope args for H_SCM_UNBIND_ALL */ +#define H_UNBIND_SCOPE_ALL (0x1) +#define H_UNBIND_SCOPE_DRC (0x2) + #define PAPR_SCM_DIMM_CMD_MASK \ ((1ul << ND_CMD_GET_CONFIG_SIZE) | \ (1ul << ND_CMD_GET_CONFIG_DATA) | \ @@ -78,21 +83,43 @@ static int drc_pmem_unbind(struct papr_scm_priv *p) { unsigned long ret[PLPAR_HCALL_BUFSIZE]; uint64_t rc, token; + unsigned long starttime; token = 0; - /* NB: unbind has the same retry requirements mentioned above */ + dev_dbg(>pdev->dev, "unbind drc %x\n", p->drc_index); + + /* NB: unbind_all has the same retry requirements as drc_pmem_bind() */ + starttime = HZ; do { - rc = plpar_hcall(H_SCM_UNBIND_MEM, ret, p->drc_index, - p->bound_addr, p->blocks, token); + /* If this is taking too much time then log warning */ + if (jiffies_to_msecs(HZ - starttime) > 1000) { + dev_warn(>pdev->dev, +"unbind taking > 1s to complete\n"); + starttime = HZ; + } + + /* Unbind of all SCM resources associated with drcIndex */ + rc = plpar_hcall(H_SCM_UNBIND_ALL, ret, H_UNBIND_SCOPE_DRC, +p->drc_index, token); token = ret[0]; - cond_resched(); + + /* Check if we are stalled for some time */ + if (H_IS_LONG_BUSY(rc)) { + msleep(get_longbusy_msecs(rc)); + rc = H_BUSY; + token = 0; + + } else if (rc == H_BUSY) { + cond_resched(); + } + } while (rc == H_BUSY); if (rc) dev_err(>pdev->dev, "unbind error: %lld\n", rc); - return !!rc; + return rc == H_SUCCESS ? 0 : -ENXIO; } static int papr_scm_meta_get(struct papr_scm_priv *p, -- 2.21.0
CVE-2019-12817: Linux kernel: powerpc: Unrelated processes may be able to read/write to each other's virtual memory
The Linux kernel for powerpc since 4.17 has a bug where unrelated processes may be able to read/write to each other's virtual memory under certain conditions. This bug only affects machines using 64-bit CPUs with the hash page table MMU, see below for more detail on affected CPUs. To trigger the bug a process must allocate memory above 512TB. That only happens if userspace explicitly requests it with mmap(). That process must then fork(), at this point the child incorrectly inherits the "context id" of the parent associated with the mapping above 512TB. It may then be possible for the parent/child to write to each other's mappings above 512TB, which should not be possible, and constitutes memory corruption. If instead the child process exits, all its context ids are freed, including the context id that is still in use by the parent for the mapping above 512TB. That id can then be reallocated to a third process, that process can then read/write to the parent's mapping above 512TB. Additionally if the freed id is used for the third process's primary context id, then the parent is able to read/write to the third process's mappings *below* 512TB. If the parent and child both exit before another process is allocated the freed context id, the kernel will notice the double free of the id and print a warning such as: ida_free called for id=103 which is not allocated. WARNING: CPU: 8 PID: 7293 at lib/idr.c:520 ida_free_rc+0x1b4/0x1d0 The bug was introduced in commit: f384796c40dc ("powerpc/mm: Add support for handling > 512TB address in SLB miss") Which was originally merged in v4.17. Only machines using the hash page table (HPT) MMU are affected, eg. PowerPC 970 (G5), PA6T, Power5/6/7/8/9. By default Power9 bare metal machines (powernv) use the Radix MMU and are not affected, unless the machine has been explicitly booted in HPT mode (using disable_radix on the kernel command line). KVM guests on Power9 may be affected if the host or guest is configured to use the HPT MMU. LPARs under PowerVM on Power9 are affected as they always use the HPT MMU. Kernels built with PAGE_SIZE=4K are not affected. The upstream fix is here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ca72d88378b2f2444d3ec145dd442d449d3fefbc There's also a kernel selftest to verify the fix: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16391bfc862342f285195013b73c1394fab28b97 Or a similar standalone version is included below. cheers cat > test.c < #include #include #include #include #include #include #ifndef MAP_FIXED_NOREPLACE #define MAP_FIXED_NOREPLACE MAP_FIXED // "Should be safe" above 512TB #endif int main(void) { int p2c[2], c2p[2], rc, status, c, *p; unsigned long page_size; pid_t pid; page_size = sysconf(_SC_PAGESIZE); if (page_size != 65536) { printf("Unsupported page size - not affected\n"); return 1; } // Create a mapping at 512TB to allocate an extended_id p = mmap((void *)(512ul << 40), page_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED_NOREPLACE, -1, 0); if (p == MAP_FAILED) { perror("mmap"); printf("Error: couldn't mmap(), confirm kernel has 4TB support\n"); return 1; } printf("parent writing %p = 1\n", p); *p = 1; assert(pipe(p2c) != -1 && pipe(c2p) != -1); pid = fork(); if (pid == 0) { close(p2c[1]); close(c2p[0]); assert(read(p2c[0], , 1) == 1); pid = getpid(); printf("child writing %p = %d\n", p, pid); *p = pid; assert(write(c2p[1], , 1) == 1); assert(read(p2c[0], , 1) == 1); exit(0); } close(p2c[0]); close(c2p[1]); c = 0; assert(write(p2c[1], , 1) == 1); assert(read(c2p[0], , 1) == 1); // Prevent compiler optimisation asm volatile("" : : : "memory"); rc = 0; printf("parent reading %p = %d\n", p, *p); if (*p != 1) { printf("Error: BUG! parent saw child's write! *p = %d\n", *p); rc = 1; } assert(write(p2c[1], , 1) == 1); assert(waitpid(pid, , 0) != -1); assert(WIFEXITED(status) && WEXITSTATUS(status) == 0); if (rc == 0) printf("success: test completed OK\n"); return rc; } EOF signature.asc Description: PGP signature
Re: [next][PowerPC] RCU stalls while booting linux-next on PowerVM LPAR
On 24.06.19 16:09, Sachin Sant wrote: > Latest -next fails to boot on POWER9 PowerVM LPAR due to RCU stalls. > > This problem was introduced with next-20190620 (dc636f5d78). > next-20190619 was last good kernel. > > Reverting following commit allows the kernel to boot. > 2fd4aeea6b603 : mm/memory_hotplug: move and simplify walk_memory_blocks() > > > [0.014409] Using shared cache scheduler topology > [0.016302] devtmpfs: initialized > [0.031022] clocksource: jiffies: mask: 0x max_cycles: 0x, > max_idle_ns: 1911260446275 ns > [0.031034] futex hash table entries: 16384 (order: 5, 2097152 bytes, > linear) > [0.031575] NET: Registered protocol family 16 > [0.031724] audit: initializing netlink subsys (disabled) > [0.031796] audit: type=2000 audit(1561344029.030:1): state=initialized > audit_enabled=0 res=1 > [0.032249] cpuidle: using governor menu > [0.032403] pstore: Registered nvram as persistent store backend > [ 60.061246] rcu: INFO: rcu_sched self-detected stall on CPU > [ 60.061254] rcu: 0-: (5999 ticks this GP) > idle=1ea/1/0x4002 softirq=5/5 fqs=2999 > [ 60.061261](t=6000 jiffies g=-1187 q=0) > [ 60.061265] NMI backtrace for cpu 0 > [ 60.061269] CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 5.2.0-rc5-next-20190621-autotest-autotest #1 > [ 60.061275] Call Trace: > [ 60.061280] [c018ee85f380] [c0b624ec] dump_stack+0xb0/0xf4 > (unreliable) > [ 60.061287] [c018ee85f3c0] [c0b6d464] > nmi_cpu_backtrace+0x144/0x150 > [ 60.061293] [c018ee85f450] [c0b6d61c] > nmi_trigger_cpumask_backtrace+0x1ac/0x1f0 > [ 60.061300] [c018ee85f4f0] [c00692c8] > arch_trigger_cpumask_backtrace+0x28/0x40 > [ 60.061306] [c018ee85f510] [c01c5f90] > rcu_dump_cpu_stacks+0x10c/0x16c > [ 60.061313] [c018ee85f560] [c01c4fe4] > rcu_sched_clock_irq+0x744/0x990 > [ 60.061318] [c018ee85f630] [c01d5b58] > update_process_times+0x48/0x90 > [ 60.061325] [c018ee85f660] [c01ea03c] tick_periodic+0x4c/0x120 > [ 60.061330] [c018ee85f690] [c01ea150] > tick_handle_periodic+0x40/0xe0 > [ 60.061336] [c018ee85f6d0] [c002b5cc] > timer_interrupt+0x10c/0x2e0 > [ 60.061342] [c018ee85f730] [c0009204] > decrementer_common+0x134/0x140 > [ 60.061350] --- interrupt: 901 at replay_interrupt_return+0x0/0x4 > [ 60.061350] LR = arch_local_irq_restore+0x84/0x90 > [ 60.061357] [c018ee85fa30] [c018ee85fbac] 0xc018ee85fbac > (unreliable) > [ 60.061364] [c018ee85fa50] [c0b88300] > _raw_spin_unlock_irqrestore+0x50/0x80 > [ 60.061369] [c018ee85fa70] [c0b69da4] klist_next+0xb4/0x150 > [ 60.061376] [c018ee85fac0] [c0766ea0] > subsys_find_device_by_id+0xf0/0x1a0 > [ 60.061382] [c018ee85fb20] [c0797a94] > walk_memory_blocks+0x84/0x100 > [ 60.061388] [c018ee85fb80] [c0795ea0] > link_mem_sections+0x40/0x60 > [ 60.061395] [c018ee85fbb0] [c0f28c28] topology_init+0xa0/0x268 > [ 60.061400] [c018ee85fc10] [c0010448] > do_one_initcall+0x68/0x2c0 > [ 60.061406] [c018ee85fce0] [c0f247dc] > kernel_init_freeable+0x318/0x47c > [ 60.061411] [c018ee85fdb0] [c00107c4] kernel_init+0x24/0x150 > [ 60.061417] [c018ee85fe20] [c000ba54] > ret_from_kernel_thread+0x5c/0x68 > [ 88.016563] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] > [ 88.016569] Modules linked in: > Hi, thanks! Please see https://lkml.org/lkml/2019/6/21/600 and especially https://lkml.org/lkml/2019/6/21/908 Does this fix your problem? The fix is on its way to next. Cheers! -- Thanks, David / dhildenb
[PATCH] ocxl: Fix concurrent AFU open and device removal
If an ocxl device is unbound through sysfs at the same time its AFU is being opened by a user process, the open code may dereference freed stuctures, which can lead to kernel oops messages. You'd have to hit a tiny time window, but it's possible. It's fairly easy to test by making the time window bigger artificially. Fix it with a combination of 2 changes: - when an AFU device is found in the IDR by looking for the device minor number, we should hold a reference on the device until after the context is allocated. A reference on the AFU structure is kept when the context is allocated, so we can release the reference on the device after the context allocation. - with the fix above, there's still another even tinier window, between the time the AFU device is found in the IDR and the reference on the device is taken. We can fix this one by removing the IDR entry earlier, when the device setup is removed, instead of waiting for the 'release' device callback. With proper locking around the IDR. Fixes: 75ca758adbaf ("ocxl: Create a clear delineation between ocxl backend & frontend") Signed-off-by: Frederic Barrat --- mpe: this fixes a commit merged in v5.2-rc1. It's late, and I don't think it's that important. If it's for the next merge window, I would add: Cc: sta...@vger.kernel.org # v5.2 drivers/misc/ocxl/file.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c index 2870c25da166..4d1b44de1492 100644 --- a/drivers/misc/ocxl/file.c +++ b/drivers/misc/ocxl/file.c @@ -18,18 +18,15 @@ static struct class *ocxl_class; static struct mutex minors_idr_lock; static struct idr minors_idr; -static struct ocxl_file_info *find_file_info(dev_t devno) +static struct ocxl_file_info *find_and_get_file_info(dev_t devno) { struct ocxl_file_info *info; - /* -* We don't declare an RCU critical section here, as our AFU -* is protected by a reference counter on the device. By the time the -* info reference is removed from the idr, the ref count of -* the device is already at 0, so no user API will access that AFU and -* this function can't return it. -*/ + mutex_lock(_idr_lock); info = idr_find(_idr, MINOR(devno)); + if (info) + get_device(>dev); + mutex_unlock(_idr_lock); return info; } @@ -58,14 +55,16 @@ static int afu_open(struct inode *inode, struct file *file) pr_debug("%s for device %x\n", __func__, inode->i_rdev); - info = find_file_info(inode->i_rdev); + info = find_and_get_file_info(inode->i_rdev); if (!info) return -ENODEV; rc = ocxl_context_alloc(, info->afu, inode->i_mapping); - if (rc) + if (rc) { + put_device(>dev); return rc; - + } + put_device(>dev); file->private_data = ctx; return 0; } @@ -487,7 +486,6 @@ static void info_release(struct device *dev) { struct ocxl_file_info *info = container_of(dev, struct ocxl_file_info, dev); - free_minor(info); ocxl_afu_put(info->afu); kfree(info); } @@ -577,6 +575,7 @@ void ocxl_file_unregister_afu(struct ocxl_afu *afu) ocxl_file_make_invisible(info); ocxl_sysfs_unregister_afu(info); + free_minor(info); device_unregister(>dev); } -- 2.21.0
[next][PowerPC] RCU stalls while booting linux-next on PowerVM LPAR
Latest -next fails to boot on POWER9 PowerVM LPAR due to RCU stalls. This problem was introduced with next-20190620 (dc636f5d78). next-20190619 was last good kernel. Reverting following commit allows the kernel to boot. 2fd4aeea6b603 : mm/memory_hotplug: move and simplify walk_memory_blocks() [0.014409] Using shared cache scheduler topology [0.016302] devtmpfs: initialized [0.031022] clocksource: jiffies: mask: 0x max_cycles: 0x, max_idle_ns: 1911260446275 ns [0.031034] futex hash table entries: 16384 (order: 5, 2097152 bytes, linear) [0.031575] NET: Registered protocol family 16 [0.031724] audit: initializing netlink subsys (disabled) [0.031796] audit: type=2000 audit(1561344029.030:1): state=initialized audit_enabled=0 res=1 [0.032249] cpuidle: using governor menu [0.032403] pstore: Registered nvram as persistent store backend [ 60.061246] rcu: INFO: rcu_sched self-detected stall on CPU [ 60.061254] rcu: 0-: (5999 ticks this GP) idle=1ea/1/0x4002 softirq=5/5 fqs=2999 [ 60.061261] (t=6000 jiffies g=-1187 q=0) [ 60.061265] NMI backtrace for cpu 0 [ 60.061269] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.2.0-rc5-next-20190621-autotest-autotest #1 [ 60.061275] Call Trace: [ 60.061280] [c018ee85f380] [c0b624ec] dump_stack+0xb0/0xf4 (unreliable) [ 60.061287] [c018ee85f3c0] [c0b6d464] nmi_cpu_backtrace+0x144/0x150 [ 60.061293] [c018ee85f450] [c0b6d61c] nmi_trigger_cpumask_backtrace+0x1ac/0x1f0 [ 60.061300] [c018ee85f4f0] [c00692c8] arch_trigger_cpumask_backtrace+0x28/0x40 [ 60.061306] [c018ee85f510] [c01c5f90] rcu_dump_cpu_stacks+0x10c/0x16c [ 60.061313] [c018ee85f560] [c01c4fe4] rcu_sched_clock_irq+0x744/0x990 [ 60.061318] [c018ee85f630] [c01d5b58] update_process_times+0x48/0x90 [ 60.061325] [c018ee85f660] [c01ea03c] tick_periodic+0x4c/0x120 [ 60.061330] [c018ee85f690] [c01ea150] tick_handle_periodic+0x40/0xe0 [ 60.061336] [c018ee85f6d0] [c002b5cc] timer_interrupt+0x10c/0x2e0 [ 60.061342] [c018ee85f730] [c0009204] decrementer_common+0x134/0x140 [ 60.061350] --- interrupt: 901 at replay_interrupt_return+0x0/0x4 [ 60.061350] LR = arch_local_irq_restore+0x84/0x90 [ 60.061357] [c018ee85fa30] [c018ee85fbac] 0xc018ee85fbac (unreliable) [ 60.061364] [c018ee85fa50] [c0b88300] _raw_spin_unlock_irqrestore+0x50/0x80 [ 60.061369] [c018ee85fa70] [c0b69da4] klist_next+0xb4/0x150 [ 60.061376] [c018ee85fac0] [c0766ea0] subsys_find_device_by_id+0xf0/0x1a0 [ 60.061382] [c018ee85fb20] [c0797a94] walk_memory_blocks+0x84/0x100 [ 60.061388] [c018ee85fb80] [c0795ea0] link_mem_sections+0x40/0x60 [ 60.061395] [c018ee85fbb0] [c0f28c28] topology_init+0xa0/0x268 [ 60.061400] [c018ee85fc10] [c0010448] do_one_initcall+0x68/0x2c0 [ 60.061406] [c018ee85fce0] [c0f247dc] kernel_init_freeable+0x318/0x47c [ 60.061411] [c018ee85fdb0] [c00107c4] kernel_init+0x24/0x150 [ 60.061417] [c018ee85fe20] [c000ba54] ret_from_kernel_thread+0x5c/0x68 [ 88.016563] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] [ 88.016569] Modules linked in: Thanks -Sachin boot.log Description: Binary data
Re: [PATCH] KVM: PPC: Book3S HV: Fix CR0 setting in TM emulation
Michael Neuling writes: > When emulating tsr, treclaim and trechkpt, we incorrectly set CR0. The > code currently sets: > CR0 <- 00 || MSR[TS] > but according to the ISA it should be: > CR0 <- 0 || MSR[TS] || 0 Seems bad, what's the worst case impact? Do we have a test case for this? > This fixes the bit shift to put the bits in the correct location. Fixes: ? cheers > diff --git a/arch/powerpc/kvm/book3s_hv_tm.c b/arch/powerpc/kvm/book3s_hv_tm.c > index 888e2609e3..31cd0f327c 100644 > --- a/arch/powerpc/kvm/book3s_hv_tm.c > +++ b/arch/powerpc/kvm/book3s_hv_tm.c > @@ -131,7 +131,7 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu) > } > /* Set CR0 to indicate previous transactional state */ > vcpu->arch.regs.ccr = (vcpu->arch.regs.ccr & 0x0fff) | > - (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 28); > + (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 29); > /* L=1 => tresume, L=0 => tsuspend */ > if (instr & (1 << 21)) { > if (MSR_TM_SUSPENDED(msr)) > @@ -175,7 +175,7 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu) > > /* Set CR0 to indicate previous transactional state */ > vcpu->arch.regs.ccr = (vcpu->arch.regs.ccr & 0x0fff) | > - (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 28); > + (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 29); > vcpu->arch.shregs.msr &= ~MSR_TS_MASK; > return RESUME_GUEST; > > @@ -205,7 +205,7 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu) > > /* Set CR0 to indicate previous transactional state */ > vcpu->arch.regs.ccr = (vcpu->arch.regs.ccr & 0x0fff) | > - (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 28); > + (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 29); > vcpu->arch.shregs.msr = msr | MSR_TS_S; > return RESUME_GUEST; > } > -- > 2.21.0
Re: [PATCH 1/2] powerpc/64s: Rename PPC_INVALIDATE_ERAT to PPC_ARCH_300_INVALIDATE_ERAT
Nicholas Piggin writes: > Segher Boessenkool's on June 23, 2019 10:03 pm: >> On Sun, Jun 23, 2019 at 08:41:51PM +1000, Nicholas Piggin wrote: >>> This makes it clear to the caller that it can only be used on POWER9 >>> and later CPUs. >> >>> -#define PPC_INVALIDATE_ERATPPC_SLBIA(7) >>> +#define PPC_ARCH_300_INVALIDATE_ERAT PPC_SLBIA(7) >> >> The architecture version is 3.0 (or 3.0B), not "300". This will work on >> implementations of later architecture versions as well, so maybe the name >> isn't so great anyway? > > Yeah... this is kernel convention for better or worse. ISA v3.0B > feature support is called CPU_FTR_ARCH_300, and later architectures > will advertise that support. For the most part we can use architected > features (incompatible changes would require additional code). I'd rather we used 3_0B or something inside the kernel, but I'm not sure it's worth the churn to rename the existing feature everywhere. And we can't rename the user visible feature. But if you're adding a new usage then I'd prefer: PPC_ISA_3_0B_INVALIDATE_ERAT I dislike "300" because it implies we support ISA v3.0 but we actually don't, we only support v3.0B. cheers
Re: [PATCH 4/5] x86: don't use asm-generic/ptrace.h
On Mon, 24 Jun 2019, Christoph Hellwig wrote: > Doing the indirection through macros for the regs accessors just > makes them harder to read, so implement the helpers directly. > > Note that only the helpers actually used are implemented now. > > Signed-off-by: Christoph Hellwig > Acked-by: Ingo Molnar > Acked-by: Oleg Nesterov Reviewed-by: Thomas Gleixner
Re: remove asm-generic/ptrace.h v3
On Mon, 24 Jun 2019, Christoph Hellwig wrote: > > asm-generic/ptrace.h is a little weird in that it doesn't actually > implement any functionality, but it provided multiple layers of macros > that just implement trivial inline functions. We implement those > directly in the few architectures and be off with a much simpler > design. > > I'm not sure which tree is the right place, but may this can go through > the asm-generic tree since it removes an asm-generic header? Makes sense. Thanks, tglx
[PATCH v5 3/4] crypto: talitos - fix hash on SEC1.
On SEC1, hash provides wrong result when performing hashing in several steps with input data SG list has more than one element. This was detected with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS: [ 44.185947] alg: hash: md5-talitos test failed (wrong result) on test vector 6, cfg="random: may_sleep use_finup src_divs=[25.88%@+8063, 24.19%@+9588, 28.63%@+16333, 4.60%@+6756, 16.70%@+16281] dst_divs=[71.61%@alignmask+16361, 14.36%@+7756, 14.3%@+" [ 44.325122] alg: hash: sha1-talitos test failed (wrong result) on test vector 3, cfg="random: inplace use_final src_divs=[16.56%@+16378, 52.0%@+16329, 21.42%@alignmask+16380, 10.2%@alignmask+16380] iv_offset=39" [ 44.493500] alg: hash: sha224-talitos test failed (wrong result) on test vector 4, cfg="random: use_final nosimd src_divs=[52.27%@+7401, 17.34%@+16285, 17.71%@+26, 12.68%@+10644] iv_offset=43" [ 44.673262] alg: hash: sha256-talitos test failed (wrong result) on test vector 4, cfg="random: may_sleep use_finup src_divs=[60.6%@+12790, 17.86%@+1329, 12.64%@alignmask+16300, 8.29%@+15, 0.40%@+13506, 0.51%@+16322, 0.24%@+16339] dst_divs" This is due to two issues: - We have an overlap between the buffer used for copying the input data (SEC1 doesn't do scatter/gather) and the chained descriptor. - Data copy is wrong when the previous hash left less than one blocksize of data to hash, implying a complement of the previous block with a few bytes from the new request. Fix it by: - Moving the second descriptor after the buffer, as moving the buffer after the descriptor would make it more complex for other cipher operations (AEAD, ABLKCIPHER) - Skip the bytes taken from the new request to complete the previous one by moving the SG list forward. Fixes: 37b5e8897eb5 ("crypto: talitos - chain in buffered data for ahash on SEC1") Cc: sta...@vger.kernel.org Signed-off-by: Christophe Leroy --- drivers/crypto/talitos.c | 69 1 file changed, 41 insertions(+), 28 deletions(-) diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c index 95dc3957f358..ab6bd45addf7 100644 --- a/drivers/crypto/talitos.c +++ b/drivers/crypto/talitos.c @@ -320,6 +320,21 @@ static int talitos_submit(struct device *dev, int ch, struct talitos_desc *desc, return -EINPROGRESS; } +static __be32 get_request_hdr(struct talitos_request *request, bool is_sec1) +{ + struct talitos_edesc *edesc; + + if (!is_sec1) + return request->desc->hdr; + + if (!request->desc->next_desc) + return request->desc->hdr1; + + edesc = container_of(request->desc, struct talitos_edesc, desc); + + return ((struct talitos_desc *)(edesc->buf + edesc->dma_len))->hdr1; +} + /* * process what was done, notify callback of error if not */ @@ -341,12 +356,7 @@ static void flush_channel(struct device *dev, int ch, int error, int reset_ch) /* descriptors with their done bits set don't get the error */ rmb(); - if (!is_sec1) - hdr = request->desc->hdr; - else if (request->desc->next_desc) - hdr = (request->desc + 1)->hdr1; - else - hdr = request->desc->hdr1; + hdr = get_request_hdr(request, is_sec1); if ((hdr & DESC_HDR_DONE) == DESC_HDR_DONE) status = 0; @@ -476,8 +486,14 @@ static u32 current_desc_hdr(struct device *dev, int ch) } } - if (priv->chan[ch].fifo[iter].desc->next_desc == cur_desc) - return (priv->chan[ch].fifo[iter].desc + 1)->hdr; + if (priv->chan[ch].fifo[iter].desc->next_desc == cur_desc) { + struct talitos_edesc *edesc; + + edesc = container_of(priv->chan[ch].fifo[iter].desc, +struct talitos_edesc, desc); + return ((struct talitos_desc *) + (edesc->buf + edesc->dma_len))->hdr; + } return priv->chan[ch].fifo[iter].desc->hdr; } @@ -1402,15 +1418,11 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev, edesc->dst_nents = dst_nents; edesc->iv_dma = iv_dma; edesc->dma_len = dma_len; - if (dma_len) { - void *addr = >link_tbl[0]; - - if (is_sec1 && !dst) - addr += sizeof(struct talitos_desc); - edesc->dma_link_tbl = dma_map_single(dev, addr, + if (dma_len) + edesc->dma_link_tbl = dma_map_single(dev, >link_tbl[0], edesc->dma_len, DMA_BIDIRECTIONAL); - } + return edesc; } @@ -1722,14 +1734,16 @@ static void common_nonsnoop_hash_unmap(struct device *dev, struct talitos_private *priv = dev_get_drvdata(dev); bool is_sec1 = has_ftr_sec1(priv); struct
[PATCH v5 0/4] *** SUBJECT HERE ***
This series is the last set of fixes for the Talitos driver. We now get a fully clean boot on both SEC1 (SEC1.2 on mpc885) and SEC2 (SEC2.2 on mpc8321E) with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS: [3.385197] bus: 'platform': really_probe: probing driver talitos with device ff02.crypto [3.450982] random: fast init done [ 12.252548] alg: No test for authenc(hmac(md5),cbc(aes)) (authenc-hmac-md5-cbc-aes-talitos-hsna) [ 12.262226] alg: No test for authenc(hmac(md5),cbc(des3_ede)) (authenc-hmac-md5-cbc-3des-talitos-hsna) [ 43.310737] Bug in SEC1, padding ourself [ 45.603318] random: crng init done [ 54.612333] talitos ff02.crypto: fsl,sec1.2 algorithms registered in /proc/crypto [ 54.620232] driver: 'talitos': driver_bound: bound to device 'ff02.crypto' [1.193721] bus: 'platform': really_probe: probing driver talitos with device b003.crypto [1.229197] random: fast init done [2.714920] alg: No test for authenc(hmac(sha224),cbc(aes)) (authenc-hmac-sha224-cbc-aes-talitos) [2.724312] alg: No test for authenc(hmac(sha224),cbc(aes)) (authenc-hmac-sha224-cbc-aes-talitos-hsna) [4.482045] alg: No test for authenc(hmac(md5),cbc(aes)) (authenc-hmac-md5-cbc-aes-talitos) [4.490940] alg: No test for authenc(hmac(md5),cbc(aes)) (authenc-hmac-md5-cbc-aes-talitos-hsna) [4.500280] alg: No test for authenc(hmac(md5),cbc(des3_ede)) (authenc-hmac-md5-cbc-3des-talitos) [4.509727] alg: No test for authenc(hmac(md5),cbc(des3_ede)) (authenc-hmac-md5-cbc-3des-talitos-hsna) [6.631781] random: crng init done [ 11.521795] talitos b003.crypto: fsl,sec2.2 algorithms registered in /proc/crypto [ 11.529803] driver: 'talitos': driver_bound: bound to device 'b003.crypto' v2: dropped patch 1 which was irrelevant due to a rebase weirdness. Added Cc to stable on the 2 first patches. v3: - removed stable reference in patch 1 - reworded patch 1 to include name of patch 2 for the dependency. - mentionned this dependency in patch 2 as well. - corrected the Fixes: sha1 in patch 4 v4: - using scatterwalk_ffwd() instead of opencodying SG list forwarding. - Added a patch to fix sg_copy_to_buffer() when sg->offset() is greater than PAGE_SIZE, otherwise sg_copy_to_buffer() fails when the list has been forwarded with scatterwalk_ffwd(). - taken the patch "crypto: talitos - eliminate unneeded 'done' functions at build time" out of the series because it is independent. - added a helper to find the header field associated to a request in flush_channe() v5: - Replacing the while loop by a direct shift/mask operation, as suggested by Herbert in patch 1. Christophe Leroy (4): lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE crypto: talitos - move struct talitos_edesc into talitos.h crypto: talitos - fix hash on SEC1. crypto: talitos - drop icv_ool drivers/crypto/talitos.c | 102 +++ drivers/crypto/talitos.h | 28 + lib/scatterlist.c| 9 +++-- 3 files changed, 74 insertions(+), 65 deletions(-) -- 2.13.3
[PATCH v5 1/4] lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE
All mapping iterator logic is based on the assumption that sg->offset is always lower than PAGE_SIZE. But there are situations where sg->offset is such that the SG item is on the second page. In that case sg_copy_to_buffer() fails properly copying the data into the buffer. One of the reason is that the data will be outside the kmapped area used to access that data. This patch fixes the issue by adjusting the mapping iterator offset and pgoffset fields such that offset is always lower than PAGE_SIZE. Signed-off-by: Christophe Leroy Fixes: 4225fc8555a9 ("lib/scatterlist: use page iterator in the mapping iterator") Cc: sta...@vger.kernel.org --- lib/scatterlist.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 739dc9fe2c55..f0757a67affe 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -678,17 +678,18 @@ static bool sg_miter_get_next_page(struct sg_mapping_iter *miter) { if (!miter->__remaining) { struct scatterlist *sg; - unsigned long pgoffset; if (!__sg_page_iter_next(>piter)) return false; sg = miter->piter.sg; - pgoffset = miter->piter.sg_pgoffset; - miter->__offset = pgoffset ? 0 : sg->offset; + miter->__offset = miter->piter.sg_pgoffset ? 0 : sg->offset; + miter->piter.sg_pgoffset += miter->__offset >> PAGE_SHIFT; + miter->__offset &= PAGE_SIZE - 1; miter->__remaining = sg->offset + sg->length - - (pgoffset << PAGE_SHIFT) - miter->__offset; +(miter->piter.sg_pgoffset << PAGE_SHIFT) - +miter->__offset; miter->__remaining = min_t(unsigned long, miter->__remaining, PAGE_SIZE - miter->__offset); } -- 2.13.3
[PATCH v5 4/4] crypto: talitos - drop icv_ool
icv_ool is not used anymore, drop it. Fixes: e345177ded17 ("crypto: talitos - fix AEAD processing.") Signed-off-by: Christophe Leroy --- drivers/crypto/talitos.c | 3 --- drivers/crypto/talitos.h | 2 -- 2 files changed, 5 deletions(-) diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c index ab6bd45addf7..c9d686a0e805 100644 --- a/drivers/crypto/talitos.c +++ b/drivers/crypto/talitos.c @@ -1285,9 +1285,6 @@ static int ipsec_esp(struct talitos_edesc *edesc, struct aead_request *areq, is_ipsec_esp && !encrypt); tbl_off += ret; - /* ICV data */ - edesc->icv_ool = !encrypt; - if (!encrypt && is_ipsec_esp) { struct talitos_ptr *tbl_ptr = >link_tbl[tbl_off]; diff --git a/drivers/crypto/talitos.h b/drivers/crypto/talitos.h index 95f78c6d9206..1469b956948a 100644 --- a/drivers/crypto/talitos.h +++ b/drivers/crypto/talitos.h @@ -46,7 +46,6 @@ struct talitos_desc { * talitos_edesc - s/w-extended descriptor * @src_nents: number of segments in input scatterlist * @dst_nents: number of segments in output scatterlist - * @icv_ool: whether ICV is out-of-line * @iv_dma: dma address of iv for checking continuity and link table * @dma_len: length of dma mapped link_tbl space * @dma_link_tbl: bus physical address of link_tbl/buf @@ -61,7 +60,6 @@ struct talitos_desc { struct talitos_edesc { int src_nents; int dst_nents; - bool icv_ool; dma_addr_t iv_dma; int dma_len; dma_addr_t dma_link_tbl; -- 2.13.3
[PATCH v5 2/4] crypto: talitos - move struct talitos_edesc into talitos.h
Moves struct talitos_edesc into talitos.h so that it can be used from any place in talitos.c It will be required for next patch ("crypto: talitos - fix hash on SEC1") Signed-off-by: Christophe Leroy Cc: sta...@vger.kernel.org --- drivers/crypto/talitos.c | 30 -- drivers/crypto/talitos.h | 30 ++ 2 files changed, 30 insertions(+), 30 deletions(-) diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c index a7704b53529d..95dc3957f358 100644 --- a/drivers/crypto/talitos.c +++ b/drivers/crypto/talitos.c @@ -951,36 +951,6 @@ static int aead_des3_setkey(struct crypto_aead *authenc, goto out; } -/* - * talitos_edesc - s/w-extended descriptor - * @src_nents: number of segments in input scatterlist - * @dst_nents: number of segments in output scatterlist - * @icv_ool: whether ICV is out-of-line - * @iv_dma: dma address of iv for checking continuity and link table - * @dma_len: length of dma mapped link_tbl space - * @dma_link_tbl: bus physical address of link_tbl/buf - * @desc: h/w descriptor - * @link_tbl: input and output h/w link tables (if {src,dst}_nents > 1) (SEC2) - * @buf: input and output buffeur (if {src,dst}_nents > 1) (SEC1) - * - * if decrypting (with authcheck), or either one of src_nents or dst_nents - * is greater than 1, an integrity check value is concatenated to the end - * of link_tbl data - */ -struct talitos_edesc { - int src_nents; - int dst_nents; - bool icv_ool; - dma_addr_t iv_dma; - int dma_len; - dma_addr_t dma_link_tbl; - struct talitos_desc desc; - union { - struct talitos_ptr link_tbl[0]; - u8 buf[0]; - }; -}; - static void talitos_sg_unmap(struct device *dev, struct talitos_edesc *edesc, struct scatterlist *src, diff --git a/drivers/crypto/talitos.h b/drivers/crypto/talitos.h index 32ad4fc679ed..95f78c6d9206 100644 --- a/drivers/crypto/talitos.h +++ b/drivers/crypto/talitos.h @@ -42,6 +42,36 @@ struct talitos_desc { #define TALITOS_DESC_SIZE (sizeof(struct talitos_desc) - sizeof(__be32)) +/* + * talitos_edesc - s/w-extended descriptor + * @src_nents: number of segments in input scatterlist + * @dst_nents: number of segments in output scatterlist + * @icv_ool: whether ICV is out-of-line + * @iv_dma: dma address of iv for checking continuity and link table + * @dma_len: length of dma mapped link_tbl space + * @dma_link_tbl: bus physical address of link_tbl/buf + * @desc: h/w descriptor + * @link_tbl: input and output h/w link tables (if {src,dst}_nents > 1) (SEC2) + * @buf: input and output buffeur (if {src,dst}_nents > 1) (SEC1) + * + * if decrypting (with authcheck), or either one of src_nents or dst_nents + * is greater than 1, an integrity check value is concatenated to the end + * of link_tbl data + */ +struct talitos_edesc { + int src_nents; + int dst_nents; + bool icv_ool; + dma_addr_t iv_dma; + int dma_len; + dma_addr_t dma_link_tbl; + struct talitos_desc desc; + union { + struct talitos_ptr link_tbl[0]; + u8 buf[0]; + }; +}; + /** * talitos_request - descriptor submission request * @desc: descriptor pointer (kernel virtual) -- 2.13.3
Re: [PATCH 3/3] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
On 06/23/2019 03:14 PM, Nicholas Piggin wrote: > vmalloc_to_page returns NULL for addresses mapped by larger pages[*]. > Whether or not a vmap is huge depends on the architecture details, > alignments, boot options, etc., which the caller can not be expected > to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page. > > This change teaches vmalloc_to_page about larger pages, and returns > the struct page that corresponds to the offset within the large page. > This makes the API agnostic to mapping implementation details. > > [*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap: > fail gracefully on unexpected huge vmap mappings") > > Signed-off-by: Nicholas Piggin > --- > include/asm-generic/4level-fixup.h | 1 + > include/asm-generic/5level-fixup.h | 1 + > mm/vmalloc.c | 37 +++--- > 3 files changed, 26 insertions(+), 13 deletions(-) > > diff --git a/include/asm-generic/4level-fixup.h > b/include/asm-generic/4level-fixup.h > index e3667c9a33a5..3cc65a4dd093 100644 > --- a/include/asm-generic/4level-fixup.h > +++ b/include/asm-generic/4level-fixup.h > @@ -20,6 +20,7 @@ > #define pud_none(pud)0 > #define pud_bad(pud) 0 > #define pud_present(pud) 1 > +#define pud_large(pud) 0 > #define pud_ERROR(pud) do { } while (0) > #define pud_clear(pud) pgd_clear(pud) > #define pud_val(pud) pgd_val(pud) > diff --git a/include/asm-generic/5level-fixup.h > b/include/asm-generic/5level-fixup.h > index bb6cb347018c..c4377db09a4f 100644 > --- a/include/asm-generic/5level-fixup.h > +++ b/include/asm-generic/5level-fixup.h > @@ -22,6 +22,7 @@ > #define p4d_none(p4d)0 > #define p4d_bad(p4d) 0 > #define p4d_present(p4d) 1 > +#define p4d_large(p4d) 0 > #define p4d_ERROR(p4d) do { } while (0) > #define p4d_clear(p4d) pgd_clear(p4d) > #define p4d_val(p4d) pgd_val(p4d) > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 4c9e150e5ad3..4be98f700862 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -36,6 +36,7 @@ > #include > > #include > +#include > #include > #include > > @@ -284,26 +285,36 @@ struct page *vmalloc_to_page(const void *vmalloc_addr) > > if (pgd_none(*pgd)) > return NULL; > + > p4d = p4d_offset(pgd, addr); > if (p4d_none(*p4d)) > return NULL; > - pud = pud_offset(p4d, addr); > + if (WARN_ON_ONCE(p4d_bad(*p4d))) > + return NULL; The warning here is a required addition but it needs to be moved after p4d_large() check. Please see the next comment below. > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP > + if (p4d_large(*p4d)) > + return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT); > +#endif > > - /* > - * Don't dereference bad PUD or PMD (below) entries. This will also > - * identify huge mappings, which we may encounter on architectures > - * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be > - * identified as vmalloc addresses by is_vmalloc_addr(), but are > - * not [unambiguously] associated with a struct page, so there is > - * no correct value to return for them. > - */ > - WARN_ON_ONCE(pud_bad(*pud)); > - if (pud_none(*pud) || pud_bad(*pud)) > + pud = pud_offset(p4d, addr); > + if (pud_none(*pud)) > + return NULL; > + if (WARN_ON_ONCE(pud_bad(*pud))) > return NULL; > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP > + if (pud_large(*pud)) > + return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); > +#endif > + pud_bad() on arm64 returns true when the PUD does not point to a next page table page implying the fact that it might be a large/huge entry. I am not sure if the semantics holds good for other architectures too. But on arm64 if pud_large() is true, then pud_bad() will be true as well. So pud_bad() check must happen after pud_large() check. So the sequence here should be 1. pud_none() --> Nothing is in here, return NULL 2. pud_large() --> Return offset page address from the huge page mapping 3. pud_bad()--> Return NULL as there is no more page table level left Checking pud_bad() first can return NULL for a valid huge mapping. > pmd = pmd_offset(pud, addr); > - WARN_ON_ONCE(pmd_bad(*pmd)); > - if (pmd_none(*pmd) || pmd_bad(*pmd)) > + if (pmd_none(*pmd)) > + return NULL; > + if (WARN_ON_ONCE(pmd_bad(*pmd))) > return NULL; > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP > + if (pmd_large(*pmd)) > + return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); > +#endif Ditto. I see that your previous proposal had this right which got changed in this manner after my comments. Sorry about
[PATCH v1] powerpc: Fix BUG_ON during memory unplug on radix
We hit the following BUG_ON when memory hotplugged before reboot is unplugged after reboot: kernel BUG at arch/powerpc/mm/pgtable-frag.c:113! remove_pagetable+0x594/0x6a0 (unreliable) remove_pagetable+0x94/0x6a0 vmemmap_free+0x394/0x410 sparse_remove_one_section+0x26c/0x2e8 __remove_pages+0x428/0x540 arch_remove_memory+0xd0/0x170 __remove_memory+0xd4/0x1a0 dlpar_remove_lmb+0xbc/0x110 dlpar_memory+0xa80/0xd20 handle_dlpar_errorlog+0xa8/0x160 pseries_hp_work_fn+0x2c/0x60 process_one_work+0x46c/0x860 worker_thread+0x364/0x5e0 kthread+0x1b0/0x1c0 ret_from_kernel_thread+0x5c/0x68 This occurs because, during reboot-after-hotplug, the hotplugged memory range gets initialized as regular memory and page tables are setup using memblock allocator. This means that we wouldn't have initialized the PMD or PTE fragment count for those PMD or PTE pages. Fixing this includes 3 parts: - Re-walk the init_mm page tables from mem_init() and initialize the PMD and PTE fragment counts appropriately. So PMD and PTE table pages allocated during early allocation will have a fragment count of 1. - Convert the pages from memblock pages to regular pages so that they can be freed back to buddy allocator seamlessly. However we do this for only PMD and PTE pages and not for PUD pages. PUD pages are freed using kmem_cache_free() and we need to identify memblock pages and free them differently. - When we do early memblock based allocation of PMD and PUD pages, allocate in PAGE_SIZE granularity so that we are sure the complete page is used as pagetable page. PAGE_SIZE allocations will have an implication on the amount of memory used for page tables, an example of which is shown below: Since we now do PAGE_SIZE allocations for both PUD table and PMD table (Note that PTE table allocation is already of PAGE_SIZE), we end up allocating more memory for the same amount of system RAM. Here is an example of how much more we end up allocating for page tables in case of 64T system RAM: 1. Mapping system RAM With each PGD entry spanning 512G, 64TB RAM would need 128 entries and hence 128 PUD tables. We use 1G mapping at PUD level (level 2) With default PUD_TABLE_SIZE(4K), 128*4K=512K (8 64K pages) With PAGE_SIZE(64K) allocations, 128*64K=8192K (128 64K pages) 2. Mapping struct pages (memmap) 64T RAM would need 64G for memmap with struct page size being 64B. Since memmap array is mapped using 64K mappings, we would need 64 PUD entries or 64 PMD tables (level 3) in total. With default PMD_TABLE_SIZE(4K), 64*4K=256K (4 64K pages) With PAGE_SIZE(64K) allocations, 64*64K=4096K (64 64K pages) There is no change in PTE table (level 4) allocation requirement as early page table allocation is already using PAGE_SIZE PTE tables. So essentially with this change we would use 180 64K pages more for 64T system. Reported-by: Srikanth Aithal Signed-off-by: Bharata B Rao --- v0 - https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-June/192242.html Changes in v1: - Handling PUD table freeing too. - Added details about how much extra memory we use up with this approach into the commit message - A few cleanups and renames arch/powerpc/include/asm/book3s/64/pgalloc.h | 7 +- arch/powerpc/include/asm/book3s/64/radix.h | 1 + arch/powerpc/include/asm/sparsemem.h | 1 + arch/powerpc/mm/book3s64/pgtable.c | 15 +++- arch/powerpc/mm/book3s64/radix_pgtable.c | 79 +++- arch/powerpc/mm/mem.c| 5 ++ 6 files changed, 104 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h index d5a44912902f..9ae134f260be 100644 --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h @@ -111,7 +111,12 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) static inline void pud_free(struct mm_struct *mm, pud_t *pud) { - kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud); + struct page *page = virt_to_page(pud); + + if (PageReserved(page)) + free_reserved_page(page); + else + kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud); } static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h index 574eca33f893..4320f2790e8d 100644 --- a/arch/powerpc/include/asm/book3s/64/radix.h +++ b/arch/powerpc/include/asm/book3s/64/radix.h @@ -285,6 +285,7 @@ static inline unsigned long radix__get_tree_size(void) #ifdef CONFIG_MEMORY_HOTPLUG int radix__create_section_mapping(unsigned long start, unsigned long end, int nid); int radix__remove_section_mapping(unsigned long start, unsigned long end); +void radix__fixup_pgtable_fragments(void); #endif /* CONFIG_MEMORY_HOTPLUG */ #endif /* __ASSEMBLY__ */ #endif diff --git a/arch/powerpc/include/asm/sparsemem.h
Re: [PATCH 0/3] fix vmalloc_to_page for huge vmap mappings
Hello Nicholas, On 06/23/2019 03:14 PM, Nicholas Piggin wrote: > This is a change broken out from the huge vmap vmalloc series as > requested. There is a little bit of dependency juggling across Thanks for splitting up the previous series and sending this one out. > trees, but patches are pretty trivial. Ideally if Andrew accepts > this patch and queues it up for next, then the arch patches would > be merged through those trees then patch 3 gets sent by Andrew. Fair enough. > > I've tested this with other powerpc and vmalloc patches, with code > that explicitly tests vmalloc_to_page on vmalloced memory and > results look fine. - Anshuman