Re: [PATCH 0/3] cxlflash: Miscellaneous fixes

2017-08-25 Thread Martin K. Petersen

Uma,

> This patch series contains miscellaneous fixes. The first two address
> issues that were identified by smatch and the last patch fixes a
> regression introduced by Commit 565180723294 ("scsi: cxlflash: SISlite
> updates to support 4 ports").

Applied to 4.14/scsi-queue. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


[PATCH 3/3] cxlflash: Fix vlun resize failure in the shrink path

2017-08-25 Thread Uma Krishnan
The ioctl DK_CAPI_VLUN_RESIZE can fail if the allocated vlun size is
reduced from almost maximum capacity and then increased again.

The shrink_lxt() routine is currently using the SISL_ASTATUS_MASK to mask
the higher 48 bits of the lxt entry. This is unnecessary and incorrect as
it uses a mask designed for the asynchronous interrupt status register.
When the 4 port support was added to cxlflash, the SISL_ASTATUS_MASK was
updated to reflect the status bits for all 4 ports. This change indirectly
affected the shrink_lxt() code path.

To extract the base, simply shift the bits without masking.

Fixes: 565180723294 ("scsi: cxlflash: SISlite updates to support 4 ports")

Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/vlun.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/scsi/cxlflash/vlun.c b/drivers/scsi/cxlflash/vlun.c
index bdfb930..703bf1e 100644
--- a/drivers/scsi/cxlflash/vlun.c
+++ b/drivers/scsi/cxlflash/vlun.c
@@ -694,11 +694,7 @@ static int shrink_lxt(struct afu *afu,
/* Free LBAs allocated to freed chunks */
mutex_lock(&blka->mutex);
for (i = delta - 1; i >= 0; i--) {
-   /* Mask the higher 48 bits before shifting, even though
-* it is a noop
-*/
-   aun = (lxt_old[my_new_size + i].rlba_base & SISL_ASTATUS_MASK);
-   aun = (aun >> MC_CHUNK_SHIFT);
+   aun = lxt_old[my_new_size + i].rlba_base >> MC_CHUNK_SHIFT;
if (needs_ws)
write_same16(sdev, aun, MC_CHUNK_SIZE);
ba_free(&blka->ba_lun, aun);
-- 
2.1.0



[PATCH 2/3] cxlflash: Avoid double mutex unlock

2017-08-25 Thread Uma Krishnan
From: "Matthew R. Ochs" 

The AFU recovery routine uses an interruptible mutex to control the
flow of in-flight recoveries. Upon receiving an interruptible signal
the code branches to a common exit path which wrongly assumes the
mutex is held. Add a local variable to track when the mutex should be
unlocked.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/superpipe.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/cxlflash/superpipe.c 
b/drivers/scsi/cxlflash/superpipe.c
index ad0f996..e9ee1d9 100644
--- a/drivers/scsi/cxlflash/superpipe.c
+++ b/drivers/scsi/cxlflash/superpipe.c
@@ -1650,6 +1650,7 @@ static int cxlflash_afu_recover(struct scsi_device *sdev,
u64 ctxid = DECODE_CTXID(recover->context_id),
rctxid = recover->context_id;
long reg;
+   bool locked = true;
int lretry = 20; /* up to 2 seconds */
int new_adap_fd = -1;
int rc = 0;
@@ -1658,8 +1659,11 @@ static int cxlflash_afu_recover(struct scsi_device *sdev,
up_read(&cfg->ioctl_rwsem);
rc = mutex_lock_interruptible(mutex);
down_read(&cfg->ioctl_rwsem);
-   if (rc)
+   if (rc) {
+   locked = false;
goto out;
+   }
+
rc = check_state(cfg);
if (rc) {
dev_err(dev, "%s: Failed state rc=%d\n", __func__, rc);
@@ -1693,8 +1697,10 @@ static int cxlflash_afu_recover(struct scsi_device *sdev,
mutex_unlock(mutex);
msleep(100);
rc = mutex_lock_interruptible(mutex);
-   if (rc)
+   if (rc) {
+   locked = false;
goto out;
+   }
goto retry_recover;
}
 
@@ -1738,7 +1744,8 @@ static int cxlflash_afu_recover(struct scsi_device *sdev,
 out:
if (likely(ctxi))
put_context(ctxi);
-   mutex_unlock(mutex);
+   if (locked)
+   mutex_unlock(mutex);
atomic_dec_if_positive(&cfg->recovery_threads);
return rc;
 }
-- 
2.1.0



[PATCH 1/3] cxlflash: Remove unnecessary existence check

2017-08-25 Thread Uma Krishnan
From: "Matthew R. Ochs" 

The AFU termination sequence has been refactored over time such that
the main tear down routine, term_afu(), can no longer can be invoked
with a NULL AFU pointer. Remove the unnecessary existence check from
term_afu().

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 
---
 drivers/scsi/cxlflash/main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index 6a4367c..76b8b7ee 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -820,8 +820,7 @@ static void term_afu(struct cxlflash_cfg *cfg)
for (k = cfg->afu->num_hwqs - 1; k >= 0; k--)
term_intr(cfg, UNMAP_THREE, k);
 
-   if (cfg->afu)
-   stop_afu(cfg);
+   stop_afu(cfg);
 
for (k = cfg->afu->num_hwqs - 1; k >= 0; k--)
term_mc(cfg, k);
-- 
2.1.0



[PATCH 0/3] cxlflash: Miscellaneous fixes

2017-08-25 Thread Uma Krishnan
This patch series contains miscellaneous fixes. The first two address
issues that were identified by smatch and the last patch fixes a regression
introduced by Commit 565180723294 ("scsi: cxlflash: SISlite updates to
support 4 ports").

This series is intended for 4.14 and is bisectable.

Matthew R. Ochs (2):
  cxlflash: Remove unnecessary existence check
  cxlflash: Avoid double mutex unlock

Uma Krishnan (1):
  cxlflash: Fix vlun resize failure in the shrink path

 drivers/scsi/cxlflash/main.c  |  3 +--
 drivers/scsi/cxlflash/superpipe.c | 13 ++---
 drivers/scsi/cxlflash/vlun.c  |  6 +-
 3 files changed, 12 insertions(+), 10 deletions(-)

-- 
2.1.0



Re: [PATCH v2 1/1] futex: remove duplicated code and fix UB

2017-08-25 Thread Thomas Gleixner
On Thu, 24 Aug 2017, Will Deacon wrote:
> On Thu, Aug 24, 2017 at 09:31:05AM +0200, Jiri Slaby wrote:
> > +static int futex_atomic_op_inuser(unsigned int encoded_op, u32 __user 
> > *uaddr)
> > +{
> > +   unsigned int op = (encoded_op & 0x7000) >> 28;
> > +   unsigned int cmp =(encoded_op & 0x0f00) >> 24;
> > +   int oparg = sign_extend32((encoded_op & 0x00fff000) >> 12, 12);
> > +   int cmparg = sign_extend32(encoded_op & 0x0fff, 12);
> > +   int oldval, ret;
> > +
> > +   if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) {
> > +   if (oparg < 0 || oparg > 31)
> > +   return -EINVAL;
> > +   oparg = 1 << oparg;
> > +   }
> > +
> > +   if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
> > +   return -EFAULT;
> > +
> > +   ret = arch_futex_atomic_op_inuser(op, oparg, &oldval, uaddr);
> > +   if (ret)
> > +   return ret;
> 
> We could move the pagefault_{disable,enable} calls here, and then remove
> them from the futex_atomic_op_inuser callsites elsewhere in futex.c

Correct, but we can do that after getting this in.

Thanks,

tglx


Re: [PATCH 2/2] of: Restrict DMA configuration

2017-08-25 Thread Christoph Hellwig
On Tue, Aug 15, 2017 at 09:19:50AM -0500, Rob Herring wrote:
> > For the immediate issue at hand, I guess the alternative plan of attack
> > would be to stick a flag in struct bus_type for the bus drivers
> > themselves to opt into generic DMA configuration. That at least keeps
> > everything within the kernel (and come to think of it probably works
> > neatly for modular bus types as well).
> 
> I'm fine with the change as is, it's really just the commit text I'm
> commenting on.

Robin, can you resend this with an updated commit text?


Re: [RFC Part1 PATCH v3 14/17] x86/boot: Add early boot support when running with SEV active

2017-08-25 Thread Borislav Petkov
Btw,

I don't see our SEV-specific chicken bit which disables SEV only.

Do we need it? If so, maybe something like

mem_encrypt=sme_only

or so.

Or is the mem_encrypt=off chicken bit enough?

What about the use case where you want SME but no encrypted guests?

A bunch of hmmm.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


[bug report] powerpc/mm/cxl: Add the fault handling cpu to mm cpumask

2017-08-25 Thread Dan Carpenter
Hello Aneesh Kumar K.V,

This is a semi-automatic email about new static checker warnings.

The patch 0f4bc0932e51: "powerpc/mm/cxl: Add the fault handling cpu
to mm cpumask" from Jul 27, 2017, leads to the following Smatch
complaint:

drivers/misc/cxl/fault.c:161 cxl_handle_mm_fault()
warn: variable dereferenced before check 'mm' (see line 146)

drivers/misc/cxl/fault.c
   145   */
   146  cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
   ^^
The patch adds an unchecked dereference.

   147  if ((result = copro_handle_mm_fault(mm, dar, dsisr, &flt))) {
   148  pr_devel("copro_handle_mm_fault failed: %#x\n", result);
   149  return result;
   150  }
   151  
   152  if (!radix_enabled()) {
   153  /*
   154   * update_mmu_cache() will not have loaded the hash 
since current->trap
   155   * is not a 0x400 or 0x300, so just call hash_page_mm() 
here.
   156   */
   157  access = _PAGE_PRESENT | _PAGE_READ;
   158  if (dsisr & CXL_PSL_DSISR_An_S)
   159  access |= _PAGE_WRITE;
   160  
   161  if (!mm && (REGION_ID(dar) != USER_REGION_ID))
 ^^
But the existing code is careful to check "mm" for NULL.  The
copro_handle_mm_fault() and hash_page_mm() both have checks built in.

   162  access |= _PAGE_PRIVILEGED;
   163  

regards,
dan carpenter


[GIT PULL] Please pull powerpc/linux.git powerpc-4.13-8 tag

2017-08-25 Thread Michael Ellerman
Hi Linus,

Please pull one more powerpc fix for 4.13:

The following changes since commit 5a69aec945d27e78abac9fd032533d3aaebf7c1e:

  powerpc: Fix VSX enabling/flushing to also test MSR_FP and MSR_VEC 
(2017-08-16 19:35:54 +1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-4.13-8

for you to fetch changes up to 1a92a80ad386a1a6e3b36d576d52a1a456394b70:

  powerpc/mm: Ensure cpumask update is ordered (2017-08-18 13:07:16 +1000)


powerpc fixes for 4.13 #8

Just one fix, to add a barrier in the switch_mm() code to make sure the mm
cpumask update is ordered vs the MMU starting to load translations. As far as we
know no one's actually hit the bug, but that's just luck.

Thanks to:
  Benjamin Herrenschmidt, Nicholas Piggin.


Benjamin Herrenschmidt (1):
  powerpc/mm: Ensure cpumask update is ordered

 arch/powerpc/include/asm/mmu_context.h  | 18 ++
 arch/powerpc/include/asm/pgtable-be-types.h |  1 +
 arch/powerpc/include/asm/pgtable-types.h|  1 +
 3 files changed, 20 insertions(+)


signature.asc
Description: PGP signature


Re: [PATCH v7 12/12] powerpc/vas: Define copy/paste interfaces

2017-08-25 Thread Michael Ellerman
Hi Suka,

A few more things ...

Sukadev Bhattiprolu  writes:

> diff --git a/arch/powerpc/platforms/powernv/copy-paste.h 
> b/arch/powerpc/platforms/powernv/copy-paste.h
> new file mode 100644
> index 000..7783bb8
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/copy-paste.h
> @@ -0,0 +1,74 @@
> +/*
> + * Copyright 2016 IBM Corp.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +/*
> + * Macros taken from 
> tools/testing/selftests/powerpc/context_switch/cp_abort.c
> + */

These are both out of date, they're changed in v3.0B.

> +#define PASTE(RA, RB, L, RC) \
> + .long (0x7c00070c | (RA) << (31-15) | (RB) << (31-20) \
> +   | (L) << (31-10) | (RC) << (31-31))

You should define PPC_PASTE() in ppc-opcode.h

We already have PPC_INST_PASTE, so use that.

L and RC are gone.

> +
> +#define COPY(RA, RB, L) \
> + .long (0x7c00060c | (RA) << (31-15) | (RB) << (31-20) \
> +   | (L) << (31-10))

Use PPC_COPY().

> +
> +#define CR0_FXM  "0x80"

I don't think a #define for this helps readability.

> +#define CR0_SHIFT28
> +#define CR0_MASK 0xF

Not used.

> +/*
> + * Copy/paste instructions:
> + *
> + *   copy RA,RB,L
> + *   Copy contents of address (RA) + effective_address(RB)
> + *   to internal copy-buffer.
> + *
> + *   L == 1 indicates this is the first copy.
> + *
> + *   L == 0 indicates its a continuation of a prior first copy.
> + *
> + *   paste RA,RB,L
> + *   Paste contents of internal copy-buffer to the address
> + *   (RA) + effective_address(RB)
> + *
> + *   L == 0 indicates its a continuation of a prior paste. i.e.
> + *   don't wait for the completion or update status.
> + *
> + *   L == 1 indicates this is the last paste in the group (i.e.
> + *   wait for the group to complete and update status in CR0).
> + *
> + *   For Power9, the L bit must be 'true' in both copy and paste.
> + */
> +
> +static inline int vas_copy(void *crb, int offset, int first)
> +{
> + WARN_ON_ONCE(!first);

Please change the API to not require unused parameters.

Same for offset.

> +
> + __asm__ __volatile(stringify_in_c(COPY(%0, %1, %2))";"

I've never seen __volatile before.

Just use: asm volatile


> + :
> + : "b" (offset), "b" (crb), "i" (1)
> + : "memory");
> +
> + return 0;
> +}
> +
> +static inline int vas_paste(void *paste_address, int offset, int last)
> +{
> + unsigned long long cr;

cr is 32-bits actually.

> + WARN_ON_ONCE(!last);
> +
> + cr = 0;
> + __asm__ __volatile(stringify_in_c(PASTE(%1, %2, 1, 1))";"
> + "mfocrf %0," CR0_FXM ";"
> + : "=r" (cr)
> + : "b" (paste_address), "b" (offset)
> + : "memory");

You need cr0 in the clobbers.

> +
> + return cr;

I think it would be more natural if you just returned CR0, so if you did
shift and mask with the CR0 constants you have above.


> diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
> b/arch/powerpc/platforms/powernv/vas-window.c
> index 70762c3..73081b4 100644
> --- a/arch/powerpc/platforms/powernv/vas-window.c
> +++ b/arch/powerpc/platforms/powernv/vas-window.c
> @@ -1040,6 +1041,57 @@ struct vas_window *vas_tx_win_open(int vasid, enum 
> vas_cop_type cop,
>  }
>  EXPORT_SYMBOL_GPL(vas_tx_win_open);
>  
> +int vas_copy_crb(void *crb, int offset, bool first)
> +{
> + if (!vas_initialized())
> + return -1;
> +
> + return vas_copy(crb, offset, first);
> +}
> +EXPORT_SYMBOL_GPL(vas_copy_crb);
> +
> +#define RMA_LSMP_REPORT_ENABLE PPC_BIT(53)
> +int vas_paste_crb(struct vas_window *txwin, int offset, bool last, bool re)
> +{
> + int rc;
> + uint64_t val;
> + void *addr;
> +
> + if (!vas_initialized())
> + return -1;

This is in the fast path, or at least the runtime path. So I don't think
these checks are wanted, how would we have got this far if vas wasn't
initialised?



cheers


[PATCH] powerpc/44x: mask and shift to zero bug

2017-08-25 Thread Dan Carpenter
My static checker complains that 0x1800 >> 13 is zero.  Looking at
the context, it seems like a copy and paste bug from the line below and
probably 0x3 << 13 or 0x6000 was intended.

Fixes: 2af59f7d5c3e ("[POWERPC] 4xx: Add 405GPr and 405EP support in boot 
wrapper")
Signed-off-by: Dan Carpenter 
---
Not tested!

diff --git a/arch/powerpc/boot/4xx.c b/arch/powerpc/boot/4xx.c
index 9d3bd4c45a24..f7da65169124 100644
--- a/arch/powerpc/boot/4xx.c
+++ b/arch/powerpc/boot/4xx.c
@@ -564,7 +564,7 @@ void ibm405gp_fixup_clocks(unsigned int sys_clk, unsigned 
int ser_clk)
fbdv = 16;
cbdv = ((pllmr & 0x0006) >> 17) + 1; /* CPU:PLB */
opdv = ((pllmr & 0x00018000) >> 15) + 1; /* PLB:OPB */
-   ppdv = ((pllmr & 0x1800) >> 13) + 1; /* PLB:PCI */
+   ppdv = ((pllmr & 0x6000) >> 13) + 1; /* PLB:PCI */
epdv = ((pllmr & 0x1800) >> 11) + 2; /* PLB:EBC */
udiv = ((cpc0_cr0 & 0x3e) >> 1) + 1;
 


Re: [PATCH v7 10/12] powerpc/vas: Define vas_win_close() interface

2017-08-25 Thread Michael Ellerman
Hi Suka,

More comments :)

Sukadev Bhattiprolu  writes:

> diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
> b/arch/powerpc/platforms/powernv/vas-window.c
> index 2dd4b63..24288dd 100644
> --- a/arch/powerpc/platforms/powernv/vas-window.c
> +++ b/arch/powerpc/platforms/powernv/vas-window.c
> @@ -879,11 +887,92 @@ struct vas_window *vas_rx_win_open(int vasid, enum 
> vas_cop_type cop,
>  }
>  EXPORT_SYMBOL_GPL(vas_rx_win_open);
>  
> -/* stub for now */
> +static void poll_window_busy_state(struct vas_window *window)
> +{
> + int busy;
> + uint64_t val;
> +
> +retry:
> + /*
> +  * Poll Window Busy flag
> +  */
> + val = read_hvwc_reg(window, VREG(WIN_STATUS));
> + busy = GET_FIELD(VAS_WIN_BUSY, val);
> + if (busy) {
> + val = 0;
> + schedule_timeout(2000);

What's 2000?

That's in jiffies, so it's not a fixed amount of time.

But on a typical config that will be 20 _seconds_ ?!

But you haven't set the task state, so AFAIK it will just return
instantly.

And if there's a software/hardware bug and it never stops being busy,
then we have a softlockup. The other option would be print a big fat
warning and just not free the window. But maybe that doesn't work for
other reasons.

> + goto retry;
> + }
> +}
> +
> +static void poll_window_castout(struct vas_window *window)
> +{
> + int cached;
> + uint64_t val;
> +
> + /* Cast window context out of the cache */
> +retry:
> + val = read_hvwc_reg(window, VREG(WIN_CTX_CACHING_CTL));
> + cached = GET_FIELD(VAS_WIN_CACHE_STATUS, val);
> + if (cached) {
> + val = 0ULL;
> + val = SET_FIELD(VAS_CASTOUT_REQ, val, 1);
> + val = SET_FIELD(VAS_PUSH_TO_MEM, val, 0);
> + write_hvwc_reg(window, VREG(WIN_CTX_CACHING_CTL), val);

Sigh, I still don't like that macro :)

or:
write_hvwc_reg(window, VREG(WIN_CTX_CACHING_CTL), 1ull << 63);

> +
> + schedule_timeout(2000);
> + goto retry;
> + }
> +}
> +
> +/*
> + * Close a window.
> + *
> + * See Section 1.12.1 of VAS workbook v1.05 for details on closing window:
> + *   - Disable new paste operations (unmap paste address)
> + *   - Poll for the "Window Busy" bit to be cleared
> + *   - Clear the Open/Enable bit for the Window.
> + *   - Poll for return of window Credits (implies FIFO empty for Rx win?)
> + *   - Unpin and cast window context out of cache
> + *
> + * Besides the hardware, kernel has some bookkeeping of course.
> + */
>  int vas_win_close(struct vas_window *window)
>  {
> - return -1;
> + uint64_t val;
> +
> + if (!window)
> + return 0;
> +
> + if (!window->tx_win && atomic_read(&window->num_txwins) != 0) {
> + pr_devel("VAS: Attempting to close an active Rx window!\n");
> + WARN_ON_ONCE(1);
> + return -EAGAIN;

EAGAIN means "if you do the same thing again it might work".

I don't think that's right here. The window is not in a state where it
can be freed, the caller needs to do something to fix that.

EBUSY would probably be more appropriate.


cheers


Re: [PATCH v7 01/12] powerpc/vas: Define macros, register fields and structures

2017-08-25 Thread Michael Ellerman
Sukadev Bhattiprolu  writes:

> diff --git a/arch/powerpc/platforms/powernv/vas.h 
> b/arch/powerpc/platforms/powernv/vas.h
> new file mode 100644
> index 000..c66aaf0
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/vas.h
> @@ -0,0 +1,385 @@
...
> +
> +/*
> + * In-kernel state a VAS window. One per window.
> + */
> +struct vas_window {
> + /* Fields common to send and receive windows */
> + struct vas_instance *vinst;
> + int winid;
> + bool tx_win;/* True if send window */
> + bool nx_win;/* True if NX window */
> + bool user_win;  /* True if user space window */
> + void *hvwc_map; /* HV window context */
> + void *uwc_map;  /* OS/User window context */
> + pid_t pid;  /* Linux process id of owner */
> +
> + /* Fields applicable only to send windows */
> + void *paste_kaddr;
> + char *paste_addr_name;
> + struct vas_window *rxwin;
> +
> + /* Feilds applicable only to receive windows */
> + enum vas_cop_type cop;
> + atomic_t num_txwins;

Just noticed this. You should probably use the new refcount_t type,
because AFAICS you're using this as a refcount.

cheers


Re: [PATCH v2 00/20] Speculative page faults

2017-08-25 Thread Laurent Dufour
On 21/08/2017 08:28, Anshuman Khandual wrote:
> On 08/18/2017 03:34 AM, Laurent Dufour wrote:
>> This is a port on kernel 4.13 of the work done by Peter Zijlstra to
>> handle page fault without holding the mm semaphore [1].
>>
>> The idea is to try to handle user space page faults without holding the
>> mmap_sem. This should allow better concurrency for massively threaded
>> process since the page fault handler will not wait for other threads memory
>> layout change to be done, assuming that this change is done in another part
>> of the process's memory space. This type page fault is named speculative
>> page fault. If the speculative page fault fails because of a concurrency is
>> detected or because underlying PMD or PTE tables are not yet allocating, it
>> is failing its processing and a classic page fault is then tried.
>>
>> The speculative page fault (SPF) has to look for the VMA matching the fault
>> address without holding the mmap_sem, so the VMA list is now managed using
>> SRCU allowing lockless walking. The only impact would be the deferred file
>> derefencing in the case of a file mapping, since the file pointer is
>> released once the SRCU cleaning is done.  This patch relies on the change
>> done recently by Paul McKenney in SRCU which now runs a callback per CPU
>> instead of per SRCU structure [1].
>>
>> The VMA's attributes checked during the speculative page fault processing
>> have to be protected against parallel changes. This is done by using a per
>> VMA sequence lock. This sequence lock allows the speculative page fault
>> handler to fast check for parallel changes in progress and to abort the
>> speculative page fault in that case.
>>
>> Once the VMA is found, the speculative page fault handler would check for
>> the VMA's attributes to verify that the page fault has to be handled
>> correctly or not. Thus the VMA is protected through a sequence lock which
>> allows fast detection of concurrent VMA changes. If such a change is
>> detected, the speculative page fault is aborted and a *classic* page fault
>> is tried.  VMA sequence locks are added when VMA attributes which are
>> checked during the page fault are modified.
>>
>> When the PTE is fetched, the VMA is checked to see if it has been changed,
>> so once the page table is locked, the VMA is valid, so any other changes
>> leading to touching this PTE will need to lock the page table, so no
>> parallel change is possible at this time.
>>
>> Compared to the Peter's initial work, this series introduces a spin_trylock
>> when dealing with speculative page fault. This is required to avoid dead
>> lock when handling a page fault while a TLB invalidate is requested by an
>> other CPU holding the PTE. Another change due to a lock dependency issue
>> with mapping->i_mmap_rwsem.
>>
>> In addition some VMA field values which are used once the PTE is unlocked
>> at the end the page fault path are saved into the vm_fault structure to
>> used the values matching the VMA at the time the PTE was locked.
>>
>> This series builds on top of v4.13-rc5 and is functional on x86 and
>> PowerPC.
>>
>> Tests have been made using a large commercial in-memory database on a
>> PowerPC system with 752 CPU using RFC v5. The results are very encouraging
>> since the loading of the 2TB database was faster by 14% with the
>> speculative page fault.
>>
> 
> You specifically mention loading as most of the page faults will
> happen at that time and then the working set will settle down with
> very less page faults there after ? That means unless there is
> another wave of page faults we wont notice performance improvement
> during the runtime.

I just captured performance statistic during the database loading then
since the database was not stimulated, there was no page faults generated.
Further tests will be made while the database is running but I didn't have
the framework to do so right now.

> 
>> Using ebizzy test [3], which spreads a lot of threads, the result are good
>> when running on both a large or a small system. When using kernbench, the
> 
> The performance improvements are greater as there is a lot of creation
> and destruction of anon mappings which generates constant flow of page
> faults to be handled.
> 
>> result are quite similar which expected as not so much multi threaded
>> processes are involved. But there is no performance degradation neither
>> which is good.
> 
> If we compile with 'make -j N' there would be a lot of threads but I
> guess the problem is SPF does not support handling file mapping IIUC
> which limits the performance improvement for some workloads.

Yes but that test is showing that there is no performance degradation which
is good.

>>
>> --
>> Benchmarks results
>>
>> Note these test have been made on top of 4.13-rc3 with the following patch
>> from Paul McKenney applied: 
>>  "srcu: Provide ordering for CPU not involved in grace period" [5]
> 
> Is this patch an improvement for SRCU which we are using

Re: [PATCH v7 08/12] powerpc/vas: Define vas_win_id()

2017-08-25 Thread Michael Ellerman
Sukadev Bhattiprolu  writes:

> Define an interface to return a system-wide unique id for a given VAS
> window.
>
> The vas_win_id() will be used in a follow-on patch to generate an unique
> handle for a user space receive window. Applications can use this handle
> to pair send and receive windows for fast thread-wakeup.
>
> The hardware refers to this system-wide unique id as a Partition Send
> Window ID which is expected to be used during fault handling. Hence the
> "pswid" in the function names.

Same comment as previous patch.

cheers


Re: [PATCH v7 07/12] powerpc/vas: Define vas_win_paste_addr()

2017-08-25 Thread Michael Ellerman
Sukadev Bhattiprolu  writes:

> Define an interface that the NX drivers can use to find the physical
> paste address of a send window. This interface is expected to be used
> with the mmap() operation of the NX driver's device. i.e the user space
> process can use driver's mmap() operation to map the send window's paste
> address into their address space and then use copy and paste instructions
> to submit the CRBs to the NX engine.
>
> Note that kernel drivers will use vas_paste_crb() directly and don't need
> this interface.

AFAICS this is not used. And if it was that would be a problem because
there are implications in letting userspace do paste, so can you drop
this from the series for now.

cheers


Re: [PATCH v7 06/12] powerpc/vas: Define helpers to alloc/free windows

2017-08-25 Thread Michael Ellerman
Sukadev Bhattiprolu  writes:
> diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
> b/arch/powerpc/platforms/powernv/vas-window.c
> index 3a50d6a..9c12919 100644
> --- a/arch/powerpc/platforms/powernv/vas-window.c
> +++ b/arch/powerpc/platforms/powernv/vas-window.c
> @@ -490,6 +490,76 @@ int init_winctx_regs(struct vas_window *window, struct 
> vas_winctx *winctx)
>   return 0;
>  }
>  
> +DEFINE_SPINLOCK(vas_ida_lock);

static.

> +
> +void vas_release_window_id(struct ida *ida, int winid)

static.

> +{
> + spin_lock(&vas_ida_lock);
> + ida_remove(ida, winid);
> + spin_unlock(&vas_ida_lock);
> +}
> +
> +int vas_assign_window_id(struct ida *ida)

static.

> +{
> + int rc, winid;
> +
> + rc = ida_pre_get(ida, GFP_KERNEL);
> + if (!rc)
> + return -EAGAIN;
> +
> + spin_lock(&vas_ida_lock);
> + rc = ida_get_new_above(ida, 0, &winid);

If you're passing 0 you can just use ida_get_new().

Or did you actually want to exclude 0? In which case you should pass 1.

> + spin_unlock(&vas_ida_lock);
> +
> + if (rc)
> + return rc;

You're supposed to handle EAGAIN I thought.

> +
> + if (winid > VAS_WINDOWS_PER_CHIP) {
> + pr_err("VAS: Too many (%d) open windows\n", winid);
> + vas_release_window_id(ida, winid);
> + return -EAGAIN;
> + }
> +
> + return winid;
> +}
> +
> +void vas_window_free(struct vas_window *window)

static.

> +{
> + int winid = window->winid;
> + struct vas_instance *vinst = window->vinst;
> +
> + unmap_winctx_mmio_bars(window);
> + kfree(window);
> +
> + vas_release_window_id(&vinst->ida, winid);
> +}
> +
> +struct vas_window *vas_window_alloc(struct vas_instance *vinst)
> +{
> + int winid;
> + struct vas_window *window;
> +
> + winid = vas_assign_window_id(&vinst->ida);
> + if (winid < 0)
> + return ERR_PTR(winid);
> +
> + window = kzalloc(sizeof(*window), GFP_KERNEL);
> + if (!window)
> + return ERR_PTR(-ENOMEM);

You leak an id here.

The error handling would be easier in here if the caller did the alloc,
or if you split alloc and init, and alloc just did the kzalloc().

One of the callers even prints "unable to allocate memory" if this
function fails, but that's not accurate, there's several failure modes.

> +
> + window->vinst = vinst;
> + window->winid = winid;
> +
> + if (map_winctx_mmio_bars(window))
> + goto out_free;
> +
> + return window;
> +
> +out_free:
> + kfree(window);

Leak an id here.

> + return ERR_PTR(-ENOMEM);
> +}
> +
>  /* stub for now */
>  int vas_win_close(struct vas_window *window)
>  {
> -- 
> 2.7.4


Re: [PATCH v7 05/12] powerpc/vas: Define helpers to init window context

2017-08-25 Thread Michael Ellerman
Sukadev Bhattiprolu  writes:
> diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
> b/arch/powerpc/platforms/powernv/vas-window.c
> index a3a705a..3a50d6a 100644
> --- a/arch/powerpc/platforms/powernv/vas-window.c
> +++ b/arch/powerpc/platforms/powernv/vas-window.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "vas.h"
>  
> @@ -185,6 +186,310 @@ int map_winctx_mmio_bars(struct vas_window *window)
>   return 0;
>  }
>  
> +/*
> + * Reset all valid registers in the HV and OS/User Window Contexts for
> + * the window identified by @window.
> + *
> + * NOTE: We cannot really use a for loop to reset window context. Not all
> + *offsets in a window context are valid registers and the valid
> + *registers are not sequential. And, we can only write to offsets
> + *with valid registers (or is that only in Simics?).

I assume there's no "reset everything" register we can write to do this
for us?

Also if you can clean up the comment to not mention Simics, I would
assume that applies on real hardware too.

> + */
> +void reset_window_regs(struct vas_window *window)
> +{
> + write_hvwc_reg(window, VREG(LPID), 0ULL);
> + write_hvwc_reg(window, VREG(PID), 0ULL);
> + write_hvwc_reg(window, VREG(XLATE_MSR), 0ULL);
> + write_hvwc_reg(window, VREG(XLATE_LPCR), 0ULL);
> + write_hvwc_reg(window, VREG(XLATE_CTL), 0ULL);
> + write_hvwc_reg(window, VREG(AMR), 0ULL);
> + write_hvwc_reg(window, VREG(SEIDR), 0ULL);
> + write_hvwc_reg(window, VREG(FAULT_TX_WIN), 0ULL);
> + write_hvwc_reg(window, VREG(OSU_INTR_SRC_RA), 0ULL);
> + write_hvwc_reg(window, VREG(HV_INTR_SRC_RA), 0ULL);
> + write_hvwc_reg(window, VREG(PSWID), 0ULL);
> + write_hvwc_reg(window, VREG(SPARE1), 0ULL);
> + write_hvwc_reg(window, VREG(SPARE2), 0ULL);
> + write_hvwc_reg(window, VREG(SPARE3), 0ULL);
> + write_hvwc_reg(window, VREG(SPARE4), 0ULL);
> + write_hvwc_reg(window, VREG(SPARE5), 0ULL);
> + write_hvwc_reg(window, VREG(SPARE6), 0ULL);

Should we be writing to spare registers? Presumably in a future hardware
revision they might have some unknown purpose.

> + write_hvwc_reg(window, VREG(LFIFO_BAR), 0ULL);
> + write_hvwc_reg(window, VREG(LDATA_STAMP_CTL), 0ULL);
> + write_hvwc_reg(window, VREG(LDMA_CACHE_CTL), 0ULL);
> + write_hvwc_reg(window, VREG(LRFIFO_PUSH), 0ULL);
> + write_hvwc_reg(window, VREG(CURR_MSG_COUNT), 0ULL);
> + write_hvwc_reg(window, VREG(LNOTIFY_AFTER_COUNT), 0ULL);
> + write_hvwc_reg(window, VREG(LRX_WCRED), 0ULL);
> + write_hvwc_reg(window, VREG(LRX_WCRED_ADDER), 0ULL);
> + write_hvwc_reg(window, VREG(TX_WCRED), 0ULL);
> + write_hvwc_reg(window, VREG(TX_WCRED_ADDER), 0ULL);
> + write_hvwc_reg(window, VREG(LFIFO_SIZE), 0ULL);
> + write_hvwc_reg(window, VREG(WINCTL), 0ULL);
> + write_hvwc_reg(window, VREG(WIN_STATUS), 0ULL);
> + write_hvwc_reg(window, VREG(WIN_CTX_CACHING_CTL), 0ULL);
> + write_hvwc_reg(window, VREG(TX_RSVD_BUF_COUNT), 0ULL);
> + write_hvwc_reg(window, VREG(LRFIFO_WIN_PTR), 0ULL);
> + write_hvwc_reg(window, VREG(LNOTIFY_CTL), 0ULL);
> + write_hvwc_reg(window, VREG(LNOTIFY_PID), 0ULL);
> + write_hvwc_reg(window, VREG(LNOTIFY_LPID), 0ULL);
> + write_hvwc_reg(window, VREG(LNOTIFY_TID), 0ULL);
> + write_hvwc_reg(window, VREG(LNOTIFY_SCOPE), 0ULL);
> + write_hvwc_reg(window, VREG(NX_UTIL_ADDER), 0ULL);
> +
> + /* Skip read-only registers: NX_UTIL and NX_UTIL_SE */
> +
> + /*
> +  * The send and receive window credit adder registers are also
> +  * accessible from HVWC and have been initialized above. We don't
> +  * need to initialize from the OS/User Window Context, so skip
> +  * following calls:
> +  *
> +  *  write_uwc_reg(window, VREG(TX_WCRED_ADDER), 0ULL);
> +  *  write_uwc_reg(window, VREG(LRX_WCRED_ADDER), 0ULL);
> +  */
> +}
> +
> +/*
> + * Initialize window context registers related to Address Translation.
> + * These registers are common to send/receive windows although they
> + * differ for user/kernel windows. As we resolve the TODOs we may
> + * want to add fields to vas_winctx and move the initialization to
> + * init_vas_winctx_regs().
> + */
> +static void init_xlate_regs(struct vas_window *window, bool user_win)
> +{
> + uint64_t lpcr, val;
> +
> + /*
> +  * MSR_TA, MSR_US are false for both kernel and user.
> +  * MSR_DR and MSR_PR are false for kernel.
> +  */
> + val = 0ULL;
> + val = SET_FIELD(VAS_XLATE_MSR_HV, val, true);

Using a bool here presumably works, but if you actually wrote:

  ((u64)true << VAS_XLATE_MSR_HV)

It would look pretty weird. Using an int would be more normal.

> + val = SET_FIELD(VAS_XLATE_MSR_SF, val, true);
> + if (user_win) {
> + val = SET_FIELD(VAS_XLATE_MSR_DR, val, true);
> + val = SET_FIELD(VAS_XLATE_MSR_PR, val, true);
> + }
> + write_hvwc_reg(win

Re: [PATCH v2 18/20] perf tools: Add support for the SPF perf event

2017-08-25 Thread Laurent Dufour
On 21/08/2017 10:48, Anshuman Khandual wrote:
> On 08/18/2017 03:35 AM, Laurent Dufour wrote:
>> Add support for the new speculative faults event.
>>
>> Signed-off-by: Laurent Dufour 
>> ---
>>  tools/include/uapi/linux/perf_event.h | 1 +
>>  tools/perf/util/evsel.c   | 1 +
>>  tools/perf/util/parse-events.c| 4 
>>  tools/perf/util/parse-events.l| 1 +
>>  tools/perf/util/python.c  | 1 +
>>  5 files changed, 8 insertions(+)
>>
>> diff --git a/tools/include/uapi/linux/perf_event.h 
>> b/tools/include/uapi/linux/perf_event.h
>> index b1c0b187acfe..3043ec0988e9 100644
>> --- a/tools/include/uapi/linux/perf_event.h
>> +++ b/tools/include/uapi/linux/perf_event.h
>> @@ -111,6 +111,7 @@ enum perf_sw_ids {
>>  PERF_COUNT_SW_EMULATION_FAULTS  = 8,
>>  PERF_COUNT_SW_DUMMY = 9,
>>  PERF_COUNT_SW_BPF_OUTPUT= 10,
>> +PERF_COUNT_SW_SPF_DONE  = 11,
> 
> Right, just one event for the success case. 'DONE' is redundant, only
> 'SPF' should be fine IMHO.
> 

Fair enough, I'll rename it PERF_COUNT_SW_SPF.

Thanks,
Laurent.



Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-25 Thread Laurent Dufour
On 20/08/2017 14:11, Sergey Senozhatsky wrote:
> On (08/18/17 00:05), Laurent Dufour wrote:
> [..]
>> +/*
>> + * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
>> + * are not compatible with the speculative page fault processing.
>> + */
>> +pol = __get_vma_policy(vma, address);
>> +if (!pol)
>> +pol = get_task_policy(current);
>> +if (pol && pol->mode == MPOL_INTERLEAVE)
>> +goto unlock;
> 
> include/linux/mempolicy.h defines
> 
> struct mempolicy *get_task_policy(struct task_struct *p);
> struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   unsigned long addr);
> 
> only for CONFIG_NUMA configs.

Thanks Sergey, I'll add #ifdef around this block.



Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

2017-08-25 Thread Frederic Barrat



Le 25/08/2017 à 09:44, Benjamin Herrenschmidt a écrit :

On Fri, 2017-08-25 at 06:53 +0200, Frederic Barrat wrote:


Le 24/08/2017 à 20:47, Benjamin Herrenschmidt a écrit :

On Thu, 2017-08-24 at 18:40 +0200, Frederic Barrat wrote:


The decrementing part is giving me troubles, and I think it makes sense:
if I decrement the counter when detaching the context from the capi
card, then the next TLBIs for the memory context may be back to local.


Yes, you need to flush the CAPI TLB first.


So when the process exits, the NPU wouldn't get the associated TLBIs,
which spells trouble the next time the same memory context ID is reused.
I believe this the cause of the problem I'm seeing. As soon as I keep
the TLBIs global, even after I detach from the capi adapter, everything
is fine.

Does it sound right?

So to keep the checks minimal in mm_is_thread_local(), to just checking
the active_cpus count, I'm thinking of introducing a "copro enabled" bit
on the context, so that we can increment active_cpus only once. And
never decrement it.


You can decrement if you flush. Don't you have MMIOs to do directed
flushes ?


That's for the nMMU. Last I heard, we don't have MMIOs to flush anything
on the nMMU.

Side note: for the PSL, we do have MMIOs to flush, but they were
perceived as useful only for debug and we don't rely on them, precisely
because the nMMU would fall out of sync, so we have to rely on broadcast.


Well, you can always do a broadcast tlbi to flush the whole PID if you
decrement... that shouldn't be a very frequent operation.


Arg, yes!
Thanks!

  Fred




Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

2017-08-25 Thread Benjamin Herrenschmidt
On Fri, 2017-08-25 at 06:53 +0200, Frederic Barrat wrote:
> 
> Le 24/08/2017 à 20:47, Benjamin Herrenschmidt a écrit :
> > On Thu, 2017-08-24 at 18:40 +0200, Frederic Barrat wrote:
> > > 
> > > The decrementing part is giving me troubles, and I think it makes sense:
> > > if I decrement the counter when detaching the context from the capi
> > > card, then the next TLBIs for the memory context may be back to local.
> > 
> > Yes, you need to flush the CAPI TLB first.
> > 
> > > So when the process exits, the NPU wouldn't get the associated TLBIs,
> > > which spells trouble the next time the same memory context ID is reused.
> > > I believe this the cause of the problem I'm seeing. As soon as I keep
> > > the TLBIs global, even after I detach from the capi adapter, everything
> > > is fine.
> > > 
> > > Does it sound right?
> > > 
> > > So to keep the checks minimal in mm_is_thread_local(), to just checking
> > > the active_cpus count, I'm thinking of introducing a "copro enabled" bit
> > > on the context, so that we can increment active_cpus only once. And
> > > never decrement it.
> > 
> > You can decrement if you flush. Don't you have MMIOs to do directed
> > flushes ?
> 
> That's for the nMMU. Last I heard, we don't have MMIOs to flush anything 
> on the nMMU.
> 
> Side note: for the PSL, we do have MMIOs to flush, but they were 
> perceived as useful only for debug and we don't rely on them, precisely 
> because the nMMU would fall out of sync, so we have to rely on broadcast.

Well, you can always do a broadcast tlbi to flush the whole PID if you
decrement... that shouldn't be a very frequent operation.

Ben.