Re: [EXTERNAL] [RFC PATCH] powernv/eeh: Fix oops when probing cxl devices

2019-10-14 Thread Sam Bobroff
On Fri, Sep 27, 2019 at 02:45:10PM +0200, Frederic Barrat wrote:
> Recent cleanup in the way EEH support is added to a device causes a
> kernel oops when the cxl driver probes a device and creates virtual
> devices discovered on the FPGA:
> 
> BUG: Kernel NULL pointer dereference at 0x00a0
> Faulting instruction address: 0xc0048070
> Oops: Kernel access of bad area, sig: 7 [#1]
> ...
> NIP [c0048070] eeh_add_device_late.part.9+0x50/0x1e0
> LR [c004805c] eeh_add_device_late.part.9+0x3c/0x1e0
> Call Trace:
> [c000200e43983900] [c079e250] _dev_info+0x5c/0x6c (unreliable)
> [c000200e43983980] [c00d1ad0] pnv_pcibios_bus_add_device+0x60/0xb0
> [c000200e439839f0] [c00606d0] pcibios_bus_add_device+0x40/0x60
> [c000200e43983a10] [c06aa3a0] pci_bus_add_device+0x30/0x100
> [c000200e43983a80] [c06aa4d4] pci_bus_add_devices+0x64/0xd0
> [c000200e43983ac0] [c0081c429118] cxl_pci_vphb_add+0xe0/0x130 [cxl]
> [c000200e43983b00] [c0081c4242ac] cxl_probe+0x504/0x5b0 [cxl]
> [c000200e43983bb0] [c06bba1c] local_pci_probe+0x6c/0x110
> [c000200e43983c30] [c0159278] work_for_cpu_fn+0x38/0x60
> 
> The root cause is that those cxl virtual devices don't have a
> representation in the device tree and therefore no associated pci_dn
> structure. In eeh_add_device_late(), pdn is NULL, so edev is NULL and
> we oops.
> 
> We never had explicit support for EEH for those virtual
> devices. Instead, EEH events are reported to the (real) pci device and
> handled by the cxl driver. Which can then forward to the virtual
> devices and handle dependencies. The fact that we try adding EEH
> support for the virtual devices is new and a side-effect of the recent
> cleanup.
> 
> This patch fixes it by skipping adding EEH support on powernv for
> devices which don't have a pci_dn structure.
> 
> Fixes: b905f8cdca77 ("powerpc/eeh: EEH for pSeries hot plug")
> Signed-off-by: Frederic Barrat 
> ---
> 
> Sending as an RFC, as I'm afraid of hiding potential issues and would
> be interested in comments. The powernv eeh code expects a struct
> pci_dn, so the fix seems safe. I'm wondering if there could be cases
> (other than capi virtual devices) where we'd want to blow up and fix
> instead of going undetected with this patch.

Looks good to me.

I do think it would be good to detect a missing pci_dn (WARN_ONCE()
might be appropriate).  However to implement it,
pnv_pcibios_bus_add_device() would need a way to detect that a struct
pci_dev is a cxl virtual device. I don't see an easy way to do that; do
you know if it's possible?

One last thing: pseries_pcibios_bus_add_device() also requires a pci_dn.
Do you know if it's possible to trigger a similar issue there, or is it
not possible for some reason?

Cheers,
Sam.

>  arch/powerpc/platforms/powernv/eeh-powernv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
> b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 6bc24a47e9ef..6f300ab7f0e9 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -42,7 +42,7 @@ void pnv_pcibios_bus_add_device(struct pci_dev *pdev)
>  {
>   struct pci_dn *pdn = pci_get_pdn(pdev);
>  
> - if (eeh_has_flag(EEH_FORCE_DISABLED))
> + if (!pdn || eeh_has_flag(EEH_FORCE_DISABLED))
>   return;
>  
>   dev_dbg(>dev, "EEH: Setting up device\n");
> -- 
> 2.21.0
> 


signature.asc
Description: PGP signature


[PATCH v2 0/3] crypto: powerpc - convert SPE AES algorithms to skcipher API

2019-10-14 Thread Eric Biggers
This series converts the glue code for the PowerPC SPE implementations
of AES-ECB, AES-CBC, AES-CTR, and AES-XTS from the deprecated
"blkcipher" API to the "skcipher" API.  This is needed in order for the
blkcipher API to be removed.

Patch 1-2 are fixes.  Patch 3 is the actual conversion.

Tested with:

export ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu-
make mpc85xx_defconfig
cat >> .config << EOF
# CONFIG_MODULES is not set
# CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=y
CONFIG_CRYPTO_ECB=y
CONFIG_CRYPTO_XTS=y
CONFIG_CRYPTO_AES_PPC_SPE=y
EOF
make olddefconfig
make -j32
qemu-system-ppc -M mpc8544ds -cpu e500 -nographic \
-kernel arch/powerpc/boot/zImage \
-append cryptomgr.fuzz_iterations=1000

Note that xts-ppc-spe still fails the comparison tests due to the lack
of ciphertext stealing support.  This is not addressed by this series.

Changed since v1:

- Split fixes into separate patches.

- Made ppc_aes_setkey_skcipher() call ppc_aes_setkey(), rather than
  creating a separate expand_key() function.  This keeps the code
  shorter.

Eric Biggers (3):
  crypto: powerpc - don't unnecessarily use atomic scatterwalk
  crypto: powerpc - don't set ivsize for AES-ECB
  crypto: powerpc - convert SPE AES algorithms to skcipher API

 arch/powerpc/crypto/aes-spe-glue.c | 389 -
 crypto/Kconfig |   1 +
 2 files changed, 166 insertions(+), 224 deletions(-)

-- 
2.23.0



[PATCH v2 2/3] crypto: powerpc - don't set ivsize for AES-ECB

2019-10-14 Thread Eric Biggers
From: Eric Biggers 

Set the ivsize for the "ecb-ppc-spe" algorithm to 0, since ECB mode
doesn't take an IV.

This fixes a failure in the extra crypto self-tests:

alg: skcipher: ivsize for ecb-ppc-spe (16) doesn't match generic impl 
(0)

Signed-off-by: Eric Biggers 
---
 arch/powerpc/crypto/aes-spe-glue.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/crypto/aes-spe-glue.c 
b/arch/powerpc/crypto/aes-spe-glue.c
index 319f1dbb3a70..4189d2644f74 100644
--- a/arch/powerpc/crypto/aes-spe-glue.c
+++ b/arch/powerpc/crypto/aes-spe-glue.c
@@ -415,7 +415,6 @@ static struct crypto_alg aes_algs[] = { {
.blkcipher = {
.min_keysize=   AES_MIN_KEY_SIZE,
.max_keysize=   AES_MAX_KEY_SIZE,
-   .ivsize =   AES_BLOCK_SIZE,
.setkey =   ppc_aes_setkey,
.encrypt=   ppc_ecb_encrypt,
.decrypt=   ppc_ecb_decrypt,
-- 
2.23.0



[PATCH v2 3/3] crypto: powerpc - convert SPE AES algorithms to skcipher API

2019-10-14 Thread Eric Biggers
From: Eric Biggers 

Convert the glue code for the PowerPC SPE implementations of AES-ECB,
AES-CBC, AES-CTR, and AES-XTS from the deprecated "blkcipher" API to the
"skcipher" API.  This is needed in order for the blkcipher API to be
removed.

Tested with:

export ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu-
make mpc85xx_defconfig
cat >> .config << EOF
# CONFIG_MODULES is not set
# CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=y
CONFIG_CRYPTO_ECB=y
CONFIG_CRYPTO_XTS=y
CONFIG_CRYPTO_AES_PPC_SPE=y
EOF
make olddefconfig
make -j32
qemu-system-ppc -M mpc8544ds -cpu e500 -nographic \
-kernel arch/powerpc/boot/zImage \
-append cryptomgr.fuzz_iterations=1000

Note that xts-ppc-spe still fails the comparison tests due to the lack
of ciphertext stealing support.  This is not addressed by this patch.

This patch also cleans up the code by making ->encrypt() and ->decrypt()
call a common function for each of ECB, CBC, and XTS, and by using a
clearer way to compute the length to process at each step.

Signed-off-by: Eric Biggers 
---
 arch/powerpc/crypto/aes-spe-glue.c | 381 +
 crypto/Kconfig |   1 +
 2 files changed, 166 insertions(+), 216 deletions(-)

diff --git a/arch/powerpc/crypto/aes-spe-glue.c 
b/arch/powerpc/crypto/aes-spe-glue.c
index 4189d2644f74..f828f8bcd0c6 100644
--- a/arch/powerpc/crypto/aes-spe-glue.c
+++ b/arch/powerpc/crypto/aes-spe-glue.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -118,13 +119,19 @@ static int ppc_aes_setkey(struct crypto_tfm *tfm, const 
u8 *in_key,
return 0;
 }
 
-static int ppc_xts_setkey(struct crypto_tfm *tfm, const u8 *in_key,
+static int ppc_aes_setkey_skcipher(struct crypto_skcipher *tfm,
+  const u8 *in_key, unsigned int key_len)
+{
+   return ppc_aes_setkey(crypto_skcipher_tfm(tfm), in_key, key_len);
+}
+
+static int ppc_xts_setkey(struct crypto_skcipher *tfm, const u8 *in_key,
   unsigned int key_len)
 {
-   struct ppc_xts_ctx *ctx = crypto_tfm_ctx(tfm);
+   struct ppc_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
int err;
 
-   err = xts_check_key(tfm, in_key, key_len);
+   err = xts_verify_key(tfm, in_key, key_len);
if (err)
return err;
 
@@ -133,7 +140,7 @@ static int ppc_xts_setkey(struct crypto_tfm *tfm, const u8 
*in_key,
if (key_len != AES_KEYSIZE_128 &&
key_len != AES_KEYSIZE_192 &&
key_len != AES_KEYSIZE_256) {
-   tfm->crt_flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
+   crypto_skcipher_set_flags(tfm, CRYPTO_TFM_RES_BAD_KEY_LEN);
return -EINVAL;
}
 
@@ -178,201 +185,154 @@ static void ppc_aes_decrypt(struct crypto_tfm *tfm, u8 
*out, const u8 *in)
spe_end();
 }
 
-static int ppc_ecb_encrypt(struct blkcipher_desc *desc, struct scatterlist 
*dst,
-  struct scatterlist *src, unsigned int nbytes)
+static int ppc_ecb_crypt(struct skcipher_request *req, bool enc)
 {
-   struct ppc_aes_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
-   struct blkcipher_walk walk;
-   unsigned int ubytes;
+   struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+   struct ppc_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
+   struct skcipher_walk walk;
+   unsigned int nbytes;
int err;
 
-   blkcipher_walk_init(, dst, src, nbytes);
-   err = blkcipher_walk_virt(desc, );
+   err = skcipher_walk_virt(, req, false);
 
-   while ((nbytes = walk.nbytes)) {
-   ubytes = nbytes > MAX_BYTES ?
-nbytes - MAX_BYTES : nbytes & (AES_BLOCK_SIZE - 1);
-   nbytes -= ubytes;
+   while ((nbytes = walk.nbytes) != 0) {
+   nbytes = min_t(unsigned int, nbytes, MAX_BYTES);
+   nbytes = round_down(nbytes, AES_BLOCK_SIZE);
 
spe_begin();
-   ppc_encrypt_ecb(walk.dst.virt.addr, walk.src.virt.addr,
-   ctx->key_enc, ctx->rounds, nbytes);
+   if (enc)
+   ppc_encrypt_ecb(walk.dst.virt.addr, walk.src.virt.addr,
+   ctx->key_enc, ctx->rounds, nbytes);
+   else
+   ppc_decrypt_ecb(walk.dst.virt.addr, walk.src.virt.addr,
+   ctx->key_dec, ctx->rounds, nbytes);
spe_end();
 
-   err = blkcipher_walk_done(desc, , ubytes);
+   err = skcipher_walk_done(, walk.nbytes - nbytes);
}
 
return err;
 }
 
-static int ppc_ecb_decrypt(struct blkcipher_desc *desc, struct scatterlist 
*dst,
-

[PATCH v2 1/3] crypto: powerpc - don't unnecessarily use atomic scatterwalk

2019-10-14 Thread Eric Biggers
From: Eric Biggers 

The PowerPC SPE implementations of AES modes only disable preemption
during the actual encryption/decryption, not during the scatterwalk
functions.  It's therefore unnecessary to request an atomic scatterwalk.
So don't do so.

Signed-off-by: Eric Biggers 
---
 arch/powerpc/crypto/aes-spe-glue.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/arch/powerpc/crypto/aes-spe-glue.c 
b/arch/powerpc/crypto/aes-spe-glue.c
index 3a4ca7d32477..319f1dbb3a70 100644
--- a/arch/powerpc/crypto/aes-spe-glue.c
+++ b/arch/powerpc/crypto/aes-spe-glue.c
@@ -186,7 +186,6 @@ static int ppc_ecb_encrypt(struct blkcipher_desc *desc, 
struct scatterlist *dst,
unsigned int ubytes;
int err;
 
-   desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
blkcipher_walk_init(, dst, src, nbytes);
err = blkcipher_walk_virt(desc, );
 
@@ -214,7 +213,6 @@ static int ppc_ecb_decrypt(struct blkcipher_desc *desc, 
struct scatterlist *dst,
unsigned int ubytes;
int err;
 
-   desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
blkcipher_walk_init(, dst, src, nbytes);
err = blkcipher_walk_virt(desc, );
 
@@ -242,7 +240,6 @@ static int ppc_cbc_encrypt(struct blkcipher_desc *desc, 
struct scatterlist *dst,
unsigned int ubytes;
int err;
 
-   desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
blkcipher_walk_init(, dst, src, nbytes);
err = blkcipher_walk_virt(desc, );
 
@@ -270,7 +267,6 @@ static int ppc_cbc_decrypt(struct blkcipher_desc *desc, 
struct scatterlist *dst,
unsigned int ubytes;
int err;
 
-   desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
blkcipher_walk_init(, dst, src, nbytes);
err = blkcipher_walk_virt(desc, );
 
@@ -298,7 +294,6 @@ static int ppc_ctr_crypt(struct blkcipher_desc *desc, 
struct scatterlist *dst,
unsigned int pbytes, ubytes;
int err;
 
-   desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
blkcipher_walk_init(, dst, src, nbytes);
err = blkcipher_walk_virt_block(desc, , AES_BLOCK_SIZE);
 
@@ -329,7 +324,6 @@ static int ppc_xts_encrypt(struct blkcipher_desc *desc, 
struct scatterlist *dst,
int err;
u32 *twk;
 
-   desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
blkcipher_walk_init(, dst, src, nbytes);
err = blkcipher_walk_virt(desc, );
twk = ctx->key_twk;
@@ -360,7 +354,6 @@ static int ppc_xts_decrypt(struct blkcipher_desc *desc, 
struct scatterlist *dst,
int err;
u32 *twk;
 
-   desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
blkcipher_walk_init(, dst, src, nbytes);
err = blkcipher_walk_virt(desc, );
twk = ctx->key_twk;
-- 
2.23.0



Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Pingfan Liu
On Mon, Oct 14, 2019 at 10:03:03PM +0200, Jan Kara wrote:
> On Mon 14-10-19 08:23:39, Eric Sandeen wrote:
> > On 10/14/19 4:43 AM, Jan Kara wrote:
> > > On Mon 14-10-19 16:33:15, Pingfan Liu wrote:
> > > > On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:
> > > > > On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> > > > > > When using fadump (fireware assist dump) mode on powerpc, a mismatch
> > > > > > between grub xfs driver and kernel xfs driver has been obsevered.  
> > > > > > Note:
> > > > > > fadump boots up in the following sequence: fireware -> grub reads 
> > > > > > kernel
> > > > > > and initramfs -> kernel boots.
> > > > > > 
> > > > > > The process to reproduce this mismatch:
> > > > > >- On powerpc, boot kernel with fadump=on and edit 
> > > > > > /etc/kdump.conf.
> > > > > >- Replacing "path /var/crash" with "path /var/crashnew", then, 
> > > > > > "kdumpctl
> > > > > >  restart" to rebuild the initramfs. Detail about the rebuilding 
> > > > > > looks
> > > > > >  like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
> > > > > >mv /boot/initramfs-`uname -r`.img.tmp 
> > > > > > /boot/initramfs-`uname -r`.img
> > > > > >sync
> > > > > >- "echo c >/proc/sysrq-trigger".
> > > > > > 
> > > > > > The result:
> > > > > > The dump image will not be saved under /var/crashnew/* as expected, 
> > > > > > but
> > > > > > still saved under /var/crash.
> > > > > > 
> > > > > > The root cause:
> > > > > > As Eric pointed out that on xfs, 'sync' ensures the consistency by 
> > > > > > writing
> > > > > > back metadata to xlog, but not necessary to fsblock. This raises 
> > > > > > issue if
> > > > > > grub can not replay the xlog before accessing the xfs files. Since 
> > > > > > the
> > > > > > above dir entry of initramfs should be saved as inline data with 
> > > > > > xfs_inode,
> > > > > > so xfs_fs_sync_fs() does not guarantee it written to fsblock.
> > > > > > 
> > > > > > umount can be used to write metadata fsblock, but the filesystem 
> > > > > > can not be
> > > > > > umounted if still in use.
> > > > > > 
> > > > > > There are two ways to fix this mismatch, either grub or xfs. It may 
> > > > > > be
> > > > > > easier to do this in xfs side by introducing an interface to flush 
> > > > > > metadata
> > > > > > to fsblock explicitly.
> > > > > > 
> > > > > > With this patch, metadata can be written to fsblock by:
> > > > > ># update AIL
> > > > > >sync
> > > > > ># new introduced interface to flush metadata to fsblock
> > > > > >mount -o remount,metasync mountpoint
> > > > > 
> > > > > I think this ought to be an ioctl or some sort of generic call since 
> > > > > the
> > > > > jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
> > > > > is too dumb to recover logs but still wants to write to the fs"
> > > > > checkpointing problem.
> > > > Yes, a syscall sounds more reasonable.
> > > > > 
> > > > > (Or maybe we should just put all that stuff in a vfat filesystem, I
> > > > > don't know...)
> > > > I think it is unavoidable to involve in each fs' implementation. What
> > > > about introducing an interface sync_to_fsblock(struct super_block *sb) 
> > > > in
> > > > the struct super_operations, then let each fs manage its own case?
> > > 
> > > Well, we already have a way to achieve what you need: fsfreeze.
> > > Traditionally, that is guaranteed to put fs into a "clean" state very much
> > > equivalent to the fs being unmounted and that seems to be what the
> > > bootloader wants so that it can access the filesystem without worrying
> > > about some recovery details. So do you see any problem with replacing
> > > 'sync' in your example above with 'fsfreeze /boot && fsfreeze -u /boot'?
> > > 
> > >   Honza
> > 
> > The problem with fsfreeze is that if the device you want to quiesce is, say,
> > the root fs, freeze isn't really a good option.
> 
> I agree you need to be really careful not to deadlock against yourself in
> that case. But this particular use actually has a chance to work.
> 
Yeah, normally there is a /boot partition in system, and if so, fsfreeze
can work.
> > But the other thing I want to highlight about this approach is that it does 
> > not
> > solve the root problem: something is trying to read the block device without
> > first replaying the log.
> > 
> > A call such as the proposal here is only going to leave consistent metadata 
> > at
> > the time the call returns; at any time after that, all guarantees are off 
> > again,
> > so the problem hasn't been solved.
> 
> Oh, absolutely agreed. I was also thinking about this before sending my
> reply. Once you unfreeze, the log can start filling with changes and
> there's no guarantee that e.g. inode does not move as part of these
But just as fsync, we only guarantee the consistency before a sync. If
the involved files change again, we need another sync.
> changes. But to 

Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Pingfan Liu
On Mon, Oct 14, 2019 at 08:23:39AM -0500, Eric Sandeen wrote:
> On 10/14/19 4:43 AM, Jan Kara wrote:
> > On Mon 14-10-19 16:33:15, Pingfan Liu wrote:
> > > On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:
> > > > On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> > > > > When using fadump (fireware assist dump) mode on powerpc, a mismatch
> > > > > between grub xfs driver and kernel xfs driver has been obsevered.  
> > > > > Note:
> > > > > fadump boots up in the following sequence: fireware -> grub reads 
> > > > > kernel
> > > > > and initramfs -> kernel boots.
> > > > > 
> > > > > The process to reproduce this mismatch:
> > > > >- On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf.
> > > > >- Replacing "path /var/crash" with "path /var/crashnew", then, 
> > > > > "kdumpctl
> > > > >  restart" to rebuild the initramfs. Detail about the rebuilding 
> > > > > looks
> > > > >  like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
> > > > >mv /boot/initramfs-`uname -r`.img.tmp 
> > > > > /boot/initramfs-`uname -r`.img
> > > > >sync
> > > > >- "echo c >/proc/sysrq-trigger".
> > > > > 
> > > > > The result:
> > > > > The dump image will not be saved under /var/crashnew/* as expected, 
> > > > > but
> > > > > still saved under /var/crash.
> > > > > 
> > > > > The root cause:
> > > > > As Eric pointed out that on xfs, 'sync' ensures the consistency by 
> > > > > writing
> > > > > back metadata to xlog, but not necessary to fsblock. This raises 
> > > > > issue if
> > > > > grub can not replay the xlog before accessing the xfs files. Since the
> > > > > above dir entry of initramfs should be saved as inline data with 
> > > > > xfs_inode,
> > > > > so xfs_fs_sync_fs() does not guarantee it written to fsblock.
> > > > > 
> > > > > umount can be used to write metadata fsblock, but the filesystem can 
> > > > > not be
> > > > > umounted if still in use.
> > > > > 
> > > > > There are two ways to fix this mismatch, either grub or xfs. It may be
> > > > > easier to do this in xfs side by introducing an interface to flush 
> > > > > metadata
> > > > > to fsblock explicitly.
> > > > > 
> > > > > With this patch, metadata can be written to fsblock by:
> > > > ># update AIL
> > > > >sync
> > > > ># new introduced interface to flush metadata to fsblock
> > > > >mount -o remount,metasync mountpoint
> > > > 
> > > > I think this ought to be an ioctl or some sort of generic call since the
> > > > jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
> > > > is too dumb to recover logs but still wants to write to the fs"
> > > > checkpointing problem.
> > > Yes, a syscall sounds more reasonable.
> > > > 
> > > > (Or maybe we should just put all that stuff in a vfat filesystem, I
> > > > don't know...)
> > > I think it is unavoidable to involve in each fs' implementation. What
> > > about introducing an interface sync_to_fsblock(struct super_block *sb) in
> > > the struct super_operations, then let each fs manage its own case?
> > 
> > Well, we already have a way to achieve what you need: fsfreeze.
> > Traditionally, that is guaranteed to put fs into a "clean" state very much
> > equivalent to the fs being unmounted and that seems to be what the
> > bootloader wants so that it can access the filesystem without worrying
> > about some recovery details. So do you see any problem with replacing
> > 'sync' in your example above with 'fsfreeze /boot && fsfreeze -u /boot'?
> > 
> > Honza
> 
> The problem with fsfreeze is that if the device you want to quiesce is, say,
> the root fs, freeze isn't really a good option.
Yes, that is the difference between my patch and fsfreeze.  But
honestly, it is a rare case where a system has not a /boot partition. Due
to the activity on /boot is very low, fsfreeze may meet the need, or
repeatly retry fsfress until success.
> 
> But the other thing I want to highlight about this approach is that it does 
> not
> solve the root problem: something is trying to read the block device without
> first replaying the log.
> 
> A call such as the proposal here is only going to leave consistent metadata at
> the time the call returns; at any time after that, all guarantees are off 
> again,
My patch places assumption that grub only accesses limited files and ensures the
consistency only on those files (kernel,initramfs).
> so the problem hasn't been solved.
Agree. The perfect solution should be a log aware bootloader.

Thanks and regards,
Pingfan


Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Pingfan Liu
On Mon, Oct 14, 2019 at 01:40:27AM -0700, Christoph Hellwig wrote:
> On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> > When using fadump (fireware assist dump) mode on powerpc, a mismatch
> > between grub xfs driver and kernel xfs driver has been obsevered.  Note:
> > fadump boots up in the following sequence: fireware -> grub reads kernel
> > and initramfs -> kernel boots.
> 
> This isn't something new.  To fundamentally fix this you need to
> implement (in-memory) log recovery in grub.  That is the only really safe
> long-term solutioin.  But the equivalent of your patch you can already
Agree. For the consistency of the whole fs, we need grub to be aware of
log. While this patch just assumes that files accessed by grub are
known, and the consistency is forced only on these files.
> get by freezing and unfreezing the file system using the FIFREEZE and
> FITHAW ioctls.  And if my memory is serving me correctly Dave has been
freeze will block any further modification to the fs. That is different
from my patch, which does not have such limitation.
> preaching that to the bootloader folks for a long time, but apparently
> without visible results.
Yes, it is a pity. And maybe it is uneasy to do.

Thanks and regards,
Pingfan


Re: [PATCH RFC 0/5] ARM: Raspberry Pi 4 DMA support

2019-10-14 Thread Catalin Marinas
On Mon, Oct 14, 2019 at 08:31:02PM +0200, Nicolas Saenz Julienne wrote:
> the Raspberry Pi 4 offers up to 4GB of memory, of which only the first
> is DMA capable device wide. This forces us to use of bounce buffers,
> which are currently not very well supported by ARM's custom DMA ops.
> Among other things the current mechanism (see dmabounce.c) isn't
> suitable for high memory. Instead of fixing it, this series introduces a
> way of selecting dma-direct as the default DMA ops provider which allows
> for the Raspberry Pi to make use of swiotlb.

I presume these patches go on top of this series:

http://lkml.kernel.org/r/20190911182546.17094-1-nsaenzjulie...@suse.de

which I queued here:

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/zone-dma

-- 
Catalin


Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Eric Sandeen




On 10/14/19 3:03 PM, Jan Kara wrote:

On Mon 14-10-19 08:23:39, Eric Sandeen wrote:

On 10/14/19 4:43 AM, Jan Kara wrote:

On Mon 14-10-19 16:33:15, Pingfan Liu wrote:

On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:

On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:

When using fadump (fireware assist dump) mode on powerpc, a mismatch
between grub xfs driver and kernel xfs driver has been obsevered.  Note:
fadump boots up in the following sequence: fireware -> grub reads kernel
and initramfs -> kernel boots.

The process to reproduce this mismatch:
- On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf.
- Replacing "path /var/crash" with "path /var/crashnew", then, "kdumpctl
  restart" to rebuild the initramfs. Detail about the rebuilding looks
  like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
mv /boot/initramfs-`uname -r`.img.tmp /boot/initramfs-`uname -r`.img
sync
- "echo c >/proc/sysrq-trigger".

The result:
The dump image will not be saved under /var/crashnew/* as expected, but
still saved under /var/crash.

The root cause:
As Eric pointed out that on xfs, 'sync' ensures the consistency by writing
back metadata to xlog, but not necessary to fsblock. This raises issue if
grub can not replay the xlog before accessing the xfs files. Since the
above dir entry of initramfs should be saved as inline data with xfs_inode,
so xfs_fs_sync_fs() does not guarantee it written to fsblock.

umount can be used to write metadata fsblock, but the filesystem can not be
umounted if still in use.

There are two ways to fix this mismatch, either grub or xfs. It may be
easier to do this in xfs side by introducing an interface to flush metadata
to fsblock explicitly.

With this patch, metadata can be written to fsblock by:
# update AIL
sync
# new introduced interface to flush metadata to fsblock
mount -o remount,metasync mountpoint


I think this ought to be an ioctl or some sort of generic call since the
jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
is too dumb to recover logs but still wants to write to the fs"
checkpointing problem.

Yes, a syscall sounds more reasonable.


(Or maybe we should just put all that stuff in a vfat filesystem, I
don't know...)

I think it is unavoidable to involve in each fs' implementation. What
about introducing an interface sync_to_fsblock(struct super_block *sb) in
the struct super_operations, then let each fs manage its own case?


Well, we already have a way to achieve what you need: fsfreeze.
Traditionally, that is guaranteed to put fs into a "clean" state very much
equivalent to the fs being unmounted and that seems to be what the
bootloader wants so that it can access the filesystem without worrying
about some recovery details. So do you see any problem with replacing
'sync' in your example above with 'fsfreeze /boot && fsfreeze -u /boot'?

Honza


The problem with fsfreeze is that if the device you want to quiesce is, say,
the root fs, freeze isn't really a good option.


I agree you need to be really careful not to deadlock against yourself in
that case. But this particular use actually has a chance to work.


But the other thing I want to highlight about this approach is that it does not
solve the root problem: something is trying to read the block device without
first replaying the log.

A call such as the proposal here is only going to leave consistent metadata at
the time the call returns; at any time after that, all guarantees are off again,
so the problem hasn't been solved.


Oh, absolutely agreed. I was also thinking about this before sending my
reply. Once you unfreeze, the log can start filling with changes and
there's no guarantee that e.g. inode does not move as part of these
changes. But to be fair, replaying the log isn't easy either, even more so
from a bootloader. You cannot write the changes from the log back into the
filesystem as e.g. in case of suspend-to-disk the resumed kernel gets
surprised and corrupts the fs under its hands (been there, tried that). So
you must keep changes only in memory and that's not really easy in the
constrained bootloader environment.

So I guess we are left with hacks that kind of mostly work and fsfreeze is
one of those. If you don't mess with the files after fsfreeze, you're
likely to find what you need even without replaying the log.


We're in agreement here.  ;)  I only worry about implementing things like this
which sound like guarantees, but aren't, and end up encouraging bad behavior
or promoting misconceptions.

More and more, I think we should reconsider Darrick's "bootfs" (ext2 by another
name, but with extra-sync-iness) proposal...

-Eric


Honza



Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Jan Kara
On Mon 14-10-19 08:23:39, Eric Sandeen wrote:
> On 10/14/19 4:43 AM, Jan Kara wrote:
> > On Mon 14-10-19 16:33:15, Pingfan Liu wrote:
> > > On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:
> > > > On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> > > > > When using fadump (fireware assist dump) mode on powerpc, a mismatch
> > > > > between grub xfs driver and kernel xfs driver has been obsevered.  
> > > > > Note:
> > > > > fadump boots up in the following sequence: fireware -> grub reads 
> > > > > kernel
> > > > > and initramfs -> kernel boots.
> > > > > 
> > > > > The process to reproduce this mismatch:
> > > > >- On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf.
> > > > >- Replacing "path /var/crash" with "path /var/crashnew", then, 
> > > > > "kdumpctl
> > > > >  restart" to rebuild the initramfs. Detail about the rebuilding 
> > > > > looks
> > > > >  like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
> > > > >mv /boot/initramfs-`uname -r`.img.tmp 
> > > > > /boot/initramfs-`uname -r`.img
> > > > >sync
> > > > >- "echo c >/proc/sysrq-trigger".
> > > > > 
> > > > > The result:
> > > > > The dump image will not be saved under /var/crashnew/* as expected, 
> > > > > but
> > > > > still saved under /var/crash.
> > > > > 
> > > > > The root cause:
> > > > > As Eric pointed out that on xfs, 'sync' ensures the consistency by 
> > > > > writing
> > > > > back metadata to xlog, but not necessary to fsblock. This raises 
> > > > > issue if
> > > > > grub can not replay the xlog before accessing the xfs files. Since the
> > > > > above dir entry of initramfs should be saved as inline data with 
> > > > > xfs_inode,
> > > > > so xfs_fs_sync_fs() does not guarantee it written to fsblock.
> > > > > 
> > > > > umount can be used to write metadata fsblock, but the filesystem can 
> > > > > not be
> > > > > umounted if still in use.
> > > > > 
> > > > > There are two ways to fix this mismatch, either grub or xfs. It may be
> > > > > easier to do this in xfs side by introducing an interface to flush 
> > > > > metadata
> > > > > to fsblock explicitly.
> > > > > 
> > > > > With this patch, metadata can be written to fsblock by:
> > > > ># update AIL
> > > > >sync
> > > > ># new introduced interface to flush metadata to fsblock
> > > > >mount -o remount,metasync mountpoint
> > > > 
> > > > I think this ought to be an ioctl or some sort of generic call since the
> > > > jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
> > > > is too dumb to recover logs but still wants to write to the fs"
> > > > checkpointing problem.
> > > Yes, a syscall sounds more reasonable.
> > > > 
> > > > (Or maybe we should just put all that stuff in a vfat filesystem, I
> > > > don't know...)
> > > I think it is unavoidable to involve in each fs' implementation. What
> > > about introducing an interface sync_to_fsblock(struct super_block *sb) in
> > > the struct super_operations, then let each fs manage its own case?
> > 
> > Well, we already have a way to achieve what you need: fsfreeze.
> > Traditionally, that is guaranteed to put fs into a "clean" state very much
> > equivalent to the fs being unmounted and that seems to be what the
> > bootloader wants so that it can access the filesystem without worrying
> > about some recovery details. So do you see any problem with replacing
> > 'sync' in your example above with 'fsfreeze /boot && fsfreeze -u /boot'?
> > 
> > Honza
> 
> The problem with fsfreeze is that if the device you want to quiesce is, say,
> the root fs, freeze isn't really a good option.

I agree you need to be really careful not to deadlock against yourself in
that case. But this particular use actually has a chance to work.

> But the other thing I want to highlight about this approach is that it does 
> not
> solve the root problem: something is trying to read the block device without
> first replaying the log.
> 
> A call such as the proposal here is only going to leave consistent metadata at
> the time the call returns; at any time after that, all guarantees are off 
> again,
> so the problem hasn't been solved.

Oh, absolutely agreed. I was also thinking about this before sending my
reply. Once you unfreeze, the log can start filling with changes and
there's no guarantee that e.g. inode does not move as part of these
changes. But to be fair, replaying the log isn't easy either, even more so
from a bootloader. You cannot write the changes from the log back into the
filesystem as e.g. in case of suspend-to-disk the resumed kernel gets
surprised and corrupts the fs under its hands (been there, tried that). So
you must keep changes only in memory and that's not really easy in the
constrained bootloader environment.

So I guess we are left with hacks that kind of mostly work and fsfreeze is
one of those. If you don't mess with the files after 

Re: [PATCH v6 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span()

2019-10-14 Thread Andrew Morton
On Mon, 14 Oct 2019 11:32:13 +0200 David Hildenbrand  wrote:

> > Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
> 
> @Andrew, can you convert that to
> 
> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded 
> memory to zones until online") # visible after d0dc12e86b319
> 
> and add
> 
> Cc: sta...@vger.kernel.org # v4.13+

Done, thanks.


Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-10-14 Thread Andrew Morton
On Mon, 14 Oct 2019 11:39:13 +0200 David Hildenbrand  wrote:

> > Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
> 
> @Andrew, can you convert that to
> 
> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to 
> zones until online") # visible after d0dc12e86b319

Done.

> While adding cc'ing sta...@vger.kernel.org # v4.13+ would be nice,
> I doubt it will be easily possible to backport, as we are missing
> some prereq patches (e.g., from Oscar like 2c2a5af6fed2 ("mm,
> memory_hotplug: add nid parameter to arch_remove_memory")). But, it could
> be done with some work.
> 
> I think "Cc: sta...@vger.kernel.org # v5.0+" could be done more
> easily. Maybe it's okay to not cc:stable this one. We usually
> online all memory (except s390x), however, s390x does not remove that
> memory ever. Devmem with driver reserved memory would be, however,
> worth backporting this.

I added 

Cc: [5.0+]


Re: [PATCH v4 3/3] powerpc/prom_init: Use -ffreestanding to avoid a reference to bcmp

2019-10-14 Thread Segher Boessenkool
On Mon, Oct 14, 2019 at 08:56:12AM -0700, Nick Desaulniers wrote:
> On Mon, Oct 14, 2019 at 2:35 AM Segher Boessenkool
>  wrote:
> >
> > On Sun, Oct 13, 2019 at 07:51:01PM -0700, Nathan Chancellor wrote:
> > > r374662 gives LLVM the ability to convert certain loops into a reference
> > > to bcmp as an optimization; this breaks prom_init_check.sh:
> >
> > When/why does LLVM think this is okay?  This function has been removed
> > from POSIX over a decade ago (and before that it always was marked as
> > legacy).
> 
> Segher, do you have links for any of the above? If so, that would be
> helpful to me.

Sure!

https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap03.html

Older versions are harder to find online, unfortunately.  But there is

https://kernel.org/pub/linux/docs/man-pages/man-pages-posix/

in which man3p/bcmp.3p says:

FUTURE DIRECTIONS
   This function may be withdrawn in a future version.

Finally, the Linux man pages say (man bcmp):

CONFORMING TO
   4.3BSD.   This  function   is   deprecated   (marked   as   LEGACY   in
   POSIX.1-2001): use memcmp(3) in new programs.  POSIX.1-2008 removes the
   specification of bcmp().


> I'm arguing against certain transforms that assume that
> one library function is faster than another, when such claims are
> based on measurements from one stdlib implementation.

Wow.  The difference between memcmp and bcmp is trivial (just the return
value is different, and that costs hardly anything to add).  And memcmp
is guaranteed to exist since C89/C90 at least.

> The rationale for why it was added was that memcmp takes a measurable
> amount of time in Google's fleet, and most calls to memcmp don't care
> about the position of the mismatch; bcmp is lower overhead (or at
> least for our libc implementation, not sure about others).

You just have to do the read of the last words you compare as big-endian,
and then you can just subtract the two words, convert that to "int" (which
is very inconvenient to do, but hardly expensive), and there you go.

Or on x86 use the bswap insn, or something like it.

Or, if you use GCC, it has __builtin_memcmp but also __builtin_memcmp_eq,
and those are automatically used, too.


Segher


[PATCH RFC 1/5] dma/direct: turn ARCH_ZONE_DMA_BITS into a variable

2019-10-14 Thread Nicolas Saenz Julienne
Some architectures, notably ARM, are interested in tweaking this
depending on their runtime DMA addressing limitations.

Signed-off-by: Nicolas Saenz Julienne 
---
 arch/arm64/include/asm/page.h   |  2 --
 arch/arm64/mm/init.c|  9 +++--
 arch/powerpc/include/asm/page.h |  9 -
 arch/powerpc/mm/mem.c   | 20 +++-
 arch/s390/include/asm/page.h|  2 --
 arch/s390/mm/init.c |  1 +
 include/linux/dma-direct.h  |  2 ++
 kernel/dma/direct.c | 13 ++---
 8 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 7b8c98830101..d39ddb258a04 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -38,6 +38,4 @@ extern int pfn_valid(unsigned long);
 
 #include 
 
-#define ARCH_ZONE_DMA_BITS 30
-
 #endif
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 44f07fdf7a59..ddd6a6ce158e 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -41,6 +42,8 @@
 #include 
 #include 
 
+#define ARM64_ZONE_DMA_BITS30
+
 /*
  * We need to be able to catch inadvertent references to memstart_addr
  * that occur (potentially in generic code) before arm64_memblock_init()
@@ -440,8 +443,10 @@ void __init arm64_memblock_init(void)
 
early_init_fdt_scan_reserved_mem();
 
-   if (IS_ENABLED(CONFIG_ZONE_DMA))
-   arm64_dma_phys_limit = max_zone_phys(ARCH_ZONE_DMA_BITS);
+   if (IS_ENABLED(CONFIG_ZONE_DMA)) {
+   zone_dma_bits = ARM64_ZONE_DMA_BITS;
+   arm64_dma_phys_limit = max_zone_phys(ARM64_ZONE_DMA_BITS);
+   }
 
if (IS_ENABLED(CONFIG_ZONE_DMA32))
arm64_dma32_phys_limit = max_zone_phys(32);
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index c8bb14ff4713..f6c562acc3f8 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -329,13 +329,4 @@ struct vm_area_struct;
 #endif /* __ASSEMBLY__ */
 #include 
 
-/*
- * Allow 30-bit DMA for very limited Broadcom wifi chips on many powerbooks.
- */
-#ifdef CONFIG_PPC32
-#define ARCH_ZONE_DMA_BITS 30
-#else
-#define ARCH_ZONE_DMA_BITS 31
-#endif
-
 #endif /* _ASM_POWERPC_PAGE_H */
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 97e5922cb52e..8bab4e8b6bae 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -200,10 +201,10 @@ static int __init mark_nonram_nosave(void)
  * everything else. GFP_DMA32 page allocations automatically fall back to
  * ZONE_DMA.
  *
- * By using 31-bit unconditionally, we can exploit ARCH_ZONE_DMA_BITS to
- * inform the generic DMA mapping code.  32-bit only devices (if not handled
- * by an IOMMU anyway) will take a first dip into ZONE_NORMAL and get
- * otherwise served by ZONE_DMA.
+ * By using 31-bit unconditionally, we can exploit zone_dma_bits to inform the
+ * generic DMA mapping code.  32-bit only devices (if not handled by an IOMMU
+ * anyway) will take a first dip into ZONE_NORMAL and get otherwise served by
+ * ZONE_DMA.
  */
 static unsigned long max_zone_pfns[MAX_NR_ZONES];
 
@@ -236,9 +237,18 @@ void __init paging_init(void)
printk(KERN_DEBUG "Memory hole size: %ldMB\n",
   (long int)((top_of_ram - total_ram) >> 20));
 
+   /*
+* Allow 30-bit DMA for very limited Broadcom wifi chips on many
+* powerbooks.
+*/
+   if (IS_ENABLED(CONFIG_PPC32))
+   zone_dma_bits = 30;
+   else
+   zone_dma_bits = 31;
+
 #ifdef CONFIG_ZONE_DMA
max_zone_pfns[ZONE_DMA] = min(max_low_pfn,
- 1UL << (ARCH_ZONE_DMA_BITS - PAGE_SHIFT));
+ 1UL << (zone_dma_bits - PAGE_SHIFT));
 #endif
max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
 #ifdef CONFIG_HIGHMEM
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 823578c6b9e2..a4d38092530a 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -177,8 +177,6 @@ static inline int devmem_is_allowed(unsigned long pfn)
 #define VM_DATA_DEFAULT_FLAGS  (VM_READ | VM_WRITE | \
 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
 
-#define ARCH_ZONE_DMA_BITS 31
-
 #include 
 #include 
 
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index c1d96e588152..ac44bd76db4b 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -118,6 +118,7 @@ void __init paging_init(void)
 
sparse_memory_present_with_active_regions(MAX_NUMNODES);
sparse_init();
+   zone_dma_bits = 31;
memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
max_zone_pfns[ZONE_DMA] = PFN_DOWN(MAX_DMA_ADDRESS);
max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
diff 

[PATCH RFC 0/5] ARM: Raspberry Pi 4 DMA support

2019-10-14 Thread Nicolas Saenz Julienne
Hi all,
the Raspberry Pi 4 offers up to 4GB of memory, of which only the first
is DMA capable device wide. This forces us to use of bounce buffers,
which are currently not very well supported by ARM's custom DMA ops.
Among other things the current mechanism (see dmabounce.c) isn't
suitable for high memory. Instead of fixing it, this series introduces a
way of selecting dma-direct as the default DMA ops provider which allows
for the Raspberry Pi to make use of swiotlb.

Regards,
Nicolas

---

Nicolas Saenz Julienne (5):
  dma/direct: turn ARCH_ZONE_DMA_BITS into a variable
  ARM: introduce arm_dma_direct
  ARM: let machines select dma-direct over arch's DMA implementation
  dma/direct: check for overflows in ARM's dma_capable()
  ARM: bcm2711: use dma-direct

 arch/arm/include/asm/dma-direct.h  |  6 ++
 arch/arm/include/asm/dma-mapping.h |  3 ++-
 arch/arm/include/asm/dma.h |  2 ++
 arch/arm/include/asm/mach/arch.h   |  1 +
 arch/arm/mach-bcm/Kconfig  |  1 +
 arch/arm/mach-bcm/bcm2711.c|  1 +
 arch/arm/mm/dma-mapping.c  | 10 ++
 arch/arm/mm/init.c | 21 -
 arch/arm64/include/asm/page.h  |  2 --
 arch/arm64/mm/init.c   |  9 +++--
 arch/powerpc/include/asm/page.h|  9 -
 arch/powerpc/mm/mem.c  | 20 +++-
 arch/s390/include/asm/page.h   |  2 --
 arch/s390/mm/init.c|  1 +
 include/linux/dma-direct.h |  2 ++
 kernel/dma/direct.c| 13 ++---
 16 files changed, 66 insertions(+), 37 deletions(-)

-- 
2.23.0



Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Eric Sandeen

On 10/14/19 4:43 AM, Jan Kara wrote:

On Mon 14-10-19 16:33:15, Pingfan Liu wrote:

On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:

On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:

When using fadump (fireware assist dump) mode on powerpc, a mismatch
between grub xfs driver and kernel xfs driver has been obsevered.  Note:
fadump boots up in the following sequence: fireware -> grub reads kernel
and initramfs -> kernel boots.

The process to reproduce this mismatch:
   - On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf.
   - Replacing "path /var/crash" with "path /var/crashnew", then, "kdumpctl
 restart" to rebuild the initramfs. Detail about the rebuilding looks
 like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
   mv /boot/initramfs-`uname -r`.img.tmp /boot/initramfs-`uname -r`.img
   sync
   - "echo c >/proc/sysrq-trigger".

The result:
The dump image will not be saved under /var/crashnew/* as expected, but
still saved under /var/crash.

The root cause:
As Eric pointed out that on xfs, 'sync' ensures the consistency by writing
back metadata to xlog, but not necessary to fsblock. This raises issue if
grub can not replay the xlog before accessing the xfs files. Since the
above dir entry of initramfs should be saved as inline data with xfs_inode,
so xfs_fs_sync_fs() does not guarantee it written to fsblock.

umount can be used to write metadata fsblock, but the filesystem can not be
umounted if still in use.

There are two ways to fix this mismatch, either grub or xfs. It may be
easier to do this in xfs side by introducing an interface to flush metadata
to fsblock explicitly.

With this patch, metadata can be written to fsblock by:
   # update AIL
   sync
   # new introduced interface to flush metadata to fsblock
   mount -o remount,metasync mountpoint


I think this ought to be an ioctl or some sort of generic call since the
jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
is too dumb to recover logs but still wants to write to the fs"
checkpointing problem.

Yes, a syscall sounds more reasonable.


(Or maybe we should just put all that stuff in a vfat filesystem, I
don't know...)

I think it is unavoidable to involve in each fs' implementation. What
about introducing an interface sync_to_fsblock(struct super_block *sb) in
the struct super_operations, then let each fs manage its own case?


Well, we already have a way to achieve what you need: fsfreeze.
Traditionally, that is guaranteed to put fs into a "clean" state very much
equivalent to the fs being unmounted and that seems to be what the
bootloader wants so that it can access the filesystem without worrying
about some recovery details. So do you see any problem with replacing
'sync' in your example above with 'fsfreeze /boot && fsfreeze -u /boot'?

Honza


The problem with fsfreeze is that if the device you want to quiesce is, say,
the root fs, freeze isn't really a good option.

But the other thing I want to highlight about this approach is that it does not
solve the root problem: something is trying to read the block device without
first replaying the log.

A call such as the proposal here is only going to leave consistent metadata at
the time the call returns; at any time after that, all guarantees are off again,
so the problem hasn't been solved.

-Eric


Re: [PATCH] crypto: powerpc - convert SPE AES algorithms to skcipher API

2019-10-14 Thread Ard Biesheuvel
On Mon, 14 Oct 2019 at 19:38, Eric Biggers  wrote:
>
> On Mon, Oct 14, 2019 at 10:45:22AM +0200, Ard Biesheuvel wrote:
> > Hi Eric,
> >
> > On Sat, 12 Oct 2019 at 04:32, Eric Biggers  wrote:
> > >
> > > From: Eric Biggers 
> > >
> > > Convert the glue code for the PowerPC SPE implementations of AES-ECB,
> > > AES-CBC, AES-CTR, and AES-XTS from the deprecated "blkcipher" API to the
> > > "skcipher" API.
> > >
> > > Tested with:
> > >
> > > export ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu-
> > > make mpc85xx_defconfig
> > > cat >> .config << EOF
> > > # CONFIG_MODULES is not set
> > > # CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
> > > CONFIG_DEBUG_KERNEL=y
> > > CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y
> > > CONFIG_CRYPTO_AES=y
> > > CONFIG_CRYPTO_CBC=y
> > > CONFIG_CRYPTO_CTR=y
> > > CONFIG_CRYPTO_ECB=y
> > > CONFIG_CRYPTO_XTS=y
> > > CONFIG_CRYPTO_AES_PPC_SPE=y
> > > EOF
> > > make olddefconfig
> > > make -j32
> > > qemu-system-ppc -M mpc8544ds -cpu e500 -nographic \
> > > -kernel arch/powerpc/boot/zImage \
> > > -append cryptomgr.fuzz_iterations=1000
> > >
> > > Note that xts-ppc-spe still fails the comparison tests due to the lack
> > > of ciphertext stealing support.  This is not addressed by this patch.
> > >
> > > Signed-off-by: Eric Biggers 
> > > ---
> > >  arch/powerpc/crypto/aes-spe-glue.c | 416 +
> > >  crypto/Kconfig |   1 +
> > >  2 files changed, 186 insertions(+), 231 deletions(-)
> > >
> > > diff --git a/arch/powerpc/crypto/aes-spe-glue.c 
> > > b/arch/powerpc/crypto/aes-spe-glue.c
> > > index 3a4ca7d32477..374e3e51e998 100644
> > > --- a/arch/powerpc/crypto/aes-spe-glue.c
> > > +++ b/arch/powerpc/crypto/aes-spe-glue.c
> > > @@ -17,6 +17,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >
> > >  /*
> > > @@ -86,17 +87,13 @@ static void spe_end(void)
> > > preempt_enable();
> > >  }
> > >
> > > -static int ppc_aes_setkey(struct crypto_tfm *tfm, const u8 *in_key,
> > > -   unsigned int key_len)
> > > +static int expand_key(struct ppc_aes_ctx *ctx,
> > > + const u8 *in_key, unsigned int key_len)
> > >  {
> > > -   struct ppc_aes_ctx *ctx = crypto_tfm_ctx(tfm);
> > > -
> > > if (key_len != AES_KEYSIZE_128 &&
> > > key_len != AES_KEYSIZE_192 &&
> > > -   key_len != AES_KEYSIZE_256) {
> > > -   tfm->crt_flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
> > > +   key_len != AES_KEYSIZE_256)
> > > return -EINVAL;
> > > -   }
> > >
> > > switch (key_len) {
> > > case AES_KEYSIZE_128:
> > > @@ -114,17 +111,40 @@ static int ppc_aes_setkey(struct crypto_tfm *tfm, 
> > > const u8 *in_key,
> > > }
> > >
> > > ppc_generate_decrypt_key(ctx->key_dec, ctx->key_enc, key_len);
> > > +   return 0;
> > > +}
> > >
> > > +static int ppc_aes_setkey(struct crypto_tfm *tfm, const u8 *in_key,
> > > +   unsigned int key_len)
> > > +{
> > > +   struct ppc_aes_ctx *ctx = crypto_tfm_ctx(tfm);
> > > +
> > > +   if (expand_key(ctx, in_key, key_len) != 0) {
> > > +   tfm->crt_flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
> > > +   return -EINVAL;
> > > +   }
> > > +   return 0;
> > > +}
> > > +
> > > +static int ppc_aes_setkey_skcipher(struct crypto_skcipher *tfm,
> > > +  const u8 *in_key, unsigned int key_len)
> > > +{
> > > +   struct ppc_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
> > > +
> > > +   if (expand_key(ctx, in_key, key_len) != 0) {
> > > +   crypto_skcipher_set_flags(tfm, 
> > > CRYPTO_TFM_RES_BAD_KEY_LEN);
> > > +   return -EINVAL;
> > > +   }
> > > return 0;
> > >  }
> > >
> > > -static int ppc_xts_setkey(struct crypto_tfm *tfm, const u8 *in_key,
> > > +static int ppc_xts_setkey(struct crypto_skcipher *tfm, const u8 *in_key,
> > >unsigned int key_len)
> > >  {
> > > -   struct ppc_xts_ctx *ctx = crypto_tfm_ctx(tfm);
> > > +   struct ppc_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
> > > int err;
> > >
> > > -   err = xts_check_key(tfm, in_key, key_len);
> > > +   err = xts_verify_key(tfm, in_key, key_len);
> > > if (err)
> > > return err;
> > >
> > > @@ -133,7 +153,7 @@ static int ppc_xts_setkey(struct crypto_tfm *tfm, 
> > > const u8 *in_key,
> > > if (key_len != AES_KEYSIZE_128 &&
> > > key_len != AES_KEYSIZE_192 &&
> > > key_len != AES_KEYSIZE_256) {
> > > -   tfm->crt_flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
> > > +   crypto_skcipher_set_flags(tfm, 
> > > CRYPTO_TFM_RES_BAD_KEY_LEN);
> > > return -EINVAL;
> > > }
> > >
> > > @@ -178,208 +198,154 @@ static void 

Re: [PATCH] crypto: powerpc - convert SPE AES algorithms to skcipher API

2019-10-14 Thread Eric Biggers
On Mon, Oct 14, 2019 at 10:45:22AM +0200, Ard Biesheuvel wrote:
> Hi Eric,
> 
> On Sat, 12 Oct 2019 at 04:32, Eric Biggers  wrote:
> >
> > From: Eric Biggers 
> >
> > Convert the glue code for the PowerPC SPE implementations of AES-ECB,
> > AES-CBC, AES-CTR, and AES-XTS from the deprecated "blkcipher" API to the
> > "skcipher" API.
> >
> > Tested with:
> >
> > export ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu-
> > make mpc85xx_defconfig
> > cat >> .config << EOF
> > # CONFIG_MODULES is not set
> > # CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
> > CONFIG_DEBUG_KERNEL=y
> > CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y
> > CONFIG_CRYPTO_AES=y
> > CONFIG_CRYPTO_CBC=y
> > CONFIG_CRYPTO_CTR=y
> > CONFIG_CRYPTO_ECB=y
> > CONFIG_CRYPTO_XTS=y
> > CONFIG_CRYPTO_AES_PPC_SPE=y
> > EOF
> > make olddefconfig
> > make -j32
> > qemu-system-ppc -M mpc8544ds -cpu e500 -nographic \
> > -kernel arch/powerpc/boot/zImage \
> > -append cryptomgr.fuzz_iterations=1000
> >
> > Note that xts-ppc-spe still fails the comparison tests due to the lack
> > of ciphertext stealing support.  This is not addressed by this patch.
> >
> > Signed-off-by: Eric Biggers 
> > ---
> >  arch/powerpc/crypto/aes-spe-glue.c | 416 +
> >  crypto/Kconfig |   1 +
> >  2 files changed, 186 insertions(+), 231 deletions(-)
> >
> > diff --git a/arch/powerpc/crypto/aes-spe-glue.c 
> > b/arch/powerpc/crypto/aes-spe-glue.c
> > index 3a4ca7d32477..374e3e51e998 100644
> > --- a/arch/powerpc/crypto/aes-spe-glue.c
> > +++ b/arch/powerpc/crypto/aes-spe-glue.c
> > @@ -17,6 +17,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >
> >  /*
> > @@ -86,17 +87,13 @@ static void spe_end(void)
> > preempt_enable();
> >  }
> >
> > -static int ppc_aes_setkey(struct crypto_tfm *tfm, const u8 *in_key,
> > -   unsigned int key_len)
> > +static int expand_key(struct ppc_aes_ctx *ctx,
> > + const u8 *in_key, unsigned int key_len)
> >  {
> > -   struct ppc_aes_ctx *ctx = crypto_tfm_ctx(tfm);
> > -
> > if (key_len != AES_KEYSIZE_128 &&
> > key_len != AES_KEYSIZE_192 &&
> > -   key_len != AES_KEYSIZE_256) {
> > -   tfm->crt_flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
> > +   key_len != AES_KEYSIZE_256)
> > return -EINVAL;
> > -   }
> >
> > switch (key_len) {
> > case AES_KEYSIZE_128:
> > @@ -114,17 +111,40 @@ static int ppc_aes_setkey(struct crypto_tfm *tfm, 
> > const u8 *in_key,
> > }
> >
> > ppc_generate_decrypt_key(ctx->key_dec, ctx->key_enc, key_len);
> > +   return 0;
> > +}
> >
> > +static int ppc_aes_setkey(struct crypto_tfm *tfm, const u8 *in_key,
> > +   unsigned int key_len)
> > +{
> > +   struct ppc_aes_ctx *ctx = crypto_tfm_ctx(tfm);
> > +
> > +   if (expand_key(ctx, in_key, key_len) != 0) {
> > +   tfm->crt_flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
> > +   return -EINVAL;
> > +   }
> > +   return 0;
> > +}
> > +
> > +static int ppc_aes_setkey_skcipher(struct crypto_skcipher *tfm,
> > +  const u8 *in_key, unsigned int key_len)
> > +{
> > +   struct ppc_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
> > +
> > +   if (expand_key(ctx, in_key, key_len) != 0) {
> > +   crypto_skcipher_set_flags(tfm, CRYPTO_TFM_RES_BAD_KEY_LEN);
> > +   return -EINVAL;
> > +   }
> > return 0;
> >  }
> >
> > -static int ppc_xts_setkey(struct crypto_tfm *tfm, const u8 *in_key,
> > +static int ppc_xts_setkey(struct crypto_skcipher *tfm, const u8 *in_key,
> >unsigned int key_len)
> >  {
> > -   struct ppc_xts_ctx *ctx = crypto_tfm_ctx(tfm);
> > +   struct ppc_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
> > int err;
> >
> > -   err = xts_check_key(tfm, in_key, key_len);
> > +   err = xts_verify_key(tfm, in_key, key_len);
> > if (err)
> > return err;
> >
> > @@ -133,7 +153,7 @@ static int ppc_xts_setkey(struct crypto_tfm *tfm, const 
> > u8 *in_key,
> > if (key_len != AES_KEYSIZE_128 &&
> > key_len != AES_KEYSIZE_192 &&
> > key_len != AES_KEYSIZE_256) {
> > -   tfm->crt_flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
> > +   crypto_skcipher_set_flags(tfm, CRYPTO_TFM_RES_BAD_KEY_LEN);
> > return -EINVAL;
> > }
> >
> > @@ -178,208 +198,154 @@ static void ppc_aes_decrypt(struct crypto_tfm *tfm, 
> > u8 *out, const u8 *in)
> > spe_end();
> >  }
> >
> > -static int ppc_ecb_encrypt(struct blkcipher_desc *desc, struct scatterlist 
> > *dst,
> > -  struct scatterlist *src, unsigned int nbytes)
> > +static int ppc_ecb_crypt(struct skcipher_request *req, 

[PATCH] powerpc/32s: fix allow/prevent_user_access() when crossing segment boundaries.

2019-10-14 Thread Christophe Leroy
Make sure starting addr is aligned to segment boundary so that when
incrementing the segment, the starting address of the new segment is
below the end address. Otherwise the last segment might get  missed.

Fixes: a68c31fc01ef ("powerpc/32s: Implement Kernel Userspace Access 
Protection")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/kup.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index 677e9babef80..f9dc597b0b86 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -91,6 +91,7 @@
 
 static inline void kuap_update_sr(u32 sr, u32 addr, u32 end)
 {
+   addr &= 0xf000; /* align addr to start of segment */
barrier();  /* make sure thread.kuap is updated before playing with 
SRs */
while (addr < end) {
mtsrin(sr, addr);
-- 
2.13.3



Re: [PATCH v4 3/3] powerpc/prom_init: Use -ffreestanding to avoid a reference to bcmp

2019-10-14 Thread Nick Desaulniers
On Mon, Oct 14, 2019 at 2:35 AM Segher Boessenkool
 wrote:
>
> On Sun, Oct 13, 2019 at 07:51:01PM -0700, Nathan Chancellor wrote:
> > r374662 gives LLVM the ability to convert certain loops into a reference
> > to bcmp as an optimization; this breaks prom_init_check.sh:
>
> When/why does LLVM think this is okay?  This function has been removed
> from POSIX over a decade ago (and before that it always was marked as
> legacy).

Segher, do you have links for any of the above? If so, that would be
helpful to me. I'm arguing against certain transforms that assume that
one library function is faster than another, when such claims are
based on measurements from one stdlib implementation. (There's others
in the pipeline I'm not too thrilled about, too).

The rationale for why it was added was that memcmp takes a measurable
amount of time in Google's fleet, and most calls to memcmp don't care
about the position of the mismatch; bcmp is lower overhead (or at
least for our libc implementation, not sure about others).
-- 
Thanks,
~Nick Desaulniers


Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-14 Thread Mark Rutland
On Tue, Oct 01, 2019 at 04:58:30PM +1000, Daniel Axtens wrote:
> Hook into vmalloc and vmap, and dynamically allocate real shadow
> memory to back the mappings.
> 
> Most mappings in vmalloc space are small, requiring less than a full
> page of shadow space. Allocating a full shadow page per mapping would
> therefore be wasteful. Furthermore, to ensure that different mappings
> use different shadow pages, mappings would have to be aligned to
> KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE.
> 
> Instead, share backing space across multiple mappings. Allocate a
> backing page when a mapping in vmalloc space uses a particular page of
> the shadow region. This page can be shared by other vmalloc mappings
> later on.
> 
> We hook in to the vmap infrastructure to lazily clean up unused shadow
> memory.
> 
> To avoid the difficulties around swapping mappings around, this code
> expects that the part of the shadow region that covers the vmalloc
> space will not be covered by the early shadow page, but will be left
> unmapped. This will require changes in arch-specific code.
> 
> This allows KASAN with VMAP_STACK, and may be helpful for architectures
> that do not have a separate module space (e.g. powerpc64, which I am
> currently working on). It also allows relaxing the module alignment
> back to PAGE_SIZE.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202009
> Acked-by: Vasily Gorbik 
> Signed-off-by: Daniel Axtens 
> [Mark: rework shadow allocation]
> Signed-off-by: Mark Rutland 

Sorry to point this out so late, but your S-o-B should come last in the
chain per Documentation/process/submitting-patches.rst. Judging by the
rest of that, I think you want something like:

Co-developed-by: Mark Rutland 
Signed-off-by: Mark Rutland  [shadow rework]
Signed-off-by: Daniel Axtens 

... leaving yourself as the Author in the headers.

Sorry to have made that more complicated!

[...]

> +static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
> + void *unused)
> +{
> + unsigned long page;
> +
> + page = (unsigned long)__va(pte_pfn(*ptep) << PAGE_SHIFT);
> +
> + spin_lock(_mm.page_table_lock);
> +
> + if (likely(!pte_none(*ptep))) {
> + pte_clear(_mm, addr, ptep);
> + free_page(page);
> + }

There should be TLB maintenance between clearing the PTE and freeing the
page here.

Thanks,
Mark.


Re: [PATCH 2/2] powerpc/powernv: ocxl move TL definition

2019-10-14 Thread christophe lombard

On 14/10/2019 12:21, Frederic Barrat wrote:



Le 09/10/2019 à 17:11, christophe lombard a écrit :
Specifies the templates in the Transaction Layer that the OpenCAPI 
device/host

support when transmitting/receiving DL/DLX frames to or from the OpenCAPI
device/host.
Update, rename and create new few platform-specific calls which can be 
used by

drivers.

No functional change.

Signed-off-by: Christophe Lombard 
---
  arch/powerpc/include/asm/pnv-ocxl.h   |   5 +-
  arch/powerpc/platforms/powernv/ocxl.c | 103 --
  drivers/misc/ocxl/config.c    |  89 +-
  3 files changed, 99 insertions(+), 98 deletions(-)

diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
b/arch/powerpc/include/asm/pnv-ocxl.h

index 8e516e339e6c..b8c68878b4ba 100644
--- a/arch/powerpc/include/asm/pnv-ocxl.h
+++ b/arch/powerpc/include/asm/pnv-ocxl.h
@@ -13,10 +13,7 @@ extern int pnv_ocxl_get_actag(struct pci_dev *dev, 
u16 *base, u16 *enabled,

  u16 *supported);
  extern int pnv_ocxl_get_pasid_count(struct pci_dev *dev, int *count);

-extern int pnv_ocxl_get_tl_cap(struct pci_dev *dev, long *cap,
-    char *rate_buf, int rate_buf_size);
-extern int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long cap,
-    uint64_t rate_buf_phys, int rate_buf_size);
+extern int pnv_ocxl_set_TL(struct pci_dev *dev, int tl_dvsec);

  extern int pnv_ocxl_platform_setup(struct pci_dev *dev,
 int PE_mask, int *hwirq,
diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c

index 4d26cba12b63..351324cffc2b 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -369,8 +369,8 @@ static void set_templ_rate(unsigned int templ, 
unsigned int rate, char *buf)

  buf[idx] |= rate << shift;
  }

-int pnv_ocxl_get_tl_cap(struct pci_dev *dev, long *cap,
-    char *rate_buf, int rate_buf_size)
+static int get_tl_cap(struct pci_dev *dev, long *cap,
+  char *rate_buf, int rate_buf_size)
  {
  if (rate_buf_size != PNV_OCXL_TL_RATE_BUF_SIZE)
  return -EINVAL;
@@ -390,10 +390,9 @@ int pnv_ocxl_get_tl_cap(struct pci_dev *dev, long 
*cap,

  *cap = PNV_OCXL_TL_P9_RECV_CAP;
  return 0;
  }
-EXPORT_SYMBOL_GPL(pnv_ocxl_get_tl_cap);

-int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long cap,
-    uint64_t rate_buf_phys, int rate_buf_size)
+static int set_tl_conf(struct pci_dev *dev, long cap,
+   uint64_t rate_buf_phys, int rate_buf_size)
  {
  struct pci_controller *hose = pci_bus_to_host(dev->bus);
  struct pnv_phb *phb = hose->private_data;
@@ -410,7 +409,99 @@ int pnv_ocxl_set_tl_conf(struct pci_dev *dev, 
long cap,

  }
  return 0;
  }
-EXPORT_SYMBOL_GPL(pnv_ocxl_set_tl_conf);
+
+int pnv_ocxl_set_TL(struct pci_dev *dev, int tl_dvsec)
+{
+    u32 val;
+    __be32 *be32ptr;
+    u8 timers;
+    int i, rc;
+    long recv_cap;
+    char *recv_rate;
+
+    recv_rate = kzalloc(PNV_OCXL_TL_RATE_BUF_SIZE, GFP_KERNEL);
+    if (!recv_rate)
+    return -ENOMEM;
+    /*
+ * The spec defines 64 templates for messages in the
+ * Transaction Layer (TL).
+ *
+ * The host and device each support a subset, so we need to
+ * configure the transmitters on each side to send only
+ * templates the receiver understands, at a rate the receiver
+ * can process.  Per the spec, template 0 must be supported by
+ * everybody. That's the template which has been used by the
+ * host and device so far.
+ *
+ * The sending rate limit must be set before the template is
+ * enabled.
+ */
+
+    /*
+ * Device -> host
+ */
+    rc = get_tl_cap(dev, _cap, recv_rate,
+    PNV_OCXL_TL_RATE_BUF_SIZE);
+    if (rc)
+    goto out;
+
+    for (i = 0; i < PNV_OCXL_TL_RATE_BUF_SIZE; i += 4) {
+    be32ptr = (__be32 *) _rate[i];
+    pci_write_config_dword(dev,
+    tl_dvsec + OCXL_DVSEC_TL_SEND_RATE + i,
+    be32_to_cpu(*be32ptr));
+    }
+    val = recv_cap >> 32;
+    pci_write_config_dword(dev, tl_dvsec + OCXL_DVSEC_TL_SEND_CAP, val);
+    val = recv_cap & GENMASK(31, 0);
+    pci_write_config_dword(dev, tl_dvsec + OCXL_DVSEC_TL_SEND_CAP + 
4, val);

+
+    /*
+ * Host -> device
+ */
+    for (i = 0; i < PNV_OCXL_TL_RATE_BUF_SIZE; i += 4) {
+    pci_read_config_dword(dev,
+    tl_dvsec + OCXL_DVSEC_TL_RECV_RATE + i,
+    );
+    be32ptr = (__be32 *) _rate[i];
+    *be32ptr = cpu_to_be32(val);
+    }
+    pci_read_config_dword(dev, tl_dvsec + OCXL_DVSEC_TL_RECV_CAP, );
+    recv_cap = (long) val << 32;
+    pci_read_config_dword(dev, tl_dvsec + OCXL_DVSEC_TL_RECV_CAP + 4, 
);

+    recv_cap |= val;
+
+    rc = set_tl_conf(dev, recv_cap, __pa(recv_rate),
+ PNV_OCXL_TL_RATE_BUF_SIZE);
+    if (rc)
+    goto out;
+
+    /*
+ * Opencapi commands needing to be retried are classified per
+ * the TL in 

Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-14 Thread Mark Rutland
On Tue, Oct 15, 2019 at 12:57:44AM +1100, Daniel Axtens wrote:
> Hi Andrey,
> 
> 
> >> +  /*
> >> +   * Ensure poisoning is visible before the shadow is made visible
> >> +   * to other CPUs.
> >> +   */
> >> +  smp_wmb();
> >
> > I'm not quite understand what this barrier do and why it needed.
> > And if it's really needed there should be a pairing barrier
> > on the other side which I don't see.
> 
> Mark might be better able to answer this, but my understanding is that
> we want to make sure that we never have a situation where the writes are
> reordered so that PTE is installed before all the poisioning is written
> out. I think it follows the logic in __pte_alloc() in mm/memory.c:
> 
>   /*
>* Ensure all pte setup (eg. pte page lock and page clearing) are
>* visible before the pte is made visible to other CPUs by being
>* put into page tables.

Yup. We need to ensure that if a thread sees a populated shadow PTE, the
corresponding shadow memory has been zeroed. Thus, we need to ensure
that the zeroing is observed by other CPUs before we update the PTE.

We're relying on the absence of a TLB entry preventing another CPU from
loading the corresponding shadow shadow memory until its PTE has been
populated (after the zeroing is visible). Consequently there is no
barrier on the other side, and just a control-dependency (which would be
insufficient on its own).

There is a potential problem here, as Will Deacon wrote up at:

  
https://lore.kernel.org/linux-arm-kernel/20190827131818.14724-1-w...@kernel.org/

... in the section starting:

| *** Other architecture maintainers -- start here! ***

... whereby the CPU can spuriously fault on an access after observing a
valid PTE.

For arm64 we handle the spurious fault, and it looks like x86 would need
something like its vmalloc_fault() applying to the shadow region to
cater for this.

Thanks,
Mark.


Re: [PATCH 1/2] powerpc/powernv: ocxl move SPA definition

2019-10-14 Thread christophe lombard

On 14/10/2019 12:17, Frederic Barrat wrote:


diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c

index 8c65aacda9c8..4d26cba12b63 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -12,11 +12,54 @@
  #define PNV_OCXL_PASID_BITS    15
  #define PNV_OCXL_PASID_MAX    ((1 << PNV_OCXL_PASID_BITS) - 1)

-#define AFU_PRESENT (1 << 31)
-#define AFU_INDEX_MASK 0x3F00
-#define AFU_INDEX_SHIFT 24
-#define ACTAG_MASK 0xFFF
+#define AFU_PRESENT    (1 << 31)
+#define AFU_INDEX_MASK    0x3F00
+#define AFU_INDEX_SHIFT    24
+#define ACTAG_MASK 0xFFF
+
+#define SPA_PASID_BITS    15
+#define SPA_PASID_MAX    ((1 << SPA_PASID_BITS) - 1)
+#define SPA_PE_MASK    SPA_PASID_MAX
+#define SPA_SPA_SIZE_LOG    22 /* Each SPA is 4 Mb */
+#define SPA_PE_VALID    0x8000
+
+#define SPA_CFG_SF    (1ull << (63-0))
+#define SPA_CFG_TA    (1ull << (63-1))
+#define SPA_CFG_HV    (1ull << (63-3))
+#define SPA_CFG_UV    (1ull << (63-4))
+#define SPA_CFG_XLAT_hpt    (0ull << (63-6)) /* Hashed page table 
(HPT) mode */

+#define SPA_CFG_XLAT_roh    (2ull << (63-6)) /* Radix on HPT mode */
+#define SPA_CFG_XLAT_ror    (3ull << (63-6)) /* Radix on Radix mode */
+#define SPA_CFG_PR    (1ull << (63-49))
+#define SPA_CFG_TC    (1ull << (63-54))
+#define SPA_CFG_DR    (1ull << (63-59))
+
+struct ocxl_process_element {
+    __be64 config_state;
+    __be32 reserved1[11];
+    __be32 lpid;
+    __be32 tid;
+    __be32 pid;
+    __be32 reserved2[10];
+    __be64 amr;
+    __be32 reserved3[3];
+    __be32 software_state;
+};
+
+struct spa {
+    struct ocxl_process_element *spa_mem;
+    int spa_order;
+};

+struct platform_data {
+    struct spa *spa;
+    u64 phb_opal_id;
+    u32 bdfn;
+    void __iomem *dsisr;
+    void __iomem *dar;
+    void __iomem *tfc;
+    void __iomem *pe_handle;
+};

  struct actag_range {
  u16 start;
@@ -369,7 +412,7 @@ int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long 
cap,

  }
  EXPORT_SYMBOL_GPL(pnv_ocxl_set_tl_conf);

-int pnv_ocxl_get_xsl_irq(struct pci_dev *dev, int *hwirq)
+static int get_xsl_irq(struct pci_dev *dev, int *hwirq)
  {
  int rc;

@@ -381,19 +424,17 @@ int pnv_ocxl_get_xsl_irq(struct pci_dev *dev, 
int *hwirq)

  }
  return 0;
  }
-EXPORT_SYMBOL_GPL(pnv_ocxl_get_xsl_irq);

-void pnv_ocxl_unmap_xsl_regs(void __iomem *dsisr, void __iomem *dar,
-    void __iomem *tfc, void __iomem *pe_handle)
+static void unmap_xsl_regs(void __iomem *dsisr, void __iomem *dar,
+   void __iomem *tfc, void __iomem *pe_handle)
  {
  iounmap(dsisr);
  iounmap(dar);
  iounmap(tfc);
  iounmap(pe_handle);
  }
-EXPORT_SYMBOL_GPL(pnv_ocxl_unmap_xsl_regs);

-int pnv_ocxl_map_xsl_regs(struct pci_dev *dev, void __iomem **dsisr,
+static int map_xsl_regs(struct pci_dev *dev, void __iomem **dsisr,
  void __iomem **dar, void __iomem **tfc,
  void __iomem **pe_handle)
  {
@@ -429,61 +470,144 @@ int pnv_ocxl_map_xsl_regs(struct pci_dev *dev, 
void __iomem **dsisr,

  }
  return rc;
  }
-EXPORT_SYMBOL_GPL(pnv_ocxl_map_xsl_regs);

-struct spa_data {
-    u64 phb_opal_id;
-    u32 bdfn;
-};
+static int alloc_spa(struct pci_dev *dev, struct platform_data *data)
+{
+    struct spa *spa;
+
+    spa = kzalloc(sizeof(struct spa), GFP_KERNEL);
+    if (!spa)
+    return -ENOMEM;
+
+    spa->spa_order = SPA_SPA_SIZE_LOG - PAGE_SHIFT;
+    spa->spa_mem = (struct ocxl_process_element *)
+    __get_free_pages(GFP_KERNEL | __GFP_ZERO, spa->spa_order);
+    if (!spa->spa_mem) {
+    dev_err(>dev, "Can't allocate Shared Process Area\n");
+    kfree(spa);
+    return -ENOMEM;
+    }
+
+    data->spa = spa;
+    dev_dbg(>dev, "Allocated SPA for %x:%x:%x at %p\n",
+   pci_domain_nr(dev->bus), dev->bus->number,
+   PCI_SLOT(dev->devfn), spa->spa_mem);



If using dev_dbg() then we get the domain, bus and device/fn for free as 
part of the message prefix. Which leaves the SPA address, which may be 
hidden with recent kernel changes since it's a pointer to a kernel 
address (I haven't checked). I guess the message could be reworked. The 
point was really to show that we're allocating a spa structure for the 
link and help debugging any ref count issue.




I think we can even delete this message as it is not very interesting to
be present.




+
+    return 0;
+}
+
+static void free_spa(struct platform_data *data)
+{
+    struct spa *spa = data->spa;
+
+    if (spa && spa->spa_mem) {
+    free_pages((unsigned long) spa->spa_mem, spa->spa_order);
+    kfree(spa);
+    data->spa = NULL;
+    }
+}

-int pnv_ocxl_spa_setup(struct pci_dev *dev, void *spa_mem, int PE_mask,
-    void **platform_data)
+int pnv_ocxl_platform_setup(struct pci_dev *dev, int PE_mask,
+    int *hwirq, void **platform_data)
  {
  struct pci_controller *hose = 

Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-14 Thread Daniel Axtens
Hi Andrey,


>> +/*
>> + * Ensure poisoning is visible before the shadow is made visible
>> + * to other CPUs.
>> + */
>> +smp_wmb();
>
> I'm not quite understand what this barrier do and why it needed.
> And if it's really needed there should be a pairing barrier
> on the other side which I don't see.

Mark might be better able to answer this, but my understanding is that
we want to make sure that we never have a situation where the writes are
reordered so that PTE is installed before all the poisioning is written
out. I think it follows the logic in __pte_alloc() in mm/memory.c:

/*
 * Ensure all pte setup (eg. pte page lock and page clearing) are
 * visible before the pte is made visible to other CPUs by being
 * put into page tables.
 *
 * The other side of the story is the pointer chasing in the page
 * table walking code (when walking the page table without locking;
 * ie. most of the time). Fortunately, these data accesses consist
 * of a chain of data-dependent loads, meaning most CPUs (alpha
 * being the notable exception) will already guarantee loads are
 * seen in-order. See the alpha page table accessors for the
 * smp_read_barrier_depends() barriers in page table walking code.
 */
smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */

I can clarify the comment.

>> +
>> +spin_lock(_mm.page_table_lock);
>> +if (likely(pte_none(*ptep))) {
>> +set_pte_at(_mm, addr, ptep, pte);
>> +page = 0;
>> +}
>> +spin_unlock(_mm.page_table_lock);
>> +if (page)
>> +free_page(page);
>> +return 0;
>> +}
>> +
>
>
> ...
>
>> @@ -754,6 +769,8 @@ merge_or_add_vmap_area(struct vmap_area *va,
>>  }
>>  
>>  insert:
>> +kasan_release_vmalloc(orig_start, orig_end, va->va_start, va->va_end);
>> +
>>  if (!merged) {
>>  link_va(va, root, parent, link, head);
>>  augment_tree_propagate_from(va);
>> @@ -2068,6 +2085,22 @@ static struct vm_struct *__get_vm_area_node(unsigned 
>> long size,
>>  
>>  setup_vmalloc_vm(area, va, flags, caller);
>>  
>> +/*
>> + * For KASAN, if we are in vmalloc space, we need to cover the shadow
>> + * area with real memory. If we come here through VM_ALLOC, this is
>> + * done by a higher level function that has access to the true size,
>> + * which might not be a full page.
>> + *
>> + * We assume module space comes via VM_ALLOC path.
>> + */
>> +if (is_vmalloc_addr(area->addr) && !(area->flags & VM_ALLOC)) {
>> +if (kasan_populate_vmalloc(area->size, area)) {
>> +unmap_vmap_area(va);
>> +kfree(area);
>> +return NULL;
>> +}
>> +}
>> +
>>  return area;
>>  }
>>  
>> @@ -2245,6 +2278,9 @@ static void __vunmap(const void *addr, int 
>> deallocate_pages)
>>  debug_check_no_locks_freed(area->addr, get_vm_area_size(area));
>>  debug_check_no_obj_freed(area->addr, get_vm_area_size(area));
>>  
>> +if (area->flags & VM_KASAN)
>> +kasan_poison_vmalloc(area->addr, area->size);
>> +
>>  vm_remove_mappings(area, deallocate_pages);
>>  
>>  if (deallocate_pages) {
>> @@ -2497,6 +2533,9 @@ void *__vmalloc_node_range(unsigned long size, 
>> unsigned long align,
>>  if (!addr)
>>  return NULL;
>>  
>> +if (kasan_populate_vmalloc(real_size, area))
>> +return NULL;
>> +
>
> KASAN itself uses __vmalloc_node_range() to allocate and map shadow in memory 
> online callback.
> So we should either skip non-vmalloc and non-module addresses here or teach 
> kasan's memory online/offline
> callbacks to not use __vmalloc_node_range() (do something similar to 
> kasan_populate_vmalloc() perhaps?). 

Ah, right you are. I haven't been testing that.

I am a bit nervous about further restricting kasan_populate_vmalloc: I
seem to remember having problems with code using the vmalloc family of
functions to map memory that doesn't lie within vmalloc space but which
still has instrumented accesses.

On the other hand, I'm not keen on rewriting any of the memory
on/offline code if I can avoid it!

I'll have a look and get back you as soon as I can.

Thanks for catching this.

Kind regards,
Daniel

>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kasan-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/kasan-dev/352cb4fa-2e57-7e3b-23af-898e113bbe22%40virtuozzo.com.


[PATCH v5 3/7] Powerpc/Watchpoint: Fix ptrace code that muck around with address/len

2019-10-14 Thread Ravi Bangoria
ptrace_set_debugreg() does not consider new length while overwriting
the watchpoint. Fix that. ppc_set_hwdebug() aligns watchpoint address
to doubleword boundary but does not change the length. If address range
is crossing doubleword boundary and length is less then 8, we will loose
samples from second doubleword. So fix that as well.

Signed-off-by: Ravi Bangoria 
---
 arch/powerpc/include/asm/hw_breakpoint.h | 4 ++--
 arch/powerpc/kernel/ptrace.c | 9 +++--
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_breakpoint.h 
b/arch/powerpc/include/asm/hw_breakpoint.h
index ea91ac7f5a27..27ac6f5d2891 100644
--- a/arch/powerpc/include/asm/hw_breakpoint.h
+++ b/arch/powerpc/include/asm/hw_breakpoint.h
@@ -34,6 +34,8 @@ struct arch_hw_breakpoint {
 #define HW_BRK_TYPE_PRIV_ALL   (HW_BRK_TYPE_USER | HW_BRK_TYPE_KERNEL | \
 HW_BRK_TYPE_HYP)
 
+#define HW_BREAKPOINT_ALIGN 0x7
+
 #define DABR_MAX_LEN   8
 #define DAWR_MAX_LEN   512
 
@@ -48,8 +50,6 @@ struct pmu;
 struct perf_sample_data;
 struct task_struct;
 
-#define HW_BREAKPOINT_ALIGN 0x7
-
 extern int hw_breakpoint_slots(int type);
 extern int arch_bp_generic_fields(int type, int *gen_bp_type);
 extern int arch_check_bp_in_kernelspace(struct arch_hw_breakpoint *hw);
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index c861b12337bd..8ea6f01531f1 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -2440,6 +2440,7 @@ static int ptrace_set_debugreg(struct task_struct *task, 
unsigned long addr,
if (bp) {
attr = bp->attr;
attr.bp_addr = hw_brk.address;
+   attr.bp_len = DABR_MAX_LEN;
arch_bp_generic_fields(hw_brk.type, _type);
 
/* Enable breakpoint */
@@ -2881,7 +2882,7 @@ static long ppc_set_hwdebug(struct task_struct *child,
if ((unsigned long)bp_info->addr >= TASK_SIZE)
return -EIO;
 
-   brk.address = bp_info->addr & ~7UL;
+   brk.address = bp_info->addr & ~HW_BREAKPOINT_ALIGN;
brk.type = HW_BRK_TYPE_TRANSLATE;
brk.len = DABR_MAX_LEN;
if (bp_info->trigger_type & PPC_BREAKPOINT_TRIGGER_READ)
@@ -2889,10 +2890,6 @@ static long ppc_set_hwdebug(struct task_struct *child,
if (bp_info->trigger_type & PPC_BREAKPOINT_TRIGGER_WRITE)
brk.type |= HW_BRK_TYPE_WRITE;
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
-   /*
-* Check if the request is for 'range' breakpoints. We can
-* support it if range < 8 bytes.
-*/
if (bp_info->addr_mode == PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE)
len = bp_info->addr2 - bp_info->addr;
else if (bp_info->addr_mode == PPC_BREAKPOINT_MODE_EXACT)
@@ -2905,7 +2902,7 @@ static long ppc_set_hwdebug(struct task_struct *child,
 
/* Create a new breakpoint request if one doesn't exist already */
hw_breakpoint_init();
-   attr.bp_addr = (unsigned long)bp_info->addr & ~HW_BREAKPOINT_ALIGN;
+   attr.bp_addr = (unsigned long)bp_info->addr;
attr.bp_len = len;
arch_bp_generic_fields(brk.type, _type);
 
-- 
2.21.0



[PATCH v5 7/7] Powerpc/Watchpoint: Support for 8xx in ptrace-hwbreak.c selftest

2019-10-14 Thread Ravi Bangoria
On the 8xx, signals are generated after executing the instruction.
So no need to manually single-step on 8xx. Also, 8xx __set_dabr()
currently ignores length and hardcodes the length to 8 bytes. So
all unaligned and 512 byte testcase will fail on 8xx. Ignore those
testcases on 8xx.

Signed-off-by: Ravi Bangoria 
---
 .../selftests/powerpc/ptrace/ptrace-hwbreak.c | 32 +--
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c 
b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
index 916e97f5f8b1..7deedbc16b0b 100644
--- a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
+++ b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
@@ -22,6 +22,11 @@
 #include 
 #include "ptrace.h"
 
+#define SPRN_PVR   0x11F
+#define PVR_8xx0x0050
+
+bool is_8xx;
+
 /*
  * Use volatile on all global var so that compiler doesn't
  * optimise their load/stores. Otherwise selftest can fail.
@@ -205,13 +210,15 @@ static void check_success(pid_t child_pid, const char 
*name, const char *type,
 
printf("%s, %s, len: %d: Ok\n", name, type, len);
 
-   /*
-* For ptrace registered watchpoint, signal is generated
-* before executing load/store. Singlestep the instruction
-* and then continue the test.
-*/
-   ptrace(PTRACE_SINGLESTEP, child_pid, NULL, 0);
-   wait(NULL);
+   if (!is_8xx) {
+   /*
+* For ptrace registered watchpoint, signal is generated
+* before executing load/store. Singlestep the instruction
+* and then continue the test.
+*/
+   ptrace(PTRACE_SINGLESTEP, child_pid, NULL, 0);
+   wait(NULL);
+   }
 }
 
 static void ptrace_set_debugreg(pid_t child_pid, unsigned long wp_addr)
@@ -447,8 +454,10 @@ run_tests(pid_t child_pid, struct ppc_debug_info *dbginfo, 
bool dawr)
test_set_debugreg(child_pid);
if (dbginfo->features & PPC_DEBUG_FEATURE_DATA_BP_RANGE) {
test_sethwdebug_exact(child_pid);
-   test_sethwdebug_range_aligned(child_pid);
-   if (dawr) {
+
+   if (!is_8xx)
+   test_sethwdebug_range_aligned(child_pid);
+   if (dawr && !is_8xx) {
test_sethwdebug_range_unaligned(child_pid);
test_sethwdebug_range_unaligned_dar(child_pid);
test_sethwdebug_dawr_max_range(child_pid);
@@ -489,5 +498,10 @@ static int ptrace_hwbreak(void)
 
 int main(int argc, char **argv, char **envp)
 {
+   int pvr = 0;
+   asm __volatile__ ("mfspr %0,%1" : "=r"(pvr) : "i"(SPRN_PVR));
+   if (pvr == PVR_8xx)
+   is_8xx = true;
+
return test_harness(ptrace_hwbreak, "ptrace-hwbreak");
 }
-- 
2.21.0



[PATCH v5 6/7] Powerpc/Watchpoint: Add dar outside test in perf-hwbreak.c selftest

2019-10-14 Thread Ravi Bangoria
So far we used to ignore exception if dar points outside of user
specified range. But now we are ignoring it only if actual load/
store range does not overlap with user specified range. Include
selftests for the same:

  # ./tools/testing/selftests/powerpc/ptrace/perf-hwbreak
  ...
  TESTED: No overlap
  TESTED: Partial overlap
  TESTED: Partial overlap
  TESTED: No overlap
  TESTED: Full overlap
  success: perf_hwbreak

Signed-off-by: Ravi Bangoria 
---
 .../selftests/powerpc/ptrace/perf-hwbreak.c   | 111 +-
 1 file changed, 110 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c 
b/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c
index 200337daec42..389c545675c6 100644
--- a/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c
+++ b/tools/testing/selftests/powerpc/ptrace/perf-hwbreak.c
@@ -148,6 +148,113 @@ static int runtestsingle(int readwriteflag, int 
exclude_user, int arraytest)
return 0;
 }
 
+static int runtest_dar_outside(void)
+{
+   volatile char target[8];
+   volatile __u16 temp16;
+   volatile __u64 temp64;
+   struct perf_event_attr attr;
+   int break_fd;
+   unsigned long long breaks;
+   int fail = 0;
+   size_t res;
+
+   /* setup counters */
+   memset(, 0, sizeof(attr));
+   attr.disabled = 1;
+   attr.type = PERF_TYPE_BREAKPOINT;
+   attr.exclude_kernel = 1;
+   attr.exclude_hv = 1;
+   attr.exclude_guest = 1;
+   attr.bp_type = HW_BREAKPOINT_RW;
+   /* watch middle half of target array */
+   attr.bp_addr = (__u64)(target + 2);
+   attr.bp_len = 4;
+   break_fd = sys_perf_event_open(, 0, -1, -1, 0);
+   if (break_fd < 0) {
+   perror("sys_perf_event_open");
+   exit(1);
+   }
+
+   /* Shouldn't hit. */
+   ioctl(break_fd, PERF_EVENT_IOC_RESET);
+   ioctl(break_fd, PERF_EVENT_IOC_ENABLE);
+   temp16 = *((__u16 *)target);
+   *((__u16 *)target) = temp16;
+   ioctl(break_fd, PERF_EVENT_IOC_DISABLE);
+   res = read(break_fd, , sizeof(unsigned long long));
+   assert(res == sizeof(unsigned long long));
+   if (breaks == 0) {
+   printf("TESTED: No overlap\n");
+   } else {
+   printf("FAILED: No overlap: %lld != 0\n", breaks);
+   fail = 1;
+   }
+
+   /* Hit */
+   ioctl(break_fd, PERF_EVENT_IOC_RESET);
+   ioctl(break_fd, PERF_EVENT_IOC_ENABLE);
+   temp16 = *((__u16 *)(target + 1));
+   *((__u16 *)(target + 1)) = temp16;
+   ioctl(break_fd, PERF_EVENT_IOC_DISABLE);
+   res = read(break_fd, , sizeof(unsigned long long));
+   assert(res == sizeof(unsigned long long));
+   if (breaks == 2) {
+   printf("TESTED: Partial overlap\n");
+   } else {
+   printf("FAILED: Partial overlap: %lld != 2\n", breaks);
+   fail = 1;
+   }
+
+   /* Hit */
+   ioctl(break_fd, PERF_EVENT_IOC_RESET);
+   ioctl(break_fd, PERF_EVENT_IOC_ENABLE);
+   temp16 = *((__u16 *)(target + 5));
+   *((__u16 *)(target + 5)) = temp16;
+   ioctl(break_fd, PERF_EVENT_IOC_DISABLE);
+   res = read(break_fd, , sizeof(unsigned long long));
+   assert(res == sizeof(unsigned long long));
+   if (breaks == 2) {
+   printf("TESTED: Partial overlap\n");
+   } else {
+   printf("FAILED: Partial overlap: %lld != 2\n", breaks);
+   fail = 1;
+   }
+
+   /* Shouldn't Hit */
+   ioctl(break_fd, PERF_EVENT_IOC_RESET);
+   ioctl(break_fd, PERF_EVENT_IOC_ENABLE);
+   temp16 = *((__u16 *)(target + 6));
+   *((__u16 *)(target + 6)) = temp16;
+   ioctl(break_fd, PERF_EVENT_IOC_DISABLE);
+   res = read(break_fd, , sizeof(unsigned long long));
+   assert(res == sizeof(unsigned long long));
+   if (breaks == 0) {
+   printf("TESTED: No overlap\n");
+   } else {
+   printf("FAILED: No overlap: %lld != 0\n", breaks);
+   fail = 1;
+   }
+
+   /* Hit */
+   ioctl(break_fd, PERF_EVENT_IOC_RESET);
+   ioctl(break_fd, PERF_EVENT_IOC_ENABLE);
+   temp64 = *((__u64 *)target);
+   *((__u64 *)target) = temp64;
+   ioctl(break_fd, PERF_EVENT_IOC_DISABLE);
+   res = read(break_fd, , sizeof(unsigned long long));
+   assert(res == sizeof(unsigned long long));
+   if (breaks == 2) {
+   printf("TESTED: Full overlap\n");
+   } else {
+   printf("FAILED: Full overlap: %lld != 2\n", breaks);
+   fail = 1;
+   }
+
+   close(break_fd);
+   return fail;
+}
+
 static int runtest(void)
 {
int rwflag;
@@ -172,7 +279,9 @@ static int runtest(void)
return ret;
}
}
-   return 0;
+
+   ret = runtest_dar_outside();
+   return ret;
 }
 
 
-- 
2.21.0



[PATCH v5 5/7] Powerpc/Watchpoint: Rewrite ptrace-hwbreak.c selftest

2019-10-14 Thread Ravi Bangoria
ptrace-hwbreak.c selftest is logically broken. On powerpc, when
watchpoint is created with ptrace, signals are generated before
executing the instruction and user has to manually singlestep
the instruction with watchpoint disabled, which selftest never
does and thus it keeps on getting the signal at the same
instruction. If we fix it, selftest fails because the logical
connection between tracer(parent) and tracee(child) is also
broken. Rewrite the selftest and add new tests for unaligned
access.

With patch:
  $ ./tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak
  test: ptrace-hwbreak
  tags: git_version:powerpc-5.3-4-224-g218b868240c7-dirty
  PTRACE_SET_DEBUGREG, WO, len: 1: Ok
  PTRACE_SET_DEBUGREG, WO, len: 2: Ok
  PTRACE_SET_DEBUGREG, WO, len: 4: Ok
  PTRACE_SET_DEBUGREG, WO, len: 8: Ok
  PTRACE_SET_DEBUGREG, RO, len: 1: Ok
  PTRACE_SET_DEBUGREG, RO, len: 2: Ok
  PTRACE_SET_DEBUGREG, RO, len: 4: Ok
  PTRACE_SET_DEBUGREG, RO, len: 8: Ok
  PTRACE_SET_DEBUGREG, RW, len: 1: Ok
  PTRACE_SET_DEBUGREG, RW, len: 2: Ok
  PTRACE_SET_DEBUGREG, RW, len: 4: Ok
  PTRACE_SET_DEBUGREG, RW, len: 8: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_EXACT, WO, len: 1: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_EXACT, RO, len: 1: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_EXACT, RW, len: 1: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW ALIGNED, WO, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW ALIGNED, RO, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW ALIGNED, RW, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, WO, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, RO, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, RW, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, MODE_RANGE, DW UNALIGNED, DAR OUTSIDE, RW, len: 6: Ok
  PPC_PTRACE_SETHWDEBUG, DAWR_MAX_LEN, RW, len: 512: Ok
  success: ptrace-hwbreak

Signed-off-by: Ravi Bangoria 
---
 .../selftests/powerpc/ptrace/ptrace-hwbreak.c | 571 +++---
 1 file changed, 361 insertions(+), 210 deletions(-)

diff --git a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c 
b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
index 3066d310f32b..916e97f5f8b1 100644
--- a/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
+++ b/tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
@@ -22,318 +22,469 @@
 #include 
 #include "ptrace.h"
 
-/* Breakpoint access modes */
-enum {
-   BP_X = 1,
-   BP_RW = 2,
-   BP_W = 4,
-};
-
-static pid_t child_pid;
-static struct ppc_debug_info dbginfo;
-
-static void get_dbginfo(void)
-{
-   int ret;
-
-   ret = ptrace(PPC_PTRACE_GETHWDBGINFO, child_pid, NULL, );
-   if (ret) {
-   perror("Can't get breakpoint info\n");
-   exit(-1);
-   }
-}
-
-static bool hwbreak_present(void)
-{
-   return (dbginfo.num_data_bps != 0);
-}
+/*
+ * Use volatile on all global var so that compiler doesn't
+ * optimise their load/stores. Otherwise selftest can fail.
+ */
+static volatile __u64 glvar;
 
-static bool dawr_present(void)
-{
-   return !!(dbginfo.features & PPC_DEBUG_FEATURE_DATA_BP_DAWR);
-}
+#define DAWR_MAX_LEN 512
+static volatile __u8 big_var[DAWR_MAX_LEN] __attribute__((aligned(512)));
 
-static void set_breakpoint_addr(void *addr)
-{
-   int ret;
+#define A_LEN 6
+#define B_LEN 6
+struct gstruct {
+   __u8 a[A_LEN]; /* double word aligned */
+   __u8 b[B_LEN]; /* double word unaligned */
+};
+static volatile struct gstruct gstruct __attribute__((aligned(512)));
 
-   ret = ptrace(PTRACE_SET_DEBUGREG, child_pid, 0, addr);
-   if (ret) {
-   perror("Can't set breakpoint addr\n");
-   exit(-1);
-   }
-}
 
-static int set_hwbreakpoint_addr(void *addr, int range)
+static void get_dbginfo(pid_t child_pid, struct ppc_debug_info *dbginfo)
 {
-   int ret;
-
-   struct ppc_hw_breakpoint info;
-
-   info.version = 1;
-   info.trigger_type = PPC_BREAKPOINT_TRIGGER_RW;
-   info.addr_mode = PPC_BREAKPOINT_MODE_EXACT;
-   if (range > 0)
-   info.addr_mode = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE;
-   info.condition_mode = PPC_BREAKPOINT_CONDITION_NONE;
-   info.addr = (__u64)addr;
-   info.addr2 = (__u64)addr + range;
-   info.condition_value = 0;
-
-   ret = ptrace(PPC_PTRACE_SETHWDEBUG, child_pid, 0, );
-   if (ret < 0) {
-   perror("Can't set breakpoint\n");
+   if (ptrace(PPC_PTRACE_GETHWDBGINFO, child_pid, NULL, dbginfo)) {
+   perror("Can't get breakpoint info");
exit(-1);
}
-   return ret;
 }
 
-static int del_hwbreakpoint_addr(int watchpoint_handle)
+static bool dawr_present(struct ppc_debug_info *dbginfo)
 {
-   int ret;
-
-   ret = ptrace(PPC_PTRACE_DELHWDEBUG, child_pid, 0, watchpoint_handle);
-   if (ret < 0) {
-   perror("Can't delete hw breakpoint\n");
-   exit(-1);
-   }
-   return ret;
+   return !!(dbginfo->features & PPC_DEBUG_FEATURE_DATA_BP_DAWR);
 }
 

[PATCH v5 4/7] Powerpc/Watchpoint: Don't ignore extraneous exceptions blindly

2019-10-14 Thread Ravi Bangoria
On Powerpc, watchpoint match range is double-word granular. On a
watchpoint hit, DAR is set to the first byte of overlap between
actual access and watched range. And thus it's quite possible that
DAR does not point inside user specified range. Ex, say user creates
a watchpoint with address range 0x1004 to 0x1007. So hw would be
configured to watch from 0x1000 to 0x1007. If there is a 4 byte
access from 0x1002 to 0x1005, DAR will point to 0x1002 and thus
interrupt handler considers it as extraneous, but it's actually not,
because part of the access belongs to what user has asked.

Instead of blindly ignoring the exception, get actual address range
by analysing an instruction, and ignore only if actual range does
not overlap with user specified range.

Note: The behavior is unchanged for 8xx.

Signed-off-by: Ravi Bangoria 
---
 arch/powerpc/kernel/hw_breakpoint.c | 52 +
 1 file changed, 31 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/hw_breakpoint.c 
b/arch/powerpc/kernel/hw_breakpoint.c
index f36274d426ed..58ce3d37c2a3 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -222,33 +222,49 @@ void thread_change_pc(struct task_struct *tsk, struct 
pt_regs *regs)
tsk->thread.last_hit_ubp = NULL;
 }
 
-static bool is_larx_stcx_instr(struct pt_regs *regs, unsigned int instr)
+static bool dar_within_range(unsigned long dar, struct arch_hw_breakpoint 
*info)
 {
-   int ret, type;
-   struct instruction_op op;
+   return ((info->address <= dar) && (dar - info->address < info->len));
+}
 
-   ret = analyse_instr(, regs, instr);
-   type = GETTYPE(op.type);
-   return (!ret && (type == LARX || type == STCX));
+static bool
+dar_range_overlaps(unsigned long dar, int size, struct arch_hw_breakpoint 
*info)
+{
+   return ((dar <= info->address + info->len - 1) &&
+   (dar + size - 1 >= info->address));
 }
 
 /*
  * Handle debug exception notifications.
  */
 static bool stepping_handler(struct pt_regs *regs, struct perf_event *bp,
-unsigned long addr)
+struct arch_hw_breakpoint *info)
 {
unsigned int instr = 0;
+   int ret, type, size;
+   struct instruction_op op;
+   unsigned long addr = info->address;
 
if (__get_user_inatomic(instr, (unsigned int *)regs->nip))
goto fail;
 
-   if (is_larx_stcx_instr(regs, instr)) {
+   ret = analyse_instr(, regs, instr);
+   type = GETTYPE(op.type);
+   size = GETSIZE(op.type);
+
+   if (!ret && (type == LARX || type == STCX)) {
printk_ratelimited("Breakpoint hit on instruction that can't be 
emulated."
   " Breakpoint at 0x%lx will be disabled.\n", 
addr);
goto disable;
}
 
+   /*
+* If it's extraneous event, we still need to emulate/single-
+* step the instruction, but we don't generate an event.
+*/
+   if (size && !dar_range_overlaps(regs->dar, size, info))
+   info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ;
+
/* Do not emulate user-space instructions, instead single-step them */
if (user_mode(regs)) {
current->thread.last_hit_ubp = bp;
@@ -280,7 +296,6 @@ int hw_breakpoint_handler(struct die_args *args)
struct perf_event *bp;
struct pt_regs *regs = args->regs;
struct arch_hw_breakpoint *info;
-   unsigned long dar = regs->dar;
 
/* Disable breakpoints during exception handling */
hw_breakpoint_disable();
@@ -312,19 +327,14 @@ int hw_breakpoint_handler(struct die_args *args)
goto out;
}
 
-   /*
-* Verify if dar lies within the address range occupied by the symbol
-* being watched to filter extraneous exceptions.  If it doesn't,
-* we still need to single-step the instruction, but we don't
-* generate an event.
-*/
info->type &= ~HW_BRK_TYPE_EXTRANEOUS_IRQ;
-   if (!((bp->attr.bp_addr <= dar) &&
- (dar - bp->attr.bp_addr < bp->attr.bp_len)))
-   info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ;
-
-   if (!IS_ENABLED(CONFIG_PPC_8xx) && !stepping_handler(regs, bp, 
info->address))
-   goto out;
+   if (IS_ENABLED(CONFIG_PPC_8xx)) {
+   if (!dar_within_range(regs->dar, info))
+   info->type |= HW_BRK_TYPE_EXTRANEOUS_IRQ;
+   } else {
+   if (!stepping_handler(regs, bp, info))
+   goto out;
+   }
 
/*
 * As a policy, the callback is invoked in a 'trigger-after-execute'
-- 
2.21.0



[PATCH v5 2/7] Powerpc/Watchpoint: Fix length calculation for unaligned target

2019-10-14 Thread Ravi Bangoria
Watchpoint match range is always doubleword(8 bytes) aligned on
powerpc. If the given range is crossing doubleword boundary, we
need to increase the length such that next doubleword also get
covered. Ex,

  address   len = 6 bytes
|=.
   |v--|--v|
   | | | | | | | | | | | | | | | | |
   |---|---|
<---8 bytes--->

In such case, current code configures hw as:
  start_addr = address & ~HW_BREAKPOINT_ALIGN
  len = 8 bytes

And thus read/write in last 4 bytes of the given range is ignored.
Fix this by including next doubleword in the length.

Signed-off-by: Ravi Bangoria 
---
 arch/powerpc/include/asm/hw_breakpoint.h |  2 +
 arch/powerpc/kernel/dawr.c   |  6 +--
 arch/powerpc/kernel/hw_breakpoint.c  | 67 +---
 arch/powerpc/kernel/process.c|  3 ++
 arch/powerpc/kernel/ptrace.c |  1 +
 5 files changed, 56 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_breakpoint.h 
b/arch/powerpc/include/asm/hw_breakpoint.h
index 4a887e85a5f4..ea91ac7f5a27 100644
--- a/arch/powerpc/include/asm/hw_breakpoint.h
+++ b/arch/powerpc/include/asm/hw_breakpoint.h
@@ -14,6 +14,7 @@ struct arch_hw_breakpoint {
unsigned long   address;
u16 type;
u16 len; /* length of the target data symbol */
+   u16 hw_len; /* length programmed in hw */
 };
 
 /* Note: Don't change the the first 6 bits below as they are in the same order
@@ -73,6 +74,7 @@ static inline void hw_breakpoint_disable(void)
brk.address = 0;
brk.type = 0;
brk.len = 0;
+   brk.hw_len = 0;
if (ppc_breakpoint_available())
__set_breakpoint();
 }
diff --git a/arch/powerpc/kernel/dawr.c b/arch/powerpc/kernel/dawr.c
index 5f66b95b6858..cc14aa6c4a1b 100644
--- a/arch/powerpc/kernel/dawr.c
+++ b/arch/powerpc/kernel/dawr.c
@@ -30,10 +30,10 @@ int set_dawr(struct arch_hw_breakpoint *brk)
 * DAWR length is stored in field MDR bits 48:53.  Matches range in
 * doublewords (64 bits) baised by -1 eg. 0b00=1DW and
 * 0b11=64DW.
-* brk->len is in bytes.
+* brk->hw_len is in bytes.
 * This aligns up to double word size, shifts and does the bias.
 */
-   mrd = ((brk->len + 7) >> 3) - 1;
+   mrd = ((brk->hw_len + 7) >> 3) - 1;
dawrx |= (mrd & 0x3f) << (63 - 53);
 
if (ppc_md.set_dawr)
@@ -54,7 +54,7 @@ static ssize_t dawr_write_file_bool(struct file *file,
const char __user *user_buf,
size_t count, loff_t *ppos)
 {
-   struct arch_hw_breakpoint null_brk = {0, 0, 0};
+   struct arch_hw_breakpoint null_brk = {0};
size_t rc;
 
/* Send error to user if they hypervisor won't allow us to write DAWR */
diff --git a/arch/powerpc/kernel/hw_breakpoint.c 
b/arch/powerpc/kernel/hw_breakpoint.c
index 677041cb3c3e..f36274d426ed 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -126,6 +126,49 @@ int arch_bp_generic_fields(int type, int *gen_bp_type)
return 0;
 }
 
+/*
+ * Watchpoint match range is always doubleword(8 bytes) aligned on
+ * powerpc. If the given range is crossing doubleword boundary, we
+ * need to increase the length such that next doubleword also get
+ * covered. Ex,
+ *
+ *  address   len = 6 bytes
+ *|=.
+ *   |v--|--v|
+ *   | | | | | | | | | | | | | | | | |
+ *   |---|---|
+ *<---8 bytes--->
+ *
+ * In this case, we should configure hw as:
+ *   start_addr = address & ~HW_BREAKPOINT_ALIGN
+ *   len = 16 bytes
+ *
+ * @start_addr and @end_addr are inclusive.
+ */
+static int hw_breakpoint_validate_len(struct arch_hw_breakpoint *hw)
+{
+   u16 max_len = DABR_MAX_LEN;
+   u16 hw_len;
+   unsigned long start_addr, end_addr;
+
+   start_addr = hw->address & ~HW_BREAKPOINT_ALIGN;
+   end_addr = (hw->address + hw->len - 1) | HW_BREAKPOINT_ALIGN;
+   hw_len = end_addr - start_addr + 1;
+
+   if (dawr_enabled()) {
+   max_len = DAWR_MAX_LEN;
+   /* DAWR region can't cross 512 bytes boundary */
+   if ((start_addr >> 9) != (end_addr >> 9))
+   return -EINVAL;
+   }
+
+   if (hw_len > max_len)
+   return -EINVAL;
+
+   hw->hw_len = hw_len;
+   return 0;
+}
+
 /*
  * Validate the arch-specific HW Breakpoint register settings
  */
@@ -133,9 +176,9 @@ int hw_breakpoint_arch_parse(struct perf_event *bp,
 const struct perf_event_attr *attr,
 struct arch_hw_breakpoint *hw)
 {
-   int ret = -EINVAL, length_max;
+   int ret = -EINVAL;
 
-   if (!bp)
+   if (!bp || !attr->bp_len)
return ret;
 
hw->type = 

[PATCH v5 1/7] Powerpc/Watchpoint: Introduce macros for watchpoint length

2019-10-14 Thread Ravi Bangoria
We are hadrcoding length everywhere in the watchpoint code.
Introduce macros for the length and use them.

Signed-off-by: Ravi Bangoria 
---
 arch/powerpc/include/asm/hw_breakpoint.h | 3 +++
 arch/powerpc/kernel/hw_breakpoint.c  | 4 ++--
 arch/powerpc/kernel/ptrace.c | 6 +++---
 arch/powerpc/xmon/xmon.c | 2 +-
 4 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_breakpoint.h 
b/arch/powerpc/include/asm/hw_breakpoint.h
index 67e2da195eae..4a887e85a5f4 100644
--- a/arch/powerpc/include/asm/hw_breakpoint.h
+++ b/arch/powerpc/include/asm/hw_breakpoint.h
@@ -33,6 +33,9 @@ struct arch_hw_breakpoint {
 #define HW_BRK_TYPE_PRIV_ALL   (HW_BRK_TYPE_USER | HW_BRK_TYPE_KERNEL | \
 HW_BRK_TYPE_HYP)
 
+#define DABR_MAX_LEN   8
+#define DAWR_MAX_LEN   512
+
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 #include 
 #include 
diff --git a/arch/powerpc/kernel/hw_breakpoint.c 
b/arch/powerpc/kernel/hw_breakpoint.c
index 1007ec36b4cb..677041cb3c3e 100644
--- a/arch/powerpc/kernel/hw_breakpoint.c
+++ b/arch/powerpc/kernel/hw_breakpoint.c
@@ -163,9 +163,9 @@ int hw_breakpoint_arch_parse(struct perf_event *bp,
 */
if (!ppc_breakpoint_available())
return -ENODEV;
-   length_max = 8; /* DABR */
+   length_max = DABR_MAX_LEN; /* DABR */
if (dawr_enabled()) {
-   length_max = 512 ; /* 64 doublewords */
+   length_max = DAWR_MAX_LEN; /* 64 doublewords */
/* DAWR region can't cross 512 boundary */
if ((attr->bp_addr >> 9) !=
((attr->bp_addr + attr->bp_len - 1) >> 9))
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 8c92febf5f44..f22e773a416a 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -2425,7 +2425,7 @@ static int ptrace_set_debugreg(struct task_struct *task, 
unsigned long addr,
return -EIO;
hw_brk.address = data & (~HW_BRK_TYPE_DABR);
hw_brk.type = (data & HW_BRK_TYPE_DABR) | HW_BRK_TYPE_PRIV_ALL;
-   hw_brk.len = 8;
+   hw_brk.len = DABR_MAX_LEN;
set_bp = (data) && (hw_brk.type & HW_BRK_TYPE_RDWR);
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
bp = thread->ptrace_bps[0];
@@ -2456,7 +2456,7 @@ static int ptrace_set_debugreg(struct task_struct *task, 
unsigned long addr,
/* Create a new breakpoint request if one doesn't exist already */
hw_breakpoint_init();
attr.bp_addr = hw_brk.address;
-   attr.bp_len = 8;
+   attr.bp_len = DABR_MAX_LEN;
arch_bp_generic_fields(hw_brk.type,
   _type);
 
@@ -2882,7 +2882,7 @@ static long ppc_set_hwdebug(struct task_struct *child,
 
brk.address = bp_info->addr & ~7UL;
brk.type = HW_BRK_TYPE_TRANSLATE;
-   brk.len = 8;
+   brk.len = DABR_MAX_LEN;
if (bp_info->trigger_type & PPC_BREAKPOINT_TRIGGER_READ)
brk.type |= HW_BRK_TYPE_READ;
if (bp_info->trigger_type & PPC_BREAKPOINT_TRIGGER_WRITE)
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index d83364ebc5c5..d547e540c230 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -884,7 +884,7 @@ static void insert_cpu_bpts(void)
if (dabr.enabled) {
brk.address = dabr.address;
brk.type = (dabr.enabled & HW_BRK_TYPE_DABR) | 
HW_BRK_TYPE_PRIV_ALL;
-   brk.len = 8;
+   brk.len = DABR_MAX_LEN;
__set_breakpoint();
}
 
-- 
2.21.0



[PATCH v5 0/7] Powerpc/Watchpoint: Few important fixes

2019-10-14 Thread Ravi Bangoria
v4: https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-September/197621.html

v4->v5:
 - patch 1,2,3/7: Split v4 patch1 into three differnet patches.
   * 1st patch to replace hardcoded watchpoint length with macros
   * 2nd patch that fixes the unaligned watchpoint issue
   * 3rd patch that fixes ptrace code that mucks around address/len
 - patch 3/7: v4 patch1 was creating a regression in watchpoint length
   calculation in ptrace code. Fixed that.
 - patch 7/7: Disabled MODE_RANGE and 512 byte testcases for 8xx. (Build
   tested only)
 - patch 7/7: Unaligned watchpoints are not supported with DABR. Test
   unaligned watchpoint only when DAWR is present.

Ravi Bangoria (7):
  Powerpc/Watchpoint: Introduce macros for watchpoint length
  Powerpc/Watchpoint: Fix length calculation for unaligned target
  Powerpc/Watchpoint: Fix ptrace code that muck around with address/len
  Powerpc/Watchpoint: Don't ignore extraneous exceptions blindly
  Powerpc/Watchpoint: Rewrite ptrace-hwbreak.c selftest
  Powerpc/Watchpoint: Add dar outside test in perf-hwbreak.c selftest
  Powerpc/Watchpoint: Support for 8xx in ptrace-hwbreak.c selftest

 arch/powerpc/include/asm/hw_breakpoint.h  |   9 +-
 arch/powerpc/kernel/dawr.c|   6 +-
 arch/powerpc/kernel/hw_breakpoint.c   | 119 ++--
 arch/powerpc/kernel/process.c |   3 +
 arch/powerpc/kernel/ptrace.c  |  16 +-
 arch/powerpc/xmon/xmon.c  |   2 +-
 .../selftests/powerpc/ptrace/perf-hwbreak.c   | 111 +++-
 .../selftests/powerpc/ptrace/ptrace-hwbreak.c | 581 +++---
 8 files changed, 582 insertions(+), 265 deletions(-)

-- 
2.21.0



Re: [PATCH 4/4] crypto: nx - convert AES-CTR to skcipher API

2019-10-14 Thread Ard Biesheuvel
On Sun, 13 Oct 2019 at 06:40, Eric Biggers  wrote:
>
> From: Eric Biggers 
>
> Convert the PowerPC Nest (NX) implementation of AES-CTR from the
> deprecated "blkcipher" API to the "skcipher" API.  This is needed in
> order for the blkcipher API to be removed.
>
> Signed-off-by: Eric Biggers 

Reviewed-by: Ard Biesheuvel 

> ---
>  drivers/crypto/nx/nx-aes-ctr.c | 84 +++---
>  drivers/crypto/nx/nx.c | 25 +++---
>  drivers/crypto/nx/nx.h |  4 +-
>  3 files changed, 46 insertions(+), 67 deletions(-)
>
> diff --git a/drivers/crypto/nx/nx-aes-ctr.c b/drivers/crypto/nx/nx-aes-ctr.c
> index 05e558cefe94..6d5ce1a66f1e 100644
> --- a/drivers/crypto/nx/nx-aes-ctr.c
> +++ b/drivers/crypto/nx/nx-aes-ctr.c
> @@ -19,11 +19,11 @@
>  #include "nx.h"
>
>
> -static int ctr_aes_nx_set_key(struct crypto_tfm *tfm,
> - const u8  *in_key,
> - unsigned int   key_len)
> +static int ctr_aes_nx_set_key(struct crypto_skcipher *tfm,
> + const u8   *in_key,
> + unsigned intkey_len)
>  {
> -   struct nx_crypto_ctx *nx_ctx = crypto_tfm_ctx(tfm);
> +   struct nx_crypto_ctx *nx_ctx = crypto_skcipher_ctx(tfm);
> struct nx_csbcpb *csbcpb = nx_ctx->csbcpb;
>
> nx_ctx_init(nx_ctx, HCOP_FC_AES);
> @@ -51,11 +51,11 @@ static int ctr_aes_nx_set_key(struct crypto_tfm *tfm,
> return 0;
>  }
>
> -static int ctr3686_aes_nx_set_key(struct crypto_tfm *tfm,
> - const u8  *in_key,
> - unsigned int   key_len)
> +static int ctr3686_aes_nx_set_key(struct crypto_skcipher *tfm,
> + const u8   *in_key,
> + unsigned intkey_len)
>  {
> -   struct nx_crypto_ctx *nx_ctx = crypto_tfm_ctx(tfm);
> +   struct nx_crypto_ctx *nx_ctx = crypto_skcipher_ctx(tfm);
>
> if (key_len < CTR_RFC3686_NONCE_SIZE)
> return -EINVAL;
> @@ -69,12 +69,10 @@ static int ctr3686_aes_nx_set_key(struct crypto_tfm *tfm,
> return ctr_aes_nx_set_key(tfm, in_key, key_len);
>  }
>
> -static int ctr_aes_nx_crypt(struct blkcipher_desc *desc,
> -   struct scatterlist*dst,
> -   struct scatterlist*src,
> -   unsigned int   nbytes)
> +static int ctr_aes_nx_crypt(struct skcipher_request *req, u8 *iv)
>  {
> -   struct nx_crypto_ctx *nx_ctx = crypto_blkcipher_ctx(desc->tfm);
> +   struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +   struct nx_crypto_ctx *nx_ctx = crypto_skcipher_ctx(tfm);
> struct nx_csbcpb *csbcpb = nx_ctx->csbcpb;
> unsigned long irq_flags;
> unsigned int processed = 0, to_process;
> @@ -83,9 +81,9 @@ static int ctr_aes_nx_crypt(struct blkcipher_desc *desc,
> spin_lock_irqsave(_ctx->lock, irq_flags);
>
> do {
> -   to_process = nbytes - processed;
> +   to_process = req->cryptlen - processed;
>
> -   rc = nx_build_sg_lists(nx_ctx, desc->info, dst, src,
> +   rc = nx_build_sg_lists(nx_ctx, iv, req->dst, req->src,
>_process, processed,
>csbcpb->cpb.aes_ctr.iv);
> if (rc)
> @@ -97,59 +95,51 @@ static int ctr_aes_nx_crypt(struct blkcipher_desc *desc,
> }
>
> rc = nx_hcall_sync(nx_ctx, _ctx->op,
> -  desc->flags & CRYPTO_TFM_REQ_MAY_SLEEP);
> +  req->base.flags & 
> CRYPTO_TFM_REQ_MAY_SLEEP);
> if (rc)
> goto out;
>
> -   memcpy(desc->info, csbcpb->cpb.aes_cbc.cv, AES_BLOCK_SIZE);
> +   memcpy(iv, csbcpb->cpb.aes_cbc.cv, AES_BLOCK_SIZE);
>
> atomic_inc(&(nx_ctx->stats->aes_ops));
> atomic64_add(csbcpb->csb.processed_byte_count,
>  &(nx_ctx->stats->aes_bytes));
>
> processed += to_process;
> -   } while (processed < nbytes);
> +   } while (processed < req->cryptlen);
>  out:
> spin_unlock_irqrestore(_ctx->lock, irq_flags);
> return rc;
>  }
>
> -static int ctr3686_aes_nx_crypt(struct blkcipher_desc *desc,
> -   struct scatterlist*dst,
> -   struct scatterlist*src,
> -   unsigned int   nbytes)
> +static int ctr3686_aes_nx_crypt(struct skcipher_request *req)
>  {
> -   struct nx_crypto_ctx *nx_ctx = crypto_blkcipher_ctx(desc->tfm);
> +   struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +   struct nx_crypto_ctx *nx_ctx = crypto_skcipher_ctx(tfm);
> u8 iv[16];
>
> memcpy(iv, 

Re: [PATCH 3/4] crypto: nx - convert AES-CBC to skcipher API

2019-10-14 Thread Ard Biesheuvel
On Sun, 13 Oct 2019 at 06:40, Eric Biggers  wrote:
>
> From: Eric Biggers 
>
> Convert the PowerPC Nest (NX) implementation of AES-CBC from the
> deprecated "blkcipher" API to the "skcipher" API.  This is needed in
> order for the blkcipher API to be removed.
>
> Signed-off-by: Eric Biggers 

Reviewed-by: Ard Biesheuvel 

> ---
>  drivers/crypto/nx/nx-aes-cbc.c | 78 ++
>  drivers/crypto/nx/nx.c | 11 ++---
>  drivers/crypto/nx/nx.h |  4 +-
>  3 files changed, 41 insertions(+), 52 deletions(-)
>
> diff --git a/drivers/crypto/nx/nx-aes-cbc.c b/drivers/crypto/nx/nx-aes-cbc.c
> index 482a203a9260..92e921eceed7 100644
> --- a/drivers/crypto/nx/nx-aes-cbc.c
> +++ b/drivers/crypto/nx/nx-aes-cbc.c
> @@ -18,11 +18,11 @@
>  #include "nx.h"
>
>
> -static int cbc_aes_nx_set_key(struct crypto_tfm *tfm,
> - const u8  *in_key,
> - unsigned int   key_len)
> +static int cbc_aes_nx_set_key(struct crypto_skcipher *tfm,
> + const u8   *in_key,
> + unsigned intkey_len)
>  {
> -   struct nx_crypto_ctx *nx_ctx = crypto_tfm_ctx(tfm);
> +   struct nx_crypto_ctx *nx_ctx = crypto_skcipher_ctx(tfm);
> struct nx_csbcpb *csbcpb = nx_ctx->csbcpb;
>
> nx_ctx_init(nx_ctx, HCOP_FC_AES);
> @@ -50,13 +50,11 @@ static int cbc_aes_nx_set_key(struct crypto_tfm *tfm,
> return 0;
>  }
>
> -static int cbc_aes_nx_crypt(struct blkcipher_desc *desc,
> -   struct scatterlist*dst,
> -   struct scatterlist*src,
> -   unsigned int   nbytes,
> -   intenc)
> +static int cbc_aes_nx_crypt(struct skcipher_request *req,
> +   int  enc)
>  {
> -   struct nx_crypto_ctx *nx_ctx = crypto_blkcipher_ctx(desc->tfm);
> +   struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +   struct nx_crypto_ctx *nx_ctx = crypto_skcipher_ctx(tfm);
> struct nx_csbcpb *csbcpb = nx_ctx->csbcpb;
> unsigned long irq_flags;
> unsigned int processed = 0, to_process;
> @@ -70,9 +68,9 @@ static int cbc_aes_nx_crypt(struct blkcipher_desc *desc,
> NX_CPB_FDM(csbcpb) &= ~NX_FDM_ENDE_ENCRYPT;
>
> do {
> -   to_process = nbytes - processed;
> +   to_process = req->cryptlen - processed;
>
> -   rc = nx_build_sg_lists(nx_ctx, desc->info, dst, src,
> +   rc = nx_build_sg_lists(nx_ctx, req->iv, req->dst, req->src,
>_process, processed,
>csbcpb->cpb.aes_cbc.iv);
> if (rc)
> @@ -84,56 +82,46 @@ static int cbc_aes_nx_crypt(struct blkcipher_desc *desc,
> }
>
> rc = nx_hcall_sync(nx_ctx, _ctx->op,
> -  desc->flags & CRYPTO_TFM_REQ_MAY_SLEEP);
> +  req->base.flags & 
> CRYPTO_TFM_REQ_MAY_SLEEP);
> if (rc)
> goto out;
>
> -   memcpy(desc->info, csbcpb->cpb.aes_cbc.cv, AES_BLOCK_SIZE);
> +   memcpy(req->iv, csbcpb->cpb.aes_cbc.cv, AES_BLOCK_SIZE);
> atomic_inc(&(nx_ctx->stats->aes_ops));
> atomic64_add(csbcpb->csb.processed_byte_count,
>  &(nx_ctx->stats->aes_bytes));
>
> processed += to_process;
> -   } while (processed < nbytes);
> +   } while (processed < req->cryptlen);
>  out:
> spin_unlock_irqrestore(_ctx->lock, irq_flags);
> return rc;
>  }
>
> -static int cbc_aes_nx_encrypt(struct blkcipher_desc *desc,
> - struct scatterlist*dst,
> - struct scatterlist*src,
> - unsigned int   nbytes)
> +static int cbc_aes_nx_encrypt(struct skcipher_request *req)
>  {
> -   return cbc_aes_nx_crypt(desc, dst, src, nbytes, 1);
> +   return cbc_aes_nx_crypt(req, 1);
>  }
>
> -static int cbc_aes_nx_decrypt(struct blkcipher_desc *desc,
> - struct scatterlist*dst,
> - struct scatterlist*src,
> - unsigned int   nbytes)
> +static int cbc_aes_nx_decrypt(struct skcipher_request *req)
>  {
> -   return cbc_aes_nx_crypt(desc, dst, src, nbytes, 0);
> +   return cbc_aes_nx_crypt(req, 0);
>  }
>
> -struct crypto_alg nx_cbc_aes_alg = {
> -   .cra_name= "cbc(aes)",
> -   .cra_driver_name = "cbc-aes-nx",
> -   .cra_priority= 300,
> -   .cra_flags   = CRYPTO_ALG_TYPE_BLKCIPHER,
> -   .cra_blocksize   = AES_BLOCK_SIZE,
> -   .cra_ctxsize = sizeof(struct nx_crypto_ctx),
> -   .cra_type= _blkcipher_type,
> - 

Re: [PATCH 2/4] crypto: nx - convert AES-ECB to skcipher API

2019-10-14 Thread Ard Biesheuvel
On Sun, 13 Oct 2019 at 06:40, Eric Biggers  wrote:
>
> From: Eric Biggers 
>
> Convert the PowerPC Nest (NX) implementation of AES-ECB from the
> deprecated "blkcipher" API to the "skcipher" API.  This is needed in
> order for the blkcipher API to be removed.
>
> Signed-off-by: Eric Biggers 

Reviewed-by: Ard Biesheuvel 

> ---
>  drivers/crypto/nx/nx-aes-ecb.c | 76 ++
>  drivers/crypto/nx/nx.c | 28 ++---
>  drivers/crypto/nx/nx.h |  5 ++-
>  3 files changed, 58 insertions(+), 51 deletions(-)
>
> diff --git a/drivers/crypto/nx/nx-aes-ecb.c b/drivers/crypto/nx/nx-aes-ecb.c
> index 87183890d1ab..77e338dc33f1 100644
> --- a/drivers/crypto/nx/nx-aes-ecb.c
> +++ b/drivers/crypto/nx/nx-aes-ecb.c
> @@ -18,11 +18,11 @@
>  #include "nx.h"
>
>
> -static int ecb_aes_nx_set_key(struct crypto_tfm *tfm,
> - const u8  *in_key,
> - unsigned int   key_len)
> +static int ecb_aes_nx_set_key(struct crypto_skcipher *tfm,
> + const u8   *in_key,
> + unsigned intkey_len)
>  {
> -   struct nx_crypto_ctx *nx_ctx = crypto_tfm_ctx(tfm);
> +   struct nx_crypto_ctx *nx_ctx = crypto_skcipher_ctx(tfm);
> struct nx_csbcpb *csbcpb = (struct nx_csbcpb *)nx_ctx->csbcpb;
>
> nx_ctx_init(nx_ctx, HCOP_FC_AES);
> @@ -50,13 +50,11 @@ static int ecb_aes_nx_set_key(struct crypto_tfm *tfm,
> return 0;
>  }
>
> -static int ecb_aes_nx_crypt(struct blkcipher_desc *desc,
> -   struct scatterlist*dst,
> -   struct scatterlist*src,
> -   unsigned int   nbytes,
> -   intenc)
> +static int ecb_aes_nx_crypt(struct skcipher_request *req,
> +   int  enc)
>  {
> -   struct nx_crypto_ctx *nx_ctx = crypto_blkcipher_ctx(desc->tfm);
> +   struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +   struct nx_crypto_ctx *nx_ctx = crypto_skcipher_ctx(tfm);
> struct nx_csbcpb *csbcpb = nx_ctx->csbcpb;
> unsigned long irq_flags;
> unsigned int processed = 0, to_process;
> @@ -70,10 +68,10 @@ static int ecb_aes_nx_crypt(struct blkcipher_desc *desc,
> NX_CPB_FDM(csbcpb) &= ~NX_FDM_ENDE_ENCRYPT;
>
> do {
> -   to_process = nbytes - processed;
> +   to_process = req->cryptlen - processed;
>
> -   rc = nx_build_sg_lists(nx_ctx, NULL, dst, src, _process,
> -  processed, NULL);
> +   rc = nx_build_sg_lists(nx_ctx, NULL, req->dst, req->src,
> +  _process, processed, NULL);
> if (rc)
> goto out;
>
> @@ -83,7 +81,7 @@ static int ecb_aes_nx_crypt(struct blkcipher_desc *desc,
> }
>
> rc = nx_hcall_sync(nx_ctx, _ctx->op,
> -  desc->flags & CRYPTO_TFM_REQ_MAY_SLEEP);
> +  req->base.flags & 
> CRYPTO_TFM_REQ_MAY_SLEEP);
> if (rc)
> goto out;
>
> @@ -92,46 +90,36 @@ static int ecb_aes_nx_crypt(struct blkcipher_desc *desc,
>  &(nx_ctx->stats->aes_bytes));
>
> processed += to_process;
> -   } while (processed < nbytes);
> +   } while (processed < req->cryptlen);
>
>  out:
> spin_unlock_irqrestore(_ctx->lock, irq_flags);
> return rc;
>  }
>
> -static int ecb_aes_nx_encrypt(struct blkcipher_desc *desc,
> - struct scatterlist*dst,
> - struct scatterlist*src,
> - unsigned int   nbytes)
> +static int ecb_aes_nx_encrypt(struct skcipher_request *req)
>  {
> -   return ecb_aes_nx_crypt(desc, dst, src, nbytes, 1);
> +   return ecb_aes_nx_crypt(req, 1);
>  }
>
> -static int ecb_aes_nx_decrypt(struct blkcipher_desc *desc,
> - struct scatterlist*dst,
> - struct scatterlist*src,
> - unsigned int   nbytes)
> +static int ecb_aes_nx_decrypt(struct skcipher_request *req)
>  {
> -   return ecb_aes_nx_crypt(desc, dst, src, nbytes, 0);
> +   return ecb_aes_nx_crypt(req, 0);
>  }
>
> -struct crypto_alg nx_ecb_aes_alg = {
> -   .cra_name= "ecb(aes)",
> -   .cra_driver_name = "ecb-aes-nx",
> -   .cra_priority= 300,
> -   .cra_flags   = CRYPTO_ALG_TYPE_BLKCIPHER,
> -   .cra_blocksize   = AES_BLOCK_SIZE,
> -   .cra_alignmask   = 0xf,
> -   .cra_ctxsize = sizeof(struct nx_crypto_ctx),
> -   .cra_type= _blkcipher_type,
> -   .cra_module  = THIS_MODULE,
> -   .cra_init= nx_crypto_ctx_aes_ecb_init,

Re: [PATCH 1/4] crypto: nx - don't abuse blkcipher_desc to pass iv around

2019-10-14 Thread Ard Biesheuvel
On Sun, 13 Oct 2019 at 06:40, Eric Biggers  wrote:
>
> From: Eric Biggers 
>
> The NX crypto driver is using 'struct blkcipher_desc' to pass the IV
> around, even for AEADs (for which it creates the struct on the stack).
> This is not appropriate since this structure is part of the "blkcipher"
> API, which is deprecated and will be removed.
>
> Just pass around the IV directly instead.
>
> Signed-off-by: Eric Biggers 

Reviewed-by: Ard Biesheuvel 

> ---
>  drivers/crypto/nx/nx-aes-cbc.c |  5 +++--
>  drivers/crypto/nx/nx-aes-ccm.c | 40 --
>  drivers/crypto/nx/nx-aes-ctr.c |  5 +++--
>  drivers/crypto/nx/nx-aes-ecb.c |  4 ++--
>  drivers/crypto/nx/nx-aes-gcm.c | 24 +---
>  drivers/crypto/nx/nx.c | 16 +++---
>  drivers/crypto/nx/nx.h |  6 ++---
>  7 files changed, 43 insertions(+), 57 deletions(-)
>
> diff --git a/drivers/crypto/nx/nx-aes-cbc.c b/drivers/crypto/nx/nx-aes-cbc.c
> index e631f9979127..482a203a9260 100644
> --- a/drivers/crypto/nx/nx-aes-cbc.c
> +++ b/drivers/crypto/nx/nx-aes-cbc.c
> @@ -72,8 +72,9 @@ static int cbc_aes_nx_crypt(struct blkcipher_desc *desc,
> do {
> to_process = nbytes - processed;
>
> -   rc = nx_build_sg_lists(nx_ctx, desc, dst, src, _process,
> -  processed, csbcpb->cpb.aes_cbc.iv);
> +   rc = nx_build_sg_lists(nx_ctx, desc->info, dst, src,
> +  _process, processed,
> +  csbcpb->cpb.aes_cbc.iv);
> if (rc)
> goto out;
>
> diff --git a/drivers/crypto/nx/nx-aes-ccm.c b/drivers/crypto/nx/nx-aes-ccm.c
> index 5be8f01c5da8..84fed736ed2e 100644
> --- a/drivers/crypto/nx/nx-aes-ccm.c
> +++ b/drivers/crypto/nx/nx-aes-ccm.c
> @@ -327,7 +327,7 @@ static int generate_pat(u8   *iv,
>  }
>
>  static int ccm_nx_decrypt(struct aead_request   *req,
> - struct blkcipher_desc *desc,
> + u8*iv,
>   unsigned int assoclen)
>  {
> struct nx_crypto_ctx *nx_ctx = crypto_tfm_ctx(req->base.tfm);
> @@ -348,7 +348,7 @@ static int ccm_nx_decrypt(struct aead_request   *req,
>  req->src, nbytes + req->assoclen, authsize,
>  SCATTERWALK_FROM_SG);
>
> -   rc = generate_pat(desc->info, req, nx_ctx, authsize, nbytes, assoclen,
> +   rc = generate_pat(iv, req, nx_ctx, authsize, nbytes, assoclen,
>   csbcpb->cpb.aes_ccm.in_pat_or_b0);
> if (rc)
> goto out;
> @@ -367,7 +367,7 @@ static int ccm_nx_decrypt(struct aead_request   *req,
>
> NX_CPB_FDM(nx_ctx->csbcpb) &= ~NX_FDM_ENDE_ENCRYPT;
>
> -   rc = nx_build_sg_lists(nx_ctx, desc, req->dst, req->src,
> +   rc = nx_build_sg_lists(nx_ctx, iv, req->dst, req->src,
>_process, processed + req->assoclen,
>csbcpb->cpb.aes_ccm.iv_or_ctr);
> if (rc)
> @@ -381,7 +381,7 @@ static int ccm_nx_decrypt(struct aead_request   *req,
> /* for partial completion, copy following for next
>  * entry into loop...
>  */
> -   memcpy(desc->info, csbcpb->cpb.aes_ccm.out_ctr, 
> AES_BLOCK_SIZE);
> +   memcpy(iv, csbcpb->cpb.aes_ccm.out_ctr, AES_BLOCK_SIZE);
> memcpy(csbcpb->cpb.aes_ccm.in_pat_or_b0,
> csbcpb->cpb.aes_ccm.out_pat_or_mac, AES_BLOCK_SIZE);
> memcpy(csbcpb->cpb.aes_ccm.in_s0,
> @@ -405,7 +405,7 @@ static int ccm_nx_decrypt(struct aead_request   *req,
>  }
>
>  static int ccm_nx_encrypt(struct aead_request   *req,
> - struct blkcipher_desc *desc,
> + u8*iv,
>   unsigned int assoclen)
>  {
> struct nx_crypto_ctx *nx_ctx = crypto_tfm_ctx(req->base.tfm);
> @@ -418,7 +418,7 @@ static int ccm_nx_encrypt(struct aead_request   *req,
>
> spin_lock_irqsave(_ctx->lock, irq_flags);
>
> -   rc = generate_pat(desc->info, req, nx_ctx, authsize, nbytes, assoclen,
> +   rc = generate_pat(iv, req, nx_ctx, authsize, nbytes, assoclen,
>   csbcpb->cpb.aes_ccm.in_pat_or_b0);
> if (rc)
> goto out;
> @@ -436,7 +436,7 @@ static int ccm_nx_encrypt(struct aead_request   *req,
>
> NX_CPB_FDM(csbcpb) |= NX_FDM_ENDE_ENCRYPT;
>
> -   rc = nx_build_sg_lists(nx_ctx, desc, req->dst, req->src,
> +   rc = nx_build_sg_lists(nx_ctx, iv, req->dst, req->src,
>_process, processed + req->assoclen,
>csbcpb->cpb.aes_ccm.iv_or_ctr);
> if (rc)
> @@ -450,7 +450,7 @@ static 

Applied "ASoC: fsl_mqs: Move static keyword to the front of declarations" to the asoc tree

2019-10-14 Thread Mark Brown
The patch

   ASoC: fsl_mqs: Move static keyword to the front of declarations

has been applied to the asoc tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-5.5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From dd79841ca66ff509660880237dc286d7f116a766 Mon Sep 17 00:00:00 2001
From: YueHaibing 
Date: Fri, 11 Oct 2019 22:35:38 +0800
Subject: [PATCH] ASoC: fsl_mqs: Move static keyword to the front of
 declarations

gcc warn about this:

sound/soc/fsl/fsl_mqs.c:146:1: warning:
 static is not at beginning of declaration [-Wold-style-declaration]

Signed-off-by: YueHaibing 
Acked-by: Nicolin Chen 
Link: https://lore.kernel.org/r/20191011143538.15300-1-yuehaib...@huawei.com
Signed-off-by: Mark Brown 
---
 sound/soc/fsl/fsl_mqs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/fsl/fsl_mqs.c b/sound/soc/fsl/fsl_mqs.c
index f7fc44e8fb27..0c813a45bba7 100644
--- a/sound/soc/fsl/fsl_mqs.c
+++ b/sound/soc/fsl/fsl_mqs.c
@@ -143,7 +143,7 @@ static void fsl_mqs_shutdown(struct snd_pcm_substream 
*substream,
   MQS_EN_MASK, 0);
 }
 
-const static struct snd_soc_component_driver soc_codec_fsl_mqs = {
+static const struct snd_soc_component_driver soc_codec_fsl_mqs = {
.idle_bias_on = 1,
.non_legacy_dai_naming  = 1,
 };
-- 
2.20.1



Re: [PATCH 1/2] dma-mapping: Add dma_addr_is_phys_addr()

2019-10-14 Thread Robin Murphy

On 14/10/2019 05:51, David Gibson wrote:

On Fri, Oct 11, 2019 at 06:25:18PM -0700, Ram Pai wrote:

From: Thiago Jung Bauermann 

In order to safely use the DMA API, virtio needs to know whether DMA
addresses are in fact physical addresses and for that purpose,
dma_addr_is_phys_addr() is introduced.

cc: Benjamin Herrenschmidt 
cc: David Gibson 
cc: Michael Ellerman 
cc: Paul Mackerras 
cc: Michael Roth 
cc: Alexey Kardashevskiy 
cc: Paul Burton 
cc: Robin Murphy 
cc: Bartlomiej Zolnierkiewicz 
cc: Marek Szyprowski 
cc: Christoph Hellwig 
Suggested-by: Michael S. Tsirkin 
Signed-off-by: Ram Pai 
Signed-off-by: Thiago Jung Bauermann 


The change itself looks ok, so

Reviewed-by: David Gibson 

However, I would like to see the commit message (and maybe the inline
comments) expanded a bit on what the distinction here is about.  Some
of the text from the next patch would be suitable, about DMA addresses
usually being in a different address space but not in the case of
bounce buffering.


Right, this needs a much tighter definition. "DMA address happens to be 
a valid physical address" is true of various IOMMU setups too, but I 
can't believe it's meaningful in such cases.


If what you actually want is "DMA is direct or SWIOTLB" - i.e. "DMA 
address is physical address of DMA data (not necessarily the original 
buffer)" - wouldn't dma_is_direct() suffice?


Robin.


---
  arch/powerpc/include/asm/dma-mapping.h | 21 +
  arch/powerpc/platforms/pseries/Kconfig |  1 +
  include/linux/dma-mapping.h| 20 
  kernel/dma/Kconfig |  3 +++
  4 files changed, 45 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 565d6f7..f92c0a4b 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -5,6 +5,8 @@
  #ifndef _ASM_DMA_MAPPING_H
  #define _ASM_DMA_MAPPING_H
  
+#include 

+
  static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
  {
/* We don't handle the NULL dev case for ISA for now. We could
@@ -15,4 +17,23 @@ static inline const struct dma_map_ops 
*get_arch_dma_ops(struct bus_type *bus)
return NULL;
  }
  
+#ifdef CONFIG_ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR

+/**
+ * dma_addr_is_phys_addr - check whether a device DMA address is a physical
+ * address
+ * @dev:   device to check
+ *
+ * Returns %true if any DMA address for this device happens to also be a valid
+ * physical address (not necessarily of the same page).
+ */
+static inline bool dma_addr_is_phys_addr(struct device *dev)
+{
+   /*
+* Secure guests always use the SWIOTLB, therefore DMA addresses are
+* actually the physical address of the bounce buffer.
+*/
+   return is_secure_guest();
+}
+#endif
+
  #endif/* _ASM_DMA_MAPPING_H */
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 9e35cdd..0108150 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -152,6 +152,7 @@ config PPC_SVM
select SWIOTLB
select ARCH_HAS_MEM_ENCRYPT
select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+   select ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR
help
 There are certain POWER platforms which support secure guests using
 the Protected Execution Facility, with the help of an Ultravisor
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index f7d1eea..6df5664 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -693,6 +693,26 @@ static inline bool dma_addressing_limited(struct device 
*dev)
dma_get_required_mask(dev);
  }
  
+#ifndef CONFIG_ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR

+/**
+ * dma_addr_is_phys_addr - check whether a device DMA address is a physical
+ * address
+ * @dev:   device to check
+ *
+ * Returns %true if any DMA address for this device happens to also be a valid
+ * physical address (not necessarily of the same page).
+ */
+static inline bool dma_addr_is_phys_addr(struct device *dev)
+{
+   /*
+* Except in very specific setups, DMA addresses exist in a different
+* address space from CPU physical addresses and cannot be directly used
+* to reference system memory.
+*/
+   return false;
+}
+#endif
+
  #ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS
  void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
const struct iommu_ops *iommu, bool coherent);
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 9decbba..6209b46 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -51,6 +51,9 @@ config ARCH_HAS_DMA_MMAP_PGPROT
  config ARCH_HAS_FORCE_DMA_UNENCRYPTED
bool
  
+config ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR

+   bool
+
  config DMA_NONCOHERENT_CACHE_SYNC
bool
  




Re: [PATCH 2/2] powerpc/powernv: ocxl move TL definition

2019-10-14 Thread Frederic Barrat




Le 09/10/2019 à 17:11, christophe lombard a écrit :

Specifies the templates in the Transaction Layer that the OpenCAPI device/host
support when transmitting/receiving DL/DLX frames to or from the OpenCAPI
device/host.
Update, rename and create new few platform-specific calls which can be used by
drivers.

No functional change.

Signed-off-by: Christophe Lombard 
---
  arch/powerpc/include/asm/pnv-ocxl.h   |   5 +-
  arch/powerpc/platforms/powernv/ocxl.c | 103 --
  drivers/misc/ocxl/config.c|  89 +-
  3 files changed, 99 insertions(+), 98 deletions(-)

diff --git a/arch/powerpc/include/asm/pnv-ocxl.h 
b/arch/powerpc/include/asm/pnv-ocxl.h
index 8e516e339e6c..b8c68878b4ba 100644
--- a/arch/powerpc/include/asm/pnv-ocxl.h
+++ b/arch/powerpc/include/asm/pnv-ocxl.h
@@ -13,10 +13,7 @@ extern int pnv_ocxl_get_actag(struct pci_dev *dev, u16 
*base, u16 *enabled,
u16 *supported);
  extern int pnv_ocxl_get_pasid_count(struct pci_dev *dev, int *count);

-extern int pnv_ocxl_get_tl_cap(struct pci_dev *dev, long *cap,
-   char *rate_buf, int rate_buf_size);
-extern int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long cap,
-   uint64_t rate_buf_phys, int rate_buf_size);
+extern int pnv_ocxl_set_TL(struct pci_dev *dev, int tl_dvsec);

  extern int pnv_ocxl_platform_setup(struct pci_dev *dev,
   int PE_mask, int *hwirq,
diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c
index 4d26cba12b63..351324cffc2b 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -369,8 +369,8 @@ static void set_templ_rate(unsigned int templ, unsigned int 
rate, char *buf)
buf[idx] |= rate << shift;
  }

-int pnv_ocxl_get_tl_cap(struct pci_dev *dev, long *cap,
-   char *rate_buf, int rate_buf_size)
+static int get_tl_cap(struct pci_dev *dev, long *cap,
+ char *rate_buf, int rate_buf_size)
  {
if (rate_buf_size != PNV_OCXL_TL_RATE_BUF_SIZE)
return -EINVAL;
@@ -390,10 +390,9 @@ int pnv_ocxl_get_tl_cap(struct pci_dev *dev, long *cap,
*cap = PNV_OCXL_TL_P9_RECV_CAP;
return 0;
  }
-EXPORT_SYMBOL_GPL(pnv_ocxl_get_tl_cap);

-int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long cap,
-   uint64_t rate_buf_phys, int rate_buf_size)
+static int set_tl_conf(struct pci_dev *dev, long cap,
+  uint64_t rate_buf_phys, int rate_buf_size)
  {
struct pci_controller *hose = pci_bus_to_host(dev->bus);
struct pnv_phb *phb = hose->private_data;
@@ -410,7 +409,99 @@ int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long cap,
}
return 0;
  }
-EXPORT_SYMBOL_GPL(pnv_ocxl_set_tl_conf);
+
+int pnv_ocxl_set_TL(struct pci_dev *dev, int tl_dvsec)
+{
+   u32 val;
+   __be32 *be32ptr;
+   u8 timers;
+   int i, rc;
+   long recv_cap;
+   char *recv_rate;
+
+   recv_rate = kzalloc(PNV_OCXL_TL_RATE_BUF_SIZE, GFP_KERNEL);
+   if (!recv_rate)
+   return -ENOMEM;
+   /*
+* The spec defines 64 templates for messages in the
+* Transaction Layer (TL).
+*
+* The host and device each support a subset, so we need to
+* configure the transmitters on each side to send only
+* templates the receiver understands, at a rate the receiver
+* can process.  Per the spec, template 0 must be supported by
+* everybody. That's the template which has been used by the
+* host and device so far.
+*
+* The sending rate limit must be set before the template is
+* enabled.
+*/
+
+   /*
+* Device -> host
+*/
+   rc = get_tl_cap(dev, _cap, recv_rate,
+   PNV_OCXL_TL_RATE_BUF_SIZE);
+   if (rc)
+   goto out;
+
+   for (i = 0; i < PNV_OCXL_TL_RATE_BUF_SIZE; i += 4) {
+   be32ptr = (__be32 *) _rate[i];
+   pci_write_config_dword(dev,
+   tl_dvsec + OCXL_DVSEC_TL_SEND_RATE + i,
+   be32_to_cpu(*be32ptr));
+   }
+   val = recv_cap >> 32;
+   pci_write_config_dword(dev, tl_dvsec + OCXL_DVSEC_TL_SEND_CAP, val);
+   val = recv_cap & GENMASK(31, 0);
+   pci_write_config_dword(dev, tl_dvsec + OCXL_DVSEC_TL_SEND_CAP + 4, val);
+
+   /*
+* Host -> device
+*/
+   for (i = 0; i < PNV_OCXL_TL_RATE_BUF_SIZE; i += 4) {
+   pci_read_config_dword(dev,
+   tl_dvsec + OCXL_DVSEC_TL_RECV_RATE + i,
+   );
+   be32ptr = (__be32 *) _rate[i];
+   *be32ptr = cpu_to_be32(val);
+   }
+   pci_read_config_dword(dev, tl_dvsec + OCXL_DVSEC_TL_RECV_CAP, );
+   recv_cap = (long) val << 32;
+   pci_read_config_dword(dev, 

Re: [PATCH 1/2] powerpc/powernv: ocxl move SPA definition

2019-10-14 Thread Frederic Barrat




diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c
index 8c65aacda9c8..4d26cba12b63 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -12,11 +12,54 @@
  #define PNV_OCXL_PASID_BITS   15
  #define PNV_OCXL_PASID_MAX((1 << PNV_OCXL_PASID_BITS) - 1)

-#define AFU_PRESENT (1 << 31)
-#define AFU_INDEX_MASK 0x3F00
-#define AFU_INDEX_SHIFT 24
-#define ACTAG_MASK 0xFFF
+#define AFU_PRESENT(1 << 31)
+#define AFU_INDEX_MASK 0x3F00
+#define AFU_INDEX_SHIFT24
+#define ACTAG_MASK 0xFFF
+
+#define SPA_PASID_BITS 15
+#define SPA_PASID_MAX  ((1 << SPA_PASID_BITS) - 1)
+#define SPA_PE_MASKSPA_PASID_MAX
+#define SPA_SPA_SIZE_LOG   22 /* Each SPA is 4 Mb */
+#define SPA_PE_VALID   0x8000
+
+#define SPA_CFG_SF (1ull << (63-0))
+#define SPA_CFG_TA (1ull << (63-1))
+#define SPA_CFG_HV (1ull << (63-3))
+#define SPA_CFG_UV (1ull << (63-4))
+#define SPA_CFG_XLAT_hpt   (0ull << (63-6)) /* Hashed page table (HPT) 
mode */
+#define SPA_CFG_XLAT_roh   (2ull << (63-6)) /* Radix on HPT mode */
+#define SPA_CFG_XLAT_ror   (3ull << (63-6)) /* Radix on Radix mode */
+#define SPA_CFG_PR (1ull << (63-49))
+#define SPA_CFG_TC (1ull << (63-54))
+#define SPA_CFG_DR (1ull << (63-59))
+
+struct ocxl_process_element {
+   __be64 config_state;
+   __be32 reserved1[11];
+   __be32 lpid;
+   __be32 tid;
+   __be32 pid;
+   __be32 reserved2[10];
+   __be64 amr;
+   __be32 reserved3[3];
+   __be32 software_state;
+};
+
+struct spa {
+   struct ocxl_process_element *spa_mem;
+   int spa_order;
+};

+struct platform_data {
+   struct spa *spa;
+   u64 phb_opal_id;
+   u32 bdfn;
+   void __iomem *dsisr;
+   void __iomem *dar;
+   void __iomem *tfc;
+   void __iomem *pe_handle;
+};

  struct actag_range {
u16 start;
@@ -369,7 +412,7 @@ int pnv_ocxl_set_tl_conf(struct pci_dev *dev, long cap,
  }
  EXPORT_SYMBOL_GPL(pnv_ocxl_set_tl_conf);

-int pnv_ocxl_get_xsl_irq(struct pci_dev *dev, int *hwirq)
+static int get_xsl_irq(struct pci_dev *dev, int *hwirq)
  {
int rc;

@@ -381,19 +424,17 @@ int pnv_ocxl_get_xsl_irq(struct pci_dev *dev, int *hwirq)
}
return 0;
  }
-EXPORT_SYMBOL_GPL(pnv_ocxl_get_xsl_irq);

-void pnv_ocxl_unmap_xsl_regs(void __iomem *dsisr, void __iomem *dar,
-   void __iomem *tfc, void __iomem *pe_handle)
+static void unmap_xsl_regs(void __iomem *dsisr, void __iomem *dar,
+  void __iomem *tfc, void __iomem *pe_handle)
  {
iounmap(dsisr);
iounmap(dar);
iounmap(tfc);
iounmap(pe_handle);
  }
-EXPORT_SYMBOL_GPL(pnv_ocxl_unmap_xsl_regs);

-int pnv_ocxl_map_xsl_regs(struct pci_dev *dev, void __iomem **dsisr,
+static int map_xsl_regs(struct pci_dev *dev, void __iomem **dsisr,
void __iomem **dar, void __iomem **tfc,
void __iomem **pe_handle)
  {
@@ -429,61 +470,144 @@ int pnv_ocxl_map_xsl_regs(struct pci_dev *dev, void 
__iomem **dsisr,
}
return rc;
  }
-EXPORT_SYMBOL_GPL(pnv_ocxl_map_xsl_regs);

-struct spa_data {
-   u64 phb_opal_id;
-   u32 bdfn;
-};
+static int alloc_spa(struct pci_dev *dev, struct platform_data *data)
+{
+   struct spa *spa;
+
+   spa = kzalloc(sizeof(struct spa), GFP_KERNEL);
+   if (!spa)
+   return -ENOMEM;
+
+   spa->spa_order = SPA_SPA_SIZE_LOG - PAGE_SHIFT;
+   spa->spa_mem = (struct ocxl_process_element *)
+   __get_free_pages(GFP_KERNEL | __GFP_ZERO, spa->spa_order);
+   if (!spa->spa_mem) {
+   dev_err(>dev, "Can't allocate Shared Process Area\n");
+   kfree(spa);
+   return -ENOMEM;
+   }
+
+   data->spa = spa;
+   dev_dbg(>dev, "Allocated SPA for %x:%x:%x at %p\n",
+  pci_domain_nr(dev->bus), dev->bus->number,
+  PCI_SLOT(dev->devfn), spa->spa_mem);



If using dev_dbg() then we get the domain, bus and device/fn for free as 
part of the message prefix. Which leaves the SPA address, which may be 
hidden with recent kernel changes since it's a pointer to a kernel 
address (I haven't checked). I guess the message could be reworked. The 
point was really to show that we're allocating a spa structure for the 
link and help debugging any ref count issue.




+
+   return 0;
+}
+
+static void free_spa(struct platform_data *data)
+{
+   struct spa *spa = data->spa;
+
+   if (spa && spa->spa_mem) {
+   free_pages((unsigned long) spa->spa_mem, spa->spa_order);
+   kfree(spa);
+   data->spa = NULL;
+   }
+}

-int pnv_ocxl_spa_setup(struct pci_dev *dev, void *spa_mem, int PE_mask,
-   void **platform_data)

[PATCH] powerpc: pseries: no need to check return value of debugfs_create functions

2019-10-14 Thread Greg Kroah-Hartman
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: "Naveen N. Rao" 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman 
---
Note, I can take this in my driver-core tree as I have some later
debugfs core cleanups that depend on this, if no one objects.

thanks,

greg k-h

 arch/powerpc/platforms/pseries/dtl.c | 38 
 arch/powerpc/platforms/pseries/hvCall_inst.c | 12 ++-
 arch/powerpc/platforms/pseries/lpar.c| 15 +---
 3 files changed, 10 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/dtl.c 
b/arch/powerpc/platforms/pseries/dtl.c
index 2b87480f2837..eab8aa293743 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -19,7 +19,6 @@
 
 struct dtl {
struct dtl_entry*buf;
-   struct dentry   *file;
int cpu;
int buf_entries;
u64 last_idx;
@@ -320,46 +319,28 @@ static const struct file_operations dtl_fops = {
 
 static struct dentry *dtl_dir;
 
-static int dtl_setup_file(struct dtl *dtl)
+static void dtl_setup_file(struct dtl *dtl)
 {
char name[10];
 
sprintf(name, "cpu-%d", dtl->cpu);
 
-   dtl->file = debugfs_create_file(name, 0400, dtl_dir, dtl, _fops);
-   if (!dtl->file)
-   return -ENOMEM;
-
-   return 0;
+   debugfs_create_file(name, 0400, dtl_dir, dtl, _fops);
 }
 
 static int dtl_init(void)
 {
-   struct dentry *event_mask_file, *buf_entries_file;
-   int rc, i;
+   int i;
 
if (!firmware_has_feature(FW_FEATURE_SPLPAR))
return -ENODEV;
 
/* set up common debugfs structure */
 
-   rc = -ENOMEM;
dtl_dir = debugfs_create_dir("dtl", powerpc_debugfs_root);
-   if (!dtl_dir) {
-   printk(KERN_WARNING "%s: can't create dtl root dir\n",
-   __func__);
-   goto err;
-   }
 
-   event_mask_file = debugfs_create_x8("dtl_event_mask", 0600,
-   dtl_dir, _event_mask);
-   buf_entries_file = debugfs_create_u32("dtl_buf_entries", 0400,
-   dtl_dir, _buf_entries);
-
-   if (!event_mask_file || !buf_entries_file) {
-   printk(KERN_WARNING "%s: can't create dtl files\n", __func__);
-   goto err_remove_dir;
-   }
+   debugfs_create_x8("dtl_event_mask", 0600, dtl_dir, _event_mask);
+   debugfs_create_u32("dtl_buf_entries", 0400, dtl_dir, _buf_entries);
 
/* set up the per-cpu log structures */
for_each_possible_cpu(i) {
@@ -367,16 +348,9 @@ static int dtl_init(void)
spin_lock_init(>lock);
dtl->cpu = i;
 
-   rc = dtl_setup_file(dtl);
-   if (rc)
-   goto err_remove_dir;
+   dtl_setup_file(dtl);
}
 
return 0;
-
-err_remove_dir:
-   debugfs_remove_recursive(dtl_dir);
-err:
-   return rc;
 }
 machine_arch_initcall(pseries, dtl_init);
diff --git a/arch/powerpc/platforms/pseries/hvCall_inst.c 
b/arch/powerpc/platforms/pseries/hvCall_inst.c
index bcc1b67417a8..c40c62ec432e 100644
--- a/arch/powerpc/platforms/pseries/hvCall_inst.c
+++ b/arch/powerpc/platforms/pseries/hvCall_inst.c
@@ -129,7 +129,6 @@ static void probe_hcall_exit(void *ignored, unsigned long 
opcode, long retval,
 static int __init hcall_inst_init(void)
 {
struct dentry *hcall_root;
-   struct dentry *hcall_file;
char cpu_name_buf[CPU_NAME_BUF_SIZE];
int cpu;
 
@@ -145,17 +144,12 @@ static int __init hcall_inst_init(void)
}
 
hcall_root = debugfs_create_dir(HCALL_ROOT_DIR, NULL);
-   if (!hcall_root)
-   return -ENOMEM;
 
for_each_possible_cpu(cpu) {
snprintf(cpu_name_buf, CPU_NAME_BUF_SIZE, "cpu%d", cpu);
-   hcall_file = debugfs_create_file(cpu_name_buf, 0444,
-hcall_root,
-per_cpu(hcall_stats, cpu),
-_inst_seq_fops);
-   if (!hcall_file)
-   return -ENOMEM;
+   debugfs_create_file(cpu_name_buf, 0444, hcall_root,
+   per_cpu(hcall_stats, cpu),
+   _inst_seq_fops);
}
 
return 0;
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index b53359258d99..284b2eb4a669 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1995,24 +1995,11 @@ static int __init 

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-14 Thread Greg KH
On Mon, Oct 14, 2019 at 11:49:12AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 14, 2019 at 11:25:09AM +0200, Greg KH wrote:
> > Good luck, I don't really think that most, if any, of this is needed,
> > but hey, it's nice to clean it up where it can be :)
> 
> Some of the virtual devices we have (that use devm) really ought to set
> the node too, like drivers/base/cpu.c and driver/base/node.c and
> arguably the cooling devices too (they create a device per cpu).
> 
> The patch I had here:
> 
>   
> https://lkml.kernel.org/r/20190925214526.ga4...@worktop.programming.kicks-ass.net
> 
> takes the more radical approach of requiring a node, except when
> explicitly marked not (the fake devices that don't use devm for
> example).

I like that patch :)

> But yes, PCI and other physical busses really should be having a node
> set, no excuses.

Agreed, at least just warning on the bus creation will make it a bit
less "noisy", in contrast to your patch.  But the messages in your patch
show people just how broken their bioses really are.  Which is always
fun...

> Anyway, I don't think non-physical devices actually use
> cpumask_of_node() much, a quick grep didn't show any.

That's good.

thanks,

greg k-h


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-14 Thread Peter Zijlstra
On Mon, Oct 14, 2019 at 11:25:09AM +0200, Greg KH wrote:
> Good luck, I don't really think that most, if any, of this is needed,
> but hey, it's nice to clean it up where it can be :)

Some of the virtual devices we have (that use devm) really ought to set
the node too, like drivers/base/cpu.c and driver/base/node.c and
arguably the cooling devices too (they create a device per cpu).

The patch I had here:

  
https://lkml.kernel.org/r/20190925214526.ga4...@worktop.programming.kicks-ass.net

takes the more radical approach of requiring a node, except when
explicitly marked not (the fake devices that don't use devm for
example).

But yes, PCI and other physical busses really should be having a node
set, no excuses.

Anyway, I don't think non-physical devices actually use
cpumask_of_node() much, a quick grep didn't show any.


Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Jan Kara
On Mon 14-10-19 16:33:15, Pingfan Liu wrote:
> On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:
> > On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> > > When using fadump (fireware assist dump) mode on powerpc, a mismatch
> > > between grub xfs driver and kernel xfs driver has been obsevered.  Note:
> > > fadump boots up in the following sequence: fireware -> grub reads kernel
> > > and initramfs -> kernel boots.
> > > 
> > > The process to reproduce this mismatch:
> > >   - On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf.
> > >   - Replacing "path /var/crash" with "path /var/crashnew", then, "kdumpctl
> > > restart" to rebuild the initramfs. Detail about the rebuilding looks
> > > like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
> > >   mv /boot/initramfs-`uname -r`.img.tmp /boot/initramfs-`uname 
> > > -r`.img
> > >   sync
> > >   - "echo c >/proc/sysrq-trigger".
> > > 
> > > The result:
> > > The dump image will not be saved under /var/crashnew/* as expected, but
> > > still saved under /var/crash.
> > > 
> > > The root cause:
> > > As Eric pointed out that on xfs, 'sync' ensures the consistency by writing
> > > back metadata to xlog, but not necessary to fsblock. This raises issue if
> > > grub can not replay the xlog before accessing the xfs files. Since the
> > > above dir entry of initramfs should be saved as inline data with 
> > > xfs_inode,
> > > so xfs_fs_sync_fs() does not guarantee it written to fsblock.
> > > 
> > > umount can be used to write metadata fsblock, but the filesystem can not 
> > > be
> > > umounted if still in use.
> > > 
> > > There are two ways to fix this mismatch, either grub or xfs. It may be
> > > easier to do this in xfs side by introducing an interface to flush 
> > > metadata
> > > to fsblock explicitly.
> > > 
> > > With this patch, metadata can be written to fsblock by:
> > >   # update AIL
> > >   sync
> > >   # new introduced interface to flush metadata to fsblock
> > >   mount -o remount,metasync mountpoint
> > 
> > I think this ought to be an ioctl or some sort of generic call since the
> > jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
> > is too dumb to recover logs but still wants to write to the fs"
> > checkpointing problem.
> Yes, a syscall sounds more reasonable.
> > 
> > (Or maybe we should just put all that stuff in a vfat filesystem, I
> > don't know...)
> I think it is unavoidable to involve in each fs' implementation. What
> about introducing an interface sync_to_fsblock(struct super_block *sb) in
> the struct super_operations, then let each fs manage its own case?

Well, we already have a way to achieve what you need: fsfreeze.
Traditionally, that is guaranteed to put fs into a "clean" state very much
equivalent to the fs being unmounted and that seems to be what the
bootloader wants so that it can access the filesystem without worrying
about some recovery details. So do you see any problem with replacing
'sync' in your example above with 'fsfreeze /boot && fsfreeze -u /boot'?

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-10-14 Thread David Hildenbrand
On 06.10.19 10:56, David Hildenbrand wrote:
> We currently try to shrink a single zone when removing memory. We use the
> zone of the first page of the memory we are removing. If that memmap was
> never initialized (e.g., memory was never onlined), we will read garbage
> and can trigger kernel BUGs (due to a stale pointer):
> 
> :/# [   23.912993] BUG: unable to handle page fault for address: 
> 353d
> [   23.914219] #PF: supervisor write access in kernel mode
> [   23.915199] #PF: error_code(0x0002) - not-present page
> [   23.916160] PGD 0 P4D 0
> [   23.916627] Oops: 0002 [#1] SMP PTI
> [   23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 
> 5.3.0-rc5-next-20190820+ #317
> [   23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
> [   23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [   23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
> [   23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 
> c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
> [   23.926876] RSP: 0018:ad2400043c98 EFLAGS: 00010246
> [   23.927928] RAX:  RBX: 0002 RCX: 
> 
> [   23.929458] RDX: 0020 RSI: 0014 RDI: 
> 2f40
> [   23.930899] RBP: 00014000 R08:  R09: 
> 0001
> [   23.932362] R10:  R11:  R12: 
> 0014
> [   23.933603] R13: 0014 R14: 2f40 R15: 
> 9e3e7aff3680
> [   23.934913] FS:  () GS:9e3e7bb0() 
> knlGS:
> [   23.936294] CS:  0010 DS:  ES:  CR0: 80050033
> [   23.937481] CR2: 353d CR3: 5861 CR4: 
> 06e0
> [   23.938687] DR0:  DR1:  DR2: 
> 
> [   23.939889] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   23.941168] Call Trace:
> [   23.941580]  __remove_pages+0x4b/0x640
> [   23.942303]  ? mark_held_locks+0x49/0x70
> [   23.943149]  arch_remove_memory+0x63/0x8d
> [   23.943921]  try_remove_memory+0xdb/0x130
> [   23.944766]  ? walk_memory_blocks+0x7f/0x9e
> [   23.945616]  __remove_memory+0xa/0x11
> [   23.946274]  acpi_memory_device_remove+0x70/0x100
> [   23.947308]  acpi_bus_trim+0x55/0x90
> [   23.947914]  acpi_device_hotplug+0x227/0x3a0
> [   23.948714]  acpi_hotplug_work_fn+0x1a/0x30
> [   23.949433]  process_one_work+0x221/0x550
> [   23.950190]  worker_thread+0x50/0x3b0
> [   23.950993]  kthread+0x105/0x140
> [   23.951644]  ? process_one_work+0x550/0x550
> [   23.952508]  ? kthread_park+0x80/0x80
> [   23.953367]  ret_from_fork+0x3a/0x50
> [   23.954025] Modules linked in:
> [   23.954613] CR2: 353d
> [   23.955248] ---[ end trace 93d982b1fb3e1a69 ]---
> 
> Instead, shrink the zones when offlining memory or when onlining failed.
> Introduce and use remove_pfn_range_from_zone(() for that. We now properly
> shrink the zones, even if we have DIMMs whereby
> - Some memory blocks fall into no zone (never onlined)
> - Some memory blocks fall into multiple zones (offlined+re-onlined)
> - Multiple memory blocks that fall into different zones
> 
> Drop the zone parameter (with a potential dubious value) from
> __remove_pages() and __remove_section().
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Tony Luck 
> Cc: Fenghua Yu 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: Christian Borntraeger 
> Cc: Yoshinori Sato 
> Cc: Rich Felker 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: Andrew Morton 
> Cc: Mark Rutland 
> Cc: Steve Capper 
> Cc: Mike Rapoport 
> Cc: Anshuman Khandual 
> Cc: Yu Zhao 
> Cc: Jun Yao 
> Cc: Robin Murphy 
> Cc: Michal Hocko 
> Cc: Oscar Salvador 
> Cc: "Matthew Wilcox (Oracle)" 
> Cc: Christophe Leroy 
> Cc: "Aneesh Kumar K.V" 
> Cc: Pavel Tatashin 
> Cc: Gerald Schaefer 
> Cc: Halil Pasic 
> Cc: Tom Lendacky 
> Cc: Greg Kroah-Hartman 
> Cc: Masahiro Yamada 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Cc: Qian Cai 
> Cc: Jason Gunthorpe 
> Cc: Logan Gunthorpe 
> Cc: Ira Weiny 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-i...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")

@Andrew, can you convert that to

Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to 
zones until online") # visible after d0dc12e86b319

While adding cc'ing sta...@vger.kernel.org # v4.13+ would be nice,
I doubt it will be easily possible to backport, as we are missing
some prereq patches (e.g., from Oscar like 2c2a5af6fed2 ("mm,
memory_hotplug: add nid parameter to 

Re: [PATCH v4 3/3] powerpc/prom_init: Use -ffreestanding to avoid a reference to bcmp

2019-10-14 Thread Segher Boessenkool
On Sun, Oct 13, 2019 at 07:51:01PM -0700, Nathan Chancellor wrote:
> r374662 gives LLVM the ability to convert certain loops into a reference
> to bcmp as an optimization; this breaks prom_init_check.sh:

When/why does LLVM think this is okay?  This function has been removed
from POSIX over a decade ago (and before that it always was marked as
legacy).


Segher


Re: [PATCH v6 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span()

2019-10-14 Thread David Hildenbrand

On 06.10.19 10:56, David Hildenbrand wrote:

Let's limit shrinking to !ZONE_DEVICE so we can fix the current code. We
should never try to touch the memmap of offline sections where we could
have uninitialized memmaps and could trigger BUGs when calling
page_to_nid() on poisoned pages.

There is no reliable way to distinguish an uninitialized memmap from an
initialized memmap that belongs to ZONE_DEVICE, as we don't have
anything like SECTION_IS_ONLINE we can use similar to
pfn_to_online_section() for !ZONE_DEVICE memory. E.g.,
set_zone_contiguous() similarly relies on pfn_to_online_section() and
will therefore never set a ZONE_DEVICE zone consecutive. Stopping to
shrink the ZONE_DEVICE therefore results in no observable changes,
besides /proc/zoneinfo indicating different boundaries - something we
can totally live with.

Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory
hotplug"), the memmap was initialized with 0 and the node with the
right value. So the zone might be wrong but not garbage. After that
commit, both the zone and the node will be garbage when touching
uninitialized memmaps.

Cc: Andrew Morton 
Cc: Oscar Salvador 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")


@Andrew, can you convert that to

Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded 
memory to zones until online") # visible after d0dc12e86b319


and add

Cc: sta...@vger.kernel.org # v4.13+


--

Thanks,

David / dhildenb


Re: [PATCH v6 03/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_pgdat_span()

2019-10-14 Thread David Hildenbrand
On 06.10.19 10:56, David Hildenbrand wrote:
> We might use the nid of memmaps that were never initialized. For
> example, if the memmap was poisoned, we will crash the kernel in
> pfn_to_nid() right now. Let's use the calculated boundaries of the separate
> zones instead. This now also avoids having to iterate over a whole bunch of
> subsections again, after shrinking one zone.
> 
> Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory
> hotplug"), the memmap was initialized to 0 and the node was set to the
> right value. After that commit, the node might be garbage.
> 
> We'll have to fix shrink_zone_span() next.
> 
> Cc: Andrew Morton 
> Cc: Oscar Salvador 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Pavel Tatashin 
> Cc: Dan Williams 
> Cc: Wei Yang 
> Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")

@Andrew, can you convert that to

Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to 
zones until online") # visible after d0dc12e86b319

and add

Cc: sta...@vger.kernel.org # v4.13+

-- 

Thanks,

David / dhildenb


Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-14 Thread Greg KH
On Mon, Oct 14, 2019 at 04:00:46PM +0800, Yunsheng Lin wrote:
> On 2019/10/12 18:47, Greg KH wrote:
> > On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
> >> On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
> >>> On 2019/10/12 15:40, Greg KH wrote:
>  On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> > add pci and acpi maintainer
> > cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
> >
> > On 2019/10/11 19:15, Peter Zijlstra wrote:
> >> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
> >>> But I failed to see why the above is related to making 
> >>> node_to_cpumask_map()
> >>> NUMA_NO_NODE aware?
> >>
> >> Your initial bug is for hns3, which is a PCI device, which really 
> >> _MUST_
> >> have a node assigned.
> >>
> >> It not having one, is a straight up bug. We must not silently accept
> >> NO_NODE there, ever.
> >>
> >
> > I suppose you mean reporting a lack of affinity when the node of a pcie
> > device is not set by "not silently accept NO_NODE".
> 
>  If the firmware of a pci device does not provide the node information,
>  then yes, warn about that.
> 
> > As Greg has asked about in [1]:
> > what is a user to do when the user sees the kernel reporting that?
> >
> > We may tell user to contact their vendor for info or updates about
> > that when they do not know about their system well enough, but their
> > vendor may get away with this by quoting ACPI spec as the spec
> > considering this optional. Should the user believe this is indeed a
> > fw bug or a misreport from the kernel?
> 
>  Say it is a firmware bug, if it is a firmware bug, that's simple.
> 
> > If this kind of reporting is common pratice and will not cause any
> > misunderstanding, then maybe we can report that.
> 
>  Yes, please do so, that's the only way those boxes are ever going to get
>  fixed.  And go add the test to the "firmware testing" tool that is based
>  on Linux that Intel has somewhere, to give vendors a chance to fix this
>  before they ship hardware.
> 
>  This shouldn't be a big deal, we warn of other hardware bugs all the
>  time.
> >>>
> >>> Ok, thanks for clarifying.
> >>>
> >>> Will send a patch to catch the case when a pcie device without numa node
> >>> being set and warn about it.
> >>>
> >>> Maybe use dev->bus to verify if it is a pci device?
> >>
> >> No, do that in the pci bus core code itself, when creating the devices
> >> as that is when you know, or do not know, the numa node, right?
> >>
> >> This can't be in the driver core only, as each bus type will have a
> >> different way of determining what the node the device is on.  For some
> >> reason, I thought the PCI core code already does this, right?
> > 
> > Yes, pci_irq_get_node(), which NO ONE CALLS!  I should go delete that
> > thing...
> > 
> > Anyway, it looks like the pci core code does call set_dev_node() based
> > on the PCI bridge, so if that is set up properly, all should be fine.
> > 
> > If not, well, you have buggy firmware and you need to warn about that at
> > the time you are creating the bridge.  Look at the call to
> > pcibus_to_node() in pci_register_host_bridge().
> 
> Thanks for pointing out the specific function.
> Maybe we do not need to warn about the case when the device has a parent,
> because we must have warned about the parent if the device has a parent
> and the parent also has a node of NO_NODE, so do not need to warn the child
> device anymore? like blew:
> 
> @@ -932,6 +932,10 @@ static int pci_register_host_bridge(struct 
> pci_host_bridge *bridge)
> list_add_tail(>node, _root_buses);
> up_write(_bus_sem);
> 
> +   if (nr_node_ids > 1 && !parent &&

Why do you need to check this?  If you have a parent, it's your node
should be set, if not, that's an error, right?

> +   dev_to_node(bus->bridge) == NUMA_NO_NODE)
> +   dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable 
> HW. Please contact your vendor for updates.\n");
> +
> return 0;

Who set that bus->bridge node to NUMA_NO_NODE?
If that is set, the firmware is broken, as you say, but you need to tell
the user what firmware is broken.

Try something like this out and see what happens on your machine that
had things "broken".  What does it say?

> Also, we do not need to warn about that in pci_device_add(), Right?
> Because we must have warned about the pci host bridge of the pci device.

That should be true, yes.

> I may be wrong about above because I am not so familiar with the pci.
> 
> > 
> > And yes, you need to do this all on a per-bus-type basis, as has been
> > pointed out.  It's up to the bus to create the device and set this up
> > properly.
> 
> Thanks.
> Will do that on per-bus-type basis.

Good luck, I don't really think that most, if any, of this is needed,

Re: [PATCH v6 01/10] mm/memunmap: Don't access uninitialized memmap in memunmap_pages()

2019-10-14 Thread David Hildenbrand

On 06.10.19 10:56, David Hildenbrand wrote:

From: "Aneesh Kumar K.V" 

With an altmap, the memmap falling into the reserved altmap space are
not initialized and, therefore, contain a garbage NID and a garbage
zone. Make sure to read the NID/zone from a memmap that was initialzed.

This fixes a kernel crash that is observed when destroying a namespace:

[   81.356173] kernel BUG at include/linux/mm.h:1107!
cpu 0x1: Vector: 700 (Program Check) at [c00274087890]
 pc: c04b9728: memunmap_pages+0x238/0x340
 lr: c04b9724: memunmap_pages+0x234/0x340
...
 pid   = 3669, comm = ndctl
kernel BUG at include/linux/mm.h:1107!
[c00274087ba0] c09e3500 devm_action_release+0x30/0x50
[c00274087bc0] c09e4758 release_nodes+0x268/0x2d0
[c00274087c30] c09dd144 device_release_driver_internal+0x174/0x240
[c00274087c70] c09d9dfc unbind_store+0x13c/0x190
[c00274087cb0] c09d8a24 drv_attr_store+0x44/0x60
[c00274087cd0] c05a7470 sysfs_kf_write+0x70/0xa0
[c00274087d10] c05a5cac kernfs_fop_write+0x1ac/0x290
[c00274087d60] c04be45c __vfs_write+0x3c/0x70
[c00274087d80] c04c26e4 vfs_write+0xe4/0x200
[c00274087dd0] c04c2a6c ksys_write+0x7c/0x140
[c00274087e20] c000bbd0 system_call+0x5c/0x68

Cc: Dan Williams 
Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: Logan Gunthorpe 
Cc: Ira Weiny 
Signed-off-by: Aneesh Kumar K.V 
[ minimze code changes, rephrase description ]
Signed-off-by: David Hildenbrand 
---
  mm/memremap.c | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 557e53c6fb46..8c2fb44c3b4d 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -123,6 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
  void memunmap_pages(struct dev_pagemap *pgmap)
  {
struct resource *res = >res;
+   struct page *first_page;
unsigned long pfn;
int nid;
  
@@ -131,14 +132,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)

put_page(pfn_to_page(pfn));
dev_pagemap_cleanup(pgmap);
  
+	/* make sure to access a memmap that was actually initialized */

+   first_page = pfn_to_page(pfn_first(pgmap));
+
/* pages are dead and unused, undo the arch mapping */
-   nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
+   nid = page_to_nid(first_page);
  
  	mem_hotplug_begin();

if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-   pfn = PHYS_PFN(res->start);
-   __remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-PHYS_PFN(resource_size(res)), NULL);
+   __remove_pages(page_zone(first_page), PHYS_PFN(res->start),
+  PHYS_PFN(resource_size(res)), NULL);
} else {
arch_remove_memory(nid, res->start, resource_size(res),
pgmap_altmap(pgmap));



@Andrew, can you add

Fixes: 2c2a5af6fed2 ("mm, memory_hotplug: add nid parameter to 
arch_remove_memory")


(which basically introduced the nid = page_to_nid(first_page))

The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f4833 ("mm, 
devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I 
think we will never have driver reserved memory with 
MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).


Also, I think

Cc: sta...@vger.kernel.org # v5.0+

makes sense.

--

Thanks,

David / dhildenb


Re: [PATCH] crypto: powerpc - convert SPE AES algorithms to skcipher API

2019-10-14 Thread Ard Biesheuvel
Hi Eric,

On Sat, 12 Oct 2019 at 04:32, Eric Biggers  wrote:
>
> From: Eric Biggers 
>
> Convert the glue code for the PowerPC SPE implementations of AES-ECB,
> AES-CBC, AES-CTR, and AES-XTS from the deprecated "blkcipher" API to the
> "skcipher" API.
>
> Tested with:
>
> export ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu-
> make mpc85xx_defconfig
> cat >> .config << EOF
> # CONFIG_MODULES is not set
> # CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
> CONFIG_DEBUG_KERNEL=y
> CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y
> CONFIG_CRYPTO_AES=y
> CONFIG_CRYPTO_CBC=y
> CONFIG_CRYPTO_CTR=y
> CONFIG_CRYPTO_ECB=y
> CONFIG_CRYPTO_XTS=y
> CONFIG_CRYPTO_AES_PPC_SPE=y
> EOF
> make olddefconfig
> make -j32
> qemu-system-ppc -M mpc8544ds -cpu e500 -nographic \
> -kernel arch/powerpc/boot/zImage \
> -append cryptomgr.fuzz_iterations=1000
>
> Note that xts-ppc-spe still fails the comparison tests due to the lack
> of ciphertext stealing support.  This is not addressed by this patch.
>
> Signed-off-by: Eric Biggers 
> ---
>  arch/powerpc/crypto/aes-spe-glue.c | 416 +
>  crypto/Kconfig |   1 +
>  2 files changed, 186 insertions(+), 231 deletions(-)
>
> diff --git a/arch/powerpc/crypto/aes-spe-glue.c 
> b/arch/powerpc/crypto/aes-spe-glue.c
> index 3a4ca7d32477..374e3e51e998 100644
> --- a/arch/powerpc/crypto/aes-spe-glue.c
> +++ b/arch/powerpc/crypto/aes-spe-glue.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  /*
> @@ -86,17 +87,13 @@ static void spe_end(void)
> preempt_enable();
>  }
>
> -static int ppc_aes_setkey(struct crypto_tfm *tfm, const u8 *in_key,
> -   unsigned int key_len)
> +static int expand_key(struct ppc_aes_ctx *ctx,
> + const u8 *in_key, unsigned int key_len)
>  {
> -   struct ppc_aes_ctx *ctx = crypto_tfm_ctx(tfm);
> -
> if (key_len != AES_KEYSIZE_128 &&
> key_len != AES_KEYSIZE_192 &&
> -   key_len != AES_KEYSIZE_256) {
> -   tfm->crt_flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
> +   key_len != AES_KEYSIZE_256)
> return -EINVAL;
> -   }
>
> switch (key_len) {
> case AES_KEYSIZE_128:
> @@ -114,17 +111,40 @@ static int ppc_aes_setkey(struct crypto_tfm *tfm, const 
> u8 *in_key,
> }
>
> ppc_generate_decrypt_key(ctx->key_dec, ctx->key_enc, key_len);
> +   return 0;
> +}
>
> +static int ppc_aes_setkey(struct crypto_tfm *tfm, const u8 *in_key,
> +   unsigned int key_len)
> +{
> +   struct ppc_aes_ctx *ctx = crypto_tfm_ctx(tfm);
> +
> +   if (expand_key(ctx, in_key, key_len) != 0) {
> +   tfm->crt_flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
> +   return -EINVAL;
> +   }
> +   return 0;
> +}
> +
> +static int ppc_aes_setkey_skcipher(struct crypto_skcipher *tfm,
> +  const u8 *in_key, unsigned int key_len)
> +{
> +   struct ppc_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
> +
> +   if (expand_key(ctx, in_key, key_len) != 0) {
> +   crypto_skcipher_set_flags(tfm, CRYPTO_TFM_RES_BAD_KEY_LEN);
> +   return -EINVAL;
> +   }
> return 0;
>  }
>
> -static int ppc_xts_setkey(struct crypto_tfm *tfm, const u8 *in_key,
> +static int ppc_xts_setkey(struct crypto_skcipher *tfm, const u8 *in_key,
>unsigned int key_len)
>  {
> -   struct ppc_xts_ctx *ctx = crypto_tfm_ctx(tfm);
> +   struct ppc_xts_ctx *ctx = crypto_skcipher_ctx(tfm);
> int err;
>
> -   err = xts_check_key(tfm, in_key, key_len);
> +   err = xts_verify_key(tfm, in_key, key_len);
> if (err)
> return err;
>
> @@ -133,7 +153,7 @@ static int ppc_xts_setkey(struct crypto_tfm *tfm, const 
> u8 *in_key,
> if (key_len != AES_KEYSIZE_128 &&
> key_len != AES_KEYSIZE_192 &&
> key_len != AES_KEYSIZE_256) {
> -   tfm->crt_flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
> +   crypto_skcipher_set_flags(tfm, CRYPTO_TFM_RES_BAD_KEY_LEN);
> return -EINVAL;
> }
>
> @@ -178,208 +198,154 @@ static void ppc_aes_decrypt(struct crypto_tfm *tfm, 
> u8 *out, const u8 *in)
> spe_end();
>  }
>
> -static int ppc_ecb_encrypt(struct blkcipher_desc *desc, struct scatterlist 
> *dst,
> -  struct scatterlist *src, unsigned int nbytes)
> +static int ppc_ecb_crypt(struct skcipher_request *req, bool enc)
>  {
> -   struct ppc_aes_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
> -   struct blkcipher_walk walk;
> -   unsigned int ubytes;
> +   struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +   struct ppc_aes_ctx *ctx = crypto_skcipher_ctx(tfm);
> +   struct skcipher_walk walk;
> +   unsigned int 

Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Christoph Hellwig
On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> When using fadump (fireware assist dump) mode on powerpc, a mismatch
> between grub xfs driver and kernel xfs driver has been obsevered.  Note:
> fadump boots up in the following sequence: fireware -> grub reads kernel
> and initramfs -> kernel boots.

This isn't something new.  To fundamentally fix this you need to
implement (in-memory) log recovery in grub.  That is the only really safe
long-term solutioin.  But the equivalent of your patch you can already
get by freezing and unfreezing the file system using the FIFREEZE and
FITHAW ioctls.  And if my memory is serving me correctly Dave has been
preaching that to the bootloader folks for a long time, but apparently
without visible results.


Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Pingfan Liu
On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:
> On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> > When using fadump (fireware assist dump) mode on powerpc, a mismatch
> > between grub xfs driver and kernel xfs driver has been obsevered.  Note:
> > fadump boots up in the following sequence: fireware -> grub reads kernel
> > and initramfs -> kernel boots.
> > 
> > The process to reproduce this mismatch:
> >   - On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf.
> >   - Replacing "path /var/crash" with "path /var/crashnew", then, "kdumpctl
> > restart" to rebuild the initramfs. Detail about the rebuilding looks
> > like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
> >   mv /boot/initramfs-`uname -r`.img.tmp /boot/initramfs-`uname 
> > -r`.img
> >   sync
> >   - "echo c >/proc/sysrq-trigger".
> > 
> > The result:
> > The dump image will not be saved under /var/crashnew/* as expected, but
> > still saved under /var/crash.
> > 
> > The root cause:
> > As Eric pointed out that on xfs, 'sync' ensures the consistency by writing
> > back metadata to xlog, but not necessary to fsblock. This raises issue if
> > grub can not replay the xlog before accessing the xfs files. Since the
> > above dir entry of initramfs should be saved as inline data with xfs_inode,
> > so xfs_fs_sync_fs() does not guarantee it written to fsblock.
> > 
> > umount can be used to write metadata fsblock, but the filesystem can not be
> > umounted if still in use.
> > 
> > There are two ways to fix this mismatch, either grub or xfs. It may be
> > easier to do this in xfs side by introducing an interface to flush metadata
> > to fsblock explicitly.
> > 
> > With this patch, metadata can be written to fsblock by:
> >   # update AIL
> >   sync
> >   # new introduced interface to flush metadata to fsblock
> >   mount -o remount,metasync mountpoint
> 
> I think this ought to be an ioctl or some sort of generic call since the
> jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
> is too dumb to recover logs but still wants to write to the fs"
> checkpointing problem.
Yes, a syscall sounds more reasonable.
> 
> (Or maybe we should just put all that stuff in a vfat filesystem, I
> don't know...)
I think it is unavoidable to involve in each fs' implementation. What
about introducing an interface sync_to_fsblock(struct super_block *sb) in
the struct super_operations, then let each fs manage its own case?
> 
> --D
> 
> > Signed-off-by: Pingfan Liu 
> > Cc: "Darrick J. Wong" 
> > Cc: Dave Chinner 
> > Cc: Eric Sandeen 
> > Cc: Hari Bathini 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > To: linux-...@vger.kernel.org
> > ---
> >  fs/xfs/xfs_mount.h  |  1 +
> >  fs/xfs/xfs_super.c  | 15 ++-
> >  fs/xfs/xfs_trans.h  |  2 ++
> >  fs/xfs/xfs_trans_ail.c  | 26 +-
> >  fs/xfs/xfs_trans_priv.h |  1 +
> >  5 files changed, 43 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index fdb60e0..85f32e6 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -243,6 +243,7 @@ typedef struct xfs_mount {
> >  #define XFS_MOUNT_FILESTREAMS  (1ULL << 24)/* enable the 
> > filestreams
> >allocator */
> >  #define XFS_MOUNT_NOATTR2  (1ULL << 25)/* disable use of attr2 format 
> > */
> > +#define XFS_MOUNT_METASYNC (1ull << 26)/* write meta to fsblock */
> >  
> >  #define XFS_MOUNT_DAX  (1ULL << 62)/* TEST ONLY! */
> >  
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 8d1df9f..41df810 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -59,7 +59,7 @@ enum {
> > Opt_filestreams, Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota,
> > Opt_prjquota, Opt_uquota, Opt_gquota, Opt_pquota,
> > Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
> > -   Opt_discard, Opt_nodiscard, Opt_dax, Opt_err,
> > +   Opt_discard, Opt_nodiscard, Opt_dax, Opt_metasync, Opt_err
> >  };
> >  
> >  static const match_table_t tokens = {
> > @@ -106,6 +106,7 @@ static const match_table_t tokens = {
> > {Opt_discard,   "discard"}, /* Discard unused blocks */
> > {Opt_nodiscard, "nodiscard"},   /* Do not discard unused blocks */
> > {Opt_dax,   "dax"}, /* Enable direct access to bdev pages */
> > +   {Opt_metasync,  "metasync"},/* one shot to write meta to fsblock */
> > {Opt_err,   NULL},
> >  };
> >  
> > @@ -338,6 +339,9 @@ xfs_parseargs(
> > mp->m_flags |= XFS_MOUNT_DAX;
> > break;
> >  #endif
> > +   case Opt_metasync:
> > +   mp->m_flags |= XFS_MOUNT_METASYNC;
> > +   break;
> > default:
> > xfs_warn(mp, "unknown mount option [%s].", p);
> > return -EINVAL;
> > @@ -1259,6 +1263,9 @@ 

Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-14 Thread Yunsheng Lin
On 2019/10/12 18:47, Greg KH wrote:
> On Sat, Oct 12, 2019 at 12:40:01PM +0200, Greg KH wrote:
>> On Sat, Oct 12, 2019 at 05:47:56PM +0800, Yunsheng Lin wrote:
>>> On 2019/10/12 15:40, Greg KH wrote:
 On Sat, Oct 12, 2019 at 02:17:26PM +0800, Yunsheng Lin wrote:
> add pci and acpi maintainer
> cc linux-...@vger.kernel.org and linux-a...@vger.kernel.org
>
> On 2019/10/11 19:15, Peter Zijlstra wrote:
>> On Fri, Oct 11, 2019 at 11:27:54AM +0800, Yunsheng Lin wrote:
>>> But I failed to see why the above is related to making 
>>> node_to_cpumask_map()
>>> NUMA_NO_NODE aware?
>>
>> Your initial bug is for hns3, which is a PCI device, which really _MUST_
>> have a node assigned.
>>
>> It not having one, is a straight up bug. We must not silently accept
>> NO_NODE there, ever.
>>
>
> I suppose you mean reporting a lack of affinity when the node of a pcie
> device is not set by "not silently accept NO_NODE".

 If the firmware of a pci device does not provide the node information,
 then yes, warn about that.

> As Greg has asked about in [1]:
> what is a user to do when the user sees the kernel reporting that?
>
> We may tell user to contact their vendor for info or updates about
> that when they do not know about their system well enough, but their
> vendor may get away with this by quoting ACPI spec as the spec
> considering this optional. Should the user believe this is indeed a
> fw bug or a misreport from the kernel?

 Say it is a firmware bug, if it is a firmware bug, that's simple.

> If this kind of reporting is common pratice and will not cause any
> misunderstanding, then maybe we can report that.

 Yes, please do so, that's the only way those boxes are ever going to get
 fixed.  And go add the test to the "firmware testing" tool that is based
 on Linux that Intel has somewhere, to give vendors a chance to fix this
 before they ship hardware.

 This shouldn't be a big deal, we warn of other hardware bugs all the
 time.
>>>
>>> Ok, thanks for clarifying.
>>>
>>> Will send a patch to catch the case when a pcie device without numa node
>>> being set and warn about it.
>>>
>>> Maybe use dev->bus to verify if it is a pci device?
>>
>> No, do that in the pci bus core code itself, when creating the devices
>> as that is when you know, or do not know, the numa node, right?
>>
>> This can't be in the driver core only, as each bus type will have a
>> different way of determining what the node the device is on.  For some
>> reason, I thought the PCI core code already does this, right?
> 
> Yes, pci_irq_get_node(), which NO ONE CALLS!  I should go delete that
> thing...
> 
> Anyway, it looks like the pci core code does call set_dev_node() based
> on the PCI bridge, so if that is set up properly, all should be fine.
> 
> If not, well, you have buggy firmware and you need to warn about that at
> the time you are creating the bridge.  Look at the call to
> pcibus_to_node() in pci_register_host_bridge().

Thanks for pointing out the specific function.
Maybe we do not need to warn about the case when the device has a parent,
because we must have warned about the parent if the device has a parent
and the parent also has a node of NO_NODE, so do not need to warn the child
device anymore? like blew:

@@ -932,6 +932,10 @@ static int pci_register_host_bridge(struct pci_host_bridge 
*bridge)
list_add_tail(>node, _root_buses);
up_write(_bus_sem);

+   if (nr_node_ids > 1 && !parent &&
+   dev_to_node(bus->bridge) == NUMA_NO_NODE)
+   dev_err(bus->bridge, FW_BUG "No node assigned on NUMA capable 
HW. Please contact your vendor for updates.\n");
+
return 0;


Also, we do not need to warn about that in pci_device_add(), Right?
Because we must have warned about the pci host bridge of the pci device.

I may be wrong about above because I am not so familiar with the pci.

> 
> And yes, you need to do this all on a per-bus-type basis, as has been
> pointed out.  It's up to the bus to create the device and set this up
> properly.

Thanks.
Will do that on per-bus-type basis.

> 
> thanks,
> 
> greg k-h
> 
> .
> 



Re: [PATCH v2 25/29] xtensa: Move EXCEPTION_TABLE to RO_DATA segment

2019-10-14 Thread Max Filippov
On Thu, Oct 10, 2019 at 5:16 PM Kees Cook  wrote:
>
> Since the EXCEPTION_TABLE is read-only, collapse it into RO_DATA.
>
> Signed-off-by: Kees Cook 
> ---
>  arch/xtensa/kernel/vmlinux.lds.S | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Acked-by: Max Filippov 

-- 
Thanks.
-- Max