date:20181212

Re: [Resend PATCH V5 7/10] KVM: Make kvm_set_spte_hva() return int

2018-12-12 Thread Tianyu Lan

Hi Paul:
 Thanks for your review.
On Wed, Dec 12, 2018 at 1:03 PM Paul Mackerras  wrote:
>
> On Thu, Dec 06, 2018 at 09:21:10PM +0800, lantianyu1...@gmail.com wrote:
> > From: Lan Tianyu 
> >
> > The patch is to make kvm_set_spte_hva() return int and caller can
> > check return value to determine flush tlb or not.
>
> It would be helpful if the patch description told the reader which
> return value(s) mean that the caller should flush the tlb.  I would
> guess that non-zero means to do the flush, but you should make that
> explicit.

OK. Thanks for suggestion and will update in the next version.

>
> > Signed-off-by: Lan Tianyu 
>
> For the powerpc bits:
>
> Acked-by: Paul Mackerras 



-- 
Best regards
Tianyu Lan

[PATCH v2] crypto: talitos - fix ablkcipher for CONFIG_VMAP_STACK

2018-12-12 Thread Christophe Leroy

[2.364486] WARNING: CPU: 0 PID: 60 at ./arch/powerpc/include/asm/io.h:837 
dma_nommu_map_page+0x44/0xd4
[2.373579] CPU: 0 PID: 60 Comm: cryptomgr_test Tainted: GW 
4.20.0-rc5-00560-g6bfb52e23a00-dirty #531
[2.384740] NIP:  c000c540 LR: c000c584 CTR: 
[2.389743] REGS: c95abab0 TRAP: 0700   Tainted: GW  
(4.20.0-rc5-00560-g6bfb52e23a00-dirty)
[2.400042] MSR:  00029032   CR: 24042204  XER: 
[2.406669]
[2.406669] GPR00: c02f2244 c95abb60 c6262990 c95abd80 256a 0001 
0001 0001
[2.406669] GPR08:  2000 0010 0010 24042202  
0100 c95abd88
[2.406669] GPR16:  c05569d4 0001 0010 c95abc88 c0615664 
0004 
[2.406669] GPR24: 0010 c95abc88 c95abc88  c61ae210 c7ff6d40 
c61ae210 3d68
[2.441559] NIP [c000c540] dma_nommu_map_page+0x44/0xd4
[2.446720] LR [c000c584] dma_nommu_map_page+0x88/0xd4
[2.451762] Call Trace:
[2.454195] [c95abb60] [82000808] 0x82000808 (unreliable)
[2.459572] [c95abb80] [c02f2244] talitos_edesc_alloc+0xbc/0x3c8
[2.465493] [c95abbb0] [c02f2600] ablkcipher_edesc_alloc+0x4c/0x5c
[2.471606] [c95abbd0] [c02f4ed0] ablkcipher_encrypt+0x20/0x64
[2.477389] [c95abbe0] [c02023b0] __test_skcipher+0x4bc/0xa08
[2.483049] [c95abe00] [c0204b60] test_skcipher+0x2c/0xcc
[2.488385] [c95abe20] [c0204c48] alg_test_skcipher+0x48/0xbc
[2.494064] [c95abe40] [c0205cec] alg_test+0x164/0x2e8
[2.499142] [c95abf00] [c0200dec] cryptomgr_test+0x48/0x50
[2.504558] [c95abf10] [c0039ff4] kthread+0xe4/0x110
[2.509471] [c95abf40] [c000e1d0] ret_from_kernel_thread+0x14/0x1c
[2.515532] Instruction dump:
[2.518468] 7c7e1b78 7c9d2378 7cbf2b78 41820054 3d20c076 8089c200 3d20c076 
7c84e850
[2.526127] 8129c204 7c842e70 7f844840 419c0008 <0fe0> 2f9e 54847022 
7c84fa14
[2.533960] ---[ end trace bf78d94af73fe3b8 ]---
[2.539123] talitos ff02.crypto: master data transfer error
[2.544775] talitos ff02.crypto: TEA error: ISR 0x2000_0040
[2.551625] alg: skcipher: encryption failed on test 1 for ecb-aes-talitos: 
ret=22

IV cannot be on stack when CONFIG_VMAP_STACK is selected because the stack
cannot be DMA mapped anymore.

This patch copies the IV from areq->info into the request context.

Fixes: 4de9d0b547b9 ("crypto: talitos - Add ablkcipher algorithms")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 v2: Using per-request context.

 drivers/crypto/talitos.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 6988012deca4..df289758755c 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -1664,12 +1664,16 @@ static int common_nonsnoop(struct talitos_edesc *edesc,
 static struct talitos_edesc *ablkcipher_edesc_alloc(struct ablkcipher_request *
areq, bool encrypt)
 {
+   void *req_ctx = ablkcipher_request_ctx(areq);
struct crypto_ablkcipher *cipher = crypto_ablkcipher_reqtfm(areq);
struct talitos_ctx *ctx = crypto_ablkcipher_ctx(cipher);
unsigned int ivsize = crypto_ablkcipher_ivsize(cipher);
 
+   if (ivsize)
+   memcpy(req_ctx, areq->info, ivsize);
+
return talitos_edesc_alloc(ctx->dev, areq->src, areq->dst,
-  areq->info, 0, areq->nbytes, 0, ivsize, 0,
+  req_ctx, 0, areq->nbytes, 0, ivsize, 0,
   areq->base.flags, encrypt);
 }
 
@@ -3066,6 +3070,13 @@ static int talitos_cra_init_ahash(struct crypto_tfm *tfm)
return 0;
 }
 
+static int talitos_cra_init_ablkcipher(struct crypto_tfm *tfm)
+{
+   tfm->crt_ablkcipher.reqsize = TALITOS_MAX_IV_LENGTH;
+
+   return talitos_cra_init(tfm);
+}
+
 static void talitos_cra_exit(struct crypto_tfm *tfm)
 {
struct talitos_ctx *ctx = crypto_tfm_ctx(tfm);
@@ -3149,7 +3160,7 @@ static struct talitos_crypto_alg 
*talitos_alg_alloc(struct device *dev,
switch (t_alg->algt.type) {
case CRYPTO_ALG_TYPE_ABLKCIPHER:
alg = &t_alg->algt.alg.crypto;
-   alg->cra_init = talitos_cra_init;
+   alg->cra_init = talitos_cra_init_ablkcipher;
alg->cra_exit = talitos_cra_exit;
alg->cra_type = &crypto_ablkcipher_type;
alg->cra_ablkcipher.setkey = ablkcipher_setkey;
-- 
2.13.3

Re: [PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-12-12 Thread kbuild test robot

Hi Rafael,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.20-rc6 next-20181212]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Rafael-David-Tinoco/mm-zsmalloc-c-Fix-zsmalloc-32-bit-PAE-support/20181211-020704
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 8.1.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

>> mm/zsmalloc.c:112:3: error: #error "MAX_POSSIBLE_PHYSMEM_BITS HAS to be 
>> defined by arch using zsmalloc";
 #error "MAX_POSSIBLE_PHYSMEM_BITS HAS to be defined by arch using 
zsmalloc";
  ^

vim +112 mm/zsmalloc.c

   103  
   104  /*
   105   * MAX_POSSIBLE_PHYSMEM_BITS should be defined by all archs using 
zsmalloc:
   106   * Trying to guess it from MAX_PHYSMEM_BITS, or considering it 
BITS_PER_LONG,
   107   * proved to be wrong by not considering PAE capabilities, or using 
SPARSEMEM
   108   * only headers, leading to bad object encoding due to object index 
overflow.
   109   */
   110  #ifndef MAX_POSSIBLE_PHYSMEM_BITS
   111   #define MAX_POSSIBLE_PHYSMEM_BITS BITS_PER_LONG
 > 112   #error "MAX_POSSIBLE_PHYSMEM_BITS HAS to be defined by arch using 
 > zsmalloc";
   113  #else
   114   #ifndef CONFIG_64BIT
   115#if (MAX_POSSIBLE_PHYSMEM_BITS >= (BITS_PER_LONG + PAGE_SHIFT - 
OBJ_TAG_BITS))
   116 #error "MAX_POSSIBLE_PHYSMEM_BITS is wrong for this arch";
   117#endif
   118   #endif
   119  #endif
   120  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

[PATCH kernel v5 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-12 Thread Alexey Kardashevskiy

POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
pluggable PCIe devices but still have PCIe links which are used
for config space and MMIO. In addition to that the GPUs have 6 NVLinks
which are connected to other GPUs and the POWER9 CPU. POWER9 chips
have a special unit on a die called an NPU which is an NVLink2 host bus
adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
These systems also support ATS (address translation services) which is
a part of the NVLink2 protocol. Such GPUs also share on-board RAM
(16GB or 32GB) to the system via the same NVLink2 so a CPU has
cache-coherent access to a GPU RAM.

This exports GPU RAM to the userspace as a new VFIO device region. This
preregisters the new memory as device memory as it might be used for DMA.
This inserts pfns from the fault handler as the GPU memory is not onlined
until the vendor driver is loaded and trained the NVLinks so doing this
earlier causes low level errors which we fence in the firmware so
it does not hurt the host system but still better be avoided; for the same
reason this does not map GPU RAM into the host kernel (usual thing for
emulated access otherwise).

This exports an ATSD (Address Translation Shootdown) register of NPU which
allows TLB invalidations inside GPU for an operating system. The register
conveniently occupies a single 64k page. It is also presented to
the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
each of them can be used for TLB invalidation in a GPU linked to this NPU.
This allocates one ATSD register per an NVLink bridge allowing passing
up to 6 registers. Due to the host firmware bug (just recently fixed),
only 1 ATSD register per NPU was actually advertised to the host system
so this passes that alone register via the first NVLink bridge device in
the group which is still enough as QEMU collects them all back and
presents to the guest via vPHB to mimic the emulated NPU PHB on the host.

In order to provide the userspace with the information about GPU-to-NVLink
connections, this exports an additional capability called "tgt"
(which is an abbreviated host system bus address). The "tgt" property
tells the GPU its own system address and allows the guest driver to
conglomerate the routing information so each GPU knows how to get directly
to the other GPUs.

For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
know LPID (a logical partition ID or a KVM guest hardware ID in other
words) and PID (a memory context ID of a userspace process, not to be
confused with a linux pid). This assigns a GPU to LPID in the NPU and
this is why this adds a listener for KVM on an IOMMU group. A PID comes
via NVLink from a GPU and NPU uses a PID wildcard to pass it through.

This requires coherent memory and ATSD to be available on the host as
the GPU vendor only supports configurations with both features enabled
and other configurations are known not to work. Because of this and
because of the ways the features are advertised to the host system
(which is a device tree with very platform specific properties),
this requires enabled POWERNV platform.

The V100 GPUs do not advertise any of these capabilities via the config
space and there are more than just one device ID so this relies on
the platform to tell whether these GPUs have special abilities such as
NVLinks.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v5:
* do not memremap GPU RAM for emulation, map it only when it is needed
* allocate 1 ATSD register per NVLink bridge, if none left, then expose
the region with a zero size
* separate caps per device type
* addressed AW review comments

v4:
* added nvlink-speed to the NPU bridge capability as this turned out to
be not a constant value
* instead of looking at the exact device ID (which also changes from system
to system), now this (indirectly) looks at the device tree to know
if GPU and NPU support NVLink

v3:
* reworded the commit log about tgt
* added tracepoints (do we want them enabled for entire vfio-pci?)
* added code comments
* added write|mmap flags to the new regions
* auto enabled VFIO_PCI_NVLINK2 config option
* added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
references; there are required by the NVIDIA driver
* keep notifier registered only for short time
---
 drivers/vfio/pci/Makefile   |   1 +
 drivers/vfio/pci/trace.h| 102 ++
 drivers/vfio/pci/vfio_pci_private.h |  14 +
 include/uapi/linux/vfio.h   |  39 +++
 drivers/vfio/pci/vfio_pci.c |  27 +-
 drivers/vfio/pci/vfio_pci_nvlink2.c | 473 
 drivers/vfio/pci/Kconfig|   6 +
 7 files changed, 660 insertions(+), 2 deletions(-)
 create mode 100644 drivers/vfio/pci/trace.h
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 76d8ec0..9662c06 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1

[PATCH kernel v5 19/20] vfio_pci: Allow regions to add own capabilities

2018-12-12 Thread Alexey Kardashevskiy

VFIO regions already support region capabilities with a limited set of
fields. However the subdriver might have to report to the userspace
additional bits.

This adds an add_capability() hook to vfio_pci_regops.

Signed-off-by: Alexey Kardashevskiy 
Acked-by: Alex Williamson 
---
Changes:
v3:
* removed confusing rationale for the patch, the next patch makes
use of it anyway
---
 drivers/vfio/pci/vfio_pci_private.h | 3 +++
 drivers/vfio/pci/vfio_pci.c | 6 ++
 2 files changed, 9 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 86aab05..93c1738 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -62,6 +62,9 @@ struct vfio_pci_regops {
int (*mmap)(struct vfio_pci_device *vdev,
struct vfio_pci_region *region,
struct vm_area_struct *vma);
+   int (*add_capability)(struct vfio_pci_device *vdev,
+ struct vfio_pci_region *region,
+ struct vfio_info_cap *caps);
 };
 
 struct vfio_pci_region {
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 4a6f7c0..6cb70cf 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -763,6 +763,12 @@ static long vfio_pci_ioctl(void *device_data,
if (ret)
return ret;
 
+   if (vdev->region[i].ops->add_capability) {
+   ret = vdev->region[i].ops->add_capability(vdev,
+   &vdev->region[i], &caps);
+   if (ret)
+   return ret;
+   }
}
}
 
-- 
2.17.1

[PATCH kernel v5 14/20] powerpc/powernv/npu: Add compound IOMMU groups

2018-12-12 Thread Alexey Kardashevskiy

At the moment the powernv platform registers an IOMMU group for each PE.
There is an exception though: an NVLink bridge which is attached to
the corresponding GPU's IOMMU group making it a master.

Now we have POWER9 systems with GPUs connected to each other directly
bypassing PCI. At the moment we do not control state of these links so
we have to put such interconnected GPUs to one IOMMU group which
means that the old scheme with one GPU as a master won't work - there will
be up to 3 GPUs in such group.

This introduces a npu_comp struct which represents a compound IOMMU
group made of multiple PEs - PCI PEs (for GPUs) and NPU PEs (for NVLink
bridges). This converts the existing NVLink1 code to use the new scheme.
>From now on, each PE must have a valid iommu_table_group_ops which will
either be called directly (for a single PE group) or indirectly from
a compound group handlers.

This moves IOMMU group registration for NVLink-connected GPUs to npu-dma.c.
For POWER8, this stores a new compound group pointer in the PE (so a GPU
is still a master); for POWER9 the new group pointer is stored in an NPU
(which is allocated per a PCI host controller).

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v5:
* now read page sizes from PHB NVLink to narrow down what the compoind PE
can actually support (hint: 4K/64K only)
---
 arch/powerpc/include/asm/pci.h|   1 +
 arch/powerpc/platforms/powernv/pci.h  |   7 +
 arch/powerpc/platforms/powernv/npu-dma.c  | 291 --
 arch/powerpc/platforms/powernv/pci-ioda.c | 163 
 4 files changed, 325 insertions(+), 137 deletions(-)

diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index baf2886..0c72f18 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -132,5 +132,6 @@ extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev 
*gpdev, int index);
 extern int pnv_npu2_init(struct pci_controller *hose);
 extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
unsigned long msr);
+extern int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev);
 
 #endif /* __ASM_POWERPC_PCI_H */
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index cf9f748..aef4bb5 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -62,6 +62,7 @@ struct pnv_ioda_pe {
 
/* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */
struct iommu_table_group table_group;
+   struct npu_comp *npucomp;
 
/* 64-bit TCE bypass region */
booltce_bypass_enabled;
@@ -201,6 +202,8 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev);
 extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev);
 extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq);
 extern void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
+extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
+   __u64 window_size, __u32 levels);
 extern int pnv_eeh_post_init(void);
 
 extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
@@ -216,6 +219,10 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, 
const char *level,
 extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
 extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
+extern struct iommu_table_group *pnv_try_setup_npu_table_group(
+   struct pnv_ioda_pe *pe);
+extern struct iommu_table_group *pnv_npu_compound_attach(
+   struct pnv_ioda_pe *pe);
 
 /* pci-ioda-tce.c */
 #define POWERNV_IOMMU_DEFAULT_LEVELS   1
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index dc629ee..3468eaa 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -328,31 +328,6 @@ static struct iommu_table_group_ops pnv_pci_npu_ops = {
.unset_window = pnv_npu_unset_window,
.take_ownership = pnv_npu_take_ownership,
 };
-
-struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe)
-{
-   struct pnv_phb *phb = npe->phb;
-   struct pci_bus *pbus = phb->hose->bus;
-   struct pci_dev *npdev, *gpdev = NULL, *gptmp;
-   struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev);
-
-   if (!gpe || !gpdev)
-   return NULL;
-
-   npe->table_group.ops = &pnv_pci_npu_ops;
-
-   list_for_each_entry(npdev, &pbus->devices, bus_list) {
-   gptmp = pnv_pci_get_gpu_dev(npdev);
-
-   if (gptmp != gpdev)
-   continue;
-
-   pe_info(gpe, "Attached NPU %s\n", dev_name(&npdev->dev));
-   iommu_group_add_device(gpe->table_group.group, &npdev->dev);
-   }
-
-   return gpe;
-}
 #endif /* !CONFIG_IOMMU_API

[PATCH kernel v5 12/20] powerpc/powernv/npu: Move single TVE handling to NPU PE

2018-12-12 Thread Alexey Kardashevskiy

Normal PCI PEs have 2 TVEs, one per a DMA window; however NPU PE has only
one which points to one of two tables of the corresponding PCI PE.

So whenever a new DMA window is programmed to PEs, the NPU PE needs to
release old table in order to use the new one.

Commit d41ce7b1bcc3e ("powerpc/powernv/npu: Do not try invalidating 32bit
table when 64bit table is enabled") did just that but in pci-ioda.c
while it actually belongs to npu-dma.c.

This moves the single TVE handling to npu-dma.c. This does not implement
restoring though as it is highly unlikely that we can set the table to
PCI PE and cannot to NPU PE and if that fails, we could only set 32bit
table to NPU PE and this configuration is not really supported or wanted.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/npu-dma.c  |  8 +++
 arch/powerpc/platforms/powernv/pci-ioda.c | 27 +++
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index ef1457f..26063fb 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -130,6 +130,11 @@ long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
tbl->it_level_size : tbl->it_size;
const __u64 start_addr = tbl->it_offset << tbl->it_page_shift;
const __u64 win_size = tbl->it_size << tbl->it_page_shift;
+   int num2 = (num == 0) ? 1 : 0;
+
+   /* NPU has just one TVE so if there is another table, remove it first */
+   if (npe->table_group.tables[num2])
+   pnv_npu_unset_window(npe, num2);
 
pe_info(npe, "Setting up window %llx..%llx pg=%lx\n",
start_addr, start_addr + win_size - 1,
@@ -160,6 +165,9 @@ long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num)
struct pnv_phb *phb = npe->phb;
int64_t rc;
 
+   if (!npe->table_group.tables[num])
+   return 0;
+
pe_info(npe, "Removing DMA window\n");
 
rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number,
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index d6b140f..3f998bd 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2669,23 +2669,14 @@ static struct pnv_ioda_pe *gpe_table_group_to_npe(
 static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group *table_group,
int num, struct iommu_table *tbl)
 {
-   struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
-   int num2 = (num == 0) ? 1 : 0;
long ret = pnv_pci_ioda2_set_window(table_group, num, tbl);
 
if (ret)
return ret;
 
-   if (table_group->tables[num2])
-   pnv_npu_unset_window(npe, num2);
-
-   ret = pnv_npu_set_window(npe, num, tbl);
-   if (ret) {
+   ret = pnv_npu_set_window(gpe_table_group_to_npe(table_group), num, tbl);
+   if (ret)
pnv_pci_ioda2_unset_window(table_group, num);
-   if (table_group->tables[num2])
-   pnv_npu_set_window(npe, num2,
-   table_group->tables[num2]);
-   }
 
return ret;
 }
@@ -2694,24 +2685,12 @@ static long pnv_pci_ioda2_npu_unset_window(
struct iommu_table_group *table_group,
int num)
 {
-   struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group);
-   int num2 = (num == 0) ? 1 : 0;
long ret = pnv_pci_ioda2_unset_window(table_group, num);
 
if (ret)
return ret;
 
-   if (!npe->table_group.tables[num])
-   return 0;
-
-   ret = pnv_npu_unset_window(npe, num);
-   if (ret)
-   return ret;
-
-   if (table_group->tables[num2])
-   ret = pnv_npu_set_window(npe, num2, table_group->tables[num2]);
-
-   return ret;
+   return pnv_npu_unset_window(gpe_table_group_to_npe(table_group), num);
 }
 
 static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
-- 
2.17.1

[PATCH kernel v5 11/20] powerpc/powernv: Reference iommu_table while it is linked to a group

2018-12-12 Thread Alexey Kardashevskiy

The iommu_table pointer stored in iommu_table_group may get stale
by accident, this adds referencing and removes a redundant comment
about this.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 3 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c | 4 
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c 
b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index 7639b21..697449a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -368,6 +368,7 @@ void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
found = false;
for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) {
if (table_group->tables[i] == tbl) {
+   iommu_tce_table_put(tbl);
table_group->tables[i] = NULL;
found = true;
break;
@@ -393,7 +394,7 @@ long pnv_pci_link_table_and_group(int node, int num,
tgl->table_group = table_group;
list_add_rcu(&tgl->next, &tbl->it_group_list);
 
-   table_group->tables[num] = tbl;
+   table_group->tables[num] = iommu_tce_table_get(tbl);
 
return 0;
 }
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 1168b185..d6b140f 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2716,10 +2716,6 @@ static long pnv_pci_ioda2_npu_unset_window(
 
 static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group)
 {
-   /*
-* Detach NPU first as pnv_ioda2_take_ownership() will destroy
-* the iommu_table if 32bit DMA is enabled.
-*/
pnv_npu_take_ownership(gpe_table_group_to_npe(table_group));
pnv_ioda2_take_ownership(table_group);
 }
-- 
2.17.1

[PATCH kernel v5 10/20] powerpc/iommu_api: Move IOMMU groups setup to a single place

2018-12-12 Thread Alexey Kardashevskiy

Registering new IOMMU groups and adding devices to them are separated in
code and the latter is dug in the DMA setup code which it does not
really belong to.

This moved IOMMU groups setup to a separate helper which registers a group
and adds devices as before. This does not make a difference as IOMMU
groups are not used anyway; the only dependency here is that
iommu_add_device() requires a valid pointer to an iommu_table
(set by set_iommu_table_base()).

To keep the old behaviour, this does not add new IOMMU groups for PEs
with no DMA weigth and also skips NVLink bridges which do not have
pci_controller_ops::setup_bridge (the normal way of adding PEs).

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 80 +++
 1 file changed, 66 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index b86a6e0..1168b185 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1269,6 +1269,8 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus)
pnv_ioda_setup_npu_PE(pdev);
 }
 
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe);
+
 static void pnv_pci_ioda_setup_PEs(void)
 {
struct pci_controller *hose;
@@ -1591,6 +1593,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, 
u16 num_vfs)
mutex_unlock(&phb->ioda.pe_list_mutex);
 
pnv_pci_ioda2_setup_dma_pe(phb, pe);
+   pnv_ioda_setup_bus_iommu_group(pe);
}
 }
 
@@ -1930,21 +1933,16 @@ static u64 pnv_pci_ioda_dma_get_required_mask(struct 
pci_dev *pdev)
return mask;
 }
 
-static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
-  struct pci_bus *bus,
-  bool add_to_group)
+static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
 {
struct pci_dev *dev;
 
list_for_each_entry(dev, &bus->devices, bus_list) {
set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
set_dma_offset(&dev->dev, pe->tce_bypass_base);
-   if (add_to_group)
-   iommu_add_device(&pe->table_group, &dev->dev);
 
if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
-   pnv_ioda_setup_bus_dma(pe, dev->subordinate,
-   add_to_group);
+   pnv_ioda_setup_bus_dma(pe, dev->subordinate);
}
 }
 
@@ -2374,7 +2372,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb 
*phb,
iommu_init_table(tbl, phb->hose->node);
 
if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
-   pnv_ioda_setup_bus_dma(pe, pe->pbus, true);
+   pnv_ioda_setup_bus_dma(pe, pe->pbus);
 
return;
  fail:
@@ -2607,7 +2605,7 @@ static void pnv_ioda2_take_ownership(struct 
iommu_table_group *table_group)
pnv_pci_ioda2_set_bypass(pe, false);
pnv_pci_ioda2_unset_window(&pe->table_group, 0);
if (pe->pbus)
-   pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
+   pnv_ioda_setup_bus_dma(pe, pe->pbus);
iommu_tce_table_put(tbl);
 }
 
@@ -2618,7 +2616,7 @@ static void pnv_ioda2_release_ownership(struct 
iommu_table_group *table_group)
 
pnv_pci_ioda2_setup_default_config(pe);
if (pe->pbus)
-   pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
+   pnv_ioda_setup_bus_dma(pe, pe->pbus);
 }
 
 static struct iommu_table_group_ops pnv_pci_ioda2_ops = {
@@ -2735,12 +2733,68 @@ static struct iommu_table_group_ops 
pnv_pci_ioda2_npu_ops = {
.release_ownership = pnv_ioda2_release_ownership,
 };
 
+static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe *pe,
+   struct pci_bus *bus)
+{
+   struct pci_dev *dev;
+
+   list_for_each_entry(dev, &bus->devices, bus_list) {
+   iommu_add_device(&pe->table_group, &dev->dev);
+
+   if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
+   pnv_ioda_setup_bus_iommu_group_add_devices(pe,
+   dev->subordinate);
+   }
+}
+
+static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe)
+{
+   if (!pnv_pci_ioda_pe_dma_weight(pe))
+   return;
+
+   iommu_register_group(&pe->table_group, pe->phb->hose->global_number,
+   pe->pe_number);
+
+   /*
+* set_iommu_table_base(&pe->pdev->dev, tbl) should have been called
+* by now
+*/
+   if (pe->flags & PNV_IODA_PE_DEV)
+   iommu_add_device(&pe->table_group, &pe->pdev->dev);
+   else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
+   pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus);
+}
+
 static void pnv_pci

[PATCH kernel v5 18/20] vfio_pci: Allow mapping extra regions

2018-12-12 Thread Alexey Kardashevskiy

So far we only allowed mapping of MMIO BARs to the userspace. However
there are GPUs with on-board coherent RAM accessible via side
channels which we also want to map to the userspace. The first client
for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9
NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped
to the system address space, we are going to export these as an extra
PCI region.

We already support extra PCI regions and this adds support for mapping
them to the userspace.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
Acked-by: Alex Williamson 
---
Changes:
v2:
* reverted one of mistakenly removed error checks
---
 drivers/vfio/pci/vfio_pci_private.h | 3 +++
 drivers/vfio/pci/vfio_pci.c | 9 +
 2 files changed, 12 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index cde3b5d..86aab05 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -59,6 +59,9 @@ struct vfio_pci_regops {
  size_t count, loff_t *ppos, bool iswrite);
void(*release)(struct vfio_pci_device *vdev,
   struct vfio_pci_region *region);
+   int (*mmap)(struct vfio_pci_device *vdev,
+   struct vfio_pci_region *region,
+   struct vm_area_struct *vma);
 };
 
 struct vfio_pci_region {
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index fef5002..4a6f7c0 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1130,6 +1130,15 @@ static int vfio_pci_mmap(void *device_data, struct 
vm_area_struct *vma)
return -EINVAL;
if ((vma->vm_flags & VM_SHARED) == 0)
return -EINVAL;
+   if (index >= VFIO_PCI_NUM_REGIONS) {
+   int regnum = index - VFIO_PCI_NUM_REGIONS;
+   struct vfio_pci_region *region = vdev->region + regnum;
+
+   if (region && region->ops && region->ops->mmap &&
+   (region->flags & VFIO_REGION_INFO_FLAG_MMAP))
+   return region->ops->mmap(vdev, region, vma);
+   return -EINVAL;
+   }
if (index >= VFIO_PCI_ROM_REGION_INDEX)
return -EINVAL;
if (!vdev->bar_mmap_supported[index])
-- 
2.17.1

Re: [PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-12-12 Thread kbuild test robot

Hi Rafael,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.20-rc6 next-20181212]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Rafael-David-Tinoco/mm-zsmalloc-c-Fix-zsmalloc-32-bit-PAE-support/20181211-020704
config: mips-allmodconfig (attached as .config)
compiler: mips-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=mips 

All error/warnings (new ones prefixed by >>):

>> mm/zsmalloc.c:116:5: error: #error "MAX_POSSIBLE_PHYSMEM_BITS is wrong for 
>> this arch";
   #error "MAX_POSSIBLE_PHYSMEM_BITS is wrong for this arch";
^
   In file included from include/linux/cache.h:5:0,
from arch/mips/include/asm/cpu-info.h:15,
from arch/mips/include/asm/cpu-features.h:13,
from arch/mips/include/asm/bitops.h:21,
from include/linux/bitops.h:19,
from include/linux/kernel.h:11,
from include/linux/list.h:9,
from include/linux/module.h:9,
from mm/zsmalloc.c:33:
>> mm/zsmalloc.c:133:49: warning: right shift count is negative 
>> [-Wshift-count-negative]
 MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
^
   include/uapi/linux/kernel.h:13:40: note: in definition of macro 
'__KERNEL_DIV_ROUND_UP'
#define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
   ^
>> mm/zsmalloc.c:133:2: note: in expansion of macro 'MAX'
 MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
 ^~~
>> mm/zsmalloc.c:151:59: note: in expansion of macro 'ZS_MIN_ALLOC_SIZE'
#define ZS_SIZE_CLASSES (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - 
ZS_MIN_ALLOC_SIZE, \
  ^
>> mm/zsmalloc.c:256:32: note: in expansion of macro 'ZS_SIZE_CLASSES'
 struct size_class *size_class[ZS_SIZE_CLASSES];
   ^~~
>> mm/zsmalloc.c:133:49: warning: right shift count is negative 
>> [-Wshift-count-negative]
 MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
^
   include/uapi/linux/kernel.h:13:40: note: in definition of macro 
'__KERNEL_DIV_ROUND_UP'
#define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
   ^
>> mm/zsmalloc.c:133:2: note: in expansion of macro 'MAX'
 MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
 ^~~
>> mm/zsmalloc.c:151:59: note: in expansion of macro 'ZS_MIN_ALLOC_SIZE'
#define ZS_SIZE_CLASSES (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - 
ZS_MIN_ALLOC_SIZE, \
  ^
>> mm/zsmalloc.c:256:32: note: in expansion of macro 'ZS_SIZE_CLASSES'
 struct size_class *size_class[ZS_SIZE_CLASSES];
   ^~~
>> mm/zsmalloc.c:256:21: error: variably modified 'size_class' at file scope
 struct size_class *size_class[ZS_SIZE_CLASSES];
^~
   In file included from include/linux/kernel.h:10:0,
from include/linux/list.h:9,
from include/linux/module.h:9,
from mm/zsmalloc.c:33:
   mm/zsmalloc.c: In function 'get_size_class_index':
>> mm/zsmalloc.c:133:49: warning: right shift count is negative 
>> [-Wshift-count-negative]
 MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
^
   include/linux/compiler.h:76:40: note: in definition of macro 'likely'
# define likely(x) __builtin_expect(!!(x), 1)
   ^
>> mm/zsmalloc.c:133:2: note: in expansion of macro 'MAX'
 MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
 ^~~
   mm/zsmalloc.c:543:20: note: in expansion of macro 'ZS_MIN_ALLOC_SIZE'
 if (likely(size > ZS_MIN_ALLOC_SIZE))
   ^
>> mm/zsmalloc.c:133:49: warning: right shift count is negative 
>> [-Wshift-count-negative]
 MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >&g

[PATCH kernel v5 17/20] powerpc/powernv/npu: Fault user page into the hypervisor's pagetable

2018-12-12 Thread Alexey Kardashevskiy

When a page fault happens in a GPU, the GPU signals the OS and the GPU
driver calls the fault handler which populated a page table; this allows
the GPU to complete an ATS request.

On the bare metal get_user_pages() is enough as it adds a pte to
the kernel page table but under KVM the partition scope tree does not get
updated so ATS will still fail.

This reads a byte from an effective address which causes HV storage
interrupt and KVM updates the partition scope tree.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 2c405a4..ed81426 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -1133,6 +1133,8 @@ int pnv_npu2_handle_fault(struct npu_context *context, 
uintptr_t *ea,
u64 rc = 0, result = 0;
int i, is_write;
struct page *page[1];
+   const char __user *u;
+   char c;
 
/* mmap_sem should be held so the struct_mm must be present */
struct mm_struct *mm = context->mm;
@@ -1145,18 +1147,17 @@ int pnv_npu2_handle_fault(struct npu_context *context, 
uintptr_t *ea,
is_write ? FOLL_WRITE : 0,
page, NULL, NULL);
 
-   /*
-* To support virtualised environments we will have to do an
-* access to the page to ensure it gets faulted into the
-* hypervisor. For the moment virtualisation is not supported in
-* other areas so leave the access out.
-*/
if (rc != 1) {
status[i] = rc;
result = -EFAULT;
continue;
}
 
+   /* Make sure partition scoped tree gets a pte */
+   u = page_address(page[0]);
+   if (__get_user(c, u))
+   result = -EFAULT;
+
status[i] = 0;
put_page(page[0]);
}
-- 
2.17.1

[PATCH kernel v5 08/20] powerpc/pseries: Remove IOMMU API support for non-LPAR systems

2018-12-12 Thread Alexey Kardashevskiy

The pci_dma_bus_setup_pSeries and pci_dma_dev_setup_pSeries hooks are
registered for the pseries platform which does not have FW_FEATURE_LPAR;
these would be pre-powernv platforms which we never supported PCI pass
through for anyway so remove it.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/platforms/pseries/iommu.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index cbcc8ce..2783cb7 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -645,7 +645,6 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
iommu_table_setparms(pci->phb, dn, tbl);
tbl->it_ops = &iommu_table_pseries_ops;
iommu_init_table(tbl, pci->phb->node);
-   iommu_register_group(pci->table_group, pci_domain_nr(bus), 0);
 
/* Divide the rest (1.75GB) among the children */
pci->phb->dma_window_size = 0x8000ul;
@@ -756,10 +755,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
iommu_table_setparms(phb, dn, tbl);
tbl->it_ops = &iommu_table_pseries_ops;
iommu_init_table(tbl, phb->node);
-   iommu_register_group(PCI_DN(dn)->table_group,
-   pci_domain_nr(phb->bus), 0);
set_iommu_table_base(&dev->dev, tbl);
-   iommu_add_device(&dev->dev);
return;
}
 
@@ -770,11 +766,10 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
while (dn && PCI_DN(dn) && PCI_DN(dn)->table_group == NULL)
dn = dn->parent;
 
-   if (dn && PCI_DN(dn)) {
+   if (dn && PCI_DN(dn))
set_iommu_table_base(&dev->dev,
PCI_DN(dn)->table_group->tables[0]);
-   iommu_add_device(&dev->dev);
-   } else
+   else
printk(KERN_WARNING "iommu: Device %s has no iommu table\n",
   pci_name(dev));
 }
-- 
2.17.1

[PATCH kernel v5 07/20] powerpc/pseries/npu: Enable platform support

2018-12-12 Thread Alexey Kardashevskiy

We already changed NPU API for GPUs to not to call OPAL and the remaining
bit is initializing NPU structures.

This searches for POWER9 NVLinks attached to any device on a PHB and
initializes an NPU structure if any found.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v5:
* added WARN_ON_ONCE

v4:
* dropped "IBM,npu-vphb" compatible type on PHB and use the type of NVLink
---
 arch/powerpc/platforms/pseries/pci.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/pci.c 
b/arch/powerpc/platforms/pseries/pci.c
index 41d8a4d..7725825 100644
--- a/arch/powerpc/platforms/pseries/pci.c
+++ b/arch/powerpc/platforms/pseries/pci.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "pseries.h"
 
 #if 0
@@ -237,6 +238,8 @@ static void __init pSeries_request_regions(void)
 
 void __init pSeries_final_fixup(void)
 {
+   struct pci_controller *hose;
+
pSeries_request_regions();
 
eeh_probe_devices();
@@ -246,6 +249,25 @@ void __init pSeries_final_fixup(void)
ppc_md.pcibios_sriov_enable = pseries_pcibios_sriov_enable;
ppc_md.pcibios_sriov_disable = pseries_pcibios_sriov_disable;
 #endif
+   list_for_each_entry(hose, &hose_list, list_node) {
+   struct device_node *dn = hose->dn, *nvdn;
+
+   while (1) {
+   dn = of_find_all_nodes(dn);
+   if (!dn)
+   break;
+   nvdn = of_parse_phandle(dn, "ibm,nvlink", 0);
+   if (!nvdn)
+   continue;
+   if (!of_device_is_compatible(nvdn, "ibm,npu-link"))
+   continue;
+   if (!of_device_is_compatible(nvdn->parent,
+   "ibm,power9-npu"))
+   continue;
+   WARN_ON_ONCE(pnv_npu2_init(hose));
+   break;
+   }
+   }
 }
 
 /*
-- 
2.17.1

[PATCH kernel v5 16/20] powerpc/powernv/npu: Check mmio_atsd array bounds when populating

2018-12-12 Thread Alexey Kardashevskiy

A broken device tree might contain more than 8 values and introduce hard
to debug memory corruption bug. This adds the boundary check.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/npu-dma.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 31dfc11..2c405a4 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -1179,8 +1179,9 @@ int pnv_npu2_init(struct pci_controller *hose)
 
npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush");
 
-   for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
-   i, &mmio_atsd); i++)
+   for (i = 0; i < ARRAY_SIZE(npu->mmio_atsd_regs) &&
+   !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd",
+   i, &mmio_atsd); i++)
npu->mmio_atsd_regs[i] = ioremap(mmio_atsd, 32);
 
pr_info("NPU%d: Found %d MMIO ATSD registers", hose->global_number, i);
-- 
2.17.1

[PATCH kernel v5 15/20] powerpc/powernv/npu: Add release_ownership hook

2018-12-12 Thread Alexey Kardashevskiy

In order to make ATS work and translate addresses for arbitrary
LPID and PID, we need to program an NPU with LPID and allow PID wildcard
matching with a specific MSR mask.

This implements a helper to assign a GPU to LPAR and program the NPU
with a wildcard for PID and a helper to do clean-up. The helper takes
MSR (only DR/HV/PR/SF bits are allowed) to program them into NPU2 for
ATS checkout requests support.

This exports pnv_npu2_unmap_lpar_dev() as following patches will use it
from the VFIO driver.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v5:
* removed opal_purge_cache as it is a part of reset in skiboot now
---
 arch/powerpc/platforms/powernv/npu-dma.c | 51 
 1 file changed, 51 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 3468eaa..31dfc11 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -300,6 +300,7 @@ static void pnv_npu_take_ownership(struct iommu_table_group 
*table_group)
table_group);
struct pnv_phb *phb = npe->phb;
int64_t rc;
+   struct pci_dev *gpdev = NULL;
 
/*
 * Note: NPU has just a single TVE in the hardware which means that
@@ -321,12 +322,28 @@ static void pnv_npu_take_ownership(struct 
iommu_table_group *table_group)
return;
}
pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false);
+
+   get_gpu_pci_dev_and_pe(npe, &gpdev);
+   if (gpdev)
+   pnv_npu2_unmap_lpar_dev(gpdev);
+}
+
+static void pnv_npu_release_ownership(struct iommu_table_group *table_group)
+{
+   struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+   table_group);
+   struct pci_dev *gpdev = NULL;
+
+   get_gpu_pci_dev_and_pe(npe, &gpdev);
+   if (gpdev)
+   pnv_npu2_map_lpar_dev(gpdev, 0, MSR_DR | MSR_PR | MSR_HV);
 }
 
 static struct iommu_table_group_ops pnv_pci_npu_ops = {
.set_window = pnv_npu_set_window,
.unset_window = pnv_npu_unset_window,
.take_ownership = pnv_npu_take_ownership,
+   .release_ownership = pnv_npu_release_ownership,
 };
 #endif /* !CONFIG_IOMMU_API */
 
@@ -1237,3 +1254,37 @@ void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned 
long msr)
list_for_each_entry(gpdev, &gpe->pbus->devices, bus_list)
pnv_npu2_map_lpar_dev(gpdev, 0, msr);
 }
+
+int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev)
+{
+   int ret;
+   struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
+   struct pci_controller *hose;
+   struct pnv_phb *nphb;
+
+   if (!npdev)
+   return -ENODEV;
+
+   hose = pci_bus_to_host(npdev->bus);
+   nphb = hose->private_data;
+
+   dev_dbg(&gpdev->dev, "destroy context opalid=%llu\n",
+   nphb->opal_id);
+   ret = opal_npu_destroy_context(nphb->opal_id, 0/*__unused*/,
+   PCI_DEVID(gpdev->bus->number, gpdev->devfn));
+   if (ret < 0) {
+   dev_err(&gpdev->dev, "Failed to destroy context: %d\n", ret);
+   return ret;
+   }
+
+   /* Set LPID to 0 anyway, just to be safe */
+   dev_dbg(&gpdev->dev, "Map LPAR opalid=%llu lparid=0\n", nphb->opal_id);
+   ret = opal_npu_map_lpar(nphb->opal_id,
+   PCI_DEVID(gpdev->bus->number, gpdev->devfn), 0 /*LPID*/,
+   0 /* LPCR bits */);
+   if (ret)
+   dev_err(&gpdev->dev, "Error %d mapping device to LPAR\n", ret);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(pnv_npu2_unmap_lpar_dev);
-- 
2.17.1

[PATCH kernel v5 05/20] powerpc/powernv/npu: Move OPAL calls away from context manipulation

2018-12-12 Thread Alexey Kardashevskiy

When introduced, the NPU context init/destroy helpers called OPAL which
enabled/disabled PID (a userspace memory context ID) filtering in an NPU
per a GPU; this was a requirement for P9 DD1.0. However newer chip
revision added a PID wildcard support so there is no more need to
call OPAL every time a new context is initialized. Also, since the PID
wildcard support was added, skiboot does not clear wildcard entries
in the NPU so these remain in the hardware till the system reboot.

This moves LPID and wildcard programming to the PE setup code which
executes once during the booting process so NPU2 context init/destroy
won't need to do additional configuration.

This replaces the check for FW_FEATURE_OPAL with a check for npu!=NULL as
this is the way to tell if the NPU support is present and configured.

This moves pnv_npu2_init() declaration as pseries should be able to use it.
This keeps pnv_npu2_map_lpar() in powernv as pseries is not allowed to
call that. This exports pnv_npu2_map_lpar_dev() as following patches
will use it from the VFIO driver.

While at it, replace redundant list_for_each_entry_safe() with
a simpler list_for_each_entry().

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v5:
* add few checks for npu!=NULL

v4:
* add flags check in pnv_npu2_init_context()
---
 arch/powerpc/include/asm/pci.h|   3 +
 arch/powerpc/platforms/powernv/pci.h  |   2 +-
 arch/powerpc/platforms/powernv/npu-dma.c  | 111 --
 arch/powerpc/platforms/powernv/pci-ioda.c |  15 ++-
 4 files changed, 77 insertions(+), 54 deletions(-)

diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index 2af9ded..baf2886 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -129,5 +129,8 @@ extern void pcibios_scan_phb(struct pci_controller *hose);
 
 extern struct pci_dev *pnv_pci_get_gpu_dev(struct pci_dev *npdev);
 extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index);
+extern int pnv_npu2_init(struct pci_controller *hose);
+extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid,
+   unsigned long msr);
 
 #endif /* __ASM_POWERPC_PCI_H */
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index f2d50974..ddb4f02 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -190,6 +190,7 @@ extern void pnv_pci_init_ioda_hub(struct device_node *np);
 extern void pnv_pci_init_ioda2_phb(struct device_node *np);
 extern void pnv_pci_init_npu_phb(struct device_node *np);
 extern void pnv_pci_init_npu2_opencapi_phb(struct device_node *np);
+extern void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned long msr);
 extern void pnv_pci_reset_secondary_bus(struct pci_dev *dev);
 extern int pnv_eeh_phb_reset(struct pci_controller *hose, int option);
 
@@ -220,7 +221,6 @@ extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int 
num,
 extern long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num);
 extern void pnv_npu_take_ownership(struct pnv_ioda_pe *npe);
 extern void pnv_npu_release_ownership(struct pnv_ioda_pe *npe);
-extern int pnv_npu2_init(struct pnv_phb *phb);
 
 /* pci-ioda-tce.c */
 #define POWERNV_IOMMU_DEFAULT_LEVELS   1
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 5e66439..ef1457f 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -512,6 +512,9 @@ static void acquire_atsd_reg(struct npu_context 
*npu_context,
continue;
 
npu = pci_bus_to_host(npdev->bus)->npu;
+   if (!npu)
+   continue;
+
mmio_atsd_reg[i].npu = npu;
mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
while (mmio_atsd_reg[i].reg < 0) {
@@ -676,7 +679,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev 
*gpdev,
u32 nvlink_index;
struct device_node *nvlink_dn;
struct mm_struct *mm = current->mm;
-   struct pnv_phb *nphb;
struct npu *npu;
struct npu_context *npu_context;
struct pci_controller *hose;
@@ -687,13 +689,14 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev 
*gpdev,
 */
struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0);
 
-   if (!firmware_has_feature(FW_FEATURE_OPAL))
-   return ERR_PTR(-ENODEV);
-
if (!npdev)
/* No nvlink associated with this GPU device */
return ERR_PTR(-ENODEV);
 
+   /* We only support DR/PR/HV in pnv_npu2_map_lpar_dev() */
+   if (flags & ~(MSR_DR | MSR_PR | MSR_HV))
+   return ERR_PTR(-EINVAL);
+
nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0);
if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index",

[PATCH kernel v5 13/20] powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops

2018-12-12 Thread Alexey Kardashevskiy

At the moment NPU IOMMU is manipulated directly from the IODA2 PCI
PE code; PCI PE acts as a master to NPU PE. Soon we will have compound
IOMMU groups with several PEs from several different PHB (such as
interconnected GPUs and NPUs) so there will be no single master but
a one big IOMMU group.

This makes a first step and converts an NPU PE with a set of extern
function to a table group.

This should cause no behavioral change. Note that
pnv_npu_release_ownership() has never been implemented.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/platforms/powernv/pci.h  |  5 
 arch/powerpc/platforms/powernv/npu-dma.c  | 34 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 +--
 3 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index ddb4f02..cf9f748 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -216,11 +216,6 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, 
const char *level,
 extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass);
 extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm);
 extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe);
-extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
-   struct iommu_table *tbl);
-extern long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num);
-extern void pnv_npu_take_ownership(struct pnv_ioda_pe *npe);
-extern void pnv_npu_release_ownership(struct pnv_ioda_pe *npe);
 
 /* pci-ioda-tce.c */
 #define POWERNV_IOMMU_DEFAULT_LEVELS   1
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 26063fb..dc629ee 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -121,9 +121,14 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct 
pnv_ioda_pe *npe,
return pe;
 }
 
-long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
+static long pnv_npu_unset_window(struct iommu_table_group *table_group,
+   int num);
+
+static long pnv_npu_set_window(struct iommu_table_group *table_group, int num,
struct iommu_table *tbl)
 {
+   struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+   table_group);
struct pnv_phb *phb = npe->phb;
int64_t rc;
const unsigned long size = tbl->it_indirect_levels ?
@@ -134,7 +139,7 @@ long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
 
/* NPU has just one TVE so if there is another table, remove it first */
if (npe->table_group.tables[num2])
-   pnv_npu_unset_window(npe, num2);
+   pnv_npu_unset_window(&npe->table_group, num2);
 
pe_info(npe, "Setting up window %llx..%llx pg=%lx\n",
start_addr, start_addr + win_size - 1,
@@ -160,8 +165,10 @@ long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num,
return 0;
 }
 
-long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num)
+static long pnv_npu_unset_window(struct iommu_table_group *table_group, int 
num)
 {
+   struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+   table_group);
struct pnv_phb *phb = npe->phb;
int64_t rc;
 
@@ -206,7 +213,8 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe)
if (!gpe)
return;
 
-   rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]);
+   rc = pnv_npu_set_window(&npe->table_group, 0,
+   gpe->table_group.tables[0]);
 
/*
 * NVLink devices use the same TCE table configuration as
@@ -231,7 +239,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe)
if (phb->type != PNV_PHB_NPU_NVLINK || !npe->pdev)
return -EINVAL;
 
-   rc = pnv_npu_unset_window(npe, 0);
+   rc = pnv_npu_unset_window(&npe->table_group, 0);
if (rc != OPAL_SUCCESS)
return rc;
 
@@ -284,9 +292,12 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, 
bool bypass)
}
 }
 
+#ifdef CONFIG_IOMMU_API
 /* Switch ownership from platform code to external user (e.g. VFIO) */
-void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
+static void pnv_npu_take_ownership(struct iommu_table_group *table_group)
 {
+   struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe,
+   table_group);
struct pnv_phb *phb = npe->phb;
int64_t rc;
 
@@ -297,7 +308,7 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe)
 * if it was enabled at the moment of ownership change.
 */
if (npe->table_group.tables[0]) {
-   pnv_npu_unset_window(npe, 0);
+   pnv_npu_unset_window(&npe->table_group, 0);
return;

[PATCH kernel v5 03/20] powerpc/vfio/iommu/kvm: Do not pin device memory

2018-12-12 Thread Alexey Kardashevskiy

This new memory does not have page structs as it is not plugged to
the host so gup() will fail anyway.

This adds 2 helpers:
- mm_iommu_newdev() to preregister the "memory device" memory so
the rest of API can still be used;
- mm_iommu_is_devmem() to know if the physical address is one of thise
new regions which we must avoid unpinning of.

This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
if the memory is device memory to avoid pfn_to_page().

This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
does delayed pages dirtying.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v5:
* mm_iommu_is_devmem() now returns the actual size which might me smaller
than the pageshift so  tce_page_is_contained() won't do pfn_to_page()
if @hpa..@hpa+64K is preregistered but page_shift is bigger than 16
* removed David's r-by because of the change in mm_iommu_is_devmem

v4:
* added device memory check in the real mode path
---
 arch/powerpc/include/asm/iommu.h   |  5 +-
 arch/powerpc/include/asm/mmu_context.h |  5 ++
 arch/powerpc/kernel/iommu.c| 11 ++-
 arch/powerpc/kvm/book3s_64_vio.c   | 18 ++---
 arch/powerpc/mm/mmu_context_iommu.c| 93 +++---
 drivers/vfio/vfio_iommu_spapr_tce.c| 29 +---
 6 files changed, 129 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 35db0cb..a8aeac0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -218,8 +218,9 @@ extern void iommu_register_group(struct iommu_table_group 
*table_group,
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 extern int __init tce_iommu_bus_notifier_init(void);
-extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
-   unsigned long *hpa, enum dma_data_direction *direction);
+extern long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
+   unsigned long entry, unsigned long *hpa,
+   enum dma_data_direction *direction);
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 268e112..4e4656d 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -24,6 +24,9 @@ extern bool mm_iommu_preregistered(struct mm_struct *mm);
 extern long mm_iommu_new(struct mm_struct *mm,
unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem);
+extern long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua,
+   unsigned long entries, unsigned long dev_hpa,
+   struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_struct *mm,
struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_init(struct mm_struct *mm);
@@ -39,6 +42,8 @@ extern long mm_iommu_ua_to_hpa(struct 
mm_iommu_table_group_mem_t *mem,
 extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned int pageshift, unsigned long *hpa);
 extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua);
+extern bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa,
+   unsigned int pageshift, unsigned long *size);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index f0dc680..cbcc615 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DBG(...)
 
@@ -993,15 +994,19 @@ int iommu_tce_check_gpa(unsigned long page_shift, 
unsigned long gpa)
 }
 EXPORT_SYMBOL_GPL(iommu_tce_check_gpa);
 
-long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
-   unsigned long *hpa, enum dma_data_direction *direction)
+long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
+   unsigned long entry, unsigned long *hpa,
+   enum dma_data_direction *direction)
 {
long ret;
+   unsigned long size = 0;
 
ret = tbl->it_ops->exchange(tbl, entry, hpa, direction);
 
if (!ret && ((*direction == DMA_FROM_DEVICE) ||
-   (*direction == DMA_BIDIRECTIONAL)))
+   (*direction == DMA_BIDIRECTIONAL)) &&
+   !mm_iommu_is_devmem(mm, *hpa, tbl->it_page_shift,
+   &size))
SetPageDirty(pfn_to_page(*hpa >> PAGE_SHIFT));
 
/* if (unlikely(ret))
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 62a8d03..532ab7

[PATCH kernel v5 02/20] powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a region

2018-12-12 Thread Alexey Kardashevskiy

Normally mm_iommu_get() should add a reference and mm_iommu_put() should
remove it. However historically mm_iommu_find() does the referencing and
mm_iommu_get() is doing allocation and referencing.

We are going to add another helper to preregister device memory so
instead of having mm_iommu_new() (which pre-registers the normal memory
and references the region), we need separate helpers for pre-registering
and referencing.

This renames:
- mm_iommu_get to mm_iommu_new;
- mm_iommu_find to mm_iommu_get.

This changes mm_iommu_get() to reference the region so the name now
reflects what it does.

This removes the check for exact match from mm_iommu_new() as we want it
to fail on existing regions; mm_iommu_get() should be used instead.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
Changes:
v5:
* fixed a bug with uninitialized @found in tce_iommu_unregister_pages()
* reworded the commit log

v4:
* squashed "powerpc/mm/iommu: Make mm_iommu_new() fail on existing regions" 
into this

v2:
* merged 2 patches into one
---
 arch/powerpc/include/asm/mmu_context.h |  4 +--
 arch/powerpc/mm/mmu_context_iommu.c| 19 +++---
 drivers/vfio/vfio_iommu_spapr_tce.c| 35 +-
 3 files changed, 34 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index c05efd2..268e112 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -21,7 +21,7 @@ struct mm_iommu_table_group_mem_t;
 
 extern int isolate_lru_page(struct page *page);/* from internal.h */
 extern bool mm_iommu_preregistered(struct mm_struct *mm);
-extern long mm_iommu_get(struct mm_struct *mm,
+extern long mm_iommu_new(struct mm_struct *mm,
unsigned long ua, unsigned long entries,
struct mm_iommu_table_group_mem_t **pmem);
 extern long mm_iommu_put(struct mm_struct *mm,
@@ -32,7 +32,7 @@ extern struct mm_iommu_table_group_mem_t 
*mm_iommu_lookup(struct mm_struct *mm,
unsigned long ua, unsigned long size);
 extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
struct mm_struct *mm, unsigned long ua, unsigned long size);
-extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+extern struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned int pageshift, unsigned long *hpa);
diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 0741d90..25a4b7f7 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -89,7 +89,7 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
-long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long 
entries,
+long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long 
entries,
struct mm_iommu_table_group_mem_t **pmem)
 {
struct mm_iommu_table_group_mem_t *mem;
@@ -100,12 +100,6 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, 
unsigned long entries,
 
list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list,
next) {
-   if ((mem->ua == ua) && (mem->entries == entries)) {
-   ++mem->used;
-   *pmem = mem;
-   goto unlock_exit;
-   }
-
/* Overlap? */
if ((mem->ua < (ua + (entries << PAGE_SHIFT))) &&
(ua < (mem->ua +
@@ -192,7 +186,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, 
unsigned long entries,
 
return ret;
 }
-EXPORT_SYMBOL_GPL(mm_iommu_get);
+EXPORT_SYMBOL_GPL(mm_iommu_new);
 
 static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 {
@@ -308,21 +302,26 @@ struct mm_iommu_table_group_mem_t 
*mm_iommu_lookup_rm(struct mm_struct *mm,
return ret;
 }
 
-struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
+struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm,
unsigned long ua, unsigned long entries)
 {
struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
 
+   mutex_lock(&mem_list_mutex);
+
list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) {
if ((mem->ua == ua) && (mem->entries == entries)) {
ret = mem;
+   ++mem->used;
break;
}
}
 
+   mutex_unlock(&mem_list_mutex);
+
return ret;
 }
-EXPORT_SYMBOL_GPL(mm_iommu_find);
+EXPORT_SYMBOL_GPL(mm_iommu_get);
 
 long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
unsigned long ua, unsigned int pageshift

[PATCH kernel v5 01/20] powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2

2018-12-12 Thread Alexey Kardashevskiy

The skiboot firmware has a hot reset handler which fences the NVIDIA V100
GPU RAM on Witherspoons and makes accesses no-op instead of throwing HMIs:
https://github.com/open-power/skiboot/commit/fca2b2b839a67

Now we are going to pass V100 via VFIO which most certainly involves
KVM guests which are often terminated without getting a chance to offline
GPU RAM so we end up with a running machine with misconfigured memory.
Accessing this memory produces hardware management interrupts (HMI)
which bring the host down.

To suppress HMIs, this wires up this hot reset hook to vfio_pci_disable()
via pci_disable_device() which switches NPU2 to a safe mode and prevents
HMIs.

Signed-off-by: Alexey Kardashevskiy 
Acked-by: Alistair Popple 
Reviewed-by: David Gibson 
---
Changes:
v2:
* updated the commit log
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 9ee7a30..29c6837 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -3676,6 +3676,15 @@ static void pnv_pci_release_device(struct pci_dev *pdev)
pnv_ioda_release_pe(pe);
 }
 
+static void pnv_npu_disable_device(struct pci_dev *pdev)
+{
+   struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev);
+   struct eeh_pe *eehpe = edev ? edev->pe : NULL;
+
+   if (eehpe && eeh_ops && eeh_ops->reset)
+   eeh_ops->reset(eehpe, EEH_RESET_HOT);
+}
+
 static void pnv_pci_ioda_shutdown(struct pci_controller *hose)
 {
struct pnv_phb *phb = hose->private_data;
@@ -3720,6 +3729,7 @@ static const struct pci_controller_ops 
pnv_npu_ioda_controller_ops = {
.reset_secondary_bus= pnv_pci_reset_secondary_bus,
.dma_set_mask   = pnv_npu_dma_set_mask,
.shutdown   = pnv_pci_ioda_shutdown,
+   .disable_device = pnv_npu_disable_device,
 };
 
 static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = {
-- 
2.17.1

[PATCH kernel v5 09/20] powerpc/powernv/pseries: Rework device adding to IOMMU groups

2018-12-12 Thread Alexey Kardashevskiy

The powernv platform registers IOMMU groups and adds devices to them
from the pci_controller_ops::setup_bridge() hook except one case when
virtual functions (SRIOV VFs) are added from a bus notifier.

The pseries platform registers IOMMU groups from
the pci_controller_ops::dma_bus_setup() hook and adds devices from
the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier
used for powernv does not add devices for pseries though as
__of_scan_bus() adds devices first, then it does the bus/dev DMA setup.

Both platforms use iommu_add_device() which takes a device and expects
it to have a valid IOMMU table struct with an iommu_table_group pointer
which in turn points the iommu_group struct (which represents
an IOMMU group). Although the helper seems easy to use, it relies on
some pre-existing device configuration and associated data structures
which it does not really need.

This simplifies iommu_add_device() to take the table_group pointer
directly. Pseries already has a table_group pointer handy and the bus
notified is not used anyway. For powernv, this copies the existing bus
notifier, makes it work for powernv only which means an easy way of
getting to the table_group pointer. This was tested on VFs but should
also support physical PCI hotplug.

Since iommu_add_device() receives the table_group pointer directly,
pseries does not do TCE cache invalidation (the hypervisor does) nor
allow multiple groups per a VFIO container (in other words sharing
an IOMMU table between partitionable endpoints), this removes
iommu_table_group_link from pseries.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 
---
 arch/powerpc/include/asm/iommu.h  | 12 ++---
 arch/powerpc/kernel/iommu.c   | 58 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 +---
 arch/powerpc/platforms/powernv/pci.c  | 43 -
 arch/powerpc/platforms/pseries/iommu.c| 46 +-
 5 files changed, 74 insertions(+), 95 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index a8aeac0..e847ff6 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -215,9 +215,9 @@ struct iommu_table_group {
 
 extern void iommu_register_group(struct iommu_table_group *table_group,
 int pci_domain_number, unsigned long pe_num);
-extern int iommu_add_device(struct device *dev);
+extern int iommu_add_device(struct iommu_table_group *table_group,
+   struct device *dev);
 extern void iommu_del_device(struct device *dev);
-extern int __init tce_iommu_bus_notifier_init(void);
 extern long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
unsigned long entry, unsigned long *hpa,
enum dma_data_direction *direction);
@@ -228,7 +228,8 @@ static inline void iommu_register_group(struct 
iommu_table_group *table_group,
 {
 }
 
-static inline int iommu_add_device(struct device *dev)
+static inline int iommu_add_device(struct iommu_table_group *table_group,
+   struct device *dev)
 {
return 0;
 }
@@ -236,11 +237,6 @@ static inline int iommu_add_device(struct device *dev)
 static inline void iommu_del_device(struct device *dev)
 {
 }
-
-static inline int __init tce_iommu_bus_notifier_init(void)
-{
-return 0;
-}
 #endif /* !CONFIG_IOMMU_API */
 
 int dma_iommu_mapping_error(struct device *dev, dma_addr_t dma_addr);
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index cbcc615..9d5d109 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1078,11 +1078,8 @@ void iommu_release_ownership(struct iommu_table *tbl)
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-int iommu_add_device(struct device *dev)
+int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
-   struct iommu_table *tbl;
-   struct iommu_table_group_link *tgl;
-
/*
 * The sysfs entries should be populated before
 * binding IOMMU group. If sysfs entries isn't
@@ -1098,32 +1095,10 @@ int iommu_add_device(struct device *dev)
return -EBUSY;
}
 
-   tbl = get_iommu_table_base(dev);
-   if (!tbl) {
-   pr_debug("%s: Skipping device %s with no tbl\n",
-__func__, dev_name(dev));
-   return 0;
-   }
-
-   tgl = list_first_entry_or_null(&tbl->it_group_list,
-   struct iommu_table_group_link, next);
-   if (!tgl) {
-   pr_debug("%s: Skipping device %s with no group\n",
-__func__, dev_name(dev));
-   return 0;
-   }
pr_debug("%s: Adding %s to iommu group %d\n",
-__func__, dev_name(dev),
-iommu_group_id(tgl->table_group->group));
+__func__, dev_name(dev),  iommu_group_id(table_group->group));
 
-   if (PAGE_SIZE < IOMMU_PAGE_SI

[PATCH kernel v5 06/20] powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation

2018-12-12 Thread Alexey Kardashevskiy

We might have memory@ nodes with "linux,usable-memory" set to zero
(for example, to replicate powernv's behaviour for GPU coherent memory)
which means that the memory needs an extra initialization but since
it can be used afterwards, the pseries platform will try mapping it
for DMA so the DMA window needs to cover those memory regions too;
if the window cannot cover new memory regions, the memory onlining fails.

This walks through the memory nodes to find the highest RAM address to
let a huge DMA window cover that too in case this memory gets onlined
later.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* uses of_read_number directly instead of cut-n-pasted read_n_cells
---
 arch/powerpc/platforms/pseries/iommu.c | 33 +-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 06f0296..cbcc8ce 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -964,6 +964,37 @@ struct failed_ddw_pdn {
 
 static LIST_HEAD(failed_ddw_pdn_list);
 
+static phys_addr_t ddw_memory_hotplug_max(void)
+{
+   phys_addr_t max_addr = memory_hotplug_max();
+   struct device_node *memory;
+
+   for_each_node_by_type(memory, "memory") {
+   unsigned long start, size;
+   int ranges, n_mem_addr_cells, n_mem_size_cells, len;
+   const __be32 *memcell_buf;
+
+   memcell_buf = of_get_property(memory, "reg", &len);
+   if (!memcell_buf || len <= 0)
+   continue;
+
+   n_mem_addr_cells = of_n_addr_cells(memory);
+   n_mem_size_cells = of_n_size_cells(memory);
+
+   /* ranges in cell */
+   ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
+
+   start = of_read_number(memcell_buf, n_mem_addr_cells);
+   memcell_buf += n_mem_addr_cells;
+   size = of_read_number(memcell_buf, n_mem_size_cells);
+   memcell_buf += n_mem_size_cells;
+
+   max_addr = max_t(phys_addr_t, max_addr, start + size);
+   }
+
+   return max_addr;
+}
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1053,7 +1084,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
}
/* verify the window * number of ptes will map the partition */
/* check largest block * page size > max memory hotplug addr */
-   max_addr = memory_hotplug_max();
+   max_addr = ddw_memory_hotplug_max();
if (query.largest_available_block < (max_addr >> page_shift)) {
dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u "
  "%llu-sized pages\n", max_addr,  
query.largest_available_block,
-- 
2.17.1

[PATCH kernel v5 04/20] powerpc/powernv: Move npu struct from pnv_phb to pci_controller

2018-12-12 Thread Alexey Kardashevskiy

The powernv PCI code stores NPU data in the pnv_phb struct. The latter
is referenced by pci_controller::private_data. We are going to have NPU2
support in the pseries platform as well but it does not store any
private_data in in the pci_controller struct; and even if it did,
it would be a different data structure.

This makes npu a pointer and stores it one level higher in
the pci_controller struct.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v5:
* removed !npu checks as this is out of scope of this patch
* added WARN_ON_ONCE in WARN_ON_ONCE(pnv_npu2_init(phb))

v4:
* changed subj from "powerpc/powernv: Detach npu struct from pnv_phb"
* got rid of global list of npus - store them now in pci_controller
* got rid of npdev_to_npu() helper
---
 arch/powerpc/include/asm/pci-bridge.h |  1 +
 arch/powerpc/platforms/powernv/pci.h  | 16 -
 arch/powerpc/platforms/powernv/npu-dma.c  | 74 +--
 arch/powerpc/platforms/powernv/pci-ioda.c |  2 +-
 4 files changed, 58 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 94d4490..aee4fcc 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -129,6 +129,7 @@ struct pci_controller {
 #endif /* CONFIG_PPC64 */
 
void *private_data;
+   struct npu *npu;
 };
 
 /* These are used for config access before all the PCI probing
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 2131373..f2d50974 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -8,9 +8,6 @@
 
 struct pci_dn;
 
-/* Maximum possible number of ATSD MMIO registers per NPU */
-#define NV_NMMU_ATSD_REGS 8
-
 enum pnv_phb_type {
PNV_PHB_IODA1   = 0,
PNV_PHB_IODA2   = 1,
@@ -176,19 +173,6 @@ struct pnv_phb {
unsigned intdiag_data_size;
u8  *diag_data;
 
-   /* Nvlink2 data */
-   struct npu {
-   int index;
-   __be64 *mmio_atsd_regs[NV_NMMU_ATSD_REGS];
-   unsigned int mmio_atsd_count;
-
-   /* Bitmask for MMIO register usage */
-   unsigned long mmio_atsd_usage;
-
-   /* Do we need to explicitly flush the nest mmu? */
-   bool nmmu_flush;
-   } npu;
-
int p2p_target_count;
 };
 
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 91d488f..5e66439 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -327,6 +327,25 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct 
pnv_ioda_pe *npe)
return gpe;
 }
 
+/*
+ * NPU2 ATS
+ */
+/* Maximum possible number of ATSD MMIO registers per NPU */
+#define NV_NMMU_ATSD_REGS 8
+
+/* An NPU descriptor, valid for POWER9 only */
+struct npu {
+   int index;
+   __be64 *mmio_atsd_regs[NV_NMMU_ATSD_REGS];
+   unsigned int mmio_atsd_count;
+
+   /* Bitmask for MMIO register usage */
+   unsigned long mmio_atsd_usage;
+
+   /* Do we need to explicitly flush the nest mmu? */
+   bool nmmu_flush;
+};
+
 /* Maximum number of nvlinks per npu */
 #define NV_MAX_LINKS 6
 
@@ -478,7 +497,6 @@ static void acquire_atsd_reg(struct npu_context 
*npu_context,
int i, j;
struct npu *npu;
struct pci_dev *npdev;
-   struct pnv_phb *nphb;
 
for (i = 0; i <= max_npu2_index; i++) {
mmio_atsd_reg[i].reg = -1;
@@ -493,8 +511,7 @@ static void acquire_atsd_reg(struct npu_context 
*npu_context,
if (!npdev)
continue;
 
-   nphb = pci_bus_to_host(npdev->bus)->private_data;
-   npu = &nphb->npu;
+   npu = pci_bus_to_host(npdev->bus)->npu;
mmio_atsd_reg[i].npu = npu;
mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu);
while (mmio_atsd_reg[i].reg < 0) {
@@ -662,6 +679,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev 
*gpdev,
struct pnv_phb *nphb;
struct npu *npu;
struct npu_context *npu_context;
+   struct pci_controller *hose;
 
/*
 * At present we don't support GPUs connected to multiple NPUs and I'm
@@ -689,8 +707,9 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev 
*gpdev,
return ERR_PTR(-EINVAL);
}
 
-   nphb = pci_bus_to_host(npdev->bus)->private_data;
-   npu = &nphb->npu;
+   hose = pci_bus_to_host(npdev->bus);
+   nphb = hose->private_data;
+   npu = hose->npu;
 
/*
 * Setup the NPU context table for a particular GPU. These need to be
@@ -764,7 +783,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev 
*gpdev,
 */
WRITE_ONCE(npu_context->npdev[npu->index][nvli

[PATCH kernel v5 00/20] powerpc/powernv/npu, vfio: NVIDIA V100 + P9 passthrough

2018-12-12 Thread Alexey Kardashevskiy



This is for passing through NVIDIA V100 GPUs on POWER9 systems.
20/20 has the details of hardware setup.

This implements support for NVIDIA V100 GPU with coherent memory and
NPU/ATS support available in the POWER9 CPU. The aim is to support
unmodified vendor driver in the guest.

This is pushed to (both guest and host kernels):
https://github.com/aik/linux/tree/nv2

Matching qemu is pushed to github:
https://github.com/aik/qemu/tree/nv2

Skiboot bits are here:
https://github.com/aik/skiboot/tree/nv2

The individual patches have changelogs. Some were dropped as not required
or quite useless.


Please comment. Thanks.



Alexey Kardashevskiy (20):
  powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2
  powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a
region
  powerpc/vfio/iommu/kvm: Do not pin device memory
  powerpc/powernv: Move npu struct from pnv_phb to pci_controller
  powerpc/powernv/npu: Move OPAL calls away from context manipulation
  powerpc/pseries/iommu: Use memory@ nodes in max RAM address
calculation
  powerpc/pseries/npu: Enable platform support
  powerpc/pseries: Remove IOMMU API support for non-LPAR systems
  powerpc/powernv/pseries: Rework device adding to IOMMU groups
  powerpc/iommu_api: Move IOMMU groups setup to a single place
  powerpc/powernv: Reference iommu_table while it is linked to a group
  powerpc/powernv/npu: Move single TVE handling to NPU PE
  powerpc/powernv/npu: Convert NPU IOMMU helpers to
iommu_table_group_ops
  powerpc/powernv/npu: Add compound IOMMU groups
  powerpc/powernv/npu: Add release_ownership hook
  powerpc/powernv/npu: Check mmio_atsd array bounds when populating
  powerpc/powernv/npu: Fault user page into the hypervisor's pagetable
  vfio_pci: Allow mapping extra regions
  vfio_pci: Allow regions to add own capabilities
  vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

 drivers/vfio/pci/Makefile |   1 +
 arch/powerpc/include/asm/iommu.h  |  17 +-
 arch/powerpc/include/asm/mmu_context.h|   9 +-
 arch/powerpc/include/asm/pci-bridge.h |   1 +
 arch/powerpc/include/asm/pci.h|   4 +
 arch/powerpc/platforms/powernv/pci.h  |  30 +-
 drivers/vfio/pci/trace.h  | 102 
 drivers/vfio/pci/vfio_pci_private.h   |  20 +
 include/uapi/linux/vfio.h |  39 ++
 arch/powerpc/kernel/iommu.c   |  69 +--
 arch/powerpc/kvm/book3s_64_vio.c  |  18 +-
 arch/powerpc/mm/mmu_context_iommu.c   | 110 +++-
 arch/powerpc/platforms/powernv/npu-dma.c  | 549 +++---
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |   3 +-
 arch/powerpc/platforms/powernv/pci-ioda.c | 237 
 arch/powerpc/platforms/powernv/pci.c  |  43 +-
 arch/powerpc/platforms/pseries/iommu.c|  88 ++-
 arch/powerpc/platforms/pseries/pci.c  |  22 +
 drivers/vfio/pci/vfio_pci.c   |  42 +-
 drivers/vfio/pci/vfio_pci_nvlink2.c   | 473 +++
 drivers/vfio/vfio_iommu_spapr_tce.c   |  64 +-
 drivers/vfio/pci/Kconfig  |   6 +
 22 files changed, 1556 insertions(+), 391 deletions(-)
 create mode 100644 drivers/vfio/pci/trace.h
 create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c

-- 
2.17.1

[PATCH] powerpc/powernv: Move opal_power_control_init() call in opal_init().

2018-12-12 Thread Mahesh J Salgaonkar

From: Mahesh Salgaonkar 

opal_power_control_init() depends on opal message notifier to be
initialized, which is done in opal_init()->opal_message_init(). But both
these initialization are called through machine initcalls and it all
depends on in which order they being called. So far these are called in
correct order (may be we got lucky) and never saw any issue. But it is
clearer to control initialization order explicitly by moving
opal_power_control_init() into opal_init().

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/opal.h |1 +
 arch/powerpc/platforms/powernv/opal-power.c |3 +--
 arch/powerpc/platforms/powernv/opal.c   |3 +++
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index ff3866473afe..a55b01c90bb1 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -347,6 +347,7 @@ extern int opal_async_comp_init(void);
 extern int opal_sensor_init(void);
 extern int opal_hmi_handler_init(void);
 extern int opal_event_init(void);
+int opal_power_control_init(void);
 
 extern int opal_machine_check(struct pt_regs *regs);
 extern bool opal_mce_check_early_recovery(struct pt_regs *regs);
diff --git a/arch/powerpc/platforms/powernv/opal-power.c 
b/arch/powerpc/platforms/powernv/opal-power.c
index 58dc3308237f..89ab1da57657 100644
--- a/arch/powerpc/platforms/powernv/opal-power.c
+++ b/arch/powerpc/platforms/powernv/opal-power.c
@@ -138,7 +138,7 @@ static struct notifier_block opal_power_control_nb = {
.priority   = 0,
 };
 
-static int __init opal_power_control_init(void)
+int __init opal_power_control_init(void)
 {
int ret, supported = 0;
struct device_node *np;
@@ -176,4 +176,3 @@ static int __init opal_power_control_init(void)
 
return 0;
 }
-machine_subsys_initcall(powernv, opal_power_control_init);
diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index beed86f4224b..f6dfc0534969 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -960,6 +960,9 @@ static int __init opal_init(void)
/* Initialise OPAL sensor groups */
opal_sensor_groups_init();
 
+   /* Initialise OPAL Power control interface */
+   opal_power_control_init();
+
return 0;
 }
 machine_subsys_initcall(powernv, opal_init);

Re: [PATCH V2 7/8] KVM: PPC: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants 1 & 2

2018-12-12 Thread Paul Mackerras

On Mon, Dec 10, 2018 at 02:58:24PM +1100, Suraj Jitindar Singh wrote:
> A guest cannot access quadrants 1 or 2 as this would result in an
> exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used by a
> guest when it wants to perform an access to quadrants 1 or 2, for
> example when it wants to access memory for one of its nested guests.
> 
> Also provide an implementation for the kvm-hv module.
> 
> Signed-off-by: Suraj Jitindar Singh 

[snip]

>  /*
> + * Handle the H_COPY_TOFROM_GUEST hcall.
> + * r4 = L1 lpid of nested guest
> + * r5 = pid
> + * r6 = eaddr to access
> + * r7 = to buffer (L1 gpa)
> + * r8 = from buffer (L1 gpa)

Comment says these are GPAs...

> + * r9 = n bytes to copy
> + */
> +long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_nested_guest *gp;
> + int l1_lpid = kvmppc_get_gpr(vcpu, 4);
> + int pid = kvmppc_get_gpr(vcpu, 5);
> + gva_t eaddr = kvmppc_get_gpr(vcpu, 6);
> + void *gp_to = (void *) kvmppc_get_gpr(vcpu, 7);
> + void *gp_from = (void *) kvmppc_get_gpr(vcpu, 8);
> + void *buf;
> + unsigned long n = kvmppc_get_gpr(vcpu, 9);
> + bool is_load = !!gp_to;
> + long rc;
> +
> + if (gp_to && gp_from) /* One must be NULL to determine the direction */
> + return H_PARAMETER;
> +
> + if (eaddr & (0xFFFUL << 52))
> + return H_PARAMETER;
> +
> + buf = kzalloc(n, GFP_KERNEL);
> + if (!buf)
> + return H_NO_MEM;
> +
> + gp = kvmhv_get_nested(vcpu->kvm, l1_lpid, false);
> + if (!gp) {
> + rc = H_PARAMETER;
> + goto out_free;
> + }
> +
> + mutex_lock(&gp->tlb_lock);
> +
> + if (is_load) {
> + /* Load from the nested guest into our buffer */
> + rc = __kvmhv_copy_tofrom_guest_radix(gp->shadow_lpid, pid,
> +  eaddr, buf, NULL, n);
> + if (rc)
> + goto not_found;
> +
> + /* Write what was loaded into our buffer back to the L1 guest */
> + rc = kvmppc_st(vcpu, (ulong *) &gp_to, n, buf, true);

but using kvmppc_st implies that it is an EA (and in fact when you
call it in the next patch you pass an EA).

It would be more like other hcalls to pass a GPA, meaning that you
would use kvm_write_guest() here.  On the other hand, with the
quadrant access, kvmppc_st() might well be faster than
kvm_write_guest.

So you need to decide which it is and either fix the comment or change
the code.

Paul.

[PATCH AUTOSEL 3.18 10/16] ide: pmac: add of_node_put()

2018-12-12 Thread Sasha Levin

From: Yangtao Li 

[ Upstream commit a51921c0db3fd26c4ed83dc0ec5d32988fa02aa5 ]

use of_node_put() to release the refcount.

Signed-off-by: Yangtao Li 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/ide/pmac.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ide/pmac.c b/drivers/ide/pmac.c
index 2db803cd095c..57a0bd00789f 100644
--- a/drivers/ide/pmac.c
+++ b/drivers/ide/pmac.c
@@ -920,6 +920,7 @@ static u8 pmac_ide_cable_detect(ide_hwif_t *hwif)
struct device_node *root = of_find_node_by_path("/");
const char *model = of_get_property(root, "model", NULL);
 
+   of_node_put(root);
/* Get cable type from device-tree. */
if (cable && !strncmp(cable, "80-", 3)) {
/* Some drives fail to detect 80c cable in PowerBook */
-- 
2.19.1

[PATCH AUTOSEL 4.4 13/23] ide: pmac: add of_node_put()

2018-12-12 Thread Sasha Levin

From: Yangtao Li 

[ Upstream commit a51921c0db3fd26c4ed83dc0ec5d32988fa02aa5 ]

use of_node_put() to release the refcount.

Signed-off-by: Yangtao Li 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/ide/pmac.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ide/pmac.c b/drivers/ide/pmac.c
index 96a345248224..0add5bb3cee8 100644
--- a/drivers/ide/pmac.c
+++ b/drivers/ide/pmac.c
@@ -920,6 +920,7 @@ static u8 pmac_ide_cable_detect(ide_hwif_t *hwif)
struct device_node *root = of_find_node_by_path("/");
const char *model = of_get_property(root, "model", NULL);
 
+   of_node_put(root);
/* Get cable type from device-tree. */
if (cable && !strncmp(cable, "80-", 3)) {
/* Some drives fail to detect 80c cable in PowerBook */
-- 
2.19.1

[PATCH AUTOSEL 4.9 14/34] ide: pmac: add of_node_put()

2018-12-12 Thread Sasha Levin

From: Yangtao Li 

[ Upstream commit a51921c0db3fd26c4ed83dc0ec5d32988fa02aa5 ]

use of_node_put() to release the refcount.

Signed-off-by: Yangtao Li 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/ide/pmac.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ide/pmac.c b/drivers/ide/pmac.c
index 0c5d3a99468e..b20025a5a8d9 100644
--- a/drivers/ide/pmac.c
+++ b/drivers/ide/pmac.c
@@ -920,6 +920,7 @@ static u8 pmac_ide_cable_detect(ide_hwif_t *hwif)
struct device_node *root = of_find_node_by_path("/");
const char *model = of_get_property(root, "model", NULL);
 
+   of_node_put(root);
/* Get cable type from device-tree. */
if (cable && !strncmp(cable, "80-", 3)) {
/* Some drives fail to detect 80c cable in PowerBook */
-- 
2.19.1

[PATCH AUTOSEL 4.14 16/41] ide: pmac: add of_node_put()

2018-12-12 Thread Sasha Levin

From: Yangtao Li 

[ Upstream commit a51921c0db3fd26c4ed83dc0ec5d32988fa02aa5 ]

use of_node_put() to release the refcount.

Signed-off-by: Yangtao Li 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/ide/pmac.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ide/pmac.c b/drivers/ide/pmac.c
index c5b902b86b44..203ed4adc04a 100644
--- a/drivers/ide/pmac.c
+++ b/drivers/ide/pmac.c
@@ -920,6 +920,7 @@ static u8 pmac_ide_cable_detect(ide_hwif_t *hwif)
struct device_node *root = of_find_node_by_path("/");
const char *model = of_get_property(root, "model", NULL);
 
+   of_node_put(root);
/* Get cable type from device-tree. */
if (cable && !strncmp(cable, "80-", 3)) {
/* Some drives fail to detect 80c cable in PowerBook */
-- 
2.19.1

[PATCH AUTOSEL 4.19 31/73] ide: pmac: add of_node_put()

2018-12-12 Thread Sasha Levin

From: Yangtao Li 

[ Upstream commit a51921c0db3fd26c4ed83dc0ec5d32988fa02aa5 ]

use of_node_put() to release the refcount.

Signed-off-by: Yangtao Li 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
---
 drivers/ide/pmac.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ide/pmac.c b/drivers/ide/pmac.c
index c5b902b86b44..203ed4adc04a 100644
--- a/drivers/ide/pmac.c
+++ b/drivers/ide/pmac.c
@@ -920,6 +920,7 @@ static u8 pmac_ide_cable_detect(ide_hwif_t *hwif)
struct device_node *root = of_find_node_by_path("/");
const char *model = of_get_property(root, "model", NULL);
 
+   of_node_put(root);
/* Get cable type from device-tree. */
if (cable && !strncmp(cable, "80-", 3)) {
/* Some drives fail to detect 80c cable in PowerBook */
-- 
2.19.1

Re: [PATCH kernel v4 03/19] powerpc/vfio/iommu/kvm: Do not pin device memory

2018-12-12 Thread Paul Mackerras

On Fri, Nov 23, 2018 at 04:52:48PM +1100, Alexey Kardashevskiy wrote:
> This new memory does not have page structs as it is not plugged to
> the host so gup() will fail anyway.
> 
> This adds 2 helpers:
> - mm_iommu_newdev() to preregister the "memory device" memory so
> the rest of API can still be used;
> - mm_iommu_is_devmem() to know if the physical address is one of thise
> new regions which we must avoid unpinning of.
> 
> This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
> if the memory is device memory to avoid pfn_to_page().
> 
> This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
> does delayed pages dirtying.

This mostly looks good, but I have one concern:

> -static bool tce_page_is_contained(struct page *page, unsigned page_shift)
> +static bool tce_page_is_contained(struct mm_struct *mm, unsigned long hpa,
> + unsigned int page_shift)
>  {
> + struct page *page;
> +
> + if (mm_iommu_is_devmem(mm, hpa, page_shift))
> + return true;
> +
> + page = pfn_to_page(hpa >> PAGE_SHIFT);

Is it possible for userspace or a guest to cause us to get here with
hpa value that is bogus?  If so what does pfn_to_page do with that
pfn, and do we handle that correctly?

(I realize that if there is a problem here, it's a problem that
already exists in the code without this patch.)

Paul.

Re: [PATCH] lib: fix build failure in CONFIG_DEBUG_VIRTUAL test

2018-12-12 Thread Michael Ellerman

Christophe Leroy  writes:

> On several arches, virt_to_phys() is in io.h
>
> Build fails without it:
>
>   CC  lib/test_debug_virtual.o
> lib/test_debug_virtual.c: In function 'test_debug_virtual_init':
> lib/test_debug_virtual.c:26:7: error: implicit declaration of function 
> 'virt_to_phys' [-Werror=implicit-function-declaration]
>   pa = virt_to_phys(va);
>^
>
> Fixes: e4dace361552 ("lib: add test module for CONFIG_DEBUG_VIRTUAL")
> CC: sta...@vger.kernel.org
> Signed-off-by: Christophe Leroy 
> ---
>  lib/test_debug_virtual.c | 1 +
>  1 file changed, 1 insertion(+)

I'm going to take this via the powerpc tree, because otherwise
Christophe's patch to implement CONFIG_DEBUG_VIRTUAL for powerpc will
break the build for us.

Hopefully no one minds :)

cheers

Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem

2018-12-12 Thread Michael Ellerman

Frank Rowand  writes:
> On 12/11/18 8:07 AM, Rob Herring wrote:
>> On Tue, Dec 11, 2018 at 7:29 AM Michael Ellerman  wrote:
...
>>> diff --git a/drivers/of/base.c b/drivers/of/base.c
>>> index 09692c9b32a7..d8e4534c0686 100644
>>> --- a/drivers/of/base.c
>>> +++ b/drivers/of/base.c
>>> @@ -1190,6 +1190,10 @@ struct device_node *of_find_node_by_phandle(phandle 
>>> handle)
>>> if (phandle_cache[masked_handle] &&
>>> handle == phandle_cache[masked_handle]->phandle)
>>> np = phandle_cache[masked_handle];
>>> +
>>> +   /* If we find a detached node, remove it */
>>> +   if (of_node_check_flag(np, OF_DETACHED))
>>> +   np = phandle_cache[masked_handle] = NULL;
>
> The bug you found exposes a couple of different issues, a little bit
> deeper than the proposed fix.  I'll work on a fuller fix tonight or
> tomorrow.

OK thanks.

>> I'm wondering if we should explicitly remove the node from the cache
>> when we set OF_DETACHED. Otherwise, it could be possible that the node
>> pointer has been freed already. Or maybe we need both?
>
> Yes, it should be explicitly removed.  I may also add in a paranoia check in
> of_find_node_by_phandle().

That seems best to me.

cheers

Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem

2018-12-12 Thread Michael Ellerman

Rob Herring  writes:
> On Tue, Dec 11, 2018 at 7:29 AM Michael Ellerman  wrote:
...
>> diff --git a/drivers/of/base.c b/drivers/of/base.c
>> index 09692c9b32a7..d8e4534c0686 100644
>> --- a/drivers/of/base.c
>> +++ b/drivers/of/base.c
>> @@ -1190,6 +1190,10 @@ struct device_node *of_find_node_by_phandle(phandle 
>> handle)
>> if (phandle_cache[masked_handle] &&
>> handle == phandle_cache[masked_handle]->phandle)
>> np = phandle_cache[masked_handle];
>> +
>> +   /* If we find a detached node, remove it */
>> +   if (of_node_check_flag(np, OF_DETACHED))
>> +   np = phandle_cache[masked_handle] = NULL;
>
> I'm wondering if we should explicitly remove the node from the cache
> when we set OF_DETACHED. Otherwise, it could be possible that the node
> pointer has been freed already.

Yeah good point.

> Or maybe we need both?

That's probably best, it could even be a WARN_ON() if we find one in
of_find_node_by_phandle().

cheers

Re: [PATCH] powerpc/mm/hash: Hand user access of kernel address gracefully

2018-12-12 Thread Michael Ellerman

Breno Leitao  writes:

> hi Aneesh,
>
> On 11/26/18 12:35 PM, Aneesh Kumar K.V wrote:
>> With commit 2865d08dd9ea ("powerpc/mm: Move the DSISR_PROTFAULT sanity 
>> check")
>> we moved the protection fault access check before vma lookup. That means we
>> hit that WARN_ON when user space access a kernel address.  Before the commit
>> this was handled by find_vma() not finding vma for the kernel address and
>> considering that access as bad area access.
>> 
>> Avoid the confusing WARN_ON and convert that to a ratelimited printk.
>> With the patch we now get
>> 
>> for load:
>> [  187.700294] a.out[5997]: User access of kernel address (c000dea0) 
>> - exploit attempt? (uid: 1000)
>> [  187.700344] a.out[5997]: segfault (11) at c000dea0 nip 1317c0798 
>> lr 7fff80d6441c code 1 in a.out[1317c+1]
>> [  187.700429] a.out[5997]: code: 6000 6042 3c4c0002 38427790 
>> 4b20 3c4c0002 38427784 fbe1fff8
>> [  187.700435] a.out[5997]: code: f821ffc1 7c3f0b78 6000 e9228030 
>> <8929> 993f002f 6000 383f0040
>> 
>> for exec:
>> [  225.100903] a.out[6067]: User access of kernel address (c000dea0) 
>> - exploit attempt? (uid: 1000)
>> [  225.100938] a.out[6067]: segfault (11) at c000dea0 nip 
>> c000dea0 lr 129d507b0 code 1
>> [  225.100943] a.out[6067]: Bad NIP, not dumping instructions.
>> 
>> Fixes: 2865d08dd9ea ("powerpc/mm: Move the DSISR_PROTFAULT sanity check")
>> Signed-off-by: Aneesh Kumar K.V 
>
> Tested-by: Breno Leitao 

Thanks.

>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>> index 1697e903bbf2..46f280068c45 100644
>> --- a/arch/powerpc/mm/fault.c
>> +++ b/arch/powerpc/mm/fault.c
>> @@ -342,8 +342,21 @@ static inline void cmo_account_page_fault(void) { }
>>  #endif /* CONFIG_PPC_SMLPAR */
>>  
>>  #ifdef CONFIG_PPC_STD_MMU
>> -static void sanity_check_fault(bool is_write, unsigned long error_code)
>> +static void sanity_check_fault(bool is_write, bool is_user,
>> +   unsigned long error_code, unsigned long address)
>>  {
>> +/*
>> + * userspace trying to access kernel address, we get PROTFAULT for that.
>> + */
>> +if (is_user && address >= TASK_SIZE) {
>> +printk_ratelimited(KERN_CRIT "%s[%d]: "
>> +   "User access of kernel address (%lx) - "
>> +   "exploit attempt? (uid: %d)\n",
>> +   current->comm, current->pid, address,
>> +   from_kuid(&init_user_ns, current_uid()));
>> +return;
>
> Silly question: Is it OK to printk() and just return here? __do_page_fault
> will continue to execute independently of this return, right?

Yeah it is OK to just return.

I agree it's a bit of a strange way for the code to be structured, ie.
we detect a bad condition and print about it and then just return and
let it continue anyway.

I guess it's that way because it was added as an additional check, ie.
the code already handled those cases further down, but this was a check
in case anything weird happened.

If you look at the start of __do_page_fault() we have three separate
checks:

if (unlikely(page_fault_is_bad(error_code))) {
if (is_user) {
_exception(SIGBUS, regs, BUS_OBJERR, address);
return 0;
}
return SIGBUS;
}

/* Additional sanity check(s) */
sanity_check_fault(is_write, is_user, error_code, address);

/*
 * The kernel should never take an execute fault nor should it
 * take a page fault to a kernel address.
 */
if (unlikely(!is_user && bad_kernel_fault(is_exec, error_code, 
address)))
return SIGSEGV;


It seems like maybe we could simplify that somewhat.

We need to be careful though that we return the right signal (SEGV or
BUS), and also that user faults get counted (see PERF_COUNT_SW_PAGE_FAULTS).

So it's not straight forward as usual :)

cheers

[PATCH v9 14/14] ima: Store the measurement again when appraising a modsig

2018-12-12 Thread Thiago Jung Bauermann

If the IMA template contains the 'sig' field, then the modsig should be
added to the measurement list when the file is appraised, and that is what
normally happens.

But If a measurement rule caused a file containing a modsig to be measured
before a different rule causes it to be appraised, the resulting
measurement entry will not contain the modsig because it is only fetched
during appraisal. When the appraisal rule triggers, it won't store a new
measurement containing the modsig because the file was already measured.

We need to detect that situation and store an additional measurement with
the modsig. This is done by defining the appraise subaction flag
IMA_READ_MEASURE and testing for it in process_measurement().

Suggested-by: Mimi Zohar 
Signed-off-by: Thiago Jung Bauermann 
---
 security/integrity/ima/ima.h  |  1 +
 security/integrity/ima/ima_api.c  |  9 +++-
 security/integrity/ima/ima_main.c | 17 ++--
 security/integrity/ima/ima_policy.c   | 59 ---
 security/integrity/ima/ima_template.c | 24 +++
 security/integrity/integrity.h|  9 ++--
 6 files changed, 107 insertions(+), 12 deletions(-)

diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 55f8ef65cab4..c163d9bf248c 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -147,6 +147,7 @@ int ima_init_crypto(void);
 void ima_putc(struct seq_file *m, void *data, int datalen);
 void ima_print_digest(struct seq_file *m, u8 *digest, u32 size);
 struct ima_template_desc *ima_template_desc_current(void);
+bool ima_template_has_sig(void);
 int ima_restore_measurement_entry(struct ima_template_entry *entry);
 int ima_restore_measurement_list(loff_t bufsize, void *buf);
 int ima_measurements_show(struct seq_file *m, void *v);
diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index 99dd1d53fc35..cb72c9b7d84b 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -289,7 +289,14 @@ void ima_store_measurement(struct integrity_iint_cache 
*iint,
xattr_len, NULL};
int violation = 0;
 
-   if (iint->measured_pcrs & (0x1 << pcr))
+   /*
+* We still need to store the measurement in the case of MODSIG because
+* we only have its contents to put in the list at the time of
+* appraisal, but a file measurement from earlier might already exist in
+* the measurement list.
+*/
+   if (iint->measured_pcrs & (0x1 << pcr) &&
+   (!xattr_value || xattr_value->type != IMA_MODSIG))
return;
 
result = ima_alloc_init_template(&event_data, &entry);
diff --git a/security/integrity/ima/ima_main.c 
b/security/integrity/ima/ima_main.c
index 448be1e00bab..072cfb061a29 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -289,9 +289,20 @@ static int process_measurement(struct file *file, const 
struct cred *cred,
 */
if (read_sig && iint->flags & IMA_MODSIG_ALLOWED &&
(xattr_len <= 0 || !ima_xattr_sig_known_key(func, xattr_value,
-   xattr_len)))
-   ima_read_collect_modsig(func, buf, size, &xattr_value,
-   &xattr_len);
+   xattr_len))) {
+   rc = ima_read_collect_modsig(func, buf, size, &xattr_value,
+&xattr_len);
+
+   /*
+* A file measurement might already exist in the measurement
+* list. Based on policy, include an additional file measurement
+* containing the appended signature and file hash, without the
+* appended signature (i.e., the 'd-sig' field).
+*/
+   if (!rc && iint->flags & IMA_READ_MEASURE &&
+   ima_template_has_sig())
+   action |= IMA_MEASURE;
+   }
 
rc = ima_collect_measurement(iint, file, buf, size, hash_algo);
if (rc != 0 && rc != -EBADF && rc != -EINVAL)
diff --git a/security/integrity/ima/ima_policy.c 
b/security/integrity/ima/ima_policy.c
index c38a63f56b7b..1cce69197235 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -10,6 +10,9 @@
  * - initialize default measure policy rules
  *
  */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include 
 #include 
 #include 
@@ -369,7 +372,8 @@ static bool ima_match_rules(struct ima_rule_entry *rule, 
struct inode *inode,
  * In addition to knowing that we need to appraise the file in general,
  * we need to differentiate between calling hooks, for hook specific rules.
  */
-static int get_subaction(struct ima_rule_entry *rule, enum ima_hooks func)
+static int get_appraise_subaction(struct ima_rule_entry *rule,
+

[PATCH v9 13/14] ima: Write modsig to the measurement list

2018-12-12 Thread Thiago Jung Bauermann

Add modsig support to the "sig" template field, allowing the the contents
of the modsig to be included in the measurement list.

Suggested-by: Mimi Zohar 
Signed-off-by: Thiago Jung Bauermann 
---
 security/integrity/ima/ima.h  |  7 +++
 security/integrity/ima/ima_modsig.c   | 13 +
 security/integrity/ima/ima_template_lib.c | 15 ++-
 3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 40a6ddfdd9ea..55f8ef65cab4 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -318,6 +318,7 @@ int ima_read_collect_modsig(enum ima_hooks func, const void 
*buf,
int *xattr_len);
 int ima_get_modsig_hash(struct evm_ima_xattr_data *hdr, enum hash_algo *algo,
const u8 **hash, u8 *len);
+int ima_modsig_serialize_data(struct evm_ima_xattr_data **data, int *data_len);
 void ima_free_xattr_data(struct evm_ima_xattr_data *hdr);
 #else
 static inline bool ima_hook_supports_modsig(enum ima_hooks func)
@@ -340,6 +341,12 @@ static inline int ima_get_modsig_hash(struct 
evm_ima_xattr_data *hdr,
return -EOPNOTSUPP;
 }
 
+static inline int ima_modsig_serialize_data(struct evm_ima_xattr_data **data,
+   int *data_len)
+{
+   return -EOPNOTSUPP;
+}
+
 static inline void ima_free_xattr_data(struct evm_ima_xattr_data *hdr)
 {
kfree(hdr);
diff --git a/security/integrity/ima/ima_modsig.c 
b/security/integrity/ima/ima_modsig.c
index 587b79a9afef..0424f844c4c3 100644
--- a/security/integrity/ima/ima_modsig.c
+++ b/security/integrity/ima/ima_modsig.c
@@ -190,6 +190,19 @@ int ima_get_modsig_hash(struct evm_ima_xattr_data *hdr, 
enum hash_algo *algo,
return pkcs7_get_digest(modsig->pkcs7_msg, hash, len);
 }
 
+int ima_modsig_serialize_data(struct evm_ima_xattr_data **data, int *data_len)
+{
+   struct modsig_hdr *modsig = (struct modsig_hdr *) *data;
+
+   if (!*data || (*data)->type != IMA_MODSIG)
+   return -EINVAL;
+
+   *data = &modsig->raw_pkcs7;
+   *data_len = modsig->raw_pkcs7_len;
+
+   return 0;
+}
+
 int ima_modsig_verify(struct key *keyring, const void *hdr)
 {
const struct modsig_hdr *modsig = (const struct modsig_hdr *) hdr;
diff --git a/security/integrity/ima/ima_template_lib.c 
b/security/integrity/ima/ima_template_lib.c
index 36d175816894..417cd153ba60 100644
--- a/security/integrity/ima/ima_template_lib.c
+++ b/security/integrity/ima/ima_template_lib.c
@@ -411,10 +411,23 @@ int ima_eventsig_init(struct ima_event_data *event_data,
  struct ima_field_data *field_data)
 {
struct evm_ima_xattr_data *xattr_value = event_data->xattr_value;
+   int xattr_len = event_data->xattr_len;
 
if (!is_signed(xattr_value))
return 0;
 
-   return ima_write_template_field_data(xattr_value, event_data->xattr_len,
+   /*
+* The xattr_value for IMA_MODSIG is a runtime structure containing
+* pointers. Get its raw data instead.
+*/
+   if (xattr_value->type == IMA_MODSIG) {
+   int rc;
+
+   rc = ima_modsig_serialize_data(&xattr_value, &xattr_len);
+   if (rc)
+   return rc;
+   }
+
+   return ima_write_template_field_data(xattr_value, xattr_len,
 DATA_FMT_HEX, field_data);
 }

[PATCH v9 12/14] ima: Add new "d-sig" template field

2018-12-12 Thread Thiago Jung Bauermann

Define new "d-sig" template field which holds the digest that is expected
to match the one contained in the modsig.

Suggested-by: Mimi Zohar 
Signed-off-by: Thiago Jung Bauermann 
---
 Documentation/security/IMA-templates.rst  |  5 
 security/integrity/ima/ima.h  |  9 +++
 security/integrity/ima/ima_modsig.c   | 23 
 security/integrity/ima/ima_template.c |  4 ++-
 security/integrity/ima/ima_template_lib.c | 32 ++-
 security/integrity/ima/ima_template_lib.h |  2 ++
 6 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/Documentation/security/IMA-templates.rst 
b/Documentation/security/IMA-templates.rst
index 2cd0e273cc9a..f2a0f4225857 100644
--- a/Documentation/security/IMA-templates.rst
+++ b/Documentation/security/IMA-templates.rst
@@ -68,6 +68,11 @@ descriptors by adding their identifier to the format string
  - 'd-ng': the digest of the event, calculated with an arbitrary hash
algorithm (field format: [:]digest, where the digest
prefix is shown only if the hash algorithm is not SHA1 or MD5);
+ - 'd-sig': the digest of the event for files that have an appended modsig. 
This
+   field is calculated without including the modsig and thus will differ from
+   the total digest of the file, but it is what should match the digest
+   contained in the modsig (if it doesn't, the signature is invalid). It is
+   shown in the same format as 'd-ng';
  - 'n-ng': the name of the event, without size limitations;
  - 'sig': the file signature.
 
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 753d59352718..40a6ddfdd9ea 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -316,6 +316,8 @@ int ima_read_collect_modsig(enum ima_hooks func, const void 
*buf,
loff_t buf_len,
struct evm_ima_xattr_data **xattr_value,
int *xattr_len);
+int ima_get_modsig_hash(struct evm_ima_xattr_data *hdr, enum hash_algo *algo,
+   const u8 **hash, u8 *len);
 void ima_free_xattr_data(struct evm_ima_xattr_data *hdr);
 #else
 static inline bool ima_hook_supports_modsig(enum ima_hooks func)
@@ -331,6 +333,13 @@ static inline int ima_read_collect_modsig(enum ima_hooks 
func, const void *buf,
return -EOPNOTSUPP;
 }
 
+static inline int ima_get_modsig_hash(struct evm_ima_xattr_data *hdr,
+ enum hash_algo *algo, const u8 **hash,
+ u8 *len)
+{
+   return -EOPNOTSUPP;
+}
+
 static inline void ima_free_xattr_data(struct evm_ima_xattr_data *hdr)
 {
kfree(hdr);
diff --git a/security/integrity/ima/ima_modsig.c 
b/security/integrity/ima/ima_modsig.c
index f228f333509d..587b79a9afef 100644
--- a/security/integrity/ima/ima_modsig.c
+++ b/security/integrity/ima/ima_modsig.c
@@ -167,6 +167,29 @@ int ima_read_collect_modsig(enum ima_hooks func, const 
void *buf,
return rc;
 }
 
+int ima_get_modsig_hash(struct evm_ima_xattr_data *hdr, enum hash_algo *algo,
+   const u8 **hash, u8 *len)
+{
+   struct modsig_hdr *modsig = (typeof(modsig)) hdr;
+   const struct public_key_signature *pks;
+   int i;
+
+   if (!hdr || hdr->type != IMA_MODSIG)
+   return -EINVAL;
+
+   pks = pkcs7_get_message_sig(modsig->pkcs7_msg);
+   if (!pks)
+   return -EBADMSG;
+
+   for (i = 0; i < HASH_ALGO__LAST; i++)
+   if (!strcmp(hash_algo_name[i], pks->hash_algo))
+   break;
+
+   *algo = i;
+
+   return pkcs7_get_digest(modsig->pkcs7_msg, hash, len);
+}
+
 int ima_modsig_verify(struct key *keyring, const void *hdr)
 {
const struct modsig_hdr *modsig = (const struct modsig_hdr *) hdr;
diff --git a/security/integrity/ima/ima_template.c 
b/security/integrity/ima/ima_template.c
index b631b8bc7624..045ad508cbb8 100644
--- a/security/integrity/ima/ima_template.c
+++ b/security/integrity/ima/ima_template.c
@@ -43,8 +43,10 @@ static const struct ima_template_field supported_fields[] = {
 .field_show = ima_show_template_string},
{.field_id = "sig", .field_init = ima_eventsig_init,
 .field_show = ima_show_template_sig},
+   {.field_id = "d-sig", .field_init = ima_eventdigest_sig_init,
+.field_show = ima_show_template_digest_ng},
 };
-#define MAX_TEMPLATE_NAME_LEN 15
+#define MAX_TEMPLATE_NAME_LEN 24
 
 static struct ima_template_desc *ima_template;
 static struct ima_template_desc *lookup_template_desc(const char *name);
diff --git a/security/integrity/ima/ima_template_lib.c 
b/security/integrity/ima/ima_template_lib.c
index 300912914b17..36d175816894 100644
--- a/security/integrity/ima/ima_template_lib.c
+++ b/security/integrity/ima/ima_template_lib.c
@@ -222,7 +222,8 @@ int ima_parse_buf(void *bufstartp, void *bufendp, void 
**bufcurp,
return 0;
 }
 
-static int ima_eventdigest_init_common(u8 *dig

[PATCH v9 11/14] ima: Implement support for module-style appended signatures

2018-12-12 Thread Thiago Jung Bauermann

Implement the appraise_type=imasig|modsig option, allowing IMA to read and
verify modsig signatures.

In case a file has both an xattr signature and an appended modsig, IMA will
only use the appended signature if the key used by the xattr signature
isn't present in the IMA keyring.

Also enable building the sign-file tool when CONFIG_IMA_APPRAISE_MODSIG is
enabled, so that the user can sign files using this format.

Signed-off-by: Thiago Jung Bauermann 
---
 scripts/Makefile  |   4 +-
 security/integrity/digsig.c   |   3 +
 security/integrity/ima/Kconfig|   3 +
 security/integrity/ima/ima.h  |  31 -
 security/integrity/ima/ima_appraise.c |  68 ++-
 security/integrity/ima/ima_main.c |  18 ++-
 security/integrity/ima/ima_modsig.c   | 162 ++
 security/integrity/integrity.h|  10 ++
 8 files changed, 291 insertions(+), 8 deletions(-)

diff --git a/scripts/Makefile b/scripts/Makefile
index ece52ff20171..a2cf10661925 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -17,7 +17,9 @@ hostprogs-$(CONFIG_VT)   += conmakehash
 hostprogs-$(BUILD_C_RECORDMCOUNT) += recordmcount
 hostprogs-$(CONFIG_BUILDTIME_EXTABLE_SORT) += sortextable
 hostprogs-$(CONFIG_ASN1)+= asn1_compiler
-hostprogs-$(CONFIG_MODULE_SIG)  += sign-file
+ifneq ($(CONFIG_MODULE_SIG)$(CONFIG_IMA_APPRAISE_MODSIG),)
+hostprogs-y += sign-file
+endif
 hostprogs-$(CONFIG_SYSTEM_TRUSTED_KEYRING) += extract-cert
 hostprogs-$(CONFIG_SYSTEM_EXTRA_CERTIFICATE) += insert-sys-cert
 
diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c
index bbfa3085d1b5..c5585e75d5d9 100644
--- a/security/integrity/digsig.c
+++ b/security/integrity/digsig.c
@@ -75,6 +75,9 @@ int integrity_digsig_verify(const unsigned int id, const char 
*sig, int siglen,
if (IS_ERR(keyring))
return PTR_ERR(keyring);
 
+   if (sig[0] == IMA_MODSIG)
+   return ima_modsig_verify(keyring, sig);
+
switch (sig[1]) {
case 1:
/* v1 API expect signature without xattr type */
diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
index bba19f9ea184..0fb542455698 100644
--- a/security/integrity/ima/Kconfig
+++ b/security/integrity/ima/Kconfig
@@ -234,6 +234,9 @@ config IMA_APPRAISE_BOOTPARAM
 config IMA_APPRAISE_MODSIG
bool "Support module-style signatures for appraisal"
depends on IMA_APPRAISE
+   depends on INTEGRITY_ASYMMETRIC_KEYS
+   select PKCS7_MESSAGE_PARSER
+   select MODULE_SIG_FORMAT
default n
help
   Adds support for signatures appended to files. The format of the
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 69c06e2d7bd6..753d59352718 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -156,7 +156,8 @@ void ima_init_template_list(void);
 
 static inline bool is_signed(const struct evm_ima_xattr_data *xattr_value)
 {
-   return xattr_value && xattr_value->type == EVM_IMA_XATTR_DIGSIG;
+   return xattr_value && (xattr_value->type == EVM_IMA_XATTR_DIGSIG ||
+  xattr_value->type == IMA_MODSIG);
 }
 
 /*
@@ -253,6 +254,9 @@ enum integrity_status ima_get_cache_status(struct 
integrity_iint_cache *iint,
   enum ima_hooks func);
 enum hash_algo ima_get_hash_algo(struct evm_ima_xattr_data *xattr_value,
 int xattr_len);
+bool ima_xattr_sig_known_key(enum ima_hooks func,
+const struct evm_ima_xattr_data *xattr_value,
+int xattr_len);
 int ima_read_xattr(struct dentry *dentry,
   struct evm_ima_xattr_data **xattr_value);
 
@@ -291,6 +295,13 @@ ima_get_hash_algo(struct evm_ima_xattr_data *xattr_value, 
int xattr_len)
return ima_hash_algo;
 }
 
+static inline bool ima_xattr_sig_known_key(enum ima_hooks func,
+  const struct evm_ima_xattr_data
+  *xattr_value, int xattr_len)
+{
+   return false;
+}
+
 static inline int ima_read_xattr(struct dentry *dentry,
 struct evm_ima_xattr_data **xattr_value)
 {
@@ -301,11 +312,29 @@ static inline int ima_read_xattr(struct dentry *dentry,
 
 #ifdef CONFIG_IMA_APPRAISE_MODSIG
 bool ima_hook_supports_modsig(enum ima_hooks func);
+int ima_read_collect_modsig(enum ima_hooks func, const void *buf,
+   loff_t buf_len,
+   struct evm_ima_xattr_data **xattr_value,
+   int *xattr_len);
+void ima_free_xattr_data(struct evm_ima_xattr_data *hdr);
 #else
 static inline bool ima_hook_supports_modsig(enum ima_hooks func)
 {
return false;
 }
+
+static inline int ima_read_collect_modsig(enum ima_hooks func, const void *buf,
+

[PATCH v9 10/14] ima: Add modsig appraise_type option for module-style appended signatures

2018-12-12 Thread Thiago Jung Bauermann

Introduce the modsig keyword to the IMA policy syntax to specify that
a given hook should expect the file to have the IMA signature appended
to it. Here is how it can be used in a rule:

appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig

With this rule, IMA will accept either a signature stored in the extended
attribute or an appended signature.

For now, the rule above will behave exactly the same as if
appraise_type=imasig was specified. The actual modsig implementation
will be introduced separately.

Suggested-by: Mimi Zohar 
Signed-off-by: Thiago Jung Bauermann 
---
 Documentation/ABI/testing/ima_policy |  6 +-
 security/integrity/ima/Kconfig   | 10 +
 security/integrity/ima/Makefile  |  1 +
 security/integrity/ima/ima.h |  9 
 security/integrity/ima/ima_modsig.c  | 31 
 security/integrity/ima/ima_policy.c  | 12 +--
 security/integrity/integrity.h   |  1 +
 7 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/ima_policy 
b/Documentation/ABI/testing/ima_policy
index 74c6702de74e..9d1dfd0a8891 100644
--- a/Documentation/ABI/testing/ima_policy
+++ b/Documentation/ABI/testing/ima_policy
@@ -37,7 +37,7 @@ Description:
euid:= decimal value
fowner:= decimal value
lsm:are LSM specific
-   option: appraise_type:= [imasig]
+   option: appraise_type:= [imasig] [imasig|modsig]
pcr:= decimal value
 
default policy:
@@ -103,3 +103,7 @@ Description:
 
measure func=KEXEC_KERNEL_CHECK pcr=4
measure func=KEXEC_INITRAMFS_CHECK pcr=5
+
+   Example of appraise rule allowing modsig appended signatures:
+
+   appraise func=KEXEC_KERNEL_CHECK 
appraise_type=imasig|modsig
diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
index a18f8c6d13b5..bba19f9ea184 100644
--- a/security/integrity/ima/Kconfig
+++ b/security/integrity/ima/Kconfig
@@ -231,6 +231,16 @@ config IMA_APPRAISE_BOOTPARAM
  This option enables the different "ima_appraise=" modes
  (eg. fix, log) from the boot command line.
 
+config IMA_APPRAISE_MODSIG
+   bool "Support module-style signatures for appraisal"
+   depends on IMA_APPRAISE
+   default n
+   help
+  Adds support for signatures appended to files. The format of the
+  appended signature is the same used for signed kernel modules.
+  The modsig keyword can be used in the IMA policy to allow a hook
+  to accept such signatures.
+
 config IMA_TRUSTED_KEYRING
bool "Require all keys on the .ima keyring be signed (deprecated)"
depends on IMA_APPRAISE && SYSTEM_TRUSTED_KEYRING
diff --git a/security/integrity/ima/Makefile b/security/integrity/ima/Makefile
index d921dc4f9eb0..31d57cdf2421 100644
--- a/security/integrity/ima/Makefile
+++ b/security/integrity/ima/Makefile
@@ -9,5 +9,6 @@ obj-$(CONFIG_IMA) += ima.o
 ima-y := ima_fs.o ima_queue.o ima_init.o ima_main.o ima_crypto.o ima_api.o \
 ima_policy.o ima_template.o ima_template_lib.o
 ima-$(CONFIG_IMA_APPRAISE) += ima_appraise.o
+ima-$(CONFIG_IMA_APPRAISE_MODSIG) += ima_modsig.o
 ima-$(CONFIG_HAVE_IMA_KEXEC) += ima_kexec.o
 obj-$(CONFIG_IMA_BLACKLIST_KEYRING) += ima_mok.o
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index f0bc2a182cbf..69c06e2d7bd6 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -299,6 +299,15 @@ static inline int ima_read_xattr(struct dentry *dentry,
 
 #endif /* CONFIG_IMA_APPRAISE */
 
+#ifdef CONFIG_IMA_APPRAISE_MODSIG
+bool ima_hook_supports_modsig(enum ima_hooks func);
+#else
+static inline bool ima_hook_supports_modsig(enum ima_hooks func)
+{
+   return false;
+}
+#endif /* CONFIG_IMA_APPRAISE_MODSIG */
+
 /* LSM based policy rules require audit */
 #ifdef CONFIG_IMA_LSM_RULES
 
diff --git a/security/integrity/ima/ima_modsig.c 
b/security/integrity/ima/ima_modsig.c
new file mode 100644
index ..08182bd7f445
--- /dev/null
+++ b/security/integrity/ima/ima_modsig.c
@@ -0,0 +1,31 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * IMA support for appraising module-style appended signatures.
+ *
+ * Copyright (C) 2018  IBM Corporation
+ *
+ * Author:
+ * Thiago Jung Bauermann 
+ */
+
+#include "ima.h"
+
+/**
+ * ima_hook_supports_modsig - can the policy allow modsig for this hook?
+ *
+ * modsig is only supported by hooks using ima_post_read_file, because only 
they
+ * preload the contents of the file in a buffer. FILE_CHECK does that in some
+ * cases, but not when reached from vfs_open. POLICY_CHECK can support it, but
+ * it's not useful in practice because it's a text file so deny.
+ */
+bool ima_hook_supports_modsig(enum ima_hooks func)
+{
+   switch (func) {
+   case KEXEC_KERNEL_CHECK:
+   case KEXEC_INITRAMFS_C

[PATCH v9 09/14] ima: Export func_tokens

2018-12-12 Thread Thiago Jung Bauermann

ima_read_modsig() will need it so that it can show an error message.

Signed-off-by: Thiago Jung Bauermann 
---
 security/integrity/ima/ima.h|  2 ++
 security/integrity/ima/ima_policy.c | 12 ++--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index e4f72b30cb28..f0bc2a182cbf 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -195,6 +195,8 @@ enum ima_hooks {
__ima_hooks(__ima_hook_enumify)
 };
 
+extern const char *const func_tokens[];
+
 /* LIM API function definitions */
 int ima_get_action(struct inode *inode, const struct cred *cred, u32 secid,
   int mask, enum ima_hooks func, int *pcr);
diff --git a/security/integrity/ima/ima_policy.c 
b/security/integrity/ima/ima_policy.c
index d17a23b5c91d..b7ee342fbe4a 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -1138,6 +1138,12 @@ void ima_delete_rules(void)
}
 }
 
+#define __ima_hook_stringify(str)  (#str),
+
+const char *const func_tokens[] = {
+   __ima_hooks(__ima_hook_stringify)
+};
+
 #ifdef CONFIG_IMA_READ_POLICY
 enum {
mask_exec = 0, mask_write, mask_read, mask_append
@@ -1150,12 +1156,6 @@ static const char *const mask_tokens[] = {
"MAY_APPEND"
 };
 
-#define __ima_hook_stringify(str)  (#str),
-
-static const char *const func_tokens[] = {
-   __ima_hooks(__ima_hook_stringify)
-};
-
 void *ima_policy_start(struct seq_file *m, loff_t *pos)
 {
loff_t l = *pos;

[PATCH v9 08/14] ima: Introduce is_signed()

2018-12-12 Thread Thiago Jung Bauermann

With the introduction of another IMA signature type (modsig), some places
will need to check for both of them. It is cleaner to do that if there's a
helper function to tell whether an xattr_value represents an IMA
signature.

Suggested-by: Mimi Zohar 
Signed-off-by: Thiago Jung Bauermann 
---
 security/integrity/ima/ima.h  | 5 +
 security/integrity/ima/ima_appraise.c | 7 +++
 security/integrity/ima/ima_template_lib.c | 2 +-
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index cc12f3449a72..e4f72b30cb28 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -154,6 +154,11 @@ unsigned long ima_get_binary_runtime_size(void);
 int ima_init_template(void);
 void ima_init_template_list(void);
 
+static inline bool is_signed(const struct evm_ima_xattr_data *xattr_value)
+{
+   return xattr_value && xattr_value->type == EVM_IMA_XATTR_DIGSIG;
+}
+
 /*
  * used to protect h_table and sha_table
  */
diff --git a/security/integrity/ima/ima_appraise.c 
b/security/integrity/ima/ima_appraise.c
index dcb8226972cf..085386c77b0b 100644
--- a/security/integrity/ima/ima_appraise.c
+++ b/security/integrity/ima/ima_appraise.c
@@ -335,15 +335,14 @@ int ima_appraise_measurement(enum ima_hooks func,
} else if (status != INTEGRITY_PASS) {
/* Fix mode, but don't replace file signatures. */
if ((ima_appraise & IMA_APPRAISE_FIX) &&
-   (!xattr_value ||
-xattr_value->type != EVM_IMA_XATTR_DIGSIG)) {
+   !is_signed(xattr_value)) {
if (!ima_fix_xattr(dentry, iint))
status = INTEGRITY_PASS;
}
 
/* Permit new files with file signatures, but without data. */
if (inode->i_size == 0 && iint->flags & IMA_NEW_FILE &&
-   xattr_value && xattr_value->type == EVM_IMA_XATTR_DIGSIG) {
+   is_signed(xattr_value)) {
status = INTEGRITY_PASS;
}
 
@@ -458,7 +457,7 @@ int ima_inode_setxattr(struct dentry *dentry, const char 
*xattr_name,
if (!xattr_value_len || (xvalue->type >= IMA_XATTR_LAST))
return -EINVAL;
ima_reset_appraise_flags(d_backing_inode(dentry),
-   xvalue->type == EVM_IMA_XATTR_DIGSIG);
+is_signed(xvalue));
result = 0;
}
return result;
diff --git a/security/integrity/ima/ima_template_lib.c 
b/security/integrity/ima/ima_template_lib.c
index 43752002c222..300912914b17 100644
--- a/security/integrity/ima/ima_template_lib.c
+++ b/security/integrity/ima/ima_template_lib.c
@@ -382,7 +382,7 @@ int ima_eventsig_init(struct ima_event_data *event_data,
 {
struct evm_ima_xattr_data *xattr_value = event_data->xattr_value;
 
-   if ((!xattr_value) || (xattr_value->type != EVM_IMA_XATTR_DIGSIG))
+   if (!is_signed(xattr_value))
return 0;
 
return ima_write_template_field_data(xattr_value, event_data->xattr_len,

[PATCH v9 07/14] integrity: Select CONFIG_KEYS instead of depending on it

2018-12-12 Thread Thiago Jung Bauermann

This avoids a dependency cycle in soon-to-be-introduced
CONFIG_IMA_APPRAISE_MODSIG: it will select CONFIG_MODULE_SIG_FORMAT
which in turn selects CONFIG_KEYS. Kconfig then complains that
CONFIG_INTEGRITY_SIGNATURE depends on CONFIG_KEYS.

Signed-off-by: Thiago Jung Bauermann 
Signed-off-by: Mimi Zohar 
---
 security/integrity/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/integrity/Kconfig b/security/integrity/Kconfig
index 4b4d2aeef539..176905bef20a 100644
--- a/security/integrity/Kconfig
+++ b/security/integrity/Kconfig
@@ -17,8 +17,8 @@ if INTEGRITY
 
 config INTEGRITY_SIGNATURE
bool "Digital signature verification using multiple keyrings"
-   depends on KEYS
default n
+   select KEYS
select SIGNATURE
help
  This option enables digital signature verification support

[PATCH v9 06/14] integrity: Introduce asymmetric_sig_has_known_key()

2018-12-12 Thread Thiago Jung Bauermann

IMA will only look for a modsig if the xattr sig references a key which is
not in the expected kernel keyring. To that end, introduce
asymmetric_sig_has_known_key().

The logic of extracting the key used in the xattr sig is factored out from
asymmetric_verify() so that it can be used by the new function.

Signed-off-by: Thiago Jung Bauermann 
Signed-off-by: Mimi Zohar 
---
 security/integrity/digsig_asymmetric.c | 44 +++---
 security/integrity/integrity.h |  8 +
 2 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/security/integrity/digsig_asymmetric.c 
b/security/integrity/digsig_asymmetric.c
index d775e03fbbcc..4c3c49f919f5 100644
--- a/security/integrity/digsig_asymmetric.c
+++ b/security/integrity/digsig_asymmetric.c
@@ -79,26 +79,48 @@ static struct key *request_asymmetric_key(struct key 
*keyring, uint32_t keyid)
return key;
 }
 
-int asymmetric_verify(struct key *keyring, const char *sig,
- int siglen, const char *data, int datalen)
+static struct key *asymmetric_key_from_sig(struct key *keyring, const char 
*sig,
+  int siglen)
 {
-   struct public_key_signature pks;
-   struct signature_v2_hdr *hdr = (struct signature_v2_hdr *)sig;
-   struct key *key;
-   int ret = -ENOMEM;
+   const struct signature_v2_hdr *hdr = (struct signature_v2_hdr *) sig;
 
if (siglen <= sizeof(*hdr))
-   return -EBADMSG;
+   return ERR_PTR(-EBADMSG);
 
siglen -= sizeof(*hdr);
 
if (siglen != be16_to_cpu(hdr->sig_size))
-   return -EBADMSG;
+   return ERR_PTR(-EBADMSG);
 
if (hdr->hash_algo >= HASH_ALGO__LAST)
-   return -ENOPKG;
+   return ERR_PTR(-ENOPKG);
+
+   return request_asymmetric_key(keyring, be32_to_cpu(hdr->keyid));
+}
+
+bool asymmetric_sig_has_known_key(struct key *keyring, const char *sig,
+ int siglen)
+{
+   struct key *key;
+
+   key = asymmetric_key_from_sig(keyring, sig, siglen);
+   if (IS_ERR_OR_NULL(key))
+   return false;
+
+   key_put(key);
+
+   return true;
+}
+
+int asymmetric_verify(struct key *keyring, const char *sig,
+ int siglen, const char *data, int datalen)
+{
+   struct public_key_signature pks;
+   struct signature_v2_hdr *hdr = (struct signature_v2_hdr *)sig;
+   struct key *key;
+   int ret = -ENOMEM;
 
-   key = request_asymmetric_key(keyring, be32_to_cpu(hdr->keyid));
+   key = asymmetric_key_from_sig(keyring, sig, siglen);
if (IS_ERR(key))
return PTR_ERR(key);
 
@@ -110,7 +132,7 @@ int asymmetric_verify(struct key *keyring, const char *sig,
pks.digest = (u8 *)data;
pks.digest_size = datalen;
pks.s = hdr->sig;
-   pks.s_size = siglen;
+   pks.s_size = siglen - sizeof(*hdr);
ret = verify_signature(key, &pks);
key_put(key);
pr_debug("%s() = %d\n", __func__, ret);
diff --git a/security/integrity/integrity.h b/security/integrity/integrity.h
index 6f657260a964..dec5ab8cf9e9 100644
--- a/security/integrity/integrity.h
+++ b/security/integrity/integrity.h
@@ -194,12 +194,20 @@ static inline int __init integrity_load_cert(const 
unsigned int id,
 #ifdef CONFIG_INTEGRITY_ASYMMETRIC_KEYS
 int asymmetric_verify(struct key *keyring, const char *sig,
  int siglen, const char *data, int datalen);
+bool asymmetric_sig_has_known_key(struct key *keyring, const char *sig,
+ int siglen);
 #else
 static inline int asymmetric_verify(struct key *keyring, const char *sig,
int siglen, const char *data, int datalen)
 {
return -EOPNOTSUPP;
 }
+
+static inline bool asymmetric_sig_has_known_key(struct key *keyring,
+   const char *sig, int siglen)
+{
+   return false;
+}
 #endif
 
 #ifdef CONFIG_IMA_LOAD_X509

[PATCH v9 05/14] integrity: Introduce integrity_keyring_from_id()

2018-12-12 Thread Thiago Jung Bauermann

IMA will need to obtain the keyring used to verify file signatures so that
it can verify the module-style signature appended to files.

Signed-off-by: Thiago Jung Bauermann 
Signed-off-by: Mimi Zohar 
---
 security/integrity/digsig.c| 28 +---
 security/integrity/integrity.h |  6 ++
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/security/integrity/digsig.c b/security/integrity/digsig.c
index 71c3200521d6..bbfa3085d1b5 100644
--- a/security/integrity/digsig.c
+++ b/security/integrity/digsig.c
@@ -44,11 +44,10 @@ static const char * const 
keyring_name[INTEGRITY_KEYRING_MAX] = {
 #define restrict_link_to_ima restrict_link_by_builtin_trusted
 #endif
 
-int integrity_digsig_verify(const unsigned int id, const char *sig, int siglen,
-   const char *digest, int digestlen)
+struct key *integrity_keyring_from_id(const unsigned int id)
 {
-   if (id >= INTEGRITY_KEYRING_MAX || siglen < 2)
-   return -EINVAL;
+   if (id >= INTEGRITY_KEYRING_MAX)
+   return ERR_PTR(-EINVAL);
 
if (!keyring[id]) {
keyring[id] =
@@ -57,17 +56,32 @@ int integrity_digsig_verify(const unsigned int id, const 
char *sig, int siglen,
int err = PTR_ERR(keyring[id]);
pr_err("no %s keyring: %d\n", keyring_name[id], err);
keyring[id] = NULL;
-   return err;
+   return ERR_PTR(err);
}
}
 
+   return keyring[id];
+}
+
+int integrity_digsig_verify(const unsigned int id, const char *sig, int siglen,
+   const char *digest, int digestlen)
+{
+   struct key *keyring;
+
+   if (siglen < 2)
+   return -EINVAL;
+
+   keyring = integrity_keyring_from_id(id);
+   if (IS_ERR(keyring))
+   return PTR_ERR(keyring);
+
switch (sig[1]) {
case 1:
/* v1 API expect signature without xattr type */
-   return digsig_verify(keyring[id], sig + 1, siglen - 1,
+   return digsig_verify(keyring, sig + 1, siglen - 1,
 digest, digestlen);
case 2:
-   return asymmetric_verify(keyring[id], sig, siglen,
+   return asymmetric_verify(keyring, sig, siglen,
 digest, digestlen);
}
 
diff --git a/security/integrity/integrity.h b/security/integrity/integrity.h
index 2bf0fc51752b..6f657260a964 100644
--- a/security/integrity/integrity.h
+++ b/security/integrity/integrity.h
@@ -155,6 +155,7 @@ extern struct dentry *integrity_dir;
 
 #ifdef CONFIG_INTEGRITY_SIGNATURE
 
+struct key *integrity_keyring_from_id(const unsigned int id);
 int integrity_digsig_verify(const unsigned int id, const char *sig, int siglen,
const char *digest, int digestlen);
 
@@ -164,6 +165,11 @@ int __init integrity_load_cert(const unsigned int id, 
const char *source,
   const void *data, size_t len, key_perm_t perm);
 #else
 
+static inline struct key *integrity_keyring_from_id(const unsigned int id)
+{
+   return ERR_PTR(-EINVAL);
+}
+
 static inline int integrity_digsig_verify(const unsigned int id,
  const char *sig, int siglen,
  const char *digest, int digestlen)

[PATCH v9 04/14] integrity: Introduce struct evm_xattr

2018-12-12 Thread Thiago Jung Bauermann

Even though struct evm_ima_xattr_data includes a fixed-size array to hold a
SHA1 digest, most of the code ignores the array and uses the struct to mean
"type indicator followed by data of unspecified size" and tracks the real
size of what the struct represents in a separate length variable.

The only exception to that is the EVM code, which correctly uses the
definition of struct evm_ima_xattr_data.

So make this explicit in the code by removing the length specification from
the array in struct evm_ima_xattr_data. Also, change the name of the
element from digest to data since in most places the array doesn't hold a
digest.

A separate struct evm_xattr is introduced, with the original definition of
evm_ima_xattr_data to be used in the places that actually expect that
definition, specifically the EVM HMAC code.

Signed-off-by: Thiago Jung Bauermann 
Reviewed-by: Mimi Zohar 
---
 security/integrity/evm/evm_main.c | 8 
 security/integrity/ima/ima_appraise.c | 7 ---
 security/integrity/integrity.h| 6 ++
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/security/integrity/evm/evm_main.c 
b/security/integrity/evm/evm_main.c
index 7f3f54d89a6e..a1b42d10efc7 100644
--- a/security/integrity/evm/evm_main.c
+++ b/security/integrity/evm/evm_main.c
@@ -169,7 +169,7 @@ static enum integrity_status evm_verify_hmac(struct dentry 
*dentry,
/* check value type */
switch (xattr_data->type) {
case EVM_XATTR_HMAC:
-   if (xattr_len != sizeof(struct evm_ima_xattr_data)) {
+   if (xattr_len != sizeof(struct evm_xattr)) {
evm_status = INTEGRITY_FAIL;
goto out;
}
@@ -179,7 +179,7 @@ static enum integrity_status evm_verify_hmac(struct dentry 
*dentry,
   xattr_value_len, &digest);
if (rc)
break;
-   rc = crypto_memneq(xattr_data->digest, digest.digest,
+   rc = crypto_memneq(xattr_data->data, digest.digest,
   SHA1_DIGEST_SIZE);
if (rc)
rc = -EINVAL;
@@ -523,7 +523,7 @@ int evm_inode_init_security(struct inode *inode,
 const struct xattr *lsm_xattr,
 struct xattr *evm_xattr)
 {
-   struct evm_ima_xattr_data *xattr_data;
+   struct evm_xattr *xattr_data;
int rc;
 
if (!evm_key_loaded() || !evm_protected_xattr(lsm_xattr->name))
@@ -533,7 +533,7 @@ int evm_inode_init_security(struct inode *inode,
if (!xattr_data)
return -ENOMEM;
 
-   xattr_data->type = EVM_XATTR_HMAC;
+   xattr_data->data.type = EVM_XATTR_HMAC;
rc = evm_init_hmac(inode, lsm_xattr, xattr_data->digest);
if (rc < 0)
goto out;
diff --git a/security/integrity/ima/ima_appraise.c 
b/security/integrity/ima/ima_appraise.c
index f6ac405daabb..dcb8226972cf 100644
--- a/security/integrity/ima/ima_appraise.c
+++ b/security/integrity/ima/ima_appraise.c
@@ -167,7 +167,8 @@ enum hash_algo ima_get_hash_algo(struct evm_ima_xattr_data 
*xattr_value,
return sig->hash_algo;
break;
case IMA_XATTR_DIGEST_NG:
-   ret = xattr_value->digest[0];
+   /* first byte contains algorithm id */
+   ret = xattr_value->data[0];
if (ret < HASH_ALGO__LAST)
return ret;
break;
@@ -175,7 +176,7 @@ enum hash_algo ima_get_hash_algo(struct evm_ima_xattr_data 
*xattr_value,
/* this is for backward compatibility */
if (xattr_len == 21) {
unsigned int zero = 0;
-   if (!memcmp(&xattr_value->digest[16], &zero, 4))
+   if (!memcmp(&xattr_value->data[16], &zero, 4))
return HASH_ALGO_MD5;
else
return HASH_ALGO_SHA1;
@@ -274,7 +275,7 @@ int ima_appraise_measurement(enum ima_hooks func,
/* xattr length may be longer. md5 hash in previous
   version occupied 20 bytes in xattr, instead of 16
 */
-   rc = memcmp(&xattr_value->digest[hash_start],
+   rc = memcmp(&xattr_value->data[hash_start],
iint->ima_hash->digest,
iint->ima_hash->length);
else
diff --git a/security/integrity/integrity.h b/security/integrity/integrity.h
index 3517d2852a07..2bf0fc51752b 100644
--- a/security/integrity/integrity.h
+++ b/security/integrity/integrity.h
@@ -79,6 +79,12 @@ enum evm_ima_xattr_type {
 
 struct evm_ima_xattr_data {
u8 type;
+   u8 data[];
+} __packed;
+
+/* Only used in the EVM HMAC code. */
+struct evm_xattr {
+   struct evm_ima_xattr_data da

[PATCH v9 03/14] PKCS#7: Introduce pkcs7_get_digest()

2018-12-12 Thread Thiago Jung Bauermann

IMA will need to access the digest of the PKCS7 message (as calculated by
the kernel) before the signature is verified, so introduce
pkcs7_get_digest() for that purpose.

Also, modify pkcs7_digest() to detect when the digest was already
calculated so that it doesn't have to do redundant work. Verifying that
sinfo->sig->digest isn't NULL is sufficient because both places which
allocate sinfo->sig (pkcs7_parse_message() and pkcs7_note_signed_info())
use kzalloc() so sig->digest is always initialized to zero.

Signed-off-by: Thiago Jung Bauermann 
Reviewed-by: Mimi Zohar 
Cc: David Howells 
Cc: Herbert Xu 
Cc: "David S. Miller" 
---
 crypto/asymmetric_keys/pkcs7_verify.c | 27 +++
 include/crypto/pkcs7.h|  3 +++
 2 files changed, 30 insertions(+)

diff --git a/crypto/asymmetric_keys/pkcs7_verify.c 
b/crypto/asymmetric_keys/pkcs7_verify.c
index 97c77f66b20d..ccf80a7b7d9b 100644
--- a/crypto/asymmetric_keys/pkcs7_verify.c
+++ b/crypto/asymmetric_keys/pkcs7_verify.c
@@ -33,6 +33,10 @@ static int pkcs7_digest(struct pkcs7_message *pkcs7,
 
kenter(",%u,%s", sinfo->index, sinfo->sig->hash_algo);
 
+   /* The digest was calculated already. */
+   if (sig->digest)
+   return 0;
+
if (!sinfo->sig->hash_algo)
return -ENOPKG;
 
@@ -122,6 +126,29 @@ static int pkcs7_digest(struct pkcs7_message *pkcs7,
return ret;
 }
 
+int pkcs7_get_digest(struct pkcs7_message *pkcs7, const u8 **buf, u8 *len)
+{
+   struct pkcs7_signed_info *sinfo = pkcs7->signed_infos;
+   int ret;
+
+   /*
+* This function doesn't support messages with more than one signature.
+*/
+   if (sinfo == NULL || sinfo->next != NULL)
+   return -EBADMSG;
+
+   ret = pkcs7_digest(pkcs7, sinfo);
+   if (ret)
+   return ret;
+
+   if (buf)
+   *buf = sinfo->sig->digest;
+   if (len)
+   *len = sinfo->sig->digest_size;
+
+   return 0;
+}
+
 /*
  * Find the key (X.509 certificate) to use to verify a PKCS#7 message.  PKCS#7
  * uses the issuer's name and the issuing certificate serial number for
diff --git a/include/crypto/pkcs7.h b/include/crypto/pkcs7.h
index 6f51d0cb6d12..cfaea9c37f4a 100644
--- a/include/crypto/pkcs7.h
+++ b/include/crypto/pkcs7.h
@@ -46,4 +46,7 @@ extern int pkcs7_verify(struct pkcs7_message *pkcs7,
 extern int pkcs7_supply_detached_data(struct pkcs7_message *pkcs7,
  const void *data, size_t datalen);
 
+extern int pkcs7_get_digest(struct pkcs7_message *pkcs7, const u8 **buf,
+   u8 *len);
+
 #endif /* _CRYPTO_PKCS7_H */

[PATCH v9 02/14] PKCS#7: Refactor verify_pkcs7_signature() and add pkcs7_get_message_sig()

2018-12-12 Thread Thiago Jung Bauermann

IMA will need to verify a PKCS#7 which has already been parsed. For this
reason, factor out the code which does that from verify_pkcs7_signature()
into a new function which takes a struct pkcs7_message instead of a data
buffer.

In addition, IMA will need to know the key that signed a given PKCS#7
message, so add pkcs7_get_message_sig().

Signed-off-by: Thiago Jung Bauermann 
Reviewed-by: Mimi Zohar 
Cc: David Howells 
Cc: David Woodhouse 
Cc: Herbert Xu 
Cc: "David S. Miller" 
---
 certs/system_keyring.c| 61 ---
 crypto/asymmetric_keys/pkcs7_parser.c | 16 +++
 include/crypto/pkcs7.h|  2 +
 include/linux/verification.h  | 10 +
 4 files changed, 73 insertions(+), 16 deletions(-)

diff --git a/certs/system_keyring.c b/certs/system_keyring.c
index 81728717523d..dd8c5ef941ce 100644
--- a/certs/system_keyring.c
+++ b/certs/system_keyring.c
@@ -191,33 +191,27 @@ late_initcall(load_system_certificate_list);
 #ifdef CONFIG_SYSTEM_DATA_VERIFICATION
 
 /**
- * verify_pkcs7_signature - Verify a PKCS#7-based signature on system data.
+ * verify_pkcs7_message_sig - Verify a PKCS#7-based signature on system data.
  * @data: The data to be verified (NULL if expecting internal data).
  * @len: Size of @data.
- * @raw_pkcs7: The PKCS#7 message that is the signature.
- * @pkcs7_len: The size of @raw_pkcs7.
+ * @pkcs7: The PKCS#7 message that is the signature.
  * @trusted_keys: Trusted keys to use (NULL for builtin trusted keys only,
  * (void *)1UL for all trusted keys).
  * @usage: The use to which the key is being put.
  * @view_content: Callback to gain access to content.
  * @ctx: Context for callback.
  */
-int verify_pkcs7_signature(const void *data, size_t len,
-  const void *raw_pkcs7, size_t pkcs7_len,
-  struct key *trusted_keys,
-  enum key_being_used_for usage,
-  int (*view_content)(void *ctx,
-  const void *data, size_t len,
-  size_t asn1hdrlen),
-  void *ctx)
+int verify_pkcs7_message_sig(const void *data, size_t len,
+struct pkcs7_message *pkcs7,
+struct key *trusted_keys,
+enum key_being_used_for usage,
+int (*view_content)(void *ctx,
+const void *data, size_t len,
+size_t asn1hdrlen),
+void *ctx)
 {
-   struct pkcs7_message *pkcs7;
int ret;
 
-   pkcs7 = pkcs7_parse_message(raw_pkcs7, pkcs7_len);
-   if (IS_ERR(pkcs7))
-   return PTR_ERR(pkcs7);
-
/* The data should be detached - so we need to supply it. */
if (data && pkcs7_supply_detached_data(pkcs7, data, len) < 0) {
pr_err("PKCS#7 signature with non-detached data\n");
@@ -259,6 +253,41 @@ int verify_pkcs7_signature(const void *data, size_t len,
}
 
 error:
+   pr_devel("<==%s() = %d\n", __func__, ret);
+   return ret;
+}
+
+/**
+ * verify_pkcs7_signature - Verify a PKCS#7-based signature on system data.
+ * @data: The data to be verified (NULL if expecting internal data).
+ * @len: Size of @data.
+ * @raw_pkcs7: The PKCS#7 message that is the signature.
+ * @pkcs7_len: The size of @raw_pkcs7.
+ * @trusted_keys: Trusted keys to use (NULL for builtin trusted keys only,
+ * (void *)1UL for all trusted keys).
+ * @usage: The use to which the key is being put.
+ * @view_content: Callback to gain access to content.
+ * @ctx: Context for callback.
+ */
+int verify_pkcs7_signature(const void *data, size_t len,
+  const void *raw_pkcs7, size_t pkcs7_len,
+  struct key *trusted_keys,
+  enum key_being_used_for usage,
+  int (*view_content)(void *ctx,
+  const void *data, size_t len,
+  size_t asn1hdrlen),
+  void *ctx)
+{
+   struct pkcs7_message *pkcs7;
+   int ret;
+
+   pkcs7 = pkcs7_parse_message(raw_pkcs7, pkcs7_len);
+   if (IS_ERR(pkcs7))
+   return PTR_ERR(pkcs7);
+
+   ret = verify_pkcs7_message_sig(data, len, pkcs7, trusted_keys, usage,
+  view_content, ctx);
+
pkcs7_free_message(pkcs7);
pr_devel("<==%s() = %d\n", __func__, ret);
return ret;
diff --git a/crypto/asymmetric_keys/pkcs7_parser.c 
b/crypto/asymmetric_keys/pkcs7_parser.c
index f0d56e1a8b7e..8df9693f659f 100644
--- a/crypto/asymmetric_keys/pkcs7_parser.c
+++ b/crypto/asymmetric_keys/pkcs7_parser.c
@@ -684,3 +684,19 @@ int pkcs7_note_signed_i

[PATCH v9 01/14] MODSIGN: Export module signature definitions

2018-12-12 Thread Thiago Jung Bauermann

IMA will use the module_signature format for append signatures, so export
the relevant definitions and factor out the code which verifies that the
appended signature trailer is valid.

Also, create a CONFIG_MODULE_SIG_FORMAT option so that IMA can select it
and be able to use mod_check_sig() without having to depend on
CONFIG_MODULE_SIG.

Signed-off-by: Thiago Jung Bauermann 
Reviewed-by: Mimi Zohar 
Cc: Jessica Yu 
---
 include/linux/module.h   |  3 --
 include/linux/module_signature.h | 47 ++
 init/Kconfig |  6 ++-
 kernel/Makefile  |  2 +-
 kernel/module.c  |  1 +
 kernel/module_signing.c  | 82 ++--
 6 files changed, 91 insertions(+), 50 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index fce6b4335e36..e49bbc5c66ef 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -25,9 +25,6 @@
 #include 
 #include 
 
-/* In stripped ARM and x86-64 modules, ~ is surprisingly rare. */
-#define MODULE_SIG_STRING "~Module signature appended~\n"
-
 /* Not Yet Implemented */
 #define MODULE_SUPPORTED_DEVICE(name)
 
diff --git a/include/linux/module_signature.h b/include/linux/module_signature.h
new file mode 100644
index ..a3a629fc8c13
--- /dev/null
+++ b/include/linux/module_signature.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * Module signature handling.
+ *
+ * Copyright (C) 2012 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowe...@redhat.com)
+ */
+
+#ifndef _LINUX_MODULE_SIGNATURE_H
+#define _LINUX_MODULE_SIGNATURE_H
+
+/* In stripped ARM and x86-64 modules, ~ is surprisingly rare. */
+#define MODULE_SIG_STRING "~Module signature appended~\n"
+
+enum pkey_id_type {
+   PKEY_ID_PGP,/* OpenPGP generated key ID */
+   PKEY_ID_X509,   /* X.509 arbitrary subjectKeyIdentifier */
+   PKEY_ID_PKCS7,  /* Signature in PKCS#7 message */
+};
+
+/*
+ * Module signature information block.
+ *
+ * The constituents of the signature section are, in order:
+ *
+ * - Signer's name
+ * - Key identifier
+ * - Signature data
+ * - Information block
+ */
+struct module_signature {
+   u8  algo;   /* Public-key crypto algorithm [0] */
+   u8  hash;   /* Digest algorithm [0] */
+   u8  id_type;/* Key identifier type [PKEY_ID_PKCS7] */
+   u8  signer_len; /* Length of signer's name [0] */
+   u8  key_id_len; /* Length of key identifier [0] */
+   u8  __pad[3];
+   __be32  sig_len;/* Length of signature data */
+};
+
+struct load_info;
+
+int mod_check_sig(const struct module_signature *ms, size_t file_len,
+ const char *name);
+int mod_verify_sig(const void *mod, struct load_info *info);
+
+#endif /* _LINUX_MODULE_SIGNATURE_H */
diff --git a/init/Kconfig b/init/Kconfig
index a4112e95724a..cd31593525ee 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1864,7 +1864,7 @@ config MODULE_SRCVERSION_ALL
 config MODULE_SIG
bool "Module signature verification"
depends on MODULES
-   select SYSTEM_DATA_VERIFICATION
+   select MODULE_SIG_FORMAT
help
  Check modules for valid signatures upon load: the signature
  is simply appended to the module. For more information see
@@ -1879,6 +1879,10 @@ config MODULE_SIG
  debuginfo strip done by some packagers (such as rpmbuild) and
  inclusion into an initramfs that wants the module size reduced.
 
+config MODULE_SIG_FORMAT
+   def_bool n
+   select SYSTEM_DATA_VERIFICATION
+
 config MODULE_SIG_FORCE
bool "Require modules to be validly signed"
depends on MODULE_SIG
diff --git a/kernel/Makefile b/kernel/Makefile
index 7343b3a9bff0..e56842571348 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -59,7 +59,7 @@ obj-y += up.o
 endif
 obj-$(CONFIG_UID16) += uid16.o
 obj-$(CONFIG_MODULES) += module.o
-obj-$(CONFIG_MODULE_SIG) += module_signing.o
+obj-$(CONFIG_MODULE_SIG_FORMAT) += module_signing.o
 obj-$(CONFIG_KALLSYMS) += kallsyms.o
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_CRASH_CORE) += crash_core.o
diff --git a/kernel/module.c b/kernel/module.c
index 49a405891587..205c9eefd08d 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/kernel/module_signing.c b/kernel/module_signing.c
index f2075ce8e4b3..5624e59981b4 100644
--- a/kernel/module_signing.c
+++ b/kernel/module_signing.c
@@ -11,36 +11,44 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include "module-internal.h"
 
-enum pkey_id_type {
-   PKEY_ID_PGP,/* OpenPGP generated key ID */
-   PKEY_ID_X509,   /* X.509 arbitrary subjectKeyIdentifier */
-   PKEY_ID_PKCS7,  /* Signature in PKCS#7 message */
-};
-
-/*
- * Module si

[PATCH v9 00/14] Appended signatures support for IMA appraisal

2018-12-12 Thread Thiago Jung Bauermann

Hello,

This version is basically about tidying up the code to make it clearer. Most
of the changes are in patches 11 and 14.

There are two functional changes: one is modifying the list of hooks allowed
to use modsig to allow verifying signed modules and disallow verifying
firmware, and the other is to use the platform keyring as fallback for kexec
kernel verification.

The changelog below has the details. The patches apply on today's
linux-integrity/next-integrity.

Original cover letter:

On the OpenPOWER platform, secure boot and trusted boot are being
implemented using IMA for taking measurements and verifying signatures.
Since the kernel image on Power servers is an ELF binary, kernels are
signed using the scripts/sign-file tool and thus use the same signature
format as signed kernel modules.

This patch series adds support in IMA for verifying those signatures.
It adds flexibility to OpenPOWER secure boot, because it allows it to boot
kernels with the signature appended to them as well as kernels where the
signature is stored in the IMA extended attribute.

Changes since v8:
- Patch "MODSIGN: Export module signature definitions"
  - Renamed validate_module_sig() to mod_check_sig(). (Suggested by
Mimi Zohar).

- Patch "integrity: Introduce struct evm_xattr"
  - Added comment mentioning that the evm_xattr usage is limited to HMAC
before the structure definition. (Suggested by Mimi Zohar)

- Patch "ima: Add modsig appraise_type option for module-style appended
 signatures"
  - Added MODULE_CHECK to whitelist of hooks allowed to use modsig, and
removed FIRMWARE_CHECK. (Suggested by Mimi Zohar and James Morris)

- Patch "ima: Implement support for module-style appended signatures"
  - Moved call to ima_modsig_verify() from ima_appraise_measurement() to
integrity_digsig_verify(). (Suggested by Mimi Zohar)
  - Renamed ima_read_modsig() to ima_read_collect_modsig() and made it force
PKCS7 code to calculate the file hash. (Suggested by Mimi Zohar)
  - Build sign-file tool if IMA_APPRAISE_MODSIG is enabled.
  - Check whether the signing key is in the platform keyring as a fallback
for the KEXEC_KERNEL hook. (Suggested by Mimi Zohar)

- Patch "ima: Store the measurement again when appraising a modsig"
  - In process_measurement(), when a new measurement needs to be stored
re-add IMA_MEASURE flag when the modsig is read rather than changing the
if condition when calling ima_store_measurement(). (Suggested by Mimi
Zohar)
  - Check whether ima_template has "sig" and "d-sig" fields at
initialization rather than at the first time the check is needed.
(suggested by Mimi Zohar)

Changes since v7:
- Patch "MODSIGN: Export module signature definitions"
  - Added module name parameter to validate_module_sig() so that it can be
shown in error messages.

- Patch "integrity: Introduce struct evm_xattr"
  - Dropped use of struct evm_xattr in evm_update_evmxattr() and
evm_verify_hmac(). It's not needed there anymore because of changes
to support portable EVM signatures.

Changes since v6:

- Patch "PKCS#7: Introduce pkcs7_get_message_sig() and 
verify_pkcs7_message_sig()"
  - Retitled to "PKCS#7: Refactor verify_pkcs7_signature() and
add pkcs7_get_message_sig()"
  - Reworded description to clarify why the refactoring is needed.
The code is unchanged. (Suggested by Mimi Zohar)
  - Added Mimi Zohar's Reviewed-by.

- Patch "PKCS#7: Introduce pkcs7_get_digest()"
  - Added Mimi Zohar's Reviewed-by.

- Patch "integrity: Introduce integrity_keyring_from_id"
  - Added Mimi Zohar's Signed-off-by.

- Patch "integrity: Introduce asymmetric_sig_has_known_key()"
  - Added Mimi Zohar's Signed-off-by.

- Patch "integrity: Select CONFIG_KEYS instead of depending on it"
  - Added Mimi Zohar's Signed-off-by.

- Patch "ima: Introduce is_ima_sig()"
  - Renamed function to is_signed() (suggested by Mimi Zohar).

- Patch "ima: Add functions to read and verify a modsig signature"
  - Changed stubs for the !CONFIG_IMA_APPRAISE_MODSIG to return -EOPNOTSUPP
instead of -ENOTSUPP, since the latter isn't defined in uapi headers.
  - Moved functions to the patches which use them and dropped this patch
(suggested by Mimi Zohar).

- Patch "ima: Implement support for module-style appended signatures"
  - Prevent reading and writing of IMA_MODSIG xattr in ima_read_xattr()
and ima_inode_setxattr().
  - Simplify code in process_measurement() which decides whether to try
reading a modsig (suggested by Mimi Zohar).
  - Moved some functions from patch "ima: Add functions to read and verify
a modsig signature" into this patch.

- Patch "ima: Add new "d-sig" template field"
  - New patch containing code from patch "ima: Write modsig to the measurement 
list"
(Suggested by Mimi Zohar).

- Patch "ima: Write modsig to the measurement list"
  - Moved some functions from patch "ima: Add functions to read and verify
a modsig signature" into this patch.
  - Moved code relat

Re: [PATCH v03] powerpc/mobility: Fix node detach/rename problem

2018-12-12 Thread Frank Rowand

Hi Michael Bringmann,

On 12/11/18 8:07 AM, Rob Herring wrote:
> On Tue, Dec 11, 2018 at 7:29 AM Michael Ellerman  wrote:
>>
>> Hi Michael,
>>
>> Please Cc the device tree folks on device tree patches, and also the
>> original author of the patch that added the code you're modifying.
>>
>> So I've added:
>>   robh...@kernel.org
>>   frowand.l...@gmail.com
>>   devicet...@vger.kernel.org
>>   linux-ker...@vger.kernel.org
>>
>> Michael Bringmann  writes:
>>> The PPC mobility code receives RTAS requests to delete nodes with
>>> platform-/hardware-specific attributes when restarting the kernel
>>> after a migration.  My example is for migration between a P8 Alpine
>>> and a P8 Brazos.   Nodes to be deleted include 'ibm,random-v1',
>>> 'ibm,platform-facilities', 'ibm,sym-encryption-v1', and,
>>> 'ibm,compression-v1'.
>>>
>>> The mobility.c code calls 'of_detach_node' for the nodes and their
>>> children.  This makes calls to detach the properties and to remove
>>> the associated sysfs/kernfs files.
>>>
>>> Then new copies of the same nodes are next provided by the PHYP,
>>> local copies are built, and a pointer to the 'struct device_node'
>>> is passed to of_attach_node.  Before the call to of_attach_node,
>>> the phandle is initialized to 0 when the data structure is alloced.
>>> During the call to of_attach_node, it calls __of_attach_node which
>>> pulls the actual name and phandle from just created sub-properties
>>> named something like 'name' and 'ibm,phandle'.
>>>
>>> This is all fine for the first migration.  The problem occurs with
>>> the second and subsequent migrations when the PHYP on the new system
>>> wants to replace the same set of nodes again, referenced with the
>>> same names and phandle values.
>>>
>>> On the second and subsequent migrations, the PHYP tells the system
>>> to again delete the nodes 'ibm,platform-facilities', 'ibm,random-v1',
>>> 'ibm,compression-v1', 'ibm,sym-encryption-v1'.  It specifies these
>>> nodes by its known set of phandle values -- the same handles used
>>> by the PHYP on the source system are known on the target system.
>>> The mobility.c code calls of_find_node_by_phandle() with these values
>>> and ends up locating the first instance of each node that was added
>>> during the original boot, instead of the second instance of each node
>>> created after the first migration.  The detach during the second
>>> migration fails with errors like,
>>>
>>> [ 4565.030704] WARNING: CPU: 3 PID: 4787 at drivers/of/dynamic.c:252 
>>> __of_detach_node+0x8/0xa0
>>> [ 4565.030708] Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag 
>>> inet_diag unix_diag af_packet_diag netlink_diag lockd grace fscache sunrpc 
>>> xts vmx_crypto sg pseries_rng binfmt_misc ip_tables xfs libcrc32c sd_mod 
>>> ibmveth ibmvscsi scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
>>> [ 4565.030733] CPU: 3 PID: 4787 Comm: drmgr Tainted: GW 
>>> 4.18.0-rc1-wi107836-v05-120+ #201
>>> [ 4565.030737] NIP:  c07c1ea8 LR: c07c1fb4 CTR: 
>>> 00655170
>>> [ 4565.030741] REGS: c003f302b690 TRAP: 0700   Tainted: GW  
>>> (4.18.0-rc1-wi107836-v05-120+)
>>> [ 4565.030745] MSR:  80010282b033 
>>>   CR: 22288822  XER: 000a
>>> [ 4565.030757] CFAR: c07c1fb0 IRQMASK: 1
>>> [ 4565.030757] GPR00: c07c1fa4 c003f302b910 c114bf00 
>>> c0038e68
>>> [ 4565.030757] GPR04: 0001  80c008e0b4b8 
>>> 
>>> [ 4565.030757] GPR08:  0001 8003 
>>> 2843
>>> [ 4565.030757] GPR12: 8800 c0001ec9ae00 4000 
>>> 
>>> [ 4565.030757] GPR16:  0008  
>>> f6ff
>>> [ 4565.030757] GPR20: 0007  c003e9f1f034 
>>> 0001
>>> [ 4565.030757] GPR24:    
>>> 
>>> [ 4565.030757] GPR28: c1549d28 c1134828 c0038e68 
>>> c003f302b930
>>> [ 4565.030804] NIP [c07c1ea8] __of_detach_node+0x8/0xa0
>>> [ 4565.030808] LR [c07c1fb4] of_detach_node+0x74/0xd0
>>> [ 4565.030811] Call Trace:
>>> [ 4565.030815] [c003f302b910] [c07c1fa4] 
>>> of_detach_node+0x64/0xd0 (unreliable)
>>> [ 4565.030821] [c003f302b980] [c00c33c4] 
>>> dlpar_detach_node+0xb4/0x150
>>> [ 4565.030826] [c003f302ba10] [c00c3ffc] 
>>> delete_dt_node+0x3c/0x80
>>> [ 4565.030831] [c003f302ba40] [c00c4380] 
>>> pseries_devicetree_update+0x150/0x4f0
>>> [ 4565.030836] [c003f302bb70] [c00c479c] 
>>> post_mobility_fixup+0x7c/0xf0
>>> [ 4565.030841] [c003f302bbe0] [c00c4908] 
>>> migration_store+0xf8/0x130
>>> [ 4565.030847] [c003f302bc70] [c0998160] 
>>> kobj_attr_store+0x30/0x60
>>> [ 4565.030852] [c003f302bc90] [c0412f14] 
>>> sysfs_kf_write+0x64/0xa0
>>> [ 45

Re: [PATCH 0/2] sriov enablement on s390

2018-12-12 Thread Bjorn Helgaas

On Wed, Dec 05, 2018 at 02:45:14PM +0100, Sebastian Ott wrote:
> Hello Bjorn,
> 
> On Wed, 10 Oct 2018, Bjorn Helgaas wrote:
> > On Wed, Oct 10, 2018 at 02:55:07PM +0200, Sebastian Ott wrote:
> > > On Wed, 12 Sep 2018, Bjorn Helgaas wrote:
> > > > On Wed, Sep 12, 2018 at 02:34:09PM +0200, Sebastian Ott wrote:
> > > > > On s390 we currently handle SRIOV within firmware. Which means
> > > > > that the PF is under firmware control and not visible to operating
> > > > > systems. SRIOV enablement happens within firmware and VFs are
> > > > > passed through to logical partitions.
> > > > > 
> > > > > I'm working on a new mode were the PF is under operating system
> > > > > control (including SRIOV enablement). However we still need
> > > > > firmware support to access the VFs. The way this is supposed
> > > > > to work is that when firmware traps the SRIOV enablement it
> > > > > will present machine checks to the logical partition that
> > > > > triggered the SRIOV enablement and provide the VFs via hotplug
> > > > > events.
> > > > > 
> > > > > The problem I'm faced with is that the VF detection code in
> > > > > sriov_enable leads to unusable functions in s390.
> > > > 
> > > > We're moving away from the weak function implementation style.  Can
> > > > you take a look at Arnd's work here, which uses pci_host_bridge
> > > > callbacks instead?
> > > > 
> > > >   https://lkml.kernel.org/r/20180817102645.3839621-1-a...@arndb.de
> > > 
> > > What's the status of Arnd's patches - will they go upstream in the next
> > > couple of versions?
> > 
> > I hope so [1].  IIRC Arnd mentioned doing some minor updates, so I'm
> > waiting on that.
> > 
> > > What about my patches that I rebased on Arnd's branch
> > > will they be considered?
> > 
> > Definitely.  From my point of view they're just lined up behind Arnd's
> > patches.
> > 
> > [1] 
> > https://lore.kernel.org/linux-pci/20181002205903.gd120...@bhelgaas-glaptop.roam.corp.google.com
> 
> It appears like these patches are not in-line for the next merge window.
> Would it be possible to go with my original patches (using __weak
> functions)? (This would also make life easier with regards to backports)
> I can post patches to convert this to use function pointers once Arnd's
> patches make it to the kernel.

Yeah, sorry, I think we should just go with your original approach.

Can you repost those patches with minor changelog updates so
"git log --online" on the files looks consistent.  Also, capitalize
"PCI", "VF", etc, consistently when used in English text.

Bjorn

Re: [PATCH 10/10] perf/doc: update design.txt for exclude_{host|guest} flags

2018-12-12 Thread Andrew Murray

On Wed, Dec 12, 2018 at 09:07:42AM +0100, Christoffer Dall wrote:
> On Tue, Dec 11, 2018 at 01:59:03PM +, Andrew Murray wrote:
> > On Tue, Dec 11, 2018 at 10:06:53PM +1100, Michael Ellerman wrote:
> > > [ Reviving old thread. ]
> > > 
> > > Andrew Murray  writes:
> > > > On Tue, Nov 20, 2018 at 10:31:36PM +1100, Michael Ellerman wrote:
> > > >> Andrew Murray  writes:
> > > >> 
> > > >> > Update design.txt to reflect the presence of the exclude_host
> > > >> > and exclude_guest perf flags.
> > > >> >
> > > >> > Signed-off-by: Andrew Murray 
> > > >> > ---
> > > >> >  tools/perf/design.txt | 4 
> > > >> >  1 file changed, 4 insertions(+)
> > > >> >
> > > >> > diff --git a/tools/perf/design.txt b/tools/perf/design.txt
> > > >> > index a28dca2..7de7d83 100644
> > > >> > --- a/tools/perf/design.txt
> > > >> > +++ b/tools/perf/design.txt
> > > >> > @@ -222,6 +222,10 @@ The 'exclude_user', 'exclude_kernel' and 
> > > >> > 'exclude_hv' bits provide a
> > > >> >  way to request that counting of events be restricted to times when 
> > > >> > the
> > > >> >  CPU is in user, kernel and/or hypervisor mode.
> > > >> >  
> > > >> > +Furthermore the 'exclude_host' and 'exclude_guest' bits provide a 
> > > >> > way
> > > >> > +to request counting of events restricted to guest and host contexts 
> > > >> > when
> > > >> > +using virtualisation.
> > > >> 
> > > >> How does exclude_host differ from exclude_hv ?
> > > >
> > > > I believe exclude_host / exclude_guest are intented to distinguish
> > > > between host and guest in the hosted hypervisor context (KVM).
> > > 
> > > OK yeah, from the perf-list man page:
> > > 
> > >u - user-space counting
> > >k - kernel counting
> > >h - hypervisor counting
> > >I - non idle counting
> > >G - guest counting (in KVM guests)
> > >H - host counting (not in KVM guests)
> > > 
> > > > Whereas exclude_hv allows to distinguish between guest and
> > > > hypervisor in the bare-metal type hypervisors.
> > > 
> > > Except that's exactly not how we use them on powerpc :)
> > > 
> > > We use exclude_hv to exclude "the hypervisor", regardless of whether
> > > it's KVM or PowerVM (which is a bare-metal hypervisor).
> > > 
> > > We don't use exclude_host / exclude_guest at all, which I guess is a
> > > bug, except I didn't know they existed until this thread.
> > > 
> > > eg, in a KVM guest:
> > > 
> > >   $ perf record -e cycles:G /bin/bash -c "for i in {0..10}; do :;done"
> > >   $ perf report -D | grep -Fc "dso: [hypervisor]"
> > >   16
> > > 
> > > 
> > > > In the case of arm64 - if VHE extensions are present then the host
> > > > kernel will run at a higher privilege to the guest kernel, in which
> > > > case there is no distinction between hypervisor and host so we ignore
> > > > exclude_hv. But where VHE extensions are not present then the host
> > > > kernel runs at the same privilege level as the guest and we use a
> > > > higher privilege level to switch between them - in this case we can
> > > > use exclude_hv to discount that hypervisor role of switching between
> > > > guests.
> > > 
> > > I couldn't find any arm64 perf code using exclude_host/guest at all?
> > 
> > Correct - but this is in flight as I am currently adding support for this
> > see [1].
> > 
> > > 
> > > And I don't see any x86 code using exclude_hv.
> > 
> > I can't find any either.
> > 
> > > 
> > > But maybe that's OK, I just worry this is confusing for users.
> > 
> > There is some extra context regarding this where exclude_guest/exclude_host
> > was added, see [2] and where exclude_hv was added, see [3]
> > 
> > Generally it seems that exclude_guest/exclude_host relies upon switching
> > counters off/on on guest/host switch code (which works well in the nested
> > virt case). Whereas exclude_hv tends to rely solely on hardware capability
> > based on privilege level (which works well in the bare metal case where
> > the guest doesn't run at same privilege as the host).
> > 
> > I think from the user perspective exclude_hv allows you to see your overhead
> > if you are a guest (i.e. work done by bare metal hypervisor associated with
> > you as the guest). Whereas exclude_guest/exclude_host doesn't allow you to
> > see events above you (i.e. the kernel hypervisor) if you are the guest...
> > 
> > At least that's how I read this, I've copied in others that may provide
> > more authoritative feedback.
> > 
> > [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2018-December/033698.html
> > [2] https://www.spinics.net/lists/kvm/msg53996.html
> > [3] https://lore.kernel.org/patchwork/patch/143918/
> > 
> 
> I'll try to answer this in a different way, based on previous
> discussions with Joerg et al. who introduced these flags.  Assume no
> support for nested virtualization as a first approximation:
> 
>   If you are running as a guest:
> - exclude_hv: stop counting events when the hypervisor runs
> - exclude_host: has no effect
>

Re: [PATCH] PCI/AER: only insert one element into kfifo

2018-12-12 Thread Keith Busch

On Wed, Dec 12, 2018 at 04:32:30PM +0800, Yanjiang Jin wrote:
> 'commit ecae65e133f2 ("PCI/AER: Use kfifo_in_spinlocked() to
> insert locked elements")' replace kfifo_put() with kfifo_in_spinlocked().
> 
> But as "kfifo_in(fifo, buf, n)" describes:
> " * @n: number of elements to be added".
> 
> We want to insert only one element into kfifo, not "sizeof(entry) = 16".
> Without this patch, we would get 15 uninitialized elements.
> 
> Signed-off-by: Yanjiang Jin 

My bad. I had trouble testing the GHES path for this.  Thanks for the fix.

Reviewed-by: Keith Busch 

> ---
>  drivers/pci/pcie/aer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a90a919..fed29de 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1064,7 +1064,7 @@ void aer_recover_queue(int domain, unsigned int bus, 
> unsigned int devfn,
> .regs   = aer_regs,
> };
> 
> -   if (kfifo_in_spinlocked(&aer_recover_ring, &entry, sizeof(entry),
> +   if (kfifo_in_spinlocked(&aer_recover_ring, &entry, 1,
>  &aer_recover_ring_lock))
> schedule_work(&aer_recover_work);
> else
> --
> 1.8.3.1

Re: use generic DMA mapping code in powerpc V4

2018-12-12 Thread Christian Zigotzky

Hi Christoph,

Thanks a lot for your reply. I will test your patches tomorrow.

Cheers,
Christian

Sent from my iPhone

> On 12. Dec 2018, at 15:15, Christoph Hellwig  wrote:
> 
> Thanks for bisecting.  I've spent some time going over the conversion
> but can't really pinpoint it.  I have three little patches that switch
> parts of the code to the generic version.  This is on top of the
> last good commmit (977706f9755d2d697aa6f45b4f9f0e07516efeda).
> 
> Can you check with whіch one things stop working?
> 
> 
> <0001-get_required_mask.patch>
> <0002-swiotlb-dma_supported.patch>
> <0003-nommu-dma_supported.patch>
> <0004-alloc-free.patch>

Re: [PATCH 12/34] powerpc/cell: move dma direct window setup out of dma_configure

2018-12-12 Thread Christoph Hellwig

On Sun, Dec 09, 2018 at 09:23:39PM +1100, Michael Ellerman wrote:
> Christoph Hellwig  writes:
> 
> > Configure the dma settings at device setup time, and stop playing games
> > with get_pci_dma_ops.  This prepares for using the common dma_configure
> > code later on.
> >
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  arch/powerpc/platforms/cell/iommu.c | 20 +++-
> >  1 file changed, 11 insertions(+), 9 deletions(-)
> 
> This one's crashing, haven't dug into why yet:

Can you provide a gdb assembly of the exact crash site?  This looks
like for some odd reason the DT structures aren't fully setup by the
time we are probing the device, which seems odd.

Either way, something like the patch below would ensure we call
cell_iommu_get_fixed_address from a similar context as before, can you
check if that fixes the issue?

diff --git a/arch/powerpc/platforms/cell/iommu.c 
b/arch/powerpc/platforms/cell/iommu.c
index 93c7e4aef571..4891b338bf9f 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -569,19 +569,12 @@ static struct iommu_table *cell_get_iommu_table(struct 
device *dev)
return &window->table;
 }
 
-static u64 cell_iommu_get_fixed_address(struct device *dev);
-
 static void cell_dma_dev_setup(struct device *dev)
 {
-   if (cell_iommu_enabled) {
-   u64 addr = cell_iommu_get_fixed_address(dev);
-
-   if (addr != OF_BAD_ADDR)
-   set_dma_offset(dev, addr + dma_iommu_fixed_base);
+   if (cell_iommu_enabled)
set_iommu_table_base(dev, cell_get_iommu_table(dev));
-   } else {
+   else
set_dma_offset(dev, cell_dma_nommu_offset);
-   }
 }
 
 static void cell_pci_dma_dev_setup(struct pci_dev *dev)
@@ -865,8 +858,16 @@ static u64 cell_iommu_get_fixed_address(struct device *dev)
 
 static bool cell_pci_iommu_bypass_supported(struct pci_dev *pdev, u64 mask)
 {
-   return mask == DMA_BIT_MASK(64) &&
-   cell_iommu_get_fixed_address(&pdev->dev) != OF_BAD_ADDR;
+   if (mask == DMA_BIT_MASK(64)) {
+   u64 addr = cell_iommu_get_fixed_address(&pdev->dev);
+
+   if (addr != OF_BAD_ADDR) {
+   set_dma_offset(&pdev->dev, dma_iommu_fixed_base + addr);
+   return true;
+   }
+   }
+
+   return true;
 }
 
 static void insert_16M_pte(unsigned long addr, unsigned long *ptab,

Re: use generic DMA mapping code in powerpc V4

2018-12-12 Thread Christoph Hellwig

Thanks for bisecting.  I've spent some time going over the conversion
but can't really pinpoint it.  I have three little patches that switch
parts of the code to the generic version.  This is on top of the
last good commmit (977706f9755d2d697aa6f45b4f9f0e07516efeda).

Can you check with whіch one things stop working?


>From 83a4b87de6bc6a75b500c9959de88e2157fbcd7c Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Wed, 12 Dec 2018 15:07:49 +0100
Subject: get_required_mask

---
 arch/powerpc/kernel/dma-iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index 5b15e53ee43d..2e682004959f 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -152,7 +152,7 @@ u64 dma_iommu_get_required_mask(struct device *dev)
 		return 0;
 
 	if (dev_is_pci(dev)) {
-		u64 bypass_mask = dma_nommu_get_required_mask(dev);
+		u64 bypass_mask = dma_direct_get_required_mask(dev);
 
 		if (dma_iommu_bypass_supported(dev, bypass_mask))
 			return bypass_mask;
-- 
2.19.2

>From c2579a3619575397929781a14895966cbc1d217b Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Wed, 12 Dec 2018 15:08:52 +0100
Subject: swiotlb dma_supported

---
 arch/powerpc/kernel/dma-swiotlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/dma-swiotlb.c b/arch/powerpc/kernel/dma-swiotlb.c
index aa11625c6691..52ee531c1a0d 100644
--- a/arch/powerpc/kernel/dma-swiotlb.c
+++ b/arch/powerpc/kernel/dma-swiotlb.c
@@ -36,7 +36,7 @@ const struct dma_map_ops powerpc_swiotlb_dma_ops = {
 	.free = __dma_nommu_free_coherent,
 	.map_sg = swiotlb_map_sg_attrs,
 	.unmap_sg = swiotlb_unmap_sg_attrs,
-	.dma_supported = swiotlb_dma_supported,
+	.dma_supported = dma_direct_supported,
 	.map_page = swiotlb_map_page,
 	.unmap_page = swiotlb_unmap_page,
 	.sync_single_for_cpu = swiotlb_sync_single_for_cpu,
-- 
2.19.2

>From 0105db9e6d8d031b4295116630fd0318fd146737 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Wed, 12 Dec 2018 15:10:36 +0100
Subject: nommu dma_supported

---
 arch/powerpc/kernel/dma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index a6590aa77181..f53d11d35230 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -179,7 +179,7 @@ const struct dma_map_ops dma_nommu_ops = {
 	.alloc= __dma_nommu_alloc_coherent,
 	.free= __dma_nommu_free_coherent,
 	.map_sg= dma_nommu_map_sg,
-	.dma_supported			= dma_nommu_dma_supported,
+	.dma_supported			= dma_direct_supported,
 	.map_page			= dma_nommu_map_page,
 #ifdef CONFIG_NOT_COHERENT_CACHE
 	.sync_single_for_cpu 		= dma_nommu_sync_single,
-- 
2.19.2

>From 4c5dd4d4a4b4e63be722fd29ada896c5962072b8 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Wed, 12 Dec 2018 15:11:38 +0100
Subject: alloc/free

---
 arch/powerpc/kernel/dma.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index f53d11d35230..d3db6d879559 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -176,8 +176,13 @@ static inline void dma_nommu_sync_single(struct device *dev,
 #endif
 
 const struct dma_map_ops dma_nommu_ops = {
+#ifdef CONFIG_NOT_COHERENT_CACHE
 	.alloc= __dma_nommu_alloc_coherent,
 	.free= __dma_nommu_free_coherent,
+#else
+	.alloc= dma_direct_alloc,
+	.free= dma_direct_free,
+#endif
 	.map_sg= dma_nommu_map_sg,
 	.dma_supported			= dma_direct_supported,
 	.map_page			= dma_nommu_map_page,
-- 
2.19.2

[PATCH 07/11] powerpc/fsl: Flush the branch predictor at each kernel entry (32 bit)

2018-12-12 Thread Diana Craciun

In order to protect against speculation attacks on
indirect branches, the branch predictor is flushed at
kernel entry to protect for the following situations:
- userspace process attacking another userspace process
- userspace process attacking the kernel
Basically when the privillege level change (i.e.the kernel
is entered), the branch predictor state is flushed.

Signed-off-by: Diana Craciun 
---
 arch/powerpc/kernel/head_booke.h |  6 ++
 arch/powerpc/kernel/head_fsl_booke.S | 15 +++
 2 files changed, 21 insertions(+)

diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
index d0862a1..15ac510 100644
--- a/arch/powerpc/kernel/head_booke.h
+++ b/arch/powerpc/kernel/head_booke.h
@@ -43,6 +43,9 @@
andi.   r11, r11, MSR_PR;   /* check whether user or kernel*/\
mr  r11, r1; \
beq 1f;  \
+START_BTB_FLUSH_SECTION\
+   BTB_FLUSH(r11)  \
+END_BTB_FLUSH_SECTION  \
/* if from user, start at top of this thread's kernel stack */   \
lwz r11, THREAD_INFO-THREAD(r10);\
ALLOC_STACK_FRAME(r11, THREAD_SIZE); \
@@ -128,6 +131,9 @@
stw r9,_CCR(r8);/* save CR on stack*/\
mfspr   r11,exc_level_srr1; /* check whether user or kernel*/\
DO_KVM  BOOKE_INTERRUPT_##intno exc_level_srr1;  \
+START_BTB_FLUSH_SECTION
\
+   BTB_FLUSH(r10)  
\
+END_BTB_FLUSH_SECTION  
\
andi.   r11,r11,MSR_PR;  \
mfspr   r11,SPRN_SPRG_THREAD;   /* if from user, start at top of   */\
lwz r11,THREAD_INFO-THREAD(r11); /* this thread's kernel stack */\
diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index e2750b8..2386ce2 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -453,6 +453,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
mfcrr13
stw r13, THREAD_NORMSAVE(3)(r10)
DO_KVM  BOOKE_INTERRUPT_DTLB_MISS SPRN_SRR1
+START_BTB_FLUSH_SECTION
+   mfspr r11, SPRN_SRR1
+   andi. r10,r11,MSR_PR
+   beq 1f
+   BTB_FLUSH(r10)
+1:
+END_BTB_FLUSH_SECTION
mfspr   r10, SPRN_DEAR  /* Get faulting address */
 
/* If we are faulting a kernel address, we have to use the
@@ -547,6 +554,14 @@ END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
mfcrr13
stw r13, THREAD_NORMSAVE(3)(r10)
DO_KVM  BOOKE_INTERRUPT_ITLB_MISS SPRN_SRR1
+START_BTB_FLUSH_SECTION
+   mfspr r11, SPRN_SRR1
+   andi. r10,r11,MSR_PR
+   beq 1f
+   BTB_FLUSH(r10)
+1:
+END_BTB_FLUSH_SECTION
+
mfspr   r10, SPRN_SRR0  /* Get faulting address */
 
/* If we are faulting a kernel address, we have to use the
-- 
2.5.5

[PATCH 04/11] powerpc/fsl: Emulate SPRN_BUCSR register

2018-12-12 Thread Diana Craciun

In order to flush the branch predictor the guest kernel
performs writes to the BUCSR register which is hypervisor
privilleged. However, the branch predictor is flushed at
each KVM entry, so the branch predictor has been already
flushed, so just return as soon as possible to guest.

Signed-off-by: Diana Craciun 
---
 arch/powerpc/kvm/e500_emulate.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index 3f8189e..d0eb670 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -276,6 +276,11 @@ int kvmppc_core_emulate_mtspr_e500(struct kvm_vcpu *vcpu, 
int sprn, ulong spr_va
 */
vcpu->arch.pwrmgtcr0 = spr_val;
break;
+   /* if we are here, it means that we have already flushed the
+* branch predictor, so just return to guest
+*/
+   case SPRN_BUCSR:
+   break;
 
/* extra exceptions */
 #ifdef CONFIG_SPE_POSSIBLE
-- 
2.5.5

[PATCH 00/11] powerpc/fsl: NXP PowerPC Spectre variant 2 workarounds

2018-12-12 Thread Diana Craciun

Implement Spectre variant 2 workarounds for NXP PowerPC Book3E
processors.

Diana Craciun (11):
  Add infrastructure to fixup branch predictor flush
  Add macro to flush the branch predictor
  Fix spectre_v2 mitigations reporting
  Emulate SPRN_BUCSR register
  Add nospectre_v2 command line argument
  Flush the branch predictor at each kernel entry (64bit)
  Flush the branch predictor at each kernel entry (32 bit)
  Flush branch predictor when entering KVM
  Enable runtime patching if nospectre_v2 boot arg is used
  Update Spectre v2 reporting
  Add FSL_PPC_BOOK3E as supported arch for nospectre_v2 boot arg

 Documentation/admin-guide/kernel-parameters.txt |  2 +-
 arch/powerpc/include/asm/feature-fixups.h   | 12 +++
 arch/powerpc/include/asm/ppc_asm.h  | 10 +
 arch/powerpc/include/asm/setup.h|  7 +++
 arch/powerpc/kernel/entry_64.S  |  5 +
 arch/powerpc/kernel/exceptions-64e.S| 26 ++-
 arch/powerpc/kernel/head_booke.h|  6 ++
 arch/powerpc/kernel/head_fsl_booke.S| 15 +
 arch/powerpc/kernel/security.c  | 28 +++--
 arch/powerpc/kernel/setup-common.c  |  1 +
 arch/powerpc/kernel/vmlinux.lds.S   |  8 +++
 arch/powerpc/kvm/bookehv_interrupts.S   |  4 
 arch/powerpc/kvm/e500_emulate.c |  5 +
 arch/powerpc/lib/feature-fixups.c   | 21 +++
 arch/powerpc/mm/tlb_low_64e.S   |  7 +++
 15 files changed, 153 insertions(+), 4 deletions(-)

-- 
2.5.5

[PATCH 02/11] powerpc/fsl: Add macro to flush the branch predictor

2018-12-12 Thread Diana Craciun

The BUCSR register can be used to invalidate the entries in the
branch prediction mechanisms.

Signed-off-by: Diana Craciun 
---
 arch/powerpc/include/asm/ppc_asm.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc_asm.h 
b/arch/powerpc/include/asm/ppc_asm.h
index b5d0236..5c901bf 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -821,4 +821,14 @@ END_FTR_SECTION_IFCLR(CPU_FTR_601)
stringify_in_c(.long (_target) - . ;)   \
stringify_in_c(.previous)
 
+#ifdef CONFIG_PPC_FSL_BOOK3E
+#define BTB_FLUSH(reg) \
+   lis reg,BUCSR_INIT@h;   \
+   ori reg,reg,BUCSR_INIT@l;   \
+   mtspr SPRN_BUCSR,reg;   \
+   isync;
+#else
+#define BTB_FLUSH(reg)
+#endif /* CONFIG_PPC_FSL_BOOK3E */
+
 #endif /* _ASM_POWERPC_PPC_ASM_H */
-- 
2.5.5

[PATCH 10/11] powerpc/fsl: Update Spectre v2 reporting

2018-12-12 Thread Diana Craciun

Report branch predictor state flush as a mitigation for
Spectre variant 2.

Signed-off-by: Diana Craciun 
---
 arch/powerpc/kernel/security.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index 4393a38..861fab3 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -212,8 +212,11 @@ ssize_t cpu_show_spectre_v2(struct device *dev, struct 
device_attribute *attr, c
 
if (count_cache_flush_type == COUNT_CACHE_FLUSH_HW)
seq_buf_printf(&s, "(hardware accelerated)");
-   } else
+   } else if (btb_flush_enabled) {
+   seq_buf_printf(&s, "Mitigation: Branch predictor state flush");
+   } else {
seq_buf_printf(&s, "Vulnerable");
+   }
 
seq_buf_printf(&s, "\n");
 
-- 
2.5.5

[PATCH 09/11] powerpc/fsl: Enable runtime patching if nospectre_v2 boot arg is used

2018-12-12 Thread Diana Craciun

If the user choses not to use the mitigations, replace
the code sequence with nops.

Signed-off-by: Diana Craciun 
---
 arch/powerpc/kernel/setup-common.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 93ee370..f27eeda 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -974,6 +974,7 @@ void __init setup_arch(char **cmdline_p)
ppc_md.setup_arch();
 
setup_barrier_nospec();
+   setup_spectre_v2();
 
paging_init();
 
-- 
2.5.5

[PATCH 11/11] powerpc/fsl: Add FSL_PPC_BOOK3E as supported arch for nospectre_v2 boot arg

2018-12-12 Thread Diana Craciun

Signed-off-by: Diana Craciun 
---
 Documentation/admin-guide/kernel-parameters.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index aefd358..cf6b4c5 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2827,7 +2827,7 @@
check bypass). With this option data leaks are possible
in the system.
 
-   nospectre_v2[X86] Disable all mitigations for the Spectre variant 2
+   nospectre_v2[X86,PPC_FSL_BOOK3E] Disable all mitigations for the 
Spectre variant 2
(indirect branch prediction) vulnerability. System may
allow data leaks with this option, which is equivalent
to spectre_v2=off.
-- 
2.5.5

[PATCH 01/11] powerpc/fsl: Add infrastructure to fixup branch predictor flush

2018-12-12 Thread Diana Craciun

In order to protect against speculation attacks (Spectre
variant 2) on NXP PowerPC platforms, the branch predictor
should be flushed when the privillege level is changed.
This patch is adding the infrastructure to fixup at runtime
the code sections that are performing the branch predictor flush
depending on a boot arg parameter which is added later in a
separate patch.

Signed-off-by: Diana Craciun 
---
 arch/powerpc/include/asm/feature-fixups.h | 12 
 arch/powerpc/include/asm/setup.h  |  2 ++
 arch/powerpc/kernel/vmlinux.lds.S |  8 
 arch/powerpc/lib/feature-fixups.c | 21 +
 4 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/asm/feature-fixups.h 
b/arch/powerpc/include/asm/feature-fixups.h
index 33b6f9c..40a6c926 100644
--- a/arch/powerpc/include/asm/feature-fixups.h
+++ b/arch/powerpc/include/asm/feature-fixups.h
@@ -221,6 +221,17 @@ label##3:  \
FTR_ENTRY_OFFSET 953b-954b; \
.popsection;
 
+#define START_BTB_FLUSH_SECTION\
+955:   \
+
+#define END_BTB_FLUSH_SECTION  \
+956:   \
+   .pushsection __btb_flush_fixup,"a"; \
+   .align 2;   \
+957:   \
+   FTR_ENTRY_OFFSET 955b-957b; \
+   FTR_ENTRY_OFFSET 956b-957b; \
+   .popsection;
 
 #ifndef __ASSEMBLY__
 #include 
@@ -230,6 +241,7 @@ extern long __start___stf_entry_barrier_fixup, 
__stop___stf_entry_barrier_fixup;
 extern long __start___stf_exit_barrier_fixup, __stop___stf_exit_barrier_fixup;
 extern long __start___rfi_flush_fixup, __stop___rfi_flush_fixup;
 extern long __start___barrier_nospec_fixup, __stop___barrier_nospec_fixup;
+extern long __start__btb_flush_fixup, __stop__btb_flush_fixup;
 
 void apply_feature_fixups(void);
 void setup_feature_keys(void);
diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index 1fffbba..c941c8c 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -67,6 +67,8 @@ void do_barrier_nospec_fixups_range(bool enable, void *start, 
void *end);
 static inline void do_barrier_nospec_fixups_range(bool enable, void *start, 
void *end) { };
 #endif
 
+void do_btb_flush_fixups(void);
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_SETUP_H */
diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 434581b..254b757 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -170,6 +170,14 @@ SECTIONS
}
 #endif /* CONFIG_PPC_BARRIER_NOSPEC */
 
+#ifdef CONFIG_PPC_FSL_BOOK3E
+   . = ALIGN(8);
+   __spec_btb_flush_fixup : AT(ADDR(__spec_btb_flush_fixup) - LOAD_OFFSET) 
{
+   __start__btb_flush_fixup = .;
+   *(__btb_flush_fixup)
+   __stop__btb_flush_fixup = .;
+   }
+#endif
EXCEPTION_TABLE(0)
 
NOTES :kernel :notes
diff --git a/arch/powerpc/lib/feature-fixups.c 
b/arch/powerpc/lib/feature-fixups.c
index e613b02..02a213c 100644
--- a/arch/powerpc/lib/feature-fixups.c
+++ b/arch/powerpc/lib/feature-fixups.c
@@ -347,6 +347,27 @@ void do_barrier_nospec_fixups_range(bool enable, void 
*fixup_start, void *fixup_
 
printk(KERN_DEBUG "barrier-nospec: patched %d locations\n", i);
 }
+static void patch_btb_flush_section(long *curr)
+{
+   unsigned int *start, *end;
+
+   start = (void *)curr + *curr;
+   end = (void *)curr + *(curr + 1);
+   for (; start < end; start++) {
+   pr_devel("patching dest %lx\n", (unsigned long)start);
+   patch_instruction(start, PPC_INST_NOP);
+   }
+}
+void do_btb_flush_fixups(void)
+{
+   long *start, *end;
+
+   start = PTRRELOC(&__start__btb_flush_fixup);
+   end = PTRRELOC(&__stop__btb_flush_fixup);
+
+   for (; start < end; start += 2)
+   patch_btb_flush_section(start);
+}
 #endif /* CONFIG_PPC_FSL_BOOK3E */
 
 void do_lwsync_fixups(unsigned long value, void *fixup_start, void *fixup_end)
-- 
2.5.5

[PATCH 03/11] powerpc/fsl: Fix spectre_v2 mitigations reporting

2018-12-12 Thread Diana Craciun

Currently for CONFIG_PPC_FSL_BOOK3E
cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 reports:
"Mitigation: Software count cache flush" which is wrong. Fix it
to report vulnerable for now.

Signed-off-by: Diana Craciun 
---
 arch/powerpc/kernel/security.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index f6f469f..1b395b8 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -22,7 +22,7 @@ enum count_cache_flush_type {
COUNT_CACHE_FLUSH_SW= 0x2,
COUNT_CACHE_FLUSH_HW= 0x4,
 };
-static enum count_cache_flush_type count_cache_flush_type;
+static enum count_cache_flush_type count_cache_flush_type = 
COUNT_CACHE_FLUSH_NONE;
 
 bool barrier_nospec_enabled;
 static bool no_nospec;
-- 
2.5.5

[PATCH 08/11] powerpc/fsl: Flush branch predictor when entering KVM

2018-12-12 Thread Diana Craciun

Switching from the guest to host is another place
where the speculative accesses can be exploited.
Flush the branch predictor when entering KVM.

Signed-off-by: Diana Craciun 
---
 arch/powerpc/kvm/bookehv_interrupts.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
b/arch/powerpc/kvm/bookehv_interrupts.S
index 051af7d..4e5081e 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -75,6 +75,10 @@
PPC_LL  r1, VCPU_HOST_STACK(r4)
PPC_LL  r2, HOST_R2(r1)
 
+START_BTB_FLUSH_SECTION
+   BTB_FLUSH(r10)
+END_BTB_FLUSH_SECTION
+
mfspr   r10, SPRN_PID
lwz r8, VCPU_HOST_PID(r4)
PPC_LL  r11, VCPU_SHARED(r4)
-- 
2.5.5

[PATCH 06/11] powerpc/fsl: Flush the branch predictor at each kernel entry (64bit)

2018-12-12 Thread Diana Craciun

In order to protect against speculation attacks on
indirect branches, the branch predictor is flushed at
kernel entry to protect for the following situations:
- userspace process attacking another userspace process
- userspace process attacking the kernel
Basically when the privillege level change (i.e. the
kernel is entered), the branch predictor state is flushed.

Signed-off-by: Diana Craciun 
---
 arch/powerpc/kernel/entry_64.S   |  5 +
 arch/powerpc/kernel/exceptions-64e.S | 26 +-
 arch/powerpc/mm/tlb_low_64e.S|  7 +++
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 7b1693a..7c2032e 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -80,6 +80,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM)
std r0,GPR0(r1)
std r10,GPR1(r1)
beq 2f  /* if from kernel mode */
+#ifdef CONFIG_PPC_FSL_BOOK3E
+START_BTB_FLUSH_SECTION
+   BTB_FLUSH(r10)
+END_BTB_FLUSH_SECTION
+#endif
ACCOUNT_CPU_USER_ENTRY(r13, r10, r11)
 2: std r2,GPR2(r1)
std r3,GPR3(r1)
diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index 6d6e144..afb6387 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -296,7 +296,8 @@ ret_from_mc_except:
andi.   r10,r11,MSR_PR; /* save stack pointer */\
beq 1f; /* branch around if supervisor */   \
ld  r1,PACAKSAVE(r13);  /* get kernel stack coming from usr */\
-1: cmpdi   cr1,r1,0;   /* check if SP makes sense */   \
+1: type##_BTB_FLUSH\
+   cmpdi   cr1,r1,0;   /* check if SP makes sense */   \
bge-cr1,exc_##n##_bad_stack;/* bad stack (TODO: out of line) */ \
mfspr   r10,SPRN_##type##_SRR0; /* read SRR0 before touching stack */
 
@@ -328,6 +329,29 @@ ret_from_mc_except:
 #define SPRN_MC_SRR0   SPRN_MCSRR0
 #define SPRN_MC_SRR1   SPRN_MCSRR1
 
+#ifdef CONFIG_PPC_FSL_BOOK3E
+#define GEN_BTB_FLUSH  \
+   START_BTB_FLUSH_SECTION \
+   beq 1f; \
+   BTB_FLUSH(r10)  \
+   1:  \
+   END_BTB_FLUSH_SECTION
+
+#define CRIT_BTB_FLUSH \
+   START_BTB_FLUSH_SECTION \
+   BTB_FLUSH(r10)  \
+   END_BTB_FLUSH_SECTION
+
+#define DBG_BTB_FLUSH CRIT_BTB_FLUSH
+#define MC_BTB_FLUSH CRIT_BTB_FLUSH
+#define GDBELL_BTB_FLUSH GEN_BTB_FLUSH
+#else
+#define GEN_BTB_FLUSH
+#define CRIT_BTB_FLUSH
+#define DBG_BTB_FLUSH
+#define GDBELL_BTB_FLUSH
+#endif
+
 #define NORMAL_EXCEPTION_PROLOG(n, intnum, addition)   \
EXCEPTION_PROLOG(n, intnum, GEN, addition##_GEN(n))
 
diff --git a/arch/powerpc/mm/tlb_low_64e.S b/arch/powerpc/mm/tlb_low_64e.S
index 7fd20c5..9ed9006 100644
--- a/arch/powerpc/mm/tlb_low_64e.S
+++ b/arch/powerpc/mm/tlb_low_64e.S
@@ -70,6 +70,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV)
std r15,EX_TLB_R15(r12)
std r10,EX_TLB_CR(r12)
 #ifdef CONFIG_PPC_FSL_BOOK3E
+START_BTB_FLUSH_SECTION
+   mfspr r11, SPRN_SRR1
+   andi. r10,r11,MSR_PR
+   beq 1f
+   BTB_FLUSH(r10)
+1:
+END_BTB_FLUSH_SECTION
std r7,EX_TLB_R7(r12)
 #endif
TLB_MISS_PROLOG_STATS
-- 
2.5.5

[PATCH 05/11] powerpc/fsl: Add nospectre_v2 command line argument

2018-12-12 Thread Diana Craciun

When the command line argument is present, the Spectre variant 2
mitigations are disabled.

Signed-off-by: Diana Craciun 
---
 arch/powerpc/include/asm/setup.h |  5 +
 arch/powerpc/kernel/security.c   | 21 +
 2 files changed, 26 insertions(+)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index c941c8c..65676e2 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -67,6 +67,11 @@ void do_barrier_nospec_fixups_range(bool enable, void 
*start, void *end);
 static inline void do_barrier_nospec_fixups_range(bool enable, void *start, 
void *end) { };
 #endif
 
+#ifdef CONFIG_PPC_FSL_BOOK3E
+void setup_spectre_v2(void);
+#else
+static inline void setup_spectre_v2(void) {};
+#endif
 void do_btb_flush_fixups(void);
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index 1b395b8..4393a38 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -26,6 +26,10 @@ static enum count_cache_flush_type count_cache_flush_type = 
COUNT_CACHE_FLUSH_NO
 
 bool barrier_nospec_enabled;
 static bool no_nospec;
+static bool btb_flush_enabled;
+#ifdef CONFIG_PPC_FSL_BOOK3E
+static bool no_spectrev2;
+#endif
 
 static void enable_barrier_nospec(bool enable)
 {
@@ -101,6 +105,23 @@ static __init int barrier_nospec_debugfs_init(void)
 device_initcall(barrier_nospec_debugfs_init);
 #endif /* CONFIG_DEBUG_FS */
 
+#ifdef CONFIG_PPC_FSL_BOOK3E
+static int __init handle_nospectre_v2(char *p)
+{
+   no_spectrev2 = true;
+
+   return 0;
+}
+early_param("nospectre_v2", handle_nospectre_v2);
+void setup_spectre_v2(void)
+{
+   if (no_spectrev2)
+   do_btb_flush_fixups();
+   else
+   btb_flush_enabled = true;
+}
+#endif /* CONFIG_PPC_FSL_BOOK3E */
+
 #ifdef CONFIG_PPC_BOOK3S_64
 ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, 
char *buf)
 {
-- 
2.5.5

Re: [PATCH] powerpc/8xx: hide itlbie and dtlbie symbols

2018-12-12 Thread Michael Ellerman

Christophe Leroy  writes:

> When disassembling InstructionTLBError we get the following messy code:
>
> c000138c:   7d 84 63 78 mr  r4,r12
> c0001390:   75 25 58 00 andis.  r5,r9,22528
> c0001394:   75 2a 40 00 andis.  r10,r9,16384
> c0001398:   41 a2 00 08 beq c00013a0 
> c000139c:   7c 00 22 64 tlbie   r4,r0
>
> c00013a0 :
> c00013a0:   39 40 04 01 li  r10,1025
> c00013a4:   91 4b 00 b0 stw r10,176(r11)
> c00013a8:   39 40 10 32 li  r10,4146
> c00013ac:   48 00 cc 59 bl  c000e004 
>
> For a cleaner code dump, this patch replaces itlbie and dtlbie
> symbols by numeric symbols.
>
> c000138c:   7d 84 63 78 mr  r4,r12
> c0001390:   75 25 58 00 andis.  r5,r9,22528
> c0001394:   75 2a 40 00 andis.  r10,r9,16384
> c0001398:   41 a2 00 08 beq c00013a0 
> c000139c:   7c 00 22 64 tlbie   r4,r0
> c00013a0:   39 40 04 01 li  r10,1025
> c00013a4:   91 4b 00 b0 stw r10,176(r11)
> c00013a8:   39 40 10 32 li  r10,4146
> c00013ac:   48 00 cc 59 bl  c000e004 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/head_8xx.S | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
> index 3b67b9533c82..8c848acfe249 100644
> --- a/arch/powerpc/kernel/head_8xx.S
> +++ b/arch/powerpc/kernel/head_8xx.S
> @@ -552,11 +552,10 @@ InstructionTLBError:
>   mr  r4,r12
>   andis.  r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
>   andis.  r10,r9,SRR1_ISI_NOPT@h
> - beq+1f
> + beq+1301f
>   tlbie   r4
> -itlbie:
>   /* 0x400 is InstructionAccess exception, needed by bad_page_fault() */
> -1:   EXC_XFER_LITE(0x400, handle_page_fault)
> +1301:EXC_XFER_LITE(0x400, handle_page_fault)

You could use a local symbol, something like:

beq+1f
tlbie   r4
.Litlbie:
/* 0x400 is InstructionAccess exception, needed by bad_page_fault() */
1:  EXC_XFER_LITE(0x400, handle_page_fault)


cheers

Re: [PATCH v3] powerpc: implement CONFIG_DEBUG_VIRTUAL

2018-12-12 Thread Michael Ellerman

Christophe Leroy  writes:
> Le 12/12/2018 à 01:23, Michael Ellerman a écrit :
>> Christophe Leroy  writes:
>> 
>>> This patch implements CONFIG_DEBUG_VIRTUAL to warn about
>>> incorrect use of virt_to_phys() and page_to_phys()
>>>
>>> Below is the result of test_debug_virtual:
>>>
>>> [1.438746] WARNING: CPU: 0 PID: 1 at 
>>> ./arch/powerpc/include/asm/io.h:808 test_debug_virtual_init+0x3c/0xd4
>>> [1.448156] CPU: 0 PID: 1 Comm: swapper Not tainted 
>>> 4.20.0-rc5-00560-g6bfb52e23a00-dirty #532
>>> [1.457259] NIP:  c066c550 LR: c0650ccc CTR: c066c514
>>> [1.462257] REGS: c900bdb0 TRAP: 0700   Not tainted  
>>> (4.20.0-rc5-00560-g6bfb52e23a00-dirty)
>>> [1.471184] MSR:  00029032   CR: 48000422  XER: 2000
>>> [1.477811]
>>> [1.477811] GPR00: c0650ccc c900be60 c60d  006000c0 c900 
>>> 9032 c7fa0020
>>> [1.477811] GPR08: 2400 0001 0900  c07b5d04  
>>> c00037d8 
>>> [1.477811] GPR16:     c076 c074 
>>> 0092 c0685bb0
>>> [1.477811] GPR24: c065042c c068a734 c0685b8c 0006  c076 
>>> c075c3c0 
>>> [1.512711] NIP [c066c550] test_debug_virtual_init+0x3c/0xd4
>>> [1.518315] LR [c0650ccc] do_one_initcall+0x8c/0x1cc
>>> [1.523163] Call Trace:
>>> [1.525595] [c900be60] [c0567340] 0xc0567340 (unreliable)
>>> [1.530954] [c900be90] [c0650ccc] do_one_initcall+0x8c/0x1cc
>>> [1.536551] [c900bef0] [c0651000] kernel_init_freeable+0x1f4/0x2cc
>>> [1.542658] [c900bf30] [c00037ec] kernel_init+0x14/0x110
>>> [1.547913] [c900bf40] [c000e1d0] ret_from_kernel_thread+0x14/0x1c
>>> [1.553971] Instruction dump:
>>> [1.556909] 3ca50100 bfa10024 54a5000e 3fa0c076 7c0802a6 3d454000 
>>> 813dc204 554893be
>>> [1.564566] 7d294010 7d294910 90010034 39290001 <0f09> 7c3e0b78 
>>> 955e0008 3fe0c062
>>> [1.572425] ---[ end trace 6f6984225b280ad6 ]---
>>> [1.577467] PA: 0x0900 for VA: 0xc900
>>> [1.581799] PA: 0x061e8f50 for VA: 0xc61e8f50
>>>
>>> Signed-off-by: Christophe Leroy 
>>> ---
>>>   v3: Added missing linux/mm.h
>>>   I realised that a driver may use DMA on stack after checking with 
>>> virt_addr_valid(), so the new
>>>   verification might induce false positives. I remove it for now, will 
>>> add it again later in a more
>>>   controled way.
>> 
>> What is this comment referring to?
>> 
>> I can't see any difference to v2 except the linux/mm.h include.
>
> v2 was:
>
>
> @@ -804,6 +806,11 @@ extern void __iounmap_at(void *ea, unsigned long size);
>*/
>   static inline unsigned long virt_to_phys(volatile void * address)
>   {
> + if (IS_ENABLED(CONFIG_DEBUG_VIRTUAL) &&
> + !WARN_ON(IS_ENABLED(CONFIG_HAVE_ARCH_VMAP_STACK) && current->pid &&
> +  object_is_on_stack((const void*)address)))
> + WARN_ON(!virt_addr_valid(address));
> +
>   return __pa((unsigned long)address);
>   }
>
>
> v3 is: (same as v1)
>
>
> @@ -804,6 +806,8 @@ extern void __iounmap_at(void *ea, unsigned long size);
>*/
>   static inline unsigned long virt_to_phys(volatile void * address)
>   {
> + WARN_ON(IS_ENABLED(CONFIG_DEBUG_VIRTUAL) && !virt_addr_valid(address));
> +
>   return __pa((unsigned long)address);
>   }

Right, sorry I must have been looking at v1 (which was already applied
in my tree).

> The idea in v2 was to detect objects on stack used for DMA before 
> activating CONFIG_VMAP_STACK, but if the driver uses virt_addr_valid() 
> to decide if it can DMA map it, then we'll get false positives.
> So I think this should be added with a dedicated DEBUG CONFIG option, 
> not implicitely.

Sounds good. I'll take v3.

cheers

[PATCH] Cover letter for (PCI/AER: only insert one element into kfifo)

2018-12-12 Thread Yanjiang Jin

Without this patch, if we have multi PCIe devices, and one of them has
AER error, aer_recover_work_func() -> kfifo_get() will traverse the whole
kfifo which has wrong element number(16).
If one null element's uninitialized memory matches another
PCIe device(:01:00.0), we may get the below call trace.
It is unusual, but indeed happened on my board: QDF2400.

# lspci
:00:00.0 PCI bridge:
:01:00.0 Ethernet controller:
0004:00:00.0 PCI bridge:
0004:01:00.0 Ethernet controller:
0005:00:00.0 PCI bridge:
0005:01:00.0 Ethernet controller:

Call trace:

[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[Hardware Error]: It has been corrected by h/w and requires no further action
[Hardware Error]: event severity: corrected
[Hardware Error]:  precise tstamp: 2018-11-29 09:23:16
[Hardware Error]:  Error 0, type: corrected
[Hardware Error]:   section_type: PCIe error
[Hardware Error]:   port_type: 4, root port
[Hardware Error]:   version: 3.0
[Hardware Error]:   command: 0x0407, status: 0x0010
[Hardware Error]:   device_id: 0004:00:00.0
[Hardware Error]:   slot: 0
[Hardware Error]:   secondary_bus: 0x01
[Hardware Error]:   vendor_id: 0x17cb, device_id: 0x0401
[Hardware Error]:   class_code: 000406
[Hardware Error]:   bridge: secondary_status: 0x, control: 0x
AER recover: find pci_dev for 0004:00:00:0
pcieport 0004:00:00.0: aer_status: 0x0001, aer_mask: 0xe000
pcieport 0004:00:00.0:[ 0] RxErr  (First)
pcieport 0004:00:00.0: aer_layer=Physical Layer, aer_agent=Receiver ID
AER recover: Can not find pci_dev for a38f:00:18:2
AER recover: Can not find pci_dev for 0857:1c:03:5
AER recover: Can not find pci_dev for 62d2:80:19:6
AER recover: Can not find pci_dev for 0857:f8:03:4
AER recover: Can not find pci_dev for 0907:78:07:1
AER recover: Can not find pci_dev for :00:00:1
AER recover: Can not find pci_dev for 0907:00:00:0
AER recover: Can not find pci_dev for :00:00:1
AER recover: find pci_dev for :01:00:0
Unable to handle kernel paging request at virtual address 00813004
Mem abort info:
  ESR = 0x9607
  Exception class = DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
Data abort info:
  ISV = 0, ISS = 0x0007
  CM = 0, WnR = 0
user pgtable: 64k pages, 48-bit VAs, pgdp = 0dce9024
[00813004] pgd=001727260003, pud=001727260003
pmd=001727290003, pte=
Internal error: Oops: 9607 [#1] SMP
Workqueue: events aer_recover_work_func
pstate: 2045 (nzCv daif +PAN -UAO)
pc : cper_print_aer+0x4c/0x290
lr : aer_recover_work_func+0x110/0x150
sp : 8017ca59fca0
x29: 8017ca59fca0 x28: 8017ca841000
x27: 8017ca841000 x26: 0001
x25: 00813000 x24: 0040
x23: 0040 x22: 08d5f830
x21: 090f1f10 x20: 090f1e98
x19:  x18: 
x17: 0001 x16: 0007
x15: 09073708 x14: 891e8faf
x13: 091e8fbd x12: 2c726579614c206c
x11: 0909b000 x10: 05f5e0ff
x9 : 8017ca59fa10 x8 : 09073978
x7 : 091e8a40 x6 : 0518
x5 : 0001 x4 : 8017ff9710b8
x3 : 8017ff9710b8 x2 : 00813000
x1 :  x0 : 09073708
Process kworker/11:1 (pid: 232, stack limit = 0x060ad7e1)
Call trace:
 cper_print_aer+0x4c/0x290
 aer_recover_work_func+0x110/0x150
 process_one_work+0x1ac/0x3f0
 worker_thread+0x54/0x430
 kthread+0x104/0x130
 ret_from_fork+0x10/0x18
Code: f941 f90057a1 d281 54000f40 (2940e334)
SMP: stopping secondary CPUs
Starting crashdump kernel...
Bye!



This email is intended only for the named addressee. It may contain information 
that is confidential/private, legally privileged, or copyright-protected, and 
you should handle it accordingly. If you are not the intended recipient, you do 
not have legal rights to retain, copy, or distribute this email or its 
contents, and should promptly delete the email and all electronic copies in 
your system; do not retain copies in any media. If you have received this email 
in error, please notify the sender promptly. Thank you.

[PATCH] PCI/AER: only insert one element into kfifo

2018-12-12 Thread Yanjiang Jin

'commit ecae65e133f2 ("PCI/AER: Use kfifo_in_spinlocked() to
insert locked elements")' replace kfifo_put() with kfifo_in_spinlocked().

But as "kfifo_in(fifo, buf, n)" describes:
" * @n: number of elements to be added".

We want to insert only one element into kfifo, not "sizeof(entry) = 16".
Without this patch, we would get 15 uninitialized elements.

Signed-off-by: Yanjiang Jin 
---
 drivers/pci/pcie/aer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a90a919..fed29de 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1064,7 +1064,7 @@ void aer_recover_queue(int domain, unsigned int bus, 
unsigned int devfn,
.regs   = aer_regs,
};

-   if (kfifo_in_spinlocked(&aer_recover_ring, &entry, sizeof(entry),
+   if (kfifo_in_spinlocked(&aer_recover_ring, &entry, 1,
 &aer_recover_ring_lock))
schedule_work(&aer_recover_work);
else
--
1.8.3.1




This email is intended only for the named addressee. It may contain information 
that is confidential/private, legally privileged, or copyright-protected, and 
you should handle it accordingly. If you are not the intended recipient, you do 
not have legal rights to retain, copy, or distribute this email or its 
contents, and should promptly delete the email and all electronic copies in 
your system; do not retain copies in any media. If you have received this email 
in error, please notify the sender promptly. Thank you.

Re: [PATCH 10/10] perf/doc: update design.txt for exclude_{host|guest} flags

2018-12-12 Thread Christoffer Dall

On Tue, Dec 11, 2018 at 01:59:03PM +, Andrew Murray wrote:
> On Tue, Dec 11, 2018 at 10:06:53PM +1100, Michael Ellerman wrote:
> > [ Reviving old thread. ]
> > 
> > Andrew Murray  writes:
> > > On Tue, Nov 20, 2018 at 10:31:36PM +1100, Michael Ellerman wrote:
> > >> Andrew Murray  writes:
> > >> 
> > >> > Update design.txt to reflect the presence of the exclude_host
> > >> > and exclude_guest perf flags.
> > >> >
> > >> > Signed-off-by: Andrew Murray 
> > >> > ---
> > >> >  tools/perf/design.txt | 4 
> > >> >  1 file changed, 4 insertions(+)
> > >> >
> > >> > diff --git a/tools/perf/design.txt b/tools/perf/design.txt
> > >> > index a28dca2..7de7d83 100644
> > >> > --- a/tools/perf/design.txt
> > >> > +++ b/tools/perf/design.txt
> > >> > @@ -222,6 +222,10 @@ The 'exclude_user', 'exclude_kernel' and 
> > >> > 'exclude_hv' bits provide a
> > >> >  way to request that counting of events be restricted to times when the
> > >> >  CPU is in user, kernel and/or hypervisor mode.
> > >> >  
> > >> > +Furthermore the 'exclude_host' and 'exclude_guest' bits provide a way
> > >> > +to request counting of events restricted to guest and host contexts 
> > >> > when
> > >> > +using virtualisation.
> > >> 
> > >> How does exclude_host differ from exclude_hv ?
> > >
> > > I believe exclude_host / exclude_guest are intented to distinguish
> > > between host and guest in the hosted hypervisor context (KVM).
> > 
> > OK yeah, from the perf-list man page:
> > 
> >u - user-space counting
> >k - kernel counting
> >h - hypervisor counting
> >I - non idle counting
> >G - guest counting (in KVM guests)
> >H - host counting (not in KVM guests)
> > 
> > > Whereas exclude_hv allows to distinguish between guest and
> > > hypervisor in the bare-metal type hypervisors.
> > 
> > Except that's exactly not how we use them on powerpc :)
> > 
> > We use exclude_hv to exclude "the hypervisor", regardless of whether
> > it's KVM or PowerVM (which is a bare-metal hypervisor).
> > 
> > We don't use exclude_host / exclude_guest at all, which I guess is a
> > bug, except I didn't know they existed until this thread.
> > 
> > eg, in a KVM guest:
> > 
> >   $ perf record -e cycles:G /bin/bash -c "for i in {0..10}; do :;done"
> >   $ perf report -D | grep -Fc "dso: [hypervisor]"
> >   16
> > 
> > 
> > > In the case of arm64 - if VHE extensions are present then the host
> > > kernel will run at a higher privilege to the guest kernel, in which
> > > case there is no distinction between hypervisor and host so we ignore
> > > exclude_hv. But where VHE extensions are not present then the host
> > > kernel runs at the same privilege level as the guest and we use a
> > > higher privilege level to switch between them - in this case we can
> > > use exclude_hv to discount that hypervisor role of switching between
> > > guests.
> > 
> > I couldn't find any arm64 perf code using exclude_host/guest at all?
> 
> Correct - but this is in flight as I am currently adding support for this
> see [1].
> 
> > 
> > And I don't see any x86 code using exclude_hv.
> 
> I can't find any either.
> 
> > 
> > But maybe that's OK, I just worry this is confusing for users.
> 
> There is some extra context regarding this where exclude_guest/exclude_host
> was added, see [2] and where exclude_hv was added, see [3]
> 
> Generally it seems that exclude_guest/exclude_host relies upon switching
> counters off/on on guest/host switch code (which works well in the nested
> virt case). Whereas exclude_hv tends to rely solely on hardware capability
> based on privilege level (which works well in the bare metal case where
> the guest doesn't run at same privilege as the host).
> 
> I think from the user perspective exclude_hv allows you to see your overhead
> if you are a guest (i.e. work done by bare metal hypervisor associated with
> you as the guest). Whereas exclude_guest/exclude_host doesn't allow you to
> see events above you (i.e. the kernel hypervisor) if you are the guest...
> 
> At least that's how I read this, I've copied in others that may provide
> more authoritative feedback.
> 
> [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2018-December/033698.html
> [2] https://www.spinics.net/lists/kvm/msg53996.html
> [3] https://lore.kernel.org/patchwork/patch/143918/
> 

I'll try to answer this in a different way, based on previous
discussions with Joerg et al. who introduced these flags.  Assume no
support for nested virtualization as a first approximation:

  If you are running as a guest:
- exclude_hv: stop counting events when the hypervisor runs
- exclude_host: has no effect
- exclude_guest: has no effect
  
  If you are running as a host/hypervisor:
   - exclude_hv: has no effect
   - exclude_host: only count events when the guest is running
   - exclude_guest: only count events when the host is running

With nested virtualization, you get the natural union of the above.

**This has

Re: [PATCH v4 5/5] powerpc: generate uapi header and system call table files

2018-12-12 Thread kbuild test robot

Hi Firoz,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.20-rc6]
[cannot apply to next-20181211]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Firoz-Khan/powerpc-system-call-table-generation-support/20181209-185907
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allnoconfig (attached as .config)
compiler: powerpc-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   ./arch/powerpc/include/generated/asm/syscall_table_32.h: Assembler messages:
>> ./arch/powerpc/include/generated/asm/syscall_table_32.h:1: Error: junk at 
>> end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:2: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:3: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:4: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:5: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:6: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:7: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:8: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:9: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:10: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:11: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:12: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:13: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:14: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:15: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:16: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:17: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:18: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:19: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:20: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:21: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:22: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:23: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:24: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:25: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:26: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:27: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:28: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:29: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:30: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:31: Error: junk at 
end of line, first unrecognized character is `('
   ./arch/powerpc/include/generated/asm/syscall_table_32.h:32: Error: junk at

77 matches

Mail list logo