date:20200306

Re: [PATCH V15] mm/debug: Add tests validating architecture page table helpers

2020-03-06 Thread Christophe Leroy





Le 07/03/2020 à 01:56, Anshuman Khandual a écrit :



On 03/07/2020 06:04 AM, Qian Cai wrote:




On Mar 6, 2020, at 7:03 PM, Anshuman Khandual  wrote:

Hmm, set_pte_at() function is not preferred here for these tests. The idea
is to avoid or atleast minimize TLB/cache flushes triggered from these sort
of 'static' tests. set_pte_at() is platform provided and could/might trigger
these flushes or some other platform specific synchronization stuff. Just


Why is that important for this debugging option?


Primarily reason is to avoid TLB/cache flush instructions on the system
during these tests that only involve transforming different page table
level entries through helpers. Unless really necessary, why should it
emit any TLB/cache flush instructions ?


What's the problem with thoses flushes ?






wondering is there specific reason with respect to the soft lock up problem
making it necessary to use set_pte_at() rather than a simple WRITE_ONCE() ?


Looks at the s390 version of set_pte_at(), it has this comment,
vmaddr);

/*
  * Certain architectures need to do special things when PTEs
  * within a page table are directly modified.  Thus, the following
  * hook is made available.
  */

I can only guess that powerpc  could be the same here.


This comment is present in multiple platforms while defining set_pte_at().
Is not 'barrier()' here alone good enough ? Else what exactly set_pte_at()
does as compared to WRITE_ONCE() that avoids the soft lock up, just trying
to understand.




Argh ! I didn't realise that you were writing directly into the page 
tables. When it works, that's only by chance I guess.


To properly set the page table entries, set_pte_at() has to be used:
- On powerpc 8xx, with 16k pages, the page table entry must be copied 
four times. set_pte_at() does it, WRITE_ONCE() doesn't.
- On powerpc book3s/32 (hash MMU), the flag _PAGE_HASHPTE must be 
preserved among writes. set_pte_at() preserves it, WRITE_ONCE() doesn't.


set_pte_at() also does a few other mandatory things, like calling 
pte_mkpte()


So, the WRITE_ONCE() must definitely become a set_pte_at()

Christophe

Re: [PATCH v3 07/14] powerpc/32: drop get_pteptr()

2020-03-06 Thread Andrew Morton

On Thu, 27 Feb 2020 10:46:01 +0200 Mike Rapoport  wrote:

> Commit 8d30c14cab30 ("powerpc/mm: Rework I$/D$ coherency (v3)") and
> commit 90ac19a8b21b ("[POWERPC] Abolish iopa(), mm_ptov(),
> io_block_mapping() from arch/powerpc") removed the use of get_pteptr()
> outside of mm/pgtable_32.c
> 
> In mm/pgtable_32.c, the only user of get_pteptr() is __change_page_attr()
> which operates on kernel context and on lowmem pages only.
> 
> Move page table traversal to __change_page_attr() and drop get_pteptr().

People have been changing things in linux-next and the powerpc patches
are hurting.

I'll disable this patch series for now.  Can you please redo
powerpc-32-drop-get_pteptr.patch and
powerpc-add-support-for-folded-p4d-page-tables.patch (and
powerpc-add-support-for-folded-p4d-page-tables-fix.patch)?

Thanks.

Re: [PATCH v4 1/8] ASoC: dt-bindings: fsl_asrc: Change asrc-width to asrc-format

2020-03-06 Thread Shengjiu Wang

Hi

On Tue, Mar 3, 2020 at 8:47 PM Mark Brown  wrote:
>
> On Tue, Mar 03, 2020 at 11:59:30AM +0800, Shengjiu Wang wrote:
> > On Tue, Mar 3, 2020 at 9:43 AM Rob Herring  wrote:
>
> > > > -   - fsl,asrc-width  : Defines a mutual sample width used by DPCM Back 
> > > > Ends.
> > > > +   - fsl,asrc-format : Defines a mutual sample format used by DPCM Back
> > > > +   Ends. The value is one of SNDRV_PCM_FORMAT_XX in
> > > > +   "include/uapi/sound/asound.h"
>
> > > You can't just change properties. They are an ABI.
>
> > I have updated all the things related with this ABI in this patch series.
> > What else should I do?
>
> Like Nicolin says you should continue to support the old stuff.  The
> kernel should work with people's out of tree DTs too so simply updating
> everything in the tree isn't enough.

Thanks for review, I will update patch in next version.

best regards
wang shengjiu

[PATCH] powerpc/pseries: fix of_read_drc_info_cell() to point at next record

2020-03-06 Thread Tyrel Datwyler

The expectation is that when calling of_read_drc_info_cell()
repeatedly to parse multiple drc-info records that the in/out curval
parameter points at the start of the next record on return. However,
the current behavior has curval still pointing at the final value of
the record just parsed. The result of which is that if the
ibm,drc-info property contains multiple properties the parsed value
of the drc_type for any record after the first has the power_domain
value of the previous record appended to the type string.

Ex: observed the following 0x prepended to PHB

[   69.485037] drc-info: type: \xff\xff\xff\xffPHB, prefix: PHB , index_start: 
0x2001
[   69.485038] drc-info: suffix_start: 1, sequential_elems: 3072, 
sequential_inc: 1
[   69.485038] drc-info: power-domain: 0x, last_index: 0x2c00

Fix by incrementing curval past the power_domain value to point at
drc_type string of next record.

Fixes: a29396653b8bf ("pseries/drc-info: Search DRC properties for CPU indexes")
Signed-off-by: Tyrel Datwyler 
---
 arch/powerpc/platforms/pseries/of_helpers.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/of_helpers.c 
b/arch/powerpc/platforms/pseries/of_helpers.c
index 66dfd8256712..23241c71ef37 100644
--- a/arch/powerpc/platforms/pseries/of_helpers.c
+++ b/arch/powerpc/platforms/pseries/of_helpers.c
@@ -88,7 +88,7 @@ int of_read_drc_info_cell(struct property **prop, const 
__be32 **curval,
return -EINVAL;
 
/* Should now know end of current entry */
-   (*curval) = (void *)p2;
+   (*curval) = (void *)(++p2);
data->last_drc_index = data->drc_index_start +
((data->num_sequential_elems - 1) * data->sequential_inc);
 
-- 
2.16.4

[PATCH v5 01/11] PCI: designware-ep: Add multiple PFs support for DWC

2020-03-06 Thread Xiaowei Bao

Add multiple PFs support for DWC, due to different PF have different
config space, we use func_conf_select callback function to access
the different PF's config space, the different chip company need to
implement this callback function when use the DWC IP core and intend
to support multiple PFs feature.

Signed-off-by: Xiaowei Bao 
Acked-by: Gustavo Pimentel 
---
v2:
 - Remove duplicate redundant code.
 - Reimplement the PF config space access way.
v3:
 - Integrate duplicate code for func_select.
 - Move PCIE_ATU_FUNC_NUM(pf) (pf << 20) to ((pf) << 20).
 - Add the comments for func_conf_select function.
v4:
 - Correct the commit message.
v5:
 - No change.

 drivers/pci/controller/dwc/pcie-designware-ep.c | 123 
 drivers/pci/controller/dwc/pcie-designware.c|  59 
 drivers/pci/controller/dwc/pcie-designware.h|  18 +++-
 3 files changed, 142 insertions(+), 58 deletions(-)

diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c 
b/drivers/pci/controller/dwc/pcie-designware-ep.c
index cfeccd7..58d8556 100644
--- a/drivers/pci/controller/dwc/pcie-designware-ep.c
+++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
@@ -19,12 +19,26 @@ void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)
pci_epc_linkup(epc);
 }
 
-static void __dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar,
-  int flags)
+static unsigned int dw_pcie_ep_func_select(struct dw_pcie_ep *ep, u8 func_no)
+{
+   unsigned int func_offset = 0;
+
+   if (ep->ops->func_conf_select)
+   func_offset = ep->ops->func_conf_select(ep, func_no);
+
+   return func_offset;
+}
+
+static void __dw_pcie_ep_reset_bar(struct dw_pcie *pci, u8 func_no,
+  enum pci_barno bar, int flags)
 {
u32 reg;
+   unsigned int func_offset = 0;
+   struct dw_pcie_ep *ep = >ep;
+
+   func_offset = dw_pcie_ep_func_select(ep, func_no);
 
-   reg = PCI_BASE_ADDRESS_0 + (4 * bar);
+   reg = func_offset + PCI_BASE_ADDRESS_0 + (4 * bar);
dw_pcie_dbi_ro_wr_en(pci);
dw_pcie_writel_dbi2(pci, reg, 0x0);
dw_pcie_writel_dbi(pci, reg, 0x0);
@@ -37,7 +51,12 @@ static void __dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum 
pci_barno bar,
 
 void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar)
 {
-   __dw_pcie_ep_reset_bar(pci, bar, 0);
+   u8 func_no, funcs;
+
+   funcs = pci->ep.epc->max_functions;
+
+   for (func_no = 0; func_no < funcs; func_no++)
+   __dw_pcie_ep_reset_bar(pci, func_no, bar, 0);
 }
 
 static int dw_pcie_ep_write_header(struct pci_epc *epc, u8 func_no,
@@ -45,28 +64,31 @@ static int dw_pcie_ep_write_header(struct pci_epc *epc, u8 
func_no,
 {
struct dw_pcie_ep *ep = epc_get_drvdata(epc);
struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+   unsigned int func_offset = 0;
+
+   func_offset = dw_pcie_ep_func_select(ep, func_no);
 
dw_pcie_dbi_ro_wr_en(pci);
-   dw_pcie_writew_dbi(pci, PCI_VENDOR_ID, hdr->vendorid);
-   dw_pcie_writew_dbi(pci, PCI_DEVICE_ID, hdr->deviceid);
-   dw_pcie_writeb_dbi(pci, PCI_REVISION_ID, hdr->revid);
-   dw_pcie_writeb_dbi(pci, PCI_CLASS_PROG, hdr->progif_code);
-   dw_pcie_writew_dbi(pci, PCI_CLASS_DEVICE,
+   dw_pcie_writew_dbi(pci, func_offset + PCI_VENDOR_ID, hdr->vendorid);
+   dw_pcie_writew_dbi(pci, func_offset + PCI_DEVICE_ID, hdr->deviceid);
+   dw_pcie_writeb_dbi(pci, func_offset + PCI_REVISION_ID, hdr->revid);
+   dw_pcie_writeb_dbi(pci, func_offset + PCI_CLASS_PROG, hdr->progif_code);
+   dw_pcie_writew_dbi(pci, func_offset + PCI_CLASS_DEVICE,
   hdr->subclass_code | hdr->baseclass_code << 8);
-   dw_pcie_writeb_dbi(pci, PCI_CACHE_LINE_SIZE,
+   dw_pcie_writeb_dbi(pci, func_offset + PCI_CACHE_LINE_SIZE,
   hdr->cache_line_size);
-   dw_pcie_writew_dbi(pci, PCI_SUBSYSTEM_VENDOR_ID,
+   dw_pcie_writew_dbi(pci, func_offset + PCI_SUBSYSTEM_VENDOR_ID,
   hdr->subsys_vendor_id);
-   dw_pcie_writew_dbi(pci, PCI_SUBSYSTEM_ID, hdr->subsys_id);
-   dw_pcie_writeb_dbi(pci, PCI_INTERRUPT_PIN,
+   dw_pcie_writew_dbi(pci, func_offset + PCI_SUBSYSTEM_ID, hdr->subsys_id);
+   dw_pcie_writeb_dbi(pci, func_offset + PCI_INTERRUPT_PIN,
   hdr->interrupt_pin);
dw_pcie_dbi_ro_wr_dis(pci);
 
return 0;
 }
 
-static int dw_pcie_ep_inbound_atu(struct dw_pcie_ep *ep, enum pci_barno bar,
- dma_addr_t cpu_addr,
+static int dw_pcie_ep_inbound_atu(struct dw_pcie_ep *ep, u8 func_no,
+ enum pci_barno bar, dma_addr_t cpu_addr,
  enum dw_pcie_as_type as_type)
 {
int ret;
@@ -79,7 +101,7 @@ static int dw_pcie_ep_inbound_atu(struct dw_pcie_ep *ep, 
enum pci_barno bar,
return -EINVAL;
}
 
-   ret =

[PATCH v5 03/11] PCI: designware-ep: Move the function of getting MSI capability forward

2020-03-06 Thread Xiaowei Bao

Move the function of getting MSI capability to the front of init
function, because the init function of the EP platform driver will use
the return value by the function of getting MSI capability.

Signed-off-by: Xiaowei Bao 
Reviewed-by: Andrew Murray 
---
v2:
 - No change.
v3:
 - No change.
v4:
 - No change.
v5:
 - No change.

 drivers/pci/controller/dwc/pcie-designware-ep.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c 
b/drivers/pci/controller/dwc/pcie-designware-ep.c
index 44ece33..933bb89 100644
--- a/drivers/pci/controller/dwc/pcie-designware-ep.c
+++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
@@ -632,6 +632,10 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
if (ret < 0)
epc->max_functions = 1;
 
+   ep->msi_cap = dw_pcie_find_capability(pci, PCI_CAP_ID_MSI);
+
+   ep->msix_cap = dw_pcie_find_capability(pci, PCI_CAP_ID_MSIX);
+
if (ep->ops->ep_init)
ep->ops->ep_init(ep);
 
@@ -648,9 +652,6 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
dev_err(dev, "Failed to reserve memory for MSI/MSI-X\n");
return -ENOMEM;
}
-   ep->msi_cap = dw_pcie_find_capability(pci, PCI_CAP_ID_MSI);
-
-   ep->msix_cap = dw_pcie_find_capability(pci, PCI_CAP_ID_MSIX);
 
offset = dw_pcie_ep_find_ext_capability(pci, PCI_EXT_CAP_ID_REBAR);
if (offset) {
-- 
2.9.5

[PATCH v5 11/11] misc: pci_endpoint_test: Add LS1088a in pci_device_id table

2020-03-06 Thread Xiaowei Bao

Add LS1088a in pci_device_id table so that pci-epf-test can be used
for testing PCIe EP in LS1088a.

Signed-off-by: Xiaowei Bao 
Reviewed-by: Andrew Murray 
---
v2:
 - No change.
v3:
 - No change.
v4:
 - Use a maco to define the LS1088a device ID.
v5:
 - No change.
 
 drivers/misc/pci_endpoint_test.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/misc/pci_endpoint_test.c b/drivers/misc/pci_endpoint_test.c
index a5e3170..72d694f 100644
--- a/drivers/misc/pci_endpoint_test.c
+++ b/drivers/misc/pci_endpoint_test.c
@@ -65,6 +65,7 @@
 #define PCI_ENDPOINT_TEST_IRQ_NUMBER   0x28
 
 #define PCI_DEVICE_ID_TI_AM654 0xb00c
+#define PCI_DEVICE_ID_LS1088A  0x80c0
 
 #define is_am654_pci_dev(pdev) \
((pdev)->device == PCI_DEVICE_ID_TI_AM654)
@@ -793,6 +794,7 @@ static const struct pci_device_id pci_endpoint_test_tbl[] = 
{
{ PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_DRA74x) },
{ PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_DRA72x) },
{ PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, 0x81c0) },
+   { PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, PCI_DEVICE_ID_LS1088A) },
{ PCI_DEVICE_DATA(SYNOPSYS, EDDA, NULL) },
{ PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_AM654),
  .driver_data = (kernel_ulong_t)_data
-- 
2.9.5

[PATCH v5 02/11] PCI: designware-ep: Add the doorbell mode of MSI-X in EP mode

2020-03-06 Thread Xiaowei Bao

Add the doorbell mode of MSI-X in DWC EP driver.

Signed-off-by: Xiaowei Bao 
Reviewed-by: Andrew Murray 
---
v2:
 - Remove the macro of no used.
v3:
 - No change.
v4:
 - Modify the commit message.
v5:
 - No change.

 drivers/pci/controller/dwc/pcie-designware-ep.c | 14 ++
 drivers/pci/controller/dwc/pcie-designware.h| 12 
 2 files changed, 26 insertions(+)

diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c 
b/drivers/pci/controller/dwc/pcie-designware-ep.c
index 58d8556..44ece33 100644
--- a/drivers/pci/controller/dwc/pcie-designware-ep.c
+++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
@@ -449,6 +449,20 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 
func_no,
return 0;
 }
 
+int dw_pcie_ep_raise_msix_irq_doorbell(struct dw_pcie_ep *ep, u8 func_no,
+  u16 interrupt_num)
+{
+   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+   u32 msg_data;
+
+   msg_data = (func_no << PCIE_MSIX_DOORBELL_PF_SHIFT) |
+  (interrupt_num - 1);
+
+   dw_pcie_writel_dbi(pci, PCIE_MSIX_DOORBELL, msg_data);
+
+   return 0;
+}
+
 int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
  u16 interrupt_num)
 {
diff --git a/drivers/pci/controller/dwc/pcie-designware.h 
b/drivers/pci/controller/dwc/pcie-designware.h
index 00d2d31..cb32afa 100644
--- a/drivers/pci/controller/dwc/pcie-designware.h
+++ b/drivers/pci/controller/dwc/pcie-designware.h
@@ -97,6 +97,9 @@
 #define PCIE_MISC_CONTROL_1_OFF0x8BC
 #define PCIE_DBI_RO_WR_EN  BIT(0)
 
+#define PCIE_MSIX_DOORBELL 0x948
+#define PCIE_MSIX_DOORBELL_PF_SHIFT24
+
 #define PCIE_PL_CHK_REG_CONTROL_STATUS 0xB20
 #define PCIE_PL_CHK_REG_CHK_REG_START  BIT(0)
 #define PCIE_PL_CHK_REG_CHK_REG_CONTINUOUS BIT(1)
@@ -431,6 +434,8 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 
func_no,
 u8 interrupt_num);
 int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
 u16 interrupt_num);
+int dw_pcie_ep_raise_msix_irq_doorbell(struct dw_pcie_ep *ep, u8 func_no,
+  u16 interrupt_num);
 void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar);
 #else
 static inline void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)
@@ -463,6 +468,13 @@ static inline int dw_pcie_ep_raise_msix_irq(struct 
dw_pcie_ep *ep, u8 func_no,
return 0;
 }
 
+static inline int dw_pcie_ep_raise_msix_irq_doorbell(struct dw_pcie_ep *ep,
+u8 func_no,
+u16 interrupt_num)
+{
+   return 0;
+}
+
 static inline void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno 
bar)
 {
 }
-- 
2.9.5

[PATCH v5 04/11] PCI: designware-ep: Modify MSI and MSIX CAP way of finding

2020-03-06 Thread Xiaowei Bao

Each PF of EP device should have it's own MSI or MSIX capabitily
struct, so create a dw_pcie_ep_func struct and remove the msi_cap
and msix_cap to this struct from dw_pcie_ep, and manage the PFs
with a list.

Signed-off-by: Xiaowei Bao 
---
v3:
 - This is a new patch, to fix the issue of MSI and MSIX CAP way of
   finding.
v4:
 - Correct some word of commit message.
v5:
 - No change.

 drivers/pci/controller/dwc/pcie-designware-ep.c | 135 +---
 drivers/pci/controller/dwc/pcie-designware.h|  18 +++-
 2 files changed, 134 insertions(+), 19 deletions(-)

diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c 
b/drivers/pci/controller/dwc/pcie-designware-ep.c
index 933bb89..fb915f2 100644
--- a/drivers/pci/controller/dwc/pcie-designware-ep.c
+++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
@@ -19,6 +19,19 @@ void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)
pci_epc_linkup(epc);
 }
 
+struct dw_pcie_ep_func *
+dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no)
+{
+   struct dw_pcie_ep_func *ep_func;
+
+   list_for_each_entry(ep_func, >func_list, list) {
+   if (ep_func->func_no == func_no)
+   return ep_func;
+   }
+
+   return NULL;
+}
+
 static unsigned int dw_pcie_ep_func_select(struct dw_pcie_ep *ep, u8 func_no)
 {
unsigned int func_offset = 0;
@@ -59,6 +72,47 @@ void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum 
pci_barno bar)
__dw_pcie_ep_reset_bar(pci, func_no, bar, 0);
 }
 
+static u8 __dw_pcie_ep_find_next_cap(struct dw_pcie_ep *ep, u8 func_no,
+   u8 cap_ptr, u8 cap)
+{
+   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+   unsigned int func_offset = 0;
+   u8 cap_id, next_cap_ptr;
+   u16 reg;
+
+   if (!cap_ptr)
+   return 0;
+
+   func_offset = dw_pcie_ep_func_select(ep, func_no);
+
+   reg = dw_pcie_readw_dbi(pci, func_offset + cap_ptr);
+   cap_id = (reg & 0x00ff);
+
+   if (cap_id > PCI_CAP_ID_MAX)
+   return 0;
+
+   if (cap_id == cap)
+   return cap_ptr;
+
+   next_cap_ptr = (reg & 0xff00) >> 8;
+   return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap);
+}
+
+static u8 dw_pcie_ep_find_capability(struct dw_pcie_ep *ep, u8 func_no, u8 cap)
+{
+   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+   unsigned int func_offset = 0;
+   u8 next_cap_ptr;
+   u16 reg;
+
+   func_offset = dw_pcie_ep_func_select(ep, func_no);
+
+   reg = dw_pcie_readw_dbi(pci, func_offset + PCI_CAPABILITY_LIST);
+   next_cap_ptr = (reg & 0x00ff);
+
+   return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap);
+}
+
 static int dw_pcie_ep_write_header(struct pci_epc *epc, u8 func_no,
   struct pci_epf_header *hdr)
 {
@@ -246,13 +300,18 @@ static int dw_pcie_ep_get_msi(struct pci_epc *epc, u8 
func_no)
struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
u32 val, reg;
unsigned int func_offset = 0;
+   struct dw_pcie_ep_func *ep_func;
 
-   if (!ep->msi_cap)
+   ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
+   if (!ep_func)
+   return -EINVAL;
+
+   if (!ep_func->msi_cap)
return -EINVAL;
 
func_offset = dw_pcie_ep_func_select(ep, func_no);
 
-   reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
+   reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
val = dw_pcie_readw_dbi(pci, reg);
if (!(val & PCI_MSI_FLAGS_ENABLE))
return -EINVAL;
@@ -268,13 +327,18 @@ static int dw_pcie_ep_set_msi(struct pci_epc *epc, u8 
func_no, u8 interrupts)
struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
u32 val, reg;
unsigned int func_offset = 0;
+   struct dw_pcie_ep_func *ep_func;
+
+   ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
+   if (!ep_func)
+   return -EINVAL;
 
-   if (!ep->msi_cap)
+   if (!ep_func->msi_cap)
return -EINVAL;
 
func_offset = dw_pcie_ep_func_select(ep, func_no);
 
-   reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
+   reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
val = dw_pcie_readw_dbi(pci, reg);
val &= ~PCI_MSI_FLAGS_QMASK;
val |= (interrupts << 1) & PCI_MSI_FLAGS_QMASK;
@@ -291,13 +355,18 @@ static int dw_pcie_ep_get_msix(struct pci_epc *epc, u8 
func_no)
struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
u32 val, reg;
unsigned int func_offset = 0;
+   struct dw_pcie_ep_func *ep_func;
+
+   ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
+   if (!ep_func)
+   return -EINVAL;
 
-   if (!ep->msix_cap)
+   if (!ep_func->msix_cap)
return -EINVAL;
 
func_offset = dw_pcie_ep_func_select(ep, func_no);
 
-   reg = ep->msix_cap + func_offset + PCI_MSIX_FLAGS;
+   reg = ep_func->msix_cap + func_offset +

[PATCH v5 06/11] PCI: layerscape: Fix some format issue of the code

2020-03-06 Thread Xiaowei Bao

Fix some format issue of the code in EP driver.

Signed-off-by: Xiaowei Bao 
Reviewed-by: Andrew Murray 
---
v2:
 - No change.
v3:
 - No change.
v4:
 - No change.
v5:
 - No change.

 drivers/pci/controller/dwc/pci-layerscape-ep.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c 
b/drivers/pci/controller/dwc/pci-layerscape-ep.c
index 0d151ce..0691d9a 100644
--- a/drivers/pci/controller/dwc/pci-layerscape-ep.c
+++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c
@@ -63,7 +63,7 @@ static void ls_pcie_ep_init(struct dw_pcie_ep *ep)
 }
 
 static int ls_pcie_ep_raise_irq(struct dw_pcie_ep *ep, u8 func_no,
- enum pci_epc_irq_type type, u16 interrupt_num)
+   enum pci_epc_irq_type type, u16 interrupt_num)
 {
struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
 
@@ -87,7 +87,7 @@ static const struct dw_pcie_ep_ops pcie_ep_ops = {
 };
 
 static int __init ls_add_pcie_ep(struct ls_pcie_ep *pcie,
-   struct platform_device *pdev)
+struct platform_device *pdev)
 {
struct dw_pcie *pci = pcie->pci;
struct device *dev = pci->dev;
-- 
2.9.5

[PATCH v4 00/11] Add the multiple PF support for DWC and Layerscape

2020-03-06 Thread Xiaowei Bao

Add the PCIe EP multiple PF support for DWC and Layerscape, add
the doorbell MSIX function for DWC, use list to manage the PF of
one PCIe controller, and refactor the Layerscape EP driver due to
some platforms difference.

Xiaowei Bao (11):
  PCI: designware-ep: Add multiple PFs support for DWC
  PCI: designware-ep: Add the doorbell mode of MSI-X in EP mode
  PCI: designware-ep: Move the function of getting MSI capability
forward
  PCI: designware-ep: Modify MSI and MSIX CAP way of finding
  dt-bindings: pci: layerscape-pci: Add compatible strings for ls1088a
and ls2088a
  PCI: layerscape: Fix some format issue of the code
  PCI: layerscape: Modify the way of getting capability with different
PEX
  PCI: layerscape: Modify the MSIX to the doorbell mode
  PCI: layerscape: Add EP mode support for ls1088a and ls2088a
  arm64: dts: layerscape: Add PCIe EP node for ls1088a
  misc: pci_endpoint_test: Add LS1088a in pci_device_id table

 .../devicetree/bindings/pci/layerscape-pci.txt |   2 +
 arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi |  31 +++
 drivers/misc/pci_endpoint_test.c   |   2 +
 drivers/pci/controller/dwc/pci-layerscape-ep.c | 100 ++--
 drivers/pci/controller/dwc/pcie-designware-ep.c| 255 +
 drivers/pci/controller/dwc/pcie-designware.c   |  59 +++--
 drivers/pci/controller/dwc/pcie-designware.h   |  48 +++-
 7 files changed, 404 insertions(+), 93 deletions(-)

-- 
2.9.5

[PATCH v5 05/11] dt-bindings: pci: layerscape-pci: Add compatible strings for ls1088a and ls2088a

2020-03-06 Thread Xiaowei Bao

Add compatible strings for ls1088a and ls2088a.

Signed-off-by: Xiaowei Bao 
Acked-by: Rob Herring 
---
v2:
 - No change.
v3:
 - Use one valid combination of compatible strings.
v4:
 - Add the comma between the two compatible.
v5:
 - No change.

 Documentation/devicetree/bindings/pci/layerscape-pci.txt | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/devicetree/bindings/pci/layerscape-pci.txt 
b/Documentation/devicetree/bindings/pci/layerscape-pci.txt
index 99a386e..daa99f7 100644
--- a/Documentation/devicetree/bindings/pci/layerscape-pci.txt
+++ b/Documentation/devicetree/bindings/pci/layerscape-pci.txt
@@ -24,6 +24,8 @@ Required properties:
 "fsl,ls1028a-pcie"
   EP mode:
"fsl,ls1046a-pcie-ep", "fsl,ls-pcie-ep"
+   "fsl,ls1088a-pcie-ep", "fsl,ls-pcie-ep"
+   "fsl,ls2088a-pcie-ep", "fsl,ls-pcie-ep"
 - reg: base addresses and lengths of the PCIe controller register blocks.
 - interrupts: A list of interrupt outputs of the controller. Must contain an
   entry for each entry in the interrupt-names property.
-- 
2.9.5

[PATCH v5 10/11] arm64: dts: layerscape: Add PCIe EP node for ls1088a

2020-03-06 Thread Xiaowei Bao

Add PCIe EP node for ls1088a to support EP mode.

Signed-off-by: Xiaowei Bao 
Reviewed-by: Andrew Murray 
---
v2:
 - Remove the pf-offset proparty.
v3:
 - No change.
v4:
 - No change.
v5:
 - No change.
 
 arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi | 31 ++
 1 file changed, 31 insertions(+)

diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi 
b/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
index ec6013a..cb0805b 100644
--- a/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
+++ b/arch/arm64/boot/dts/freescale/fsl-ls1088a.dtsi
@@ -497,6 +497,17 @@
status = "disabled";
};
 
+   pcie_ep@340 {
+   compatible = "fsl,ls1088a-pcie-ep","fsl,ls-pcie-ep";
+   reg = <0x00 0x0340 0x0 0x0010
+  0x20 0x 0x8 0x>;
+   reg-names = "regs", "addr_space";
+   num-ib-windows = <24>;
+   num-ob-windows = <128>;
+   max-functions = /bits/ 8 <2>;
+   status = "disabled";
+   };
+
pcie@350 {
compatible = "fsl,ls1088a-pcie";
reg = <0x00 0x0350 0x0 0x0010   /* controller 
registers */
@@ -522,6 +533,16 @@
status = "disabled";
};
 
+   pcie_ep@350 {
+   compatible = "fsl,ls1088a-pcie-ep","fsl,ls-pcie-ep";
+   reg = <0x00 0x0350 0x0 0x0010
+  0x28 0x 0x8 0x>;
+   reg-names = "regs", "addr_space";
+   num-ib-windows = <6>;
+   num-ob-windows = <8>;
+   status = "disabled";
+   };
+
pcie@360 {
compatible = "fsl,ls1088a-pcie";
reg = <0x00 0x0360 0x0 0x0010   /* controller 
registers */
@@ -547,6 +568,16 @@
status = "disabled";
};
 
+   pcie_ep@360 {
+   compatible = "fsl,ls1088a-pcie-ep","fsl,ls-pcie-ep";
+   reg = <0x00 0x0360 0x0 0x0010
+  0x30 0x 0x8 0x>;
+   reg-names = "regs", "addr_space";
+   num-ib-windows = <6>;
+   num-ob-windows = <8>;
+   status = "disabled";
+   };
+
smmu: iommu@500 {
compatible = "arm,mmu-500";
reg = <0 0x500 0 0x80>;
-- 
2.9.5

[PATCH v5 07/11] PCI: layerscape: Modify the way of getting capability with different PEX

2020-03-06 Thread Xiaowei Bao

The different PCIe controller in one board may be have different
capability of MSI or MSIX, so change the way of getting the MSI
capability, make it more flexible.

Signed-off-by: Xiaowei Bao 
---
v2:
 - Remove the repeated assignment code.
v3:
 - Use ep_func msi_cap and msix_cap to decide the msi_capable and
   msix_capable of pci_epc_features struct.
v4:
 - No change.
v5:
 - No change.

 drivers/pci/controller/dwc/pci-layerscape-ep.c | 31 +++---
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c 
b/drivers/pci/controller/dwc/pci-layerscape-ep.c
index 0691d9a..9601f9c 100644
--- a/drivers/pci/controller/dwc/pci-layerscape-ep.c
+++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c
@@ -22,6 +22,7 @@
 
 struct ls_pcie_ep {
struct dw_pcie  *pci;
+   struct pci_epc_features *ls_epc;
 };
 
 #define to_ls_pcie_ep(x)   dev_get_drvdata((x)->dev)
@@ -40,26 +41,31 @@ static const struct of_device_id ls_pcie_ep_of_match[] = {
{ },
 };
 
-static const struct pci_epc_features ls_pcie_epc_features = {
-   .linkup_notifier = false,
-   .msi_capable = true,
-   .msix_capable = false,
-   .bar_fixed_64bit = (1 << BAR_2) | (1 << BAR_4),
-};
-
 static const struct pci_epc_features*
 ls_pcie_ep_get_features(struct dw_pcie_ep *ep)
 {
-   return _pcie_epc_features;
+   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+   struct ls_pcie_ep *pcie = to_ls_pcie_ep(pci);
+
+   return pcie->ls_epc;
 }
 
 static void ls_pcie_ep_init(struct dw_pcie_ep *ep)
 {
struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+   struct ls_pcie_ep *pcie = to_ls_pcie_ep(pci);
+   struct dw_pcie_ep_func *ep_func;
enum pci_barno bar;
 
+   ep_func = dw_pcie_ep_get_func_from_ep(ep, 0);
+   if (!ep_func)
+   return;
+
for (bar = 0; bar < PCI_STD_NUM_BARS; bar++)
dw_pcie_ep_reset_bar(pci, bar);
+
+   pcie->ls_epc->msi_capable = ep_func->msi_cap ? true : false;
+   pcie->ls_epc->msix_capable = ep_func->msix_cap ? true : false;
 }
 
 static int ls_pcie_ep_raise_irq(struct dw_pcie_ep *ep, u8 func_no,
@@ -119,6 +125,7 @@ static int __init ls_pcie_ep_probe(struct platform_device 
*pdev)
struct device *dev = >dev;
struct dw_pcie *pci;
struct ls_pcie_ep *pcie;
+   struct pci_epc_features *ls_epc;
struct resource *dbi_base;
int ret;
 
@@ -130,6 +137,10 @@ static int __init ls_pcie_ep_probe(struct platform_device 
*pdev)
if (!pci)
return -ENOMEM;
 
+   ls_epc = devm_kzalloc(dev, sizeof(*ls_epc), GFP_KERNEL);
+   if (!ls_epc)
+   return -ENOMEM;
+
dbi_base = platform_get_resource_byname(pdev, IORESOURCE_MEM, "regs");
pci->dbi_base = devm_pci_remap_cfg_resource(dev, dbi_base);
if (IS_ERR(pci->dbi_base))
@@ -140,6 +151,10 @@ static int __init ls_pcie_ep_probe(struct platform_device 
*pdev)
pci->ops = _pcie_ep_ops;
pcie->pci = pci;
 
+   ls_epc->bar_fixed_64bit = (1 << BAR_2) | (1 << BAR_4),
+
+   pcie->ls_epc = ls_epc;
+
platform_set_drvdata(pdev, pcie);
 
ret = ls_add_pcie_ep(pcie, pdev);
-- 
2.9.5

[PATCH v5 09/11] PCI: layerscape: Add EP mode support for ls1088a and ls2088a

2020-03-06 Thread Xiaowei Bao

Add PCIe EP mode support for ls1088a and ls2088a, there are some
difference between LS1 and LS2 platform, so refactor the code of
the EP driver.

Signed-off-by: Xiaowei Bao 
---
v2: 
 - This is a new patch for supporting the ls1088a and ls2088a platform.
v3:
 - Adjust the some struct assignment order in probe function.
v4:
 - No change.
v5:
 - No change.

 drivers/pci/controller/dwc/pci-layerscape-ep.c | 72 +++---
 1 file changed, 53 insertions(+), 19 deletions(-)

diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c 
b/drivers/pci/controller/dwc/pci-layerscape-ep.c
index bfab1c6..84206f2 100644
--- a/drivers/pci/controller/dwc/pci-layerscape-ep.c
+++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c
@@ -20,27 +20,29 @@
 
 #define PCIE_DBI2_OFFSET   0x1000  /* DBI2 base address*/
 
-struct ls_pcie_ep {
-   struct dw_pcie  *pci;
-   struct pci_epc_features *ls_epc;
+#define to_ls_pcie_ep(x)   dev_get_drvdata((x)->dev)
+
+struct ls_pcie_ep_drvdata {
+   u32 func_offset;
+   const struct dw_pcie_ep_ops *ops;
+   const struct dw_pcie_ops*dw_pcie_ops;
 };
 
-#define to_ls_pcie_ep(x)   dev_get_drvdata((x)->dev)
+struct ls_pcie_ep {
+   struct dw_pcie  *pci;
+   struct pci_epc_features *ls_epc;
+   const struct ls_pcie_ep_drvdata *drvdata;
+};
 
 static int ls_pcie_establish_link(struct dw_pcie *pci)
 {
return 0;
 }
 
-static const struct dw_pcie_ops ls_pcie_ep_ops = {
+static const struct dw_pcie_ops dw_ls_pcie_ep_ops = {
.start_link = ls_pcie_establish_link,
 };
 
-static const struct of_device_id ls_pcie_ep_of_match[] = {
-   { .compatible = "fsl,ls-pcie-ep",},
-   { },
-};
-
 static const struct pci_epc_features*
 ls_pcie_ep_get_features(struct dw_pcie_ep *ep)
 {
@@ -87,10 +89,39 @@ static int ls_pcie_ep_raise_irq(struct dw_pcie_ep *ep, u8 
func_no,
}
 }
 
-static const struct dw_pcie_ep_ops pcie_ep_ops = {
+static unsigned int ls_pcie_ep_func_conf_select(struct dw_pcie_ep *ep,
+   u8 func_no)
+{
+   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+   struct ls_pcie_ep *pcie = to_ls_pcie_ep(pci);
+
+   WARN_ON(func_no && !pcie->drvdata->func_offset);
+   return pcie->drvdata->func_offset * func_no;
+}
+
+static const struct dw_pcie_ep_ops ls_pcie_ep_ops = {
.ep_init = ls_pcie_ep_init,
.raise_irq = ls_pcie_ep_raise_irq,
.get_features = ls_pcie_ep_get_features,
+   .func_conf_select = ls_pcie_ep_func_conf_select,
+};
+
+static const struct ls_pcie_ep_drvdata ls1_ep_drvdata = {
+   .ops = _pcie_ep_ops,
+   .dw_pcie_ops = _ls_pcie_ep_ops,
+};
+
+static const struct ls_pcie_ep_drvdata ls2_ep_drvdata = {
+   .func_offset = 0x2,
+   .ops = _pcie_ep_ops,
+   .dw_pcie_ops = _ls_pcie_ep_ops,
+};
+
+static const struct of_device_id ls_pcie_ep_of_match[] = {
+   { .compatible = "fsl,ls1046a-pcie-ep", .data = _ep_drvdata },
+   { .compatible = "fsl,ls1088a-pcie-ep", .data = _ep_drvdata },
+   { .compatible = "fsl,ls2088a-pcie-ep", .data = _ep_drvdata },
+   { },
 };
 
 static int __init ls_add_pcie_ep(struct ls_pcie_ep *pcie,
@@ -103,7 +134,7 @@ static int __init ls_add_pcie_ep(struct ls_pcie_ep *pcie,
int ret;
 
ep = >ep;
-   ep->ops = _ep_ops;
+   ep->ops = pcie->drvdata->ops;
 
res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "addr_space");
if (!res)
@@ -142,20 +173,23 @@ static int __init ls_pcie_ep_probe(struct platform_device 
*pdev)
if (!ls_epc)
return -ENOMEM;
 
-   dbi_base = platform_get_resource_byname(pdev, IORESOURCE_MEM, "regs");
-   pci->dbi_base = devm_pci_remap_cfg_resource(dev, dbi_base);
-   if (IS_ERR(pci->dbi_base))
-   return PTR_ERR(pci->dbi_base);
+   pcie->drvdata = of_device_get_match_data(dev);
 
-   pci->dbi_base2 = pci->dbi_base + PCIE_DBI2_OFFSET;
pci->dev = dev;
-   pci->ops = _pcie_ep_ops;
-   pcie->pci = pci;
+   pci->ops = pcie->drvdata->dw_pcie_ops;
 
ls_epc->bar_fixed_64bit = (1 << BAR_2) | (1 << BAR_4),
 
+   pcie->pci = pci;
pcie->ls_epc = ls_epc;
 
+   dbi_base = platform_get_resource_byname(pdev, IORESOURCE_MEM, "regs");
+   pci->dbi_base = devm_pci_remap_cfg_resource(dev, dbi_base);
+   if (IS_ERR(pci->dbi_base))
+   return PTR_ERR(pci->dbi_base);
+
+   pci->dbi_base2 = pci->dbi_base + PCIE_DBI2_OFFSET;
+
platform_set_drvdata(pdev, pcie);
 
ret = ls_add_pcie_ep(pcie, pdev);
-- 
2.9.5

[PATCH v5 08/11] PCI: layerscape: Modify the MSIX to the doorbell mode

2020-03-06 Thread Xiaowei Bao

dw_pcie_ep_raise_msix_irq was never called in the exisitng driver
before, because the ls1046a platform don't support the MSIX feature
and msix_capable was always set to false.
Now that add the ls1088a platform with MSIX support, use the doorbell
method to support the MSIX feature.

Signed-off-by: Xiaowei Bao 
Reviewed-by: Andrew Murray 
---
v2: 
 - No change
v3:
 - Modify the commit message make it clearly.
v4: 
 - No change
v5: 
 - Modify the commit message.

 drivers/pci/controller/dwc/pci-layerscape-ep.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c 
b/drivers/pci/controller/dwc/pci-layerscape-ep.c
index 9601f9c..bfab1c6 100644
--- a/drivers/pci/controller/dwc/pci-layerscape-ep.c
+++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c
@@ -79,7 +79,8 @@ static int ls_pcie_ep_raise_irq(struct dw_pcie_ep *ep, u8 
func_no,
case PCI_EPC_IRQ_MSI:
return dw_pcie_ep_raise_msi_irq(ep, func_no, interrupt_num);
case PCI_EPC_IRQ_MSIX:
-   return dw_pcie_ep_raise_msix_irq(ep, func_no, interrupt_num);
+   return dw_pcie_ep_raise_msix_irq_doorbell(ep, func_no,
+ interrupt_num);
default:
dev_err(pci->dev, "UNKNOWN IRQ type\n");
return -EINVAL;
-- 
2.9.5

Re: [PATCH v3] powerpc: setup_64: set up PACA earlier to avoid kcov problems

2020-03-06 Thread Nicholas Piggin

Daniel Axtens's on March 6, 2020 5:30 pm:
> kcov instrumentation is collected the __sanitizer_cov_trace_pc hook in
> kernel/kcov.c. The compiler inserts these hooks into every basic block
> unless kcov is disabled for that file.
> 
> We then have a deep call-chain:
>  - __sanitizer_cov_trace_pc calls to check_kcov_mode()
>  - check_kcov_mode() (kernel/kcov.c) calls in_task()
>  - in_task() (include/linux/preempt.h) calls preempt_count().
>  - preempt_count() (include/asm-generic/preempt.h) calls
>  current_thread_info()
>  - because powerpc has THREAD_INFO_IN_TASK, current_thread_info()
>  (include/linux/thread_info.h) is defined to 'current'
>  - current (arch/powerpc/include/asm/current.h) is defined to
>  get_current().
>  - get_current (same file) loads an offset of r13.
>  - arch/powerpc/include/asm/paca.h makes r13 a register variable
>  called local_paca - it is the PACA for the current CPU, so
>  this has the effect of loading the current task from PACA.
>  - get_current returns the current task from PACA,
>  - current_thread_info returns the task cast to a thread_info
>  - preempt_count dereferences the thread_info to load preempt_count
>  - that value is used by in_task and so on up the chain
> 
> The problem is:
> 
>  - kcov instrumentation is enabled for arch/powerpc/kernel/dt_cpu_ftrs.c
> 
>  - even if it were not, dt_cpu_ftrs_init calls generic dt parsing code
>which should definitely have instrumentation enabled.
> 
>  - setup_64.c calls dt_cpu_ftrs_init before it sets up a PACA.
> 
>  - If we don't set up a paca, r13 will contain unpredictable data.
> 
>  - In a zImage compiled with kcov and KASAN, we see r13 containing a value
>that leads to dereferencing invalid memory (something like
>912a72603d420015).
> 
>  - Weirdly, the same kernel as a vmlinux loaded directly by qemu does not
>crash. Investigating with gdb, it seems that in the vmlinux boot case,
>r13 is near enough to zero that we just happen to be able to read that
>part of memory (we're operating with translation off at this point) and
>the current pointer also happens to land in readable memory and
>everything just works.
> 
>  - PACA setup refers to CPU features - setup_paca() looks at
>early_cpu_has_feature(CPU_FTR_HVMODE)
> 
> There's no generic kill switch for kcov (as far as I can tell), and we
> don't want to have to turn off instrumentation in the generic dt parsing
> code (which lives outside arch/powerpc/) just because we don't have a real
> paca or task yet.
> 
> So:
>  - change the test when setting up a PACA to consider the actual value of
>the MSR rather than the CPU feature.
> 
>  - move the PACA setup to before the cpu feature parsing.

Hmm. Problem is that equally we want PACA to be sane before we call too
far into the rest of the kernel ("generic dt parsing code").

Does KCOV really need to instrument code this early? If not, then
change

if (!in_task())
return false;

to

if (system_state != SYSTEM_RUNNING)
return false;
if (!in_task())
return false;

If it does need to instrument early, then something like

if (system_state >= SYSTEM_SCHEDULING && !in_task())
return false;

Thanks,
Nick

Re: [PATCH V15] mm/debug: Add tests validating architecture page table helpers

2020-03-06 Thread Qian Cai




> On Mar 6, 2020, at 7:56 PM, Anshuman Khandual  
> wrote:
> 
> 
> 
> On 03/07/2020 06:04 AM, Qian Cai wrote:
>> 
>> 
>>> On Mar 6, 2020, at 7:03 PM, Anshuman Khandual  
>>> wrote:
>>> 
>>> Hmm, set_pte_at() function is not preferred here for these tests. The idea
>>> is to avoid or atleast minimize TLB/cache flushes triggered from these sort
>>> of 'static' tests. set_pte_at() is platform provided and could/might trigger
>>> these flushes or some other platform specific synchronization stuff. Just
>> 
>> Why is that important for this debugging option?
> 
> Primarily reason is to avoid TLB/cache flush instructions on the system
> during these tests that only involve transforming different page table
> level entries through helpers. Unless really necessary, why should it
> emit any TLB/cache flush instructions ?
> 
>> 
>>> wondering is there specific reason with respect to the soft lock up problem
>>> making it necessary to use set_pte_at() rather than a simple WRITE_ONCE() ?
>> 
>> Looks at the s390 version of set_pte_at(), it has this comment,
>> vmaddr);
>> 
>> /*
>> * Certain architectures need to do special things when PTEs
>> * within a page table are directly modified.  Thus, the following
>> * hook is made available.
>> */
>> 
>> I can only guess that powerpc  could be the same here.
> 
> This comment is present in multiple platforms while defining set_pte_at().
> Is not 'barrier()' here alone good enough ? Else what exactly set_pte_at()

No, barrier() is not enough.

> does as compared to WRITE_ONCE() that avoids the soft lock up, just trying
> to understand.

I surely can spend hours to figure which exact things in set_pte_at() is 
necessary for
pte_clear() not to stuck, and then propose a solution and possible need to 
retest on
multiple arches. I am not sure if that is a good use of my time just to saving
a few TLB/cache flush on a debug kernel?

Re: [PATCH] powerpc/64: BE option to use ELFv2 ABI for big endian kernels

2020-03-06 Thread Nicholas Piggin

Segher Boessenkool's on March 5, 2020 8:55 pm:
> On Thu, Mar 05, 2020 at 01:34:22PM +1000, Nicholas Piggin wrote:
>> Segher Boessenkool's on March 4, 2020 9:09 am:
>> >> +override flavour := linux-ppc64v2
>> > 
>> > That isn't a good name, heh.  This isn't "v2" of anything...  Spell out
>> > the name "ELFv2"?  Or as "elfv2"?  It is just a name after all, it is
>> > version 1 in all three version fields in the ELF headers!
>> 
>> Yeah okay. This part is only for some weird little perl asm generator
>> script, but probably better to be careful. linux-ppc64-elfv2 ?
> 
> That generator is from openssl, or inspired by it, it is everywhere.
> So it is more important to get it right than it would seem at first
> glance ;-)
> 
> That name looks perfect to me.  You'll have to update REs expecting the
> arch at the end (like /le$/), but you had to already I think?

le$ is still okay for testing ppc64le, unless you wanted me to add the 
-elfv2 suffix on there as well? If that was the case, for consistency
we'd also have to add -elfv1 for the BE v1 case. I was just going to add
-elfv2 for the new variant.

Thanks,
Nick

Re: [PATCH V15] mm/debug: Add tests validating architecture page table helpers

2020-03-06 Thread Anshuman Khandual

On 03/07/2020 06:04 AM, Qian Cai wrote:
> 
> 
>> On Mar 6, 2020, at 7:03 PM, Anshuman Khandual  
>> wrote:
>>
>> Hmm, set_pte_at() function is not preferred here for these tests. The idea
>> is to avoid or atleast minimize TLB/cache flushes triggered from these sort
>> of 'static' tests. set_pte_at() is platform provided and could/might trigger
>> these flushes or some other platform specific synchronization stuff. Just
> 
> Why is that important for this debugging option?

Primarily reason is to avoid TLB/cache flush instructions on the system
during these tests that only involve transforming different page table
level entries through helpers. Unless really necessary, why should it
emit any TLB/cache flush instructions ?

> 
>> wondering is there specific reason with respect to the soft lock up problem
>> making it necessary to use set_pte_at() rather than a simple WRITE_ONCE() ?
> 
> Looks at the s390 version of set_pte_at(), it has this comment,
> vmaddr);
> 
> /*
>  * Certain architectures need to do special things when PTEs
>  * within a page table are directly modified.  Thus, the following
>  * hook is made available.
>  */
> 
> I can only guess that powerpc  could be the same here.

This comment is present in multiple platforms while defining set_pte_at().
Is not 'barrier()' here alone good enough ? Else what exactly set_pte_at()
does as compared to WRITE_ONCE() that avoids the soft lock up, just trying
to understand.

[PATCH] Fix powerpc/64: system call zero volatile registers when returning

2020-03-06 Thread Nicholas Piggin

Here's an incremental fix that can be folded into the patch.

Segher Boessenkool's on February 26, 2020 7:20 am:
> Hi!
> 
> On Wed, Feb 26, 2020 at 03:35:35AM +1000, Nicholas Piggin wrote:
>> Kernel addresses and potentially other sensitive data could be leaked
>> in volatile registers after a syscall.
> 
>>  cmpdi   r3,0
>>  bne .Lsyscall_restore_regs
>> +li  r0,0
>> +li  r4,0
>> +li  r5,0
>> +li  r6,0
>> +li  r7,0
>> +li  r8,0
>> +li  r9,0
>> +li  r10,0
>> +li  r11,0
>> +li  r12,0
>> +mtctr   r0
>> +mtspr   SPRN_XER,r0
>>  .Lsyscall_restore_regs_cont:
> 
> What about LR?  Is that taken care of later?
> 
> This also deserves a big fat comment imo, it is very important after
> all, and not so obvious.
> 
> 
> Segher
> 

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/entry_64.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 0e2c56573a41..ea534375250b 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -135,6 +135,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
 
cmpdi   r3,0
bne .Lsyscall_restore_regs
+   /* Zero volatile regs that may contain sensitive kernel data */
li  r0,0
li  r4,0
li  r5,0
-- 
2.23.0

[PATCH v3 9/9] Documentation/powerpc: VAS API

2020-03-06 Thread Haren Myneni



Power9 introduced Virtual Accelerator Switchboard (VAS) which allows
userspace to communicate with Nest Accelerator (NX) directly. But
kernel has to establish channel to NX for userspace. This document
describes user space API that application can use to establish
communication channel.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 Documentation/powerpc/index.rst   |   1 +
 Documentation/powerpc/vas-api.rst | 246 ++
 2 files changed, 247 insertions(+)
 create mode 100644 Documentation/powerpc/vas-api.rst

diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index 0d45f0f..afe2d5e 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -30,6 +30,7 @@ powerpc
 syscall64-abi
 transactional_memory
 ultravisor
+vas-api
 
 .. only::  subproject and html
 
diff --git a/Documentation/powerpc/vas-api.rst 
b/Documentation/powerpc/vas-api.rst
new file mode 100644
index 000..13ce4e7
--- /dev/null
+++ b/Documentation/powerpc/vas-api.rst
@@ -0,0 +1,246 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _VAS-API:
+
+===
+Virtual Accelerator Switchboard (VAS) userspace API
+===
+
+Introduction
+
+
+Power9 processor introduced Virtual Accelerator Switchboard (VAS) which
+allows both userspace and kernel communicate to co-processor
+(hardware accelerator) referred to as the Nest Accelerator (NX). The NX
+unit comprises of one or more hardware engines or co-processor types
+such as 842 compression, GZIP compression and encryption. On power9,
+userspace applications will have access to only GZIP Compression engine
+which supports ZLIB and GZIP compression algorithms in the hardware.
+
+To communicate with NX, kernel has to establish a channel or window and
+then requests can be submitted directly without kernel involvement.
+Requests to the GZIP engine must be formatted as a co-processor Request
+Block (CRB) and these CRBs must be submitted to the NX using COPY/PASTE
+instructions to paste the CRB to hardware address that is associated with
+the engine's request queue.
+
+The GZIP engine provides two priority levels of requests: Normal and
+High. Only Normal requests are supported from userspace right now.
+
+This document explains userspace API that is used to interact with
+kernel to setup channel / window which can be used to send compression
+requests directly to NX accelerator.
+
+
+Overview
+
+
+Application access to the GZIP engine is provided through
+/dev/crypto/nx-gzip device node implemented by the VAS/NX device driver.
+An application must open the /dev/crypto/nx-gzip device to obtain a file
+descriptor (fd). Then should issue VAS_TX_WIN_OPEN ioctl with this fd to
+establish connection to the engine. It means send window is opened on GZIP
+engine for this process. Once a connection is established, the application
+should use the mmap() system call to map the hardware address of engine's
+request queue into the application's virtual address space.
+
+The application can then submit one or more requests to the the engine by
+using copy/paste instructions and pasting the CRBs to the virtual address
+(aka paste_address) returned by mmap(). User space can close the
+established connection or send window by closing the file descriptior
+(close(fd)) or upon the process exit.
+
+Note that applications can send several requests with the same window or
+can establish multiple windows, but one window for each file descriptor.
+
+Following sections provide additional details and references about the
+individual steps.
+
+NX-GZIP Device Node
+===
+
+There is one /dev/crypto/nx-gzip node in the system and it provides
+access to all GZIP engines in the system. The only valid operations on
+/dev/crypto/nx-gzip are:
+
+   * open() the device for read and write.
+   * issue VAS_TX_WIN_OPEN ioctl
+   * mmap() the engine's request queue into application's virtual
+ address space (i.e. get a paste_address for the co-processor
+ engine).
+   * close the device node.
+
+Other file operations on this device node are undefined.
+
+Note that the copy and paste operations go directly to the hardware and
+do not go through this device. Refer COPY/PASTE document for more
+details.
+
+Although a system may have several instances of the NX co-processor
+engines (typically, one per P9 chip) there is just one
+/dev/crypto/nx-gzip device node in the system. When the nx-gzip device
+node is opened, Kernel opens send window on a suitable instance of NX
+accelerator. It finds CPU on which the user process is executing and
+determine the NX instance for the corresponding chip on which this CPU
+belongs.
+
+Applications may chose a specific instance of the NX co-processor using
+the vas_id field in the VAS_TX_WIN_OPEN ioctl as detailed below.
+
+A userspace

[PATCH v3 8/9] crypto/nx: Remove 'pid' in vas_tx_win_attr struct

2020-03-06 Thread Haren Myneni



When window is opened, pid reference is taken for user space
windows. Not needed for kernel windows. So remove 'pid' in
vas_tx_win_attr struct.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/vas.h| 1 -
 drivers/crypto/nx/nx-common-powernv.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index e064953..994db6f 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -86,7 +86,6 @@ struct vas_tx_win_attr {
int wcreds_max;
int lpid;
int pidr;   /* hardware PID (from SPRN_PID) */
-   int pid;/* linux process id */
int pswid;
int rsvd_txbuf_count;
int tc_mode;
diff --git a/drivers/crypto/nx/nx-common-powernv.c 
b/drivers/crypto/nx/nx-common-powernv.c
index f570691..38333e4 100644
--- a/drivers/crypto/nx/nx-common-powernv.c
+++ b/drivers/crypto/nx/nx-common-powernv.c
@@ -692,7 +692,6 @@ static struct vas_window *nx_alloc_txwin(struct nx_coproc 
*coproc)
 */
vas_init_tx_win_attr(, coproc->ct);
txattr.lpid = 0;/* lpid is 0 for kernel requests */
-   txattr.pid = 0; /* pid is 0 for kernel requests */
 
/*
 * Open a VAS send window which is used to send request to NX.
-- 
1.8.3.1

[PATCH v3 7/9] crypto/nx: Enable and setup GZIP compresstion type

2020-03-06 Thread Haren Myneni



Changes to probe GZIP device-tree nodes, open RX windows and setup
GZIP compression type. No plans to provide GZIP usage in kernel right
now, but this patch enables GZIP for user space usage.

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/nx-common-powernv.c | 43 ++-
 1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/drivers/crypto/nx/nx-common-powernv.c 
b/drivers/crypto/nx/nx-common-powernv.c
index 82dfa60..f570691 100644
--- a/drivers/crypto/nx/nx-common-powernv.c
+++ b/drivers/crypto/nx/nx-common-powernv.c
@@ -65,6 +65,7 @@ struct nx_coproc {
  * Using same values as in skiboot or coprocessor type representing
  * in NX workbook.
  */
+#define NX_CT_GZIP (2) /* on P9 and later */
 #define NX_CT_842  (3)
 
 static int (*nx842_powernv_exec)(const unsigned char *in,
@@ -819,6 +820,9 @@ static int __init vas_cfg_coproc_info(struct device_node 
*dn, int chip_id,
if (type == NX_CT_842)
ret = nx_set_ct(coproc, priority, VAS_COP_TYPE_842_HIPRI,
VAS_COP_TYPE_842);
+   else if (type == NX_CT_GZIP)
+   ret = nx_set_ct(coproc, priority, VAS_COP_TYPE_GZIP_HIPRI,
+   VAS_COP_TYPE_GZIP);
 
if (ret)
goto err_out;
@@ -867,12 +871,16 @@ static int __init vas_cfg_coproc_info(struct device_node 
*dn, int chip_id,
return ret;
 }
 
-static int __init nx_coproc_init(int chip_id, int ct_842)
+static int __init nx_coproc_init(int chip_id, int ct_842, int ct_gzip)
 {
int ret = 0;
 
if (opal_check_token(OPAL_NX_COPROC_INIT)) {
ret = opal_nx_coproc_init(chip_id, ct_842);
+
+   if (!ret)
+   ret = opal_nx_coproc_init(chip_id, ct_gzip);
+
if (ret) {
ret = opal_error_code(ret);
pr_err("Failed to initialize NX for chip(%d): %d\n",
@@ -902,8 +910,8 @@ static int __init find_nx_device_tree(struct device_node 
*dn, int chip_id,
 static int __init nx_powernv_probe_vas(struct device_node *pn)
 {
int chip_id, vasid, ret = 0;
+   int ct_842 = 0, ct_gzip = 0;
struct device_node *dn;
-   int ct_842 = 0;
 
chip_id = of_get_ibm_chip_id(pn);
if (chip_id < 0) {
@@ -920,19 +928,24 @@ static int __init nx_powernv_probe_vas(struct device_node 
*pn)
for_each_child_of_node(pn, dn) {
ret = find_nx_device_tree(dn, chip_id, vasid, NX_CT_842,
"ibm,p9-nx-842", _842);
+
+   if (!ret)
+   ret = find_nx_device_tree(dn, chip_id, vasid,
+   NX_CT_GZIP, "ibm,p9-nx-gzip", _gzip);
+
if (ret)
return ret;
}
 
-   if (!ct_842) {
-   pr_err("NX842 FIFO nodes are missing\n");
+   if (!ct_842 || !ct_gzip) {
+   pr_err("NX FIFO nodes are missing\n");
return -EINVAL;
}
 
/*
 * Initialize NX instance for both high and normal priority FIFOs.
 */
-   ret = nx_coproc_init(chip_id, ct_842);
+   ret = nx_coproc_init(chip_id, ct_842, ct_gzip);
 
return ret;
 }
@@ -1072,10 +1085,19 @@ static __init int nx_compress_powernv_init(void)
nx842_powernv_exec = nx842_exec_icswx;
} else {
/*
+* Register VAS user space API for NX GZIP so
+* that user space can use GZIP engine.
+* 842 compression is supported only in kernel.
+*/
+   ret = vas_register_coproc_api(THIS_MODULE);
+
+   /*
 * GZIP is not supported in kernel right now.
 * So open tx windows only for 842.
 */
-   ret = nx_open_percpu_txwins();
+   if (!ret)
+   ret = nx_open_percpu_txwins();
+
if (ret) {
nx_delete_coprocs();
return ret;
@@ -1096,6 +1118,15 @@ static __init int nx_compress_powernv_init(void)
 
 static void __exit nx_compress_powernv_exit(void)
 {
+   /*
+* GZIP engine is supported only in power9 or later and nx842_ct
+* is used on power8 (icswx).
+* VAS API for NX GZIP is registered during init for user space
+* use. So delete this API use for GZIP engine.
+*/
+   if (!nx842_ct)
+   vas_unregister_coproc_api();
+
crypto_unregister_alg(_powernv_alg);
 
nx_delete_coprocs();
-- 
1.8.3.1

[PATCH v3 6/9] crypto/NX: Make enable code generic to add new GZIP compression type

2020-03-06 Thread Haren Myneni



Make setup and enable code generic to support new GZIP compression type.
Changed nx842 reference to nx and moved some code to new functions.
Functionality is not changed except sparse warning fix - setting NULL
instead of 0 for per_cpu send window in nx_delete_coprocs().

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/nx-common-powernv.c | 161 +-
 1 file changed, 101 insertions(+), 60 deletions(-)

diff --git a/drivers/crypto/nx/nx-common-powernv.c 
b/drivers/crypto/nx/nx-common-powernv.c
index f42881f..82dfa60 100644
--- a/drivers/crypto/nx/nx-common-powernv.c
+++ b/drivers/crypto/nx/nx-common-powernv.c
@@ -40,9 +40,9 @@ struct nx842_workmem {
char padding[WORKMEM_ALIGN]; /* unused, to allow alignment */
 } __packed __aligned(WORKMEM_ALIGN);
 
-struct nx842_coproc {
+struct nx_coproc {
unsigned int chip_id;
-   unsigned int ct;
+   unsigned int ct;/* Can be 842 or GZIP high/normal*/
unsigned int ci;/* Coprocessor instance, used with icswx */
struct {
struct vas_window *rxwin;
@@ -58,9 +58,15 @@ struct nx842_coproc {
 static DEFINE_PER_CPU(struct vas_window *, cpu_txwin);
 
 /* no cpu hotplug on powernv, so this list never changes after init */
-static LIST_HEAD(nx842_coprocs);
+static LIST_HEAD(nx_coprocs);
 static unsigned int nx842_ct;  /* used in icswx function */
 
+/*
+ * Using same values as in skiboot or coprocessor type representing
+ * in NX workbook.
+ */
+#define NX_CT_842  (3)
+
 static int (*nx842_powernv_exec)(const unsigned char *in,
unsigned int inlen, unsigned char *out,
unsigned int *outlenp, void *workmem, int fc);
@@ -666,15 +672,15 @@ static int nx842_powernv_decompress(const unsigned char 
*in, unsigned int inlen,
  wmem, CCW_FC_842_DECOMP_CRC);
 }
 
-static inline void nx842_add_coprocs_list(struct nx842_coproc *coproc,
+static inline void nx_add_coprocs_list(struct nx_coproc *coproc,
int chipid)
 {
coproc->chip_id = chipid;
INIT_LIST_HEAD(>list);
-   list_add(>list, _coprocs);
+   list_add(>list, _coprocs);
 }
 
-static struct vas_window *nx842_alloc_txwin(struct nx842_coproc *coproc)
+static struct vas_window *nx_alloc_txwin(struct nx_coproc *coproc)
 {
struct vas_window *txwin = NULL;
struct vas_tx_win_attr txattr;
@@ -704,9 +710,9 @@ static struct vas_window *nx842_alloc_txwin(struct 
nx842_coproc *coproc)
  * cpu_txwin is used in copy/paste operation for each compression /
  * decompression request.
  */
-static int nx842_open_percpu_txwins(void)
+static int nx_open_percpu_txwins(void)
 {
-   struct nx842_coproc *coproc, *n;
+   struct nx_coproc *coproc, *n;
unsigned int i, chip_id;
 
for_each_possible_cpu(i) {
@@ -714,17 +720,18 @@ static int nx842_open_percpu_txwins(void)
 
chip_id = cpu_to_chip_id(i);
 
-   list_for_each_entry_safe(coproc, n, _coprocs, list) {
+   list_for_each_entry_safe(coproc, n, _coprocs, list) {
/*
 * Kernel requests use only high priority FIFOs. So
 * open send windows for these FIFOs.
+* GZIP is not supported in kernel right now.
 */
 
if (coproc->ct != VAS_COP_TYPE_842_HIPRI)
continue;
 
if (coproc->chip_id == chip_id) {
-   txwin = nx842_alloc_txwin(coproc);
+   txwin = nx_alloc_txwin(coproc);
if (IS_ERR(txwin))
return PTR_ERR(txwin);
 
@@ -743,13 +750,28 @@ static int nx842_open_percpu_txwins(void)
return 0;
 }
 
+static int __init nx_set_ct(struct nx_coproc *coproc, const char *priority,
+   int high, int normal)
+{
+   if (!strcmp(priority, "High"))
+   coproc->ct = high;
+   else if (!strcmp(priority, "Normal"))
+   coproc->ct = normal;
+   else {
+   pr_err("Invalid RxFIFO priority value\n");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static int __init vas_cfg_coproc_info(struct device_node *dn, int chip_id,
-   int vasid, int *ct)
+   int vasid, int type, int *ct)
 {
struct vas_window *rxwin = NULL;
struct vas_rx_win_attr rxattr;
-   struct nx842_coproc *coproc;
u32 lpid, pid, tid, fifo_size;
+   struct nx_coproc *coproc;
u64 rx_fifo;
const char *priority;
int ret;
@@ -794,15 +816,12 @@ static int __init vas_cfg_coproc_info(struct device_node 
*dn, int chip_id,
if (!coproc)
return -ENOMEM;
 
-   if

Re: [PATCH V15] mm/debug: Add tests validating architecture page table helpers

2020-03-06 Thread Qian Cai

> On Mar 6, 2020, at 7:03 PM, Anshuman Khandual  
> wrote:
> 
> Hmm, set_pte_at() function is not preferred here for these tests. The idea
> is to avoid or atleast minimize TLB/cache flushes triggered from these sort
> of 'static' tests. set_pte_at() is platform provided and could/might trigger
> these flushes or some other platform specific synchronization stuff. Just

Why is that important for this debugging option?

> wondering is there specific reason with respect to the soft lock up problem
> making it necessary to use set_pte_at() rather than a simple WRITE_ONCE() ?

Looks at the s390 version of set_pte_at(), it has this comment,
vmaddr);

/*
 * Certain architectures need to do special things when PTEs
 * within a page table are directly modified.  Thus, the following
 * hook is made available.
 */

I can only guess that powerpc  could be the same here.

[PATCH v3 5/9] crypto/nx: Rename nx-842-powernv file name to nx-common-powernv

2020-03-06 Thread Haren Myneni



Rename nx-842-powernv.c to nx-common-powernv.c to add code for setup
and enable new GZIP compression type. The actual functionality is not
changed in this patch.

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/Makefile|2 +-
 drivers/crypto/nx/nx-842-powernv.c| 1062 -
 drivers/crypto/nx/nx-common-powernv.c | 1062 +
 3 files changed, 1063 insertions(+), 1063 deletions(-)
 delete mode 100644 drivers/crypto/nx/nx-842-powernv.c
 create mode 100644 drivers/crypto/nx/nx-common-powernv.c

diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile
index 015155d..bc89a20 100644
--- a/drivers/crypto/nx/Makefile
+++ b/drivers/crypto/nx/Makefile
@@ -15,4 +15,4 @@ obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES) += 
nx-compress-pseries.o nx-compres
 obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV) += nx-compress-powernv.o 
nx-compress.o
 nx-compress-objs := nx-842.o
 nx-compress-pseries-objs := nx-842-pseries.o
-nx-compress-powernv-objs := nx-842-powernv.o
+nx-compress-powernv-objs := nx-common-powernv.o
diff --git a/drivers/crypto/nx/nx-842-powernv.c 
b/drivers/crypto/nx/nx-842-powernv.c
deleted file mode 100644
index 8e63326..000
--- a/drivers/crypto/nx/nx-842-powernv.c
+++ /dev/null
@@ -1,1062 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/*
- * Driver for IBM PowerNV 842 compression accelerator
- *
- * Copyright (C) 2015 Dan Streetman, IBM Corp
- */
-
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
-#include "nx-842.h"
-
-#include 
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Dan Streetman ");
-MODULE_DESCRIPTION("842 H/W Compression driver for IBM PowerNV processors");
-MODULE_ALIAS_CRYPTO("842");
-MODULE_ALIAS_CRYPTO("842-nx");
-
-#define WORKMEM_ALIGN  (CRB_ALIGN)
-#define CSB_WAIT_MAX   (5000) /* ms */
-#define VAS_RETRIES(10)
-
-struct nx842_workmem {
-   /* Below fields must be properly aligned */
-   struct coprocessor_request_block crb; /* CRB_ALIGN align */
-   struct data_descriptor_entry ddl_in[DDL_LEN_MAX]; /* DDE_ALIGN align */
-   struct data_descriptor_entry ddl_out[DDL_LEN_MAX]; /* DDE_ALIGN align */
-   /* Above fields must be properly aligned */
-
-   ktime_t start;
-
-   char padding[WORKMEM_ALIGN]; /* unused, to allow alignment */
-} __packed __aligned(WORKMEM_ALIGN);
-
-struct nx842_coproc {
-   unsigned int chip_id;
-   unsigned int ct;
-   unsigned int ci;/* Coprocessor instance, used with icswx */
-   struct {
-   struct vas_window *rxwin;
-   int id;
-   } vas;
-   struct list_head list;
-};
-
-/*
- * Send the request to NX engine on the chip for the corresponding CPU
- * where the process is executing. Use with VAS function.
- */
-static DEFINE_PER_CPU(struct vas_window *, cpu_txwin);
-
-/* no cpu hotplug on powernv, so this list never changes after init */
-static LIST_HEAD(nx842_coprocs);
-static unsigned int nx842_ct;  /* used in icswx function */
-
-static int (*nx842_powernv_exec)(const unsigned char *in,
-   unsigned int inlen, unsigned char *out,
-   unsigned int *outlenp, void *workmem, int fc);
-
-/**
- * setup_indirect_dde - Setup an indirect DDE
- *
- * The DDE is setup with the the DDE count, byte count, and address of
- * first direct DDE in the list.
- */
-static void setup_indirect_dde(struct data_descriptor_entry *dde,
-  struct data_descriptor_entry *ddl,
-  unsigned int dde_count, unsigned int byte_count)
-{
-   dde->flags = 0;
-   dde->count = dde_count;
-   dde->index = 0;
-   dde->length = cpu_to_be32(byte_count);
-   dde->address = cpu_to_be64(nx842_get_pa(ddl));
-}
-
-/**
- * setup_direct_dde - Setup single DDE from buffer
- *
- * The DDE is setup with the buffer and length.  The buffer must be properly
- * aligned.  The used length is returned.
- * Returns:
- *   NSuccessfully set up DDE with N bytes
- */
-static unsigned int setup_direct_dde(struct data_descriptor_entry *dde,
-unsigned long pa, unsigned int len)
-{
-   unsigned int l = min_t(unsigned int, len, LEN_ON_PAGE(pa));
-
-   dde->flags = 0;
-   dde->count = 0;
-   dde->index = 0;
-   dde->length = cpu_to_be32(l);
-   dde->address = cpu_to_be64(pa);
-
-   return l;
-}
-
-/**
- * setup_ddl - Setup DDL from buffer
- *
- * Returns:
- *   0 Successfully set up DDL
- */
-static int setup_ddl(struct data_descriptor_entry *dde,
-struct data_descriptor_entry *ddl,
-unsigned char *buf, unsigned int len,
-bool in)
-{
-   unsigned long pa = nx842_get_pa(buf);
-   int i, ret, total_len = len;
-
-   if (!IS_ALIGNED(pa, DDE_BUFFER_ALIGN)) {
-   pr_debug("%s buffer pa 0x%lx not

[PATCH v3 4/9] crypto/nx: Initialize coproc entry with kzalloc

2020-03-06 Thread Haren Myneni



coproc entry is initialized during NX probe on power9, but not on P8.
nx842_delete_coprocs() is used for both and frees receive window if it
is allocated. Getting crash for rmmod on P8 since coproc->vas.rxwin
is not initialized.

This patch replaces kmalloc with kzalloc in nx842_powernv_probe()

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/nx-842-powernv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/nx/nx-842-powernv.c 
b/drivers/crypto/nx/nx-842-powernv.c
index c037a24..8e63326 100644
--- a/drivers/crypto/nx/nx-842-powernv.c
+++ b/drivers/crypto/nx/nx-842-powernv.c
@@ -922,7 +922,7 @@ static int __init nx842_powernv_probe(struct device_node 
*dn)
return -EINVAL;
}
 
-   coproc = kmalloc(sizeof(*coproc), GFP_KERNEL);
+   coproc = kzalloc(sizeof(*coproc), GFP_KERNEL);
if (!coproc)
return -ENOMEM;
 
-- 
1.8.3.1

[PATCH v3 3/9] powerpc/vas: Add VAS user space API

2020-03-06 Thread Haren Myneni



On power9, userspace can send GZIP compression requests directly to NX
once kernel establishes NX channel / window with VAS. This patch provides
user space API which allows user space to establish channel using open
VAS_TX_WIN_OPEN ioctl, mmap and close operations.

Each window corresponds to file descriptor and application can open
multiple windows. After the window is opened, VAS_TX_WIN_OPEN icoctl to
open a window on specific VAS instance, mmap() system call to map
the hardware address of engine's request queue into the application's
virtual address space.

Then the application can then submit one or more requests to the the
engine by using the copy/paste instructions and pasting the CRBs to
the virtual address (aka paste_address) returned by mmap().

Only NX GZIP coprocessor type is supported right now and allow GZIP
engine access via /dev/crypto/nx-gzip device node.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/vas.h  |  11 ++
 arch/powerpc/platforms/powernv/Makefile |   2 +-
 arch/powerpc/platforms/powernv/vas-api.c| 290 
 arch/powerpc/platforms/powernv/vas-window.c |   6 +-
 arch/powerpc/platforms/powernv/vas.h|   2 +
 5 files changed, 307 insertions(+), 4 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/vas-api.c

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index f93e6b0..e064953 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -163,4 +163,15 @@ struct vas_window *vas_tx_win_open(int vasid, enum 
vas_cop_type cop,
  */
 int vas_paste_crb(struct vas_window *win, int offset, bool re);
 
+/*
+ * Register / unregister coprocessor type to VAS API which will be exported
+ * to user space. Applications can use this API to open / close window
+ * which can be used to send / receive requests directly to cooprcessor.
+ *
+ * Only NX GZIP coprocessor type is supported now, but this API can be
+ * used for others in future.
+ */
+int vas_register_coproc_api(struct module *mod);
+void vas_unregister_coproc_api(void);
+
 #endif /* __ASM_POWERPC_VAS_H */
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index 395789f..fe3f0fb 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -17,7 +17,7 @@ obj-$(CONFIG_MEMORY_FAILURE)  += opal-memory-errors.o
 obj-$(CONFIG_OPAL_PRD) += opal-prd.o
 obj-$(CONFIG_PERF_EVENTS) += opal-imc.o
 obj-$(CONFIG_PPC_MEMTRACE) += memtrace.o
-obj-$(CONFIG_PPC_VAS)  += vas.o vas-window.o vas-debug.o vas-fault.o
+obj-$(CONFIG_PPC_VAS)  += vas.o vas-window.o vas-debug.o vas-fault.o vas-api.o
 obj-$(CONFIG_OCXL_BASE)+= ocxl.o
 obj-$(CONFIG_SCOM_DEBUGFS) += opal-xscom.o
 obj-$(CONFIG_PPC_SECURE_BOOT) += opal-secvar.o
diff --git a/arch/powerpc/platforms/powernv/vas-api.c 
b/arch/powerpc/platforms/powernv/vas-api.c
new file mode 100644
index 000..3473a4a
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/vas-api.c
@@ -0,0 +1,290 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * VAS user space API for its accelerators (Only NX-GZIP is supported now)
+ * Copyright (C) 2019 Haren Myneni, IBM Corp
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "vas.h"
+
+/*
+ * The driver creates the device node that can be used as follows:
+ * For NX-GZIP
+ *
+ * fd = open("/dev/crypto/nx-gzip", O_RDWR);
+ * rc = ioctl(fd, VAS_TX_WIN_OPEN, );
+ * paste_addr = mmap(NULL, PAGE_SIZE, prot, MAP_SHARED, fd, 0ULL).
+ * vas_copy(, 0, 1);
+ * vas_paste(paste_addr, 0, 1);
+ * close(fd) or exit process to close window.
+ *
+ * where "vas_copy" and "vas_paste" are defined in copy-paste.h.
+ * copy/paste returns to the user space directly. So refer NX hardware
+ * documententation for excat copy/paste usage and completion / error
+ * conditions.
+ */
+
+static char*coproc_dev_name = "nx-gzip";
+static atomic_tcoproc_instid = ATOMIC_INIT(0);
+
+/*
+ * Wrapper object for the nx-gzip device - there is just one instance of
+ * this node for the whole system.
+ */
+static struct coproc_dev {
+   struct cdev cdev;
+   struct device *device;
+   char *name;
+   dev_t devt;
+   struct class *class;
+} coproc_device;
+
+/*
+ * One instance per open of a nx-gzip device. Each coproc_instance is
+ * associated with a VAS window after the caller issues
+ * VAS_GZIP_TX_WIN_OPEN ioctl.
+ */
+struct coproc_instance {
+   int id;
+   struct vas_window *txwin;
+};
+
+static char *coproc_devnode(struct device *dev, umode_t *mode)
+{
+   return kasprintf(GFP_KERNEL, "crypto/%s", dev_name(dev));
+}
+
+static int coproc_open(struct inode *inode, struct file *fp)
+{
+   struct coproc_instance *instance;
+
+   instance = kzalloc(sizeof(*instance), GFP_KERNEL);
+   if (!instance)
+   return -ENOMEM;
+

[PATCH v3 2/9] powerpc/vas: Define VAS_TX_WIN_OPEN ioctl API

2020-03-06 Thread Haren Myneni



Define the VAS_TX_WIN_OPEN ioctl interface for NX GZIP access
from user space. This interface is used to open GZIP send window and
mmap region which can be used by userspace to send requests to NX
directly with copy/paste instructions.

Signed-off-by: Haren Myneni 
---
 Documentation/userspace-api/ioctl/ioctl-number.rst |  1 +
 arch/powerpc/include/uapi/asm/vas-api.h| 22 ++
 2 files changed, 23 insertions(+)
 create mode 100644 arch/powerpc/include/uapi/asm/vas-api.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 2e91370..deabc73 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -287,6 +287,7 @@ Code  Seq#Include File  
 Comments
 'v'   00-1F  linux/fs.h  conflict!
 'v'   00-0F  linux/sonypi.h  conflict!
 'v'   00-0F  media/v4l2-subdev.h conflict!
+'v'   20-27  arch/powerpc/include/uapi/asm/vas-api.hVAS API
 'v'   C0-FF  linux/meye.hconflict!
 'w'   allCERN SCI 
driver
 'y'   00-1F  packet 
based user level communications
diff --git a/arch/powerpc/include/uapi/asm/vas-api.h 
b/arch/powerpc/include/uapi/asm/vas-api.h
new file mode 100644
index 000..fe95d67
--- /dev/null
+++ b/arch/powerpc/include/uapi/asm/vas-api.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+/*
+ * Copyright 2019 IBM Corp.
+ */
+
+#ifndef _UAPI_MISC_VAS_H
+#define _UAPI_MISC_VAS_H
+
+#include 
+
+#define VAS_MAGIC  'v'
+#define VAS_TX_WIN_OPEN_IOW(VAS_MAGIC, 0x20, struct 
vas_tx_win_open_attr)
+
+struct vas_tx_win_open_attr {
+   __u32   version;
+   __s16   vas_id; /* specific instance of vas or -1 for default */
+   __u16   reserved1;
+   __u64   flags;  /* Future use */
+   __u64   reserved2[6];
+};
+
+#endif /* _UAPI_MISC_VAS_H */
-- 
1.8.3.1

[PATCH v3 1/9] powerpc/vas: Initialize window attributes for GZIP coprocessor type

2020-03-06 Thread Haren Myneni



Initialize send and receive window attributes for GZIP high and
normal priority types.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-window.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index c60accd..e9ab851 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -817,7 +817,8 @@ void vas_init_rx_win_attr(struct vas_rx_win_attr *rxattr, 
enum vas_cop_type cop)
 {
memset(rxattr, 0, sizeof(*rxattr));
 
-   if (cop == VAS_COP_TYPE_842 || cop == VAS_COP_TYPE_842_HIPRI) {
+   if (cop == VAS_COP_TYPE_842 || cop == VAS_COP_TYPE_842_HIPRI ||
+   cop == VAS_COP_TYPE_GZIP || cop == VAS_COP_TYPE_GZIP_HIPRI) {
rxattr->pin_win = true;
rxattr->nx_win = true;
rxattr->fault_win = false;
@@ -892,7 +893,8 @@ void vas_init_tx_win_attr(struct vas_tx_win_attr *txattr, 
enum vas_cop_type cop)
 {
memset(txattr, 0, sizeof(*txattr));
 
-   if (cop == VAS_COP_TYPE_842 || cop == VAS_COP_TYPE_842_HIPRI) {
+   if (cop == VAS_COP_TYPE_842 || cop == VAS_COP_TYPE_842_HIPRI ||
+   cop == VAS_COP_TYPE_GZIP || cop == VAS_COP_TYPE_GZIP_HIPRI) {
txattr->rej_no_credit = false;
txattr->rx_wcred_mode = true;
txattr->tx_wcred_mode = true;
@@ -976,9 +978,14 @@ static bool tx_win_args_valid(enum vas_cop_type cop,
if (attr->wcreds_max > VAS_TX_WCREDS_MAX)
return false;
 
-   if (attr->user_win &&
-   (cop != VAS_COP_TYPE_FTW || attr->rsvd_txbuf_count))
-   return false;
+   if (attr->user_win) {
+   if (attr->rsvd_txbuf_count)
+   return false;
+
+   if (cop != VAS_COP_TYPE_FTW && cop != VAS_COP_TYPE_GZIP &&
+   cop != VAS_COP_TYPE_GZIP_HIPRI)
+   return false;
+   }
 
return true;
 }
-- 
1.8.3.1

[PATCH v3 0/9] crypto/nx: Enable GZIP engine and provide userpace API

2020-03-06 Thread Haren Myneni



Power9 processor supports Virtual Accelerator Switchboard (VAS) which
allows kernel and userspace to send compression requests to Nest
Accelerator (NX) directly. The NX unit comprises of 2 842 compression
engines and 1 GZIP engine. Linux kernel already has 842 compression
support on kernel. This patch series adds GZIP compression support
from user space. The GZIP Compression engine implements the ZLIB and
GZIP compression algorithms. No plans of adding NX-GZIP compression
support in kernel right now.

Applications can send requests to NX directly with COPY/PASTE
instructions. But kernel has to establish channel / window on NX-GZIP
device for the userspace. So userspace access to the GZIP engine is
provided through /dev/crypto/nx-gzip device with several operations.

An application must open the this device to obtain a file descriptor (fd).
Using the fd, application should issue the VAS_TX_WIN_OPEN ioctl to
establish a connection to the engine. Once window is opened, should use
mmap() system call to map the hardware address of engine's request queue
into the application's virtual address space. Then user space forms the
request as co-processor Request Block (CRB) and paste this CRB on the
mapped HW address using COPY/PASTE instructions. Application can poll
on status flags (part of CRB) with timeout for request completion.

For VAS_TX_WIN_OPEN ioctl, if user space passes vas_id = -1 (struct
vas_tx_win_open_attr), kernel determines the VAS instance on the
corresponding chip based on the CPU on which the process is executing.
Otherwise, the specified VAS instance is used if application passes the
proper VAS instance (vas_id listed in /proc/device-tree/vas@*/ibm,vas_id).

Process can open multiple windows with different FDs or can send several
requests to NX on the same window at the same time.

A userspace library libnxz is available:
https://github.com/abalib/power-gzip

Applications that use inflate/deflate calls can link with libNXz and use
NX GZIP compression without any modification.

Tested the available 842 compression on power8 and power9 system to make
sure no regression and tested GZIP compression on power9 with tests
available in the above link.

Thanks to Bulent Abali for nxz library and tests development.

Changelog:
V2:
  - Move user space API code to powerpc as suggested. Also this API
can be extended to any other coprocessor type that VAS can support
in future. Example: Fast thread wakeup feature from VAS
  - Rebased to 5.6-rc3

V3:
  - Fix sparse warnings (patches 3&6)

Haren Myneni (9):
  powerpc/vas: Initialize window attributes for GZIP coprocessor type
  powerpc/vas: Define VAS_TX_WIN_OPEN ioctl API
  powerpc/vas: Add VAS user space API
  crypto/nx: Initialize coproc entry with kzalloc
  crypto/nx: Rename nx-842-powernv file name to nx-common-powernv
  crypto/NX: Make enable code generic to add new GZIP compression type
  crypto/nx: Enable and setup GZIP compresstion type
  crypto/nx: Remove 'pid' in vas_tx_win_attr struct
  Documentation/powerpc: VAS API

 Documentation/powerpc/index.rst|1 +
 Documentation/powerpc/vas-api.rst  |  246 +
 Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
 arch/powerpc/include/asm/vas.h |   12 +-
 arch/powerpc/include/uapi/asm/vas-api.h|   22 +
 arch/powerpc/platforms/powernv/Makefile|2 +-
 arch/powerpc/platforms/powernv/vas-api.c   |  290 +
 arch/powerpc/platforms/powernv/vas-window.c|   23 +-
 arch/powerpc/platforms/powernv/vas.h   |2 +
 drivers/crypto/nx/Makefile |2 +-
 drivers/crypto/nx/nx-842-powernv.c | 1062 --
 drivers/crypto/nx/nx-common-powernv.c  | 1133 
 12 files changed, 1723 insertions(+), 1073 deletions(-)
 create mode 100644 Documentation/powerpc/vas-api.rst
 create mode 100644 arch/powerpc/include/uapi/asm/vas-api.h
 create mode 100644 arch/powerpc/platforms/powernv/vas-api.c
 delete mode 100644 drivers/crypto/nx/nx-842-powernv.c
 create mode 100644 drivers/crypto/nx/nx-common-powernv.c

-- 
1.8.3.1

Re: [PATCH] Add OPAL_GET_SYMBOL / OPAL_LOOKUP_SYMBOL

2020-03-06 Thread Nicholas Piggin

Oliver O'Halloran's on March 6, 2020 12:42 pm:
> On Fri, Feb 28, 2020 at 2:09 PM Nicholas Piggin  wrote:
>>
>> These calls can be used by Linux to annotate BUG addresses with symbols,
>> look up symbol addresses in xmon, etc.
>>
>> This is preferable over having Linux parse the OPAL symbol map itself,
>> because OPAL's parsing code already exists for its own symbol printing,
>> and it can support other code regions than the skiboot symbols, e.g.,
>> the wake-up code in the HOMER (where CPUs have been seen to get stuck).
>>
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  core/opal.c |  2 +
>>  core/utils.c| 92 +++--
>>  doc/opal-api/opal-get-symbol-181.rst| 42 +++
>>  doc/opal-api/opal-lookup-symbol-182.rst | 35 ++
>>  include/opal-api.h  |  4 +-
>>  5 files changed, 168 insertions(+), 7 deletions(-)
>>  create mode 100644 doc/opal-api/opal-get-symbol-181.rst
>>  create mode 100644 doc/opal-api/opal-lookup-symbol-182.rst
>>
>> diff --git a/core/opal.c b/core/opal.c
>> index d6ff6027b..d9fc4fe05 100644
>> --- a/core/opal.c
>> +++ b/core/opal.c
>> @@ -142,6 +142,8 @@ int64_t opal_entry_check(struct stack_frame *eframe)
>> case OPAL_CEC_REBOOT:
>> case OPAL_CEC_REBOOT2:
>> case OPAL_SIGNAL_SYSTEM_RESET:
>> +   case OPAL_GET_SYMBOL:
>> +   case OPAL_LOOKUP_SYMBOL:
> 
> These names are still awful :|

Ah yeah sorry I said I'd change them didn't I? Anyway don't merge before
the kernel is on board with the idea.

Thanks,
Nick

Re: [PATCH V15] mm/debug: Add tests validating architecture page table helpers

2020-03-06 Thread Anshuman Khandual




On 03/07/2020 02:14 AM, Qian Cai wrote:
> On Fri, 2020-03-06 at 05:27 +0530, Anshuman Khandual wrote:
>> This adds tests which will validate architecture page table helpers and
>> other accessors in their compliance with expected generic MM semantics.
>> This will help various architectures in validating changes to existing
>> page table helpers or addition of new ones.
>>
>> This test covers basic page table entry transformations including but not
>> limited to old, young, dirty, clean, write, write protect etc at various
>> level along with populating intermediate entries with next page table page
>> and validating them.
>>
>> Test page table pages are allocated from system memory with required size
>> and alignments. The mapped pfns at page table levels are derived from a
>> real pfn representing a valid kernel text symbol. This test gets called
>> inside kernel_init() right after async_synchronize_full().
>>
>> This test gets built and run when CONFIG_DEBUG_VM_PGTABLE is selected. Any
>> architecture, which is willing to subscribe this test will need to select
>> ARCH_HAS_DEBUG_VM_PGTABLE. For now this is limited to arc, arm64, x86, s390
>> and ppc32 platforms where the test is known to build and run successfully.
>> Going forward, other architectures too can subscribe the test after fixing
>> any build or runtime problems with their page table helpers. Meanwhile for
>> better platform coverage, the test can also be enabled with CONFIG_EXPERT
>> even without ARCH_HAS_DEBUG_VM_PGTABLE.
>>
>> Folks interested in making sure that a given platform's page table helpers
>> conform to expected generic MM semantics should enable the above config
>> which will just trigger this test during boot. Any non conformity here will
>> be reported as an warning which would need to be fixed. This test will help
>> catch any changes to the agreed upon semantics expected from generic MM and
>> enable platforms to accommodate it thereafter.
> 
> OK, I get this working on powerpc hash MMU as well, so this?
> 
> diff --git a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
> b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
> index 64d0f9b15c49..c527d05c0459 100644
> --- a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
> +++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
> @@ -22,8 +22,7 @@
>  |   nios2: | TODO |
>  |openrisc: | TODO |
>  |  parisc: | TODO |
> -|  powerpc/32: |  ok  |
> -|  powerpc/64: | TODO |
> +| powerpc: |  ok  |
>  |   riscv: | TODO |
>  |s390: |  ok  |
>  |  sh: | TODO |
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 2e7eee523ba1..176930f40e07 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -116,7 +116,7 @@ config PPC
>   #
>   select ARCH_32BIT_OFF_T if PPC32
>   select ARCH_HAS_DEBUG_VIRTUAL
> - select ARCH_HAS_DEBUG_VM_PGTABLE if PPC32
> + select ARCH_HAS_DEBUG_VM_PGTABLE
>   select ARCH_HAS_DEVMEM_IS_ALLOWED
>   select ARCH_HAS_ELF_RANDOMIZE
>   select ARCH_HAS_FORTIFY_SOURCE
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 96a91bda3a85..98990a515268 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -256,7 +256,8 @@ static void __init pte_clear_tests(struct mm_struct *mm,
> pte_t *ptep,
>   pte_t pte = READ_ONCE(*ptep);
>  
>   pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> - WRITE_ONCE(*ptep, pte);
> + set_pte_at(mm, vaddr, ptep, pte);

Hmm, set_pte_at() function is not preferred here for these tests. The idea
is to avoid or atleast minimize TLB/cache flushes triggered from these sort
of 'static' tests. set_pte_at() is platform provided and could/might trigger
these flushes or some other platform specific synchronization stuff. Just
wondering is there specific reason with respect to the soft lock up problem
making it necessary to use set_pte_at() rather than a simple WRITE_ONCE() ?

> + barrier();
>   pte_clear(mm, vaddr, ptep);
>   pte = READ_ONCE(*ptep);
>   WARN_ON(!pte_none(pte));
>

Re: [PATCH V15] mm/debug: Add tests validating architecture page table helpers

2020-03-06 Thread Qian Cai

On Fri, 2020-03-06 at 05:27 +0530, Anshuman Khandual wrote:
> This adds tests which will validate architecture page table helpers and
> other accessors in their compliance with expected generic MM semantics.
> This will help various architectures in validating changes to existing
> page table helpers or addition of new ones.
> 
> This test covers basic page table entry transformations including but not
> limited to old, young, dirty, clean, write, write protect etc at various
> level along with populating intermediate entries with next page table page
> and validating them.
> 
> Test page table pages are allocated from system memory with required size
> and alignments. The mapped pfns at page table levels are derived from a
> real pfn representing a valid kernel text symbol. This test gets called
> inside kernel_init() right after async_synchronize_full().
> 
> This test gets built and run when CONFIG_DEBUG_VM_PGTABLE is selected. Any
> architecture, which is willing to subscribe this test will need to select
> ARCH_HAS_DEBUG_VM_PGTABLE. For now this is limited to arc, arm64, x86, s390
> and ppc32 platforms where the test is known to build and run successfully.
> Going forward, other architectures too can subscribe the test after fixing
> any build or runtime problems with their page table helpers. Meanwhile for
> better platform coverage, the test can also be enabled with CONFIG_EXPERT
> even without ARCH_HAS_DEBUG_VM_PGTABLE.
> 
> Folks interested in making sure that a given platform's page table helpers
> conform to expected generic MM semantics should enable the above config
> which will just trigger this test during boot. Any non conformity here will
> be reported as an warning which would need to be fixed. This test will help
> catch any changes to the agreed upon semantics expected from generic MM and
> enable platforms to accommodate it thereafter.

OK, I get this working on powerpc hash MMU as well, so this?

diff --git a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
index 64d0f9b15c49..c527d05c0459 100644
--- a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
+++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
@@ -22,8 +22,7 @@
 |   nios2: | TODO |
 |openrisc: | TODO |
 |  parisc: | TODO |
-|  powerpc/32: |  ok  |
-|  powerpc/64: | TODO |
+| powerpc: |  ok  |
 |   riscv: | TODO |
 |s390: |  ok  |
 |  sh: | TODO |
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2e7eee523ba1..176930f40e07 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -116,7 +116,7 @@ config PPC
    #
    select ARCH_32BIT_OFF_T if PPC32
    select ARCH_HAS_DEBUG_VIRTUAL
-   select ARCH_HAS_DEBUG_VM_PGTABLE if PPC32
+   select ARCH_HAS_DEBUG_VM_PGTABLE
    select ARCH_HAS_DEVMEM_IS_ALLOWED
    select ARCH_HAS_ELF_RANDOMIZE
    select ARCH_HAS_FORTIFY_SOURCE
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 96a91bda3a85..98990a515268 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -256,7 +256,8 @@ static void __init pte_clear_tests(struct mm_struct *mm,
pte_t *ptep,
    pte_t pte = READ_ONCE(*ptep);
 
    pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
-   WRITE_ONCE(*ptep, pte);
+   set_pte_at(mm, vaddr, ptep, pte);
+   barrier();
    pte_clear(mm, vaddr, ptep);
    pte = READ_ONCE(*ptep);
    WARN_ON(!pte_none(pte));

[PATCH 3/4] powerpc/xmon: Add source flags to output of XIVE interrupts

2020-03-06 Thread Cédric Le Goater

Some firmwares or hypervisors can advertise different source
characteristics. Track their value under XMON. What we are mostly
interested in is the StoreEOI flag.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 8155adc2225a..c865ae554605 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -283,7 +283,10 @@ int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data 
*d)
struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
u64 val = xive_esb_read(xd, XIVE_ESB_GET);
 
-   xmon_printf("PQ=%c%c",
+   xmon_printf("flags=%c%c%c PQ=%c%c",
+   xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
+   xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
+   xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
val & XIVE_ESB_VAL_P ? 'P' : '-',
val & XIVE_ESB_VAL_Q ? 'Q' : '-');
}
-- 
2.21.1

[PATCH V7 13/14] powerpc/vas: Display process stuck message

2020-03-06 Thread Haren Myneni



Process can not close send window until all requests are processed.
Means wait until window state is not busy and send credits are
returned. Display debug messages in case taking longer to close the
window.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-window.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index 1439a6f..40c303a 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -1182,6 +1182,7 @@ static void poll_window_credits(struct vas_window *window)
 {
u64 val;
int creds, mode;
+   int count = 0;
 
val = read_hvwc_reg(window, VREG(WINCTL));
if (window->tx_win)
@@ -1200,10 +1201,27 @@ static void poll_window_credits(struct vas_window 
*window)
creds = GET_FIELD(VAS_LRX_WCRED, val);
}
 
+   /*
+* Takes around few milliseconds to complete all pending requests
+* and return credits.
+* TODO: Scan fault FIFO and invalidate CRBs points to this window
+*   and issue CRB Kill to stop all pending requests. Need only
+*   if there is a bug in NX or fault handling in kernel.
+*/
if (creds < window->wcreds_max) {
val = 0;
set_current_state(TASK_UNINTERRUPTIBLE);
schedule_timeout(msecs_to_jiffies(10));
+   count++;
+   /*
+* Process can not close send window until all credits are
+* returned.
+*/
+   if (!(count % 1))
+   pr_debug("VAS: pid %d stuck. Waiting for credits 
returned for Window(%d). creds %d, Retries %d\n",
+   vas_window_pid(window), window->winid,
+   creds, count);
+
goto retry;
}
 }
@@ -1217,6 +1235,7 @@ static void poll_window_busy_state(struct vas_window 
*window)
 {
int busy;
u64 val;
+   int count = 0;
 
 retry:
val = read_hvwc_reg(window, VREG(WIN_STATUS));
@@ -1225,6 +1244,15 @@ static void poll_window_busy_state(struct vas_window 
*window)
val = 0;
set_current_state(TASK_UNINTERRUPTIBLE);
schedule_timeout(msecs_to_jiffies(5));
+   count++;
+   /*
+* Takes around few milliseconds to process all pending
+* requests.
+*/
+   if (!(count % 1))
+   pr_debug("VAS: pid %d stuck. Window (ID=%d) is in busy 
state. Retries %d\n",
+   vas_window_pid(window), window->winid, count);
+
goto retry;
}
 }
-- 
1.8.3.1

[PATCH V7 14/14] powerpc/vas: Free send window in VAS instance after credits returned

2020-03-06 Thread Haren Myneni



NX may be processing requests while trying to close window. Wait until
all credits are returned and then free send window from VAS instance.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-window.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index 40c303a..c60accd 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -1317,14 +1317,14 @@ int vas_win_close(struct vas_window *window)
 
unmap_paste_region(window);
 
-   clear_vinst_win(window);
-
poll_window_busy_state(window);
 
unpin_close_window(window);
 
poll_window_credits(window);
 
+   clear_vinst_win(window);
+
poll_window_castout(window);
 
/* if send window, drop reference to matching receive window */
-- 
1.8.3.1

[PATCH V7 12/14] powerpc/vas: Return credits after handling fault

2020-03-06 Thread Haren Myneni



NX expects OS to return credit for send window after processing each
fault. Also credit has to be returned even for fault window.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-fault.c  |  9 +
 arch/powerpc/platforms/powernv/vas-window.c | 17 +
 arch/powerpc/platforms/powernv/vas.h|  1 +
 3 files changed, 27 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
b/arch/powerpc/platforms/powernv/vas-fault.c
index 940adc5..cb8c51d 100644
--- a/arch/powerpc/platforms/powernv/vas-fault.c
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -237,6 +237,10 @@ irqreturn_t vas_fault_thread_fn(int irq, void *data)
memcpy(crb, fifo, CRB_SIZE);
entry->stamp.nx.pswid = cpu_to_be32(FIFO_INVALID_ENTRY);
entry->ccw |= cpu_to_be32(CCW0_INVALID);
+   /*
+* Return credit for the fault window.
+*/
+   vas_return_credit(vinst->fault_win, 0);
mutex_unlock(>mutex);
 
pr_devel("VAS[%d] fault_fifo %p, fifo %p, fault_crbs %d\n",
@@ -266,6 +270,11 @@ irqreturn_t vas_fault_thread_fn(int irq, void *data)
}
 
update_csb(window, crb);
+   /*
+* Return credit for send window after processing
+* fault CRB.
+*/
+   vas_return_credit(window, 1);
} while (true);
 }
 
diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index 427a884..1439a6f 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -1318,6 +1318,23 @@ int vas_win_close(struct vas_window *window)
 }
 EXPORT_SYMBOL_GPL(vas_win_close);
 
+/*
+ * Return credit for the given window.
+ */
+void vas_return_credit(struct vas_window *window, bool tx)
+{
+   uint64_t val;
+
+   val = 0ULL;
+   if (tx) { /* send window */
+   val = SET_FIELD(VAS_TX_WCRED, val, 1);
+   write_hvwc_reg(window, VREG(TX_WCRED_ADDER), val);
+   } else {
+   val = SET_FIELD(VAS_LRX_WCRED, val, 1);
+   write_hvwc_reg(window, VREG(LRX_WCRED_ADDER), val);
+   }
+}
+
 struct vas_window *vas_pswid_to_window(struct vas_instance *vinst,
uint32_t pswid)
 {
diff --git a/arch/powerpc/platforms/powernv/vas.h 
b/arch/powerpc/platforms/powernv/vas.h
index bc728d7..8c39a7d 100644
--- a/arch/powerpc/platforms/powernv/vas.h
+++ b/arch/powerpc/platforms/powernv/vas.h
@@ -428,6 +428,7 @@ struct vas_winctx {
 extern void vas_window_free_dbgdir(struct vas_window *win);
 extern int vas_setup_fault_window(struct vas_instance *vinst);
 extern irqreturn_t vas_fault_thread_fn(int irq, void *data);
+extern void vas_return_credit(struct vas_window *window, bool tx);
 extern struct vas_window *vas_pswid_to_window(struct vas_instance *vinst,
uint32_t pswid);
 
-- 
1.8.3.1

[PATCH V7 11/14] powerpc/vas: Do not use default credits for receive window

2020-03-06 Thread Haren Myneni



System checkstops if RxFIFO overruns with more requests than the
maximum possible number of CRBs allowed in FIFO at any time. So
max credits value (rxattr.wcreds_max) is set and is passed to
vas_rx_win_open() by the the driver.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-window.c | 4 ++--
 arch/powerpc/platforms/powernv/vas.h| 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index 7587258..427a884 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -772,7 +772,7 @@ static bool rx_win_args_valid(enum vas_cop_type cop,
if (attr->rx_fifo_size > VAS_RX_FIFO_SIZE_MAX)
return false;
 
-   if (attr->wcreds_max > VAS_RX_WCREDS_MAX)
+   if (!attr->wcreds_max)
return false;
 
if (attr->nx_win) {
@@ -877,7 +877,7 @@ struct vas_window *vas_rx_win_open(int vasid, enum 
vas_cop_type cop,
rxwin->nx_win = rxattr->nx_win;
rxwin->user_win = rxattr->user_win;
rxwin->cop = cop;
-   rxwin->wcreds_max = rxattr->wcreds_max ?: VAS_WCREDS_DEFAULT;
+   rxwin->wcreds_max = rxattr->wcreds_max;
 
init_winctx_for_rxwin(rxwin, rxattr, );
init_winctx_regs(rxwin, );
diff --git a/arch/powerpc/platforms/powernv/vas.h 
b/arch/powerpc/platforms/powernv/vas.h
index 16aa8ec..bc728d7 100644
--- a/arch/powerpc/platforms/powernv/vas.h
+++ b/arch/powerpc/platforms/powernv/vas.h
@@ -101,11 +101,9 @@
 /*
  * Initial per-process credits.
  * Max send window credits:4K-1 (12-bits in VAS_TX_WCRED)
- * Max receive window credits: 64K-1 (16 bits in VAS_LRX_WCRED)
  *
  * TODO: Needs tuning for per-process credits
  */
-#define VAS_RX_WCREDS_MAX  ((64 << 10) - 1)
 #define VAS_TX_WCREDS_MAX  ((4 << 10) - 1)
 #define VAS_WCREDS_DEFAULT (1 << 10)
 
-- 
1.8.3.1

[PATCH V7 10/14] powerpc/vas: Print CRB and FIFO values

2020-03-06 Thread Haren Myneni



Dump FIFO entries if could not find send window and print CRB
for debugging.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-fault.c | 41 ++
 1 file changed, 41 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
b/arch/powerpc/platforms/powernv/vas-fault.c
index 751ce48..940adc5 100644
--- a/arch/powerpc/platforms/powernv/vas-fault.c
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -26,6 +26,28 @@
  */
 #define VAS_FAULT_WIN_FIFO_SIZE(4 << 20)
 
+static void dump_crb(struct coprocessor_request_block *crb)
+{
+   struct data_descriptor_entry *dde;
+   struct nx_fault_stamp *nx;
+
+   dde = >source;
+   pr_devel("SrcDDE: addr 0x%llx, len %d, count %d, idx %d, flags %d\n",
+   be64_to_cpu(dde->address), be32_to_cpu(dde->length),
+   dde->count, dde->index, dde->flags);
+
+   dde = >target;
+   pr_devel("TgtDDE: addr 0x%llx, len %d, count %d, idx %d, flags %d\n",
+   be64_to_cpu(dde->address), be32_to_cpu(dde->length),
+   dde->count, dde->index, dde->flags);
+
+   nx = >stamp.nx;
+   pr_devel("NX Stamp: PSWID 0x%x, FSA 0x%llx, flags 0x%x, FS 0x%x\n",
+   be32_to_cpu(nx->pswid),
+   be64_to_cpu(crb->stamp.nx.fault_storage_addr),
+   nx->flags, nx->fault_status);
+}
+
 /*
  * Update the CSB to indicate a translation error.
  *
@@ -138,6 +160,23 @@ static void update_csb(struct vas_window *window,
pid_vnr(pid), rc);
 }
 
+static void dump_fifo(struct vas_instance *vinst, void *entry)
+{
+   int i;
+   unsigned long *fifo = entry;
+   unsigned long *end = vinst->fault_fifo + vinst->fault_fifo_size;
+
+   pr_err("Fault fifo size %d, Max crbs %d\n", vinst->fault_fifo_size,
+   vinst->fault_fifo_size / CRB_SIZE);
+
+   /* Dump 10 CRB entries or until end of FIFO */
+   pr_err("Fault FIFO Dump:\n");
+   for (i = 0; i < 10*(CRB_SIZE/8) && fifo < end; i += 4, fifo += 4) {
+   pr_err("[%.3d, %p]: 0x%.16lx 0x%.16lx 0x%.16lx 0x%.16lx\n",
+   i, fifo, *fifo, *(fifo+1), *(fifo+2), *(fifo+3));
+   }
+}
+
 /*
  * Process valid CRBs in fault FIFO.
  */
@@ -204,6 +243,7 @@ irqreturn_t vas_fault_thread_fn(int irq, void *data)
vinst->vas_id, vinst->fault_fifo, fifo,
vinst->fault_crbs);
 
+   dump_crb(crb);
window = vas_pswid_to_window(vinst,
be32_to_cpu(crb->stamp.nx.pswid));
 
@@ -214,6 +254,7 @@ irqreturn_t vas_fault_thread_fn(int irq, void *data)
 * even clean it up (return credit).
 * But we should not get here.
 */
+   dump_fifo(vinst, (void *)entry);
pr_err("VAS[%d] fault_fifo %p, fifo %p, pswid 0x%x, 
fault_crbs %d bad CRB?\n",
vinst->vas_id, vinst->fault_fifo, fifo,
be32_to_cpu(crb->stamp.nx.pswid),
-- 
1.8.3.1

[PATCH V7 09/14] powerpc/vas: Update CSB and notify process for fault CRBs

2020-03-06 Thread Haren Myneni



For each fault CRB, update fault address in CRB (fault_storage_addr)
and translation error status in CSB so that user space can touch the
fault address and resend the request. If the user space passed invalid
CSB address send signal to process with SIGSEGV.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-fault.c | 114 +
 1 file changed, 114 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
b/arch/powerpc/platforms/powernv/vas-fault.c
index 1c6d5cc..751ce48 100644
--- a/arch/powerpc/platforms/powernv/vas-fault.c
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -26,6 +27,118 @@
 #define VAS_FAULT_WIN_FIFO_SIZE(4 << 20)
 
 /*
+ * Update the CSB to indicate a translation error.
+ *
+ * If we are unable to update the CSB means copy_to_user failed due to
+ * invalid csb_addr, send a signal to the process.
+ *
+ * Remaining settings in the CSB are based on wait_for_csb() of
+ * NX-GZIP.
+ */
+static void update_csb(struct vas_window *window,
+   struct coprocessor_request_block *crb)
+{
+   int rc;
+   struct pid *pid;
+   void __user *csb_addr;
+   struct task_struct *tsk;
+   struct kernel_siginfo info;
+   struct coprocessor_status_block csb;
+
+   /*
+* NX user space windows can not be opened for task->mm=NULL
+* and faults will not be generated for kernel requests.
+*/
+   if (!window->mm || !window->user_win)
+   return;
+
+   csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
+
+   csb.cc = CSB_CC_TRANSLATION;
+   csb.ce = CSB_CE_TERMINATION;
+   csb.cs = 0;
+   csb.count = 0;
+
+   /*
+* NX operates and returns in BE format as defined CRB struct.
+* So return fault_storage_addr in BE as NX pastes in FIFO and
+* expects user space to convert to CPU format.
+*/
+   csb.address = crb->stamp.nx.fault_storage_addr;
+   csb.flags = 0;
+
+   pid = window->pid;
+   tsk = get_pid_task(pid, PIDTYPE_PID);
+   /*
+* Send window will be closed after processing all NX requests
+* and process exits after closing all windows. In multi-thread
+* applications, thread may not exists, but does not close FD
+* (means send window) upon exit. Parent thread (tgid) can use
+* and close the window later.
+* pid and mm references are taken when window is opened by
+* process (pid). So tgid is used only when child thread opens
+* a window and exits without closing it in multithread tasks.
+*/
+   if (!tsk) {
+   pid = window->tgid;
+   tsk = get_pid_task(pid, PIDTYPE_PID);
+   /*
+* Parent thread will be closing window during its exit.
+* So should not get here.
+*/
+   if (!tsk)
+   return;
+   }
+
+   /* Return if the task is exiting. */
+   if (tsk->flags & PF_EXITING) {
+   put_task_struct(tsk);
+   return;
+   }
+
+   use_mm(window->mm);
+   rc = copy_to_user(csb_addr, , sizeof(csb));
+   /*
+* User space polls on csb.flags (first byte). So add barrier
+* then copy first byte with csb flags update.
+*/
+   smp_mb();
+   if (!rc) {
+   csb.flags = CSB_V;
+   rc = copy_to_user(csb_addr, , sizeof(u8));
+   }
+   unuse_mm(window->mm);
+   put_task_struct(tsk);
+
+   /* Success */
+   if (!rc)
+   return;
+
+   pr_debug("Invalid CSB address 0x%p signalling pid(%d)\n",
+   csb_addr, pid_vnr(pid));
+
+   clear_siginfo();
+   info.si_signo = SIGSEGV;
+   info.si_errno = EFAULT;
+   info.si_code = SEGV_MAPERR;
+   info.si_addr = csb_addr;
+
+   /*
+* process will be polling on csb.flags after request is sent to
+* NX. So generally CSB update should not fail except when an
+* application does not follow the process properly. So an error
+* message will be displayed and leave it to user space whether
+* to ignore or handle this signal.
+*/
+   rcu_read_lock();
+   rc = kill_pid_info(SIGSEGV, , pid);
+   rcu_read_unlock();
+
+   pr_devel("%s(): pid %d kill_proc_info() rc %d\n", __func__,
+   pid_vnr(pid), rc);
+}
+
+/*
  * Process valid CRBs in fault FIFO.
  */
 irqreturn_t vas_fault_thread_fn(int irq, void *data)
@@ -111,6 +224,7 @@ irqreturn_t vas_fault_thread_fn(int irq, void *data)
return IRQ_HANDLED;
}
 
+   update_csb(window, crb);
} while (true);
 }
 
-- 
1.8.3.1

[PATCH V7 08/14] powerpc/vas: Take reference to PID and mm for user space windows

2020-03-06 Thread Haren Myneni



Process close windows after its requests are completed. In multi-thread
applications, child can open a window but release FD will not be called
upon its exit. Parent thread will be closing it later upon its exit.

The parent can also send NX requests with this window and NX can
generate page faults. After kernel handles the page fault, send
signal to process by using PID if CSB address is invalid. Parent
thread will not receive signal since its PID is different from the one
saved in vas_window. So use tgid in case if the task for the pid saved
in window is not running and send signal to its parent.

To prevent reusing the pid until the window closed, take reference to
pid and task mm.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-debug.c  |  2 +-
 arch/powerpc/platforms/powernv/vas-window.c | 53 ++---
 arch/powerpc/platforms/powernv/vas.h|  9 -
 3 files changed, 57 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/vas-debug.c 
b/arch/powerpc/platforms/powernv/vas-debug.c
index 09e63df..ef9a717 100644
--- a/arch/powerpc/platforms/powernv/vas-debug.c
+++ b/arch/powerpc/platforms/powernv/vas-debug.c
@@ -38,7 +38,7 @@ static int info_show(struct seq_file *s, void *private)
 
seq_printf(s, "Type: %s, %s\n", cop_to_str(window->cop),
window->tx_win ? "Send" : "Receive");
-   seq_printf(s, "Pid : %d\n", window->pid);
+   seq_printf(s, "Pid : %d\n", vas_window_pid(window));
 
 unlock:
mutex_unlock(_mutex);
diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index a45d81d..7587258 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -12,6 +12,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include "vas.h"
@@ -876,8 +878,6 @@ struct vas_window *vas_rx_win_open(int vasid, enum 
vas_cop_type cop,
rxwin->user_win = rxattr->user_win;
rxwin->cop = cop;
rxwin->wcreds_max = rxattr->wcreds_max ?: VAS_WCREDS_DEFAULT;
-   if (rxattr->user_win)
-   rxwin->pid = task_pid_vnr(current);
 
init_winctx_for_rxwin(rxwin, rxattr, );
init_winctx_regs(rxwin, );
@@ -1027,7 +1027,6 @@ struct vas_window *vas_tx_win_open(int vasid, enum 
vas_cop_type cop,
txwin->tx_win = 1;
txwin->rxwin = rxwin;
txwin->nx_win = txwin->rxwin->nx_win;
-   txwin->pid = attr->pid;
txwin->user_win = attr->user_win;
txwin->wcreds_max = attr->wcreds_max ?: VAS_WCREDS_DEFAULT;
 
@@ -1068,8 +1067,43 @@ struct vas_window *vas_tx_win_open(int vasid, enum 
vas_cop_type cop,
goto free_window;
}
 
-   set_vinst_win(vinst, txwin);
+   if (txwin->user_win) {
+   /*
+* Window opened by child thread may not be closed when
+* it exits. So take reference to its pid and release it
+* when the window is free by parent thread.
+* Acquire a reference to the task's pid to make sure
+* pid will not be re-used - needed only for multithread
+* applications.
+*/
+   txwin->pid = get_task_pid(current, PIDTYPE_PID);
+   /*
+* Acquire a reference to the task's mm.
+*/
+   txwin->mm = get_task_mm(current);
 
+   if (!txwin->mm) {
+   put_pid(txwin->pid);
+   pr_err("VAS: pid(%d): mm_struct is not found\n",
+   current->pid);
+   rc = -EPERM;
+   goto free_window;
+   }
+
+   mmgrab(txwin->mm);
+   mmput(txwin->mm);
+   mm_context_add_copro(txwin->mm);
+   /*
+* Process closes window during exit. In the case of
+* multithread application, child can open window and
+* can exit without closing it. Expects parent thread
+* to use and close the window. So do not need to take
+* pid reference for parent thread.
+*/
+   txwin->tgid = find_get_pid(task_tgid_vnr(current));
+   }
+
+   set_vinst_win(vinst, txwin);
return txwin;
 
 free_window:
@@ -1266,8 +1300,17 @@ int vas_win_close(struct vas_window *window)
poll_window_castout(window);
 
/* if send window, drop reference to matching receive window */
-   if (window->tx_win)
+   if (window->tx_win) {
+   if (window->user_win) {
+   /* Drop references to pid and mm */
+   put_pid(window->pid);
+   if (window->mm) {
+   mmdrop(window->mm);
+

[PATCH V7 07/14] powerpc/vas: Register NX with fault window ID and IRQ port value

2020-03-06 Thread Haren Myneni



For each user space send window, register NX with fault window ID
and port value so that NX paste CRBs in this fault FIFO when it
sees fault on the request buffer.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-window.c | 15 +--
 arch/powerpc/platforms/powernv/vas.h| 15 +++
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index 7c6f55f..a45d81d 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -373,7 +373,7 @@ int init_winctx_regs(struct vas_window *window, struct 
vas_winctx *winctx)
init_xlate_regs(window, winctx->user_win);
 
val = 0ULL;
-   val = SET_FIELD(VAS_FAULT_TX_WIN, val, 0);
+   val = SET_FIELD(VAS_FAULT_TX_WIN, val, winctx->fault_win_id);
write_hvwc_reg(window, VREG(FAULT_TX_WIN), val);
 
/* In PowerNV, interrupts go to HV. */
@@ -748,6 +748,8 @@ static void init_winctx_for_rxwin(struct vas_window *rxwin,
 
winctx->min_scope = VAS_SCOPE_LOCAL;
winctx->max_scope = VAS_SCOPE_VECTORED_GROUP;
+   if (rxwin->vinst->virq)
+   winctx->irq_port = rxwin->vinst->irq_port;
 }
 
 static bool rx_win_args_valid(enum vas_cop_type cop,
@@ -944,13 +946,22 @@ static void init_winctx_for_txwin(struct vas_window 
*txwin,
winctx->lpid = txattr->lpid;
winctx->pidr = txattr->pidr;
winctx->rx_win_id = txwin->rxwin->winid;
+   /*
+* IRQ and fault window setup is successful. Set fault window
+* for the send window so that ready to handle faults.
+*/
+   if (txwin->vinst->virq)
+   winctx->fault_win_id = txwin->vinst->fault_win->winid;
 
winctx->dma_type = VAS_DMA_TYPE_INJECT;
winctx->tc_mode = txattr->tc_mode;
winctx->min_scope = VAS_SCOPE_LOCAL;
winctx->max_scope = VAS_SCOPE_VECTORED_GROUP;
+   if (txwin->vinst->virq)
+   winctx->irq_port = txwin->vinst->irq_port;
 
-   winctx->pswid = 0;
+   winctx->pswid = txattr->pswid ? txattr->pswid :
+   encode_pswid(txwin->vinst->vas_id, txwin->winid);
 }
 
 static bool tx_win_args_valid(enum vas_cop_type cop,
diff --git a/arch/powerpc/platforms/powernv/vas.h 
b/arch/powerpc/platforms/powernv/vas.h
index ecae7cd..310b8a0 100644
--- a/arch/powerpc/platforms/powernv/vas.h
+++ b/arch/powerpc/platforms/powernv/vas.h
@@ -468,6 +468,21 @@ static inline u64 read_hvwc_reg(struct vas_window *win,
return in_be64(win->hvwc_map+reg);
 }
 
+/*
+ * Encode/decode the Partition Send Window ID (PSWID) for a window in
+ * a way that we can uniquely identify any window in the system. i.e.
+ * we should be able to locate the 'struct vas_window' given the PSWID.
+ *
+ * BitsUsage
+ * 0:7 VAS id (8 bits)
+ * 8:15Unused, 0 (3 bits)
+ * 16:31   Window id (16 bits)
+ */
+static inline u32 encode_pswid(int vasid, int winid)
+{
+   return ((u32)winid | (vasid << (31 - 7)));
+}
+
 static inline void decode_pswid(u32 pswid, int *vasid, int *winid)
 {
if (vasid)
-- 
1.8.3.1

[PATCH V7 06/14] powerpc/vas: Setup thread IRQ handler per VAS instance

2020-03-06 Thread Haren Myneni



Setup thread IRQ handler per each VAS instance. When NX sees a fault
on CRB, kernel gets an interrupt and vas_fault_handler will be
executed to process fault CRBs. Read all valid CRBs from fault FIFO,
determine the corresponding send window from CRB and process fault
requests.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas-fault.c  | 90 +
 arch/powerpc/platforms/powernv/vas-window.c | 60 +++
 arch/powerpc/platforms/powernv/vas.c| 49 +++-
 arch/powerpc/platforms/powernv/vas.h|  6 ++
 4 files changed, 204 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
b/arch/powerpc/platforms/powernv/vas-fault.c
index 4044998..1c6d5cc 100644
--- a/arch/powerpc/platforms/powernv/vas-fault.c
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "vas.h"
@@ -25,6 +26,95 @@
 #define VAS_FAULT_WIN_FIFO_SIZE(4 << 20)
 
 /*
+ * Process valid CRBs in fault FIFO.
+ */
+irqreturn_t vas_fault_thread_fn(int irq, void *data)
+{
+   struct vas_instance *vinst = data;
+   struct coprocessor_request_block *crb, *entry;
+   struct coprocessor_request_block buf;
+   struct vas_window *window;
+   unsigned long flags;
+   void *fifo;
+
+   crb = 
+
+   /*
+* VAS can interrupt with multiple page faults. So process all
+* valid CRBs within fault FIFO until reaches invalid CRB.
+* NX updates nx_fault_stamp in CRB and pastes in fault FIFO.
+* kernel retrives send window from parition send window ID
+* (pswid) in nx_fault_stamp. So pswid should be valid and
+* ccw[0] (in be) should be zero since this bit is reserved.
+* If user space touches this bit, NX returns with "CRB format
+* error".
+*
+* After reading CRB entry, invalidate it with pswid (set
+* 0x) and ccw[0] (set to 1).
+*
+* In case kernel receives another interrupt with different page
+* fault, CRBs are already processed by the previous handling. So
+* will be returned from this function when it sees invalid CRB.
+*/
+   do {
+   mutex_lock(>mutex);
+
+   spin_lock_irqsave(>fault_lock, flags);
+   /*
+* Advance the fault fifo pointer to next CRB.
+* Use CRB_SIZE rather than sizeof(*crb) since the latter is
+* aligned to CRB_ALIGN (256) but the CRB written to by VAS is
+* only CRB_SIZE in len.
+*/
+   fifo = vinst->fault_fifo + (vinst->fault_crbs * CRB_SIZE);
+   entry = fifo;
+
+   if ((entry->stamp.nx.pswid == cpu_to_be32(FIFO_INVALID_ENTRY))
+   || (entry->ccw & cpu_to_be32(CCW0_INVALID))) {
+   atomic_set(>faults_in_progress, 0);
+   spin_unlock_irqrestore(>fault_lock, flags);
+   mutex_unlock(>mutex);
+   return IRQ_HANDLED;
+   }
+
+   spin_unlock_irqrestore(>fault_lock, flags);
+   vinst->fault_crbs++;
+   if (vinst->fault_crbs == (vinst->fault_fifo_size / CRB_SIZE))
+   vinst->fault_crbs = 0;
+
+   memcpy(crb, fifo, CRB_SIZE);
+   entry->stamp.nx.pswid = cpu_to_be32(FIFO_INVALID_ENTRY);
+   entry->ccw |= cpu_to_be32(CCW0_INVALID);
+   mutex_unlock(>mutex);
+
+   pr_devel("VAS[%d] fault_fifo %p, fifo %p, fault_crbs %d\n",
+   vinst->vas_id, vinst->fault_fifo, fifo,
+   vinst->fault_crbs);
+
+   window = vas_pswid_to_window(vinst,
+   be32_to_cpu(crb->stamp.nx.pswid));
+
+   if (IS_ERR(window)) {
+   /*
+* We got an interrupt about a specific send
+* window but we can't find that window and we can't
+* even clean it up (return credit).
+* But we should not get here.
+*/
+   pr_err("VAS[%d] fault_fifo %p, fifo %p, pswid 0x%x, 
fault_crbs %d bad CRB?\n",
+   vinst->vas_id, vinst->fault_fifo, fifo,
+   be32_to_cpu(crb->stamp.nx.pswid),
+   vinst->fault_crbs);
+
+   WARN_ON_ONCE(1);
+   atomic_set(>faults_in_progress, 0);
+   return IRQ_HANDLED;
+   }
+
+   } while (true);
+}
+
+/*
  * Fault window is opened per VAS instance. NX pastes fault CRB in fault
  * FIFO upon page faults.
  */
diff --git a/arch/powerpc/platforms/powernv/vas-window.c

[PATCH V7 05/14] powerpc/vas: Setup fault window per VAS instance

2020-03-06 Thread Haren Myneni



Setup fault window for each VAS instance. When NX gets a fault on
request buffer, write fault CRBs in the corresponding fault FIFO and
then sends an interrupt to the OS.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/Makefile |  2 +-
 arch/powerpc/platforms/powernv/vas-fault.c  | 77 +
 arch/powerpc/platforms/powernv/vas-window.c |  4 +-
 arch/powerpc/platforms/powernv/vas.c| 20 
 arch/powerpc/platforms/powernv/vas.h| 16 ++
 5 files changed, 116 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/vas-fault.c

diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index c0f8120..395789f 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -17,7 +17,7 @@ obj-$(CONFIG_MEMORY_FAILURE)  += opal-memory-errors.o
 obj-$(CONFIG_OPAL_PRD) += opal-prd.o
 obj-$(CONFIG_PERF_EVENTS) += opal-imc.o
 obj-$(CONFIG_PPC_MEMTRACE) += memtrace.o
-obj-$(CONFIG_PPC_VAS)  += vas.o vas-window.o vas-debug.o
+obj-$(CONFIG_PPC_VAS)  += vas.o vas-window.o vas-debug.o vas-fault.o
 obj-$(CONFIG_OCXL_BASE)+= ocxl.o
 obj-$(CONFIG_SCOM_DEBUGFS) += opal-xscom.o
 obj-$(CONFIG_PPC_SECURE_BOOT) += opal-secvar.o
diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
b/arch/powerpc/platforms/powernv/vas-fault.c
new file mode 100644
index 000..4044998
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * VAS Fault handling.
+ * Copyright 2019, IBM Corporation
+ */
+
+#define pr_fmt(fmt) "vas: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vas.h"
+
+/*
+ * The maximum FIFO size for fault window can be 8MB
+ * (VAS_RX_FIFO_SIZE_MAX). Using 4MB FIFO since each VAS
+ * instance will be having fault window.
+ * 8MB FIFO can be used if expects more faults for each VAS
+ * instance.
+ */
+#define VAS_FAULT_WIN_FIFO_SIZE(4 << 20)
+
+/*
+ * Fault window is opened per VAS instance. NX pastes fault CRB in fault
+ * FIFO upon page faults.
+ */
+int vas_setup_fault_window(struct vas_instance *vinst)
+{
+   struct vas_rx_win_attr attr;
+
+   vinst->fault_fifo_size = VAS_FAULT_WIN_FIFO_SIZE;
+   vinst->fault_fifo = kzalloc(vinst->fault_fifo_size, GFP_KERNEL);
+   if (!vinst->fault_fifo) {
+   pr_err("Unable to alloc %d bytes for fault_fifo\n",
+   vinst->fault_fifo_size);
+   return -ENOMEM;
+   }
+
+   /*
+* Invalidate all CRB entries. NX pastes valid entry for each fault.
+*/
+   memset(vinst->fault_fifo, FIFO_INVALID_ENTRY, vinst->fault_fifo_size);
+   vas_init_rx_win_attr(, VAS_COP_TYPE_FAULT);
+
+   attr.rx_fifo_size = vinst->fault_fifo_size;
+   attr.rx_fifo = vinst->fault_fifo;
+
+   /*
+* Max creds is based on number of CRBs can fit in the FIFO.
+* (fault_fifo_size/CRB_SIZE). If 8MB FIFO is used, max creds
+* will be 0x since the receive creds field is 16bits wide.
+*/
+   attr.wcreds_max = vinst->fault_fifo_size / CRB_SIZE;
+   attr.lnotify_lpid = 0;
+   attr.lnotify_pid = mfspr(SPRN_PID);
+   attr.lnotify_tid = mfspr(SPRN_PID);
+
+   vinst->fault_win = vas_rx_win_open(vinst->vas_id, VAS_COP_TYPE_FAULT,
+   );
+
+   if (IS_ERR(vinst->fault_win)) {
+   pr_err("VAS: Error %ld opening FaultWin\n",
+   PTR_ERR(vinst->fault_win));
+   kfree(vinst->fault_fifo);
+   return PTR_ERR(vinst->fault_win);
+   }
+
+   pr_devel("VAS: Created FaultWin %d, LPID/PID/TID [%d/%d/%d]\n",
+   vinst->fault_win->winid, attr.lnotify_lpid,
+   attr.lnotify_pid, attr.lnotify_tid);
+
+   return 0;
+}
diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
b/arch/powerpc/platforms/powernv/vas-window.c
index 0c0d27d..1783fa9 100644
--- a/arch/powerpc/platforms/powernv/vas-window.c
+++ b/arch/powerpc/platforms/powernv/vas-window.c
@@ -827,9 +827,9 @@ void vas_init_rx_win_attr(struct vas_rx_win_attr *rxattr, 
enum vas_cop_type cop)
rxattr->fault_win = true;
rxattr->notify_disable = true;
rxattr->rx_wcred_mode = true;
-   rxattr->tx_wcred_mode = true;
rxattr->rx_win_ord_mode = true;
-   rxattr->tx_win_ord_mode = true;
+   rxattr->rej_no_credit = true;
+   rxattr->tc_mode = VAS_THRESH_DISABLED;
} else if (cop == VAS_COP_TYPE_FTW) {
rxattr->user_win = true;
rxattr->intr_disable = true;
diff --git a/arch/powerpc/platforms/powernv/vas.c 
b/arch/powerpc/platforms/powernv/vas.c
index 168ab68..557c8e4 100644
--- a/arch/powerpc/platforms/powernv/vas.c
+++

[PATCH V7 04/14] powerpc/vas: Alloc and setup IRQ and trigger port address

2020-03-06 Thread Haren Myneni



Alloc IRQ and get trigger port address for each VAS instance. Kernel
register this IRQ per VAS instance and sets this port for each send
window. NX interrupts the kernel when it sees page fault.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/powernv/vas.c | 34 --
 arch/powerpc/platforms/powernv/vas.h |  2 ++
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/vas.c 
b/arch/powerpc/platforms/powernv/vas.c
index ed9cc6d..168ab68 100644
--- a/arch/powerpc/platforms/powernv/vas.c
+++ b/arch/powerpc/platforms/powernv/vas.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "vas.h"
 
@@ -25,10 +26,12 @@
 
 static int init_vas_instance(struct platform_device *pdev)
 {
-   int rc, cpu, vasid;
-   struct resource *res;
-   struct vas_instance *vinst;
struct device_node *dn = pdev->dev.of_node;
+   struct vas_instance *vinst;
+   uint32_t chipid, irq;
+   struct resource *res;
+   int rc, cpu, vasid;
+   uint64_t port;
 
rc = of_property_read_u32(dn, "ibm,vas-id", );
if (rc) {
@@ -36,6 +39,12 @@ static int init_vas_instance(struct platform_device *pdev)
return -ENODEV;
}
 
+   rc = of_property_read_u32(dn, "ibm,chip-id", );
+   if (rc) {
+   pr_err("No ibm,chip-id property for %s?\n", pdev->name);
+   return -ENODEV;
+   }
+
if (pdev->num_resources != 4) {
pr_err("Unexpected DT configuration for [%s, %d]\n",
pdev->name, vasid);
@@ -69,9 +78,22 @@ static int init_vas_instance(struct platform_device *pdev)
 
vinst->paste_win_id_shift = 63 - res->end;
 
-   pr_devel("Initialized instance [%s, %d], paste_base 0x%llx, "
-   "paste_win_id_shift 0x%llx\n", pdev->name, vasid,
-   vinst->paste_base_addr, vinst->paste_win_id_shift);
+   rc = xive_native_alloc_get_irq_info(chipid, , );
+   if (rc)
+   return rc;
+
+   vinst->virq = irq_create_mapping(NULL, irq);
+   if (!vinst->virq) {
+   pr_err("Inst%d: Unable to map global irq %d\n",
+   vinst->vas_id, irq);
+   return -EINVAL;
+   }
+
+   vinst->irq_port = port;
+   pr_devel("Initialized instance [%s, %d] paste_base 0x%llx 
paste_win_id_shift 0x%llx IRQ %d Port 0x%llx\n",
+   pdev->name, vasid, vinst->paste_base_addr,
+   vinst->paste_win_id_shift, vinst->virq,
+   vinst->irq_port);
 
for_each_possible_cpu(cpu) {
if (cpu_to_chip_id(cpu) == of_get_ibm_chip_id(dn))
diff --git a/arch/powerpc/platforms/powernv/vas.h 
b/arch/powerpc/platforms/powernv/vas.h
index 5574aec..598608b 100644
--- a/arch/powerpc/platforms/powernv/vas.h
+++ b/arch/powerpc/platforms/powernv/vas.h
@@ -313,6 +313,8 @@ struct vas_instance {
u64 paste_base_addr;
u64 paste_win_id_shift;
 
+   u64 irq_port;
+   int virq;
struct mutex mutex;
struct vas_window *rxwin[VAS_COP_TYPE_MAX];
struct vas_window *windows[VAS_WINDOWS_PER_CHIP];
-- 
1.8.3.1

[PATCH V7 03/14] powerpc/vas: Define nx_fault_stamp in coprocessor_request_block

2020-03-06 Thread Haren Myneni



Kernel sets fault address and status in CRB for NX page fault on user
space address after processing page fault. User space gets the signal
and handles the fault mentioned in CRB by bringing the page in to
memory and send NX request again.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/icswx.h | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/include/asm/icswx.h
index 9872f85..b233d1e 100644
--- a/arch/powerpc/include/asm/icswx.h
+++ b/arch/powerpc/include/asm/icswx.h
@@ -108,6 +108,17 @@ struct data_descriptor_entry {
__be64 address;
 } __packed __aligned(DDE_ALIGN);
 
+/* 4.3.2 NX-stamped Fault CRB */
+
+#define NX_STAMP_ALIGN  (0x10)
+
+struct nx_fault_stamp {
+   __be64 fault_storage_addr;
+   __be16 reserved;
+   __u8   flags;
+   __u8   fault_status;
+   __be32 pswid;
+} __packed __aligned(NX_STAMP_ALIGN);
 
 /* Chapter 6.5.2 Coprocessor-Request Block (CRB) */
 
@@ -135,7 +146,12 @@ struct coprocessor_request_block {
 
struct coprocessor_completion_block ccb;
 
-   u8 reserved[48];
+   union {
+   struct nx_fault_stamp nx;
+   u8 reserved[16];
+   } stamp;
+
+   u8 reserved[32];
 
struct coprocessor_status_block csb;
 } __packed __aligned(CRB_ALIGN);
-- 
1.8.3.1

[PATCH V7 02/14] powerpc/xive: Define xive_native_alloc_get_irq_info()

2020-03-06 Thread Haren Myneni



pnv_ocxl_alloc_xive_irq() in ocxl.c allocates IRQ and gets trigger port
address. VAS also needs this function, but based on chip ID. So moved
this common function to xive/native.c.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/xive.h   |  2 ++
 arch/powerpc/platforms/powernv/ocxl.c | 20 ++--
 arch/powerpc/sysdev/xive/native.c | 23 +++
 3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index d08ea11..fd337da 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -139,6 +139,8 @@ int xive_native_set_queue_state(u32 vp_id, uint32_t prio, 
u32 qtoggle,
 int xive_native_get_vp_state(u32 vp_id, u64 *out_state);
 bool xive_native_has_queue_state_support(void);
 extern u32 xive_native_alloc_irq_on_chip(u32 chip_id);
+extern int xive_native_alloc_get_irq_info(u32 chip_id, u32 *irq,
+   u64 *trigger_addr);
 
 static inline u32 xive_native_alloc_irq(void)
 {
diff --git a/arch/powerpc/platforms/powernv/ocxl.c 
b/arch/powerpc/platforms/powernv/ocxl.c
index 8c65aac..fb8f99a 100644
--- a/arch/powerpc/platforms/powernv/ocxl.c
+++ b/arch/powerpc/platforms/powernv/ocxl.c
@@ -487,24 +487,8 @@ int pnv_ocxl_spa_remove_pe_from_cache(void *platform_data, 
int pe_handle)
 
 int pnv_ocxl_alloc_xive_irq(u32 *irq, u64 *trigger_addr)
 {
-   __be64 flags, trigger_page;
-   s64 rc;
-   u32 hwirq;
-
-   hwirq = xive_native_alloc_irq();
-   if (!hwirq)
-   return -ENOENT;
-
-   rc = opal_xive_get_irq_info(hwirq, , NULL, _page, NULL,
-   NULL);
-   if (rc || !trigger_page) {
-   xive_native_free_irq(hwirq);
-   return -ENOENT;
-   }
-   *irq = hwirq;
-   *trigger_addr = be64_to_cpu(trigger_page);
-   return 0;
-
+   return xive_native_alloc_get_irq_info(OPAL_XIVE_ANY_CHIP, irq,
+   trigger_addr);
 }
 EXPORT_SYMBOL_GPL(pnv_ocxl_alloc_xive_irq);
 
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 14d4406..abdd892 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -295,6 +295,29 @@ u32 xive_native_alloc_irq_on_chip(u32 chip_id)
 }
 EXPORT_SYMBOL_GPL(xive_native_alloc_irq_on_chip);
 
+int xive_native_alloc_get_irq_info(u32 chip_id, u32 *irq, u64 *trigger_addr)
+{
+   __be64 flags, trigger_page;
+   u32 hwirq;
+   s64 rc;
+
+   hwirq = xive_native_alloc_irq_on_chip(chip_id);
+   if (!hwirq)
+   return -ENOENT;
+
+   rc = opal_xive_get_irq_info(hwirq, , NULL, _page, NULL,
+   NULL);
+   if (rc || !trigger_page) {
+   xive_native_free_irq(hwirq);
+   return -ENOENT;
+   }
+   *irq = hwirq;
+   *trigger_addr = be64_to_cpu(trigger_page);
+
+   return 0;
+}
+EXPORT_SYMBOL(xive_native_alloc_get_irq_info);
+
 void xive_native_free_irq(u32 irq)
 {
for (;;) {
-- 
1.8.3.1

[PATCH V7 01/14] powerpc/xive: Define xive_native_alloc_irq_on_chip()

2020-03-06 Thread Haren Myneni



This function allocates IRQ on a specific chip. VAS needs per chip
IRQ allocation and will have IRQ handler per VAS instance.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/xive.h   | 9 -
 arch/powerpc/sysdev/xive/native.c | 6 +++---
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 93f982db..d08ea11 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -5,6 +5,8 @@
 #ifndef _ASM_POWERPC_XIVE_H
 #define _ASM_POWERPC_XIVE_H
 
+#include 
+
 #define XIVE_INVALID_VP0x
 
 #ifdef CONFIG_PPC_XIVE
@@ -108,7 +110,6 @@ struct xive_q {
 int xive_native_populate_irq_data(u32 hw_irq,
  struct xive_irq_data *data);
 void xive_cleanup_irq_data(struct xive_irq_data *xd);
-u32 xive_native_alloc_irq(void);
 void xive_native_free_irq(u32 irq);
 int xive_native_configure_irq(u32 hw_irq, u32 target, u8 prio, u32 sw_irq);
 
@@ -137,6 +138,12 @@ int xive_native_set_queue_state(u32 vp_id, uint32_t prio, 
u32 qtoggle,
u32 qindex);
 int xive_native_get_vp_state(u32 vp_id, u64 *out_state);
 bool xive_native_has_queue_state_support(void);
+extern u32 xive_native_alloc_irq_on_chip(u32 chip_id);
+
+static inline u32 xive_native_alloc_irq(void)
+{
+   return xive_native_alloc_irq_on_chip(OPAL_XIVE_ANY_CHIP);
+}
 
 #else
 
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 0ff6b73..14d4406 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -279,12 +279,12 @@ static int xive_native_get_ipi(unsigned int cpu, struct 
xive_cpu *xc)
 }
 #endif /* CONFIG_SMP */
 
-u32 xive_native_alloc_irq(void)
+u32 xive_native_alloc_irq_on_chip(u32 chip_id)
 {
s64 rc;
 
for (;;) {
-   rc = opal_xive_allocate_irq(OPAL_XIVE_ANY_CHIP);
+   rc = opal_xive_allocate_irq(chip_id);
if (rc != OPAL_BUSY)
break;
msleep(OPAL_BUSY_DELAY_MS);
@@ -293,7 +293,7 @@ u32 xive_native_alloc_irq(void)
return 0;
return rc;
 }
-EXPORT_SYMBOL_GPL(xive_native_alloc_irq);
+EXPORT_SYMBOL_GPL(xive_native_alloc_irq_on_chip);
 
 void xive_native_free_irq(u32 irq)
 {
-- 
1.8.3.1

[PATCH V7 00/14] powerpc/vas: Page fault handling for user space NX requests

2020-03-06 Thread Haren Myneni



On power9, Virtual Accelerator Switchboard (VAS) allows user space or
kernel to communicate with Nest Accelerator (NX) directly using COPY/PASTE
instructions. NX provides various functionalities such as compression,
encryption and etc. But only compression (842 and GZIP formats) is
supported in Linux kernel on power9.

842 compression driver (drivers/crypto/nx/nx-842-powernv.c)
is already included in Linux. Only GZIP support will be available from
user space.

Applications can issue GZIP compression / decompression requests to NX with
COPY/PASTE instructions. When NX is processing these requests, can hit
fault on the request buffer (not in memory). It issues an interrupt and
pastes fault CRB in fault FIFO. Expects kernel to handle this fault and
return credits for both send and fault windows after processing.

This patch series adds IRQ and fault window setup, and NX fault handling:
- Alloc IRQ and trigger port address, and configure IRQ per VAS instance.
- Set port# for each window to generate an interrupt when noticed fault.
- Set fault window and FIFO on which NX paste fault CRB.
- Setup IRQ thread fault handler per VAS instance.
- When receiving an interrupt, Read CRBs from fault FIFO and update
  coprocessor_status_block (CSB) in the corresponding CRB with translation
  failure (CSB_CC_TRANSLATION). After issuing NX requests, process polls
  on CSB address. When it sees translation error, can touch the request
  buffer to bring the page in to memory and reissue NX request.
- If copy_to_user fails on user space CSB address, OS sends SEGV signal.

Tested these patches with NX-GZIP support and will be posting this series
soon.

Patches 1 & 2: Define alloc IRQ and get port address per chip which are needed
   to alloc IRQ per VAS instance.
Patch 3: Define nx_fault_stamp on which NX writes fault status for the fault
 CRB
Patch 4: Alloc and setup IRQ and trigger port address for each VAS instance
Patch 5: Setup fault window per each VAS instance. This window is used for
 NX to paste fault CRB in FIFO.
Patches 6 & 7: Setup threaded IRQ per VAS and register NX with fault window
 ID and port number for each send window so that NX paste fault CRB
 in this window.
Patch 8: Reference to pid and mm so that pid is not used until window closed.
 Needed for multi thread application where child can open a window
 and can be used by parent later.
Patches 9 and 10: Process CRBs from fault FIFO and notify tasks by
 updating CSB or through signals.
Patches 11 and 12: Return credits for send and fault windows after handling
faults.
Patch 14:Fix closing send window after all credits are returned. This issue
 happens only for user space requests. No page faults on kernel
 request buffer.

Changelog:
V2:
  - Use threaded IRQ instead of own kernel thread handler
  - Use pswid instead of user space CSB address to find valid CRB
  - Removed unused macros and other changes as suggested by Christoph Hellwig

V3:
  - Rebased to 5.5-rc2
  - Use struct pid * instead of pid_t for vas_window tgid
  - Code cleanup as suggested by Christoph Hellwig

V4:
  - Define xive alloc and get IRQ info based on chip ID and use these
functions for IRQ setup per VAS instance. It eliminates skiboot
dependency as suggested by Oliver.

V5:
  - Do not update CSB if the process is exiting (patch9)

V6:
  - Add interrupt handler instead of default one and return IRQ_HANDLED
if the fault handling thread is already in progress. (Patch6)
  - Use platform send window ID and CCW[0] bit to find valid CRB in
fault FIFO (Patch6).
  - Return fault address to user space in BE and other changes as
suggested by Michael Neuling. (patch9)
  - Rebased to 5.6-rc4

V7:
  - Fix sparse warnings (patches 6,9 and 10) 

Haren Myneni (14):
  powerpc/xive: Define xive_native_alloc_irq_on_chip()
  powerpc/xive: Define xive_native_alloc_get_irq_info()
  powerpc/vas: Define nx_fault_stamp in coprocessor_request_block
  powerpc/vas: Alloc and setup IRQ and trigger port address
  powerpc/vas: Setup fault window per VAS instance
  powerpc/vas: Setup thread IRQ handler per VAS instance
  powerpc/vas: Register NX with fault window ID and IRQ port value
  powerpc/vas: Take reference to PID and mm for user space windows
  powerpc/vas: Update CSB and notify process for fault CRBs
  powerpc/vas: Print CRB and FIFO values
  powerpc/vas: Do not use default credits for receive window
  powerpc/vas: Return credits after handling fault
  powerpc/vas: Display process stuck message
  powerpc/vas: Free send window in VAS instance after credits returned

 arch/powerpc/include/asm/icswx.h|  18 +-
 arch/powerpc/include/asm/xive.h |  11 +-
 arch/powerpc/platforms/powernv/Makefile |   2 +-
 arch/powerpc/platforms/powernv/ocxl.c   |  20 +-
 arch/powerpc/platforms/powernv/vas-debug.c  |   2 +-
 arch/powerpc/platforms/powernv/vas-fault.c  | 331

[PATCH 2/4] powerpc/xive: Fix xmon support on the PowerNV platform

2020-03-06 Thread Cédric Le Goater

The PowerNV platform has multiple IRQ chips and the xmon command
dumping the state of the XIVE interrupt should only operate on the
XIVE IRQ chip.

Fixes: 5896163f7f91 ("powerpc/xmon: Improve output of XIVE interrupts")
Cc: sta...@vger.kernel.org # v5.4+
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 550baba98ec9..8155adc2225a 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -261,11 +261,15 @@ notrace void xmon_xive_do_dump(int cpu)
 
 int xmon_xive_get_irq_config(u32 hw_irq, struct irq_data *d)
 {
+   struct irq_chip *chip = irq_data_get_irq_chip(d);
int rc;
u32 target;
u8 prio;
u32 lirq;
 
+   if (!is_xive_irq(chip))
+   return -EINVAL;
+
rc = xive_ops->get_irq_config(hw_irq, , , );
if (rc) {
xmon_printf("IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
-- 
2.21.1

Re: [PATCHv3 1/2] powerpc/of: split out new_property() for reusing

2020-03-06 Thread Nathan Lynch

Hi,

Pingfan Liu  writes:
> Splitting out new_property() for coming reusing and moving it to
> of_helpers.c.

[...]

> +struct property *new_property(const char *name, const int length,
> + const unsigned char *value, struct property *last)
> +{
> + struct property *new = kzalloc(sizeof(*new), GFP_KERNEL);
> +
> + if (!new)
> + return NULL;
> +
> + new->name = kstrdup(name, GFP_KERNEL);
> + if (!new->name)
> + goto cleanup;
> + new->value = kmalloc(length + 1, GFP_KERNEL);
> + if (!new->value)
> + goto cleanup;
> +
> + memcpy(new->value, value, length);
> + *(((char *)new->value) + length) = 0;
> + new->length = length;
> + new->next = last;
> + return new;
> +
> +cleanup:
> + kfree(new->name);
> + kfree(new->value);
> + kfree(new);
> + return NULL;
> +}

This function in its current form isn't suitable for more general use:

* It appears to be tailored to string properties - note the char * value
  parameter, the length + 1 allocation and nul termination.

* Most code shouldn't need the 'last' argument. The code where this
  currently resides builds a list of properties and attaches it to a new
  node, bypassing of_add_property().

Let's look at the call site you add in your next patch:

+   big = cpu_to_be64(p->bound_addr);
+   property = new_property("bound-addr", sizeof(u64), (const 
unsigned char *),
+   NULL);
+   of_add_property(dn, property);

So you have to use a cast, and this is going to allocate (sizeof(u64) + 1)
for the value, is that what you want?

I think you should leave that legacy pseries reconfig code undisturbed
(frankly that stuff should get deprecated and removed) and if you want a
generic helper it should look more like:

struct property *of_property_new(const char *name, size_t length,
 const void *value, gfp_t allocflags)

__of_prop_dup() looks like a good model/guide here.

Re: [PATCH] selftests/powerpc: Add a test of sigreturn vs VDSO

2020-03-06 Thread Nathan Lynch

Nathan Lynch  writes:
> Michael Ellerman  writes:
>
>> +static int search_proc_maps(char *needle, unsigned long *low, unsigned long 
>> *high)
>
>^^ const?
>
>> +{
>> +unsigned long start, end;
>> +static char buf[4096];
>> +char name[128];
>> +FILE *f;
>> +int rc = -1;
>> +
>> +f = fopen("/proc/self/maps", "r");
>> +if (!f) {
>> +perror("fopen");
>> +return -1;
>> +}
>> +
>> +while (fgets(buf, sizeof(buf), f)) {
>> +rc = sscanf(buf, "%lx-%lx %*c%*c%*c%*c %*x %*d:%*d %*d %127s\n",
>> +, , name);
>
> I suspect it doesn't matter in practice for this particular test, but
> since this looks like a generally useful function that could gain users
> in the future: does this spuriously fail if the matching line straddles
> a 4096-byte boundary? Maybe fscanf(3) should be used instead?

Or maybe I should read the fgets man page more closely :-)

  "Reading stops after an EOF or a newline."

Sorry for the noise.

Re: [PATCH] selftests/powerpc: Add a test of sigreturn vs VDSO

2020-03-06 Thread Nathan Lynch

Michael Ellerman  writes:

> +static int search_proc_maps(char *needle, unsigned long *low, unsigned long 
> *high)

   ^^ const?
   
> +{
> + unsigned long start, end;
> + static char buf[4096];
> + char name[128];
> + FILE *f;
> + int rc = -1;
> +
> + f = fopen("/proc/self/maps", "r");
> + if (!f) {
> + perror("fopen");
> + return -1;
> + }
> +
> + while (fgets(buf, sizeof(buf), f)) {
> + rc = sscanf(buf, "%lx-%lx %*c%*c%*c%*c %*x %*d:%*d %*d %127s\n",
> + , , name);

I suspect it doesn't matter in practice for this particular test, but
since this looks like a generally useful function that could gain users
in the future: does this spuriously fail if the matching line straddles
a 4096-byte boundary? Maybe fscanf(3) should be used instead?


> + if (rc == 2)
> + continue;
> +
> + if (rc != 3) {
> + printf("sscanf errored\n");
> + rc = -1;
> + break;
> + }
> +
> + if (strstr(name, needle)) {
> + *low = start;
> + *high = end - 1;
> + rc = 0;
> + break;
> + }
> + }
> +
> + fclose(f);
> +
> + return rc;
> +}
> +
> +static volatile sig_atomic_t took_signal = 0;
> +
> +static void sigusr1_handler(int sig)
> +{
> + took_signal++;
> +}
> +
> +int test_sigreturn_vdso(void)
> +{
> + unsigned long low, high, size;
> + struct sigaction act;
> + char *p;
> +
> + act.sa_handler = sigusr1_handler;
> + act.sa_flags = 0;
> + sigemptyset(_mask);
> +
> + assert(sigaction(SIGUSR1, , NULL) == 0);
> +
> + // Confirm the VDSO is mapped, and work out where it is
> + assert(search_proc_maps("[vdso]", , ) == 0);
> + size = high - low + 1;
> + printf("VDSO is at 0x%lx-0x%lx (%lu bytes)\n", low, high, size);
> +
> + kill(getpid(), SIGUSR1);
> + assert(took_signal == 1);
> + printf("Signal delivered OK with VDSO mapped\n");

I haven't looked at the test harness in detail but this should be
reliable if the program is a single thread - lgtm.

Re: [PATCH v3] ima: add a new CONFIG for loading arch-specific policies

2020-03-06 Thread Nayna


Oops,  Please ignore this patch.

By mistake I posted the wrong version. I am sorry for the confusion,  I 
will resend the right version.


Thanks & Regards,

 - Nayna

On 3/6/20 12:39 PM, Nayna Jain wrote:

Every time a new architecture defines the IMA architecture specific
functions - arch_ima_get_secureboot() and arch_ima_get_policy(), the IMA
include file needs to be updated. To avoid this "noise", this patch
defines a new IMA Kconfig IMA_SECURE_AND_OR_TRUSTED_BOOT option, allowing
the different architectures to select it.

Suggested-by: Linus Torvalds 
Signed-off-by: Nayna Jain 
Cc: Ard Biesheuvel 
Cc: Philipp Rudo 
Cc: Michael Ellerman 
---
v3:
* Updated and tested the patch with improvements suggested by Michael.
It now uses "imply" instead of "select". Thanks Michael.
* Have missed replacing the CONFIG_IMA in x86 and s390 with new config,
that was resulting in redefinition when the IMA_SECURE_AND_OR_TRUSTED_BOOT
is not enabled. Thanks to Mimi for recognizing the problem.

v2:
* Fixed the issue identified by Mimi. Thanks Mimi, Ard, Heiko and Michael for
discussing the fix.

  arch/powerpc/Kconfig   | 1 +
  arch/s390/Kconfig  | 1 +
  arch/s390/kernel/Makefile  | 2 +-
  arch/x86/Kconfig   | 1 +
  arch/x86/kernel/Makefile   | 2 +-
  include/linux/ima.h| 3 +--
  security/integrity/ima/Kconfig | 8 
  7 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 497b7d0b2d7e..a5cfde432983 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -979,6 +979,7 @@ config PPC_SECURE_BOOT
bool
depends on PPC_POWERNV
depends on IMA_ARCH_POLICY
+   select IMA_SECURE_AND_OR_TRUSTED_BOOT
help
  Systems with firmware secure boot enabled need to define security
  policies to extend secure boot to the OS. This config allows a user
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 8abe77536d9d..4a502fbcb800 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -195,6 +195,7 @@ config S390
select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select SWIOTLB
select GENERIC_ALLOCATOR
+   select IMA_SECURE_AND_OR_TRUSTED_BOOT if IMA_ARCH_POLICY
  
  
  config SCHED_OMIT_FRAME_POINTER

diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile
index 2b1203cf7be6..578a6fa82ea4 100644
--- a/arch/s390/kernel/Makefile
+++ b/arch/s390/kernel/Makefile
@@ -70,7 +70,7 @@ obj-$(CONFIG_JUMP_LABEL)  += jump_label.o
  obj-$(CONFIG_KEXEC_FILE)  += machine_kexec_file.o kexec_image.o
  obj-$(CONFIG_KEXEC_FILE)  += kexec_elf.o
  
-obj-$(CONFIG_IMA)		+= ima_arch.o

+obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT)   += ima_arch.o
  
  obj-$(CONFIG_PERF_EVENTS)	+= perf_event.o perf_cpum_cf_common.o

  obj-$(CONFIG_PERF_EVENTS) += perf_cpum_cf.o perf_cpum_sf.o
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..7f5bfaf0cbd2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,7 @@ config X86
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
select PROC_PID_ARCH_STATUS if PROC_FS
+   select IMA_SECURE_AND_OR_TRUSTED_BOOT   if EFI && IMA_ARCH_POLICY
  
  config INSTRUCTION_DECODER

def_bool y
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 9b294c13809a..7f131ceba136 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -155,5 +155,5 @@ ifeq ($(CONFIG_X86_64),y)
  endif
  
  ifdef CONFIG_EFI

-obj-$(CONFIG_IMA)  += ima_arch.o
+obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT)   += ima_arch.o
  endif
diff --git a/include/linux/ima.h b/include/linux/ima.h
index 1659217e9b60..aefe758f4466 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -30,8 +30,7 @@ extern void ima_kexec_cmdline(const void *buf, int size);
  extern void ima_add_kexec_buffer(struct kimage *image);
  #endif
  
-#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390) \

-   || defined(CONFIG_PPC_SECURE_BOOT)
+#ifdef CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT
  extern bool arch_ima_get_secureboot(void);
  extern const char * const *arch_get_ima_policy(void);
  #else
diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
index 3f3ee4e2eb0d..2baaf196c6d8 100644
--- a/security/integrity/ima/Kconfig
+++ b/security/integrity/ima/Kconfig
@@ -327,3 +327,11 @@ config IMA_QUEUE_EARLY_BOOT_KEYS
depends on IMA_MEASURE_ASYMMETRIC_KEYS
depends on SYSTEM_TRUSTED_KEYRING
default y
+
+config IMA_SECURE_AND_OR_TRUSTED_BOOT
+   bool
+   depends on IMA_ARCH_POLICY
+   default n
+   help
+  This option is selected by architectures to enable secure and/or
+  trusted boot based on IMA runtime policies.

[PATCH v3] ima: add a new CONFIG for loading arch-specific policies

2020-03-06 Thread Nayna Jain

Every time a new architecture defines the IMA architecture specific
functions - arch_ima_get_secureboot() and arch_ima_get_policy(), the IMA
include file needs to be updated. To avoid this "noise", this patch
defines a new IMA Kconfig IMA_SECURE_AND_OR_TRUSTED_BOOT option, allowing
the different architectures to select it.

Suggested-by: Linus Torvalds 
Signed-off-by: Nayna Jain 
Cc: Ard Biesheuvel 
Cc: Philipp Rudo 
Cc: Michael Ellerman 
---
v3:
* Updated and tested the patch with improvements suggested by Michael.
It now uses "imply" instead of "select". Thanks Michael.
* Have missed replacing the CONFIG_IMA in x86 and s390 with new config,
that was resulting in redefinition when the IMA_SECURE_AND_OR_TRUSTED_BOOT
is not enabled. Thanks to Mimi for recognizing the problem.

v2:
* Fixed the issue identified by Mimi. Thanks Mimi, Ard, Heiko and Michael for
discussing the fix.

 arch/powerpc/Kconfig   | 1 +
 arch/s390/Kconfig  | 1 +
 arch/s390/kernel/Makefile  | 2 +-
 arch/x86/Kconfig   | 1 +
 arch/x86/kernel/Makefile   | 2 +-
 include/linux/ima.h| 3 +--
 security/integrity/ima/Kconfig | 8 
 7 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 497b7d0b2d7e..a5cfde432983 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -979,6 +979,7 @@ config PPC_SECURE_BOOT
bool
depends on PPC_POWERNV
depends on IMA_ARCH_POLICY
+   select IMA_SECURE_AND_OR_TRUSTED_BOOT
help
  Systems with firmware secure boot enabled need to define security
  policies to extend secure boot to the OS. This config allows a user
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 8abe77536d9d..4a502fbcb800 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -195,6 +195,7 @@ config S390
select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select SWIOTLB
select GENERIC_ALLOCATOR
+   select IMA_SECURE_AND_OR_TRUSTED_BOOT if IMA_ARCH_POLICY
 
 
 config SCHED_OMIT_FRAME_POINTER
diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile
index 2b1203cf7be6..578a6fa82ea4 100644
--- a/arch/s390/kernel/Makefile
+++ b/arch/s390/kernel/Makefile
@@ -70,7 +70,7 @@ obj-$(CONFIG_JUMP_LABEL)  += jump_label.o
 obj-$(CONFIG_KEXEC_FILE)   += machine_kexec_file.o kexec_image.o
 obj-$(CONFIG_KEXEC_FILE)   += kexec_elf.o
 
-obj-$(CONFIG_IMA)  += ima_arch.o
+obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT)   += ima_arch.o
 
 obj-$(CONFIG_PERF_EVENTS)  += perf_event.o perf_cpum_cf_common.o
 obj-$(CONFIG_PERF_EVENTS)  += perf_cpum_cf.o perf_cpum_sf.o
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index beea77046f9b..7f5bfaf0cbd2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -230,6 +230,7 @@ config X86
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
select PROC_PID_ARCH_STATUS if PROC_FS
+   select IMA_SECURE_AND_OR_TRUSTED_BOOT   if EFI && IMA_ARCH_POLICY
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 9b294c13809a..7f131ceba136 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -155,5 +155,5 @@ ifeq ($(CONFIG_X86_64),y)
 endif
 
 ifdef CONFIG_EFI
-obj-$(CONFIG_IMA)  += ima_arch.o
+obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT)   += ima_arch.o
 endif
diff --git a/include/linux/ima.h b/include/linux/ima.h
index 1659217e9b60..aefe758f4466 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -30,8 +30,7 @@ extern void ima_kexec_cmdline(const void *buf, int size);
 extern void ima_add_kexec_buffer(struct kimage *image);
 #endif
 
-#if (defined(CONFIG_X86) && defined(CONFIG_EFI)) || defined(CONFIG_S390) \
-   || defined(CONFIG_PPC_SECURE_BOOT)
+#ifdef CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT
 extern bool arch_ima_get_secureboot(void);
 extern const char * const *arch_get_ima_policy(void);
 #else
diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
index 3f3ee4e2eb0d..2baaf196c6d8 100644
--- a/security/integrity/ima/Kconfig
+++ b/security/integrity/ima/Kconfig
@@ -327,3 +327,11 @@ config IMA_QUEUE_EARLY_BOOT_KEYS
depends on IMA_MEASURE_ASYMMETRIC_KEYS
depends on SYSTEM_TRUSTED_KEYRING
default y
+
+config IMA_SECURE_AND_OR_TRUSTED_BOOT
+   bool
+   depends on IMA_ARCH_POLICY
+   default n
+   help
+  This option is selected by architectures to enable secure and/or
+  trusted boot based on IMA runtime policies.
-- 
2.18.1

Re: [PATCH v2 4/5] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU

2020-03-06 Thread Naveen N. Rao


Nathan Lynch wrote:

"Naveen N. Rao"  writes:

Gautham R Shenoy wrote:

On Fri, Feb 21, 2020 at 10:50:12AM -0600, Nathan Lynch wrote:

It's regrettable that we have to wake up potentially idle CPUs in order
to derive correct idle statistics for them, but I suppose the main user
(lparstat) of these interfaces already is causing this to happen by
polling the existing per-cpu purr and spurr attributes.

So now lparstat will incur at minimum four syscalls and four IPIs per
CPU per polling interval -- one for each of purr, spurr, idle_purr and
idle_spurr. Correct?


Yes, it is unforunate that we will end up making four syscalls and
generating IPI noise, and this is something that I discussed with
Naveen and Kamalesh. We have the following two constraints:

1) These values of PURR and SPURR required are per-cpu. Hence putting
them in lparcfg is not an option.

2) sysfs semantics encourages a single value per key, the key being
the sysfs-file. Something like the following would have made far more
sense.

cat /sys/devices/system/cpu/cpuX/purr_spurr_accounting
purr:A
idle_purr:B
spurr:C
idle_spurr:D

There are some sysfs files which allow something like this. Eg: 
/sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state


Thoughts on any other alternatives?


Umm... procfs?
/me ducks


I had wondered about perf events but I'm not sure that's any more suitable.


Yes, we considered that, but it looks like the event reads are not 
"batched" in any manner. So, the IPI overhead will be similar.





At some point it's going to make sense to batch sampling of remote CPUs'
SPRs.


How did you mean this? It looks like we first need to provide a separate 
user interface, since with the existing sysfs interface providing 
separate files, I am not sure if we can batch such reads.


I mean in order to minimize IPI traffic something like: sample/calculate
all of a CPU's purr, idle_purr, spurr, idle_spurr in a single IPI upon a
read of any of the attributes, and cache the result for some time, so
that the anticipated subsequent reads of the other attributes use the
cached results instead of generating more IPIs.

That would keep the current sysfs interface at the cost of imposing a
certain coarseness in the results.


Thanks for clarifying, that makes sense. Though we need to be careful in 
ensuring the sysfs semantics work as expected.




Anyway, that's a mitigation that could be considered if the
implementation in this patch is found to be too expensive in practice.


That's a good point. We can optimize later if this turns out to be a 
problem in practice, if we end up using this approach.



- Naveen

[PATCH v4 7/7] mm/memremap: Set caching mode for PCI P2PDMA memory to WC

2020-03-06 Thread Logan Gunthorpe

PCI BAR IO memory should never be mapped as WB, however prior to this
the PAT bits were set WB and it was typically overridden by MTRR
registers set by the firmware.

Set PCI P2PDMA memory to be UC as this is what it currently, typically,
ends up being mapped as on x86 after the MTRR registers override the
cache setting.

Future use-cases may need to generalize this by adding flags to
select the caching type, as some P2PDMA cases may not want UC.
However, those use-cases are not upstream yet and this can be changed
when they arrive.

Cc: Christoph Hellwig 
Cc: Jason Gunthorpe 
Signed-off-by: Logan Gunthorpe 
Reviewed-by: Dan Williams 
---
 mm/memremap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memremap.c b/mm/memremap.c
index 06742372a203..9033ae401448 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -190,7 +190,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
break;
case MEMORY_DEVICE_DEVDAX:
+   need_devmap_managed = false;
+   break;
case MEMORY_DEVICE_PCI_P2PDMA:
+   params.pgprot = pgprot_noncached(params.pgprot);
need_devmap_managed = false;
break;
default:
-- 
2.20.1

[PATCH v4 2/7] mm/memory_hotplug: Rename mhp_restrictions to mhp_params

2020-03-06 Thread Logan Gunthorpe

The mhp_restrictions struct really doesn't specify anything resembling
a restriction anymore so rename it to be mhp_params as it is a list
of extended parameters.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: David Hildenbrand 
Reviewed-by: Dan Williams 
Acked-by: Michal Hocko 
---
 arch/arm64/mm/mmu.c|  4 ++--
 arch/ia64/mm/init.c|  4 ++--
 arch/powerpc/mm/mem.c  |  4 ++--
 arch/s390/mm/init.c|  6 +++---
 arch/sh/mm/init.c  |  4 ++--
 arch/x86/mm/init_32.c  |  4 ++--
 arch/x86/mm/init_64.c  |  8 
 include/linux/memory_hotplug.h | 16 
 mm/memory_hotplug.c|  8 
 mm/memremap.c  |  8 
 10 files changed, 33 insertions(+), 33 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 128f70852bf3..ee37bca8aba8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1050,7 +1050,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size,
-   struct mhp_restrictions *restrictions)
+   struct mhp_params *params)
 {
int flags = 0;
 
@@ -1063,7 +1063,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
memblock_clear_nomap(start, size);
 
return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
-  restrictions);
+  params);
 }
 void arch_remove_memory(int nid, u64 start, u64 size,
struct vmem_altmap *altmap)
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index b01d68a2d5d9..97bbc23ea1e3 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -670,13 +670,13 @@ mem_init (void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size,
-   struct mhp_restrictions *restrictions)
+   struct mhp_params *params)
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
 
-   ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
+   ret = __add_pages(nid, start_pfn, nr_pages, params);
if (ret)
printk("%s: Problem encountered in __add_pages() as ret=%d\n",
   __func__,  ret);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index ef7b1119b2e2..b4bece53bec0 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -128,7 +128,7 @@ static void flush_dcache_range_chunked(unsigned long start, 
unsigned long stop,
 }
 
 int __ref arch_add_memory(int nid, u64 start, u64 size,
-   struct mhp_restrictions *restrictions)
+ struct mhp_params *params)
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -144,7 +144,7 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
return -EFAULT;
}
 
-   return __add_pages(nid, start_pfn, nr_pages, restrictions);
+   return __add_pages(nid, start_pfn, nr_pages, params);
 }
 
 void __ref arch_remove_memory(int nid, u64 start, u64 size,
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index ac44bd76db4b..e9e4a7abd0cc 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -268,20 +268,20 @@ device_initcall(s390_cma_mem_init);
 #endif /* CONFIG_CMA */
 
 int arch_add_memory(int nid, u64 start, u64 size,
-   struct mhp_restrictions *restrictions)
+   struct mhp_params *params)
 {
unsigned long start_pfn = PFN_DOWN(start);
unsigned long size_pages = PFN_DOWN(size);
int rc;
 
-   if (WARN_ON_ONCE(restrictions->altmap))
+   if (WARN_ON_ONCE(params->altmap))
return -EINVAL;
 
rc = vmem_add_mapping(start, size);
if (rc)
return rc;
 
-   rc = __add_pages(nid, start_pfn, size_pages, restrictions);
+   rc = __add_pages(nid, start_pfn, size_pages, params);
if (rc)
vmem_remove_mapping(start, size);
return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index d1b1ff2be17a..e5114c053364 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -406,14 +406,14 @@ void __init mem_init(void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 int arch_add_memory(int nid, u64 start, u64 size,
-   struct mhp_restrictions *restrictions)
+   struct mhp_params *params)
 {
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
 
/* We only have ZONE_NORMAL, so this is easy.. */
-   ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
+   ret = __add_pages(nid, start_pfn, nr_pages, params);
if (unlikely(ret))
printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
 
diff --git

[PATCH v4 6/7] mm/memory_hotplug: Add pgprot_t to mhp_params

2020-03-06 Thread Logan Gunthorpe

devm_memremap_pages() is currently used by the PCI P2PDMA code to create
struct page mappings for IO memory. At present, these mappings are created
with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on
x86, an mtrr register will typically override this and force the cache
type to be UC-. In the case firmware doesn't set this register it is
effectively WB and will typically result in a machine check exception
when it's accessed.

Other arches are not currently likely to function correctly seeing they
don't have any MTRR registers to fall back on.

To solve this, provide a way to specify the pgprot value explicitly to
arch_add_memory().

Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple
change to pass the pgprot_t down to their respective functions which set
up the page tables. For x86_32, set the page tables explicitly using
_set_memory_prot() (seeing they are already mapped). For ia64, s390 and
sh, reject anything but PAGE_KERNEL settings -- this should be fine,
for now, seeing these architectures don't support ZONE_DEVICE.

A check in __add_pages() is also added to ensure the pgprot parameter was
set for all arches.

Signed-off-by: Logan Gunthorpe 
Acked-by: David Hildenbrand 
Acked-by: Michal Hocko 
Acked-by: Dan Williams 
---
 arch/arm64/mm/mmu.c|  3 ++-
 arch/ia64/mm/init.c|  3 +++
 arch/powerpc/mm/mem.c  |  3 ++-
 arch/s390/mm/init.c|  3 +++
 arch/sh/mm/init.c  |  3 +++
 arch/x86/mm/init_32.c  | 12 
 arch/x86/mm/init_64.c  |  2 +-
 include/linux/memory_hotplug.h |  3 +++
 mm/memory_hotplug.c|  5 -
 mm/memremap.c  |  6 +++---
 10 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ee37bca8aba8..ea3fa844a8a2 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1058,7 +1058,8 @@ int arch_add_memory(int nid, u64 start, u64 size,
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
-size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
+size, params->pgprot, __pgd_pgtable_alloc,
+flags);
 
memblock_clear_nomap(start, size);
 
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 97bbc23ea1e3..d637b4ea3147 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -676,6 +676,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
 
+   if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
+   return -EINVAL;
+
ret = __add_pages(nid, start_pfn, nr_pages, params);
if (ret)
printk("%s: Problem encountered in __add_pages() as ret=%d\n",
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 19b1da5d7eca..832412bc7fad 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -138,7 +138,8 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
resize_hpt_for_hotplug(memblock_phys_mem_size());
 
start = (unsigned long)__va(start);
-   rc = create_section_mapping(start, start + size, nid, PAGE_KERNEL);
+   rc = create_section_mapping(start, start + size, nid,
+   params->pgprot);
if (rc) {
pr_warn("Unable to create mapping for hot added memory 
0x%llx..0x%llx: %d\n",
start, start + size, rc);
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index e9e4a7abd0cc..87b2d024e75a 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -277,6 +277,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
if (WARN_ON_ONCE(params->altmap))
return -EINVAL;
 
+   if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
+   return -EINVAL;
+
rc = vmem_add_mapping(start, size);
if (rc)
return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index e5114c053364..b9de2d4fa57e 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -412,6 +412,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
 
+   if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot)
+   return -EINVAL;
+
/* We only have ZONE_NORMAL, so this is easy.. */
ret = __add_pages(nid, start_pfn, nr_pages, params);
if (unlikely(ret))
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index e25a4218e6ff..69128f1a22ac 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -858,6 +858,18 @@ int arch_add_memory(int nid, u64 start, u64 size,
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
+   int ret;
+
+   /*
+* The

[PATCH v4 4/7] x86/mm: Introduce __set_memory_prot()

2020-03-06 Thread Logan Gunthorpe

For use in the 32bit arch_add_memory() to set the pgprot type of the
memory to add.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Signed-off-by: Logan Gunthorpe 
Reviewed-by: Dan Williams 
---
 arch/x86/include/asm/set_memory.h |  1 +
 arch/x86/mm/pat/set_memory.c  | 13 +
 2 files changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/set_memory.h 
b/arch/x86/include/asm/set_memory.h
index 64c3dce374e5..034358da4837 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -34,6 +34,7 @@
  * The caller is required to take care of these.
  */
 
+int __set_memory_prot(unsigned long addr, int numpages, pgprot_t prot);
 int _set_memory_uc(unsigned long addr, int numpages);
 int _set_memory_wc(unsigned long addr, int numpages);
 int _set_memory_wt(unsigned long addr, int numpages);
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index c4aedd00c1ba..a7b14dffeb0b 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1792,6 +1792,19 @@ static inline int cpa_clear_pages_array(struct page 
**pages, int numpages,
CPA_PAGES_ARRAY, pages);
 }
 
+/*
+ * _set_memory_prot is an internal helper for callers that have been passed
+ * a pgprot_t value from upper layers and a reservation has already been taken.
+ * If you want to set the pgprot to a specific page protocol, use the
+ * set_memory_xx() functions.
+ */
+int __set_memory_prot(unsigned long addr, int numpages, pgprot_t prot)
+{
+   return change_page_attr_set_clr(, numpages, prot,
+   __pgprot(~pgprot_val(prot)), 0, 0,
+   NULL);
+}
+
 int _set_memory_uc(unsigned long addr, int numpages)
 {
/*
-- 
2.20.1

[PATCH v4 5/7] powerpc/mm: Thread pgprot_t through create_section_mapping()

2020-03-06 Thread Logan Gunthorpe

In prepartion to support a pgprot_t argument for arch_add_memory().

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Signed-off-by: Logan Gunthorpe 
---
 arch/powerpc/include/asm/book3s/64/hash.h  |  3 ++-
 arch/powerpc/include/asm/book3s/64/radix.h |  3 ++-
 arch/powerpc/include/asm/sparsemem.h   |  3 ++-
 arch/powerpc/mm/book3s64/hash_utils.c  |  5 +++--
 arch/powerpc/mm/book3s64/pgtable.c |  7 ---
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 18 +++---
 arch/powerpc/mm/mem.c  |  5 +++--
 7 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index 2781ebf6add4..6fc4520092c7 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -251,7 +251,8 @@ extern int __meminit hash__vmemmap_create_mapping(unsigned 
long start,
 extern void hash__vmemmap_remove_mapping(unsigned long start,
 unsigned long page_size);
 
-int hash__create_section_mapping(unsigned long start, unsigned long end, int 
nid);
+int hash__create_section_mapping(unsigned long start, unsigned long end,
+int nid, pgprot_t prot);
 int hash__remove_section_mapping(unsigned long start, unsigned long end);
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index d97db3ad9aae..46799f3c3d1d 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -289,7 +289,8 @@ static inline unsigned long radix__get_tree_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int radix__create_section_mapping(unsigned long start, unsigned long end, int 
nid);
+int radix__create_section_mapping(unsigned long start, unsigned long end,
+ int nid, pgprot_t prot);
 int radix__remove_section_mapping(unsigned long start, unsigned long end);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index 3192d454a733..c89b32443cff 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -13,7 +13,8 @@
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-extern int create_section_mapping(unsigned long start, unsigned long end, int 
nid);
+extern int create_section_mapping(unsigned long start, unsigned long end,
+ int nid, pgprot_t prot);
 extern int remove_section_mapping(unsigned long start, unsigned long end);
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 523d4d39d11e..201738e07a1d 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -809,7 +809,8 @@ int resize_hpt_for_hotplug(unsigned long new_mem_size)
return 0;
 }
 
-int hash__create_section_mapping(unsigned long start, unsigned long end, int 
nid)
+int hash__create_section_mapping(unsigned long start, unsigned long end,
+int nid, pgprot_t prot)
 {
int rc;
 
@@ -819,7 +820,7 @@ int hash__create_section_mapping(unsigned long start, 
unsigned long end, int nid
}
 
rc = htab_bolt_mapping(start, end, __pa(start),
-  pgprot_val(PAGE_KERNEL), mmu_linear_psize,
+  pgprot_val(prot), mmu_linear_psize,
   mmu_kernel_ssize);
 
if (rc < 0) {
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 2bf7e1b4fd82..e0bb69c616e4 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -171,12 +171,13 @@ void mmu_cleanup_all(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int __meminit create_section_mapping(unsigned long start, unsigned long end, 
int nid)
+int __meminit create_section_mapping(unsigned long start, unsigned long end,
+int nid, pgprot_t prot)
 {
if (radix_enabled())
-   return radix__create_section_mapping(start, end, nid);
+   return radix__create_section_mapping(start, end, nid, prot);
 
-   return hash__create_section_mapping(start, end, nid);
+   return hash__create_section_mapping(start, end, nid, prot);
 }
 
 int __meminit remove_section_mapping(unsigned long start, unsigned long end)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index dd1bea45325c..0ef10b5c26ba 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -253,7 +253,7 @@ static unsigned long next_boundary(unsigned long addr, 
unsigned long end)
 
 static int __meminit create_physical_mapping(unsigned long start,

[PATCH v4 3/7] x86/mm: Thread pgprot_t through init_memory_mapping()

2020-03-06 Thread Logan Gunthorpe

In prepartion to support a pgprot_t argument for arch_add_memory().

It's required to move the prototype of init_memory_mapping() seeing
the original location came before the definition of pgprot_t.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Signed-off-by: Logan Gunthorpe 
Reviewed-by: Dan Williams 
Acked-by: Michal Hocko 
---
 arch/x86/include/asm/page_types.h |  3 ---
 arch/x86/include/asm/pgtable.h|  3 +++
 arch/x86/kernel/amd_gart_64.c |  3 ++-
 arch/x86/mm/init.c|  9 +
 arch/x86/mm/init_32.c |  3 ++-
 arch/x86/mm/init_64.c | 32 +--
 arch/x86/mm/mm_internal.h |  3 ++-
 arch/x86/platform/uv/bios_uv.c|  3 ++-
 8 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h 
b/arch/x86/include/asm/page_types.h
index c85e15010f48..bf7aa2e290ef 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -73,9 +73,6 @@ static inline phys_addr_t get_max_mapped(void)
 
 bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
 
-extern unsigned long init_memory_mapping(unsigned long start,
-unsigned long end);
-
 extern void initmem_init(void);
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7e118660bbd9..48d6a5960f28 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1046,6 +1046,9 @@ static inline void __meminit init_trampoline_default(void)
 
 void __init poking_init(void);
 
+unsigned long init_memory_mapping(unsigned long start,
+ unsigned long end, pgprot_t prot);
+
 # ifdef CONFIG_RANDOMIZE_MEMORY
 void __meminit init_trampoline(void);
 # else
diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 4e5f50236048..16133819415c 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -744,7 +744,8 @@ int __init gart_iommu_init(void)
 
start_pfn = PFN_DOWN(aper_base);
if (!pfn_range_is_mapped(start_pfn, end_pfn))
-   init_memory_mapping(start_pfn<> PAGE_SHIFT, ret >> PAGE_SHIFT);
 
@@ -521,7 +522,7 @@ static unsigned long __init init_range_memory_mapping(
 */
can_use_brk_pgt = max(start, (u64)pgt_buf_end<=
min(end, (u64)pgt_buf_top<> PAGE_SHIFT,
-PAGE_KERNEL_LARGE),
+prot),
 init);
spin_unlock(_mm.page_table_lock);
paddr_last = paddr_next;
@@ -669,7 +672,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, 
unsigned long paddr_end,
 
 static unsigned long __meminit
 phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
- unsigned long page_size_mask, bool init)
+ unsigned long page_size_mask, pgprot_t prot, bool init)
 {
unsigned long vaddr, vaddr_end, vaddr_next, paddr_next, paddr_last;
 
@@ -679,7 +682,7 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, 
unsigned long paddr_end,
 
if (!pgtable_l5_enabled())
return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end,
-page_size_mask, init);
+page_size_mask, prot, init);
 
for (; vaddr < vaddr_end; vaddr = vaddr_next) {
p4d_t *p4d = p4d_page + p4d_index(vaddr);
@@ -702,13 +705,13 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, 
unsigned long paddr_end,
if (!p4d_none(*p4d)) {
pud = pud_offset(p4d, 0);
paddr_last = phys_pud_init(pud, paddr, __pa(vaddr_end),
-   page_size_mask, init);
+   page_size_mask, prot, init);
continue;
}
 
pud = alloc_low_page();
paddr_last = phys_pud_init(pud, paddr, __pa(vaddr_end),
-  page_size_mask, init);
+  page_size_mask, prot, init);
 
spin_lock(_mm.page_table_lock);
p4d_populate_init(_mm, p4d, pud, init);
@@ -722,7 +725,7 @@ static unsigned long __meminit
 __kernel_physical_mapping_init(unsigned long paddr_start,
   unsigned long paddr_end,
   unsigned long page_size_mask,
-  bool init)
+  pgprot_t prot, bool init)
 {
bool pgd_changed = false;
unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last;
@@ -743,13 +746,13

[PATCH v4 1/7] mm/memory_hotplug: Drop the flags field from struct mhp_restrictions

2020-03-06 Thread Logan Gunthorpe

This variable is not used anywhere and should therefore be removed
from the structure.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: David Hildenbrand 
Reviewed-by: Dan Williams 
Acked-by: Michal Hocko 
---
 include/linux/memory_hotplug.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index f4d59155f3d4..69ff3037528d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -55,11 +55,9 @@ enum {
 
 /*
  * Restrictions for the memory hotplug:
- * flags:  MHP_ flags
  * altmap: alternative allocator for memmap array
  */
 struct mhp_restrictions {
-   unsigned long flags;
struct vmem_altmap *altmap;
 };
 
-- 
2.20.1

[PATCH v4 0/7] Allow setting caching mode in arch_add_memory() for P2PDMA

2020-03-06 Thread Logan Gunthorpe

Hi,

This is v4 of the patchset which cleans up a number of minor issues
from the feedback of v3 and rebases onto v5.6-rc4. Additional feedback
is welcome.

Also worth noting, is that the kernel test robot reports[1] that Patch 3
in this series improves will-it-scale.per_process_ops by 36%. Though,
for the life of me, I can't understand why that would be. But it's
reported the same thing twice now for different versions of the series.

Thanks,

Logan

[1] 
https://lists.01.org/hyperkitty/list/l...@lists.01.org/thread/5APDKNBEJGVJTJRTI2IIA3P4OC2OEYPS/

--

Changes in v4:
 * Rebased onto v5.6-rc4
 * Collected tags form David, Dan and Michal
 * Minor changes to the new _set_memory_prot() function and added some
   comments as requested by Dan.
 * Changed the default caching type for P2PDMA memory to UC instead of
   WC per Jason's concerns that WC might be more generally unsafe.

Changes in v3:
 * Rebased onto v5.6-rc2
 * Rename mhp_modifiers to mhp_params per David with an updated kernel
   doc per Dan
 * Drop support for s390 per David seeing it does not support
   ZONE_DEVICE yet and there was a potential problem with huge pages.
 * Added WARN_ON_ONCE in cases where arches recieve non PAGE_KERNEL
   parameters
 * Collected David and Micheal's Reviewed-By and Acked-by Tags

Changes in v2:
 * Rebased onto v5.5-rc5
 * Renamed mhp_restrictions to mhp_modifiers and added the pgprot field
   to that structure instead of using an argument for
   arch_add_memory().
 * Add patch to drop the unused flags field in mhp_restrictions

A git branch is available here:

https://github.com/sbates130272/linux-p2pmem remap_pages_cache_v4

--

Currently, the page tables created using memremap_pages() are always
created with the PAGE_KERNEL cacheing mode. However, the P2PDMA code
is creating pages for PCI BAR memory which should never be accessed
through the cache and instead use either WC or UC. This still works in
most cases, on x86, because the MTRR registers typically override the
caching settings in the page tables for all of the IO memory to be
UC-. However, this tends not to work so well on other arches or
some rare x86 machines that have firmware which does not setup the
MTRR registers in this way.

Instead of this, this series proposes a change to arch_add_memory()
to take the pgprot required by the mapping which allows us to
explicitly set pagetable entries for P2PDMA memory to UC.

This changes is pretty routine for most of the arches: x86_64, arm64
and powerpc simply need to thread the pgprot through to where the page
tables are setup. x86_32 unfortunately sets up the page tables at boot so
must use _set_memory_prot() to change their caching mode. ia64, s390 and sh
don't appear to have an easy way to change the page tables so, for now
at least, we just return -EINVAL on such mappings and thus they will
not support P2PDMA memory until the work for this is done. This should
be fine as they don't yet support ZONE_DEVICE.

--

Logan Gunthorpe (7):
  mm/memory_hotplug: Drop the flags field from struct mhp_restrictions
  mm/memory_hotplug: Rename mhp_restrictions to mhp_params
  x86/mm: Thread pgprot_t through init_memory_mapping()
  x86/mm: Introduce __set_memory_prot()
  powerpc/mm: Thread pgprot_t through create_section_mapping()
  mm/memory_hotplug: Add pgprot_t to mhp_params
  mm/memremap: Set caching mode for PCI P2PDMA memory to WC

 arch/arm64/mm/mmu.c|  7 ++--
 arch/ia64/mm/init.c|  7 ++--
 arch/powerpc/include/asm/book3s/64/hash.h  |  3 +-
 arch/powerpc/include/asm/book3s/64/radix.h |  3 +-
 arch/powerpc/include/asm/sparsemem.h   |  3 +-
 arch/powerpc/mm/book3s64/hash_utils.c  |  5 +--
 arch/powerpc/mm/book3s64/pgtable.c |  7 ++--
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 18 ++
 arch/powerpc/mm/mem.c  | 10 +++---
 arch/s390/mm/init.c|  9 +++--
 arch/sh/mm/init.c  |  7 ++--
 arch/x86/include/asm/page_types.h  |  3 --
 arch/x86/include/asm/pgtable.h |  3 ++
 arch/x86/include/asm/set_memory.h  |  1 +
 arch/x86/kernel/amd_gart_64.c  |  3 +-
 arch/x86/mm/init.c |  9 ++---
 arch/x86/mm/init_32.c  | 19 --
 arch/x86/mm/init_64.c  | 40 --
 arch/x86/mm/mm_internal.h  |  3 +-
 arch/x86/mm/pat/set_memory.c   | 13 +++
 arch/x86/platform/uv/bios_uv.c |  3 +-
 include/linux/memory_hotplug.h | 21 ++--
 mm/memory_hotplug.c| 11 +++---
 mm/memremap.c  | 17 +
 24 files changed, 144 insertions(+), 81 deletions(-)


base-commit: 98d54f81e36ba3bf92172791eba5ca5bd813989b
--
2.20.1

Re: [PATCH v2 1/5] powerpc: Move idle_loop_prolog()/epilog() functions to header file

2020-03-06 Thread Nathan Lynch

Gautham R Shenoy  writes:
> On Fri, Feb 21, 2020 at 09:03:16AM -0600, Nathan Lynch wrote:
>> Looks fine and correct as a cleanup, but asm/include/idle.h and
>> idle_loop_prolog, idle_loop_epilog, strike me as too generic for
>> pseries-specific code.
>
> Should it be prefixed with pseries , i.e pseries_idle_prolog()
> and pseries_idle_epilog() ?

Yes that seems appropriate to me.

Re: [PATCH v2 4/5] powerpc/sysfs: Show idle_purr and idle_spurr for every CPU

2020-03-06 Thread Nathan Lynch

"Naveen N. Rao"  writes:
> Gautham R Shenoy wrote:
>> On Fri, Feb 21, 2020 at 10:50:12AM -0600, Nathan Lynch wrote:
>>> It's regrettable that we have to wake up potentially idle CPUs in order
>>> to derive correct idle statistics for them, but I suppose the main user
>>> (lparstat) of these interfaces already is causing this to happen by
>>> polling the existing per-cpu purr and spurr attributes.
>>> 
>>> So now lparstat will incur at minimum four syscalls and four IPIs per
>>> CPU per polling interval -- one for each of purr, spurr, idle_purr and
>>> idle_spurr. Correct?
>> 
>> Yes, it is unforunate that we will end up making four syscalls and
>> generating IPI noise, and this is something that I discussed with
>> Naveen and Kamalesh. We have the following two constraints:
>> 
>> 1) These values of PURR and SPURR required are per-cpu. Hence putting
>> them in lparcfg is not an option.
>> 
>> 2) sysfs semantics encourages a single value per key, the key being
>> the sysfs-file. Something like the following would have made far more
>> sense.
>> 
>> cat /sys/devices/system/cpu/cpuX/purr_spurr_accounting
>> purr:A
>> idle_purr:B
>> spurr:C
>> idle_spurr:D
>> 
>> There are some sysfs files which allow something like this. Eg: 
>> /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state
>> 
>> Thoughts on any other alternatives?
>
> Umm... procfs?
> /me ducks

I had wondered about perf events but I'm not sure that's any more suitable.


>>> At some point it's going to make sense to batch sampling of remote CPUs'
>>> SPRs.
>
> How did you mean this? It looks like we first need to provide a separate 
> user interface, since with the existing sysfs interface providing 
> separate files, I am not sure if we can batch such reads.

I mean in order to minimize IPI traffic something like: sample/calculate
all of a CPU's purr, idle_purr, spurr, idle_spurr in a single IPI upon a
read of any of the attributes, and cache the result for some time, so
that the anticipated subsequent reads of the other attributes use the
cached results instead of generating more IPIs.

That would keep the current sysfs interface at the cost of imposing a
certain coarseness in the results.

Anyway, that's a mitigation that could be considered if the
implementation in this patch is found to be too expensive in practice.

[PATCH v3] powerpc/kasan: Fix shadow memory protection with CONFIG_KASAN_VMALLOC

2020-03-06 Thread Christophe Leroy

With CONFIG_KASAN_VMALLOC, new page tables are created at the time
shadow memory for vmalloc area in unmapped. If some parts of the
page table still has entries to the zero page shadow memory, the
entries are wrongly marked RW.

With CONFIG_KASAN_VMALLOC, almost the entire kernel address space
is managed by KASAN. To make it simple, just create KASAN page tables
for the entire kernel space at kasan_init(). That doesn't use much
more space, and that's anyway already done for hash platforms.

Fixes: 3d4247fcc938 ("powerpc/32: Add support of KASAN_VMALLOC")
Signed-off-by: Christophe Leroy 
---
v3: Split a too long line

v2: Allocate all tables at init instead of doing it when
unmapping vmalloc space KASAN pages.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/kasan/kasan_init_32.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/kasan_init_32.c
index 1a29cf469903..cbcad369fcb2 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -120,12 +120,6 @@ static void __init kasan_unmap_early_shadow_vmalloc(void)
unsigned long k_cur;
phys_addr_t pa = __pa(kasan_early_shadow_page);
 
-   if (!early_mmu_has_feature(MMU_FTR_HPTE_TABLE)) {
-   int ret = kasan_init_shadow_page_tables(k_start, k_end);
-
-   if (ret)
-   panic("kasan: kasan_init_shadow_page_tables() failed");
-   }
for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) {
pmd_t *pmd = pmd_offset(pud_offset(pgd_offset_k(k_cur), k_cur), 
k_cur);
pte_t *ptep = pte_offset_kernel(pmd, k_cur);
@@ -143,7 +137,8 @@ void __init kasan_mmu_init(void)
int ret;
struct memblock_region *reg;
 
-   if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE)) {
+   if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE) ||
+   IS_ENABLED(CONFIG_KASAN_VMALLOC)) {
ret = kasan_init_shadow_page_tables(KASAN_SHADOW_START, 
KASAN_SHADOW_END);
 
if (ret)
-- 
2.25.0

[PATCH 4/4] powerpc/xive: Add a debugfs file to dump internal XIVE state

2020-03-06 Thread Cédric Le Goater

As does XMON, the debugfs file /sys/kernel/debug/powerpc/xive exposes
the XIVE internal state of the machine CPUs and interrupts. Available
on the PowerNV and sPAPR platforms.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/xive-internal.h |   2 +
 arch/powerpc/sysdev/xive/common.c| 105 +++
 arch/powerpc/sysdev/xive/native.c|   3 +
 arch/powerpc/sysdev/xive/spapr.c |  19 
 4 files changed, 129 insertions(+)

diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
b/arch/powerpc/sysdev/xive/xive-internal.h
index 382980f4de2d..b7b901da2168 100644
--- a/arch/powerpc/sysdev/xive/xive-internal.h
+++ b/arch/powerpc/sysdev/xive/xive-internal.h
@@ -57,12 +57,14 @@ struct xive_ops {
int (*get_ipi)(unsigned int cpu, struct xive_cpu *xc);
void(*put_ipi)(unsigned int cpu, struct xive_cpu *xc);
 #endif
+   int (*debug_show)(struct seq_file *m, void *private);
const char *name;
 };
 
 bool xive_core_init(const struct xive_ops *ops, void __iomem *area, u32 offset,
u8 max_prio);
 __be32 *xive_queue_page_alloc(unsigned int cpu, u32 queue_shift);
+int xive_core_debug_init(void);
 
 static inline u32 xive_alloc_order(u32 queue_shift)
 {
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index c865ae554605..e192077481a1 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -1558,3 +1559,107 @@ static int __init xive_off(char *arg)
return 0;
 }
 __setup("xive=off", xive_off);
+
+void xive_debug_show_cpu(struct seq_file *m, int cpu)
+{
+   struct xive_cpu *xc = per_cpu(xive_cpu, cpu);
+
+   seq_printf(m, "CPU %d:", cpu);
+   if (xc) {
+   seq_printf(m, "pp=%02x CPPR=%02x ", xc->pending_prio, xc->cppr);
+
+#ifdef CONFIG_SMP
+   {
+   u64 val = xive_esb_read(>ipi_data, XIVE_ESB_GET);
+
+   seq_printf(m, "IPI=0x%08x PQ=%c%c ", xc->hw_ipi,
+  val & XIVE_ESB_VAL_P ? 'P' : '-',
+  val & XIVE_ESB_VAL_Q ? 'Q' : '-');
+   }
+#endif
+   {
+   struct xive_q *q = >queue[xive_irq_priority];
+   u32 i0, i1, idx;
+
+   if (q->qpage) {
+   idx = q->idx;
+   i0 = be32_to_cpup(q->qpage + idx);
+   idx = (idx + 1) & q->msk;
+   i1 = be32_to_cpup(q->qpage + idx);
+   seq_printf(m, "EQ idx=%d T=%d %08x %08x ...",
+  q->idx, q->toggle, i0, i1);
+   }
+   }
+   }
+   seq_puts(m, "\n");
+}
+
+void xive_debug_show_irq(struct seq_file *m, u32 hw_irq, struct irq_data *d)
+{
+   struct irq_chip *chip = irq_data_get_irq_chip(d);
+   int rc;
+   u32 target;
+   u8 prio;
+   u32 lirq;
+
+   if (!is_xive_irq(chip))
+   return;
+
+   rc = xive_ops->get_irq_config(hw_irq, , , );
+   if (rc) {
+   seq_printf(m, "IRQ 0x%08x : no config rc=%d\n", hw_irq, rc);
+   return;
+   }
+
+   seq_printf(m, "IRQ 0x%08x : target=0x%x prio=%02x lirq=0x%x ",
+  hw_irq, target, prio, lirq);
+
+   if (d) {
+   struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
+   u64 val = xive_esb_read(xd, XIVE_ESB_GET);
+
+   seq_printf(m, "flags=%c%c%c PQ=%c%c",
+  xd->flags & XIVE_IRQ_FLAG_STORE_EOI ? 'S' : ' ',
+  xd->flags & XIVE_IRQ_FLAG_LSI ? 'L' : ' ',
+  xd->flags & XIVE_IRQ_FLAG_H_INT_ESB ? 'H' : ' ',
+  val & XIVE_ESB_VAL_P ? 'P' : '-',
+  val & XIVE_ESB_VAL_Q ? 'Q' : '-');
+   }
+   seq_puts(m, "\n");
+}
+
+static int xive_core_debug_show(struct seq_file *m, void *private)
+{
+   unsigned int i;
+   struct irq_desc *desc;
+   int cpu;
+
+   if (xive_ops->debug_show)
+   xive_ops->debug_show(m, private);
+
+   for_each_possible_cpu(cpu)
+   xive_debug_show_cpu(m, cpu);
+
+   for_each_irq_desc(i, desc) {
+   struct irq_data *d = irq_desc_get_irq_data(desc);
+   unsigned int hw_irq;
+
+   if (!d)
+   continue;
+
+   hw_irq = (unsigned int)irqd_to_hwirq(d);
+
+   /* IPIs are special (HW number 0) */
+   if (hw_irq)
+   xive_debug_show_irq(m, hw_irq, d);
+   }
+   return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(xive_core_debug);
+
+int xive_core_debug_init(void)
+{
+   debugfs_create_file("xive", 0444, powerpc_debugfs_root,
+

[PATCH 1/4] powerpc/xive: Use XIVE_BAD_IRQ instead of zero to catch non configured IPIs

2020-03-06 Thread Cédric Le Goater

When a CPU is brought up, an IPI number is allocated and recorded
under the XIVE CPU structure. Invalid IPI numbers are tracked with
interrupt number 0x0.

On the PowerNV platform, the interrupt number space starts at 0x10 and
this works fine. However, on the sPAPR platform, it is possible to
allocate the interrupt number 0x0 and this raises an issue when CPU 0
is unplugged. The XIVE spapr driver tracks allocated interrupt numbers
in a bitmask and it is not correctly updated when interrupt number 0x0
is freed. It stays allocated and it is then impossible to reallocate.

Fix by using the XIVE_BAD_IRQ value instead of zero on both platforms.

Reported-by: David Gibson 
Fixes: eac1e731b59e ("powerpc/xive: guest exploitation of the XIVE interrupt 
controller")
Cc: sta...@vger.kernel.org # v4.14+
Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/xive-internal.h |  7 +++
 arch/powerpc/sysdev/xive/common.c| 12 +++-
 arch/powerpc/sysdev/xive/native.c|  4 ++--
 arch/powerpc/sysdev/xive/spapr.c |  4 ++--
 4 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
b/arch/powerpc/sysdev/xive/xive-internal.h
index 59cd366e7933..382980f4de2d 100644
--- a/arch/powerpc/sysdev/xive/xive-internal.h
+++ b/arch/powerpc/sysdev/xive/xive-internal.h
@@ -5,6 +5,13 @@
 #ifndef __XIVE_INTERNAL_H
 #define __XIVE_INTERNAL_H
 
+/*
+ * A "disabled" interrupt should never fire, to catch problems
+ * we set its logical number to this
+ */
+#define XIVE_BAD_IRQ   0x7fff
+#define XIVE_MAX_IRQ   (XIVE_BAD_IRQ - 1)
+
 /* Each CPU carry one of these with various per-CPU state */
 struct xive_cpu {
 #ifdef CONFIG_SMP
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index fa49193206b6..550baba98ec9 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -68,13 +68,6 @@ static u32 xive_ipi_irq;
 /* Xive state for each CPU */
 static DEFINE_PER_CPU(struct xive_cpu *, xive_cpu);
 
-/*
- * A "disabled" interrupt should never fire, to catch problems
- * we set its logical number to this
- */
-#define XIVE_BAD_IRQ   0x7fff
-#define XIVE_MAX_IRQ   (XIVE_BAD_IRQ - 1)
-
 /* An invalid CPU target */
 #define XIVE_INVALID_TARGET(-1)
 
@@ -1153,7 +1146,7 @@ static int xive_setup_cpu_ipi(unsigned int cpu)
xc = per_cpu(xive_cpu, cpu);
 
/* Check if we are already setup */
-   if (xc->hw_ipi != 0)
+   if (xc->hw_ipi != XIVE_BAD_IRQ)
return 0;
 
/* Grab an IPI from the backend, this will populate xc->hw_ipi */
@@ -1190,7 +1183,7 @@ static void xive_cleanup_cpu_ipi(unsigned int cpu, struct 
xive_cpu *xc)
/* Disable the IPI and free the IRQ data */
 
/* Already cleaned up ? */
-   if (xc->hw_ipi == 0)
+   if (xc->hw_ipi == XIVE_BAD_IRQ)
return;
 
/* Mask the IPI */
@@ -1346,6 +1339,7 @@ static int xive_prepare_cpu(unsigned int cpu)
if (np)
xc->chip_id = of_get_ibm_chip_id(np);
of_node_put(np);
+   xc->hw_ipi = XIVE_BAD_IRQ;
 
per_cpu(xive_cpu, cpu) = xc;
}
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 0ff6b739052c..50e1a8e02497 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -312,7 +312,7 @@ static void xive_native_put_ipi(unsigned int cpu, struct 
xive_cpu *xc)
s64 rc;
 
/* Free the IPI */
-   if (!xc->hw_ipi)
+   if (xc->hw_ipi == XIVE_BAD_IRQ)
return;
for (;;) {
rc = opal_xive_free_irq(xc->hw_ipi);
@@ -320,7 +320,7 @@ static void xive_native_put_ipi(unsigned int cpu, struct 
xive_cpu *xc)
msleep(OPAL_BUSY_DELAY_MS);
continue;
}
-   xc->hw_ipi = 0;
+   xc->hw_ipi = XIVE_BAD_IRQ;
break;
}
 }
diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 55dc61cb4867..3f15615712b5 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -560,11 +560,11 @@ static int xive_spapr_get_ipi(unsigned int cpu, struct 
xive_cpu *xc)
 
 static void xive_spapr_put_ipi(unsigned int cpu, struct xive_cpu *xc)
 {
-   if (!xc->hw_ipi)
+   if (xc->hw_ipi == XIVE_BAD_IRQ)
return;
 
xive_irq_bitmap_free(xc->hw_ipi);
-   xc->hw_ipi = 0;
+   xc->hw_ipi = XIVE_BAD_IRQ;
 }
 #endif /* CONFIG_SMP */
 
-- 
2.21.1

[PATCH v2] powerpc/kasan: Fix shadow memory protection with CONFIG_KASAN_VMALLOC

2020-03-06 Thread Christophe Leroy

With CONFIG_KASAN_VMALLOC, new page tables are created at the time
shadow memory for vmalloc area in unmapped. If some parts of the
page table still has entries to the zero page shadow memory, the
entries are wrongly marked RW.

With CONFIG_KASAN_VMALLOC, almost the entire kernel address space
is managed by KASAN. To make it simple, just create KASAN page tables
for the entire kernel space at kasan_init(). That doesn't use much
more space, and that's anyway already done for hash platforms.

Fixes: 3d4247fcc938 ("powerpc/32: Add support of KASAN_VMALLOC")
Signed-off-by: Christophe Leroy 
---
v2: Allocate all tables at init instead of doing it when
unmapping vmalloc space KASAN pages.
---
 arch/powerpc/mm/kasan/kasan_init_32.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/kasan_init_32.c
index 1a29cf469903..c9174d645652 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -120,12 +120,6 @@ static void __init kasan_unmap_early_shadow_vmalloc(void)
unsigned long k_cur;
phys_addr_t pa = __pa(kasan_early_shadow_page);
 
-   if (!early_mmu_has_feature(MMU_FTR_HPTE_TABLE)) {
-   int ret = kasan_init_shadow_page_tables(k_start, k_end);
-
-   if (ret)
-   panic("kasan: kasan_init_shadow_page_tables() failed");
-   }
for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) {
pmd_t *pmd = pmd_offset(pud_offset(pgd_offset_k(k_cur), k_cur), 
k_cur);
pte_t *ptep = pte_offset_kernel(pmd, k_cur);
@@ -143,7 +137,7 @@ void __init kasan_mmu_init(void)
int ret;
struct memblock_region *reg;
 
-   if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE)) {
+   if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE) || 
IS_ENABLED(CONFIG_KASAN_VMALLOC)) {
ret = kasan_init_shadow_page_tables(KASAN_SHADOW_START, 
KASAN_SHADOW_END);
 
if (ret)
-- 
2.25.0

[PATCH] powerpc/kasan: Fix kasan_remap_early_shadow_ro()

2020-03-06 Thread Christophe Leroy

At the moment kasan_remap_early_shadow_ro() does nothing, because
k_end is 0 and k_cur < 0 is always true.

Change the test to k_cur != k_end, as done in
kasan_init_shadow_page_tables()

Signed-off-by: Christophe Leroy 
Fixes: cbd18991e24f ("powerpc/mm: Fix an Oops in kasan_mmu_init()")
Cc: sta...@vger.kernel.org
---
 arch/powerpc/mm/kasan/kasan_init_32.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/kasan_init_32.c
index f19526e7d3dc..1a29cf469903 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -101,7 +101,7 @@ static void __init kasan_remap_early_shadow_ro(void)
 
kasan_populate_pte(kasan_early_shadow_pte, prot);
 
-   for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) {
+   for (k_cur = k_start & PAGE_MASK; k_cur != k_end; k_cur += PAGE_SIZE) {
pmd_t *pmd = pmd_ptr_k(k_cur);
pte_t *ptep = pte_offset_kernel(pmd, k_cur);
 
-- 
2.25.0

[PATCH 0/4] powerpc/xive: fixes and debug extensions

2020-03-06 Thread Cédric Le Goater

Hello,

First two patches are fixes for non-critical issues. I checked that
they applied on stable.

Last two are debug extensions, one for xmon and the other to dump XIVE
internal state under debugfs, which is easier than xmon.

Cheers,

C.

Cédric Le Goater (4):
  powerpc/xive: Use XIVE_BAD_IRQ instead of zero to catch non configured
IPIs
  powerpc/xive: Fix xmon support on the PowerNV platform
  powerpc/xmon: Add source flags to output of XIVE interrupts
  powerpc/xive: Add a debugfs file to dump internal XIVE state

 arch/powerpc/sysdev/xive/xive-internal.h |   9 ++
 arch/powerpc/sysdev/xive/common.c| 126 +--
 arch/powerpc/sysdev/xive/native.c|   7 +-
 arch/powerpc/sysdev/xive/spapr.c |  23 -
 4 files changed, 151 insertions(+), 14 deletions(-)

-- 
2.21.1

[PATCH v8 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2020-03-06 Thread Daniel Axtens

Implement a limited form of KASAN for Book3S 64-bit machines running under
the Radix MMU:

 - Set aside the last 1/8th of the first contiguous block of physical
   memory to provide writable shadow for the linear map. For annoying
   reasons documented below, the memory size must be specified at compile
   time.

 - Enable the compiler instrumentation to check addresses and maintain the
   shadow region. (This is the guts of KASAN which we can easily reuse.)

 - Require kasan-vmalloc support to handle modules and anything else in
   vmalloc space.

 - KASAN needs to be able to validate all pointer accesses, but we can't
   back all kernel addresses with real memory - only linear map and
   vmalloc. On boot, set up a single page of read-only shadow that marks
   all these accesses as valid.

 - Make our stack-walking code KASAN-safe by using READ_ONCE_NOCHECK -
   generic code, arm64, s390 and x86 all do this for similar sorts of
   reasons: when unwinding a stack, we might touch memory that KASAN has
   marked as being out-of-bounds. In our case we often get this when
   checking for an exception frame because we're checking an arbitrary
   offset into the stack frame.

   See commit 20955746320e ("s390/kasan: avoid false positives during stack
   unwind"), commit bcaf669b4bdb ("arm64: disable kasan when accessing
   frame->fp in unwind_frame"), commit 91e08ab0c851 ("x86/dumpstack:
   Prevent KASAN false positive warnings") and commit 6e22c8366416
   ("tracing, kasan: Silence Kasan warning in check_stack of stack_tracer")

 - Document KASAN in both generic and powerpc docs.

Background
--

KASAN support on Book3S is a bit tricky to get right:

 - It would be good to support inline instrumentation so as to be able to
   catch stack issues that cannot be caught with outline mode.

 - Inline instrumentation requires a fixed offset.

 - Book3S runs code in real mode after booting. Most notably a lot of KVM
   runs in real mode, and it would be good to be able to instrument it.

 - Because code runs in real mode after boot, the offset has to point to
   valid memory both in and out of real mode.

[ppc64 mm note: The kernel installs a linear mapping at effective
address c000... onward. This is a one-to-one mapping with physical
memory from ... onward. Because of how memory accesses work on
powerpc 64-bit Book3S, a kernel pointer in the linear map accesses the
same memory both with translations on (accessing as an 'effective
address'), and with translations off (accessing as a 'real
address'). This works in both guests and the hypervisor. For more
details, see s5.7 of Book III of version 3 of the ISA, in particular
the Storage Control Overview, s5.7.3, and s5.7.5 - noting that this
KASAN implementation currently only supports Radix.]

One approach is just to give up on inline instrumentation. This way all
checks can be delayed until after everything set is up correctly, and the
address-to-shadow calculations can be overridden. However, the features and
speed boost provided by inline instrumentation are worth trying to do
better.

If _at compile time_ it is known how much contiguous physical memory a
system has, the top 1/8th of the first block of physical memory can be set
aside for the shadow. This is a big hammer and comes with 3 big
consequences:

 - there's no nice way to handle physically discontiguous memory, so only
   the first physical memory block can be used.

 - kernels will simply fail to boot on machines with less memory than
   specified when compiling.

 - kernels running on machines with more memory than specified when
   compiling will simply ignore the extra memory.

Despite the limitations, it can still find bugs,
e.g. http://patchwork.ozlabs.org/patch/1103775/

At the moment, this physical memory limit must be set _even for outline
mode_. This may be changed in a later series - a different implementation
could be added for outline mode that dynamically allocates shadow at a
fixed offset. For example, see https://patchwork.ozlabs.org/patch/795211/

Suggested-by: Michael Ellerman 
Cc: Balbir Singh  # ppc64 out-of-line radix version
Cc: Christophe Leroy  # ppc32 version
Reviewed-by:  # focussed mainly on Documentation
 and things impacting PPC32
Signed-off-by: Daniel Axtens 

---
Changes since v7:

 - Don't instrument arch/powerpc/kernel/paca.c, it can lead to hangs. Also
   don't instrument setup_64.c; it's too early to be safe.

 - Reword this commit message, thanks Mikey.

 - Reinsert some tidier stack walking code, with hopefully some better
   justification. Certainly it is a common, cross-platform sort of issue.

 - Fix a stupid bug in early printing where I multiplied by SZ_1M rather than
   divided.

 - Make the memory reservation code FLATMEM friendly.

 - Kconfig tweaks, thanks Christophe.

Changes since v6:
 - rework kasan_late_init support, which also fixes book3e problem that 
snowpatch

[PATCH v8 3/4] powerpc/mm/kasan: rename kasan_init_32.c to init_32.c

2020-03-06 Thread Daniel Axtens

kasan is already implied by the directory name, we don't need to
repeat it.

Suggested-by: Christophe Leroy 
Signed-off-by: Daniel Axtens 
---
 arch/powerpc/mm/kasan/Makefile   | 2 +-
 arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename arch/powerpc/mm/kasan/{kasan_init_32.c => init_32.c} (100%)

diff --git a/arch/powerpc/mm/kasan/Makefile b/arch/powerpc/mm/kasan/Makefile
index 6577897673dd..36a4e1b10b2d 100644
--- a/arch/powerpc/mm/kasan/Makefile
+++ b/arch/powerpc/mm/kasan/Makefile
@@ -2,4 +2,4 @@
 
 KASAN_SANITIZE := n
 
-obj-$(CONFIG_PPC32)   += kasan_init_32.o
+obj-$(CONFIG_PPC32)   += init_32.o
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/init_32.c
similarity index 100%
rename from arch/powerpc/mm/kasan/kasan_init_32.c
rename to arch/powerpc/mm/kasan/init_32.c
-- 
2.20.1

[PATCH v8 2/4] kasan: Document support on 32-bit powerpc

2020-03-06 Thread Daniel Axtens

KASAN is supported on 32-bit powerpc and the docs should reflect this.

Document s390 support while we're at it.

Suggested-by: Christophe Leroy 
Reviewed-by: Christophe Leroy 
Signed-off-by: Daniel Axtens 

---

Changes since v5:
 - rebase - riscv has now got support.
 - document s390 support while we're at it
 - clarify when kasan_vmalloc support is required
---
 Documentation/dev-tools/kasan.rst |  7 +--
 Documentation/powerpc/kasan.txt   | 12 
 2 files changed, 17 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/powerpc/kasan.txt

diff --git a/Documentation/dev-tools/kasan.rst 
b/Documentation/dev-tools/kasan.rst
index c652d740735d..012ef3d91d1f 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -22,7 +22,8 @@ global variables yet.
 Tag-based KASAN is only supported in Clang and requires version 7.0.0 or later.
 
 Currently generic KASAN is supported for the x86_64, arm64, xtensa, s390 and
-riscv architectures, and tag-based KASAN is supported only for arm64.
+riscv architectures. It is also supported on 32-bit powerpc kernels. Tag-based 
+KASAN is supported only on arm64.
 
 Usage
 -
@@ -255,7 +256,9 @@ CONFIG_KASAN_VMALLOC
 
 
 With ``CONFIG_KASAN_VMALLOC``, KASAN can cover vmalloc space at the
-cost of greater memory usage. Currently this is only supported on x86.
+cost of greater memory usage. Currently this supported on x86, s390
+and 32-bit powerpc. It is optional, except on 32-bit powerpc kernels
+with module support, where it is required.
 
 This works by hooking into vmalloc and vmap, and dynamically
 allocating real shadow memory to back the mappings.
diff --git a/Documentation/powerpc/kasan.txt b/Documentation/powerpc/kasan.txt
new file mode 100644
index ..26bb0e8bb18c
--- /dev/null
+++ b/Documentation/powerpc/kasan.txt
@@ -0,0 +1,12 @@
+KASAN is supported on powerpc on 32-bit only.
+
+32 bit support
+==
+
+KASAN is supported on both hash and nohash MMUs on 32-bit.
+
+The shadow area sits at the top of the kernel virtual memory space above the
+fixmap area and occupies one eighth of the total kernel virtual memory space.
+
+Instrumentation of the vmalloc area is optional, unless built with modules,
+in which case it is required.
-- 
2.20.1

[PATCH v8 1/4] kasan: define and use MAX_PTRS_PER_* for early shadow tables

2020-03-06 Thread Daniel Axtens

powerpc has a variable number of PTRS_PER_*, set at runtime based
on the MMU that the kernel is booted under.

This means the PTRS_PER_* are no longer constants, and therefore
breaks the build.

Define default MAX_PTRS_PER_*s in the same style as MAX_PTRS_PER_P4D.
As KASAN is the only user at the moment, just define them in the kasan
header, and have them default to PTRS_PER_* unless overridden in arch
code.

Suggested-by: Christophe Leroy 
Suggested-by: Balbir Singh 
Reviewed-by: Christophe Leroy 
Reviewed-by: Balbir Singh 
Signed-off-by: Daniel Axtens 
---
 include/linux/kasan.h | 18 +++---
 mm/kasan/init.c   |  6 +++---
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 5cde9e7c2664..b3a4500633f5 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -14,10 +14,22 @@ struct task_struct;
 #include 
 #include 
 
+#ifndef MAX_PTRS_PER_PTE
+#define MAX_PTRS_PER_PTE PTRS_PER_PTE
+#endif
+
+#ifndef MAX_PTRS_PER_PMD
+#define MAX_PTRS_PER_PMD PTRS_PER_PMD
+#endif
+
+#ifndef MAX_PTRS_PER_PUD
+#define MAX_PTRS_PER_PUD PTRS_PER_PUD
+#endif
+
 extern unsigned char kasan_early_shadow_page[PAGE_SIZE];
-extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE];
-extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD];
-extern pud_t kasan_early_shadow_pud[PTRS_PER_PUD];
+extern pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE];
+extern pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD];
+extern pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD];
 extern p4d_t kasan_early_shadow_p4d[MAX_PTRS_PER_P4D];
 
 int kasan_populate_early_shadow(const void *shadow_start,
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index ce45c491ebcd..8b54a96d3b3e 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -46,7 +46,7 @@ static inline bool kasan_p4d_table(pgd_t pgd)
 }
 #endif
 #if CONFIG_PGTABLE_LEVELS > 3
-pud_t kasan_early_shadow_pud[PTRS_PER_PUD] __page_aligned_bss;
+pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD] __page_aligned_bss;
 static inline bool kasan_pud_table(p4d_t p4d)
 {
return p4d_page(p4d) == virt_to_page(lm_alias(kasan_early_shadow_pud));
@@ -58,7 +58,7 @@ static inline bool kasan_pud_table(p4d_t p4d)
 }
 #endif
 #if CONFIG_PGTABLE_LEVELS > 2
-pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD] __page_aligned_bss;
+pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD] __page_aligned_bss;
 static inline bool kasan_pmd_table(pud_t pud)
 {
return pud_page(pud) == virt_to_page(lm_alias(kasan_early_shadow_pmd));
@@ -69,7 +69,7 @@ static inline bool kasan_pmd_table(pud_t pud)
return false;
 }
 #endif
-pte_t kasan_early_shadow_pte[PTRS_PER_PTE] __page_aligned_bss;
+pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE] __page_aligned_bss;
 
 static inline bool kasan_pte_table(pmd_t pmd)
 {
-- 
2.20.1

[PATCH v8 0/4] KASAN for powerpc64 radix

2020-03-06 Thread Daniel Axtens

Building on the work of Christophe, Aneesh and Balbir, I've ported
KASAN to 64-bit Book3S kernels running on the Radix MMU.

This provides full inline instrumentation on radix, but does require
that you be able to specify the amount of physically contiguous memory
on the system at compile time. More details in patch 4.

One quirk that I've noticed is that detection of invalid accesses to
module globals are currently broken - everything is permitted. I'm
sure this used to work, but it doesn't atm and this is why: gcc puts
the ASAN init code in a section called '.init_array'. Powerpc64 module
loading code goes through and _renames_ any section beginning with
'.init' to begin with '_init' in order to avoid some complexities
around our 24-bit indirect jumps. This means it renames '.init_array'
to '_init_array', and the generic module loading code then fails to
recognise the section as a constructor and thus doesn't run it. This
hack dates back to 2003 and so I'm not going to try to unpick it in
this series. (I suspect this may have previously worked if the code
ended up in .ctors rather than .init_array but I don't keep my old
binaries around so I have no real way of checking.)

v8: Rejig patch 4 commit message, thanks Mikey.
Various tweaks to patch 4: fix some potential hangs, clean up
some code, fix a trivial bug, and also have another crack at
correct stack-walking based on what other arches do. Some very
minor tweaks, and a review from Christophe.

v7: Tweaks from Christophe, fix issues detected by snowpatch.

v6: Rebase on the latest changes in powerpc/merge. Minor tweaks
  to the documentation. Small tweaks to the header to work
  with the kasan_late_init() function that Christophe added
  for 32-bit kasan-vmalloc support.
No functional change.

v5: ptdump support. More cleanups, tweaks and fixes, thanks
Christophe. Details in patch 4.

I have seen another stack walk splat, but I don't think it's
related to the patch set, I think there's a bug somewhere else,
probably in stack frame manipulation in the kernel or (more
unlikely) in the compiler.

v4: More cleanups, split renaming out, clarify bits and bobs.
Drop the stack walk disablement, that isn't needed. No other
functional change.

v3: Reduce the overly ambitious scope of the MAX_PTRS change.
Document more things, including around why some of the
restrictions apply.
Clean up the code more, thanks Christophe.

v2: The big change is the introduction of tree-wide(ish)
MAX_PTRS_PER_{PTE,PMD,PUD} macros in preference to the previous
approach, which was for the arch to override the page table array
definitions with their own. (And I squashed the annoying
intermittent crash!)

Apart from that there's just a lot of cleanup. Christophe, I've
addressed most of what you asked for and I will reply to your v1
emails to clarify what remains unchanged.

Re: [PATCH v3] powerpc: setup_64: set up PACA earlier to avoid kcov problems

2020-03-06 Thread Daniel Axtens

Andrew Donnellan  writes:

> On 6/3/20 6:30 pm, Daniel Axtens wrote:
>> kcov instrumentation is collected the __sanitizer_cov_trace_pc hook in
>> kernel/kcov.c. The compiler inserts these hooks into every basic block
>> unless kcov is disabled for that file.
>> 
>> We then have a deep call-chain:
>>   - __sanitizer_cov_trace_pc calls to check_kcov_mode()
>>   - check_kcov_mode() (kernel/kcov.c) calls in_task()
>>   - in_task() (include/linux/preempt.h) calls preempt_count().
>>   - preempt_count() (include/asm-generic/preempt.h) calls
>>   current_thread_info()
>>   - because powerpc has THREAD_INFO_IN_TASK, current_thread_info()
>>   (include/linux/thread_info.h) is defined to 'current'
>>   - current (arch/powerpc/include/asm/current.h) is defined to
>>   get_current().
>>   - get_current (same file) loads an offset of r13.
>>   - arch/powerpc/include/asm/paca.h makes r13 a register variable
>>   called local_paca - it is the PACA for the current CPU, so
>>   this has the effect of loading the current task from PACA.
>>   - get_current returns the current task from PACA,
>>   - current_thread_info returns the task cast to a thread_info
>>   - preempt_count dereferences the thread_info to load preempt_count
>>   - that value is used by in_task and so on up the chain
>> 
>> The problem is:
>> 
>>   - kcov instrumentation is enabled for arch/powerpc/kernel/dt_cpu_ftrs.c
>> 
>>   - even if it were not, dt_cpu_ftrs_init calls generic dt parsing code
>> which should definitely have instrumentation enabled.
>> 
>>   - setup_64.c calls dt_cpu_ftrs_init before it sets up a PACA.
>> 
>>   - If we don't set up a paca, r13 will contain unpredictable data.
>> 
>>   - In a zImage compiled with kcov and KASAN, we see r13 containing a value
>> that leads to dereferencing invalid memory (something like
>> 912a72603d420015).
>> 
>>   - Weirdly, the same kernel as a vmlinux loaded directly by qemu does not
>> crash. Investigating with gdb, it seems that in the vmlinux boot case,
>> r13 is near enough to zero that we just happen to be able to read that
>> part of memory (we're operating with translation off at this point) and
>> the current pointer also happens to land in readable memory and
>> everything just works.
>> 
>>   - PACA setup refers to CPU features - setup_paca() looks at
>> early_cpu_has_feature(CPU_FTR_HVMODE)
>> 
>> There's no generic kill switch for kcov (as far as I can tell), and we
>> don't want to have to turn off instrumentation in the generic dt parsing
>> code (which lives outside arch/powerpc/) just because we don't have a real
>> paca or task yet.
>> 
>> So:
>>   - change the test when setting up a PACA to consider the actual value of
>> the MSR rather than the CPU feature.
>> 
>>   - move the PACA setup to before the cpu feature parsing.
>> 
>> Translations get switched on once we leave early_setup, so I think we'd
>> already catch any other cases where the PACA or task aren't set up.
>> 
>> Boot tested on a P9 guest and host.
>> 
>> Fixes: fb0b0a73b223 ("powerpc: Enable kcov")
>> Cc: Andrew Donnellan 
>> Suggested-by: Michael Ellerman 
>> Signed-off-by: Daniel Axtens 
>> 
>> ---
>> 
>> Regarding moving the comment about printk()-safety:
>> I am about 75% sure that the thing that makes printk() safe is the PACA,
>> not the CPU features. That's what commit 24d9649574fb ("[POWERPC] Document
>> when printk is useable") seems to indicate, but as someone wise recently
>> told me, "bootstrapping is hard", so I may be totally wrong.
>> 
>> v3: Update comment, thanks Christophe Leroy.
>>  Remove a comment in dt_cpu_ftrs.c that is no longer accurate - thanks
>>Andrew. I think we want to retain all the code still, but I'm open to
>>being told otherwise.
>
> Thanks for doing that.
>
> This patch and the justification doesn't seem obviously wrong, and is 
> snowpatch-clean.
>
> Reviewed-by: Andrew Donnellan 
>
> (Is it worth cc'ing this to stable in case there are other situations we 
> haven't foreseen where we hit the unpredictable r13 data? Few people use 
> kcov...)

I did briefly consider it but didn't believe it reached the stable
criteria:

| It must fix a real bug that bothers people (not a, “This could be a
| problem...” type thing).

On reflection it's a real bug (boot hang), it bothers me, and presumably
also you due to the syzkaller interaction, and I am led to believe we
are both people, so I guess I'll do a v3 with cc: stable. Thanks!

Regards,
Daniel

>
> -- 
> Andrew Donnellan  OzLabs, ADL Canberra
> a...@linux.ibm.com IBM Australia Limited

Re: [PATCH v7 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support

2020-03-06 Thread Daniel Axtens

Christophe Leroy  writes:

> Le 13/02/2020 à 01:47, Daniel Axtens a écrit :
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 497b7d0b2d7e..f1c54c08a88e 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -169,7 +169,9 @@ config PPC
>>  select HAVE_ARCH_HUGE_VMAP  if PPC_BOOK3S_64 && 
>> PPC_RADIX_MMU
>>  select HAVE_ARCH_JUMP_LABEL
>>  select HAVE_ARCH_KASAN  if PPC32
>> +select HAVE_ARCH_KASAN  if PPC_BOOK3S_64 && 
>> PPC_RADIX_MMU
>
> That's probably detail, but as it is necessary to deeply define the HW 
> when selecting that (I mean giving the exact amount of memory and with 
> restrictions like having a wholeblock memory), should it also depend of 
> EXPERT ?

If it weren't a debug feature I would definitely agree with you, but I
think being a debug feature it's not so necessary. Also anything with
more memory than the config option specifies will still boot - it's just
less memory that won't boot. I've set the default to 1024MB: I know
that's a lot of memory for an embedded system but I think for anything
with the Radix MMU it's an OK default.

I'm sure if mpe disagrees he can add EXPERT when he's merging :)

> Maybe we could have
>
> - select HAVE_ARCH_KASAN_VMALLOC  if PPC32
> + select HAVE_ARCH_KASAN_VMALLOC  if HAVE_ARCH_KASAN

That's a good idea. Done.

Thanks for the review!

Regards,
Daniel

>
>>  select HAVE_ARCH_KGDB
>>  select HAVE_ARCH_MMAP_RND_BITS
>>  select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if COMPAT
>
>
> Christophe

[PATCH] Fixes: 227942809b52 ("cpufreq: powernv: Restore cpu frequency to policy->cur on unthrottling")

2020-03-06 Thread Pratik Rajesh Sampat

The patch avoids allocating cpufreq_policy on stack hence fixing frame
size overflow in 'powernv_cpufreq_work_fn'

Signed-off-by: Pratik Rajesh Sampat 
---
 drivers/cpufreq/powernv-cpufreq.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 56f4bc0d209e..20ee0661555a 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -902,6 +902,7 @@ static struct notifier_block powernv_cpufreq_reboot_nb = {
 void powernv_cpufreq_work_fn(struct work_struct *work)
 {
struct chip *chip = container_of(work, struct chip, throttle);
+   struct cpufreq_policy *policy;
unsigned int cpu;
cpumask_t mask;
 
@@ -916,12 +917,14 @@ void powernv_cpufreq_work_fn(struct work_struct *work)
chip->restore = false;
for_each_cpu(cpu, ) {
int index;
-   struct cpufreq_policy policy;
 
-   cpufreq_get_policy(, cpu);
-   index = cpufreq_table_find_index_c(, policy.cur);
-   powernv_cpufreq_target_index(, index);
-   cpumask_andnot(, , policy.cpus);
+   policy = cpufreq_cpu_get(cpu);
+   if (!policy)
+   continue;
+   index = cpufreq_table_find_index_c(policy, policy->cur);
+   powernv_cpufreq_target_index(policy, index);
+   cpumask_andnot(, , policy->cpus);
+   cpufreq_cpu_put(policy);
}
 out:
put_online_cpus();
-- 
2.17.1

[Bug 206695] kmemleak reports leaks in drivers/macintosh/windfarm

2020-03-06 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=206695

--- Comment #6 from Erhard F. (erhar...@mailbox.org) ---
(In reply to mpe from comment #5)
> Can you try this one instead, it changes the order of operations to make
> the code flow a bit nicer.
2nd patch works equally well.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH v3] powerpc: setup_64: set up PACA earlier to avoid kcov problems

2020-03-06 Thread Andrew Donnellan


On 6/3/20 6:30 pm, Daniel Axtens wrote:

kcov instrumentation is collected the __sanitizer_cov_trace_pc hook in
kernel/kcov.c. The compiler inserts these hooks into every basic block
unless kcov is disabled for that file.

We then have a deep call-chain:
  - __sanitizer_cov_trace_pc calls to check_kcov_mode()
  - check_kcov_mode() (kernel/kcov.c) calls in_task()
  - in_task() (include/linux/preempt.h) calls preempt_count().
  - preempt_count() (include/asm-generic/preempt.h) calls
  current_thread_info()
  - because powerpc has THREAD_INFO_IN_TASK, current_thread_info()
  (include/linux/thread_info.h) is defined to 'current'
  - current (arch/powerpc/include/asm/current.h) is defined to
  get_current().
  - get_current (same file) loads an offset of r13.
  - arch/powerpc/include/asm/paca.h makes r13 a register variable
  called local_paca - it is the PACA for the current CPU, so
  this has the effect of loading the current task from PACA.
  - get_current returns the current task from PACA,
  - current_thread_info returns the task cast to a thread_info
  - preempt_count dereferences the thread_info to load preempt_count
  - that value is used by in_task and so on up the chain

The problem is:

  - kcov instrumentation is enabled for arch/powerpc/kernel/dt_cpu_ftrs.c

  - even if it were not, dt_cpu_ftrs_init calls generic dt parsing code
which should definitely have instrumentation enabled.

  - setup_64.c calls dt_cpu_ftrs_init before it sets up a PACA.

  - If we don't set up a paca, r13 will contain unpredictable data.

  - In a zImage compiled with kcov and KASAN, we see r13 containing a value
that leads to dereferencing invalid memory (something like
912a72603d420015).

  - Weirdly, the same kernel as a vmlinux loaded directly by qemu does not
crash. Investigating with gdb, it seems that in the vmlinux boot case,
r13 is near enough to zero that we just happen to be able to read that
part of memory (we're operating with translation off at this point) and
the current pointer also happens to land in readable memory and
everything just works.

  - PACA setup refers to CPU features - setup_paca() looks at
early_cpu_has_feature(CPU_FTR_HVMODE)

There's no generic kill switch for kcov (as far as I can tell), and we
don't want to have to turn off instrumentation in the generic dt parsing
code (which lives outside arch/powerpc/) just because we don't have a real
paca or task yet.

So:
  - change the test when setting up a PACA to consider the actual value of
the MSR rather than the CPU feature.

  - move the PACA setup to before the cpu feature parsing.

Translations get switched on once we leave early_setup, so I think we'd
already catch any other cases where the PACA or task aren't set up.

Boot tested on a P9 guest and host.

Fixes: fb0b0a73b223 ("powerpc: Enable kcov")
Cc: Andrew Donnellan 
Suggested-by: Michael Ellerman 
Signed-off-by: Daniel Axtens 

---

Regarding moving the comment about printk()-safety:
I am about 75% sure that the thing that makes printk() safe is the PACA,
not the CPU features. That's what commit 24d9649574fb ("[POWERPC] Document
when printk is useable") seems to indicate, but as someone wise recently
told me, "bootstrapping is hard", so I may be totally wrong.

v3: Update comment, thanks Christophe Leroy.
 Remove a comment in dt_cpu_ftrs.c that is no longer accurate - thanks
   Andrew. I think we want to retain all the code still, but I'm open to
   being told otherwise.


Thanks for doing that.

This patch and the justification doesn't seem obviously wrong, and is 
snowpatch-clean.


Reviewed-by: Andrew Donnellan 

(Is it worth cc'ing this to stable in case there are other situations we 
haven't foreseen where we hit the unpredictable r13 data? Few people use 
kcov...)



--
Andrew Donnellan  OzLabs, ADL Canberra
a...@linux.ibm.com IBM Australia Limited

84 matches

Mail list logo