Re: [PATCH] powerpc/pseries: Remove confusing warning message.

2019-10-01 Thread Stephen Rothwell
Hi Laurent,

On Tue,  1 Oct 2019 15:29:28 +0200 Laurent Dufour  wrote:
>
> Fixes: 1211ee61b4a8 ("powerpc/pseries: Read TLB Block Invalidate 
> Characteristics")
> Reported-by: Stephen Rothwell 

Please use my external email address , thanks.

-- 
Cheers,
Stephen Rothwell


pgp68Uzwgj5xQ.pgp
Description: OpenPGP digital signature


[PATCH RFC 14/15] powerpc/eeh: Sync eeh_force_recover_write()

2019-10-01 Thread Sam Bobroff
Synchronize access to eeh_pe.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/kernel/eeh.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index cba16ca0694a..26d9367c41a1 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1971,7 +1971,7 @@ static ssize_t eeh_force_recover_write(struct file *filp,
 * from an odd state (e.g. PE removed, or recovery of a
 * non-isolated PE)
 */
-   __eeh_send_failure_event(pe);
+   __eeh_send_failure_event(pe); /* Give ref */
 
return ret < 0 ? ret : count;
 }
-- 
2.22.0.216.g00a2a96fc9



[PATCH RFC 05/15] powerpc/eeh: Sync eeh_pe_get_parent()

2019-10-01 Thread Sam Bobroff
Synchronize access to eeh_pe.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/kernel/eeh_pe.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index b89ed46f14e6..0486d3c6ff20 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -469,10 +469,12 @@ struct eeh_pe *eeh_pe_find(struct pci_controller *phb,
  * The whole PEs existing in the system are organized as hierarchy
  * tree. The function is used to retrieve the parent PE according
  * to the parent EEH device.
+ * Returns a referenced PE.
  */
 static struct eeh_pe *eeh_pe_get_parent(struct eeh_dev *edev)
 {
struct eeh_dev *parent;
+   struct eeh_pe *pe;
struct pci_dn *pdn = eeh_dev_to_pdn(edev);
 
/*
@@ -490,8 +492,14 @@ static struct eeh_pe *eeh_pe_get_parent(struct eeh_dev 
*edev)
if (!parent)
return NULL;
 
-   if (parent->pe)
-   return parent->pe;
+   if (parent->pe) {
+   /* TODO: Unsafe until eeh_dev can be synchronized
+* with eeh_pe.
+*/
+   pe = parent->pe;
+   eeh_get_pe(pe);
+   return pe;
+   }
 
pdn = pdn->parent;
}
-- 
2.22.0.216.g00a2a96fc9



[PATCH RFC 15/15] powerpc/eeh: Sync pcibios_set_pcie_reset_state()

2019-10-01 Thread Sam Bobroff
Synchronize access to eeh_pe.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/kernel/eeh.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 26d9367c41a1..c61bfaf4ca26 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -924,6 +924,7 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum 
pcie_reset_state stat
__func__, pci_name(dev));
return -EINVAL;
}
+   eeh_get_pe(pe);
 
switch (state) {
case pcie_deassert_reset:
@@ -957,6 +958,7 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, enum 
pcie_reset_state stat
return -EINVAL;
};
 
+   eeh_put_pe(pe);
return 0;
 }
 
-- 
2.22.0.216.g00a2a96fc9



[PATCH RFC 00/15] powerpc/eeh: Synchronize access to struct eeh_pe

2019-10-01 Thread Sam Bobroff
Hi Everyone,

It's currently possible to cause a kernel crash by being (un)lucky while adding
or removing PCI devices during EEH recovery.  The cause of the problem is that
there are trees of objects (struct eeh_pe, implemented with pointers and kernel
linked lists) that are manipulated concurrently by the PCI core and the EEH
recovery system, without synchronization.  Unsurprisingly this allows the
kernel to hit list poison values or free'd memory, and crash.

Here's a first draft of a way to address this, and I would welcome feedback at
this stage: Does it make sense? Is it simple enough? Is it maintainable? Are
there other options I should explore?  etc...

The design:

We have to consider that:
- eeh_dev_check_failure() can be called in any context, so the only methods
  that are safe to use there are a spinlock or atomics.
- Some recovery operations are blocking (and can take a long time), so they
  can't be protected by a spinlock. These operations are performed while
  traversing lists of PEs.

The simplest solution might initially seem to be to hold the pci_rescan_remove
lock during EEH recovery, but it seems to have some significant problems:
- It's a mutex, and can't be used in eeh_dev_check_failure().
- If there is a crash during recovery, either in the core kernel or a driver's
  call-back function, the lock would be held forever and that would block a lot
  of the PCI core.
- The EEH code is quite recursive and it's very difficult to work out where to
  take the lock without causing deadlocks. If there are bugs, they would likely
  be deadlocks.

So instead, this proposal uses a spinlock (eeh_pe_lock) to protect access to
the list-membership fields of eeh_pe in combination with reference counting for
blocking operations.  Because the spinlock is only protecting the list fields
(and no state), the lock only needs to be used over small sections that can be
inspected easily.

For blocking operations I'm using the standard kernel refcounting. It only
involves one extra field on each PE but it does require tweaks to maintain the
refcount in most functions that use PEs and sometimes that can be tricky.

The way the refcount is used is fairly standard: it starts at 1 and is
incremented for each pointer pointing to the PE. The initial value of 1
represents the 'existance' of the PE and it's membership in the 'tree' under
it's PHB, regardless of where in that tree it is, or if it has children. A PE
is removed from those lists before it's inital reference is dropped.

The interaction of the spinlock and refcounting provides this property: If the
spinlock is held and a ref is held on a PE then it is safe to start from that
PE and traverse anywhere along the list fields or parent pointers to other PEs
without taking references to them.  (To keep a reference to a PE for use
outside of the spinlock, a ref must be taken before the spinlock is released.)

That property can be used to perform blocking operations while traversing the
lists by 'sliding' the reference along, only holding the spinlock while moving
to the next iteration.  It seems to work well but is not perfect: when the lock
is released, the current PE may be removed and there isn't any simple, safe way
to continue the traversal. The situation can be detected, so there won't be a
crash, but traversal can be left incomplete. Perhaps this could be fixed by
using a temporary list of struct eeh_pe that is collected while holding the
spinlock (I've experimented with this, and it works but seems a bit
complicated), or perhaps we can just work around it where it matters.  It
hasn't happened at all during my testing. I'll look at it if this proposal goes
ahead.

For clarity, I have also kept the scope very tightly on the lists of struct
eeh_pe and the parent pointer. This patchset does not protect the list of
eeh_dev in each eeh_pe or the pdev member of struct eeh_dev, both of which are
also affected by races. I've left a few 'this is unsafe' comments in the code
in those areas.  I'll look more at it if this proposal goes ahead, I don't
think that they will be difficult compared to eeh_pe.

Lastly, I've included an "orphan tracking system" that I used during
development to verify that reference couting was acting as expected, but I have
no idea if it should be included in the final version.  It keeps track of PEs
that have been removed from the PHB tree, but not yet freed and makes that list
available in debugfs.  Any PEs that remain orphans for very long are going to
be the result of bugs.  It's extra risk because it itself could contain bugs,
but it could also be useful during debugging.

Cheers,
Sam.

Sam Bobroff (15):
  powerpc/eeh: Introduce refcounting for struct eeh_pe
  powerpc/eeh: Rename eeh_pe_get() to eeh_pe_find()
  powerpc/eeh: Track orphaned struct eeh_pe
  powerpc/eeh: Sync eeh_pe_next(), eeh_pe_find() and early-out
traversals
  powerpc/eeh: Sync eeh_pe_get_parent()
  powerpc/eeh: Sync eeh_phb_pe_get()
  powerpc/eeh: Sync eeh_add_to_p

[PATCH RFC 13/15] powerpc/eeh: Sync pnv_eeh_ei_write()

2019-10-01 Thread Sam Bobroff
Synchronize access to eeh_pe.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index c56a796dd894..12367ed2083b 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -148,6 +148,7 @@ static ssize_t pnv_eeh_ei_write(struct file *filp,
 
/* Do error injection */
ret = eeh_ops->err_inject(pe, type, func, addr, mask);
+   eeh_put_pe(pe); /* Release ref */
return ret < 0 ? ret : count;
 }
 
-- 
2.22.0.216.g00a2a96fc9



[PATCH RFC 02/15] powerpc/eeh: Rename eeh_pe_get() to eeh_pe_find()

2019-10-01 Thread Sam Bobroff
There are now functions eeh_get_pe() and eeh_pe_get() which seems
likely to cause confusion.

Keep eeh_get_pe() because "get" is commonly used to refer to acquiring
a reference (which it does), and rename eeh_pe_get() to eeh_pe_find()
because it performs a search.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/include/asm/eeh.h   |  2 +-
 arch/powerpc/kernel/eeh.c|  2 +-
 arch/powerpc/kernel/eeh_pe.c | 12 ++--
 arch/powerpc/platforms/powernv/eeh-powernv.c |  4 ++--
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 54ba958760f2..e6f461d07426 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -288,7 +288,7 @@ int eeh_phb_pe_create(struct pci_controller *phb);
 int eeh_wait_state(struct eeh_pe *pe, int max_wait);
 struct eeh_pe *eeh_phb_pe_get(struct pci_controller *phb);
 struct eeh_pe *eeh_pe_next(struct eeh_pe *pe, struct eeh_pe *root);
-struct eeh_pe *eeh_pe_get(struct pci_controller *phb,
+struct eeh_pe *eeh_pe_find(struct pci_controller *phb,
  int pe_no, int config_addr);
 int eeh_add_to_parent_pe(struct eeh_dev *edev);
 int eeh_rmv_from_parent_pe(struct eeh_dev *edev);
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index d613e7ea7794..7e36ad0cfa2b 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1940,7 +1940,7 @@ static ssize_t eeh_force_recover_write(struct file *filp,
return -ENODEV;
 
/* Retrieve PE */
-   pe = eeh_pe_get(hose, pe_no, 0);
+   pe = eeh_pe_find(hose, pe_no, 0);
if (!pe)
return -ENODEV;
 
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index d455df7b4928..fda52d1989c3 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -315,7 +315,7 @@ void eeh_pe_dev_traverse(struct eeh_pe *root,
 }
 
 /**
- * __eeh_pe_get - Check the PE address
+ * __eeh_pe_find - Check the PE address
  * @data: EEH PE
  * @flag: EEH device
  *
@@ -329,7 +329,7 @@ struct eeh_pe_get_flag {
int config_addr;
 };
 
-static void *__eeh_pe_get(struct eeh_pe *pe, void *flag)
+static void *__eeh_pe_find(struct eeh_pe *pe, void *flag)
 {
struct eeh_pe_get_flag *tmp = (struct eeh_pe_get_flag *) flag;
 
@@ -371,14 +371,14 @@ static void *__eeh_pe_get(struct eeh_pe *pe, void *flag)
  * which is composed of PCI bus/device/function number, or unified
  * PE address.
  */
-struct eeh_pe *eeh_pe_get(struct pci_controller *phb,
-   int pe_no, int config_addr)
+struct eeh_pe *eeh_pe_find(struct pci_controller *phb,
+  int pe_no, int config_addr)
 {
struct eeh_pe *root = eeh_phb_pe_get(phb);
struct eeh_pe_get_flag tmp = { pe_no, config_addr };
struct eeh_pe *pe;
 
-   pe = eeh_pe_traverse(root, __eeh_pe_get, &tmp);
+   pe = eeh_pe_traverse(root, __eeh_pe_find, &tmp);
 
return pe;
 }
@@ -447,7 +447,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
 * PE should be composed of PCI bus and its subordinate
 * components.
 */
-   pe = eeh_pe_get(pdn->phb, edev->pe_config_addr, config_addr);
+   pe = eeh_pe_find(pdn->phb, edev->pe_config_addr, config_addr);
if (pe) {
if (pe->type & EEH_PE_INVALID) {
list_add_tail(&edev->entry, &pe->edevs);
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 25d1712b519d..e477e0b70968 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -142,7 +142,7 @@ static ssize_t pnv_eeh_ei_write(struct file *filp,
return -EINVAL;
 
/* Retrieve PE */
-   pe = eeh_pe_get(hose, pe_no, 0);
+   pe = eeh_pe_find(hose, pe_no, 0);
if (!pe)
return -ENODEV;
 
@@ -1425,7 +1425,7 @@ static int pnv_eeh_get_pe(struct pci_controller *hose,
}
 
/* Find the PE according to PE# */
-   dev_pe = eeh_pe_get(hose, pe_no, 0);
+   dev_pe = eeh_pe_find(hose, pe_no, 0);
if (!dev_pe)
return -EEXIST;
 
-- 
2.22.0.216.g00a2a96fc9



[PATCH RFC 03/15] powerpc/eeh: Track orphaned struct eeh_pe

2019-10-01 Thread Sam Bobroff
To facilitate debugging of (the inevitable) refcounting problems with
struct eeh_pes, detect when a struct eeh_pe has been removed from it's
global PHB list but not yet freed ("orphaned"), and collect these
PEs in a list.

They should only remain in the list briefly during processing, so any
PEs that stay longer will be the result of bugs.

The list can be examined by reading from the "eeh_pe_debug" file in
debugfs.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/include/asm/eeh.h |  4 +++
 arch/powerpc/kernel/eeh.c  | 21 ++
 arch/powerpc/kernel/eeh_pe.c   | 53 +-
 3 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index e6f461d07426..df843d04268d 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -66,6 +67,7 @@ struct pci_dn;
 #define EEH_PE_RECOVERING  (1 << 1)/* Recovering PE*/
 #define EEH_PE_CFG_BLOCKED (1 << 2)/* Block config access  */
 #define EEH_PE_RESET   (1 << 3)/* PE reset in progress */
+#define EEH_PE_ORPHAN  (1 << 4)/* Removal pending  */
 
 #define EEH_PE_KEEP(1 << 8)/* Keep PE on hotplug   */
 #define EEH_PE_CFG_RESTRICTED  (1 << 9)/* Block config on error */
@@ -103,6 +105,7 @@ struct eeh_pe {
unsigned long stack_trace[64];
int trace_entries;
 #endif /* CONFIG_STACKTRACE */
+   struct list_head orphan_entry;  /* Memb. eeh_orphan_pes */
 };
 
 #define eeh_pe_for_each_dev(pe, edev, tmp) \
@@ -280,6 +283,7 @@ typedef void (*eeh_edev_traverse_func)(struct eeh_dev 
*edev, void *flag);
 typedef void *(*eeh_pe_traverse_func)(struct eeh_pe *pe, void *flag);
 void eeh_lock_pes(unsigned long *flags);
 void eeh_unlock_pes(unsigned long flags);
+void eeh_pe_seq_show_orphans(struct seq_file *s);
 void eeh_set_pe_aux_size(int size);
 void eeh_get_pe(struct eeh_pe *pe);
 void eeh_put_pe(struct eeh_pe *pe);
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 7e36ad0cfa2b..7eb6ca1ab72b 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -2153,6 +2153,24 @@ static const struct file_operations eeh_dev_break_fops = 
{
.read   = eeh_debugfs_dev_usage,
 };
 
+static int eeh_pe_debug_seq_file(struct seq_file *s, void *unused)
+{
+   eeh_pe_seq_show_orphans(s);
+   return 0;
+}
+
+static int eeh_pe_debug_dbgfs_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, eeh_pe_debug_seq_file, inode->i_private);
+}
+
+static const struct file_operations eeh_pe_debug_fops = {
+   .owner  = THIS_MODULE,
+   .open   = eeh_pe_debug_dbgfs_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
 #endif
 
 static int __init eeh_init_proc(void)
@@ -2177,6 +2195,9 @@ static int __init eeh_init_proc(void)
debugfs_create_file_unsafe("eeh_force_recover", 0600,
powerpc_debugfs_root, NULL,
&eeh_force_recover_fops);
+   debugfs_create_file_unsafe("eeh_pe_debug", 0400,
+   powerpc_debugfs_root, NULL,
+   &eeh_pe_debug_fops);
eeh_cache_debugfs_init();
 #endif
}
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index fda52d1989c3..aa279474a928 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -21,6 +21,7 @@
 
 static int eeh_pe_aux_size = 0;
 static LIST_HEAD(eeh_phb_pe);
+static LIST_HEAD(eeh_orphan_pes);
 
 /*
  * Lock protecting the parent and child list fields of struct eeh_pe (this does
@@ -30,6 +31,7 @@ static LIST_HEAD(eeh_phb_pe);
  * via it's parent or child fields will be freed.
  */
 static DEFINE_RAW_SPINLOCK(eeh_pe_lock);
+static DEFINE_RAW_SPINLOCK(eeh_orphan_lock);
 
 void eeh_lock_pes(unsigned long *flags)
 {
@@ -89,9 +91,24 @@ static struct eeh_pe *eeh_pe_alloc(struct pci_controller 
*phb, int type)
return pe;
 }
 
+void eeh_pe_seq_show_orphans(struct seq_file *s)
+{
+   unsigned long flags;
+   struct eeh_pe *pe;
+
+   seq_puts(s, "Orphaned PEs awaiting free:\n");
+   raw_spin_lock_irqsave(&eeh_orphan_lock, flags);
+   list_for_each_entry(pe, &eeh_orphan_pes, orphan_entry)
+   seq_printf(s, "PHB#%x PE#%x Refcount:%d\n",
+  pe->phb->global_number, pe->addr,
+  kref_read(&pe->kref));
+   raw_spin_unlock_irqrestore(&eeh_orphan_lock, flags);
+}
+
 static void eeh_pe_free(struct kref *kref)
 {
struct eeh_pe *pe = container_of(kref, struct eeh_pe, kref);
+   unsigned long flags;
 
if (WARN_ONCE(pe->type & EEH_PE_PHB,
  "EEH: PHB#%x-PE#%x: Attempt to

[PATCH RFC 06/15] powerpc/eeh: Sync eeh_phb_pe_get()

2019-10-01 Thread Sam Bobroff
Synchronize access to eeh_pe.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/kernel/eeh_pe.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 0486d3c6ff20..e89a30de2e7e 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -269,20 +269,27 @@ int eeh_wait_state(struct eeh_pe *pe, int max_wait)
  * The overall PEs form hierarchy tree. The first layer of the
  * hierarchy tree is composed of PHB PEs. The function is used
  * to retrieve the corresponding PHB PE according to the given PHB.
+ * Returns a referenced PE.
  */
 struct eeh_pe *eeh_phb_pe_get(struct pci_controller *phb)
 {
struct eeh_pe *pe;
+   unsigned long flags;
 
+   eeh_lock_pes(&flags);
list_for_each_entry(pe, &eeh_phb_pe, child) {
/*
 * Actually, we needn't check the type since
 * the PE for PHB has been determined when that
 * was created.
 */
-   if ((pe->type & EEH_PE_PHB) && pe->phb == phb)
-   return pe;
+   if ((pe->type & EEH_PE_PHB) && pe->phb == phb) {
+   eeh_get_pe(pe); /* Acquire ref */
+   eeh_unlock_pes(flags);
+   return pe; /* Give ref */
+   }
}
+   eeh_unlock_pes(flags);
 
return NULL;
 }
-- 
2.22.0.216.g00a2a96fc9



[PATCH RFC 12/15] powerpc/eeh: Sync eeh_pe_get_state()

2019-10-01 Thread Sam Bobroff
Synchronize access to eeh_pe.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/kernel/eeh.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 171be70b34d8..cba16ca0694a 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1689,6 +1689,7 @@ int eeh_pe_get_state(struct eeh_pe *pe)
 {
int result, ret = 0;
bool rst_active, dma_en, mmio_en;
+   unsigned long flags;
 
/* Existing PE ? */
if (!pe)
@@ -1703,10 +1704,14 @@ int eeh_pe_get_state(struct eeh_pe *pe)
 * unavailable so that the error recovery on the guest is suspended
 * until the recovery completes on the host.
 */
+   eeh_lock_pes(&flags);
if (pe->parent &&
!(pe->state & EEH_PE_REMOVED) &&
-   (pe->parent->state & (EEH_PE_ISOLATED | EEH_PE_RECOVERING)))
+   (pe->parent->state & (EEH_PE_ISOLATED | EEH_PE_RECOVERING))) {
+   eeh_unlock_pes(flags);
return EEH_PE_STATE_UNAVAIL;
+   }
+   eeh_unlock_pes(flags);
 
result = eeh_ops->get_state(pe, NULL);
rst_active = !!(result & EEH_STATE_RESET_ACTIVE);
-- 
2.22.0.216.g00a2a96fc9



[PATCH RFC 07/15] powerpc/eeh: Sync eeh_add_to_parent_pe() and eeh_rmv_from_parent_pe()

2019-10-01 Thread Sam Bobroff
Note that even though there is currently only one place where a PE can
be removed from the parent/child tree (eeh_rmv_from_parent_pe()), it
is still protected against concurrent removal in case that changes in
the future.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/kernel/eeh_pe.c | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index e89a30de2e7e..c9780b7109ec 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -528,6 +528,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
struct eeh_pe *pe, *parent;
struct pci_dn *pdn = eeh_dev_to_pdn(edev);
int config_addr = (pdn->busno << 8) | (pdn->devfn);
+   unsigned long flags;
 
/* Check if the PE number is valid */
if (!eeh_has_flag(EEH_VALID_PE_ZERO) && !edev->pe_config_addr) {
@@ -541,6 +542,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
 * PE should be composed of PCI bus and its subordinate
 * components.
 */
+   /* Acquire ref */
pe = eeh_pe_find(pdn->phb, edev->pe_config_addr, config_addr);
if (pe) {
if (pe->type & EEH_PE_INVALID) {
@@ -557,7 +559,6 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
parent->type &= ~EEH_PE_INVALID;
parent = parent->parent;
}
-
eeh_edev_dbg(edev,
 "Added to device PE (parent: PE#%x)\n",
 pe->parent->addr);
@@ -570,10 +571,12 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
list_add_tail(&edev->entry, &pe->edevs);
eeh_edev_dbg(edev, "Added to bus PE\n");
}
+   eeh_put_pe(pe); /* Release ref */
return 0;
}
 
/* Create a new EEH PE */
+   /* Acquire existence ref */
if (edev->physfn)
pe = eeh_pe_alloc(pdn->phb, EEH_PE_VF);
else
@@ -591,9 +594,9 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
 * to PHB directly. Otherwise, we have to associate the
 * PE with its parent.
 */
-   parent = eeh_pe_get_parent(edev);
+   parent = eeh_pe_get_parent(edev); /* Acquire ref */
if (!parent) {
-   parent = eeh_phb_pe_get(pdn->phb);
+   parent = eeh_phb_pe_get(pdn->phb); /* Acquire ref */
if (!parent) {
pr_err("%s: No PHB PE is found (PHB Domain=%d)\n",
__func__, pdn->phb->global_number);
@@ -602,6 +605,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
return -EEXIST;
}
}
+   eeh_lock_pes(&flags);
pe->parent = parent;
 
/*
@@ -609,10 +613,13 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
 * link the EEH device accordingly.
 */
list_add_tail(&pe->child, &parent->child_list);
+   eeh_unlock_pes(flags);
list_add_tail(&edev->entry, &pe->edevs);
edev->pe = pe;
eeh_edev_dbg(edev, "Added to device PE (parent: PE#%x)\n",
-pe->parent->addr);
+parent->addr);
+   eeh_put_pe(parent); /* Release ref */
+   /* Retain existence ref */
 
return 0;
 }
@@ -631,12 +638,15 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
struct eeh_pe *pe, *parent, *child;
bool keep, recover;
int cnt;
+   unsigned long flags;
 
+   /* TODO: Unsafe until eeh_dev can be synchronized * with eeh_pe. */
pe = eeh_dev_to_pe(edev);
if (!pe) {
eeh_edev_dbg(edev, "No PE found for device.\n");
return -EEXIST;
}
+   eeh_get_pe(pe); /* Acquire ref */
 
/* Remove the EEH device */
edev->pe = NULL;
@@ -648,6 +658,7 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
 * delete the parent PE if it doesn't have associated
 * child PEs and EEH devices.
 */
+   eeh_lock_pes(&flags);
while (1) {
parent = pe->parent;
 
@@ -699,8 +710,15 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
}
}
 
+   if (!parent)
+   break; /* PE has been removed */
+
+   eeh_get_pe(parent); /* Acquire ref */
+   eeh_put_pe(pe); /* Release ref */
pe = parent;
}
+   eeh_unlock_pes(flags);
+   eeh_put_pe(pe); /* Release ref */
 
return 0;
 }
-- 
2.22.0.216.g00a2a96fc9



[PATCH RFC 11/15] powerpc/eeh: Sync eeh_dev_check_failure()

2019-10-01 Thread Sam Bobroff
Synchronize access to eeh_pe.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/kernel/eeh.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index eb37cb384ff4..171be70b34d8 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -447,7 +447,7 @@ static int eeh_phb_check_failure(struct eeh_pe *pe)
 int eeh_dev_check_failure(struct eeh_dev *edev)
 {
int ret;
-   unsigned long flags;
+   unsigned long flags, pe_flags;
struct device_node *dn;
struct pci_dev *dev;
struct eeh_pe *pe, *parent_pe;
@@ -464,7 +464,9 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
return 0;
}
dev = eeh_dev_to_pci_dev(edev);
+   /* TODO: Unsafe until eeh_dev can be synchronized with eeh_pe. */
pe = eeh_dev_to_pe(edev);
+   eeh_get_pe(pe);
 
/* Access to IO BARs might get this far and still not want checking. */
if (!pe) {
@@ -475,6 +477,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
 
if (!pe->addr && !pe->config_addr) {
eeh_stats.no_cfg_addr++;
+   eeh_put_pe(pe); /* Release ref */
return 0;
}
 
@@ -482,17 +485,21 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
 * On PowerNV platform, we might already have fenced PHB
 * there and we need take care of that firstly.
 */
-   ret = eeh_phb_check_failure(pe);
-   if (ret > 0)
+   ret = eeh_phb_check_failure(pe); /* Acquire ref */
+   if (ret > 0) {
+   eeh_put_pe(pe); /* Release ref */
return ret;
+   }
 
/*
 * If the PE isn't owned by us, we shouldn't check the
 * state. Instead, let the owner handle it if the PE has
 * been frozen.
 */
-   if (eeh_pe_passed(pe))
+   if (eeh_pe_passed(pe)) {
+   eeh_put_pe(pe); /* Release ref */
return 0;
+   }
 
/* If we already have a pending isolation event for this
 * slot, we know it's bad already, we don't need to check.
@@ -548,7 +555,10 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
 * put into frozen state as well. We should take care
 * that at first.
 */
+   eeh_lock_pes(&pe_flags);
parent_pe = pe->parent;
+   eeh_get_pe(parent_pe); /* Acquire ref */
+   eeh_unlock_pes(pe_flags);
while (parent_pe) {
/* Hit the ceiling ? */
if (parent_pe->type & EEH_PE_PHB)
@@ -557,15 +567,18 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
/* Frozen parent PE ? */
ret = eeh_ops->get_state(parent_pe, NULL);
if (ret > 0 && !eeh_state_active(ret)) {
+   eeh_put_pe(pe); /* Release ref */
pe = parent_pe;
+   eeh_get_pe(pe); /* Acquire ref */
pr_err("EEH: Failure of PHB#%x-PE#%x will be handled at 
parent PHB#%x-PE#%x.\n",
   pe->phb->global_number, pe->addr,
   pe->phb->global_number, parent_pe->addr);
}
 
/* Next parent level */
-   parent_pe = parent_pe->parent;
+   eeh_pe_move_to_parent(&parent_pe);
}
+   eeh_put_pe(parent_pe); /* Release ref */
 
eeh_stats.slot_resets++;
 
@@ -582,11 +595,12 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
 */
pr_debug("EEH: %s: Frozen PHB#%x-PE#%x detected\n",
__func__, pe->phb->global_number, pe->addr);
-   eeh_send_failure_event(pe);
+   eeh_send_failure_event(pe); /* Give ref */
 
return 1;
 
 dn_unlock:
+   eeh_put_pe(pe); /* Release ref */
eeh_serialize_unlock(flags);
return rc;
 }
-- 
2.22.0.216.g00a2a96fc9



[PATCH RFC 08/15] powerpc/eeh: Sync eeh_handle_normal_event()

2019-10-01 Thread Sam Bobroff
Synchronize access to eeh_pe.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/kernel/eeh_driver.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index b3245d0cfb22..c9d73070793e 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -879,6 +879,7 @@ static void eeh_clear_slot_attention(struct pci_dev *pdev)
  * & devices under this slot, and then finally restarting the device
  * drivers (which cause a second set of hotplug events to go out to
  * userspace).
+ * Consumes the reference on 'pe'.
  */
 void eeh_handle_normal_event(struct eeh_pe *pe)
 {
@@ -898,6 +899,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
if (!bus) {
pr_err("%s: Cannot find PCI bus for PHB#%x-PE#%x\n",
__func__, pe->phb->global_number, pe->addr);
+   eeh_put_pe(pe); /* Release ref */
return;
}
 
@@ -1141,6 +1143,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
pci_hp_remove_devices(bus);
pci_unlock_rescan_remove();
/* The passed PE should no longer be used */
+   eeh_put_pe(pe); /* Release ref */
return;
}
}
@@ -1160,6 +1163,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
eeh_pe_state_clear(pe, EEH_PE_RECOVERING, true);
pr_info("PE state after recovery:\n");
eeh_tree_state_dump_kprintf(pe);
+   eeh_put_pe(pe); /* Release ref */
 }
 
 /**
-- 
2.22.0.216.g00a2a96fc9



[PATCH RFC 09/15] powerpw/eeh: Sync eeh_handle_special_event(), pnv_eeh_get_pe(), pnv_eeh_next_error()

2019-10-01 Thread Sam Bobroff
Synchronize access to eeh_pe.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/kernel/eeh_driver.c | 15 +---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 38 
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index c9d73070793e..bc5d58bf3904 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -1184,6 +1184,7 @@ void eeh_handle_special_event(void)
 
 
do {
+   /* Acquire ref if rc == _FROZEN_PE, _FENCED_PHB or _DEAD_PHB */
rc = eeh_ops->next_error(&pe);
 
switch (rc) {
@@ -1195,10 +1196,11 @@ void eeh_handle_special_event(void)
eeh_remove_event(NULL, true);
 
list_for_each_entry(hose, &hose_list, list_node) {
-   phb_pe = eeh_phb_pe_get(hose);
+   phb_pe = eeh_phb_pe_get(hose); /* Acquire ref */
if (!phb_pe) continue;
 
eeh_pe_mark_isolated(phb_pe);
+   eeh_put_pe(phb_pe); /* Release ref */
}
 
eeh_serialize_unlock(flags);
@@ -1236,15 +1238,17 @@ void eeh_handle_special_event(void)
if (rc == EEH_NEXT_ERR_FROZEN_PE ||
rc == EEH_NEXT_ERR_FENCED_PHB) {
eeh_pe_state_mark(pe, EEH_PE_RECOVERING);
-   eeh_handle_normal_event(pe);
+   eeh_handle_normal_event(pe); /* Give ref */
} else {
pci_lock_rescan_remove();
list_for_each_entry(hose, &hose_list, list_node) {
-   phb_pe = eeh_phb_pe_get(hose);
+   phb_pe = eeh_phb_pe_get(hose); /* Acquire ref */
if (!phb_pe ||
!(phb_pe->state & EEH_PE_ISOLATED) ||
-   (phb_pe->state & EEH_PE_RECOVERING))
+   (phb_pe->state & EEH_PE_RECOVERING)) {
+   eeh_put_pe(phb_pe); /* Release ref */
continue;
+   }
 
eeh_for_each_pe(pe, tmp_pe)
eeh_pe_for_each_dev(tmp_pe, edev, 
tmp_edev)
@@ -1263,11 +1267,14 @@ void eeh_handle_special_event(void)
   __func__,
   pe->phb->global_number,
   pe->addr);
+   eeh_put_pe(phb_pe); /* Release ref */
break;
}
pci_hp_remove_devices(bus);
+   eeh_put_pe(phb_pe); /* Release ref */
}
pci_unlock_rescan_remove();
+   eeh_put_pe(pe); /* Release ref */
}
 
/*
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index e477e0b70968..c56a796dd894 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -1404,6 +1404,7 @@ static void pnv_eeh_get_and_dump_hub_diag(struct 
pci_controller *hose)
}
 }
 
+/* A return of 0 indicates that *pe is set, and referenced. */
 static int pnv_eeh_get_pe(struct pci_controller *hose,
  u16 pe_no, struct eeh_pe **pe)
 {
@@ -1431,6 +1432,7 @@ static int pnv_eeh_get_pe(struct pci_controller *hose,
 
/* Freeze the (compound) PE */
*pe = dev_pe;
+   eeh_get_pe(*pe); /* Acquire ref */
if (!(dev_pe->state & EEH_PE_ISOLATED))
phb->freeze_pe(phb, pe_no);
 
@@ -1439,23 +1441,26 @@ static int pnv_eeh_get_pe(struct pci_controller *hose,
 * have been frozen. However, we still need poke until
 * hitting the frozen PE on top level.
 */
-   dev_pe = dev_pe->parent;
+   eeh_pe_move_to_parent(&dev_pe);
while (dev_pe && !(dev_pe->type & EEH_PE_PHB)) {
int ret;
ret = eeh_ops->get_state(dev_pe, NULL);
if (ret <= 0 || eeh_state_active(ret)) {
-   dev_pe = dev_pe->parent;
+   eeh_pe_move_to_parent(&dev_pe);
continue;
}
 
/* Frozen parent PE */
+   eeh_put_pe(*pe); /* Release ref */
*pe = dev_pe;
+   eeh_get_pe(*pe); /* Acquire ref */
if (!(dev_pe->state & EEH_PE_ISOLATED))
phb->freeze_pe(phb, dev_pe->addr);
 
/* Next one */
-   de

[PATCH RFC 10/15] powerpc/eeh: Sync eeh_phb_check_failure()

2019-10-01 Thread Sam Bobroff
Synchronize access to eeh_pe.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/kernel/eeh.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 7eb6ca1ab72b..eb37cb384ff4 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -394,7 +394,7 @@ static int eeh_phb_check_failure(struct eeh_pe *pe)
return -EPERM;
 
/* Find the PHB PE */
-   phb_pe = eeh_phb_pe_get(pe->phb);
+   phb_pe = eeh_phb_pe_get(pe->phb); /* Acquire ref */
if (!phb_pe) {
pr_warn("%s Can't find PE for PHB#%x\n",
__func__, pe->phb->global_number);
@@ -422,9 +422,10 @@ static int eeh_phb_check_failure(struct eeh_pe *pe)
 
pr_debug("EEH: PHB#%x failure detected, location: %s\n",
phb_pe->phb->global_number, eeh_pe_loc_get(phb_pe));
-   eeh_send_failure_event(phb_pe);
+   eeh_send_failure_event(phb_pe); /* Give ref */
return 1;
 out:
+   eeh_put_pe(phb_pe); /* Release ref */
eeh_serialize_unlock(flags);
return ret;
 }
-- 
2.22.0.216.g00a2a96fc9



[PATCH RFC 04/15] powerpc/eeh: Sync eeh_pe_next(), eeh_pe_find() and early-out traversals

2019-10-01 Thread Sam Bobroff
Reference counting for the in-line loop macro "eeh_for_each_pe()" can
be done by having eeh_pe_next() both get and put references; "rolling"
a reference along the list. This allows most loops to work without
change, although ones that use an "early-out" must manually release
the final reference.

While reference counting will keep the current iteration's PE from
being freed while it is in use, it's also necessary to check that it
hasn't been removed from the list that's being traversed.  This is
done by checking the parent pointer, which is set to NULL on removal
(see eeh_rmv_from_parent_pe()) (PHB type PEs never have their parent
set, but aren't a problem: they can't be removed). If this does occur,
the traversal is terminated. This may leave the traversal incomplete,
but that is preferable to crashing.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/include/asm/eeh.h   |  7 -
 arch/powerpc/kernel/eeh_driver.c |  4 ++-
 arch/powerpc/kernel/eeh_pe.c | 50 +---
 3 files changed, 48 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index df843d04268d..3ab03d407eb1 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -111,8 +111,13 @@ struct eeh_pe {
 #define eeh_pe_for_each_dev(pe, edev, tmp) \
list_for_each_entry_safe(edev, tmp, &pe->edevs, entry)
 
+/*
+ * Note that eeh_pe_next() maintins a reference on 'pe' for each
+ * iteration and that it must be manually released if the loop is
+ * exited early (i.e. before eeh_pe_next() returns NULL).
+ */
 #define eeh_for_each_pe(root, pe) \
-   for (pe = root; pe; pe = eeh_pe_next(pe, root))
+   for (pe = root, eeh_get_pe(pe); pe; pe = eeh_pe_next(pe, root))
 
 static inline bool eeh_pe_passed(struct eeh_pe *pe)
 {
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 28e54fe3ac6c..b3245d0cfb22 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -590,8 +590,10 @@ static int eeh_clear_pe_frozen_state(struct eeh_pe *root, 
bool include_passed)
for (i = 0; i < 3; i++)
if (!eeh_unfreeze_pe(pe))
break;
-   if (i >= 3)
+   if (i >= 3) {
+   eeh_put_pe(pe); /* Release loop ref */
return -EIO;
+   }
}
}
eeh_pe_state_clear(root, EEH_PE_ISOLATED, include_passed);
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index aa279474a928..b89ed46f14e6 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -294,23 +294,44 @@ struct eeh_pe *eeh_phb_pe_get(struct pci_controller *phb)
  *
  * The function is used to retrieve the next PE in the
  * hierarchy PE tree.
+ *
+ * Consumes the ref on 'pe', returns a referenced PE (if not null).
  */
-struct eeh_pe *eeh_pe_next(struct eeh_pe *pe, struct eeh_pe *root)
+struct eeh_pe *eeh_pe_next(struct eeh_pe *prev_pe, struct eeh_pe *root)
 {
-   struct list_head *next = pe->child_list.next;
+   struct list_head *next;
+   struct eeh_pe *next_pe = NULL, *pe = prev_pe;
+   unsigned long flags;
 
+   eeh_lock_pes(&flags);
+   if (!(pe->type & EEH_PE_PHB) && !pe->parent) {
+   /* Current PE has been removed since the last iteration.
+* There's no way to recover so bail out. The traversal
+* may be incomplete.
+*/
+   eeh_unlock_pes(flags);
+   pr_warn("EEH: Warning: PHB#%x-PE%x: Traversal possibly 
incomplete.\n",
+   pe->phb->global_number, pe->addr);
+   eeh_put_pe(pe); /* Release ref from last iter */
+   return NULL;
+   }
+   next = pe->child_list.next;
if (next == &pe->child_list) {
while (1) {
if (pe == root)
-   return NULL;
+   goto out;
next = pe->child.next;
if (next != &pe->parent->child_list)
break;
pe = pe->parent;
}
}
-
-   return list_entry(next, struct eeh_pe, child);
+   next_pe = list_entry(next, struct eeh_pe, child);
+   eeh_get_pe(next_pe); /* Acquire ref for next iter */
+out:
+   eeh_unlock_pes(flags);
+   eeh_put_pe(prev_pe); /* Release ref from last iter */
+   return next_pe;
 }
 
 /**
@@ -332,7 +353,10 @@ void *eeh_pe_traverse(struct eeh_pe *root,
 
eeh_for_each_pe(root, pe) {
ret = fn(pe, flag);
-   if (ret) return ret;
+   if (ret) {
+   eeh_put_pe(pe); /* Early-out: release last ref */
+   return ret;
+   }

[PATCH RFC 01/15] powerpc/eeh: Introduce refcounting for struct eeh_pe

2019-10-01 Thread Sam Bobroff
Introduce infrastructure supporting refcounting for struct eeh_pe.

This will provide protection of the list members and struct memory so
that crashes related to accessing free memory or poisoned list
pointers can be avoided (for example, when EEH is triggered during
device removal).

While this provides no additional synchronization of the other EEH
state, it seems to be an effective way of providing the necessary
safety with a very low risk of introducing deadlocks.

Signed-off-by: Sam Bobroff 
---
 arch/powerpc/include/asm/eeh.h |  7 
 arch/powerpc/kernel/eeh_pe.c   | 70 +-
 2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 5448b68ff260..54ba958760f2 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -72,6 +73,7 @@ struct pci_dn;
 #define EEH_PE_PRI_BUS (1 << 11)   /* Cached primary bus   */
 
 struct eeh_pe {
+   struct kref kref;   /* Reference count  */
int type;   /* PE type: PHB/Bus/Device  */
int state;  /* PE EEH dependent mode*/
int config_addr;/* Traditional PCI address  */
@@ -276,7 +278,12 @@ static inline bool eeh_state_active(int state)
 
 typedef void (*eeh_edev_traverse_func)(struct eeh_dev *edev, void *flag);
 typedef void *(*eeh_pe_traverse_func)(struct eeh_pe *pe, void *flag);
+void eeh_lock_pes(unsigned long *flags);
+void eeh_unlock_pes(unsigned long flags);
 void eeh_set_pe_aux_size(int size);
+void eeh_get_pe(struct eeh_pe *pe);
+void eeh_put_pe(struct eeh_pe *pe);
+void eeh_pe_move_to_parent(struct eeh_pe **pe);
 int eeh_phb_pe_create(struct pci_controller *phb);
 int eeh_wait_state(struct eeh_pe *pe, int max_wait);
 struct eeh_pe *eeh_phb_pe_get(struct pci_controller *phb);
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index 177852e39a25..d455df7b4928 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -22,6 +22,25 @@
 static int eeh_pe_aux_size = 0;
 static LIST_HEAD(eeh_phb_pe);
 
+/*
+ * Lock protecting the parent and child list fields of struct eeh_pe (this does
+ * not include the edev list).  The lock does not, in general, prevent struct
+ * eeh_pe objects from being freed, but it does provide that when held and a
+ * reference is held on a PE, that neither that PE or any other PE reachable
+ * via it's parent or child fields will be freed.
+ */
+static DEFINE_RAW_SPINLOCK(eeh_pe_lock);
+
+void eeh_lock_pes(unsigned long *flags)
+{
+   raw_spin_lock_irqsave(&eeh_pe_lock, *flags);
+}
+
+void eeh_unlock_pes(unsigned long flags)
+{
+   raw_spin_unlock_irqrestore(&eeh_pe_lock, flags);
+}
+
 /**
  * eeh_set_pe_aux_size - Set PE auxillary data size
  * @size: PE auxillary data size
@@ -59,6 +78,7 @@ static struct eeh_pe *eeh_pe_alloc(struct pci_controller 
*phb, int type)
if (!pe) return NULL;
 
/* Initialize PHB PE */
+   kref_init(&pe->kref); /* Acquire existence ref */
pe->type = type;
pe->phb = phb;
INIT_LIST_HEAD(&pe->child_list);
@@ -69,6 +89,51 @@ static struct eeh_pe *eeh_pe_alloc(struct pci_controller 
*phb, int type)
return pe;
 }
 
+static void eeh_pe_free(struct kref *kref)
+{
+   struct eeh_pe *pe = container_of(kref, struct eeh_pe, kref);
+
+   if (WARN_ONCE(pe->type & EEH_PE_PHB,
+ "EEH: PHB#%x-PE#%x: Attempt to free PHB PE!\n",
+ pe->phb->global_number, pe->addr))
+   return;
+
+   if (WARN_ONCE(pe->parent,
+ "EEH: PHB#%x-PE#%x: Attempt to free in-use PE!\n",
+ pe->phb->global_number, pe->addr))
+   return;
+
+   kfree(pe);
+}
+
+void eeh_get_pe(struct eeh_pe *pe)
+{
+   if (!pe)
+   return;
+   WARN_ONCE(!kref_get_unless_zero(&pe->kref),
+ "EEH: PHB#%x-PE#%x: Attempt to get when refcount 0!\n",
+ pe->phb->global_number, pe->addr);
+}
+
+void eeh_put_pe(struct eeh_pe *pe)
+{
+   if (!pe)
+   return;
+   kref_put(&pe->kref, eeh_pe_free);
+}
+
+void eeh_pe_move_to_parent(struct eeh_pe **pe)
+{
+   unsigned long flags;
+   struct eeh_pe *old_pe = *pe;
+
+   eeh_lock_pes(&flags);
+   *pe = (*pe)->parent;
+   eeh_get_pe(*pe);
+   eeh_put_pe(old_pe);
+   eeh_unlock_pes(flags);
+}
+
 /**
  * eeh_phb_pe_create - Create PHB PE
  * @phb: PCI controller
@@ -439,7 +504,7 @@ int eeh_add_to_parent_pe(struct eeh_dev *edev)
pr_err("%s: No PHB PE is found (PHB Domain=%d)\n",
__func__, pdn->phb->global_number);
edev->pe = NULL;
-   kfree(pe);
+   eeh_put_pe(pe); /* Release exi

Re: [RFC PATCH 00/27] current interrupt series plus scv syscall

2019-10-01 Thread Nicholas Piggin
Michal Suchánek's on September 24, 2019 7:33 pm:
> Hello,
> 
> can you mark the individual patches with RFC rather than the wole
> series?

Hey, thanks for the reviews. I'll resend all but the last two patches
soon for merge in the next window.

scv needs a bit more work but hopefully it will be mergeable as well
but I will submit that afterward.

Thanks,
Nick


Re: [RFC PATCH 23/27] powerpc/64: system call implement the bulk of the logic in C

2019-10-01 Thread Nicholas Piggin
Michal Suchánek's on September 23, 2019 10:47 pm:
> On Sun, Sep 15, 2019 at 11:28:09AM +1000, Nicholas Piggin wrote:
>> System call entry and particularly exit code is beyond the limit of what
>> is reasonable to implement in asm.
>> 
>> This conversion moves all conditional branches out of the asm code,
>> except for the case that all GPRs should be restored at exit.
>> 
>> Null syscall test is about 5% faster after this patch, because the exit
>> work is handled under local_irq_disable, and the hard mask and pending
>> interrupt replay is handled after that, which avoids games with MSR.
>> 
>> Signed-off-by: Nicholas Piggin 
>> 
>> v3:
>> - Fix !KUAP build [mpe]
>> - Fix BookE build/boot [mpe]
>> - Don't trace irqs with MSR[RI]=0
>> - Don't allow syscall_exit_prepare to be ftraced, because function
>>   graph tracing which traces exits barfs after the IRQ state is
>>   prepared for kernel exit.
>> - Fix BE syscall table to use normal function descriptors now that they
>>   are called from C.
>> - Comment syscall_exit_prepare.
> ...
>> -#if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE) && defined(CONFIG_PPC_SPLPAR)
>> -BEGIN_FW_FTR_SECTION
>> -/* see if there are any DTL entries to process */
>> -ld  r10,PACALPPACAPTR(r13)  /* get ptr to VPA */
>> -ld  r11,PACA_DTL_RIDX(r13)  /* get log read index */
>> -addir10,r10,LPPACA_DTLIDX
>> -LDX_BE  r10,0,r10   /* get log write index */
>> -cmpdr11,r10
>> -beq+33f
>> -bl  accumulate_stolen_time
>> -REST_GPR(0,r1)
>> -REST_4GPRS(3,r1)
>> -REST_2GPRS(7,r1)
>> -addir9,r1,STACK_FRAME_OVERHEAD
>> -33:
>> -END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
>> -#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE && CONFIG_PPC_SPLPAR */
> ...
>> diff --git a/arch/powerpc/kernel/syscall_64.c 
>> b/arch/powerpc/kernel/syscall_64.c
>> new file mode 100644
>> index ..1d2529824588
>> --- /dev/null
>> +++ b/arch/powerpc/kernel/syscall_64.c
>> @@ -0,0 +1,195 @@
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +extern void __noreturn tabort_syscall(void);
>> +
>> +typedef long (*syscall_fn)(long, long, long, long, long, long);
>> +
>> +long system_call_exception(long r3, long r4, long r5, long r6, long r7, 
>> long r8, unsigned long r0, struct pt_regs *regs)
>> +{
>> +unsigned long ti_flags;
>> +syscall_fn f;
>> +
>> +BUG_ON(!(regs->msr & MSR_PR));
>> +
>> +if (IS_ENABLED(CONFIG_PPC_TRANSACTIONAL_MEM) &&
>> +unlikely(regs->msr & MSR_TS_T))
>> +tabort_syscall();
>> +
>> +account_cpu_user_entry();
>> +
>> +#ifdef CONFIG_PPC_SPLPAR
>> +if (IS_ENABLED(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE) &&
>> +firmware_has_feature(FW_FEATURE_SPLPAR)) {
>> +struct lppaca *lp = get_lppaca();
>> +
>> +if (unlikely(local_paca->dtl_ridx != lp->dtl_idx))
> 
> sparse complains about this, and in time.c this it converted like this:
> if (unlikely(local_paca->dtl_ridx != be64_to_cpu(lp->dtl_idx)))
> The removed code has a LDX_BE there so there should be some conversion
> involved, right?

Ah yes, thanks good catch.

Thanks,
Nick


Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-01 Thread Daniel Axtens
Hi,

>>  /*
>>   * Find a place in the tree where VA potentially will be
>>   * inserted, unless it is merged with its sibling/siblings.
>> @@ -741,6 +752,10 @@ merge_or_add_vmap_area(struct vmap_area *va,
>>  if (sibling->va_end == va->va_start) {
>>  sibling->va_end = va->va_end;
>>  
>> +kasan_release_vmalloc(orig_start, orig_end,
>> +  sibling->va_start,
>> +  sibling->va_end);
>> +
> The same.

The call to kasan_release_vmalloc() is a static inline no-op if
CONFIG_KASAN_VMALLOC is not defined, which I thought was the preferred
way to do things rather than sprinkling the code with ifdefs?

The complier should be smart enough to eliminate all the
orig_state/orig_end stuff at compile time because it can see that it's
not used, so there's no cost in the binary.

Regards,
Daniel


Re: [PATCH v6 3/9] powerpc: add support to initialize ima policy rules

2019-10-01 Thread Thiago Jung Bauermann


Hi Nayna,

Nayna  writes:

> On 09/30/2019 09:04 PM, Thiago Jung Bauermann wrote:
>>> diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c
>>> new file mode 100644
>>> index ..39401b67f19e
>>> --- /dev/null
>>> +++ b/arch/powerpc/kernel/ima_arch.c
>>> @@ -0,0 +1,33 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * Copyright (C) 2019 IBM Corporation
>>> + * Author: Nayna Jain
>>> + */
>>> +
>>> +#include 
>>> +#include 
>>> +
>>> +bool arch_ima_get_secureboot(void)
>>> +{
>>> +   return is_powerpc_os_secureboot_enabled();
>>> +}
>>> +
>>> +/* Defines IMA appraise rules for secureboot */
>>> +static const char *const arch_rules[] = {
>>> +   "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
>>> +#if !IS_ENABLED(CONFIG_MODULE_SIG)
>>> +   "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
>>> +#endif
>>> +   NULL
>>> +};
>>> +
>>> +/*
>>> + * Returns the relevant IMA arch policies based on the system secureboot 
>>> state.
>>> + */
>>> +const char *const *arch_get_ima_policy(void)
>>> +{
>>> +   if (is_powerpc_os_secureboot_enabled())
>>> +   return arch_rules;
>>> +
>>> +   return NULL;
>>> +}
>> If CONFIG_MODULE_SIG is enabled but module signatures aren't enforced,
>> then IMA won't enforce module signature either. x86's
>> arch_get_ima_policy() calls set_module_sig_enforced(). Doesn't the
>> powerpc version need to do that as well?
>>
>> On the flip side, if module signatures are enforced by the module
>> subsystem then IMA will verify the signature a second time since there's
>> no sharing of signature verification results between the module
>> subsystem and IMA (this was observed by Mimi).
>>
>> IMHO this is a minor issue, since module loading isn't a hot path and
>> the duplicate work shouldn't impact anything. But it could be avoided by
>> having a NULL entry in arch_rules, which arch_get_ima_policy() would
>> dynamically update with the "appraise func=MODULE_CHECK" rule if
>> is_module_sig_enforced() is true.
>
> Thanks Thiago for reviewing.  I am wondering that this will give two meanings
> for NULL.

What are the two meanings? My understanding is that it only means "end
of array". The additional NULL just allows arch_get_ima_policy() to
dynamically append one item to the array.

But I hadn't thought of your other alternatives. They should work just
as well. Among those, I think option 1 is cleaner.

This addresses the second issue I mentioned, but not the first.

Also, one other thing I just noticed is that x86's arch policy has
measure rules but powerpc's policy doesn't. What is different in our
case?

> Can we do something like below, there are possibly two options ?
>
> 1. Set IMA_APPRAISED in the iint->flags if is_module_sig_enforced().
>
> OR
>
> 2. Let ima_get_action() check for is_module_sig_enforced() when policy is
> appraise and func is MODULE_CHECK.
>
> Thanks & Regards,
>- Nayna


--
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [PATCH v2 00/21] Refine memblock API

2019-10-01 Thread Adam Ford
On Sun, Sep 29, 2019 at 8:33 AM Adam Ford  wrote:
>
> I am attaching two logs.  I now the mailing lists will be unhappy, but
>  don't want to try and spam a bunch of log through the mailing liast.
> The two logs show the differences between the working and non-working
> imx6q 3D accelerator when trying to run a simple glmark2-es2-drm demo.
>
> The only change between them is the 2 line code change you suggested.
>
> In both cases, I have cma=128M set in my bootargs.  Historically this
> has been sufficient, but cma=256M has not made a difference.
>

Mike any suggestions on how to move forward?
I was hoping to get the fixes tested and pushed before 5.4 is released
if at all possible

> adam
>
> On Sat, Sep 28, 2019 at 2:33 AM Mike Rapoport  wrote:
> >
> > On Thu, Sep 26, 2019 at 02:35:53PM -0500, Adam Ford wrote:
> > > On Thu, Sep 26, 2019 at 11:04 AM Mike Rapoport  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Thu, Sep 26, 2019 at 08:09:52AM -0500, Adam Ford wrote:
> > > > > On Wed, Sep 25, 2019 at 10:17 AM Fabio Estevam  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Sep 25, 2019 at 9:17 AM Adam Ford  
> > > > > > wrote:
> > > > > >
> > > > > > > I tried cma=256M and noticed the cma dump at the beginning didn't
> > > > > > > change.  Do we need to setup a reserved-memory node like
> > > > > > > imx6ul-ccimx6ulsom.dtsi did?
> > > > > >
> > > > > > I don't think so.
> > > > > >
> > > > > > Were you able to identify what was the exact commit that caused 
> > > > > > such regression?
> > > > >
> > > > > I was able to narrow it down the 92d12f9544b7 ("memblock: refactor
> > > > > internal allocation functions") that caused the regression with
> > > > > Etnaviv.
> > > >
> > > >
> > > > Can you please test with this change:
> > > >
> > >
> > > That appears to have fixed my issue.  I am not sure what the impact
> > > is, but is this a safe option?
> >
> > It's not really a fix, I just wanted to see how exactly 92d12f9544b7 
> > ("memblock:
> > refactor internal allocation functions") broke your setup.
> >
> > Can you share the dts you are using and the full kernel log?
> >
> > > adam
> > >
> > > > diff --git a/mm/memblock.c b/mm/memblock.c
> > > > index 7d4f61a..1f5a0eb 100644
> > > > --- a/mm/memblock.c
> > > > +++ b/mm/memblock.c
> > > > @@ -1356,9 +1356,6 @@ static phys_addr_t __init 
> > > > memblock_alloc_range_nid(phys_addr_t size,
> > > > align = SMP_CACHE_BYTES;
> > > > }
> > > >
> > > > -   if (end > memblock.current_limit)
> > > > -   end = memblock.current_limit;
> > > > -
> > > >  again:
> > > > found = memblock_find_in_range_node(size, align, start, end, 
> > > > nid,
> > > > flags);
> > > >
> > > > > I also noticed that if I create a reserved memory node as was done one
> > > > > imx6ul-ccimx6ulsom.dtsi the 3D seems to work again, but without it, I
> > > > > was getting errors regardless of the 'cma=256M' or not.
> > > > > I don't have a problem using the reserved memory, but I guess I am not
> > > > > sure what the amount should be.  I know for the video decoding 1080p,
> > > > > I have historically used cma=128M, but with the 3D also needing some
> > > > > memory allocation, is that enough or should I use 256M?
> > > > >
> > > > > adam
> > > >
> > > > --
> > > > Sincerely yours,
> > > > Mike.
> > > >
> >
> > --
> > Sincerely yours,
> > Mike.
> >


Re: [PATCH v5 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span()

2019-10-01 Thread kbuild test robot
Hi David,

I love your patch! Perhaps something to improve:

[auto build test WARNING on mmotm/master]

url:
https://github.com/0day-ci/linux/commits/David-Hildenbrand/mm-memory_hotplug-Shrink-zones-before-removing-memory/20191002-054310
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-b002-201939 (attached as .config)
compiler: gcc-7 (Debian 7.4.0-13) 7.4.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All warnings (new ones prefixed by >>):

   In file included from include/asm-generic/bug.h:5:0,
from arch/x86/include/asm/bug.h:83,
from include/linux/bug.h:5,
from include/linux/mmdebug.h:5,
from include/linux/mm.h:9,
from mm/memory_hotplug.c:9:
   mm/memory_hotplug.c: In function '__remove_zone':
   mm/memory_hotplug.c:471:24: error: 'ZONE_DEVICE' undeclared (first use in 
this function); did you mean 'ZONE_MOVABLE'?
 if (zone_idx(zone) == ZONE_DEVICE)
   ^
   include/linux/compiler.h:58:52: note: in definition of macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : 
__trace_if_value(cond))
   ^~~~
>> mm/memory_hotplug.c:471:2: note: in expansion of macro 'if'
 if (zone_idx(zone) == ZONE_DEVICE)
 ^~
   mm/memory_hotplug.c:471:24: note: each undeclared identifier is reported 
only once for each function it appears in
 if (zone_idx(zone) == ZONE_DEVICE)
   ^
   include/linux/compiler.h:58:52: note: in definition of macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : 
__trace_if_value(cond))
   ^~~~
>> mm/memory_hotplug.c:471:2: note: in expansion of macro 'if'
 if (zone_idx(zone) == ZONE_DEVICE)
 ^~

vim +/if +471 mm/memory_hotplug.c

   459  
   460  static void __remove_zone(struct zone *zone, unsigned long start_pfn,
   461  unsigned long nr_pages)
   462  {
   463  struct pglist_data *pgdat = zone->zone_pgdat;
   464  unsigned long flags;
   465  
   466  /*
   467   * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
   468   * we will not try to shrink the zones - which is okay as
   469   * set_zone_contiguous() cannot deal with ZONE_DEVICE either 
way.
   470   */
 > 471  if (zone_idx(zone) == ZONE_DEVICE)
   472  return;
   473  
   474  pgdat_resize_lock(zone->zone_pgdat, &flags);
   475  shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
   476  update_pgdat_span(pgdat);
   477  pgdat_resize_unlock(zone->zone_pgdat, &flags);
   478  }
   479  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v5 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span()

2019-10-01 Thread kbuild test robot
Hi David,

I love your patch! Yet something to improve:

[auto build test ERROR on mmotm/master]

url:
https://github.com/0day-ci/linux/commits/David-Hildenbrand/mm-memory_hotplug-Shrink-zones-before-removing-memory/20191002-054310
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-g004-201939 (attached as .config)
compiler: gcc-7 (Debian 7.4.0-13) 7.4.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

   mm/memory_hotplug.c: In function '__remove_zone':
>> mm/memory_hotplug.c:471:24: error: 'ZONE_DEVICE' undeclared (first use in 
>> this function); did you mean 'ZONE_MOVABLE'?
 if (zone_idx(zone) == ZONE_DEVICE)
   ^~~
   ZONE_MOVABLE
   mm/memory_hotplug.c:471:24: note: each undeclared identifier is reported 
only once for each function it appears in

vim +471 mm/memory_hotplug.c

   459  
   460  static void __remove_zone(struct zone *zone, unsigned long start_pfn,
   461  unsigned long nr_pages)
   462  {
   463  struct pglist_data *pgdat = zone->zone_pgdat;
   464  unsigned long flags;
   465  
   466  /*
   467   * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
   468   * we will not try to shrink the zones - which is okay as
   469   * set_zone_contiguous() cannot deal with ZONE_DEVICE either 
way.
   470   */
 > 471  if (zone_idx(zone) == ZONE_DEVICE)
   472  return;
   473  
   474  pgdat_resize_lock(zone->zone_pgdat, &flags);
   475  shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
   476  update_pgdat_span(pgdat);
   477  pgdat_resize_unlock(zone->zone_pgdat, &flags);
   478  }
   479  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH v4 4/4] powerpc: load firmware trusted keys/hashes into kernel keyring

2019-10-01 Thread Nayna Jain
The keys used to verify the Host OS kernel are managed by firmware as
secure variables. This patch loads the verification keys into the .platform
keyring and revocation hashes into .blacklist keyring. This enables
verification and loading of the kernels signed by the boot time keys which
are trusted by firmware.

Signed-off-by: Nayna Jain 
Reviewed-by: Mimi Zohar 
---
 security/integrity/Kconfig|  8 ++
 security/integrity/Makefile   |  3 +
 .../integrity/platform_certs/load_powerpc.c   | 86 +++
 3 files changed, 97 insertions(+)
 create mode 100644 security/integrity/platform_certs/load_powerpc.c

diff --git a/security/integrity/Kconfig b/security/integrity/Kconfig
index 0bae6adb63a9..26abee23e4e3 100644
--- a/security/integrity/Kconfig
+++ b/security/integrity/Kconfig
@@ -72,6 +72,14 @@ config LOAD_IPL_KEYS
depends on S390
def_bool y
 
+config LOAD_PPC_KEYS
+   bool "Enable loading of platform and blacklisted keys for POWER"
+   depends on INTEGRITY_PLATFORM_KEYRING
+   depends on PPC_SECURE_BOOT
+   help
+ Enable loading of keys to the .platform keyring and blacklisted
+ hashes to the .blacklist keyring for powerpc based platforms.
+
 config INTEGRITY_AUDIT
bool "Enables integrity auditing support "
depends on AUDIT
diff --git a/security/integrity/Makefile b/security/integrity/Makefile
index 525bf1d6e0db..9eeb6b053de3 100644
--- a/security/integrity/Makefile
+++ b/security/integrity/Makefile
@@ -14,6 +14,9 @@ integrity-$(CONFIG_LOAD_UEFI_KEYS) += 
platform_certs/efi_parser.o \
  platform_certs/load_uefi.o \
  platform_certs/keyring_handler.o
 integrity-$(CONFIG_LOAD_IPL_KEYS) += platform_certs/load_ipl_s390.o
+integrity-$(CONFIG_LOAD_PPC_KEYS) += platform_certs/efi_parser.o \
+platform_certs/load_powerpc.o \
+platform_certs/keyring_handler.o
 $(obj)/load_uefi.o: KBUILD_CFLAGS += -fshort-wchar
 
 subdir-$(CONFIG_IMA)   += ima
diff --git a/security/integrity/platform_certs/load_powerpc.c 
b/security/integrity/platform_certs/load_powerpc.c
new file mode 100644
index ..83d99cde5376
--- /dev/null
+++ b/security/integrity/platform_certs/load_powerpc.c
@@ -0,0 +1,86 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain
+ *
+ *  - loads keys and hashes stored and controlled by the firmware.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "keyring_handler.h"
+
+/*
+ * Get a certificate list blob from the named secure variable.
+ */
+static __init void *get_cert_list(u8 *key, unsigned long keylen, uint64_t 
*size)
+{
+   int rc;
+   void *db;
+
+   rc = secvar_ops->get(key, keylen, NULL, size);
+   if (rc) {
+   pr_err("Couldn't get size: %d\n", rc);
+   return NULL;
+   }
+
+   db = kmalloc(*size, GFP_KERNEL);
+   if (!db)
+   return NULL;
+
+   rc = secvar_ops->get(key, keylen, db, size);
+   if (rc) {
+   kfree(db);
+   pr_err("Error reading db var: %d\n", rc);
+   return NULL;
+   }
+
+   return db;
+}
+
+/*
+ * Load the certs contained in the keys databases into the platform trusted
+ * keyring and the blacklisted X.509 cert SHA256 hashes into the blacklist
+ * keyring.
+ */
+static int __init load_powerpc_certs(void)
+{
+   void *db = NULL, *dbx = NULL;
+   uint64_t dbsize = 0, dbxsize = 0;
+   int rc = 0;
+
+   if (!secvar_ops)
+   return -ENODEV;
+
+   /* Get db, and dbx.  They might not exist, so it isn't
+* an error if we can't get them.
+*/
+   db = get_cert_list("db", 3, &dbsize);
+   if (!db) {
+   pr_err("Couldn't get db list from firmware\n");
+   } else {
+   rc = parse_efi_signature_list("powerpc:db", db, dbsize,
+ get_handler_for_db);
+   if (rc)
+   pr_err("Couldn't parse db signatures: %d\n", rc);
+   kfree(db);
+   }
+
+   dbx = get_cert_list("dbx", 3,  &dbxsize);
+   if (!dbx) {
+   pr_info("Couldn't get dbx list from firmware\n");
+   } else {
+   rc = parse_efi_signature_list("powerpc:dbx", dbx, dbxsize,
+ get_handler_for_dbx);
+   if (rc)
+   pr_err("Couldn't parse dbx signatures: %d\n", rc);
+   kfree(dbx);
+   }
+
+   return rc;
+}
+late_initcall(load_powerpc_certs);
-- 
2.20.1



[PATCH v4 3/4] x86/efi: move common keyring handler functions to new file

2019-10-01 Thread Nayna Jain
The handlers to add the keys to the .platform keyring and blacklisted
hashes to the .blacklist keyring is common for both the uefi and powerpc
mechanisms of loading the keys/hashes from the firmware.

This patch moves the common code from load_uefi.c to keyring_handler.c

Signed-off-by: Nayna Jain 
Acked-by: Mimi Zohar 
---
 security/integrity/Makefile   |  3 +-
 .../platform_certs/keyring_handler.c  | 80 +++
 .../platform_certs/keyring_handler.h  | 32 
 security/integrity/platform_certs/load_uefi.c | 67 +---
 4 files changed, 115 insertions(+), 67 deletions(-)
 create mode 100644 security/integrity/platform_certs/keyring_handler.c
 create mode 100644 security/integrity/platform_certs/keyring_handler.h

diff --git a/security/integrity/Makefile b/security/integrity/Makefile
index 19faace69644..525bf1d6e0db 100644
--- a/security/integrity/Makefile
+++ b/security/integrity/Makefile
@@ -11,7 +11,8 @@ integrity-$(CONFIG_INTEGRITY_SIGNATURE) += digsig.o
 integrity-$(CONFIG_INTEGRITY_ASYMMETRIC_KEYS) += digsig_asymmetric.o
 integrity-$(CONFIG_INTEGRITY_PLATFORM_KEYRING) += 
platform_certs/platform_keyring.o
 integrity-$(CONFIG_LOAD_UEFI_KEYS) += platform_certs/efi_parser.o \
-   platform_certs/load_uefi.o
+ platform_certs/load_uefi.o \
+ platform_certs/keyring_handler.o
 integrity-$(CONFIG_LOAD_IPL_KEYS) += platform_certs/load_ipl_s390.o
 $(obj)/load_uefi.o: KBUILD_CFLAGS += -fshort-wchar
 
diff --git a/security/integrity/platform_certs/keyring_handler.c 
b/security/integrity/platform_certs/keyring_handler.c
new file mode 100644
index ..c5ba695c10e3
--- /dev/null
+++ b/security/integrity/platform_certs/keyring_handler.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "../integrity.h"
+
+static efi_guid_t efi_cert_x509_guid __initdata = EFI_CERT_X509_GUID;
+static efi_guid_t efi_cert_x509_sha256_guid __initdata =
+   EFI_CERT_X509_SHA256_GUID;
+static efi_guid_t efi_cert_sha256_guid __initdata = EFI_CERT_SHA256_GUID;
+
+/*
+ * Blacklist a hash.
+ */
+static __init void uefi_blacklist_hash(const char *source, const void *data,
+  size_t len, const char *type,
+  size_t type_len)
+{
+   char *hash, *p;
+
+   hash = kmalloc(type_len + len * 2 + 1, GFP_KERNEL);
+   if (!hash)
+   return;
+   p = memcpy(hash, type, type_len);
+   p += type_len;
+   bin2hex(p, data, len);
+   p += len * 2;
+   *p = 0;
+
+   mark_hash_blacklisted(hash);
+   kfree(hash);
+}
+
+/*
+ * Blacklist an X509 TBS hash.
+ */
+static __init void uefi_blacklist_x509_tbs(const char *source,
+  const void *data, size_t len)
+{
+   uefi_blacklist_hash(source, data, len, "tbs:", 4);
+}
+
+/*
+ * Blacklist the hash of an executable.
+ */
+static __init void uefi_blacklist_binary(const char *source,
+const void *data, size_t len)
+{
+   uefi_blacklist_hash(source, data, len, "bin:", 4);
+}
+
+/*
+ * Return the appropriate handler for particular signature list types found in
+ * the UEFI db and MokListRT tables.
+ */
+__init efi_element_handler_t get_handler_for_db(const efi_guid_t *sig_type)
+{
+   if (efi_guidcmp(*sig_type, efi_cert_x509_guid) == 0)
+   return add_to_platform_keyring;
+   return 0;
+}
+
+/*
+ * Return the appropriate handler for particular signature list types found in
+ * the UEFI dbx and MokListXRT tables.
+ */
+__init efi_element_handler_t get_handler_for_dbx(const efi_guid_t *sig_type)
+{
+   if (efi_guidcmp(*sig_type, efi_cert_x509_sha256_guid) == 0)
+   return uefi_blacklist_x509_tbs;
+   if (efi_guidcmp(*sig_type, efi_cert_sha256_guid) == 0)
+   return uefi_blacklist_binary;
+   return 0;
+}
diff --git a/security/integrity/platform_certs/keyring_handler.h 
b/security/integrity/platform_certs/keyring_handler.h
new file mode 100644
index ..2462bfa08fe3
--- /dev/null
+++ b/security/integrity/platform_certs/keyring_handler.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef PLATFORM_CERTS_INTERNAL_H
+#define PLATFORM_CERTS_INTERNAL_H
+
+#include 
+
+void blacklist_hash(const char *source, const void *data,
+   size_t len, const char *type,
+   size_t type_len);
+
+/*
+ * Blacklist an X509 TBS hash.
+ */
+void blacklist_x509_tbs(const char *source, const void *data, size_t len);
+
+/*
+ * Blacklist the hash of an executable.
+ */
+void blacklist_binary(const char *source, const void *data, size_t len);
+
+/*
+ * Return the handler for particular signature list types found in the db.
+ */
+efi_element_handler_t get_handl

[PATCH v4 2/4] powerpc: expose secure variables to userspace via sysfs

2019-10-01 Thread Nayna Jain
PowerNV secure variables, which store the keys used for OS kernel
verification, are managed by the firmware. These secure variables need to
be accessed by the userspace for addition/deletion of the certificates.

This patch adds the sysfs interface to expose secure variables for PowerNV
secureboot. The users shall use this interface for manipulating
the keys stored in the secure variables.

Signed-off-by: Nayna Jain 
Reviewed-by: Greg Kroah-Hartman 
---
 Documentation/ABI/testing/sysfs-secvar |  37 +
 arch/powerpc/Kconfig   |  10 ++
 arch/powerpc/kernel/Makefile   |   1 +
 arch/powerpc/kernel/secvar-sysfs.c | 198 +
 4 files changed, 246 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-secvar
 create mode 100644 arch/powerpc/kernel/secvar-sysfs.c

diff --git a/Documentation/ABI/testing/sysfs-secvar 
b/Documentation/ABI/testing/sysfs-secvar
new file mode 100644
index ..815bd8ec4d5e
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-secvar
@@ -0,0 +1,37 @@
+What:  /sys/firmware/secvar
+Date:  August 2019
+Contact:   Nayna Jain 
+Description:   This directory is created if the POWER firmware supports OS
+   secureboot, thereby secure variables. It exposes interface
+   for reading/writing the secure variables
+
+What:  /sys/firmware/secvar/vars
+Date:  August 2019
+Contact:   Nayna Jain 
+Description:   This directory lists all the secure variables that are supported
+   by the firmware.
+
+What:  /sys/firmware/secvar/vars/
+Date:  August 2019
+Contact:   Nayna Jain 
+Description:   Each secure variable is represented as a directory named as
+   . The variable name is unique and is in ASCII
+   representation. The data and size can be determined by reading
+   their respective attribute files.
+
+What:  /sys/firmware/secvar/vars//size
+Date:  August 2019
+Contact:   Nayna Jain 
+Description:   An integer representation of the size of the content of the
+   variable. In other words, it represents the size of the data.
+
+What:  /sys/firmware/secvar/vars//data
+Date:  August 2019
+Contact:   Nayna Jain 
+Description:   A read-only file containing the value of the variable
+
+What:  /sys/firmware/secvar/vars//update
+Date:  August 2019
+Contact:   Nayna Jain 
+Description:   A write-only file that is used to submit the new value for the
+   variable.
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index deb19ec6ba3d..89084e4e5054 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -946,6 +946,16 @@ config PPC_SECURE_BOOT
  to enable OS secure boot on systems that have firmware support for
  it. If in doubt say N.
 
+config SECVAR_SYSFS
+   tristate "Enable sysfs interface for POWER secure variables"
+   depends on PPC_SECURE_BOOT
+   depends on SYSFS
+   help
+ POWER secure variables are managed and controlled by firmware.
+ These variables are exposed to userspace via sysfs to enable
+ read/write operations on these variables. Say Y if you have
+ secure boot enabled and want to expose variables to userspace.
+
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 3cf26427334f..116a3a5c0557 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -162,6 +162,7 @@ obj-y   += ucall.o
 endif
 
 obj-$(CONFIG_PPC_SECURE_BOOT)  += secure_boot.o ima_arch.o secvar-ops.o
+obj-$(CONFIG_SECVAR_SYSFS) += secvar-sysfs.o
 
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
new file mode 100644
index ..87a7cea41523
--- /dev/null
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -0,0 +1,198 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2019 IBM Corporation 
+ *
+ * This code exposes secure variables to user via sysfs
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Since firmware checks the maximum allowed size, currently, it is default to
+ * 0. In future, it will be read from the device tree.
+ */
+#define VARIABLE_MAX_SIZE  0
+/* Approximate value */
+#define NAME_MAX_SIZE 1024
+
+static struct kobject *secvar_kobj;
+static struct kset *secvar_kset;
+
+static ssize_t size_show(struct kobject *kobj, struct kobj_attribute *attr,
+char *buf)
+{
+   uint64_t dsize;
+   int rc;
+
+   rc = secvar_ops->get(kobj->name, strlen(kobj->name) + 1, NULL, &dsize);
+   if (rc) {
+   pr_err("Error retrieving variable size %d\n", rc);
+   return rc;
+   }
+
+   rc = sprintf(buf, "%llu\n", dsize);
+
+   retur

[PATCH v4 1/4] powerpc/powernv: Add OPAL API interface to access secure variable

2019-10-01 Thread Nayna Jain
The X.509 certificates trusted by the platform and required to secure boot
the OS kernel are wrapped in secure variables, which are controlled by
OPAL.

This patch adds firmware/kernel interface to read and write OPAL secure
variables based on the unique key.

This support can be enabled using CONFIG_OPAL_SECVAR.

Signed-off-by: Claudio Carvalho 
Signed-off-by: Nayna Jain 
---
 arch/powerpc/include/asm/opal-api.h  |   5 +-
 arch/powerpc/include/asm/opal.h  |   8 ++
 arch/powerpc/include/asm/powernv.h   |   2 +
 arch/powerpc/include/asm/secvar.h|  35 +
 arch/powerpc/kernel/Makefile |   2 +-
 arch/powerpc/kernel/secvar-ops.c |  19 +++
 arch/powerpc/platforms/powernv/Kconfig   |   6 +
 arch/powerpc/platforms/powernv/Makefile  |   1 +
 arch/powerpc/platforms/powernv/opal-call.c   |   3 +
 arch/powerpc/platforms/powernv/opal-secvar.c | 137 +++
 arch/powerpc/platforms/powernv/opal.c|   5 +
 11 files changed, 221 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/include/asm/secvar.h
 create mode 100644 arch/powerpc/kernel/secvar-ops.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-secvar.c

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 378e3997845a..c1f25a760eb1 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -211,7 +211,10 @@
 #define OPAL_MPIPL_UPDATE  173
 #define OPAL_MPIPL_REGISTER_TAG174
 #define OPAL_MPIPL_QUERY_TAG   175
-#define OPAL_LAST  175
+#define OPAL_SECVAR_GET176
+#define OPAL_SECVAR_GET_NEXT   177
+#define OPAL_SECVAR_ENQUEUE_UPDATE 178
+#define OPAL_LAST  178
 
 #define QUIESCE_HOLD   1 /* Spin all calls at entry */
 #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index a0cf8fba4d12..03392dc3f5e2 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -298,6 +298,13 @@ int opal_sensor_group_clear(u32 group_hndl, int token);
 int opal_sensor_group_enable(u32 group_hndl, int token, bool enable);
 int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
 
+int opal_secvar_get(const char *key, uint64_t key_len, u8 *data,
+   uint64_t *data_size);
+int opal_secvar_get_next(const char *key, uint64_t *key_len,
+uint64_t key_buf_size);
+int opal_secvar_enqueue_update(const char *key, uint64_t key_len, u8 *data,
+  uint64_t data_size);
+
 s64 opal_mpipl_update(enum opal_mpipl_ops op, u64 src, u64 dest, u64 size);
 s64 opal_mpipl_register_tag(enum opal_mpipl_tags tag, u64 addr);
 s64 opal_mpipl_query_tag(enum opal_mpipl_tags tag, u64 *addr);
@@ -392,6 +399,7 @@ void opal_wake_poller(void);
 void opal_powercap_init(void);
 void opal_psr_init(void);
 void opal_sensor_groups_init(void);
+void opal_secvar_init(void);
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/powerpc/include/asm/powernv.h 
b/arch/powerpc/include/asm/powernv.h
index e1a858718716..cff980a85dd2 100644
--- a/arch/powerpc/include/asm/powernv.h
+++ b/arch/powerpc/include/asm/powernv.h
@@ -12,10 +12,12 @@ extern void powernv_set_nmmu_ptcr(unsigned long ptcr);
 void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val);
 
 void pnv_tm_init(void);
+
 #else
 static inline void powernv_set_nmmu_ptcr(unsigned long ptcr) { }
 
 static inline void pnv_tm_init(void) { }
+
 #endif
 
 #endif /* _ASM_POWERNV_H */
diff --git a/arch/powerpc/include/asm/secvar.h 
b/arch/powerpc/include/asm/secvar.h
new file mode 100644
index ..4cc35b58b986
--- /dev/null
+++ b/arch/powerpc/include/asm/secvar.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain
+ *
+ * PowerPC secure variable operations.
+ */
+#ifndef SECVAR_OPS_H
+#define SECVAR_OPS_H
+
+#include 
+#include 
+
+extern const struct secvar_operations *secvar_ops;
+
+struct secvar_operations {
+   int (*get)(const char *key, uint64_t key_len, u8 *data,
+  uint64_t *data_size);
+   int (*get_next)(const char *key, uint64_t *key_len,
+   uint64_t keybufsize);
+   int (*set)(const char *key, uint64_t key_len, u8 *data,
+  uint64_t data_size);
+};
+
+#ifdef CONFIG_PPC_SECURE_BOOT
+
+extern void set_secvar_ops(const struct secvar_operations *ops);
+
+#else
+
+static inline void set_secvar_ops(const struct secvar_operations *ops) { }
+
+#endif
+
+#endif
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index e8eb2955b7d5..3cf26427334f 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -161,7 +161,7 @@ ifneq 

[PATCH v4 0/4] powerpc: expose secure variables to the kernel and userspace

2019-10-01 Thread Nayna Jain
In order to verify the OS kernel on PowerNV systems, secure boot requires
X.509 certificates trusted by the platform. These are stored in secure
variables controlled by OPAL, called OPAL secure variables. In order to
enable users to manage the keys, the secure variables need to be exposed
to userspace.

OPAL provides the runtime services for the kernel to be able to access the
secure variables[1]. This patchset defines the kernel interface for the
OPAL APIs. These APIs are used by the hooks, which load these variables
to the keyring and expose them to the userspace for reading/writing.

The previous version[2] of the patchset added support only for the sysfs
interface. This patch adds two more patches that involves loading of
the firmware trusted keys to the kernel keyring. This patchset is
dependent on the base CONFIG PPC_SECURE_BOOT added by ima arch specific
patches for POWER[3]

Overall, this patchset adds the following support:

* expose secure variables to the kernel via OPAL Runtime API interface
* expose secure variables to the userspace via kernel sysfs interface
* load kernel verification and revocation keys to .platform and
.blacklist keyring respectively.

The secure variables can be read/written using simple linux utilities
cat/hexdump.

For example:
Path to the secure variables is:
/sys/firmware/secvar/vars

Each secure variable is listed as directory. 
$ ls -l
total 0
drwxr-xr-x. 2 root root 0 Aug 20 21:20 db
drwxr-xr-x. 2 root root 0 Aug 20 21:20 KEK
drwxr-xr-x. 2 root root 0 Aug 20 21:20 PK

The attributes of each of the secure variables are(for example: PK):
[db]$ ls -l
total 0
-r--r--r--. 1 root root 0 Oct  1 15:10 data
-r--r--r--. 1 root root 65536 Oct  1 15:10 size
--w---. 1 root root 0 Oct  1 15:12 update

The "data" is used to read the existing variable value using hexdump. The
data is stored in ESL format.
The "update" is used to write a new value using cat. The update is
to be submitted as AUTH file.

[1] Depends on skiboot OPAL API changes which removes metadata from
the API. https://lists.ozlabs.org/pipermail/skiboot/2019-September/015203.html.
[2] https://lkml.org/lkml/2019/6/13/1644
[3] https://lkml.org/lkml/2019/9/27/407

Changelog:
v4:
* rebased to v5.4-rc1 
* uses __BIN_ATTR_WO macro to create binary attribute as suggested by
  Greg
* removed email id from the file header
* renamed argument keysize to keybufsize in get_next() function
* updated default binary file sizes to 0, as firmware handles checking
against the maximum size
* fixed minor formatting issues in Patch 4/4
* added Greg's and Mimi's Reviewed-by and Ack-by

v3:
* includes Greg's feedbacks:
 * fixes in Patch 2/4
   * updates the Documentation.
   * fixes code feedbacks
* adds SYSFS Kconfig dependency for SECVAR_SYSFS
* fixes mixed tabs and spaces
* removes "name" attribute for each of the variable name based
directories
* fixes using __ATTR_RO() and __BIN_ATTR_RO() and statics and const
* fixes the racing issue by using kobj_type default groups. Also,
fixes the kobject leakage.
* removes extra print messages
  * updates patch description for Patch 3/4
  * removes file name from Patch 4/4 file header comment and removed
  def_bool y from the LOAD_PPC_KEYS Kconfig

* includes Oliver's feedbacks:
  * fixes Patch 1/2
   * moves OPAL API wrappers after opal_nx_proc_init(), fixed the
   naming, types and removed extern.
   * fixes spaces
   * renames get_variable() to get(), get_next_variable() to get_next()
   and set_variable() to set()
   * removed get_secvar_ops() and defined secvar_ops as global
   * fixes consts and statics
   * removes generic secvar_init() and defined platform specific
   opal_secar_init()
   * updates opal_secvar_supported() to check for secvar support even
   before checking the OPAL APIs support and also fixed the error codes.
   * addes function that converts OPAL return codes to linux errno
   * moves secvar check support in the opal_secvar_init() and defined its
   prototype in opal.h
  * fixes Patch 2/2
   * fixes static/const
   * defines macro for max name size
   * replaces OPAL error codes with linux errno and also updated error
   handling
   * moves secvar support check before creating sysfs kobjects in 
   secvar_sysfs_init()
   * fixes spaces  

v2:
* removes complete efi-sms from the sysfs implementation and is simplified
* includes Greg's and Oliver's feedbacks:
 * adds sysfs documentation
 * moves sysfs code to arch/powerpc
 * other code related feedbacks.
* adds two new patches to load keys to .platform and .blacklist keyring.
These patches are added to this series as they are also dependent on
OPAL APIs.

Nayna Jain (4):
  powerpc/powernv: Add OPAL API interface to access secure variable
  powerpc: expose secure variables to userspace via sysfs
  x86/efi: move common keyring handler functions to new file
  powerpc: load firmware trusted keys/hashes into kernel keyring

 Documentation/ABI/testing/sysfs-secvar|  37 
 ar

[PATCH] sysfs: Fixes __BIN_ATTR_WO() macro

2019-10-01 Thread Nayna Jain
This patch fixes the size and write parameter for the macro
__BIN_ATTR_WO().

Fixes: 7f905761e15a8 ("sysfs: add BIN_ATTR_WO() macro")
Signed-off-by: Nayna Jain 
---
 include/linux/sysfs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 5420817ed317..fa7ee503fb76 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -196,9 +196,9 @@ struct bin_attribute {
.size   = _size,\
 }
 
-#define __BIN_ATTR_WO(_name) { \
+#define __BIN_ATTR_WO(_name, _size) {  \
.attr   = { .name = __stringify(_name), .mode = 0200 }, \
-   .store  = _name##_store,\
+   .write  = _name##_write,\
.size   = _size,\
 }
 
-- 
2.20.1



Re: [RFC PATCH 0/9] Fixes and Enablement of ibm,drc-info property

2019-10-01 Thread Bjorn Helgaas
On Tue, Oct 01, 2019 at 01:12:05AM -0500, Tyrel Datwyler wrote:
> There was an initial previous effort yo add support for the PAPR
> architected ibm,drc-info property. This property provides a more
> memory compact representation of a paritions Dynamic Reconfig
> Connectors (DRC). These can otherwise be thought of the currently
> partitioned, or available, but yet to be partitioned, system resources
> such as cpus, memory, and physical/logical IOA devices.
> 
> The initial implementation proved buggy and was fully turned of by
> disabling the bit in the appropriate CAS support vector. We now have
> PowerVM firmware in the field that supports this new property, and 
> further to suppport partitions with 24TB+ or possible memory this
> property is required to perform platform migration.
> 
> This serious fixup the short comings of the previous implementation
> in the areas of general implementation, cpu hotplug, and IOA hotplug.
> 
> Tyrel Datwyler (9):
>   powerpc/pseries: add cpu DLPAR support for drc-info property
>   powerpc/pseries: fix bad drc_index_start value parsing of drc-info
> entry
>   powerpc/pseries: fix drc-info mappings of logical cpus to drc-index
>   PCI: rpaphp: fix up pointer to first drc-info entry
>   PCI: rpaphp: don't rely on firmware feature to imply drc-info support
>   PCI: rpaphp: add drc-info support for hotplug slot registration
>   PCI: rpaphp: annotate and correctly byte swap DRC properties
>   PCI: rpaphp: correctly match ibm,my-drc-index to drc-name when using
> drc-info
>   powerpc: Enable support for ibm,drc-info property
> 
>  arch/powerpc/kernel/prom_init.c |   2 +-
>  arch/powerpc/platforms/pseries/hotplug-cpu.c| 117 --
>  arch/powerpc/platforms/pseries/of_helpers.c |   8 +-
>  arch/powerpc/platforms/pseries/pseries_energy.c |  23 ++---
>  drivers/pci/hotplug/rpaphp_core.c   | 124 
> +---
>  5 files changed, 191 insertions(+), 83 deletions(-)

Michael, I assume you'll take care of this.  If I were applying, I
would capitalize the commit subjects and fix the typos in the commit
logs, e.g.,

  s/the this/the/
  s/the the/that the/
  s/short coming/shortcoming/
  s/seperate/separate/
  s/bid endian/big endian/
  s/were appropriate/where appropriate/
  s/name form/name from/

etc.  git am also complains about space before tab whitespace errors.
And it adds a few lines >80 chars.


Re: [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range

2019-10-01 Thread Leonardo Bras
On Tue, 2019-10-01 at 12:04 -0700, John Hubbard wrote:
> On 10/1/19 10:56 AM, Leonardo Bras wrote:
> > On Mon, 2019-09-30 at 14:51 -0700, John Hubbard wrote:
> > > On 9/27/19 4:40 PM, Leonardo Bras wrote:
> ...
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index 98f13ab37bac..7105c829cf44 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long 
> > > > start, unsigned long end)
> > > >  int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> > > >   struct page **pages)
> > > >  {
> > > > +   struct mm_struct *mm;
> > > 
> > > I don't think that this local variable adds any value, so let's not use 
> > > it.
> > > Similar point in a few other patches too.
> > 
> > It avoids 1 deference of current->mm, it's a little performance gain.
> > 
> 
> No, it isn't. :) 
> 
> Longer answer: at this level (by which I mean, "wrote the C code, haven't 
> looked
> at the generated asm yet, and haven't done a direct perf test yet"), none of 
> us
> C programmers are entitled to imagine that we can second guess both the 
> compiler 
> and the CPU well enough to claim that  declaring a local pointer variable on 
> the
> stack will even *affect* performance, much less know which way it will go!
> 

I did this based on how costly can be 'current', and I could notice
reduction in assembly size most of the time. (powerpc)
But I get what you mean, maybe the (possible) performance gain don't
worth the extra work.

> The compiler at -O2 will *absolutely* optimize away any local variables that
> it doesn't need.
> 
> And that leads to how kernel programmers routinely decide about that kind of 
> variable: "does the variable's added clarity compensate for the extra visual 
> noise and for the need to manage the variable?"

That's a good way to decide it. :)

> 
> Here, and in most (all?) other points in the patchset where you've added an
> mm local variable, the answer is no.
> 

Well, IMHO it's cleaner that way. But I get that other people may
disagree. 

> 
> ...   start_lockless_pgtbl_walk(mm);
> > > Minor: I'd like to rename this register_lockless_pgtable_walker().
> > > 
> > > > local_irq_disable();
> > > > gup_pgd_range(addr, end, gup_flags, pages, &nr);
> > > > local_irq_enable();
> > > > +   end_lockless_pgtbl_walk(mm);
> > > 
> > > ...and deregister_lockless_pgtable_walker().
> > > 
> > 
> > I have no problem changing the name, but I don't register/deregister
> > are good terms for this. 
> > 
> > I would rather use start/finish, begin/end, and so on. Register sounds
> > like something more complicated than what we are trying to achieve
> > here. 
> > 
> 
> OK, well, I don't want to bikeshed on naming more than I usually do, and 
> what you have is reasonable, so I'll leave that alone. :)
> 
> thanks,

Thank for the feedback,



signature.asc
Description: This is a digitally signed message part


Re: [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range

2019-10-01 Thread John Hubbard
On 10/1/19 10:56 AM, Leonardo Bras wrote:
> On Mon, 2019-09-30 at 14:51 -0700, John Hubbard wrote:
>> On 9/27/19 4:40 PM, Leonardo Bras wrote:
...
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 98f13ab37bac..7105c829cf44 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, 
>>> unsigned long end)
>>>  int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>>   struct page **pages)
>>>  {
>>> +   struct mm_struct *mm;
>>
>> I don't think that this local variable adds any value, so let's not use it.
>> Similar point in a few other patches too.
> 
> It avoids 1 deference of current->mm, it's a little performance gain.
> 

No, it isn't. :) 

Longer answer: at this level (by which I mean, "wrote the C code, haven't looked
at the generated asm yet, and haven't done a direct perf test yet"), none of us
C programmers are entitled to imagine that we can second guess both the 
compiler 
and the CPU well enough to claim that  declaring a local pointer variable on the
stack will even *affect* performance, much less know which way it will go!

The compiler at -O2 will *absolutely* optimize away any local variables that
it doesn't need.

And that leads to how kernel programmers routinely decide about that kind of 
variable: "does the variable's added clarity compensate for the extra visual 
noise and for the need to manage the variable?"

Here, and in most (all?) other points in the patchset where you've added an
mm local variable, the answer is no.


>>
... start_lockless_pgtbl_walk(mm);
>>
>> Minor: I'd like to rename this register_lockless_pgtable_walker().
>>
>>> local_irq_disable();
>>> gup_pgd_range(addr, end, gup_flags, pages, &nr);
>>> local_irq_enable();
>>> +   end_lockless_pgtbl_walk(mm);
>>
>> ...and deregister_lockless_pgtable_walker().
>>
> 
> I have no problem changing the name, but I don't register/deregister
> are good terms for this. 
> 
> I would rather use start/finish, begin/end, and so on. Register sounds
> like something more complicated than what we are trying to achieve
> here. 
> 

OK, well, I don't want to bikeshed on naming more than I usually do, and 
what you have is reasonable, so I'll leave that alone. :)

thanks,
-- 
John Hubbard
NVIDIA



Re: [PATCH] sysfs: add BIN_ATTR_WO() macro

2019-10-01 Thread Nayna




On 10/01/2019 02:16 PM, Greg Kroah-Hartman wrote:

On Tue, Oct 01, 2019 at 02:08:53PM -0400, Nayna wrote:

Hi Greg,


On 08/26/2019 11:01 AM, Greg Kroah-Hartman wrote:

This variant was missing from sysfs.h, I guess no one noticed it before.

Turns out the powerpc secure variable code can use it, so add it to the
tree for it, and potentially others to take advantage of, instead of
open-coding it.

Reported-by: Nayna Jain 
Signed-off-by: Greg Kroah-Hartman 
---

I'll queue this up to my tree for 5.4-rc1, but if you want to take this
in your tree earlier, feel free to do so.

   include/linux/sysfs.h | 9 +
   1 file changed, 9 insertions(+)

diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 965236795750..5420817ed317 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -196,6 +196,12 @@ struct bin_attribute {
.size   = _size,\
   }
+#define __BIN_ATTR_WO(_name) { \
+   .attr   = { .name = __stringify(_name), .mode = 0200 }, \
+   .store  = _name##_store,\
+   .size   = _size,\
+}
+
   #define __BIN_ATTR_RW(_name, _size)  \
__BIN_ATTR(_name, 0644, _name##_read, _name##_write, _size)
@@ -208,6 +214,9 @@ struct bin_attribute bin_attr_##_name = __BIN_ATTR(_name, 
_mode, _read, \
   #define BIN_ATTR_RO(_name, _size)\
   struct bin_attribute bin_attr_##_name = __BIN_ATTR_RO(_name, _size)
+#define BIN_ATTR_WO(_name, _size)  \
+struct bin_attribute bin_attr_##_name = __BIN_ATTR_WO(_name, _size)
+
   #define BIN_ATTR_RW(_name, _size)\
   struct bin_attribute bin_attr_##_name = __BIN_ATTR_RW(_name, _size)

I am sorry. I didn't notice it via inspection but there is a bug in this
macro. When I actually try using it, compilation fails. Here's a likely
patch:

diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 5420817ed317..fa7ee503fb76 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -196,9 +196,9 @@ struct bin_attribute {
 .size   = _size,\
  }
-#define __BIN_ATTR_WO(_name) { \
+#define __BIN_ATTR_WO(_name, _size) {  \
 .attr   = { .name = __stringify(_name), .mode = 0200 }, \
-   .store  = _name##_store,\
+   .write  = _name##_write,\
 .size   = _size,\
  }


Heh, good catch.  Can you send a real patch for this that I can apply to
give you the proper credit for finding and fixing this?


Sure.. Thanks Greg !!

Thanks & Regards,
  - Nayna


Re: [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks

2019-10-01 Thread John Hubbard
On 10/1/19 11:39 AM, Leonardo Bras wrote:
> On Mon, 2019-09-30 at 14:47 -0700, John Hubbard wrote:
>> On 9/30/19 11:42 AM, Leonardo Bras wrote:
>>> On Mon, 2019-09-30 at 10:57 -0700, John Hubbard wrote:
...
> 
> I am failing to understand the issue.
> I mean, smp_call_function_many() will issue a IPI to each CPU in
> CPUmask and wait it to run before returning. 
> If interrupts are disabled (either by MSR_EE=0 or local_irq_disable),
> the IPI will not run on that CPU, and the wait part will make sure to
> lock the thread until the interrupts are enabled again. 
> 
> Could you please point the issue there?

The biggest problem here is evidently my not knowing much about ppc. :) 
So if that's how it behaves, then all is well, sorry it took me a while
to understand the MSR_EE=0 behavior.

> 
 Simply skipping that means that an additional mechanism is required...which
 btw might involve a new, ppc-specific routine, so maybe this is going to 
 end
 up pretty close to what I pasted in after all...
> Of course, if we really need that, we can add a bool parameter to the
> function to choose about disabling/enabling irqs.
>> * This is really a core mm function, so don't hide it away in arch 
>> layers.
>> (If you're changing mm/ files, that's a big hint.)
>
> My idea here is to let the arch decide on how this 'register' is going
> to work, as archs may have different needs (in powerpc for example, we
> can't always disable irqs, since we may be in realmode).
>>
>> Yes, the tension there is that a) some things are per-arch, and b) it's easy 
>> to get it wrong. The commit below (d9101bfa6adc) is IMHO a perfect example of
>> that.
>>
>> So, I would like core mm/ functions that guide the way, but the interrupt
>> behavior complicates it. I think your original passing of just struct_mm
>> is probably the right balance, assuming that I'm wrong about interrupts.
>>
> 
> I think, for the generic function, that including {en,dis}abling the
> interrupt is fine. I mean, if disabling the interrupt is the generic
> behavior, it's ok. 
> I will just make sure to explain that the interrupt {en,dis}abling is
> part of the sync process. If an arch don't like it, it can write a
> specific function that does the sync in a better way. (and defining
> __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER to ignore the generic function)
> 

Tentatively, that sounds good. We still end up with the counter variable
directly in struct mm_struct, and the generic function in mm/gup.c 
(or mm/somewhere), where it's easy to find and see what's going on. sure.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks

2019-10-01 Thread Leonardo Bras
On Mon, 2019-09-30 at 14:47 -0700, John Hubbard wrote:
> On 9/30/19 11:42 AM, Leonardo Bras wrote:
> > On Mon, 2019-09-30 at 10:57 -0700, John Hubbard wrote:
> > > > As I told before, there are cases where this function is called from
> > > > 'real mode' in powerpc, which doesn't disable irqs and may have a
> > > > tricky behavior if we do. So, encapsulate the irq disable in this
> > > > function can be a bad choice.
> > > 
> > > You still haven't explained how this works in that case. So far, the
> > > synchronization we've discussed has depended upon interrupt disabling
> > > as part of the solution, in order to hold off page splitting and page
> > > table freeing.
> > 
> > The irqs are already disabled by another mechanism (hw): MSR_EE=0.
> > So, serialize will work as expected.
> 
> I get that they're disabled. But will this interlock with the code that
> issues IPIs?? Because it's not just disabling interrupts that matters, but
> rather, synchronizing with the code (TLB flushing) that *happens* to 
> require issuing IPIs, which in turn interact with disabling interrupts.
> 
> So I'm still not seeing how that could work here, unless there is something
> interesting about the smp_call_function_many() on ppc with MSR_EE=0 mode...?
> 

I am failing to understand the issue.
I mean, smp_call_function_many() will issue a IPI to each CPU in
CPUmask and wait it to run before returning. 
If interrupts are disabled (either by MSR_EE=0 or local_irq_disable),
the IPI will not run on that CPU, and the wait part will make sure to
lock the thread until the interrupts are enabled again. 

Could you please point the issue there?

> > > Simply skipping that means that an additional mechanism is 
> > > required...which
> > > btw might involve a new, ppc-specific routine, so maybe this is going to 
> > > end
> > > up pretty close to what I pasted in after all...
> > > > Of course, if we really need that, we can add a bool parameter to the
> > > > function to choose about disabling/enabling irqs.
> > > > > * This is really a core mm function, so don't hide it away in arch 
> > > > > layers.
> > > > > (If you're changing mm/ files, that's a big hint.)
> > > > 
> > > > My idea here is to let the arch decide on how this 'register' is going
> > > > to work, as archs may have different needs (in powerpc for example, we
> > > > can't always disable irqs, since we may be in realmode).
> 
> Yes, the tension there is that a) some things are per-arch, and b) it's easy 
> to get it wrong. The commit below (d9101bfa6adc) is IMHO a perfect example of
> that.
> 
> So, I would like core mm/ functions that guide the way, but the interrupt
> behavior complicates it. I think your original passing of just struct_mm
> is probably the right balance, assuming that I'm wrong about interrupts.
> 

I think, for the generic function, that including {en,dis}abling the
interrupt is fine. I mean, if disabling the interrupt is the generic
behavior, it's ok. 
I will just make sure to explain that the interrupt {en,dis}abling is
part of the sync process. If an arch don't like it, it can write a
specific function that does the sync in a better way. (and defining
__HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER to ignore the generic function)

In this case, the generic function would also include the ifdef'ed
atomic inc and the memory barrier. 

> 
> > > > Maybe we can create a generic function instead of a dummy, and let it
> > > > be replaced in case the arch needs to do so.
> > > 
> > > Yes, that might be what we need, if it turns out that ppc can't use this
> > > approach (although let's see about that).
> > > 
> > 
> > I initially used the dummy approach because I did not see anything like
> > serialize in other archs. 
> > 
> > I mean, even if I put some generic function here, if there is no
> > function to use the 'lockless_pgtbl_walk_count', it becomes only a
> > overhead.
> > 
> 
> Not really: the memory barrier is required in all cases, and this code
> would be good I think:
> 
> +void register_lockless_pgtable_walker(struct mm_struct *mm)
> +{
> +#ifdef LOCKLESS_PAGE_TABLE_WALK_TRACKING
> +   atomic_inc(&mm->lockless_pgtbl_nr_walkers);
> +#endif
> +   /*
> +* This memory barrier pairs with any code that is either trying to
> +* delete page tables, or split huge pages.
> +*/
> +   smp_mb();
> +}
> +EXPORT_SYMBOL_GPL(gup_fast_lock_acquire);
> 
> And this is the same as your original patch, with just a minor name change:
> 
> @@ -2341,9 +2395,11 @@ int __get_user_pages_fast(unsigned long start, int 
> nr_pages, int write,
>  
> if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
> gup_fast_permitted(start, end)) {
> +   register_lockless_pgtable_walker(current->mm);
> local_irq_save(flags);
> gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
> local_irq_restore(flags);
> +   deregister_lockless_pgtable_walker(current->mm);
> 

Re: [PATCH] sysfs: add BIN_ATTR_WO() macro

2019-10-01 Thread Greg Kroah-Hartman
On Tue, Oct 01, 2019 at 02:08:53PM -0400, Nayna wrote:
> Hi Greg,
> 
> 
> On 08/26/2019 11:01 AM, Greg Kroah-Hartman wrote:
> > This variant was missing from sysfs.h, I guess no one noticed it before.
> > 
> > Turns out the powerpc secure variable code can use it, so add it to the
> > tree for it, and potentially others to take advantage of, instead of
> > open-coding it.
> > 
> > Reported-by: Nayna Jain 
> > Signed-off-by: Greg Kroah-Hartman 
> > ---
> > 
> > I'll queue this up to my tree for 5.4-rc1, but if you want to take this
> > in your tree earlier, feel free to do so.
> > 
> >   include/linux/sysfs.h | 9 +
> >   1 file changed, 9 insertions(+)
> > 
> > diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
> > index 965236795750..5420817ed317 100644
> > --- a/include/linux/sysfs.h
> > +++ b/include/linux/sysfs.h
> > @@ -196,6 +196,12 @@ struct bin_attribute {
> > .size   = _size,\
> >   }
> > +#define __BIN_ATTR_WO(_name) { 
> > \
> > +   .attr   = { .name = __stringify(_name), .mode = 0200 }, \
> > +   .store  = _name##_store,\
> > +   .size   = _size,\
> > +}
> > +
> >   #define __BIN_ATTR_RW(_name, _size)   
> > \
> > __BIN_ATTR(_name, 0644, _name##_read, _name##_write, _size)
> > @@ -208,6 +214,9 @@ struct bin_attribute bin_attr_##_name = 
> > __BIN_ATTR(_name, _mode, _read, \
> >   #define BIN_ATTR_RO(_name, _size) \
> >   struct bin_attribute bin_attr_##_name = __BIN_ATTR_RO(_name, _size)
> > +#define BIN_ATTR_WO(_name, _size)  \
> > +struct bin_attribute bin_attr_##_name = __BIN_ATTR_WO(_name, _size)
> > +
> >   #define BIN_ATTR_RW(_name, _size) \
> >   struct bin_attribute bin_attr_##_name = __BIN_ATTR_RW(_name, _size)
> 
> I am sorry. I didn't notice it via inspection but there is a bug in this
> macro. When I actually try using it, compilation fails. Here's a likely
> patch:
> 
> diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
> index 5420817ed317..fa7ee503fb76 100644
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -196,9 +196,9 @@ struct bin_attribute {
> .size   = _size,\
>  }
> -#define __BIN_ATTR_WO(_name) { \
> +#define __BIN_ATTR_WO(_name, _size) {  \
> .attr   = { .name = __stringify(_name), .mode = 0200 }, \
> -   .store  = _name##_store,\
> +   .write  = _name##_write,\
> .size   = _size,\
>  }
> 

Heh, good catch.  Can you send a real patch for this that I can apply to
give you the proper credit for finding and fixing this?

thanks,

greg k-h


Re: [PATCH] sysfs: add BIN_ATTR_WO() macro

2019-10-01 Thread Nayna

Hi Greg,


On 08/26/2019 11:01 AM, Greg Kroah-Hartman wrote:

This variant was missing from sysfs.h, I guess no one noticed it before.

Turns out the powerpc secure variable code can use it, so add it to the
tree for it, and potentially others to take advantage of, instead of
open-coding it.

Reported-by: Nayna Jain 
Signed-off-by: Greg Kroah-Hartman 
---

I'll queue this up to my tree for 5.4-rc1, but if you want to take this
in your tree earlier, feel free to do so.

  include/linux/sysfs.h | 9 +
  1 file changed, 9 insertions(+)

diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 965236795750..5420817ed317 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -196,6 +196,12 @@ struct bin_attribute {
.size   = _size,\
  }
  
+#define __BIN_ATTR_WO(_name) {		\

+   .attr   = { .name = __stringify(_name), .mode = 0200 }, \
+   .store  = _name##_store,\
+   .size   = _size,\
+}
+
  #define __BIN_ATTR_RW(_name, _size)   \
__BIN_ATTR(_name, 0644, _name##_read, _name##_write, _size)
  
@@ -208,6 +214,9 @@ struct bin_attribute bin_attr_##_name = __BIN_ATTR(_name, _mode, _read,	\

  #define BIN_ATTR_RO(_name, _size) \
  struct bin_attribute bin_attr_##_name = __BIN_ATTR_RO(_name, _size)
  
+#define BIN_ATTR_WO(_name, _size)	\

+struct bin_attribute bin_attr_##_name = __BIN_ATTR_WO(_name, _size)
+
  #define BIN_ATTR_RW(_name, _size) \
  struct bin_attribute bin_attr_##_name = __BIN_ATTR_RW(_name, _size)
  


I am sorry. I didn't notice it via inspection but there is a bug in this 
macro. When I actually try using it, compilation fails. Here's a likely 
patch:


diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 5420817ed317..fa7ee503fb76 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -196,9 +196,9 @@ struct bin_attribute {
.size   = _size,\
 }
 
-#define __BIN_ATTR_WO(_name) { \

+#define __BIN_ATTR_WO(_name, _size) {  \
.attr   = { .name = __stringify(_name), .mode = 0200 }, \
-   .store  = _name##_store,\
+   .write  = _name##_write,\
.size   = _size,\
 }


Thanks & Regards,
- Nayna



Re: [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range

2019-10-01 Thread Leonardo Bras
On Mon, 2019-09-30 at 14:51 -0700, John Hubbard wrote:
> On 9/27/19 4:40 PM, Leonardo Bras wrote:
> > As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
> > monitor against THP split/collapse with the couting method, it's necessary
> 
> s/couting/counting/
> 

Thanks, fixed for v5.

> > to bound it with {start,end}_lockless_pgtbl_walk.
> > 
> > There are dummy functions, so it is not going to add any overhead on archs
> > that don't use this method.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  mm/gup.c | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 98f13ab37bac..7105c829cf44 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, 
> > unsigned long end)
> >  int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> >   struct page **pages)
> >  {
> > +   struct mm_struct *mm;
> 
> I don't think that this local variable adds any value, so let's not use it.
> Similar point in a few other patches too.

It avoids 1 deference of current->mm, it's a little performance gain.

> 
> > unsigned long len, end;
> > unsigned long flags;
> > int nr = 0;
> > @@ -2352,9 +2353,12 @@ int __get_user_pages_fast(unsigned long start, int 
> > nr_pages, int write,
> >  
> > if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
> > gup_fast_permitted(start, end)) {
> > +   mm = current->mm;
> > +   start_lockless_pgtbl_walk(mm);
> > local_irq_save(flags);
> > gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
> > local_irq_restore(flags);
> > +   end_lockless_pgtbl_walk(mm);
> > }
> >  
> > return nr;
> > @@ -2404,6 +2408,7 @@ int get_user_pages_fast(unsigned long start, int 
> > nr_pages,
> > unsigned int gup_flags, struct page **pages)
> >  {
> > unsigned long addr, len, end;
> > +   struct mm_struct *mm;
> 
> Same here.
> 
> > int nr = 0, ret = 0;
> >  
> > if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
> > @@ -2421,9 +2426,12 @@ int get_user_pages_fast(unsigned long start, int 
> > nr_pages,
> >  
> > if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
> > gup_fast_permitted(start, end)) {
> > +   mm = current->mm;
> > +   start_lockless_pgtbl_walk(mm);
> 
> Minor: I'd like to rename this register_lockless_pgtable_walker().
> 
> > local_irq_disable();
> > gup_pgd_range(addr, end, gup_flags, pages, &nr);
> > local_irq_enable();
> > +   end_lockless_pgtbl_walk(mm);
> 
> ...and deregister_lockless_pgtable_walker().
> 

I have no problem changing the name, but I don't register/deregister
are good terms for this. 

I would rather use start/finish, begin/end, and so on. Register sounds
like something more complicated than what we are trying to achieve
here. 

> 
> thanks,

Thank you!


signature.asc
Description: This is a digitally signed message part


Re: [PATCH v6 1/9] dt-bindings: ibm, secureboot: secure boot specific properties for PowerNV

2019-10-01 Thread Nayna




On 10/01/2019 09:33 AM, Rob Herring wrote:

On Fri, Sep 27, 2019 at 10:25:52AM -0400, Nayna Jain wrote:

PowerNV represents both the firmware and Host OS secureboot state of the
system via device tree. This patch adds the documentation to give
the definition of the nodes and the properties.

Signed-off-by: Nayna Jain 
---
  .../bindings/powerpc/ibm,secureboot.rst   | 76 
  .../devicetree/bindings/powerpc/secvar.rst| 89 +++
  2 files changed, 165 insertions(+)
  create mode 100644 
Documentation/devicetree/bindings/powerpc/ibm,secureboot.rst
  create mode 100644 Documentation/devicetree/bindings/powerpc/secvar.rst

diff --git a/Documentation/devicetree/bindings/powerpc/ibm,secureboot.rst 
b/Documentation/devicetree/bindings/powerpc/ibm,secureboot.rst
new file mode 100644
index ..03d32099d2eb
--- /dev/null
+++ b/Documentation/devicetree/bindings/powerpc/ibm,secureboot.rst
@@ -0,0 +1,76 @@
+# SPDX-License-Identifier: GPL-2.0

Not the right form for reST files.


+*** NOTE ***
+This document is copied from OPAL firmware
+(skiboot/doc/device-tree/ibm,secureboot.rst)

Why copy into the kernel?


Do you mean we do not need the device-tree documentation in the kernel 
when it already exists in the skiboot tree ?


I think I am ok with that. Michael, what do you think ?

Thanks & Regards,
 - Nayna


Re: [PATCH v6 3/9] powerpc: add support to initialize ima policy rules

2019-10-01 Thread Nayna




On 09/30/2019 09:04 PM, Thiago Jung Bauermann wrote:

Hello,


Hi,




diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c
new file mode 100644
index ..39401b67f19e
--- /dev/null
+++ b/arch/powerpc/kernel/ima_arch.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain
+ */
+
+#include 
+#include 
+
+bool arch_ima_get_secureboot(void)
+{
+   return is_powerpc_os_secureboot_enabled();
+}
+
+/* Defines IMA appraise rules for secureboot */
+static const char *const arch_rules[] = {
+   "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
+#if !IS_ENABLED(CONFIG_MODULE_SIG)
+   "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
+#endif
+   NULL
+};
+
+/*
+ * Returns the relevant IMA arch policies based on the system secureboot state.
+ */
+const char *const *arch_get_ima_policy(void)
+{
+   if (is_powerpc_os_secureboot_enabled())
+   return arch_rules;
+
+   return NULL;
+}

If CONFIG_MODULE_SIG is enabled but module signatures aren't enforced,
then IMA won't enforce module signature either. x86's
arch_get_ima_policy() calls set_module_sig_enforced(). Doesn't the
powerpc version need to do that as well?

On the flip side, if module signatures are enforced by the module
subsystem then IMA will verify the signature a second time since there's
no sharing of signature verification results between the module
subsystem and IMA (this was observed by Mimi).

IMHO this is a minor issue, since module loading isn't a hot path and
the duplicate work shouldn't impact anything. But it could be avoided by
having a NULL entry in arch_rules, which arch_get_ima_policy() would
dynamically update with the "appraise func=MODULE_CHECK" rule if
is_module_sig_enforced() is true.


Thanks Thiago for reviewing.  I am wondering that this will give two 
meanings for NULL. Can we do something like below, there are possibly 
two options ?


1. Set IMA_APPRAISED in the iint->flags if is_module_sig_enforced().

OR

2. Let ima_get_action() check for is_module_sig_enforced() when policy 
is appraise and func is MODULE_CHECK.


Thanks & Regards,
   - Nayna


[Bug 204789] Boot failure with more than 256G of memory on Power9 with 4K pages & Hash MMU

2019-10-01 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=204789

--- Comment #11 from Cameron (c...@neo-zeon.de) ---
grep RAM /proc/iomem
-3f : System RAM

The system has 16 dimm slots, all are populated. Unfortunately, I will 
not have physical to access to the box in the foreseeable future.

Aneesh appears to be correct in that this issue started with 0034d395f89d.

On 10/1/19 3:04 AM, bugzilla-dae...@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=204789
>
> --- Comment #10 from Michael Ellerman (mich...@ellerman.id.au) ---
> Can you boot a good kernel and do:
>
> $ sudo grep RAM /proc/iomem
>
> And paste the output. Just to confirm what your memory layout is.
>
> What arrangement of DIMMs do you have? It's possible you could work around
> the
> bug by changing that, depending on how many DIMMs and slots you have.
>

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH 18/29] arm64: Move EXCEPTION_TABLE to RO_DATA segment

2019-10-01 Thread Kees Cook
On Tue, Oct 01, 2019 at 10:03:56AM +0100, Will Deacon wrote:
> Hi Kees,
> 
> On Thu, Sep 26, 2019 at 10:55:51AM -0700, Kees Cook wrote:
> > The EXCEPTION_TABLE is read-only, so collapse it into RO_DATA.
> > 
> > Signed-off-by: Kees Cook 
> > ---
> >  arch/arm64/kernel/vmlinux.lds.S | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/arm64/kernel/vmlinux.lds.S 
> > b/arch/arm64/kernel/vmlinux.lds.S
> > index 81d94e371c95..c6ba2eee0ee8 100644
> > --- a/arch/arm64/kernel/vmlinux.lds.S
> > +++ b/arch/arm64/kernel/vmlinux.lds.S
> > @@ -5,6 +5,8 @@
> >   * Written by Martin Mares 
> >   */
> >  
> > +#define RO_DATA_EXCEPTION_TABLE_ALIGN  8
> > +
> >  #include 
> >  #include 
> >  #include 
> > @@ -135,8 +137,8 @@ SECTIONS
> > . = ALIGN(SEGMENT_ALIGN);
> > _etext = .; /* End of text section */
> >  
> > -   RO_DATA(PAGE_SIZE)  /* everything from this point to */
> > -   EXCEPTION_TABLE(8)  /* __init_begin will be marked RO NX */
> > +   /* everything from this point to __init_begin will be marked RO NX */
> > +   RO_DATA(PAGE_SIZE)
> >  
> > . = ALIGN(PAGE_SIZE);
> 
> Do you reckon it would be worth merging this last ALIGN directive into the
> RO_DATA definition too? Given that we want to map the thing read-only, it
> really has to be aligned either side.

Actually, taking a closer look, this appears to be redundant: RO_DATA()
ends with:

. = ALIGN(align)

(where "align" is the "PAGE_SIZE" argument to RO_DATA())

> Anyway, that's only a nit, so:
> 
> Acked-by: Will Deacon 

Thanks!

> P.S. Please CC the arm64 maintainers on arm64 patches -- I nearly missed
> this one!

Okay, I can re-expand my list. I originally had done this but it was
getting to be a rather large set of people. :)

-- 
Kees Cook


Re: [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone

2019-10-01 Thread David Hildenbrand
On 01.10.19 16:57, David Hildenbrand wrote:
> On 01.10.19 16:40, David Hildenbrand wrote:
>> From: "Aneesh Kumar K.V" 
>>
>> With altmap, all the resource pfns are not initialized. While initializing
>> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
>> skip pfns that were never initialized.
>>
>> Update memunmap_pages to calculate start and end pfn based on altmap
>> values. This fixes a kernel crash that is observed when destroying
>> a namespace.
>>
>> [   81.356173] kernel BUG at include/linux/mm.h:1107!
>> cpu 0x1: Vector: 700 (Program Check) at [c00274087890]
>> pc: c04b9728: memunmap_pages+0x238/0x340
>> lr: c04b9724: memunmap_pages+0x234/0x340
>> ...
>> pid   = 3669, comm = ndctl
>> kernel BUG at include/linux/mm.h:1107!
>> [c00274087ba0] c09e3500 devm_action_release+0x30/0x50
>> [c00274087bc0] c09e4758 release_nodes+0x268/0x2d0
>> [c00274087c30] c09dd144 
>> device_release_driver_internal+0x174/0x240
>> [c00274087c70] c09d9dfc unbind_store+0x13c/0x190
>> [c00274087cb0] c09d8a24 drv_attr_store+0x44/0x60
>> [c00274087cd0] c05a7470 sysfs_kf_write+0x70/0xa0
>> [c00274087d10] c05a5cac kernfs_fop_write+0x1ac/0x290
>> [c00274087d60] c04be45c __vfs_write+0x3c/0x70
>> [c00274087d80] c04c26e4 vfs_write+0xe4/0x200
>> [c00274087dd0] c04c2a6c ksys_write+0x7c/0x140
>> [c00274087e20] c000bbd0 system_call+0x5c/0x68
>>
>> Cc: Dan Williams 
>> Cc: Andrew Morton 
>> Cc: Jason Gunthorpe 
>> Cc: Logan Gunthorpe 
>> Cc: Ira Weiny 
>> Reviewed-by: Pankaj Gupta 
>> Signed-off-by: Aneesh Kumar K.V 
>> [ move all pfn-realted declarations into a single line ]
>> Signed-off-by: David Hildenbrand 
>> ---
>>  mm/memremap.c | 13 -
>>  1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 557e53c6fb46..026788b2ac69 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap 
>> *pgmap)
>>  void memunmap_pages(struct dev_pagemap *pgmap)
>>  {
>>  struct resource *res = &pgmap->res;
>> -unsigned long pfn;
>> +unsigned long pfn, nr_pages, start_pfn, end_pfn;
>>  int nid;
>>  
>>  dev_pagemap_kill(pgmap);
>> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>>  put_page(pfn_to_page(pfn));
>>  dev_pagemap_cleanup(pgmap);
>>  
>> +start_pfn = pfn_first(pgmap);
>> +end_pfn = pfn_end(pgmap);
>> +nr_pages = end_pfn - start_pfn;
>> +
>>  /* pages are dead and unused, undo the arch mapping */
>> -nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
>> +nid = page_to_nid(pfn_to_page(start_pfn));
>>  
>>  mem_hotplug_begin();
>>  if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
>> -pfn = PHYS_PFN(res->start);
>> -__remove_pages(page_zone(pfn_to_page(pfn)), pfn,
>> - PHYS_PFN(resource_size(res)), NULL);
>> +__remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
>> +   nr_pages, NULL);
>>  } else {
>>  arch_remove_memory(nid, res->start, resource_size(res),
>>  pgmap_altmap(pgmap));
>>
> 
> Aneesh, I was wondering why the use of "res->start" is correct (and we
> shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
> could review.
> 

To be more precise, I wonder if it should actually be

__remove_pages(page_zone(pfn_to_page(start_pfn)), res->start,
   resource_size(res))

IOW, keep calling __remove_pages() with the same parameters but read
nid/zone from the offset one.

Hope some memunmap_pages() expert can clarify.

-- 

Thanks,

David / dhildenb


Re: [PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone

2019-10-01 Thread David Hildenbrand
On 01.10.19 16:40, David Hildenbrand wrote:
> From: "Aneesh Kumar K.V" 
> 
> With altmap, all the resource pfns are not initialized. While initializing
> pfn, altmap reserve space is skipped. Hence when removing pfn from zone
> skip pfns that were never initialized.
> 
> Update memunmap_pages to calculate start and end pfn based on altmap
> values. This fixes a kernel crash that is observed when destroying
> a namespace.
> 
> [   81.356173] kernel BUG at include/linux/mm.h:1107!
> cpu 0x1: Vector: 700 (Program Check) at [c00274087890]
> pc: c04b9728: memunmap_pages+0x238/0x340
> lr: c04b9724: memunmap_pages+0x234/0x340
> ...
> pid   = 3669, comm = ndctl
> kernel BUG at include/linux/mm.h:1107!
> [c00274087ba0] c09e3500 devm_action_release+0x30/0x50
> [c00274087bc0] c09e4758 release_nodes+0x268/0x2d0
> [c00274087c30] c09dd144 device_release_driver_internal+0x174/0x240
> [c00274087c70] c09d9dfc unbind_store+0x13c/0x190
> [c00274087cb0] c09d8a24 drv_attr_store+0x44/0x60
> [c00274087cd0] c05a7470 sysfs_kf_write+0x70/0xa0
> [c00274087d10] c05a5cac kernfs_fop_write+0x1ac/0x290
> [c00274087d60] c04be45c __vfs_write+0x3c/0x70
> [c00274087d80] c04c26e4 vfs_write+0xe4/0x200
> [c00274087dd0] c04c2a6c ksys_write+0x7c/0x140
> [c00274087e20] c000bbd0 system_call+0x5c/0x68
> 
> Cc: Dan Williams 
> Cc: Andrew Morton 
> Cc: Jason Gunthorpe 
> Cc: Logan Gunthorpe 
> Cc: Ira Weiny 
> Reviewed-by: Pankaj Gupta 
> Signed-off-by: Aneesh Kumar K.V 
> [ move all pfn-realted declarations into a single line ]
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memremap.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 557e53c6fb46..026788b2ac69 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
>  void memunmap_pages(struct dev_pagemap *pgmap)
>  {
>   struct resource *res = &pgmap->res;
> - unsigned long pfn;
> + unsigned long pfn, nr_pages, start_pfn, end_pfn;
>   int nid;
>  
>   dev_pagemap_kill(pgmap);
> @@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
>   put_page(pfn_to_page(pfn));
>   dev_pagemap_cleanup(pgmap);
>  
> + start_pfn = pfn_first(pgmap);
> + end_pfn = pfn_end(pgmap);
> + nr_pages = end_pfn - start_pfn;
> +
>   /* pages are dead and unused, undo the arch mapping */
> - nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
> + nid = page_to_nid(pfn_to_page(start_pfn));
>  
>   mem_hotplug_begin();
>   if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> - pfn = PHYS_PFN(res->start);
> - __remove_pages(page_zone(pfn_to_page(pfn)), pfn,
> -  PHYS_PFN(resource_size(res)), NULL);
> + __remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
> +nr_pages, NULL);
>   } else {
>   arch_remove_memory(nid, res->start, resource_size(res),
>   pgmap_altmap(pgmap));
> 

Aneesh, I was wondering why the use of "res->start" is correct (and we
shouldn't also witch to start_pfn/nr_pages here. It would be good if Dan
could review.

-- 

Thanks,

David / dhildenb


[PATCH v5 10/10] mm/memory_hotplug: Cleanup __remove_pages()

2019-10-01 Thread David Hildenbrand
Let's drop the basically unused section stuff and simplify.

Also, let's use a shorter variant to calculate the number of pages to
the next section boundary.

Cc: Andrew Morton 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ffb514e3b090..0fa99e5a657e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -488,25 +488,20 @@ static void __remove_section(unsigned long pfn, unsigned 
long nr_pages,
 void __remove_pages(unsigned long pfn, unsigned long nr_pages,
struct vmem_altmap *altmap)
 {
+   const unsigned long end_pfn = pfn + nr_pages;
+   unsigned long cur_nr_pages;
unsigned long map_offset = 0;
-   unsigned long nr, start_sec, end_sec;
 
map_offset = vmem_altmap_offset(altmap);
 
if (check_pfn_span(pfn, nr_pages, "remove"))
return;
 
-   start_sec = pfn_to_section_nr(pfn);
-   end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
-   for (nr = start_sec; nr <= end_sec; nr++) {
-   unsigned long pfns;
-
+   for (; pfn < end_pfn; pfn += cur_nr_pages) {
cond_resched();
-   pfns = min(nr_pages, PAGES_PER_SECTION
-   - (pfn & ~PAGE_SECTION_MASK));
-   __remove_section(pfn, pfns, map_offset, altmap);
-   pfn += pfns;
-   nr_pages -= pfns;
+   /* Select all remaining pages up to the next section boundary */
+   cur_nr_pages = min(end_pfn - pfn, -(pfn | PAGE_SECTION_MASK));
+   __remove_section(pfn, cur_nr_pages, map_offset, altmap);
map_offset = 0;
}
 }
-- 
2.21.0



[PATCH v5 09/10] mm/memory_hotplug: Drop local variables in shrink_zone_span()

2019-10-01 Thread David Hildenbrand
Get rid of the unnecessary local variables.

Cc: Andrew Morton 
Cc: Oscar Salvador 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 22b3623768c8..ffb514e3b090 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -374,14 +374,11 @@ static unsigned long find_biggest_section_pfn(int nid, 
struct zone *zone,
 static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 unsigned long end_pfn)
 {
-   unsigned long zone_start_pfn = zone->zone_start_pfn;
-   unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
-   unsigned long zone_end_pfn = z;
unsigned long pfn;
int nid = zone_to_nid(zone);
 
zone_span_writelock(zone);
-   if (zone_start_pfn == start_pfn) {
+   if (zone->zone_start_pfn == start_pfn) {
/*
 * If the section is smallest section in the zone, it need
 * shrink zone->zone_start_pfn and zone->zone_spanned_pages.
@@ -389,25 +386,25 @@ static void shrink_zone_span(struct zone *zone, unsigned 
long start_pfn,
 * for shrinking zone.
 */
pfn = find_smallest_section_pfn(nid, zone, end_pfn,
-   zone_end_pfn);
+   zone_end_pfn(zone));
if (pfn) {
+   zone->spanned_pages = zone_end_pfn(zone) - pfn;
zone->zone_start_pfn = pfn;
-   zone->spanned_pages = zone_end_pfn - pfn;
} else {
zone->zone_start_pfn = 0;
zone->spanned_pages = 0;
}
-   } else if (zone_end_pfn == end_pfn) {
+   } else if (zone_end_pfn(zone) == end_pfn) {
/*
 * If the section is biggest section in the zone, it need
 * shrink zone->spanned_pages.
 * In this case, we find second biggest valid mem_section for
 * shrinking zone.
 */
-   pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
+   pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn,
   start_pfn);
if (pfn)
-   zone->spanned_pages = pfn - zone_start_pfn + 1;
+   zone->spanned_pages = pfn - zone->zone_start_pfn + 1;
else {
zone->zone_start_pfn = 0;
zone->spanned_pages = 0;
-- 
2.21.0



[PATCH v5 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2019-10-01 Thread David Hildenbrand
If we have holes, the holes will automatically get detected and removed
once we remove the next bigger/smaller section. The extra checks can
go.

Cc: Andrew Morton 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: David Hildenbrand 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 34 +++---
 1 file changed, 7 insertions(+), 27 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 273833612774..22b3623768c8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, unsigned 
long start_pfn,
if (pfn) {
zone->zone_start_pfn = pfn;
zone->spanned_pages = zone_end_pfn - pfn;
+   } else {
+   zone->zone_start_pfn = 0;
+   zone->spanned_pages = 0;
}
} else if (zone_end_pfn == end_pfn) {
/*
@@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, unsigned 
long start_pfn,
   start_pfn);
if (pfn)
zone->spanned_pages = pfn - zone_start_pfn + 1;
+   else {
+   zone->zone_start_pfn = 0;
+   zone->spanned_pages = 0;
+   }
}
-
-   /*
-* The section is not biggest or smallest mem_section in the zone, it
-* only creates a hole in the zone. So in this case, we need not
-* change the zone. But perhaps, the zone has only hole data. Thus
-* it check the zone has only hole or not.
-*/
-   pfn = zone_start_pfn;
-   for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
-   if (unlikely(!pfn_to_online_page(pfn)))
-   continue;
-
-   if (page_zone(pfn_to_page(pfn)) != zone)
-   continue;
-
-   /* Skip range to be removed */
-   if (pfn >= start_pfn && pfn < end_pfn)
-   continue;
-
-   /* If we find valid section, we have nothing to do */
-   zone_span_writeunlock(zone);
-   return;
-   }
-
-   /* The zone has no valid section */
-   zone->zone_start_pfn = 0;
-   zone->spanned_pages = 0;
zone_span_writeunlock(zone);
 }
 
-- 
2.21.0



[PATCH v5 07/10] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn

2019-10-01 Thread David Hildenbrand
With shrink_pgdat_span() out of the way, we now always have a valid
zone.

Cc: Andrew Morton 
Cc: Oscar Salvador 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Cc: Wei Yang 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 640309236a58..273833612774 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -337,7 +337,7 @@ static unsigned long find_smallest_section_pfn(int nid, 
struct zone *zone,
if (unlikely(pfn_to_nid(start_pfn) != nid))
continue;
 
-   if (zone && zone != page_zone(pfn_to_page(start_pfn)))
+   if (zone != page_zone(pfn_to_page(start_pfn)))
continue;
 
return start_pfn;
@@ -362,7 +362,7 @@ static unsigned long find_biggest_section_pfn(int nid, 
struct zone *zone,
if (unlikely(pfn_to_nid(pfn) != nid))
continue;
 
-   if (zone && zone != page_zone(pfn_to_page(pfn)))
+   if (zone != page_zone(pfn_to_page(pfn)))
continue;
 
return pfn;
-- 
2.21.0



[PATCH v5 06/10] mm/memory_hotplug: Poison memmap in remove_pfn_range_from_zone()

2019-10-01 Thread David Hildenbrand
Let's poison the pages similar to when adding new memory in
sparse_add_section(). Also call remove_pfn_range_from_zone() from
memunmap_pages(), so we can poison the memmap from there as well.

While at it, calculate the pfn in memunmap_pages() only once.

Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Oscar Salvador 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 3 +++
 mm/memremap.c   | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index cef909ebd807..640309236a58 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -464,6 +464,9 @@ void __ref remove_pfn_range_from_zone(struct zone *zone,
struct pglist_data *pgdat = zone->zone_pgdat;
unsigned long flags;
 
+   /* Poison struct pages because they are now uninitialized again. */
+   page_init_poison(pfn_to_page(start_pfn), sizeof(struct page) * 
nr_pages);
+
/*
 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
 * we will not try to shrink the zones - which is okay as
diff --git a/mm/memremap.c b/mm/memremap.c
index 734afeaad811..371939f92b69 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -139,6 +139,8 @@ void memunmap_pages(struct dev_pagemap *pgmap)
nid = page_to_nid(pfn_to_page(start_pfn));
 
mem_hotplug_begin();
+   remove_pfn_range_from_zone(page_zone(pfn_to_page(start_pfn)),
+  start_pfn, nr_pages);
if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
__remove_pages(start_pfn, nr_pages, NULL);
} else {
-- 
2.21.0



[PATCH v5 05/10] mm/memory_hotplug: Shrink zones when offlining memory

2019-10-01 Thread David Hildenbrand
We currently try to shrink a single zone when removing memory. We use the
zone of the first page of the memory we are removing. If that memmap was
never initialized (e.g., memory was never onlined), we will read garbage
and can trigger kernel BUGs (due to a stale pointer):

:/# [   23.912993] BUG: unable to handle page fault for address: 
353d
[   23.914219] #PF: supervisor write access in kernel mode
[   23.915199] #PF: error_code(0x0002) - not-present page
[   23.916160] PGD 0 P4D 0
[   23.916627] Oops: 0002 [#1] SMP PTI
[   23.917256] CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 
5.3.0-rc5-next-20190820+ #317
[   23.918900] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
[   23.921194] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   23.922249] RIP: 0010:clear_zone_contiguous+0x5/0x10
[   23.923173] Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 
c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
[   23.926876] RSP: 0018:ad2400043c98 EFLAGS: 00010246
[   23.927928] RAX:  RBX: 0002 RCX: 
[   23.929458] RDX: 0020 RSI: 0014 RDI: 2f40
[   23.930899] RBP: 00014000 R08:  R09: 0001
[   23.932362] R10:  R11:  R12: 0014
[   23.933603] R13: 0014 R14: 2f40 R15: 9e3e7aff3680
[   23.934913] FS:  () GS:9e3e7bb0() 
knlGS:
[   23.936294] CS:  0010 DS:  ES:  CR0: 80050033
[   23.937481] CR2: 353d CR3: 5861 CR4: 06e0
[   23.938687] DR0:  DR1:  DR2: 
[   23.939889] DR3:  DR6: fffe0ff0 DR7: 0400
[   23.941168] Call Trace:
[   23.941580]  __remove_pages+0x4b/0x640
[   23.942303]  ? mark_held_locks+0x49/0x70
[   23.943149]  arch_remove_memory+0x63/0x8d
[   23.943921]  try_remove_memory+0xdb/0x130
[   23.944766]  ? walk_memory_blocks+0x7f/0x9e
[   23.945616]  __remove_memory+0xa/0x11
[   23.946274]  acpi_memory_device_remove+0x70/0x100
[   23.947308]  acpi_bus_trim+0x55/0x90
[   23.947914]  acpi_device_hotplug+0x227/0x3a0
[   23.948714]  acpi_hotplug_work_fn+0x1a/0x30
[   23.949433]  process_one_work+0x221/0x550
[   23.950190]  worker_thread+0x50/0x3b0
[   23.950993]  kthread+0x105/0x140
[   23.951644]  ? process_one_work+0x550/0x550
[   23.952508]  ? kthread_park+0x80/0x80
[   23.953367]  ret_from_fork+0x3a/0x50
[   23.954025] Modules linked in:
[   23.954613] CR2: 353d
[   23.955248] ---[ end trace 93d982b1fb3e1a69 ]---

Instead, shrink the zones when offlining memory or when onlining failed.
Introduce and use remove_pfn_range_from_zone(() for that. We now properly
shrink the zones, even if we have DIMMs whereby
- Some memory blocks fall into no zone (never onlined)
- Some memory blocks fall into multiple zones (offlined+re-onlined)
- Multiple memory blocks that fall into different zones

Drop the zone parameter (with a potential dubious value) from
__remove_pages() and __remove_section().

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Andrew Morton 
Cc: Mark Rutland 
Cc: Steve Capper 
Cc: Mike Rapoport 
Cc: Anshuman Khandual 
Cc: Yu Zhao 
Cc: Jun Yao 
Cc: Robin Murphy 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: Pavel Tatashin 
Cc: Gerald Schaefer 
Cc: Halil Pasic 
Cc: Tom Lendacky 
Cc: Greg Kroah-Hartman 
Cc: Masahiro Yamada 
Cc: Dan Williams 
Cc: Wei Yang 
Cc: Qian Cai 
Cc: Jason Gunthorpe 
Cc: Logan Gunthorpe 
Cc: Ira Weiny 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
Signed-off-by: David Hildenbrand 
---
 arch/arm64/mm/mmu.c|  4 +---
 arch/ia64/mm/init.c|  4 +---
 arch/powerpc/mm/mem.c  |  3 +--
 arch/s390/mm/init.c|  4 +---
 arch/sh/mm/init.c  |  4 +---
 arch/x86/mm/init_32.c  |  4 +---
 arch/x86/mm/init_64.c  |  4 +---
 include/linux/memory_hotplug.h |  7 +--
 mm/memory_hotplug.c| 31 ---
 mm/memremap.c  |  3 +--
 10 files changed, 29 insertions(+), 39 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 60c929f3683b..d10247fab0fd 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1069,7 +10

[PATCH v5 04/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span()

2019-10-01 Thread David Hildenbrand
Let's limit shrinking to !ZONE_DEVICE so we can fix the current code. We
should never try to touch the memmap of offline sections where we could
have uninitialized memmaps and could trigger BUGs when calling
page_to_nid() on poisoned pages.

There is no reliable way to distinguish an uninitialized memmap from an
initialized memmap that belongs to ZONE_DEVICE, as we don't have
anything like SECTION_IS_ONLINE we can use similar to
pfn_to_online_section() for !ZONE_DEVICE memory. E.g.,
set_zone_contiguous() similarly relies on pfn_to_online_section() and
will therefore never set a ZONE_DEVICE zone consecutive. Stopping to
shrink the ZONE_DEVICE therefore results in no observable changes,
besides /proc/zoneinfo indicating different boundaries - something we
can totally live with.

Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory
hotplug"), the memmap was initialized with 0 and the node with the
right value. So the zone might be wrong but not garbage. After that
commit, both the zone and the node will be garbage when touching
uninitialized memmaps.

Cc: Andrew Morton 
Cc: Oscar Salvador 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
Reported-by: Aneesh Kumar K.V 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 86b4dc18e831..afed8331332b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -331,7 +331,7 @@ static unsigned long find_smallest_section_pfn(int nid, 
struct zone *zone,
 unsigned long end_pfn)
 {
for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SUBSECTION) {
-   if (unlikely(!pfn_valid(start_pfn)))
+   if (unlikely(!pfn_to_online_page(start_pfn)))
continue;
 
if (unlikely(pfn_to_nid(start_pfn) != nid))
@@ -356,7 +356,7 @@ static unsigned long find_biggest_section_pfn(int nid, 
struct zone *zone,
/* pfn is the end pfn of a memory section. */
pfn = end_pfn - 1;
for (; pfn >= start_pfn; pfn -= PAGES_PER_SUBSECTION) {
-   if (unlikely(!pfn_valid(pfn)))
+   if (unlikely(!pfn_to_online_page(pfn)))
continue;
 
if (unlikely(pfn_to_nid(pfn) != nid))
@@ -415,7 +415,7 @@ static void shrink_zone_span(struct zone *zone, unsigned 
long start_pfn,
 */
pfn = zone_start_pfn;
for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) {
-   if (unlikely(!pfn_valid(pfn)))
+   if (unlikely(!pfn_to_online_page(pfn)))
continue;
 
if (page_zone(pfn_to_page(pfn)) != zone)
@@ -463,6 +463,14 @@ static void __remove_zone(struct zone *zone, unsigned long 
start_pfn,
struct pglist_data *pgdat = zone->zone_pgdat;
unsigned long flags;
 
+   /*
+* Zone shrinking code cannot properly deal with ZONE_DEVICE. So
+* we will not try to shrink the zones - which is okay as
+* set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
+*/
+   if (zone_idx(zone) == ZONE_DEVICE)
+   return;
+
pgdat_resize_lock(zone->zone_pgdat, &flags);
shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
update_pgdat_span(pgdat);
-- 
2.21.0



[PATCH v5 03/10] mm/memory_hotplug: Don't access uninitialized memmaps in shrink_pgdat_span()

2019-10-01 Thread David Hildenbrand
We might use the nid of memmaps that were never initialized. For
example, if the memmap was poisoned, we will crash the kernel in
pfn_to_nid() right now. Let's use the calculated boundaries of the separate
zones instead. This now also avoids having to iterate over a whole bunch of
subsections again, after shrinking one zone.

Before commit d0dc12e86b31 ("mm/memory_hotplug: optimize memory
hotplug"), the memmap was initialized to 0 and the node was set to the
right value. After that commit, the node might be garbage.

We'll have to fix shrink_zone_span() next.

Cc: Andrew Morton 
Cc: Oscar Salvador 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Pavel Tatashin 
Cc: Dan Williams 
Cc: Wei Yang 
Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
Reported-by: Aneesh Kumar K.V 
Signed-off-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 72 ++---
 1 file changed, 15 insertions(+), 57 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 680b4b3e57d9..86b4dc18e831 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -436,67 +436,25 @@ static void shrink_zone_span(struct zone *zone, unsigned 
long start_pfn,
zone_span_writeunlock(zone);
 }
 
-static void shrink_pgdat_span(struct pglist_data *pgdat,
- unsigned long start_pfn, unsigned long end_pfn)
+static void update_pgdat_span(struct pglist_data *pgdat)
 {
-   unsigned long pgdat_start_pfn = pgdat->node_start_pfn;
-   unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace 
clash */
-   unsigned long pgdat_end_pfn = p;
-   unsigned long pfn;
-   int nid = pgdat->node_id;
-
-   if (pgdat_start_pfn == start_pfn) {
-   /*
-* If the section is smallest section in the pgdat, it need
-* shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
-* In this case, we find second smallest valid mem_section
-* for shrinking zone.
-*/
-   pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
-   pgdat_end_pfn);
-   if (pfn) {
-   pgdat->node_start_pfn = pfn;
-   pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
-   }
-   } else if (pgdat_end_pfn == end_pfn) {
-   /*
-* If the section is biggest section in the pgdat, it need
-* shrink pgdat->node_spanned_pages.
-* In this case, we find second biggest valid mem_section for
-* shrinking zone.
-*/
-   pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
-  start_pfn);
-   if (pfn)
-   pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
-   }
-
-   /*
-* If the section is not biggest or smallest mem_section in the pgdat,
-* it only creates a hole in the pgdat. So in this case, we need not
-* change the pgdat.
-* But perhaps, the pgdat has only hole data. Thus it check the pgdat
-* has only hole or not.
-*/
-   pfn = pgdat_start_pfn;
-   for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SUBSECTION) {
-   if (unlikely(!pfn_valid(pfn)))
-   continue;
-
-   if (pfn_to_nid(pfn) != nid)
-   continue;
+   unsigned long node_start_pfn = 0, node_end_pfn = 0;
+   struct zone *zone;
 
-   /* Skip range to be removed */
-   if (pfn >= start_pfn && pfn < end_pfn)
-   continue;
+   for (zone = pgdat->node_zones;
+zone < pgdat->node_zones + MAX_NR_ZONES; zone++) {
+   unsigned long zone_end_pfn = zone->zone_start_pfn +
+zone->spanned_pages;
 
-   /* If we find valid section, we have nothing to do */
-   return;
+   /* No need to lock the zones, they can't change. */
+   if (zone_end_pfn > node_end_pfn)
+   node_end_pfn = zone_end_pfn;
+   if (zone->zone_start_pfn < node_start_pfn)
+   node_start_pfn = zone->zone_start_pfn;
}
 
-   /* The pgdat has no valid section */
-   pgdat->node_start_pfn = 0;
-   pgdat->node_spanned_pages = 0;
+   pgdat->node_start_pfn = node_start_pfn;
+   pgdat->node_spanned_pages = node_end_pfn - node_start_pfn;
 }
 
 static void __remove_zone(struct zone *zone, unsigned long start_pfn,
@@ -507,7 +465,7 @@ static void __remove_zone(struct zone *zone, unsigned long 
start_pfn,
 
pgdat_resize_lock(zone->zone_pgdat, &flags);
shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
-   shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
+   update_pgdat_span(pgdat);
pgda

[PATCH v5 02/10] mm/memmap_init: Update variable name in memmap_init_zone

2019-10-01 Thread David Hildenbrand
From: "Aneesh Kumar K.V" 

The third argument is actually number of pages. Changes the variable name
from size to nr_pages to indicate this better.

No functional change in this patch.

Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Oscar Salvador 
Cc: Mel Gorman 
Cc: Mike Rapoport 
Cc: Dan Williams 
Cc: Alexander Duyck 
Cc: Pavel Tatashin 
Cc: Alexander Potapenko 
Reviewed-by: Pankaj Gupta 
Reviewed-by: David Hildenbrand 
Signed-off-by: Aneesh Kumar K.V 
Signed-off-by: David Hildenbrand 
---
 mm/page_alloc.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 15c2050c629b..b0b2d5464000 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5936,10 +5936,10 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
 #ifdef CONFIG_ZONE_DEVICE
 void __ref memmap_init_zone_device(struct zone *zone,
   unsigned long start_pfn,
-  unsigned long size,
+  unsigned long nr_pages,
   struct dev_pagemap *pgmap)
 {
-   unsigned long pfn, end_pfn = start_pfn + size;
+   unsigned long pfn, end_pfn = start_pfn + nr_pages;
struct pglist_data *pgdat = zone->zone_pgdat;
struct vmem_altmap *altmap = pgmap_altmap(pgmap);
unsigned long zone_idx = zone_idx(zone);
@@ -5956,7 +5956,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 */
if (altmap) {
start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
-   size = end_pfn - start_pfn;
+   nr_pages = end_pfn - start_pfn;
}
 
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
@@ -6003,7 +6003,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
}
 
pr_info("%s initialised %lu pages in %ums\n", __func__,
-   size, jiffies_to_msecs(jiffies - start));
+   nr_pages, jiffies_to_msecs(jiffies - start));
 }
 
 #endif
-- 
2.21.0



[PATCH v5 01/10] mm/memunmap: Use the correct start and end pfn when removing pages from zone

2019-10-01 Thread David Hildenbrand
From: "Aneesh Kumar K.V" 

With altmap, all the resource pfns are not initialized. While initializing
pfn, altmap reserve space is skipped. Hence when removing pfn from zone
skip pfns that were never initialized.

Update memunmap_pages to calculate start and end pfn based on altmap
values. This fixes a kernel crash that is observed when destroying
a namespace.

[   81.356173] kernel BUG at include/linux/mm.h:1107!
cpu 0x1: Vector: 700 (Program Check) at [c00274087890]
pc: c04b9728: memunmap_pages+0x238/0x340
lr: c04b9724: memunmap_pages+0x234/0x340
...
pid   = 3669, comm = ndctl
kernel BUG at include/linux/mm.h:1107!
[c00274087ba0] c09e3500 devm_action_release+0x30/0x50
[c00274087bc0] c09e4758 release_nodes+0x268/0x2d0
[c00274087c30] c09dd144 device_release_driver_internal+0x174/0x240
[c00274087c70] c09d9dfc unbind_store+0x13c/0x190
[c00274087cb0] c09d8a24 drv_attr_store+0x44/0x60
[c00274087cd0] c05a7470 sysfs_kf_write+0x70/0xa0
[c00274087d10] c05a5cac kernfs_fop_write+0x1ac/0x290
[c00274087d60] c04be45c __vfs_write+0x3c/0x70
[c00274087d80] c04c26e4 vfs_write+0xe4/0x200
[c00274087dd0] c04c2a6c ksys_write+0x7c/0x140
[c00274087e20] c000bbd0 system_call+0x5c/0x68

Cc: Dan Williams 
Cc: Andrew Morton 
Cc: Jason Gunthorpe 
Cc: Logan Gunthorpe 
Cc: Ira Weiny 
Reviewed-by: Pankaj Gupta 
Signed-off-by: Aneesh Kumar K.V 
[ move all pfn-realted declarations into a single line ]
Signed-off-by: David Hildenbrand 
---
 mm/memremap.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 557e53c6fb46..026788b2ac69 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -123,7 +123,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
 void memunmap_pages(struct dev_pagemap *pgmap)
 {
struct resource *res = &pgmap->res;
-   unsigned long pfn;
+   unsigned long pfn, nr_pages, start_pfn, end_pfn;
int nid;
 
dev_pagemap_kill(pgmap);
@@ -131,14 +131,17 @@ void memunmap_pages(struct dev_pagemap *pgmap)
put_page(pfn_to_page(pfn));
dev_pagemap_cleanup(pgmap);
 
+   start_pfn = pfn_first(pgmap);
+   end_pfn = pfn_end(pgmap);
+   nr_pages = end_pfn - start_pfn;
+
/* pages are dead and unused, undo the arch mapping */
-   nid = page_to_nid(pfn_to_page(PHYS_PFN(res->start)));
+   nid = page_to_nid(pfn_to_page(start_pfn));
 
mem_hotplug_begin();
if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-   pfn = PHYS_PFN(res->start);
-   __remove_pages(page_zone(pfn_to_page(pfn)), pfn,
-PHYS_PFN(resource_size(res)), NULL);
+   __remove_pages(page_zone(pfn_to_page(start_pfn)), start_pfn,
+  nr_pages, NULL);
} else {
arch_remove_memory(nid, res->start, resource_size(res),
pgmap_altmap(pgmap));
-- 
2.21.0



[PATCH v5 00/10] mm/memory_hotplug: Shrink zones before removing memory

2019-10-01 Thread David Hildenbrand
This series fixes the access of uninitialized memmaps when shrinking
zones/nodes and when removing memory. Also, it contains all fixes for
crashes that can be triggered when removing certain namespace using
memunmap_pages() - ZONE_DEVICE, reported by Aneesh.

We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
more involved (we don't have SECTION_IS_ONLINE as an indicator), and
shrinking is only of limited use (set_zone_contiguous() cannot detect
the ZONE_DEVICE as contiguous).

We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of
code to a minimum. Shrinking is especially necessary to keep
zone->contiguous set where possible, especially, on memory unplug of
DIMMs at zone boundaries.

--

Zones are now properly shrunk when offlining memory blocks or when
onlining failed. This allows to properly shrink zones on memory unplug
even if the separate memory blocks of a DIMM were onlined to different
zones or re-onlined to a different zone after offlining.

Example:

:/# cat /proc/zoneinfo
Node 1, zone  Movable
spanned  0
present  0
managed  0
:/# echo "online_movable" > /sys/devices/system/memory/memory41/state
:/# echo "online_movable" > /sys/devices/system/memory/memory43/state
:/# cat /proc/zoneinfo
Node 1, zone  Movable
spanned  98304
present  65536
managed  65536
:/# echo 0 > /sys/devices/system/memory/memory43/online
:/# cat /proc/zoneinfo
Node 1, zone  Movable
spanned  32768
present  32768
managed  32768
:/# echo 0 > /sys/devices/system/memory/memory41/online
:/# cat /proc/zoneinfo
Node 1, zone  Movable
spanned  0
present  0
managed  0

--

I tested this with DIMMs on x86. It sounded like Aneesh tested the
ZONE_DEVICE part :)

v4 -> v5:
- "mm/memory_hotplug: Don't access uninitialized memmaps in shrink_zone_span()"
-- Add more details why ZONE_DEVICE is special
- Include two patches from Aneesh
-- "mm/memunmap: Use the correct start and end pfn when removing pages
from zone"
-- "mm/memmap_init: Update variable name in memmap_init_zone"

v3 -> v4:
- Drop "mm/memremap: Get rid of memmap_init_zone_device()"
-- As Alexander noticed, it was messy either way
- Drop "mm/memory_hotplug: Exit early in __remove_pages() on BUGs"
- Drop "mm: Exit early in set_zone_contiguous() if already contiguous"
- Drop "mm/memory_hotplug: Optimize zone shrinking code when checking for
  holes"
- Merged "mm/memory_hotplug: Remove pages from a zone before removing
  memory" and "mm/memory_hotplug: Remove zone parameter from
  __remove_pages()" into "mm/memory_hotplug: Shrink zones when offlining
  memory"
- Added "mm/memory_hotplug: Poison memmap in remove_pfn_range_from_zone()"
- Stop shrinking ZONE_DEVICE
- Reshuffle patches, moving all fixes to the front. Add Fixes: tags.
- Change subject/description of various patches
- Minor changes (too many to mention)

Cc: Aneesh Kumar K.V 
Cc: Andrew Morton 
Cc: Dan Williams 
Cc: Michal Hocko 

Aneesh Kumar K.V (2):
  mm/memunmap: Use the correct start and end pfn when removing pages
from zone
  mm/memmap_init: Update variable name in memmap_init_zone

David Hildenbrand (8):
  mm/memory_hotplug: Don't access uninitialized memmaps in
shrink_pgdat_span()
  mm/memory_hotplug: Don't access uninitialized memmaps in
shrink_zone_span()
  mm/memory_hotplug: Shrink zones when offlining memory
  mm/memory_hotplug: Poison memmap in remove_pfn_range_from_zone()
  mm/memory_hotplug: We always have a zone in
find_(smallest|biggest)_section_pfn
  mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()
  mm/memory_hotplug: Drop local variables in shrink_zone_span()
  mm/memory_hotplug: Cleanup __remove_pages()

 arch/arm64/mm/mmu.c|   4 +-
 arch/ia64/mm/init.c|   4 +-
 arch/powerpc/mm/mem.c  |   3 +-
 arch/s390/mm/init.c|   4 +-
 arch/sh/mm/init.c  |   4 +-
 arch/x86/mm/init_32.c  |   4 +-
 arch/x86/mm/init_64.c  |   4 +-
 include/linux/memory_hotplug.h |   7 +-
 mm/memory_hotplug.c| 184 +++--
 mm/memremap.c  |  14 ++-
 mm/page_alloc.c|   8 +-
 11 files changed, 88 insertions(+), 152 deletions(-)

-- 
2.21.0



Re: [PATCH v6 1/9] dt-bindings: ibm, secureboot: secure boot specific properties for PowerNV

2019-10-01 Thread Rob Herring
On Fri, Sep 27, 2019 at 10:25:52AM -0400, Nayna Jain wrote:
> PowerNV represents both the firmware and Host OS secureboot state of the
> system via device tree. This patch adds the documentation to give
> the definition of the nodes and the properties.
> 
> Signed-off-by: Nayna Jain 
> ---
>  .../bindings/powerpc/ibm,secureboot.rst   | 76 
>  .../devicetree/bindings/powerpc/secvar.rst| 89 +++
>  2 files changed, 165 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/powerpc/ibm,secureboot.rst
>  create mode 100644 Documentation/devicetree/bindings/powerpc/secvar.rst
> 
> diff --git a/Documentation/devicetree/bindings/powerpc/ibm,secureboot.rst 
> b/Documentation/devicetree/bindings/powerpc/ibm,secureboot.rst
> new file mode 100644
> index ..03d32099d2eb
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/powerpc/ibm,secureboot.rst
> @@ -0,0 +1,76 @@
> +# SPDX-License-Identifier: GPL-2.0

Not the right form for reST files.

> +*** NOTE ***
> +This document is copied from OPAL firmware
> +(skiboot/doc/device-tree/ibm,secureboot.rst)

Why copy into the kernel?

Plus, the bindings are in the process of being converted to schema. What 
would I do with these files?

> +
> +.. _device-tree/ibm,secureboot:
> +
> +ibm,secureboot
> +==
> +
> +The ``ìbm,secureboot`` node provides secure boot and trusted boot information
> +up to the target OS. Further information can be found in :ref:`stb-overview`.
> +
> +Required properties
> +---
> +
> +.. code-block:: none
> +
> +compatible: Either one of the following values:
> +
> +ibm,secureboot-v1  :  The container-verification-code
> +  is stored in a secure ROM 
> memory.
> +
> +ibm,secureboot-v2  :  The container-verification-code
> +  is stored in a reserved memory.
> +  It described by the ibm,cvc 
> child
> +  node.
> +
> +ibm,secureboot-v3  :  The container-verification-code
> +  is stored in a reserved memory.
> +  It described by the ibm,cvc 
> child
> +  node. Secure variables are
> +  supported. `secvar` node should
> +  be created.
> +
> +secure-enabled: this property exists when the firmware stack is 
> booting
> +in secure mode (hardware secure boot jumper 
> asserted).
> +
> +trusted-enabled:this property exists when the firmware stack is 
> booting
> +in trusted mode.
> +
> +hw-key-hash:hash of the three hardware public keys trusted by the
> +platformw owner. This is used to verify if a firmware
> +code is signed with trusted keys.
> +
> +hw-key-hash-size:   hw-key-hash size
> +
> +secvar: this node is created if the platform supports secure
> +variables. Contains information about the current
> +secvar status, see 'secvar.rst'.
> +
> +Obsolete properties
> +---
> +
> +.. code-block:: none
> +
> +hash-algo:  Superseded by the hw-key-hash-size property in
> +'ibm,secureboot-v2'.
> +
> +Example
> +---
> +
> +.. code-block:: dts
> +
> +ibm,secureboot {
> +compatible = "ibm,secureboot-v2";
> +secure-enabled;
> +trusted-enabled;
> +hw-key-hash-size = <0x40>;
> +hw-key-hash = <0x40d487ff 0x7380ed6a 0xd54775d5 0x795fea0d 0xe2f541fe
> +   0xa9db06b8 0x466a42a3 0x20e65f75 0xb4866546 0x0017d907
> +   0x515dc2a5 0xf9fc5095 0x4d6ee0c9 0xb67d219d 0xfb708535
> +   0x1d01d6d1>;
> +phandle = <0x10fd>;
> +linux,phandle = <0x10fd>;
> +};
> diff --git a/Documentation/devicetree/bindings/powerpc/secvar.rst 
> b/Documentation/devicetree/bindings/powerpc/secvar.rst
> new file mode 100644
> index ..47793ab9c2a7
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/powerpc/secvar.rst
> @@ -0,0 +1,89 @@
> +# SPDX-License-Identifier: GPL-2.0
> +*** NOTE ***
> +This document is copied from OPAL firmware
> +(skiboot/doc/device-tree/secvar.rst)
> +
> +.. _device-tree/ibm,secureboot/secvar:
> +
> +secvar
> +==
> +
> +The ``secvar`` node provides secure variable information for the secure
> +boot of the target OS.
> +
> +Required properties
> +---
> +
> +.. code-block:: none
> +
> +compatible: this property is set based on the current secure
> +va

[PATCH] powerpc/pseries: Remove confusing warning message.

2019-10-01 Thread Laurent Dufour
Since the commit 1211ee61b4a8 ("powerpc/pseries: Read TLB Block Invalidate
Characteristics"), a warning message is displayed when booting a guest on
top of KVM:

lpar: arch/powerpc/platforms/pseries/lpar.c 
pseries_lpar_read_hblkrm_characteristics Error calling get-system-parameter 
(0xfffd)

This message is displayed because this hypervisor is not supporting the
H_BLOCK_REMOVE hcall and thus is not exposing the corresponding feature.

Reading the TLB Block Invalidate Characteristics should not be done if the
feature is not exposed.

Fixes: 1211ee61b4a8 ("powerpc/pseries: Read TLB Block Invalidate 
Characteristics")
Reported-by: Stephen Rothwell 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/platforms/pseries/lpar.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index b53359258d99..f87a5c64e24d 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1419,6 +1419,9 @@ void __init pseries_lpar_read_hblkrm_characteristics(void)
unsigned char local_buffer[SPLPAR_TLB_BIC_MAXLENGTH];
int call_status, len, idx, bpsize;
 
+   if (!firmware_has_feature(FW_FEATURE_BLOCK_REMOVE))
+   return;
+
spin_lock(&rtas_data_buf_lock);
memset(rtas_data_buf, 0, RTAS_DATA_BUF_SIZE);
call_status = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
-- 
2.23.0



Applied "ASoC: fsl_asrc: Use in(out)put_format instead of in(out)put_word_width" to the asoc tree

2019-10-01 Thread Mark Brown
The patch

   ASoC: fsl_asrc: Use in(out)put_format instead of in(out)put_word_width

has been applied to the asoc tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-5.5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From 4bf62571070dd1021556e275d9221f736b2ffcf3 Mon Sep 17 00:00:00 2001
From: Shengjiu Wang 
Date: Fri, 27 Sep 2019 09:46:09 +0800
Subject: [PATCH] ASoC: fsl_asrc: Use in(out)put_format instead of
 in(out)put_word_width

snd_pcm_format_t is more formal than enum asrc_word_width, which has
two property, width and physical width, which is more accurate than
enum asrc_word_width. So it is better to use in(out)put_format
instead of in(out)put_word_width.

Signed-off-by: Shengjiu Wang 
Acked-by: Nicolin Chen 
Link: 
https://lore.kernel.org/r/7937c1404ee327ce141cb03b3575b02ea01a740c.1569493933.git.shengjiu.w...@nxp.com
Signed-off-by: Mark Brown 
---
 sound/soc/fsl/fsl_asrc.c | 56 +++-
 sound/soc/fsl/fsl_asrc.h |  4 +--
 2 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/sound/soc/fsl/fsl_asrc.c b/sound/soc/fsl/fsl_asrc.c
index cfa40ef6b1ca..4d3804a1ea55 100644
--- a/sound/soc/fsl/fsl_asrc.c
+++ b/sound/soc/fsl/fsl_asrc.c
@@ -265,6 +265,8 @@ static int fsl_asrc_config_pair(struct fsl_asrc_pair *pair)
struct asrc_config *config = pair->config;
struct fsl_asrc *asrc_priv = pair->asrc_priv;
enum asrc_pair_index index = pair->index;
+   enum asrc_word_width input_word_width;
+   enum asrc_word_width output_word_width;
u32 inrate, outrate, indiv, outdiv;
u32 clk_index[2], div[2];
int in, out, channels;
@@ -283,9 +285,32 @@ static int fsl_asrc_config_pair(struct fsl_asrc_pair *pair)
return -EINVAL;
}
 
-   /* Validate output width */
-   if (config->output_word_width == ASRC_WIDTH_8_BIT) {
-   pair_err("does not support 8bit width output\n");
+   switch (snd_pcm_format_width(config->input_format)) {
+   case 8:
+   input_word_width = ASRC_WIDTH_8_BIT;
+   break;
+   case 16:
+   input_word_width = ASRC_WIDTH_16_BIT;
+   break;
+   case 24:
+   input_word_width = ASRC_WIDTH_24_BIT;
+   break;
+   default:
+   pair_err("does not support this input format, %d\n",
+config->input_format);
+   return -EINVAL;
+   }
+
+   switch (snd_pcm_format_width(config->output_format)) {
+   case 16:
+   output_word_width = ASRC_WIDTH_16_BIT;
+   break;
+   case 24:
+   output_word_width = ASRC_WIDTH_24_BIT;
+   break;
+   default:
+   pair_err("does not support this output format, %d\n",
+config->output_format);
return -EINVAL;
}
 
@@ -383,8 +408,8 @@ static int fsl_asrc_config_pair(struct fsl_asrc_pair *pair)
/* Implement word_width configurations */
regmap_update_bits(asrc_priv->regmap, REG_ASRMCR1(index),
   ASRMCR1i_OW16_MASK | ASRMCR1i_IWD_MASK,
-  ASRMCR1i_OW16(config->output_word_width) |
-  ASRMCR1i_IWD(config->input_word_width));
+  ASRMCR1i_OW16(output_word_width) |
+  ASRMCR1i_IWD(input_word_width));
 
/* Enable BUFFER STALL */
regmap_update_bits(asrc_priv->regmap, REG_ASRMCR(index),
@@ -497,13 +522,13 @@ static int fsl_asrc_dai_hw_params(struct 
snd_pcm_substream *substream,
  struct snd_soc_dai *dai)
 {
struct fsl_asrc *asrc_priv = snd_soc_dai_get_drvdata(dai);
-   int width = params_width(params);
struct snd_pcm_runtime *runtime = substream->runtime;
struct fsl_asrc_pair *pair = runtime->private_data;
unsigned int channels = params_channels(params);
unsigned int rate = params_rate(params);
struct asrc_config config;
-   int word_width, ret;
+   snd_pcm_format_t format;
+   int ret;
 
ret = fsl_asrc_request_pair(channels, pair);
if (ret) {
@@ -513,15 +538,10 @@ static int fsl_asrc_dai_hw_params(struct 
snd

Applied "ASoC: fsl_asrc: update supported sample format" to the asoc tree

2019-10-01 Thread Mark Brown
The patch

   ASoC: fsl_asrc: update supported sample format

has been applied to the asoc tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-5.5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From 109539c986cee525e5ff9ae98793f23c2b29e54d Mon Sep 17 00:00:00 2001
From: Shengjiu Wang 
Date: Fri, 27 Sep 2019 09:46:10 +0800
Subject: [PATCH] ASoC: fsl_asrc: update supported sample format

The ASRC support 24bit/16bit/8bit input width, which is
data width, not slot width.

For the S20_3LE format, the data with is 20bit, slot width
is 24bit, if we set ASRMCR1n.IWD to be 24bits, the result
is the volume is lower than expected, it likes 24bit data
right shift 4 bits

So replace S20_3LE with S24_3LE in supported list and add S8
format in TX supported list

Signed-off-by: Shengjiu Wang 
Acked-by: Nicolin Chen 
Link: 
https://lore.kernel.org/r/45a7c383f43cc1dd9d0934846447aee653278c03.1569493933.git.shengjiu.w...@nxp.com
Signed-off-by: Mark Brown 
---
 sound/soc/fsl/fsl_asrc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/sound/soc/fsl/fsl_asrc.c b/sound/soc/fsl/fsl_asrc.c
index 4d3804a1ea55..584badf956d2 100644
--- a/sound/soc/fsl/fsl_asrc.c
+++ b/sound/soc/fsl/fsl_asrc.c
@@ -624,7 +624,7 @@ static int fsl_asrc_dai_probe(struct snd_soc_dai *dai)
 
 #define FSL_ASRC_FORMATS   (SNDRV_PCM_FMTBIT_S24_LE | \
 SNDRV_PCM_FMTBIT_S16_LE | \
-SNDRV_PCM_FMTBIT_S20_3LE)
+SNDRV_PCM_FMTBIT_S24_3LE)
 
 static struct snd_soc_dai_driver fsl_asrc_dai = {
.probe = fsl_asrc_dai_probe,
@@ -635,7 +635,8 @@ static struct snd_soc_dai_driver fsl_asrc_dai = {
.rate_min = 5512,
.rate_max = 192000,
.rates = SNDRV_PCM_RATE_KNOT,
-   .formats = FSL_ASRC_FORMATS,
+   .formats = FSL_ASRC_FORMATS |
+  SNDRV_PCM_FMTBIT_S8,
},
.capture = {
.stream_name = "ASRC-Capture",
-- 
2.20.1



Applied "ASoC: pcm_dmaengine: Extract snd_dmaengine_pcm_refine_runtime_hwparams" to the asoc tree

2019-10-01 Thread Mark Brown
The patch

   ASoC: pcm_dmaengine: Extract snd_dmaengine_pcm_refine_runtime_hwparams

has been applied to the asoc tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-5.5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From e957204e732bc2916a241dc61dd7dd14e9a98350 Mon Sep 17 00:00:00 2001
From: Shengjiu Wang 
Date: Fri, 27 Sep 2019 09:46:11 +0800
Subject: [PATCH] ASoC: pcm_dmaengine: Extract
 snd_dmaengine_pcm_refine_runtime_hwparams

When set the runtime hardware parameters, we may need to query
the capability of DMA to complete the parameters.

This patch is to Extract this operation from
dmaengine_pcm_set_runtime_hwparams function to a separate function
snd_dmaengine_pcm_refine_runtime_hwparams, that other components
which need this feature can call this function.

Signed-off-by: Shengjiu Wang 
Reviewed-by: Nicolin Chen 
Link: 
https://lore.kernel.org/r/d728f65194e9978cbec4132b522d4fed420d704a.1569493933.git.shengjiu.w...@nxp.com
Signed-off-by: Mark Brown 
---
 include/sound/dmaengine_pcm.h |  5 ++
 sound/core/pcm_dmaengine.c| 83 +++
 sound/soc/soc-generic-dmaengine-pcm.c | 61 ++--
 3 files changed, 94 insertions(+), 55 deletions(-)

diff --git a/include/sound/dmaengine_pcm.h b/include/sound/dmaengine_pcm.h
index c679f6116580..b65220685920 100644
--- a/include/sound/dmaengine_pcm.h
+++ b/include/sound/dmaengine_pcm.h
@@ -83,6 +83,11 @@ void snd_dmaengine_pcm_set_config_from_dai_data(
const struct snd_dmaengine_dai_dma_data *dma_data,
struct dma_slave_config *config);
 
+int snd_dmaengine_pcm_refine_runtime_hwparams(
+   struct snd_pcm_substream *substream,
+   struct snd_dmaengine_dai_dma_data *dma_data,
+   struct snd_pcm_hardware *hw,
+   struct dma_chan *chan);
 
 /*
  * Try to request the DMA channel using compat_request_channel or
diff --git a/sound/core/pcm_dmaengine.c b/sound/core/pcm_dmaengine.c
index 89a05926ac73..5749a8a49784 100644
--- a/sound/core/pcm_dmaengine.c
+++ b/sound/core/pcm_dmaengine.c
@@ -369,4 +369,87 @@ int snd_dmaengine_pcm_close_release_chan(struct 
snd_pcm_substream *substream)
 }
 EXPORT_SYMBOL_GPL(snd_dmaengine_pcm_close_release_chan);
 
+/**
+ * snd_dmaengine_pcm_refine_runtime_hwparams - Refine runtime hw params
+ * @substream: PCM substream
+ * @dma_data: DAI DMA data
+ * @hw: PCM hw params
+ * @chan: DMA channel to use for data transfers
+ *
+ * Returns 0 on success, a negative error code otherwise.
+ *
+ * This function will query DMA capability, then refine the pcm hardware
+ * parameters.
+ */
+int snd_dmaengine_pcm_refine_runtime_hwparams(
+   struct snd_pcm_substream *substream,
+   struct snd_dmaengine_dai_dma_data *dma_data,
+   struct snd_pcm_hardware *hw,
+   struct dma_chan *chan)
+{
+   struct dma_slave_caps dma_caps;
+   u32 addr_widths = BIT(DMA_SLAVE_BUSWIDTH_1_BYTE) |
+ BIT(DMA_SLAVE_BUSWIDTH_2_BYTES) |
+ BIT(DMA_SLAVE_BUSWIDTH_4_BYTES);
+   snd_pcm_format_t i;
+   int ret = 0;
+
+   if (!hw || !chan || !dma_data)
+   return -EINVAL;
+
+   ret = dma_get_slave_caps(chan, &dma_caps);
+   if (ret == 0) {
+   if (dma_caps.cmd_pause && dma_caps.cmd_resume)
+   hw->info |= SNDRV_PCM_INFO_PAUSE | 
SNDRV_PCM_INFO_RESUME;
+   if (dma_caps.residue_granularity <= 
DMA_RESIDUE_GRANULARITY_SEGMENT)
+   hw->info |= SNDRV_PCM_INFO_BATCH;
+
+   if (substream->stream == SNDRV_PCM_STREAM_PLAYBACK)
+   addr_widths = dma_caps.dst_addr_widths;
+   else
+   addr_widths = dma_caps.src_addr_widths;
+   }
+
+   /*
+* If SND_DMAENGINE_PCM_DAI_FLAG_PACK is set keep
+* hw.formats set to 0, meaning no restrictions are in place.
+* In this case it's the responsibility of the DAI driver to
+* provide the supported format information.
+*/
+   if (!(dma_data->flags & SND_DMAENGINE_PCM_DAI_FLAG_PACK))
+   /*
+* Prepare formats mask for valid/allowed sample types. If the
+* dma does not have support for the given physical word size,
+

Applied "ASoC: fsl_mqs: add DT binding documentation" to the asoc tree

2019-10-01 Thread Mark Brown
The patch

   ASoC: fsl_mqs: add DT binding documentation

has been applied to the asoc tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-5.5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From 75234212c446cef3272a025b588b2e418158ed30 Mon Sep 17 00:00:00 2001
From: Shengjiu Wang 
Date: Fri, 13 Sep 2019 17:42:13 +0800
Subject: [PATCH] ASoC: fsl_mqs: add DT binding documentation

Add the DT binding documentation for NXP MQS driver

Signed-off-by: Shengjiu Wang 
Acked-by: Nicolin Chen 
Reviewed-by: Rob Herring 
Link: 
https://lore.kernel.org/r/65e1f035aea2951aacda54aa3a751bc244f72f6a.1568367274.git.shengjiu.w...@nxp.com
Signed-off-by: Mark Brown 
---
 .../devicetree/bindings/sound/fsl,mqs.txt | 36 +++
 1 file changed, 36 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/sound/fsl,mqs.txt

diff --git a/Documentation/devicetree/bindings/sound/fsl,mqs.txt 
b/Documentation/devicetree/bindings/sound/fsl,mqs.txt
new file mode 100644
index ..40353fc30255
--- /dev/null
+++ b/Documentation/devicetree/bindings/sound/fsl,mqs.txt
@@ -0,0 +1,36 @@
+fsl,mqs audio CODEC
+
+Required properties:
+  - compatible : Must contain one of "fsl,imx6sx-mqs", "fsl,codec-mqs"
+   "fsl,imx8qm-mqs", "fsl,imx8qxp-mqs".
+  - clocks : A list of phandles + clock-specifiers, one for each entry in
+clock-names
+  - clock-names : "mclk" - must required.
+ "core" - required if compatible is "fsl,imx8qm-mqs", it
+  is for register access.
+  - gpr : A phandle of General Purpose Registers in IOMUX Controller.
+ Required if compatible is "fsl,imx6sx-mqs".
+
+Required if compatible is "fsl,imx8qm-mqs":
+  - power-domains: A phandle of PM domain provider node.
+  - reg: Offset and length of the register set for the device.
+
+Example:
+
+mqs: mqs {
+   compatible = "fsl,imx6sx-mqs";
+   gpr = <&gpr>;
+   clocks = <&clks IMX6SX_CLK_SAI1>;
+   clock-names = "mclk";
+   status = "disabled";
+};
+
+mqs: mqs@5985 {
+   compatible = "fsl,imx8qm-mqs";
+   reg = <0x5985 0x1>;
+   clocks = <&clk IMX8QM_AUD_MQS_IPG>,
+<&clk IMX8QM_AUD_MQS_HMCLK>;
+   clock-names = "core", "mclk";
+   power-domains = <&pd_mqs0>;
+   status = "disabled";
+};
-- 
2.20.1



Applied "ASoC: fsl_mqs: Add MQS component driver" to the asoc tree

2019-10-01 Thread Mark Brown
The patch

   ASoC: fsl_mqs: Add MQS component driver

has been applied to the asoc tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-5.5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From 9e28f6532c611c0c3fa759d2101aba9f0d41e860 Mon Sep 17 00:00:00 2001
From: Shengjiu Wang 
Date: Fri, 13 Sep 2019 17:42:14 +0800
Subject: [PATCH] ASoC: fsl_mqs: Add MQS component driver

MQS (medium quality sound), is used to generate medium quality
audio via a standard digital output pin. It can be used to
connect stereo speakers or headphones simply via power amplifier
stages without an additional DAC chip. It only accepts 2-channel,
LSB-valid 16bit, MSB shift-out first, frame sync asserting with
the first bit of the frame, data shifted with the posedge of
bit clock, 44.1 kHz or 48 kHz signals from SAI1 in left justified
format; and it provides the SNR target as no more than 20dB for
the signals below 10 kHz. The signals above 10 kHz will have
worse THD+N values.

MQS provides only simple audio reproduction. No internal pop,
click or distortion artifact reduction methods are provided.

The MQS receives the audio data from the SAI1 Tx section.

Signed-off-by: Shengjiu Wang 
Acked-by: Nicolin Chen 
Link: 
https://lore.kernel.org/r/74dfc73a92d2df4213225abe7d2a3db82672fe0f.1568367274.git.shengjiu.w...@nxp.com
Signed-off-by: Mark Brown 
---
 sound/soc/fsl/Kconfig   |  10 ++
 sound/soc/fsl/Makefile  |   2 +
 sound/soc/fsl/fsl_mqs.c | 333 
 3 files changed, 345 insertions(+)
 create mode 100644 sound/soc/fsl/fsl_mqs.c

diff --git a/sound/soc/fsl/Kconfig b/sound/soc/fsl/Kconfig
index aa99c008a925..65e8cd4be930 100644
--- a/sound/soc/fsl/Kconfig
+++ b/sound/soc/fsl/Kconfig
@@ -25,6 +25,16 @@ config SND_SOC_FSL_SAI
  This option is only useful for out-of-tree drivers since
  in-tree drivers select it automatically.
 
+config SND_SOC_FSL_MQS
+   tristate "Medium Quality Sound (MQS) module support"
+   depends on SND_SOC_FSL_SAI
+   select REGMAP_MMIO
+   help
+ Say Y if you want to add Medium Quality Sound (MQS)
+ support for the Freescale CPUs.
+ This option is only useful for out-of-tree drivers since
+ in-tree drivers select it automatically.
+
 config SND_SOC_FSL_AUDMIX
tristate "Audio Mixer (AUDMIX) module support"
select REGMAP_MMIO
diff --git a/sound/soc/fsl/Makefile b/sound/soc/fsl/Makefile
index c0dd04422fe9..8cde88c72d93 100644
--- a/sound/soc/fsl/Makefile
+++ b/sound/soc/fsl/Makefile
@@ -23,6 +23,7 @@ snd-soc-fsl-esai-objs := fsl_esai.o
 snd-soc-fsl-micfil-objs := fsl_micfil.o
 snd-soc-fsl-utils-objs := fsl_utils.o
 snd-soc-fsl-dma-objs := fsl_dma.o
+snd-soc-fsl-mqs-objs := fsl_mqs.o
 
 obj-$(CONFIG_SND_SOC_FSL_AUDMIX) += snd-soc-fsl-audmix.o
 obj-$(CONFIG_SND_SOC_FSL_ASOC_CARD) += snd-soc-fsl-asoc-card.o
@@ -33,6 +34,7 @@ obj-$(CONFIG_SND_SOC_FSL_SPDIF) += snd-soc-fsl-spdif.o
 obj-$(CONFIG_SND_SOC_FSL_ESAI) += snd-soc-fsl-esai.o
 obj-$(CONFIG_SND_SOC_FSL_MICFIL) += snd-soc-fsl-micfil.o
 obj-$(CONFIG_SND_SOC_FSL_UTILS) += snd-soc-fsl-utils.o
+obj-$(CONFIG_SND_SOC_FSL_MQS) += snd-soc-fsl-mqs.o
 obj-$(CONFIG_SND_SOC_POWERPC_DMA) += snd-soc-fsl-dma.o
 
 # MPC5200 Platform Support
diff --git a/sound/soc/fsl/fsl_mqs.c b/sound/soc/fsl/fsl_mqs.c
new file mode 100644
index ..c1619a553514
--- /dev/null
+++ b/sound/soc/fsl/fsl_mqs.c
@@ -0,0 +1,333 @@
+// SPDX-License-Identifier: GPL-2.0
+//
+// ALSA SoC IMX MQS driver
+//
+// Copyright (C) 2014-2015 Freescale Semiconductor, Inc.
+// Copyright 2019 NXP
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define REG_MQS_CTRL   0x00
+
+#define MQS_EN_MASK(0x1 << 28)
+#define MQS_EN_SHIFT   (28)
+#define MQS_SW_RST_MASK(0x1 << 24)
+#define MQS_SW_RST_SHIFT   (24)
+#define MQS_OVERSAMPLE_MASK(0x1 << 20)
+#define MQS_OVERSAMPLE_SHIFT   (20)
+#define MQS_CLK_DIV_MASK   (0xFF << 0)
+#define MQS_CLK_DIV_SHIFT  (0)
+
+/* codec private data */
+struct fsl_mqs {
+   struct regmap *regmap;
+   struct clk *mclk;
+ 

Applied "ASoC: fsl_asrc: Fix error with S24_3LE format bitstream in i.MX8" to the asoc tree

2019-10-01 Thread Mark Brown
The patch

   ASoC: fsl_asrc: Fix error with S24_3LE format bitstream in i.MX8

has been applied to the asoc tree at

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-5.5

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From 703df4413ff6cf1812922522daa7c0610f087910 Mon Sep 17 00:00:00 2001
From: Shengjiu Wang 
Date: Fri, 27 Sep 2019 09:46:12 +0800
Subject: [PATCH] ASoC: fsl_asrc: Fix error with S24_3LE format bitstream in
 i.MX8

There is error "aplay: pcm_write:2023: write error: Input/output error"
on i.MX8QM/i.MX8QXP platform for S24_3LE format.

In i.MX8QM/i.MX8QXP, the DMA is EDMA, which don't support 24bit
sample, but we didn't add any constraint, that cause issues.

So we need to query the caps of dma, then update the hw parameters
according to the caps.

Signed-off-by: Shengjiu Wang 
Acked-by: Nicolin Chen 
Link: 
https://lore.kernel.org/r/b6a4de2bbf960ef291ee902afe4388bd0fc1d347.1569493933.git.shengjiu.w...@nxp.com
Signed-off-by: Mark Brown 
---
 sound/soc/fsl/fsl_asrc.c |  4 +--
 sound/soc/fsl/fsl_asrc.h |  3 ++
 sound/soc/fsl/fsl_asrc_dma.c | 64 
 3 files changed, 62 insertions(+), 9 deletions(-)

diff --git a/sound/soc/fsl/fsl_asrc.c b/sound/soc/fsl/fsl_asrc.c
index 584badf956d2..0bf91a6f54b9 100644
--- a/sound/soc/fsl/fsl_asrc.c
+++ b/sound/soc/fsl/fsl_asrc.c
@@ -115,7 +115,7 @@ static void fsl_asrc_sel_proc(int inrate, int outrate,
  * within range [ANCA, ANCA+ANCB-1], depends on the channels of pair A
  * while pair A and pair C are comparatively independent.
  */
-static int fsl_asrc_request_pair(int channels, struct fsl_asrc_pair *pair)
+int fsl_asrc_request_pair(int channels, struct fsl_asrc_pair *pair)
 {
enum asrc_pair_index index = ASRC_INVALID_PAIR;
struct fsl_asrc *asrc_priv = pair->asrc_priv;
@@ -158,7 +158,7 @@ static int fsl_asrc_request_pair(int channels, struct 
fsl_asrc_pair *pair)
  *
  * It clears the resource from asrc_priv and releases the occupied channels.
  */
-static void fsl_asrc_release_pair(struct fsl_asrc_pair *pair)
+void fsl_asrc_release_pair(struct fsl_asrc_pair *pair)
 {
struct fsl_asrc *asrc_priv = pair->asrc_priv;
enum asrc_pair_index index = pair->index;
diff --git a/sound/soc/fsl/fsl_asrc.h b/sound/soc/fsl/fsl_asrc.h
index 38af485bdd22..2b57e8c53728 100644
--- a/sound/soc/fsl/fsl_asrc.h
+++ b/sound/soc/fsl/fsl_asrc.h
@@ -462,4 +462,7 @@ struct fsl_asrc {
 #define DRV_NAME "fsl-asrc-dai"
 extern struct snd_soc_component_driver fsl_asrc_component;
 struct dma_chan *fsl_asrc_get_dma_channel(struct fsl_asrc_pair *pair, bool 
dir);
+int fsl_asrc_request_pair(int channels, struct fsl_asrc_pair *pair);
+void fsl_asrc_release_pair(struct fsl_asrc_pair *pair);
+
 #endif /* _FSL_ASRC_H */
diff --git a/sound/soc/fsl/fsl_asrc_dma.c b/sound/soc/fsl/fsl_asrc_dma.c
index 01052a0808b0..2a60fc6142b1 100644
--- a/sound/soc/fsl/fsl_asrc_dma.c
+++ b/sound/soc/fsl/fsl_asrc_dma.c
@@ -16,13 +16,11 @@
 
 #define FSL_ASRC_DMABUF_SIZE   (256 * 1024)
 
-static const struct snd_pcm_hardware snd_imx_hardware = {
+static struct snd_pcm_hardware snd_imx_hardware = {
.info = SNDRV_PCM_INFO_INTERLEAVED |
SNDRV_PCM_INFO_BLOCK_TRANSFER |
SNDRV_PCM_INFO_MMAP |
-   SNDRV_PCM_INFO_MMAP_VALID |
-   SNDRV_PCM_INFO_PAUSE |
-   SNDRV_PCM_INFO_RESUME,
+   SNDRV_PCM_INFO_MMAP_VALID,
.buffer_bytes_max = FSL_ASRC_DMABUF_SIZE,
.period_bytes_min = 128,
.period_bytes_max = 65535, /* Limited by SDMA engine */
@@ -270,12 +268,25 @@ static int fsl_asrc_dma_hw_free(struct snd_pcm_substream 
*substream)
 
 static int fsl_asrc_dma_startup(struct snd_pcm_substream *substream)
 {
+   bool tx = substream->stream == SNDRV_PCM_STREAM_PLAYBACK;
struct snd_soc_pcm_runtime *rtd = substream->private_data;
struct snd_pcm_runtime *runtime = substream->runtime;
struct snd_soc_component *component = snd_soc_rtdcom_lookup(rtd, 
DRV_NAME);
+   struct snd_dmaengine_dai_dma_data *dma_data;
struct device *dev = component->dev;
struct fsl_asrc *asrc_priv = dev_get_drvdata(dev);
struct fsl_asrc_pair *pair;
+   struct dma_chan *tmp_chan = NULL;
+   u8 dir =

Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-01 Thread Uladzislau Rezki
Hello, Daniel.

> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a3c70e275f4e..9fb7a16f42ae 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -690,8 +690,19 @@ merge_or_add_vmap_area(struct vmap_area *va,
>   struct list_head *next;
>   struct rb_node **link;
>   struct rb_node *parent;
> + unsigned long orig_start, orig_end;
Shouldn't that be wrapped around #ifdef CONFIG_KASAN_VMALLOC?

>   bool merged = false;
>  
> + /*
> +  * To manage KASAN vmalloc memory usage, we use this opportunity to
> +  * clean up the shadow memory allocated to back this allocation.
> +  * Because a vmalloc shadow page covers several pages, the start or end
> +  * of an allocation might not align with a shadow page. Use the merging
> +  * opportunities to try to extend the region we can release.
> +  */
> + orig_start = va->va_start;
> + orig_end = va->va_end;
> +
The same.

>   /*
>* Find a place in the tree where VA potentially will be
>* inserted, unless it is merged with its sibling/siblings.
> @@ -741,6 +752,10 @@ merge_or_add_vmap_area(struct vmap_area *va,
>   if (sibling->va_end == va->va_start) {
>   sibling->va_end = va->va_end;
>  
> + kasan_release_vmalloc(orig_start, orig_end,
> +   sibling->va_start,
> +   sibling->va_end);
> +
The same.

>   /* Check and update the tree if needed. */
>   augment_tree_propagate_from(sibling);
>  
> @@ -754,6 +769,8 @@ merge_or_add_vmap_area(struct vmap_area *va,
>   }
>  
>  insert:
> + kasan_release_vmalloc(orig_start, orig_end, va->va_start, va->va_end);
> +
The same + all further changes in this file.
>   if (!merged) {
>   link_va(va, root, parent, link, head);
>   augment_tree_propagate_from(va);
> @@ -2068,6 +2085,22 @@ static struct vm_struct *__get_vm_area_node(unsigned 
> long size,
>  
>   setup_vmalloc_vm(area, va, flags, caller);
>  
> + /*
> +  * For KASAN, if we are in vmalloc space, we need to cover the shadow
> +  * area with real memory. If we come here through VM_ALLOC, this is
> +  * done by a higher level function that has access to the true size,
> +  * which might not be a full page.
> +  *
> +  * We assume module space comes via VM_ALLOC path.
> +  */
> + if (is_vmalloc_addr(area->addr) && !(area->flags & VM_ALLOC)) {
> + if (kasan_populate_vmalloc(area->size, area)) {
> + unmap_vmap_area(va);
> + kfree(area);
> + return NULL;
> + }
> + }
> +
>   return area;
>  }
>  
> @@ -2245,6 +2278,9 @@ static void __vunmap(const void *addr, int 
> deallocate_pages)
>   debug_check_no_locks_freed(area->addr, get_vm_area_size(area));
>   debug_check_no_obj_freed(area->addr, get_vm_area_size(area));
>  
> + if (area->flags & VM_KASAN)
> + kasan_poison_vmalloc(area->addr, area->size);
> +
>   vm_remove_mappings(area, deallocate_pages);
>  
>   if (deallocate_pages) {
> @@ -2497,6 +2533,9 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
> long align,
>   if (!addr)
>   return NULL;
>  
> + if (kasan_populate_vmalloc(real_size, area))
> + return NULL;
> +
>   /*
>* In this function, newly allocated vm_struct has VM_UNINITIALIZED
>* flag. It means that vm_struct is not fully initialized.
> @@ -3351,10 +3390,14 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned 
> long *offsets,
>   spin_unlock(&vmap_area_lock);
>  
>   /* insert all vm's */
> - for (area = 0; area < nr_vms; area++)
> + for (area = 0; area < nr_vms; area++) {
>   setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC,
>pcpu_get_vm_areas);
>  
> + /* assume success here */
> + kasan_populate_vmalloc(sizes[area], vms[area]);
> + }
> +
>   kfree(vas);
>   return vms;
>  


--
Vlad Rezki


[PATCH] powerpc/perf: Add functionality to check PERF_EVENTS config option

2019-10-01 Thread Kajol Jain
Perf is the primary interface to program performance monitoring
unit (pmu) and collect counter data in system.
But currently pmu register files are created in the
/sys/devices/system/cpu/cpu* without checking CONFIG_PERF_EVENTS
option. These includes PMC* and MMCR* sprs.
Patch ties sysfs pmu spr file creation with CONFIG_PERF_EVENTS options.

Tested this patch with enable/disable CONFIG_PERF_EVENTS option
in powernv and pseries machines.
Also did compilation testing with book3s_32.config.

Signed-off-by: Kajol Jain 
---
 arch/powerpc/kernel/sysfs.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index e2147d7c9e72..263023cc6308 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -456,16 +456,21 @@ static ssize_t __used \
 
 #if defined(CONFIG_PPC64)
 #define HAS_PPC_PMC_CLASSIC1
+#if defined(CONFIG_PERF_EVENTS)
 #define HAS_PPC_PMC_IBM1
+#endif
 #define HAS_PPC_PMC_PA6T   1
 #elif defined(CONFIG_PPC_BOOK3S_32)
 #define HAS_PPC_PMC_CLASSIC1
+#if defined(CONFIG_PERF_EVENTS)
 #define HAS_PPC_PMC_IBM1
 #define HAS_PPC_PMC_G4 1
 #endif
+#endif
 
 
 #ifdef HAS_PPC_PMC_CLASSIC
+#ifdef HAS_PPC_PMC_IBM
 SYSFS_PMCSETUP(mmcr0, SPRN_MMCR0);
 SYSFS_PMCSETUP(mmcr1, SPRN_MMCR1);
 SYSFS_PMCSETUP(pmc1, SPRN_PMC1);
@@ -484,6 +489,10 @@ SYSFS_PMCSETUP(pmc7, SPRN_PMC7);
 SYSFS_PMCSETUP(pmc8, SPRN_PMC8);
 
 SYSFS_PMCSETUP(mmcra, SPRN_MMCRA);
+#endif /* CONFIG_PPC64 */
+#endif /* HAS_PPC_PMC_IBM */
+
+#ifdef CONFIG_PPC64
 SYSFS_SPRSETUP(purr, SPRN_PURR);
 SYSFS_SPRSETUP(spurr, SPRN_SPURR);
 SYSFS_SPRSETUP(pir, SPRN_PIR);
@@ -494,7 +503,9 @@ SYSFS_SPRSETUP(tscr, SPRN_TSCR);
   enable write when needed with a separate function.
   Lets be conservative and default to pseries.
 */
+#ifdef HAS_PPC_PMC_IBM
 static DEVICE_ATTR(mmcra, 0600, show_mmcra, store_mmcra);
+#endif /* HAS_PPC_PMC_IBM */
 static DEVICE_ATTR(spurr, 0400, show_spurr, NULL);
 static DEVICE_ATTR(purr, 0400, show_purr, store_purr);
 static DEVICE_ATTR(pir, 0400, show_pir, NULL);
@@ -605,12 +616,14 @@ static void sysfs_create_dscr_default(void)
 #endif /* CONFIG_PPC64 */
 
 #ifdef HAS_PPC_PMC_PA6T
+#ifdef HAS_PPC_PMC_IBM
 SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
 SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
 SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
 SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
 SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
 SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
+#endif /* HAS_PPC_PMC_IBM */
 #ifdef CONFIG_DEBUG_MISC
 SYSFS_SPRSETUP(hid0, SPRN_HID0);
 SYSFS_SPRSETUP(hid1, SPRN_HID1);
@@ -648,7 +661,6 @@ static struct device_attribute ibm_common_attrs[] = {
__ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
__ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
 };
-#endif /* HAS_PPC_PMC_G4 */
 
 #ifdef HAS_PPC_PMC_G4
 static struct device_attribute g4_common_attrs[] = {
@@ -670,9 +682,11 @@ static struct device_attribute classic_pmc_attrs[] = {
__ATTR(pmc8, 0600, show_pmc8, store_pmc8),
 #endif
 };
+#endif /* HAS_PPC_PMC_IBM */
 
 #ifdef HAS_PPC_PMC_PA6T
 static struct device_attribute pa6t_attrs[] = {
+#ifdef HAS_PPC_PMC_IBM
__ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
__ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
__ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
@@ -681,6 +695,7 @@ static struct device_attribute pa6t_attrs[] = {
__ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
__ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
__ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
+#endif /* HAS_PPC_PMC_IBM */
 #ifdef CONFIG_DEBUG_MISC
__ATTR(hid0, 0600, show_hid0, store_hid0),
__ATTR(hid1, 0600, show_hid1, store_hid1),
@@ -769,8 +784,10 @@ static int register_cpu_online(unsigned int cpu)
device_create_file(s, &pmc_attrs[i]);
 
 #ifdef CONFIG_PPC64
+#ifdef HAS_PPC_PMC_IBM
if (cpu_has_feature(CPU_FTR_MMCRA))
device_create_file(s, &dev_attr_mmcra);
+#endif
 
if (cpu_has_feature(CPU_FTR_PURR)) {
if (!firmware_has_feature(FW_FEATURE_LPAR))
@@ -858,8 +875,10 @@ static int unregister_cpu_online(unsigned int cpu)
device_remove_file(s, &pmc_attrs[i]);
 
 #ifdef CONFIG_PPC64
+#ifdef HAS_PPC_PMC_IBM
if (cpu_has_feature(CPU_FTR_MMCRA))
device_remove_file(s, &dev_attr_mmcra);
+#endif
 
if (cpu_has_feature(CPU_FTR_PURR))
device_remove_file(s, &dev_attr_purr);
-- 
2.21.0



[Bug 204789] Boot failure with more than 256G of memory on Power9 with 4K pages & Hash MMU

2019-10-01 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=204789

--- Comment #10 from Michael Ellerman (mich...@ellerman.id.au) ---
Can you boot a good kernel and do:

$ sudo grep RAM /proc/iomem

And paste the output. Just to confirm what your memory layout is.

What arrangement of DIMMs do you have? It's possible you could work around the
bug by changing that, depending on how many DIMMs and slots you have.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 204789] Boot failure with more than 256G of memory on POWER/ppc64

2019-10-01 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=204789

Michael Ellerman (mich...@ellerman.id.au) changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 204789] Boot failure with more than 256G of memory on POWER/ppc64

2019-10-01 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=204789

Michael Ellerman (mich...@ellerman.id.au) changed:

   What|Removed |Added

 CC||mich...@ellerman.id.au
  Component|Other   |PPC-64
   Assignee|a...@linux-foundation.org   |platform_ppc-64@kernel-bugs
   ||.osdl.org
Product|Memory Management   |Platform Specific/Hardware

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH 14/29] vmlinux.lds.h: Allow EXCEPTION_TABLE to live in RO_DATA

2019-10-01 Thread Will Deacon
On Thu, Sep 26, 2019 at 10:55:47AM -0700, Kees Cook wrote:
> Many architectures have an EXCEPTION_TABLE that needs only to be
> read-only. As such, it should live in RO_DATA. This creates a macro to
> identify this case for the architectures that can move EXCEPTION_TABLE
> into RO_DATA.
> 
> Signed-off-by: Kees Cook 
> ---
>  include/asm-generic/vmlinux.lds.h | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index d57a28786bb8..35a6cba39d9f 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -69,6 +69,17 @@
>  #define NOTES_HEADERS_RESTORE
>  #endif
>  
> +/*
> + * Some architectures have non-executable read-only exception tables.
> + * They can be added to the RO_DATA segment by specifying their desired
> + * alignment.
> + */
> +#ifdef RO_DATA_EXCEPTION_TABLE_ALIGN
> +#define RO_DATA_EXCEPTION_TABLE  
> EXCEPTION_TABLE(RO_DATA_EXCEPTION_TABLE_ALIGN)
> +#else
> +#define RO_DATA_EXCEPTION_TABLE
> +#endif
> +
>  /* Align . to a 8 byte boundary equals to maximum function alignment. */
>  #define ALIGN_FUNCTION()  . = ALIGN(8)
>  
> @@ -508,6 +519,7 @@
>   __stop___modver = .;\
>   }   \
>   \
> + RO_DATA_EXCEPTION_TABLE \
>   NOTES   \
>   \
>   . = ALIGN((align)); \

I had to read this one to understand the later arm64 change. It looks
fine to me, so:

Acked-by: Will Deacon 

Will


Re: [PATCH 18/29] arm64: Move EXCEPTION_TABLE to RO_DATA segment

2019-10-01 Thread Will Deacon
Hi Kees,

On Thu, Sep 26, 2019 at 10:55:51AM -0700, Kees Cook wrote:
> The EXCEPTION_TABLE is read-only, so collapse it into RO_DATA.
> 
> Signed-off-by: Kees Cook 
> ---
>  arch/arm64/kernel/vmlinux.lds.S | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index 81d94e371c95..c6ba2eee0ee8 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -5,6 +5,8 @@
>   * Written by Martin Mares 
>   */
>  
> +#define RO_DATA_EXCEPTION_TABLE_ALIGN8
> +
>  #include 
>  #include 
>  #include 
> @@ -135,8 +137,8 @@ SECTIONS
>   . = ALIGN(SEGMENT_ALIGN);
>   _etext = .; /* End of text section */
>  
> - RO_DATA(PAGE_SIZE)  /* everything from this point to */
> - EXCEPTION_TABLE(8)  /* __init_begin will be marked RO NX */
> + /* everything from this point to __init_begin will be marked RO NX */
> + RO_DATA(PAGE_SIZE)
>  
>   . = ALIGN(PAGE_SIZE);

Do you reckon it would be worth merging this last ALIGN directive into the
RO_DATA definition too? Given that we want to map the thing read-only, it
really has to be aligned either side.

Anyway, that's only a nit, so:

Acked-by: Will Deacon 

Will

P.S. Please CC the arm64 maintainers on arm64 patches -- I nearly missed
this one!


[PATCH] powerpc/book3s64/hash: Add cond_resched to avoid soft lockup warning

2019-10-01 Thread Aneesh Kumar K.V
With large memory (8TB and more) hotplug, we can get soft lockup warnings
as below. These were caused by a long loop without any explicit cond_resched
which is a problem for !PREEMPT kernels.

Avoid the using cond_resched() while inserting hash page table entries. We
already do similar cond_resched() in __add_pages() (
commit: f64ac5e6e306 ("mm, memory_hotplug: add scheduling point to 
__add_pages"))

 rcu: 3-: (24002 ticks this GP) idle=13e/1/0x4002 
softirq=722/722 fqs=12001
  (t=24003 jiffies g=4285 q=2002)
 NMI backtrace for cpu 3
 CPU: 3 PID: 3870 Comm: ndctl Not tainted 5.3.0-197.18-default+ #2
 Call Trace:
 [c000692b3040] [c0f759fc] dump_stack+0xb0/0xf4 (unreliable)
 [c000692b3080] [c0f80ee4] nmi_cpu_backtrace+0x124/0x130
 [c000692b3100] [c0f8109c] nmi_trigger_cpumask_backtrace+0x1ac/0x1f0
 [c000692b31a0] [c006c2f8] arch_trigger_cpumask_backtrace+0x28/0x3c
 [c000692b31c0] [c01eb1e4] rcu_dump_cpu_stacks+0xf8/0x154
 [c000692b3210] [c01e9e88] rcu_sched_clock_irq+0x878/0xb40
 [c000692b32f0] [c01fe878] update_process_times+0x48/0x90
 [c000692b3320] [c0217a6c] tick_sched_handle.isra.16+0x4c/0x80
 [c000692b3340] [c0217f08] tick_sched_timer+0x68/0xe0
 [c000692b3380] [c01ffee0] __hrtimer_run_queues+0x180/0x430
 [c000692b3400] [c0200bd0] hrtimer_interrupt+0x110/0x300
 [c000692b34b0] [c002c128] timer_interrupt+0x108/0x2f0
 [c000692b3510] [c0009194] decrementer_common+0x114/0x120
 --- interrupt: 901 at arch_add_memory+0xc0/0x130
 LR = arch_add_memory+0x74/0x130
 [c000692b38b0] [c04646b4] memremap_pages+0x494/0x650
 [c000692b3960] [c04648ac] devm_memremap_pages+0x3c/0xa0
 [c000692b39a0] [c0b3f4e8] pmem_attach_disk+0x188/0x750
 [c000692b3aa0] [c0b2506c] nvdimm_bus_probe+0xac/0x2c0
 [c000692b3b30] [c0aeb218] really_probe+0x148/0x570
 [c000692b3bc0] [c0aeba1c] driver_probe_device+0x19c/0x1d0
 [c000692b3c40] [c0aebe5c] device_driver_attach+0xcc/0x100
 [c000692b3c80] [c0ae8a14] bind_store+0x134/0x1c0
 [c000692b3cd0] [c0ae7824] drv_attr_store+0x44/0x60
 [c000692b3cf0] [c055e3f4] sysfs_kf_write+0x64/0x90
 [c000692b3d10] [c055d8f0] kernfs_fop_write+0x1a0/0x270
 [c000692b3d60] [c04698ec] __vfs_write+0x3c/0x70
 [c000692b3d80] [c046d260] vfs_write+0xd0/0x260
 [c000692b3dd0] [c046d6ac] ksys_write+0xdc/0x130
 [c000692b3e20] [c000b278] system_call+0x5c/0x68

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index a9d1f72de848..b30435c7d804 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -316,6 +316,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
if (ret < 0)
break;
 
+   cond_resched();
 #ifdef CONFIG_DEBUG_PAGEALLOC
if (debug_pagealloc_enabled() &&
(paddr >> PAGE_SHIFT) < linear_map_hash_count)
-- 
2.21.0



Re: IOMMU group creation for pseries hotplug, and powernv VFs

2019-10-01 Thread Shawn Anastasio

On 9/29/19 9:08 PM, Oliver O'Halloran wrote:

A couple of extra patches on top of Shawn's existing re-ordering patch.
This seems to fix the problem Alexey noted with Shawn's change causing
VFs to lose their IOMMU group. I've tried pretty hard to make this a
minimal fix it's still a bit large.

If mpe is happy to take this as a fix for 5.4 then I'll leave it,
otherwise we might want to look at different approaches.



Thanks for fixing this Oliver!

Reviewed-by: Shawn Anastasio 


[PATCH v8 5/5] kasan debug: track pages allocated for vmalloc shadow

2019-10-01 Thread Daniel Axtens
Provide the current number of vmalloc shadow pages in
/sys/kernel/debug/kasan/vmalloc_shadow_pages.

Signed-off-by: Daniel Axtens 

---

v8: rename kasan_vmalloc/shadow_pages -> kasan/vmalloc_shadow_pages

On v4 (no dynamic freeing), I saw the following approximate figures
on my test VM:

 - fresh boot: 720
 - after test_vmalloc: ~14000

With v5 (lazy dynamic freeing):

 - boot: ~490-500
 - running modprobe test_vmalloc pushes the figures up to sometimes
as high as ~14000, but they drop down to ~560 after the test ends.
I'm not sure where the extra sixty pages are from, but running the
test repeately doesn't cause the number to keep growing, so I don't
think we're leaking.
 - with vmap_stack, spawning tasks pushes the figure up to ~4200, then
some clearing kicks in and drops it down to previous levels again.
---
 mm/kasan/common.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index e33cbab83309..5b924f860a32 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -750,6 +751,8 @@ core_initcall(kasan_memhotplug_init);
 #endif
 
 #ifdef CONFIG_KASAN_VMALLOC
+static u64 vmalloc_shadow_pages;
+
 static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
  void *unused)
 {
@@ -776,6 +779,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned 
long addr,
if (likely(pte_none(*ptep))) {
set_pte_at(&init_mm, addr, ptep, pte);
page = 0;
+   vmalloc_shadow_pages++;
}
spin_unlock(&init_mm.page_table_lock);
if (page)
@@ -829,6 +833,7 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, 
unsigned long addr,
if (likely(!pte_none(*ptep))) {
pte_clear(&init_mm, addr, ptep);
free_page(page);
+   vmalloc_shadow_pages--;
}
spin_unlock(&init_mm.page_table_lock);
 
@@ -947,4 +952,25 @@ void kasan_release_vmalloc(unsigned long start, unsigned 
long end,
   (unsigned long)shadow_end);
}
 }
+
+static __init int kasan_init_debugfs(void)
+{
+   struct dentry *root, *count;
+
+   root = debugfs_create_dir("kasan", NULL);
+   if (IS_ERR(root)) {
+   if (PTR_ERR(root) == -ENODEV)
+   return 0;
+   return PTR_ERR(root);
+   }
+
+   count = debugfs_create_u64("vmalloc_shadow_pages", 0444, root,
+  &vmalloc_shadow_pages);
+
+   if (IS_ERR(count))
+   return PTR_ERR(root);
+
+   return 0;
+}
+late_initcall(kasan_init_debugfs);
 #endif
-- 
2.20.1



[PATCH v8 4/5] x86/kasan: support KASAN_VMALLOC

2019-10-01 Thread Daniel Axtens
In the case where KASAN directly allocates memory to back vmalloc
space, don't map the early shadow page over it.

We prepopulate pgds/p4ds for the range that would otherwise be empty.
This is required to get it synced to hardware on boot, allowing the
lower levels of the page tables to be filled dynamically.

Acked-by: Dmitry Vyukov 
Signed-off-by: Daniel Axtens 

---
v5: fix some checkpatch CHECK warnings. There are some that remain
around lines ending with '(': I have not changed these because
it's consistent with the rest of the file and it's not easy to
see how to fix it without creating an overlong line or lots of
temporary variables.

v2: move from faulting in shadow pgds to prepopulating
---
 arch/x86/Kconfig|  1 +
 arch/x86/mm/kasan_init_64.c | 60 +
 2 files changed, 61 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 96ea2c7449ef..3590651e95f5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -135,6 +135,7 @@ config X86
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_JUMP_LABEL_RELATIVE
select HAVE_ARCH_KASAN  if X86_64
+   select HAVE_ARCH_KASAN_VMALLOC  if X86_64
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS  if MMU
select HAVE_ARCH_MMAP_RND_COMPAT_BITS   if MMU && COMPAT
diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 296da58f3013..8f00f462709e 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -245,6 +245,51 @@ static void __init kasan_map_early_shadow(pgd_t *pgd)
} while (pgd++, addr = next, addr != end);
 }
 
+static void __init kasan_shallow_populate_p4ds(pgd_t *pgd,
+  unsigned long addr,
+  unsigned long end,
+  int nid)
+{
+   p4d_t *p4d;
+   unsigned long next;
+   void *p;
+
+   p4d = p4d_offset(pgd, addr);
+   do {
+   next = p4d_addr_end(addr, end);
+
+   if (p4d_none(*p4d)) {
+   p = early_alloc(PAGE_SIZE, nid, true);
+   p4d_populate(&init_mm, p4d, p);
+   }
+   } while (p4d++, addr = next, addr != end);
+}
+
+static void __init kasan_shallow_populate_pgds(void *start, void *end)
+{
+   unsigned long addr, next;
+   pgd_t *pgd;
+   void *p;
+   int nid = early_pfn_to_nid((unsigned long)start);
+
+   addr = (unsigned long)start;
+   pgd = pgd_offset_k(addr);
+   do {
+   next = pgd_addr_end(addr, (unsigned long)end);
+
+   if (pgd_none(*pgd)) {
+   p = early_alloc(PAGE_SIZE, nid, true);
+   pgd_populate(&init_mm, pgd, p);
+   }
+
+   /*
+* we need to populate p4ds to be synced when running in
+* four level mode - see sync_global_pgds_l4()
+*/
+   kasan_shallow_populate_p4ds(pgd, addr, next, nid);
+   } while (pgd++, addr = next, addr != (unsigned long)end);
+}
+
 #ifdef CONFIG_KASAN_INLINE
 static int kasan_die_handler(struct notifier_block *self,
 unsigned long val,
@@ -352,9 +397,24 @@ void __init kasan_init(void)
shadow_cpu_entry_end = (void *)round_up(
(unsigned long)shadow_cpu_entry_end, PAGE_SIZE);
 
+   /*
+* If we're in full vmalloc mode, don't back vmalloc space with early
+* shadow pages. Instead, prepopulate pgds/p4ds so they are synced to
+* the global table and we can populate the lower levels on demand.
+*/
+#ifdef CONFIG_KASAN_VMALLOC
+   kasan_shallow_populate_pgds(
+   kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
+   kasan_mem_to_shadow((void *)VMALLOC_END));
+
+   kasan_populate_early_shadow(
+   kasan_mem_to_shadow((void *)VMALLOC_END + 1),
+   shadow_cpu_entry_begin);
+#else
kasan_populate_early_shadow(
kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
shadow_cpu_entry_begin);
+#endif
 
kasan_populate_shadow((unsigned long)shadow_cpu_entry_begin,
  (unsigned long)shadow_cpu_entry_end, 0);
-- 
2.20.1



[PATCH v8 3/5] fork: support VMAP_STACK with KASAN_VMALLOC

2019-10-01 Thread Daniel Axtens
Supporting VMAP_STACK with KASAN_VMALLOC is straightforward:

 - clear the shadow region of vmapped stacks when swapping them in
 - tweak Kconfig to allow VMAP_STACK to be turned on with KASAN

Reviewed-by: Dmitry Vyukov 
Signed-off-by: Daniel Axtens 
---
 arch/Kconfig  | 9 +
 kernel/fork.c | 4 
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 5f8a5d84dbbe..2d914990402f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -843,16 +843,17 @@ config HAVE_ARCH_VMAP_STACK
 config VMAP_STACK
default y
bool "Use a virtually-mapped stack"
-   depends on HAVE_ARCH_VMAP_STACK && !KASAN
+   depends on HAVE_ARCH_VMAP_STACK
+   depends on !KASAN || KASAN_VMALLOC
---help---
  Enable this if you want the use virtually-mapped kernel stacks
  with guard pages.  This causes kernel stack overflows to be
  caught immediately rather than causing difficult-to-diagnose
  corruption.
 
- This is presently incompatible with KASAN because KASAN expects
- the stack to map directly to the KASAN shadow map using a formula
- that is incorrect if the stack is in vmalloc space.
+ To use this with KASAN, the architecture must support backing
+ virtual mappings with real shadow memory, and KASAN_VMALLOC must
+ be enabled.
 
 config ARCH_OPTIONAL_KERNEL_RWX
def_bool n
diff --git a/kernel/fork.c b/kernel/fork.c
index 6adbbcf448c3..0c9e6478ba85 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -94,6 +94,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -229,6 +230,9 @@ static unsigned long *alloc_thread_stack_node(struct 
task_struct *tsk, int node)
if (!s)
continue;
 
+   /* Clear the KASAN shadow of the stack. */
+   kasan_unpoison_shadow(s->addr, THREAD_SIZE);
+
/* Clear stale pointers from reused stack. */
memset(s->addr, 0, THREAD_SIZE);
 
-- 
2.20.1



[PATCH v8 2/5] kasan: add test for vmalloc

2019-10-01 Thread Daniel Axtens
Test kasan vmalloc support by adding a new test to the module.

Signed-off-by: Daniel Axtens 

--

v5: split out per Christophe Leroy
---
 lib/test_kasan.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/lib/test_kasan.c b/lib/test_kasan.c
index 49cc4d570a40..328d33beae36 100644
--- a/lib/test_kasan.c
+++ b/lib/test_kasan.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -748,6 +749,30 @@ static noinline void __init kmalloc_double_kzfree(void)
kzfree(ptr);
 }
 
+#ifdef CONFIG_KASAN_VMALLOC
+static noinline void __init vmalloc_oob(void)
+{
+   void *area;
+
+   pr_info("vmalloc out-of-bounds\n");
+
+   /*
+* We have to be careful not to hit the guard page.
+* The MMU will catch that and crash us.
+*/
+   area = vmalloc(3000);
+   if (!area) {
+   pr_err("Allocation failed\n");
+   return;
+   }
+
+   ((volatile char *)area)[3100];
+   vfree(area);
+}
+#else
+static void __init vmalloc_oob(void) {}
+#endif
+
 static int __init kmalloc_tests_init(void)
 {
/*
@@ -793,6 +818,7 @@ static int __init kmalloc_tests_init(void)
kasan_strings();
kasan_bitops();
kmalloc_double_kzfree();
+   vmalloc_oob();
 
kasan_restore_multi_shot(multishot);
 
-- 
2.20.1



[PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-01 Thread Daniel Axtens
Hook into vmalloc and vmap, and dynamically allocate real shadow
memory to back the mappings.

Most mappings in vmalloc space are small, requiring less than a full
page of shadow space. Allocating a full shadow page per mapping would
therefore be wasteful. Furthermore, to ensure that different mappings
use different shadow pages, mappings would have to be aligned to
KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE.

Instead, share backing space across multiple mappings. Allocate a
backing page when a mapping in vmalloc space uses a particular page of
the shadow region. This page can be shared by other vmalloc mappings
later on.

We hook in to the vmap infrastructure to lazily clean up unused shadow
memory.

To avoid the difficulties around swapping mappings around, this code
expects that the part of the shadow region that covers the vmalloc
space will not be covered by the early shadow page, but will be left
unmapped. This will require changes in arch-specific code.

This allows KASAN with VMAP_STACK, and may be helpful for architectures
that do not have a separate module space (e.g. powerpc64, which I am
currently working on). It also allows relaxing the module alignment
back to PAGE_SIZE.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202009
Acked-by: Vasily Gorbik 
Signed-off-by: Daniel Axtens 
[Mark: rework shadow allocation]
Signed-off-by: Mark Rutland 

--

v2: let kasan_unpoison_shadow deal with ranges that do not use a
full shadow byte.

v3: relax module alignment
rename to kasan_populate_vmalloc which is a much better name
deal with concurrency correctly

v4: Mark's rework
Poision pages on vfree
Handle allocation failures

v5: Per Christophe Leroy, split out test and dynamically free pages.

v6: Guard freeing page properly. Drop WARN_ON_ONCE(pte_none(*ptep)),
 on reflection it's unnecessary debugging cruft with too high a
 false positive rate.

v7: tlb flush, thanks Mark.
explain more clearly how freeing works and is concurrency-safe.
---
 Documentation/dev-tools/kasan.rst |  63 +
 include/linux/kasan.h |  31 +
 include/linux/moduleloader.h  |   2 +-
 include/linux/vmalloc.h   |  12 ++
 lib/Kconfig.kasan |  16 +++
 mm/kasan/common.c | 204 ++
 mm/kasan/generic_report.c |   3 +
 mm/kasan/kasan.h  |   1 +
 mm/vmalloc.c  |  45 ++-
 9 files changed, 375 insertions(+), 2 deletions(-)

diff --git a/Documentation/dev-tools/kasan.rst 
b/Documentation/dev-tools/kasan.rst
index b72d07d70239..bdb92c3de7a5 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -215,3 +215,66 @@ brk handler is used to print bug reports.
 A potential expansion of this mode is a hardware tag-based mode, which would
 use hardware memory tagging support instead of compiler instrumentation and
 manual shadow memory manipulation.
+
+What memory accesses are sanitised by KASAN?
+
+
+The kernel maps memory in a number of different parts of the address
+space. This poses something of a problem for KASAN, which requires
+that all addresses accessed by instrumented code have a valid shadow
+region.
+
+The range of kernel virtual addresses is large: there is not enough
+real memory to support a real shadow region for every address that
+could be accessed by the kernel.
+
+By default
+~~
+
+By default, architectures only map real memory over the shadow region
+for the linear mapping (and potentially other small areas). For all
+other areas - such as vmalloc and vmemmap space - a single read-only
+page is mapped over the shadow area. This read-only shadow page
+declares all memory accesses as permitted.
+
+This presents a problem for modules: they do not live in the linear
+mapping, but in a dedicated module space. By hooking in to the module
+allocator, KASAN can temporarily map real shadow memory to cover
+them. This allows detection of invalid accesses to module globals, for
+example.
+
+This also creates an incompatibility with ``VMAP_STACK``: if the stack
+lives in vmalloc space, it will be shadowed by the read-only page, and
+the kernel will fault when trying to set up the shadow data for stack
+variables.
+
+CONFIG_KASAN_VMALLOC
+
+
+With ``CONFIG_KASAN_VMALLOC``, KASAN can cover vmalloc space at the
+cost of greater memory usage. Currently this is only supported on x86.
+
+This works by hooking into vmalloc and vmap, and dynamically
+allocating real shadow memory to back the mappings.
+
+Most mappings in vmalloc space are small, requiring less than a full
+page of shadow space. Allocating a full shadow page per mapping would
+therefore be wasteful. Furthermore, to ensure that different mappings
+use different shadow pages, mappings would have to be aligned to
+``KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE``.
+
+Instead, we share backing space across multiple mappings. We a