Re: [PATCH] modpost: support arbitrary symbol length in modversion

2023-01-17 Thread Masahiro Yamada
On Wed, Jan 18, 2023 at 4:23 AM Lucas De Marchi
 wrote:
>
> On Tue, Jan 17, 2023 at 06:51:44PM +0100, Michal Suchánek wrote:
> >Hello,
> >
> >On Fri, Jan 13, 2023 at 06:18:41PM +, Gary Guo wrote:
> >> On Thu, 12 Jan 2023 14:40:59 -0700
> >> Lucas De Marchi  wrote:
> >>
> >> > On Wed, Jan 11, 2023 at 04:11:51PM +, Gary Guo wrote:
> >> > >
> >> > > struct modversion_info {
> >> > >- unsigned long crc;
> >> > >- char name[MODULE_NAME_LEN];
> >> > >+ /* Offset of the next modversion entry in relation to this one. */
> >> > >+ u32 next;
> >> > >+ u32 crc;
> >> > >+ char name[0];
> >> >
> >> > although not really exported as uapi, this will break userspace as this 
> >> > is
> >> > used in the  elf file generated for the modules. I think
> >> > this change must be made in a backward compatible way and kmod updated
> >> > to deal with the variable name length:
> >> >
> >> > kmod $ git grep "\[64"
> >> > libkmod/libkmod-elf.c:  char name[64 - sizeof(uint32_t)];
> >> > libkmod/libkmod-elf.c:  char name[64 - sizeof(uint64_t)];
> >> >
> >> > in kmod we have both 32 and 64 because a 64-bit kmod can read both 32
> >> > and 64 bit module, and vice versa.
> >> >
> >>
> >> Hi Lucas,
> >>
> >> Thanks for the information.
> >>
> >> The change can't be "truly" backward compatible, in a sense that
> >> regardless of the new format we choose, kmod would not be able to decode
> >> symbols longer than "64 - sizeof(long)" bytes. So the list it retrieves
> >> is going to be incomplete, isn't it?
> >>
> >> What kind of backward compatibility should be expected? It could be:
> >> * short symbols can still be found by old versions of kmod, but not
> >>   long symbols;
> >
> >That sounds good. Not everyone is using rust, and with this option
> >people who do will need to upgrade tooling, and people who don't care
> >don't need to do anything.
>
> that could be it indeed. My main worry here is:
>
> "After the support is added in kmod, kmod needs to be able to output the
> correct information regardless if the module is from before/after the
> change in the kernel and also without relying on kernel version."
> Just changing the struct modversion_info doesn't make that possible.
>
> Maybe adding the long symbols in another section? Or ble
> just increase to 512 and add the size to a
> "__versions_hdr" section. If we then output a max size per module,
> this would offset a little bit the additional size gained for the
> modules using rust. And the additional 0's should compress well
> so I'm not sure the additional size is that much relevant here.





I also thought of new section(s) for long symbols.



One idea is to have separate sections for CRCs and symbol names.




section __version_crc:
   0x12345678
   0x23456789
   0x34567890


section __version_sym:
  "very_very_very_very_long_symbol"
  "another_very_very_very_very_very_long_symbol"
  "yet_another_very_very_very_very_very_long_symbol"




You can iterate in each section with this:

   crc += sizeof(u32);
   name += strlen(name) + 1;


Benefits:
  - No next pointer
  - No padding
- *.mod.c is kept human readable.





BTW, the following is impossible
because the pointer reference to .rodata
is not available at this point?

struct modversion_info {
 u32 crc;
 const char *name:
};



-- 
Best Regards
Masahiro Yamada


[PATCH v3 21/24] powerpc/pseries: Pass PLPKS password on kexec

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

Before interacting with the PLPKS, we ask the hypervisor to generate a
password for the current boot, which is then required for most further
PLPKS operations.

If we kexec into a new kernel, the new kernel will try and fail to
generate a new password, as the password has already been set.

Pass the password through to the new kernel via the device tree, in
/chosen/plpks-pw. Check for the presence of this property before trying
to generate a new password - if it exists, use the existing password and
remove it from the device tree.

Signed-off-by: Russell Currey 
Signed-off-by: Andrew Donnellan 

---

v3: New patch
---
 arch/powerpc/kexec/file_load_64.c  | 17 -
 arch/powerpc/platforms/pseries/plpks.c | 18 +-
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index af8854f9eae3..f0e25700ae77 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct umem_info {
u64 *buf;   /* data buffer for usable-memory property */
@@ -1155,7 +1156,7 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
unsigned long initrd_len, const char *cmdline)
 {
struct crash_mem *umem = NULL, *rmem = NULL;
-   int i, nr_ranges, ret;
+   int i, nr_ranges, ret, chosen_node;
 
/*
 * Restrict memory usage for kdump kernel by setting up
@@ -1230,6 +1231,20 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
}
}
 
+#ifdef CONFIG_PSERIES_PLPKS
+   // If we have PLPKS active, we need to provide the password
+   if (plpks_is_available()) {
+   chosen_node = fdt_path_offset(fdt, "/chosen");
+   if (!chosen_node) {
+   pr_err("Can't find chosen node: %s\n",
+  fdt_strerror(chosen_node));
+   goto out;
+   }
+   ret = fdt_setprop(fdt, chosen_node, "ibm,plpks-pw",
+ plpks_get_password(), 
plpks_get_passwordlen());
+   }
+#endif // CONFIG_PSERIES_PLPKS
+
 out:
kfree(rmem);
kfree(umem);
diff --git a/arch/powerpc/platforms/pseries/plpks.c 
b/arch/powerpc/platforms/pseries/plpks.c
index b3c7410a4f13..0350f10e1755 100644
--- a/arch/powerpc/platforms/pseries/plpks.c
+++ b/arch/powerpc/platforms/pseries/plpks.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -126,7 +127,22 @@ static int plpks_gen_password(void)
 {
unsigned long retbuf[PLPAR_HCALL_BUFSIZE] = { 0 };
u8 *password, consumer = PLPKS_OS_OWNER;
-   int rc;
+   struct property *prop;
+   int rc, len;
+
+   // Before we generate the password, we may have been booted by kexec and
+   // provided with a previous password.  Check for that first.
+   prop = of_find_property(of_chosen, "ibm,plpks-pw", );
+   if (prop) {
+   ospasswordlength = (u16)len;
+   ospassword = kzalloc(ospasswordlength, GFP_KERNEL);
+   if (!ospassword) {
+   of_remove_property(of_chosen, prop);
+   return -ENOMEM;
+   }
+   memcpy(ospassword, prop->value, len);
+   return of_remove_property(of_chosen, prop);
+   }
 
// The password must not cross a page boundary, so we align to the next 
power of 2
password = kzalloc(roundup_pow_of_two(maxpwsize), GFP_KERNEL);
-- 
2.39.0



[PATCH v3 07/24] powerpc/secvar: Extend sysfs to include config vars

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

The forthcoming pseries consumer of the secvar API wants to expose a
number of config variables.  Allowing secvar implementations to provide
their own sysfs attributes makes it easy for consumers to expose what
they need to.

This is not being used by the OPAL secvar implementation at present, and
the config directory will not be created if no attributes are set.

Signed-off-by: Russell Currey 
Co-developed-by: Andrew Donnellan 
Signed-off-by: Andrew Donnellan 

---

v3: Remove unnecessary "secvar:" prefix from error messages (ajd)

Merge config attributes into secvar_operations (mpe)
---
 arch/powerpc/include/asm/secvar.h  |  2 ++
 arch/powerpc/kernel/secvar-sysfs.c | 33 +-
 2 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/secvar.h 
b/arch/powerpc/include/asm/secvar.h
index b2cb9bb7c540..ebf95386d720 100644
--- a/arch/powerpc/include/asm/secvar.h
+++ b/arch/powerpc/include/asm/secvar.h
@@ -10,6 +10,7 @@
 
 #include 
 #include 
+#include 
 
 #define SECVAR_MAX_FORMAT_LEN  30 // max length of string returned by 
->format()
 
@@ -21,6 +22,7 @@ struct secvar_operations {
int (*set)(const char *key, u64 key_len, u8 *data, u64 data_size);
ssize_t (*format)(char *buf);
int (*max_size)(u64 *max_size);
+   const struct attribute **config_attrs;
 };
 
 #ifdef CONFIG_PPC_SECURE_BOOT
diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index 9f0e49bf3903..b82e95a2e415 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -140,6 +140,19 @@ static int update_kobj_size(void)
return 0;
 }
 
+static int secvar_sysfs_config(struct kobject *kobj)
+{
+   struct attribute_group config_group = {
+   .name = "config",
+   .attrs = (struct attribute **)secvar_ops->config_attrs,
+   };
+
+   if (secvar_ops->config_attrs)
+   return sysfs_create_group(kobj, _group);
+
+   return 0;
+}
+
 static int secvar_sysfs_load(void)
 {
struct kobject *kobj;
@@ -202,26 +215,36 @@ static int secvar_sysfs_init(void)
 
rc = sysfs_create_file(secvar_kobj, _attr.attr);
if (rc) {
-   kobject_put(secvar_kobj);
-   return -ENOMEM;
+   pr_err("Failed to create format object\n");
+   rc = -ENOMEM;
+   goto err;
}
 
secvar_kset = kset_create_and_add("vars", NULL, secvar_kobj);
if (!secvar_kset) {
pr_err("sysfs kobject registration failed\n");
-   kobject_put(secvar_kobj);
-   return -ENOMEM;
+   rc = -ENOMEM;
+   goto err;
}
 
rc = update_kobj_size();
if (rc) {
pr_err("Cannot read the size of the attribute\n");
-   return rc;
+   goto err;
+   }
+
+   rc = secvar_sysfs_config(secvar_kobj);
+   if (rc) {
+   pr_err("Failed to create config directory\n");
+   goto err;
}
 
secvar_sysfs_load();
 
return 0;
+err:
+   kobject_put(secvar_kobj);
+   return rc;
 }
 
 late_initcall(secvar_sysfs_init);
-- 
2.39.0



[PATCH v3 23/24] integrity/powerpc: Improve error handling & reporting when loading certs

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

A few improvements to load_powerpc.c:

 - include integrity.h for the pr_fmt()
 - move all error reporting out of get_cert_list()
 - use ERR_PTR() to better preserve error detail
 - don't use pr_err() for missing keys

Signed-off-by: Russell Currey 
Signed-off-by: Andrew Donnellan 

---

v3: New patch
---
 .../integrity/platform_certs/load_powerpc.c   | 26 ++-
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/security/integrity/platform_certs/load_powerpc.c 
b/security/integrity/platform_certs/load_powerpc.c
index 1e4f80a4e71c..dee51606d5f4 100644
--- a/security/integrity/platform_certs/load_powerpc.c
+++ b/security/integrity/platform_certs/load_powerpc.c
@@ -14,9 +14,15 @@
 #include 
 #include 
 #include "keyring_handler.h"
+#include "../integrity.h"
 
 /*
  * Get a certificate list blob from the named secure variable.
+ *
+ * Returns:
+ *  - a pointer to a kmalloc'd buffer containing the cert list on success
+ *  - NULL if the key does not exist
+ *  - an ERR_PTR on error
  */
 static __init void *get_cert_list(u8 *key, unsigned long keylen, u64 *size)
 {
@@ -25,19 +31,19 @@ static __init void *get_cert_list(u8 *key, unsigned long 
keylen, u64 *size)
 
rc = secvar_ops->get(key, keylen, NULL, size);
if (rc) {
-   pr_err("Couldn't get size: %d\n", rc);
-   return NULL;
+   if (rc == -ENOENT)
+   return NULL;
+   return ERR_PTR(rc);
}
 
db = kmalloc(*size, GFP_KERNEL);
if (!db)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
 
rc = secvar_ops->get(key, keylen, db, size);
if (rc) {
kfree(db);
-   pr_err("Error reading %s var: %d\n", key, rc);
-   return NULL;
+   return ERR_PTR(rc);
}
 
return db;
@@ -69,7 +75,11 @@ static int __init load_powerpc_certs(void)
 */
db = get_cert_list("db", 3, );
if (!db) {
-   pr_err("Couldn't get db list from firmware\n");
+   pr_info("Couldn't get db list from firmware\n");
+   } else if (IS_ERR(db)) {
+   rc = PTR_ERR(db);
+   pr_err("Error reading db from firmware: %d\n", rc);
+   return rc;
} else {
rc = parse_efi_signature_list("powerpc:db", db, dbsize,
  get_handler_for_db);
@@ -81,6 +91,10 @@ static int __init load_powerpc_certs(void)
dbx = get_cert_list("dbx", 4,  );
if (!dbx) {
pr_info("Couldn't get dbx list from firmware\n");
+   } else if (IS_ERR(dbx)) {
+   rc = PTR_ERR(dbx);
+   pr_err("Error reading dbx from firmware: %d\n", rc);
+   return rc;
} else {
rc = parse_efi_signature_list("powerpc:dbx", dbx, dbxsize,
  get_handler_for_dbx);
-- 
2.39.0



[PATCH v3 16/24] powerpc/pseries: Implement signed update for PLPKS objects

2023-01-17 Thread Andrew Donnellan
From: Nayna Jain 

The Platform Keystore provides a signed update interface which can be used
to create, replace or append to certain variables in the PKS in a secure
fashion, with the hypervisor requiring that the update be signed using the
Platform Key.

Implement an interface to the H_PKS_SIGNED_UPDATE hcall in the plpks
driver to allow signed updates to PKS objects.

(The plpks driver doesn't need to do any cryptography or otherwise handle
the actual signed variable contents - that will be handled by userspace
tooling.)

Signed-off-by: Nayna Jain 
[ajd: split patch, add timeout handling and misc cleanups]
Co-developed-by: Andrew Donnellan 
Signed-off-by: Andrew Donnellan 
Signed-off-by: Russell Currey 

---

v3: Merge plpks fixes and signed update series with secvar series

Fix error code handling in plpks_confirm_object_flushed() (ruscur)

Pass plpks_var struct to plpks_signed_update_var() by reference (mpe)

Consistent constant naming scheme (ruscur)
---
 arch/powerpc/include/asm/hvcall.h  |  3 +-
 arch/powerpc/include/asm/plpks.h   |  5 ++
 arch/powerpc/platforms/pseries/plpks.c | 71 --
 3 files changed, 73 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 95fd7f9485d5..33b26c0cb69b 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -336,7 +336,8 @@
 #define H_SCM_FLUSH0x44C
 #define H_GET_ENERGY_SCALE_INFO0x450
 #define H_WATCHDOG 0x45C
-#define MAX_HCALL_OPCODE   H_WATCHDOG
+#define H_PKS_SIGNED_UPDATE0x454
+#define MAX_HCALL_OPCODE   H_PKS_SIGNED_UPDATE
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
diff --git a/arch/powerpc/include/asm/plpks.h b/arch/powerpc/include/asm/plpks.h
index 7c5f51a9af7c..e7204e6c0ca4 100644
--- a/arch/powerpc/include/asm/plpks.h
+++ b/arch/powerpc/include/asm/plpks.h
@@ -68,6 +68,11 @@ struct plpks_var_name_list {
struct plpks_var_name varlist[];
 };
 
+/**
+ * Updates the authenticated variable. It expects NULL as the component.
+ */
+int plpks_signed_update_var(struct plpks_var *var, u64 flags);
+
 /**
  * Writes the specified var and its data to PKS.
  * Any caller of PKS driver should present a valid component type for
diff --git a/arch/powerpc/platforms/pseries/plpks.c 
b/arch/powerpc/platforms/pseries/plpks.c
index 1189246b03dc..796ed5544ee5 100644
--- a/arch/powerpc/platforms/pseries/plpks.c
+++ b/arch/powerpc/platforms/pseries/plpks.c
@@ -81,6 +81,12 @@ static int pseries_status_to_err(int rc)
err = -ENOENT;
break;
case H_BUSY:
+   case H_LONG_BUSY_ORDER_1_MSEC:
+   case H_LONG_BUSY_ORDER_10_MSEC:
+   case H_LONG_BUSY_ORDER_100_MSEC:
+   case H_LONG_BUSY_ORDER_1_SEC:
+   case H_LONG_BUSY_ORDER_10_SEC:
+   case H_LONG_BUSY_ORDER_100_SEC:
err = -EBUSY;
break;
case H_AUTHORITY:
@@ -184,14 +190,17 @@ static struct label *construct_label(char *component, u8 
varos, u8 *name,
 u16 namelen)
 {
struct label *label;
-   size_t slen;
+   size_t slen = 0;
 
if (!name || namelen > PLPKS_MAX_NAME_SIZE)
return ERR_PTR(-EINVAL);
 
-   slen = strlen(component);
-   if (component && slen > sizeof(label->attr.prefix))
-   return ERR_PTR(-EINVAL);
+   // Support NULL component for signed updates
+   if (component) {
+   slen = strlen(component);
+   if (slen > sizeof(label->attr.prefix))
+   return ERR_PTR(-EINVAL);
+   }
 
// The label structure must not cross a page boundary, so we align to 
the next power of 2
label = kzalloc(roundup_pow_of_two(sizeof(*label)), GFP_KERNEL);
@@ -397,6 +406,58 @@ static int plpks_confirm_object_flushed(struct label 
*label,
return pseries_status_to_err(rc);
 }
 
+int plpks_signed_update_var(struct plpks_var *var, u64 flags)
+{
+   unsigned long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
+   int rc;
+   struct label *label;
+   struct plpks_auth *auth;
+   u64 continuetoken = 0;
+   u64 timeout = 0;
+
+   if (!var->data || var->datalen <= 0 || var->namelen > 
PLPKS_MAX_NAME_SIZE)
+   return -EINVAL;
+
+   if (!(var->policy & PLPKS_SIGNEDUPDATE))
+   return -EINVAL;
+
+   auth = construct_auth(PLPKS_OS_OWNER);
+   if (IS_ERR(auth))
+   return PTR_ERR(auth);
+
+   label = construct_label(var->component, var->os, var->name, 
var->namelen);
+   if (IS_ERR(label)) {
+   rc = PTR_ERR(label);
+   goto out;
+   }
+
+   do {
+   rc = plpar_hcall9(H_PKS_SIGNED_UPDATE, retbuf,
+ virt_to_phys(auth), virt_to_phys(label),
+ label->size, var->policy, flags,
+ 

[PATCH v3 22/24] powerpc/pseries: Implement secvars for dynamic secure boot

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

The pseries platform can support dynamic secure boot (i.e. secure boot
using user-defined keys) using variables contained with the PowerVM LPAR
Platform KeyStore (PLPKS).  Using the powerpc secvar API, expose the
relevant variables for pseries dynamic secure boot through the existing
secvar filesystem layout.

The relevant variables for dynamic secure boot are signed in the
keystore, and can only be modified using the H_PKS_SIGNED_UPDATE hcall.
Object labels in the keystore are encoded using ucs2 format.  With our
fixed variable names we don't have to care about encoding outside of the
necessary byte padding.

When a user writes to a variable, the first 8 bytes of data must contain
the signed update flags as defined by the hypervisor.

When a user reads a variable, the first 4 bytes of data contain the
policies defined for the object.

Limitations exist due to the underlying implementation of sysfs binary
attributes, as is the case for the OPAL secvar implementation -
partial writes are unsupported and writes cannot be larger than PAGE_SIZE.

Co-developed-by: Nayna Jain 
Signed-off-by: Nayna Jain 
Co-developed-by: Andrew Donnellan 
Signed-off-by: Andrew Donnellan 
Signed-off-by: Russell Currey 

---

v2: Remove unnecessary config vars from sysfs and document the others,
thanks to review from Greg.  If we end up needing to expose more, we
can add them later and update the docs.

Use sysfs_emit() instead of sprintf(), thanks to Greg.

Change the size of the sysfs binary attributes to include the 8-byte
flags header, preventing truncation of large writes.

v3: plpks_set_variable(): pass var to plpks_signed_update_var() as a
pointer (mpe)

Update copyright date (ajd)

Consistent comment style (ajd)

Change device_initcall() to machine_arch_initcall(pseries...) so we
don't try to load on powernv and kill the machine (mpe)

Add config attributes into plpks_secvar_ops (mpe)

Get rid of PLPKS_SECVAR_COUNT macro (mpe)

Reworded descriptions in ABI documentation (mpe)

Switch to using secvar_ops->var_names rather than
secvar_ops->get_next() (ajd/mpe)

Optimise allocation/copying of buffers (mpe)

Elaborate the comment documenting the "format" string (mpe)

Return -EIO on errors in the read case (mpe)

Add "grubdbx" variable (Sudhakar Kuppusamy)

Use utf8s_to_utf16s() rather than our own "UCS-2" conversion code (mpe)

Change uint64_t to u64 (mpe)

Fix SB_VERSION data length (ruscur)

Stop prepending policy data on read (ruscur)

Enforce max format length on format string (not strictly needed, but
makes the length limit clear) (ajd)

Update include of plpks.h to reflect new path (ruscur)

Consistent constant naming scheme (ruscur)
---
 Documentation/ABI/testing/sysfs-secvar|  75 +-
 arch/powerpc/platforms/pseries/Makefile   |   4 +-
 arch/powerpc/platforms/pseries/plpks-secvar.c | 214 ++
 3 files changed, 290 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/platforms/pseries/plpks-secvar.c

diff --git a/Documentation/ABI/testing/sysfs-secvar 
b/Documentation/ABI/testing/sysfs-secvar
index feebb8c57294..a19f4d5fcec6 100644
--- a/Documentation/ABI/testing/sysfs-secvar
+++ b/Documentation/ABI/testing/sysfs-secvar
@@ -18,6 +18,14 @@ Description: A string indicating which backend is in use by 
the firmware.
This determines the format of the variable and the accepted
format of variable updates.
 
+   On powernv/OPAL, this value is provided by the OPAL firmware
+   and is expected to be "ibm,edk2-compat-v1".
+
+   On pseries/PLPKS, this is generated by the kernel based on the
+   version number in the SB_VERSION variable in the keystore, and
+   has the form "ibm,plpks-sb-v", or
+   "ibm,plpks-sb-unknown" if there is no SB_VERSION variable.
+
 What:  /sys/firmware/secvar/vars/
 Date:  August 2019
 Contact:   Nayna Jain 
@@ -34,7 +42,7 @@ Description:  An integer representation of the size of the 
content of the
 
 What:  /sys/firmware/secvar/vars//data
 Date:  August 2019
-Contact:   Nayna Jain h
+Contact:   Nayna Jain 
 Description:   A read-only file containing the value of the variable. The size
of the file represents the maximum size of the variable data.
 
@@ -44,3 +52,68 @@ Contact: Nayna Jain 
 Description:   A write-only file that is used to submit the new value for the
variable. The size of the file represents the maximum size of
the variable data that can be written.
+
+What:  /sys/firmware/secvar/config
+Date:  December 2022
+Contact:   Nayna Jain 
+Description:   This optional directory contains read-only config attributes as
+   defined by the secure variable implementation.  All data is in
+   

[PATCH v3 20/24] powerpc/pseries: Add helpers to get PLPKS password

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

Add helper functions to get the PLPKS password. This will be used in a
later patch to support passing the password between kernels over kexec.

Signed-off-by: Russell Currey 
Signed-off-by: Andrew Donnellan 

---

v3: New patch
---
 arch/powerpc/include/asm/plpks.h   | 11 +++
 arch/powerpc/platforms/pseries/plpks.c | 10 ++
 2 files changed, 21 insertions(+)

diff --git a/arch/powerpc/include/asm/plpks.h b/arch/powerpc/include/asm/plpks.h
index 0c49969b0864..08355c89f5fd 100644
--- a/arch/powerpc/include/asm/plpks.h
+++ b/arch/powerpc/include/asm/plpks.h
@@ -171,6 +171,17 @@ u32 plpks_get_maxlargeobjectsize(void);
  */
 u64 plpks_get_signedupdatealgorithms(void);
 
+/**
+ * Returns the PLPKS password generated by the hypervisor.
+ * Should only be used to prepare a different OS to use the PLPKS, i.e. kexec.
+ */
+u8 *plpks_get_password(void);
+
+/**
+ * Returns the length of the PLPKS password in bytes.
+ */
+u16 plpks_get_passwordlen(void);
+
 #endif // CONFIG_PSERIES_PLPKS
 
 #endif // _ASM_POWERPC_PLPKS_H
diff --git a/arch/powerpc/platforms/pseries/plpks.c 
b/arch/powerpc/platforms/pseries/plpks.c
index 5d9c6a3b2014..b3c7410a4f13 100644
--- a/arch/powerpc/platforms/pseries/plpks.c
+++ b/arch/powerpc/platforms/pseries/plpks.c
@@ -359,6 +359,16 @@ u64 plpks_get_signedupdatealgorithms(void)
return signedupdatealgorithms;
 }
 
+u8 *plpks_get_password(void)
+{
+   return ospassword;
+}
+
+u16 plpks_get_passwordlen(void)
+{
+   return ospasswordlength;
+}
+
 bool plpks_is_available(void)
 {
int rc;
-- 
2.39.0



[PATCH v3 24/24] integrity/powerpc: Support loading keys from pseries secvar

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

The secvar object format is only in the device tree under powernv.
We now have an API call to retrieve it in a generic way, so we should
use that instead of having to handle the DT here.

Add support for pseries secvar, with the "ibm,plpks-sb-v1" format.
The object format is expected to be the same, so there shouldn't be any
functional differences between objects retrieved from powernv and
pseries.

Signed-off-by: Russell Currey 
Signed-off-by: Andrew Donnellan 

---

v3: New patch
---
 .../integrity/platform_certs/load_powerpc.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/security/integrity/platform_certs/load_powerpc.c 
b/security/integrity/platform_certs/load_powerpc.c
index dee51606d5f4..8fa899616d92 100644
--- a/security/integrity/platform_certs/load_powerpc.c
+++ b/security/integrity/platform_certs/load_powerpc.c
@@ -10,7 +10,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include "keyring_handler.h"
@@ -59,16 +58,22 @@ static int __init load_powerpc_certs(void)
void *db = NULL, *dbx = NULL;
u64 dbsize = 0, dbxsize = 0;
int rc = 0;
-   struct device_node *node;
+   ssize_t len;
+   char buf[SECVAR_MAX_FORMAT_LEN];
 
if (!secvar_ops)
return -ENODEV;
 
-   /* The following only applies for the edk2-compat backend. */
-   node = of_find_compatible_node(NULL, NULL, "ibm,edk2-compat-v1");
-   if (!node)
+   len = secvar_ops->format(buf);
+   if (len <= 0)
return -ENODEV;
 
+   // Check for known secure boot implementations from OPAL or PLPKS
+   if (strcmp("ibm,edk2-compat-v1", buf) && strcmp("ibm,plpks-sb-v1", 
buf)) {
+   pr_err("Unsupported secvar implementation \"%s\", not loading 
certs\n", buf);
+   return -ENODEV;
+   }
+
/*
 * Get db, and dbx. They might not exist, so it isn't an error if we
 * can't get them.
@@ -103,8 +108,6 @@ static int __init load_powerpc_certs(void)
kfree(dbx);
}
 
-   of_node_put(node);
-
return rc;
 }
 late_initcall(load_powerpc_certs);
-- 
2.39.0



[PATCH v3 15/24] powerpc/pseries: Expose PLPKS config values, support additional fields

2023-01-17 Thread Andrew Donnellan
From: Nayna Jain 

The plpks driver uses the H_PKS_GET_CONFIG hcall to retrieve configuration
and status information about the PKS from the hypervisor.

Update _plpks_get_config() to handle some additional fields. Add getter
functions to allow the PKS configuration information to be accessed from
other files. Validate that the values we're getting comply with the spec.

While we're here, move the config struct in _plpks_get_config() off the
stack - it's getting large and we also need to make sure it doesn't cross
a page boundary.

Signed-off-by: Nayna Jain 
[ajd: split patch, extend to support additional v3 API fields, minor fixes]
Co-developed-by: Andrew Donnellan 
Signed-off-by: Andrew Donnellan 
Signed-off-by: Russell Currey 

---

v3: Merge plpks fixes and signed update series with secvar series

Refresh config values in plpks_get_usedspace() (ajd)

Validate the config values being returned comply with spec (ruscur)

Return maxobjlabelsize as is (ruscur)

Move plpks.h to include/asm (ruscur)

Fix checkpatch checks (ruscur)
---
 arch/powerpc/include/asm/plpks.h   |  58 ++
 arch/powerpc/platforms/pseries/plpks.c | 149 +++--
 2 files changed, 195 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/plpks.h b/arch/powerpc/include/asm/plpks.h
index 6466aadd7145..7c5f51a9af7c 100644
--- a/arch/powerpc/include/asm/plpks.h
+++ b/arch/powerpc/include/asm/plpks.h
@@ -96,6 +96,64 @@ int plpks_read_fw_var(struct plpks_var *var);
  */
 int plpks_read_bootloader_var(struct plpks_var *var);
 
+/**
+ * Returns if PKS is available on this LPAR.
+ */
+bool plpks_is_available(void);
+
+/**
+ * Returns version of the Platform KeyStore.
+ */
+u8 plpks_get_version(void);
+
+/**
+ * Returns hypervisor storage overhead per object, not including the size of
+ * the object or label. Only valid for config version >= 2
+ */
+u16 plpks_get_objoverhead(void);
+
+/**
+ * Returns maximum password size. Must be >= 32 bytes
+ */
+u16 plpks_get_maxpwsize(void);
+
+/**
+ * Returns maximum object size supported by Platform KeyStore.
+ */
+u16 plpks_get_maxobjectsize(void);
+
+/**
+ * Returns maximum object label size supported by Platform KeyStore.
+ */
+u16 plpks_get_maxobjectlabelsize(void);
+
+/**
+ * Returns total size of the configured Platform KeyStore.
+ */
+u32 plpks_get_totalsize(void);
+
+/**
+ * Returns used space from the total size of the Platform KeyStore.
+ */
+u32 plpks_get_usedspace(void);
+
+/**
+ * Returns bitmask of policies supported by the hypervisor.
+ */
+u32 plpks_get_supportedpolicies(void);
+
+/**
+ * Returns maximum byte size of a single object supported by the hypervisor.
+ * Only valid for config version >= 3
+ */
+u32 plpks_get_maxlargeobjectsize(void);
+
+/**
+ * Returns bitmask of signature algorithms supported for signed updates.
+ * Only valid for config version >= 3
+ */
+u64 plpks_get_signedupdatealgorithms(void);
+
 #endif // CONFIG_PSERIES_PLPKS
 
 #endif // _ASM_POWERPC_PLPKS_H
diff --git a/arch/powerpc/platforms/pseries/plpks.c 
b/arch/powerpc/platforms/pseries/plpks.c
index 91f3f623a2c7..1189246b03dc 100644
--- a/arch/powerpc/platforms/pseries/plpks.c
+++ b/arch/powerpc/platforms/pseries/plpks.c
@@ -24,8 +24,16 @@ static u8 *ospassword;
 static u16 ospasswordlength;
 
 // Retrieved with H_PKS_GET_CONFIG
+static u8 version;
+static u16 objoverhead;
 static u16 maxpwsize;
 static u16 maxobjsize;
+static s16 maxobjlabelsize;
+static u32 totalsize;
+static u32 usedspace;
+static u32 supportedpolicies;
+static u32 maxlargeobjectsize;
+static u64 signedupdatealgorithms;
 
 struct plpks_auth {
u8 version;
@@ -206,32 +214,149 @@ static struct label *construct_label(char *component, u8 
varos, u8 *name,
 static int _plpks_get_config(void)
 {
unsigned long retbuf[PLPAR_HCALL_BUFSIZE] = { 0 };
-   struct {
+   struct config {
u8 version;
u8 flags;
-   __be32 rsvd0;
+   __be16 rsvd0;
+   __be16 objoverhead;
__be16 maxpwsize;
__be16 maxobjlabelsize;
__be16 maxobjsize;
__be32 totalsize;
__be32 usedspace;
__be32 supportedpolicies;
-   __be64 rsvd1;
-   } __packed config;
+   __be32 maxlargeobjectsize;
+   __be64 signedupdatealgorithms;
+   u8 rsvd1[476];
+   } __packed * config;
size_t size;
-   int rc;
+   int rc = 0;
+
+   size = sizeof(*config);
+
+   // Config struct must not cross a page boundary. So long as the struct
+   // size is a power of 2, this should be fine as alignment is guaranteed
+   config = kzalloc(size, GFP_KERNEL);
+   if (!config) {
+   rc = -ENOMEM;
+   goto err;
+   }
+
+   rc = plpar_hcall(H_PKS_GET_CONFIG, retbuf, virt_to_phys(config), size);
+
+   if (rc != H_SUCCESS) {
+   rc = 

[PATCH v3 18/24] powerpc/pseries: Make caller pass buffer to plpks_read_var()

2023-01-17 Thread Andrew Donnellan
Currently, plpks_read_var() allocates a buffer to pass to the
H_PKS_READ_OBJECT hcall, then allocates another buffer, of the caller's
preferred size if necessary, into which the data is copied, and returns
that buffer to the caller.

This is a bit over the top - while we probably still want to allocate a
separate buffer to pass to the hypervisor in the hcall, we can let the
caller allocate the final buffer and specify the size.

Don't allocate var->data in plpks_read_var(), instead expect the caller to
allocate it. If the caller needs to discover the size, it can set
var->data to NULL and var->datalen will be populated. Update header file
to document this.

Suggested-by: Michael Ellerman 
Signed-off-by: Andrew Donnellan 
Signed-off-by: Russell Currey 

---

v3: New patch (mpe)
---
 arch/powerpc/include/asm/plpks.h   | 12 
 arch/powerpc/platforms/pseries/plpks.c | 11 ---
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/plpks.h b/arch/powerpc/include/asm/plpks.h
index e7204e6c0ca4..0c49969b0864 100644
--- a/arch/powerpc/include/asm/plpks.h
+++ b/arch/powerpc/include/asm/plpks.h
@@ -88,16 +88,28 @@ int plpks_remove_var(char *component, u8 varos,
 
 /**
  * Returns the data for the specified os variable.
+ *
+ * Caller must allocate a buffer in var->data with length in var->datalen.
+ * If no buffer is provided, var->datalen will be populated with the object's
+ * size.
  */
 int plpks_read_os_var(struct plpks_var *var);
 
 /**
  * Returns the data for the specified firmware variable.
+ *
+ * Caller must allocate a buffer in var->data with length in var->datalen.
+ * If no buffer is provided, var->datalen will be populated with the object's
+ * size.
  */
 int plpks_read_fw_var(struct plpks_var *var);
 
 /**
  * Returns the data for the specified bootloader variable.
+ *
+ * Caller must allocate a buffer in var->data with length in var->datalen.
+ * If no buffer is provided, var->datalen will be populated with the object's
+ * size.
  */
 int plpks_read_bootloader_var(struct plpks_var *var);
 
diff --git a/arch/powerpc/platforms/pseries/plpks.c 
b/arch/powerpc/platforms/pseries/plpks.c
index 96a026a37285..5d9c6a3b2014 100644
--- a/arch/powerpc/platforms/pseries/plpks.c
+++ b/arch/powerpc/platforms/pseries/plpks.c
@@ -578,17 +578,14 @@ static int plpks_read_var(u8 consumer, struct plpks_var 
*var)
goto out_free_output;
}
 
-   if (var->datalen == 0 || var->datalen > retbuf[0])
+   if (!var->data || var->datalen > retbuf[0])
var->datalen = retbuf[0];
 
-   var->data = kzalloc(var->datalen, GFP_KERNEL);
-   if (!var->data) {
-   rc = -ENOMEM;
-   goto out_free_output;
-   }
var->policy = retbuf[1];
 
-   memcpy(var->data, output, var->datalen);
+   if (var->data)
+   memcpy(var->data, output, var->datalen);
+
rc = 0;
 
 out_free_output:
-- 
2.39.0



[PATCH v3 14/24] powerpc/pseries: Fix alignment of PLPKS structures and buffers

2023-01-17 Thread Andrew Donnellan
A number of structures and buffers passed to PKS hcalls have alignment
requirements, which could on occasion cause problems:

- Authorisation structures must be 16-byte aligned and must not cross a
  page boundary

- Label structures must not cross page coundaries

- Password output buffers must not cross page boundaries

Round up the allocations of these structures/buffers to the next power of
2 to make sure this happens.

Reported-by: Benjamin Gray 
Fixes: 2454a7af0f2a ("powerpc/pseries: define driver for Platform KeyStore")
Signed-off-by: Andrew Donnellan 
Reviewed-by: Russell Currey 
Signed-off-by: Russell Currey 

---

v3: Merge plpks fixes and signed update series with secvar series
---
 arch/powerpc/platforms/pseries/plpks.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/plpks.c 
b/arch/powerpc/platforms/pseries/plpks.c
index 6d1303e4862d..91f3f623a2c7 100644
--- a/arch/powerpc/platforms/pseries/plpks.c
+++ b/arch/powerpc/platforms/pseries/plpks.c
@@ -112,7 +112,8 @@ static int plpks_gen_password(void)
u8 *password, consumer = PLPKS_OS_OWNER;
int rc;
 
-   password = kzalloc(maxpwsize, GFP_KERNEL);
+   // The password must not cross a page boundary, so we align to the next 
power of 2
+   password = kzalloc(roundup_pow_of_two(maxpwsize), GFP_KERNEL);
if (!password)
return -ENOMEM;
 
@@ -148,7 +149,9 @@ static struct plpks_auth *construct_auth(u8 consumer)
if (consumer > PLPKS_OS_OWNER)
return ERR_PTR(-EINVAL);
 
-   auth = kzalloc(struct_size(auth, password, maxpwsize), GFP_KERNEL);
+   // The auth structure must not cross a page boundary and must be
+   // 16 byte aligned. We align to the next largest power of 2
+   auth = kzalloc(roundup_pow_of_two(struct_size(auth, password, 
maxpwsize)), GFP_KERNEL);
if (!auth)
return ERR_PTR(-ENOMEM);
 
@@ -182,7 +185,8 @@ static struct label *construct_label(char *component, u8 
varos, u8 *name,
if (component && slen > sizeof(label->attr.prefix))
return ERR_PTR(-EINVAL);
 
-   label = kzalloc(sizeof(*label), GFP_KERNEL);
+   // The label structure must not cross a page boundary, so we align to 
the next power of 2
+   label = kzalloc(roundup_pow_of_two(sizeof(*label)), GFP_KERNEL);
if (!label)
return ERR_PTR(-ENOMEM);
 
-- 
2.39.0



[PATCH v3 19/24] powerpc/pseries: Turn PSERIES_PLPKS into a hidden option

2023-01-17 Thread Andrew Donnellan
It seems a bit unnecessary for the PLPKS code to have a user-visible
config option when it doesn't do anything on its own, and there's existing
options for enabling Secure Boot-related features.

It should be enabled by PPC_SECURE_BOOT, which will eventually be what
uses PLPKS to populate keyrings.

However, we can't get of the separate option completely, because it will
also be used for SED Opal purposes.

Change PSERIES_PLPKS into a hidden option, which is selected by
PPC_SECURE_BOOT.

Signed-off-by: Andrew Donnellan 
Signed-off-by: Russell Currey 

---

v3: New patch
---
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/platforms/pseries/Kconfig | 11 +--
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b8c4ac56bddc..d4ed46101bec 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -1029,6 +1029,7 @@ config PPC_SECURE_BOOT
depends on PPC_POWERNV || PPC_PSERIES
depends on IMA_ARCH_POLICY
imply IMA_SECURE_AND_OR_TRUSTED_BOOT
+   select PSERIES_PLPKS if PPC_PSERIES
help
  Systems with firmware secure boot enabled need to define security
  policies to extend secure boot to the OS. This config allows a user
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index a3b4d99567cb..82b6f993be0f 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -151,16 +151,7 @@ config IBMEBUS
 
 config PSERIES_PLPKS
depends on PPC_PSERIES
-   bool "Support for the Platform Key Storage"
-   help
- PowerVM provides an isolated Platform Keystore(PKS) storage
- allocation for each LPAR with individually managed access
- controls to store sensitive information securely. It can be
- used to store asymmetric public keys or secrets as required
- by different usecases. Select this config to enable
- operating system interface to hypervisor to access this space.
-
- If unsure, select N.
+   bool
 
 config PAPR_SCM
depends on PPC_PSERIES && MEMORY_HOTPLUG && LIBNVDIMM
-- 
2.39.0



[PATCH v3 17/24] powerpc/pseries: Log hcall return codes for PLPKS debug

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

The plpks code converts hypervisor return codes into their Linux
equivalents so that users can understand them.  Having access to the
original return codes is really useful for debugging, so add a
pr_debug() so we don't lose information from the conversion.

Signed-off-by: Russell Currey 
Signed-off-by: Andrew Donnellan 
---
 arch/powerpc/platforms/pseries/plpks.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/plpks.c 
b/arch/powerpc/platforms/pseries/plpks.c
index 796ed5544ee5..96a026a37285 100644
--- a/arch/powerpc/platforms/pseries/plpks.c
+++ b/arch/powerpc/platforms/pseries/plpks.c
@@ -117,6 +117,8 @@ static int pseries_status_to_err(int rc)
err = -EINVAL;
}
 
+   pr_debug("Converted hypervisor code %d to Linux %d\n", rc, err);
+
return err;
 }
 
-- 
2.39.0



[PATCH v3 12/24] powerpc/pseries: Move PLPKS constants to header file

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

Move the constants defined in plpks.c to plpks.h, and standardise their
naming, so that PLPKS consumers can make use of them later on.

Signed-off-by: Russell Currey 
Co-developed-by: Andrew Donnellan 
Signed-off-by: Andrew Donnellan 

---

v3: New patch
---
 arch/powerpc/include/asm/plpks.h   | 36 +---
 arch/powerpc/platforms/pseries/plpks.c | 57 ++
 2 files changed, 53 insertions(+), 40 deletions(-)

diff --git a/arch/powerpc/include/asm/plpks.h b/arch/powerpc/include/asm/plpks.h
index 8295502ee93b..6466aadd7145 100644
--- a/arch/powerpc/include/asm/plpks.h
+++ b/arch/powerpc/include/asm/plpks.h
@@ -14,14 +14,40 @@
 #include 
 #include 
 
-#define OSSECBOOTAUDIT 0x4000
-#define OSSECBOOTENFORCE 0x2000
-#define WORLDREADABLE 0x0800
-#define SIGNEDUPDATE 0x0100
+// Object policy flags from supported_policies
+#define PLPKS_OSSECBOOTAUDIT   PPC_BIT32(1) // OS secure boot must be 
audit/enforce
+#define PLPKS_OSSECBOOTENFORCE PPC_BIT32(2) // OS secure boot must be enforce
+#define PLPKS_PWSETPPC_BIT32(3) // No access without password set
+#define PLPKS_WORLDREADABLEPPC_BIT32(4) // Readable without authentication
+#define PLPKS_IMMUTABLEPPC_BIT32(5) // Once written, object 
cannot be removed
+#define PLPKS_TRANSIENTPPC_BIT32(6) // Object does not persist 
through reboot
+#define PLPKS_SIGNEDUPDATE PPC_BIT32(7) // Object can only be modified by 
signed updates
+#define PLPKS_HVPROVISIONEDPPC_BIT32(28) // Hypervisor has provisioned 
this object
 
-#define PLPKS_VAR_LINUX0x02
+// Signature algorithm flags from signed_update_algorithms
+#define PLPKS_ALG_RSA2048  PPC_BIT(0)
+#define PLPKS_ALG_RSA4096  PPC_BIT(1)
+
+// Object label OS metadata flags
+#define PLPKS_VAR_LINUX0x02
 #define PLPKS_VAR_COMMON   0x04
 
+// Flags for which consumer owns an object is owned by
+#define PLPKS_FW_OWNER 0x1
+#define PLPKS_BOOTLOADER_OWNER 0x2
+#define PLPKS_OS_OWNER 0x3
+
+// Flags for label metadata fields
+#define PLPKS_LABEL_VERSION0
+#define PLPKS_MAX_LABEL_ATTR_SIZE  16
+#define PLPKS_MAX_NAME_SIZE239
+#define PLPKS_MAX_DATA_SIZE4000
+
+// Timeouts for PLPKS operations
+#define PLPKS_MAX_TIMEOUT  5000 // msec
+#define PLPKS_FLUSH_SLEEP  10 // msec
+#define PLPKS_FLUSH_SLEEP_RANGE400
+
 struct plpks_var {
char *component;
u8 *name;
diff --git a/arch/powerpc/platforms/pseries/plpks.c 
b/arch/powerpc/platforms/pseries/plpks.c
index 955117f81e50..5bdc093de6fb 100644
--- a/arch/powerpc/platforms/pseries/plpks.c
+++ b/arch/powerpc/platforms/pseries/plpks.c
@@ -20,19 +20,6 @@
 #include 
 #include 
 
-#define PKS_FW_OWNER0x1
-#define PKS_BOOTLOADER_OWNER 0x2
-#define PKS_OS_OWNER0x3
-
-#define LABEL_VERSION  0
-#define MAX_LABEL_ATTR_SIZE 16
-#define MAX_NAME_SIZE  239
-#define MAX_DATA_SIZE  4000
-
-#define PKS_FLUSH_MAX_TIMEOUT 5000 //msec
-#define PKS_FLUSH_SLEEP  10 //msec
-#define PKS_FLUSH_SLEEP_RANGE 400
-
 static u8 *ospassword;
 static u16 ospasswordlength;
 
@@ -59,7 +46,7 @@ struct label_attr {
 
 struct label {
struct label_attr attr;
-   u8 name[MAX_NAME_SIZE];
+   u8 name[PLPKS_MAX_NAME_SIZE];
size_t size;
 };
 
@@ -122,7 +109,7 @@ static int pseries_status_to_err(int rc)
 static int plpks_gen_password(void)
 {
unsigned long retbuf[PLPAR_HCALL_BUFSIZE] = { 0 };
-   u8 *password, consumer = PKS_OS_OWNER;
+   u8 *password, consumer = PLPKS_OS_OWNER;
int rc;
 
password = kzalloc(maxpwsize, GFP_KERNEL);
@@ -158,7 +145,7 @@ static struct plpks_auth *construct_auth(u8 consumer)
 {
struct plpks_auth *auth;
 
-   if (consumer > PKS_OS_OWNER)
+   if (consumer > PLPKS_OS_OWNER)
return ERR_PTR(-EINVAL);
 
auth = kzalloc(struct_size(auth, password, maxpwsize), GFP_KERNEL);
@@ -168,7 +155,7 @@ static struct plpks_auth *construct_auth(u8 consumer)
auth->version = 1;
auth->consumer = consumer;
 
-   if (consumer == PKS_FW_OWNER || consumer == PKS_BOOTLOADER_OWNER)
+   if (consumer == PLPKS_FW_OWNER || consumer == PLPKS_BOOTLOADER_OWNER)
return auth;
 
memcpy(auth->password, ospassword, ospasswordlength);
@@ -188,7 +175,7 @@ static struct label *construct_label(char *component, u8 
varos, u8 *name,
struct label *label;
size_t slen;
 
-   if (!name || namelen > MAX_NAME_SIZE)
+   if (!name || namelen > PLPKS_MAX_NAME_SIZE)
return ERR_PTR(-EINVAL);
 
slen = strlen(component);
@@ -202,9 +189,9 @@ static struct label *construct_label(char *component, u8 
varos, u8 *name,
if (component)
memcpy(>attr.prefix, component, slen);
 
-   label->attr.version = LABEL_VERSION;
+   

[PATCH v3 05/24] powerpc/secvar: Handle max object size in the consumer

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

Currently the max object size is handled in the core secvar code with an
entirely OPAL-specific implementation, so create a new max_size() op and
move the existing implementation into the powernv platform.  Should be
no functional change.

Signed-off-by: Russell Currey 
Signed-off-by: Andrew Donnellan 

---

v3: Change uint64_t type to u64 (mpe)
---
 arch/powerpc/include/asm/secvar.h|  1 +
 arch/powerpc/kernel/secvar-sysfs.c   | 17 +++--
 arch/powerpc/platforms/powernv/opal-secvar.c | 19 +++
 3 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/secvar.h 
b/arch/powerpc/include/asm/secvar.h
index 8b6475589120..b2cb9bb7c540 100644
--- a/arch/powerpc/include/asm/secvar.h
+++ b/arch/powerpc/include/asm/secvar.h
@@ -20,6 +20,7 @@ struct secvar_operations {
int (*get_next)(const char *key, u64 *key_len, u64 keybufsize);
int (*set)(const char *key, u64 key_len, u8 *data, u64 data_size);
ssize_t (*format)(char *buf);
+   int (*max_size)(u64 *max_size);
 };
 
 #ifdef CONFIG_PPC_SECURE_BOOT
diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index d3858eedd72c..031ef37bca99 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -128,27 +128,16 @@ static struct kobj_type secvar_ktype = {
 static int update_kobj_size(void)
 {
 
-   struct device_node *node;
u64 varsize;
-   int rc = 0;
+   int rc = secvar_ops->max_size();
 
-   node = of_find_compatible_node(NULL, NULL, "ibm,secvar-backend");
-   if (!of_device_is_available(node)) {
-   rc = -ENODEV;
-   goto out;
-   }
-
-   rc = of_property_read_u64(node, "max-var-size", );
if (rc)
-   goto out;
+   return rc;
 
data_attr.size = varsize;
update_attr.size = varsize;
 
-out:
-   of_node_put(node);
-
-   return rc;
+   return 0;
 }
 
 static int secvar_sysfs_load(void)
diff --git a/arch/powerpc/platforms/powernv/opal-secvar.c 
b/arch/powerpc/platforms/powernv/opal-secvar.c
index 623c6839e66c..c9b9fd3730df 100644
--- a/arch/powerpc/platforms/powernv/opal-secvar.c
+++ b/arch/powerpc/platforms/powernv/opal-secvar.c
@@ -122,11 +122,30 @@ static ssize_t opal_secvar_format(char *buf)
return rc;
 }
 
+static int opal_secvar_max_size(u64 *max_size)
+{
+   int rc;
+   struct device_node *node;
+
+   node = of_find_compatible_node(NULL, NULL, "ibm,secvar-backend");
+   if (!of_device_is_available(node)) {
+   rc = -ENODEV;
+   goto out;
+   }
+
+   rc = of_property_read_u64(node, "max-var-size", max_size);
+
+out:
+   of_node_put(node);
+   return rc;
+}
+
 static const struct secvar_operations opal_secvar_ops = {
.get = opal_get_variable,
.get_next = opal_get_next_variable,
.set = opal_set_variable,
.format = opal_secvar_format,
+   .max_size = opal_secvar_max_size,
 };
 
 static int opal_secvar_probe(struct platform_device *pdev)
-- 
2.39.0



[PATCH v3 10/24] powerpc/secvar: Don't print error on ENOENT when reading variables

2023-01-17 Thread Andrew Donnellan
If attempting to read the size or data attributes of a  non-existent
variable (which will be possible after a later patch to expose the PLPKS
via the secvar interface), don't spam the kernel log with error messages.
Only print errors for return codes that aren't ENOENT.

Reported-by: Sudhakar Kuppusamy 
Signed-off-by: Andrew Donnellan 

---

v3: New patch
---
 arch/powerpc/kernel/secvar-sysfs.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index 68fb0b857442..2499bfd04fad 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -39,8 +39,8 @@ static ssize_t size_show(struct kobject *kobj, struct 
kobj_attribute *attr,
 
rc = secvar_ops->get(kobj->name, strlen(kobj->name) + 1, NULL, );
if (rc) {
-   pr_err("Error retrieving %s variable size %d\n", kobj->name,
-  rc);
+   if (rc != -ENOENT)
+   pr_err("Error retrieving %s variable size %d\n", 
kobj->name, rc);
return rc;
}
 
@@ -57,7 +57,8 @@ static ssize_t data_read(struct file *filep, struct kobject 
*kobj,
 
rc = secvar_ops->get(kobj->name, strlen(kobj->name) + 1, NULL, );
if (rc) {
-   pr_err("Error getting %s variable size %d\n", kobj->name, rc);
+   if (rc != -ENOENT)
+   pr_err("Error getting %s variable size %d\n", 
kobj->name, rc);
return rc;
}
pr_debug("dsize is %llu\n", dsize);
-- 
2.39.0



[PATCH v3 09/24] powerpc/secvar: Warn when PAGE_SIZE is smaller than max object size

2023-01-17 Thread Andrew Donnellan
Due to sysfs constraints, when writing to a variable, we can only handle
writes of up to PAGE_SIZE.

It's possible that the maximum object size is larger than PAGE_SIZE, in
which case, print a warning on boot so that the user is aware.

Signed-off-by: Andrew Donnellan 
Signed-off-by: Russell Currey 

---

v3: New patch (ajd)
---
 arch/powerpc/kernel/secvar-sysfs.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index d9352d4be87b..68fb0b857442 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -217,6 +217,7 @@ static int secvar_sysfs_load_static(void)
 
 static int secvar_sysfs_init(void)
 {
+   u64 max_size;
int rc;
 
if (!secvar_ops) {
@@ -266,6 +267,14 @@ static int secvar_sysfs_init(void)
goto err;
}
 
+   // Due to sysfs limitations, we will only ever get a write buffer of
+   // up to 1 page in size. Print a warning if this is potentially going
+   // to cause problems, so that the user is aware.
+   secvar_ops->max_size(_size);
+   if (max_size > PAGE_SIZE)
+   pr_warn_ratelimited("PAGE_SIZE (%lu) is smaller than maximum 
object size (%llu), writes are limited to PAGE_SIZE\n",
+   PAGE_SIZE, max_size);
+
return 0;
 err:
kobject_put(secvar_kobj);
-- 
2.39.0



[PATCH v3 08/24] powerpc/secvar: Allow backend to populate static list of variable names

2023-01-17 Thread Andrew Donnellan
Currently, the list of variables is populated by calling
secvar_ops->get_next() repeatedly, which is explicitly modelled on the
OPAL API (including the keylen parameter).

For the upcoming PLPKS backend, we have a static list of variable names.
It is messy to fit that into get_next(), so instead, let the backend put
a NULL-terminated array of variable names into secvar_ops->var_names,
which will be used if get_next() is undefined.

Signed-off-by: Andrew Donnellan 
Signed-off-by: Russell Currey 

---

v3: New patch (ajd/mpe)
---
 arch/powerpc/include/asm/secvar.h  |  4 ++
 arch/powerpc/kernel/secvar-sysfs.c | 67 --
 2 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/secvar.h 
b/arch/powerpc/include/asm/secvar.h
index ebf95386d720..c8bee1834b54 100644
--- a/arch/powerpc/include/asm/secvar.h
+++ b/arch/powerpc/include/asm/secvar.h
@@ -23,6 +23,10 @@ struct secvar_operations {
ssize_t (*format)(char *buf);
int (*max_size)(u64 *max_size);
const struct attribute **config_attrs;
+
+   // NULL-terminated array of fixed variable names
+   // Only used if get_next() isn't provided
+   const char * const *var_names;
 };
 
 #ifdef CONFIG_PPC_SECURE_BOOT
diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index b82e95a2e415..d9352d4be87b 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -153,9 +153,31 @@ static int secvar_sysfs_config(struct kobject *kobj)
return 0;
 }
 
-static int secvar_sysfs_load(void)
+static int add_var(const char *name)
 {
struct kobject *kobj;
+   int rc;
+
+   kobj = kzalloc(sizeof(*kobj), GFP_KERNEL);
+   if (!kobj)
+   return -ENOMEM;
+
+   kobject_init(kobj, _ktype);
+
+   rc = kobject_add(kobj, _kset->kobj, "%s", name);
+   if (rc) {
+   pr_warn("kobject_add error %d for attribute: %s\n", rc,
+   name);
+   kobject_put(kobj);
+   return rc;
+   }
+
+   kobject_uevent(kobj, KOBJ_ADD);
+   return 0;
+}
+
+static int secvar_sysfs_load(void)
+{
u64 namesize = 0;
char *name;
int rc;
@@ -173,31 +195,26 @@ static int secvar_sysfs_load(void)
break;
}
 
-   kobj = kzalloc(sizeof(*kobj), GFP_KERNEL);
-   if (!kobj) {
-   rc = -ENOMEM;
-   break;
-   }
-
-   kobject_init(kobj, _ktype);
-
-   rc = kobject_add(kobj, _kset->kobj, "%s", name);
-   if (rc) {
-   pr_warn("kobject_add error %d for attribute: %s\n", rc,
-   name);
-   kobject_put(kobj);
-   kobj = NULL;
-   }
-
-   if (kobj)
-   kobject_uevent(kobj, KOBJ_ADD);
-
+   rc = add_var(name);
} while (!rc);
 
kfree(name);
return rc;
 }
 
+static int secvar_sysfs_load_static(void)
+{
+   const char * const *name_ptr = secvar_ops->var_names;
+   int rc;
+   while (*name_ptr) {
+   rc = add_var(*name_ptr);
+   if (rc)
+   return rc;
+   name_ptr++;
+   }
+   return 0;
+}
+
 static int secvar_sysfs_init(void)
 {
int rc;
@@ -239,7 +256,15 @@ static int secvar_sysfs_init(void)
goto err;
}
 
-   secvar_sysfs_load();
+   if (secvar_ops->get_next)
+   rc = secvar_sysfs_load();
+   else
+   rc = secvar_sysfs_load_static();
+
+   if (rc) {
+   pr_err("Failed to create variable attributes\n");
+   goto err;
+   }
 
return 0;
 err:
-- 
2.39.0



[PATCH v3 11/24] powerpc/pseries: Move plpks.h to include directory

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

Move plpks.h from platforms/pseries/ to include/asm/. This is necessary
for later patches to make use of the PLPKS from code in other subsystems.

Signed-off-by: Russell Currey 
Signed-off-by: Andrew Donnellan 

---

v3: New patch
---
 .../powerpc/{platforms/pseries => include/asm}/plpks.h | 10 +++---
 arch/powerpc/platforms/pseries/plpks.c |  3 +--
 2 files changed, 8 insertions(+), 5 deletions(-)
 rename arch/powerpc/{platforms/pseries => include/asm}/plpks.h (89%)

diff --git a/arch/powerpc/platforms/pseries/plpks.h 
b/arch/powerpc/include/asm/plpks.h
similarity index 89%
rename from arch/powerpc/platforms/pseries/plpks.h
rename to arch/powerpc/include/asm/plpks.h
index 275ccd86bfb5..8295502ee93b 100644
--- a/arch/powerpc/platforms/pseries/plpks.h
+++ b/arch/powerpc/include/asm/plpks.h
@@ -6,8 +6,10 @@
  * Platform keystore for pseries LPAR(PLPKS).
  */
 
-#ifndef _PSERIES_PLPKS_H
-#define _PSERIES_PLPKS_H
+#ifndef _ASM_POWERPC_PLPKS_H
+#define _ASM_POWERPC_PLPKS_H
+
+#ifdef CONFIG_PSERIES_PLPKS
 
 #include 
 #include 
@@ -68,4 +70,6 @@ int plpks_read_fw_var(struct plpks_var *var);
  */
 int plpks_read_bootloader_var(struct plpks_var *var);
 
-#endif
+#endif // CONFIG_PSERIES_PLPKS
+
+#endif // _ASM_POWERPC_PLPKS_H
diff --git a/arch/powerpc/platforms/pseries/plpks.c 
b/arch/powerpc/platforms/pseries/plpks.c
index 4edd1585e245..955117f81e50 100644
--- a/arch/powerpc/platforms/pseries/plpks.c
+++ b/arch/powerpc/platforms/pseries/plpks.c
@@ -18,8 +18,7 @@
 #include 
 #include 
 #include 
-
-#include "plpks.h"
+#include 
 
 #define PKS_FW_OWNER0x1
 #define PKS_BOOTLOADER_OWNER 0x2
-- 
2.39.0



[PATCH v3 13/24] powerpc/pseries: Fix handling of PLPKS object flushing timeout

2023-01-17 Thread Andrew Donnellan
plpks_confirm_object_flushed() uses the H_PKS_CONFIRM_OBJECT_FLUSHED hcall
to check whether changes to an object in the Platform KeyStore have been
flushed to non-volatile storage.

The hcall returns two output values, the return code and the flush status.
plpks_confirm_object_flushed() polls the hcall until either the flush
status has updated, the return code is an error, or a timeout has been
exceeded.

While we're still polling, the hcall is returning H_SUCCESS (0) as the
return code. In the timeout case, this means that upon exiting the polling
loop, rc is 0, and therefore 0 is returned to the user.

Handle the timeout case separately and return ETIMEDOUT if triggered.

Fixes: 2454a7af0f2a ("powerpc/pseries: define driver for Platform KeyStore")
Reported-by: Benjamin Gray 
Signed-off-by: Andrew Donnellan 
Tested-by: Russell Currey 
Reviewed-by: Russell Currey 
Signed-off-by: Russell Currey 

---

v3: Merge plpks fixes and signed update series with secvar series

Neaten how we return at the end of the function (ruscur)
---
 arch/powerpc/platforms/pseries/plpks.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/plpks.c 
b/arch/powerpc/platforms/pseries/plpks.c
index 5bdc093de6fb..6d1303e4862d 100644
--- a/arch/powerpc/platforms/pseries/plpks.c
+++ b/arch/powerpc/platforms/pseries/plpks.c
@@ -234,6 +234,7 @@ static int plpks_confirm_object_flushed(struct label *label,
struct plpks_auth *auth)
 {
unsigned long retbuf[PLPAR_HCALL_BUFSIZE] = { 0 };
+   bool timed_out = true;
u64 timeout = 0;
u8 status;
int rc;
@@ -245,22 +246,26 @@ static int plpks_confirm_object_flushed(struct label 
*label,
 
status = retbuf[0];
if (rc) {
+   timed_out = false;
if (rc == H_NOT_FOUND && status == 1)
rc = 0;
break;
}
 
-   if (!rc && status == 1)
+   if (!rc && status == 1) {
+   timed_out = false;
break;
+   }
 
usleep_range(PLPKS_FLUSH_SLEEP,
 PLPKS_FLUSH_SLEEP + PLPKS_FLUSH_SLEEP_RANGE);
timeout = timeout + PLPKS_FLUSH_SLEEP;
} while (timeout < PLPKS_MAX_TIMEOUT);
 
-   rc = pseries_status_to_err(rc);
+   if (timed_out)
+   return -ETIMEDOUT;
 
-   return rc;
+   return pseries_status_to_err(rc);
 }
 
 int plpks_write_var(struct plpks_var var)
-- 
2.39.0



[PATCH v3 06/24] powerpc/secvar: Clean up init error messages

2023-01-17 Thread Andrew Donnellan
Remove unnecessary prefixes from error messages in secvar_sysfs_init()
(the file defines pr_fmt, so putting "secvar:" in every message is
unnecessary). Make capitalisation and punctuation more consistent.

Signed-off-by: Andrew Donnellan 
Signed-off-by: Russell Currey 

---

v3: New patch (ajd)
---
 arch/powerpc/kernel/secvar-sysfs.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index 031ef37bca99..9f0e49bf3903 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -190,13 +190,13 @@ static int secvar_sysfs_init(void)
int rc;
 
if (!secvar_ops) {
-   pr_warn("secvar: failed to retrieve secvar operations.\n");
+   pr_warn("Failed to retrieve secvar operations\n");
return -ENODEV;
}
 
secvar_kobj = kobject_create_and_add("secvar", firmware_kobj);
if (!secvar_kobj) {
-   pr_err("secvar: Failed to create firmware kobj\n");
+   pr_err("Failed to create firmware kobj\n");
return -ENOMEM;
}
 
@@ -208,7 +208,7 @@ static int secvar_sysfs_init(void)
 
secvar_kset = kset_create_and_add("vars", NULL, secvar_kobj);
if (!secvar_kset) {
-   pr_err("secvar: sysfs kobject registration failed.\n");
+   pr_err("sysfs kobject registration failed\n");
kobject_put(secvar_kobj);
return -ENOMEM;
}
-- 
2.39.0



[PATCH v3 04/24] powerpc/secvar: Handle format string in the consumer

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

The code that handles the format string in secvar-sysfs.c is entirely
OPAL specific, so create a new "format" op in secvar_operations to make
the secvar code more generic.  No functional change.

Signed-off-by: Russell Currey 
Signed-off-by: Andrew Donnellan 

---

v2: Use sysfs_emit() instead of sprintf() (gregkh)

v3: Enforce format string size limit (ruscur)
---
 arch/powerpc/include/asm/secvar.h|  3 +++
 arch/powerpc/kernel/secvar-sysfs.c   | 23 --
 arch/powerpc/platforms/powernv/opal-secvar.c | 25 
 3 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/secvar.h 
b/arch/powerpc/include/asm/secvar.h
index 07ba36f868a7..8b6475589120 100644
--- a/arch/powerpc/include/asm/secvar.h
+++ b/arch/powerpc/include/asm/secvar.h
@@ -11,12 +11,15 @@
 #include 
 #include 
 
+#define SECVAR_MAX_FORMAT_LEN  30 // max length of string returned by 
->format()
+
 extern const struct secvar_operations *secvar_ops;
 
 struct secvar_operations {
int (*get)(const char *key, u64 key_len, u8 *data, u64 *data_size);
int (*get_next)(const char *key, u64 *key_len, u64 keybufsize);
int (*set)(const char *key, u64 key_len, u8 *data, u64 data_size);
+   ssize_t (*format)(char *buf);
 };
 
 #ifdef CONFIG_PPC_SECURE_BOOT
diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index 462cacc0ca60..d3858eedd72c 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -21,26 +21,13 @@ static struct kset *secvar_kset;
 static ssize_t format_show(struct kobject *kobj, struct kobj_attribute *attr,
   char *buf)
 {
-   ssize_t rc = 0;
-   struct device_node *node;
-   const char *format;
-
-   node = of_find_compatible_node(NULL, NULL, "ibm,secvar-backend");
-   if (!of_device_is_available(node)) {
-   rc = -ENODEV;
-   goto out;
-   }
+   char tmp[SECVAR_MAX_FORMAT_LEN];
+   ssize_t len = secvar_ops->format(tmp);
 
-   rc = of_property_read_string(node, "format", );
-   if (rc)
-   goto out;
+   if (len <= 0)
+   return -EIO;
 
-   rc = sysfs_emit(buf, "%s\n", format);
-
-out:
-   of_node_put(node);
-
-   return rc;
+   return sysfs_emit(buf, "%s\n", tmp);
 }
 
 
diff --git a/arch/powerpc/platforms/powernv/opal-secvar.c 
b/arch/powerpc/platforms/powernv/opal-secvar.c
index ef89861569e0..623c6839e66c 100644
--- a/arch/powerpc/platforms/powernv/opal-secvar.c
+++ b/arch/powerpc/platforms/powernv/opal-secvar.c
@@ -98,10 +98,35 @@ static int opal_set_variable(const char *key, u64 ksize, u8 
*data, u64 dsize)
return opal_status_to_err(rc);
 }
 
+static ssize_t opal_secvar_format(char *buf)
+{
+   ssize_t rc = 0;
+   struct device_node *node;
+   const char *format;
+
+   node = of_find_compatible_node(NULL, NULL, "ibm,secvar-backend");
+   if (!of_device_is_available(node)) {
+   rc = -ENODEV;
+   goto out;
+   }
+
+   rc = of_property_read_string(node, "format", );
+   if (rc)
+   goto out;
+
+   rc = snprintf(buf, SECVAR_MAX_FORMAT_LEN, "%s", format);
+
+out:
+   of_node_put(node);
+
+   return rc;
+}
+
 static const struct secvar_operations opal_secvar_ops = {
.get = opal_get_variable,
.get_next = opal_get_next_variable,
.set = opal_set_variable,
+   .format = opal_secvar_format,
 };
 
 static int opal_secvar_probe(struct platform_device *pdev)
-- 
2.39.0



[PATCH v3 00/24] pSeries dynamic secure boot secvar interface + platform keyring loading

2023-01-17 Thread Andrew Donnellan
This series exposes an interface to userspace for reading and writing
secure variables contained within the PowerVM LPAR Platform KeyStore
(PLPKS) for the purpose of configuring dynamic secure boot, and adds
the glue required to load keys from the PLPKS into the platform keyring.

This series builds on past work by Nayna Jain[0] in exposing PLPKS
variables to userspace.  Rather than being a generic interface for
interacting with the keystore, however, we use the existing powerpc
secvar infrastructure to only expose objects in the keystore used
for dynamic secure boot.  This has the benefit of leveraging an
existing interface and making the implementation relatively minimal.

This series integrates a previous series to fix some bugs in PLPKS
and implement support for signed updates[1], and a cleanup patch from
Michael Ellerman[2].

There are a few relevant details to note about the implementation:

 * New additions to the secvar API: format(), max_size(), config_attrs,
   var_names

 * New optional sysfs directory "config/" for arbitrary ASCII variables

 * Some OPAL-specific code has been relocated from secvar-sysfs.c to
   powernv platform code.  Would appreciate any powernv testing!

 * Variable names are fixed and only those used for secure boot are
   exposed.  This is not a generic PLPKS interface, but also doesn't
   preclude one being added in future.

With this series, both powernv and pseries platforms support dynamic
secure boot through the same interface.

[0]: 
https://lore.kernel.org/linuxppc-dev/20221106210744.603240-1-na...@linux.ibm.com/
[1]: 
https://lore.kernel.org/linuxppc-dev/20221220071626.1426786-1-...@linux.ibm.com/
[2]: 
https://lore.kernel.org/linuxppc-dev/20230112023819.1692452-1-...@ellerman.id.au/

v1: 
https://lore.kernel.org/linuxppc-dev/20221228072943.429266-1-rus...@russell.cc/
v2: 
https://lore.kernel.org/linuxppc-dev/20221230042014.154483-1-rus...@russell.cc/

=

Changes in v3:

Integrate Andrew's PLPKS bugfixes and enhancements series and Michael
Ellerman's u64 cleanup patch into this series (and update the other
patches to use u64)

New patches to load keys from the PLPKS into the kernel's platform
keyring (ruscur)

New patches to pass PLPKS password to new kernels when kexecing
(ruscur)

Improve handling of format strings (ruscur)

Clean up secvar error messages (ajd)

Merge config attributes into secvar_operations (mpe)

Add a new static variable names API rather than (ab)using get_next()
(ajd/mpe)

Warning message when PAGE_SIZE is smaller than the max object size
(ajd)

Move plpks.h to the include directory, and move a bunch of constants
in there with a consistent naming scheme

Refresh PLPKS config values whenever plpks_get_usedspace() is called
(ajd)

Extra validation on PLPKS config values (ruscur)

Return maxobjlabelsize to userspace as is without subtracting overhead 
(ruscur)

Fix error code handling in plpks_confirm_object_flushed() (ruscur)

Pass plpks_var struct to plpks_signed_update_var() by reference (mpe)

Make the data buffer in plpks_read_var() caller-allocated to reduce
number of allocations/copies (mpe)

Rework the Kconfig options so that PSERIES_PLPKS is a hidden option,
turned on by enabling PPC_SECURE_BOOT, and the PLPKS secvar code is
activated by PPC_SECVAR_SYSFS to match powernv (ajd)

Use machine_arch_initcall() rather than device_initcall() so we don't
break powernv (mpe)

Improve ABI documentation (mpe)

Return -EIO on most read errors (mpe)

Add "grubdbx" variable (Sudhakar)

Use utf8s_to_utf16s() rather than our own "UCS-2" conversion code (mpe)

Fix SB_VERSION data length (ruscur)

Stop prepending policy data on read (ruscur)

Don't print errors to the kernel log when reading non-existent
variables (Sudhakar)

Miscellaneous code style, checkpatch cleanups

Changes in v2:

Remove unnecessary config vars from sysfs and document the others,
thanks to review from Greg.  If we end up needing to expose more, we
can add them later and update the docs.

Use sysfs_emit() instead of sprintf() for all sysfs strings

Change the size of the sysfs binary attributes to include the 8-byte
flags header, preventing truncation of large writes.

Andrew Donnellan (8):
  powerpc/secvar: Clean up init error messages
  powerpc/secvar: Allow backend to populate static list of variable
names
  powerpc/secvar: Warn when PAGE_SIZE is smaller than max object size
  powerpc/secvar: Don't print error on ENOENT when reading variables
  powerpc/pseries: Fix handling of PLPKS object flushing timeout
  powerpc/pseries: Fix alignment of PLPKS structures and buffers
  powerpc/pseries: Make caller pass buffer to plpks_read_var()
  powerpc/pseries: Turn PSERIES_PLPKS into a hidden option

Michael Ellerman (1):
  powerpc/secvar: Use u64 in secvar_operations

Nayna Jain (2):
  

[PATCH v3 02/24] powerpc/secvar: WARN_ON_ONCE() if multiple secvar ops are set

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

The secvar code only supports one consumer at a time.

Multiple consumers aren't possible at this point in time, but we'd want
it to be obvious if it ever could happen.

Signed-off-by: Russell Currey 
Signed-off-by: Andrew Donnellan 
---
 arch/powerpc/kernel/secvar-ops.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/secvar-ops.c b/arch/powerpc/kernel/secvar-ops.c
index 6a29777d6a2d..aa1b2adc2710 100644
--- a/arch/powerpc/kernel/secvar-ops.c
+++ b/arch/powerpc/kernel/secvar-ops.c
@@ -8,10 +8,12 @@
 
 #include 
 #include 
+#include 
 
-const struct secvar_operations *secvar_ops __ro_after_init;
+const struct secvar_operations *secvar_ops __ro_after_init = NULL;
 
 void set_secvar_ops(const struct secvar_operations *ops)
 {
+   WARN_ON_ONCE(secvar_ops);
secvar_ops = ops;
 }
-- 
2.39.0



[PATCH v3 01/24] powerpc/secvar: Use u64 in secvar_operations

2023-01-17 Thread Andrew Donnellan
From: Michael Ellerman 

There's no reason for secvar_operations to use uint64_t vs the more
common kernel type u64.

The types are compatible, but they require different printk format
strings which can lead to confusion.

Change all the secvar related routines to use u64.

Signed-off-by: Michael Ellerman 
Reviewed-by: Russell Currey 
Reviewed-by: Andrew Donnellan 
Signed-off-by: Andrew Donnellan 

---

v3: Include new patch
---
 arch/powerpc/include/asm/secvar.h| 9 +++--
 arch/powerpc/kernel/secvar-sysfs.c   | 8 
 arch/powerpc/platforms/powernv/opal-secvar.c | 9 +++--
 security/integrity/platform_certs/load_powerpc.c | 4 ++--
 4 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/secvar.h 
b/arch/powerpc/include/asm/secvar.h
index 4cc35b58b986..07ba36f868a7 100644
--- a/arch/powerpc/include/asm/secvar.h
+++ b/arch/powerpc/include/asm/secvar.h
@@ -14,12 +14,9 @@
 extern const struct secvar_operations *secvar_ops;
 
 struct secvar_operations {
-   int (*get)(const char *key, uint64_t key_len, u8 *data,
-  uint64_t *data_size);
-   int (*get_next)(const char *key, uint64_t *key_len,
-   uint64_t keybufsize);
-   int (*set)(const char *key, uint64_t key_len, u8 *data,
-  uint64_t data_size);
+   int (*get)(const char *key, u64 key_len, u8 *data, u64 *data_size);
+   int (*get_next)(const char *key, u64 *key_len, u64 keybufsize);
+   int (*set)(const char *key, u64 key_len, u8 *data, u64 data_size);
 };
 
 #ifdef CONFIG_PPC_SECURE_BOOT
diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index 1ee4640a2641..001cdbcdb9d2 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -47,7 +47,7 @@ static ssize_t format_show(struct kobject *kobj, struct 
kobj_attribute *attr,
 static ssize_t size_show(struct kobject *kobj, struct kobj_attribute *attr,
 char *buf)
 {
-   uint64_t dsize;
+   u64 dsize;
int rc;
 
rc = secvar_ops->get(kobj->name, strlen(kobj->name) + 1, NULL, );
@@ -64,8 +64,8 @@ static ssize_t data_read(struct file *filep, struct kobject 
*kobj,
 struct bin_attribute *attr, char *buf, loff_t off,
 size_t count)
 {
-   uint64_t dsize;
char *data;
+   u64 dsize;
int rc;
 
rc = secvar_ops->get(kobj->name, strlen(kobj->name) + 1, NULL, );
@@ -166,9 +166,9 @@ static int update_kobj_size(void)
 
 static int secvar_sysfs_load(void)
 {
-   char *name;
-   uint64_t namesize = 0;
struct kobject *kobj;
+   u64 namesize = 0;
+   char *name;
int rc;
 
name = kzalloc(NAME_MAX_SIZE, GFP_KERNEL);
diff --git a/arch/powerpc/platforms/powernv/opal-secvar.c 
b/arch/powerpc/platforms/powernv/opal-secvar.c
index 14133e120bdd..ef89861569e0 100644
--- a/arch/powerpc/platforms/powernv/opal-secvar.c
+++ b/arch/powerpc/platforms/powernv/opal-secvar.c
@@ -54,8 +54,7 @@ static int opal_status_to_err(int rc)
return err;
 }
 
-static int opal_get_variable(const char *key, uint64_t ksize,
-u8 *data, uint64_t *dsize)
+static int opal_get_variable(const char *key, u64 ksize, u8 *data, u64 *dsize)
 {
int rc;
 
@@ -71,8 +70,7 @@ static int opal_get_variable(const char *key, uint64_t ksize,
return opal_status_to_err(rc);
 }
 
-static int opal_get_next_variable(const char *key, uint64_t *keylen,
- uint64_t keybufsize)
+static int opal_get_next_variable(const char *key, u64 *keylen, u64 keybufsize)
 {
int rc;
 
@@ -88,8 +86,7 @@ static int opal_get_next_variable(const char *key, uint64_t 
*keylen,
return opal_status_to_err(rc);
 }
 
-static int opal_set_variable(const char *key, uint64_t ksize, u8 *data,
-uint64_t dsize)
+static int opal_set_variable(const char *key, u64 ksize, u8 *data, u64 dsize)
 {
int rc;
 
diff --git a/security/integrity/platform_certs/load_powerpc.c 
b/security/integrity/platform_certs/load_powerpc.c
index a2900cb85357..1e4f80a4e71c 100644
--- a/security/integrity/platform_certs/load_powerpc.c
+++ b/security/integrity/platform_certs/load_powerpc.c
@@ -18,7 +18,7 @@
 /*
  * Get a certificate list blob from the named secure variable.
  */
-static __init void *get_cert_list(u8 *key, unsigned long keylen, uint64_t 
*size)
+static __init void *get_cert_list(u8 *key, unsigned long keylen, u64 *size)
 {
int rc;
void *db;
@@ -51,7 +51,7 @@ static __init void *get_cert_list(u8 *key, unsigned long 
keylen, uint64_t *size)
 static int __init load_powerpc_certs(void)
 {
void *db = NULL, *dbx = NULL;
-   uint64_t dbsize = 0, dbxsize = 0;
+   u64 dbsize = 0, dbxsize = 0;
int rc = 0;
struct device_node *node;
 
-- 
2.39.0



[PATCH v3 03/24] powerpc/secvar: Use sysfs_emit() instead of sprintf()

2023-01-17 Thread Andrew Donnellan
From: Russell Currey 

The secvar format string and object size sysfs files are both ASCII
text, and should use sysfs_emit().  No functional change.

Suggested-by: Greg Kroah-Hartman 
Signed-off-by: Russell Currey 
Signed-off-by: Andrew Donnellan 

---

v2: New patch (gregkh)
---
 arch/powerpc/kernel/secvar-sysfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index 001cdbcdb9d2..462cacc0ca60 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -35,7 +35,7 @@ static ssize_t format_show(struct kobject *kobj, struct 
kobj_attribute *attr,
if (rc)
goto out;
 
-   rc = sprintf(buf, "%s\n", format);
+   rc = sysfs_emit(buf, "%s\n", format);
 
 out:
of_node_put(node);
@@ -57,7 +57,7 @@ static ssize_t size_show(struct kobject *kobj, struct 
kobj_attribute *attr,
return rc;
}
 
-   return sprintf(buf, "%llu\n", dsize);
+   return sysfs_emit(buf, "%llu\n", dsize);
 }
 
 static ssize_t data_read(struct file *filep, struct kobject *kobj,
-- 
2.39.0



[PATCH 4/8] perf/core: Add perf_sample_save_brstack() helper

2023-01-17 Thread Namhyung Kim
When it saves the branch stack to the perf sample data, it needs to
update the sample flags and the dynamic size.  To make sure this,
add the perf_sample_save_brstack() helper and convert all call sites.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Peter Zijlstra 
Acked-by: Jiri Olsa 
Tested-by: Jiri Olsa 
Signed-off-by: Namhyung Kim 
---
 arch/powerpc/perf/core-book3s.c |  3 +-
 arch/x86/events/amd/core.c  |  6 +--
 arch/x86/events/intel/core.c|  6 +--
 arch/x86/events/intel/ds.c  |  9 ++---
 include/linux/perf_event.h  | 66 -
 kernel/events/core.c| 16 +++-
 6 files changed, 53 insertions(+), 53 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index bf318dd9b709..8c1f7def596e 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2313,8 +2313,7 @@ static void record_and_restart(struct perf_event *event, 
unsigned long val,
struct cpu_hw_events *cpuhw;
cpuhw = this_cpu_ptr(_hw_events);
power_pmu_bhrb_read(event, cpuhw);
-   data.br_stack = >bhrb_stack;
-   data.sample_flags |= PERF_SAMPLE_BRANCH_STACK;
+   perf_sample_save_brstack(, event, 
>bhrb_stack);
}
 
if (event->attr.sample_type & PERF_SAMPLE_DATA_SRC &&
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index d6f3703e4119..463f3eb8bbd7 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -928,10 +928,8 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
if (!x86_perf_event_set_period(event))
continue;
 
-   if (has_branch_stack(event)) {
-   data.br_stack = >lbr_stack;
-   data.sample_flags |= PERF_SAMPLE_BRANCH_STACK;
-   }
+   if (has_branch_stack(event))
+   perf_sample_save_brstack(, event, 
>lbr_stack);
 
if (perf_event_overflow(event, , regs))
x86_pmu_stop(event, 0);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 29d2d0411caf..14f0a746257d 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3036,10 +3036,8 @@ static int handle_pmi_common(struct pt_regs *regs, u64 
status)
 
perf_sample_data_init(, 0, event->hw.last_period);
 
-   if (has_branch_stack(event)) {
-   data.br_stack = >lbr_stack;
-   data.sample_flags |= PERF_SAMPLE_BRANCH_STACK;
-   }
+   if (has_branch_stack(event))
+   perf_sample_save_brstack(, event, 
>lbr_stack);
 
if (perf_event_overflow(event, , regs))
x86_pmu_stop(event, 0);
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 158cf845fc80..07c8a2cdc3ee 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1720,10 +1720,8 @@ static void setup_pebs_fixed_sample_data(struct 
perf_event *event,
data->sample_flags |= PERF_SAMPLE_TIME;
}
 
-   if (has_branch_stack(event)) {
-   data->br_stack = >lbr_stack;
-   data->sample_flags |= PERF_SAMPLE_BRANCH_STACK;
-   }
+   if (has_branch_stack(event))
+   perf_sample_save_brstack(data, event, >lbr_stack);
 }
 
 static void adaptive_pebs_save_regs(struct pt_regs *regs,
@@ -1883,8 +1881,7 @@ static void setup_pebs_adaptive_sample_data(struct 
perf_event *event,
 
if (has_branch_stack(event)) {
intel_pmu_store_pebs_lbrs(lbr);
-   data->br_stack = >lbr_stack;
-   data->sample_flags |= PERF_SAMPLE_BRANCH_STACK;
+   perf_sample_save_brstack(data, event, >lbr_stack);
}
}
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 569dfac5887f..7db0e9cc2682 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1102,6 +1102,31 @@ extern u64 perf_event_read_value(struct perf_event 
*event,
 
 extern struct perf_callchain_entry *perf_callchain(struct perf_event *event, 
struct pt_regs *regs);
 
+static inline bool branch_sample_no_flags(const struct perf_event *event)
+{
+   return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_NO_FLAGS;
+}
+
+static inline bool branch_sample_no_cycles(const struct perf_event *event)
+{
+   return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_NO_CYCLES;
+}
+
+static inline bool branch_sample_type(const struct perf_event *event)
+{
+   return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_TYPE_SAVE;
+}
+
+static inline bool branch_sample_hw_index(const struct perf_event *event)
+{
+   return 

[PATCH V2 3/3] skiboot: Update IMC PMU node names for power10

2023-01-17 Thread Athira Rajeev
The nest IMC (In Memory Collection) Performance Monitoring
Unit(PMU) node names are saved as "struct nest_pmus_struct"
in the "hw/imc.c" IMC code. Not all the IMC PMUs listed in
the device tree may be available. Nest IMC PMU names along with
their bit values is represented in imc availability vector.
This struct is used to remove the unavailable nodes by checking
this vector.

For power10, the imc_chip_avl_vector ie, imc availability vector
( which is a part of the IMC control block structure ), has
change in mapping of units and bit positions. Hence rename the
existing nest_pmus array to nest_pmus_p9 and add entry for power10
as nest_pmus_p10.

Also the avl_vector has another change in bit positions 11:34. These
bit positions tells the availability of Xlink/Alink/CAPI. There
are total 8 links and three bit field combination says which link
is available. Patch implements all these change to handle
nest_pmus_p10.

Signed-off-by: Athira Rajeev 
---
Changelog:
v1 -> v2:
- Addressed review comment from Dan to update
  the utility funtion to search and compare
  upto "@". Renamed it as dt_find_by_name_substr.

 hw/imc.c | 195 ---
 1 file changed, 185 insertions(+), 10 deletions(-)

diff --git a/hw/imc.c b/hw/imc.c
index 805e6cc1..20bc51b4 100644
--- a/hw/imc.c
+++ b/hw/imc.c
@@ -49,7 +49,7 @@ static unsigned int *htm_scom_index;
  * imc_chip_avl_vector(in struct imc_chip_cb, look at include/imc.h).
  * nest_pmus[] is an array containing all the possible nest IMC PMU node names.
  */
-static char const *nest_pmus[] = {
+static char const *nest_pmus_p9[] = {
"powerbus0@",
"mcs0@",
"mcs1@",
@@ -104,6 +104,67 @@ static char const *nest_pmus[] = {
/* reserved bits : 51 - 63 */
 };
 
+static char const *nest_pmus_p10[] = {
+   "pb@",
+   "mcs0@",
+   "mcs1@",
+   "mcs2@",
+   "mcs3@",
+   "mcs4@",
+   "mcs5@",
+   "mcs6@",
+   "mcs7@",
+   "pec0@",
+   "pec1@",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "NA",
+   "phb0@",
+   "phb1@",
+   "phb2@",
+   "phb3@",
+   "phb4@",
+   "phb5@",
+   "ocmb0@",
+   "ocmb1@",
+   "ocmb2@",
+   "ocmb3@",
+   "ocmb4@",
+   "ocmb5@",
+   "ocmb6@",
+   "ocmb7@",
+   "ocmb8@",
+   "ocmb9@",
+   "ocmb10@",
+   "ocmb11@",
+   "ocmb12@",
+   "ocmb13@",
+   "ocmb14@",
+   "ocmb15@",
+   "nx@",
+};
+
 /*
  * Due to Nest HW/OCC restriction, microcode will not support individual unit
  * events for these nest units mcs0, mcs1 ... mcs7 in the accumulation mode.
@@ -371,7 +432,7 @@ static void disable_unavailable_units(struct dt_node *dev)
uint64_t avl_vec;
struct imc_chip_cb *cb;
struct dt_node *target;
-   int i;
+   int i, j;
bool disable_all_nests = false;
struct proc_chip *chip;
 
@@ -409,14 +470,128 @@ static void disable_unavailable_units(struct dt_node 
*dev)
avl_vec = (0xffULL) << 56;
}
 
-   for (i = 0; i < ARRAY_SIZE(nest_pmus); i++) {
-   if (!(PPC_BITMASK(i, i) & avl_vec)) {
-   /* Check if the device node exists */
-   target = dt_find_by_name_substr(dev, nest_pmus[i]);
-   if (!target)
-   continue;
-   /* Remove the device node */
-   dt_free(target);
+   if (proc_gen == proc_gen_p9) {
+   for (i = 0; i < ARRAY_SIZE(nest_pmus_p9); i++) {
+   if (!(PPC_BITMASK(i, i) & avl_vec)) {
+   /* Check if the device node exists */
+   target = dt_find_by_name_substr(dev, 
nest_pmus_p9[i]);
+   if (!target)
+   continue;
+   /* Remove the device node */
+   dt_free(target);
+   }
+   }
+   } else if (proc_gen == proc_gen_p10) {
+   int val;
+   char al[8], xl[8], otl[8], phb[8];
+   for (i = 0; i < 11; i++) {
+   if (!(PPC_BITMASK(i, i) & avl_vec)) {
+   /* Check if the device node exists */
+   target = dt_find_by_name_substr(dev, 
nest_pmus_p10[i]);
+   if (!target)
+   continue;
+   /* Remove the device node */
+   dt_free(target);
+   }
+   }
+
+   for 

[PATCH V2 2/3] skiboot: Update IMC code to use dt_find_by_name_substr for checking dt nodes

2023-01-17 Thread Athira Rajeev
The nest IMC (In Memory Collection) Performance Monitoring
Unit(PMU) node names are saved in nest_pmus[] array in the
"hw/imc.c" IMC code. Not all the IMC PMUs listed in the device
tree may be available. Nest IMC PMU names along with their
bit values is represented in imc availability vector.
The nest_pmus[] array is used to remove the unavailable nodes
by checking this vector.

To check node availability, code was using "dt_find_by_substr".
But since the node names have format like: "name@offset",
dt_find_by_name doesn't return the expected result. Fix this
by using dt_find_by_name_substr. Also, update the char array
to use correct node names with suffix "@"

Signed-off-by: Athira Rajeev 
---
Changelog:
v1 -> v2:
- Addressed review comment from Dan to update
  the utility funtion to search and compare
  upto "@". Renamed it as dt_find_by_name_substr.
  Hence used existing nest_pmus char array to
  update node names with "@" suffix

 hw/imc.c | 104 +++
 1 file changed, 52 insertions(+), 52 deletions(-)

diff --git a/hw/imc.c b/hw/imc.c
index 97e0809f..805e6cc1 100644
--- a/hw/imc.c
+++ b/hw/imc.c
@@ -50,57 +50,57 @@ static unsigned int *htm_scom_index;
  * nest_pmus[] is an array containing all the possible nest IMC PMU node names.
  */
 static char const *nest_pmus[] = {
-   "powerbus0",
-   "mcs0",
-   "mcs1",
-   "mcs2",
-   "mcs3",
-   "mcs4",
-   "mcs5",
-   "mcs6",
-   "mcs7",
-   "mba0",
-   "mba1",
-   "mba2",
-   "mba3",
-   "mba4",
-   "mba5",
-   "mba6",
-   "mba7",
-   "cen0",
-   "cen1",
-   "cen2",
-   "cen3",
-   "cen4",
-   "cen5",
-   "cen6",
-   "cen7",
-   "xlink0",
-   "xlink1",
-   "xlink2",
-   "mcd0",
-   "mcd1",
-   "phb0",
-   "phb1",
-   "phb2",
-   "phb3",
-   "phb4",
-   "phb5",
-   "nx",
-   "capp0",
-   "capp1",
-   "vas",
-   "int",
-   "alink0",
-   "alink1",
-   "alink2",
-   "alink3",
-   "nvlink0",
-   "nvlink1",
-   "nvlink2",
-   "nvlink3",
-   "nvlink4",
-   "nvlink5",
+   "powerbus0@",
+   "mcs0@",
+   "mcs1@",
+   "mcs2@",
+   "mcs3@",
+   "mcs4@",
+   "mcs5@",
+   "mcs6@",
+   "mcs7@",
+   "mba0@",
+   "mba1@",
+   "mba2@",
+   "mba3@",
+   "mba4@",
+   "mba5@",
+   "mba6@",
+   "mba7@",
+   "centaur0@",
+   "centaur1@",
+   "centaur2@",
+   "centaur3@",
+   "centaur4@",
+   "centaur5@",
+   "centaur6@",
+   "centaur7@",
+   "xlink0@",
+   "xlink1@",
+   "xlink2@",
+   "mcd0@",
+   "mcd1@",
+   "phb0@",
+   "phb1@",
+   "phb2@",
+   "phb3@",
+   "phb4@",
+   "phb5@",
+   "nx@",
+   "capp0@",
+   "capp1@",
+   "vas@",
+   "int@",
+   "alink0@",
+   "alink1@",
+   "alink2@",
+   "alink3@",
+   "nvlink0@",
+   "nvlink1@",
+   "nvlink2@",
+   "nvlink3@",
+   "nvlink4@",
+   "nvlink5@",
/* reserved bits : 51 - 63 */
 };
 
@@ -412,7 +412,7 @@ static void disable_unavailable_units(struct dt_node *dev)
for (i = 0; i < ARRAY_SIZE(nest_pmus); i++) {
if (!(PPC_BITMASK(i, i) & avl_vec)) {
/* Check if the device node exists */
-   target = dt_find_by_name(dev, nest_pmus[i]);
+   target = dt_find_by_name_substr(dev, nest_pmus[i]);
if (!target)
continue;
/* Remove the device node */
-- 
2.27.0



[PATCH V2 1/3] core/device: Add function to return child node using name at substring "@"

2023-01-17 Thread Athira Rajeev
Add a function dt_find_by_name_substr() that returns the child node if
it matches till first occurence at "@" of a given name, otherwise NULL.
This is helpful for cases with node name like: "name@addr". In
scenarios where nodes are added with "name@addr" format and if the
value of "addr" is not known, that node can't be matched with node
name or addr. Hence matching with substring as node name will return
the expected result. Patch adds dt_find_by_name_substr() function
and testcase for the same in core/test/run-device.c

Signed-off-by: Athira Rajeev 
---
Changelog:
v1 -> v2:
- Addressed review comment from Dan to update
  the utility funtion to search and compare
  upto "@". Renamed it as dt_find_by_name_substr.
  
 core/device.c  | 18 ++
 core/test/run-device.c | 11 +++
 include/device.h   |  3 +++
 3 files changed, 32 insertions(+)

diff --git a/core/device.c b/core/device.c
index 2de37c74..df3a5775 100644
--- a/core/device.c
+++ b/core/device.c
@@ -395,6 +395,24 @@ struct dt_node *dt_find_by_name(struct dt_node *root, 
const char *name)
 }
 
 
+struct dt_node *dt_find_by_name_substr(struct dt_node *root, const char *name)
+{
+   struct dt_node *child, *match;
+   char *pos;
+
+   list_for_each(>children, child, list) {
+   pos = strchr(child->name, '@');
+   if (!strncmp(child->name, name, pos - child->name))
+   return child;
+
+   match = dt_find_by_name_substr(child, name);
+   if (match)
+   return match;
+   }
+
+   return NULL;
+}
+
 struct dt_node *dt_new_check(struct dt_node *parent, const char *name)
 {
struct dt_node *node = dt_find_by_name(parent, name);
diff --git a/core/test/run-device.c b/core/test/run-device.c
index 4a12382b..0e463e58 100644
--- a/core/test/run-device.c
+++ b/core/test/run-device.c
@@ -466,6 +466,17 @@ int main(void)
new_prop_ph = dt_prop_get_u32(ut2, "something");
assert(!(new_prop_ph == ev1_ph));
dt_free(subtree);
+
+   /* Test dt_find_by_name_substr */
+   root = dt_new_root("");
+   addr1 = dt_new_addr(root, "node", 0x1);
+   addr2 = dt_new_addr(root, "node0_1", 0x2);
+   assert(dt_find_by_name(root, "node@1") == addr1);
+   assert(dt_find_by_name(root, "node0_1@2") == addr2);
+   assert(dt_find_by_name_substr(root, "node@1") == addr1);
+   assert(dt_find_by_name_substr(root, "node0_1@2") == addr2);
+   dt_free(root);
+
return 0;
 }
 
diff --git a/include/device.h b/include/device.h
index 93fb90ff..b6a1a813 100644
--- a/include/device.h
+++ b/include/device.h
@@ -184,6 +184,9 @@ struct dt_node *dt_find_by_path(struct dt_node *root, const 
char *path);
 /* Find a child node by name */
 struct dt_node *dt_find_by_name(struct dt_node *root, const char *name);
 
+/* Find a child node by name and substring */
+struct dt_node *dt_find_by_name_substr(struct dt_node *root, const char *name);
+
 /* Find a node by phandle */
 struct dt_node *dt_find_by_phandle(struct dt_node *root, u32 phandle);
 
-- 
2.27.0



Re: crypto: p10-aes-gcm - Add asm markings necessary for kernel code

2023-01-17 Thread Stephen Rothwell
Hi Herbert,

On Tue, 17 Jan 2023 15:26:24 +0800 Herbert Xu  
wrote:
>
> On Tue, Jan 17, 2023 at 02:47:47PM +1100, Stephen Rothwell wrote:
> > Hi all,
> > 
> > After merging the crypto tree, today's linux-next build (powerpc
> > pseries_le_defconfig) failed like this:
> > 
> > arch/powerpc/crypto/p10_aes_gcm.o: warning: objtool: .text+0x884: 
> > unannotated intra-function call
> > arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: 
> > aes_p8_set_encrypt_key+0x44: unannotated intra-function call
> > ld: arch/powerpc/crypto/p10_aes_gcm.o: ABI version 1 is not compatible with 
> > ABI version 2 output
> > ld: failed to merge target specific data of file 
> > arch/powerpc/crypto/p10_aes_gcm.o
> > 
> > Caused by commit
> > 
> >   ca68a96c37eb ("crypto: p10-aes-gcm - An accelerated AES/GCM stitched 
> > implementation")
> > 
> > I have applied the following hack for today.  
> 
> Thanks Stephen, I'm going to update the previous fix as follows:

I still get:

arch/powerpc/crypto/aesp8-ppc.o: warning: objtool: aes_p8_set_encrypt_key+0x44: 
unannotated intra-function call

from the powerpc pseries_le_defconfig build (which is otherwise ok).

Now I also get (from a powerpc allyesconfig build):

tmp/cc8g6b4E.s: Assembler messages:
tmp/cc8g6b4E.s: Error: .size expression for gcm_init_p8 does not evaluate to a 
constant
tmp/cc8g6b4E.s: Error: .size expression for .gcm_init_p8 does not evaluate to a 
constant
tmp/cc8g6b4E.s: Error: .size expression for gcm_init_htable does not evaluate 
to a constant
tmp/cc8g6b4E.s: Error: .size expression for .gcm_init_htable does not evaluate 
to a constant
tmp/cc8g6b4E.s: Error: .size expression for gcm_gmult_p8 does not evaluate to a 
constant
tmp/cc8g6b4E.s: Error: .size expression for .gcm_gmult_p8 does not evaluate to 
a constant
tmp/cc8g6b4E.s: Error: .size expression for gcm_ghash_p8 does not evaluate to a 
constant
tmp/cc8g6b4E.s: Error: .size expression for .gcm_ghash_p8 does not evaluate to 
a constant
make[4]: *** [next/scripts/Makefile.build:374: 
arch/powerpc/crypto/ghashp8-ppc.o] Error 1
tmp/ccNrBtc1.s: Assembler messages:
tmp/ccNrBtc1.s: Error: .size expression for aes_p8_set_encrypt_key does not 
evaluate to a constant
tmp/ccNrBtc1.s: Error: .size expression for .aes_p8_set_encrypt_key does not 
evaluate to a constant
tmp/ccNrBtc1.s: Error: .size expression for aes_p8_set_decrypt_key does not 
evaluate to a constant
tmp/ccNrBtc1.s: Error: .size expression for .aes_p8_set_decrypt_key does not 
evaluate to a constant
tmp/ccNrBtc1.s: Error: .size expression for aes_p8_encrypt does not evaluate to 
a constant
tmp/ccNrBtc1.s: Error: .size expression for .aes_p8_encrypt does not evaluate 
to a constant
tmp/ccNrBtc1.s: Error: .size expression for aes_p8_decrypt does not evaluate to 
a constant
tmp/ccNrBtc1.s: Error: .size expression for .aes_p8_decrypt does not evaluate 
to a constant
tmp/ccNrBtc1.s: Error: .size expression for aes_p8_cbc_encrypt does not 
evaluate to a constant
tmp/ccNrBtc1.s: Error: .size expression for .aes_p8_cbc_encrypt does not 
evaluate to a constant
tmp/ccNrBtc1.s: Error: .size expression for aes_p8_ctr32_encrypt_blocks does 
not evaluate to a constant
tmp/ccNrBtc1.s: Error: .size expression for .aes_p8_ctr32_encrypt_blocks does 
not evaluate to a constant
tmp/ccNrBtc1.s: Error: .size expression for aes_p8_xts_encrypt does not 
evaluate to a constant
tmp/ccNrBtc1.s: Error: .size expression for .aes_p8_xts_encrypt does not 
evaluate to a constant
tmp/ccNrBtc1.s: Error: .size expression for aes_p8_xts_decrypt does not 
evaluate to a constant
tmp/ccNrBtc1.s: Error: .size expression for .aes_p8_xts_decrypt does not 
evaluate to a constant
make[4]: *** [next/scripts/Makefile.build:374: arch/powerpc/crypto/aesp8-ppc.o] 
Error 1
make[4]: Target 'arch/powerpc/crypto/' not remade because of errors.

$ grep gcm_init_p8 arch/powerpc/crypto/ghashp8-ppc.s
.align 2 ; .type gcm_init_p8,@function; .globl gcm_init_p8; gcm_init_p8:
.size gcm_init_p8,.-.gcm_init_p8
.size .gcm_init_p8,.-.gcm_init_p8

I have just marked CRYPTO_P10_AES_GCM as BROKEN for today.

-- 
Cheers,
Stephen Rothwell


pgp6zTiBfGOeP.pgp
Description: OpenPGP digital signature


Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code

2023-01-17 Thread Matthew Wilcox
On Tue, Jan 17, 2023 at 05:06:57PM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko  wrote:
> >
> > On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > > page fault handling. When VMA is not found, can't be locked or changes
> > > after being locked, the function returns NULL. The lookup is performed
> > > under RCU protection to prevent the found VMA from being destroyed before
> > > the VMA lock is acquired. VMA lock statistics are updated according to
> > > the results.
> > > For now only anonymous VMAs can be searched this way. In other cases the
> > > function returns NULL.
> >
> > Could you describe why only anonymous vmas are handled at this stage and
> > what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> > doesn't seem to have any anonymous vma specific requirements AFAICS.
> 
> TBH I haven't spent too much time looking into file-backed page faults
> yet but a couple of tasks I can think of are:
> - Ensure that all vma->vm_ops->fault() handlers do not rely on
> mmap_lock being read-locked;

I think this way lies madness.  There are just too many device drivers
that implement ->fault.  My plan is to call the ->map_pages() method
under RCU without even read-locking the VMA.  If that doesn't satisfy
the fault, then drop all the way back to taking the mmap_sem for read
before calling into ->fault.



Re: [PATCH 09/41] mm: rcu safe VMA freeing

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 6:25 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:04, Suren Baghdasaryan wrote:
> [...]
> >  void vm_area_free(struct vm_area_struct *vma)
> >  {
> >   free_anon_vma_name(vma);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + call_rcu(>vm_rcu, __vm_area_free);
> > +#else
> >   kmem_cache_free(vm_area_cachep, vma);
> > +#endif
>
> Is it safe to have vma with already freed vma_name? I suspect this is
> safe because of mmap_lock but is there any reason to split the freeing
> process and have this potential UAF lurking?

It should be safe because VMA is either locked or has been isolated
while locked, so no page fault handlers should have access to it. But
you are right, moving free_anon_vma_name() into __vm_area_free() does
seem safer. Will make the change in the next rev.

>
> >  }
> >
> >  static void account_kernel_stack(struct task_struct *tsk, int account)
> > --
> > 2.39.0
>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:15 AM 'Michal Hocko' via kernel-team
 wrote:
>
> On Tue 17-01-23 16:09:03, Michal Hocko wrote:
> > On Mon 09-01-23 12:53:08, Suren Baghdasaryan wrote:
> > > To keep vma locking correctness when vm_flags are modified, add modifier
> > > functions to be used whenever flags are updated.
> > >
> > > Signed-off-by: Suren Baghdasaryan 
> > > ---
> > >  include/linux/mm.h   | 38 ++
> > >  include/linux/mm_types.h |  8 +++-
> > >  2 files changed, 45 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index ec2c4c227d51..35cf0a6cbcc2 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -702,6 +702,44 @@ static inline void vma_init(struct vm_area_struct 
> > > *vma, struct mm_struct *mm)
> > > vma_init_lock(vma);
> > >  }
> > >
> > > +/* Use when VMA is not part of the VMA tree and needs no locking */
> > > +static inline
> > > +void init_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > > +{
> > > +   WRITE_ONCE(vma->vm_flags, flags);
> > > +}
> >
> > Why do we need WRITE_ONCE here? Isn't vma invisible during its
> > initialization?

Ack. Will change to a simple assignment.

> >
> > > +
> > > +/* Use when VMA is part of the VMA tree and needs appropriate locking */
> > > +static inline
> > > +void reset_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > > +{
> > > +   vma_write_lock(vma);
> > > +   init_vm_flags(vma, flags);
> > > +}
> > > +
> > > +static inline
> > > +void set_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > > +{
> > > +   vma_write_lock(vma);
> > > +   vma->vm_flags |= flags;
> > > +}
> > > +
> > > +static inline
> > > +void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > > +{
> > > +   vma_write_lock(vma);
> > > +   vma->vm_flags &= ~flags;
> > > +}
> > > +
> > > +static inline
> > > +void mod_vm_flags(struct vm_area_struct *vma,
> > > + unsigned long set, unsigned long clear)
> > > +{
> > > +   vma_write_lock(vma);
> > > +   vma->vm_flags |= set;
> > > +   vma->vm_flags &= ~clear;
> > > +}
> > > +
> >
> > This is rather unusual pattern. There is no note about locking involved
> > in the naming and also why is the locking part of this interface in the
> > first place? I can see reason for access functions to actually check for
> > lock asserts.
>
> OK, it took me a while but it is clear to me now. The confusion comes
> from the naming vma_write_lock is no a lock in its usual terms. It is
> more of a vma_mark_modified with side effects to read locking which is a
> real lock. With that it makes more sense to have this done in these
> helpers rather than requiring all users to keep this subtletly in mind.

If renaming vma-locking primitives the way Matthew suggested in
https://lore.kernel.org/all/y8czmt01z1fvv...@casper.infradead.org/
makes it easier to read/understand, I'm all for it. Let's discuss the
naming in that email thread because that's where these functions are
introduced.

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:16 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> > Move VMA flag modification (which now implies VMA locking) before
> > anon_vma_lock_write to match the locking order of page fault handler.
>
> Does this changelog assumes per vma locking in the #PF?

Hmm, you are right. Page fault handlers do not use per-vma locks yet
but the changelog already talks about that. Maybe I should change it
to simply:
```
Move VMA flag modification (which now implies VMA locking) before
vma_adjust_trans_huge() to ensure the modifications are done after VMA
has been locked.
```
Is that better?

>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:42 AM 'Michal Hocko' via kernel-team
 wrote:
>
> On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> > Assert there are no holders of VMA lock for reading when it is about to be
> > destroyed.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/mm.h | 8 
> >  kernel/fork.c  | 2 ++
> >  2 files changed, 10 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 594e835bad9c..c464fc8a514c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct 
> > vm_area_struct *vma)
> >   VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), 
> > vma);
> >  }
> >
> > +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> > +{
> > + VM_BUG_ON_VMA(rwsem_is_locked(>lock) &&
> > +   vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> > +   vma);
>
> Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
> us something is wrong on its own, no? This could be somebody racing with
> the vma destruction and using the write lock. Unlikely but I do not see
> why to narrow debugging scope.

I wanted to ensure there are no page fault handlers (read-lockers)
when we are destroying the VMA and rwsem_is_locked(>lock) alone
could trigger if someone is concurrently calling vma_write_lock(). But
I don't think we expect someone to be write-locking the VMA while we
are destroying it, so you are right, I'm overcomplicating things here.
I think I can get rid of vma_assert_no_reader() and add
VM_BUG_ON_VMA(rwsem_is_locked(>lock)) directly in
__vm_area_free(). WDYT?


> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:57 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> > call_rcu() can take a long time when callback offloading is enabled.
> > Its use in the vm_area_free can cause regressions in the exit path when
> > multiple VMAs are being freed.
>
> What kind of regressions.
>
> > To minimize that impact, place VMAs into
> > a list and free them in groups using one call_rcu() call per group.
>
> Please add some data to justify this additional complexity.

Sorry, should have done that in the first place. A 4.3% regression was
noticed when running execl test from unixbench suite. spawn test also
showed 1.6% regression. Profiling revealed that vma freeing was taking
longer due to call_rcu() which is slow when RCU callback offloading is
enabled. I asked Paul McKenney and he explained to me that because the
callbacks are offloaded to some other kthread, possibly running on
some other CPU, it is necessary to use explicit locking.  Locking on a
per-call_rcu() basis would result in excessive contention during
callback flooding. So, by batching call_rcu() work we cut that
overhead and reduce this lock contention.


> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:47 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > page fault handling. When VMA is not found, can't be locked or changes
> > after being locked, the function returns NULL. The lookup is performed
> > under RCU protection to prevent the found VMA from being destroyed before
> > the VMA lock is acquired. VMA lock statistics are updated according to
> > the results.
> > For now only anonymous VMAs can be searched this way. In other cases the
> > function returns NULL.
>
> Could you describe why only anonymous vmas are handled at this stage and
> what (roughly) has to be done to support other vmas? lock_vma_under_rcu
> doesn't seem to have any anonymous vma specific requirements AFAICS.

TBH I haven't spent too much time looking into file-backed page faults
yet but a couple of tasks I can think of are:
- Ensure that all vma->vm_ops->fault() handlers do not rely on
mmap_lock being read-locked;
- vma->vm_file freeing like VMA freeing will need to be done after RCU
grace period since page fault handlers use it. This will require some
caution because simply adding it into __vm_area_free() called via
call_rcu()  will cause corresponding fops->release() to be called
asynchronously. I had to solve this issue with out-of-tree SPF
implementation when asynchronously called snd_pcm_release() was
problematic.

I'm sure I'm missing more potential issues and maybe Matthew and
Michel can pinpoint more things to resolve here?

>
> Also isn't lock_vma_under_rcu effectively find_read_lock_vma? Not that
> the naming is really the most important part but the rcu locking is
> internal to the function so why should we spread this implementation
> detail to the world...

I wanted the name to indicate that the lookup is done with no locks
held. But I'm open to suggestions.

>
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/mm.h |  3 +++
> >  mm/memory.c| 51 ++
> >  2 files changed, 54 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index c464fc8a514c..d0fddf6a1de9 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct 
> > vm_area_struct *vma)
> > vma);
> >  }
> >
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > +   unsigned long address);
> > +
> >  #else /* CONFIG_PER_VMA_LOCK */
> >
> >  static inline void vma_init_lock(struct vm_area_struct *vma) {}
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9ece18548db1..a658e26d965d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct 
> > *vma, unsigned long address,
> >  }
> >  EXPORT_SYMBOL_GPL(handle_mm_fault);
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +/*
> > + * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed 
> > to be
> > + * stable and not isolated. If the VMA is not found or is being modified 
> > the
> > + * function returns NULL.
> > + */
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > +   unsigned long address)
> > +{
> > + MA_STATE(mas, >mm_mt, address, address);
> > + struct vm_area_struct *vma, *validate;
> > +
> > + rcu_read_lock();
> > + vma = mas_walk();
> > +retry:
> > + if (!vma)
> > + goto inval;
> > +
> > + /* Only anonymous vmas are supported for now */
> > + if (!vma_is_anonymous(vma))
> > + goto inval;
> > +
> > + if (!vma_read_trylock(vma))
> > + goto inval;
> > +
> > + /* Check since vm_start/vm_end might change before we lock the VMA */
> > + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > + vma_read_unlock(vma);
> > + goto inval;
> > + }
> > +
> > + /* Check if the VMA got isolated after we found it */
> > + mas.index = address;
> > + validate = mas_walk();
> > + if (validate != vma) {
> > + vma_read_unlock(vma);
> > + count_vm_vma_lock_event(VMA_LOCK_MISS);
> > + /* The area was replaced with another one. */
> > + vma = validate;
> > + goto retry;
> > + }
> > +
> > + rcu_read_unlock();
> > + return vma;
> > +inval:
> > + rcu_read_unlock();
> > + count_vm_vma_lock_event(VMA_LOCK_ABORT);
> > + return NULL;
> > +}
> > +#endif /* CONFIG_PER_VMA_LOCK */
> > +
> >  #ifndef __PAGETABLE_P4D_FOLDED
> >  /*
> >   * Allocate p4d page table.
> > --
> > 2.39.0
>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code

2023-01-17 Thread Liam Howlett
* Jann Horn  [230117 16:04]:
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> > Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> > page fault handling. When VMA is not found, can't be locked or changes
> > after being locked, the function returns NULL. The lookup is performed
> > under RCU protection to prevent the found VMA from being destroyed before
> > the VMA lock is acquired. VMA lock statistics are updated according to
> > the results.
> > For now only anonymous VMAs can be searched this way. In other cases the
> > function returns NULL.
> [...]
> > +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > + unsigned long address)
> > +{
> > +   MA_STATE(mas, >mm_mt, address, address);
> > +   struct vm_area_struct *vma, *validate;
> > +
> > +   rcu_read_lock();
> > +   vma = mas_walk();
> > +retry:
> > +   if (!vma)
> > +   goto inval;
> > +
> > +   /* Only anonymous vmas are supported for now */
> > +   if (!vma_is_anonymous(vma))
> > +   goto inval;
> > +
> > +   if (!vma_read_trylock(vma))
> > +   goto inval;
> > +
> > +   /* Check since vm_start/vm_end might change before we lock the VMA 
> > */
> > +   if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> > +   vma_read_unlock(vma);
> > +   goto inval;
> > +   }
> > +
> > +   /* Check if the VMA got isolated after we found it */
> > +   mas.index = address;
> > +   validate = mas_walk();
> 
> Question for Maple Tree experts:
> 
> Are you allowed to use mas_walk() like this? If the first mas_walk()
> call encountered a single-entry tree, it would store mas->node =
> MAS_ROOT, right? And then the second call would go into
> mas_state_walk(), mas_start() would return NULL, mas_is_ptr() would be
> true, and then mas_state_walk() would return the result of
> mas_start(), which is NULL? And we'd end up with mas_walk() returning
> NULL on the second run even though the tree hasn't changed?

This is safe for VMAs.  There might be a bug in the tree regarding
re-walking with a pointer, but it won't matter here.

A single entry tree will be a pointer if the entry is of the range 0 - 0
(mas.index == 0, mas.last == 0).  This would be a zero sized VMA - which
is not valid.

The second walk will check if the maple node is dead and restart the
walk if it is dead.  If the node isn't dead (almost always the case),
then it will be a very quick walk.

After a mas_walk(), the maple state has mas.index = vma->vm_start
and mas.last = (vma->vm_end - 1). The address is set prior to the second
walk in case of a vma split where mas.index from the first walk
is on the other side of the split than address.

> 
> > +   if (validate != vma) {
> > +   vma_read_unlock(vma);
> > +   count_vm_vma_lock_event(VMA_LOCK_MISS);
> > +   /* The area was replaced with another one. */
> > +   vma = validate;
> > +   goto retry;
> > +   }
> > +
> > +   rcu_read_unlock();
> > +   return vma;
> > +inval:
> > +   rcu_read_unlock();
> > +   count_vm_vma_lock_event(VMA_LOCK_ABORT);
> > +   return NULL;
> > +}

Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Matthew Wilcox
On Tue, Jan 17, 2023 at 02:36:47PM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 1:46 PM Jann Horn  wrote:
> > On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan  
> > wrote:
> > > On Tue, Jan 17, 2023 at 10:03 AM Jann Horn  wrote:
> > > > One thing that might be gnarly here is that I think you might not be
> > > > allowed to use up_read() to fully release ownership of an object -
> > > > from what I remember, I think that up_read() (unlike something like
> > > > spin_unlock()) can access the lock object after it's already been
> > > > acquired by someone else. So if you want to protect against concurrent
> > > > deletion, this might have to be something like:
> > > >
> > > > rcu_read_lock(); /* keeps vma alive */
> > > > up_read(>lock);
> > > > rcu_read_unlock();
> > >
> > > But for deleting VMA one would need to write-lock the vma->lock first,
> > > which I assume can't happen until this up_read() is complete. Is that
> > > assumption wrong?
> >
> > __up_read() does:
> >
> > rwsem_clear_reader_owned(sem);
> > tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, >count);
> > DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
> > if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
> >   RWSEM_FLAG_WAITERS)) {
> >   clear_nonspinnable(sem);
> >   rwsem_wake(sem);
> > }
> >
> > The atomic_long_add_return_release() is the point where we are doing
> > the main lock-releasing.
> >
> > So if a reader dropped the read-lock while someone else was waiting on
> > the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
> > lock together with it, the reader also does clear_nonspinnable() and
> > rwsem_wake() afterwards.
> > But in rwsem_down_write_slowpath(), after we've set
> > RWSEM_FLAG_WAITERS, we can return successfully immediately once
> > rwsem_try_write_lock() sees that there are no active readers or
> > writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
> > succeeds). We're not necessarily waiting for the "nonspinnable" bit or
> > the wake.
> >
> > So yeah, I think down_write() can return successfully before up_read()
> > is done with its memory accesses.
> >
> > (Spinlocks are different - the kernel relies on being able to drop
> > references via spin_unlock() in some places.)
> 
> Thanks for bringing this up. I can add rcu_read_{lock/unlock) as you
> suggested and that would fix the issue because we free VMAs from
> call_rcu(). However this feels to me as an issue of rw_semaphore
> design that this locking pattern is unsafe and might lead to UAF.

We have/had this problem with normal mutexes too.  It was the impetus
for adding the struct completion which is very careful to not touch
anything after the completion is, well, completed.


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 1:46 PM Jann Horn  wrote:
>
> On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan  wrote:
> > On Tue, Jan 17, 2023 at 10:03 AM Jann Horn  wrote:
> > >
> > > +locking maintainers
> >
> > Thanks! I'll CC the locking maintainers in the next posting.
> >
> > >
> > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  
> > > wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops 
> > > > mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > > [...]
> > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > +{
> > > > +   up_read(>lock);
> > > > +}
> > >
> > > One thing that might be gnarly here is that I think you might not be
> > > allowed to use up_read() to fully release ownership of an object -
> > > from what I remember, I think that up_read() (unlike something like
> > > spin_unlock()) can access the lock object after it's already been
> > > acquired by someone else. So if you want to protect against concurrent
> > > deletion, this might have to be something like:
> > >
> > > rcu_read_lock(); /* keeps vma alive */
> > > up_read(>lock);
> > > rcu_read_unlock();
> >
> > But for deleting VMA one would need to write-lock the vma->lock first,
> > which I assume can't happen until this up_read() is complete. Is that
> > assumption wrong?
>
> __up_read() does:
>
> rwsem_clear_reader_owned(sem);
> tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, >count);
> DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
> if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
>   RWSEM_FLAG_WAITERS)) {
>   clear_nonspinnable(sem);
>   rwsem_wake(sem);
> }
>
> The atomic_long_add_return_release() is the point where we are doing
> the main lock-releasing.
>
> So if a reader dropped the read-lock while someone else was waiting on
> the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
> lock together with it, the reader also does clear_nonspinnable() and
> rwsem_wake() afterwards.
> But in rwsem_down_write_slowpath(), after we've set
> RWSEM_FLAG_WAITERS, we can return successfully immediately once
> rwsem_try_write_lock() sees that there are no active readers or
> writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
> succeeds). We're not necessarily waiting for the "nonspinnable" bit or
> the wake.
>
> So yeah, I think down_write() can return successfully before up_read()
> is done with its memory accesses.
>
> (Spinlocks are different - the kernel relies on being able to drop
> references via spin_unlock() in some places.)

Thanks for bringing this up. I can add rcu_read_{lock/unlock) as you
suggested and that would fix the issue because we free VMAs from
call_rcu(). However this feels to me as an issue of rw_semaphore
design that this locking pattern is unsafe and might lead to UAF.


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 1:54 PM Matthew Wilcox  wrote:
>
> On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko  wrote:
> > >
> > > On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA 
> > > > > lock
> > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. 
> > > > > When
> > > > > mmap_write_lock holder is done with all modifications and drops 
> > > > > mmap_lock,
> > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked 
> > > > > as
> > > > > locked.
> > > >
> > > > I have to say I was struggling a bit with the above and only understood
> > > > what you mean by reading the patch several times. I would phrase it like
> > > > this (feel free to use if you consider this to be an improvement).
> > > >
> > > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > > > per-vma and per-mm sequence counters to note exclusive locking:
> > > > - read lock - (implemented by vma_read_trylock) requires the the
> > > >   vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> > > >   differ. If they match then there must be a vma exclusive lock
> > > >   held somewhere.
> > > > - read unlock - (implemented by vma_read_unlock) is a trivial
> > > >   vma->lock unlock.
> > > > - write lock - (vma_write_lock) requires the mmap_lock to be
> > > >   held exclusively and the current mm counter is noted to the 
> > > > vma
> > > >   side. This will allow multiple vmas to be locked under a 
> > > > single
> > > >   mmap_lock write lock (e.g. during vma merging). The vma 
> > > > counter
> > > >   is modified under exclusive vma lock.
> > >
> > > Didn't realize one more thing.
> > > Unlike standard write lock this implementation allows to be
> > > called multiple times under a single mmap_lock. In a sense
> > > it is more of mark_vma_potentially_modified than a lock.
> >
> > In the RFC it was called vma_mark_locked() originally and renames were
> > discussed in the email thread ending here:
> > https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43...@suse.cz/.
> > If other names are preferable I'm open to changing them.
>
> I don't want to bikeshed this, but rather than locking it seems to be
> more:
>
> vma_start_read()
> vma_end_read()
> vma_start_write()
> vma_end_write()
> vma_downgrade_write()

Couple corrections, we would have to have vma_start_tryread() and
vma_end_write_all(). Also there is no vma_downgrade_write().
mmap_write_downgrade() simply does vma_end_write_all().

>
> ... and that these are _implemented_ with locks (in part) is an
> implementation detail?
>
> Would that reduce people's confusion?
>
> > >
> > > > - write unlock - (vma_write_unlock_mm) is a batch release of all
> > > >   vma locks held. It doesn't pair with a specific
> > > >   vma_write_lock! It is done before exclusive mmap_lock is
> > > >   released by incrementing mm sequence counter (mm_lock_seq).
> > > >   - write downgrade - if the mmap_lock is downgraded to the read
> > > > lock all vma write locks are released as well (effectivelly
> > > > same as write unlock).
> > > --
> > > Michal Hocko
> > > SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Matthew Wilcox
On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko  wrote:
> >
> > On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > mmap_write_lock holder is done with all modifications and drops 
> > > > mmap_lock,
> > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > locked.
> > >
> > > I have to say I was struggling a bit with the above and only understood
> > > what you mean by reading the patch several times. I would phrase it like
> > > this (feel free to use if you consider this to be an improvement).
> > >
> > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > > per-vma and per-mm sequence counters to note exclusive locking:
> > > - read lock - (implemented by vma_read_trylock) requires the the
> > >   vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> > >   differ. If they match then there must be a vma exclusive lock
> > >   held somewhere.
> > > - read unlock - (implemented by vma_read_unlock) is a trivial
> > >   vma->lock unlock.
> > > - write lock - (vma_write_lock) requires the mmap_lock to be
> > >   held exclusively and the current mm counter is noted to the vma
> > >   side. This will allow multiple vmas to be locked under a single
> > >   mmap_lock write lock (e.g. during vma merging). The vma counter
> > >   is modified under exclusive vma lock.
> >
> > Didn't realize one more thing.
> > Unlike standard write lock this implementation allows to be
> > called multiple times under a single mmap_lock. In a sense
> > it is more of mark_vma_potentially_modified than a lock.
> 
> In the RFC it was called vma_mark_locked() originally and renames were
> discussed in the email thread ending here:
> https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43...@suse.cz/.
> If other names are preferable I'm open to changing them.

I don't want to bikeshed this, but rather than locking it seems to be
more:

vma_start_read()
vma_end_read()
vma_start_write()
vma_end_write()
vma_downgrade_write()

... and that these are _implemented_ with locks (in part) is an
implementation detail?

Would that reduce people's confusion?

> >
> > > - write unlock - (vma_write_unlock_mm) is a batch release of all
> > >   vma locks held. It doesn't pair with a specific
> > >   vma_write_lock! It is done before exclusive mmap_lock is
> > >   released by incrementing mm sequence counter (mm_lock_seq).
> > >   - write downgrade - if the mmap_lock is downgraded to the read
> > > lock all vma write locks are released as well (effectivelly
> > > same as write unlock).
> > --
> > Michal Hocko
> > SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Jann Horn
On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan  wrote:
> On Tue, Jan 17, 2023 at 10:03 AM Jann Horn  wrote:
> >
> > +locking maintainers
>
> Thanks! I'll CC the locking maintainers in the next posting.
>
> >
> > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > to be exclusively locked during VMA tree modifications, instead of the
> > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > locked.
> > [...]
> > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > +{
> > > +   up_read(>lock);
> > > +}
> >
> > One thing that might be gnarly here is that I think you might not be
> > allowed to use up_read() to fully release ownership of an object -
> > from what I remember, I think that up_read() (unlike something like
> > spin_unlock()) can access the lock object after it's already been
> > acquired by someone else. So if you want to protect against concurrent
> > deletion, this might have to be something like:
> >
> > rcu_read_lock(); /* keeps vma alive */
> > up_read(>lock);
> > rcu_read_unlock();
>
> But for deleting VMA one would need to write-lock the vma->lock first,
> which I assume can't happen until this up_read() is complete. Is that
> assumption wrong?

__up_read() does:

rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, >count);
DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
  RWSEM_FLAG_WAITERS)) {
  clear_nonspinnable(sem);
  rwsem_wake(sem);
}

The atomic_long_add_return_release() is the point where we are doing
the main lock-releasing.

So if a reader dropped the read-lock while someone else was waiting on
the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
lock together with it, the reader also does clear_nonspinnable() and
rwsem_wake() afterwards.
But in rwsem_down_write_slowpath(), after we've set
RWSEM_FLAG_WAITERS, we can return successfully immediately once
rwsem_try_write_lock() sees that there are no active readers or
writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
succeeds). We're not necessarily waiting for the "nonspinnable" bit or
the wake.

So yeah, I think down_write() can return successfully before up_read()
is done with its memory accesses.

(Spinlocks are different - the kernel relies on being able to drop
references via spin_unlock() in some places.)


Re: [PATCH linux-next] powerpc/cell/axon_msi: Use dma_zalloc_coherent()

2023-01-17 Thread Christophe JAILLET

Le 17/01/2023 à 10:06, ye.xingc...@zte.com.cn a écrit :

From: ye xingchen 

Instead of using dma_alloc_coherent() and memset() directly use
dma_zalloc_coherent().



Hi,

dma_zalloc_coherent() has been removed at the very beginning of 2019 in 
commit dfd32cad146e.


It is not existing since v5.0-rc2.

\o/

CJ



Signed-off-by: ye xingchen 
---
  arch/powerpc/platforms/cell/axon_msi.c | 5 ++---
  1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/cell/axon_msi.c 
b/arch/powerpc/platforms/cell/axon_msi.c
index 0c11aad896c7..8a4c522c8e67 100644
--- a/arch/powerpc/platforms/cell/axon_msi.c
+++ b/arch/powerpc/platforms/cell/axon_msi.c
@@ -358,8 +358,8 @@ static int axon_msi_probe(struct platform_device *device)
goto out_free_msic;
}

-   msic->fifo_virt = dma_alloc_coherent(>dev, MSIC_FIFO_SIZE_BYTES,
->fifo_phys, GFP_KERNEL);
+   msic->fifo_virt = dma_zalloc_coherent(>dev, 
MSIC_FIFO_SIZE_BYTES,
+ >fifo_phys, GFP_KERNEL);
if (!msic->fifo_virt) {
printk(KERN_ERR "axon_msi: couldn't allocate fifo for %pOF\n",
   dn);
@@ -372,7 +372,6 @@ static int axon_msi_probe(struct platform_device *device)
   dn);
goto out_free_fifo;
}
-   memset(msic->fifo_virt, 0xff, MSIC_FIFO_SIZE_BYTES);

/* We rely on being able to stash a virq in a u16, so limit irqs to < 
65536 */
msic->irq_domain = irq_domain_add_nomap(dn, 65536, _host_ops, 
msic);




Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:03 AM Jann Horn  wrote:
>
> +locking maintainers

Thanks! I'll CC the locking maintainers in the next posting.

>
> On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
> [...]
> > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > +{
> > +   up_read(>lock);
> > +}
>
> One thing that might be gnarly here is that I think you might not be
> allowed to use up_read() to fully release ownership of an object -
> from what I remember, I think that up_read() (unlike something like
> spin_unlock()) can access the lock object after it's already been
> acquired by someone else. So if you want to protect against concurrent
> deletion, this might have to be something like:
>
> rcu_read_lock(); /* keeps vma alive */
> up_read(>lock);
> rcu_read_unlock();

But for deleting VMA one would need to write-lock the vma->lock first,
which I assume can't happen until this up_read() is complete. Is that
assumption wrong?

>
> But I'm not entirely sure about that, the locking folks might know better.
>
> Also, it might not matter given that the rw_semaphore part is removed
> in the current patch 41/41 anyway...

This does matter because Michal suggested dropping that last 41/41
patch for now.


Re: [PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-17 Thread Mark Rutland
On Tue, Jan 17, 2023 at 02:21:40PM +, Sudeep Holla wrote:
> On Tue, Jan 17, 2023 at 01:16:21PM +, Mark Rutland wrote:
> > On Tue, Jan 17, 2023 at 11:26:29AM +0100, Peter Zijlstra wrote:
> > > On Mon, Jan 16, 2023 at 04:59:04PM +, Mark Rutland wrote:
> > > 
> > > > I'm sorry to have to bear some bad news on that front. :(
> > > 
> > > Moo, something had to give..
> > > 
> > > 
> > > > IIUC what's happenign here is the PSCI cpuidle driver has entered idle 
> > > > and RCU
> > > > is no longer watching when arm64's cpu_suspend() manipulates DAIF. Our
> > > > local_daif_*() helpers poke lockdep and tracing, hence the call to
> > > > trace_hardirqs_off() and the RCU usage.
> > > 
> > > Right, strictly speaking not needed at this point, IRQs should have been
> > > traced off a long time ago.
> > 
> > True, but there are some other calls around here that *might* end up 
> > invoking
> > RCU stuff (e.g. the MTE code).
> > 
> > That all needs a noinstr cleanup too, which I'll sort out as a follow-up.
> > 
> > > > I think we need RCU to be watching all the way down to cpu_suspend(), 
> > > > and it's
> > > > cpu_suspend() that should actually enter/exit idle context. That and we 
> > > > need to
> > > > make cpu_suspend() and the low-level PSCI invocation noinstr.
> > > > 
> > > > I'm not sure whether 32-bit will have a similar issue or not.
> > > 
> > > I'm not seeing 32bit or Risc-V have similar issues here, but who knows,
> > > maybe I missed somsething.
> > 
> > I reckon if they do, the core changes here give us the infrastructure to fix
> > them if/when we get reports.
> > 
> > > In any case, the below ought to cure the ARM64 case and remove that last
> > > known RCU_NONIDLE() user as a bonus.
> > 
> > The below works for me testing on a Juno R1 board with PSCI, using 
> > defconfig +
> > CONFIG_PROVE_LOCKING=y + CONFIG_DEBUG_LOCKDEP=y + 
> > CONFIG_DEBUG_ATOMIC_SLEEP=y.
> > I'm not sure how to test the LPI / FFH part, but it looks good to me.
> > 
> > FWIW:
> > 
> > Reviewed-by: Mark Rutland 
> > Tested-by: Mark Rutland 
> > 
> > Sudeep, would you be able to give the LPI/FFH side a spin with the kconfig
> > options above?
> > 
> 
> Not sure if I have messed up something in my mail setup, but I did reply
> earlier.

Sorry, that was my bad; I had been drafting my reply for a while and forgot to
re-check prior to sending.

> I did test both DT/cpuidle-psci driver and  ACPI/LPI+FFH driver
> with the fix Peter sent. I was seeing same splat as you in both DT and
> ACPI boot which the patch fixed it. I used the same config as described by
> you above.

Perfect; thanks!

Mark.


Re: [PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-17 Thread Sudeep Holla
On Tue, Jan 17, 2023 at 01:16:21PM +, Mark Rutland wrote:
> On Tue, Jan 17, 2023 at 11:26:29AM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 16, 2023 at 04:59:04PM +, Mark Rutland wrote:
> > 
> > > I'm sorry to have to bear some bad news on that front. :(
> > 
> > Moo, something had to give..
> > 
> > 
> > > IIUC what's happenign here is the PSCI cpuidle driver has entered idle 
> > > and RCU
> > > is no longer watching when arm64's cpu_suspend() manipulates DAIF. Our
> > > local_daif_*() helpers poke lockdep and tracing, hence the call to
> > > trace_hardirqs_off() and the RCU usage.
> > 
> > Right, strictly speaking not needed at this point, IRQs should have been
> > traced off a long time ago.
> 
> True, but there are some other calls around here that *might* end up invoking
> RCU stuff (e.g. the MTE code).
> 
> That all needs a noinstr cleanup too, which I'll sort out as a follow-up.
> 
> > > I think we need RCU to be watching all the way down to cpu_suspend(), and 
> > > it's
> > > cpu_suspend() that should actually enter/exit idle context. That and we 
> > > need to
> > > make cpu_suspend() and the low-level PSCI invocation noinstr.
> > > 
> > > I'm not sure whether 32-bit will have a similar issue or not.
> > 
> > I'm not seeing 32bit or Risc-V have similar issues here, but who knows,
> > maybe I missed somsething.
> 
> I reckon if they do, the core changes here give us the infrastructure to fix
> them if/when we get reports.
> 
> > In any case, the below ought to cure the ARM64 case and remove that last
> > known RCU_NONIDLE() user as a bonus.
> 
> The below works for me testing on a Juno R1 board with PSCI, using defconfig +
> CONFIG_PROVE_LOCKING=y + CONFIG_DEBUG_LOCKDEP=y + CONFIG_DEBUG_ATOMIC_SLEEP=y.
> I'm not sure how to test the LPI / FFH part, but it looks good to me.
> 
> FWIW:
> 
> Reviewed-by: Mark Rutland 
> Tested-by: Mark Rutland 
> 
> Sudeep, would you be able to give the LPI/FFH side a spin with the kconfig
> options above?
> 

Not sure if I have messed up something in my mail setup, but I did reply
earlier. I did test both DT/cpuidle-psci driver and  ACPI/LPI+FFH driver
with the fix Peter sent. I was seeing same splat as you in both DT and
ACPI boot which the patch fixed it. I used the same config as described by
you above.

-- 
Regards,
Sudeep


Re: [PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-17 Thread Mark Rutland
On Tue, Jan 17, 2023 at 11:26:29AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 16, 2023 at 04:59:04PM +, Mark Rutland wrote:
> 
> > I'm sorry to have to bear some bad news on that front. :(
> 
> Moo, something had to give..
> 
> 
> > IIUC what's happenign here is the PSCI cpuidle driver has entered idle and 
> > RCU
> > is no longer watching when arm64's cpu_suspend() manipulates DAIF. Our
> > local_daif_*() helpers poke lockdep and tracing, hence the call to
> > trace_hardirqs_off() and the RCU usage.
> 
> Right, strictly speaking not needed at this point, IRQs should have been
> traced off a long time ago.

True, but there are some other calls around here that *might* end up invoking
RCU stuff (e.g. the MTE code).

That all needs a noinstr cleanup too, which I'll sort out as a follow-up.

> > I think we need RCU to be watching all the way down to cpu_suspend(), and 
> > it's
> > cpu_suspend() that should actually enter/exit idle context. That and we 
> > need to
> > make cpu_suspend() and the low-level PSCI invocation noinstr.
> > 
> > I'm not sure whether 32-bit will have a similar issue or not.
> 
> I'm not seeing 32bit or Risc-V have similar issues here, but who knows,
> maybe I missed somsething.

I reckon if they do, the core changes here give us the infrastructure to fix
them if/when we get reports.

> In any case, the below ought to cure the ARM64 case and remove that last
> known RCU_NONIDLE() user as a bonus.

The below works for me testing on a Juno R1 board with PSCI, using defconfig +
CONFIG_PROVE_LOCKING=y + CONFIG_DEBUG_LOCKDEP=y + CONFIG_DEBUG_ATOMIC_SLEEP=y.
I'm not sure how to test the LPI / FFH part, but it looks good to me.

FWIW:

Reviewed-by: Mark Rutland 
Tested-by: Mark Rutland 

Sudeep, would you be able to give the LPI/FFH side a spin with the kconfig
options above?

Thanks,
Mark.

> 
> ---
> diff --git a/arch/arm64/kernel/cpuidle.c b/arch/arm64/kernel/cpuidle.c
> index 41974a1a229a..42e19fff40ee 100644
> --- a/arch/arm64/kernel/cpuidle.c
> +++ b/arch/arm64/kernel/cpuidle.c
> @@ -67,10 +67,10 @@ __cpuidle int acpi_processor_ffh_lpi_enter(struct 
> acpi_lpi_state *lpi)
>   u32 state = lpi->address;
>  
>   if (ARM64_LPI_IS_RETENTION_STATE(lpi->arch_flags))
> - return 
> CPU_PM_CPU_IDLE_ENTER_RETENTION_PARAM(psci_cpu_suspend_enter,
> + return 
> CPU_PM_CPU_IDLE_ENTER_RETENTION_PARAM_RCU(psci_cpu_suspend_enter,
>   lpi->index, state);
>   else
> - return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter,
> + return CPU_PM_CPU_IDLE_ENTER_PARAM_RCU(psci_cpu_suspend_enter,
>lpi->index, state);
>  }
>  #endif
> diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
> index e7163f31f716..0fbdf5fe64d8 100644
> --- a/arch/arm64/kernel/suspend.c
> +++ b/arch/arm64/kernel/suspend.c
> @@ -4,6 +4,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -104,6 +105,10 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
> long))
>* From this point debug exceptions are disabled to prevent
>* updates to mdscr register (saved and restored along with
>* general purpose registers) from kernel debuggers.
> +  *
> +  * Strictly speaking the trace_hardirqs_off() here is superfluous,
> +  * hardirqs should be firmly off by now. This really ought to use
> +  * something like raw_local_daif_save().
>*/
>   flags = local_daif_save();
>  
> @@ -120,6 +125,8 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
> long))
>*/
>   arm_cpuidle_save_irq_context();
>  
> + ct_cpuidle_enter();
> +
>   if (__cpu_suspend_enter()) {
>   /* Call the suspend finisher */
>   ret = fn(arg);
> @@ -133,8 +140,11 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
> long))
>*/
>   if (!ret)
>   ret = -EOPNOTSUPP;
> +
> + ct_cpuidle_exit();
>   } else {
> - RCU_NONIDLE(__cpu_suspend_exit());
> + ct_cpuidle_exit();
> + __cpu_suspend_exit();
>   }
>  
>   arm_cpuidle_restore_irq_context();
> diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c
> index 4fc4e0381944..312a34ef28dc 100644
> --- a/drivers/cpuidle/cpuidle-psci.c
> +++ b/drivers/cpuidle/cpuidle-psci.c
> @@ -69,16 +69,12 @@ static __cpuidle int 
> __psci_enter_domain_idle_state(struct cpuidle_device *dev,
>   else
>   pm_runtime_put_sync_suspend(pd_dev);
>  
> - ct_cpuidle_enter();
> -
>   state = psci_get_domain_state();
>   if (!state)
>   state = states[idx];
>  
>   ret = psci_cpu_suspend_enter(state) ? -1 : idx;
>  
> - ct_cpuidle_exit();
> -
>   if (s2idle)
>   dev_pm_genpd_resume(pd_dev);
>   else
> @@ -192,7 +188,7 @@ static __cpuidle int 

Re: [PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-17 Thread Sudeep Holla
On Tue, Jan 17, 2023 at 11:26:29AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 16, 2023 at 04:59:04PM +, Mark Rutland wrote:
> 
> > I'm sorry to have to bear some bad news on that front. :(
> 
> Moo, something had to give..
> 
> 
> > IIUC what's happenign here is the PSCI cpuidle driver has entered idle and 
> > RCU
> > is no longer watching when arm64's cpu_suspend() manipulates DAIF. Our
> > local_daif_*() helpers poke lockdep and tracing, hence the call to
> > trace_hardirqs_off() and the RCU usage.
> 
> Right, strictly speaking not needed at this point, IRQs should have been
> traced off a long time ago.
> 
> > I think we need RCU to be watching all the way down to cpu_suspend(), and 
> > it's
> > cpu_suspend() that should actually enter/exit idle context. That and we 
> > need to
> > make cpu_suspend() and the low-level PSCI invocation noinstr.
> > 
> > I'm not sure whether 32-bit will have a similar issue or not.
> 
> I'm not seeing 32bit or Risc-V have similar issues here, but who knows,
> maybe I missed somsething.
> 
> In any case, the below ought to cure the ARM64 case and remove that last
> known RCU_NONIDLE() user as a bonus.
>

Thanks for the fix. I tested the series and did observe the same splat
with both DT and ACPI boot(they enter idle in different code paths). Thanks
to Mark for reminding me about ACPI. With this fix, I see the splat is
gone in both DT(cpuidle-psci.c) and ACPI(acpi_processor_idle.c).

You can add:

Tested-by: Sudeep Holla 

--
Regards,
Sudeep


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko  wrote:
>
> On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > to be exclusively locked during VMA tree modifications, instead of the
> > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > locked.
> >
> > I have to say I was struggling a bit with the above and only understood
> > what you mean by reading the patch several times. I would phrase it like
> > this (feel free to use if you consider this to be an improvement).
> >
> > Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> > per-vma and per-mm sequence counters to note exclusive locking:
> > - read lock - (implemented by vma_read_trylock) requires the the
> >   vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
> >   differ. If they match then there must be a vma exclusive lock
> >   held somewhere.
> > - read unlock - (implemented by vma_read_unlock) is a trivial
> >   vma->lock unlock.
> > - write lock - (vma_write_lock) requires the mmap_lock to be
> >   held exclusively and the current mm counter is noted to the vma
> >   side. This will allow multiple vmas to be locked under a single
> >   mmap_lock write lock (e.g. during vma merging). The vma counter
> >   is modified under exclusive vma lock.
>
> Didn't realize one more thing.
> Unlike standard write lock this implementation allows to be
> called multiple times under a single mmap_lock. In a sense
> it is more of mark_vma_potentially_modified than a lock.

In the RFC it was called vma_mark_locked() originally and renames were
discussed in the email thread ending here:
https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43...@suse.cz/.
If other names are preferable I'm open to changing them.

>
> > - write unlock - (vma_write_unlock_mm) is a batch release of all
> >   vma locks held. It doesn't pair with a specific
> >   vma_write_lock! It is done before exclusive mmap_lock is
> >   released by incrementing mm sequence counter (mm_lock_seq).
> >   - write downgrade - if the mmap_lock is downgraded to the read
> > lock all vma write locks are released as well (effectivelly
> > same as write unlock).
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:07 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 5986817f393c..c026d75108b3 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct 
> > vm_area_struct *orig)
> >*/
> >   *new = data_race(*orig);
> >   INIT_LIST_HEAD(>anon_vma_chain);
> > + vma_init_lock(new);
> >   dup_anon_vma_name(orig, new);
> >   }
> >   return new;
> > @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct 
> > *mm, struct task_struct *p,
> >   seqcount_init(>write_protect_seq);
> >   mmap_init_lock(mm);
> >   INIT_LIST_HEAD(>mmlist);
> > +#ifdef CONFIG_PER_VMA_LOCK
> > + WRITE_ONCE(mm->mm_lock_seq, 0);
> > +#endif
>
> The mm shouldn't be visible so why WRITE_ONCE?

True. Will change to a simple assignment.

>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 7:04 AM Michal Hocko  wrote:
>
> On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
>
> I have to say I was struggling a bit with the above and only understood
> what you mean by reading the patch several times. I would phrase it like
> this (feel free to use if you consider this to be an improvement).
>
> Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> per-vma and per-mm sequence counters to note exclusive locking:
> - read lock - (implemented by vma_read_trylock) requires the the
>   vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
>   differ. If they match then there must be a vma exclusive lock
>   held somewhere.
> - read unlock - (implemented by vma_read_unlock) is a trivial
>   vma->lock unlock.
> - write lock - (vma_write_lock) requires the mmap_lock to be
>   held exclusively and the current mm counter is noted to the vma
>   side. This will allow multiple vmas to be locked under a single
>   mmap_lock write lock (e.g. during vma merging). The vma counter
>   is modified under exclusive vma lock.
> - write unlock - (vma_write_unlock_mm) is a batch release of all
>   vma locks held. It doesn't pair with a specific
>   vma_write_lock! It is done before exclusive mmap_lock is
>   released by incrementing mm sequence counter (mm_lock_seq).
> - write downgrade - if the mmap_lock is downgraded to the read
>   lock all vma write locks are released as well (effectivelly
>   same as write unlock).

Thanks for the suggestion, Michal. I'll definitely reuse your description.

>
> > VMA lock is placed on the cache line boundary so that its 'count' field
> > falls into the first cache line while the rest of the fields fall into
> > the second cache line. This lets the 'count' field to be cached with
> > other frequently accessed fields and used quickly in uncontended case
> > while 'owner' and other fields used in the contended case will not
> > invalidate the first cache line while waiting on the lock.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/mm.h| 80 +++
> >  include/linux/mm_types.h  |  8 
> >  include/linux/mmap_lock.h | 13 +++
> >  kernel/fork.c |  4 ++
> >  mm/init-mm.c  |  3 ++
> >  5 files changed, 108 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index f3f196e4d66d..ec2c4c227d51 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -612,6 +612,85 @@ struct vm_operations_struct {
> > unsigned long addr);
> >  };
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +static inline void vma_init_lock(struct vm_area_struct *vma)
> > +{
> > + init_rwsem(>lock);
> > + vma->vm_lock_seq = -1;
> > +}
> > +
> > +static inline void vma_write_lock(struct vm_area_struct *vma)
> > +{
> > + int mm_lock_seq;
> > +
> > + mmap_assert_write_locked(vma->vm_mm);
> > +
> > + /*
> > +  * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> > +  * mm->mm_lock_seq can't be concurrently modified.
> > +  */
> > + mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> > + if (vma->vm_lock_seq == mm_lock_seq)
> > + return;
> > +
> > + down_write(>lock);
> > + vma->vm_lock_seq = mm_lock_seq;
> > + up_write(>lock);
> > +}
> > +
> > +/*
> > + * Try to read-lock a vma. The function is allowed to occasionally yield 
> > false
> > + * locked result to avoid performance overhead, in which case we fall back 
> > to
> > + * using mmap_lock. The function should never yield false unlocked result.
> > + */
> > +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > +{
> > + /* Check before locking. A race might cause false locked result. */
> > + if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > + return false;
> > +
> > + if (unlikely(down_read_trylock(>lock) == 0))
> > + return false;
> > +
> > + /*
> > +  * Overflow might produce false locked result.
> > +  * False unlocked result is impossible because we modify and check
> > +  * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
> > +  * modification invalidates all existing locks.
> > +  */
> > + 

Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 12:28 PM Jann Horn  wrote:
>
> On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko  wrote:
> > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > not be detected by a page fault handler without proper locking.
> >
> > I am struggling with this changelog. Maybe because my recollection of
> > the THP collapsing subtleties is weak. But aren't you just trying to say
> > that the current #PF handling and THP collapsing need to be mutually
> > exclusive currently so in order to keep that assumption you have mark
> > the vma write locked?
> >
> > Also it is not really clear to me how that handles other vmas which can
> > share the same thp?
>
> It's not about the hugepage itself, it's about how the THP collapse
> operation frees page tables.
>
> Before this series, page tables can be walked under any one of the
> mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> unlinks and frees page tables, it must ensure that all of those either
> are locked or don't exist. This series adds a fourth lock under which
> page tables can be traversed, and so khugepaged must also lock out that one.
>
> There is a codepath in khugepaged that iterates through all mappings
> of a file to zap page tables (retract_page_tables()), which locks each
> visited mm with mmap_write_trylock() and now also does
> vma_write_lock().
>
>
> I think one aspect of this patch that might cause trouble later on, if
> support for non-anonymous VMAs is added, is that retract_page_tables()
> now does vma_write_lock() while holding the mapping lock; the page
> fault handling path would probably take the locks the other way
> around, leading to a deadlock? So the vma_write_lock() in
> retract_page_tables() might have to become a trylock later on.
>
> Related: Please add the new VMA lock to the big lock ordering comments
> at the top of mm/rmap.c. (And maybe later mm/filemap.c, if/when you
> add file VMA support.)

Thanks for the clarifications and the warning. I'll add appropriate
comments and will take this deadlocking scenario into account when
later implementing support for file-backed page faults.


Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code

2023-01-17 Thread Jann Horn
On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> page fault handling. When VMA is not found, can't be locked or changes
> after being locked, the function returns NULL. The lookup is performed
> under RCU protection to prevent the found VMA from being destroyed before
> the VMA lock is acquired. VMA lock statistics are updated according to
> the results.
> For now only anonymous VMAs can be searched this way. In other cases the
> function returns NULL.
[...]
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> + unsigned long address)
> +{
> +   MA_STATE(mas, >mm_mt, address, address);
> +   struct vm_area_struct *vma, *validate;
> +
> +   rcu_read_lock();
> +   vma = mas_walk();
> +retry:
> +   if (!vma)
> +   goto inval;
> +
> +   /* Only anonymous vmas are supported for now */
> +   if (!vma_is_anonymous(vma))
> +   goto inval;
> +
> +   if (!vma_read_trylock(vma))
> +   goto inval;
> +
> +   /* Check since vm_start/vm_end might change before we lock the VMA */
> +   if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> +   vma_read_unlock(vma);
> +   goto inval;
> +   }
> +
> +   /* Check if the VMA got isolated after we found it */
> +   mas.index = address;
> +   validate = mas_walk();

Question for Maple Tree experts:

Are you allowed to use mas_walk() like this? If the first mas_walk()
call encountered a single-entry tree, it would store mas->node =
MAS_ROOT, right? And then the second call would go into
mas_state_walk(), mas_start() would return NULL, mas_is_ptr() would be
true, and then mas_state_walk() would return the result of
mas_start(), which is NULL? And we'd end up with mas_walk() returning
NULL on the second run even though the tree hasn't changed?

> +   if (validate != vma) {
> +   vma_read_unlock(vma);
> +   count_vm_vma_lock_event(VMA_LOCK_MISS);
> +   /* The area was replaced with another one. */
> +   vma = validate;
> +   goto retry;
> +   }
> +
> +   rcu_read_unlock();
> +   return vma;
> +inval:
> +   rcu_read_unlock();
> +   count_vm_vma_lock_event(VMA_LOCK_ABORT);
> +   return NULL;
> +}


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 12:31 PM Michal Hocko  wrote:
>
> On Tue 17-01-23 10:28:40, Suren Baghdasaryan wrote:
> [...]
> > > Then yes, that's a starvable lock.  Preventing starvation on the mmap
> > > sem was the original motivation for making rwsems non-starvable, so
> > > changing that behaviour now seems like a bad idea.  For efficiency, I'd
> > > suggest that a waiting writer set the top bit of the counter.  That way,
> > > all new readers will back off without needing to check a second variable
> > > and old readers will know that they *may* need to do the wakeup when
> > > atomic_sub_return_release() is negative.
> > >
> > > (rwsem.c has a more complex bitfield, but I don't think we need to go
> > > that far; the important point is that the waiting writer indicates its
> > > presence in the count field so that readers can modify their behaviour)
> >
> > Got it. Ok, I think we can figure something out to check if there are
> > waiting write-lockers and prevent new readers from taking the lock.
>
> Reinventing locking primitives is a ticket to weird bugs. I would stick
> with the rwsem and deal with performance fallouts after it is clear that
> the core idea is generally acceptable and based on actual real life
> numbers. This whole thing is quite big enough that we do not have to go
> through "is this new synchronization primitive correct and behaving
> reasonably" exercise.

Point taken. That's one of the reasons I kept this patch separate.
I'll drop this last patch from the series for now. One correction
though, this will not be a performance fallout but memory consumption
fallout.

>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 12:36 PM Jann Horn  wrote:
>
> On Tue, Jan 17, 2023 at 8:51 PM Jann Horn  wrote:
> > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> > > Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
> > > handling under VMA lock and retry holding mmap_lock. This can be handled
> > > more gracefully in the future.
> > >
> > > Signed-off-by: Suren Baghdasaryan 
> > > Suggested-by: Peter Xu 
> > > ---
> > >  mm/memory.c | 7 +++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 20806bc8b4eb..12508f4d845a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct 
> > > mm_struct *mm,
> > > if (!vma->anon_vma)
> > > goto inval;
> > >
> > > +   /*
> > > +* Due to the possibility of userfault handler dropping 
> > > mmap_lock, avoid
> > > +* it for now and fall back to page fault handling under 
> > > mmap_lock.
> > > +*/
> > > +   if (userfaultfd_armed(vma))
> > > +   goto inval;
> >
> > This looks racy wrt concurrent userfaultfd_register(). I think you'll
> > want to do the userfaultfd_armed(vma) check _after_ locking the VMA,
>
> I still think this change is needed...

Yes, I think you are right. I'll move the check after locking the VMA. Thanks!

>
> > and ensure that the userfaultfd code write-locks the VMA before
> > changing the __VM_UFFD_FLAGS in vma->vm_flags.
>
> Ah, but now I see you already took care of this half of the issue with
> the reset_vm_flags() change in
> https://lore.kernel.org/linux-mm/20230109205336.3665937-16-sur...@google.com/
> .
>
>
> > > if (!vma_read_trylock(vma))
> > > goto inval;
> > >
> > > --
> > > 2.39.0
> > >


Re: Calculating array sizes in C - was: Re: Build regressions/improvements in v6.2-rc1

2023-01-17 Thread John Paul Adrian Glaubitz

Hi!

On 1/17/23 21:05, Geert Uytterhoeven wrote:

Isn't this supposed to be caught by this check:

 a, __same_type(a, NULL)

?


Yeah, but gcc thinks it is smarter than us...
Probably it drops the test, assuming UB cannot happen.


Hmm, sounds like a GGC bug to me then. Not sure how to fix this then.

Adrian

--
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock

2023-01-17 Thread Jann Horn
On Tue, Jan 17, 2023 at 8:51 PM Jann Horn  wrote:
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> > Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
> > handling under VMA lock and retry holding mmap_lock. This can be handled
> > more gracefully in the future.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > Suggested-by: Peter Xu 
> > ---
> >  mm/memory.c | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 20806bc8b4eb..12508f4d845a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct 
> > mm_struct *mm,
> > if (!vma->anon_vma)
> > goto inval;
> >
> > +   /*
> > +* Due to the possibility of userfault handler dropping mmap_lock, 
> > avoid
> > +* it for now and fall back to page fault handling under mmap_lock.
> > +*/
> > +   if (userfaultfd_armed(vma))
> > +   goto inval;
>
> This looks racy wrt concurrent userfaultfd_register(). I think you'll
> want to do the userfaultfd_armed(vma) check _after_ locking the VMA,

I still think this change is needed...

> and ensure that the userfaultfd code write-locks the VMA before
> changing the __VM_UFFD_FLAGS in vma->vm_flags.

Ah, but now I see you already took care of this half of the issue with
the reset_vm_flags() change in
https://lore.kernel.org/linux-mm/20230109205336.3665937-16-sur...@google.com/
.


> > if (!vma_read_trylock(vma))
> > goto inval;
> >
> > --
> > 2.39.0
> >


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Michal Hocko
On Tue 17-01-23 10:28:40, Suren Baghdasaryan wrote:
[...]
> > Then yes, that's a starvable lock.  Preventing starvation on the mmap
> > sem was the original motivation for making rwsems non-starvable, so
> > changing that behaviour now seems like a bad idea.  For efficiency, I'd
> > suggest that a waiting writer set the top bit of the counter.  That way,
> > all new readers will back off without needing to check a second variable
> > and old readers will know that they *may* need to do the wakeup when
> > atomic_sub_return_release() is negative.
> >
> > (rwsem.c has a more complex bitfield, but I don't think we need to go
> > that far; the important point is that the waiting writer indicates its
> > presence in the count field so that readers can modify their behaviour)
> 
> Got it. Ok, I think we can figure something out to check if there are
> waiting write-lockers and prevent new readers from taking the lock.

Reinventing locking primitives is a ticket to weird bugs. I would stick
with the rwsem and deal with performance fallouts after it is clear that
the core idea is generally acceptable and based on actual real life
numbers. This whole thing is quite big enough that we do not have to go
through "is this new synchronization primitive correct and behaving
reasonably" exercise.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-17 Thread Jann Horn
On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko  wrote:
> On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > Protect VMA from concurrent page fault handler while collapsing a huge
> > page. Page fault handler needs a stable PMD to use PTL and relies on
> > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > not be detected by a page fault handler without proper locking.
>
> I am struggling with this changelog. Maybe because my recollection of
> the THP collapsing subtleties is weak. But aren't you just trying to say
> that the current #PF handling and THP collapsing need to be mutually
> exclusive currently so in order to keep that assumption you have mark
> the vma write locked?
>
> Also it is not really clear to me how that handles other vmas which can
> share the same thp?

It's not about the hugepage itself, it's about how the THP collapse
operation frees page tables.

Before this series, page tables can be walked under any one of the
mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
unlinks and frees page tables, it must ensure that all of those either
are locked or don't exist. This series adds a fourth lock under which
page tables can be traversed, and so khugepaged must also lock out that one.

There is a codepath in khugepaged that iterates through all mappings
of a file to zap page tables (retract_page_tables()), which locks each
visited mm with mmap_write_trylock() and now also does
vma_write_lock().


I think one aspect of this patch that might cause trouble later on, if
support for non-anonymous VMAs is added, is that retract_page_tables()
now does vma_write_lock() while holding the mapping lock; the page
fault handling path would probably take the locks the other way
around, leading to a deadlock? So the vma_write_lock() in
retract_page_tables() might have to become a trylock later on.

Related: Please add the new VMA lock to the big lock ordering comments
at the top of mm/rmap.c. (And maybe later mm/filemap.c, if/when you
add file VMA support.)


Re: Calculating array sizes in C - was: Re: Build regressions/improvements in v6.2-rc1

2023-01-17 Thread Geert Uytterhoeven
Hi Adrian,

On Tue, Jan 17, 2023 at 6:06 PM John Paul Adrian Glaubitz
 wrote:
> On 1/17/23 18:01, Geert Uytterhoeven wrote:
> > The issue is that some of the parameters are not arrays, but
> > NULL. E.g.:
> >
> > arch/sh/kernel/cpu/sh2/setup-sh7619.c:static
> > DECLARE_INTC_DESC(intc_desc, "sh7619", vectors, NULL,
> > arch/sh/kernel/cpu/sh2/setup-sh7619.c-   NULL,
> > prio_registers, NULL);
>
> Isn't this supposed to be caught by this check:
>
> a, __same_type(a, NULL)
>
> ?

Yeah, but gcc thinks it is smarter than us...
Probably it drops the test, assuming UB cannot happen.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock

2023-01-17 Thread Jann Horn
On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
> handling under VMA lock and retry holding mmap_lock. This can be handled
> more gracefully in the future.
>
> Signed-off-by: Suren Baghdasaryan 
> Suggested-by: Peter Xu 
> ---
>  mm/memory.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 20806bc8b4eb..12508f4d845a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct 
> mm_struct *mm,
> if (!vma->anon_vma)
> goto inval;
>
> +   /*
> +* Due to the possibility of userfault handler dropping mmap_lock, 
> avoid
> +* it for now and fall back to page fault handling under mmap_lock.
> +*/
> +   if (userfaultfd_armed(vma))
> +   goto inval;

This looks racy wrt concurrent userfaultfd_register(). I think you'll
want to do the userfaultfd_armed(vma) check _after_ locking the VMA,
and ensure that the userfaultfd code write-locks the VMA before
changing the __VM_UFFD_FLAGS in vma->vm_flags.

> if (!vma_read_trylock(vma))
> goto inval;
>
> --
> 2.39.0
>


Re: [PATCH] modpost: support arbitrary symbol length in modversion

2023-01-17 Thread Lucas De Marchi

On Tue, Jan 17, 2023 at 06:51:44PM +0100, Michal Suchánek wrote:

Hello,

On Fri, Jan 13, 2023 at 06:18:41PM +, Gary Guo wrote:

On Thu, 12 Jan 2023 14:40:59 -0700
Lucas De Marchi  wrote:

> On Wed, Jan 11, 2023 at 04:11:51PM +, Gary Guo wrote:
> >
> > struct modversion_info {
> >-  unsigned long crc;
> >-  char name[MODULE_NAME_LEN];
> >+  /* Offset of the next modversion entry in relation to this one. */
> >+  u32 next;
> >+  u32 crc;
> >+  char name[0];
>
> although not really exported as uapi, this will break userspace as this is
> used in the  elf file generated for the modules. I think
> this change must be made in a backward compatible way and kmod updated
> to deal with the variable name length:
>
> kmod $ git grep "\[64"
> libkmod/libkmod-elf.c:  char name[64 - sizeof(uint32_t)];
> libkmod/libkmod-elf.c:  char name[64 - sizeof(uint64_t)];
>
> in kmod we have both 32 and 64 because a 64-bit kmod can read both 32
> and 64 bit module, and vice versa.
>

Hi Lucas,

Thanks for the information.

The change can't be "truly" backward compatible, in a sense that
regardless of the new format we choose, kmod would not be able to decode
symbols longer than "64 - sizeof(long)" bytes. So the list it retrieves
is going to be incomplete, isn't it?

What kind of backward compatibility should be expected? It could be:
* short symbols can still be found by old versions of kmod, but not
  long symbols;


That sounds good. Not everyone is using rust, and with this option
people who do will need to upgrade tooling, and people who don't care
don't need to do anything.


that could be it indeed. My main worry here is:

"After the support is added in kmod, kmod needs to be able to output the
correct information regardless if the module is from before/after the
change in the kernel and also without relying on kernel version."
Just changing the struct modversion_info doesn't make that possible.

Maybe adding the long symbols in another section? Or ble
just increase to 512 and add the size to a
"__versions_hdr" section. If we then output a max size per module,
this would offset a little bit the additional size gained for the
modules using rust. And the additional 0's should compress well
so I'm not sure the additional size is that much relevant here.

Lucas De Marchi



Thanks

Michal


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 11:00 AM Jann Horn  wrote:
>
> On Tue, Jan 17, 2023 at 7:55 PM Suren Baghdasaryan  wrote:
> > On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox  wrote:
> > >
> > > On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox  
> > > > wrote:
> > > > >
> > > > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > > > > > >
> > > > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan 
> > > > > > >  wrote:
> > > > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > > > two important specifics which can be used to replace 
> > > > > > > > rw_semaphore
> > > > > > > > with a simpler structure:
> > > > > > > [...]
> > > > > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > > >  {
> > > > > > > > -   up_read(>vm_lock->lock);
> > > > > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > > > >  }
> > > > > > >
> > > > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > > > use-after-free because you're accessing the vma after dropping 
> > > > > > > your
> > > > > > > reference on it. You'd have to first look up the vma->vm_mm, then 
> > > > > > > do
> > > > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > > > read-side critical section if the VMA is freed with RCU delay.
> > > > > >
> > > > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > > > counter of how many readers took the lock or it's negative if the 
> > > > > > lock
> > > > > > is write-locked.
> > > > >
> > > > > Yes, but ...
> > > > >
> > > > > Task A:
> > > > > atomic_dec_and_test(>vm_lock->count)
> > > > > Task B:
> > > > > munmap()
> > > > > write lock
> > > > > free VMA
> > > > > synchronize_rcu()
> > > > > VMA is really freed
> > > > > wake_up(>vm_mm->vma_writer_wait);
> > > > >
> > > > > ... vma is freed.
> > > > >
> > > > > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > > > > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > > > > have that assertion?
> > > >
> > > > Yep, that's what this patch is doing
> > > > https://lore.kernel.org/all/20230109205336.3665937-27-sur...@google.com/
> > > > by calling vma_assert_no_reader() from __vm_area_free().
> > >
> > > That's not enough though.  Task A still has a pointer to vma after it
> > > has called atomic_dec_and_test(), even after vma has been freed by
> > > Task B, and before Task A dereferences vma->vm_mm.
> >
> > Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
> > local variable and call mmgrab() before atomic_dec_and_test(), then
> > use it in wake_up() and call mmdrop(). Is that what you are thinking?
>
> You shouldn't need mmgrab()/mmdrop(), because whoever is calling you
> for page fault handling must be keeping the mm_struct alive.

Good point. Will update in the next revision to store mm before
dropping the count. Thanks for all the comments folks!


Re: [PATCH 40/41] mm: separate vma->lock from vm_area_struct

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:34 AM Jann Horn  wrote:
>
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> > vma->lock being part of the vm_area_struct causes performance regression
> > during page faults because during contention its count and owner fields
> > are constantly updated and having other parts of vm_area_struct used
> > during page fault handling next to them causes constant cache line
> > bouncing. Fix that by moving the lock outside of the vm_area_struct.
> > All attempts to keep vma->lock inside vm_area_struct in a separate
> > cache line still produce performance regression especially on NUMA
> > machines. Smallest regression was achieved when lock is placed in the
> > fourth cache line but that bloats vm_area_struct to 256 bytes.
>
> Just checking: When you tested putting the lock in different cache
> lines, did you force the slab allocator to actually store the
> vm_area_struct with cacheline alignment (by setting SLAB_HWCACHE_ALIGN
> on the slab or with a cacheline_aligned_in_smp on the struct
> definition)?

Yep, I tried all these combinations and still saw noticeable regression.


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Jann Horn
On Tue, Jan 17, 2023 at 7:55 PM Suren Baghdasaryan  wrote:
> On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox  wrote:
> >
> > On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > > > > >
> > > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan 
> > > > > >  wrote:
> > > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > > with a simpler structure:
> > > > > > [...]
> > > > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > >  {
> > > > > > > -   up_read(>vm_lock->lock);
> > > > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > > >  }
> > > > > >
> > > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > > use-after-free because you're accessing the vma after dropping your
> > > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > > read-side critical section if the VMA is freed with RCU delay.
> > > > >
> > > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > > counter of how many readers took the lock or it's negative if the lock
> > > > > is write-locked.
> > > >
> > > > Yes, but ...
> > > >
> > > > Task A:
> > > > atomic_dec_and_test(>vm_lock->count)
> > > > Task B:
> > > > munmap()
> > > > write lock
> > > > free VMA
> > > > synchronize_rcu()
> > > > VMA is really freed
> > > > wake_up(>vm_mm->vma_writer_wait);
> > > >
> > > > ... vma is freed.
> > > >
> > > > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > > > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > > > have that assertion?
> > >
> > > Yep, that's what this patch is doing
> > > https://lore.kernel.org/all/20230109205336.3665937-27-sur...@google.com/
> > > by calling vma_assert_no_reader() from __vm_area_free().
> >
> > That's not enough though.  Task A still has a pointer to vma after it
> > has called atomic_dec_and_test(), even after vma has been freed by
> > Task B, and before Task A dereferences vma->vm_mm.
>
> Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
> local variable and call mmgrab() before atomic_dec_and_test(), then
> use it in wake_up() and call mmdrop(). Is that what you are thinking?

You shouldn't need mmgrab()/mmdrop(), because whoever is calling you
for page fault handling must be keeping the mm_struct alive.


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox  wrote:
>
> On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox  wrote:
> > >
> > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > > > >
> > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  
> > > > > wrote:
> > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > with a simpler structure:
> > > > > [...]
> > > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > >  {
> > > > > > -   up_read(>vm_lock->lock);
> > > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > >  }
> > > > >
> > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > use-after-free because you're accessing the vma after dropping your
> > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > read-side critical section if the VMA is freed with RCU delay.
> > > >
> > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > counter of how many readers took the lock or it's negative if the lock
> > > > is write-locked.
> > >
> > > Yes, but ...
> > >
> > > Task A:
> > > atomic_dec_and_test(>vm_lock->count)
> > > Task B:
> > > munmap()
> > > write lock
> > > free VMA
> > > synchronize_rcu()
> > > VMA is really freed
> > > wake_up(>vm_mm->vma_writer_wait);
> > >
> > > ... vma is freed.
> > >
> > > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > > have that assertion?
> >
> > Yep, that's what this patch is doing
> > https://lore.kernel.org/all/20230109205336.3665937-27-sur...@google.com/
> > by calling vma_assert_no_reader() from __vm_area_free().
>
> That's not enough though.  Task A still has a pointer to vma after it
> has called atomic_dec_and_test(), even after vma has been freed by
> Task B, and before Task A dereferences vma->vm_mm.

Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
local variable and call mmgrab() before atomic_dec_and_test(), then
use it in wake_up() and call mmdrop(). Is that what you are thinking?


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:36 AM Jann Horn  wrote:
>
> On Tue, Jan 17, 2023 at 7:31 PM Matthew Wilcox  wrote:
> >
> > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > > >
> > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  
> > > > wrote:
> > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > two important specifics which can be used to replace rw_semaphore
> > > > > with a simpler structure:
> > > > [...]
> > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > >  {
> > > > > -   up_read(>vm_lock->lock);
> > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > >  }
> > > >
> > > > I haven't properly reviewed this, but this bit looks like a
> > > > use-after-free because you're accessing the vma after dropping your
> > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > read-side critical section if the VMA is freed with RCU delay.
> > >
> > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > counter of how many readers took the lock or it's negative if the lock
> > > is write-locked.
> >
> > Yes, but ...
> >
> > Task A:
> > atomic_dec_and_test(>vm_lock->count)
> > Task B:
> > munmap()
> > write lock
> > free VMA
> > synchronize_rcu()
> > VMA is really freed
> > wake_up(>vm_mm->vma_writer_wait);
> >
> > ... vma is freed.
> >
> > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > have that assertion?
>
> I don't see that. When do_user_addr_fault() is calling
> vma_read_unlock(), there's no RCU read lock held, right?

We free VMAs using call_rcu() after removing them from VMA tree. OTOH
page fault handlers are searching for VMAs from inside RCU read
section and calling vma_read_unlock() from there, see
https://lore.kernel.org/all/20230109205336.3665937-29-sur...@google.com/.
Once we take the VMA read-lock, it ensures that it can't be
write-locked and if someone is destroying or isolating the VMA, it
needs to write-lock it first.


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Matthew Wilcox
On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox  wrote:
> >
> > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > > >
> > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  
> > > > wrote:
> > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > two important specifics which can be used to replace rw_semaphore
> > > > > with a simpler structure:
> > > > [...]
> > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > >  {
> > > > > -   up_read(>vm_lock->lock);
> > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > >  }
> > > >
> > > > I haven't properly reviewed this, but this bit looks like a
> > > > use-after-free because you're accessing the vma after dropping your
> > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > read-side critical section if the VMA is freed with RCU delay.
> > >
> > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > counter of how many readers took the lock or it's negative if the lock
> > > is write-locked.
> >
> > Yes, but ...
> >
> > Task A:
> > atomic_dec_and_test(>vm_lock->count)
> > Task B:
> > munmap()
> > write lock
> > free VMA
> > synchronize_rcu()
> > VMA is really freed
> > wake_up(>vm_mm->vma_writer_wait);
> >
> > ... vma is freed.
> >
> > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > have that assertion?
> 
> Yep, that's what this patch is doing
> https://lore.kernel.org/all/20230109205336.3665937-27-sur...@google.com/
> by calling vma_assert_no_reader() from __vm_area_free().

That's not enough though.  Task A still has a pointer to vma after it
has called atomic_dec_and_test(), even after vma has been freed by
Task B, and before Task A dereferences vma->vm_mm.


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox  wrote:
>
> On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > >
> > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  
> > > wrote:
> > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > considerable space for each vm_area_struct. However vma_lock has
> > > > two important specifics which can be used to replace rw_semaphore
> > > > with a simpler structure:
> > > [...]
> > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > >  {
> > > > -   up_read(>vm_lock->lock);
> > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > >  }
> > >
> > > I haven't properly reviewed this, but this bit looks like a
> > > use-after-free because you're accessing the vma after dropping your
> > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > read-side critical section if the VMA is freed with RCU delay.
> >
> > vm_lock->count does not control the lifetime of the VMA, it's a
> > counter of how many readers took the lock or it's negative if the lock
> > is write-locked.
>
> Yes, but ...
>
> Task A:
> atomic_dec_and_test(>vm_lock->count)
> Task B:
> munmap()
> write lock
> free VMA
> synchronize_rcu()
> VMA is really freed
> wake_up(>vm_mm->vma_writer_wait);
>
> ... vma is freed.
>
> Now, I think this doesn't occur.  I'm pretty sure that every caller of
> vma_read_unlock() is holding the RCU read lock.  But maybe we should
> have that assertion?

Yep, that's what this patch is doing
https://lore.kernel.org/all/20230109205336.3665937-27-sur...@google.com/
by calling vma_assert_no_reader() from __vm_area_free().

>


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Jann Horn
On Tue, Jan 17, 2023 at 7:31 PM Matthew Wilcox  wrote:
>
> On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > >
> > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  
> > > wrote:
> > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > considerable space for each vm_area_struct. However vma_lock has
> > > > two important specifics which can be used to replace rw_semaphore
> > > > with a simpler structure:
> > > [...]
> > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > >  {
> > > > -   up_read(>vm_lock->lock);
> > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > >  }
> > >
> > > I haven't properly reviewed this, but this bit looks like a
> > > use-after-free because you're accessing the vma after dropping your
> > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > read-side critical section if the VMA is freed with RCU delay.
> >
> > vm_lock->count does not control the lifetime of the VMA, it's a
> > counter of how many readers took the lock or it's negative if the lock
> > is write-locked.
>
> Yes, but ...
>
> Task A:
> atomic_dec_and_test(>vm_lock->count)
> Task B:
> munmap()
> write lock
> free VMA
> synchronize_rcu()
> VMA is really freed
> wake_up(>vm_mm->vma_writer_wait);
>
> ... vma is freed.
>
> Now, I think this doesn't occur.  I'm pretty sure that every caller of
> vma_read_unlock() is holding the RCU read lock.  But maybe we should
> have that assertion?

I don't see that. When do_user_addr_fault() is calling
vma_read_unlock(), there's no RCU read lock held, right?


Re: [PATCH 40/41] mm: separate vma->lock from vm_area_struct

2023-01-17 Thread Jann Horn
On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> vma->lock being part of the vm_area_struct causes performance regression
> during page faults because during contention its count and owner fields
> are constantly updated and having other parts of vm_area_struct used
> during page fault handling next to them causes constant cache line
> bouncing. Fix that by moving the lock outside of the vm_area_struct.
> All attempts to keep vma->lock inside vm_area_struct in a separate
> cache line still produce performance regression especially on NUMA
> machines. Smallest regression was achieved when lock is placed in the
> fourth cache line but that bloats vm_area_struct to 256 bytes.

Just checking: When you tested putting the lock in different cache
lines, did you force the slab allocator to actually store the
vm_area_struct with cacheline alignment (by setting SLAB_HWCACHE_ALIGN
on the slab or with a cacheline_aligned_in_smp on the struct
definition)?


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Matthew Wilcox
On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> >
> > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > considerable space for each vm_area_struct. However vma_lock has
> > > two important specifics which can be used to replace rw_semaphore
> > > with a simpler structure:
> > [...]
> > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > >  {
> > > -   up_read(>vm_lock->lock);
> > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > +   wake_up(>vm_mm->vma_writer_wait);
> > >  }
> >
> > I haven't properly reviewed this, but this bit looks like a
> > use-after-free because you're accessing the vma after dropping your
> > reference on it. You'd have to first look up the vma->vm_mm, then do
> > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > touching the vma. Or alternatively wrap the whole thing in an RCU
> > read-side critical section if the VMA is freed with RCU delay.
> 
> vm_lock->count does not control the lifetime of the VMA, it's a
> counter of how many readers took the lock or it's negative if the lock
> is write-locked.

Yes, but ...

Task A:
atomic_dec_and_test(>vm_lock->count)
Task B:
munmap()
write lock
free VMA
synchronize_rcu()
VMA is really freed
wake_up(>vm_mm->vma_writer_wait);

... vma is freed.

Now, I think this doesn't occur.  I'm pretty sure that every caller of
vma_read_unlock() is holding the RCU read lock.  But maybe we should
have that assertion?



Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:23 AM Matthew Wilcox  wrote:
>
> On Mon, Jan 16, 2023 at 09:58:35PM -0800, Suren Baghdasaryan wrote:
> > On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox  wrote:
> > >
> > > On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > > > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox  
> > > > wrote:
> > > > >
> > > > > On Mon, Jan 16, 2023 at 11:14:38AM +, Hyeonggon Yoo wrote:
> > > > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct 
> > > > > > > vm_area_struct *vma)
> > > > > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > > > >  {
> > > > > > > /* Check before locking. A race might cause false locked 
> > > > > > > result. */
> > > > > > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > > +   if (vma->vm_lock->lock_seq == 
> > > > > > > READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > > return false;
> > > > > > >
> > > > > > > -   if (unlikely(down_read_trylock(>vm_lock->lock) == 0))
> > > > > > > +   if 
> > > > > > > (unlikely(!atomic_inc_unless_negative(>vm_lock->count)))
> > > > > > > return false;
> > > > > > >
> > > > > > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > > > > > +   if (unlikely(atomic_read(>vm_lock->count) < 0)) {
> > > > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > > > +   return false;
> > > > > > > +   }
> > > > > > > +
> > > > > > > /*
> > > > > > >  * Overflow might produce false locked result.
> > > > > > >  * False unlocked result is impossible because we modify and 
> > > > > > > check
> > > > > > >  * vma->vm_lock_seq under vma->vm_lock protection and 
> > > > > > > mm->mm_lock_seq
> > > > > > >  * modification invalidates all existing locks.
> > > > > > >  */
> > > > > > > -   if (unlikely(vma->vm_lock_seq == 
> > > > > > > READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > > -   up_read(>vm_lock->lock);
> > > > > > > +   if (unlikely(vma->vm_lock->lock_seq == 
> > > > > > > READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > > > return false;
> > > > > > > }
> > > > > >
> > > > > > With this change readers can cause writers to starve.
> > > > > > What about checking waitqueue_active() before or after increasing
> > > > > > vma->vm_lock->count?
> > > > >
> > > > > I don't understand how readers can starve a writer.  Readers do
> > > > > atomic_inc_unless_negative() so a writer can always force readers
> > > > > to fail.
> > > >
> > > > I think the point here was that if page faults keep occuring and they
> > > > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > > > and there is no reader throttling mechanism (no max time that writer
> > > > will be waiting).
> > >
> > > Perhaps I misunderstood your description; I thought that a _waiting_
> > > writer would make the count negative, not a successfully acquiring
> > > writer.
> >
> > A waiting writer does not modify the counter, instead it's placed on
> > the wait queue and the last reader which sets the count to 0 while
> > releasing its read lock will wake it up. Once the writer is woken it
> > will try to set the count to negative and if successful will own the
> > lock, otherwise it goes back to sleep.
>
> Then yes, that's a starvable lock.  Preventing starvation on the mmap
> sem was the original motivation for making rwsems non-starvable, so
> changing that behaviour now seems like a bad idea.  For efficiency, I'd
> suggest that a waiting writer set the top bit of the counter.  That way,
> all new readers will back off without needing to check a second variable
> and old readers will know that they *may* need to do the wakeup when
> atomic_sub_return_release() is negative.
>
> (rwsem.c has a more complex bitfield, but I don't think we need to go
> that far; the important point is that the waiting writer indicates its
> presence in the count field so that readers can modify their behaviour)

Got it. Ok, I think we can figure something out to check if there are
waiting write-lockers and prevent new readers from taking the lock.


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Suren Baghdasaryan
On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
>
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> > rw_semaphore is a sizable structure of 40 bytes and consumes
> > considerable space for each vm_area_struct. However vma_lock has
> > two important specifics which can be used to replace rw_semaphore
> > with a simpler structure:
> [...]
> >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> >  {
> > -   up_read(>vm_lock->lock);
> > +   if (atomic_dec_and_test(>vm_lock->count))
> > +   wake_up(>vm_mm->vma_writer_wait);
> >  }
>
> I haven't properly reviewed this, but this bit looks like a
> use-after-free because you're accessing the vma after dropping your
> reference on it. You'd have to first look up the vma->vm_mm, then do
> the atomic_dec_and_test(), and afterwards do the wake_up() without
> touching the vma. Or alternatively wrap the whole thing in an RCU
> read-side critical section if the VMA is freed with RCU delay.

vm_lock->count does not control the lifetime of the VMA, it's a
counter of how many readers took the lock or it's negative if the lock
is write-locked.


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Matthew Wilcox
On Mon, Jan 16, 2023 at 09:58:35PM -0800, Suren Baghdasaryan wrote:
> On Mon, Jan 16, 2023 at 9:46 PM Matthew Wilcox  wrote:
> >
> > On Mon, Jan 16, 2023 at 08:34:36PM -0800, Suren Baghdasaryan wrote:
> > > On Mon, Jan 16, 2023 at 8:14 PM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Mon, Jan 16, 2023 at 11:14:38AM +, Hyeonggon Yoo wrote:
> > > > > > @@ -643,20 +647,28 @@ static inline void vma_write_lock(struct 
> > > > > > vm_area_struct *vma)
> > > > > >  static inline bool vma_read_trylock(struct vm_area_struct *vma)
> > > > > >  {
> > > > > > /* Check before locking. A race might cause false locked 
> > > > > > result. */
> > > > > > -   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > +   if (vma->vm_lock->lock_seq == 
> > > > > > READ_ONCE(vma->vm_mm->mm_lock_seq))
> > > > > > return false;
> > > > > >
> > > > > > -   if (unlikely(down_read_trylock(>vm_lock->lock) == 0))
> > > > > > +   if (unlikely(!atomic_inc_unless_negative(>vm_lock->count)))
> > > > > > return false;
> > > > > >
> > > > > > +   /* If atomic_t overflows, restore and fail to lock. */
> > > > > > +   if (unlikely(atomic_read(>vm_lock->count) < 0)) {
> > > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > > +   return false;
> > > > > > +   }
> > > > > > +
> > > > > > /*
> > > > > >  * Overflow might produce false locked result.
> > > > > >  * False unlocked result is impossible because we modify and 
> > > > > > check
> > > > > >  * vma->vm_lock_seq under vma->vm_lock protection and 
> > > > > > mm->mm_lock_seq
> > > > > >  * modification invalidates all existing locks.
> > > > > >  */
> > > > > > -   if (unlikely(vma->vm_lock_seq == 
> > > > > > READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > -   up_read(>vm_lock->lock);
> > > > > > +   if (unlikely(vma->vm_lock->lock_seq == 
> > > > > > READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> > > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > > return false;
> > > > > > }
> > > > >
> > > > > With this change readers can cause writers to starve.
> > > > > What about checking waitqueue_active() before or after increasing
> > > > > vma->vm_lock->count?
> > > >
> > > > I don't understand how readers can starve a writer.  Readers do
> > > > atomic_inc_unless_negative() so a writer can always force readers
> > > > to fail.
> > >
> > > I think the point here was that if page faults keep occuring and they
> > > prevent vm_lock->count from reaching 0 then a writer will be blocked
> > > and there is no reader throttling mechanism (no max time that writer
> > > will be waiting).
> >
> > Perhaps I misunderstood your description; I thought that a _waiting_
> > writer would make the count negative, not a successfully acquiring
> > writer.
> 
> A waiting writer does not modify the counter, instead it's placed on
> the wait queue and the last reader which sets the count to 0 while
> releasing its read lock will wake it up. Once the writer is woken it
> will try to set the count to negative and if successful will own the
> lock, otherwise it goes back to sleep.

Then yes, that's a starvable lock.  Preventing starvation on the mmap
sem was the original motivation for making rwsems non-starvable, so
changing that behaviour now seems like a bad idea.  For efficiency, I'd
suggest that a waiting writer set the top bit of the counter.  That way,
all new readers will back off without needing to check a second variable
and old readers will know that they *may* need to do the wakeup when
atomic_sub_return_release() is negative.

(rwsem.c has a more complex bitfield, but I don't think we need to go
that far; the important point is that the waiting writer indicates its
presence in the count field so that readers can modify their behaviour)


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Jann Horn
On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
[...]
>  static inline void vma_read_unlock(struct vm_area_struct *vma)
>  {
> -   up_read(>vm_lock->lock);
> +   if (atomic_dec_and_test(>vm_lock->count))
> +   wake_up(>vm_mm->vma_writer_wait);
>  }

I haven't properly reviewed this, but this bit looks like a
use-after-free because you're accessing the vma after dropping your
reference on it. You'd have to first look up the vma->vm_mm, then do
the atomic_dec_and_test(), and afterwards do the wake_up() without
touching the vma. Or alternatively wrap the whole thing in an RCU
read-side critical section if the VMA is freed with RCU delay.


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Jann Horn
+locking maintainers

On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> Introduce a per-VMA rw_semaphore to be used during page fault handling
> instead of mmap_lock. Because there are cases when multiple VMAs need
> to be exclusively locked during VMA tree modifications, instead of the
> usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> mmap_write_lock holder is done with all modifications and drops mmap_lock,
> it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> locked.
[...]
> +static inline void vma_read_unlock(struct vm_area_struct *vma)
> +{
> +   up_read(>lock);
> +}

One thing that might be gnarly here is that I think you might not be
allowed to use up_read() to fully release ownership of an object -
from what I remember, I think that up_read() (unlike something like
spin_unlock()) can access the lock object after it's already been
acquired by someone else. So if you want to protect against concurrent
deletion, this might have to be something like:

rcu_read_lock(); /* keeps vma alive */
up_read(>lock);
rcu_read_unlock();

But I'm not entirely sure about that, the locking folks might know better.

Also, it might not matter given that the rw_semaphore part is removed
in the current patch 41/41 anyway...


Re: [PATCH] modpost: support arbitrary symbol length in modversion

2023-01-17 Thread Michal Suchánek
Hello,

On Fri, Jan 13, 2023 at 06:18:41PM +, Gary Guo wrote:
> On Thu, 12 Jan 2023 14:40:59 -0700
> Lucas De Marchi  wrote:
> 
> > On Wed, Jan 11, 2023 at 04:11:51PM +, Gary Guo wrote:
> > >
> > > struct modversion_info {
> > >-  unsigned long crc;
> > >-  char name[MODULE_NAME_LEN];
> > >+  /* Offset of the next modversion entry in relation to this one. */
> > >+  u32 next;
> > >+  u32 crc;
> > >+  char name[0];  
> > 
> > although not really exported as uapi, this will break userspace as this is
> > used in the  elf file generated for the modules. I think
> > this change must be made in a backward compatible way and kmod updated
> > to deal with the variable name length:
> > 
> > kmod $ git grep "\[64"
> > libkmod/libkmod-elf.c:  char name[64 - sizeof(uint32_t)];
> > libkmod/libkmod-elf.c:  char name[64 - sizeof(uint64_t)];
> > 
> > in kmod we have both 32 and 64 because a 64-bit kmod can read both 32
> > and 64 bit module, and vice versa.
> > 
> 
> Hi Lucas,
> 
> Thanks for the information.
> 
> The change can't be "truly" backward compatible, in a sense that
> regardless of the new format we choose, kmod would not be able to decode
> symbols longer than "64 - sizeof(long)" bytes. So the list it retrieves
> is going to be incomplete, isn't it?
> 
> What kind of backward compatibility should be expected? It could be:
> * short symbols can still be found by old versions of kmod, but not
>   long symbols;

That sounds good. Not everyone is using rust, and with this option
people who do will need to upgrade tooling, and people who don't care
don't need to do anything.

Thanks

Michal


[Bug 216095] sysfs: cannot create duplicate filename '/devices/platform/of-display'

2023-01-17 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=216095

--- Comment #15 from Michal Suchánek (msucha...@suse.de) ---
You do have two outputs defined in the device tree:

/pci@f000/ATY,AlteracParent@10/ATY,Alterac_A@0
/pci@f000/ATY,AlteracParent@10/ATY,Alterac_B@1

If they correspond to a physical outputs is another question. After all if one
of the outputs exists only on some card models it would never detect a
connected screen on the cards on which the output does not physically exist and
everything will work just fine.

Here is a patch that aims to resolve the problem
https://lore.kernel.org/lkml/20230117165804.18036-1-msucha...@suse.de/

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

Re: Calculating array sizes in C - was: Re: Build regressions/improvements in v6.2-rc1

2023-01-17 Thread John Paul Adrian Glaubitz

Hi!

On 1/17/23 18:01, Geert Uytterhoeven wrote:

The issue is that some of the parameters are not arrays, but
NULL. E.g.:

arch/sh/kernel/cpu/sh2/setup-sh7619.c:static
DECLARE_INTC_DESC(intc_desc, "sh7619", vectors, NULL,
arch/sh/kernel/cpu/sh2/setup-sh7619.c-   NULL,
prio_registers, NULL);


Isn't this supposed to be caught by this check:

a, __same_type(a, NULL)

?

Adrian

--
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: Calculating array sizes in C - was: Re: Build regressions/improvements in v6.2-rc1

2023-01-17 Thread Geert Uytterhoeven
Hi Adrian,

On Tue, Jan 17, 2023 at 5:42 PM John Paul Adrian Glaubitz
 wrote:
> On 1/6/23 16:17, Geert Uytterhoeven wrote:
> >> I'm not seeing this one, but I am getting this one instead:
> >>
> >> In file included from ./arch/sh/include/asm/hw_irq.h:6,
> >>from ./include/linux/irq.h:596,
> >>from ./include/asm-generic/hardirq.h:17,
> >>from ./arch/sh/include/asm/hardirq.h:9,
> >>from ./include/linux/hardirq.h:11,
> >>from ./include/linux/interrupt.h:11,
> >>from ./include/linux/serial_core.h:13,
> >>from ./include/linux/serial_sci.h:6,
> >>from arch/sh/kernel/cpu/sh2/setup-sh7619.c:11:
> >> ./include/linux/sh_intc.h:100:63: error: division 'sizeof (void *) / 
> >> sizeof (void)' does not compute the number of array elements 
> >> [-Werror=sizeof-pointer-div]
> >> 100 | #define _INTC_ARRAY(a) a, __same_type(a, NULL) ? 0 : 
> >> sizeof(a)/sizeof(*a)
> >> |   ^
> >> ./include/linux/sh_intc.h:105:31: note: in expansion of macro '_INTC_ARRAY'
> >> 105 | _INTC_ARRAY(vectors), _INTC_ARRAY(groups),  \
> >> |   ^~~
> >
> > The easiest fix for the latter is to disable CONFIG_WERROR.
> > Unfortunately I don't know a simple solution to get rid of the warning.
>
> I did some research and it seems that what the macro _INT_ARRAY() does with 
> "sizeof(a)/sizeof(*a)"
> is a commonly used way to calculate array sizes and the kernel has even its 
> own macro for that
> called ARRAY_SIZE() which Linus asks people to use here [1].
>
> So, I replaced _INTC_ARRAY() with ARRAY_SIZE() (see below), however the 
> kernel's own ARRAY_SIZE()
> macro triggers the same compiler warning. I'm CC'ing Michael Karcher who has 
> more knowledge on
> writing proper C code than me and maybe an idea how to fix this warning.
>
> Thanks,
> Adrian
>
> > [1] https://lkml.org/lkml/2015/9/3/428
>
> diff --git a/include/linux/sh_intc.h b/include/linux/sh_intc.h
> index c255273b0281..07a187686a84 100644
> --- a/include/linux/sh_intc.h
> +++ b/include/linux/sh_intc.h
> @@ -97,14 +97,12 @@ struct intc_hw_desc {
>  unsigned int nr_subgroups;
>   };
>
> -#define _INTC_ARRAY(a) a, __same_type(a, NULL) ? 0 : sizeof(a)/sizeof(*a)
> -
>   #define INTC_HW_DESC(vectors, groups, mask_regs,   \
>   prio_regs, sense_regs, ack_regs)   \
>   {  \
> -   _INTC_ARRAY(vectors), _INTC_ARRAY(groups),  \
> -   _INTC_ARRAY(mask_regs), _INTC_ARRAY(prio_regs), \
> -   _INTC_ARRAY(sense_regs), _INTC_ARRAY(ack_regs), \
> +   ARRAY_SIZE(vectors), ARRAY_SIZE(groups),\
> +   ARRAY_SIZE(mask_regs), ARRAY_SIZE(prio_regs),   \
> +   ARRAY_SIZE(sense_regs), ARRAY_SIZE(ack_regs),   \
>   }

The issue is that some of the parameters are not arrays, but
NULL. E.g.:

arch/sh/kernel/cpu/sh2/setup-sh7619.c:static
DECLARE_INTC_DESC(intc_desc, "sh7619", vectors, NULL,
arch/sh/kernel/cpu/sh2/setup-sh7619.c-   NULL,
prio_registers, NULL);
--

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


[PATCH] of: Make of framebuffer devices unique

2023-01-17 Thread Michal Suchanek
Since Linux 5.19 this error is observed:

sysfs: cannot create duplicate filename '/devices/platform/of-display'

This is because multiple devices with the same name 'of-display' are
created on the same bus.

Update the code to create numbered device names for the non-boot
disaplay.

cc: linuxppc-dev@lists.ozlabs.org
References: https://bugzilla.kernel.org/show_bug.cgi?id=216095
Fixes: 52b1b46c39ae ("of: Create platform devices for OF framebuffers")
Reported-by: Erhard F. 
Suggested-by: Thomas Zimmermann 
Signed-off-by: Michal Suchanek 
---
 drivers/of/platform.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/of/platform.c b/drivers/of/platform.c
index 81c8c227ab6b..f2a5d679a324 100644
--- a/drivers/of/platform.c
+++ b/drivers/of/platform.c
@@ -525,6 +525,7 @@ static int __init of_platform_default_populate_init(void)
if (IS_ENABLED(CONFIG_PPC)) {
struct device_node *boot_display = NULL;
struct platform_device *dev;
+   int display_number = 1;
int ret;
 
/* Check if we have a MacOS display without a node spec */
@@ -561,10 +562,15 @@ static int __init of_platform_default_populate_init(void)
boot_display = node;
break;
}
+
for_each_node_by_type(node, "display") {
+   char *buf[14];
if (!of_get_property(node, "linux,opened", NULL) || 
node == boot_display)
continue;
-   of_platform_device_create(node, "of-display", NULL);
+   ret = snprintf(buf, "of-display-%d", display_number++);
+   if (ret >= sizeof(buf))
+   continue;
+   of_platform_device_create(node, buf, NULL);
}
 
} else {
-- 
2.35.3



Re: [PATCH v9 00/10] phy: Add support for Lynx 10G SerDes

2023-01-17 Thread Sean Anderson
On 12/29/22 19:01, Sean Anderson wrote:
> This adds support for the Lynx 10G SerDes found on the QorIQ T-series
> and Layerscape series. Due to limited time and hardware, only support
> for the LS1046ARDB and LS1088ARDB is added in this initial series.
> 
> This series is based on phy/next, but it requires phylink support. This
> is already present for the LS1088A, and it was recently added for the
> LS1046A in net-next/master.
> 
> Major reconfiguration of baud rate (e.g. 1G->10G) does not work. From my
> testing, SerDes register settings appear identical. The issue appears to
> be between the PCS and the MAC. The link itself comes up at both ends,
> and a mac loopback succeeds. However, a PCS loopback results in dropped
> packets. Perhaps there is some undocumented register in the PCS?
> 
> I suspect this driver is around 95% complete, but I don't have the
> documentation to make it work completely. At the very least it is useful
> for two cases:
> 
> - Although this is untested, it should support 2.5G SGMII as well as
>   1000BASE-KX. The latter needs MAC and PCS support, but the former
>   should work out of the box.
> - It allows for clock configurations not supported by the RCW. This is
>   very useful if you want to use e.g. SRDS_PRTCL_S1=0x and =0x1133
>   on the same board. This is because the former setting will use PLL1
>   as the 1G reference, but the latter will use PLL1 as the 10G
>   reference. Because we can reconfigure the PLLs, it is possible to
>   always use PLL1 as the 1G reference.
> 
> The final patch in this series depends on [1].
> 
> [1] 
> https://lore.kernel.org/netdev/20221227230918.2440351-1-sean.ander...@seco.com/
> 
> Changes in v9:
> - Add fsl,unused-lanes-reserved to allow for a gradual transition
>   between firmware and Linux control of the SerDes
> - Change phy-type back to fsl,type, as I was getting the error
> '#phy-cells' is a dependency of 'phy-type'
> - Convert some u32s to unsigned long to match arguments
> - Switch from round_rate to determine_rate
> - Drop explicit reference to reference clock
> - Use .parent_names when requesting parents
> - Use devm_clk_hw_get_clk to pass clocks back to serdes
> - Fix indentation
> - Split off clock "driver" into its own patch to allow for better
>   review.
> - Add ability to defer lane initialization to phy_init. This allows
>   for easier transitioning between firmware-managed serdes and Linux-
>   managed serdes, as the consumer (such as dpaa2, which knows what the
>   firmware is doing) has the last say on who gets control.
> - Fix name of phy mode node
> - Add fsl,unused-lanes-reserved to allow a gradual transition, depending
>   on the mac link type.
> - Remove unused clocks
> - Fix some phy mode node names
> 
> Changes in v8:
> - Remove unused variable from lynx_ls_mode_init
> - Rename serdes phy handles to use _A, _B, etc. instead of _0, _1, etc.
>   This should help remind readers that the numbering corresponds to the
>   physical layout of the registers, and not the lane (pin) number.
> - Prevent PCSs from probing as phys
> - Rename serdes phy handles like the LS1046A
> - Add SFP slot binding
> - Fix incorrect lane ordering (it's backwards on the LS1088A just like it is 
> in
>   the LS1046A).
> - Fix duplicated lane 2 (it should have been lane 3).
> - Fix incorrectly-documented value for XFI1.
> - Remove interrupt for aquantia phy. It never fired for whatever reason,
>   preventing the link from coming up.
> - Add GPIOs for QIXIS FPGA.
> - Enable MAC1 PCS
> - Remove si5341 binding
> 
> Changes in v7:
> - Use double quotes everywhere in yaml
> - Break out call order into generic documentation
> - Refuse to switch "major" protocols
> - Update Kconfig to reflect restrictions
> - Remove set/clear of "pcs reset" bit, since it doesn't seem to fix
>   anything.
> 
> Changes in v6:
> - Bump PHY_TYPE_2500BASEX to 13, since PHY_TYPE_USXGMII was added in the
>   meantime
> - fsl,type -> phy-type
> - frequence -> frequency
> - Update MAINTAINERS to include new files
> - Include bitfield.h and slab.h to allow compilation on non-arm64
>   arches.
> - Depend on COMMON_CLK and either layerscape/ppc
> - XGI.9 -> XFI.9
> 
> Changes in v5:
> - Update commit description
> - Dual id header
> - Remove references to PHY_INTERFACE_MODE_1000BASEKX to allow this
>   series to be applied directly to linux/master.
> - Add fsl,lynx-10g.h to MAINTAINERS
> 
> Changes in v4:
> - Add 2500BASE-X and 10GBASE-R phy types
> - Use subnodes to describe lane configuration, instead of describing
>   PCCRs. This is the same style used by phy-cadence-sierra et al.
> - Add ids for Lynx 10g PLLs
> - Rework all debug statements to remove use of __func__. Additional
>   information has been provided as necessary.
> - Consider alternative parent rates in round_rate and not in set_rate.
>   Trying to modify out parent's rate in set_rate will deadlock.
> - Explicitly perform a stop/reset sequence in set_rate. This way we
>   always ensure that the PLL is 

Re: [PATCH] powerpc/rtas: upgrade internal arch spinlocks

2023-01-17 Thread Nathan Lynch
Michael Ellerman  writes:
> Nathan Lynch  writes:
>> Laurent Dufour  writes:
>>> On 10/01/2023 05:42:55, Nathan Lynch wrote:
 --- a/arch/powerpc/include/asm/rtas-types.h
 +++ b/arch/powerpc/include/asm/rtas-types.h
 @@ -18,7 +18,7 @@ struct rtas_t {
unsigned long entry;/* physical address pointer */
unsigned long base; /* physical address pointer */
unsigned long size;
 -  arch_spinlock_t lock;
 +  raw_spinlock_t lock;
struct rtas_args args;
struct device_node *dev;/* virtual address pointer */
  };
 diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
 index deded51a7978..a834726f18e3 100644
 --- a/arch/powerpc/kernel/rtas.c
 +++ b/arch/powerpc/kernel/rtas.c
 @@ -61,7 +61,7 @@ static inline void do_enter_rtas(unsigned long args)
  }
  
  struct rtas_t rtas = {
 -  .lock = __ARCH_SPIN_LOCK_UNLOCKED
 +  .lock = __RAW_SPIN_LOCK_UNLOCKED(rtas.lock),
  };
  EXPORT_SYMBOL(rtas);
>>>
>>> This is not the scope of this patch, but the RTAS's lock is externalized
>>> through the structure rtas_t, while it is only used in that file.
>>>
>>> I think, this would be good, in case of future change about that lock, and
>>> in order to not break KABI, to move it out of that structure, and to define
>>> it statically in that file.
>>
>> Thanks for pointing this out.
>>
>> /* rtas-types.h */
>> struct rtas_t {
>>  unsigned long entry;/* physical address pointer */
>>  unsigned long base; /* physical address pointer */
>>  unsigned long size;
>>  raw_spinlock_t lock;
>>  struct rtas_args args;
>>  struct device_node *dev;/* virtual address pointer */
>> };
>>
>> /* rtas.h */
>> extern struct rtas_t rtas;
>>
>> There's C and asm code outside of rtas.c that accesses rtas.entry,
>> rtas.base, rtas.size, and rtas.dev. But as you say, rtas.lock is used
>> only in rtas.c, and it's hard to imagine any legitimate external
>> use. This applies to the args member as well, since accesses must occur
>> under the lock.
>>
>> Making the lock and args private to rtas.c seems desirable on its own,
>> so I think that should be done first as a cleanup, followed by the
>> riskier arch -> raw lock conversion.
>
> I don't see any reason why `rtas` is exported at all.
>
> There might have been in the past, but I don't see one any more.
>
> So I'd be happy if we removed the EXPORT entirely. If it breaks
> something we can always put it back.

Agreed, I see no accesses of the rtas struct from code that can be built
as a module, and we can introduce nicer accessor functions in the future
if need arises. I will incorporate removal of EXPORT_SYMBOL(rtas).


Calculating array sizes in C - was: Re: Build regressions/improvements in v6.2-rc1

2023-01-17 Thread John Paul Adrian Glaubitz

Hi Geert!

On 1/6/23 16:17, Geert Uytterhoeven wrote:

I'm not seeing this one, but I am getting this one instead:

In file included from ./arch/sh/include/asm/hw_irq.h:6,
   from ./include/linux/irq.h:596,
   from ./include/asm-generic/hardirq.h:17,
   from ./arch/sh/include/asm/hardirq.h:9,
   from ./include/linux/hardirq.h:11,
   from ./include/linux/interrupt.h:11,
   from ./include/linux/serial_core.h:13,
   from ./include/linux/serial_sci.h:6,
   from arch/sh/kernel/cpu/sh2/setup-sh7619.c:11:
./include/linux/sh_intc.h:100:63: error: division 'sizeof (void *) / sizeof 
(void)' does not compute the number of array elements 
[-Werror=sizeof-pointer-div]
100 | #define _INTC_ARRAY(a) a, __same_type(a, NULL) ? 0 : 
sizeof(a)/sizeof(*a)
|   ^
./include/linux/sh_intc.h:105:31: note: in expansion of macro '_INTC_ARRAY'
105 | _INTC_ARRAY(vectors), _INTC_ARRAY(groups),  \
|   ^~~


The easiest fix for the latter is to disable CONFIG_WERROR.
Unfortunately I don't know a simple solution to get rid of the warning.


I did some research and it seems that what the macro _INT_ARRAY() does with 
"sizeof(a)/sizeof(*a)"
is a commonly used way to calculate array sizes and the kernel has even its own 
macro for that
called ARRAY_SIZE() which Linus asks people to use here [1].

So, I replaced _INTC_ARRAY() with ARRAY_SIZE() (see below), however the 
kernel's own ARRAY_SIZE()
macro triggers the same compiler warning. I'm CC'ing Michael Karcher who has 
more knowledge on
writing proper C code than me and maybe an idea how to fix this warning.

Thanks,
Adrian


[1] https://lkml.org/lkml/2015/9/3/428


diff --git a/include/linux/sh_intc.h b/include/linux/sh_intc.h
index c255273b0281..07a187686a84 100644
--- a/include/linux/sh_intc.h
+++ b/include/linux/sh_intc.h
@@ -97,14 +97,12 @@ struct intc_hw_desc {
unsigned int nr_subgroups;
 };
 
-#define _INTC_ARRAY(a) a, __same_type(a, NULL) ? 0 : sizeof(a)/sizeof(*a)

-
 #define INTC_HW_DESC(vectors, groups, mask_regs,   \
 prio_regs, sense_regs, ack_regs)   \
 {  \
-   _INTC_ARRAY(vectors), _INTC_ARRAY(groups),  \
-   _INTC_ARRAY(mask_regs), _INTC_ARRAY(prio_regs), \
-   _INTC_ARRAY(sense_regs), _INTC_ARRAY(ack_regs), \
+   ARRAY_SIZE(vectors), ARRAY_SIZE(groups),\
+   ARRAY_SIZE(mask_regs), ARRAY_SIZE(prio_regs),   \
+   ARRAY_SIZE(sense_regs), ARRAY_SIZE(ack_regs),   \
 }
 
 struct intc_desc {


--
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: [PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:34, Suren Baghdasaryan wrote:
> call_rcu() can take a long time when callback offloading is enabled.
> Its use in the vm_area_free can cause regressions in the exit path when
> multiple VMAs are being freed.

What kind of regressions.

> To minimize that impact, place VMAs into
> a list and free them in groups using one call_rcu() call per group.

Please add some data to justify this additional complexity.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:23, Suren Baghdasaryan wrote:
> Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> page fault handling. When VMA is not found, can't be locked or changes
> after being locked, the function returns NULL. The lookup is performed
> under RCU protection to prevent the found VMA from being destroyed before
> the VMA lock is acquired. VMA lock statistics are updated according to
> the results.
> For now only anonymous VMAs can be searched this way. In other cases the
> function returns NULL.

Could you describe why only anonymous vmas are handled at this stage and
what (roughly) has to be done to support other vmas? lock_vma_under_rcu
doesn't seem to have any anonymous vma specific requirements AFAICS.

Also isn't lock_vma_under_rcu effectively find_read_lock_vma? Not that
the naming is really the most important part but the rcu locking is
internal to the function so why should we spread this implementation
detail to the world...

> Signed-off-by: Suren Baghdasaryan 
> ---
>  include/linux/mm.h |  3 +++
>  mm/memory.c| 51 ++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c464fc8a514c..d0fddf6a1de9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct 
> vm_area_struct *vma)
> vma);
>  }
>  
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> +   unsigned long address);
> +
>  #else /* CONFIG_PER_VMA_LOCK */
>  
>  static inline void vma_init_lock(struct vm_area_struct *vma) {}
> diff --git a/mm/memory.c b/mm/memory.c
> index 9ece18548db1..a658e26d965d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, 
> unsigned long address,
>  }
>  EXPORT_SYMBOL_GPL(handle_mm_fault);
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> +/*
> + * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to 
> be
> + * stable and not isolated. If the VMA is not found or is being modified the
> + * function returns NULL.
> + */
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> +   unsigned long address)
> +{
> + MA_STATE(mas, >mm_mt, address, address);
> + struct vm_area_struct *vma, *validate;
> +
> + rcu_read_lock();
> + vma = mas_walk();
> +retry:
> + if (!vma)
> + goto inval;
> +
> + /* Only anonymous vmas are supported for now */
> + if (!vma_is_anonymous(vma))
> + goto inval;
> +
> + if (!vma_read_trylock(vma))
> + goto inval;
> +
> + /* Check since vm_start/vm_end might change before we lock the VMA */
> + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> + vma_read_unlock(vma);
> + goto inval;
> + }
> +
> + /* Check if the VMA got isolated after we found it */
> + mas.index = address;
> + validate = mas_walk();
> + if (validate != vma) {
> + vma_read_unlock(vma);
> + count_vm_vma_lock_event(VMA_LOCK_MISS);
> + /* The area was replaced with another one. */
> + vma = validate;
> + goto retry;
> + }
> +
> + rcu_read_unlock();
> + return vma;
> +inval:
> + rcu_read_unlock();
> + count_vm_vma_lock_event(VMA_LOCK_ABORT);
> + return NULL;
> +}
> +#endif /* CONFIG_PER_VMA_LOCK */
> +
>  #ifndef __PAGETABLE_P4D_FOLDED
>  /*
>   * Allocate p4d page table.
> -- 
> 2.39.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 26/41] kernel/fork: assert no VMA readers during its destruction

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:21, Suren Baghdasaryan wrote:
> Assert there are no holders of VMA lock for reading when it is about to be
> destroyed.
> 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  include/linux/mm.h | 8 
>  kernel/fork.c  | 2 ++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 594e835bad9c..c464fc8a514c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct 
> vm_area_struct *vma)
>   VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), 
> vma);
>  }
>  
> +static inline void vma_assert_no_reader(struct vm_area_struct *vma)
> +{
> + VM_BUG_ON_VMA(rwsem_is_locked(>lock) &&
> +   vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
> +   vma);

Do we really need to check for vm_lock_seq? rwsem_is_locked should tell
us something is wrong on its own, no? This could be somebody racing with
the vma destruction and using the write lock. Unlikely but I do not see
why to narrow debugging scope.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> Protect VMA from concurrent page fault handler while collapsing a huge
> page. Page fault handler needs a stable PMD to use PTL and relies on
> per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> not be detected by a page fault handler without proper locking.

I am struggling with this changelog. Maybe because my recollection of
the THP collapsing subtleties is weak. But aren't you just trying to say
that the current #PF handling and THP collapsing need to be mutually
exclusive currently so in order to keep that assumption you have mark
the vma write locked?

Also it is not really clear to me how that handles other vmas which can
share the same thp?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:12, Suren Baghdasaryan wrote:
> Move VMA flag modification (which now implies VMA locking) before
> anon_vma_lock_write to match the locking order of page fault handler.

Does this changelog assumes per vma locking in the #PF?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions

2023-01-17 Thread Michal Hocko
On Tue 17-01-23 16:09:03, Michal Hocko wrote:
> On Mon 09-01-23 12:53:08, Suren Baghdasaryan wrote:
> > To keep vma locking correctness when vm_flags are modified, add modifier
> > functions to be used whenever flags are updated.
> > 
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  include/linux/mm.h   | 38 ++
> >  include/linux/mm_types.h |  8 +++-
> >  2 files changed, 45 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ec2c4c227d51..35cf0a6cbcc2 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -702,6 +702,44 @@ static inline void vma_init(struct vm_area_struct 
> > *vma, struct mm_struct *mm)
> > vma_init_lock(vma);
> >  }
> >  
> > +/* Use when VMA is not part of the VMA tree and needs no locking */
> > +static inline
> > +void init_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > +{
> > +   WRITE_ONCE(vma->vm_flags, flags);
> > +}
> 
> Why do we need WRITE_ONCE here? Isn't vma invisible during its
> initialization?
> 
> > +
> > +/* Use when VMA is part of the VMA tree and needs appropriate locking */
> > +static inline
> > +void reset_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > +{
> > +   vma_write_lock(vma);
> > +   init_vm_flags(vma, flags);
> > +}
> > +
> > +static inline
> > +void set_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > +{
> > +   vma_write_lock(vma);
> > +   vma->vm_flags |= flags;
> > +}
> > +
> > +static inline
> > +void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> > +{
> > +   vma_write_lock(vma);
> > +   vma->vm_flags &= ~flags;
> > +}
> > +
> > +static inline
> > +void mod_vm_flags(struct vm_area_struct *vma,
> > + unsigned long set, unsigned long clear)
> > +{
> > +   vma_write_lock(vma);
> > +   vma->vm_flags |= set;
> > +   vma->vm_flags &= ~clear;
> > +}
> > +
> 
> This is rather unusual pattern. There is no note about locking involved
> in the naming and also why is the locking part of this interface in the
> first place? I can see reason for access functions to actually check for
> lock asserts.

OK, it took me a while but it is clear to me now. The confusion comes
from the naming vma_write_lock is no a lock in its usual terms. It is
more of a vma_mark_modified with side effects to read locking which is a
real lock. With that it makes more sense to have this done in these
helpers rather than requiring all users to keep this subtletly in mind.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Michal Hocko
On Tue 17-01-23 16:04:26, Michal Hocko wrote:
> On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > instead of mmap_lock. Because there are cases when multiple VMAs need
> > to be exclusively locked during VMA tree modifications, instead of the
> > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > locked.
> 
> I have to say I was struggling a bit with the above and only understood
> what you mean by reading the patch several times. I would phrase it like
> this (feel free to use if you consider this to be an improvement).
> 
> Introduce a per-VMA rw_semaphore. The lock implementation relies on a
> per-vma and per-mm sequence counters to note exclusive locking:
> - read lock - (implemented by vma_read_trylock) requires the the
>   vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
>   differ. If they match then there must be a vma exclusive lock
>   held somewhere.
> - read unlock - (implemented by vma_read_unlock) is a trivial
>   vma->lock unlock.
> - write lock - (vma_write_lock) requires the mmap_lock to be
>   held exclusively and the current mm counter is noted to the vma
>   side. This will allow multiple vmas to be locked under a single
>   mmap_lock write lock (e.g. during vma merging). The vma counter
>   is modified under exclusive vma lock.

Didn't realize one more thing.
Unlike standard write lock this implementation allows to be
called multiple times under a single mmap_lock. In a sense
it is more of mark_vma_potentially_modified than a lock.

> - write unlock - (vma_write_unlock_mm) is a batch release of all
>   vma locks held. It doesn't pair with a specific
>   vma_write_lock! It is done before exclusive mmap_lock is
>   released by incrementing mm sequence counter (mm_lock_seq).
>   - write downgrade - if the mmap_lock is downgraded to the read
> lock all vma write locks are released as well (effectivelly
> same as write unlock).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 13/41] mm: introduce vma->vm_flags modifier functions

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:08, Suren Baghdasaryan wrote:
> To keep vma locking correctness when vm_flags are modified, add modifier
> functions to be used whenever flags are updated.
> 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  include/linux/mm.h   | 38 ++
>  include/linux/mm_types.h |  8 +++-
>  2 files changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ec2c4c227d51..35cf0a6cbcc2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -702,6 +702,44 @@ static inline void vma_init(struct vm_area_struct *vma, 
> struct mm_struct *mm)
>   vma_init_lock(vma);
>  }
>  
> +/* Use when VMA is not part of the VMA tree and needs no locking */
> +static inline
> +void init_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> +{
> + WRITE_ONCE(vma->vm_flags, flags);
> +}

Why do we need WRITE_ONCE here? Isn't vma invisible during its
initialization?

> +
> +/* Use when VMA is part of the VMA tree and needs appropriate locking */
> +static inline
> +void reset_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> +{
> + vma_write_lock(vma);
> + init_vm_flags(vma, flags);
> +}
> +
> +static inline
> +void set_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> +{
> + vma_write_lock(vma);
> + vma->vm_flags |= flags;
> +}
> +
> +static inline
> +void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
> +{
> + vma_write_lock(vma);
> + vma->vm_flags &= ~flags;
> +}
> +
> +static inline
> +void mod_vm_flags(struct vm_area_struct *vma,
> +   unsigned long set, unsigned long clear)
> +{
> + vma_write_lock(vma);
> + vma->vm_flags |= set;
> + vma->vm_flags &= ~clear;
> +}
> +

This is rather unusual pattern. There is no note about locking involved
in the naming and also why is the locking part of this interface in the
first place? I can see reason for access functions to actually check for
lock asserts.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 5986817f393c..c026d75108b3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -474,6 +474,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct 
> *orig)
>*/
>   *new = data_race(*orig);
>   INIT_LIST_HEAD(>anon_vma_chain);
> + vma_init_lock(new);
>   dup_anon_vma_name(orig, new);
>   }
>   return new;
> @@ -1145,6 +1146,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
> struct task_struct *p,
>   seqcount_init(>write_protect_seq);
>   mmap_init_lock(mm);
>   INIT_LIST_HEAD(>mmlist);
> +#ifdef CONFIG_PER_VMA_LOCK
> + WRITE_ONCE(mm->mm_lock_seq, 0);
> +#endif

The mm shouldn't be visible so why WRITE_ONCE?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote:
> Introduce a per-VMA rw_semaphore to be used during page fault handling
> instead of mmap_lock. Because there are cases when multiple VMAs need
> to be exclusively locked during VMA tree modifications, instead of the
> usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> mmap_write_lock holder is done with all modifications and drops mmap_lock,
> it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> locked.

I have to say I was struggling a bit with the above and only understood
what you mean by reading the patch several times. I would phrase it like
this (feel free to use if you consider this to be an improvement).

Introduce a per-VMA rw_semaphore. The lock implementation relies on a
per-vma and per-mm sequence counters to note exclusive locking:
- read lock - (implemented by vma_read_trylock) requires the the
  vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to
  differ. If they match then there must be a vma exclusive lock
  held somewhere.
- read unlock - (implemented by vma_read_unlock) is a trivial
  vma->lock unlock.
- write lock - (vma_write_lock) requires the mmap_lock to be
  held exclusively and the current mm counter is noted to the vma
  side. This will allow multiple vmas to be locked under a single
  mmap_lock write lock (e.g. during vma merging). The vma counter
  is modified under exclusive vma lock.
- write unlock - (vma_write_unlock_mm) is a batch release of all
  vma locks held. It doesn't pair with a specific
  vma_write_lock! It is done before exclusive mmap_lock is
  released by incrementing mm sequence counter (mm_lock_seq).
- write downgrade - if the mmap_lock is downgraded to the read
  lock all vma write locks are released as well (effectivelly
  same as write unlock).

> VMA lock is placed on the cache line boundary so that its 'count' field
> falls into the first cache line while the rest of the fields fall into
> the second cache line. This lets the 'count' field to be cached with
> other frequently accessed fields and used quickly in uncontended case
> while 'owner' and other fields used in the contended case will not
> invalidate the first cache line while waiting on the lock.
> 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  include/linux/mm.h| 80 +++
>  include/linux/mm_types.h  |  8 
>  include/linux/mmap_lock.h | 13 +++
>  kernel/fork.c |  4 ++
>  mm/init-mm.c  |  3 ++
>  5 files changed, 108 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f3f196e4d66d..ec2c4c227d51 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -612,6 +612,85 @@ struct vm_operations_struct {
> unsigned long addr);
>  };
>  
> +#ifdef CONFIG_PER_VMA_LOCK
> +static inline void vma_init_lock(struct vm_area_struct *vma)
> +{
> + init_rwsem(>lock);
> + vma->vm_lock_seq = -1;
> +}
> +
> +static inline void vma_write_lock(struct vm_area_struct *vma)
> +{
> + int mm_lock_seq;
> +
> + mmap_assert_write_locked(vma->vm_mm);
> +
> + /*
> +  * current task is holding mmap_write_lock, both vma->vm_lock_seq and
> +  * mm->mm_lock_seq can't be concurrently modified.
> +  */
> + mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
> + if (vma->vm_lock_seq == mm_lock_seq)
> + return;
> +
> + down_write(>lock);
> + vma->vm_lock_seq = mm_lock_seq;
> + up_write(>lock);
> +}
> +
> +/*
> + * Try to read-lock a vma. The function is allowed to occasionally yield 
> false
> + * locked result to avoid performance overhead, in which case we fall back to
> + * using mmap_lock. The function should never yield false unlocked result.
> + */
> +static inline bool vma_read_trylock(struct vm_area_struct *vma)
> +{
> + /* Check before locking. A race might cause false locked result. */
> + if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
> + return false;
> +
> + if (unlikely(down_read_trylock(>lock) == 0))
> + return false;
> +
> + /*
> +  * Overflow might produce false locked result.
> +  * False unlocked result is impossible because we modify and check
> +  * vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
> +  * modification invalidates all existing locks.
> +  */
> + if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
> + up_read(>lock);
> + return false;
> + }
> + return true;
> +}
> +
> +static inline void vma_read_unlock(struct vm_area_struct *vma)
> +{
> + up_read(>lock);
> +}
> +
> +static inline void vma_assert_write_locked(struct 

Re: [PATCH v3 01/10] dt-bindings: soc: fsl: cpm_qe: Add TSA controller

2023-01-17 Thread Rob Herring
On Fri, Jan 13, 2023 at 11:37:50AM +0100, Herve Codina wrote:
> Add support for the time slot assigner (TSA)
> available in some PowerQUICC SoC such as MPC885
> or MPC866.

An odd line wrap length... 

> 
> Signed-off-by: Herve Codina 
> ---
>  .../bindings/soc/fsl/cpm_qe/fsl,tsa.yaml  | 260 ++
>  include/dt-bindings/soc/fsl,tsa.h |  13 +
>  2 files changed, 273 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,tsa.yaml
>  create mode 100644 include/dt-bindings/soc/fsl,tsa.h
> 
> diff --git a/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,tsa.yaml 
> b/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,tsa.yaml
> new file mode 100644
> index ..eb17b6119abd
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,tsa.yaml
> @@ -0,0 +1,260 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/soc/fsl/cpm_qe/fsl,tsa.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: PowerQUICC CPM Time-slot assigner (TSA) controller
> +
> +maintainers:
> +  - Herve Codina 
> +
> +description: |

Don't need '|' if no formatting.

> +  The TSA is the time-slot assigner that can be found on some
> +  PowerQUICC SoC.
> +  Its purpose is to route some TDM time-slots to other internal
> +  serial controllers.

Wrap at 80.

> +
> +properties:
> +  compatible:
> +items:
> +  - enum:
> +  - fsl,mpc885-tsa
> +  - fsl,mpc866-tsa
> +  - const: fsl,cpm1-tsa
> +
> +  reg:
> +items:
> +  - description: SI (Serial Interface) register base
> +  - description: SI RAM base
> +
> +  reg-names:
> +items:
> +  - const: si_regs
> +  - const: si_ram
> +
> +  '#address-cells':
> +const: 1
> +
> +  '#size-cells':
> +const: 0
> +
> +patternProperties:
> +  '^tdm@[0-1]$':
> +description:
> +  The TDM managed by this controller
> +type: object

   additionalProperties: false

> +
> +properties:
> +  reg:
> +minimum: 0
> +maximum: 1
> +description:
> +  The TDM number for this TDM, 0 for TDMa and 1 for TDMb
> +
> +  fsl,common-rxtx-pins:
> +$ref: /schemas/types.yaml#/definitions/flag
> +description:
> +  The hardware can use four dedicated pins for Tx clock,
> +  Tx sync, Rx clock and Rx sync or use only two pins,
> +  Tx/Rx clock and Rx/Rx sync.
> +  Without the 'fsl,common-rxtx-pins' property, the four
> +  pins are used. With the 'fsl,common-rxtx-pins' property,
> +  two pins are used.
> +
> +  clocks:
> +minItems: 2
> +maxItems: 4
> +
> +  clock-names:
> +minItems: 2
> +maxItems: 4
> +
> +  fsl,mode:

'mode' is a bit vague. It's already used as well which can be a problem 
if there are differing types. (There's not in this case)

> +$ref: /schemas/types.yaml#/definitions/string
> +enum: [normal, echo, internal-loopback, control-loopback]
> +default: normal
> +description: |
> +  Operational mode:
> +- normal:
> +Normal operation
> +- echo:
> +Automatic echo. Rx data is resent on Tx
> +- internal-loopback:
> +The TDM transmitter is connected to the receiver.
> +Data appears on Tx pin.
> +- control-loopback:
> +The TDM transmitter is connected to the receiver.
> +The Tx pin is disconnected.
> +
> +  fsl,rx-frame-sync-delay-bits:
> +enum: [0, 1, 2, 3]
> +default: 0
> +description: |
> +  Receive frame sync delay in number of bits.
> +  Indicates the delay between the Rx sync and the first bit of the
> +  Rx frame. 0 for no bit delay. 1, 2 or 3 for 1, 2 or 3 bits delay.
> +
> +  fsl,tx-frame-sync-delay-bits:
> +enum: [0, 1, 2, 3]
> +default: 0
> +description: |
> +  Transmit frame sync delay in number of bits.
> +  Indicates the delay between the Tx sync and the first bit of the
> +  Tx frame. 0 for no bit delay. 1, 2 or 3 for 1, 2 or 3 bits delay.
> +
> +  fsl,clock-falling-edge:
> +$ref: /schemas/types.yaml#/definitions/flag
> +description: |
> +  Data is sent on falling edge of the clock (and received on the
> +  rising edge).
> +  If 'clock-falling-edge' is not present, data is sent on the
> +  rising edge (and received on the falling edge).
> +
> +  fsl,fsync-rising-edge:
> +$ref: /schemas/types.yaml#/definitions/flag
> +description:
> +  Frame sync pulses are sampled with the rising edge of the channel
> +  clock. If 'fsync-rising-edge' is not present, pulses are sample
> +  with e falling edge.
> +
> +  fsl,double-speed-clock:
> +

Re: [PASEMI] Nemo board doesn't reboot anymore after the commit "HID: usbhid: Add ALWAYS_POLL quirk for some mice"

2023-01-17 Thread Jiri Kosina
On Tue, 17 Jan 2023, Christian Zigotzky wrote:

> >> The reboot issue is still present in the RC2 of kernel 6.2. We still need
> >> the
> >> usbhid.patch. [1]
> >>
> >> Please check the bad commit. [2]
> > Ankit,
> >
> > have you tested with all the devices that you added the quirk for in your
> > original patch?
> >
> > Unless I hear otherwise, I will just drop
> > the quirk for USB_DEVICE_ID_CHERRY_MOUSE_000C before this gets clarified.
> >
> > Thanks,
> >
> The issue also affects the RC4.

Given the lack of input from Ankit, I have just queued the patch below in 
for-6.2/upstream-fixes.


From: Jiri Kosina 
Subject: [PATCH] HID: revert CHERRY_MOUSE_000C quirk

This partially reverts commit f6d910a89a2391 ("HID: usbhid: Add ALWAYS_POLL 
quirk
for some mice"), as it turns out to break reboot on some platforms for reason
yet to be understood.

Fixes: f6d910a89a2391 ("HID: usbhid: Add ALWAYS_POLL quirk for some mice")
Reported-by: Christian Zigotzky 
Signed-off-by: Jiri Kosina 
---
 drivers/hid/hid-ids.h| 1 -
 drivers/hid/hid-quirks.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/drivers/hid/hid-ids.h b/drivers/hid/hid-ids.h
index 82713ef3aaa6..c3735848ed5d 100644
--- a/drivers/hid/hid-ids.h
+++ b/drivers/hid/hid-ids.h
@@ -274,7 +274,6 @@
 #define USB_DEVICE_ID_CH_AXIS_295  0x001c
 
 #define USB_VENDOR_ID_CHERRY   0x046a
-#define USB_DEVICE_ID_CHERRY_MOUSE_000C0x000c
 #define USB_DEVICE_ID_CHERRY_CYMOTION  0x0023
 #define USB_DEVICE_ID_CHERRY_CYMOTION_SOLAR0x0027
 
diff --git a/drivers/hid/hid-quirks.c b/drivers/hid/hid-quirks.c
index 0e9702c7f7d6..be3ad02573de 100644
--- a/drivers/hid/hid-quirks.c
+++ b/drivers/hid/hid-quirks.c
@@ -54,7 +54,6 @@ static const struct hid_device_id hid_quirks[] = {
{ HID_USB_DEVICE(USB_VENDOR_ID_CH, USB_DEVICE_ID_CH_FLIGHT_SIM_YOKE), 
HID_QUIRK_NOGET },
{ HID_USB_DEVICE(USB_VENDOR_ID_CH, USB_DEVICE_ID_CH_PRO_PEDALS), 
HID_QUIRK_NOGET },
{ HID_USB_DEVICE(USB_VENDOR_ID_CH, USB_DEVICE_ID_CH_PRO_THROTTLE), 
HID_QUIRK_NOGET },
-   { HID_USB_DEVICE(USB_VENDOR_ID_CHERRY, 
USB_DEVICE_ID_CHERRY_MOUSE_000C), HID_QUIRK_ALWAYS_POLL },
{ HID_USB_DEVICE(USB_VENDOR_ID_CORSAIR, USB_DEVICE_ID_CORSAIR_K65RGB), 
HID_QUIRK_NO_INIT_REPORTS },
{ HID_USB_DEVICE(USB_VENDOR_ID_CORSAIR, 
USB_DEVICE_ID_CORSAIR_K65RGB_RAPIDFIRE), HID_QUIRK_NO_INIT_REPORTS | 
HID_QUIRK_ALWAYS_POLL },
{ HID_USB_DEVICE(USB_VENDOR_ID_CORSAIR, USB_DEVICE_ID_CORSAIR_K70RGB), 
HID_QUIRK_NO_INIT_REPORTS },

-- 
Jiri Kosina
SUSE Labs



Re: [PATCH 09/41] mm: rcu safe VMA freeing

2023-01-17 Thread Michal Hocko
On Mon 09-01-23 12:53:04, Suren Baghdasaryan wrote:
[...]
>  void vm_area_free(struct vm_area_struct *vma)
>  {
>   free_anon_vma_name(vma);
> +#ifdef CONFIG_PER_VMA_LOCK
> + call_rcu(>vm_rcu, __vm_area_free);
> +#else
>   kmem_cache_free(vm_area_cachep, vma);
> +#endif

Is it safe to have vma with already freed vma_name? I suspect this is
safe because of mmap_lock but is there any reason to split the freeing
process and have this potential UAF lurking?

>  }
>  
>  static void account_kernel_stack(struct task_struct *tsk, int account)
> -- 
> 2.39.0

-- 
Michal Hocko
SUSE Labs


  1   2   >