date:20230109

Re: [Skiboot] [PATCH 1/3] core/device: Add function to return child node using name and length

2023-01-09 Thread Athira Rajeev




> On 09-Jan-2023, at 8:59 PM, Dan Horák  wrote:
> 
> Hi Athira,
> 
> On Thu, 5 Jan 2023 12:41:33 +0530
> Athira Rajeev  wrote:
> 
>> 
>> 
>>> On 05-Jan-2023, at 12:35 PM, Madhavan Srinivasan  
>>> wrote:
>>> 
>>> 
>>> On Mon,  2 Jan 2023 08:45:22 +0530
>>> Athira Rajeev  wrote:
>>> 
 Add a function dt_find_by_name_len() that returns the child node if
 it matches the first "n" characters of a given name, otherwise NULL.
 This is helpful for cases with node name like: "name@addr". In
 scenarios where nodes are added with "name@addr" format and if the
 value of "addr" is not known, that node can't be matched with node
 name or addr. Hence matching with substring as node name will return
 the expected result. Patch adds dt_find_by_name_len() function
 and testcase for the same in core/test/run-device.c
>>> 
>>> wouldn't it be better to automatically compare the name up to the "@"
>>> character in the node name when searching for the match instead of
>>> having to hard-code the lengths? I think it should be good enough for
>>> the use case described above.
>>> 
>>> something like
>>> ...
>>> pos = strchr(child->name, '@')
>>> if (!strncmp(child->name, name, pos - child->name))
>>> ...
>>> 
>>> 
>>> Dan
>> 
>> Hi Dan,
>> 
>> Thanks for checking the patch.
>> 
>> Comparing upto "@" while searching for the match will restrict the string 
>> search only for patterns with "@".
>> By having dt_find_by_name_len which uses length, will be useful for generic 
>> substring search for different patterns.
>> So prefered to use length instead of hardcoding character.
>> 
>> Please let us know your thoughts.
> 
> I understand the presented solution is a pretty generic one, but I think
> the question is whether the added complexity brings the benefits
> compared to the simpler "separator char" solution.
> 
> And thinking even more about the generic "length" approach, it might
> bring some false positive hits. Imagine nodes abc@1, abcd@2 and you are
> looking for "abc". A search for (abc,3) will match also the "abcd"
> one. And if the search string will always contain the "@" character,
> then specifying the length is not required. And I believe the length
> parameter might be totally redundant, because it can be derived from
> the search string and the new function would be like
> "dt_find_by_name_substr()".
> 
> 
>   With regards,
> 
>   Dan


Hi Dan,

Thanks for the response. Makes sense to have the new function as 
"dt_find_by_name_substr" by comparing upto “@".
I will rework on the changes and post a V2 for this.

Thanks
Athira

> 
>> Thanks
>> Athira
>> 
>>> 
 Signed-off-by: Athira Rajeev 
 ---
 core/device.c  | 20 
 core/test/run-device.c | 11 +++
 include/device.h   |  4 
 3 files changed, 35 insertions(+)
 diff --git a/core/device.c b/core/device.c
 index 2de37c74..72c54e85 100644
 --- a/core/device.c
 +++ b/core/device.c
 @@ -395,6 +395,26 @@ struct dt_node *dt_find_by_name(struct dt_node *root, 
 const char *name)
 }
 +struct dt_node *dt_find_by_name_len(struct dt_node *root,
 +  const char *name, int len)
 +{
 +  struct dt_node *child, *match;
 +
 +  if (len <= 0)
 +  return NULL;
 +
 +  list_for_each(>children, child, list) {
 +  if (!strncmp(child->name, name, len))
 +  return child;
 +
 +  match = dt_find_by_name_len(child, name, len);
 +  if (match)
 +  return match;
 +  }
 +
 +  return NULL;
 +}
 +
 struct dt_node *dt_new_check(struct dt_node *parent, const char *name)
 {
struct dt_node *node = dt_find_by_name(parent, name);
 diff --git a/core/test/run-device.c b/core/test/run-device.c
 index 4a12382b..8c552103 100644
 --- a/core/test/run-device.c
 +++ b/core/test/run-device.c
 @@ -466,6 +466,17 @@ int main(void)
new_prop_ph = dt_prop_get_u32(ut2, "something");
assert(!(new_prop_ph == ev1_ph));
dt_free(subtree);
 +
 +  /* Test dt_find_by_name_len */
 +  root = dt_new_root("");
 +  addr1 = dt_new_addr(root, "node", 0x1);
 +  addr2 = dt_new_addr(root, "node0_1", 0x2);
 +  assert(dt_find_by_name(root, "node@1") == addr1);
 +  assert(dt_find_by_name(root, "node0_1@2") == addr2);
 +  assert(dt_find_by_name_len(root, "node@", 5) == addr1);
 +  assert(dt_find_by_name_len(root, "node0_1@", 8) == addr2);
 +  dt_free(root);
 +
return 0;
 }
 diff --git a/include/device.h b/include/device.h
 index 93fb90ff..f5e0fb79 100644
 --- a/include/device.h
 +++ b/include/device.h
 @@ -184,6 +184,10 @@ struct dt_node *dt_find_by_path(struct dt_node *root, 
 const char *path);
 /* Find a child node by name */
 struct dt_node *dt_find_by_name(struct dt_node

[PATCH] powerpc/rtas: upgrade internal arch spinlocks

2023-01-09 Thread Nathan Lynch

At the time commit f97bb36f705d ("powerpc/rtas: Turn rtas lock into a
raw spinlock") was written, the spinlock lockup detection code called
__delay(), which will not make progress if the timebase is not
advancing. Since the interprocessor timebase synchronization sequence
for chrp, cell, and some now-unsupported Power models can temporarily
freeze the timebase through an RTAS function (freeze-time-base), the
lock that serializes most RTAS calls was converted to arch_spinlock_t
to prevent kernel hangs in the lockup detection code.

However, commit bc88c10d7e69 ("locking/spinlock/debug: Remove spinlock
lockup detection code") removed that inconvenient property from the
lock debug code several years ago. So now it should be safe to
reintroduce generic locks into the RTAS support code, primarily to
increase lockdep coverage.

Making rtas.lock a spinlock_t would violate lock type nesting rules
because it can be acquired while holding raw locks, e.g. pci_lock and
irq_desc->lock. So convert it to raw_spinlock_t. There's no apparent
reason not to upgrade timebase_lock as well.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/include/asm/rtas-types.h |  2 +-
 arch/powerpc/kernel/rtas.c| 52 ---
 2 files changed, 15 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas-types.h 
b/arch/powerpc/include/asm/rtas-types.h
index 8df6235d64d1..a58f96eb2d19 100644
--- a/arch/powerpc/include/asm/rtas-types.h
+++ b/arch/powerpc/include/asm/rtas-types.h
@@ -18,7 +18,7 @@ struct rtas_t {
unsigned long entry;/* physical address pointer */
unsigned long base; /* physical address pointer */
unsigned long size;
-   arch_spinlock_t lock;
+   raw_spinlock_t lock;
struct rtas_args args;
struct device_node *dev;/* virtual address pointer */
 };
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index deded51a7978..a834726f18e3 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -61,7 +61,7 @@ static inline void do_enter_rtas(unsigned long args)
 }
 
 struct rtas_t rtas = {
-   .lock = __ARCH_SPIN_LOCK_UNLOCKED
+   .lock = __RAW_SPIN_LOCK_UNLOCKED(rtas.lock),
 };
 EXPORT_SYMBOL(rtas);
 
@@ -80,28 +80,6 @@ unsigned long rtas_rmo_buf;
 void (*rtas_flash_term_hook)(int);
 EXPORT_SYMBOL(rtas_flash_term_hook);
 
-/* RTAS use home made raw locking instead of spin_lock_irqsave
- * because those can be called from within really nasty contexts
- * such as having the timebase stopped which would lockup with
- * normal locks and spinlock debugging enabled
- */
-static unsigned long lock_rtas(void)
-{
-   unsigned long flags;
-
-   local_irq_save(flags);
-   preempt_disable();
-   arch_spin_lock();
-   return flags;
-}
-
-static void unlock_rtas(unsigned long flags)
-{
-   arch_spin_unlock();
-   local_irq_restore(flags);
-   preempt_enable();
-}
-
 /*
  * call_rtas_display_status and call_rtas_display_status_delay
  * are designed only for very early low-level debugging, which
@@ -109,14 +87,14 @@ static void unlock_rtas(unsigned long flags)
  */
 static void call_rtas_display_status(unsigned char c)
 {
-   unsigned long s;
+   unsigned long flags;
 
if (!rtas.base)
return;
 
-   s = lock_rtas();
+   raw_spin_lock_irqsave(, flags);
rtas_call_unlocked(, 10, 1, 1, NULL, c);
-   unlock_rtas(s);
+   raw_spin_unlock_irqrestore(, flags);
 }
 
 static void call_rtas_display_status_delay(char c)
@@ -534,7 +512,7 @@ int rtas_call(int token, int nargs, int nret, int *outputs, 
...)
 {
va_list list;
int i;
-   unsigned long s;
+   unsigned long flags;
struct rtas_args *rtas_args;
char *buff_copy = NULL;
int ret;
@@ -557,8 +535,7 @@ int rtas_call(int token, int nargs, int nret, int *outputs, 
...)
return -1;
}
 
-   s = lock_rtas();
-
+   raw_spin_lock_irqsave(, flags);
/* We use the global rtas args buffer */
rtas_args = 
 
@@ -576,7 +553,7 @@ int rtas_call(int token, int nargs, int nret, int *outputs, 
...)
outputs[i] = be32_to_cpu(rtas_args->rets[i+1]);
ret = (nret > 0)? be32_to_cpu(rtas_args->rets[0]): 0;
 
-   unlock_rtas(s);
+   raw_spin_unlock_irqrestore(, flags);
 
if (buff_copy) {
log_error(buff_copy, ERR_TYPE_RTAS_LOG, 0);
@@ -1268,7 +1245,7 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
 
buff_copy = get_errorlog_buffer();
 
-   flags = lock_rtas();
+   raw_spin_lock_irqsave(, flags);
 
rtas.args = args;
do_enter_rtas(__pa());
@@ -1279,7 +1256,7 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
if (be32_to_cpu(args.rets[0]) == -1)
errbuf = __fetch_rtas_last_error(buff_copy);
 
-   unlock_rtas(flags);
+   raw_spin_unlock_irqrestore(,

[PATCH] powerpc/pseries: drop RTAS-based timebase synchronization

2023-01-09 Thread Nathan Lynch

The pseries platform has been LPAR-only for several generations, and
the PAPR spec:

* Guarantees that timebase synchronization is performed by
  the platform ("The timebase registers are synchronized by the
  platform before CPUs are given to the OS" - 7.3.8 SMP Support).

* Completely omits the RTAS freeze-time-base and thaw-time-base RTAS
  functions, which are CHRP artifacts.

This code is effectively unused on currently supported models, so drop
it.

Signed-off-by: Nathan Lynch 
---
 arch/powerpc/platforms/pseries/smp.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index fd2174edfa1d..2bcfee86ff87 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -278,11 +278,5 @@ void __init smp_init_pseries(void)
cpumask_clear_cpu(boot_cpuid, of_spin_mask);
}
 
-   /* Non-lpar has additional take/give timebase */
-   if (rtas_token("freeze-time-base") != RTAS_UNKNOWN_SERVICE) {
-   smp_ops->give_timebase = rtas_give_timebase;
-   smp_ops->take_timebase = rtas_take_timebase;
-   }
-
pr_debug(" <- smp_init_pSeries()\n");
 }
-- 
2.37.1

Re: [PATCH v2 7/7] powerpc/pseries: Implement secvars for dynamic secure boot

2023-01-09 Thread Andrew Donnellan

On Fri, 2023-01-06 at 21:49 +1100, Michael Ellerman wrote:
> 
> > +static int plpks_get_variable(const char *key, uint64_t key_len,
> > + u8 *data, uint64_t *data_size)
> > +{
> > +   struct plpks_var var = {0};
> > +   u16 ucs2_namelen;
> > +   u8 *ucs2_name;
> > +   int rc = 0;
> > +
> > +   ucs2_namelen = get_ucs2name(key, _name);
> > +   if (!ucs2_namelen)
> > +   return -ENOMEM;
> > +
> > +   var.name = ucs2_name;
> > +   var.namelen = ucs2_namelen;
> > +   var.os = PLPKS_VAR_LINUX;
> > +   rc = plpks_read_os_var();
> > +
> > +   if (rc)
> > +   goto err;
> > +
> > +   *data_size = var.datalen + sizeof(var.policy);
> > +
> > +   // We can be called with data = NULL to just get the object
> > size.
> > +   if (data) {
> > +   memcpy(data, , sizeof(var.policy));
> > +   memcpy(data + sizeof(var.policy), var.data,
> > var.datalen);
> > +   }
> 
> There's a lot of allocation and copying going on. The secvar-sysfs.c
> data_read() has kzalloc'ed data, but only after already doing the
> hcall
> to get the size. Then plpks_read_os_var() does an allocation for the
> hcall and then another allocation of the exact data size. Then
> data_read()
> does another copy into the sysfs supplied buffer.
> 
> So to read a single variable we do the hcall twice, and allocate/copy
> the content of the variable 4 times?
> 
>  - Hypervisor into "output" in plpks_read_var().
>  - "output" into "var->data" in plpks_read_var().
>  - "var.data" into "data" in plpks_get_variable().
>  - "data" into "buf" in data_read().
> 
> As long as maxobjsize is < PAGE_SIZE I think we could do the hcall
> directly into "buf". Maybe we want to avoid writing into "buf"
> directly
> in case the hcall fails or something, but the other 3 copies seem
> unnecessary.

In the general case, I don't like passing buffer pointers straight from
parameters into hcalls, since the address has to be in the linear map,
and that's a detail I'd rather hide from callers. But otherwise, yes I
think we can probably shift to having the caller allocate the buffers.

-- 
Andrew DonnellanOzLabs, ADL Canberra
a...@linux.ibm.com   IBM Australia Limited

Re: [PATCH v2 7/7] powerpc/pseries: Implement secvars for dynamic secure boot

2023-01-09 Thread Russell Currey

On Mon, 2023-01-09 at 16:20 +1100, Andrew Donnellan wrote:
> On Mon, 2023-01-09 at 14:34 +1100, Russell Currey wrote:
> > 
> > > > +static int plpks_secvar_init(void)
> > > > +{
> > > > +   if (!plpks_is_available())
> > > > +   return -ENODEV;
> > > > +
> > > > +   set_secvar_ops(_secvar_ops);
> > > > +   set_secvar_config_attrs(config_attrs);
> > > > +   return 0;
> > > > +}
> > > > +device_initcall(plpks_secvar_init);
> > > 
> > > That must be a machine_device_initcall(pseries, ...), otherwise
> > > we
> > > will
> > > blow up doing a hcall on powernv in plpks_is_available().
> > 
> > OK, can do.  I don't understand your case of how powernv could hit
> > this, but I think I to have to move plpks_is_available() into
> > include/,
> > so it's going to be even more possible anyway.
> 
> Kernels can be compiled with both pseries and powernv support, in
> which
> case plpks_secvar_init() will be called unconditionally even when
> booting on a powernv machine.
> 
> I can confirm that as it is, booting this on powernv qemu causes a
> panic.

Of course, I'm not sure why I thought an initcall in a platform that
wasn't active would magically not run on other platforms.

>

Re: [PATCH 06/14] powerpc/vdso: Remove unused '-s' flag from ASFLAGS

2023-01-09 Thread Segher Boessenkool

On Tue, Jan 10, 2023 at 01:22:38AM +0100, Andreas Schwab wrote:
> On Jan 09 2023, Segher Boessenkool wrote:
> 
> > It is required by POSIX (for the c99 command, anyway).  It *also* is
> > required to be supported when producing object files (so when no linking
> > is done).
> >
> > It is a GCC flag, and documented just fine:
> > https://gcc.gnu.org/onlinedocs/gcc/Link-Options.html#index-s
> 
> Most assembler flags are unrelated to the flags passed to the compiler
> driver, and -s is no exception.  POSIX has nothing to say about the
> sub-commands of the compiler anyway.

But this is not an assembler flag!

quiet_cmd_vdso32as = VDSO32A $@
  cmd_vdso32as = $(VDSOCC) $(a_flags) $(CC32FLAGS) $(AS32FLAGS) -c -o $@ $<

where

ifdef CROSS32_COMPILE
VDSOCC := $(CROSS32_COMPILE)gcc
else
VDSOCC := $(CC)
endif


The name of the make variable AS32FLAGS is a bit misleading.  This
variable does not hold arguments to as, it holds arguments to the
compiler driver, "gcc".


Segher

Re: [PATCH 06/14] powerpc/vdso: Remove unused '-s' flag from ASFLAGS

2023-01-09 Thread Nathan Chancellor

On Mon, Jan 09, 2023 at 03:14:33PM -0800, Nick Desaulniers wrote:
> On Mon, Jan 9, 2023 at 2:29 PM Segher Boessenkool
>  wrote:
> >
> > Hi!  Happy new year all.
> 
> HNY Segher! :)
> 
> >
> > On Mon, Jan 09, 2023 at 01:58:32PM -0800, Nick Desaulniers wrote:
> > > On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  
> > > wrote:
> > > >
> > > > When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
> > > > warns that ASFLAGS contains '-s', which is a linking phase option, so it
> > > > is unused.
> > > >
> > > >   clang-16: error: argument unused during compilation: '-s' 
> > > > [-Werror,-Wunused-command-line-argument]
> > > >
> > > > Looking at the GAS sources, '-s' is only useful when targeting Solaris
> > > > and it is ignored for the powerpc target so just drop the flag
> > > > altogether, as it is not needed.
> > >
> > > Do you have any more info where you found this?  I don't see -s
> > > documented as an assembler flag.
> > > https://sourceware.org/binutils/docs/as/PowerPC_002dOpts.html
> > > https://sourceware.org/binutils/docs/as/Invoking.html
> >
> > It is required by POSIX (for the c99 command, anyway).  It *also* is
> > required to be supported when producing object files (so when no linking
> > is done).
> >
> > It is a GCC flag, and documented just fine:
> > https://gcc.gnu.org/onlinedocs/gcc/Link-Options.html#index-s
> >
> > (Yes, that says it is for linking; but the option is allowed without
> > error of any kind always).
> >
> > (ASFLAGS sounds like it is for assembler commands, but it really is
> > for compiler commands that just happen to get .S input files).
> >
> > > The patch seems fine to me, but what was this ever supposed to be?
> > > FWICT it predates git history (looking at
> > > arch/powerpc/kernel/vdso32/Makefile at fc15351d9d63)
> >
> > Yeah, good question.  This compiler flag does the moral equivalent of
> > strip -s (aka --strip-all).  Maybe this was needed at some point, or
> > the symbol or debug info was just annoying (during bringup or similar)?
> 
> Ah right! Ok then, I think we might keep the patch's diff, but update
> the commit message to mention this is a linker flag that's unused
> since the compiler is being invoked but not the linker (the compiler
> is being used as the driver to assemble a single assembler source
> without linking it; linking is then driven by the linker in a separate
> make rule).

Yes, sorry, I thought that was clear with the "which is a linking phase
option" comment in the commit message but clearly not :)

> Then we might want to revisit that s390 patch, too?
> https://lore.kernel.org/llvm/20221228-drop-qunused-arguments-v1-9-658cbc8fc...@kernel.org/

So for this patch, I have

  When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
  warns:

clang-16: error: argument unused during compilation: '-s' 
[-Werror,-Wunused-command-line-argument]

  The compiler's '-s' flag is a linking option (it is passed along to the
  linker directly), which means it does nothing when the linker is not
  invoked by the compiler. The kernel builds all .o files with either '-c'
  or '-S', which do not run the linker, so '-s' can be safely dropped from
  ASFLAGS.

as a new commit message. Is that sufficient for everyone? If so, I'll
adjust the s390 commit to match, as it is the same exact problem.

Alternatively, if '-s' should actually remain around, we could move it
to ldflags-y, which is added in patch 7. However, I assume that nobody
has noticed that it has not been doing its job for a while, so it should
be safe to remove.

Cheers,
Nathan

Re: [PATCH 06/14] powerpc/vdso: Remove unused '-s' flag from ASFLAGS

2023-01-09 Thread Andreas Schwab

On Jan 09 2023, Segher Boessenkool wrote:

> It is required by POSIX (for the c99 command, anyway).  It *also* is
> required to be supported when producing object files (so when no linking
> is done).
>
> It is a GCC flag, and documented just fine:
> https://gcc.gnu.org/onlinedocs/gcc/Link-Options.html#index-s

Most assembler flags are unrelated to the flags passed to the compiler
driver, and -s is no exception.  POSIX has nothing to say about the
sub-commands of the compiler anyway.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Re: [PATCH 08/14] powerpc/vdso: Remove an unsupported flag from vgettimeofday-32.o with clang

2023-01-09 Thread Nick Desaulniers

On Mon, Jan 9, 2023 at 2:38 PM Nathan Chancellor  wrote:
>
> On Mon, Jan 09, 2023 at 02:12:55PM -0800, Nick Desaulniers wrote:
> > On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  wrote:
> > >
> > > When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
> > > warns:
> > >
> > >   clang-16: error: argument unused during compilation: 
> > > '-fno-stack-clash-protection' [-Werror,-Wunused-command-line-argument]
> > >
> > > This flag is supported for 64-bit powerpc but not 32-bit, hence the 
> > > warning.
> > > Just remove the flag from vgettimeofday-32.o's CFLAGS when using clang, 
> > > as has
> > > been done for other flags previously.
> > >
> > > Signed-off-by: Nathan Chancellor 
> >
> > Hmm...so this was added by the top level Makefile doing a cc-option
> > test.  How did the test pass if this was unsupported? That worries me
> > that perhaps other cc-option tests are passing erroneously for certain
> > ppc -m32/-m64 configs?
>
> Sure, that is a reasonable concern. I should have expanded upon this a
> little bit more in the commit message. Is this any better?
>
>   When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
>   warns:
>
> clang-16: error: argument unused during compilation: 
> '-fno-stack-clash-protection' [-Werror,-Wunused-command-line-argument]
>
>   This warning happens because vgettimeofday-32.c gets its base CFLAGS
>   from the main kernel, which may contain flags that are only supported
>   on a 64-bit target but not a 32-bit one, which is the case here.
>   -fstack-clash-protection and its negation are only suppported by the
>   64-bit powerpc target but that flag is included in an invocation for a
>   32-bit powerpc target, so clang points out that while the flag is one
>   that it recognizes, it is not actually used by this compiler job.
>
>   To eliminate the warning, remove -fno-stack-clash-protection from
>   vgettimeofday-32.c's CFLAGS when using clang, as has been done for
>   other flags previously.

Ah, ok that makes more sense. Sorry for my confusion, but if you
wouldn't mind adding the additional info in v3 it might help others
(or just me!)

You may add my RB to such a v3 w/ updated commit message.
-- 
Thanks,
~Nick Desaulniers

Re: [PATCH 06/14] powerpc/vdso: Remove unused '-s' flag from ASFLAGS

2023-01-09 Thread Nick Desaulniers

On Mon, Jan 9, 2023 at 2:29 PM Segher Boessenkool
 wrote:
>
> Hi!  Happy new year all.

HNY Segher! :)

>
> On Mon, Jan 09, 2023 at 01:58:32PM -0800, Nick Desaulniers wrote:
> > On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  wrote:
> > >
> > > When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
> > > warns that ASFLAGS contains '-s', which is a linking phase option, so it
> > > is unused.
> > >
> > >   clang-16: error: argument unused during compilation: '-s' 
> > > [-Werror,-Wunused-command-line-argument]
> > >
> > > Looking at the GAS sources, '-s' is only useful when targeting Solaris
> > > and it is ignored for the powerpc target so just drop the flag
> > > altogether, as it is not needed.
> >
> > Do you have any more info where you found this?  I don't see -s
> > documented as an assembler flag.
> > https://sourceware.org/binutils/docs/as/PowerPC_002dOpts.html
> > https://sourceware.org/binutils/docs/as/Invoking.html
>
> It is required by POSIX (for the c99 command, anyway).  It *also* is
> required to be supported when producing object files (so when no linking
> is done).
>
> It is a GCC flag, and documented just fine:
> https://gcc.gnu.org/onlinedocs/gcc/Link-Options.html#index-s
>
> (Yes, that says it is for linking; but the option is allowed without
> error of any kind always).
>
> (ASFLAGS sounds like it is for assembler commands, but it really is
> for compiler commands that just happen to get .S input files).
>
> > The patch seems fine to me, but what was this ever supposed to be?
> > FWICT it predates git history (looking at
> > arch/powerpc/kernel/vdso32/Makefile at fc15351d9d63)
>
> Yeah, good question.  This compiler flag does the moral equivalent of
> strip -s (aka --strip-all).  Maybe this was needed at some point, or
> the symbol or debug info was just annoying (during bringup or similar)?

Ah right! Ok then, I think we might keep the patch's diff, but update
the commit message to mention this is a linker flag that's unused
since the compiler is being invoked but not the linker (the compiler
is being used as the driver to assemble a single assembler source
without linking it; linking is then driven by the linker in a separate
make rule).

Then we might want to revisit that s390 patch, too?
https://lore.kernel.org/llvm/20221228-drop-qunused-arguments-v1-9-658cbc8fc...@kernel.org/

>
> > Reviewed-by: Nick Desaulniers 
> Reviewed-by: Segher Boessenkool 
>
>
> Segher



-- 
Thanks,
~Nick Desaulniers

Re: [PATCH 06/14] powerpc/vdso: Remove unused '-s' flag from ASFLAGS

2023-01-09 Thread Segher Boessenkool

On Mon, Jan 09, 2023 at 03:37:53PM -0700, Nathan Chancellor wrote:
> On Mon, Jan 09, 2023 at 04:23:37PM -0600, Segher Boessenkool wrote:
> > Hi!  Happy new year all.
> > 
> > On Mon, Jan 09, 2023 at 01:58:32PM -0800, Nick Desaulniers wrote:
> > > On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  
> > > wrote:
> > > >
> > > > When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
> > > > warns that ASFLAGS contains '-s', which is a linking phase option, so it
> > > > is unused.
> > > >
> > > >   clang-16: error: argument unused during compilation: '-s' 
> > > > [-Werror,-Wunused-command-line-argument]
> > > >
> > > > Looking at the GAS sources, '-s' is only useful when targeting Solaris
> > > > and it is ignored for the powerpc target so just drop the flag
> > > > altogether, as it is not needed.
> > > 
> > > Do you have any more info where you found this?  I don't see -s
> > > documented as an assembler flag.
> > > https://sourceware.org/binutils/docs/as/PowerPC_002dOpts.html
> > > https://sourceware.org/binutils/docs/as/Invoking.html
> > 
> > It is required by POSIX (for the c99 command, anyway).  It *also* is
> > required to be supported when producing object files (so when no linking
> > is done).
> > 
> > It is a GCC flag, and documented just fine:
> > https://gcc.gnu.org/onlinedocs/gcc/Link-Options.html#index-s
> > 
> > (Yes, that says it is for linking; but the option is allowed without
> > error of any kind always).
> > 
> > (ASFLAGS sounds like it is for assembler commands, but it really is
> > for compiler commands that just happen to get .S input files).
> > 
> > > The patch seems fine to me, but what was this ever supposed to be?
> > > FWICT it predates git history (looking at
> > > arch/powerpc/kernel/vdso32/Makefile at fc15351d9d63)
> > 
> > Yeah, good question.  This compiler flag does the moral equivalent of
> > strip -s (aka --strip-all).  Maybe this was needed at some point, or
> > the symbol or debug info was just annoying (during bringup or similar)?
> > 
> > > Reviewed-by: Nick Desaulniers 
> > Reviewed-by: Segher Boessenkool 
> 
> Thank you for the review and extra explanation! I had kind of expanded
> on this in the s390 version of this patch [1], I will go ahead and do
> the same thing for the powerpc version too.
> 
> [1]: 
> https://lore.kernel.org/llvm/20221228-drop-qunused-arguments-v1-9-658cbc8fc...@kernel.org/

The underwhelming GCC source code for this is
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/gcc.cc;h=d629ca5e424ad3120be13e82b9f38b7bf8f1cdf2;hb=HEAD#l1148
which says that the -s flag is passed through unmodified to the linker
when invoking the linker.  See #l603 for the %{s} specs syntax.

This is not a flag to the assembler at all.  It is a flag to the
compiler, which passes it on to the linker :-)

> > (ASFLAGS sounds like it is for assembler commands, but it really is
> > for compiler commands that just happen to get .S input files).


Segher

Re: [PATCH 08/14] powerpc/vdso: Remove an unsupported flag from vgettimeofday-32.o with clang

2023-01-09 Thread Nathan Chancellor

On Mon, Jan 09, 2023 at 02:12:55PM -0800, Nick Desaulniers wrote:
> On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  wrote:
> >
> > When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
> > warns:
> >
> >   clang-16: error: argument unused during compilation: 
> > '-fno-stack-clash-protection' [-Werror,-Wunused-command-line-argument]
> >
> > This flag is supported for 64-bit powerpc but not 32-bit, hence the warning.
> > Just remove the flag from vgettimeofday-32.o's CFLAGS when using clang, as 
> > has
> > been done for other flags previously.
> >
> > Signed-off-by: Nathan Chancellor 
> 
> Hmm...so this was added by the top level Makefile doing a cc-option
> test.  How did the test pass if this was unsupported? That worries me
> that perhaps other cc-option tests are passing erroneously for certain
> ppc -m32/-m64 configs?

Sure, that is a reasonable concern. I should have expanded upon this a
little bit more in the commit message. Is this any better?

  When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
  warns:

clang-16: error: argument unused during compilation: 
'-fno-stack-clash-protection' [-Werror,-Wunused-command-line-argument]

  This warning happens because vgettimeofday-32.c gets its base CFLAGS
  from the main kernel, which may contain flags that are only supported
  on a 64-bit target but not a 32-bit one, which is the case here.
  -fstack-clash-protection and its negation are only suppported by the
  64-bit powerpc target but that flag is included in an invocation for a
  32-bit powerpc target, so clang points out that while the flag is one
  that it recognizes, it is not actually used by this compiler job.

  To eliminate the warning, remove -fno-stack-clash-protection from
  vgettimeofday-32.c's CFLAGS when using clang, as has been done for
  other flags previously.

Cheers,
Nathan

> > ---
> > Cc: m...@ellerman.id.au
> > Cc: npig...@gmail.com
> > Cc: christophe.le...@csgroup.eu
> > Cc: linuxppc-dev@lists.ozlabs.org
> > ---
> >  arch/powerpc/kernel/vdso/Makefile | 5 +
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/arch/powerpc/kernel/vdso/Makefile 
> > b/arch/powerpc/kernel/vdso/Makefile
> > index 769b62832b38..4ee7d36ce752 100644
> > --- a/arch/powerpc/kernel/vdso/Makefile
> > +++ b/arch/powerpc/kernel/vdso/Makefile
> > @@ -16,6 +16,11 @@ ifneq ($(c-gettimeofday-y),)
> >CFLAGS_vgettimeofday-32.o += -ffreestanding -fasynchronous-unwind-tables
> >CFLAGS_REMOVE_vgettimeofday-32.o = $(CC_FLAGS_FTRACE)
> >CFLAGS_REMOVE_vgettimeofday-32.o += -mcmodel=medium -mabi=elfv1 
> > -mabi=elfv2 -mcall-aixdesc
> > +  # This flag is supported by clang for 64-bit but not 32-bit so it will 
> > cause
> > +  # an unused command line flag warning for this file.
> > +  ifdef CONFIG_CC_IS_CLANG
> > +  CFLAGS_REMOVE_vgettimeofday-32.o += -fno-stack-clash-protection
> > +  endif
> >CFLAGS_vgettimeofday-64.o += -include $(c-gettimeofday-y)
> >CFLAGS_vgettimeofday-64.o += $(DISABLE_LATENT_ENTROPY_PLUGIN)
> >CFLAGS_vgettimeofday-64.o += $(call cc-option, -fno-stack-protector)
> >
> > --
> > 2.39.0
> 
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

Re: [PATCH 06/14] powerpc/vdso: Remove unused '-s' flag from ASFLAGS

2023-01-09 Thread Nathan Chancellor

On Mon, Jan 09, 2023 at 04:23:37PM -0600, Segher Boessenkool wrote:
> Hi!  Happy new year all.
> 
> On Mon, Jan 09, 2023 at 01:58:32PM -0800, Nick Desaulniers wrote:
> > On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  wrote:
> > >
> > > When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
> > > warns that ASFLAGS contains '-s', which is a linking phase option, so it
> > > is unused.
> > >
> > >   clang-16: error: argument unused during compilation: '-s' 
> > > [-Werror,-Wunused-command-line-argument]
> > >
> > > Looking at the GAS sources, '-s' is only useful when targeting Solaris
> > > and it is ignored for the powerpc target so just drop the flag
> > > altogether, as it is not needed.
> > 
> > Do you have any more info where you found this?  I don't see -s
> > documented as an assembler flag.
> > https://sourceware.org/binutils/docs/as/PowerPC_002dOpts.html
> > https://sourceware.org/binutils/docs/as/Invoking.html
> 
> It is required by POSIX (for the c99 command, anyway).  It *also* is
> required to be supported when producing object files (so when no linking
> is done).
> 
> It is a GCC flag, and documented just fine:
> https://gcc.gnu.org/onlinedocs/gcc/Link-Options.html#index-s
> 
> (Yes, that says it is for linking; but the option is allowed without
> error of any kind always).
> 
> (ASFLAGS sounds like it is for assembler commands, but it really is
> for compiler commands that just happen to get .S input files).
> 
> > The patch seems fine to me, but what was this ever supposed to be?
> > FWICT it predates git history (looking at
> > arch/powerpc/kernel/vdso32/Makefile at fc15351d9d63)
> 
> Yeah, good question.  This compiler flag does the moral equivalent of
> strip -s (aka --strip-all).  Maybe this was needed at some point, or
> the symbol or debug info was just annoying (during bringup or similar)?
> 
> > Reviewed-by: Nick Desaulniers 
> Reviewed-by: Segher Boessenkool 

Thank you for the review and extra explanation! I had kind of expanded
on this in the s390 version of this patch [1], I will go ahead and do
the same thing for the powerpc version too.

[1]: 
https://lore.kernel.org/llvm/20221228-drop-qunused-arguments-v1-9-658cbc8fc...@kernel.org/

Cheers,
Nathan

Re: [PATCH 06/14] powerpc/vdso: Remove unused '-s' flag from ASFLAGS

2023-01-09 Thread Segher Boessenkool

Hi!  Happy new year all.

On Mon, Jan 09, 2023 at 01:58:32PM -0800, Nick Desaulniers wrote:
> On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  wrote:
> >
> > When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
> > warns that ASFLAGS contains '-s', which is a linking phase option, so it
> > is unused.
> >
> >   clang-16: error: argument unused during compilation: '-s' 
> > [-Werror,-Wunused-command-line-argument]
> >
> > Looking at the GAS sources, '-s' is only useful when targeting Solaris
> > and it is ignored for the powerpc target so just drop the flag
> > altogether, as it is not needed.
> 
> Do you have any more info where you found this?  I don't see -s
> documented as an assembler flag.
> https://sourceware.org/binutils/docs/as/PowerPC_002dOpts.html
> https://sourceware.org/binutils/docs/as/Invoking.html

It is required by POSIX (for the c99 command, anyway).  It *also* is
required to be supported when producing object files (so when no linking
is done).

It is a GCC flag, and documented just fine:
https://gcc.gnu.org/onlinedocs/gcc/Link-Options.html#index-s

(Yes, that says it is for linking; but the option is allowed without
error of any kind always).

(ASFLAGS sounds like it is for assembler commands, but it really is
for compiler commands that just happen to get .S input files).

> The patch seems fine to me, but what was this ever supposed to be?
> FWICT it predates git history (looking at
> arch/powerpc/kernel/vdso32/Makefile at fc15351d9d63)

Yeah, good question.  This compiler flag does the moral equivalent of
strip -s (aka --strip-all).  Maybe this was needed at some point, or
the symbol or debug info was just annoying (during bringup or similar)?

> Reviewed-by: Nick Desaulniers 
Reviewed-by: Segher Boessenkool 

Segher

Re: [PATCH 06/14] powerpc/vdso: Remove unused '-s' flag from ASFLAGS

2023-01-09 Thread Nick Desaulniers

On Mon, Jan 9, 2023 at 2:15 PM Nathan Chancellor  wrote:
>
> On Mon, Jan 09, 2023 at 01:58:32PM -0800, Nick Desaulniers wrote:
> > On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  wrote:
> > >
> > > When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
> > > warns that ASFLAGS contains '-s', which is a linking phase option, so it
> > > is unused.
> > >
> > >   clang-16: error: argument unused during compilation: '-s' 
> > > [-Werror,-Wunused-command-line-argument]
> > >
> > > Looking at the GAS sources, '-s' is only useful when targeting Solaris
> > > and it is ignored for the powerpc target so just drop the flag
> > > altogether, as it is not needed.
> >
> > Do you have any more info where you found this?  I don't see -s
> > documented as an assembler flag.
> > https://sourceware.org/binutils/docs/as/PowerPC_002dOpts.html
> > https://sourceware.org/binutils/docs/as/Invoking.html
>
> Sure thing, sorry I did not include it initially. See the section from
> line 1284 to 1291 or search for "case 's':":
>
> https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gas/config/tc-ppc.c;h=9450fa74de1b61542c9a18babf8c8f621ef904fb;hb=HEAD
>
> > The patch seems fine to me, but what was this ever supposed to be?
> > FWICT it predates git history (looking at
> > arch/powerpc/kernel/vdso32/Makefile at fc15351d9d63)
>
> Right, I could not figure it out either, it has been there since the
> vDSO was introduced back in 2005 (I was three days away from 10!) and
> there is no comment about it so *shrug*:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=054eb7153aeb84cc92da84210cf93b0e2a34811b
>
> If someone else's archeological skills are better and can provide more
> information, I am happy to include that.
>
> > Reviewed-by: Nick Desaulniers 
>
> Thanks as always for the review! I'll include this and a note about
> where in binutils I found that information for v2.

Yeah, I think the comment from binutils sources would be good to add
to a v2 commit message. Thanks!


-- 
Thanks,
~Nick Desaulniers

Re: [PATCH 07/14] powerpc/vdso: Improve linker flags

2023-01-09 Thread Nathan Chancellor

On Mon, Jan 09, 2023 at 02:08:41PM -0800, Nick Desaulniers wrote:
> On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  wrote:
> >
> > When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, there
> > are several warnings in the PowerPC vDSO:
> >
> >   clang-16: error: -Wl,-soname=linux-vdso32.so.1: 'linker' input unused 
> > [-Werror,-Wunused-command-line-argument]
> >   clang-16: error: -Wl,--hash-style=both: 'linker' input unused 
> > [-Werror,-Wunused-command-line-argument]
> >   clang-16: error: argument unused during compilation: '-shared' 
> > [-Werror,-Wunused-command-line-argument]
> >
> >   clang-16: error: argument unused during compilation: '-nostdinc' 
> > [-Werror,-Wunused-command-line-argument]
> >   clang-16: error: argument unused during compilation: '-Wa,-maltivec' 
> > [-Werror,-Wunused-command-line-argument]
> >
> > The first group of warnings point out that linker flags were being added
> > to all invocations of $(CC), even though they will only be used during
> > the final vDSO link. Move those flags to ldflags-y.
> >
> > The second group of warnings are compiler or assembler flags that will
> > be unused during linking. Filter them out from KBUILD_CFLAGS so that
> > they are not used during linking.
> >
> > Signed-off-by: Nathan Chancellor 
> > ---
> > Cc: m...@ellerman.id.au
> > Cc: npig...@gmail.com
> > Cc: christophe.le...@csgroup.eu
> > Cc: linuxppc-dev@lists.ozlabs.org
> > ---
> >  arch/powerpc/kernel/vdso/Makefile | 18 +++---
> >  1 file changed, 11 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/vdso/Makefile 
> > b/arch/powerpc/kernel/vdso/Makefile
> > index 45c0cc5d34b6..769b62832b38 100644
> > --- a/arch/powerpc/kernel/vdso/Makefile
> > +++ b/arch/powerpc/kernel/vdso/Makefile
> > @@ -47,13 +47,17 @@ KCOV_INSTRUMENT := n
> >  UBSAN_SANITIZE := n
> >  KASAN_SANITIZE := n
> >
> > -ccflags-y := -shared -fno-common -fno-builtin -nostdlib 
> > -Wl,--hash-style=both
> > -ccflags-$(CONFIG_LD_IS_LLD) += $(call 
> > cc-option,--ld-path=$(LD),-fuse-ld=lld)
> > -
> > -CC32FLAGS := -Wl,-soname=linux-vdso32.so.1 -m32
> > +ccflags-y := -fno-common -fno-builtin
> > +ldflags-y := -Wl,--hash-style=both -nostdlib -shared
> > +ldflags-$(CONFIG_LD_IS_LLD) += $(call 
> > cc-option,--ld-path=$(LD),-fuse-ld=lld)
> > +# Filter flags that clang will warn are unused for linking
> > +ldflags-y += $(filter-out $(CC_FLAGS_FTRACE) -Wa$(comma)%, 
> > $(KBUILD_CFLAGS))
> > +
> > +CC32FLAGS := -m32
> > +LD32FLAGS := -Wl,-soname=linux-vdso32.so.1
> >  AS32FLAGS := -D__VDSO32__
> >
> > -CC64FLAGS := -Wl,-soname=linux-vdso64.so.1
> > +LD64FLAGS := -Wl,-soname=linux-vdso64.so.1
> >  AS64FLAGS := -D__VDSO64__
> >
> >  targets += vdso32.lds
> > @@ -92,14 +96,14 @@ include/generated/vdso64-offsets.h: 
> > $(obj)/vdso64.so.dbg FORCE
> >
> >  # actual build commands
> >  quiet_cmd_vdso32ld_and_check = VDSO32L $@
> > -  cmd_vdso32ld_and_check = $(VDSOCC) $(c_flags) $(CC32FLAGS) -o $@ 
> > -Wl,-T$(filter %.lds,$^) $(filter %.o,$^) -z noexecstack ; $(cmd_vdso_check)
> > +  cmd_vdso32ld_and_check = $(VDSOCC) $(ldflags-y) $(CC32FLAGS) 
> > $(LD32FLAGS) -o $@ -Wl,-T$(filter %.lds,$^) $(filter %.o,$^) -z noexecstack 
> > ; $(cmd_vdso_check)
> >  quiet_cmd_vdso32as = VDSO32A $@
> >cmd_vdso32as = $(VDSOCC) $(a_flags) $(CC32FLAGS) $(AS32FLAGS) -c -o 
> > $@ $<
> >  quiet_cmd_vdso32cc = VDSO32C $@
> >cmd_vdso32cc = $(VDSOCC) $(c_flags) $(CC32FLAGS) -c -o $@ $<
> >
> >  quiet_cmd_vdso64ld_and_check = VDSO64L $@
> > -  cmd_vdso64ld_and_check = $(VDSOCC) $(c_flags) $(CC64FLAGS) -o $@ 
> > -Wl,-T$(filter %.lds,$^) $(filter %.o,$^) -z noexecstack ; $(cmd_vdso_check)
> > +  cmd_vdso64ld_and_check = $(VDSOCC) $(ldflags-y) $(CC64FLAGS) 
> > $(LD64FLAGS) -o $@ -Wl,-T$(filter %.lds,$^) $(filter %.o,$^) -z noexecstack 
> > ; $(cmd_vdso_check)
> 
> Let's move `-z noexecstack` up into ldflags-y? (you may add my RB with
> that modification)

Ack, done locally, will be included in v2.

> >  quiet_cmd_vdso64as = VDSO64A $@
> >cmd_vdso64as = $(VDSOCC) $(a_flags) $(CC64FLAGS) $(AS64FLAGS) -c -o 
> > $@ $<
> >
> >
> > --
> > 2.39.0
> 
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

Re: [PATCH 06/14] powerpc/vdso: Remove unused '-s' flag from ASFLAGS

2023-01-09 Thread Nathan Chancellor

On Mon, Jan 09, 2023 at 01:58:32PM -0800, Nick Desaulniers wrote:
> On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  wrote:
> >
> > When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
> > warns that ASFLAGS contains '-s', which is a linking phase option, so it
> > is unused.
> >
> >   clang-16: error: argument unused during compilation: '-s' 
> > [-Werror,-Wunused-command-line-argument]
> >
> > Looking at the GAS sources, '-s' is only useful when targeting Solaris
> > and it is ignored for the powerpc target so just drop the flag
> > altogether, as it is not needed.
> 
> Do you have any more info where you found this?  I don't see -s
> documented as an assembler flag.
> https://sourceware.org/binutils/docs/as/PowerPC_002dOpts.html
> https://sourceware.org/binutils/docs/as/Invoking.html

Sure thing, sorry I did not include it initially. See the section from
line 1284 to 1291 or search for "case 's':":

https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gas/config/tc-ppc.c;h=9450fa74de1b61542c9a18babf8c8f621ef904fb;hb=HEAD

> The patch seems fine to me, but what was this ever supposed to be?
> FWICT it predates git history (looking at
> arch/powerpc/kernel/vdso32/Makefile at fc15351d9d63)

Right, I could not figure it out either, it has been there since the
vDSO was introduced back in 2005 (I was three days away from 10!) and
there is no comment about it so *shrug*:

https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=054eb7153aeb84cc92da84210cf93b0e2a34811b

If someone else's archeological skills are better and can provide more
information, I am happy to include that.

> Reviewed-by: Nick Desaulniers 

Thanks as always for the review! I'll include this and a note about
where in binutils I found that information for v2.

> >
> > Signed-off-by: Nathan Chancellor 
> > ---
> > Cc: m...@ellerman.id.au
> > Cc: npig...@gmail.com
> > Cc: christophe.le...@csgroup.eu
> > Cc: linuxppc-dev@lists.ozlabs.org
> > ---
> >  arch/powerpc/kernel/vdso/Makefile | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/vdso/Makefile 
> > b/arch/powerpc/kernel/vdso/Makefile
> > index 6a977b0d8ffc..45c0cc5d34b6 100644
> > --- a/arch/powerpc/kernel/vdso/Makefile
> > +++ b/arch/powerpc/kernel/vdso/Makefile
> > @@ -51,10 +51,10 @@ ccflags-y := -shared -fno-common -fno-builtin -nostdlib 
> > -Wl,--hash-style=both
> >  ccflags-$(CONFIG_LD_IS_LLD) += $(call 
> > cc-option,--ld-path=$(LD),-fuse-ld=lld)
> >
> >  CC32FLAGS := -Wl,-soname=linux-vdso32.so.1 -m32
> > -AS32FLAGS := -D__VDSO32__ -s
> > +AS32FLAGS := -D__VDSO32__
> >
> >  CC64FLAGS := -Wl,-soname=linux-vdso64.so.1
> > -AS64FLAGS := -D__VDSO64__ -s
> > +AS64FLAGS := -D__VDSO64__
> >
> >  targets += vdso32.lds
> >  CPPFLAGS_vdso32.lds += -P -C -Upowerpc
> >
> > --
> > 2.39.0
> 
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

Re: [PATCH 08/14] powerpc/vdso: Remove an unsupported flag from vgettimeofday-32.o with clang

2023-01-09 Thread Nick Desaulniers

On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  wrote:
>
> When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
> warns:
>
>   clang-16: error: argument unused during compilation: 
> '-fno-stack-clash-protection' [-Werror,-Wunused-command-line-argument]
>
> This flag is supported for 64-bit powerpc but not 32-bit, hence the warning.
> Just remove the flag from vgettimeofday-32.o's CFLAGS when using clang, as has
> been done for other flags previously.
>
> Signed-off-by: Nathan Chancellor 

Hmm...so this was added by the top level Makefile doing a cc-option
test.  How did the test pass if this was unsupported? That worries me
that perhaps other cc-option tests are passing erroneously for certain
ppc -m32/-m64 configs?

> ---
> Cc: m...@ellerman.id.au
> Cc: npig...@gmail.com
> Cc: christophe.le...@csgroup.eu
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/kernel/vdso/Makefile | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/arch/powerpc/kernel/vdso/Makefile 
> b/arch/powerpc/kernel/vdso/Makefile
> index 769b62832b38..4ee7d36ce752 100644
> --- a/arch/powerpc/kernel/vdso/Makefile
> +++ b/arch/powerpc/kernel/vdso/Makefile
> @@ -16,6 +16,11 @@ ifneq ($(c-gettimeofday-y),)
>CFLAGS_vgettimeofday-32.o += -ffreestanding -fasynchronous-unwind-tables
>CFLAGS_REMOVE_vgettimeofday-32.o = $(CC_FLAGS_FTRACE)
>CFLAGS_REMOVE_vgettimeofday-32.o += -mcmodel=medium -mabi=elfv1 
> -mabi=elfv2 -mcall-aixdesc
> +  # This flag is supported by clang for 64-bit but not 32-bit so it will 
> cause
> +  # an unused command line flag warning for this file.
> +  ifdef CONFIG_CC_IS_CLANG
> +  CFLAGS_REMOVE_vgettimeofday-32.o += -fno-stack-clash-protection
> +  endif
>CFLAGS_vgettimeofday-64.o += -include $(c-gettimeofday-y)
>CFLAGS_vgettimeofday-64.o += $(DISABLE_LATENT_ENTROPY_PLUGIN)
>CFLAGS_vgettimeofday-64.o += $(call cc-option, -fno-stack-protector)
>
> --
> 2.39.0



-- 
Thanks,
~Nick Desaulniers

Re: [PATCH 07/14] powerpc/vdso: Improve linker flags

2023-01-09 Thread Nick Desaulniers

On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  wrote:
>
> When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, there
> are several warnings in the PowerPC vDSO:
>
>   clang-16: error: -Wl,-soname=linux-vdso32.so.1: 'linker' input unused 
> [-Werror,-Wunused-command-line-argument]
>   clang-16: error: -Wl,--hash-style=both: 'linker' input unused 
> [-Werror,-Wunused-command-line-argument]
>   clang-16: error: argument unused during compilation: '-shared' 
> [-Werror,-Wunused-command-line-argument]
>
>   clang-16: error: argument unused during compilation: '-nostdinc' 
> [-Werror,-Wunused-command-line-argument]
>   clang-16: error: argument unused during compilation: '-Wa,-maltivec' 
> [-Werror,-Wunused-command-line-argument]
>
> The first group of warnings point out that linker flags were being added
> to all invocations of $(CC), even though they will only be used during
> the final vDSO link. Move those flags to ldflags-y.
>
> The second group of warnings are compiler or assembler flags that will
> be unused during linking. Filter them out from KBUILD_CFLAGS so that
> they are not used during linking.
>
> Signed-off-by: Nathan Chancellor 
> ---
> Cc: m...@ellerman.id.au
> Cc: npig...@gmail.com
> Cc: christophe.le...@csgroup.eu
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/kernel/vdso/Makefile | 18 +++---
>  1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/kernel/vdso/Makefile 
> b/arch/powerpc/kernel/vdso/Makefile
> index 45c0cc5d34b6..769b62832b38 100644
> --- a/arch/powerpc/kernel/vdso/Makefile
> +++ b/arch/powerpc/kernel/vdso/Makefile
> @@ -47,13 +47,17 @@ KCOV_INSTRUMENT := n
>  UBSAN_SANITIZE := n
>  KASAN_SANITIZE := n
>
> -ccflags-y := -shared -fno-common -fno-builtin -nostdlib -Wl,--hash-style=both
> -ccflags-$(CONFIG_LD_IS_LLD) += $(call cc-option,--ld-path=$(LD),-fuse-ld=lld)
> -
> -CC32FLAGS := -Wl,-soname=linux-vdso32.so.1 -m32
> +ccflags-y := -fno-common -fno-builtin
> +ldflags-y := -Wl,--hash-style=both -nostdlib -shared
> +ldflags-$(CONFIG_LD_IS_LLD) += $(call cc-option,--ld-path=$(LD),-fuse-ld=lld)
> +# Filter flags that clang will warn are unused for linking
> +ldflags-y += $(filter-out $(CC_FLAGS_FTRACE) -Wa$(comma)%, $(KBUILD_CFLAGS))
> +
> +CC32FLAGS := -m32
> +LD32FLAGS := -Wl,-soname=linux-vdso32.so.1
>  AS32FLAGS := -D__VDSO32__
>
> -CC64FLAGS := -Wl,-soname=linux-vdso64.so.1
> +LD64FLAGS := -Wl,-soname=linux-vdso64.so.1
>  AS64FLAGS := -D__VDSO64__
>
>  targets += vdso32.lds
> @@ -92,14 +96,14 @@ include/generated/vdso64-offsets.h: $(obj)/vdso64.so.dbg 
> FORCE
>
>  # actual build commands
>  quiet_cmd_vdso32ld_and_check = VDSO32L $@
> -  cmd_vdso32ld_and_check = $(VDSOCC) $(c_flags) $(CC32FLAGS) -o $@ 
> -Wl,-T$(filter %.lds,$^) $(filter %.o,$^) -z noexecstack ; $(cmd_vdso_check)
> +  cmd_vdso32ld_and_check = $(VDSOCC) $(ldflags-y) $(CC32FLAGS) 
> $(LD32FLAGS) -o $@ -Wl,-T$(filter %.lds,$^) $(filter %.o,$^) -z noexecstack ; 
> $(cmd_vdso_check)
>  quiet_cmd_vdso32as = VDSO32A $@
>cmd_vdso32as = $(VDSOCC) $(a_flags) $(CC32FLAGS) $(AS32FLAGS) -c -o $@ 
> $<
>  quiet_cmd_vdso32cc = VDSO32C $@
>cmd_vdso32cc = $(VDSOCC) $(c_flags) $(CC32FLAGS) -c -o $@ $<
>
>  quiet_cmd_vdso64ld_and_check = VDSO64L $@
> -  cmd_vdso64ld_and_check = $(VDSOCC) $(c_flags) $(CC64FLAGS) -o $@ 
> -Wl,-T$(filter %.lds,$^) $(filter %.o,$^) -z noexecstack ; $(cmd_vdso_check)
> +  cmd_vdso64ld_and_check = $(VDSOCC) $(ldflags-y) $(CC64FLAGS) 
> $(LD64FLAGS) -o $@ -Wl,-T$(filter %.lds,$^) $(filter %.o,$^) -z noexecstack ; 
> $(cmd_vdso_check)

Let's move `-z noexecstack` up into ldflags-y? (you may add my RB with
that modification)

>  quiet_cmd_vdso64as = VDSO64A $@
>cmd_vdso64as = $(VDSOCC) $(a_flags) $(CC64FLAGS) $(AS64FLAGS) -c -o $@ 
> $<
>
>
> --
> 2.39.0



-- 
Thanks,
~Nick Desaulniers

Re: [PATCH 06/14] powerpc/vdso: Remove unused '-s' flag from ASFLAGS

2023-01-09 Thread Nick Desaulniers

On Wed, Jan 4, 2023 at 11:55 AM Nathan Chancellor  wrote:
>
> When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
> warns that ASFLAGS contains '-s', which is a linking phase option, so it
> is unused.
>
>   clang-16: error: argument unused during compilation: '-s' 
> [-Werror,-Wunused-command-line-argument]
>
> Looking at the GAS sources, '-s' is only useful when targeting Solaris
> and it is ignored for the powerpc target so just drop the flag
> altogether, as it is not needed.

Do you have any more info where you found this?  I don't see -s
documented as an assembler flag.
https://sourceware.org/binutils/docs/as/PowerPC_002dOpts.html
https://sourceware.org/binutils/docs/as/Invoking.html

The patch seems fine to me, but what was this ever supposed to be?
FWICT it predates git history (looking at
arch/powerpc/kernel/vdso32/Makefile at fc15351d9d63)

Reviewed-by: Nick Desaulniers 

>
> Signed-off-by: Nathan Chancellor 
> ---
> Cc: m...@ellerman.id.au
> Cc: npig...@gmail.com
> Cc: christophe.le...@csgroup.eu
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/kernel/vdso/Makefile | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kernel/vdso/Makefile 
> b/arch/powerpc/kernel/vdso/Makefile
> index 6a977b0d8ffc..45c0cc5d34b6 100644
> --- a/arch/powerpc/kernel/vdso/Makefile
> +++ b/arch/powerpc/kernel/vdso/Makefile
> @@ -51,10 +51,10 @@ ccflags-y := -shared -fno-common -fno-builtin -nostdlib 
> -Wl,--hash-style=both
>  ccflags-$(CONFIG_LD_IS_LLD) += $(call cc-option,--ld-path=$(LD),-fuse-ld=lld)
>
>  CC32FLAGS := -Wl,-soname=linux-vdso32.so.1 -m32
> -AS32FLAGS := -D__VDSO32__ -s
> +AS32FLAGS := -D__VDSO32__
>
>  CC64FLAGS := -Wl,-soname=linux-vdso64.so.1
> -AS64FLAGS := -D__VDSO64__ -s
> +AS64FLAGS := -D__VDSO64__
>
>  targets += vdso32.lds
>  CPPFLAGS_vdso32.lds += -P -C -Upowerpc
>
> --
> 2.39.0



-- 
Thanks,
~Nick Desaulniers

Re: [PATCH 12/15] auxdisplay: ht16k33: Introduce backlight_get_brightness()

2023-01-09 Thread Robin van der Gracht

Hi Sam,

On 2023-01-08 10:29, Sam Ravnborg wrote:

Hi Robin.

On Sat, Jan 07, 2023 at 10:02:38PM +0100, Miguel Ojeda wrote:

On Sat, Jan 7, 2023 at 7:26 PM Sam Ravnborg via B4 Submission Endpoint
 wrote:
>
> Introduce backlight_get_brightness() to simplify logic
> and avoid direct access to backlight properties.

Note: Stephen sent this one too a while ago (with some more details in
the commit message, which is always nice); and then he sent yesterday
v2 [1] (to mention the functional change with `BL_CORE_SUSPENDED`
[2]).

Thanks for the pointers. I will try to move forward with Stephen's
patches.

Anyway, if it goes via drm-misc, feel free to have my:

Acked-by: Miguel Ojeda 

Though it would be nice to have Robin test the change.

Robin - can I get your ack to apply Stephen's original v2 patch to
drm-misc?

done! see: 
https://lore.kernel.org/lkml/0b16391f997e6ed005a326e4e48f2...@protonic.nl/

- Robin

Re: [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2023-01-09 Thread Barry Song

On Tue, Jan 10, 2023 at 1:19 AM Catalin Marinas  wrote:
>
> On Sun, Jan 08, 2023 at 06:48:41PM +0800, Barry Song wrote:
> > On Fri, Jan 6, 2023 at 2:15 AM Catalin Marinas  
> > wrote:
> > > On Thu, Nov 17, 2022 at 04:26:48PM +0800, Yicong Yang wrote:
> > > > It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> > > > large systems but may not have improvement on small systems like on
> > > > a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> > > > on CONFIG_EXPERT for this stage and make this disabled on systems
> > > > with less than 8 CPUs. User can modify this threshold according to
> > > > their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.
> > >
> > > What's the overhead of such batching on systems with 4 or fewer CPUs? If
> > > it isn't noticeable, I'd rather have it always on than some number
> > > chosen on whichever SoC you tested.
> >
> > On the one hand, tlb flush is cheap on a small system. so batching tlb flush
> > helps very minorly.
>
> Yes, it probably won't help on small systems but I don't like config
> options choosing the threshold, which may be different from system to
> system even if they have the same number of CPUs. A run-time tunable
> would be a better option.
>
> > On the other hand, since we have batched the tlb flush, new PTEs might be
> > invisible to others before the final broadcast is done and Ack-ed.
>
> The new PTEs could indeed be invisible at the TLB level but not at the
> memory (page table) level since this is done under the PTL IIUC.
>
> > thus, there
> > is a risk someone else might do mprotect or similar things  on those 
> > deferred
> > pages which will ask for read-modify-write on those deferred PTEs.
>
> And this should be fine, we have things like the PTL in place for the
> actual memory access to the page table.
>
> > in this
> > case, mm will do an explicit flush by flush_tlb_batched_pending which is
> > not required if tlb flush is not deferred.
>
> I don't fully understand why it's needed, or at least why it would be
> needed on arm64. At the end of an mprotect(), we have the final PTEs in
> place and we just need to issue a TLBI for that range.
> change_pte_range() for example has a tlb_flush_pte_range() if the PTE
> was present and that won't be done lazily. If there are other TLBIs
> pending for the same range, they'll be done later though likely
> unnecessarily but still cheaper than issuing a flush_tlb_mm().

Thanks! I'd like to ask for some comments from Nadav and Mel from the x86 side.
Revisiting the code of flush_tlb_batched_pending shows we still have races even
under PTL.

/*
 * Reclaim unmaps pages under the PTL but do not flush the TLB prior to
 * releasing the PTL if TLB flushes are batched. It's possible for a parallel
 * operation such as mprotect or munmap to race between reclaim unmapping
 * the page and flushing the page. If this race occurs, it potentially allows
 * access to data via a stale TLB entry. Tracking all mm's that have TLB
 * batching in flight would be expensive during reclaim so instead track
 * whether TLB batching occurred in the past and if so then do a flush here
 * if required. This will cost one additional flush per reclaim cycle paid
 * by the first operation at risk such as mprotect and mumap.
 *
 * This must be called under the PTL so that an access to tlb_flush_batched
 * that is potentially a "reclaim vs mprotect/munmap/etc" race will synchronise
 * via the PTL.
 */
void flush_tlb_batched_pending(struct mm_struct *mm)
{
}

According to Catalin's comment, it seems over-cautious since we can make sure
people see updated TLB after mprotect and munmap are done as they have tlb
flush.  We can also make sure mprotect see updated "memory" of PTEs from
reclamation though pte is not visible in TLB level.

Hi Mel, Nadav, would you please help clarify the exact sequence of how this race
is going to happen?

>
> > void flush_tlb_batched_pending(struct mm_struct *mm)
> > {
> >int batch = atomic_read(>tlb_flush_batched);
> >int pending = batch & TLB_FLUSH_BATCH_PENDING_MASK;
> >int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
> >
> >if (pending != flushed) {
> >flush_tlb_mm(mm);
> > /*
> >  * If the new TLB flushing is pending during flushing, leave
> >  * mm->tlb_flush_batched as is, to avoid losing flushing.
> > */
> >   atomic_cmpxchg(>tlb_flush_batched, batch,
> >pending | (pending << TLB_FLUSH_BATCH_FLUSHED_SHIFT));
> >  }
> > }
>
> I guess this works on x86 better as it avoids the IPIs if this flush
> already happened. But on arm64 we already issued the TLBI, we just
> didn't wait for it to complete via a DSB.
>
> > I believe Anshuman has contributed many points on this in those previous
> > discussions.
>
> Yeah, I should re-read the old threads.
>
> --
> Catalin

Thanks
Barry

Re: [PATCH 05/14] powerpc: Remove linker flag from KBUILD_AFLAGS

2023-01-09 Thread Nick Desaulniers

On Wed, Jan 4, 2023 at 11:54 AM Nathan Chancellor  wrote:
>
> When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
> points out that KBUILD_AFLAGS contains a linker flag, which will be
> used:
>
>   clang: error: -Wl,-a32: 'linker' input unused 
> [-Werror,-Wunused-command-line-argument]
>
> This was likely supposed to be '-Wa,-a$(BITS)'. However, this change is
> unnecessary, as all supported versions of clang and gcc will pass '-a64'
> or '-a32' to GNU as based on the value of '-m'; the behavior of the
> latest stable release of the oldest supported major version of each
> compiler is shown below and each compiler's latest release exhibits the
> same behavior (GCC 12.2.0 and Clang 15.0.6).
>
>   $ powerpc64-linux-gcc --version | head -1
>   powerpc64-linux-gcc (GCC) 5.5.0
>
>   $ powerpc64-linux-gcc -m64 -### -x assembler-with-cpp -c -o /dev/null 
> /dev/null &| grep 'as '
>   .../as -a64 -mppc64 -many -mbig -o /dev/null /tmp/cctwuBzZ.s
>
>   $ powerpc64-linux-gcc -m32 -### -x assembler-with-cpp -c -o /dev/null 
> /dev/null &| grep 'as '
>   .../as -a32 -mppc -many -mbig -o /dev/null /tmp/ccaZP4mF.sg
>
>   $ clang --version | head -1
>   Ubuntu clang version 
> 11.1.0-++20211011094159+1fdec59bffc1-1~exp1~20211011214622.5
>
>   $ clang --target=powerpc64-linux-gnu -fno-integrated-as -m64 -### \
> -x assembler-with-cpp -c -o /dev/null /dev/null &| grep gnu-as
>"/usr/bin/powerpc64-linux-gnu-as" "-a64" "-mppc64" "-many" "-o" 
> "/dev/null" "/tmp/null-80267c.s"
>
>   $ clang --target=powerpc64-linux-gnu -fno-integrated-as -m64 -### \
> -x assembler-with-cpp -c -o /dev/null /dev/null &| grep gnu-as
>"/usr/bin/powerpc64-linux-gnu-as" "-a32" "-mppc" "-many" "-o" "/dev/null" 
> "/tmp/null-ab8f8d.s"
>
> Remove this flag altogether to avoid future issues.
>
> Fixes: 1421dc6d4829 ("powerpc/kbuild: Use flags variables rather than 
> overriding LD/CC/AS")
> Signed-off-by: Nathan Chancellor 

Thanks for the patch!
Reviewed-by: Nick Desaulniers 

> ---
> Cc: m...@ellerman.id.au
> Cc: npig...@gmail.com
> Cc: christophe.le...@csgroup.eu
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index dc4cbf0a5ca9..4fd630efe39d 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -90,7 +90,7 @@ aflags-$(CONFIG_CPU_LITTLE_ENDIAN)+= -mlittle-endian
>
>  ifeq ($(HAS_BIARCH),y)
>  KBUILD_CFLAGS  += -m$(BITS)
> -KBUILD_AFLAGS  += -m$(BITS) -Wl,-a$(BITS)
> +KBUILD_AFLAGS  += -m$(BITS)
>  KBUILD_LDFLAGS += -m elf$(BITS)$(LDEMULATION)
>  endif
>
>
> --
> 2.39.0



-- 
Thanks,
~Nick Desaulniers

Re: [PATCH 01/15] video: fbdev: atmel_lcdfb: Rework backlight handling

2023-01-09 Thread Stephen Kitt

Hi Sam,

On Sun, 8 Jan 2023 21:24:20 +0100, Sam Ravnborg  wrote:
> > Here are my pending patches from last June on lore:
> >   
> 
> All patches are handled I think except this:
> > * https://lore.kernel.org/lkml/20220608205623.2106113-1-st...@sk2.org/  
> 
> Can I ask you to drop the assignment that is not needed, and resend with
> the collected acks/r-b.
> 
> With this, then all fbdev patches are handled.

Ah yes, it was revised as
https://lore.kernel.org/lkml/20220609180440.3138625-1-st...@sk2.org/ which
only got one ack (from you,
https://lore.kernel.org/lkml/yqjckqmqeuvsb...@ravnborg.org/). Should I
resend that or is that usable as-is? Or would it be better if I sent all the
fbdev patches again (collecting all the acks and reviews)?

Regards,

Stephen

pgp6D4pSmN3MA.pgp
Description: OpenPGP digital signature

[PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-09 Thread Suren Baghdasaryan

rw_semaphore is a sizable structure of 40 bytes and consumes
considerable space for each vm_area_struct. However vma_lock has
two important specifics which can be used to replace rw_semaphore
with a simpler structure:
1. Readers never wait. They try to take the vma_lock and fall back to
mmap_lock if that fails.
2. Only one writer at a time will ever try to write-lock a vma_lock
because writers first take mmap_lock in write mode.
Because of these requirements, full rw_semaphore functionality is not
needed and we can replace rw_semaphore with an atomic variable.
When a reader takes read lock, it increments the atomic unless the
value is negative. If that fails read-locking is aborted and mmap_lock
is used instead.
When writer takes write lock, it resets atomic value to -1 if the
current value is 0 (no readers). Since all writers take mmap_lock in
write mode first, there can be only one writer at a time. If there
are readers, writer will place itself into a wait queue using new
mm_struct.vma_writer_wait waitqueue head. The last reader to release
the vma_lock will signal the writer to wake up.
vm_lock_seq is also moved into vma_lock and along with atomic_t they
are nicely packed and consume 8 bytes, bringing the overhead from
vma_lock from 44 to 16 bytes:

slabinfo before the changes:
 ...: ...
vm_area_struct...152   532 : ...

slabinfo with vma_lock:
 ...: ...
rw_semaphore  ...  8  5121 : ...
vm_area_struct...160   512 : ...

Assuming 4 vm_area_structs, memory consumption would be:
baseline: 6040kB
vma_lock (vm_area_structs+vma_lock): 6280kB+316kB=6596kB
Total increase: 556kB

atomic_t might overflow if there are many competing readers, therefore
vma_read_trylock() implements an overflow check and if that occurs it
restors the previous value and exits with a failure to lock.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h   | 37 +
 include/linux/mm_types.h | 10 --
 kernel/fork.c|  6 +++---
 mm/init-mm.c |  2 ++
 4 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d40bf8a5e19e..294dd44b2198 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -627,12 +627,16 @@ static inline void vma_write_lock(struct vm_area_struct 
*vma)
 * mm->mm_lock_seq can't be concurrently modified.
 */
mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
-   if (vma->vm_lock_seq == mm_lock_seq)
+   if (vma->vm_lock->lock_seq == mm_lock_seq)
return;
 
-   down_write(>vm_lock->lock);
-   vma->vm_lock_seq = mm_lock_seq;
-   up_write(>vm_lock->lock);
+   if (atomic_cmpxchg(>vm_lock->count, 0, -1))
+   wait_event(vma->vm_mm->vma_writer_wait,
+  atomic_cmpxchg(>vm_lock->count, 0, -1) == 0);
+   vma->vm_lock->lock_seq = mm_lock_seq;
+   /* Write barrier to ensure lock_seq change is visible before count */
+   smp_wmb();
+   atomic_set(>vm_lock->count, 0);
 }
 
 /*
@@ -643,20 +647,28 @@ static inline void vma_write_lock(struct vm_area_struct 
*vma)
 static inline bool vma_read_trylock(struct vm_area_struct *vma)
 {
/* Check before locking. A race might cause false locked result. */
-   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+   if (vma->vm_lock->lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
return false;
 
-   if (unlikely(down_read_trylock(>vm_lock->lock) == 0))
+   if (unlikely(!atomic_inc_unless_negative(>vm_lock->count)))
return false;
 
+   /* If atomic_t overflows, restore and fail to lock. */
+   if (unlikely(atomic_read(>vm_lock->count) < 0)) {
+   if (atomic_dec_and_test(>vm_lock->count))
+   wake_up(>vm_mm->vma_writer_wait);
+   return false;
+   }
+
/*
 * Overflow might produce false locked result.
 * False unlocked result is impossible because we modify and check
 * vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
 * modification invalidates all existing locks.
 */
-   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
-   up_read(>vm_lock->lock);
+   if (unlikely(vma->vm_lock->lock_seq == 
READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+   if (atomic_dec_and_test(>vm_lock->count))
+   wake_up(>vm_mm->vma_writer_wait);
return false;
}
return true;
@@ -664,7 +676,8 @@ static inline bool vma_read_trylock(struct vm_area_struct 
*vma)
 
 static inline void vma_read_unlock(struct vm_area_struct *vma)
 {
-   up_read(>vm_lock->lock);
+   if (atomic_dec_and_test(>vm_lock->count))
+   wake_up(>vm_mm->vma_writer_wait);
 }
 
 static inline void

[PATCH 40/41] mm: separate vma->lock from vm_area_struct

2023-01-09 Thread Suren Baghdasaryan

vma->lock being part of the vm_area_struct causes performance regression
during page faults because during contention its count and owner fields
are constantly updated and having other parts of vm_area_struct used
during page fault handling next to them causes constant cache line
bouncing. Fix that by moving the lock outside of the vm_area_struct.
All attempts to keep vma->lock inside vm_area_struct in a separate
cache line still produce performance regression especially on NUMA
machines. Smallest regression was achieved when lock is placed in the
fourth cache line but that bloats vm_area_struct to 256 bytes.
Considering performance and memory impact, separate lock looks like
the best option. It increases memory footprint of each VMA but that
will be addressed in the next patch.
Note that after this change vma_init() does not allocate or
initialize vma->lock anymore. A number of drivers allocate a pseudo
VMA on the stack but they never use the VMA's lock, therefore it does
not need to be allocated. The future drivers which might need the VMA
lock should use vm_area_alloc()/vm_area_free() to allocate it.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h   | 25 ++--
 include/linux/mm_types.h |  6 ++-
 kernel/fork.c| 82 
 3 files changed, 74 insertions(+), 39 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 50c7a6dd9c7a..d40bf8a5e19e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -615,11 +615,6 @@ struct vm_operations_struct {
 };
 
 #ifdef CONFIG_PER_VMA_LOCK
-static inline void vma_init_lock(struct vm_area_struct *vma)
-{
-   init_rwsem(>lock);
-   vma->vm_lock_seq = -1;
-}
 
 static inline void vma_write_lock(struct vm_area_struct *vma)
 {
@@ -635,9 +630,9 @@ static inline void vma_write_lock(struct vm_area_struct 
*vma)
if (vma->vm_lock_seq == mm_lock_seq)
return;
 
-   down_write(>lock);
+   down_write(>vm_lock->lock);
vma->vm_lock_seq = mm_lock_seq;
-   up_write(>lock);
+   up_write(>vm_lock->lock);
 }
 
 /*
@@ -651,17 +646,17 @@ static inline bool vma_read_trylock(struct vm_area_struct 
*vma)
if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
return false;
 
-   if (unlikely(down_read_trylock(>lock) == 0))
+   if (unlikely(down_read_trylock(>vm_lock->lock) == 0))
return false;
 
/*
 * Overflow might produce false locked result.
 * False unlocked result is impossible because we modify and check
-* vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+* vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq
 * modification invalidates all existing locks.
 */
if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
-   up_read(>lock);
+   up_read(>vm_lock->lock);
return false;
}
return true;
@@ -669,7 +664,7 @@ static inline bool vma_read_trylock(struct vm_area_struct 
*vma)
 
 static inline void vma_read_unlock(struct vm_area_struct *vma)
 {
-   up_read(>lock);
+   up_read(>vm_lock->lock);
 }
 
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
@@ -684,7 +679,7 @@ static inline void vma_assert_write_locked(struct 
vm_area_struct *vma)
 
 static inline void vma_assert_no_reader(struct vm_area_struct *vma)
 {
-   VM_BUG_ON_VMA(rwsem_is_locked(>lock) &&
+   VM_BUG_ON_VMA(rwsem_is_locked(>vm_lock->lock) &&
  vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
  vma);
 }
@@ -694,7 +689,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct 
*mm,
 
 #else /* CONFIG_PER_VMA_LOCK */
 
-static inline void vma_init_lock(struct vm_area_struct *vma) {}
 static inline void vma_write_lock(struct vm_area_struct *vma) {}
 static inline bool vma_read_trylock(struct vm_area_struct *vma)
{ return false; }
@@ -704,6 +698,10 @@ static inline void vma_assert_no_reader(struct 
vm_area_struct *vma) {}
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
+/*
+ * WARNING: vma_init does not initialize vma->vm_lock.
+ * Use vm_area_alloc()/vm_area_free() if vma needs locking.
+ */
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
static const struct vm_operations_struct dummy_vm_ops = {};
@@ -712,7 +710,6 @@ static inline void vma_init(struct vm_area_struct *vma, 
struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = _vm_ops;
INIT_LIST_HEAD(>anon_vma_chain);
-   vma_init_lock(vma);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c0e6c8e4700b..faa61b400f9b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -526,6 +526,10 @@ struct anon_vma_name {
char name[];
 };
 
+struct

[PATCH 39/41] kernel/fork: throttle call_rcu() calls in vm_area_free

2023-01-09 Thread Suren Baghdasaryan

call_rcu() can take a long time when callback offloading is enabled.
Its use in the vm_area_free can cause regressions in the exit path when
multiple VMAs are being freed. To minimize that impact, place VMAs into
a list and free them in groups using one call_rcu() call per group.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h   |  1 +
 include/linux/mm_types.h | 19 +--
 kernel/fork.c| 68 +++-
 mm/init-mm.c |  3 ++
 mm/mmap.c|  1 +
 5 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3158f33e268c..50c7a6dd9c7a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -250,6 +250,7 @@ void setup_initial_init_mm(void *start_code, void *end_code,
 struct vm_area_struct *vm_area_alloc(struct mm_struct *);
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *);
 void vm_area_free(struct vm_area_struct *);
+void drain_free_vmas(struct mm_struct *mm);
 
 #ifndef CONFIG_MMU
 extern struct rb_root nommu_region_tree;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fce9113d979c..c0e6c8e4700b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -592,8 +592,18 @@ struct vm_area_struct {
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
   units */
-   struct file * vm_file;  /* File we map to (can be NULL). */
-   void * vm_private_data; /* was vm_pte (shared mem) */
+   union {
+   struct {
+   /* File we map to (can be NULL). */
+   struct file *vm_file;
+
+   /* was vm_pte (shared mem) */
+   void *vm_private_data;
+   };
+#ifdef CONFIG_PER_VMA_LOCK
+   struct list_head vm_free_list;
+#endif
+   };
 
 #ifdef CONFIG_ANON_VMA_NAME
/*
@@ -693,6 +703,11 @@ struct mm_struct {
  */
 #ifdef CONFIG_PER_VMA_LOCK
int mm_lock_seq;
+   struct {
+   struct list_head head;
+   spinlock_t lock;
+   int size;
+   } vma_free_list;
 #endif
 
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 6d9f14e55ecf..97f2b751f88d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -481,26 +481,75 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct 
*orig)
 }
 
 #ifdef CONFIG_PER_VMA_LOCK
-static void __vm_area_free(struct rcu_head *head)
+static inline void __vm_area_free(struct vm_area_struct *vma)
 {
-   struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
- vm_rcu);
/* The vma should either have no lock holders or be write-locked. */
vma_assert_no_reader(vma);
kmem_cache_free(vm_area_cachep, vma);
 }
-#endif
+
+static void vma_free_rcu_callback(struct rcu_head *head)
+{
+   struct vm_area_struct *first_vma;
+   struct vm_area_struct *vma, *vma2;
+
+   first_vma = container_of(head, struct vm_area_struct, vm_rcu);
+   list_for_each_entry_safe(vma, vma2, _vma->vm_free_list, 
vm_free_list)
+   __vm_area_free(vma);
+   __vm_area_free(first_vma);
+}
+
+void drain_free_vmas(struct mm_struct *mm)
+{
+   struct vm_area_struct *first_vma;
+   LIST_HEAD(to_destroy);
+
+   spin_lock(>vma_free_list.lock);
+   list_splice_init(>vma_free_list.head, _destroy);
+   mm->vma_free_list.size = 0;
+   spin_unlock(>vma_free_list.lock);
+
+   if (list_empty(_destroy))
+   return;
+
+   first_vma = list_first_entry(_destroy, struct vm_area_struct, 
vm_free_list);
+   /* Remove the head which is allocated on the stack */
+   list_del(_destroy);
+
+   call_rcu(_vma->vm_rcu, vma_free_rcu_callback);
+}
+
+#define VM_AREA_FREE_LIST_MAX  32
+
+void vm_area_free(struct vm_area_struct *vma)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   bool drain;
+
+   free_anon_vma_name(vma);
+
+   spin_lock(>vma_free_list.lock);
+   list_add(>vm_free_list, >vma_free_list.head);
+   mm->vma_free_list.size++;
+   drain = mm->vma_free_list.size > VM_AREA_FREE_LIST_MAX;
+   spin_unlock(>vma_free_list.lock);
+
+   if (drain)
+   drain_free_vmas(mm);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+void drain_free_vmas(struct mm_struct *mm) {}
 
 void vm_area_free(struct vm_area_struct *vma)
 {
free_anon_vma_name(vma);
-#ifdef CONFIG_PER_VMA_LOCK
-   call_rcu(>vm_rcu, __vm_area_free);
-#else
kmem_cache_free(vm_area_cachep, vma);
-#endif
 }
 
+#endif /* CONFIG_PER_VMA_LOCK */
+
 static void account_kernel_stack(struct task_struct *tsk, int account)
 {
if (IS_ENABLED(CONFIG_VMAP_STACK)) {
@@ -1150,6 +1199,9 @@ static struct

[PATCH 38/41] mm: avoid assertion in untrack_pfn

2023-01-09 Thread Suren Baghdasaryan

untrack_pfn can be called after VMA was isolated and mmap_lock downgraded.
An attempt to lock affected VMA would cause an assertion, therefore
use mod_vm_flags_nolock in such situations.

Signed-off-by: Suren Baghdasaryan 
---
 arch/x86/mm/pat/memtype.c | 10 +++---
 include/linux/mm.h|  2 +-
 include/linux/pgtable.h   |  5 +++--
 mm/memory.c   | 15 ---
 mm/memremap.c |  4 ++--
 mm/mmap.c |  4 ++--
 6 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 9e490a372896..f71c8381430b 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -1045,7 +1045,7 @@ void track_pfn_insert(struct vm_area_struct *vma, 
pgprot_t *prot, pfn_t pfn)
  * can be for the entire vma (in which case pfn, size are zero).
  */
 void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
-unsigned long size)
+unsigned long size, bool lock_vma)
 {
resource_size_t paddr;
unsigned long prot;
@@ -1064,8 +1064,12 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned 
long pfn,
size = vma->vm_end - vma->vm_start;
}
free_pfn_range(paddr, size);
-   if (vma)
-   clear_vm_flags(vma, VM_PAT);
+   if (vma) {
+   if (lock_vma)
+   clear_vm_flags(vma, VM_PAT);
+   else
+   mod_vm_flags_nolock(vma, 0, VM_PAT);
+   }
 }
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7d436a5027cc..3158f33e268c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2135,7 +2135,7 @@ void zap_page_range_single(struct vm_area_struct *vma, 
unsigned long address,
   unsigned long size, struct zap_details *details);
 void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *start_vma, unsigned long start,
-   unsigned long end);
+   unsigned long end, bool lock_vma);
 
 struct mmu_notifier_range;
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 1159b25b0542..eaa831bd675d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1214,7 +1214,8 @@ static inline int track_pfn_copy(struct vm_area_struct 
*vma)
  * can be for the entire vma (in which case pfn, size are zero).
  */
 static inline void untrack_pfn(struct vm_area_struct *vma,
-  unsigned long pfn, unsigned long size)
+  unsigned long pfn, unsigned long size,
+  bool lock_vma)
 {
 }
 
@@ -1232,7 +1233,7 @@ extern void track_pfn_insert(struct vm_area_struct *vma, 
pgprot_t *prot,
 pfn_t pfn);
 extern int track_pfn_copy(struct vm_area_struct *vma);
 extern void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
-   unsigned long size);
+   unsigned long size, bool lock_vma);
 extern void untrack_pfn_moved(struct vm_area_struct *vma);
 #endif
 
diff --git a/mm/memory.c b/mm/memory.c
index 12508f4d845a..5c7d5eaa60d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1610,7 +1610,7 @@ void unmap_page_range(struct mmu_gather *tlb,
 static void unmap_single_vma(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr,
-   struct zap_details *details)
+   struct zap_details *details, bool lock_vma)
 {
unsigned long start = max(vma->vm_start, start_addr);
unsigned long end;
@@ -1625,7 +1625,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
uprobe_munmap(vma, start, end);
 
if (unlikely(vma->vm_flags & VM_PFNMAP))
-   untrack_pfn(vma, 0, 0);
+   untrack_pfn(vma, 0, 0, lock_vma);
 
if (start != end) {
if (unlikely(is_vm_hugetlb_page(vma))) {
@@ -1672,7 +1672,7 @@ static void unmap_single_vma(struct mmu_gather *tlb,
  */
 void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
struct vm_area_struct *vma, unsigned long start_addr,
-   unsigned long end_addr)
+   unsigned long end_addr, bool lock_vma)
 {
struct mmu_notifier_range range;
struct zap_details details = {
@@ -1686,7 +1686,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree 
*mt,
start_addr, end_addr);
mmu_notifier_invalidate_range_start();
do {
-   unmap_single_vma(tlb, vma, start_addr, end_addr, );
+   unmap_single_vma(tlb, vma, start_addr, end_addr, ,
+lock_vma);
} while ((vma = mas_find(, end_addr - 1)) != NULL);
mmu_notifier_invalidate_range_end();
 }
@@ -1715,7 +1716,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned 
long start,

[PATCH 37/41] mm: introduce mod_vm_flags_nolock

2023-01-09 Thread Suren Baghdasaryan

In cases when VMA flags are modified after VMA was isolated and mmap_lock
was downgraded, flags modifications do not require per-VMA locking and
an attempt to lock the VMA would result in an assertion because mmap
write lock is not held.
Introduce mod_vm_flags_nolock to be used in such situation.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e3be1d45371..7d436a5027cc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -743,6 +743,14 @@ void clear_vm_flags(struct vm_area_struct *vma, unsigned 
long flags)
vma->vm_flags &= ~flags;
 }
 
+static inline
+void mod_vm_flags_nolock(struct vm_area_struct *vma,
+ unsigned long set, unsigned long clear)
+{
+   vma->vm_flags |= set;
+   vma->vm_flags &= ~clear;
+}
+
 static inline
 void mod_vm_flags(struct vm_area_struct *vma,
  unsigned long set, unsigned long clear)
-- 
2.39.0

[PATCH 36/41] powerc/mm: try VMA lock-based page fault handling first

2023-01-09 Thread Suren Baghdasaryan

From: Laurent Dufour 

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.
Copied from "x86/mm: try VMA lock-based page fault handling first"

Signed-off-by: Laurent Dufour 
Signed-off-by: Suren Baghdasaryan 
---
 arch/powerpc/mm/fault.c| 41 ++
 arch/powerpc/platforms/powernv/Kconfig |  1 +
 arch/powerpc/platforms/pseries/Kconfig |  1 +
 3 files changed, 43 insertions(+)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 2bef19cc1b98..f92f8956d5f2 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -469,6 +469,44 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned 
long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
 
+#ifdef CONFIG_PER_VMA_LOCK
+   if (!(flags & FAULT_FLAG_USER) || atomic_read(>mm_users) == 1)
+   goto lock_mmap;
+
+   vma = lock_vma_under_rcu(mm, address);
+   if (!vma)
+   goto lock_mmap;
+
+   if (unlikely(access_pkey_error(is_write, is_exec,
+  (error_code & DSISR_KEYFAULT), vma))) {
+   int rc = bad_access_pkey(regs, address, vma);
+
+   vma_read_unlock(vma);
+   return rc;
+   }
+
+   if (unlikely(access_error(is_write, is_exec, vma))) {
+   int rc = bad_access(regs, address);
+
+   vma_read_unlock(vma);
+   return rc;
+   }
+
+   fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, 
regs);
+   vma_read_unlock(vma);
+
+   if (!(fault & VM_FAULT_RETRY)) {
+   count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+   goto done;
+   }
+   count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+   if (fault_signal_pending(fault, regs))
+   return user_mode(regs) ? 0 : SIGBUS;
+
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
/* When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in the
 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -545,6 +583,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned 
long address,
 
mmap_read_unlock(current->mm);
 
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
 
diff --git a/arch/powerpc/platforms/powernv/Kconfig 
b/arch/powerpc/platforms/powernv/Kconfig
index ae248a161b43..70a46acc70d6 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -16,6 +16,7 @@ config PPC_POWERNV
select PPC_DOORBELL
select MMU_NOTIFIER
select FORCE_SMP
+   select ARCH_SUPPORTS_PER_VMA_LOCK
default y
 
 config OPAL_PRD
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index a3b4d99567cb..e036a04ff1ca 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -21,6 +21,7 @@ config PPC_PSERIES
select HOTPLUG_CPU
select FORCE_SMP
select SWIOTLB
+   select ARCH_SUPPORTS_PER_VMA_LOCK
default y
 
 config PARAVIRT
-- 
2.39.0

[PATCH 35/41] arm64/mm: try VMA lock-based page fault handling first

2023-01-09 Thread Suren Baghdasaryan

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Signed-off-by: Suren Baghdasaryan 
---
 arch/arm64/Kconfig|  1 +
 arch/arm64/mm/fault.c | 36 
 2 files changed, 37 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 03934808b2ed..829fa6d14a36 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -95,6 +95,7 @@ config ARM64
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
+   select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 596f46dabe4e..833fa8bab291 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -535,6 +535,9 @@ static int __kprobes do_page_fault(unsigned long far, 
unsigned long esr,
unsigned long vm_flags;
unsigned int mm_flags = FAULT_FLAG_DEFAULT;
unsigned long addr = untagged_addr(far);
+#ifdef CONFIG_PER_VMA_LOCK
+   struct vm_area_struct *vma;
+#endif
 
if (kprobe_page_fault(regs, esr))
return 0;
@@ -585,6 +588,36 @@ static int __kprobes do_page_fault(unsigned long far, 
unsigned long esr,
 
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
 
+#ifdef CONFIG_PER_VMA_LOCK
+   if (!(mm_flags & FAULT_FLAG_USER) || atomic_read(>mm_users) == 1)
+   goto lock_mmap;
+
+   vma = lock_vma_under_rcu(mm, addr);
+   if (!vma)
+   goto lock_mmap;
+
+   if (!(vma->vm_flags & vm_flags)) {
+   vma_read_unlock(vma);
+   goto lock_mmap;
+   }
+   fault = handle_mm_fault(vma, addr & PAGE_MASK,
+   mm_flags | FAULT_FLAG_VMA_LOCK, regs);
+   vma_read_unlock(vma);
+
+   if (!(fault & VM_FAULT_RETRY)) {
+   count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+   goto done;
+   }
+   count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+   /* Quick path to respond to signals */
+   if (fault_signal_pending(fault, regs)) {
+   if (!user_mode(regs))
+   goto no_context;
+   return 0;
+   }
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
/*
 * As per x86, we may deadlock here. However, since the kernel only
 * validly references user space from well defined areas of the code,
@@ -628,6 +661,9 @@ static int __kprobes do_page_fault(unsigned long far, 
unsigned long esr,
}
mmap_read_unlock(mm);
 
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
/*
 * Handle the "normal" (no error) case first.
 */
-- 
2.39.0

[PATCH 34/41] x86/mm: try VMA lock-based page fault handling first

2023-01-09 Thread Suren Baghdasaryan

Attempt VMA lock-based page fault handling first, and fall back to the
existing mmap_lock-based handling if that fails.

Signed-off-by: Suren Baghdasaryan 
---
 arch/x86/Kconfig|  1 +
 arch/x86/mm/fault.c | 36 
 2 files changed, 37 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3604074a878b..3647f7bdb110 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86_64
# Options that are inherently 64-bit kernel only:
select ARCH_HAS_GIGANTIC_PAGE
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
+   select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 7b0d4ab894c8..983266e7c49b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -19,6 +19,7 @@
 #include  /* faulthandler_disabled()  */
 #include  /* 
efi_crash_gracefully_on_page_fault()*/
 #include 
+#include   /* find_and_lock_vma() */
 
 #include /* boot_cpu_has, ...*/
 #include  /* dotraplinkage, ...   */
@@ -1354,6 +1355,38 @@ void do_user_addr_fault(struct pt_regs *regs,
}
 #endif
 
+#ifdef CONFIG_PER_VMA_LOCK
+   if (!(flags & FAULT_FLAG_USER) || atomic_read(>mm_users) == 1)
+   goto lock_mmap;
+
+   vma = lock_vma_under_rcu(mm, address);
+   if (!vma)
+   goto lock_mmap;
+
+   if (unlikely(access_error(error_code, vma))) {
+   vma_read_unlock(vma);
+   goto lock_mmap;
+   }
+   fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, 
regs);
+   vma_read_unlock(vma);
+
+   if (!(fault & VM_FAULT_RETRY)) {
+   count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
+   goto done;
+   }
+   count_vm_vma_lock_event(VMA_LOCK_RETRY);
+
+   /* Quick path to respond to signals */
+   if (fault_signal_pending(fault, regs)) {
+   if (!user_mode(regs))
+   kernelmode_fixup_or_oops(regs, error_code, address,
+SIGBUS, BUS_ADRERR,
+ARCH_DEFAULT_PKEY);
+   return;
+   }
+lock_mmap:
+#endif /* CONFIG_PER_VMA_LOCK */
+
/*
 * Kernel-mode access to the user address space should only occur
 * on well-defined single instructions listed in the exception
@@ -1454,6 +1487,9 @@ void do_user_addr_fault(struct pt_regs *regs,
}
 
mmap_read_unlock(mm);
+#ifdef CONFIG_PER_VMA_LOCK
+done:
+#endif
if (likely(!(fault & VM_FAULT_ERROR)))
return;
 
-- 
2.39.0

[PATCH 33/41] mm: introduce per-VMA lock statistics

2023-01-09 Thread Suren Baghdasaryan

Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra
statistics about handling page fault under VMA lock.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/vm_event_item.h | 6 ++
 include/linux/vmstat.h| 6 ++
 mm/Kconfig.debug  | 8 
 mm/vmstat.c   | 6 ++
 4 files changed, 26 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 7f5d1caf5890..8abfa1240040 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -149,6 +149,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_X86
DIRECT_MAP_LEVEL2_SPLIT,
DIRECT_MAP_LEVEL3_SPLIT,
+#endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+   VMA_LOCK_SUCCESS,
+   VMA_LOCK_ABORT,
+   VMA_LOCK_RETRY,
+   VMA_LOCK_MISS,
 #endif
NR_VM_EVENT_ITEMS
 };
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 19cf5b6892ce..fed855bae6d8 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -125,6 +125,12 @@ static inline void vm_events_fold_cpu(int cpu)
 #define count_vm_tlb_events(x, y) do { (void)(y); } while (0)
 #endif
 
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+#define count_vm_vma_lock_event(x) count_vm_event(x)
+#else
+#define count_vm_vma_lock_event(x) do {} while (0)
+#endif
+
 #define __count_zid_vm_events(item, zid, delta) \
__count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
 
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index fca699ad1fb0..32a93b064590 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -207,3 +207,11 @@ config PTDUMP_DEBUGFS
  kernel.
 
  If in doubt, say N.
+
+
+config PER_VMA_LOCK_STATS
+   bool "Statistics for per-vma locks"
+   depends on PER_VMA_LOCK
+   default y
+   help
+ Statistics for per-vma locks.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1ea6a5ce1c41..4f1089a1860e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1399,6 +1399,12 @@ const char * const vmstat_text[] = {
"direct_map_level2_splits",
"direct_map_level3_splits",
 #endif
+#ifdef CONFIG_PER_VMA_LOCK_STATS
+   "vma_lock_success",
+   "vma_lock_abort",
+   "vma_lock_retry",
+   "vma_lock_miss",
+#endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
-- 
2.39.0

[PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock

2023-01-09 Thread Suren Baghdasaryan

Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.

Signed-off-by: Suren Baghdasaryan 
Suggested-by: Peter Xu 
---
 mm/memory.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 20806bc8b4eb..12508f4d845a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct 
mm_struct *mm,
if (!vma->anon_vma)
goto inval;
 
+   /*
+* Due to the possibility of userfault handler dropping mmap_lock, avoid
+* it for now and fall back to page fault handling under mmap_lock.
+*/
+   if (userfaultfd_armed(vma))
+   goto inval;
+
if (!vma_read_trylock(vma))
goto inval;
 
-- 
2.39.0

[PATCH 31/41] mm: prevent do_swap_page from handling page faults under VMA lock

2023-01-09 Thread Suren Baghdasaryan

Due to the possibility of do_swap_page dropping mmap_lock, abort fault
handling under VMA lock and retry holding mmap_lock. This can be handled
more gracefully in the future.

Signed-off-by: Suren Baghdasaryan 
Reviewed-by: Laurent Dufour 
---
 mm/memory.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 2560524ad7f4..20806bc8b4eb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3707,6 +3707,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!pte_unmap_same(vmf))
goto out;
 
+   if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+   ret = VM_FAULT_RETRY;
+   goto out;
+   }
+
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
-- 
2.39.0

[PATCH 30/41] mm: add FAULT_FLAG_VMA_LOCK flag

2023-01-09 Thread Suren Baghdasaryan

Add a new flag to distinguish page faults handled under protection of
per-vma lock.

Signed-off-by: Suren Baghdasaryan 
Reviewed-by: Laurent Dufour 
---
 include/linux/mm.h   | 3 ++-
 include/linux/mm_types.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d0fddf6a1de9..2e3be1d45371 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -467,7 +467,8 @@ static inline bool fault_flag_allow_retry_first(enum 
fault_flag flags)
{ FAULT_FLAG_USER,  "USER" }, \
{ FAULT_FLAG_REMOTE,"REMOTE" }, \
{ FAULT_FLAG_INSTRUCTION,   "INSTRUCTION" }, \
-   { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
+   { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
+   { FAULT_FLAG_VMA_LOCK,  "VMA_LOCK" }
 
 /*
  * vm_fault is filled by the pagefault handler and passed to the vma's
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0d27edd3e63a..fce9113d979c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1103,6 +1103,7 @@ enum fault_flag {
FAULT_FLAG_INTERRUPTIBLE =  1 << 9,
FAULT_FLAG_UNSHARE =1 << 10,
FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
+   FAULT_FLAG_VMA_LOCK =   1 << 12,
 };
 
 typedef unsigned int __bitwise zap_flags_t;
-- 
2.39.0

[PATCH 29/41] mm: fall back to mmap_lock if vma->anon_vma is not yet set

2023-01-09 Thread Suren Baghdasaryan

When vma->anon_vma is not set, pagefault handler will set it by either
reusing anon_vma of an adjacent VMA if VMAs are compatible or by
allocating a new one. find_mergeable_anon_vma() walks VMA tree to find
a compatible adjacent VMA and that requires not only the faulting VMA
to be stable but also the tree structure and other VMAs inside that tree.
Therefore locking just the faulting VMA is not enough for this search.
Fall back to taking mmap_lock when vma->anon_vma is not set. This
situation happens only on the first page fault and should not affect
overall performance.

Signed-off-by: Suren Baghdasaryan 
---
 mm/memory.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index a658e26d965d..2560524ad7f4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5264,6 +5264,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct 
mm_struct *mm,
if (!vma_is_anonymous(vma))
goto inval;
 
+   /* find_mergeable_anon_vma uses adjacent vmas which are not locked */
+   if (!vma->anon_vma)
+   goto inval;
+
if (!vma_read_trylock(vma))
goto inval;
 
-- 
2.39.0

[PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code

2023-01-09 Thread Suren Baghdasaryan

Introduce lock_vma_under_rcu function to lookup and lock a VMA during
page fault handling. When VMA is not found, can't be locked or changes
after being locked, the function returns NULL. The lookup is performed
under RCU protection to prevent the found VMA from being destroyed before
the VMA lock is acquired. VMA lock statistics are updated according to
the results.
For now only anonymous VMAs can be searched this way. In other cases the
function returns NULL.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h |  3 +++
 mm/memory.c| 51 ++
 2 files changed, 54 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c464fc8a514c..d0fddf6a1de9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -687,6 +687,9 @@ static inline void vma_assert_no_reader(struct 
vm_area_struct *vma)
  vma);
 }
 
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+ unsigned long address);
+
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_init_lock(struct vm_area_struct *vma) {}
diff --git a/mm/memory.c b/mm/memory.c
index 9ece18548db1..a658e26d965d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5242,6 +5242,57 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, 
unsigned long address,
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
 
+#ifdef CONFIG_PER_VMA_LOCK
+/*
+ * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
+ * stable and not isolated. If the VMA is not found or is being modified the
+ * function returns NULL.
+ */
+struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
+ unsigned long address)
+{
+   MA_STATE(mas, >mm_mt, address, address);
+   struct vm_area_struct *vma, *validate;
+
+   rcu_read_lock();
+   vma = mas_walk();
+retry:
+   if (!vma)
+   goto inval;
+
+   /* Only anonymous vmas are supported for now */
+   if (!vma_is_anonymous(vma))
+   goto inval;
+
+   if (!vma_read_trylock(vma))
+   goto inval;
+
+   /* Check since vm_start/vm_end might change before we lock the VMA */
+   if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
+   vma_read_unlock(vma);
+   goto inval;
+   }
+
+   /* Check if the VMA got isolated after we found it */
+   mas.index = address;
+   validate = mas_walk();
+   if (validate != vma) {
+   vma_read_unlock(vma);
+   count_vm_vma_lock_event(VMA_LOCK_MISS);
+   /* The area was replaced with another one. */
+   vma = validate;
+   goto retry;
+   }
+
+   rcu_read_unlock();
+   return vma;
+inval:
+   rcu_read_unlock();
+   count_vm_vma_lock_event(VMA_LOCK_ABORT);
+   return NULL;
+}
+#endif /* CONFIG_PER_VMA_LOCK */
+
 #ifndef __PAGETABLE_P4D_FOLDED
 /*
  * Allocate p4d page table.
-- 
2.39.0

[PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration

2023-01-09 Thread Suren Baghdasaryan

Page fault handlers might need to fire MMU notifications while a new
notifier is being registered. Modify mm_take_all_locks to write-lock all
VMAs and prevent this race with fault handlers that would hold VMA locks.
VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
locking order as in page fault handlers.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 30c7d1c5206e..a256deca0bc0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct 
address_space *mapping)
  * of mm/rmap.c:
  *   - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
  * hugetlb mapping);
+ *   - all vmas marked locked
  *   - all i_mmap_rwsem locks;
  *   - all anon_vma->rwseml
  *
@@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm)
mas_for_each(, vma, ULONG_MAX) {
if (signal_pending(current))
goto out_unlock;
+   vma_write_lock(vma);
if (vma->vm_file && vma->vm_file->f_mapping &&
is_vm_hugetlb_page(vma))
vm_lock_mapping(mm, vma->vm_file->f_mapping);
@@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
if (vma->vm_file && vma->vm_file->f_mapping)
vm_unlock_mapping(vma->vm_file->f_mapping);
}
+   vma_write_unlock_mm(mm);
 
mutex_unlock(_all_locks_mutex);
 }
-- 
2.39.0

[PATCH 26/41] kernel/fork: assert no VMA readers during its destruction

2023-01-09 Thread Suren Baghdasaryan

Assert there are no holders of VMA lock for reading when it is about to be
destroyed.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h | 8 
 kernel/fork.c  | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 594e835bad9c..c464fc8a514c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -680,6 +680,13 @@ static inline void vma_assert_write_locked(struct 
vm_area_struct *vma)
VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), 
vma);
 }
 
+static inline void vma_assert_no_reader(struct vm_area_struct *vma)
+{
+   VM_BUG_ON_VMA(rwsem_is_locked(>lock) &&
+ vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq),
+ vma);
+}
+
 #else /* CONFIG_PER_VMA_LOCK */
 
 static inline void vma_init_lock(struct vm_area_struct *vma) {}
@@ -688,6 +695,7 @@ static inline bool vma_read_trylock(struct vm_area_struct 
*vma)
{ return false; }
 static inline void vma_read_unlock(struct vm_area_struct *vma) {}
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+static inline void vma_assert_no_reader(struct vm_area_struct *vma) {}
 
 #endif /* CONFIG_PER_VMA_LOCK */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1591dd8a0745..6d9f14e55ecf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -485,6 +485,8 @@ static void __vm_area_free(struct rcu_head *head)
 {
struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
  vm_rcu);
+   /* The vma should either have no lock holders or be write-locked. */
+   vma_assert_no_reader(vma);
kmem_cache_free(vm_area_cachep, vma);
 }
 #endif
-- 
2.39.0

[PATCH 25/41] mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area

2023-01-09 Thread Suren Baghdasaryan

While unmapping VMAs, adjacent VMAs might be able to grow into the area
being unmapped. In such cases write-lock adjacent VMAs to prevent this
growth.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 0d767ce043af..30c7d1c5206e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2461,11 +2461,13 @@ do_mas_align_munmap(struct ma_state *mas, struct 
vm_area_struct *vma,
 * down_read(mmap_lock) and collide with the VMA we are about to unmap.
 */
if (downgrade) {
-   if (next && (next->vm_flags & VM_GROWSDOWN))
+   if (next && (next->vm_flags & VM_GROWSDOWN)) {
+   vma_write_lock(next);
downgrade = false;
-   else if (prev && (prev->vm_flags & VM_GROWSUP))
+   } else if (prev && (prev->vm_flags & VM_GROWSUP)) {
+   vma_write_lock(prev);
downgrade = false;
-   else
+   } else
mmap_write_downgrade(mm);
}
 
-- 
2.39.0

[PATCH 24/41] mm: conditionally write-lock VMA in free_pgtables

2023-01-09 Thread Suren Baghdasaryan

Normally free_pgtables needs to lock affected VMAs except for the case
when VMAs were isolated under VMA write-lock. munmap() does just that,
isolating while holding appropriate locks and then downgrading mmap_lock
and dropping per-VMA locks before freeing page tables.
Add a parameter to free_pgtables and unmap_region for such scenario.

Signed-off-by: Suren Baghdasaryan 
---
 mm/internal.h |  2 +-
 mm/memory.c   |  6 +-
 mm/mmap.c | 18 --
 3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index bcf75a8b032d..5ea4ff1a70e7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -87,7 +87,7 @@ void folio_activate(struct folio *folio);
 
 void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
   struct vm_area_struct *start_vma, unsigned long floor,
-  unsigned long ceiling);
+  unsigned long ceiling, bool lock_vma);
 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 
 struct zap_details;
diff --git a/mm/memory.c b/mm/memory.c
index 2fabf89b2be9..9ece18548db1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -348,7 +348,7 @@ void free_pgd_range(struct mmu_gather *tlb,
 
 void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
   struct vm_area_struct *vma, unsigned long floor,
-  unsigned long ceiling)
+  unsigned long ceiling, bool lock_vma)
 {
MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
 
@@ -366,6 +366,8 @@ void free_pgtables(struct mmu_gather *tlb, struct 
maple_tree *mt,
 * Hide vma from rmap and truncate_pagecache before freeing
 * pgtables
 */
+   if (lock_vma)
+   vma_write_lock(vma);
unlink_anon_vmas(vma);
unlink_file_vma(vma);
 
@@ -380,6 +382,8 @@ void free_pgtables(struct mmu_gather *tlb, struct 
maple_tree *mt,
   && !is_vm_hugetlb_page(next)) {
vma = next;
next = mas_find(, ceiling - 1);
+   if (lock_vma)
+   vma_write_lock(vma);
unlink_anon_vmas(vma);
unlink_file_vma(vma);
}
diff --git a/mm/mmap.c b/mm/mmap.c
index be289e0b693b..0d767ce043af 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -78,7 +78,7 @@ core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 
0644);
 static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
struct vm_area_struct *vma, struct vm_area_struct *prev,
struct vm_area_struct *next, unsigned long start,
-   unsigned long end);
+   unsigned long end, bool lock_vma);
 
 static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
 {
@@ -2202,7 +2202,7 @@ static inline void remove_mt(struct mm_struct *mm, struct 
ma_state *mas)
 static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
struct vm_area_struct *vma, struct vm_area_struct *prev,
struct vm_area_struct *next,
-   unsigned long start, unsigned long end)
+   unsigned long start, unsigned long end, bool lock_vma)
 {
struct mmu_gather tlb;
 
@@ -2211,7 +2211,8 @@ static void unmap_region(struct mm_struct *mm, struct 
maple_tree *mt,
update_hiwater_rss(mm);
unmap_vmas(, mt, vma, start, end);
free_pgtables(, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
-next ? next->vm_start : USER_PGTABLES_CEILING);
+next ? next->vm_start : USER_PGTABLES_CEILING,
+lock_vma);
tlb_finish_mmu();
 }
 
@@ -2468,7 +2469,11 @@ do_mas_align_munmap(struct ma_state *mas, struct 
vm_area_struct *vma,
mmap_write_downgrade(mm);
}
 
-   unmap_region(mm, _detach, vma, prev, next, start, end);
+   /*
+* We can free page tables without locking the vmas because they were
+* isolated before we downgraded mmap_lock and dropped per-vma locks.
+*/
+   unmap_region(mm, _detach, vma, prev, next, start, end, !downgrade);
/* Statistics and freeing VMAs */
mas_set(_detach, start);
remove_mt(mm, _detach);
@@ -2785,7 +2790,8 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
vma->vm_file = NULL;
 
/* Undo any partial mapping done by a device driver. */
-   unmap_region(mm, mas.tree, vma, prev, next, vma->vm_start, vma->vm_end);
+   unmap_region(mm, mas.tree, vma, prev, next, vma->vm_start, vma->vm_end,
+true);
if (file && (vm_flags & VM_SHARED))
mapping_unmap_writable(file->f_mapping);
 free_vma:
@@ -3130,7 +3136,7 @@ void exit_mmap(struct mm_struct *mm)

[PATCH 23/41] mm: write-lock VMAs before removing them from VMA tree

2023-01-09 Thread Suren Baghdasaryan

Write-locking VMAs before isolating them ensures that page fault
handlers don't operate on isolated VMAs.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c  | 2 ++
 mm/nommu.c | 5 +
 2 files changed, 7 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index da1908730828..be289e0b693b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -448,6 +448,7 @@ void vma_mas_store(struct vm_area_struct *vma, struct 
ma_state *mas)
  */
 void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas)
 {
+   vma_write_lock(vma);
trace_vma_mas_szero(mas->tree, vma->vm_start, vma->vm_end - 1);
mas->index = vma->vm_start;
mas->last = vma->vm_end - 1;
@@ -2300,6 +2301,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct 
*vma,
 static inline int munmap_sidetree(struct vm_area_struct *vma,
   struct ma_state *mas_detach)
 {
+   vma_write_lock(vma);
mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
return -ENOMEM;
diff --git a/mm/nommu.c b/mm/nommu.c
index b3154357ced5..7ae91337ef14 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -552,6 +552,7 @@ void vma_mas_store(struct vm_area_struct *vma, struct 
ma_state *mas)
 
 void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas)
 {
+   vma_write_lock(vma);
mas->index = vma->vm_start;
mas->last = vma->vm_end - 1;
mas_store_prealloc(mas, NULL);
@@ -1551,6 +1552,10 @@ void exit_mmap(struct mm_struct *mm)
mmap_write_lock(mm);
for_each_vma(vmi, vma) {
cleanup_vma_from_mm(vma);
+   /*
+* No need to lock VMA because this is the only mm user and no
+* page fault handled can race with it.
+*/
delete_vma(mm, vma);
cond_resched();
}
-- 
2.39.0

[PATCH 22/41] mm/mremap: write-lock VMA while remapping it to a new address range

2023-01-09 Thread Suren Baghdasaryan

Write-lock VMA as locked before copying it and when copy_vma produces
a new VMA.

Signed-off-by: Suren Baghdasaryan 
Reviewed-by: Laurent Dufour 
---
 mm/mmap.c   | 1 +
 mm/mremap.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index ff02cb51e7e7..da1908730828 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3261,6 +3261,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct 
**vmap,
get_file(new_vma->vm_file);
if (new_vma->vm_ops && new_vma->vm_ops->open)
new_vma->vm_ops->open(new_vma);
+   vma_write_lock(new_vma);
if (vma_link(mm, new_vma))
goto out_vma_link;
*need_rmap_locks = false;
diff --git a/mm/mremap.c b/mm/mremap.c
index 2ccdd1561f5b..d24a79bcb1a1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -622,6 +622,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
return -ENOMEM;
}
 
+   vma_write_lock(vma);
new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT);
new_vma = copy_vma(, new_addr, new_len, new_pgoff,
   _rmap_locks);
-- 
2.39.0

[PATCH 21/41] mm/mmap: write-lock VMAs affected by VMA expansion

2023-01-09 Thread Suren Baghdasaryan

vma_expand changes VMA boundaries and might result in freeing an adjacent
VMA. Write-lock affected VMAs to prevent concurrent page faults.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 1e2154137631..ff02cb51e7e7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -544,6 +544,7 @@ inline int vma_expand(struct ma_state *mas, struct 
vm_area_struct *vma,
if (mas_preallocate(mas, vma, GFP_KERNEL))
goto nomem;
 
+   vma_write_lock(vma);
vma_adjust_trans_huge(vma, start, end, 0);
 
if (file) {
@@ -590,6 +591,7 @@ inline int vma_expand(struct ma_state *mas, struct 
vm_area_struct *vma,
}
 
if (remove_next) {
+   vma_write_lock(next);
if (file) {
uprobe_munmap(next, next->vm_start, next->vm_end);
fput(file);
-- 
2.39.0

[PATCH 20/41] mm/mmap: write-lock VMAs in vma_adjust

2023-01-09 Thread Suren Baghdasaryan

vma_adjust modifies a VMA and possibly its neighbors. Write-lock them
before making the modifications.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f6ca4a87f9e2..1e2154137631 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -614,6 +614,12 @@ inline int vma_expand(struct ma_state *mas, struct 
vm_area_struct *vma,
  * The following helper function should be used when such adjustments
  * are necessary.  The "insert" vma (if any) is to be inserted
  * before we drop the necessary locks.
+ * 'expand' vma is always locked before it's passed to __vma_adjust()
+ * from vma_merge() because vma should not change from the moment
+ * can_vma_merge_{before|after} decision is made.
+ * 'insert' vma is used only by __split_vma() and it's always a brand
+ * new vma which is not yet added into mm's vma tree, therefore no need
+ * to lock it.
  */
 int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
@@ -633,6 +639,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
MA_STATE(mas, >mm_mt, 0, 0);
struct vm_area_struct *exporter = NULL, *importer = NULL;
 
+   vma_write_lock(vma);
+   if (next)
+   vma_write_lock(next);
+
if (next && !insert) {
if (end >= next->vm_end) {
/*
@@ -676,8 +686,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
 * If next doesn't have anon_vma, import from vma after
 * next, if the vma overlaps with it.
 */
-   if (remove_next == 2 && !next->anon_vma)
+   if (remove_next == 2 && !next->anon_vma) {
exporter = next_next;
+   if (exporter)
+   vma_write_lock(exporter);
+   }
 
} else if (end > next->vm_start) {
/*
@@ -850,6 +863,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
if (remove_next == 2) {
remove_next = 1;
next = next_next;
+   if (next)
+   vma_write_lock(next);
goto again;
}
}
-- 
2.39.0

[PATCH 19/41] mm/mmap: write-lock VMAs before merging, splitting or expanding them

2023-01-09 Thread Suren Baghdasaryan

Decisions about whether VMAs can be merged, split or expanded must be
made while VMAs are protected from the changes which can affect that
decision. For example, merge_vma uses vma->anon_vma in its decision
whether the VMA can be merged. Meanwhile, page fault handler changes
vma->anon_vma during COW operation.
Write-lock all VMAs which might be affected by a merge or split operation
before making decision how such operations should be performed.

Not sure if expansion really needs this, just being paranoid. Otherwise
mmap_region and vm_brk_flags might not locking.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 53d885e70a54..f6ca4a87f9e2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -254,8 +254,11 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 */
mas_set(, oldbrk);
next = mas_find(, newbrk - 1 + PAGE_SIZE + stack_guard_gap);
-   if (next && newbrk + PAGE_SIZE > vm_start_gap(next))
-   goto out;
+   if (next) {
+   vma_write_lock(next);
+   if (newbrk + PAGE_SIZE > vm_start_gap(next))
+   goto out;
+   }
 
brkvma = mas_prev(, mm->start_brk);
/* Ok, looks good - let it rip. */
@@ -1017,10 +1020,17 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
if (vm_flags & VM_SPECIAL)
return NULL;
 
+   if (prev)
+   vma_write_lock(prev);
next = find_vma(mm, prev ? prev->vm_end : 0);
mid = next;
-   if (next && next->vm_end == end)/* cases 6, 7, 8 */
+   if (next)
+   vma_write_lock(next);
+   if (next && next->vm_end == end) {  /* cases 6, 7, 8 */
next = find_vma(mm, next->vm_end);
+   if (next)
+   vma_write_lock(next);
+   }
 
/* verify some invariant that must be enforced by the caller */
VM_WARN_ON(prev && addr <= prev->vm_start);
@@ -2198,6 +2208,7 @@ int __split_vma(struct mm_struct *mm, struct 
vm_area_struct *vma,
int err;
validate_mm_mt(mm);
 
+   vma_write_lock(vma);
if (vma->vm_ops && vma->vm_ops->may_split) {
err = vma->vm_ops->may_split(vma, addr);
if (err)
@@ -2564,6 +2575,8 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
 
/* Attempt to expand an old mapping */
/* Check next */
+   if (next)
+   vma_write_lock(next);
if (next && next->vm_start == end && !vma_policy(next) &&
can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
 NULL_VM_UFFD_CTX, NULL)) {
@@ -2573,6 +2586,8 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
}
 
/* Check prev */
+   if (prev)
+   vma_write_lock(prev);
if (prev && prev->vm_end == addr && !vma_policy(prev) &&
(vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file,
   pgoff, vma->vm_userfaultfd_ctx, NULL) :
@@ -2942,6 +2957,8 @@ static int do_brk_flags(struct ma_state *mas, struct 
vm_area_struct *vma,
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
return -ENOMEM;
 
+   if (vma)
+   vma_write_lock(vma);
/*
 * Expand the existing vma if possible; Note that singular lists do not
 * occur after forking, so the expand will only happen on new VMAs.
-- 
2.39.0

[PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-09 Thread Suren Baghdasaryan

Protect VMA from concurrent page fault handler while collapsing a huge
page. Page fault handler needs a stable PMD to use PTL and relies on
per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
not be detected by a page fault handler without proper locking.

Signed-off-by: Suren Baghdasaryan 
---
 mm/khugepaged.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5376246a3052..d8d0647f0c2c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1032,6 +1032,7 @@ static int collapse_huge_page(struct mm_struct *mm, 
unsigned long address,
if (result != SCAN_SUCCEED)
goto out_up_write;
 
+   vma_write_lock(vma);
anon_vma_lock_write(vma->anon_vma);
 
mmu_notifier_range_init(, MMU_NOTIFY_CLEAR, 0, NULL, mm,
@@ -1503,6 +1504,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
goto drop_hpage;
}
 
+   /* Lock the vma before taking i_mmap and page table locks */
+   vma_write_lock(vma);
+
/*
 * We need to lock the mapping so that from here on, only GUP-fast and
 * hardware page walks can access the parts of the page tables that
@@ -1690,6 +1694,7 @@ static int retract_page_tables(struct address_space 
*mapping, pgoff_t pgoff,
result = SCAN_PTE_UFFD_WP;
goto unlock_next;
}
+   vma_write_lock(vma);
collapse_and_free_pmd(mm, vma, addr, pmd);
if (!cc->is_khugepaged && is_target)
result = set_huge_pmd(vma, addr, pmd, hpage);
-- 
2.39.0

[PATCH 17/41] mm/mmap: move VMA locking before anon_vma_lock_write call

2023-01-09 Thread Suren Baghdasaryan

Move VMA flag modification (which now implies VMA locking) before
anon_vma_lock_write to match the locking order of page fault handler.

Signed-off-by: Suren Baghdasaryan 
---
 mm/mmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index fa994ae903d9..53d885e70a54 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2953,13 +2953,13 @@ static int do_brk_flags(struct ma_state *mas, struct 
vm_area_struct *vma,
if (mas_preallocate(mas, vma, GFP_KERNEL))
goto unacct_fail;
 
+   set_vm_flags(vma, VM_SOFTDIRTY);
vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
if (vma->anon_vma) {
anon_vma_lock_write(vma->anon_vma);
anon_vma_interval_tree_pre_update_vma(vma);
}
vma->vm_end = addr + len;
-   set_vm_flags(vma, VM_SOFTDIRTY);
mas_store_prealloc(mas, vma);
 
if (vma->anon_vma) {
-- 
2.39.0

[PATCH 16/41] mm: replace vma->vm_flags indirect modification in ksm_madvise

2023-01-09 Thread Suren Baghdasaryan

Replace indirect modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness. Add a BUG_ON check in ksm_madvise() to catch indirect
vm_flags modification attempts.

Signed-off-by: Suren Baghdasaryan 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 5 -
 arch/s390/mm/gmap.c| 5 -
 mm/khugepaged.c| 2 ++
 mm/ksm.c   | 2 ++
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 1d67baa5557a..325a7a47d348 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -393,6 +393,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
 {
unsigned long gfn = memslot->base_gfn;
unsigned long end, start = gfn_to_hva(kvm, gfn);
+   unsigned long vm_flags;
int ret = 0;
struct vm_area_struct *vma;
int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
@@ -409,12 +410,14 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
ret = H_STATE;
break;
}
+   vm_flags = vma->vm_flags;
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
- merge_flag, >vm_flags);
+ merge_flag, _flags);
if (ret) {
ret = H_STATE;
break;
}
+   reset_vm_flags(vma, vm_flags);
start = vma->vm_end;
} while (end > vma->vm_end);
 
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 3811d6c86d09..e47387f8be6d 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2587,14 +2587,17 @@ int gmap_mark_unmergeable(void)
 {
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
+   unsigned long vm_flags;
int ret;
VMA_ITERATOR(vmi, mm, 0);
 
for_each_vma(vmi, vma) {
+   vm_flags = vma->vm_flags;
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
- MADV_UNMERGEABLE, >vm_flags);
+ MADV_UNMERGEABLE, _flags);
if (ret)
return ret;
+   reset_vm_flags(vma, vm_flags);
}
mm->def_flags &= ~VM_MERGEABLE;
return 0;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5cb401aa2b9d..5376246a3052 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -352,6 +352,8 @@ struct attribute_group khugepaged_attr_group = {
 int hugepage_madvise(struct vm_area_struct *vma,
 unsigned long *vm_flags, int advice)
 {
+   /* vma->vm_flags can be changed only using modifier functions */
+   BUG_ON(vm_flags == >vm_flags);
switch (advice) {
case MADV_HUGEPAGE:
 #ifdef CONFIG_S390
diff --git a/mm/ksm.c b/mm/ksm.c
index dd02780c387f..d05c41b289db 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2471,6 +2471,8 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long 
start,
struct mm_struct *mm = vma->vm_mm;
int err;
 
+   /* vma->vm_flags can be changed only using modifier functions */
+   BUG_ON(vm_flags == >vm_flags);
switch (advice) {
case MADV_MERGEABLE:
/*
-- 
2.39.0

[PATCH 15/41] mm: replace vma->vm_flags direct modifications with modifier calls

2023-01-09 Thread Suren Baghdasaryan

Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

Signed-off-by: Suren Baghdasaryan 
---
 arch/arm/kernel/process.c  |  2 +-
 arch/ia64/mm/init.c|  8 
 arch/loongarch/include/asm/tlb.h   |  2 +-
 arch/powerpc/kvm/book3s_xive_native.c  |  2 +-
 arch/powerpc/mm/book3s64/subpage_prot.c|  2 +-
 arch/powerpc/platforms/book3s/vas-api.c|  2 +-
 arch/powerpc/platforms/cell/spufs/file.c   | 14 +++---
 arch/s390/mm/gmap.c|  3 +--
 arch/x86/entry/vsyscall/vsyscall_64.c  |  2 +-
 arch/x86/kernel/cpu/sgx/driver.c   |  2 +-
 arch/x86/kernel/cpu/sgx/virt.c |  2 +-
 arch/x86/mm/pat/memtype.c  |  6 +++---
 arch/x86/um/mem_32.c   |  2 +-
 drivers/acpi/pfr_telemetry.c   |  2 +-
 drivers/android/binder.c   |  3 +--
 drivers/char/mspec.c   |  2 +-
 drivers/crypto/hisilicon/qm.c  |  2 +-
 drivers/dax/device.c   |  2 +-
 drivers/dma/idxd/cdev.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c|  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_doorbell.c  |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_events.c|  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_process.c   |  4 ++--
 drivers/gpu/drm/drm_gem.c  |  2 +-
 drivers/gpu/drm/drm_gem_dma_helper.c   |  3 +--
 drivers/gpu/drm/drm_gem_shmem_helper.c |  2 +-
 drivers/gpu/drm/drm_vm.c   |  8 
 drivers/gpu/drm/etnaviv/etnaviv_gem.c  |  2 +-
 drivers/gpu/drm/exynos/exynos_drm_gem.c|  4 ++--
 drivers/gpu/drm/gma500/framebuffer.c   |  2 +-
 drivers/gpu/drm/i810/i810_dma.c|  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c   |  4 ++--
 drivers/gpu/drm/mediatek/mtk_drm_gem.c |  2 +-
 drivers/gpu/drm/msm/msm_gem.c  |  2 +-
 drivers/gpu/drm/omapdrm/omap_gem.c |  3 +--
 drivers/gpu/drm/rockchip/rockchip_drm_gem.c|  3 +--
 drivers/gpu/drm/tegra/gem.c|  5 ++---
 drivers/gpu/drm/ttm/ttm_bo_vm.c|  3 +--
 drivers/gpu/drm/virtio/virtgpu_vram.c  |  2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_glue.c   |  2 +-
 drivers/gpu/drm/xen/xen_drm_front_gem.c|  3 +--
 drivers/hsi/clients/cmt_speech.c   |  2 +-
 drivers/hwtracing/intel_th/msu.c   |  2 +-
 drivers/hwtracing/stm/core.c   |  2 +-
 drivers/infiniband/hw/hfi1/file_ops.c  |  4 ++--
 drivers/infiniband/hw/mlx5/main.c  |  4 ++--
 drivers/infiniband/hw/qib/qib_file_ops.c   | 13 ++---
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c   |  2 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_verbs.c|  2 +-
 .../media/common/videobuf2/videobuf2-dma-contig.c  |  2 +-
 drivers/media/common/videobuf2/videobuf2-vmalloc.c |  2 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c  |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c  |  4 ++--
 drivers/media/v4l2-core/videobuf-vmalloc.c |  2 +-
 drivers/misc/cxl/context.c |  2 +-
 drivers/misc/habanalabs/common/memory.c|  2 +-
 drivers/misc/habanalabs/gaudi/gaudi.c  |  4 ++--
 drivers/misc/habanalabs/gaudi2/gaudi2.c|  8 
 drivers/misc/habanalabs/goya/goya.c|  4 ++--
 drivers/misc/ocxl/context.c|  4 ++--
 drivers/misc/ocxl/sysfs.c  |  2 +-
 drivers/misc/open-dice.c   |  6 +++---
 drivers/misc/sgi-gru/grufile.c |  4 ++--
 drivers/misc/uacce/uacce.c |  2 +-
 drivers/sbus/char/oradax.c |  2 +-
 drivers/scsi/cxlflash/ocxl_hw.c|  2 +-
 drivers/scsi/sg.c  |  2 +-
 drivers/staging/media/atomisp/pci/hmm/hmm_bo.c |  2 +-
 drivers/staging/media/deprecated/meye/meye.c   |  4 ++--
 .../media/deprecated/stkwebcam/stk-webcam.c|  2 +-
 drivers/target/target_core_user.c  |  2 +-
 drivers/uio/uio.c  |  2 +-
 drivers/usb/core/devio.c   |  3 +--
 drivers/usb/mon/mon_bin.c  |  3 +--
 drivers/vdpa/vdpa_user/iova_domain.c   |  2 +-
 drivers/vfio/pci/vfio_pci_core.c   |  2 +-
 drivers/vhost/vdpa.c   |  2 +-
 drivers/video/fbdev/68328fb.c

[PATCH 14/41] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK

2023-01-09 Thread Suren Baghdasaryan

To simplify the usage of VM_LOCKED_CLEAR_MASK in clear_vm_flags(),
replace it with VM_LOCKED_MASK bitmask and convert all users.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h | 4 ++--
 kernel/fork.c  | 2 +-
 mm/hugetlb.c   | 4 ++--
 mm/mlock.c | 6 +++---
 mm/mmap.c  | 6 +++---
 mm/mremap.c| 2 +-
 6 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 35cf0a6cbcc2..2b16d45b75a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -416,8 +416,8 @@ extern unsigned int kobjsize(const void *objp);
 /* This mask defines which mm->def_flags a process can inherit its parent */
 #define VM_INIT_DEF_MASK   VM_NOHUGEPAGE
 
-/* This mask is used to clear all the VMA flags used by mlock */
-#define VM_LOCKED_CLEAR_MASK   (~(VM_LOCKED | VM_LOCKONFAULT))
+/* This mask represents all the VMA flag bits used by mlock */
+#define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
 
 /* Arch-specific flags to clear when updating VM flags on protection change */
 #ifndef VM_ARCH_CLEAR
diff --git a/kernel/fork.c b/kernel/fork.c
index c026d75108b3..1591dd8a0745 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -674,7 +674,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
tmp->anon_vma = NULL;
} else if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
-   tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
+   clear_vm_flags(tmp, VM_LOCKED_MASK);
file = tmp->vm_file;
if (file) {
struct address_space *mapping = file->f_mapping;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index db895230ee7e..24861cbfa2b1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6950,8 +6950,8 @@ static unsigned long page_table_shareable(struct 
vm_area_struct *svma,
unsigned long s_end = sbase + PUD_SIZE;
 
/* Allow segments to share if only one is marked locked */
-   unsigned long vm_flags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
-   unsigned long svm_flags = svma->vm_flags & VM_LOCKED_CLEAR_MASK;
+   unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED_MASK;
+   unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED_MASK;
 
/*
 * match the virtual addresses, permission and the alignment of the
diff --git a/mm/mlock.c b/mm/mlock.c
index 7032f6dd0ce1..06aa9e204fac 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -490,7 +490,7 @@ static int apply_vma_lock_flags(unsigned long start, size_t 
len,
prev = mas_prev(, 0);
 
for (nstart = start ; ; ) {
-   vm_flags_t newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
+   vm_flags_t newflags = vma->vm_flags & ~VM_LOCKED_MASK;
 
newflags |= flags;
 
@@ -662,7 +662,7 @@ static int apply_mlockall_flags(int flags)
struct vm_area_struct *vma, *prev = NULL;
vm_flags_t to_add = 0;
 
-   current->mm->def_flags &= VM_LOCKED_CLEAR_MASK;
+   current->mm->def_flags &= ~VM_LOCKED_MASK;
if (flags & MCL_FUTURE) {
current->mm->def_flags |= VM_LOCKED;
 
@@ -682,7 +682,7 @@ static int apply_mlockall_flags(int flags)
mas_for_each(, vma, ULONG_MAX) {
vm_flags_t newflags;
 
-   newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
+   newflags = vma->vm_flags & ~VM_LOCKED_MASK;
newflags |= to_add;
 
/* Ignore errors */
diff --git a/mm/mmap.c b/mm/mmap.c
index 9db37adfc00a..5c4b608edde9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2721,7 +2721,7 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm))
-   vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+   clear_vm_flags(vma, VM_LOCKED_MASK);
else
mm->locked_vm += (len >> PAGE_SHIFT);
}
@@ -3392,8 +3392,8 @@ static struct vm_area_struct *__install_special_mapping(
vma->vm_start = addr;
vma->vm_end = addr + len;
 
-   vma->vm_flags = vm_flags | mm->def_flags | VM_DONTEXPAND | VM_SOFTDIRTY;
-   vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+   init_vm_flags(vma, (vm_flags | mm->def_flags |
+ VM_DONTEXPAND | VM_SOFTDIRTY) & ~VM_LOCKED_MASK);
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 
vma->vm_ops = ops;
diff --git a/mm/mremap.c b/mm/mremap.c
index fe587c5d6591..5f6f9931bff1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -686,7 +686,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 
if (unlikely(!err && (flags & MREMAP_DONTUNMAP))) {
/* We always clear VM_LOCKED[ONFAULT] on the old vma */
-

[PATCH 13/41] mm: introduce vma->vm_flags modifier functions

2023-01-09 Thread Suren Baghdasaryan

To keep vma locking correctness when vm_flags are modified, add modifier
functions to be used whenever flags are updated.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h   | 38 ++
 include/linux/mm_types.h |  8 +++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ec2c4c227d51..35cf0a6cbcc2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -702,6 +702,44 @@ static inline void vma_init(struct vm_area_struct *vma, 
struct mm_struct *mm)
vma_init_lock(vma);
 }
 
+/* Use when VMA is not part of the VMA tree and needs no locking */
+static inline
+void init_vm_flags(struct vm_area_struct *vma, unsigned long flags)
+{
+   WRITE_ONCE(vma->vm_flags, flags);
+}
+
+/* Use when VMA is part of the VMA tree and needs appropriate locking */
+static inline
+void reset_vm_flags(struct vm_area_struct *vma, unsigned long flags)
+{
+   vma_write_lock(vma);
+   init_vm_flags(vma, flags);
+}
+
+static inline
+void set_vm_flags(struct vm_area_struct *vma, unsigned long flags)
+{
+   vma_write_lock(vma);
+   vma->vm_flags |= flags;
+}
+
+static inline
+void clear_vm_flags(struct vm_area_struct *vma, unsigned long flags)
+{
+   vma_write_lock(vma);
+   vma->vm_flags &= ~flags;
+}
+
+static inline
+void mod_vm_flags(struct vm_area_struct *vma,
+ unsigned long set, unsigned long clear)
+{
+   vma_write_lock(vma);
+   vma->vm_flags |= set;
+   vma->vm_flags &= ~clear;
+}
+
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
 {
vma->vm_ops = NULL;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5f7c5ca89931..0d27edd3e63a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -553,7 +553,13 @@ struct vm_area_struct {
 * See vmf_insert_mixed_prot() for discussion.
 */
pgprot_t vm_page_prot;
-   unsigned long vm_flags; /* Flags, see mm.h. */
+
+   /*
+* Flags, see mm.h.
+* WARNING! Do not modify directly to keep correct VMA locking.
+* Use {init|reset|set|clear|mod}_vm_flags() functions instead.
+*/
+   unsigned long vm_flags;
 
 #ifdef CONFIG_PER_VMA_LOCK
int vm_lock_seq;
-- 
2.39.0

[PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-09 Thread Suren Baghdasaryan

Introduce a per-VMA rw_semaphore to be used during page fault handling
instead of mmap_lock. Because there are cases when multiple VMAs need
to be exclusively locked during VMA tree modifications, instead of the
usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
exclusively and setting vma->lock_seq to the current mm->lock_seq. When
mmap_write_lock holder is done with all modifications and drops mmap_lock,
it will increment mm->lock_seq, effectively unlocking all VMAs marked as
locked.
VMA lock is placed on the cache line boundary so that its 'count' field
falls into the first cache line while the rest of the fields fall into
the second cache line. This lets the 'count' field to be cached with
other frequently accessed fields and used quickly in uncontended case
while 'owner' and other fields used in the contended case will not
invalidate the first cache line while waiting on the lock.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm.h| 80 +++
 include/linux/mm_types.h  |  8 
 include/linux/mmap_lock.h | 13 +++
 kernel/fork.c |  4 ++
 mm/init-mm.c  |  3 ++
 5 files changed, 108 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..ec2c4c227d51 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -612,6 +612,85 @@ struct vm_operations_struct {
  unsigned long addr);
 };
 
+#ifdef CONFIG_PER_VMA_LOCK
+static inline void vma_init_lock(struct vm_area_struct *vma)
+{
+   init_rwsem(>lock);
+   vma->vm_lock_seq = -1;
+}
+
+static inline void vma_write_lock(struct vm_area_struct *vma)
+{
+   int mm_lock_seq;
+
+   mmap_assert_write_locked(vma->vm_mm);
+
+   /*
+* current task is holding mmap_write_lock, both vma->vm_lock_seq and
+* mm->mm_lock_seq can't be concurrently modified.
+*/
+   mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq);
+   if (vma->vm_lock_seq == mm_lock_seq)
+   return;
+
+   down_write(>lock);
+   vma->vm_lock_seq = mm_lock_seq;
+   up_write(>lock);
+}
+
+/*
+ * Try to read-lock a vma. The function is allowed to occasionally yield false
+ * locked result to avoid performance overhead, in which case we fall back to
+ * using mmap_lock. The function should never yield false unlocked result.
+ */
+static inline bool vma_read_trylock(struct vm_area_struct *vma)
+{
+   /* Check before locking. A race might cause false locked result. */
+   if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))
+   return false;
+
+   if (unlikely(down_read_trylock(>lock) == 0))
+   return false;
+
+   /*
+* Overflow might produce false locked result.
+* False unlocked result is impossible because we modify and check
+* vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq
+* modification invalidates all existing locks.
+*/
+   if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) {
+   up_read(>lock);
+   return false;
+   }
+   return true;
+}
+
+static inline void vma_read_unlock(struct vm_area_struct *vma)
+{
+   up_read(>lock);
+}
+
+static inline void vma_assert_write_locked(struct vm_area_struct *vma)
+{
+   mmap_assert_write_locked(vma->vm_mm);
+   /*
+* current task is holding mmap_write_lock, both vma->vm_lock_seq and
+* mm->mm_lock_seq can't be concurrently modified.
+*/
+   VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), 
vma);
+}
+
+#else /* CONFIG_PER_VMA_LOCK */
+
+static inline void vma_init_lock(struct vm_area_struct *vma) {}
+static inline void vma_write_lock(struct vm_area_struct *vma) {}
+static inline bool vma_read_trylock(struct vm_area_struct *vma)
+   { return false; }
+static inline void vma_read_unlock(struct vm_area_struct *vma) {}
+static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+
+#endif /* CONFIG_PER_VMA_LOCK */
+
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
static const struct vm_operations_struct dummy_vm_ops = {};
@@ -620,6 +699,7 @@ static inline void vma_init(struct vm_area_struct *vma, 
struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = _vm_ops;
INIT_LIST_HEAD(>anon_vma_chain);
+   vma_init_lock(vma);
 }
 
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d5cdec1314fe..5f7c5ca89931 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -555,6 +555,11 @@ struct vm_area_struct {
pgprot_t vm_page_prot;
unsigned long vm_flags; /* Flags, see mm.h. */
 
+#ifdef CONFIG_PER_VMA_LOCK
+   int vm_lock_seq;
+   struct rw_semaphore lock;
+#endif
+
/*
 * For areas with an

[PATCH 11/41] mm: export dump_mm()

2023-01-09 Thread Suren Baghdasaryan

mmap_assert_write_locked() will be used in the next patch to ensure
vma write lock is taken only under mmap_lock exclusive lock. Because
mmap_assert_write_locked() uses dump_mm() and there are cases when
vma write lock is taken from inside a module, it's necessary to export
dump_mm() function.

Signed-off-by: Suren Baghdasaryan 
---
 mm/debug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/debug.c b/mm/debug.c
index 7f8e5f744e42..b6e9e53469d1 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -215,6 +215,7 @@ void dump_mm(const struct mm_struct *mm)
mm->def_flags, >def_flags
);
 }
+EXPORT_SYMBOL(dump_mm);
 
 static bool page_init_poisoning __read_mostly = true;
 
-- 
2.39.0

[PATCH 10/41] mm: move mmap_lock assert function definitions

2023-01-09 Thread Suren Baghdasaryan

Move mmap_lock assert function definitions up so that they can be used
by other mmap_lock routines.

Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mmap_lock.h | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 96e113e23d04..e49ba91bb1f0 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -60,6 +60,18 @@ static inline void __mmap_lock_trace_released(struct 
mm_struct *mm, bool write)
 
 #endif /* CONFIG_TRACING */
 
+static inline void mmap_assert_locked(struct mm_struct *mm)
+{
+   lockdep_assert_held(>mmap_lock);
+   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_lock), mm);
+}
+
+static inline void mmap_assert_write_locked(struct mm_struct *mm)
+{
+   lockdep_assert_held_write(>mmap_lock);
+   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_lock), mm);
+}
+
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
init_rwsem(>mmap_lock);
@@ -150,18 +162,6 @@ static inline void mmap_read_unlock_non_owner(struct 
mm_struct *mm)
up_read_non_owner(>mmap_lock);
 }
 
-static inline void mmap_assert_locked(struct mm_struct *mm)
-{
-   lockdep_assert_held(>mmap_lock);
-   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_lock), mm);
-}
-
-static inline void mmap_assert_write_locked(struct mm_struct *mm)
-{
-   lockdep_assert_held_write(>mmap_lock);
-   VM_BUG_ON_MM(!rwsem_is_locked(>mmap_lock), mm);
-}
-
 static inline int mmap_lock_is_contended(struct mm_struct *mm)
 {
return rwsem_is_contended(>mmap_lock);
-- 
2.39.0

[PATCH 09/41] mm: rcu safe VMA freeing

2023-01-09 Thread Suren Baghdasaryan

From: Michel Lespinasse 

This prepares for page faults handling under VMA lock, looking up VMAs
under protection of an rcu read lock, instead of the usual mmap read lock.

Signed-off-by: Michel Lespinasse 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm_types.h | 13 ++---
 kernel/fork.c| 13 +
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4b6bce73fbb4..d5cdec1314fe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -535,9 +535,16 @@ struct anon_vma_name {
 struct vm_area_struct {
/* The first cache line has the info for VMA tree walking. */
 
-   unsigned long vm_start; /* Our start address within vm_mm. */
-   unsigned long vm_end;   /* The first byte after our end address
-  within vm_mm. */
+   union {
+   struct {
+   /* VMA covers [vm_start; vm_end) addresses within mm */
+   unsigned long vm_start;
+   unsigned long vm_end;
+   };
+#ifdef CONFIG_PER_VMA_LOCK
+   struct rcu_head vm_rcu; /* Used for deferred freeing. */
+#endif
+   };
 
struct mm_struct *vm_mm;/* The address space we belong to. */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 58aab6c889a4..5986817f393c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -479,10 +479,23 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct 
*orig)
return new;
 }
 
+#ifdef CONFIG_PER_VMA_LOCK
+static void __vm_area_free(struct rcu_head *head)
+{
+   struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
+ vm_rcu);
+   kmem_cache_free(vm_area_cachep, vma);
+}
+#endif
+
 void vm_area_free(struct vm_area_struct *vma)
 {
free_anon_vma_name(vma);
+#ifdef CONFIG_PER_VMA_LOCK
+   call_rcu(>vm_rcu, __vm_area_free);
+#else
kmem_cache_free(vm_area_cachep, vma);
+#endif
 }
 
 static void account_kernel_stack(struct task_struct *tsk, int account)
-- 
2.39.0

[PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-09 Thread Suren Baghdasaryan

This configuration variable will be used to build the support for VMA
locking during page fault handling.

This is enabled by default on supported architectures with SMP and MMU
set.

The architecture support is needed since the page fault handler is called
from the architecture's page faulting code which needs modifications to
handle faults under VMA lock.

Signed-off-by: Suren Baghdasaryan 
---
 mm/Kconfig | 13 +
 1 file changed, 13 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..0aeca3794972 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1183,6 +1183,19 @@ config LRU_GEN_STATS
  This option has a per-memcg and per-node memory overhead.
 # }
 
+config ARCH_SUPPORTS_PER_VMA_LOCK
+   def_bool n
+
+config PER_VMA_LOCK
+   bool "Per-vma locking support"
+   default y
+   depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
+   help
+ Allow per-vma locking during page fault handling.
+
+ This feature allows locking each virtual memory area separately when
+ handling page faults instead of taking mmap_lock.
+
 source "mm/damon/Kconfig"
 
 endmenu
-- 
2.39.0

[PATCH 07/41] mm: Enable maple tree RCU mode by default.

2023-01-09 Thread Suren Baghdasaryan

From: "Liam R. Howlett" 

Use the maple tree in RCU mode for VMA tracking.  This is necessary for
the use of per-VMA locking.  RCU mode is enabled by default but disabled
when exiting an mm and for the new tree during a fork.

Also enable RCU for the tree used in munmap operations to ensure the
nodes remain valid for readers.

Signed-off-by: Liam R. Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 include/linux/mm_types.h | 3 ++-
 kernel/fork.c| 3 +++
 mm/mmap.c| 4 +++-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3b8475007734..4b6bce73fbb4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -810,7 +810,8 @@ struct mm_struct {
unsigned long cpu_bitmap[];
 };
 
-#define MM_MT_FLAGS(MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN)
+#define MM_MT_FLAGS(MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN | \
+MT_FLAGS_USE_RCU)
 extern struct mm_struct init_mm;
 
 /* Pointer magic because the dynamic array size confuses some compilers. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..58aab6c889a4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -617,6 +617,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
if (retval)
goto out;
 
+   mt_clear_in_rcu(mas.tree);
mas_for_each(_mas, mpnt, ULONG_MAX) {
struct file *file;
 
@@ -703,6 +704,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
retval = arch_dup_mmap(oldmm, mm);
 loop_out:
mas_destroy();
+   if (!retval)
+   mt_set_in_rcu(mas.tree);
 out:
mmap_write_unlock(mm);
flush_tlb_mm(oldmm);
diff --git a/mm/mmap.c b/mm/mmap.c
index 87d929316d57..9db37adfc00a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2304,7 +2304,8 @@ do_mas_align_munmap(struct ma_state *mas, struct 
vm_area_struct *vma,
int count = 0;
int error = -ENOMEM;
MA_STATE(mas_detach, _detach, 0, 0);
-   mt_init_flags(_detach, MT_FLAGS_LOCK_EXTERN);
+   mt_init_flags(_detach, mas->tree->ma_flags &
+ (MT_FLAGS_LOCK_MASK | MT_FLAGS_USE_RCU));
mt_set_external_lock(_detach, >mmap_lock);
 
if (mas_preallocate(mas, vma, GFP_KERNEL))
@@ -3091,6 +3092,7 @@ void exit_mmap(struct mm_struct *mm)
 */
set_bit(MMF_OOM_SKIP, >flags);
mmap_write_lock(mm);
+   mt_clear_in_rcu(>mm_mt);
free_pgtables(, >mm_mt, vma, FIRST_USER_ADDRESS,
  USER_PGTABLES_CEILING);
tlb_finish_mmu();
-- 
2.39.0

[PATCH 06/41] maple_tree: Add smp_rmb() to dead node detection

2023-01-09 Thread Suren Baghdasaryan

From: "Liam R. Howlett" 

Add an smp_rmb() before reading the parent pointer to ensure that
anything read from the node prior to the parent pointer hasn't been
reordered ahead of this check.

The is necessary for RCU mode.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 lib/maple_tree.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 8066fb1e8ec9..80ca28b656d3 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -535,9 +535,11 @@ static inline struct maple_node *mte_parent(const struct 
maple_enode *enode)
  */
 static inline bool ma_dead_node(const struct maple_node *node)
 {
-   struct maple_node *parent = (void *)((unsigned long)
-node->parent & ~MAPLE_NODE_MASK);
+   struct maple_node *parent;
 
+   /* Do not reorder reads from the node prior to the parent check */
+   smp_rmb();
+   parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
return (parent == node);
 }
 
@@ -552,6 +554,8 @@ static inline bool mte_dead_node(const struct maple_enode 
*enode)
struct maple_node *parent, *node;
 
node = mte_to_node(enode);
+   /* Do not reorder reads from the node prior to the parent check */
+   smp_rmb();
parent = mte_parent(enode);
return (parent == node);
 }
-- 
2.39.0

[PATCH 05/41] maple_tree: Fix write memory barrier of nodes once dead for RCU mode

2023-01-09 Thread Suren Baghdasaryan

From: "Liam R. Howlett" 

During the development of the maple tree, the strategy of freeing
multiple nodes changed and, in the process, the pivots were reused to
store pointers to dead nodes.  To ensure the readers see accurate
pivots, the writers need to mark the nodes as dead and call smp_wmb() to
ensure any readers can identify the node as dead before using the pivot
values.

There were two places where the old method of marking the node as dead
without smp_wmb() were being used, which resulted in RCU readers seeing
the wrong pivot value before seeing the node was dead.  Fix this race
condition by using mte_set_node_dead() which has the smp_wmb() call to
ensure the race is closed.

Add a WARN_ON() to the ma_free_rcu() call to ensure all nodes being
freed are marked as dead to ensure there are no other call paths besides
the two updated paths.

This is necessary for the RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 lib/maple_tree.c |  7 +--
 tools/testing/radix-tree/maple.c | 16 
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index d85291b19f86..8066fb1e8ec9 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -179,7 +179,7 @@ static void mt_free_rcu(struct rcu_head *head)
  */
 static void ma_free_rcu(struct maple_node *node)
 {
-   node->parent = ma_parent_ptr(node);
+   WARN_ON(node->parent != ma_parent_ptr(node));
call_rcu(>rcu, mt_free_rcu);
 }
 
@@ -1775,8 +1775,10 @@ static inline void mas_replace(struct ma_state *mas, 
bool advanced)
rcu_assign_pointer(slots[offset], mas->node);
}
 
-   if (!advanced)
+   if (!advanced) {
+   mte_set_node_dead(old_enode);
mas_free(mas, old_enode);
+   }
 }
 
 /*
@@ -4217,6 +4219,7 @@ static inline bool mas_wr_node_store(struct ma_wr_state 
*wr_mas)
 done:
mas_leaf_set_meta(mas, newnode, dst_pivots, maple_leaf_64, new_end);
if (in_rcu) {
+   mte_set_node_dead(mas->node);
mas->node = mt_mk_node(newnode, wr_mas->type);
mas_replace(mas, false);
} else {
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 81fa7ec2e66a..2539ad6c4777 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -108,6 +108,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mn->slot[1] != NULL);
MT_BUG_ON(mt, mas_allocated() != 0);
 
+   mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
mas.node = MAS_START;
mas_nomem(, GFP_KERNEL);
@@ -160,6 +161,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated() != i);
MT_BUG_ON(mt, !mn);
MT_BUG_ON(mt, not_empty(mn));
+   mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
 
@@ -192,6 +194,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, not_empty(mn));
MT_BUG_ON(mt, mas_allocated() != i - 1);
MT_BUG_ON(mt, !mn);
+   mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
 
@@ -210,6 +213,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node();
MT_BUG_ON(mt, not_empty(mn));
MT_BUG_ON(mt, mas_allocated() != j - 1);
+   mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated() != 0);
@@ -233,6 +237,7 @@ static noinline void check_new_node(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated() != i - j);
mn = mas_pop_node();
MT_BUG_ON(mt, not_empty(mn));
+   mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
MT_BUG_ON(mt, mas_allocated() != i - j - 1);
}
@@ -269,6 +274,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(); /* get the next node. */
MT_BUG_ON(mt, mn == NULL);
MT_BUG_ON(mt, not_empty(mn));
+   mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
}
MT_BUG_ON(mt, mas_allocated() != 0);
@@ -294,6 +300,7 @@ static noinline void check_new_node(struct maple_tree *mt)
mn = mas_pop_node(); /* get the next node. */
MT_BUG_ON(mt, mn == NULL);
MT_BUG_ON(mt, not_empty(mn));
+   mn->parent = ma_parent_ptr(mn);

[PATCH 04/41] maple_tree: remove extra smp_wmb() from mas_dead_leaves()

2023-01-09 Thread Suren Baghdasaryan

From: Liam Howlett 

The call to mte_set_dead_node() before the smp_wmb() already calls
smp_wmb() so this is not needed.  This is an optimization for the RCU
mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 lib/maple_tree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index a11eea943f8d..d85291b19f86 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5510,7 +5510,6 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void 
__rcu **slots,
break;
 
mte_set_node_dead(entry);
-   smp_wmb(); /* Needed for RCU */
node->type = type;
rcu_assign_pointer(slots[offset], node);
}
-- 
2.39.0

[PATCH 03/41] maple_tree: Fix freeing of nodes in rcu mode

2023-01-09 Thread Suren Baghdasaryan

From: Liam Howlett 

The walk to destroy the nodes was not always setting the node type and
would result in a destroy method potentially using the values as nodes.
Avoid this by setting the correct node types.  This is necessary for the
RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 lib/maple_tree.c | 73 
 1 file changed, 62 insertions(+), 11 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index a748938ad2e9..a11eea943f8d 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -897,6 +897,44 @@ static inline void ma_set_meta(struct maple_node *mn, enum 
maple_type mt,
meta->end = end;
 }
 
+/*
+ * mas_clear_meta() - clear the metadata information of a node, if it exists
+ * @mas: The maple state
+ * @mn: The maple node
+ * @mt: The maple node type
+ * @offset: The offset of the highest sub-gap in this node.
+ * @end: The end of the data in this node.
+ */
+static inline void mas_clear_meta(struct ma_state *mas, struct maple_node *mn,
+ enum maple_type mt)
+{
+   struct maple_metadata *meta;
+   unsigned long *pivots;
+   void __rcu **slots;
+   void *next;
+
+   switch (mt) {
+   case maple_range_64:
+   pivots = mn->mr64.pivot;
+   if (unlikely(pivots[MAPLE_RANGE64_SLOTS - 2])) {
+   slots = mn->mr64.slot;
+   next = mas_slot_locked(mas, slots,
+  MAPLE_RANGE64_SLOTS - 1);
+   if (unlikely((mte_to_node(next) && 
mte_node_type(next
+   return; /* The last slot is a node, no metadata 
*/
+   }
+   fallthrough;
+   case maple_arange_64:
+   meta = ma_meta(mn, mt);
+   break;
+   default:
+   return;
+   }
+
+   meta->gap = 0;
+   meta->end = 0;
+}
+
 /*
  * ma_meta_end() - Get the data end of a node from the metadata
  * @mn: The maple node
@@ -5448,20 +5486,22 @@ static inline int mas_rev_alloc(struct ma_state *mas, 
unsigned long min,
  * mas_dead_leaves() - Mark all leaves of a node as dead.
  * @mas: The maple state
  * @slots: Pointer to the slot array
+ * @type: The maple node type
  *
  * Must hold the write lock.
  *
  * Return: The number of leaves marked as dead.
  */
 static inline
-unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots)
+unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots,
+ enum maple_type mt)
 {
struct maple_node *node;
enum maple_type type;
void *entry;
int offset;
 
-   for (offset = 0; offset < mt_slot_count(mas->node); offset++) {
+   for (offset = 0; offset < mt_slots[mt]; offset++) {
entry = mas_slot_locked(mas, slots, offset);
type = mte_node_type(entry);
node = mte_to_node(entry);
@@ -5480,14 +5520,13 @@ unsigned char mas_dead_leaves(struct ma_state *mas, 
void __rcu **slots)
 
 static void __rcu **mas_dead_walk(struct ma_state *mas, unsigned char offset)
 {
-   struct maple_node *node, *next;
+   struct maple_node *next;
void __rcu **slots = NULL;
 
next = mas_mn(mas);
do {
-   mas->node = ma_enode_ptr(next);
-   node = mas_mn(mas);
-   slots = ma_slots(node, node->type);
+   mas->node = mt_mk_node(next, next->type);
+   slots = ma_slots(next, next->type);
next = mas_slot_locked(mas, slots, offset);
offset = 0;
} while (!ma_is_leaf(next->type));
@@ -5551,11 +5590,14 @@ static inline void __rcu **mas_destroy_descend(struct 
ma_state *mas,
node = mas_mn(mas);
slots = ma_slots(node, mte_node_type(mas->node));
next = mas_slot_locked(mas, slots, 0);
-   if ((mte_dead_node(next)))
+   if ((mte_dead_node(next))) {
+   mte_to_node(next)->type = mte_node_type(next);
next = mas_slot_locked(mas, slots, 1);
+   }
 
mte_set_node_dead(mas->node);
node->type = mte_node_type(mas->node);
+   mas_clear_meta(mas, node, node->type);
node->piv_parent = prev;
node->parent_slot = offset;
offset = 0;
@@ -5575,13 +5617,18 @@ static void mt_destroy_walk(struct maple_enode *enode, 
unsigned char ma_flags,
 
MA_STATE(mas, , 0, 0);
 
-   if (mte_is_leaf(enode))
+   mas.node = enode;
+   if (mte_is_leaf(enode)) {
+   node->type = mte_node_type(enode);
goto free_leaf;
+   }
 
+   ma_flags &= ~MT_FLAGS_LOCK_MASK;
mt_init_flags(, ma_flags);
mas_lock();
 
-

[PATCH 02/41] maple_tree: Detect dead nodes in mas_start()

2023-01-09 Thread Suren Baghdasaryan

From: Liam Howlett 

When initially starting a search, the root node may already be in the
process of being replaced in RCU mode.  Detect and restart the walk if
this is the case.  This is necessary for RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 lib/maple_tree.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index ff9f04e0150d..a748938ad2e9 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1359,11 +1359,15 @@ static inline struct maple_enode *mas_start(struct 
ma_state *mas)
mas->depth = 0;
mas->offset = 0;
 
+retry:
root = mas_root(mas);
/* Tree with nodes */
if (likely(xa_is_node(root))) {
mas->depth = 1;
mas->node = mte_safe_root(root);
+   if (mte_dead_node(mas->node))
+   goto retry;
+
return NULL;
}
 
-- 
2.39.0

[PATCH 00/41] Per-VMA locks

2023-01-09 Thread Suren Baghdasaryan

This is the v1 of per-VMA locks patchset originally posted as an RFC at
[1] and described in LWN article at [2]. Per-vma locks idea that was
discussed during SPF [3] discussion at LSF/MM this year [4], which
concluded with suggestion that “a reader/writer semaphore could be put
into the VMA itself; that would have the effect of using the VMA as a
sort of range lock. There would still be contention at the VMA level, but
it would be an improvement.” This patchset implements this suggested
approach.

When handling page faults we lookup the VMA that contains the faulting
page under RCU protection and try to acquire its lock. If that fails we
fall back to using mmap_lock, similar to how SPF handled this situation.

One notable way the implementation deviates from the proposal is the way
VMAs are read-locked. During some of mm updates, multiple VMAs need to be
locked until the end of the update (e.g. vma_merge, split_vma, etc).
Tracking all the locked VMAs, avoiding recursive locks, figuring out when
it's safe to unlock previously locked VMAs would make the code more
complex. So, instead of the usual lock/unlock pattern, the proposed
solution marks a VMA as locked and provides an efficient way to:
1. Identify locked VMAs.
2. Unlock all locked VMAs in bulk.
We also postpone unlocking the locked VMAs until the end of the update,
when we do mmap_write_unlock. Potentially this keeps a VMA locked for
longer than is absolutely necessary but it results in a big reduction of
code complexity.
Read-locking a VMA is done using two sequence numbers - one in the
vm_area_struct and one in the mm_struct. VMA is considered read-locked
when these sequence numbers are equal. To read-lock a VMA we set the
sequence number in vm_area_struct to be equal to the sequence number in
mm_struct. To unlock all VMAs we increment mm_struct's seq number. This
allows for an efficient way to track locked VMAs and to drop the locks on
all VMAs at the end of the update.

The patchset implements per-VMA locking only for anonymous pages which
are not in swap and avoids userfaultfs as their implementation is more
complex. Additional support for file-back page faults, swapped and user
pages can be added incrementally.

Performance benchmarks show similar although slightly smaller benefits as
with SPF patchset (~75% of SPF benefits). Still, with lower complexity
this approach might be more desirable.

Since RFC was posted in September 2022, two separate Google teams outside
of Android evaluated the patchset and confirmed positive results. Here are
the known usecases when per-VMA locks show benefits:

Android:
Apps with high number of threads (~100) launch times improve by up to 20%.
Each thread mmaps several areas upon startup (Stack and Thread-local
storage (TLS), thread signal stack, indirect ref table), which requires
taking mmap_lock in write mode. Page faults take mmap_lock in read mode.
During app launch, both thread creation and page faults establishing the
active workinget are happening in parallel and that causes lock contention
between mm writers and readers even if updates and page faults are
happening in different VMAs. Per-vma locks prevent this contention by
providing more granular lock.

Google Fibers:
We have several dynamically sized thread pools that spawn new threads
under increased load and reduce their number when idling. For example,
Google's in-process scheduling/threading framework, UMCG/Fibers, is backed
by such a thread pool. When idling, only a small number of idle worker
threads are available; when a spike of incoming requests arrive, each
request is handled in its own "fiber", which is a work item posted onto a
UMCG worker thread; quite often these spikes lead to a number of new
threads spawning. Each new thread needs to allocate and register an RSEQ
section on its TLS, then register itself with the kernel as a UMCG worker
thread, and only after that it can be considered by the in-process
UMCG/Fiber scheduler as available to do useful work. In short, during an
incoming workload spike new threads have to be spawned, and they perform
several syscalls (RSEQ registration, UMCG worker registration, memory
allocations) before they can actually start doing useful work. Removing
any bottlenecks on this thread startup path will greatly improve our
services' latencies when faced with request/workload spikes.
At high scale, mmap_lock contention during thread creation and stack page
faults leads to user-visible multi-second serving latencies in a similar
pattern to Android app startup. Per-VMA locking patchset has been run
successfully in limited experiments with user-facing production workloads.
In these experiments, we observed that the peak thread creation rate was
high enough that thread creation is no longer a bottleneck.

TCP zerocopy receive:
>From the point of view of TCP zerocopy receive, the per-vma lock patch is
massively beneficial.
In today's implementation, a process with N threads where N - 1 are
performing zerocopy

[PATCH 01/41] maple_tree: Be more cautious about dead nodes

2023-01-09 Thread Suren Baghdasaryan

From: Liam Howlett 

ma_pivots() and ma_data_end() may be called with a dead node.  Ensure to
that the node isn't dead before using the returned values.

This is necessary for RCU mode of the maple tree.

Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam Howlett 
Signed-off-by: Suren Baghdasaryan 
---
 lib/maple_tree.c | 53 +++-
 1 file changed, 43 insertions(+), 10 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 26e2045d3cda..ff9f04e0150d 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -540,6 +540,7 @@ static inline bool ma_dead_node(const struct maple_node 
*node)
 
return (parent == node);
 }
+
 /*
  * mte_dead_node() - check if the @enode is dead.
  * @enode: The encoded maple node
@@ -621,6 +622,8 @@ static inline unsigned int mas_alloc_req(const struct 
ma_state *mas)
  * @node - the maple node
  * @type - the node type
  *
+ * In the event of a dead node, this array may be %NULL
+ *
  * Return: A pointer to the maple node pivots
  */
 static inline unsigned long *ma_pivots(struct maple_node *node,
@@ -1091,8 +1094,11 @@ static int mas_ascend(struct ma_state *mas)
a_type = mas_parent_enum(mas, p_enode);
a_node = mte_parent(p_enode);
a_slot = mte_parent_slot(p_enode);
-   pivots = ma_pivots(a_node, a_type);
a_enode = mt_mk_node(a_node, a_type);
+   pivots = ma_pivots(a_node, a_type);
+
+   if (unlikely(ma_dead_node(a_node)))
+   return 1;
 
if (!set_min && a_slot) {
set_min = true;
@@ -1398,6 +1404,9 @@ static inline unsigned char ma_data_end(struct maple_node 
*node,
 {
unsigned char offset;
 
+   if (!pivots)
+   return 0;
+
if (type == maple_arange_64)
return ma_meta_end(node, type);
 
@@ -1433,6 +1442,9 @@ static inline unsigned char mas_data_end(struct ma_state 
*mas)
return ma_meta_end(node, type);
 
pivots = ma_pivots(node, type);
+   if (unlikely(ma_dead_node(node)))
+   return 0;
+
offset = mt_pivots[type] - 1;
if (likely(!pivots[offset]))
return ma_meta_end(node, type);
@@ -4504,6 +4516,9 @@ static inline int mas_prev_node(struct ma_state *mas, 
unsigned long min)
node = mas_mn(mas);
slots = ma_slots(node, mt);
pivots = ma_pivots(node, mt);
+   if (unlikely(ma_dead_node(node)))
+   return 1;
+
mas->max = pivots[offset];
if (offset)
mas->min = pivots[offset - 1] + 1;
@@ -4525,6 +4540,9 @@ static inline int mas_prev_node(struct ma_state *mas, 
unsigned long min)
slots = ma_slots(node, mt);
pivots = ma_pivots(node, mt);
offset = ma_data_end(node, mt, pivots, mas->max);
+   if (unlikely(ma_dead_node(node)))
+   return 1;
+
if (offset)
mas->min = pivots[offset - 1] + 1;
 
@@ -4573,6 +4591,7 @@ static inline int mas_next_node(struct ma_state *mas, 
struct maple_node *node,
struct maple_enode *enode;
int level = 0;
unsigned char offset;
+   unsigned char node_end;
enum maple_type mt;
void __rcu **slots;
 
@@ -4596,7 +4615,11 @@ static inline int mas_next_node(struct ma_state *mas, 
struct maple_node *node,
node = mas_mn(mas);
mt = mte_node_type(mas->node);
pivots = ma_pivots(node, mt);
-   } while (unlikely(offset == ma_data_end(node, mt, pivots, mas->max)));
+   node_end = ma_data_end(node, mt, pivots, mas->max);
+   if (unlikely(ma_dead_node(node)))
+   return 1;
+
+   } while (unlikely(offset == node_end));
 
slots = ma_slots(node, mt);
pivot = mas_safe_pivot(mas, pivots, ++offset, mt);
@@ -4612,6 +4635,9 @@ static inline int mas_next_node(struct ma_state *mas, 
struct maple_node *node,
mt = mte_node_type(mas->node);
slots = ma_slots(node, mt);
pivots = ma_pivots(node, mt);
+   if (unlikely(ma_dead_node(node)))
+   return 1;
+
offset = 0;
pivot = pivots[0];
}
@@ -4658,16 +4684,18 @@ static inline void *mas_next_nentry(struct ma_state 
*mas,
return NULL;
}
 
-   pivots = ma_pivots(node, type);
slots = ma_slots(node, type);
-   mas->index = mas_safe_min(mas, pivots, mas->offset);
-   if (ma_dead_node(node))
+   pivots = ma_pivots(node, type);
+   count = ma_data_end(node, type, pivots, mas->max);
+   if (unlikely(ma_dead_node(node)))
return NULL;
 
+   mas->index = mas_safe_min(mas, pivots, mas->offset);
+   if (unlikely(ma_dead_node(node)))
+   return NULL;

Re: [PATCH 01/15] video: fbdev: atmel_lcdfb: Rework backlight handling

2023-01-09 Thread Helge Deller


On 1/9/23 21:18, Stephen Kitt wrote:

On Sun, 8 Jan 2023 18:26:12 +0100, Helge Deller  wrote:


On 1/7/23 21:53, Sam Ravnborg wrote:

Hi Stephen.

On Sat, Jan 07, 2023 at 09:36:47PM +0100, Stephen Kitt wrote:

On 7 January 2023 19:26:15 CET, Sam Ravnborg via B4 Submission Endpoint
 wrote:

From: Sam Ravnborg 

The atmel_lcdfb had code to save/restore power state.
This is not needed so drop it.

Introduce backlight_is_brightness() to make logic simpler.

Signed-off-by: Sam Ravnborg 
Cc: Nicolas Ferre 
Cc: Alexandre Belloni 
Cc: Ludovic Desroches 
Cc: linux-fb...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
---
drivers/video/fbdev/atmel_lcdfb.c | 24 +++-
1 file changed, 3 insertions(+), 21 deletions(-)

...


Hi Sam,

I’d submitted quite a few more of these previously (and you’d reviewed
them), see e.g. the thread starting at
https://lkml.org/lkml/2022/6/7/4365, and yesterday,
https://lkml.org/lkml/2023/1/6/520, https://lkml.org/lkml/2023/1/6/656,
https://lkml.org/lkml/2023/1/6/970, https://lkml.org/lkml/2023/1/6/643,
and https://lkml.org/lkml/2023/1/6/680. There are a few more, I can find
them if it’s any use.


The patches from yesterday was what triggered me to resurrect an old
branch of mine where I had done something similar. I had lost all
memory of reviewing similar patches from you.


Helge - could you pick the reviewed patches from:
https://lore.kernel.org/all/20220607192335.1137249-1-st...@sk2.org/
[This is the same mail as Stephen refer to above - looked up via lore].


I just pulled those 7 patches into fbdev/for-next.
If you need more, please let me know,


Please pull
https://lore.kernel.org/lkml/20230109200239.1850611-1-st...@sk2.org/ too, it
completes the fbdev set. (It’s a re-send of
https://lore.kernel.org/lkml/20220609180440.3138625-1-st...@sk2.org/).


Done.

Thanks!
Helge

Re: [PATCH 01/15] video: fbdev: atmel_lcdfb: Rework backlight handling

2023-01-09 Thread Stephen Kitt

On Sun, 8 Jan 2023 18:26:12 +0100, Helge Deller  wrote:

> On 1/7/23 21:53, Sam Ravnborg wrote:
> > Hi Stephen.
> >
> > On Sat, Jan 07, 2023 at 09:36:47PM +0100, Stephen Kitt wrote:  
> >> On 7 January 2023 19:26:15 CET, Sam Ravnborg via B4 Submission Endpoint
> >>  wrote:  
> >>> From: Sam Ravnborg 
> >>>
> >>> The atmel_lcdfb had code to save/restore power state.
> >>> This is not needed so drop it.
> >>>
> >>> Introduce backlight_is_brightness() to make logic simpler.
> >>>
> >>> Signed-off-by: Sam Ravnborg 
> >>> Cc: Nicolas Ferre 
> >>> Cc: Alexandre Belloni 
> >>> Cc: Ludovic Desroches 
> >>> Cc: linux-fb...@vger.kernel.org
> >>> Cc: linux-arm-ker...@lists.infradead.org
> >>> ---
> >>> drivers/video/fbdev/atmel_lcdfb.c | 24 +++-
> >>> 1 file changed, 3 insertions(+), 21 deletions(-)  
> > ...  
> >>
> >> Hi Sam,
> >>
> >> I’d submitted quite a few more of these previously (and you’d reviewed
> >> them), see e.g. the thread starting at
> >> https://lkml.org/lkml/2022/6/7/4365, and yesterday,
> >> https://lkml.org/lkml/2023/1/6/520, https://lkml.org/lkml/2023/1/6/656,
> >> https://lkml.org/lkml/2023/1/6/970, https://lkml.org/lkml/2023/1/6/643,
> >> and https://lkml.org/lkml/2023/1/6/680. There are a few more, I can find
> >> them if it’s any use.  
> >
> > The patches from yesterday was what triggered me to resurrect an old
> > branch of mine where I had done something similar. I had lost all
> > memory of reviewing similar patches from you.
> >
> >
> > Helge - could you pick the reviewed patches from:
> > https://lore.kernel.org/all/20220607192335.1137249-1-st...@sk2.org/
> > [This is the same mail as Stephen refer to above - looked up via lore].  
> 
> I just pulled those 7 patches into fbdev/for-next.
> If you need more, please let me know,

Please pull
https://lore.kernel.org/lkml/20230109200239.1850611-1-st...@sk2.org/ too, it
completes the fbdev set. (It’s a re-send of
https://lore.kernel.org/lkml/20220609180440.3138625-1-st...@sk2.org/).

Thanks,

Stephen


pgpZRyHS3C8SI.pgp
Description: OpenPGP digital signature

Re: [PATCH 01/15] video: fbdev: atmel_lcdfb: Rework backlight handling

2023-01-09 Thread Stephen Kitt

On Mon, 9 Jan 2023 20:53:44 +0100, Stephen Kitt  wrote:
> On Sun, 8 Jan 2023 21:24:20 +0100, Sam Ravnborg  wrote:
> > > Here are my pending patches from last June on lore:
> > > 
> > 
> > All patches are handled I think except this:  
> > > * https://lore.kernel.org/lkml/20220608205623.2106113-1-st...@sk2.org/
> > >   
> > 
> > Can I ask you to drop the assignment that is not needed, and resend with
> > the collected acks/r-b.
> > 
> > With this, then all fbdev patches are handled.  
> 
> Ah yes, it was revised as
> https://lore.kernel.org/lkml/20220609180440.3138625-1-st...@sk2.org/ which
> only got one ack (from you,
> https://lore.kernel.org/lkml/yqjckqmqeuvsb...@ravnborg.org/). Should I
> resend that or is that usable as-is? Or would it be better if I sent all the
> fbdev patches again (collecting all the acks and reviews)?

Since the others are already in fbdev/for-next, I’ve just resent v2 of this
patch with your ack.

Regards,

Stephen


pgpqRo2Vm6zc7.pgp
Description: OpenPGP digital signature

Re: [PATCH v3 3/8] sched, smp: Trace IPIs sent via send_call_function_single_ipi()

2023-01-09 Thread Valentin Schneider

On 07/01/23 12:04, Ingo Molnar wrote:
> * Valentin Schneider  wrote:
>
>> send_call_function_single_ipi() is the thing that sends IPIs at the bottom
>> of smp_call_function*() via either generic_exec_single() or
>> smp_call_function_many_cond(). Give it an IPI-related tracepoint.
>> 
>> Note that this ends up tracing any IPI sent via __smp_call_single_queue(),
>> which covers __ttwu_queue_wakelist() and irq_work_queue_on() "for free".
>> 
>> Signed-off-by: Valentin Schneider 
>> Reviewed-by: Steven Rostedt (Google) 
>
> Acked-by: Ingo Molnar 
>
> Patch series logistics:
>
>  - No objections from the scheduler side, this feature looks pretty useful.
>

Thanks!

>  - Certain patches are incomplete, others are noted as being merged 
>separately, so I presume you'll send an updated/completed series 
>eventually?
>

The first patch from Steve is now in, so can drop it.

The other patches are complete, though I need to rebase them and regenerate
the treewide patch to catch any changes that came with 6.2. I'll do that
this week.

The "incompleteness" pointed out in the cover letter is about the types of
IPIs that can be traced. This series covers the ones that end up invoking
some core code (coincidentally those are the most common ones), others such
as e.g. tick_broadcast() for arm, arm64, riscv and hexagon remain
unaffected.

I'm not that much interested in these (other than maybe the tick broadcast
one they are all fairly unfrequent), but I'm happy to have a shot at them
for the sake of completeness - either in that series or in a followup, up
to you.  

>  - We can merge this via the scheduler tree I suspect, as most callbacks 
>affected relate to tip:sched/core and tmp:smp/core - but if you have 
>some other preferred tree that's fine too.
>

Either sound good to me.

> Thanks,
>
>   Ingo

Re: [PATCH v3 6/8] treewide: Trace IPIs sent via smp_send_reschedule()

2023-01-09 Thread Valentin Schneider

On 08/01/23 20:17, Huacai Chen wrote:
> Hi, Valentin,
>
> On Fri, Dec 2, 2022 at 11:59 PM Valentin Schneider  
> wrote:
>> @@ -83,7 +83,7 @@ extern void show_ipi_list(struct seq_file *p, int prec);
>>   * it goes straight through and wastes no time serializing
>>   * anything. Worst case is that we lose a reschedule ...
>>   */
>> -static inline void smp_send_reschedule(int cpu)
>> +static inline void arch_smp_send_reschedule(int cpu)
>>  {
>> loongson_send_ipi_single(cpu, SMP_RESCHEDULE);
>>  }
> This function has been moved to arch/loongarch/kernel/smp.c since 6.2.
>

Thanks! I'll make sure to rerun the coccinelle script for the next version.

Re: [PATCH 02/15] video: fbdev: atyfb: Introduce backlight_get_brightness()

2023-01-09 Thread Sam Ravnborg

Hi Christophe,
On Mon, Jan 09, 2023 at 05:44:46PM +, Christophe Leroy wrote:
> 
> 
> Le 07/01/2023 à 19:26, Sam Ravnborg via B4 Submission Endpoint a écrit :
> > From: Sam Ravnborg 
> > 
> > Introduce backlight_get_brightness() to simplify logic
> > and avoid direct access to backlight properties.
> 
> When I read 'introduce' I understand that you are adding a new function.
> 
> In fact backlight_get_brightness() already exists, so maybe replace 
> 'introduce' by 'use'

Thanks for your feedback. A similar patch is already applied to the
fbdev tree, so this patch can be ignored.

Sam

Re: [PATCH 02/15] video: fbdev: atyfb: Introduce backlight_get_brightness()

2023-01-09 Thread Christophe Leroy



Le 07/01/2023 à 19:26, Sam Ravnborg via B4 Submission Endpoint a écrit :
> From: Sam Ravnborg 
> 
> Introduce backlight_get_brightness() to simplify logic
> and avoid direct access to backlight properties.

When I read 'introduce' I understand that you are adding a new function.

In fact backlight_get_brightness() already exists, so maybe replace 
'introduce' by 'use'


> 
> Signed-off-by: Sam Ravnborg 
> Cc: Bartlomiej Zolnierkiewicz 
> Cc: Sam Ravnborg 
> Cc: Daniel Vetter 
> Cc: Souptick Joarder 
> Cc: Maarten Lankhorst 
> Cc: Jason Yan 
> Cc: Jani Nikula 
> Cc: Arnd Bergmann 
> ---
>   drivers/video/fbdev/aty/atyfb_base.c | 8 +---
>   1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/drivers/video/fbdev/aty/atyfb_base.c 
> b/drivers/video/fbdev/aty/atyfb_base.c
> index 0ccf5d401ecb..ca361e215904 100644
> --- a/drivers/video/fbdev/aty/atyfb_base.c
> +++ b/drivers/video/fbdev/aty/atyfb_base.c
> @@ -2219,13 +2219,7 @@ static int aty_bl_update_status(struct 
> backlight_device *bd)
>   {
>  struct atyfb_par *par = bl_get_data(bd);
>  unsigned int reg = aty_ld_lcd(LCD_MISC_CNTL, par);
> -   int level;
> -
> -   if (bd->props.power != FB_BLANK_UNBLANK ||
> -   bd->props.fb_blank != FB_BLANK_UNBLANK)
> -   level = 0;
> -   else
> -   level = bd->props.brightness;
> +   int level = backlight_get_brightness(bd);
> 
>  reg |= (BLMOD_EN | BIASMOD_EN);
>  if (level > 0) {
> 
> --
> 2.34.1

Re: [PATCH v2 1/2] powerpc/ps3: Change updateboltedpp panic to info

2023-01-09 Thread Christophe Leroy



Le 03/01/2023 à 18:51, Geoff Levand a écrit :
> Commit fdacae8a84024474afff234bdd1dbe19ad597a10 (powerpc: Activate
> CONFIG_STRICT_KERNEL_RWX by default) causes ps3_hpte_updateboltedpp()
> to be called.  Change the panic statment in ps3_hpte_updateboltedpp()
> to a pr_info statement so that bootup can continue.

But if I understand correctly, it means that CONFIG_STRICT_KERNEL_RWX 
won't work then.

So, shouldn't we keep the panic and forbid CONFIG_STRICT_KERNEL_RWX on PS3 ?


> 
> Signed-off-by: Geoff Levand 
> ---
>   arch/powerpc/platforms/ps3/htab.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/ps3/htab.c 
> b/arch/powerpc/platforms/ps3/htab.c
> index c27e6cf85272..9de62bd52650 100644
> --- a/arch/powerpc/platforms/ps3/htab.c
> +++ b/arch/powerpc/platforms/ps3/htab.c
> @@ -146,7 +146,7 @@ static long ps3_hpte_updatepp(unsigned long slot, 
> unsigned long newpp,
>   static void ps3_hpte_updateboltedpp(unsigned long newpp, unsigned long ea,
>   int psize, int ssize)
>   {
> - panic("ps3_hpte_updateboltedpp() not implemented");
> + pr_info("ps3_hpte_updateboltedpp() not implemented");
>   }
>   
>   static void ps3_hpte_invalidate(unsigned long slot, unsigned long vpn,

Re: [PATCH] objtool: continue if find_insn() fails in decode_instructions()

2023-01-09 Thread Sathvika Vasireddy




On 09/01/23 22:23, Ingo Molnar wrote:

* Sathvika Vasireddy  wrote:


Hi Ingo, Happy New Year!

Happy New Year to you too! :-)


On 07/01/23 15:51, Ingo Molnar wrote:

* Sathvika Vasireddy  wrote:


Currently, decode_instructions() is failing if it is not able to find
instruction, and this is happening since commit dbcdbdfdf137b4
("objtool: Rework instruction -> symbol mapping") because it is
expecting instruction for STT_NOTYPE symbols.

Due to this, the following objtool warnings are seen:
   [1] arch/powerpc/kernel/optprobes_head.o: warning: objtool: 
optprobe_template_end(): can't find starting instruction
   [2] arch/powerpc/kernel/kvm_emul.o: warning: objtool: kvm_template_end(): 
can't find starting instruction
   [3] arch/powerpc/kernel/head_64.o: warning: objtool: end_first_256B(): can't 
find starting instruction

The warnings are thrown because find_insn() is failing for symbols that
are at the end of the file, or at the end of the section. Given how
STT_NOTYPE symbols are currently handled in decode_instructions(),
continue if the instruction is not found, instead of throwing warning
and returning.

Signed-off-by: Naveen N. Rao 
Signed-off-by: Sathvika Vasireddy 

The SOB chain doesn't look valid: is Naveen N. Rao, the first SOB line, the
author of the patch? If yes then a matching From: line is needed.

Or if two people developed the patch, then Co-developed-by should be used:

  Co-developed-by: First Co-Author 
  Signed-off-by: First Co-Author 
  Co-developed-by: Second Co-Author 
  Signed-off-by: Second Co-Author 

[ In this SOB sequence "Second Co-Author" is the one who submits the patch. ]

[ Please only use Co-developed-by if actual lines of code were written by
the co-author that created copyrightable material - it's not a courtesy
tag. Reviewed-by/Acked-by/Tested-by can be used to credit non-code
contributions. ]

Thank you for the clarification, and for bringing these points to my
attention. I'll keep these things in mind. In this case, since both Naveen
N. Rao and I developed the patch, the below tags
are applicable.

     Co-developed-by: First Co-Author 
     Signed-off-by: First Co-Author 
     Co-developed-by: Second Co-Author 
     Signed-off-by: Second Co-Author 

... while filling in your real names & email addresses I suppose. ;-)

Indeed :-)



However, I would be dropping this particular patch, since I think Nick's
patch [1] is better to fix the objtool issue.

[1] - 
https://lore.kernel.org/linuxppc-dev/20221220101323.3119939-1-npig...@gmail.com/

Ok, I'll pick up Nick's fix, with these tags added for the PowerPC
regression aspect and your review:

   Reported-by: Naveen N. Rao 
   Reported-by: Sathvika Vasireddy 
   Acked-by: Sathvika Vasireddy 

To document & credit the efforts of your patch.


Sure, thank you!

- Sathvika

Re: [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2023-01-09 Thread Catalin Marinas

On Sun, Jan 08, 2023 at 06:48:41PM +0800, Barry Song wrote:
> On Fri, Jan 6, 2023 at 2:15 AM Catalin Marinas  
> wrote:
> > On Thu, Nov 17, 2022 at 04:26:48PM +0800, Yicong Yang wrote:
> > > It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> > > large systems but may not have improvement on small systems like on
> > > a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> > > on CONFIG_EXPERT for this stage and make this disabled on systems
> > > with less than 8 CPUs. User can modify this threshold according to
> > > their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.
> >
> > What's the overhead of such batching on systems with 4 or fewer CPUs? If
> > it isn't noticeable, I'd rather have it always on than some number
> > chosen on whichever SoC you tested.
> 
> On the one hand, tlb flush is cheap on a small system. so batching tlb flush
> helps very minorly.

Yes, it probably won't help on small systems but I don't like config
options choosing the threshold, which may be different from system to
system even if they have the same number of CPUs. A run-time tunable
would be a better option.

> On the other hand, since we have batched the tlb flush, new PTEs might be
> invisible to others before the final broadcast is done and Ack-ed.

The new PTEs could indeed be invisible at the TLB level but not at the
memory (page table) level since this is done under the PTL IIUC.

> thus, there
> is a risk someone else might do mprotect or similar things  on those deferred
> pages which will ask for read-modify-write on those deferred PTEs.

And this should be fine, we have things like the PTL in place for the
actual memory access to the page table.

> in this
> case, mm will do an explicit flush by flush_tlb_batched_pending which is
> not required if tlb flush is not deferred.

I don't fully understand why it's needed, or at least why it would be
needed on arm64. At the end of an mprotect(), we have the final PTEs in
place and we just need to issue a TLBI for that range.
change_pte_range() for example has a tlb_flush_pte_range() if the PTE
was present and that won't be done lazily. If there are other TLBIs
pending for the same range, they'll be done later though likely
unnecessarily but still cheaper than issuing a flush_tlb_mm().

> void flush_tlb_batched_pending(struct mm_struct *mm)
> {
>int batch = atomic_read(>tlb_flush_batched);
>int pending = batch & TLB_FLUSH_BATCH_PENDING_MASK;
>int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
> 
>if (pending != flushed) {
>flush_tlb_mm(mm);
> /*
>  * If the new TLB flushing is pending during flushing, leave
>  * mm->tlb_flush_batched as is, to avoid losing flushing.
> */
>   atomic_cmpxchg(>tlb_flush_batched, batch,
>pending | (pending << TLB_FLUSH_BATCH_FLUSHED_SHIFT));
>  }
> }

I guess this works on x86 better as it avoids the IPIs if this flush
already happened. But on arm64 we already issued the TLBI, we just
didn't wait for it to complete via a DSB.

> I believe Anshuman has contributed many points on this in those previous
> discussions.

Yeah, I should re-read the old threads.

-- 
Catalin

Re: [PATCH] objtool: continue if find_insn() fails in decode_instructions()

2023-01-09 Thread Ingo Molnar



* Sathvika Vasireddy  wrote:

> Hi Ingo, Happy New Year!

Happy New Year to you too! :-)

> On 07/01/23 15:51, Ingo Molnar wrote:
> > * Sathvika Vasireddy  wrote:
> > 
> > > Currently, decode_instructions() is failing if it is not able to find
> > > instruction, and this is happening since commit dbcdbdfdf137b4
> > > ("objtool: Rework instruction -> symbol mapping") because it is
> > > expecting instruction for STT_NOTYPE symbols.
> > > 
> > > Due to this, the following objtool warnings are seen:
> > >   [1] arch/powerpc/kernel/optprobes_head.o: warning: objtool: 
> > > optprobe_template_end(): can't find starting instruction
> > >   [2] arch/powerpc/kernel/kvm_emul.o: warning: objtool: 
> > > kvm_template_end(): can't find starting instruction
> > >   [3] arch/powerpc/kernel/head_64.o: warning: objtool: end_first_256B(): 
> > > can't find starting instruction
> > > 
> > > The warnings are thrown because find_insn() is failing for symbols that
> > > are at the end of the file, or at the end of the section. Given how
> > > STT_NOTYPE symbols are currently handled in decode_instructions(),
> > > continue if the instruction is not found, instead of throwing warning
> > > and returning.
> > > 
> > > Signed-off-by: Naveen N. Rao 
> > > Signed-off-by: Sathvika Vasireddy 
> > The SOB chain doesn't look valid: is Naveen N. Rao, the first SOB line, the
> > author of the patch? If yes then a matching From: line is needed.
> > 
> > Or if two people developed the patch, then Co-developed-by should be used:
> > 
> >  Co-developed-by: First Co-Author 
> >  Signed-off-by: First Co-Author 
> >  Co-developed-by: Second Co-Author 
> >  Signed-off-by: Second Co-Author 
> > 
> > [ In this SOB sequence "Second Co-Author" is the one who submits the patch. 
> > ]
> > 
> > [ Please only use Co-developed-by if actual lines of code were written by
> >the co-author that created copyrightable material - it's not a courtesy
> >tag. Reviewed-by/Acked-by/Tested-by can be used to credit non-code
> >contributions. ]
> Thank you for the clarification, and for bringing these points to my
> attention. I'll keep these things in mind. In this case, since both Naveen
> N. Rao and I developed the patch, the below tags
> are applicable.
> 
>     Co-developed-by: First Co-Author 
>     Signed-off-by: First Co-Author 
>     Co-developed-by: Second Co-Author 
>     Signed-off-by: Second Co-Author 

... while filling in your real names & email addresses I suppose. ;-)

> 
> However, I would be dropping this particular patch, since I think Nick's
> patch [1] is better to fix the objtool issue.
> 
> [1] - 
> https://lore.kernel.org/linuxppc-dev/20221220101323.3119939-1-npig...@gmail.com/

Ok, I'll pick up Nick's fix, with these tags added for the PowerPC 
regression aspect and your review:

  Reported-by: Naveen N. Rao 
  Reported-by: Sathvika Vasireddy 
  Acked-by: Sathvika Vasireddy 

To document & credit the efforts of your patch.

Thanks,

Ingo

Re: [PATCH 15/15] backlight: backlight: Drop the deprecated fb_blank property

2023-01-09 Thread Sam Ravnborg

Hi Daniel.

On Mon, Jan 09, 2023 at 11:06:35AM +, Daniel Thompson wrote:
> On Sat, Jan 07, 2023 at 07:26:29PM +0100, Sam Ravnborg via B4 Submission 
> Endpoint wrote:
> > From: Sam Ravnborg 
> >
> > With all users gone remove the deprecated fb_blank member in
> > backlight_properties.
> >
> > Signed-off-by: Sam Ravnborg 
> > Cc: Lee Jones 
> > Cc: Daniel Thompson 
> > Cc: Jingoo Han 
> 
> 
> Reviewed-by: Daniel Thompson 

Thanks for the follow-up on all the backlight related patches.

> 
> 
> PS Please don't treat this like a maintainer Acked-by: and merge it
>(Lee's not on holiday so work with Lee to figure out the merge
>strategy ;-) ).
Nope, I am aware that the usual pattern here and wait for Lee to show
up.
For this patch there is a bug as I need to update a comment.
I will fix this when I resend after all the patches in flight has
landed. So likely after the next merge window,

Sam

Re: [PATCH] net-wan: Add check for NULL for utdm in ucc_hdlc_probe

2023-01-09 Thread Christophe Leroy



Le 23/12/2022 à 15:32, Ekaterina Esina a écrit :
> [Vous ne recevez pas souvent de courriers de ees...@astralinux.ru. Découvrez 
> pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> 
> If uhdlc_priv_tsa != 1 then utdm is not initialized.
> And if ret != NULL then goto undo_uhdlc_init, where utdm is dereferenced.
> Same if dev == NULL.
> 
> Found by Linux Verification Center (linuxtesting.org) with SVACE.
> 
> Signed-off-by: Ekaterina Esina 
> ---
>   drivers/net/wan/fsl_ucc_hdlc.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/wan/fsl_ucc_hdlc.c b/drivers/net/wan/fsl_ucc_hdlc.c
> index 22edea6ca4b8..2ddb0f71e648 100644
> --- a/drivers/net/wan/fsl_ucc_hdlc.c
> +++ b/drivers/net/wan/fsl_ucc_hdlc.c
> @@ -1243,7 +1243,9 @@ static int ucc_hdlc_probe(struct platform_device *pdev)
>   free_dev:
>  free_netdev(dev);
>   undo_uhdlc_init:
> -   iounmap(utdm->siram);
> +   if (utdm != NULL) {
> +   iounmap(utdm->siram);
> +   }

If utdm being NULL is a problem here, isn't it also a problem in the 
iounmap below ?


>   unmap_si_regs:
>  iounmap(utdm->si_regs);
>   free_utdm:
> --
> 2.30.2
>

RE: [PATCH 1/1] PCI: layerscape: Add EP mode support for ls1028a

2023-01-09 Thread Frank Li

> 
> From: Xiaowei Bao 
> 
> Add PCIe EP mode support for ls1028a.
> 
> Signed-off-by: Xiaowei Bao 
> Signed-off-by: Hou Zhiqiang 
> ---
> 
> All other patches were already accepte by maintainer in
> https://lore.kernel.org/lkml/2022223457.10599-1-leoyang...@nxp.com/
> 
> But missed this one.
> 
> Re-post.
> 

Ping.

Re: [Skiboot] [PATCH 1/3] core/device: Add function to return child node using name and length

2023-01-09 Thread Dan Horák

Hi Athira,

On Thu, 5 Jan 2023 12:41:33 +0530
Athira Rajeev  wrote:

> 
> 
> > On 05-Jan-2023, at 12:35 PM, Madhavan Srinivasan  
> > wrote:
> > 
> > 
> > On Mon,  2 Jan 2023 08:45:22 +0530
> > Athira Rajeev  wrote:
> > 
> >> Add a function dt_find_by_name_len() that returns the child node if
> >> it matches the first "n" characters of a given name, otherwise NULL.
> >> This is helpful for cases with node name like: "name@addr". In
> >> scenarios where nodes are added with "name@addr" format and if the
> >> value of "addr" is not known, that node can't be matched with node
> >> name or addr. Hence matching with substring as node name will return
> >> the expected result. Patch adds dt_find_by_name_len() function
> >> and testcase for the same in core/test/run-device.c
> > 
> > wouldn't it be better to automatically compare the name up to the "@"
> > character in the node name when searching for the match instead of
> > having to hard-code the lengths? I think it should be good enough for
> > the use case described above.
> > 
> > something like
> > ...
> > pos = strchr(child->name, '@')
> > if (!strncmp(child->name, name, pos - child->name))
> >  ...
> > 
> > 
> > Dan
> 
> Hi Dan,
> 
> Thanks for checking the patch.
> 
> Comparing upto "@" while searching for the match will restrict the string 
> search only for patterns with "@".
> By having dt_find_by_name_len which uses length, will be useful for generic 
> substring search for different patterns.
> So prefered to use length instead of hardcoding character.
> 
> Please let us know your thoughts.

I understand the presented solution is a pretty generic one, but I think
the question is whether the added complexity brings the benefits
compared to the simpler "separator char" solution.

And thinking even more about the generic "length" approach, it might
bring some false positive hits. Imagine nodes abc@1, abcd@2 and you are
looking for "abc". A search for (abc,3) will match also the "abcd"
one. And if the search string will always contain the "@" character,
then specifying the length is not required. And I believe the length
parameter might be totally redundant, because it can be derived from
the search string and the new function would be like
"dt_find_by_name_substr()".


With regards,

Dan

> Thanks
> Athira
> 
> > 
> >> Signed-off-by: Athira Rajeev 
> >> ---
> >> core/device.c  | 20 
> >> core/test/run-device.c | 11 +++
> >> include/device.h   |  4 
> >> 3 files changed, 35 insertions(+)
> >> diff --git a/core/device.c b/core/device.c
> >> index 2de37c74..72c54e85 100644
> >> --- a/core/device.c
> >> +++ b/core/device.c
> >> @@ -395,6 +395,26 @@ struct dt_node *dt_find_by_name(struct dt_node *root, 
> >> const char *name)
> >> }
> >>  +struct dt_node *dt_find_by_name_len(struct dt_node *root,
> >> +  const char *name, int len)
> >> +{
> >> +  struct dt_node *child, *match;
> >> +
> >> +  if (len <= 0)
> >> +  return NULL;
> >> +
> >> +  list_for_each(>children, child, list) {
> >> +  if (!strncmp(child->name, name, len))
> >> +  return child;
> >> +
> >> +  match = dt_find_by_name_len(child, name, len);
> >> +  if (match)
> >> +  return match;
> >> +  }
> >> +
> >> +  return NULL;
> >> +}
> >> +
> >> struct dt_node *dt_new_check(struct dt_node *parent, const char *name)
> >> {
> >>struct dt_node *node = dt_find_by_name(parent, name);
> >> diff --git a/core/test/run-device.c b/core/test/run-device.c
> >> index 4a12382b..8c552103 100644
> >> --- a/core/test/run-device.c
> >> +++ b/core/test/run-device.c
> >> @@ -466,6 +466,17 @@ int main(void)
> >>new_prop_ph = dt_prop_get_u32(ut2, "something");
> >>assert(!(new_prop_ph == ev1_ph));
> >>dt_free(subtree);
> >> +
> >> +  /* Test dt_find_by_name_len */
> >> +  root = dt_new_root("");
> >> +  addr1 = dt_new_addr(root, "node", 0x1);
> >> +  addr2 = dt_new_addr(root, "node0_1", 0x2);
> >> +  assert(dt_find_by_name(root, "node@1") == addr1);
> >> +  assert(dt_find_by_name(root, "node0_1@2") == addr2);
> >> +  assert(dt_find_by_name_len(root, "node@", 5) == addr1);
> >> +  assert(dt_find_by_name_len(root, "node0_1@", 8) == addr2);
> >> +  dt_free(root);
> >> +
> >>return 0;
> >> }
> >> diff --git a/include/device.h b/include/device.h
> >> index 93fb90ff..f5e0fb79 100644
> >> --- a/include/device.h
> >> +++ b/include/device.h
> >> @@ -184,6 +184,10 @@ struct dt_node *dt_find_by_path(struct dt_node *root, 
> >> const char *path);
> >> /* Find a child node by name */
> >> struct dt_node *dt_find_by_name(struct dt_node *root, const char *name);
> >> +/* Find a child node by name and len */
> >> +struct dt_node *dt_find_by_name_len(struct dt_node *root,
> >> +const char *name, int len);
> >> +
> >> /* Find a node by phandle */
> >> struct dt_node *dt_find_by_phandle(struct dt_node

Re: [PATCH 2/2] perf test bpf: Skip test if kernel-debuginfo is not present

2023-01-09 Thread Athira Rajeev




> On 05-Jan-2023, at 6:24 PM, Arnaldo Carvalho de Melo  wrote:
> 
> Em Thu, Jan 05, 2023 at 05:47:42PM +0530, Athira Rajeev escreveu:
>> Perf BPF filter test fails in environment where "kernel-debuginfo"
>> is not installed.
> 
> I'll apply this to perf/core, for the next merge window, as its more an
> improvement than a fix, i.e. we know why it fails, we're just improving
> the user reporting to make that clear at first sight.
> 
> - Arnaldo

Hi Arnaldo,

Sure, Thanks for checking

Athira
> 
>> Test failure logs:
>> <<>>
>> 42: BPF filter:
>> 42.1: Basic BPF filtering : Ok
>> 42.2: BPF pinning : Ok
>> 42.3: BPF prologue generation : FAILED!
>> <<>>
>> 
>> Enabling verbose option provided debug logs, which says debuginfo
>> needs to be installed. Snippet of verbose logs:
>> 
>> <<>>
>> 42.3: BPF prologue generation   :
>> --- start ---
>> test child forked, pid 28218
>> <<>>
>> Rebuild with CONFIG_DEBUG_INFO=y, or install an appropriate debuginfo
>> package.
>> bpf_probe: failed to convert perf probe events
>> Failed to add events selected by BPF
>> test child finished with -1
>>  end 
>> BPF filter subtest 3: FAILED!
>> <<>>
>> 
>> Here subtest, "BPF prologue generation" failed and
>> logs shows debuginfo is needed. After installing
>> kernel-debuginfo package, testcase passes.
>> 
>> Subtest "BPF prologue generation" failed because, the "do_test"
>> function returns "TEST_FAIL" without checking the error type
>> returned by "parse_events_load_bpf_obj" function.
>> Function parse_events_load_bpf_obj can also return error of type
>> "-ENODATA" incase kernel-debuginfo package is not installed. Fix this
>> by adding check for -ENODATA error.
>> 
>> Test result after the patch changes:
>> 
>> Test failure logs:
>> <<>>
>> 42: BPF filter :
>> 42.1: Basic BPF filtering  : Ok
>> 42.2: BPF pinning  : Ok
>> 42.3: BPF prologue generation  : Skip (clang/debuginfo isn't
>> installed or environment missing BPF support)
>> 
>> Fixes: ba1fae431e74 ("perf test: Add 'perf test BPF'")
>> Signed-off-by: Athira Rajeev 
>> ---
>> Note: This is dependent on patch 1:
>> tools/perf: Update the exit error codes in function
>> try_to_find_probe_trace_event
>> 
>> tools/perf/tests/bpf.c | 6 +-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>> 
>> diff --git a/tools/perf/tests/bpf.c b/tools/perf/tests/bpf.c
>> index 17c023823713..6a4235a9cf57 100644
>> --- a/tools/perf/tests/bpf.c
>> +++ b/tools/perf/tests/bpf.c
>> @@ -126,6 +126,10 @@ static int do_test(struct bpf_object *obj, int 
>> (*func)(void),
>> 
>>  err = parse_events_load_bpf_obj(_state, _state.list, obj, 
>> NULL);
>>  parse_events_error__exit(_error);
>> +if (err == -ENODATA) {
>> +pr_debug("Failed to add events selected by BPF, debuginfo 
>> package not installed\n");
>> +return TEST_SKIP;
>> +}
>>  if (err || list_empty(_state.list)) {
>>  pr_debug("Failed to add events selected by BPF\n");
>>  return TEST_FAIL;
>> @@ -368,7 +372,7 @@ static struct test_case bpf_tests[] = {
>>  "clang isn't installed or environment missing BPF 
>> support"),
>> #ifdef HAVE_BPF_PROLOGUE
>>  TEST_CASE_REASON("BPF prologue generation", bpf_prologue_test,
>> -"clang isn't installed or environment missing BPF 
>> support"),
>> +"clang/debuginfo isn't installed or environment missing 
>> BPF support"),
>> #else
>>  TEST_CASE_REASON("BPF prologue generation", bpf_prologue_test, "not 
>> compiled in"),
>> #endif
>> -- 
>> 2.31.1
> 
> -- 
> 
> - Arnaldo

Re: [PATCH] tools/perf: Fix bpf-script-test-prologue test compile issue with clang

2023-01-09 Thread Athira Rajeev




> On 05-Jan-2023, at 6:23 PM, Arnaldo Carvalho de Melo  wrote:
> 
> Em Thu, Jan 05, 2023 at 05:34:36PM +0530, Athira Rajeev escreveu:
>> While running perf test for bpf, observed that "BPF prologue
>> generation" test case fails to compile with clang. Logs below
>> from powerpc:
>> 
>> :33:2: error: use of undeclared identifier 'fmode_t'
>>fmode_t f_mode = (fmode_t)_f_mode;
>>^
>> :37:6: error: use of undeclared identifier 'f_mode'; did you mean 
>> '_f_mode'?
>>if (f_mode & FMODE_WRITE)
>>^~
>>_f_mode
>> :30:60: note: '_f_mode' declared here
>> int bpf_func__null_lseek(void *ctx, int err, unsigned long _f_mode,
>>   ^
>> 2 errors generated.
> 
> Thanks for fixing this, I noticed the problem but didn't got around to
> investigate it.
> 
> Tested and applied.

Thanks Arnaldo for checking the patch.

Athira
> 
> - Arnaldo
> 
>> The test code tests/bpf-script-test-prologue.c uses fmode_t.
>> And the error above is for "fmode_t" which is defined in
>> include/linux/types.h as part of kernel build directory:
>> "/lib/modules//build" that comes from
>> kernel devel [ soft link to /usr/src/ ].
>> Clang picks this header file from "-working-directory" build
>> option that specifies this build folder.
>> 
>> But the 'commit 14e4b9f4289a ("perf trace: Raw augmented
>> syscalls fix libbpf 1.0+ compatibility")', changed the
>> include directory to use: "/usr/include".  Post this change,
>> types.h from /usr/include/ is getting picked upwhich doesn’t
>> contain definition of "fmode_t" and hence fails to compile.
>> 
>> Compilation command before this commit:
>> /usr/bin/clang -D__KERNEL__ -D__NR_CPUS__=72 -DLINUX_VERSION_CODE=0x50e00 
>> -xc  -I/root/lib/perf/include/bpf -nostdinc -I./arch/powerpc/include 
>> -I./arch/powerpc/include/generated  -I./include 
>> -I./arch/powerpc/include/uapi -I./arch/powerpc/include/generated/uapi 
>> -I./include/uapi -I./include/generated/uapi -include 
>> ./include/linux/compiler-version.h -include ./include/linux/kconfig.h  
>> -Wno-unused-value -Wno-pointer-sign -working-directory 
>> /lib/modules//build -c - -target bpf  -g -O2 -o -
>> 
>> Compilation command after this commit:
>> /usr/bin/clang -D__KERNEL__ -D__NR_CPUS__=72 -DLINUX_VERSION_CODE=0x50e00 
>> -xc  -I/usr/include/ -nostdinc -I./arch/powerpc/include 
>> -I./arch/powerpc/include/generated  -I./include 
>> -I./arch/powerpc/include/uapi -I./arch/powerpc/include/generated/uapi 
>> -I./include/uapi -I./include/generated/uapi -include 
>> ./include/linux/compiler-version.h -include ./include/linux/kconfig.h  
>> -Wno-unused-value -Wno-pointer-sign -working-directory 
>> /lib/modules//build -c - -target bpf  -g -O2 -o -
>> 
>> The difference is addition of -I/usr/include/  in the first line
>> which is causing the error. Fix this by adding typedef for "fmode_t"
>> in the testcase to solve the compile issue.
>> 
>> Fixes: 14e4b9f4289a ("perf trace: Raw augmented syscalls fix libbpf 1.0+ 
>> compatibility")
>> Signed-off-by: Athira Rajeev 
>> ---
>> tools/perf/tests/bpf-script-test-prologue.c | 2 ++
>> 1 file changed, 2 insertions(+)
>> 
>> diff --git a/tools/perf/tests/bpf-script-test-prologue.c 
>> b/tools/perf/tests/bpf-script-test-prologue.c
>> index bd83d364cf30..91778b5c6125 100644
>> --- a/tools/perf/tests/bpf-script-test-prologue.c
>> +++ b/tools/perf/tests/bpf-script-test-prologue.c
>> @@ -20,6 +20,8 @@
>> # undef if
>> #endif
>> 
>> +typedef unsigned int __bitwise fmode_t;
>> +
>> #define FMODE_READ   0x1
>> #define FMODE_WRITE  0x2
>> 
>> -- 
>> 2.31.1
> 
> -- 
> 
> - Arnaldo

Re: [PATCH] objtool: continue if find_insn() fails in decode_instructions()

2023-01-09 Thread Sathvika Vasireddy


Hi Ingo, Happy New Year!

On 07/01/23 15:51, Ingo Molnar wrote:

* Sathvika Vasireddy  wrote:


Currently, decode_instructions() is failing if it is not able to find
instruction, and this is happening since commit dbcdbdfdf137b4
("objtool: Rework instruction -> symbol mapping") because it is
expecting instruction for STT_NOTYPE symbols.

Due to this, the following objtool warnings are seen:
  [1] arch/powerpc/kernel/optprobes_head.o: warning: objtool: 
optprobe_template_end(): can't find starting instruction
  [2] arch/powerpc/kernel/kvm_emul.o: warning: objtool: kvm_template_end(): 
can't find starting instruction
  [3] arch/powerpc/kernel/head_64.o: warning: objtool: end_first_256B(): can't 
find starting instruction

The warnings are thrown because find_insn() is failing for symbols that
are at the end of the file, or at the end of the section. Given how
STT_NOTYPE symbols are currently handled in decode_instructions(),
continue if the instruction is not found, instead of throwing warning
and returning.

Signed-off-by: Naveen N. Rao 
Signed-off-by: Sathvika Vasireddy 

The SOB chain doesn't look valid: is Naveen N. Rao, the first SOB line, the
author of the patch? If yes then a matching From: line is needed.

Or if two people developed the patch, then Co-developed-by should be used:

 Co-developed-by: First Co-Author 
 Signed-off-by: First Co-Author 
 Co-developed-by: Second Co-Author 
 Signed-off-by: Second Co-Author 

[ In this SOB sequence "Second Co-Author" is the one who submits the patch. ]

[ Please only use Co-developed-by if actual lines of code were written by
   the co-author that created copyrightable material - it's not a courtesy
   tag. Reviewed-by/Acked-by/Tested-by can be used to credit non-code
   contributions. ]
Thank you for the clarification, and for bringing these points to my 
attention. I'll keep these things in mind. In this case, since both 
Naveen N. Rao and I developed the patch, the below tags

are applicable.

    Co-developed-by: First Co-Author 
    Signed-off-by: First Co-Author 
    Co-developed-by: Second Co-Author 
    Signed-off-by: Second Co-Author 

However, I would be dropping this particular patch, since I think Nick's 
patch [1] is better to fix the objtool issue.


[1] - 
https://lore.kernel.org/linuxppc-dev/20221220101323.3119939-1-npig...@gmail.com/ 




Thanks for reviewing!

- Sathvika

Re: [PATCH net-next] Remove DECnet support from kernel

2023-01-09 Thread Arnd Bergmann

On Mon, Jan 9, 2023, at 09:34, Jiri Slaby wrote:
> On 09. 01. 23, 9:14, Jan Engelhardt wrote:
>> On Monday 2023-01-09 08:04, Jiri Slaby wrote:
>
> Right, we used to keep providing also defines and structs in uapi 
> headers of removed functionality. So that the above socket would 
> compile, but fail during runtime.
>
> I am not biased to any solution. In fact, I found out trinity was fixed 
> already. So either path networking takes, it's fine by me. I'm not sure 
> about the chromium users, though (and I don't care).

Chromium and some of the others look like automatically generated
lists of files and the rest seem to have compile-time checks.

>From a brief look at all the packages in the debian codesearch
link you provided, I don't see any that are likely to cause
problems aside from trinity.

Arnd

Re: [PATCH net-next] Remove DECnet support from kernel

2023-01-09 Thread Jiri Slaby


On 09. 01. 23, 9:14, Jan Engelhardt wrote:


On Monday 2023-01-09 08:04, Jiri Slaby wrote:

On 18. 08. 22, 2:43, Stephen Hemminger wrote:

DECnet is an obsolete network protocol


this breaks userspace. Some projects include linux/dn.h:

  https://codesearch.debian.net/search?q=include.*linux%2Fdn.h=0

I found Trinity fails to build:
net/proto-decnet.c:5:10: fatal error: linux/dn.h: No such file or directory
 5 | #include 

Should we provide the above as empty files?


Not a good idea. There may be configure tests / code that merely checks for
dn.h existence without checking for specific contents/defines. If you provide
empty files, this would fail to build:

#include "config.h"
#ifdef HAVE_LINUX_DN_H
#   include 
#endif
int main() {
#ifdef HAVE_LINUX_DN_H
socket(AF_DECNET, 0, DNPROTO_NSP); // or whatever
#else
...
#endif
}

So, with my distro hat on, outright removing header files feels like the
slightly lesser of two evils. Given the task to port $arbitrary software
between operating systems, absent header files is something more or less
"regularly" encountered, so one could argue we are "trained" to deal with it.
But missing individual defines is a much deeper dive into the APIs and
software to patch it out.


Right, we used to keep providing also defines and structs in uapi 
headers of removed functionality. So that the above socket would 
compile, but fail during runtime.


I am not biased to any solution. In fact, I found out trinity was fixed 
already. So either path networking takes, it's fine by me. I'm not sure 
about the chromium users, though (and I don't care).


thanks,
--
js
suse labs

Re: [PATCH net-next] Remove DECnet support from kernel

2023-01-09 Thread Jan Engelhardt

On Monday 2023-01-09 08:04, Jiri Slaby wrote:
> On 18. 08. 22, 2:43, Stephen Hemminger wrote:
>> DECnet is an obsolete network protocol
>
> this breaks userspace. Some projects include linux/dn.h:
>
>  https://codesearch.debian.net/search?q=include.*linux%2Fdn.h=0
>
> I found Trinity fails to build:
> net/proto-decnet.c:5:10: fatal error: linux/dn.h: No such file or directory
> 5 | #include 
>
> Should we provide the above as empty files?

Not a good idea. There may be configure tests / code that merely checks for
dn.h existence without checking for specific contents/defines. If you provide
empty files, this would fail to build:

#include "config.h"
#ifdef HAVE_LINUX_DN_H
#   include 
#endif
int main() {
#ifdef HAVE_LINUX_DN_H
socket(AF_DECNET, 0, DNPROTO_NSP); // or whatever
#else
...
#endif
}

So, with my distro hat on, outright removing header files feels like the
slightly lesser of two evils. Given the task to port $arbitrary software
between operating systems, absent header files is something more or less
"regularly" encountered, so one could argue we are "trained" to deal with it.
But missing individual defines is a much deeper dive into the APIs and
software to patch it out.

Re: [PATCH 09/15] staging: fbtft: fb_ssd1351.c: Introduce backlight_is_blank()

2023-01-09 Thread Daniel Thompson

On Sun, Jan 08, 2023 at 09:29:43PM +0100, Sam Ravnborg wrote:
> Hi Stephen,
>
> On Sun, Jan 08, 2023 at 08:28:17PM +0100, Stephen Kitt wrote:
> > On Sat, 07 Jan 2023 19:26:23 +0100, Sam Ravnborg via B4 Submission Endpoint
> >  wrote:
> >
> > > From: Sam Ravnborg 
> > >
> > > Avoiding direct access to backlight_properties.props.
> > >
> > > Access to the deprecated props.fb_blank replaced by backlight_is_blank().
> > > Access to props.power is dropped - it was only used for debug.
> > >
> > > Signed-off-by: Sam Ravnborg 
> > > Cc: Stephen Kitt 
> > > Cc: Greg Kroah-Hartman 
> > > Cc: Daniel Thompson 
> > > Cc: Andy Shevchenko 
> > > Cc: linux-fb...@vger.kernel.org
> > > ---
> > >  drivers/staging/fbtft/fb_ssd1351.c | 9 +++--
> > >  1 file changed, 3 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/drivers/staging/fbtft/fb_ssd1351.c
> > > b/drivers/staging/fbtft/fb_ssd1351.c index b8d55aa8c5c7..995fbd2f3dc6 
> > > 100644
> > > --- a/drivers/staging/fbtft/fb_ssd1351.c
> > > +++ b/drivers/staging/fbtft/fb_ssd1351.c
> > > @@ -190,15 +190,12 @@ static struct fbtft_display display = {
> > >  static int update_onboard_backlight(struct backlight_device *bd)
> > >  {
> > >   struct fbtft_par *par = bl_get_data(bd);
> > > - bool on;
> > > + bool blank = backlight_is_blank(bd);
> > >
> > > - fbtft_par_dbg(DEBUG_BACKLIGHT, par,
> > > -   "%s: power=%d, fb_blank=%d\n",
> > > -   __func__, bd->props.power, bd->props.fb_blank);
> > > + fbtft_par_dbg(DEBUG_BACKLIGHT, par, "%s: blank=%d\n", __func__,
> > > blank);
> > > - on = !backlight_is_blank(bd);
> > >   /* Onboard backlight connected to GPIO0 on SSD1351, GPIO1 unused */
> > > - write_reg(par, 0xB5, on ? 0x03 : 0x02);
> > > + write_reg(par, 0xB5, !blank ? 0x03 : 0x02);
> > >
> > >   return 0;
> > >  }
> > >
> > > --
> > > 2.34.1
> >
> > For debugging purposes here, would there be any point in logging 
> > props.state?
> > As in
> >
> > fbtft_par_dbg(DEBUG_BACKLIGHT, par,
> > - "%s: power=%d, fb_blank=%d\n",
> > - __func__, bd->props.power, bd->props.fb_blank);
> > + "%s: power=%d, state=%u\n",
> > + __func__, bd->props.power, bd->props.state);
>
> Thanks for the suggestion - and the reviews!
>
> I was tempted to just remove the debugging.
> If we require debugging, then this could be added in the backlight core,
> thus everyone would benefit from it.

I had the same instinct to remove the debug print here (esp. ones with
__func__ in them) but I think that's probably a much more widely scoped
clean up for fbtft ;-).

On that basis:
Reviewed-by: Daniel Thompson 


Daniel.

Re: [PATCH 10/15] staging: fbtft: core: Introduce backlight_is_blank()

2023-01-09 Thread Andy Shevchenko

On Sat, Jan 07, 2023 at 07:26:24PM +0100, Sam Ravnborg via B4 Submission 
Endpoint wrote:
> From: Sam Ravnborg 
> 
> Avoiding direct access to backlight_properties.props.
> 
> Access to the deprecated props.fb_blank replaced by backlight_is_blank().
> Access to props.power is dropped - it was only used for debug.

...

> + if (blank)
>   gpiod_set_value(par->gpio.led[0], !polarity);
> + else
> + gpiod_set_value(par->gpio.led[0], polarity);

if (blank)
polarity = !polarity;

gpiod_set_value(par->gpio.led[0], polarity);

?

-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH 09/15] staging: fbtft: fb_ssd1351.c: Introduce backlight_is_blank()

2023-01-09 Thread Andy Shevchenko

On Sat, Jan 07, 2023 at 07:26:23PM +0100, Sam Ravnborg via B4 Submission 
Endpoint wrote:
> From: Sam Ravnborg 
> 
> Avoiding direct access to backlight_properties.props.
> 
> Access to the deprecated props.fb_blank replaced by backlight_is_blank().
> Access to props.power is dropped - it was only used for debug.

> Signed-off-by: Sam Ravnborg 
> Cc: Stephen Kitt 
> Cc: Greg Kroah-Hartman 
> Cc: Daniel Thompson 
> Cc: Andy Shevchenko 

> Cc: linux-fb...@vger.kernel.org

Not sure why you have this (at least) explicitly mentioned as get_maintainer.pl
can generate it and git send-email can utilize the script output...

...

> - write_reg(par, 0xB5, on ? 0x03 : 0x02);
> + write_reg(par, 0xB5, !blank ? 0x03 : 0x02);

Why not positive conditional?

write_reg(par, 0xB5, blank ? 0x02 : 0x03);

-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH 15/15] backlight: backlight: Drop the deprecated fb_blank property

2023-01-09 Thread Daniel Thompson

On Sat, Jan 07, 2023 at 07:26:29PM +0100, Sam Ravnborg via B4 Submission 
Endpoint wrote:
> From: Sam Ravnborg 
>
> With all users gone remove the deprecated fb_blank member in
> backlight_properties.
>
> Signed-off-by: Sam Ravnborg 
> Cc: Lee Jones 
> Cc: Daniel Thompson 
> Cc: Jingoo Han 

Reviewed-by: Daniel Thompson 

PS Please don't treat this like a maintainer Acked-by: and merge it
   (Lee's not on holiday so work with Lee to figure out the merge
   strategy ;-) ).

Re: [PATCH 14/15] backlight: tosa: Use backlight helper

2023-01-09 Thread Daniel Thompson

On Sat, Jan 07, 2023 at 07:26:28PM +0100, Sam Ravnborg via B4 Submission 
Endpoint wrote:
> From: Stephen Kitt 
>
> Instead of retrieving the backlight brightness in struct
> backlight_properties manually, and then checking whether the backlight
> should be on at all, use backlight_get_brightness() which does all
> this and insulates this from future changes.
>
> Signed-off-by: Stephen Kitt 
> Signed-off-by: Sam Ravnborg 

Just for completeness...
Reviewed-by: Daniel Thompson 


Daniel.

Re: [PATCH 13/15] backlight: omap1: Use backlight helpers

2023-01-09 Thread Daniel Thompson

On Sat, Jan 07, 2023 at 07:26:27PM +0100, Sam Ravnborg via B4 Submission 
Endpoint wrote:
> From: Sam Ravnborg 
>
> Rework backlight handling to avoid access to the deprecated
> backlight_properties.fb_blank member.
>
> The rework includes removal of get_brightness() operation,
> because there was no read back from HW so no use for it.
>
> Signed-off-by: Sam Ravnborg 
> Cc: Lee Jones 
> Cc: Daniel Thompson 
> Cc: Jingoo Han 

This rework will result in additional calls to omapbl_send_enable()
during updates so, if anyone who cares about omap1, is filtering LKML
then a Tested-by: would be nice.

However, I doubt the additional calls will do any harm so even with
that absent:
Reviewed-by: Daniel Thompson 


Daniel.

[Bug 216095] sysfs: cannot create duplicate filename '/devices/platform/of-display'

2023-01-09 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=216095

--- Comment #12 from Michal Suchánek (msucha...@suse.de) ---
Thanks for the DT dumps.

So you really do have two outputs but the problem is likely different:

The hardware-specific driver gets initialized earlier, and then the offb/ofdrm
driver cannot get the resources for the framebuffer because the native driver
is already using them.

That should be fine but the message it prints is not exactly clear.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH 01/15] video: fbdev: atmel_lcdfb: Rework backlight handling

2023-01-09 Thread Jani Nikula

On Sat, 07 Jan 2023, Sam Ravnborg via B4 Submission Endpoint 
 wrote:
> From: Sam Ravnborg 
>
> The atmel_lcdfb had code to save/restore power state.
> This is not needed so drop it.
>
> Introduce backlight_is_brightness() to make logic simpler.

backlight_is_brightness?

BR,
Jani.


>
> Signed-off-by: Sam Ravnborg 
> Cc: Nicolas Ferre 
> Cc: Alexandre Belloni 
> Cc: Ludovic Desroches 
> Cc: linux-fb...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> ---
>  drivers/video/fbdev/atmel_lcdfb.c | 24 +++-
>  1 file changed, 3 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/video/fbdev/atmel_lcdfb.c 
> b/drivers/video/fbdev/atmel_lcdfb.c
> index 1fc8de4ecbeb..d297b3892637 100644
> --- a/drivers/video/fbdev/atmel_lcdfb.c
> +++ b/drivers/video/fbdev/atmel_lcdfb.c
> @@ -49,7 +49,6 @@ struct atmel_lcdfb_info {
>   struct clk  *lcdc_clk;
>  
>   struct backlight_device *backlight;
> - u8  bl_power;
>   u8  saved_lcdcon;
>  
>   u32 pseudo_palette[16];
> @@ -109,32 +108,18 @@ static u32 contrast_ctr = ATMEL_LCDC_PS_DIV8
>  static int atmel_bl_update_status(struct backlight_device *bl)
>  {
>   struct atmel_lcdfb_info *sinfo = bl_get_data(bl);
> - int power = sinfo->bl_power;
> - int brightness = bl->props.brightness;
> + int brightness;
>  
> - /* REVISIT there may be a meaningful difference between
> -  * fb_blank and power ... there seem to be some cases
> -  * this doesn't handle correctly.
> -  */
> - if (bl->props.fb_blank != sinfo->bl_power)
> - power = bl->props.fb_blank;
> - else if (bl->props.power != sinfo->bl_power)
> - power = bl->props.power;
> -
> - if (brightness < 0 && power == FB_BLANK_UNBLANK)
> - brightness = lcdc_readl(sinfo, ATMEL_LCDC_CONTRAST_VAL);
> - else if (power != FB_BLANK_UNBLANK)
> - brightness = 0;
> + brightness = backlight_get_brightness(bl);
>  
>   lcdc_writel(sinfo, ATMEL_LCDC_CONTRAST_VAL, brightness);
> +
>   if (contrast_ctr & ATMEL_LCDC_POL_POSITIVE)
>   lcdc_writel(sinfo, ATMEL_LCDC_CONTRAST_CTR,
>   brightness ? contrast_ctr : 0);
>   else
>   lcdc_writel(sinfo, ATMEL_LCDC_CONTRAST_CTR, contrast_ctr);
>  
> - bl->props.fb_blank = bl->props.power = sinfo->bl_power = power;
> -
>   return 0;
>  }
>  
> @@ -155,8 +140,6 @@ static void init_backlight(struct atmel_lcdfb_info *sinfo)
>   struct backlight_properties props;
>   struct backlight_device *bl;
>  
> - sinfo->bl_power = FB_BLANK_UNBLANK;
> -
>   if (sinfo->backlight)
>   return;
>  
> @@ -173,7 +156,6 @@ static void init_backlight(struct atmel_lcdfb_info *sinfo)
>   sinfo->backlight = bl;
>  
>   bl->props.power = FB_BLANK_UNBLANK;
> - bl->props.fb_blank = FB_BLANK_UNBLANK;
>   bl->props.brightness = atmel_bl_get_brightness(bl);
>  }

-- 
Jani Nikula, Intel Open Source Graphics Center

95 matches

Mail list logo