Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-21 Thread Eugenio Perez Martin
On Thu, Dec 21, 2023 at 4:07 PM Dragos Tatulea  wrote:
>
> On Thu, 2023-12-21 at 15:55 +0100, Eugenio Perez Martin wrote:
> > On Thu, Dec 21, 2023 at 3:38 PM Dragos Tatulea  wrote:
> > >
> > > On Thu, 2023-12-21 at 13:08 +0100, Eugenio Perez Martin wrote:
> > > > On Thu, Dec 21, 2023 at 12:52 PM Dragos Tatulea  
> > > > wrote:
> > > > >
> > > > > On Thu, 2023-12-21 at 08:46 +0100, Eugenio Perez Martin wrote:
> > > > > > On Thu, Dec 21, 2023 at 3:03 AM Jason Wang  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed, Dec 20, 2023 at 9:32 PM Eugenio Perez Martin
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Wed, Dec 20, 2023 at 5:06 AM Jason Wang 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Dec 20, 2023 at 11:46 AM Jason Wang 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > The virtio spec doesn't allow changing virtqueue 
> > > > > > > > > > > addresses after
> > > > > > > > > > > DRIVER_OK. Some devices do support this operation when 
> > > > > > > > > > > the device is
> > > > > > > > > > > suspended. The 
> > > > > > > > > > > VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
> > > > > > > > > > > advertises this support as a backend features.
> > > > > > > > > >
> > > > > > > > > > There's an ongoing effort in virtio spec to introduce the 
> > > > > > > > > > suspend state.
> > > > > > > > > >
> > > > > > > > > > So I wonder if it's better to just allow such behaviour?
> > > > > > > > >
> > > > > > > > > Actually I mean, allow drivers to modify the parameters 
> > > > > > > > > during suspend
> > > > > > > > > without a new feature.
> > > > > > > > >
> > > > > > > >
> > > > > > > > That would be ideal, but how do userland checks if it can 
> > > > > > > > suspend +
> > > > > > > > change properties + resume?
> > > > > > >
> > > > > > > As discussed, it looks to me the only device that supports 
> > > > > > > suspend is
> > > > > > > simulator and it supports change properties.
> > > > > > >
> > > > > > > E.g:
> > > > > > >
> > > > > > > static int vdpasim_set_vq_address(struct vdpa_device *vdpa, u16 
> > > > > > > idx,
> > > > > > >   u64 desc_area, u64 driver_area,
> > > > > > >   u64 device_area)
> > > > > > > {
> > > > > > > struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> > > > > > > struct vdpasim_virtqueue *vq = >vqs[idx];
> > > > > > >
> > > > > > > vq->desc_addr = desc_area;
> > > > > > > vq->driver_addr = driver_area;
> > > > > > > vq->device_addr = device_area;
> > > > > > >
> > > > > > > return 0;
> > > > > > > }
> > > > > > >
> > > > > >
> > > > > > So in the current kernel master it is valid to set a different vq
> > > > > > address while the device is suspended in vdpa_sim. But it is not 
> > > > > > valid
> > > > > > in mlx5, as the FW will not be updated in resume (Dragos, please
> > > > > > correct me if I'm wrong). Both of them return success.
> > > > > >
> > > > > In the current state, there is no resume. HW Virtqueues will just get 
> > > > > re-created
> > > > > with the new address.
> > > > >
> > > >
> > > > Oh, then all of this is effectively transparent to the userspace
> > > > except for the time it takes?
> > > >
> > > Not quite: mlx5_vdpa_set_vq_address will save the vq address only on the 
> > > SW vq
> > > representation. Only later will it will call into the FW to update the 
> > > FW. Later
> > > means:
> > > - On DRIVER_OK state, when the VQs get created.
> > > - On .set_map when the VQs get re-created (before this series) / updated 
> > > (after
> > > this series)
> > > - On .resume (after this series).
> > >
> > > So if the .set_vq_address is called when the VQ is in DRIVER_OK but not
> > > suspended those addresses will be set later for later.
> > >
> >
> > Ouch, that is more in the line of my thoughts :(.
> >
> > > > In that case you're right, we don't need feature flags. But I think it
> > > > would be great to also move the error return in case userspace tries
> > > > to modify vq parameters out of suspend state.
> > > >
> > > On the driver side or on the core side?
> > >
> >
> > Core side.
> >
> Checking my understanding: instead of the feature flags there would be a check
> (for .set_vq_addr and .set_vq_state) to return an error if they are called 
> under
> DRIVER_OK and not suspended state?
>

Yes, correct. Per Jason's message, it should be enough with two
independent series:
* Patches 6, 7 and 8 of this series, just checking for suspend state
and not feature flags.
* Your v2.

Thanks!

> > It does not have to be part of this series, I meant it can be proposed
> > in a separate series and applied before the parent driver one.
> >
> > > Thanks
> > > > Thanks!
> > > >
> > > >
> > > > > > How can we know in the destination QEMU if it is valid to suspend &
> > > > > > set address? 

Re: [PATCH 1/4] linux/export: Fix alignment for 64-bit ksymtab entries

2023-12-21 Thread Masahiro Yamada
On Fri, Dec 22, 2023 at 3:08 PM Luis Chamberlain  wrote:
>
> On Thu, Dec 21, 2023 at 10:07:13PM -0800, Luis Chamberlain wrote:
> >
> > If we want to go bananas we could even get a graph of size of modules
>
> Sorry I meant size of number of symbols Vs cost.
>
>  Luis



But, 1/4 is really a bug-fix, isn't it?


ksymtab was previously 8-byte aligned for CONFIG_64BIT,
but now is only 4-byte aligned.


$ git show ddb5cdbafaaa^:include/linux/export.h |
  head -n66 | tail -n5
struct kernel_symbol {
unsigned long value;
const char *name;
const char *namespace;
};


$ git show ddb5cdbafaaa^:include/asm-generic/export.h |
   head -23 | tail -8
#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
#define KSYM_ALIGN 4
#elif defined(CONFIG_64BIT)
#define KSYM_ALIGN 8
#else
#define KSYM_ALIGN 4
#endif




In the old behavior,  used C code
for producing ksymtab, hence it was naturally
aligned by the compiler. (unsigned long and pointer
require 8-byte alignment for CONFIG_64BIT)


 explicitly required
8-byte alignment for CONFIG_64BIT.





In the current behavior, 
produces all ksymtab by using inline assembler,
but it hard-codes ".balign 4".











-- 
Best Regards
Masahiro Yamada



Re: [PATCH 1/4] linux/export: Fix alignment for 64-bit ksymtab entries

2023-12-21 Thread Luis Chamberlain
On Thu, Dec 21, 2023 at 10:07:13PM -0800, Luis Chamberlain wrote:
> 
> If we want to go bananas we could even get a graph of size of modules

Sorry I meant size of number of symbols Vs cost.

 Luis



Re: [PATCH 1/4] linux/export: Fix alignment for 64-bit ksymtab entries

2023-12-21 Thread Luis Chamberlain
On Fri, Dec 22, 2023 at 01:01:23AM +0900, Masahiro Yamada wrote:
> On Thu, Dec 21, 2023 at 7:22 PM Masahiro Yamada  wrote:
> >
> > On Thu, Nov 23, 2023 at 7:18 AM  wrote:
> > >
> > > From: Helge Deller 
> > >
> > > An alignment of 4 bytes is wrong for 64-bit platforms which don't define
> > > CONFIG_HAVE_ARCH_PREL32_RELOCATIONS (which then store 64-bit pointers).
> > > Fix their alignment to 8 bytes.
> > >
> > > Signed-off-by: Helge Deller 
> >
> >
> > This is correct.
> >
> > Acked-by: Masahiro Yamada 
> >
> > Please add
> >
> >
> > Fixes: ddb5cdbafaaa ("kbuild: generate KSYMTAB entries by modpost")
> >
> >
> 
> 
> If there is no objection, I will pick this up
> to linux-kbuild/fixes.

The new selftests I've suggested should help get perf data to cover both
modules and built-in kernel symbols given find_symbol() will first hit
built-in symbols first before modules with one caveat: we'd want to extend
the selftest with a part which builds a module built-in with also tons
of other symbols.

So I'm all for you taking this but I don't think we need to rush for the
same reasons I mentioned in my reply to Helge.

I think it would be nice to get real perf data with perf stat as I
suggested, and include that in the commit logs. I think it would also be
useful to include a description about the fact that there is no real fix
and that the performance hit is all that happens as the architecture
just emulates the aligment. In the worst case, if exception handlers
are broken we could crash but that is rare although it does happen.

If we want to go bananas we could even get a graph of size of modules
Vs cost on misaligment as a relationship with time. Without this, frankly
cost on "performance" is artificial.

Thoughts?

  Luis



Re: [PATCH 2/4] modules: Ensure 64-bit alignment on __ksymtab_* sections

2023-12-21 Thread Luis Chamberlain
On Wed, Nov 22, 2023 at 11:18:12PM +0100, del...@kernel.org wrote:
> From: Helge Deller 
> 
> On 64-bit architectures without CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
> (e.g. ppc64, ppc64le, parisc, s390x,...) the __KSYM_REF() macro stores
> 64-bit pointers into the __ksymtab* sections.
> Make sure that those sections will be correctly aligned at module link time,
> otherwise unaligned memory accesses may happen at runtime.

The ramifications are not explained there. You keep sending me patches
with this and we keep doing a nose dive on this. It means I have to do
more work. So as I had suggested with your patch which I merged in
commit 87c482bdfa79 ("modules: Ensure natural alignment for
.altinstructions and __bug_table sections") please clarify the
impact of not merging this patch. Also last time you noticed the
misalignment due to a faulty exception handler, please mention how
you found this out now.

And since this is not your first patch on the exact related subject
I'd appreciate if you can show me perf stat results differences between
having and not having this patch merged. Why? Because we talk about
a performance penalthy, but we are not saying how much, and since this
is an ongoing thing, we might as well have a tool belt with ways to
measure such performance impact to bring clarity and value to this
and future related patches.

> The __kcrctab* sections store 32-bit entities, so use ALIGN(4) for those.

I've given some thought about how to test this. Sadly perf kallsysms
just opens the /proc/kallsysms file, but that's fine, we need our own
test.

I think a 3 new simple modules selftest would do it and running perf
stat on it. One module, let us call it module A which constructs its own
name space prefix for its exported symbols and has tons of silly symbols
for arbitrary data, whatever. We then have module B which refers to a
few arbitrary symbols from module A, hopefully spread out linearly, so
if module A had 10,000 symbols, we'd have module A refer to a symbol
ever 1,000 symbols. Finally we want a symbol C which has say, 50,000
symbols all of which will not be used at all by the first two modules,
but the selftest will load module C first, prior to calling modprobe B.

We'll stress test this way two calls which use find_symbol():

1) Upon load of B it will trigger simplify_symbols() to look for the
symbol it uses from the module A with tons of symbols. That's an
indirect way for us to call resolve_symbol_wait() from module A without
having to use symbol_get() which want to remove as exported as it is
just a hack which should go away. Our goal is for us to test
resolve_symbol() which will call find_symbol() and that will eventually
look for the symbol on module A with:

  find_exported_symbol_in_section()

That uses bsearch() so a binary search for the symbol and we'd end up
hitting the misalignments here. Binary search will at worst be O(log(n))
and so the only way to aggreviate the search will be to add tons of
symbols to A, and have B use a few of them.

2) When you load B, userspace will at first load A as depmod will inform
userspace A goes before B. Upon B's load towards the end right before
we call module B's init routine we get complete_formation() called on
the module. That will first check for duplicate symbols with the call
to verify_exported_symbols(). That is when we'll force iteration on
module C's insane symbol list.

The selftests just runs

perf stat -e pick-your-poison-for-misalignments 
tools/testing/selftests/kmod/ksymtab.sh

Where ksymtab.sh is your new script which calls:

modprobe C
modprobe B

I say pick-your-poison-for-misalignments because I am not sure what is
best here.

Thoughts?

> Signed-off-by: Helge Deller 
> Cc:  # v6.0+

That's a stretch without any data, don't you think?

 Luis



Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2023-12-21 Thread Kees Cook



On December 21, 2023 4:16:56 AM PST, Michael Ellerman  
wrote:
>Cc +Kees
>
>Christophe Leroy  writes:
>> Declaring rodata_enabled and mark_rodata_ro() at all time
>> helps removing related #ifdefery in C files.
>>
>> Signed-off-by: Christophe Leroy 
>> ---
>>  include/linux/init.h |  4 
>>  init/main.c  | 21 +++--
>>  2 files changed, 7 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/init.h b/include/linux/init.h
>> index 01b52c9c7526..d2b47be38a07 100644
>> --- a/include/linux/init.h
>> +++ b/include/linux/init.h
>> @@ -168,12 +168,8 @@ extern initcall_entry_t __initcall_end[];
>>  
>>  extern struct file_system_type rootfs_fs_type;
>>  
>> -#if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_STRICT_MODULE_RWX)
>>  extern bool rodata_enabled;
>> -#endif
>> -#ifdef CONFIG_STRICT_KERNEL_RWX
>>  void mark_rodata_ro(void);
>> -#endif
>>  
>>  extern void (*late_time_init)(void);
>>  
>> diff --git a/init/main.c b/init/main.c
>> index e24b0780fdff..807df08c501f 100644
>> --- a/init/main.c
>> +++ b/init/main.c
>> @@ -1396,10 +1396,9 @@ static int __init set_debug_rodata(char *str)
>>  early_param("rodata", set_debug_rodata);
>>  #endif
>>  
>> -#ifdef CONFIG_STRICT_KERNEL_RWX
>>  static void mark_readonly(void)
>>  {
>> -if (rodata_enabled) {
>> +if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX) && rodata_enabled) {

I think this will break without rodata_enabled actual existing on other 
architectures. (Only declaration was made visible, not the definition, which is 
above here and still behind ifdefs?)

-Kees

>>  /*
>>   * load_module() results in W+X mappings, which are cleaned
>>   * up with call_rcu().  Let's make sure that queued work is
>> @@ -1409,20 +1408,14 @@ static void mark_readonly(void)
>>  rcu_barrier();
>>  mark_rodata_ro();
>>  rodata_test();
>> -} else
>> +} else if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)) {
>>  pr_info("Kernel memory protection disabled.\n");
>> +} else if (IS_ENABLED(CONFIG_ARCH_HAS_STRICT_KERNEL_RWX)) {
>> +pr_warn("Kernel memory protection not selected by kernel 
>> config.\n");
>> +} else {
>> +pr_warn("This architecture does not have kernel memory 
>> protection.\n");
>> +}
>>  }
>> -#elif defined(CONFIG_ARCH_HAS_STRICT_KERNEL_RWX)
>> -static inline void mark_readonly(void)
>> -{
>> -pr_warn("Kernel memory protection not selected by kernel config.\n");
>> -}
>> -#else
>> -static inline void mark_readonly(void)
>> -{
>> -pr_warn("This architecture does not have kernel memory protection.\n");
>> -}
>> -#endif
>>  
>>  void __weak free_initmem(void)
>>  {
>> -- 
>> 2.41.0

-- 
Kees Cook



Re: [PATCH v2] eventfs: Fix file and directory uid and gid ownership

2023-12-21 Thread Google
On Thu, 21 Dec 2023 19:07:57 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> It was reported that when mounting the tracefs file system with a gid
> other than root, the ownership did not carry down to the eventfs directory
> due to the dynamic nature of it.
> 
> A fix was done to solve this, but it had two issues.
> 
> (a) if the attr passed into update_inode_attr() was NULL, it didn't do
> anything. This is true for files that have not had a chown or chgrp
> done to itself or any of its sibling files, as the attr is allocated
> for all children when any one needs it.
> 
>  # umount /sys/kernel/tracing
>  # mount -o rw,seclabel,relatime,gid=1000 -t tracefs nodev /mnt
> 
>  # ls -ld /mnt/events/sched
> drwxr-xr-x 28 root rostedt 0 Dec 21 13:12 /mnt/events/sched/
> 
>  # ls -ld /mnt/events/sched/sched_switch
> drwxr-xr-x 2 root rostedt 0 Dec 21 13:12 /mnt/events/sched/sched_switch/
> 
> But when checking the files:
> 
>  # ls -l /mnt/events/sched/sched_switch
> total 0
> -rw-r- 1 root root 0 Dec 21 13:12 enable
> -rw-r- 1 root root 0 Dec 21 13:12 filter
> -r--r- 1 root root 0 Dec 21 13:12 format
> -r--r- 1 root root 0 Dec 21 13:12 hist
> -r--r- 1 root root 0 Dec 21 13:12 id
> -rw-r- 1 root root 0 Dec 21 13:12 trigger
> 
> (b) When the attr does not denote the UID or GID, it defaulted to using
> the parent uid or gid. This is incorrect as changing the parent
> uid or gid will automatically change all its children.
> 
>  # chgrp tracing /mnt/events/timer
> 
>  # ls -ld /mnt/events/timer
> drwxr-xr-x 2 root tracing 0 Dec 21 14:34 /mnt/events/timer
> 
>  # ls -l /mnt/events/timer
> total 0
> -rw-r- 1 root root0 Dec 21 14:35 enable
> -rw-r- 1 root root0 Dec 21 14:35 filter
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_cancel
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_entry
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_exit
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_init
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_start
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_expire
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_state
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 tick_stop
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_cancel
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_entry
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_exit
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_init
> drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_start
> 
> At first it was thought that this could be easily fixed by just making the
> default ownership of the superblock when it was mounted. But this does not
> handle the case of:
> 
>  # chgrp tracing instances
>  # mkdir instances/foo
> 
> If the superblock was used, then the group ownership would be that of what
> it was when it was mounted, when it should instead be "tracing".
> 
> Instead, set a flag for the top level eventfs directory ("events") to flag
> which eventfs_inode belongs to it.
> 
> Since the "events" directory's dentry and inode are never freed, it does
> not need to use its attr field to restore its mode and ownership. Use the
> this eventfs_inode's attr as the default ownership for all the files and
> directories underneath it.
> 
> When the events eventfs_inode is created, it sets its ownership to its
> parent uid and gid. As the events directory is created at boot up before
> it gets mounted, this will always be uid=0 and gid=0. If it's created via
> an instance, then it will take the ownership of the instance directory.
> 
> When the file system is mounted, it will update all the gids if one is
> specified. This will have a callback to update the events evenfs_inode's
> default entries.
> 
> When a file or directory is created under the events directory, it will
> walk the ei->dentry parents until it finds the evenfs_inode that belongs
> to the events directory to retrieve the default uid and gid values.
> 
> Link: 
> https://lore.kernel.org/all/CAHk-=wiwQtUHvzwyZucDq8=gtw+anwscylhpfswrq84pjho...@mail.gmail.com/
> 

Looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 
Tested-by: Masami Hiramatsu (Google) 

Thank you,

> Cc: sta...@vger.kernel.org
> Fixes: 0dfc852b6fe3 ("eventfs: Have event files and directories default to 
> parent uid and gid")
> Reported-by: Linus Torvalds 
> Signed-off-by: Steven Rostedt (Google) 
> ---
> Changes since v1: 
> https://lore.kernel.org/linux-trace-kernel/20231221173812.51fe6...@gandalf.local.home
> 
> - While writing kselftests to test ownership, I found a bug. The remount
>   option allows for an update of the gid. If it is specified, then all
>   dentries are traversed in tracefs and eventfs and the gid is updated.
> 
>   The bug is that a eventfs that does not have a dentry, but had its gid
>   updated, the old gid is still stored, and when the dentry is created, it
>   will use the stored gid.
> 
>   Instead, check all eventfs_inodes that do not have a 

Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-21 Thread Jason Wang
On Thu, Dec 21, 2023 at 3:47 PM Eugenio Perez Martin
 wrote:
>
> On Thu, Dec 21, 2023 at 3:03 AM Jason Wang  wrote:
> >
> > On Wed, Dec 20, 2023 at 9:32 PM Eugenio Perez Martin
> >  wrote:
> > >
> > > On Wed, Dec 20, 2023 at 5:06 AM Jason Wang  wrote:
> > > >
> > > > On Wed, Dec 20, 2023 at 11:46 AM Jason Wang  wrote:
> > > > >
> > > > > On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  
> > > > > wrote:
> > > > > >
> > > > > > The virtio spec doesn't allow changing virtqueue addresses after
> > > > > > DRIVER_OK. Some devices do support this operation when the device is
> > > > > > suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
> > > > > > advertises this support as a backend features.
> > > > >
> > > > > There's an ongoing effort in virtio spec to introduce the suspend 
> > > > > state.
> > > > >
> > > > > So I wonder if it's better to just allow such behaviour?
> > > >
> > > > Actually I mean, allow drivers to modify the parameters during suspend
> > > > without a new feature.
> > > >
> > >
> > > That would be ideal, but how do userland checks if it can suspend +
> > > change properties + resume?
> >
> > As discussed, it looks to me the only device that supports suspend is
> > simulator and it supports change properties.
> >
> > E.g:
> >
> > static int vdpasim_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> >   u64 desc_area, u64 driver_area,
> >   u64 device_area)
> > {
> > struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> > struct vdpasim_virtqueue *vq = >vqs[idx];
> >
> > vq->desc_addr = desc_area;
> > vq->driver_addr = driver_area;
> > vq->device_addr = device_area;
> >
> > return 0;
> > }
> >
>
> So in the current kernel master it is valid to set a different vq
> address while the device is suspended in vdpa_sim. But it is not valid
> in mlx5, as the FW will not be updated in resume (Dragos, please
> correct me if I'm wrong). Both of them return success.
>
> How can we know in the destination QEMU if it is valid to suspend &
> set address? Should we handle this as a bugfix and backport the
> change?

Good point.

We probably need to do backport, this seems to be the easiest way.
Theoretically, userspace may assume this behavior (though I don't
think there would be a user that depends on the simulator).

>
> > >
> > > The only way that comes to my mind is to make sure all parents return
> > > error if userland tries to do it, and then fallback in userland.
> >
> > Yes.
> >
> > > I'm
> > > ok with that, but I'm not sure if the current master & previous kernel
> > > has a coherent behavior. Do they return error? Or return success
> > > without changing address / vq state?
> >
> > We probably don't need to worry too much here, as e.g set_vq_address
> > could fail even without suspend (just at uAPI level).
> >
>
> I don't get this, sorry. I rephrased my point with an example earlier
> in the mail.

I mean currently, VHOST_SET_VRING_ADDR can fail. So userspace should
not assume it will always succeed.

Thanks

>




[PATCH v2] eventfs: Fix file and directory uid and gid ownership

2023-12-21 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

It was reported that when mounting the tracefs file system with a gid
other than root, the ownership did not carry down to the eventfs directory
due to the dynamic nature of it.

A fix was done to solve this, but it had two issues.

(a) if the attr passed into update_inode_attr() was NULL, it didn't do
anything. This is true for files that have not had a chown or chgrp
done to itself or any of its sibling files, as the attr is allocated
for all children when any one needs it.

 # umount /sys/kernel/tracing
 # mount -o rw,seclabel,relatime,gid=1000 -t tracefs nodev /mnt

 # ls -ld /mnt/events/sched
drwxr-xr-x 28 root rostedt 0 Dec 21 13:12 /mnt/events/sched/

 # ls -ld /mnt/events/sched/sched_switch
drwxr-xr-x 2 root rostedt 0 Dec 21 13:12 /mnt/events/sched/sched_switch/

But when checking the files:

 # ls -l /mnt/events/sched/sched_switch
total 0
-rw-r- 1 root root 0 Dec 21 13:12 enable
-rw-r- 1 root root 0 Dec 21 13:12 filter
-r--r- 1 root root 0 Dec 21 13:12 format
-r--r- 1 root root 0 Dec 21 13:12 hist
-r--r- 1 root root 0 Dec 21 13:12 id
-rw-r- 1 root root 0 Dec 21 13:12 trigger

(b) When the attr does not denote the UID or GID, it defaulted to using
the parent uid or gid. This is incorrect as changing the parent
uid or gid will automatically change all its children.

 # chgrp tracing /mnt/events/timer

 # ls -ld /mnt/events/timer
drwxr-xr-x 2 root tracing 0 Dec 21 14:34 /mnt/events/timer

 # ls -l /mnt/events/timer
total 0
-rw-r- 1 root root0 Dec 21 14:35 enable
-rw-r- 1 root root0 Dec 21 14:35 filter
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_cancel
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_entry
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_exit
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_init
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_start
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_expire
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_state
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 tick_stop
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_cancel
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_entry
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_exit
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_init
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_start

At first it was thought that this could be easily fixed by just making the
default ownership of the superblock when it was mounted. But this does not
handle the case of:

 # chgrp tracing instances
 # mkdir instances/foo

If the superblock was used, then the group ownership would be that of what
it was when it was mounted, when it should instead be "tracing".

Instead, set a flag for the top level eventfs directory ("events") to flag
which eventfs_inode belongs to it.

Since the "events" directory's dentry and inode are never freed, it does
not need to use its attr field to restore its mode and ownership. Use the
this eventfs_inode's attr as the default ownership for all the files and
directories underneath it.

When the events eventfs_inode is created, it sets its ownership to its
parent uid and gid. As the events directory is created at boot up before
it gets mounted, this will always be uid=0 and gid=0. If it's created via
an instance, then it will take the ownership of the instance directory.

When the file system is mounted, it will update all the gids if one is
specified. This will have a callback to update the events evenfs_inode's
default entries.

When a file or directory is created under the events directory, it will
walk the ei->dentry parents until it finds the evenfs_inode that belongs
to the events directory to retrieve the default uid and gid values.

Link: 
https://lore.kernel.org/all/CAHk-=wiwQtUHvzwyZucDq8=gtw+anwscylhpfswrq84pjho...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: 0dfc852b6fe3 ("eventfs: Have event files and directories default to 
parent uid and gid")
Reported-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20231221173812.51fe6...@gandalf.local.home

- While writing kselftests to test ownership, I found a bug. The remount
  option allows for an update of the gid. If it is specified, then all
  dentries are traversed in tracefs and eventfs and the gid is updated.

  The bug is that a eventfs that does not have a dentry, but had its gid
  updated, the old gid is still stored, and when the dentry is created, it
  will use the stored gid.

  Instead, check all eventfs_inodes that do not have a dentry, and traverse
  its children to make sure their gids are updated as well.

 fs/tracefs/event_inode.c | 105 +++
 fs/tracefs/inode.c   |   6 +++
 fs/tracefs/internal.h|   2 +
 3 files changed, 103 insertions(+), 10 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2ccc849a5bda..6e005a83114f 100644
--- 

Re: [PATCH net-next v3 2/3] net: introduce abstraction for network memory

2023-12-21 Thread Mina Almasry
On Thu, Dec 21, 2023 at 3:23 PM Shakeel Butt  wrote:
>
> On Wed, Dec 20, 2023 at 01:45:01PM -0800, Mina Almasry wrote:
> > Add the netmem_ref type, an abstraction for network memory.
> >
> > To add support for new memory types to the net stack, we must first
> > abstract the current memory type. Currently parts of the net stack
> > use struct page directly:
> >
> > - page_pool
> > - drivers
> > - skb_frag_t
> >
> > Originally the plan was to reuse struct page* for the new memory types,
> > and to set the LSB on the page* to indicate it's not really a page.
> > However, for compiler type checking we need to introduce a new type.
> >
> > netmem_ref is introduced to abstract the underlying memory type. Currently
> > it's a no-op abstraction that is always a struct page underneath. In
> > parallel there is an undergoing effort to add support for devmem to the
> > net stack:
> >
> > https://lore.kernel.org/netdev/20231208005250.2910004-1-almasrym...@google.com/
> >
> > Signed-off-by: Mina Almasry 
> >
> > ---
> >
> > v3:
> >
> > - Modify struct netmem from a union of struct page + new types to an opaque
> >   netmem_ref type.  I went with:
> >
> >   +typedef void *__bitwise netmem_ref;
> >
> >   rather than this that Jakub recommended:
> >
> >   +typedef unsigned long __bitwise netmem_ref;
> >
> >   Because with the latter the compiler issues warnings to cast NULL to
> >   netmem_ref. I hope that's ok.
> >
>
> Can you share what the warning was? You might just need __force
> attribute. However you might need this __force a lot. I wonder if you
> can just follow struct encoded_page example verbatim here.
>

The warning is like so:

./include/net/page_pool/helpers.h: In function ‘page_pool_alloc’:
./include/linux/stddef.h:8:14: warning: returning ‘void *’ from a
function with return type ‘netmem_ref’ {aka ‘long unsigned int’} makes
integer from pointer without a cast [-Wint-conversion]
8 | #define NULL ((void *)0)
  |  ^
./include/net/page_pool/helpers.h:132:24: note: in expansion of macro
‘NULL’
  132 | return NULL;
  |^~~~

And happens in all the code where:

netmem_ref func()
{
return NULL;
}

It's fixable by changing the return to `return (netmem_ref NULL);` or
`return 0;`, but I feel like netmem_ref should be some type which
allows a cast from NULL implicitly.

Also as you (and patchwork) noticed, __bitwise should not be used with
void*; it's only meant for integer types. Sorry I missed that in the
docs and was not running make C=2.

-- 
Thanks,
Mina



Re: [PATCH 2/3] nvdimm/dimm_devs: fix kernel-doc for function params

2023-12-21 Thread Randy Dunlap
Hi Ira,

On 12/21/23 14:34, Ira Weiny wrote:
> Ira Weiny wrote:
>> Randy Dunlap wrote:
> 
> [snip]
> 
>>> diff -- a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
>>> --- a/drivers/nvdimm/dimm_devs.c
>>> +++ b/drivers/nvdimm/dimm_devs.c
>>> @@ -53,7 +53,10 @@ static int validate_dimm(struct nvdimm_d
>>>  
>>>  /**
>>>   * nvdimm_init_nsarea - determine the geometry of a dimm's namespace area
>>> - * @nvdimm: dimm to initialize
>>> + * @ndd: dimm to initialize
>>> + *
>>> + * Returns: %0 if the area is already valid, -errno on error,
>>
>> This is good...
>>
>>> ...otherwise an
>>> + * ND_CMD_* status code.
>>
>> I don't see where any of the ->ndctl() calls return an ND_CMD_* status
>> code.  What am I missing?

Probably nothing.
Yes, please drop that/fix that when you merge it.
Thanks.

> If you agree that this should be dropped I can clean it up as I'm trying
> to prep the nvdimm tree now and was hoping to take this series.
> 
> Ira
> 
>>
>> The rest looks good,
>> Ira

-- 
#Randy
https://people.kernel.org/tglx/notes-about-netiquette
https://subspace.kernel.org/etiquette.html



Re: [PATCH 3/3] nvdimm/namespace: fix kernel-doc for function params

2023-12-21 Thread Ira Weiny
Randy Dunlap wrote:
> 
> 
> On 12/21/23 14:32, Ira Weiny wrote:
> > Randy Dunlap wrote:
> > 
> > [snip]
> > 
> >> @@ -1656,8 +1664,10 @@ static int select_pmem_id(struct nd_regi
> >>  /**
> >>   * create_namespace_pmem - validate interleave set labelling, retrieve 
> >> label0
> >>   * @nd_region: region with mappings to validate
> >> - * @nspm: target namespace to create
> >> + * @nd_mapping: container of dpa-resource-root + labels
> >>   * @nd_label: target pmem namespace label to evaluate
> >> + *
> >> + * Returns: the created  device on success or -errno on error
> > 
> > NIT: should this be ERR_PTR(-errno) on error?
> 
> Oh, for sure. Thanks for catching that.

I'll clean this up as well if you are good with me changing patch 2/3 as
well.

Ira



Re: [PATCH net-next v3 3/3] net: add netmem_ref to skb_frag_t

2023-12-21 Thread Shakeel Butt
On Wed, Dec 20, 2023 at 01:45:02PM -0800, Mina Almasry wrote:
> diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> index 65d1f6755f98..3180a54b2c68 100644
> --- a/net/kcm/kcmsock.c
> +++ b/net/kcm/kcmsock.c
> @@ -636,9 +636,15 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
>   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
>   msize += skb_shinfo(skb)->frags[i].bv_len;

Don't you need the above to cast to bio_vec to get bv_len? skb_frag_t
does not have bv_len anymore.

>  
> + /* The cast to struct bio_vec* here assumes the frags are
> +  * struct page based. WARN if there is no page in this skb.
> +  */
> + DEBUG_NET_WARN_ON_ONCE(
> + !skb_frag_page(_shinfo(skb)->frags[0]));
> +
>   iov_iter_bvec(_iter, ITER_SOURCE,
> -   skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags,
> -   msize);
> +   (const struct bio_vec *)skb_shinfo(skb)->frags,
> +   skb_shinfo(skb)->nr_frags, msize);
>   iov_iter_advance(_iter, txm->frag_offset);
>  
>   do {
> -- 
> 2.43.0.472.g3155946c3a-goog
> 



Re: [PATCH 3/3] nvdimm/namespace: fix kernel-doc for function params

2023-12-21 Thread Randy Dunlap



On 12/21/23 14:32, Ira Weiny wrote:
> Randy Dunlap wrote:
> 
> [snip]
> 
>> @@ -1656,8 +1664,10 @@ static int select_pmem_id(struct nd_regi
>>  /**
>>   * create_namespace_pmem - validate interleave set labelling, retrieve 
>> label0
>>   * @nd_region: region with mappings to validate
>> - * @nspm: target namespace to create
>> + * @nd_mapping: container of dpa-resource-root + labels
>>   * @nd_label: target pmem namespace label to evaluate
>> + *
>> + * Returns: the created  device on success or -errno on error
> 
> NIT: should this be ERR_PTR(-errno) on error?

Oh, for sure. Thanks for catching that.

> Generally good to me though.
> 
> Reviewed-by: Ira Weiny 
> 
>>   */
>>  static struct device *create_namespace_pmem(struct nd_region *nd_region,
>>  struct nd_mapping *nd_mapping,

-- 
#Randy
https://people.kernel.org/tglx/notes-about-netiquette
https://subspace.kernel.org/etiquette.html



Re: [PATCH net-next v3 2/3] net: introduce abstraction for network memory

2023-12-21 Thread Shakeel Butt
On Wed, Dec 20, 2023 at 01:45:01PM -0800, Mina Almasry wrote:
> Add the netmem_ref type, an abstraction for network memory.
> 
> To add support for new memory types to the net stack, we must first
> abstract the current memory type. Currently parts of the net stack
> use struct page directly:
> 
> - page_pool
> - drivers
> - skb_frag_t
> 
> Originally the plan was to reuse struct page* for the new memory types,
> and to set the LSB on the page* to indicate it's not really a page.
> However, for compiler type checking we need to introduce a new type.
> 
> netmem_ref is introduced to abstract the underlying memory type. Currently
> it's a no-op abstraction that is always a struct page underneath. In
> parallel there is an undergoing effort to add support for devmem to the
> net stack:
> 
> https://lore.kernel.org/netdev/20231208005250.2910004-1-almasrym...@google.com/
> 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> v3:
> 
> - Modify struct netmem from a union of struct page + new types to an opaque
>   netmem_ref type.  I went with:
> 
>   +typedef void *__bitwise netmem_ref;
> 
>   rather than this that Jakub recommended:
> 
>   +typedef unsigned long __bitwise netmem_ref;
> 
>   Because with the latter the compiler issues warnings to cast NULL to
>   netmem_ref. I hope that's ok.
> 

Can you share what the warning was? You might just need __force
attribute. However you might need this __force a lot. I wonder if you
can just follow struct encoded_page example verbatim here.




[PATCH] eventfs: Fix file and directory uid and gid ownership

2023-12-21 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

It was reported that when mounting the tracefs file system with a gid
other than root, the ownership did not carry down to the eventfs directory
due to the dynamic nature of it.

A fix was done to solve this, but it had two issues.

(a) if the attr passed into update_inode_attr() was NULL, it didn't do
anything. This is true for files that have not had a chown or chgrp
done to itself or any of its sibling files, as the attr is allocated
for all children when any one needs it.

 # umount /sys/kernel/tracing
 # mount -o rw,seclabel,relatime,gid=1000 -t tracefs nodev /mnt

 # ls -ld /mnt/events/sched
drwxr-xr-x 28 root rostedt 0 Dec 21 13:12 /mnt/events/sched/

 # ls -ld /mnt/events/sched/sched_switch
drwxr-xr-x 2 root rostedt 0 Dec 21 13:12 /mnt/events/sched/sched_switch/

But when checking the files:

 # ls -l /mnt/events/sched/sched_switch
total 0
-rw-r- 1 root root 0 Dec 21 13:12 enable
-rw-r- 1 root root 0 Dec 21 13:12 filter
-r--r- 1 root root 0 Dec 21 13:12 format
-r--r- 1 root root 0 Dec 21 13:12 hist
-r--r- 1 root root 0 Dec 21 13:12 id
-rw-r- 1 root root 0 Dec 21 13:12 trigger

(b) When the attr does not denote the UID or GID, it defaulted to using
the parent uid or gid. This is incorrect as changing the parent
uid or gid will automatically change all its children.

 # chgrp tracing /mnt/events/timer

 # ls -ld /mnt/events/timer
drwxr-xr-x 2 root tracing 0 Dec 21 14:34 /mnt/events/timer

 # ls -l /mnt/events/timer
total 0
-rw-r- 1 root root0 Dec 21 14:35 enable
-rw-r- 1 root root0 Dec 21 14:35 filter
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_cancel
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_entry
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_expire_exit
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_init
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 hrtimer_start
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_expire
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 itimer_state
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 tick_stop
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_cancel
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_entry
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_expire_exit
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_init
drwxr-xr-x 2 root tracing 0 Dec 21 14:35 timer_start

At first it was thought that this could be easily fixed by just making the
default ownership of the superblock when it was mounted. But this does not
handle the case of:

 # chgrp tracing instances
 # mkdir instances/foo

If the superblock was used, then the group ownership would be that of what
it was when it was mounted, when it should instead be "tracing".

Instead, set a flag for the top level eventfs directory ("events") to flag
which eventfs_inode belongs to it.

Since the "events" directory's dentry and inode are never freed, it does
not need to use its attr field to restore its mode and ownership. Use the
this eventfs_inode's attr as the default ownership for all the files and
directories underneath it.

When the events eventfs_inode is created, it sets its ownership to its
parent uid and gid. As the events directory is created at boot up before
it gets mounted, this will always be uid=0 and gid=0. If it's created via
an instance, then it will take the ownership of the instance directory.

When the file system is mounted, it will update all the gids if one is
specified. This will have a callback to update the events evenfs_inode's
default entries.

When a file or directory is created under the events directory, it will
walk the ei->dentry parents until it finds the evenfs_inode that belongs
to the events directory to retrieve the default uid and gid values.

Link: 
https://lore.kernel.org/all/CAHk-=wiwQtUHvzwyZucDq8=gtw+anwscylhpfswrq84pjho...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: 0dfc852b6fe3 ("eventfs: Have event files and directories default to 
parent uid and gid")
Reported-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 74 ++--
 fs/tracefs/inode.c   |  6 
 fs/tracefs/internal.h|  2 ++
 3 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2ccc849a5bda..2d6b11195420 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -113,7 +113,14 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 * determined by the parent directory.
 */
if (dentry->d_inode->i_mode & S_IFDIR) {
-   update_attr(>attr, iattr);
+   /*
+* The events directory dentry is never freed, unless its
+* part of an instance that is deleted. It's attr is the
+* default for its child files and directores.
+* Do not update it. It's not used for its own mode or ownership
+*/
+ 

Re: [PATCH 2/3] nvdimm/dimm_devs: fix kernel-doc for function params

2023-12-21 Thread Ira Weiny
Ira Weiny wrote:
> Randy Dunlap wrote:

[snip]

> > diff -- a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> > --- a/drivers/nvdimm/dimm_devs.c
> > +++ b/drivers/nvdimm/dimm_devs.c
> > @@ -53,7 +53,10 @@ static int validate_dimm(struct nvdimm_d
> >  
> >  /**
> >   * nvdimm_init_nsarea - determine the geometry of a dimm's namespace area
> > - * @nvdimm: dimm to initialize
> > + * @ndd: dimm to initialize
> > + *
> > + * Returns: %0 if the area is already valid, -errno on error,
> 
> This is good...
> 
> > ...otherwise an
> > + * ND_CMD_* status code.
> 
> I don't see where any of the ->ndctl() calls return an ND_CMD_* status
> code.  What am I missing?

If you agree that this should be dropped I can clean it up as I'm trying
to prep the nvdimm tree now and was hoping to take this series.

Ira

> 
> The rest looks good,
> Ira



Re: [PATCH 3/3] nvdimm/namespace: fix kernel-doc for function params

2023-12-21 Thread Ira Weiny
Randy Dunlap wrote:

[snip]

> @@ -1656,8 +1664,10 @@ static int select_pmem_id(struct nd_regi
>  /**
>   * create_namespace_pmem - validate interleave set labelling, retrieve label0
>   * @nd_region: region with mappings to validate
> - * @nspm: target namespace to create
> + * @nd_mapping: container of dpa-resource-root + labels
>   * @nd_label: target pmem namespace label to evaluate
> + *
> + * Returns: the created  device on success or -errno on error

NIT: should this be ERR_PTR(-errno) on error?

Generally good to me though.

Reviewed-by: Ira Weiny 

>   */
>  static struct device *create_namespace_pmem(struct nd_region *nd_region,
>   struct nd_mapping *nd_mapping,



Re: [PATCH 2/3] nvdimm/dimm_devs: fix kernel-doc for function params

2023-12-21 Thread Ira Weiny
Randy Dunlap wrote:
> Adjust kernel-doc notation to prevent warnings when using -Wall.
> 
> dimm_devs.c:59: warning: Function parameter or member 'ndd' not described in 
> 'nvdimm_init_nsarea'
> dimm_devs.c:59: warning: Excess function parameter 'nvdimm' description in 
> 'nvdimm_init_nsarea'
> dimm_devs.c:59: warning: No description found for return value of 
> 'nvdimm_init_nsarea'
> dimm_devs.c:728: warning: No description found for return value of 
> 'nd_pmem_max_contiguous_dpa'
> dimm_devs.c:773: warning: No description found for return value of 
> 'nd_pmem_available_dpa'
> dimm_devs.c:844: warning: Function parameter or member 'ndd' not described in 
> 'nvdimm_allocated_dpa'
> dimm_devs.c:844: warning: Excess function parameter 'nvdimm' description in 
> 'nvdimm_allocated_dpa'
> dimm_devs.c:844: warning: No description found for return value of 
> 'nvdimm_allocated_dpa'
> 
> Signed-off-by: Randy Dunlap 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Ira Weiny 
> Cc: nvd...@lists.linux.dev
> ---
>  drivers/nvdimm/dimm_devs.c |   14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff -- a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> --- a/drivers/nvdimm/dimm_devs.c
> +++ b/drivers/nvdimm/dimm_devs.c
> @@ -53,7 +53,10 @@ static int validate_dimm(struct nvdimm_d
>  
>  /**
>   * nvdimm_init_nsarea - determine the geometry of a dimm's namespace area
> - * @nvdimm: dimm to initialize
> + * @ndd: dimm to initialize
> + *
> + * Returns: %0 if the area is already valid, -errno on error,

This is good...

> ...otherwise an
> + * ND_CMD_* status code.

I don't see where any of the ->ndctl() calls return an ND_CMD_* status
code.  What am I missing?

The rest looks good,
Ira

>   */
>  int nvdimm_init_nsarea(struct nvdimm_drvdata *ndd)
>  {
> @@ -722,6 +725,9 @@ static unsigned long dpa_align(struct nd
>   *  contiguous unallocated dpa range.
>   * @nd_region: constrain available space check to this reference region
>   * @nd_mapping: container of dpa-resource-root + labels
> + *
> + * Returns: %0 if there is an alignment error, otherwise the max
> + *   unallocated dpa range
>   */
>  resource_size_t nd_pmem_max_contiguous_dpa(struct nd_region *nd_region,
>  struct nd_mapping *nd_mapping)
> @@ -767,6 +773,8 @@ resource_size_t nd_pmem_max_contiguous_d
>   *
>   * Validate that a PMEM label, if present, aligns with the start of an
>   * interleave set.
> + *
> + * Returns: %0 if there is an alignment error, otherwise the unallocated dpa
>   */
>  resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
> struct nd_mapping *nd_mapping)
> @@ -836,8 +844,10 @@ struct resource *nvdimm_allocate_dpa(str
>  
>  /**
>   * nvdimm_allocated_dpa - sum up the dpa currently allocated to this label_id
> - * @nvdimm: container of dpa-resource-root + labels
> + * @ndd: container of dpa-resource-root + labels
>   * @label_id: dpa resource name of the form pmem-
> + *
> + * Returns: sum of the dpa allocated to the label_id
>   */
>  resource_size_t nvdimm_allocated_dpa(struct nvdimm_drvdata *ndd,
>   struct nd_label_id *label_id)



Re: [PATCH 1/3] nvdimm/btt: fix btt_blk_cleanup() kernel-doc

2023-12-21 Thread Ira Weiny
Randy Dunlap wrote:
> Correct the function parameters to prevent kernel-doc warnings:
> 
> btt.c:1567: warning: Function parameter or member 'nd_region' not described 
> in 'btt_init'
> btt.c:1567: warning: Excess function parameter 'maxlane' description in 
> 'btt_init'
> 
> Signed-off-by: Randy Dunlap 
> Cc: Vishal Verma 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Ira Weiny 

Reviewed-by: Ira Weiny 

> Cc: nvd...@lists.linux.dev
> ---
>  drivers/nvdimm/btt.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff -- a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> --- a/drivers/nvdimm/btt.c
> +++ b/drivers/nvdimm/btt.c
> @@ -1550,7 +1550,7 @@ static void btt_blk_cleanup(struct btt *
>   * @rawsize: raw size in bytes of the backing device
>   * @lbasize: lba size of the backing device
>   * @uuid:A uuid for the backing device - this is stored on media
> - * @maxlane: maximum number of parallel requests the device can handle
> + * @nd_region:nd_region for the REGION device
>   *
>   * Initialize a Block Translation Table on a backing device to provide
>   * single sector power fail atomicity.





Re: [PATCH net-next v3 1/3] vsock/virtio: use skb_frag_*() helpers

2023-12-21 Thread Shakeel Butt
On Wed, Dec 20, 2023 at 1:45 PM Mina Almasry  wrote:
>
> Minor fix for virtio: code wanting to access the fields inside an skb
> frag should use the skb_frag_*() helpers, instead of accessing the
> fields directly. This allows for extensions where the underlying
> memory is not a page.
>
> Signed-off-by: Mina Almasry 

Reviewed-by: Shakeel Butt 



Re: [PATCH v3] ACPI: NFIT: Use cleanup.h helpers instead of devm_*()

2023-12-21 Thread Ira Weiny
Rafael J. Wysocki wrote:
> On Wed, Oct 18, 2023 at 6:28 AM Dan Williams  wrote:
> >
> > Michal Wilczynski wrote:
> > > The new cleanup.h facilities that arrived in v6.5-rc1 can replace the
> > > the usage of devm semantics in acpi_nfit_init_interleave_set(). That
> > > routine appears to only be using devm to avoid goto statements. The
> > > new __free() annotation at variable declaration time can achieve the same
> > > effect more efficiently.
> > >
> > > There is no end user visible side effects of this patch, I was
> > > motivated to send this cleanup to practice using the new helpers.
> > >
> > > Suggested-by: Dave Jiang 
> > > Suggested-by: Andy Shevchenko 
> > > Reviewed-by: Dave Jiang 
> > > Reviewed-by: Andy Shevchenko 
> > > Signed-off-by: Michal Wilczynski 
> > > ---
> > >
> > > Dan, would you like me to give you credit for the changelog changes
> > > with Co-developed-by tag ?
> >
> > Nope, this looks good to me, thanks for fixing it up.
> >
> > Reviewed-by: Dan Williams 
> 
> Are you going to apply it too, or should I take it?

I'm prepping the nvdimm tree for 6.8.  I will take it.

Ira



Re: [PATCH v7] scripts/link-vmlinux.sh: Add alias to duplicate symbols for kallsyms

2023-12-21 Thread Alessandro Carminati
Hello Francis,
Thanks for the review.

Il giorno gio 21 dic 2023 alle ore 18:23 Francis Laniel
 ha scritto:
>
> Hi!
>
>
> Le lundi 4 décembre 2023, 22:46:35 CET Alessandro Carminati (Red Hat) a
> écrit :
> > In the kernel environment, situations frequently arise where identical
> > names are shared among symbols within both the core image and modules.
> > While this doesn't pose issues for the kernel's binary itself, it
> > complicates trace or probe operations using tools like kprobe.
> >
> > This patch introduces "kas_alias" to address this challenge.
> >
> > During the kernel's build process, just before linking the vmlinux
> > image in the "scripts/link-vmlinux.sh", symbol name frequencies are
> > collected.
> > This collection includes both the core kernel components and modules.
> > Subsequently, within the same action, the nm data relative to vmlinux
> > is modified by adding aliases based on the comprehensive symbol
> > information gathered.
> >
> > The collection process occurs in two phases:
> >
> > 1. First phase: Executed during the linking of vmlinux, "kas_alias" scans
> >all symbols provided by the 'nm' data against the vmlinux core image
> >and all objects used for module linkage. This phase requires all
> >modules objects to be produced at this stage, thereby adding a vmlinux
> >dependency for linking modules in 'scripts/Makefile.modfinal'.
> >
> > 2. Second phase: In a subsequent run in the same build, "kas_alias"
> >processes module objects and injects aliases into the objects' symbol
> >tables where necessary. This operation is done by modifying
> >'scripts/Makefile.modfinal' to include an action for each processed
> >module.
> >
> > Example:
> >
> > Consider the symbol "device_show", you can expect an output like the
> > following:
> >
> >  ~ # cat /proc/kallsyms | grep " name_show"
> > caa2bb4f01c8 t name_show
> > caa2bb4f01c8 t name_show@kernel_irq_irqdesc_c_264
> > caa2bb9c1a30 t name_show
> > caa2bb9c1a30 t name_show@drivers_pnp_card_c_186
> > caa2bbac4754 t name_show
> > caa2bbac4754 t name_show@drivers_regulator_core_c_678
> > caa2bbba4900 t name_show
> > caa2bbba4900 t name_show@drivers_base_power_wakeup_stats_c_93
> > caa2bbec4038 t name_show
> > caa2bbec4038 t name_show@drivers_rtc_sysfs_c_26
> > caa2bbecc920 t name_show
> > caa2bbecc920 t name_show@drivers_i2c_i2c_core_base_c_660
> > caa2bbed3840 t name_show
> > caa2bbed3840 t name_show@drivers_i2c_i2c_dev_c_100
> > caa2bbef7210 t name_show
> > caa2bbef7210 t name_show@drivers_pps_sysfs_c_66
> > caa2bbf03328 t name_show
> > caa2bbf03328 t name_show@drivers_hwmon_hwmon_c_72
> > caa2bbff6f3c t name_show
> > caa2bbff6f3c t name_show@drivers_remoteproc_remoteproc_sysfs_c_215
> > caa2bbff8d78 t name_show
> > caa2bbff8d78 t name_show@drivers_rpmsg_rpmsg_core_c_455
> > caa2bbfff7a4 t name_show
> > caa2bbfff7a4 t name_show@drivers_devfreq_devfreq_c_1395
> > caa2bc001f60 t name_show
> > caa2bc001f60 t name_show@drivers_extcon_extcon_c_389
> > caa2bc009890 t name_show
> > caa2bc009890 t name_show@drivers_iio_industrialio_core_c_1396
> > caa2bc01212c t name_show
> > caa2bc01212c t name_show@drivers_iio_industrialio_trigger_c_51
> > caa2bc025e2c t name_show
> > caa2bc025e2c t name_show@drivers_fpga_fpga_mgr_c_618
> > caa2a052102c t name_show  [hello]
> > caa2a052102c t name_show@hello_hello_c_8  [hello]
> > caa2a051955c t name_show  [rpmsg_char]
> > caa2a051955c t name_show@drivers_rpmsg_rpmsg_char_c_365   [rpmsg_char]
> >
> > where hello, is a plain helloworld module built OOT.
>
> Thank you for sending the v7 and sorry for the delay!
> I tested it and it works fine:
> root@vm-amd64:~# uname -r
> 6.7.0-rc6-00079-g6162f0ee7bcb
> root@vm-amd64:~# grep name_show /proc/kallsyms | head -5
> b400f230 t __pfx_pmu_name_show
> b400f240 t pmu_name_show
> b40fb1d0 t __pfx_name_show
> b40fb1e0 t name_show
> b40fb1e0 t name_show@kernel_irq_irqdesc_c_264
>
> I am curious about the module part, can you please indicate me how you tested
> it?

Are you referring to the process of building an OOT module and including
aliases?

If that's the query, the current build now generates an additional file
named modules.symbfreq, which contains the frequencies of symbols.
Here's an example:

~ # cat modules.symbfreq
[...]
bcm2835_handle_irq:1
bcm2836_arm_irqchip_handle_irq:1
dw_apb_ictl_handle_irq:1
gic_handle_irq:2
aic_handle_irq:1
[...]

To proceed, this file needs to be copied to the root directory of the OOT
module, following which the module can undergo the regular build process.

If this file is not found during the build, no aliases will be added to that
particular module.

>
> I found some nits in the code but nothing critical:

I experimented with several approaches to tackle this problem before
discovering one that proved satisfactory. 

Re: [PATCH v2 1/5] cpufreq: Add a cpufreq pressure feedback for the scheduler

2023-12-21 Thread Rafael J. Wysocki
On Thu, Dec 21, 2023 at 8:57 PM Rafael J. Wysocki  wrote:
>
> On Thu, Dec 21, 2023 at 4:24 PM Vincent Guittot
>  wrote:
> >
> > Provide to the scheduler a feedback about the temporary max available
> > capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
> > filtered as the pressure will happen for dozens ms or more.
> >
> > Signed-off-by: Vincent Guittot 
> > ---
> >  drivers/cpufreq/cpufreq.c | 34 ++
> >  include/linux/cpufreq.h   | 10 ++
> >  2 files changed, 44 insertions(+)
> >
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 44db4f59c4cc..15bd41f9bb5e 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -2563,6 +2563,38 @@ int cpufreq_get_policy(struct cpufreq_policy 
> > *policy, unsigned int cpu)
> >  }
> >  EXPORT_SYMBOL(cpufreq_get_policy);
> >
> > +DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
> > +
> > +/**
> > + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs
> > + * @policy: cpufreq policy of the CPUs.
> > + *
> > + * Update the value of cpufreq pressure for all @cpus in the policy.
> > + */
> > +static void cpufreq_update_pressure(struct cpufreq_policy *policy)
> > +{
> > +   unsigned long max_capacity, capped_freq, pressure;
> > +   u32 max_freq;
> > +   int cpu;
> > +
> > +   cpu = cpumask_first(policy->related_cpus);
> > +   pressure = max_capacity = arch_scale_cpu_capacity(cpu);
>
> I would prefer two separate statements instead of the above.
>
> Other than this
>
> Acked-by: Rafael J. Wysocki 
>
> > +   capped_freq = policy->max;
> > +   max_freq = arch_scale_freq_ref(cpu);
> > +
> > +   /*
> > +* Handle properly the boost frequencies, which should simply clean
> > +* the thermal pressure value.
> > +*/
> > +   if (max_freq <= capped_freq)
> > +   pressure -= max_capacity;

Actually, it would be somewhat cleaner to do

pressure = 0;

here and

> > +   else
> > +   pressure -= mult_frac(max_capacity, capped_freq, max_freq);

pressure = max_capacity - mult_frac(max_capacity, capped_freq, max_freq);

and it would not be necessary to initialize pressure.

> > +
> > +   for_each_cpu(cpu, policy->related_cpus)
> > +   WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure);
> > +}
> > +
> >  /**
> >   * cpufreq_set_policy - Modify cpufreq policy parameters.
> >   * @policy: Policy object to modify.
> > @@ -2618,6 +2650,8 @@ static int cpufreq_set_policy(struct cpufreq_policy 
> > *policy,
> > policy->max = __resolve_freq(policy, policy->max, 
> > CPUFREQ_RELATION_H);
> > trace_cpu_frequency_limits(policy);
> >
> > +   cpufreq_update_pressure(policy);
> > +
> > policy->cached_target_freq = UINT_MAX;
> >
> > pr_debug("new min and max freqs are %u - %u kHz\n",
> > diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> > index afda5f24d3dd..b1d97edd3253 100644
> > --- a/include/linux/cpufreq.h
> > +++ b/include/linux/cpufreq.h
> > @@ -241,6 +241,12 @@ struct kobject *get_governor_parent_kobj(struct 
> > cpufreq_policy *policy);
> >  void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
> >  void cpufreq_disable_fast_switch(struct cpufreq_policy *policy);
> >  bool has_target_index(void);
> > +
> > +DECLARE_PER_CPU(unsigned long, cpufreq_pressure);
> > +static inline unsigned long cpufreq_get_pressure(int cpu)
> > +{
> > +   return per_cpu(cpufreq_pressure, cpu);
> > +}
> >  #else
> >  static inline unsigned int cpufreq_get(unsigned int cpu)
> >  {
> > @@ -263,6 +269,10 @@ static inline bool 
> > cpufreq_supports_freq_invariance(void)
> > return false;
> >  }
> >  static inline void disable_cpufreq(void) { }
> > +static inline unsigned long cpufreq_get_pressure(int cpu)
> > +{
> > +   return 0;
> > +}
> >  #endif
> >
> >  #ifdef CONFIG_CPU_FREQ_STAT
> > --
> > 2.34.1
> >



Re: [PATCH v2 1/5] cpufreq: Add a cpufreq pressure feedback for the scheduler

2023-12-21 Thread Rafael J. Wysocki
On Thu, Dec 21, 2023 at 4:24 PM Vincent Guittot
 wrote:
>
> Provide to the scheduler a feedback about the temporary max available
> capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
> filtered as the pressure will happen for dozens ms or more.
>
> Signed-off-by: Vincent Guittot 
> ---
>  drivers/cpufreq/cpufreq.c | 34 ++
>  include/linux/cpufreq.h   | 10 ++
>  2 files changed, 44 insertions(+)
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 44db4f59c4cc..15bd41f9bb5e 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2563,6 +2563,38 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, 
> unsigned int cpu)
>  }
>  EXPORT_SYMBOL(cpufreq_get_policy);
>
> +DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
> +
> +/**
> + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs
> + * @policy: cpufreq policy of the CPUs.
> + *
> + * Update the value of cpufreq pressure for all @cpus in the policy.
> + */
> +static void cpufreq_update_pressure(struct cpufreq_policy *policy)
> +{
> +   unsigned long max_capacity, capped_freq, pressure;
> +   u32 max_freq;
> +   int cpu;
> +
> +   cpu = cpumask_first(policy->related_cpus);
> +   pressure = max_capacity = arch_scale_cpu_capacity(cpu);

I would prefer two separate statements instead of the above.

Other than this

Acked-by: Rafael J. Wysocki 

> +   capped_freq = policy->max;
> +   max_freq = arch_scale_freq_ref(cpu);
> +
> +   /*
> +* Handle properly the boost frequencies, which should simply clean
> +* the thermal pressure value.
> +*/
> +   if (max_freq <= capped_freq)
> +   pressure -= max_capacity;
> +   else
> +   pressure -= mult_frac(max_capacity, capped_freq, max_freq);
> +
> +   for_each_cpu(cpu, policy->related_cpus)
> +   WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure);
> +}
> +
>  /**
>   * cpufreq_set_policy - Modify cpufreq policy parameters.
>   * @policy: Policy object to modify.
> @@ -2618,6 +2650,8 @@ static int cpufreq_set_policy(struct cpufreq_policy 
> *policy,
> policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
> trace_cpu_frequency_limits(policy);
>
> +   cpufreq_update_pressure(policy);
> +
> policy->cached_target_freq = UINT_MAX;
>
> pr_debug("new min and max freqs are %u - %u kHz\n",
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index afda5f24d3dd..b1d97edd3253 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -241,6 +241,12 @@ struct kobject *get_governor_parent_kobj(struct 
> cpufreq_policy *policy);
>  void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
>  void cpufreq_disable_fast_switch(struct cpufreq_policy *policy);
>  bool has_target_index(void);
> +
> +DECLARE_PER_CPU(unsigned long, cpufreq_pressure);
> +static inline unsigned long cpufreq_get_pressure(int cpu)
> +{
> +   return per_cpu(cpufreq_pressure, cpu);
> +}
>  #else
>  static inline unsigned int cpufreq_get(unsigned int cpu)
>  {
> @@ -263,6 +269,10 @@ static inline bool cpufreq_supports_freq_invariance(void)
> return false;
>  }
>  static inline void disable_cpufreq(void) { }
> +static inline unsigned long cpufreq_get_pressure(int cpu)
> +{
> +   return 0;
> +}
>  #endif
>
>  #ifdef CONFIG_CPU_FREQ_STAT
> --
> 2.34.1
>



Re: [PATCH] kernel/module: improve documentation for try_module_get()

2023-12-21 Thread Luis Chamberlain
On Thu, Dec 21, 2023 at 05:58:47PM +0100, Marco Pagani wrote:
> The sentence "this call will fail if the module is already being
> removed" is potentially confusing and may contradict the rest of the
> documentation. If one tries to get a module that has already been
> removed using a stale pointer, the kernel will crash.
> 
> Signed-off-by: Marco Pagani 

Thanks, patch applied!

  Luis



Re: [PATCH v9 1/2] ring-buffer: Introducing ring-buffer mapping functions

2023-12-21 Thread Steven Rostedt
On Thu, 21 Dec 2023 17:35:22 +
Vincent Donnefort  wrote:

> @@ -5999,6 +6078,307 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
> *buffer, int order)
>  }
>  EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
>  

The kernel developers have agreed to allow loop variables to be declared in
loops. This will simplify these macros:



> +#define subbuf_page(off, start) \
> + virt_to_page((void *)(start + (off << PAGE_SHIFT)))
> +
> +#define foreach_subbuf_page(off, sub_order, start, page) \
> + for (off = 0, page = subbuf_page(0, start); \
> +  off < (1 << sub_order);\
> +  off++, page = subbuf_page(off, start))

#define foreach_subbuf_page(sub_order, start, page) \
for (int __off = 0, page = subbuf_page(0, (start)); \
 __off < (1 << (sub_order));\
 __off++, page = subbuf_page(__off, (start)))

And parameters as r-values should always be wrapped in parenthesis.

> +
> +static inline void subbuf_map_prepare(unsigned long subbuf_start, int order)
> +{
> + struct page *page;
> + int subbuf_off;

Then you can remove the "subbuf_off".

> +
> + /*
> +  * When allocating order > 0 pages, only the first struct page has a
> +  * refcount > 1. Increasing the refcount here ensures none of the struct
> +  * page composing the sub-buffer is freeed when the mapping is closed.

Nice catch by the way ;-)

-- Steve

> +  */
> + foreach_subbuf_page(subbuf_off, order, subbuf_start, page)
> + page_ref_inc(page);
> +}
> +
> +static inline void subbuf_unmap(unsigned long subbuf_start, int order)
> +{
> + struct page *page;
> + int subbuf_off;
> +
> + foreach_subbuf_page(subbuf_off, order, subbuf_start, page) {
> + page_ref_dec(page);
> + page->mapping = NULL;
> + }
> +}
> +



Re: [PATCH v9 1/2] ring-buffer: Introducing ring-buffer mapping functions

2023-12-21 Thread Steven Rostedt
On Thu, 21 Dec 2023 17:35:22 +
Vincent Donnefort  wrote:

> @@ -739,6 +747,22 @@ static __always_inline bool full_hit(struct trace_buffer 
> *buffer, int cpu, int f
>   return (dirty * 100) > (full * nr_pages);
>  }
>  
> +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> + if (unlikely(READ_ONCE(cpu_buffer->mapped))) {
> + /* Ensure the meta_page is ready */
> + smp_rmb();
> + WRITE_ONCE(cpu_buffer->meta_page->entries,
> +local_read(_buffer->entries));
> + WRITE_ONCE(cpu_buffer->meta_page->overrun,
> +local_read(_buffer->overrun));
> + WRITE_ONCE(cpu_buffer->meta_page->subbufs_touched,
> +local_read(_buffer->pages_touched));
> + WRITE_ONCE(cpu_buffer->meta_page->subbufs_lost,
> +local_read(_buffer->pages_lost));
> + }
> +}
> +
>  /*
>   * rb_wake_up_waiters - wake up tasks waiting for ring buffer input
>   *
> @@ -749,6 +773,18 @@ static void rb_wake_up_waiters(struct irq_work *work)
>  {
>   struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, 
> work);
>  
> + if (rbwork->is_cpu_buffer) {
> + struct ring_buffer_per_cpu *cpu_buffer;
> +
> + cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu,
> +   irq_work);
> + /*
> +  * If the waiter is a cpu_buffer, this might be due to a
> +  * userspace mapping. Let's update the meta-page.
> +  */
> + rb_update_meta_page(cpu_buffer);
> + }
> +
>   wake_up_all(>waiters);
>   if (rbwork->full_waiters_pending || rbwork->wakeup_full) {
>   rbwork->wakeup_full = false;


I think this code would be cleaner if we did:

static void rb_update_meta_page(strucrt rb_irq_work *rbwork)
{
struct ring_buffer_per_cpu *cpu_buffer;

if (!rbwork->is_cpu_buffer)
return;

/*
 * If the waiter is a cpu_buffer, this might be due to a
 * userspace mapping. Let's update the meta-page.
 */
cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu,
  irq_work);

if (unlikely(READ_ONCE(cpu_buffer->mapped))) {

// I don't think we need the "unlikely"

/* Ensure the meta_page is ready */
smp_rmb();
WRITE_ONCE(cpu_buffer->meta_page->entries,
   local_read(_buffer->entries));
WRITE_ONCE(cpu_buffer->meta_page->overrun,
   local_read(_buffer->overrun));
WRITE_ONCE(cpu_buffer->meta_page->subbufs_touched,
   local_read(_buffer->pages_touched));
WRITE_ONCE(cpu_buffer->meta_page->subbufs_lost,
   local_read(_buffer->pages_lost));
}
}

/*
 * rb_wake_up_waiters - wake up tasks waiting for ring buffer input
 *
 * Schedules a delayed work to wake up any task that is blocked on the
 * ring buffer waiters queue.
 */
static void rb_wake_up_waiters(struct irq_work *work)
{
struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, 
work);

rb_update_meta_page(cpu_buffer);

wake_up_all(>waiters);
if (rbwork->full_waiters_pending || rbwork->wakeup_full) {
rbwork->wakeup_full = false;
rbwork->full_waiters_pending = false;
wake_up_all(>full_waiters);
}
}


-- Steve



[PATCH v9 2/2] tracing: Allow user-space mapping of the ring-buffer

2023-12-21 Thread Vincent Donnefort
Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index f950648b0ba9..8c49489c5867 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -26,4 +26,6 @@ struct trace_buffer_meta {
__u32   meta_struct_len;/* Len of this struct */
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _UAPI_TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 46dbe22121e9..cfeaf2cd204e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8583,15 +8583,31 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = >iter;
+   int err;
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent);
+   if (err)
+   return err;
+   }
 
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
+
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(_types_lock);
 
iter->wait_index++;
@@ -8604,6 +8620,62 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
+   struct page *page;
+
+   page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file,
+vmf->pgoff);
+   if (!page)
+   return ret;
+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+
+   ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file);
+}
+
+static void tracing_buffers_mmap_open(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+
+   WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file));
+}
+
+static const struct vm_operations_struct tracing_buffers_vmops = {
+   .open   = tracing_buffers_mmap_open,
+   .close  = tracing_buffers_mmap_close,
+   .fault  = tracing_buffers_mmap_fault,
+};
+
+static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = filp->private_data;
+   struct trace_iterator *iter = >iter;
+
+   if (vma->vm_flags & VM_WRITE)
+   return -EPERM;
+
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP, VM_MAYWRITE);
+   vma->vm_ops = _buffers_vmops;
+
+   return ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file);
+}
+
 static const struct file_operations tracing_buffers_fops = {
.open   = tracing_buffers_open,
.read   = tracing_buffers_read,
@@ 

[PATCH v9 1/2] ring-buffer: Introducing ring-buffer mapping functions

2023-12-21 Thread Vincent Donnefort
In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()
  ring_buffer_map_fault()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..0841ba8bab14 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
+  unsigned long pgoff);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..f950648b0ba9
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_TRACE_MMAP_H_
+#define _UAPI_TRACE_MMAP_H_
+
+#include 
+
+struct trace_buffer_meta {
+   unsigned long   entries;
+   unsigned long   overrun;
+   unsigned long   read;
+
+   unsigned long   subbufs_touched;
+   unsigned long   subbufs_lost;
+   unsigned long   subbufs_read;
+
+   struct {
+   unsigned long   lost_events;/* Events lost at the time of 
the reader swap */
+   __u32   id; /* Reader subbuf ID from 0 to 
nr_subbufs - 1 */
+   __u32   read;   /* Number of bytes read on the 
reader subbuf */
+   } reader;
+
+   __u32   subbuf_size;/* Size of each subbuf 
including the header */
+   __u32   nr_subbufs; /* Number of subbufs in the 
ring-buffer */
+
+   __u32   meta_page_size; /* Size of the meta-page */
+   __u32   meta_struct_len;/* Len of this struct */
+};
+
+#endif /* _UAPI_TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 173d2595ce2d..2f3e0260db88 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -338,6 +338,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -388,6 +389,7 @@ struct rb_irq_work {
boolwaiters_pending;
boolfull_waiters_pending;
boolwakeup_full;
+   boolis_cpu_buffer;
 };
 
 /*
@@ -484,6 +486,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   int mapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to addr */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -739,6 +747,22 @@ static __always_inline bool full_hit(struct trace_buffer 
*buffer, int cpu, int f
return (dirty * 100) > (full * nr_pages);
 }
 
+static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+   if (unlikely(READ_ONCE(cpu_buffer->mapped))) {
+   /* Ensure the meta_page is ready */
+   smp_rmb();
+   WRITE_ONCE(cpu_buffer->meta_page->entries,
+  local_read(_buffer->entries));
+   WRITE_ONCE(cpu_buffer->meta_page->overrun,
+  

[PATCH v9 0/2] Introducing trace buffer mapping by user-space

2023-12-21 Thread Vincent Donnefort
The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Attached to this cover letter an example of consuming read for a
ring-buffer, using libtracefs.

Vincent

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

---

/* Need to access private struct to save counters */
struct kbuffer {
unsigned long long  timestamp;
long long   lost_events;
unsigned long   flags;
void*subbuffer;
void*data;
unsigned intindex;
unsigned intcurr;
unsigned intnext;
unsigned intsize;
unsigned intstart;
unsigned intfirst;

unsigned int (*read_4)(void *ptr);
unsigned long long (*read_8)(void *ptr);
unsigned long long (*read_long)(struct kbuffer *kbuf, void *ptr);
int (*next_event)(struct kbuffer *kbuf);
};

struct trace_buffer_meta {
unsigned long   entries;
unsigned long   overrun;
unsigned long   read;

unsigned long   subbufs_touched;
unsigned long   subbufs_lost;
unsigned long   subbufs_read;

struct {
unsigned long   lost_events;/* Events lost at the time of 
the reader swap */
__u32   id; /* Reader subbuf ID from 0 to 
nr_subbufs - 1 */
__u32   read;   /* Number of bytes read on the 
reader subbuf */
} reader;

__u32   subbuf_size;
__u32   nr_subbufs; /* Number of subbufs in the 
ring-buffer */

__u32   meta_page_size; /* Size of the meta-page */
__u32   meta_struct_len;/* Len of this struct */
};

static char *argv0;
static bool exit_requested;

static char *get_this_name(void)
{
static char *this_name;
char *arg;
char *p;

if (this_name)
return this_name;

arg = argv0;
p = arg+strlen(arg);

while (p >= arg && *p != '/')
p--;
p++;

this_name = p;
return p;
}

static void __vdie(const char *fmt, va_list ap, int err)
{
int ret = errno;
char *p = get_this_name();

if (err && errno)
perror(p);
else
ret = -1;

fprintf(stderr, "  ");
vfprintf(stderr, fmt, ap);

fprintf(stderr, "\n");
exit(ret);
}

void pdie(const char *fmt, ...)
{
va_list ap;

va_start(ap, fmt);
__vdie(fmt, ap, 1);
va_end(ap);
}

static void read_subbuf(struct tep_handle *tep, struct kbuffer *kbuf)
{
static struct trace_seq seq;
struct tep_record record;

if (seq.buffer)
trace_seq_reset();
else
trace_seq_init();

while ((record.data = kbuffer_read_event(kbuf, ))) {
record.size = kbuffer_event_size(kbuf);

Re: [PATCH v7] scripts/link-vmlinux.sh: Add alias to duplicate symbols for kallsyms

2023-12-21 Thread Francis Laniel
Hi!


Le lundi 4 décembre 2023, 22:46:35 CET Alessandro Carminati (Red Hat) a 
écrit :
> In the kernel environment, situations frequently arise where identical
> names are shared among symbols within both the core image and modules.
> While this doesn't pose issues for the kernel's binary itself, it
> complicates trace or probe operations using tools like kprobe.
> 
> This patch introduces "kas_alias" to address this challenge.
> 
> During the kernel's build process, just before linking the vmlinux
> image in the "scripts/link-vmlinux.sh", symbol name frequencies are
> collected.
> This collection includes both the core kernel components and modules.
> Subsequently, within the same action, the nm data relative to vmlinux
> is modified by adding aliases based on the comprehensive symbol
> information gathered.
> 
> The collection process occurs in two phases:
> 
> 1. First phase: Executed during the linking of vmlinux, "kas_alias" scans
>all symbols provided by the 'nm' data against the vmlinux core image
>and all objects used for module linkage. This phase requires all
>modules objects to be produced at this stage, thereby adding a vmlinux
>dependency for linking modules in 'scripts/Makefile.modfinal'.
> 
> 2. Second phase: In a subsequent run in the same build, "kas_alias"
>processes module objects and injects aliases into the objects' symbol
>tables where necessary. This operation is done by modifying
>'scripts/Makefile.modfinal' to include an action for each processed
>module.
> 
> Example:
> 
> Consider the symbol "device_show", you can expect an output like the
> following:
> 
>  ~ # cat /proc/kallsyms | grep " name_show"
> caa2bb4f01c8 t name_show
> caa2bb4f01c8 t name_show@kernel_irq_irqdesc_c_264
> caa2bb9c1a30 t name_show
> caa2bb9c1a30 t name_show@drivers_pnp_card_c_186
> caa2bbac4754 t name_show
> caa2bbac4754 t name_show@drivers_regulator_core_c_678
> caa2bbba4900 t name_show
> caa2bbba4900 t name_show@drivers_base_power_wakeup_stats_c_93
> caa2bbec4038 t name_show
> caa2bbec4038 t name_show@drivers_rtc_sysfs_c_26
> caa2bbecc920 t name_show
> caa2bbecc920 t name_show@drivers_i2c_i2c_core_base_c_660
> caa2bbed3840 t name_show
> caa2bbed3840 t name_show@drivers_i2c_i2c_dev_c_100
> caa2bbef7210 t name_show
> caa2bbef7210 t name_show@drivers_pps_sysfs_c_66
> caa2bbf03328 t name_show
> caa2bbf03328 t name_show@drivers_hwmon_hwmon_c_72
> caa2bbff6f3c t name_show
> caa2bbff6f3c t name_show@drivers_remoteproc_remoteproc_sysfs_c_215
> caa2bbff8d78 t name_show
> caa2bbff8d78 t name_show@drivers_rpmsg_rpmsg_core_c_455
> caa2bbfff7a4 t name_show
> caa2bbfff7a4 t name_show@drivers_devfreq_devfreq_c_1395
> caa2bc001f60 t name_show
> caa2bc001f60 t name_show@drivers_extcon_extcon_c_389
> caa2bc009890 t name_show
> caa2bc009890 t name_show@drivers_iio_industrialio_core_c_1396
> caa2bc01212c t name_show
> caa2bc01212c t name_show@drivers_iio_industrialio_trigger_c_51
> caa2bc025e2c t name_show
> caa2bc025e2c t name_show@drivers_fpga_fpga_mgr_c_618
> caa2a052102c t name_show  [hello]
> caa2a052102c t name_show@hello_hello_c_8  [hello]
> caa2a051955c t name_show  [rpmsg_char]
> caa2a051955c t name_show@drivers_rpmsg_rpmsg_char_c_365   [rpmsg_char]
> 
> where hello, is a plain helloworld module built OOT.

Thank you for sending the v7 and sorry for the delay!
I tested it and it works fine:
root@vm-amd64:~# uname -r
6.7.0-rc6-00079-g6162f0ee7bcb
root@vm-amd64:~# grep name_show /proc/kallsyms | head -5
b400f230 t __pfx_pmu_name_show
b400f240 t pmu_name_show
b40fb1d0 t __pfx_name_show
b40fb1e0 t name_show
b40fb1e0 t name_show@kernel_irq_irqdesc_c_264

I am curious about the module part, can you please indicate me how you tested 
it?

I found some nits in the code but nothing critical:

> Tested-by: Masami Hiramatsu (Google) 
> Signed-off-by: Alessandro Carminati (Red Hat)
> 
> 
> ---
> 
> NOTE1:
> About the symbols name duplication that happens as consequence of the
> inclusion compat_binfmt_elf.c does, it is evident that this corner is
> inherently challenging the addr2line approach.
> Attempting to conceal this limitation would be counterproductive.
> 
> compat_binfmt_elf.c includes directly binfmt_elf.c, addr2line can't help
> but report all functions and data declared by that file, coming from
> binfmt_elf.c.
> 
> My position is that, rather than producing a more complicated pipeline
> to handle this corner case, it is better to fix the compat_binfmt_elf.c
> anomaly.
> 
> This patch does not deal with the two potentially problematic symbols
> defined by compat_binfmt_elf.c
> 
> Changes from v1:
> * Integrated changes requested by Masami to exclude symbols with prefixes
>   "_cfi" and "_pfx".
> * Introduced a small framework to handle patterns that need to be excluded
>   from the alias production.
> 

Re: [PATCH net-next v3 3/3] net: add netmem_ref to skb_frag_t

2023-12-21 Thread Willem de Bruijn
Mina Almasry wrote:
> Use netmem_ref instead of page in skb_frag_t. Currently netmem_ref
> is always a struct page underneath, but the abstraction allows efforts
> to add support for skb frags not backed by pages.
> 
> There is unfortunately 1 instance where the skb_frag_t is assumed to be
> a bio_vec in kcm. For this case, add a debug assert that the skb frag is
> indeed backed by a page, and do a cast.
> 
> Add skb[_frag]_fill_netmem_*() and skb_add_rx_frag_netmem() helpers so
> that the API can be used to create netmem skbs.
> 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> v3;
> - Renamed the fields in skb_frag_t.
> 
> v2:
> - Add skb frag filling helpers.
> 
> ---
>  include/linux/skbuff.h | 92 +-
>  net/core/skbuff.c  | 22 +++---
>  net/kcm/kcmsock.c  | 10 -
>  3 files changed, 89 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 7ce38874dbd1..729c95e97be1 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -37,6 +37,7 @@
>  #endif
>  #include 
>  #include 
> +#include 
>  
>  /**
>   * DOC: skb checksums
> @@ -359,7 +360,11 @@ extern int sysctl_max_skb_frags;
>   */
>  #define GSO_BY_FRAGS 0x
>  
> -typedef struct bio_vec skb_frag_t;
> +typedef struct skb_frag {
> + netmem_ref netmem;
> + unsigned int len;
> + unsigned int offset;
> +} skb_frag_t;
>  
>  /**
>   * skb_frag_size() - Returns the size of a skb fragment
> @@ -367,7 +372,7 @@ typedef struct bio_vec skb_frag_t;
>   */
>  static inline unsigned int skb_frag_size(const skb_frag_t *frag)
>  {
> - return frag->bv_len;
> + return frag->len;
>  }
>  
>  /**
> @@ -377,7 +382,7 @@ static inline unsigned int skb_frag_size(const skb_frag_t 
> *frag)
>   */
>  static inline void skb_frag_size_set(skb_frag_t *frag, unsigned int size)
>  {
> - frag->bv_len = size;
> + frag->len = size;
>  }
>  
>  /**
> @@ -387,7 +392,7 @@ static inline void skb_frag_size_set(skb_frag_t *frag, 
> unsigned int size)
>   */
>  static inline void skb_frag_size_add(skb_frag_t *frag, int delta)
>  {
> - frag->bv_len += delta;
> + frag->len += delta;
>  }
>  
>  /**
> @@ -397,7 +402,7 @@ static inline void skb_frag_size_add(skb_frag_t *frag, 
> int delta)
>   */
>  static inline void skb_frag_size_sub(skb_frag_t *frag, int delta)
>  {
> - frag->bv_len -= delta;
> + frag->len -= delta;
>  }
>  
>  /**
> @@ -417,7 +422,7 @@ static inline bool skb_frag_must_loop(struct page *p)
>   *   skb_frag_foreach_page - loop over pages in a fragment
>   *
>   *   @f: skb frag to operate on
> - *   @f_off: offset from start of f->bv_page
> + *   @f_off: offset from start of f->netmem
>   *   @f_len: length from f_off to loop over
>   *   @p: (temp var) current page
>   *   @p_off: (temp var) offset from start of current page,
> @@ -2431,22 +2436,37 @@ static inline unsigned int skb_pagelen(const struct 
> sk_buff *skb)
>   return skb_headlen(skb) + __skb_pagelen(skb);
>  }
>  
> +static inline void skb_frag_fill_netmem_desc(skb_frag_t *frag,
> +  netmem_ref netmem, int off,
> +  int size)
> +{
> + frag->netmem = netmem;
> + frag->offset = off;
> + skb_frag_size_set(frag, size);
> +}
> +
>  static inline void skb_frag_fill_page_desc(skb_frag_t *frag,
>  struct page *page,
>  int off, int size)
>  {
> - frag->bv_page = page;
> - frag->bv_offset = off;
> - skb_frag_size_set(frag, size);
> + skb_frag_fill_netmem_desc(frag, page_to_netmem(page), off, size);
> +}
> +
> +static inline void __skb_fill_netmem_desc_noacc(struct skb_shared_info 
> *shinfo,
> + int i, netmem_ref netmem,
> + int off, int size)
> +{
> + skb_frag_t *frag = >frags[i];
> +
> + skb_frag_fill_netmem_desc(frag, netmem, off, size);
>  }
>  
>  static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
> int i, struct page *page,
> int off, int size)
>  {
> - skb_frag_t *frag = >frags[i];
> -
> - skb_frag_fill_page_desc(frag, page, off, size);
> + __skb_fill_netmem_desc_noacc(shinfo, i, page_to_netmem(page), off,
> +  size);
>  }
>  
>  /**
> @@ -2462,10 +2482,10 @@ static inline void skb_len_add(struct sk_buff *skb, 
> int delta)
>  }
>  
>  /**
> - * __skb_fill_page_desc - initialise a paged fragment in an skb
> + * __skb_fill_netmem_desc - initialise a fragment in an skb
>   * @skb: buffer containing fragment to be initialised
> - * @i: paged fragment index to initialise
> - * @page: the page to use for this fragment
> + * @i: fragment index to initialise
> + * 

Re: [PATCH net-next v3 1/3] vsock/virtio: use skb_frag_*() helpers

2023-12-21 Thread Willem de Bruijn
Mina Almasry wrote:
> Minor fix for virtio: code wanting to access the fields inside an skb
> frag should use the skb_frag_*() helpers, instead of accessing the
> fields directly. This allows for extensions where the underlying
> memory is not a page.
> 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> v2:
> 
> - Also fix skb_frag_off() + skb_frag_size() (David)
> - Did not apply the reviewed-by from Stefano since the patch changed
> relatively much.
> 
> ---
>  net/vmw_vsock/virtio_transport.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport.c 
> b/net/vmw_vsock/virtio_transport.c
> index f495b9e5186b..1748268e0694 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -153,10 +153,10 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>* 'virt_to_phys()' later to fill the buffer 
> descriptor.
>* We don't touch memory at "virtual" address 
> of this page.
>*/
> - va = page_to_virt(skb_frag->bv_page);
> + va = page_to_virt(skb_frag_page(skb_frag));
>   sg_init_one(sgs[out_sg],
> - va + skb_frag->bv_offset,
> - skb_frag->bv_len);
> + va + skb_frag_off(skb_frag),
> + skb_frag_size(skb_frag));
>   out_sg++;
>   }
>   }

If there are requests for further revision in the series, can send
this virtio cleanup on its own to get it off the stack.

> -- 
> 2.43.0.472.g3155946c3a-goog
> 





Re: [PATCH net-next v3 3/3] net: add netmem_ref to skb_frag_t

2023-12-21 Thread Simon Horman
On Wed, Dec 20, 2023 at 01:45:02PM -0800, Mina Almasry wrote:
> Use netmem_ref instead of page in skb_frag_t. Currently netmem_ref
> is always a struct page underneath, but the abstraction allows efforts
> to add support for skb frags not backed by pages.
> 
> There is unfortunately 1 instance where the skb_frag_t is assumed to be
> a bio_vec in kcm. For this case, add a debug assert that the skb frag is
> indeed backed by a page, and do a cast.
> 
> Add skb[_frag]_fill_netmem_*() and skb_add_rx_frag_netmem() helpers so
> that the API can be used to create netmem skbs.
> 
> Signed-off-by: Mina Almasry 

...

> diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> index 65d1f6755f98..3180a54b2c68 100644
> --- a/net/kcm/kcmsock.c
> +++ b/net/kcm/kcmsock.c
> @@ -636,9 +636,15 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
>   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
>   msize += skb_shinfo(skb)->frags[i].bv_len;
>  
> + /* The cast to struct bio_vec* here assumes the frags are
> +  * struct page based. WARN if there is no page in this skb.
> +  */
> + DEBUG_NET_WARN_ON_ONCE(
> + !skb_frag_page(_shinfo(skb)->frags[0]));
> +
>   iov_iter_bvec(_iter, ITER_SOURCE,
> -   skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags,
> -   msize);
> +   (const struct bio_vec *)skb_shinfo(skb)->frags,
> +   skb_shinfo(skb)->nr_frags, msize);
>   iov_iter_advance(_iter, txm->frag_offset);
>  
>   do {

Hi Mina,

something isn't quite right here.

  ...//kcmsock.c:637:39: error: no member named 'bv_len' in 'struct skb_frag'
  637 | msize += skb_shinfo(skb)->frags[i].bv_len;
  |  ~ ^

-- 
pw-bot: changes-requested



[PATCH] kernel/module: improve documentation for try_module_get()

2023-12-21 Thread Marco Pagani
The sentence "this call will fail if the module is already being
removed" is potentially confusing and may contradict the rest of the
documentation. If one tries to get a module that has already been
removed using a stale pointer, the kernel will crash.

Signed-off-by: Marco Pagani 
---
 include/linux/module.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index a98e188cf37b..08364d5cbc07 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -668,7 +668,7 @@ extern void __module_get(struct module *module);
  * @module: the module we should check for
  *
  * Only try to get a module reference count if the module is not being removed.
- * This call will fail if the module is already being removed.
+ * This call will fail if the module is in the process of being removed.
  *
  * Care must also be taken to ensure the module exists and is alive prior to
  * usage of this call. This can be gauranteed through two means:
-- 
2.43.0




Re: [PATCH 1/4] linux/export: Fix alignment for 64-bit ksymtab entries

2023-12-21 Thread Masahiro Yamada
On Thu, Dec 21, 2023 at 7:22 PM Masahiro Yamada  wrote:
>
> On Thu, Nov 23, 2023 at 7:18 AM  wrote:
> >
> > From: Helge Deller 
> >
> > An alignment of 4 bytes is wrong for 64-bit platforms which don't define
> > CONFIG_HAVE_ARCH_PREL32_RELOCATIONS (which then store 64-bit pointers).
> > Fix their alignment to 8 bytes.
> >
> > Signed-off-by: Helge Deller 
>
>
> This is correct.
>
> Acked-by: Masahiro Yamada 
>
> Please add
>
>
> Fixes: ddb5cdbafaaa ("kbuild: generate KSYMTAB entries by modpost")
>
>


If there is no objection, I will pick this up
to linux-kbuild/fixes.


Thanks.





-- 
Best Regards
Masahiro Yamada



Re: [PATCH 0/4] Section alignment issues?

2023-12-21 Thread Masahiro Yamada
On Thu, Dec 21, 2023 at 10:40 PM Masahiro Yamada  wrote:
>
> On Thu, Nov 23, 2023 at 7:18 AM  wrote:
> >
> > From: Helge Deller 
> >
> > While working on the 64-bit parisc kernel, I noticed that the __ksymtab[]
> > table was not correctly 64-bit aligned in many modules.
> > The following patches do fix some of those issues in the generic code.
> >
> > But further investigation shows that multiple sections in the kernel and in
> > modules are possibly not correctly aligned, and thus may lead to performance
> > degregations at runtime (small on x86, huge on parisc, sparc and others 
> > which
> > need exception handlers). Sometimes wrong alignments may also be simply 
> > hidden
> > by the linker or kernel module loader which pulls in the sections by luck 
> > with
> > a correct alignment (e.g. because the previous section was aligned already).
> >
> > An objdump on a x86 module shows e.g.:
> >
> > ./kernel/net/netfilter/nf_log_syslog.ko: file format elf64-x86-64
> > Sections:
> > Idx Name  Size  VMA   LMA   File off  
> > Algn
> >   0 .text 1fdf      0040  
> > 2**4
> >   CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
> >   1 .init.text00f6      2020  
> > 2**4
> >   CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
> >   2 .exit.text005c      2120  
> > 2**4
> >   CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
> >   3 .rodata.str1.8 00dc      2180  
> > 2**3
> >   CONTENTS, ALLOC, LOAD, READONLY, DATA
> >   4 .rodata.str1.1 030a      225c  
> > 2**0
> >   CONTENTS, ALLOC, LOAD, READONLY, DATA
> >   5 .rodata   00b0      2580  
> > 2**5
> >   CONTENTS, ALLOC, LOAD, READONLY, DATA
> >   6 .modinfo  019e      2630  
> > 2**0
> >   CONTENTS, ALLOC, LOAD, READONLY, DATA
> >   7 .return_sites 0034      27ce  
> > 2**0
> >   CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
> >   8 .call_sites   029c      2802  
> > 2**0
> >   CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
> >
> > In this example I believe the ".return_sites" and ".call_sites" should have
> > an alignment of at least 32-bit (4 bytes).
> >
> > On other architectures or modules other sections like ".altinstructions" or
> > "__ex_table" may show up wrongly instead.
> >
> > In general I think it would be beneficial to search for wrong alignments at
> > link time, and maybe at runtime.
> >
> > The patch at the end of this cover letter
> > - adds compile time checks to the "modpost" tool, and
> > - adds a runtime check to the kernel module loader at runtime.
> > And it will possibly show false positives too (!!!)
> > I do understand that some of those sections are not performce critical
> > and thus any alignment is OK.
> >
> > The modpost patch will emit at compile time such warnings (on x86-64 kernel 
> > build):
> >
> > WARNING: modpost: vmlinux: section .initcall7.init (type 1, flags 2) has 
> > alignment of 1, expected at least 4.
> > Maybe you need to add ALIGN() to the modules.lds file (or fix modpost) ?
> > WARNING: modpost: vmlinux: section .altinstructions (type 1, flags 2) has 
> > alignment of 1, expected at least 2.
> > WARNING: modpost: vmlinux: section .initcall6.init (type 1, flags 2) has 
> > alignment of 1, expected at least 4.
> > WARNING: modpost: vmlinux: section .initcallearly.init (type 1, flags 2) 
> > has alignment of 1, expected at least 4.
> > WARNING: modpost: vmlinux: section .rodata.cst2 (type 1, flags 18) has 
> > alignment of 2, expected at least 64.
> > WARNING: modpost: vmlinux: section .static_call_tramp_key (type 1, flags 2) 
> > has alignment of 1, expected at least 8.
> > WARNING: modpost: vmlinux: section .con_initcall.init (type 1, flags 2) has 
> > alignment of 1, expected at least 8.
> > WARNING: modpost: vmlinux: section __bug_table (type 1, flags 3) has 
> > alignment of 1, expected at least 4.
> > ...
>
>
>
>
> modpost acts on vmlinux.o instead of vmlinux.
>
>
> vmlinux.o is a relocatable ELF, which is not a real layout
> because no linker script has been considered yet at this
> point.
>
>
> vmlinux is an executable ELF, produced by a linker,
> with the linker script taken into consideration
> to determine the final section/symbol layout.
>
>
> So, checking this in modpost is meaningless.
>
>
>
> I did not check the module checking code, but
> modules are also relocatable ELF.



Sorry, I replied too early.
(Actually I replied without reading your modpost code).

Now, I understand what your checker is doing.


I did not test how many false positives are produced,

[PATCH v2 5/5] sched/pelt: Remove shift of thermal clock

2023-12-21 Thread Vincent Guittot
The optional shift of the clock used by thermal/hw load avg has been
introduced to handle case where the signal was not always a high frequency
hw signal. Now that cpufreq provides a signal for firmware and
SW pressure, we can remove this exception and always keep this PELT signal
aligned with other signals.
Mark deprecated sched_thermal_decay_shift boot parameter.

Signed-off-by: Vincent Guittot 
---
 .../admin-guide/kernel-parameters.txt  |  1 +
 kernel/sched/core.c|  2 +-
 kernel/sched/fair.c| 10 ++
 kernel/sched/sched.h   | 18 --
 4 files changed, 4 insertions(+), 27 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 65731b060e3f..2ee15522b15d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5722,6 +5722,7 @@
but is useful for debugging and performance tuning.
 
sched_thermal_decay_shift=
+   [Deprecated]
[KNL, SMP] Set a decay shift for scheduler thermal
pressure signal. Thermal pressure signal follows the
default decay period of other scheduler pelt
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a6f084bdf1c5..c68e47bfd5ae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5670,7 +5670,7 @@ void scheduler_tick(void)
 
update_rq_clock(rq);
hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
-   update_hw_load_avg(rq_clock_hw(rq), rq, hw_pressure);
+   update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
curr->sched_class->task_tick(rq, curr, 0);
if (sched_feat(LATENCY_WARN))
resched_latency = cpu_resched_latency(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce0d32f441a8..16d71e764131 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -78,15 +78,9 @@ static unsigned int normalized_sysctl_sched_base_slice   
= 75ULL;
 
 const_debug unsigned int sysctl_sched_migration_cost   = 50UL;
 
-int sched_hw_decay_shift;
 static int __init setup_sched_thermal_decay_shift(char *str)
 {
-   int _shift = 0;
-
-   if (kstrtoint(str, 0, &_shift))
-   pr_warn("Unable to set scheduler thermal pressure decay shift 
parameter\n");
-
-   sched_hw_decay_shift = clamp(_shift, 0, 10);
+   pr_warn("Ignoring the deprecated sched_thermal_decay_shift= option\n");
return 1;
 }
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
@@ -9271,7 +9265,7 @@ static bool __update_blocked_others(struct rq *rq, bool 
*done)
 
decayed = update_rt_rq_load_avg(now, rq, curr_class == _sched_class) 
|
  update_dl_rq_load_avg(now, rq, curr_class == _sched_class) 
|
- update_hw_load_avg(rq_clock_hw(rq), rq, hw_pressure) |
+ update_hw_load_avg(now, rq, hw_pressure) |
  update_irq_load_avg(rq, 0);
 
if (others_have_blocked(rq))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 677d24202eec..6fc6718a1060 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1520,24 +1520,6 @@ static inline u64 rq_clock_task(struct rq *rq)
return rq->clock_task;
 }
 
-/**
- * By default the decay is the default pelt decay period.
- * The decay shift can change the decay period in
- * multiples of 32.
- *  Decay shiftDecay period(ms)
- * 0   32
- * 1   64
- * 2   128
- * 3   256
- * 4   512
- */
-extern int sched_hw_decay_shift;
-
-static inline u64 rq_clock_hw(struct rq *rq)
-{
-   return rq_clock_task(rq) >> sched_hw_decay_shift;
-}
-
 static inline void rq_clock_skip_update(struct rq *rq)
 {
lockdep_assert_rq_held(rq);
-- 
2.34.1




[PATCH v2 4/5] sched: Rename arch_update_thermal_pressure into arch_update_hw_pressure

2023-12-21 Thread Vincent Guittot
Now that cpufreq provides a pressure value to the scheduler, rename
arch_update_thermal_pressure into HW pressure to reflect that it returns
a pressure applied by HW (i.e. with a high frequency change) and not
always related to thermal mitigation but also generated by max current
limitation as an example. Such high frequency signal needs filtering to be
smoothed and provide an value that reflects the average available capacity
into the scheduler time scale.

Signed-off-by: Vincent Guittot 
---
 arch/arm/include/asm/topology.h   |  6 ++---
 arch/arm64/include/asm/topology.h |  6 ++---
 drivers/base/arch_topology.c  | 26 +--
 drivers/cpufreq/qcom-cpufreq-hw.c |  4 +--
 include/linux/arch_topology.h |  8 +++---
 include/linux/sched/topology.h|  8 +++---
 .../{thermal_pressure.h => hw_pressure.h} | 14 +-
 include/trace/events/sched.h  |  2 +-
 init/Kconfig  | 12 -
 kernel/sched/core.c   |  8 +++---
 kernel/sched/fair.c   | 16 ++--
 kernel/sched/pelt.c   | 18 ++---
 kernel/sched/pelt.h   | 16 ++--
 kernel/sched/sched.h  | 10 +++
 14 files changed, 77 insertions(+), 77 deletions(-)
 rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 853c4f81ba4a..ad36b6570067 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -22,9 +22,9 @@
 /* Enable topology flag updates */
 #define arch_update_cpu_topology topology_update_cpu_topology
 
-/* Replace task scheduler's default thermal pressure API */
-#define arch_scale_thermal_pressure topology_get_thermal_pressure
-#define arch_update_thermal_pressure   topology_update_thermal_pressure
+/* Replace task scheduler's default HW pressure API */
+#define arch_scale_hw_pressure topology_get_hw_pressure
+#define arch_update_hw_pressuretopology_update_hw_pressure
 
 #else
 
diff --git a/arch/arm64/include/asm/topology.h 
b/arch/arm64/include/asm/topology.h
index a323b109b9c4..0f6ef432fb84 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -35,9 +35,9 @@ void update_freq_counters_refs(void);
 /* Enable topology flag updates */
 #define arch_update_cpu_topology topology_update_cpu_topology
 
-/* Replace task scheduler's default thermal pressure API */
-#define arch_scale_thermal_pressure topology_get_thermal_pressure
-#define arch_update_thermal_pressure   topology_update_thermal_pressure
+/* Replace task scheduler's default HW pressure API */
+#define arch_scale_hw_pressure topology_get_hw_pressure
+#define arch_update_hw_pressuretopology_update_hw_pressure
 
 #include 
 
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 0906114963ff..405af7a87008 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -22,7 +22,7 @@
 #include 
 
 #define CREATE_TRACE_POINTS
-#include 
+#include 
 
 static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
 static struct cpumask scale_freq_counters_mask;
@@ -160,26 +160,26 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned 
long capacity)
per_cpu(cpu_scale, cpu) = capacity;
 }
 
-DEFINE_PER_CPU(unsigned long, thermal_pressure);
+DEFINE_PER_CPU(unsigned long, hw_pressure);
 
 /**
- * topology_update_thermal_pressure() - Update thermal pressure for CPUs
+ * topology_update_hw_pressure() - Update HW pressure for CPUs
  * @cpus: The related CPUs for which capacity has been reduced
  * @capped_freq : The maximum allowed frequency that CPUs can run at
  *
- * Update the value of thermal pressure for all @cpus in the mask. The
+ * Update the value of HW pressure for all @cpus in the mask. The
  * cpumask should include all (online+offline) affected CPUs, to avoid
  * operating on stale data when hot-plug is used for some CPUs. The
  * @capped_freq reflects the currently allowed max CPUs frequency due to
- * thermal capping. It might be also a boost frequency value, which is bigger
+ * HW capping. It might be also a boost frequency value, which is bigger
  * than the internal 'capacity_freq_ref' max frequency. In such case the
  * pressure value should simply be removed, since this is an indication that
- * there is no thermal throttling. The @capped_freq must be provided in kHz.
+ * there is no HW throttling. The @capped_freq must be provided in kHz.
  */
-void topology_update_thermal_pressure(const struct cpumask *cpus,
+void topology_update_hw_pressure(const struct cpumask *cpus,
  unsigned long capped_freq)
 {
-   unsigned long max_capacity, capacity, th_pressure;
+   unsigned long max_capacity, capacity, hw_pressure;

[PATCH v2 2/5] sched: Take cpufreq feedback into account

2023-12-21 Thread Vincent Guittot
Aggregate the different pressures applied on the capacity of CPUs and
create a new function that returns the actual capacity of the CPU:
  get_actual_cpu_capacity()

Signed-off-by: Vincent Guittot 
Reviewed-by: Lukasz Luba 
---
 kernel/sched/fair.c | 45 +
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcea3d55d95d..0235081defa5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4932,13 +4932,22 @@ static inline void util_est_update(struct cfs_rq 
*cfs_rq,
trace_sched_util_est_se_tp(>se);
 }
 
+static inline unsigned long get_actual_cpu_capacity(int cpu)
+{
+   unsigned long capacity = arch_scale_cpu_capacity(cpu);
+
+   capacity -= max(thermal_load_avg(cpu_rq(cpu)), 
cpufreq_get_pressure(cpu));
+
+   return capacity;
+}
+
 static inline int util_fits_cpu(unsigned long util,
unsigned long uclamp_min,
unsigned long uclamp_max,
int cpu)
 {
-   unsigned long capacity_orig, capacity_orig_thermal;
unsigned long capacity = capacity_of(cpu);
+   unsigned long capacity_orig;
bool fits, uclamp_max_fits;
 
/*
@@ -4970,7 +4979,6 @@ static inline int util_fits_cpu(unsigned long util,
 * goal is to cap the task. So it's okay if it's getting less.
 */
capacity_orig = arch_scale_cpu_capacity(cpu);
-   capacity_orig_thermal = capacity_orig - 
arch_scale_thermal_pressure(cpu);
 
/*
 * We want to force a task to fit a cpu as implied by uclamp_max.
@@ -5045,7 +5053,8 @@ static inline int util_fits_cpu(unsigned long util,
 * handle the case uclamp_min > uclamp_max.
 */
uclamp_min = min(uclamp_min, uclamp_max);
-   if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
+   if (fits && (util < uclamp_min) &&
+   (uclamp_min > get_actual_cpu_capacity(cpu)))
return -1;
 
return fits;
@@ -7426,7 +7435,7 @@ select_idle_capacity(struct task_struct *p, struct 
sched_domain *sd, int target)
 * Look for the CPU with best capacity.
 */
else if (fits < 0)
-   cpu_cap = arch_scale_cpu_capacity(cpu) - 
thermal_load_avg(cpu_rq(cpu));
+   cpu_cap = get_actual_cpu_capacity(cpu);
 
/*
 * First, select CPU which fits better (-1 being better than 0).
@@ -7919,8 +7928,8 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
struct root_domain *rd = this_rq()->rd;
int cpu, best_energy_cpu, target = -1;
int prev_fits = -1, best_fits = -1;
-   unsigned long best_thermal_cap = 0;
-   unsigned long prev_thermal_cap = 0;
+   unsigned long best_actual_cap = 0;
+   unsigned long prev_actual_cap = 0;
struct sched_domain *sd;
struct perf_domain *pd;
struct energy_env eenv;
@@ -7950,7 +7959,7 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
 
for (; pd; pd = pd->next) {
unsigned long util_min = p_util_min, util_max = p_util_max;
-   unsigned long cpu_cap, cpu_thermal_cap, util;
+   unsigned long cpu_cap, cpu_actual_cap, util;
long prev_spare_cap = -1, max_spare_cap = -1;
unsigned long rq_util_min, rq_util_max;
unsigned long cur_delta, base_energy;
@@ -7962,18 +7971,17 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
if (cpumask_empty(cpus))
continue;
 
-   /* Account thermal pressure for the energy estimation */
+   /* Account external pressure for the energy estimation */
cpu = cpumask_first(cpus);
-   cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
-   cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
+   cpu_actual_cap = get_actual_cpu_capacity(cpu);
 
-   eenv.cpu_cap = cpu_thermal_cap;
+   eenv.cpu_cap = cpu_actual_cap;
eenv.pd_cap = 0;
 
for_each_cpu(cpu, cpus) {
struct rq *rq = cpu_rq(cpu);
 
-   eenv.pd_cap += cpu_thermal_cap;
+   eenv.pd_cap += cpu_actual_cap;
 
if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
continue;
@@ -8044,7 +8052,7 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
if (prev_delta < base_energy)
goto unlock;
prev_delta -= base_energy;
-   prev_thermal_cap = cpu_thermal_cap;
+   prev_actual_cap = cpu_actual_cap;
best_delta = 

[PATCH v2 3/5] thermal/cpufreq: Remove arch_update_thermal_pressure()

2023-12-21 Thread Vincent Guittot
arch_update_thermal_pressure() aims to update fast changing signal which
should be averaged using PELT filtering before being provided to the
scheduler which can't make smart use of fast changing signal.
cpufreq now provides the maximum freq_qos pressure on the capacity to the
scheduler, which includes cpufreq cooling device. Remove the call to
arch_update_thermal_pressure() in cpufreq cooling device as this is
handled by cpufreq_get_pressure().

Signed-off-by: Vincent Guittot 
Reviewed-by: Lukasz Luba 
---
 drivers/thermal/cpufreq_cooling.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/thermal/cpufreq_cooling.c 
b/drivers/thermal/cpufreq_cooling.c
index e2cc7bd30862..e77d3b44903e 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -448,7 +448,6 @@ static int cpufreq_set_cur_state(struct 
thermal_cooling_device *cdev,
 unsigned long state)
 {
struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
-   struct cpumask *cpus;
unsigned int frequency;
int ret;
 
@@ -465,8 +464,6 @@ static int cpufreq_set_cur_state(struct 
thermal_cooling_device *cdev,
ret = freq_qos_update_request(_cdev->qos_req, frequency);
if (ret >= 0) {
cpufreq_cdev->cpufreq_state = state;
-   cpus = cpufreq_cdev->policy->related_cpus;
-   arch_update_thermal_pressure(cpus, frequency);
ret = 0;
}
 
-- 
2.34.1




[PATCH v2 1/5] cpufreq: Add a cpufreq pressure feedback for the scheduler

2023-12-21 Thread Vincent Guittot
Provide to the scheduler a feedback about the temporary max available
capacity. Unlike arch_update_thermal_pressure, this doesn't need to be
filtered as the pressure will happen for dozens ms or more.

Signed-off-by: Vincent Guittot 
---
 drivers/cpufreq/cpufreq.c | 34 ++
 include/linux/cpufreq.h   | 10 ++
 2 files changed, 44 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 44db4f59c4cc..15bd41f9bb5e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2563,6 +2563,38 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, 
unsigned int cpu)
 }
 EXPORT_SYMBOL(cpufreq_get_policy);
 
+DEFINE_PER_CPU(unsigned long, cpufreq_pressure);
+
+/**
+ * cpufreq_update_pressure() - Update cpufreq pressure for CPUs
+ * @policy: cpufreq policy of the CPUs.
+ *
+ * Update the value of cpufreq pressure for all @cpus in the policy.
+ */
+static void cpufreq_update_pressure(struct cpufreq_policy *policy)
+{
+   unsigned long max_capacity, capped_freq, pressure;
+   u32 max_freq;
+   int cpu;
+
+   cpu = cpumask_first(policy->related_cpus);
+   pressure = max_capacity = arch_scale_cpu_capacity(cpu);
+   capped_freq = policy->max;
+   max_freq = arch_scale_freq_ref(cpu);
+
+   /*
+* Handle properly the boost frequencies, which should simply clean
+* the thermal pressure value.
+*/
+   if (max_freq <= capped_freq)
+   pressure -= max_capacity;
+   else
+   pressure -= mult_frac(max_capacity, capped_freq, max_freq);
+
+   for_each_cpu(cpu, policy->related_cpus)
+   WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure);
+}
+
 /**
  * cpufreq_set_policy - Modify cpufreq policy parameters.
  * @policy: Policy object to modify.
@@ -2618,6 +2650,8 @@ static int cpufreq_set_policy(struct cpufreq_policy 
*policy,
policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H);
trace_cpu_frequency_limits(policy);
 
+   cpufreq_update_pressure(policy);
+
policy->cached_target_freq = UINT_MAX;
 
pr_debug("new min and max freqs are %u - %u kHz\n",
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index afda5f24d3dd..b1d97edd3253 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -241,6 +241,12 @@ struct kobject *get_governor_parent_kobj(struct 
cpufreq_policy *policy);
 void cpufreq_enable_fast_switch(struct cpufreq_policy *policy);
 void cpufreq_disable_fast_switch(struct cpufreq_policy *policy);
 bool has_target_index(void);
+
+DECLARE_PER_CPU(unsigned long, cpufreq_pressure);
+static inline unsigned long cpufreq_get_pressure(int cpu)
+{
+   return per_cpu(cpufreq_pressure, cpu);
+}
 #else
 static inline unsigned int cpufreq_get(unsigned int cpu)
 {
@@ -263,6 +269,10 @@ static inline bool cpufreq_supports_freq_invariance(void)
return false;
 }
 static inline void disable_cpufreq(void) { }
+static inline unsigned long cpufreq_get_pressure(int cpu)
+{
+   return 0;
+}
 #endif
 
 #ifdef CONFIG_CPU_FREQ_STAT
-- 
2.34.1




[PATCH v2 0/5] Rework system pressure interface to the scheduler

2023-12-21 Thread Vincent Guittot
Following the consolidation and cleanup of CPU capacity in [1], this serie
reworks how the scheduler gets the pressures on CPUs. We need to take into
account all pressures applied by cpufreq on the compute capacity of a CPU
for dozens of ms or more and not only cpufreq cooling device or HW
mitigiations. we split the pressure applied on CPU's capacity in 2 parts:
- one from cpufreq and freq_qos
- one from HW high freq mitigiation.

The next step will be to add a dedicated interface for long standing
capping of the CPU capacity (i.e. for seconds or more) like the
scaling_max_freq of cpufreq sysfs. The latter is already taken into
account by this serie but as a temporary pressure which is not always the
best choice when we know that it will happen for seconds or more.

[1] 
https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guit...@linaro.org/

Change since v1:
- Use struct cpufreq_policy as parameter of cpufreq_update_pressure()
- Fix typos and comments
- Make sched_thermal_decay_shift boot param as deprecated

Vincent Guittot (5):
  cpufreq: Add a cpufreq pressure feedback for the scheduler
  sched: Take cpufreq feedback into account
  thermal/cpufreq: Remove arch_update_thermal_pressure()
  sched: Rename arch_update_thermal_pressure into
arch_update_hw_pressure
  sched/pelt: Remove shift of thermal clock

 .../admin-guide/kernel-parameters.txt |  1 +
 arch/arm/include/asm/topology.h   |  6 +-
 arch/arm64/include/asm/topology.h |  6 +-
 drivers/base/arch_topology.c  | 26 
 drivers/cpufreq/cpufreq.c | 34 ++
 drivers/cpufreq/qcom-cpufreq-hw.c |  4 +-
 drivers/thermal/cpufreq_cooling.c |  3 -
 include/linux/arch_topology.h |  8 +--
 include/linux/cpufreq.h   | 10 +++
 include/linux/sched/topology.h|  8 +--
 .../{thermal_pressure.h => hw_pressure.h} | 14 ++---
 include/trace/events/sched.h  |  2 +-
 init/Kconfig  | 12 ++--
 kernel/sched/core.c   |  8 +--
 kernel/sched/fair.c   | 63 +--
 kernel/sched/pelt.c   | 18 +++---
 kernel/sched/pelt.h   | 16 ++---
 kernel/sched/sched.h  | 22 +--
 18 files changed, 142 insertions(+), 119 deletions(-)
 rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%)

-- 
2.34.1




Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-21 Thread Dragos Tatulea
On Thu, 2023-12-21 at 15:55 +0100, Eugenio Perez Martin wrote:
> On Thu, Dec 21, 2023 at 3:38 PM Dragos Tatulea  wrote:
> > 
> > On Thu, 2023-12-21 at 13:08 +0100, Eugenio Perez Martin wrote:
> > > On Thu, Dec 21, 2023 at 12:52 PM Dragos Tatulea  
> > > wrote:
> > > > 
> > > > On Thu, 2023-12-21 at 08:46 +0100, Eugenio Perez Martin wrote:
> > > > > On Thu, Dec 21, 2023 at 3:03 AM Jason Wang  
> > > > > wrote:
> > > > > > 
> > > > > > On Wed, Dec 20, 2023 at 9:32 PM Eugenio Perez Martin
> > > > > >  wrote:
> > > > > > > 
> > > > > > > On Wed, Dec 20, 2023 at 5:06 AM Jason Wang  
> > > > > > > wrote:
> > > > > > > > 
> > > > > > > > On Wed, Dec 20, 2023 at 11:46 AM Jason Wang 
> > > > > > > >  wrote:
> > > > > > > > > 
> > > > > > > > > On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea 
> > > > > > > > >  wrote:
> > > > > > > > > > 
> > > > > > > > > > The virtio spec doesn't allow changing virtqueue addresses 
> > > > > > > > > > after
> > > > > > > > > > DRIVER_OK. Some devices do support this operation when the 
> > > > > > > > > > device is
> > > > > > > > > > suspended. The 
> > > > > > > > > > VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
> > > > > > > > > > advertises this support as a backend features.
> > > > > > > > > 
> > > > > > > > > There's an ongoing effort in virtio spec to introduce the 
> > > > > > > > > suspend state.
> > > > > > > > > 
> > > > > > > > > So I wonder if it's better to just allow such behaviour?
> > > > > > > > 
> > > > > > > > Actually I mean, allow drivers to modify the parameters during 
> > > > > > > > suspend
> > > > > > > > without a new feature.
> > > > > > > > 
> > > > > > > 
> > > > > > > That would be ideal, but how do userland checks if it can suspend 
> > > > > > > +
> > > > > > > change properties + resume?
> > > > > > 
> > > > > > As discussed, it looks to me the only device that supports suspend 
> > > > > > is
> > > > > > simulator and it supports change properties.
> > > > > > 
> > > > > > E.g:
> > > > > > 
> > > > > > static int vdpasim_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> > > > > >   u64 desc_area, u64 driver_area,
> > > > > >   u64 device_area)
> > > > > > {
> > > > > > struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> > > > > > struct vdpasim_virtqueue *vq = >vqs[idx];
> > > > > > 
> > > > > > vq->desc_addr = desc_area;
> > > > > > vq->driver_addr = driver_area;
> > > > > > vq->device_addr = device_area;
> > > > > > 
> > > > > > return 0;
> > > > > > }
> > > > > > 
> > > > > 
> > > > > So in the current kernel master it is valid to set a different vq
> > > > > address while the device is suspended in vdpa_sim. But it is not valid
> > > > > in mlx5, as the FW will not be updated in resume (Dragos, please
> > > > > correct me if I'm wrong). Both of them return success.
> > > > > 
> > > > In the current state, there is no resume. HW Virtqueues will just get 
> > > > re-created
> > > > with the new address.
> > > > 
> > > 
> > > Oh, then all of this is effectively transparent to the userspace
> > > except for the time it takes?
> > > 
> > Not quite: mlx5_vdpa_set_vq_address will save the vq address only on the SW 
> > vq
> > representation. Only later will it will call into the FW to update the FW. 
> > Later
> > means:
> > - On DRIVER_OK state, when the VQs get created.
> > - On .set_map when the VQs get re-created (before this series) / updated 
> > (after
> > this series)
> > - On .resume (after this series).
> > 
> > So if the .set_vq_address is called when the VQ is in DRIVER_OK but not
> > suspended those addresses will be set later for later.
> > 
> 
> Ouch, that is more in the line of my thoughts :(.
> 
> > > In that case you're right, we don't need feature flags. But I think it
> > > would be great to also move the error return in case userspace tries
> > > to modify vq parameters out of suspend state.
> > > 
> > On the driver side or on the core side?
> > 
> 
> Core side.
> 
Checking my understanding: instead of the feature flags there would be a check
(for .set_vq_addr and .set_vq_state) to return an error if they are called under
DRIVER_OK and not suspended state?

> It does not have to be part of this series, I meant it can be proposed
> in a separate series and applied before the parent driver one.
> 
> > Thanks
> > > Thanks!
> > > 
> > > 
> > > > > How can we know in the destination QEMU if it is valid to suspend &
> > > > > set address? Should we handle this as a bugfix and backport the
> > > > > change?
> > > > > 
> > > > > > > 
> > > > > > > The only way that comes to my mind is to make sure all parents 
> > > > > > > return
> > > > > > > error if userland tries to do it, and then fallback in userland.
> > > > > > 
> > > > > > Yes.
> > > > > > 
> > > > > > > I'm
> > > > > > > ok with that, but I'm not sure if the current master & previous 
> > > > > > > kernel
> > > > > > > has a coherent behavior. Do they return 

Re: [PATCH v5 02/15] ring-buffer: Page size per ring buffer

2023-12-21 Thread Steven Rostedt
On Thu, 21 Dec 2023 14:51:29 +
David Laight  wrote:

> > I think 1kb units is perfectly fine (patch 15 changes to kb units). The
> > interface says its to define the minimal size of the sub-buffer, not the
> > actual size.  
> 
> I didn't read that far through :-(
> 

Well, this isn't a normal patch series as I took the work from Tzvetomir
back from 2021, and started massaging them. I wanted to keep Tzvetomir's
original authorship even though he's not working on it, so I just applied
his patches as-is and then added patches on top of them, to denote what I
did and what he did.

This is why I never modified the first 5 patches of this series, except for
subject lines and change logs.

-- Steve



Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-21 Thread Eugenio Perez Martin
On Thu, Dec 21, 2023 at 3:38 PM Dragos Tatulea  wrote:
>
> On Thu, 2023-12-21 at 13:08 +0100, Eugenio Perez Martin wrote:
> > On Thu, Dec 21, 2023 at 12:52 PM Dragos Tatulea  wrote:
> > >
> > > On Thu, 2023-12-21 at 08:46 +0100, Eugenio Perez Martin wrote:
> > > > On Thu, Dec 21, 2023 at 3:03 AM Jason Wang  wrote:
> > > > >
> > > > > On Wed, Dec 20, 2023 at 9:32 PM Eugenio Perez Martin
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, Dec 20, 2023 at 5:06 AM Jason Wang  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed, Dec 20, 2023 at 11:46 AM Jason Wang  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > The virtio spec doesn't allow changing virtqueue addresses 
> > > > > > > > > after
> > > > > > > > > DRIVER_OK. Some devices do support this operation when the 
> > > > > > > > > device is
> > > > > > > > > suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND 
> > > > > > > > > flag
> > > > > > > > > advertises this support as a backend features.
> > > > > > > >
> > > > > > > > There's an ongoing effort in virtio spec to introduce the 
> > > > > > > > suspend state.
> > > > > > > >
> > > > > > > > So I wonder if it's better to just allow such behaviour?
> > > > > > >
> > > > > > > Actually I mean, allow drivers to modify the parameters during 
> > > > > > > suspend
> > > > > > > without a new feature.
> > > > > > >
> > > > > >
> > > > > > That would be ideal, but how do userland checks if it can suspend +
> > > > > > change properties + resume?
> > > > >
> > > > > As discussed, it looks to me the only device that supports suspend is
> > > > > simulator and it supports change properties.
> > > > >
> > > > > E.g:
> > > > >
> > > > > static int vdpasim_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> > > > >   u64 desc_area, u64 driver_area,
> > > > >   u64 device_area)
> > > > > {
> > > > > struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> > > > > struct vdpasim_virtqueue *vq = >vqs[idx];
> > > > >
> > > > > vq->desc_addr = desc_area;
> > > > > vq->driver_addr = driver_area;
> > > > > vq->device_addr = device_area;
> > > > >
> > > > > return 0;
> > > > > }
> > > > >
> > > >
> > > > So in the current kernel master it is valid to set a different vq
> > > > address while the device is suspended in vdpa_sim. But it is not valid
> > > > in mlx5, as the FW will not be updated in resume (Dragos, please
> > > > correct me if I'm wrong). Both of them return success.
> > > >
> > > In the current state, there is no resume. HW Virtqueues will just get 
> > > re-created
> > > with the new address.
> > >
> >
> > Oh, then all of this is effectively transparent to the userspace
> > except for the time it takes?
> >
> Not quite: mlx5_vdpa_set_vq_address will save the vq address only on the SW vq
> representation. Only later will it will call into the FW to update the FW. 
> Later
> means:
> - On DRIVER_OK state, when the VQs get created.
> - On .set_map when the VQs get re-created (before this series) / updated 
> (after
> this series)
> - On .resume (after this series).
>
> So if the .set_vq_address is called when the VQ is in DRIVER_OK but not
> suspended those addresses will be set later for later.
>

Ouch, that is more in the line of my thoughts :(.

> > In that case you're right, we don't need feature flags. But I think it
> > would be great to also move the error return in case userspace tries
> > to modify vq parameters out of suspend state.
> >
> On the driver side or on the core side?
>

Core side.

It does not have to be part of this series, I meant it can be proposed
in a separate series and applied before the parent driver one.

> Thanks
> > Thanks!
> >
> >
> > > > How can we know in the destination QEMU if it is valid to suspend &
> > > > set address? Should we handle this as a bugfix and backport the
> > > > change?
> > > >
> > > > > >
> > > > > > The only way that comes to my mind is to make sure all parents 
> > > > > > return
> > > > > > error if userland tries to do it, and then fallback in userland.
> > > > >
> > > > > Yes.
> > > > >
> > > > > > I'm
> > > > > > ok with that, but I'm not sure if the current master & previous 
> > > > > > kernel
> > > > > > has a coherent behavior. Do they return error? Or return success
> > > > > > without changing address / vq state?
> > > > >
> > > > > We probably don't need to worry too much here, as e.g set_vq_address
> > > > > could fail even without suspend (just at uAPI level).
> > > > >
> > > >
> > > > I don't get this, sorry. I rephrased my point with an example earlier
> > > > in the mail.
> > > >
> > >
> >
>




RE: [PATCH v5 02/15] ring-buffer: Page size per ring buffer

2023-12-21 Thread David Laight
> I think 1kb units is perfectly fine (patch 15 changes to kb units). The
> interface says its to define the minimal size of the sub-buffer, not the
> actual size.

I didn't read that far through :-(

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)




Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-21 Thread Dragos Tatulea
On Thu, 2023-12-21 at 13:08 +0100, Eugenio Perez Martin wrote:
> On Thu, Dec 21, 2023 at 12:52 PM Dragos Tatulea  wrote:
> > 
> > On Thu, 2023-12-21 at 08:46 +0100, Eugenio Perez Martin wrote:
> > > On Thu, Dec 21, 2023 at 3:03 AM Jason Wang  wrote:
> > > > 
> > > > On Wed, Dec 20, 2023 at 9:32 PM Eugenio Perez Martin
> > > >  wrote:
> > > > > 
> > > > > On Wed, Dec 20, 2023 at 5:06 AM Jason Wang  
> > > > > wrote:
> > > > > > 
> > > > > > On Wed, Dec 20, 2023 at 11:46 AM Jason Wang  
> > > > > > wrote:
> > > > > > > 
> > > > > > > On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea 
> > > > > > >  wrote:
> > > > > > > > 
> > > > > > > > The virtio spec doesn't allow changing virtqueue addresses after
> > > > > > > > DRIVER_OK. Some devices do support this operation when the 
> > > > > > > > device is
> > > > > > > > suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND 
> > > > > > > > flag
> > > > > > > > advertises this support as a backend features.
> > > > > > > 
> > > > > > > There's an ongoing effort in virtio spec to introduce the suspend 
> > > > > > > state.
> > > > > > > 
> > > > > > > So I wonder if it's better to just allow such behaviour?
> > > > > > 
> > > > > > Actually I mean, allow drivers to modify the parameters during 
> > > > > > suspend
> > > > > > without a new feature.
> > > > > > 
> > > > > 
> > > > > That would be ideal, but how do userland checks if it can suspend +
> > > > > change properties + resume?
> > > > 
> > > > As discussed, it looks to me the only device that supports suspend is
> > > > simulator and it supports change properties.
> > > > 
> > > > E.g:
> > > > 
> > > > static int vdpasim_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> > > >   u64 desc_area, u64 driver_area,
> > > >   u64 device_area)
> > > > {
> > > > struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> > > > struct vdpasim_virtqueue *vq = >vqs[idx];
> > > > 
> > > > vq->desc_addr = desc_area;
> > > > vq->driver_addr = driver_area;
> > > > vq->device_addr = device_area;
> > > > 
> > > > return 0;
> > > > }
> > > > 
> > > 
> > > So in the current kernel master it is valid to set a different vq
> > > address while the device is suspended in vdpa_sim. But it is not valid
> > > in mlx5, as the FW will not be updated in resume (Dragos, please
> > > correct me if I'm wrong). Both of them return success.
> > > 
> > In the current state, there is no resume. HW Virtqueues will just get 
> > re-created
> > with the new address.
> > 
> 
> Oh, then all of this is effectively transparent to the userspace
> except for the time it takes?
> 
Not quite: mlx5_vdpa_set_vq_address will save the vq address only on the SW vq
representation. Only later will it will call into the FW to update the FW. Later
means:
- On DRIVER_OK state, when the VQs get created.
- On .set_map when the VQs get re-created (before this series) / updated (after
this series)
- On .resume (after this series).

So if the .set_vq_address is called when the VQ is in DRIVER_OK but not
suspended those addresses will be set later for later.

> In that case you're right, we don't need feature flags. But I think it
> would be great to also move the error return in case userspace tries
> to modify vq parameters out of suspend state.
> 
On the driver side or on the core side?

Thanks
> Thanks!
> 
> 
> > > How can we know in the destination QEMU if it is valid to suspend &
> > > set address? Should we handle this as a bugfix and backport the
> > > change?
> > > 
> > > > > 
> > > > > The only way that comes to my mind is to make sure all parents return
> > > > > error if userland tries to do it, and then fallback in userland.
> > > > 
> > > > Yes.
> > > > 
> > > > > I'm
> > > > > ok with that, but I'm not sure if the current master & previous kernel
> > > > > has a coherent behavior. Do they return error? Or return success
> > > > > without changing address / vq state?
> > > > 
> > > > We probably don't need to worry too much here, as e.g set_vq_address
> > > > could fail even without suspend (just at uAPI level).
> > > > 
> > > 
> > > I don't get this, sorry. I rephrased my point with an example earlier
> > > in the mail.
> > > 
> > 
> 



Re: [PATCH] tracing / synthetic: Disable events after testing in synth_event_gen_test_init()

2023-12-21 Thread Steven Rostedt
On Thu, 21 Dec 2023 11:06:38 +0100
Alexander Graf  wrote:

> Thanks a bunch for the super quick turnaround time for the fix! I can 
> confirm that I'm no longer seeing the warning :)
> 
> Tested-by: Alexander Graf 

Thanks Alex,

> 
> 
> Do we need another similar patch for the kprobe self tests? The below is 
> with 55cb5f43689d7 plus an unrelated initrd patch plus this patch and 
> the following .config: http://csgraf.de/tmp2/config-ftrace.xz

Sure, now you tell me ;-)

I just finished all my tests and will be sending Linus a pull request soon.

I'll let Masami handle this one, as it's under his department.

-- Steve

> 
> [  919.717134] Testing all events: OK
> [  924.418194] Testing ftrace filter: OK
> [  924.418887] trace_kprobe: Testing kprobe tracing:
> [  924.424244] [ cut here ]
> [  924.424952] WARNING: CPU: 2 PID: 1 at 
> kernel/trace/trace_kprobe.c:2073 kprobe_trace_self_tests_init+0x192/0x540
> [  924.425659] Modules linked in:
> [  924.425886] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 
> 6.7.0-rc6-00024-gc10698ad3c9a #298
> [  924.426448] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
> [  924.427230] RIP: 0010:kprobe_trace_self_tests_init+0x192/0x540
> [  924.427639] Code: 7e 10 74 3b 48 8b 36 48 39 d6 75 f2 0f 0b 48 c7 c7 
> 58 71 79 a5 e8 ee 3e 5a fe 48 c7 c7 20 38 b7 a5 e8 a2 51 68 fe 85 c0 74 
> 33 <0f> 0b 48 c7 c7 38 73 79 a5 e8 d0 3e 5a fe e8 4b 64 62 fe eb 23 48
> [  924.428922] RSP: :ab508001be58 EFLAGS: 00010286
> [  924.429288] RAX: fff0 RBX: 005a RCX: 
> 0202
> [  924.429800] RDX:  RSI: 0002e970 RDI: 
> a5b708a0
> [  924.430295] RBP:  R08: 0001 R09: 
> a4855a90
> [  924.430794] R10: 0007 R11: 028a R12: 
> 0001
> [  924.431286] R13: a5cc9a00 R14: 8cec81bebe00 R15: 
> a619f0f0
> [  924.431785] FS:  () GS:8cecf910() 
> knlGS:
> [  924.432346] CS:  0010 DS:  ES:  CR0: 80050033
> [  924.432748] CR2:  CR3: 4883e001 CR4: 
> 003706f0
> [  924.433246] DR0:  DR1:  DR2: 
> 
> [  924.433739] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  924.434233] Call Trace:
> [  924.434418]  
> [  924.434569]  ? __warn+0x7d/0x140
> [  924.434807]  ? kprobe_trace_self_tests_init+0x192/0x540
> [  924.435172]  ? report_bug+0xf8/0x1e0
> [  924.435430]  ? handle_bug+0x3f/0x70
> [  924.435677]  ? exc_invalid_op+0x13/0x60
> [  924.435954]  ? asm_exc_invalid_op+0x16/0x20
> [  924.436249]  ? rdinit_setup+0x40/0x40
> [  924.436509]  ? trace_kprobe_release+0x70/0xb0
> [  924.436822]  ? kprobe_trace_self_tests_init+0x192/0x540
> [  924.437189]  ? kprobe_trace_self_tests_init+0x421/0x540
> [  924.437551]  ? init_kprobe_trace+0x1b0/0x1b0
> [  924.437855]  do_one_initcall+0x44/0x200
> [  924.438131]  kernel_init_freeable+0x1e8/0x330
> [  924.438439]  ? rest_init+0xd0/0xd0
> [  924.438682]  kernel_init+0x16/0x130
> [  924.438943]  ret_from_fork+0x30/0x50
> [  924.439202]  ? rest_init+0xd0/0xd0
> [  924.439445]  ret_from_fork_asm+0x11/0x20
> [  924.439734]  
> [  924.439893] ---[ end trace  ]---
> [  924.440217] trace_kprobe: error on cleaning up probes.
> [  924.440575] NG: Some tests are failed. Please check them.




Re: [PATCH v5 02/15] ring-buffer: Page size per ring buffer

2023-12-21 Thread Steven Rostedt
On Thu, 21 Dec 2023 09:17:55 +
David Laight  wrote:

> > Unfortunately, it has to be PAGE_SIZE (and for now it's a power of 2 to
> > make masking easy). It's used for splice and will also be used for memory
> > mapping with user space.  
> 
> Perhaps then the sysctl to set the size should be powers of 4k

It's not a sysctl but a file in tracefs

> with a minimum size of PAGE_SIZE.
> Then you don't have to know the page size when setting things up.

The user shouldn't need to know either. But the size of the sub-buffer
limits the biggest size of an event, so the user only needs to make sure
the sub-buffer is bigger than their biggest event.

> 
> I'm also guessing that no Linux kernels have a PAGE_SIZE of 2k?
> IIRC some old mmu (maybe 68020 era) used 2k pages.

I think 1kb units is perfectly fine (patch 15 changes to kb units). The
interface says its to define the minimal size of the sub-buffer, not the
actual size.

-- Steve



Re: [PATCH 0/3] Fairphone 5 PMIC-GLINK support (USB-C, charger, fuel gauge)

2023-12-21 Thread Luca Weiss
On Thu Dec 21, 2023 at 1:53 PM CET, Konrad Dybcio wrote:
> On 21.12.2023 11:34, Dmitry Baryshkov wrote:
> > On Thu, 21 Dec 2023 at 09:33, Luca Weiss  wrote:
> >>
> >> On Wed Dec 20, 2023 at 1:32 PM CET, Konrad Dybcio wrote:
> >>> On 20.12.2023 11:02, Luca Weiss wrote:
>  This series adds all the necessary bits to enable USB-C role switching,
>  charger and fuel gauge (all via pmic-glink) on Fairphone 5.
> 
>  One thing that could be made different is the pmic-glink compatible.
>  I've chosen to use qcm6490 compatible for it and not sc7280 since
>  there's plenty of firmware variety on sc7280-based platforms and they
>  might require different quirks in the future, so limit this PDOS quirk
>  to just qcm6490 for now.
> 
>  If someone thinks it should be qcom,sc7280-pmic-glink, please let me
>  know :)
> >>> IMO it's best to continue using the "base soc" (which just so happened
> >>> to fall onto sc7280 this time around) for all compatibles, unless the
> >>> derivatives actually had changes
> >>
> >> Hi Konrad,
> >>
> >> I think at some point I asked Dmitry what he thought and he mentioned
> >> qcm6490. Even found the message again:
> >>
> >>> well, since it is a firmware thing, you might want to emphasise that.
> >>> So from my POV qcm6490 makes more sense
> >>
> >> But yeah since it's likely that sc7280 firmware behaves the same as
> >> qcm6490 firmware it's probably okay to use sc7280 compatible, worst case
> >> we change it later :) I'll send a v2 with those changes.
> > 
> > Worst case we end up with sc7280 which has yet another slightly
> > different UCSI / PMIC GLINK implementation, but the compatible string
> > is already taken.
> > I still suppose that this should be a qcm6490-related string.
> Right, let's keep qcm then

Ack from my side also. Thanks for the feedback!

>
> Konrad




Re: [PATCH 0/4] Section alignment issues?

2023-12-21 Thread Masahiro Yamada
On Thu, Nov 23, 2023 at 7:18 AM  wrote:
>
> From: Helge Deller 
>
> While working on the 64-bit parisc kernel, I noticed that the __ksymtab[]
> table was not correctly 64-bit aligned in many modules.
> The following patches do fix some of those issues in the generic code.
>
> But further investigation shows that multiple sections in the kernel and in
> modules are possibly not correctly aligned, and thus may lead to performance
> degregations at runtime (small on x86, huge on parisc, sparc and others which
> need exception handlers). Sometimes wrong alignments may also be simply hidden
> by the linker or kernel module loader which pulls in the sections by luck with
> a correct alignment (e.g. because the previous section was aligned already).
>
> An objdump on a x86 module shows e.g.:
>
> ./kernel/net/netfilter/nf_log_syslog.ko: file format elf64-x86-64
> Sections:
> Idx Name  Size  VMA   LMA   File off  Algn
>   0 .text 1fdf      0040  2**4
>   CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
>   1 .init.text00f6      2020  2**4
>   CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
>   2 .exit.text005c      2120  2**4
>   CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
>   3 .rodata.str1.8 00dc      2180  
> 2**3
>   CONTENTS, ALLOC, LOAD, READONLY, DATA
>   4 .rodata.str1.1 030a      225c  
> 2**0
>   CONTENTS, ALLOC, LOAD, READONLY, DATA
>   5 .rodata   00b0      2580  2**5
>   CONTENTS, ALLOC, LOAD, READONLY, DATA
>   6 .modinfo  019e      2630  2**0
>   CONTENTS, ALLOC, LOAD, READONLY, DATA
>   7 .return_sites 0034      27ce  2**0
>   CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
>   8 .call_sites   029c      2802  2**0
>   CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
>
> In this example I believe the ".return_sites" and ".call_sites" should have
> an alignment of at least 32-bit (4 bytes).
>
> On other architectures or modules other sections like ".altinstructions" or
> "__ex_table" may show up wrongly instead.
>
> In general I think it would be beneficial to search for wrong alignments at
> link time, and maybe at runtime.
>
> The patch at the end of this cover letter
> - adds compile time checks to the "modpost" tool, and
> - adds a runtime check to the kernel module loader at runtime.
> And it will possibly show false positives too (!!!)
> I do understand that some of those sections are not performce critical
> and thus any alignment is OK.
>
> The modpost patch will emit at compile time such warnings (on x86-64 kernel 
> build):
>
> WARNING: modpost: vmlinux: section .initcall7.init (type 1, flags 2) has 
> alignment of 1, expected at least 4.
> Maybe you need to add ALIGN() to the modules.lds file (or fix modpost) ?
> WARNING: modpost: vmlinux: section .altinstructions (type 1, flags 2) has 
> alignment of 1, expected at least 2.
> WARNING: modpost: vmlinux: section .initcall6.init (type 1, flags 2) has 
> alignment of 1, expected at least 4.
> WARNING: modpost: vmlinux: section .initcallearly.init (type 1, flags 2) has 
> alignment of 1, expected at least 4.
> WARNING: modpost: vmlinux: section .rodata.cst2 (type 1, flags 18) has 
> alignment of 2, expected at least 64.
> WARNING: modpost: vmlinux: section .static_call_tramp_key (type 1, flags 2) 
> has alignment of 1, expected at least 8.
> WARNING: modpost: vmlinux: section .con_initcall.init (type 1, flags 2) has 
> alignment of 1, expected at least 8.
> WARNING: modpost: vmlinux: section __bug_table (type 1, flags 3) has 
> alignment of 1, expected at least 4.
> ...




modpost acts on vmlinux.o instead of vmlinux.


vmlinux.o is a relocatable ELF, which is not a real layout
because no linker script has been considered yet at this
point.


vmlinux is an executable ELF, produced by a linker,
with the linker script taken into consideration
to determine the final section/symbol layout.


So, checking this in modpost is meaningless.



I did not check the module checking code, but
modules are also relocatable ELF.











>
> while the kernel module loader may show at runtime:
>
> ** WARNING **:   module: efivarfs, dest=0xc02d08d2, 
> section=.retpoline_sites possibly not correctly aligned.
> ** WARNING **:   module: efivarfs, dest=0xc02d0bae, 
> section=.ibt_endbr_seal possibly not correctly aligned.
> ** WARNING **:   module: efivarfs, dest=0xc02d0be6, 
> section=.orc_unwind possibly not correctly aligned.
> ** WARNING **:   module: efivarfs, dest=0xc02d12a6, 
> 

Re: [PATCH 3/4] vmlinux.lds.h: Fix alignment for __ksymtab*, __kcrctab_* and .pci_fixup sections

2023-12-21 Thread Masahiro Yamada
On Thu, Nov 23, 2023 at 7:18 AM  wrote:
>
> From: Helge Deller 
>
> On 64-bit architectures without CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
> (e.g. ppc64, ppc64le, parisc, s390x,...) the __KSYM_REF() macro stores
> 64-bit pointers into the __ksymtab* sections.
> Make sure that the start of those sections is 64-bit aligned in the vmlinux
> executable, otherwise unaligned memory accesses may happen at runtime.


Are you solving a real problem?


1/4 already ensures the proper alignment of __ksymtab*, doesn't it?



I applied the following hack to attempt to
break the alignment intentionally.


diff --git a/include/asm-generic/vmlinux.lds.h
b/include/asm-generic/vmlinux.lds.h
index bae0fe4d499b..e2b5c9acee97 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -482,7 +482,7 @@
TRACEDATA   \
\
PRINTK_INDEX\
-   \
+   . = . + 1;  \
/* Kernel symbol table: Normal symbols */   \
__ksymtab : AT(ADDR(__ksymtab) - LOAD_OFFSET) { \
__start___ksymtab = .;  \




The __ksymtab section and __start___ksymtab symbol
are still properly aligned due to the '.balign'
in 



So, my understanding is this patch is unneeded.


Or, does the behaviour depend on toolchains?








> The __kcrctab* sections store 32-bit entities, so make those sections
> 32-bit aligned.
>
> The pci fixup routines want to be 64-bit aligned on 64-bit platforms
> which don't define CONFIG_HAVE_ARCH_PREL32_RELOCATIONS. An alignment
> of 8 bytes is sufficient to guarantee aligned accesses at runtime.
>
> Signed-off-by: Helge Deller 
> Cc:  # v6.0+


















> ---
>  include/asm-generic/vmlinux.lds.h | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index bae0fe4d499b..fa4335346e7d 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -467,6 +467,7 @@
> }   \
> \
> /* PCI quirks */\
> +   . = ALIGN(8);   \
> .pci_fixup: AT(ADDR(.pci_fixup) - LOAD_OFFSET) {\
> BOUNDED_SECTION_PRE_LABEL(.pci_fixup_early,  
> _pci_fixups_early,  __start, __end) \
> BOUNDED_SECTION_PRE_LABEL(.pci_fixup_header, 
> _pci_fixups_header, __start, __end) \
> @@ -484,6 +485,7 @@
> PRINTK_INDEX\
> \
> /* Kernel symbol table: Normal symbols */   \
> +   . = ALIGN(8);   \
> __ksymtab : AT(ADDR(__ksymtab) - LOAD_OFFSET) { \
> __start___ksymtab = .;  \
> KEEP(*(SORT(___ksymtab+*))) \
> @@ -491,6 +493,7 @@
> }   \
> \
> /* Kernel symbol table: GPL-only symbols */ \
> +   . = ALIGN(8);   \
> __ksymtab_gpl : AT(ADDR(__ksymtab_gpl) - LOAD_OFFSET) { \
> __start___ksymtab_gpl = .;  \
> KEEP(*(SORT(___ksymtab_gpl+*))) \
> @@ -498,6 +501,7 @@
> }   \
> \
> /* Kernel symbol table: Normal symbols */   \
> +   . = ALIGN(4);   \
> __kcrctab : AT(ADDR(__kcrctab) - LOAD_OFFSET) { \
> __start___kcrctab = .;  \
> KEEP(*(SORT(___kcrctab+*))) \
> @@ -505,6 +509,7 @@
> }   \
> \
> /* Kernel symbol table: GPL-only symbols */ \
> +   . = ALIGN(4);   \
> __kcrctab_gpl : AT(ADDR(__kcrctab_gpl) - LOAD_OFFSET) { \
> 

Re: [PATCH 0/3] Fairphone 5 PMIC-GLINK support (USB-C, charger, fuel gauge)

2023-12-21 Thread Konrad Dybcio
On 21.12.2023 11:34, Dmitry Baryshkov wrote:
> On Thu, 21 Dec 2023 at 09:33, Luca Weiss  wrote:
>>
>> On Wed Dec 20, 2023 at 1:32 PM CET, Konrad Dybcio wrote:
>>> On 20.12.2023 11:02, Luca Weiss wrote:
 This series adds all the necessary bits to enable USB-C role switching,
 charger and fuel gauge (all via pmic-glink) on Fairphone 5.

 One thing that could be made different is the pmic-glink compatible.
 I've chosen to use qcm6490 compatible for it and not sc7280 since
 there's plenty of firmware variety on sc7280-based platforms and they
 might require different quirks in the future, so limit this PDOS quirk
 to just qcm6490 for now.

 If someone thinks it should be qcom,sc7280-pmic-glink, please let me
 know :)
>>> IMO it's best to continue using the "base soc" (which just so happened
>>> to fall onto sc7280 this time around) for all compatibles, unless the
>>> derivatives actually had changes
>>
>> Hi Konrad,
>>
>> I think at some point I asked Dmitry what he thought and he mentioned
>> qcm6490. Even found the message again:
>>
>>> well, since it is a firmware thing, you might want to emphasise that.
>>> So from my POV qcm6490 makes more sense
>>
>> But yeah since it's likely that sc7280 firmware behaves the same as
>> qcm6490 firmware it's probably okay to use sc7280 compatible, worst case
>> we change it later :) I'll send a v2 with those changes.
> 
> Worst case we end up with sc7280 which has yet another slightly
> different UCSI / PMIC GLINK implementation, but the compatible string
> is already taken.
> I still suppose that this should be a qcm6490-related string.
Right, let's keep qcm then

Konrad



Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2023-12-21 Thread Michael Ellerman
Cc +Kees

Christophe Leroy  writes:
> Declaring rodata_enabled and mark_rodata_ro() at all time
> helps removing related #ifdefery in C files.
>
> Signed-off-by: Christophe Leroy 
> ---
>  include/linux/init.h |  4 
>  init/main.c  | 21 +++--
>  2 files changed, 7 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/init.h b/include/linux/init.h
> index 01b52c9c7526..d2b47be38a07 100644
> --- a/include/linux/init.h
> +++ b/include/linux/init.h
> @@ -168,12 +168,8 @@ extern initcall_entry_t __initcall_end[];
>  
>  extern struct file_system_type rootfs_fs_type;
>  
> -#if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_STRICT_MODULE_RWX)
>  extern bool rodata_enabled;
> -#endif
> -#ifdef CONFIG_STRICT_KERNEL_RWX
>  void mark_rodata_ro(void);
> -#endif
>  
>  extern void (*late_time_init)(void);
>  
> diff --git a/init/main.c b/init/main.c
> index e24b0780fdff..807df08c501f 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1396,10 +1396,9 @@ static int __init set_debug_rodata(char *str)
>  early_param("rodata", set_debug_rodata);
>  #endif
>  
> -#ifdef CONFIG_STRICT_KERNEL_RWX
>  static void mark_readonly(void)
>  {
> - if (rodata_enabled) {
> + if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX) && rodata_enabled) {
>   /*
>* load_module() results in W+X mappings, which are cleaned
>* up with call_rcu().  Let's make sure that queued work is
> @@ -1409,20 +1408,14 @@ static void mark_readonly(void)
>   rcu_barrier();
>   mark_rodata_ro();
>   rodata_test();
> - } else
> + } else if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)) {
>   pr_info("Kernel memory protection disabled.\n");
> + } else if (IS_ENABLED(CONFIG_ARCH_HAS_STRICT_KERNEL_RWX)) {
> + pr_warn("Kernel memory protection not selected by kernel 
> config.\n");
> + } else {
> + pr_warn("This architecture does not have kernel memory 
> protection.\n");
> + }
>  }
> -#elif defined(CONFIG_ARCH_HAS_STRICT_KERNEL_RWX)
> -static inline void mark_readonly(void)
> -{
> - pr_warn("Kernel memory protection not selected by kernel config.\n");
> -}
> -#else
> -static inline void mark_readonly(void)
> -{
> - pr_warn("This architecture does not have kernel memory protection.\n");
> -}
> -#endif
>  
>  void __weak free_initmem(void)
>  {
> -- 
> 2.41.0



Re: [PATCH v3 28/34] s390/mm: Define KMSAN metadata for vmalloc and modules

2023-12-21 Thread Alexander Gordeev
On Thu, Dec 14, 2023 at 12:24:48AM +0100, Ilya Leoshkevich wrote:
> The pages for the KMSAN metadata associated with most kernel mappings
> are taken from memblock by the common code. However, vmalloc and module
> metadata needs to be defined by the architectures.
> 
> Be a little bit more careful than x86: allocate exactly MODULES_LEN
> for the module shadow and origins, and then take 2/3 of vmalloc for
> the vmalloc shadow and origins. This ensures that users passing small
> vmalloc= values on the command line do not cause module metadata
> collisions.
> 
> Reviewed-by: Alexander Potapenko 
> Signed-off-by: Ilya Leoshkevich 
> ---
>  arch/s390/boot/startup.c|  8 
>  arch/s390/include/asm/pgtable.h | 10 ++
>  2 files changed, 18 insertions(+)
> 
> diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c
> index 8104e0e3d188..e37e7ffda430 100644
> --- a/arch/s390/boot/startup.c
> +++ b/arch/s390/boot/startup.c
> @@ -253,9 +253,17 @@ static unsigned long setup_kernel_memory_layout(void)
>   MODULES_END = round_down(__abs_lowcore, _SEGMENT_SIZE);
>   MODULES_VADDR = MODULES_END - MODULES_LEN;
>   VMALLOC_END = MODULES_VADDR;
> +#ifdef CONFIG_KMSAN
> + VMALLOC_END -= MODULES_LEN * 2;
> +#endif
>  
>   /* allow vmalloc area to occupy up to about 1/2 of the rest virtual 
> space left */
>   vmalloc_size = min(vmalloc_size, round_down(VMALLOC_END / 2, 
> _REGION3_SIZE));

Since commit 2a65c6e1ad06 ("s390/boot: always align vmalloc area on segment 
boundary")
vmalloc_size is aligned on _SEGMENT_SIZE boundary.

> +#ifdef CONFIG_KMSAN
> + /* take 2/3 of vmalloc area for KMSAN shadow and origins */
> + vmalloc_size = round_down(vmalloc_size / 3, _REGION3_SIZE);

And thus, the alignment here should be _SEGMENT_SIZE as well.

> + VMALLOC_END -= vmalloc_size * 2;
> +#endif
>   VMALLOC_START = VMALLOC_END - vmalloc_size;
>  
>   /* split remaining virtual space between 1:1 mapping & vmemmap array */

...

With the above fixup:
Acked-by: Alexander Gordeev 

Thanks!



Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-21 Thread Eugenio Perez Martin
On Thu, Dec 21, 2023 at 12:52 PM Dragos Tatulea  wrote:
>
> On Thu, 2023-12-21 at 08:46 +0100, Eugenio Perez Martin wrote:
> > On Thu, Dec 21, 2023 at 3:03 AM Jason Wang  wrote:
> > >
> > > On Wed, Dec 20, 2023 at 9:32 PM Eugenio Perez Martin
> > >  wrote:
> > > >
> > > > On Wed, Dec 20, 2023 at 5:06 AM Jason Wang  wrote:
> > > > >
> > > > > On Wed, Dec 20, 2023 at 11:46 AM Jason Wang  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea 
> > > > > >  wrote:
> > > > > > >
> > > > > > > The virtio spec doesn't allow changing virtqueue addresses after
> > > > > > > DRIVER_OK. Some devices do support this operation when the device 
> > > > > > > is
> > > > > > > suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
> > > > > > > advertises this support as a backend features.
> > > > > >
> > > > > > There's an ongoing effort in virtio spec to introduce the suspend 
> > > > > > state.
> > > > > >
> > > > > > So I wonder if it's better to just allow such behaviour?
> > > > >
> > > > > Actually I mean, allow drivers to modify the parameters during suspend
> > > > > without a new feature.
> > > > >
> > > >
> > > > That would be ideal, but how do userland checks if it can suspend +
> > > > change properties + resume?
> > >
> > > As discussed, it looks to me the only device that supports suspend is
> > > simulator and it supports change properties.
> > >
> > > E.g:
> > >
> > > static int vdpasim_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> > >   u64 desc_area, u64 driver_area,
> > >   u64 device_area)
> > > {
> > > struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> > > struct vdpasim_virtqueue *vq = >vqs[idx];
> > >
> > > vq->desc_addr = desc_area;
> > > vq->driver_addr = driver_area;
> > > vq->device_addr = device_area;
> > >
> > > return 0;
> > > }
> > >
> >
> > So in the current kernel master it is valid to set a different vq
> > address while the device is suspended in vdpa_sim. But it is not valid
> > in mlx5, as the FW will not be updated in resume (Dragos, please
> > correct me if I'm wrong). Both of them return success.
> >
> In the current state, there is no resume. HW Virtqueues will just get 
> re-created
> with the new address.
>

Oh, then all of this is effectively transparent to the userspace
except for the time it takes?

In that case you're right, we don't need feature flags. But I think it
would be great to also move the error return in case userspace tries
to modify vq parameters out of suspend state.

Thanks!


> > How can we know in the destination QEMU if it is valid to suspend &
> > set address? Should we handle this as a bugfix and backport the
> > change?
> >
> > > >
> > > > The only way that comes to my mind is to make sure all parents return
> > > > error if userland tries to do it, and then fallback in userland.
> > >
> > > Yes.
> > >
> > > > I'm
> > > > ok with that, but I'm not sure if the current master & previous kernel
> > > > has a coherent behavior. Do they return error? Or return success
> > > > without changing address / vq state?
> > >
> > > We probably don't need to worry too much here, as e.g set_vq_address
> > > could fail even without suspend (just at uAPI level).
> > >
> >
> > I don't get this, sorry. I rephrased my point with an example earlier
> > in the mail.
> >
>




Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-21 Thread Dragos Tatulea
On Thu, 2023-12-21 at 08:46 +0100, Eugenio Perez Martin wrote:
> On Thu, Dec 21, 2023 at 3:03 AM Jason Wang  wrote:
> > 
> > On Wed, Dec 20, 2023 at 9:32 PM Eugenio Perez Martin
> >  wrote:
> > > 
> > > On Wed, Dec 20, 2023 at 5:06 AM Jason Wang  wrote:
> > > > 
> > > > On Wed, Dec 20, 2023 at 11:46 AM Jason Wang  wrote:
> > > > > 
> > > > > On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  
> > > > > wrote:
> > > > > > 
> > > > > > The virtio spec doesn't allow changing virtqueue addresses after
> > > > > > DRIVER_OK. Some devices do support this operation when the device is
> > > > > > suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
> > > > > > advertises this support as a backend features.
> > > > > 
> > > > > There's an ongoing effort in virtio spec to introduce the suspend 
> > > > > state.
> > > > > 
> > > > > So I wonder if it's better to just allow such behaviour?
> > > > 
> > > > Actually I mean, allow drivers to modify the parameters during suspend
> > > > without a new feature.
> > > > 
> > > 
> > > That would be ideal, but how do userland checks if it can suspend +
> > > change properties + resume?
> > 
> > As discussed, it looks to me the only device that supports suspend is
> > simulator and it supports change properties.
> > 
> > E.g:
> > 
> > static int vdpasim_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> >   u64 desc_area, u64 driver_area,
> >   u64 device_area)
> > {
> > struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> > struct vdpasim_virtqueue *vq = >vqs[idx];
> > 
> > vq->desc_addr = desc_area;
> > vq->driver_addr = driver_area;
> > vq->device_addr = device_area;
> > 
> > return 0;
> > }
> > 
> 
> So in the current kernel master it is valid to set a different vq
> address while the device is suspended in vdpa_sim. But it is not valid
> in mlx5, as the FW will not be updated in resume (Dragos, please
> correct me if I'm wrong). Both of them return success.
> 
In the current state, there is no resume. HW Virtqueues will just get re-created
with the new address. 

> How can we know in the destination QEMU if it is valid to suspend &
> set address? Should we handle this as a bugfix and backport the
> change?
> 
> > > 
> > > The only way that comes to my mind is to make sure all parents return
> > > error if userland tries to do it, and then fallback in userland.
> > 
> > Yes.
> > 
> > > I'm
> > > ok with that, but I'm not sure if the current master & previous kernel
> > > has a coherent behavior. Do they return error? Or return success
> > > without changing address / vq state?
> > 
> > We probably don't need to worry too much here, as e.g set_vq_address
> > could fail even without suspend (just at uAPI level).
> > 
> 
> I don't get this, sorry. I rephrased my point with an example earlier
> in the mail.
> 



Re: [PATCH v7] scripts/link-vmlinux.sh: Add alias to duplicate symbols for kallsyms

2023-12-21 Thread Alessandro Carminati
gentle ping

Il giorno lun 4 dic 2023 alle ore 22:47 Alessandro Carminati (Red Hat)
 ha scritto:
>
> In the kernel environment, situations frequently arise where identical
> names are shared among symbols within both the core image and modules.
> While this doesn't pose issues for the kernel's binary itself, it
> complicates trace or probe operations using tools like kprobe.
>
> This patch introduces "kas_alias" to address this challenge.
>
> During the kernel's build process, just before linking the vmlinux
> image in the "scripts/link-vmlinux.sh", symbol name frequencies are
> collected.
> This collection includes both the core kernel components and modules.
> Subsequently, within the same action, the nm data relative to vmlinux
> is modified by adding aliases based on the comprehensive symbol
> information gathered.
>
> The collection process occurs in two phases:
>
> 1. First phase: Executed during the linking of vmlinux, "kas_alias" scans
>all symbols provided by the 'nm' data against the vmlinux core image
>and all objects used for module linkage. This phase requires all
>modules objects to be produced at this stage, thereby adding a vmlinux
>dependency for linking modules in 'scripts/Makefile.modfinal'.
>
> 2. Second phase: In a subsequent run in the same build, "kas_alias"
>processes module objects and injects aliases into the objects' symbol
>tables where necessary. This operation is done by modifying
>'scripts/Makefile.modfinal' to include an action for each processed
>module.
>
> Example:
>
> Consider the symbol "device_show", you can expect an output like the
> following:
>
>  ~ # cat /proc/kallsyms | grep " name_show"
> caa2bb4f01c8 t name_show
> caa2bb4f01c8 t name_show@kernel_irq_irqdesc_c_264
> caa2bb9c1a30 t name_show
> caa2bb9c1a30 t name_show@drivers_pnp_card_c_186
> caa2bbac4754 t name_show
> caa2bbac4754 t name_show@drivers_regulator_core_c_678
> caa2bbba4900 t name_show
> caa2bbba4900 t name_show@drivers_base_power_wakeup_stats_c_93
> caa2bbec4038 t name_show
> caa2bbec4038 t name_show@drivers_rtc_sysfs_c_26
> caa2bbecc920 t name_show
> caa2bbecc920 t name_show@drivers_i2c_i2c_core_base_c_660
> caa2bbed3840 t name_show
> caa2bbed3840 t name_show@drivers_i2c_i2c_dev_c_100
> caa2bbef7210 t name_show
> caa2bbef7210 t name_show@drivers_pps_sysfs_c_66
> caa2bbf03328 t name_show
> caa2bbf03328 t name_show@drivers_hwmon_hwmon_c_72
> caa2bbff6f3c t name_show
> caa2bbff6f3c t name_show@drivers_remoteproc_remoteproc_sysfs_c_215
> caa2bbff8d78 t name_show
> caa2bbff8d78 t name_show@drivers_rpmsg_rpmsg_core_c_455
> caa2bbfff7a4 t name_show
> caa2bbfff7a4 t name_show@drivers_devfreq_devfreq_c_1395
> caa2bc001f60 t name_show
> caa2bc001f60 t name_show@drivers_extcon_extcon_c_389
> caa2bc009890 t name_show
> caa2bc009890 t name_show@drivers_iio_industrialio_core_c_1396
> caa2bc01212c t name_show
> caa2bc01212c t name_show@drivers_iio_industrialio_trigger_c_51
> caa2bc025e2c t name_show
> caa2bc025e2c t name_show@drivers_fpga_fpga_mgr_c_618
> caa2a052102c t name_show[hello]
> caa2a052102c t name_show@hello_hello_c_8[hello]
> caa2a051955c t name_show[rpmsg_char]
> caa2a051955c t name_show@drivers_rpmsg_rpmsg_char_c_365 [rpmsg_char]
>
> where hello, is a plain helloworld module built OOT.
>
> Tested-by: Masami Hiramatsu (Google) 
> Signed-off-by: Alessandro Carminati (Red Hat) 
>
> ---
>
> NOTE1:
> About the symbols name duplication that happens as consequence of the
> inclusion compat_binfmt_elf.c does, it is evident that this corner is
> inherently challenging the addr2line approach.
> Attempting to conceal this limitation would be counterproductive.
>
> compat_binfmt_elf.c includes directly binfmt_elf.c, addr2line can't help
> but report all functions and data declared by that file, coming from
> binfmt_elf.c.
>
> My position is that, rather than producing a more complicated pipeline
> to handle this corner case, it is better to fix the compat_binfmt_elf.c
> anomaly.
>
> This patch does not deal with the two potentially problematic symbols
> defined by compat_binfmt_elf.c
>
> Changes from v1:
> * Integrated changes requested by Masami to exclude symbols with prefixes
>   "_cfi" and "_pfx".
> * Introduced a small framework to handle patterns that need to be excluded
>   from the alias production.
> * Excluded other symbols using the framework.
> * Introduced the ability to discriminate between text and data symbols.
> * Added two new config symbols in this version:
>   CONFIG_KALLSYMS_ALIAS_DATA, which allows data for data, and
>   CONFIG_KALLSYMS_ALIAS_DATA_ALL, which excludes all filters and provides
>   an alias for each duplicated symbol.
>
> https://lore.kernel.org/all/20230711151925.1092080-1-alessandro.carmin...@gmail.com/
>
> Changes from v2:
> * Alias tags are created by querying DWARF information from 

Re: [PATCH 0/3] Fairphone 5 PMIC-GLINK support (USB-C, charger, fuel gauge)

2023-12-21 Thread Dmitry Baryshkov
On Thu, 21 Dec 2023 at 09:33, Luca Weiss  wrote:
>
> On Wed Dec 20, 2023 at 1:32 PM CET, Konrad Dybcio wrote:
> > On 20.12.2023 11:02, Luca Weiss wrote:
> > > This series adds all the necessary bits to enable USB-C role switching,
> > > charger and fuel gauge (all via pmic-glink) on Fairphone 5.
> > >
> > > One thing that could be made different is the pmic-glink compatible.
> > > I've chosen to use qcm6490 compatible for it and not sc7280 since
> > > there's plenty of firmware variety on sc7280-based platforms and they
> > > might require different quirks in the future, so limit this PDOS quirk
> > > to just qcm6490 for now.
> > >
> > > If someone thinks it should be qcom,sc7280-pmic-glink, please let me
> > > know :)
> > IMO it's best to continue using the "base soc" (which just so happened
> > to fall onto sc7280 this time around) for all compatibles, unless the
> > derivatives actually had changes
>
> Hi Konrad,
>
> I think at some point I asked Dmitry what he thought and he mentioned
> qcm6490. Even found the message again:
>
> > well, since it is a firmware thing, you might want to emphasise that.
> > So from my POV qcm6490 makes more sense
>
> But yeah since it's likely that sc7280 firmware behaves the same as
> qcm6490 firmware it's probably okay to use sc7280 compatible, worst case
> we change it later :) I'll send a v2 with those changes.

Worst case we end up with sc7280 which has yet another slightly
different UCSI / PMIC GLINK implementation, but the compatible string
is already taken.
I still suppose that this should be a qcm6490-related string.

>
> Regards
> Luca
>
> >
> > as far as firmware goes, I *think* CrOS doesn't even have PMIC_GLINK?
> > There are however WoA 7280 laptops which totally should have it.. Would
> > be nice to hunt some down and see if they report different stuff to
> > what's there on android firmware
> >
> > Konrad
>


-- 
With best wishes
Dmitry



Re: [PATCH 1/4] linux/export: Fix alignment for 64-bit ksymtab entries

2023-12-21 Thread Masahiro Yamada
On Thu, Nov 23, 2023 at 7:18 AM  wrote:
>
> From: Helge Deller 
>
> An alignment of 4 bytes is wrong for 64-bit platforms which don't define
> CONFIG_HAVE_ARCH_PREL32_RELOCATIONS (which then store 64-bit pointers).
> Fix their alignment to 8 bytes.
>
> Signed-off-by: Helge Deller 


This is correct.

Acked-by: Masahiro Yamada 

Please add


Fixes: ddb5cdbafaaa ("kbuild: generate KSYMTAB entries by modpost")





> ---
>  include/linux/export-internal.h | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/export-internal.h b/include/linux/export-internal.h
> index 69501e0ec239..cd253eb51d6c 100644
> --- a/include/linux/export-internal.h
> +++ b/include/linux/export-internal.h
> @@ -16,10 +16,13 @@
>   * and eliminates the need for absolute relocations that require runtime
>   * processing on relocatable kernels.
>   */
> +#define __KSYM_ALIGN   ".balign 4"
>  #define __KSYM_REF(sym)".long " #sym "- ."
>  #elif defined(CONFIG_64BIT)
> +#define __KSYM_ALIGN   ".balign 8"
>  #define __KSYM_REF(sym)".quad " #sym
>  #else
> +#define __KSYM_ALIGN   ".balign 4"
>  #define __KSYM_REF(sym)".long " #sym
>  #endif
>
> @@ -42,7 +45,7 @@
> "   .asciz \"" ns "\""  "\n"  
>   \
> "   .previous"  "\n"  
>   \
> "   .section \"___ksymtab" sec "+" #name "\", \"a\"""\n"  
>   \
> -   "   .balign 4"  "\n"  
>   \
> +   __KSYM_ALIGN"\n"  
>   \
> "__ksymtab_" #name ":"  "\n"  
>   \
> __KSYM_REF(sym) "\n"  
>   \
> __KSYM_REF(__kstrtab_ ##name)   "\n"  
>   \
> --
> 2.41.0
>


-- 
Best Regards
Masahiro Yamada



[PATCH V1] net: qrtr: ns: Return 0 if server port is not present

2023-12-21 Thread Sarannya S
When a 'DEL_CLIENT' message is received from the remote, the corresponding
server port gets deleted. A DEL_SERVER message is then announced for this
server. As part of handling the subsequent DEL_SERVER message, the name-
server attempts to delete the server port which results in a '-ENOENT' error.
The return value from server_del() is then propagated back to qrtr_ns_worker,
causing excessive error prints.
To address this, return 0 from control_cmd_del_server() without checking the
return value of server_del(), since the above scenario is not an error case
and hence server_del() doesn't have any other error return value.

Signed-off-by: Sarannya Sasikumar 
---
 net/qrtr/ns.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/qrtr/ns.c b/net/qrtr/ns.c
index b1db0b5..abb0c70 100644
--- a/net/qrtr/ns.c
+++ b/net/qrtr/ns.c
@@ -512,7 +512,9 @@ static int ctrl_cmd_del_server(struct sockaddr_qrtr *from,
if (!node)
return -ENOENT;
 
-   return server_del(node, port, true);
+   server_del(node, port, true);
+
+   return 0;
 }
 
 static int ctrl_cmd_new_lookup(struct sockaddr_qrtr *from,
-- 
2.7.4




[PATCH V1] net: qrtr: ns: Ignore ENODEV failures in ns

2023-12-21 Thread Sarannya S
From: Chris Lew 

Ignore the ENODEV failures returned by kernel_sendmsg(). These errors
indicate that either the local port has been closed or the remote has
gone down. Neither of these scenarios are fatal and will eventually be
handled through packets that are later queued on the control port.

Signed-off-by: Chris Lew 
Signed-off-by: Sarannya Sasikumar 
---
 net/qrtr/ns.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/qrtr/ns.c b/net/qrtr/ns.c
index abb0c70..8234339 100644
--- a/net/qrtr/ns.c
+++ b/net/qrtr/ns.c
@@ -157,7 +157,7 @@ static int service_announce_del(struct sockaddr_qrtr *dest,
msg.msg_namelen = sizeof(*dest);
 
ret = kernel_sendmsg(qrtr_ns.sock, , , 1, sizeof(pkt));
-   if (ret < 0)
+   if (ret < 0 && ret != -ENODEV)
pr_err("failed to announce del service\n");
 
return ret;
@@ -188,7 +188,7 @@ static void lookup_notify(struct sockaddr_qrtr *to, struct 
qrtr_server *srv,
msg.msg_namelen = sizeof(*to);
 
ret = kernel_sendmsg(qrtr_ns.sock, , , 1, sizeof(pkt));
-   if (ret < 0)
+   if (ret < 0 && ret != -ENODEV)
pr_err("failed to send lookup notification\n");
 }
 
@@ -207,6 +207,9 @@ static int announce_servers(struct sockaddr_qrtr *sq)
xa_for_each(>servers, index, srv) {
ret = service_announce_new(sq, srv);
if (ret < 0) {
+   if (ret == -ENODEV)
+   continue;
+
pr_err("failed to announce new service\n");
return ret;
}
@@ -369,7 +372,7 @@ static int ctrl_cmd_bye(struct sockaddr_qrtr *from)
msg.msg_namelen = sizeof(sq);
 
ret = kernel_sendmsg(qrtr_ns.sock, , , 1, sizeof(pkt));
-   if (ret < 0) {
+   if (ret < 0 && ret != -ENODEV) {
pr_err("failed to send bye cmd\n");
return ret;
}
@@ -443,7 +446,7 @@ static int ctrl_cmd_del_client(struct sockaddr_qrtr *from,
msg.msg_namelen = sizeof(sq);
 
ret = kernel_sendmsg(qrtr_ns.sock, , , 1, sizeof(pkt));
-   if (ret < 0) {
+   if (ret < 0 && ret != -ENODEV) {
pr_err("failed to send del client cmd\n");
return ret;
}
-- 
2.7.4




Re: [PATCH] tracing / synthetic: Disable events after testing in synth_event_gen_test_init()

2023-12-21 Thread Alexander Graf

Hi Steve,

On 20.12.23 17:15, Steven Rostedt wrote:


From: "Steven Rostedt (Google)" 

The synth_event_gen_test module can be built in, if someone wants to run
the tests at boot up and not have to load them.

The synth_event_gen_test_init() function creates and enables the synthetic
events and runs its tests.

The synth_event_gen_test_exit() disables the events it created and
destroys the events.

If the module is builtin, the events are never disabled. The issue is, the
events should be disable after the tests are run. This could be an issue
if the rest of the boot up tests are enabled, as they expect the events to
be in a known state before testing. That known state happens to be
disabled.

When CONFIG_SYNTH_EVENT_GEN_TEST=y and CONFIG_EVENT_TRACE_STARTUP_TEST=y
a warning will trigger:

  Running tests on trace events:
  Testing event create_synth_test:
  Enabled event during self test!
  [ cut here ]
  WARNING: CPU: 2 PID: 1 at kernel/trace/trace_events.c:4150 
event_trace_self_tests+0x1c2/0x480
  Modules linked in:
  CPU: 2 PID: 1 Comm: swapper/0 Not tainted 
6.7.0-rc2-test-00031-gb803d7c664d5-dirty #276
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.2-debian-1.16.2-1 04/01/2014
  RIP: 0010:event_trace_self_tests+0x1c2/0x480
  Code: bb e8 a2 ab 5d fc 48 8d 7b 48 e8 f9 3d 99 fc 48 8b 73 48 40 f6 c6 01 0f 84 d6 
fe ff ff 48 c7 c7 20 b6 ad bb e8 7f ab 5d fc 90 <0f> 0b 90 48 89 df e8 d3 3d 99 
fc 48 8b 1b 4c 39 f3 0f 85 2c ff ff
  RSP: :c901fdc0 EFLAGS: 00010246
  RAX: 0029 RBX: 88810399ca80 RCX: 
  RDX:  RSI: b9f19478 RDI: 88823c734e64
  RBP: 88810399f300 R08:  R09: fbfff79eb32a
  R10: bcf59957 R11: 0001 R12: 888104068090
  R13: bc89f0a0 R14: bc8a0f08 R15: 0078
  FS:  () GS:88823c70() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2:  CR3: 0001f6282001 CR4: 00170ef0
  Call Trace:
   
   ? __warn+0xa5/0x200
   ? event_trace_self_tests+0x1c2/0x480
   ? report_bug+0x1f6/0x220
   ? handle_bug+0x6f/0x90
   ? exc_invalid_op+0x17/0x50
   ? asm_exc_invalid_op+0x1a/0x20
   ? tracer_preempt_on+0x78/0x1c0
   ? event_trace_self_tests+0x1c2/0x480
   ? __pfx_event_trace_self_tests_init+0x10/0x10
   event_trace_self_tests_init+0x27/0xe0
   do_one_initcall+0xd6/0x3c0
   ? __pfx_do_one_initcall+0x10/0x10
   ? kasan_set_track+0x25/0x30
   ? rcu_is_watching+0x38/0x60
   kernel_init_freeable+0x324/0x450
   ? __pfx_kernel_init+0x10/0x10
   kernel_init+0x1f/0x1e0
   ? _raw_spin_unlock_irq+0x33/0x50
   ret_from_fork+0x34/0x60
   ? __pfx_kernel_init+0x10/0x10
   ret_from_fork_asm+0x1b/0x30
   

This is because the synth_event_gen_test_init() left the synthetic events
that it created enabled. By having it disable them after testing, the
other selftests will run fine.

Cc: sta...@vger.kernel.org
Fixes: 9fe41efaca084 ("tracing: Add synth event generation test module")
Reported-by: Alexander Graf 
Signed-off-by: Steven Rostedt (Google) 



Thanks a bunch for the super quick turnaround time for the fix! I can 
confirm that I'm no longer seeing the warning :)


Tested-by: Alexander Graf 


Do we need another similar patch for the kprobe self tests? The below is 
with 55cb5f43689d7 plus an unrelated initrd patch plus this patch and 
the following .config: http://csgraf.de/tmp2/config-ftrace.xz


[  919.717134] Testing all events: OK
[  924.418194] Testing ftrace filter: OK
[  924.418887] trace_kprobe: Testing kprobe tracing:
[  924.424244] [ cut here ]
[  924.424952] WARNING: CPU: 2 PID: 1 at 
kernel/trace/trace_kprobe.c:2073 kprobe_trace_self_tests_init+0x192/0x540

[  924.425659] Modules linked in:
[  924.425886] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 
6.7.0-rc6-00024-gc10698ad3c9a #298
[  924.426448] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014

[  924.427230] RIP: 0010:kprobe_trace_self_tests_init+0x192/0x540
[  924.427639] Code: 7e 10 74 3b 48 8b 36 48 39 d6 75 f2 0f 0b 48 c7 c7 
58 71 79 a5 e8 ee 3e 5a fe 48 c7 c7 20 38 b7 a5 e8 a2 51 68 fe 85 c0 74 
33 <0f> 0b 48 c7 c7 38 73 79 a5 e8 d0 3e 5a fe e8 4b 64 62 fe eb 23 48

[  924.428922] RSP: :ab508001be58 EFLAGS: 00010286
[  924.429288] RAX: fff0 RBX: 005a RCX: 
0202
[  924.429800] RDX:  RSI: 0002e970 RDI: 
a5b708a0
[  924.430295] RBP:  R08: 0001 R09: 
a4855a90
[  924.430794] R10: 0007 R11: 028a R12: 
0001
[  924.431286] R13: a5cc9a00 R14: 8cec81bebe00 R15: 
a619f0f0
[  924.431785] FS:  () GS:8cecf910() 
knlGS:

[  924.432346] CS:  0010 DS:  ES:  CR0: 80050033
[  924.432748] CR2: 

[PATCH V1 1/1] net: qrtr: ns: Ignore ENODEV failures in ns

2023-12-21 Thread Sarannya S
From: Chris Lew 

Ignore the ENODEV failures returned by kernel_sendmsg(). These errors
indicate that either the local port has been closed or the remote has
gone down. Neither of these scenarios are fatal and will eventually be
handled through packets that are later queued on the control port.

Signed-off-by: Chris Lew 
Signed-off-by: Sarannya Sasikumar 
---
 net/qrtr/ns.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/qrtr/ns.c b/net/qrtr/ns.c
index abb0c70..8234339 100644
--- a/net/qrtr/ns.c
+++ b/net/qrtr/ns.c
@@ -157,7 +157,7 @@ static int service_announce_del(struct sockaddr_qrtr *dest,
msg.msg_namelen = sizeof(*dest);
 
ret = kernel_sendmsg(qrtr_ns.sock, , , 1, sizeof(pkt));
-   if (ret < 0)
+   if (ret < 0 && ret != -ENODEV)
pr_err("failed to announce del service\n");
 
return ret;
@@ -188,7 +188,7 @@ static void lookup_notify(struct sockaddr_qrtr *to, struct 
qrtr_server *srv,
msg.msg_namelen = sizeof(*to);
 
ret = kernel_sendmsg(qrtr_ns.sock, , , 1, sizeof(pkt));
-   if (ret < 0)
+   if (ret < 0 && ret != -ENODEV)
pr_err("failed to send lookup notification\n");
 }
 
@@ -207,6 +207,9 @@ static int announce_servers(struct sockaddr_qrtr *sq)
xa_for_each(>servers, index, srv) {
ret = service_announce_new(sq, srv);
if (ret < 0) {
+   if (ret == -ENODEV)
+   continue;
+
pr_err("failed to announce new service\n");
return ret;
}
@@ -369,7 +372,7 @@ static int ctrl_cmd_bye(struct sockaddr_qrtr *from)
msg.msg_namelen = sizeof(sq);
 
ret = kernel_sendmsg(qrtr_ns.sock, , , 1, sizeof(pkt));
-   if (ret < 0) {
+   if (ret < 0 && ret != -ENODEV) {
pr_err("failed to send bye cmd\n");
return ret;
}
@@ -443,7 +446,7 @@ static int ctrl_cmd_del_client(struct sockaddr_qrtr *from,
msg.msg_namelen = sizeof(sq);
 
ret = kernel_sendmsg(qrtr_ns.sock, , , 1, sizeof(pkt));
-   if (ret < 0) {
+   if (ret < 0 && ret != -ENODEV) {
pr_err("failed to send del client cmd\n");
return ret;
}
-- 
2.7.4




Re: [PATCH 1/3] dt-bindings: soc: qcom: qcom,pmic-glink: document QCM6490 compatible

2023-12-21 Thread Krzysztof Kozlowski
On 20/12/2023 11:02, Luca Weiss wrote:
> Document the QCM6490 compatible used to describe the pmic glink on this
> platform.
> 
> Signed-off-by: Luca Weiss 
> ---

Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof




RE: [PATCH v5 02/15] ring-buffer: Page size per ring buffer

2023-12-21 Thread David Laight
From: Steven Rostedt
> Sent: 20 December 2023 13:01
> 
> On Wed, 20 Dec 2023 08:48:02 +
> David Laight  wrote:
> 
> > From: Steven Rostedt
> > > Sent: 19 December 2023 18:54
> > > From: "Tzvetomir Stoyanov (VMware)" 
> > >
> > > Currently the size of one sub buffer page is global for all buffers and
> > > it is hard coded to one system page. In order to introduce configurable
> > > ring buffer sub page size, the internal logic should be refactored to
> > > work with sub page size per ring buffer.
> > >
> > ...
> > > - nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
> > > + /* Default buffer page size - one system page */
> > > + buffer->subbuf_size = PAGE_SIZE - BUF_PAGE_HDR_SIZE;
> > > +
> > > + /* Max payload is buffer page size - header (8bytes) */
> > > + buffer->max_data_size = buffer->subbuf_size - (sizeof(u32) * 2);
> > > +
> > > + nr_pages = DIV_ROUND_UP(size, buffer->subbuf_size);
> >
> > While not new, does this really make any sense for systems with 64k pages?
> > Wouldn't it be better to have units of 4k?
> 
> Unfortunately, it has to be PAGE_SIZE (and for now it's a power of 2 to
> make masking easy). It's used for splice and will also be used for memory
> mapping with user space.

Perhaps then the sysctl to set the size should be powers of 4k
with a minimum size of PAGE_SIZE.
Then you don't have to know the page size when setting things up.

I'm also guessing that no Linux kernels have a PAGE_SIZE of 2k?
IIRC some old mmu (maybe 68020 era) used 2k pages.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)




[PATCH 3/3] powerpc: Simplify strict_kernel_rwx_enabled()

2023-12-21 Thread Christophe Leroy
Now that rodata_enabled is always declared, remove #ifdef
and define a single version of strict_kernel_rwx_enabled().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/mmu.h | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index d8b7e246a32f..24241995f740 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -330,17 +330,10 @@ static __always_inline bool early_radix_enabled(void)
return early_mmu_has_feature(MMU_FTR_TYPE_RADIX);
 }
 
-#ifdef CONFIG_STRICT_KERNEL_RWX
 static inline bool strict_kernel_rwx_enabled(void)
 {
-   return rodata_enabled;
+   return IS_ENABLED(CONFIG_STRICT_KERNEL_RWX) && rodata_enabled;
 }
-#else
-static inline bool strict_kernel_rwx_enabled(void)
-{
-   return false;
-}
-#endif
 
 static inline bool strict_module_rwx_enabled(void)
 {
-- 
2.41.0




[PATCH 2/3] modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around rodata_enabled

2023-12-21 Thread Christophe Leroy
Now that rodata_enabled is declared at all time, the #ifdef
CONFIG_STRICT_MODULE_RWX can be removed.

Signed-off-by: Christophe Leroy 
---
 kernel/module/strict_rwx.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/kernel/module/strict_rwx.c b/kernel/module/strict_rwx.c
index a2b656b4e3d2..eadff63b6e80 100644
--- a/kernel/module/strict_rwx.c
+++ b/kernel/module/strict_rwx.c
@@ -34,12 +34,8 @@ void module_enable_x(const struct module *mod)
 
 void module_enable_ro(const struct module *mod, bool after_init)
 {
-   if (!IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
-   return;
-#ifdef CONFIG_STRICT_MODULE_RWX
-   if (!rodata_enabled)
+   if (!IS_ENABLED(CONFIG_STRICT_MODULE_RWX) || !rodata_enabled)
return;
-#endif
 
module_set_memory(mod, MOD_TEXT, set_memory_ro);
module_set_memory(mod, MOD_INIT_TEXT, set_memory_ro);
-- 
2.41.0




[PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2023-12-21 Thread Christophe Leroy
Declaring rodata_enabled and mark_rodata_ro() at all time
helps removing related #ifdefery in C files.

Signed-off-by: Christophe Leroy 
---
 include/linux/init.h |  4 
 init/main.c  | 21 +++--
 2 files changed, 7 insertions(+), 18 deletions(-)

diff --git a/include/linux/init.h b/include/linux/init.h
index 01b52c9c7526..d2b47be38a07 100644
--- a/include/linux/init.h
+++ b/include/linux/init.h
@@ -168,12 +168,8 @@ extern initcall_entry_t __initcall_end[];
 
 extern struct file_system_type rootfs_fs_type;
 
-#if defined(CONFIG_STRICT_KERNEL_RWX) || defined(CONFIG_STRICT_MODULE_RWX)
 extern bool rodata_enabled;
-#endif
-#ifdef CONFIG_STRICT_KERNEL_RWX
 void mark_rodata_ro(void);
-#endif
 
 extern void (*late_time_init)(void);
 
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..807df08c501f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1396,10 +1396,9 @@ static int __init set_debug_rodata(char *str)
 early_param("rodata", set_debug_rodata);
 #endif
 
-#ifdef CONFIG_STRICT_KERNEL_RWX
 static void mark_readonly(void)
 {
-   if (rodata_enabled) {
+   if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX) && rodata_enabled) {
/*
 * load_module() results in W+X mappings, which are cleaned
 * up with call_rcu().  Let's make sure that queued work is
@@ -1409,20 +1408,14 @@ static void mark_readonly(void)
rcu_barrier();
mark_rodata_ro();
rodata_test();
-   } else
+   } else if (IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)) {
pr_info("Kernel memory protection disabled.\n");
+   } else if (IS_ENABLED(CONFIG_ARCH_HAS_STRICT_KERNEL_RWX)) {
+   pr_warn("Kernel memory protection not selected by kernel 
config.\n");
+   } else {
+   pr_warn("This architecture does not have kernel memory 
protection.\n");
+   }
 }
-#elif defined(CONFIG_ARCH_HAS_STRICT_KERNEL_RWX)
-static inline void mark_readonly(void)
-{
-   pr_warn("Kernel memory protection not selected by kernel config.\n");
-}
-#else
-static inline void mark_readonly(void)
-{
-   pr_warn("This architecture does not have kernel memory protection.\n");
-}
-#endif
 
 void __weak free_initmem(void)
 {
-- 
2.41.0




Re: [PATCH] tracing / synthetic: Disable events after testing in synth_event_gen_test_init()

2023-12-21 Thread Google
On Wed, 20 Dec 2023 11:15:25 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> The synth_event_gen_test module can be built in, if someone wants to run
> the tests at boot up and not have to load them.
> 
> The synth_event_gen_test_init() function creates and enables the synthetic
> events and runs its tests.
> 
> The synth_event_gen_test_exit() disables the events it created and
> destroys the events.
> 
> If the module is builtin, the events are never disabled. The issue is, the
> events should be disable after the tests are run. This could be an issue
> if the rest of the boot up tests are enabled, as they expect the events to
> be in a known state before testing. That known state happens to be
> disabled.
> 
> When CONFIG_SYNTH_EVENT_GEN_TEST=y and CONFIG_EVENT_TRACE_STARTUP_TEST=y
> a warning will trigger:
> 
>  Running tests on trace events:
>  Testing event create_synth_test:
>  Enabled event during self test!
>  [ cut here ]
>  WARNING: CPU: 2 PID: 1 at kernel/trace/trace_events.c:4150 
> event_trace_self_tests+0x1c2/0x480
>  Modules linked in:
>  CPU: 2 PID: 1 Comm: swapper/0 Not tainted 
> 6.7.0-rc2-test-00031-gb803d7c664d5-dirty #276
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.16.2-debian-1.16.2-1 04/01/2014
>  RIP: 0010:event_trace_self_tests+0x1c2/0x480
>  Code: bb e8 a2 ab 5d fc 48 8d 7b 48 e8 f9 3d 99 fc 48 8b 73 48 40 f6 c6 01 
> 0f 84 d6 fe ff ff 48 c7 c7 20 b6 ad bb e8 7f ab 5d fc 90 <0f> 0b 90 48 89 df 
> e8 d3 3d 99 fc 48 8b 1b 4c 39 f3 0f 85 2c ff ff
>  RSP: :c901fdc0 EFLAGS: 00010246
>  RAX: 0029 RBX: 88810399ca80 RCX: 
>  RDX:  RSI: b9f19478 RDI: 88823c734e64
>  RBP: 88810399f300 R08:  R09: fbfff79eb32a
>  R10: bcf59957 R11: 0001 R12: 888104068090
>  R13: bc89f0a0 R14: bc8a0f08 R15: 0078
>  FS:  () GS:88823c70() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2:  CR3: 0001f6282001 CR4: 00170ef0
>  Call Trace:
>   
>   ? __warn+0xa5/0x200
>   ? event_trace_self_tests+0x1c2/0x480
>   ? report_bug+0x1f6/0x220
>   ? handle_bug+0x6f/0x90
>   ? exc_invalid_op+0x17/0x50
>   ? asm_exc_invalid_op+0x1a/0x20
>   ? tracer_preempt_on+0x78/0x1c0
>   ? event_trace_self_tests+0x1c2/0x480
>   ? __pfx_event_trace_self_tests_init+0x10/0x10
>   event_trace_self_tests_init+0x27/0xe0
>   do_one_initcall+0xd6/0x3c0
>   ? __pfx_do_one_initcall+0x10/0x10
>   ? kasan_set_track+0x25/0x30
>   ? rcu_is_watching+0x38/0x60
>   kernel_init_freeable+0x324/0x450
>   ? __pfx_kernel_init+0x10/0x10
>   kernel_init+0x1f/0x1e0
>   ? _raw_spin_unlock_irq+0x33/0x50
>   ret_from_fork+0x34/0x60
>   ? __pfx_kernel_init+0x10/0x10
>   ret_from_fork_asm+0x1b/0x30
>   
> 
> This is because the synth_event_gen_test_init() left the synthetic events
> that it created enabled. By having it disable them after testing, the
> other selftests will run fine.
> 

Ah, OK. It has to clean up the testing events.

Acked-by: Masami Hiramatsu (Google) 

Thank you,

> Cc: sta...@vger.kernel.org
> Fixes: 9fe41efaca084 ("tracing: Add synth event generation test module")
> Reported-by: Alexander Graf 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/synth_event_gen_test.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/trace/synth_event_gen_test.c 
> b/kernel/trace/synth_event_gen_test.c
> index 8dfe85499d4a..354c2117be43 100644
> --- a/kernel/trace/synth_event_gen_test.c
> +++ b/kernel/trace/synth_event_gen_test.c
> @@ -477,6 +477,17 @@ static int __init synth_event_gen_test_init(void)
>  
>   ret = test_trace_synth_event();
>   WARN_ON(ret);
> +
> + /* Disable when done */
> + trace_array_set_clr_event(gen_synth_test->tr,
> +   "synthetic",
> +   "gen_synth_test", false);
> + trace_array_set_clr_event(empty_synth_test->tr,
> +   "synthetic",
> +   "empty_synth_test", false);
> + trace_array_set_clr_event(create_synth_test->tr,
> +   "synthetic",
> +   "create_synth_test", false);
>   out:
>   return ret;
>  }
> -- 
> 2.42.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] eventfs: Have event files and directories default to parent uid and gid

2023-12-21 Thread Google
On Wed, 20 Dec 2023 10:50:17 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> Dongliang reported:
> 
>   I found that in the latest version, the nodes of tracefs have been
>   changed to dynamically created.
> 
>   This has caused me to encounter a problem where the gid I specified in
>   the mounting parameters cannot apply to all files, as in the following
>   situation:
> 
>   /data/tmp/events # mount | grep tracefs
>   tracefs on /data/tmp type tracefs (rw,seclabel,relatime,gid=3012)
> 
>   gid 3012 = readtracefs
> 
>   /data/tmp # ls -lh
>   total 0
>   -r--r-   1 root readtracefs 0 1970-01-01 08:00 README
>   -r--r-   1 root readtracefs 0 1970-01-01 08:00 available_events
> 
>   ums9621_1h10:/data/tmp/events # ls -lh
>   total 0
>   drwxr-xr-x 2 root root 0 2023-12-19 00:56 alarmtimer
>   drwxr-xr-x 2 root root 0 2023-12-19 00:56 asoc
> 
>   It will prevent certain applications from accessing tracefs properly, I
>   try to avoid this issue by making the following modifications.
> 
> To fix this, have the files created default to taking the ownership of
> the parent dentry unless the ownership was previously set by the user.
> 
> Link: 
> https://lore.kernel.org/linux-trace-kernel/1703063706-30539-1-git-send-email-dongliang@unisoc.com/
> 

This looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Thank you!

> Reported-by: Dongliang Cui 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  fs/tracefs/event_inode.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 43e237864a42..2ccc849a5bda 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -148,7 +148,8 @@ static const struct file_operations 
> eventfs_file_operations = {
>   .release= eventfs_release,
>  };
>  
> -static void update_inode_attr(struct inode *inode, struct eventfs_attr 
> *attr, umode_t mode)
> +static void update_inode_attr(struct dentry *dentry, struct inode *inode,
> +   struct eventfs_attr *attr, umode_t mode)
>  {
>   if (!attr) {
>   inode->i_mode = mode;
> @@ -162,9 +163,13 @@ static void update_inode_attr(struct inode *inode, 
> struct eventfs_attr *attr, um
>  
>   if (attr->mode & EVENTFS_SAVE_UID)
>   inode->i_uid = attr->uid;
> + else
> + inode->i_uid = d_inode(dentry->d_parent)->i_uid;
>  
>   if (attr->mode & EVENTFS_SAVE_GID)
>   inode->i_gid = attr->gid;
> + else
> + inode->i_gid = d_inode(dentry->d_parent)->i_gid;
>  }
>  
>  /**
> @@ -206,7 +211,7 @@ static struct dentry *create_file(const char *name, 
> umode_t mode,
>   return eventfs_failed_creating(dentry);
>  
>   /* If the user updated the directory's attributes, use them */
> - update_inode_attr(inode, attr, mode);
> + update_inode_attr(dentry, inode, attr, mode);
>  
>   inode->i_op = _file_inode_operations;
>   inode->i_fop = fop;
> @@ -242,7 +247,8 @@ static struct dentry *create_dir(struct eventfs_inode 
> *ei, struct dentry *parent
>   return eventfs_failed_creating(dentry);
>  
>   /* If the user updated the directory's attributes, use them */
> - update_inode_attr(inode, >attr, S_IFDIR | S_IRWXU | S_IRUGO | 
> S_IXUGO);
> + update_inode_attr(dentry, inode, >attr,
> +   S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
>  
>   inode->i_op = _root_dir_inode_operations;
>   inode->i_fop = _file_operations;
> -- 
> 2.42.0
> 


-- 
Masami Hiramatsu (Google)