Re: [PATCH v3 04/13] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2023-10-04 Thread Mike Rapoport
On Wed, Oct 04, 2023 at 12:29:36AM +, Edgecombe, Rick P wrote:
> On Mon, 2023-09-18 at 10:29 +0300, Mike Rapoport wrote:
> > diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> > index 5f71a0cf4399..9d37375e2f05 100644
> > --- a/arch/x86/kernel/module.c
> > +++ b/arch/x86/kernel/module.c
> >
> > -void *module_alloc(unsigned long size)
> > +struct execmem_params __init *execmem_arch_params(void)
> >  {
> > -   gfp_t gfp_mask = GFP_KERNEL;
> > -   void *p;
> > -
> > -   if (PAGE_ALIGN(size) > MODULES_LEN)
> > -   return NULL;
> > +   unsigned long module_load_offset = 0;
> > +   unsigned long start;
> >  
> > -   p = __vmalloc_node_range(size, MODULE_ALIGN,
> > -    MODULES_VADDR +
> > get_module_load_offset(),
> > -    MODULES_END, gfp_mask, PAGE_KERNEL,
> > -    VM_FLUSH_RESET_PERMS |
> > VM_DEFER_KMEMLEAK,
> > -    NUMA_NO_NODE,
> > __builtin_return_address(0));
> > +   if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && kaslr_enabled())
> > +   module_load_offset =
> > +   get_random_u32_inclusive(1, 1024) *
> > PAGE_SIZE;
> 
> Minor:
> I think you can skip the IS_ENABLED(CONFIG_RANDOMIZE_BASE) part because
> CONFIG_RANDOMIZE_MEMORY depends on CONFIG_RANDOMIZE_BASE (which is
> checked in kaslr_enabled()).

Thanks, I'll look into it.

-- 
Sincerely yours,
Mike.



Re: [PATCH v3 03/13] mm/execmem, arch: convert simple overrides of module_alloc to execmem

2023-10-04 Thread Mike Rapoport
On Wed, Oct 04, 2023 at 03:39:26PM +, Edgecombe, Rick P wrote:
> On Tue, 2023-10-03 at 17:29 -0700, Rick Edgecombe wrote:
> > It seems a bit weird to copy all of this. Is it trying to be faster
> > or
> > something?
> > 
> > Couldn't it just check r->start in execmem_text/data_alloc() path and
> > switch to EXECMEM_DEFAULT if needed then? The execmem_range_is_data()
> > part that comes later could be added to the logic there too. So this
> > seems like unnecessary complexity to me or I don't see the reason.
> 
> I guess this is a bad idea because if you have the full size array
> sitting around anyway you might as well use it and reduce the
> exec_mem_alloc() logic.

That's was the idea, indeed. :)

> Just looking at it from the x86 side (and
> similar) though, where there is actually only one execmem_range and it
> building this whole array with identical data and it seems weird.

Right, most architectures have only one range, but to support all variants
that we have, execmem has to maintain the whole array.

-- 
Sincerely yours,
Mike.



Re: [PATCH 0/4] tracing: improve symbolic printing

2023-10-04 Thread Alan Maguire
On 04/10/2023 22:43, Steven Rostedt wrote:
> On Wed, 4 Oct 2023 22:35:07 +0100
> Alan Maguire  wrote:
> 
>> One thing we've heard from some embedded folks [1] is that having
>> kernel BTF loadable as a separate module (rather than embedded in
>> vmlinux) would help, as there are size limits on vmlinux that they can
>> workaround by having modules on a different partition. We're hoping
>> to get that working soon. I was wondering if you see other issues around
>> BTF adoption for embedded systems that we could put on the to-do list?
>> Not necessarily for this particular use-case (since there are
>> complications with trace data as you describe), but just trying to make
>> sure we can remove barriers to BTF adoption where possible.
> 
> I wonder how easy is it to create subsets of BTF. For one thing, in the
> future we want to be able to trace the arguments of all functions. That is,
> tracing all functions at the same time (function tracer) and getting the
> arguments within the trace.
> 
> This would only require information about functions and their arguments,
> which would be very useful. Is BTF easy to break apart? That is, just
> generate the information needed for function arguments?
>

There has been a fair bit of effort around this from the userspace side;
the BTF gen efforts were focused around applications carrying the
minimum BTF for their needs, so just the structures needed by the
particular BPF programs rather than the full set of vmlinux structures
for example [1].

Parsing BTF in-kernel to pull out the BTF functions (BTF_KIND_FUNC),
their prototypes (BTF_KIND_FUNC_PROTO) and all associated parameters
would be pretty straightforward I think, especially if you don't need
the structures that are passed via pointers. So if you're starting with
the full BTF, creating a subset for use in tracing would be reasonably
straightforward. My personal preference would always be to have the
full BTF where possible, but if that wasn't feasible on some systems
we'd need to add some options to pahole/libbpf to support such trimming
during the DWARF->BTF translation process.

Alan

[1] https://lore.kernel.org/bpf/20220209222646.348365-7-mauri...@kinvolk.io/

> Note, pretty much all functions do not pass structures by values, and this
> would not need to know the contents of a pointer to a structure. This would
> mean that structure layout information is not needed.
> 
> -- Steve
> 



Re: [PATCH 0/4] tracing: improve symbolic printing

2023-10-04 Thread Steven Rostedt
On Wed, 4 Oct 2023 22:35:07 +0100
Alan Maguire  wrote:

> One thing we've heard from some embedded folks [1] is that having
> kernel BTF loadable as a separate module (rather than embedded in
> vmlinux) would help, as there are size limits on vmlinux that they can
> workaround by having modules on a different partition. We're hoping
> to get that working soon. I was wondering if you see other issues around
> BTF adoption for embedded systems that we could put on the to-do list?
> Not necessarily for this particular use-case (since there are
> complications with trace data as you describe), but just trying to make
> sure we can remove barriers to BTF adoption where possible.

I wonder how easy is it to create subsets of BTF. For one thing, in the
future we want to be able to trace the arguments of all functions. That is,
tracing all functions at the same time (function tracer) and getting the
arguments within the trace.

This would only require information about functions and their arguments,
which would be very useful. Is BTF easy to break apart? That is, just
generate the information needed for function arguments?

Note, pretty much all functions do not pass structures by values, and this
would not need to know the contents of a pointer to a structure. This would
mean that structure layout information is not needed.

-- Steve



Re: [PATCH 0/4] tracing: improve symbolic printing

2023-10-04 Thread Alan Maguire
On 04/10/2023 18:29, Steven Rostedt wrote:
> On Wed, 4 Oct 2023 09:54:31 -0700
> Jakub Kicinski  wrote:
> 
>> On Wed, 4 Oct 2023 12:35:24 -0400 Steven Rostedt wrote:
 Potentially naive question - the trace point holds enum skb_drop_reason.
 The user space can get the names from BTF. Can we not teach user space
 to generically look up names of enums in BTF?
>>>
>>> That puts a hard requirement to include BTF in builds where it was not
>>> needed before. I really do not want to build with BTF just to get access to
>>> these symbols. And since this is used by the embedded world, and BTF is
>>> extremely bloated, the short answer is "No".  
>>
>> Dunno. BTF is there most of the time. It could make the life of
>> majority of the users far more pleasant.
> 
> BTF isn't there for a lot of developers working in embedded who use this
> code. Most my users that I deal with have minimal environments, so BTF is a
> showstopper.

One thing we've heard from some embedded folks [1] is that having
kernel BTF loadable as a separate module (rather than embedded in
vmlinux) would help, as there are size limits on vmlinux that they can
workaround by having modules on a different partition. We're hoping
to get that working soon. I was wondering if you see other issues around
BTF adoption for embedded systems that we could put on the to-do list?
Not necessarily for this particular use-case (since there are
complications with trace data as you describe), but just trying to make
sure we can remove barriers to BTF adoption where possible.

Thanks!

Alan

[1]
https://lore.kernel.org/bpf/CAHBbfcUkr6fTm2X9GNsFNqV75fTG=abqxfx_8ayk+4hk7he...@mail.gmail.com/

> 
>>
>> I hope we can at least agree that the current methods of generating 
>> the string arrays at C level are... aesthetically displeasing.
> 
> I don't know, I kinda like it ;-)
> 
> -- Steve
> 



Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages

2023-10-04 Thread Huang, Kai
On Wed, 2023-10-04 at 10:03 -0500, Haitao Huang wrote:
> On Tue, 03 Oct 2023 15:07:42 -0500, Huang, Kai  wrote:
> 
> > On Tue, 2023-10-03 at 01:45 -0500, Haitao Huang wrote:
> > > > 
> > > > Btw, probably a dumb question:
> > > > 
> > > > Theoretically if you only need to find a victim enclave you don't need 
> > > > to put VA
> > > > pages to the unreclaimable list, because those VA pages will be freed 
> > > > anyway
> > > > when enclave is killed.  So keeping VA pages in the list is for>  
> > > accounting all
> > > > the pages that the cgroup is having?
> > > 
> > > Yes basically tracking them in cgroups as they are allocated.
> > > 
> > > VAs and SECS may also come and go as swapping/unswapping happens. But  
> > > if acgroup is OOM, and all reclaimables are gone (swapped out), it'd  
> > > have toreclaim VAs/SECs in the same cgroup starting from the front of  
> > > the LRUlist. To reclaim a VA/SECS, it identifies the enclave from the  
> > > owner ofthe VA/SECS page and kills it, as killing enclave is the only  
> > > way toreclaim VA/SECS pages.
> > 
> > To kill enclave you just need to track SECS in  the unreclaimable list.  
> > Only when you want to account the total EPC pages via some list you  
> > _probably_
> > need to track VA as well.  But I am not quite sure about this either.
> 
> There is a case where even SECS is paged out for an enclave with all  
> reclaimables out. 
> 

Yes.  But this essentially means these enclaves are not active, thus shouldn't
be the victim of OOM?

> So cgroup needs to track each page used by an enclave  
> and kill enclave when cgroup needs to lower usage by evicting an VA or  
> SECS page.

Let's discuss more on tracking SECS on unreclaimable list only.

Could we assume that when the OOM wants to pick up a victim to serve the new
enclave, there must be at least another one *active* enclave which still has the
SECS page in EPC?

If yes, that enclave will be selected as victim.

If not, then no other enclave will be selected as victim.  Instead, only the new
enclave which is requesting more EPC will be selected, because it's SECS is on
the unreclaimable list.

Somehow this is unacceptable, thus we need to track VA pages too in order to
kill other inactive enclave?

> There were some discussion on paging out VAs without killing enclaves but  
> it'd be complicated and not implemented yet.

No we don't involve swapping VA pages now.  It's a separate topic.

> 
> BTW, I need clarify tracking pages which is done by LRUs vs usage  
> accounting which is done by charge/uncharge to misc. To me tracking is for  
> reclaiming not accounting. Also vEPCs not tracked at all but they are  
> accounted for.

I'll review the rest patches.  Thanks.


Re: [PATCH v5 06/18] x86/sgx: Introduce EPC page states

2023-10-04 Thread Huang, Kai
On Wed, 2023-10-04 at 10:24 -0500, Haitao Huang wrote:
> On Tue, 03 Oct 2023 15:03:48 -0500, Huang, Kai  wrote:
> 
> > On Mon, 2023-10-02 at 23:49 -0500, Haitao Huang wrote:
> > > On Wed, 27 Sep 2023 05:28:36 -0500, Huang, Kai   
> > > wrote:
> > > 
> > > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > > Use the lower 3 bits in the flags field of sgx_epc_page struct to
> > > > > track EPC states in its life cycle and define an enum for possible
> > > > > states. More state(s) will be added later.
> > > > 
> > > > This patch does more than what the changelog claims to do.  AFAICT it
> > > > does
> > > > below:
> > > > 
> > > >  1) Use the lower 3 bits to track EPC page status
> > > >  2) Rename SGX_EPC_PAGE_RECLAIMER_TRACKED to SGX_EPC_PAGE_RERCLAIMABLE
> > > >  3) Introduce a new state SGX_EPC_PAGE_UNRECLAIMABLE
> > > >  4) Track SECS and VA pages as SGX_EPC_PAGE_UNRECLAIMABLE
> > > > 
> > > > The changelog only says 1) IIUC.
> > > > 
> > > I don't quite get why you would view 3) as a separate item from 1).
> > 
> > 1) is about using some method to track EPC page status, 3) is adding a  
> > new
> > state.
> > 
> > Why cannot they be separated?
> > 
> > > In my view, 4) is not done as long as there is not separate list to  
> > > track
> > > it.
> > 
> > You are literally doing below:
> > 
> > @@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl,  
> > struct
> > sgx_secs *secs)
> > encl->attributes = secs->attributes;
> > encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
> > +   sgx_record_epc_page(encl->secs.epc_page,
> > +   SGX_EPC_PAGE_UNRECLAIMABLE);
> > +
> > 
> > Which obviously is tracking SECS as unreclaimable page here.
> > 
> > The only thing you are not doing now is to put to the actual list, which  
> > you
> > introduced in a later patch.
> > 
> > But why not just doing them together?
> > 
> > 
> I see where the problem is now.  Initially these states are bit masks so  
> UNTRACKED and UNRECLAIMABLE are all not masked (set zero). I'll change  
> these "record" calls with UNTRACKED instead, and later replace with  
> UNRECLAIMABLE when they are actually added to the list. So UNRECLAIMABLE  
> state can also be delayed until that patch with the list added.

I am not sure whether I am following, but could we just delay introducing the
"untracked" or "unreclaimable" until the list is added?

Why do we need to call sgx_record_epc_page() for SECS and VA pages in _this_
patch?

Reading again, I _think_ the reason why you added these new states because you
want to justify using the low 3 bits as EPC page states, i.e., below code ...

+#define SGX_EPC_PAGE_STATE_MASK GENMASK(2, 0)

But for now we only have two valid states:

- SGX_EPC_PAGE_IS_FREE
- SGX_EPC_PAGE_RECLAIMER_TRACKED

Thus you added two more states: NOT_TRACKED/UNRECLAIMABLE.  And more
confusingly, you added calling sgx_record_epc_page() for SECS and VA pages in
this patch to try to actually use these new states, while the changelog says:

Use the lower 3 bits in the flags field of sgx_epc_page struct to
track EPC states in its life cycle and define an enum for possible
states. More state(s) will be added later.

... which doesn't mention any of above.

But this doesn't stand either because you only need 2 bits for the four states
but not 3 bits.  So I don't see how adding the new states could help here.

So I would suggest two options:

1) 

In this patch, you only change the way to track EPC states to reflect your
changelog of this patch (maybe you can add NOT_TRACKED, but I am not sure /
don't care, just give a justification if you do).

And then you have a patch to introduce the new unreclaimable list, the new EPC
state, and call sgx_record_epc_page() *AND* sgx_drop_epc_page() for SECS/VA
pages.  The patch could be titled such as:

x86/sgx: Store SECS/VA pages in the unreclaimable list

2)

You do the opposite to option 1): Introduce the patch 

x86/sgx: Store SECS/VA pages in the unreclaimable list

... first, and then convert the EPC states to using the lower 3 bits (actually 2
bits is enough, you can extend to 3 bits in later patch when needed).

Does above make more sense?



[PATCH v5] eventfs: Remove eventfs_file and just use eventfs_inode

2023-10-04 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Instead of having a descriptor for every file represented in the eventfs
directory, only have the directory itself represented. Change the API to
send in a list of entries that represent all the files in the directory
(but not other directories). The entry list contains a name and a callback
function that will be used to create the files when they are accessed.

struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry 
*parent,
const struct eventfs_entry 
*entries,
int size, void *data);

is used for the top level eventfs directory, and returns an eventfs_inode
that will be used by:

struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode 
*parent,
 const struct eventfs_entry *entries,
 int size, void *data);

where both of the above take an array of struct eventfs_entry entries for
every file that is in the directory.

The entries are defined by:

typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);

struct eventfs_entry {
const char  *name;
eventfs_callbackcallback;
};

Where the name is the name of the file and the callback gets called when
the file is being created. The callback passes in the name (in case the
same callback is used for multiple files), a pointer to the mode, data and
fops. The data will be pointing to the data that was passed in
eventfs_create_dir() or eventfs_create_events_dir() but may be overridden
to point to something else, as it will be used to point to the
inode->i_private that is created. The information passed back from the
callback is used to create the dentry/inode.

If the callback fills the data and the file should be created, it must
return a positive number. On zero or negative, the file is ignored.

This logic may also be used as a prototype to convert entire pseudo file
systems into just-in-time allocation.

The "show_events_dentry" file has been updated to show the directories,
and any files they have.

With just the eventfs_file allocations:

 Before after deltas for meminfo (in kB):

   MemFree: -14360
   MemAvailable:-14260
   Buffers: 40
   Cached:  24
   Active:  44
   Inactive:48
   Inactive(anon):  28
   Active(file):44
   Inactive(file):  20
   Dirty:   -4
   AnonPages:   28
   Mapped:  4
   KReclaimable:132
   Slab:1604
   SReclaimable:132
   SUnreclaim:  1472
   Committed_AS:12

 Before after deltas for slabinfo:

   : [ *  = ]

   ext4_inode_cache 27  [* 1184 = 31968 ]
   extent_status102 [*   40 = 4080 ]
   tracefs_inode_cache  144 [*  656 = 94464 ]
   buffer_head  39  [*  104 = 4056 ]
   shmem_inode_cache49  [*  800 = 39200 ]
   filp -53 [*  256 = -13568 ]
   dentry   251 [*  192 = 48192 ]
   lsm_file_cache   277 [*   32 = 8864 ]
   vm_area_struct   -14 [*  184 = -2576 ]
   trace_event_file 1748[*   88 = 153824 ]
   kmalloc-1k   35  [* 1024 = 35840 ]
   kmalloc-256  49  [*  256 = 12544 ]
   kmalloc-192  -28 [*  192 = -5376 ]
   kmalloc-128  -30 [*  128 = -3840 ]
   kmalloc-96   10581   [*   96 = 1015776 ]
   kmalloc-64   3056[*   64 = 195584 ]
   kmalloc-32   1291[*   32 = 41312 ]
   kmalloc-16   2310[*   16 = 36960 ]
   kmalloc-89216[*8 = 73728 ]

 Free memory dropped by 14,360 kB
 Available memory dropped by 14,260 kB
 Total slab additions in size: 1,771,032 bytes

With this change:

 Before after deltas for meminfo (in kB):

   MemFree: -12084
   MemAvailable:-11976
   Buffers: 32
   Cached:  32
   Active:  72
   Inactive:168
   Inactive(anon):  176
   Active(file):72
   Inactive(file):  -8
   Dirty:   24
   AnonPages:   196
   Mapped:  8
   KReclaimable:148
   Slab:836
   SReclaimable:148
   SUnreclaim:  688
   Committed_AS:324

 Before after deltas for slabinfo:

   : [ *  = ]

   tracefs_inode_cache  144 [* 656 = 94464 ]
   shmem_inode_cache-23 [* 800 = -18400 ]
   filp -92 [* 256 = -23552 ]
   dentry   179 [* 192 = 34368 ]
   lsm_file_cache   -3   

Re: [PATCH] net: appletalk: remove cops support

2023-10-04 Thread patchwork-bot+netdevbpf
Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski :

On Wed, 27 Sep 2023 11:00:30 +0200 you wrote:
> The COPS Appletalk support is very old, never said to actually work
> properly, and the firmware code for the devices are under a very suspect
> license.  Remove it all to clear up the license issue, if it is still
> needed and actually used by anyone, we can add it back later once the
> license is cleared up.
> 
> Reported-by: Prarit Bhargava 
> Cc: Christoph Hellwig 
> Cc: Vitaly Kuznetsov 
> Cc: jsch...@samba.org
> Signed-off-by: Greg Kroah-Hartman 
> 
> [...]

Here is the summary with links:
  - net: appletalk: remove cops support
https://git.kernel.org/netdev/net-next/c/00f3696f7555

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




Re: [PATCH v1 1/9] ACPI: bus: Make notify wrappers more generic

2023-10-04 Thread Rafael J. Wysocki
On Mon, Sep 25, 2023 at 6:31 PM Michal Wilczynski
 wrote:
>
> acpi_dev_install_notify_handler() and acpi_dev_remove_notify_handler()
> are wrappers around ACPICA installers. They are meant to save some
> duplicated code from drivers. However as we're moving towards drivers
> operating on platform_device they become a bit inconvenient to use as
> inside the driver code we mostly want to use driver data of platform
> device instead of ACPI device.

That's fair enough, but ->

> Make notify handlers installer wrappers more generic, while still
> saving some code that would be duplicated otherwise.
>
> Reviewed-by: Andy Shevchenko 
> Signed-off-by: Michal Wilczynski 
> ---
>
> Notes:
> So one solution could be to just replace acpi_device with
> platform_device as an argument in those functions. However I don't
> believe this is a correct solution, as it is very often the case that
> drivers declare their own private structures which gets allocated during
> the .probe() callback, and become the heart of the driver. When drivers
> do that it makes much more sense to just pass the private structure
> to the notify handler instead of forcing user to dance with the
> platform_device or acpi_device.
>
>  drivers/acpi/ac.c |  6 +++---
>  drivers/acpi/acpi_video.c |  6 +++---
>  drivers/acpi/battery.c|  6 +++---
>  drivers/acpi/bus.c| 14 ++
>  drivers/acpi/hed.c|  6 +++---
>  drivers/acpi/nfit/core.c  |  6 +++---
>  drivers/acpi/thermal.c|  6 +++---
>  include/acpi/acpi_bus.h   |  9 -
>  8 files changed, 28 insertions(+), 31 deletions(-)
>
> diff --git a/drivers/acpi/ac.c b/drivers/acpi/ac.c
> index 225dc6818751..0b245f9f7ec8 100644
> --- a/drivers/acpi/ac.c
> +++ b/drivers/acpi/ac.c
> @@ -256,8 +256,8 @@ static int acpi_ac_add(struct acpi_device *device)
> ac->battery_nb.notifier_call = acpi_ac_battery_notify;
> register_acpi_notifier(>battery_nb);
>
> -   result = acpi_dev_install_notify_handler(device, ACPI_ALL_NOTIFY,
> -acpi_ac_notify);
> +   result = acpi_dev_install_notify_handler(device->handle, 
> ACPI_ALL_NOTIFY,
> +acpi_ac_notify, device);
> if (result)
> goto err_unregister;
>
> @@ -306,7 +306,7 @@ static void acpi_ac_remove(struct acpi_device *device)
>
> ac = acpi_driver_data(device);
>
> -   acpi_dev_remove_notify_handler(device, ACPI_ALL_NOTIFY,
> +   acpi_dev_remove_notify_handler(device->handle, ACPI_ALL_NOTIFY,
>acpi_ac_notify);
> power_supply_unregister(ac->charger);
> unregister_acpi_notifier(>battery_nb);
> diff --git a/drivers/acpi/acpi_video.c b/drivers/acpi/acpi_video.c
> index 948e31f7ce6e..025c17890127 100644
> --- a/drivers/acpi/acpi_video.c
> +++ b/drivers/acpi/acpi_video.c
> @@ -2059,8 +2059,8 @@ static int acpi_video_bus_add(struct acpi_device 
> *device)
>
> acpi_video_bus_add_notify_handler(video);
>
> -   error = acpi_dev_install_notify_handler(device, ACPI_DEVICE_NOTIFY,
> -   acpi_video_bus_notify);
> +   error = acpi_dev_install_notify_handler(device->handle, 
> ACPI_DEVICE_NOTIFY,
> +   acpi_video_bus_notify, 
> device);
> if (error)
> goto err_remove;
>
> @@ -2092,7 +2092,7 @@ static void acpi_video_bus_remove(struct acpi_device 
> *device)
>
> video = acpi_driver_data(device);
>
> -   acpi_dev_remove_notify_handler(device, ACPI_DEVICE_NOTIFY,
> +   acpi_dev_remove_notify_handler(device->handle, ACPI_DEVICE_NOTIFY,
>acpi_video_bus_notify);
>
> mutex_lock(_list_lock);
> diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
> index 969bf81e8d54..45dae32a8646 100644
> --- a/drivers/acpi/battery.c
> +++ b/drivers/acpi/battery.c
> @@ -1213,8 +1213,8 @@ static int acpi_battery_add(struct acpi_device *device)
>
> device_init_wakeup(>dev, 1);
>
> -   result = acpi_dev_install_notify_handler(device, ACPI_ALL_NOTIFY,
> -acpi_battery_notify);
> +   result = acpi_dev_install_notify_handler(device->handle, 
> ACPI_ALL_NOTIFY,
> +acpi_battery_notify, device);
> if (result)
> goto fail_pm;
>
> @@ -1241,7 +1241,7 @@ static void acpi_battery_remove(struct acpi_device 
> *device)
>
> battery = acpi_driver_data(device);
>
> -   acpi_dev_remove_notify_handler(device, ACPI_ALL_NOTIFY,
> +   acpi_dev_remove_notify_handler(device->handle, ACPI_ALL_NOTIFY,
>acpi_battery_notify);
>
> device_init_wakeup(>dev, 0);
> diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
> index f41dda2d3493..479fe888d629 100644
> --- 

[PATCH v2 72/89] tracefs: convert to new timestamp accessors

2023-10-04 Thread Jeff Layton
Convert to using the new inode timestamp accessor functions.

Signed-off-by: Jeff Layton 
---
 fs/tracefs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 891653ba9cf3..429603d865a9 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -152,7 +152,7 @@ struct inode *tracefs_get_inode(struct super_block *sb)
struct inode *inode = new_inode(sb);
if (inode) {
inode->i_ino = get_next_ino();
-   inode->i_atime = inode->i_mtime = 
inode_set_ctime_current(inode);
+   simple_inode_init_ts(inode);
}
return inode;
 }
-- 
2.41.0




Re: [PATCH] net: appletalk: remove cops support

2023-10-04 Thread Jakub Kicinski
On Wed, 27 Sep 2023 11:00:30 +0200 Greg Kroah-Hartman wrote:
> The COPS Appletalk support is very old, never said to actually work
> properly, and the firmware code for the devices are under a very suspect
> license.  Remove it all to clear up the license issue, if it is still
> needed and actually used by anyone, we can add it back later once the
> license is cleared up.

Nice, Doug and Arnd also mentioned this in the past so let me add
them to the CC as I apply this...


Re: [PATCH 0/4] tracing: improve symbolic printing

2023-10-04 Thread Steven Rostedt
On Wed, 04 Oct 2023 20:38:46 +0200
Johannes Berg  wrote:

> On Wed, 2023-10-04 at 09:22 -0700, Jakub Kicinski wrote:
> > 
> > Potentially naive question - the trace point holds enum skb_drop_reason.
> > The user space can get the names from BTF. Can we not teach user space
> > to generically look up names of enums in BTF?  
> 
> I'll note that, unrelated to the discussion about whether or not we
> could use BTF, we couldn't do it in this case anyway since the whole
> drop reasons aren't captured in enum skb_drop_reason, that contains only
> the core ones, and now other subsystems are adding their own somewhat
> dynamically later.

Another issue with using BTF, is that the BTF would need to be saved in the
trace.dat or perf.data file, as many times the trace data is moved off to
another machine for offline analysis. And using the vmlinux would not be
useful, because there is several times you have multiple trace files for
various versions of a build and that would require mapping which
vmlinux/btf file goes with which trace data.

Right now, the conversions can easily be saved in the trace file directly.

-- Steve



Re: [PATCH 0/4] tracing: improve symbolic printing

2023-10-04 Thread Johannes Berg
On Wed, 2023-10-04 at 09:22 -0700, Jakub Kicinski wrote:
> 
> Potentially naive question - the trace point holds enum skb_drop_reason.
> The user space can get the names from BTF. Can we not teach user space
> to generically look up names of enums in BTF?

I'll note that, unrelated to the discussion about whether or not we
could use BTF, we couldn't do it in this case anyway since the whole
drop reasons aren't captured in enum skb_drop_reason, that contains only
the core ones, and now other subsystems are adding their own somewhat
dynamically later.

johannes




Re: [PATCH] remoteproc: k3-r5: Wait for core0 power-up before powering up core1

2023-10-04 Thread Mathieu Poirier
On Wed, 4 Oct 2023 at 07:51, Apurva Nandan  wrote:
>
> Hi Mathieu,
>
> On 11/09/23 22:15, Mathieu Poirier wrote:
> > Hi Apurva,
> >
> > On Wed, Sep 06, 2023 at 06:17:56PM +0530, Apurva Nandan wrote:
> >> PSC controller has a limitation that it can only power-up the second core
> >> when the first core is in ON state. Power-state for core0 should be equal
> >> to or higher than core1, else the kernel is seen hanging during rproc
> >> loading.
> >>
> >> Make the powering up of cores sequential, by waiting for the current core
> >> to power-up before proceeding to the next core, with a timeout of 2sec.
> >> Add a wait queue event in k3_r5_cluster_rproc_init call, that will wait
> >> for the current core to be released from reset before proceeding with the
> >> next core.
> >>
> >> Fixes: 6dedbd1d5443 ("remoteproc: k3-r5: Add a remoteproc driver for R5F 
> >> subsystem")
> >>
> >> Signed-off-by: Apurva Nandan 
> >> ---
> >>
> >>   kpv report: 
> >> https://gist.githubusercontent.com/apurvanandan1997/feb3b304121c265b7827be43752b7ae8/raw
> >>
> >>   drivers/remoteproc/ti_k3_r5_remoteproc.c | 23 +++
> >>   1 file changed, 23 insertions(+)
> >>
> >> diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
> >> b/drivers/remoteproc/ti_k3_r5_remoteproc.c
> >> index ad3415a3851b..ba5e503f7c9c 100644
> >> --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
> >> +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
> >> @@ -103,12 +103,14 @@ struct k3_r5_soc_data {
> >>* @dev: cached device pointer
> >>* @mode: Mode to configure the Cluster - Split or LockStep
> >>* @cores: list of R5 cores within the cluster
> >> + * @core_transition: wait queue to sync core state changes
> >>* @soc_data: SoC-specific feature data for a R5FSS
> >>*/
> >>   struct k3_r5_cluster {
> >>  struct device *dev;
> >>  enum cluster_mode mode;
> >>  struct list_head cores;
> >> +wait_queue_head_t core_transition;
> >>  const struct k3_r5_soc_data *soc_data;
> >>   };
> >>
> >> @@ -128,6 +130,7 @@ struct k3_r5_cluster {
> >>* @atcm_enable: flag to control ATCM enablement
> >>* @btcm_enable: flag to control BTCM enablement
> >>* @loczrama: flag to dictate which TCM is at device address 0x0
> >> + * @released_from_reset: flag to signal when core is out of reset
> >>*/
> >>   struct k3_r5_core {
> >>  struct list_head elem;
> >> @@ -144,6 +147,7 @@ struct k3_r5_core {
> >>  u32 atcm_enable;
> >>  u32 btcm_enable;
> >>  u32 loczrama;
> >> +bool released_from_reset;
> >>   };
> >>
> >>   /**
> >> @@ -460,6 +464,8 @@ static int k3_r5_rproc_prepare(struct rproc *rproc)
> >>  ret);
> >>  return ret;
> >>  }
> >> +core->released_from_reset = true;
> >> +wake_up_interruptible(>core_transition);
> >>
> >>  /*
> >>   * Newer IP revisions like on J7200 SoCs support h/w 
> >> auto-initialization
> >> @@ -1140,6 +1146,7 @@ static int k3_r5_rproc_configure_mode(struct 
> >> k3_r5_rproc *kproc)
> >>  return ret;
> >>  }
> >>
> >> +core->released_from_reset = c_state;
> >>  ret = ti_sci_proc_get_status(core->tsp, _vec, , ,
> >>   );
> >>  if (ret < 0) {
> >> @@ -1280,6 +1287,21 @@ static int k3_r5_cluster_rproc_init(struct 
> >> platform_device *pdev)
> >>  cluster->mode == CLUSTER_MODE_SINGLECPU ||
> >>  cluster->mode == CLUSTER_MODE_SINGLECORE)
> >>  break;
> >> +
> >> +/* R5 cores require to be powered on sequentially, core0
> >> + * should be in higher power state than core1 in a cluster
> >> + * So, wait for current core to power up before proceeding
> >> + * to next core and put timeout of 2sec for each core.
> >> + */
> > Wrong multi-line comment format.
> Okay will fix this.
> >> +ret = 
> >> wait_event_interruptible_timeout(cluster->core_transition,
> >> +   
> >> core->released_from_reset,
> >> +   
> >> msecs_to_jiffies(2000));
> >> +if (ret <= 0) {
> >> +dev_err(dev,
> >> +"Timed out waiting for %s core to power 
> >> up!\n",
> >> +rproc->name);
> >> +return ret;
> >> +}
> >  From my perspective, this is needed because rproc_auto_boot_callback() for 
> > core1
> > can be called before core0 due to thread execution order.  Am I correct?
> Yes
> > If so please add this explanation to the comment you have above.  Also, 
> > let's
> > say a user decides to switch both cores off after reboot.  At that time, 
> > what
> > prevents a user from switching on core1 before core0 via sysfs?
> Okay, will add the explanation.
> Currently, adding support for graceful shutdown is in progress. As of
> now in order
> to stop/start core or 

Re: [PATCH 0/4] tracing: improve symbolic printing

2023-10-04 Thread Steven Rostedt
On Wed, 4 Oct 2023 09:54:31 -0700
Jakub Kicinski  wrote:

> On Wed, 4 Oct 2023 12:35:24 -0400 Steven Rostedt wrote:
> > > Potentially naive question - the trace point holds enum skb_drop_reason.
> > > The user space can get the names from BTF. Can we not teach user space
> > > to generically look up names of enums in BTF?
> > 
> > That puts a hard requirement to include BTF in builds where it was not
> > needed before. I really do not want to build with BTF just to get access to
> > these symbols. And since this is used by the embedded world, and BTF is
> > extremely bloated, the short answer is "No".  
> 
> Dunno. BTF is there most of the time. It could make the life of
> majority of the users far more pleasant.

BTF isn't there for a lot of developers working in embedded who use this
code. Most my users that I deal with have minimal environments, so BTF is a
showstopper.

> 
> I hope we can at least agree that the current methods of generating 
> the string arrays at C level are... aesthetically displeasing.

I don't know, I kinda like it ;-)

-- Steve



Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events

2023-10-04 Thread Tejun Heo
Hello,

On Wed, Oct 04, 2023 at 10:45:08AM -0500, Haitao Huang wrote:
> So I will update to something like following. Let me know if that's correct
> understanding.
> @tj, I'd appreciate for your input on whether this is acceptable from
> cgroups side.

Yeah, that's fine by me and I can't tell what actual differences the two
would have in practice.

Thanks.

-- 
tejun


Re: [PATCH 0/4] tracing: improve symbolic printing

2023-10-04 Thread Jakub Kicinski
On Wed, 4 Oct 2023 12:35:24 -0400 Steven Rostedt wrote:
> > Potentially naive question - the trace point holds enum skb_drop_reason.
> > The user space can get the names from BTF. Can we not teach user space
> > to generically look up names of enums in BTF?  
> 
> That puts a hard requirement to include BTF in builds where it was not
> needed before. I really do not want to build with BTF just to get access to
> these symbols. And since this is used by the embedded world, and BTF is
> extremely bloated, the short answer is "No".

Dunno. BTF is there most of the time. It could make the life of
majority of the users far more pleasant.

I hope we can at least agree that the current methods of generating 
the string arrays at C level are... aesthetically displeasing.



Re: [PATCH 0/4] tracing: improve symbolic printing

2023-10-04 Thread Steven Rostedt
On Wed, 4 Oct 2023 09:22:05 -0700
Jakub Kicinski  wrote:

> Potentially naive question - the trace point holds enum skb_drop_reason.
> The user space can get the names from BTF. Can we not teach user space
> to generically look up names of enums in BTF?

That puts a hard requirement to include BTF in builds where it was not
needed before. I really do not want to build with BTF just to get access to
these symbols. And since this is used by the embedded world, and BTF is
extremely bloated, the short answer is "No".

-- Steve



Re: [PATCH 0/4] tracing: improve symbolic printing

2023-10-04 Thread Jakub Kicinski
On Thu, 21 Sep 2023 10:51:30 +0200 Johannes Berg wrote:
> So I was frustrated with not seeing the names of SKB dropreasons
> for all but the core reasons, and then while looking into this
> all, realized, that the current __print_symbolic() is pretty bad
> anyway.
> 
> So I came up with a new approach, using a separate declaration
> of the symbols, and __print_sym() in there, but to userspace it
> all doesn't matter, it shows it the same way, just dyamically
> instead of munging with the strings all the time.
> 
> This is a huge .data savings as far as I can tell, with a modest
> amount (~4k) of .text addition, while making it all dynamic and
> in the SKB dropreason case even reusing the existing list that
> dropmonitor uses today. Surely patch 3 isn't needed here, but it
> felt right.
> 
> Anyway, I think it's a pretty reasonable approach overall, and
> it does works.
> 
> I've listed a number of open questions in the first patch since
> that's where the real changes for this are.

Potentially naive question - the trace point holds enum skb_drop_reason.
The user space can get the names from BTF. Can we not teach user space
to generically look up names of enums in BTF?



Re: [PATCH v5 01/18] cgroup/misc: Add per resource callbacks for CSS events

2023-10-04 Thread Haitao Huang

Hi Jarkko

On Mon, 02 Oct 2023 17:55:14 -0500, Jarkko Sakkinen   
wrote:

...

> >> I noticed this one later:
> >>
> >> It would better to create a separate ops struct and declare the  
instance

> >> as const at minimum.
> >>
> >> Then there is no need for dynamic assigment of ops and all that is  
in
> >> rodata. This is improves both security and also allows static  
analysis

> >> bit better.
> >>
> >> Now you have to dynamically trace the struct instance, e.g. in  
case of

> >> a bug. If this one done, it would be already in the vmlinux.
> >I.e. then in the driver you can have static const struct declaration
> > with *all* pointers pre-assigned.
> >
> > Not sure if cgroups follows this or not but it is *objectively*
> > better. Previous work is not always best possible work...
> >
>
> IIUC, like vm_ops field in vma structs. Although function pointers in
> vm_ops are assigned statically, but you still need dynamically assign
> vm_ops for each instance of vma.
>
> So the code will look like this:
>
> if (parent_cg->res[i].misc_ops && parent_cg->res[i].misc_ops->alloc)
> {
> ...
> }
>
> I don't see this is the pattern used in cgroups and no strong opinion
> either way.
>
> TJ, do you have preference on this?

I do have strong opinion on this. In the client side we want as much
things declared statically as we can because it gives more tools for
statical analysis.

I don't want to see dynamic assignments in the SGX driver, when they
are not actually needed, no matter things are done in cgroups.


I.e. I don't really even care what crazy things cgroups subsystem
might do or not do. It's not my problem.

All I care is that we *do not* have any use for assigning those
pointers at run-time. So do whatever you want with cgroups side
as long as this is not the case.




So I will update to something like following. Let me know if that's  
correct understanding.
@tj, I'd appreciate for your input on whether this is acceptable from  
cgroups side.


--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -31,22 +31,26 @@ struct misc_cg;

 #include 

+/* per resource callback ops */
+struct misc_operations_struct {
+   int (*alloc)(struct misc_cg *cg);
+   void (*free)(struct misc_cg *cg);
+   void (*max_write)(struct misc_cg *cg);
+};
 /**
  * struct misc_res: Per cgroup per misc type resource
  * @max: Maximum limit on the resource.
  * @usage: Current usage of the resource.
  * @events: Number of times, the resource limit exceeded.
+ * @priv: resource specific data.
+ * @misc_ops: resource specific operations.
  */
 struct misc_res {
u64 max;
atomic64_t usage;
atomic64_t events;
void *priv;
-
-   /* per resource callback ops */
-   int (*alloc)(struct misc_cg *cg);
-   void (*free)(struct misc_cg *cg);
-   void (*max_write)(struct misc_cg *cg);
+   const struct misc_operations_struct *misc_ops;
 };

...
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 4633b8629e63..500415087643 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -277,8 +277,8 @@ static ssize_t misc_cg_max_write(struct  
kernfs_open_file *of, char *buf,


if (READ_ONCE(misc_res_capacity[type])) {
WRITE_ONCE(cg->res[type].max, max);
-   if (cg->res[type].max_write)
-   cg->res[type].max_write(cg);
+   if (cg->res[type].misc_ops &&  
cg->res[type].misc_ops->max_write)

+   cg->res[type].misc_ops->max_write(cg);

[skip other similar changes in misc.c]

And on SGX side, it'll be updated like this:

--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -376,6 +376,14 @@ static void sgx_epc_cgroup_max_write(struct misc_cg  
*cg)

queue_work(sgx_epc_cg_wq, _cg->reclaim_work);
 }

+static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
+
+const struct misc_operations_struct sgx_epc_cgroup_ops = {
+.alloc = sgx_epc_cgroup_alloc,
+.free = sgx_epc_cgroup_free,
+.max_write = sgx_epc_cgroup_max_write,
+};
+
 static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
 {
struct sgx_epc_cgroup *epc_cg;
@@ -386,12 +394,7 @@ static int sgx_epc_cgroup_alloc(struct misc_cg *cg)

sgx_lru_init(_cg->lru);
INIT_WORK(_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
-   cg->res[MISC_CG_RES_SGX_EPC].alloc = sgx_epc_cgroup_alloc;
-   cg->res[MISC_CG_RES_SGX_EPC].free = sgx_epc_cgroup_free;
-   cg->res[MISC_CG_RES_SGX_EPC].max_write = sgx_epc_cgroup_max_write;
-   cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
-   epc_cg->cg = cg;
-
+   cg->res[MISC_CG_RES_SGX_EPC].misc_ops = _epc_cgroup_ops;
return 0;
 }


Thanks again to all of you for feedback.

Haitao


Re: [PATCH v3 03/13] mm/execmem, arch: convert simple overrides of module_alloc to execmem

2023-10-04 Thread Edgecombe, Rick P
On Tue, 2023-10-03 at 17:29 -0700, Rick Edgecombe wrote:
> It seems a bit weird to copy all of this. Is it trying to be faster
> or
> something?
> 
> Couldn't it just check r->start in execmem_text/data_alloc() path and
> switch to EXECMEM_DEFAULT if needed then? The execmem_range_is_data()
> part that comes later could be added to the logic there too. So this
> seems like unnecessary complexity to me or I don't see the reason.

I guess this is a bad idea because if you have the full size array
sitting around anyway you might as well use it and reduce the
exec_mem_alloc() logic. Just looking at it from the x86 side (and
similar) though, where there is actually only one execmem_range and it
building this whole array with identical data and it seems weird.



Re: [PATCH v5 06/18] x86/sgx: Introduce EPC page states

2023-10-04 Thread Haitao Huang

On Tue, 03 Oct 2023 15:03:48 -0500, Huang, Kai  wrote:


On Mon, 2023-10-02 at 23:49 -0500, Haitao Huang wrote:
On Wed, 27 Sep 2023 05:28:36 -0500, Huang, Kai   
wrote:


> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > Use the lower 3 bits in the flags field of sgx_epc_page struct to
> > track EPC states in its life cycle and define an enum for possible
> > states. More state(s) will be added later.
>
> This patch does more than what the changelog claims to do.  AFAICT it
> does
> below:
>
>  1) Use the lower 3 bits to track EPC page status
>  2) Rename SGX_EPC_PAGE_RECLAIMER_TRACKED to SGX_EPC_PAGE_RERCLAIMABLE
>  3) Introduce a new state SGX_EPC_PAGE_UNRECLAIMABLE
>  4) Track SECS and VA pages as SGX_EPC_PAGE_UNRECLAIMABLE
>
> The changelog only says 1) IIUC.
>
I don't quite get why you would view 3) as a separate item from 1).


1) is about using some method to track EPC page status, 3) is adding a  
new

state.

Why cannot they be separated?

In my view, 4) is not done as long as there is not separate list to  
track

it.


You are literally doing below:

@@ -113,6 +113,9 @@ static int sgx_encl_create(struct sgx_encl *encl,  
struct

sgx_secs *secs)
encl->attributes = secs->attributes;
encl->attributes_mask = SGX_ATTR_UNPRIV_MASK;
+   sgx_record_epc_page(encl->secs.epc_page,
+   SGX_EPC_PAGE_UNRECLAIMABLE);
+

Which obviously is tracking SECS as unreclaimable page here.

The only thing you are not doing now is to put to the actual list, which  
you

introduced in a later patch.

But why not just doing them together?


I see where the problem is now.  Initially these states are bit masks so  
UNTRACKED and UNRECLAIMABLE are all not masked (set zero). I'll change  
these "record" calls with UNTRACKED instead, and later replace with  
UNRECLAIMABLE when they are actually added to the list. So UNRECLAIMABLE  
state can also be delayed until that patch with the list added.


Thanks.
Haitao


Re: [PATCH v5 09/18] x86/sgx: Store struct sgx_encl when allocating new VA pages

2023-10-04 Thread Haitao Huang

On Tue, 03 Oct 2023 15:07:42 -0500, Huang, Kai  wrote:


On Tue, 2023-10-03 at 01:45 -0500, Haitao Huang wrote:

>
> Btw, probably a dumb question:
>
> Theoretically if you only need to find a victim enclave you don't need 
> to put VA
> pages to the unreclaimable list, because those VA pages will be freed 
> anyway
> when enclave is killed.  So keeping VA pages in the list is for>  
accounting all

> the pages that the cgroup is having?

Yes basically tracking them in cgroups as they are allocated.

VAs and SECS may also come and go as swapping/unswapping happens. But  
if acgroup is OOM, and all reclaimables are gone (swapped out), it'd  
have toreclaim VAs/SECs in the same cgroup starting from the front of  
the LRUlist. To reclaim a VA/SECS, it identifies the enclave from the  
owner ofthe VA/SECS page and kills it, as killing enclave is the only  
way toreclaim VA/SECS pages.


To kill enclave you just need to track SECS in  the unreclaimable list.  
Only when you want to account the total EPC pages via some list you  
_probably_

need to track VA as well.  But I am not quite sure about this either.


There is a case where even SECS is paged out for an enclave with all  
reclaimables out. So cgroup needs to track each page used by an enclave  
and kill enclave when cgroup needs to lower usage by evicting an VA or  
SECS page.
There were some discussion on paging out VAs without killing enclaves but  
it'd be complicated and not implemented yet.


BTW, I need clarify tracking pages which is done by LRUs vs usage  
accounting which is done by charge/uncharge to misc. To me tracking is for  
reclaiming not accounting. Also vEPCs not tracked at all but they are  
accounted for.


Haitao


Re: [PATCH v2] trace: tracing_event_filter: fast path when no subsystem filters

2023-10-04 Thread Steven Rostedt
On Wed, 4 Oct 2023 10:39:33 -0400
Nick Lowell  wrote:

> > [ cut here ]
> >  WARNING: CPU: 5 PID: 944 at kernel/trace/trace_events_filter.c:2423 
> > apply_subsystem_event_filter+0x18c/0x5e0
> >  Modules linked in:
> >  CPU: 5 PID: 944 Comm: trace-cmd Not tainted 
> > 6.6.0-rc4-test-9-gff7cd7446fe5 #102
> >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > 1.16.2-debian-1.16.2-1 04/01/2014
> >  RIP: 0010:apply_subsystem_event_filter+0x18c/0x5e0
> >  Code: 44 24 08 00 00 00 00 48 8b 6d 00 4c 39 f5 75 bc 48 8b 44 24 18 4c 8b 
> > 60 18 4c 89 e5 45 84 ff 75 14 48 85 ed 0f 84 37 ff ff ff <0f> 0b eb 10 e8 
> > 4b be fd ff eb b0 4d 85 e4 0f 84 a3 02 00 00 48 8b
> >  RSP: 0018:9b4941607db8 EFLAGS: 00010286
> >  RAX: 8b2780a77280 RBX: 8b2780a77400 RCX: 
> >  RDX:  RSI: 8b2781c11c38 RDI: 8b2781c11c38
> >  RBP: 8b28df449030 R08: 8b2781c11c38 R09: 
> >  R10: 8b2781c11c38 R11:  R12: 8b28df449030
> >  R13: aaf64de0 R14: aaf66bb8 R15: 
> >  FS:  7fd221def3c0() GS:8b28f7d4() 
> > knlGS:
> >  CS:  0010 DS:  ES:  CR0: 80050033
> >  CR2: 56117c93e160 CR3: 00010173a003 CR4: 00170ee0
> >  Call Trace:
> >   
> >   ? apply_subsystem_event_filter+0x18c/0x5e0
> >   ? __warn+0x81/0x130
> >   ? apply_subsystem_event_filter+0x18c/0x5e0
> >   ? report_bug+0x191/0x1c0
> >   ? handle_bug+0x3c/0x80
> >   ? exc_invalid_op+0x17/0x70
> >   ? asm_exc_invalid_op+0x1a/0x20
> >   ? apply_subsystem_event_filter+0x18c/0x5e0
> >   ? apply_subsystem_event_filter+0x5b/0x5e0
> >   ? __check_object_size+0x25b/0x2c0
> >   subsystem_filter_write+0x41/0x70
> >   vfs_write+0xf2/0x440
> >   ? kmem_cache_free+0x22/0x350
> >   ksys_write+0x6f/0xf0
> >   do_syscall_64+0x3f/0xc0
> >   entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> >  RIP: 0033:0x7fd221ee7ae0
> >
> > -- Steve  
> 
> Is this just informative indicating that there are issues with how
> filters are being used or are you saying there is something else I
> need to do before this patch is approved?
> What version of trace-cmd is that using?

Not sure if it matters, but the above was with trace-cmd v3.2.

So, I guess we need to look a bit deeper at the change.

> @@ -2411,7 +2418,12 @@ int apply_subsystem_event_filter(struct 
> trace_subsystem_dir *dir,
>   }
>  
>   if (!strcmp(strstrip(filter_string), "0")) {
> - filter_free_subsystem_preds(dir, tr);
> + /* If nothing was freed, we do not need to sync */
> + if (!filter_free_subsystem_preds(dir, tr)) {
> + if(!(WARN_ON_ONCE(system->filter)))
> + goto out_unlock;

When do we want to skip the below?

The original version just did the "goto out_unlock" before the
"system->filter" check, and that would have caused a memory leak, or just
left the "system->filter" around when unneeded.

In other words, the patch is incorrect in general then.

> + }
> +
>   remove_filter_string(system->filter);
>   filter = system->filter;
>   system->filter = NULL;

I believe, what you really want here is simply:

bool sync;

[..]

if (!strcmp(strstrip(filter_string), "0")) {
+   sync = filter_free_subsystem_preds(dir, tr);
remove_filter_string(system->filter);
filter = system->filter;
system->filter = NULL;
-   /* Ensure all filters are no longer used */
-   tracepoint_synchronize_unregister();
+   if (sync) {
+   /* Ensure all filters are no longer used */
+   tracepoint_synchronize_unregister();
+   }
filter_free_subsystem_filters(dir, tr);

Maybe even pass in "sync" to the filter_free_subsystem_filters() to make
sure there were nothing to be freed, and do the WARN_ON_ONCE() then.

__free_filter(filter);
goto out_unlock;
}

-- Steve



Re: [PATCH v2] trace: tracing_event_filter: fast path when no subsystem filters

2023-10-04 Thread Nick Lowell
On Tue, Oct 3, 2023 at 10:29 PM Steven Rostedt  wrote:
>
> On Mon,  2 Oct 2023 10:41:48 -0400
> Nicholas Lowell  wrote:
>
> > @@ -2411,7 +2418,12 @@ int apply_subsystem_event_filter(struct 
> > trace_subsystem_dir *dir,
> >   }
> >
> >   if (!strcmp(strstrip(filter_string), "0")) {
> > - filter_free_subsystem_preds(dir, tr);
> > + /* If nothing was freed, we do not need to sync */
> > + if (!filter_free_subsystem_preds(dir, tr)) {
> > + if(!(WARN_ON_ONCE(system->filter)))
> > + goto out_unlock;
> > + }
> > +
> >   remove_filter_string(system->filter);
> >   filter = system->filter;
> >   system->filter = NULL;
> > --
>
> This is why I asked for the warning:
>
> trace-cmd record -o /tmp/trace.dat -e sched -f "(common_pid == $$) || 
> ((common_pid > 10) && common_pid < 100) || (common_pid >= 1000 && common_pid 
> <= 1050) || (common_pid > 1 && common_pid < 2)" sleep 5
>
>
> Causes:
>
> [ cut here ]
>  WARNING: CPU: 5 PID: 944 at kernel/trace/trace_events_filter.c:2423 
> apply_subsystem_event_filter+0x18c/0x5e0
>  Modules linked in:
>  CPU: 5 PID: 944 Comm: trace-cmd Not tainted 
> 6.6.0-rc4-test-9-gff7cd7446fe5 #102
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.16.2-debian-1.16.2-1 04/01/2014
>  RIP: 0010:apply_subsystem_event_filter+0x18c/0x5e0
>  Code: 44 24 08 00 00 00 00 48 8b 6d 00 4c 39 f5 75 bc 48 8b 44 24 18 4c 8b 
> 60 18 4c 89 e5 45 84 ff 75 14 48 85 ed 0f 84 37 ff ff ff <0f> 0b eb 10 e8 4b 
> be fd ff eb b0 4d 85 e4 0f 84 a3 02 00 00 48 8b
>  RSP: 0018:9b4941607db8 EFLAGS: 00010286
>  RAX: 8b2780a77280 RBX: 8b2780a77400 RCX: 
>  RDX:  RSI: 8b2781c11c38 RDI: 8b2781c11c38
>  RBP: 8b28df449030 R08: 8b2781c11c38 R09: 
>  R10: 8b2781c11c38 R11:  R12: 8b28df449030
>  R13: aaf64de0 R14: aaf66bb8 R15: 
>  FS:  7fd221def3c0() GS:8b28f7d4() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 56117c93e160 CR3: 00010173a003 CR4: 00170ee0
>  Call Trace:
>   
>   ? apply_subsystem_event_filter+0x18c/0x5e0
>   ? __warn+0x81/0x130
>   ? apply_subsystem_event_filter+0x18c/0x5e0
>   ? report_bug+0x191/0x1c0
>   ? handle_bug+0x3c/0x80
>   ? exc_invalid_op+0x17/0x70
>   ? asm_exc_invalid_op+0x1a/0x20
>   ? apply_subsystem_event_filter+0x18c/0x5e0
>   ? apply_subsystem_event_filter+0x5b/0x5e0
>   ? __check_object_size+0x25b/0x2c0
>   subsystem_filter_write+0x41/0x70
>   vfs_write+0xf2/0x440
>   ? kmem_cache_free+0x22/0x350
>   ksys_write+0x6f/0xf0
>   do_syscall_64+0x3f/0xc0
>   entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>  RIP: 0033:0x7fd221ee7ae0
>
> -- Steve

Is this just informative indicating that there are issues with how
filters are being used or are you saying there is something else I
need to do before this patch is approved?
What version of trace-cmd is that using?



Re: [PATCH] visl: use canonical ftrace path

2023-10-04 Thread Steven Rostedt
On Wed, 4 Oct 2023 16:00:34 +0200
Hans Verkuil  wrote:

> > The original patch did have linux-me...@vger.kernel.org Cc'd. Was it 
> > dropped?  
> 
> You are correct, it was sitting in patchwork, I somehow missed it.
> 
> It's now delegated to me, so it is in my pipeline.

Thanks a lot! I appreciate it.

-- Steve



Re: [PATCH] visl: use canonical ftrace path

2023-10-04 Thread Hans Verkuil
On 10/4/23 15:57, Steven Rostedt wrote:
> On Wed, 4 Oct 2023 08:33:12 +0200
> Hans Verkuil  wrote:
> 
>> On 03/10/2023 23:41, Steven Rostedt wrote:
>>>
>>> Could this go through the linux-media tree, or if you give it an Ack, I'll
>>> take it through the tracing tree.  
>>
>> I prefer to take this through the media subsystem. Ross, can you post this
> 
> Thanks!
> 
>> patch again, this time including linux-me...@vger.kernel.org?
> 
> The original patch did have linux-me...@vger.kernel.org Cc'd. Was it dropped?

You are correct, it was sitting in patchwork, I somehow missed it.

It's now delegated to me, so it is in my pipeline.

Regards,

Hans

> 
> -- Steve
> 
> 
>>
>> The patch looks fine, so I'll pick it up.
>>
>> Regards,
>>
>>  Hans




Re: [PATCH] visl: use canonical ftrace path

2023-10-04 Thread Steven Rostedt
On Wed, 4 Oct 2023 08:33:12 +0200
Hans Verkuil  wrote:

> On 03/10/2023 23:41, Steven Rostedt wrote:
> > 
> > Could this go through the linux-media tree, or if you give it an Ack, I'll
> > take it through the tracing tree.  
> 
> I prefer to take this through the media subsystem. Ross, can you post this

Thanks!

> patch again, this time including linux-me...@vger.kernel.org?

The original patch did have linux-me...@vger.kernel.org Cc'd. Was it dropped?

-- Steve


> 
> The patch looks fine, so I'll pick it up.
> 
> Regards,
> 
>   Hans



Re: [PATCH] remoteproc: k3-r5: Wait for core0 power-up before powering up core1

2023-10-04 Thread Apurva Nandan

Hi Mathieu,

On 11/09/23 22:15, Mathieu Poirier wrote:

Hi Apurva,

On Wed, Sep 06, 2023 at 06:17:56PM +0530, Apurva Nandan wrote:

PSC controller has a limitation that it can only power-up the second core
when the first core is in ON state. Power-state for core0 should be equal
to or higher than core1, else the kernel is seen hanging during rproc
loading.

Make the powering up of cores sequential, by waiting for the current core
to power-up before proceeding to the next core, with a timeout of 2sec.
Add a wait queue event in k3_r5_cluster_rproc_init call, that will wait
for the current core to be released from reset before proceeding with the
next core.

Fixes: 6dedbd1d5443 ("remoteproc: k3-r5: Add a remoteproc driver for R5F 
subsystem")

Signed-off-by: Apurva Nandan 
---

  kpv report: 
https://gist.githubusercontent.com/apurvanandan1997/feb3b304121c265b7827be43752b7ae8/raw

  drivers/remoteproc/ti_k3_r5_remoteproc.c | 23 +++
  1 file changed, 23 insertions(+)

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index ad3415a3851b..ba5e503f7c9c 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -103,12 +103,14 @@ struct k3_r5_soc_data {
   * @dev: cached device pointer
   * @mode: Mode to configure the Cluster - Split or LockStep
   * @cores: list of R5 cores within the cluster
+ * @core_transition: wait queue to sync core state changes
   * @soc_data: SoC-specific feature data for a R5FSS
   */
  struct k3_r5_cluster {
struct device *dev;
enum cluster_mode mode;
struct list_head cores;
+   wait_queue_head_t core_transition;
const struct k3_r5_soc_data *soc_data;
  };
  
@@ -128,6 +130,7 @@ struct k3_r5_cluster {

   * @atcm_enable: flag to control ATCM enablement
   * @btcm_enable: flag to control BTCM enablement
   * @loczrama: flag to dictate which TCM is at device address 0x0
+ * @released_from_reset: flag to signal when core is out of reset
   */
  struct k3_r5_core {
struct list_head elem;
@@ -144,6 +147,7 @@ struct k3_r5_core {
u32 atcm_enable;
u32 btcm_enable;
u32 loczrama;
+   bool released_from_reset;
  };
  
  /**

@@ -460,6 +464,8 @@ static int k3_r5_rproc_prepare(struct rproc *rproc)
ret);
return ret;
}
+   core->released_from_reset = true;
+   wake_up_interruptible(>core_transition);
  
  	/*

 * Newer IP revisions like on J7200 SoCs support h/w auto-initialization
@@ -1140,6 +1146,7 @@ static int k3_r5_rproc_configure_mode(struct k3_r5_rproc 
*kproc)
return ret;
}
  
+	core->released_from_reset = c_state;

ret = ti_sci_proc_get_status(core->tsp, _vec, , ,
 );
if (ret < 0) {
@@ -1280,6 +1287,21 @@ static int k3_r5_cluster_rproc_init(struct 
platform_device *pdev)
cluster->mode == CLUSTER_MODE_SINGLECPU ||
cluster->mode == CLUSTER_MODE_SINGLECORE)
break;
+
+   /* R5 cores require to be powered on sequentially, core0
+* should be in higher power state than core1 in a cluster
+* So, wait for current core to power up before proceeding
+* to next core and put timeout of 2sec for each core.
+*/

Wrong multi-line comment format.

Okay will fix this.

+   ret = wait_event_interruptible_timeout(cluster->core_transition,
+  
core->released_from_reset,
+  msecs_to_jiffies(2000));
+   if (ret <= 0) {
+   dev_err(dev,
+   "Timed out waiting for %s core to power up!\n",
+   rproc->name);
+   return ret;
+   }

 From my perspective, this is needed because rproc_auto_boot_callback() for 
core1
can be called before core0 due to thread execution order.  Am I correct?

Yes

If so please add this explanation to the comment you have above.  Also, let's
say a user decides to switch both cores off after reboot.  At that time, what
prevents a user from switching on core1 before core0 via sysfs?

Okay, will add the explanation.
Currently, adding support for graceful shutdown is in progress. As of 
now in order
to stop/start core or change firmware, we recommend users to restart the 
OS.

Thanks,
Mathieu


}
  
  	return 0;

@@ -1709,6 +1731,7 @@ static int k3_r5_probe(struct platform_device *pdev)
cluster->dev = dev;
cluster->soc_data = data;
INIT_LIST_HEAD(>cores);
+   init_waitqueue_head(>core_transition);
  
  	ret = of_property_read_u32(np, "ti,cluster-mode", >mode);

if (ret < 0 && ret != -EINVAL) {
--
2.34.1



Re: [RFC PATCH] tracing: change syscall number type in struct syscall_trace_*

2023-10-04 Thread Artem Savkov
On Tue, Oct 03, 2023 at 09:38:44PM -0400, Steven Rostedt wrote:
> On Mon,  2 Oct 2023 15:52:42 +0200
> Artem Savkov  wrote:
> 
> > linux-rt-devel tree contains a patch that adds an extra member to struct
> > trace_entry. This causes the offset of args field in struct
> > trace_event_raw_sys_enter be different from the one in struct
> > syscall_trace_enter:
> 
> This patch looks like it's fixing the symptom and not the issue. No code
> should rely on the two event structures to be related. That's an unwanted
> coupling, that will likely cause issues down the road (like the RT patch
> you mentioned).

I agree, but I didn't see a better solution and that was my way of
starting conversation, thus the RFC.

> > 
> > struct trace_event_raw_sys_enter {
> > struct trace_entry ent;  /* 012 */
> > 
> > /* XXX last struct has 3 bytes of padding */
> > /* XXX 4 bytes hole, try to pack */
> > 
> > long int   id;   /*16 8 */
> > long unsigned int  args[6];  /*2448 */
> > /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
> > char   __data[]; /*72 0 */
> > 
> > /* size: 72, cachelines: 2, members: 4 */
> > /* sum members: 68, holes: 1, sum holes: 4 */
> > /* paddings: 1, sum paddings: 3 */
> > /* last cacheline: 8 bytes */
> > };
> > 
> > struct syscall_trace_enter {
> > struct trace_entry ent;  /* 012 */
> > 
> > /* XXX last struct has 3 bytes of padding */
> > 
> > intnr;   /*12 4 */
> > long unsigned int  args[];   /*16 0 */
> > 
> > /* size: 16, cachelines: 1, members: 3 */
> > /* paddings: 1, sum paddings: 3 */
> > /* last cacheline: 16 bytes */
> > };
> > 
> > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> > test_profiler testcase because max_ctx_offset is calculated based on the
> > former struct, while off on the latter:
> 
> The above appears to be pointing to the real bug. The "is calculated based
> on the former struct while off on the latter" Why are the two being used
> together? They are supposed to be *unrelated*!
> 
> 
> > 
> >   10488 if (is_tracepoint || is_syscall_tp) {
> >   10489 int off = trace_event_get_offsets(event->tp_event);
> 
> So basically this is clumping together the raw_syscalls with the syscalls
> events as if they are the same. But the are not. They are created
> differently. It's basically like using one structure to get the offsets of
> another structure. That would be a bug anyplace else in the kernel. Sounds
> like it's a bug here too.
> 
> I think the issue is with this code, not the tracing code.
> 
> We could expose the struct syscall_trace_enter and syscall_trace_exit if
> the offsets to those are needed.

I don't think we need syscall_trace_* offsets, looks like
trace_event_get_offsets() should return offset trace_event_raw_sys_enter
instead. I am still trying to figure out how all of this works together.
Maybe Alexei or Andrii have more context here.

-- 
 Artem




Re: [PATCH V3] tracing/timerlat: Hotplug support for the user-space interface

2023-10-04 Thread Steven Rostedt
On Wed, 4 Oct 2023 08:17:31 -0400
Steven Rostedt  wrote:

> #!/bin/bash
> 
> find_debugfs() {
> debugfs=`cat /proc/mounts | while read mount dir type opts a b; do
>   if [ $mount == "debugfs" ]; then

I guess I should update this to look for tracefs. This script is actually
older than tracefs.

-- Steve


>   echo $dir;
>   break
>   fi
> done`
> if [ -z "$debugfs" ]; then
>   if ! mount -t debugfs nodev /sys/kernel/debug; then
>   echo "FAILED to mount debugfs"
>   exit -1
>   fi
>   echo "/sys/kernel/debug"
> else
>   echo $debugfs
> fi
> }
> 
> debugfs=`find_debugfs`
> tracedir="$debugfs/tracing"




Re: [PATCH V3] tracing/timerlat: Hotplug support for the user-space interface

2023-10-04 Thread Steven Rostedt
On Wed, 4 Oct 2023 12:02:52 +0200
Daniel Bristot de Oliveira  wrote:

> On 10/4/23 03:03, Steven Rostedt wrote:
> > On Fri, 29 Sep 2023 17:02:46 +0200
> > Daniel Bristot de Oliveira  wrote:
> >   
> >> The osnoise/per_cpu/CPU$/timerlat_fd is create for each possible
> >> CPU, but it might create confusion if the CPU is not online.
> >>
> >> Create the file only for online CPUs, also follow hotplug by
> >> creating and deleting as CPUs come and go.
> >>
> >> Fixes: e88ed227f639 ("tracing/timerlat: Add user-space interface")  
> > 
> > Is this a fix that needs to go in now and Cc'd to stable? Or is this
> > something that can wait till the next merge window?  
> 
> We can wait for the next merge window... it is a non-trivial fix.
> 

A requirement is if it's a fix, not really how "trivial" it is.

That said, I'm able to consistently triggered this:

BUG: kernel NULL pointer dereference, address: 00a0
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0 
 Oops: 0002 [#1] PREEMPT SMP PTI
 CPU: 1 PID: 20 Comm: cpuhp/1 Not tainted 6.6.0-rc4-test-8-g2df8f295b0e2 
#103
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.2-debian-1.16.2-1 04/01/2014
 RIP: 0010:down_write+0x23/0x70
 Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 48 89 fb e8 2e bc ff 
ff bf 01 00 00 00 e8 24 14 31 ff 31 c0 ba 01 00 00 00  48 0f b1 13 75 33 65 
48 8b 04 25 00 36 03 00 48 89 43 08 bf 01
 RSP: 0018:b17f800e3d98 EFLAGS: 00010246
 RAX:  RBX: 00a0 RCX: ff81
 RDX: 0001 RSI: 0064 RDI: b6edd5cc
 RBP: b17f800e3df8 R08: 8c6237c61188 R09: 8020001b
 R10: 8c6237c61160 R11: 0001 R12: 0002da30
 R13:  R14: b6314080 R15: 8c6237c61188
 FS:  () GS:8c6237c4() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 00a0 CR3: 000102412001 CR4: 00170ee0
 Call Trace:
  
  ? __die+0x23/0x70
  ? page_fault_oops+0x17d/0x4c0
  ? exc_page_fault+0x7f/0x180
  ? asm_exc_page_fault+0x26/0x30
  ? __pfx_osnoise_cpu_die+0x10/0x10
  ? down_write+0x1c/0x70
  ? down_write+0x23/0x70
  ? down_write+0x1c/0x70
  simple_recursive_removal+0xef/0x280
  ? __pfx_remove_one+0x10/0x10
  ? __pfx_osnoise_cpu_die+0x10/0x10
  tracefs_remove+0x44/0x70
  timerlat_rm_per_cpu_interface+0x28/0x70
  osnoise_cpu_die+0xf/0x20
  cpuhp_invoke_callback+0xf8/0x460
  ? __pfx_smpboot_thread_fn+0x10/0x10
  cpuhp_thread_fun+0xf3/0x190
  smpboot_thread_fn+0x18c/0x230
  kthread+0xf7/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x34/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  
 Modules linked in:
 CR2: 00a0
 ---[ end trace  ]---
 RIP: 0010:down_write+0x23/0x70
 Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 48 89 fb e8 2e bc ff 
ff bf 01 00 00 00 e8 24 14 31 ff 31 c0 ba 01 00 00 00  48 0f b1 13 75 33 65 
48 8b 04 25 00 36 03 00 48 89 43 08 bf 01
 RSP: 0018:b17f800e3d98 EFLAGS: 00010246
 RAX:  RBX: 00a0 RCX: ff81
 RDX: 0001 RSI: 0064 RDI: b6edd5cc
 RBP: b17f800e3df8 R08: 8c6237c61188 R09: 8020001b
 R10: 8c6237c61160 R11: 0001 R12: 0002da30
 R13:  R14: b6314080 R15: 8c6237c61188
 FS:  () GS:8c6237c4() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 00a0 CR3: 000102412001 CR4: 00170ee0
 note: cpuhp/1[20] exited with irqs disabled
 note: cpuhp/1[20] exited with preempt_count 1


With running the attached script as:

 # ./ftrace-test-tracers sleep 1

-- Steve


ftrace-test-tracers
Description: Binary data


Re: [PATCH V3] tracing/timerlat: Hotplug support for the user-space interface

2023-10-04 Thread Daniel Bristot de Oliveira
On 10/4/23 03:03, Steven Rostedt wrote:
> On Fri, 29 Sep 2023 17:02:46 +0200
> Daniel Bristot de Oliveira  wrote:
> 
>> The osnoise/per_cpu/CPU$/timerlat_fd is create for each possible
>> CPU, but it might create confusion if the CPU is not online.
>>
>> Create the file only for online CPUs, also follow hotplug by
>> creating and deleting as CPUs come and go.
>>
>> Fixes: e88ed227f639 ("tracing/timerlat: Add user-space interface")
> 
> Is this a fix that needs to go in now and Cc'd to stable? Or is this
> something that can wait till the next merge window?

We can wait for the next merge window... it is a non-trivial fix.

-- Daniel
> -- Steve




[GIT PULL] rtla: Fixes for 6.6

2023-10-04 Thread Daniel Bristot de Oliveira
Steven,

Timerlat auto-analysis:

  - Timerlat is reporting thread interference time without thread noise
events occurrence. It was caused because the thread interference variable
was not reset after the analysis of a timerlat activation that did not
hit the threshold.

  - The IRQ handler delay is estimated from the delta of the IRQ latency
reported by timerlat, and the timestamp from IRQ handler start event.
If the delta is near-zero, the drift from the external clock and the
trace event and/or the overhead can cause the value to be negative.
If the value is negative, print a zero-delay.

  - IRQ handlers happening after the timerlat thread event but before
the stop tracing were being reported as IRQ that happened before the
*current* IRQ occurrence. Ignore Previous IRQ noise in this condition
because they are valid only for the *next* timerlat activation.

Timerlat user-space:

  - Timerlat is stopping all user-space thread if a CPU becomes
offline. Do not stop the entire tool if a CPU is/become offline,
but only the thread of the unavailable CPU. Stop the tool only,
if all threads leave because the CPUs become/are offline.

man-pages:

  - Fix command line example in timerlat hist man page.


Please pull the latest rtla-v6.6-fixes tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git
rtla-v6.6-fixes

Tag SHA1: 6a0df51b5c0a075a553a5ad73b1fd421f559eb6b
Head SHA1: 81ec384b80ffbda752c230778d39ea620c7e3bcf


Daniel Bristot de Oliveira (4):
  rtla/timerlat_aa: Zero thread sum after every sample analysis
  rtla/timerlat_aa: Fix negative IRQ delay
  rtla/timerlat_aa: Fix previous IRQ delay for IRQs that happens after 
thread sample
  rtla/timerlat: Do not stop user-space if a cpu is offline

Xie XiuQi (1):
  rtla: fix a example in rtla-timerlat-hist.rst


 Documentation/tools/rtla/rtla-timerlat-hist.rst |  4 ++--
 tools/tracing/rtla/src/timerlat_aa.c| 32 -
 tools/tracing/rtla/src/timerlat_u.c |  6 +++--
 3 files changed, 32 insertions(+), 10 deletions(-)
---
diff --git a/Documentation/tools/rtla/rtla-timerlat-hist.rst 
b/Documentation/tools/rtla/rtla-timerlat-hist.rst
index 057db78d4095..03b7f3deb069 100644
--- a/Documentation/tools/rtla/rtla-timerlat-hist.rst
+++ b/Documentation/tools/rtla/rtla-timerlat-hist.rst
@@ -36,11 +36,11 @@ EXAMPLE
 In the example below, **rtla timerlat hist** is set to run for *10* minutes,
 in the cpus *0-4*, *skipping zero* only lines. Moreover, **rtla timerlat
 hist** will change the priority of the *timerlat* threads to run under
-*SCHED_DEADLINE* priority, with a *10us* runtime every *1ms* period. The
+*SCHED_DEADLINE* priority, with a *100us* runtime every *1ms* period. The
 *1ms* period is also passed to the *timerlat* tracer. Auto-analysis is disabled
 to reduce overhead ::
 
-  [root@alien ~]# timerlat hist -d 10m -c 0-4 -P d:100us:1ms -p 1ms --no-aa
+  [root@alien ~]# timerlat hist -d 10m -c 0-4 -P d:100us:1ms -p 1000 --no-aa
   # RTLA timerlat histogram
   # Time unit is microseconds (us)
   # Duration:   0 00:10:00
diff --git a/tools/tracing/rtla/src/timerlat_aa.c 
b/tools/tracing/rtla/src/timerlat_aa.c
index e0ffe69c271c..7093fd5333be 100644
--- a/tools/tracing/rtla/src/timerlat_aa.c
+++ b/tools/tracing/rtla/src/timerlat_aa.c
@@ -159,6 +159,7 @@ static int timerlat_aa_irq_latency(struct timerlat_aa_data 
*taa_data,
taa_data->thread_nmi_sum = 0;
taa_data->thread_irq_sum = 0;
taa_data->thread_softirq_sum = 0;
+   taa_data->thread_thread_sum = 0;
taa_data->thread_blocking_duration = 0;
taa_data->timer_irq_start_time = 0;
taa_data->timer_irq_duration = 0;
@@ -337,7 +338,23 @@ static int timerlat_aa_irq_handler(struct trace_seq *s, 
struct tep_record *recor
taa_data->timer_irq_start_time = start;
taa_data->timer_irq_duration = duration;
 
-   taa_data->timer_irq_start_delay = 
taa_data->timer_irq_start_time - expected_start;
+   /*
+* We are dealing with two different clock sources: the
+* external clock source that timerlat uses as a reference
+* and the clock used by the tracer. There are also two
+* moments: the time reading the clock and the timer in
+* which the event is placed in the buffer (the trace
+* event timestamp). If the processor is slow or there
+* is some hardware noise, the difference between the
+* timestamp and the external clock read can be longer
+* than the IRQ handler delay, resulting in a negative
+* time. If so, set IRQ start delay as 0. In the end,
+* it is less relevant than the noise.
+*/
+   if (expected_start < taa_data->timer_irq_start_time)
+

Re: [RFC PATCH] tracing: change syscall number type in struct syscall_trace_*

2023-10-04 Thread Artem Savkov
On Tue, Oct 03, 2023 at 03:11:15PM -0700, Andrii Nakryiko wrote:
> On Mon, Oct 2, 2023 at 6:53 AM Artem Savkov  wrote:
> >
> > linux-rt-devel tree contains a patch that adds an extra member to struct
> 
> can you please point to the patch itself that makes that change?

Of course, some context would be useful. The patch in question is b1773eac3f29c
("sched: Add support for lazy preemption") from rt-devel tree [0]. It came up
a couple of times before: [1] [2] [3] [4].

[0] 
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71
[1] 
https://lore.kernel.org/linux-rt-users/20200221153541.681468-1-jo...@kernel.org/t/#u
[2] 
https://github.com/iovisor/bpftrace/commit/a2e3d5dbc03ceb49b776cf5602d31896158844a7
[3] https://lore.kernel.org/bpf/xunyjzy64q9b@redhat.com/t/#u
[4] https://lore.kernel.org/bpf/20230727150647.397626-1-ykali...@redhat.com/t/#u

> > trace_entry. This causes the offset of args field in struct
> > trace_event_raw_sys_enter be different from the one in struct
> > syscall_trace_enter:
> >
> > struct trace_event_raw_sys_enter {
> > struct trace_entry ent;  /* 012 */
> >
> > /* XXX last struct has 3 bytes of padding */
> > /* XXX 4 bytes hole, try to pack */
> >
> > long int   id;   /*16 8 */
> > long unsigned int  args[6];  /*2448 */
> > /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
> > char   __data[]; /*72 0 */
> >
> > /* size: 72, cachelines: 2, members: 4 */
> > /* sum members: 68, holes: 1, sum holes: 4 */
> > /* paddings: 1, sum paddings: 3 */
> > /* last cacheline: 8 bytes */
> > };
> >
> > struct syscall_trace_enter {
> > struct trace_entry ent;  /* 012 */
> >
> > /* XXX last struct has 3 bytes of padding */
> >
> > intnr;   /*12 4 */
> > long unsigned int  args[];   /*16 0 */
> >
> > /* size: 16, cachelines: 1, members: 3 */
> > /* paddings: 1, sum paddings: 3 */
> > /* last cacheline: 16 bytes */
> > };
> >
> > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> > test_profiler testcase because max_ctx_offset is calculated based on the
> > former struct, while off on the latter:
> >
> >   10488 if (is_tracepoint || is_syscall_tp) {
> >   10489 int off = trace_event_get_offsets(event->tp_event);
> >   10490
> >   10491 if (prog->aux->max_ctx_offset > off)
> >   10492 return -EACCES;
> >   10493 }
> >
> > This patch changes the type of nr member in syscall_trace_* structs to
> > be long so that "args" offset is equal to that in struct
> > trace_event_raw_sys_enter.
> >
> > Signed-off-by: Artem Savkov 
> > ---
> >  kernel/trace/trace.h  | 4 ++--
> >  kernel/trace/trace_syscalls.c | 7 ---
> >  2 files changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> > index 77debe53f07cf..cd1d24df85364 100644
> > --- a/kernel/trace/trace.h
> > +++ b/kernel/trace/trace.h
> > @@ -135,13 +135,13 @@ enum trace_type {
> >   */
> >  struct syscall_trace_enter {
> > struct trace_entry  ent;
> > -   int nr;
> > +   longnr;
> > unsigned long   args[];
> >  };
> >
> >  struct syscall_trace_exit {
> > struct trace_entry  ent;
> > -   int nr;
> > +   longnr;
> > longret;
> >  };
> >
> > diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> > index de753403cdafb..c26939119f2e4 100644
> > --- a/kernel/trace/trace_syscalls.c
> > +++ b/kernel/trace/trace_syscalls.c
> > @@ -101,7 +101,7 @@ find_syscall_meta(unsigned long syscall)
> > return NULL;
> >  }
> >
> > -static struct syscall_metadata *syscall_nr_to_meta(int nr)
> > +static struct syscall_metadata *syscall_nr_to_meta(long nr)
> >  {
> > if (IS_ENABLED(CONFIG_HAVE_SPARSE_SYSCALL_NR))
> > return xa_load(_metadata_sparse, (unsigned 
> > long)nr);
> > @@ -132,7 +132,8 @@ print_syscall_enter(struct trace_iterator *iter, int 
> > flags,
> > struct trace_entry *ent = iter->ent;
> > struct syscall_trace_enter *trace;
> > struct syscall_metadata *entry;
> > -   int i, syscall;
> > +   int i;
> > +   long syscall;
> >
> > trace = (typeof(trace))ent;
> > syscall = trace->nr;
> > @@ -177,7 +178,7 @@ print_syscall_exit(struct trace_iterator *iter, int 
> > flags,
> > struct trace_seq *s = >seq;
> > struct trace_entry *ent = iter->ent;
> > struct 

Re: [PATCH] visl: use canonical ftrace path

2023-10-04 Thread Hans Verkuil
On 03/10/2023 23:41, Steven Rostedt wrote:
> 
> Could this go through the linux-media tree, or if you give it an Ack, I'll
> take it through the tracing tree.

I prefer to take this through the media subsystem. Ross, can you post this
patch again, this time including linux-me...@vger.kernel.org?

The patch looks fine, so I'll pick it up.

Regards,

Hans

> 
> -- Steve
> 
> 
> On Tue, 29 Aug 2023 14:46:01 -0600
> Ross Zwisler  wrote:
> 
>> From: Ross Zwisler 
>>
>> The canonical location for the tracefs filesystem is at /sys/kernel/tracing.
>>
>> But, from Documentation/trace/ftrace.rst:
>>
>>   Before 4.1, all ftrace tracing control files were within the debugfs
>>   file system, which is typically located at /sys/kernel/debug/tracing.
>>   For backward compatibility, when mounting the debugfs file system,
>>   the tracefs file system will be automatically mounted at:
>>
>>   /sys/kernel/debug/tracing
>>
>> Update the visl decoder driver documentation to use this tracefs path.
>>
>> Signed-off-by: Ross Zwisler 
>> ---
>>  Documentation/admin-guide/media/visl.rst | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/media/visl.rst 
>> b/Documentation/admin-guide/media/visl.rst
>> index 7d2dc78341c9..4328c6c72d30 100644
>> --- a/Documentation/admin-guide/media/visl.rst
>> +++ b/Documentation/admin-guide/media/visl.rst
>> @@ -78,7 +78,7 @@ The trace events are defined on a per-codec basis, e.g.:
>>  
>>  .. code-block:: bash
>>  
>> -$ ls /sys/kernel/debug/tracing/events/ | grep visl
>> +$ ls /sys/kernel/tracing/events/ | grep visl
>>  visl_fwht_controls
>>  visl_h264_controls
>>  visl_hevc_controls
>> @@ -90,13 +90,13 @@ For example, in order to dump HEVC SPS data:
>>  
>>  .. code-block:: bash
>>  
>> -$ echo 1 >  
>> /sys/kernel/debug/tracing/events/visl_hevc_controls/v4l2_ctrl_hevc_sps/enable
>> +$ echo 1 >  
>> /sys/kernel/tracing/events/visl_hevc_controls/v4l2_ctrl_hevc_sps/enable
>>  
>>  The SPS data will be dumped to the trace buffer, i.e.:
>>  
>>  .. code-block:: bash
>>  
>> -$ cat /sys/kernel/debug/tracing/trace
>> +$ cat /sys/kernel/tracing/trace
>>  video_parameter_set_id 0
>>  seq_parameter_set_id 0
>>  pic_width_in_luma_samples 1920
>