Re: [PATCH 0/4] apply page shift to PFN instead of VA in pfn_to_virt

2024-02-01 Thread Arnd Bergmann
On Thu, Feb 1, 2024, at 11:46, Geert Uytterhoeven wrote:
> Hi Arnd,
>
> On Thu, Feb 1, 2024 at 11:11 AM Arnd Bergmann  wrote:
>> I think it's fair to assume we won't need asm-generic/page.h any
>> more, as we likely won't be adding new NOMMU architectures.
>
> So you think riscv-nommu (k210) was the last one we will ever see?

Yes. We've already removed half of the nommu architectures
(blackfin, avr32, h8300, m32r, microblaze-nommu)  over
the past couple of years, and the remaining ones are pretty
much only there to support existing users.

The only platform one that I see getting real work is
esp32 [1], but that is not a new architecture.

 Arnd

[1] https://github.com/jcmvbkbc/linux-xtensa/tree/xtensa-6.8-rc2-esp32



Re: [PATCH v1] module.h: define __symbol_get_gpl() as a regular __symbol_get()

2024-02-01 Thread Christoph Hellwig
On Thu, Feb 01, 2024 at 12:29:46PM +0300, Andrew Kanner wrote:
> Of course not, no new users needed.
> 
> I haven't discussed it directly. I found the unused __symbol_get_gpl()
> myself, but during investigation of wether it was ever used somewhere
> found the old patch series suggested by Mauro Carvalho Chehab (in Cc).

Ah, ok.

> 
> Link: 
> https://lore.kernel.org/lkml/5f001015990a76c0da35a4c3cf08e457ec353ab2.1652113087.git.mche...@kernel.org/
> 
> The patch series is from 2022 and not merged. You can take [PATCH v6
> 1/4] which removes the unused symbol from the link.
> 
> Or I can resend v2 with my commit msg. But not sure about how it works
> in such a case - will adding Suggested-by tag (if no objections from
> Mauro) with the Link be ok?

Either is fine.  I actually have a patch removing it somewhere in an
unused tree as well :)



Re: [PATCH 0/4] apply page shift to PFN instead of VA in pfn_to_virt

2024-02-01 Thread Arnd Bergmann
On Fri, Feb 2, 2024, at 02:02, Yan Zhao wrote:
> On Thu, Feb 01, 2024 at 06:46:46AM +0100, Arnd Bergmann wrote:
>> 
>> I think it's fair to assume we won't need asm-generic/page.h any
>> more, as we likely won't be adding new NOMMU architectures.
>> I can have a look myself at removing any such unused headers in
>> include/asm-generic/, it's probably not the only one.
>> 
>> Can you just send a patch to remove the unused pfn_to_virt()
>> functions?
> Ok. I'll do it!
> BTW: do you think it's also good to keep this fixing series though we'll
> remove the unused function later?
> So if people want to revert the removal some day, they can get right one.
>
> Or maybe I'm just paranoid, and explanation about the fix in the commit
> message of patch for function removal is enough.
>
> What's your preference? :)

I would just remove it, there is no point in having both
pfn_to_kaddr() and pfn_to_virt() when they do the exact
same thing aside from this bug.

Just do a single patch for all architectures, no need to
have three or four identical ones when I'm going to merge
them all through the same tree anyway.

Just make sure you explain in the changelog what the bug was
and how you noticed it, in case anyone is ever tempted to
bring the function back.

Arnd



Re: Boot-time dumping of ftrace fuctiongraph buffer

2024-02-01 Thread Google
Hi Ahmad,

On Thu, 1 Feb 2024 13:21:37 +0100
Ahmad Fatoum  wrote:

> Hello,
> 
> I semi-regularly debug probe failures. For drivers that use dev_err_probe
> rigorously, this is a quick matter: The probe function records a deferral 
> reason
> and if the deferral persists, deferred_probe_timeout_work_func() will print
> the collected reasons, even if PID 1 is never started.
> 
> For drivers that don't call dev_err_probe, I find myself sometimes doing 
> printf
> debugging inside the probe function.
> 
> I would like to replace this with the function graph tracer:
> 
>   - record the probe function, configured over kernel command line
> (The device indefinitely deferring probe is printed to the console,
>  so I know what I am looking for on the next boot)
> 
>   - Dump the function graph trace
> 
>   - See if the last call before (non-devm) cleanup is getting a clock, a GPIO,
> a regulator or w/e.

What kind of information you prints by the printk()?
If the target (suspicious driver probe function) is obvious, you can use kprobe
event and tp_printk. Or, even if you don't know, if you are sure which function
is the starting/ending point, you can use bootconfig to record the specific part
of execution in the ring buffer, and dump it as Steve said.

In Documentation/trace/boottime-trace.rst, there is an example.
-
With the trigger action and kprobes, you can trace function-graph while
a function is called. For example, this will trace all function calls in
the pci_proc_init()::

  ftrace {
tracing_on = 0
tracer = function_graph
event.kprobes {
start_event {
probes = "pci_proc_init"
actions = "traceon"
}
end_event {
probes = "pci_proc_init%return"
actions = "traceoff"
}
}
  }
-

Thank you,

> 
> For this to be maximally useful, I need to configure this not only at 
> boot-time,
> but also dump the ftrace buffer at boot time. Probe deferral can hinder the 
> kernel from
> calling init and providing a shell, where I could read 
> /sys/kernel/tracing/trace.
> 
> I found following two mechanisms that looked relevant, but seem not to
> do exactly what I want:
> 
>   - tp_printk: seems to be related to trace points only and not usable
> for the function graph output
> 
>   - dump_on_oops: I don't get an Oops if probe deferral times out, but maybe
> one could patch the kernel to check a oops_on_probe_deferral or 
> dump_on_probe_deferral
> kernel command line parameter in deferred_probe_timeout_work_func()?
> 
> 
> Is there existing support that I am missing? Any input on whether this
> would be a welcome feature to have?
> 
> Thanks!
> 
> Cheers,
> Ahmad
> 
> -- 
> Pengutronix e.K.   | |
> Steuerwalder Str. 21   | http://www.pengutronix.de/  |
> 31137 Hildesheim, Germany  | Phone: +49-5121-206917-0|
> Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |
>  


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 0/4] tracing/user_events: Introduce multi-format events

2024-02-01 Thread Google
On Tue, 30 Jan 2024 10:25:49 -0800
Beau Belgrave  wrote:

> On Tue, Jan 30, 2024 at 11:09:33AM +0900, Masami Hiramatsu wrote:
> > Hi Beau,
> > 
> > On Tue, 23 Jan 2024 22:08:40 +
> > Beau Belgrave  wrote:
> > 
> > > Currently user_events supports 1 event with the same name and must have
> > > the exact same format when referenced by multiple programs. This opens
> > > an opportunity for malicous or poorly thought through programs to
> > > create events that others use with different formats. Another scenario
> > > is user programs wishing to use the same event name but add more fields
> > > later when the software updates. Various versions of a program may be
> > > running side-by-side, which is prevented by the current single format
> > > requirement.
> > > 
> > > Add a new register flag (USER_EVENT_REG_MULTI_FORMAT) which indicates
> > > the user program wishes to use the same user_event name, but may have
> > > several different formats of the event in the future. When this flag is
> > > used, create the underlying tracepoint backing the user_event with a
> > > unique name per-version of the format. It's important that existing ABI
> > > users do not get this logic automatically, even if one of the multi
> > > format events matches the format. This ensures existing programs that
> > > create events and assume the tracepoint name will match exactly continue
> > > to work as expected. Add logic to only check multi-format events with
> > > other multi-format events and single-format events to only check
> > > single-format events during find.
> > 
> > Thanks for this work! This will allow many instance to use the same
> > user-events at the same time.
> > 
> > BTW, can we force this flag set by default? My concern is if any user
> > program use this user-event interface in the container (maybe it is
> > possible if we bind-mount it). In this case, the user program can
> > detect the other program is using the event if this flag is not set.
> > Moreover, if there is a malicious program running in the container,
> > it can prevent using the event name from other programs even if it
> > is isolated by the name-space.
> > 
> 
> The multi-format use a different system name (user_events_multi). So you
> cannot use the single-format flag to detect multi-format names, etc. You
> can only use it to find other single-format names like you could always do.
> 
> Likewise, you cannot use the single-event flag to block or prevent
> multi-format events from being created.

Hmm, got it.

> 
> Changing this behavior to default would break all of our environments.
> So I'm pretty sure it would break others. The current environment
> expects tracepoints to show up as their registered name under the
> user_events system name. If this changed out from under us on a specific
> kernel version, that would be bad.
> 
> I'd like eventually to have a tracer namespace concept for containers.
> Then we would have a user_event_group per tracer namespace. Then all
> user_events within the container have a unique system name which fully
> isolates them. However, even with that isolation, we still need a way to
> allow programs in the same container to have different versions of the
> same event name.

Agreed.

> 
> Multi-format events fixes this problem. I think isolation should be
> dealt with via true namespace isolation at the tracing level.
> 
> > Steve suggested that if a user program which is running in a namespace
> > uses user-event without this flag, we can reject that by default.
> > 
> > What would you think about?
> > 
> 
> This would break all of our environments. It would make previously
> compiled programs using the existing ABI from working as expected.
> 
> I'd much prefer that level of isolation to happen at the namespace level
> and why user_events as plumbing for different groups to achieve this.
> It's also why the ABI does not allow programs to state a system name.
> I'm trying to reserve the system name for the group/tracer/namespace
> isolation work.

OK, that's reasonable enough.

Thank you!

> 
> Thanks,
> -Beau
> 
> > Thank you,
> > 
> > 
> > > 
> > > Add a register_name (reg_name) to the user_event struct which allows for
> > > split naming of events. We now have the name that was used to register
> > > within user_events as well as the unique name for the tracepoint. Upon
> > > registering events ensure matches based on first the reg_name, followed
> > > by the fields and format of the event. This allows for multiple events
> > > with the same registered name to have different formats. The underlying
> > > tracepoint will have a unique name in the format of 
> > > {reg_name}:[unique_id].
> > > The unique_id is the time, in nanoseconds, of the event creation converted
> > > to hex. Since this is done under the register mutex, it is extremely
> > > unlikely for these IDs to ever match. It's also very unlikely a malicious
> > > program could consistently guess what the name would be and attempt to
> > > 

Re: [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test

2024-02-01 Thread Jason Wang
On Tue, Jan 30, 2024 at 7:38 PM Yunsheng Lin  wrote:
>
> introduce vhost_net_test basing on virtio_test to test
> vhost_net changing in the kernel.

Let's describe what kind of test is being done and how it is done here.

>
> Signed-off-by: Yunsheng Lin 
> ---
>  tools/virtio/.gitignore   |   1 +
>  tools/virtio/Makefile |   8 +-
>  tools/virtio/vhost_net_test.c | 576 ++
>  3 files changed, 582 insertions(+), 3 deletions(-)
>  create mode 100644 tools/virtio/vhost_net_test.c
>
> diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore
> index 9934d48d9a55..7e47b281c442 100644
> --- a/tools/virtio/.gitignore
> +++ b/tools/virtio/.gitignore
> @@ -1,5 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  *.d
>  virtio_test
> +vhost_net_test
>  vringh_test
>  virtio-trace/trace-agent
> diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
> index d128925980e0..e25e99c1c3b7 100644
> --- a/tools/virtio/Makefile
> +++ b/tools/virtio/Makefile
> @@ -1,8 +1,9 @@
>  # SPDX-License-Identifier: GPL-2.0
>  all: test mod
> -test: virtio_test vringh_test
> +test: virtio_test vringh_test vhost_net_test
>  virtio_test: virtio_ring.o virtio_test.o
>  vringh_test: vringh_test.o vringh.o virtio_ring.o
> +vhost_net_test: virtio_ring.o vhost_net_test.o
>
>  try-run = $(shell set -e;  \
> if ($(1)) >/dev/null 2>&1;  \
> @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
>
>  .PHONY: all test mod clean vhost oot oot-clean oot-build
>  clean:
> -   ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
> -  vhost_test/Module.symvers vhost_test/modules.order *.d
> +   ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
> +  vhost_test/.*.cmd vhost_test/Module.symvers \
> +  vhost_test/modules.order *.d
>  -include *.d
> diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
> new file mode 100644
> index ..e336792a0d77
> --- /dev/null
> +++ b/tools/virtio/vhost_net_test.c
> @@ -0,0 +1,576 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define RANDOM_BATCH   -1
> +#define HDR_LEN12
> +#define TEST_BUF_LEN   256
> +#define TEST_PTYPE ETH_P_LOOPBACK
> +
> +/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */
> +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
> +
> +struct vq_info {
> +   int kick;
> +   int call;
> +   int idx;
> +   long started;
> +   long completed;
> +   struct pollfd fds;
> +   void *ring;
> +   /* copy used for control */
> +   struct vring vring;
> +   struct virtqueue *vq;
> +};
> +
> +struct vdev_info {
> +   struct virtio_device vdev;
> +   int control;
> +   struct vq_info vqs[2];
> +   int nvqs;
> +   void *buf;
> +   size_t buf_size;
> +   char *test_buf;
> +   char *res_buf;
> +   struct vhost_memory *mem;
> +   int sock;
> +   int ifindex;
> +   unsigned char mac[ETHER_ADDR_LEN];
> +};
> +
> +static int tun_alloc(struct vdev_info *dev)
> +{
> +   struct ifreq ifr;
> +   int len = HDR_LEN;

Any reason you can't just use the virtio_net uapi?

> +   int fd, e;
> +
> +   fd = open("/dev/net/tun", O_RDWR);
> +   if (fd < 0) {
> +   perror("Cannot open /dev/net/tun");
> +   return fd;
> +   }
> +
> +   memset(, 0, sizeof(ifr));
> +
> +   ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
> +   snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> +
> +   e = ioctl(fd, TUNSETIFF, );
> +   if (e < 0) {
> +   perror("ioctl[TUNSETIFF]");
> +   close(fd);
> +   return e;
> +   }
> +
> +   e = ioctl(fd, TUNSETVNETHDRSZ, );
> +   if (e < 0) {
> +   perror("ioctl[TUNSETVNETHDRSZ]");
> +   close(fd);
> +   return e;
> +   }
> +
> +   e = ioctl(fd, SIOCGIFHWADDR, );
> +   if (e < 0) {
> +   perror("ioctl[SIOCGIFHWADDR]");
> +   close(fd);
> +   return e;
> +   }
> +
> +   memcpy(dev->mac, _hwaddr.sa_data, ETHER_ADDR_LEN);
> +   return fd;
> +}
> +
> +static void vdev_create_socket(struct vdev_info *dev)
> +{
> +   struct ifreq ifr;
> +
> +   dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
> +   assert(dev->sock != -1);
> +
> +   snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());

Nit: it might be better to accept the device name instead of repeating
the snprintf trick here, this would facilitate the future changes.

> +   assert(ioctl(dev->sock, 

Re: [PATCH RFC v3 28/35] arm64: mte: swap: Handle tag restoring when missing tag storage

2024-02-01 Thread Peter Collingbourne
On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
 wrote:
>
> Linux restores tags when a page is swapped in and there are tags associated
> with the swap entry which the new page will replace. The saved tags are
> restored even if the page will not be mapped as tagged, to protect against
> cases where the page is shared between different VMAs, and is tagged in
> some, but untagged in others. By using this approach, the process can still
> access the correct tags following an mprotect(PROT_MTE) on the non-MTE
> enabled VMA.
>
> But this poses a challenge for managing tag storage: in the scenario above,
> when a new page is allocated to be swapped in for the process where it will
> be mapped as untagged, the corresponding tag storage block is not reserved.
> mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will
> overwrite data in the tag storage block associated with the new page,
> leading to data corruption if the block is in use by a process.
>
> Get around this issue by saving the tags in a new xarray, this time indexed
> by the page pfn, and then restoring them when tag storage is reserved for
> the page.
>
> Signed-off-by: Alexandru Elisei 
> ---
>
> Changes since rfc v2:
>
> * Restore saved tags **before** setting the PG_tag_storage_reserved bit to
> eliminate a brief window of opportunity where userspace can access 
> uninitialized
> tags (Peter Collingbourne).
>
>  arch/arm64/include/asm/mte_tag_storage.h |   8 ++
>  arch/arm64/include/asm/pgtable.h |  11 +++
>  arch/arm64/kernel/mte_tag_storage.c  |  12 ++-
>  arch/arm64/mm/mteswap.c  | 110 +++
>  4 files changed, 140 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h 
> b/arch/arm64/include/asm/mte_tag_storage.h
> index 50bdae94cf71..40590a8c3748 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -36,6 +36,14 @@ bool page_is_tag_storage(struct page *page);
>
>  vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct 
> vm_fault *vmf,
> bool *map_pte);
> +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page);
> +
> +void tags_by_pfn_lock(void);
> +void tags_by_pfn_unlock(void);
> +
> +void *mte_erase_tags_for_pfn(unsigned long pfn);
> +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
> +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
>  #else
>  static inline bool tag_storage_enabled(void)
>  {
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 0174e292f890..87ae59436162 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1085,6 +1085,17 @@ static inline void arch_swap_invalidate_area(int type)
> mte_invalidate_tags_area_by_swp_entry(type);
>  }
>
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
> +static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
> + struct folio *folio)
> +{
> +   if (tag_storage_enabled())
> +   return mte_try_transfer_swap_tags(entry, >page);
> +   return 0;
> +}
> +#endif
> +
>  #define __HAVE_ARCH_SWAP_RESTORE
>  static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>  {
> diff --git a/arch/arm64/kernel/mte_tag_storage.c 
> b/arch/arm64/kernel/mte_tag_storage.c
> index afe2bb754879..ac7b9c9c585c 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -567,6 +567,7 @@ int reserve_tag_storage(struct page *page, int order, 
> gfp_t gfp)
> }
> }
>
> +   mte_restore_tags_for_pfn(page_to_pfn(page), order);
> page_set_tag_storage_reserved(page, order);
>  out_unlock:
> mutex_unlock(_blocks_lock);
> @@ -595,7 +596,8 @@ void free_tag_storage(struct page *page, int order)
> struct tag_region *region;
> unsigned long page_va;
> unsigned long flags;
> -   int ret;
> +   void *tags;
> +   int i, ret;
>
> ret = tag_storage_find_block(page, _block, );
> if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", 
> page_to_pfn(page)))
> @@ -605,6 +607,14 @@ void free_tag_storage(struct page *page, int order)
> /* Avoid writeback of dirty tag cache lines corrupting data. */
> dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
>
> +   tags_by_pfn_lock();
> +   for (i = 0; i < (1 << order); i++) {
> +   tags = mte_erase_tags_for_pfn(page_to_pfn(page + i));
> +   if (unlikely(tags))
> +   mte_free_tag_buf(tags);
> +   }
> +   tags_by_pfn_unlock();
> +
> end_block = start_block + order_to_num_blocks(order, 
> region->block_size_pages);
>
> xa_lock_irqsave(_blocks_reserved, 

Re: [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation

2024-02-01 Thread Yunsheng Lin
On 2024/2/1 21:16, Paolo Abeni wrote:
> On Tue, 2024-01-30 at 19:37 +0800, Yunsheng Lin wrote:
>> Currently there seems to be three page frag implementions
>> which all try to allocate order 3 page, if that fails, it
>> then fail back to allocate order 0 page, and each of them
>> all allow order 3 page allocation to fail under certain
>> condition by using specific gfp bits.
>>
>> The gfp bits for order 3 page allocation are different
>> between different implementation, __GFP_NOMEMALLOC is
>> or'd to forbid access to emergency reserves memory for
>> __page_frag_cache_refill(), but it is not or'd in other
>> implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
>> direct reclaim in skb_page_frag_refill(), but it is not
>> masked off in __page_frag_cache_refill().
>>
>> This patch unifies the gfp bits used between different
>> implementions by or'ing __GFP_NOMEMALLOC and masking off
>> __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
>> possible pressure for mm.
>>
>> Signed-off-by: Yunsheng Lin 
>> Reviewed-by: Alexander Duyck 
>> CC: Alexander Duyck 
>> ---
>>  drivers/vhost/net.c | 2 +-
>>  mm/page_alloc.c | 4 ++--
>>  net/core/sock.c | 2 +-
>>  3 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index f2ed7167c848..e574e21cc0ca 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net 
>> *net, unsigned int sz,
>>  /* Avoid direct reclaim but allow kswapd to wake */
>>  pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
>>__GFP_COMP | __GFP_NOWARN |
>> -  __GFP_NORETRY,
>> +  __GFP_NORETRY | __GFP_NOMEMALLOC,
>>SKB_FRAG_PAGE_ORDER);
> 
>>  if (likely(pfrag->page)) {
>>  pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c0f7e67c4250..636145c29f70 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4685,8 +4685,8 @@ static struct page *__page_frag_cache_refill(struct 
>> page_frag_cache *nc,
>>  gfp_t gfp = gfp_mask;
>>  
>>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>> -gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
>> -__GFP_NOMEMALLOC;
>> +gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
>> +   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>>  page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>>  PAGE_FRAG_CACHE_MAX_ORDER);
>>  nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 88bf810394a5..8289a3d8c375 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -2919,7 +2919,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
>> page_frag *pfrag, gfp_t gfp)
>>  /* Avoid direct reclaim but allow kswapd to wake */
>>  pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
>>__GFP_COMP | __GFP_NOWARN |
>> -  __GFP_NORETRY,
>> +  __GFP_NORETRY | __GFP_NOMEMALLOC,
>>SKB_FRAG_PAGE_ORDER);
> 
> This will prevent memory reserve usage when allocating order 3 pages,
> but not when allocating a single page as a fallback. Still different

More accurately, the above ensures memory reserve is always not used
for order 3 pages, whether memory reserve is used for order 0 pages
depending on original 'gfp' flags, if 'gfp' does not have __GFP_NOMEMALLOC
bit set, memory reserve may still be used  for order 0 pages.

> from the __page_frag_cache_refill() allocator - which never accesses
> the memory reserves.

I am not really sure I understand the above commemt.
The semantic is the same as skb_page_frag_refill() as explained above
as my understanding. Note that __page_frag_cache_refill() use 'gfp_mask'
for allocating order 3 pages and use the original 'gfp' for allocating
order 0 pages.

> 
> I'm unsure we want to propagate the __page_frag_cache_refill behavior
> here, the current behavior could be required by some systems.
> 
> It looks like this series still leave the skb_page_frag_refill()
> allocator alone, what about dropping this chunk, too? 

As explained above, I would prefer to keep it as it is as it seems
to be quite obvious that we can avoid possible pressure for mm by
not using memory reserve for order 3 pages as we have the fallback
for order 0 pages.

Please let me know if there is anything obvious I missed.

> 
> Thanks!
> 
> Paolo
> 
> 
> .
> 



Re: Boot-time dumping of ftrace fuctiongraph buffer

2024-02-01 Thread Steven Rostedt
On Thu, 1 Feb 2024 13:21:37 +0100
Ahmad Fatoum  wrote:

> Hello,
> 
> I semi-regularly debug probe failures. For drivers that use dev_err_probe
> rigorously, this is a quick matter: The probe function records a deferral 
> reason
> and if the deferral persists, deferred_probe_timeout_work_func() will print
> the collected reasons, even if PID 1 is never started.
> 
> For drivers that don't call dev_err_probe, I find myself sometimes doing 
> printf
> debugging inside the probe function.

Is the driver built in or started after init?

> 
> I would like to replace this with the function graph tracer:
> 
>   - record the probe function, configured over kernel command line
> (The device indefinitely deferring probe is printed to the console,
>  so I know what I am looking for on the next boot)
> 
>   - Dump the function graph trace
> 
>   - See if the last call before (non-devm) cleanup is getting a clock, a GPIO,
> a regulator or w/e.
> 
> For this to be maximally useful, I need to configure this not only at 
> boot-time,
> but also dump the ftrace buffer at boot time. Probe deferral can hinder the 
> kernel from
> calling init and providing a shell, where I could read 
> /sys/kernel/tracing/trace.

OK so the driver is built in.

> 
> I found following two mechanisms that looked relevant, but seem not to
> do exactly what I want:
> 
>   - tp_printk: seems to be related to trace points only and not usable
> for the function graph output
> 
>   - dump_on_oops: I don't get an Oops if probe deferral times out, but maybe
> one could patch the kernel to check a oops_on_probe_deferral or 
> dump_on_probe_deferral
> kernel command line parameter in deferred_probe_timeout_work_func()?
> 
> 
> Is there existing support that I am missing? Any input on whether this
> would be a welcome feature to have?

Well you can start function_graph on the kernel command line and event
filter on a give function

 ftrace=function_graph function_graph_filter=probe_func

You can add your own ftrace_dump() on some kind of detected error and put
that in the kernel command line. For example RCU has:

  rcupdate.rcu_cpu_stall_ftrace_dump=

Which will do a ftrace dump when a RCU stall is triggered.

-- Steve




Re: [PATCH 0/4] apply page shift to PFN instead of VA in pfn_to_virt

2024-02-01 Thread Yan Zhao
On Thu, Feb 01, 2024 at 06:46:46AM +0100, Arnd Bergmann wrote:
> On Thu, Feb 1, 2024, at 01:01, Yan Zhao wrote:
> > On Wed, Jan 31, 2024 at 12:48:38PM +0100, Arnd Bergmann wrote:
> >> On Wed, Jan 31, 2024, at 06:51, Yan Zhao wrote:
> >> 
> >> How exactly did you notice the function being wrong,
> >> did you try to add a user somewhere, or just read through
> >> the code?
> > I came across them when I was debugging an unexpected kernel page fault
> > on x86, and I was not sure whether page_to_virt() was compiled in
> > asm-generic/page.h or linux/mm.h.
> > Though finally, it turned out that the one in linux/mm.h was used, which
> > yielded the right result and the unexpected kernel page fault in my case
> > was not related to page_to_virt(), it did lead me to noticing that the
> > pfn_to_virt() in asm-generic/page.h and other 3 archs did not look right.
> >
> > Yes, unlike virt_to_pfn() which still has a caller in openrisc (among
> > csky, Hexagon, openrisc), pfn_to_virt() now does not have a caller in
> > the 3 archs. Though both virt_to_pfn() and pfn_to_virt() are referenced
> > in asm-generic/page.h, I also not sure if we need to remove the
> > asm-generic/page.h which may serve as a template to future archs ?
> >
> > So, either way looks good to me :)
> 
> I think it's fair to assume we won't need asm-generic/page.h any
> more, as we likely won't be adding new NOMMU architectures.
> I can have a look myself at removing any such unused headers in
> include/asm-generic/, it's probably not the only one.
> 
> Can you just send a patch to remove the unused pfn_to_virt()
> functions?
Ok. I'll do it!
BTW: do you think it's also good to keep this fixing series though we'll
remove the unused function later?
So if people want to revert the removal some day, they can get right one.

Or maybe I'm just paranoid, and explanation about the fix in the commit
message of patch for function removal is enough.

What's your preference? :)



Re: [PATCH v8 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-01 Thread Jarkko Sakkinen
On Tue Jan 30, 2024 at 4:09 AM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
>
> Implement the reclamation flow for cgroup, encapsulated in the top-level
> function sgx_epc_cgroup_reclaim_pages(). It does a pre-order walk on its
> subtree, and make calls to sgx_reclaim_pages() at each node passing in
> the LRU of that node. It keeps track of total reclaimed pages, and pages
> left to attempt.  It stops the walk if desired number of pages are
> attempted.
>
> In some contexts, e.g. page fault handling, only asynchronous
> reclamation is allowed. Create a work-queue, corresponding work item and
> function definitions to support the asynchronous reclamation. Both
> synchronous and asynchronous flows invoke the same top level reclaim
> function, and will be triggered later by sgx_epc_cgroup_try_charge()
> when usage of the cgroup is at or near its limit.
>
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> V8:
> - Remove alignment for substructure variables. (Jarkko)
>
> V7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 174 ++-
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |   3 +
>  2 files changed, 176 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> index eac8548164de..8858a0850f8a 100644
> --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -7,9 +7,173 @@
>  
>  static struct sgx_epc_cgroup epc_cg_root;
>  
> +static struct workqueue_struct *sgx_epc_cg_wq;

I'd document this with a comment.

Missed this for "epc_eg_root", which should have the same treatment.

Just brief 1-2 line thing, no need for prose here. Something to remind
what it is later on.

> +
> +static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup 
> *epc_cg)
> +{
> + return atomic64_read(_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / 
> PAGE_SIZE;
> +}
> +
> +static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
> +{
> + return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
> +}
> +
> +/*
> + * Get the lower bound of limits of a cgroup and its ancestors.  Used in
> + * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup 
> is
> + * over its limit or its ancestors' hence reclamation is needed.
> + */
> +static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup 
> *epc_cg)
> +{
> + struct misc_cg *i = epc_cg->cg;
> + u64 m = U64_MAX;
> +
> + while (i) {
> + m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
> + i = misc_cg_parent(i);
> + }
> +
> + return m / PAGE_SIZE;
> +}
> +
>  /**
> - * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
> + * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its 
> LRUs
> + * @root:Root of the tree to check
>   *
> + * Return: %true if all cgroups under the specified root have empty LRU 
> lists.
> + * Used to avoid livelocks due to a cgroup having a non-zero charge count but
> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
> + * because all pages in the cgroup are unreclaimable.
> + */
> +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
> +{
> + struct cgroup_subsys_state *css_root;
> + struct cgroup_subsys_state *pos;
> + struct sgx_epc_cgroup *epc_cg;
> + bool ret = true;
> +
> + /*
> +  * Caller ensure css_root ref acquired
> +  */
> + css_root = >css;
> +
> + rcu_read_lock();
> + css_for_each_descendant_pre(pos, css_root) {
> + if (!css_tryget(pos))
> + break;
> +
> + rcu_read_unlock();
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +
> + spin_lock(_cg->lru.lock);
> + ret = list_empty(_cg->lru.reclaimable);
> + spin_unlock(_cg->lru.lock);
> +
> + rcu_read_lock();
> + css_put(pos);
> + if (!ret)
> + break;
> + }
> +
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> +/**
> + * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to 
> reclaim pages
> + * @root:Root of the tree to start walking
> + * Return:   Number of pages reclaimed.
> + */
> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
> +{
> + /*
> +  * Attempting to reclaim only a few pages will often fail and is
> +  * inefficient, while reclaiming a huge number of pages can result in
> +  * soft lockups due to holding various locks for an extended duration.
> +  */
> + unsigned int nr_to_scan = SGX_NR_TO_SCAN;
> + struct cgroup_subsys_state *css_root;
> + struct cgroup_subsys_state *pos;
> +

Re: [PATCH v8 07/15] x86/sgx: Expose sgx_reclaim_pages() for cgroup

2024-02-01 Thread Jarkko Sakkinen
On Tue Jan 30, 2024 at 4:09 AM EET, Haitao Huang wrote:
> From: Sean Christopherson 
>
> Each EPC cgroup will have an LRU structure to track reclaimable EPC pages.
> When a cgroup usage reaches its limit, the cgroup needs to reclaim pages
> from its LRU or LRUs of its descendants to make room for any new
> allocations.
>
> To prepare for reclamation per cgroup, expose the top level reclamation
> function, sgx_reclaim_pages(), in header file for reuse. Add a parameter
> to the function to pass in an LRU so cgroups can pass in different
> tracking LRUs later.  Add another parameter for passing in the number of
> pages to scan and make the function return the number of pages reclaimed
> as a cgroup reclaimer may need to track reclamation progress from its
> descendants, change number of pages to scan in subsequent calls.
>
> Create a wrapper for the global reclaimer, sgx_reclaim_pages_global(),
> to just call this function with the global LRU passed in. When
> per-cgroup LRU is added later, the wrapper will perform global
> reclamation from the root cgroup.
>
> Signed-off-by: Sean Christopherson 
> Co-developed-by: Kristen Carlson Accardi 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> V8:
> - Use width of 80 characters in text paragraphs. (Jarkko)
>
> V7:
> - Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim
> function". Do not split the top level function (Kai)
> - Dropped patches 7 and 8 of V6.
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 53 +++---
>  arch/x86/kernel/cpu/sgx/sgx.h  |  1 +
>  2 files changed, 37 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index a131aa985c95..4f5824c4751d 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -286,11 +286,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
> *epc_page,
>   mutex_unlock(>lock);
>  }
>  
> -/*
> - * Take a fixed number of pages from the head of the active page pool and
> - * reclaim them to the enclave's private shmem files. Skip the pages, which 
> have
> - * been accessed since the last scan. Move those pages to the tail of active
> - * page pool so that the pages get scanned in LRU like fashion.
> +/**
> + * sgx_reclaim_pages() - Reclaim a fixed number of pages from an LRU
> + *
> + * Take a fixed number of pages from the head of a given LRU and reclaim 
> them to
> + * the enclave's private shmem files. Skip the pages, which have been 
> accessed
> + * since the last scan. Move those pages to the tail of the list so that the
> + * pages get scanned in LRU like fashion.
>   *
>   * Batch process a chunk of pages (at the moment 16) in order to degrade 
> amount
>   * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a 
> bit
> @@ -298,8 +300,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
> *epc_page,
>   * + EWB) but not sufficiently. Reclaiming one page at a time would also be
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
> + *
> + * @lru: The LRU from which pages are reclaimed.
> + * @nr_to_scan: Pointer to the target number of pages to scan, must be less 
> than
> + *   SGX_NR_TO_SCAN.
> + * Return:   Number of pages reclaimed.
>   */
> -static void sgx_reclaim_pages(void)
> +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int 
> *nr_to_scan)
>  {
>   struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
>   struct sgx_backing backing[SGX_NR_TO_SCAN];
> @@ -310,10 +317,10 @@ static void sgx_reclaim_pages(void)
>   int ret;
>   int i;
>  
> - spin_lock(_global_lru.lock);
> - for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> - epc_page = list_first_entry_or_null(_global_lru.reclaimable,
> - struct sgx_epc_page, list);
> + spin_lock(>lock);
> +
> + for (; *nr_to_scan > 0; --(*nr_to_scan)) {
> + epc_page = list_first_entry_or_null(>reclaimable, struct 
> sgx_epc_page, list);
>   if (!epc_page)
>   break;
>  
> @@ -328,7 +335,8 @@ static void sgx_reclaim_pages(void)
>*/
>   epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
>   }
> - spin_unlock(_global_lru.lock);
> +
> + spin_unlock(>lock);
>  
>   for (i = 0; i < cnt; i++) {
>   epc_page = chunk[i];
> @@ -351,9 +359,9 @@ static void sgx_reclaim_pages(void)
>   continue;
>  
>  skip:
> - spin_lock(_global_lru.lock);
> - list_add_tail(_page->list, _global_lru.reclaimable);
> - spin_unlock(_global_lru.lock);
> + spin_lock(>lock);
> + list_add_tail(_page->list, >reclaimable);
> + spin_unlock(>lock);
>  
>   kref_put(_page->encl->refcount, sgx_encl_release);

Re: [PATCH v8 06/15] x86/sgx: Abstract tracking reclaimable pages in LRU

2024-02-01 Thread Jarkko Sakkinen
On Tue Jan 30, 2024 at 4:09 AM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
>
> The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking
> of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly
> allocated page into the global LRU list while
> sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded
> global LRU references in these functions to make them reusable when
> pages are tracked in per-cgroup LRUs.
>
> Create a helper, sgx_lru_list(), that returns the LRU that tracks a given
> EPC page. It simply returns the global LRU now, and will later return
> the LRU of the cgroup within which the EPC page was allocated. Replace
> the hard coded global LRU with a call to this helper.
>
> Next patches will first get the cgroup reclamation flow ready while
> keeping pages tracked in the global LRU and reclaimed by ksgxd before we
> make the switch in the end for sgx_lru_list() to return per-cgroup
> LRU.
>
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> V7:
> - Split this out from the big patch, #10 in V6. (Dave, Kai)
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 30 ++
>  1 file changed, 18 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 912959c7ecc9..a131aa985c95 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space);
>   */
>  static struct sgx_epc_lru_list sgx_global_lru;
>  
> +static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page 
> *epc_page)
> +{
> + return _global_lru;
> +}
> +
>  static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>  
>  /* Nodes with one or more EPC sections. */
> @@ -500,25 +505,24 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  }
>  
>  /**
> - * sgx_mark_page_reclaimable() - Mark a page as reclaimable
> + * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in 
> a LRU.
>   * @page:EPC page
> - *
> - * Mark a page as reclaimable and add it to the active page list. Pages
> - * are automatically removed from the active list when freed.
>   */
>  void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
>  {
> - spin_lock(_global_lru.lock);
> + struct sgx_epc_lru_list *lru = sgx_lru_list(page);
> +
> + spin_lock(>lock);
>   page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
> - list_add_tail(>list, _global_lru.reclaimable);
> - spin_unlock(_global_lru.lock);
> + list_add_tail(>list, >reclaimable);
> + spin_unlock(>lock);
>  }
>  
>  /**
> - * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
> + * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU
>   * @page:EPC page
>   *
> - * Clear the reclaimable flag and remove the page from the active page list.
> + * Clear the reclaimable flag if set and remove the page from its LRU.
>   *
>   * Return:
>   *   0 on success,
> @@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page 
> *page)
>   */
>  int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
>  {
> - spin_lock(_global_lru.lock);
> + struct sgx_epc_lru_list *lru = sgx_lru_list(page);
> +
> + spin_lock(>lock);
>   if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
>   /* The page is being reclaimed. */
>   if (list_empty(>list)) {
> - spin_unlock(_global_lru.lock);
> + spin_unlock(>lock);
>   return -EBUSY;
>   }
>  
>   list_del(>list);
>   page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
>   }
> - spin_unlock(_global_lru.lock);
> + spin_unlock(>lock);
>  
>   return 0;
>  }

Reviewed-by: Jarkko Sakkinen 

BR, Jarkko



Re: [PATCH v8 05/15] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list

2024-02-01 Thread Jarkko Sakkinen
On Tue Jan 30, 2024 at 4:09 AM EET, Haitao Huang wrote:
> From: Sean Christopherson 
>
> Introduce a data structure to wrap the existing reclaimable list and its
> spinlock. Each cgroup later will have one instance of this structure to
> track EPC pages allocated for processes associated with the same cgroup.
> Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
> from the reclaimable list in this structure when its usage reaches near
> its limit.
>
> Use this structure to encapsulate the LRU list and its lock used by the
> global reclaimer.
>
> Signed-off-by: Sean Christopherson 
> Co-developed-by: Kristen Carlson Accardi 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 

I'd put author as last sob but that said I'm not sure if there is rigid
rule to do so. Thus not saying must here.

> Cc: Sean Christopherson 
> ---
> V6:
> - removed introduction to unreclaimables in commit message.
>
> V4:
> - Removed unneeded comments for the spinlock and the non-reclaimables.
> (Kai, Jarkko)
> - Revised the commit to add introduction comments for unreclaimables and
> multiple LRU lists.(Kai)
> - Reordered the patches: delay all changes for unreclaimables to
> later, and this one becomes the first change in the SGX subsystem.
>
> V3:
> - Removed the helper functions and revised commit messages.
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 39 +-
>  arch/x86/kernel/cpu/sgx/sgx.h  | 15 +
>  2 files changed, 35 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index c32f18b70c73..912959c7ecc9 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
>  
>  /*
>   * These variables are part of the state of the reclaimer, and must be 
> accessed
> - * with sgx_reclaimer_lock acquired.
> + * with sgx_global_lru.lock acquired.
>   */
> -static LIST_HEAD(sgx_active_page_list);
> -static DEFINE_SPINLOCK(sgx_reclaimer_lock);
> +static struct sgx_epc_lru_list sgx_global_lru;
>  
>  static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
>  
> @@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
>   int ret;
>   int i;
>  
> - spin_lock(_reclaimer_lock);
> + spin_lock(_global_lru.lock);
>   for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> - if (list_empty(_active_page_list))
> + epc_page = list_first_entry_or_null(_global_lru.reclaimable,
> + struct sgx_epc_page, list);
> + if (!epc_page)
>   break;
>  
> - epc_page = list_first_entry(_active_page_list,
> - struct sgx_epc_page, list);
>   list_del_init(_page->list);
>   encl_page = epc_page->owner;
>  
> @@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
>*/
>   epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
>   }
> - spin_unlock(_reclaimer_lock);
> + spin_unlock(_global_lru.lock);
>  
>   for (i = 0; i < cnt; i++) {
>   epc_page = chunk[i];
> @@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
>   continue;
>  
>  skip:
> - spin_lock(_reclaimer_lock);
> - list_add_tail(_page->list, _active_page_list);
> - spin_unlock(_reclaimer_lock);
> + spin_lock(_global_lru.lock);
> + list_add_tail(_page->list, _global_lru.reclaimable);
> + spin_unlock(_global_lru.lock);
>  
>   kref_put(_page->encl->refcount, sgx_encl_release);
>  
> @@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
>  static bool sgx_should_reclaim(unsigned long watermark)
>  {
>   return atomic_long_read(_nr_free_pages) < watermark &&
> -!list_empty(_active_page_list);
> +!list_empty(_global_lru.reclaimable);
>  }
>  
>  /*
> @@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)
>  
>   ksgxd_tsk = tsk;
>  
> + sgx_lru_init(_global_lru);
> +
>   return true;
>  }
>  
> @@ -507,10 +508,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>   */
>  void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
>  {
> - spin_lock(_reclaimer_lock);
> + spin_lock(_global_lru.lock);
>   page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
> - list_add_tail(>list, _active_page_list);
> - spin_unlock(_reclaimer_lock);
> + list_add_tail(>list, _global_lru.reclaimable);
> + spin_unlock(_global_lru.lock);
>  }
>  
>  /**
> @@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page 
> *page)
>   */
>  int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
>  {
> - spin_lock(_reclaimer_lock);
> + spin_lock(_global_lru.lock);
>   if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> 

Re: [PATCH v8 04/15] x86/sgx: Implement basic EPC misc cgroup functionality

2024-02-01 Thread Jarkko Sakkinen
On Tue Jan 30, 2024 at 4:09 AM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
>
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The
> existing cgroup memory controller cannot be used to limit or account for
> SGX EPC memory, which is a desirable feature in some environments.  For
> example, in a Kubernates environment, a user can request certain EPC
> quota for a pod but the orchestrator can not enforce the quota to limit
> runtime EPC usage of the pod without an EPC cgroup controller.
>
> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
> limit and track EPC allocations per cgroup. Earlier patches have added
> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
> support in SGX driver as the "sgx_epc" resource provider:
>
> - Set "capacity" of EPC by calling misc_cg_set_capacity()
> - Update EPC usage counter, "current", by calling charge and uncharge
> APIs for EPC allocation and deallocation, respectively.
> - Setup sgx_epc resource type specific callbacks, which perform
> initialization and cleanup during cgroup allocation and deallocation,
> respectively.
>
> With these changes, the misc cgroup controller enables user to set a hard
> limit for EPC usage in the "misc.max" interface file. It reports current
> usage in "misc.current", the total EPC memory available in
> "misc.capacity", and the number of times EPC usage reached the max limit
> in "misc.events".
>
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
>
> Later patches will reorganize the tracking and reclamation code in the
> global reclaimer and implement per-cgroup tracking and reclaiming.
>
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> V8:
> - Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko)
> - Remove extra space, '_INTEL'. (Jarkko)
>
> V7:
> - Use a static for root cgroup (Kai)
> - Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
> - Correct check for charge API return (Kai)
> - Start initialization in SGX device driver init (Kai)
> - Remove unneeded BUG_ON (Kai)
> - Split  sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
>
> V6:
> - Split the original large patch"Limit process EPC usage with misc
> cgroup controller"  and restructure it (Kai)
> ---
>  arch/x86/Kconfig | 13 +
>  arch/x86/kernel/cpu/sgx/Makefile |  1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 73 
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h | 73 
>  arch/x86/kernel/cpu/sgx/main.c   | 52 +++-
>  arch/x86/kernel/cpu/sgx/sgx.h|  5 ++
>  include/linux/misc_cgroup.h  |  2 +
>  7 files changed, 217 insertions(+), 2 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5edec175b9bf..10c3d1d099b2 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1947,6 +1947,19 @@ config X86_SGX
>  
> If unsure, say N.
>  
> +config CGROUP_SGX_EPC
> + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for 
> Intel SGX"
> + depends on X86_SGX && CGROUP_MISC
> + help
> +   Provides control over the EPC footprint of tasks in a cgroup via
> +   the Miscellaneous cgroup controller.
> +
> +   EPC is a subset of regular memory that is usable only by SGX
> +   enclaves and is very limited in quantity, e.g. less than 1%
> +   of total DRAM.
> +
> +   Say N if unsure.
> +
>  config X86_USER_SHADOW_STACK
>   bool "X86 userspace shadow stack"
>   depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile 
> b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
>   ioctl.o \
>   main.o
>  obj-$(CONFIG_X86_SGX_KVM)+= virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC)+= epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index ..eac8548164de
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,73 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include 
> +#include 
> +#include "epc_cgroup.h"
> +
> +static struct sgx_epc_cgroup epc_cg_root;
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
> 

Re: [PATCH v8 01/15] cgroup/misc: Add per resource callbacks for CSS events

2024-02-01 Thread Jarkko Sakkinen
On Tue Jan 30, 2024 at 4:09 AM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
>
> The misc cgroup controller (subsystem) currently does not perform
> resource type specific action for Cgroups Subsystem State (CSS) events:
> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> when a cgroup is destroyed.
>
> Define callbacks for those events and allow resource providers to
> register the callbacks per resource type as needed. This will be
> utilized later by the EPC misc cgroup support implemented in the SGX
> driver.
>
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> V8:
> - Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko)
> V7:
> - Make ops one per resource type and store them in array (Michal)
> - Rename the ops struct to misc_res_ops, and enforce the constraints of 
> required callback
> functions (Jarkko)
> - Moved addition of priv field to patch 4 where it was used first. (Jarkko)
>
> V6:
> - Create ops struct for per resource callbacks (Jarkko)
> - Drop max_write callback (Dave, Michal)
> - Style fixes (Kai)

This version looks nice and smooth:

Reviewed-by: Jarkko Sakkinen 

BR, Jarkko



Re: [PATCH v7 04/15] x86/sgx: Implement basic EPC misc cgroup functionality

2024-02-01 Thread Jarkko Sakkinen
On Wed Jan 24, 2024 at 5:29 AM EET, Haitao Huang wrote:
> On Mon, 22 Jan 2024 14:25:53 -0600, Jarkko Sakkinen   
> wrote:
>
> > On Mon Jan 22, 2024 at 7:20 PM EET, Haitao Huang wrote:
> >> From: Kristen Carlson Accardi 
> >>
> >> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> >> RAM allocations, and are managed solely by the SGX subsystem. The
> >> existing cgroup memory controller cannot be used to limit or account for
> >> SGX EPC memory, which is a desirable feature in some environments.  For
> >> example, in a Kubernates environment, a user can request certain EPC
> >> quota for a pod but the orchestrator can not enforce the quota to limit
> >> runtime EPC usage of the pod without an EPC cgroup controller.
> >>
> >> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
> >> limit and track EPC allocations per cgroup. Earlier patches have added
> >> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
> >> support in SGX driver as the "sgx_epc" resource provider:
> >>
> >> - Set "capacity" of EPC by calling misc_cg_set_capacity()
> >> - Update EPC usage counter, "current", by calling charge and uncharge
> >> APIs for EPC allocation and deallocation, respectively.
> >> - Setup sgx_epc resource type specific callbacks, which perform
> >> initialization and cleanup during cgroup allocation and deallocation,
> >> respectively.
> >>
> >> With these changes, the misc cgroup controller enables user to set a  
> >> hard
> >> limit for EPC usage in the "misc.max" interface file. It reports current
> >> usage in "misc.current", the total EPC memory available in
> >> "misc.capacity", and the number of times EPC usage reached the max limit
> >> in "misc.events".
> >>
> >> For now, the EPC cgroup simply blocks additional EPC allocation in
> >> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> >> still tracked in the global active list, only reclaimed by the global
> >> reclaimer when the total free page count is lower than a threshold.
> >>
> >> Later patches will reorganize the tracking and reclamation code in the
> >> global reclaimer and implement per-cgroup tracking and reclaiming.
> >>
> >> Co-developed-by: Sean Christopherson 
> >> Signed-off-by: Sean Christopherson 
> >> Signed-off-by: Kristen Carlson Accardi 
> >> Co-developed-by: Haitao Huang 
> >> Signed-off-by: Haitao Huang 
> >
> > For consistency sake I'd also add co-developed-by for Kristen. This is
> > at least the format suggested by kernel documentation.
> >
> She is the "From Author", so only Signed-off-by is needed for her  
> according to the second example in the doc[1]?

Right but okay then Kristen afaik  be the last sob so only re-ordering
is needed I guess.

>
> Thanks
> Haitao
> [1]https://docs.kernel.org/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by

BR, Jarkko



Re: [PATCH v3 00/47] filelock: split file leases out of struct file_lock

2024-02-01 Thread NeilBrown
On Thu, 01 Feb 2024, Jeff Layton wrote:
> I'm not sure this is much prettier than the last, but contracting
> "fl_core" to "c", as Neil suggested is a bit easier on the eyes.
> 
> I also added a few small helpers and converted several users over to
> them. That reduces the size of the per-fs conversion patches later in
> the series. I played with some others too, but they were too awkward
> or not frequently used enough to make it worthwhile.
> 
> Many thanks to Chuck and Neil for the earlier R-b's and comments. I've
> dropped those for now since this set is a bit different from the last.
> 
> I'd like to get this into linux-next soon and we can see about merging
> it for v6.9, unless anyone has major objections.

For all patches:
  Reviewed-by: NeilBrown 

Thanks Jeff - I think this is a good and useful change and while it
might not all be as pretty as I might like, I don't see any way to
improve it and think it is certainly good enough to merge.

I think the conversion from "fl_core" to "c" does work well enough.
I particularly like how the removal of the IS_* macros (patch 16) turned
out.  The inline expansion of IS_LEASE() in particular makes the code
clearer to me.

Thanks,
NeilBrown


> 
> Thanks!
> 
> Signed-off-by: Jeff Layton 
> ---
> Changes in v3:
> - Rename "flc_core" fields in file_lock and file_lease to "c"
> - new helpers: locks_wake_up, for_each_file_lock, and 
> lock_is_{unlock,read,write}
> - Link to v2: 
> https://lore.kernel.org/r/20240125-flsplit-v2-0-7485322b6...@kernel.org
> 
> Changes in v2:
> - renamed file_lock_core fields to have "flc_" prefix
> - used macros to more easily do the change piecemeal
> - broke up patches into per-subsystem ones
> - Link to v1: 
> https://lore.kernel.org/r/20240116-flsplit-v1-0-c9d0f4370...@kernel.org
> 
> ---
> Jeff Layton (47):
>   filelock: fl_pid field should be signed int
>   filelock: rename some fields in tracepoints
>   filelock: rename fl_pid variable in lock_get_status
>   filelock: add some new helper functions
>   9p: rename fl_type variable in v9fs_file_do_lock
>   afs: convert to using new filelock helpers
>   ceph: convert to using new filelock helpers
>   dlm: convert to using new filelock helpers
>   gfs2: convert to using new filelock helpers
>   lockd: convert to using new filelock helpers
>   nfs: convert to using new filelock helpers
>   nfsd: convert to using new filelock helpers
>   ocfs2: convert to using new filelock helpers
>   smb/client: convert to using new filelock helpers
>   smb/server: convert to using new filelock helpers
>   filelock: drop the IS_* macros
>   filelock: split common fields into struct file_lock_core
>   filelock: have fs/locks.c deal with file_lock_core directly
>   filelock: convert more internal functions to use file_lock_core
>   filelock: make posix_same_owner take file_lock_core pointers
>   filelock: convert posix_owner_key to take file_lock_core arg
>   filelock: make locks_{insert,delete}_global_locks take file_lock_core 
> arg
>   filelock: convert locks_{insert,delete}_global_blocked
>   filelock: make __locks_delete_block and __locks_wake_up_blocks take 
> file_lock_core
>   filelock: convert __locks_insert_block, conflict and deadlock checks to 
> use file_lock_core
>   filelock: convert fl_blocker to file_lock_core
>   filelock: clean up locks_delete_block internals
>   filelock: reorganize locks_delete_block and __locks_insert_block
>   filelock: make assign_type helper take a file_lock_core pointer
>   filelock: convert locks_wake_up_blocks to take a file_lock_core pointer
>   filelock: convert locks_insert_lock_ctx and locks_delete_lock_ctx
>   filelock: convert locks_translate_pid to take file_lock_core
>   filelock: convert seqfile handling to use file_lock_core
>   9p: adapt to breakup of struct file_lock
>   afs: adapt to breakup of struct file_lock
>   ceph: adapt to breakup of struct file_lock
>   dlm: adapt to breakup of struct file_lock
>   gfs2: adapt to breakup of struct file_lock
>   fuse: adapt to breakup of struct file_lock
>   lockd: adapt to breakup of struct file_lock
>   nfs: adapt to breakup of struct file_lock
>   nfsd: adapt to breakup of struct file_lock
>   ocfs2: adapt to breakup of struct file_lock
>   smb/client: adapt to breakup of struct file_lock
>   smb/server: adapt to breakup of struct file_lock
>   filelock: remove temporary compatibility macros
>   filelock: split leases out of struct file_lock
> 
>  fs/9p/vfs_file.c|  40 +-
>  fs/afs/flock.c  |  60 +--
>  fs/ceph/locks.c |  74 ++--
>  fs/dlm/plock.c  |  44 +--
>  fs/fuse/file.c  |  14 +-
>  fs/gfs2/file.c  |  16 +-
>  fs/libfs.c  |   2 +-
>  fs/lockd/clnt4xdr.c |  14 +-
>  

Re: [PATCH v3 34/47] 9p: adapt to breakup of struct file_lock

2024-02-01 Thread NeilBrown
On Thu, 01 Feb 2024, Jeff Layton wrote:
> Most of the existing APIs have remained the same, but subsystems that
> access file_lock fields directly need to reach into struct
> file_lock_core now.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/9p/vfs_file.c | 39 +++
>  1 file changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index a1dabcf73380..abdbbaee5184 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -9,7 +9,6 @@
>  #include 
>  #include 
>  #include 
> -#define _NEED_FILE_LOCK_FIELD_MACROS
>  #include 
>  #include 
>  #include 
> @@ -108,7 +107,7 @@ static int v9fs_file_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>  
>   p9_debug(P9_DEBUG_VFS, "filp: %p lock: %p\n", filp, fl);
>  
> - if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_type != F_UNLCK) {
> + if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->c.flc_type != F_UNLCK) {

Should this have already been changed to lock_is_unlock() ???

NeilBrown


>   filemap_write_and_wait(inode->i_mapping);
>   invalidate_mapping_pages(>i_data, 0, -1);
>   }
> @@ -127,7 +126,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>   fid = filp->private_data;
>   BUG_ON(fid == NULL);
>  
> - BUG_ON((fl->fl_flags & FL_POSIX) != FL_POSIX);
> + BUG_ON((fl->c.flc_flags & FL_POSIX) != FL_POSIX);
>  
>   res = locks_lock_file_wait(filp, fl);
>   if (res < 0)
> @@ -136,7 +135,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>   /* convert posix lock to p9 tlock args */
>   memset(, 0, sizeof(flock));
>   /* map the lock type */
> - switch (fl->fl_type) {
> + switch (fl->c.flc_type) {
>   case F_RDLCK:
>   flock.type = P9_LOCK_TYPE_RDLCK;
>   break;
> @@ -152,7 +151,7 @@ static int v9fs_file_do_lock(struct file *filp, int cmd, 
> struct file_lock *fl)
>   flock.length = 0;
>   else
>   flock.length = fl->fl_end - fl->fl_start + 1;
> - flock.proc_id = fl->fl_pid;
> + flock.proc_id = fl->c.flc_pid;
>   flock.client_id = fid->clnt->name;
>   if (IS_SETLKW(cmd))
>   flock.flags = P9_LOCK_FLAGS_BLOCK;
> @@ -207,13 +206,13 @@ static int v9fs_file_do_lock(struct file *filp, int 
> cmd, struct file_lock *fl)
>* incase server returned error for lock request, revert
>* it locally
>*/
> - if (res < 0 && fl->fl_type != F_UNLCK) {
> - unsigned char type = fl->fl_type;
> + if (res < 0 && fl->c.flc_type != F_UNLCK) {
> + unsigned char type = fl->c.flc_type;
>  
> - fl->fl_type = F_UNLCK;
> + fl->c.flc_type = F_UNLCK;
>   /* Even if this fails we want to return the remote error */
>   locks_lock_file_wait(filp, fl);
> - fl->fl_type = type;
> + fl->c.flc_type = type;
>   }
>   if (flock.client_id != fid->clnt->name)
>   kfree(flock.client_id);
> @@ -235,7 +234,7 @@ static int v9fs_file_getlock(struct file *filp, struct 
> file_lock *fl)
>* if we have a conflicting lock locally, no need to validate
>* with server
>*/
> - if (fl->fl_type != F_UNLCK)
> + if (fl->c.flc_type != F_UNLCK)
>   return res;
>  
>   /* convert posix lock to p9 tgetlock args */
> @@ -246,7 +245,7 @@ static int v9fs_file_getlock(struct file *filp, struct 
> file_lock *fl)
>   glock.length = 0;
>   else
>   glock.length = fl->fl_end - fl->fl_start + 1;
> - glock.proc_id = fl->fl_pid;
> + glock.proc_id = fl->c.flc_pid;
>   glock.client_id = fid->clnt->name;
>  
>   res = p9_client_getlock_dotl(fid, );
> @@ -255,13 +254,13 @@ static int v9fs_file_getlock(struct file *filp, struct 
> file_lock *fl)
>   /* map 9p lock type to os lock type */
>   switch (glock.type) {
>   case P9_LOCK_TYPE_RDLCK:
> - fl->fl_type = F_RDLCK;
> + fl->c.flc_type = F_RDLCK;
>   break;
>   case P9_LOCK_TYPE_WRLCK:
> - fl->fl_type = F_WRLCK;
> + fl->c.flc_type = F_WRLCK;
>   break;
>   case P9_LOCK_TYPE_UNLCK:
> - fl->fl_type = F_UNLCK;
> + fl->c.flc_type = F_UNLCK;
>   break;
>   }
>   if (glock.type != P9_LOCK_TYPE_UNLCK) {
> @@ -270,7 +269,7 @@ static int v9fs_file_getlock(struct file *filp, struct 
> file_lock *fl)
>   fl->fl_end = OFFSET_MAX;
>   else
>   fl->fl_end = glock.start + glock.length - 1;
> - fl->fl_pid = -glock.proc_id;
> + fl->c.flc_pid = -glock.proc_id;
>   }
>  out:
>   if (glock.client_id != fid->clnt->name)
> @@ -294,7 +293,7 @@ static int v9fs_file_lock_dotl(struct file *filp, int 
> cmd, struct file_lock *fl)
>   

Re: [PATCH] ARM: dts: qcom: apq8026-lg-lenok: Add vibrator support

2024-02-01 Thread Bjorn Andersson


On Sun, 21 Jan 2024 11:09:57 +0100, Luca Weiss wrote:
> This device has a vibrator attached to the CAMSS_GP0_CLK, use clk-pwm
> and pwm-vibrator to make the vibrator work.
> 
> 

Applied, thanks!

[1/1] ARM: dts: qcom: apq8026-lg-lenok: Add vibrator support
  commit: 4d679e3c29e3609962de43e51c8c1abda34e

Best regards,
-- 
Bjorn Andersson 



Re: [PATCH v2 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware

2024-02-01 Thread Arnaud POULIQUEN



On 2/1/24 17:02, Mathieu Poirier wrote:
> On Thu, Feb 01, 2024 at 04:06:37PM +0100, Arnaud POULIQUEN wrote:
>> hello Mathieu,
>>
>> On 1/31/24 19:52, Mathieu Poirier wrote:
>>> On Tue, Jan 30, 2024 at 10:13:48AM +0100, Arnaud POULIQUEN wrote:


 On 1/26/24 18:11, Mathieu Poirier wrote:
> On Thu, Jan 18, 2024 at 11:04:33AM +0100, Arnaud Pouliquen wrote:
>> The new TEE remoteproc device is used to manage remote firmware in a
>> secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is
>> introduced to delegate the loading of the firmware to the trusted
>> execution context. In such cases, the firmware should be signed and
>> adhere to the image format defined by the TEE.
>>
>> Signed-off-by: Arnaud Pouliquen 
>> ---
>> V1 to V2 update:
>> - remove the select "TEE_REMOTEPROC" in STM32_RPROC config as detected by
>>   the kernel test robot:
>>  WARNING: unmet direct dependencies detected for TEE_REMOTEPROC
>>  Depends on [n]: REMOTEPROC [=y] && OPTEE [=n]
>>  Selected by [y]:
>>  - STM32_RPROC [=y] && (ARCH_STM32 || COMPILE_TEST [=y]) && 
>> REMOTEPROC [=y]
>> - Fix initialized trproc variable in  stm32_rproc_probe
>> ---
>>  drivers/remoteproc/stm32_rproc.c | 149 +--
>>  1 file changed, 144 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/remoteproc/stm32_rproc.c 
>> b/drivers/remoteproc/stm32_rproc.c
>> index fcc0001e2657..cf6a21bac945 100644
>> --- a/drivers/remoteproc/stm32_rproc.c
>> +++ b/drivers/remoteproc/stm32_rproc.c
>> @@ -20,6 +20,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #include "remoteproc_internal.h"
>> @@ -49,6 +50,9 @@
>>  #define M4_STATE_STANDBY4
>>  #define M4_STATE_CRASH  5
>>  
>> +/* Remote processor unique identifier aligned with the Trusted 
>> Execution Environment definitions */
>> +#define STM32_MP1_M4_PROC_ID0
>> +
>>  struct stm32_syscon {
>>  struct regmap *map;
>>  u32 reg;
>> @@ -90,6 +94,8 @@ struct stm32_rproc {
>>  struct stm32_mbox mb[MBOX_NB_MBX];
>>  struct workqueue_struct *workqueue;
>>  bool hold_boot_smc;
>> +bool fw_loaded;
>> +struct tee_rproc *trproc;
>>  void __iomem *rsc_va;
>>  };
>>  
>> @@ -257,6 +263,91 @@ static int stm32_rproc_release(struct rproc *rproc)
>>  return err;
>>  }
>>  
>> +static int stm32_rproc_tee_elf_sanity_check(struct rproc *rproc,
>> +const struct firmware *fw)
>> +{
>> +struct stm32_rproc *ddata = rproc->priv;
>> +unsigned int ret = 0;
>> +
>> +if (rproc->state == RPROC_DETACHED)
>> +return 0;
>> +
>> +ret = tee_rproc_load_fw(ddata->trproc, fw);
>> +if (!ret)
>> +ddata->fw_loaded = true;
>> +
>> +return ret;
>> +}
>> +
>> +static int stm32_rproc_tee_elf_load(struct rproc *rproc,
>> +const struct firmware *fw)
>> +{
>> +struct stm32_rproc *ddata = rproc->priv;
>> +unsigned int ret;
>> +
>> +/*
>> + * This function can be called by remote proc for recovery
>> + * without the sanity check. In this case we need to load the 
>> firmware
>> + * else nothing done here as the firmware has been preloaded 
>> for the
>> + * sanity check to be able to parse it for the resource table.
>> + */
>
> This comment is very confusing - please consider refactoring.  
>
>> +if (ddata->fw_loaded)
>> +return 0;
>> +
>
> I'm not sure about keeping a flag to indicate the status of the loaded 
> firmware.
> It is not done for the non-secure method, I don't see why it would be 
> needed for
> the secure one.
>

 The difference is on the sanity check.
 - in rproc_elf_sanity_check we  parse the elf file to verify that it is
 valid.
 - in stm32_rproc_tee_elf_sanity_check we have to do the same, that means to
 authenticate it. the authentication is done during the load.

 So this flag is used to avoid to reload it twice time.
 refactoring the comment should help to understand this flag


 An alternative would be to bypass the sanity check. But this lead to same
 limitation.
 Before loading the firmware in remoteproc_core, we call rproc_parse_fw() 
 that is
 used to get the resource table address. To get it from tee we need to
 authenticate the firmware so load it...

>>>
>>> I spent a long time thinking about this patchset.  Looking at the code 

Re: [PATCH v5] can: virtio: Initial virtio CAN driver.

2024-02-01 Thread Harald Mommer

Hello,

I thought there would be some more comments coming and I could address 
everything in one chunk. Not the case, besides your comments silence.


On 08.01.24 20:34, Christophe JAILLET wrote:


Hi,
a few nits below, should there be a v6.



I'm sure there will be but not so soon. Probably after acceptance of the 
virtio CAN specification or after change requests to the specification 
are received and the driver has to be adapted to an updated draft.






+static int virtio_can_alloc_tx_idx(struct virtio_can_priv *priv)
+{
+    int tx_idx;
+
+    tx_idx = ida_alloc_range(>tx_putidx_ida, 0,
+ priv->can.echo_skb_max - 1, GFP_KERNEL);
+    if (tx_idx >= 0)
+    atomic_add(1, >tx_inflight);


atomic_inc() ?



Yes, will be done, already in my local code base.





+
+    return tx_idx;
+}
+
+static void virtio_can_free_tx_idx(struct virtio_can_priv *priv,
+   unsigned int idx)
+{
+    ida_free(>tx_putidx_ida, idx);
+    atomic_sub(1, >tx_inflight);


atomic_dec() ?



See above.





+}


...


+static int virtio_can_probe(struct virtio_device *vdev)
+{
+    struct virtio_can_priv *priv;
+    struct net_device *dev;
+    int err;
+
+    dev = alloc_candev(sizeof(struct virtio_can_priv),
+   VIRTIO_CAN_ECHO_SKB_MAX);
+    if (!dev)
+    return -ENOMEM;
+
+    priv = netdev_priv(dev);
+
+    ida_init(>tx_putidx_ida);
+
+    netif_napi_add(dev, >napi, virtio_can_rx_poll);
+    netif_napi_add(dev, >napi_tx, virtio_can_tx_poll);
+
+    SET_NETDEV_DEV(dev, >dev);
+
+    priv->dev = dev;
+    priv->vdev = vdev;
+    vdev->priv = priv;
+
+    priv->can.do_set_mode = virtio_can_set_mode;
+    /* Set Virtio CAN supported operations */
+    priv->can.ctrlmode_supported = CAN_CTRLMODE_BERR_REPORTING;
+    if (virtio_has_feature(vdev, VIRTIO_CAN_F_CAN_FD)) {
+    err = can_set_static_ctrlmode(dev, CAN_CTRLMODE_FD);
+    if (err != 0)
+    goto on_failure;
+    }
+
+    /* Initialize virtqueues */
+    err = virtio_can_find_vqs(priv);
+    if (err != 0)
+    goto on_failure;
+
+    INIT_LIST_HEAD(>tx_list);
+
+    spin_lock_init(>tx_lock);
+    mutex_init(>ctrl_lock);
+
+    init_completion(>ctrl_done);
+
+    virtio_can_populate_vqs(vdev);
+
+    register_virtio_can_dev(dev);


Check for error?



    err = register_virtio_can_dev(dev);
    if (err) {
    virtio_can_del_vq(vdev);
    dev_err(>dev, "Couldn't register candev (err=%d)\n", err);
    goto on_failure;
    }




CJ


+
+    napi_enable(>napi);
+    napi_enable(>napi_tx);
+
+    /* Request device going live */
+    virtio_device_ready(vdev); /* Optionally done by 
virtio_dev_probe() */



The virtio_device_ready() will also be removed. The caller will make the 
device live anyway after return. There are no operations between the 
virtio_device_ready() and the return 0 which make it necessary to have 
the device live early before the return.




+
+    return 0;
+
+on_failure:
+    virtio_can_free_candev(dev);
+    return err;
+}


...






Re: [PATCH] tracing/timerlat: Move hrtimer_init to timerlat_fd open()

2024-02-01 Thread Steven Rostedt
On Thu, 1 Feb 2024 10:12:50 -0800
Greg KH  wrote:

> And cc: stable properly?  thanks!

The script I use automatically did that ;-)

-- Steve



Re: [PATCH v1] module.h: define __symbol_get_gpl() as a regular __symbol_get()

2024-02-01 Thread Luis Chamberlain
On Thu, Feb 01, 2024 at 12:29:46PM +0300, Andrew Kanner wrote:
> On Thu, Feb 01, 2024 at 06:29:58AM +0100, Christoph Hellwig wrote:
> > On Wed, Jan 31, 2024 at 10:02:52PM +0300, Andrew Kanner wrote:
> > > Prototype for __symbol_get_gpl() was introduced in the initial git
> > > commit 1da177e4c3f4 ("Linux-2.6.12-rc2"), but was not used after that.
> > > 
> > > In commit 9011e49d54dc ("modules: only allow symbol_get of
> > > EXPORT_SYMBOL_GPL modules") Christoph Hellwig switched __symbol_get()
> > > to process GPL symbols only, most likely this is what
> > > __symbol_get_gpl() was designed to do.
> > > 
> > > We might either define __symbol_get_gpl() as __symbol_get() or remove
> > > it completely as suggested by Mauro Carvalho Chehab.
> > 
> > Just remove it, there is no need to keep unused funtionality around.
> > 
> > Btw, where did the discussion start?  I hope you're not trying to
> > add new symbol_get users?
> > 
> 
> Of course not, no new users needed.
> 
> I haven't discussed it directly. I found the unused __symbol_get_gpl()
> myself, but during investigation of wether it was ever used somewhere
> found the old patch series suggested by Mauro Carvalho Chehab (in Cc).
> 
> Link: 
> https://lore.kernel.org/lkml/5f001015990a76c0da35a4c3cf08e457ec353ab2.1652113087.git.mche...@kernel.org/
> 
> The patch series is from 2022 and not merged. You can take [PATCH v6
> 1/4] which removes the unused symbol from the link.
> 
> Or I can resend v2 with my commit msg. But not sure about how it works
> in such a case - will adding Suggested-by tag (if no objections from
> Mauro) with the Link be ok?

While you're at it, if you want to try it, you could see if you can
improve the situation more by looking at symbol_get() users that remain
and seeing if you can instead fix it with proper Kconfig dependency and
at build time. Then we can just remove it as well.

  Luis



Re: [PATCH] tracing/timerlat: Move hrtimer_init to timerlat_fd open()

2024-02-01 Thread Greg KH
On Thu, Feb 01, 2024 at 01:08:23PM -0500, Steven Rostedt wrote:
> On Thu, 1 Feb 2024 10:05:59 -0800
> Greg KH  wrote:
> 
> > > timerlat_fd = 
> > > open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 'r')
> > > timerlat_fd.close();
> > > 
> > > # ./taskset -c 0 ./timerlat_load.py
> > >   
> > 
> > Then obviously, this is a real, functional, change, so say so in the
> > kernel changelog :)
> 
> I've updated the change log to remove that comment and add the reproducer.

And cc: stable properly?  thanks!



Re: [PATCH] tracing/timerlat: Move hrtimer_init to timerlat_fd open()

2024-02-01 Thread Steven Rostedt
On Thu, 1 Feb 2024 10:05:59 -0800
Greg KH  wrote:

> > timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 
> > 'r')
> > timerlat_fd.close();
> > 
> > # ./taskset -c 0 ./timerlat_load.py
> >   
> 
> Then obviously, this is a real, functional, change, so say so in the
> kernel changelog :)

I've updated the change log to remove that comment and add the reproducer.

-- Steve



Re: [PATCH] tracing/timerlat: Move hrtimer_init to timerlat_fd open()

2024-02-01 Thread Greg KH
On Thu, Feb 01, 2024 at 05:02:56PM +0100, Daniel Bristot de Oliveira wrote:
> On 2/1/24 16:44, Greg KH wrote:
> > On Thu, Feb 01, 2024 at 04:13:39PM +0100, Daniel Bristot de Oliveira wrote:
> >> Currently, the timerlat's hrtimer is initialized at the first read of
> >> timerlat_fd, and destroyed at close(). It works, but it causes an error
> >> if the user program open() and close() the file without reading.
> > 
> > What error exactly happens?  Userspace, or the kernel crashes?
> 
> sorry, kernel crash:
> 
> # echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options
> # echo timerlat > /sys/kernel/debug/tracing/current_tracer
> 
> # cat ./timerlat_load.py
> #!/usr/bin/env python3
> 
> timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 
> 'r')
> timerlat_fd.close();
> 
> # ./taskset -c 0 ./timerlat_load.py
> 

Then obviously, this is a real, functional, change, so say so in the
kernel changelog :)

thanks,

greg k-h



Re: [PATCH] lib/test_kmod: fix kernel-doc warnings

2024-02-01 Thread Luis Chamberlain
On Fri, Nov 03, 2023 at 09:20:44PM -0700, Randy Dunlap wrote:
> Fix all kernel-doc warnings in test_kmod.c:
> - Mark some enum values as private so that kernel-doc is not needed
>   for them
> - s/thread_mutex/thread_lock/ in a struct's kernel-doc comments
> - add kernel-doc info for @task_sync
> 
> test_kmod.c:67: warning: Enum value '__TEST_KMOD_INVALID' not described in 
> enum 'kmod_test_case'
> test_kmod.c:67: warning: Enum value '__TEST_KMOD_MAX' not described in enum 
> 'kmod_test_case'
> test_kmod.c:100: warning: Function parameter or member 'task_sync' not 
> described in 'kmod_test_device_info'
> test_kmod.c:134: warning: Function parameter or member 'thread_mutex' not 
> described in 'kmod_test_device'
> 
> Signed-off-by: Randy Dunlap 
> Cc: Luis Chamberlain 
> Cc: linux-modu...@vger.kernel.org

Applied and pushed, thanks!

  Luis



Re: [PATCH v2 0/3] modules: few of alignment fixes

2024-02-01 Thread Luis Chamberlain
On Wed, Jan 31, 2024 at 02:11:44PM -0800, Luis Chamberlain wrote:
> On Mon, Jan 29, 2024 at 11:26:39AM -0800, Luis Chamberlain wrote:
> > Masahiro, if there no issues feel free to take this or I can take them in
> > too via the modules-next tree. Lemme know!
> 
> I've queued this onto modules-testing to get winder testing [0]

I've moved it to modules-next as I've found no issues.

  Luis



Re: [PATCH RFC v3 31/35] khugepaged: arm64: Don't collapse MTE enabled VMAs

2024-02-01 Thread Alexandru Elisei
On Thu, Feb 01, 2024 at 01:42:08PM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > copy_user_highpage() will do memory allocation if there are saved tags for
> > the destination page, and the page is missing tag storage.
> > 
> > After commit a349d72fd9ef ("mm/pgtable: add rcu_read_lock() and
> > rcu_read_unlock()s"), collapse_huge_page() calls
> > __collapse_huge_page_copy() -> .. -> copy_user_highpage() with the RCU lock
> > held, which means that copy_user_highpage() can only allocate memory using
> > GFP_ATOMIC or equivalent.
> > 
> > Get around this by refusing to collapse pages into a transparent huge page
> > if the VMA is MTE-enabled.
> 
> Makes sense when copy_user_highpage() will allocate memory for tag storage.
> 
> > 
> > Signed-off-by: Alexandru Elisei 
> > ---
> > 
> > Changes since rfc v2:
> > 
> > * New patch. I think an agreement on whether copy*_user_highpage() should be
> > always allowed to sleep, or should not be allowed, would be useful.
> 
> This is a good question ! Even after preventing the collapse of MTE VMA here,
> there still might be more paths where a sleeping (i.e memory allocating)
> copy*_user_highpage() becomes problematic ?

Exactly!

> 
> > 
> >  arch/arm64/include/asm/pgtable.h| 3 +++
> >  arch/arm64/kernel/mte_tag_storage.c | 5 +
> >  include/linux/khugepaged.h  | 5 +
> >  mm/khugepaged.c | 4 
> >  4 files changed, 17 insertions(+)
> > 
> > diff --git a/arch/arm64/include/asm/pgtable.h 
> > b/arch/arm64/include/asm/pgtable.h
> > index 87ae59436162..d0473538c926 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -1120,6 +1120,9 @@ static inline bool arch_alloc_cma(gfp_t gfp_mask)
> > return true;
> >  }
> >  
> > +bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned 
> > long address);
> > +#define arch_hugepage_vma_revalidate arch_hugepage_vma_revalidate
> > +
> >  #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> >  #endif /* CONFIG_ARM64_MTE */
> >  
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c 
> > b/arch/arm64/kernel/mte_tag_storage.c
> > index ac7b9c9c585c..a99959b70573 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -636,3 +636,8 @@ void arch_alloc_page(struct page *page, int order, 
> > gfp_t gfp)
> > if (tag_storage_enabled() && alloc_requires_tag_storage(gfp))
> > reserve_tag_storage(page, order, gfp);
> >  }
> > +
> > +bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned 
> > long address)
> > +{
> > +   return !(vma->vm_flags & VM_MTE);
> > +}
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index f68865e19b0b..461e4322dff2 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -38,6 +38,11 @@ static inline void khugepaged_exit(struct mm_struct *mm)
> > if (test_bit(MMF_VM_HUGEPAGE, >flags))
> > __khugepaged_exit(mm);
> >  }
> > +
> > +#ifndef arch_hugepage_vma_revalidate
> > +#define arch_hugepage_vma_revalidate(vma, address) 1
> 
> Please replace s/1/true as arch_hugepage_vma_revalidate() returns bool ?

Yeah, that's strange, I don't know why I used 1 there. Will change it to true,
thanks for spotting it.

> 
> > +#endif
> 
> Right, above construct is much better than __HAVE_ARCH_ based one.

Thanks!

Alex

> 
> > +
> >  #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> >  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct 
> > *oldmm)
> >  {
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 2b219acb528e..cb9a9ddb4d86 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -935,6 +935,10 @@ static int hugepage_vma_revalidate(struct mm_struct 
> > *mm, unsigned long address,
> >  */
> > if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)))
> > return SCAN_PAGE_ANON;
> > +
> > +   if (!arch_hugepage_vma_revalidate(vma, address))
> > +   return SCAN_VMA_CHECK;
> > +
> > return SCAN_SUCCEED;
> >  }
> >  
> 
> Otherwise this LGTM.



Re: [PATCH RFC v3 30/35] arm64: mte: ptrace: Handle pages with missing tag storage

2024-02-01 Thread Alexandru Elisei
Hi,

On Thu, Feb 01, 2024 at 02:51:39PM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > A page can end up mapped in a MTE enabled VMA without the corresponding tag
> > storage block reserved. Tag accesses made by ptrace in this case can lead
> > to the wrong tags being read or memory corruption for the process that is
> > using the tag storage memory as data.
> > 
> > Reserve tag storage by treating ptrace accesses like a fault.
> > 
> > Signed-off-by: Alexandru Elisei 
> > ---
> > 
> > Changes since rfc v2:
> > 
> > * New patch, issue reported by Peter Collingbourne.
> > 
> >  arch/arm64/kernel/mte.c | 26 --
> >  1 file changed, 24 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > index faf09da3400a..b1fa02dad4fd 100644
> > --- a/arch/arm64/kernel/mte.c
> > +++ b/arch/arm64/kernel/mte.c
> > @@ -412,10 +412,13 @@ static int __access_remote_tags(struct mm_struct *mm, 
> > unsigned long addr,
> > while (len) {
> > struct vm_area_struct *vma;
> > unsigned long tags, offset;
> > +   unsigned int fault_flags;
> > +   struct page *page;
> > +   vm_fault_t ret;
> > void *maddr;
> > -   struct page *page = get_user_page_vma_remote(mm, addr,
> > -gup_flags, );
> >  
> > +get_page:
> > +   page = get_user_page_vma_remote(mm, addr, gup_flags, );
> 
> But if there is valid page returned here in the first GUP attempt, will there
> still be a subsequent handle_mm_fault() on the same vma and addr ?

Only if it's missing tag storage. If it's missing tag storage, the page has
been mapped as arch_fault_on_access_pte(), and
handle_mm_fault()->..->arch_handle_folio_fault_on_access() will either
reserve tag storage, or migrate it.

> 
> > if (IS_ERR(page)) {
> > err = PTR_ERR(page);
> > break;
> > @@ -433,6 +436,25 @@ static int __access_remote_tags(struct mm_struct *mm, 
> > unsigned long addr,
> > put_page(page);
> > break;
> > }
> > +
> > +   if (tag_storage_enabled() && !page_tag_storage_reserved(page)) {
> 
> Should not '!page' be checked here as well ?

I was under the impression that get_user_page_vma_remote() returns an error
pointer if gup couldn't pin the page.

Thanks,
Alex

> 
> > +   fault_flags = FAULT_FLAG_DEFAULT | \
> > + FAULT_FLAG_USER | \
> > + FAULT_FLAG_REMOTE | \
> > + FAULT_FLAG_ALLOW_RETRY | \
> > + FAULT_FLAG_RETRY_NOWAIT;
> > +   if (write)
> > +   fault_flags |= FAULT_FLAG_WRITE;
> > +
> > +   put_page(page);
> > +   ret = handle_mm_fault(vma, addr, fault_flags, NULL);
> > +   if (ret & VM_FAULT_ERROR) {
> > +   err = -EFAULT;
> > +   break;
> > +   }
> > +   goto get_page;
> > +   }
> > +
> > WARN_ON_ONCE(!page_mte_tagged(page));
> >  
> > /* limit access to the end of the page */



Re: [PATCH RFC v3 13/35] mm: memory: Introduce fault-on-access mechanism for pages

2024-02-01 Thread Alexandru Elisei
Hi,

On Thu, Feb 01, 2024 at 11:22:13AM +0530, Anshuman Khandual wrote:
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > Introduce a mechanism that allows an architecture to trigger a page fault,
> > and add the infrastructure to handle that fault accordingly. To use make> 
> > use of this, an arch is expected to mark the table entry as PAGE_NONE (which
> > will cause a fault next time it is accessed) and to implement an
> > arch-specific method (like a software bit) for recognizing that the fault
> > needs to be handled by the arch code.
> > 
> > arm64 will use of this approach to reserve tag storage for pages which are
> > mapped in an MTE enabled VMA, but the storage needed to store tags isn't
> > reserved (for example, because of an mprotect(PROT_MTE) call on a VMA with
> > existing pages).
> 
> Just to summerize -
> 
> So platform will create NUMA balancing like page faults - via marking existing
> mappings with PAGE_NONE permission, when the subsequent fault happens identify
> such cases via a software bit in the page table entry and then route the fault
> to the platform code itself for special purpose page fault handling where page
> might come from some reserved areas instead.

Indeed. In the tag storage scenario, the page is page that will be mapped
as tagged, if it's missing tag storage, the tag storage needs to be
reserved before it can be mapped as tagged (and tags can be accessed).

> 
> Some questions
> 
> - How often PAGE_NONE is to be marked for applicable MTE VMA based mappings 
> 
>   - Is it periodic like NUMA balancing or just one time for tag storage

It's deterministic, and only for tag storage. It's done in
set_ptes()/__set_pte_at()->..->mte_sync_tags(), if the page is going to be
mapped as tagged, but is missing tag storage. See patch #26 ("arm64: mte:
Use fault-on-access to reserve missing tag storage") [1] for the code.

[1] 
https://lore.kernel.org/linux-arm-kernel/20240125164256.4147-27-alexandru.eli...@arm.com/

> 
> - How this is going to interact with NUMA balancing given both use PAGE_NONE
> 
>   - How to differentiate these mappings from standard pte_protnone()

The only place where the difference matters is in do_numa_page(), here
renamed to handle_pte_protnone(), and in the huge page equivalent.

Userspace can access tags only if set_ptes()/__set_pte_at() maps the pte
with the PT_NORMAL_TAGGED attribute, but those functions will always map
the page as arch_fault_on_access_pte() if it's missing tag storage. That
makes it impossible for the kernel to map it as tagged behind our back.

Unless you had other concerns.

Thanks,
Alex

> 
> > 
> > Signed-off-by: Alexandru Elisei 
> > ---
> > 
> > Changes since rfc v2:
> > 
> > * New patch. Split from patch #19 ("mm: mprotect: Introduce 
> > PAGE_FAULT_ON_ACCESS
> > for mprotect(PROT_MTE)") (David Hildenbrand).
> > 
> >  include/linux/huge_mm.h |  4 ++--
> >  include/linux/pgtable.h | 47 +++--
> >  mm/Kconfig  |  3 +++
> >  mm/huge_memory.c| 36 +
> >  mm/memory.c | 51 ++---
> >  5 files changed, 109 insertions(+), 32 deletions(-)
> > 
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 5adb86af35fc..4678a0a5e6a8 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -346,7 +346,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct 
> > *vma, unsigned long addr,
> >  struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long 
> > addr,
> > pud_t *pud, int flags, struct dev_pagemap **pgmap);
> >  
> > -vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
> > +vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf);
> >  
> >  extern struct page *huge_zero_page;
> >  extern unsigned long huge_zero_pfn;
> > @@ -476,7 +476,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t 
> > *pud,
> > return NULL;
> >  }
> >  
> > -static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> > +static inline vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf)
> >  {
> > return 0;
> >  }
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 2d0f04042f62..81a21be855a2 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1455,7 +1455,7 @@ static inline int pud_trans_unstable(pud_t *pud)
> > return 0;
> >  }
> >  
> > -#ifndef CONFIG_NUMA_BALANCING
> > +#if !defined(CONFIG_NUMA_BALANCING) && 
> > !defined(CONFIG_ARCH_HAS_FAULT_ON_ACCESS)
> >  /*
> >   * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". 
> > It is
> >   * perfectly valid to indicate "no" in that case, which is why our default
> > @@ -1477,7 +1477,50 @@ static inline int pmd_protnone(pmd_t pmd)
> >  {
> > return 0;
> >  }
> > -#endif /* CONFIG_NUMA_BALANCING */
> > +#endif /* !CONFIG_NUMA_BALANCING && !CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
> > +
> > +#ifndef 

[PATCH v2] eventfs: Add WARN_ON_ONCE() to checks in eventfs_root_lookup()

2024-02-01 Thread Steven Rostedt
}From: "Steven Rostedt (Google)" 

There's a couple of if statements in eventfs_root_lookup() that should
never be true. Instead of removing them, add WARN_ON_ONCE() around them.

  One is a tracefs_inode not being for eventfs.

  The other is a child being freed but still on the parent's children
  list. When a child is freed, it is removed from the list under the
  same mutex that is held during the iteration.

Link: https://lore.kernel.org/linux-trace-kernel/20240201002719.GS2087318@ZenIV/

Reported-by: Al Viro 
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240201161617.499712...@goodmis.org

- Fixed missing open parenthesis in first WARN_ON_ONCE()

 fs/tracefs/event_inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 110e8a272189..9d9c7dc3114b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -483,7 +483,7 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
struct dentry *result = NULL;
 
ti = get_tracefs(dir);
-   if (!(ti->flags & TRACEFS_EVENT_INODE))
+   if (WARN_ON_ONCE(!(ti->flags & TRACEFS_EVENT_INODE)))
return ERR_PTR(-EIO);
 
mutex_lock(_mutex);
@@ -495,7 +495,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
list_for_each_entry(ei_child, >children, list) {
if (strcmp(ei_child->name, name) != 0)
continue;
-   if (ei_child->is_freed)
+   /* A child is freed and removed from the list at the same time 
*/
+   if (WARN_ON_ONCE(ei_child->is_freed))
goto out;
result = lookup_dir_entry(dentry, ei, ei_child);
goto out;
-- 
2.43.0




Re: [PATCH RFC v3 12/35] mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()

2024-02-01 Thread Alexandru Elisei
Hi,

On Thu, Feb 01, 2024 at 09:00:23AM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > arm64 uses arch_swap_restore() to restore saved tags before the page is
> > swapped in and it's called in atomic context (with the ptl lock held).
> > 
> > Introduce arch_swap_prepare_to_restore() that will allow an architecture to
> > perform extra work during swap in and outside of a critical section.
> > This will be used by arm64 to allocate a buffer in memory where to
> > temporarily save tags if tag storage is not available for the page being
> > swapped in.
> 
> Just wondering if tag storage will always be unavailable for tagged pages
> being swapped in ? OR there are cases where allocation might not even be

In some (probably most) situations, tag storage will be available for the
page that will be swapped in. That's because either the page will have been
taken from the swap cache (which means it hasn't been freed, so it still
has tag storage reserved) or it has been allocated with vma_alloc_folio()
(when it's swapped back in in a VMA with VM_MTE set).

I've explained a scenario where tags will be restored for a page without
tag storage in patch #28 ("mte: swap: Handle tag restoring when missing tag
storage") [1]. Basically, it's because tagged pages can be mapped as tagged
in one VMA and untagged in another VMA; and swap tags are restored the
first time a page is swapped back in, even if it's swapped in a VMA with
MTE disabled.

[1] 
https://lore.kernel.org/linux-arm-kernel/20240125164256.4147-29-alexandru.eli...@arm.com/

> required ? This prepare phase needs to be outside the critical section -
> only because there might be memory allocations ?

Yes, exactly. See patch above :)

Thanks,
Alex



Re: [PATCH 5/6] eventfs: Add WARN_ON_ONCE() to checks in eventfs_root_lookup()

2024-02-01 Thread Steven Rostedt
On Thu, 01 Feb 2024 10:34:51 -0500
Steven Rostedt  wrote:

> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -483,7 +483,7 @@ static struct dentry *eventfs_root_lookup(struct inode 
> *dir,
>   struct dentry *result = NULL;
>  
>   ti = get_tracefs(dir);
> - if (!(ti->flags & TRACEFS_EVENT_INODE))
> + if (WARN_ON_ONCE!(ti->flags & TRACEFS_EVENT_INODE)))

I added this patch at the end but never tested it :-p

I then did a rebase to move it ahead of patch 6, which was tested.

Will resend this patch.

-- Steve


>   return ERR_PTR(-EIO);
>  
>   mutex_lock(_mutex);



Re: [PATCH RFC 3/4] net: ethtool: Use uts_release

2024-02-01 Thread John Garry

On 01/02/2024 16:09, Jakub Kicinski wrote:

On Thu, 1 Feb 2024 14:20:23 +0100 Jiri Pirko wrote:

BTW, I assume that changes like this are also ok:

8<-

   net: team: Don't bother filling in ethtool driver version


Yup, just to be clear - you can send this independently from the series,


Sure, and I think rocker and staging/octeon also have this unnecessary 
code also.



tag is as

  [PATCH net-next]


ah, yes



we'll take it via the networking tree.


Thanks. I assume Greg - the staging maintainer - would take the octeon 
patch.



I'm not sure which tree the other
patches will go thru..


I think that the best thing to do is get a minimal set in for 6.9 and 
then merge the rest in the next cycle. I've got about 22 patches in 
total now, but I think that there will be more. We'll see who can pick 
up the first set when I send it officially.


Thanks,
John




[PATCH 6/6] eventfs: Create eventfs_root_inode to store dentry

2024-02-01 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Only the root "events" directory stores a dentry. There's no reason to
hold a dentry pointer for every eventfs_inode as it is never set except
for the root "events" eventfs_inode.

Create a eventfs_root_inode structure that holds the events_dir dentry.
The "events" eventfs_inode *is* special, let it have its own descriptor.

Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 65 +---
 fs/tracefs/internal.h|  2 --
 2 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1a831ba1042b..c50d089c9a7f 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -35,6 +35,17 @@ static DEFINE_MUTEX(eventfs_mutex);
 /* Choose something "unique" ;-) */
 #define EVENTFS_FILE_INODE_INO 0x12c4e37
 
+struct eventfs_root_inode {
+   struct eventfs_inodeei;
+   struct dentry   *events_dir;
+};
+
+static struct eventfs_root_inode *get_root_inode(struct eventfs_inode *ei)
+{
+   WARN_ON_ONCE(!ei->is_events);
+   return container_of(ei, struct eventfs_root_inode, ei);
+}
+
 /* Just try to make something consistent and unique */
 static int eventfs_dir_ino(struct eventfs_inode *ei)
 {
@@ -73,12 +84,18 @@ enum {
 static void release_ei(struct kref *ref)
 {
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
+   struct eventfs_root_inode *rei;
 
WARN_ON_ONCE(!ei->is_freed);
 
kfree(ei->entry_attrs);
kfree_const(ei->name);
-   kfree_rcu(ei, rcu);
+   if (ei->is_events) {
+   rei = get_root_inode(ei);
+   kfree_rcu(rei, ei.rcu);
+   } else {
+   kfree_rcu(ei, rcu);
+   }
 }
 
 static inline void put_ei(struct eventfs_inode *ei)
@@ -408,19 +425,43 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
return NULL;
 }
 
+static inline struct eventfs_inode *init_ei(struct eventfs_inode *ei, const 
char *name)
+{
+   ei->name = kstrdup_const(name, GFP_KERNEL);
+   if (!ei->name)
+   return NULL;
+   kref_init(>kref);
+   return ei;
+}
+
 static inline struct eventfs_inode *alloc_ei(const char *name)
 {
struct eventfs_inode *ei = kzalloc(sizeof(*ei), GFP_KERNEL);
+   struct eventfs_inode *result;
 
if (!ei)
return NULL;
 
-   ei->name = kstrdup_const(name, GFP_KERNEL);
-   if (!ei->name) {
+   result = init_ei(ei, name);
+   if (!result)
kfree(ei);
+
+   return result;
+}
+
+static inline struct eventfs_inode *alloc_root_ei(const char *name)
+{
+   struct eventfs_root_inode *rei = kzalloc(sizeof(*rei), GFP_KERNEL);
+   struct eventfs_inode *ei;
+
+   if (!rei)
return NULL;
-   }
-   kref_init(>kref);
+
+   rei->ei.is_events = 1;
+   ei = init_ei(>ei, name);
+   if (!ei)
+   kfree(rei);
+
return ei;
 }
 
@@ -710,6 +751,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
int size, void *data)
 {
struct dentry *dentry = tracefs_start_creating(name, parent);
+   struct eventfs_root_inode *rei;
struct eventfs_inode *ei;
struct tracefs_inode *ti;
struct inode *inode;
@@ -722,7 +764,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
if (IS_ERR(dentry))
return ERR_CAST(dentry);
 
-   ei = alloc_ei(name);
+   ei = alloc_root_ei(name);
if (!ei)
goto fail;
 
@@ -731,10 +773,11 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
goto fail;
 
// Note: we have a ref to the dentry from tracefs_start_creating()
-   ei->events_dir = dentry;
+   rei = get_root_inode(ei);
+   rei->events_dir = dentry;
+
ei->entries = entries;
ei->nr_entries = size;
-   ei->is_events = 1;
ei->data = data;
 
/* Save the ownership of this directory */
@@ -845,13 +888,15 @@ void eventfs_remove_dir(struct eventfs_inode *ei)
  */
 void eventfs_remove_events_dir(struct eventfs_inode *ei)
 {
+   struct eventfs_root_inode *rei;
struct dentry *dentry;
 
-   dentry = ei->events_dir;
+   rei = get_root_inode(ei);
+   dentry = rei->events_dir;
if (!dentry)
return;
 
-   ei->events_dir = NULL;
+   rei->events_dir = NULL;
eventfs_remove_dir(ei);
 
/*
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index beb3dcd0e434..15c26f9aaad4 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -36,7 +36,6 @@ struct eventfs_attr {
  * @children:  link list into the child eventfs_inode
  * @entries:   the array of entries representing the files in the directory
  * @name:  the 

[PATCH 5/6] eventfs: Add WARN_ON_ONCE() to checks in eventfs_root_lookup()

2024-02-01 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

There's a couple of if statements in eventfs_root_lookup() that should
never be true. Instead of removing them, add WARN_ON_ONCE() around them.

  One is a tracefs_inode not being for eventfs.

  The other is a child being freed but still on the parent's children
  list. When a child is freed, it is removed from the list under the
  same mutex that is held during the iteration.

Link: https://lore.kernel.org/linux-trace-kernel/20240201002719.GS2087318@ZenIV/

Reported-by: Al Viro 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 110e8a272189..1a831ba1042b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -483,7 +483,7 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
struct dentry *result = NULL;
 
ti = get_tracefs(dir);
-   if (!(ti->flags & TRACEFS_EVENT_INODE))
+   if (WARN_ON_ONCE!(ti->flags & TRACEFS_EVENT_INODE)))
return ERR_PTR(-EIO);
 
mutex_lock(_mutex);
@@ -495,7 +495,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
list_for_each_entry(ei_child, >children, list) {
if (strcmp(ei_child->name, name) != 0)
continue;
-   if (ei_child->is_freed)
+   /* A child is freed and removed from the list at the same time 
*/
+   if (WARN_ON_ONCE(ei_child->is_freed))
goto out;
result = lookup_dir_entry(dentry, ei, ei_child);
goto out;
-- 
2.43.0





[PATCH 4/6] eventfs: Keep all directory links at 1

2024-02-01 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The directory link count in eventfs was somewhat bogus. It was only being
updated when a directory child was being looked up and not on creation.

One solution would be to update in get_attr() the link count by iterating
the ei->children list and then adding 2. But that could slow down simple
stat() calls, especially if it's done on all directories in eventfs.

Another solution would be to add a parent pointer to the eventfs_inode
and keep track of the number of sub directories it has on creation. But
this adds overhead for something not really worthwhile.

The solution decided upon is to keep all directory links in eventfs as 1.
This tells user space not to rely on the hard links of directories. Which
in this case it shouldn't.

Link: https://lore.kernel.org/linux-trace-kernel/20240201002719.GS2087318@ZenIV/

Cc: sta...@vger.kernel.org
Fixes: c1504e510238 ("eventfs: Implement eventfs dir creation functions")
Suggested-by: Al Viro 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 9e031e5a2713..110e8a272189 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -404,9 +404,7 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
 
dentry->d_fsdata = get_ei(ei);
 
-   inc_nlink(inode);
d_add(dentry, inode);
-   inc_nlink(dentry->d_parent->d_inode);
return NULL;
 }
 
@@ -769,9 +767,17 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
 
dentry->d_fsdata = get_ei(ei);
 
-   /* directory inodes start off with i_nlink == 2 (for "." entry) */
-   inc_nlink(inode);
+   /*
+* Keep all eventfs directories with i_nlink == 1.
+* Due to the dynamic nature of the dentry creations and not
+* wanting to add a pointer to the parent eventfs_inode in the
+* eventfs_inode structure, keeping the i_nlink in sync with the
+* number of directories would cause too much complexity for
+* something not worth much. Keeping directory links at 1
+* tells userspace not to trust the link number.
+*/
d_instantiate(dentry, inode);
+   /* The dentry of the "events" parent does keep track though */
inc_nlink(dentry->d_parent->d_inode);
fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
tracefs_end_creating(dentry);
-- 
2.43.0





[PATCH 3/6] eventfs: Remove fsnotify*() functions from lookup()

2024-02-01 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The dentries and inodes are created when referenced in the lookup code.
There's no reason to call fsnotify_*() functions when they are created by
a reference. It doesn't make any sense.

Link: https://lore.kernel.org/linux-trace-kernel/20240201002719.GS2087318@ZenIV/

Cc: sta...@vger.kernel.org
Fixes: a376007917776 ("eventfs: Implement functions to create files and dirs 
when accessed");
Suggested-by: Al Viro 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index ca7daee7c811..9e031e5a2713 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -366,7 +366,6 @@ static struct dentry *lookup_file(struct eventfs_inode 
*parent_ei,
dentry->d_fsdata = get_ei(parent_ei);
 
d_add(dentry, inode);
-   fsnotify_create(dentry->d_parent->d_inode, dentry);
return NULL;
 };
 
@@ -408,7 +407,6 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
inc_nlink(inode);
d_add(dentry, inode);
inc_nlink(dentry->d_parent->d_inode);
-   fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
return NULL;
 }
 
-- 
2.43.0





[PATCH 2/6] eventfs: Restructure eventfs_inode structure to be more condensed

2024-02-01 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Some of the eventfs_inode structure has holes in it. Rework the structure
to be a bit more condensed, and also remove the no longer used llist
field.

Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/internal.h | 27 ---
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 1886f1826cd8..beb3dcd0e434 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -32,40 +32,37 @@ struct eventfs_attr {
 /*
  * struct eventfs_inode - hold the properties of the eventfs directories.
  * @list:  link list into the parent directory
+ * @rcu:   Union with @list for freeing
+ * @children:  link list into the child eventfs_inode
  * @entries:   the array of entries representing the files in the directory
  * @name:  the name of the directory to create
- * @children:  link list into the child eventfs_inode
  * @events_dir: the dentry of the events directory
  * @entry_attrs: Saved mode and ownership of the @d_children
- * @attr:  Saved mode and ownership of eventfs_inode itself
  * @data:  The private data to pass to the callbacks
+ * @attr:  Saved mode and ownership of eventfs_inode itself
  * @is_freed:  Flag set if the eventfs is on its way to be freed
  *Note if is_freed is set, then dentry is corrupted.
+ * @is_events: Flag set for only the top level "events" directory
  * @nr_entries: The number of items in @entries
+ * @ino:   The saved inode number
  */
 struct eventfs_inode {
-   struct kref kref;
-   struct list_headlist;
+   union {
+   struct list_headlist;
+   struct rcu_head rcu;
+   };
+   struct list_headchildren;
const struct eventfs_entry  *entries;
const char  *name;
-   struct list_headchildren;
struct dentry   *events_dir;
struct eventfs_attr *entry_attrs;
-   struct eventfs_attr attr;
void*data;
+   struct eventfs_attr attr;
+   struct kref kref;
unsigned intis_freed:1;
unsigned intis_events:1;
unsigned intnr_entries:30;
unsigned intino;
-   /*
-* Union - used for deletion
-* @llist:  for calling dput() if needed after RCU
-* @rcu:eventfs_inode to delete in RCU
-*/
-   union {
-   struct llist_node   llist;
-   struct rcu_head rcu;
-   };
 };
 
 static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
-- 
2.43.0





[PATCH 0/6] eventfs: More fixes and clean ups

2024-02-01 Thread Steven Rostedt
Al Viro reviewed the latest patch set from Linus and had some comments.
One was on the return code of the eventfs_root_lookup() which I already
sent a patch for:
  
https://lore.kernel.org/linux-trace-kernel/20240131233227.73db5...@gandalf.local.home/

The other comments had to do with code that was there before Linus's
updates. Those were:

 - It made no sense to have the fsnotify*() functions triggering
   from file/directory creation in the lookup()

 - The directory inode count wasn't accurate. The updates to
   the link nodes for the parent directory was also happening in
   lookup. Al Viro told me to just set them all to 1, as that
   tells user space not to trust the hard link counts.

I added a WARN_ON_ONCE() in case of the eventfs_inode being freed without
is_freed being set.

I restructured the eventfs_inode structure to be a bit more compact.

The last two changes I included here but do not plan on pushing for
v6.8.  Those are:

 - Adding WARN_ON_ONCE() to the conditionals that Al asked about
   as the logic should prevent them from being true.

 - Moving the dentry pointer out of eventfs_inode and creating a new
   eventfs_root_inode that contains the eventfs_inode and the dentry
   pointer. This only gets used by the "events" directory.

Steven Rostedt (Google) (6):
  eventfs: Warn if an eventfs_inode is freed without is_freed being set
  eventfs: Restructure eventfs_inode structure to be more condensed
  eventfs: Remove fsnotify*() functions from lookup()
  eventfs: Keep all directory links at 1
  eventfs: Add WARN_ON_ONCE() to checks in eventfs_root_lookup()
  eventfs: Create eventfs_root_inode to store dentry


 fs/tracefs/event_inode.c | 104 +--
 fs/tracefs/internal.h|  29 ++---
 2 files changed, 94 insertions(+), 39 deletions(-)



[PATCH 1/6] eventfs: Warn if an eventfs_inode is freed without is_freed being set

2024-02-01 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

There should never be a case where an evenfs_inode is being freed without
is_freed being set. Add a WARN_ON_ONCE() if it ever happens. That would
mean there was one too many put_ei()s.

Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 515fdace1eea..ca7daee7c811 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -73,6 +73,9 @@ enum {
 static void release_ei(struct kref *ref)
 {
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
+
+   WARN_ON_ONCE(!ei->is_freed);
+
kfree(ei->entry_attrs);
kfree_const(ei->name);
kfree_rcu(ei, rcu);
@@ -84,6 +87,14 @@ static inline void put_ei(struct eventfs_inode *ei)
kref_put(>kref, release_ei);
 }
 
+static inline void free_ei(struct eventfs_inode *ei)
+{
+   if (ei) {
+   ei->is_freed = 1;
+   put_ei(ei);
+   }
+}
+
 static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
 {
if (ei)
@@ -679,7 +690,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, 
struct eventfs_inode
 
/* Was the parent freed? */
if (list_empty(>list)) {
-   put_ei(ei);
+   free_ei(ei);
ei = NULL;
}
return ei;
@@ -770,7 +781,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
return ei;
 
  fail:
-   put_ei(ei);
+   free_ei(ei);
tracefs_failed_creating(dentry);
return ERR_PTR(-ENOMEM);
 }
@@ -801,9 +812,8 @@ static void eventfs_remove_rec(struct eventfs_inode *ei, 
int level)
list_for_each_entry(ei_child, >children, list)
eventfs_remove_rec(ei_child, level + 1);
 
-   ei->is_freed = 1;
list_del(>list);
-   put_ei(ei);
+   free_ei(ei);
 }
 
 /**
-- 
2.43.0





Re: [PATCH v2 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware

2024-02-01 Thread Mathieu Poirier
On Thu, Feb 01, 2024 at 04:06:37PM +0100, Arnaud POULIQUEN wrote:
> hello Mathieu,
> 
> On 1/31/24 19:52, Mathieu Poirier wrote:
> > On Tue, Jan 30, 2024 at 10:13:48AM +0100, Arnaud POULIQUEN wrote:
> >>
> >>
> >> On 1/26/24 18:11, Mathieu Poirier wrote:
> >>> On Thu, Jan 18, 2024 at 11:04:33AM +0100, Arnaud Pouliquen wrote:
>  The new TEE remoteproc device is used to manage remote firmware in a
>  secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is
>  introduced to delegate the loading of the firmware to the trusted
>  execution context. In such cases, the firmware should be signed and
>  adhere to the image format defined by the TEE.
> 
>  Signed-off-by: Arnaud Pouliquen 
>  ---
>  V1 to V2 update:
>  - remove the select "TEE_REMOTEPROC" in STM32_RPROC config as detected by
>    the kernel test robot:
>   WARNING: unmet direct dependencies detected for TEE_REMOTEPROC
>   Depends on [n]: REMOTEPROC [=y] && OPTEE [=n]
>   Selected by [y]:
>   - STM32_RPROC [=y] && (ARCH_STM32 || COMPILE_TEST [=y]) && 
>  REMOTEPROC [=y]
>  - Fix initialized trproc variable in  stm32_rproc_probe
>  ---
>   drivers/remoteproc/stm32_rproc.c | 149 +--
>   1 file changed, 144 insertions(+), 5 deletions(-)
> 
>  diff --git a/drivers/remoteproc/stm32_rproc.c 
>  b/drivers/remoteproc/stm32_rproc.c
>  index fcc0001e2657..cf6a21bac945 100644
>  --- a/drivers/remoteproc/stm32_rproc.c
>  +++ b/drivers/remoteproc/stm32_rproc.c
>  @@ -20,6 +20,7 @@
>   #include 
>   #include 
>   #include 
>  +#include 
>   #include 
>   
>   #include "remoteproc_internal.h"
>  @@ -49,6 +50,9 @@
>   #define M4_STATE_STANDBY4
>   #define M4_STATE_CRASH  5
>   
>  +/* Remote processor unique identifier aligned with the Trusted 
>  Execution Environment definitions */
>  +#define STM32_MP1_M4_PROC_ID0
>  +
>   struct stm32_syscon {
>   struct regmap *map;
>   u32 reg;
>  @@ -90,6 +94,8 @@ struct stm32_rproc {
>   struct stm32_mbox mb[MBOX_NB_MBX];
>   struct workqueue_struct *workqueue;
>   bool hold_boot_smc;
>  +bool fw_loaded;
>  +struct tee_rproc *trproc;
>   void __iomem *rsc_va;
>   };
>   
>  @@ -257,6 +263,91 @@ static int stm32_rproc_release(struct rproc *rproc)
>   return err;
>   }
>   
>  +static int stm32_rproc_tee_elf_sanity_check(struct rproc *rproc,
>  +const struct firmware *fw)
>  +{
>  +struct stm32_rproc *ddata = rproc->priv;
>  +unsigned int ret = 0;
>  +
>  +if (rproc->state == RPROC_DETACHED)
>  +return 0;
>  +
>  +ret = tee_rproc_load_fw(ddata->trproc, fw);
>  +if (!ret)
>  +ddata->fw_loaded = true;
>  +
>  +return ret;
>  +}
>  +
>  +static int stm32_rproc_tee_elf_load(struct rproc *rproc,
>  +const struct firmware *fw)
>  +{
>  +struct stm32_rproc *ddata = rproc->priv;
>  +unsigned int ret;
>  +
>  +/*
>  + * This function can be called by remote proc for recovery
>  + * without the sanity check. In this case we need to load the 
>  firmware
>  + * else nothing done here as the firmware has been preloaded 
>  for the
>  + * sanity check to be able to parse it for the resource table.
>  + */
> >>>
> >>> This comment is very confusing - please consider refactoring.  
> >>>
>  +if (ddata->fw_loaded)
>  +return 0;
>  +
> >>>
> >>> I'm not sure about keeping a flag to indicate the status of the loaded 
> >>> firmware.
> >>> It is not done for the non-secure method, I don't see why it would be 
> >>> needed for
> >>> the secure one.
> >>>
> >>
> >> The difference is on the sanity check.
> >> - in rproc_elf_sanity_check we  parse the elf file to verify that it is
> >> valid.
> >> - in stm32_rproc_tee_elf_sanity_check we have to do the same, that means to
> >> authenticate it. the authentication is done during the load.
> >>
> >> So this flag is used to avoid to reload it twice time.
> >> refactoring the comment should help to understand this flag
> >>
> >>
> >> An alternative would be to bypass the sanity check. But this lead to same
> >> limitation.
> >> Before loading the firmware in remoteproc_core, we call rproc_parse_fw() 
> >> that is
> >> used to get the resource table address. To get it from tee we need to
> >> authenticate the firmware so load it...
> >>
> > 
> > I spent a long time thinking about this patchset.  Looking at the code as it
> > is now, request_firmware() in 

Re: [PATCH] tracing/timerlat: Move hrtimer_init to timerlat_fd open()

2024-02-01 Thread Steven Rostedt
On Thu, 1 Feb 2024 17:02:56 +0100
Daniel Bristot de Oliveira  wrote:

> On 2/1/24 16:44, Greg KH wrote:
> > On Thu, Feb 01, 2024 at 04:13:39PM +0100, Daniel Bristot de Oliveira wrote: 
> >  
> >> Currently, the timerlat's hrtimer is initialized at the first read of
> >> timerlat_fd, and destroyed at close(). It works, but it causes an error
> >> if the user program open() and close() the file without reading.  
> > 
> > What error exactly happens?  Userspace, or the kernel crashes?  
> 
> sorry, kernel crash:
> 
> # echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options
> # echo timerlat > /sys/kernel/debug/tracing/current_tracer
> 
> # cat ./timerlat_load.py
> #!/usr/bin/env python3
> 
> timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 
> 'r')
> timerlat_fd.close();
> 
> # ./taskset -c 0 ./timerlat_load.py
> 
> 
> [ 6401.611374] BUG: kernel NULL pointer dereference, address: 0010
> [ 6401.611786] #PF: supervisor read access in kernel mode
> [ 6401.612081] #PF: error_code(0x) - not-present page
> [ 6401.612376] PGD 0 P4D 0
> [ 6401.612495] Oops:  [#1] PREEMPT SMP NOPTI
> [ 6401.612690] CPU: 1 PID: 2673 Comm: python3 Not tainted 
> 6.6.13-200.fc39.x86_64 #1
> [ 6401.613011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.16.3-1.fc39 04/01/2014
> [ 6401.613379] RIP: 0010:hrtimer_active+0xd/0x50
> [ 6401.613577] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 
> 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 57 30 <8b> 42 
> 10 a8 01 74 09 f3 90 8b 42 10 a8 01 75 f7 80 7f 38 00 75 1d
> [ 6401.614374] RSP: 0018:b031009b7e10 EFLAGS: 00010286
> [ 6401.614604] RAX: 0002db00 RBX: 9118f786db08 RCX: 
> 
> [ 6401.614914] RDX:  RSI: 9117a0e64400 RDI: 
> 9118f786db08
> [ 6401.615225] RBP: 9118f786db80 R08: 9117a0ddd420 R09: 
> 9117804d4f70
> [ 6401.615534] R10:  R11:  R12: 
> 9118f786db08
> [ 6401.615846] R13: 91178fdd5e20 R14: 9117840978c0 R15: 
> 
> [ 6401.616157] FS:  7f2ffbab1740() GS:9118f784() 
> knlGS:
> [ 6401.616508] CS:  0010 DS:  ES:  CR0: 80050033
> [ 6401.616765] CR2: 0010 CR3: 0001b402e000 CR4: 
> 00750ee0
> [ 6401.617075] PKRU: 5554
> [ 6401.617197] Call Trace:
> [ 6401.617309]  
> [ 6401.617407]  ? __die+0x23/0x70
> [ 6401.617548]  ? page_fault_oops+0x171/0x4e0
> [ 6401.617983]  ? srso_alias_return_thunk+0x5/0x7f
> [ 6401.618389]  ? avc_has_extended_perms+0x237/0x520
> [ 6401.618800]  ? exc_page_fault+0x7f/0x180
> [ 6401.619176]  ? asm_exc_page_fault+0x26/0x30
> [ 6401.619563]  ? hrtimer_active+0xd/0x50
> [ 6401.619926]  hrtimer_cancel+0x15/0x40
> [ 6401.620286]  timerlat_fd_release+0x48/0xe0
> [ 6401.620666]  __fput+0xf5/0x290
> [ 6401.621004]  __x64_sys_close+0x3d/0x80
> [ 6401.621370]  do_syscall_64+0x60/0x90
> [ 6401.621730]  ? srso_alias_return_thunk+0x5/0x7f
> [ 6401.622129]  ? __x64_sys_ioctl+0x72/0xd0
> [ 6401.622503]  ? srso_alias_return_thunk+0x5/0x7f
> [ 6401.622902]  ? syscall_exit_to_user_mode+0x2b/0x40
> [ 6401.623309]  ? srso_alias_return_thunk+0x5/0x7f
> [ 6401.623703]  ? do_syscall_64+0x6c/0x90
> [ 6401.624063]  ? srso_alias_return_thunk+0x5/0x7f
> [ 6401.624457]  ? exit_to_user_mode_prepare+0x142/0x1f0
> [ 6401.624868]  ? srso_alias_return_thunk+0x5/0x7f
> [ 6401.625262]  ? syscall_exit_to_user_mode+0x2b/0x40
> [ 6401.625663]  ? srso_alias_return_thunk+0x5/0x7f
> [ 6401.626051]  ? do_syscall_64+0x6c/0x90
> [ 6401.626404]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [ 6401.626810] RIP: 0033:0x7f2ffb321594
> [ 6401.627164] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 
> 00 00 90 f3 0f 1e fa 80 3d d5 cd 0d 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 
> 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d
> [ 6401.628345] RSP: 002b:7ffe8d8eef18 EFLAGS: 0202 ORIG_RAX: 
> 0003
> [ 6401.628867] RAX: ffda RBX: 7f2ffba4e668 RCX: 
> 7f2ffb321594
> [ 6401.629372] RDX:  RSI:  RDI: 
> 0003
> [ 6401.629879] RBP: 7ffe8d8eef40 R08:  R09: 
> 
> [ 6401.630384] R10: 55c926e3167eae79 R11: 0202 R12: 
> 0003
> [ 6401.630889] R13: 7ffe8d8ef030 R14:  R15: 
> 7f2ffba4e668
> [ 6401.631394]  
> [ 6401.631691] Modules linked in: tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 
> nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set 
> nf_tables nfnetlink qrtr sunrpc pktcdvd intel_rapl_msr snd_hda_codec_generic 
> intel_rapl_common ledtrig_audio snd_hda_intel snd_intel_dspcfg 
> snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_pcm kvm_amd 
> iTCO_wdt snd_timer intel_pmc_bxt ccp joydev iTCO_vendor_support kvm irqbypass 

Re: [PATCH RFC 3/4] net: ethtool: Use uts_release

2024-02-01 Thread Jakub Kicinski
On Thu, 1 Feb 2024 14:20:23 +0100 Jiri Pirko wrote:
> >BTW, I assume that changes like this are also ok:
> >
> >8<-
> >
> >   net: team: Don't bother filling in ethtool driver version

Yup, just to be clear - you can send this independently from the series,
tag is as 

 [PATCH net-next]

we'll take it via the networking tree. I'm not sure which tree the other
patches will go thru..



Re: [PATCH] tracing/timerlat: Move hrtimer_init to timerlat_fd open()

2024-02-01 Thread Daniel Bristot de Oliveira
On 2/1/24 16:44, Greg KH wrote:
> On Thu, Feb 01, 2024 at 04:13:39PM +0100, Daniel Bristot de Oliveira wrote:
>> Currently, the timerlat's hrtimer is initialized at the first read of
>> timerlat_fd, and destroyed at close(). It works, but it causes an error
>> if the user program open() and close() the file without reading.
> 
> What error exactly happens?  Userspace, or the kernel crashes?

sorry, kernel crash:

# echo NO_OSNOISE_WORKLOAD > /sys/kernel/debug/tracing/osnoise/options
# echo timerlat > /sys/kernel/debug/tracing/current_tracer

# cat ./timerlat_load.py
#!/usr/bin/env python3

timerlat_fd = open("/sys/kernel/tracing/osnoise/per_cpu/cpu0/timerlat_fd", 'r')
timerlat_fd.close();

# ./taskset -c 0 ./timerlat_load.py


[ 6401.611374] BUG: kernel NULL pointer dereference, address: 0010
[ 6401.611786] #PF: supervisor read access in kernel mode
[ 6401.612081] #PF: error_code(0x) - not-present page
[ 6401.612376] PGD 0 P4D 0
[ 6401.612495] Oops:  [#1] PREEMPT SMP NOPTI
[ 6401.612690] CPU: 1 PID: 2673 Comm: python3 Not tainted 
6.6.13-200.fc39.x86_64 #1
[ 6401.613011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.3-1.fc39 04/01/2014
[ 6401.613379] RIP: 0010:hrtimer_active+0xd/0x50
[ 6401.613577] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 57 30 <8b> 42 10 
a8 01 74 09 f3 90 8b 42 10 a8 01 75 f7 80 7f 38 00 75 1d
[ 6401.614374] RSP: 0018:b031009b7e10 EFLAGS: 00010286
[ 6401.614604] RAX: 0002db00 RBX: 9118f786db08 RCX: 
[ 6401.614914] RDX:  RSI: 9117a0e64400 RDI: 9118f786db08
[ 6401.615225] RBP: 9118f786db80 R08: 9117a0ddd420 R09: 9117804d4f70
[ 6401.615534] R10:  R11:  R12: 9118f786db08
[ 6401.615846] R13: 91178fdd5e20 R14: 9117840978c0 R15: 
[ 6401.616157] FS:  7f2ffbab1740() GS:9118f784() 
knlGS:
[ 6401.616508] CS:  0010 DS:  ES:  CR0: 80050033
[ 6401.616765] CR2: 0010 CR3: 0001b402e000 CR4: 00750ee0
[ 6401.617075] PKRU: 5554
[ 6401.617197] Call Trace:
[ 6401.617309]  
[ 6401.617407]  ? __die+0x23/0x70
[ 6401.617548]  ? page_fault_oops+0x171/0x4e0
[ 6401.617983]  ? srso_alias_return_thunk+0x5/0x7f
[ 6401.618389]  ? avc_has_extended_perms+0x237/0x520
[ 6401.618800]  ? exc_page_fault+0x7f/0x180
[ 6401.619176]  ? asm_exc_page_fault+0x26/0x30
[ 6401.619563]  ? hrtimer_active+0xd/0x50
[ 6401.619926]  hrtimer_cancel+0x15/0x40
[ 6401.620286]  timerlat_fd_release+0x48/0xe0
[ 6401.620666]  __fput+0xf5/0x290
[ 6401.621004]  __x64_sys_close+0x3d/0x80
[ 6401.621370]  do_syscall_64+0x60/0x90
[ 6401.621730]  ? srso_alias_return_thunk+0x5/0x7f
[ 6401.622129]  ? __x64_sys_ioctl+0x72/0xd0
[ 6401.622503]  ? srso_alias_return_thunk+0x5/0x7f
[ 6401.622902]  ? syscall_exit_to_user_mode+0x2b/0x40
[ 6401.623309]  ? srso_alias_return_thunk+0x5/0x7f
[ 6401.623703]  ? do_syscall_64+0x6c/0x90
[ 6401.624063]  ? srso_alias_return_thunk+0x5/0x7f
[ 6401.624457]  ? exit_to_user_mode_prepare+0x142/0x1f0
[ 6401.624868]  ? srso_alias_return_thunk+0x5/0x7f
[ 6401.625262]  ? syscall_exit_to_user_mode+0x2b/0x40
[ 6401.625663]  ? srso_alias_return_thunk+0x5/0x7f
[ 6401.626051]  ? do_syscall_64+0x6c/0x90
[ 6401.626404]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 6401.626810] RIP: 0033:0x7f2ffb321594
[ 6401.627164] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 
00 00 90 f3 0f 1e fa 80 3d d5 cd 0d 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d
[ 6401.628345] RSP: 002b:7ffe8d8eef18 EFLAGS: 0202 ORIG_RAX: 
0003
[ 6401.628867] RAX: ffda RBX: 7f2ffba4e668 RCX: 7f2ffb321594
[ 6401.629372] RDX:  RSI:  RDI: 0003
[ 6401.629879] RBP: 7ffe8d8eef40 R08:  R09: 
[ 6401.630384] R10: 55c926e3167eae79 R11: 0202 R12: 0003
[ 6401.630889] R13: 7ffe8d8ef030 R14:  R15: 7f2ffba4e668
[ 6401.631394]  
[ 6401.631691] Modules linked in: tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 
nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set 
nf_tables nfnetlink qrtr sunrpc pktcdvd intel_rapl_msr snd_hda_codec_generic 
intel_rapl_common ledtrig_audio snd_hda_intel snd_intel_dspcfg 
snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_pcm kvm_amd 
iTCO_wdt snd_timer intel_pmc_bxt ccp joydev iTCO_vendor_support kvm irqbypass 
i2c_i801 snd i2c_smbus soundcore lpc_ich virtio_balloon loop zram 
crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic 
ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_net virtio_gpu 
virtio_console virtio_blk net_failover failover virtio_dma_buf serio_raw 

Re: [PATCH] tracing/timerlat: Move hrtimer_init to timerlat_fd open()

2024-02-01 Thread Greg KH
On Thu, Feb 01, 2024 at 04:13:39PM +0100, Daniel Bristot de Oliveira wrote:
> Currently, the timerlat's hrtimer is initialized at the first read of
> timerlat_fd, and destroyed at close(). It works, but it causes an error
> if the user program open() and close() the file without reading.

What error exactly happens?  Userspace, or the kernel crashes?

thanks,

greg k-h



Re: [RFC PATCH 2/5] mfd: add 88pm88x driver

2024-02-01 Thread Karel Balej
Lee Jones, 2024-01-31T11:03:11+00:00:
> On Sun, 28 Jan 2024, Karel Balej wrote:
> > > > +   /* GPIO1: DVC, GPIO0: input */
> > > > +   REG_SEQ0(PM88X_REG_GPIO_CTRL1, 0x40),
> > >
> > > Shouldn't you set these up using Pintrl?
> > 
> > You mean to add a new MFD cell for the pins and write the respective
> > driver? The downstream implementation has no such thing so I'm not sure
> > if I would be able to do that from scratch.
>
> This is not a Pinctrl driver.
>
> Isn't there a generic API you can use?

I'm sorry, I don't think I understand what you mean.

Thank you,
K. B.



Re: [PATCH] tracing/timerlat: Move hrtimer_init to timerlat_fd open()

2024-02-01 Thread Daniel Bristot de Oliveira
On 2/1/24 16:25, Steven Rostedt wrote:
> On Thu,  1 Feb 2024 16:13:39 +0100
> Daniel Bristot de Oliveira  wrote:
> 
>> Currently, the timerlat's hrtimer is initialized at the first read of
>> timerlat_fd, and destroyed at close(). It works, but it causes an error
>> if the user program open() and close() the file without reading.
>>
>> Move hrtimer_init to timerlat_fd open() to avoid this problem.
>>
>> No functional changes.
> 
> It can't be fixing something and not have any functional changes.
> 
> No functional changes means the code is restructured but the resulting
> assembly would be the same.
> 
> Like moving functions around in a file so that you don't need extra
> prototype declarations.
> 
> Please only add "No functional changes" if the function's assembly would be
> the same.

ok

>>
>> Fixes: e88ed227f639 ("tracing/timerlat: Add user-space interface")
> 
> With a fixes tag, I'm assuming his should go into v6.8 with a Cc stable?

right, I added it on Cc, but did not include the Cc:.. tag. It seems that I 
should have.

-- Daniel

> -- Steve
> 
> 
>> Signed-off-by: Daniel Bristot de Oliveira 
>> ---
>>  kernel/trace/trace_osnoise.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
>> index bd0d01d00fb9..a8e28f9b9271 100644
>> --- a/kernel/trace/trace_osnoise.c
>> +++ b/kernel/trace/trace_osnoise.c
>> @@ -2444,6 +2444,9 @@ static int timerlat_fd_open(struct inode *inode, 
>> struct file *file)
>>  tlat = this_cpu_tmr_var();
>>  tlat->count = 0;
>>  
>> +hrtimer_init(>timer, CLOCK_MONOTONIC, 
>> HRTIMER_MODE_ABS_PINNED_HARD);
>> +tlat->timer.function = timerlat_irq;
>> +
>>  migrate_enable();
>>  return 0;
>>  };
>> @@ -2526,9 +2529,6 @@ timerlat_fd_read(struct file *file, char __user *ubuf, 
>> size_t count,
>>  tlat->tracing_thread = false;
>>  tlat->kthread = current;
>>  
>> -hrtimer_init(>timer, CLOCK_MONOTONIC, 
>> HRTIMER_MODE_ABS_PINNED_HARD);
>> -tlat->timer.function = timerlat_irq;
>> -
>>  /* Annotate now to drift new period */
>>  tlat->abs_period = hrtimer_cb_get_time(>timer);
>>  
> 




Re: [PATCH] tracing/timerlat: Move hrtimer_init to timerlat_fd open()

2024-02-01 Thread Steven Rostedt
On Thu,  1 Feb 2024 16:13:39 +0100
Daniel Bristot de Oliveira  wrote:

> Currently, the timerlat's hrtimer is initialized at the first read of
> timerlat_fd, and destroyed at close(). It works, but it causes an error
> if the user program open() and close() the file without reading.
> 
> Move hrtimer_init to timerlat_fd open() to avoid this problem.
> 
> No functional changes.

It can't be fixing something and not have any functional changes.

No functional changes means the code is restructured but the resulting
assembly would be the same.

Like moving functions around in a file so that you don't need extra
prototype declarations.

Please only add "No functional changes" if the function's assembly would be
the same.

> 
> Fixes: e88ed227f639 ("tracing/timerlat: Add user-space interface")

With a fixes tag, I'm assuming his should go into v6.8 with a Cc stable?

-- Steve


> Signed-off-by: Daniel Bristot de Oliveira 
> ---
>  kernel/trace/trace_osnoise.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index bd0d01d00fb9..a8e28f9b9271 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -2444,6 +2444,9 @@ static int timerlat_fd_open(struct inode *inode, struct 
> file *file)
>   tlat = this_cpu_tmr_var();
>   tlat->count = 0;
>  
> + hrtimer_init(>timer, CLOCK_MONOTONIC, 
> HRTIMER_MODE_ABS_PINNED_HARD);
> + tlat->timer.function = timerlat_irq;
> +
>   migrate_enable();
>   return 0;
>  };
> @@ -2526,9 +2529,6 @@ timerlat_fd_read(struct file *file, char __user *ubuf, 
> size_t count,
>   tlat->tracing_thread = false;
>   tlat->kthread = current;
>  
> - hrtimer_init(>timer, CLOCK_MONOTONIC, 
> HRTIMER_MODE_ABS_PINNED_HARD);
> - tlat->timer.function = timerlat_irq;
> -
>   /* Annotate now to drift new period */
>   tlat->abs_period = hrtimer_cb_get_time(>timer);
>  




[PATCH] tracing/timerlat: Move hrtimer_init to timerlat_fd open()

2024-02-01 Thread Daniel Bristot de Oliveira
Currently, the timerlat's hrtimer is initialized at the first read of
timerlat_fd, and destroyed at close(). It works, but it causes an error
if the user program open() and close() the file without reading.

Move hrtimer_init to timerlat_fd open() to avoid this problem.

No functional changes.

Fixes: e88ed227f639 ("tracing/timerlat: Add user-space interface")
Signed-off-by: Daniel Bristot de Oliveira 
---
 kernel/trace/trace_osnoise.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index bd0d01d00fb9..a8e28f9b9271 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2444,6 +2444,9 @@ static int timerlat_fd_open(struct inode *inode, struct 
file *file)
tlat = this_cpu_tmr_var();
tlat->count = 0;
 
+   hrtimer_init(>timer, CLOCK_MONOTONIC, 
HRTIMER_MODE_ABS_PINNED_HARD);
+   tlat->timer.function = timerlat_irq;
+
migrate_enable();
return 0;
 };
@@ -2526,9 +2529,6 @@ timerlat_fd_read(struct file *file, char __user *ubuf, 
size_t count,
tlat->tracing_thread = false;
tlat->kthread = current;
 
-   hrtimer_init(>timer, CLOCK_MONOTONIC, 
HRTIMER_MODE_ABS_PINNED_HARD);
-   tlat->timer.function = timerlat_irq;
-
/* Annotate now to drift new period */
tlat->abs_period = hrtimer_cb_get_time(>timer);
 
-- 
2.43.0




Re: [PATCH v2 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware

2024-02-01 Thread Arnaud POULIQUEN
hello Mathieu,

On 1/31/24 19:52, Mathieu Poirier wrote:
> On Tue, Jan 30, 2024 at 10:13:48AM +0100, Arnaud POULIQUEN wrote:
>>
>>
>> On 1/26/24 18:11, Mathieu Poirier wrote:
>>> On Thu, Jan 18, 2024 at 11:04:33AM +0100, Arnaud Pouliquen wrote:
 The new TEE remoteproc device is used to manage remote firmware in a
 secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is
 introduced to delegate the loading of the firmware to the trusted
 execution context. In such cases, the firmware should be signed and
 adhere to the image format defined by the TEE.

 Signed-off-by: Arnaud Pouliquen 
 ---
 V1 to V2 update:
 - remove the select "TEE_REMOTEPROC" in STM32_RPROC config as detected by
   the kernel test robot:
  WARNING: unmet direct dependencies detected for TEE_REMOTEPROC
  Depends on [n]: REMOTEPROC [=y] && OPTEE [=n]
  Selected by [y]:
  - STM32_RPROC [=y] && (ARCH_STM32 || COMPILE_TEST [=y]) && REMOTEPROC 
 [=y]
 - Fix initialized trproc variable in  stm32_rproc_probe
 ---
  drivers/remoteproc/stm32_rproc.c | 149 +--
  1 file changed, 144 insertions(+), 5 deletions(-)

 diff --git a/drivers/remoteproc/stm32_rproc.c 
 b/drivers/remoteproc/stm32_rproc.c
 index fcc0001e2657..cf6a21bac945 100644
 --- a/drivers/remoteproc/stm32_rproc.c
 +++ b/drivers/remoteproc/stm32_rproc.c
 @@ -20,6 +20,7 @@
  #include 
  #include 
  #include 
 +#include 
  #include 
  
  #include "remoteproc_internal.h"
 @@ -49,6 +50,9 @@
  #define M4_STATE_STANDBY  4
  #define M4_STATE_CRASH5
  
 +/* Remote processor unique identifier aligned with the Trusted Execution 
 Environment definitions */
 +#define STM32_MP1_M4_PROC_ID0
 +
  struct stm32_syscon {
struct regmap *map;
u32 reg;
 @@ -90,6 +94,8 @@ struct stm32_rproc {
struct stm32_mbox mb[MBOX_NB_MBX];
struct workqueue_struct *workqueue;
bool hold_boot_smc;
 +  bool fw_loaded;
 +  struct tee_rproc *trproc;
void __iomem *rsc_va;
  };
  
 @@ -257,6 +263,91 @@ static int stm32_rproc_release(struct rproc *rproc)
return err;
  }
  
 +static int stm32_rproc_tee_elf_sanity_check(struct rproc *rproc,
 +  const struct firmware *fw)
 +{
 +  struct stm32_rproc *ddata = rproc->priv;
 +  unsigned int ret = 0;
 +
 +  if (rproc->state == RPROC_DETACHED)
 +  return 0;
 +
 +  ret = tee_rproc_load_fw(ddata->trproc, fw);
 +  if (!ret)
 +  ddata->fw_loaded = true;
 +
 +  return ret;
 +}
 +
 +static int stm32_rproc_tee_elf_load(struct rproc *rproc,
 +  const struct firmware *fw)
 +{
 +  struct stm32_rproc *ddata = rproc->priv;
 +  unsigned int ret;
 +
 +  /*
 +   * This function can be called by remote proc for recovery
 +   * without the sanity check. In this case we need to load the firmware
 +   * else nothing done here as the firmware has been preloaded for the
 +   * sanity check to be able to parse it for the resource table.
 +   */
>>>
>>> This comment is very confusing - please consider refactoring.  
>>>
 +  if (ddata->fw_loaded)
 +  return 0;
 +
>>>
>>> I'm not sure about keeping a flag to indicate the status of the loaded 
>>> firmware.
>>> It is not done for the non-secure method, I don't see why it would be 
>>> needed for
>>> the secure one.
>>>
>>
>> The difference is on the sanity check.
>> - in rproc_elf_sanity_check we  parse the elf file to verify that it is
>> valid.
>> - in stm32_rproc_tee_elf_sanity_check we have to do the same, that means to
>> authenticate it. the authentication is done during the load.
>>
>> So this flag is used to avoid to reload it twice time.
>> refactoring the comment should help to understand this flag
>>
>>
>> An alternative would be to bypass the sanity check. But this lead to same
>> limitation.
>> Before loading the firmware in remoteproc_core, we call rproc_parse_fw() 
>> that is
>> used to get the resource table address. To get it from tee we need to
>> authenticate the firmware so load it...
>>
> 
> I spent a long time thinking about this patchset.  Looking at the code as it
> is now, request_firmware() in rproc_boot() is called even when the TEE is
> responsible for loading the firmware.  There should be some conditional code
> that calls either request_firmware() or tee_rproc_load_fw().  The latter 
> should
> also be renamed to tee_rproc_request_firmware() to avoid confusion.


The request_firmware() call is needed in both cases to get the image from the
filesystem. The tee_rproc_load_fw() gets, as input, the struct firmware provided
by request_firmware().

If we want to integrate in remoteproc_core the solution 

Re: [PATCH] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge

2024-02-01 Thread Greg KH
On Thu, Feb 01, 2024 at 11:41:56AM +0100, Pavel Machek wrote:
> +#define DEBUG

Please don't.  This is dynamic, use the dynamic debugging and set it
from userspace if you want to debug the driver.

> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/* firmware regs */
> +
> +#define ANX7688_REG_VBUS_OFF_DELAY_TIME 0x22
> +#define ANX7688_REG_FEATURE_CTRL0x27
> +#define ANX7688_REG_EEPROM_LOAD_STATUS1 0x11
> +#define ANX7688_REG_EEPROM_LOAD_STATUS0 0x12
> +#define ANX7688_REG_FW_VERSION1 0x15
> +#define ANX7688_REG_FW_VERSION0 0x16
> +
> +#define ANX7688_EEPROM_FW_LOADED 0x01

Mix of tabs and spaces, please just use tabs.

> +static const char * const anx7688_supply_names[] = {
> +"avdd33",
> +"avdd18",
> +"dvdd18",
> +"avdd10",
> +"dvdd10",
> + "i2c",
> +"hdmi_vt",
> +
> +"vconn", // power for VCONN1/VCONN2 switches
> +"vbus", // vbus power
> +};

Again, tabs vs. spaces, please use checkpatch.

> +#define ANX7688_NUM_SUPPLIES ARRAY_SIZE(anx7688_supply_names)
> +#define ANX7688_NUM_ALWAYS_ON_SUPPLIES (ANX7688_NUM_SUPPLIES - 1)
> +
> +#define ANX7688_I2C_INDEX (ANX7688_NUM_SUPPLIES - 4)
> +#define ANX7688_VCONN_INDEX (ANX7688_NUM_SUPPLIES - 2)
> +#define ANX7688_VBUS_INDEX (ANX7688_NUM_SUPPLIES - 1)
> +
> +enum {
> + ANX7688_F_POWERED,
> + ANX7688_F_CONNECTED,
> + ANX7688_F_FW_FAILED,
> + ANX7688_F_PWRSUPPLY_CHANGE,
> + ANX7688_F_CURRENT_UPDATE,
> +};
> +
> +struct anx7688 {
> +struct device *dev;
> +struct i2c_client *client;
> +struct i2c_client *client_tcpc;
> +struct regulator_bulk_data supplies[ANX7688_NUM_SUPPLIES];
> + struct power_supply *vbus_in_supply;
> + struct notifier_block vbus_in_nb;
> + int input_current_limit; // mA
> +struct gpio_desc *gpio_enable;
> +struct gpio_desc *gpio_reset;
> +struct gpio_desc *gpio_cabledet;

I'm stopping here, again, tabs, you know this :(

greg k-h



Re: [PATCH] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge

2024-02-01 Thread Greg KH
On Thu, Feb 01, 2024 at 11:41:56AM +0100, Pavel Machek wrote:
> --- /dev/null
> +++ b/drivers/usb/typec/anx7688.c
> @@ -0,0 +1,1866 @@
> +/*
> + * ANX7688 USB-C HDMI bridge/PD driver
> + *



Did this pass checkpatch?  I need a spdx line for new files please,
don't force us to go back and guess in the future, that isn't nice.

thanks,

greg k-h



Re: [v1] trace/hwlat: stop worker if !is_percpu_thread due to hotplug event

2024-02-01 Thread Steven Rostedt
On Thu,  1 Feb 2024 14:24:55 +0800
Andy Chiu  wrote:

> If the task happens to run after cpu hot-plug offline, then it would not
> be running in a percpu_thread. Instead, it would be re-queued into a
> UNBOUND workqueue. This would trigger a warning if we enable kernel
> preemption.
> 
> Signed-off-by: Andy Chiu 
> ---
>  kernel/trace/trace_hwlat.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
> index b791524a6536..87258ddc2141 100644
> --- a/kernel/trace/trace_hwlat.c
> +++ b/kernel/trace/trace_hwlat.c
> @@ -511,7 +511,16 @@ static int start_cpu_kthread(unsigned int cpu)
>  static void hwlat_hotplug_workfn(struct work_struct *dummy)
>  {
>   struct trace_array *tr = hwlat_trace;
> - unsigned int cpu = smp_processor_id();
> + unsigned int cpu;
> +
> + /*
> +  * If the work is scheduled after CPU hotplug offline being invoked,
> +  * then it would be queued into UNBOUNDED workqueue

Is it due to a CPU going to online and then right back to offline? This
function is called by the online path.

From the queue_work_on() comment:

/**
 * queue_work_on - queue work on specific cpu
 * @cpu: CPU number to execute work on
 * @wq: workqueue to use
 * @work: work to queue
 *
 * We queue the work to a specific CPU, the caller must ensure it
 * can't go away.  Callers that fail to ensure that the specified
 * CPU cannot go away will execute on a randomly chosen CPU.
 * But note well that callers specifying a CPU that never has been
 * online will get a splat.
 *
 * Return: %false if @work was already on a queue, %true otherwise.
 */

So the work gets queued on the CPU but then the CPU immediately goes
offline. I wonder what happens if it comes back online?

The above states that if the work is already queued on a CPU it won't queue
it again. Hmm, reading the comments, I'm not sure you can run the same
workqueue on muliple CPUs. It looks to do only one and all the others will
fail.

I think we need to change this to a global work queue and use CPU masks.
Where in the *cpu_init() function, it just sets the current CPU bit in a
bit mask and calls the workqueue, and that workqueue is responsible for all
CPUs.

Then the workqueue function will call cpu_read_lock() and just iterate the
CPU mask starting the threads for each of the bits that are set and clear
the bit (possibly with __test_and_clear_bit).

I don't think the schedule_on_cpu() is doing what we think it should be
doing.


-- Steve


> +  */
> + if (!is_percpu_thread())
> + return;
> +
> + cpu = smp_processor_id();
>  
>   mutex_lock(_types_lock);
>   mutex_lock(_data.lock);




Re: [v1] trace/osnoise: prevent osnoise hotplog worker running in UNBOUND workqueue

2024-02-01 Thread Steven Rostedt
On Thu,  1 Feb 2024 14:18:45 +0800
Andy Chiu  wrote:

> smp_processor_id() should be called with migration disabled. This mean
> we may safely call smp_processor_id() in percpu thread. However, this is
> not the case if the work is (re-)queued into unbound workqueue, during
> cpu-hotplog. So, detect and return early if this work happens to run on
> an unbound wq.
> 

Yeah I triggered this too, but never made it a priority to fix.


> Signed-off-by: Andy Chiu 
> ---
>  kernel/trace/trace_osnoise.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
> index bd0d01d00fb9..cf7f716d3f35 100644
> --- a/kernel/trace/trace_osnoise.c
> +++ b/kernel/trace/trace_osnoise.c
> @@ -2068,7 +2068,12 @@ static int start_per_cpu_kthreads(void)
>  #ifdef CONFIG_HOTPLUG_CPU
>  static void osnoise_hotplug_workfn(struct work_struct *dummy)
>  {
> - unsigned int cpu = smp_processor_id();
> + unsigned int cpu;
> +
> + if (!is_percpu_thread())
> + return;

But doesn't this then fail to register the CPU thread?

I wonder if this has some race with schedule_work_on() and hotplug?

-- Steve



> +
> + cpu = smp_processor_id();
>  
>   mutex_lock(_types_lock);
>  




Re: [RFC PATCH] kernel/module: add a safer implementation of try_module_get()

2024-02-01 Thread Marco Pagani



On 2024-01-30 21:47, Luis Chamberlain wrote:
> On Tue, Jan 30, 2024 at 08:36:14PM +0100, Marco Pagani wrote:
>> The current implementation of try_module_get() requires the module to
>> exist and be live as a precondition. While this may seem intuitive at
>> first glance, enforcing the precondition can be tricky, considering that
>> modules can be unloaded at any time if not previously taken. For
>> instance, the caller could be preempted just before calling
>> try_module_get(), and while preempted, the module could be unloaded and
>> freed. More subtly, the module could also be unloaded at any point while
>> executing try_module_get() before incrementing the refount with
>> atomic_inc_not_zero().
>>
>> Neglecting the precondition that the module must exist and be live can
>> cause unexpected race conditions that can lead to crashes. However,
>> ensuring that the precondition is met may require additional locking
>> that increases the complexity of the code and can make it more
>> error-prone.
>>
>> This patch adds a slower yet safer implementation of try_module_get()
>> that checks if the module is valid by looking into the mod_tree before
>> taking the module's refcount. This new function can be safely called on
>> stale and invalid module pointers, relieving developers from the burden
>> of ensuring that the module exists and is live before attempting to take
>> it.
>>
>> The tree lookup and refcount increment are executed after taking the
>> module_mutex to prevent the module from being unloaded after looking up
>> the tree.
>>
>> Signed-off-by: Marco Pagani 
> 
> It very much sounds like there is a desire to have this but without a
> user, there is no justification.

I was working on a set of patches to fix an issue in the fpga subsystem
when I came across your commit 557aafac1153 ("kernel/module: add
documentation for try_module_get()") that made me realize we also had a
safety problem. 

To solve this problem for the fpga manager, we had to add a mutex to
ensure the low-level module still exists before calling
try_module_get(). However, having a safer version of try_module_get()
would have simplified the code and made it more robust against changes.

https://lore.kernel.org/linux-fpga/2024060242.149265-1-marpa...@redhat.com/

I suspect there may be other cases where try_module_get() is
inadvertently called without ensuring that the module still exists
that may benefit from a safer implementation.

>> +bool try_module_get_safe(struct module *module)
>> +{
>> +struct module *mod;
>> +bool ret = true;
>> +
>> +if (!module)
>> +goto out;
>> +
>> +mutex_lock(_mutex);
> 
> If a user comes around then this should be mutex_lock_interruptible(),
> and add might_sleep()

Would it be okay to return false if it gets interrupted, or should I
change the return type to int to propagate -EINTR? My concern with
changing the signature is that it would be less straightforward to
use the function in place of try_module_get().

>> +
>> +/*
>> + * Check if the address points to a valid live module and take
>> + * the refcount only if it points to the module struct.
>> + */
>> +mod = __module_address((unsigned long)module);
>> +if (mod && mod == module && module_is_live(mod))
>> +__module_get(mod);
>> +else
>> +ret = false;
>> +
>> +mutex_unlock(_mutex);
>> +
>> +out:
>> +return ret;
>> +}
>> +EXPORT_SYMBOL(try_module_get_safe);
> 
> And EXPORT_SYMBOL_GPL() would need to be used.

Okay, I initially used EXPORT_SYMBOL() to be compatible with
try_module_get().

> 
> I'd also expect selftests to be expanded for this case, but again,
> without a user, this is just trying to resolve a problem which does not
> exist.

I can add selftests in the next versions.
Thanks,
Marco




[ANNOUNCE] 4.14.336-rt159

2024-02-01 Thread Luis Claudio R. Goncalves
Hello RT-list!

I'm pleased to announce the 4.14.336-rt159 stable release.

Please remember that this is the LAST 4.14.y-rt kernel to be released.
Refer to Greg Kroah-Hartman's announce of 4.14.336 for more information:

  https://lore.kernel.org/lkml/2024011046-ecology-tiptoeing-ce50@gregkh/

Again, 4.14.y is officially end-of-life. So is the RT counterpart.

This release is just an update to the new stable 4.14.336 version and
no RT-specific changes have been made.

You can get this release via the git tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

  branch: v4.14-rt
  Head SHA1: 998ff9c899891ef962ca1cb42bf807dc5d223ebf

Or to build 4.14.336-rt159 directly, the following patches should be applied:

  https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.14.tar.xz

  https://www.kernel.org/pub/linux/kernel/v4.x/patch-4.14.336.xz

  
https://www.kernel.org/pub/linux/kernel/projects/rt/4.14/older/patch-4.14.336-rt159.patch.xz

Signing key fingerprint:

  9354 0649 9972 8D31 D464  D140 F394 A423 F8E6 7C26

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Luis




Re: [PATCH v2] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

2024-02-01 Thread Paul E. McKenney
On Sat, Jan 27, 2024 at 06:09:05PM +0800, Chen Zhongjin wrote:
> On 2024/1/20 23:30, Paul E. McKenney wrote:

(Apologies for the delay, despite my attempts to make it otherwise,
your email still got dumped into my spam folder.)

> Hi Paul,
> This patch works for my reproduce test case.

Thank you!!!

> Just a small question, if you dont mind, this problem exsit on LTS version
> but we had struct rcu_tasks_percpu after v5.17. We can't backport this patch
> to LTS 5.10 or 4.19, which are still under maintaince.
> If you have any idea or plan to apply this patch to elder version, please
> tell me, thanks very much!

It should be possible to hand-apply the patch.  Or to backport additional
patches to make this one apply cleanly.   Note that in v4.19, the code
is in kernel/rcu/update.c rather than its new home in kernel/rcu/tasks.h.

> Anyway, it's ok to apply this patch to mainline.

May I have your Tested-by?

Thanx, Paul

> Best,
> Chen
> 
> > > Again, that comment reads in full as follows:
> > > 
> > >   /*
> > >* Step 2: Wait for quiesence period to ensure all potentially
> > >* preempted tasks to have normally scheduled. Because optprobe
> > >* may modify multiple instructions, there is a chance that Nth
> > >* instruction is preempted. In that case, such tasks can return
> > >* to 2nd-Nth byte of jump instruction. This wait is for avoiding it.
> > >* Note that on non-preemptive kernel, this is transparently converted
> > >* to synchronoze_sched() to wait for all interrupts to have completed.
> > >*/
> > > 
> > > Please note well that first sentence.
> > > 
> > > Unless that first sentence no longer holds, this patch cannot work
> > > because synchronize_rcu_tasks_rude() will not (repeat, NOT) wait for
> > > preempted tasks.
> > > 
> > > So how to safely break this deadlock?  Reproducing Chen Zhongjin's
> > > diagram:
> > > 
> > > pid A pid B   pid C
> > > kprobe_optimizer()do_exit()   
> > > perf_kprobe_init()
> > > mutex_lock(_mutex) exit_tasks_rcu_start()  
> > > mutex_lock(_mutex)
> > > synchronize_rcu_tasks()   zap_pid_ns_processes()  // waiting 
> > > kprobe_mutex
> > > // waiting tasks_rcu_exit_srcukernel_wait4()
> > >   // waiting pid C exit
> > > 
> > > We need to stop synchronize_rcu_tasks() from waiting on tasks like
> > > pid B that are voluntarily blocked.  One way to do that is to replace
> > > SRCU with a set of per-CPU lists.  Then exit_tasks_rcu_start() adds the
> > > current task to this list and does ...
> > > 
> > > OK, this is getting a bit involved.  If you would like to follow along,
> > > please feel free to look here:
> > > 
> > > https://docs.google.com/document/d/1MEHHs5qbbZBzhN8dGP17pt-d87WptFJ2ZQcqS221d9I/edit?usp=sharing
> > 
> > And please see below for a prototype patch, which passes moderate
> > rcutorture testing.
> > 
> > Chen Zhongjin, does this prevent the deadlock you have been seeing?
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > commit 113fe0eeabe7a83e87d638d44b9e1d0f9691b146
> > Author: Paul E. McKenney 
> > Date:   Sat Jan 20 07:07:08 2024 -0800
> > 
> >  rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
> >  Holding a mutex across synchronize_rcu_tasks() and acquiring
> >  that same mutex in code called from do_exit() after its call to
> >  exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
> >  results in deadlock.  This is by design, because tasks that are far
> >  enough into do_exit() are no longer present on the tasks list, making
> >  it a bit difficult for RCU Tasks to find them, let alone wait on them
> >  to do a voluntary context switch.  However, such deadlocks are becoming
> >  more frequent.  In addition, lockdep currently does not detect such
> >  deadlocks and they can be difficult to reproduce.
> >  In addition, if a task voluntarily context switches during that time
> >  (for example, if it blocks acquiring a mutex), then this task is in an
> >  RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
> >  just as well take advantage of that fact.
> >  This commit therefore eliminates these deadlock by replacing the
> >  SRCU-based wait for do_exit() completion with per-CPU lists of tasks
> >  currently exiting.  A given task will be on one of these per-CPU lists 
> > for
> >  the same period of time that this task would previously have been in 
> > the
> >  previous SRCU read-side critical section.  These lists enable RCU Tasks
> >  to find the tasks that have already been removed from the tasks list,
> >  but that must nevertheless be waited upon.
> >  The RCU Tasks grace period gathers any of these 

Re: [PATCH RFC 3/4] net: ethtool: Use uts_release

2024-02-01 Thread Jiri Pirko
Thu, Feb 01, 2024 at 01:57:16PM CET, john.g.ga...@oracle.com wrote:
>On 31/01/2024 19:24, Jakub Kicinski wrote:
>> On Wed, 31 Jan 2024 10:48:50 + John Garry wrote:
>> > Instead of using UTS_RELEASE, use uts_release, which means that we don't
>> > need to rebuild the code just for the git head commit changing.
>> > 
>> > Signed-off-by: John Garry
>> Yes, please!
>> 
>> Acked-by: Jakub Kicinski
>
>Cheers
>
>BTW, I assume that changes like this are also ok:
>
>8<-
>
>   net: team: Don't bother filling in ethtool driver version
>
>   The version is same as the default, as don't bother.
>
>   Signed-off-by: John Garry 
>
>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>index f575f225d417..0a44bbdcfb7b 100644
>--- a/drivers/net/team/team.c
>+++ b/drivers/net/team/team.c
>@@ -25,7 +25,6 @@
>#include 
>#include 
>#include 
>-#include 
>#include 
>
>#define DRV_NAME "team"
>@@ -2074,7 +2073,6 @@ static void team_ethtool_get_drvinfo(struct
>net_device *dev,
>struct ethtool_drvinfo *drvinfo)
>{
>   strscpy(drvinfo->driver, DRV_NAME, sizeof(drvinfo->driver));
>-   strscpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));

Yeah. I don't see why should anyone care about this "version"...
Let's remove it.

>}
>
>>8-
>
>right?
>
>John
>
>
>
>
>



[ANNOUNCE] 5.10.209-rt101

2024-02-01 Thread Luis Claudio R. Goncalves
Hello RT-list!

I'm pleased to announce the 5.10.209-rt101 stable release.

This release is just an update to the new stable 5.10.209
version and no RT-specific changes have been made.

You can get this release via the git tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

  branch: v5.10-rt
  Head SHA1: 2ff8470513b9651ecd42aaa0983ed13841b2fb1b

Or to build 5.10.209-rt101 directly, the following patches should be applied:

  https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz

  https://www.kernel.org/pub/linux/kernel/v5.x/patch-5.10.209.xz

  
https://www.kernel.org/pub/linux/kernel/projects/rt/5.10/older/patch-5.10.209-rt101.patch.xz

Signing key fingerprint:

  9354 0649 9972 8D31 D464  D140 F394 A423 F8E6 7C26

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Luis




Re: [PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation

2024-02-01 Thread Paolo Abeni
On Tue, 2024-01-30 at 19:37 +0800, Yunsheng Lin wrote:
> Currently there seems to be three page frag implementions
> which all try to allocate order 3 page, if that fails, it
> then fail back to allocate order 0 page, and each of them
> all allow order 3 page allocation to fail under certain
> condition by using specific gfp bits.
> 
> The gfp bits for order 3 page allocation are different
> between different implementation, __GFP_NOMEMALLOC is
> or'd to forbid access to emergency reserves memory for
> __page_frag_cache_refill(), but it is not or'd in other
> implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
> direct reclaim in skb_page_frag_refill(), but it is not
> masked off in __page_frag_cache_refill().
> 
> This patch unifies the gfp bits used between different
> implementions by or'ing __GFP_NOMEMALLOC and masking off
> __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
> possible pressure for mm.
> 
> Signed-off-by: Yunsheng Lin 
> Reviewed-by: Alexander Duyck 
> CC: Alexander Duyck 
> ---
>  drivers/vhost/net.c | 2 +-
>  mm/page_alloc.c | 4 ++--
>  net/core/sock.c | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index f2ed7167c848..e574e21cc0ca 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net 
> *net, unsigned int sz,
>   /* Avoid direct reclaim but allow kswapd to wake */
>   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> __GFP_COMP | __GFP_NOWARN |
> -   __GFP_NORETRY,
> +   __GFP_NORETRY | __GFP_NOMEMALLOC,
> SKB_FRAG_PAGE_ORDER);

>   if (likely(pfrag->page)) {
>   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0f7e67c4250..636145c29f70 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4685,8 +4685,8 @@ static struct page *__page_frag_cache_refill(struct 
> page_frag_cache *nc,
>   gfp_t gfp = gfp_mask;
>  
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> - gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
> - __GFP_NOMEMALLOC;
> + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> +__GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>   page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>   PAGE_FRAG_CACHE_MAX_ORDER);
>   nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 88bf810394a5..8289a3d8c375 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2919,7 +2919,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
> page_frag *pfrag, gfp_t gfp)
>   /* Avoid direct reclaim but allow kswapd to wake */
>   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> __GFP_COMP | __GFP_NOWARN |
> -   __GFP_NORETRY,
> +   __GFP_NORETRY | __GFP_NOMEMALLOC,
> SKB_FRAG_PAGE_ORDER);

This will prevent memory reserve usage when allocating order 3 pages,
but not when allocating a single page as a fallback. Still different
from the __page_frag_cache_refill() allocator - which never accesses
the memory reserves.

I'm unsure we want to propagate the __page_frag_cache_refill behavior
here, the current behavior could be required by some systems.

It looks like this series still leave the skb_page_frag_refill()
allocator alone, what about dropping this chunk, too? 

Thanks!

Paolo




Re: [PATCH RFC 3/4] net: ethtool: Use uts_release

2024-02-01 Thread John Garry

On 31/01/2024 19:24, Jakub Kicinski wrote:

On Wed, 31 Jan 2024 10:48:50 + John Garry wrote:

Instead of using UTS_RELEASE, use uts_release, which means that we don't
need to rebuild the code just for the git head commit changing.

Signed-off-by: John Garry

Yes, please!

Acked-by: Jakub Kicinski


Cheers

BTW, I assume that changes like this are also ok:

8<-

   net: team: Don't bother filling in ethtool driver version

   The version is same as the default, as don't bother.

   Signed-off-by: John Garry 

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index f575f225d417..0a44bbdcfb7b 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -25,7 +25,6 @@
#include 
#include 
#include 
-#include 
#include 

#define DRV_NAME "team"
@@ -2074,7 +2073,6 @@ static void team_ethtool_get_drvinfo(struct
net_device *dev,
struct ethtool_drvinfo *drvinfo)
{
   strscpy(drvinfo->driver, DRV_NAME, sizeof(drvinfo->driver));
-   strscpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));
}

>8-

right?

John







Boot-time dumping of ftrace fuctiongraph buffer

2024-02-01 Thread Ahmad Fatoum
Hello,

I semi-regularly debug probe failures. For drivers that use dev_err_probe
rigorously, this is a quick matter: The probe function records a deferral reason
and if the deferral persists, deferred_probe_timeout_work_func() will print
the collected reasons, even if PID 1 is never started.

For drivers that don't call dev_err_probe, I find myself sometimes doing printf
debugging inside the probe function.

I would like to replace this with the function graph tracer:

  - record the probe function, configured over kernel command line
(The device indefinitely deferring probe is printed to the console,
 so I know what I am looking for on the next boot)

  - Dump the function graph trace

  - See if the last call before (non-devm) cleanup is getting a clock, a GPIO,
a regulator or w/e.

For this to be maximally useful, I need to configure this not only at boot-time,
but also dump the ftrace buffer at boot time. Probe deferral can hinder the 
kernel from
calling init and providing a shell, where I could read 
/sys/kernel/tracing/trace.

I found following two mechanisms that looked relevant, but seem not to
do exactly what I want:

  - tp_printk: seems to be related to trace points only and not usable
for the function graph output

  - dump_on_oops: I don't get an Oops if probe deferral times out, but maybe
one could patch the kernel to check a oops_on_probe_deferral or 
dump_on_probe_deferral
kernel command line parameter in deferred_probe_timeout_work_func()?


Is there existing support that I am missing? Any input on whether this
would be a welcome feature to have?

Thanks!

Cheers,
Ahmad

-- 
Pengutronix e.K.   | |
Steuerwalder Str. 21   | http://www.pengutronix.de/  |
31137 Hildesheim, Germany  | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |
 



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-01 Thread Michael S. Tsirkin
On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > >  Summary 
> > > 
> > > In my (non-vhost experience) opinion the way to go would be either
> > > replacing the cond_resched with a hard schedule or setting the
> > > need_resched flag within vhost if the a data transfer was successfully
> > > initiated. It will be necessary to check if this causes problems with
> > > other workloads/benchmarks.
> > 
> > Yes but conceptually I am still in the dark on whether the fact that
> > periodically invoking cond_resched is no longer sufficient to be nice to
> > others is a bug, or intentional.  So you feel it is intentional?
> 
> I would assume that cond_resched is still a valid concept.
> But, in this particular scenario we have the following problem:
> 
> So far (with CFS) we had:
> 1. vhost initiates data transfer
> 2. kworker is woken up
> 3. CFS gives priority to woken up task and schedules it
> 4. kworker runs
> 
> Now (with EEVDF) we have:
> 0. In some cases, kworker has accumulated negative lag 
> 1. vhost initiates data transfer
> 2. kworker is woken up
> -3a. EEVDF does not schedule kworker if it has negative lag
> -4a. vhost continues running, kworker on same CPU starves
> --
> -3b. EEVDF schedules kworker if it has positive or no lag
> -4b. kworker runs
> 
> In the 3a/4a case, the kworker is given no chance to set the
> necessary flag. The flag can only be set by another CPU now.
> The schedule of the kworker was not caused by cond_resched, but
> rather by the wakeup path of the scheduler.
> 
> cond_resched works successfully once the load balancer (I suppose) 
> decides to migrate the vhost off to another CPU. In that case, the
> load balancer on another CPU sets that flag and we are good.
> That then eventually allows the scheduler to pick kworker, but very
> late.

I don't really understand what is special about vhost though.
Wouldn't it apply to any kernel code?

> > I propose a two patch series then:
> > 
> > patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> > 
> > If you're doing longer computations: first think userspace. If you
> > **really** want to do it in kernel you should regularly check if you need
> > to give up the CPU (remember there is cooperative multitasking per CPU).
> > Idiom::
> > 
> > cond_resched(); /* Will sleep */
> > 
> > 
> > replace cond_resched -> schedule
> > 
> > 
> > Since apparently cond_resched is no longer sufficient to
> > make the scheduler check whether you need to give up the CPU.
> > 
> > patch 2: make this change for vhost.
> > 
> > WDYT?
> 
> For patch 1, I would like to see some feedback from Peter (or someone else
> from the scheduler maintainers).

I am guessing once you post it you will see feedback.

> For patch 2, I would prefer to do some more testing first if this might have
> an negative effect on other benchmarks.
> 
> I also stumbled upon something in the scheduler code that I want to verify.
> Maybe a cgroup thing, will check that out again.
> 
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.
> 
> > 
> > -- 
> > MST
> > 
> > 




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-01 Thread Tobias Huschle
On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > 
> >  Summary 
> > 
> > In my (non-vhost experience) opinion the way to go would be either
> > replacing the cond_resched with a hard schedule or setting the
> > need_resched flag within vhost if the a data transfer was successfully
> > initiated. It will be necessary to check if this causes problems with
> > other workloads/benchmarks.
> 
> Yes but conceptually I am still in the dark on whether the fact that
> periodically invoking cond_resched is no longer sufficient to be nice to
> others is a bug, or intentional.  So you feel it is intentional?

I would assume that cond_resched is still a valid concept.
But, in this particular scenario we have the following problem:

So far (with CFS) we had:
1. vhost initiates data transfer
2. kworker is woken up
3. CFS gives priority to woken up task and schedules it
4. kworker runs

Now (with EEVDF) we have:
0. In some cases, kworker has accumulated negative lag 
1. vhost initiates data transfer
2. kworker is woken up
-3a. EEVDF does not schedule kworker if it has negative lag
-4a. vhost continues running, kworker on same CPU starves
--
-3b. EEVDF schedules kworker if it has positive or no lag
-4b. kworker runs

In the 3a/4a case, the kworker is given no chance to set the
necessary flag. The flag can only be set by another CPU now.
The schedule of the kworker was not caused by cond_resched, but
rather by the wakeup path of the scheduler.

cond_resched works successfully once the load balancer (I suppose) 
decides to migrate the vhost off to another CPU. In that case, the
load balancer on another CPU sets that flag and we are good.
That then eventually allows the scheduler to pick kworker, but very
late.

> I propose a two patch series then:
> 
> patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> 
> If you're doing longer computations: first think userspace. If you
> **really** want to do it in kernel you should regularly check if you need
> to give up the CPU (remember there is cooperative multitasking per CPU).
> Idiom::
> 
> cond_resched(); /* Will sleep */
> 
> 
> replace cond_resched -> schedule
> 
> 
> Since apparently cond_resched is no longer sufficient to
> make the scheduler check whether you need to give up the CPU.
> 
> patch 2: make this change for vhost.
> 
> WDYT?

For patch 1, I would like to see some feedback from Peter (or someone else
from the scheduler maintainers).
For patch 2, I would prefer to do some more testing first if this might have
an negative effect on other benchmarks.

I also stumbled upon something in the scheduler code that I want to verify.
Maybe a cgroup thing, will check that out again.

I'll do some more testing with the cond_resched->schedule fix, check the
cgroup thing and wait for Peter then.
Will get back if any of the above yields some results.

> 
> -- 
> MST
> 
> 



Re: [PATCH 0/4] apply page shift to PFN instead of VA in pfn_to_virt

2024-02-01 Thread Geert Uytterhoeven
Hi Arnd,

On Thu, Feb 1, 2024 at 11:11 AM Arnd Bergmann  wrote:
> I think it's fair to assume we won't need asm-generic/page.h any
> more, as we likely won't be adding new NOMMU architectures.

So you think riscv-nommu (k210) was the last one we will ever see?

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds



[PATCH] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge

2024-02-01 Thread Pavel Machek
From: Ondrej Jirman 

This is driver for ANX7688 USB-C HDMI, with flashing and debugging
features removed. ANX7688 is rather criticial piece on PinePhone,
there's no display and no battery charging without it.

There's likely more work to be done here, but having basic support
in mainline is needed to be able to work on the other stuff
(networking, cameras, power management).

Signed-off-by: Ondrej Jirman 
Signed-off-by: Martijn Braam 
Signed-off-by: Samuel Holland 
Signed-off-by: Pavel Machek 

diff --git a/drivers/usb/typec/Kconfig b/drivers/usb/typec/Kconfig
index 2f80c2792dbd..982b7c444a1f 100644
--- a/drivers/usb/typec/Kconfig
+++ b/drivers/usb/typec/Kconfig
@@ -64,6 +65,17 @@ config TYPEC_ANX7411
  If you choose to build this driver as a dynamically linked module, the
  module will be called anx7411.ko.
 
+config TYPEC_ANX7688
+   tristate "Analogix ANX7688 Type-C DRP Port controller and mux driver"
+   depends on I2C
+   depends on USB_ROLE_SWITCH
+   help
+ Say Y or M here if your system has Analogix ANX7688 Type-C Bridge
+ controller driver.
+
+ If you choose to build this driver as a dynamically linked module, the
+ module will be called anx7688.ko.
+
 config TYPEC_RT1719
tristate "Richtek RT1719 Sink Only Type-C controller driver"
depends on USB_ROLE_SWITCH || !USB_ROLE_SWITCH
diff --git a/drivers/usb/typec/Makefile b/drivers/usb/typec/Makefile
index 7a368fea61bc..3f8ff94ad294 100644
--- a/drivers/usb/typec/Makefile
+++ b/drivers/usb/typec/Makefile
@@ -7,6 +7,7 @@ obj-$(CONFIG_TYPEC_TCPM)+= tcpm/
 obj-$(CONFIG_TYPEC_UCSI)   += ucsi/
 obj-$(CONFIG_TYPEC_TPS6598X)   += tipd/
 obj-$(CONFIG_TYPEC_ANX7411)+= anx7411.o
+obj-$(CONFIG_TYPEC_ANX7688)+= anx7688.o
 obj-$(CONFIG_TYPEC_HD3SS3220)  += hd3ss3220.o
 obj-$(CONFIG_TYPEC_STUSB160X)  += stusb160x.o
 obj-$(CONFIG_TYPEC_RT1719) += rt1719.o
diff --git a/drivers/usb/typec/anx7688.c b/drivers/usb/typec/anx7688.c
new file mode 100644
index ..9b693c9fd085
--- /dev/null
+++ b/drivers/usb/typec/anx7688.c
@@ -0,0 +1,1866 @@
+/*
+ * ANX7688 USB-C HDMI bridge/PD driver
+ *
+ * Warning, this driver is somewhat PinePhone specific.
+ *
+ * How this works:
+ * - this driver allows to program firmware into ANX7688 EEPROM, and
+ *   initialize it
+ * - it then communicates with the firmware running on the OCM (on-chip
+ *   microcontroller)
+ * - it detects whether there is cable plugged in or not and powers
+ *   up or down the ANX7688 based on that
+ * - when the cable is connected the firmware on the OCM will handle
+ *   the detection of the nature of the device on the other end
+ *   of the USB-C cable
+ * - this driver then communicates with the USB phy to let it swap
+ *   data roles accordingly
+ * - it also enables VBUS and VCONN regulators as appropriate
+ * - USB phy driver (Allwinner) needs to know whether to switch to
+ *   device or host mode, or whether to turn off
+ * - when the firmware detects SRC.1.5A or SRC.3.0A via CC pins
+ *   or something else via PD, it notifies this driver via software
+ *   interrupt and this driver will determine how to update the TypeC
+ *   port status and what input current limit is appropriate
+ * - input current limit determination happens 500ms after cable
+ *   insertion or hard reset (delay is necessary to determine whether
+ *   the remote end is PD capable or not)
+ * - this driver tells to the PMIC driver that the input current limit
+ *   needs to be changed
+ * - this driver also monitors PMIC status and re-sets the input current
+ *   limit if it changes for some reason (due to PMIC internal decision
+ *   making) (this is disabled for now)
+ *
+ * ANX7688 FW behavior as observed:
+ *
+ * - DO NOT SET MORE THAN 1 SINK CAPABILITY! Firmware will ignore what
+ *   you set and send hardcoded PDO_BATT 5-21V 30W message!
+ *
+ * Product brief:
+ * 
https://www.analogix.com/en/system/files/AA-002281-PB-6-ANX7688_Product_Brief_0.pdf
+ * Development notes:
+ * https://xnux.eu/devices/feature/anx7688.html
+ */
+
+#define DEBUG
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* firmware regs */
+
+#define ANX7688_REG_VBUS_OFF_DELAY_TIME 0x22
+#define ANX7688_REG_FEATURE_CTRL0x27
+#define ANX7688_REG_EEPROM_LOAD_STATUS1 0x11
+#define ANX7688_REG_EEPROM_LOAD_STATUS0 0x12
+#define ANX7688_REG_FW_VERSION1 0x15
+#define ANX7688_REG_FW_VERSION0 0x16
+
+#define ANX7688_EEPROM_FW_LOADED   0x01
+
+#define ANX7688_REG_STATUS_INT_MASK 0x17
+#define ANX7688_REG_STATUS_INT  0x28
+#define ANX7688_IRQS_RECEIVED_MSG   BIT(0)
+#define ANX7688_IRQS_RECEIVED_ACK   BIT(1)
+#define ANX7688_IRQS_VCONN_CHANGE   BIT(2)
+#define ANX7688_IRQS_VBUS_CHANGEBIT(3)
+#define ANX7688_IRQS_CC_STATUS_CHANGE   BIT(4)
+#define 

Re: [PATCH v1] module.h: define __symbol_get_gpl() as a regular __symbol_get()

2024-02-01 Thread Andrew Kanner
On Thu, Feb 01, 2024 at 06:29:58AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 31, 2024 at 10:02:52PM +0300, Andrew Kanner wrote:
> > Prototype for __symbol_get_gpl() was introduced in the initial git
> > commit 1da177e4c3f4 ("Linux-2.6.12-rc2"), but was not used after that.
> > 
> > In commit 9011e49d54dc ("modules: only allow symbol_get of
> > EXPORT_SYMBOL_GPL modules") Christoph Hellwig switched __symbol_get()
> > to process GPL symbols only, most likely this is what
> > __symbol_get_gpl() was designed to do.
> > 
> > We might either define __symbol_get_gpl() as __symbol_get() or remove
> > it completely as suggested by Mauro Carvalho Chehab.
> 
> Just remove it, there is no need to keep unused funtionality around.
> 
> Btw, where did the discussion start?  I hope you're not trying to
> add new symbol_get users?
> 

Of course not, no new users needed.

I haven't discussed it directly. I found the unused __symbol_get_gpl()
myself, but during investigation of wether it was ever used somewhere
found the old patch series suggested by Mauro Carvalho Chehab (in Cc).

Link: 
https://lore.kernel.org/lkml/5f001015990a76c0da35a4c3cf08e457ec353ab2.1652113087.git.mche...@kernel.org/

The patch series is from 2022 and not merged. You can take [PATCH v6
1/4] which removes the unused symbol from the link.

Or I can resend v2 with my commit msg. But not sure about how it works
in such a case - will adding Suggested-by tag (if no objections from
Mauro) with the Link be ok?

-- 
Andrew Kanner



Re: [PATCH RFC v3 30/35] arm64: mte: ptrace: Handle pages with missing tag storage

2024-02-01 Thread Anshuman Khandual



On 1/25/24 22:12, Alexandru Elisei wrote:
> A page can end up mapped in a MTE enabled VMA without the corresponding tag
> storage block reserved. Tag accesses made by ptrace in this case can lead
> to the wrong tags being read or memory corruption for the process that is
> using the tag storage memory as data.
> 
> Reserve tag storage by treating ptrace accesses like a fault.
> 
> Signed-off-by: Alexandru Elisei 
> ---
> 
> Changes since rfc v2:
> 
> * New patch, issue reported by Peter Collingbourne.
> 
>  arch/arm64/kernel/mte.c | 26 --
>  1 file changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index faf09da3400a..b1fa02dad4fd 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -412,10 +412,13 @@ static int __access_remote_tags(struct mm_struct *mm, 
> unsigned long addr,
>   while (len) {
>   struct vm_area_struct *vma;
>   unsigned long tags, offset;
> + unsigned int fault_flags;
> + struct page *page;
> + vm_fault_t ret;
>   void *maddr;
> - struct page *page = get_user_page_vma_remote(mm, addr,
> -  gup_flags, );
>  
> +get_page:
> + page = get_user_page_vma_remote(mm, addr, gup_flags, );

But if there is valid page returned here in the first GUP attempt, will there
still be a subsequent handle_mm_fault() on the same vma and addr ?

>   if (IS_ERR(page)) {
>   err = PTR_ERR(page);
>   break;
> @@ -433,6 +436,25 @@ static int __access_remote_tags(struct mm_struct *mm, 
> unsigned long addr,
>   put_page(page);
>   break;
>   }
> +
> + if (tag_storage_enabled() && !page_tag_storage_reserved(page)) {

Should not '!page' be checked here as well ?

> + fault_flags = FAULT_FLAG_DEFAULT | \
> +   FAULT_FLAG_USER | \
> +   FAULT_FLAG_REMOTE | \
> +   FAULT_FLAG_ALLOW_RETRY | \
> +   FAULT_FLAG_RETRY_NOWAIT;
> + if (write)
> + fault_flags |= FAULT_FLAG_WRITE;
> +
> + put_page(page);
> + ret = handle_mm_fault(vma, addr, fault_flags, NULL);
> + if (ret & VM_FAULT_ERROR) {
> + err = -EFAULT;
> + break;
> + }
> + goto get_page;
> + }
> +
>   WARN_ON_ONCE(!page_mte_tagged(page));
>  
>   /* limit access to the end of the page */



Re: [PATCH v7 0/3] vduse: add support for networking devices

2024-02-01 Thread Maxime Coquelin




On 2/1/24 09:40, Michael S. Tsirkin wrote:

On Thu, Feb 01, 2024 at 09:34:11AM +0100, Maxime Coquelin wrote:

Hi Jason,

It looks like all patches got acked by you.
Any blocker to queue the series for next release?

Thanks,
Maxime


I think it's good enough at this point. Will put it in
linux-next shortly.



Thanks Michael!




Re: [PATCH v7 0/3] vduse: add support for networking devices

2024-02-01 Thread Michael S. Tsirkin
On Thu, Feb 01, 2024 at 09:34:11AM +0100, Maxime Coquelin wrote:
> Hi Jason,
> 
> It looks like all patches got acked by you.
> Any blocker to queue the series for next release?
> 
> Thanks,
> Maxime

I think it's good enough at this point. Will put it in
linux-next shortly.




Re: [PATCH v7 0/3] vduse: add support for networking devices

2024-02-01 Thread Maxime Coquelin

Hi Jason,

It looks like all patches got acked by you.
Any blocker to queue the series for next release?

Thanks,
Maxime

On 1/9/24 12:10, Maxime Coquelin wrote:

This small series enables virtio-net device type in VDUSE.
With it, basic operation have been tested, both with
virtio-vdpa and vhost-vdpa using DPDK Vhost library series
adding VDUSE support using split rings layout (merged in
DPDK v23.07-rc1).

Control queue support (and so multiqueue) has also been
tested, but requires a Kernel series from Jason Wang
relaxing control queue polling [1] to function reliably,
so while Jason rework is done, a patch is added to fail init
if control queue feature is requested.(tested also with DPDK
v23.11).


Changes in v7:
==
- Fail init only if VIRTIO_NET_F_CTRL_VQ is requested.
- Convert to use BIT_ULL() macro
- Rebased

Changes in v6:
==
- Remove SELinux support from the series, will be handled
   in a dedicated one.
- Require CAP_NET_ADMIN for Net devices creation (Jason).
- Fail init if control queue features are requested for
   Net device type (Jason).
- Rebased on latest master.

Changes in v5:
==
- Move control queue disablement patch before Net
   devices enablement (Jason).
- Unify operations LSM hooks into a single hook.
- Rebase on latest master.

Maxime Coquelin (3):
   vduse: validate block features only with block devices
   vduse: Temporarily fail if control queue feature requested
   vduse: enable Virtio-net device type

  drivers/vdpa/vdpa_user/vduse_dev.c | 24 
  1 file changed, 20 insertions(+), 4 deletions(-)






Re: [PATCH RFC v3 31/35] khugepaged: arm64: Don't collapse MTE enabled VMAs

2024-02-01 Thread Anshuman Khandual



On 1/25/24 22:12, Alexandru Elisei wrote:
> copy_user_highpage() will do memory allocation if there are saved tags for
> the destination page, and the page is missing tag storage.
> 
> After commit a349d72fd9ef ("mm/pgtable: add rcu_read_lock() and
> rcu_read_unlock()s"), collapse_huge_page() calls
> __collapse_huge_page_copy() -> .. -> copy_user_highpage() with the RCU lock
> held, which means that copy_user_highpage() can only allocate memory using
> GFP_ATOMIC or equivalent.
> 
> Get around this by refusing to collapse pages into a transparent huge page
> if the VMA is MTE-enabled.

Makes sense when copy_user_highpage() will allocate memory for tag storage.

> 
> Signed-off-by: Alexandru Elisei 
> ---
> 
> Changes since rfc v2:
> 
> * New patch. I think an agreement on whether copy*_user_highpage() should be
> always allowed to sleep, or should not be allowed, would be useful.

This is a good question ! Even after preventing the collapse of MTE VMA here,
there still might be more paths where a sleeping (i.e memory allocating)
copy*_user_highpage() becomes problematic ?

> 
>  arch/arm64/include/asm/pgtable.h| 3 +++
>  arch/arm64/kernel/mte_tag_storage.c | 5 +
>  include/linux/khugepaged.h  | 5 +
>  mm/khugepaged.c | 4 
>  4 files changed, 17 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 87ae59436162..d0473538c926 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1120,6 +1120,9 @@ static inline bool arch_alloc_cma(gfp_t gfp_mask)
>   return true;
>  }
>  
> +bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned long 
> address);
> +#define arch_hugepage_vma_revalidate arch_hugepage_vma_revalidate
> +
>  #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
>  #endif /* CONFIG_ARM64_MTE */
>  
> diff --git a/arch/arm64/kernel/mte_tag_storage.c 
> b/arch/arm64/kernel/mte_tag_storage.c
> index ac7b9c9c585c..a99959b70573 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -636,3 +636,8 @@ void arch_alloc_page(struct page *page, int order, gfp_t 
> gfp)
>   if (tag_storage_enabled() && alloc_requires_tag_storage(gfp))
>   reserve_tag_storage(page, order, gfp);
>  }
> +
> +bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned long 
> address)
> +{
> + return !(vma->vm_flags & VM_MTE);
> +}
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index f68865e19b0b..461e4322dff2 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -38,6 +38,11 @@ static inline void khugepaged_exit(struct mm_struct *mm)
>   if (test_bit(MMF_VM_HUGEPAGE, >flags))
>   __khugepaged_exit(mm);
>  }
> +
> +#ifndef arch_hugepage_vma_revalidate
> +#define arch_hugepage_vma_revalidate(vma, address) 1

Please replace s/1/true as arch_hugepage_vma_revalidate() returns bool ?

> +#endif

Right, above construct is much better than __HAVE_ARCH_ based one.

> +
>  #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct 
> *oldmm)
>  {
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2b219acb528e..cb9a9ddb4d86 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -935,6 +935,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, 
> unsigned long address,
>*/
>   if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)))
>   return SCAN_PAGE_ANON;
> +
> + if (!arch_hugepage_vma_revalidate(vma, address))
> + return SCAN_VMA_CHECK;
> +
>   return SCAN_SUCCEED;
>  }
>  

Otherwise this LGTM.



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-01 Thread Michael S. Tsirkin
On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > - Along with the wakeup of the kworker, need_resched needs to
> > >   be set, such that cond_resched() triggers a reschedule.
> > 
> > Let's try this? Does not look like discussing vhost itself will
> > draw attention from scheduler guys but posting a scheduling
> > patch probably will? Can you post a patch?
> 
> As a baseline, I verified that the following two options fix
> the regression:
> 
> - replacing the cond_resched in the vhost_worker function with a hard
>   schedule 
> - setting the need_resched flag using set_tsk_need_resched(current)
>   right before calling cond_resched
> 
> I then tried to find a better spot to put the set_tsk_need_resched
> call. 
> 
> One approach I found to be working is setting the need_resched flag 
> at the end of handle_tx and hande_rx.
> This would be after data has been actually passed to the socket, so 
> the originally blocked kworker has something to do and will profit
> from the reschedule. 
> It might be possible to go deeper and place the set_tsk_need_resched
> call to the location right after actually passing the data, but this
> might leave us with sprinkling that call in multiple places and
> might be too intrusive.
> Furthermore, it might be possible to check if an error occured when
> preparing the transmission and then skip the setting of the flag.
> 
> This would require a conceptual decision on the vhost side.
> This solution would not touch the scheduler, only incentivise it to
> do the right thing for this particular regression.
> 
> Another idea could be to find the counterpart that initiates the
> actual data transfer, which I assume wakes up the kworker. From
> what I gather it seems to be an eventfd notification that ends up
> somewhere in the qemu code. Not sure if that context would allow
> to set the need_resched flag, nor whether this would be a good idea.
> 
> > 
> > > - On cond_resched(), verify if the consumed runtime of the caller
> > >   is outweighing the negative lag of another process (e.g. the 
> > >   kworker) and schedule the other process. Introduces overhead
> > >   to cond_resched.
> > 
> > Or this last one.
> 
> On cond_resched itself, this will probably only be possible in a very 
> very hacky way. That is because currently, there is no immidiate access
> to the necessary data available, which would make it necessary to 
> bloat up the cond_resched function quite a bit, with a probably 
> non-negligible amount of overhead.
> 
> Changing other aspects in the scheduler might get us in trouble as
> they all would probably resolve back to the question "What is the magic
> value that determines whether a small task not being scheduled justifies
> setting the need_resched flag for a currently running task or adjusting 
> its lag?". As this would then also have to work for all non-vhost related
> cases, this looks like a dangerous path to me on second thought.
> 
> 
>  Summary 
> 
> In my (non-vhost experience) opinion the way to go would be either
> replacing the cond_resched with a hard schedule or setting the
> need_resched flag within vhost if the a data transfer was successfully
> initiated. It will be necessary to check if this causes problems with
> other workloads/benchmarks.

Yes but conceptually I am still in the dark on whether the fact that
periodically invoking cond_resched is no longer sufficient to be nice to
others is a bug, or intentional.  So you feel it is intentional?
I propose a two patch series then:

patch 1: in this text in Documentation/kernel-hacking/hacking.rst

If you're doing longer computations: first think userspace. If you
**really** want to do it in kernel you should regularly check if you need
to give up the CPU (remember there is cooperative multitasking per CPU).
Idiom::

cond_resched(); /* Will sleep */


replace cond_resched -> schedule


Since apparently cond_resched is no longer sufficient to
make the scheduler check whether you need to give up the CPU.

patch 2: make this change for vhost.

WDYT?

-- 
MST