[PATCH] modules: Drop the .export_symbol section from the final modules

2024-04-16 Thread wangyao
From: Wang Yao 

Commit ddb5cdbafaaa ("kbuild: generate KSYMTAB entries by modpost")
forget drop the .export_symbol section from the final modules.

Signed-off-by: Wang Yao 
---
 scripts/module.lds.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index bf5bcf2836d8..89ff01a22634 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S
@@ -13,6 +13,7 @@ SECTIONS {
/DISCARD/ : {
*(.discard)
*(.discard.*)
+   *(.export_symbol)
}
 
__ksymtab   0 : { *(SORT(___ksymtab+*)) }
-- 
2.27.0




Re: [PATCH v20 0/5] Introducing trace buffer mapping by user-space

2024-04-16 Thread Mike Rapoport
(added linux-mm)

On Wed, Apr 10, 2024 at 01:56:12PM -0400, Steven Rostedt wrote:
> 
> Hi Andrew, et.al.
> 
> Linus said it's a hard requirement that this code gets an Acked-by (or
> Reviewed-by) from the memory sub-maintainers before he will accept it.
> He was upset that we faulted in pages one at a time instead of mapping it
> in one go:
> 
>   
> https://lore.kernel.org/all/CAHk-=wh5wWeib7+kVHpBVtUn7kx7GGadWqb5mW5FYTdewEfL=w...@mail.gmail.com/
> 
> Could you take a look at patches 1-3 to make sure they look sane from a
> memory management point of view?
> 
> I really want this applied in the next merge window.
> 
> Thanks!
> 
> -- Steve
> 
> 
> On Sat,  6 Apr 2024 18:36:44 +0100
> Vincent Donnefort  wrote:
> 
> > The tracing ring-buffers can be stored on disk or sent to network
> > without any copy via splice. However the later doesn't allow real time
> > processing of the traces. A solution is to give userspace direct access
> > to the ring-buffer pages via a mapping. An application can now become a
> > consumer of the ring-buffer, in a similar fashion to what trace_pipe
> > offers.
> > 
> > Support for this new feature can already be found in libtracefs from
> > version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.
> > 
> > Vincent
> > 
> > v19 -> v20:
> >   * Fix typos in documentation.
> >   * Remove useless mmap open and fault callbacks.
> >   * add mm.h include for vm_insert_pages
> > 
> > v18 -> v19:
> >   * Use VM_PFNMAP and vm_insert_pages
> >   * Allocate ring-buffer subbufs with __GFP_COMP
> >   * Pad the meta-page with the zero-page to align on the subbuf_order
> >   * Extend the ring-buffer test with mmap() dedicated suite
> > 
> > v17 -> v18:
> >   * Fix lockdep_assert_held
> >   * Fix spin_lock_init typo
> >   * Fix CONFIG_TRACER_MAX_TRACE typo
> > 
> > v16 -> v17:
> >   * Documentation and comments improvements.
> >   * Create get/put_snapshot_map() for clearer code.
> >   * Replace kzalloc with kcalloc.
> >   * Fix -ENOMEM handling in rb_alloc_meta_page().
> >   * Move flush(cpu_buffer->reader_page) behind the reader lock.
> >   * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer
> > mutex. (removes READ_ONCE/WRITE_ONCE accesses).
> > 
> > v15 -> v16:
> >   * Add comment for the dcache flush.
> >   * Remove now unnecessary WRITE_ONCE for the meta-page.
> > 
> > v14 -> v15:
> >   * Add meta-page and reader-page flush. Intends to fix the mapping
> > for VIVT and aliasing-VIPT data caches.
> >   * -EPERM on VM_EXEC.
> >   * Fix build warning !CONFIG_TRACER_MAX_TRACE.
> > 
> > v13 -> v14:
> >   * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
> >   * on unmap, sync meta-page teardown with the reader_lock instead of
> > the synchronize_rcu.
> >   * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
> > (intends to fix a lockdep issue)
> >   * Add kerneldoc for flags and Reserved fields.
> >   * Add kselftest for snapshot/map mutual exclusion.
> > 
> > v12 -> v13:
> >   * Swap subbufs_{touched,lost} for Reserved fields.
> >   * Add a flag field in the meta-page.
> >   * Fix CONFIG_TRACER_MAX_TRACE.
> >   * Rebase on top of trace/urgent.
> >   * Add a comment for try_unregister_trigger()
> > 
> > v11 -> v12:
> >   * Fix code sample mmap bug.
> >   * Add logging in sample code.
> >   * Reset tracer in selftest.
> >   * Add a refcount for the snapshot users.
> >   * Prevent mapping when there are snapshot users and vice versa.
> >   * Refine the meta-page.
> >   * Fix types in the meta-page.
> >   * Collect Reviewed-by.
> > 
> > v10 -> v11:
> >   * Add Documentation and code sample.
> >   * Add a selftest.
> >   * Move all the update to the meta-page into a single
> > rb_update_meta_page().
> >   * rb_update_meta_page() is now called from
> > ring_buffer_map_get_reader() to fix NOBLOCK callers.
> >   * kerneldoc for struct trace_meta_page.
> >   * Add a patch to zero all the ring-buffer allocations.
> > 
> > v9 -> v10:
> >   * Refactor rb_update_meta_page()
> >   * In-loop declaration for foreach_subbuf_page()
> >   * Check for cpu_buffer->mapped overflow
> > 
> > v8 -> v9:
> >   * Fix the unlock path in ring_buffer_map()
> >   * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
> >   * Rebase on linux-trace/for-next 
> > (3cb3091138ca0921c4569bcf7ffa062519639b6a)
> > 
> > v7 -> v8:
> >   * Drop the subbufs renaming into bpages
> >   * Use subbuf as a name when relevant
> > 
> > v6 -> v7:
> >   * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
> >   * Support for subbufs
> >   * Rename subbufs into bpages
> > 
> > v5 -> v6:
> >   * Rebase on next-20230802.
> >   * (unsigned long) -> (void *) cast for virt_to_page().
> >   * Add a wait for the GET_READER_PAGE ioctl.
> >   * Move writer fields update (overrun/pages_lost/entries/pages_touched)
> > in the irq_work.
> >   * Rearrange id in struct buffer_page.
> >   * Rearrange the meta-page.
> >   * ring_buffer_meta_page -> 

[PATCH v3] ftrace: Fix possible use-after-free issue in ftrace_location()

2024-04-16 Thread Zheng Yejian
KASAN reports a bug:

  BUG: KASAN: use-after-free in ftrace_location+0x90/0x120
  Read of size 8 at addr 888141d40010 by task insmod/424
  CPU: 8 PID: 424 Comm: insmod Tainted: GW  6.9.0-rc2+
  [...]
  Call Trace:
   
   dump_stack_lvl+0x68/0xa0
   print_report+0xcf/0x610
   kasan_report+0xb5/0xe0
   ftrace_location+0x90/0x120
   register_kprobe+0x14b/0xa40
   kprobe_init+0x2d/0xff0 [kprobe_example]
   do_one_initcall+0x8f/0x2d0
   do_init_module+0x13a/0x3c0
   load_module+0x3082/0x33d0
   init_module_from_file+0xd2/0x130
   __x64_sys_finit_module+0x306/0x440
   do_syscall_64+0x68/0x140
   entry_SYSCALL_64_after_hwframe+0x71/0x79

The root cause is that, in lookup_rec(), ftrace record of some address
is being searched in ftrace pages of some module, but those ftrace pages
at the same time is being freed in ftrace_release_mod() as the
corresponding module is being deleted:

   CPU1   |  CPU2
  register_kprobes() {| delete_module() {
check_kprobe_address_safe() { |
  arch_check_ftrace_location() {  |
ftrace_location() {   |
  lookup_rec() // USE!|   ftrace_release_mod() // Free!

To fix this issue:
  1. Hold rcu lock as accessing ftrace pages in ftrace_location_range();
  2. Use ftrace_location_range() instead of lookup_rec() in
 ftrace_location();
  3. Call synchronize_rcu() before freeing any ftrace pages both in
 ftrace_process_locs()/ftrace_release_mod()/ftrace_free_mem().

Fixes: ae6aa16fdc16 ("kprobes: introduce ftrace based optimization")
Suggested-by: Steven Rostedt 
Signed-off-by: Zheng Yejian 
---
 kernel/trace/ftrace.c | 46 ---
 1 file changed, 30 insertions(+), 16 deletions(-)

v3:
 - Complete the commit description and add Suggested-by tag
 - Add comments around where synchronize_rcu() is called

v2:
 - Link: 
https://lore.kernel.org/all/20240416112459.1444612-1-zhengyeji...@huawei.com/
 - Use RCU lock instead of holding ftrace_lock as suggested by Steve.
   Link: https://lore.kernel.org/all/20240410112823.1d084...@gandalf.local.home/

v1:
 - Link: 
https://lore.kernel.org/all/20240401125543.1282845-1-zhengyeji...@huawei.com/

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index da1710499698..e05d3e3dc06a 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1581,7 +1581,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long start, 
unsigned long end)
 }
 
 /**
- * ftrace_location_range - return the first address of a traced location
+ * ftrace_location_range_rcu - return the first address of a traced location
  * if it touches the given ip range
  * @start: start of range to search.
  * @end: end of range to search (inclusive). @end points to the last byte
@@ -1592,7 +1592,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long start, 
unsigned long end)
  * that is either a NOP or call to the function tracer. It checks the ftrace
  * internal tables to determine if the address belongs or not.
  */
-unsigned long ftrace_location_range(unsigned long start, unsigned long end)
+static unsigned long ftrace_location_range_rcu(unsigned long start, unsigned 
long end)
 {
struct dyn_ftrace *rec;
 
@@ -1603,6 +1603,16 @@ unsigned long ftrace_location_range(unsigned long start, 
unsigned long end)
return 0;
 }
 
+unsigned long ftrace_location_range(unsigned long start, unsigned long end)
+{
+   unsigned long loc;
+
+   rcu_read_lock();
+   loc = ftrace_location_range_rcu(start, end);
+   rcu_read_unlock();
+   return loc;
+}
+
 /**
  * ftrace_location - return the ftrace location
  * @ip: the instruction pointer to check
@@ -1614,25 +1624,22 @@ unsigned long ftrace_location_range(unsigned long 
start, unsigned long end)
  */
 unsigned long ftrace_location(unsigned long ip)
 {
-   struct dyn_ftrace *rec;
+   unsigned long loc;
unsigned long offset;
unsigned long size;
 
-   rec = lookup_rec(ip, ip);
-   if (!rec) {
+   loc = ftrace_location_range(ip, ip);
+   if (!loc) {
if (!kallsyms_lookup_size_offset(ip, , ))
goto out;
 
/* map sym+0 to __fentry__ */
if (!offset)
-   rec = lookup_rec(ip, ip + size - 1);
+   loc = ftrace_location_range(ip, ip + size - 1);
}
 
-   if (rec)
-   return rec->ip;
-
 out:
-   return 0;
+   return loc;
 }
 
 /**
@@ -6596,6 +6603,8 @@ static int ftrace_process_locs(struct module *mod,
/* We should have used all pages unless we skipped some */
if (pg_unuse) {
WARN_ON(!skipped);
+   /* Need to synchronize with ftrace_location_range() */
+   synchronize_rcu();
ftrace_free_pages(pg_unuse);
}
return ret;
@@ -6809,6 +6818,9 @@ void ftrace_release_mod(struct module *mod)
  out_unlock:

Re: [PATCH net-next v8] virtio_net: Support RX hash XDP hint

2024-04-16 Thread Liang Chen
On Tue, Apr 16, 2024 at 3:35 PM Heng Qi  wrote:
>
>
>
> 在 2024/4/16 下午2:19, Liang Chen 写道:
> > The RSS hash report is a feature that's part of the virtio specification.
> > Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> > (still a work in progress as per [1]) support this feature. While the
> > capability to obtain the RSS hash has been enabled in the normal path,
> > it's currently missing in the XDP path. Therefore, we are introducing
> > XDP hints through kfuncs to allow XDP programs to access the RSS hash.
> >
> > 1.
> > https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r
> >
> > Signed-off-by: Liang Chen 
> > ---
> >Changes from v7:
> > - use table lookup for rss hash type
> >Changes from v6:
> > - fix a coding style issue
> >Changes from v5:
> > - Preservation of the hash value has been dropped, following the conclusion
> >from discussions in V3 reviews. The virtio_net driver doesn't
> >accessing/using the virtio_net_hdr after the XDP program execution, so
> >nothing tragic should happen. As to the xdp program, if it smashes the
> >entry in virtio header, it is likely buggy anyways. Additionally, looking
> >up the Intel IGC driver,  it also does not bother with this particular
> >aspect.
> > ---
> >   drivers/net/virtio_net.c| 42 +
> >   include/uapi/linux/virtio_net.h |  1 +
> >   2 files changed, 43 insertions(+)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index c22d1118a133..1d750009f615 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -4621,6 +4621,47 @@ static void virtnet_set_big_packets(struct 
> > virtnet_info *vi, const int mtu)
> >   }
> >   }
> >
> > +static enum xdp_rss_hash_type
> > +virtnet_xdp_rss_type[VIRTIO_NET_HASH_REPORT_MAX_TABLE] = {
> > + [VIRTIO_NET_HASH_REPORT_NONE] = XDP_RSS_TYPE_NONE,
> > + [VIRTIO_NET_HASH_REPORT_IPv4] = XDP_RSS_TYPE_L3_IPV4,
> > + [VIRTIO_NET_HASH_REPORT_TCPv4] = XDP_RSS_TYPE_L4_IPV4_TCP,
> > + [VIRTIO_NET_HASH_REPORT_UDPv4] = XDP_RSS_TYPE_L4_IPV4_UDP,
> > + [VIRTIO_NET_HASH_REPORT_IPv6] = XDP_RSS_TYPE_L3_IPV6,
> > + [VIRTIO_NET_HASH_REPORT_TCPv6] = XDP_RSS_TYPE_L4_IPV6_TCP,
> > + [VIRTIO_NET_HASH_REPORT_UDPv6] = XDP_RSS_TYPE_L4_IPV6_UDP,
> > + [VIRTIO_NET_HASH_REPORT_IPv6_EX] = XDP_RSS_TYPE_L3_IPV6_EX,
> > + [VIRTIO_NET_HASH_REPORT_TCPv6_EX] = XDP_RSS_TYPE_L4_IPV6_TCP_EX,
> > + [VIRTIO_NET_HASH_REPORT_UDPv6_EX] = XDP_RSS_TYPE_L4_IPV6_UDP_EX
> > +};
> > +
> > +static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash,
> > +enum xdp_rss_hash_type *rss_type)
> > +{
> > + const struct xdp_buff *xdp = (void *)_ctx;
> > + struct virtio_net_hdr_v1_hash *hdr_hash;
> > + struct virtnet_info *vi;
> > + u16 hash_report;
> > +
> > + if (!(xdp->rxq->dev->features & NETIF_F_RXHASH))
> > + return -ENODATA;
> > +
> > + vi = netdev_priv(xdp->rxq->dev);
> > + hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - vi->hdr_len);
> > + hash_report = __le16_to_cpu(hdr_hash->hash_report);
> > +
> > + if (hash_report >= VIRTIO_NET_HASH_REPORT_MAX_TABLE)
> > + hash_report = VIRTIO_NET_HASH_REPORT_NONE;
>
> When this happens, it may mean an error or user modification of the
> header occurred.
> Should the following *hash* value be cleared to 0?
>

Referring to the igc and mlx5 drivers, the hash value is not cleared
to 0 in such cases either. This is because if the rss type is
XDP_RSS_TYPE_NONE, the hash value is considered to be not relevant.
Thanks!

> Thanks,
> Heng
>
> > +
> > + *rss_type = virtnet_xdp_rss_type[hash_report];
> > + *hash = __le32_to_cpu(hdr_hash->hash_value);
> > + return 0;
> > +}
> > +
> > +static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = {
> > + .xmo_rx_hash= virtnet_xdp_rx_hash,
> > +};
> > +
> >   static int virtnet_probe(struct virtio_device *vdev)
> >   {
> >   int i, err = -ENOMEM;
> > @@ -4747,6 +4788,7 @@ static int virtnet_probe(struct virtio_device *vdev)
> > VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
> >
> >   dev->hw_features |= NETIF_F_RXHASH;
> > + dev->xdp_metadata_ops = _xdp_metadata_ops;
> >   }
> >
> >   if (vi->has_rss_hash_report)
> > diff --git a/include/uapi/linux/virtio_net.h 
> > b/include/uapi/linux/virtio_net.h
> > index cc65ef0f3c3e..3ee695450096 100644
> > --- a/include/uapi/linux/virtio_net.h
> > +++ b/include/uapi/linux/virtio_net.h
> > @@ -176,6 +176,7 @@ struct virtio_net_hdr_v1_hash {
> >   #define VIRTIO_NET_HASH_REPORT_IPv6_EX 7
> >   #define VIRTIO_NET_HASH_REPORT_TCPv6_EX8
> >   #define VIRTIO_NET_HASH_REPORT_UDPv6_EX9
> > +#define VIRTIO_NET_HASH_REPORT_MAX_TABLE  10
> >   __le16 hash_report;
> >   __le16 padding;
> >   };
>



Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Haitao Huang
On Tue, 16 Apr 2024 17:21:24 -0500, Haitao Huang  
 wrote:


On Tue, 16 Apr 2024 17:04:21 -0500, Haitao Huang  
 wrote:


On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen   
wrote:



On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
I did declare the configs in the config file but I missed it in my  
patch

as stated earlier. IIUC, that would not cause this error though.

Maybe I should exit with the skip code if no CGROUP_MISC (no more
CGROUP_SGX_EPC) is configured?


OK, so I wanted to do a distro kernel test here, and used the default
OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.


tools/testing/selftests$ find . -name README
./futex/README
./tc-testing/README
./net/forwarding/README
./powerpc/nx-gzip/README
./ftrace/README
./arm64/signal/README
./arm64/fp/README
./arm64/README
./zram/README
./livepatch/README
./resctrl/README


So is there a README because of override timeout parameter? Maybe it
should be just set to a high enough value?

BR, Jarkko




 From the docs, I think we are supposed to use override.
See:  
https://docs.kernel.org/dev-tools/kselftest.html#timeout-for-selftests


Thanks
Haitao



Maybe you are suggesting we add settings file? I can do that.
README also explains what the tests do though. Do you still think they  
should not exist?

I was mostly following resctrl as example.

Thanks
Haitao



With the settings I shortened the README quite bit. Now I also lean  
towards removing it. Let me know your preference. You can check the latest  
at my branch for reference:

https://github.com/haitaohuang/linux/tree/sgx_cg_upstream_v12_plus

Thanks
Haitao



Re: [PATCH net-next v8] virtio_net: Support RX hash XDP hint

2024-04-16 Thread Liang Chen
On Tue, Apr 16, 2024 at 3:20 PM Jason Wang  wrote:
>
> On Tue, Apr 16, 2024 at 2:20 PM Liang Chen  wrote:
> >
> > The RSS hash report is a feature that's part of the virtio specification.
> > Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> > (still a work in progress as per [1]) support this feature. While the
> > capability to obtain the RSS hash has been enabled in the normal path,
> > it's currently missing in the XDP path. Therefore, we are introducing
> > XDP hints through kfuncs to allow XDP programs to access the RSS hash.
> >
> > 1.
> > https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r
> >
> > Signed-off-by: Liang Chen 
> > ---
> >   Changes from v7:
> > - use table lookup for rss hash type
> >   Changes from v6:
> > - fix a coding style issue
> >   Changes from v5:
> > - Preservation of the hash value has been dropped, following the conclusion
> >   from discussions in V3 reviews. The virtio_net driver doesn't
> >   accessing/using the virtio_net_hdr after the XDP program execution, so
> >   nothing tragic should happen. As to the xdp program, if it smashes the
> >   entry in virtio header, it is likely buggy anyways. Additionally, looking
> >   up the Intel IGC driver,  it also does not bother with this particular
> >   aspect.
> > ---
> >  drivers/net/virtio_net.c| 42 +
> >  include/uapi/linux/virtio_net.h |  1 +
> >  2 files changed, 43 insertions(+)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index c22d1118a133..1d750009f615 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -4621,6 +4621,47 @@ static void virtnet_set_big_packets(struct 
> > virtnet_info *vi, const int mtu)
> > }
> >  }
> >
> > +static enum xdp_rss_hash_type
> > +virtnet_xdp_rss_type[VIRTIO_NET_HASH_REPORT_MAX_TABLE] = {
> > +   [VIRTIO_NET_HASH_REPORT_NONE] = XDP_RSS_TYPE_NONE,
> > +   [VIRTIO_NET_HASH_REPORT_IPv4] = XDP_RSS_TYPE_L3_IPV4,
> > +   [VIRTIO_NET_HASH_REPORT_TCPv4] = XDP_RSS_TYPE_L4_IPV4_TCP,
> > +   [VIRTIO_NET_HASH_REPORT_UDPv4] = XDP_RSS_TYPE_L4_IPV4_UDP,
> > +   [VIRTIO_NET_HASH_REPORT_IPv6] = XDP_RSS_TYPE_L3_IPV6,
> > +   [VIRTIO_NET_HASH_REPORT_TCPv6] = XDP_RSS_TYPE_L4_IPV6_TCP,
> > +   [VIRTIO_NET_HASH_REPORT_UDPv6] = XDP_RSS_TYPE_L4_IPV6_UDP,
> > +   [VIRTIO_NET_HASH_REPORT_IPv6_EX] = XDP_RSS_TYPE_L3_IPV6_EX,
> > +   [VIRTIO_NET_HASH_REPORT_TCPv6_EX] = XDP_RSS_TYPE_L4_IPV6_TCP_EX,
> > +   [VIRTIO_NET_HASH_REPORT_UDPv6_EX] = XDP_RSS_TYPE_L4_IPV6_UDP_EX
> > +};
> > +
> > +static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash,
> > +  enum xdp_rss_hash_type *rss_type)
> > +{
> > +   const struct xdp_buff *xdp = (void *)_ctx;
> > +   struct virtio_net_hdr_v1_hash *hdr_hash;
> > +   struct virtnet_info *vi;
> > +   u16 hash_report;
> > +
> > +   if (!(xdp->rxq->dev->features & NETIF_F_RXHASH))
> > +   return -ENODATA;
> > +
> > +   vi = netdev_priv(xdp->rxq->dev);
> > +   hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - 
> > vi->hdr_len);
> > +   hash_report = __le16_to_cpu(hdr_hash->hash_report);
> > +
> > +   if (hash_report >= VIRTIO_NET_HASH_REPORT_MAX_TABLE)
> > +   hash_report = VIRTIO_NET_HASH_REPORT_NONE;
> > +
> > +   *rss_type = virtnet_xdp_rss_type[hash_report];
> > +   *hash = __le32_to_cpu(hdr_hash->hash_value);
> > +   return 0;
> > +}
> > +
> > +static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = {
> > +   .xmo_rx_hash= virtnet_xdp_rx_hash,
> > +};
> > +
> >  static int virtnet_probe(struct virtio_device *vdev)
> >  {
> > int i, err = -ENOMEM;
> > @@ -4747,6 +4788,7 @@ static int virtnet_probe(struct virtio_device *vdev)
> >   VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
> >
> > dev->hw_features |= NETIF_F_RXHASH;
> > +   dev->xdp_metadata_ops = _xdp_metadata_ops;
> > }
> >
> > if (vi->has_rss_hash_report)
> > diff --git a/include/uapi/linux/virtio_net.h 
> > b/include/uapi/linux/virtio_net.h
> > index cc65ef0f3c3e..3ee695450096 100644
> > --- a/include/uapi/linux/virtio_net.h
> > +++ b/include/uapi/linux/virtio_net.h
> > @@ -176,6 +176,7 @@ struct virtio_net_hdr_v1_hash {
> >  #define VIRTIO_NET_HASH_REPORT_IPv6_EX 7
> >  #define VIRTIO_NET_HASH_REPORT_TCPv6_EX8
> >  #define VIRTIO_NET_HASH_REPORT_UDPv6_EX9
> > +#define VIRTIO_NET_HASH_REPORT_MAX_TABLE  10
>
> This should not be part of uAPI. It may confuse the userspace.
>

Sure. I will just move it to virtio_net.c right above the table
definition. Thanks!

> Others look good.
>
> Thanks
>
> > __le16 hash_report;
> > __le16 padding;
> >  };
> > --
> > 2.40.1
> >
>



Re: [PATCH] ASoC: tracing: Export SND_SOC_DAPM_DIR_OUT to its value

2024-04-16 Thread Mark Brown
On Tue, 16 Apr 2024 00:03:03 -0400, Steven Rostedt wrote:
> The string SND_SOC_DAPM_DIR_OUT is printed in the snd_soc_dapm_path trace
> event instead of its value:
> 
>(((REC->path_dir) == SND_SOC_DAPM_DIR_OUT) ? "->" : "<-")
> 
> User space cannot parse this, as it has no idea what SND_SOC_DAPM_DIR_OUT
> is. Use TRACE_DEFINE_ENUM() to convert it to its value:
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: tracing: Export SND_SOC_DAPM_DIR_OUT to its value
  commit: 58300f8d6a48e58d1843199be743f819e2791ea3

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark




Re: [PATCH][next] dax: remove redundant assignment to variable rc

2024-04-16 Thread Verma, Vishal L
On Mon, 2024-04-15 at 11:19 +0100, Colin Ian King wrote:
> The variable rc is being assigned an value and then is being re-assigned
> a new value in the next statement. The assignment is redundant and can
> be removed.
> 
> Cleans up clang scan build warning:
> drivers/dax/bus.c:1207:2: warning: Value stored to 'rc' is never
> read [deadcode.DeadStores]
> 
> Signed-off-by: Colin Ian King 

Reviewed-by: Vishal Verma 

> ---
>  drivers/dax/bus.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 797e1ebff299..f758afbf8f09 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1204,7 +1204,6 @@ static ssize_t mapping_store(struct device *dev, struct 
> device_attribute *attr,
>   if (rc)
>   return rc;
>  
> - rc = -ENXIO;
>   rc = down_write_killable(_region_rwsem);
>   if (rc)
>   return rc;



Re: [PATCH v12 07/14] x86/sgx: Abstract tracking reclaimable pages in LRU

2024-04-16 Thread Haitao Huang

On Tue, 16 Apr 2024 09:07:33 -0500, Huang, Kai  wrote:


On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:

From: Kristen Carlson Accardi 

The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking
of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly
allocated page into the global LRU list while
sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded
global LRU references in these functions to make them reusable when
pages are tracked in per-cgroup LRUs.

Create a helper, sgx_lru_list(), that returns the LRU that tracks a  
given

EPC page. It simply returns the global LRU now, and will later return
the LRU of the cgroup within which the EPC page was allocated. Replace
the hard coded global LRU with a call to this helper.

Next patches will first get the cgroup reclamation flow ready while
keeping pages tracked in the global LRU and reclaimed by ksgxd before we
make the switch in the end for sgx_lru_list() to return per-cgroup
LRU.


I found the first paragraph hard to read.  Provide my version below for
your reference:

"
The SGX driver tracks reclaimable EPC pages via
sgx_mark_page_reclaimable(), which adds the newly allocated page into the
global LRU list.  sgx_unmark_page_reclaimable() does the opposite.

To support SGX EPC cgroup, the SGX driver will need to maintain an LRU
list for each cgroup, and the new allocated EPC page will need to be  
added

to the LRU of associated cgroup, but not always the global LRU list.

When sgx_mark_page_reclaimable() is called, the cgroup that the new
allocated EPC page belongs to is already known, i.e., it has been set to
the 'struct sgx_epc_page'.

Add a helper, sgx_lru_list(), to return the LRU that the EPC page should
be/is added to for the given EPC page.  Currently it just returns the
global LRU.  Change sgx_{mark|unmark}_page_reclaimable() to use the  
helper

function to get the LRU from the EPC page instead of referring to the
global LRU directly.

This allows EPC page being able to be tracked in "per-cgroup" LRU when
that becomes ready.
"


Thanks. Will use


Nit:

That being said, is sgx_epc_page_lru() better than sgx_lru_list()?



Sure



Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Tested-by: Jarkko Sakkinen 
---



Feel free to add:

Reviewed-by: Kai Huang 


Thanks
Haitao



[PATCH 2/2] selftests/user_events: Add non-spacing separator check

2024-04-16 Thread Beau Belgrave
The ABI documentation indicates that field separators do not need a
space between them, only a ';'. When no spacing is used, the register
must work. Any subsequent register, with or without spaces, must match
and not return -EADDRINUSE.

Add a non-spacing separator case to our self-test register case to ensure
it works going forward.

Signed-off-by: Beau Belgrave 
---
 tools/testing/selftests/user_events/ftrace_test.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/tools/testing/selftests/user_events/ftrace_test.c 
b/tools/testing/selftests/user_events/ftrace_test.c
index dcd7509fe2e0..0bb46793dcd4 100644
--- a/tools/testing/selftests/user_events/ftrace_test.c
+++ b/tools/testing/selftests/user_events/ftrace_test.c
@@ -261,6 +261,12 @@ TEST_F(user, register_events) {
ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSREG, ));
ASSERT_EQ(0, reg.write_index);
 
+   /* Register without separator spacing should still match */
+   reg.enable_bit = 29;
+   reg.name_args = (__u64)"__test_event u32 field1;u32 field2";
+   ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSREG, ));
+   ASSERT_EQ(0, reg.write_index);
+
/* Multiple registers to same name but different args should fail */
reg.enable_bit = 29;
reg.name_args = (__u64)"__test_event u32 field1;";
@@ -288,6 +294,8 @@ TEST_F(user, register_events) {
ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSUNREG, ));
unreg.disable_bit = 30;
ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSUNREG, ));
+   unreg.disable_bit = 29;
+   ASSERT_EQ(0, ioctl(self->data_fd, DIAG_IOCSUNREG, ));
 
/* Delete should have been auto-done after close and unregister */
close(self->data_fd);
-- 
2.34.1




[PATCH 1/2] tracing/user_events: Fix non-spaced field matching

2024-04-16 Thread Beau Belgrave
When the ABI was updated to prevent same name w/different args, it
missed an important corner case when fields don't end with a space.
Typically, space is used for fields to help separate them, like
"u8 field1; u8 field2". If no spaces are used, like
"u8 field1;u8 field2", then the parsing works for the first time.
However, the match check fails on a subsequent register, leading to
confusion.

This is because the match check uses argv_split() and assumes that all
fields will be split upon the space. When spaces are used, we get back
{ "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }.
This causes a mismatch, and the user program gets back -EADDRINUSE.

Add a method to detect this case before calling argv_split(). If found
force a space after the field separator character ';'. This ensures all
cases work properly for matching.

With this fix, the following are all treated as matching:
u8 field1;u8 field2
u8 field1; u8 field2
u8 field1;\tu8 field2
u8 field1;\nu8 field2

Fixes: ba470eebc2f6 ("tracing/user_events: Prevent same name but different args 
event")
Signed-off-by: Beau Belgrave 
---
 kernel/trace/trace_events_user.c | 88 +++-
 1 file changed, 87 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c
index 70d428c394b6..9184d3962b2a 100644
--- a/kernel/trace/trace_events_user.c
+++ b/kernel/trace/trace_events_user.c
@@ -1989,6 +1989,92 @@ static int user_event_set_tp_name(struct user_event 
*user)
return 0;
 }
 
+/*
+ * Counts how many ';' without a trailing space are in the args.
+ */
+static int count_semis_no_space(char *args)
+{
+   int count = 0;
+
+   while ((args = strchr(args, ';'))) {
+   args++;
+
+   if (!isspace(*args))
+   count++;
+   }
+
+   return count;
+}
+
+/*
+ * Copies the arguments while ensuring all ';' have a trailing space.
+ */
+static char *fix_semis_no_space(char *args, int count)
+{
+   char *fixed, *pos;
+   char c, last;
+   int len;
+
+   len = strlen(args) + count;
+   fixed = kmalloc(len + 1, GFP_KERNEL);
+
+   if (!fixed)
+   return NULL;
+
+   pos = fixed;
+   last = '\0';
+
+   while (len > 0) {
+   c = *args++;
+
+   if (last == ';' && !isspace(c)) {
+   *pos++ = ' ';
+   len--;
+   }
+
+   if (len > 0) {
+   *pos++ = c;
+   len--;
+   }
+
+   last = c;
+   }
+
+   /*
+* len is the length of the copy excluding the null.
+* This ensures we always have room for a null.
+*/
+   *pos = '\0';
+
+   return fixed;
+}
+
+static char **user_event_argv_split(char *args, int *argc)
+{
+   /* Count how many ';' without a trailing space */
+   int count = count_semis_no_space(args);
+
+   if (count) {
+   /* We must fixup 'field;field' to 'field; field' */
+   char *fixed = fix_semis_no_space(args, count);
+   char **split;
+
+   if (!fixed)
+   return NULL;
+
+   /* We do a normal split afterwards */
+   split = argv_split(GFP_KERNEL, fixed, argc);
+
+   /* We can free since argv_split makes a copy */
+   kfree(fixed);
+
+   return split;
+   }
+
+   /* No fixup is required */
+   return argv_split(GFP_KERNEL, args, argc);
+}
+
 /*
  * Parses the event name, arguments and flags then registers if successful.
  * The name buffer lifetime is owned by this method for success cases only.
@@ -2012,7 +2098,7 @@ static int user_event_parse(struct user_event_group 
*group, char *name,
return -EPERM;
 
if (args) {
-   argv = argv_split(GFP_KERNEL, args, );
+   argv = user_event_argv_split(args, );
 
if (!argv)
return -ENOMEM;
-- 
2.34.1




[PATCH 0/2] tracing/user_events: Fix non-spaced field matching

2024-04-16 Thread Beau Belgrave
When the ABI was updated to prevent same name w/different args, it
missed an important corner case when fields don't end with a space.
Typically, space is used for fields to help separate them, like
"u8 field1; u8 field2". If no spaces are used, like
"u8 field1;u8 field2", then the parsing works for the first time.
However, the match check fails on a subsequent register, leading to
confusion.

This is because the match check uses argv_split() and assumes that all
fields will be split upon the space. When spaces are used, we get back
{ "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }.
This causes a mismatch, and the user program gets back -EADDRINUSE.

Add a method to detect this case before calling argv_split(). If found
force a space after the field separator character ';'. This ensures all
cases work properly for matching.

I could not find an existing function to accomplish this, so I had to
hand code a copy with this logic. If there is a better way to achieve
this, I'm all ears.

This series also adds a selftest to ensure this doesn't break again.

With this fix, the following are all treated as matching:
u8 field1;u8 field2
u8 field1; u8 field2
u8 field1;\tu8 field2
u8 field1;\nu8 field2

Beau Belgrave (2):
  tracing/user_events: Fix non-spaced field matching
  selftests/user_events: Add non-spacing separator check

 kernel/trace/trace_events_user.c  | 88 ++-
 .../selftests/user_events/ftrace_test.c   |  8 ++
 2 files changed, 95 insertions(+), 1 deletion(-)


base-commit: 0bbac3facb5d6cc0171c45c9873a2dc96bea9680
-- 
2.34.1




Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-16 Thread Haitao Huang
On Mon, 15 Apr 2024 22:20:02 -0500, Haitao Huang  
 wrote:


diff --git a/arch/x86/kernel/cpu/sgx/Makefile  
b/arch/x86/kernel/cpu/sgx/Makefile

index 9c1656779b2a..400baa7cfb69 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -1,6 +1,7 @@
 obj-y += \
driver.o \
encl.o \
+   epc_cgroup.o \


It should be:

+obj-$(CONFIG_CGROUP_MISC)   += epc_cgroup.o

Haitao




Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Haitao Huang
On Tue, 16 Apr 2024 17:04:21 -0500, Haitao Huang  
 wrote:


On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen   
wrote:



On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
I did declare the configs in the config file but I missed it in my  
patch

as stated earlier. IIUC, that would not cause this error though.

Maybe I should exit with the skip code if no CGROUP_MISC (no more
CGROUP_SGX_EPC) is configured?


OK, so I wanted to do a distro kernel test here, and used the default
OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.


tools/testing/selftests$ find . -name README
./futex/README
./tc-testing/README
./net/forwarding/README
./powerpc/nx-gzip/README
./ftrace/README
./arm64/signal/README
./arm64/fp/README
./arm64/README
./zram/README
./livepatch/README
./resctrl/README


So is there a README because of override timeout parameter? Maybe it
should be just set to a high enough value?

BR, Jarkko




 From the docs, I think we are supposed to use override.
See:  
https://docs.kernel.org/dev-tools/kselftest.html#timeout-for-selftests


Thanks
Haitao



Maybe you are suggesting we add settings file? I can do that.
README also explains what the tests do though. Do you still think they  
should not exist?

I was mostly following resctrl as example.

Thanks
Haitao



Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Haitao Huang
On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen   
wrote:



On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:

I did declare the configs in the config file but I missed it in my patch
as stated earlier. IIUC, that would not cause this error though.

Maybe I should exit with the skip code if no CGROUP_MISC (no more
CGROUP_SGX_EPC) is configured?


OK, so I wanted to do a distro kernel test here, and used the default
OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.


tools/testing/selftests$ find . -name README
./futex/README
./tc-testing/README
./net/forwarding/README
./powerpc/nx-gzip/README
./ftrace/README
./arm64/signal/README
./arm64/fp/README
./arm64/README
./zram/README
./livepatch/README
./resctrl/README


So is there a README because of override timeout parameter? Maybe it
should be just set to a high enough value?

BR, Jarkko




From the docs, I think we are supposed to use override.
See: https://docs.kernel.org/dev-tools/kselftest.html#timeout-for-selftests

Thanks
Haitao



[PATCH v2 3/4] dax/bus.c: Don't use down_write_killable for non-user processes

2024-04-16 Thread Vishal Verma
Change an instance of down_write_killable() to a simple down_write() where
there is no user process that might want to interrupt the operation.

Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
rwsem")
Reported-by: Dan Williams 
Signed-off-by: Vishal Verma 
---
 drivers/dax/bus.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 4e04b228b080..db183eb5ce3a 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1542,12 +1542,8 @@ static struct dev_dax *__devm_create_dev_dax(struct 
dev_dax_data *data)
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
struct dev_dax *dev_dax;
-   int rc;
-
-   rc = down_write_killable(_region_rwsem);
-   if (rc)
-   return ERR_PTR(rc);
 
+   down_write(_region_rwsem);
dev_dax = __devm_create_dev_dax(data);
up_write(_region_rwsem);
 

-- 
2.44.0




[PATCH v2 4/4] dax/bus.c: Use the right locking mode (read vs write) in size_show

2024-04-16 Thread Vishal Verma
In size_show(), the dax_dev_rwsem only needs a read lock, but was
acquiring a write lock. Change it to down_read_interruptible() so it
doesn't unnecessarily hold a write lock.

Cc: Dan Williams 
Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
rwsem")
Signed-off-by: Vishal Verma 
---
 drivers/dax/bus.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index db183eb5ce3a..66095e60a279 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -939,11 +939,11 @@ static ssize_t size_show(struct device *dev,
unsigned long long size;
int rc;
 
-   rc = down_write_killable(_dev_rwsem);
+   rc = down_read_interruptible(_dev_rwsem);
if (rc)
return rc;
size = dev_dax_size(dev_dax);
-   up_write(_dev_rwsem);
+   up_read(_dev_rwsem);
 
return sysfs_emit(buf, "%llu\n", size);
 }

-- 
2.44.0




[PATCH v2 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths

2024-04-16 Thread Vishal Verma
Commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
rwsem")
was a bit overzealous in eliminating device_lock() usage, and ended up
removing a couple of lock acquisitions which were needed, and as a
result, fix some of the conditional locking missteps that the above
commit introduced in unregister_dax_dev() and unregister_dax_mapping().

Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
rwsem")
Reported-by: Dan Williams 
Signed-off-by: Vishal Verma 
---
 drivers/dax/bus.c | 44 ++--
 1 file changed, 10 insertions(+), 34 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 7924dd542a13..4e04b228b080 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -465,26 +465,17 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax)
trim_dev_dax_range(dev_dax);
 }
 
-static void __unregister_dev_dax(void *dev)
+static void unregister_dev_dax(void *dev)
 {
struct dev_dax *dev_dax = to_dev_dax(dev);
 
dev_dbg(dev, "%s\n", __func__);
 
+   down_write(_region_rwsem);
kill_dev_dax(dev_dax);
device_del(dev);
free_dev_dax_ranges(dev_dax);
put_device(dev);
-}
-
-static void unregister_dev_dax(void *dev)
-{
-   if (rwsem_is_locked(_region_rwsem))
-   return __unregister_dev_dax(dev);
-
-   if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0))
-   return;
-   __unregister_dev_dax(dev);
up_write(_region_rwsem);
 }
 
@@ -560,15 +551,12 @@ static ssize_t delete_store(struct device *dev, struct 
device_attribute *attr,
if (!victim)
return -ENXIO;
 
-   rc = down_write_killable(_region_rwsem);
-   if (rc)
-   return rc;
-   rc = down_write_killable(_dev_rwsem);
-   if (rc) {
-   up_write(_region_rwsem);
-   return rc;
-   }
+   device_lock(dev);
+   device_lock(victim);
dev_dax = to_dev_dax(victim);
+   rc = down_write_killable(_dev_rwsem);
+   if (rc)
+   return rc;
if (victim->driver || dev_dax_size(dev_dax))
rc = -EBUSY;
else {
@@ -589,11 +577,12 @@ static ssize_t delete_store(struct device *dev, struct 
device_attribute *attr,
rc = -EBUSY;
}
up_write(_dev_rwsem);
+   device_unlock(victim);
 
/* won the race to invalidate the device, clean it up */
if (do_del)
devm_release_action(dev, unregister_dev_dax, victim);
-   up_write(_region_rwsem);
+   device_unlock(dev);
put_device(victim);
 
return rc;
@@ -705,7 +694,7 @@ static void dax_mapping_release(struct device *dev)
put_device(parent);
 }
 
-static void __unregister_dax_mapping(void *data)
+static void unregister_dax_mapping(void *data)
 {
struct device *dev = data;
struct dax_mapping *mapping = to_dax_mapping(dev);
@@ -713,25 +702,12 @@ static void __unregister_dax_mapping(void *data)
 
dev_dbg(dev, "%s\n", __func__);
 
-   lockdep_assert_held_write(_region_rwsem);
-
dev_dax->ranges[mapping->range_id].mapping = NULL;
mapping->range_id = -1;
 
device_unregister(dev);
 }
 
-static void unregister_dax_mapping(void *data)
-{
-   if (rwsem_is_locked(_region_rwsem))
-   return __unregister_dax_mapping(data);
-
-   if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0))
-   return;
-   __unregister_dax_mapping(data);
-   up_write(_region_rwsem);
-}
-
 static struct dev_dax_range *get_dax_range(struct device *dev)
 {
struct dax_mapping *mapping = to_dax_mapping(dev);

-- 
2.44.0




[PATCH v2 1/4] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts

2024-04-16 Thread Vishal Verma
In [1], Dan points out that all of the WARN_ON_ONCE() usage in the
referenced patch should be replaced with lockdep_assert_held, or
lockdep_held_assert_write(). Replace these as appropriate.

Link: 
https://lore.kernel.org/r/65f0b5ef41817_aa2229...@dwillia2-mobl3.amr.corp.intel.com.notmuch
 [1]
Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
rwsem")
Cc: Andrew Morton 
Reported-by: Dan Williams 
Signed-off-by: Vishal Verma 
---
 drivers/dax/bus.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 797e1ebff299..7924dd542a13 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -192,7 +192,7 @@ static u64 dev_dax_size(struct dev_dax *dev_dax)
u64 size = 0;
int i;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_dev_rwsem));
+   lockdep_assert_held(_dev_rwsem);
 
for (i = 0; i < dev_dax->nr_range; i++)
size += range_len(_dax->ranges[i].range);
@@ -302,7 +302,7 @@ static unsigned long long dax_region_avail_size(struct 
dax_region *dax_region)
resource_size_t size = resource_size(_region->res);
struct resource *res;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem));
+   lockdep_assert_held(_region_rwsem);
 
for_each_dax_region_resource(dax_region, res)
size -= resource_size(res);
@@ -447,7 +447,7 @@ static void trim_dev_dax_range(struct dev_dax *dev_dax)
struct range *range = _dax->ranges[i].range;
struct dax_region *dax_region = dev_dax->region;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem));
+   lockdep_assert_held_write(_region_rwsem);
dev_dbg(_dax->dev, "delete range[%d]: %#llx:%#llx\n", i,
(unsigned long long)range->start,
(unsigned long long)range->end);
@@ -507,7 +507,7 @@ static int __free_dev_dax_id(struct dev_dax *dev_dax)
struct dax_region *dax_region;
int rc = dev_dax->id;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_dev_rwsem));
+   lockdep_assert_held_write(_dev_rwsem);
 
if (!dev_dax->dyn_id || dev_dax->id < 0)
return -1;
@@ -713,7 +713,7 @@ static void __unregister_dax_mapping(void *data)
 
dev_dbg(dev, "%s\n", __func__);
 
-   WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem));
+   lockdep_assert_held_write(_region_rwsem);
 
dev_dax->ranges[mapping->range_id].mapping = NULL;
mapping->range_id = -1;
@@ -830,7 +830,7 @@ static int devm_register_dax_mapping(struct dev_dax 
*dev_dax, int range_id)
struct device *dev;
int rc;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem));
+   lockdep_assert_held_write(_region_rwsem);
 
if (dev_WARN_ONCE(_dax->dev, !dax_region->dev->driver,
"region disabled\n"))
@@ -876,7 +876,7 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 
start,
struct resource *alloc;
int i, rc;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem));
+   lockdep_assert_held_write(_region_rwsem);
 
/* handle the seed alloc special case */
if (!size) {
@@ -935,7 +935,7 @@ static int adjust_dev_dax_range(struct dev_dax *dev_dax, 
struct resource *res, r
struct device *dev = _dax->dev;
int rc;
 
-   WARN_ON_ONCE(!rwsem_is_locked(_region_rwsem));
+   lockdep_assert_held_write(_region_rwsem);
 
if (dev_WARN_ONCE(dev, !size, "deletion is handled by 
dev_dax_shrink\n"))
return -EINVAL;

-- 
2.44.0




Re: [PATCH v15 0/4] add zynqmp TCM bindings

2024-04-16 Thread Mathieu Poirier
On Fri, Apr 12, 2024 at 11:37:04AM -0700, Tanmay Shah wrote:
> Tightly-Coupled Memories(TCMs) are low-latency memory that provides
> predictable instruction execution and predictable data load/store
> timing. Each Cortex-R5F processor contains exclusive two 64 KB memory
> banks on the ATCM and BTCM ports, for a total of 128 KB of memory.
> In lockstep mode, both 128KB memory is accessible to the cluster.
> 
> As per ZynqMP Ultrascale+ Technical Reference Manual UG1085, following
> is address space of TCM memory. The bindings in this patch series
> introduces properties to accommodate following address space with
> address translation between Linux and Cortex-R5 views.
> 
> | | | |
> | --- | --- | --- |
> |  *Mode*|   *R5 View* | *Linux view* |  Notes   |
> | *Split Mode*   | *start addr*| *start addr* |  |
> | R5_0 ATCM (64 KB)  | 0x_ | 0xFFE0_  |  |
> | R5_0 BTCM (64 KB)  | 0x0002_ | 0xFFE2_  |  |
> | R5_1 ATCM (64 KB)  | 0x_ | 0xFFE9_  | alias of 0xFFE1_ |
> | R5_1 BTCM (64 KB)  | 0x0002_ | 0xFFEB_  | alias of 0xFFE3_ |
> |  ___   | ___ |___   |  |
> | *Lockstep Mode*| |  |  |
> | R5_0 ATCM (128 KB) | 0x_ | 0xFFE0_  |  |
> | R5_0 BTCM (128 KB) | 0x0002_ | 0xFFE2_  |  |
> 
> References:
> UG1085 TCM address space:
> https://docs.xilinx.com/r/en-US/ug1085-zynq-ultrascale-trm/Tightly-Coupled-Memory-Address-Map
>

I have applied patches 1, 2 and 4 of this set.  Patch 3 should to through
Michal's tree.

Thanks,
Mathieu

> ---
> 
> prerequisite-patch-link: 
> https://lore.kernel.org/all/d4556268-8274-4089-949f-3b97d6779...@gmail.com/
> Base Branch: 6.9.rc2
> 
> Changes in v15:
>   - Use hardcode TCM addresses as fallback method if "reg" unavailable
>   - Use new bindings for r5fss subsystem
> 
> Changes in v14:
>   - Add xlnx,tcm-mode property and use it for TCM configuration
>   - Add Versal and Versal-NET platform support
>   - Maintain backward compatibility for ZynqMP platform and use hardcode
> TCM addresses
> 
> Changes in v13:
>   - Have power-domains property for lockstep case instead of
> keeping it flexible.
>   - Add "items:" list in power-domains property
> 
> 
> Radhey Shyam Pandey (1):
>   dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings
> 
> Tanmay Shah (3):
>   remoteproc: zynqmp: fix lockstep mode memory region
>   dts: zynqmp: add properties for TCM in remoteproc
>   remoteproc: zynqmp: parse TCM from device tree
> 
>  .../remoteproc/xlnx,zynqmp-r5fss.yaml | 279 --
>  .../boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts  |   8 +
>  arch/arm64/boot/dts/xilinx/zynqmp.dtsi|  67 -
>  drivers/remoteproc/xlnx_r5_remoteproc.c   | 273 +
>  4 files changed, 459 insertions(+), 168 deletions(-)
> 
> 
> base-commit: 4d5aabb6843939fad36912be8bf109adf9af0848
> -- 
> 2.25.1
> 



Re: [PATCH v15 3/4] dts: zynqmp: add properties for TCM in remoteproc

2024-04-16 Thread Mathieu Poirier
On Fri, Apr 12, 2024 at 11:37:07AM -0700, Tanmay Shah wrote:
> Add properties as per new bindings in zynqmp remoteproc node
> to represent TCM address and size.
> 
> This patch also adds alternative remoteproc node to represent
> remoteproc cluster in split mode. By default lockstep mode is
> enabled and users should disable it before using split mode
> dts. Both device-tree nodes can't be used simultaneously one
> of them must be disabled. For zcu102-1.0 and zcu102-1.1 board
> remoteproc split mode dts node is enabled and lockstep mode
> dts is disabled.
> 
> Signed-off-by: Tanmay Shah 
> ---
>  .../boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts  |  8 +++
>  arch/arm64/boot/dts/xilinx/zynqmp.dtsi| 67 +--
>  2 files changed, 70 insertions(+), 5 deletions(-)
>

Reviewed-by: Mathieu Poirier 

> diff --git a/arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts 
> b/arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts
> index c8f71a1aec89..495ca94b45db 100644
> --- a/arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts
> +++ b/arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts
> @@ -14,6 +14,14 @@ / {
>   compatible = "xlnx,zynqmp-zcu102-rev1.0", "xlnx,zynqmp-zcu102", 
> "xlnx,zynqmp";
>  };
>  
> +_split {
> + status = "okay";
> +};
> +
> +_lockstep {
> + status = "disabled";
> +};
> +
>   {
>   #address-cells = <1>;
>   #size-cells = <1>;
> diff --git a/arch/arm64/boot/dts/xilinx/zynqmp.dtsi 
> b/arch/arm64/boot/dts/xilinx/zynqmp.dtsi
> index 25d20d803230..ef31b0fc73d1 100644
> --- a/arch/arm64/boot/dts/xilinx/zynqmp.dtsi
> +++ b/arch/arm64/boot/dts/xilinx/zynqmp.dtsi
> @@ -260,19 +260,76 @@ fpga_full: fpga-full {
>   ranges;
>   };
>  
> - remoteproc {
> + rproc_lockstep: remoteproc@ffe0 {
>   compatible = "xlnx,zynqmp-r5fss";
>   xlnx,cluster-mode = <1>;
> + xlnx,tcm-mode = <1>;
>  
> - r5f-0 {
> + #address-cells = <2>;
> + #size-cells = <2>;
> +
> + ranges = <0x0 0x0 0x0 0xffe0 0x0 0x1>,
> +  <0x0 0x2 0x0 0xffe2 0x0 0x1>,
> +  <0x0 0x1 0x0 0xffe1 0x0 0x1>,
> +  <0x0 0x3 0x0 0xffe3 0x0 0x1>;
> +
> + r5f@0 {
> + compatible = "xlnx,zynqmp-r5f";
> + reg = <0x0 0x0 0x0 0x1>,
> +   <0x0 0x2 0x0 0x1>,
> +   <0x0 0x1 0x0 0x1>,
> +   <0x0 0x3 0x0 0x1>;
> + reg-names = "atcm0", "btcm0", "atcm1", "btcm1";
> + power-domains = <_firmware PD_RPU_0>,
> + <_firmware PD_R5_0_ATCM>,
> + <_firmware PD_R5_0_BTCM>,
> + <_firmware PD_R5_1_ATCM>,
> + <_firmware PD_R5_1_BTCM>;
> + memory-region = <_0_fw_image>;
> + };
> +
> + r5f@1 {
> + compatible = "xlnx,zynqmp-r5f";
> + reg = <0x1 0x0 0x0 0x1>, <0x1 0x2 0x0 0x1>;
> + reg-names = "atcm0", "btcm0";
> + power-domains = <_firmware PD_RPU_1>,
> + <_firmware PD_R5_1_ATCM>,
> + <_firmware PD_R5_1_BTCM>;
> + memory-region = <_1_fw_image>;
> + };
> + };
> +
> + rproc_split: remoteproc-split@ffe0 {
> + status = "disabled";
> + compatible = "xlnx,zynqmp-r5fss";
> + xlnx,cluster-mode = <0>;
> + xlnx,tcm-mode = <0>;
> +
> + #address-cells = <2>;
> + #size-cells = <2>;
> +
> + ranges = <0x0 0x0 0x0 0xffe0 0x0 0x1>,
> +  <0x0 0x2 0x0 0xffe2 0x0 0x1>,
> +  <0x1 0x0 0x0 0xffe9 0x0 0x1>,
> +  <0x1 0x2 0x0 0xffeb 0x0 0x1>;
> +
> + r5f@0 {
>   compatible = "xlnx,zynqmp-r5f";
> - power-domains = <_firmware PD_RPU_0>;
> + reg = <0x0 0x0 0x0 0x1>, <0x0 0x2 0x0 0x1>;
> + reg-names = "atcm0", "btcm0";
> + power-domains = <_firmware PD_RPU_0>,
> + <_firmware PD_R5_0_ATCM>,
> + <_firmware PD_R5_0_BTCM>;
>   memory-region = <_0_fw_image>;
>   };
>  
> - r5f-1 {
> + r5f@1 {
>   compatible = "xlnx,zynqmp-r5f";
> - power-domains = <_firmware PD_RPU_1>;
> + reg = <0x1 0x0 0x0 0x1>, <0x1 0x2 0x0 0x1>;
> + reg-names = "atcm0", "btcm0";
> + 

Re: [PATCH net-next v2 07/15] mm: page_frag: add '_va' suffix to page_frag API

2024-04-16 Thread Alexander H Duyck
On Mon, 2024-04-15 at 21:19 +0800, Yunsheng Lin wrote:
> Currently most of the API for page_frag API is returning
> 'virtual address' as output or expecting 'virtual address'
> as input, in order to differentiate the API handling between
> 'virtual address' and 'struct page', add '_va' suffix to the
> corresponding API mirroring the page_pool_alloc_va() API of
> the page_pool.
> 
> Signed-off-by: Yunsheng Lin 

This patch is a total waste of time. By that logic we should be
renaming __get_free_pages since it essentially does the same thing.

This just seems like more code changes for the sake of adding code
changes rather than fixing anything. In my opinion it should be dropped
from the set.




Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Jarkko Sakkinen
On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
> I did declare the configs in the config file but I missed it in my patch  
> as stated earlier. IIUC, that would not cause this error though.
>
> Maybe I should exit with the skip code if no CGROUP_MISC (no more  
> CGROUP_SGX_EPC) is configured?

OK, so I wanted to do a distro kernel test here, and used the default
OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.

> tools/testing/selftests$ find . -name README
> ./futex/README
> ./tc-testing/README
> ./net/forwarding/README
> ./powerpc/nx-gzip/README
> ./ftrace/README
> ./arm64/signal/README
> ./arm64/fp/README
> ./arm64/README
> ./zram/README
> ./livepatch/README
> ./resctrl/README

So is there a README because of override timeout parameter? Maybe it
should be just set to a high enough value?

BR, Jarkko



Re: [PATCH v2] ftrace: Fix possible use-after-free issue in ftrace_location()

2024-04-16 Thread Markus Elfring
…
> To fix it, we hold rcu lock as lookuping ftrace record, and call
> synchronize_rcu() before freeing any ftrace pages.

I suggest to convert this description into an imperative wording.

Regards,
Markus



Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Haitao Huang

On Tue, 16 Apr 2024 00:42:41 -0500, Huang, Kai  wrote:



I'll send a fixup for this patch or another version of the series if  
more

changes are needed.


Hi Haitao,

I don't like to say but in general I think you are sending too  
frequently.  The
last version was sent April, 11th (my time), so considering the weekend  
it has
only been 3 or at most 4 days.  
Please slow down a little bit to give people more time.


More information please also see:

https://www.kernel.org/doc/html/next/process/submitting-patches.html#resend-reminders


Thanks for the feedback. I'll slow down sending new versions.

Haitao



[PATCH net-next] ipvs: allow some sysctls in non-init user namespaces

2024-04-16 Thread Alexander Mikhalitsyn
Let's make all IPVS sysctls visible and RO even when
network namespace is owned by non-initial user namespace.

Let's make a few sysctls to be writable:
- conntrack
- conn_reuse_mode
- expire_nodest_conn
- expire_quiescent_template

I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.

This list of sysctls was chosen because I can't
see any security risks allowing them and also
Kubernetes uses [2] these specific sysctls.

This patch is motivated by user request in the LXC
project [1].

[1] https://github.com/lxc/lxc/issues/4278
[2] 
https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103

Cc: Stéphane Graber 
Cc: Christian Brauner 
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 143a341bbc0a..92a818c2f783 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -4285,10 +4285,22 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
if (tbl == NULL)
return -ENOMEM;
 
-   /* Don't export sysctls to unprivileged users */
+   /* Let's show all sysctls in non-init user namespace-owned
+* net namespaces, but make them read-only.
+*
+* Allow only a few specific sysctls to be writable.
+*/
if (net->user_ns != _user_ns) {
-   tbl[0].procname = NULL;
-   ctl_table_size = 0;
+   for (idx = 0; idx < ARRAY_SIZE(vs_vars); idx++) {
+   if (!tbl[idx].procname)
+   continue;
+
+   if (!((strcmp(tbl[idx].procname, "conntrack") 
== 0) ||
+ (strcmp(tbl[idx].procname, 
"conn_reuse_mode") == 0) ||
+ (strcmp(tbl[idx].procname, 
"expire_nodest_conn") == 0) ||
+ (strcmp(tbl[idx].procname, 
"expire_quiescent_template") == 0)))
+   tbl[idx].mode = 0444;
+   }
}
} else
tbl = vs_vars;
-- 
2.34.1




Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Haitao Huang
On Tue, 16 Apr 2024 09:10:12 -0500, Jarkko Sakkinen   
wrote:



On Tue Apr 16, 2024 at 5:05 PM EEST, Jarkko Sakkinen wrote:

On Tue Apr 16, 2024 at 6:20 AM EEST, Haitao Huang wrote:
> With different cgroups, the script starts one or multiple concurrent  
SGX

> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> test case, which loads an enclave of EPC size equal to the EPC  
capacity

> available on the platform. The script checks results against the
> expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with following
> expectations:
>
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some  
if

> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run  
lots of

> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all processes exit.
>
> The script also includes a test with low mem_cg limit and LARGE  
sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is  
charged

> to a proper mem_cg. For this test, it turns off swapping before start,
> and turns swapping back on afterwards.
>
> Add README to document how to run the tests.
>
> Signed-off-by: Haitao Huang 

jarkko@mustatorvisieni:~/linux-tpmdd> sudo make -C  
tools/testing/selftests/sgx run_tests
make: Entering directory  
'/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
gcc -Wall -Werror -g  
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include  
-fPIC -c main.c -o  
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o
gcc -Wall -Werror -g  
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include  
-fPIC -c load.c -o  
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o
gcc -Wall -Werror -g  
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include  
-fPIC -c sigstruct.c -o  
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o
gcc -Wall -Werror -g  
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include  
-fPIC -c call.S -o  
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o
gcc -Wall -Werror -g  
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include  
-fPIC -c sign_key.S -o  
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o
gcc -Wall -Werror -g  
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include  
-fPIC -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx  
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o  
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o  
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o  
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o  
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z  
noexecstack -lcrypto
gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE  
-fno-stack-protector -mrdrnd  
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include  
test_encl.c test_encl_bootstrap.S -o  
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf  
-Wl,-T,test_encl.lds,--build-id=none
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld:  
warning: /tmp/ccqvDJVg.o: missing .note.GNU-stack section implies  
executable stack
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld:  
NOTE: This behaviour is deprecated and will be removed in a future  
version of the linker

TAP version 13
1..2
# timeout set to 45
# selftests: sgx: test_sgx
# TAP version 13
# 1..16
# # Starting 16 tests from 1 test cases.
# #  RUN   enclave.unclobbered_vdso ...
# #OK  enclave.unclobbered_vdso
# ok 1 enclave.unclobbered_vdso
# #  RUN   enclave.unclobbered_vdso_oversubscribed ...
# #OK  enclave.unclobbered_vdso_oversubscribed
# ok 2 enclave.unclobbered_vdso_oversubscribed
# #  RUN   enclave.unclobbered_vdso_oversubscribed_remove ...
# # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an  
enclave with 98566144 bytes heap may take a while ...
# # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of  
98566144 bytes to trimmed may take a while ...
# # main.c:473:unclobbered_vdso_oversubscribed_remove:Entering enclave  
to run EACCEPT for each page of 98566144 bytes may take a while ...
# # main.c:494:unclobbered_vdso_oversubscribed_remove:Removing 98566144  
bytes from enclave may take a while ...

# #OK  enclave.unclobbered_vdso_oversubscribed_remove
# ok 3 enclave.unclobbered_vdso_oversubscribed_remove
# #  RUN   

Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Jarkko Sakkinen
On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
> > 
> > I'll send a fixup for this patch or another version of the series if more  
> > changes are needed.
>
> Hi Haitao,
>
> I don't like to say but in general I think you are sending too frequently.  
> The
> last version was sent April, 11th (my time), so considering the weekend it has
> only been 3 or at most 4 days.  
>
> Please slow down a little bit to give people more time.
>
> More information please also see:
>
> https://www.kernel.org/doc/html/next/process/submitting-patches.html#resend-reminders

+1

Yes, exactly. I'd take one week break and cycle the kselftest part
internally a bit as I said my previous response. I'm sure that there
is experise inside Intel how to implement it properly. I.e. take some
time to find the right person, and wait as long as that person has a
bit of bandwidth to go through the test and suggest modifications.

Cannot blame, as I've done the same mistake a few times in past but
yeah this would be the best possible corrective action to take.

BR, Jarkko



Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Jarkko Sakkinen
On Tue Apr 16, 2024 at 5:05 PM EEST, Jarkko Sakkinen wrote:
> On Tue Apr 16, 2024 at 6:20 AM EEST, Haitao Huang wrote:
> > With different cgroups, the script starts one or multiple concurrent SGX
> > selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> > test case, which loads an enclave of EPC size equal to the EPC capacity
> > available on the platform. The script checks results against the
> > expectation set for each cgroup and reports success or failure.
> >
> > The script creates 3 different cgroups at the beginning with following
> > expectations:
> >
> > 1) SMALL - intentionally small enough to fail the test loading an
> > enclave of size equal to the capacity.
> > 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> > more than 4 concurrent tests are run. The script starts 4 expecting at
> > least one test to pass, and then starts 5 expecting at least one test
> > to fail.
> > 3) LARGER - limit is the same as the capacity, large enough to run lots of
> > concurrent tests. The script starts 8 of them and expects all pass.
> > Then it reruns the same test with one process randomly killed and
> > usage checked to be zero after all processes exit.
> >
> > The script also includes a test with low mem_cg limit and LARGE sgx_epc
> > limit to verify that the RAM used for per-cgroup reclamation is charged
> > to a proper mem_cg. For this test, it turns off swapping before start,
> > and turns swapping back on afterwards.
> >
> > Add README to document how to run the tests.
> >
> > Signed-off-by: Haitao Huang 
>
> jarkko@mustatorvisieni:~/linux-tpmdd> sudo make -C 
> tools/testing/selftests/sgx run_tests
> make: Entering directory 
> '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
> gcc -Wall -Werror -g 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> -fPIC -c main.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o
> gcc -Wall -Werror -g 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> -fPIC -c load.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o
> gcc -Wall -Werror -g 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> -fPIC -c sigstruct.c -o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o
> gcc -Wall -Werror -g 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> -fPIC -c call.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o
> gcc -Wall -Werror -g 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> -fPIC -c sign_key.S -o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o
> gcc -Wall -Werror -g 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> -fPIC -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z 
> noexecstack -lcrypto
> gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE 
> -fno-stack-protector -mrdrnd 
> -I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
> test_encl.c test_encl_bootstrap.S -o 
> /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf 
> -Wl,-T,test_encl.lds,--build-id=none
> /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: 
> warning: /tmp/ccqvDJVg.o: missing .note.GNU-stack section implies executable 
> stack
> /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: 
> NOTE: This behaviour is deprecated and will be removed in a future version of 
> the linker
> TAP version 13
> 1..2
> # timeout set to 45
> # selftests: sgx: test_sgx
> # TAP version 13
> # 1..16
> # # Starting 16 tests from 1 test cases.
> # #  RUN   enclave.unclobbered_vdso ...
> # #OK  enclave.unclobbered_vdso
> # ok 1 enclave.unclobbered_vdso
> # #  RUN   enclave.unclobbered_vdso_oversubscribed ...
> # #OK  enclave.unclobbered_vdso_oversubscribed
> # ok 2 enclave.unclobbered_vdso_oversubscribed
> # #  RUN   enclave.unclobbered_vdso_oversubscribed_remove ...
> # # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an enclave 
> with 98566144 bytes heap may take a while ...
> # # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of 
> 98566144 bytes to trimmed may take a while ...
> # # main.c:473:unclobbered_vdso_oversubscribed_remove:Entering enclave to run 
> EACCEPT for each page of 98566144 bytes may take a while ...
> # # main.c:494:unclobbered_vdso_oversubscribed_remove:Removing 98566144 bytes 
> from enclave may take a while ...
> # #OK  enclave.unclobbered_vdso_oversubscribed_remove
> # ok 3 

Re: [PATCH v12 07/14] x86/sgx: Abstract tracking reclaimable pages in LRU

2024-04-16 Thread Huang, Kai
On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> The functions, sgx_{mark,unmark}_page_reclaimable(), manage the tracking
> of reclaimable EPC pages: sgx_mark_page_reclaimable() adds a newly
> allocated page into the global LRU list while
> sgx_unmark_page_reclaimable() does the opposite. Abstract the hard coded
> global LRU references in these functions to make them reusable when
> pages are tracked in per-cgroup LRUs.
> 
> Create a helper, sgx_lru_list(), that returns the LRU that tracks a given
> EPC page. It simply returns the global LRU now, and will later return
> the LRU of the cgroup within which the EPC page was allocated. Replace
> the hard coded global LRU with a call to this helper.
> 
> Next patches will first get the cgroup reclamation flow ready while
> keeping pages tracked in the global LRU and reclaimed by ksgxd before we
> make the switch in the end for sgx_lru_list() to return per-cgroup
> LRU.

I found the first paragraph hard to read.  Provide my version below for
your reference:

"
The SGX driver tracks reclaimable EPC pages via
sgx_mark_page_reclaimable(), which adds the newly allocated page into the
global LRU list.  sgx_unmark_page_reclaimable() does the opposite.

To support SGX EPC cgroup, the SGX driver will need to maintain an LRU
list for each cgroup, and the new allocated EPC page will need to be added
to the LRU of associated cgroup, but not always the global LRU list.

When sgx_mark_page_reclaimable() is called, the cgroup that the new
allocated EPC page belongs to is already known, i.e., it has been set to
the 'struct sgx_epc_page'.

Add a helper, sgx_lru_list(), to return the LRU that the EPC page should
be/is added to for the given EPC page.  Currently it just returns the
global LRU.  Change sgx_{mark|unmark}_page_reclaimable() to use the helper
function to get the LRU from the EPC page instead of referring to the
global LRU directly.

This allows EPC page being able to be tracked in "per-cgroup" LRU when
that becomes ready.
"

Nit:

That being said, is sgx_epc_page_lru() better than sgx_lru_list()?

> 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Reviewed-by: Jarkko Sakkinen 
> Tested-by: Jarkko Sakkinen 
> ---
> 

Feel free to add:

Reviewed-by: Kai Huang 


Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-16 Thread Jarkko Sakkinen
On Tue Apr 16, 2024 at 6:20 AM EEST, Haitao Huang wrote:
> With different cgroups, the script starts one or multiple concurrent SGX
> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> test case, which loads an enclave of EPC size equal to the EPC capacity
> available on the platform. The script checks results against the
> expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with following
> expectations:
>
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all processes exit.
>
> The script also includes a test with low mem_cg limit and LARGE sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is charged
> to a proper mem_cg. For this test, it turns off swapping before start,
> and turns swapping back on afterwards.
>
> Add README to document how to run the tests.
>
> Signed-off-by: Haitao Huang 

jarkko@mustatorvisieni:~/linux-tpmdd> sudo make -C tools/testing/selftests/sgx 
run_tests
make: Entering directory '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c main.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c load.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c sigstruct.c -o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c call.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c sign_key.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z noexecstack 
-lcrypto
gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE 
-fno-stack-protector -mrdrnd 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
test_encl.c test_encl_bootstrap.S -o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf 
-Wl,-T,test_encl.lds,--build-id=none
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: 
warning: /tmp/ccqvDJVg.o: missing .note.GNU-stack section implies executable 
stack
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: NOTE: 
This behaviour is deprecated and will be removed in a future version of the 
linker
TAP version 13
1..2
# timeout set to 45
# selftests: sgx: test_sgx
# TAP version 13
# 1..16
# # Starting 16 tests from 1 test cases.
# #  RUN   enclave.unclobbered_vdso ...
# #OK  enclave.unclobbered_vdso
# ok 1 enclave.unclobbered_vdso
# #  RUN   enclave.unclobbered_vdso_oversubscribed ...
# #OK  enclave.unclobbered_vdso_oversubscribed
# ok 2 enclave.unclobbered_vdso_oversubscribed
# #  RUN   enclave.unclobbered_vdso_oversubscribed_remove ...
# # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an enclave with 
98566144 bytes heap may take a while ...
# # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of 98566144 
bytes to trimmed may take a while ...
# # main.c:473:unclobbered_vdso_oversubscribed_remove:Entering enclave to run 
EACCEPT for each page of 98566144 bytes may take a while ...
# # main.c:494:unclobbered_vdso_oversubscribed_remove:Removing 98566144 bytes 
from enclave may take a while ...
# #OK  enclave.unclobbered_vdso_oversubscribed_remove
# ok 3 enclave.unclobbered_vdso_oversubscribed_remove
# #  RUN   enclave.clobbered_vdso ...
# #OK  enclave.clobbered_vdso
# ok 4 enclave.clobbered_vdso
# #  RUN   enclave.clobbered_vdso_and_user_function ...
# #OK  

Re: [PATCH v12 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-16 Thread Huang, Kai
On Mon, 2024-04-15 at 20:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
> 
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The
> existing cgroup memory controller cannot be used to limit or account for
> SGX EPC memory, which is a desirable feature in some environments. For
> instance, within a Kubernetes environment, while a user may specify a
> particular EPC quota for a pod, the orchestrator requires a mechanism to
> enforce that the pod's actual runtime EPC usage does not exceed the
> allocated quota.
> 
> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
> limit and track EPC allocations per cgroup. Earlier patches have added
> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
> support in SGX driver as the "sgx_epc" resource provider:
> 
> - Set "capacity" of EPC by calling misc_cg_set_capacity()
> - Update EPC usage counter, "current", by calling charge and uncharge
> APIs for EPC allocation and deallocation, respectively.
> - Setup sgx_epc resource type specific callbacks, which perform
> initialization and cleanup during cgroup allocation and deallocation,
> respectively.
> 
> With these changes, the misc cgroup controller enables users to set a hard
> limit for EPC usage in the "misc.max" interface file. It reports current
> usage in "misc.current", the total EPC memory available in
> "misc.capacity", and the number of times EPC usage reached the max limit
> in "misc.events".
> 
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
> 
> Later patches will reorganize the tracking and reclamation code in the
> global reclaimer and implement per-cgroup tracking and reclaiming.
> 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> Reviewed-by: Jarkko Sakkinen 
> Reviewed-by: Tejun Heo 
> Tested-by: Jarkko Sakkinen 

I don't see any big issue, so feel free to add:

Reviewed-by: Kai Huang 

Nitpickings below:

[...]


> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,72 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright(c) 2022-2024 Intel Corporation. */
> +
> +#include 
> +#include 

It doesn't seem you need the above two here.

Probably they are needed in later patches, in that case we can move to the
relevant patch(es) that they got used.

However I think it's better to explicitly include  since
kzalloc()/kfree() are used.

Btw, I am not sure whether you want to use  because looks
it contains a lot of unrelated staff.  Anyway I guess nobody cares.

> +#include "epc_cgroup.h"
> +
> +/* The root SGX EPC cgroup */
> +static struct sgx_cgroup sgx_cg_root;

The comment isn't necessary (sorry didn't notice before), because the code
is pretty clear saying that IMHO.

[...]

> 
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,72 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _SGX_EPC_CGROUP_H_
> +#define _SGX_EPC_CGROUP_H_
> +
> +#include 

I don't see why you need  here.  Also, ...

> +#include 
> +#include 
> +
> +#include "sgx.h"

... "sgx.h" already includes 

[...]

> 
> +static inline struct sgx_cgroup *sgx_get_current_cg(void)
> +{
> + /* get_current_misc_cg() never returns NULL when Kconfig enabled */
> + return sgx_cgroup_from_misc_cg(get_current_misc_cg());
> +}

I spent some time looking into this.  And yes if I was reading code
correctly the get_current_misc_cg() should never return NULL when Kconfig
is on.

I typed my analysis below in [*].  And it would be helpful if any cgroup
expert can have a second eye on this.

[...]


> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

Is this needed?  I believe SGX variants in "epc_cgroup.h" should be enough
for sgx/main.c?

[...]


[*] IIUC get_current_misc_cg() should never return NULL when Kconfig is on
(code indent slight adjusted for text wrap).

Firstly, during kernel boot there's always a valid @css allocated for MISC
cgroup, regardless whether it is disabled in kernel command line.

int __init cgroup_init(void)
{
...

for_each_subsys(ss, ssid) {
if (ss->early_init) {   
  
...
} else {
cgroup_init_subsys(ss, false);  
   
}

...


[PATCH v2] ftrace: Fix possible use-after-free issue in ftrace_location()

2024-04-16 Thread Zheng Yejian
KASAN reports a bug:

  BUG: KASAN: use-after-free in ftrace_location+0x90/0x120
  Read of size 8 at addr 888141d40010 by task insmod/424
  CPU: 8 PID: 424 Comm: insmod Tainted: GW  6.9.0-rc2+
  [...]
  Call Trace:
   
   dump_stack_lvl+0x68/0xa0
   print_report+0xcf/0x610
   kasan_report+0xb5/0xe0
   ftrace_location+0x90/0x120
   register_kprobe+0x14b/0xa40
   kprobe_init+0x2d/0xff0 [kprobe_example]
   do_one_initcall+0x8f/0x2d0
   do_init_module+0x13a/0x3c0
   load_module+0x3082/0x33d0
   init_module_from_file+0xd2/0x130
   __x64_sys_finit_module+0x306/0x440
   do_syscall_64+0x68/0x140
   entry_SYSCALL_64_after_hwframe+0x71/0x79

The root cause is that when lookup_rec() is lookuping ftrace record of
an address in ftrace pages of some module, and those ftrace pages may
at the same time being freed in ftrace_release_mod() as the corresponding
module is being deleted:

  register_kprobes() {
check_kprobe_address_safe() {
  arch_check_ftrace_location() {
ftrace_location() {
  lookup_rec()  // access memory that has been freed by
// ftrace_release_mod() !!!

To fix it, we hold rcu lock as lookuping ftrace record, and call
synchronize_rcu() before freeing any ftrace pages.

Fixes: ae6aa16fdc16 ("kprobes: introduce ftrace based optimization")
Signed-off-by: Zheng Yejian 
---
 kernel/trace/ftrace.c | 43 +++
 1 file changed, 27 insertions(+), 16 deletions(-)

v2:
 - Use RCU lock instead of holding ftrace_lock as suggested by Steve.
   Link: https://lore.kernel.org/all/20240410112823.1d084...@gandalf.local.home/

v1:
 - Link: 
https://lore.kernel.org/all/20240401125543.1282845-1-zhengyeji...@huawei.com/

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index da1710499698..2b41837a2fac 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1581,7 +1581,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long start, 
unsigned long end)
 }
 
 /**
- * ftrace_location_range - return the first address of a traced location
+ * ftrace_location_range_rcu - return the first address of a traced location
  * if it touches the given ip range
  * @start: start of range to search.
  * @end: end of range to search (inclusive). @end points to the last byte
@@ -1592,7 +1592,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long start, 
unsigned long end)
  * that is either a NOP or call to the function tracer. It checks the ftrace
  * internal tables to determine if the address belongs or not.
  */
-unsigned long ftrace_location_range(unsigned long start, unsigned long end)
+static unsigned long ftrace_location_range_rcu(unsigned long start, unsigned 
long end)
 {
struct dyn_ftrace *rec;
 
@@ -1603,6 +1603,16 @@ unsigned long ftrace_location_range(unsigned long start, 
unsigned long end)
return 0;
 }
 
+unsigned long ftrace_location_range(unsigned long start, unsigned long end)
+{
+   unsigned long loc;
+
+   rcu_read_lock();
+   loc = ftrace_location_range_rcu(start, end);
+   rcu_read_unlock();
+   return loc;
+}
+
 /**
  * ftrace_location - return the ftrace location
  * @ip: the instruction pointer to check
@@ -1614,25 +1624,22 @@ unsigned long ftrace_location_range(unsigned long 
start, unsigned long end)
  */
 unsigned long ftrace_location(unsigned long ip)
 {
-   struct dyn_ftrace *rec;
+   unsigned long loc;
unsigned long offset;
unsigned long size;
 
-   rec = lookup_rec(ip, ip);
-   if (!rec) {
+   loc = ftrace_location_range(ip, ip);
+   if (!loc) {
if (!kallsyms_lookup_size_offset(ip, , ))
goto out;
 
/* map sym+0 to __fentry__ */
if (!offset)
-   rec = lookup_rec(ip, ip + size - 1);
+   loc = ftrace_location_range(ip, ip + size - 1);
}
 
-   if (rec)
-   return rec->ip;
-
 out:
-   return 0;
+   return loc;
 }
 
 /**
@@ -6596,6 +6603,7 @@ static int ftrace_process_locs(struct module *mod,
/* We should have used all pages unless we skipped some */
if (pg_unuse) {
WARN_ON(!skipped);
+   synchronize_rcu();
ftrace_free_pages(pg_unuse);
}
return ret;
@@ -6809,6 +6817,8 @@ void ftrace_release_mod(struct module *mod)
  out_unlock:
mutex_unlock(_lock);
 
+   if (tmp_page)
+   synchronize_rcu();
for (pg = tmp_page; pg; pg = tmp_page) {
 
/* Needs to be called outside of ftrace_lock */
@@ -7142,6 +7152,7 @@ void ftrace_free_mem(struct module *mod, void *start_ptr, 
void *end_ptr)
unsigned long start = (unsigned long)(start_ptr);
unsigned long end = (unsigned long)(end_ptr);
struct ftrace_page **last_pg = _pages_start;
+   struct ftrace_page *tmp_page = NULL;
struct ftrace_page *pg;
struct dyn_ftrace *rec;

Re: [RFC PATCH 3/7] module: prepare to handle ROX allocations for text

2024-04-16 Thread Nadav Amit



> On 11 Apr 2024, at 19:05, Mike Rapoport  wrote:
> 
> @@ -2440,7 +2479,24 @@ static int post_relocation(struct module *mod, const 
> struct load_info *info)
>   add_kallsyms(mod, info);
> 
>   /* Arch-specific module finalizing. */
> - return module_finalize(info->hdr, info->sechdrs, mod);
> + ret = module_finalize(info->hdr, info->sechdrs, mod);
> + if (ret)
> + return ret;
> +
> + for_each_mod_mem_type(type) {
> + struct module_memory *mem = >mem[type];
> +
> + if (mem->is_rox) {
> + if (!execmem_update_copy(mem->base, mem->rw_copy,
> +  mem->size))
> + return -ENOMEM;
> +
> + vfree(mem->rw_copy);
> + mem->rw_copy = NULL;
> + }
> + }
> +
> + return 0;
> }

I might be missing something, but it seems a bit racy.

IIUC, module_finalize() calls alternatives_smp_module_add(). At this
point, since you don’t hold the text_mutex, some might do text_poke(),
e.g., by enabling/disabling static-key, and the update would be
overwritten. No?


Re: [RFC PATCH 6/7] execmem: add support for cache of large ROX pages

2024-04-16 Thread Peter Zijlstra
On Mon, Apr 15, 2024 at 08:00:26PM +0300, Mike Rapoport wrote:
> On Mon, Apr 15, 2024 at 12:47:50PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 11, 2024 at 07:05:25PM +0300, Mike Rapoport wrote:
> > 
> > > To populate the cache, a writable large page is allocated from vmalloc 
> > > with
> > > VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped as
> > > ROX.
> > 
> > > +static void execmem_invalidate(void *ptr, size_t size, bool writable)
> > > +{
> > > + if (execmem_info->invalidate)
> > > + execmem_info->invalidate(ptr, size, writable);
> > > + else
> > > + memset(ptr, 0, size);
> > > +}
> > 
> > +static void execmem_invalidate(void *ptr, size_t size, bool writeable)
> > +{
> > +   /* fill memory with INT3 instructions */
> > +   if (writeable)
> > +   memset(ptr, 0xcc, size);
> > +   else
> > +   text_poke_set(ptr, 0xcc, size);
> > +}
> > 
> > Thing is, 0xcc (aka INT3_INSN_OPCODE) is not an invalid instruction.
> > It raises #BP not #UD.
> 
> Do you mean that _invalidate is a poor name choice or that it's necessary
> to use an instruction that raises #UD?

Poor naming, mostly. #BP handler will still scream bloody murder if the
site is otherwise unclaimed.

It just isn't an invalid instruction.



Re: [PATCH net-next v8] virtio_net: Support RX hash XDP hint

2024-04-16 Thread Heng Qi




在 2024/4/16 下午2:19, Liang Chen 写道:

The RSS hash report is a feature that's part of the virtio specification.
Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
(still a work in progress as per [1]) support this feature. While the
capability to obtain the RSS hash has been enabled in the normal path,
it's currently missing in the XDP path. Therefore, we are introducing
XDP hints through kfuncs to allow XDP programs to access the RSS hash.

1.
https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r

Signed-off-by: Liang Chen 
---
   Changes from v7:
- use table lookup for rss hash type
   Changes from v6:
- fix a coding style issue
   Changes from v5:
- Preservation of the hash value has been dropped, following the conclusion
   from discussions in V3 reviews. The virtio_net driver doesn't
   accessing/using the virtio_net_hdr after the XDP program execution, so
   nothing tragic should happen. As to the xdp program, if it smashes the
   entry in virtio header, it is likely buggy anyways. Additionally, looking
   up the Intel IGC driver,  it also does not bother with this particular
   aspect.
---
  drivers/net/virtio_net.c| 42 +
  include/uapi/linux/virtio_net.h |  1 +
  2 files changed, 43 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d1118a133..1d750009f615 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4621,6 +4621,47 @@ static void virtnet_set_big_packets(struct virtnet_info 
*vi, const int mtu)
}
  }
  
+static enum xdp_rss_hash_type

+virtnet_xdp_rss_type[VIRTIO_NET_HASH_REPORT_MAX_TABLE] = {
+   [VIRTIO_NET_HASH_REPORT_NONE] = XDP_RSS_TYPE_NONE,
+   [VIRTIO_NET_HASH_REPORT_IPv4] = XDP_RSS_TYPE_L3_IPV4,
+   [VIRTIO_NET_HASH_REPORT_TCPv4] = XDP_RSS_TYPE_L4_IPV4_TCP,
+   [VIRTIO_NET_HASH_REPORT_UDPv4] = XDP_RSS_TYPE_L4_IPV4_UDP,
+   [VIRTIO_NET_HASH_REPORT_IPv6] = XDP_RSS_TYPE_L3_IPV6,
+   [VIRTIO_NET_HASH_REPORT_TCPv6] = XDP_RSS_TYPE_L4_IPV6_TCP,
+   [VIRTIO_NET_HASH_REPORT_UDPv6] = XDP_RSS_TYPE_L4_IPV6_UDP,
+   [VIRTIO_NET_HASH_REPORT_IPv6_EX] = XDP_RSS_TYPE_L3_IPV6_EX,
+   [VIRTIO_NET_HASH_REPORT_TCPv6_EX] = XDP_RSS_TYPE_L4_IPV6_TCP_EX,
+   [VIRTIO_NET_HASH_REPORT_UDPv6_EX] = XDP_RSS_TYPE_L4_IPV6_UDP_EX
+};
+
+static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash,
+  enum xdp_rss_hash_type *rss_type)
+{
+   const struct xdp_buff *xdp = (void *)_ctx;
+   struct virtio_net_hdr_v1_hash *hdr_hash;
+   struct virtnet_info *vi;
+   u16 hash_report;
+
+   if (!(xdp->rxq->dev->features & NETIF_F_RXHASH))
+   return -ENODATA;
+
+   vi = netdev_priv(xdp->rxq->dev);
+   hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - vi->hdr_len);
+   hash_report = __le16_to_cpu(hdr_hash->hash_report);
+
+   if (hash_report >= VIRTIO_NET_HASH_REPORT_MAX_TABLE)
+   hash_report = VIRTIO_NET_HASH_REPORT_NONE;


When this happens, it may mean an error or user modification of the 
header occurred.

Should the following *hash* value be cleared to 0?

Thanks,
Heng


+
+   *rss_type = virtnet_xdp_rss_type[hash_report];
+   *hash = __le32_to_cpu(hdr_hash->hash_value);
+   return 0;
+}
+
+static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = {
+   .xmo_rx_hash= virtnet_xdp_rx_hash,
+};
+
  static int virtnet_probe(struct virtio_device *vdev)
  {
int i, err = -ENOMEM;
@@ -4747,6 +4788,7 @@ static int virtnet_probe(struct virtio_device *vdev)
  VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
  
  		dev->hw_features |= NETIF_F_RXHASH;

+   dev->xdp_metadata_ops = _xdp_metadata_ops;
}
  
  	if (vi->has_rss_hash_report)

diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index cc65ef0f3c3e..3ee695450096 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -176,6 +176,7 @@ struct virtio_net_hdr_v1_hash {
  #define VIRTIO_NET_HASH_REPORT_IPv6_EX 7
  #define VIRTIO_NET_HASH_REPORT_TCPv6_EX8
  #define VIRTIO_NET_HASH_REPORT_UDPv6_EX9
+#define VIRTIO_NET_HASH_REPORT_MAX_TABLE  10
__le16 hash_report;
__le16 padding;
  };





Re: [PATCHv3 2/2] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge

2024-04-16 Thread Heikki Krogerus
On Tue, Apr 09, 2024 at 01:04:12PM +0200, Pavel Machek wrote:
> Hi!
> 
> > > This is driver for ANX7688 USB-C HDMI, with flashing and debugging
> > > features removed. ANX7688 is rather criticial piece on PinePhone,
> > > there's no display and no battery charging without it.
> > > 
> > > There's likely more work to be done here, but having basic support
> > > in mainline is needed to be able to work on the other stuff
> > > (networking, cameras, power management).
> > > 
> > > Signed-off-by: Ondrej Jirman 
> > > Co-developed-by: Martijn Braam 
> > > Co-developed-by: Samuel Holland 
> > > Signed-off-by: Pavel Machek 
> > 
> > Just couple of quick comments below - I did not have time to go over
> > this very thoroughly, but I think you need to make a new version in
> > any case because of comments in 1/2.
> 
> Yes, there will be new version.
> 
> There is ton of sleep primitives, and existing ones are okay. I can
> change everything to fdelay or whatever primitive-of-the-day is, but
> I'd rather not do pointless changes.
> 
> You can ask for major changes, or complain about extra newlines, but
> doing both in the same email does not make sense.

I'm not telling you to do any major changes, yet. Right now I'm trying
to understand why you are doing thing the way you are doing them.

> > Btw, Co-developed-by comes before Signed-off-by I think.
> 
> I believe this order is fine, too.
> 
> > > +++ b/drivers/usb/typec/anx7688.c
> > > @@ -0,0 +1,1830 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * ANX7688 USB-C HDMI bridge/PD driver
> > > + *
> > > + * Warning, this driver is somewhat PinePhone specific.
> > 
> > So why not split this into core part and PinePhone specific glue
> > part?
> 
> Potentially a lot of work I can't really test and would rather not do.

Glue layer is usually an easy way to organise your driver into smaller
more manageable chunks, and most importantly, keep the core parts
generic. But just to be clear here, it is not always necessary - maybe
that's the case here.

> > > + struct delayed_work work;
> > > + struct timer_list work_timer;
> > > +
> > > + struct mutex lock;
> > 
> > Undocumented lock.
> 
> This is simple driver. How do you expect me to document it? Protects
> this data structure, not exactly uncommon.

You use a comment to describe what the lock protects. You can use
"scripts/checkpatch.pl --strict" to spot this kind of stuff for you.

> > > +
> > > + /* wait for power to stabilize and release reset */
> > > + msleep(10);
> > > + gpiod_set_value(anx7688->gpio_reset, 0);
> > > + usleep_range(2, 4);
> > 
> > Why not just use usleep_range() in both cases.
> 
> It does not really make code any better. Can do if you insist.

Consistency makes any code better.

> > > +static int anx7688_connect(struct anx7688 *anx7688)
> > > +{
> > > + struct typec_partner_desc desc = {};
> > > + int ret, i;
> > > + u8 fw[2];
> > > + const u8 dp_snk_identity[16] = {
> > > + 0x00, 0x00, 0x00, 0xec, /* id header */
> > > + 0x00, 0x00, 0x00, 0x00, /* cert stat */
> > > + 0x00, 0x00, 0x00, 0x00, /* product type */
> > > + 0x39, 0x00, 0x00, 0x51  /* alt mode adapter */
> > > + };
> > > + const u8 svid[4] = {
> > > + 0x00, 0x00, 0x01, 0xff,
> > > + };
> > 
> > Why not get those from DT?
> 
> Are you sure it belongs to the DT (and that DT people will agree)?

Yes, they belong to the DT, and there are already bindings. Dmitry
gave you the link to the bindings I think (thanks for that Dmitry).

> > > + u32 caps[8];
> > > +
> > > + dev_dbg(anx7688->dev, "cable inserted\n");
> > > +
> > > + anx7688->last_status = -1;
> > > + anx7688->last_cc_status = -1;
> > > + anx7688->last_dp_state = -1;
> > > +
> > > + msleep(10);
> > 
> > Please make a comment here why you have to wait, and use
> > usleep_range().
> 
> /* Dunno because working code does that and waiting for hardware to
> settle down after cable insertion kind of looks like a good thing */
> 
> I did not write the driver, and there's no good documentation for this
> chip. I can try to invent something, but...
> 
> > > + i = 0;
> > > + while (1) {
> > > + ret = anx7688_reg_read(anx7688, 
> > > ANX7688_REG_EEPROM_LOAD_STATUS0);
> > > +
> > > + if (ret >= 0 && (ret & ANX7688_EEPROM_FW_LOADED) == 
> > > ANX7688_EEPROM_FW_LOADED) {
> > > + dev_dbg(anx7688->dev, "eeprom0 = 0x%02x\n", ret);
> > > + dev_dbg(anx7688->dev, "fw loaded after %d ms\n", i * 
> > > 10);
> > > + break;
> > > + }
> > > +
> > > + if (i > 99) {
> > > + set_bit(ANX7688_F_FW_FAILED, anx7688->flags);
> > > + dev_err(anx7688->dev,
> > > + "boot firmware load failed (you may need to 
> > > flash FW to anx7688 first)\n");
> > > + ret = -ETIMEDOUT;
> > > + goto err_vconoff;
> > > + }
> > > + msleep(5);
> > > + i++;
> > > + }
> > 
> > You need to use 

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-16 Thread Mike Rapoport
On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote:
> On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > > +/**
> > > + * enum execmem_type - types of executable memory ranges
> > > + *
> > > + * There are several subsystems that allocate executable memory.
> > > + * Architectures define different restrictions on placement,
> > > + * permissions, alignment and other parameters for memory that can be 
> > > used
> > > + * by these subsystems.
> > > + * Types in this enum identify subsystems that allocate executable memory
> > > + * and let architectures define parameters for ranges suitable for
> > > + * allocations by each subsystem.
> > > + *
> > > + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> > > + * are not explcitly defined.
> > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > > + * @EXECMEM_KPROBES: parameters for kprobes
> > > + * @EXECMEM_FTRACE: parameters for ftrace
> > > + * @EXECMEM_BPF: parameters for BPF
> > > + * @EXECMEM_TYPE_MAX:
> > > + */
> > > +enum execmem_type {
> > > + EXECMEM_DEFAULT,
> > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > > + EXECMEM_KPROBES,
> > > + EXECMEM_FTRACE,
> > > + EXECMEM_BPF,
> > > + EXECMEM_TYPE_MAX,
> > > +};
> > 
> > Can we please get a break-down of how all these types are actually
> > different from one another?
> > 
> > I'm thinking some platforms have a tiny immediate space (arm64 comes to
> > mind) and has less strict placement constraints for some of them?
> 
> Yeah, and really I'd *much* rather deal with that in arch code, as I have said
> several times.
> 
> For arm64 we have two bsaic restrictions: 
> 
> 1) Direct branches can go +/-128M
>We can expand this range by having direct branches go to PLTs, at a
>performance cost.
> 
> 2) PREL32 relocations can go +/-2G
>We cannot expand this further.
> 
> * We don't need to allocate memory for ftrace. We do not use trampolines.
> 
> * Kprobes XOL areas don't care about either of those; we don't place any
>   PC-relative instructions in those. Maybe we want to in future.
> 
> * Modules care about both; we'd *prefer* to place them within +/-128M of all
>   other kernel/module code, but if there's no space we can use PLTs and expand
>   that to +/-2G. Since modules can refreence other modules, that ends up
>   actually being halved, and modules have to fit within some 2G window that
>   also covers the kernel.
> 
> * I'm not sure about BPF's requirements; it seems happy doing the same as
>   modules.

BPF are happy with vmalloc().
 
> So if we *must* use a common execmem allocator, what we'd reall want is our 
> own
> types, e.g.
> 
>   EXECMEM_ANYWHERE
>   EXECMEM_NOPLT
>   EXECMEM_PREL32
> 
> ... and then we use those in arch code to implement module_alloc() and 
> friends.

I'm looking at execmem_types more as definition of the consumers, maybe I
should have named the enum execmem_consumer at the first place.

And the arch constrains defined in struct execmem_range describe how memory
should be allocated for each consumer.

These constraints are defined early at boot and remain static, so
initializing them once and letting a common allocator use them makes
perfect sense to me.

I agree that fallback_{start,end} are not ideal, but we have 3
architectures that have preferred and secondary range for modules. And arm
and powerpc use the same logic for kprobes as well, and I don't see why this
code should be duplicated.

And, for instance, if you decide to place PC-relative instructions if
kprobes XOL areas, you'd only need to update execmem_range for kprobes to
be more like the range for modules.

With central allocator it's easier to deal with the things like
VM_FLUSH_RESET_PERMS and caching of ROX memory and I think it will be more
maintainable that module_alloc(), alloc_insn_page() and
bpf_jit_alloc_exec() spread all over the place.
 
> Mark.

-- 
Sincerely yours,
Mike.



Re: [PATCH net-next v8] virtio_net: Support RX hash XDP hint

2024-04-16 Thread Jason Wang
On Tue, Apr 16, 2024 at 2:20 PM Liang Chen  wrote:
>
> The RSS hash report is a feature that's part of the virtio specification.
> Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> (still a work in progress as per [1]) support this feature. While the
> capability to obtain the RSS hash has been enabled in the normal path,
> it's currently missing in the XDP path. Therefore, we are introducing
> XDP hints through kfuncs to allow XDP programs to access the RSS hash.
>
> 1.
> https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r
>
> Signed-off-by: Liang Chen 
> ---
>   Changes from v7:
> - use table lookup for rss hash type
>   Changes from v6:
> - fix a coding style issue
>   Changes from v5:
> - Preservation of the hash value has been dropped, following the conclusion
>   from discussions in V3 reviews. The virtio_net driver doesn't
>   accessing/using the virtio_net_hdr after the XDP program execution, so
>   nothing tragic should happen. As to the xdp program, if it smashes the
>   entry in virtio header, it is likely buggy anyways. Additionally, looking
>   up the Intel IGC driver,  it also does not bother with this particular
>   aspect.
> ---
>  drivers/net/virtio_net.c| 42 +
>  include/uapi/linux/virtio_net.h |  1 +
>  2 files changed, 43 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c22d1118a133..1d750009f615 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -4621,6 +4621,47 @@ static void virtnet_set_big_packets(struct 
> virtnet_info *vi, const int mtu)
> }
>  }
>
> +static enum xdp_rss_hash_type
> +virtnet_xdp_rss_type[VIRTIO_NET_HASH_REPORT_MAX_TABLE] = {
> +   [VIRTIO_NET_HASH_REPORT_NONE] = XDP_RSS_TYPE_NONE,
> +   [VIRTIO_NET_HASH_REPORT_IPv4] = XDP_RSS_TYPE_L3_IPV4,
> +   [VIRTIO_NET_HASH_REPORT_TCPv4] = XDP_RSS_TYPE_L4_IPV4_TCP,
> +   [VIRTIO_NET_HASH_REPORT_UDPv4] = XDP_RSS_TYPE_L4_IPV4_UDP,
> +   [VIRTIO_NET_HASH_REPORT_IPv6] = XDP_RSS_TYPE_L3_IPV6,
> +   [VIRTIO_NET_HASH_REPORT_TCPv6] = XDP_RSS_TYPE_L4_IPV6_TCP,
> +   [VIRTIO_NET_HASH_REPORT_UDPv6] = XDP_RSS_TYPE_L4_IPV6_UDP,
> +   [VIRTIO_NET_HASH_REPORT_IPv6_EX] = XDP_RSS_TYPE_L3_IPV6_EX,
> +   [VIRTIO_NET_HASH_REPORT_TCPv6_EX] = XDP_RSS_TYPE_L4_IPV6_TCP_EX,
> +   [VIRTIO_NET_HASH_REPORT_UDPv6_EX] = XDP_RSS_TYPE_L4_IPV6_UDP_EX
> +};
> +
> +static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash,
> +  enum xdp_rss_hash_type *rss_type)
> +{
> +   const struct xdp_buff *xdp = (void *)_ctx;
> +   struct virtio_net_hdr_v1_hash *hdr_hash;
> +   struct virtnet_info *vi;
> +   u16 hash_report;
> +
> +   if (!(xdp->rxq->dev->features & NETIF_F_RXHASH))
> +   return -ENODATA;
> +
> +   vi = netdev_priv(xdp->rxq->dev);
> +   hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - vi->hdr_len);
> +   hash_report = __le16_to_cpu(hdr_hash->hash_report);
> +
> +   if (hash_report >= VIRTIO_NET_HASH_REPORT_MAX_TABLE)
> +   hash_report = VIRTIO_NET_HASH_REPORT_NONE;
> +
> +   *rss_type = virtnet_xdp_rss_type[hash_report];
> +   *hash = __le32_to_cpu(hdr_hash->hash_value);
> +   return 0;
> +}
> +
> +static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = {
> +   .xmo_rx_hash= virtnet_xdp_rx_hash,
> +};
> +
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
> int i, err = -ENOMEM;
> @@ -4747,6 +4788,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>   VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
>
> dev->hw_features |= NETIF_F_RXHASH;
> +   dev->xdp_metadata_ops = _xdp_metadata_ops;
> }
>
> if (vi->has_rss_hash_report)
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index cc65ef0f3c3e..3ee695450096 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -176,6 +176,7 @@ struct virtio_net_hdr_v1_hash {
>  #define VIRTIO_NET_HASH_REPORT_IPv6_EX 7
>  #define VIRTIO_NET_HASH_REPORT_TCPv6_EX8
>  #define VIRTIO_NET_HASH_REPORT_UDPv6_EX9
> +#define VIRTIO_NET_HASH_REPORT_MAX_TABLE  10

This should not be part of uAPI. It may confuse the userspace.

Others look good.

Thanks

> __le16 hash_report;
> __le16 padding;
>  };
> --
> 2.40.1
>




[PATCH net-next v8] virtio_net: Support RX hash XDP hint

2024-04-16 Thread Liang Chen
The RSS hash report is a feature that's part of the virtio specification.
Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
(still a work in progress as per [1]) support this feature. While the
capability to obtain the RSS hash has been enabled in the normal path,
it's currently missing in the XDP path. Therefore, we are introducing
XDP hints through kfuncs to allow XDP programs to access the RSS hash.

1.
https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r

Signed-off-by: Liang Chen 
---
  Changes from v7:
- use table lookup for rss hash type 
  Changes from v6:
- fix a coding style issue
  Changes from v5:
- Preservation of the hash value has been dropped, following the conclusion
  from discussions in V3 reviews. The virtio_net driver doesn't
  accessing/using the virtio_net_hdr after the XDP program execution, so
  nothing tragic should happen. As to the xdp program, if it smashes the
  entry in virtio header, it is likely buggy anyways. Additionally, looking
  up the Intel IGC driver,  it also does not bother with this particular
  aspect.
---
 drivers/net/virtio_net.c| 42 +
 include/uapi/linux/virtio_net.h |  1 +
 2 files changed, 43 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d1118a133..1d750009f615 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4621,6 +4621,47 @@ static void virtnet_set_big_packets(struct virtnet_info 
*vi, const int mtu)
}
 }
 
+static enum xdp_rss_hash_type
+virtnet_xdp_rss_type[VIRTIO_NET_HASH_REPORT_MAX_TABLE] = {
+   [VIRTIO_NET_HASH_REPORT_NONE] = XDP_RSS_TYPE_NONE,
+   [VIRTIO_NET_HASH_REPORT_IPv4] = XDP_RSS_TYPE_L3_IPV4,
+   [VIRTIO_NET_HASH_REPORT_TCPv4] = XDP_RSS_TYPE_L4_IPV4_TCP,
+   [VIRTIO_NET_HASH_REPORT_UDPv4] = XDP_RSS_TYPE_L4_IPV4_UDP,
+   [VIRTIO_NET_HASH_REPORT_IPv6] = XDP_RSS_TYPE_L3_IPV6,
+   [VIRTIO_NET_HASH_REPORT_TCPv6] = XDP_RSS_TYPE_L4_IPV6_TCP,
+   [VIRTIO_NET_HASH_REPORT_UDPv6] = XDP_RSS_TYPE_L4_IPV6_UDP,
+   [VIRTIO_NET_HASH_REPORT_IPv6_EX] = XDP_RSS_TYPE_L3_IPV6_EX,
+   [VIRTIO_NET_HASH_REPORT_TCPv6_EX] = XDP_RSS_TYPE_L4_IPV6_TCP_EX,
+   [VIRTIO_NET_HASH_REPORT_UDPv6_EX] = XDP_RSS_TYPE_L4_IPV6_UDP_EX
+};
+
+static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash,
+  enum xdp_rss_hash_type *rss_type)
+{
+   const struct xdp_buff *xdp = (void *)_ctx;
+   struct virtio_net_hdr_v1_hash *hdr_hash;
+   struct virtnet_info *vi;
+   u16 hash_report;
+
+   if (!(xdp->rxq->dev->features & NETIF_F_RXHASH))
+   return -ENODATA;
+
+   vi = netdev_priv(xdp->rxq->dev);
+   hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - vi->hdr_len);
+   hash_report = __le16_to_cpu(hdr_hash->hash_report);
+
+   if (hash_report >= VIRTIO_NET_HASH_REPORT_MAX_TABLE)
+   hash_report = VIRTIO_NET_HASH_REPORT_NONE;
+
+   *rss_type = virtnet_xdp_rss_type[hash_report];
+   *hash = __le32_to_cpu(hdr_hash->hash_value);
+   return 0;
+}
+
+static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = {
+   .xmo_rx_hash= virtnet_xdp_rx_hash,
+};
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
int i, err = -ENOMEM;
@@ -4747,6 +4788,7 @@ static int virtnet_probe(struct virtio_device *vdev)
  VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
 
dev->hw_features |= NETIF_F_RXHASH;
+   dev->xdp_metadata_ops = _xdp_metadata_ops;
}
 
if (vi->has_rss_hash_report)
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index cc65ef0f3c3e..3ee695450096 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -176,6 +176,7 @@ struct virtio_net_hdr_v1_hash {
 #define VIRTIO_NET_HASH_REPORT_IPv6_EX 7
 #define VIRTIO_NET_HASH_REPORT_TCPv6_EX8
 #define VIRTIO_NET_HASH_REPORT_UDPv6_EX9
+#define VIRTIO_NET_HASH_REPORT_MAX_TABLE  10
__le16 hash_report;
__le16 padding;
 };
-- 
2.40.1




Re: [PATCH net-next v7] virtio_net: Support RX hash XDP hint

2024-04-16 Thread Liang Chen
On Mon, Apr 15, 2024 at 3:36 PM Jesper Dangaard Brouer  wrote:
>
>
>
> On 13/04/2024 06.10, Liang Chen wrote:
> > The RSS hash report is a feature that's part of the virtio specification.
> > Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> > (still a work in progress as per [1]) support this feature. While the
> > capability to obtain the RSS hash has been enabled in the normal path,
> > it's currently missing in the XDP path. Therefore, we are introducing
> > XDP hints through kfuncs to allow XDP programs to access the RSS hash.
> >
> > 1.
> > https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r
> >
> > Signed-off-by: Liang Chen 
> > ---
> >Changes from v6:
> > - fix a coding style issue
> >Changes from v5:
> > - Preservation of the hash value has been dropped, following the conclusion
> >from discussions in V3 reviews. The virtio_net driver doesn't
> >accessing/using the virtio_net_hdr after the XDP program execution, so
> >nothing tragic should happen. As to the xdp program, if it smashes the
> >entry in virtio header, it is likely buggy anyways. Additionally, looking
> >up the Intel IGC driver,  it also does not bother with this particular
> >aspect.
> > ---
> >   drivers/net/virtio_net.c | 55 
> >   1 file changed, 55 insertions(+)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index c22d1118a133..2a1892b7b8d3 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -4621,6 +4621,60 @@ static void virtnet_set_big_packets(struct 
> > virtnet_info *vi, const int mtu)
> >   }
> >   }
> >
> > +static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash,
> > +enum xdp_rss_hash_type *rss_type)
> > +{
> > + const struct xdp_buff *xdp = (void *)_ctx;
> > + struct virtio_net_hdr_v1_hash *hdr_hash;
> > + struct virtnet_info *vi;
> > +
> > + if (!(xdp->rxq->dev->features & NETIF_F_RXHASH))
> > + return -ENODATA;
> > +
> > + vi = netdev_priv(xdp->rxq->dev);
> > + hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - vi->hdr_len);
> > +
> > + switch (__le16_to_cpu(hdr_hash->hash_report)) {
> > + case VIRTIO_NET_HASH_REPORT_TCPv4:
> > + *rss_type = XDP_RSS_TYPE_L4_IPV4_TCP;
> > + break;
> > + case VIRTIO_NET_HASH_REPORT_UDPv4:
> > + *rss_type = XDP_RSS_TYPE_L4_IPV4_UDP;
> > + break;
> > + case VIRTIO_NET_HASH_REPORT_TCPv6:
> > + *rss_type = XDP_RSS_TYPE_L4_IPV6_TCP;
> > + break;
> > + case VIRTIO_NET_HASH_REPORT_UDPv6:
> > + *rss_type = XDP_RSS_TYPE_L4_IPV6_UDP;
> > + break;
> > + case VIRTIO_NET_HASH_REPORT_TCPv6_EX:
> > + *rss_type = XDP_RSS_TYPE_L4_IPV6_TCP_EX;
> > + break;
> > + case VIRTIO_NET_HASH_REPORT_UDPv6_EX:
> > + *rss_type = XDP_RSS_TYPE_L4_IPV6_UDP_EX;
> > + break;
> > + case VIRTIO_NET_HASH_REPORT_IPv4:
> > + *rss_type = XDP_RSS_TYPE_L3_IPV4;
> > + break;
> > + case VIRTIO_NET_HASH_REPORT_IPv6:
> > + *rss_type = XDP_RSS_TYPE_L3_IPV6;
> > + break;
> > + case VIRTIO_NET_HASH_REPORT_IPv6_EX:
> > + *rss_type = XDP_RSS_TYPE_L3_IPV6_EX;
> > + break;
> > + case VIRTIO_NET_HASH_REPORT_NONE:
> > + default:
> > + *rss_type = XDP_RSS_TYPE_NONE;
> > + }
>
> Why is this not implemented as a table lookup?
>

Sure. Thanks for the suggestion!

> Like:
>
> https://elixir.bootlin.com/linux/v6.9-rc4/source/drivers/net/ethernet/intel/igc/igc_main.c#L6652
>   https://elixir.bootlin.com/linux/latest/A/ident/xdp_rss_hash_type
>
> --Jesper
>
> > +
> > + *hash = __le32_to_cpu(hdr_hash->hash_value);
> > + return 0;
> > +}
> > +
> > +static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = {
> > + .xmo_rx_hash= virtnet_xdp_rx_hash,
> > +};
> > +
> >   static int virtnet_probe(struct virtio_device *vdev)
> >   {
> >   int i, err = -ENOMEM;
> > @@ -4747,6 +4801,7 @@ static int virtnet_probe(struct virtio_device *vdev)
> > VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
> >
> >   dev->hw_features |= NETIF_F_RXHASH;
> > + dev->xdp_metadata_ops = _xdp_metadata_ops;
> >   }
> >
> >   if (vi->has_rss_hash_report)