[PATCH -next] VDUSE: fix another doc underline warning

2024-02-22 Thread Randy Dunlap
Extend the underline for a heading to prevent a documentation
build warning. Also spell "reconnection" correctly.

Documentation/userspace-api/vduse.rst:236: WARNING: Title underline too short.
HOW VDUSE devices reconnectoin works


Fixes: 2b3fd606c662 ("Documentation: Add reconnect process for VDUSE")
Signed-off-by: Randy Dunlap 
Cc: Cindy Lu 
Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Xuan Zhuo 
Cc: virtualizat...@lists.linux.dev
Cc: Jonathan Corbet 
---
 Documentation/userspace-api/vduse.rst |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff -- a/Documentation/userspace-api/vduse.rst 
b/Documentation/userspace-api/vduse.rst
--- a/Documentation/userspace-api/vduse.rst
+++ b/Documentation/userspace-api/vduse.rst
@@ -232,8 +232,8 @@ able to start the dataplane processing a
 
 For more details on the uAPI, please see include/uapi/linux/vduse.h.
 
-HOW VDUSE devices reconnectoin works
-
+HOW VDUSE devices reconnection works
+
 0. Userspace APP checks if the device /dev/vduse/vduse_name exists.
If it does not exist, need to create the instance.goto step 1
If it does exist, it means this is a reconnect and goto step 3.



[PATCH] ftrace: fix most kernel-doc warnings

2024-02-22 Thread Randy Dunlap
Reduce the number of kernel-doc warnings from 52 down to 10, i.e.,
fix 42 kernel-doc warnings by (a) using the Returns: format for
function return values or (b) using "@var:" instead of "@var -"
for function parameter descriptions.

Fix one return values list so that it is formatted correctly when
rendered for output.

Spell "non-zero" with a hyphen in several places.

Signed-off-by: Randy Dunlap 
Reported-by: kernel test robot 
Link: https://lore.kernel.org/oe-kbuild-all/202312180518.x6frydsn-...@intel.com/
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Mathieu Desnoyers 
Cc: Mark Rutland 
Cc: linux-trace-ker...@vger.kernel.org
---
This patch addresses most of the reported kernel-doc warnings but does
not fix all of them, so I did not use "Closes:" for the Link: tag.

 kernel/trace/ftrace.c |   90 
 1 file changed, 46 insertions(+), 44 deletions(-)

diff -- a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1160,7 +1160,7 @@ __ftrace_lookup_ip(struct ftrace_hash *h
  * Search a given @hash to see if a given instruction pointer (@ip)
  * exists in it.
  *
- * Returns the entry that holds the @ip if found. NULL otherwise.
+ * Returns: the entry that holds the @ip if found. NULL otherwise.
  */
 struct ftrace_func_entry *
 ftrace_lookup_ip(struct ftrace_hash *hash, unsigned long ip)
@@ -1282,7 +1282,7 @@ static void free_ftrace_hash_rcu(struct
 
 /**
  * ftrace_free_filter - remove all filters for an ftrace_ops
- * @ops - the ops to remove the filters from
+ * @ops: the ops to remove the filters from
  */
 void ftrace_free_filter(struct ftrace_ops *ops)
 {
@@ -1587,7 +1587,7 @@ static struct dyn_ftrace *lookup_rec(uns
  * @end: end of range to search (inclusive). @end points to the last byte
  * to check.
  *
- * Returns rec->ip if the related ftrace location is a least partly within
+ * Returns: rec->ip if the related ftrace location is a least partly within
  * the given address range. That is, the first address of the instruction
  * that is either a NOP or call to the function tracer. It checks the ftrace
  * internal tables to determine if the address belongs or not.
@@ -1607,9 +1607,10 @@ unsigned long ftrace_location_range(unsi
  * ftrace_location - return the ftrace location
  * @ip: the instruction pointer to check
  *
- * If @ip matches the ftrace location, return @ip.
- * If @ip matches sym+0, return sym's ftrace location.
- * Otherwise, return 0.
+ * Returns:
+ * * If @ip matches the ftrace location, return @ip.
+ * * If @ip matches sym+0, return sym's ftrace location.
+ * * Otherwise, return 0.
  */
 unsigned long ftrace_location(unsigned long ip)
 {
@@ -1639,7 +1640,7 @@ out:
  * @start: start of range to search
  * @end: end of range to search (inclusive). @end points to the last byte to 
check.
  *
- * Returns 1 if @start and @end contains a ftrace location.
+ * Returns: 1 if @start and @end contains a ftrace location.
  * That is, the instruction that is either a NOP or call to
  * the function tracer. It checks the ftrace internal tables to
  * determine if the address belongs or not.
@@ -2574,7 +2575,7 @@ static void call_direct_funcs(unsigned l
  * wants to convert to a callback that saves all regs. If FTRACE_FL_REGS
  * is not set, then it wants to convert to the normal callback.
  *
- * Returns the address of the trampoline to set to
+ * Returns: the address of the trampoline to set to
  */
 unsigned long ftrace_get_addr_new(struct dyn_ftrace *rec)
 {
@@ -2615,7 +2616,7 @@ unsigned long ftrace_get_addr_new(struct
  * a function that saves all the regs. Basically the '_EN' version
  * represents the current state of the function.
  *
- * Returns the address of the trampoline that is currently being called
+ * Returns: the address of the trampoline that is currently being called
  */
 unsigned long ftrace_get_addr_curr(struct dyn_ftrace *rec)
 {
@@ -2719,7 +2720,7 @@ struct ftrace_rec_iter {
 /**
  * ftrace_rec_iter_start - start up iterating over traced functions
  *
- * Returns an iterator handle that is used to iterate over all
+ * Returns: an iterator handle that is used to iterate over all
  * the records that represent address locations where functions
  * are traced.
  *
@@ -2751,7 +2752,7 @@ struct ftrace_rec_iter *ftrace_rec_iter_
  * ftrace_rec_iter_next - get the next record to process.
  * @iter: The handle to the iterator.
  *
- * Returns the next iterator after the given iterator @iter.
+ * Returns: the next iterator after the given iterator @iter.
  */
 struct ftrace_rec_iter *ftrace_rec_iter_next(struct ftrace_rec_iter *iter)
 {
@@ -2776,7 +2777,7 @@ struct ftrace_rec_iter *ftrace_rec_iter_
  * ftrace_rec_iter_record - get the record at the iterator location
  * @iter: The current iterator location
  *
- * Returns the record that the current @iter is at.
+ * Returns: the record that the current @iter is at.
  */
 struct dyn_ftrace *ftrace_rec_iter_record(struct 

Re: [RFC PATCH 11/14] csky/thread_info: Introduce TIF_NOTIFY_IPI flag

2024-02-22 Thread Guo Ren
On Wed, Feb 21, 2024 at 1:20 AM K Prateek Nayak  wrote:
>
> Add support for TIF_NOTIFY_IPI on C-SKY. With TIF_NOTIFY_IPI, a sender
> sending an IPI to an idle CPU in TIF_POLLING mode will set the
> TIF_NOTIFY_IPI flag in the target's idle tasks's thread_info to pull the
> CPU out of idle, as opposed to setting TIF_NEED_RESCHED previously. This
> avoids spurious calls to schedule_idle() in cases where an IPI does not
> necessarily wake up a task on the idle CPU.
>
> Cc: Guo Ren 
> Cc: "Rafael J. Wysocki" 
> Cc: Daniel Lezcano 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Juri Lelli 
> Cc: Vincent Guittot 
> Cc: Dietmar Eggemann 
> Cc: Steven Rostedt 
> Cc: Ben Segall 
> Cc: Mel Gorman 
> Cc: Daniel Bristot de Oliveira 
> Cc: Valentin Schneider 
> Cc: K Prateek Nayak 
> Cc: linux-c...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Signed-off-by: K Prateek Nayak 
> ---
>  arch/csky/include/asm/thread_info.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/csky/include/asm/thread_info.h 
> b/arch/csky/include/asm/thread_info.h
> index b5ed788f0c68..9bc7a037c476 100644
> --- a/arch/csky/include/asm/thread_info.h
> +++ b/arch/csky/include/asm/thread_info.h
> @@ -61,6 +61,7 @@ static inline struct thread_info *current_thread_info(void)
>  #define TIF_SYSCALL_TRACEPOINT 5   /* syscall tracepoint instrumentation 
> */
>  #define TIF_SYSCALL_AUDIT  6   /* syscall auditing */
>  #define TIF_NOTIFY_SIGNAL  7   /* signal notifications exist */
> +#define TIF_NOTIFY_IPI 8   /* Pending IPI on TIF_POLLLING idle 
> CPU */
>  #define TIF_POLLING_NRFLAG 16  /* poll_idle() is TIF_NEED_RESCHED */
>  #define TIF_MEMDIE 18  /* is terminating due to OOM killer */
>  #define TIF_RESTORE_SIGMASK20  /* restore signal mask in do_signal() 
> */
> @@ -73,6 +74,7 @@ static inline struct thread_info *current_thread_info(void)
>  #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
>  #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
>  #define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL)
> +#define _TIF_NOTIFY_IPI(1 << TIF_NOTIFY_IPI)
>  #define _TIF_UPROBE(1 << TIF_UPROBE)
>  #define _TIF_POLLING_NRFLAG(1 << TIF_POLLING_NRFLAG)
>  #define _TIF_MEMDIE(1 << TIF_MEMDIE)
> --
> 2.34.1
>
LGTM

Acked-by: Guo Ren 

-- 
Best Regards
 Guo Ren



Re: [PATCH v19 0/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-22 Thread Ankit Agrawal
>> Ankit Agrawal (3):
>>   vfio/pci: rename and export do_io_rw()
>>   vfio/pci: rename and export range_intersect_range
>>   vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
>>
>>  MAINTAINERS   |  16 +-
>>  drivers/vfio/pci/Kconfig  |   2 +
>>  drivers/vfio/pci/Makefile |   2 +
>>  drivers/vfio/pci/nvgrace-gpu/Kconfig  |  10 +
>>  drivers/vfio/pci/nvgrace-gpu/Makefile |   3 +
>>  drivers/vfio/pci/nvgrace-gpu/main.c   | 879 ++
>>  drivers/vfio/pci/vfio_pci_config.c    |  42 ++
>>  drivers/vfio/pci/vfio_pci_rdwr.c  |  16 +-
>>  drivers/vfio/pci/virtio/main.c    |  72 +--
>>  include/linux/vfio_pci_core.h |  10 +-
>>  10 files changed, 993 insertions(+), 59 deletions(-)
>>  create mode 100644 drivers/vfio/pci/nvgrace-gpu/Kconfig
>>  create mode 100644 drivers/vfio/pci/nvgrace-gpu/Makefile
>>  create mode 100644 drivers/vfio/pci/nvgrace-gpu/main.c
>>
>
> Applied to vfio next branch for v6.9.  Thanks,
>
> Alex

Thanks Alex! Appreciate this along with your guidance and help in the reviews.


Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-22 Thread Xuan Zhuo
On Fri, 09 Feb 2024 13:57:25 +0100, Paolo Abeni  wrote:
> On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote:
> > On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni  wrote:
> > >
> > > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> > > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni  wrote:
> > > > >
> > > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer 
> > > > > >  wrote:
> > > > > > > On 02/02/2024 13.11, Liang Chen wrote:
> > > > > [...]
> > > > > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct 
> > > > > > > > xdp_buff *xdp)
> > > > > > > >   }
> > > > > > > >   }
> > > > > > > >
> > > > > > > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff 
> > > > > > > > *virtnet_xdp,
> > > > > > > > +  struct net_device *dev,
> > > > > > > > +  struct 
> > > > > > > > virtio_net_hdr_v1_hash *hdr_hash)
> > > > > > > > +{
> > > > > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > > > > + virtnet_xdp->hash_value = hdr_hash->hash_value;
> > > > > > > > + virtnet_xdp->hash_report = hdr_hash->hash_report;
> > > > > > > > + }
> > > > > > > > +}
> > > > > > > > +
> > > > > > >
> > > > > > > Would it be possible to store a pointer to hdr_hash in 
> > > > > > > virtnet_xdp_buff,
> > > > > > > with the purpose of delaying extracting this, until and only if 
> > > > > > > XDP
> > > > > > > bpf_prog calls the kfunc?
> > > > > > >
> > > > > >
> > > > > > That seems to be the way v1 works,
> > > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > > > > . But it was pointed out that the inline header may be overwritten 
> > > > > > by
> > > > > > the xdp prog, so the hash is copied out to maintain its integrity.
> > > > >
> > > > > Why? isn't XDP supposed to get write access only to the pkt
> > > > > contents/buffer?
> > > > >
> > > >
> > > > Normally, an XDP program accesses only the packet data. However,
> > > > there's also an XDP RX Metadata area, referenced by the data_meta
> > > > pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to
> > > > point somewhere ahead of the data buffer, thereby granting the XDP
> > > > program access to the virtio header located immediately before the
> > >
> > > AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data before
> > > xdp->data_hard_start:
> > >
> > > https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210
> > >
> > > and virtio net set such field after the virtio_net_hdr:
> > >
> > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218
> > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420
> > >
> > > I don't see how the virtio hdr could be touched? Possibly even more
> > > important: if such thing is possible, I think is should be somewhat
> > > denied (for the same reason an H/W nic should prevent XDP from
> > > modifying its own buffer descriptor).
> >
> > Thank you for highlighting this concern. The header layout differs
> > slightly between small and mergeable mode. Taking 'mergeable mode' as
> > an example, after calling xdp_prepare_buff the layout of xdp_buff
> > would be as depicted in the diagram below,
> >
> >   buf
> >|
> >v
> > +--+--+-+
> > | xdp headroom | virtio header| packet  |
> > | (256 bytes)  | (20 bytes)   | content |
> > +--+--+-+
> > ^ ^
> > | |
> >  data_hard_startdata
> >   data_meta
> >
> > If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little
> > towards 'data_hard_start', it would point to the inline header, thus
> > potentially allowing the XDP program to access the inline header.
>
> I see. That layout was completely unexpected to me.
>
> AFAICS the virtio_net driver tries to avoid accessing/using the
> virtio_net_hdr after the XDP program execution, so nothing tragic
> should happen.
>
> @Michael, @Jason, I guess the above is like that by design? Isn't it a
> bit fragile?

YES. We process it carefully. That brings some troubles, we hope to put the
virtio-net header to the vring desc like other NICs. But that is a big project.

I think this patch is ok, this can be merged to net-next firstly.

Thanks.


>
> Thanks!
>
> Paolo
>



[PATCH 1/2] tracing: Fix snapshot counter going between two tracers that use it

2024-02-22 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Running the ftrace selftests caused the ring buffer mapping test to fail.
Investigating, I found that the snapshot counter would be incremented
every time a tracer that uses the snapshot is enabled even if the snapshot
was used by the previous tracer.

That is:

 # cd /sys/kernel/tracing
 # echo wakeup_rt > current_tracer
 # echo wakeup_dl > current_tracer
 # echo nop > current_tracer

would leave the snapshot counter at 1 and not zero. That's because the
enabling of wakeup_dl would increment the counter again but the setting
the tracer to nop would only decrement it once.

Do not arm the snapshot for a tracer if the previous tracer already had it
armed.

Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b7a870c8ae2a..480201c3b36e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6164,7 +6164,7 @@ int tracing_set_tracer(struct trace_array *tr, const char 
*buf)
tracing_disarm_snapshot(tr);
}
 
-   if (t->use_max_tr) {
+   if (!had_max_tr && t->use_max_tr) {
ret = tracing_arm_snapshot_locked(tr);
if (ret)
goto out;
-- 
2.43.0





[PATCH 0/2] tracing: Fix snapshot accounting

2024-02-22 Thread Steven Rostedt


The ring buffer mapping test failed after running the ftrace tests.
This was due to some mismatched snapshot accounting that left the
snapshot counter enabled when it was not, which prevents the ring buffer
from being mapped.

Steven Rostedt (Google) (2):
  tracing: Fix snapshot counter going between two tracers that use it
  tracing: Decrement the snapshot if the snapshot trigger fails to register


 kernel/trace/trace.c| 2 +-
 kernel/trace/trace_events_trigger.c | 5 -
 2 files changed, 5 insertions(+), 2 deletions(-)



[PATCH 2/2] tracing: Decrement the snapshot if the snapshot trigger fails to register

2024-02-22 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Running the ftrace selftests caused the ring buffer mapping test to fail.
Investigating, I found that the snapshot counter would be incremented
every time a snapshot trigger was added, even if that snapshot trigger
failed.

 # cd /sys/kernel/tracing
 # echo "snapshot" > events/sched/sched_process_fork/trigger
 # echo "snapshot" > events/sched/sched_process_fork/trigger
 -bash: echo: write error: File exists

That second one that fails increments the snapshot counter but doesn't
decrement it. It needs to be decremented when the snapshot fails.

Fixes: 16f7e48ffc53a ("tracing: Add snapshot refcount")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_events_trigger.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_events_trigger.c 
b/kernel/trace/trace_events_trigger.c
index 62e4f58b8671..4bec043c8690 100644
--- a/kernel/trace/trace_events_trigger.c
+++ b/kernel/trace/trace_events_trigger.c
@@ -1491,7 +1491,10 @@ register_snapshot_trigger(char *glob,
if (ret < 0)
return ret;
 
-   return register_trigger(glob, data, file);
+   ret = register_trigger(glob, data, file);
+   if (ret < 0)
+   tracing_disarm_snapshot(file->tr);
+   return ret;
 }
 
 static void unregister_snapshot_trigger(char *glob,
-- 
2.43.0





Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw

2024-02-22 Thread Hou Tao
Hi,

On 2/23/2024 3:49 AM, Michael S. Tsirkin wrote:
> On Wed, Jan 03, 2024 at 06:59:29PM +0800, Hou Tao wrote:
>> From: Hou Tao 
>>
>> When trying to insert a 10MB kernel module kept in a virtiofs with cache
>> disabled, the following warning was reported:
>>

SNIP

>>
>> A feasible solution is to limit the value of max_read for virtiofs, so
>> the length passed to kmalloc() will be limited. However it will affects
>> the max read size for ITER_IOVEC io and the value of max_write also needs
>> limitation. So instead of limiting the values of max_read and max_write,
>> introducing max_nopage_rw to cap both the values of max_read and
>> max_write when the fuse dio read/write request is initiated from kernel.
>>
>> Considering that fuse read/write request from kernel is uncommon and to
>> decrease the demand for large contiguous pages, set max_nopage_rw as
>> 256KB instead of KMALLOC_MAX_SIZE - 4096 or similar.
>>
>> Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
>> Signed-off-by: Hou Tao 
>
> So what should I do with this patch? It includes fuse changes
> but of course I can merge too if no one wants to bother either way...

The patch had got some feedback from Bernd Schubert . And I will post v2
before next Thursday.




Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-22 Thread Haitao Huang

On Thu, 22 Feb 2024 15:26:05 -0600, Huang, Kai  wrote:




On 23/02/2024 6:09 am, Haitao Huang wrote:
On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai   
wrote:





-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool  
reclaim)

 {
-return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,  
PAGE_SIZE);

+for (;;) {
+if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+PAGE_SIZE))
+break;
+
+if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+return -ENOMEM;
+
+if (signal_pending(current))
+return -ERESTARTSYS;
+
+if (!reclaim) {
+queue_work(sgx_epc_cg_wq, _cg->reclaim_work);
+return -EBUSY;
+}
+
+if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
+/* All pages were too young to reclaim, try again a  
little later */

+schedule();
+}
+
+return 0;
 }



Seems this code change is 90% similar to the existing code in the
sgx_alloc_epc_page():

...
for ( ; ; ) {
page = __sgx_alloc_epc_page();
if (!IS_ERR(page)) {
page->owner = owner;
break;
}

if (list_empty(_active_page_list))
return ERR_PTR(-ENOMEM);

if (!reclaim) {
page = ERR_PTR(-EBUSY);
break;
}

if (signal_pending(current)) {
page = ERR_PTR(-ERESTARTSYS);
break;
}

sgx_reclaim_pages();
cond_resched();
}
...

Is it better to move the logic/code change in try_charge() out to
sgx_alloc_epc_page() to unify them?

IIUC, the logic is quite similar: When you either failed to allocate  
one page,
or failed to charge one page, you try to reclaim EPC page(s) from the  
current

EPC cgroup, either directly or indirectly.

No?

 Only these lines are the same:
 if (!reclaim) {
 page = ERR_PTR(-EBUSY);
 break;
 }
  if (signal_pending(current)) {
 page = ERR_PTR(-ERESTARTSYS);
 break;
 }
 In sgx_alloc_epc_page() we do global reclamation but here we do  
per-cgroup reclamation.


But why?  If we failed to allocate, shouldn't we try to reclaim from the  
_current_ EPC cgroup instead of global?  E.g., I thought one enclave in  
one EPC cgroup requesting insane amount of EPC shouldn't impact enclaves  
inside other cgroups?


Right. When code reaches to here, we already passed reclaim per cgroup.  
The cgroup may not at or reach limit but system has run out of physical  
EPC.


Thanks
Haitao



Re: [PATCH v9 13/15] x86/sgx: Turn on per-cgroup EPC reclamation

2024-02-22 Thread Huang, Kai




On 23/02/2024 5:36 am, Haitao Huang wrote:

On Wed, 21 Feb 2024 05:23:00 -0600, Huang, Kai  wrote:


On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:

From: Kristen Carlson Accardi 

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_lru_list() returns hard
coded reference to the global LRU.

Change sgx_lru_list() to return the LRU of the cgroup in which the given
EPC page is allocated.

This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., sum of cgroup limits
greater than the total capacity, cgroups may never reclaim but the total
usage can still be near the capacity. Therefore global reclamation is
still needed in those cases and it should reclaim from the root cgroup.

Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled, otherwise from the global LRU.

Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.

With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c 
b/arch/x86/kernel/cpu/sgx/main.c

index 6b0c26cac621..d4265a390ba9 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru;

 static inline struct sgx_epc_lru_list *sgx_lru_list(struct 
sgx_epc_page *epc_page)

 {
+#ifdef CONFIG_CGROUP_SGX_EPC
+    if (epc_page->epc_cg)
+    return _page->epc_cg->lru;
+
+    /* This should not happen if kernel is configured correctly */
+    WARN_ON_ONCE(1);
+#endif
 return _global_lru;
 }


How about when EPC cgroup is enabled, but one enclave doesn't belong 
to any EPC
cgroup?  Is it OK to track EPC pages for these enclaves to the root 
EPC cgroup's

LRU list together with other enclaves belongs to the root cgroup?


This should be a valid case, right?


There is no such case. Each page is in the root by default.



Is it guaranteed by the (misc) cgroup design/implementation?  If so 
please add this information to the changelog and/or comments?  It helps 
non-cgroup expert like me to understand.




Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-22 Thread Huang, Kai




On 23/02/2024 6:20 am, Haitao Huang wrote:

On Wed, 21 Feb 2024 05:00:27 -0600, Huang, Kai  wrote:


On Wed, 2024-02-21 at 00:44 -0600, Haitao Huang wrote:

[...]
>
> Here the @nr_to_scan is reduced by the number of pages that are
> isolated, but
> not actually reclaimed (which is reflected by @cnt).
>
> IIUC, looks you want to make this function do "each cycle" as what you
> mentioned
> in the v8 [1]:
>
> I tested with that approach and found we can only target number of
> pages
> attempted to reclaim not pages actually reclaimed due to the
> uncertainty
> of how long it takes to reclaim pages. Besides targeting number of
> scanned pages for each cycle is also what the ksgxd does.
>
> If we target actual number of pages, sometimes it just takes 
too long.

> I
> saw more timeouts with the default time limit when running 
parallel

> selftests.
>
> I am not sure what does "sometimes it just takes too long" mean, but
> what I am
> thinking is you are trying to do some perfect but yet complicated code
> here.

I think what I observed was that the try_charge() would block too long
before getting chance of schedule() to yield, causing more timeouts than
necessary.
I'll do some re-test to be sure.


Looks this is a valid information that can be used to justify whatever 
you are

implementing in the EPC cgroup reclaiming function(s).

I'll add some comments. Was assuming this is just following the old 
design as ksgxd.
There were some comments at the beginning of 
sgx_epc_cgrooup_reclaim_page().

     /*
  * Attempting to reclaim only a few pages will often fail and is
  * inefficient, while reclaiming a huge number of pages can 
result in
  * soft lockups due to holding various locks for an extended 
duration.

  */
     unsigned int nr_to_scan = SGX_NR_TO_SCAN;

I think it can be improved to emphasize we only "attempt" to finish 
scanning fixed number of pages for reclamation, not enforce number of 
pages successfully reclaimed.


Not sure need to be this comment, but at somewhere just state you are 
trying to follow the ksgxd() (the current sgx_reclaim_pages()), but 
trying to do it "_across_ given cgroup and all the descendants".


That's the reason you made @nr_to_scan as a pointer.

And also some text to explain why to follow ksgxd() -- not wanting to 
block longer due to loop over descendants etc -- so we can focus on 
discussing whether such justification is reasonable.





Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-22 Thread Huang, Kai




On 23/02/2024 9:12 am, Haitao Huang wrote:

On Wed, 21 Feb 2024 04:48:58 -0600, Huang, Kai  wrote:


On Wed, 2024-02-21 at 00:23 -0600, Haitao Huang wrote:

StartHi Kai
On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai  
wrote:

[...]
>
> So you introduced the work/workqueue here but there's no place which
> actually
> queues the work.  IMHO you can either:
>
> 1) move relevant code change here; or
> 2) focus on introducing core functions to reclaim certain pages from a
> given EPC
> cgroup w/o workqueue and introduce the work/workqueue in later patch.
>
> Makes sense?
>

Starting in v7, I was trying to split the big patch, #10 in v6 as you 
and
others suggested. My thought process was to put infrastructure needed 
for

per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9
13/15] in the end.


That's reasonable for sure.



Thanks for the confirmation :-)



Before that, all reclaimables are tracked in the global LRU so really
there is no "reclaim certain pages from a  given EPC cgroup w/o 
workqueue"

or reclaim through workqueue before that point, as suggested in #2. This
patch puts down the implementation for both flows but neither used 
yet, as

stated in the commit message.


I know it's not used yet.  The point is how to split patches to make 
them more

self-contain and easy to review.


I would think this patch already self-contained in that all are 
implementation of cgroup reclamation building blocks utilized later. But 
I'll try to follow your suggestions below to split further (would prefer 
not to merge in general unless there is strong reasons).




For #2, sorry for not being explicit -- I meant it seems it's more 
reasonable to

split in this way:

Patch 1)
  a). change to sgx_reclaim_pages();


I'll still prefer this to be a separate patch. It is self-contained IMHO.
We were splitting the original patch because it was too big. I don't 
want to merge back unless there is a strong reason.



  b). introduce sgx_epc_cgroup_reclaim_pages();


Ok.


If I got you right, I believe you want to have a cgroup variant function 
following the same behaviour of the one for global reclaim, i.e., the 
_current_ sgx_reclaim_pages(), which always tries to scan and reclaim 
SGX_NR_TO_SCAN pages each time.


And this cgroup variant function, sgx_epc_cgroup_reclaim_pages(), tries 
to scan and reclaim SGX_NR_TO_SCAN pages each time "_across_ the cgroup 
and all the descendants".


And you want to implement sgx_epc_cgroup_reclaim_pages() in this way due 
to WHATEVER reasons.


In that case, the change to sgx_reclaim_pages() and the introduce of 
sgx_epc_cgroup_reclaim_pages() should really be together because they 
are completely tied together in terms of implementation.


In this way you can just explain clearly in _ONE_ patch why you choose 
this implementation, and for reviewer it's also easier to review because 
we can just discuss in one patch.


Makes sense?



  c). introduce sgx_epc_cgroup_reclaim_work_func() (use a better 
name), which just takes an EPC cgroup as input w/o involving any 
work/workqueue.


This is for the workqueue use only. So I think it'd be better be with 
patch #2 below?


There are multiple levels of logic here IMHO:

  1. a) and b) above focus on "each reclaim" a given EPC cgroup
  2. c) is about a loop of above to bring given cgroup's usage to limit
  3. workqueue is one (probably best) way to do c) in async way
  4. the logic where 1) (direct reclaim) and 3) (indirect) are triggered

To me, it's clear 1) should be in one patch as stated above.

Also, to me 3) and 4) are better to be together since they give you a 
clear view on how the direct/indirect reclaim are triggered.


2) could be flexible depending on how you see it.  If you prefer viewing 
it from low-level implementation of reclaiming pages from cgroup, then 
it's also OK to be together with 1).  If you want to treat it as a part 
of _async_ way of bring down usage to limit, then _MAYBE_ it's also OK 
to be with 3) and 4).


But to me 2) can be together with 1) or even a separate patch because 
it's still kinda of low-level reclaiming details.  3) and 4) shouldn't 
contain such detail but should focus on how direct/indirect reclaim is done.


[...]



To be honest, the part I'm feeling most confusing is this 
self-contained-ness. It seems depend on how you look at things.


Completely understand.  But I think our discussion should be helpful to 
both of us and others.




Re: [PATCH v4 2/4] tracing/user_events: Introduce multi-format events

2024-02-22 Thread Steven Rostedt
On Thu, 22 Feb 2024 00:18:05 +
Beau Belgrave  wrote:

> Currently user_events supports 1 event with the same name and must have
> the exact same format when referenced by multiple programs. This opens
> an opportunity for malicous or poorly thought through programs to

  malicious? ;-)

-- Steve

> create events that others use with different formats. Another scenario
> is user programs wishing to use the same event name but add more fields
> later when the software updates. Various versions of a program may be
> running side-by-side, which is prevented by the current single format
> requirement.
> 
> Add a new register flag (USER_EVENT_REG_MULTI_FORMAT) which indicates
> the user program wishes to use the same user_event name, but may have
> several different formats of the event. When this flag is used, create
> the underlying tracepoint backing the user_event with a unique name
> per-version of the format. It's important that existing ABI users do
> not get this logic automatically, even if one of the multi format
> events matches the format. This ensures existing programs that create
> events and assume the tracepoint name will match exactly continue to
> work as expected. Add logic to only check multi-format events with
> other multi-format events and single-format events to only check
> single-format events during find.
> 
> Change system name of the multi-format event tracepoint to ensure that
> multi-format events are isolated completely from single-format events.
> This prevents single-format names from conflicting with multi-format
> events if they end with the same suffix as the multi-format events.
> 
> Add a register_name (reg_name) to the user_event struct which allows for
> split naming of events. We now have the name that was used to register
> within user_events as well as the unique name for the tracepoint. Upon
> registering events ensure matches based on first the reg_name, followed
> by the fields and format of the event. This allows for multiple events
> with the same registered name to have different formats. The underlying
> tracepoint will have a unique name in the format of {reg_name}.{unique_id}.
> 
> For example, if both "test u32 value" and "test u64 value" are used with
> the USER_EVENT_REG_MULTI_FORMAT the system would have 2 unique
> tracepoints. The dynamic_events file would then show the following:
>   u:test u64 count
>   u:test u32 count
> 
> The actual tracepoint names look like this:
>   test.0
>   test.1
> 
> Both would be under the new user_events_multi system name to prevent the
> older ABI from being used to squat on multi-formatted events and block
> their use.
> 
> Deleting events via "!u:test u64 count" would only delete the first
> tracepoint that matched that format. When the delete ABI is used all
> events with the same name will be attempted to be deleted. If
> per-version deletion is required, user programs should either not use
> persistent events or delete them via dynamic_events.
> 
> Signed-off-by: Beau Belgrave 



Re: [PATCH v4 1/4] tracing/user_events: Prepare find/delete for same name events

2024-02-22 Thread Steven Rostedt
On Thu, 22 Feb 2024 00:18:04 +
Beau Belgrave  wrote:

> The current code for finding and deleting events assumes that there will
> never be cases when user_events are registered with the same name, but
> different formats. Scenarios exist where programs want to use the same
> name but have different formats. An example is multiple versions of a
> program running side-by-side using the same event name, but with updated
> formats in each version.
> 
> This change does not yet allow for multi-format events. If user_events
> are registered with the same name but different arguments the programs
> see the same return values as before. This change simply makes it
> possible to easily accomodate for this.

 accommodate? ;-)

> 
> Update find_user_event() to take in argument parameters and register
> flags to accomodate future multi-format event scenarios. Have find

  accommodate

> validate argument matching and return error pointers to cover when
> an existing event has the same name but different format. Update
> callers to handle error pointer logic.
> 
> Move delete_user_event() to use hash walking directly now that
> find_user_event() has changed. Delete all events found that match the
> register name, stop if an error occurs and report back to the user.
> 
> Update user_fields_match() to cover list_empty() scenarios now that
> find_user_event() uses it directly. This makes the logic consistent
> accross several callsites.

  across ?

I'll update this.

-- Steve

> 
> Signed-off-by: Beau Belgrave 



Re: [PATCH v19 0/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-22 Thread Alex Williamson
On Tue, 20 Feb 2024 17:20:52 +0530
 wrote:

> From: Ankit Agrawal 
> 
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary chip-to-chip cache coherent interconnect.
> 
> The device is peculiar compared to a real PCI device in that whilst
> there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> device, it is not used to access device memory once the faster
> chip-to-chip interconnect is initialized (occurs at the time of host
> system boot). The device memory is accessed instead using the
> chip-to-chip interconnect that is exposed as a contiguous physically
> addressable region on the host. Since the device memory is cache
> coherent with the CPU, it can be mmaped into the user VMA with a
> cacheable mapping and used like a regular RAM. The device memory is
> not added to the host kernel, but mapped directly as this reduces
> memory wastage due to struct pages.
> 
> There is also a requirement of a minimum reserved 1G uncached region
> (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1].
> This is to work around a HW defect. Based on [2], the requisite properties
> (uncached, unaligned access) can be achieved through a VM mapping (S1)
> of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> a different non-cached property to the reserved 1G region, it needs to
> be carved out from the device memory and mapped as a separate region
> in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets
> the Qemu VMA page properties (pgprot) as NORMAL_NC.
> 
> Provide a VFIO PCI variant driver that adapts the unique device memory
> representation into a more standard PCI representation facing userspace.
> 
> The variant driver exposes these two regions - the non-cached reserved
> (resmem) and the cached rest of the device memory (termed as usemem) as
> separate VFIO 64b BAR regions. This is divergent from the baremetal
> approach, where the device memory is exposed as a device memory region.
> The decision for a different approach was taken in view of the fact that
> it would necessiate additional code in Qemu to discover and insert those
> regions in the VM IPA, along with the additional VM ACPI DSDT changes to
> communiate the device memory region IPA to the VM workloads. Moreover,
> this behavior would have to be added to a variety of emulators (beyond
> top of tree Qemu) out there desiring grace hopper support.
> 
> Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> maps the uncached carved out region to the next available PCI BAR (i.e.
> comprising of region 2 and 3). The cached device memory aperture is
> assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
> device in the VM with the uncached aperture reported as BAR2 region,
> the cacheable as BAR4. The variant driver provides emulation for these
> fake BARs' PCI config space offset registers.
> 
> The hardware ensures that the system does not crash when the memory
> is accessed with the memory enable turned off. It synthesis ~0 reads
> and dropped writes on such access. So there is no need to support the
> disablement/enablement of BAR through PCI_COMMAND config space register.
> 
> The memory layout on the host looks like the following:
>devmem (memlength)
> |--|
> |-cached|--NC--|
> |   |
> usemem.memphys  resmem.memphys
> 
> PCI BARs need to be aligned to the power-of-2, but the actual memory on the
> device may not. A read or write access to the physical address from the
> last device PFN up to the next power-of-2 aligned physical address
> results in reading ~0 and dropped writes. Note that the GPU device
> driver [6] is capable of knowing the exact device memory size through
> separate means. The device memory size is primarily kept in the system
> ACPI tables for use by the VFIO PCI variant module.
> 
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size memblock
> (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW and
> VFIO driver. The VM device driver make use of the same value for its
> calculation to determine USEMEM size.
> 
> Currently there is no provision in KVM for a S2 mapping with
> MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
> As previously mentioned, resmem is mapped pgprot_writecombine(), that
> sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
> proposed changes in [3] and [4], KVM marks the region with
> MemAttr[2:0]=0b101 in S2.
> 
> If the device memory properties are not present, the driver registers the
> vfio-pci-core function pointers. Since there are no ACPI memory properties
> generated for the VM, the 

Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-22 Thread Huang, Kai




On 23/02/2024 6:09 am, Haitao Huang wrote:

On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai  wrote:




-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool 
reclaim)

 {
-    return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, 
PAGE_SIZE);

+    for (;;) {
+    if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+    PAGE_SIZE))
+    break;
+
+    if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+    return -ENOMEM;
+
+    if (signal_pending(current))
+    return -ERESTARTSYS;
+
+    if (!reclaim) {
+    queue_work(sgx_epc_cg_wq, _cg->reclaim_work);
+    return -EBUSY;
+    }
+
+    if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
+    /* All pages were too young to reclaim, try again a 
little later */

+    schedule();
+    }
+
+    return 0;
 }



Seems this code change is 90% similar to the existing code in the
sgx_alloc_epc_page():

...
for ( ; ; ) {
    page = __sgx_alloc_epc_page();
    if (!IS_ERR(page)) {
    page->owner = owner;
    break;
    }

    if (list_empty(_active_page_list))
    return ERR_PTR(-ENOMEM);

    if (!reclaim) {
    page = ERR_PTR(-EBUSY);
    break;
    }

    if (signal_pending(current)) {
    page = ERR_PTR(-ERESTARTSYS);
    break;
    }

    sgx_reclaim_pages();
    cond_resched();
    }
...

Is it better to move the logic/code change in try_charge() out to
sgx_alloc_epc_page() to unify them?

IIUC, the logic is quite similar: When you either failed to allocate 
one page,
or failed to charge one page, you try to reclaim EPC page(s) from the 
current

EPC cgroup, either directly or indirectly.

No?


Only these lines are the same:
     if (!reclaim) {
     page = ERR_PTR(-EBUSY);
     break;
     }

     if (signal_pending(current)) {
     page = ERR_PTR(-ERESTARTSYS);
     break;
     }

In sgx_alloc_epc_page() we do global reclamation but here we do 
per-cgroup reclamation. 


But why?  If we failed to allocate, shouldn't we try to reclaim from the 
_current_ EPC cgroup instead of global?  E.g., I thought one enclave in 
one EPC cgroup requesting insane amount of EPC shouldn't impact enclaves 
inside other cgroups?


That's why the logic of other lines is different
though they look similar due to similar function names. For the global 
reclamation we need consider case in that cgroup is not enabled. 
Similarly list_empty(_active_page_list) would have to be changed to 
check root cgroup if cgroups enabled otherwise check global LRU.  The 
(!reclaim) case is also different.  


W/o getting clear on my above question, so far I am not convinced why 
such difference cannot be hide inside wrapper function(s).


So I don't see an obvious good way

to abstract those to get meaningful savings.

Thanks
Haitao




[PATCH v2 4/4] tracing: Use EVENT_NULL_STR macro instead of open coding "(null)"

2024-02-22 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The TRACE_EVENT macros has some dependency if a __string() field is NULL,
where it will save "(null)" as the string. This string is also used by
__assign_str(). It's better to create a single macro instead of having
something that will not be caught by the compiler if there is an
unfortunate typo.

Suggested-by: Mathieu Desnoyers 
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/trace_events.h |  3 +++
 include/trace/events/sunrpc.h| 12 ++--
 include/trace/stages/stage5_get_offsets.h|  4 ++--
 include/trace/stages/stage6_event_callback.h |  8 
 4 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index d68ff9b1247f..b9d88aced7e1 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -17,6 +17,9 @@ struct dentry;
 struct bpf_prog;
 union bpf_attr;
 
+/* Used for event string fields when they are NULL */
+#define EVENT_NULL_STR "(null)"
+
 const char *trace_print_flags_seq(struct trace_seq *p, const char *delim,
  unsigned long flags,
  const struct trace_print_flags *flag_array);
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index cdd3a45e6003..ce6a85b82afa 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1327,18 +1327,18 @@ TRACE_EVENT(xs_stream_read_data,
__field(ssize_t, err)
__field(size_t, total)
__string(addr, xprt ? xprt->address_strings[RPC_DISPLAY_ADDR] :
-   "(null)")
+   EVENT_NULL_STR)
__string(port, xprt ? xprt->address_strings[RPC_DISPLAY_PORT] :
-   "(null)")
+   EVENT_NULL_STR)
),
 
TP_fast_assign(
__entry->err = err;
__entry->total = total;
__assign_str(addr, xprt ?
-   xprt->address_strings[RPC_DISPLAY_ADDR] : "(null)");
+   xprt->address_strings[RPC_DISPLAY_ADDR] : 
EVENT_NULL_STR);
__assign_str(port, xprt ?
-   xprt->address_strings[RPC_DISPLAY_PORT] : "(null)");
+   xprt->address_strings[RPC_DISPLAY_PORT] : 
EVENT_NULL_STR);
),
 
TP_printk("peer=[%s]:%s err=%zd total=%zu", __get_str(addr),
@@ -1783,7 +1783,7 @@ TRACE_EVENT(svc_process,
__string(service, name)
__string(procedure, svc_proc_name(rqst))
__string(addr, rqst->rq_xprt ?
-rqst->rq_xprt->xpt_remotebuf : "(null)")
+rqst->rq_xprt->xpt_remotebuf : EVENT_NULL_STR)
),
 
TP_fast_assign(
@@ -1793,7 +1793,7 @@ TRACE_EVENT(svc_process,
__assign_str(service, name);
__assign_str(procedure, svc_proc_name(rqst));
__assign_str(addr, rqst->rq_xprt ?
-rqst->rq_xprt->xpt_remotebuf : "(null)");
+rqst->rq_xprt->xpt_remotebuf : EVENT_NULL_STR);
),
 
TP_printk("addr=%s xid=0x%08x service=%s vers=%u proc=%s",
diff --git a/include/trace/stages/stage5_get_offsets.h 
b/include/trace/stages/stage5_get_offsets.h
index 20b801ed3fd4..e6b96608f452 100644
--- a/include/trace/stages/stage5_get_offsets.h
+++ b/include/trace/stages/stage5_get_offsets.h
@@ -47,7 +47,7 @@
 
 #undef __string
 #define __string(item, src) __dynamic_array(char, item,
\
-   strlen((const char *)(src) ? : "(null)") + 1)   \
+   strlen((const char *)(src) ? : EVENT_NULL_STR) + 1) \
__data_offsets->item##_ptr_ = src;
 
 #undef __string_len
@@ -70,7 +70,7 @@
 
 #undef __rel_string
 #define __rel_string(item, src) __rel_dynamic_array(char, item,
\
-   strlen((const char *)(src) ? : "(null)") + 1)   \
+   strlen((const char *)(src) ? : EVENT_NULL_STR) + 1) \
__data_offsets->item##_ptr_ = src;
 
 #undef __rel_string_len
diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index 38732855eadb..2bfd49713b42 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -32,14 +32,14 @@
 
 #undef __assign_str
 #define __assign_str(dst, src) \
-   memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? : "(null)",  \
+   memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? : EVENT_NULL_STR, \
   __get_dynamic_array_len(dst))
 
 #undef __assign_str_len
 #define __assign_str_len(dst, src, len)
\
do {\

[PATCH v2 3/4] tracing: Use ? : shortcut in trace macros

2024-02-22 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Instead of having:

  #define __assign_str(dst, src)\
memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ?  \
__data_offsets.dst##_ptr_ : "(null)",   \
__get_dynamic_array_len(dst))

Use the ? : shortcut and compact it down to:

  #define __assign_str(dst, src)\
memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? : "(null)",  \
   __get_dynamic_array_len(dst))

Suggested-by: Mathieu Desnoyers 
Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/stages/stage5_get_offsets.h|  4 ++--
 include/trace/stages/stage6_event_callback.h | 14 ++
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/include/trace/stages/stage5_get_offsets.h 
b/include/trace/stages/stage5_get_offsets.h
index 45f8151cf622..20b801ed3fd4 100644
--- a/include/trace/stages/stage5_get_offsets.h
+++ b/include/trace/stages/stage5_get_offsets.h
@@ -47,7 +47,7 @@
 
 #undef __string
 #define __string(item, src) __dynamic_array(char, item,
\
-   strlen((src) ? (const char *)(src) : "(null)") + 1) \
+   strlen((const char *)(src) ? : "(null)") + 1)   \
__data_offsets->item##_ptr_ = src;
 
 #undef __string_len
@@ -70,7 +70,7 @@
 
 #undef __rel_string
 #define __rel_string(item, src) __rel_dynamic_array(char, item,
\
-   strlen((src) ? (const char *)(src) : "(null)") + 1) \
+   strlen((const char *)(src) ? : "(null)") + 1)   \
__data_offsets->item##_ptr_ = src;
 
 #undef __rel_string_len
diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index c0e5d097324e..38732855eadb 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -32,15 +32,14 @@
 
 #undef __assign_str
 #define __assign_str(dst, src) \
-   memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ?  \
-  __data_offsets.dst##_ptr_ : "(null)",\
+   memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ? : "(null)",  \
   __get_dynamic_array_len(dst))
 
 #undef __assign_str_len
 #define __assign_str_len(dst, src, len)
\
do {\
-   memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ?  \
-  __data_offsets.dst##_ptr_ : "(null)", len);  \
+   memcpy(__get_str(dst),  \
+  __data_offsets.dst##_ptr_ ? : "(null)", len);\
__get_str(dst)[len] = '\0'; \
} while(0)
 
@@ -95,15 +94,14 @@
 
 #undef __assign_rel_str
 #define __assign_rel_str(dst, src) \
-   memcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ?  \
-  __data_offsets.dst##_ptr_ : "(null)",\
+   memcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ? : "(null)", \
   __get_rel_dynamic_array_len(dst))
 
 #undef __assign_rel_str_len
 #define __assign_rel_str_len(dst, src, len)\
do {\
-   memcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ?  \
-  __data_offsets.dst##_ptr_ : "(null)", len);  \
+   memcpy(__get_rel_str(dst),  \
+  __data_offsets.dst##_ptr_ ? : "(null)", len);\
__get_rel_str(dst)[len] = '\0'; \
} while (0)
 
-- 
2.43.0





[PATCH v2 2/4] tracing: Do not calculate strlen() twice for __string() fields

2024-02-22 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The TRACE_EVENT() macro handles dynamic strings by having:

  TP_PROTO(struct some_struct *s),
  TP_ARGS(s),
  TP_STRUCT__entry(
__string(my_string, s->string)
 ),
 TP_fast_assign(
__assign_str(my_string, s->string);
 )
 TP_printk("%s", __get_str(my_string))

There's even some code that may call a function helper to find the
s->string value. The problem with the above is that the work to get the
s->string is done twice. Once at the __string() and again in the
__assign_str().

The length of the string is calculated via a strlen(), not once, but
twice. Once during the __string() macro and again in __assign_str(). But
the length is actually already recorded in the data location and here's no
reason to call strlen() again.

Just use the saved length that was saved in the __string() code for the
__assign_str() code.

Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/stages/stage6_event_callback.h | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index b3e2f321e787..c0e5d097324e 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -32,8 +32,9 @@
 
 #undef __assign_str
 #define __assign_str(dst, src) \
-   strcpy(__get_str(dst), __data_offsets.dst##_ptr_ ?  \
-  __data_offsets.dst##_ptr_ : "(null)")
+   memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ?  \
+  __data_offsets.dst##_ptr_ : "(null)",\
+  __get_dynamic_array_len(dst))
 
 #undef __assign_str_len
 #define __assign_str_len(dst, src, len)
\
@@ -94,8 +95,9 @@
 
 #undef __assign_rel_str
 #define __assign_rel_str(dst, src) \
-   strcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ?  \
-  __data_offsets.dst##_ptr_ : "(null)")
+   memcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ?  \
+  __data_offsets.dst##_ptr_ : "(null)",\
+  __get_rel_dynamic_array_len(dst))
 
 #undef __assign_rel_str_len
 #define __assign_rel_str_len(dst, src, len)\
-- 
2.43.0





[PATCH v2 1/4] tracing: Rework __assign_str() and __string() to not duplicate getting the string

2024-02-22 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The TRACE_EVENT() macro handles dynamic strings by having:

  TP_PROTO(struct some_struct *s),
  TP_ARGS(s),
  TP_STRUCT__entry(
__string(my_string, s->string)
 ),
 TP_fast_assign(
__assign_str(my_string, s->string);
 )
 TP_printk("%s", __get_str(my_string))

There's even some code that may call a function helper to find the
s->string value. The problem with the above is that the work to get the
s->string is done twice. Once at the __string() and again in the
__assign_str().

But the __string() uses dynamic_array() which has a helper structure that
is created holding the offsets and length of the string fields. Instead of
finding the string twice, just save it off in another field from that
helper structure, and have __assign_str() use that instead.

Note, this also means that the second parameter of __assign_str() isn't
even used anymore, and may be removed in the future.

Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/stages/stage2_data_offsets.h   |  4 ++--
 include/trace/stages/stage5_get_offsets.h| 15 ++-
 include/trace/stages/stage6_event_callback.h | 12 
 3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/include/trace/stages/stage2_data_offsets.h 
b/include/trace/stages/stage2_data_offsets.h
index 469b6a64293d..8b0cff06d346 100644
--- a/include/trace/stages/stage2_data_offsets.h
+++ b/include/trace/stages/stage2_data_offsets.h
@@ -24,7 +24,7 @@
 #define __array(type, item, len)
 
 #undef __dynamic_array
-#define __dynamic_array(type, item, len)   u32 item;
+#define __dynamic_array(type, item, len)   u32 item; const void 
*item##_ptr_;
 
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
@@ -45,7 +45,7 @@
 #define __sockaddr(field, len) __dynamic_array(u8, field, len)
 
 #undef __rel_dynamic_array
-#define __rel_dynamic_array(type, item, len)   u32 item;
+#define __rel_dynamic_array(type, item, len)   u32 item; const void 
*item##_ptr_;
 
 #undef __rel_string
 #define __rel_string(item, src) __rel_dynamic_array(char, item, -1)
diff --git a/include/trace/stages/stage5_get_offsets.h 
b/include/trace/stages/stage5_get_offsets.h
index e30a13be46ba..45f8151cf622 100644
--- a/include/trace/stages/stage5_get_offsets.h
+++ b/include/trace/stages/stage5_get_offsets.h
@@ -47,10 +47,12 @@
 
 #undef __string
 #define __string(item, src) __dynamic_array(char, item,
\
-   strlen((src) ? (const char *)(src) : "(null)") + 1)
+   strlen((src) ? (const char *)(src) : "(null)") + 1) \
+   __data_offsets->item##_ptr_ = src;
 
 #undef __string_len
-#define __string_len(item, src, len) __dynamic_array(char, item, (len) + 1)
+#define __string_len(item, src, len) __dynamic_array(char, item, (len) + 1)\
+   __data_offsets->item##_ptr_ = src;
 
 #undef __vstring
 #define __vstring(item, fmt, ap) __dynamic_array(char, item,   \
@@ -67,11 +69,14 @@
__data_size += __item_length;
 
 #undef __rel_string
-#define __rel_string(item, src) __rel_dynamic_array(char, item,
\
-   strlen((src) ? (const char *)(src) : "(null)") + 1)
+#define __rel_string(item, src) __rel_dynamic_array(char, item,
\
+   strlen((src) ? (const char *)(src) : "(null)") + 1) \
+   __data_offsets->item##_ptr_ = src;
 
 #undef __rel_string_len
-#define __rel_string_len(item, src, len) __rel_dynamic_array(char, item, (len) 
+ 1)
+#define __rel_string_len(item, src, len) __rel_dynamic_array(char, item, (len) 
+ 1)\
+   __data_offsets->item##_ptr_ = src;
+
 /*
  * __bitmask_size_in_bytes_raw is the number of bytes needed to hold
  * num_possible_cpus().
diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index 919b1a4da980..b3e2f321e787 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -32,12 +32,14 @@
 
 #undef __assign_str
 #define __assign_str(dst, src) \
-   strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
+   strcpy(__get_str(dst), __data_offsets.dst##_ptr_ ?  \
+  __data_offsets.dst##_ptr_ : "(null)")
 
 #undef __assign_str_len
 #define __assign_str_len(dst, src, len)
\
do {\
-   memcpy(__get_str(dst), (src), (len));   \
+   memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ?  \
+  __data_offsets.dst##_ptr_ : "(null)", len);  \
__get_str(dst)[len] = '\0'; \
} while(0)
 
@@ -92,12 +94,14 @@
 
 #undef __assign_rel_str
 #define __assign_rel_str(dst, src) \
-   strcpy(__get_rel_str(dst), (src) ? (const char 

[PATCH v2 0/4] tracing: Optimize __string()/__assign_str() processing

2024-02-22 Thread Steven Rostedt
The TRACE_EVENT() macro handles dynamic strings by having:

  TP_PROTO(struct some_struct *s),
  TP_ARGS(s),
  TP_STRUCT__entry(
__string(my_string, s->string)
 ),
 TP_fast_assign(
__assign_str(my_string, s->string);
 )
 TP_printk("%s", __get_str(my_string))

There's even some code that may call a function helper to find the
s->string value. The problem with the above is that the work to get the
s->string is done twice. Once at the __string() and again in the
__assign_str().

The length of the string is calculated via a strlen(), not once, but
twice (using strcpy). The length is actually already recorded in the data
location from __string() and here's no reason to call strcpy() in
__assign_str() as the length is already known.

The __string() macro uses dynamic_array() which has a helper structure that
is created holding the offsets and length of the string fields. Instead of
finding the string twice, just save it off in another field in that helper
structure, and have __assign_str() use that instead.

Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240222195111.139824...@goodmis.org/

- Both the __dynamic_array() and __rel_dynamic_array() macros end with
  a semicolon and are used by __string() and __rel_string()
  respectively. But __string() doesn't have a semicolon after
  __dynamic_array() but __rel_string does have a semicolon after
  __rel_dynamic_array() which is unneeded. Remove the unnecessary
  semicolon to keep the two consistent.

- Fixed __rel_string_len() that was incorrectly using __get_str() and
  not __get_rel_str().

- Added two cleanup patches. One to use the ?: shortcut and the other
  to use the new macro EVENT_NULL_STR instead of open coding "(null)"
  as the string must be consistent between __string() and __assign_str().

Steven Rostedt (Google) (4):
  tracing: Rework __assign_str() and __string() to not duplicate getting 
the string
  tracing: Do not calculate strlen() twice for __string() fields
  tracing: Use ? : shortcut in trace macros
  tracing: Use EVENT_NULL_STR macro instead of open coding "(null)"


 include/linux/trace_events.h |  3 +++
 include/trace/events/sunrpc.h| 12 ++--
 include/trace/stages/stage2_data_offsets.h   |  4 ++--
 include/trace/stages/stage5_get_offsets.h| 15 ++-
 include/trace/stages/stage6_event_callback.h | 12 
 5 files changed, 29 insertions(+), 17 deletions(-)



Re: [PATCH] vduse: implement DMA sync callbacks

2024-02-22 Thread Michael S. Tsirkin
On Tue, Feb 20, 2024 at 01:01:36AM -0800, Christoph Hellwig wrote:
> On Mon, Feb 19, 2024 at 06:06:06PM +0100, Maxime Coquelin wrote:
> > Since commit 295525e29a5b ("virtio_net: merge dma
> > operations when filling mergeable buffers"), VDUSE device
> > require support for DMA's .sync_single_for_cpu() operation
> > as the memory is non-coherent between the device and CPU
> > because of the use of a bounce buffer.
> > 
> > This patch implements both .sync_single_for_cpu() and
> > sync_single_for_device() callbacks, and also skip bounce
> > buffer copies during DMA map and unmap operations if the
> > DMA_ATTR_SKIP_CPU_SYNC attribute is set to avoid extra
> > copies of the same buffer.
> 
> vduse really needs to get out of implementing fake DMA operations for
> something that is not DMA.

In a sense ... but on the other hand, the "fake DMA" metaphor seems to
work surprisingly well, like in this instance - internal bounce buffer
looks a bit like non-coherent DMA.  A way to make this all prettier
would I guess be to actually wrap all of DMA with virtio wrappers which
would all go if () dma_... else vduse_...; or something to this end.  A
lot of work for sure, and is it really worth it? if the only crazy
driver is vduse I'd maybe rather keep the crazy hacks local there ...

-- 
MST




Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-22 Thread Haitao Huang

On Wed, 21 Feb 2024 04:48:58 -0600, Huang, Kai  wrote:


On Wed, 2024-02-21 at 00:23 -0600, Haitao Huang wrote:

StartHi Kai
On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai   
wrote:

[...]
>
> So you introduced the work/workqueue here but there's no place which
> actually
> queues the work.  IMHO you can either:
>
> 1) move relevant code change here; or
> 2) focus on introducing core functions to reclaim certain pages from a
> given EPC
> cgroup w/o workqueue and introduce the work/workqueue in later patch.
>
> Makes sense?
>

Starting in v7, I was trying to split the big patch, #10 in v6 as you  
and
others suggested. My thought process was to put infrastructure needed  
for

per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9
13/15] in the end.


That's reasonable for sure.



Thanks for the confirmation :-)



Before that, all reclaimables are tracked in the global LRU so really
there is no "reclaim certain pages from a  given EPC cgroup w/o  
workqueue"

or reclaim through workqueue before that point, as suggested in #2. This
patch puts down the implementation for both flows but neither used yet,  
as

stated in the commit message.


I know it's not used yet.  The point is how to split patches to make  
them more

self-contain and easy to review.


I would think this patch already self-contained in that all are  
implementation of cgroup reclamation building blocks utilized later. But  
I'll try to follow your suggestions below to split further (would prefer  
not to merge in general unless there is strong reasons).




For #2, sorry for not being explicit -- I meant it seems it's more  
reasonable to

split in this way:

Patch 1)
  a). change to sgx_reclaim_pages();


I'll still prefer this to be a separate patch. It is self-contained IMHO.
We were splitting the original patch because it was too big. I don't want  
to merge back unless there is a strong reason.



  b). introduce sgx_epc_cgroup_reclaim_pages();


Ok.

  c). introduce sgx_epc_cgroup_reclaim_work_func() (use a better name), 
 which just takes an EPC cgroup as input w/o involving any  
work/workqueue.


This is for the workqueue use only. So I think it'd be better be with  
patch #2 below?




These functions are all related to how to implement reclaiming pages  
from a
given EPC cgroup, and they are logically related in terms of  
implementation thus

it's easier to be reviewed together.



This is pretty much the current patch + sgx_reclaim_pages() - workqueue.

Then you just need to justify the design/implementation in  
changelog/comments.




How about just do b) in patch #1, and state the new function is the  
building block and will be used for both direct and indirect reclamation?



Patch 2)
  - Introduce work/workqueue, and implement the logic to queue the work.

Now we all know there's a function to reclaim pages for a given EPC  
cgroup, then

we can focus on when that is called, either directly or indirectly.



The try_charge() function will do both actually.
For indirect, it queue the work to the wq. For direct it just calls  
sgx_epc_cgroup_reclaim_pages().
That part is already in separate (I think self-contained) patch [v9,  
10/15].


So for this patch, I'll add  sgx_epc_cgroup_reclaim_work_func() and  
introduce work/workqueue so later work can be queued?




#1 would force me go back and merge the patches again.


I don't think so.  I am not asking to put all things together, but only  
asking

to split in better way (that I see).



Okay.

You mentioned some function is "Scheduled by sgx_epc_cgroup_try_charge()  
to
reclaim pages", but I am not seeing any code doing that in this patch.   
This
needs fixing, either by moving relevant code here, or removing these  
not-done-

yet comments.



Yes. The comments will be fixed.


For instance (I am just giving an example), if after review we found the
queue_work() shouldn't be done in try_charge(), you will need to go back  
to this

patch and remove these comments.

That's not the best way.  Each patch needs to be self-contained.



Sorry I feel kind of lost on this whole thing by now. It seems so random
to me. Is there hard rules on this?


See above.  I am only offering my opinion on how to split patches in  
better way.




To be honest, the part I'm feeling most confusing is this  
self-contained-ness. It seems depend on how you look at things.




I was hoping these statements would help reviewers on the flow of the
patches.

At the end of [v9 04/15]:

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

At the end of [v9 06/15]:

Next patches will first get the cgroup 

Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-22 Thread Michael S. Tsirkin
On Fri, Feb 09, 2024 at 01:57:25PM +0100, Paolo Abeni wrote:
> On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote:
> > On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni  wrote:
> > > 
> > > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> > > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni  wrote:
> > > > > 
> > > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer 
> > > > > >  wrote:
> > > > > > > On 02/02/2024 13.11, Liang Chen wrote:
> > > > > [...]
> > > > > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct 
> > > > > > > > xdp_buff *xdp)
> > > > > > > >   }
> > > > > > > >   }
> > > > > > > > 
> > > > > > > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff 
> > > > > > > > *virtnet_xdp,
> > > > > > > > +  struct net_device *dev,
> > > > > > > > +  struct 
> > > > > > > > virtio_net_hdr_v1_hash *hdr_hash)
> > > > > > > > +{
> > > > > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > > > > + virtnet_xdp->hash_value = hdr_hash->hash_value;
> > > > > > > > + virtnet_xdp->hash_report = hdr_hash->hash_report;
> > > > > > > > + }
> > > > > > > > +}
> > > > > > > > +
> > > > > > > 
> > > > > > > Would it be possible to store a pointer to hdr_hash in 
> > > > > > > virtnet_xdp_buff,
> > > > > > > with the purpose of delaying extracting this, until and only if 
> > > > > > > XDP
> > > > > > > bpf_prog calls the kfunc?
> > > > > > > 
> > > > > > 
> > > > > > That seems to be the way v1 works,
> > > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > > > > . But it was pointed out that the inline header may be overwritten 
> > > > > > by
> > > > > > the xdp prog, so the hash is copied out to maintain its integrity.
> > > > > 
> > > > > Why? isn't XDP supposed to get write access only to the pkt
> > > > > contents/buffer?
> > > > > 
> > > > 
> > > > Normally, an XDP program accesses only the packet data. However,
> > > > there's also an XDP RX Metadata area, referenced by the data_meta
> > > > pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to
> > > > point somewhere ahead of the data buffer, thereby granting the XDP
> > > > program access to the virtio header located immediately before the
> > > 
> > > AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data before
> > > xdp->data_hard_start:
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210
> > > 
> > > and virtio net set such field after the virtio_net_hdr:
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218
> > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420
> > > 
> > > I don't see how the virtio hdr could be touched? Possibly even more
> > > important: if such thing is possible, I think is should be somewhat
> > > denied (for the same reason an H/W nic should prevent XDP from
> > > modifying its own buffer descriptor).
> > 
> > Thank you for highlighting this concern. The header layout differs
> > slightly between small and mergeable mode. Taking 'mergeable mode' as
> > an example, after calling xdp_prepare_buff the layout of xdp_buff
> > would be as depicted in the diagram below,
> > 
> >   buf
> >|
> >v
> > +--+--+-+
> > | xdp headroom | virtio header| packet  |
> > | (256 bytes)  | (20 bytes)   | content |
> > +--+--+-+
> > ^ ^
> > | |
> >  data_hard_startdata
> >   data_meta
> > 
> > If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little
> > towards 'data_hard_start', it would point to the inline header, thus
> > potentially allowing the XDP program to access the inline header.
> 
> I see. That layout was completely unexpected to me.
> 
> AFAICS the virtio_net driver tries to avoid accessing/using the
> virtio_net_hdr after the XDP program execution, so nothing tragic
> should happen.
> 
> @Michael, @Jason, I guess the above is like that by design? Isn't it a
> bit fragile?
> 
> Thanks!
> 
> Paolo

I agree it is all a bit fragile, not sure how to do better without extra
copies though ...

-- 
MST




Re: [PATCH 1/1] vhost: Added pad cleanup if vnet_hdr is not present.

2024-02-22 Thread Michael S. Tsirkin
On Mon, Jan 15, 2024 at 05:32:25PM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 15, 2024 at 09:48:40PM +0200, Andrew Melnychenko wrote:
> > When the Qemu launched with vhost but without tap vnet_hdr,
> > vhost tries to copy vnet_hdr from socket iter with size 0
> > to the page that may contain some trash.
> > That trash can be interpreted as unpredictable values for
> > vnet_hdr.
> > That leads to dropping some packets and in some cases to
> > stalling vhost routine when the vhost_net tries to process
> > packets and fails in a loop.
> > 
> > Qemu options:
> >   -netdev tap,vhost=on,vnet_hdr=off,...
> > 
> > Signed-off-by: Andrew Melnychenko 
> > ---
> >  drivers/vhost/net.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index f2ed7167c848..57411ac2d08b 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -735,6 +735,9 @@ static int vhost_net_build_xdp(struct 
> > vhost_net_virtqueue *nvq,
> > hdr = buf;
> > gso = >gso;
> >  
> > +   if (!sock_hlen)
> > +   memset(buf, 0, pad);
> > +
> > if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> > vhost16_to_cpu(vq, gso->csum_start) +
> > vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> 
> 
> Hmm need to analyse it to make sure there are no cases where we leak
> some data to guest here in case where sock_hlen is set ...


Could you post this analysis pls?

> > -- 
> > 2.43.0




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-22 Thread Michael S. Tsirkin
On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.

As I predicted, if you want attention from sched guys you need to
send a patch in their area.

-- 
MST




Re: [PATCH] virtiofs: limit the length of ITER_KVEC dio by max_nopage_rw

2024-02-22 Thread Michael S. Tsirkin
On Wed, Jan 03, 2024 at 06:59:29PM +0800, Hou Tao wrote:
> From: Hou Tao 
> 
> When trying to insert a 10MB kernel module kept in a virtiofs with cache
> disabled, the following warning was reported:
> 
>   [ cut here ]
>   WARNING: CPU: 2 PID: 439 at mm/page_alloc.c:4544 ..
>   Modules linked in:
>   CPU: 2 PID: 439 Comm: insmod Not tainted 6.7.0-rc7+ #33
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ..
>   RIP: 0010:__alloc_pages+0x2c4/0x360
>   ..
>   Call Trace:
>
>? __warn+0x8f/0x150
>? __alloc_pages+0x2c4/0x360
>__kmalloc_large_node+0x86/0x160
>__kmalloc+0xcd/0x140
>virtio_fs_enqueue_req+0x240/0x6d0
>virtio_fs_wake_pending_and_unlock+0x7f/0x190
>queue_request_and_unlock+0x58/0x70
>fuse_simple_request+0x18b/0x2e0
>fuse_direct_io+0x58a/0x850
>fuse_file_read_iter+0xdb/0x130
>__kernel_read+0xf3/0x260
>kernel_read+0x45/0x60
>kernel_read_file+0x1ad/0x2b0
>init_module_from_file+0x6a/0xe0
>idempotent_init_module+0x179/0x230
>__x64_sys_finit_module+0x5d/0xb0
>do_syscall_64+0x36/0xb0
>entry_SYSCALL_64_after_hwframe+0x6e/0x76
>..
>
>   ---[ end trace  ]---
> 
> The warning happened as follow. In copy_args_to_argbuf(), virtiofs uses
> kmalloc-ed memory as bound buffer for fuse args, but
> fuse_get_user_pages() only limits the length of fuse arg by max_read or
> max_write for IOV_KVEC io (e.g., kernel_read_file from finit_module()).
> For virtiofs, max_read is UINT_MAX, so a big read request which is about
> 10MB is passed to copy_args_to_argbuf(), kmalloc() is called in turn
> with len=10MB, and triggers the warning in __alloc_pages():
> WARN_ON_ONCE_GFP(order > MAX_ORDER, gfp)).
> 
> A feasible solution is to limit the value of max_read for virtiofs, so
> the length passed to kmalloc() will be limited. However it will affects
> the max read size for ITER_IOVEC io and the value of max_write also needs
> limitation. So instead of limiting the values of max_read and max_write,
> introducing max_nopage_rw to cap both the values of max_read and
> max_write when the fuse dio read/write request is initiated from kernel.
> 
> Considering that fuse read/write request from kernel is uncommon and to
> decrease the demand for large contiguous pages, set max_nopage_rw as
> 256KB instead of KMALLOC_MAX_SIZE - 4096 or similar.
> 
> Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem")
> Signed-off-by: Hou Tao 


So what should I do with this patch? It includes fuse changes
but of course I can merge too if no one wants to bother either way...


> ---
>  fs/fuse/file.c  | 12 +++-
>  fs/fuse/fuse_i.h|  3 +++
>  fs/fuse/inode.c |  1 +
>  fs/fuse/virtio_fs.c |  6 ++
>  4 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a660f1f21540..f1beb7c0b782 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1422,6 +1422,16 @@ static int fuse_get_user_pages(struct fuse_args_pages 
> *ap, struct iov_iter *ii,
>   return ret < 0 ? ret : 0;
>  }
>  
> +static size_t fuse_max_dio_rw_size(const struct fuse_conn *fc,
> +const struct iov_iter *iter, int write)
> +{
> + unsigned int nmax = write ? fc->max_write : fc->max_read;
> +
> + if (iov_iter_is_kvec(iter))
> + nmax = min(nmax, fc->max_nopage_rw);
> + return nmax;
> +}
> +
>  ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>  loff_t *ppos, int flags)
>  {
> @@ -1432,7 +1442,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct 
> iov_iter *iter,
>   struct inode *inode = mapping->host;
>   struct fuse_file *ff = file->private_data;
>   struct fuse_conn *fc = ff->fm->fc;
> - size_t nmax = write ? fc->max_write : fc->max_read;
> + size_t nmax = fuse_max_dio_rw_size(fc, iter, write);
>   loff_t pos = *ppos;
>   size_t count = iov_iter_count(iter);
>   pgoff_t idx_from = pos >> PAGE_SHIFT;
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 1df83eebda92..fc753cd34211 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -594,6 +594,9 @@ struct fuse_conn {
>   /** Constrain ->max_pages to this value during feature negotiation */
>   unsigned int max_pages_limit;
>  
> + /** Maximum read/write size when there is no page in request */
> + unsigned int max_nopage_rw;
> +
>   /** Input queue */
>   struct fuse_iqueue iq;
>  
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 2a6d44f91729..4cbbcb4a4b71 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -923,6 +923,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct 
> fuse_mount *fm,
>   fc->user_ns = get_user_ns(user_ns);
>   fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
>   fc->max_pages_limit = FUSE_MAX_MAX_PAGES;
> + fc->max_nopage_rw = UINT_MAX;
>  
>   INIT_LIST_HEAD(>mounts);
>   

[PATCH 2/2] tracing: Do not calculate strlen() twice for __string() fields

2024-02-22 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The TRACE_EVENT() macro handles dynamic strings by having:

  TP_PROTO(struct some_struct *s),
  TP_ARGS(s),
  TP_STRUCT__entry(
__string(my_string, s->string)
 ),
 TP_fast_assign(
__assign_str(my_string, s->string);
 )
 TP_printk("%s", __get_str(my_string))

There's even some code that may call a function helper to find the
s->string value. The problem with the above is that the work to get the
length of s->string is done twice. Once at the __string() and again in the
__assign_str().

The length of the string is calculated via a strlen(), not once, but twice
(via strcpy() in __assign_str()). Once during the __string() macro and again
in __assign_str(). But the length is actually already recorded in the data
location and there's no reason to call strlen() again.

Just use the saved length that was saved in the __string() code for the
__assign_str() code.

Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/stages/stage6_event_callback.h | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index e09e3d019dcc..d389725937e9 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -32,8 +32,9 @@
 
 #undef __assign_str
 #define __assign_str(dst, src) \
-   strcpy(__get_str(dst), __data_offsets.dst##_ptr_ ?  \
-  __data_offsets.dst##_ptr_ : "(null)")
+   memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ?  \
+  __data_offsets.dst##_ptr_ : "(null)",\
+  __get_dynamic_array_len(dst))
 
 #undef __assign_str_len
 #define __assign_str_len(dst, src, len)
\
@@ -94,8 +95,9 @@
 
 #undef __assign_rel_str
 #define __assign_rel_str(dst, src) \
-   strcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ?  \
-  __data_offsets.dst##_ptr_ : "(null)")
+   memcpy(__get_rel_str(dst), __data_offsets.dst##_ptr_ ?  \
+  __data_offsets.dst##_ptr_ : "(null)",\
+  __get_rel_dynamic_array_len(dst))
 
 #undef __assign_rel_str_len
 #define __assign_rel_str_len(dst, src, len)\
-- 
2.43.0





[PATCH 0/2] tracing: Optimize __string()/__assign_str() processing

2024-02-22 Thread Steven Rostedt


The TRACE_EVENT() macro handles dynamic strings by having:

  TP_PROTO(struct some_struct *s),
  TP_ARGS(s),
  TP_STRUCT__entry(
__string(my_string, s->string)
 ),
 TP_fast_assign(
__assign_str(my_string, s->string);
 )
 TP_printk("%s", __get_str(my_string))

There's even some code that may call a function helper to find the
s->string value. The problem with the above is that the work to get the
s->string is done twice. Once at the __string() and again in the
__assign_str().

The length of the string is calculated via a strlen(), not once, but
twice (using strcpy). The length is actually already recorded in the data
location from __string() and here's no reason to call strcpy() in
__assign_str() as the length is already known.

The __string() macro uses dynamic_array() which has a helper structure that
is created holding the offsets and length of the string fields. Instead of
finding the string twice, just save it off in another field in that helper
structure, and have __assign_str() use that instead.

Steven Rostedt (Google) (2):
  tracing: Rework __assign_str() and __string() to not duplicate getting 
the string
  tracing: Do not calculate strlen() twice for __string() fields


 include/trace/stages/stage2_data_offsets.h   |  4 ++--
 include/trace/stages/stage5_get_offsets.h| 15 ++-
 include/trace/stages/stage6_event_callback.h | 14 ++
 3 files changed, 22 insertions(+), 11 deletions(-)



[PATCH 1/2] tracing: Rework __assign_str() and __string() to not duplicate getting the string

2024-02-22 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The TRACE_EVENT() macro handles dynamic strings by having:

  TP_PROTO(struct some_struct *s),
  TP_ARGS(s),
  TP_STRUCT__entry(
__string(my_string, s->string)
 ),
 TP_fast_assign(
__assign_str(my_string, s->string);
 )
 TP_printk("%s", __get_str(my_string))

There's even some code that may call a function helper to find the
s->string value. The problem with the above is that the work to get the
s->string is done twice. Once at the __string() and again in the
__assign_str().

The __string() uses dynamic_array() which has a helper structure that
is created holding the offsets and length of the string fields. Instead of
finding the string twice, just save it off in another field in that helper
structure, and have __assign_str() use that instead.

Note, this also means that the second parameter of __assign_str() isn't
even used anymore, and may be removed in the future.

Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/stages/stage2_data_offsets.h   |  4 ++--
 include/trace/stages/stage5_get_offsets.h| 15 ++-
 include/trace/stages/stage6_event_callback.h | 12 
 3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/include/trace/stages/stage2_data_offsets.h 
b/include/trace/stages/stage2_data_offsets.h
index 469b6a64293d..8b0cff06d346 100644
--- a/include/trace/stages/stage2_data_offsets.h
+++ b/include/trace/stages/stage2_data_offsets.h
@@ -24,7 +24,7 @@
 #define __array(type, item, len)
 
 #undef __dynamic_array
-#define __dynamic_array(type, item, len)   u32 item;
+#define __dynamic_array(type, item, len)   u32 item; const void 
*item##_ptr_;
 
 #undef __string
 #define __string(item, src) __dynamic_array(char, item, -1)
@@ -45,7 +45,7 @@
 #define __sockaddr(field, len) __dynamic_array(u8, field, len)
 
 #undef __rel_dynamic_array
-#define __rel_dynamic_array(type, item, len)   u32 item;
+#define __rel_dynamic_array(type, item, len)   u32 item; const void 
*item##_ptr_;
 
 #undef __rel_string
 #define __rel_string(item, src) __rel_dynamic_array(char, item, -1)
diff --git a/include/trace/stages/stage5_get_offsets.h 
b/include/trace/stages/stage5_get_offsets.h
index e30a13be46ba..c71f38e10419 100644
--- a/include/trace/stages/stage5_get_offsets.h
+++ b/include/trace/stages/stage5_get_offsets.h
@@ -47,10 +47,12 @@
 
 #undef __string
 #define __string(item, src) __dynamic_array(char, item,
\
-   strlen((src) ? (const char *)(src) : "(null)") + 1)
+   strlen((src) ? (const char *)(src) : "(null)") + 1) \
+   __data_offsets->item##_ptr_ = src;
 
 #undef __string_len
-#define __string_len(item, src, len) __dynamic_array(char, item, (len) + 1)
+#define __string_len(item, src, len) __dynamic_array(char, item, (len) + 1)\
+   __data_offsets->item##_ptr_ = src;
 
 #undef __vstring
 #define __vstring(item, fmt, ap) __dynamic_array(char, item,   \
@@ -67,11 +69,14 @@
__data_size += __item_length;
 
 #undef __rel_string
-#define __rel_string(item, src) __rel_dynamic_array(char, item,
\
-   strlen((src) ? (const char *)(src) : "(null)") + 1)
+#define __rel_string(item, src) __rel_dynamic_array(char, item,
\
+   strlen((src) ? (const char *)(src) : "(null)") + 1); \
+   __data_offsets->item##_ptr_ = src;
 
 #undef __rel_string_len
-#define __rel_string_len(item, src, len) __rel_dynamic_array(char, item, (len) 
+ 1)
+#define __rel_string_len(item, src, len) __rel_dynamic_array(char, item, (len) 
+ 1)\
+   __data_offsets->item##_ptr_ = src;
+
 /*
  * __bitmask_size_in_bytes_raw is the number of bytes needed to hold
  * num_possible_cpus().
diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index 919b1a4da980..e09e3d019dcc 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -32,12 +32,14 @@
 
 #undef __assign_str
 #define __assign_str(dst, src) \
-   strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");
+   strcpy(__get_str(dst), __data_offsets.dst##_ptr_ ?  \
+  __data_offsets.dst##_ptr_ : "(null)")
 
 #undef __assign_str_len
 #define __assign_str_len(dst, src, len)
\
do {\
-   memcpy(__get_str(dst), (src), (len));   \
+   memcpy(__get_str(dst), __data_offsets.dst##_ptr_ ?  \
+  __data_offsets.dst##_ptr_ : "(null)", len);  \
__get_str(dst)[len] = '\0'; \
} while(0)
 
@@ -92,12 +94,14 @@
 
 #undef __assign_rel_str
 #define __assign_rel_str(dst, src) \
-   strcpy(__get_rel_str(dst), (src) ? (const char 

Re: [PATCH v2] PM: runtime: add tracepoint for runtime_status changes

2024-02-22 Thread Rafael J. Wysocki
On Wed, Feb 21, 2024 at 8:49 PM Vilas Bhat  wrote:
>
> Existing runtime PM ftrace events (`rpm_suspend`, `rpm_resume`,
> `rpm_return_int`) offer limited visibility into the exact timing of device
> runtime power state transitions, particularly when asynchronous operations
> are involved. When the `rpm_suspend` or `rpm_resume` functions are invoked
> with the `RPM_ASYNC` flag, a return value of 0 i.e., success merely
> indicates that the device power state request has been queued, not that
> the device has yet transitioned.
>
> A new ftrace event, `rpm_status`, is introduced. This event directly logs
> the `power.runtime_status` value of a device whenever it changes providing
> granular tracking of runtime power state transitions regardless of
> synchronous or asynchronous `rpm_suspend` / `rpm_resume` usage.
>
> Signed-off-by: Vilas Bhat 
> ---
> V1 -> V2: Modified enum value definition as per reviewer comments.
> ---
>  drivers/base/power/runtime.c |  1 +
>  include/trace/events/rpm.h   | 42 
>  2 files changed, 43 insertions(+)
>
> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
> index 05793c9fbb84..d10354847878 100644
> --- a/drivers/base/power/runtime.c
> +++ b/drivers/base/power/runtime.c
> @@ -94,6 +94,7 @@ static void update_pm_runtime_accounting(struct device *dev)
>  static void __update_runtime_status(struct device *dev, enum rpm_status 
> status)
>  {
> update_pm_runtime_accounting(dev);
> +   trace_rpm_status(dev, status);
> dev->power.runtime_status = status;
>  }
>
> diff --git a/include/trace/events/rpm.h b/include/trace/events/rpm.h
> index 3c716214dab1..bd120e23ce12 100644
> --- a/include/trace/events/rpm.h
> +++ b/include/trace/events/rpm.h
> @@ -101,6 +101,48 @@ TRACE_EVENT(rpm_return_int,
> __entry->ret)
>  );
>
> +#define RPM_STATUS_STRINGS \
> +   EM(RPM_INVALID, "RPM_INVALID") \
> +   EM(RPM_ACTIVE, "RPM_ACTIVE") \
> +   EM(RPM_RESUMING, "RPM_RESUMING") \
> +   EM(RPM_SUSPENDED, "RPM_SUSPENDED") \
> +   EMe(RPM_SUSPENDING, "RPM_SUSPENDING")
> +
> +/* Enums require being exported to userspace, for user tool parsing. */
> +#undef EM
> +#undef EMe
> +#define EM(a, b)   TRACE_DEFINE_ENUM(a);
> +#define EMe(a, b)  TRACE_DEFINE_ENUM(a);
> +
> +RPM_STATUS_STRINGS
> +
> +/*
> + * Now redefine the EM() and EMe() macros to map the enums to the strings 
> that
> + * will be printed in the output.
> + */
> +#undef EM
> +#undef EMe
> +#define EM(a, b)   { a, b },
> +#define EMe(a, b)  { a, b }
> +
> +TRACE_EVENT(rpm_status,
> +   TP_PROTO(struct device *dev, enum rpm_status status),
> +   TP_ARGS(dev, status),
> +
> +   TP_STRUCT__entry(
> +   __string(name,  dev_name(dev))
> +   __field(int,status)
> +   ),
> +
> +   TP_fast_assign(
> +   __assign_str(name, dev_name(dev));
> +   __entry->status = status;
> +   ),
> +
> +   TP_printk("%s status=%s", __get_str(name),
> +   __print_symbolic(__entry->status, RPM_STATUS_STRINGS))
> +);
> +
>  #endif /* _TRACE_RUNTIME_POWER_H */
>
>  /* This part must be outside protection */
> --

Applied as 6.9 material, thanks!



Re: [PATCH net-next v4 2/2] virtio-net: add cond_resched() to the command waiting loop

2024-02-22 Thread Michael S. Tsirkin
On Tue, Jul 25, 2023 at 11:03:11AM +0800, Jason Wang wrote:
> On Mon, Jul 24, 2023 at 3:18 PM Michael S. Tsirkin  wrote:
> >
> > On Mon, Jul 24, 2023 at 02:52:49PM +0800, Jason Wang wrote:
> > > On Mon, Jul 24, 2023 at 2:46 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Fri, Jul 21, 2023 at 10:18:03PM +0200, Maxime Coquelin wrote:
> > > > >
> > > > >
> > > > > On 7/21/23 17:10, Michael S. Tsirkin wrote:
> > > > > > On Fri, Jul 21, 2023 at 04:58:04PM +0200, Maxime Coquelin wrote:
> > > > > > >
> > > > > > >
> > > > > > > On 7/21/23 16:45, Michael S. Tsirkin wrote:
> > > > > > > > On Fri, Jul 21, 2023 at 04:37:00PM +0200, Maxime Coquelin wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 7/20/23 23:02, Michael S. Tsirkin wrote:
> > > > > > > > > > On Thu, Jul 20, 2023 at 01:26:20PM -0700, Shannon Nelson 
> > > > > > > > > > wrote:
> > > > > > > > > > > On 7/20/23 1:38 AM, Jason Wang wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Adding cond_resched() to the command waiting loop for a 
> > > > > > > > > > > > better
> > > > > > > > > > > > co-operation with the scheduler. This allows to give 
> > > > > > > > > > > > CPU a breath to
> > > > > > > > > > > > run other task(workqueue) instead of busy looping when 
> > > > > > > > > > > > preemption is
> > > > > > > > > > > > not allowed on a device whose CVQ might be slow.
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Jason Wang 
> > > > > > > > > > >
> > > > > > > > > > > This still leaves hung processes, but at least it doesn't 
> > > > > > > > > > > pin the CPU any
> > > > > > > > > > > more.  Thanks.
> > > > > > > > > > > Reviewed-by: Shannon Nelson 
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'd like to see a full solution
> > > > > > > > > > 1- block until interrupt
> > > > > > > > >
> > > > > > > > > Would it make sense to also have a timeout?
> > > > > > > > > And when timeout expires, set FAILED bit in device status?
> > > > > > > >
> > > > > > > > virtio spec does not set any limits on the timing of vq
> > > > > > > > processing.
> > > > > > >
> > > > > > > Indeed, but I thought the driver could decide it is too long for 
> > > > > > > it.
> > > > > > >
> > > > > > > The issue is we keep waiting with rtnl locked, it can quickly 
> > > > > > > make the
> > > > > > > system unusable.
> > > > > >
> > > > > > if this is a problem we should find a way not to keep rtnl
> > > > > > locked indefinitely.
> > > > >
> > > > > From the tests I have done, I think it is. With OVS, a 
> > > > > reconfiguration is
> > > > > performed when the VDUSE device is added, and when a MLX5 device is
> > > > > in the same bridge, it ends up doing an ioctl() that tries to take the
> > > > > rtnl lock. In this configuration, it is not possible to kill OVS 
> > > > > because
> > > > > it is stuck trying to acquire rtnl lock for mlx5 that is held by 
> > > > > virtio-
> > > > > net.
> > > >
> > > > So for sure, we can queue up the work and process it later.
> > > > The somewhat tricky part is limiting the memory consumption.
> > >
> > > And it needs to sync with rtnl somehow, e.g device unregistering which
> > > seems not easy.
> > >
> > > Thanks
> >
> > since when does device unregister need to send cvq commands?
> 
> It doesn't do this now. But if we don't process the work under rtnl,
> we need to synchronize with device unregistering.
> 
> Thanks

But what's not easy about it?

> >
> > > >
> > > >
> > > > > >
> > > > > > > > > > 2- still handle surprise removal correctly by waking in 
> > > > > > > > > > that case
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > ---
> > > > > > > > > > > >  drivers/net/virtio_net.c | 4 +++-
> > > > > > > > > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/drivers/net/virtio_net.c 
> > > > > > > > > > > > b/drivers/net/virtio_net.c
> > > > > > > > > > > > index 9f3b1d6ac33d..e7533f29b219 100644
> > > > > > > > > > > > --- a/drivers/net/virtio_net.c
> > > > > > > > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > > > > > > > @@ -2314,8 +2314,10 @@ static bool 
> > > > > > > > > > > > virtnet_send_command(struct virtnet_info *vi, u8 class, 
> > > > > > > > > > > > u8 cmd,
> > > > > > > > > > > >  * into the hypervisor, so the request 
> > > > > > > > > > > > should be handled immediately.
> > > > > > > > > > > >  */
> > > > > > > > > > > > while (!virtqueue_get_buf(vi->cvq, ) &&
> > > > > > > > > > > > -  !virtqueue_is_broken(vi->cvq))
> > > > > > > > > > > > +  !virtqueue_is_broken(vi->cvq)) {
> > > > > > > > > > > > +   cond_resched();
> > > > > > > > > > > > cpu_relax();
> > > > > > > > > > > > +   }
> > > > > > > > > > > >
> > > > > > > > > > > > return vi->ctrl->status == VIRTIO_NET_OK;
> > > > > > > > > > > >  }
> > > > > > > > > > > > --
> > > > > > > > > > 

Re: [PATCH v3 7/7] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware

2024-02-22 Thread Mathieu Poirier
Hi,

On Wed, Feb 14, 2024 at 06:21:27PM +0100, Arnaud Pouliquen wrote:
> The new TEE remoteproc device is used to manage remote firmware in a
> secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is
> introduced to delegate the loading of the firmware to the trusted
> execution context. In such cases, the firmware should be signed and
> adhere to the image format defined by the TEE.
> 
> A new "to_attach" field is introduced to differentiate the use cases
> "firmware loaded by the boot stage" and "firmware loaded by the TEE".
> 
> Signed-off-by: Arnaud Pouliquen 
> ---
> V2 to V3 update:
> - remove stm32_rproc_tee_elf_sanity_check(), stm32_rproc_tee_elf_load()
>   stm32_rproc_tee_elf_find_loaded_rsc_table() and  stm32_rproc_tee_start() 
> that are bnow unused
> - use new rproc::alt_boot field to sepcify that the alternate fboot method is 
> used
> - use stm32_rproc::to_attach field to differenciate attch mode from 
> remoteproc tee boot mode.
> - remove the used of stm32_rproc::fw_loaded
> ---
>  drivers/remoteproc/stm32_rproc.c | 85 +---
>  1 file changed, 79 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/remoteproc/stm32_rproc.c 
> b/drivers/remoteproc/stm32_rproc.c
> index fcc0001e2657..9cfcf66462e0 100644
> --- a/drivers/remoteproc/stm32_rproc.c
> +++ b/drivers/remoteproc/stm32_rproc.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "remoteproc_internal.h"
> @@ -49,6 +50,9 @@
>  #define M4_STATE_STANDBY 4
>  #define M4_STATE_CRASH   5
>  
> +/* Remote processor unique identifier aligned with the Trusted Execution 
> Environment definitions */
> +#define STM32_MP1_M4_PROC_ID0
> +
>  struct stm32_syscon {
>   struct regmap *map;
>   u32 reg;
> @@ -90,6 +94,8 @@ struct stm32_rproc {
>   struct stm32_mbox mb[MBOX_NB_MBX];
>   struct workqueue_struct *workqueue;
>   bool hold_boot_smc;
> + bool to_attach;
> + struct tee_rproc *trproc;
>   void __iomem *rsc_va;
>  };
>  
> @@ -253,10 +259,30 @@ static int stm32_rproc_release(struct rproc *rproc)
>   return err;
>   }
>   }
> + ddata->to_attach = false;
>  
>   return err;
>  }
>  
> +static int stm32_rproc_tee_attach(struct rproc *rproc)
> +{
> + /* Nothing to do, remote proc already started by the secured context. */
> + return 0;
> +}
> +
> +static int stm32_rproc_tee_stop(struct rproc *rproc)
> +{
> + int err;
> +
> + stm32_rproc_request_shutdown(rproc);
> +
> + err = tee_rproc_stop(rproc);
> + if (err)
> + return err;
> +
> + return stm32_rproc_release(rproc);
> +}
> +
>  static int stm32_rproc_prepare(struct rproc *rproc)
>  {
>   struct device *dev = rproc->dev.parent;
> @@ -637,10 +663,14 @@ stm32_rproc_get_loaded_rsc_table(struct rproc *rproc, 
> size_t *table_sz)
>  {
>   struct stm32_rproc *ddata = rproc->priv;
>   struct device *dev = rproc->dev.parent;
> + struct tee_rproc *trproc = ddata->trproc;
>   phys_addr_t rsc_pa;
>   u32 rsc_da;
>   int err;
>  
> + if (trproc && !ddata->to_attach)
> + return tee_rproc_get_loaded_rsc_table(rproc, table_sz);
> +

Why do we need a flag at all?  Why can't st_rproc_tee_ops::get_loaded_rsc_table
be set to tee_rproc_get_loaded_rsc_table()?

>   /* The resource table has already been mapped, nothing to do */
>   if (ddata->rsc_va)
>   goto done;
> @@ -693,8 +723,20 @@ static const struct rproc_ops st_rproc_ops = {
>   .get_boot_addr  = rproc_elf_get_boot_addr,
>  };
>  
> +static const struct rproc_ops st_rproc_tee_ops = {
> + .prepare= stm32_rproc_prepare,
> + .start  = tee_rproc_start,
> + .stop   = stm32_rproc_tee_stop,
> + .attach = stm32_rproc_tee_attach,
> + .kick   = stm32_rproc_kick,
> + .get_loaded_rsc_table = stm32_rproc_get_loaded_rsc_table,
> + .find_loaded_rsc_table = tee_rproc_find_loaded_rsc_table,
> + .load   = tee_rproc_load_fw,
> +};
> +
>  static const struct of_device_id stm32_rproc_match[] = {
> - { .compatible = "st,stm32mp1-m4" },
> + {.compatible = "st,stm32mp1-m4",},
> + {.compatible = "st,stm32mp1-m4-tee",},
>   {},
>  };
>  MODULE_DEVICE_TABLE(of, stm32_rproc_match);
> @@ -853,6 +895,7 @@ static int stm32_rproc_probe(struct platform_device *pdev)
>   struct device *dev = >dev;
>   struct stm32_rproc *ddata;
>   struct device_node *np = dev->of_node;
> + struct tee_rproc *trproc = NULL;
>   struct rproc *rproc;
>   unsigned int state;
>   int ret;
> @@ -861,12 +904,33 @@ static int stm32_rproc_probe(struct platform_device 
> *pdev)
>   if (ret)
>   return ret;
>  
> - rproc = rproc_alloc(dev, np->name, _rproc_ops, NULL, sizeof(*ddata));
> - if (!rproc)
> - return -ENOMEM;

This patch doesn't apply to rproc-next - please rebase.


Re: [PATCH] NFSD: Fix nfsd_clid_class use of __string_len() macro

2024-02-22 Thread Steven Rostedt
On Thu, 22 Feb 2024 13:25:34 -0500
Chuck Lever  wrote:


> Do you want me to take this through the nfsd tree, or would you like
> an Ack from me so you can handle it as part of your clean up? Just
> in case:
> 
> Acked-by: Chuck Lever 
> 

As my patches depend on this, I can take it with your ack.

Thanks!

-- Steve



Re: [PATCH] NFSD: Fix nfsd_clid_class use of __string_len() macro

2024-02-22 Thread Chuck Lever
On Thu, Feb 22, 2024 at 12:28:28PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> I'm working on restructuring the __string* macros so that it doesn't need
> to recalculate the string twice. That is, it will save it off when
> processing __string() and the __assign_str() will not need to do the work
> again as it currently does.
> 
> Currently __string_len(item, src, len) doesn't actually use "src", but my
> changes will require src to be correct as that is where the __assign_str()
> will get its value from.
> 
> The event class nfsd_clid_class has:
> 
>   __string_len(name, name, clp->cl_name.len)
> 
> But the second "name" does not exist and causes my changes to fail to
> build. That second parameter should be: clp->cl_name.data.
> 
> Fixes: d27b74a8675ca ("NFSD: Use new __string_len C macros for 
> nfsd_clid_class")
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  fs/nfsd/trace.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index d1e8cf079b0f..2cd57033791f 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -843,7 +843,7 @@ DECLARE_EVENT_CLASS(nfsd_clid_class,
>   __array(unsigned char, addr, sizeof(struct sockaddr_in6))
>   __field(unsigned long, flavor)
>   __array(unsigned char, verifier, NFS4_VERIFIER_SIZE)
> - __string_len(name, name, clp->cl_name.len)
> + __string_len(name, clp->cl_name.data, clp->cl_name.len)
>   ),
>   TP_fast_assign(
>   __entry->cl_boot = clp->cl_clientid.cl_boot;
> -- 
> 2.43.0
> 

Do you want me to take this through the nfsd tree, or would you like
an Ack from me so you can handle it as part of your clean up? Just
in case:

Acked-by: Chuck Lever 

-- 
Chuck Lever



Re: [PATCH v3 5/7] dt-bindings: remoteproc: Add compatibility for TEE support

2024-02-22 Thread Rob Herring


On Wed, 14 Feb 2024 18:21:25 +0100, Arnaud Pouliquen wrote:
> The "st,stm32mp1-m4-tee" compatible is utilized in a system configuration
> where the Cortex-M4 firmware is loaded by the Trusted execution Environment
> (TEE).
> For instance, this compatible is used in both the Linux and OP-TEE
> device-tree:
> - In OP-TEE, a node is defined in the device tree with the
>   st,stm32mp1-m4-tee to support signed remoteproc firmware.
>   Based on DT properties, OP-TEE authenticates, loads, starts, and stops
>   the firmware.
> - On Linux, when the compatibility is set, the Cortex-M resets should not
>   be declared in the device tree.
> 
> Signed-off-by: Arnaud Pouliquen 
> ---
> V1 to V2 updates
> - update "st,stm32mp1-m4" compatible description to generalize
> - remove the 'reset-names' requirement in one conditional branch, as the
>   property is already part of the condition test.
> ---
>  .../bindings/remoteproc/st,stm32-rproc.yaml   | 51 ---
>  1 file changed, 43 insertions(+), 8 deletions(-)
> 

Reviewed-by: Rob Herring 




Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-22 Thread Haitao Huang

On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai  wrote:


+/**
+ * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs  
to reclaim pages

+ * @root:  Root of the tree to start walking from.
+ * Return: Number of pages reclaimed.


Just wondering, do you need to return @cnt given this function is called  
w/o

checking the return value?

Yes. Will add explicit commenting that we need scan fixed number of pages  
for attempted reclamation.

+ */
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
+{
+   /*
+* Attempting to reclaim only a few pages will often fail and is
+* inefficient, while reclaiming a huge number of pages can result in
+* soft lockups due to holding various locks for an extended duration.
+*/


Not sure we need this comment, given it's already implied in
sgx_reclaim_pages().  You cannot pass a value > SGX_NR_TO_SCAN anyway.


Will rework on these comments to make them more meaningful.




[other comments/questions addressed in separate email threads]
[...]

+
+/*
+ * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the  
cgroup

+ * when the cgroup is at/near its maximum capacity
+ */


I don't see this being "scheduled by sgx_epc_cgroup_try_charge()" here.   
Does it

make more sense to move that code change to this patch for better review?



Right. This comment was left-over when I split the old patch.


+static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
+{
+   struct sgx_epc_cgroup *epc_cg;
+   u64 cur, max;
+
+   epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
+
+   for (;;) {
+   max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
+
+   /*
+* Adjust the limit down by one page, the goal is to free up
+* pages for fault allocations, not to simply obey the limit.
+* Conditionally decrementing max also means the cur vs. max
+* check will correctly handle the case where both are zero.
+*/
+   if (max)
+   max--;


With the below max -= SGX_NR_TO_SCAN/2 staff, do you still need this one?



Logically still needed for case max <= SGX_NR_TO_SCAN * 2


+
+   /*
+* Unless the limit is extremely low, in which case forcing
+* reclaim will likely cause thrashing, force the cgroup to
+* reclaim at least once if it's operating *near* its maximum
+* limit by adjusting @max down by half the min reclaim size.


OK.  But why choose "SGX_NO_TO_SCAN * 2" as "extremely low"? E.g, could  
we

choose SGX_NR_TO_SCAN instead?
IMHO at least we should at least put a comment to mention this.

And maybe you can have a dedicated macro for that in which way I believe  
the

code would be easier to understand?


Good point. I think the value is kind of arbitrary. We consider  
enclaves/cgroups of 64K size are very small. If such a cgroup ever reaches  
the limit, then we don't aggressively reclaim to optimize #PF handling.  
User might as well just raise the limit if it is not performant.





+* This work func is scheduled by sgx_epc_cgroup_try_charge


This has been mentioned in the function comment already.


+* when it cannot directly reclaim due to being in an atomic
+* context, e.g. EPC allocation in a fault handler.


Why a fault handler is an "atomic context"?  Just say when it cannot  
directly

reclaim.



Sure.


Waiting
+* to reclaim until the cgroup is actually at its limit is less
+* performant as it means the faulting task is effectively
+* blocked until a worker makes its way through the global work
+* queue.
+*/
+   if (max > SGX_NR_TO_SCAN * 2)
+   max -= (SGX_NR_TO_SCAN / 2);
+
+   cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+
+   if (cur <= max || sgx_epc_cgroup_lru_empty(epc_cg->cg))
+   break;
+
+   /* Keep reclaiming until above condition is met. */
+   sgx_epc_cgroup_reclaim_pages(epc_cg->cg);


Also, each loop here calls sgx_epc_cgroup_max_pages_to_root() and
sgx_epc_cgroup_lru_empty(), both loop the given EPC cgroup and  
descendants.  If
we still make sgx_reclaim_pages() always scan SGX_NR_TO_SCAN pages,  
seems we can

reduce the number of loops here?



[We already scan SGX_NR_TO_SCAN pages for the cgroup at the level of  
sgx_epc_cgroup_reclaim_pages().]


I think you mean that we keep scanning and reclaiming until at least  
SGX_NR_TO_SCAN pages are reclaimed as your code suggested above. We  
probably can make that a version for this background thread for  
optimization. But sgx_epc_cgroup_max_pages_to_root() and  
sgx_epc_cgroup_lru_empty() are not that bad unless we had very deep and  
wide cgroup trees. So 

Re: [PATCH v2] mm: Update mark_victim tracepoints fields

2024-02-22 Thread Carlos Galo
On Thu, Feb 22, 2024 at 6:16 AM Michal Hocko  wrote:
>
> On Wed 21-02-24 13:30:51, Carlos Galo wrote:
> > On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko  wrote:
> > >
> > > Hi,
> > > sorry I have missed this before.
> > >
> > > On Thu 11-01-24 21:05:30, Carlos Galo wrote:
> > > > The current implementation of the mark_victim tracepoint provides only
> > > > the process ID (pid) of the victim process. This limitation poses
> > > > challenges for userspace tools that need additional information
> > > > about the OOM victim. The association between pid and the additional
> > > > data may be lost after the kill, making it difficult for userspace to
> > > > correlate the OOM event with the specific process.
> > >
> > > You are correct that post OOM all per-process information is lost. On
> > > the other hand we do dump all this information to the kernel log. Could
> > > you explain why that is not suitable for your purpose?
> >
> > Userspace tools often need real-time visibility into OOM situations
> > for userspace intervention. Our use case involves utilizing BPF
> > programs, along with BPF ring buffers, to provide OOM notification to
> > userspace. Parsing kernel logs would be significant overhead as
> > opposed to the event based BPF approach.
>
> Please put that into the changelog.

Will do.

> > > > In order to mitigate this limitation, add the following fields:
> > > >
> > > > - UID
> > > >In Android each installed application has a unique UID. Including
> > > >the `uid` assists in correlating OOM events with specific apps.
> > > >
> > > > - Process Name (comm)
> > > >Enables identification of the affected process.
> > > >
> > > > - OOM Score
> > > >Allows userspace to get additional insights of the relative kill
> > > >priority of the OOM victim.
> > >
> > > What is the oom score useful for?
> > >
> > The OOM score provides us a measure of the victim's importance. On the
> > android side, it allows us to identify if top or foreground apps are
> > killed, which have user perceptible impact.
>
> But the value on its own (wihtout knowing scores of other tasks) doesn't
> really tell you anything, does it?

Android uses the OOM adj_score ranges  to categorize app state
(foreground, background, ...). I'll resend a v3 with the commit text
updated to include details about this.

> > > Is there any reason to provide a different information from the one
> > > reported to the kernel log?
> > > __oom_kill_process:
> > > pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> > > file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB 
> > > oom_score_adj:%hd\n",
> > > message, task_pid_nr(victim), victim->comm, 
> > > K(mm->total_vm),
> > > K(get_mm_counter(mm, MM_ANONPAGES)),
> > > K(get_mm_counter(mm, MM_FILEPAGES)),
> > > K(get_mm_counter(mm, MM_SHMEMPAGES)),
> > > from_kuid(_user_ns, task_uid(victim)),
> > > mm_pgtables_bytes(mm) >> 10, 
> > > victim->signal->oom_score_adj);
> > >
> >
> > We added these fields we need (UID, process name, and OOM score), but
> > we're open to adding the others if you prefer that for consistency
> > with the kernel log.
>
> yes, I think the consistency would be better here. For one it reports
> numbers which can tell quite a lot about the killed victim. It is a
> superset of what you already asking for. With a notable exception of the
> oom_score which is really dubious without a wider context.

Sounds good, I'll resend a v3 that includes these fields as well.

Thanks,
Carlos

> --
> Michal Hocko
> SUSE Labs



[PATCH] NFSD: Fix nfsd_clid_class use of __string_len() macro

2024-02-22 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

I'm working on restructuring the __string* macros so that it doesn't need
to recalculate the string twice. That is, it will save it off when
processing __string() and the __assign_str() will not need to do the work
again as it currently does.

Currently __string_len(item, src, len) doesn't actually use "src", but my
changes will require src to be correct as that is where the __assign_str()
will get its value from.

The event class nfsd_clid_class has:

  __string_len(name, name, clp->cl_name.len)

But the second "name" does not exist and causes my changes to fail to
build. That second parameter should be: clp->cl_name.data.

Fixes: d27b74a8675ca ("NFSD: Use new __string_len C macros for nfsd_clid_class")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/nfsd/trace.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index d1e8cf079b0f..2cd57033791f 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -843,7 +843,7 @@ DECLARE_EVENT_CLASS(nfsd_clid_class,
__array(unsigned char, addr, sizeof(struct sockaddr_in6))
__field(unsigned long, flavor)
__array(unsigned char, verifier, NFS4_VERIFIER_SIZE)
-   __string_len(name, name, clp->cl_name.len)
+   __string_len(name, clp->cl_name.data, clp->cl_name.len)
),
TP_fast_assign(
__entry->cl_boot = clp->cl_clientid.cl_boot;
-- 
2.43.0




Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-22 Thread Haitao Huang

On Wed, 21 Feb 2024 05:00:27 -0600, Huang, Kai  wrote:


On Wed, 2024-02-21 at 00:44 -0600, Haitao Huang wrote:

[...]
>
> Here the @nr_to_scan is reduced by the number of pages that are
> isolated, but
> not actually reclaimed (which is reflected by @cnt).
>
> IIUC, looks you want to make this function do "each cycle" as what you
> mentioned
> in the v8 [1]:
>
>I tested with that approach and found we can only target number of
> pages
>attempted to reclaim not pages actually reclaimed due to the
> uncertainty
>of how long it takes to reclaim pages. Besides targeting number of
>scanned pages for each cycle is also what the ksgxd does.
>
> 	If we target actual number of pages, sometimes it just takes too  
long.

> I
>saw more timeouts with the default time limit when running parallel
>selftests.
>
> I am not sure what does "sometimes it just takes too long" mean, but
> what I am
> thinking is you are trying to do some perfect but yet complicated code
> here.

I think what I observed was that the try_charge() would block too long
before getting chance of schedule() to yield, causing more timeouts than
necessary.
I'll do some re-test to be sure.


Looks this is a valid information that can be used to justify whatever  
you are

implementing in the EPC cgroup reclaiming function(s).

I'll add some comments. Was assuming this is just following the old design  
as ksgxd.
There were some comments at the beginning of  
sgx_epc_cgrooup_reclaim_page().

/*
 * Attempting to reclaim only a few pages will often fail and is
 * inefficient, while reclaiming a huge number of pages can result  
in
 * soft lockups due to holding various locks for an extended  
duration.

 */
unsigned int nr_to_scan = SGX_NR_TO_SCAN;

I think it can be improved to emphasize we only "attempt" to finish  
scanning fixed number of pages for reclamation, not enforce number of  
pages successfully reclaimed.




>
> For instance, I don't think selftest reflect the real workload, and I
> believe
> adjusting the limit of a given EPC cgroup shouldn't be a frequent
> operation,
> thus it is acceptable to use some easy-maintain code but less perfect
> code.
>
> Here I still think having @nr_to_scan as a pointer is  
over-complicated.

> For
> example, we can still let sgx_reclaim_pages() to always scan
> SGX_NR_TO_SCAN
> pages, but give up when there's enough pages reclaimed or when the EPC
> cgroup
> and its descendants have been looped:
>
> unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
> {
>unsigned int cnt = 0;
>...
>
>css_for_each_descendant_pre(pos, css_root) {
>...
>epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
>cnt += sgx_reclaim_pages(_cg->lru);
>
>if (cnt >= SGX_NR_TO_SCAN)
>break;
>}
>
>...
>return cnt;
> }
>
> Yeah it may reclaim more than SGX_NR_TO_SCAN when the loop actually
> reaches any
> descendants, but that should be rare and we don't care that much, do  
we?

>
I assume you meant @cnt here to be number of pages actually reclaimed.


Yes.


This could cause  sgx_epc_cgroup_reclaim_pages() block too long as @cnt
may always be zero (all pages are too young) and you have to loop all
descendants.


I am not sure whether this is a valid point.

For example, your change in patch 10 "x86/sgx: Add EPC reclamation in  
cgroup

try_charge()" already loops all descendants in below code:

+   if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+   return -ENOMEM;



I meant looping all descendants for reclamation which is expensive and we  
want to avoid. Not just checking emptiness of LRUs.


Anyway, I can see all these can be justification to your  
design/implementation.
My point is please put these justification in changelog/comments so that  
we can

actually understand.


Yes, will add clarifying comments.
Thanks



Re: [PATCH v9 10/15] x86/sgx: Add EPC reclamation in cgroup try_charge()

2024-02-22 Thread Haitao Huang

On Wed, 21 Feb 2024 05:06:02 -0600, Huang, Kai  wrote:




-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool  
reclaim)

 {
-   return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+   for (;;) {
+   if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+   PAGE_SIZE))
+   break;
+
+   if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+   return -ENOMEM;
+
+   if (signal_pending(current))
+   return -ERESTARTSYS;
+
+   if (!reclaim) {
+   queue_work(sgx_epc_cg_wq, _cg->reclaim_work);
+   return -EBUSY;
+   }
+
+   if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
+   /* All pages were too young to reclaim, try again a 
little later */
+   schedule();
+   }
+
+   return 0;
 }



Seems this code change is 90% similar to the existing code in the
sgx_alloc_epc_page():

...
for ( ; ; ) {
page = __sgx_alloc_epc_page();
if (!IS_ERR(page)) {
page->owner = owner;
break;
}

if (list_empty(_active_page_list))
return ERR_PTR(-ENOMEM);

if (!reclaim) {
page = ERR_PTR(-EBUSY);
break;
}

if (signal_pending(current)) {
page = ERR_PTR(-ERESTARTSYS);
break;
}

sgx_reclaim_pages();
cond_resched();
}
...

Is it better to move the logic/code change in try_charge() out to
sgx_alloc_epc_page() to unify them?

IIUC, the logic is quite similar: When you either failed to allocate one  
page,
or failed to charge one page, you try to reclaim EPC page(s) from the  
current

EPC cgroup, either directly or indirectly.

No?


Only these lines are the same:
if (!reclaim) {
page = ERR_PTR(-EBUSY);
break;
}

if (signal_pending(current)) {
page = ERR_PTR(-ERESTARTSYS);
break;
}

In sgx_alloc_epc_page() we do global reclamation but here we do per-cgroup  
reclamation. That's why the logic of other lines is different though they  
look similar due to similar function names. For the global reclamation we  
need consider case in that cgroup is not enabled. Similarly  
list_empty(_active_page_list) would have to be changed to check root  
cgroup if cgroups enabled otherwise check global LRU.  The (!reclaim) case  
is also different.  So I don't see an obvious good way to abstract those  
to get meaningful savings.


Thanks
Haitao



Re: [PATCH v9 13/15] x86/sgx: Turn on per-cgroup EPC reclamation

2024-02-22 Thread Haitao Huang

On Wed, 21 Feb 2024 05:23:00 -0600, Huang, Kai  wrote:


On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:

From: Kristen Carlson Accardi 

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_lru_list() returns hard
coded reference to the global LRU.

Change sgx_lru_list() to return the LRU of the cgroup in which the given
EPC page is allocated.

This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., sum of cgroup limits
greater than the total capacity, cgroups may never reclaim but the total
usage can still be near the capacity. Therefore global reclamation is
still needed in those cases and it should reclaim from the root cgroup.

Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled, otherwise from the global LRU.

Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.

With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c  
b/arch/x86/kernel/cpu/sgx/main.c

index 6b0c26cac621..d4265a390ba9 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru;

 static inline struct sgx_epc_lru_list *sgx_lru_list(struct  
sgx_epc_page *epc_page)

 {
+#ifdef CONFIG_CGROUP_SGX_EPC
+   if (epc_page->epc_cg)
+   return _page->epc_cg->lru;
+
+   /* This should not happen if kernel is configured correctly */
+   WARN_ON_ONCE(1);
+#endif
return _global_lru;
 }


How about when EPC cgroup is enabled, but one enclave doesn't belong to  
any EPC
cgroup?  Is it OK to track EPC pages for these enclaves to the root EPC  
cgroup's

LRU list together with other enclaves belongs to the root cgroup?


This should be a valid case, right?


There is no such case. Each page is in the root by default.

Thanks

Haitao



Re: [PATCH v9 12/15] x86/sgx: Expose sgx_epc_cgroup_reclaim_pages() for global reclaimer

2024-02-22 Thread Haitao Huang

On Wed, 21 Feb 2024 05:10:36 -0600, Huang, Kai  wrote:


On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:

From: Kristen Carlson Accardi 

When cgroup is enabled, all reclaimable pages will be tracked in cgroup
LRUs. The global reclaimer needs to start reclamation from the root
cgroup. Expose the top level cgroup reclamation function so the global
reclaimer can reuse it.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V8:
- Remove unneeded breaks in function declarations. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 2 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 7 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c  
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c

index abf74fdb12b4..6e31f8727b8a 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -96,7 +96,7 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
  * @indirect:   In ksgxd or EPC cgroup work queue context.
  * Return: Number of pages reclaimed.
  */
-static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root,  
bool indirect)
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool  
indirect)

 {
/*
 * Attempting to reclaim only a few pages will often fail and is
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h  
b/arch/x86/kernel/cpu/sgx/epc_cgroup.h

index d061cd807b45..5b3e8e1b8630 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -31,6 +31,11 @@ static inline int sgx_epc_cgroup_try_charge(struct  
sgx_epc_cgroup *epc_cg, bool
 static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup  
*epc_cg) { }


 static inline void sgx_epc_cgroup_init(void) { }
+
+static inline unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg  
*root, bool indirect)

+{
+   return 0;
+}
 #else
 struct sgx_epc_cgroup {
struct misc_cg *cg;
@@ -69,6 +74,8 @@ static inline void sgx_put_epc_cg(struct  
sgx_epc_cgroup *epc_cg)
 int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool  
reclaim);

 void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
 bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool  
indirect);

+
 void sgx_epc_cgroup_init(void);

 #endif


I'd just prefer to merge patch such like this to the one that actually  
uses the
exposed function.  It's just couple of LOC and we don't deserve to read  
these

repeated changelog and move back and forth between patches during review.


IIRC, Jarkko prefers exposing functions first in separate patch. Jarkko,  
right?


Also I found your definition/expectation of self-contained patches is just  
confusing or too constrained at least. I usually review each patch  
separately without back and forth and then review them together to see if  
they all make sense in terms of breakdown. My minimal  expectation is  
that  a patch should not depend on future changes and should not break  
current state of function. and


For this one I thought the idea was you verify if the API exposed make  
sense without looking at how it is used in future. Then when you review  
the usage patch, you see if the usage is reasonable.


I would really hesitate to merge patches at this point unless we all have  
an agreement and have good/strong reasons or there is a hard rule about  
this.


Thanks
Haitao



Re: [syzbot] [virtualization?] linux-next boot error: WARNING: refcount bug in __free_pages_ok

2024-02-22 Thread Michael S. Tsirkin
On Thu, Feb 22, 2024 at 11:06:55AM +0800, Lei Yang wrote:
> Hi All
> 
> I hit a similar issue when doing a regression testing from my side.
> For the error messages please help review the attachment.
> 
> The latest commit:
> commit c02197fc9076e7d991c8f6adc11759c5ba52ddc6 (HEAD -> master,
> origin/master, origin/HEAD)
> Merge: f2667e0c3240 0846dd77c834
> Author: Linus Torvalds 
> Date:   Sat Feb 17 16:59:31 2024 -0800
> 
> Merge tag 'powerpc-6.8-3' of
> git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
> 
> Pull powerpc fixes from Michael Ellerman:
>  "This is a bit of a big batch for rc4, but just due to holiday hangover
>   and because I didn't send any fixes last week due to a late revert
>   request. I think next week should be back to normal.
> 
> Regards
> Lei

It all looks like a generic bug dealing with some refcounting
in the allocator.  So, a chance of a bisect there?

-- 
MST




Re: [PATCH v2] mm: Update mark_victim tracepoints fields

2024-02-22 Thread Michal Hocko
On Wed 21-02-24 13:30:51, Carlos Galo wrote:
> On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko  wrote:
> >
> > Hi,
> > sorry I have missed this before.
> >
> > On Thu 11-01-24 21:05:30, Carlos Galo wrote:
> > > The current implementation of the mark_victim tracepoint provides only
> > > the process ID (pid) of the victim process. This limitation poses
> > > challenges for userspace tools that need additional information
> > > about the OOM victim. The association between pid and the additional
> > > data may be lost after the kill, making it difficult for userspace to
> > > correlate the OOM event with the specific process.
> >
> > You are correct that post OOM all per-process information is lost. On
> > the other hand we do dump all this information to the kernel log. Could
> > you explain why that is not suitable for your purpose?
> 
> Userspace tools often need real-time visibility into OOM situations
> for userspace intervention. Our use case involves utilizing BPF
> programs, along with BPF ring buffers, to provide OOM notification to
> userspace. Parsing kernel logs would be significant overhead as
> opposed to the event based BPF approach.

Please put that into the changelog.

> > > In order to mitigate this limitation, add the following fields:
> > >
> > > - UID
> > >In Android each installed application has a unique UID. Including
> > >the `uid` assists in correlating OOM events with specific apps.
> > >
> > > - Process Name (comm)
> > >Enables identification of the affected process.
> > >
> > > - OOM Score
> > >Allows userspace to get additional insights of the relative kill
> > >priority of the OOM victim.
> >
> > What is the oom score useful for?
> >
> The OOM score provides us a measure of the victim's importance. On the
> android side, it allows us to identify if top or foreground apps are
> killed, which have user perceptible impact.

But the value on its own (wihtout knowing scores of other tasks) doesn't
really tell you anything, does it?
 
> > Is there any reason to provide a different information from the one
> > reported to the kernel log?
> > __oom_kill_process:
> > pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> > file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
> > message, task_pid_nr(victim), victim->comm, K(mm->total_vm),
> > K(get_mm_counter(mm, MM_ANONPAGES)),
> > K(get_mm_counter(mm, MM_FILEPAGES)),
> > K(get_mm_counter(mm, MM_SHMEMPAGES)),
> > from_kuid(_user_ns, task_uid(victim)),
> > mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);
> >
> 
> We added these fields we need (UID, process name, and OOM score), but
> we're open to adding the others if you prefer that for consistency
> with the kernel log.

yes, I think the consistency would be better here. For one it reports
numbers which can tell quite a lot about the killed victim. It is a
superset of what you already asking for. With a notable exception of the
oom_score which is really dubious without a wider context.
-- 
Michal Hocko
SUSE Labs



[PATCH] tracing: Use init_utsname()->release

2024-02-22 Thread John Garry
Instead of using UTS_RELEASE, use init_utsname()->release, which means that
we don't need to rebuild the code just for the git head commit changing.

Signed-off-by: John Garry 
---
I originally sent an RFC using new string uts_release, but that
string is not needed as we can use init_utsname()->release instead.

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8198bfc54b58..1416bf72f3f4 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -13,7 +13,7 @@
  *  Copyright (C) 2004 Nadia Yvette Chambers
  */
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -4368,7 +4368,7 @@ print_trace_header(struct seq_file *m, struct 
trace_iterator *iter)
get_total_entries(buf, , );
 
seq_printf(m, "# %s latency trace v1.1.5 on %s\n",
-  name, UTS_RELEASE);
+  name, init_utsname()->release);
seq_puts(m, "# ---"
 "-\n");
seq_printf(m, "# latency: %lu us, #%lu/%lu, CPU#%d |"
-- 
2.31.1




Re: [PATCH v4 0/6] LoongArch: Add pv ipi support on LoongArch VM

2024-02-22 Thread WANG Xuerui

On 2/22/24 18:06, maobibo wrote:



On 2024/2/22 下午5:34, WANG Xuerui wrote:

On 2/17/24 11:15, maobibo wrote:

On 2024/2/15 下午6:25, WANG Xuerui wrote:

On 2/15/24 18:11, WANG Xuerui wrote:
Sorry for the late reply (and Happy Chinese New Year), and thanks 
for providing microbenchmark numbers! But it seems the more 
comprehensive CoreMark results were omitted (that's also absent in 
v3)? While the 


Of course the benchmark suite should be UnixBench instead of 
CoreMark. Lesson: don't multi-task code reviews, especially not 
after consuming beer -- a cup of coffee won't fully cancel the 
influence. ;-)


Where is rule about benchmark choices like UnixBench/Coremark for ipi 
improvement?


Sorry for the late reply. The rules are mostly unwritten, but in 
general you can think of the preference of benchmark suites as a 
matter of "effectiveness" -- the closer it's to some real workload in 
the wild, the better. Micro-benchmarks is okay for illustrating the 
points, but without demonstrating the impact on realistic workloads, a 
change could be "useless" in practice or even decrease various 
performance metrics (be that throughput or latency or anything that 
matters in the certain case), but get accepted without notice.
yes, micro-benchmark cannot represent the real world, however it does 
not mean that UnixBench/Coremark should be run. You need to point out 
what is the negative effective from code, or what is the possible real 
scenario which may benefit. And points out the reasonable benchmark 
sensitive for IPIs rather than blindly saying UnixBench/Coremark.


I was not meaning to argue with you, nor was I implying that your 
changes "must be regressing things even though I didn't check myself" -- 
my point is, *any* comparison with realistic workload that shows the 
performance mostly unaffected inside/outside KVM, would give reviewers 
(and yourself too) much more confidence in accepting the change.


For me, personally I think a microbenchmark could be enough, because the 
only externally-visible change is the IPI mechanism overhead, but please 
consider other reviewers that may or may not be familiar enough with 
LoongArch to be able to notice the "triviality". Also, given the 6-patch 
size of the series, it could hardly be considered "trivial".


--
WANG "xen0n" Xuerui

Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/




Re: [PATCH v4 0/6] LoongArch: Add pv ipi support on LoongArch VM

2024-02-22 Thread maobibo




On 2024/2/22 下午5:34, WANG Xuerui wrote:

On 2/17/24 11:15, maobibo wrote:

On 2024/2/15 下午6:25, WANG Xuerui wrote:

On 2/15/24 18:11, WANG Xuerui wrote:
Sorry for the late reply (and Happy Chinese New Year), and thanks 
for providing microbenchmark numbers! But it seems the more 
comprehensive CoreMark results were omitted (that's also absent in 
v3)? While the 


Of course the benchmark suite should be UnixBench instead of 
CoreMark. Lesson: don't multi-task code reviews, especially not after 
consuming beer -- a cup of coffee won't fully cancel the influence. ;-)


Where is rule about benchmark choices like UnixBench/Coremark for ipi 
improvement?


Sorry for the late reply. The rules are mostly unwritten, but in general 
you can think of the preference of benchmark suites as a matter of 
"effectiveness" -- the closer it's to some real workload in the wild, 
the better. Micro-benchmarks is okay for illustrating the points, but 
without demonstrating the impact on realistic workloads, a change could 
be "useless" in practice or even decrease various performance metrics 
(be that throughput or latency or anything that matters in the certain 
case), but get accepted without notice.
yes, micro-benchmark cannot represent the real world, however it does 
not mean that UnixBench/Coremark should be run. You need to point out 
what is the negative effective from code, or what is the possible real 
scenario which may benefit. And points out the reasonable benchmark 
sensitive for IPIs rather than blindly saying UnixBench/Coremark.


Regards
Bibo Mao








Re: [PATCH v3 0/7] Introduction of a remoteproc tee to load signed firmware

2024-02-22 Thread Naman Jain

On 2/22/2024 2:17 PM, Arnaud POULIQUEN wrote:

Hello Naman,

On 2/22/24 06:43, Naman Jain wrote:

On 2/14/2024 10:51 PM, Arnaud Pouliquen wrote:

Updates from the previous version [1]:

This version proposes another approach based on an alternate load and boot
of the coprocessor. Therefore, the constraint introduced by tee_remoteproc
is that the firmware has to be authenticated and loaded before the resource
table can be obtained.

The existing boot sequence is: >
    1) Get the resource table and store it in a cache,
   calling rproc->ops->parse_fw().
    2) Parse the resource table and handle resources,
   calling rproc_handle_resources.
    3) Load the firmware, calling rproc->ops->load().
    4) Start the firmware, calling rproc->ops->start().
   => Steps 1 and 2 are executed in rproc_fw_boot(), while steps 3 and 4 are
     executed in rproc_start().
=> the use of rproc->ops->load() ops is mandatory

The boot sequence needed for TEE boot is:

    1) Load the firmware.
    2) Get the loaded resource, no cache.
    3) Parse the resource table and handle resources.
    4) Start the firmware.


Hi,
What problem are we really addressing here by reordering load, parse of
FW resources?


The feature introduced in TEE is the signature of the firmware images. That
means that before getting the resource table, we need to first authenticate the
firmware images.
Authenticating a firmware image means that we have to copy the firmware into
protected memory that cannot be corrupted by the non-secure and then verify the
signature.
The strategy implemented in OP-TEE is to load the firmware into destination
memory and then authenticate it.
This strategy avoids having a temporary copy of the whole images in a secure 
memory.
This strategy imposes loading the firmware images before retrieving the resource
table.


Basically, what are the limitations of the current design you are referring to?
I understood that TEE is designed that way.


The limitation of the current design is that we obtain the resource table before
loading the firmware. Following the current design would impose constraints in
TEE that are not straightforward. Step 1 (getting the resource table and storing
it in a cache) would require having a copy of the resource table in TEE after
authenticating the images. However, authenticating the firmware, as explained
before, depends on the strategy implemented. In TEE implementation, we load the
firmware to authenticate it in the destination memory.

Regards,
Arnaud



Hello Arnaud,
I think now I got your point. In TEE, you don't want to do anything(read
resource table) with FW images, until its loaded and authenticated.
Since current design was not allowing you to do it, you had to 
reorganize the code so that this can be achieved.


Generally speaking, in current design, if authentication fails for some
reason later, one can handle it, but it depends on the implementation of
parse_fw op if the damage is already done.

Please correct me if this is wrong assumption.
Patch looks good to me.

Regards,
Naman Jain



Re: [PATCH v4 0/6] LoongArch: Add pv ipi support on LoongArch VM

2024-02-22 Thread WANG Xuerui

On 2/17/24 11:15, maobibo wrote:

On 2024/2/15 下午6:25, WANG Xuerui wrote:

On 2/15/24 18:11, WANG Xuerui wrote:
Sorry for the late reply (and Happy Chinese New Year), and thanks for 
providing microbenchmark numbers! But it seems the more comprehensive 
CoreMark results were omitted (that's also absent in v3)? While the 


Of course the benchmark suite should be UnixBench instead of CoreMark. 
Lesson: don't multi-task code reviews, especially not after consuming 
beer -- a cup of coffee won't fully cancel the influence. ;-)


Where is rule about benchmark choices like UnixBench/Coremark for ipi 
improvement?


Sorry for the late reply. The rules are mostly unwritten, but in general 
you can think of the preference of benchmark suites as a matter of 
"effectiveness" -- the closer it's to some real workload in the wild, 
the better. Micro-benchmarks is okay for illustrating the points, but 
without demonstrating the impact on realistic workloads, a change could 
be "useless" in practice or even decrease various performance metrics 
(be that throughput or latency or anything that matters in the certain 
case), but get accepted without notice.


--
WANG "xen0n" Xuerui

Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/




Re: [PATCH v3] modules: wait do_free_init correctly

2024-02-22 Thread Changbin Du
On Wed, Feb 21, 2024 at 09:40:48AM -0800, Luis Chamberlain wrote:
> + live-patching folks,
> 
> Finally, things are starting to be much clearer. Thanks for the time
> for working on this, some more comments below and a question which
> I think deserves some attention.
> 
> On Sat, Feb 17, 2024 at 04:18:10PM +0800, Changbin Du wrote:
> > The synchronization here is just to ensure the module init's been freed
> > before doing W+X checking. 
> 
> Some nits, this should read instead:
> 
> Fix the ordering of freeing of a module init so that it happens before
> W+X checking.
> 
> > But the commit 1a7b7d922081 ("modules: Use
> > vmalloc special flag") moves do_free_init() into a global workqueue
> > instead of call_rcu(). So now rcu_barrier() can not ensure that do_free_init
> > has completed. We should wait it via flush_work().
> 
> Remove "But" and adjust as:
> 
> Commit 1a7b7d922081 ("modules: Use vmalloc special flag") moved
> calling do_free_init() into a global workqueue instead of relying on it
> being called through call_rcu(..., do_free_init), which used to allowed us
> call do_free_init() asynchronously after the end of a subsequent grace
>  
> period. The move to a global workqueue broke the gaurantees for code
> which needed to be sure the do_free_init() would complete with rcu_barrier().
> To fix this callers which used to rely on rcu_barrier() must now instead
> use flush_work(_free_wq).
>
Sure, thanks!

> > Without this fix, we still could encounter false positive reports in
> > W+X checking,
> 
> This is good thanks for the clarification.
> 
> I think it would be useful for the commit log then to describe also that
> it is not that the freeing was not happening, it is just that our sanity
> checkers raced against the permission checkers which assume init memory
> is already gone.
> 
okay, I'll apend this detailed explanation.

> > and the rcu synchronization is unnecessary which can
> > introduce significant delay.
> 
> While this can be true, I am not sure if we can remove it. See below.
> 
> > Eric Chanudet reports that the rcu_barrier introduces ~0.1s delay on a
> > PREEMPT_RT kernel.
> 
> That's a separate issue.
> 
> >   [0.291444] Freeing unused kernel memory: 5568K
> >   [0.402442] Run /sbin/init as init process
> > 
> > With this fix, the above delay can be eliminated.
> > 
> > Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
> > Signed-off-by: Changbin Du 
> > Cc: Xiaoyi Su 
> > Cc: Eric Chanudet 
> > 
> > ---
> > v3:
> >   - amend comment in do_init_module() and update commit msg.
> > v2:
> >   - fix compilation issue for no CONFIG_MODULES found by 0-DAY.
> > ---
> >  include/linux/moduleloader.h | 8 
> >  init/main.c  | 5 +++--
> >  kernel/module/main.c | 9 +++--
> >  3 files changed, 18 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
> > index 001b2ce83832..89b1e0ed9811 100644
> > --- a/include/linux/moduleloader.h
> > +++ b/include/linux/moduleloader.h
> > @@ -115,6 +115,14 @@ int module_finalize(const Elf_Ehdr *hdr,
> > const Elf_Shdr *sechdrs,
> > struct module *mod);
> >  
> > +#ifdef CONFIG_MODULES
> > +void flush_module_init_free_work(void);
> > +#else
> > +static inline void flush_module_init_free_work(void)
> > +{
> > +}
> > +#endif
> > +
> >  /* Any cleanup needed when module leaves. */
> >  void module_arch_cleanup(struct module *mod);
> >  
> > diff --git a/init/main.c b/init/main.c
> > index e24b0780fdff..f0b7e21ac67f 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -99,6 +99,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  
> >  #include 
> > @@ -1402,11 +1403,11 @@ static void mark_readonly(void)
> > if (rodata_enabled) {
> > /*
> >  * load_module() results in W+X mappings, which are cleaned
> > -* up with call_rcu().  Let's make sure that queued work is
> > +* up with init_free_wq. Let's make sure that queued work is
> >  * flushed so that we don't hit false positives looking for
> >  * insecure pages which are W+X.
> >  */
> > -   rcu_barrier();
> 
> Was this the only source of waiters that used rcu_barrier() to sync ?
> What about kallsyms, live-patching ?
> 
> This original source to the addition of this rcu_barrier() (in a slight
> older modified form with with rcu_barrier_sched()) was commit
> ae646f0b9ca13 ("init: fix false positives in W+X checking") since
> v4.17 in 2018, 6 years ago. So I'm hoping we don't have any other
> side-by new users which have grown dependent on this rcu_barrier() for
> other call_rcu()'s they may have used, but it is hard to tell.
> 
Per the condtion 'rodata_enabled' and comments, I think the rcu_barrier() is
only used to synchronize with freeing module init memory.

> So while I agree that flush work is the right solution, removing the
> 

Re: [PATCH v3 0/7] Introduction of a remoteproc tee to load signed firmware

2024-02-22 Thread Arnaud POULIQUEN
Hello Naman,

On 2/22/24 06:43, Naman Jain wrote:
> On 2/14/2024 10:51 PM, Arnaud Pouliquen wrote:
>> Updates from the previous version [1]:
>>
>> This version proposes another approach based on an alternate load and boot
>> of the coprocessor. Therefore, the constraint introduced by tee_remoteproc
>> is that the firmware has to be authenticated and loaded before the resource
>> table can be obtained.
>>
>> The existing boot sequence is: >
>>    1) Get the resource table and store it in a cache,
>>   calling rproc->ops->parse_fw().
>>    2) Parse the resource table and handle resources,
>>   calling rproc_handle_resources.
>>    3) Load the firmware, calling rproc->ops->load().
>>    4) Start the firmware, calling rproc->ops->start().
>>   => Steps 1 and 2 are executed in rproc_fw_boot(), while steps 3 and 4 are
>>     executed in rproc_start().
>> => the use of rproc->ops->load() ops is mandatory
>>
>> The boot sequence needed for TEE boot is:
>>
>>    1) Load the firmware.
>>    2) Get the loaded resource, no cache.
>>    3) Parse the resource table and handle resources.
>>    4) Start the firmware.
> 
> Hi,
> What problem are we really addressing here by reordering load, parse of
> FW resources?

The feature introduced in TEE is the signature of the firmware images. That
means that before getting the resource table, we need to first authenticate the
firmware images.
Authenticating a firmware image means that we have to copy the firmware into
protected memory that cannot be corrupted by the non-secure and then verify the
signature.
The strategy implemented in OP-TEE is to load the firmware into destination
memory and then authenticate it.
This strategy avoids having a temporary copy of the whole images in a secure 
memory.
This strategy imposes loading the firmware images before retrieving the resource
table.

> Basically, what are the limitations of the current design you are referring 
> to?
> I understood that TEE is designed that way.

The limitation of the current design is that we obtain the resource table before
loading the firmware. Following the current design would impose constraints in
TEE that are not straightforward. Step 1 (getting the resource table and storing
it in a cache) would require having a copy of the resource table in TEE after
authenticating the images. However, authenticating the firmware, as explained
before, depends on the strategy implemented. In TEE implementation, we load the
firmware to authenticate it in the destination memory.

Regards,
Arnaud

> 
>>
>> Then the crash recovery also has to be managed.For recovery, the cache is
>> used to temporarily save the resource table and then reapply it on
>> restart:
>>    1) Stop the remote processor, calling rproc->ops->stop().
>>    2) Load the firmware, calling rproc->ops->load().
>>    3) Copy cached resource table.
>>    4) Start the remote processor, calling rproc->ops->start().
>>
>> => This sequence is also needed when TEE manages the boot of the remote
>>     processor.
>> => The rproc->ops->load() is also used in recovery sequence.
>>
>> Based on the sequences described above, the proposal is to:
>>
>> - Rework tee_rproc API to better match the rproc_ops structure.
>>    This allows to simply map the function to implement the load ops, which
>>    is not optional. The tee_rproc_load_fw() is updated in consequence.
>> - Remove the call of rproc_load_segments from rproc_start() to dissociate
>>    the load and the start. This is necessary to implement the boot sequence
>>    requested for the TEE remote proc support.
>> - Introduce an rproc_alt_fw_boot() function that is an alternative boot
>>    sequence, which implements the sequence requested for the TEE remoteproc
>>    support.
>>
>>
>> [1]
>> https://lore.kernel.org/lkml/20240118100433.3984196-1-arnaud.pouliq...@foss.st.com/T/
>>
>>
>> Description of the feature:
>>
>> This series proposes the implementation of a remoteproc tee driver to
>> communicate with a TEE trusted application responsible for authenticating and
>> loading the remoteproc firmware image in an Arm secure context.
>>
>> 1) Principle:
>>
>> The remoteproc tee driver provides services to communicate with the OP-TEE
>> trusted application running on the Trusted Execution Context (TEE).
> 
> s/Context/Environment?
> 
>> The trusted application in TEE manages the remote processor lifecycle:
>>
>> - authenticating and loading firmware images,
>> - isolating and securing the remote processor memories,
>> - supporting multi-firmware (e.g., TF-M + Zephyr on a Cortex-M33),
>> - managing the start and stop of the firmware by the TEE.
>>
> 
> Regards,
> Naman Jain
> 



Re: [PATCH] dt-bindings: usb: typec: anx7688: start a binding document

2024-02-22 Thread Krzysztof Kozlowski
On 21/02/2024 23:45, Pavel Machek wrote:
>>> +properties:
>>> +  compatible:
>>> +const: usb-c-connector
>>> +
>>> +  power-role: true
>>> +
>>> +  data-role: true
>>> +
>>> +  try-power-role: true
>>
>> I don't understand why do you have here properties. Do you see any
>> binding like this?
> 
> Well, it looks like I copied these mistakes from analogix,anx7411.yaml .
> 
>>> +
>>> +required:
>>> +  - compatible
>>
>> Drop, why is it needed?
> 
> Again, copy from analogix,anx7411.yaml .
> 
> Should I try to fix up analogix,anx7411.yaml , and submit that, first,
> or is it easier if you do that?

I'll fix it up, thanks.

Best regards,
Krzysztof