Re: [RFC PATCH 1/5] mfd: 88pm88x: differences with respect to the PMIC RFC series

2024-01-11 Thread Lee Jones
On Thu, 11 Jan 2024, Karel Balej wrote:

> On Thu Jan 11, 2024 at 4:25 PM CET, Lee Jones wrote:
> 
> [...]
> 
> > > > > diff --git a/include/linux/mfd/88pm88x.h b/include/linux/mfd/88pm88x.h
> > > > > index a34c57447827..9a335f6b9c07 100644
> > > > > --- a/include/linux/mfd/88pm88x.h
> > > > > +++ b/include/linux/mfd/88pm88x.h
> > > > > @@ -49,6 +49,8 @@ struct pm88x_data {
> > > > >   unsigned int whoami;
> > > > >   struct reg_sequence *presets;
> > > > >   unsigned int num_presets;
> > > > > + struct mfd_cell *devs;
> > > > > + unsigned int num_devs;
> > > >
> > > > Why are you adding extra abstraction?
> > > 
> > > Right, this is probably not necessary now since I'm only implementing
> > > support for one of the chips - it's just that I keep thinking about it
> > > as a driver for both of them and thus tend to write it a bit more
> > > abstractly. Shall I then drop this and also the `presets` member which
> > > is also chip-specific?
> >
> > Even if you were to support multiple devices, this strategy is unusual
> > and isn't likely to be accepted.
> 
> May I please ask what the recommended strategy is then? `switch`ing on
> the chip ID? I have taken this approach because it seemed to produce a
> cleaner/more straightforward code in comparison to that. Or are you only
> talking about the chip cells/subdevices in particular?

I'd have to see the implementation, but the general exception I'm taking
here is storing the use-once MFD cell data inside a data structure.  If
you were to match on device ID it's *sometimes* acceptable to store the
pointers in local variables.

-- 
Lee Jones [李琼斯]



Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend

2024-01-11 Thread Jason Wang
On Fri, Jan 12, 2024 at 12:18 AM Mike Christie
 wrote:
>
> On 1/10/24 9:09 PM, Jason Wang wrote:
> > On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare  
> > wrote:
> >>
> >> To pass ownership of a live vdpa device to a new process, the user
> >> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
> >> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
> >> mm.  Flush workers in suspend to guarantee that no worker sees the new
> >> mm and old VA in between.
> >>
> >> Signed-off-by: Steve Sistare 
> >> ---
> >>  drivers/vhost/vdpa.c | 4 
> >>  1 file changed, 4 insertions(+)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index 8fe1562d24af..9673e8e20d11 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
> >>  {
> >> struct vdpa_device *vdpa = v->vdpa;
> >> const struct vdpa_config_ops *ops = vdpa->config;
> >> +   struct vhost_dev *vdev = >vdev;
> >>
> >> if (!ops->suspend)
> >> return -EOPNOTSUPP;
> >>
> >> +   if (vdev->use_worker)
> >> +   vhost_dev_flush(vdev);
> >
> > It looks to me like it's better to check use_woker in vhost_dev_flush.
> >
>
> You can now just call vhost_dev_flush and it will do the right thing.
> The xa_for_each loop will only flush workers if they have been setup,
> so for vdpa it will not find/flush anything.

Right.

Thanks

>
>
>




Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-01-11 Thread Steven Rostedt
On Thu, 11 Jan 2024 11:34:58 -0500
Mathieu Desnoyers  wrote:


> The LTTng kernel tracer has supported mmap'd buffers for nearly 15 years [1],
> and has a lot of similarities with this patch series.
> 
> LTTng has the notion of "subbuffer id" to allow atomically exchanging a
> "reader" extra subbuffer with the subbuffer to be read. It implements
> "get subbuffer" / "put subbuffer" ioctls to allow the consumer (reader)
> to move the currently read subbuffer position. [2]
> 
> It would not hurt to compare your approach to LTTng and highlight
> similarities/differences, and the rationale for the differences.
> 
> Especially when it comes to designing kernel ABIs, it's good to make sure
> that all bases are covered, because those choices will have lasting impacts.
> 
> Thanks,
> 
> Mathieu
> 
> [1] 
> https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_mmap.c
> [2] 
> https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_vfs.c
> 

Hi Mathieu,

Thanks for sharing!

As we discussed a little bit in the tracing meeting we do somethings
differently but very similar too ;-)

The similarities as that all the sub-buffers are mapped. You have a
reader-sub-buffer as well.

The main difference is that you use an ioctl() that returns where to find
the reader-sub-buffer, where our ioctl() is just "I'm done, get me a new
reader-sub-buffer". Ours will update the meta page.

Our meta page looks like this:

> +struct trace_buffer_meta {
> + unsigned long   entries;
> + unsigned long   overrun;
> + unsigned long   read;

If start tracing: trace-cmd start -e sched_switch and do:

 ~ cat /sys/kernel/tracing/per_cpu/cpu0/stats 
entries: 14
overrun: 0
commit overrun: 0
bytes: 992
oldest event ts: 84844.825372
now ts: 84847.102075
dropped events: 0
read events: 0

You'll see similar to the above.

 entries = entries
 overrun = overrun
 read = read

The above "read" is total number of events read.

Pretty staight forward ;-)


> +
> + unsigned long   subbufs_touched;
> + unsigned long   subbufs_lost;
> + unsigned long   subbufs_read;

Now I'm thinking we may not want this exported, as I'm not sure it's useful.

Vincent, talking with Mathieu, he was suggesting that we only export what
we really need, and I don't think we need the above. Especially since they
are not exposed in the stats file.


> +
> + struct {
> + unsigned long   lost_events;
> + __u32   id;
> + __u32   read;
> + } reader;

The above is definitely needed, as all of it is used to read the
reader-page of the sub-buffer.

lost_events is the number of lost events that happened before this
sub-buffer was swapped out.

Hmm, Vincent, I just notice that you update the lost_events as:

> +static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
> +{
> + struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> +
> + WRITE_ONCE(meta->entries, local_read(_buffer->entries));
> + WRITE_ONCE(meta->overrun, local_read(_buffer->overrun));
> + WRITE_ONCE(meta->read, cpu_buffer->read);
> +
> + WRITE_ONCE(meta->subbufs_touched, 
> local_read(_buffer->pages_touched));
> + WRITE_ONCE(meta->subbufs_lost, local_read(_buffer->pages_lost));
> + WRITE_ONCE(meta->subbufs_read, local_read(_buffer->pages_read));
> +
> + WRITE_ONCE(meta->reader.read, cpu_buffer->reader_page->read);
> + WRITE_ONCE(meta->reader.id, cpu_buffer->reader_page->id);
> + WRITE_ONCE(meta->reader.lost_events, cpu_buffer->lost_events);
> +}

The lost_events may need to be handled differently, as it doesn't always
get cleared. So it may be stale data.


> +
> + __u32   subbuf_size;
> + __u32   nr_subbufs;

This gets is the information needed to read the mapped ring buffer.

> +
> + __u32   meta_page_size;
> + __u32   meta_struct_len;

The ring buffer gets mapped after "meta_page_size" and this structure is
"meta_struct_len" which will change if we add new data to it.

> +};

-- Steve



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-11 Thread Steven Rostedt
On Thu, 11 Jan 2024 22:01:32 +0100
Christian Brauner  wrote:

> What I'm pointing out in the current logic is that the caller is
> taxed twice:
> 
> (1) Once when the VFS has done inode_permission(MAY_EXEC, "xfs")
> (2) And again when you call lookup_one_len() in eventfs_start_creating()
> _because_ the permission check in lookup_one_len() is the exact
> same permission check again that the vfs has done
> inode_permission(MAY_EXEC, "xfs").

As I described in: 
https://lore.kernel.org/all/20240110133154.6e18f...@gandalf.local.home/

The eventfs files below "events" doesn't need the .permissions callback at
all. It's only there because the "events" inode uses it.

The .permissions call for eventfs has:

static int eventfs_permission(struct mnt_idmap *idmap,
  struct inode *inode, int mask)
{
set_top_events_ownership(inode);
return generic_permission(idmap, inode, mask);
}

Where the "set_top_events_ownership() is a nop for everything but the
"events" directory.

I guess I could have two ops:

static const struct inode_operations eventfs_root_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
.permission = eventfs_permission,
};

static const struct inode_operations eventfs_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
};

And use the second one for all dentries below the root, but I figured it's
not that big of a deal if I called the permissions on all. Perhaps I should
do it with two?

Anyway, the issue is with "events" directory and remounting, because like
the tracefs system, the inode and dentry for "evnets" is created at boot
up, before the mount happens. The VFS layer is going to check the
permissions of its inode and dentry, which will be incorrect if the mount
was mounted with a "gid" option.


-- Steve



Re: [PATCH] mm: Update mark_victim tracepoints fields

2024-01-11 Thread Carlos Galo
On Thu, Jan 11, 2024 at 9:08 AM kernel test robot  wrote:
>
> Hi Carlos,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on 0dd3ee31125508cd67f7e7172247f05b7fd1753a]
>
> url:
> https://github.com/intel-lab-lkp/linux/commits/Carlos-Galo/mm-Update-mark_victim-tracepoints-fields/20240111-081635
> base:   0dd3ee31125508cd67f7e7172247f05b7fd1753a
> patch link:
> https://lore.kernel.org/r/20240111001155.746-1-carlosgalo%40google.com
> patch subject: [PATCH] mm: Update mark_victim tracepoints fields
> config: x86_64-defconfig 
> (https://download.01.org/0day-ci/archive/20240112/202401120052.rdfjpivg-...@intel.com/config)
> compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
> reproduce (this is a W=1 build): 
> (https://download.01.org/0day-ci/archive/20240112/202401120052.rdfjpivg-...@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version 
> of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot 
> | Closes: 
> https://lore.kernel.org/oe-kbuild-all/202401120052.rdfjpivg-...@intel.com/
>

Sorry, I missed a comma in my final editing.
I posted a V2 here:
https://lore.kernel.org/lkml/20240111210539.636607-1-carlosg...@google.com/

Thanks,
Carlos

> All errors (new ones prefixed by >>):
>
>In file included from include/trace/define_trace.h:102,
> from include/trace/events/oom.h:206,
> from mm/oom_kill.c:54:
>include/trace/events/oom.h: In function 'trace_raw_output_mark_victim':
> >> include/trace/stages/stage3_trace_output.h:6:17: error: called object is 
> >> not a function or function pointer
>6 | #define __entry field
>  | ^
>include/trace/trace_events.h:203:34: note: in definition of macro 
> 'DECLARE_EVENT_CLASS'
>  203 | trace_event_printf(iter, print);   
>  \
>  |  ^
>include/trace/trace_events.h:45:30: note: in expansion of macro 'PARAMS'
>   45 |  PARAMS(print));   \
>  |  ^~
>include/trace/events/oom.h:74:1: note: in expansion of macro 'TRACE_EVENT'
>   74 | TRACE_EVENT(mark_victim,
>  | ^~~
>include/trace/events/oom.h:93:9: note: in expansion of macro 'TP_printk'
>   93 | TP_printk("pid=%d uid=%u comm=%s oom_score_adj=%hd",
>  | ^
>include/trace/events/oom.h:95:17: note: in expansion of macro '__entry'
>   95 | __entry->uid
>  | ^~~
> >> include/trace/trace_events.h:203:39: error: expected expression before ')' 
> >> token
>  203 | trace_event_printf(iter, print);   
>  \
>  |   ^
>include/trace/trace_events.h:40:9: note: in expansion of macro 
> 'DECLARE_EVENT_CLASS'
>   40 | DECLARE_EVENT_CLASS(name,  \
>  | ^~~
>include/trace/events/oom.h:74:1: note: in expansion of macro 'TRACE_EVENT'
>   74 | TRACE_EVENT(mark_victim,
>  | ^~~
>
>
> vim +6 include/trace/stages/stage3_trace_output.h
>
> af6b9668e85ffd include/trace/stages/stage3_defines.h Steven Rostedt (Google  
> 2022-03-03  4)
> af6b9668e85ffd include/trace/stages/stage3_defines.h Steven Rostedt (Google  
> 2022-03-03  5) #undef __entry
> af6b9668e85ffd include/trace/stages/stage3_defines.h Steven Rostedt (Google  
> 2022-03-03 @6) #define __entry field
> af6b9668e85ffd include/trace/stages/stage3_defines.h Steven Rostedt (Google  
> 2022-03-03  7)
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki



[PATCH v2] mm: Update mark_victim tracepoints fields

2024-01-11 Thread Carlos Galo
The current implementation of the mark_victim tracepoint provides only
the process ID (pid) of the victim process. This limitation poses
challenges for userspace tools that need additional information
about the OOM victim. The association between pid and the additional
data may be lost after the kill, making it difficult for userspace to
correlate the OOM event with the specific process.

In order to mitigate this limitation, add the following fields:

- UID
   In Android each installed application has a unique UID. Including
   the `uid` assists in correlating OOM events with specific apps.

- Process Name (comm)
   Enables identification of the affected process.

- OOM Score
   Allows userspace to get additional insights of the relative kill
   priority of the OOM victim.

Cc: Steven Rostedt 
Cc: Andrew Morton 
Cc: Suren Baghdasaryan 
Signed-off-by: Carlos Galo 
---
v2: Fixed build error. Added missing comma when printing `__entry->uid`.

 include/trace/events/oom.h | 19 +++
 mm/oom_kill.c  |  6 +-
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h
index 26a11e4a2c36..fb8a5d1b8a0a 100644
--- a/include/trace/events/oom.h
+++ b/include/trace/events/oom.h
@@ -72,19 +72,30 @@ TRACE_EVENT(reclaim_retry_zone,
 );
 
 TRACE_EVENT(mark_victim,
-   TP_PROTO(int pid),
+   TP_PROTO(struct task_struct *task, uid_t uid),
 
-   TP_ARGS(pid),
+   TP_ARGS(task, uid),
 
TP_STRUCT__entry(
__field(int, pid)
+   __field(uid_t, uid)
+   __string(comm, task->comm)
+   __field(short, oom_score_adj)
),
 
TP_fast_assign(
-   __entry->pid = pid;
+   __entry->pid = task->pid;
+   __entry->uid = uid;
+   __assign_str(comm, task->comm);
+   __entry->oom_score_adj = task->signal->oom_score_adj;
),
 
-   TP_printk("pid=%d", __entry->pid)
+   TP_printk("pid=%d uid=%u comm=%s oom_score_adj=%hd",
+   __entry->pid,
+   __entry->uid,
+   __get_str(comm),
+   __entry->oom_score_adj
+   )
 );
 
 TRACE_EVENT(wake_reaper,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e6071fde34a..0698c00c5da6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include "internal.h"
@@ -753,6 +754,7 @@ static inline void queue_oom_reaper(struct task_struct *tsk)
  */
 static void mark_oom_victim(struct task_struct *tsk)
 {
+   const struct cred *cred;
struct mm_struct *mm = tsk->mm;
 
WARN_ON(oom_killer_disabled);
@@ -772,7 +774,9 @@ static void mark_oom_victim(struct task_struct *tsk)
 */
__thaw_task(tsk);
atomic_inc(_victims);
-   trace_mark_victim(tsk->pid);
+   cred = get_task_cred(tsk);
+   trace_mark_victim(tsk, cred->uid.val);
+   put_cred(cred);
 }
 
 /**

base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
-- 
2.43.0.275.g3460e3d667-goog




Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-11 Thread Christian Brauner
On Wed, Jan 10, 2024 at 08:07:46AM -0500, Steven Rostedt wrote:
> On Wed, 10 Jan 2024 12:45:36 +0100
> Christian Brauner  wrote:
> 
> > So say you do:
> > 
> > mkdir /sys/kernel/tracing/instances/foo
> > 
> > After this has returned we know everything we need to know about the new
> > tracefs instance including the ownership and the mode of all inodes in
> > /sys/kernel/tracing/instances/foo/events/* and below precisely because
> > ownership is always inherited from the parent dentry and recorded in the
> > metadata struct eventfs_inode.
> > 
> > So say someone does:
> > 
> > open("/sys/kernel/tracing/instances/foo/events/xfs");
> > 
> > and say this is the first time that someone accesses that events/
> > directory.
> > 
> > When the open pathwalk is done, the vfs will determine via
> > 
> > [1] may_lookup(inode_of(events))
> > 
> > whether you are able to list entries such as "xfs" in that directory.
> > The vfs checks inode_permission(MAY_EXEC) on "events" and if that holds
> > it ends up calling i_op->eventfs_root_lookup(events).
> > 
> > At this point tracefs/eventfs adds the inodes for all entries in that
> > "events" directory including "xfs" based on the metadata it recorded
> > during the mkdir. Since now someone is actually interested in them. And
> > it initializes the inodes with ownership and everything and adds the
> > dentries that belong into that directory.
> > 
> > Nothing here depends on the permissions of the caller. The only
> > permission that mattered was done in the VFS in [1]. If the caller has
> > permissions to enter a directory they can lookup and list its contents.
> > And its contents where determined/fixed etc when mkdir was called.
> > 
> > So we just need to add the required objects into the caches (inode,
> > dentry) whose addition we intentionally defered until someone actually
> > needed them.
> > 
> > So, eventfs_root_lookup() now initializes the inodes with the ownership
> > from the stored metadata or from the parent dentry and splices in inodes
> > and dentries. No permission checking is needed for this because it is
> > always a recheck of what the vfs did in [1].
> > 
> > We now return to the vfs and path walk continues to the final component
> > that you actually want to open which is that "xfs" directory in this
> > example. We check the permissions on that inode via may_open("xfs") and
> > we open that directory returning an fd to userspace ultimately.
> > 
> > (I'm going by memory since I need to step out the door.)
> 
> So, let's say we do:
> 
>  chgrp -R rostedt /sys/kernel/tracing/

The rostedt group needs exec permissions and "other" cannot have exec
permissions otherwise you can trivially list the entries even if it's
owned by root:

chmod 750 /sys/kernel/tracing

user1@localhost:~$ ls -aln /sys/kernel/ | grep tracing
drwxr-x---   6 0 10000 Jan 11 18:23 tracing

> 
> But I don't want rostedt to have access to xfs
> 
>  chgrp -R root /sys/kernel/tracing/events/xfs

chmod 750 /sys/kernel/tracing/events/xfs

user1@localhost:~$ ls -aln /sys/kernel/tracing/events/ | grep xfs
drwxr-x--- 601 0 0 0 Jan 11 18:24 xfs

This ensure that if a user is in the group and the group has exec perms
lookup is possible (For root this will usually work because
CAP_DAC_READ_SEARCH overrides the exec requirement.).

> 
> Both actions will create the inodes and dentries of all files and
> directories (because of "-R"). But once that is done, the ref counts go to
> zero. They stay around until reclaim. But then I open Chrome ;-) and it
> reclaims all the dentries and inodes, so we are back to here we were on
> boot.
> 
> Now as rostedt I do:
> 
>  ls /sys/kernel/tracing/events/xfs
> 
> The VFS layer doesn't know if I have permission to that or not, because all
> the inodes and dentries have been freed. It has to call back to eventfs to
> find out. Which the eventfs_root_lookup() and eventfs_iterate_shared() will
> recreated the inodes with the proper permission.

Very roughly, ignoring most of the complexity of lookup and focussing on
the permission checking:

When a caller looks up an entry in a directory then the VFS will call
inode_permission(MAY_EXEC) on the directory the caller is trying to
perform that lookup in.

If the caller wants to lookup the "events" entry in the "tracing"
directory then the VFS will call inode_permission(MAY_EXEC, "tracing")
and then - assuming it's not in the cache - call into the lookup method
of the filesystem.

After the VFS has determined that the caller has the permissions to
lookup the "events" entry in the "tracing" directory no further
permission checks are needed.

Path lookup now moves on to the next part which is looking up the "xfs"
entry in the "events" directory. So again, VFS calls
inode_permission(MAY_EXEC, "events") finds that caller has the permissions
and then goes on to call into eventfs_root_lookup().

The VFS has already given the caller a dentry for "xfs" and is passing
that down to tracefs/eventfs. So eventfs gets

Re: [PATCH] mm: Update mark_victim tracepoints fields

2024-01-11 Thread kernel test robot
Hi Carlos,

kernel test robot noticed the following build errors:

[auto build test ERROR on 0dd3ee31125508cd67f7e7172247f05b7fd1753a]

url:
https://github.com/intel-lab-lkp/linux/commits/Carlos-Galo/mm-Update-mark_victim-tracepoints-fields/20240111-081635
base:   0dd3ee31125508cd67f7e7172247f05b7fd1753a
patch link:
https://lore.kernel.org/r/20240111001155.746-1-carlosgalo%40google.com
patch subject: [PATCH] mm: Update mark_victim tracepoints fields
config: x86_64-defconfig 
(https://download.01.org/0day-ci/archive/20240112/202401120052.rdfjpivg-...@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20240112/202401120052.rdfjpivg-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202401120052.rdfjpivg-...@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/trace/define_trace.h:102,
from include/trace/events/oom.h:206,
from mm/oom_kill.c:54:
   include/trace/events/oom.h: In function 'trace_raw_output_mark_victim':
>> include/trace/stages/stage3_trace_output.h:6:17: error: called object is not 
>> a function or function pointer
   6 | #define __entry field
 | ^
   include/trace/trace_events.h:203:34: note: in definition of macro 
'DECLARE_EVENT_CLASS'
 203 | trace_event_printf(iter, print); 
   \
 |  ^
   include/trace/trace_events.h:45:30: note: in expansion of macro 'PARAMS'
  45 |  PARAMS(print));   \
 |  ^~
   include/trace/events/oom.h:74:1: note: in expansion of macro 'TRACE_EVENT'
  74 | TRACE_EVENT(mark_victim,
 | ^~~
   include/trace/events/oom.h:93:9: note: in expansion of macro 'TP_printk'
  93 | TP_printk("pid=%d uid=%u comm=%s oom_score_adj=%hd",
 | ^
   include/trace/events/oom.h:95:17: note: in expansion of macro '__entry'
  95 | __entry->uid
 | ^~~
>> include/trace/trace_events.h:203:39: error: expected expression before ')' 
>> token
 203 | trace_event_printf(iter, print); 
   \
 |   ^
   include/trace/trace_events.h:40:9: note: in expansion of macro 
'DECLARE_EVENT_CLASS'
  40 | DECLARE_EVENT_CLASS(name,  \
 | ^~~
   include/trace/events/oom.h:74:1: note: in expansion of macro 'TRACE_EVENT'
  74 | TRACE_EVENT(mark_victim,
 | ^~~


vim +6 include/trace/stages/stage3_trace_output.h

af6b9668e85ffd include/trace/stages/stage3_defines.h Steven Rostedt (Google  
2022-03-03  4) 
af6b9668e85ffd include/trace/stages/stage3_defines.h Steven Rostedt (Google  
2022-03-03  5) #undef __entry
af6b9668e85ffd include/trace/stages/stage3_defines.h Steven Rostedt (Google  
2022-03-03 @6) #define __entry field
af6b9668e85ffd include/trace/stages/stage3_defines.h Steven Rostedt (Google  
2022-03-03  7) 

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



Re: [PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-01-11 Thread Mathieu Desnoyers

On 2024-01-11 11:17, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

   ring_buffer_{map,unmap}()
   ring_buffer_map_fault()

And controls on the ring-buffer:

   ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

   A unique ID for each subbuf of the ring-buffer, currently they are
   only identified through their in-kernel VA.

   A meta-page, where are stored ring-buffer statistics and a
   description for the current reader



Hi Vincent,

The LTTng kernel tracer has supported mmap'd buffers for nearly 15 years [1],
and has a lot of similarities with this patch series.

LTTng has the notion of "subbuffer id" to allow atomically exchanging a
"reader" extra subbuffer with the subbuffer to be read. It implements
"get subbuffer" / "put subbuffer" ioctls to allow the consumer (reader)
to move the currently read subbuffer position. [2]

It would not hurt to compare your approach to LTTng and highlight
similarities/differences, and the rationale for the differences.

Especially when it comes to designing kernel ABIs, it's good to make sure
that all bases are covered, because those choices will have lasting impacts.

Thanks,

Mathieu

[1] 
https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_mmap.c
[2] 
https://git.lttng.org/?p=lttng-modules.git;a=blob;f=src/lib/ringbuffer/ring_buffer_vfs.c

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend

2024-01-11 Thread Mike Christie
On 1/10/24 9:09 PM, Jason Wang wrote:
> On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare  
> wrote:
>>
>> To pass ownership of a live vdpa device to a new process, the user
>> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
>> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
>> mm.  Flush workers in suspend to guarantee that no worker sees the new
>> mm and old VA in between.
>>
>> Signed-off-by: Steve Sistare 
>> ---
>>  drivers/vhost/vdpa.c | 4 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index 8fe1562d24af..9673e8e20d11 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
>>  {
>> struct vdpa_device *vdpa = v->vdpa;
>> const struct vdpa_config_ops *ops = vdpa->config;
>> +   struct vhost_dev *vdev = >vdev;
>>
>> if (!ops->suspend)
>> return -EOPNOTSUPP;
>>
>> +   if (vdev->use_worker)
>> +   vhost_dev_flush(vdev);
> 
> It looks to me like it's better to check use_woker in vhost_dev_flush.
> 

You can now just call vhost_dev_flush and it will do the right thing.
The xa_for_each loop will only flush workers if they have been setup,
so for vdpa it will not find/flush anything.






[PATCH v11 4/5] Documentation: tracing: Add ring-buffer mapping

2024-01-11 Thread Vincent Donnefort
It is now possible to mmap() a ring-buffer to stream its content. Add
some documentation and a code example.

Signed-off-by: Vincent Donnefort 

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5092d6c13af5..0b300901fd75 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -29,6 +29,7 @@ Linux Tracing Technologies
timerlat-tracer
intel_th
ring-buffer-design
+   ring-buffer-map
stm
sys-t
coresight/index
diff --git a/Documentation/trace/ring-buffer-map.rst 
b/Documentation/trace/ring-buffer-map.rst
new file mode 100644
index ..2ba7b5339178
--- /dev/null
+++ b/Documentation/trace/ring-buffer-map.rst
@@ -0,0 +1,105 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Tracefs ring-buffer memory mapping
+==
+
+:Author: Vincent Donnefort 
+
+Overview
+
+Tracefs ring-buffer memory map provides an efficient method to stream data
+as no memory copy is necessary. The application mapping the ring-buffer becomes
+then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
+
+Memory mapping setup
+
+The mapping works with a mmap() of the trace_pipe_raw interface.
+
+The first system page of the mapping contains ring-buffer statistics and
+description. It is referred as the meta-page. One of the most important field 
of
+the meta-page is the reader. It contains the subbuf ID which can be safely read
+by the mapper (see ring-buffer-design.rst).
+
+The meta-page is followed by all the subbuf, ordered by ascendant ID. It is
+therefore effortless to know where the reader starts in the mapping:
+
+.. code-block:: c
+
+reader_id = meta->reader->id;
+reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
+
+When the application is done with the current reader, it can get a new one 
using
+the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates
+the meta-page fields.
+
+Limitations
+===
+When a mapping is in place on a Tracefs ring-buffer, it is not possible to
+either resize it (either by increasing the entire size of the ring-buffer or
+each subbuf). It is also not possible to use snapshot or splice.
+
+Concurrent readers (either another application mapping that ring-buffer or the
+kernel with trace_pipe) are allowed but not recommended. They will compete for
+the ring-buffer and the output is unpredictable.
+
+Example
+===
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+#include 
+
+#define TRACE_PIPE_RAW 
"/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
+
+int main(void)
+{
+int page_size = getpagesize(), fd, reader_id;
+unsigned long meta_len, data_len;
+struct trace_buffer_meta *meta;
+void *map, *reader, *data;
+
+fd = open(TRACE_PIPE_RAW, O_RDONLY);
+if (fd < 0)
+exit(EXIT_FAILURE);
+
+map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+if (map == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+meta = (struct trace_buffer_meta *)map;
+meta_len = meta->meta_page_size;
+
+printf("entries:%lu\n", meta->entries);
+printf("overrun:%lu\n", meta->overrun);
+printf("read:   %lu\n", meta->read);
+printf("subbufs_touched:%lu\n", meta->subbufs_touched);
+printf("subbufs_lost:   %lu\n", meta->subbufs_lost);
+printf("subbufs_read:   %lu\n", meta->subbufs_read);
+printf("nr_subbufs: %u\n", meta->nr_subbufs);
+
+data_len = meta->subbuf_size * meta->nr_subbufs;
+data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, 
data_len);
+if (data == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0)
+exit(EXIT_FAILURE);
+
+reader_id = meta->reader.id;
+reader = data + meta->subbuf_size * reader_id;
+
+munmap(data, data_len);
+munmap(meta, meta_len);
+close (fd);
+
+return 0;
+}
-- 
2.43.0.275.g3460e3d667-goog




[PATCH v11 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-01-11 Thread Vincent Donnefort
Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index bde39a73ce65..a797891e3ba0 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -42,4 +42,6 @@ struct trace_buffer_meta {
__u32   meta_struct_len;
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 46dbe22121e9..7bf6c2942aea 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6472,7 +6472,7 @@ static void tracing_set_nop(struct trace_array *tr)
 {
if (tr->current_trace == _trace)
return;
-   
+
tr->current_trace->enabled--;
 
if (tr->current_trace->reset)
@@ -8583,15 +8583,31 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = >iter;
+   int err;
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent);
+   if (err)
+   return err;
+   }
+
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
 
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(_types_lock);
 
iter->wait_index++;
@@ -8604,6 +8620,62 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
+   struct page *page;
+
+   page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file,
+vmf->pgoff);
+   if (!page)
+   return ret;
+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+
+   ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file);
+}
+
+static void tracing_buffers_mmap_open(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+
+   WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file));
+}
+
+static const struct vm_operations_struct tracing_buffers_vmops = {
+   .open   = tracing_buffers_mmap_open,
+   .close  = tracing_buffers_mmap_close,
+   .fault  = tracing_buffers_mmap_fault,
+};
+
+static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = filp->private_data;
+   struct trace_iterator *iter = >iter;
+
+   if (vma->vm_flags & VM_WRITE)
+   return -EPERM;
+
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP, VM_MAYWRITE);
+   vma->vm_ops = _buffers_vmops;
+
+   return 

[PATCH v11 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-01-11 Thread Vincent Donnefort
In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()
  ring_buffer_map_fault()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..0841ba8bab14 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
+  unsigned long pgoff);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..bde39a73ce65
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @subbufs_touched:   Number of subbufs that have been filled.
+ * @subbufs_lost:  Number of subbufs lost to overrun.
+ * @subbufs_read:  Number of subbufs that have been read.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. From 0 to @nr_subbufs 
- 1
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @subbuf_size:   Size of each subbuf, including the header.
+ * @nr_subbufs:Number of subbfs in the ring-buffer.
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ */
+struct trace_buffer_meta {
+   unsigned long   entries;
+   unsigned long   overrun;
+   unsigned long   read;
+
+   unsigned long   subbufs_touched;
+   unsigned long   subbufs_lost;
+   unsigned long   subbufs_read;
+
+   struct {
+   unsigned long   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index db73e326fa04..e9ff1c95e896 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -338,6 +338,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -484,6 +485,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   int mapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to addr */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1542,6 +1549,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(_buffer->irq_work.waiters);
init_waitqueue_head(_buffer->irq_work.full_waiters);
+   

[PATCH v11 1/5] ring-buffer: Zero ring-buffer sub-buffers

2024-01-11 Thread Vincent Donnefort
In preparation for the ring-buffer memory mapping where each subbuf will
be accessible to user-space, zero all the page allocations.

Signed-off-by: Vincent Donnefort 

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 173d2595ce2d..db73e326fa04 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1466,7 +1466,8 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu 
*cpu_buffer,
 
list_add(>list, pages);
 
-   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags,
+   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
+   mflags | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
@@ -1551,7 +1552,8 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
 
cpu_buffer->reader_page = bpage;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 
cpu_buffer->buffer->subbuf_order);
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
+   cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
@@ -5525,7 +5527,8 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, 
int cpu)
if (bpage->data)
goto out;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY,
+   page = alloc_pages_node(cpu_to_node(cpu),
+   GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page) {
kfree(bpage);
-- 
2.43.0.275.g3460e3d667-goog




[PATCH v11 0/5] Introducing trace buffer mapping by user-space

2024-01-11 Thread Vincent Donnefort
The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Support for this new feature can already be found in libtracefs from the
version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.

Vincent

v10 -> v11:
  * Add Documentation and code sample.
  * Add a selftest.
  * Move all the update to the meta-page into a single
rb_update_meta_page().
  * rb_update_meta_page() is now called from
ring_buffer_map_get_reader() to fix NOBLOCK callers.
  * kerneldoc for struct trace_meta_page.
  * Add a patch to zero all the ring-buffer allocations.

v9 -> v10:
  * Refactor rb_update_meta_page()
  * In-loop declaration for foreach_subbuf_page()
  * Check for cpu_buffer->mapped overflow

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

Vincent Donnefort (5):
  ring-buffer: Zero ring-buffer sub-buffers
  ring-buffer: Introducing ring-buffer mapping functions
  tracing: Allow user-space mapping of the ring-buffer
  Documentation: tracing: Add ring-buffer mapping
  ring-buffer/selftest: Add ring-buffer mapping test

 Documentation/trace/index.rst |   1 +
 Documentation/trace/ring-buffer-map.rst   | 105 ++
 include/linux/ring_buffer.h   |   7 +
 include/uapi/linux/trace_mmap.h   |  47 +++
 kernel/trace/ring_buffer.c| 339 +-
 kernel/trace/trace.c  |  81 -
 tools/testing/selftests/ring-buffer/Makefile  |   8 +
 tools/testing/selftests/ring-buffer/config|   1 +
 .../testing/selftests/ring-buffer/map_test.c  | 188 ++
 9 files changed, 769 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/trace/ring-buffer-map.rst
 create mode 100644 include/uapi/linux/trace_mmap.h
 create mode 100644 tools/testing/selftests/ring-buffer/Makefile
 create mode 100644 tools/testing/selftests/ring-buffer/config
 create mode 100644 tools/testing/selftests/ring-buffer/map_test.c


base-commit: 4f1991a92cfe89096b2d1f5583a2e093bdd55c37
-- 
2.43.0.275.g3460e3d667-goog




Re: [RFC PATCH 1/5] mfd: 88pm88x: differences with respect to the PMIC RFC series

2024-01-11 Thread Karel Balej
On Thu Jan 11, 2024 at 4:25 PM CET, Lee Jones wrote:

[...]

> > > > diff --git a/include/linux/mfd/88pm88x.h b/include/linux/mfd/88pm88x.h
> > > > index a34c57447827..9a335f6b9c07 100644
> > > > --- a/include/linux/mfd/88pm88x.h
> > > > +++ b/include/linux/mfd/88pm88x.h
> > > > @@ -49,6 +49,8 @@ struct pm88x_data {
> > > > unsigned int whoami;
> > > > struct reg_sequence *presets;
> > > > unsigned int num_presets;
> > > > +   struct mfd_cell *devs;
> > > > +   unsigned int num_devs;
> > >
> > > Why are you adding extra abstraction?
> > 
> > Right, this is probably not necessary now since I'm only implementing
> > support for one of the chips - it's just that I keep thinking about it
> > as a driver for both of them and thus tend to write it a bit more
> > abstractly. Shall I then drop this and also the `presets` member which
> > is also chip-specific?
>
> Even if you were to support multiple devices, this strategy is unusual
> and isn't likely to be accepted.

May I please ask what the recommended strategy is then? `switch`ing on
the chip ID? I have taken this approach because it seemed to produce a
cleaner/more straightforward code in comparison to that. Or are you only
talking about the chip cells/subdevices in particular?

Thank you,
K. B.



Re: [RFC PATCH 1/5] mfd: 88pm88x: differences with respect to the PMIC RFC series

2024-01-11 Thread Lee Jones
On Thu, 11 Jan 2024, Karel Balej wrote:

> Lee,
> 
> On Thu Jan 11, 2024 at 11:54 AM CET, Lee Jones wrote:
> > The subject needs work.  Please tell us what the patches is doing.
> >
> > On Thu, 28 Dec 2023, Karel Balej wrote:
> >
> > > From: Karel Balej 
> >
> > A full an complete commit message is a must.
> 
> I have not provided a detailed description here because as I have noted
> in the cover letter, this patch will be squashed into the MFD series. I
> sent it only as a bridge between the two series, sorry for the
> confusion.
> 
> > > diff --git a/include/linux/mfd/88pm88x.h b/include/linux/mfd/88pm88x.h
> > > index a34c57447827..9a335f6b9c07 100644
> > > --- a/include/linux/mfd/88pm88x.h
> > > +++ b/include/linux/mfd/88pm88x.h
> > > @@ -49,6 +49,8 @@ struct pm88x_data {
> > >   unsigned int whoami;
> > >   struct reg_sequence *presets;
> > >   unsigned int num_presets;
> > > + struct mfd_cell *devs;
> > > + unsigned int num_devs;
> >
> > Why are you adding extra abstraction?
> 
> Right, this is probably not necessary now since I'm only implementing
> support for one of the chips - it's just that I keep thinking about it
> as a driver for both of them and thus tend to write it a bit more
> abstractly. Shall I then drop this and also the `presets` member which
> is also chip-specific?

Even if you were to support multiple devices, this strategy is unusual
and isn't likely to be accepted.

With respect to the other variables, you are in a better position to
know if they are still required.  By the sounds of it, I'd suggest it
might be better to remove them.

-- 
Lee Jones [李琼斯]



Re: [RFC PATCH 1/5] mfd: 88pm88x: differences with respect to the PMIC RFC series

2024-01-11 Thread Karel Balej
Lee,

On Thu Jan 11, 2024 at 11:54 AM CET, Lee Jones wrote:
> The subject needs work.  Please tell us what the patches is doing.
>
> On Thu, 28 Dec 2023, Karel Balej wrote:
>
> > From: Karel Balej 
>
> A full an complete commit message is a must.

I have not provided a detailed description here because as I have noted
in the cover letter, this patch will be squashed into the MFD series. I
sent it only as a bridge between the two series, sorry for the
confusion.

> > diff --git a/include/linux/mfd/88pm88x.h b/include/linux/mfd/88pm88x.h
> > index a34c57447827..9a335f6b9c07 100644
> > --- a/include/linux/mfd/88pm88x.h
> > +++ b/include/linux/mfd/88pm88x.h
> > @@ -49,6 +49,8 @@ struct pm88x_data {
> > unsigned int whoami;
> > struct reg_sequence *presets;
> > unsigned int num_presets;
> > +   struct mfd_cell *devs;
> > +   unsigned int num_devs;
>
> Why are you adding extra abstraction?

Right, this is probably not necessary now since I'm only implementing
support for one of the chips - it's just that I keep thinking about it
as a driver for both of them and thus tend to write it a bit more
abstractly. Shall I then drop this and also the `presets` member which
is also chip-specific?

Thank you, best regards,
K. B.



Re: [PATCH v5 11/34] function_graph: Have the instances use their own ftrace_ops for filtering

2024-01-11 Thread Google
On Mon, 8 Jan 2024 15:03:21 +
Mark Rutland  wrote:

> On Mon, Jan 08, 2024 at 02:21:03PM +, Mark Rutland wrote:
> > On Mon, Jan 08, 2024 at 12:25:55PM +, Mark Rutland wrote:
> > > We also have HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, but since the return 
> > > address is
> > > not on the stack at the point function-entry is intercepted we use the FP 
> > > as
> > > the retp value -- in the absence of tail calls this will be different 
> > > between a
> > > caller and callee.
> > 
> > Ah; I just spotted that this patch changed that in ftrace_graph_func(), 
> > which
> > is the source of the bug. 
> > 
> > As of this patch, we use the address of fregs->lr as the retp value, but the
> > unwinder still uses the FP value, and so when 
> > unwind_recover_return_address()
> > calls ftrace_graph_ret_addr(), the retp value won't match the expected 
> > entry on
> > the fgraph ret_stack, resulting in failing to find the expected entry.
> > 
> > Since the ftrace_regs only exist transiently during function entry/exit, 
> > it's
> > possible for a stackframe to reuse that same address on the stack, which 
> > would
> > result in finding a different entry by mistake.
> > 
> > The diff below restores the existing behaviour and fixes the issue for me.
> > Could you please fold that into this patch?
> > 
> > On a separate note, looking at how this patch changed arm64's
> > ftrace_graph_func(), do we need similar changes to arm64's
> > prepare_ftrace_return() for the old-style mcount based ftrace?
> > 
> > Mark.
> > 
> > >8
> > diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
> > index 205937e04ece..329092ce06ba 100644
> > --- a/arch/arm64/kernel/ftrace.c
> > +++ b/arch/arm64/kernel/ftrace.c
> > @@ -495,7 +495,7 @@ void ftrace_graph_func(unsigned long ip, unsigned long 
> > parent_ip,
> > if (bit < 0)
> > return;
> >  
> > -   if (!function_graph_enter_ops(*parent, ip, fregs->fp, parent, gops))
> > +   if (!function_graph_enter_ops(*parent, ip, fregs->fp, (void 
> > *)fregs->fp, gops))
> > *parent = (unsigned long)_to_handler;
> >  
> > ftrace_test_recursion_unlock(bit);
> 
> Thinking some more, this line gets excessively long when we pass the fregs 
> too,
> so it's probably worth adding a local variable for fp, i.e. the diff below.

Yeah, that will be better to keep the line short.

Thank you,

> 
> Mark.
> 
> >8
> diff --git a/arch/arm64/kernel/ftrace.c b/arch/arm64/kernel/ftrace.c
> index 205937e04ece..d4e142ef4686 100644
> --- a/arch/arm64/kernel/ftrace.c
> +++ b/arch/arm64/kernel/ftrace.c
> @@ -481,8 +481,9 @@ void prepare_ftrace_return(unsigned long self_addr, 
> unsigned long *parent,
>  void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
>struct ftrace_ops *op, struct ftrace_regs *fregs)
>  {
> -   unsigned long *parent = >lr;
> struct fgraph_ops *gops = container_of(op, struct fgraph_ops, ops);
> +   unsigned long *parent = >lr;
> +   unsigned long fp = fregs->fp;
> int bit;
>  
> if (unlikely(ftrace_graph_is_dead()))
> @@ -495,7 +496,7 @@ void ftrace_graph_func(unsigned long ip, unsigned long 
> parent_ip,
> if (bit < 0)
> return;
>  
> -   if (!function_graph_enter_ops(*parent, ip, fregs->fp, parent, gops))
> +   if (!function_graph_enter_ops(*parent, ip, fp, (void *)fp, gops))
> *parent = (unsigned long)_to_handler;
>  
> ftrace_test_recursion_unlock(bit);
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v5 11/34] function_graph: Have the instances use their own ftrace_ops for filtering

2024-01-11 Thread Google
Hi Mark,

On Thu, 11 Jan 2024 11:01:56 +
Mark Rutland  wrote:

> On Thu, Jan 11, 2024 at 11:15:33AM +0900, Masami Hiramatsu wrote:
> > Hi Mark,
> > 
> > Thanks for the investigation.
> 
> Hi!
> 
> As a heads-up, I already figured out the problem and sent a fixup at:
> 
>   https://lore.kernel.org/lkml/ZZwEz8HsTa2IZE3L@FVFF77S0Q05N/
> 
> ... and a more refined fix at:
> 
>   https://lore.kernel.org/lkml/ZZwOubTSbB_FucVz@FVFF77S0Q05N/

Oops, I missed it. And I also confirmed that.

> 
> The gist was that before this patch, arm64 used the FP as the 'retp' value, 
> but
> this patch changed that to the address of fregs->lr. This meant that the 
> fgraph
> ret_stack contained all of the correct return addresses, but when the unwinder
> called ftrace_graph_ret_addr() with FP as the 'retp' value, it failed to match
> any entry in the ret_stack.

Yeah, this patch introduced new arm64 ftrace_graph_func(). and I missed
to pass the 'parent'... OK let me fix that.

> 
> Since the fregs only exist transiently at function entry and exit, I'd prefer
> that we still use the FP as the 'retp' value, which is what I proposed in the
> fixups above.

OK. Let me add it.

Thank you!

> 
> Thanks,
> Mark.
> 
> > On Mon, 8 Jan 2024 12:25:55 +
> > Mark Rutland  wrote:
> > 
> > > Hi,
> > > 
> > > There's a bit more of an info-dump below; I'll go try to dump the fgraph 
> > > shadow
> > > stack so that we can analyse this in more detail.
> > > 
> > > On Mon, Jan 08, 2024 at 10:14:36AM +0900, Masami Hiramatsu wrote:
> > > > On Fri, 5 Jan 2024 17:09:10 +
> > > > Mark Rutland  wrote:
> > > > 
> > > > > On Mon, Dec 18, 2023 at 10:13:46PM +0900, Masami Hiramatsu (Google) 
> > > > > wrote:
> > > > > > From: Steven Rostedt (VMware) 
> > > > > > 
> > > > > > Allow for instances to have their own ftrace_ops part of the 
> > > > > > fgraph_ops
> > > > > > that makes the funtion_graph tracer filter on the set_ftrace_filter 
> > > > > > file
> > > > > > of the instance and not the top instance.
> > > > > > 
> > > > > > This also change how the function_graph handles multiple instances 
> > > > > > on the
> > > > > > shadow stack. Previously we use ARRAY type entries to record which 
> > > > > > one
> > > > > > is enabled, and this makes it a bitmap of the fgraph_array's 
> > > > > > indexes.
> > > > > > Previous function_graph_enter() expects calling back from
> > > > > > prepare_ftrace_return() function which is called back only once if 
> > > > > > it is
> > > > > > enabled. But this introduces different ftrace_ops for each fgraph
> > > > > > instance and those are called from ftrace_graph_func() one by one. 
> > > > > > Thus
> > > > > > we can not loop on the fgraph_array(), and need to reuse the 
> > > > > > ret_stack
> > > > > > pushed by the previous instance. Finding the ret_stack is easy 
> > > > > > because
> > > > > > we can check the ret_stack->func. But that is not enough for the 
> > > > > > self-
> > > > > > recursive tail-call case. Thus fgraph uses the bitmap entry to find 
> > > > > > it
> > > > > > is already set (this means that entry is for previous tail call).
> > > > > > 
> > > > > > Signed-off-by: Steven Rostedt (VMware) 
> > > > > > Signed-off-by: Masami Hiramatsu (Google) 
> > > > > 
> > > > > As a heads-up, while testing the topic/fprobe-on-fgraph branch on 
> > > > > arm64, I get
> > > > > a warning which bisets down to this commit:
> > > > 
> > > > Hmm, so does this happen when enabling function graph tracer?
> > > 
> > > Yes; I see it during the function_graph boot-time self-test if I also 
> > > enable
> > > CONFIG_IRQSOFF_TRACER=y. I can also trigger it regardless of
> > > CONFIG_IRQSOFF_TRACER if I cat /proc/self/stack with the function_graph 
> > > tracer
> > > enabled (note that I hacked the unwinder to continue after failing to 
> > > recover a
> > > return address):
> > > 
> > > | # mount -t tracefs none /sys/kernel/tracing/
> > > | # echo function_graph > /sys/kernel/tracing/current_tracer
> > > | # cat /proc/self/stack
> > > | [   37.469980] [ cut here ]
> > > | [   37.471503] WARNING: CPU: 2 PID: 174 at 
> > > arch/arm64/kernel/stacktrace.c:84 arch_stack_walk+0x2d8/0x338
> > > | [   37.474381] Modules linked in:
> > > | [   37.475501] CPU: 2 PID: 174 Comm: cat Not tainted 
> > > 6.7.0-rc2-00026-gea1e68a341c2-dirty #15
> > > | [   37.478133] Hardware name: linux,dummy-virt (DT)
> > > | [   37.479670] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS 
> > > BTYPE=--)
> > > | [   37.481923] pc : arch_stack_walk+0x2d8/0x338
> > > | [   37.483373] lr : arch_stack_walk+0x1bc/0x338
> > > | [   37.484818] sp : 8000835f3a90
> > > | [   37.485974] x29: 8000835f3a90 x28: 8000835f3b80 x27: 
> > > 8000835f3b38
> > > | [   37.488405] x26: 04341e00 x25: 8000835f4000 x24: 
> > > 80008002df18
> > > | [   37.490842] x23: 80008002df18 x22: 8000835f3b60 x21: 
> > > 80008015d240
> > > | [   37.493269] x20: 8000835f3b50 x19: 8000835f3b40 

Re: [RESEND PATCH 0/1] ALSA: virtio: add support for audio controls

2024-01-11 Thread Marcin Radomski
Hi Aiswarya,

I was looking into VirtIO audio controls support in Linux and came across this 
patch series, which seems to be marked "archived" on patchwork [0]. I would 
love to be able to use this with mainline Linux. I'm wondering about the status 
of this series - is the feature still under development, or are there some 
concerns that need to be addressed?

I'd be more than happy to help with testing.

Thanks for any insights or updates you can offer.

Regards,
Marcin Radomski

[0] 
https://patchwork.kernel.org/project/alsa-devel/patch/20230209115916.917569-2-aiswarya.cyr...@opensynergy.com/



Re: [PATCH] mm: Update mark_victim tracepoints fields

2024-01-11 Thread kernel test robot
Hi Carlos,

kernel test robot noticed the following build errors:

[auto build test ERROR on 0dd3ee31125508cd67f7e7172247f05b7fd1753a]

url:
https://github.com/intel-lab-lkp/linux/commits/Carlos-Galo/mm-Update-mark_victim-tracepoints-fields/20240111-081635
base:   0dd3ee31125508cd67f7e7172247f05b7fd1753a
patch link:
https://lore.kernel.org/r/20240111001155.746-1-carlosgalo%40google.com
patch subject: [PATCH] mm: Update mark_victim tracepoints fields
config: x86_64-rhel-8.3-rust 
(https://download.01.org/0day-ci/archive/20240111/202401112057.ljyusjje-...@intel.com/config)
compiler: ClangBuiltLinux clang version 17.0.6 
(https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20240111/202401112057.ljyusjje-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202401112057.ljyusjje-...@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/oom_kill.c:54:
   In file included from include/trace/events/oom.h:206:
   In file included from include/trace/define_trace.h:102:
   In file included from include/trace/trace_events.h:237:
>> include/trace/events/oom.h:96:3: error: called object type 'uid_t' (aka 
>> 'unsigned int') is not a function or function pointer
  74 | ),
 | ~~
  75 | 
  76 | TP_fast_assign(
 | ~~~
  77 | __entry->pid = task->pid;
 | ~
  78 | __entry->uid = uid;
 | ~~~
  79 | __assign_str(comm, task->comm);
 | ~~~
  80 | __entry->oom_score_adj = task->signal->oom_score_adj;
 | ~
  81 | ),
 | ~~
  82 | 
  83 | TP_printk("pid=%d uid=%u comm=%s oom_score_adj=%hd",
 | 
  84 | __entry->pid,
 | ~
  85 | __entry->uid
 | 
  86 | __get_str(comm),
 | ^~~~
  87 | __entry->oom_score_adj,
 | ~~~
  88 | )
 | ~
  89 | );
 | ~
   include/trace/stages/stage3_trace_output.h:20:26: note: expanded from macro 
'__get_str'
  20 | #define __get_str(field) ((char *)__get_dynamic_array(field))
 |  ^
   include/trace/stages/stage3_trace_output.h:9:43: note: expanded from macro 
'TP_printk'
   9 | #define TP_printk(fmt, args...) fmt "\n", args
 |   ^
   include/trace/trace_events.h:45:16: note: expanded from macro 'TRACE_EVENT'
  40 | DECLARE_EVENT_CLASS(name,  \
 | 
  41 |  PARAMS(proto),\
 |  ~~~
  42 |  PARAMS(args), \
 |  ~~~
  43 |  PARAMS(tstruct),  \
 |  ~~~
  44 |  PARAMS(assign),   \
 |  ~~~
  45 |  PARAMS(print));   \
 |  ~~~^~~
   include/linux/tracepoint.h:107:25: note: expanded from macro 'PARAMS'
 107 | #define PARAMS(args...) args
 | ^
   include/trace/trace_events.h:203:27: note: expanded from macro 
'DECLARE_EVENT_CLASS'
 203 | trace_event_printf(iter, print); 
   \
 |  ^
   In file included from mm/oom_kill.c:54:
   In file included from include/trace/events/oom.h:206:
   In file included from include/trace/define_trace.h:102:
   In file included from include/trace/trace_events.h:237:
>> include/trace/events/oom.h:74:1: error: expected expression
  74 | TRACE_EVENT(mark_victim,
 | ^
   include/trace/trace_events.h:40:2: note: expanded from macro 'TRACE_EVENT'
  40 | DECLARE_EVENT_CLASS(

[PATCH RESEND v2] hwspinlock: qcom: Remove IPQ6018 SOC specific compatible

2024-01-11 Thread Vignesh Viswanathan
IPQ6018 has 32 tcsr_mutex hwlock registers with stride 0x1000.
The compatible string qcom,ipq6018-tcsr-mutex is mapped to
of_msm8226_tcsr_mutex which has 32 locks configured with stride of 0x80
and doesn't match the HW present in IPQ6018.

Remove IPQ6018 specific compatible string so that it fallsback to
of_tcsr_mutex data which maps to the correct configuration for IPQ6018.

Cc: sta...@vger.kernel.org
Fixes: 5d4753f741d8 ("hwspinlock: qcom: add support for MMIO on older SoCs")
Signed-off-by: Vignesh Viswanathan 
Reviewed-by: Konrad Dybcio 
---

This patch was already posted [2] and applied [3], but missing in the
linux-next TIP. Resending with r-b tags so that it can be picked up
again.

[2] 
https://lore.kernel.org/all/20230905095535.1263113-3-quic_viswa...@quicinc.com/
[3] 
https://lore.kernel.org/all/169522934567.2501390.111220106132298.b4...@kernel.org/

Changes in v2:
 - Updated commit message
 - Added Fixes and stable tags

 drivers/hwspinlock/qcom_hwspinlock.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/hwspinlock/qcom_hwspinlock.c 
b/drivers/hwspinlock/qcom_hwspinlock.c
index a0fd67fd2934..814dfe8697bf 100644
--- a/drivers/hwspinlock/qcom_hwspinlock.c
+++ b/drivers/hwspinlock/qcom_hwspinlock.c
@@ -115,7 +115,6 @@ static const struct of_device_id qcom_hwspinlock_of_match[] 
= {
{ .compatible = "qcom,sfpb-mutex", .data = _sfpb_mutex },
{ .compatible = "qcom,tcsr-mutex", .data = _tcsr_mutex },
{ .compatible = "qcom,apq8084-tcsr-mutex", .data = 
_msm8226_tcsr_mutex },
-   { .compatible = "qcom,ipq6018-tcsr-mutex", .data = 
_msm8226_tcsr_mutex },
{ .compatible = "qcom,msm8226-tcsr-mutex", .data = 
_msm8226_tcsr_mutex },
{ .compatible = "qcom,msm8974-tcsr-mutex", .data = 
_msm8226_tcsr_mutex },
{ .compatible = "qcom,msm8994-tcsr-mutex", .data = 
_msm8226_tcsr_mutex },
-- 
2.41.0




Re: [PATCH 2/2] arm64: dts: qcom: sm7225-fairphone-fp4: Add PM6150L thermals

2024-01-11 Thread AngeloGioacchino Del Regno

Il 10/01/24 20:16, Konrad Dybcio ha scritto:



On 1/9/24 12:24, Luca Weiss wrote:

On Tue Jan 9, 2024 at 11:09 AM CET, Konrad Dybcio wrote:



On 1/5/24 15:54, Luca Weiss wrote:

Configure the thermals for the PA_THERM1, MSM_THERM, PA_THERM0,
RFC_CAM_THERM, CAM_FLASH_THERM and QUIET_THERM thermistors connected to
PM6150L.

Due to hardware constraints we can only register 4 zones with
pm6150l_adc_tm, the other 2 we can register via generic-adc-thermal.


Ugh.. so the ADC can support more inputs than the ADC_TM that was
designed to ship alongside it can?

And that's why the "generic-adc-thermal"-provided zones need to
be polled?


This part of the code from qcom-spmi-adc-tm5.c was trigerring if I
define more than 4 channels, and looking at downstream I can also see
that only 4 zones are registered properly with adc_tm, the rest is
registered with "qcom,adc-tm5-iio" which skips from what I could tell
basically all the HW bits and only registering the thermal zone.


ret = adc_tm5_read(chip, ADC_TM5_NUM_BTM,
   _available, sizeof(channels_available));
if (ret) {
    dev_err(chip->dev, "read failed for BTM channels\n");
    return ret;
}

for (i = 0; i < chip->nchannels; i++) {
    if (chip->channels[i].channel >= channels_available) {
    dev_err(chip->dev, "Invalid channel %d\n", 
chip->channels[i].channel);
    return -EINVAL;
    }
}






The trip points can really only be considered as placeholders, more
configuration with cooling etc. can be added later.

Signed-off-by: Luca Weiss 
---

[...]

I've read the sentence above, but..

+    sdm-skin-thermal {
+    polling-delay-passive = <1000>;
+    polling-delay = <5000>;
+    thermal-sensors = <_therm_sensor>;
+
+    trips {
+    active-config0 {
+    temperature = <125000>;
+    hysteresis = <1000>;
+    type = "passive";


I don't fancy burnt fingers for dinner!


With passive trip point it wouldn't even do anything now, but at what
temp do you think it should do what? I'd definitely need more time to
understand more of how the thermal setup works in downstream Android,
and then replicate a sane configuration for mainline with proper
temperatures, cooling, etc.

If "skin therm" means "the temperature of some part of the phone's
body that can be felt with a human hand", then definitely some
throttling should happen at 40ish with heavy throttling at 50
and crit at 55 or so..

We should probably make this a broader topic and keep a single
policy for all supported phones.

+ CC AGdR, may be interested in where this leads

Konrad


A thermal trip at 125°C for *skin temperature* is useless... if a device's skin
temperature (be it a smartphone, a SBC, a Chromebook, a non-specially-identified
laptop, a car head unit, or whatever else you can imagine) reaches that kind of
temperature, this means that something inside likely reached something along the
lines of 150°C for a prolonged period of time.

You will definitely agree with me that if something reached that temperature for
a certain period of time, it is *highly unlikely* (not to say impossible) that
Linux is even still running and that the green smoke that is naturally trapped 
in
any chip didn't get released :-)

Besides, keep in mind that if the SKIN temperature is 55°C, if your device has
a -> lithium <- battery (li-ion/lifepo/others), your hands are "probably" in a
kind of danger that I wouldn't be comfortable with (and neither you I'm sure).

Strictly related to the trip temperature setting for "SKIN", for me this is a
strong NAK.

I'd go for stricter ranges too, something like...
- critical: ~52/53
- heavy throttling: 49/50
- throttle: ~45
NOTE: this would be valid only if there are other throttling points for 
CPU/GPU/etc


 Anyway, something else I have in my mind: 

Konrad: "standardizing" skin temperature is too big of a bet, and will be wrong,
because industrial-grade devices are permitted to reach higher skin 
temperatures.
Some industrial grade smartphones (example: rugged stuff) may have sensors in
the underside of the ruggedized shell... so in that case you want to set the 
skin
temperature limit with delta-T ideally...

Though!

Making this easier for everyone: we can somehow dictate (unofficially, because
of more of the many reasons I already explained) an acceptable temperature range
for "skin" temperature, outside of which, reviews should follow manual testing.

As for the concept of "skin temperature", and for some thermal framework work
(sorry for the word game) that I'm bringing on the table, in this case we can
imagine it as:

Thermal zone type: SKIN
Thermal zone name: shell-upper/shell-rightmost/shellsomething

Type SKIN would be defined as
"a zone defining the temperature of the outer shell of a device"...

...and for example differently from type PCB, which fits different temperature
probe points.


Feel free 

Re: [PATCH v5 11/34] function_graph: Have the instances use their own ftrace_ops for filtering

2024-01-11 Thread Mark Rutland
On Thu, Jan 11, 2024 at 11:15:33AM +0900, Masami Hiramatsu wrote:
> Hi Mark,
> 
> Thanks for the investigation.

Hi!

As a heads-up, I already figured out the problem and sent a fixup at:

  https://lore.kernel.org/lkml/ZZwEz8HsTa2IZE3L@FVFF77S0Q05N/

... and a more refined fix at:

  https://lore.kernel.org/lkml/ZZwOubTSbB_FucVz@FVFF77S0Q05N/

The gist was that before this patch, arm64 used the FP as the 'retp' value, but
this patch changed that to the address of fregs->lr. This meant that the fgraph
ret_stack contained all of the correct return addresses, but when the unwinder
called ftrace_graph_ret_addr() with FP as the 'retp' value, it failed to match
any entry in the ret_stack.

Since the fregs only exist transiently at function entry and exit, I'd prefer
that we still use the FP as the 'retp' value, which is what I proposed in the
fixups above.

Thanks,
Mark.

> On Mon, 8 Jan 2024 12:25:55 +
> Mark Rutland  wrote:
> 
> > Hi,
> > 
> > There's a bit more of an info-dump below; I'll go try to dump the fgraph 
> > shadow
> > stack so that we can analyse this in more detail.
> > 
> > On Mon, Jan 08, 2024 at 10:14:36AM +0900, Masami Hiramatsu wrote:
> > > On Fri, 5 Jan 2024 17:09:10 +
> > > Mark Rutland  wrote:
> > > 
> > > > On Mon, Dec 18, 2023 at 10:13:46PM +0900, Masami Hiramatsu (Google) 
> > > > wrote:
> > > > > From: Steven Rostedt (VMware) 
> > > > > 
> > > > > Allow for instances to have their own ftrace_ops part of the 
> > > > > fgraph_ops
> > > > > that makes the funtion_graph tracer filter on the set_ftrace_filter 
> > > > > file
> > > > > of the instance and not the top instance.
> > > > > 
> > > > > This also change how the function_graph handles multiple instances on 
> > > > > the
> > > > > shadow stack. Previously we use ARRAY type entries to record which one
> > > > > is enabled, and this makes it a bitmap of the fgraph_array's indexes.
> > > > > Previous function_graph_enter() expects calling back from
> > > > > prepare_ftrace_return() function which is called back only once if it 
> > > > > is
> > > > > enabled. But this introduces different ftrace_ops for each fgraph
> > > > > instance and those are called from ftrace_graph_func() one by one. 
> > > > > Thus
> > > > > we can not loop on the fgraph_array(), and need to reuse the ret_stack
> > > > > pushed by the previous instance. Finding the ret_stack is easy because
> > > > > we can check the ret_stack->func. But that is not enough for the self-
> > > > > recursive tail-call case. Thus fgraph uses the bitmap entry to find it
> > > > > is already set (this means that entry is for previous tail call).
> > > > > 
> > > > > Signed-off-by: Steven Rostedt (VMware) 
> > > > > Signed-off-by: Masami Hiramatsu (Google) 
> > > > 
> > > > As a heads-up, while testing the topic/fprobe-on-fgraph branch on 
> > > > arm64, I get
> > > > a warning which bisets down to this commit:
> > > 
> > > Hmm, so does this happen when enabling function graph tracer?
> > 
> > Yes; I see it during the function_graph boot-time self-test if I also enable
> > CONFIG_IRQSOFF_TRACER=y. I can also trigger it regardless of
> > CONFIG_IRQSOFF_TRACER if I cat /proc/self/stack with the function_graph 
> > tracer
> > enabled (note that I hacked the unwinder to continue after failing to 
> > recover a
> > return address):
> > 
> > | # mount -t tracefs none /sys/kernel/tracing/
> > | # echo function_graph > /sys/kernel/tracing/current_tracer
> > | # cat /proc/self/stack
> > | [   37.469980] [ cut here ]
> > | [   37.471503] WARNING: CPU: 2 PID: 174 at 
> > arch/arm64/kernel/stacktrace.c:84 arch_stack_walk+0x2d8/0x338
> > | [   37.474381] Modules linked in:
> > | [   37.475501] CPU: 2 PID: 174 Comm: cat Not tainted 
> > 6.7.0-rc2-00026-gea1e68a341c2-dirty #15
> > | [   37.478133] Hardware name: linux,dummy-virt (DT)
> > | [   37.479670] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS 
> > BTYPE=--)
> > | [   37.481923] pc : arch_stack_walk+0x2d8/0x338
> > | [   37.483373] lr : arch_stack_walk+0x1bc/0x338
> > | [   37.484818] sp : 8000835f3a90
> > | [   37.485974] x29: 8000835f3a90 x28: 8000835f3b80 x27: 
> > 8000835f3b38
> > | [   37.488405] x26: 04341e00 x25: 8000835f4000 x24: 
> > 80008002df18
> > | [   37.490842] x23: 80008002df18 x22: 8000835f3b60 x21: 
> > 80008015d240
> > | [   37.493269] x20: 8000835f3b50 x19: 8000835f3b40 x18: 
> > 
> > | [   37.495704] x17:  x16:  x15: 
> > 
> > | [   37.498144] x14:  x13:  x12: 
> > 
> > | [   37.500579] x11: 800082b4d920 x10: 8000835f3a70 x9 : 
> > 8000800e55a0
> > | [   37.503021] x8 : 80008002df18 x7 : 04341e00 x6 : 
> > 
> > | [   37.505452] x5 :  x4 : 8000835f3e48 x3 : 
> > 8000835f3b80
> > | [   37.507888] x2 : 80008002df18 x1 : 07f7b000 x0 : 
> > 

Re: [RFC PATCH 1/5] mfd: 88pm88x: differences with respect to the PMIC RFC series

2024-01-11 Thread Lee Jones
The subject needs work.  Please tell us what the patches is doing.

On Thu, 28 Dec 2023, Karel Balej wrote:

> From: Karel Balej 

A full an complete commit message is a must.

> Signed-off-by: Karel Balej 
> ---
>  drivers/mfd/88pm88x.c   | 14 --
>  include/linux/mfd/88pm88x.h |  2 ++
>  2 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/mfd/88pm88x.c b/drivers/mfd/88pm88x.c
> index 5db6c65b667d..3424d88a58f6 100644
> --- a/drivers/mfd/88pm88x.c
> +++ b/drivers/mfd/88pm88x.c
> @@ -57,16 +57,16 @@ static struct reg_sequence pm886_presets[] = {
>   REG_SEQ0(PM88X_REG_BK_OSC_CTRL3, 0xc0),
>  };
>  
> -static struct resource onkey_resources[] = {
> +static struct resource pm88x_onkey_resources[] = {
>   DEFINE_RES_IRQ_NAMED(PM88X_IRQ_ONKEY, "88pm88x-onkey"),
>  };
>  
> -static struct mfd_cell pm88x_devs[] = {
> +static struct mfd_cell pm886_devs[] = {
>   {
>   .name = "88pm88x-onkey",
> - .num_resources = ARRAY_SIZE(onkey_resources),
> - .resources = onkey_resources,
> - .id = -1,
> + .of_compatible = "marvell,88pm88x-onkey",
> + .num_resources = ARRAY_SIZE(pm88x_onkey_resources),
> + .resources = pm88x_onkey_resources,
>   },
>  };
>  
> @@ -74,6 +74,8 @@ static struct pm88x_data pm886_a1_data = {
>   .whoami = PM886_A1_WHOAMI,
>   .presets = pm886_presets,
>   .num_presets = ARRAY_SIZE(pm886_presets),
> + .devs = pm886_devs,
> + .num_devs = ARRAY_SIZE(pm886_devs),
>  };
>  
>  static const struct regmap_config pm88x_i2c_regmap = {
> @@ -157,7 +159,7 @@ static int pm88x_probe(struct i2c_client *client)
>   if (ret)
>   return ret;
>  
> - ret = devm_mfd_add_devices(>dev, 0, pm88x_devs, 
> ARRAY_SIZE(pm88x_devs),
> + ret = devm_mfd_add_devices(>dev, 0, chip->data->devs, 
> chip->data->num_devs,
>   NULL, 0, regmap_irq_get_domain(chip->irq_data));
>   if (ret) {
>   dev_err(>dev, "Failed to add devices: %d\n", ret);
> diff --git a/include/linux/mfd/88pm88x.h b/include/linux/mfd/88pm88x.h
> index a34c57447827..9a335f6b9c07 100644
> --- a/include/linux/mfd/88pm88x.h
> +++ b/include/linux/mfd/88pm88x.h
> @@ -49,6 +49,8 @@ struct pm88x_data {
>   unsigned int whoami;
>   struct reg_sequence *presets;
>   unsigned int num_presets;
> + struct mfd_cell *devs;
> + unsigned int num_devs;

Why are you adding extra abstraction?

>  };
>  
>  struct pm88x_chip {
> -- 
> 2.43.0
> 

-- 
Lee Jones [李琼斯]



Re: [PATCH v10 1/2] ring-buffer: Introducing ring-buffer mapping functions

2024-01-11 Thread Vincent Donnefort
[...]

> > > +  */
> > > + smp_wmb();
> > > + WRITE_ONCE(cpu_buffer->mapped, 1);
> > > +
> > > + /* Init meta_page values unless the writer did it already */
> > > + cmpxchg(_buffer->meta_page->entries, 0,
> > > + local_read(_buffer->entries));
> > > + cmpxchg(_buffer->meta_page->overrun, 0,
> > > + local_read(_buffer->overrun));
> > > + cmpxchg(_buffer->meta_page->subbufs_touched, 0,
> > > + local_read(_buffer->pages_touched));
> > > + cmpxchg(_buffer->meta_page->subbufs_lost, 0,
> > > + local_read(_buffer->pages_lost));
> > 
> > Instead of using these cmpxchg, can we move this initialization before
> > smp_wmb()? Thus we can avoid conflict with rb_update_meta_page()
> 
> Good point, Not sure why I made this more complicated than it should be.

Ha, I know, that was before v6, where the writer was updating all of that. From
v6 the update has been moved to the irq_rb_work.

> 
> > 
> > Thank you,
> >
> 
> [...]



Re: [PATCH v10 1/2] ring-buffer: Introducing ring-buffer mapping functions

2024-01-11 Thread Vincent Donnefort
On Tue, Jan 09, 2024 at 06:58:13PM -0500, Steven Rostedt wrote:
> On Wed, 10 Jan 2024 08:42:05 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > On Tue, 9 Jan 2024 15:13:51 +
> > Vincent Donnefort  wrote:
> > 
> > > > > @@ -388,6 +389,7 @@ struct rb_irq_work {
> > > > >   boolwaiters_pending;
> > > > >   boolfull_waiters_pending;
> > > > >   boolwakeup_full;
> > > > > + boolis_cpu_buffer;  
> > > > 
> > > > I think 'is_cpu_buffer' is a bit unclear (or generic),
> > > > what about 'meta_page_update'?  
> > > 
> > > Hum not sure about that change. This was really to identify if parent of
> > > rb_irq_work is a cpu_buffer or a trace_buffer. It can be a cpu_buffer 
> > > regardless
> > > of the need to update the meta-page.  
> > 
> > Yeah, I just meant that is "for_cpu_buffer", not "rb_irq_work 
> > is_cpu_buffer".
> > So when reading the code, I just felt uncomfortable.
> > 
> 
> How about "in_cpu_buffer" as that is what it is.
> 
> struct ring_buffer_per_cpu {
>   struct rb_irq_work {
>   boolin_cpu_buffer;
>   }
> }
> 
> Would that make you feel more comfortable? ;-)
> 
> -- Steve

I'll actually solve that by moving that update from the rb_irq_work to
ring_buffer_map_get_reader().

Reason is the rb_irq_work is only triggered when !O_NONBLOCK is set.

> 
> -- 
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
> 



Re: [RFC v1 0/8] vhost-vdpa: add support for iommufd

2024-01-11 Thread Cindy Lu
On Thu, Jan 11, 2024 at 6:25 AM Michael S. Tsirkin  wrote:
>
> On Sat, Nov 04, 2023 at 01:16:33AM +0800, Cindy Lu wrote:
> >
> > Hi All
> > This code provides the iommufd support for vdpa device
> > This code fixes the bugs from the last version and also add the asid 
> > support. rebase on kernel
> > v6,6-rc3
> > Test passed in the physical device (vp_vdpa), but  there are still some 
> > problems in the emulated device (vdpa_sim_net),
> > I will continue working on it
> >
> > The kernel code is
> > https://gitlab.com/lulu6/vhost/-/tree/iommufdRFC_v1
> >
> > Signed-off-by: Cindy Lu 
>
> Was this abandoned?
>
Thanks Micheal. I'm really sorry for the delay. I will continue working on this
Thanks
Cindy
> >
> > Cindy Lu (8):
> >   vhost/iommufd: Add the functions support iommufd
> >   Kconfig: Add the new file vhost/iommufd
> >   vhost: Add 3 new uapi to support iommufd
> >   vdpa: Add new vdpa_config_ops to support iommufd
> >   vdpa_sim :Add support for iommufd
> >   vdpa: change the map/unmap process to support iommufd
> >   vp_vdpa::Add support for iommufd
> >   iommu: expose the function iommu_device_use_default_domain
> >
> >  drivers/iommu/iommu.c |   2 +
> >  drivers/vdpa/vdpa_sim/vdpa_sim.c  |   8 ++
> >  drivers/vdpa/virtio_pci/vp_vdpa.c |   4 +
> >  drivers/vhost/Kconfig |   1 +
> >  drivers/vhost/Makefile|   1 +
> >  drivers/vhost/iommufd.c   | 178 +
> >  drivers/vhost/vdpa.c  | 210 +-
> >  drivers/vhost/vhost.h |  21 +++
> >  include/linux/vdpa.h  |  38 +-
> >  include/uapi/linux/vhost.h|  66 ++
> >  10 files changed, 525 insertions(+), 4 deletions(-)
> >  create mode 100644 drivers/vhost/iommufd.c
> >
> > --
> > 2.34.3
>




Re: [PATCH 2/2] arm64: dts: qcom: sm7225-fairphone-fp4: Add PM6150L thermals

2024-01-11 Thread Luca Weiss
On Wed Jan 10, 2024 at 8:16 PM CET, Konrad Dybcio wrote:
>
>
> On 1/9/24 12:24, Luca Weiss wrote:
> > On Tue Jan 9, 2024 at 11:09 AM CET, Konrad Dybcio wrote:
> >>
> >>
> >> On 1/5/24 15:54, Luca Weiss wrote:
> >>> Configure the thermals for the PA_THERM1, MSM_THERM, PA_THERM0,
> >>> RFC_CAM_THERM, CAM_FLASH_THERM and QUIET_THERM thermistors connected to
> >>> PM6150L.
> >>>
> >>> Due to hardware constraints we can only register 4 zones with
> >>> pm6150l_adc_tm, the other 2 we can register via generic-adc-thermal.
> >>
> >> Ugh.. so the ADC can support more inputs than the ADC_TM that was
> >> designed to ship alongside it can?
> >>
> >> And that's why the "generic-adc-thermal"-provided zones need to
> >> be polled?
> > 
> > This part of the code from qcom-spmi-adc-tm5.c was trigerring if I
> > define more than 4 channels, and looking at downstream I can also see
> > that only 4 zones are registered properly with adc_tm, the rest is
> > registered with "qcom,adc-tm5-iio" which skips from what I could tell
> > basically all the HW bits and only registering the thermal zone.
> > 
> > 
> > ret = adc_tm5_read(chip, ADC_TM5_NUM_BTM,
> >_available, sizeof(channels_available));
> > if (ret) {
> > dev_err(chip->dev, "read failed for BTM channels\n");
> > return ret;
> > }
> > 
> > for (i = 0; i < chip->nchannels; i++) {
> > if (chip->channels[i].channel >= channels_available) {
> > dev_err(chip->dev, "Invalid channel %d\n", 
> > chip->channels[i].channel);
> > return -EINVAL;
> > }
> > }
> > 
> > 
> >>
> >>>
> >>> The trip points can really only be considered as placeholders, more
> >>> configuration with cooling etc. can be added later.
> >>>
> >>> Signed-off-by: Luca Weiss 
> >>> ---
> >> [...]
> >>
> >> I've read the sentence above, but..
> >>> + sdm-skin-thermal {
> >>> + polling-delay-passive = <1000>;
> >>> + polling-delay = <5000>;
> >>> + thermal-sensors = <_therm_sensor>;
> >>> +
> >>> + trips {
> >>> + active-config0 {
> >>> + temperature = <125000>;
> >>> + hysteresis = <1000>;
> >>> + type = "passive";
> >>
> >> I don't fancy burnt fingers for dinner!
> > 
> > With passive trip point it wouldn't even do anything now, but at what
> > temp do you think it should do what? I'd definitely need more time to
> > understand more of how the thermal setup works in downstream Android,
> > and then replicate a sane configuration for mainline with proper
> > temperatures, cooling, etc.
> If "skin therm" means "the temperature of some part of the phone's
> body that can be felt with a human hand", then definitely some
> throttling should happen at 40ish with heavy throttling at 50
> and crit at 55 or so..
>
> We should probably make this a broader topic and keep a single
> policy for all supported phones.

I agree that this shouldn't be implemented differently per device since
it's really more a question "what should Linux do" that's quite a
general question and not device-specific. Of course some device-specific
tweaks could be here and there, like if the phone has metal back or
plastic back but it's only minor.

Based on the config here
https://gerrit-public.fairphone.software/plugins/gitiles/platform/hardware/qcom/thermal/+/refs/heads/odm/dev/target/13/fp5/thermalConfig.cpp#946
it looks like throtteling starts for internal components at 95degC with
a shutdown threshold of 115degC.
The skin sensor here has a throttling threshold of 40degC and shutdown
threshold of 95degC.

But actually I'm not even sure this config gets active for QCM6490 with
socid=497. So yeah I need more time digging into the thermal code to see
what it's actually doing.. Not that it would/should be much different
for socid=497 I guess though.

There's also plenty of thermal code in qcom proprietary.

Regards
Luca

>
> + CC AGdR, may be interested in where this leads
>
> Konrad




Re: [PATCH] driver/virtio: Add Memory Balloon Support for SEV/SEV-ES

2024-01-11 Thread David Hildenbrand

On 10.01.24 07:22, Zheyun Shen wrote:

For now, SEV pins guest's memory to avoid swapping or
moving ciphertext, but leading to the inhibition of
Memory Ballooning.

In Memory Ballooning, only guest's free pages will be relocated
in balloon inflation and deflation, so the difference of plaintext
doesn't matter to guest.


A Linux hypervisor will always give you a fresh, zeroed page. I don't 
recall what the spec says, could be that that is a guarantee.




Memory Ballooning is a nice memory overcommitment technology can
be used in CVM based on SEV and SEV-ES, so userspace tools can
provide an option to allow SEV not to pin memory and enable
Memory Ballooning. Guest kernel may not inhibit Balloon and
should set shared memory for Balloon decrypted.


Two points:

1) Memory overcommit means that you promise to have more memory than you 
actually have.


To be able to use that in a *safe* way in the hypervisor, to fulfill 
that promise, you need some backup strategy, which is usually swap space 
in the hypervisor. Further one might apply other techniques (ram 
compression, memory deduplication) in the hypervisor to make that 
swapping unlikely to ever happen when overcommitting (because nobody 
wants to swap).


Assume you run a lot of VMs that mostly have private/encrypted memory 
(which is the default). Imagine you previously inflated the balloon on 
VM0, and that VM needs more memory (you promised it could have more!). 
You reach out to other VMs to inflate the balloon so you get memory 
back, but they cannot give up memory safely.


In that scenario (a) you cannot swap something out because all pages are 
pinned (b) memory compression cannot be applied because pages are pinned 
and (c) memory deduplication cannot be applied because pages are pinned.


Pinned memory is a resource that cannot be overcomitted.

So I am not convinced the use case you are targeting can be considered 
any way of sane memory overcommit. You should better call it resizing VM 
memory instead. Then, it's clearer that the hypervisor cannot promise to 
ever give you that memory when you are in need.



2) What about other features?

What if the hypervisor enabled free-page-reporting? Would that work 
(unlikely, I assunme). Don't we have to block that?


--
Cheers,

David / dhildenb




Re: [PATCH] driver/virtio: Add Memory Balloon Support for SEV/SEV-ES

2024-01-11 Thread David Hildenbrand

For now, SEV pins guest's memory to avoid swapping or
moving ciphertext, but leading to the inhibition of
Memory Ballooning.

In Memory Ballooning, only guest's free pages will be relocated
in balloon inflation and deflation, so the difference of plaintext
doesn't matter to guest.



This seems only true if the page is zeroed, is this true here?


Sorry, I cannot figure out why the pages should be zeroed. I think
both host kernel and guest kernel assume that the pages are not
zeroed and will use kzalloc or manually zero them in real applications,
which is same as non-SEV environments.


balloon_page_alloc() will not zero the memory (no __GFP_ZERO set). Only 
in some configurations (zero-on-alloc, zero-on-free), the kernel would 
do that implicitly.


So we'd eventually be leaking secrets to the untrusted hypervisor?



I have tested in SEV-ES, reclaiming memory by balloon inflation and reuse
them after balloon deflation both works well with the patch. Hypervisor
can normally give the reclaimed memory from one CVM to another, or give
back to the origin CVM.


I'll comment on your misconception of memory overcommit separately.

--
Cheers,

David / dhildenb