date:20231030

[PATCH] tracing/kprobes: Fix the order of argument descriptions

2023-10-30 Thread Yujie Liu

The order of descriptions should be consistent with the argument list of
the function, so "kretprobe" should be the second one.

int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe,
 const char *name, const char *loc, ...)

Fixes: 2a588dd1d5d6 ("tracing: Add kprobe event command generation functions")
Suggested-by: Mukesh Ojha 
Signed-off-by: Yujie Liu 
---
 kernel/trace/trace_kprobe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index e834f149695b..47812aa16bb5 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1020,9 +1020,9 @@ EXPORT_SYMBOL_GPL(kprobe_event_cmd_init);
 /**
  * __kprobe_event_gen_cmd_start - Generate a kprobe event command from arg list
  * @cmd: A pointer to the dynevent_cmd struct representing the new event
+ * @kretprobe: Is this a return probe?
  * @name: The name of the kprobe event
  * @loc: The location of the kprobe event
- * @kretprobe: Is this a return probe?
  * @...: Variable number of arg (pairs), one pair for each field
  *
  * NOTE: Users normally won't want to call this function directly, but
-- 
2.34.1

Re: [PATCH v7 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-30 Thread Verma, Vishal L

On Mon, 2023-10-30 at 11:20 +0100, David Hildenbrand wrote:
> On 26.10.23 00:44, Vishal Verma wrote:
> > 
[..]

> > @@ -2146,11 +2186,69 @@ void try_offline_node(int nid)
> >   }
> >   EXPORT_SYMBOL(try_offline_node);
> >   
> > -static int __ref try_remove_memory(u64 start, u64 size)
> > +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
> >   {
> > -   struct memory_block *mem;
> > -   int rc = 0, nid = NUMA_NO_NODE;
> > +   unsigned long memblock_size = memory_block_size_bytes();
> > struct vmem_altmap *altmap = NULL;
> > +   struct memory_block *mem;
> > +   u64 cur_start;
> > +   int rc;
> > +
> > +   /*
> > +    * For memmap_on_memory, the altmaps could have been added on
> > +    * a per-memblock basis. Loop through the entire range if so,
> > +    * and remove each memblock and its altmap.
> > +    */
> 
> /*
>   * altmaps where added on a per-memblock basis; we have to process
>   * each individual memory block.
>   */
> 
> > +   for (cur_start = start; cur_start < start + size;
> > +    cur_start += memblock_size) {
> > +   rc = walk_memory_blocks(cur_start, memblock_size, ,
> > +   test_has_altmap_cb);
> > +   if (rc) {
> > +   altmap = mem->altmap;
> > +   /*
> > +    * Mark altmap NULL so that we can add a debug
> > +    * check on memblock free.
> > +    */
> > +   mem->altmap = NULL;
> > +   }
> 
> Simpler (especially, we know that there must be an altmap):
> 
> mem = find_memory_block(pfn_to_section_nr(cur_start));
> altmap = mem->altmap;
> mem->altmap = NULL;
> 
> I think we might be able to remove test_has_altmap_cb() then.
> 
> > +
> > +   remove_memory_block_devices(cur_start, memblock_size);
> > +
> > +   arch_remove_memory(cur_start, memblock_size, altmap);
> > +
> > +   /* Verify that all vmemmap pages have actually been freed. 
> > */
> > +   if (altmap) {
> 
> There must be an altmap, so this can be done unconditionally.

Hi David,

All other comments make sense, making those changes now.

However for this one, does the WARN() below go away then?

I was wondering if maybe arch_remove_memory() is responsible for
freeing the altmap here, and at this stage we're just checking if that
happened. If it didn't WARN and then free it.

I drilled down the path, and I don't see altmap actually getting freed
in vmem_altmap_free(), but I wasn't sure if  was meant
to free it as altmap->alloc went down to 0.

> 
> > +   WARN(altmap->alloc, "Altmap not fully unmapped");
> > +   kfree(altmap);
> > +   }
> > +   }
> > +}
> 
>

Re: [PATCH] eventfs: Hold eventfs_mutex when calling callback functions

2023-10-30 Thread Naresh Kamboju

On Mon, 30 Oct 2023 at 21:10, Steven Rostedt  wrote:
>
> From: "Steven Rostedt (Google)" 
>
> The callback function that is used to create inodes and dentries is not
> protected by anything and the data that is passed to it could become
> stale. After eventfs_remove_dir() is called by the tracing system, it is
> free to remove the events that are associated to that directory.
> Unfortunately, that means the callbacks must not be called after that.
>
>  CPU0   CPU1
>     
>  eventfs_root_lookup() {
>  eventfs_remove_dir() {
>   mutex_lock(_mutex);
>   ei->is_freed = set;
>   mutex_unlock(_mutex);
>  }
>  kfree(event_call);
>
> for (...) {
>   entry = >entries[i];
>   r = entry->callback() {
>   call = data;  // call == event_call above
>   if (call->flags ...)
>
>  [ USE AFTER FREE BUG ]
>
> The safest way to protect this is to wrap the callback with:
>
>  mutex_lock(_mutex);
>  if (!ei->is_freed)
>  r = entry->callback();
>  else
>  r = -1;
>  mutex_unlock(_mutex);
>
> This will make sure that the callback will not be called after it is
> freed. But now it needs to be known that the callback is called while
> holding internal eventfs locks, and that it must not call back into the
> eventfs / tracefs system. There's no reason it should anyway, but document
> that as well.
>
> Link: 
> https://lore.kernel.org/all/CA+G9fYu9GOEbD=rR5eMR-=HJ8H6rMsbzDC2ZY5=y50wpwae...@mail.gmail.com/
>
> Reported-by: Linux Kernel Functional Testing 
> Reported-by: Naresh Kamboju 
> Signed-off-by: Steven Rostedt (Google) 

Tested-by: Linux Kernel Functional Testing 
Tested-by: Naresh Kamboju 

> ---
>  fs/tracefs/event_inode.c | 22 ++--
>  include/linux/tracefs.h  | 43 


--
Linaro LKFT
https://lkft.linaro.org

Re: [PATCH v2] eventfs: Test for ei->is_freed when accessing ei->dentry

2023-10-30 Thread Naresh Kamboju

On Mon, 30 Oct 2023 at 20:11, Steven Rostedt  wrote:
>
> On Mon, 30 Oct 2023 12:37:08 +0530
> Naresh Kamboju  wrote:
>
> >
> > I have tested the linux-trace.git trace/core and run selftests ftrace
> > the reported kernel panic [1] & [2] has been fixed but found
>
> Good to know. Can I add "Tested-by" from you for that bug fix?

Please add,
Tested-by: Linux Kernel Functional Testing 
Tested-by: Naresh Kamboju 

>
> > "general protection fault"  at  kernel/trace/trace_events.c:2439.
>
> Can you test with the below patch?

Applied this patch on top of linux-trace.git trace/core and test passed.
All the reported issues have been resolved.

>
> Also, can I ask what are you testing this on that makes it trigger so
> easily? As I'm not able to trigger these in my tests, even though they are
> indeed bugs.

I use following build artifacts and running selftests: ftrace: ftracetest-ktap

   kernel:
  url:

https://storage.tuxsuite.com/public/linaro/naresh/builds/2XT84xYJIMmKApmqOOtKhnLdCyz/bzImage
modules:
  url:

https://storage.tuxsuite.com/public/linaro/naresh/builds/2XT84xYJIMmKApmqOOtKhnLdCyz/modules.tar.xz
  compression: xz
nfsrootfs:
  url: https://storage.tuxboot.com/debian/bookworm/amd64/rootfs.tar.xz
 kselftest:
  url:

https://storage.tuxsuite.com/public/linaro/naresh/builds/2XT84xYJIMmKApmqOOtKhnLdCyz/kselftest.tar.xz


Steps to test:
==

1) Boot x86 machine with above bzImage, modules, rootfs and
copy kselftest.tar.xz from above url and unzip.
2) cd /opt/kselftests/default-in-kernel/ftrace
3) ./run_kselftest.sh  -c ftrace


# selftests: ftrace: ftracetest-ktap
# unlink: cannot unlink
'/opt/kselftests/default-in-kernel/ftrace/logs/latest': No such file
or directory
# TAP version 13
# 1..129
# ok 1 Basic trace file check
...
# # Totals: pass:126 faii:0 xfail:1 xpass:0 skip:1 error:1


Test logs:
https://lkft.validation.linaro.org/scheduler/job/6985018#L4742




>
> -- Steve
>
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 7ad7496bd597..7a0b54ddda24 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -609,7 +609,13 @@ static struct dentry *eventfs_root_lookup(struct inode 
> *dir,
> entry = >entries[i];
> if (strcmp(name, entry->name) == 0) {
> void *cdata = data;
> -   r = entry->callback(name, , , );
> +   mutex_lock(_mutex);
> +   /* If ei->is_freed, then the event itself may be too 
> */
> +   if (!ei->is_freed)
> +   r = entry->callback(name, , , 
> );
> +   else
> +   r = -1;
> +   mutex_unlock(_mutex);
> if (r <= 0)
> continue;
> ret = simple_lookup(dir, dentry, flags);
> @@ -743,7 +749,13 @@ static int dcache_dir_open_wrapper(struct inode *inode, 
> struct file *file)
> void *cdata = data;
> entry = >entries[i];
> name = entry->name;
> -   r = entry->callback(name, , , );
> +   mutex_lock(_mutex);
> +   /* If ei->is_freed, then the event itself may be too */
> +   if (!ei->is_freed)
> +   r = entry->callback(name, , , );
> +   else
> +   r = -1;
> +   mutex_unlock(_mutex);
> if (r <= 0)
> continue;
> d = create_file_dentry(ei, i, parent, name, mode, cdata, 
> fops, false);

Re: [PATCH 9/9] arm64: dts: qcom: qcm6490-fairphone-fp5: Enable WiFi

2023-10-30 Thread Konrad Dybcio

On 27.10.2023 16:20, Luca Weiss wrote:
> Now that the WPSS remoteproc is enabled, enable wifi so we can use it.
> 
> Signed-off-by: Luca Weiss 
> ---
>  arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts 
> b/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts
> index d65eef30091b..e7e20f73cbe6 100644
> --- a/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts
> +++ b/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts
> @@ -713,3 +713,7 @@  {
>   firmware-name = "qcom/qcm6490/fairphone5/venus.mbn";
>   status = "okay";
>  };
> +
> + {
> + status = "okay";
qcom,ath11k-calibration-variant?

Konrad

Re: [PATCH 8/9] arm64: dts: qcom: qcm6490-fairphone-fp5: Enable various remoteprocs

2023-10-30 Thread Konrad Dybcio

On 27.10.2023 16:20, Luca Weiss wrote:
> Enable the ADSP, CDSP, MPSS and WPSS that are found on the SoC.
> 
> Signed-off-by: Luca Weiss 
> ---
>  arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts 
> b/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts
> index cc092735ce17..d65eef30091b 100644
> --- a/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts
> +++ b/arch/arm64/boot/dts/qcom/qcm6490-fairphone-fp5.dts
> @@ -490,6 +490,26 @@ _id_1 {
>   status = "okay";
>  };
>  
> +_adsp {
> + firmware-name = "qcom/qcm6490/fairphone5/adsp.mdt";
> + status = "okay";
> +};
> +
> +_cdsp {
> + firmware-name = "qcom/qcm6490/fairphone5/cdsp.mdt";
> + status = "okay";
> +};
> +
> +_mpss {
> + firmware-name = "qcom/qcm6490/fairphone5/modem.mdt";
> + status = "okay";
> +};
> +
> +_wpss {
> + firmware-name = "qcom/qcm6490/fairphone5/wpss.mdt";
mbn?

Konrad

[GIT PULL] tracing/tools: Updates for 6.7

2023-10-30 Thread Daniel Bristot de Oliveira

Steven,

Tracing tools changes for 6.7:

RTLA:
  - On rtla/utils.c, initialize the 'found' variable to avoid garbage
when a mount point is not found.

Verification:
  - Remove duplicated imports on dot2k python script


Please pull the latest tracing-tools-v6.7 tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git
tracing-tools-v6.7

Tag SHA1: eadb7e76bc7ce4174d4905a29a4a07cd835508fe
Head SHA1: 696444a544ecd6d62c1edc89516b376cefb28929


Alessandro Carminati (Red Hat) (1):
  verification/dot2k: Delete duplicate imports

Colin Ian King (1):
  rtla: Fix uninitialized variable found


 tools/tracing/rtla/src/utils.c | 2 +-
 tools/verification/dot2/dot2k  | 2 --
 2 files changed, 1 insertion(+), 3 deletions(-)
---
diff --git a/tools/tracing/rtla/src/utils.c b/tools/tracing/rtla/src/utils.c
index 623a38908ed5..c769d7b3842c 100644
--- a/tools/tracing/rtla/src/utils.c
+++ b/tools/tracing/rtla/src/utils.c
@@ -538,7 +538,7 @@ static const int find_mount(const char *fs, char *mp, int 
sizeof_mp)
 {
char mount_point[MAX_PATH];
char type[100];
-   int found;
+   int found = 0;
FILE *fp;
 
fp = fopen("/proc/mounts", "r");
diff --git a/tools/verification/dot2/dot2k b/tools/verification/dot2/dot2k
index 9dcd38abe20a..d4d7e52d549e 100644
--- a/tools/verification/dot2/dot2k
+++ b/tools/verification/dot2/dot2k
@@ -15,8 +15,6 @@ if __name__ == '__main__':
 import os
 import platform
 import sys
-import sys
-import argparse
 
 parser = argparse.ArgumentParser(description='transform .dot file into 
kernel rv monitor')
 parser.add_argument('-d', "--dot", dest="dot_file", required=True)

[PATCH v6 11/12] Docs/x86/sgx: Add description for cgroup support

2023-10-30 Thread Haitao Huang

From: Sean Christopherson 

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang
Signed-off-by: Haitao Huang
Cc: Sean Christopherson 
---
V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes

V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
 Documentation/arch/x86/sgx.rst | 74 ++
 1 file changed, 74 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..dfc8fac13ab2 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,77 @@ to expected failures and handle them as follows:
first call.  It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates 
distribution of SGX
+EPC memory, which is a subset of system RAM that is used to provide 
SGX-enabled applications
+with protected memory, and is otherwise inaccessible, i.e. shows up as 
reserved in /proc/iomem
+and cannot be read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all 
intents and
+purposes the EPC is independent from normal system memory, e.g. must be 
reserved at boot from
+RAM and cannot be converted between EPC and normal memory while the system is 
running.  The EPC
+is managed by the SGX subsystem and is not accounted by the memory controller. 
 Note that this
+is true only for EPC memory itself, i.e.  normal memory allocations related to 
SGX and EPC
+memory, e.g. the backing memory for evicted EPC pages, are accounted, limited 
and protected by
+the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual 
memory techniques
+and pages can be swapped out of the EPC to their backing store (normal system 
memory allocated
+via shmem).  The SGX EPC subsystem is analogous to the memory subsystem, and 
it implements
+limit and protection models for EPC memory.
+
+SGX EPC Interface Files
+---
+
+For a generic description of the Miscellaneous controller interface files, 
please see
+Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise.  
If a value which
+is not PAGE_SIZE aligned is written, the actual value used by the controller 
will be rounded
+down to the closest PAGE_SIZE multiple.
+
+  misc.capacity
+A read-only flat-keyed file shown only in the root cgroup.  The 
sgx_epc resource will
+show the total amount of EPC memory available on the platform.
+
+  misc.current
+A read-only flat-keyed file shown in the non-root cgroups.  The 
sgx_epc resource will
+show the current active EPC memory usage of the cgroup and its 
descendants. EPC pages
+that are swapped out to backing RAM are not included in the current 
count.
+
+  misc.max
+A read-write single value file which exists on non-root cgroups. The 
sgx_epc resource
+will show the EPC usage hard limit. The default is "max".
+
+If a cgroup's EPC usage reaches this limit, EPC allocations, e.g. for 
page fault
+handling, will be blocked until EPC can be reclaimed from the cgroup. 
If there are no
+pages left that are reclaimable within the same group, the kernel 
returns ENOMEM.
+
+The EPC pages allocated for a guest VM by the virtual EPC driver are 
not reclaimable by
+the host kernel. In case the guest cgroup's limit is reached and no 
reclaimable pages
+left in the same cgroup, the virtual EPC driver returns SIGBUS to the 
user space
+process to indicate failure on new EPC allocation requests.
+
+The misc.max limit is non-preemptive. If a user writes a limit lower 
than the current
+usage to this file, the cgroup will not preemptively deallocate pages 
currently in use,
+and will only start blocking the next allocation and reclaiming EPC at 
that time.
+
+  misc.events
+A read-only flat-keyed file which exists on non-root cgroups.
+A value change in this file generates a file modified event.
+
+  max
+The number of times the cgroup has triggered a reclaim
+due to its EPC usage approaching (or

[PATCH v6 10/12] x86/sgx: Implement EPC reclamation for cgroup

2023-10-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

Currently all reclaimable pages are tracked only in the global LRU list,
and only the global reclaimer(ksgxd) performs reclamation when the
global free page counts are lower than a threshold. When a cgroup limit
is reached, cgroup need also try to reclaim pages allocated within the
group. This patch enables per-cgroup reclamation.

Add a helper function sgx_lru_list(), that for a given EPC page, returns
the LRU list of the cgroup that is assigned to the EPC page at
allocation time. This helper is used to replace the hard coded global
LRU wherever appropriate: modify sgx_mark/unmark_page_reclaimable() to
track EPCs in the LRU list of the appropriate cgroup; modify
sgx_do_epc_reclamation() to return unreclaimed pages back to proper
cgroup.

Implement the reclamation flow for cgroup, encapsulated in the top-level
function sgx_epc_cgroup_reclaim_pages(). Just like the global reclaimer,
the cgroup reclaimer first isolates candidate pages for reclaim, then
invokes sgx_do_epc_reclamation(). The only difference is that a cgroup
does a pre-order walk on its subtree to scan for candidate pages from
its own LRU and LRUs in its descendants.

In some contexts, e.g. page fault handling, only asynchronous
reclamation is allowed. Create a workqueue, 'sgx_epc_cg_wq',
corresponding work item and function definitions to support the
asynchronous reclamation. Add a Boolean parameter for
sgx_epc_cgroup_try_charge() to indicate whether synchronous reclaim is
allowed or not. Both synchronous and asynchronous flows invoke the same
top level reclaim function, sgx_epc_cgroup_reclaim_pages().

All reclaimable pages are tracked in per-cgroup LRUs when cgroup is
enabled. Update the original global reclaimer to reclaim from the root
cgroup when cgroup is enabled, also calling
sgx_epc_cgroup_reclaim_pages().

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V6:
- Drop EPC OOM killer.(Dave, Michal)
- Patch restructuring: this includes part split from the patch, "Limit process 
EPC usage with
misc cgroup controller", and combined with "Prepare for multiple LRUs"
- Removed force reclamation ignoring 'youngness' of the pages
- Removed checking for capacity in reclamation loop.
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 224 ++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  19 ++-
 arch/x86/kernel/cpu/sgx/main.c   |  71 ++---
 3 files changed, 289 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 500627d0563f..110d44c0ef7c 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -5,6 +5,38 @@
 #include 
 #include "epc_cgroup.h"
 
+#define SGX_EPC_RECLAIM_MIN_PAGES  16U
+
+static struct workqueue_struct *sgx_epc_cg_wq;
+
+static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup 
*epc_cg)
+{
+   return atomic64_read(_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / 
PAGE_SIZE;
+}
+
+static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
+{
+   return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors.  Used in
+ * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup is 
over its limit
+ * or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup 
*epc_cg)
+{
+   struct misc_cg *i = epc_cg->cg;
+   u64 m = U64_MAX;
+
+   while (i) {
+   m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+   i = misc_cg_parent(i);
+   }
+
+   return m / PAGE_SIZE;
+}
+
 static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct 
misc_cg *cg)
 {
return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
@@ -15,12 +47,188 @@ static inline bool sgx_epc_cgroup_disabled(void)
return !cgroup_subsys_enabled(misc_cgrp_subsys);
 }
 
+/**
+ * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
+ * @root:  root of the tree to check
+ *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ */
+bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
+{
+   struct cgroup_subsys_state *css_root;
+   struct cgroup_subsys_state *pos;
+   struct sgx_epc_cgroup *epc_cg;
+   bool ret = true;
+
+   /*
+* Caller ensure css_root ref acquired
+*/
+   css_root = >css;
+
+   rcu_read_lock();
+   css_for_each_descendant_pre(pos, css_root) {
+   if

[PATCH v6 12/12] selftests/sgx: Add scripts for EPC cgroup testing

2023-10-30 Thread Haitao Huang

The scripts rely on cgroup-tools package from libcgroup [1].

To run selftests for epc cgroup:

sudo ./run_epc_cg_selftests.sh

With different cgroups, the script starts one or multiple concurrent SGX
selftests, each to run one unclobbered_vdso_oversubscribed test.  Each
of such test tries to load an enclave of EPC size equal to the EPC
capacity available on the platform. The script checks results against
the expectation set for each cgroup and reports success or failure.

The script creates 3 different cgroups at the beginning with following
expectations:

1) SMALL - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) LARGE - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script starts 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) LARGER - limit is the same as the capacity, large enough to run lots of
concurrent tests. The script starts 10 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all process exit.

To watch misc cgroup 'current' changes during testing, run this in a
separate terminal:

./watch_misc_for_tests.sh current

[1] https://github.com/libcgroup/libcgroup/blob/main/README

Signed-off-by: Haitao Huang 
---
V5:

- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.
---
 .../selftests/sgx/run_epc_cg_selftests.sh | 196 ++
 .../selftests/sgx/watch_misc_for_tests.sh |  13 ++
 2 files changed, 209 insertions(+)
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh 
b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
new file mode 100755
index ..72b93f694753
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -0,0 +1,196 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+TEST_ROOT_CG=selftest
+cgcreate -g misc:$TEST_ROOT_CG
+if [ $? -ne 0 ]; then
+echo "# Please make sure cgroup-tools is installed, and misc cgroup is 
mounted."
+exit 1
+fi
+TEST_CG_SUB1=$TEST_ROOT_CG/test1
+TEST_CG_SUB2=$TEST_ROOT_CG/test2
+TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
+TEST_CG_SUB4=$TEST_ROOT_CG/test4
+
+cgcreate -g misc:$TEST_CG_SUB1
+cgcreate -g misc:$TEST_CG_SUB2
+cgcreate -g misc:$TEST_CG_SUB3
+cgcreate -g misc:$TEST_CG_SUB4
+
+# Default to V2
+CG_ROOT=/sys/fs/cgroup
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+echo "# cgroup V2 is in use."
+else
+echo "# cgroup V1 is in use."
+CG_ROOT=/sys/fs/cgroup/misc
+fi
+
+CAPACITY=$(grep "sgx_epc" "$CG_ROOT/misc.capacity" | awk '{print $2}')
+# This is below number of VA pages needed for enclave of capacity size. So
+# should fail oversubscribed cases
+SMALL=$(( CAPACITY / 512 ))
+
+# At least load one enclave of capacity size successfully, maybe up to 4.
+# But some may fail if we run more than 4 concurrent enclaves of capacity size.
+LARGE=$(( SMALL * 4 ))
+
+# Load lots of enclaves
+LARGER=$CAPACITY
+echo "# Setting up limits."
+echo "sgx_epc $SMALL" | tee $CG_ROOT/$TEST_CG_SUB1/misc.max
+echo "sgx_epc $LARGE" | tee $CG_ROOT/$TEST_CG_SUB2/misc.max
+echo "sgx_epc $LARGER" | tee $CG_ROOT/$TEST_CG_SUB4/misc.max
+
+timestamp=$(date +%Y%m%d_%H%M%S)
+
+test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
+
+echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting 
failure..."
+# Always use leaf node of misc cgroups so it works for both v1 and v2
+# these may fail on OOM
+cgexec -g misc:$TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1
+if [[ $? -eq 0 ]]; then
+echo "# Fail on SMALL limit, not expecting any test passes."
+cgdelete -r -g misc:$TEST_ROOT_CG
+exit 1
+else
+echo "# Test failed as expected."
+fi
+
+echo "# PASSED SMALL limit."
+
+echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE 
limit,
+expecting at least one success"
+pids=()
+for i in {1..4}; do
+(
+cgexec -g misc:$TEST_CG_SUB2 $test_cmd 
>cgtest_large_positive_$timestamp.$i.log 2>&1
+) &
+pids+=($!)
+done
+
+any_success=0
+for pid in "${pids[@]}"; do
+wait "$pid"
+status=$?
+if [[ $status -eq 0 ]]; then
+any_success=1
+   echo "# Process $pid returned successfully."
+fi
+done
+
+if [[ $any_success -eq 0 ]]; then
+echo "# Failed on LARGE limit positive testing, no test passes."
+cgdelete -r -g misc:$TEST_ROOT_CG
+exit 1
+fi
+
+echo "# PASSED LARGE limit positive testing."
+
+echo "# Start 5 concurrent unclobbered_vdso_oversubscribed tests with LARGE 
limit,
+expecting at least one failure"
+pids=()
+for i in {1..5}; do
+(
+cgexec -g misc:$TEST_CG_SUB2 $test_cmd

[PATCH v6 08/12] x86/sgx: Use a list to track to-be-reclaimed pages

2023-10-30 Thread Haitao Huang

From: Sean Christopherson 

Change sgx_reclaim_pages() to use a list rather than an array for
storing the epc_pages which will be reclaimed. This change is needed
to transition to the LRU implementation for EPC cgroup support.

When the EPC cgroup is implemented, the reclaiming process will do a
pre-order tree walk for the subtree starting from the limit-violating
cgroup.  When each node is visited, candidate pages are selected from
its "reclaimable" LRU list and moved into this temporary list. Passing a
list from node to node for temporary storage in this walk is more
straightforward than using an array.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang
Signed-off-by: Haitao Huang
Cc: Sean Christopherson 
---
V6:
- Remove extra list_del_init and style fix (Kai)

V4:
- Changes needed for patch reordering
- Revised commit message

V3:
- Removed list wrappers
---
 arch/x86/kernel/cpu/sgx/main.c | 35 +++---
 1 file changed, 15 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index e27ac73d8843..33bcba313d40 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -296,12 +296,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
*epc_page,
  */
 static void sgx_reclaim_pages(void)
 {
-   struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
+   struct sgx_epc_page *epc_page, *tmp;
struct sgx_encl_page *encl_page;
-   struct sgx_epc_page *epc_page;
pgoff_t page_index;
-   int cnt = 0;
+   LIST_HEAD(iso);
int ret;
int i;
 
@@ -317,7 +316,7 @@ static void sgx_reclaim_pages(void)
 
if (kref_get_unless_zero(_page->encl->refcount) != 0) {
sgx_epc_page_set_state(epc_page, 
SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
-   chunk[cnt++] = epc_page;
+   list_move_tail(_page->list, );
} else
/* The owner is freeing the page. No need to add the
 * page back to the list of reclaimable pages.
@@ -326,8 +325,11 @@ static void sgx_reclaim_pages(void)
}
spin_unlock(_global_lru.lock);
 
-   for (i = 0; i < cnt; i++) {
-   epc_page = chunk[i];
+   if (list_empty())
+   return;
+
+   i = 0;
+   list_for_each_entry_safe(epc_page, tmp, , list) {
encl_page = epc_page->owner;
 
if (!sgx_reclaimer_age(epc_page))
@@ -342,6 +344,7 @@ static void sgx_reclaim_pages(void)
goto skip;
}
 
+   i++;
encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
mutex_unlock(_page->encl->lock);
continue;
@@ -349,27 +352,19 @@ static void sgx_reclaim_pages(void)
 skip:
spin_lock(_global_lru.lock);
sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
-   list_add_tail(_page->list, _global_lru.reclaimable);
+   list_move_tail(_page->list, _global_lru.reclaimable);
spin_unlock(_global_lru.lock);
 
kref_put(_page->encl->refcount, sgx_encl_release);
-
-   chunk[i] = NULL;
-   }
-
-   for (i = 0; i < cnt; i++) {
-   epc_page = chunk[i];
-   if (epc_page)
-   sgx_reclaimer_block(epc_page);
}
 
-   for (i = 0; i < cnt; i++) {
-   epc_page = chunk[i];
-   if (!epc_page)
-   continue;
+   list_for_each_entry(epc_page, , list)
+   sgx_reclaimer_block(epc_page);
 
+   i = 0;
+   list_for_each_entry_safe(epc_page, tmp, , list) {
encl_page = epc_page->owner;
-   sgx_reclaimer_write(epc_page, [i]);
+   sgx_reclaimer_write(epc_page, [i++]);
 
kref_put(_page->encl->refcount, sgx_encl_release);
sgx_epc_page_reset_state(epc_page);
-- 
2.25.1

[PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-10-30 Thread Haitao Huang

From: Sean Christopherson 

To prepare for per-cgroup reclamation, separate the top-level reclaim
function, sgx_reclaim_epc_pages(), into two separate functions:

- sgx_isolate_epc_pages() scans and isolates reclaimable pages from a given LRU 
list.
- sgx_do_epc_reclamation() performs the real reclamation for the already 
isolated pages.

Create a new function, sgx_reclaim_epc_pages_global(), calling those two
in succession, to replace the original sgx_reclaim_epc_pages(). The
above two functions will serve as building blocks for the reclamation
flows in later EPC cgroup implementation.

sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
cgroup will use the result to track reclaiming progress.

sgx_isolate_epc_pages() returns the additional number of pages to scan
for current epoch of reclamation. The EPC cgroup will use the result to
determine if more scanning to be done in LRUs in its children groups.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Cc: Sean Christopherson 
---
V6:
- Restructure patches to make it easier to review. (Kai)
- Fix unused nr_to_scan (Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 97 ++
 arch/x86/kernel/cpu/sgx/sgx.h  |  8 +++
 2 files changed, 72 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 33bcba313d40..e8848b493eb7 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -281,33 +281,23 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
*epc_page,
mutex_unlock(>lock);
 }
 
-/*
- * Take a fixed number of pages from the head of the active page pool and
- * reclaim them to the enclave's private shmem files. Skip the pages, which 
have
- * been accessed since the last scan. Move those pages to the tail of active
- * page pool so that the pages get scanned in LRU like fashion.
+/**
+ * sgx_isolate_epc_pages() - Isolate pages from an LRU for reclaim
+ * @lru:   LRU from which to reclaim
+ * @nr_to_scan:Number of pages to scan for reclaim
+ * @dst:   Destination list to hold the isolated pages
  *
- * Batch process a chunk of pages (at the moment 16) in order to degrade amount
- * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a 
bit
- * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and 
IPI
- * + EWB) but not sufficiently. Reclaiming one page at a time would also be
- * problematic as it would increase the lock contention too much, which would
- * halt forward progress.
+ * Return: remaining pages to scan, i.e, @nr_to_scan minus the number of pages 
scanned.
  */
-static void sgx_reclaim_pages(void)
+unsigned int  sgx_isolate_epc_pages(struct sgx_epc_lru_list *lru, unsigned int 
nr_to_scan,
+   struct list_head *dst)
 {
-   struct sgx_backing backing[SGX_NR_TO_SCAN];
-   struct sgx_epc_page *epc_page, *tmp;
struct sgx_encl_page *encl_page;
-   pgoff_t page_index;
-   LIST_HEAD(iso);
-   int ret;
-   int i;
+   struct sgx_epc_page *epc_page;
 
-   spin_lock(_global_lru.lock);
-   for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-   epc_page = list_first_entry_or_null(_global_lru.reclaimable,
-   struct sgx_epc_page, list);
+   spin_lock(>lock);
+   for (; nr_to_scan > 0; --nr_to_scan) {
+   epc_page = list_first_entry_or_null(>reclaimable, struct 
sgx_epc_page, list);
if (!epc_page)
break;
 
@@ -316,23 +306,53 @@ static void sgx_reclaim_pages(void)
 
if (kref_get_unless_zero(_page->encl->refcount) != 0) {
sgx_epc_page_set_state(epc_page, 
SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
-   list_move_tail(_page->list, );
+   list_move_tail(_page->list, dst);
} else
/* The owner is freeing the page. No need to add the
 * page back to the list of reclaimable pages.
 */
sgx_epc_page_reset_state(epc_page);
}
-   spin_unlock(_global_lru.lock);
+   spin_unlock(>lock);
+
+   return nr_to_scan;
+}
+
+/**
+ * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC pages.
+ * @iso:   List of isolated pages for reclamation
+ *
+ * Take a list of EPC pages and reclaim them to the enclave's private shmem 
files.  Do not
+ * reclaim the pages that have been accessed since the last scan, and move 
each of those pages
+ * to the tail of its tracking LRU list.
+ *
+ * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX per call 
in order to
+ * degrade amount of IPI's and ETRACK's potentially required. sgx_encl_ewb() 
does degrade a bit
+

[PATCH v6 07/12] x86/sgx: Introduce EPC page states

2023-10-30 Thread Haitao Huang

Use the lower 2 bits in the flags field of sgx_epc_page struct to track
EPC states and define an enum for possible states for EPC pages tracked
for reclamation.

Add the RECLAIM_IN_PROGRESS state to explicitly indicate a page that is
identified as a candidate for reclaiming, but has not yet been
reclaimed, instead of relying on list_empty(_page->list). A later
patch will replace the array on stack with a temporary list to store the
candidate pages, so list_empty() should no longer be used for this
purpose.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Signed-off-by: Haitao Huang 
Cc: Sean Christopherson 
---
V6:
- Drop UNRECLAIMABLE and use only 2 bits for states (Kai)
- Combine the patch for RECLAIM_IN_PROGRESS
- Style fixes (Jarkko and Kai)
---
 arch/x86/kernel/cpu/sgx/encl.c |  2 +-
 arch/x86/kernel/cpu/sgx/main.c | 33 +-
 arch/x86/kernel/cpu/sgx/sgx.h  | 62 +++---
 3 files changed, 76 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..17dc108d3ff7 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -1315,7 +1315,7 @@ void sgx_encl_free_epc_page(struct sgx_epc_page *page)
 {
int ret;
 
-   WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+   WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_STATE_MASK);
 
ret = __eremove(sgx_get_epc_virt_addr(page));
if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index d347acd717fd..e27ac73d8843 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -315,13 +315,14 @@ static void sgx_reclaim_pages(void)
list_del_init(_page->list);
encl_page = epc_page->owner;
 
-   if (kref_get_unless_zero(_page->encl->refcount) != 0)
+   if (kref_get_unless_zero(_page->encl->refcount) != 0) {
+   sgx_epc_page_set_state(epc_page, 
SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
chunk[cnt++] = epc_page;
-   else
+   } else
/* The owner is freeing the page. No need to add the
 * page back to the list of reclaimable pages.
 */
-   epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+   sgx_epc_page_reset_state(epc_page);
}
spin_unlock(_global_lru.lock);
 
@@ -347,6 +348,7 @@ static void sgx_reclaim_pages(void)
 
 skip:
spin_lock(_global_lru.lock);
+   sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
list_add_tail(_page->list, _global_lru.reclaimable);
spin_unlock(_global_lru.lock);
 
@@ -370,7 +372,7 @@ static void sgx_reclaim_pages(void)
sgx_reclaimer_write(epc_page, [i]);
 
kref_put(_page->encl->refcount, sgx_encl_release);
-   epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+   sgx_epc_page_reset_state(epc_page);
 
sgx_free_epc_page(epc_page);
}
@@ -509,7 +511,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
spin_lock(_global_lru.lock);
-   page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
+   WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
+   page->flags |= SGX_EPC_PAGE_RECLAIMABLE;
list_add_tail(>list, _global_lru.reclaimable);
spin_unlock(_global_lru.lock);
 }
@@ -527,16 +530,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
spin_lock(_global_lru.lock);
-   if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
-   /* The page is being reclaimed. */
-   if (list_empty(>list)) {
-   spin_unlock(_global_lru.lock);
-   return -EBUSY;
-   }
-
-   list_del(>list);
-   page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+   if (sgx_epc_page_reclaim_in_progress(page->flags)) {
+   spin_unlock(_global_lru.lock);
+   return -EBUSY;
}
+
+   list_del(>list);
+   sgx_epc_page_reset_state(page);
spin_unlock(_global_lru.lock);
 
return 0;
@@ -623,6 +623,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
struct sgx_epc_section *section = _epc_sections[page->section];
struct sgx_numa_node *node = section->node;
 
+   WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
if (page->epc_cg) {
sgx_epc_cgroup_uncharge(page->epc_cg);
page->epc_cg = NULL;
@@ -635,7 +636,7 @@ void sgx_free_epc_page(struct

[PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-10-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

Implement support for cgroup control of SGX Enclave Page Cache (EPC)
memory using the misc cgroup controller. EPC memory is independent
from normal system memory, e.g. must be reserved at boot from RAM and
cannot be converted between EPC and normal memory while the system is
running. EPC is managed by the SGX subsystem and is not accounted by
the memory controller.

Much like normal system memory, EPC memory can be overcommitted via
virtual memory techniques and pages can be swapped out of the EPC to
their backing store (normal system memory, e.g. shmem).  The SGX EPC
subsystem is analogous to the memory subsystem and the SGX EPC controller
is in turn analogous to the memory controller; it implements limit and
protection models for EPC memory.

The misc controller provides a mechanism to set a hard limit of EPC
usage via the "sgx_epc" resource in "misc.max". The total EPC memory
available on the system is reported via the "sgx_epc" resource in
"misc.capacity".

This patch was modified from the previous version to only add basic EPC
cgroup structure, accounting allocations for cgroup usage
(charge/uncharge), setup misc cgroup callbacks, set total EPC capacity.

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
globale reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller"  and restructure it (Kai)
---
 arch/x86/Kconfig |  13 
 arch/x86/kernel/cpu/sgx/Makefile |   1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  36 ++
 arch/x86/kernel/cpu/sgx/main.c   |  28 
 arch/x86/kernel/cpu/sgx/sgx.h|   3 +
 6 files changed, 184 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..e17c5dc3aea4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1921,6 +1921,19 @@ config X86_SGX
 
  If unsure, say N.
 
+config CGROUP_SGX_EPC
+   bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for 
Intel SGX"
+   depends on X86_SGX && CGROUP_MISC
+   help
+ Provides control over the EPC footprint of tasks in a cgroup via
+ the Miscellaneous cgroup controller.
+
+ EPC is a subset of regular memory that is usable only by SGX
+ enclaves and is very limited in quantity, e.g. less than 1%
+ of total DRAM.
+
+ Say N if unsure.
+
 config X86_USER_SHADOW_STACK
bool "X86 userspace shadow stack"
depends on AS_WRUSS
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
 obj-$(CONFIG_X86_SGX_KVM)  += virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC)  += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index ..500627d0563f
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include 
+#include 
+#include "epc_cgroup.h"
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct 
misc_cg *cg)
+{
+   return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+}
+
+static inline bool sgx_epc_cgroup_disabled(void)
+{
+   return !cgroup_subsys_enabled(misc_cgrp_subsys);
+}
+
+/**
+ * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page
+ *
+ * Returns EPC cgroup or NULL on success, -errno on failure.
+ */
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
+{
+   struct sgx_epc_cgroup *epc_cg;
+   int ret;
+
+   if (sgx_epc_cgroup_disabled())
+   return NULL;
+
+   epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
+   ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+
+   if (!ret) {
+   /* No epc_cg returned, release ref from get_current_misc_cg() */
+   put_misc_cg(epc_cg->cg);
+   return ERR_PTR(-ENOMEM);
+   }
+
+   /* Ref released in sgx_epc_cgroup_uncharge() */
+   return epc_cg;
+}
+
+/**
+ * sgx_epc_cgroup_uncharge() -

[PATCH v6 05/12] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list

2023-10-30 Thread Haitao Huang

From: Sean Christopherson 

Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Cc: Sean Christopherson 
---
V6:
- removed introduction to unreclaimables in commit message.

V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.

V3:
- Removed the helper functions and revised commit messages.
---
 arch/x86/kernel/cpu/sgx/sgx.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index b1786774b8d2..0fbe6a2a159b 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -86,6 +86,21 @@ static inline void *sgx_get_epc_virt_addr(struct 
sgx_epc_page *page)
return section->virt_addr + index * PAGE_SIZE;
 }
 
+/*
+ * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
+ * cgroup.
+ */
+struct sgx_epc_lru_list {
+   spinlock_t lock;
+   struct list_head reclaimable;
+};
+
+static inline void sgx_lru_init(struct sgx_epc_lru_list *lru)
+{
+   spin_lock_init(>lock);
+   INIT_LIST_HEAD(>reclaimable);
+}
+
 struct sgx_epc_page *__sgx_alloc_epc_page(void);
 void sgx_free_epc_page(struct sgx_epc_page *page);
 
-- 
2.25.1

[PATCH v6 06/12] x86/sgx: Use sgx_epc_lru_list for existing active page list

2023-10-30 Thread Haitao Huang

From: Sean Christopherson 

In future each cgroup needs a LRU list to track reclaimable pages.  For
now just replace the existing sgx_active_page_list in the reclaimer and
its spinlock with a global sgx_epc_lru_list struct.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Cc: Sean Christopherson 
---
V5:
- Spelled out SECS, VA (Jarkko)

V4:
- No change, only reordered the patch.

V3:
- Remove usage of list wrapper
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +-
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 07606f391540..d347acd717fd 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
 
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-   if (list_empty(_active_page_list))
+   epc_page = list_first_entry_or_null(_global_lru.reclaimable,
+   struct sgx_epc_page, list);
+   if (!epc_page)
break;
 
-   epc_page = list_first_entry(_active_page_list,
-   struct sgx_epc_page, list);
list_del_init(_page->list);
encl_page = epc_page->owner;
 
@@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
 */
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(_reclaimer_lock);
+   spin_unlock(_global_lru.lock);
 
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
continue;
 
 skip:
-   spin_lock(_reclaimer_lock);
-   list_add_tail(_page->list, _active_page_list);
-   spin_unlock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
+   list_add_tail(_page->list, _global_lru.reclaimable);
+   spin_unlock(_global_lru.lock);
 
kref_put(_page->encl->refcount, sgx_encl_release);
 
@@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
return atomic_long_read(_nr_free_pages) < watermark &&
-  !list_empty(_active_page_list);
+  !list_empty(_global_lru.reclaimable);
 }
 
 /*
@@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
ksgxd_tsk = tsk;
 
+   sgx_lru_init(_global_lru);
+
return true;
 }
 
@@ -507,10 +508,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(>list, _active_page_list);
-   spin_unlock(_reclaimer_lock);
+   list_add_tail(>list, _global_lru.reclaimable);
+   spin_unlock(_global_lru.lock);
 }
 
 /**
@@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(_reclaimer_lock);
+   spin_lock(_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(>list)) {
-   spin_unlock(_reclaimer_lock);
+   spin_unlock(_global_lru.lock);
return -EBUSY;
}
 
list_del(>list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(_reclaimer_lock);
+   spin_unlock(_global_lru.lock);
 
return 0;
 }
@@ -574,7 +575,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool 
reclaim)
break;
}
 
-   if (list_empty(_active_page_list))
+   if (list_empty(_global_lru.reclaimable))
return ERR_PTR(-ENOMEM);
 
if (!reclaim) {
-- 
2.25.1

[PATCH v6 02/12] cgroup/misc: Export APIs for SGX driver

2023-10-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

Export misc_cg_root() so the SGX EPC cgroup can access and do extra
setup during initialization, e.g., set callbacks and private data
previously defined.

The SGX EPC cgroup will reclaim EPC pages when a usage in a cgroup
reaches its or ancestor's limit. This requires a walk from the current
cgroup up to the root similar to misc_cg_try_charge(). Export
misc_cg_parent() to enable this walk.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V6:
- Make commit messages more concise and split the original patch into two(Kai)
---
 include/linux/misc_cgroup.h | 24 
 kernel/cgroup/misc.c| 21 -
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 5dc509c27c3d..2a3b1f8dc669 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -68,6 +68,7 @@ struct misc_cg {
struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 
amount);
@@ -87,6 +88,20 @@ static inline struct misc_cg *css_misc(struct 
cgroup_subsys_state *css)
return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -111,6 +126,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+   return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+   return NULL;
+}
 
 static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index d971ede44ebf..fa464324ccf8 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -40,18 +40,13 @@ static struct misc_cg root_cg;
 static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+   return _cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -150,7 +145,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!amount)
return 0;
 
-   for (i = cg; i; i = parent_misc(i)) {
+   for (i = cg; i; i = misc_cg_parent(i)) {
res = >res[type];
 
new_usage = atomic64_add_return(amount, >usage);
@@ -163,12 +158,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
return 0;
 
 err_charge:
-   for (j = i; j; j = parent_misc(j)) {
+   for (j = i; j; j = misc_cg_parent(j)) {
atomic64_inc(>res[type].events);
cgroup_file_notify(>events_file);
}
 
-   for (j = cg; j != i; j = parent_misc(j))
+   for (j = cg; j != i; j = misc_cg_parent(j))
misc_cg_cancel_charge(type, j, amount);
misc_cg_cancel_charge(type, i, amount);
return ret;
@@ -190,7 +185,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!(amount && valid_type(type) && cg))
return;
 
-   for (i = cg; i; i = parent_misc(i))
+   for (i = cg; i; i = misc_cg_parent(i))
misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.25.1

[PATCH v6 03/12] cgroup/misc: Add SGX EPC resource type

2023-10-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V6:
- Split the original patch into this and the preceding one (Kai)
---
 include/linux/misc_cgroup.h | 4 
 kernel/cgroup/misc.c| 4 
 2 files changed, 8 insertions(+)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 2a3b1f8dc669..368f6c5fccae 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
MISC_CG_RES_SEV,
/* AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+   /* SGX EPC memory resource */
+   MISC_CG_RES_SGX_EPC,
 #endif
MISC_CG_RES_TYPES
 };
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index fa464324ccf8..a22500851fe8 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
/* AMD SEV-ES ASIDs resource */
"sev_es",
 #endif
+#ifdef CONFIG_CGROUP_SGX_EPC
+   /* Intel SGX EPC memory bytes */
+   "sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
-- 
2.25.1

[PATCH v6 01/12] cgroup/misc: Add per resource callbacks for CSS events

2023-10-30 Thread Haitao Huang

From: Kristen Carlson Accardi 

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.

Also add per resource type private data for those callbacks to store and
access resource specific data.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
---
V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
 include/linux/misc_cgroup.h | 14 ++
 kernel/cgroup/misc.c| 27 ---
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..5dc509c27c3d 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -27,16 +27,30 @@ struct misc_cg;
 
 #include 
 
+/**
+ * struct misc_operations_struct: per resource callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is 
allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_operations_struct {
+   int (*alloc)(struct misc_cg *cg);
+   void (*free)(struct misc_cg *cg);
+};
+
 /**
  * struct misc_res: Per cgroup per misc type resource
  * @max: Maximum limit on the resource.
  * @usage: Current usage of the resource.
  * @events: Number of times, the resource limit exceeded.
+ * @priv: resource specific data.
+ * @misc_ops: resource specific operations.
  */
 struct misc_res {
u64 max;
atomic64_t usage;
atomic64_t events;
+   void *priv;
+   const struct misc_operations_struct *misc_ops;
 };
 
 /**
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..d971ede44ebf 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -383,23 +383,37 @@ static struct cftype misc_cg_files[] = {
 static struct cgroup_subsys_state *
 misc_cg_alloc(struct cgroup_subsys_state *parent_css)
 {
+   struct misc_cg *parent_cg, *cg;
enum misc_res_type i;
-   struct misc_cg *cg;
+   int ret;
 
if (!parent_css) {
-   cg = _cg;
+   parent_cg = cg = _cg;
} else {
cg = kzalloc(sizeof(*cg), GFP_KERNEL);
if (!cg)
return ERR_PTR(-ENOMEM);
+   parent_cg = css_misc(parent_css);
}
 
for (i = 0; i < MISC_CG_RES_TYPES; i++) {
WRITE_ONCE(cg->res[i].max, MAX_NUM);
atomic64_set(>res[i].usage, 0);
+   if (parent_cg->res[i].misc_ops && 
parent_cg->res[i].misc_ops->alloc) {
+   ret = parent_cg->res[i].misc_ops->alloc(cg);
+   if (ret)
+   goto alloc_err;
+   }
}
 
return >css;
+
+alloc_err:
+   for (i = 0; i < MISC_CG_RES_TYPES; i++)
+   if (parent_cg->res[i].misc_ops && 
parent_cg->res[i].misc_ops->free)
+   cg->res[i].misc_ops->free(cg);
+   kfree(cg);
+   return ERR_PTR(ret);
 }
 
 /**
@@ -410,7 +424,14 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
  */
 static void misc_cg_free(struct cgroup_subsys_state *css)
 {
-   kfree(css_misc(css));
+   struct misc_cg *cg = css_misc(css);
+   enum misc_res_type i;
+
+   for (i = 0; i < MISC_CG_RES_TYPES; i++)
+   if (cg->res[i].misc_ops && cg->res[i].misc_ops->free)
+   cg->res[i].misc_ops->free(cg);
+
+   kfree(cg);
 }
 
 /* Cgroup controller callbacks */
-- 
2.25.1

[PATCH v6 00/12] Add Cgroup support for SGX EPC memory

2023-10-30 Thread Haitao Huang

SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM 
allocations, and
are managed solely by the SGX subsystem. The existing cgroup memory controller 
cannot be used
to limit or account for SGX EPC memory, which is a desirable feature in some 
environments,
e.g., support for pod level control in a Kubernates cluster on a VM or 
baremetal host [1,2].
 
This patchset implements the support for sgx_epc memory within the misc cgroup 
controller. The
user can use the misc cgroup controller to set and enforce a max limit on total 
EPC usage per
cgroup. The implementation reports current usage and events of reaching the 
limit per cgroup as
well as the total system capacity.
 
With the EPC misc controller enabled, every EPC page allocation is accounted 
for a cgroup's
usage, reflected in the 'sgx_epc' entry in the 'misc.current' interface file of 
the cgroup.
Much like normal system memory, EPC memory can be overcommitted via virtual 
memory techniques
and pages can be swapped out of the EPC to their backing store (normal system 
memory allocated
via shmem, accounted by the memory controller). When the EPC usage of a cgroup 
reaches its hard
limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts a reclamation 
process to swap
out some EPC pages within the same cgroup and its descendant to their backing 
store. Although
the SGX architecture supports swapping for all pages, to avoid extra 
complexities, this
implementation does not support swapping for certain page types, e.g.  Version 
Array(VA) pages,
and treat them as unreclaimable pages.  When the limit is reached but nothing 
left in the
cgroup for reclamation, i.e., only unreclaimable pages left, any new EPC 
allocation in the
cgroup will result in an ENOMEM error.

The EPC pages allocated for guest VMs by the virtual EPC driver are not 
reclaimable by the host
kernel [5]. Therefore they are also treated as unreclaimable from cgroup's 
point of view.  And
the virtual EPC driver translates an ENOMEM error resulted from an EPC 
allocation request into
a SIGBUS to the user process.

This work was originally authored by Sean Christopherson a few years ago, and 
previously
modified by Kristen C. Accardi to utilize the misc cgroup controller rather 
than a custom
controller. I have been updating the patches based on review comments since V2 
[3, 4, 10],
simplified the implementation/design and fixed some stability issues found from 
testing.
 
The patches are organized as following: 
- Patches 1-3 are prerequisite misc cgroup changes for adding new APIs, 
structs, resource
  types.
- Patch 4 implements basic misc controller for EPC without reclamation.
- Patches 5-9 prepare for per-cgroup reclamation.
* Separate out the existing infrastructure of tracking reclaimable pages
  from the global reclaimer(ksgxd) to a newly created LRU list struct.
* Separate out reusable top-level functions for reclamation.
- Patch 10 adds support for per-cgroup reclamation.
- Patch 11 adds documentation for the EPC cgroup.
- Patch 12 adds test scripts.

I appreciate your review and providing tags if appropriate.

---
V6:
- Dropped OOM killing path, only implement non-preemptive enforcement of max 
limit (Dave, Michal)
- Simplified reclamation flow by taking out sgx_epc_reclaim_control, forced 
reclamation by
  ignoring 'age".
- Restructured patches: split misc API + resource types patch and the big EPC 
cgroup patch
  (Kai, Michal)
- Dropped some Tested-by/Reviewed-by tags due to significant changes
- Added more selftests

v5:
- Replace the manual test script with a selftest script.
- Restore the "From" tag for some patches to Sean (Kai)
- Style fixes (Jarkko)

v4:
- Collected "Tested-by" from Mikko. I kept it for now as no functional changes 
in v4.
- Rebased on to v6.6_rc1 and reordered patches as described above.
- Separated out the bug fixes [7,8,9]. This series depend on those patches. 
(Dave, Jarkko)
- Added comments in commit message to give more preview what's to come next. 
(Jarkko)
- Fixed some documentation error, gap, style (Mikko, Randy)
- Fixed some comments, typo, style in code (Mikko, Kai)
- Patch format and background for reclaimable vs unreclaimable (Kai, Jarkko)
- Fixed typo (Pavel)
- Exclude the previous fixes/enhancements for self-tests. Patch 18 now depends 
on series [6]
- Use the same to list for cover and all patches. (Solo)
 
v3:
 
- Added EPC states to replace flags in sgx_epc_page struct. (Jarkko)
- Unrolled wrappers for cond_resched, list (Dave)
- Separate patches for adding reclaimable and unreclaimable lists. (Dave)
- Other improvements on patch flow, commit messages, styles. (Dave, Jarkko)
- Simplified the cgroup tree walking with plain
  css_for_each_descendant_pre.
- Fixed race conditions and crashes.
- OOM killer to wait for the victim enclave pages being reclaimed.
- Unblock the user by handling misc_max_write callback asynchronously.
- Rebased onto 6.4 and no longer base this series on the MCA

Re: [PATCH v2 1/3] dt-bindings: remoteproc: qcom,sm8550-pas: document the SM8650 PAS

2023-10-30 Thread Krzysztof Kozlowski

On 30/10/2023 11:03, Neil Armstrong wrote:
> Document the DSP Peripheral Authentication Service on the SM8650 Platform.
> 
> Signed-off-by: Neil Armstrong 
> ---
>  .../bindings/remoteproc/qcom,sm8550-pas.yaml   | 44 
> +-
>  1 file changed, 43 insertions(+), 1 deletion(-)'

Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

Re: [PATCH] eventfs: Fix kerneldoc of eventfs_remove_rec()

2023-10-30 Thread Mukesh Ojha





On 10/30/2023 9:45 PM, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

The eventfs_remove_rec() had some missing parameters in the kerneldoc
comment above it. Also, rephrase the description a bit more to have a bit
more correct grammar.

Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use 
eventfs_inode");
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202310052216.4sgqaswo-...@intel.com/
Signed-off-by: Steven Rostedt (Google) 


Reviewed-by: Mukesh Ojha 

-Mukesh

---
  fs/tracefs/event_inode.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 5a3cc5394294..1c28e013201f 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -977,9 +977,11 @@ static void free_rcu_ei(struct rcu_head *head)
  /**
   * eventfs_remove_rec - remove eventfs dir or file from list
   * @ei: eventfs_inode to be removed.
+ * @head: the list head to place the deleted @ei and children
+ * @level: prevent recursion from going more than 3 levels deep.
   *
- * This function recursively remove eventfs_inode which
- * contains info of file or dir.
+ * This function recursively removes eventfs_inodes which
+ * contains info of files and/or directories.
   */
  static void eventfs_remove_rec(struct eventfs_inode *ei, struct list_head 
*head, int level)
  {

[PATCH] eventfs: Fix kerneldoc of eventfs_remove_rec()

2023-10-30 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The eventfs_remove_rec() had some missing parameters in the kerneldoc
comment above it. Also, rephrase the description a bit more to have a bit
more correct grammar.

Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use 
eventfs_inode");
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202310052216.4sgqaswo-...@intel.com/
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 5a3cc5394294..1c28e013201f 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -977,9 +977,11 @@ static void free_rcu_ei(struct rcu_head *head)
 /**
  * eventfs_remove_rec - remove eventfs dir or file from list
  * @ei: eventfs_inode to be removed.
+ * @head: the list head to place the deleted @ei and children
+ * @level: prevent recursion from going more than 3 levels deep.
  *
- * This function recursively remove eventfs_inode which
- * contains info of file or dir.
+ * This function recursively removes eventfs_inodes which
+ * contains info of files and/or directories.
  */
 static void eventfs_remove_rec(struct eventfs_inode *ei, struct list_head 
*head, int level)
 {
-- 
2.42.0

Re: [PATCH v6 3/4] remoteproc: zynqmp: add pm domains support

2023-10-30 Thread Tanmay Shah

Hi Mathieu,

I agree to all the comments, I will address them in next revision.

Thanks,

Tanmay

On 10/18/23 12:38 PM, Mathieu Poirier wrote:
> Good morning,
>
> On Thu, Oct 12, 2023 at 09:22:28PM -0700, Tanmay Shah wrote:
> > Use TCM pm domains extracted from device-tree
> > to power on/off TCM using general pm domain framework.
> > 
> > Signed-off-by: Tanmay Shah 
> > ---
> > 
> > Changes in v6:
> >   - Remove spurious change
> >   - Handle errors in add_pm_domains function
> >   - Remove redundant code to handle errors from remove_pm_domains
> > 
> >  drivers/remoteproc/xlnx_r5_remoteproc.c | 262 ++--
> >  1 file changed, 243 insertions(+), 19 deletions(-)
> > 
> > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> > b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > index 4395edea9a64..04e95d880184 100644
> > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > @@ -16,6 +16,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include "remoteproc_internal.h"
> >  
> > @@ -102,6 +103,12 @@ static const struct mem_bank_data 
> > zynqmp_tcm_banks_lockstep[] = {
> >   * @rproc: rproc handle
> >   * @pm_domain_id: RPU CPU power domain id
> >   * @ipi: pointer to mailbox information
> > + * @num_pm_dev: number of tcm pm domain devices for this core
> > + * @pm_dev1: pm domain virtual devices for power domain framework
> > + * @pm_dev_link1: pm domain device links after registration
> > + * @pm_dev2: used only in lockstep mode. second core's pm domain virtual 
> > devices
> > + * @pm_dev_link2: used only in lockstep mode. second core's pm device 
> > links after
> > + * registration
> >   */
> >  struct zynqmp_r5_core {
> > struct device *dev;
> > @@ -111,6 +118,11 @@ struct zynqmp_r5_core {
> > struct rproc *rproc;
> > u32 pm_domain_id;
> > struct mbox_info *ipi;
> > +   int num_pm_dev;
> > +   struct device **pm_dev1;
>
> s/pm_dev1/pm_dev_core0
>
> > +   struct device_link **pm_dev_link1;
>
> s/pm_dev_link1/pm_dev_core0_link;
>
> > +   struct device **pm_dev2;
>
> s/pm_dev2/pm_dev_core1
>
> > +   struct device_link **pm_dev_link2;
>
> s/pm_dev_link2/pm_dev_core1_link;
>
> >  };
> >  
> >  /**
> > @@ -575,12 +587,21 @@ static int add_tcm_carveout_split_mode(struct rproc 
> > *rproc)
> > bank_size = r5_core->tcm_banks[i]->size;
> > pm_domain_id = r5_core->tcm_banks[i]->pm_domain_id;
> >  
> > -   ret = zynqmp_pm_request_node(pm_domain_id,
> > -ZYNQMP_PM_CAPABILITY_ACCESS, 0,
> > -ZYNQMP_PM_REQUEST_ACK_BLOCKING);
> > -   if (ret < 0) {
> > -   dev_err(dev, "failed to turn on TCM 0x%x", 
> > pm_domain_id);
> > -   goto release_tcm_split;
> > +   /*
> > +* If TCM information is available in device-tree then
> > +* in that case, pm domain framework will power on/off TCM.
> > +* In that case pm_domain_id is set to 0. If hardcode
> > +* bindings from driver is used, then only this driver will
> > +* use pm_domain_id.
> > +*/
> > +   if (pm_domain_id) {
> > +   ret = zynqmp_pm_request_node(pm_domain_id,
> > +
> > ZYNQMP_PM_CAPABILITY_ACCESS, 0,
> > +
> > ZYNQMP_PM_REQUEST_ACK_BLOCKING);
> > +   if (ret < 0) {
> > +   dev_err(dev, "failed to turn on TCM 0x%x", 
> > pm_domain_id);
> > +   goto release_tcm_split;
> > +   }
>
> This should go in the next patch.
>
> > }
> >  
> > dev_dbg(dev, "TCM carveout split mode %s addr=%llx, da=0x%x, 
> > size=0x%lx",
> > @@ -646,13 +667,16 @@ static int add_tcm_carveout_lockstep_mode(struct 
> > rproc *rproc)
> > for (i = 0; i < num_banks; i++) {
> > pm_domain_id = r5_core->tcm_banks[i]->pm_domain_id;
> >  
> > -   /* Turn on each TCM bank individually */
> > -   ret = zynqmp_pm_request_node(pm_domain_id,
> > -ZYNQMP_PM_CAPABILITY_ACCESS, 0,
> > -ZYNQMP_PM_REQUEST_ACK_BLOCKING);
> > -   if (ret < 0) {
> > -   dev_err(dev, "failed to turn on TCM 0x%x", 
> > pm_domain_id);
> > -   goto release_tcm_lockstep;
> > +   if (pm_domain_id) {
> > +   /* Turn on each TCM bank individually */
> > +   ret = zynqmp_pm_request_node(pm_domain_id,
> > +
> > ZYNQMP_PM_CAPABILITY_ACCESS, 0,
> > +
> > ZYNQMP_PM_REQUEST_ACK_BLOCKING);
> > +   if (ret < 0) {
> > +   dev_err(dev, "failed to turn on TCM 0x%x",
> > +

[PATCH v2] eventfs: Save ownership and mode

2023-10-30 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Now that inodes and dentries are created on the fly, they are also
reclaimed on memory pressure. Since the ownership and file mode are saved
in the inode, if they are freed, any changes to the ownership and mode
will be lost.

To counter this, if the user changes the permissions or ownership, save
them, and when creating the inodes again, restore those changes.

Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/all/20231029182345.0803e...@rorschach.local.home/

  - Fixed kerneldoc reported by 
https://lore.kernel.org/all/202310301225.vjemupal-...@intel.com/

 fs/tracefs/event_inode.c | 148 +++
 fs/tracefs/internal.h|  15 
 2 files changed, 150 insertions(+), 13 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 45bddce7c747..7ad7496bd597 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -33,6 +33,15 @@
  */
 static DEFINE_MUTEX(eventfs_mutex);
 
+/* Mode is unsigned short, use the upper bits for flags */
+enum {
+   EVENTFS_SAVE_MODE   = BIT(16),
+   EVENTFS_SAVE_UID= BIT(17),
+   EVENTFS_SAVE_GID= BIT(18),
+};
+
+#define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
+
 /*
  * The eventfs_inode (ei) itself is protected by SRCU. It is released from
  * its parent's list and will have is_freed set (under eventfs_mutex).
@@ -47,8 +56,89 @@ static int dcache_dir_open_wrapper(struct inode *inode, 
struct file *file);
 static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx);
 static int eventfs_release(struct inode *inode, struct file *file);
 
+static void update_attr(struct eventfs_attr *attr, struct iattr *iattr)
+{
+   unsigned int ia_valid = iattr->ia_valid;
+
+   if (ia_valid & ATTR_MODE) {
+   attr->mode = (attr->mode & ~EVENTFS_MODE_MASK) |
+   (iattr->ia_mode & EVENTFS_MODE_MASK) |
+   EVENTFS_SAVE_MODE;
+   }
+   if (ia_valid & ATTR_UID) {
+   attr->mode |= EVENTFS_SAVE_UID;
+   attr->uid = iattr->ia_uid;
+   }
+   if (ia_valid & ATTR_GID) {
+   attr->mode |= EVENTFS_SAVE_GID;
+   attr->gid = iattr->ia_gid;
+   }
+}
+
+static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry,
+   struct iattr *iattr)
+{
+   const struct eventfs_entry *entry;
+   struct eventfs_inode *ei;
+   const char *name;
+   int ret;
+
+   mutex_lock(_mutex);
+   ei = dentry->d_fsdata;
+   /* The LSB is set when the eventfs_inode is being freed */
+   if (((unsigned long)ei & 1UL) || ei->is_freed) {
+   /* Do not allow changes if the event is about to be removed. */
+   mutex_unlock(_mutex);
+   return -ENODEV;
+   }
+
+   /* Preallocate the children mode array if necessary */
+   if (!(dentry->d_inode->i_mode & S_IFDIR)) {
+   if (!ei->entry_attrs) {
+   ei->entry_attrs = kzalloc(sizeof(*ei->entry_attrs) * 
ei->nr_entries,
+ GFP_KERNEL);
+   if (!ei->entry_attrs) {
+   ret = -ENOMEM;
+   goto out;
+   }
+   }
+   }
+
+   ret = simple_setattr(idmap, dentry, iattr);
+   if (ret < 0)
+   goto out;
+
+   /*
+* If this is a dir, then update the ei cache, only the file
+* mode is saved in the ei->m_children, and the ownership is
+* determined by the parent directory.
+*/
+   if (dentry->d_inode->i_mode & S_IFDIR) {
+   update_attr(>attr, iattr);
+
+   } else {
+   name = dentry->d_name.name;
+
+   for (int i = 0; i < ei->nr_entries; i++) {
+   entry = >entries[i];
+   if (strcmp(name, entry->name) == 0) {
+   update_attr(>entry_attrs[i], iattr);
+   break;
+   }
+   }
+   }
+ out:
+   mutex_unlock(_mutex);
+   return ret;
+}
+
 static const struct inode_operations eventfs_root_dir_inode_operations = {
.lookup = eventfs_root_lookup,
+   .setattr= eventfs_set_attr,
+};
+
+static const struct inode_operations eventfs_file_inode_operations = {
+   .setattr= eventfs_set_attr,
 };
 
 static const struct file_operations eventfs_file_operations = {
@@ -59,10 +149,30 @@ static const struct file_operations 
eventfs_file_operations = {
.release= eventfs_release,
 };
 
+static void update_inode_attr(struct inode *inode, struct eventfs_attr *attr, 
umode_t mode)
+{
+   if (!attr) {
+   inode->i_mode = mode;
+   return;
+   }
+
+   if (attr->mode & EVENTFS_SAVE_MODE)

Re: [PATCH] eventfs: Hold eventfs_mutex when calling callback functions

2023-10-30 Thread Steven Rostedt



I think I figured out why claws-mail adds a backslash to quotes when
sending. It allows you to add more than one Cc on a line, and I think (I
haven't tested it yet) if you add multiple names with quotes on the
same line, it thinks that it's a single name and will backslash internal
quotes. Hmm.

Anyway, I had to remove all quoted names to reply to this email. Hopefully
claws-mail doesn't screw it up again :-p

I may also need to upgrade claws-mail, as I build my own because I have
some things enabled that the distro version does not include.

-- Steve


On Mon, 30 Oct 2023 11:40:47 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> The callback function that is used to create inodes and dentries is not
> protected by anything and the data that is passed to it could become
> stale. After eventfs_remove_dir() is called by the tracing system, it is
> free to remove the events that are associated to that directory.
> Unfortunately, that means the callbacks must not be called after that.
> 
>  CPU0 CPU1
>   
>  eventfs_root_lookup() {
>eventfs_remove_dir() {
> mutex_lock(_mutex);
> ei->is_freed = set;
> mutex_unlock(_mutex);
>}
>kfree(event_call);
> 
> for (...) {
>   entry = >entries[i];
>   r = entry->callback() {
>   call = data;// call == event_call above
>   if (call->flags ...)
> 
>  [ USE AFTER FREE BUG ]
> 
> The safest way to protect this is to wrap the callback with:
> 
>  mutex_lock(_mutex);
>  if (!ei->is_freed)
>  r = entry->callback();
>  else
>  r = -1;
>  mutex_unlock(_mutex);
> 
> This will make sure that the callback will not be called after it is
> freed. But now it needs to be known that the callback is called while
> holding internal eventfs locks, and that it must not call back into the
> eventfs / tracefs system. There's no reason it should anyway, but document
> that as well.
> 
> Link: 
> https://lore.kernel.org/all/CA+G9fYu9GOEbD=rR5eMR-=HJ8H6rMsbzDC2ZY5=y50wpwae...@mail.gmail.com/
> 
> Reported-by: Linux Kernel Functional Testing 
> Reported-by: Naresh Kamboju 
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  fs/tracefs/event_inode.c | 22 ++--
>  include/linux/tracefs.h  | 43 
>  2 files changed, 63 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 7ad7496bd597..5a3cc5394294 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -609,7 +609,13 @@ static struct dentry *eventfs_root_lookup(struct inode 
> *dir,
>   entry = >entries[i];
>   if (strcmp(name, entry->name) == 0) {
>   void *cdata = data;
> - r = entry->callback(name, , , );
> + mutex_lock(_mutex);
> + /* If ei->is_freed, then the event itself may be too */
> + if (!ei->is_freed)
> + r = entry->callback(name, , , );
> + else
> + r = -1;
> + mutex_unlock(_mutex);
>   if (r <= 0)
>   continue;
>   ret = simple_lookup(dir, dentry, flags);
> @@ -743,7 +749,13 @@ static int dcache_dir_open_wrapper(struct inode *inode, 
> struct file *file)
>   void *cdata = data;
>   entry = >entries[i];
>   name = entry->name;
> - r = entry->callback(name, , , );
> + mutex_lock(_mutex);
> + /* If ei->is_freed, then the event itself may be too */
> + if (!ei->is_freed)
> + r = entry->callback(name, , , );
> + else
> + r = -1;
> + mutex_unlock(_mutex);
>   if (r <= 0)
>   continue;
>   d = create_file_dentry(ei, i, parent, name, mode, cdata, fops, 
> false);
> @@ -821,6 +833,10 @@ static void free_ei(struct eventfs_inode *ei)
>   *   data = A pointer to @data, and the callback may replace it, which will
>   * cause the file created to pass the new data to the open() call.
>   *   fops = the fops to use for the created file.
> + *
> + * NB. @callback is called while holding internal locks of the eventfs
> + * system. The callback must not call any code that might also call into
> + * the tracefs or eventfs system or it will risk creating a deadlock.
>   */
>  struct eventfs_inode *eventfs_create_dir(const char *name, struct 
> eventfs_inode *parent,
>const struct eventfs_entry *entries,
> @@ -880,6 +896,8 @@ struct eventfs_inode

[PATCH] eventfs: Hold eventfs_mutex when calling callback functions

2023-10-30 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The callback function that is used to create inodes and dentries is not
protected by anything and the data that is passed to it could become
stale. After eventfs_remove_dir() is called by the tracing system, it is
free to remove the events that are associated to that directory.
Unfortunately, that means the callbacks must not be called after that.

 CPU0   CPU1
    
 eventfs_root_lookup() {
 eventfs_remove_dir() {
  mutex_lock(_mutex);
  ei->is_freed = set;
  mutex_unlock(_mutex);
 }
 kfree(event_call);

for (...) {
  entry = >entries[i];
  r = entry->callback() {
  call = data;  // call == event_call above
  if (call->flags ...)

 [ USE AFTER FREE BUG ]

The safest way to protect this is to wrap the callback with:

 mutex_lock(_mutex);
 if (!ei->is_freed)
 r = entry->callback();
 else
 r = -1;
 mutex_unlock(_mutex);

This will make sure that the callback will not be called after it is
freed. But now it needs to be known that the callback is called while
holding internal eventfs locks, and that it must not call back into the
eventfs / tracefs system. There's no reason it should anyway, but document
that as well.

Link: 
https://lore.kernel.org/all/CA+G9fYu9GOEbD=rR5eMR-=HJ8H6rMsbzDC2ZY5=y50wpwae...@mail.gmail.com/

Reported-by: Linux Kernel Functional Testing 
Reported-by: Naresh Kamboju 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 22 ++--
 include/linux/tracefs.h  | 43 
 2 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 7ad7496bd597..5a3cc5394294 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -609,7 +609,13 @@ static struct dentry *eventfs_root_lookup(struct inode 
*dir,
entry = >entries[i];
if (strcmp(name, entry->name) == 0) {
void *cdata = data;
-   r = entry->callback(name, , , );
+   mutex_lock(_mutex);
+   /* If ei->is_freed, then the event itself may be too */
+   if (!ei->is_freed)
+   r = entry->callback(name, , , );
+   else
+   r = -1;
+   mutex_unlock(_mutex);
if (r <= 0)
continue;
ret = simple_lookup(dir, dentry, flags);
@@ -743,7 +749,13 @@ static int dcache_dir_open_wrapper(struct inode *inode, 
struct file *file)
void *cdata = data;
entry = >entries[i];
name = entry->name;
-   r = entry->callback(name, , , );
+   mutex_lock(_mutex);
+   /* If ei->is_freed, then the event itself may be too */
+   if (!ei->is_freed)
+   r = entry->callback(name, , , );
+   else
+   r = -1;
+   mutex_unlock(_mutex);
if (r <= 0)
continue;
d = create_file_dentry(ei, i, parent, name, mode, cdata, fops, 
false);
@@ -821,6 +833,10 @@ static void free_ei(struct eventfs_inode *ei)
  *   data = A pointer to @data, and the callback may replace it, which will
  * cause the file created to pass the new data to the open() call.
  *   fops = the fops to use for the created file.
+ *
+ * NB. @callback is called while holding internal locks of the eventfs
+ * system. The callback must not call any code that might also call into
+ * the tracefs or eventfs system or it will risk creating a deadlock.
  */
 struct eventfs_inode *eventfs_create_dir(const char *name, struct 
eventfs_inode *parent,
 const struct eventfs_entry *entries,
@@ -880,6 +896,8 @@ struct eventfs_inode *eventfs_create_dir(const char *name, 
struct eventfs_inode
  * @data: The default data to pass to the files (an entry may override it).
  *
  * This function creates the top of the trace event directory.
+ *
+ * See eventfs_create_dir() for use of @entries.
  */
 struct eventfs_inode *eventfs_create_events_dir(const char *name, struct 
dentry *parent,
const struct eventfs_entry 
*entries,
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 13359b1a35d1..7a5fe17b6bf9 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -23,9 +23,52 @@ struct file_operations;
 
 struct eventfs_file;
 
+/**
+ * eventfs_callback - A callback function to create dynamic files in eventfs
+ * @name: The

Re: [PATCH 7/9] arm64: dts: qcom: sc7280: Add CDSP node

2023-10-30 Thread Doug Anderson

Hi,

On Mon, Oct 30, 2023 at 7:43 AM Luca Weiss  wrote:
>
> On Mon Oct 30, 2023 at 3:11 PM CET, Doug Anderson wrote:
> > Hi,
> >
> > On Mon, Oct 30, 2023 at 2:12 AM Luca Weiss  wrote:
> > >
> > > On Mon Oct 30, 2023 at 10:04 AM CET, Mukesh Ojha wrote:
> > > >
> > > >
> > > > On 10/27/2023 7:50 PM, Luca Weiss wrote:
> > > > > Add the node for the ADSP found on the SC7280 SoC, using standard
> > > > > Qualcomm firmware.
> > > > >
> > > > > The memory region for sc7280-chrome-common.dtsi is taken from msm-5.4
> > > > > yupik.dtsi since the other areas also seem to match that file there,
> > > > > though I cannot be sure there.
> > > > >
> > > > > Signed-off-by: Luca Weiss 
> > > > > ---
> > > > >   arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi |   5 +
> > > > >   arch/arm64/boot/dts/qcom/sc7280.dtsi   | 138 
> > > > > +
> > > > >   2 files changed, 143 insertions(+)
> > > > >
> > > > > diff --git a/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi 
> > > > > b/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> > > > > index eb55616e0892..6e5a9d4c1fda 100644
> > > > > --- a/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> > > > > +++ b/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> > > > > @@ -29,6 +29,11 @@ adsp_mem: memory@8670 {
> > > > > no-map;
> > > > > };
> > > > >
> > > > > +   cdsp_mem: memory@88f0 {
> > > > > +   reg = <0x0 0x88f0 0x0 0x1e0>;
> > > > > +   no-map;
> > > > > +   };
> > > > > +
> > > >
> > > > Just a question, why to do it here, if chrome does not use this ?
> > >
> > > Other memory regions in sc7280.dtsi also get referenced but not actually
> > > defined in that file, like mpss_mem and wpss_mem. Alternatively we can
> > > also try and solve this differently, but then we should probably also
> > > adjust mpss and wpss to be consistent.
> > >
> > > Apart from either declaring cdsp_mem in sc7280.dtsi or
> > > "/delete-property/ memory-region;" for CDSP I don't really have better
> > > ideas though.
> > >
> > > I also imagine these ChromeOS devices will want to enable cdsp at some
> > > point but I don't know any plans there.
> >
> > Given that "remoteproc_cdsp" has status "disabled" in the dtsi, it
> > feels like the dtsi shouldn't be reserving memory. I guess maybe
> > memory regions can't be status "disabled"?
>
> Hi Doug,
>
> That's how it works in really any qcom dtsi though. I think in most/all
> cases normally the reserved-memory is already declared in the SoC dtsi
> file and also used with the memory-region property.
>
> I wouldn't be against adjusting sc7280.dtsi to match the way it's done
> in the other dtsi files though, so to have all the required labels
> already defined in the dtsi so it doesn't rely on these labels being
> defined in the device dts.
>
> In other words, currently if you include sc7280.dtsi and try to build,
> you first have to define the labels mpss_mem and wpss_mem (after this
> patch series also cdsp_mem and adsp_mem) for it to build.
>
> I'm quite neutral either way, let me know :)

I haven't done a ton of thinking about this, so if I'm spouting
gibberish then feel free to ignore me. :-P It just feels weird that
when all the "dtsi" files are combined and you look at what you end up
on a sc7280 Chrome board that you'll be reserving 32MB of memory for a
device that's set (in the same device tree) to be "disabled", right?
...the 32MB is completely wasted, I think. If we wanted to enable the
CDSP we'd have to modify the device tree anyway, so it seems like that
same modification would set the CDSP to "okay" and also reserve the
memory...

In that vein, it seems like maybe you could move the "cdsp_mem" to the
SoC .dsti file with a status of "disabled". . I guess we don't do that
elsewhere, but maybe we should be? As far as I can tell without
testing it (just looking at fdt_scan_reserved_mem()) this should
work...

-Doug

Re: [PATCH 7/9] arm64: dts: qcom: sc7280: Add CDSP node

2023-10-30 Thread Luca Weiss

On Mon Oct 30, 2023 at 3:11 PM CET, Doug Anderson wrote:
> Hi,
>
> On Mon, Oct 30, 2023 at 2:12 AM Luca Weiss  wrote:
> >
> > On Mon Oct 30, 2023 at 10:04 AM CET, Mukesh Ojha wrote:
> > >
> > >
> > > On 10/27/2023 7:50 PM, Luca Weiss wrote:
> > > > Add the node for the ADSP found on the SC7280 SoC, using standard
> > > > Qualcomm firmware.
> > > >
> > > > The memory region for sc7280-chrome-common.dtsi is taken from msm-5.4
> > > > yupik.dtsi since the other areas also seem to match that file there,
> > > > though I cannot be sure there.
> > > >
> > > > Signed-off-by: Luca Weiss 
> > > > ---
> > > >   arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi |   5 +
> > > >   arch/arm64/boot/dts/qcom/sc7280.dtsi   | 138 
> > > > +
> > > >   2 files changed, 143 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi 
> > > > b/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> > > > index eb55616e0892..6e5a9d4c1fda 100644
> > > > --- a/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> > > > +++ b/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> > > > @@ -29,6 +29,11 @@ adsp_mem: memory@8670 {
> > > > no-map;
> > > > };
> > > >
> > > > +   cdsp_mem: memory@88f0 {
> > > > +   reg = <0x0 0x88f0 0x0 0x1e0>;
> > > > +   no-map;
> > > > +   };
> > > > +
> > >
> > > Just a question, why to do it here, if chrome does not use this ?
> >
> > Other memory regions in sc7280.dtsi also get referenced but not actually
> > defined in that file, like mpss_mem and wpss_mem. Alternatively we can
> > also try and solve this differently, but then we should probably also
> > adjust mpss and wpss to be consistent.
> >
> > Apart from either declaring cdsp_mem in sc7280.dtsi or
> > "/delete-property/ memory-region;" for CDSP I don't really have better
> > ideas though.
> >
> > I also imagine these ChromeOS devices will want to enable cdsp at some
> > point but I don't know any plans there.
>
> Given that "remoteproc_cdsp" has status "disabled" in the dtsi, it
> feels like the dtsi shouldn't be reserving memory. I guess maybe
> memory regions can't be status "disabled"?

Hi Doug,

That's how it works in really any qcom dtsi though. I think in most/all
cases normally the reserved-memory is already declared in the SoC dtsi
file and also used with the memory-region property.

I wouldn't be against adjusting sc7280.dtsi to match the way it's done
in the other dtsi files though, so to have all the required labels
already defined in the dtsi so it doesn't rely on these labels being
defined in the device dts.

In other words, currently if you include sc7280.dtsi and try to build,
you first have to define the labels mpss_mem and wpss_mem (after this
patch series also cdsp_mem and adsp_mem) for it to build.

I'm quite neutral either way, let me know :)

Regards
Luca

>
> -Doug

Re: [PATCH v2] eventfs: Test for ei->is_freed when accessing ei->dentry

2023-10-30 Thread Steven Rostedt

On Mon, 30 Oct 2023 12:37:08 +0530
Naresh Kamboju  wrote:

> 
> I have tested the linux-trace.git trace/core and run selftests ftrace
> the reported kernel panic [1] & [2] has been fixed but found

Good to know. Can I add "Tested-by" from you for that bug fix?

> "general protection fault"  at  kernel/trace/trace_events.c:2439.

Can you test with the below patch?

Also, can I ask what are you testing this on that makes it trigger so
easily? As I'm not able to trigger these in my tests, even though they are
indeed bugs.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 7ad7496bd597..7a0b54ddda24 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -609,7 +609,13 @@ static struct dentry *eventfs_root_lookup(struct inode 
*dir,
entry = >entries[i];
if (strcmp(name, entry->name) == 0) {
void *cdata = data;
-   r = entry->callback(name, , , );
+   mutex_lock(_mutex);
+   /* If ei->is_freed, then the event itself may be too */
+   if (!ei->is_freed)
+   r = entry->callback(name, , , );
+   else
+   r = -1;
+   mutex_unlock(_mutex);
if (r <= 0)
continue;
ret = simple_lookup(dir, dentry, flags);
@@ -743,7 +749,13 @@ static int dcache_dir_open_wrapper(struct inode *inode, 
struct file *file)
void *cdata = data;
entry = >entries[i];
name = entry->name;
-   r = entry->callback(name, , , );
+   mutex_lock(_mutex);
+   /* If ei->is_freed, then the event itself may be too */
+   if (!ei->is_freed)
+   r = entry->callback(name, , , );
+   else
+   r = -1;
+   mutex_unlock(_mutex);
if (r <= 0)
continue;
d = create_file_dentry(ei, i, parent, name, mode, cdata, fops, 
false);

Re: [PATCH 7/9] arm64: dts: qcom: sc7280: Add CDSP node

2023-10-30 Thread Doug Anderson

Hi,

On Mon, Oct 30, 2023 at 2:12 AM Luca Weiss  wrote:
>
> On Mon Oct 30, 2023 at 10:04 AM CET, Mukesh Ojha wrote:
> >
> >
> > On 10/27/2023 7:50 PM, Luca Weiss wrote:
> > > Add the node for the ADSP found on the SC7280 SoC, using standard
> > > Qualcomm firmware.
> > >
> > > The memory region for sc7280-chrome-common.dtsi is taken from msm-5.4
> > > yupik.dtsi since the other areas also seem to match that file there,
> > > though I cannot be sure there.
> > >
> > > Signed-off-by: Luca Weiss 
> > > ---
> > >   arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi |   5 +
> > >   arch/arm64/boot/dts/qcom/sc7280.dtsi   | 138 
> > > +
> > >   2 files changed, 143 insertions(+)
> > >
> > > diff --git a/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi 
> > > b/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> > > index eb55616e0892..6e5a9d4c1fda 100644
> > > --- a/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> > > +++ b/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> > > @@ -29,6 +29,11 @@ adsp_mem: memory@8670 {
> > > no-map;
> > > };
> > >
> > > +   cdsp_mem: memory@88f0 {
> > > +   reg = <0x0 0x88f0 0x0 0x1e0>;
> > > +   no-map;
> > > +   };
> > > +
> >
> > Just a question, why to do it here, if chrome does not use this ?
>
> Other memory regions in sc7280.dtsi also get referenced but not actually
> defined in that file, like mpss_mem and wpss_mem. Alternatively we can
> also try and solve this differently, but then we should probably also
> adjust mpss and wpss to be consistent.
>
> Apart from either declaring cdsp_mem in sc7280.dtsi or
> "/delete-property/ memory-region;" for CDSP I don't really have better
> ideas though.
>
> I also imagine these ChromeOS devices will want to enable cdsp at some
> point but I don't know any plans there.

Given that "remoteproc_cdsp" has status "disabled" in the dtsi, it
feels like the dtsi shouldn't be reserving memory. I guess maybe
memory regions can't be status "disabled"?

-Doug

Re: [PATCH for-next] tracing/kprobes: Add symbol counting check when module loads

2023-10-30 Thread Steven Rostedt

On Sun, 29 Oct 2023 12:10:46 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Check the number of probe target symbols in the target module when
> the module is loaded. If the probe is not on the unique name symbols
> in the module, it will be rejected at that point.
> 
> Note that the symbol which has a unique name in the target module,
> it will be accepted even if there are same-name symbols in the
> kernel or other modules,
> 
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  kernel/trace/trace_kprobe.c |  112 
> ++-
>  1 file changed, 68 insertions(+), 44 deletions(-)

Reviewed-by: Steven Rosted (Google) 

-- Steve

Re: [PATCH v2 2/3] remoteproc: qcom: pas: make region assign more generic

2023-10-30 Thread Mukesh Ojha





On 10/30/2023 3:33 PM, Neil Armstrong wrote:

The current memory region assign only supports a single
memory region.

But new platforms introduces more regions to make the
memory requirements more flexible for various use cases.
Those new platforms also shares the memory region between the
DSP and HLOS.

To handle this, make the region assign more generic in order
to support more than a single memory region and also permit
setting the regions permissions as shared.

Signed-off-by: Neil Armstrong 
---
  drivers/remoteproc/qcom_q6v5_pas.c | 102 -
  1 file changed, 66 insertions(+), 36 deletions(-)

diff --git a/drivers/remoteproc/qcom_q6v5_pas.c 
b/drivers/remoteproc/qcom_q6v5_pas.c
index 913a5d2068e8..4829fd26e17d 100644
--- a/drivers/remoteproc/qcom_q6v5_pas.c
+++ b/drivers/remoteproc/qcom_q6v5_pas.c
@@ -33,6 +33,8 @@
  
  #define ADSP_DECRYPT_SHUTDOWN_DELAY_MS	100
  
+#define MAX_ASSIGN_COUNT 2

+
  struct adsp_data {
int crash_reason_smem;
const char *firmware_name;
@@ -51,6 +53,9 @@ struct adsp_data {
int ssctl_id;
  
  	int region_assign_idx;

+   int region_assign_count;
+   bool region_assign_shared;
+   int region_assign_vmid;
  };
  
  struct qcom_adsp {

@@ -87,15 +92,18 @@ struct qcom_adsp {
phys_addr_t dtb_mem_phys;
phys_addr_t mem_reloc;
phys_addr_t dtb_mem_reloc;
-   phys_addr_t region_assign_phys;
+   phys_addr_t region_assign_phys[MAX_ASSIGN_COUNT];
void *mem_region;
void *dtb_mem_region;
size_t mem_size;
size_t dtb_mem_size;
-   size_t region_assign_size;
+   size_t region_assign_size[MAX_ASSIGN_COUNT];
  
  	int region_assign_idx;

-   u64 region_assign_perms;
+   int region_assign_count;
+   bool region_assign_shared;
+   int region_assign_vmid;
+   u64 region_assign_perms[MAX_ASSIGN_COUNT];
  
  	struct qcom_rproc_glink glink_subdev;

struct qcom_rproc_subdev smd_subdev;
@@ -590,37 +598,52 @@ static int adsp_alloc_memory_region(struct qcom_adsp 
*adsp)
  
  static int adsp_assign_memory_region(struct qcom_adsp *adsp)

  {
-   struct reserved_mem *rmem = NULL;
-   struct qcom_scm_vmperm perm;
+   struct qcom_scm_vmperm perm[MAX_ASSIGN_COUNT];
+   unsigned int perm_size = 1;


AFAICS, not need of initialization.


struct device_node *node;
-   int ret;
+   int offset, ret;


Nit: one variable per line.

  
  	if (!adsp->region_assign_idx)


Not related to this patch..
Should not this be valid only for > 1 ?



return 0;
  
-	node = of_parse_phandle(adsp->dev->of_node, "memory-region", adsp->region_assign_idx);

-   if (node)
-   rmem = of_reserved_mem_lookup(node);
-   of_node_put(node);
-   if (!rmem) {
-   dev_err(adsp->dev, "unable to resolve shareable 
memory-region\n");
-   return -EINVAL;
-   }
+   for (offset = 0; offset < adsp->region_assign_count; ++offset) {
+   struct reserved_mem *rmem = NULL;
+
+   node = of_parse_phandle(adsp->dev->of_node, "memory-region",
+   adsp->region_assign_idx + offset);
+   if (node)
+   rmem = of_reserved_mem_lookup(node);
+   of_node_put(node);
+   if (!rmem) {
+   dev_err(adsp->dev, "unable to resolve shareable 
memory-region index %d\n",
+   offset);
+   return -EINVAL; > +  }



  
-	perm.vmid = QCOM_SCM_VMID_MSS_MSA;

-   perm.perm = QCOM_SCM_PERM_RW;
+   if (adsp->region_assign_shared)  {
+   perm[0].vmid = QCOM_SCM_VMID_HLOS;
+   perm[0].perm = QCOM_SCM_PERM_RW;
+   perm[1].vmid = adsp->region_assign_vmid;
+   perm[1].perm = QCOM_SCM_PERM_RW;
+   perm_size = 2;
+   } else {
+   perm[0].vmid = adsp->region_assign_vmid;
+   perm[0].perm = QCOM_SCM_PERM_RW;
+   perm_size = 1;
+   }
  
-	adsp->region_assign_phys = rmem->base;

-   adsp->region_assign_size = rmem->size;
-   adsp->region_assign_perms = BIT(QCOM_SCM_VMID_HLOS);
+   adsp->region_assign_phys[offset] = rmem->base;
+   adsp->region_assign_size[offset] = rmem->size;
+   adsp->region_assign_perms[offset] = BIT(QCOM_SCM_VMID_HLOS);


Do we need array for this, is this changing ?

  
-	ret = qcom_scm_assign_mem(adsp->region_assign_phys,

- adsp->region_assign_size,
- >region_assign_perms,
- , 1);
-   if (ret < 0) {
-   dev_err(adsp->dev, "assign memory failed\n");
-   return ret;
+   ret = qcom_scm_assign_mem(adsp->region_assign_phys[offset],
+

Re: [PATCH v7 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-30 Thread David Hildenbrand


On 26.10.23 00:44, Vishal Verma wrote:

The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
'memblock_size' chunks of memory being added. Adding a larger span of
memory precludes memmap_on_memory semantics.

For users of hotplug such as kmem, large amounts of memory might get
added from the CXL subsystem. In some cases, this amount may exceed the
available 'main memory' to store the memmap for the memory being added.
In this case, it is useful to have a way to place the memmap on the
memory being added, even if it means splitting the addition into
memblock-sized chunks.

Change add_memory_resource() to loop over memblock-sized chunks of
memory if caller requested memmap_on_memory, and if other conditions for
it are met. Teach try_remove_memory() to also expect that a memory
range being removed might have been split up into memblock sized chunks,
and to loop through those as needed.

This does preclude being able to use PUD mappings in the direct map; a
proposal to how this could be optimized in the future is laid out
here[1].



Almost there, I think :)

  
+static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,

+   u64 start, u64 size)
+{
+   unsigned long memblock_size = memory_block_size_bytes();
+   u64 cur_start;
+   int ret;
+
+   for (cur_start = start; cur_start < start + size;
+cur_start += memblock_size) {
+   struct mhp_params params = { .pgprot =
+pgprot_mhp(PAGE_KERNEL) };
+   struct vmem_altmap mhp_altmap = {
+   .base_pfn = PHYS_PFN(cur_start),
+   .end_pfn = PHYS_PFN(cur_start + memblock_size - 1),
+   };
+
+   mhp_altmap.free = memory_block_memmap_on_memory_pages();
+   params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap),
+   GFP_KERNEL);
+   if (!params.altmap)
+   return -ENOMEM;


Best to cleanup here instead of handling it in the caller [as noted by 
Vishal, we might not be doing that yet]. Using 
remove_memory_blocks_and_altmaps() on the fully processed range sounds 
reasonable.


maybe something like

ret = arch_add_memory(nid, cur_start, memblock_size, );
if (ret) {
kfree(params.altmap);
goto out;
}

ret = create_memory_block_devices(cur_start, memblock_size,
   params.altmap, group);
if (ret) {
arch_remove_memory(cur_start, memblock_size, NULL);
kfree(params.altmap);
goto out;
}

if (ret && cur_start != start)
remove_memory_blocks_and_altmaps(start, cur_start - start);
return ret;


+
+   /* call arch's memory hotadd */
+   ret = arch_add_memory(nid, cur_start, memblock_size, );
+   if (ret < 0) {
+   kfree(params.altmap);
+   return ret;
+   }
+
+   /* create memory block devices after memory was added */
+   ret = create_memory_block_devices(cur_start, memblock_size,
+ params.altmap, group);
+   if (ret) {
+   arch_remove_memory(cur_start, memblock_size, NULL);
+   kfree(params.altmap);
+   return ret;
+   }
+   }
+
+   return 0;
+}
+


[...]


  static int check_cpu_on_node(int nid)
  {
int cpu;
@@ -2146,11 +2186,69 @@ void try_offline_node(int nid)
  }
  EXPORT_SYMBOL(try_offline_node);
  
-static int __ref try_remove_memory(u64 start, u64 size)

+static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
  {
-   struct memory_block *mem;
-   int rc = 0, nid = NUMA_NO_NODE;
+   unsigned long memblock_size = memory_block_size_bytes();
struct vmem_altmap *altmap = NULL;
+   struct memory_block *mem;
+   u64 cur_start;
+   int rc;
+
+   /*
+* For memmap_on_memory, the altmaps could have been added on
+* a per-memblock basis. Loop through the entire range if so,
+* and remove each memblock and its altmap.
+*/


/*
 * altmaps where added on a per-memblock basis; we have to process
 * each individual memory block.
 */


+   for (cur_start = start; cur_start < start + size;
+cur_start += memblock_size) {
+   rc = walk_memory_blocks(cur_start, memblock_size, ,
+   test_has_altmap_cb);
+   if (rc) {
+   altmap = mem->altmap;
+   /*
+* Mark altmap NULL so that we can add a debug
+* check on memblock free.
+*/
+   mem->altmap = NULL;
+   }


Simpler (especially, we know that there must be an altmap):

mem =

[PATCH v2 2/3] remoteproc: qcom: pas: make region assign more generic

2023-10-30 Thread Neil Armstrong

The current memory region assign only supports a single
memory region.

But new platforms introduces more regions to make the
memory requirements more flexible for various use cases.
Those new platforms also shares the memory region between the
DSP and HLOS.

To handle this, make the region assign more generic in order
to support more than a single memory region and also permit
setting the regions permissions as shared.

Signed-off-by: Neil Armstrong 
---
 drivers/remoteproc/qcom_q6v5_pas.c | 102 -
 1 file changed, 66 insertions(+), 36 deletions(-)

diff --git a/drivers/remoteproc/qcom_q6v5_pas.c 
b/drivers/remoteproc/qcom_q6v5_pas.c
index 913a5d2068e8..4829fd26e17d 100644
--- a/drivers/remoteproc/qcom_q6v5_pas.c
+++ b/drivers/remoteproc/qcom_q6v5_pas.c
@@ -33,6 +33,8 @@
 
 #define ADSP_DECRYPT_SHUTDOWN_DELAY_MS 100
 
+#define MAX_ASSIGN_COUNT 2
+
 struct adsp_data {
int crash_reason_smem;
const char *firmware_name;
@@ -51,6 +53,9 @@ struct adsp_data {
int ssctl_id;
 
int region_assign_idx;
+   int region_assign_count;
+   bool region_assign_shared;
+   int region_assign_vmid;
 };
 
 struct qcom_adsp {
@@ -87,15 +92,18 @@ struct qcom_adsp {
phys_addr_t dtb_mem_phys;
phys_addr_t mem_reloc;
phys_addr_t dtb_mem_reloc;
-   phys_addr_t region_assign_phys;
+   phys_addr_t region_assign_phys[MAX_ASSIGN_COUNT];
void *mem_region;
void *dtb_mem_region;
size_t mem_size;
size_t dtb_mem_size;
-   size_t region_assign_size;
+   size_t region_assign_size[MAX_ASSIGN_COUNT];
 
int region_assign_idx;
-   u64 region_assign_perms;
+   int region_assign_count;
+   bool region_assign_shared;
+   int region_assign_vmid;
+   u64 region_assign_perms[MAX_ASSIGN_COUNT];
 
struct qcom_rproc_glink glink_subdev;
struct qcom_rproc_subdev smd_subdev;
@@ -590,37 +598,52 @@ static int adsp_alloc_memory_region(struct qcom_adsp 
*adsp)
 
 static int adsp_assign_memory_region(struct qcom_adsp *adsp)
 {
-   struct reserved_mem *rmem = NULL;
-   struct qcom_scm_vmperm perm;
+   struct qcom_scm_vmperm perm[MAX_ASSIGN_COUNT];
+   unsigned int perm_size = 1;
struct device_node *node;
-   int ret;
+   int offset, ret;
 
if (!adsp->region_assign_idx)
return 0;
 
-   node = of_parse_phandle(adsp->dev->of_node, "memory-region", 
adsp->region_assign_idx);
-   if (node)
-   rmem = of_reserved_mem_lookup(node);
-   of_node_put(node);
-   if (!rmem) {
-   dev_err(adsp->dev, "unable to resolve shareable 
memory-region\n");
-   return -EINVAL;
-   }
+   for (offset = 0; offset < adsp->region_assign_count; ++offset) {
+   struct reserved_mem *rmem = NULL;
+
+   node = of_parse_phandle(adsp->dev->of_node, "memory-region",
+   adsp->region_assign_idx + offset);
+   if (node)
+   rmem = of_reserved_mem_lookup(node);
+   of_node_put(node);
+   if (!rmem) {
+   dev_err(adsp->dev, "unable to resolve shareable 
memory-region index %d\n",
+   offset);
+   return -EINVAL;
+   }
 
-   perm.vmid = QCOM_SCM_VMID_MSS_MSA;
-   perm.perm = QCOM_SCM_PERM_RW;
+   if (adsp->region_assign_shared)  {
+   perm[0].vmid = QCOM_SCM_VMID_HLOS;
+   perm[0].perm = QCOM_SCM_PERM_RW;
+   perm[1].vmid = adsp->region_assign_vmid;
+   perm[1].perm = QCOM_SCM_PERM_RW;
+   perm_size = 2;
+   } else {
+   perm[0].vmid = adsp->region_assign_vmid;
+   perm[0].perm = QCOM_SCM_PERM_RW;
+   perm_size = 1;
+   }
 
-   adsp->region_assign_phys = rmem->base;
-   adsp->region_assign_size = rmem->size;
-   adsp->region_assign_perms = BIT(QCOM_SCM_VMID_HLOS);
+   adsp->region_assign_phys[offset] = rmem->base;
+   adsp->region_assign_size[offset] = rmem->size;
+   adsp->region_assign_perms[offset] = BIT(QCOM_SCM_VMID_HLOS);
 
-   ret = qcom_scm_assign_mem(adsp->region_assign_phys,
- adsp->region_assign_size,
- >region_assign_perms,
- , 1);
-   if (ret < 0) {
-   dev_err(adsp->dev, "assign memory failed\n");
-   return ret;
+   ret = qcom_scm_assign_mem(adsp->region_assign_phys[offset],
+ adsp->region_assign_size[offset],
+ >region_assign_perms[offset],
+ perm, perm_size);
+   if (ret < 0) {

[PATCH v2 0/3] remoteproc: qcom: Introduce DSP support for SM8650

2023-10-30 Thread Neil Armstrong

Add the bindings and driver changes for DSP support on the
SM8650 platform in order to enable the aDSP, cDSP and MPSS
subsystems to boot.

Compared to SM8550, where SM8650 uses the same dual firmware files,
(dtb file and main firmware) the memory zones requirement has changed:
- cDSP: now requires 2 memory zones to be configured as shared
  between the cDSP and the HLOS subsystem
- MPSS: In addition to the memory zone required for the SM8550
  MPSS, another one is required to be configured for MPSS
  usage only.

In order to handle this and avoid code duplication, the region_assign_*
code patch has been made more generic and is able handle multiple
DSP-only memory zones (for MPSS) or DSP-HLOS shared memory zones (cDSP)
in the same region_assign functions.

Dependencies: None

For convenience, a regularly refreshed linux-next based git tree containing
all the SM8650 related work is available at:
https://git.codelinaro.org/neil.armstrong/linux/-/tree/topic/sm8650/upstream/integ

Signed-off-by: Neil Armstrong 
---
Changes in v2:
- Fixed sm8650 entries in allOf:if:then to match Krzysztof's comments
- Collected reviewed-by on patch 3
- Link to v1: 
https://lore.kernel.org/r/20231025-topic-sm8650-upstream-remoteproc-v1-0-a8d20e4ce...@linaro.org

---
Neil Armstrong (3):
  dt-bindings: remoteproc: qcom,sm8550-pas: document the SM8650 PAS
  remoteproc: qcom: pas: make region assign more generic
  remoteproc: qcom: pas: Add SM8650 remoteproc support

 .../bindings/remoteproc/qcom,sm8550-pas.yaml   |  44 +-
 drivers/remoteproc/qcom_q6v5_pas.c | 152 -
 2 files changed, 159 insertions(+), 37 deletions(-)
---
base-commit: fe1998aa935b44ef873193c0772c43bce74f17dc
change-id: 20231016-topic-sm8650-upstream-remoteproc-66a87eeb6fee

Best regards,
-- 
Neil Armstrong

[PATCH v2 1/3] dt-bindings: remoteproc: qcom,sm8550-pas: document the SM8650 PAS

2023-10-30 Thread Neil Armstrong

Document the DSP Peripheral Authentication Service on the SM8650 Platform.

Signed-off-by: Neil Armstrong 
---
 .../bindings/remoteproc/qcom,sm8550-pas.yaml   | 44 +-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml 
b/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml
index 58120829fb06..4e8ce9e7e9fa 100644
--- a/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml
@@ -19,6 +19,9 @@ properties:
   - qcom,sm8550-adsp-pas
   - qcom,sm8550-cdsp-pas
   - qcom,sm8550-mpss-pas
+  - qcom,sm8650-adsp-pas
+  - qcom,sm8650-cdsp-pas
+  - qcom,sm8650-mpss-pas
 
   reg:
 maxItems: 1
@@ -49,6 +52,7 @@ properties:
   - description: Memory region for main Firmware authentication
   - description: Memory region for Devicetree Firmware authentication
   - description: DSM Memory region
+  - description: DSM Memory region 2
 
 required:
   - compatible
@@ -63,6 +67,7 @@ allOf:
   enum:
 - qcom,sm8550-adsp-pas
 - qcom,sm8550-cdsp-pas
+- qcom,sm8650-adsp-pas
 then:
   properties:
 interrupts:
@@ -71,7 +76,26 @@ allOf:
   maxItems: 5
 memory-region:
   maxItems: 2
-else:
+  - if:
+  properties:
+compatible:
+  enum:
+- qcom,sm8650-cdsp-pas
+then:
+  properties:
+interrupts:
+  maxItems: 5
+interrupt-names:
+  maxItems: 5
+memory-region:
+  minItems: 3
+  maxItems: 3
+  - if:
+  properties:
+compatible:
+  enum:
+- qcom,sm8550-mpss-pas
+then:
   properties:
 interrupts:
   minItems: 6
@@ -79,12 +103,28 @@ allOf:
   minItems: 6
 memory-region:
   minItems: 3
+  maxItems: 3
+  - if:
+  properties:
+compatible:
+  enum:
+- qcom,sm8650-mpss-pas
+then:
+  properties:
+interrupts:
+  minItems: 6
+interrupt-names:
+  minItems: 6
+memory-region:
+  minItems: 4
+  maxItems: 4
 
   - if:
   properties:
 compatible:
   enum:
 - qcom,sm8550-adsp-pas
+- qcom,sm8650-adsp-pas
 then:
   properties:
 power-domains:
@@ -101,6 +141,7 @@ allOf:
 compatible:
   enum:
 - qcom,sm8550-mpss-pas
+- qcom,sm8650-mpss-pas
 then:
   properties:
 power-domains:
@@ -116,6 +157,7 @@ allOf:
 compatible:
   enum:
 - qcom,sm8550-cdsp-pas
+- qcom,sm8650-cdsp-pas
 then:
   properties:
 power-domains:

-- 
2.34.1

[PATCH v2 3/3] remoteproc: qcom: pas: Add SM8650 remoteproc support

2023-10-30 Thread Neil Armstrong

Add DSP Peripheral Authentication Service support for the SM8650 platform.

Reviewed-by: Dmitry Baryshkov 
Signed-off-by: Neil Armstrong 
---
 drivers/remoteproc/qcom_q6v5_pas.c | 50 ++
 1 file changed, 50 insertions(+)

diff --git a/drivers/remoteproc/qcom_q6v5_pas.c 
b/drivers/remoteproc/qcom_q6v5_pas.c
index 4829fd26e17d..c593e6d529b3 100644
--- a/drivers/remoteproc/qcom_q6v5_pas.c
+++ b/drivers/remoteproc/qcom_q6v5_pas.c
@@ -1195,6 +1195,53 @@ static const struct adsp_data sm8550_mpss_resource = {
.region_assign_vmid = QCOM_SCM_VMID_MSS_MSA,
 };
 
+static const struct adsp_data sm8650_cdsp_resource = {
+   .crash_reason_smem = 601,
+   .firmware_name = "cdsp.mdt",
+   .dtb_firmware_name = "cdsp_dtb.mdt",
+   .pas_id = 18,
+   .dtb_pas_id = 0x25,
+   .minidump_id = 7,
+   .auto_boot = true,
+   .proxy_pd_names = (char*[]){
+   "cx",
+   "mxc",
+   "nsp",
+   NULL
+   },
+   .load_state = "cdsp",
+   .ssr_name = "cdsp",
+   .sysmon_name = "cdsp",
+   .ssctl_id = 0x17,
+   .region_assign_idx = 2,
+   .region_assign_count = 1,
+   .region_assign_shared = true,
+   .region_assign_vmid = QCOM_SCM_VMID_CDSP,
+};
+
+static const struct adsp_data sm8650_mpss_resource = {
+   .crash_reason_smem = 421,
+   .firmware_name = "modem.mdt",
+   .dtb_firmware_name = "modem_dtb.mdt",
+   .pas_id = 4,
+   .dtb_pas_id = 0x26,
+   .minidump_id = 3,
+   .auto_boot = false,
+   .decrypt_shutdown = true,
+   .proxy_pd_names = (char*[]){
+   "cx",
+   "mss",
+   NULL
+   },
+   .load_state = "modem",
+   .ssr_name = "mpss",
+   .sysmon_name = "modem",
+   .ssctl_id = 0x12,
+   .region_assign_idx = 2,
+   .region_assign_count = 2,
+   .region_assign_vmid = QCOM_SCM_VMID_MSS_MSA,
+};
+
 static const struct of_device_id adsp_of_match[] = {
{ .compatible = "qcom,msm8226-adsp-pil", .data = _resource_init},
{ .compatible = "qcom,msm8953-adsp-pil", .data = 
_adsp_resource},
@@ -1247,6 +1294,9 @@ static const struct of_device_id adsp_of_match[] = {
{ .compatible = "qcom,sm8550-adsp-pas", .data = _adsp_resource},
{ .compatible = "qcom,sm8550-cdsp-pas", .data = _cdsp_resource},
{ .compatible = "qcom,sm8550-mpss-pas", .data = _mpss_resource},
+   { .compatible = "qcom,sm8650-adsp-pas", .data = _adsp_resource},
+   { .compatible = "qcom,sm8650-cdsp-pas", .data = _cdsp_resource},
+   { .compatible = "qcom,sm8650-mpss-pas", .data = _mpss_resource},
{ },
 };
 MODULE_DEVICE_TABLE(of, adsp_of_match);

-- 
2.34.1

Re: [PATCH 1/3] dt-bindings: remoteproc: qcom,sm8550-pas: document the SM8650 PAS

2023-10-30 Thread Krzysztof Kozlowski

On 30/10/2023 09:29, Neil Armstrong wrote:
> Ok, I fixed all that
> 
>>
>>> +  - if:
>>> +  properties:
>>> +compatible:
>>> +  enum:
>>> +- qcom,sm8650-mpss-pas
>>> +then:
>>
>> I am not sure if keeping it in the same binding as sm8550 avoids that
>> much duplication.
> 
> Yes it does, 70% is the bindings would be the same, still if it's still 
> preferable I can duplicate.
> 

Then let's keep how you propose - in one binding.

Best regards,
Krzysztof

Re: [PATCH 7/9] arm64: dts: qcom: sc7280: Add CDSP node

2023-10-30 Thread Luca Weiss

On Mon Oct 30, 2023 at 10:04 AM CET, Mukesh Ojha wrote:
>
>
> On 10/27/2023 7:50 PM, Luca Weiss wrote:
> > Add the node for the ADSP found on the SC7280 SoC, using standard
> > Qualcomm firmware.
> > 
> > The memory region for sc7280-chrome-common.dtsi is taken from msm-5.4
> > yupik.dtsi since the other areas also seem to match that file there,
> > though I cannot be sure there.
> > 
> > Signed-off-by: Luca Weiss 
> > ---
> >   arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi |   5 +
> >   arch/arm64/boot/dts/qcom/sc7280.dtsi   | 138 
> > +
> >   2 files changed, 143 insertions(+)
> > 
> > diff --git a/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi 
> > b/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> > index eb55616e0892..6e5a9d4c1fda 100644
> > --- a/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> > +++ b/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> > @@ -29,6 +29,11 @@ adsp_mem: memory@8670 {
> > no-map;
> > };
> >   
> > +   cdsp_mem: memory@88f0 {
> > +   reg = <0x0 0x88f0 0x0 0x1e0>;
> > +   no-map;
> > +   };
> > +
>
> Just a question, why to do it here, if chrome does not use this ?

Other memory regions in sc7280.dtsi also get referenced but not actually
defined in that file, like mpss_mem and wpss_mem. Alternatively we can
also try and solve this differently, but then we should probably also
adjust mpss and wpss to be consistent.

Apart from either declaring cdsp_mem in sc7280.dtsi or
"/delete-property/ memory-region;" for CDSP I don't really have better
ideas though.

I also imagine these ChromeOS devices will want to enable cdsp at some
point but I don't know any plans there.

Regards
Luca

>
> -Mukesh
>
> > camera_mem: memory@8ad0 {
> > reg = <0x0 0x8ad0 0x0 0x50>;
> > no-map;
> > diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi 
> > b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> > index cc153f4e6979..e15646289bf7 100644
> > --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
> > +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> > @@ -3815,6 +3815,144 @@ nsp_noc: interconnect@a0c {
> > qcom,bcm-voters = <_bcm_voter>;
> > };
> >   
> > +   remoteproc_cdsp: remoteproc@a30 {
> > +   compatible = "qcom,sc7280-cdsp-pas";
> > +   reg = <0 0x0a30 0 0x1>;
> > +
> > +   interrupts-extended = < GIC_SPI 578 
> > IRQ_TYPE_LEVEL_HIGH>,
> > + <_smp2p_in 0 
> > IRQ_TYPE_EDGE_RISING>,
> > + <_smp2p_in 1 
> > IRQ_TYPE_EDGE_RISING>,
> > + <_smp2p_in 2 
> > IRQ_TYPE_EDGE_RISING>,
> > + <_smp2p_in 3 
> > IRQ_TYPE_EDGE_RISING>,
> > + <_smp2p_in 7 
> > IRQ_TYPE_EDGE_RISING>;
> > +   interrupt-names = "wdog", "fatal", "ready", "handover",
> > + "stop-ack", "shutdown-ack";
> > +
> > +   clocks = < RPMH_CXO_CLK>;
> > +   clock-names = "xo";
> > +
> > +   power-domains = < SC7280_CX>,
> > +   < SC7280_MX>;
> > +   power-domain-names = "cx", "mx";
> > +
> > +   interconnects = <_noc MASTER_CDSP_PROC 0 _virt 
> > SLAVE_EBI1 0>;
> > +
> > +   memory-region = <_mem>;
> > +
> > +   qcom,qmp = <_qmp>;
> > +
> > +   qcom,smem-states = <_smp2p_out 0>;
> > +   qcom,smem-state-names = "stop";
> > +
> > +   status = "disabled";
> > +
> > +   glink-edge {
> > +   interrupts-extended = < IPCC_CLIENT_CDSP
> > +
> > IPCC_MPROC_SIGNAL_GLINK_QMP
> > +
> > IRQ_TYPE_EDGE_RISING>;
> > +   mboxes = < IPCC_CLIENT_CDSP
> > +   IPCC_MPROC_SIGNAL_GLINK_QMP>;
> > +
> > +   label = "cdsp";
> > +   qcom,remote-pid = <5>;
> > +
> > +   fastrpc {
> > +   compatible = "qcom,fastrpc";
> > +   qcom,glink-channels = 
> > "fastrpcglink-apps-dsp";
> > +   label = "cdsp";
> > +   qcom,non-secure-domain;
> > +   #address-cells = <1>;
> > +   #size-cells = <0>;
> > +
> > +   compute-cb@1 {
> > +   compatible = 
> > "qcom,fastrpc-compute-cb";
> > +

Re: [PATCH 7/9] arm64: dts: qcom: sc7280: Add CDSP node

2023-10-30 Thread Mukesh Ojha





On 10/27/2023 7:50 PM, Luca Weiss wrote:

Add the node for the ADSP found on the SC7280 SoC, using standard
Qualcomm firmware.

The memory region for sc7280-chrome-common.dtsi is taken from msm-5.4
yupik.dtsi since the other areas also seem to match that file there,
though I cannot be sure there.

Signed-off-by: Luca Weiss 
---
  arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi |   5 +
  arch/arm64/boot/dts/qcom/sc7280.dtsi   | 138 +
  2 files changed, 143 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi 
b/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
index eb55616e0892..6e5a9d4c1fda 100644
--- a/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
+++ b/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
@@ -29,6 +29,11 @@ adsp_mem: memory@8670 {
no-map;
};
  
+		cdsp_mem: memory@88f0 {

+   reg = <0x0 0x88f0 0x0 0x1e0>;
+   no-map;
+   };
+


Just a question, why to do it here, if chrome does not use this ?

-Mukesh


camera_mem: memory@8ad0 {
reg = <0x0 0x8ad0 0x0 0x50>;
no-map;
diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi 
b/arch/arm64/boot/dts/qcom/sc7280.dtsi
index cc153f4e6979..e15646289bf7 100644
--- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
+++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
@@ -3815,6 +3815,144 @@ nsp_noc: interconnect@a0c {
qcom,bcm-voters = <_bcm_voter>;
};
  
+		remoteproc_cdsp: remoteproc@a30 {

+   compatible = "qcom,sc7280-cdsp-pas";
+   reg = <0 0x0a30 0 0x1>;
+
+   interrupts-extended = < GIC_SPI 578 
IRQ_TYPE_LEVEL_HIGH>,
+ <_smp2p_in 0 
IRQ_TYPE_EDGE_RISING>,
+ <_smp2p_in 1 
IRQ_TYPE_EDGE_RISING>,
+ <_smp2p_in 2 
IRQ_TYPE_EDGE_RISING>,
+ <_smp2p_in 3 
IRQ_TYPE_EDGE_RISING>,
+ <_smp2p_in 7 
IRQ_TYPE_EDGE_RISING>;
+   interrupt-names = "wdog", "fatal", "ready", "handover",
+ "stop-ack", "shutdown-ack";
+
+   clocks = < RPMH_CXO_CLK>;
+   clock-names = "xo";
+
+   power-domains = < SC7280_CX>,
+   < SC7280_MX>;
+   power-domain-names = "cx", "mx";
+
+   interconnects = <_noc MASTER_CDSP_PROC 0 _virt 
SLAVE_EBI1 0>;
+
+   memory-region = <_mem>;
+
+   qcom,qmp = <_qmp>;
+
+   qcom,smem-states = <_smp2p_out 0>;
+   qcom,smem-state-names = "stop";
+
+   status = "disabled";
+
+   glink-edge {
+   interrupts-extended = < IPCC_CLIENT_CDSP
+
IPCC_MPROC_SIGNAL_GLINK_QMP
+
IRQ_TYPE_EDGE_RISING>;
+   mboxes = < IPCC_CLIENT_CDSP
+   IPCC_MPROC_SIGNAL_GLINK_QMP>;
+
+   label = "cdsp";
+   qcom,remote-pid = <5>;
+
+   fastrpc {
+   compatible = "qcom,fastrpc";
+   qcom,glink-channels = 
"fastrpcglink-apps-dsp";
+   label = "cdsp";
+   qcom,non-secure-domain;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   compute-cb@1 {
+   compatible = 
"qcom,fastrpc-compute-cb";
+   reg = <1>;
+   iommus = <_smmu 0x11a1 
0x0420>,
+<_smmu 0x1181 
0x0420>;
+   };
+
+   compute-cb@2 {
+   compatible = 
"qcom,fastrpc-compute-cb";
+   reg = <2>;
+   iommus = <_smmu 0x11a2 
0x0420>,
+<_smmu 0x1182 
0x0420>;
+   };
+
+   compute-cb@3 {
+   compatible = 
"qcom,fastrpc-compute-cb";
+

Re: [PATCH v2 2/3] usb: typec: fsa4480: Add support to swap SBU orientation

2023-10-30 Thread Heikki Krogerus

On Fri, Oct 20, 2023 at 11:33:19AM +0200, Luca Weiss wrote:
> On some hardware designs the AUX+/- lanes are connected reversed to
> SBU1/2 compared to the expected design by FSA4480.
> 
> Made more complicated, the otherwise compatible Orient-Chip OCP96011
> expects the lanes to be connected reversed compared to FSA4480.
> 
> * FSA4480 block diagram shows AUX+ connected to SBU2 and AUX- to SBU1.
> * OCP96011 block diagram shows AUX+ connected to SBU1 and AUX- to SBU2.
> 
> So if OCP96011 is used as drop-in for FSA4480 then the orientation
> handling in the driver needs to be reversed to match the expectation of
> the OCP96011 hardware.
> 
> Support parsing the data-lanes parameter in the endpoint node to swap
> this in the driver.
> 
> The parse_data_lanes_mapping function is mostly taken from nb7vpq904m.c.
> 
> Reviewed-by: Neil Armstrong 
> Signed-off-by: Luca Weiss 

Reviewed-by: Heikki Krogerus 

> ---
>  drivers/usb/typec/mux/fsa4480.c | 71 
> +
>  1 file changed, 71 insertions(+)
> 
> diff --git a/drivers/usb/typec/mux/fsa4480.c b/drivers/usb/typec/mux/fsa4480.c
> index e0ee1f621abb..cb7cdf90cb0a 100644
> --- a/drivers/usb/typec/mux/fsa4480.c
> +++ b/drivers/usb/typec/mux/fsa4480.c
> @@ -60,6 +60,7 @@ struct fsa4480 {
>   unsigned int svid;
>  
>   u8 cur_enable;
> + bool swap_sbu_lanes;
>  };
>  
>  static const struct regmap_config fsa4480_regmap_config = {
> @@ -76,6 +77,9 @@ static int fsa4480_set(struct fsa4480 *fsa)
>   u8 enable = FSA4480_ENABLE_DEVICE;
>   u8 sel = 0;
>  
> + if (fsa->swap_sbu_lanes)
> + reverse = !reverse;
> +
>   /* USB Mode */
>   if (fsa->mode < TYPEC_STATE_MODAL ||
>   (!fsa->svid && (fsa->mode == TYPEC_MODE_USB2 ||
> @@ -179,12 +183,75 @@ static int fsa4480_mux_set(struct typec_mux_dev *mux, 
> struct typec_mux_state *st
>   return ret;
>  }
>  
> +enum {
> + NORMAL_LANE_MAPPING,
> + INVERT_LANE_MAPPING,
> +};
> +
> +#define DATA_LANES_COUNT 2
> +
> +static const int supported_data_lane_mapping[][DATA_LANES_COUNT] = {
> + [NORMAL_LANE_MAPPING] = { 0, 1 },
> + [INVERT_LANE_MAPPING] = { 1, 0 },
> +};
> +
> +static int fsa4480_parse_data_lanes_mapping(struct fsa4480 *fsa)
> +{
> + struct fwnode_handle *ep;
> + u32 data_lanes[DATA_LANES_COUNT];
> + int ret, i, j;
> +
> + ep = fwnode_graph_get_next_endpoint(dev_fwnode(>client->dev), 
> NULL);
> + if (!ep)
> + return 0;
> +
> + ret = fwnode_property_read_u32_array(ep, "data-lanes", data_lanes, 
> DATA_LANES_COUNT);
> + if (ret == -EINVAL)
> + /* Property isn't here, consider default mapping */
> + goto out_done;
> + if (ret) {
> + dev_err(>client->dev, "invalid data-lanes property: %d\n", 
> ret);
> + goto out_error;
> + }
> +
> + for (i = 0; i < ARRAY_SIZE(supported_data_lane_mapping); i++) {
> + for (j = 0; j < DATA_LANES_COUNT; j++) {
> + if (data_lanes[j] != supported_data_lane_mapping[i][j])
> + break;
> + }
> +
> + if (j == DATA_LANES_COUNT)
> + break;
> + }
> +
> + switch (i) {
> + case NORMAL_LANE_MAPPING:
> + break;
> + case INVERT_LANE_MAPPING:
> + fsa->swap_sbu_lanes = true;
> + break;
> + default:
> + dev_err(>client->dev, "invalid data-lanes mapping\n");
> + ret = -EINVAL;
> + goto out_error;
> + }
> +
> +out_done:
> + ret = 0;
> +
> +out_error:
> + fwnode_handle_put(ep);
> +
> + return ret;
> +}
> +
>  static int fsa4480_probe(struct i2c_client *client)
>  {
>   struct device *dev = >dev;
>   struct typec_switch_desc sw_desc = { };
>   struct typec_mux_desc mux_desc = { };
>   struct fsa4480 *fsa;
> + int ret;
>  
>   fsa = devm_kzalloc(dev, sizeof(*fsa), GFP_KERNEL);
>   if (!fsa)
> @@ -193,6 +260,10 @@ static int fsa4480_probe(struct i2c_client *client)
>   fsa->client = client;
>   mutex_init(>lock);
>  
> + ret = fsa4480_parse_data_lanes_mapping(fsa);
> + if (ret)
> + return ret;
> +
>   fsa->regmap = devm_regmap_init_i2c(client, _regmap_config);
>   if (IS_ERR(fsa->regmap))
>   return dev_err_probe(dev, PTR_ERR(fsa->regmap), "failed to 
> initialize regmap\n");
> 
> -- 
> 2.42.0

-- 
heikki

Re: [PATCH 1/3] dt-bindings: remoteproc: qcom,sm8550-pas: document the SM8650 PAS

2023-10-30 Thread Neil Armstrong


On 27/10/2023 09:36, Krzysztof Kozlowski wrote:

On 25/10/2023 09:35, Neil Armstrong wrote:

Document the DSP Peripheral Authentication Service on the SM8650 Platform.

Signed-off-by: Neil Armstrong 
---
  .../bindings/remoteproc/qcom,sm8550-pas.yaml   | 41 +-
  1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml 
b/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml
index 58120829fb06..316371c8ee6e 100644
--- a/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml
@@ -19,6 +19,9 @@ properties:
- qcom,sm8550-adsp-pas
- qcom,sm8550-cdsp-pas
- qcom,sm8550-mpss-pas
+  - qcom,sm8650-adsp-pas
+  - qcom,sm8650-cdsp-pas
+  - qcom,sm8650-mpss-pas
  
reg:

  maxItems: 1
@@ -49,6 +52,7 @@ properties:
- description: Memory region for main Firmware authentication
- description: Memory region for Devicetree Firmware authentication
- description: DSM Memory region
+  - description: DSM Memory region 2
  
  required:

- compatible
@@ -63,6 +67,7 @@ allOf:
enum:
  - qcom,sm8550-adsp-pas
  - qcom,sm8550-cdsp-pas
+- qcom,sm8650-adsp-pas
  then:
properties:
  interrupts:
@@ -71,7 +76,25 @@ allOf:
maxItems: 5
  memory-region:
maxItems: 2
-else:
+  - if:
+  properties:
+compatible:
+  enum:
+- qcom,sm8650-cdsp-pas
+then:
+  properties:
+interrupts:
+  minItems: 5


maxItems



+interrupt-names:
+  minItems: 5


maxItems


+memory-region:
+  minItems: 3


maxItems: 3


+  - if:
+  properties:
+compatible:
+  enum:
+- qcom,sm8550-mpss-pas
+then:
properties:
  interrupts:
minItems: 6
@@ -79,12 +102,26 @@ allOf:
minItems: 6
  memory-region:
minItems: 3


You need to add here maxItems.


Ok, I fixed all that




+  - if:
+  properties:
+compatible:
+  enum:
+- qcom,sm8650-mpss-pas
+then:


I am not sure if keeping it in the same binding as sm8550 avoids that
much duplication.


Yes it does, 70% is the bindings would be the same, still if it's still 
preferable I can duplicate.

Thanks,
Neil



Best regards,
Krzysztof

Re: [PATCH v2] bus: mhi: host: Add tracing support

2023-10-30 Thread Krishna Chaitanya Chundru




On 10/27/2023 8:59 PM, Jeffrey Hugo wrote:

On 10/23/2023 1:11 AM, Krishna Chaitanya Chundru wrote:


On 10/20/2023 8:33 PM, Jeffrey Hugo wrote:

On 10/13/2023 3:52 AM, Krishna chaitanya chundru wrote:

This change adds ftrace support for following functions which
helps in debugging the issues when there is Channel state & MHI
state change and also when we receive data and control events:
1. mhi_intvec_threaded_handler
2. mhi_process_data_event_ring
3. mhi_process_ctrl_ev_ring
4. mhi_gen_tre
5. mhi_update_channel_state
6. mhi_tryset_pm_state
7. mhi_pm_st_worker

Signed-off-by: Krishna chaitanya chundru 
---
Changes in v2:
- Passing the raw state into the trace event and using 
__print_symbolic() as suggested by bjorn.
- Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by 
bjorn.

- Fixed the kernel test rebot issues.
- Link to v1: 
https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com 


---
  MAINTAINERS |   1 +
  drivers/bus/mhi/host/init.c |   3 +
  drivers/bus/mhi/host/internal.h |   1 +
  drivers/bus/mhi/host/main.c |  32 +++--
  drivers/bus/mhi/host/pm.c   |   6 +-
  include/trace/events/mhi_host.h | 287 


  6 files changed, 317 insertions(+), 13 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 35977b269d5e..4339c668a6ab 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13862,6 +13862,7 @@ F:    Documentation/mhi/
  F:    drivers/bus/mhi/
  F:    drivers/pci/endpoint/functions/pci-epf-mhi.c
  F:    include/linux/mhi.h
+F:    include/trace/events/mhi_host.h
    MICROBLAZE ARCHITECTURE
  M:    Michal Simek 
diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c
index f78aefd2d7a3..3afa90a204fd 100644
--- a/drivers/bus/mhi/host/init.c
+++ b/drivers/bus/mhi/host/init.c
@@ -20,6 +20,9 @@
  #include 
  #include "internal.h"
  +#define CREATE_TRACE_POINTS
+#include 


This feels redundant to me.  A few lines ago we included internal.h, 
and internal.h includes trace/events/mhi_host.h


As Steve mentioned, this is mandatory step for creating trace points 
& trace events.


I understand this creates the trace points, and that needs to be done 
in C code.  It dtill seems redundant because we are including the 
header twice (and I am aware trace has the special multi-header read 
functionality for this).


The duplicate include still feels weird, but I have not come up with a 
better way to structure this.
We will use this way for now then, we will check in parallel if there 
is  a way to avoid this and change it in the future.







+
  static DEFINE_IDA(mhi_controller_ida);
    const char * const mhi_ee_str[MHI_EE_MAX] = {
diff --git a/drivers/bus/mhi/host/internal.h 
b/drivers/bus/mhi/host/internal.h

index 2e139e76de4c..a80a317a59a9 100644
--- a/drivers/bus/mhi/host/internal.h
+++ b/drivers/bus/mhi/host/internal.h
@@ -7,6 +7,7 @@
  #ifndef _MHI_INT_H
  #define _MHI_INT_H
  +#include 
  #include "../common.h"
    extern struct bus_type mhi_bus_type;
diff --git a/drivers/bus/mhi/host/main.c b/drivers/bus/mhi/host/main.c
index dcf627b36e82..fcdb728ba49f 100644
--- a/drivers/bus/mhi/host/main.c
+++ b/drivers/bus/mhi/host/main.c
@@ -246,6 +246,11 @@ static void *mhi_to_virtual(struct mhi_ring 
*ring, dma_addr_t addr)

  return (addr - ring->iommu_base) + ring->base;
  }
  +dma_addr_t mhi_to_physical(struct mhi_ring *ring, void *addr)
+{
+    return (addr - ring->base) + ring->iommu_base;
+}


This seems to be poorly named since we are using the iommu_base 
which suggests we are converting to an IOVA.


Why do we need this though?  This seems like it might be a security 
issue, or at the very least, not preferred, and I'm struggling to 
figure out what value this provides to you are I when looking at the 
log.



I will rename the function to reflect it is converting to IOVA.

We MHI TRE we write the IOVA address, to correlate between TRE events 
in the MHI ring and event we are processing  we want to log the IOVA 
address.


As we are logging only IOVA address which is provided in the 
devicetree and not the original physical address we are not expecting 
any security issues here.


Correct me if I was wrong.


The IOVA is not provided by DT, it is a runtime allocated value 
provided by the IOMMU, if present.  If not present, then it is a 
physical address.


Remember, x86 does not use devicetree.

While the IOVA (with an iommu) is not technically a physical address, 
but is treated as such by the device.  I can imagine an attacker doing 
bad things if they get a hold of the value.

Sure we will remove it.


Still, you haven't indicated why this is useful.


The TRE ring elements has address in the IOVA format when we want to 
correlate the address with the TRE elements in the dumps it will easier 
with this way.


Anyway we will not expose this as you suggested as it might expose 
physical address in some platforms.







+
  static void mhi_add_ring_element(struct

Re: [PATCH v2] eventfs: Test for ei->is_freed when accessing ei->dentry

2023-10-30 Thread Naresh Kamboju

Hi Steven,

> Are you sure it was applied correctly?

Please ignore the build warnings / errors it was due to apply patch was
not successful.

> Perhaps check out the branch I
> have and let me know if you get the same errors.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git trace/core

I have tested the linux-trace.git trace/core and run selftests ftrace
the reported kernel panic [1] & [2] has been fixed but found
"general protection fault"  at  kernel/trace/trace_events.c:2439.

[1] 
https://lore.kernel.org/all/CA+G9fYt6pY+tMZEOg=soeywqoe19fgp3ur15sgowkdk+_x8...@mail.gmail.com/
[2] 
https://lore.kernel.org/linux-trace-kernel/20231028164650.4f5ea...@rorschach.local.home/

Reported-by: Linux Kernel Functional Testing 

Test log:
==
kselftest: Running tests in ftrace
TAP version 13
1..1
# timeout set to 0
# selftests: ftrace: ftracetest-ktap

...

# ok 46 Test creation and deletion of trace instances while setting an event
[  332.783872] general protection fault, probably for non-canonical
address 0x6b6b6b6b6b6b6bcb:  [#1] PREEMPT SMP PTI
[  332.794585] CPU: 1 PID: 5165 Comm: ls Not tainted 6.6.0-rc4 #1
[  332.800429] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.7 12/07/2021
[  332.807820] RIP: 0010:event_callback (kernel/trace/trace_events.c:2439)
[ 332.812353] Code: f6 48 c7 c6 09 78 6b a0 41 55 49 89 cd 41 54 49 89
d4 53 48 8b 02 48 89 fb 4c 8b 78 10 e8 10 8b 19 01 85 c0 0f 84 d2 00
00 00 <41> 8b 47 60 a8 08 0f 85 9c 00 00 00 49 8b 47 10 48 83 78 18 00
74
All code

   0: f6 48 c7 c6  testb  $0xc6,-0x39(%rax)
   4: 09 78 6b  or %edi,0x6b(%rax)
   7: a0 41 55 49 89 cd 41 movabs 0x495441cd89495541,%al
   e: 54 49
  10: 89 d4mov%edx,%esp
  12: 53push   %rbx
  13: 48 8b 02  mov(%rdx),%rax
  16: 48 89 fb  mov%rdi,%rbx
  19: 4c 8b 78 10  mov0x10(%rax),%r15
  1d: e8 10 8b 19 01callq  0x1198b32
  22: 85 c0test   %eax,%eax
  24: 0f 84 d2 00 00 00je 0xfc
  2a:* 41 8b 47 60  mov0x60(%r15),%eax <-- trapping instruction
  2e: a8 08test   $0x8,%al
  30: 0f 85 9c 00 00 00jne0xd2
  36: 49 8b 47 10  mov0x10(%r15),%rax
  3a: 48 83 78 18 00cmpq   $0x0,0x18(%rax)
  3f: 74.byte 0x74

Code starting with the faulting instruction
===
   0: 41 8b 47 60  mov0x60(%r15),%eax
   4: a8 08test   $0x8,%al
   6: 0f 85 9c 00 00 00jne0xa8
   c: 49 8b 47 10  mov0x10(%r15),%rax
  10: 48 83 78 18 00cmpq   $0x0,0x18(%rax)
  15: 74.byte 0x74
[  332.831089] RSP: 0018:ba5700967b50 EFLAGS: 00010202
[  332.836315] RAX: 0001 RBX: a06fb5d7 RCX: ba5700967bc8
[  332.843439] RDX: 0069 RSI: a06b7809 RDI: a06fb5d7
[  332.850563] RBP: ba5700967b78 R08: 8dbe12250134 R09: 
[  332.857722] R10: 8dbe122500e0 R11: 8dbe0d8f8fc8 R12: ba5700967bd0
[  332.864846] R13: ba5700967bc8 R14: ba5700967bc6 R15: 6b6b6b6b6b6b6b6b
[  332.871971] FS:  7fc5a5062d00() GS:8dc167a8()
knlGS:
[  332.880057] CS:  0010 DS:  ES:  CR0: 80050033
[  332.885801] CR2: 7f7f8dc50a1c CR3: 000113c84003 CR4: 003706e0
[  332.892928] DR0:  DR1:  DR2: 
[  332.900058] DR3:  DR6: fffe0ff0 DR7: 0400
[  332.907183] Call Trace:
[  332.909658]  
[  332.911760] ? show_regs (arch/x86/kernel/dumpstack.c:479)
[  332.915166] ? die_addr (arch/x86/kernel/dumpstack.c:421
arch/x86/kernel/dumpstack.c:460)
[  332.918486] ? exc_general_protection (arch/x86/kernel/traps.c:697
arch/x86/kernel/traps.c:642)
[  332.923191] ? asm_exc_general_protection
(arch/x86/include/asm/idtentry.h:564)
[  332.928073] ? event_callback (kernel/trace/trace_events.c:2439)
[  332.931996] dcache_dir_open_wrapper (fs/tracefs/event_inode.c:745)
[  332.936636] ? __pfx_dcache_dir_open_wrapper (fs/tracefs/event_inode.c:683)
[  332.941772] do_dentry_open (fs/open.c:929)
[  332.945646] vfs_open (fs/open.c:1064)
[  332.948800] path_openat (fs/namei.c:3639 fs/namei.c:3796)
[  332.952465] ? ___slab_alloc (mm/slub.c:810 mm/slub.c:3265)
[  332.956384] ? nfs_ctx_key_to_expire (fs/nfs/write.c:1281)
[  332.960917] ? trace_preempt_on (kernel/trace/trace_preemptirq.c:105)
[  332.964929] do_filp_open (fs/namei.c:3823)
[  332.968511] do_sys_openat2 (fs/open.c:1422)
[  332.972173] __x64_sys_openat (fs/open.c:1448)
[  332.976014] do_syscall_64 (arch/x86/entry/common.c:50
arch/x86/entry/common.c:80)
[  332.979601] ? trace_hardirqs_on_prepare
(kernel/trace/trace_preemptirq.c:47 (discriminator 16)
kernel/trace/trace_preemptirq.c:42 (discriminator 16))
[  332.984436] ? syscall_exit_to_user_mode (kernel/entry/common.c:299)
[  332.989219] ?

Re: [PATCH v3 04/13] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2023-10-30 Thread Mike Rapoport

On Thu, Oct 26, 2023 at 11:24:39AM +0100, Will Deacon wrote:
> On Thu, Oct 26, 2023 at 11:58:00AM +0300, Mike Rapoport wrote:
> > On Mon, Oct 23, 2023 at 06:14:20PM +0100, Will Deacon wrote:
> > > On Mon, Sep 18, 2023 at 10:29:46AM +0300, Mike Rapoport wrote:
> > > > diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> > > > index dd851297596e..cd6320de1c54 100644
> > > > --- a/arch/arm64/kernel/module.c
> > > > +++ b/arch/arm64/kernel/module.c

...

> > > > -   if (module_direct_base) {
> > > > -   p = __vmalloc_node_range(size, MODULE_ALIGN,
> > > > -module_direct_base,
> > > > -module_direct_base + SZ_128M,
> > > > -GFP_KERNEL | __GFP_NOWARN,
> > > > -PAGE_KERNEL, 0, NUMA_NO_NODE,
> > > > -__builtin_return_address(0));
> > > > -   }
> > > > +   module_init_limits();
> > > 
> > > Hmm, this used to be run from subsys_initcall(), but now you're running
> > > it _really_ early, before random_init(), so randomization of the module
> > > space is no longer going to be very random if we don't have early entropy
> > > from the firmware or the CPU, which is likely to be the case on most SoCs.
> > 
> > Well, it will be as random as KASLR. Won't that be enough?
> 
> I don't think that's true -- we have the 'kaslr-seed' property for KASLR,
> but I'm not seeing anything like that for the module randomisation and I
> also don't see why we need to set these limits so early.

x86 needs execmem initialized before ftrace_init() so I thought it would be
best to setup execmem along with most of MM in mm_core_init().

I'll move execmem initialization for !x86 to a later point, say
core_initcall.
 
> Will

-- 
Sincerely yours,
Mike.

46 matches

Mail list logo