Re: [PATCH v7 3/7] LoongArch: KVM: Add cpucfg area for kvm hypervisor

2024-04-01 Thread Xi Ruoyao
On Tue, 2024-04-02 at 11:34 +0800, maobibo wrote:


> Are you sure that it's impossible to read some data used by the kernel
> internally?

Yes.

> There is another issue, since kernel restore T0-T7 registers and user
> space save T0-T7. Why T0-T7 is scratch registers rather than preserve
> registers like other architecture? What is the advantage if it is
> scratch registers?

I'd say "MIPS legacy."  Note that MIPS also does not preserve temp
registers, and MIPS does not have the "info leak" issue as well (or it
should have been assigned a CVE, in all these years).

I do agree maybe it's the time to move away from MIPS legacy and be more
similar to RISC-V etc now...

In Glibc we can condition __SYSCALL_CLOBBERS with #if
__LINUX_KERNEL_VERSION > xxx to take the advantage.

Huacai, Xuerui, how do you think?

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University



Re: [PATCH 7/9] mm: Free up PG_slab

2024-04-01 Thread Matthew Wilcox
On Sun, Mar 31, 2024 at 11:11:10PM +0800, kernel test robot wrote:
> kernel test robot noticed "UBSAN:shift-out-of-bounds_in_fs/proc/page.c" on:
> 
> commit: 30e5296811312a13938b83956a55839ac1e3aa40 ("[PATCH 7/9] mm: Free up 
> PG_slab")

Quite right.  Spotted another one while I was at it.  Not able to test
right now, but this should do the trick:

diff --git a/fs/proc/page.c b/fs/proc/page.c
index 5bc82828c6aa..55b01535eb22 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -175,6 +175,8 @@ u64 stable_page_flags(const struct page *page)
u |= 1 << KPF_OFFLINE;
if (PageTable(page))
u |= 1 << KPF_PGTABLE;
+   if (folio_test_slab(folio))
+   u |= 1 << KPF_SLAB;
 
 #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
u |= kpf_copy_bit(k, KPF_IDLE,  PG_idle);
@@ -184,7 +186,6 @@ u64 stable_page_flags(const struct page *page)
 #endif
 
u |= kpf_copy_bit(k, KPF_LOCKED,PG_locked);
-   u |= kpf_copy_bit(k, KPF_SLAB,  PG_slab);
u |= kpf_copy_bit(k, KPF_ERROR, PG_error);
u |= kpf_copy_bit(k, KPF_DIRTY, PG_dirty);
u |= kpf_copy_bit(k, KPF_UPTODATE,  PG_uptodate);
diff --git a/tools/cgroup/memcg_slabinfo.py b/tools/cgroup/memcg_slabinfo.py
index 1d3a90d93fe2..270c28a0d098 100644
--- a/tools/cgroup/memcg_slabinfo.py
+++ b/tools/cgroup/memcg_slabinfo.py
@@ -146,12 +146,11 @@ def detect_kernel_config():
 
 
 def for_each_slab(prog):
-PGSlab = 1 << prog.constant('PG_slab')
-PGHead = 1 << prog.constant('PG_head')
+PGSlab = ~prog.constant('PG_slab')
 
 for page in for_each_page(prog):
 try:
-if page.flags.value_() & PGSlab:
+if page.page_type.value_() == PGSlab:
 yield cast('struct slab *', page)
 except FaultError:
 pass



Re: [PATCH v3 6/7] KVM: arm64: Participate in bitmap-based PTE aging

2024-04-01 Thread Yu Zhao
On Mon, Apr 1, 2024 at 7:30 PM James Houghton  wrote:
>
> Participate in bitmap-based aging while grabbing the KVM MMU lock for
> reading. Ideally we wouldn't need to grab this lock at all, but that
> would require a more intrustive and risky change.
   ^^ intrusive
This sounds subjective -- I'd just present the challenges and let
reviewers make their own judgements.

> Also pass
> KVM_PGTABLE_WALK_SHARED, as this software walker is safe to run in
> parallel with other walkers.
>
> It is safe only to grab the KVM MMU lock for reading as the kvm_pgtable
> is destroyed while holding the lock for writing, and freeing of the page
> table pages is either done while holding the MMU lock for writing or
> after an RCU grace period.
>
> When mkold == false, record the young pages in the passed-in bitmap.
>
> When mkold == true, only age the pages that need aging according to the
> passed-in bitmap.
>
> Suggested-by: Yu Zhao 

Thanks but I did not suggest this.

What I have in v2 is RCU based. I hope Oliver or someone else can help
make that work. Otherwise we can just drop this for now and revisit
later.

(I have no problems with this patch if the Arm folks think the
RCU-based version doesn't have a good ROI.)



Re: [PATCH v5 2/3] arm64: dts: qcom: sc7280: Add UFS nodes for sc7280 soc

2024-04-01 Thread Manivannan Sadhasivam
On Fri, Mar 22, 2024 at 08:59:12AM +0100, Luca Weiss wrote:
> On Mon Dec 4, 2023 at 6:28 PM CET, Manivannan Sadhasivam wrote:
> > On Mon, Dec 04, 2023 at 01:21:42PM +0100, Luca Weiss wrote:
> > > On Mon Dec 4, 2023 at 1:15 PM CET, Nitin Rawat wrote:
> > > >
> > > >
> > > > On 12/4/2023 3:54 PM, Luca Weiss wrote:
> > > > > From: Nitin Rawat 
> > > > > 
> > > > > Add UFS host controller and PHY nodes for sc7280 soc.
> > > > > 
> > > > > Signed-off-by: Nitin Rawat 
> > > > > Reviewed-by: Konrad Dybcio 
> > > > > Tested-by: Konrad Dybcio  # QCM6490 FP5
> > > > > [luca: various cleanups and additions as written in the cover letter]
> > > > > Signed-off-by: Luca Weiss 
> > > > > ---
> > > > >   arch/arm64/boot/dts/qcom/sc7280.dtsi | 74 
> > > > > +++-
> > > > >   1 file changed, 73 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi 
> > > > > b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> > > > > index 04bf85b0399a..8b08569f2191 100644
> > > > > --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
> > > > > +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> > > > > @@ -15,6 +15,7 @@
> > > > >   #include 
> > > > >   #include 
> > > > >   #include 
> > > > > +#include 
> > > > >   #include 
> > > > >   #include 
> > > > >   #include 
> > > > > @@ -906,7 +907,7 @@ gcc: clock-controller@10 {
> > > > >   clocks = < RPMH_CXO_CLK>,
> > > > >< RPMH_CXO_CLK_A>, <_clk>,
> > > > ><0>, <_phy>,
> > > > > -  <0>, <0>, <0>,
> > > > > +  <_mem_phy 0>, <_mem_phy 1>, 
> > > > > <_mem_phy 2>,
> > > > ><_1_qmpphy 
> > > > > QMP_USB43DP_USB3_PIPE_CLK>;
> > > > >   clock-names = "bi_tcxo", "bi_tcxo_ao", 
> > > > > "sleep_clk",
> > > > > "pcie_0_pipe_clk", 
> > > > > "pcie_1_pipe_clk",
> > > > > @@ -2238,6 +2239,77 @@ pcie1_phy: phy@1c0e000 {
> > > > >   status = "disabled";
> > > > >   };
> > > > >   
> > > > > + ufs_mem_hc: ufs@1d84000 {
> > > > > + compatible = "qcom,sc7280-ufshc", "qcom,ufshc",
> > > > > +  "jedec,ufs-2.0";
> > > > > + reg = <0x0 0x01d84000 0x0 0x3000>;
> > > > > + interrupts = ;
> > > > > + phys = <_mem_phy>;
> > > > > + phy-names = "ufsphy";
> > > > > + lanes-per-direction = <2>;
> > > > > + #reset-cells = <1>;
> > > > > + resets = < GCC_UFS_PHY_BCR>;
> > > > > + reset-names = "rst";
> > > > > +
> > > > > + power-domains = < GCC_UFS_PHY_GDSC>;
> > > > > + required-opps = <_opp_nom>;
> > > > > +
> > > > > + iommus = <_smmu 0x80 0x0>;
> > > > > + dma-coherent;
> > > > > +
> > > > > + interconnects = <_noc MASTER_UFS_MEM 
> > > > > QCOM_ICC_TAG_ALWAYS
> > > > > +  _virt SLAVE_EBI1 
> > > > > QCOM_ICC_TAG_ALWAYS>,
> > > > > + <_noc MASTER_APPSS_PROC 
> > > > > QCOM_ICC_TAG_ALWAYS
> > > > > +   SLAVE_UFS_MEM_CFG 
> > > > > QCOM_ICC_TAG_ALWAYS>;
> > > > > + interconnect-names = "ufs-ddr", "cpu-ufs";
> > > > > +
> > > > > + clocks = < GCC_UFS_PHY_AXI_CLK>,
> > > > > +  < GCC_AGGRE_UFS_PHY_AXI_CLK>,
> > > > > +  < GCC_UFS_PHY_AHB_CLK>,
> > > > > +  < GCC_UFS_PHY_UNIPRO_CORE_CLK>,
> > > > > +  < RPMH_CXO_CLK>,
> > > > > +  < GCC_UFS_PHY_TX_SYMBOL_0_CLK>,
> > > > > +  < GCC_UFS_PHY_RX_SYMBOL_0_CLK>,
> > > > > +  < GCC_UFS_PHY_RX_SYMBOL_1_CLK>;
> > > > > + clock-names = "core_clk",
> > > > > +   "bus_aggr_clk",
> > > > > +   "iface_clk",
> > > > > +   "core_clk_unipro",
> > > > > +   "ref_clk",
> > > > > +   "tx_lane0_sync_clk",
> > > > > +   "rx_lane0_sync_clk",
> > > > > +   "rx_lane1_sync_clk";
> > > > > + freq-table-hz =
> > > > > + <7500 3>,
> > > > > + <0 0>,
> > > > > + <0 0>,
> > > > > + <7500 3>,
> > > > > + <0 0>,
> > > > > + <0 0>,
> > > > > + <0 0>,
> > > > > +   

Re: [PATCH v7 3/7] LoongArch: KVM: Add cpucfg area for kvm hypervisor

2024-04-01 Thread maobibo




On 2024/4/2 上午10:49, Xi Ruoyao wrote:

On Tue, 2024-04-02 at 09:43 +0800, maobibo wrote:

Sorry for the late reply, but I think it may be a bit non-constructive
to repeatedly submit the same code without due explanation in our
previous review threads. Let me try to recollect some of the details
though...

Because your review comments about hypercall method is wrong, I need not
adopt it.


Again it's unfair to say so considering the lack of LVZ documentation.

/* snip */



1. T0-T7 are scratch registers during SYSCALL ABI, this is what you
suggest, does there exist information leaking to user space from T0-T7
registers?


It's not a problem.  When syscall returns RESTORE_ALL_AND_RET is invoked
despite T0-T7 are not saved.  So a "junk" value will be read from the
leading PT_SIZE bytes of the kernel stack for this thread.

The leading PT_SIZE bytes of the kernel stack is dedicated for storing
the struct pt_regs representing the reg file of the thread in the
userspace.
Not all syscalls use leading PT_SIZE bytes of the kernel stack. It is 
complicated if syscall is combined with interrupt and singals.




Thus we may only read out the userspace T0-T7 value stored when the same
thread was interrupted or trapped last time, or 0 (if the thread was
never interrupted or trapped before).

And it's impossible to read some data used by the kernel internally, or
some data of another thread.
Are you sure that it's impossible to read some data used by the kernel 
internally?


Regards
Bibo Mao


But indeed there is some improvement here.  Zeroing these registers
seems cleaner than reading out the junk values, and also faster (move
$t0, $r0 is faster than ld.d $t0, $sp, PT_R12).  Not sure if it's worthy
to violate Huacai's "keep things simple" aspiration though.






[PATCH] livepatch: Add KLP_IDLE state

2024-04-01 Thread zhangwarden
From: Wardenjohn 

In livepatch, using KLP_UNDEFINED is seems to be confused.
When kernel is ready, livepatch is ready too, which state is
idle but not undefined. What's more, if one livepatch process
is finished, the klp state should be idle rather than undefined.

Therefore, using KLP_IDLE to replace KLP_UNDEFINED is much better
in reading and understanding.
---
 include/linux/livepatch.h |  1 +
 kernel/livepatch/patch.c  |  2 +-
 kernel/livepatch/transition.c | 24 
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 9b9b38e89563..c1c53cd5b227 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -19,6 +19,7 @@
 
 /* task patch states */
 #define KLP_UNDEFINED  -1
+#define KLP_IDLE   -1
 #define KLP_UNPATCHED   0
 #define KLP_PATCHED 1
 
diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
index 4152c71507e2..01d3219289ee 100644
--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -95,7 +95,7 @@ static void notrace klp_ftrace_handler(unsigned long ip,
 
patch_state = current->patch_state;
 
-   WARN_ON_ONCE(patch_state == KLP_UNDEFINED);
+   WARN_ON_ONCE(patch_state == KLP_IDLE);
 
if (patch_state == KLP_UNPATCHED) {
/*
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index e54c3d60a904..73f8f98dba84 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -23,7 +23,7 @@ static DEFINE_PER_CPU(unsigned long[MAX_STACK_ENTRIES], 
klp_stack_entries);
 
 struct klp_patch *klp_transition_patch;
 
-static int klp_target_state = KLP_UNDEFINED;
+static int klp_target_state = KLP_IDLE;
 
 static unsigned int klp_signals_cnt;
 
@@ -123,21 +123,21 @@ static void klp_complete_transition(void)
klp_for_each_func(obj, func)
func->transition = false;
 
-   /* Prevent klp_ftrace_handler() from seeing KLP_UNDEFINED state */
+   /* Prevent klp_ftrace_handler() from seeing KLP_IDLE state */
if (klp_target_state == KLP_PATCHED)
klp_synchronize_transition();
 
read_lock(_lock);
for_each_process_thread(g, task) {
WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_PATCH_PENDING));
-   task->patch_state = KLP_UNDEFINED;
+   task->patch_state = KLP_IDLE;
}
read_unlock(_lock);
 
for_each_possible_cpu(cpu) {
task = idle_task(cpu);
WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_PATCH_PENDING));
-   task->patch_state = KLP_UNDEFINED;
+   task->patch_state = KLP_IDLE;
}
 
klp_for_each_object(klp_transition_patch, obj) {
@@ -152,7 +152,7 @@ static void klp_complete_transition(void)
pr_notice("'%s': %s complete\n", klp_transition_patch->mod->name,
  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
 
-   klp_target_state = KLP_UNDEFINED;
+   klp_target_state = KLP_IDLE;
klp_transition_patch = NULL;
 }
 
@@ -455,7 +455,7 @@ void klp_try_complete_transition(void)
struct klp_patch *patch;
bool complete = true;
 
-   WARN_ON_ONCE(klp_target_state == KLP_UNDEFINED);
+   WARN_ON_ONCE(klp_target_state == KLP_IDLE);
 
/*
 * Try to switch the tasks to the target patch state by walking their
@@ -532,7 +532,7 @@ void klp_start_transition(void)
struct task_struct *g, *task;
unsigned int cpu;
 
-   WARN_ON_ONCE(klp_target_state == KLP_UNDEFINED);
+   WARN_ON_ONCE(klp_target_state == KLP_IDLE);
 
pr_notice("'%s': starting %s transition\n",
  klp_transition_patch->mod->name,
@@ -578,7 +578,7 @@ void klp_init_transition(struct klp_patch *patch, int state)
struct klp_func *func;
int initial_state = !state;
 
-   WARN_ON_ONCE(klp_target_state != KLP_UNDEFINED);
+   WARN_ON_ONCE(klp_target_state != KLP_IDLE);
 
klp_transition_patch = patch;
 
@@ -597,7 +597,7 @@ void klp_init_transition(struct klp_patch *patch, int state)
 */
read_lock(_lock);
for_each_process_thread(g, task) {
-   WARN_ON_ONCE(task->patch_state != KLP_UNDEFINED);
+   WARN_ON_ONCE(task->patch_state != KLP_IDLE);
task->patch_state = initial_state;
}
read_unlock(_lock);
@@ -607,19 +607,19 @@ void klp_init_transition(struct klp_patch *patch, int 
state)
 */
for_each_possible_cpu(cpu) {
task = idle_task(cpu);
-   WARN_ON_ONCE(task->patch_state != KLP_UNDEFINED);
+   WARN_ON_ONCE(task->patch_state != KLP_DILE);
task->patch_state = initial_state;
}
 
/*
 * Enforce the order of the task->patch_state initializations and the
 * func->transition updates to 

[PATCH] livepatch: Add KLP_IDLE state

2024-04-01 Thread zhangwarden
From: Wardenjohn 

---
 include/linux/livepatch.h |  1 +
 kernel/livepatch/patch.c  |  2 +-
 kernel/livepatch/transition.c | 24 
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 9b9b38e89563..c1c53cd5b227 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -19,6 +19,7 @@
 
 /* task patch states */
 #define KLP_UNDEFINED  -1
+#define KLP_IDLE   -1
 #define KLP_UNPATCHED   0
 #define KLP_PATCHED 1
 
diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c
index 4152c71507e2..01d3219289ee 100644
--- a/kernel/livepatch/patch.c
+++ b/kernel/livepatch/patch.c
@@ -95,7 +95,7 @@ static void notrace klp_ftrace_handler(unsigned long ip,
 
patch_state = current->patch_state;
 
-   WARN_ON_ONCE(patch_state == KLP_UNDEFINED);
+   WARN_ON_ONCE(patch_state == KLP_IDLE);
 
if (patch_state == KLP_UNPATCHED) {
/*
diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c
index e54c3d60a904..73f8f98dba84 100644
--- a/kernel/livepatch/transition.c
+++ b/kernel/livepatch/transition.c
@@ -23,7 +23,7 @@ static DEFINE_PER_CPU(unsigned long[MAX_STACK_ENTRIES], 
klp_stack_entries);
 
 struct klp_patch *klp_transition_patch;
 
-static int klp_target_state = KLP_UNDEFINED;
+static int klp_target_state = KLP_IDLE;
 
 static unsigned int klp_signals_cnt;
 
@@ -123,21 +123,21 @@ static void klp_complete_transition(void)
klp_for_each_func(obj, func)
func->transition = false;
 
-   /* Prevent klp_ftrace_handler() from seeing KLP_UNDEFINED state */
+   /* Prevent klp_ftrace_handler() from seeing KLP_IDLE state */
if (klp_target_state == KLP_PATCHED)
klp_synchronize_transition();
 
read_lock(_lock);
for_each_process_thread(g, task) {
WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_PATCH_PENDING));
-   task->patch_state = KLP_UNDEFINED;
+   task->patch_state = KLP_IDLE;
}
read_unlock(_lock);
 
for_each_possible_cpu(cpu) {
task = idle_task(cpu);
WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_PATCH_PENDING));
-   task->patch_state = KLP_UNDEFINED;
+   task->patch_state = KLP_IDLE;
}
 
klp_for_each_object(klp_transition_patch, obj) {
@@ -152,7 +152,7 @@ static void klp_complete_transition(void)
pr_notice("'%s': %s complete\n", klp_transition_patch->mod->name,
  klp_target_state == KLP_PATCHED ? "patching" : "unpatching");
 
-   klp_target_state = KLP_UNDEFINED;
+   klp_target_state = KLP_IDLE;
klp_transition_patch = NULL;
 }
 
@@ -455,7 +455,7 @@ void klp_try_complete_transition(void)
struct klp_patch *patch;
bool complete = true;
 
-   WARN_ON_ONCE(klp_target_state == KLP_UNDEFINED);
+   WARN_ON_ONCE(klp_target_state == KLP_IDLE);
 
/*
 * Try to switch the tasks to the target patch state by walking their
@@ -532,7 +532,7 @@ void klp_start_transition(void)
struct task_struct *g, *task;
unsigned int cpu;
 
-   WARN_ON_ONCE(klp_target_state == KLP_UNDEFINED);
+   WARN_ON_ONCE(klp_target_state == KLP_IDLE);
 
pr_notice("'%s': starting %s transition\n",
  klp_transition_patch->mod->name,
@@ -578,7 +578,7 @@ void klp_init_transition(struct klp_patch *patch, int state)
struct klp_func *func;
int initial_state = !state;
 
-   WARN_ON_ONCE(klp_target_state != KLP_UNDEFINED);
+   WARN_ON_ONCE(klp_target_state != KLP_IDLE);
 
klp_transition_patch = patch;
 
@@ -597,7 +597,7 @@ void klp_init_transition(struct klp_patch *patch, int state)
 */
read_lock(_lock);
for_each_process_thread(g, task) {
-   WARN_ON_ONCE(task->patch_state != KLP_UNDEFINED);
+   WARN_ON_ONCE(task->patch_state != KLP_IDLE);
task->patch_state = initial_state;
}
read_unlock(_lock);
@@ -607,19 +607,19 @@ void klp_init_transition(struct klp_patch *patch, int 
state)
 */
for_each_possible_cpu(cpu) {
task = idle_task(cpu);
-   WARN_ON_ONCE(task->patch_state != KLP_UNDEFINED);
+   WARN_ON_ONCE(task->patch_state != KLP_DILE);
task->patch_state = initial_state;
}
 
/*
 * Enforce the order of the task->patch_state initializations and the
 * func->transition updates to ensure that klp_ftrace_handler() doesn't
-* see a func in transition with a task->patch_state of KLP_UNDEFINED.
+* see a func in transition with a task->patch_state of KLP_IDLE.
 *
 * Also enforce the order of the klp_target_state write and future
 * TIF_PATCH_PENDING writes to ensure 

Re: [PATCH v7 3/7] LoongArch: KVM: Add cpucfg area for kvm hypervisor

2024-04-01 Thread maobibo




On 2024/4/2 上午10:49, Xi Ruoyao wrote:

On Tue, 2024-04-02 at 09:43 +0800, maobibo wrote:

Sorry for the late reply, but I think it may be a bit non-constructive
to repeatedly submit the same code without due explanation in our
previous review threads. Let me try to recollect some of the details
though...

Because your review comments about hypercall method is wrong, I need not
adopt it.


Again it's unfair to say so considering the lack of LVZ documentation.

/* snip */



1. T0-T7 are scratch registers during SYSCALL ABI, this is what you
suggest, does there exist information leaking to user space from T0-T7
registers?


It's not a problem.  When syscall returns RESTORE_ALL_AND_RET is invoked
despite T0-T7 are not saved.  So a "junk" value will be read from the
leading PT_SIZE bytes of the kernel stack for this thread.

For you it is "junk" value, some guys maybe thinks it is useful.

There is another issue, since kernel restore T0-T7 registers and user 
space save T0-T7. Why T0-T7 is scratch registers rather than preserve 
registers like other architecture? What is the advantage if it is 
scratch registers?


Regards
Bibo Mao


The leading PT_SIZE bytes of the kernel stack is dedicated for storing
the struct pt_regs representing the reg file of the thread in the
userspace.

Thus we may only read out the userspace T0-T7 value stored when the same
thread was interrupted or trapped last time, or 0 (if the thread was
never interrupted or trapped before).

And it's impossible to read some data used by the kernel internally, or
some data of another thread.

But indeed there is some improvement here.  Zeroing these registers
seems cleaner than reading out the junk values, and also faster (move
$t0, $r0 is faster than ld.d $t0, $sp, PT_R12).  Not sure if it's worthy
to violate Huacai's "keep things simple" aspiration though.






Re: 回复:general protection fault in refill_obj_stock

2024-04-01 Thread Roman Gushchin
On Tue, Apr 02, 2024 at 09:50:54AM +0800, Ubisectech Sirius wrote:
> > On Mon, Apr 01, 2024 at 03:04:46PM +0800, Ubisectech Sirius wrote:
> > Hello.
> > We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> > Recently, our team has discovered a issue in Linux kernel 6.7. Attached to 
> > the email were a PoC file of the issue.
> 
> > Thank you for the report!
> 
> > I tried to compile and run your test program for about half an hour
> > on a virtual machine running 6.7 with enabled KASAN, but wasn't able
> > to reproduce the problem.
> 
> > Can you, please, share a bit more information? How long does it take
> > to reproduce? Do you mind sharing your kernel config? Is there anything 
> > special
> > about your setup? What are exact steps to reproduce the problem?
> > Is this problem reproducible on 6.6?
> 
> Hi. 
>The .config of linux kernel 6.7 has send to you as attachment.

Thanks!

How long it takes to reproduce a problem? Do you just start your reproducer and 
wait?

> And The problem is reproducible on 6.6.

Hm, it rules out my recent changes.
Did you try any older kernels? 6.5? 6.0? Did you try to bisect the problem?
If it's fast to reproduce, it might be the best option.

Also, are you running vanilla kernels or you do have some custom changes on top?

Thanks!



Re: [PATCH v7 3/7] LoongArch: KVM: Add cpucfg area for kvm hypervisor

2024-04-01 Thread Xi Ruoyao
On Tue, 2024-04-02 at 09:43 +0800, maobibo wrote:
> > Sorry for the late reply, but I think it may be a bit non-constructive 
> > to repeatedly submit the same code without due explanation in our 
> > previous review threads. Let me try to recollect some of the details
> > though...
> Because your review comments about hypercall method is wrong, I need not 
> adopt it.

Again it's unfair to say so considering the lack of LVZ documentation.

/* snip */

> 
> 1. T0-T7 are scratch registers during SYSCALL ABI, this is what you 
> suggest, does there exist information leaking to user space from T0-T7
> registers?

It's not a problem.  When syscall returns RESTORE_ALL_AND_RET is invoked
despite T0-T7 are not saved.  So a "junk" value will be read from the
leading PT_SIZE bytes of the kernel stack for this thread.

The leading PT_SIZE bytes of the kernel stack is dedicated for storing
the struct pt_regs representing the reg file of the thread in the
userspace.

Thus we may only read out the userspace T0-T7 value stored when the same
thread was interrupted or trapped last time, or 0 (if the thread was
never interrupted or trapped before).

And it's impossible to read some data used by the kernel internally, or
some data of another thread.

But indeed there is some improvement here.  Zeroing these registers
seems cleaner than reading out the junk values, and also faster (move
$t0, $r0 is faster than ld.d $t0, $sp, PT_R12).  Not sure if it's worthy
to violate Huacai's "keep things simple" aspiration though.

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Steven Rostedt
On Mon, 1 Apr 2024 19:29:46 -0700
Andrii Nakryiko  wrote:

> On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu  wrote:
> >
> > On Mon, 1 Apr 2024 12:09:18 -0400
> > Steven Rostedt  wrote:
> >  
> > > On Mon, 1 Apr 2024 20:25:52 +0900
> > > Masami Hiramatsu (Google)  wrote:
> > >  
> > > > > Masami,
> > > > >
> > > > > Are you OK with just keeping it set to N.  
> > > >
> > > > OK, if it is only for the debugging, I'm OK to set N this.
> > > >  
> > > > >
> > > > > We could have other options like PROVE_LOCKING enable it.  
> > > >
> > > > Agreed (but it should say this is a debug option)  
> > >
> > > It does say "Validate" which to me is a debug option. What would you
> > > suggest?  
> >
> > I think the help message should have "This is for debugging ftrace."
> >  
> 
> Sent v2 with adjusted wording, thanks!

You may want to wait till Masami and I agree ;-)

Masami,

But it isn't really for "debugging", it's for validating. That is, it
doesn't give us any information to debug ftrace. It only validates if it is
executed properly. In other words, I never want to be asked "How can I use
this option to debug ftrace?"

For example, we also have:

  "Verify ring buffer time stamp deltas"

that makes sure the time stamps of the ring buffer are not buggy.

And there's:

  "Verify compile time sorting of ftrace functions"

Which is also used to make sure things are working properly.

Neither of the above says they are for "debugging", even though they are
more useful for debugging than this option.

I'm not sure saying this is "debugging ftrace" is accurate. It may help
debug ftrace if it is called outside of an RCU location, which has a
1 in 100,000,000,000 chance of causing an actual bug, as the race window is
extremely small. 

Now if it is also called outside of instrumentation, that will likely trigger
other warnings even without this code, and this will not be needed to debug
that.

ftrace has all sorts of "verifiers" that is used to make sure things are
working properly. And yes, you can consider it as "debugging". But I would
not consider this an option to enable if ftrace was broken, and you are
looking into why it is broken.

To me, this option is only to verify that ftrace (and other users of
ftrace_test_recursion_trylock()) are not called outside of RCU, as if they
are, it can cause a race. But it also has a noticeable overhead when enabled.

-- Steve


-- Steve



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Andrii Nakryiko
On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu  wrote:
>
> On Mon, 1 Apr 2024 12:09:18 -0400
> Steven Rostedt  wrote:
>
> > On Mon, 1 Apr 2024 20:25:52 +0900
> > Masami Hiramatsu (Google)  wrote:
> >
> > > > Masami,
> > > >
> > > > Are you OK with just keeping it set to N.
> > >
> > > OK, if it is only for the debugging, I'm OK to set N this.
> > >
> > > >
> > > > We could have other options like PROVE_LOCKING enable it.
> > >
> > > Agreed (but it should say this is a debug option)
> >
> > It does say "Validate" which to me is a debug option. What would you
> > suggest?
>
> I think the help message should have "This is for debugging ftrace."
>

Sent v2 with adjusted wording, thanks!

> Thank you,
>
> >
> > -- Steve
>
>
> --
> Masami Hiramatsu (Google) 



[PATCH v2] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Andrii Nakryiko
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true and is mostly useful for
low-level debugging of ftrace subsystem. For most users it should
probably be kept disabled to eliminate unnecessary runtime overhead.

Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Paul E. McKenney 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig| 14 ++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)  \
({  \
bool __ret = !rcu_is_watching();\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..fcf45d5c60cb 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,20 @@ config FTRACE_RECORD_RECURSION_SIZE
  This file can be reset, but the limit can not change in
  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+   bool "Validate RCU is on during ftrace recursion check"
+   depends on FUNCTION_TRACER
+   depends on ARCH_WANTS_NO_INSTR
+   help
+ All callbacks that attach to the function tracing have some sort
+ of protection against recursion. This option performs additional
+ checks to make sure RCU is on when ftrace callbacks recurse.
+
+ This is a feature useful for debugging ftrace. This will add more
+ overhead to all ftrace-based invocations.
+
+ If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
bool "Record functions that recurse in the ring buffer"
depends on FTRACE_RECORD_RECURSION
-- 
2.43.0




Re: [PATCH v7 3/7] LoongArch: KVM: Add cpucfg area for kvm hypervisor

2024-04-01 Thread maobibo




On 2024/3/24 上午3:02, WANG Xuerui wrote:

On 3/15/24 16:07, Bibo Mao wrote:

Instruction cpucfg can be used to get processor features. And there
is trap exception when it is executed in VM mode, and also it is
to provide cpu features to VM. On real hardware cpucfg area 0 - 20
is used.  Here one specified area 0x4000 -- 0x40ff is used
for KVM hypervisor to privide PV features, and the area can be extended
for other hypervisors in future. This area will never be used for
real HW, it is only used by software.

Signed-off-by: Bibo Mao 
---
  arch/loongarch/include/asm/inst.h  |  1 +
  arch/loongarch/include/asm/loongarch.h | 10 +
  arch/loongarch/kvm/exit.c  | 59 +++---
  3 files changed, 54 insertions(+), 16 deletions(-)



Sorry for the late reply, but I think it may be a bit non-constructive 
to repeatedly submit the same code without due explanation in our 
previous review threads. Let me try to recollect some of the details 
though...
Because your review comments about hypercall method is wrong, I need not 
adopt it.


If I remember correctly, during the previous reviews, it was mentioned 
that the only upsides of using CPUCFG were:


- it was exactly identical to the x86 approach,
- it would not require access to the LoongArch Reference Manual Volume 3 
to use, and

- it was plain old data.

But, for the first point, we don't have to follow x86 convention after 
X86 virtualization is successfully and widely applied in our life and 
products. It it normal to follow it if there is not obvious issues.


all. The second reason might be compelling, but on the one hand that's 
another problem orthogonal to the current one, and on the other hand 
HVCL is:


- already effectively public because of the fact that this very patchset 
is public,
- its semantics is trivial to implement even without access to the LVZ 
manual, because of its striking similarity with SYSCALL, and
- by being a function call, we reserve the possibility for hypervisors 
to invoke logic for self-identification purposes, even if this is likely 
overkill from today's perspective.


And, even if we decide that using HVCL for self-identification is 
overkill after all, we still have another choice that's IOCSR. We 
already read LOONGARCH_IOCSR_FEATURES (0x8) for its bit 11 (IOCSRF_VM) 
to populate the CPU_FEATURE_HYPERVISOR bit, and it's only natural that 
we put the identification word in the IOCSR space. As far as I can see, 
the IOCSR space is plenty and equally available for making reservations; 
it can only be even easier when it's done by a Loongson team.
IOCSR method is possible also, about chip design CPUCFG is used for cpu 
features and IOCSR is for device featurs. Here CPUCFG method is 
selected, I am KVM LoongArch maintainer and I can decide to select 
methods if the method works well. Is that right?


If you are interested in KVM LoongArch, you can submit more patches and 
become maintainer or write new hypervisor support such xen/xvisor etc, 
and use your method.


Also you are interested in Linux kernel, there are some issues. Can you 
help to improve it?


1. T0-T7 are scratch registers during SYSCALL ABI, this is what you 
suggest, does there exist information leaking to user space from T0-T7 
registers?


2. LoongArch KVM depends on AS_HAS_LVZ_EXTENSION, which requires the 
latest binutils. It is also what you suggest. Some kernel developers 
does not have the latest binutils and common kvm code is modified and 
LoongArch KVM fails to compile. But they can not find it since their 
LoongArch cross-compile is old and LoongArch KVM is disabled. This issue 
can be found at https://lkml.org/lkml/2023/11/15/828.


Regards
Bibo Mao


Finally, I've mentioned multiple times, that varying CPUCFG behavior 
based on PLV is not something well documented on the manuals, hence not 
friendly to low-level developers. Devs of third-party firmware and/or 
kernels do exist, I've personally spoken to some of them on the 
2023-11-18 3A6000 release event; in order for the varying CPUCFG 
behavior approach to pass for me, at the very least, the LoongArch 
reference manual must be amended to explicitly include an explanation of 
it, and a reference to potential use cases.







[PATCH v2] selftests/sgx: Improve cgroup test scripts

2024-04-01 Thread Haitao Huang
Make cgroup test scripts ash compatible.
Remove cg-tools dependency.
Add documentation for functions.

Tested with busybox on Ubuntu.

Signed-off-by: Haitao Huang 
---
v2:
- Fixes for v2 cgroup
- Turn off swapping before memcontrol tests and back on after
- Add comments and reformat
---
 tools/testing/selftests/sgx/ash_cgexec.sh |  57 ++
 .../selftests/sgx/run_epc_cg_selftests.sh | 187 +++---
 .../selftests/sgx/watch_misc_for_tests.sh |  13 +-
 3 files changed, 179 insertions(+), 78 deletions(-)
 create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh

diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh 
b/tools/testing/selftests/sgx/ash_cgexec.sh
new file mode 100755
index ..9607784378df
--- /dev/null
+++ b/tools/testing/selftests/sgx/ash_cgexec.sh
@@ -0,0 +1,57 @@
+#!/usr/bin/env sh
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2024 Intel Corporation.
+
+# Move the current shell process to the specified cgroup
+# Arguments:
+#  $1 - The cgroup controller name, e.g., misc, memory.
+#  $2 - The path of the cgroup,
+#  relative to /sys/fs/cgroup for cgroup v2,
+#  relative to /sys/fs/cgroup/$1 for v1.
+move_to_cgroup() {
+controllers="$1"
+path="$2"
+
+# Check if cgroup v2 is in use
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+# Cgroup v2 logic
+cgroup_full_path="/sys/fs/cgroup/${path}"
+echo $$ > "${cgroup_full_path}/cgroup.procs"
+else
+# Cgroup v1 logic
+OLD_IFS="$IFS"
+IFS=','
+for controller in $controllers; do
+cgroup_full_path="/sys/fs/cgroup/${controller}/${path}"
+echo $$ > "${cgroup_full_path}/tasks"
+done
+IFS="$OLD_IFS"
+fi
+}
+
+if [ "$#" -lt 3 ] || [ "$1" != "-g" ]; then
+echo "Usage: $0 -g  [-g  
...]  [args...]"
+exit 1
+fi
+
+while [ "$#" -gt 0 ]; do
+case "$1" in
+-g)
+# Ensure that a controller:path pair is provided after -g
+if [ -z "$2" ]; then
+echo "Error: Missing controller:path argument after -g"
+exit 1
+fi
+IFS=':' read CONTROLLERS CGROUP_PATH <  $CG_MISC_ROOT/cgroup.subtree_control
+echo "+memory" > $CG_MEM_ROOT/cgroup.subtree_control
+echo "+misc" >  $CG_MISC_ROOT/$TEST_ROOT_CG/cgroup.subtree_control
+echo "+memory" > $CG_MEM_ROOT/$TEST_ROOT_CG/cgroup.subtree_control
+echo "+misc" >  $CG_MISC_ROOT/$TEST_CG_SUB1/cgroup.subtree_control
+fi
 
 CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk '{print $2}')
 # This is below number of VA pages needed for enclave of capacity size. So
@@ -48,34 +51,67 @@ echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max
 echo "sgx_epc $LARGE" >  $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max
 echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max
 
+if [ $? -ne 0 ]; then
+echo "# Failed setting up misc limits, make sure misc cgroup is mounted."
+exit 1
+fi
+
+clean_up_misc()
+{
+sleep 2
+rmdir $CG_MISC_ROOT/$TEST_CG_SUB2
+rmdir $CG_MISC_ROOT/$TEST_CG_SUB3
+rmdir $CG_MISC_ROOT/$TEST_CG_SUB4
+rmdir $CG_MISC_ROOT/$TEST_CG_SUB1
+rmdir $CG_MISC_ROOT/$TEST_ROOT_CG
+}
+
 timestamp=$(date +%Y%m%d_%H%M%S)
 
 test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
 
+# Wait for a process and check for expected exit status.
+#
+# Arguments:
+#  $1 - the pid of the process to wait and check.
+#  $2 - 1 if expecting success, 0 for failure.
+#
+# Return:
+#  0 if the exit status of the process matches the expectation.
+#  1 otherwise.
 wait_check_process_status() {
-local pid=$1
-local check_for_success=$2  # If 1, check for success;
-# If 0, check for failure
+pid=$1
+check_for_success=$2  # If 1, check for success;
+  # If 0, check for failure
 wait "$pid"
-local status=$?
+status=$?
 
-if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then
+if [ $check_for_success -eq 1 ] && [ $status -eq 0 ]; then
 echo "# Process $pid succeeded."
 return 0
-elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then
+elif [ $check_for_success -eq 0 ] && [ $status -ne 0 ]; then
 echo "# Process $pid returned failure."
 return 0
 fi
 return 1
 }
 
+# Wait for a set of processes and check for expected exit status
+#
+# Arguments:
+#  $1 - 1 if expecting success, 0 for failure.
+#  remaining args - The pids of the processes
+#
+# Return:
+#  0 if exit status of any process matches the expectation.
+#  1 otherwise.
 wait_and_detect_for_any() {
-local pids=("$@")
-local check_for_success=$1  # If 1, check for success;
-# If 0, check for failure
-local detected=1 # 0 for success detection
+check_for_success=$1  # If 1, check for success;
+  # If 0, check for failure
+shift
+ 

Re: [PATCH v7 7/7] Documentation: KVM: Add hypercall for LoongArch

2024-04-01 Thread maobibo




On 2024/3/24 上午2:40, WANG Xuerui wrote:

On 3/15/24 16:11, Bibo Mao wrote:

[snip]
+KVM hypercall ABI
+=
+
+Hypercall ABI on KVM is simple, only one scratch register a0 and at most
+five generic registers used as input parameter. FP register and 
vector register
+is not used for input register and should not be modified during 
hypercall.
+Hypercall function can be inlined since there is only one scratch 
register.


Maybe it's better to describe the list of preserved registers with an 
expression such as "all non-GPR registers shall remain unmodified after 
returning from the hypercall", to guard ourselves against future ISA 
state additions.
Sorry, I do not understand. What is the meaning of "all non-GPR 
registers"?  Can you give an example?


Regards
Bibo Mao


But I still maintain that it's better to promise less here, and only 
hint on the extensive preservation of context as an implementation 
detail. It is for not losing our ability to save/restore less in the 
future, should we decide to do so.







Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Google
On Mon, 1 Apr 2024 12:09:18 -0400
Steven Rostedt  wrote:

> On Mon, 1 Apr 2024 20:25:52 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > > Masami,
> > > 
> > > Are you OK with just keeping it set to N.  
> > 
> > OK, if it is only for the debugging, I'm OK to set N this.
> > 
> > > 
> > > We could have other options like PROVE_LOCKING enable it.  
> > 
> > Agreed (but it should say this is a debug option)
> 
> It does say "Validate" which to me is a debug option. What would you
> suggest?

I think the help message should have "This is for debugging ftrace."

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



[PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-01 Thread Ho-Ren (Jack) Chuang
The current implementation treats emulated memory devices, such as
CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to
distinguish them. Thus, we modify the tiered memory initialization process
to introduce a delay specifically for CPUless NUMA nodes. This delay
ensures that the memory tier initialization for these nodes is deferred
until HMAT information is obtained during the boot process. Finally,
demotion tables are recalculated at the end.

* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be excluded
in the late init.

* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT information.
If no HMAT is specified, it falls back to using the default DRAM tier.

* Introduce another new lock `default_dram_perf_lock` for adist calculation
In the current implementation, iterating through CPUlist nodes requires
holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
trying to acquire the same lock, leading to a potential deadlock.
Therefore, we propose introducing a standalone `default_dram_perf_lock` to
protect `default_dram_perf_*`. This approach not only avoids deadlock
but also prevents holding a large lock simultaneously.

* Upgrade `set_node_memory_tier` to support additional cases, including
  default DRAM, late CPUless, and hot-plugged initializations.
To cover hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information is
available.

* Introduce `default_memory_types` for those memory types that are not
  initialized by device drivers.
Because late initialized memory and default DRAM memory need to be managed,
a default memory type is created for storing all memory types that are
not initialized by device drivers and as a fallback.

Signed-off-by: Ho-Ren (Jack) Chuang 
Signed-off-by: Hao Xiang 
Reviewed-by: "Huang, Ying" 
---
 include/linux/memory-tiers.h |  5 +-
 mm/memory-tiers.c| 95 +---
 2 files changed, 81 insertions(+), 19 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index a44c03c2ba3a..16769552a338 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -140,12 +140,13 @@ static inline int mt_perf_to_adistance(struct 
access_coordinate *perf, int *adis
return -EIO;
 }
 
-struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist,
+   struct list_head *memory_types)
 {
return NULL;
 }
 
-void mt_put_memory_types(struct list_head *memory_types)
+static inline void mt_put_memory_types(struct list_head *memory_types)
 {
 
 }
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 974af10cfdd8..44fa10980d37 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -36,6 +36,11 @@ struct node_memory_type_map {
 
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
+/*
+ * The list is used to store all memory types that are not created
+ * by a device driver.
+ */
+static LIST_HEAD(default_memory_types);
 static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
 struct memory_dev_type *default_dram_type;
 
@@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly;
 
 static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
 
+/* The lock is used to protect `default_dram_perf*` info and nid. */
+static DEFINE_MUTEX(default_dram_perf_lock);
 static bool default_dram_perf_error;
 static struct access_coordinate default_dram_perf;
 static int default_dram_perf_ref_nid = NUMA_NO_NODE;
@@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, struct 
memory_dev_type *mem
 static struct memory_tier *set_node_memory_tier(int node)
 {
struct memory_tier *memtier;
-   struct memory_dev_type *memtype;
+   struct memory_dev_type *mtype = default_dram_type;
+   int adist = MEMTIER_ADISTANCE_DRAM;
pg_data_t *pgdat = NODE_DATA(node);
 
 
@@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int node)
if (!node_state(node, N_MEMORY))
return ERR_PTR(-EINVAL);
 
-   __init_node_memory_type(node, default_dram_type);
+   mt_calc_adistance(node, );
+   if (node_memory_types[node].memtype == NULL) {
+   mtype = mt_find_alloc_memory_type(adist, _memory_types);
+   if (IS_ERR(mtype)) {
+   mtype = 

[PATCH v10 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-01 Thread Ho-Ren (Jack) Chuang
Since different memory devices require finding, allocating, and putting
memory types, these common steps are abstracted in this patch,
enhancing the scalability and conciseness of the code.

Signed-off-by: Ho-Ren (Jack) Chuang 
Reviewed-by: "Huang, Ying" 
---
 drivers/dax/kmem.c   | 20 ++--
 include/linux/memory-tiers.h | 13 +
 mm/memory-tiers.c| 32 
 3 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 42ee360cf4e3..01399e5b53b2 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
 
 static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
 {
-   bool found = false;
struct memory_dev_type *mtype;
 
mutex_lock(_memory_type_lock);
-   list_for_each_entry(mtype, _memory_types, list) {
-   if (mtype->adistance == adist) {
-   found = true;
-   break;
-   }
-   }
-   if (!found) {
-   mtype = alloc_memory_type(adist);
-   if (!IS_ERR(mtype))
-   list_add(>list, _memory_types);
-   }
+   mtype = mt_find_alloc_memory_type(adist, _memory_types);
mutex_unlock(_memory_type_lock);
 
return mtype;
@@ -77,13 +66,8 @@ static struct memory_dev_type 
*kmem_find_alloc_memory_type(int adist)
 
 static void kmem_put_memory_types(void)
 {
-   struct memory_dev_type *mtype, *mtn;
-
mutex_lock(_memory_type_lock);
-   list_for_each_entry_safe(mtype, mtn, _memory_types, list) {
-   list_del(>list);
-   put_memory_type(mtype);
-   }
+   mt_put_memory_types(_memory_types);
mutex_unlock(_memory_type_lock);
 }
 
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 69e781900082..a44c03c2ba3a 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist);
 int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
 const char *source);
 int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
+struct memory_dev_type *mt_find_alloc_memory_type(int adist,
+   struct list_head 
*memory_types);
+void mt_put_memory_types(struct list_head *memory_types);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct 
access_coordinate *perf, int *adis
 {
return -EIO;
 }
+
+struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+{
+   return NULL;
+}
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+
+}
 #endif /* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 0537664620e5..974af10cfdd8 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct 
memory_dev_type *memtype)
 }
 EXPORT_SYMBOL_GPL(clear_node_memory_type);
 
+struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+{
+   bool found = false;
+   struct memory_dev_type *mtype;
+
+   list_for_each_entry(mtype, memory_types, list) {
+   if (mtype->adistance == adist) {
+   found = true;
+   break;
+   }
+   }
+   if (!found) {
+   mtype = alloc_memory_type(adist);
+   if (!IS_ERR(mtype))
+   list_add(>list, memory_types);
+   }
+
+   return mtype;
+}
+EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type);
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+   struct memory_dev_type *mtype, *mtn;
+
+   list_for_each_entry_safe(mtype, mtn, memory_types, list) {
+   list_del(>list);
+   put_memory_type(mtype);
+   }
+}
+EXPORT_SYMBOL_GPL(mt_put_memory_types);
+
 static void dump_hmem_attrs(struct access_coordinate *coord, const char 
*prefix)
 {
pr_info(
-- 
Ho-Ren (Jack) Chuang




[PATCH v10 0/2] Improved Memory Tier Creation for CPUless NUMA Nodes

2024-04-01 Thread Ho-Ren (Jack) Chuang
When a memory device, such as CXL1.1 type3 memory, is emulated as
normal memory (E820_TYPE_RAM), the memory device is indistinguishable from
normal DRAM in terms of memory tiering with the current implementation.
The current memory tiering assigns all detected normal memory nodes to
the same DRAM tier. This results in normal memory devices with different
attributions being unable to be assigned to the correct memory tier,
leading to the inability to migrate pages between different
types of memory.
https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/

This patchset automatically resolves the issues. It delays the
initialization of memory tiers for CPUless NUMA nodes until they obtain
HMAT information and after all devices are initialized at boot time,
eliminating the need for user intervention. If no HMAT is specified,
it falls back to using `default_dram_type`.

Example usecase:
We have CXL memory on the host, and we create VMs with a new system memory
device backed by host CXL memory. We inject CXL memory performance
attributes through QEMU, and the guest now sees memory nodes with
performance attributes in HMAT. With this change, we enable the
guest kernel to construct the correct memory tiering for the memory nodes.

- v10:
 Thanks to Andrew's and SeongJae's comments,
 * Address kunit compilation errors
 * Resolve the bug of not returning the correct error code in
   `mt_perf_to_adistance`
-v9:
 * Address corner cases in `memory_tier_late_init`. Thank Ying's comments.
 * 
https://lore.kernel.org/lkml/20240329053353.309557-1-horenchu...@bytedance.com/T/#u
-v8:
 * Fix email format
 * 
https://lore.kernel.org/lkml/20240329004815.195476-1-horenchu...@bytedance.com/T/#u
-v7:
 * Add Reviewed-by: "Huang, Ying" 
-v6:
 Thanks to Ying's comments,
 * Move `default_dram_perf_lock` to the function's beginning for clarity
 * Fix double unlocking at v5
 * 
https://lore.kernel.org/lkml/20240327072729.3381685-1-horenchu...@bytedance.com/T/#u
-v5:
 Thanks to Ying's comments,
 * Add comments about what is protected by `default_dram_perf_lock`
 * Fix an uninitialized pointer mtype
 * Slightly shorten the time holding `default_dram_perf_lock`
 * Fix a deadlock bug in `mt_perf_to_adistance`
 * 
https://lore.kernel.org/lkml/20240327041646.3258110-1-horenchu...@bytedance.com/T/#u
-v4:
 Thanks to Ying's comments,
 * Remove redundant code
 * Reorganize patches accordingly
 * 
https://lore.kernel.org/lkml/20240322070356.315922-1-horenchu...@bytedance.com/T/#u
-v3:
 Thanks to Ying's comments,
 * Make the newly added code independent of HMAT
 * Upgrade set_node_memory_tier to support more cases
 * Put all non-driver-initialized memory types into default_memory_types
   instead of using hmat_memory_types
 * find_alloc_memory_type -> mt_find_alloc_memory_type
 * 
https://lore.kernel.org/lkml/20240320061041.3246828-1-horenchu...@bytedance.com/T/#u
-v2:
 Thanks to Ying's comments,
 * Rewrite cover letter & patch description
 * Rename functions, don't use _hmat
 * Abstract common functions into find_alloc_memory_type()
 * Use the expected way to use set_node_memory_tier instead of modifying it
 * 
https://lore.kernel.org/lkml/20240312061729.1997111-1-horenchu...@bytedance.com/T/#u
-v1:
 * 
https://lore.kernel.org/lkml/20240301082248.3456086-1-horenchu...@bytedance.com/T/#u

Ho-Ren (Jack) Chuang (2):
  memory tier: dax/kmem: introduce an abstract layer for finding,
allocating, and putting memory types
  memory tier: create CPUless memory tiers after obtaining HMAT info

 drivers/dax/kmem.c   |  20 +-
 include/linux/memory-tiers.h |  14 
 mm/memory-tiers.c| 127 ++-
 3 files changed, 126 insertions(+), 35 deletions(-)

-- 
Ho-Ren (Jack) Chuang




Re: general protection fault in refill_obj_stock

2024-04-01 Thread Roman Gushchin
On Mon, Apr 01, 2024 at 03:04:46PM +0800, Ubisectech Sirius wrote:
> Hello.
> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> Recently, our team has discovered a issue in Linux kernel 6.7. Attached to 
> the email were a PoC file of the issue.

Thank you for the report!

I tried to compile and run your test program for about half an hour
on a virtual machine running 6.7 with enabled KASAN, but wasn't able
to reproduce the problem.

Can you, please, share a bit more information? How long does it take
to reproduce? Do you mind sharing your kernel config? Is there anything special
about your setup? What are exact steps to reproduce the problem?
Is this problem reproducible on 6.6?

It's interesting that the problem looks like use-after-free for the objcg 
pointer
but happens in the context of udev-systemd, which I believe should be fairly 
stable
and it's cgroup is not going anywhere.

Thanks!



Re: [PATCH 13/13] mailbox: omap: Remove kernel FIFO message queuing

2024-04-01 Thread Andrew Davis

On 4/1/24 6:39 PM, Hari Nagalla wrote:

On 3/25/24 12:20, Andrew Davis wrote:

The kernel FIFO queue has a couple issues. The biggest issue is that
it causes extra latency in a path that can be used in real-time tasks,
such as communication with real-time remote processors.

The whole FIFO idea itself looks to be a leftover from before the
unified mailbox framework. The current mailbox framework expects
mbox_chan_received_data() to be called with data immediately as it
arrives. Remove the FIFO and pass the messages to the mailbox
framework directly.

Yes, this would definitely speed up the message receive path. With RT linux, 
the irq runs in thread context, so that is Ok. But with non-RT the whole 
receive path runs in interrupt context. So, i think it would be appropriate to 
use a threaded_irq()?


I was thinking the same at first, but seems some mailbox drivers use threaded, 
others
use non-threaded context. Since all we do in the IRQ context anymore is call
mbox_chan_received_data(), which is supposed to be IRQ safe, then it should be 
fine
either way. So for now I just kept this using the regular IRQ context as before.

If that does turn out to be an issue then let's switch to threaded.

Andrew



Re: [PATCH 12/13] mailbox: omap: Reverse FIFO busy check logic

2024-04-01 Thread Andrew Davis

On 4/1/24 6:31 PM, Hari Nagalla wrote:

On 3/25/24 12:20, Andrew Davis wrote:

  static int omap_mbox_chan_send_noirq(struct omap_mbox *mbox, u32 msg)
  {
-    int ret = -EBUSY;
+    if (mbox_fifo_full(mbox))
+    return -EBUSY;
-    if (!mbox_fifo_full(mbox)) {
-    omap_mbox_enable_irq(mbox, IRQ_RX);
-    mbox_fifo_write(mbox, msg);
-    ret = 0;
-    omap_mbox_disable_irq(mbox, IRQ_RX);
+    omap_mbox_enable_irq(mbox, IRQ_RX);
+    mbox_fifo_write(mbox, msg);
+    omap_mbox_disable_irq(mbox, IRQ_RX);
-    /* we must read and ack the interrupt directly from here */
-    mbox_fifo_read(mbox);
-    ack_mbox_irq(mbox, IRQ_RX);
-    }
+    /* we must read and ack the interrupt directly from here */
+    mbox_fifo_read(mbox);
+    ack_mbox_irq(mbox, IRQ_RX);
-    return ret;
+    return 0;
  }

Is n't the interrupt supposed to be IRQ_TX above? i.e TX ready signal?


Hmm, could be, but this patch doesn't actually change anything, only moves code
around for readability. So if we were are ack'ing the wrong interrupt, then it
was wrong before. We should check that and fix it if needed in a follow up 
patch.

Andrew



Re: [PATCH 13/13] mailbox: omap: Remove kernel FIFO message queuing

2024-04-01 Thread Hari Nagalla

On 3/25/24 12:20, Andrew Davis wrote:

The kernel FIFO queue has a couple issues. The biggest issue is that
it causes extra latency in a path that can be used in real-time tasks,
such as communication with real-time remote processors.

The whole FIFO idea itself looks to be a leftover from before the
unified mailbox framework. The current mailbox framework expects
mbox_chan_received_data() to be called with data immediately as it
arrives. Remove the FIFO and pass the messages to the mailbox
framework directly.
Yes, this would definitely speed up the message receive path. With RT 
linux, the irq runs in thread context, so that is Ok. But with non-RT 
the whole receive path runs in interrupt context. So, i think it would 
be appropriate to use a threaded_irq()?




Re: [PATCH 12/13] mailbox: omap: Reverse FIFO busy check logic

2024-04-01 Thread Hari Nagalla

On 3/25/24 12:20, Andrew Davis wrote:
  
  static int omap_mbox_chan_send_noirq(struct omap_mbox *mbox, u32 msg)

  {
-   int ret = -EBUSY;
+   if (mbox_fifo_full(mbox))
+   return -EBUSY;
  
-	if (!mbox_fifo_full(mbox)) {

-   omap_mbox_enable_irq(mbox, IRQ_RX);
-   mbox_fifo_write(mbox, msg);
-   ret = 0;
-   omap_mbox_disable_irq(mbox, IRQ_RX);
+   omap_mbox_enable_irq(mbox, IRQ_RX);
+   mbox_fifo_write(mbox, msg);
+   omap_mbox_disable_irq(mbox, IRQ_RX);
  
-		/* we must read and ack the interrupt directly from here */

-   mbox_fifo_read(mbox);
-   ack_mbox_irq(mbox, IRQ_RX);
-   }
+   /* we must read and ack the interrupt directly from here */
+   mbox_fifo_read(mbox);
+   ack_mbox_irq(mbox, IRQ_RX);
  
-	return ret;

+   return 0;
  }

Is n't the interrupt supposed to be IRQ_TX above? i.e TX ready signal?



[PATCH v3 7/7] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

2024-04-01 Thread James Houghton
From: Yu Zhao 

Use mmu_notifier_{test,clear}_young_bitmap() to handle KVM PTEs in
batches when the fast path is supported. This reduces the contention on
kvm->mmu_lock when the host is under heavy memory pressure.

An existing selftest can quickly demonstrate the effectiveness of
this patch. On a generic workstation equipped with 128 CPUs and 256GB
DRAM:

  $ sudo max_guest_memory_test -c 64 -m 250 -s 250

  MGLRU run2
  --
  Before [1]~64s
  After ~51s

  kswapd (MGLRU before)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.99%  try_to_shrink_lruvec
99.71%  evict_folios
  97.29%  shrink_folio_list
  ==>>  13.05%  folio_referenced
  12.83%  rmap_walk_file
12.31%  folio_referenced_one
  7.90%  __mmu_notifier_clear_young
7.72%  kvm_mmu_notifier_clear_young
  7.34%  _raw_write_lock

  kswapd (MGLRU after)
100.00%  balance_pgdat
  100.00%  shrink_node
100.00%  shrink_one
  99.99%  try_to_shrink_lruvec
99.59%  evict_folios
  80.37%  shrink_folio_list
  ==>>  3.74%  folio_referenced
  3.59%  rmap_walk_file
3.19%  folio_referenced_one
  2.53%  lru_gen_look_around
1.06%  __mmu_notifier_test_clear_young

[1] "mm: rmap: Don't flush TLB after checking PTE young for page
reference" was included so that the comparison is apples to
apples.
https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/

Signed-off-by: Yu Zhao 
Signed-off-by: James Houghton 
---
 Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
 include/linux/mmzone.h|   6 +-
 mm/rmap.c |   9 +-
 mm/vmscan.c   | 183 ++
 4 files changed, 159 insertions(+), 45 deletions(-)

diff --git a/Documentation/admin-guide/mm/multigen_lru.rst 
b/Documentation/admin-guide/mm/multigen_lru.rst
index 33e068830497..0ae2a6d4d94c 100644
--- a/Documentation/admin-guide/mm/multigen_lru.rst
+++ b/Documentation/admin-guide/mm/multigen_lru.rst
@@ -48,6 +48,10 @@ Values Components
verified on x86 varieties other than Intel and AMD. If it is
disabled, the multi-gen LRU will suffer a negligible
performance degradation.
+0x0008 Clearing the accessed bit in KVM page table entries in large
+   batches, when KVM MMU sets it (e.g., on x86_64). This can
+   improve the performance of guests when the host is under memory
+   pressure.
 [yYnN] Apply to all the components above.
 == ===
 
@@ -56,7 +60,7 @@ E.g.,
 
 echo y >/sys/kernel/mm/lru_gen/enabled
 cat /sys/kernel/mm/lru_gen/enabled
-0x0007
+0x000f
 echo 5 >/sys/kernel/mm/lru_gen/enabled
 cat /sys/kernel/mm/lru_gen/enabled
 0x0005
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c11b7cde81ef..a98de5106990 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -397,6 +397,7 @@ enum {
LRU_GEN_CORE,
LRU_GEN_MM_WALK,
LRU_GEN_NONLEAF_YOUNG,
+   LRU_GEN_KVM_MMU_WALK,
NR_LRU_GEN_CAPS
 };
 
@@ -554,7 +555,7 @@ struct lru_gen_memcg {
 
 void lru_gen_init_pgdat(struct pglist_data *pgdat);
 void lru_gen_init_lruvec(struct lruvec *lruvec);
-void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
+bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 void lru_gen_init_memcg(struct mem_cgroup *memcg);
 void lru_gen_exit_memcg(struct mem_cgroup *memcg);
@@ -573,8 +574,9 @@ static inline void lru_gen_init_lruvec(struct lruvec 
*lruvec)
 {
 }
 
-static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 {
+   return false;
 }
 
 static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
diff --git a/mm/rmap.c b/mm/rmap.c
index 56b313aa2ebf..41e9fc25684e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -871,13 +871,10 @@ static bool folio_referenced_one(struct folio *folio,
continue;
}
 
-   if (pvmw.pte) {
-   if (lru_gen_enabled() &&
-   pte_young(ptep_get(pvmw.pte))) {
-   lru_gen_look_around();
+   if (lru_gen_enabled() && pvmw.pte) {
+   if (lru_gen_look_around())
referenced++;
-   }
-
+   } else if (pvmw.pte) {
if (ptep_clear_flush_young_notify(vma, address,
pvmw.pte))
referenced++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 293120fe54f3..fd65f3466dfc 

[PATCH v3 6/7] KVM: arm64: Participate in bitmap-based PTE aging

2024-04-01 Thread James Houghton
Participate in bitmap-based aging while grabbing the KVM MMU lock for
reading. Ideally we wouldn't need to grab this lock at all, but that
would require a more intrustive and risky change. Also pass
KVM_PGTABLE_WALK_SHARED, as this software walker is safe to run in
parallel with other walkers.

It is safe only to grab the KVM MMU lock for reading as the kvm_pgtable
is destroyed while holding the lock for writing, and freeing of the page
table pages is either done while holding the MMU lock for writing or
after an RCU grace period.

When mkold == false, record the young pages in the passed-in bitmap.

When mkold == true, only age the pages that need aging according to the
passed-in bitmap.

Suggested-by: Yu Zhao 
Signed-off-by: James Houghton 
---
 arch/arm64/include/asm/kvm_host.h|  5 +
 arch/arm64/include/asm/kvm_pgtable.h |  4 +++-
 arch/arm64/kvm/hyp/pgtable.c | 21 ++---
 arch/arm64/kvm/mmu.c | 23 +--
 4 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 9e8a496fb284..e503553cb356 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1331,4 +1331,9 @@ bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu);
(get_idreg_field((kvm), id, fld) >= expand_field_sign(id, fld, min) && \
 get_idreg_field((kvm), id, fld) <= expand_field_sign(id, fld, max))
 
+#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
+bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn);
+#define kvm_arch_finish_bitmap_age kvm_arch_finish_bitmap_age
+void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn);
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/include/asm/kvm_pgtable.h 
b/arch/arm64/include/asm/kvm_pgtable.h
index 19278dfe7978..1976b4e26188 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -644,6 +644,7 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable 
*pgt, u64 addr);
  * @addr:  Intermediate physical address to identify the page-table entry.
  * @size:  Size of the address range to visit.
  * @mkold: True if the access flag should be cleared.
+ * @range: The kvm_gfn_range that is being used for the memslot walker.
  *
  * The offset of @addr within a page is ignored.
  *
@@ -657,7 +658,8 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable 
*pgt, u64 addr);
  * Return: True if any of the visited PTEs had the access flag set.
  */
 bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
-u64 size, bool mkold);
+u64 size, bool mkold,
+struct kvm_gfn_range *range);
 
 /**
  * kvm_pgtable_stage2_relax_perms() - Relax the permissions enforced by a
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 3fae5830f8d2..e881d3595aca 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1281,6 +1281,7 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable 
*pgt, u64 addr)
 }
 
 struct stage2_age_data {
+   struct kvm_gfn_range *range;
boolmkold;
boolyoung;
 };
@@ -1290,20 +1291,24 @@ static int stage2_age_walker(const struct 
kvm_pgtable_visit_ctx *ctx,
 {
kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF;
struct stage2_age_data *data = ctx->arg;
+   gfn_t gfn = ctx->addr / PAGE_SIZE;
 
if (!kvm_pte_valid(ctx->old) || new == ctx->old)
return 0;
 
data->young = true;
 
+
/*
-* stage2_age_walker() is always called while holding the MMU lock for
-* write, so this will always succeed. Nonetheless, this deliberately
-* follows the race detection pattern of the other stage-2 walkers in
-* case the locking mechanics of the MMU notifiers is ever changed.
+* stage2_age_walker() may not be holding the MMU lock for write, so
+* follow the race detection pattern of the other stage-2 walkers.
 */
-   if (data->mkold && !stage2_try_set_pte(ctx, new))
-   return -EAGAIN;
+   if (data->mkold) {
+   if (kvm_gfn_should_age(data->range, gfn) &&
+   !stage2_try_set_pte(ctx, new))
+   return -EAGAIN;
+   } else
+   kvm_gfn_record_young(data->range, gfn);
 
/*
 * "But where's the TLBI?!", you scream.
@@ -1315,10 +1320,12 @@ static int stage2_age_walker(const struct 
kvm_pgtable_visit_ctx *ctx,
 }
 
 bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr,
-u64 size, bool mkold)
+u64 size, bool mkold,
+struct kvm_gfn_range *range)
 {
struct stage2_age_data 

[PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-01 Thread James Houghton
Only handle the TDP MMU case for now. In other cases, if a bitmap was
not provided, fallback to the slowpath that takes mmu_lock, or, if a
bitmap was provided, inform the caller that the bitmap is unreliable.

Suggested-by: Yu Zhao 
Signed-off-by: James Houghton 
---
 arch/x86/include/asm/kvm_host.h | 14 ++
 arch/x86/kvm/mmu/mmu.c  | 16 ++--
 arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
 3 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3b58e2306621..c30918d0887e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
unsigned long npages);
  */
 #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
 
+#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
+static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
+{
+   /*
+* Indicate that we support bitmap-based aging when using the TDP MMU
+* and the accessed bit is available in the TDP page tables.
+*
+* We have no other preparatory work to do here, so we do not need to
+* redefine kvm_arch_finish_bitmap_age().
+*/
+   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
+&& shadow_accessed_mask;
+}
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 992e651540e8..fae1a75750bb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range 
*range)
 {
bool young = false;
 
-   if (kvm_memslots_have_rmaps(kvm))
+   if (kvm_memslots_have_rmaps(kvm)) {
+   if (range->lockless) {
+   kvm_age_set_unreliable(range);
+   return false;
+   }
+
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
+   }
 
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
@@ -1687,8 +1693,14 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct 
kvm_gfn_range *range)
 {
bool young = false;
 
-   if (kvm_memslots_have_rmaps(kvm))
+   if (kvm_memslots_have_rmaps(kvm)) {
+   if (range->lockless) {
+   kvm_age_set_unreliable(range);
+   return false;
+   }
+
young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
+   }
 
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d078157e62aa..edea01bc145f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1217,6 +1217,9 @@ static bool age_gfn_range(struct kvm *kvm, struct 
tdp_iter *iter,
if (!is_accessed_spte(iter->old_spte))
return false;
 
+   if (!kvm_gfn_should_age(range, iter->gfn))
+   return false;
+
if (spte_ad_enabled(iter->old_spte)) {
iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
 iter->old_spte,
@@ -1250,7 +1253,12 @@ bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range)
 static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
 struct kvm_gfn_range *range)
 {
-   return is_accessed_spte(iter->old_spte);
+   bool young = is_accessed_spte(iter->old_spte);
+
+   if (young)
+   kvm_gfn_record_young(range, iter->gfn);
+
+   return young;
 }
 
 bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-- 
2.44.0.478.gd926399ef9-goog




[PATCH v3 4/7] KVM: x86: Move tdp_mmu_enabled and shadow_accessed_mask

2024-04-01 Thread James Houghton
From: Yu Zhao 

tdp_mmu_enabled and shadow_accessed_mask are needed to implement
kvm_arch_prepare_bitmap_age().

Signed-off-by: Yu Zhao 
Signed-off-by: James Houghton 
---
 arch/x86/include/asm/kvm_host.h | 6 ++
 arch/x86/kvm/mmu.h  | 6 --
 arch/x86/kvm/mmu/spte.h | 1 -
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 16e07a2eee19..3b58e2306621 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1847,6 +1847,7 @@ struct kvm_arch_async_pf {
 
 extern u32 __read_mostly kvm_nr_uret_msrs;
 extern u64 __read_mostly host_efer;
+extern u64 __read_mostly shadow_accessed_mask;
 extern bool __read_mostly allow_smaller_maxphyaddr;
 extern bool __read_mostly enable_apicv;
 extern struct kvm_x86_ops kvm_x86_ops;
@@ -1952,6 +1953,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned 
irqchip, unsigned pin,
 bool mask);
 
 extern bool tdp_enabled;
+#ifdef CONFIG_X86_64
+extern bool tdp_mmu_enabled;
+#else
+#define tdp_mmu_enabled false
+#endif
 
 u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60f21bb4c27b..8ae279035900 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -270,12 +270,6 @@ static inline bool kvm_shadow_root_allocated(struct kvm 
*kvm)
return smp_load_acquire(>arch.shadow_root_allocated);
 }
 
-#ifdef CONFIG_X86_64
-extern bool tdp_mmu_enabled;
-#else
-#define tdp_mmu_enabled false
-#endif
-
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index a129951c9a88..f791fe045c7d 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -154,7 +154,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask;
 extern u64 __read_mostly shadow_nx_mask;
 extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
 extern u64 __read_mostly shadow_user_mask;
-extern u64 __read_mostly shadow_accessed_mask;
 extern u64 __read_mostly shadow_dirty_mask;
 extern u64 __read_mostly shadow_mmio_value;
 extern u64 __read_mostly shadow_mmio_mask;
-- 
2.44.0.478.gd926399ef9-goog




[PATCH v3 3/7] KVM: Add basic bitmap support into kvm_mmu_notifier_test/clear_young

2024-04-01 Thread James Houghton
Add kvm_arch_prepare_bitmap_age() for architectures to indiciate that
they support bitmap-based aging in kvm_mmu_notifier_test_clear_young()
and that they do not need KVM to grab the MMU lock for writing. This
function allows architectures to do other locking or other preparatory
work that it needs.

If an architecture does not implement kvm_arch_prepare_bitmap_age() or
is unable to do bitmap-based aging at runtime (and marks the bitmap as
unreliable):
 1. If a bitmap was provided, we inform the caller that the bitmap is
unreliable (MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE).
 2. If a bitmap was not provided, fall back to the old logic.

Also add logic for architectures to easily use the provided bitmap if
they are able. The expectation is that the architecture's implementation
of kvm_gfn_test_age() will use kvm_gfn_record_young(), and
kvm_gfn_age() will use kvm_gfn_should_age().

Suggested-by: Yu Zhao 
Signed-off-by: James Houghton 
---
 include/linux/kvm_host.h | 60 ++
 virt/kvm/kvm_main.c  | 92 +---
 2 files changed, 127 insertions(+), 25 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1800d03a06a9..5862fd7b5f9b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1992,6 +1992,26 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[];
 extern const struct kvm_stats_header kvm_vcpu_stats_header;
 extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
 
+/*
+ * Architectures that support using bitmaps for kvm_age_gfn() and
+ * kvm_test_age_gfn should return true for kvm_arch_prepare_bitmap_age()
+ * and do any work they need to prepare. The subsequent walk will not
+ * automatically grab the KVM MMU lock, so some architectures may opt
+ * to grab it.
+ *
+ * If true is returned, a subsequent call to kvm_arch_finish_bitmap_age() is
+ * guaranteed.
+ */
+#ifndef kvm_arch_prepare_bitmap_age
+static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
+{
+   return false;
+}
+#endif
+#ifndef kvm_arch_finish_bitmap_age
+static inline void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn) {}
+#endif
+
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 {
@@ -2076,9 +2096,16 @@ static inline bool 
mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
 }
 
+struct test_clear_young_metadata {
+   unsigned long *bitmap;
+   unsigned long bitmap_offset_end;
+   unsigned long end;
+   bool unreliable;
+};
 union kvm_mmu_notifier_arg {
pte_t pte;
unsigned long attributes;
+   struct test_clear_young_metadata *metadata;
 };
 
 struct kvm_gfn_range {
@@ -2087,11 +2114,44 @@ struct kvm_gfn_range {
gfn_t end;
union kvm_mmu_notifier_arg arg;
bool may_block;
+   bool lockless;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+
+static inline void kvm_age_set_unreliable(struct kvm_gfn_range *range)
+{
+   struct test_clear_young_metadata *args = range->arg.metadata;
+
+   args->unreliable = true;
+}
+static inline unsigned long kvm_young_bitmap_offset(struct kvm_gfn_range 
*range,
+   gfn_t gfn)
+{
+   struct test_clear_young_metadata *args = range->arg.metadata;
+
+   return hva_to_gfn_memslot(args->end - 1, range->slot) - gfn;
+}
+static inline void kvm_gfn_record_young(struct kvm_gfn_range *range, gfn_t gfn)
+{
+   struct test_clear_young_metadata *args = range->arg.metadata;
+
+   WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
+   if (args->bitmap)
+   __set_bit(kvm_young_bitmap_offset(range, gfn), args->bitmap);
+}
+static inline bool kvm_gfn_should_age(struct kvm_gfn_range *range, gfn_t gfn)
+{
+   struct test_clear_young_metadata *args = range->arg.metadata;
+
+   WARN_ON_ONCE(gfn < range->start || gfn >= range->end);
+   if (args->bitmap)
+   return test_bit(kvm_young_bitmap_offset(range, gfn),
+   args->bitmap);
+   return true;
+}
 #endif
 
 #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0545d88c802..7d80321e2ece 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -550,6 +550,7 @@ struct kvm_mmu_notifier_range {
on_lock_fn_t on_lock;
bool flush_on_ret;
bool may_block;
+   bool lockless;
 };
 
 /*
@@ -598,6 +599,8 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
struct kvm_memslots *slots;
int i, idx;
 
+   BUILD_BUG_ON(sizeof(gfn_range.arg) != sizeof(gfn_range.arg.pte));
+
  

[PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-01 Thread James Houghton
The bitmap is provided for secondary MMUs to use if they support it. For
test_young(), after it returns, the bitmap represents the pages that
were young in the interval [start, end). For clear_young, it represents
the pages that we wish the secondary MMU to clear the accessed/young bit
for.

If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
should be unchanged except that if young PTEs are found and the
architecture supports passing in a bitmap, instead of returning 1,
MMU_NOTIFIER_YOUNG_FAST is returned.

This allows MGLRU's look-around logic to work faster, resulting in a 4%
improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
to indicate to main mm that doing look-around is likely to be
beneficial.

If the secondary MMU doesn't support the bitmap, it must return
an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

[1]: https://lore.kernel.org/all/20230609005935.42390-1-yuz...@google.com/

Suggested-by: Yu Zhao 
Signed-off-by: James Houghton 
---
 include/linux/mmu_notifier.h | 93 +---
 include/trace/events/kvm.h   | 13 +++--
 mm/mmu_notifier.c| 20 +---
 virt/kvm/kvm_main.c  | 19 ++--
 4 files changed, 123 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..daaa9db625d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,10 @@ enum mmu_notifier_event {
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
 
+#define MMU_NOTIFIER_YOUNG (1 << 0)
+#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE   (1 << 1)
+#define MMU_NOTIFIER_YOUNG_FAST(1 << 2)
+
 struct mmu_notifier_ops {
/*
 * Called either by mmu_notifier_unregister or when the mm is
@@ -106,21 +110,36 @@ struct mmu_notifier_ops {
 * clear_young is a lightweight version of clear_flush_young. Like the
 * latter, it is supposed to test-and-clear the young/accessed bitflag
 * in the secondary pte, but it may omit flushing the secondary tlb.
+*
+* If @bitmap is given but is not supported, return
+* MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+*
+* If the walk is done "quickly" and there were young PTEs,
+* MMU_NOTIFIER_YOUNG_FAST is returned.
 */
int (*clear_young)(struct mmu_notifier *subscription,
   struct mm_struct *mm,
   unsigned long start,
-  unsigned long end);
+  unsigned long end,
+  unsigned long *bitmap);
 
/*
 * test_young is called to check the young/accessed bitflag in
 * the secondary pte. This is used to know if the page is
 * frequently used without actually clearing the flag or tearing
 * down the secondary mapping on the page.
+*
+* If @bitmap is given but is not supported, return
+* MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+*
+* If the walk is done "quickly" and there were young PTEs,
+* MMU_NOTIFIER_YOUNG_FAST is returned.
 */
int (*test_young)(struct mmu_notifier *subscription,
  struct mm_struct *mm,
- unsigned long address);
+ unsigned long start,
+ unsigned long end,
+ unsigned long *bitmap);
 
/*
 * change_pte is called in cases that pte mapping to page is changed:
@@ -388,10 +407,11 @@ extern int __mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
  unsigned long start,
  unsigned long end);
 extern int __mmu_notifier_clear_young(struct mm_struct *mm,
- unsigned long start,
- unsigned long end);
+ unsigned long start, unsigned long end,
+ unsigned long *bitmap);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
-unsigned long address);
+unsigned long start, unsigned long end,
+unsigned long *bitmap);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
  unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
@@ -427,7 +447,25 @@ static inline int mmu_notifier_clear_young(struct 
mm_struct *mm,
   unsigned long end)
 {
if (mm_has_notifiers(mm))
-   return __mmu_notifier_clear_young(mm, start, end);
+   return __mmu_notifier_clear_young(mm, start, end, NULL);
+   return 0;
+}
+
+/*
+ * When @bitmap is not provided, 

[PATCH v3 2/7] KVM: Move MMU notifier function declarations

2024-04-01 Thread James Houghton
To allow new MMU-notifier-related functions to use gfn_to_hva_memslot(),
move some declarations around.

Also move mmu_notifier_to_kvm() for wider use later.

Signed-off-by: James Houghton 
---
 include/linux/kvm_host.h | 41 +---
 virt/kvm/kvm_main.c  |  5 -
 2 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48f31dcd318a..1800d03a06a9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -257,25 +257,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t 
cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
-#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
-union kvm_mmu_notifier_arg {
-   pte_t pte;
-   unsigned long attributes;
-};
-
-struct kvm_gfn_range {
-   struct kvm_memory_slot *slot;
-   gfn_t start;
-   gfn_t end;
-   union kvm_mmu_notifier_arg arg;
-   bool may_block;
-};
-bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-#endif
-
 enum {
OUTSIDE_GUEST_MODE,
IN_GUEST_MODE,
@@ -2012,6 +1993,11 @@ extern const struct kvm_stats_header 
kvm_vcpu_stats_header;
 extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
 
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
+static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
+{
+   return container_of(mn, struct kvm, mmu_notifier);
+}
+
 static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
if (unlikely(kvm->mmu_invalidate_in_progress))
@@ -2089,6 +2075,23 @@ static inline bool 
mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
 
return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
 }
+
+union kvm_mmu_notifier_arg {
+   pte_t pte;
+   unsigned long attributes;
+};
+
+struct kvm_gfn_range {
+   struct kvm_memory_slot *slot;
+   gfn_t start;
+   gfn_t end;
+   union kvm_mmu_notifier_arg arg;
+   bool may_block;
+};
+bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
+bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 #endif
 
 #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ca4b1ef9dfc2..d0545d88c802 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -534,11 +534,6 @@ void kvm_destroy_vcpus(struct kvm *kvm)
 EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
 
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
-static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
-{
-   return container_of(mn, struct kvm, mmu_notifier);
-}
-
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm);
-- 
2.44.0.478.gd926399ef9-goog




[PATCH v3 0/7] mm/kvm: Improve parallelism for access bit harvesting

2024-04-01 Thread James Houghton
This patchset adds a fast path in KVM to test and clear access bits on
sptes without taking the mmu_lock. It also adds support for using a
bitmap to (1) test the access bits for many sptes in a single call to
mmu_notifier_test_young, and to (2) clear the access bits for many ptes
in a single call to mmu_notifier_clear_young.

With Yu's permission, I'm now working on getting this series into a
mergeable state.

I'm posting this as an RFC because I'm not sure if the arm64 bits are
correct, and I haven't done complete performance testing. I want to do
broader experimentation to see how much this improves VM performance in
a cloud environment, but I want to be sure that the code is mergeable
first.

Yu has posted other performance results[1], [2]. This v3 shouldn't
significantly change the x86 results, but the arm64 results may have
changed.

The most important changes since v2[3]:

- Split the test_clear_young MMU notifier back into test_young and
  clear_young. I did this because the bitmap passed in has a distinct
  meaning for each of them, and I felt that this was cleaner.

- The return value of test_young / clear_young now indicates if the
  bitmap was used.

- Removed the custom spte walker to implement the lockless path. This
  was important for arm64 to be functionally correct (thanks Oliver),
  and it avoids a lot of problems brought up in review of v2 (for
  example[4]).

- Add kvm_arch_prepare_bitmap_age and kvm_arch_finish_bitmap_age to
  allow for arm64 to implement its bitmap-based aging to grab the MMU
  lock for reading while allowing x86 to be lockless.

- The powerpc changes have been dropped.

- The logic to inform architectures how to use the bitmap has been
  cleaned up (kvm_should_clear_young has been split into
  kvm_gfn_should_age and kvm_gfn_record_young) (thanks Nicolas).

There were some smaller changes too:
- Added test_clear_young_metadata (thanks Sean).
- MMU_NOTIFIER_RANGE_LOCKLESS has been renamed to
  MMU_NOTIFIER_YOUNG_FAST, to indicate to the caller that passing a
  bitmap for MGLRU look-around is likely to be beneficial.
- Cleaned up comments that describe the changes to
  mmu_notifier_test_young / mmu_notifier_clear_young (thanks Nicolas).

[1]: https://lore.kernel.org/all/20230609005943.43041-1-yuz...@google.com/
[2]: https://lore.kernel.org/all/20230609005935.42390-1-yuz...@google.com/
[3]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuz...@google.com/
[4]: https://lore.kernel.org/all/zitx64bbx5vdj...@google.com/

James Houghton (5):
  mm: Add a bitmap into mmu_notifier_{clear,test}_young
  KVM: Move MMU notifier function declarations
  KVM: Add basic bitmap support into kvm_mmu_notifier_test/clear_young
  KVM: x86: Participate in bitmap-based PTE aging
  KVM: arm64: Participate in bitmap-based PTE aging

Yu Zhao (2):
  KVM: x86: Move tdp_mmu_enabled and shadow_accessed_mask
  mm: multi-gen LRU: use mmu_notifier_test_clear_young()

 Documentation/admin-guide/mm/multigen_lru.rst |   6 +-
 arch/arm64/include/asm/kvm_host.h |   5 +
 arch/arm64/include/asm/kvm_pgtable.h  |   4 +-
 arch/arm64/kvm/hyp/pgtable.c  |  21 +-
 arch/arm64/kvm/mmu.c  |  23 ++-
 arch/x86/include/asm/kvm_host.h   |  20 ++
 arch/x86/kvm/mmu.h|   6 -
 arch/x86/kvm/mmu/mmu.c|  16 +-
 arch/x86/kvm/mmu/spte.h   |   1 -
 arch/x86/kvm/mmu/tdp_mmu.c|  10 +-
 include/linux/kvm_host.h  | 101 --
 include/linux/mmu_notifier.h  |  93 -
 include/linux/mmzone.h|   6 +-
 include/trace/events/kvm.h|  13 +-
 mm/mmu_notifier.c |  20 +-
 mm/rmap.c |   9 +-
 mm/vmscan.c   | 183 ++
 virt/kvm/kvm_main.c   | 100 +++---
 18 files changed, 509 insertions(+), 128 deletions(-)


base-commit: 0cef2c0a2a356137b170c3cb46cb9c1dd2ca3e6b
-- 
2.44.0.478.gd926399ef9-goog




Re: [External] Re: [PATCH v9 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-01 Thread Ho-Ren (Jack) Chuang
Hi SeongJae,

On Mon, Apr 1, 2024 at 11:27 AM Ho-Ren (Jack) Chuang
 wrote:
>
> Hi SeongJae,
>
> On Sun, Mar 31, 2024 at 12:09 PM SeongJae Park  wrote:
> >
> > Hi Ho-Ren,
> >
> > On Fri, 29 Mar 2024 05:33:52 + "Ho-Ren (Jack) Chuang" 
> >  wrote:
> >
> > > Since different memory devices require finding, allocating, and putting
> > > memory types, these common steps are abstracted in this patch,
> > > enhancing the scalability and conciseness of the code.
> > >
> > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > Reviewed-by: "Huang, Ying" 
> > > ---
> > >  drivers/dax/kmem.c   | 20 ++--
> > >  include/linux/memory-tiers.h | 13 +
> > >  mm/memory-tiers.c| 32 
> > >  3 files changed, 47 insertions(+), 18 deletions(-)
> > >
> > [...]
> > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > > index 69e781900082..a44c03c2ba3a 100644
> > > --- a/include/linux/memory-tiers.h
> > > +++ b/include/linux/memory-tiers.h
> > > @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist);
> > >  int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
> > >const char *source);
> > >  int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
> > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist,
> > > + struct list_head 
> > > *memory_types);
> > > +void mt_put_memory_types(struct list_head *memory_types);
> > >  #ifdef CONFIG_MIGRATION
> > >  int next_demotion_node(int node);
> > >  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> > > @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct 
> > > access_coordinate *perf, int *adis
> > >  {
> > >   return -EIO;
> > >  }
> > > +
> > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> > > list_head *memory_types)
> > > +{
> > > + return NULL;
> > > +}
> > > +
> > > +void mt_put_memory_types(struct list_head *memory_types)
> > > +{
> > > +
> > > +}
> >
> > I found latest mm-unstable tree is failing kunit as below, and 'git bisect'
> > says it happens from this patch.
> >
> > $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
> > [11:56:40] Configuring KUnit Kernel ...
> > [11:56:40] Building KUnit Kernel ...
> > Populating config with:
> > $ make ARCH=um O=../kunit.out/ olddefconfig
> > Building with:
> > $ make ARCH=um O=../kunit.out/ --jobs=36
> > ERROR:root:In file included from .../mm/memory.c:71:
> > .../include/linux/memory-tiers.h:143:25: warning: no previous prototype 
> > for ‘mt_find_alloc_memory_type’ [-Wmissing-prototypes]
> >   143 | struct memory_dev_type *mt_find_alloc_memory_type(int adist, 
> > struct list_head *memory_types)
> >   | ^
> > .../include/linux/memory-tiers.h:148:6: warning: no previous prototype 
> > for ‘mt_put_memory_types’ [-Wmissing-prototypes]
> >   148 | void mt_put_memory_types(struct list_head *memory_types)
> >   |  ^~~
> > [...]
> >
> > Maybe we should set these as 'static inline', like below?  I confirmed this
> > fixes the kunit error.  May I ask your opinion?
> >
>
> Thanks for catching this. I'm trying to figure out this problem. Will get 
> back.
>

These kunit compilation errors can be solved by adding `static inline`
to the two complaining functions, the same solution you mentioned
earlier.

I've also tested on my end and I will send out a V10 soon. Thank you again!

> >
> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > index a44c03c2ba3a..ee6e53144156 100644
> > --- a/include/linux/memory-tiers.h
> > +++ b/include/linux/memory-tiers.h
> > @@ -140,12 +140,12 @@ static inline int mt_perf_to_adistance(struct 
> > access_coordinate *perf, int *adis
> > return -EIO;
> >  }
> >
> > -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> > list_head *memory_types)
> > +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist, 
> > struct list_head *memory_types)
> >  {
> > return NULL;
> >  }
> >
> > -void mt_put_memory_types(struct list_head *memory_types)
> > +static inline void mt_put_memory_types(struct list_head *memory_types)
> >  {
> >
> >  }
> >
> >
> > Thanks,
> > SJ
>
>
>
> --
> Best regards,
> Ho-Ren (Jack) Chuang
> 莊賀任



-- 
Best regards,
Ho-Ren (Jack) Chuang
莊賀任



Re: [PATCH] selftests/sgx: Improve cgroup test scripts

2024-04-01 Thread Haitao Huang
On Mon, 01 Apr 2024 09:22:21 -0500, Jarkko Sakkinen   
wrote:



On Sun Mar 31, 2024 at 8:44 PM EEST, Haitao Huang wrote:

Make cgroup test scripts ash compatible.
Remove cg-tools dependency.
Add documentation for functions.

Tested with busybox on Ubuntu.

Signed-off-by: Haitao Huang 


I'll run this next week on good old NUC7. Thank you.

I really wish that either (hopefully both) Intel or AMD would bring up
for developers home use meant platform to develop on TDX and SNP. It is
a shame that the latest and greatest is from 2018.

BR, Jarkko



Argh, missed a few changes for v2 cgroup:

--- a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -15,6 +15,8 @@ CG_MEM_ROOT=/sys/fs/cgroup
 CG_V1=0
 if [ ! -d "/sys/fs/cgroup/misc" ]; then
 echo "# cgroup V2 is in use."
+echo "+misc" >  $CG_MISC_ROOT/cgroup.subtree_control
+echo "+memory" > $CG_MEM_ROOT/cgroup.subtree_control
 else
 echo "# cgroup V1 is in use."
 CG_MISC_ROOT=/sys/fs/cgroup/misc
@@ -26,6 +28,11 @@ mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB2
 mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB3
 mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB4

+if [ $CG_V1 -eq 0 ]; then
+echo "+misc" >  $CG_MISC_ROOT/$TEST_ROOT_CG/cgroup.subtree_control
+echo "+misc" >  $CG_MISC_ROOT/$TEST_CG_SUB1/cgroup.subtree_control
+fi



[PATCH 3/3] Documentation/smatch: fix typo in submitting-patches.md

2024-04-01 Thread Javier Carrasco
Fix a small typo in the smatch documentation about the patch submission
process.

Signed-off-by: Javier Carrasco 
---
 Documentation/submitting-patches.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/submitting-patches.md 
b/Documentation/submitting-patches.md
index 5c4191bd..3f4c548f 100644
--- a/Documentation/submitting-patches.md
+++ b/Documentation/submitting-patches.md
@@ -20,7 +20,7 @@ Kernel submitting process.
Notice that sparse uses the MIT License.

 4. Smatch is built on top of Sparse but it is licensed under the GPLv2+ the
-   git repostories are:
+   git repositories are:

https://github.com/error27/smatch
https://repo.or.cz/w/smatch.git
--
2.40.1




[PATCH 2/3] Documentation/smatch: convert to RST

2024-04-01 Thread Javier Carrasco
Convert existing smatch documentation to RST, and add it to the index
accordingly.

Signed-off-by: Javier Carrasco 
---
 Documentation/index.rst  |  1 +
 Documentation/{smatch.txt => smatch.rst} | 56 +---
 2 files changed, 31 insertions(+), 26 deletions(-)
 rename Documentation/{smatch.txt => smatch.rst} (72%)

diff --git a/Documentation/index.rst b/Documentation/index.rst
index e29a5643..761acbae 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -86,6 +86,7 @@ Some interesting external documentation:
test-suite
doc-guide
TODO
+   smatch

 .. toctree::
:caption: Release Notes
diff --git a/Documentation/smatch.txt b/Documentation/smatch.rst
similarity index 72%
rename from Documentation/smatch.txt
rename to Documentation/smatch.rst
index b2c3ac4e..f209c8fb 100644
--- a/Documentation/smatch.txt
+++ b/Documentation/smatch.rst
@@ -1,43 +1,46 @@
+==
 Smatch
+==

-0.  Introduction
-1.  Building Smatch
-2.  Using Smatch
-3.  Smatch vs Sparse
+.. Table of Contents:

-Section 0: Introduction
+.. contents:: :local:
+
+
+0. Introduction
+===

 The Smatch mailing list is .

-Section 1:  Building Smatch

+1. Building Smatch
+==

 Smatch needs some dependencies to build:

-In Debian run:
-apt-get install gcc make sqlite3 libsqlite3-dev libdbd-sqlite3-perl libssl-dev 
libtry-tiny-perl
+In Debian run::

-Or in Fedora run:
-yum install gcc make sqlite3 sqlite-devel sqlite perl-DBD-SQLite openssl-devel 
perl-Try-Tiny
+   apt-get install gcc make sqlite3 libsqlite3-dev libdbd-sqlite3-perl 
libssl-dev libtry-tiny-perl

-Smatch is easy to build.  Just type `make`.  There isn't an install process
-right now so just run it from the build directory.
+Or in Fedora run::
+
+   yum install gcc make sqlite3 sqlite-devel sqlite perl-DBD-SQLite 
openssl-devel perl-Try-Tiny

+Smatch is easy to build.  Just type ``make``.  There isn't an install process
+right now so just run it from the build directory.

-Section 2:  Using Smatch
-
+2. Using Smatch
+===

 Smatch can be used with a cross function database. It's not mandatory to
 build the database but it's a useful thing to do.  Building the database
 for the kernel takes 2-3 hours on my computer.  For the kernel you build
-the database with:
+the database with::

-   cd ~/path/to/kernel_dir
-   ~/path/to/smatch_dir/smatch_scripts/build_kernel_data.sh
+   cd ~/path/to/kernel_dir 
~/path/to/smatch_dir/smatch_scripts/build_kernel_data.sh

 For projects other than the kernel you run Smatch with the options
 "--call-tree --info --param-mapper --spammy" and finish building the
-database by running the script:
+database by running the script::

~/path/to/smatch_dir/smatch_data/db/create_db.sh

@@ -45,21 +48,23 @@ Each time you rebuild the cross function database it 
becomes more accurate. I
 normally rebuild the database every morning.

 If you are running Smatch over the whole kernel you can use the following
-command:
+command::

~/path/to/smatch_dir/smatch_scripts/test_kernel.sh

 The test_kernel.sh script will create a .c.smatch file for every file it tests
 and a combined smatch_warns.txt file with all the warnings.

-If you are running Smatch just over one kernel file:
+If you are running Smatch just over one kernel file::

~/path/to/smatch_dir/smatch_scripts/kchecker drivers/whatever/file.c

-You can also build a directory like this:
+You can also build a directory like this::
+

~/path/to/smatch_dir/smatch_scripts/kchecker drivers/whatever/

+
 The kchecker script prints its warnings to stdout.

 The above scripts will ensure that any ARCH or CROSS_COMPILE environment
@@ -67,7 +72,7 @@ variables are passed to kernel build system - thus allowing 
for the use of
 Smatch with kernels that are normally built with cross-compilers.

 If you are building something else (which is not the Linux kernel) then use
-something like:
+something like::

make CHECK="~/path/to/smatch_dir/smatch --full-path" \
CC=~/path/to/smatch_dir/smatch/cgcc | tee smatch_warns.txt
@@ -75,9 +80,8 @@ something like:
 The makefile has to let people set the CC with an environment variable for that
 to work, of course.

-
-Section 3:  Smatch vs Sparse
-
+3. Smatch vs Sparse
+===

 Smatch uses Sparse as a C parser.  I have made a few hacks to Sparse so I
 have to distribute the two together.  Sparse is released under the MIT license
--
2.40.1




[PATCH 1/3] Documentation/smatch: fix paths in the examples

2024-04-01 Thread Javier Carrasco
A few examples use the '~/progs/smatch/devel/smatch_scripts/' path,
which seems to be a local reference that does not reflect the real
paths in the project (one would not expect 'devel' inside 'smatch').

Use the generic '~/path/to/smatch_dir/' path, which is already used
in some examples.

Signed-off-by: Javier Carrasco 
---
 Documentation/smatch.txt | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/smatch.txt b/Documentation/smatch.txt
index 59106d49..b2c3ac4e 100644
--- a/Documentation/smatch.txt
+++ b/Documentation/smatch.txt
@@ -39,7 +39,7 @@ For projects other than the kernel you run Smatch with the 
options
 "--call-tree --info --param-mapper --spammy" and finish building the
 database by running the script:

-   ~/progs/smatch/devel/smatch_data/db/create_db.sh
+   ~/path/to/smatch_dir/smatch_data/db/create_db.sh

 Each time you rebuild the cross function database it becomes more accurate. I
 normally rebuild the database every morning.
@@ -47,18 +47,18 @@ normally rebuild the database every morning.
 If you are running Smatch over the whole kernel you can use the following
 command:

-   ~/progs/smatch/devel/smatch_scripts/test_kernel.sh
+   ~/path/to/smatch_dir/smatch_scripts/test_kernel.sh

 The test_kernel.sh script will create a .c.smatch file for every file it tests
 and a combined smatch_warns.txt file with all the warnings.

 If you are running Smatch just over one kernel file:

-   ~/progs/smatch/devel/smatch_scripts/kchecker drivers/whatever/file.c
+   ~/path/to/smatch_dir/smatch_scripts/kchecker drivers/whatever/file.c

 You can also build a directory like this:

-   ~/progs/smatch/devel/smatch_scripts/kchecker drivers/whatever/
+   ~/path/to/smatch_dir/smatch_scripts/kchecker drivers/whatever/

 The kchecker script prints its warnings to stdout.

@@ -69,8 +69,8 @@ Smatch with kernels that are normally built with 
cross-compilers.
 If you are building something else (which is not the Linux kernel) then use
 something like:

-   make CHECK="~/progs/smatch/devel/smatch --full-path" \
-   CC=~/progs/smatch/devel/smatch/cgcc | tee smatch_warns.txt
+   make CHECK="~/path/to/smatch_dir/smatch --full-path" \
+   CC=~/path/to/smatch_dir/smatch/cgcc | tee smatch_warns.txt

 The makefile has to let people set the CC with an environment variable for that
 to work, of course.
--
2.40.1




[PATCH 0/3] Documentation/smatch: RST conversion and fixes

2024-04-01 Thread Javier Carrasco
This series converts the existing smatch.txt to RST and adds it to the
index, so it can be built together with the sparse documentation.

When at it, a couple of small fixes has been included.

Signed-off-by: Javier Carrasco 

Javier Carrasco (3):
  Documentation/smatch: fix paths in the examples
  Documentation/smatch: convert to RST
  Documentation/smatch: fix typo in submitting-patches.md

 Documentation/index.rst  |  1 +
 Documentation/{smatch.txt => smatch.rst} | 68 +---
 Documentation/submitting-patches.md  |  2 +-
 3 files changed, 38 insertions(+), 33 deletions(-)
 rename Documentation/{smatch.txt => smatch.rst} (60%)

--
2.40.1




[PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.

2024-04-01 Thread Kui-Feng Lee
rethook_find_ret_addr() prints a warning message and returns 0 when the
target task is running and not the "current" task to prevent returning an
incorrect return address. However, this check is incomplete as the target
task can still transition to the running state when finding the return
address, although it is safe with RCU.

The issue we encounter is that the kernel frequently prints warning
messages when BPF profiling programs call to bpf_get_task_stack() on
running tasks.

The callers should be aware and willing to take the risk of receiving an
incorrect return address from a task that is currently running other than
the "current" one. A warning is not needed here as the callers are intent
on it.

Signed-off-by: Kui-Feng Lee 
---
 kernel/trace/rethook.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index fa03094e9e69..4297a132a7ae 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct 
*tsk, unsigned long frame
if (WARN_ON_ONCE(!cur))
return 0;
 
-   if (WARN_ON_ONCE(tsk != current && task_is_running(tsk)))
+   if (tsk != current && task_is_running(tsk))
return 0;
 
do {
-- 
2.34.1




Re: [External] Re: [PATCH v9 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-01 Thread Ho-Ren (Jack) Chuang
Hi SeongJae,

On Sun, Mar 31, 2024 at 12:09 PM SeongJae Park  wrote:
>
> Hi Ho-Ren,
>
> On Fri, 29 Mar 2024 05:33:52 + "Ho-Ren (Jack) Chuang" 
>  wrote:
>
> > Since different memory devices require finding, allocating, and putting
> > memory types, these common steps are abstracted in this patch,
> > enhancing the scalability and conciseness of the code.
> >
> > Signed-off-by: Ho-Ren (Jack) Chuang 
> > Reviewed-by: "Huang, Ying" 
> > ---
> >  drivers/dax/kmem.c   | 20 ++--
> >  include/linux/memory-tiers.h | 13 +
> >  mm/memory-tiers.c| 32 
> >  3 files changed, 47 insertions(+), 18 deletions(-)
> >
> [...]
> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > index 69e781900082..a44c03c2ba3a 100644
> > --- a/include/linux/memory-tiers.h
> > +++ b/include/linux/memory-tiers.h
> > @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist);
> >  int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
> >const char *source);
> >  int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
> > +struct memory_dev_type *mt_find_alloc_memory_type(int adist,
> > + struct list_head 
> > *memory_types);
> > +void mt_put_memory_types(struct list_head *memory_types);
> >  #ifdef CONFIG_MIGRATION
> >  int next_demotion_node(int node);
> >  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> > @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct 
> > access_coordinate *perf, int *adis
> >  {
> >   return -EIO;
> >  }
> > +
> > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> > list_head *memory_types)
> > +{
> > + return NULL;
> > +}
> > +
> > +void mt_put_memory_types(struct list_head *memory_types)
> > +{
> > +
> > +}
>
> I found latest mm-unstable tree is failing kunit as below, and 'git bisect'
> says it happens from this patch.
>
> $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/
> [11:56:40] Configuring KUnit Kernel ...
> [11:56:40] Building KUnit Kernel ...
> Populating config with:
> $ make ARCH=um O=../kunit.out/ olddefconfig
> Building with:
> $ make ARCH=um O=../kunit.out/ --jobs=36
> ERROR:root:In file included from .../mm/memory.c:71:
> .../include/linux/memory-tiers.h:143:25: warning: no previous prototype 
> for ‘mt_find_alloc_memory_type’ [-Wmissing-prototypes]
>   143 | struct memory_dev_type *mt_find_alloc_memory_type(int adist, 
> struct list_head *memory_types)
>   | ^
> .../include/linux/memory-tiers.h:148:6: warning: no previous prototype 
> for ‘mt_put_memory_types’ [-Wmissing-prototypes]
>   148 | void mt_put_memory_types(struct list_head *memory_types)
>   |  ^~~
> [...]
>
> Maybe we should set these as 'static inline', like below?  I confirmed this
> fixes the kunit error.  May I ask your opinion?
>

Thanks for catching this. I'm trying to figure out this problem. Will get back.

>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index a44c03c2ba3a..ee6e53144156 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -140,12 +140,12 @@ static inline int mt_perf_to_adistance(struct 
> access_coordinate *perf, int *adis
> return -EIO;
>  }
>
> -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct 
> list_head *memory_types)
> +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist, 
> struct list_head *memory_types)
>  {
> return NULL;
>  }
>
> -void mt_put_memory_types(struct list_head *memory_types)
> +static inline void mt_put_memory_types(struct list_head *memory_types)
>  {
>
>  }
>
>
> Thanks,
> SJ



-- 
Best regards,
Ho-Ren (Jack) Chuang
莊賀任



[PATCH v2 4/4] arm64: dts: qcom: msm8976: Add WCNSS node

2024-04-01 Thread Adam Skladowski
Add node describing wireless connectivity subsystem.

Signed-off-by: Adam Skladowski 
---
 arch/arm64/boot/dts/qcom/msm8976.dtsi | 104 ++
 1 file changed, 104 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi 
b/arch/arm64/boot/dts/qcom/msm8976.dtsi
index 77670fce9b8f..41c748c78347 100644
--- a/arch/arm64/boot/dts/qcom/msm8976.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi
@@ -771,6 +771,36 @@ blsp2_i2c4_sleep: blsp2-i2c4-sleep-state {
drive-strength = <2>;
bias-disable;
};
+
+   wcss_wlan_default: wcss-wlan-default-state  {
+   wcss-wlan2-pins {
+   pins = "gpio40";
+   function = "wcss_wlan2";
+   drive-strength = <6>;
+   bias-pull-up;
+   };
+
+   wcss-wlan1-pins {
+   pins = "gpio41";
+   function = "wcss_wlan1";
+   drive-strength = <6>;
+   bias-pull-up;
+   };
+
+   wcss-wlan0-pins {
+   pins = "gpio42";
+   function = "wcss_wlan0";
+   drive-strength = <6>;
+   bias-pull-up;
+   };
+
+   wcss-wlan-pins {
+   pins = "gpio43", "gpio44";
+   function = "wcss_wlan";
+   drive-strength = <6>;
+   bias-pull-up;
+   };
+   };
};
 
gcc: clock-controller@180 {
@@ -1446,6 +1476,80 @@ blsp2_i2c4: i2c@7af8000 {
status = "disabled";
};
 
+   wcnss: remoteproc@a204000 {
+   compatible = "qcom,pronto-v3-pil", "qcom,pronto";
+   reg = <0x0a204000 0x2000>,
+ <0x0a202000 0x1000>,
+ <0x0a21b000 0x3000>;
+   reg-names = "ccu",
+   "dxe",
+   "pmu";
+
+   memory-region = <_fw_mem>;
+
+   interrupts-extended = < GIC_SPI 149 
IRQ_TYPE_EDGE_RISING>,
+ <_smp2p_in 0 
IRQ_TYPE_EDGE_RISING>,
+ <_smp2p_in 1 
IRQ_TYPE_EDGE_RISING>,
+ <_smp2p_in 2 
IRQ_TYPE_EDGE_RISING>,
+ <_smp2p_in 3 
IRQ_TYPE_EDGE_RISING>;
+   interrupt-names = "wdog",
+ "fatal",
+ "ready",
+ "handover",
+ "stop-ack";
+
+   power-domains = < MSM8976_VDDCX>,
+   < MSM8976_VDDMX>;
+   power-domain-names = "cx", "mx";
+
+   qcom,smem-states = <_smp2p_out 0>;
+   qcom,smem-state-names = "stop";
+
+   pinctrl-0 = <_wlan_default>;
+   pinctrl-names = "default";
+
+   status = "disabled";
+
+   wcnss_iris: iris {
+   /* Separate chip, compatible is board-specific 
*/
+   clocks = < RPM_SMD_RF_CLK2>;
+   clock-names = "xo";
+   };
+
+   smd-edge {
+   interrupts = ;
+
+   qcom,ipc = < 8 17>;
+   qcom,smd-edge = <6>;
+   qcom,remote-pid = <4>;
+
+   label = "pronto";
+
+   wcnss_ctrl: wcnss {
+   compatible = "qcom,wcnss";
+   qcom,smd-channels = "WCNSS_CTRL";
+
+   qcom,mmio = <>;
+
+   wcnss_bt: bluetooth {
+   compatible = "qcom,wcnss-bt";
+   };
+
+   wcnss_wifi: wifi {
+   compatible = "qcom,wcnss-wlan";
+
+   interrupts = ,
+  

[PATCH v2 3/4] arm64: dts: qcom: msm8976: Add Adreno GPU

2024-04-01 Thread Adam Skladowski
Add Adreno GPU node.

Signed-off-by: Adam Skladowski 
---
 arch/arm64/boot/dts/qcom/msm8976.dtsi | 65 +++
 1 file changed, 65 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi 
b/arch/arm64/boot/dts/qcom/msm8976.dtsi
index 6be310079f5b..77670fce9b8f 100644
--- a/arch/arm64/boot/dts/qcom/msm8976.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi
@@ -1074,6 +1074,71 @@ mdss_dsi1_phy: phy@1a96a00 {
};
};
 
+   adreno_gpu: gpu@1c0 {
+   compatible = "qcom,adreno-510.0", "qcom,adreno";
+
+   reg = <0x01c0 0x4>;
+   reg-names = "kgsl_3d0_reg_memory";
+
+   interrupts = ;
+   interrupt-names = "kgsl_3d0_irq";
+
+   clocks = < GCC_GFX3D_OXILI_CLK>,
+< GCC_GFX3D_OXILI_AHB_CLK>,
+< GCC_GFX3D_OXILI_GMEM_CLK>,
+< GCC_GFX3D_BIMC_CLK>,
+< GCC_GFX3D_OXILI_TIMER_CLK>,
+< GCC_GFX3D_OXILI_AON_CLK>;
+   clock-names = "core",
+ "iface",
+ "mem",
+ "mem_iface",
+ "rbbmtimer",
+ "alwayson";
+
+   power-domains = < OXILI_GX_GDSC>;
+
+   iommus = <_iommu 0>;
+
+   status = "disabled";
+
+   operating-points-v2 = <_opp_table>;
+
+   gpu_opp_table: opp-table {
+   compatible = "operating-points-v2";
+
+   opp-2 {
+   opp-hz = /bits/ 64 <2>;
+   required-opps = <_opp_low_svs>;
+   };
+
+   opp-3 {
+   opp-hz = /bits/ 64 <3>;
+   required-opps = <_opp_svs>;
+   };
+
+   opp-4 {
+   opp-hz = /bits/ 64 <4>;
+   required-opps = <_opp_nom>;
+   };
+
+   opp-48000 {
+   opp-hz = /bits/ 64 <48000>;
+   required-opps = <_opp_nom_plus>;
+   };
+
+   opp-54000 {
+   opp-hz = /bits/ 64 <54000>;
+   required-opps = <_opp_turbo>;
+   };
+
+   opp-6 {
+   opp-hz = /bits/ 64 <6>;
+   required-opps = <_opp_turbo>;
+   };
+   };
+   };
+
apps_iommu: iommu@1ee {
compatible = "qcom,msm8976-iommu", "qcom,msm-iommu-v2";
reg = <0x01ee 0x3000>;
-- 
2.44.0




[PATCH v2 2/4] arm64: dts: qcom: msm8976: Add MDSS nodes

2024-04-01 Thread Adam Skladowski
Add MDSS nodes to support displays on MSM8976 SoC.

Signed-off-by: Adam Skladowski 
---
 arch/arm64/boot/dts/qcom/msm8976.dtsi | 274 +-
 1 file changed, 270 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi 
b/arch/arm64/boot/dts/qcom/msm8976.dtsi
index 8bdcc1438177..6be310079f5b 100644
--- a/arch/arm64/boot/dts/qcom/msm8976.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi
@@ -785,10 +785,10 @@ gcc: clock-controller@180 {
 
clocks = < RPM_SMD_XO_CLK_SRC>,
 < RPM_SMD_XO_A_CLK_SRC>,
-<0>,
-<0>,
-<0>,
-<0>;
+<_dsi0_phy 1>,
+<_dsi0_phy 0>,
+<_dsi1_phy 1>,
+<_dsi1_phy 0>;
clock-names = "xo",
  "xo_a",
  "dsi0pll",
@@ -808,6 +808,272 @@ tcsr: syscon@1937000 {
reg = <0x01937000 0x3>;
};
 
+   mdss: display-subsystem@1a0 {
+   compatible = "qcom,mdss";
+
+   reg = <0x01a0 0x1000>,
+ <0x01ab 0x3000>;
+   reg-names = "mdss_phys", "vbif_phys";
+
+   power-domains = < MDSS_GDSC>;
+   interrupts = ;
+
+   interrupt-controller;
+   #interrupt-cells = <1>;
+
+   clocks = < GCC_MDSS_AHB_CLK>,
+< GCC_MDSS_AXI_CLK>,
+< GCC_MDSS_VSYNC_CLK>,
+< GCC_MDSS_MDP_CLK>;
+   clock-names = "iface",
+ "bus",
+ "vsync",
+ "core";
+
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges;
+
+   status = "disabled";
+
+   mdss_mdp: display-controller@1a01000 {
+   compatible = "qcom,msm8976-mdp5", "qcom,mdp5";
+   reg = <0x01a01000 0x89000>;
+   reg-names = "mdp_phys";
+
+   interrupt-parent = <>;
+   interrupts = <0>;
+
+   clocks = < GCC_MDSS_AHB_CLK>,
+< GCC_MDSS_AXI_CLK>,
+< GCC_MDSS_MDP_CLK>,
+< GCC_MDSS_VSYNC_CLK>,
+< GCC_MDP_TBU_CLK>,
+< GCC_MDP_RT_TBU_CLK>;
+   clock-names = "iface",
+ "bus",
+ "core",
+ "vsync",
+ "tbu",
+ "tbu_rt";
+
+   operating-points-v2 = <_opp_table>;
+   power-domains = < MDSS_GDSC>;
+
+   iommus = <_iommu 22>;
+
+   ports {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   port@0 {
+   reg = <0>;
+
+   mdss_mdp5_intf1_out: endpoint {
+   remote-endpoint = 
<_dsi0_in>;
+   };
+   };
+
+   port@1 {
+   reg = <1>;
+
+   mdss_mdp5_intf2_out: endpoint {
+   remote-endpoint = 
<_dsi1_in>;
+   };
+   };
+   };
+
+   mdp_opp_table: opp-table {
+   compatible = "operating-points-v2";
+
+   opp-17778 {
+   opp-hz = /bits/ 64 <17778>;
+   required-opps = 
<_opp_svs>;
+   };
+
+   opp-27000 {
+   opp-hz = /bits/ 64 <27000>;
+

[PATCH v2 1/4] arm64: dts: qcom: msm8976: Add IOMMU nodes

2024-04-01 Thread Adam Skladowski
Add the nodes describing the apps and gpu iommu and its context banks
that are found on msm8976 SoCs.

Signed-off-by: Adam Skladowski 
---
 arch/arm64/boot/dts/qcom/msm8976.dtsi | 81 +++
 1 file changed, 81 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi 
b/arch/arm64/boot/dts/qcom/msm8976.dtsi
index d2bb1ada361a..8bdcc1438177 100644
--- a/arch/arm64/boot/dts/qcom/msm8976.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi
@@ -808,6 +808,87 @@ tcsr: syscon@1937000 {
reg = <0x01937000 0x3>;
};
 
+   apps_iommu: iommu@1ee {
+   compatible = "qcom,msm8976-iommu", "qcom,msm-iommu-v2";
+   reg = <0x01ee 0x3000>;
+   ranges  = <0 0x01e2 0x2>;
+
+   clocks = < GCC_SMMU_CFG_CLK>,
+< GCC_APSS_TCU_CLK>;
+   clock-names = "iface", "bus";
+
+   qcom,iommu-secure-id = <17>;
+
+   #address-cells = <1>;
+   #size-cells = <1>;
+   #iommu-cells = <1>;
+
+   /* VFE */
+   iommu-ctx@15000 {
+   compatible = "qcom,msm-iommu-v2-ns";
+   reg = <0x15000 0x1000>;
+   qcom,ctx-asid = <20>;
+   interrupts = ;
+   };
+
+   /* VENUS NS */
+   iommu-ctx@16000 {
+   compatible = "qcom,msm-iommu-v2-ns";
+   reg = <0x16000 0x1000>;
+   qcom,ctx-asid = <21>;
+   interrupts = ;
+   };
+
+   /* MDP0 */
+   iommu-ctx@17000 {
+   compatible = "qcom,msm-iommu-v2-ns";
+   reg = <0x17000 0x1000>;
+   qcom,ctx-asid = <22>;
+   interrupts = ;
+   };
+   };
+
+   gpu_iommu: iommu@1f08000 {
+   compatible = "qcom,msm8976-iommu", "qcom,msm-iommu-v2";
+   ranges = <0 0x01f08000 0x8000>;
+
+   clocks = < GCC_SMMU_CFG_CLK>,
+< GCC_GFX3D_TCU_CLK>;
+   clock-names = "iface", "bus";
+
+   power-domains = < OXILI_CX_GDSC>;
+
+   qcom,iommu-secure-id = <18>;
+
+   #address-cells = <1>;
+   #size-cells = <1>;
+   #iommu-cells = <1>;
+
+   /* gfx3d user */
+   iommu-ctx@0 {
+   compatible = "qcom,msm-iommu-v2-ns";
+   reg = <0x0 0x1000>;
+   qcom,ctx-asid = <0>;
+   interrupts = ;
+   };
+
+   /* gfx3d secure */
+   iommu-ctx@1000 {
+   compatible = "qcom,msm-iommu-v2-sec";
+   reg = <0x1000 0x1000>;
+   qcom,ctx-asid = <2>;
+   interrupts = ;
+   };
+
+   /* gfx3d priv */
+   iommu-ctx@2000 {
+   compatible = "qcom,msm-iommu-v2-sec";
+   reg = <0x2000 0x1000>;
+   qcom,ctx-asid = <1>;
+   interrupts = ;
+   };
+   };
+
spmi_bus: spmi@200f000 {
compatible = "qcom,spmi-pmic-arb";
reg = <0x0200f000 0x1000>,
-- 
2.44.0




[PATCH v2 0/4] MSM8976 MDSS/GPU/WCNSS support

2024-04-01 Thread Adam Skladowski
This patch series provide support for display subsystem, gpu
and also adds wireless connectivity subsystem support.

Changes since v1

1. Addressed feedback
2. Dropped already applied dt-bindings patches
3. Dropped sdc patch as it was submitted as part of other series
4. Dropped dt-bindings patch for Adreno, also separate now

Adam Skladowski (4):
  arm64: dts: qcom: msm8976: Add IOMMU nodes
  arm64: dts: qcom: msm8976: Add MDSS nodes
  arm64: dts: qcom: msm8976: Add Adreno GPU
  arm64: dts: qcom: msm8976: Add WCNSS node

 arch/arm64/boot/dts/qcom/msm8976.dtsi | 524 +-
 1 file changed, 520 insertions(+), 4 deletions(-)

-- 
2.44.0




[PATCH 1/1] clk: qcom: smd-rpm: Restore msm8976 num_clk

2024-04-01 Thread Adam Skladowski
During rework somehow msm8976 num_clk got removed, restore it.

Fixes: d6edc31f3a68 ("clk: qcom: smd-rpm: Separate out interconnect bus clocks")
Signed-off-by: Adam Skladowski 
---
 drivers/clk/qcom/clk-smd-rpm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/clk/qcom/clk-smd-rpm.c b/drivers/clk/qcom/clk-smd-rpm.c
index 8602c02047d0..45c5255bcd11 100644
--- a/drivers/clk/qcom/clk-smd-rpm.c
+++ b/drivers/clk/qcom/clk-smd-rpm.c
@@ -768,6 +768,7 @@ static struct clk_smd_rpm *msm8976_clks[] = {
 
 static const struct rpm_smd_clk_desc rpm_clk_msm8976 = {
.clks = msm8976_clks,
+   .num_clks = ARRAY_SIZE(msm8976_clks),
.icc_clks = bimc_pcnoc_snoc_smmnoc_icc_clks,
.num_icc_clks = ARRAY_SIZE(bimc_pcnoc_snoc_smmnoc_icc_clks),
 };
-- 
2.44.0




[PATCH v6 2/2] tracing: Include Microcode Revision in mce_record tracepoint

2024-04-01 Thread Avadhut Naik
Currently, the microcode field (Microcode Revision) of struct mce is not
exported to userspace through the mce_record tracepoint.

Knowing the microcode version on which the MCE was received is critical
information for debugging. If the version is not recorded, later attempts
to acquire the version might result in discrepancies since it can be
changed at runtime.

Export microcode version through the tracepoint to prevent ambiguity over
the active version on the system when the MCE was received.

Signed-off-by: Avadhut Naik 
Reviewed-by: Sohil Mehta 
Reviewed-by: Steven Rostedt (Google) 
Reviewed-by: Tony Luck 
---
 include/trace/events/mce.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 294fccc329c1..f0f7b3cb2041 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -42,6 +42,7 @@ TRACE_EVENT(mce_record,
__field(u8, cs  )
__field(u8, bank)
__field(u8, cpuvendor   )
+   __field(u32,microcode   )
),
 
TP_fast_assign(
@@ -63,9 +64,10 @@ TRACE_EVENT(mce_record,
__entry->cs = m->cs;
__entry->bank   = m->bank;
__entry->cpuvendor  = m->cpuvendor;
+   __entry->microcode  = m->microcode;
),
 
-   TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
ADDR: %016Lx, MISC: %016Lx, SYND: %016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: 
%llx, vendor: %u, CPUID: %x, time: %llu, socket: %u, APIC: %x",
+   TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
ADDR: %016Lx, MISC: %016Lx, SYND: %016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: 
%llx, vendor: %u, CPUID: %x, time: %llu, socket: %u, APIC: %x, microcode: %x",
__entry->cpu,
__entry->mcgcap, __entry->mcgstatus,
__entry->bank, __entry->status,
@@ -80,7 +82,8 @@ TRACE_EVENT(mce_record,
__entry->cpuid,
__entry->walltime,
__entry->socketid,
-   __entry->apicid)
+   __entry->apicid,
+   __entry->microcode)
 );
 
 #endif /* _TRACE_MCE_H */
-- 
2.34.1




[PATCH v6 1/2] tracing: Include PPIN in mce_record tracepoint

2024-04-01 Thread Avadhut Naik
Machine Check Error information from struct mce is exported to userspace
through the mce_record tracepoint.

Currently, however, the PPIN (Protected Processor Inventory Number) field
of struct mce is not exported through the tracepoint.

Export PPIN through the tracepoint as it provides a unique identifier for
the system (or socket in case of multi-socket systems) on which the MCE
has been received.

Also, add a comment explaining the kind of information that can be and
should be added to the tracepoint.

Signed-off-by: Avadhut Naik 
Reviewed-by: Sohil Mehta 
Reviewed-by: Steven Rostedt (Google) 
Reviewed-by: Tony Luck 
---
 include/trace/events/mce.h | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 9c4e12163996..294fccc329c1 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -9,6 +9,14 @@
 #include 
 #include 
 
+/*
+ * MCE Event Record.
+ *
+ * Only very relevant and transient information which cannot be
+ * gathered from a system by any other means or which can only be
+ * acquired arduously should be added to this record.
+ */
+
 TRACE_EVENT(mce_record,
 
TP_PROTO(struct mce *m),
@@ -25,6 +33,7 @@ TRACE_EVENT(mce_record,
__field(u64,ipid)
__field(u64,ip  )
__field(u64,tsc )
+   __field(u64,ppin)
__field(u64,walltime)
__field(u32,cpu )
__field(u32,cpuid   )
@@ -45,6 +54,7 @@ TRACE_EVENT(mce_record,
__entry->ipid   = m->ipid;
__entry->ip = m->ip;
__entry->tsc= m->tsc;
+   __entry->ppin   = m->ppin;
__entry->walltime   = m->time;
__entry->cpu= m->extcpu;
__entry->cpuid  = m->cpuid;
@@ -55,7 +65,7 @@ TRACE_EVENT(mce_record,
__entry->cpuvendor  = m->cpuvendor;
),
 
-   TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
ADDR: %016Lx, MISC: %016Lx, SYND: %016Lx, RIP: %02x:<%016Lx>, TSC: %llx, 
vendor: %u, CPUID: %x, time: %llu, socket: %u, APIC: %x",
+   TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, 
ADDR: %016Lx, MISC: %016Lx, SYND: %016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: 
%llx, vendor: %u, CPUID: %x, time: %llu, socket: %u, APIC: %x",
__entry->cpu,
__entry->mcgcap, __entry->mcgstatus,
__entry->bank, __entry->status,
@@ -65,6 +75,7 @@ TRACE_EVENT(mce_record,
__entry->synd,
__entry->cs, __entry->ip,
__entry->tsc,
+   __entry->ppin,
__entry->cpuvendor,
__entry->cpuid,
__entry->walltime,
-- 
2.34.1




[PATCH v6 0/2] Update mce_record tracepoint

2024-04-01 Thread Avadhut Naik
This patchset updates the mce_record tracepoint so that the recently added
fields of struct mce are exported through it to userspace.

The first patch adds PPIN (Protected Processor Inventory Number) field to
the tracepoint.

The second patch adds the microcode field (Microcode Revision) to the
tracepoint.

Changes in v2:
 - Export microcode field (Microcode Revision) through the tracepoiont in
   addition to PPIN.

Changes in v3:
 - Change format specifier for microcode revision from %u to %x
 - Fix tab alignments
 - Add Reviewed-by: Sohil Mehta 

Changes in v4:
 - Update commit messages to reflect the reason for the fields being
   added to the tracepoint.
 - Add comment to explicitly state the type of information that should
   be added to the tracepoint.
 - Add Reviewed-by: Steven Rostedt (Google) 

Changes in v5:
 - Changed "MICROCODE REVISION" to just "MICROCODE".
 - Changed words which are not acronyms from ALL CAPS to no caps.
 - Added Reviewed-by: Tony Luck 

Changes in v6:
 - Rebased on top of Ingo's changes to the MCE tracepoint

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/include/trace/events/mce.h?id=ac5e80e94f5c67d7053f50fc3faddab931707f0f

[NOTE:
 - Since changes in this version are very minor, have retained the below
   tags received for previous versions:
Reviewed-by: Sohil Mehta 
Reviewed-by: Steven Rostedt (Google) 
Reviewed-by: Tony Luck ]

Avadhut Naik (2):
  tracing: Include PPIN in mce_record tracepoint
  tracing: Include Microcode Revision in mce_record tracepoint

 include/trace/events/mce.h | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)


base-commit: 65d1240b6728b38e4d2068d6738a17e4ee4351f5
-- 
2.34.1




Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Steven Rostedt
On Mon, 1 Apr 2024 20:25:52 +0900
Masami Hiramatsu (Google)  wrote:

> > Masami,
> > 
> > Are you OK with just keeping it set to N.  
> 
> OK, if it is only for the debugging, I'm OK to set N this.
> 
> > 
> > We could have other options like PROVE_LOCKING enable it.  
> 
> Agreed (but it should say this is a debug option)

It does say "Validate" which to me is a debug option. What would you
suggest?

-- Steve



Re: [PATCH 1/3] remoteproc: k3-dsp: Fix usage of omap_mbox_message and mbox_msg_t

2024-04-01 Thread Mathieu Poirier
On Thu, Mar 28, 2024 at 11:26:24AM -0500, Andrew Davis wrote:
> On 3/28/24 10:28 AM, Mathieu Poirier wrote:
> > Hi Andrew,
> > 
> > On Mon, Mar 25, 2024 at 11:58:06AM -0500, Andrew Davis wrote:
> > > The type of message sent using omap-mailbox is always u32. The definition
> > > of mbox_msg_t is uintptr_t which is wrong as that type changes based on
> > > the architecture (32bit vs 64bit). Use u32 unconditionally and remove
> > > the now unneeded omap-mailbox.h include.
> > > 
> > > Signed-off-by: Andrew Davis 
> > > ---
> > >   drivers/remoteproc/ti_k3_dsp_remoteproc.c | 7 +++
> > >   1 file changed, 3 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c 
> > > b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
> > > index 3555b535b1683..33b30cfb86c9d 100644
> > > --- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c
> > > +++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c
> > > @@ -11,7 +11,6 @@
> > >   #include 
> > >   #include 
> > >   #include 
> > > -#include 
> > >   #include 
> > >   #include 
> > >   #include 
> > > @@ -113,7 +112,7 @@ static void k3_dsp_rproc_mbox_callback(struct 
> > > mbox_client *client, void *data)
> > > client);
> > >   struct device *dev = kproc->rproc->dev.parent;
> > >   const char *name = kproc->rproc->name;
> > > - u32 msg = omap_mbox_message(data);
> > > + u32 msg = (u32)(uintptr_t)(data);
> > 
> > Looking at omap-mailbox.h and unless I'm missing something, the end result 
> > is
> > the same.
> > 
> > 
> > >   dev_dbg(dev, "mbox msg: 0x%x\n", msg);
> > > @@ -152,11 +151,11 @@ static void k3_dsp_rproc_kick(struct rproc *rproc, 
> > > int vqid)
> > >   {
> > >   struct k3_dsp_rproc *kproc = rproc->priv;
> > >   struct device *dev = rproc->dev.parent;
> > > - mbox_msg_t msg = (mbox_msg_t)vqid;
> > > + u32 msg = vqid;
> > >   int ret;
> > > 
> > 
> > Here @vqid becomes a 'u32' rather than a 'uintptr'...
> > 
> 
> u32 is the correct type for messages sent with OMAP mailbox. It
> only sends 32bit messages, uintptr is 64bit when compiled on
> 64bit hardware (like our ARM64 cores on K3). mbox_msg_t should
> have been defined as u32, this was a mistake we missed as we only
> ever used to compile it for 32bit cores (where uintptr is 32bit).
> 
> > >   /* send the index of the triggered virtqueue in the mailbox 
> > > payload */
> > > - ret = mbox_send_message(kproc->mbox, (void *)msg);
> > > + ret = mbox_send_message(kproc->mbox, (void *)(uintptr_t)msg);
> > 
> > ... but here it is casted as a 'uintptr_t', which yields the same result.
> > 
> 
> The function mbox_send_message() takes a void*, so we need to cast our 32bit
> message to that first, it is cast back to u32 inside the OMAP mailbox driver.
> Doing that in one step (u32 -> void*) causes a warning when void* is 64bit
> (cast from int to pointer of different size).
> 
> > 
> > I am puzzled - other than getting rid of a header file I don't see what else
> > this patch does.
> > 
> 
> Getting rid of the header is the main point of this patch (I have a later
> series that needs that header gone). But the difference this patch makes is 
> that
> before we passed a pointer to a 64bit int to OMAP mailbox which takes a 
> pointer
> to a 32bit int. Sure, the result is the same in little-endian systems, but 
> that
> isn't a strictly correct in general.

>From your explanation above this patchset is about two things:

1) Getting rid of a compilation warning when void* is 64bit wide
2) Getting rid of omap-mailbox.h

This is what the changelog should describe.  And next time, please add a cover
letter to your work.

Thanks,
Mathieu

> > >   if (ret < 0)
> > >   dev_err(dev, "failed to send mailbox message (%pe)\n",
> > >   ERR_PTR(ret));
> > > -- 
> > > 2.39.2
> > > 



Re: [PATCH v4 1/4] remoteproc: Add TEE support

2024-04-01 Thread Mathieu Poirier
On Fri, Mar 29, 2024 at 09:58:11AM +0100, Arnaud POULIQUEN wrote:
> Hello Mathieu,
> 
> On 3/27/24 18:07, Mathieu Poirier wrote:
> > On Tue, Mar 26, 2024 at 08:18:23PM +0100, Arnaud POULIQUEN wrote:
> >> Hello Mathieu,
> >>
> >> On 3/25/24 17:46, Mathieu Poirier wrote:
> >>> On Fri, Mar 08, 2024 at 03:47:05PM +0100, Arnaud Pouliquen wrote:
>  Add a remoteproc TEE (Trusted Execution Environment) driver
>  that will be probed by the TEE bus. If the associated Trusted
>  application is supported on secure part this device offers a client
> >>>
> >>> Device or driver?  I thought I touched on that before.
> >>
> >> Right, I changed the first instance and missed this one
> >>
> >>>
>  interface to load a firmware in the secure part.
>  This firmware could be authenticated by the secure trusted application.
> 
>  Signed-off-by: Arnaud Pouliquen 
>  ---
>  Updates from V3:
>  - rework TEE_REMOTEPROC description in Kconfig
>  - fix some namings
>  - add tee_rproc_parse_fw  to support rproc_ops::parse_fw
>  - add proc::tee_interface;
>  - add rproc struct as parameter of the tee_rproc_register() function
>  ---
>   drivers/remoteproc/Kconfig  |  10 +
>   drivers/remoteproc/Makefile |   1 +
>   drivers/remoteproc/tee_remoteproc.c | 434 
>   include/linux/remoteproc.h  |   4 +
>   include/linux/tee_remoteproc.h  | 112 +++
>   5 files changed, 561 insertions(+)
>   create mode 100644 drivers/remoteproc/tee_remoteproc.c
>   create mode 100644 include/linux/tee_remoteproc.h
> 
>  diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
>  index 48845dc8fa85..2cf1431b2b59 100644
>  --- a/drivers/remoteproc/Kconfig
>  +++ b/drivers/remoteproc/Kconfig
>  @@ -365,6 +365,16 @@ config XLNX_R5_REMOTEPROC
>   
> It's safe to say N if not interested in using RPU r5f cores.
>   
>  +
>  +config TEE_REMOTEPROC
>  +tristate "remoteproc support by a TEE application"
> >>>
> >>> s/remoteproc/Remoteproc
> >>>
>  +depends on OPTEE
>  +help
>  +  Support a remote processor with a TEE application. The Trusted
>  +  Execution Context is responsible for loading the trusted 
>  firmware
>  +  image and managing the remote processor's lifecycle.
>  +  This can be either built-in or a loadable module.
>  +
>   endif # REMOTEPROC
>   
>   endmenu
>  diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
>  index 91314a9b43ce..fa8daebce277 100644
>  --- a/drivers/remoteproc/Makefile
>  +++ b/drivers/remoteproc/Makefile
>  @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC)  += rcar_rproc.o
>   obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o
>   obj-$(CONFIG_ST_SLIM_REMOTEPROC)+= st_slim_rproc.o
>   obj-$(CONFIG_STM32_RPROC)   += stm32_rproc.o
>  +obj-$(CONFIG_TEE_REMOTEPROC)+= tee_remoteproc.o
>   obj-$(CONFIG_TI_K3_DSP_REMOTEPROC)  += ti_k3_dsp_remoteproc.o
>   obj-$(CONFIG_TI_K3_R5_REMOTEPROC)   += ti_k3_r5_remoteproc.o
>   obj-$(CONFIG_XLNX_R5_REMOTEPROC)+= xlnx_r5_remoteproc.o
>  diff --git a/drivers/remoteproc/tee_remoteproc.c 
>  b/drivers/remoteproc/tee_remoteproc.c
>  new file mode 100644
>  index ..c855210e52e3
>  --- /dev/null
>  +++ b/drivers/remoteproc/tee_remoteproc.c
>  @@ -0,0 +1,434 @@
>  +// SPDX-License-Identifier: GPL-2.0-or-later
>  +/*
>  + * Copyright (C) STMicroelectronics 2024 - All Rights Reserved
>  + * Author: Arnaud Pouliquen 
>  + */
>  +
>  +#include 
>  +#include 
>  +#include 
>  +#include 
>  +#include 
>  +#include 
>  +#include 
>  +
>  +#include "remoteproc_internal.h"
>  +
>  +#define MAX_TEE_PARAM_ARRY_MEMBER   4
>  +
>  +/*
>  + * Authentication of the firmware and load in the remote processor 
>  memory
>  + *
>  + * [in]  params[0].value.a: unique 32bit identifier of the remote 
>  processor
>  + * [in]  params[1].memref:  buffer containing the image of the 
>  buffer
>  + */
>  +#define TA_RPROC_FW_CMD_LOAD_FW 1
>  +
>  +/*
>  + * Start the remote processor
>  + *
>  + * [in]  params[0].value.a: unique 32bit identifier of the remote 
>  processor
>  + */
>  +#define TA_RPROC_FW_CMD_START_FW2
>  +
>  +/*
>  + * Stop the remote processor
>  + *
>  + * [in]  params[0].value.a: unique 32bit identifier of the remote 
>  processor
>  + */
>  +#define TA_RPROC_FW_CMD_STOP_FW 3
>  +
>  +/*
>  + * Return the address of the resource table, or 0 if not found
>  + * No check is done to verify that the 

Re: [PATCH v4 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware

2024-04-01 Thread Mathieu Poirier
On Fri, Mar 29, 2024 at 11:57:43AM +0100, Arnaud POULIQUEN wrote:
> 
> 
> On 3/27/24 18:14, Mathieu Poirier wrote:
> > On Tue, Mar 26, 2024 at 08:31:33PM +0100, Arnaud POULIQUEN wrote:
> >>
> >>
> >> On 3/25/24 17:51, Mathieu Poirier wrote:
> >>> On Fri, Mar 08, 2024 at 03:47:08PM +0100, Arnaud Pouliquen wrote:
>  The new TEE remoteproc device is used to manage remote firmware in a
>  secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is
>  introduced to delegate the loading of the firmware to the trusted
>  execution context. In such cases, the firmware should be signed and
>  adhere to the image format defined by the TEE.
> 
>  Signed-off-by: Arnaud Pouliquen 
>  ---
>  Updates from V3:
>  - remove support of the attach use case. Will be addressed in a separate
>    thread,
>  - add st_rproc_tee_ops::parse_fw ops,
>  - inverse call of devm_rproc_alloc()and tee_rproc_register() to manage 
>  cross
>    reference between the rproc struct and the tee_rproc struct in 
>  tee_rproc.c.
>  ---
>   drivers/remoteproc/stm32_rproc.c | 60 +---
>   1 file changed, 56 insertions(+), 4 deletions(-)
> 
>  diff --git a/drivers/remoteproc/stm32_rproc.c 
>  b/drivers/remoteproc/stm32_rproc.c
>  index 8cd838df4e92..13df33c78aa2 100644
>  --- a/drivers/remoteproc/stm32_rproc.c
>  +++ b/drivers/remoteproc/stm32_rproc.c
>  @@ -20,6 +20,7 @@
>   #include 
>   #include 
>   #include 
>  +#include 
>   #include 
>   
>   #include "remoteproc_internal.h"
>  @@ -49,6 +50,9 @@
>   #define M4_STATE_STANDBY4
>   #define M4_STATE_CRASH  5
>   
>  +/* Remote processor unique identifier aligned with the Trusted 
>  Execution Environment definitions */
> >>>
> >>> Why is this the case?  At least from the kernel side it is possible to 
> >>> call
> >>> tee_rproc_register() with any kind of value, why is there a need to be any
> >>> kind of alignment with the TEE?
> >>
> >>
> >> The use of the proc_id is to identify a processor in case of multi 
> >> co-processors.
> >>
> > 
> > That is well understood.
> > 
> >> For instance we can have a system with A DSP and a modem. We would use the 
> >> same
> >> TEE service, but
> > 
> > That too.
> > 
> >> the TEE driver will probably be different, same for the signature key.
> > 
> > What TEE driver are we talking about here?
> 
> In OP-TEE remoteproc frameork is divided in 2 or  3 layers:
> 
> - the remoteproc Trusted Application (TA) [1] which is platform agnostic
> - The remoteproc Pseudo Trusted Application (PTA) [2] which is platform
> dependent and can rely on the proc ID to retrieve the context.
> - the remoteproc driver (optional for some platforms) [3], which is in charge
>  of DT parsing and platform configuration.
> 

That part makes sense.

> Here TEE driver can be interpreted by remote PTA and/or platform driver.
>

I have to guess PTA means "Platform Trusted Application" but I have no
guarantee, adding to the level of (already high) confusion brought on by this
patchset.

> [1]
> https://elixir.bootlin.com/op-tee/latest/source/ta/remoteproc/src/remoteproc_core.c
> [2]
> https://elixir.bootlin.com/op-tee/latest/source/core/pta/stm32mp/remoteproc_pta.c
> [3]
> https://elixir.bootlin.com/op-tee/latest/source/core/drivers/remoteproc/stm32_remoteproc.c
> 
> > 
> >> In such case the proc ID allows to identify the the processor you want to 
> >> address.
> >>
> > 
> > That too is well understood, but there is no alignment needed with the TEE, 
> > i.e
> > the TEE application is not expecting a value of '0'.  We could set
> > STM32_MP1_M4_PROC_ID to 0xDEADBEEF and things would work.  This driver 
> > won't go
> > anywhere for as long as it is not the case.
> 
> 
> Here I suppose that you do not challenge the rproc_ID use in general, but for
> the stm32mp1 platform as we have only one remote processor. I'm right?

That is correct - I understand the need for an rproc_ID.  The problem is with
the comment that states that '0' is aligned with the TEE definitions, which in
my head means hard coded value and a big red flag.  What it should say is
"aligned with the TEE device tree definition". 

> 
> In OP-TEE the check is done here:
> https://elixir.bootlin.com/op-tee/latest/source/core/drivers/remoteproc/stm32_remoteproc.c#L64
> 
> If driver does not register the proc ID an error is returned indicating that 
> the
> feature is not supported.
> 
> In case of stm32mp1 yes we could consider it as useless as we have only one
> remote proc.
> 
> Nevertheless I can not guaranty that a customer will not add an external
> companion processor that uses OP-TEE to authenticate the associated firmware. 
> As
> the trusted Application is the unique entry point. he will need the proc_id to
> identify the target at PTA level.
> 
> So from my point of view having a proc ID on stm32MP1 (and on stm32mp2 

Re: [PATCH net v3] virtio_net: Do not send RSS key if it is not supported

2024-04-01 Thread Jakub Kicinski
On Sun, 31 Mar 2024 16:20:30 -0400 Michael S. Tsirkin wrote:
> > Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
> > Cc: sta...@vger.kernel.org  
> 
> net has its own stable process, don't CC stable on net patches.

Not any more, FWIW:

  1.5.7. Stable tree

  While it used to be the case that netdev submissions were not
  supposed to carry explicit CC: sta...@vger.kernel.org tags that is no
  longer the case today. Please follow the standard stable rules in
  Documentation/process/stable-kernel-rules.rst, and make sure you
  include appropriate Fixes tags!

https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#stable-tree



Re: [PATCH 1/3] dt-bindings: remoteproc: qcom,msm8996-mss-pil: allow glink-edge on msm8996

2024-04-01 Thread Rob Herring


On Mon, 01 Apr 2024 00:10:42 +0300, Dmitry Baryshkov wrote:
> MSM8996 has limited glink support, allow glink-edge node on MSM8996
> platform.
> 
> Signed-off-by: Dmitry Baryshkov 
> ---
>  Documentation/devicetree/bindings/remoteproc/qcom,msm8996-mss-pil.yaml | 1 -
>  1 file changed, 1 deletion(-)
> 

Acked-by: Rob Herring 




Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-01 Thread Jarkko Sakkinen
On Mon Apr 1, 2024 at 12:29 PM EEST, Huang, Kai wrote:
> On Sat, 2024-03-30 at 13:17 +0200, Jarkko Sakkinen wrote:
> > On Thu Mar 28, 2024 at 2:53 PM EET, Huang, Kai wrote:
> > > 
> > > > --- /dev/null
> > > > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> > > > @@ -0,0 +1,74 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +// Copyright(c) 2022 Intel Corporation.
> > > 
> > > It's 2024 now.
> > > 
> > > And looks you need to use C style comment for /* Copyright ... */, after 
> > > looking
> > > at some other C files.
> > 
> > To be fair, this happens *all the time* to everyone :-)
> > 
> > I've proposed this few times in SGX context and going to say it now.
> > Given the nature of Git copyrights would anyway need to be sorted by
> > the Git log not possibly incorrect copyright platters in the header
> > and source files.
> > 
>
> Sure fine to me either way.  Thanks for pointing out.
>
> I have some vague memory that we should update the year but I guess I was 
> wrong.

I think updating year makes sense!

I'd be fine not having copyright platter at all since the commit is from
Intel domain anyway but if it is kept then the year needs to be
corrected.

I mean Git commit stores all the data, including exact date.

BR, Jarkko




Re: [PATCH] selftests/sgx: Improve cgroup test scripts

2024-04-01 Thread Jarkko Sakkinen
On Sun Mar 31, 2024 at 8:44 PM EEST, Haitao Huang wrote:
> Make cgroup test scripts ash compatible.
> Remove cg-tools dependency.
> Add documentation for functions.
>
> Tested with busybox on Ubuntu.
>
> Signed-off-by: Haitao Huang 

I'll run this next week on good old NUC7. Thank you.

I really wish that either (hopefully both) Intel or AMD would bring up
for developers home use meant platform to develop on TDX and SNP. It is
a shame that the latest and greatest is from 2018.

BR, Jarkko



Re: Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send

2024-04-01 Thread Jason Xing
On Mon, Apr 1, 2024 at 8:34 PM  wrote:
>
> From: hepeilin 
>
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.
>
> 1. Giving an usecase example:
> =
> When an application experiences packet loss due to an unreachable UDP
> destination port, the kernel will send an exception message through the
> icmp_send function. By adding a trace point for icmp_send, developers or
> system administrators can obtain detailed information about the UDP
> packet loss, including the type, code, source address, destination address,
> source port, and destination port. This facilitates the trouble-shooting
> of UDP packet loss issues especially for those network-service
> applications.
>
> 2. Operation Instructions:
> ==
> Switch to the tracing directory.
> cd /sys/kernel/tracing
> Filter for destination port unreachable.
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
> Enable trace event.
> echo 1 > events/icmp/icmp_send/enable
>
> 3. Result View:
> 
>  udp_client_erro-11370   [002] ...s.12   124.728002:
>  icmp_send: icmp_send: type=3, code=3.
>  From 127.0.0.1:41895 to 127.0.0.1: ulen=23
>  skbaddr=589b167a
>
> v3->v4:
> Some fixes according to
> https://lore.kernel.org/all/CANn89i+EFEr7VHXNdOi59Ba_R1nFKSBJzBzkJFVgCTdXBx=y...@mail.gmail.com/
> 1.Add legality check for UDP header in SKB.

I think my understanding based on what Eric depicted differs from you:
we're supposed to filter out those many invalid cases and only trace
the valid action of sending a icmp, so where to add a new tracepoint
is important instead of adding more checks in the tracepoint itself.
Please refer to what trace_tcp_retransmit_skb() does :)

Thanks,
Jason

> 2.Target this patch for net-next.
>
> v2->v3:
> Some fixes according to
> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
> 1. Change the tracking directory to/sys/kernel/tracking.
> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>
> v1->v2:
> Some fixes according to
> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
> 1. adjust the trace_icmp_send() to more protocols than UDP.
> 2. move the calling of trace_icmp_send after sanity checks
> in __icmp_send().
>
> Signed-off-by: Peilin He
> Reviewed-by: xu xin 
> Reviewed-by: Yunkai Zhang 
> Cc: Yang Yang 
> Cc: Liu Chun 
> Cc: Xuexin Jiang 
> ---
>  include/trace/events/icmp.h | 65 +
>  net/ipv4/icmp.c |  4 +++
>  2 files changed, 69 insertions(+)
>  create mode 100644 include/trace/events/icmp.h
>
> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> new file mode 100644
> index ..7d5190f48a28
> --- /dev/null
> +++ b/include/trace/events/icmp.h
> @@ -0,0 +1,65 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM icmp
> +
> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_ICMP_H
> +
> +#include 
> +#include 
> +
> +TRACE_EVENT(icmp_send,
> +
> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> +
> +   TP_ARGS(skb, type, code),
> +
> +   TP_STRUCT__entry(
> +   __field(const void *, skbaddr)
> +   __field(int, type)
> +   __field(int, code)
> +   __array(__u8, saddr, 4)
> +   __array(__u8, daddr, 4)
> +   __field(__u16, sport)
> +   __field(__u16, dport)
> +   __field(unsigned short, ulen)
> +   ),
> +
> +   TP_fast_assign(
> +   struct iphdr *iph = ip_hdr(skb);
> +   int proto_4 = iph->protocol;
> +   __be32 *p32;
> +
> +   __entry->skbaddr = skb;
> +   __entry->type = type;
> +   __entry->code = code;
> +
> +   struct udphdr *uh = udp_hdr(skb);
> +   if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
> +   (u8 *)uh + sizeof(struct udphdr) > 
> skb_tail_pointer(skb)) {
> +   __entry->sport = 0;
> +   __entry->dport = 0;
> +   __entry->ulen = 0;
> +   } else {
> +   __entry->sport = ntohs(uh->source);
> +   __entry->dport = ntohs(uh->dest);
> +   __entry->ulen = ntohs(uh->len);
> +   }
> +
> +   p32 = (__be32 *) __entry->saddr;
> +   *p32 = iph->saddr;
> +
> +   p32 = (__be32 *) __entry->daddr;
> +   *p32 = 

Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-01 Thread Jarkko Sakkinen
On Sun Mar 31, 2024 at 8:35 PM EEST, Haitao Huang wrote:
> On Sun, 31 Mar 2024 11:19:04 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Sat Mar 30, 2024 at 5:32 PM EET, Haitao Huang wrote:
> >> On Sat, 30 Mar 2024 06:15:14 -0500, Jarkko Sakkinen 
> >> wrote:
> >>
> >> > On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote:
> >> >> On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen  
> >> 
> >> >> wrote:
> >> >>
> >> >> > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> >> >> >> The scripts rely on cgroup-tools package from libcgroup [1].
> >> >> >>
> >> >> >> To run selftests for epc cgroup:
> >> >> >>
> >> >> >> sudo ./run_epc_cg_selftests.sh
> >> >> >>
> >> >> >> To watch misc cgroup 'current' changes during testing, run this  
> >> in a
> >> >> >> separate terminal:
> >> >> >>
> >> >> >> ./watch_misc_for_tests.sh current
> >> >> >>
> >> >> >> With different cgroups, the script starts one or multiple  
> >> concurrent
> >> >> >> SGX
> >> >> >> selftests, each to run one unclobbered_vdso_oversubscribed  
> >> test.Each
> >> >> >> of such test tries to load an enclave of EPC size equal to the EPC
> >> >> >> capacity available on the platform. The script checks results  
> >> against
> >> >> >> the expectation set for each cgroup and reports success or  
> >> failure.
> >> >> >>
> >> >> >> The script creates 3 different cgroups at the beginning with
> >> >> >> following
> >> >> >> expectations:
> >> >> >>
> >> >> >> 1) SMALL - intentionally small enough to fail the test loading an
> >> >> >> enclave of size equal to the capacity.
> >> >> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail  
> >> some
> >> >> >> if
> >> >> >> more than 4 concurrent tests are run. The script starts 4  
> >> expecting
> >> >> >> at
> >> >> >> least one test to pass, and then starts 5 expecting at least one  
> >> test
> >> >> >> to fail.
> >> >> >> 3) LARGER - limit is the same as the capacity, large enough to run
> >> >> >> lots of
> >> >> >> concurrent tests. The script starts 8 of them and expects all  
> >> pass.
> >> >> >> Then it reruns the same test with one process randomly killed and
> >> >> >> usage checked to be zero after all process exit.
> >> >> >>
> >> >> >> The script also includes a test with low mem_cg limit and LARGE
> >> >> >> sgx_epc
> >> >> >> limit to verify that the RAM used for per-cgroup reclamation is
> >> >> >> charged
> >> >> >> to a proper mem_cg.
> >> >> >>
> >> >> >> [1] https://github.com/libcgroup/libcgroup/blob/main/README
> >> >> >>
> >> >> >> Signed-off-by: Haitao Huang 
> >> >> >> ---
> >> >> >> V7:
> >> >> >> - Added memcontrol test.
> >> >> >>
> >> >> >> V5:
> >> >> >> - Added script with automatic results checking, remove the
> >> >> >> interactive
> >> >> >> script.
> >> >> >> - The script can run independent from the series below.
> >> >> >> ---
> >> >> >>  .../selftests/sgx/run_epc_cg_selftests.sh | 246
> >> >> >> ++
> >> >> >>  .../selftests/sgx/watch_misc_for_tests.sh |  13 +
> >> >> >>  2 files changed, 259 insertions(+)
> >> >> >>  create mode 100755
> >> >> >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> >>  create mode 100755
> >> >> >> tools/testing/selftests/sgx/watch_misc_for_tests.sh
> >> >> >>
> >> >> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> >> new file mode 100755
> >> >> >> index ..e027bf39f005
> >> >> >> --- /dev/null
> >> >> >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> >> >> >> @@ -0,0 +1,246 @@
> >> >> >> +#!/bin/bash
> >> >> >
> >> >> > This is not portable and neither does hold in the wild.
> >> >> >
> >> >> > It does not even often hold as it is not uncommon to place bash
> >> >> > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has
> >> >> > a path that is neither of those two.
> >> >> >
> >> >> > Should be #!/usr/bin/env bash
> >> >> >
> >> >> > That is POSIX compatible form.
> >> >> >
> >> >>
> >> >> Sure
> >> >>
> >> >> > Just got around trying to test this in NUC7 so looking into this in
> >> >> > more detail.
> >> >>
> >> >> Thanks. Could you please check if this version works for you?
> >> >>
> >> >>  
> >> https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f
> >> >>
> >> >> >
> >> >> > That said can you make the script work with just "#!/usr/bin/env  
> >> sh"
> >> >> > and make sure that it is busybox ash compatible?
> >> >>
> >> >> Yes.
> >> >>
> >> >> >
> >> >> > I don't see any necessity to make this bash only and it adds to the
> >> >> > compilation time of the image. Otherwise lot of this could be  
> >> tested
> >> >> > just with qemu+bzImage+busybox(inside initramfs).
> >> >> >
> >> >>
> >> >> will still need cgroup-tools as you pointed out later. Compiling from
> >> >> its
> >> >> upstream code OK?
> >> >
> >> > Can you explain why you need it?
> >> >
> >> > What is the thing you cannot do without it?
> >> >
> >> > 

[PATCH] ftrace: Fix use-after-free issue in ftrace_location()

2024-04-01 Thread Zheng Yejian
KASAN reports a bug:

  BUG: KASAN: use-after-free in ftrace_location+0x90/0x120
  Read of size 8 at addr 888141d40010 by task insmod/424
  CPU: 8 PID: 424 Comm: insmod Tainted: GW  6.9.0-rc2+ #213
  [...]
  Call Trace:
   
   dump_stack_lvl+0x68/0xa0
   print_report+0xcf/0x610
   kasan_report+0xb5/0xe0
   ftrace_location+0x90/0x120
   register_kprobe+0x14b/0xa40
   kprobe_init+0x2d/0xff0 [kprobe_example]
   do_one_initcall+0x8f/0x2d0
   do_init_module+0x13a/0x3c0
   load_module+0x3082/0x33d0
   init_module_from_file+0xd2/0x130
   __x64_sys_finit_module+0x306/0x440
   do_syscall_64+0x68/0x140
   entry_SYSCALL_64_after_hwframe+0x71/0x79

The root cause is that when lookup_rec() is lookuping ftrace record of
an address in some module, and at the same time in ftrace_release_mod(),
the memory that saving ftrace records has been freed as that module is
being deleted.

  register_kprobes() {
check_kprobe_address_safe() {
  arch_check_ftrace_location() {
ftrace_location() {
  lookup_rec()  // access memory that has been freed by
// ftrace_release_mod() !!!

It seems that the ftrace_lock is required when lookuping records in
ftrace_location(), so is ftrace_location_range().

Fixes: ae6aa16fdc16 ("kprobes: introduce ftrace based optimization")
Signed-off-by: Zheng Yejian 
---
 kernel/trace/ftrace.c | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index da1710499698..838d175709c1 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1581,7 +1581,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long start, 
unsigned long end)
 }
 
 /**
- * ftrace_location_range - return the first address of a traced location
+ * ftrace_location_range_locked - return the first address of a traced location
  * if it touches the given ip range
  * @start: start of range to search.
  * @end: end of range to search (inclusive). @end points to the last byte
@@ -1592,7 +1592,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long start, 
unsigned long end)
  * that is either a NOP or call to the function tracer. It checks the ftrace
  * internal tables to determine if the address belongs or not.
  */
-unsigned long ftrace_location_range(unsigned long start, unsigned long end)
+static unsigned long ftrace_location_range_locked(unsigned long start, 
unsigned long end)
 {
struct dyn_ftrace *rec;
 
@@ -1603,6 +1603,17 @@ unsigned long ftrace_location_range(unsigned long start, 
unsigned long end)
return 0;
 }
 
+unsigned long ftrace_location_range(unsigned long start, unsigned long end)
+{
+   unsigned long loc;
+
+   mutex_lock(_lock);
+   loc = ftrace_location_range_locked(start, end);
+   mutex_unlock(_lock);
+
+   return loc;
+}
+
 /**
  * ftrace_location - return the ftrace location
  * @ip: the instruction pointer to check
@@ -1614,25 +1625,22 @@ unsigned long ftrace_location_range(unsigned long 
start, unsigned long end)
  */
 unsigned long ftrace_location(unsigned long ip)
 {
-   struct dyn_ftrace *rec;
+   unsigned long loc;
unsigned long offset;
unsigned long size;
 
-   rec = lookup_rec(ip, ip);
-   if (!rec) {
+   loc = ftrace_location_range(ip, ip);
+   if (!loc) {
if (!kallsyms_lookup_size_offset(ip, , ))
goto out;
 
/* map sym+0 to __fentry__ */
if (!offset)
-   rec = lookup_rec(ip, ip + size - 1);
+   loc = ftrace_location_range(ip, ip + size - 1);
}
 
-   if (rec)
-   return rec->ip;
-
 out:
-   return 0;
+   return loc;
 }
 
 /**
-- 
2.25.1




Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send

2024-04-01 Thread xu.xin16
From: hepeilin 

Introduce a tracepoint for icmp_send, which can help users to get more
detail information conveniently when icmp abnormal events happen.

1. Giving an usecase example:
=
When an application experiences packet loss due to an unreachable UDP
destination port, the kernel will send an exception message through the
icmp_send function. By adding a trace point for icmp_send, developers or
system administrators can obtain detailed information about the UDP
packet loss, including the type, code, source address, destination address,
source port, and destination port. This facilitates the trouble-shooting
of UDP packet loss issues especially for those network-service
applications.

2. Operation Instructions:
==
Switch to the tracing directory.
cd /sys/kernel/tracing
Filter for destination port unreachable.
echo "type==3 && code==3" > events/icmp/icmp_send/filter
Enable trace event.
echo 1 > events/icmp/icmp_send/enable

3. Result View:

 udp_client_erro-11370   [002] ...s.12   124.728002:
 icmp_send: icmp_send: type=3, code=3.
 From 127.0.0.1:41895 to 127.0.0.1: ulen=23
 skbaddr=589b167a

v3->v4:
Some fixes according to
https://lore.kernel.org/all/CANn89i+EFEr7VHXNdOi59Ba_R1nFKSBJzBzkJFVgCTdXBx=y...@mail.gmail.com/
1.Add legality check for UDP header in SKB.
2.Target this patch for net-next.

v2->v3:
Some fixes according to
https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
1. Change the tracking directory to/sys/kernel/tracking.
2. Adjust the layout of the TP-STRUCT_entry parameter structure.

v1->v2:
Some fixes according to
https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
1. adjust the trace_icmp_send() to more protocols than UDP.
2. move the calling of trace_icmp_send after sanity checks
in __icmp_send().

Signed-off-by: Peilin He
Reviewed-by: xu xin 
Reviewed-by: Yunkai Zhang 
Cc: Yang Yang 
Cc: Liu Chun 
Cc: Xuexin Jiang 
---
 include/trace/events/icmp.h | 65 +
 net/ipv4/icmp.c |  4 +++
 2 files changed, 69 insertions(+)
 create mode 100644 include/trace/events/icmp.h

diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
new file mode 100644
index ..7d5190f48a28
--- /dev/null
+++ b/include/trace/events/icmp.h
@@ -0,0 +1,65 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM icmp
+
+#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_ICMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(icmp_send,
+
+   TP_PROTO(const struct sk_buff *skb, int type, int code),
+
+   TP_ARGS(skb, type, code),
+
+   TP_STRUCT__entry(
+   __field(const void *, skbaddr)
+   __field(int, type)
+   __field(int, code)
+   __array(__u8, saddr, 4)
+   __array(__u8, daddr, 4)
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __field(unsigned short, ulen)
+   ),
+
+   TP_fast_assign(
+   struct iphdr *iph = ip_hdr(skb);
+   int proto_4 = iph->protocol;
+   __be32 *p32;
+
+   __entry->skbaddr = skb;
+   __entry->type = type;
+   __entry->code = code;
+
+   struct udphdr *uh = udp_hdr(skb);
+   if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
+   (u8 *)uh + sizeof(struct udphdr) > 
skb_tail_pointer(skb)) {
+   __entry->sport = 0;
+   __entry->dport = 0;
+   __entry->ulen = 0;
+   } else {
+   __entry->sport = ntohs(uh->source);
+   __entry->dport = ntohs(uh->dest);
+   __entry->ulen = ntohs(uh->len);
+   }
+
+   p32 = (__be32 *) __entry->saddr;
+   *p32 = iph->saddr;
+
+   p32 = (__be32 *) __entry->daddr;
+   *p32 = iph->daddr;
+   ),
+
+   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u 
ulen=%d skbaddr=%p",
+   __entry->type, __entry->code,
+   __entry->saddr, __entry->sport, __entry->daddr,
+   __entry->dport, __entry->ulen, __entry->skbaddr)
+);
+
+#endif /* _TRACE_ICMP_H */
+
+/* This part must be outside protection */
+#include 
\ No newline at end of file
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 8cebb476b3ab..224551d75c02 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -92,6 +92,8 @@
 #include 
 #include 
 

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Google
On Tue, 26 Mar 2024 15:01:21 -0400
Steven Rostedt  wrote:

> On Tue, 26 Mar 2024 09:16:33 -0700
> Andrii Nakryiko  wrote:
> 
> > > It's no different than lockdep. Test boxes should have it enabled, but
> > > there's no reason to have this enabled in a production system.
> > >  
> > 
> > I tend to agree with Steven here (which is why I sent this patch as it
> > is), but I'm happy to do it as an opt-out, if Masami insists. Please
> > do let me know if I need to send v2 or this one is actually the one
> > we'll end up using. Thanks!
> 
> Masami,
> 
> Are you OK with just keeping it set to N.

OK, if it is only for the debugging, I'm OK to set N this.

> 
> We could have other options like PROVE_LOCKING enable it.

Agreed (but it should say this is a debug option)

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality

2024-04-01 Thread Huang, Kai
On Sat, 2024-03-30 at 13:17 +0200, Jarkko Sakkinen wrote:
> On Thu Mar 28, 2024 at 2:53 PM EET, Huang, Kai wrote:
> > 
> > > --- /dev/null
> > > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> > > @@ -0,0 +1,74 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +// Copyright(c) 2022 Intel Corporation.
> > 
> > It's 2024 now.
> > 
> > And looks you need to use C style comment for /* Copyright ... */, after 
> > looking
> > at some other C files.
> 
> To be fair, this happens *all the time* to everyone :-)
> 
> I've proposed this few times in SGX context and going to say it now.
> Given the nature of Git copyrights would anyway need to be sorted by
> the Git log not possibly incorrect copyright platters in the header
> and source files.
> 

Sure fine to me either way.  Thanks for pointing out.

I have some vague memory that we should update the year but I guess I was wrong.


general protection fault in __fib6_update_sernum_upto_root

2024-04-01 Thread Ubisectech Sirius
Hello.
We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
Recently, our team has discovered a issue in Linux kernel 6.7. Attached to the 
email were a PoC file of the issue.

Stack dump:
general protection fault, probably for non-canonical address 
0xff1f1b1f1f1f1f24:  [#1] PREEMPT SMP KASAN NOPTI
KASAN: maybe wild-memory-access in range [0xf8f8f8f8f8f8f920-0xf8f8f8f8f8f8f927]
CPU: 1 PID: 9367 Comm: kworker/1:5 Not tainted 6.7.0 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Workqueue: ipv6_addrconf addrconf_dad_work
RIP: 0010:__fib6_update_sernum_upto_root+0xa7/0x270 net/ipv6/ip6_fib.c:1358
Code: c1 e8 03 42 80 3c 20 00 0f 85 9b 01 00 00 48 8b 1b 48 85 db 0f 84 d9 00 
00 00 e8 74 70 39 f8 48 8d 7b 2c 48 89 f8 48 c1 e8 03 <42> 0f b6 14 20 48 89 f8 
83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
RSP: 0018:c9000631f7c8 EFLAGS: 00010a07
RAX: 1f1f1f1f1f1f1f24 RBX: f8f8f8f8f8f8f8f8 RCX: 89508644
RDX: 888051d78000 RSI: 895085dc RDI: f8f8f8f8f8f8f924
RBP: 0001 R08: 0005 R09: 
R10: 0001 R11:  R12: dc00
R13: 0186 R14: 888052396c00 R15: ed100a472d80
FS:  () GS:88807ec0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f42c8487d00 CR3: 4b42c000 CR4: 00750ef0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
PKRU: 5554
Call Trace:
 
 __list_add include/linux/list.h:153 [inline]
 list_add include/linux/list.h:169 [inline]
 fib6_add+0x16c4/0x4410 net/ipv6/ip6_fib.c:1490
 __ip6_ins_rt net/ipv6/route.c:1313 [inline]
 ip6_ins_rt+0xb6/0x110 net/ipv6/route.c:1323
 __ipv6_ifa_notify+0xab3/0xd30 net/ipv6/addrconf.c:6266
 ipv6_ifa_notify net/ipv6/addrconf.c:6303 [inline]
 addrconf_dad_completed+0x15f/0xef0 net/ipv6/addrconf.c:4317
 addrconf_dad_work+0x785/0x14e0 net/ipv6/addrconf.c:4260
 process_one_work+0x87b/0x15c0 kernel/workqueue.c:3226
 worker_thread+0x855/0x1200 kernel/workqueue.c:3380
 kthread+0x2cc/0x3b0 kernel/kthread.c:388
 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:256
 
Modules linked in:
---[ end trace  ]---
RIP: 0010:__fib6_update_sernum_upto_root+0xa7/0x270 net/ipv6/ip6_fib.c:1358
Code: c1 e8 03 42 80 3c 20 00 0f 85 9b 01 00 00 48 8b 1b 48 85 db 0f 84 d9 00 
00 00 e8 74 70 39 f8 48 8d 7b 2c 48 89 f8 48 c1 e8 03 <42> 0f b6 14 20 48 89 f8 
83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
RSP: 0018:c9000631f7c8 EFLAGS: 00010a07
RAX: 1f1f1f1f1f1f1f24 RBX: f8f8f8f8f8f8f8f8 RCX: 89508644
RDX: 888051d78000 RSI: 895085dc RDI: f8f8f8f8f8f8f924
RBP: 0001 R08: 0005 R09: 
R10: 0001 R11:  R12: dc00
R13: 0186 R14: 888052396c00 R15: ed100a472d80
FS:  () GS:88807ec0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f42c8487d00 CR3: 4b42c000 CR4: 00750ef0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
PKRU: 5554

Code disassembly (best guess):
   0:   c1 e8 03shr$0x3,%eax
   3:   42 80 3c 20 00  cmpb   $0x0,(%rax,%r12,1)
   8:   0f 85 9b 01 00 00   jne0x1a9
   e:   48 8b 1bmov(%rbx),%rbx
  11:   48 85 dbtest   %rbx,%rbx
  14:   0f 84 d9 00 00 00   je 0xf3
  1a:   e8 74 70 39 f8  call   0xf8397093
  1f:   48 8d 7b 2c lea0x2c(%rbx),%rdi
  23:   48 89 f8mov%rdi,%rax
  26:   48 c1 e8 03 shr$0x3,%rax
* 2a:   42 0f b6 14 20  movzbl (%rax,%r12,1),%edx <-- trapping 
instruction
  2f:   48 89 f8mov%rdi,%rax
  32:   83 e0 07and$0x7,%eax
  35:   83 c0 03add$0x3,%eax
  38:   38 d0   cmp%dl,%al
  3a:   7c 08   jl 0x44
  3c:   84 d2   test   %dl,%dl
  3e:   0f  .byte 0xf
  3f:   85  .byte 0x85

Thank you for taking the time to read this email and we look forward to working 
with you further.






poc.c
Description: Binary data


[PATCH net-next v4 2/2] trace: tcp: fully support trace_tcp_send_reset

2024-04-01 Thread Jason Xing
From: Jason Xing 

Prior to this patch, what we can see by enabling trace_tcp_send is
only happening under two circumstances:
1) active rst mode
2) non-active rst mode and based on the full socket

That means the inconsistency occurs if we use tcpdump and trace
simultaneously to see how rst happens.

It's necessary that we should take into other cases into considerations,
say:
1) time-wait socket
2) no socket
...

By parsing the incoming skb and reversing its 4-tuple can
we know the exact 'flow' which might not exist.

Samples after applied this patch:
1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port
state=TCP_ESTABLISHED
2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port
state=UNKNOWN
Note:
1) UNKNOWN means we cannot extract the right information from skb.
2) skbaddr/skaddr could be 0

Signed-off-by: Jason Xing 
---
 include/trace/events/tcp.h | 40 --
 net/ipv4/tcp_ipv4.c|  7 +++
 net/ipv6/tcp_ipv6.c|  3 ++-
 3 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index cf14b6fcbeed..5c04a61a11c2 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -78,11 +78,47 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
  * skb of trace_tcp_send_reset is the skb that caused RST. In case of
  * active reset, skb should be NULL
  */
-DEFINE_EVENT(tcp_event_sk_skb, tcp_send_reset,
+TRACE_EVENT(tcp_send_reset,
 
TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
 
-   TP_ARGS(sk, skb)
+   TP_ARGS(sk, skb),
+
+   TP_STRUCT__entry(
+   __field(const void *, skbaddr)
+   __field(const void *, skaddr)
+   __field(int, state)
+   __array(__u8, saddr, sizeof(struct sockaddr_in6))
+   __array(__u8, daddr, sizeof(struct sockaddr_in6))
+   ),
+
+   TP_fast_assign(
+   __entry->skbaddr = skb;
+   __entry->skaddr = sk;
+   /* Zero means unknown state. */
+   __entry->state = sk ? sk->sk_state : 0;
+
+   memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
+   memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
+
+   if (sk && sk_fullsock(sk)) {
+   const struct inet_sock *inet = inet_sk(sk);
+
+   TP_STORE_ADDR_PORTS(__entry, inet, sk);
+   } else if (skb) {
+   const struct tcphdr *th = (const struct tcphdr 
*)skb->data;
+   /*
+* We should reverse the 4-tuple of skb, so later
+* it can print the right flow direction of rst.
+*/
+   TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, 
entry->saddr);
+   }
+   ),
+
+   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
+ __entry->skbaddr, __entry->skaddr,
+ __entry->saddr, __entry->daddr,
+ __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN")
 );
 
 /*
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a22ee5838751..0d47b48f8cfd 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -866,11 +866,10 @@ static void tcp_v4_send_reset(const struct sock *sk, 
struct sk_buff *skb)
 * routing might fail in this case. No choice here, if we choose to 
force
 * input interface, we will misroute in case of asymmetric route.
 */
-   if (sk) {
+   if (sk)
arg.bound_dev_if = sk->sk_bound_dev_if;
-   if (sk_fullsock(sk))
-   trace_tcp_send_reset(sk, skb);
-   }
+
+   trace_tcp_send_reset(sk, skb);
 
BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
 offsetof(struct inet_timewait_sock, tw_bound_dev_if));
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3f4cba49e9ee..8e9c59b6c00c 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1113,7 +1113,6 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb)
if (sk) {
oif = sk->sk_bound_dev_if;
if (sk_fullsock(sk)) {
-   trace_tcp_send_reset(sk, skb);
if (inet6_test_bit(REPFLOW, sk))
label = ip6_flowlabel(ipv6h);
priority = READ_ONCE(sk->sk_priority);
@@ -1129,6 +1128,8 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb)
label = ip6_flowlabel(ipv6h);
}
 
+   trace_tcp_send_reset(sk, skb);
+
tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
 ipv6_get_dsfield(ipv6h), label, priority, txhash,
 );
-- 
2.37.3




[PATCH net-next v4 1/2] trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters

2024-04-01 Thread Jason Xing
From: Jason Xing 

Introducing entry_saddr and entry_daddr parameters in this macro
for later use can help us record the reverse 4-tuple by analyzing
the 4-tuple of the incoming skb when receiving.

Signed-off-by: Jason Xing 
Reviewed-by: Eric Dumazet 
---
 include/trace/events/net_probe_common.h | 20 +++-
 include/trace/events/tcp.h  |  2 +-
 include/trace/events/udp.h  |  2 +-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/net_probe_common.h 
b/include/trace/events/net_probe_common.h
index 5e33f91bdea3..976a58364bff 100644
--- a/include/trace/events/net_probe_common.h
+++ b/include/trace/events/net_probe_common.h
@@ -70,14 +70,14 @@
TP_STORE_V4MAPPED(__entry, saddr, daddr)
 #endif
 
-#define TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh)   \
+#define TP_STORE_ADDR_PORTS_SKB_V4(skb, protoh, entry_saddr, entry_daddr) \
do {\
-   struct sockaddr_in *v4 = (void *)__entry->saddr;\
+   struct sockaddr_in *v4 = (void *)entry_saddr;   \
\
v4->sin_family = AF_INET;   \
v4->sin_port = protoh->source;  \
v4->sin_addr.s_addr = ip_hdr(skb)->saddr;   \
-   v4 = (void *)__entry->daddr;\
+   v4 = (void *)entry_daddr;   \
v4->sin_family = AF_INET;   \
v4->sin_port = protoh->dest;\
v4->sin_addr.s_addr = ip_hdr(skb)->daddr;   \
@@ -85,28 +85,30 @@
 
 #if IS_ENABLED(CONFIG_IPV6)
 
-#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh)  \
+#define TP_STORE_ADDR_PORTS_SKB(skb, protoh, entry_saddr, entry_daddr) \
do {\
const struct iphdr *iph = ip_hdr(skb);  \
\
if (iph->version == 6) {\
-   struct sockaddr_in6 *v6 = (void *)__entry->saddr; \
+   struct sockaddr_in6 *v6 = (void *)entry_saddr;  \
\
v6->sin6_family = AF_INET6; \
v6->sin6_port = protoh->source; \
v6->sin6_addr = ipv6_hdr(skb)->saddr;   \
-   v6 = (void *)__entry->daddr;\
+   v6 = (void *)entry_daddr;   \
v6->sin6_family = AF_INET6; \
v6->sin6_port = protoh->dest;   \
v6->sin6_addr = ipv6_hdr(skb)->daddr;   \
} else  \
-   TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh); \
+   TP_STORE_ADDR_PORTS_SKB_V4(skb, protoh, \
+  entry_saddr, \
+  entry_daddr);\
} while (0)
 
 #else
 
-#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh)  \
-   TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh)
+#define TP_STORE_ADDR_PORTS_SKB(skb, protoh, entry_saddr, entry_daddr) \
+   TP_STORE_ADDR_PORTS_SKB_V4(skb, protoh, entry_saddr, entry_daddr)
 
 #endif
 
diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 1db95175c1e5..cf14b6fcbeed 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -295,7 +295,7 @@ DECLARE_EVENT_CLASS(tcp_event_skb,
memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
 
-   TP_STORE_ADDR_PORTS_SKB(__entry, skb, th);
+   TP_STORE_ADDR_PORTS_SKB(skb, th, __entry->saddr, 
__entry->daddr);
),
 
TP_printk("skbaddr=%p src=%pISpc dest=%pISpc",
diff --git a/include/trace/events/udp.h b/include/trace/events/udp.h
index 62bebe2a6ece..6142be4068e2 100644
--- a/include/trace/events/udp.h
+++ b/include/trace/events/udp.h
@@ -38,7 +38,7 @@ TRACE_EVENT(udp_fail_queue_rcv_skb,
memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
 
-   TP_STORE_ADDR_PORTS_SKB(__entry, skb, uh);
+   TP_STORE_ADDR_PORTS_SKB(skb, uh, __entry->saddr, 
__entry->daddr);
),
 
TP_printk("rc=%d family=%s src=%pISpc dest=%pISpc", 

[PATCH net-next v4 0/2] tcp: make trace of reset logic complete

2024-04-01 Thread Jason Xing
From: Jason Xing 

Before this, we miss some cases where the TCP layer could send RST but
we cannot trace it. So I decided to complete it :)

v4
Link: 
https://lore.kernel.org/all/20240329034243.7929-1-kerneljasonx...@gmail.com/
1. rebased against latest net-next
2. remove {} and add skb test statement (Eric)
3. drop v3 patch [3/3] temporarily because 1) location is not that useful
since we can use perf or something else to trace, 2) Eric said we could
use drop_reason to show why we have to RST, which is good, but this seems
not work well for those ->send_reset() logic. I need more time to
investigate this part.

v3
1. fix a format problem in patch [3/3]

v2
1. fix spelling mistakes

Jason Xing (2):
  trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters
  trace: tcp: fully support trace_tcp_send_reset

 include/trace/events/net_probe_common.h | 20 ++--
 include/trace/events/tcp.h  | 42 +++--
 include/trace/events/udp.h  |  2 +-
 net/ipv4/tcp_ipv4.c |  7 ++---
 net/ipv6/tcp_ipv6.c |  3 +-
 5 files changed, 56 insertions(+), 18 deletions(-)

-- 
2.37.3




general protection fault in refill_obj_stock

2024-04-01 Thread Ubisectech Sirius
Hello.
We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
Recently, our team has discovered a issue in Linux kernel 6.7. Attached to the 
email were a PoC file of the issue.

Stack dump:
general protection fault, probably for non-canonical address 
0xdc001cc6:  [#1] PREEMPT SMP KASAN NOPTI
KASAN: probably user-memory-access in range 
[0xe630-0xe637]
CPU: 0 PID: 8041 Comm: systemd-udevd Not tainted 6.7.0 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:__ref_is_percpu include/linux/percpu-refcount.h:174 [inline]
RIP: 0010:percpu_ref_get_many include/linux/percpu-refcount.h:204 [inline]
RIP: 0010:percpu_ref_get include/linux/percpu-refcount.h:222 [inline]
RIP: 0010:obj_cgroup_get include/linux/memcontrol.h:810 [inline]
RIP: 0010:refill_obj_stock+0x135/0x500 mm/memcontrol.c:3535
Code: c7 c7 60 9f 3a 8d e8 fa ca 81 ff e8 d5 4e b2 08 5a 85 c0 0f 85 52 02 00 
00 48 b8 00 00 00 00 00 fc ff df 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 86 03 
00 00 48 8b 45 00 a8 03 0f 85 76 02 00 00
RSP: 0018:c900088bf898 EFLAGS: 00010006
RAX: dc00 RBX: 000380a0 RCX: 192001117edd
RDX: 1cc6 RSI: 0001 RDI: 8cddfa60
RBP: e633 R08:  R09: fbfff27147e0
R10: 938a3f07 R11:  R12: 0148
R13: 0200 R14: 88802c6380a0 R15: 88802c6380e0
FS:  7f774934e8c0() GS:88802c60() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 566127e8 CR3: 48fe8000 CR4: 00750ef0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
PKRU: 5554
Call Trace:
 
 memcg_slab_free_hook+0x157/0x2c0
 slab_free_hook mm/slub.c:2075 [inline]
 slab_free mm/slub.c:4280 [inline]
 kmem_cache_free+0xe1/0x350 mm/slub.c:4344
 kfree_skbmem+0xef/0x1b0 net/core/skbuff.c:1159
 __kfree_skb net/core/skbuff.c:1217 [inline]
 consume_skb net/core/skbuff.c:1432 [inline]
 consume_skb+0xdf/0x170 net/core/skbuff.c:1426
 netlink_recvmsg+0x5cb/0xf10 net/netlink/af_netlink.c:1983
 sock_recvmsg_nosec net/socket.c:1046 [inline]
 sock_recvmsg+0x1de/0x240 net/socket.c:1068
 sys_recvmsg+0x216/0x670 net/socket.c:2803
 ___sys_recvmsg+0xff/0x190 net/socket.c:2845
 __sys_recvmsg+0xfb/0x1d0 net/socket.c:2875
 current_top_of_stack arch/x86/include/asm/processor.h:532 [inline]
 on_thread_stack arch/x86/include/asm/processor.h:537 [inline]
 arch_enter_from_user_mode arch/x86/include/asm/entry-common.h:41 [inline]
 enter_from_user_mode include/linux/entry-common.h:108 [inline]
 syscall_enter_from_user_mode include/linux/entry-common.h:194 [inline]
 do_syscall_64+0x43/0x120 arch/x86/entry/common.c:79
 entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7f7749601d73
Code: 8b 15 59 a2 00 00 f7 d8 64 89 02 b8 ff ff ff ff eb b7 0f 1f 44 00 00 64 
8b 04 25 18 00 00 00 85 c0 75 14 b8 2f 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 
c3 0f 1f 40 00 48 83 ec 28 89 54 24 1c 48
RSP: 002b:7fff81586858 EFLAGS: 0246 ORIG_RAX: 002f
RAX: ffda RBX: 7fff81588a20 RCX: 7f7749601d73
RDX:  RSI: 7fff815868f0 RDI: 000f
RBP: 7fff815869d0 R08: 46d4 R09: 7fff815e5080
R10: 0007 R11: 0246 R12: 
R13: 55824edb2ef0 R14: 0100 R15: 
 
Modules linked in:
---[ end trace  ]---
RIP: 0010:__ref_is_percpu include/linux/percpu-refcount.h:174 [inline]
RIP: 0010:percpu_ref_get_many include/linux/percpu-refcount.h:204 [inline]
RIP: 0010:percpu_ref_get include/linux/percpu-refcount.h:222 [inline]
RIP: 0010:obj_cgroup_get include/linux/memcontrol.h:810 [inline]
RIP: 0010:refill_obj_stock+0x135/0x500 mm/memcontrol.c:3535
Code: c7 c7 60 9f 3a 8d e8 fa ca 81 ff e8 d5 4e b2 08 5a 85 c0 0f 85 52 02 00 
00 48 b8 00 00 00 00 00 fc ff df 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 86 03 
00 00 48 8b 45 00 a8 03 0f 85 76 02 00 00
RSP: 0018:c900088bf898 EFLAGS: 00010006
RAX: dc00 RBX: 000380a0 RCX: 192001117edd
RDX: 1cc6 RSI: 0001 RDI: 8cddfa60
RBP: e633 R08:  R09: fbfff27147e0
R10: 938a3f07 R11:  R12: 0148
R13: 0200 R14: 88802c6380a0 R15: 88802c6380e0
FS:  7f774934e8c0() GS:88802c60() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 566127e8 CR3: 48fe8000 CR4: 00750ef0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
PKRU: 5554

Code disassembly (best guess):
   0:   c7 c7 60 9f 3a 8d   mov$0x8d3a9f60,%edi
   6:   e8 fa ca 81 ff  call   0xff81cb05
   b:   e8 d5 4e b2 

general protection fault in __fib6_update_sernum_upto_root

2024-04-01 Thread Ubisectech Sirius
Hello.
We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
Recently, our team has discovered a issue in Linux kernel 6.7. Attached to the 
email were a PoC file of the issue.

Stack dump:
general protection fault, probably for non-canonical address 
0xff1f1b1f1f1f1f24:  [#1] PREEMPT SMP KASAN NOPTI
KASAN: maybe wild-memory-access in range [0xf8f8f8f8f8f8f920-0xf8f8f8f8f8f8f927]
CPU: 1 PID: 9367 Comm: kworker/1:5 Not tainted 6.7.0 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Workqueue: ipv6_addrconf addrconf_dad_work
RIP: 0010:__fib6_update_sernum_upto_root+0xa7/0x270 net/ipv6/ip6_fib.c:1358
Code: c1 e8 03 42 80 3c 20 00 0f 85 9b 01 00 00 48 8b 1b 48 85 db 0f 84 d9 00 
00 00 e8 74 70 39 f8 48 8d 7b 2c 48 89 f8 48 c1 e8 03 <42> 0f b6 14 20 48 89 f8 
83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
RSP: 0018:c9000631f7c8 EFLAGS: 00010a07
RAX: 1f1f1f1f1f1f1f24 RBX: f8f8f8f8f8f8f8f8 RCX: 89508644
RDX: 888051d78000 RSI: 895085dc RDI: f8f8f8f8f8f8f924
RBP: 0001 R08: 0005 R09: 
R10: 0001 R11:  R12: dc00
R13: 0186 R14: 888052396c00 R15: ed100a472d80
FS:  () GS:88807ec0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f42c8487d00 CR3: 4b42c000 CR4: 00750ef0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
PKRU: 5554
Call Trace:
 
 __list_add include/linux/list.h:153 [inline]
 list_add include/linux/list.h:169 [inline]
 fib6_add+0x16c4/0x4410 net/ipv6/ip6_fib.c:1490
 __ip6_ins_rt net/ipv6/route.c:1313 [inline]
 ip6_ins_rt+0xb6/0x110 net/ipv6/route.c:1323
 __ipv6_ifa_notify+0xab3/0xd30 net/ipv6/addrconf.c:6266
 ipv6_ifa_notify net/ipv6/addrconf.c:6303 [inline]
 addrconf_dad_completed+0x15f/0xef0 net/ipv6/addrconf.c:4317
 addrconf_dad_work+0x785/0x14e0 net/ipv6/addrconf.c:4260
 process_one_work+0x87b/0x15c0 kernel/workqueue.c:3226
 worker_thread+0x855/0x1200 kernel/workqueue.c:3380
 kthread+0x2cc/0x3b0 kernel/kthread.c:388
 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:256
 
Modules linked in:
---[ end trace  ]---
RIP: 0010:__fib6_update_sernum_upto_root+0xa7/0x270 net/ipv6/ip6_fib.c:1358
Code: c1 e8 03 42 80 3c 20 00 0f 85 9b 01 00 00 48 8b 1b 48 85 db 0f 84 d9 00 
00 00 e8 74 70 39 f8 48 8d 7b 2c 48 89 f8 48 c1 e8 03 <42> 0f b6 14 20 48 89 f8 
83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85
RSP: 0018:c9000631f7c8 EFLAGS: 00010a07
RAX: 1f1f1f1f1f1f1f24 RBX: f8f8f8f8f8f8f8f8 RCX: 89508644
RDX: 888051d78000 RSI: 895085dc RDI: f8f8f8f8f8f8f924
RBP: 0001 R08: 0005 R09: 
R10: 0001 R11:  R12: dc00
R13: 0186 R14: 888052396c00 R15: ed100a472d80
FS:  () GS:88807ec0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f42c8487d00 CR3: 4b42c000 CR4: 00750ef0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
PKRU: 5554

Code disassembly (best guess):
   0:   c1 e8 03shr$0x3,%eax
   3:   42 80 3c 20 00  cmpb   $0x0,(%rax,%r12,1)
   8:   0f 85 9b 01 00 00   jne0x1a9
   e:   48 8b 1bmov(%rbx),%rbx
  11:   48 85 dbtest   %rbx,%rbx
  14:   0f 84 d9 00 00 00   je 0xf3
  1a:   e8 74 70 39 f8  call   0xf8397093
  1f:   48 8d 7b 2c lea0x2c(%rbx),%rdi
  23:   48 89 f8mov%rdi,%rax
  26:   48 c1 e8 03 shr$0x3,%rax
* 2a:   42 0f b6 14 20  movzbl (%rax,%r12,1),%edx <-- trapping 
instruction
  2f:   48 89 f8mov%rdi,%rax
  32:   83 e0 07and$0x7,%eax
  35:   83 c0 03add$0x3,%eax
  38:   38 d0   cmp%dl,%al
  3a:   7c 08   jl 0x44
  3c:   84 d2   test   %dl,%dl
  3e:   0f  .byte 0xf
  3f:   85  .byte 0x85

Thank you for taking the time to read this email and we look forward to working 
with you further.







poc.c
Description: Binary data