Re: [PATCH v7 3/7] LoongArch: KVM: Add cpucfg area for kvm hypervisor
On Tue, 2024-04-02 at 11:34 +0800, maobibo wrote: > Are you sure that it's impossible to read some data used by the kernel > internally? Yes. > There is another issue, since kernel restore T0-T7 registers and user > space save T0-T7. Why T0-T7 is scratch registers rather than preserve > registers like other architecture? What is the advantage if it is > scratch registers? I'd say "MIPS legacy." Note that MIPS also does not preserve temp registers, and MIPS does not have the "info leak" issue as well (or it should have been assigned a CVE, in all these years). I do agree maybe it's the time to move away from MIPS legacy and be more similar to RISC-V etc now... In Glibc we can condition __SYSCALL_CLOBBERS with #if __LINUX_KERNEL_VERSION > xxx to take the advantage. Huacai, Xuerui, how do you think? -- Xi Ruoyao School of Aerospace Science and Technology, Xidian University
Re: [PATCH 7/9] mm: Free up PG_slab
On Sun, Mar 31, 2024 at 11:11:10PM +0800, kernel test robot wrote: > kernel test robot noticed "UBSAN:shift-out-of-bounds_in_fs/proc/page.c" on: > > commit: 30e5296811312a13938b83956a55839ac1e3aa40 ("[PATCH 7/9] mm: Free up > PG_slab") Quite right. Spotted another one while I was at it. Not able to test right now, but this should do the trick: diff --git a/fs/proc/page.c b/fs/proc/page.c index 5bc82828c6aa..55b01535eb22 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -175,6 +175,8 @@ u64 stable_page_flags(const struct page *page) u |= 1 << KPF_OFFLINE; if (PageTable(page)) u |= 1 << KPF_PGTABLE; + if (folio_test_slab(folio)) + u |= 1 << KPF_SLAB; #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT) u |= kpf_copy_bit(k, KPF_IDLE, PG_idle); @@ -184,7 +186,6 @@ u64 stable_page_flags(const struct page *page) #endif u |= kpf_copy_bit(k, KPF_LOCKED,PG_locked); - u |= kpf_copy_bit(k, KPF_SLAB, PG_slab); u |= kpf_copy_bit(k, KPF_ERROR, PG_error); u |= kpf_copy_bit(k, KPF_DIRTY, PG_dirty); u |= kpf_copy_bit(k, KPF_UPTODATE, PG_uptodate); diff --git a/tools/cgroup/memcg_slabinfo.py b/tools/cgroup/memcg_slabinfo.py index 1d3a90d93fe2..270c28a0d098 100644 --- a/tools/cgroup/memcg_slabinfo.py +++ b/tools/cgroup/memcg_slabinfo.py @@ -146,12 +146,11 @@ def detect_kernel_config(): def for_each_slab(prog): -PGSlab = 1 << prog.constant('PG_slab') -PGHead = 1 << prog.constant('PG_head') +PGSlab = ~prog.constant('PG_slab') for page in for_each_page(prog): try: -if page.flags.value_() & PGSlab: +if page.page_type.value_() == PGSlab: yield cast('struct slab *', page) except FaultError: pass
Re: [PATCH v3 6/7] KVM: arm64: Participate in bitmap-based PTE aging
On Mon, Apr 1, 2024 at 7:30 PM James Houghton wrote: > > Participate in bitmap-based aging while grabbing the KVM MMU lock for > reading. Ideally we wouldn't need to grab this lock at all, but that > would require a more intrustive and risky change. ^^ intrusive This sounds subjective -- I'd just present the challenges and let reviewers make their own judgements. > Also pass > KVM_PGTABLE_WALK_SHARED, as this software walker is safe to run in > parallel with other walkers. > > It is safe only to grab the KVM MMU lock for reading as the kvm_pgtable > is destroyed while holding the lock for writing, and freeing of the page > table pages is either done while holding the MMU lock for writing or > after an RCU grace period. > > When mkold == false, record the young pages in the passed-in bitmap. > > When mkold == true, only age the pages that need aging according to the > passed-in bitmap. > > Suggested-by: Yu Zhao Thanks but I did not suggest this. What I have in v2 is RCU based. I hope Oliver or someone else can help make that work. Otherwise we can just drop this for now and revisit later. (I have no problems with this patch if the Arm folks think the RCU-based version doesn't have a good ROI.)
Re: [PATCH v5 2/3] arm64: dts: qcom: sc7280: Add UFS nodes for sc7280 soc
On Fri, Mar 22, 2024 at 08:59:12AM +0100, Luca Weiss wrote: > On Mon Dec 4, 2023 at 6:28 PM CET, Manivannan Sadhasivam wrote: > > On Mon, Dec 04, 2023 at 01:21:42PM +0100, Luca Weiss wrote: > > > On Mon Dec 4, 2023 at 1:15 PM CET, Nitin Rawat wrote: > > > > > > > > > > > > On 12/4/2023 3:54 PM, Luca Weiss wrote: > > > > > From: Nitin Rawat > > > > > > > > > > Add UFS host controller and PHY nodes for sc7280 soc. > > > > > > > > > > Signed-off-by: Nitin Rawat > > > > > Reviewed-by: Konrad Dybcio > > > > > Tested-by: Konrad Dybcio # QCM6490 FP5 > > > > > [luca: various cleanups and additions as written in the cover letter] > > > > > Signed-off-by: Luca Weiss > > > > > --- > > > > > arch/arm64/boot/dts/qcom/sc7280.dtsi | 74 > > > > > +++- > > > > > 1 file changed, 73 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi > > > > > b/arch/arm64/boot/dts/qcom/sc7280.dtsi > > > > > index 04bf85b0399a..8b08569f2191 100644 > > > > > --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi > > > > > +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi > > > > > @@ -15,6 +15,7 @@ > > > > > #include > > > > > #include > > > > > #include > > > > > +#include > > > > > #include > > > > > #include > > > > > #include > > > > > @@ -906,7 +907,7 @@ gcc: clock-controller@10 { > > > > > clocks = < RPMH_CXO_CLK>, > > > > >< RPMH_CXO_CLK_A>, <_clk>, > > > > ><0>, <_phy>, > > > > > - <0>, <0>, <0>, > > > > > + <_mem_phy 0>, <_mem_phy 1>, > > > > > <_mem_phy 2>, > > > > ><_1_qmpphy > > > > > QMP_USB43DP_USB3_PIPE_CLK>; > > > > > clock-names = "bi_tcxo", "bi_tcxo_ao", > > > > > "sleep_clk", > > > > > "pcie_0_pipe_clk", > > > > > "pcie_1_pipe_clk", > > > > > @@ -2238,6 +2239,77 @@ pcie1_phy: phy@1c0e000 { > > > > > status = "disabled"; > > > > > }; > > > > > > > > > > + ufs_mem_hc: ufs@1d84000 { > > > > > + compatible = "qcom,sc7280-ufshc", "qcom,ufshc", > > > > > + "jedec,ufs-2.0"; > > > > > + reg = <0x0 0x01d84000 0x0 0x3000>; > > > > > + interrupts = ; > > > > > + phys = <_mem_phy>; > > > > > + phy-names = "ufsphy"; > > > > > + lanes-per-direction = <2>; > > > > > + #reset-cells = <1>; > > > > > + resets = < GCC_UFS_PHY_BCR>; > > > > > + reset-names = "rst"; > > > > > + > > > > > + power-domains = < GCC_UFS_PHY_GDSC>; > > > > > + required-opps = <_opp_nom>; > > > > > + > > > > > + iommus = <_smmu 0x80 0x0>; > > > > > + dma-coherent; > > > > > + > > > > > + interconnects = <_noc MASTER_UFS_MEM > > > > > QCOM_ICC_TAG_ALWAYS > > > > > + _virt SLAVE_EBI1 > > > > > QCOM_ICC_TAG_ALWAYS>, > > > > > + <_noc MASTER_APPSS_PROC > > > > > QCOM_ICC_TAG_ALWAYS > > > > > + SLAVE_UFS_MEM_CFG > > > > > QCOM_ICC_TAG_ALWAYS>; > > > > > + interconnect-names = "ufs-ddr", "cpu-ufs"; > > > > > + > > > > > + clocks = < GCC_UFS_PHY_AXI_CLK>, > > > > > + < GCC_AGGRE_UFS_PHY_AXI_CLK>, > > > > > + < GCC_UFS_PHY_AHB_CLK>, > > > > > + < GCC_UFS_PHY_UNIPRO_CORE_CLK>, > > > > > + < RPMH_CXO_CLK>, > > > > > + < GCC_UFS_PHY_TX_SYMBOL_0_CLK>, > > > > > + < GCC_UFS_PHY_RX_SYMBOL_0_CLK>, > > > > > + < GCC_UFS_PHY_RX_SYMBOL_1_CLK>; > > > > > + clock-names = "core_clk", > > > > > + "bus_aggr_clk", > > > > > + "iface_clk", > > > > > + "core_clk_unipro", > > > > > + "ref_clk", > > > > > + "tx_lane0_sync_clk", > > > > > + "rx_lane0_sync_clk", > > > > > + "rx_lane1_sync_clk"; > > > > > + freq-table-hz = > > > > > + <7500 3>, > > > > > + <0 0>, > > > > > + <0 0>, > > > > > + <7500 3>, > > > > > + <0 0>, > > > > > + <0 0>, > > > > > + <0 0>, > > > > > +
Re: [PATCH v7 3/7] LoongArch: KVM: Add cpucfg area for kvm hypervisor
On 2024/4/2 上午10:49, Xi Ruoyao wrote: On Tue, 2024-04-02 at 09:43 +0800, maobibo wrote: Sorry for the late reply, but I think it may be a bit non-constructive to repeatedly submit the same code without due explanation in our previous review threads. Let me try to recollect some of the details though... Because your review comments about hypercall method is wrong, I need not adopt it. Again it's unfair to say so considering the lack of LVZ documentation. /* snip */ 1. T0-T7 are scratch registers during SYSCALL ABI, this is what you suggest, does there exist information leaking to user space from T0-T7 registers? It's not a problem. When syscall returns RESTORE_ALL_AND_RET is invoked despite T0-T7 are not saved. So a "junk" value will be read from the leading PT_SIZE bytes of the kernel stack for this thread. The leading PT_SIZE bytes of the kernel stack is dedicated for storing the struct pt_regs representing the reg file of the thread in the userspace. Not all syscalls use leading PT_SIZE bytes of the kernel stack. It is complicated if syscall is combined with interrupt and singals. Thus we may only read out the userspace T0-T7 value stored when the same thread was interrupted or trapped last time, or 0 (if the thread was never interrupted or trapped before). And it's impossible to read some data used by the kernel internally, or some data of another thread. Are you sure that it's impossible to read some data used by the kernel internally? Regards Bibo Mao But indeed there is some improvement here. Zeroing these registers seems cleaner than reading out the junk values, and also faster (move $t0, $r0 is faster than ld.d $t0, $sp, PT_R12). Not sure if it's worthy to violate Huacai's "keep things simple" aspiration though.
[PATCH] livepatch: Add KLP_IDLE state
From: Wardenjohn In livepatch, using KLP_UNDEFINED is seems to be confused. When kernel is ready, livepatch is ready too, which state is idle but not undefined. What's more, if one livepatch process is finished, the klp state should be idle rather than undefined. Therefore, using KLP_IDLE to replace KLP_UNDEFINED is much better in reading and understanding. --- include/linux/livepatch.h | 1 + kernel/livepatch/patch.c | 2 +- kernel/livepatch/transition.c | 24 3 files changed, 14 insertions(+), 13 deletions(-) diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index 9b9b38e89563..c1c53cd5b227 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -19,6 +19,7 @@ /* task patch states */ #define KLP_UNDEFINED -1 +#define KLP_IDLE -1 #define KLP_UNPATCHED 0 #define KLP_PATCHED 1 diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c index 4152c71507e2..01d3219289ee 100644 --- a/kernel/livepatch/patch.c +++ b/kernel/livepatch/patch.c @@ -95,7 +95,7 @@ static void notrace klp_ftrace_handler(unsigned long ip, patch_state = current->patch_state; - WARN_ON_ONCE(patch_state == KLP_UNDEFINED); + WARN_ON_ONCE(patch_state == KLP_IDLE); if (patch_state == KLP_UNPATCHED) { /* diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index e54c3d60a904..73f8f98dba84 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -23,7 +23,7 @@ static DEFINE_PER_CPU(unsigned long[MAX_STACK_ENTRIES], klp_stack_entries); struct klp_patch *klp_transition_patch; -static int klp_target_state = KLP_UNDEFINED; +static int klp_target_state = KLP_IDLE; static unsigned int klp_signals_cnt; @@ -123,21 +123,21 @@ static void klp_complete_transition(void) klp_for_each_func(obj, func) func->transition = false; - /* Prevent klp_ftrace_handler() from seeing KLP_UNDEFINED state */ + /* Prevent klp_ftrace_handler() from seeing KLP_IDLE state */ if (klp_target_state == KLP_PATCHED) klp_synchronize_transition(); read_lock(_lock); for_each_process_thread(g, task) { WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_PATCH_PENDING)); - task->patch_state = KLP_UNDEFINED; + task->patch_state = KLP_IDLE; } read_unlock(_lock); for_each_possible_cpu(cpu) { task = idle_task(cpu); WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_PATCH_PENDING)); - task->patch_state = KLP_UNDEFINED; + task->patch_state = KLP_IDLE; } klp_for_each_object(klp_transition_patch, obj) { @@ -152,7 +152,7 @@ static void klp_complete_transition(void) pr_notice("'%s': %s complete\n", klp_transition_patch->mod->name, klp_target_state == KLP_PATCHED ? "patching" : "unpatching"); - klp_target_state = KLP_UNDEFINED; + klp_target_state = KLP_IDLE; klp_transition_patch = NULL; } @@ -455,7 +455,7 @@ void klp_try_complete_transition(void) struct klp_patch *patch; bool complete = true; - WARN_ON_ONCE(klp_target_state == KLP_UNDEFINED); + WARN_ON_ONCE(klp_target_state == KLP_IDLE); /* * Try to switch the tasks to the target patch state by walking their @@ -532,7 +532,7 @@ void klp_start_transition(void) struct task_struct *g, *task; unsigned int cpu; - WARN_ON_ONCE(klp_target_state == KLP_UNDEFINED); + WARN_ON_ONCE(klp_target_state == KLP_IDLE); pr_notice("'%s': starting %s transition\n", klp_transition_patch->mod->name, @@ -578,7 +578,7 @@ void klp_init_transition(struct klp_patch *patch, int state) struct klp_func *func; int initial_state = !state; - WARN_ON_ONCE(klp_target_state != KLP_UNDEFINED); + WARN_ON_ONCE(klp_target_state != KLP_IDLE); klp_transition_patch = patch; @@ -597,7 +597,7 @@ void klp_init_transition(struct klp_patch *patch, int state) */ read_lock(_lock); for_each_process_thread(g, task) { - WARN_ON_ONCE(task->patch_state != KLP_UNDEFINED); + WARN_ON_ONCE(task->patch_state != KLP_IDLE); task->patch_state = initial_state; } read_unlock(_lock); @@ -607,19 +607,19 @@ void klp_init_transition(struct klp_patch *patch, int state) */ for_each_possible_cpu(cpu) { task = idle_task(cpu); - WARN_ON_ONCE(task->patch_state != KLP_UNDEFINED); + WARN_ON_ONCE(task->patch_state != KLP_DILE); task->patch_state = initial_state; } /* * Enforce the order of the task->patch_state initializations and the * func->transition updates to
[PATCH] livepatch: Add KLP_IDLE state
From: Wardenjohn --- include/linux/livepatch.h | 1 + kernel/livepatch/patch.c | 2 +- kernel/livepatch/transition.c | 24 3 files changed, 14 insertions(+), 13 deletions(-) diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index 9b9b38e89563..c1c53cd5b227 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -19,6 +19,7 @@ /* task patch states */ #define KLP_UNDEFINED -1 +#define KLP_IDLE -1 #define KLP_UNPATCHED 0 #define KLP_PATCHED 1 diff --git a/kernel/livepatch/patch.c b/kernel/livepatch/patch.c index 4152c71507e2..01d3219289ee 100644 --- a/kernel/livepatch/patch.c +++ b/kernel/livepatch/patch.c @@ -95,7 +95,7 @@ static void notrace klp_ftrace_handler(unsigned long ip, patch_state = current->patch_state; - WARN_ON_ONCE(patch_state == KLP_UNDEFINED); + WARN_ON_ONCE(patch_state == KLP_IDLE); if (patch_state == KLP_UNPATCHED) { /* diff --git a/kernel/livepatch/transition.c b/kernel/livepatch/transition.c index e54c3d60a904..73f8f98dba84 100644 --- a/kernel/livepatch/transition.c +++ b/kernel/livepatch/transition.c @@ -23,7 +23,7 @@ static DEFINE_PER_CPU(unsigned long[MAX_STACK_ENTRIES], klp_stack_entries); struct klp_patch *klp_transition_patch; -static int klp_target_state = KLP_UNDEFINED; +static int klp_target_state = KLP_IDLE; static unsigned int klp_signals_cnt; @@ -123,21 +123,21 @@ static void klp_complete_transition(void) klp_for_each_func(obj, func) func->transition = false; - /* Prevent klp_ftrace_handler() from seeing KLP_UNDEFINED state */ + /* Prevent klp_ftrace_handler() from seeing KLP_IDLE state */ if (klp_target_state == KLP_PATCHED) klp_synchronize_transition(); read_lock(_lock); for_each_process_thread(g, task) { WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_PATCH_PENDING)); - task->patch_state = KLP_UNDEFINED; + task->patch_state = KLP_IDLE; } read_unlock(_lock); for_each_possible_cpu(cpu) { task = idle_task(cpu); WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_PATCH_PENDING)); - task->patch_state = KLP_UNDEFINED; + task->patch_state = KLP_IDLE; } klp_for_each_object(klp_transition_patch, obj) { @@ -152,7 +152,7 @@ static void klp_complete_transition(void) pr_notice("'%s': %s complete\n", klp_transition_patch->mod->name, klp_target_state == KLP_PATCHED ? "patching" : "unpatching"); - klp_target_state = KLP_UNDEFINED; + klp_target_state = KLP_IDLE; klp_transition_patch = NULL; } @@ -455,7 +455,7 @@ void klp_try_complete_transition(void) struct klp_patch *patch; bool complete = true; - WARN_ON_ONCE(klp_target_state == KLP_UNDEFINED); + WARN_ON_ONCE(klp_target_state == KLP_IDLE); /* * Try to switch the tasks to the target patch state by walking their @@ -532,7 +532,7 @@ void klp_start_transition(void) struct task_struct *g, *task; unsigned int cpu; - WARN_ON_ONCE(klp_target_state == KLP_UNDEFINED); + WARN_ON_ONCE(klp_target_state == KLP_IDLE); pr_notice("'%s': starting %s transition\n", klp_transition_patch->mod->name, @@ -578,7 +578,7 @@ void klp_init_transition(struct klp_patch *patch, int state) struct klp_func *func; int initial_state = !state; - WARN_ON_ONCE(klp_target_state != KLP_UNDEFINED); + WARN_ON_ONCE(klp_target_state != KLP_IDLE); klp_transition_patch = patch; @@ -597,7 +597,7 @@ void klp_init_transition(struct klp_patch *patch, int state) */ read_lock(_lock); for_each_process_thread(g, task) { - WARN_ON_ONCE(task->patch_state != KLP_UNDEFINED); + WARN_ON_ONCE(task->patch_state != KLP_IDLE); task->patch_state = initial_state; } read_unlock(_lock); @@ -607,19 +607,19 @@ void klp_init_transition(struct klp_patch *patch, int state) */ for_each_possible_cpu(cpu) { task = idle_task(cpu); - WARN_ON_ONCE(task->patch_state != KLP_UNDEFINED); + WARN_ON_ONCE(task->patch_state != KLP_DILE); task->patch_state = initial_state; } /* * Enforce the order of the task->patch_state initializations and the * func->transition updates to ensure that klp_ftrace_handler() doesn't -* see a func in transition with a task->patch_state of KLP_UNDEFINED. +* see a func in transition with a task->patch_state of KLP_IDLE. * * Also enforce the order of the klp_target_state write and future * TIF_PATCH_PENDING writes to ensure
Re: [PATCH v7 3/7] LoongArch: KVM: Add cpucfg area for kvm hypervisor
On 2024/4/2 上午10:49, Xi Ruoyao wrote: On Tue, 2024-04-02 at 09:43 +0800, maobibo wrote: Sorry for the late reply, but I think it may be a bit non-constructive to repeatedly submit the same code without due explanation in our previous review threads. Let me try to recollect some of the details though... Because your review comments about hypercall method is wrong, I need not adopt it. Again it's unfair to say so considering the lack of LVZ documentation. /* snip */ 1. T0-T7 are scratch registers during SYSCALL ABI, this is what you suggest, does there exist information leaking to user space from T0-T7 registers? It's not a problem. When syscall returns RESTORE_ALL_AND_RET is invoked despite T0-T7 are not saved. So a "junk" value will be read from the leading PT_SIZE bytes of the kernel stack for this thread. For you it is "junk" value, some guys maybe thinks it is useful. There is another issue, since kernel restore T0-T7 registers and user space save T0-T7. Why T0-T7 is scratch registers rather than preserve registers like other architecture? What is the advantage if it is scratch registers? Regards Bibo Mao The leading PT_SIZE bytes of the kernel stack is dedicated for storing the struct pt_regs representing the reg file of the thread in the userspace. Thus we may only read out the userspace T0-T7 value stored when the same thread was interrupted or trapped last time, or 0 (if the thread was never interrupted or trapped before). And it's impossible to read some data used by the kernel internally, or some data of another thread. But indeed there is some improvement here. Zeroing these registers seems cleaner than reading out the junk values, and also faster (move $t0, $r0 is faster than ld.d $t0, $sp, PT_R12). Not sure if it's worthy to violate Huacai's "keep things simple" aspiration though.
Re: 回复:general protection fault in refill_obj_stock
On Tue, Apr 02, 2024 at 09:50:54AM +0800, Ubisectech Sirius wrote: > > On Mon, Apr 01, 2024 at 03:04:46PM +0800, Ubisectech Sirius wrote: > > Hello. > > We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. > > Recently, our team has discovered a issue in Linux kernel 6.7. Attached to > > the email were a PoC file of the issue. > > > Thank you for the report! > > > I tried to compile and run your test program for about half an hour > > on a virtual machine running 6.7 with enabled KASAN, but wasn't able > > to reproduce the problem. > > > Can you, please, share a bit more information? How long does it take > > to reproduce? Do you mind sharing your kernel config? Is there anything > > special > > about your setup? What are exact steps to reproduce the problem? > > Is this problem reproducible on 6.6? > > Hi. >The .config of linux kernel 6.7 has send to you as attachment. Thanks! How long it takes to reproduce a problem? Do you just start your reproducer and wait? > And The problem is reproducible on 6.6. Hm, it rules out my recent changes. Did you try any older kernels? 6.5? 6.0? Did you try to bisect the problem? If it's fast to reproduce, it might be the best option. Also, are you running vanilla kernels or you do have some custom changes on top? Thanks!
Re: [PATCH v7 3/7] LoongArch: KVM: Add cpucfg area for kvm hypervisor
On Tue, 2024-04-02 at 09:43 +0800, maobibo wrote: > > Sorry for the late reply, but I think it may be a bit non-constructive > > to repeatedly submit the same code without due explanation in our > > previous review threads. Let me try to recollect some of the details > > though... > Because your review comments about hypercall method is wrong, I need not > adopt it. Again it's unfair to say so considering the lack of LVZ documentation. /* snip */ > > 1. T0-T7 are scratch registers during SYSCALL ABI, this is what you > suggest, does there exist information leaking to user space from T0-T7 > registers? It's not a problem. When syscall returns RESTORE_ALL_AND_RET is invoked despite T0-T7 are not saved. So a "junk" value will be read from the leading PT_SIZE bytes of the kernel stack for this thread. The leading PT_SIZE bytes of the kernel stack is dedicated for storing the struct pt_regs representing the reg file of the thread in the userspace. Thus we may only read out the userspace T0-T7 value stored when the same thread was interrupted or trapped last time, or 0 (if the thread was never interrupted or trapped before). And it's impossible to read some data used by the kernel internally, or some data of another thread. But indeed there is some improvement here. Zeroing these registers seems cleaner than reading out the junk values, and also faster (move $t0, $r0 is faster than ld.d $t0, $sp, PT_R12). Not sure if it's worthy to violate Huacai's "keep things simple" aspiration though. -- Xi Ruoyao School of Aerospace Science and Technology, Xidian University
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, 1 Apr 2024 19:29:46 -0700 Andrii Nakryiko wrote: > On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu wrote: > > > > On Mon, 1 Apr 2024 12:09:18 -0400 > > Steven Rostedt wrote: > > > > > On Mon, 1 Apr 2024 20:25:52 +0900 > > > Masami Hiramatsu (Google) wrote: > > > > > > > > Masami, > > > > > > > > > > Are you OK with just keeping it set to N. > > > > > > > > OK, if it is only for the debugging, I'm OK to set N this. > > > > > > > > > > > > > > We could have other options like PROVE_LOCKING enable it. > > > > > > > > Agreed (but it should say this is a debug option) > > > > > > It does say "Validate" which to me is a debug option. What would you > > > suggest? > > > > I think the help message should have "This is for debugging ftrace." > > > > Sent v2 with adjusted wording, thanks! You may want to wait till Masami and I agree ;-) Masami, But it isn't really for "debugging", it's for validating. That is, it doesn't give us any information to debug ftrace. It only validates if it is executed properly. In other words, I never want to be asked "How can I use this option to debug ftrace?" For example, we also have: "Verify ring buffer time stamp deltas" that makes sure the time stamps of the ring buffer are not buggy. And there's: "Verify compile time sorting of ftrace functions" Which is also used to make sure things are working properly. Neither of the above says they are for "debugging", even though they are more useful for debugging than this option. I'm not sure saying this is "debugging ftrace" is accurate. It may help debug ftrace if it is called outside of an RCU location, which has a 1 in 100,000,000,000 chance of causing an actual bug, as the race window is extremely small. Now if it is also called outside of instrumentation, that will likely trigger other warnings even without this code, and this will not be needed to debug that. ftrace has all sorts of "verifiers" that is used to make sure things are working properly. And yes, you can consider it as "debugging". But I would not consider this an option to enable if ftrace was broken, and you are looking into why it is broken. To me, this option is only to verify that ftrace (and other users of ftrace_test_recursion_trylock()) are not called outside of RCU, as if they are, it can cause a race. But it also has a noticeable overhead when enabled. -- Steve -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu wrote: > > On Mon, 1 Apr 2024 12:09:18 -0400 > Steven Rostedt wrote: > > > On Mon, 1 Apr 2024 20:25:52 +0900 > > Masami Hiramatsu (Google) wrote: > > > > > > Masami, > > > > > > > > Are you OK with just keeping it set to N. > > > > > > OK, if it is only for the debugging, I'm OK to set N this. > > > > > > > > > > > We could have other options like PROVE_LOCKING enable it. > > > > > > Agreed (but it should say this is a debug option) > > > > It does say "Validate" which to me is a debug option. What would you > > suggest? > > I think the help message should have "This is for debugging ftrace." > Sent v2 with adjusted wording, thanks! > Thank you, > > > > > -- Steve > > > -- > Masami Hiramatsu (Google)
[PATCH v2] ftrace: make extra rcu_is_watching() validation check optional
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true and is mostly useful for low-level debugging of ftrace subsystem. For most users it should probably be kept disabled to eliminate unnecessary runtime overhead. Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Paul E. McKenney Signed-off-by: Andrii Nakryiko --- include/linux/trace_recursion.h | 2 +- kernel/trace/Kconfig| 14 ++ 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index d48cd92d2364..24ea8ac049b4 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); # define do_ftrace_record_recursion(ip, pip) do { } while (0) #endif -#ifdef CONFIG_ARCH_WANTS_NO_INSTR +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING # define trace_warn_on_no_rcu(ip) \ ({ \ bool __ret = !rcu_is_watching();\ diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 61c541c36596..fcf45d5c60cb 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -974,6 +974,20 @@ config FTRACE_RECORD_RECURSION_SIZE This file can be reset, but the limit can not change in size at runtime. +config FTRACE_VALIDATE_RCU_IS_WATCHING + bool "Validate RCU is on during ftrace recursion check" + depends on FUNCTION_TRACER + depends on ARCH_WANTS_NO_INSTR + help + All callbacks that attach to the function tracing have some sort + of protection against recursion. This option performs additional + checks to make sure RCU is on when ftrace callbacks recurse. + + This is a feature useful for debugging ftrace. This will add more + overhead to all ftrace-based invocations. + + If unsure, say N + config RING_BUFFER_RECORD_RECURSION bool "Record functions that recurse in the ring buffer" depends on FTRACE_RECORD_RECURSION -- 2.43.0
Re: [PATCH v7 3/7] LoongArch: KVM: Add cpucfg area for kvm hypervisor
On 2024/3/24 上午3:02, WANG Xuerui wrote: On 3/15/24 16:07, Bibo Mao wrote: Instruction cpucfg can be used to get processor features. And there is trap exception when it is executed in VM mode, and also it is to provide cpu features to VM. On real hardware cpucfg area 0 - 20 is used. Here one specified area 0x4000 -- 0x40ff is used for KVM hypervisor to privide PV features, and the area can be extended for other hypervisors in future. This area will never be used for real HW, it is only used by software. Signed-off-by: Bibo Mao --- arch/loongarch/include/asm/inst.h | 1 + arch/loongarch/include/asm/loongarch.h | 10 + arch/loongarch/kvm/exit.c | 59 +++--- 3 files changed, 54 insertions(+), 16 deletions(-) Sorry for the late reply, but I think it may be a bit non-constructive to repeatedly submit the same code without due explanation in our previous review threads. Let me try to recollect some of the details though... Because your review comments about hypercall method is wrong, I need not adopt it. If I remember correctly, during the previous reviews, it was mentioned that the only upsides of using CPUCFG were: - it was exactly identical to the x86 approach, - it would not require access to the LoongArch Reference Manual Volume 3 to use, and - it was plain old data. But, for the first point, we don't have to follow x86 convention after X86 virtualization is successfully and widely applied in our life and products. It it normal to follow it if there is not obvious issues. all. The second reason might be compelling, but on the one hand that's another problem orthogonal to the current one, and on the other hand HVCL is: - already effectively public because of the fact that this very patchset is public, - its semantics is trivial to implement even without access to the LVZ manual, because of its striking similarity with SYSCALL, and - by being a function call, we reserve the possibility for hypervisors to invoke logic for self-identification purposes, even if this is likely overkill from today's perspective. And, even if we decide that using HVCL for self-identification is overkill after all, we still have another choice that's IOCSR. We already read LOONGARCH_IOCSR_FEATURES (0x8) for its bit 11 (IOCSRF_VM) to populate the CPU_FEATURE_HYPERVISOR bit, and it's only natural that we put the identification word in the IOCSR space. As far as I can see, the IOCSR space is plenty and equally available for making reservations; it can only be even easier when it's done by a Loongson team. IOCSR method is possible also, about chip design CPUCFG is used for cpu features and IOCSR is for device featurs. Here CPUCFG method is selected, I am KVM LoongArch maintainer and I can decide to select methods if the method works well. Is that right? If you are interested in KVM LoongArch, you can submit more patches and become maintainer or write new hypervisor support such xen/xvisor etc, and use your method. Also you are interested in Linux kernel, there are some issues. Can you help to improve it? 1. T0-T7 are scratch registers during SYSCALL ABI, this is what you suggest, does there exist information leaking to user space from T0-T7 registers? 2. LoongArch KVM depends on AS_HAS_LVZ_EXTENSION, which requires the latest binutils. It is also what you suggest. Some kernel developers does not have the latest binutils and common kvm code is modified and LoongArch KVM fails to compile. But they can not find it since their LoongArch cross-compile is old and LoongArch KVM is disabled. This issue can be found at https://lkml.org/lkml/2023/11/15/828. Regards Bibo Mao Finally, I've mentioned multiple times, that varying CPUCFG behavior based on PLV is not something well documented on the manuals, hence not friendly to low-level developers. Devs of third-party firmware and/or kernels do exist, I've personally spoken to some of them on the 2023-11-18 3A6000 release event; in order for the varying CPUCFG behavior approach to pass for me, at the very least, the LoongArch reference manual must be amended to explicitly include an explanation of it, and a reference to potential use cases.
[PATCH v2] selftests/sgx: Improve cgroup test scripts
Make cgroup test scripts ash compatible. Remove cg-tools dependency. Add documentation for functions. Tested with busybox on Ubuntu. Signed-off-by: Haitao Huang --- v2: - Fixes for v2 cgroup - Turn off swapping before memcontrol tests and back on after - Add comments and reformat --- tools/testing/selftests/sgx/ash_cgexec.sh | 57 ++ .../selftests/sgx/run_epc_cg_selftests.sh | 187 +++--- .../selftests/sgx/watch_misc_for_tests.sh | 13 +- 3 files changed, 179 insertions(+), 78 deletions(-) create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh diff --git a/tools/testing/selftests/sgx/ash_cgexec.sh b/tools/testing/selftests/sgx/ash_cgexec.sh new file mode 100755 index ..9607784378df --- /dev/null +++ b/tools/testing/selftests/sgx/ash_cgexec.sh @@ -0,0 +1,57 @@ +#!/usr/bin/env sh +# SPDX-License-Identifier: GPL-2.0 +# Copyright(c) 2024 Intel Corporation. + +# Move the current shell process to the specified cgroup +# Arguments: +# $1 - The cgroup controller name, e.g., misc, memory. +# $2 - The path of the cgroup, +# relative to /sys/fs/cgroup for cgroup v2, +# relative to /sys/fs/cgroup/$1 for v1. +move_to_cgroup() { +controllers="$1" +path="$2" + +# Check if cgroup v2 is in use +if [ ! -d "/sys/fs/cgroup/misc" ]; then +# Cgroup v2 logic +cgroup_full_path="/sys/fs/cgroup/${path}" +echo $$ > "${cgroup_full_path}/cgroup.procs" +else +# Cgroup v1 logic +OLD_IFS="$IFS" +IFS=',' +for controller in $controllers; do +cgroup_full_path="/sys/fs/cgroup/${controller}/${path}" +echo $$ > "${cgroup_full_path}/tasks" +done +IFS="$OLD_IFS" +fi +} + +if [ "$#" -lt 3 ] || [ "$1" != "-g" ]; then +echo "Usage: $0 -g [-g ...] [args...]" +exit 1 +fi + +while [ "$#" -gt 0 ]; do +case "$1" in +-g) +# Ensure that a controller:path pair is provided after -g +if [ -z "$2" ]; then +echo "Error: Missing controller:path argument after -g" +exit 1 +fi +IFS=':' read CONTROLLERS CGROUP_PATH < $CG_MISC_ROOT/cgroup.subtree_control +echo "+memory" > $CG_MEM_ROOT/cgroup.subtree_control +echo "+misc" > $CG_MISC_ROOT/$TEST_ROOT_CG/cgroup.subtree_control +echo "+memory" > $CG_MEM_ROOT/$TEST_ROOT_CG/cgroup.subtree_control +echo "+misc" > $CG_MISC_ROOT/$TEST_CG_SUB1/cgroup.subtree_control +fi CAPACITY=$(grep "sgx_epc" "$CG_MISC_ROOT/misc.capacity" | awk '{print $2}') # This is below number of VA pages needed for enclave of capacity size. So @@ -48,34 +51,67 @@ echo "sgx_epc $SMALL" > $CG_MISC_ROOT/$TEST_CG_SUB1/misc.max echo "sgx_epc $LARGE" > $CG_MISC_ROOT/$TEST_CG_SUB2/misc.max echo "sgx_epc $LARGER" > $CG_MISC_ROOT/$TEST_CG_SUB4/misc.max +if [ $? -ne 0 ]; then +echo "# Failed setting up misc limits, make sure misc cgroup is mounted." +exit 1 +fi + +clean_up_misc() +{ +sleep 2 +rmdir $CG_MISC_ROOT/$TEST_CG_SUB2 +rmdir $CG_MISC_ROOT/$TEST_CG_SUB3 +rmdir $CG_MISC_ROOT/$TEST_CG_SUB4 +rmdir $CG_MISC_ROOT/$TEST_CG_SUB1 +rmdir $CG_MISC_ROOT/$TEST_ROOT_CG +} + timestamp=$(date +%Y%m%d_%H%M%S) test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed" +# Wait for a process and check for expected exit status. +# +# Arguments: +# $1 - the pid of the process to wait and check. +# $2 - 1 if expecting success, 0 for failure. +# +# Return: +# 0 if the exit status of the process matches the expectation. +# 1 otherwise. wait_check_process_status() { -local pid=$1 -local check_for_success=$2 # If 1, check for success; -# If 0, check for failure +pid=$1 +check_for_success=$2 # If 1, check for success; + # If 0, check for failure wait "$pid" -local status=$? +status=$? -if [[ $check_for_success -eq 1 && $status -eq 0 ]]; then +if [ $check_for_success -eq 1 ] && [ $status -eq 0 ]; then echo "# Process $pid succeeded." return 0 -elif [[ $check_for_success -eq 0 && $status -ne 0 ]]; then +elif [ $check_for_success -eq 0 ] && [ $status -ne 0 ]; then echo "# Process $pid returned failure." return 0 fi return 1 } +# Wait for a set of processes and check for expected exit status +# +# Arguments: +# $1 - 1 if expecting success, 0 for failure. +# remaining args - The pids of the processes +# +# Return: +# 0 if exit status of any process matches the expectation. +# 1 otherwise. wait_and_detect_for_any() { -local pids=("$@") -local check_for_success=$1 # If 1, check for success; -# If 0, check for failure -local detected=1 # 0 for success detection +check_for_success=$1 # If 1, check for success; + # If 0, check for failure +shift +
Re: [PATCH v7 7/7] Documentation: KVM: Add hypercall for LoongArch
On 2024/3/24 上午2:40, WANG Xuerui wrote: On 3/15/24 16:11, Bibo Mao wrote: [snip] +KVM hypercall ABI += + +Hypercall ABI on KVM is simple, only one scratch register a0 and at most +five generic registers used as input parameter. FP register and vector register +is not used for input register and should not be modified during hypercall. +Hypercall function can be inlined since there is only one scratch register. Maybe it's better to describe the list of preserved registers with an expression such as "all non-GPR registers shall remain unmodified after returning from the hypercall", to guard ourselves against future ISA state additions. Sorry, I do not understand. What is the meaning of "all non-GPR registers"? Can you give an example? Regards Bibo Mao But I still maintain that it's better to promise less here, and only hint on the extensive preservation of context as an implementation detail. It is for not losing our ability to save/restore less in the future, should we decide to do so.
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, 1 Apr 2024 12:09:18 -0400 Steven Rostedt wrote: > On Mon, 1 Apr 2024 20:25:52 +0900 > Masami Hiramatsu (Google) wrote: > > > > Masami, > > > > > > Are you OK with just keeping it set to N. > > > > OK, if it is only for the debugging, I'm OK to set N this. > > > > > > > > We could have other options like PROVE_LOCKING enable it. > > > > Agreed (but it should say this is a debug option) > > It does say "Validate" which to me is a debug option. What would you > suggest? I think the help message should have "This is for debugging ftrace." Thank you, > > -- Steve -- Masami Hiramatsu (Google)
[PATCH v10 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory (E820_TYPE_RAM). However, these emulated devices have different characteristics than traditional DRAM, making it important to distinguish them. Thus, we modify the tiered memory initialization process to introduce a delay specifically for CPUless NUMA nodes. This delay ensures that the memory tier initialization for these nodes is deferred until HMAT information is obtained during the boot process. Finally, demotion tables are recalculated at the end. * late_initcall(memory_tier_late_init); Some device drivers may have initialized memory tiers between `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing online memory nodes and configuring memory tiers. They should be excluded in the late init. * Handle cases where there is no HMAT when creating memory tiers There is a scenario where a CPUless node does not provide HMAT information. If no HMAT is specified, it falls back to using the default DRAM tier. * Introduce another new lock `default_dram_perf_lock` for adist calculation In the current implementation, iterating through CPUlist nodes requires holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up trying to acquire the same lock, leading to a potential deadlock. Therefore, we propose introducing a standalone `default_dram_perf_lock` to protect `default_dram_perf_*`. This approach not only avoids deadlock but also prevents holding a large lock simultaneously. * Upgrade `set_node_memory_tier` to support additional cases, including default DRAM, late CPUless, and hot-plugged initializations. To cover hot-plugged memory nodes, `mt_calc_adistance()` and `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to handle cases where memtype is not initialized and where HMAT information is available. * Introduce `default_memory_types` for those memory types that are not initialized by device drivers. Because late initialized memory and default DRAM memory need to be managed, a default memory type is created for storing all memory types that are not initialized by device drivers and as a fallback. Signed-off-by: Ho-Ren (Jack) Chuang Signed-off-by: Hao Xiang Reviewed-by: "Huang, Ying" --- include/linux/memory-tiers.h | 5 +- mm/memory-tiers.c| 95 +--- 2 files changed, 81 insertions(+), 19 deletions(-) diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index a44c03c2ba3a..16769552a338 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -140,12 +140,13 @@ static inline int mt_perf_to_adistance(struct access_coordinate *perf, int *adis return -EIO; } -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head *memory_types) +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist, + struct list_head *memory_types) { return NULL; } -void mt_put_memory_types(struct list_head *memory_types) +static inline void mt_put_memory_types(struct list_head *memory_types) { } diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 974af10cfdd8..44fa10980d37 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -36,6 +36,11 @@ struct node_memory_type_map { static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); +/* + * The list is used to store all memory types that are not created + * by a device driver. + */ +static LIST_HEAD(default_memory_types); static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; struct memory_dev_type *default_dram_type; @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly; static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); +/* The lock is used to protect `default_dram_perf*` info and nid. */ +static DEFINE_MUTEX(default_dram_perf_lock); static bool default_dram_perf_error; static struct access_coordinate default_dram_perf; static int default_dram_perf_ref_nid = NUMA_NO_NODE; @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, struct memory_dev_type *mem static struct memory_tier *set_node_memory_tier(int node) { struct memory_tier *memtier; - struct memory_dev_type *memtype; + struct memory_dev_type *mtype = default_dram_type; + int adist = MEMTIER_ADISTANCE_DRAM; pg_data_t *pgdat = NODE_DATA(node); @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int node) if (!node_state(node, N_MEMORY)) return ERR_PTR(-EINVAL); - __init_node_memory_type(node, default_dram_type); + mt_calc_adistance(node, ); + if (node_memory_types[node].memtype == NULL) { + mtype = mt_find_alloc_memory_type(adist, _memory_types); + if (IS_ERR(mtype)) { + mtype =
[PATCH v10 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
Since different memory devices require finding, allocating, and putting memory types, these common steps are abstracted in this patch, enhancing the scalability and conciseness of the code. Signed-off-by: Ho-Ren (Jack) Chuang Reviewed-by: "Huang, Ying" --- drivers/dax/kmem.c | 20 ++-- include/linux/memory-tiers.h | 13 + mm/memory-tiers.c| 32 3 files changed, 47 insertions(+), 18 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 42ee360cf4e3..01399e5b53b2 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types); static struct memory_dev_type *kmem_find_alloc_memory_type(int adist) { - bool found = false; struct memory_dev_type *mtype; mutex_lock(_memory_type_lock); - list_for_each_entry(mtype, _memory_types, list) { - if (mtype->adistance == adist) { - found = true; - break; - } - } - if (!found) { - mtype = alloc_memory_type(adist); - if (!IS_ERR(mtype)) - list_add(>list, _memory_types); - } + mtype = mt_find_alloc_memory_type(adist, _memory_types); mutex_unlock(_memory_type_lock); return mtype; @@ -77,13 +66,8 @@ static struct memory_dev_type *kmem_find_alloc_memory_type(int adist) static void kmem_put_memory_types(void) { - struct memory_dev_type *mtype, *mtn; - mutex_lock(_memory_type_lock); - list_for_each_entry_safe(mtype, mtn, _memory_types, list) { - list_del(>list); - put_memory_type(mtype); - } + mt_put_memory_types(_memory_types); mutex_unlock(_memory_type_lock); } diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 69e781900082..a44c03c2ba3a 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist); int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, const char *source); int mt_perf_to_adistance(struct access_coordinate *perf, int *adist); +struct memory_dev_type *mt_find_alloc_memory_type(int adist, + struct list_head *memory_types); +void mt_put_memory_types(struct list_head *memory_types); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct access_coordinate *perf, int *adis { return -EIO; } + +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head *memory_types) +{ + return NULL; +} + +void mt_put_memory_types(struct list_head *memory_types) +{ + +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 0537664620e5..974af10cfdd8 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype) } EXPORT_SYMBOL_GPL(clear_node_memory_type); +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head *memory_types) +{ + bool found = false; + struct memory_dev_type *mtype; + + list_for_each_entry(mtype, memory_types, list) { + if (mtype->adistance == adist) { + found = true; + break; + } + } + if (!found) { + mtype = alloc_memory_type(adist); + if (!IS_ERR(mtype)) + list_add(>list, memory_types); + } + + return mtype; +} +EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type); + +void mt_put_memory_types(struct list_head *memory_types) +{ + struct memory_dev_type *mtype, *mtn; + + list_for_each_entry_safe(mtype, mtn, memory_types, list) { + list_del(>list); + put_memory_type(mtype); + } +} +EXPORT_SYMBOL_GPL(mt_put_memory_types); + static void dump_hmem_attrs(struct access_coordinate *coord, const char *prefix) { pr_info( -- Ho-Ren (Jack) Chuang
[PATCH v10 0/2] Improved Memory Tier Creation for CPUless NUMA Nodes
When a memory device, such as CXL1.1 type3 memory, is emulated as normal memory (E820_TYPE_RAM), the memory device is indistinguishable from normal DRAM in terms of memory tiering with the current implementation. The current memory tiering assigns all detected normal memory nodes to the same DRAM tier. This results in normal memory devices with different attributions being unable to be assigned to the correct memory tier, leading to the inability to migrate pages between different types of memory. https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/ This patchset automatically resolves the issues. It delays the initialization of memory tiers for CPUless NUMA nodes until they obtain HMAT information and after all devices are initialized at boot time, eliminating the need for user intervention. If no HMAT is specified, it falls back to using `default_dram_type`. Example usecase: We have CXL memory on the host, and we create VMs with a new system memory device backed by host CXL memory. We inject CXL memory performance attributes through QEMU, and the guest now sees memory nodes with performance attributes in HMAT. With this change, we enable the guest kernel to construct the correct memory tiering for the memory nodes. - v10: Thanks to Andrew's and SeongJae's comments, * Address kunit compilation errors * Resolve the bug of not returning the correct error code in `mt_perf_to_adistance` -v9: * Address corner cases in `memory_tier_late_init`. Thank Ying's comments. * https://lore.kernel.org/lkml/20240329053353.309557-1-horenchu...@bytedance.com/T/#u -v8: * Fix email format * https://lore.kernel.org/lkml/20240329004815.195476-1-horenchu...@bytedance.com/T/#u -v7: * Add Reviewed-by: "Huang, Ying" -v6: Thanks to Ying's comments, * Move `default_dram_perf_lock` to the function's beginning for clarity * Fix double unlocking at v5 * https://lore.kernel.org/lkml/20240327072729.3381685-1-horenchu...@bytedance.com/T/#u -v5: Thanks to Ying's comments, * Add comments about what is protected by `default_dram_perf_lock` * Fix an uninitialized pointer mtype * Slightly shorten the time holding `default_dram_perf_lock` * Fix a deadlock bug in `mt_perf_to_adistance` * https://lore.kernel.org/lkml/20240327041646.3258110-1-horenchu...@bytedance.com/T/#u -v4: Thanks to Ying's comments, * Remove redundant code * Reorganize patches accordingly * https://lore.kernel.org/lkml/20240322070356.315922-1-horenchu...@bytedance.com/T/#u -v3: Thanks to Ying's comments, * Make the newly added code independent of HMAT * Upgrade set_node_memory_tier to support more cases * Put all non-driver-initialized memory types into default_memory_types instead of using hmat_memory_types * find_alloc_memory_type -> mt_find_alloc_memory_type * https://lore.kernel.org/lkml/20240320061041.3246828-1-horenchu...@bytedance.com/T/#u -v2: Thanks to Ying's comments, * Rewrite cover letter & patch description * Rename functions, don't use _hmat * Abstract common functions into find_alloc_memory_type() * Use the expected way to use set_node_memory_tier instead of modifying it * https://lore.kernel.org/lkml/20240312061729.1997111-1-horenchu...@bytedance.com/T/#u -v1: * https://lore.kernel.org/lkml/20240301082248.3456086-1-horenchu...@bytedance.com/T/#u Ho-Ren (Jack) Chuang (2): memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types memory tier: create CPUless memory tiers after obtaining HMAT info drivers/dax/kmem.c | 20 +- include/linux/memory-tiers.h | 14 mm/memory-tiers.c| 127 ++- 3 files changed, 126 insertions(+), 35 deletions(-) -- Ho-Ren (Jack) Chuang
Re: general protection fault in refill_obj_stock
On Mon, Apr 01, 2024 at 03:04:46PM +0800, Ubisectech Sirius wrote: > Hello. > We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. > Recently, our team has discovered a issue in Linux kernel 6.7. Attached to > the email were a PoC file of the issue. Thank you for the report! I tried to compile and run your test program for about half an hour on a virtual machine running 6.7 with enabled KASAN, but wasn't able to reproduce the problem. Can you, please, share a bit more information? How long does it take to reproduce? Do you mind sharing your kernel config? Is there anything special about your setup? What are exact steps to reproduce the problem? Is this problem reproducible on 6.6? It's interesting that the problem looks like use-after-free for the objcg pointer but happens in the context of udev-systemd, which I believe should be fairly stable and it's cgroup is not going anywhere. Thanks!
Re: [PATCH 13/13] mailbox: omap: Remove kernel FIFO message queuing
On 4/1/24 6:39 PM, Hari Nagalla wrote: On 3/25/24 12:20, Andrew Davis wrote: The kernel FIFO queue has a couple issues. The biggest issue is that it causes extra latency in a path that can be used in real-time tasks, such as communication with real-time remote processors. The whole FIFO idea itself looks to be a leftover from before the unified mailbox framework. The current mailbox framework expects mbox_chan_received_data() to be called with data immediately as it arrives. Remove the FIFO and pass the messages to the mailbox framework directly. Yes, this would definitely speed up the message receive path. With RT linux, the irq runs in thread context, so that is Ok. But with non-RT the whole receive path runs in interrupt context. So, i think it would be appropriate to use a threaded_irq()? I was thinking the same at first, but seems some mailbox drivers use threaded, others use non-threaded context. Since all we do in the IRQ context anymore is call mbox_chan_received_data(), which is supposed to be IRQ safe, then it should be fine either way. So for now I just kept this using the regular IRQ context as before. If that does turn out to be an issue then let's switch to threaded. Andrew
Re: [PATCH 12/13] mailbox: omap: Reverse FIFO busy check logic
On 4/1/24 6:31 PM, Hari Nagalla wrote: On 3/25/24 12:20, Andrew Davis wrote: static int omap_mbox_chan_send_noirq(struct omap_mbox *mbox, u32 msg) { - int ret = -EBUSY; + if (mbox_fifo_full(mbox)) + return -EBUSY; - if (!mbox_fifo_full(mbox)) { - omap_mbox_enable_irq(mbox, IRQ_RX); - mbox_fifo_write(mbox, msg); - ret = 0; - omap_mbox_disable_irq(mbox, IRQ_RX); + omap_mbox_enable_irq(mbox, IRQ_RX); + mbox_fifo_write(mbox, msg); + omap_mbox_disable_irq(mbox, IRQ_RX); - /* we must read and ack the interrupt directly from here */ - mbox_fifo_read(mbox); - ack_mbox_irq(mbox, IRQ_RX); - } + /* we must read and ack the interrupt directly from here */ + mbox_fifo_read(mbox); + ack_mbox_irq(mbox, IRQ_RX); - return ret; + return 0; } Is n't the interrupt supposed to be IRQ_TX above? i.e TX ready signal? Hmm, could be, but this patch doesn't actually change anything, only moves code around for readability. So if we were are ack'ing the wrong interrupt, then it was wrong before. We should check that and fix it if needed in a follow up patch. Andrew
Re: [PATCH 13/13] mailbox: omap: Remove kernel FIFO message queuing
On 3/25/24 12:20, Andrew Davis wrote: The kernel FIFO queue has a couple issues. The biggest issue is that it causes extra latency in a path that can be used in real-time tasks, such as communication with real-time remote processors. The whole FIFO idea itself looks to be a leftover from before the unified mailbox framework. The current mailbox framework expects mbox_chan_received_data() to be called with data immediately as it arrives. Remove the FIFO and pass the messages to the mailbox framework directly. Yes, this would definitely speed up the message receive path. With RT linux, the irq runs in thread context, so that is Ok. But with non-RT the whole receive path runs in interrupt context. So, i think it would be appropriate to use a threaded_irq()?
Re: [PATCH 12/13] mailbox: omap: Reverse FIFO busy check logic
On 3/25/24 12:20, Andrew Davis wrote: static int omap_mbox_chan_send_noirq(struct omap_mbox *mbox, u32 msg) { - int ret = -EBUSY; + if (mbox_fifo_full(mbox)) + return -EBUSY; - if (!mbox_fifo_full(mbox)) { - omap_mbox_enable_irq(mbox, IRQ_RX); - mbox_fifo_write(mbox, msg); - ret = 0; - omap_mbox_disable_irq(mbox, IRQ_RX); + omap_mbox_enable_irq(mbox, IRQ_RX); + mbox_fifo_write(mbox, msg); + omap_mbox_disable_irq(mbox, IRQ_RX); - /* we must read and ack the interrupt directly from here */ - mbox_fifo_read(mbox); - ack_mbox_irq(mbox, IRQ_RX); - } + /* we must read and ack the interrupt directly from here */ + mbox_fifo_read(mbox); + ack_mbox_irq(mbox, IRQ_RX); - return ret; + return 0; } Is n't the interrupt supposed to be IRQ_TX above? i.e TX ready signal?
[PATCH v3 7/7] mm: multi-gen LRU: use mmu_notifier_test_clear_young()
From: Yu Zhao Use mmu_notifier_{test,clear}_young_bitmap() to handle KVM PTEs in batches when the fast path is supported. This reduces the contention on kvm->mmu_lock when the host is under heavy memory pressure. An existing selftest can quickly demonstrate the effectiveness of this patch. On a generic workstation equipped with 128 CPUs and 256GB DRAM: $ sudo max_guest_memory_test -c 64 -m 250 -s 250 MGLRU run2 -- Before [1]~64s After ~51s kswapd (MGLRU before) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.99% try_to_shrink_lruvec 99.71% evict_folios 97.29% shrink_folio_list ==>> 13.05% folio_referenced 12.83% rmap_walk_file 12.31% folio_referenced_one 7.90% __mmu_notifier_clear_young 7.72% kvm_mmu_notifier_clear_young 7.34% _raw_write_lock kswapd (MGLRU after) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.99% try_to_shrink_lruvec 99.59% evict_folios 80.37% shrink_folio_list ==>> 3.74% folio_referenced 3.59% rmap_walk_file 3.19% folio_referenced_one 2.53% lru_gen_look_around 1.06% __mmu_notifier_test_clear_young [1] "mm: rmap: Don't flush TLB after checking PTE young for page reference" was included so that the comparison is apples to apples. https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/ Signed-off-by: Yu Zhao Signed-off-by: James Houghton --- Documentation/admin-guide/mm/multigen_lru.rst | 6 +- include/linux/mmzone.h| 6 +- mm/rmap.c | 9 +- mm/vmscan.c | 183 ++ 4 files changed, 159 insertions(+), 45 deletions(-) diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst index 33e068830497..0ae2a6d4d94c 100644 --- a/Documentation/admin-guide/mm/multigen_lru.rst +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -48,6 +48,10 @@ Values Components verified on x86 varieties other than Intel and AMD. If it is disabled, the multi-gen LRU will suffer a negligible performance degradation. +0x0008 Clearing the accessed bit in KVM page table entries in large + batches, when KVM MMU sets it (e.g., on x86_64). This can + improve the performance of guests when the host is under memory + pressure. [yYnN] Apply to all the components above. == === @@ -56,7 +60,7 @@ E.g., echo y >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled -0x0007 +0x000f echo 5 >/sys/kernel/mm/lru_gen/enabled cat /sys/kernel/mm/lru_gen/enabled 0x0005 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c11b7cde81ef..a98de5106990 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -397,6 +397,7 @@ enum { LRU_GEN_CORE, LRU_GEN_MM_WALK, LRU_GEN_NONLEAF_YOUNG, + LRU_GEN_KVM_MMU_WALK, NR_LRU_GEN_CAPS }; @@ -554,7 +555,7 @@ struct lru_gen_memcg { void lru_gen_init_pgdat(struct pglist_data *pgdat); void lru_gen_init_lruvec(struct lruvec *lruvec); -void lru_gen_look_around(struct page_vma_mapped_walk *pvmw); +bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw); void lru_gen_init_memcg(struct mem_cgroup *memcg); void lru_gen_exit_memcg(struct mem_cgroup *memcg); @@ -573,8 +574,9 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec) { } -static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw) +static inline bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) { + return false; } static inline void lru_gen_init_memcg(struct mem_cgroup *memcg) diff --git a/mm/rmap.c b/mm/rmap.c index 56b313aa2ebf..41e9fc25684e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -871,13 +871,10 @@ static bool folio_referenced_one(struct folio *folio, continue; } - if (pvmw.pte) { - if (lru_gen_enabled() && - pte_young(ptep_get(pvmw.pte))) { - lru_gen_look_around(); + if (lru_gen_enabled() && pvmw.pte) { + if (lru_gen_look_around()) referenced++; - } - + } else if (pvmw.pte) { if (ptep_clear_flush_young_notify(vma, address, pvmw.pte)) referenced++; diff --git a/mm/vmscan.c b/mm/vmscan.c index 293120fe54f3..fd65f3466dfc
[PATCH v3 6/7] KVM: arm64: Participate in bitmap-based PTE aging
Participate in bitmap-based aging while grabbing the KVM MMU lock for reading. Ideally we wouldn't need to grab this lock at all, but that would require a more intrustive and risky change. Also pass KVM_PGTABLE_WALK_SHARED, as this software walker is safe to run in parallel with other walkers. It is safe only to grab the KVM MMU lock for reading as the kvm_pgtable is destroyed while holding the lock for writing, and freeing of the page table pages is either done while holding the MMU lock for writing or after an RCU grace period. When mkold == false, record the young pages in the passed-in bitmap. When mkold == true, only age the pages that need aging according to the passed-in bitmap. Suggested-by: Yu Zhao Signed-off-by: James Houghton --- arch/arm64/include/asm/kvm_host.h| 5 + arch/arm64/include/asm/kvm_pgtable.h | 4 +++- arch/arm64/kvm/hyp/pgtable.c | 21 ++--- arch/arm64/kvm/mmu.c | 23 +-- 4 files changed, 43 insertions(+), 10 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 9e8a496fb284..e503553cb356 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -1331,4 +1331,9 @@ bool kvm_arm_vcpu_stopped(struct kvm_vcpu *vcpu); (get_idreg_field((kvm), id, fld) >= expand_field_sign(id, fld, min) && \ get_idreg_field((kvm), id, fld) <= expand_field_sign(id, fld, max)) +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age +bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn); +#define kvm_arch_finish_bitmap_age kvm_arch_finish_bitmap_age +void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn); + #endif /* __ARM64_KVM_HOST_H__ */ diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index 19278dfe7978..1976b4e26188 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -644,6 +644,7 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr); * @addr: Intermediate physical address to identify the page-table entry. * @size: Size of the address range to visit. * @mkold: True if the access flag should be cleared. + * @range: The kvm_gfn_range that is being used for the memslot walker. * * The offset of @addr within a page is ignored. * @@ -657,7 +658,8 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr); * Return: True if any of the visited PTEs had the access flag set. */ bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr, -u64 size, bool mkold); +u64 size, bool mkold, +struct kvm_gfn_range *range); /** * kvm_pgtable_stage2_relax_perms() - Relax the permissions enforced by a diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 3fae5830f8d2..e881d3595aca 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -1281,6 +1281,7 @@ kvm_pte_t kvm_pgtable_stage2_mkyoung(struct kvm_pgtable *pgt, u64 addr) } struct stage2_age_data { + struct kvm_gfn_range *range; boolmkold; boolyoung; }; @@ -1290,20 +1291,24 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx, { kvm_pte_t new = ctx->old & ~KVM_PTE_LEAF_ATTR_LO_S2_AF; struct stage2_age_data *data = ctx->arg; + gfn_t gfn = ctx->addr / PAGE_SIZE; if (!kvm_pte_valid(ctx->old) || new == ctx->old) return 0; data->young = true; + /* -* stage2_age_walker() is always called while holding the MMU lock for -* write, so this will always succeed. Nonetheless, this deliberately -* follows the race detection pattern of the other stage-2 walkers in -* case the locking mechanics of the MMU notifiers is ever changed. +* stage2_age_walker() may not be holding the MMU lock for write, so +* follow the race detection pattern of the other stage-2 walkers. */ - if (data->mkold && !stage2_try_set_pte(ctx, new)) - return -EAGAIN; + if (data->mkold) { + if (kvm_gfn_should_age(data->range, gfn) && + !stage2_try_set_pte(ctx, new)) + return -EAGAIN; + } else + kvm_gfn_record_young(data->range, gfn); /* * "But where's the TLBI?!", you scream. @@ -1315,10 +1320,12 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx, } bool kvm_pgtable_stage2_test_clear_young(struct kvm_pgtable *pgt, u64 addr, -u64 size, bool mkold) +u64 size, bool mkold, +struct kvm_gfn_range *range) { struct stage2_age_data
[PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging
Only handle the TDP MMU case for now. In other cases, if a bitmap was not provided, fallback to the slowpath that takes mmu_lock, or, if a bitmap was provided, inform the caller that the bitmap is unreliable. Suggested-by: Yu Zhao Signed-off-by: James Houghton --- arch/x86/include/asm/kvm_host.h | 14 ++ arch/x86/kvm/mmu/mmu.c | 16 ++-- arch/x86/kvm/mmu/tdp_mmu.c | 10 +- 3 files changed, 37 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3b58e2306621..c30918d0887e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages); */ #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1) +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn) +{ + /* +* Indicate that we support bitmap-based aging when using the TDP MMU +* and the accessed bit is available in the TDP page tables. +* +* We have no other preparatory work to do here, so we do not need to +* redefine kvm_arch_finish_bitmap_age(). +*/ + return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled +&& shadow_accessed_mask; +} + #endif /* _ASM_X86_KVM_HOST_H */ diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 992e651540e8..fae1a75750bb 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { bool young = false; - if (kvm_memslots_have_rmaps(kvm)) + if (kvm_memslots_have_rmaps(kvm)) { + if (range->lockless) { + kvm_age_set_unreliable(range); + return false; + } + young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap); + } if (tdp_mmu_enabled) young |= kvm_tdp_mmu_age_gfn_range(kvm, range); @@ -1687,8 +1693,14 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) { bool young = false; - if (kvm_memslots_have_rmaps(kvm)) + if (kvm_memslots_have_rmaps(kvm)) { + if (range->lockless) { + kvm_age_set_unreliable(range); + return false; + } + young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap); + } if (tdp_mmu_enabled) young |= kvm_tdp_mmu_test_age_gfn(kvm, range); diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index d078157e62aa..edea01bc145f 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1217,6 +1217,9 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter, if (!is_accessed_spte(iter->old_spte)) return false; + if (!kvm_gfn_should_age(range, iter->gfn)) + return false; + if (spte_ad_enabled(iter->old_spte)) { iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep, iter->old_spte, @@ -1250,7 +1253,12 @@ bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter, struct kvm_gfn_range *range) { - return is_accessed_spte(iter->old_spte); + bool young = is_accessed_spte(iter->old_spte); + + if (young) + kvm_gfn_record_young(range, iter->gfn); + + return young; } bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range) -- 2.44.0.478.gd926399ef9-goog
[PATCH v3 4/7] KVM: x86: Move tdp_mmu_enabled and shadow_accessed_mask
From: Yu Zhao tdp_mmu_enabled and shadow_accessed_mask are needed to implement kvm_arch_prepare_bitmap_age(). Signed-off-by: Yu Zhao Signed-off-by: James Houghton --- arch/x86/include/asm/kvm_host.h | 6 ++ arch/x86/kvm/mmu.h | 6 -- arch/x86/kvm/mmu/spte.h | 1 - 3 files changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 16e07a2eee19..3b58e2306621 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1847,6 +1847,7 @@ struct kvm_arch_async_pf { extern u32 __read_mostly kvm_nr_uret_msrs; extern u64 __read_mostly host_efer; +extern u64 __read_mostly shadow_accessed_mask; extern bool __read_mostly allow_smaller_maxphyaddr; extern bool __read_mostly enable_apicv; extern struct kvm_x86_ops kvm_x86_ops; @@ -1952,6 +1953,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin, bool mask); extern bool tdp_enabled; +#ifdef CONFIG_X86_64 +extern bool tdp_mmu_enabled; +#else +#define tdp_mmu_enabled false +#endif u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 60f21bb4c27b..8ae279035900 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -270,12 +270,6 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm) return smp_load_acquire(>arch.shadow_root_allocated); } -#ifdef CONFIG_X86_64 -extern bool tdp_mmu_enabled; -#else -#define tdp_mmu_enabled false -#endif - static inline bool kvm_memslots_have_rmaps(struct kvm *kvm) { return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm); diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h index a129951c9a88..f791fe045c7d 100644 --- a/arch/x86/kvm/mmu/spte.h +++ b/arch/x86/kvm/mmu/spte.h @@ -154,7 +154,6 @@ extern u64 __read_mostly shadow_mmu_writable_mask; extern u64 __read_mostly shadow_nx_mask; extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */ extern u64 __read_mostly shadow_user_mask; -extern u64 __read_mostly shadow_accessed_mask; extern u64 __read_mostly shadow_dirty_mask; extern u64 __read_mostly shadow_mmio_value; extern u64 __read_mostly shadow_mmio_mask; -- 2.44.0.478.gd926399ef9-goog
[PATCH v3 3/7] KVM: Add basic bitmap support into kvm_mmu_notifier_test/clear_young
Add kvm_arch_prepare_bitmap_age() for architectures to indiciate that they support bitmap-based aging in kvm_mmu_notifier_test_clear_young() and that they do not need KVM to grab the MMU lock for writing. This function allows architectures to do other locking or other preparatory work that it needs. If an architecture does not implement kvm_arch_prepare_bitmap_age() or is unable to do bitmap-based aging at runtime (and marks the bitmap as unreliable): 1. If a bitmap was provided, we inform the caller that the bitmap is unreliable (MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE). 2. If a bitmap was not provided, fall back to the old logic. Also add logic for architectures to easily use the provided bitmap if they are able. The expectation is that the architecture's implementation of kvm_gfn_test_age() will use kvm_gfn_record_young(), and kvm_gfn_age() will use kvm_gfn_should_age(). Suggested-by: Yu Zhao Signed-off-by: James Houghton --- include/linux/kvm_host.h | 60 ++ virt/kvm/kvm_main.c | 92 +--- 2 files changed, 127 insertions(+), 25 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 1800d03a06a9..5862fd7b5f9b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1992,6 +1992,26 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[]; extern const struct kvm_stats_header kvm_vcpu_stats_header; extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[]; +/* + * Architectures that support using bitmaps for kvm_age_gfn() and + * kvm_test_age_gfn should return true for kvm_arch_prepare_bitmap_age() + * and do any work they need to prepare. The subsequent walk will not + * automatically grab the KVM MMU lock, so some architectures may opt + * to grab it. + * + * If true is returned, a subsequent call to kvm_arch_finish_bitmap_age() is + * guaranteed. + */ +#ifndef kvm_arch_prepare_bitmap_age +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn) +{ + return false; +} +#endif +#ifndef kvm_arch_finish_bitmap_age +static inline void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn) {} +#endif + #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) { @@ -2076,9 +2096,16 @@ static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm, return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq; } +struct test_clear_young_metadata { + unsigned long *bitmap; + unsigned long bitmap_offset_end; + unsigned long end; + bool unreliable; +}; union kvm_mmu_notifier_arg { pte_t pte; unsigned long attributes; + struct test_clear_young_metadata *metadata; }; struct kvm_gfn_range { @@ -2087,11 +2114,44 @@ struct kvm_gfn_range { gfn_t end; union kvm_mmu_notifier_arg arg; bool may_block; + bool lockless; }; bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); + +static inline void kvm_age_set_unreliable(struct kvm_gfn_range *range) +{ + struct test_clear_young_metadata *args = range->arg.metadata; + + args->unreliable = true; +} +static inline unsigned long kvm_young_bitmap_offset(struct kvm_gfn_range *range, + gfn_t gfn) +{ + struct test_clear_young_metadata *args = range->arg.metadata; + + return hva_to_gfn_memslot(args->end - 1, range->slot) - gfn; +} +static inline void kvm_gfn_record_young(struct kvm_gfn_range *range, gfn_t gfn) +{ + struct test_clear_young_metadata *args = range->arg.metadata; + + WARN_ON_ONCE(gfn < range->start || gfn >= range->end); + if (args->bitmap) + __set_bit(kvm_young_bitmap_offset(range, gfn), args->bitmap); +} +static inline bool kvm_gfn_should_age(struct kvm_gfn_range *range, gfn_t gfn) +{ + struct test_clear_young_metadata *args = range->arg.metadata; + + WARN_ON_ONCE(gfn < range->start || gfn >= range->end); + if (args->bitmap) + return test_bit(kvm_young_bitmap_offset(range, gfn), + args->bitmap); + return true; +} #endif #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d0545d88c802..7d80321e2ece 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -550,6 +550,7 @@ struct kvm_mmu_notifier_range { on_lock_fn_t on_lock; bool flush_on_ret; bool may_block; + bool lockless; }; /* @@ -598,6 +599,8 @@ static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, struct kvm_memslots *slots; int i, idx; + BUILD_BUG_ON(sizeof(gfn_range.arg) != sizeof(gfn_range.arg.pte)); +
[PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young
The bitmap is provided for secondary MMUs to use if they support it. For test_young(), after it returns, the bitmap represents the pages that were young in the interval [start, end). For clear_young, it represents the pages that we wish the secondary MMU to clear the accessed/young bit for. If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API should be unchanged except that if young PTEs are found and the architecture supports passing in a bitmap, instead of returning 1, MMU_NOTIFIER_YOUNG_FAST is returned. This allows MGLRU's look-around logic to work faster, resulting in a 4% improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST to indicate to main mm that doing look-around is likely to be beneficial. If the secondary MMU doesn't support the bitmap, it must return an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE. [1]: https://lore.kernel.org/all/20230609005935.42390-1-yuz...@google.com/ Suggested-by: Yu Zhao Signed-off-by: James Houghton --- include/linux/mmu_notifier.h | 93 +--- include/trace/events/kvm.h | 13 +++-- mm/mmu_notifier.c| 20 +--- virt/kvm/kvm_main.c | 19 ++-- 4 files changed, 123 insertions(+), 22 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index f349e08a9dfe..daaa9db625d3 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -61,6 +61,10 @@ enum mmu_notifier_event { #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) +#define MMU_NOTIFIER_YOUNG (1 << 0) +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1) +#define MMU_NOTIFIER_YOUNG_FAST(1 << 2) + struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is @@ -106,21 +110,36 @@ struct mmu_notifier_ops { * clear_young is a lightweight version of clear_flush_young. Like the * latter, it is supposed to test-and-clear the young/accessed bitflag * in the secondary pte, but it may omit flushing the secondary tlb. +* +* If @bitmap is given but is not supported, return +* MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE. +* +* If the walk is done "quickly" and there were young PTEs, +* MMU_NOTIFIER_YOUNG_FAST is returned. */ int (*clear_young)(struct mmu_notifier *subscription, struct mm_struct *mm, unsigned long start, - unsigned long end); + unsigned long end, + unsigned long *bitmap); /* * test_young is called to check the young/accessed bitflag in * the secondary pte. This is used to know if the page is * frequently used without actually clearing the flag or tearing * down the secondary mapping on the page. +* +* If @bitmap is given but is not supported, return +* MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE. +* +* If the walk is done "quickly" and there were young PTEs, +* MMU_NOTIFIER_YOUNG_FAST is returned. */ int (*test_young)(struct mmu_notifier *subscription, struct mm_struct *mm, - unsigned long address); + unsigned long start, + unsigned long end, + unsigned long *bitmap); /* * change_pte is called in cases that pte mapping to page is changed: @@ -388,10 +407,11 @@ extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, unsigned long start, unsigned long end); extern int __mmu_notifier_clear_young(struct mm_struct *mm, - unsigned long start, - unsigned long end); + unsigned long start, unsigned long end, + unsigned long *bitmap); extern int __mmu_notifier_test_young(struct mm_struct *mm, -unsigned long address); +unsigned long start, unsigned long end, +unsigned long *bitmap); extern void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, pte_t pte); extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r); @@ -427,7 +447,25 @@ static inline int mmu_notifier_clear_young(struct mm_struct *mm, unsigned long end) { if (mm_has_notifiers(mm)) - return __mmu_notifier_clear_young(mm, start, end); + return __mmu_notifier_clear_young(mm, start, end, NULL); + return 0; +} + +/* + * When @bitmap is not provided,
[PATCH v3 2/7] KVM: Move MMU notifier function declarations
To allow new MMU-notifier-related functions to use gfn_to_hva_memslot(), move some declarations around. Also move mmu_notifier_to_kvm() for wider use later. Signed-off-by: James Houghton --- include/linux/kvm_host.h | 41 +--- virt/kvm/kvm_main.c | 5 - 2 files changed, 22 insertions(+), 24 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 48f31dcd318a..1800d03a06a9 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -257,25 +257,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #endif -#ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER -union kvm_mmu_notifier_arg { - pte_t pte; - unsigned long attributes; -}; - -struct kvm_gfn_range { - struct kvm_memory_slot *slot; - gfn_t start; - gfn_t end; - union kvm_mmu_notifier_arg arg; - bool may_block; -}; -bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); -bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); -bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); -#endif - enum { OUTSIDE_GUEST_MODE, IN_GUEST_MODE, @@ -2012,6 +1993,11 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header; extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[]; #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER +static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) +{ + return container_of(mn, struct kvm, mmu_notifier); +} + static inline int mmu_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq) { if (unlikely(kvm->mmu_invalidate_in_progress)) @@ -2089,6 +2075,23 @@ static inline bool mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm, return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq; } + +union kvm_mmu_notifier_arg { + pte_t pte; + unsigned long attributes; +}; + +struct kvm_gfn_range { + struct kvm_memory_slot *slot; + gfn_t start; + gfn_t end; + union kvm_mmu_notifier_arg arg; + bool may_block; +}; +bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); +bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); +bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); +bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); #endif #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ca4b1ef9dfc2..d0545d88c802 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -534,11 +534,6 @@ void kvm_destroy_vcpus(struct kvm *kvm) EXPORT_SYMBOL_GPL(kvm_destroy_vcpus); #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER -static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) -{ - return container_of(mn, struct kvm, mmu_notifier); -} - typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range); typedef void (*on_lock_fn_t)(struct kvm *kvm); -- 2.44.0.478.gd926399ef9-goog
[PATCH v3 0/7] mm/kvm: Improve parallelism for access bit harvesting
This patchset adds a fast path in KVM to test and clear access bits on sptes without taking the mmu_lock. It also adds support for using a bitmap to (1) test the access bits for many sptes in a single call to mmu_notifier_test_young, and to (2) clear the access bits for many ptes in a single call to mmu_notifier_clear_young. With Yu's permission, I'm now working on getting this series into a mergeable state. I'm posting this as an RFC because I'm not sure if the arm64 bits are correct, and I haven't done complete performance testing. I want to do broader experimentation to see how much this improves VM performance in a cloud environment, but I want to be sure that the code is mergeable first. Yu has posted other performance results[1], [2]. This v3 shouldn't significantly change the x86 results, but the arm64 results may have changed. The most important changes since v2[3]: - Split the test_clear_young MMU notifier back into test_young and clear_young. I did this because the bitmap passed in has a distinct meaning for each of them, and I felt that this was cleaner. - The return value of test_young / clear_young now indicates if the bitmap was used. - Removed the custom spte walker to implement the lockless path. This was important for arm64 to be functionally correct (thanks Oliver), and it avoids a lot of problems brought up in review of v2 (for example[4]). - Add kvm_arch_prepare_bitmap_age and kvm_arch_finish_bitmap_age to allow for arm64 to implement its bitmap-based aging to grab the MMU lock for reading while allowing x86 to be lockless. - The powerpc changes have been dropped. - The logic to inform architectures how to use the bitmap has been cleaned up (kvm_should_clear_young has been split into kvm_gfn_should_age and kvm_gfn_record_young) (thanks Nicolas). There were some smaller changes too: - Added test_clear_young_metadata (thanks Sean). - MMU_NOTIFIER_RANGE_LOCKLESS has been renamed to MMU_NOTIFIER_YOUNG_FAST, to indicate to the caller that passing a bitmap for MGLRU look-around is likely to be beneficial. - Cleaned up comments that describe the changes to mmu_notifier_test_young / mmu_notifier_clear_young (thanks Nicolas). [1]: https://lore.kernel.org/all/20230609005943.43041-1-yuz...@google.com/ [2]: https://lore.kernel.org/all/20230609005935.42390-1-yuz...@google.com/ [3]: https://lore.kernel.org/kvmarm/20230526234435.662652-1-yuz...@google.com/ [4]: https://lore.kernel.org/all/zitx64bbx5vdj...@google.com/ James Houghton (5): mm: Add a bitmap into mmu_notifier_{clear,test}_young KVM: Move MMU notifier function declarations KVM: Add basic bitmap support into kvm_mmu_notifier_test/clear_young KVM: x86: Participate in bitmap-based PTE aging KVM: arm64: Participate in bitmap-based PTE aging Yu Zhao (2): KVM: x86: Move tdp_mmu_enabled and shadow_accessed_mask mm: multi-gen LRU: use mmu_notifier_test_clear_young() Documentation/admin-guide/mm/multigen_lru.rst | 6 +- arch/arm64/include/asm/kvm_host.h | 5 + arch/arm64/include/asm/kvm_pgtable.h | 4 +- arch/arm64/kvm/hyp/pgtable.c | 21 +- arch/arm64/kvm/mmu.c | 23 ++- arch/x86/include/asm/kvm_host.h | 20 ++ arch/x86/kvm/mmu.h| 6 - arch/x86/kvm/mmu/mmu.c| 16 +- arch/x86/kvm/mmu/spte.h | 1 - arch/x86/kvm/mmu/tdp_mmu.c| 10 +- include/linux/kvm_host.h | 101 -- include/linux/mmu_notifier.h | 93 - include/linux/mmzone.h| 6 +- include/trace/events/kvm.h| 13 +- mm/mmu_notifier.c | 20 +- mm/rmap.c | 9 +- mm/vmscan.c | 183 ++ virt/kvm/kvm_main.c | 100 +++--- 18 files changed, 509 insertions(+), 128 deletions(-) base-commit: 0cef2c0a2a356137b170c3cb46cb9c1dd2ca3e6b -- 2.44.0.478.gd926399ef9-goog
Re: [External] Re: [PATCH v9 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
Hi SeongJae, On Mon, Apr 1, 2024 at 11:27 AM Ho-Ren (Jack) Chuang wrote: > > Hi SeongJae, > > On Sun, Mar 31, 2024 at 12:09 PM SeongJae Park wrote: > > > > Hi Ho-Ren, > > > > On Fri, 29 Mar 2024 05:33:52 + "Ho-Ren (Jack) Chuang" > > wrote: > > > > > Since different memory devices require finding, allocating, and putting > > > memory types, these common steps are abstracted in this patch, > > > enhancing the scalability and conciseness of the code. > > > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > > Reviewed-by: "Huang, Ying" > > > --- > > > drivers/dax/kmem.c | 20 ++-- > > > include/linux/memory-tiers.h | 13 + > > > mm/memory-tiers.c| 32 > > > 3 files changed, 47 insertions(+), 18 deletions(-) > > > > > [...] > > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > > > index 69e781900082..a44c03c2ba3a 100644 > > > --- a/include/linux/memory-tiers.h > > > +++ b/include/linux/memory-tiers.h > > > @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist); > > > int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, > > >const char *source); > > > int mt_perf_to_adistance(struct access_coordinate *perf, int *adist); > > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, > > > + struct list_head > > > *memory_types); > > > +void mt_put_memory_types(struct list_head *memory_types); > > > #ifdef CONFIG_MIGRATION > > > int next_demotion_node(int node); > > > void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); > > > @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct > > > access_coordinate *perf, int *adis > > > { > > > return -EIO; > > > } > > > + > > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > > > list_head *memory_types) > > > +{ > > > + return NULL; > > > +} > > > + > > > +void mt_put_memory_types(struct list_head *memory_types) > > > +{ > > > + > > > +} > > > > I found latest mm-unstable tree is failing kunit as below, and 'git bisect' > > says it happens from this patch. > > > > $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/ > > [11:56:40] Configuring KUnit Kernel ... > > [11:56:40] Building KUnit Kernel ... > > Populating config with: > > $ make ARCH=um O=../kunit.out/ olddefconfig > > Building with: > > $ make ARCH=um O=../kunit.out/ --jobs=36 > > ERROR:root:In file included from .../mm/memory.c:71: > > .../include/linux/memory-tiers.h:143:25: warning: no previous prototype > > for ‘mt_find_alloc_memory_type’ [-Wmissing-prototypes] > > 143 | struct memory_dev_type *mt_find_alloc_memory_type(int adist, > > struct list_head *memory_types) > > | ^ > > .../include/linux/memory-tiers.h:148:6: warning: no previous prototype > > for ‘mt_put_memory_types’ [-Wmissing-prototypes] > > 148 | void mt_put_memory_types(struct list_head *memory_types) > > | ^~~ > > [...] > > > > Maybe we should set these as 'static inline', like below? I confirmed this > > fixes the kunit error. May I ask your opinion? > > > > Thanks for catching this. I'm trying to figure out this problem. Will get > back. > These kunit compilation errors can be solved by adding `static inline` to the two complaining functions, the same solution you mentioned earlier. I've also tested on my end and I will send out a V10 soon. Thank you again! > > > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > > index a44c03c2ba3a..ee6e53144156 100644 > > --- a/include/linux/memory-tiers.h > > +++ b/include/linux/memory-tiers.h > > @@ -140,12 +140,12 @@ static inline int mt_perf_to_adistance(struct > > access_coordinate *perf, int *adis > > return -EIO; > > } > > > > -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > > list_head *memory_types) > > +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist, > > struct list_head *memory_types) > > { > > return NULL; > > } > > > > -void mt_put_memory_types(struct list_head *memory_types) > > +static inline void mt_put_memory_types(struct list_head *memory_types) > > { > > > > } > > > > > > Thanks, > > SJ > > > > -- > Best regards, > Ho-Ren (Jack) Chuang > 莊賀任 -- Best regards, Ho-Ren (Jack) Chuang 莊賀任
Re: [PATCH] selftests/sgx: Improve cgroup test scripts
On Mon, 01 Apr 2024 09:22:21 -0500, Jarkko Sakkinen wrote: On Sun Mar 31, 2024 at 8:44 PM EEST, Haitao Huang wrote: Make cgroup test scripts ash compatible. Remove cg-tools dependency. Add documentation for functions. Tested with busybox on Ubuntu. Signed-off-by: Haitao Huang I'll run this next week on good old NUC7. Thank you. I really wish that either (hopefully both) Intel or AMD would bring up for developers home use meant platform to develop on TDX and SNP. It is a shame that the latest and greatest is from 2018. BR, Jarkko Argh, missed a few changes for v2 cgroup: --- a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh @@ -15,6 +15,8 @@ CG_MEM_ROOT=/sys/fs/cgroup CG_V1=0 if [ ! -d "/sys/fs/cgroup/misc" ]; then echo "# cgroup V2 is in use." +echo "+misc" > $CG_MISC_ROOT/cgroup.subtree_control +echo "+memory" > $CG_MEM_ROOT/cgroup.subtree_control else echo "# cgroup V1 is in use." CG_MISC_ROOT=/sys/fs/cgroup/misc @@ -26,6 +28,11 @@ mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB2 mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB3 mkdir -p $CG_MISC_ROOT/$TEST_CG_SUB4 +if [ $CG_V1 -eq 0 ]; then +echo "+misc" > $CG_MISC_ROOT/$TEST_ROOT_CG/cgroup.subtree_control +echo "+misc" > $CG_MISC_ROOT/$TEST_CG_SUB1/cgroup.subtree_control +fi
[PATCH 3/3] Documentation/smatch: fix typo in submitting-patches.md
Fix a small typo in the smatch documentation about the patch submission process. Signed-off-by: Javier Carrasco --- Documentation/submitting-patches.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/submitting-patches.md b/Documentation/submitting-patches.md index 5c4191bd..3f4c548f 100644 --- a/Documentation/submitting-patches.md +++ b/Documentation/submitting-patches.md @@ -20,7 +20,7 @@ Kernel submitting process. Notice that sparse uses the MIT License. 4. Smatch is built on top of Sparse but it is licensed under the GPLv2+ the - git repostories are: + git repositories are: https://github.com/error27/smatch https://repo.or.cz/w/smatch.git -- 2.40.1
[PATCH 2/3] Documentation/smatch: convert to RST
Convert existing smatch documentation to RST, and add it to the index accordingly. Signed-off-by: Javier Carrasco --- Documentation/index.rst | 1 + Documentation/{smatch.txt => smatch.rst} | 56 +--- 2 files changed, 31 insertions(+), 26 deletions(-) rename Documentation/{smatch.txt => smatch.rst} (72%) diff --git a/Documentation/index.rst b/Documentation/index.rst index e29a5643..761acbae 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -86,6 +86,7 @@ Some interesting external documentation: test-suite doc-guide TODO + smatch .. toctree:: :caption: Release Notes diff --git a/Documentation/smatch.txt b/Documentation/smatch.rst similarity index 72% rename from Documentation/smatch.txt rename to Documentation/smatch.rst index b2c3ac4e..f209c8fb 100644 --- a/Documentation/smatch.txt +++ b/Documentation/smatch.rst @@ -1,43 +1,46 @@ +== Smatch +== -0. Introduction -1. Building Smatch -2. Using Smatch -3. Smatch vs Sparse +.. Table of Contents: -Section 0: Introduction +.. contents:: :local: + + +0. Introduction +=== The Smatch mailing list is . -Section 1: Building Smatch +1. Building Smatch +== Smatch needs some dependencies to build: -In Debian run: -apt-get install gcc make sqlite3 libsqlite3-dev libdbd-sqlite3-perl libssl-dev libtry-tiny-perl +In Debian run:: -Or in Fedora run: -yum install gcc make sqlite3 sqlite-devel sqlite perl-DBD-SQLite openssl-devel perl-Try-Tiny + apt-get install gcc make sqlite3 libsqlite3-dev libdbd-sqlite3-perl libssl-dev libtry-tiny-perl -Smatch is easy to build. Just type `make`. There isn't an install process -right now so just run it from the build directory. +Or in Fedora run:: + + yum install gcc make sqlite3 sqlite-devel sqlite perl-DBD-SQLite openssl-devel perl-Try-Tiny +Smatch is easy to build. Just type ``make``. There isn't an install process +right now so just run it from the build directory. -Section 2: Using Smatch - +2. Using Smatch +=== Smatch can be used with a cross function database. It's not mandatory to build the database but it's a useful thing to do. Building the database for the kernel takes 2-3 hours on my computer. For the kernel you build -the database with: +the database with:: - cd ~/path/to/kernel_dir - ~/path/to/smatch_dir/smatch_scripts/build_kernel_data.sh + cd ~/path/to/kernel_dir ~/path/to/smatch_dir/smatch_scripts/build_kernel_data.sh For projects other than the kernel you run Smatch with the options "--call-tree --info --param-mapper --spammy" and finish building the -database by running the script: +database by running the script:: ~/path/to/smatch_dir/smatch_data/db/create_db.sh @@ -45,21 +48,23 @@ Each time you rebuild the cross function database it becomes more accurate. I normally rebuild the database every morning. If you are running Smatch over the whole kernel you can use the following -command: +command:: ~/path/to/smatch_dir/smatch_scripts/test_kernel.sh The test_kernel.sh script will create a .c.smatch file for every file it tests and a combined smatch_warns.txt file with all the warnings. -If you are running Smatch just over one kernel file: +If you are running Smatch just over one kernel file:: ~/path/to/smatch_dir/smatch_scripts/kchecker drivers/whatever/file.c -You can also build a directory like this: +You can also build a directory like this:: + ~/path/to/smatch_dir/smatch_scripts/kchecker drivers/whatever/ + The kchecker script prints its warnings to stdout. The above scripts will ensure that any ARCH or CROSS_COMPILE environment @@ -67,7 +72,7 @@ variables are passed to kernel build system - thus allowing for the use of Smatch with kernels that are normally built with cross-compilers. If you are building something else (which is not the Linux kernel) then use -something like: +something like:: make CHECK="~/path/to/smatch_dir/smatch --full-path" \ CC=~/path/to/smatch_dir/smatch/cgcc | tee smatch_warns.txt @@ -75,9 +80,8 @@ something like: The makefile has to let people set the CC with an environment variable for that to work, of course. - -Section 3: Smatch vs Sparse - +3. Smatch vs Sparse +=== Smatch uses Sparse as a C parser. I have made a few hacks to Sparse so I have to distribute the two together. Sparse is released under the MIT license -- 2.40.1
[PATCH 1/3] Documentation/smatch: fix paths in the examples
A few examples use the '~/progs/smatch/devel/smatch_scripts/' path, which seems to be a local reference that does not reflect the real paths in the project (one would not expect 'devel' inside 'smatch'). Use the generic '~/path/to/smatch_dir/' path, which is already used in some examples. Signed-off-by: Javier Carrasco --- Documentation/smatch.txt | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/Documentation/smatch.txt b/Documentation/smatch.txt index 59106d49..b2c3ac4e 100644 --- a/Documentation/smatch.txt +++ b/Documentation/smatch.txt @@ -39,7 +39,7 @@ For projects other than the kernel you run Smatch with the options "--call-tree --info --param-mapper --spammy" and finish building the database by running the script: - ~/progs/smatch/devel/smatch_data/db/create_db.sh + ~/path/to/smatch_dir/smatch_data/db/create_db.sh Each time you rebuild the cross function database it becomes more accurate. I normally rebuild the database every morning. @@ -47,18 +47,18 @@ normally rebuild the database every morning. If you are running Smatch over the whole kernel you can use the following command: - ~/progs/smatch/devel/smatch_scripts/test_kernel.sh + ~/path/to/smatch_dir/smatch_scripts/test_kernel.sh The test_kernel.sh script will create a .c.smatch file for every file it tests and a combined smatch_warns.txt file with all the warnings. If you are running Smatch just over one kernel file: - ~/progs/smatch/devel/smatch_scripts/kchecker drivers/whatever/file.c + ~/path/to/smatch_dir/smatch_scripts/kchecker drivers/whatever/file.c You can also build a directory like this: - ~/progs/smatch/devel/smatch_scripts/kchecker drivers/whatever/ + ~/path/to/smatch_dir/smatch_scripts/kchecker drivers/whatever/ The kchecker script prints its warnings to stdout. @@ -69,8 +69,8 @@ Smatch with kernels that are normally built with cross-compilers. If you are building something else (which is not the Linux kernel) then use something like: - make CHECK="~/progs/smatch/devel/smatch --full-path" \ - CC=~/progs/smatch/devel/smatch/cgcc | tee smatch_warns.txt + make CHECK="~/path/to/smatch_dir/smatch --full-path" \ + CC=~/path/to/smatch_dir/smatch/cgcc | tee smatch_warns.txt The makefile has to let people set the CC with an environment variable for that to work, of course. -- 2.40.1
[PATCH 0/3] Documentation/smatch: RST conversion and fixes
This series converts the existing smatch.txt to RST and adds it to the index, so it can be built together with the sparse documentation. When at it, a couple of small fixes has been included. Signed-off-by: Javier Carrasco Javier Carrasco (3): Documentation/smatch: fix paths in the examples Documentation/smatch: convert to RST Documentation/smatch: fix typo in submitting-patches.md Documentation/index.rst | 1 + Documentation/{smatch.txt => smatch.rst} | 68 +--- Documentation/submitting-patches.md | 2 +- 3 files changed, 38 insertions(+), 33 deletions(-) rename Documentation/{smatch.txt => smatch.rst} (60%) -- 2.40.1
[PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.
rethook_find_ret_addr() prints a warning message and returns 0 when the target task is running and not the "current" task to prevent returning an incorrect return address. However, this check is incomplete as the target task can still transition to the running state when finding the return address, although it is safe with RCU. The issue we encounter is that the kernel frequently prints warning messages when BPF profiling programs call to bpf_get_task_stack() on running tasks. The callers should be aware and willing to take the risk of receiving an incorrect return address from a task that is currently running other than the "current" one. A warning is not needed here as the callers are intent on it. Signed-off-by: Kui-Feng Lee --- kernel/trace/rethook.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c index fa03094e9e69..4297a132a7ae 100644 --- a/kernel/trace/rethook.c +++ b/kernel/trace/rethook.c @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame if (WARN_ON_ONCE(!cur)) return 0; - if (WARN_ON_ONCE(tsk != current && task_is_running(tsk))) + if (tsk != current && task_is_running(tsk)) return 0; do { -- 2.34.1
Re: [External] Re: [PATCH v9 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
Hi SeongJae, On Sun, Mar 31, 2024 at 12:09 PM SeongJae Park wrote: > > Hi Ho-Ren, > > On Fri, 29 Mar 2024 05:33:52 + "Ho-Ren (Jack) Chuang" > wrote: > > > Since different memory devices require finding, allocating, and putting > > memory types, these common steps are abstracted in this patch, > > enhancing the scalability and conciseness of the code. > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > Reviewed-by: "Huang, Ying" > > --- > > drivers/dax/kmem.c | 20 ++-- > > include/linux/memory-tiers.h | 13 + > > mm/memory-tiers.c| 32 > > 3 files changed, 47 insertions(+), 18 deletions(-) > > > [...] > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > > index 69e781900082..a44c03c2ba3a 100644 > > --- a/include/linux/memory-tiers.h > > +++ b/include/linux/memory-tiers.h > > @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist); > > int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, > >const char *source); > > int mt_perf_to_adistance(struct access_coordinate *perf, int *adist); > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, > > + struct list_head > > *memory_types); > > +void mt_put_memory_types(struct list_head *memory_types); > > #ifdef CONFIG_MIGRATION > > int next_demotion_node(int node); > > void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); > > @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct > > access_coordinate *perf, int *adis > > { > > return -EIO; > > } > > + > > +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > > list_head *memory_types) > > +{ > > + return NULL; > > +} > > + > > +void mt_put_memory_types(struct list_head *memory_types) > > +{ > > + > > +} > > I found latest mm-unstable tree is failing kunit as below, and 'git bisect' > says it happens from this patch. > > $ ./tools/testing/kunit/kunit.py run --build_dir ../kunit.out/ > [11:56:40] Configuring KUnit Kernel ... > [11:56:40] Building KUnit Kernel ... > Populating config with: > $ make ARCH=um O=../kunit.out/ olddefconfig > Building with: > $ make ARCH=um O=../kunit.out/ --jobs=36 > ERROR:root:In file included from .../mm/memory.c:71: > .../include/linux/memory-tiers.h:143:25: warning: no previous prototype > for ‘mt_find_alloc_memory_type’ [-Wmissing-prototypes] > 143 | struct memory_dev_type *mt_find_alloc_memory_type(int adist, > struct list_head *memory_types) > | ^ > .../include/linux/memory-tiers.h:148:6: warning: no previous prototype > for ‘mt_put_memory_types’ [-Wmissing-prototypes] > 148 | void mt_put_memory_types(struct list_head *memory_types) > | ^~~ > [...] > > Maybe we should set these as 'static inline', like below? I confirmed this > fixes the kunit error. May I ask your opinion? > Thanks for catching this. I'm trying to figure out this problem. Will get back. > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h > index a44c03c2ba3a..ee6e53144156 100644 > --- a/include/linux/memory-tiers.h > +++ b/include/linux/memory-tiers.h > @@ -140,12 +140,12 @@ static inline int mt_perf_to_adistance(struct > access_coordinate *perf, int *adis > return -EIO; > } > > -struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct > list_head *memory_types) > +static inline struct memory_dev_type *mt_find_alloc_memory_type(int adist, > struct list_head *memory_types) > { > return NULL; > } > > -void mt_put_memory_types(struct list_head *memory_types) > +static inline void mt_put_memory_types(struct list_head *memory_types) > { > > } > > > Thanks, > SJ -- Best regards, Ho-Ren (Jack) Chuang 莊賀任
[PATCH v2 4/4] arm64: dts: qcom: msm8976: Add WCNSS node
Add node describing wireless connectivity subsystem. Signed-off-by: Adam Skladowski --- arch/arm64/boot/dts/qcom/msm8976.dtsi | 104 ++ 1 file changed, 104 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi b/arch/arm64/boot/dts/qcom/msm8976.dtsi index 77670fce9b8f..41c748c78347 100644 --- a/arch/arm64/boot/dts/qcom/msm8976.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi @@ -771,6 +771,36 @@ blsp2_i2c4_sleep: blsp2-i2c4-sleep-state { drive-strength = <2>; bias-disable; }; + + wcss_wlan_default: wcss-wlan-default-state { + wcss-wlan2-pins { + pins = "gpio40"; + function = "wcss_wlan2"; + drive-strength = <6>; + bias-pull-up; + }; + + wcss-wlan1-pins { + pins = "gpio41"; + function = "wcss_wlan1"; + drive-strength = <6>; + bias-pull-up; + }; + + wcss-wlan0-pins { + pins = "gpio42"; + function = "wcss_wlan0"; + drive-strength = <6>; + bias-pull-up; + }; + + wcss-wlan-pins { + pins = "gpio43", "gpio44"; + function = "wcss_wlan"; + drive-strength = <6>; + bias-pull-up; + }; + }; }; gcc: clock-controller@180 { @@ -1446,6 +1476,80 @@ blsp2_i2c4: i2c@7af8000 { status = "disabled"; }; + wcnss: remoteproc@a204000 { + compatible = "qcom,pronto-v3-pil", "qcom,pronto"; + reg = <0x0a204000 0x2000>, + <0x0a202000 0x1000>, + <0x0a21b000 0x3000>; + reg-names = "ccu", + "dxe", + "pmu"; + + memory-region = <_fw_mem>; + + interrupts-extended = < GIC_SPI 149 IRQ_TYPE_EDGE_RISING>, + <_smp2p_in 0 IRQ_TYPE_EDGE_RISING>, + <_smp2p_in 1 IRQ_TYPE_EDGE_RISING>, + <_smp2p_in 2 IRQ_TYPE_EDGE_RISING>, + <_smp2p_in 3 IRQ_TYPE_EDGE_RISING>; + interrupt-names = "wdog", + "fatal", + "ready", + "handover", + "stop-ack"; + + power-domains = < MSM8976_VDDCX>, + < MSM8976_VDDMX>; + power-domain-names = "cx", "mx"; + + qcom,smem-states = <_smp2p_out 0>; + qcom,smem-state-names = "stop"; + + pinctrl-0 = <_wlan_default>; + pinctrl-names = "default"; + + status = "disabled"; + + wcnss_iris: iris { + /* Separate chip, compatible is board-specific */ + clocks = < RPM_SMD_RF_CLK2>; + clock-names = "xo"; + }; + + smd-edge { + interrupts = ; + + qcom,ipc = < 8 17>; + qcom,smd-edge = <6>; + qcom,remote-pid = <4>; + + label = "pronto"; + + wcnss_ctrl: wcnss { + compatible = "qcom,wcnss"; + qcom,smd-channels = "WCNSS_CTRL"; + + qcom,mmio = <>; + + wcnss_bt: bluetooth { + compatible = "qcom,wcnss-bt"; + }; + + wcnss_wifi: wifi { + compatible = "qcom,wcnss-wlan"; + + interrupts = , +
[PATCH v2 3/4] arm64: dts: qcom: msm8976: Add Adreno GPU
Add Adreno GPU node. Signed-off-by: Adam Skladowski --- arch/arm64/boot/dts/qcom/msm8976.dtsi | 65 +++ 1 file changed, 65 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi b/arch/arm64/boot/dts/qcom/msm8976.dtsi index 6be310079f5b..77670fce9b8f 100644 --- a/arch/arm64/boot/dts/qcom/msm8976.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi @@ -1074,6 +1074,71 @@ mdss_dsi1_phy: phy@1a96a00 { }; }; + adreno_gpu: gpu@1c0 { + compatible = "qcom,adreno-510.0", "qcom,adreno"; + + reg = <0x01c0 0x4>; + reg-names = "kgsl_3d0_reg_memory"; + + interrupts = ; + interrupt-names = "kgsl_3d0_irq"; + + clocks = < GCC_GFX3D_OXILI_CLK>, +< GCC_GFX3D_OXILI_AHB_CLK>, +< GCC_GFX3D_OXILI_GMEM_CLK>, +< GCC_GFX3D_BIMC_CLK>, +< GCC_GFX3D_OXILI_TIMER_CLK>, +< GCC_GFX3D_OXILI_AON_CLK>; + clock-names = "core", + "iface", + "mem", + "mem_iface", + "rbbmtimer", + "alwayson"; + + power-domains = < OXILI_GX_GDSC>; + + iommus = <_iommu 0>; + + status = "disabled"; + + operating-points-v2 = <_opp_table>; + + gpu_opp_table: opp-table { + compatible = "operating-points-v2"; + + opp-2 { + opp-hz = /bits/ 64 <2>; + required-opps = <_opp_low_svs>; + }; + + opp-3 { + opp-hz = /bits/ 64 <3>; + required-opps = <_opp_svs>; + }; + + opp-4 { + opp-hz = /bits/ 64 <4>; + required-opps = <_opp_nom>; + }; + + opp-48000 { + opp-hz = /bits/ 64 <48000>; + required-opps = <_opp_nom_plus>; + }; + + opp-54000 { + opp-hz = /bits/ 64 <54000>; + required-opps = <_opp_turbo>; + }; + + opp-6 { + opp-hz = /bits/ 64 <6>; + required-opps = <_opp_turbo>; + }; + }; + }; + apps_iommu: iommu@1ee { compatible = "qcom,msm8976-iommu", "qcom,msm-iommu-v2"; reg = <0x01ee 0x3000>; -- 2.44.0
[PATCH v2 2/4] arm64: dts: qcom: msm8976: Add MDSS nodes
Add MDSS nodes to support displays on MSM8976 SoC. Signed-off-by: Adam Skladowski --- arch/arm64/boot/dts/qcom/msm8976.dtsi | 274 +- 1 file changed, 270 insertions(+), 4 deletions(-) diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi b/arch/arm64/boot/dts/qcom/msm8976.dtsi index 8bdcc1438177..6be310079f5b 100644 --- a/arch/arm64/boot/dts/qcom/msm8976.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi @@ -785,10 +785,10 @@ gcc: clock-controller@180 { clocks = < RPM_SMD_XO_CLK_SRC>, < RPM_SMD_XO_A_CLK_SRC>, -<0>, -<0>, -<0>, -<0>; +<_dsi0_phy 1>, +<_dsi0_phy 0>, +<_dsi1_phy 1>, +<_dsi1_phy 0>; clock-names = "xo", "xo_a", "dsi0pll", @@ -808,6 +808,272 @@ tcsr: syscon@1937000 { reg = <0x01937000 0x3>; }; + mdss: display-subsystem@1a0 { + compatible = "qcom,mdss"; + + reg = <0x01a0 0x1000>, + <0x01ab 0x3000>; + reg-names = "mdss_phys", "vbif_phys"; + + power-domains = < MDSS_GDSC>; + interrupts = ; + + interrupt-controller; + #interrupt-cells = <1>; + + clocks = < GCC_MDSS_AHB_CLK>, +< GCC_MDSS_AXI_CLK>, +< GCC_MDSS_VSYNC_CLK>, +< GCC_MDSS_MDP_CLK>; + clock-names = "iface", + "bus", + "vsync", + "core"; + + #address-cells = <1>; + #size-cells = <1>; + ranges; + + status = "disabled"; + + mdss_mdp: display-controller@1a01000 { + compatible = "qcom,msm8976-mdp5", "qcom,mdp5"; + reg = <0x01a01000 0x89000>; + reg-names = "mdp_phys"; + + interrupt-parent = <>; + interrupts = <0>; + + clocks = < GCC_MDSS_AHB_CLK>, +< GCC_MDSS_AXI_CLK>, +< GCC_MDSS_MDP_CLK>, +< GCC_MDSS_VSYNC_CLK>, +< GCC_MDP_TBU_CLK>, +< GCC_MDP_RT_TBU_CLK>; + clock-names = "iface", + "bus", + "core", + "vsync", + "tbu", + "tbu_rt"; + + operating-points-v2 = <_opp_table>; + power-domains = < MDSS_GDSC>; + + iommus = <_iommu 22>; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@0 { + reg = <0>; + + mdss_mdp5_intf1_out: endpoint { + remote-endpoint = <_dsi0_in>; + }; + }; + + port@1 { + reg = <1>; + + mdss_mdp5_intf2_out: endpoint { + remote-endpoint = <_dsi1_in>; + }; + }; + }; + + mdp_opp_table: opp-table { + compatible = "operating-points-v2"; + + opp-17778 { + opp-hz = /bits/ 64 <17778>; + required-opps = <_opp_svs>; + }; + + opp-27000 { + opp-hz = /bits/ 64 <27000>; +
[PATCH v2 1/4] arm64: dts: qcom: msm8976: Add IOMMU nodes
Add the nodes describing the apps and gpu iommu and its context banks that are found on msm8976 SoCs. Signed-off-by: Adam Skladowski --- arch/arm64/boot/dts/qcom/msm8976.dtsi | 81 +++ 1 file changed, 81 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi b/arch/arm64/boot/dts/qcom/msm8976.dtsi index d2bb1ada361a..8bdcc1438177 100644 --- a/arch/arm64/boot/dts/qcom/msm8976.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi @@ -808,6 +808,87 @@ tcsr: syscon@1937000 { reg = <0x01937000 0x3>; }; + apps_iommu: iommu@1ee { + compatible = "qcom,msm8976-iommu", "qcom,msm-iommu-v2"; + reg = <0x01ee 0x3000>; + ranges = <0 0x01e2 0x2>; + + clocks = < GCC_SMMU_CFG_CLK>, +< GCC_APSS_TCU_CLK>; + clock-names = "iface", "bus"; + + qcom,iommu-secure-id = <17>; + + #address-cells = <1>; + #size-cells = <1>; + #iommu-cells = <1>; + + /* VFE */ + iommu-ctx@15000 { + compatible = "qcom,msm-iommu-v2-ns"; + reg = <0x15000 0x1000>; + qcom,ctx-asid = <20>; + interrupts = ; + }; + + /* VENUS NS */ + iommu-ctx@16000 { + compatible = "qcom,msm-iommu-v2-ns"; + reg = <0x16000 0x1000>; + qcom,ctx-asid = <21>; + interrupts = ; + }; + + /* MDP0 */ + iommu-ctx@17000 { + compatible = "qcom,msm-iommu-v2-ns"; + reg = <0x17000 0x1000>; + qcom,ctx-asid = <22>; + interrupts = ; + }; + }; + + gpu_iommu: iommu@1f08000 { + compatible = "qcom,msm8976-iommu", "qcom,msm-iommu-v2"; + ranges = <0 0x01f08000 0x8000>; + + clocks = < GCC_SMMU_CFG_CLK>, +< GCC_GFX3D_TCU_CLK>; + clock-names = "iface", "bus"; + + power-domains = < OXILI_CX_GDSC>; + + qcom,iommu-secure-id = <18>; + + #address-cells = <1>; + #size-cells = <1>; + #iommu-cells = <1>; + + /* gfx3d user */ + iommu-ctx@0 { + compatible = "qcom,msm-iommu-v2-ns"; + reg = <0x0 0x1000>; + qcom,ctx-asid = <0>; + interrupts = ; + }; + + /* gfx3d secure */ + iommu-ctx@1000 { + compatible = "qcom,msm-iommu-v2-sec"; + reg = <0x1000 0x1000>; + qcom,ctx-asid = <2>; + interrupts = ; + }; + + /* gfx3d priv */ + iommu-ctx@2000 { + compatible = "qcom,msm-iommu-v2-sec"; + reg = <0x2000 0x1000>; + qcom,ctx-asid = <1>; + interrupts = ; + }; + }; + spmi_bus: spmi@200f000 { compatible = "qcom,spmi-pmic-arb"; reg = <0x0200f000 0x1000>, -- 2.44.0
[PATCH v2 0/4] MSM8976 MDSS/GPU/WCNSS support
This patch series provide support for display subsystem, gpu and also adds wireless connectivity subsystem support. Changes since v1 1. Addressed feedback 2. Dropped already applied dt-bindings patches 3. Dropped sdc patch as it was submitted as part of other series 4. Dropped dt-bindings patch for Adreno, also separate now Adam Skladowski (4): arm64: dts: qcom: msm8976: Add IOMMU nodes arm64: dts: qcom: msm8976: Add MDSS nodes arm64: dts: qcom: msm8976: Add Adreno GPU arm64: dts: qcom: msm8976: Add WCNSS node arch/arm64/boot/dts/qcom/msm8976.dtsi | 524 +- 1 file changed, 520 insertions(+), 4 deletions(-) -- 2.44.0
[PATCH 1/1] clk: qcom: smd-rpm: Restore msm8976 num_clk
During rework somehow msm8976 num_clk got removed, restore it. Fixes: d6edc31f3a68 ("clk: qcom: smd-rpm: Separate out interconnect bus clocks") Signed-off-by: Adam Skladowski --- drivers/clk/qcom/clk-smd-rpm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/clk/qcom/clk-smd-rpm.c b/drivers/clk/qcom/clk-smd-rpm.c index 8602c02047d0..45c5255bcd11 100644 --- a/drivers/clk/qcom/clk-smd-rpm.c +++ b/drivers/clk/qcom/clk-smd-rpm.c @@ -768,6 +768,7 @@ static struct clk_smd_rpm *msm8976_clks[] = { static const struct rpm_smd_clk_desc rpm_clk_msm8976 = { .clks = msm8976_clks, + .num_clks = ARRAY_SIZE(msm8976_clks), .icc_clks = bimc_pcnoc_snoc_smmnoc_icc_clks, .num_icc_clks = ARRAY_SIZE(bimc_pcnoc_snoc_smmnoc_icc_clks), }; -- 2.44.0
[PATCH v6 2/2] tracing: Include Microcode Revision in mce_record tracepoint
Currently, the microcode field (Microcode Revision) of struct mce is not exported to userspace through the mce_record tracepoint. Knowing the microcode version on which the MCE was received is critical information for debugging. If the version is not recorded, later attempts to acquire the version might result in discrepancies since it can be changed at runtime. Export microcode version through the tracepoint to prevent ambiguity over the active version on the system when the MCE was received. Signed-off-by: Avadhut Naik Reviewed-by: Sohil Mehta Reviewed-by: Steven Rostedt (Google) Reviewed-by: Tony Luck --- include/trace/events/mce.h | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h index 294fccc329c1..f0f7b3cb2041 100644 --- a/include/trace/events/mce.h +++ b/include/trace/events/mce.h @@ -42,6 +42,7 @@ TRACE_EVENT(mce_record, __field(u8, cs ) __field(u8, bank) __field(u8, cpuvendor ) + __field(u32,microcode ) ), TP_fast_assign( @@ -63,9 +64,10 @@ TRACE_EVENT(mce_record, __entry->cs = m->cs; __entry->bank = m->bank; __entry->cpuvendor = m->cpuvendor; + __entry->microcode = m->microcode; ), - TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR: %016Lx, MISC: %016Lx, SYND: %016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: %llx, vendor: %u, CPUID: %x, time: %llu, socket: %u, APIC: %x", + TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR: %016Lx, MISC: %016Lx, SYND: %016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: %llx, vendor: %u, CPUID: %x, time: %llu, socket: %u, APIC: %x, microcode: %x", __entry->cpu, __entry->mcgcap, __entry->mcgstatus, __entry->bank, __entry->status, @@ -80,7 +82,8 @@ TRACE_EVENT(mce_record, __entry->cpuid, __entry->walltime, __entry->socketid, - __entry->apicid) + __entry->apicid, + __entry->microcode) ); #endif /* _TRACE_MCE_H */ -- 2.34.1
[PATCH v6 1/2] tracing: Include PPIN in mce_record tracepoint
Machine Check Error information from struct mce is exported to userspace through the mce_record tracepoint. Currently, however, the PPIN (Protected Processor Inventory Number) field of struct mce is not exported through the tracepoint. Export PPIN through the tracepoint as it provides a unique identifier for the system (or socket in case of multi-socket systems) on which the MCE has been received. Also, add a comment explaining the kind of information that can be and should be added to the tracepoint. Signed-off-by: Avadhut Naik Reviewed-by: Sohil Mehta Reviewed-by: Steven Rostedt (Google) Reviewed-by: Tony Luck --- include/trace/events/mce.h | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h index 9c4e12163996..294fccc329c1 100644 --- a/include/trace/events/mce.h +++ b/include/trace/events/mce.h @@ -9,6 +9,14 @@ #include #include +/* + * MCE Event Record. + * + * Only very relevant and transient information which cannot be + * gathered from a system by any other means or which can only be + * acquired arduously should be added to this record. + */ + TRACE_EVENT(mce_record, TP_PROTO(struct mce *m), @@ -25,6 +33,7 @@ TRACE_EVENT(mce_record, __field(u64,ipid) __field(u64,ip ) __field(u64,tsc ) + __field(u64,ppin) __field(u64,walltime) __field(u32,cpu ) __field(u32,cpuid ) @@ -45,6 +54,7 @@ TRACE_EVENT(mce_record, __entry->ipid = m->ipid; __entry->ip = m->ip; __entry->tsc= m->tsc; + __entry->ppin = m->ppin; __entry->walltime = m->time; __entry->cpu= m->extcpu; __entry->cpuid = m->cpuid; @@ -55,7 +65,7 @@ TRACE_EVENT(mce_record, __entry->cpuvendor = m->cpuvendor; ), - TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR: %016Lx, MISC: %016Lx, SYND: %016Lx, RIP: %02x:<%016Lx>, TSC: %llx, vendor: %u, CPUID: %x, time: %llu, socket: %u, APIC: %x", + TP_printk("CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, IPID: %016Lx, ADDR: %016Lx, MISC: %016Lx, SYND: %016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PPIN: %llx, vendor: %u, CPUID: %x, time: %llu, socket: %u, APIC: %x", __entry->cpu, __entry->mcgcap, __entry->mcgstatus, __entry->bank, __entry->status, @@ -65,6 +75,7 @@ TRACE_EVENT(mce_record, __entry->synd, __entry->cs, __entry->ip, __entry->tsc, + __entry->ppin, __entry->cpuvendor, __entry->cpuid, __entry->walltime, -- 2.34.1
[PATCH v6 0/2] Update mce_record tracepoint
This patchset updates the mce_record tracepoint so that the recently added fields of struct mce are exported through it to userspace. The first patch adds PPIN (Protected Processor Inventory Number) field to the tracepoint. The second patch adds the microcode field (Microcode Revision) to the tracepoint. Changes in v2: - Export microcode field (Microcode Revision) through the tracepoiont in addition to PPIN. Changes in v3: - Change format specifier for microcode revision from %u to %x - Fix tab alignments - Add Reviewed-by: Sohil Mehta Changes in v4: - Update commit messages to reflect the reason for the fields being added to the tracepoint. - Add comment to explicitly state the type of information that should be added to the tracepoint. - Add Reviewed-by: Steven Rostedt (Google) Changes in v5: - Changed "MICROCODE REVISION" to just "MICROCODE". - Changed words which are not acronyms from ALL CAPS to no caps. - Added Reviewed-by: Tony Luck Changes in v6: - Rebased on top of Ingo's changes to the MCE tracepoint https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/include/trace/events/mce.h?id=ac5e80e94f5c67d7053f50fc3faddab931707f0f [NOTE: - Since changes in this version are very minor, have retained the below tags received for previous versions: Reviewed-by: Sohil Mehta Reviewed-by: Steven Rostedt (Google) Reviewed-by: Tony Luck ] Avadhut Naik (2): tracing: Include PPIN in mce_record tracepoint tracing: Include Microcode Revision in mce_record tracepoint include/trace/events/mce.h | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) base-commit: 65d1240b6728b38e4d2068d6738a17e4ee4351f5 -- 2.34.1
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, 1 Apr 2024 20:25:52 +0900 Masami Hiramatsu (Google) wrote: > > Masami, > > > > Are you OK with just keeping it set to N. > > OK, if it is only for the debugging, I'm OK to set N this. > > > > > We could have other options like PROVE_LOCKING enable it. > > Agreed (but it should say this is a debug option) It does say "Validate" which to me is a debug option. What would you suggest? -- Steve
Re: [PATCH 1/3] remoteproc: k3-dsp: Fix usage of omap_mbox_message and mbox_msg_t
On Thu, Mar 28, 2024 at 11:26:24AM -0500, Andrew Davis wrote: > On 3/28/24 10:28 AM, Mathieu Poirier wrote: > > Hi Andrew, > > > > On Mon, Mar 25, 2024 at 11:58:06AM -0500, Andrew Davis wrote: > > > The type of message sent using omap-mailbox is always u32. The definition > > > of mbox_msg_t is uintptr_t which is wrong as that type changes based on > > > the architecture (32bit vs 64bit). Use u32 unconditionally and remove > > > the now unneeded omap-mailbox.h include. > > > > > > Signed-off-by: Andrew Davis > > > --- > > > drivers/remoteproc/ti_k3_dsp_remoteproc.c | 7 +++ > > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > > > diff --git a/drivers/remoteproc/ti_k3_dsp_remoteproc.c > > > b/drivers/remoteproc/ti_k3_dsp_remoteproc.c > > > index 3555b535b1683..33b30cfb86c9d 100644 > > > --- a/drivers/remoteproc/ti_k3_dsp_remoteproc.c > > > +++ b/drivers/remoteproc/ti_k3_dsp_remoteproc.c > > > @@ -11,7 +11,6 @@ > > > #include > > > #include > > > #include > > > -#include > > > #include > > > #include > > > #include > > > @@ -113,7 +112,7 @@ static void k3_dsp_rproc_mbox_callback(struct > > > mbox_client *client, void *data) > > > client); > > > struct device *dev = kproc->rproc->dev.parent; > > > const char *name = kproc->rproc->name; > > > - u32 msg = omap_mbox_message(data); > > > + u32 msg = (u32)(uintptr_t)(data); > > > > Looking at omap-mailbox.h and unless I'm missing something, the end result > > is > > the same. > > > > > > > dev_dbg(dev, "mbox msg: 0x%x\n", msg); > > > @@ -152,11 +151,11 @@ static void k3_dsp_rproc_kick(struct rproc *rproc, > > > int vqid) > > > { > > > struct k3_dsp_rproc *kproc = rproc->priv; > > > struct device *dev = rproc->dev.parent; > > > - mbox_msg_t msg = (mbox_msg_t)vqid; > > > + u32 msg = vqid; > > > int ret; > > > > > > > Here @vqid becomes a 'u32' rather than a 'uintptr'... > > > > u32 is the correct type for messages sent with OMAP mailbox. It > only sends 32bit messages, uintptr is 64bit when compiled on > 64bit hardware (like our ARM64 cores on K3). mbox_msg_t should > have been defined as u32, this was a mistake we missed as we only > ever used to compile it for 32bit cores (where uintptr is 32bit). > > > > /* send the index of the triggered virtqueue in the mailbox > > > payload */ > > > - ret = mbox_send_message(kproc->mbox, (void *)msg); > > > + ret = mbox_send_message(kproc->mbox, (void *)(uintptr_t)msg); > > > > ... but here it is casted as a 'uintptr_t', which yields the same result. > > > > The function mbox_send_message() takes a void*, so we need to cast our 32bit > message to that first, it is cast back to u32 inside the OMAP mailbox driver. > Doing that in one step (u32 -> void*) causes a warning when void* is 64bit > (cast from int to pointer of different size). > > > > > I am puzzled - other than getting rid of a header file I don't see what else > > this patch does. > > > > Getting rid of the header is the main point of this patch (I have a later > series that needs that header gone). But the difference this patch makes is > that > before we passed a pointer to a 64bit int to OMAP mailbox which takes a > pointer > to a 32bit int. Sure, the result is the same in little-endian systems, but > that > isn't a strictly correct in general. >From your explanation above this patchset is about two things: 1) Getting rid of a compilation warning when void* is 64bit wide 2) Getting rid of omap-mailbox.h This is what the changelog should describe. And next time, please add a cover letter to your work. Thanks, Mathieu > > > if (ret < 0) > > > dev_err(dev, "failed to send mailbox message (%pe)\n", > > > ERR_PTR(ret)); > > > -- > > > 2.39.2 > > >
Re: [PATCH v4 1/4] remoteproc: Add TEE support
On Fri, Mar 29, 2024 at 09:58:11AM +0100, Arnaud POULIQUEN wrote: > Hello Mathieu, > > On 3/27/24 18:07, Mathieu Poirier wrote: > > On Tue, Mar 26, 2024 at 08:18:23PM +0100, Arnaud POULIQUEN wrote: > >> Hello Mathieu, > >> > >> On 3/25/24 17:46, Mathieu Poirier wrote: > >>> On Fri, Mar 08, 2024 at 03:47:05PM +0100, Arnaud Pouliquen wrote: > Add a remoteproc TEE (Trusted Execution Environment) driver > that will be probed by the TEE bus. If the associated Trusted > application is supported on secure part this device offers a client > >>> > >>> Device or driver? I thought I touched on that before. > >> > >> Right, I changed the first instance and missed this one > >> > >>> > interface to load a firmware in the secure part. > This firmware could be authenticated by the secure trusted application. > > Signed-off-by: Arnaud Pouliquen > --- > Updates from V3: > - rework TEE_REMOTEPROC description in Kconfig > - fix some namings > - add tee_rproc_parse_fw to support rproc_ops::parse_fw > - add proc::tee_interface; > - add rproc struct as parameter of the tee_rproc_register() function > --- > drivers/remoteproc/Kconfig | 10 + > drivers/remoteproc/Makefile | 1 + > drivers/remoteproc/tee_remoteproc.c | 434 > include/linux/remoteproc.h | 4 + > include/linux/tee_remoteproc.h | 112 +++ > 5 files changed, 561 insertions(+) > create mode 100644 drivers/remoteproc/tee_remoteproc.c > create mode 100644 include/linux/tee_remoteproc.h > > diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig > index 48845dc8fa85..2cf1431b2b59 100644 > --- a/drivers/remoteproc/Kconfig > +++ b/drivers/remoteproc/Kconfig > @@ -365,6 +365,16 @@ config XLNX_R5_REMOTEPROC > > It's safe to say N if not interested in using RPU r5f cores. > > + > +config TEE_REMOTEPROC > +tristate "remoteproc support by a TEE application" > >>> > >>> s/remoteproc/Remoteproc > >>> > +depends on OPTEE > +help > + Support a remote processor with a TEE application. The Trusted > + Execution Context is responsible for loading the trusted > firmware > + image and managing the remote processor's lifecycle. > + This can be either built-in or a loadable module. > + > endif # REMOTEPROC > > endmenu > diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile > index 91314a9b43ce..fa8daebce277 100644 > --- a/drivers/remoteproc/Makefile > +++ b/drivers/remoteproc/Makefile > @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC) += rcar_rproc.o > obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o > obj-$(CONFIG_ST_SLIM_REMOTEPROC)+= st_slim_rproc.o > obj-$(CONFIG_STM32_RPROC) += stm32_rproc.o > +obj-$(CONFIG_TEE_REMOTEPROC)+= tee_remoteproc.o > obj-$(CONFIG_TI_K3_DSP_REMOTEPROC) += ti_k3_dsp_remoteproc.o > obj-$(CONFIG_TI_K3_R5_REMOTEPROC) += ti_k3_r5_remoteproc.o > obj-$(CONFIG_XLNX_R5_REMOTEPROC)+= xlnx_r5_remoteproc.o > diff --git a/drivers/remoteproc/tee_remoteproc.c > b/drivers/remoteproc/tee_remoteproc.c > new file mode 100644 > index ..c855210e52e3 > --- /dev/null > +++ b/drivers/remoteproc/tee_remoteproc.c > @@ -0,0 +1,434 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * Copyright (C) STMicroelectronics 2024 - All Rights Reserved > + * Author: Arnaud Pouliquen > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "remoteproc_internal.h" > + > +#define MAX_TEE_PARAM_ARRY_MEMBER 4 > + > +/* > + * Authentication of the firmware and load in the remote processor > memory > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote > processor > + * [in] params[1].memref: buffer containing the image of the > buffer > + */ > +#define TA_RPROC_FW_CMD_LOAD_FW 1 > + > +/* > + * Start the remote processor > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote > processor > + */ > +#define TA_RPROC_FW_CMD_START_FW2 > + > +/* > + * Stop the remote processor > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote > processor > + */ > +#define TA_RPROC_FW_CMD_STOP_FW 3 > + > +/* > + * Return the address of the resource table, or 0 if not found > + * No check is done to verify that the
Re: [PATCH v4 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware
On Fri, Mar 29, 2024 at 11:57:43AM +0100, Arnaud POULIQUEN wrote: > > > On 3/27/24 18:14, Mathieu Poirier wrote: > > On Tue, Mar 26, 2024 at 08:31:33PM +0100, Arnaud POULIQUEN wrote: > >> > >> > >> On 3/25/24 17:51, Mathieu Poirier wrote: > >>> On Fri, Mar 08, 2024 at 03:47:08PM +0100, Arnaud Pouliquen wrote: > The new TEE remoteproc device is used to manage remote firmware in a > secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is > introduced to delegate the loading of the firmware to the trusted > execution context. In such cases, the firmware should be signed and > adhere to the image format defined by the TEE. > > Signed-off-by: Arnaud Pouliquen > --- > Updates from V3: > - remove support of the attach use case. Will be addressed in a separate > thread, > - add st_rproc_tee_ops::parse_fw ops, > - inverse call of devm_rproc_alloc()and tee_rproc_register() to manage > cross > reference between the rproc struct and the tee_rproc struct in > tee_rproc.c. > --- > drivers/remoteproc/stm32_rproc.c | 60 +--- > 1 file changed, 56 insertions(+), 4 deletions(-) > > diff --git a/drivers/remoteproc/stm32_rproc.c > b/drivers/remoteproc/stm32_rproc.c > index 8cd838df4e92..13df33c78aa2 100644 > --- a/drivers/remoteproc/stm32_rproc.c > +++ b/drivers/remoteproc/stm32_rproc.c > @@ -20,6 +20,7 @@ > #include > #include > #include > +#include > #include > > #include "remoteproc_internal.h" > @@ -49,6 +50,9 @@ > #define M4_STATE_STANDBY4 > #define M4_STATE_CRASH 5 > > +/* Remote processor unique identifier aligned with the Trusted > Execution Environment definitions */ > >>> > >>> Why is this the case? At least from the kernel side it is possible to > >>> call > >>> tee_rproc_register() with any kind of value, why is there a need to be any > >>> kind of alignment with the TEE? > >> > >> > >> The use of the proc_id is to identify a processor in case of multi > >> co-processors. > >> > > > > That is well understood. > > > >> For instance we can have a system with A DSP and a modem. We would use the > >> same > >> TEE service, but > > > > That too. > > > >> the TEE driver will probably be different, same for the signature key. > > > > What TEE driver are we talking about here? > > In OP-TEE remoteproc frameork is divided in 2 or 3 layers: > > - the remoteproc Trusted Application (TA) [1] which is platform agnostic > - The remoteproc Pseudo Trusted Application (PTA) [2] which is platform > dependent and can rely on the proc ID to retrieve the context. > - the remoteproc driver (optional for some platforms) [3], which is in charge > of DT parsing and platform configuration. > That part makes sense. > Here TEE driver can be interpreted by remote PTA and/or platform driver. > I have to guess PTA means "Platform Trusted Application" but I have no guarantee, adding to the level of (already high) confusion brought on by this patchset. > [1] > https://elixir.bootlin.com/op-tee/latest/source/ta/remoteproc/src/remoteproc_core.c > [2] > https://elixir.bootlin.com/op-tee/latest/source/core/pta/stm32mp/remoteproc_pta.c > [3] > https://elixir.bootlin.com/op-tee/latest/source/core/drivers/remoteproc/stm32_remoteproc.c > > > > >> In such case the proc ID allows to identify the the processor you want to > >> address. > >> > > > > That too is well understood, but there is no alignment needed with the TEE, > > i.e > > the TEE application is not expecting a value of '0'. We could set > > STM32_MP1_M4_PROC_ID to 0xDEADBEEF and things would work. This driver > > won't go > > anywhere for as long as it is not the case. > > > Here I suppose that you do not challenge the rproc_ID use in general, but for > the stm32mp1 platform as we have only one remote processor. I'm right? That is correct - I understand the need for an rproc_ID. The problem is with the comment that states that '0' is aligned with the TEE definitions, which in my head means hard coded value and a big red flag. What it should say is "aligned with the TEE device tree definition". > > In OP-TEE the check is done here: > https://elixir.bootlin.com/op-tee/latest/source/core/drivers/remoteproc/stm32_remoteproc.c#L64 > > If driver does not register the proc ID an error is returned indicating that > the > feature is not supported. > > In case of stm32mp1 yes we could consider it as useless as we have only one > remote proc. > > Nevertheless I can not guaranty that a customer will not add an external > companion processor that uses OP-TEE to authenticate the associated firmware. > As > the trusted Application is the unique entry point. he will need the proc_id to > identify the target at PTA level. > > So from my point of view having a proc ID on stm32MP1 (and on stm32mp2
Re: [PATCH net v3] virtio_net: Do not send RSS key if it is not supported
On Sun, 31 Mar 2024 16:20:30 -0400 Michael S. Tsirkin wrote: > > Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.") > > Cc: sta...@vger.kernel.org > > net has its own stable process, don't CC stable on net patches. Not any more, FWIW: 1.5.7. Stable tree While it used to be the case that netdev submissions were not supposed to carry explicit CC: sta...@vger.kernel.org tags that is no longer the case today. Please follow the standard stable rules in Documentation/process/stable-kernel-rules.rst, and make sure you include appropriate Fixes tags! https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#stable-tree
Re: [PATCH 1/3] dt-bindings: remoteproc: qcom,msm8996-mss-pil: allow glink-edge on msm8996
On Mon, 01 Apr 2024 00:10:42 +0300, Dmitry Baryshkov wrote: > MSM8996 has limited glink support, allow glink-edge node on MSM8996 > platform. > > Signed-off-by: Dmitry Baryshkov > --- > Documentation/devicetree/bindings/remoteproc/qcom,msm8996-mss-pil.yaml | 1 - > 1 file changed, 1 deletion(-) > Acked-by: Rob Herring
Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality
On Mon Apr 1, 2024 at 12:29 PM EEST, Huang, Kai wrote: > On Sat, 2024-03-30 at 13:17 +0200, Jarkko Sakkinen wrote: > > On Thu Mar 28, 2024 at 2:53 PM EET, Huang, Kai wrote: > > > > > > > --- /dev/null > > > > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > > > > @@ -0,0 +1,74 @@ > > > > +// SPDX-License-Identifier: GPL-2.0 > > > > +// Copyright(c) 2022 Intel Corporation. > > > > > > It's 2024 now. > > > > > > And looks you need to use C style comment for /* Copyright ... */, after > > > looking > > > at some other C files. > > > > To be fair, this happens *all the time* to everyone :-) > > > > I've proposed this few times in SGX context and going to say it now. > > Given the nature of Git copyrights would anyway need to be sorted by > > the Git log not possibly incorrect copyright platters in the header > > and source files. > > > > Sure fine to me either way. Thanks for pointing out. > > I have some vague memory that we should update the year but I guess I was > wrong. I think updating year makes sense! I'd be fine not having copyright platter at all since the commit is from Intel domain anyway but if it is kept then the year needs to be corrected. I mean Git commit stores all the data, including exact date. BR, Jarkko
Re: [PATCH] selftests/sgx: Improve cgroup test scripts
On Sun Mar 31, 2024 at 8:44 PM EEST, Haitao Huang wrote: > Make cgroup test scripts ash compatible. > Remove cg-tools dependency. > Add documentation for functions. > > Tested with busybox on Ubuntu. > > Signed-off-by: Haitao Huang I'll run this next week on good old NUC7. Thank you. I really wish that either (hopefully both) Intel or AMD would bring up for developers home use meant platform to develop on TDX and SNP. It is a shame that the latest and greatest is from 2018. BR, Jarkko
Re: Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send
On Mon, Apr 1, 2024 at 8:34 PM wrote: > > From: hepeilin > > Introduce a tracepoint for icmp_send, which can help users to get more > detail information conveniently when icmp abnormal events happen. > > 1. Giving an usecase example: > = > When an application experiences packet loss due to an unreachable UDP > destination port, the kernel will send an exception message through the > icmp_send function. By adding a trace point for icmp_send, developers or > system administrators can obtain detailed information about the UDP > packet loss, including the type, code, source address, destination address, > source port, and destination port. This facilitates the trouble-shooting > of UDP packet loss issues especially for those network-service > applications. > > 2. Operation Instructions: > == > Switch to the tracing directory. > cd /sys/kernel/tracing > Filter for destination port unreachable. > echo "type==3 && code==3" > events/icmp/icmp_send/filter > Enable trace event. > echo 1 > events/icmp/icmp_send/enable > > 3. Result View: > > udp_client_erro-11370 [002] ...s.12 124.728002: > icmp_send: icmp_send: type=3, code=3. > From 127.0.0.1:41895 to 127.0.0.1: ulen=23 > skbaddr=589b167a > > v3->v4: > Some fixes according to > https://lore.kernel.org/all/CANn89i+EFEr7VHXNdOi59Ba_R1nFKSBJzBzkJFVgCTdXBx=y...@mail.gmail.com/ > 1.Add legality check for UDP header in SKB. I think my understanding based on what Eric depicted differs from you: we're supposed to filter out those many invalid cases and only trace the valid action of sending a icmp, so where to add a new tracepoint is important instead of adding more checks in the tracepoint itself. Please refer to what trace_tcp_retransmit_skb() does :) Thanks, Jason > 2.Target this patch for net-next. > > v2->v3: > Some fixes according to > https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/ > 1. Change the tracking directory to/sys/kernel/tracking. > 2. Adjust the layout of the TP-STRUCT_entry parameter structure. > > v1->v2: > Some fixes according to > https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/ > 1. adjust the trace_icmp_send() to more protocols than UDP. > 2. move the calling of trace_icmp_send after sanity checks > in __icmp_send(). > > Signed-off-by: Peilin He > Reviewed-by: xu xin > Reviewed-by: Yunkai Zhang > Cc: Yang Yang > Cc: Liu Chun > Cc: Xuexin Jiang > --- > include/trace/events/icmp.h | 65 + > net/ipv4/icmp.c | 4 +++ > 2 files changed, 69 insertions(+) > create mode 100644 include/trace/events/icmp.h > > diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h > new file mode 100644 > index ..7d5190f48a28 > --- /dev/null > +++ b/include/trace/events/icmp.h > @@ -0,0 +1,65 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM icmp > + > +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_ICMP_H > + > +#include > +#include > + > +TRACE_EVENT(icmp_send, > + > + TP_PROTO(const struct sk_buff *skb, int type, int code), > + > + TP_ARGS(skb, type, code), > + > + TP_STRUCT__entry( > + __field(const void *, skbaddr) > + __field(int, type) > + __field(int, code) > + __array(__u8, saddr, 4) > + __array(__u8, daddr, 4) > + __field(__u16, sport) > + __field(__u16, dport) > + __field(unsigned short, ulen) > + ), > + > + TP_fast_assign( > + struct iphdr *iph = ip_hdr(skb); > + int proto_4 = iph->protocol; > + __be32 *p32; > + > + __entry->skbaddr = skb; > + __entry->type = type; > + __entry->code = code; > + > + struct udphdr *uh = udp_hdr(skb); > + if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head || > + (u8 *)uh + sizeof(struct udphdr) > > skb_tail_pointer(skb)) { > + __entry->sport = 0; > + __entry->dport = 0; > + __entry->ulen = 0; > + } else { > + __entry->sport = ntohs(uh->source); > + __entry->dport = ntohs(uh->dest); > + __entry->ulen = ntohs(uh->len); > + } > + > + p32 = (__be32 *) __entry->saddr; > + *p32 = iph->saddr; > + > + p32 = (__be32 *) __entry->daddr; > + *p32 =
Re: [PATCH v9 15/15] selftests/sgx: Add scripts for EPC cgroup testing
On Sun Mar 31, 2024 at 8:35 PM EEST, Haitao Huang wrote: > On Sun, 31 Mar 2024 11:19:04 -0500, Jarkko Sakkinen > wrote: > > > On Sat Mar 30, 2024 at 5:32 PM EET, Haitao Huang wrote: > >> On Sat, 30 Mar 2024 06:15:14 -0500, Jarkko Sakkinen > >> wrote: > >> > >> > On Thu Mar 28, 2024 at 5:54 AM EET, Haitao Huang wrote: > >> >> On Wed, 27 Mar 2024 07:55:34 -0500, Jarkko Sakkinen > >> > >> >> wrote: > >> >> > >> >> > On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > >> >> >> The scripts rely on cgroup-tools package from libcgroup [1]. > >> >> >> > >> >> >> To run selftests for epc cgroup: > >> >> >> > >> >> >> sudo ./run_epc_cg_selftests.sh > >> >> >> > >> >> >> To watch misc cgroup 'current' changes during testing, run this > >> in a > >> >> >> separate terminal: > >> >> >> > >> >> >> ./watch_misc_for_tests.sh current > >> >> >> > >> >> >> With different cgroups, the script starts one or multiple > >> concurrent > >> >> >> SGX > >> >> >> selftests, each to run one unclobbered_vdso_oversubscribed > >> test.Each > >> >> >> of such test tries to load an enclave of EPC size equal to the EPC > >> >> >> capacity available on the platform. The script checks results > >> against > >> >> >> the expectation set for each cgroup and reports success or > >> failure. > >> >> >> > >> >> >> The script creates 3 different cgroups at the beginning with > >> >> >> following > >> >> >> expectations: > >> >> >> > >> >> >> 1) SMALL - intentionally small enough to fail the test loading an > >> >> >> enclave of size equal to the capacity. > >> >> >> 2) LARGE - large enough to run up to 4 concurrent tests but fail > >> some > >> >> >> if > >> >> >> more than 4 concurrent tests are run. The script starts 4 > >> expecting > >> >> >> at > >> >> >> least one test to pass, and then starts 5 expecting at least one > >> test > >> >> >> to fail. > >> >> >> 3) LARGER - limit is the same as the capacity, large enough to run > >> >> >> lots of > >> >> >> concurrent tests. The script starts 8 of them and expects all > >> pass. > >> >> >> Then it reruns the same test with one process randomly killed and > >> >> >> usage checked to be zero after all process exit. > >> >> >> > >> >> >> The script also includes a test with low mem_cg limit and LARGE > >> >> >> sgx_epc > >> >> >> limit to verify that the RAM used for per-cgroup reclamation is > >> >> >> charged > >> >> >> to a proper mem_cg. > >> >> >> > >> >> >> [1] https://github.com/libcgroup/libcgroup/blob/main/README > >> >> >> > >> >> >> Signed-off-by: Haitao Huang > >> >> >> --- > >> >> >> V7: > >> >> >> - Added memcontrol test. > >> >> >> > >> >> >> V5: > >> >> >> - Added script with automatic results checking, remove the > >> >> >> interactive > >> >> >> script. > >> >> >> - The script can run independent from the series below. > >> >> >> --- > >> >> >> .../selftests/sgx/run_epc_cg_selftests.sh | 246 > >> >> >> ++ > >> >> >> .../selftests/sgx/watch_misc_for_tests.sh | 13 + > >> >> >> 2 files changed, 259 insertions(+) > >> >> >> create mode 100755 > >> >> >> tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> >> >> create mode 100755 > >> >> >> tools/testing/selftests/sgx/watch_misc_for_tests.sh > >> >> >> > >> >> >> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> >> >> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> >> >> new file mode 100755 > >> >> >> index ..e027bf39f005 > >> >> >> --- /dev/null > >> >> >> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh > >> >> >> @@ -0,0 +1,246 @@ > >> >> >> +#!/bin/bash > >> >> > > >> >> > This is not portable and neither does hold in the wild. > >> >> > > >> >> > It does not even often hold as it is not uncommon to place bash > >> >> > to the path /usr/bin/bash. If I recall correctly, e.g. NixOS has > >> >> > a path that is neither of those two. > >> >> > > >> >> > Should be #!/usr/bin/env bash > >> >> > > >> >> > That is POSIX compatible form. > >> >> > > >> >> > >> >> Sure > >> >> > >> >> > Just got around trying to test this in NUC7 so looking into this in > >> >> > more detail. > >> >> > >> >> Thanks. Could you please check if this version works for you? > >> >> > >> >> > >> https://github.com/haitaohuang/linux/commit/3c424b841cf3cf66b085a424f4b537fbc3bbff6f > >> >> > >> >> > > >> >> > That said can you make the script work with just "#!/usr/bin/env > >> sh" > >> >> > and make sure that it is busybox ash compatible? > >> >> > >> >> Yes. > >> >> > >> >> > > >> >> > I don't see any necessity to make this bash only and it adds to the > >> >> > compilation time of the image. Otherwise lot of this could be > >> tested > >> >> > just with qemu+bzImage+busybox(inside initramfs). > >> >> > > >> >> > >> >> will still need cgroup-tools as you pointed out later. Compiling from > >> >> its > >> >> upstream code OK? > >> > > >> > Can you explain why you need it? > >> > > >> > What is the thing you cannot do without it? > >> > > >> >
[PATCH] ftrace: Fix use-after-free issue in ftrace_location()
KASAN reports a bug: BUG: KASAN: use-after-free in ftrace_location+0x90/0x120 Read of size 8 at addr 888141d40010 by task insmod/424 CPU: 8 PID: 424 Comm: insmod Tainted: GW 6.9.0-rc2+ #213 [...] Call Trace: dump_stack_lvl+0x68/0xa0 print_report+0xcf/0x610 kasan_report+0xb5/0xe0 ftrace_location+0x90/0x120 register_kprobe+0x14b/0xa40 kprobe_init+0x2d/0xff0 [kprobe_example] do_one_initcall+0x8f/0x2d0 do_init_module+0x13a/0x3c0 load_module+0x3082/0x33d0 init_module_from_file+0xd2/0x130 __x64_sys_finit_module+0x306/0x440 do_syscall_64+0x68/0x140 entry_SYSCALL_64_after_hwframe+0x71/0x79 The root cause is that when lookup_rec() is lookuping ftrace record of an address in some module, and at the same time in ftrace_release_mod(), the memory that saving ftrace records has been freed as that module is being deleted. register_kprobes() { check_kprobe_address_safe() { arch_check_ftrace_location() { ftrace_location() { lookup_rec() // access memory that has been freed by // ftrace_release_mod() !!! It seems that the ftrace_lock is required when lookuping records in ftrace_location(), so is ftrace_location_range(). Fixes: ae6aa16fdc16 ("kprobes: introduce ftrace based optimization") Signed-off-by: Zheng Yejian --- kernel/trace/ftrace.c | 28 ++-- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index da1710499698..838d175709c1 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -1581,7 +1581,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long start, unsigned long end) } /** - * ftrace_location_range - return the first address of a traced location + * ftrace_location_range_locked - return the first address of a traced location * if it touches the given ip range * @start: start of range to search. * @end: end of range to search (inclusive). @end points to the last byte @@ -1592,7 +1592,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long start, unsigned long end) * that is either a NOP or call to the function tracer. It checks the ftrace * internal tables to determine if the address belongs or not. */ -unsigned long ftrace_location_range(unsigned long start, unsigned long end) +static unsigned long ftrace_location_range_locked(unsigned long start, unsigned long end) { struct dyn_ftrace *rec; @@ -1603,6 +1603,17 @@ unsigned long ftrace_location_range(unsigned long start, unsigned long end) return 0; } +unsigned long ftrace_location_range(unsigned long start, unsigned long end) +{ + unsigned long loc; + + mutex_lock(_lock); + loc = ftrace_location_range_locked(start, end); + mutex_unlock(_lock); + + return loc; +} + /** * ftrace_location - return the ftrace location * @ip: the instruction pointer to check @@ -1614,25 +1625,22 @@ unsigned long ftrace_location_range(unsigned long start, unsigned long end) */ unsigned long ftrace_location(unsigned long ip) { - struct dyn_ftrace *rec; + unsigned long loc; unsigned long offset; unsigned long size; - rec = lookup_rec(ip, ip); - if (!rec) { + loc = ftrace_location_range(ip, ip); + if (!loc) { if (!kallsyms_lookup_size_offset(ip, , )) goto out; /* map sym+0 to __fentry__ */ if (!offset) - rec = lookup_rec(ip, ip + size - 1); + loc = ftrace_location_range(ip, ip + size - 1); } - if (rec) - return rec->ip; - out: - return 0; + return loc; } /** -- 2.25.1
Subject: [PATCH net-next v4] net/ipv4: add tracepoint for icmp_send
From: hepeilin Introduce a tracepoint for icmp_send, which can help users to get more detail information conveniently when icmp abnormal events happen. 1. Giving an usecase example: = When an application experiences packet loss due to an unreachable UDP destination port, the kernel will send an exception message through the icmp_send function. By adding a trace point for icmp_send, developers or system administrators can obtain detailed information about the UDP packet loss, including the type, code, source address, destination address, source port, and destination port. This facilitates the trouble-shooting of UDP packet loss issues especially for those network-service applications. 2. Operation Instructions: == Switch to the tracing directory. cd /sys/kernel/tracing Filter for destination port unreachable. echo "type==3 && code==3" > events/icmp/icmp_send/filter Enable trace event. echo 1 > events/icmp/icmp_send/enable 3. Result View: udp_client_erro-11370 [002] ...s.12 124.728002: icmp_send: icmp_send: type=3, code=3. From 127.0.0.1:41895 to 127.0.0.1: ulen=23 skbaddr=589b167a v3->v4: Some fixes according to https://lore.kernel.org/all/CANn89i+EFEr7VHXNdOi59Ba_R1nFKSBJzBzkJFVgCTdXBx=y...@mail.gmail.com/ 1.Add legality check for UDP header in SKB. 2.Target this patch for net-next. v2->v3: Some fixes according to https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/ 1. Change the tracking directory to/sys/kernel/tracking. 2. Adjust the layout of the TP-STRUCT_entry parameter structure. v1->v2: Some fixes according to https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/ 1. adjust the trace_icmp_send() to more protocols than UDP. 2. move the calling of trace_icmp_send after sanity checks in __icmp_send(). Signed-off-by: Peilin He Reviewed-by: xu xin Reviewed-by: Yunkai Zhang Cc: Yang Yang Cc: Liu Chun Cc: Xuexin Jiang --- include/trace/events/icmp.h | 65 + net/ipv4/icmp.c | 4 +++ 2 files changed, 69 insertions(+) create mode 100644 include/trace/events/icmp.h diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h new file mode 100644 index ..7d5190f48a28 --- /dev/null +++ b/include/trace/events/icmp.h @@ -0,0 +1,65 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM icmp + +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_ICMP_H + +#include +#include + +TRACE_EVENT(icmp_send, + + TP_PROTO(const struct sk_buff *skb, int type, int code), + + TP_ARGS(skb, type, code), + + TP_STRUCT__entry( + __field(const void *, skbaddr) + __field(int, type) + __field(int, code) + __array(__u8, saddr, 4) + __array(__u8, daddr, 4) + __field(__u16, sport) + __field(__u16, dport) + __field(unsigned short, ulen) + ), + + TP_fast_assign( + struct iphdr *iph = ip_hdr(skb); + int proto_4 = iph->protocol; + __be32 *p32; + + __entry->skbaddr = skb; + __entry->type = type; + __entry->code = code; + + struct udphdr *uh = udp_hdr(skb); + if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head || + (u8 *)uh + sizeof(struct udphdr) > skb_tail_pointer(skb)) { + __entry->sport = 0; + __entry->dport = 0; + __entry->ulen = 0; + } else { + __entry->sport = ntohs(uh->source); + __entry->dport = ntohs(uh->dest); + __entry->ulen = ntohs(uh->len); + } + + p32 = (__be32 *) __entry->saddr; + *p32 = iph->saddr; + + p32 = (__be32 *) __entry->daddr; + *p32 = iph->daddr; + ), + + TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u ulen=%d skbaddr=%p", + __entry->type, __entry->code, + __entry->saddr, __entry->sport, __entry->daddr, + __entry->dport, __entry->ulen, __entry->skbaddr) +); + +#endif /* _TRACE_ICMP_H */ + +/* This part must be outside protection */ +#include \ No newline at end of file diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index 8cebb476b3ab..224551d75c02 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -92,6 +92,8 @@ #include #include
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, 26 Mar 2024 15:01:21 -0400 Steven Rostedt wrote: > On Tue, 26 Mar 2024 09:16:33 -0700 > Andrii Nakryiko wrote: > > > > It's no different than lockdep. Test boxes should have it enabled, but > > > there's no reason to have this enabled in a production system. > > > > > > > I tend to agree with Steven here (which is why I sent this patch as it > > is), but I'm happy to do it as an opt-out, if Masami insists. Please > > do let me know if I need to send v2 or this one is actually the one > > we'll end up using. Thanks! > > Masami, > > Are you OK with just keeping it set to N. OK, if it is only for the debugging, I'm OK to set N this. > > We could have other options like PROVE_LOCKING enable it. Agreed (but it should say this is a debug option) Thank you, > > -- Steve -- Masami Hiramatsu (Google)
Re: [PATCH v10 05/14] x86/sgx: Implement basic EPC misc cgroup functionality
On Sat, 2024-03-30 at 13:17 +0200, Jarkko Sakkinen wrote: > On Thu Mar 28, 2024 at 2:53 PM EET, Huang, Kai wrote: > > > > > --- /dev/null > > > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c > > > @@ -0,0 +1,74 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +// Copyright(c) 2022 Intel Corporation. > > > > It's 2024 now. > > > > And looks you need to use C style comment for /* Copyright ... */, after > > looking > > at some other C files. > > To be fair, this happens *all the time* to everyone :-) > > I've proposed this few times in SGX context and going to say it now. > Given the nature of Git copyrights would anyway need to be sorted by > the Git log not possibly incorrect copyright platters in the header > and source files. > Sure fine to me either way. Thanks for pointing out. I have some vague memory that we should update the year but I guess I was wrong.
general protection fault in __fib6_update_sernum_upto_root
Hello. We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. Recently, our team has discovered a issue in Linux kernel 6.7. Attached to the email were a PoC file of the issue. Stack dump: general protection fault, probably for non-canonical address 0xff1f1b1f1f1f1f24: [#1] PREEMPT SMP KASAN NOPTI KASAN: maybe wild-memory-access in range [0xf8f8f8f8f8f8f920-0xf8f8f8f8f8f8f927] CPU: 1 PID: 9367 Comm: kworker/1:5 Not tainted 6.7.0 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Workqueue: ipv6_addrconf addrconf_dad_work RIP: 0010:__fib6_update_sernum_upto_root+0xa7/0x270 net/ipv6/ip6_fib.c:1358 Code: c1 e8 03 42 80 3c 20 00 0f 85 9b 01 00 00 48 8b 1b 48 85 db 0f 84 d9 00 00 00 e8 74 70 39 f8 48 8d 7b 2c 48 89 f8 48 c1 e8 03 <42> 0f b6 14 20 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 RSP: 0018:c9000631f7c8 EFLAGS: 00010a07 RAX: 1f1f1f1f1f1f1f24 RBX: f8f8f8f8f8f8f8f8 RCX: 89508644 RDX: 888051d78000 RSI: 895085dc RDI: f8f8f8f8f8f8f924 RBP: 0001 R08: 0005 R09: R10: 0001 R11: R12: dc00 R13: 0186 R14: 888052396c00 R15: ed100a472d80 FS: () GS:88807ec0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f42c8487d00 CR3: 4b42c000 CR4: 00750ef0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: __list_add include/linux/list.h:153 [inline] list_add include/linux/list.h:169 [inline] fib6_add+0x16c4/0x4410 net/ipv6/ip6_fib.c:1490 __ip6_ins_rt net/ipv6/route.c:1313 [inline] ip6_ins_rt+0xb6/0x110 net/ipv6/route.c:1323 __ipv6_ifa_notify+0xab3/0xd30 net/ipv6/addrconf.c:6266 ipv6_ifa_notify net/ipv6/addrconf.c:6303 [inline] addrconf_dad_completed+0x15f/0xef0 net/ipv6/addrconf.c:4317 addrconf_dad_work+0x785/0x14e0 net/ipv6/addrconf.c:4260 process_one_work+0x87b/0x15c0 kernel/workqueue.c:3226 worker_thread+0x855/0x1200 kernel/workqueue.c:3380 kthread+0x2cc/0x3b0 kernel/kthread.c:388 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:256 Modules linked in: ---[ end trace ]--- RIP: 0010:__fib6_update_sernum_upto_root+0xa7/0x270 net/ipv6/ip6_fib.c:1358 Code: c1 e8 03 42 80 3c 20 00 0f 85 9b 01 00 00 48 8b 1b 48 85 db 0f 84 d9 00 00 00 e8 74 70 39 f8 48 8d 7b 2c 48 89 f8 48 c1 e8 03 <42> 0f b6 14 20 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 RSP: 0018:c9000631f7c8 EFLAGS: 00010a07 RAX: 1f1f1f1f1f1f1f24 RBX: f8f8f8f8f8f8f8f8 RCX: 89508644 RDX: 888051d78000 RSI: 895085dc RDI: f8f8f8f8f8f8f924 RBP: 0001 R08: 0005 R09: R10: 0001 R11: R12: dc00 R13: 0186 R14: 888052396c00 R15: ed100a472d80 FS: () GS:88807ec0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f42c8487d00 CR3: 4b42c000 CR4: 00750ef0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Code disassembly (best guess): 0: c1 e8 03shr$0x3,%eax 3: 42 80 3c 20 00 cmpb $0x0,(%rax,%r12,1) 8: 0f 85 9b 01 00 00 jne0x1a9 e: 48 8b 1bmov(%rbx),%rbx 11: 48 85 dbtest %rbx,%rbx 14: 0f 84 d9 00 00 00 je 0xf3 1a: e8 74 70 39 f8 call 0xf8397093 1f: 48 8d 7b 2c lea0x2c(%rbx),%rdi 23: 48 89 f8mov%rdi,%rax 26: 48 c1 e8 03 shr$0x3,%rax * 2a: 42 0f b6 14 20 movzbl (%rax,%r12,1),%edx <-- trapping instruction 2f: 48 89 f8mov%rdi,%rax 32: 83 e0 07and$0x7,%eax 35: 83 c0 03add$0x3,%eax 38: 38 d0 cmp%dl,%al 3a: 7c 08 jl 0x44 3c: 84 d2 test %dl,%dl 3e: 0f .byte 0xf 3f: 85 .byte 0x85 Thank you for taking the time to read this email and we look forward to working with you further. poc.c Description: Binary data
[PATCH net-next v4 2/2] trace: tcp: fully support trace_tcp_send_reset
From: Jason Xing Prior to this patch, what we can see by enabling trace_tcp_send is only happening under two circumstances: 1) active rst mode 2) non-active rst mode and based on the full socket That means the inconsistency occurs if we use tcpdump and trace simultaneously to see how rst happens. It's necessary that we should take into other cases into considerations, say: 1) time-wait socket 2) no socket ... By parsing the incoming skb and reversing its 4-tuple can we know the exact 'flow' which might not exist. Samples after applied this patch: 1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port state=TCP_ESTABLISHED 2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port state=UNKNOWN Note: 1) UNKNOWN means we cannot extract the right information from skb. 2) skbaddr/skaddr could be 0 Signed-off-by: Jason Xing --- include/trace/events/tcp.h | 40 -- net/ipv4/tcp_ipv4.c| 7 +++ net/ipv6/tcp_ipv6.c| 3 ++- 3 files changed, 43 insertions(+), 7 deletions(-) diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h index cf14b6fcbeed..5c04a61a11c2 100644 --- a/include/trace/events/tcp.h +++ b/include/trace/events/tcp.h @@ -78,11 +78,47 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb, * skb of trace_tcp_send_reset is the skb that caused RST. In case of * active reset, skb should be NULL */ -DEFINE_EVENT(tcp_event_sk_skb, tcp_send_reset, +TRACE_EVENT(tcp_send_reset, TP_PROTO(const struct sock *sk, const struct sk_buff *skb), - TP_ARGS(sk, skb) + TP_ARGS(sk, skb), + + TP_STRUCT__entry( + __field(const void *, skbaddr) + __field(const void *, skaddr) + __field(int, state) + __array(__u8, saddr, sizeof(struct sockaddr_in6)) + __array(__u8, daddr, sizeof(struct sockaddr_in6)) + ), + + TP_fast_assign( + __entry->skbaddr = skb; + __entry->skaddr = sk; + /* Zero means unknown state. */ + __entry->state = sk ? sk->sk_state : 0; + + memset(__entry->saddr, 0, sizeof(struct sockaddr_in6)); + memset(__entry->daddr, 0, sizeof(struct sockaddr_in6)); + + if (sk && sk_fullsock(sk)) { + const struct inet_sock *inet = inet_sk(sk); + + TP_STORE_ADDR_PORTS(__entry, inet, sk); + } else if (skb) { + const struct tcphdr *th = (const struct tcphdr *)skb->data; + /* +* We should reverse the 4-tuple of skb, so later +* it can print the right flow direction of rst. +*/ + TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, entry->saddr); + } + ), + + TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s", + __entry->skbaddr, __entry->skaddr, + __entry->saddr, __entry->daddr, + __entry->state ? show_tcp_state_name(__entry->state) : "UNKNOWN") ); /* diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index a22ee5838751..0d47b48f8cfd 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -866,11 +866,10 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb) * routing might fail in this case. No choice here, if we choose to force * input interface, we will misroute in case of asymmetric route. */ - if (sk) { + if (sk) arg.bound_dev_if = sk->sk_bound_dev_if; - if (sk_fullsock(sk)) - trace_tcp_send_reset(sk, skb); - } + + trace_tcp_send_reset(sk, skb); BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) != offsetof(struct inet_timewait_sock, tw_bound_dev_if)); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 3f4cba49e9ee..8e9c59b6c00c 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1113,7 +1113,6 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) if (sk) { oif = sk->sk_bound_dev_if; if (sk_fullsock(sk)) { - trace_tcp_send_reset(sk, skb); if (inet6_test_bit(REPFLOW, sk)) label = ip6_flowlabel(ipv6h); priority = READ_ONCE(sk->sk_priority); @@ -1129,6 +1128,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) label = ip6_flowlabel(ipv6h); } + trace_tcp_send_reset(sk, skb); + tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1, ipv6_get_dsfield(ipv6h), label, priority, txhash, ); -- 2.37.3
[PATCH net-next v4 1/2] trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters
From: Jason Xing Introducing entry_saddr and entry_daddr parameters in this macro for later use can help us record the reverse 4-tuple by analyzing the 4-tuple of the incoming skb when receiving. Signed-off-by: Jason Xing Reviewed-by: Eric Dumazet --- include/trace/events/net_probe_common.h | 20 +++- include/trace/events/tcp.h | 2 +- include/trace/events/udp.h | 2 +- 3 files changed, 13 insertions(+), 11 deletions(-) diff --git a/include/trace/events/net_probe_common.h b/include/trace/events/net_probe_common.h index 5e33f91bdea3..976a58364bff 100644 --- a/include/trace/events/net_probe_common.h +++ b/include/trace/events/net_probe_common.h @@ -70,14 +70,14 @@ TP_STORE_V4MAPPED(__entry, saddr, daddr) #endif -#define TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh) \ +#define TP_STORE_ADDR_PORTS_SKB_V4(skb, protoh, entry_saddr, entry_daddr) \ do {\ - struct sockaddr_in *v4 = (void *)__entry->saddr;\ + struct sockaddr_in *v4 = (void *)entry_saddr; \ \ v4->sin_family = AF_INET; \ v4->sin_port = protoh->source; \ v4->sin_addr.s_addr = ip_hdr(skb)->saddr; \ - v4 = (void *)__entry->daddr;\ + v4 = (void *)entry_daddr; \ v4->sin_family = AF_INET; \ v4->sin_port = protoh->dest;\ v4->sin_addr.s_addr = ip_hdr(skb)->daddr; \ @@ -85,28 +85,30 @@ #if IS_ENABLED(CONFIG_IPV6) -#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh) \ +#define TP_STORE_ADDR_PORTS_SKB(skb, protoh, entry_saddr, entry_daddr) \ do {\ const struct iphdr *iph = ip_hdr(skb); \ \ if (iph->version == 6) {\ - struct sockaddr_in6 *v6 = (void *)__entry->saddr; \ + struct sockaddr_in6 *v6 = (void *)entry_saddr; \ \ v6->sin6_family = AF_INET6; \ v6->sin6_port = protoh->source; \ v6->sin6_addr = ipv6_hdr(skb)->saddr; \ - v6 = (void *)__entry->daddr;\ + v6 = (void *)entry_daddr; \ v6->sin6_family = AF_INET6; \ v6->sin6_port = protoh->dest; \ v6->sin6_addr = ipv6_hdr(skb)->daddr; \ } else \ - TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh); \ + TP_STORE_ADDR_PORTS_SKB_V4(skb, protoh, \ + entry_saddr, \ + entry_daddr);\ } while (0) #else -#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh) \ - TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh) +#define TP_STORE_ADDR_PORTS_SKB(skb, protoh, entry_saddr, entry_daddr) \ + TP_STORE_ADDR_PORTS_SKB_V4(skb, protoh, entry_saddr, entry_daddr) #endif diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h index 1db95175c1e5..cf14b6fcbeed 100644 --- a/include/trace/events/tcp.h +++ b/include/trace/events/tcp.h @@ -295,7 +295,7 @@ DECLARE_EVENT_CLASS(tcp_event_skb, memset(__entry->saddr, 0, sizeof(struct sockaddr_in6)); memset(__entry->daddr, 0, sizeof(struct sockaddr_in6)); - TP_STORE_ADDR_PORTS_SKB(__entry, skb, th); + TP_STORE_ADDR_PORTS_SKB(skb, th, __entry->saddr, __entry->daddr); ), TP_printk("skbaddr=%p src=%pISpc dest=%pISpc", diff --git a/include/trace/events/udp.h b/include/trace/events/udp.h index 62bebe2a6ece..6142be4068e2 100644 --- a/include/trace/events/udp.h +++ b/include/trace/events/udp.h @@ -38,7 +38,7 @@ TRACE_EVENT(udp_fail_queue_rcv_skb, memset(__entry->saddr, 0, sizeof(struct sockaddr_in6)); memset(__entry->daddr, 0, sizeof(struct sockaddr_in6)); - TP_STORE_ADDR_PORTS_SKB(__entry, skb, uh); + TP_STORE_ADDR_PORTS_SKB(skb, uh, __entry->saddr, __entry->daddr); ), TP_printk("rc=%d family=%s src=%pISpc dest=%pISpc",
[PATCH net-next v4 0/2] tcp: make trace of reset logic complete
From: Jason Xing Before this, we miss some cases where the TCP layer could send RST but we cannot trace it. So I decided to complete it :) v4 Link: https://lore.kernel.org/all/20240329034243.7929-1-kerneljasonx...@gmail.com/ 1. rebased against latest net-next 2. remove {} and add skb test statement (Eric) 3. drop v3 patch [3/3] temporarily because 1) location is not that useful since we can use perf or something else to trace, 2) Eric said we could use drop_reason to show why we have to RST, which is good, but this seems not work well for those ->send_reset() logic. I need more time to investigate this part. v3 1. fix a format problem in patch [3/3] v2 1. fix spelling mistakes Jason Xing (2): trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters trace: tcp: fully support trace_tcp_send_reset include/trace/events/net_probe_common.h | 20 ++-- include/trace/events/tcp.h | 42 +++-- include/trace/events/udp.h | 2 +- net/ipv4/tcp_ipv4.c | 7 ++--- net/ipv6/tcp_ipv6.c | 3 +- 5 files changed, 56 insertions(+), 18 deletions(-) -- 2.37.3
general protection fault in refill_obj_stock
Hello. We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. Recently, our team has discovered a issue in Linux kernel 6.7. Attached to the email were a PoC file of the issue. Stack dump: general protection fault, probably for non-canonical address 0xdc001cc6: [#1] PREEMPT SMP KASAN NOPTI KASAN: probably user-memory-access in range [0xe630-0xe637] CPU: 0 PID: 8041 Comm: systemd-udevd Not tainted 6.7.0 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:__ref_is_percpu include/linux/percpu-refcount.h:174 [inline] RIP: 0010:percpu_ref_get_many include/linux/percpu-refcount.h:204 [inline] RIP: 0010:percpu_ref_get include/linux/percpu-refcount.h:222 [inline] RIP: 0010:obj_cgroup_get include/linux/memcontrol.h:810 [inline] RIP: 0010:refill_obj_stock+0x135/0x500 mm/memcontrol.c:3535 Code: c7 c7 60 9f 3a 8d e8 fa ca 81 ff e8 d5 4e b2 08 5a 85 c0 0f 85 52 02 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 86 03 00 00 48 8b 45 00 a8 03 0f 85 76 02 00 00 RSP: 0018:c900088bf898 EFLAGS: 00010006 RAX: dc00 RBX: 000380a0 RCX: 192001117edd RDX: 1cc6 RSI: 0001 RDI: 8cddfa60 RBP: e633 R08: R09: fbfff27147e0 R10: 938a3f07 R11: R12: 0148 R13: 0200 R14: 88802c6380a0 R15: 88802c6380e0 FS: 7f774934e8c0() GS:88802c60() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 566127e8 CR3: 48fe8000 CR4: 00750ef0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: memcg_slab_free_hook+0x157/0x2c0 slab_free_hook mm/slub.c:2075 [inline] slab_free mm/slub.c:4280 [inline] kmem_cache_free+0xe1/0x350 mm/slub.c:4344 kfree_skbmem+0xef/0x1b0 net/core/skbuff.c:1159 __kfree_skb net/core/skbuff.c:1217 [inline] consume_skb net/core/skbuff.c:1432 [inline] consume_skb+0xdf/0x170 net/core/skbuff.c:1426 netlink_recvmsg+0x5cb/0xf10 net/netlink/af_netlink.c:1983 sock_recvmsg_nosec net/socket.c:1046 [inline] sock_recvmsg+0x1de/0x240 net/socket.c:1068 sys_recvmsg+0x216/0x670 net/socket.c:2803 ___sys_recvmsg+0xff/0x190 net/socket.c:2845 __sys_recvmsg+0xfb/0x1d0 net/socket.c:2875 current_top_of_stack arch/x86/include/asm/processor.h:532 [inline] on_thread_stack arch/x86/include/asm/processor.h:537 [inline] arch_enter_from_user_mode arch/x86/include/asm/entry-common.h:41 [inline] enter_from_user_mode include/linux/entry-common.h:108 [inline] syscall_enter_from_user_mode include/linux/entry-common.h:194 [inline] do_syscall_64+0x43/0x120 arch/x86/entry/common.c:79 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7f7749601d73 Code: 8b 15 59 a2 00 00 f7 d8 64 89 02 b8 ff ff ff ff eb b7 0f 1f 44 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 2f 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 89 54 24 1c 48 RSP: 002b:7fff81586858 EFLAGS: 0246 ORIG_RAX: 002f RAX: ffda RBX: 7fff81588a20 RCX: 7f7749601d73 RDX: RSI: 7fff815868f0 RDI: 000f RBP: 7fff815869d0 R08: 46d4 R09: 7fff815e5080 R10: 0007 R11: 0246 R12: R13: 55824edb2ef0 R14: 0100 R15: Modules linked in: ---[ end trace ]--- RIP: 0010:__ref_is_percpu include/linux/percpu-refcount.h:174 [inline] RIP: 0010:percpu_ref_get_many include/linux/percpu-refcount.h:204 [inline] RIP: 0010:percpu_ref_get include/linux/percpu-refcount.h:222 [inline] RIP: 0010:obj_cgroup_get include/linux/memcontrol.h:810 [inline] RIP: 0010:refill_obj_stock+0x135/0x500 mm/memcontrol.c:3535 Code: c7 c7 60 9f 3a 8d e8 fa ca 81 ff e8 d5 4e b2 08 5a 85 c0 0f 85 52 02 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 86 03 00 00 48 8b 45 00 a8 03 0f 85 76 02 00 00 RSP: 0018:c900088bf898 EFLAGS: 00010006 RAX: dc00 RBX: 000380a0 RCX: 192001117edd RDX: 1cc6 RSI: 0001 RDI: 8cddfa60 RBP: e633 R08: R09: fbfff27147e0 R10: 938a3f07 R11: R12: 0148 R13: 0200 R14: 88802c6380a0 R15: 88802c6380e0 FS: 7f774934e8c0() GS:88802c60() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 566127e8 CR3: 48fe8000 CR4: 00750ef0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Code disassembly (best guess): 0: c7 c7 60 9f 3a 8d mov$0x8d3a9f60,%edi 6: e8 fa ca 81 ff call 0xff81cb05 b: e8 d5 4e b2
general protection fault in __fib6_update_sernum_upto_root
Hello. We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. Recently, our team has discovered a issue in Linux kernel 6.7. Attached to the email were a PoC file of the issue. Stack dump: general protection fault, probably for non-canonical address 0xff1f1b1f1f1f1f24: [#1] PREEMPT SMP KASAN NOPTI KASAN: maybe wild-memory-access in range [0xf8f8f8f8f8f8f920-0xf8f8f8f8f8f8f927] CPU: 1 PID: 9367 Comm: kworker/1:5 Not tainted 6.7.0 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Workqueue: ipv6_addrconf addrconf_dad_work RIP: 0010:__fib6_update_sernum_upto_root+0xa7/0x270 net/ipv6/ip6_fib.c:1358 Code: c1 e8 03 42 80 3c 20 00 0f 85 9b 01 00 00 48 8b 1b 48 85 db 0f 84 d9 00 00 00 e8 74 70 39 f8 48 8d 7b 2c 48 89 f8 48 c1 e8 03 <42> 0f b6 14 20 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 RSP: 0018:c9000631f7c8 EFLAGS: 00010a07 RAX: 1f1f1f1f1f1f1f24 RBX: f8f8f8f8f8f8f8f8 RCX: 89508644 RDX: 888051d78000 RSI: 895085dc RDI: f8f8f8f8f8f8f924 RBP: 0001 R08: 0005 R09: R10: 0001 R11: R12: dc00 R13: 0186 R14: 888052396c00 R15: ed100a472d80 FS: () GS:88807ec0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f42c8487d00 CR3: 4b42c000 CR4: 00750ef0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: __list_add include/linux/list.h:153 [inline] list_add include/linux/list.h:169 [inline] fib6_add+0x16c4/0x4410 net/ipv6/ip6_fib.c:1490 __ip6_ins_rt net/ipv6/route.c:1313 [inline] ip6_ins_rt+0xb6/0x110 net/ipv6/route.c:1323 __ipv6_ifa_notify+0xab3/0xd30 net/ipv6/addrconf.c:6266 ipv6_ifa_notify net/ipv6/addrconf.c:6303 [inline] addrconf_dad_completed+0x15f/0xef0 net/ipv6/addrconf.c:4317 addrconf_dad_work+0x785/0x14e0 net/ipv6/addrconf.c:4260 process_one_work+0x87b/0x15c0 kernel/workqueue.c:3226 worker_thread+0x855/0x1200 kernel/workqueue.c:3380 kthread+0x2cc/0x3b0 kernel/kthread.c:388 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:256 Modules linked in: ---[ end trace ]--- RIP: 0010:__fib6_update_sernum_upto_root+0xa7/0x270 net/ipv6/ip6_fib.c:1358 Code: c1 e8 03 42 80 3c 20 00 0f 85 9b 01 00 00 48 8b 1b 48 85 db 0f 84 d9 00 00 00 e8 74 70 39 f8 48 8d 7b 2c 48 89 f8 48 c1 e8 03 <42> 0f b6 14 20 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 RSP: 0018:c9000631f7c8 EFLAGS: 00010a07 RAX: 1f1f1f1f1f1f1f24 RBX: f8f8f8f8f8f8f8f8 RCX: 89508644 RDX: 888051d78000 RSI: 895085dc RDI: f8f8f8f8f8f8f924 RBP: 0001 R08: 0005 R09: R10: 0001 R11: R12: dc00 R13: 0186 R14: 888052396c00 R15: ed100a472d80 FS: () GS:88807ec0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f42c8487d00 CR3: 4b42c000 CR4: 00750ef0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Code disassembly (best guess): 0: c1 e8 03shr$0x3,%eax 3: 42 80 3c 20 00 cmpb $0x0,(%rax,%r12,1) 8: 0f 85 9b 01 00 00 jne0x1a9 e: 48 8b 1bmov(%rbx),%rbx 11: 48 85 dbtest %rbx,%rbx 14: 0f 84 d9 00 00 00 je 0xf3 1a: e8 74 70 39 f8 call 0xf8397093 1f: 48 8d 7b 2c lea0x2c(%rbx),%rdi 23: 48 89 f8mov%rdi,%rax 26: 48 c1 e8 03 shr$0x3,%rax * 2a: 42 0f b6 14 20 movzbl (%rax,%r12,1),%edx <-- trapping instruction 2f: 48 89 f8mov%rdi,%rax 32: 83 e0 07and$0x7,%eax 35: 83 c0 03add$0x3,%eax 38: 38 d0 cmp%dl,%al 3a: 7c 08 jl 0x44 3c: 84 d2 test %dl,%dl 3e: 0f .byte 0xf 3f: 85 .byte 0x85 Thank you for taking the time to read this email and we look forward to working with you further. poc.c Description: Binary data