Re: [PATCH] x86: Accelerate copy_page with non-temporal in X86
on 2021/4/13 22:53, Borislav Petkov wrote: > I thought "should be better" too last time when I measured rep; movs vs > NT stores but actual measurements showed no real difference. Mabye the NT stores make difference when store to slow dimms, like the persistent memory I just tested. Also, it likely reduces unnecessary cache load and flush, and benifits the running processes which have data cached. -- Best wishes Kemeng Shi
Re:Re: [PATCH] x86: Accelerate copy_page with non-temporal in X86
on 2021/4/13 19:01, Borislav Petkov wrote: > + linux-nvdimm > > Original mail at > https://lkml.kernel.org/r/3f28adee-8214-fa8e-b368-eaf8b1934...@huawei.com > > On Tue, Apr 13, 2021 at 02:25:58PM +0800, Kemeng Shi wrote: >> I'm using AEP with dax_kmem drvier, and AEP is export as a NUMA node in > > What is AEP? > AEP is a type of persistent memory produced by Intel. It's slower than normal memory but is persistent. >> my system. I will move cold pages from DRAM node to AEP node with >> move_pages system call. With old "rep movsq', it costs 2030ms to move >> 1 GB pages. With "movnti", it only cost about 890ms to move 1GB pages. > > So there's __copy_user_nocache() which does NT stores. > >> -ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD >> +ALTERNATIVE_2 "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD, \ >> + "jmp copy_page_nt", X86_FEATURE_XMM2 > > This makes every machine which has sse2 do NT stores now. Which means > *every* machine practically. > Yes. And NT stores should be better for copy_page especially copying a lot of pages as only partial memory of copied page will be access recently. > The folks on linux-nvdimm@ should be able to give you a better idea what > to do. > > HTH. > Thanks for response and help.
[PATCH] x86: Accelerate copy_page with non-temporal in X86
I'm using AEP with dax_kmem drvier, and AEP is export as a NUMA node in my system. I will move cold pages from DRAM node to AEP node with move_pages system call. With old "rep movsq', it costs 2030ms to move 1 GB pages. With "movnti", it only cost about 890ms to move 1GB pages. I also test move 1GB pages from AEP node to DRAM node. But the result is unexpected. "rep movesq" cost about 372 ms while "movnti" cost about 477ms. As said in X86 , "movnti" could avoid "polluting the caches" in this situaction. I don't know if it's general result or just happening in my machine. Hardware information is as follow: CPU: Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz DRAM: Memory Device Array Handle: 0x0035 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 64 GB Form Factor: DIMM Set: None Locator: DIMM130 J40 Bank Locator: _Node1_Channel3_Dimm0 Type: DDR4 Type Detail: Synchronous Registered (Buffered) Speed: 2933 MT/s Manufacturer: Samsung Serial Number: 03B71EB0 Asset Tag: 1950 Part Number: M393A8G40MB2-CVF Rank: 2 Configured Memory Speed: 2666 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: DRAM Memory Operating Mode Capability: Volatile memory Firmware Version: Module Manufacturer ID: Bank 1, Hex 0xCE Module Product ID: Unknown Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Non-Volatile Size: None Volatile Size: 64 GB Cache Size: None Logical Size: None AEP: Memory Device Array Handle: 0x0035 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 64 bits Size: 128 GB Form Factor: DIMM Set: None Locator: DIMM131 J41 Bank Locator: _Node1_Channel3_Dimm1 Type: Logical non-volatile device Type Detail: Synchronous Non-Volatile LRDIMM Speed: 2666 MT/s Manufacturer: Intel Serial Number: 6803 Asset Tag: 1949 Part Number: NMA1XXD128GPS Rank: 1 Configured Memory Speed: 2666 MT/s Minimum Voltage: 1.2 V Maximum Voltage: 1.2 V Configured Voltage: 1.2 V Memory Technology: Intel persistent memory Memory Operating Mode Capability: Volatile memory Byte-accessible persistent memory Firmware Version: 5355 Module Manufacturer ID: Bank 1, Hex 0x89 Module Product ID: 0x0556 Memory Subsystem Controller Manufacturer ID: Bank 1, Hex 0x89 Memory Subsystem Controller Product ID: 0x097A Non-Volatile Size: 126 GB Volatile Size: None Cache Size: None Logical Size: None Memory dimm topoloygy: AEP | DRAMDRAMDRAM | | | |---|---| CPU |---|---| | | | DRAMDRAMDRAM Signed-off-by: Kemeng Shi --- arch/x86/lib/copy_page_64.S | 73 - 1 file changed, 72 insertions(+), 1 deletion(-) diff --git a/arch/x86/lib/copy_page_64.S b/arch/x86/lib/copy_page_64.S index 2402d4c489d2..69389b4aeeed 100644 --- a/arch/x86/lib/copy_page_64.S +++ b/arch/x86/lib/copy_page_64.S @@ -14,7 +14,8 @@ */ ALIGN SYM_FUNC_START(copy_page) - ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD + ALTERNATIVE_2 "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD, \ + "jmp copy_page_nt", X86_FEATURE_XMM2 movl$4096/8, %ecx rep movsq ret @@ -87,3 +88,73 @@ SYM_FUNC_START_LOCAL(copy_page_regs) addq$2*8, %rsp ret SYM_FUNC_END(copy_page_regs) + +SYM_FUNC_START_LOCAL(copy_page_nt) + subq$2*8, %rsp + movq%rbx, (%rsp) + movq%r12, 1*8(%rsp) + + movl$(4096/64)-5, %ecx + .p2align 4 +.LoopNT64: + decl%ecx + + movq0x8*0(%rsi), %rax + movq0x8*1(%rsi), %rbx + movq0x8*2(%rsi), %rdx + movq0x8*3(%rsi), %r8 + movq0x8*4(%rsi), %r9 + movq0x8*5(%rsi), %r10 + movq0x8*6(%rsi), %r11 + movq0x8*7(%rsi), %r12 + + prefetcht0 5*64(%rsi) + + movnti %rax, 0x8*0(%rdi) + movnti %rbx, 0x8*1(%rdi) + movnti %rdx, 0x8*2(%rdi) + movnti %r8, 0x8*3(%rdi) + movnti %r9, 0x8*4(%rdi) + movnti %r10, 0x8*5(%rdi) + movnti %r11, 0x8*6(%rdi) + movnti %r12, 0x8*7(%rdi) + + leaq64(%rdi), %rdi + leaq64(%rsi), %rsi + jnz .LoopNT64 + + movl$5, %ecx + .p2align 4 +.LoopNT2: + decl%ecx + + movq0x8*0(%
[tip: sched/urgent] sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
The following commit has been merged into the sched/urgent branch of tip: Commit-ID: 714e501e16cd473538b609b3e351b2cc9f7f09ed Gitweb: https://git.kernel.org/tip/714e501e16cd473538b609b3e351b2cc9f7f09ed Author:KeMeng Shi AuthorDate:Mon, 16 Sep 2019 06:53:28 Committer: Ingo Molnar CommitterDate: Wed, 25 Sep 2019 17:42:31 +02:00 sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() An oops can be triggered in the scheduler when running qemu on arm64: Unable to handle kernel paging request at virtual address 08effe40 Internal error: Oops: 9607 [#1] SMP Process migration/0 (pid: 12, stack limit = 0x084e3736) pstate: 2085 (nzCv daIf -PAN -UAO) pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20 lr : move_queued_task.isra.21+0x124/0x298 ... Call trace: __ll_sc___cmpxchg_case_acq_4+0x4/0x20 __migrate_task+0xc8/0xe0 migration_cpu_stop+0x170/0x180 cpu_stopper_thread+0xec/0x178 smpboot_thread_fn+0x1ac/0x1e8 kthread+0x134/0x138 ret_from_fork+0x10/0x18 __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to migrage the process if process is not currently running on any one of the CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if corresponding bit is coincidentally set. As a consequence, kernel will access an invalid rq address associate with the invalid CPU in migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs. The reproduce the crash: 1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling sched_setaffinity. 2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online" and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn. 3) Oops appears if the invalid CPU is set in memory after tested cpumask. Signed-off-by: KeMeng Shi Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Valentin Schneider Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikem...@huawei.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2d9a394..83ea23e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1656,7 +1656,8 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, if (cpumask_equal(p->cpus_ptr, new_mask)) goto out; - if (!cpumask_intersects(new_mask, cpu_valid_mask)) { + dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask); + if (dest_cpu >= nr_cpu_ids) { ret = -EINVAL; goto out; } @@ -1677,7 +1678,6 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, if (cpumask_test_cpu(task_cpu(p), new_mask)) goto out; - dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask); if (task_running(rq, p) || p->state == TASK_WAKING) { struct migration_arg arg = { p, dest_cpu }; /* Need help from migration thread: drop lock and wait. */
Re: Re: Re: [PATCH] connector: report comm change event when modifying /proc/pid/task/tid/comm
On 2019/9/18 at 1:08, Will Deacon wrote: >On Tue, Sep 17, 2019 at 09:56:28AM -0400, KeMeng Shi wrote: >>on 2019/9/17 at 5:10, Will Deacon wrote: >>>The rough idea looks ok to me but I have two concerns: >>> >>> (1) This looks like it will be visible to userspace, and this changes >>> the behaviour after ~8 years of not reporting this event. >>This do bother for users who only care the comm change via prctl, but >>it also benefits users who want all comm changes. Maybe the best way >>is add something like config or switch to meet the both conditions >>above. In my opinion, users cares comm change event rather than how it >>change. > >I was really just looking for some intuition as to how this event is currently >used and why extending it like this is unlikely to break those existing users. By listening these comm change events, user is able to monitor and control specific threads that they are interested. For instance, a process control daemon listening to proc connector and following comm value policies can place specific threads to assigned cgroup partitions (quota from commit f786ecba415888 ("connector: add comm change event report to proc connector")). It's harmless as user ignore the threads with names that they are not interested. >>>(2) What prevents proc_comm_connector(p) running concurrently with >>>itself via the prctl()? The locking seems to be confined to set_task_comm(). >>To be honest, I did not consider the concurrence problem at beginning. >>And some comm change events may lost or are reported repeatly as follows: >>set name via procfs set name via prctl >>set_task_comm >> set_task_comm >>proc_comm_connector >> proc_comm_connector >>Comm change event belong to procfs losts and the fresh comm change >>belong to prctl is reported twice. Actually, there is also concurrence >>problem without this update as follows: >>set name via procfs set name via prctl >> set_task_comm >>set_task_comm >> proc_comm_connector >>Comm change event from procfs is reported instead of prctl, this may >>bothers user who only care the comm change via prctl. > >Perhaps, although given that proc_comm_connector() is currently only >called on the prctl() path, then it does at least provide the comm >from the most recent prctl() invocation. With your path, the calls >can go out of order, so I think that probably needs fixing. It's indeed necessary to fix the concurrence problem. I will submit a v2 patch when I fix this. Thanks for your review and advise. KeMeng Shi
Re: Re: [PATCH] connector: report comm change event when modifying /proc/pid/task/tid/comm
on 2019/9/17 at 5:10, Will Deacon wrote: >The rough idea looks ok to me but I have two concerns: > > (1) This looks like it will be visible to userspace, and this changes > the behaviour after ~8 years of not reporting this event. This do bother for users who only care the comm change via prctl, but it also benefits users who want all comm changes. Maybe the best way is add something like config or switch to meet the both conditions above. In my opinion, users cares comm change event rather than how it change. >(2) What prevents proc_comm_connector(p) running concurrently with itself > via the prctl()? The locking seems to be confined to set_task_comm(). To be honest, I did not consider the concurrence problem at beginning. And some comm change events may lost or are reported repeatly as follows: set name via procfs set name via prctl set_task_comm set_task_comm proc_comm_connector proc_comm_connector Comm change event belong to procfs losts and the fresh comm change belong to prctl is reported twice. Actually, there is also concurrence problem without this update as follows: set name via procfs set name via prctl set_task_comm set_task_comm proc_comm_connector Comm change event from procfs is reported instead of prctl, this may bothers user who only care the comm change via prctl. There is no matter if user only care the latest comm change otherwise, put setting name and reporting event in crtical region may be the only answer. I'm sorry the response to (1) concern is missing in last mail. This is a full version. Apologize again for my mistake. Thanks for review. KeMeng Shi
Re: Re: [PATCH] connector: report comm change event when modifying /proc/pid/task/tid/comm
>(2) What prevents proc_comm_connector(p) running concurrently with itself > via the prctl()? The locking seems to be confined to set_task_comm(). To be honest, I did not consider the concurrence problem at beginning. And some comm change events may lost or are reported repeatly as follows: set name via procfs set name via prctl set_task_comm set_task_comm proc_comm_connector proc_comm_connector Comm change event belong to procfs losts and the fresh comm change belong to prctl is reported twice. Actually, there is also concurrence problem without this update as follows: set name via procfs set name via prctl set_task_comm set_task_comm proc_comm_connector Comm change event from procfs is reported instead of prctl, this may bothers user who only care the comm change via prctl. There is no matter if user only care the latest comm change otherwise, put setting name and reporting event in crtical region may be the only answer. Thanks for review. KeMeng Shi
[PATCH] connector: report comm change event when modifying /proc/pid/task/tid/comm
Commit f786ecba41588 ("connector: add comm change event report to proc connector") added proc_comm_connector to report comm change event, and prctl will report comm change event when dealing with PR_SET_NAME case. prctl can only set the name of the calling thread. In order to set the name of other threads in a process, modifying /proc/self/task/tid/comm is a general way.It's exactly how pthread_setname_np do to set name of a thread. It's unable to get comm change event of thread if the name of thread is set by other thread via pthread_setname_np. This update provides a chance for application to monitor and control threads whose name is set by other threads. Signed-off-by: KeMeng Shi --- fs/proc/base.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index ebea9501afb8..34ffe572ac69 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -94,6 +94,7 @@ #include #include #include +#include #include #include "internal.h" #include "fd.h" @@ -1549,10 +1550,12 @@ static ssize_t comm_write(struct file *file, const char __user *buf, if (!p) return -ESRCH; - if (same_thread_group(current, p)) + if (same_thread_group(current, p)) { set_task_comm(p, buffer); - else + proc_comm_connector(p); + } else { count = -EINVAL; + } put_task_struct(p); -- 2.19.1
[PATCH v2] sched: fix migration to invalid cpu in __set_cpus_allowed_ptr
Oops occur when running qemu on arm64: Unable to handle kernel paging request at virtual address 08effe40 Internal error: Oops: 9607 [#1] SMP Process migration/0 (pid: 12, stack limit = 0x084e3736) pstate: 2085 (nzCv daIf -PAN -UAO) pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20 lr : move_queued_task.isra.21+0x124/0x298 ... Call trace: __ll_sc___cmpxchg_case_acq_4+0x4/0x20 __migrate_task+0xc8/0xe0 migration_cpu_stop+0x170/0x180 cpu_stopper_thread+0xec/0x178 smpboot_thread_fn+0x1ac/0x1e8 kthread+0x134/0x138 ret_from_fork+0x10/0x18 __set_cpus_allowed_ptr will choose an active dest_cpu in affinity mask to migrage the process if process is not currently running on any one of the CPUs specified in affinity mask. __set_cpus_allowed_ptr will choose an invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects check. cpumask_test_cpu of dest_cpu afterwards is overflow and may pass if corresponding bit is coincidentally set. As a consequence, kernel will access an invalid rq address associate with the invalid cpu in migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs. Process as follows may trigger the Oops: 1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling sched_setaffinity. 2) A shell script repeatedly "echo 0 > /sys/devices/system/cpu/cpu1/online" and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn. 3) Oops appears if the invalid cpu is set in memory after tested cpumask. Signed-off-by: KeMeng Shi Reviewed-by: Valentin Schneider --- Changes in v2: -solve format problems in log kernel/sched/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3c7b90bcbe4e..087f4ac30b60 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1656,7 +1656,8 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, if (cpumask_equal(p->cpus_ptr, new_mask)) goto out; - if (!cpumask_intersects(new_mask, cpu_valid_mask)) { + dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask); + if (dest_cpu >= nr_cpu_ids) { ret = -EINVAL; goto out; } @@ -1677,7 +1678,6 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, if (cpumask_test_cpu(task_cpu(p), new_mask)) goto out; - dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask); if (task_running(rq, p) || p->state == TASK_WAKING) { struct migration_arg arg = { p, dest_cpu }; /* Need help from migration thread: drop lock and wait. */ -- 2.19.1