Re: [PATCH] x86: Accelerate copy_page with non-temporal in X86

2021-04-13 Thread Kemeng Shi



on 2021/4/13 22:53, Borislav Petkov wrote:
> I thought "should be better" too last time when I measured rep; movs vs
> NT stores but actual measurements showed no real difference.
Mabye the NT stores make difference when store to slow dimms, like the
persistent memory I just tested. Also, it likely reduces unnecessary cache
load and flush, and benifits the running processes which have data cached.

-- 
Best wishes
Kemeng Shi


Re:Re: [PATCH] x86: Accelerate copy_page with non-temporal in X86

2021-04-13 Thread Kemeng Shi



on 2021/4/13 19:01, Borislav Petkov wrote:
> + linux-nvdimm
> 
> Original mail at 
> https://lkml.kernel.org/r/3f28adee-8214-fa8e-b368-eaf8b1934...@huawei.com
> 
> On Tue, Apr 13, 2021 at 02:25:58PM +0800, Kemeng Shi wrote:
>> I'm using AEP with dax_kmem drvier, and AEP is export as a NUMA node in
> 
> What is AEP?
> 
AEP is a type of persistent memory produced by Intel. It's slower than
normal memory but is persistent.
>> my system. I will move cold pages from DRAM node to AEP node with
>> move_pages system call. With old "rep movsq', it costs 2030ms to move
>> 1 GB pages. With "movnti", it only cost about 890ms to move 1GB pages.
> 
> So there's __copy_user_nocache() which does NT stores.
> 
>> -ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD
>> +ALTERNATIVE_2 "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD, \
>> +  "jmp copy_page_nt", X86_FEATURE_XMM2
> 
> This makes every machine which has sse2 do NT stores now. Which means
> *every* machine practically.
> 
Yes. And NT stores should be better for copy_page especially copying a lot
of pages as only partial memory of copied page will be access recently.
> The folks on linux-nvdimm@ should be able to give you a better idea what
> to do.
> 
> HTH.
> 
Thanks for response and help.


[PATCH] x86: Accelerate copy_page with non-temporal in X86

2021-04-13 Thread Kemeng Shi
I'm using AEP with dax_kmem drvier, and AEP is export as a NUMA node in
my system. I will move cold pages from DRAM node to AEP node with
move_pages system call. With old "rep movsq', it costs 2030ms to move
1 GB pages. With "movnti", it only cost about 890ms to move 1GB pages.
I also test move 1GB pages from AEP node to DRAM node. But the result is
unexpected. "rep movesq" cost about 372 ms while "movnti" cost about
477ms. As said in X86 , "movnti" could avoid "polluting the caches" in
this situaction. I don't know if it's general result or just happening
in my machine. Hardware information is as follow:
CPU:
Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
DRAM:
Memory Device
Array Handle: 0x0035
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 64 GB
Form Factor: DIMM
Set: None
Locator: DIMM130 J40
Bank Locator: _Node1_Channel3_Dimm0
Type: DDR4
Type Detail: Synchronous Registered (Buffered)
Speed: 2933 MT/s
Manufacturer: Samsung
Serial Number: 03B71EB0
Asset Tag: 1950
Part Number: M393A8G40MB2-CVF
Rank: 2
Configured Memory Speed: 2666 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: 
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 64 GB
Cache Size: None
Logical Size: None
AEP:
Memory Device
Array Handle: 0x0035
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 128 GB
Form Factor: DIMM
Set: None
Locator: DIMM131 J41
Bank Locator: _Node1_Channel3_Dimm1
Type: Logical non-volatile device
Type Detail: Synchronous Non-Volatile LRDIMM
Speed: 2666 MT/s
Manufacturer: Intel
Serial Number: 6803
Asset Tag: 1949
Part Number: NMA1XXD128GPS
Rank: 1
Configured Memory Speed: 2666 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: Intel persistent memory
Memory Operating Mode Capability: Volatile memory
Byte-accessible persistent memory
Firmware Version: 5355
Module Manufacturer ID: Bank 1, Hex 0x89
Module Product ID: 0x0556
Memory Subsystem Controller Manufacturer ID: Bank 1, Hex 0x89
Memory Subsystem Controller Product ID: 0x097A
Non-Volatile Size: 126 GB
Volatile Size: None
Cache Size: None
Logical Size: None
Memory dimm topoloygy:
AEP
 |
DRAMDRAMDRAM
 |   |   |
 |---|---|
CPU
 |---|---|
 |   |   |
DRAMDRAMDRAM

Signed-off-by: Kemeng Shi 
---
 arch/x86/lib/copy_page_64.S | 73 -
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/arch/x86/lib/copy_page_64.S b/arch/x86/lib/copy_page_64.S
index 2402d4c489d2..69389b4aeeed 100644
--- a/arch/x86/lib/copy_page_64.S
+++ b/arch/x86/lib/copy_page_64.S
@@ -14,7 +14,8 @@
  */
ALIGN
 SYM_FUNC_START(copy_page)
-   ALTERNATIVE "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD
+   ALTERNATIVE_2 "jmp copy_page_regs", "", X86_FEATURE_REP_GOOD, \
+  "jmp copy_page_nt", X86_FEATURE_XMM2
movl$4096/8, %ecx
rep movsq
ret
@@ -87,3 +88,73 @@ SYM_FUNC_START_LOCAL(copy_page_regs)
addq$2*8, %rsp
ret
 SYM_FUNC_END(copy_page_regs)
+
+SYM_FUNC_START_LOCAL(copy_page_nt)
+   subq$2*8,   %rsp
+   movq%rbx,   (%rsp)
+   movq%r12,   1*8(%rsp)
+
+   movl$(4096/64)-5, %ecx
+   .p2align 4
+.LoopNT64:
+   decl%ecx
+
+   movq0x8*0(%rsi), %rax
+   movq0x8*1(%rsi), %rbx
+   movq0x8*2(%rsi), %rdx
+   movq0x8*3(%rsi), %r8
+   movq0x8*4(%rsi), %r9
+   movq0x8*5(%rsi), %r10
+   movq0x8*6(%rsi), %r11
+   movq0x8*7(%rsi), %r12
+
+   prefetcht0 5*64(%rsi)
+
+   movnti  %rax, 0x8*0(%rdi)
+   movnti  %rbx, 0x8*1(%rdi)
+   movnti  %rdx, 0x8*2(%rdi)
+   movnti  %r8,  0x8*3(%rdi)
+   movnti  %r9,  0x8*4(%rdi)
+   movnti  %r10, 0x8*5(%rdi)
+   movnti  %r11, 0x8*6(%rdi)
+   movnti  %r12, 0x8*7(%rdi)
+
+   leaq64(%rdi), %rdi
+   leaq64(%rsi), %rsi
+   jnz .LoopNT64
+
+   movl$5, %ecx
+   .p2align 4
+.LoopNT2:
+   decl%ecx
+
+   movq0x8*0(%

[tip: sched/urgent] sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()

2019-09-27 Thread tip-bot2 for KeMeng Shi
The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: 714e501e16cd473538b609b3e351b2cc9f7f09ed
Gitweb:
https://git.kernel.org/tip/714e501e16cd473538b609b3e351b2cc9f7f09ed
Author:KeMeng Shi 
AuthorDate:Mon, 16 Sep 2019 06:53:28 
Committer: Ingo Molnar 
CommitterDate: Wed, 25 Sep 2019 17:42:31 +02:00

sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()

An oops can be triggered in the scheduler when running qemu on arm64:

 Unable to handle kernel paging request at virtual address 08effe40
 Internal error: Oops: 9607 [#1] SMP
 Process migration/0 (pid: 12, stack limit = 0x084e3736)
 pstate: 2085 (nzCv daIf -PAN -UAO)
 pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
 lr : move_queued_task.isra.21+0x124/0x298
 ...
 Call trace:
  __ll_sc___cmpxchg_case_acq_4+0x4/0x20
  __migrate_task+0xc8/0xe0
  migration_cpu_stop+0x170/0x180
  cpu_stopper_thread+0xec/0x178
  smpboot_thread_fn+0x1ac/0x1e8
  kthread+0x134/0x138
  ret_from_fork+0x10/0x18

__set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
migrage the process if process is not currently running on any one of the
CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
corresponding bit is coincidentally set. As a consequence, kernel will
access an invalid rq address associate with the invalid CPU in
migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.

The reproduce the crash:

  1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
  sched_setaffinity.

  2) A shell script repeatedly does "echo 0 > 
/sys/devices/system/cpu/cpu1/online"
  and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.

  3) Oops appears if the invalid CPU is set in memory after tested cpumask.

Signed-off-by: KeMeng Shi 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Valentin Schneider 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikem...@huawei.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d9a394..83ea23e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1656,7 +1656,8 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
if (cpumask_equal(p->cpus_ptr, new_mask))
goto out;
 
-   if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
+   dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
+   if (dest_cpu >= nr_cpu_ids) {
ret = -EINVAL;
goto out;
}
@@ -1677,7 +1678,6 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
if (cpumask_test_cpu(task_cpu(p), new_mask))
goto out;
 
-   dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
if (task_running(rq, p) || p->state == TASK_WAKING) {
struct migration_arg arg = { p, dest_cpu };
/* Need help from migration thread: drop lock and wait. */


Re: Re: Re: [PATCH] connector: report comm change event when modifying /proc/pid/task/tid/comm

2019-09-18 Thread KeMeng Shi
On 2019/9/18 at 1:08, Will Deacon wrote:
>On Tue, Sep 17, 2019 at 09:56:28AM -0400, KeMeng Shi wrote:
>>on 2019/9/17 at 5:10, Will Deacon wrote:
>>>The rough idea looks ok to me but I have two concerns:
>>>
>>>  (1) This looks like it will be visible to userspace, and this changes
>>>  the behaviour after ~8 years of not reporting this event.
>>This do bother for users who only care the comm change via prctl, but 
>>it also benefits users who want all comm changes. Maybe the best way 
>>is add something like config or switch to meet the both conditions 
>>above. In my opinion, users cares comm change event rather than how it
>>change.
>
>I was really just looking for some intuition as to how this event is currently
>used and why extending it like this is unlikely to break those existing users.

By listening these comm change events, user is able to monitor and control
specific threads that they are interested. For instance, a process control
daemon listening to proc connector and following comm value policies can
place specific threads to assigned cgroup partitions (quota from commit
 f786ecba415888 ("connector: add comm change event report to proc
 connector")).  It's harmless as user ignore the threads with names that
they are not interested.

>>>(2) What prevents proc_comm_connector(p) running concurrently with 
>>>itself  via the prctl()? The locking seems to be confined to set_task_comm().
>>To be honest, I did not consider the concurrence problem at beginning. 
>>And some comm change events may lost or are reported repeatly as follows:
>>set name via procfs   set name via prctl
>>set_task_comm
>>  set_task_comm
>>proc_comm_connector
>>  proc_comm_connector
>>Comm change event belong to procfs losts and the fresh comm change 
>>belong to prctl is reported twice. Actually, there is also concurrence 
>>problem without this update as follows:
>>set name via procfs   set name via prctl
>>  set_task_comm
>>set_task_comm
>>  proc_comm_connector
>>Comm change event from procfs is reported instead of prctl, this may 
>>bothers user who only care the comm change via prctl.
>
>Perhaps, although given that proc_comm_connector() is currently only
>called on the prctl() path, then it does at least provide the comm
>from the most recent prctl() invocation. With your path, the calls
>can go out of order, so I think that probably needs fixing.

It's indeed necessary to fix the concurrence problem. I will submit
a v2 patch when I fix this.

Thanks for your review and advise.
KeMeng Shi


Re: Re: [PATCH] connector: report comm change event when modifying /proc/pid/task/tid/comm

2019-09-17 Thread KeMeng Shi
on 2019/9/17 at 5:10, Will Deacon wrote:
>The rough idea looks ok to me but I have two concerns:
>
>  (1) This looks like it will be visible to userspace, and this changes
>  the behaviour after ~8 years of not reporting this event.
This do bother for users who only care the comm change via prctl, but it
also benefits users who want all comm changes. Maybe the best way is add
something like config or switch to meet the both conditions above. In my
opinion, users cares comm change event rather than how it change.

>(2) What prevents proc_comm_connector(p) running concurrently with itself
> via the prctl()? The locking seems to be confined to set_task_comm().
To be honest, I did not consider the concurrence problem at beginning. And
some comm change events may lost or are reported repeatly as follows:
set name via procfs set name via prctl
set_task_comm
set_task_comm
proc_comm_connector
proc_comm_connector
Comm change event belong to procfs losts and the fresh comm change belong
to prctl is reported twice. Actually, there is also concurrence problem
without this update as follows:
set name via procfs set name via prctl
set_task_comm
set_task_comm
proc_comm_connector
Comm change event from procfs is reported instead of prctl, this may
bothers user who only care the comm change via prctl.

There is no matter if user only care the latest comm change otherwise, put
setting name and reporting event in crtical region may be the only answer.

I'm sorry the response to (1) concern is missing in last mail. This is a
full version. Apologize again for my mistake.

Thanks for review.
KeMeng Shi


Re: Re: [PATCH] connector: report comm change event when modifying /proc/pid/task/tid/comm

2019-09-16 Thread KeMeng Shi
>(2) What prevents proc_comm_connector(p) running concurrently with itself
> via the prctl()? The locking seems to be confined to set_task_comm().
To be honest, I did not consider the concurrence problem at beginning. And
some comm change events may lost or are reported repeatly as follows:
set name via procfs set name via prctl
set_task_comm
set_task_comm
proc_comm_connector
proc_comm_connector
Comm change event belong to procfs losts and the fresh comm change belong
to prctl is reported twice. Actually, there is also concurrence problem
without this update as follows:
set name via procfs set name via prctl
set_task_comm
set_task_comm
proc_comm_connector
Comm change event from procfs is reported instead of prctl, this may
bothers user who only care the comm change via prctl.

There is no matter if user only care the latest comm change otherwise, put
setting name and reporting event in crtical region may be the only answer.

Thanks for review.
KeMeng Shi


[PATCH] connector: report comm change event when modifying /proc/pid/task/tid/comm

2019-09-16 Thread KeMeng Shi
Commit f786ecba41588 ("connector: add comm change event report to proc
 connector") added proc_comm_connector to report comm change event, and
prctl will report comm change event when dealing with PR_SET_NAME case.

prctl can only set the name of the calling thread. In order to set the name
of other threads in a process, modifying /proc/self/task/tid/comm is a
general way.It's exactly how pthread_setname_np do to set name of a thread.

It's unable to get comm change event of thread if the name of thread is set
by other thread via pthread_setname_np. This update provides a chance for
application to monitor and control threads whose name is set by other
threads.

Signed-off-by: KeMeng Shi 
---
 fs/proc/base.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ebea9501afb8..34ffe572ac69 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -94,6 +94,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 #include "fd.h"
@@ -1549,10 +1550,12 @@ static ssize_t comm_write(struct file *file, const char 
__user *buf,
if (!p)
return -ESRCH;
 
-   if (same_thread_group(current, p))
+   if (same_thread_group(current, p)) {
set_task_comm(p, buffer);
-   else
+   proc_comm_connector(p);
+   } else {
count = -EINVAL;
+   }
 
put_task_struct(p);
 
-- 
2.19.1



[PATCH v2] sched: fix migration to invalid cpu in __set_cpus_allowed_ptr

2019-09-16 Thread KeMeng Shi
Oops occur when running qemu on arm64:
 Unable to handle kernel paging request at virtual address 08effe40
 Internal error: Oops: 9607 [#1] SMP
 Process migration/0 (pid: 12, stack limit = 0x084e3736)
 pstate: 2085 (nzCv daIf -PAN -UAO)
 pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
 lr : move_queued_task.isra.21+0x124/0x298
 ...
 Call trace:
  __ll_sc___cmpxchg_case_acq_4+0x4/0x20
  __migrate_task+0xc8/0xe0
  migration_cpu_stop+0x170/0x180
  cpu_stopper_thread+0xec/0x178
  smpboot_thread_fn+0x1ac/0x1e8
  kthread+0x134/0x138
  ret_from_fork+0x10/0x18

__set_cpus_allowed_ptr will choose an active dest_cpu in affinity mask to
migrage the process if process is not currently running on any one of the
CPUs specified in affinity mask. __set_cpus_allowed_ptr will choose an
invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if 
CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
check. cpumask_test_cpu of dest_cpu afterwards is overflow and may pass if
corresponding bit is coincidentally set. As a consequence, kernel will
access an invalid rq address associate with the invalid cpu in
migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.

Process as follows may trigger the Oops:
1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
sched_setaffinity.
2) A shell script repeatedly "echo 0 > /sys/devices/system/cpu/cpu1/online"
and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.
3) Oops appears if the invalid cpu is set in memory after tested cpumask.

Signed-off-by: KeMeng Shi 
Reviewed-by: Valentin Schneider 
---
Changes in v2:
-solve format problems in log

 kernel/sched/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c7b90bcbe4e..087f4ac30b60 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1656,7 +1656,8 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
if (cpumask_equal(p->cpus_ptr, new_mask))
goto out;
 
-   if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
+   dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
+   if (dest_cpu >= nr_cpu_ids) {
ret = -EINVAL;
goto out;
}
@@ -1677,7 +1678,6 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
if (cpumask_test_cpu(task_cpu(p), new_mask))
goto out;
 
-   dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
if (task_running(rq, p) || p->state == TASK_WAKING) {
struct migration_arg arg = { p, dest_cpu };
/* Need help from migration thread: drop lock and wait. */
-- 
2.19.1