[PATCH] mm/page_alloc: fix documentation error
Cc: Andrew Morton Cc: Matthew Wilcox Cc: Rafael Aquini Cc: Fabrizio D'Angelo Cc: linux...@kvack.org Cc: triv...@kernel.org When I increased the upper bound of the min_free_kbytes value in ee8eb9a5fe863, I forgot to tweak the above comment to reflect the new value. This patch fixes that mistake. Signed-off-by: Joel Savitz --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 48eb0f1410d4..e028b87ce294 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7832,7 +7832,7 @@ void setup_per_zone_wmarks(void) * Initialise min_free_kbytes. * * For small machines we want it small (128k min). For large machines - * we want it large (64MB max). But it is not linear, because network + * we want it large (256MB max). But it is not linear, because network * bandwidth does not increase linearly with machine size. We use * * min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy: -- 2.23.0
[PATCH v2] mm/page_alloc: fix documentation error and remove magic numbers
When I increased the upper bound of the min_free_kbytes value in ee8eb9a5fe863, I forgot to tweak the above comment to reflect the new value. This patch fixes that mistake. In addition, this patch replaces the magic number bounds with symbolic constants to clarify the logic. changes from v1: - declare constants via enum instead of separate integers Suggested-by: John Hubbard Suggested-by: Vlastimil Babka Signed-off-by: Joel Savitz --- mm/page_alloc.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 48eb0f1410d4..733c81678b0e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7832,7 +7832,7 @@ void setup_per_zone_wmarks(void) * Initialise min_free_kbytes. * * For small machines we want it small (128k min). For large machines - * we want it large (64MB max). But it is not linear, because network + * we want it large (256MB max). But it is not linear, because network * bandwidth does not increase linearly with machine size. We use * * min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy: @@ -7852,6 +7852,8 @@ void setup_per_zone_wmarks(void) * 8192MB: 11584k * 16384MB:16384k */ +static enum { MIN_FREE_KBYTES_LOWER_BOUND = 1 << 7, MIN_FREE_KBYTES_UPPER_BOUND = 1 << 18 }; + int __meminit init_per_zone_wmark_min(void) { unsigned long lowmem_kbytes; @@ -7862,10 +7864,10 @@ int __meminit init_per_zone_wmark_min(void) if (new_min_free_kbytes > user_min_free_kbytes) { min_free_kbytes = new_min_free_kbytes; - if (min_free_kbytes < 128) - min_free_kbytes = 128; - if (min_free_kbytes > 262144) - min_free_kbytes = 262144; + if (min_free_kbytes < MIN_FREE_KBYTES_LOWER_BOUND) + min_free_kbytes = MIN_FREE_KBYTES_LOWER_BOUND; + if (min_free_kbytes > MIN_FREE_KBYTES_UPPER_BOUND) + min_free_kbytes = MIN_FREE_KBYTES_UPPER_BOUND; } else { pr_warn("min_free_kbytes is not updated to %d because user defined value %d is preferred\n", new_min_free_kbytes, user_min_free_kbytes); -- 2.23.0
[PATCH] mm/page_alloc: fix documentation error and remove magic numbers
When I increased the upper bound of the min_free_kbytes value in ee8eb9a5fe863, I forgot to tweak the above comment to reflect the new value. This patch fixes that mistake. In addition, this patch replaces the magic number bounds with symbolic constants to clarify the logic. Suggested-by: John Hubbard Suggested-by: Vlastimil Babka Signed-off-by: Joel Savitz --- mm/page_alloc.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 48eb0f1410d4..f725addc2748 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7832,7 +7832,7 @@ void setup_per_zone_wmarks(void) * Initialise min_free_kbytes. * * For small machines we want it small (128k min). For large machines - * we want it large (64MB max). But it is not linear, because network + * we want it large (256MB max). But it is not linear, because network * bandwidth does not increase linearly with machine size. We use * * min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy: @@ -7852,6 +7852,9 @@ void setup_per_zone_wmarks(void) * 8192MB: 11584k * 16384MB:16384k */ +static const int MIN_FREE_KBYTES_LOWER_BOUND = 1 << 7; +static const int MIN_FREE_KBYTES_UPPER_BOUND = 1 << 18; + int __meminit init_per_zone_wmark_min(void) { unsigned long lowmem_kbytes; @@ -7862,10 +7865,10 @@ int __meminit init_per_zone_wmark_min(void) if (new_min_free_kbytes > user_min_free_kbytes) { min_free_kbytes = new_min_free_kbytes; - if (min_free_kbytes < 128) - min_free_kbytes = 128; - if (min_free_kbytes > 262144) - min_free_kbytes = 262144; + if (min_free_kbytes < MIN_FREE_KBYTES_LOWER_BOUND) + min_free_kbytes = MIN_FREE_KBYTES_LOWER_BOUND; + if (min_free_kbytes > MIN_FREE_KBYTES_UPPER_BOUND) + min_free_kbytes = MIN_FREE_KBYTES_UPPER_BOUND; } else { pr_warn("min_free_kbytes is not updated to %d because user defined value %d is preferred\n", new_min_free_kbytes, user_min_free_kbytes); -- 2.23.0
[PATCH] fs: proc: Clarify warnings for invalid proc dir names
When one attempts to create a directory in /proc with an invalid name, such as one in a subdirectory that doesn't exist, one with a name beyond 256 characters, or a reserved name such as '.' or '..', the kernel throws a warning message that looks like this: [ 7913.252558] name 'invalid_name' This warning message is nearly the same for all invalid cases, including the removal of a nonexistent directory. This patch clarifies the warning message and differentiates the invalid creation/removal cases so as to allow the user to more quickly understand their mistake. Signed-off-by: Fabrizio D'Angelo Signed-off-by: Joel Savitz --- fs/proc/generic.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/proc/generic.c b/fs/proc/generic.c index 64e9ee1b129e..df04fd4f02af 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -173,7 +173,7 @@ static int __xlate_proc_name(const char *name, struct proc_dir_entry **ret, len = next - cp; de = pde_subdir_find(de, cp, len); if (!de) { - WARN(1, "name '%s'\n", name); + WARN(1, "invalid proc dir name '%s'\n", name); return -ENOENT; } cp += len + 1; @@ -386,15 +386,15 @@ static struct proc_dir_entry *__proc_create(struct proc_dir_entry **parent, qstr.name = fn; qstr.len = strlen(fn); if (qstr.len == 0 || qstr.len >= 256) { - WARN(1, "name len %u\n", qstr.len); + WARN(1, "invalid proc dir name len %u\n", qstr.len); return NULL; } if (qstr.len == 1 && fn[0] == '.') { - WARN(1, "name '.'\n"); + WARN(1, "invalid proc dir name '.'\n"); return NULL; } if (qstr.len == 2 && fn[0] == '.' && fn[1] == '.') { - WARN(1, "name '..'\n"); + WARN(1, "invalid proc dir name '..'\n"); return NULL; } if (*parent == _root && name_to_int() != ~0U) { @@ -402,7 +402,7 @@ static struct proc_dir_entry *__proc_create(struct proc_dir_entry **parent, return NULL; } if (is_empty_pde(*parent)) { - WARN(1, "attempt to add to permanently empty directory"); + WARN(1, "attempt to add to permanently empty directory in proc"); return NULL; } @@ -670,7 +670,7 @@ void remove_proc_entry(const char *name, struct proc_dir_entry *parent) rb_erase(>subdir_node, >subdir); write_unlock(_subdir_lock); if (!de) { - WARN(1, "name '%s'\n", name); + WARN(1, "unable to remove nonexistent proc dir '%s'\n", name); return; } -- 2.23.0
[PATCH] mm: fix typo in comment
Fix spelling of successful (currently spelled successfull in kernel source) Signed-off-by: Joel Savitz --- include/linux/compaction.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 9569e7c786d3..4122df65eb44 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -38,12 +38,12 @@ enum compact_result { COMPACT_CONTINUE, /* -* The full zone was compacted scanned but wasn't successfull to compact +* The full zone was compacted scanned but wasn't successful to compact * suitable pages. */ COMPACT_COMPLETE, /* -* direct compaction has scanned part of the zone but wasn't successfull +* direct compaction has scanned part of the zone but wasn't successful * to compact suitable pages. */ COMPACT_PARTIAL_SKIPPED, -- 2.18.1
Re: [PATCH v2] PCI: rpaphp: Avoid a sometimes-uninitialized warning
>On Mon, Jun 03, 2019 at 03:11:58PM -0700, Nathan Chancellor wrote: >> When building with -Wsometimes-uninitialized, clang warns: >> >> drivers/pci/hotplug/rpaphp_core.c:243:14: warning: variable 'fndit' is >> used uninitialized whenever 'for' loop exits because its condition is >> false [-Wsometimes-uninitialized] >> for (j = 0; j < entries; j++) { >> ^~~ >> drivers/pci/hotplug/rpaphp_core.c:256:6: note: uninitialized use occurs >> here >> if (fndit) >> ^ >> drivers/pci/hotplug/rpaphp_core.c:243:14: note: remove the condition if >> it is always true >> for (j = 0; j < entries; j++) { >> ^~~ >> drivers/pci/hotplug/rpaphp_core.c:233:14: note: initialize the variable >> 'fndit' to silence this warning >> int j, fndit; >> ^ >> = 0 >> >> fndit is only used to gate a sprintf call, which can be moved into the >> loop to simplify the code and eliminate the local variable, which will >> fix this warning. >> >> Link: https://github.com/ClangBuiltLinux/linux/issues/504 >> Fixes: 2fcf3ae508c2 ("hotplug/drc-info: Add code to search ibm,drc-info >> property") >> Suggested-by: Nick Desaulniers >> Signed-off-by: Nathan Chancellor >> --- >> >> v1 -> v2: >> >> * Eliminate fndit altogether by shuffling the sprintf call into the for >> loop and changing the if conditional, as suggested by Nick. > >> drivers/pci/hotplug/rpaphp_core.c | 18 +++--- >> 1 file changed, 7 insertions(+), 11 deletions(-) >> Gentle ping, can someone pick this up? Looks a good simplification of somewhat convoluted control flow. Acked-by: Joel Savitz
Re: [PATCH v4] fs/proc: add VmTaskSize field to /proc/$$/status
The most immediate use case is the optimization of an internal test, but upon closer examination neither this patch nor the test itself turn out to be worth pursuing. Thank you for your time and constructive comments. Best, Joel Savitz On Thu, Jun 13, 2019 at 3:30 PM Andrew Morton wrote: > > On Thu, 13 Jun 2019 10:54:50 -0400 Joel Savitz wrote: > > > The kernel provides no architecture-independent mechanism to get the > > size of the virtual address space of a task (userspace process) without > > brute-force calculation. This patch allows a user to easily retrieve > > this value via a new VmTaskSize entry in /proc/$$/status. > > Why is access to ->task_size required? Please fully describe the > use case. > > > --- a/Documentation/filesystems/proc.txt > > +++ b/Documentation/filesystems/proc.txt > > @@ -187,6 +187,7 @@ read the file /proc/PID/status: > >VmLib: 1412 kB > >VmPTE:20 kb > >VmSwap:0 kB > > + VmTaskSize:137438953468 kB > >HugetlbPages: 0 kB > >CoreDumping:0 > >THP_enabled: 1 > > @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19) > > VmPTE size of page table entries > > VmSwap amount of swap used by anonymous private data > > (shmem swap usage is not included) > > + VmTaskSize size of task (userspace process) vm space > > This is rather vague. Is it the total amount of physical memory? The > sum of all vma sizes, populated or otherwise? Something else? > >
Re: [PATCH v2] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
I just did a quick test on a patched kernel to check on that "no longer affine to..." message: # nproc 64 # taskset -p 4 $$ pid 2261's current affinity mask: pid 2261's new affinity mask: 4 # echo off > /sys/devices/system/cpu/cpu2/online # taskset -p $$ pid 2261's current affinity mask: fffb # echo on > /sys/devices/system/cpu/cpu2/online # taskset -p $$ pid 2261's current affinity mask: # dmesg | tail -5 [ 143.996375] process 2261 (bash) no longer affine to cpu2 [ 143.996657] IRQ 114: no longer affine to CPU2 [ 144.007472] IRQ 227: no longer affine to CPU2 [ 144.013460] smpboot: CPU 2 is now offline [ 162.685519] smpboot: Booting Node 0 Processor 2 APIC 0x4 dmesg output is observably the same on patched and unpatched kernels in this case. The only difference in output is that on an unpatched kernel, the last `taskset -p $$` outputs: pid 2274's current affinity mask: fffb Which is the behavior that this patch aims to modify This case, which I believe is generalizable, demonstrates that we retain the "no longer affine to..." output on a kernel with this patch. Best, Joel Savitz On Thu, Jun 13, 2019 at 8:02 AM Michal Koutný wrote: > > On Tue, May 28, 2019 at 02:10:37PM +0200, Michal Koutný > wrote: > > Although, on v1 we will lose the "no longer affine to..." message > > (which is what happens in your demo IIUC). > FWIW, I was wrong, off by one 'state' transition. So the patch doesn't > cause change in messaging (not tested though).
[PATCH v4] fs/proc: add VmTaskSize field to /proc/$$/status
The kernel provides no architecture-independent mechanism to get the size of the virtual address space of a task (userspace process) without brute-force calculation. This patch allows a user to easily retrieve this value via a new VmTaskSize entry in /proc/$$/status. Signed-off-by: Joel Savitz --- Documentation/filesystems/proc.txt | 2 ++ fs/proc/task_mmu.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 66cad5c86171..1c6a912e3975 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -187,6 +187,7 @@ read the file /proc/PID/status: VmLib: 1412 kB VmPTE:20 kb VmSwap:0 kB + VmTaskSize: 137438953468 kB HugetlbPages: 0 kB CoreDumping:0 THP_enabled: 1 @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19) VmPTE size of page table entries VmSwap amount of swap used by anonymous private data (shmem swap usage is not included) + VmTaskSize size of task (userspace process) vm space HugetlbPagessize of hugetlb memory portions CoreDumping process's memory is currently being dumped (killing the process may lead to a corrupted core) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 95ca1fe7283c..0af7081f7b19 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) seq_put_decimal_ull_width(m, " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8); SEQ_PUT_DEC(" kB\nVmSwap:\t", swap); + seq_put_decimal_ull_width(m, + " kB\nVmTaskSize:\t", mm->task_size >> 10, 8); seq_puts(m, " kB\n"); hugetlb_report_usage(m, mm); } -- 2.18.1
[RESEND PATCH v2] mm/oom_killer: Add task UID to info message on an oom kill
In the event of an oom kill, useful information about the killed process is printed to dmesg. Users, especially system administrators, will find it useful to immediately see the UID of the process. In the following example, abuse_the_ram is the name of a program that attempts to iteratively allocate all available memory until it is stopped by force. Current message: Out of memory: Killed process 35389 (abuse_the_ram) total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB, shmem-rss:0kB Patched message: Out of memory: Killed process 2739 (abuse_the_ram), total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB, shmem-rss:0kB, UID 0 Suggested-by: David Rientjes Signed-off-by: Joel Savitz --- mm/oom_kill.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 3a2484884cfd..af2e3faa72a0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -874,12 +874,13 @@ static void __oom_kill_process(struct task_struct *victim, const char *message) */ do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID); mark_oom_victim(victim); - pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", + pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID %d\n", message, task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), K(get_mm_counter(victim->mm, MM_ANONPAGES)), K(get_mm_counter(victim->mm, MM_FILEPAGES)), - K(get_mm_counter(victim->mm, MM_SHMEMPAGES))); + K(get_mm_counter(victim->mm, MM_SHMEMPAGES)), + from_kuid(_user_ns, task_uid(victim))); task_unlock(victim); /* -- 2.18.1
[RESEND PATCH v3] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
In the case that a process is constrained by taskset(1) (i.e. sched_setaffinity(2)) to a subset of available cpus, and all of those are subsequently offlined, the scheduler will set tsk->cpus_allowed to the current value of task_cs(tsk)->effective_cpus. This is done via a call to do_set_cpus_allowed() in the context of cpuset_cpus_allowed_fallback() made by the scheduler when this case is detected. This is the only call made to cpuset_cpus_allowed_fallback() in the latest mainline kernel. However, this is not sane behavior. I will demonstrate this on a system running the latest upstream kernel with the following initial configuration: # grep -i cpu /proc/$$/status Cpus_allowed: ,fff Cpus_allowed_list: 0-63 (Where cpus 32-63 are provided via smt.) If we limit our current shell process to cpu2 only and then offline it and reonline it: # taskset -p 4 $$ pid 2272's current affinity mask: pid 2272's new affinity mask: 4 # echo off > /sys/devices/system/cpu/cpu2/online # dmesg | tail -3 [ 2195.866089] process 2272 (bash) no longer affine to cpu2 [ 2195.872700] IRQ 114: no longer affine to CPU2 [ 2195.879128] smpboot: CPU 2 is now offline # echo on > /sys/devices/system/cpu/cpu2/online # dmesg | tail -1 [ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4 We see that our current process now has an affinity mask containing every cpu available on the system _except_ the one we originally constrained it to: # grep -i cpu /proc/$$/status Cpus_allowed: ,fffb Cpus_allowed_list: 0-1,3-63 This is not sane behavior, as the scheduler can now not only place the process on previously forbidden cpus, it can't even schedule it on the cpu it was originally constrained to! Other cases result in even more exotic affinity masks. Take for instance a process with an affinity mask containing only cpus provided by smt at the moment that smt is toggled, in a configuration such as the following: # taskset -p f0 $$ # grep -i cpu /proc/$$/status Cpus_allowed: 00f0, Cpus_allowed_list: 36-39 A double toggle of smt results in the following behavior: # echo off > /sys/devices/system/cpu/smt/control # echo on > /sys/devices/system/cpu/smt/control # grep -i cpus /proc/$$/status Cpus_allowed: ff00, Cpus_allowed_list: 0-31,40-63 This is even less sane than the previous case, as the new affinity mask excludes all smt-provided cpus with ids less than those that were previously in the affinity mask, as well as those that were actually in the mask. With this patch applied, both of these cases end in the following state: # grep -i cpu /proc/$$/status Cpus_allowed: , Cpus_allowed_list: 0-63 The original policy is discarded. Though not ideal, it is the simplest way to restore sanity to this fallback case without reinventing the cpuset wheel that rolls down the kernel just fine in cgroup v2. A user who wishes for the previous affinity mask to be restored in this fallback case can use that mechanism instead. This patch modifies scheduler behavior by instead resetting the mask to task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy mode. I tested the cases above on both modes. Note that the scheduler uses this fallback mechanism if and only if _every_ other valid avenue has been traveled, and it is the last resort before calling BUG(). Suggested-by: Waiman Long Suggested-by: Phil Auld Signed-off-by: Joel Savitz --- kernel/cgroup/cpuset.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 6a1942ed781c..515525ff1cfd 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3254,10 +3254,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) spin_unlock_irqrestore(_lock, flags); } +/** + * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe. + * @tsk: pointer to task_struct with which the scheduler is struggling + * + * Description: In the case that the scheduler cannot find an allowed cpu in + * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy + * mode however, this value is the same as task_cs(tsk)->effective_cpus, + * which will not contain a sane cpumask during cases such as cpu hotplugging. + * This is the absolute last resort for the scheduler and it is only used if + * _every_ other avenue has been traveled. + **/ + void cpuset_cpus_allowed_fallback(struct task_struct *tsk) { rcu_read_lock(); - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus); + do_set_cpus_allowed(tsk, is_in_v2_mode() ? +
[PATCH v3] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
In the case that a process is constrained by taskset(1) (i.e. sched_setaffinity(2)) to a subset of available cpus, and all of those are subsequently offlined, the scheduler will set tsk->cpus_allowed to the current value of task_cs(tsk)->effective_cpus. This is done via a call to do_set_cpus_allowed() in the context of cpuset_cpus_allowed_fallback() made by the scheduler when this case is detected. This is the only call made to cpuset_cpus_allowed_fallback() in the latest mainline kernel. However, this is not sane behavior. I will demonstrate this on a system running the latest upstream kernel with the following initial configuration: # grep -i cpu /proc/$$/status Cpus_allowed: ,fff Cpus_allowed_list: 0-63 (Where cpus 32-63 are provided via smt.) If we limit our current shell process to cpu2 only and then offline it and reonline it: # taskset -p 4 $$ pid 2272's current affinity mask: pid 2272's new affinity mask: 4 # echo off > /sys/devices/system/cpu/cpu2/online # dmesg | tail -3 [ 2195.866089] process 2272 (bash) no longer affine to cpu2 [ 2195.872700] IRQ 114: no longer affine to CPU2 [ 2195.879128] smpboot: CPU 2 is now offline # echo on > /sys/devices/system/cpu/cpu2/online # dmesg | tail -1 [ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4 We see that our current process now has an affinity mask containing every cpu available on the system _except_ the one we originally constrained it to: # grep -i cpu /proc/$$/status Cpus_allowed: ,fffb Cpus_allowed_list: 0-1,3-63 This is not sane behavior, as the scheduler can now not only place the process on previously forbidden cpus, it can't even schedule it on the cpu it was originally constrained to! Other cases result in even more exotic affinity masks. Take for instance a process with an affinity mask containing only cpus provided by smt at the moment that smt is toggled, in a configuration such as the following: # taskset -p f0 $$ # grep -i cpu /proc/$$/status Cpus_allowed: 00f0, Cpus_allowed_list: 36-39 A double toggle of smt results in the following behavior: # echo off > /sys/devices/system/cpu/smt/control # echo on > /sys/devices/system/cpu/smt/control # grep -i cpus /proc/$$/status Cpus_allowed: ff00, Cpus_allowed_list: 0-31,40-63 This is even less sane than the previous case, as the new affinity mask excludes all smt-provided cpus with ids less than those that were previously in the affinity mask, as well as those that were actually in the mask. With this patch applied, both of these cases end in the following state: # grep -i cpu /proc/$$/status Cpus_allowed: , Cpus_allowed_list: 0-63 The original policy is discarded. Though not ideal, it is the simplest way to restore sanity to this fallback case without reinventing the cpuset wheel that rolls down the kernel just fine in cgroup v2. A user who wishes for the previous affinity mask to be restored in this fallback case can use that mechanism instead. This patch modifies scheduler behavior by instead resetting the mask to task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy mode. I tested the cases above on both modes. Note that the scheduler uses this fallback mechanism if and only if _every_ other valid avenue has been traveled, and it is the last resort before calling BUG(). Suggested-by: Waiman Long Suggested-by: Phil Auld Signed-off-by: Joel Savitz --- kernel/cgroup/cpuset.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 6a1942ed781c..515525ff1cfd 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3254,10 +3254,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) spin_unlock_irqrestore(_lock, flags); } +/** + * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe. + * @tsk: pointer to task_struct with which the scheduler is struggling + * + * Description: In the case that the scheduler cannot find an allowed cpu in + * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy + * mode however, this value is the same as task_cs(tsk)->effective_cpus, + * which will not contain a sane cpumask during cases such as cpu hotplugging. + * This is the absolute last resort for the scheduler and it is only used if + * _every_ other avenue has been traveled. + **/ + void cpuset_cpus_allowed_fallback(struct task_struct *tsk) { rcu_read_lock(); - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus); + do_set_cpus_allowed(tsk, is_in_v2_mode() ? +
Re: [PATCH v2] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
On Tue, May 21, 2019 at 10:35 AM Michal Koutný wrote: > > $ grep Cpus /proc/$$/status > > Cpus_allowed: ff > > Cpus_allowed_list: 0-7 > > (a) > > > $ taskset -p 4 $$ > > pid 19202's current affinity mask: f > I'm confused where this value comes from, I must be missing something. > > Joel, is the task in question put into a cpuset with 0xf CPUs only (at > point (a))? Or are the CPUs 4-7 offlined as well? Good point. It is a bit ambiguous, but I performed no action on the task's cpuset nor did I offline any cpus at point (a). After a bit of research, I am fairly certain that the observed discrepancy is due to differing mechanisms used to acquire the cpuset mask value. The first mechanism, via `grep Cpus /proc/$$/status`, has it's value populated by the expression (task->cpus_allowed) in fs/proc/array.c:sched_getaffinity(), whereas the taskset utility (https://github.com/karelzak/util-linux/blob/master/schedutils/taskset.c) uses sched_getaffinity(2) to determine the "current affinity mask" value from the expression (task->cpus_allowed & cpu_active_mask) in kernel/sched/core.c:sched_getaffinty(), I do not know if there is an explicit reason for this discrepancy or whether the two mechanisms were simply built independently, perhaps for different purposes. I think the /proc/$$/status value is intended to simply reflect the user-specified policy stating which cpus the task is allowed to run on without consideration for hardware state, whereas the taskset value is representative of the cpus that the task can actually be run on given the restriction policy specified by the user via the cpuset mechanism. By the way, I posted a v2 of this patch that correctly handles cgroup v2 behavior.
[RESEND PATCH v2] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
If a process is limited by taskset (i.e. cpuset) to only be allowed to run on cpu N, and then cpu N is offlined via hotplug, the process will be assigned the current value of its cpuset cgroup's effective_cpus field in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback(). This argument's value does not makes sense for this case, because task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn() to reflect the new value of cpu_active_mask after cpu N is removed from the mask. While this may make sense for the cgroup affinity mask, it does not make sense on a per-task basis, as a task that was previously limited to only be run on cpu N will be limited to every cpu _except_ for cpu N after it is offlined/onlined via hotplug. Pre-patch behavior: $ grep Cpus /proc/$$/status Cpus_allowed: ff Cpus_allowed_list: 0-7 $ taskset -p 4 $$ pid 19202's current affinity mask: f pid 19202's new affinity mask: 4 $ grep Cpus /proc/self/status Cpus_allowed: 04 Cpus_allowed_list: 2 # echo off > /sys/devices/system/cpu/cpu2/online $ grep Cpus /proc/$$/status Cpus_allowed: 0b Cpus_allowed_list: 0-1,3 # echo on > /sys/devices/system/cpu/cpu2/online $ grep Cpus /proc/$$/status Cpus_allowed: 0b Cpus_allowed_list: 0-1,3 On a patched system, the final grep produces the following output instead: $ grep Cpus /proc/$$/status Cpus_allowed: ff Cpus_allowed_list: 0-7 This patch changes the above behavior by instead resetting the mask to task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy mode. This fallback mechanism is only triggered if _every_ other valid avenue has been traveled, and it is the last resort before calling BUG(). Signed-off-by: Joel Savitz --- Makefile | 2 +- kernel/cgroup/cpuset.c | 15 ++- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 6a1942ed781c..515525ff1cfd 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3254,10 +3254,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) spin_unlock_irqrestore(_lock, flags); } +/** + * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe. + * @tsk: pointer to task_struct with which the scheduler is struggling + * + * Description: In the case that the scheduler cannot find an allowed cpu in + * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy + * mode however, this value is the same as task_cs(tsk)->effective_cpus, + * which will not contain a sane cpumask during cases such as cpu hotplugging. + * This is the absolute last resort for the scheduler and it is only used if + * _every_ other avenue has been traveled. + **/ + void cpuset_cpus_allowed_fallback(struct task_struct *tsk) { rcu_read_lock(); - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus); + do_set_cpus_allowed(tsk, is_in_v2_mode() ? + task_cs(tsk)->cpus_allowed : cpu_possible_mask); rcu_read_unlock(); /* -- 2.18.1
[PATCH v3] fs/proc: add VmTaskSize field to /proc/$$/status
There is currently no easy and architecture-independent way to find the lowest unusable virtual address available to a process without brute-force calculation. This patch allows a user to easily retrieve this value via /proc//status. Using this patch, any program that previously needed to waste cpu cycles recalculating a non-sensitive process-dependent value already known to the kernel can now be optimized to use this mechanism. Signed-off-by: Joel Savitz --- Documentation/filesystems/proc.txt | 2 ++ fs/proc/task_mmu.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 66cad5c86171..1c6a912e3975 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -187,6 +187,7 @@ read the file /proc/PID/status: VmLib: 1412 kB VmPTE:20 kb VmSwap:0 kB + VmTaskSize: 137438953468 kB HugetlbPages: 0 kB CoreDumping:0 THP_enabled: 1 @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19) VmPTE size of page table entries VmSwap amount of swap used by anonymous private data (shmem swap usage is not included) + VmTaskSize lowest unusable address in process virtual memory HugetlbPagessize of hugetlb memory portions CoreDumping process's memory is currently being dumped (killing the process may lead to a corrupted core) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 95ca1fe7283c..0af7081f7b19 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) seq_put_decimal_ull_width(m, " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8); SEQ_PUT_DEC(" kB\nVmSwap:\t", swap); + seq_put_decimal_ull_width(m, + " kB\nVmTaskSize:\t", mm->task_size >> 10, 8); seq_puts(m, " kB\n"); hugetlb_report_usage(m, mm); } -- 2.18.1
[PATCH v3 2/2] prctl.2: Document the new PR_GET_TASK_SIZE option
Add a short explanation of the new PR_GET_TASK_SIZE option for the benefit of future generations. Suggested-by: David Laight Signed-off-by: Joel Savitz --- man2/prctl.2 | 10 ++ 1 file changed, 10 insertions(+) diff --git a/man2/prctl.2 b/man2/prctl.2 index 06d8e13c7..cae582726 100644 --- a/man2/prctl.2 +++ b/man2/prctl.2 @@ -49,6 +49,7 @@ .\" 2013-01-10 Kees Cook, document PR_SET_PTRACER .\" 2012-02-04 Michael Kerrisk, document PR_{SET,GET}_CHILD_SUBREAPER .\" 2014-11-10 Dave Hansen, document PR_MPX_{EN,DIS}ABLE_MANAGEMENT +.\" 2019-05-03 Joel Savitz, document PR_GET_TASK_SIZE .\" .\" .TH PRCTL 2 2019-03-06 "Linux" "Linux Programmer's Manual" @@ -1375,6 +1376,15 @@ system call on Tru64). for information on versions and architectures) Return unaligned access control bits, in the location pointed to by .IR "(unsigned int\ *) arg2" . +.TP +.B PR_GET_TASK_SIZE +Copy the value of TASK_SIZE to the userspace address in +.IR "(unsigned long\ *) arg2" . +This value represents the highest virtual address available +to the current process. Return +.B EFAULT +if this operation fails. + .SH RETURN VALUE On success, .BR PR_GET_DUMPABLE , -- 2.18.1
[PATCH v3 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)
When PR_GET_TASK_SIZE is passed to prctl, the kernel will attempt to copy the value of TASK_SIZE to the userspace address in arg2. It is important that we account for the case of the userspace task running in 32-bit compat mode on a 64-bit kernel. As such, we must be careful to copy the correct number of bytes to userspace to avoid stack corruption. Suggested-by: Yuri Norov Suggested-by: Alexey Dobriyan Signed-off-by: Joel Savitz --- include/uapi/linux/prctl.h | 3 +++ kernel/sys.c | 23 +++ 2 files changed, 26 insertions(+) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 094bb03b9cc2..2c261c461952 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -229,4 +229,7 @@ struct prctl_mm_map { # define PR_PAC_APDBKEY(1UL << 3) # define PR_PAC_APGAKEY(1UL << 4) +/* Get the process virtual memory size (i.e. the highest usable VM address) */ +#define PR_GET_TASK_SIZE 55 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 12df0e5434b8..709584400070 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2252,6 +2252,26 @@ static int propagate_has_child_subreaper(struct task_struct *p, void *data) return 1; } +static int prctl_get_tasksize(void __user *uaddr) +{ + unsigned long current_task_size, current_word_size; + + current_task_size = TASK_SIZE; + current_word_size = sizeof(unsigned long); + +#ifdef CONFIG_64BIT + /* On 64-bit architecture, we must check whether the current thread +* is running in 32-bit compat mode. If it is, we can simply cut +* the size in half. This avoids corruption of the userspace stack. +*/ + if (test_thread_flag(TIF_ADDR32)) + current_word_size >>= 1; +#endif + + return copy_to_user(uaddr, _task_size, current_word_size) ? -EFAULT : 0; +} + int __weak arch_prctl_spec_ctrl_get(struct task_struct *t, unsigned long which) { return -EINVAL; @@ -2486,6 +2506,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, return -EINVAL; error = PAC_RESET_KEYS(me, arg2); break; + case PR_GET_TASK_SIZE: + error = prctl_get_tasksize((void *)arg2); + break; default: error = -EINVAL; break; -- 2.18.1
[PATCH v3 0/2] sys/prctl: expose TASK_SIZE value to userspace
In the mainline kernel, there is no quick mechanism to get the virtual memory size of the current process from userspace. Despite the current state of affairs, this information is available to the user through several means, one being a linear search of the entire address space. This is an inefficient use of cpu cycles. A component of the libhugetlb kernel test does exactly this, and as systems' address spaces increase beyond 32-bits, this method becomes exceedingly tedious. For example, on a ppc64le system with a 47-bit address space, the linear search causes the test to hang for some unknown amount of time. I couldn't give you an exact number because I just ran it for about 10-20 minutes and went to go do something else, probably to get coffee or something, and when I came back, I just killed the test and patched it to use this new mechanism. I re-ran my new version of the test using a kernel with this patch, and of course it passed through the previously bottlenecking codepath nearly instantaneously. As such, I propose that the prctl syscall be extended to include the option to retrieve TASK_SIZE from the kernel. This patch will allow us to upgrade an O(n) codepath to O(1) in an architecture-independent manner, and provide a mechanism for future generations to do the same. Changes from v2: We now account for the case of 32-bit compat userspace on a 64-bit kernel More detail about the nature of TASK_SIZE in documentation Joel Savitz(2): sys/prctl: add PR_GET_TASK_SIZE option to prctl(2) prctl.2: Document the new PR_GET_TASK_SIZE option include/uapi/linux/prctl.h | 3 +++ kernel/sys.c | 23 +++ 2 files changed, 26 insertions(+) man2/prctl.2 | 10 ++ 1 file changed, 10 insertions(+) -- 2.18.1
Re: [PATCH 2/2] prctl.2: Document the new PR_GET_TASK_SIZE option
On Fri, May 3, 2019 at 7:20 AM David Laight wrote: > Shouldn't this say what the value is? > ISTR a recent patch to change the was the 'used to be constant' TASK_SIZE is > defined. > I think it might be 'The highest userspace virtual address the current > process can use.' But I might be wrong. I believe you are correct David. I will add this information to the manpage in the upcoming v3. Best, Joel Savitz
Re: [PATCH v2 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)
> Won't be possible to use put_user here? Something like > > static int prctl_get_tasksize(unsigned long __user *uaddr) > { > return put_user(TASK_SIZE, uaddr) ? -EFAULT : 0; > } What would be the benefit of using put_user() over copy_to_user() in this context?
Re: [PATCH v2 0/2] sys/prctl: expose TASK_SIZE value to userspace
Yes, this the change, thanks to the suggestion of Yury Norov. I also now explicitly mention the expected userspace destination type in the manpage patch. Best, Joel Savitz On Thu, May 2, 2019 at 5:10 PM Cyrill Gorcunov wrote: > > On Thu, May 02, 2019 at 05:01:38PM -0400, Waiman Long wrote: > > > > What did you change in v2 versus v1? > > Seems unsigned long long has been changed to unsigned long.
[PATCH v2 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)
When PR_GET_TASK_SIZE is passed to prctl, the kernel will attempt to copy the value of TASK_SIZE to the userspace address in arg2. Suggested-by: Alexey Dobriyan Signed-off-by: Joel Savitz --- include/uapi/linux/prctl.h | 3 +++ kernel/sys.c | 10 ++ 2 files changed, 13 insertions(+) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 094bb03b9cc2..2335fe0a8db8 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -229,4 +229,7 @@ struct prctl_mm_map { # define PR_PAC_APDBKEY(1UL << 3) # define PR_PAC_APGAKEY(1UL << 4) +/* Get the process virtual memory size */ +#define PR_GET_TASK_SIZE 55 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 12df0e5434b8..7ced7dbd035d 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2252,6 +2252,13 @@ static int propagate_has_child_subreaper(struct task_struct *p, void *data) return 1; } +static int prctl_get_tasksize(void __user * uaddr) +{ + unsigned long task_size = TASK_SIZE; + return copy_to_user(uaddr, _size, sizeof(unsigned long)) + ? -EFAULT : 0; +} + int __weak arch_prctl_spec_ctrl_get(struct task_struct *t, unsigned long which) { return -EINVAL; @@ -2486,6 +2493,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, return -EINVAL; error = PAC_RESET_KEYS(me, arg2); break; + case PR_GET_TASK_SIZE: + error = prctl_get_tasksize((void *)arg2) ; + break; default: error = -EINVAL; break; -- 2.18.1
[PATCH v2 2/2] prctl.2: Document the new PR_GET_TASK_SIZE option
Add a short explanation of the new PR_GET_TASK_SIZE option for the benefit of future generations. Signed-off-by: Joel Savitz --- man2/prctl.2 | 9 + 1 file changed, 9 insertions(+) diff --git a/man2/prctl.2 b/man2/prctl.2 index 06d8e13c7..35a6a3919 100644 --- a/man2/prctl.2 +++ b/man2/prctl.2 @@ -49,6 +49,7 @@ .\" 2013-01-10 Kees Cook, document PR_SET_PTRACER .\" 2012-02-04 Michael Kerrisk, document PR_{SET,GET}_CHILD_SUBREAPER .\" 2014-11-10 Dave Hansen, document PR_MPX_{EN,DIS}ABLE_MANAGEMENT +.\" 2019-05-02 Joel Savitz, document PR_GET_TASK_SIZE .\" .\" .TH PRCTL 2 2019-03-06 "Linux" "Linux Programmer's Manual" @@ -1375,6 +1376,14 @@ system call on Tru64). for information on versions and architectures) Return unaligned access control bits, in the location pointed to by .IR "(unsigned int\ *) arg2" . +.TP +.B PR_GET_TASK_SIZE +Copy the value of TASK_SIZE to the userspace address in +.IR "(unsigned long\ *) arg2" . +Return +.B EFAULT +if this operation fails. + .SH RETURN VALUE On success, .BR PR_GET_DUMPABLE , -- 2.18.1
[PATCH v2 0/2] sys/prctl: expose TASK_SIZE value to userspace
In the mainline kernel, there is no quick mechanism to get the virtual memory size of the current process from userspace. Despite the current state of affairs, this information is available to the user through several means, one being a linear search of the entire address space. This is an inefficient use of cpu cycles. A component of the libhugetlb kernel test does exactly this, and as systems' address spaces increase beyond 32-bits, this method becomes exceedingly tedious. For example, on a ppc64le system with a 47-bit address space, the linear search causes the test to hang for some unknown amount of time. I couldn't give you an exact number because I just ran it for about 10-20 minutes and went to go do something else, probably to get coffee or something, and when I came back, I just killed the test and patched it to use this new mechanism. I re-ran my new version of the test using a kernel with this patch, and of course it passed through the previously bottlenecking codepath nearly instantaneously. As such, I propose that the prctl syscall be extended to include the option to retrieve TASK_SIZE from the kernel. This patch will allow us to upgrade an O(n) codepath to O(1) in an architecture-independent manner, and provide a mechanism for others to do the same. Joel Savitz(2): sys/prctl: add PR_GET_TASK_SIZE option to prctl(2) prctl.2: Document the new PR_GET_TASK_SIZE option include/uapi/linux/prctl.h | 3 +++ kernel/sys.c | 10 ++ 2 files changed, 13 insertions(+) man2/prctl.2 | 9 + 1 file changed, 9 insertions(+) -- 2.18.1
[PATCH 2/2] prctl.2: Document the new PR_GET_TASK_SIZE option
Add a short explanation of the new PR_GET_TASK_SIZE option for the benefit of future generations. Signed-off-by: Joel Savitz --- man2/prctl.2 | 9 + 1 file changed, 9 insertions(+) diff --git a/man2/prctl.2 b/man2/prctl.2 index 06d8e13c7..35a6a3919 100644 --- a/man2/prctl.2 +++ b/man2/prctl.2 @@ -49,6 +49,7 @@ .\" 2013-01-10 Kees Cook, document PR_SET_PTRACER .\" 2012-02-04 Michael Kerrisk, document PR_{SET,GET}_CHILD_SUBREAPER .\" 2014-11-10 Dave Hansen, document PR_MPX_{EN,DIS}ABLE_MANAGEMENT +.\" 2019-05-02 Joel Savitz, document PR_GET_TASK_SIZE .\" .\" .TH PRCTL 2 2019-03-06 "Linux" "Linux Programmer's Manual" @@ -1375,6 +1376,14 @@ system call on Tru64). for information on versions and architectures) Return unaligned access control bits, in the location pointed to by .IR "(unsigned int\ *) arg2" . +.TP +.B PR_GET_TASK_SIZE +Copy the value of TASK_SIZE to the userspace address in +.IR "arg2" . +Return +.B EFAULT +if this operation fails. + .SH RETURN VALUE On success, .BR PR_GET_DUMPABLE , -- 2.18.1
[PATCH 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)
When PR_GET_TASK_SIZE is passed to prctl, the kernel will attempt to copy the value of TASK_SIZE to the userspace address in arg2. Suggested-by: Alexey Dobriyan Signed-off-by: Joel Savitz --- include/uapi/linux/prctl.h | 3 +++ kernel/sys.c | 10 ++ 2 files changed, 13 insertions(+) diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 094bb03b9cc2..2335fe0a8db8 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -229,4 +229,7 @@ struct prctl_mm_map { # define PR_PAC_APDBKEY(1UL << 3) # define PR_PAC_APGAKEY(1UL << 4) +/* Get the process virtual memory size */ +#define PR_GET_TASK_SIZE 55 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 12df0e5434b8..7ced7dbd035d 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2252,6 +2252,13 @@ static int propagate_has_child_subreaper(struct task_struct *p, void *data) return 1; } +static int prctl_get_tasksize(void __user * uaddr) +{ + unsigned long long task_size = TASK_SIZE; + return copy_to_user(uaddr, _size, sizeof(unsigned long long)) + ? -EFAULT : 0; +} + int __weak arch_prctl_spec_ctrl_get(struct task_struct *t, unsigned long which) { return -EINVAL; @@ -2486,6 +2493,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, return -EINVAL; error = PAC_RESET_KEYS(me, arg2); break; + case PR_GET_TASK_SIZE: + error = prctl_get_tasksize((void *)arg2) ; + break; default: error = -EINVAL; break; -- 2.18.1
[PATCH 0/2] sys/prctl: expose TASK_SIZE value to userspace
In the mainline kernel, there is no quick mechanism to get the virtual memory size of the current process from userspace. Despite the current state of affairs, this information is available to the user through several means, one being a linear search of the entire address space. This is an inefficient use of cpu cycles. A component of the libhugetlb kernel test does exactly this, and as systems' address spaces increase beyond 32-bits, this method becomes exceedingly tedious. For example, on a ppc64le system with a 47-bit address space, the linear search causes the test to hang for some unknown amount of time. I couldn't give you an exact number because I just ran it for about 10-20 minutes and went to go do something else, probably to get coffee or something, and when I came back, I just killed the test and patched it to use this new mechanism. I re-ran my new version of the test using a kernel with this patch, and of course it passed through the previously bottlenecking codepath nearly instantaneously. As such, I propose that the prctl syscall be extended to include the option to retrieve TASK_SIZE from the kernel. This patch will allow us to upgrade an O(n) codepath to O(1) in an architecture-independent manner, and provide a mechanism for others to do the same. Joel Savitz(2): sys/prctl: add PR_GET_TASK_SIZE option to prctl(2) prctl.2: Document the new PR_GET_TASK_SIZE option include/uapi/linux/prctl.h | 3 +++ kernel/sys.c | 10 ++ 2 files changed, 13 insertions(+) man2/prctl.2 | 9 + 1 file changed, 9 insertions(+) -- 2.18.1
Re: [PATCH v2] fs/proc: add VmTaskSize field to /proc/$$/status
Good point Alexey. Expect v3 shortly. Best, Joel Savitz On Sat, Apr 27, 2019 at 5:45 PM Alexey Dobriyan wrote: > > On Fri, Apr 26, 2019 at 03:02:08PM -0400, Joel Savitz wrote: > > In the mainline kernel, there is no quick mechanism to get the virtual > > memory size of the current process from userspace. > > > > Despite the current state of affairs, this information is available to the > > user through several means, one being a linear search of the entire address > > space. This is an inefficient use of cpu cycles. > > You can test only a few known per arch values. Linear search is a self > inflicted wound. > > prctl(2) is more natural place and will also be arch neutral. > > > A component of the libhugetlb kernel test does exactly this, and as > > systems' address spaces increase beyond 32-bits, this method becomes > > exceedingly tedious. > > > For example, on a ppc64le system with a 47-bit address space, the linear > > search causes the test to hang for some unknown amount of time. I > > couldn't give you an exact number because I just ran it for about 10-20 > > minutes and went to go do something else, probably to get coffee or > > something, and when I came back, I just killed the test and patched it > > to use this new mechanism. I re-ran my new version of the test using a > > kernel with this patch, and of course it passed through the previously > > bottlenecking codepath nearly instantaneously. > > > > This patched enabled me to upgrade an O(n) codepath to O(1) in an > > architecture-independent manner. > > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -74,7 +74,10 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) > > seq_put_decimal_ull_width(m, > > " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8); > > SEQ_PUT_DEC(" kB\nVmSwap:\t", swap); > > - seq_puts(m, " kB\n"); > > + SEQ_PUT_DEC(" kB\nVmSwap:\t", swap); > > + seq_put_decimal_ull_width(m, > > + " kB\nVmTaskSize:\t", TASK_SIZE >> 10, 8); > > + seq_puts(m, " kB\n"); > > All fields in this file are related to the task. New field related > to "current" will stick like an eyesore.
[PATCH v2] fs/proc: add VmTaskSize field to /proc/$$/status
In the mainline kernel, there is no quick mechanism to get the virtual memory size of the current process from userspace. Despite the current state of affairs, this information is available to the user through several means, one being a linear search of the entire address space. This is an inefficient use of cpu cycles. A component of the libhugetlb kernel test does exactly this, and as systems' address spaces increase beyond 32-bits, this method becomes exceedingly tedious. For example, on a ppc64le system with a 47-bit address space, the linear search causes the test to hang for some unknown amount of time. I couldn't give you an exact number because I just ran it for about 10-20 minutes and went to go do something else, probably to get coffee or something, and when I came back, I just killed the test and patched it to use this new mechanism. I re-ran my new version of the test using a kernel with this patch, and of course it passed through the previously bottlenecking codepath nearly instantaneously. This patched enabled me to upgrade an O(n) codepath to O(1) in an architecture-independent manner. Signed-off-by: Joel Savitz --- Documentation/filesystems/proc.txt | 2 ++ fs/proc/task_mmu.c | 5 - 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 66cad5c86171..1c6a912e3975 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -187,6 +187,7 @@ read the file /proc/PID/status: VmLib: 1412 kB VmPTE:20 kb VmSwap:0 kB + VmTaskSize: 137438953468 kB HugetlbPages: 0 kB CoreDumping:0 THP_enabled: 1 @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19) VmPTE size of page table entries VmSwap amount of swap used by anonymous private data (shmem swap usage is not included) + VmTaskSize size of entire virtual address space of a process HugetlbPagessize of hugetlb memory portions CoreDumping process's memory is currently being dumped (killing the process may lead to a corrupted core) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 95ca1fe7283c..0ddd51479f90 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -74,7 +74,10 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) seq_put_decimal_ull_width(m, " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8); SEQ_PUT_DEC(" kB\nVmSwap:\t", swap); - seq_puts(m, " kB\n"); + SEQ_PUT_DEC(" kB\nVmSwap:\t", swap); + seq_put_decimal_ull_width(m, + " kB\nVmTaskSize:\t", TASK_SIZE >> 10, 8); + seq_puts(m, " kB\n"); hugetlb_report_usage(m, mm); } #undef SEQ_PUT_DEC -- 2.18.1
[PATCH v2] mm/oom_killer: Add task UID to info message on an oom kill
In the event of an oom kill, useful information about the killed process is printed to dmesg. Users, especially system administrators, will find it useful to immediately see the UID of the process. In the following example, abuse_the_ram is the name of a program that attempts to iteratively allocate all available memory until it is stopped by force. Current message: Out of memory: Killed process 35389 (abuse_the_ram) total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB, shmem-rss:0kB Patched message: Out of memory: Killed process 2739 (abuse_the_ram), total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB, shmem-rss:0kB, UID 0 Signed-off-by: Joel Savitz --- mm/oom_kill.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 3a2484884cfd..af2e3faa72a0 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -874,12 +874,13 @@ static void __oom_kill_process(struct task_struct *victim, const char *message) */ do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID); mark_oom_victim(victim); - pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", + pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID %d\n", message, task_pid_nr(victim), victim->comm, K(victim->mm->total_vm), K(get_mm_counter(victim->mm, MM_ANONPAGES)), K(get_mm_counter(victim->mm, MM_FILEPAGES)), - K(get_mm_counter(victim->mm, MM_SHMEMPAGES))); + K(get_mm_counter(victim->mm, MM_SHMEMPAGES)), + from_kuid(_user_ns, task_uid(victim))); task_unlock(victim); /* -- 2.18.1
[PATCH] fs/proc: add VmTaskSize field to /proc/$$/status
Currently, there is no fast mechanism to get the virtual memory size of the current process from userspace. This information is available to the user through several means, one being a linear search of the entire address space. This is the method used by a component of the libhugetlb kernel test, and using the mechanism proposed in this patch, the time complexity of that test would be upgraded to constant time from linear time. This is especially relevant on 64-bit architechtures where a linear search of the address space could take an absurd amount of time. Using this mechanism, the modification to the test component would be portable across all architechtures. Signed-off-by: Joel Savitz --- fs/proc/task_mmu.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 92a91e7816d8..f64b9a949624 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -74,7 +74,10 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) seq_put_decimal_ull_width(m, " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8); SEQ_PUT_DEC(" kB\nVmSwap:\t", swap); - seq_puts(m, " kB\n"); + SEQ_PUT_DEC(" kB\nVmSwap:\t", swap); + seq_put_decimal_ull_width(m, + " kB\nVmTaskSize:\t", TASK_SIZE >> 10, 8); + seq_puts(m, " kB\n"); hugetlb_report_usage(m, mm); } #undef SEQ_PUT_DEC -- 2.18.1
[PATCH] mm/oom_killer: Add task UID to info message on an oom kill
In the event of an oom kill, useful information about the killed process is printed to dmesg. Users, especially system administrators, will find it useful to immediately see the UID of the process. In the following example, abuse_the_ram is the name of a program that attempts to iteratively allocate all available memory until it is stopped by force. Current message: Out of memory: Killed process 35389 (abuse_the_ram) total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB, shmem-rss:0kB Patched message: Out of memory: Killed process 2739 (abuse_the_ram), UID 0, total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB, shmem-rss:0kB Signed-off-by: Joel Savitz --- mm/oom_kill.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 3a2484884cfd..22972648b758 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -874,9 +874,9 @@ static void __oom_kill_process(struct task_struct *victim, const char *message) */ do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID); mark_oom_victim(victim); - pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", + pr_err("%s: Killed process %d (%s), UID %d, total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", message, task_pid_nr(victim), victim->comm, - K(victim->mm->total_vm), + task_uid(victim).val, K(victim->mm->total_vm), K(get_mm_counter(victim->mm, MM_ANONPAGES)), K(get_mm_counter(victim->mm, MM_FILEPAGES)), K(get_mm_counter(victim->mm, MM_SHMEMPAGES))); -- 2.18.1
[tip:sched/core] sched/core: Fix typo in comment
Commit-ID: bee9853932e90ce94bce4276ec6b7b06bc48070b Gitweb: https://git.kernel.org/tip/bee9853932e90ce94bce4276ec6b7b06bc48070b Author: Joel Savitz AuthorDate: Wed, 6 Mar 2019 20:13:33 -0500 Committer: Ingo Molnar CommitDate: Fri, 19 Apr 2019 12:22:16 +0200 sched/core: Fix typo in comment Signed-off-by: Joel Savitz Cc: Andrew Morton Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: triv...@kernel.org Link: http://lkml.kernel.org/r/1551921213-813-1-git-send-email-jsav...@redhat.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8b64ef0d5589..fb09eaad1d3a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -924,7 +924,7 @@ static inline bool is_per_cpu_kthread(struct task_struct *p) } /* - * Per-CPU kthreads are allowed to run on !actie && online CPUs, see + * Per-CPU kthreads are allowed to run on !active && online CPUs, see * __set_cpus_allowed_ptr() and select_fallback_rq(). */ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
[PATCH v2] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
If a process is limited by taskset (i.e. cpuset) to only be allowed to run on cpu N, and then cpu N is offlined via hotplug, the process will be assigned the current value of its cpuset cgroup's effective_cpus field in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback(). This argument's value does not makes sense for this case, because task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn() to reflect the new value of cpu_active_mask after cpu N is removed from the mask. While this may make sense for the cgroup affinity mask, it does not make sense on a per-task basis, as a task that was previously limited to only be run on cpu N will be limited to every cpu _except_ for cpu N after it is offlined/onlined via hotplug. Pre-patch behavior: $ grep Cpus /proc/$$/status Cpus_allowed: ff Cpus_allowed_list: 0-7 $ taskset -p 4 $$ pid 19202's current affinity mask: f pid 19202's new affinity mask: 4 $ grep Cpus /proc/self/status Cpus_allowed: 04 Cpus_allowed_list: 2 # echo off > /sys/devices/system/cpu/cpu2/online $ grep Cpus /proc/$$/status Cpus_allowed: 0b Cpus_allowed_list: 0-1,3 # echo on > /sys/devices/system/cpu/cpu2/online $ grep Cpus /proc/$$/status Cpus_allowed: 0b Cpus_allowed_list: 0-1,3 On a patched system, the final grep produces the following output instead: $ grep Cpus /proc/$$/status Cpus_allowed: ff Cpus_allowed_list: 0-7 This patch changes the above behavior by instead resetting the mask to task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy mode. This fallback mechanism is only triggered if _every_ other valid avenue has been traveled, and it is the last resort before calling BUG(). Signed-off-by: Joel Savitz --- kernel/cgroup/cpuset.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 4834c4214e9c..6c9deb2cc687 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3255,10 +3255,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) spin_unlock_irqrestore(_lock, flags); } +/** + * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe. + * @tsk: pointer to task_struct with which the scheduler is struggling + * + * Description: In the case that the scheduler cannot find an allowed cpu in + * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy + * mode however, this value is the same as task_cs(tsk)->effective_cpus, + * which will not contain a sane cpumask during cases such as cpu hotplugging. + * This is the absolute last resort for the scheduler and it is only used if + * _every_ other avenue has been traveled. + **/ + void cpuset_cpus_allowed_fallback(struct task_struct *tsk) { rcu_read_lock(); - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus); + do_set_cpus_allowed(tsk, is_in_v2_mode() ? + task_cs(tsk)->cpus_allowed : cpu_possible_mask); rcu_read_unlock(); /* -- 2.18.1
Fwd: [RESEND PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
-- Forwarded message - From: Joel Savitz Date: Fri, Apr 5, 2019 at 11:37 AM Subject: [RESEND PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback() To: Cc: Joel Savitz , Li Zefan , If a process is limited by taskset (i.e. cpuset) to only be allowed to run on cpu N, and then cpu N is offlined via hotplug, the process will be assigned the current value of its cpuset cgroup's effective_cpus field in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback(). This argument's value does not makes sense for this case, because task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn() to reflect the new value of cpu_active_mask after cpu N is removed from the mask. While this may make sense for the cgroup affinity mask, it does not make sense on a per-task basis, as a task that was previously limited to only be run on cpu N will be limited to every cpu _except_ for cpu N after it is offlined/onlined via hotplug. Pre-patch behavior: $ grep Cpus /proc/$$/status Cpus_allowed: ff Cpus_allowed_list: 0-7 $ taskset -p 4 $$ pid 19202's current affinity mask: f pid 19202's new affinity mask: 4 $ grep Cpus /proc/self/status Cpus_allowed: 04 Cpus_allowed_list: 2 # echo off > /sys/devices/system/cpu/cpu2/online $ grep Cpus /proc/$$/status Cpus_allowed: 0b Cpus_allowed_list: 0-1,3 # echo on > /sys/devices/system/cpu/cpu2/online $ grep Cpus /proc/$$/status Cpus_allowed: 0b Cpus_allowed_list: 0-1,3 On a patched system, the final grep produces the following output instead: $ grep Cpus /proc/$$/status Cpus_allowed: ff Cpus_allowed_list: 0-7 This patch changes the above behavior by instead simply resetting the mask to cpu_possible_mask. Signed-off-by: Joel Savitz --- kernel/cgroup/cpuset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 479743db6c37..5f65a2167bdf 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3243,7 +3243,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) void cpuset_cpus_allowed_fallback(struct task_struct *tsk) { rcu_read_lock(); - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus); + do_set_cpus_allowed(tsk, cpu_possible_mask); rcu_read_unlock(); /* -- 2.20.1
Re: [PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
Forgot to add cc's... my bad. Best, Joel Savitz On Tue, Mar 26, 2019 at 1:31 PM Joel Savitz wrote: > > Ping! > > Does anyone have any comments or concerns about this patch? > > Best, > Joel Savitz > > Best, > Joel Savitz > > > On Thu, Mar 7, 2019 at 9:42 AM Joel Savitz wrote: > > > > On Wed, Mar 6, 2019 at 7:55 PM Joel Savitz wrote: > > > > > > If a process is limited by taskset (i.e. cpuset) to only be allowed to > > > run on cpu N, and then cpu N is offlined via hotplug, the process will > > > be assigned the current value of its cpuset cgroup's effective_cpus field > > > in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback(). > > > This argument's value does not makes sense for this case, because > > > task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn() > > > to reflect the new value of cpu_active_mask after cpu N is removed from > > > the mask. While this may make sense for the cgroup affinity mask, it > > > does not make sense on a per-task basis, as a task that was previously > > > limited to only be run on cpu N will be limited to every cpu _except_ for > > > cpu N after it is offlined/onlined via hotplug. > > > > > > Pre-patch behavior: > > > > > > $ grep Cpus /proc/$$/status > > > Cpus_allowed: ff > > > Cpus_allowed_list: 0-7 > > > > > > $ taskset -p 4 $$ > > > pid 19202's current affinity mask: f > > > pid 19202's new affinity mask: 4 > > > > > > $ grep Cpus /proc/self/status > > > Cpus_allowed: 04 > > > Cpus_allowed_list: 2 > > > > > > # echo off > /sys/devices/system/cpu/cpu2/online > > > $ grep Cpus /proc/$$/status > > > Cpus_allowed: 0b > > > Cpus_allowed_list: 0-1,3 > > > > > > # echo on > /sys/devices/system/cpu/cpu2/online > > > $ grep Cpus /proc/$$/status > > > Cpus_allowed: 0b > > > Cpus_allowed_list: 0-1,3 > > > > > > On a patched system, the final grep produces the following > > > output instead: > > > > > > $ grep Cpus /proc/$$/status > > > Cpus_allowed: ff > > > Cpus_allowed_list: 0-7 > > > > > > This patch changes the above behavior by instead simply resetting the mask > > > to cpu_possible_mask. > > > > > > Signed-off-by: Joel Savitz > > > --- > > > kernel/cgroup/cpuset.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > > > index 479743db6c37..5f65a2167bdf 100644 > > > --- a/kernel/cgroup/cpuset.c > > > +++ b/kernel/cgroup/cpuset.c > > > @@ -3243,7 +3243,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, > > > struct cpumask *pmask) > > > void cpuset_cpus_allowed_fallback(struct task_struct *tsk) > > > { > > > rcu_read_lock(); > > > - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus); > > > + do_set_cpus_allowed(tsk, cpu_possible_mask); > > > rcu_read_unlock(); > > > > > > /* > > > -- > > > 2.20.1 > > >
Re: [PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
Ping! Does anyone have any comments or concerns about this patch? Best, Joel Savitz Best, Joel Savitz On Thu, Mar 7, 2019 at 9:42 AM Joel Savitz wrote: > > On Wed, Mar 6, 2019 at 7:55 PM Joel Savitz wrote: > > > > If a process is limited by taskset (i.e. cpuset) to only be allowed to > > run on cpu N, and then cpu N is offlined via hotplug, the process will > > be assigned the current value of its cpuset cgroup's effective_cpus field > > in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback(). > > This argument's value does not makes sense for this case, because > > task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn() > > to reflect the new value of cpu_active_mask after cpu N is removed from > > the mask. While this may make sense for the cgroup affinity mask, it > > does not make sense on a per-task basis, as a task that was previously > > limited to only be run on cpu N will be limited to every cpu _except_ for > > cpu N after it is offlined/onlined via hotplug. > > > > Pre-patch behavior: > > > > $ grep Cpus /proc/$$/status > > Cpus_allowed: ff > > Cpus_allowed_list: 0-7 > > > > $ taskset -p 4 $$ > > pid 19202's current affinity mask: f > > pid 19202's new affinity mask: 4 > > > > $ grep Cpus /proc/self/status > > Cpus_allowed: 04 > > Cpus_allowed_list: 2 > > > > # echo off > /sys/devices/system/cpu/cpu2/online > > $ grep Cpus /proc/$$/status > > Cpus_allowed: 0b > > Cpus_allowed_list: 0-1,3 > > > > # echo on > /sys/devices/system/cpu/cpu2/online > > $ grep Cpus /proc/$$/status > > Cpus_allowed: 0b > > Cpus_allowed_list: 0-1,3 > > > > On a patched system, the final grep produces the following > > output instead: > > > > $ grep Cpus /proc/$$/status > > Cpus_allowed: ff > > Cpus_allowed_list: 0-7 > > > > This patch changes the above behavior by instead simply resetting the mask > > to cpu_possible_mask. > > > > Signed-off-by: Joel Savitz > > --- > > kernel/cgroup/cpuset.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > > index 479743db6c37..5f65a2167bdf 100644 > > --- a/kernel/cgroup/cpuset.c > > +++ b/kernel/cgroup/cpuset.c > > @@ -3243,7 +3243,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, > > struct cpumask *pmask) > > void cpuset_cpus_allowed_fallback(struct task_struct *tsk) > > { > > rcu_read_lock(); > > - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus); > > + do_set_cpus_allowed(tsk, cpu_possible_mask); > > rcu_read_unlock(); > > > > /* > > -- > > 2.20.1 > >
Re: [PATCH] sched: fix spelling of active in comment on is_cpu_allowed()
Best, Joel Savitz On Wed, Mar 6, 2019 at 8:13 PM Joel Savitz wrote: > > trivial fix of documentation typo > > Signed-off-by: Joel Savitz > --- > kernel/sched/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index ead464a..2e2e19b 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -920,7 +920,7 @@ static inline bool is_per_cpu_kthread(struct task_struct > *p) > } > > /* > - * Per-CPU kthreads are allowed to run on !actie && online CPUs, see > + * Per-CPU kthreads are allowed to run on !active && online CPUs, see > * __set_cpus_allowed_ptr() and select_fallback_rq(). > */ > static inline bool is_cpu_allowed(struct task_struct *p, int cpu) > -- > 1.8.3.1 >
Re: [PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
On Wed, Mar 6, 2019 at 7:55 PM Joel Savitz wrote: > > If a process is limited by taskset (i.e. cpuset) to only be allowed to > run on cpu N, and then cpu N is offlined via hotplug, the process will > be assigned the current value of its cpuset cgroup's effective_cpus field > in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback(). > This argument's value does not makes sense for this case, because > task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn() > to reflect the new value of cpu_active_mask after cpu N is removed from > the mask. While this may make sense for the cgroup affinity mask, it > does not make sense on a per-task basis, as a task that was previously > limited to only be run on cpu N will be limited to every cpu _except_ for > cpu N after it is offlined/onlined via hotplug. > > Pre-patch behavior: > > $ grep Cpus /proc/$$/status > Cpus_allowed: ff > Cpus_allowed_list: 0-7 > > $ taskset -p 4 $$ > pid 19202's current affinity mask: f > pid 19202's new affinity mask: 4 > > $ grep Cpus /proc/self/status > Cpus_allowed: 04 > Cpus_allowed_list: 2 > > # echo off > /sys/devices/system/cpu/cpu2/online > $ grep Cpus /proc/$$/status > Cpus_allowed: 0b > Cpus_allowed_list: 0-1,3 > > # echo on > /sys/devices/system/cpu/cpu2/online > $ grep Cpus /proc/$$/status > Cpus_allowed: 0b > Cpus_allowed_list: 0-1,3 > > On a patched system, the final grep produces the following > output instead: > > $ grep Cpus /proc/$$/status > Cpus_allowed: ff > Cpus_allowed_list: 0-7 > > This patch changes the above behavior by instead simply resetting the mask > to cpu_possible_mask. > > Signed-off-by: Joel Savitz > --- > kernel/cgroup/cpuset.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c > index 479743db6c37..5f65a2167bdf 100644 > --- a/kernel/cgroup/cpuset.c > +++ b/kernel/cgroup/cpuset.c > @@ -3243,7 +3243,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, > struct cpumask *pmask) > void cpuset_cpus_allowed_fallback(struct task_struct *tsk) > { > rcu_read_lock(); > - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus); > + do_set_cpus_allowed(tsk, cpu_possible_mask); > rcu_read_unlock(); > > /* > -- > 2.20.1 >
[PATCH] sched: fix spelling of active in comment on is_cpu_allowed()
trivial fix of documentation typo Signed-off-by: Joel Savitz --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ead464a..2e2e19b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -920,7 +920,7 @@ static inline bool is_per_cpu_kthread(struct task_struct *p) } /* - * Per-CPU kthreads are allowed to run on !actie && online CPUs, see + * Per-CPU kthreads are allowed to run on !active && online CPUs, see * __set_cpus_allowed_ptr() and select_fallback_rq(). */ static inline bool is_cpu_allowed(struct task_struct *p, int cpu) -- 1.8.3.1
[PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
If a process is limited by taskset (i.e. cpuset) to only be allowed to run on cpu N, and then cpu N is offlined via hotplug, the process will be assigned the current value of its cpuset cgroup's effective_cpus field in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback(). This argument's value does not makes sense for this case, because task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn() to reflect the new value of cpu_active_mask after cpu N is removed from the mask. While this may make sense for the cgroup affinity mask, it does not make sense on a per-task basis, as a task that was previously limited to only be run on cpu N will be limited to every cpu _except_ for cpu N after it is offlined/onlined via hotplug. Pre-patch behavior: $ grep Cpus /proc/$$/status Cpus_allowed: ff Cpus_allowed_list: 0-7 $ taskset -p 4 $$ pid 19202's current affinity mask: f pid 19202's new affinity mask: 4 $ grep Cpus /proc/self/status Cpus_allowed: 04 Cpus_allowed_list: 2 # echo off > /sys/devices/system/cpu/cpu2/online $ grep Cpus /proc/$$/status Cpus_allowed: 0b Cpus_allowed_list: 0-1,3 # echo on > /sys/devices/system/cpu/cpu2/online $ grep Cpus /proc/$$/status Cpus_allowed: 0b Cpus_allowed_list: 0-1,3 On a patched system, the final grep produces the following output instead: $ grep Cpus /proc/$$/status Cpus_allowed: ff Cpus_allowed_list: 0-7 This patch changes the above behavior by instead simply resetting the mask to cpu_possible_mask. Signed-off-by: Joel Savitz --- kernel/cgroup/cpuset.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 479743db6c37..5f65a2167bdf 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3243,7 +3243,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask) void cpuset_cpus_allowed_fallback(struct task_struct *tsk) { rcu_read_lock(); - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus); + do_set_cpus_allowed(tsk, cpu_possible_mask); rcu_read_unlock(); /* -- 2.20.1