[PATCH] mm/page_alloc: fix documentation error

2020-06-24 Thread Joel Savitz
Cc: Andrew Morton 
Cc: Matthew Wilcox 
Cc: Rafael Aquini 
Cc: Fabrizio D'Angelo 
Cc: linux...@kvack.org
Cc: triv...@kernel.org

When I increased the upper bound of the min_free_kbytes value in
ee8eb9a5fe863, I forgot to tweak the above comment to reflect
the new value. This patch fixes that mistake.

Signed-off-by: Joel Savitz 
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eb0f1410d4..e028b87ce294 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7832,7 +7832,7 @@ void setup_per_zone_wmarks(void)
  * Initialise min_free_kbytes.
  *
  * For small machines we want it small (128k min).  For large machines
- * we want it large (64MB max).  But it is not linear, because network
+ * we want it large (256MB max).  But it is not linear, because network
  * bandwidth does not increase linearly with machine size.  We use
  *
  * min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy:
-- 
2.23.0



[PATCH v2] mm/page_alloc: fix documentation error and remove magic numbers

2020-06-24 Thread Joel Savitz
When I increased the upper bound of the min_free_kbytes value in
ee8eb9a5fe863, I forgot to tweak the above comment to reflect
the new value. This patch fixes that mistake.

In addition, this patch replaces the magic number bounds with symbolic
constants to clarify the logic.

changes from v1:
- declare constants via enum instead of separate integers

Suggested-by: John Hubbard 
Suggested-by: Vlastimil Babka 
Signed-off-by: Joel Savitz 
---
 mm/page_alloc.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eb0f1410d4..733c81678b0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7832,7 +7832,7 @@ void setup_per_zone_wmarks(void)
  * Initialise min_free_kbytes.
  *
  * For small machines we want it small (128k min).  For large machines
- * we want it large (64MB max).  But it is not linear, because network
+ * we want it large (256MB max).  But it is not linear, because network
  * bandwidth does not increase linearly with machine size.  We use
  *
  * min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy:
@@ -7852,6 +7852,8 @@ void setup_per_zone_wmarks(void)
  * 8192MB: 11584k
  * 16384MB:16384k
  */
+static enum { MIN_FREE_KBYTES_LOWER_BOUND = 1 << 7, 
MIN_FREE_KBYTES_UPPER_BOUND = 1 << 18 };
+
 int __meminit init_per_zone_wmark_min(void)
 {
unsigned long lowmem_kbytes;
@@ -7862,10 +7864,10 @@ int __meminit init_per_zone_wmark_min(void)
 
if (new_min_free_kbytes > user_min_free_kbytes) {
min_free_kbytes = new_min_free_kbytes;
-   if (min_free_kbytes < 128)
-   min_free_kbytes = 128;
-   if (min_free_kbytes > 262144)
-   min_free_kbytes = 262144;
+   if (min_free_kbytes < MIN_FREE_KBYTES_LOWER_BOUND)
+   min_free_kbytes = MIN_FREE_KBYTES_LOWER_BOUND;
+   if (min_free_kbytes > MIN_FREE_KBYTES_UPPER_BOUND)
+   min_free_kbytes = MIN_FREE_KBYTES_UPPER_BOUND;
} else {
pr_warn("min_free_kbytes is not updated to %d because user 
defined value %d is preferred\n",
new_min_free_kbytes, user_min_free_kbytes);
-- 
2.23.0



[PATCH] mm/page_alloc: fix documentation error and remove magic numbers

2020-06-23 Thread Joel Savitz
When I increased the upper bound of the min_free_kbytes value in
ee8eb9a5fe863, I forgot to tweak the above comment to reflect
the new value. This patch fixes that mistake.

In addition, this patch replaces the magic number bounds with symbolic
constants to clarify the logic.

Suggested-by: John Hubbard 
Suggested-by: Vlastimil Babka 
Signed-off-by: Joel Savitz 
---
 mm/page_alloc.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eb0f1410d4..f725addc2748 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7832,7 +7832,7 @@ void setup_per_zone_wmarks(void)
  * Initialise min_free_kbytes.
  *
  * For small machines we want it small (128k min).  For large machines
- * we want it large (64MB max).  But it is not linear, because network
+ * we want it large (256MB max).  But it is not linear, because network
  * bandwidth does not increase linearly with machine size.  We use
  *
  * min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy:
@@ -7852,6 +7852,9 @@ void setup_per_zone_wmarks(void)
  * 8192MB: 11584k
  * 16384MB:16384k
  */
+static const int MIN_FREE_KBYTES_LOWER_BOUND = 1 << 7;
+static const int MIN_FREE_KBYTES_UPPER_BOUND = 1 << 18;
+
 int __meminit init_per_zone_wmark_min(void)
 {
unsigned long lowmem_kbytes;
@@ -7862,10 +7865,10 @@ int __meminit init_per_zone_wmark_min(void)
 
if (new_min_free_kbytes > user_min_free_kbytes) {
min_free_kbytes = new_min_free_kbytes;
-   if (min_free_kbytes < 128)
-   min_free_kbytes = 128;
-   if (min_free_kbytes > 262144)
-   min_free_kbytes = 262144;
+   if (min_free_kbytes < MIN_FREE_KBYTES_LOWER_BOUND)
+   min_free_kbytes = MIN_FREE_KBYTES_LOWER_BOUND;
+   if (min_free_kbytes > MIN_FREE_KBYTES_UPPER_BOUND)
+   min_free_kbytes = MIN_FREE_KBYTES_UPPER_BOUND;
} else {
pr_warn("min_free_kbytes is not updated to %d because user 
defined value %d is preferred\n",
new_min_free_kbytes, user_min_free_kbytes);
-- 
2.23.0



[PATCH] fs: proc: Clarify warnings for invalid proc dir names

2019-10-20 Thread Joel Savitz
When one attempts to create a directory in /proc with an invalid name,
such as one in a subdirectory that doesn't exist, one with a name beyond
256 characters, or a reserved name such as '.' or '..', the kernel
throws a warning message that looks like this:

[ 7913.252558] name 'invalid_name'

This warning message is nearly the same for all invalid cases, including
the removal of a nonexistent directory. This patch clarifies the warning
message and differentiates the invalid creation/removal cases so as to
allow the user to more quickly understand their mistake.

Signed-off-by: Fabrizio D'Angelo 
Signed-off-by: Joel Savitz 
---
 fs/proc/generic.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 64e9ee1b129e..df04fd4f02af 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -173,7 +173,7 @@ static int __xlate_proc_name(const char *name, struct 
proc_dir_entry **ret,
len = next - cp;
de = pde_subdir_find(de, cp, len);
if (!de) {
-   WARN(1, "name '%s'\n", name);
+   WARN(1, "invalid proc dir name '%s'\n", name);
return -ENOENT;
}
cp += len + 1;
@@ -386,15 +386,15 @@ static struct proc_dir_entry *__proc_create(struct 
proc_dir_entry **parent,
qstr.name = fn;
qstr.len = strlen(fn);
if (qstr.len == 0 || qstr.len >= 256) {
-   WARN(1, "name len %u\n", qstr.len);
+   WARN(1, "invalid proc dir name len %u\n", qstr.len);
return NULL;
}
if (qstr.len == 1 && fn[0] == '.') {
-   WARN(1, "name '.'\n");
+   WARN(1, "invalid proc dir name '.'\n");
return NULL;
}
if (qstr.len == 2 && fn[0] == '.' && fn[1] == '.') {
-   WARN(1, "name '..'\n");
+   WARN(1, "invalid proc dir name '..'\n");
return NULL;
}
if (*parent == _root && name_to_int() != ~0U) {
@@ -402,7 +402,7 @@ static struct proc_dir_entry *__proc_create(struct 
proc_dir_entry **parent,
return NULL;
}
if (is_empty_pde(*parent)) {
-   WARN(1, "attempt to add to permanently empty directory");
+   WARN(1, "attempt to add to permanently empty directory in 
proc");
return NULL;
}
 
@@ -670,7 +670,7 @@ void remove_proc_entry(const char *name, struct 
proc_dir_entry *parent)
rb_erase(>subdir_node, >subdir);
write_unlock(_subdir_lock);
if (!de) {
-   WARN(1, "name '%s'\n", name);
+   WARN(1, "unable to remove nonexistent proc dir '%s'\n", name);
return;
}
 
-- 
2.23.0



[PATCH] mm: fix typo in comment

2019-08-01 Thread Joel Savitz
Fix spelling of successful (currently spelled successfull in kernel
source)

Signed-off-by: Joel Savitz 
---
 include/linux/compaction.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 9569e7c786d3..4122df65eb44 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -38,12 +38,12 @@ enum compact_result {
COMPACT_CONTINUE,
 
/*
-* The full zone was compacted scanned but wasn't successfull to compact
+* The full zone was compacted scanned but wasn't successful to compact
 * suitable pages.
 */
COMPACT_COMPLETE,
/*
-* direct compaction has scanned part of the zone but wasn't successfull
+* direct compaction has scanned part of the zone but wasn't successful
 * to compact suitable pages.
 */
COMPACT_PARTIAL_SKIPPED,
-- 
2.18.1



Re: [PATCH v2] PCI: rpaphp: Avoid a sometimes-uninitialized warning

2019-06-27 Thread Joel Savitz
>On Mon, Jun 03, 2019 at 03:11:58PM -0700, Nathan Chancellor wrote:
>> When building with -Wsometimes-uninitialized, clang warns:
>> 
>> drivers/pci/hotplug/rpaphp_core.c:243:14: warning: variable 'fndit' is
>> used uninitialized whenever 'for' loop exits because its condition is
>> false [-Wsometimes-uninitialized]
>> for (j = 0; j < entries; j++) {
>> ^~~
>> drivers/pci/hotplug/rpaphp_core.c:256:6: note: uninitialized use occurs
>> here
>> if (fndit)
>> ^
>> drivers/pci/hotplug/rpaphp_core.c:243:14: note: remove the condition if
>> it is always true
>> for (j = 0; j < entries; j++) {
>> ^~~
>> drivers/pci/hotplug/rpaphp_core.c:233:14: note: initialize the variable
>> 'fndit' to silence this warning
>> int j, fndit;
>> ^
>>  = 0
>> 
>> fndit is only used to gate a sprintf call, which can be moved into the
>> loop to simplify the code and eliminate the local variable, which will
>> fix this warning.
>> 
>> Link: https://github.com/ClangBuiltLinux/linux/issues/504
>> Fixes: 2fcf3ae508c2 ("hotplug/drc-info: Add code to search ibm,drc-info 
>> property")
>> Suggested-by: Nick Desaulniers 
>> Signed-off-by: Nathan Chancellor 
>> ---
>> 
>> v1 -> v2:
>> 
>> * Eliminate fndit altogether by shuffling the sprintf call into the for
>>   loop and changing the if conditional, as suggested by Nick.
> 
>>  drivers/pci/hotplug/rpaphp_core.c | 18 +++---
>>  1 file changed, 7 insertions(+), 11 deletions(-)

>> Gentle ping, can someone pick this up?

Looks a good simplification of somewhat convoluted control flow.

Acked-by: Joel Savitz 


Re: [PATCH v4] fs/proc: add VmTaskSize field to /proc/$$/status

2019-06-15 Thread Joel Savitz
The most immediate use case is the optimization of an internal test,
but upon closer examination neither this patch nor the test itself
turn out to be worth pursuing.

Thank you for your time and constructive comments.

Best,
Joel Savitz


On Thu, Jun 13, 2019 at 3:30 PM Andrew Morton  wrote:
>
> On Thu, 13 Jun 2019 10:54:50 -0400 Joel Savitz  wrote:
>
> > The kernel provides no architecture-independent mechanism to get the
> > size of the virtual address space of a task (userspace process) without
> > brute-force calculation. This patch allows a user to easily retrieve
> > this value via a new VmTaskSize entry in /proc/$$/status.
>
> Why is access to ->task_size required?  Please fully describe the
> use case.
>
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -187,6 +187,7 @@ read the file /proc/PID/status:
> >VmLib:  1412 kB
> >VmPTE:20 kb
> >VmSwap:0 kB
> > +  VmTaskSize:137438953468 kB
> >HugetlbPages:  0 kB
> >CoreDumping:0
> >THP_enabled: 1
> > @@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
> >   VmPTE   size of page table entries
> >   VmSwap  amount of swap used by anonymous private data
> >   (shmem swap usage is not included)
> > + VmTaskSize  size of task (userspace process) vm space
>
> This is rather vague.  Is it the total amount of physical memory?  The
> sum of all vma sizes, populated or otherwise?  Something else?
>
>


Re: [PATCH v2] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-06-13 Thread Joel Savitz
I just did a quick test on a patched kernel to check on that "no
longer affine to..." message:

# nproc
64

# taskset -p 4 $$
pid 2261's current affinity mask: 
pid 2261's new affinity mask: 4
# echo off > /sys/devices/system/cpu/cpu2/online
# taskset -p $$
pid 2261's current affinity mask: fffb
# echo on > /sys/devices/system/cpu/cpu2/online
# taskset -p $$
pid 2261's current affinity mask: 

# dmesg | tail -5
[  143.996375] process 2261 (bash) no longer affine to cpu2
[  143.996657] IRQ 114: no longer affine to CPU2
[  144.007472] IRQ 227: no longer affine to CPU2
[  144.013460] smpboot: CPU 2 is now offline
[  162.685519] smpboot: Booting Node 0 Processor 2 APIC 0x4

dmesg output is observably the same on patched and unpatched kernels
in this case.

The only difference in output is that on an unpatched kernel, the last
`taskset -p $$` outputs:
pid 2274's current affinity mask: fffb
Which is the behavior that this patch aims to modify

This case, which I believe is generalizable, demonstrates that we
retain the "no longer affine to..." output on a kernel with this
patch.

Best,
Joel Savitz



On Thu, Jun 13, 2019 at 8:02 AM Michal Koutný  wrote:
>
> On Tue, May 28, 2019 at 02:10:37PM +0200, Michal Koutný   
> wrote:
> > Although, on v1 we will lose the "no longer affine to..." message
> > (which is what happens in your demo IIUC).
> FWIW, I was wrong, off by one 'state' transition. So the patch doesn't
> cause change in messaging (not tested though).


[PATCH v4] fs/proc: add VmTaskSize field to /proc/$$/status

2019-06-13 Thread Joel Savitz
The kernel provides no architecture-independent mechanism to get the
size of the virtual address space of a task (userspace process) without
brute-force calculation. This patch allows a user to easily retrieve
this value via a new VmTaskSize entry in /proc/$$/status.

Signed-off-by: Joel Savitz 
---
 Documentation/filesystems/proc.txt | 2 ++
 fs/proc/task_mmu.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..1c6a912e3975 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -187,6 +187,7 @@ read the file /proc/PID/status:
   VmLib:  1412 kB
   VmPTE:20 kb
   VmSwap:0 kB
+  VmTaskSize:  137438953468 kB
   HugetlbPages:  0 kB
   CoreDumping:0
   THP_enabled:   1
@@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
  VmPTE   size of page table entries
  VmSwap  amount of swap used by anonymous private data
  (shmem swap usage is not included)
+ VmTaskSize  size of task (userspace process) vm space
  HugetlbPagessize of hugetlb memory portions
  CoreDumping process's memory is currently being dumped
  (killing the process may lead to a corrupted core)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 95ca1fe7283c..0af7081f7b19 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
seq_put_decimal_ull_width(m,
" kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
+   seq_put_decimal_ull_width(m,
+   " kB\nVmTaskSize:\t", mm->task_size >> 10, 8);
seq_puts(m, " kB\n");
hugetlb_report_usage(m, mm);
 }
-- 
2.18.1



[RESEND PATCH v2] mm/oom_killer: Add task UID to info message on an oom kill

2019-06-12 Thread Joel Savitz
In the event of an oom kill, useful information about the killed
process is printed to dmesg. Users, especially system administrators,
will find it useful to immediately see the UID of the process.

In the following example, abuse_the_ram is the name of a program
that attempts to iteratively allocate all available memory until it is
stopped by force.

Current message:

Out of memory: Killed process 35389 (abuse_the_ram)
total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB,
shmem-rss:0kB

Patched message:

Out of memory: Killed process 2739 (abuse_the_ram),
total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB,
shmem-rss:0kB, UID 0


Suggested-by: David Rientjes 
Signed-off-by: Joel Savitz 
---
 mm/oom_kill.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3a2484884cfd..af2e3faa72a0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -874,12 +874,13 @@ static void __oom_kill_process(struct task_struct 
*victim, const char *message)
 */
do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
mark_oom_victim(victim);
-   pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
+   pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB, UID %d\n",
message, task_pid_nr(victim), victim->comm,
K(victim->mm->total_vm),
K(get_mm_counter(victim->mm, MM_ANONPAGES)),
K(get_mm_counter(victim->mm, MM_FILEPAGES)),
-   K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
+   K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
+   from_kuid(_user_ns, task_uid(victim)));
task_unlock(victim);
 
/*
-- 
2.18.1



[RESEND PATCH v3] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-06-12 Thread Joel Savitz
In the case that a process is constrained by taskset(1) (i.e.
sched_setaffinity(2)) to a subset of available cpus, and all of those are
subsequently offlined, the scheduler will set tsk->cpus_allowed to
the current value of task_cs(tsk)->effective_cpus.

This is done via a call to do_set_cpus_allowed() in the context of 
cpuset_cpus_allowed_fallback() made by the scheduler when this case is
detected. This is the only call made to cpuset_cpus_allowed_fallback()
in the latest mainline kernel.

However, this is not sane behavior.

I will demonstrate this on a system running the latest upstream kernel
with the following initial configuration:

# grep -i cpu /proc/$$/status
Cpus_allowed:   ,fff
Cpus_allowed_list:  0-63

(Where cpus 32-63 are provided via smt.)

If we limit our current shell process to cpu2 only and then offline it
and reonline it:

# taskset -p 4 $$
pid 2272's current affinity mask: 
pid 2272's new affinity mask: 4

# echo off > /sys/devices/system/cpu/cpu2/online
# dmesg | tail -3
[ 2195.866089] process 2272 (bash) no longer affine to cpu2
[ 2195.872700] IRQ 114: no longer affine to CPU2
[ 2195.879128] smpboot: CPU 2 is now offline

# echo on > /sys/devices/system/cpu/cpu2/online
# dmesg | tail -1
[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4


We see that our current process now has an affinity mask containing
every cpu available on the system _except_ the one we originally
constrained it to:

# grep -i cpu /proc/$$/status
Cpus_allowed:   ,fffb
Cpus_allowed_list:  0-1,3-63 

This is not sane behavior, as the scheduler can now not only place the
process on previously forbidden cpus, it can't even schedule it on
the cpu it was originally constrained to!

Other cases result in even more exotic affinity masks. Take for instance
a process with an affinity mask containing only cpus provided by smt at
the moment that smt is toggled, in a configuration such as the following:

# taskset -p f0 $$
# grep -i cpu /proc/$$/status
Cpus_allowed:   00f0,
Cpus_allowed_list:  36-39

A double toggle of smt results in the following behavior:

# echo off > /sys/devices/system/cpu/smt/control
# echo on > /sys/devices/system/cpu/smt/control
# grep -i cpus /proc/$$/status
Cpus_allowed:   ff00,
Cpus_allowed_list:  0-31,40-63

This is even less sane than the previous case, as the new affinity mask
excludes all smt-provided cpus with ids less than those that were
previously in the affinity mask, as well as those that were actually in
the mask.

With this patch applied, both of these cases end in the following state:

# grep -i cpu /proc/$$/status
Cpus_allowed:   ,
Cpus_allowed_list:  0-63

The original policy is discarded. Though not ideal, it is the simplest way
to restore sanity to this fallback case without reinventing the cpuset
wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
for the previous affinity mask to be restored in this fallback case can use
that mechanism instead.

This patch modifies scheduler behavior by instead resetting the mask to
task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
mode. I tested the cases above on both modes.

Note that the scheduler uses this fallback mechanism if and only if
_every_ other valid avenue has been traveled, and it is the last resort
before calling BUG().

Suggested-by: Waiman Long 
Suggested-by: Phil Auld 
Signed-off-by: Joel Savitz 
---
 kernel/cgroup/cpuset.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 6a1942ed781c..515525ff1cfd 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3254,10 +3254,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
struct cpumask *pmask)
spin_unlock_irqrestore(_lock, flags);
 }
 
+/**
+ * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe.
+ * @tsk: pointer to task_struct with which the scheduler is struggling
+ *
+ * Description: In the case that the scheduler cannot find an allowed cpu in
+ * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy
+ * mode however, this value is the same as task_cs(tsk)->effective_cpus,
+ * which will not contain a sane cpumask during cases such as cpu hotplugging.
+ * This is the absolute last resort for the scheduler and it is only used if
+ * _every_ other avenue has been traveled.
+ **/
+
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
rcu_read_lock();
-   do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
+   do_set_cpus_allowed(tsk, is_in_v2_mode() ?
+

[PATCH v3] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-06-12 Thread Joel Savitz
In the case that a process is constrained by taskset(1) (i.e.
sched_setaffinity(2)) to a subset of available cpus, and all of those are
subsequently offlined, the scheduler will set tsk->cpus_allowed to
the current value of task_cs(tsk)->effective_cpus.

This is done via a call to do_set_cpus_allowed() in the context of 
cpuset_cpus_allowed_fallback() made by the scheduler when this case is
detected. This is the only call made to cpuset_cpus_allowed_fallback()
in the latest mainline kernel.

However, this is not sane behavior.

I will demonstrate this on a system running the latest upstream kernel
with the following initial configuration:

# grep -i cpu /proc/$$/status
Cpus_allowed:   ,fff
Cpus_allowed_list:  0-63

(Where cpus 32-63 are provided via smt.)

If we limit our current shell process to cpu2 only and then offline it
and reonline it:

# taskset -p 4 $$
pid 2272's current affinity mask: 
pid 2272's new affinity mask: 4

# echo off > /sys/devices/system/cpu/cpu2/online
# dmesg | tail -3
[ 2195.866089] process 2272 (bash) no longer affine to cpu2
[ 2195.872700] IRQ 114: no longer affine to CPU2
[ 2195.879128] smpboot: CPU 2 is now offline

# echo on > /sys/devices/system/cpu/cpu2/online
# dmesg | tail -1
[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4


We see that our current process now has an affinity mask containing
every cpu available on the system _except_ the one we originally
constrained it to:

# grep -i cpu /proc/$$/status
Cpus_allowed:   ,fffb
Cpus_allowed_list:  0-1,3-63 

This is not sane behavior, as the scheduler can now not only place the
process on previously forbidden cpus, it can't even schedule it on
the cpu it was originally constrained to!

Other cases result in even more exotic affinity masks. Take for instance
a process with an affinity mask containing only cpus provided by smt at
the moment that smt is toggled, in a configuration such as the following:

# taskset -p f0 $$
# grep -i cpu /proc/$$/status
Cpus_allowed:   00f0,
Cpus_allowed_list:  36-39

A double toggle of smt results in the following behavior:

# echo off > /sys/devices/system/cpu/smt/control
# echo on > /sys/devices/system/cpu/smt/control
# grep -i cpus /proc/$$/status
Cpus_allowed:   ff00,
Cpus_allowed_list:  0-31,40-63

This is even less sane than the previous case, as the new affinity mask
excludes all smt-provided cpus with ids less than those that were
previously in the affinity mask, as well as those that were actually in
the mask.

With this patch applied, both of these cases end in the following state:

# grep -i cpu /proc/$$/status
Cpus_allowed:   ,
Cpus_allowed_list:  0-63

The original policy is discarded. Though not ideal, it is the simplest way
to restore sanity to this fallback case without reinventing the cpuset
wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
for the previous affinity mask to be restored in this fallback case can use
that mechanism instead.

This patch modifies scheduler behavior by instead resetting the mask to
task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
mode. I tested the cases above on both modes.

Note that the scheduler uses this fallback mechanism if and only if
_every_ other valid avenue has been traveled, and it is the last resort
before calling BUG().

Suggested-by: Waiman Long 
Suggested-by: Phil Auld 
Signed-off-by: Joel Savitz 
---
 kernel/cgroup/cpuset.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 6a1942ed781c..515525ff1cfd 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3254,10 +3254,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
struct cpumask *pmask)
spin_unlock_irqrestore(_lock, flags);
 }
 
+/**
+ * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe.
+ * @tsk: pointer to task_struct with which the scheduler is struggling
+ *
+ * Description: In the case that the scheduler cannot find an allowed cpu in
+ * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy
+ * mode however, this value is the same as task_cs(tsk)->effective_cpus,
+ * which will not contain a sane cpumask during cases such as cpu hotplugging.
+ * This is the absolute last resort for the scheduler and it is only used if
+ * _every_ other avenue has been traveled.
+ **/
+
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
rcu_read_lock();
-   do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
+   do_set_cpus_allowed(tsk, is_in_v2_mode() ?
+

Re: [PATCH v2] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-05-24 Thread Joel Savitz
On Tue, May 21, 2019 at 10:35 AM Michal Koutný  wrote:
> >   $ grep Cpus /proc/$$/status
> >   Cpus_allowed:   ff
> >   Cpus_allowed_list:  0-7
>
> (a)
>
> >   $ taskset -p 4 $$
> >   pid 19202's current affinity mask: f

> I'm confused where this value comes from, I must be missing something.
>
> Joel, is the task in question put into a cpuset with 0xf CPUs only (at
> point (a))? Or are the CPUs 4-7 offlined as well?

Good point.

It is a bit ambiguous, but I performed no action on the task's cpuset
nor did I offline any cpus at point (a).

After a bit of research, I am fairly certain that the observed
discrepancy is due to differing mechanisms used to acquire the cpuset
mask value.

The first mechanism, via `grep Cpus /proc/$$/status`, has it's value
populated by the expression (task->cpus_allowed) in
fs/proc/array.c:sched_getaffinity(), whereas the taskset utility
(https://github.com/karelzak/util-linux/blob/master/schedutils/taskset.c)
uses sched_getaffinity(2) to determine the "current affinity mask"
value from the expression (task->cpus_allowed & cpu_active_mask) in
kernel/sched/core.c:sched_getaffinty(),

I do not know if there is an explicit reason for this discrepancy or
whether the two mechanisms were simply built independently, perhaps
for different purposes.

I think the /proc/$$/status value is intended to simply reflect the
user-specified policy stating which cpus the task is allowed to run on
without consideration for hardware state, whereas the taskset value is
representative of the cpus that the task can actually be run on given
the restriction policy specified by the user via the cpuset mechanism.

By the way, I posted a v2 of this patch that correctly handles cgroup
v2 behavior.


[RESEND PATCH v2] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-05-07 Thread Joel Savitz
If a process is limited by taskset (i.e. cpuset) to only be allowed to
run on cpu N, and then cpu N is offlined via hotplug, the process will
be assigned the current value of its cpuset cgroup's effective_cpus field
in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
This argument's value does not makes sense for this case, because
task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
to reflect the new value of cpu_active_mask after cpu N is removed from
the mask. While this may make sense for the cgroup affinity mask, it
does not make sense on a per-task basis, as a task that was previously
limited to only be run on cpu N will be limited to every cpu _except_ for
cpu N after it is offlined/onlined via hotplug.

Pre-patch behavior:

$ grep Cpus /proc/$$/status
Cpus_allowed:   ff
Cpus_allowed_list:  0-7

$ taskset -p 4 $$
pid 19202's current affinity mask: f
pid 19202's new affinity mask: 4

$ grep Cpus /proc/self/status
Cpus_allowed:   04
Cpus_allowed_list:  2

# echo off > /sys/devices/system/cpu/cpu2/online
$ grep Cpus /proc/$$/status
Cpus_allowed:   0b
Cpus_allowed_list:  0-1,3

# echo on > /sys/devices/system/cpu/cpu2/online
$ grep Cpus /proc/$$/status
Cpus_allowed:   0b
Cpus_allowed_list:  0-1,3

On a patched system, the final grep produces the following
output instead:

$ grep Cpus /proc/$$/status
Cpus_allowed:   ff
Cpus_allowed_list:  0-7

This patch changes the above behavior by instead resetting the mask to
task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
mode.

This fallback mechanism is only triggered if _every_ other valid avenue
has been traveled, and it is the last resort before calling BUG().

Signed-off-by: Joel Savitz 
---
 Makefile   |  2 +-
 kernel/cgroup/cpuset.c | 15 ++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 6a1942ed781c..515525ff1cfd 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3254,10 +3254,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
struct cpumask *pmask)
spin_unlock_irqrestore(_lock, flags);
 }
 
+/**
+ * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe.
+ * @tsk: pointer to task_struct with which the scheduler is struggling
+ *
+ * Description: In the case that the scheduler cannot find an allowed cpu in
+ * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy
+ * mode however, this value is the same as task_cs(tsk)->effective_cpus,
+ * which will not contain a sane cpumask during cases such as cpu hotplugging.
+ * This is the absolute last resort for the scheduler and it is only used if
+ * _every_ other avenue has been traveled.
+ **/
+
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
rcu_read_lock();
-   do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
+   do_set_cpus_allowed(tsk, is_in_v2_mode() ?
+   task_cs(tsk)->cpus_allowed : cpu_possible_mask);
rcu_read_unlock();
 
/*
-- 
2.18.1



[PATCH v3] fs/proc: add VmTaskSize field to /proc/$$/status

2019-05-06 Thread Joel Savitz
There is currently no easy and architecture-independent way to find the
lowest unusable virtual address available to a process without
brute-force calculation. This patch allows a user to easily retrieve
this value via /proc//status.

Using this patch, any program that previously needed to waste cpu cycles
recalculating a non-sensitive process-dependent value already known to
the kernel can now be optimized to use this mechanism.

Signed-off-by: Joel Savitz 
---
 Documentation/filesystems/proc.txt | 2 ++
 fs/proc/task_mmu.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..1c6a912e3975 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -187,6 +187,7 @@ read the file /proc/PID/status:
   VmLib:  1412 kB
   VmPTE:20 kb
   VmSwap:0 kB
+  VmTaskSize:  137438953468 kB
   HugetlbPages:  0 kB
   CoreDumping:0
   THP_enabled:   1
@@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
  VmPTE   size of page table entries
  VmSwap  amount of swap used by anonymous private data
  (shmem swap usage is not included)
+ VmTaskSize  lowest unusable address in process virtual memory
  HugetlbPagessize of hugetlb memory portions
  CoreDumping process's memory is currently being dumped
  (killing the process may lead to a corrupted core)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 95ca1fe7283c..0af7081f7b19 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -74,6 +74,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
seq_put_decimal_ull_width(m,
" kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
+   seq_put_decimal_ull_width(m,
+   " kB\nVmTaskSize:\t", mm->task_size >> 10, 8);
seq_puts(m, " kB\n");
hugetlb_report_usage(m, mm);
 }
-- 
2.18.1



[PATCH v3 2/2] prctl.2: Document the new PR_GET_TASK_SIZE option

2019-05-03 Thread Joel Savitz
Add a short explanation of the new PR_GET_TASK_SIZE option for the benefit
of future generations.

Suggested-by: David Laight 
Signed-off-by: Joel Savitz 
---
 man2/prctl.2 | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/man2/prctl.2 b/man2/prctl.2
index 06d8e13c7..cae582726 100644
--- a/man2/prctl.2
+++ b/man2/prctl.2
@@ -49,6 +49,7 @@
 .\" 2013-01-10 Kees Cook, document PR_SET_PTRACER
 .\" 2012-02-04 Michael Kerrisk, document PR_{SET,GET}_CHILD_SUBREAPER
 .\" 2014-11-10 Dave Hansen, document PR_MPX_{EN,DIS}ABLE_MANAGEMENT
+.\" 2019-05-03 Joel Savitz, document PR_GET_TASK_SIZE
 .\"
 .\"
 .TH PRCTL 2 2019-03-06 "Linux" "Linux Programmer's Manual"
@@ -1375,6 +1376,15 @@ system call on Tru64).
 for information on versions and architectures)
 Return unaligned access control bits, in the location pointed to by
 .IR "(unsigned int\ *) arg2" .
+.TP
+.B PR_GET_TASK_SIZE
+Copy the value of TASK_SIZE to the userspace address in
+.IR "(unsigned long\ *) arg2" .
+This value represents the highest virtual address available
+to the current process. Return
+.B EFAULT
+if this operation fails.
+
 .SH RETURN VALUE
 On success,
 .BR PR_GET_DUMPABLE ,
--
2.18.1



[PATCH v3 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)

2019-05-03 Thread Joel Savitz
When PR_GET_TASK_SIZE is passed to prctl, the kernel will attempt to
copy the value of TASK_SIZE to the userspace address in arg2.

It is important that we account for the case of the userspace task
running in 32-bit compat mode on a 64-bit kernel. As such, we must be
careful to copy the correct number of bytes to userspace to avoid stack
corruption.

Suggested-by: Yuri Norov 
Suggested-by: Alexey Dobriyan 
Signed-off-by: Joel Savitz 
---
 include/uapi/linux/prctl.h |  3 +++
 kernel/sys.c   | 23 +++
 2 files changed, 26 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 094bb03b9cc2..2c261c461952 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -229,4 +229,7 @@ struct prctl_mm_map {
 # define PR_PAC_APDBKEY(1UL << 3)
 # define PR_PAC_APGAKEY(1UL << 4)

+/* Get the process virtual memory size (i.e. the highest usable VM address) */
+#define PR_GET_TASK_SIZE   55
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 12df0e5434b8..709584400070 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2252,6 +2252,26 @@ static int propagate_has_child_subreaper(struct 
task_struct *p, void *data)
return 1;
 }

+static int prctl_get_tasksize(void __user *uaddr)
+{
+   unsigned long current_task_size, current_word_size;
+
+   current_task_size = TASK_SIZE;
+   current_word_size = sizeof(unsigned long);
+
+#ifdef CONFIG_64BIT
+   /* On 64-bit architecture, we must check whether the current thread
+* is running in 32-bit compat mode. If it is, we can simply cut
+* the size in half. This avoids corruption of the userspace stack.
+*/
+   if (test_thread_flag(TIF_ADDR32))
+   current_word_size >>= 1;
+#endif
+
+   return copy_to_user(uaddr, _task_size, current_word_size) ? 
-EFAULT : 0;
+}
+
 int __weak arch_prctl_spec_ctrl_get(struct task_struct *t, unsigned long which)
 {
return -EINVAL;
@@ -2486,6 +2506,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
return -EINVAL;
error = PAC_RESET_KEYS(me, arg2);
break;
+   case PR_GET_TASK_SIZE:
+   error = prctl_get_tasksize((void *)arg2);
+   break;
default:
error = -EINVAL;
break;
--
2.18.1



[PATCH v3 0/2] sys/prctl: expose TASK_SIZE value to userspace

2019-05-03 Thread Joel Savitz
In the mainline kernel, there is no quick mechanism to get the virtual
memory size of the current process from userspace.

Despite the current state of affairs, this information is available to the
user through several means, one being a linear search of the entire address
space. This is an inefficient use of cpu cycles.

A component of the libhugetlb kernel test does exactly this, and as
systems' address spaces increase beyond 32-bits, this method becomes
exceedingly tedious.

For example, on a ppc64le system with a 47-bit address space, the linear
search causes the test to hang for some unknown amount of time. I
couldn't give you an exact number because I just ran it for about 10-20
minutes and went to go do something else, probably to get coffee or
something, and when I came back, I just killed the test and patched it
to use this new mechanism. I re-ran my new version of the test using a
kernel with this patch, and of course it passed through the previously
bottlenecking codepath nearly instantaneously.

As such, I propose that the prctl syscall be extended to include the
option to retrieve TASK_SIZE from the kernel.

This patch will allow us to upgrade an O(n) codepath to O(1) in an
architecture-independent manner, and provide a mechanism for future
generations to do the same.

Changes from v2:
 We now account for the case of 32-bit compat userspace on a 64-bit kernel
 More detail about the nature of TASK_SIZE in documentation

Joel Savitz(2):
  sys/prctl: add PR_GET_TASK_SIZE option to prctl(2)
  prctl.2: Document the new PR_GET_TASK_SIZE option

 include/uapi/linux/prctl.h |  3 +++
 kernel/sys.c   | 23 +++
 2 files changed, 26 insertions(+)

 man2/prctl.2 | 10 ++
 1 file changed, 10 insertions(+)
--
2.18.1



Re: [PATCH 2/2] prctl.2: Document the new PR_GET_TASK_SIZE option

2019-05-03 Thread Joel Savitz
On Fri, May 3, 2019 at 7:20 AM David Laight  wrote:
> Shouldn't this say what the value is?
> ISTR a recent patch to change the was the 'used to be constant' TASK_SIZE is 
> defined.
> I think it might be 'The highest userspace virtual address the current
> process can use.' But I might be wrong.

I believe you are correct David. I will add this information to the
manpage in the upcoming v3.
Best,
Joel Savitz


Re: [PATCH v2 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)

2019-05-02 Thread Joel Savitz
> Won't be possible to use put_user here? Something like
>
> static int prctl_get_tasksize(unsigned long __user *uaddr)
> {
> return put_user(TASK_SIZE, uaddr) ? -EFAULT : 0;
> }

What would be the benefit of using put_user() over copy_to_user() in
this context?


Re: [PATCH v2 0/2] sys/prctl: expose TASK_SIZE value to userspace

2019-05-02 Thread Joel Savitz
Yes, this the change, thanks to the suggestion of Yury Norov.

I also now explicitly mention the expected userspace destination type
in the manpage patch.

Best,
Joel Savitz


On Thu, May 2, 2019 at 5:10 PM Cyrill Gorcunov  wrote:
>
> On Thu, May 02, 2019 at 05:01:38PM -0400, Waiman Long wrote:
> >
> > What did you change in v2 versus v1?
>
> Seems unsigned long long has been changed to unsigned long.


[PATCH v2 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)

2019-05-02 Thread Joel Savitz
When PR_GET_TASK_SIZE is passed to prctl, the kernel will attempt to
copy the value of TASK_SIZE to the userspace address in arg2.

Suggested-by: Alexey Dobriyan 
Signed-off-by: Joel Savitz 
---
 include/uapi/linux/prctl.h |  3 +++
 kernel/sys.c   | 10 ++
 2 files changed, 13 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 094bb03b9cc2..2335fe0a8db8 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -229,4 +229,7 @@ struct prctl_mm_map {
 # define PR_PAC_APDBKEY(1UL << 3)
 # define PR_PAC_APGAKEY(1UL << 4)
 
+/* Get the process virtual memory size */
+#define PR_GET_TASK_SIZE   55
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 12df0e5434b8..7ced7dbd035d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2252,6 +2252,13 @@ static int propagate_has_child_subreaper(struct 
task_struct *p, void *data)
return 1;
 }
 
+static int prctl_get_tasksize(void __user * uaddr)
+{
+   unsigned long task_size = TASK_SIZE;
+   return copy_to_user(uaddr, _size, sizeof(unsigned long))
+   ? -EFAULT : 0;
+}
+
 int __weak arch_prctl_spec_ctrl_get(struct task_struct *t, unsigned long which)
 {
return -EINVAL;
@@ -2486,6 +2493,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
return -EINVAL;
error = PAC_RESET_KEYS(me, arg2);
break;
+   case PR_GET_TASK_SIZE:
+   error = prctl_get_tasksize((void *)arg2) ;
+   break;
default:
error = -EINVAL;
break;
-- 
2.18.1



[PATCH v2 2/2] prctl.2: Document the new PR_GET_TASK_SIZE option

2019-05-02 Thread Joel Savitz
Add a short explanation of the new PR_GET_TASK_SIZE option for the benefit
of future generations.

Signed-off-by: Joel Savitz 
---
 man2/prctl.2 | 9 +
 1 file changed, 9 insertions(+)

diff --git a/man2/prctl.2 b/man2/prctl.2
index 06d8e13c7..35a6a3919 100644
--- a/man2/prctl.2
+++ b/man2/prctl.2
@@ -49,6 +49,7 @@
 .\" 2013-01-10 Kees Cook, document PR_SET_PTRACER
 .\" 2012-02-04 Michael Kerrisk, document PR_{SET,GET}_CHILD_SUBREAPER
 .\" 2014-11-10 Dave Hansen, document PR_MPX_{EN,DIS}ABLE_MANAGEMENT
+.\" 2019-05-02 Joel Savitz, document PR_GET_TASK_SIZE
 .\"
 .\"
 .TH PRCTL 2 2019-03-06 "Linux" "Linux Programmer's Manual"
@@ -1375,6 +1376,14 @@ system call on Tru64).
 for information on versions and architectures)
 Return unaligned access control bits, in the location pointed to by
 .IR "(unsigned int\ *) arg2" .
+.TP
+.B PR_GET_TASK_SIZE
+Copy the value of TASK_SIZE to the userspace address in
+.IR "(unsigned long\ *) arg2" .
+Return
+.B EFAULT
+if this operation fails.
+
 .SH RETURN VALUE
 On success,
 .BR PR_GET_DUMPABLE ,
-- 
2.18.1



[PATCH v2 0/2] sys/prctl: expose TASK_SIZE value to userspace

2019-05-02 Thread Joel Savitz
In the mainline kernel, there is no quick mechanism to get the virtual
memory size of the current process from userspace.

Despite the current state of affairs, this information is available to the
user through several means, one being a linear search of the entire address
space. This is an inefficient use of cpu cycles.

A component of the libhugetlb kernel test does exactly this, and as
systems' address spaces increase beyond 32-bits, this method becomes
exceedingly tedious.

For example, on a ppc64le system with a 47-bit address space, the linear
search causes the test to hang for some unknown amount of time. I
couldn't give you an exact number because I just ran it for about 10-20
minutes and went to go do something else, probably to get coffee or
something, and when I came back, I just killed the test and patched it
to use this new mechanism. I re-ran my new version of the test using a
kernel with this patch, and of course it passed through the previously
bottlenecking codepath nearly instantaneously.

As such, I propose that the prctl syscall be extended to include the
option to retrieve TASK_SIZE from the kernel.

This patch will allow us to upgrade an O(n) codepath to O(1) in an
architecture-independent manner, and provide a mechanism for others
to do the same.

Joel Savitz(2):
  sys/prctl: add PR_GET_TASK_SIZE option to prctl(2)
  prctl.2: Document the new PR_GET_TASK_SIZE option

 include/uapi/linux/prctl.h |  3 +++
 kernel/sys.c   | 10 ++
 2 files changed, 13 insertions(+)

 man2/prctl.2 | 9 +
 1 file changed, 9 insertions(+)

--
2.18.1



[PATCH 2/2] prctl.2: Document the new PR_GET_TASK_SIZE option

2019-05-02 Thread Joel Savitz
Add a short explanation of the new PR_GET_TASK_SIZE option for the benefit
of future generations.

Signed-off-by: Joel Savitz 
---
 man2/prctl.2 | 9 +
 1 file changed, 9 insertions(+)

diff --git a/man2/prctl.2 b/man2/prctl.2
index 06d8e13c7..35a6a3919 100644
--- a/man2/prctl.2
+++ b/man2/prctl.2
@@ -49,6 +49,7 @@
 .\" 2013-01-10 Kees Cook, document PR_SET_PTRACER
 .\" 2012-02-04 Michael Kerrisk, document PR_{SET,GET}_CHILD_SUBREAPER
 .\" 2014-11-10 Dave Hansen, document PR_MPX_{EN,DIS}ABLE_MANAGEMENT
+.\" 2019-05-02 Joel Savitz, document PR_GET_TASK_SIZE
 .\"
 .\"
 .TH PRCTL 2 2019-03-06 "Linux" "Linux Programmer's Manual"
@@ -1375,6 +1376,14 @@ system call on Tru64).
 for information on versions and architectures)
 Return unaligned access control bits, in the location pointed to by
 .IR "(unsigned int\ *) arg2" .
+.TP
+.B PR_GET_TASK_SIZE
+Copy the value of TASK_SIZE to the userspace address in
+.IR "arg2" .
+Return
+.B EFAULT
+if this operation fails.
+
 .SH RETURN VALUE
 On success,
 .BR PR_GET_DUMPABLE ,
-- 
2.18.1



[PATCH 1/2] kernel/sys: add PR_GET_TASK_SIZE option to prctl(2)

2019-05-02 Thread Joel Savitz
When PR_GET_TASK_SIZE is passed to prctl, the kernel will attempt to
copy the value of TASK_SIZE to the userspace address in arg2.

Suggested-by: Alexey Dobriyan 
Signed-off-by: Joel Savitz 
---
 include/uapi/linux/prctl.h |  3 +++
 kernel/sys.c   | 10 ++
 2 files changed, 13 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 094bb03b9cc2..2335fe0a8db8 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -229,4 +229,7 @@ struct prctl_mm_map {
 # define PR_PAC_APDBKEY(1UL << 3)
 # define PR_PAC_APGAKEY(1UL << 4)
 
+/* Get the process virtual memory size */
+#define PR_GET_TASK_SIZE   55
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 12df0e5434b8..7ced7dbd035d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2252,6 +2252,13 @@ static int propagate_has_child_subreaper(struct 
task_struct *p, void *data)
return 1;
 }
 
+static int prctl_get_tasksize(void __user * uaddr)
+{
+   unsigned long long task_size = TASK_SIZE;
+   return copy_to_user(uaddr, _size, sizeof(unsigned long long))
+   ? -EFAULT : 0;
+}
+
 int __weak arch_prctl_spec_ctrl_get(struct task_struct *t, unsigned long which)
 {
return -EINVAL;
@@ -2486,6 +2493,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, 
unsigned long, arg3,
return -EINVAL;
error = PAC_RESET_KEYS(me, arg2);
break;
+   case PR_GET_TASK_SIZE:
+   error = prctl_get_tasksize((void *)arg2) ;
+   break;
default:
error = -EINVAL;
break;
-- 
2.18.1



[PATCH 0/2] sys/prctl: expose TASK_SIZE value to userspace

2019-05-02 Thread Joel Savitz
In the mainline kernel, there is no quick mechanism to get the virtual
memory size of the current process from userspace.

Despite the current state of affairs, this information is available to the
user through several means, one being a linear search of the entire address
space. This is an inefficient use of cpu cycles.

A component of the libhugetlb kernel test does exactly this, and as
systems' address spaces increase beyond 32-bits, this method becomes
exceedingly tedious.

For example, on a ppc64le system with a 47-bit address space, the linear
search causes the test to hang for some unknown amount of time. I
couldn't give you an exact number because I just ran it for about 10-20
minutes and went to go do something else, probably to get coffee or
something, and when I came back, I just killed the test and patched it
to use this new mechanism. I re-ran my new version of the test using a
kernel with this patch, and of course it passed through the previously
bottlenecking codepath nearly instantaneously.

As such, I propose that the prctl syscall be extended to include the
option to retrieve TASK_SIZE from the kernel.

This patch will allow us to upgrade an O(n) codepath to O(1) in an
architecture-independent manner, and provide a mechanism for others
to do the same.

Joel Savitz(2):
  sys/prctl: add PR_GET_TASK_SIZE option to prctl(2)
  prctl.2: Document the new PR_GET_TASK_SIZE option

 include/uapi/linux/prctl.h |  3 +++
 kernel/sys.c   | 10 ++
 2 files changed, 13 insertions(+)

 man2/prctl.2 | 9 +
 1 file changed, 9 insertions(+)

--
2.18.1



Re: [PATCH v2] fs/proc: add VmTaskSize field to /proc/$$/status

2019-04-30 Thread Joel Savitz
Good point Alexey.

Expect v3 shortly.

Best,
Joel Savitz


On Sat, Apr 27, 2019 at 5:45 PM Alexey Dobriyan  wrote:
>
> On Fri, Apr 26, 2019 at 03:02:08PM -0400, Joel Savitz wrote:
> > In the mainline kernel, there is no quick mechanism to get the virtual
> > memory size of the current process from userspace.
> >
> > Despite the current state of affairs, this information is available to the
> > user through several means, one being a linear search of the entire address
> > space. This is an inefficient use of cpu cycles.
>
> You can test only a few known per arch values. Linear search is a self
> inflicted wound.
>
> prctl(2) is more natural place and will also be arch neutral.
>
> > A component of the libhugetlb kernel test does exactly this, and as
> > systems' address spaces increase beyond 32-bits, this method becomes
> > exceedingly tedious.
>
> > For example, on a ppc64le system with a 47-bit address space, the linear
> > search causes the test to hang for some unknown amount of time. I
> > couldn't give you an exact number because I just ran it for about 10-20
> > minutes and went to go do something else, probably to get coffee or
> > something, and when I came back, I just killed the test and patched it
> > to use this new mechanism. I re-ran my new version of the test using a
> > kernel with this patch, and of course it passed through the previously
> > bottlenecking codepath nearly instantaneously.
> >
> > This patched enabled me to upgrade an O(n) codepath to O(1) in an
> > architecture-independent manner.
>
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -74,7 +74,10 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
> >   seq_put_decimal_ull_width(m,
> >   " kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
> >   SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> > - seq_puts(m, " kB\n");
> > + SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
> > + seq_put_decimal_ull_width(m,
> > + " kB\nVmTaskSize:\t", TASK_SIZE >> 10, 8);
> > + seq_puts(m, " kB\n");
>
> All fields in this file are related to the task. New field related
> to "current" will stick like an eyesore.


[PATCH v2] fs/proc: add VmTaskSize field to /proc/$$/status

2019-04-26 Thread Joel Savitz
In the mainline kernel, there is no quick mechanism to get the virtual
memory size of the current process from userspace.

Despite the current state of affairs, this information is available to the
user through several means, one being a linear search of the entire address
space. This is an inefficient use of cpu cycles.

A component of the libhugetlb kernel test does exactly this, and as
systems' address spaces increase beyond 32-bits, this method becomes
exceedingly tedious.

For example, on a ppc64le system with a 47-bit address space, the linear
search causes the test to hang for some unknown amount of time. I
couldn't give you an exact number because I just ran it for about 10-20
minutes and went to go do something else, probably to get coffee or
something, and when I came back, I just killed the test and patched it
to use this new mechanism. I re-ran my new version of the test using a
kernel with this patch, and of course it passed through the previously
bottlenecking codepath nearly instantaneously.

This patched enabled me to upgrade an O(n) codepath to O(1) in an
architecture-independent manner.

Signed-off-by: Joel Savitz 
---
 Documentation/filesystems/proc.txt | 2 ++
 fs/proc/task_mmu.c | 5 -
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 66cad5c86171..1c6a912e3975 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -187,6 +187,7 @@ read the file /proc/PID/status:
   VmLib:  1412 kB
   VmPTE:20 kb
   VmSwap:0 kB
+  VmTaskSize:  137438953468 kB
   HugetlbPages:  0 kB
   CoreDumping:0
   THP_enabled:   1
@@ -263,6 +264,7 @@ Table 1-2: Contents of the status files (as of 4.19)
  VmPTE   size of page table entries
  VmSwap  amount of swap used by anonymous private data
  (shmem swap usage is not included)
+ VmTaskSize  size of entire virtual address space of a process
  HugetlbPagessize of hugetlb memory portions
  CoreDumping process's memory is currently being dumped
  (killing the process may lead to a corrupted core)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 95ca1fe7283c..0ddd51479f90 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -74,7 +74,10 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
seq_put_decimal_ull_width(m,
" kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
-   seq_puts(m, " kB\n");
+   SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
+   seq_put_decimal_ull_width(m,
+   " kB\nVmTaskSize:\t", TASK_SIZE >> 10, 8);
+   seq_puts(m, " kB\n");
hugetlb_report_usage(m, mm);
 }
 #undef SEQ_PUT_DEC
-- 
2.18.1



[PATCH v2] mm/oom_killer: Add task UID to info message on an oom kill

2019-04-26 Thread Joel Savitz
In the event of an oom kill, useful information about the killed
process is printed to dmesg. Users, especially system administrators,
will find it useful to immediately see the UID of the process.

In the following example, abuse_the_ram is the name of a program
that attempts to iteratively allocate all available memory until it is
stopped by force.

Current message:

Out of memory: Killed process 35389 (abuse_the_ram)
total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB,
shmem-rss:0kB

Patched message:

Out of memory: Killed process 2739 (abuse_the_ram),
total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB,
shmem-rss:0kB, UID 0

Signed-off-by: Joel Savitz 
---
 mm/oom_kill.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3a2484884cfd..af2e3faa72a0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -874,12 +874,13 @@ static void __oom_kill_process(struct task_struct 
*victim, const char *message)
 */
do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
mark_oom_victim(victim);
-   pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
+   pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB, UID %d\n",
message, task_pid_nr(victim), victim->comm,
K(victim->mm->total_vm),
K(get_mm_counter(victim->mm, MM_ANONPAGES)),
K(get_mm_counter(victim->mm, MM_FILEPAGES)),
-   K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
+   K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
+   from_kuid(_user_ns, task_uid(victim)));
task_unlock(victim);
 
/*
-- 
2.18.1



[PATCH] fs/proc: add VmTaskSize field to /proc/$$/status

2019-04-25 Thread Joel Savitz
Currently, there is no fast mechanism to get the virtual memory size of
the current process from userspace. This information is available to the
user through several means, one being a linear search of the entire address
space. This is the method used by a component of the libhugetlb kernel
test, and using the mechanism proposed in this patch, the time complexity
of that test would be upgraded to constant time from linear time. This is
especially relevant on 64-bit architechtures where a linear search of
the address space could take an absurd amount of time. Using this
mechanism, the modification to the test component would be portable
across all architechtures.

Signed-off-by: Joel Savitz 
---
 fs/proc/task_mmu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 92a91e7816d8..f64b9a949624 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -74,7 +74,10 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
seq_put_decimal_ull_width(m,
" kB\nVmPTE:\t", mm_pgtables_bytes(mm) >> 10, 8);
SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
-   seq_puts(m, " kB\n");
+   SEQ_PUT_DEC(" kB\nVmSwap:\t", swap);
+   seq_put_decimal_ull_width(m,
+   " kB\nVmTaskSize:\t", TASK_SIZE >> 10, 8);
+   seq_puts(m, " kB\n");
hugetlb_report_usage(m, mm);
 }
 #undef SEQ_PUT_DEC
-- 
2.18.1



[PATCH] mm/oom_killer: Add task UID to info message on an oom kill

2019-04-24 Thread Joel Savitz
In the event of an oom kill, useful information about the killed
process is printed to dmesg. Users, especially system administrators,
will find it useful to immediately see the UID of the process.

In the following example, abuse_the_ram is the name of a program
that attempts to iteratively allocate all available memory until it is
stopped by force.

Current message:

Out of memory: Killed process 35389 (abuse_the_ram)
total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB,
shmem-rss:0kB

Patched message:

Out of memory: Killed process 2739 (abuse_the_ram), UID 0,
total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB,
shmem-rss:0kB

Signed-off-by: Joel Savitz 
---
 mm/oom_kill.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3a2484884cfd..22972648b758 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -874,9 +874,9 @@ static void __oom_kill_process(struct task_struct *victim, 
const char *message)
 */
do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
mark_oom_victim(victim);
-   pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
+   pr_err("%s: Killed process %d (%s), UID %d, total-vm:%lukB, 
anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
message, task_pid_nr(victim), victim->comm,
-   K(victim->mm->total_vm),
+   task_uid(victim).val, K(victim->mm->total_vm),
K(get_mm_counter(victim->mm, MM_ANONPAGES)),
K(get_mm_counter(victim->mm, MM_FILEPAGES)),
K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
-- 
2.18.1



[tip:sched/core] sched/core: Fix typo in comment

2019-04-19 Thread tip-bot for Joel Savitz
Commit-ID:  bee9853932e90ce94bce4276ec6b7b06bc48070b
Gitweb: https://git.kernel.org/tip/bee9853932e90ce94bce4276ec6b7b06bc48070b
Author: Joel Savitz 
AuthorDate: Wed, 6 Mar 2019 20:13:33 -0500
Committer:  Ingo Molnar 
CommitDate: Fri, 19 Apr 2019 12:22:16 +0200

sched/core: Fix typo in comment

Signed-off-by: Joel Savitz 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: triv...@kernel.org
Link: 
http://lkml.kernel.org/r/1551921213-813-1-git-send-email-jsav...@redhat.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b64ef0d5589..fb09eaad1d3a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -924,7 +924,7 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 }
 
 /*
- * Per-CPU kthreads are allowed to run on !actie && online CPUs, see
+ * Per-CPU kthreads are allowed to run on !active && online CPUs, see
  * __set_cpus_allowed_ptr() and select_fallback_rq().
  */
 static inline bool is_cpu_allowed(struct task_struct *p, int cpu)


[PATCH v2] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-04-09 Thread Joel Savitz
If a process is limited by taskset (i.e. cpuset) to only be allowed to
run on cpu N, and then cpu N is offlined via hotplug, the process will
be assigned the current value of its cpuset cgroup's effective_cpus field
in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
This argument's value does not makes sense for this case, because
task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
to reflect the new value of cpu_active_mask after cpu N is removed from
the mask. While this may make sense for the cgroup affinity mask, it
does not make sense on a per-task basis, as a task that was previously
limited to only be run on cpu N will be limited to every cpu _except_ for
cpu N after it is offlined/onlined via hotplug.

Pre-patch behavior:

$ grep Cpus /proc/$$/status
Cpus_allowed:   ff
Cpus_allowed_list:  0-7

$ taskset -p 4 $$
pid 19202's current affinity mask: f
pid 19202's new affinity mask: 4

$ grep Cpus /proc/self/status
Cpus_allowed:   04
Cpus_allowed_list:  2

# echo off > /sys/devices/system/cpu/cpu2/online
$ grep Cpus /proc/$$/status
Cpus_allowed:   0b
Cpus_allowed_list:  0-1,3

# echo on > /sys/devices/system/cpu/cpu2/online
$ grep Cpus /proc/$$/status
Cpus_allowed:   0b
Cpus_allowed_list:  0-1,3

On a patched system, the final grep produces the following
output instead:

$ grep Cpus /proc/$$/status
Cpus_allowed:   ff
Cpus_allowed_list:  0-7

This patch changes the above behavior by instead resetting the mask to
task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
mode.

This fallback mechanism is only triggered if _every_ other valid avenue
has been traveled, and it is the last resort before calling BUG().

Signed-off-by: Joel Savitz 
---
 kernel/cgroup/cpuset.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4834c4214e9c..6c9deb2cc687 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3255,10 +3255,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
struct cpumask *pmask)
spin_unlock_irqrestore(_lock, flags);
 }
 
+/**
+ * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe.
+ * @tsk: pointer to task_struct with which the scheduler is struggling
+ *
+ * Description: In the case that the scheduler cannot find an allowed cpu in
+ * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy
+ * mode however, this value is the same as task_cs(tsk)->effective_cpus,
+ * which will not contain a sane cpumask during cases such as cpu hotplugging.
+ * This is the absolute last resort for the scheduler and it is only used if
+ * _every_ other avenue has been traveled.
+ **/
+
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
rcu_read_lock();
-   do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
+   do_set_cpus_allowed(tsk, is_in_v2_mode() ?
+   task_cs(tsk)->cpus_allowed : cpu_possible_mask);
rcu_read_unlock();
 
/*
-- 
2.18.1



Fwd: [RESEND PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-04-05 Thread Joel Savitz
-- Forwarded message -
From: Joel Savitz 
Date: Fri, Apr 5, 2019 at 11:37 AM
Subject: [RESEND PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()
To: 
Cc: Joel Savitz , Li Zefan ,



If a process is limited by taskset (i.e. cpuset) to only be allowed to
run on cpu N, and then cpu N is offlined via hotplug, the process will
be assigned the current value of its cpuset cgroup's effective_cpus field
in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
This argument's value does not makes sense for this case, because
task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
to reflect the new value of cpu_active_mask after cpu N is removed from
the mask. While this may make sense for the cgroup affinity mask, it
does not make sense on a per-task basis, as a task that was previously
limited to only be run on cpu N will be limited to every cpu _except_ for
cpu N after it is offlined/onlined via hotplug.

Pre-patch behavior:

$ grep Cpus /proc/$$/status
Cpus_allowed:   ff
Cpus_allowed_list:  0-7

$ taskset -p 4 $$
pid 19202's current affinity mask: f
pid 19202's new affinity mask: 4

$ grep Cpus /proc/self/status
Cpus_allowed:   04
Cpus_allowed_list:  2

# echo off > /sys/devices/system/cpu/cpu2/online
$ grep Cpus /proc/$$/status
Cpus_allowed:   0b
Cpus_allowed_list:  0-1,3

# echo on > /sys/devices/system/cpu/cpu2/online
$ grep Cpus /proc/$$/status
Cpus_allowed:   0b
Cpus_allowed_list:  0-1,3

On a patched system, the final grep produces the following
output instead:

$ grep Cpus /proc/$$/status
Cpus_allowed:   ff
Cpus_allowed_list:  0-7

This patch changes the above behavior by instead simply resetting the mask
to cpu_possible_mask.

Signed-off-by: Joel Savitz 
---
 kernel/cgroup/cpuset.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 479743db6c37..5f65a2167bdf 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3243,7 +3243,7 @@ void cpuset_cpus_allowed(struct task_struct
*tsk, struct cpumask *pmask)
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
rcu_read_lock();
-   do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
+   do_set_cpus_allowed(tsk, cpu_possible_mask);
rcu_read_unlock();

/*
--
2.20.1


Re: [PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-03-26 Thread Joel Savitz
Forgot to add cc's... my bad.

Best,
Joel Savitz

On Tue, Mar 26, 2019 at 1:31 PM Joel Savitz  wrote:
>
> Ping!
>
> Does anyone have any comments or concerns about this patch?
>
> Best,
> Joel Savitz
>
> Best,
> Joel Savitz
>
>
> On Thu, Mar 7, 2019 at 9:42 AM Joel Savitz  wrote:
> >
> > On Wed, Mar 6, 2019 at 7:55 PM Joel Savitz  wrote:
> > >
> > > If a process is limited by taskset (i.e. cpuset) to only be allowed to
> > > run on cpu N, and then cpu N is offlined via hotplug, the process will
> > > be assigned the current value of its cpuset cgroup's effective_cpus field
> > > in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
> > > This argument's value does not makes sense for this case, because
> > > task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
> > > to reflect the new value of cpu_active_mask after cpu N is removed from
> > > the mask. While this may make sense for the cgroup affinity mask, it
> > > does not make sense on a per-task basis, as a task that was previously
> > > limited to only be run on cpu N will be limited to every cpu _except_ for
> > > cpu N after it is offlined/onlined via hotplug.
> > >
> > > Pre-patch behavior:
> > >
> > > $ grep Cpus /proc/$$/status
> > > Cpus_allowed:   ff
> > > Cpus_allowed_list:  0-7
> > >
> > > $ taskset -p 4 $$
> > > pid 19202's current affinity mask: f
> > > pid 19202's new affinity mask: 4
> > >
> > > $ grep Cpus /proc/self/status
> > > Cpus_allowed:   04
> > > Cpus_allowed_list:  2
> > >
> > > # echo off > /sys/devices/system/cpu/cpu2/online
> > > $ grep Cpus /proc/$$/status
> > > Cpus_allowed:   0b
> > > Cpus_allowed_list:  0-1,3
> > >
> > > # echo on > /sys/devices/system/cpu/cpu2/online
> > > $ grep Cpus /proc/$$/status
> > > Cpus_allowed:   0b
> > > Cpus_allowed_list:  0-1,3
> > >
> > > On a patched system, the final grep produces the following
> > > output instead:
> > >
> > > $ grep Cpus /proc/$$/status
> > > Cpus_allowed:   ff
> > > Cpus_allowed_list:  0-7
> > >
> > > This patch changes the above behavior by instead simply resetting the mask
> > > to cpu_possible_mask.
> > >
> > > Signed-off-by: Joel Savitz 
> > > ---
> > >  kernel/cgroup/cpuset.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > > index 479743db6c37..5f65a2167bdf 100644
> > > --- a/kernel/cgroup/cpuset.c
> > > +++ b/kernel/cgroup/cpuset.c
> > > @@ -3243,7 +3243,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
> > > struct cpumask *pmask)
> > >  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> > >  {
> > > rcu_read_lock();
> > > -   do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> > > +   do_set_cpus_allowed(tsk, cpu_possible_mask);
> > > rcu_read_unlock();
> > >
> > > /*
> > > --
> > > 2.20.1
> > >


Re: [PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-03-26 Thread Joel Savitz
Ping!

Does anyone have any comments or concerns about this patch?

Best,
Joel Savitz

Best,
Joel Savitz


On Thu, Mar 7, 2019 at 9:42 AM Joel Savitz  wrote:
>
> On Wed, Mar 6, 2019 at 7:55 PM Joel Savitz  wrote:
> >
> > If a process is limited by taskset (i.e. cpuset) to only be allowed to
> > run on cpu N, and then cpu N is offlined via hotplug, the process will
> > be assigned the current value of its cpuset cgroup's effective_cpus field
> > in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
> > This argument's value does not makes sense for this case, because
> > task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
> > to reflect the new value of cpu_active_mask after cpu N is removed from
> > the mask. While this may make sense for the cgroup affinity mask, it
> > does not make sense on a per-task basis, as a task that was previously
> > limited to only be run on cpu N will be limited to every cpu _except_ for
> > cpu N after it is offlined/onlined via hotplug.
> >
> > Pre-patch behavior:
> >
> > $ grep Cpus /proc/$$/status
> > Cpus_allowed:   ff
> > Cpus_allowed_list:  0-7
> >
> > $ taskset -p 4 $$
> > pid 19202's current affinity mask: f
> > pid 19202's new affinity mask: 4
> >
> > $ grep Cpus /proc/self/status
> > Cpus_allowed:   04
> > Cpus_allowed_list:  2
> >
> > # echo off > /sys/devices/system/cpu/cpu2/online
> > $ grep Cpus /proc/$$/status
> > Cpus_allowed:   0b
> > Cpus_allowed_list:  0-1,3
> >
> > # echo on > /sys/devices/system/cpu/cpu2/online
> > $ grep Cpus /proc/$$/status
> > Cpus_allowed:   0b
> > Cpus_allowed_list:  0-1,3
> >
> > On a patched system, the final grep produces the following
> > output instead:
> >
> > $ grep Cpus /proc/$$/status
> > Cpus_allowed:   ff
> > Cpus_allowed_list:  0-7
> >
> > This patch changes the above behavior by instead simply resetting the mask
> > to cpu_possible_mask.
> >
> > Signed-off-by: Joel Savitz 
> > ---
> >  kernel/cgroup/cpuset.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 479743db6c37..5f65a2167bdf 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -3243,7 +3243,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
> > struct cpumask *pmask)
> >  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> >  {
> > rcu_read_lock();
> > -   do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> > +   do_set_cpus_allowed(tsk, cpu_possible_mask);
> > rcu_read_unlock();
> >
> > /*
> > --
> > 2.20.1
> >


Re: [PATCH] sched: fix spelling of active in comment on is_cpu_allowed()

2019-03-07 Thread Joel Savitz
Best,
Joel Savitz


On Wed, Mar 6, 2019 at 8:13 PM Joel Savitz  wrote:
>
> trivial fix of documentation typo
>
> Signed-off-by: Joel Savitz 
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ead464a..2e2e19b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -920,7 +920,7 @@ static inline bool is_per_cpu_kthread(struct task_struct 
> *p)
>  }
>
>  /*
> - * Per-CPU kthreads are allowed to run on !actie && online CPUs, see
> + * Per-CPU kthreads are allowed to run on !active && online CPUs, see
>   * __set_cpus_allowed_ptr() and select_fallback_rq().
>   */
>  static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
> --
> 1.8.3.1
>


Re: [PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-03-07 Thread Joel Savitz
On Wed, Mar 6, 2019 at 7:55 PM Joel Savitz  wrote:
>
> If a process is limited by taskset (i.e. cpuset) to only be allowed to
> run on cpu N, and then cpu N is offlined via hotplug, the process will
> be assigned the current value of its cpuset cgroup's effective_cpus field
> in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
> This argument's value does not makes sense for this case, because
> task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
> to reflect the new value of cpu_active_mask after cpu N is removed from
> the mask. While this may make sense for the cgroup affinity mask, it
> does not make sense on a per-task basis, as a task that was previously
> limited to only be run on cpu N will be limited to every cpu _except_ for
> cpu N after it is offlined/onlined via hotplug.
>
> Pre-patch behavior:
>
> $ grep Cpus /proc/$$/status
> Cpus_allowed:   ff
> Cpus_allowed_list:  0-7
>
> $ taskset -p 4 $$
> pid 19202's current affinity mask: f
> pid 19202's new affinity mask: 4
>
> $ grep Cpus /proc/self/status
> Cpus_allowed:   04
> Cpus_allowed_list:  2
>
> # echo off > /sys/devices/system/cpu/cpu2/online
> $ grep Cpus /proc/$$/status
> Cpus_allowed:   0b
> Cpus_allowed_list:  0-1,3
>
> # echo on > /sys/devices/system/cpu/cpu2/online
> $ grep Cpus /proc/$$/status
> Cpus_allowed:   0b
> Cpus_allowed_list:  0-1,3
>
> On a patched system, the final grep produces the following
> output instead:
>
> $ grep Cpus /proc/$$/status
> Cpus_allowed:   ff
> Cpus_allowed_list:  0-7
>
> This patch changes the above behavior by instead simply resetting the mask
> to cpu_possible_mask.
>
> Signed-off-by: Joel Savitz 
> ---
>  kernel/cgroup/cpuset.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 479743db6c37..5f65a2167bdf 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3243,7 +3243,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
> struct cpumask *pmask)
>  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>  {
> rcu_read_lock();
> -   do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> +   do_set_cpus_allowed(tsk, cpu_possible_mask);
> rcu_read_unlock();
>
> /*
> --
> 2.20.1
>


[PATCH] sched: fix spelling of active in comment on is_cpu_allowed()

2019-03-06 Thread Joel Savitz
trivial fix of documentation typo

Signed-off-by: Joel Savitz 
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ead464a..2e2e19b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -920,7 +920,7 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
 }
 
 /*
- * Per-CPU kthreads are allowed to run on !actie && online CPUs, see
+ * Per-CPU kthreads are allowed to run on !active && online CPUs, see
  * __set_cpus_allowed_ptr() and select_fallback_rq().
  */
 static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
-- 
1.8.3.1



[PATCH] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-03-06 Thread Joel Savitz
If a process is limited by taskset (i.e. cpuset) to only be allowed to
run on cpu N, and then cpu N is offlined via hotplug, the process will
be assigned the current value of its cpuset cgroup's effective_cpus field
in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
This argument's value does not makes sense for this case, because
task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
to reflect the new value of cpu_active_mask after cpu N is removed from
the mask. While this may make sense for the cgroup affinity mask, it
does not make sense on a per-task basis, as a task that was previously
limited to only be run on cpu N will be limited to every cpu _except_ for
cpu N after it is offlined/onlined via hotplug.

Pre-patch behavior:

$ grep Cpus /proc/$$/status
Cpus_allowed:   ff
Cpus_allowed_list:  0-7

$ taskset -p 4 $$
pid 19202's current affinity mask: f
pid 19202's new affinity mask: 4

$ grep Cpus /proc/self/status
Cpus_allowed:   04
Cpus_allowed_list:  2

# echo off > /sys/devices/system/cpu/cpu2/online 
$ grep Cpus /proc/$$/status
Cpus_allowed:   0b
Cpus_allowed_list:  0-1,3

# echo on > /sys/devices/system/cpu/cpu2/online 
$ grep Cpus /proc/$$/status
Cpus_allowed:   0b
Cpus_allowed_list:  0-1,3

On a patched system, the final grep produces the following
output instead:

$ grep Cpus /proc/$$/status
Cpus_allowed:   ff
Cpus_allowed_list:  0-7

This patch changes the above behavior by instead simply resetting the mask
to cpu_possible_mask.

Signed-off-by: Joel Savitz 
---
 kernel/cgroup/cpuset.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 479743db6c37..5f65a2167bdf 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3243,7 +3243,7 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct 
cpumask *pmask)
 void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
 {
rcu_read_lock();
-   do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
+   do_set_cpus_allowed(tsk, cpu_possible_mask);
rcu_read_unlock();
 
/*
-- 
2.20.1