Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Thu, 20 Aug 2020 10:41:39 -0700 Andrei Vagin wrote: > Unfotunatly, I don't have enough time to lead a process of pushing > task_diag into the upstream. So if it is interesting for you, you can > restart this process and I am ready to help as much as time will > permit. > > I think the main blocking issue was a lack of interest from the wide > audience to this. The slow proc is the problem just for a few users, > but task_diag is a big subsystem that repeats functionality of another > subsystem with all derived problems like code duplication. Unfortunately I don't have much time either and yes it sounds like upstreaming a new interface like this will require input & enthusiasm from more of those who are monitoring large numbers of processes, which is not really me.. A related issue is that task_diag doesn't currently support the cgroup filesystem which has the same issues as /proc and is accessed very heavily by e.g. the Kubernetes kubelet cadvisor. Perhaps more interest in tackling this could come from the Kubernetes community. > > Another blocking issue is a new interface. There was no consensus on > this. Initially, I suggested to use netlink sockets, but developers > from non-network subsystem objected on this, so the transaction file > interface was introduced. The main idea similar to netlink sockets is > that we write a request and read a response. > > There were some security concerns but I think I fixed them. There's currently a lot of momentum behind io_uring which could not only enable efficient enumeration and retrieval of small files but maybe it would also be a more natural place for an API like task_diag.. Best Wishes, Eugene
Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Mon, 10 Aug 2020 17:41:32 +0200 Greg KH wrote: > On Tue, Aug 11, 2020 at 01:27:00AM +1000, Eugene Lubarsky wrote: > > On Mon, 10 Aug 2020 17:04:53 +0200 > > Greg KH wrote: > And have you benchmarked any of this? Try working with the common > tools that want this information and see if it actually is noticeable > (hint, I have been doing that with the readfile work and it's > surprising what the results are in places...) Apologies for the delay. Here are some benchmarks with atop. Patch to atop at: https://github.com/eug48/atop/commits/proc-all Patch to add /proc/all/schedstat & cpuset below. atop not collecting threads & cmdline as /proc/all/ doesn't support it. 10,000 processes, kernel 5.8, nested KVM, 2 cores of i7-6700HQ @ 2.60GHz # USE_PROC_ALL=0 ./atop -w test 1 & # pidstat -p $(pidof atop) 1 01:33:05 %usr %system %guest %wait%CPU CPU Command 01:33:06 33.66 33.660.000.99 67.33 1 atop 01:33:07 33.00 32.000.002.00 65.00 0 atop 01:33:08 34.00 31.000.001.00 65.00 0 atop ... Average: 33.15 32.790.001.09 65.94 - atop # USE_PROC_ALL=1 ./atop -w test 1 & # pidstat -p $(pidof atop) 1 01:33:33 %usr %system %guest %wait%CPU CPU Command 01:33:34 28.00 14.000.001.00 42.00 1 atop 01:33:35 28.00 14.000.000.00 42.00 1 atop 01:33:36 26.00 13.000.000.00 39.00 1 atop ... Average: 27.08 12.860.000.35 39.94 - atop So CPU usage goes down from ~65% to ~40%. Data collection times in milliseconds are: # xsv cat columns proc.csv procall.csv \ > | xsv stats \ > | xsv select field,min,max,mean,stddev \ > | xsv table field min max mean stddev /proc time 558 625 586.59 18.29 /proc/all time 231 262 243.56 8.02 Much performance optimisation can still be done, e.g. the modified atop uses fgets which is reading 1KB at a time, and seq_file seems to only return 4KB pages. task_diag should be much faster still. I'd imagine this sort of thing would be useful for daemons monitoring large numbers of processes. I don't run such systems myself; my initial motivation was frustration with the Kubernetes kubelet having ~2-4% CPU usage even with a couple of containers. Basic profiling suggests syscalls have a lot to do with it - it's actually reading loads of tiny cgroup files and enumerating many directories every 10 seconds, but /proc has similar issues and seemed easier to start with. Anyway, I've read that io_uring could also help here in the near future, which would be really cool especially if there was a way to enumerate directories and read many files regex-style in a single operation, e.g. /proc/[0-9].*/(stat|statm|io) > > Currently I'm trying to re-use the existing code in fs/proc that > > controls which PIDs are visible, but may well be missing > > something.. > > Try it out and see if it works correctly. And pid namespaces are not > the only thing these days from what I call :) > I've tried `unshare --fork --pid --mount-proc cat /proc/all/stat` which seems to behave correctly. ptrace flags are handled by the existing code. Best Wishes, Eugene From 2ffc2e388f7ce4e3f182c2442823e5f13bae03dd Mon Sep 17 00:00:00 2001 From: Eugene Lubarsky Date: Tue, 25 Aug 2020 12:36:41 +1000 Subject: [RFC PATCH] fs/proc: /proc/all: add schedstat and cpuset Signed-off-by: Eugene Lubarsky --- fs/proc/base.c | 42 ++ 1 file changed, 42 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index 0bba4b3a985e..44d73f1ade4a 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3944,6 +3944,36 @@ static int proc_all_io(struct seq_file *m, void *v) } #endif +#ifdef CONFIG_PROC_PID_CPUSET +static int proc_all_cpuset(struct seq_file *m, void *v) +{ + struct all_iter *iter = (struct all_iter *) v; + struct pid_namespace *ns = iter->ns; + struct task_struct *task = iter->tgid_iter.task; + struct pid *pid = task->thread_pid; + + seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns)); + seq_puts(m, " "); + + return proc_cpuset_show(m, ns, pid, task); +} +#endif + +#ifdef CONFIG_SCHED_INFO +static int proc_all_schedstat(struct seq_file *m, void *v) +{ + struct all_iter *iter = (struct all_iter *) v; + struct pid_namespace *ns = iter->ns; + struct task_struct *task = iter->tgid_iter.task; + struct pid *pid = task->thread_pid; + + seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns)); + seq_puts(m, " "); + + return proc_pid_schedstat(m, ns, pid, task); +} +#endif + static int proc_all_statx(struct seq_file *m, void *v) { struct all_iter *iter = (struct all_iter *) v; @@ -3990,6 +4020,12 @@ PROC_ALL_OPS(status); #ifdef CONFIG_TASK_IO_ACCOUNTING PROC_ALL_OPS(io); #e
Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Wed, 12 Aug 2020 00:51:35 -0700 Andrei Vagin wrote: > Maybe we need resurrect the task_diag series instead of inventing > another less-effective interface... I would certainly welcome the resurrection of task_diag - it is clearly more efficient than this /proc/all/ idea. It would be good to find out if there's anything in particular that's currently blocking it. This RFC is mainly meant to check whether such an addition would be acceptable from an API point of view. It currently has an obvious performance issue in that seq_file seems to only return one page at a time so lots of read syscalls are still required. However I may not have the time to figure out a proposed fix for this by myself. Regardless, text-based formats can't match the efficiency of task_diag, but binary ones are also possible. Best Wishes, Eugene
Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Mon, 10 Aug 2020 17:04:53 +0200 Greg KH wrote: > How many syscalls does this save on? > > Perhaps you want my proposed readfile(2) syscall: > > https://lore.kernel.org/r/20200704140250.423345-1-gre...@linuxfoundation.org > to help out with things like this? :) The proposed readfile sounds great and would help, but if there are 1000 processes wouldn't that require 1000 readfile calls to read their proc files? With something like this the stats for 1000 processes could be retrieved with an open, a few reads and a close. > > > The proposed files in this proof-of-concept patch set are: > > > > * /proc/all/stat > > I think the problem will be defining "all" in the case of the specific > namespace you are dealing with, right? How will this handle all of > those issues properly for all of these different statisics? > Currently I'm trying to re-use the existing code in fs/proc that controls which PIDs are visible, but may well be missing something.. Best Wishes, Eugene
[RFC PATCH 4/5] fs/proc: Introduce /proc/all/io
Returns io info for all visible processes. The data is the same as /proc/[pid]/io but formatted as a series of numbers on a single line. A PID column is also prepended. Signed-off-by: Eugene Lubarsky --- fs/proc/base.c | 66 -- 1 file changed, 59 insertions(+), 7 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 5982fd43dd21..03d48225b6d1 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2910,9 +2910,8 @@ static const struct file_operations proc_coredump_filter_operations = { #endif #ifdef CONFIG_TASK_IO_ACCOUNTING -static int do_io_accounting(struct task_struct *task, struct seq_file *m, int whole) +static int calc_io_accounting(struct task_struct *task, struct task_io_accounting *acct, int whole) { - struct task_io_accounting acct = task->ioac; unsigned long flags; int result; @@ -2928,12 +2927,27 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh if (whole && lock_task_sighand(task, )) { struct task_struct *t = task; - task_io_accounting_add(, >signal->ioac); + task_io_accounting_add(acct, >signal->ioac); while_each_thread(task, t) - task_io_accounting_add(, >ioac); + task_io_accounting_add(acct, >ioac); unlock_task_sighand(task, ); } + result = 0; + +out_unlock: + mutex_unlock(>signal->exec_update_mutex); + return result; +} +static int do_io_accounting(struct task_struct *task, struct seq_file *m, int whole) +{ + struct task_io_accounting acct = task->ioac; + int result; + + result = calc_io_accounting(task, , whole); + if (result) + return result; + seq_printf(m, "rchar: %llu\n" "wchar: %llu\n" @@ -2949,10 +2963,7 @@ static int do_io_accounting(struct task_struct *task, struct seq_file *m, int wh (unsigned long long)acct.read_bytes, (unsigned long long)acct.write_bytes, (unsigned long long)acct.cancelled_write_bytes); - result = 0; -out_unlock: - mutex_unlock(>signal->exec_update_mutex); return result; } @@ -3896,7 +3907,42 @@ static int proc_all_statm(struct seq_file *m, void *v) return proc_pid_statm(m, ns, pid, task); } +#ifdef CONFIG_TASK_IO_ACCOUNTING +static int proc_all_io_print_one(struct seq_file *m, struct task_struct *task) +{ + struct task_io_accounting acct = task->ioac; + int result; + result = calc_io_accounting(task, , 1); + if (result) + return result; + + seq_printf(m, + "%llu %llu %llu %llu %llu %llu %llu\n", + (unsigned long long)acct.rchar, + (unsigned long long)acct.wchar, + (unsigned long long)acct.syscr, + (unsigned long long)acct.syscw, + (unsigned long long)acct.read_bytes, + (unsigned long long)acct.write_bytes, + (unsigned long long)acct.cancelled_write_bytes); + + return result; +} + +static int proc_all_io(struct seq_file *m, void *v) +{ + struct all_iter *iter = (struct all_iter *) v; + struct pid_namespace *ns = iter->ns; + struct task_struct *task = iter->tgid_iter.task; + struct pid *pid = task->thread_pid; + + seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns)); + seq_puts(m, " "); + + return proc_all_io_print_one(m, task); +} +#endif static int proc_all_status(struct seq_file *m, void *v) { @@ -3915,6 +3961,9 @@ static int proc_all_status(struct seq_file *m, void *v) PROC_ALL_OPS(stat); PROC_ALL_OPS(statm); PROC_ALL_OPS(status); +#ifdef CONFIG_TASK_IO_ACCOUNTING + PROC_ALL_OPS(io); +#endif #define PROC_ALL_CREATE(NAME) \ do { \ @@ -3932,4 +3981,7 @@ void __init proc_all_init(void) PROC_ALL_CREATE(stat); PROC_ALL_CREATE(statm); PROC_ALL_CREATE(status); +#ifdef CONFIG_TASK_IO_ACCOUNTING + PROC_ALL_CREATE(io); +#endif } -- 2.25.1
[RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
This is an idea for substantially reducing the number of syscalls needed by monitoring tools whilst mostly re-using the existing API. The proposed files in this proof-of-concept patch set are: * /proc/all/stat A stat line for each process in the existing format. * /proc/all/statm statm lines but starting with a PID column. * /proc/all/status status info for all processes in the existing format. * /proc/all/io The existing /proc/pid/io data but formatted as a single line for each process, similarly to stat/statm, with a PID column added. * /proc/all/statx Gathers info from stat, statm and io; the purpose is actually not so much to reduce syscalls but to help userspace be more efficient by not having to store data in e.g. hashtables in order to gather it from separate /proc/all/ files. The format proposed here starts with the unchanged stat line and begins the other info with a few characters, repeating for each process: ... 25 (cat) R 1 1 0 0 -1 4194304 185 0 16 0 2 0 0 0 20 ... m 662 188 167 5 0 112 0 io 4292 0 12 0 0 0 0 ... There has been a proposal with some overlapping goals: /proc/task-diag (https://github.com/avagin/linux-task-diag), but I'm not sure about its current status. Best Wishes, Eugene Eugene Lubarsky (5): fs/proc: Introduce /proc/all/stat fs/proc: Introduce /proc/all/statm fs/proc: Introduce /proc/all/status fs/proc: Introduce /proc/all/io fs/proc: Introduce /proc/all/statx fs/proc/base.c | 215 +++-- fs/proc/internal.h | 1 + fs/proc/root.c | 1 + 3 files changed, 210 insertions(+), 7 deletions(-) -- 2.25.1
[RFC PATCH 2/5] fs/proc: Introduce /proc/all/statm
Returns statm lines for all visible processes with prepended PIDs. Signed-off-by: Eugene Lubarsky --- fs/proc/base.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index e0f60a1528b7..8396a38ba7d2 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3884,6 +3884,18 @@ static int proc_all_stat(struct seq_file *m, void *v) return proc_tgid_stat(m, iter->ns, iter->tgid_iter.task->thread_pid, iter->tgid_iter.task); } +static int proc_all_statm(struct seq_file *m, void *v) +{ + struct all_iter *iter = (struct all_iter *) v; + struct pid_namespace *ns = iter->ns; + struct task_struct *task = iter->tgid_iter.task; + struct pid *pid = task->thread_pid; + + seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns)); + seq_puts(m, " "); + return proc_pid_statm(m, ns, pid, task); +} + #define PROC_ALL_OPS(NAME) static const struct seq_operations proc_all_##NAME##_ops = { \ .start = proc_all_start, \ @@ -3893,6 +3905,7 @@ static int proc_all_stat(struct seq_file *m, void *v) } PROC_ALL_OPS(stat); +PROC_ALL_OPS(statm); #define PROC_ALL_CREATE(NAME) \ do { \ @@ -3908,4 +3921,5 @@ void __init proc_all_init(void) return; PROC_ALL_CREATE(stat); + PROC_ALL_CREATE(statm); } -- 2.25.1
[RFC PATCH 3/5] fs/proc: Introduce /proc/all/status
Returns status lines for all visible processes in the existing format. Signed-off-by: Eugene Lubarsky --- fs/proc/base.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index 8396a38ba7d2..5982fd43dd21 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3897,6 +3897,14 @@ static int proc_all_statm(struct seq_file *m, void *v) } + +static int proc_all_status(struct seq_file *m, void *v) +{ + struct all_iter *iter = (struct all_iter *) v; + + return proc_pid_status(m, iter->ns, iter->tgid_iter.task->thread_pid, iter->tgid_iter.task); +} + #define PROC_ALL_OPS(NAME) static const struct seq_operations proc_all_##NAME##_ops = { \ .start = proc_all_start, \ .next = proc_all_next, \ @@ -3906,6 +3914,7 @@ static int proc_all_statm(struct seq_file *m, void *v) PROC_ALL_OPS(stat); PROC_ALL_OPS(statm); +PROC_ALL_OPS(status); #define PROC_ALL_CREATE(NAME) \ do { \ @@ -3922,4 +3931,5 @@ void __init proc_all_init(void) PROC_ALL_CREATE(stat); PROC_ALL_CREATE(statm); + PROC_ALL_CREATE(status); } -- 2.25.1
[RFC PATCH 5/5] fs/proc: Introduce /proc/all/statx
Gathers info from stat, statm and io files. The purpose is not so much to reduce syscall numbers but to help userspace not have to store data in e.g. hashtables in order to gather it from separate /proc/all files. The format starts with an unchanged stat line and begins the statm & io lines with "m" or "io", repeating these for each process. e.g. ... 25 (cat) R 1 1 0 0 -1 4194304 185 0 16 0 2 0 0 0 20 ... m 662 188 167 5 0 112 0 io 4292 0 12 0 0 0 0 ... Signed-off-by: Eugene Lubarsky --- fs/proc/base.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index 03d48225b6d1..5c6010c2ea1c 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3944,6 +3944,31 @@ static int proc_all_io(struct seq_file *m, void *v) } #endif +static int proc_all_statx(struct seq_file *m, void *v) +{ + struct all_iter *iter = (struct all_iter *) v; + struct pid_namespace *ns = iter->ns; + struct pid *pid = iter->tgid_iter.task->thread_pid; + struct task_struct *task = iter->tgid_iter.task; + int err; + + err = proc_tgid_stat(m, ns, pid, task); + if (err) + return err; + + seq_puts(m, "m "); + err = proc_pid_statm(m, ns, pid, task); + if (err) + return err; + +#ifdef CONFIG_TASK_IO_ACCOUNTING + seq_puts(m, "io "); + err = proc_all_io_print_one(m, task); +#endif + + return err; +} + static int proc_all_status(struct seq_file *m, void *v) { struct all_iter *iter = (struct all_iter *) v; @@ -3960,6 +3985,7 @@ static int proc_all_status(struct seq_file *m, void *v) PROC_ALL_OPS(stat); PROC_ALL_OPS(statm); +PROC_ALL_OPS(statx); PROC_ALL_OPS(status); #ifdef CONFIG_TASK_IO_ACCOUNTING PROC_ALL_OPS(io); @@ -3980,6 +4006,7 @@ void __init proc_all_init(void) PROC_ALL_CREATE(stat); PROC_ALL_CREATE(statm); + PROC_ALL_CREATE(statx); PROC_ALL_CREATE(status); #ifdef CONFIG_TASK_IO_ACCOUNTING PROC_ALL_CREATE(io); -- 2.25.1
[RFC PATCH 1/5] fs/proc: Introduce /proc/all/stat
Returns stat lines for all visible processes in the existing format, aiming to substantially reduce the number of syscalls that are needed for this common task. Signed-off-by: Eugene Lubarsky --- fs/proc/base.c | 98 ++ fs/proc/internal.h | 1 + fs/proc/root.c | 1 + 3 files changed, 100 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index a333caeca291..e0f60a1528b7 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3811,3 +3811,101 @@ void __init set_proc_pid_nlink(void) nlink_tid = pid_entry_nlink(tid_base_stuff, ARRAY_SIZE(tid_base_stuff)); nlink_tgid = pid_entry_nlink(tgid_base_stuff, ARRAY_SIZE(tgid_base_stuff)); } + + +/* + * /proc/all/ + */ + +struct all_iter { + struct tgid_iter tgid_iter; + struct proc_fs_info *fs_info; + struct pid_namespace *ns; +}; + +static void *proc_all_start(struct seq_file *m, loff_t *pos) +{ + struct all_iter *iter = kmalloc(sizeof(struct all_iter), GFP_KERNEL); + + iter->fs_info = proc_sb_info(file_inode(m->file)->i_sb); + iter->ns = proc_pid_ns(file_inode(m->file)->i_sb); + + iter->tgid_iter.tgid = *pos; + iter->tgid_iter.task = NULL; + iter->tgid_iter = next_tgid(iter->ns, iter->tgid_iter); + + if (!iter->tgid_iter.task) { + kfree(iter); + return NULL; + } + + return iter; +} + +static void *proc_all_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct all_iter *iter = (struct all_iter *) v; + struct proc_fs_info *fs_info = iter->fs_info; + struct tgid_iter *tgid_iter = >tgid_iter; + + do { + tgid_iter->tgid += 1; + *tgid_iter = next_tgid(iter->ns, *tgid_iter); + } while (tgid_iter->task && + !has_pid_permissions(fs_info, tgid_iter->task, HIDEPID_INVISIBLE)); + + *pos = tgid_iter->tgid; + + if (!tgid_iter->task) { + kfree(v); + return NULL; + } + + return iter; +} + +static void proc_all_stop(struct seq_file *m, void *v) +{ + if (v) { + struct all_iter *iter = (struct all_iter *) v; + struct task_struct *task = iter->tgid_iter.task; + + if (task) + put_task_struct(task); + + kfree(v); + } +} + +static int proc_all_stat(struct seq_file *m, void *v) +{ + struct all_iter *iter = (struct all_iter *) v; + + return proc_tgid_stat(m, iter->ns, iter->tgid_iter.task->thread_pid, iter->tgid_iter.task); +} + + +#define PROC_ALL_OPS(NAME) static const struct seq_operations proc_all_##NAME##_ops = { \ + .start = proc_all_start, \ + .next = proc_all_next, \ + .stop = proc_all_stop, \ + .show = proc_all_##NAME \ +} + +PROC_ALL_OPS(stat); + +#define PROC_ALL_CREATE(NAME) \ + do { \ + if (!proc_create_seq(#NAME, 0, all_dir, _all_##NAME##_ops)) \ + return; \ + } while (0) + +void __init proc_all_init(void) +{ + struct proc_dir_entry *all_dir = proc_mkdir("all", NULL); + + if (!all_dir) + return; + + PROC_ALL_CREATE(stat); +} diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 917cc85e3466..b22d9cb619bf 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -171,6 +171,7 @@ extern int pid_delete_dentry(const struct dentry *); extern int proc_pid_readdir(struct file *, struct dir_context *); struct dentry *proc_pid_lookup(struct dentry *, unsigned int); extern loff_t mem_lseek(struct file *, loff_t, int); +extern void proc_all_init(void); /* Lookups */ typedef struct dentry *instantiate_t(struct dentry *, diff --git a/fs/proc/root.c b/fs/proc/root.c index 5e444d4f9717..4b5cfd2cdc0a 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -291,6 +291,7 @@ void __init proc_root_init(void) set_proc_pid_nlink(); proc_self_init(); proc_thread_self_init(); + proc_all_init(); proc_symlink("mounts", NULL, "self/mounts"); proc_net_init(); -- 2.25.1