Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Thu, 20 Aug 2020 10:41:39 -0700 Andrei Vagin wrote: > Unfotunatly, I don't have enough time to lead a process of pushing > task_diag into the upstream. So if it is interesting for you, you can > restart this process and I am ready to help as much as time will > permit. > > I think the main blocking issue was a lack of interest from the wide > audience to this. The slow proc is the problem just for a few users, > but task_diag is a big subsystem that repeats functionality of another > subsystem with all derived problems like code duplication. Unfortunately I don't have much time either and yes it sounds like upstreaming a new interface like this will require input & enthusiasm from more of those who are monitoring large numbers of processes, which is not really me.. A related issue is that task_diag doesn't currently support the cgroup filesystem which has the same issues as /proc and is accessed very heavily by e.g. the Kubernetes kubelet cadvisor. Perhaps more interest in tackling this could come from the Kubernetes community. > > Another blocking issue is a new interface. There was no consensus on > this. Initially, I suggested to use netlink sockets, but developers > from non-network subsystem objected on this, so the transaction file > interface was introduced. The main idea similar to netlink sockets is > that we write a request and read a response. > > There were some security concerns but I think I fixed them. There's currently a lot of momentum behind io_uring which could not only enable efficient enumeration and retrieval of small files but maybe it would also be a more natural place for an API like task_diag.. Best Wishes, Eugene
Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Mon, 10 Aug 2020 17:41:32 +0200 Greg KH wrote: > On Tue, Aug 11, 2020 at 01:27:00AM +1000, Eugene Lubarsky wrote: > > On Mon, 10 Aug 2020 17:04:53 +0200 > > Greg KH wrote: > And have you benchmarked any of this? Try working with the common > tools that want this information and see if it actually is noticeable > (hint, I have been doing that with the readfile work and it's > surprising what the results are in places...) Apologies for the delay. Here are some benchmarks with atop. Patch to atop at: https://github.com/eug48/atop/commits/proc-all Patch to add /proc/all/schedstat & cpuset below. atop not collecting threads & cmdline as /proc/all/ doesn't support it. 10,000 processes, kernel 5.8, nested KVM, 2 cores of i7-6700HQ @ 2.60GHz # USE_PROC_ALL=0 ./atop -w test 1 & # pidstat -p $(pidof atop) 1 01:33:05 %usr %system %guest %wait%CPU CPU Command 01:33:06 33.66 33.660.000.99 67.33 1 atop 01:33:07 33.00 32.000.002.00 65.00 0 atop 01:33:08 34.00 31.000.001.00 65.00 0 atop ... Average: 33.15 32.790.001.09 65.94 - atop # USE_PROC_ALL=1 ./atop -w test 1 & # pidstat -p $(pidof atop) 1 01:33:33 %usr %system %guest %wait%CPU CPU Command 01:33:34 28.00 14.000.001.00 42.00 1 atop 01:33:35 28.00 14.000.000.00 42.00 1 atop 01:33:36 26.00 13.000.000.00 39.00 1 atop ... Average: 27.08 12.860.000.35 39.94 - atop So CPU usage goes down from ~65% to ~40%. Data collection times in milliseconds are: # xsv cat columns proc.csv procall.csv \ > | xsv stats \ > | xsv select field,min,max,mean,stddev \ > | xsv table field min max mean stddev /proc time 558 625 586.59 18.29 /proc/all time 231 262 243.56 8.02 Much performance optimisation can still be done, e.g. the modified atop uses fgets which is reading 1KB at a time, and seq_file seems to only return 4KB pages. task_diag should be much faster still. I'd imagine this sort of thing would be useful for daemons monitoring large numbers of processes. I don't run such systems myself; my initial motivation was frustration with the Kubernetes kubelet having ~2-4% CPU usage even with a couple of containers. Basic profiling suggests syscalls have a lot to do with it - it's actually reading loads of tiny cgroup files and enumerating many directories every 10 seconds, but /proc has similar issues and seemed easier to start with. Anyway, I've read that io_uring could also help here in the near future, which would be really cool especially if there was a way to enumerate directories and read many files regex-style in a single operation, e.g. /proc/[0-9].*/(stat|statm|io) > > Currently I'm trying to re-use the existing code in fs/proc that > > controls which PIDs are visible, but may well be missing > > something.. > > Try it out and see if it works correctly. And pid namespaces are not > the only thing these days from what I call :) > I've tried `unshare --fork --pid --mount-proc cat /proc/all/stat` which seems to behave correctly. ptrace flags are handled by the existing code. Best Wishes, Eugene From 2ffc2e388f7ce4e3f182c2442823e5f13bae03dd Mon Sep 17 00:00:00 2001 From: Eugene Lubarsky Date: Tue, 25 Aug 2020 12:36:41 +1000 Subject: [RFC PATCH] fs/proc: /proc/all: add schedstat and cpuset Signed-off-by: Eugene Lubarsky --- fs/proc/base.c | 42 ++ 1 file changed, 42 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index 0bba4b3a985e..44d73f1ade4a 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3944,6 +3944,36 @@ static int proc_all_io(struct seq_file *m, void *v) } #endif +#ifdef CONFIG_PROC_PID_CPUSET +static int proc_all_cpuset(struct seq_file *m, void *v) +{ + struct all_iter *iter = (struct all_iter *) v; + struct pid_namespace *ns = iter->ns; + struct task_struct *task = iter->tgid_iter.task; + struct pid *pid = task->thread_pid; + + seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns)); + seq_puts(m, " "); + + return proc_cpuset_show(m, ns, pid, task); +} +#endif + +#ifdef CONFIG_SCHED_INFO +static int proc_all_schedstat(struct seq_file *m, void *v) +{ + struct all_iter *iter = (struct all_iter *) v; + struct pid_namespace *ns = iter->ns; + struct task_struct *task = iter->tgid_iter.task; + struct pid *pid = task->thread_pid; + + seq_put_decimal_ull(m, "", pid_nr_ns(pid, ns)); + seq_puts(m, " "); + + return proc_pid_schedstat(m, ns, pid, task); +} +#endif + static int proc_all_statx(struct seq_file *m, void *v) { struct all_iter *iter = (struct all_iter *) v; @@ -3990,6 +4020,12 @@ PROC_ALL_OPS(status); #ifdef CONFIG_TASK_IO_ACCOUNTING PROC_ALL_OPS(io); #endif +#ifdef CONFIG_SCHED_INFO + PROC_ALL_OPS(schedstat); +#endif +#ifdef CONFIG_PROC_PID_CPUSET + PROC_ALL_OPS(cpuset); +#endif
Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Fri, Aug 14, 2020 at 01:01:00AM +1000, Eugene Lubarsky wrote: > On Wed, 12 Aug 2020 00:51:35 -0700 > Andrei Vagin wrote: > > > Maybe we need resurrect the task_diag series instead of inventing > > another less-effective interface... > > I would certainly welcome the resurrection of task_diag - it is clearly > more efficient than this /proc/all/ idea. It would be good to find out > if there's anything in particular that's currently blocking it. Unfotunatly, I don't have enough time to lead a process of pushing task_diag into the upstream. So if it is interesting for you, you can restart this process and I am ready to help as much as time will permit. I think the main blocking issue was a lack of interest from the wide audience to this. The slow proc is the problem just for a few users, but task_diag is a big subsystem that repeats functionality of another subsystem with all derived problems like code duplication. Another blocking issue is a new interface. There was no consensus on this. Initially, I suggested to use netlink sockets, but developers from non-network subsystem objected on this, so the transaction file interface was introduced. The main idea similar to netlink sockets is that we write a request and read a response. There were some security concerns but I think I fixed them. > > This RFC is mainly meant to check whether such an addition would > be acceptable from an API point of view. It currently has an obvious > performance issue in that seq_file seems to only return one page at a > time so lots of read syscalls are still required. However I may not > have the time to figure out a proposed fix for this by myself. > Regardless, text-based formats can't match the efficiency of task_diag, > but binary ones are also possible. I don't have objections to this series. It can be an option if we will decide that we don't want to do a major rework here. Thanks, Andrei
Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Wed, 12 Aug 2020 00:51:35 -0700 Andrei Vagin wrote: > Maybe we need resurrect the task_diag series instead of inventing > another less-effective interface... I would certainly welcome the resurrection of task_diag - it is clearly more efficient than this /proc/all/ idea. It would be good to find out if there's anything in particular that's currently blocking it. This RFC is mainly meant to check whether such an addition would be acceptable from an API point of view. It currently has an obvious performance issue in that seq_file seems to only return one page at a time so lots of read syscalls are still required. However I may not have the time to figure out a proposed fix for this by myself. Regardless, text-based formats can't match the efficiency of task_diag, but binary ones are also possible. Best Wishes, Eugene
Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Wed, Aug 12, 2020 at 10:47:32PM -0600, David Ahern wrote: > On 8/12/20 1:51 AM, Andrei Vagin wrote: > > > > I rebased the task_diag patches on top of v5.8: > > https://github.com/avagin/linux-task-diag/tree/v5.8-task-diag > > Thanks for updating the patches. > > > > > /proc/pid files have three major limitations: > > * Requires at least three syscalls per process per file > > open(), read(), close() > > * Variety of formats, mostly text based > > The kernel spent time to encode binary data into a text format and > > then tools like top and ps spent time to decode them back to a binary > > format. > > * Sometimes slow due to extra attributes > > For example, /proc/PID/smaps contains a lot of useful informations > > about memory mappings and memory consumption for each of them. But > > even if we don't need memory consumption fields, the kernel will > > spend time to collect this information. > > that's what I recall as well. > > > > > More details and numbers are in this article: > > https://avagin.github.io/how-fast-is-procfs > > > > This new interface doesn't have only one of these limitations, but > > task_diag doesn't have all of them. > > > > And I compared how fast each of these interfaces: > > > > The test environment: > > CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz > > RAM: 16GB > > kernel: v5.8 with task_diag and /proc/all patches. > > 100K processes: > > $ ps ax | wc -l > > 10228 > > 100k processes but showing 10k here?? I'm sure that one zero has been escaped from here. task_proc_all shows a number of tasks too and it shows 100230. > > > > > $ time cat /proc/all/status > /dev/null > > > > real0m0.577s > > user0m0.017s > > sys 0m0.559s > > > > task_proc_all is used to read /proc/pid/status for all tasks: > > https://github.com/avagin/linux-task-diag/blob/master/tools/testing/selftests/task_diag/task_proc_all.c > > > > $ time ./task_proc_all status > > tasks: 100230 > > Thanks, Andrei
Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On 8/12/20 1:51 AM, Andrei Vagin wrote: > > I rebased the task_diag patches on top of v5.8: > https://github.com/avagin/linux-task-diag/tree/v5.8-task-diag Thanks for updating the patches. > > /proc/pid files have three major limitations: > * Requires at least three syscalls per process per file > open(), read(), close() > * Variety of formats, mostly text based > The kernel spent time to encode binary data into a text format and > then tools like top and ps spent time to decode them back to a binary > format. > * Sometimes slow due to extra attributes > For example, /proc/PID/smaps contains a lot of useful informations > about memory mappings and memory consumption for each of them. But > even if we don't need memory consumption fields, the kernel will > spend time to collect this information. that's what I recall as well. > > More details and numbers are in this article: > https://avagin.github.io/how-fast-is-procfs > > This new interface doesn't have only one of these limitations, but > task_diag doesn't have all of them. > > And I compared how fast each of these interfaces: > > The test environment: > CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz > RAM: 16GB > kernel: v5.8 with task_diag and /proc/all patches. > 100K processes: > $ ps ax | wc -l > 10228 100k processes but showing 10k here?? > > $ time cat /proc/all/status > /dev/null > > real 0m0.577s > user 0m0.017s > sys 0m0.559s > > task_proc_all is used to read /proc/pid/status for all tasks: > https://github.com/avagin/linux-task-diag/blob/master/tools/testing/selftests/task_diag/task_proc_all.c > > $ time ./task_proc_all status > tasks: 100230 > > real 0m0.924s > user 0m0.054s > sys 0m0.858s > > > /proc/all/status is about 40% faster than /proc/*/status. > > Now let's take a look at the perf output: > > $ time perf record -g cat /proc/all/status > /dev/null > $ perf report > - 98.08% 1.38% cat [kernel.vmlinux] [k] entry_SYSCALL_64 >- 96.70% entry_SYSCALL_64 > - do_syscall_64 > - 94.97% ksys_read > - 94.80% vfs_read >- 94.58% proc_reg_read > - seq_read > - 87.95% proc_pid_status > + 13.10% seq_put_decimal_ull_width > - 11.69% task_mem >+ 9.48% seq_put_decimal_ull_width > + 10.63% seq_printf > - 10.35% cpuset_task_status_allowed >+ seq_printf > - 9.84% render_sigset_t > 1.61% seq_putc >+ 1.61% seq_puts > + 4.99% proc_task_name > + 4.11% seq_puts > - 3.76% render_cap_t > 2.38% seq_put_hex_ll >+ 1.25% seq_puts > 2.64% __task_pid_nr_ns > + 1.54% get_task_mm > + 1.34% __lock_task_sighand > + 0.70% from_kuid_munged > 0.61% get_task_cred > 0.56% seq_putc > 0.52% hugetlb_report_usage > 0.52% from_kgid_munged > + 4.30% proc_all_next > + 0.82% _copy_to_user > > We can see that the kernel spent more than 50% of the time to encode binary > data into a text format. > > Now let's see how fast task_diag: > > $ time ./task_diag_all all -c -q > > real 0m0.087s > user 0m0.001s > sys 0m0.082s > > Maybe we need resurrect the task_diag series instead of inventing > another less-effective interface... I think the netlink message design is the better way to go. As system sizes continue to increase (> 100 cpus is common now) you need to be able to pass the right data to userspace as fast as possible to keep up with what can be a very dynamic userspace and set of processes. When you first proposed this idea I was working on systems with >= 1k cpus and the netlink option was able to keep up with a 'make -j N' on those systems. `perf record` walking /proc would never finish initializing - I had to add a "done initializing" message to know when to start a test. With the task_diag approach, perf could collect the data in short order and move on to recording data.
Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Tue, Aug 11, 2020 at 12:58:47AM +1000, Eugene Lubarsky wrote: > This is an idea for substantially reducing the number of syscalls needed > by monitoring tools whilst mostly re-using the existing API. > > The proposed files in this proof-of-concept patch set are: > > * /proc/all/stat > A stat line for each process in the existing format. > > * /proc/all/statm > statm lines but starting with a PID column. > > * /proc/all/status > status info for all processes in the existing format. > > * /proc/all/io > The existing /proc/pid/io data but formatted as a single line for > each process, similarly to stat/statm, with a PID column added. > > * /proc/all/statx > Gathers info from stat, statm and io; the purpose is actually > not so much to reduce syscalls but to help userspace be more > efficient by not having to store data in e.g. hashtables in order > to gather it from separate /proc/all/ files. > > The format proposed here starts with the unchanged stat line > and begins the other info with a few characters, repeating for > each process: > > ... > 25 (cat) R 1 1 0 0 -1 4194304 185 0 16 0 2 0 0 0 20 ... > m 662 188 167 5 0 112 0 > io 4292 0 12 0 0 0 0 > ... > > > There has been a proposal with some overlapping goals: /proc/task-diag > (https://github.com/avagin/linux-task-diag), but I'm not sure about > its current status. I rebased the task_diag patches on top of v5.8: https://github.com/avagin/linux-task-diag/tree/v5.8-task-diag /proc/pid files have three major limitations: * Requires at least three syscalls per process per file open(), read(), close() * Variety of formats, mostly text based The kernel spent time to encode binary data into a text format and then tools like top and ps spent time to decode them back to a binary format. * Sometimes slow due to extra attributes For example, /proc/PID/smaps contains a lot of useful informations about memory mappings and memory consumption for each of them. But even if we don't need memory consumption fields, the kernel will spend time to collect this information. More details and numbers are in this article: https://avagin.github.io/how-fast-is-procfs This new interface doesn't have only one of these limitations, but task_diag doesn't have all of them. And I compared how fast each of these interfaces: The test environment: CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz RAM: 16GB kernel: v5.8 with task_diag and /proc/all patches. 100K processes: $ ps ax | wc -l 10228 $ time cat /proc/all/status > /dev/null real0m0.577s user0m0.017s sys 0m0.559s task_proc_all is used to read /proc/pid/status for all tasks: https://github.com/avagin/linux-task-diag/blob/master/tools/testing/selftests/task_diag/task_proc_all.c $ time ./task_proc_all status tasks: 100230 real0m0.924s user0m0.054s sys 0m0.858s /proc/all/status is about 40% faster than /proc/*/status. Now let's take a look at the perf output: $ time perf record -g cat /proc/all/status > /dev/null $ perf report - 98.08% 1.38% cat [kernel.vmlinux] [k] entry_SYSCALL_64 - 96.70% entry_SYSCALL_64 - do_syscall_64 - 94.97% ksys_read - 94.80% vfs_read - 94.58% proc_reg_read - seq_read - 87.95% proc_pid_status + 13.10% seq_put_decimal_ull_width - 11.69% task_mem + 9.48% seq_put_decimal_ull_width + 10.63% seq_printf - 10.35% cpuset_task_status_allowed + seq_printf - 9.84% render_sigset_t 1.61% seq_putc + 1.61% seq_puts + 4.99% proc_task_name + 4.11% seq_puts - 3.76% render_cap_t 2.38% seq_put_hex_ll + 1.25% seq_puts 2.64% __task_pid_nr_ns + 1.54% get_task_mm + 1.34% __lock_task_sighand + 0.70% from_kuid_munged 0.61% get_task_cred 0.56% seq_putc 0.52% hugetlb_report_usage 0.52% from_kgid_munged + 4.30% proc_all_next + 0.82% _copy_to_user We can see that the kernel spent more than 50% of the time to encode binary data into a text format. Now let's see how fast task_diag: $ time ./task_diag_all all -c -q real0m0.087s user0m0.001s sys 0m0.082s Maybe we need resurrect the task_diag series instead of inventing another less-effective interface... Thanks, Andrei > > > > Best Wishes, > > Eugene > > > Eugene Lubarsky (5): > fs/proc: Introduce /proc/all/stat >
Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Tue, Aug 11, 2020 at 01:27:00AM +1000, Eugene Lubarsky wrote: > On Mon, 10 Aug 2020 17:04:53 +0200 > Greg KH wrote: > > How many syscalls does this save on? > > > > Perhaps you want my proposed readfile(2) syscall: > > > > https://lore.kernel.org/r/20200704140250.423345-1-gre...@linuxfoundation.org > > to help out with things like this? :) > > The proposed readfile sounds great and would help, but if there are > 1000 processes wouldn't that require 1000 readfile calls to read their > proc files? Yes, but that should be better than 1000 open, 1000 read, and then 1000 close calls, right? :) > With something like this the stats for 1000 processes could be > retrieved with an open, a few reads and a close. And have you benchmarked any of this? Try working with the common tools that want this information and see if it actually is noticeable (hint, I have been doing that with the readfile work and it's surprising what the results are in places...) > > > > > > The proposed files in this proof-of-concept patch set are: > > > > > > * /proc/all/stat > > > > I think the problem will be defining "all" in the case of the specific > > namespace you are dealing with, right? How will this handle all of > > those issues properly for all of these different statisics? > > > > Currently I'm trying to re-use the existing code in fs/proc that > controls which PIDs are visible, but may well be missing something.. Try it out and see if it works correctly. And pid namespaces are not the only thing these days from what I call :) thanks, greg k-h
Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Mon, 10 Aug 2020 17:04:53 +0200 Greg KH wrote: > How many syscalls does this save on? > > Perhaps you want my proposed readfile(2) syscall: > > https://lore.kernel.org/r/20200704140250.423345-1-gre...@linuxfoundation.org > to help out with things like this? :) The proposed readfile sounds great and would help, but if there are 1000 processes wouldn't that require 1000 readfile calls to read their proc files? With something like this the stats for 1000 processes could be retrieved with an open, a few reads and a close. > > > The proposed files in this proof-of-concept patch set are: > > > > * /proc/all/stat > > I think the problem will be defining "all" in the case of the specific > namespace you are dealing with, right? How will this handle all of > those issues properly for all of these different statisics? > Currently I'm trying to re-use the existing code in fs/proc that controls which PIDs are visible, but may well be missing something.. Best Wishes, Eugene
Re: [RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
On Tue, Aug 11, 2020 at 12:58:47AM +1000, Eugene Lubarsky wrote: > This is an idea for substantially reducing the number of syscalls needed > by monitoring tools whilst mostly re-using the existing API. How many syscalls does this save on? Perhaps you want my proposed readfile(2) syscall: https://lore.kernel.org/r/20200704140250.423345-1-gre...@linuxfoundation.org to help out with things like this? :) > The proposed files in this proof-of-concept patch set are: > > * /proc/all/stat I think the problem will be defining "all" in the case of the specific namespace you are dealing with, right? How will this handle all of those issues properly for all of these different statisics? thanks, greg k-h
[RFC PATCH 0/5] Introduce /proc/all/ to gather stats from all processes
This is an idea for substantially reducing the number of syscalls needed by monitoring tools whilst mostly re-using the existing API. The proposed files in this proof-of-concept patch set are: * /proc/all/stat A stat line for each process in the existing format. * /proc/all/statm statm lines but starting with a PID column. * /proc/all/status status info for all processes in the existing format. * /proc/all/io The existing /proc/pid/io data but formatted as a single line for each process, similarly to stat/statm, with a PID column added. * /proc/all/statx Gathers info from stat, statm and io; the purpose is actually not so much to reduce syscalls but to help userspace be more efficient by not having to store data in e.g. hashtables in order to gather it from separate /proc/all/ files. The format proposed here starts with the unchanged stat line and begins the other info with a few characters, repeating for each process: ... 25 (cat) R 1 1 0 0 -1 4194304 185 0 16 0 2 0 0 0 20 ... m 662 188 167 5 0 112 0 io 4292 0 12 0 0 0 0 ... There has been a proposal with some overlapping goals: /proc/task-diag (https://github.com/avagin/linux-task-diag), but I'm not sure about its current status. Best Wishes, Eugene Eugene Lubarsky (5): fs/proc: Introduce /proc/all/stat fs/proc: Introduce /proc/all/statm fs/proc: Introduce /proc/all/status fs/proc: Introduce /proc/all/io fs/proc: Introduce /proc/all/statx fs/proc/base.c | 215 +++-- fs/proc/internal.h | 1 + fs/proc/root.c | 1 + 3 files changed, 210 insertions(+), 7 deletions(-) -- 2.25.1