Re: severe proc dentry lock contention
Junxiao Bi writes: > On 6/18/20 5:02 PM, ebied...@xmission.com wrote: > >> Matthew Wilcox writes: >> >>> On Thu, Jun 18, 2020 at 03:17:33PM -0700, Junxiao Bi wrote: When debugging some performance issue, i found that thousands of threads exit around same time could cause a severe spin lock contention on proc dentry "/proc/$parent_process_pid/task/", that's because threads needs to clean up their pid file from that dir when exit. Check the following standalone test case that simulated the case and perf top result on v5.7 kernel. Any idea on how to fix this? >>> Thanks, Junxiao. >>> >>> We've looked at a few different ways of fixing this problem. >>> >>> Even though the contention is within the dcache, it seems like a usecase >>> that the dcache shouldn't be optimised for -- generally we do not have >>> hundreds of CPUs removing dentries from a single directory in parallel. >>> >>> We could fix this within procfs. We don't have a great patch yet, but >>> the current approach we're looking at allows only one thread at a time >>> to call dput() on any /proc/*/task directory. >>> >>> We could also look at fixing this within the scheduler. Only allowing >>> one CPU to run the threads of an exiting process would fix this particular >>> problem, but might have other consequences. >>> >>> I was hoping that 7bc3e6e55acf would fix this, but that patch is in 5.7, >>> so that hope is ruled out. >> Does anyone know if problem new in v5.7? I am wondering if I introduced >> this problem when I refactored the code or if I simply churned the code >> but the issue remains effectively the same. > It's not new issue, we see it in old kernel like v4.14 >> >> Can you try only flushing entries when the last thread of the process is >> reaped? I think in practice we would want to be a little more >> sophisticated but it is a good test case to see if it solves the issue. > > Thank you. i will try and let you know. Assuming this works we can remove the hacking part of always doing this by only doing this if SIGNAL_GROUP_EXIT when we know this thundering herd will appear. We still need to do something with the list however. If your testing works out performance wise I will figure out what we need to make the hack generale and safe. Eric > Thanks, > > Junxiao. > >> >> diff --git a/kernel/exit.c b/kernel/exit.c >> index cebae77a9664..d56e4eb60bdd 100644 >> --- a/kernel/exit.c >> +++ b/kernel/exit.c >> @@ -152,7 +152,7 @@ void put_task_struct_rcu_user(struct task_struct *task) >> void release_task(struct task_struct *p) >> { >> struct task_struct *leader; >> -struct pid *thread_pid; >> +struct pid *thread_pid = NULL; >> int zap_leader; >> repeat: >> /* don't need to get the RCU readlock here - the process is dead and >> @@ -165,7 +165,8 @@ void release_task(struct task_struct *p) >> write_lock_irq(_lock); >> ptrace_release_task(p); >> -thread_pid = get_pid(p->thread_pid); >> +if (p == p->group_leader) >> +thread_pid = get_pid(p->thread_pid); >> __exit_signal(p); >> /* >> @@ -188,8 +189,10 @@ void release_task(struct task_struct *p) >> } >> write_unlock_irq(_lock); >> -proc_flush_pid(thread_pid); >> -put_pid(thread_pid); >> +if (thread_pid) { >> +proc_flush_pid(thread_pid); >> +put_pid(thread_pid); >> +} >> release_thread(p); >> put_task_struct_rcu_user(p); >>
Re: severe proc dentry lock contention
On 6/18/20 5:02 PM, ebied...@xmission.com wrote: Matthew Wilcox writes: On Thu, Jun 18, 2020 at 03:17:33PM -0700, Junxiao Bi wrote: When debugging some performance issue, i found that thousands of threads exit around same time could cause a severe spin lock contention on proc dentry "/proc/$parent_process_pid/task/", that's because threads needs to clean up their pid file from that dir when exit. Check the following standalone test case that simulated the case and perf top result on v5.7 kernel. Any idea on how to fix this? Thanks, Junxiao. We've looked at a few different ways of fixing this problem. Even though the contention is within the dcache, it seems like a usecase that the dcache shouldn't be optimised for -- generally we do not have hundreds of CPUs removing dentries from a single directory in parallel. We could fix this within procfs. We don't have a great patch yet, but the current approach we're looking at allows only one thread at a time to call dput() on any /proc/*/task directory. We could also look at fixing this within the scheduler. Only allowing one CPU to run the threads of an exiting process would fix this particular problem, but might have other consequences. I was hoping that 7bc3e6e55acf would fix this, but that patch is in 5.7, so that hope is ruled out. Does anyone know if problem new in v5.7? I am wondering if I introduced this problem when I refactored the code or if I simply churned the code but the issue remains effectively the same. It's not new issue, we see it in old kernel like v4.14 Can you try only flushing entries when the last thread of the process is reaped? I think in practice we would want to be a little more sophisticated but it is a good test case to see if it solves the issue. Thank you. i will try and let you know. Thanks, Junxiao. diff --git a/kernel/exit.c b/kernel/exit.c index cebae77a9664..d56e4eb60bdd 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -152,7 +152,7 @@ void put_task_struct_rcu_user(struct task_struct *task) void release_task(struct task_struct *p) { struct task_struct *leader; - struct pid *thread_pid; + struct pid *thread_pid = NULL; int zap_leader; repeat: /* don't need to get the RCU readlock here - the process is dead and @@ -165,7 +165,8 @@ void release_task(struct task_struct *p) write_lock_irq(_lock); ptrace_release_task(p); - thread_pid = get_pid(p->thread_pid); + if (p == p->group_leader) + thread_pid = get_pid(p->thread_pid); __exit_signal(p); /* @@ -188,8 +189,10 @@ void release_task(struct task_struct *p) } write_unlock_irq(_lock); - proc_flush_pid(thread_pid); - put_pid(thread_pid); + if (thread_pid) { + proc_flush_pid(thread_pid); + put_pid(thread_pid); + } release_thread(p); put_task_struct_rcu_user(p);
Re: severe proc dentry lock contention
Matthew Wilcox writes: > On Thu, Jun 18, 2020 at 03:17:33PM -0700, Junxiao Bi wrote: >> When debugging some performance issue, i found that thousands of threads >> exit around same time could cause a severe spin lock contention on proc >> dentry "/proc/$parent_process_pid/task/", that's because threads needs to >> clean up their pid file from that dir when exit. Check the following >> standalone test case that simulated the case and perf top result on v5.7 >> kernel. Any idea on how to fix this? > > Thanks, Junxiao. > > We've looked at a few different ways of fixing this problem. > > Even though the contention is within the dcache, it seems like a usecase > that the dcache shouldn't be optimised for -- generally we do not have > hundreds of CPUs removing dentries from a single directory in parallel. > > We could fix this within procfs. We don't have a great patch yet, but > the current approach we're looking at allows only one thread at a time > to call dput() on any /proc/*/task directory. > > We could also look at fixing this within the scheduler. Only allowing > one CPU to run the threads of an exiting process would fix this particular > problem, but might have other consequences. > > I was hoping that 7bc3e6e55acf would fix this, but that patch is in 5.7, > so that hope is ruled out. Does anyone know if problem new in v5.7? I am wondering if I introduced this problem when I refactored the code or if I simply churned the code but the issue remains effectively the same. Can you try only flushing entries when the last thread of the process is reaped? I think in practice we would want to be a little more sophisticated but it is a good test case to see if it solves the issue. diff --git a/kernel/exit.c b/kernel/exit.c index cebae77a9664..d56e4eb60bdd 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -152,7 +152,7 @@ void put_task_struct_rcu_user(struct task_struct *task) void release_task(struct task_struct *p) { struct task_struct *leader; - struct pid *thread_pid; + struct pid *thread_pid = NULL; int zap_leader; repeat: /* don't need to get the RCU readlock here - the process is dead and @@ -165,7 +165,8 @@ void release_task(struct task_struct *p) write_lock_irq(_lock); ptrace_release_task(p); - thread_pid = get_pid(p->thread_pid); + if (p == p->group_leader) + thread_pid = get_pid(p->thread_pid); __exit_signal(p); /* @@ -188,8 +189,10 @@ void release_task(struct task_struct *p) } write_unlock_irq(_lock); - proc_flush_pid(thread_pid); - put_pid(thread_pid); + if (thread_pid) { + proc_flush_pid(thread_pid); + put_pid(thread_pid); + } release_thread(p); put_task_struct_rcu_user(p);
Re: severe proc dentry lock contention
On Thu, Jun 18, 2020 at 03:17:33PM -0700, Junxiao Bi wrote: > When debugging some performance issue, i found that thousands of threads > exit around same time could cause a severe spin lock contention on proc > dentry "/proc/$parent_process_pid/task/", that's because threads needs to > clean up their pid file from that dir when exit. Check the following > standalone test case that simulated the case and perf top result on v5.7 > kernel. Any idea on how to fix this? Thanks, Junxiao. We've looked at a few different ways of fixing this problem. Even though the contention is within the dcache, it seems like a usecase that the dcache shouldn't be optimised for -- generally we do not have hundreds of CPUs removing dentries from a single directory in parallel. We could fix this within procfs. We don't have a great patch yet, but the current approach we're looking at allows only one thread at a time to call dput() on any /proc/*/task directory. We could also look at fixing this within the scheduler. Only allowing one CPU to run the threads of an exiting process would fix this particular problem, but might have other consequences. I was hoping that 7bc3e6e55acf would fix this, but that patch is in 5.7, so that hope is ruled out. > > PerfTop: 48891 irqs/sec kernel:95.6% exact: 100.0% lost: 0/0 drop: > 0/0 [4000Hz cycles], (all, 72 CPUs) > --- > > > 66.10% [kernel] [k] > native_queued_spin_lock_slowpath > 1.13% [kernel] [k] _raw_spin_lock > 0.84% [kernel] [k] clear_page_erms > 0.82% [kernel] [k] > queued_write_lock_slowpath > 0.64% [kernel] [k] proc_task_readdir > 0.61% [kernel] [k] > find_idlest_group.isra.95 > 0.61% [kernel] [k] > syscall_return_via_sysret > 0.55% [kernel] [k] entry_SYSCALL_64 > 0.49% [kernel] [k] memcpy_erms > 0.46% [kernel] [k] update_cfs_group > 0.41% [kernel] [k] get_pid_task > 0.39% [kernel] [k] > _raw_spin_lock_irqsave > 0.37% [kernel] [k] > __list_del_entry_valid > 0.34% [kernel] [k] > get_page_from_freelist > 0.34% [kernel] [k] __d_lookup > 0.32% [kernel] [k] update_load_avg > 0.31% libc-2.17.so [.] get_next_seq > 0.27% [kernel] [k] avc_has_perm_noaudit > 0.26% [kernel] [k] __sched_text_start > 0.25% [kernel] [k] > selinux_inode_permission > 0.25% [kernel] [k] __slab_free > 0.24% [kernel] [k] detach_entity_cfs_rq > 0.23% [kernel] [k] zap_pte_range > 0.22% [kernel] [k] > _find_next_bit.constprop.1 > 0.22% libc-2.17.so [.] vfprintf > 0.20% libc-2.17.so [.] _int_malloc > 0.19% [kernel] [k] _raw_spin_lock_irq > 0.18% [kernel] [k] rb_erase > 0.18% [kernel] [k] pid_revalidate > 0.18% [kernel] [k] lockref_get_not_dead > 0.18% [kernel] [k] > __alloc_pages_nodemask > 0.17% [kernel] [k] set_task_cpu > 0.17% libc-2.17.so [.] __strcoll_l > 0.17% [kernel] [k] do_syscall_64 > 0.17% [kernel] [k] __vmalloc_node_range > 0.17% libc-2.17.so [.] _IO_vfscanf > 0.17% [kernel] [k] refcount_dec_not_one > 0.15% [kernel] [k] __task_pid_nr_ns > 0.15% [kernel] [k] > native_irq_return_iret > 0.15% [kernel] [k] free_pcppages_bulk > 0.14% [kernel] [k] kmem_cache_alloc > 0.14% [kernel] [k] link_path_walk > 0.14% libc-2.17.so [.] _int_free > 0.14% [kernel] [k] > __update_load_avg_cfs_rq > 0.14% perf.5.7.0-master.20200601.ol7.x86_64 [.] 0x000eac29 > 0.13%
severe proc dentry lock contention
Hi, When debugging some performance issue, i found that thousands of threads exit around same time could cause a severe spin lock contention on proc dentry "/proc/$parent_process_pid/task/", that's because threads needs to clean up their pid file from that dir when exit. Check the following standalone test case that simulated the case and perf top result on v5.7 kernel. Any idea on how to fix this? PerfTop: 48891 irqs/sec kernel:95.6% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 72 CPUs) --- 66.10% [kernel] [k] native_queued_spin_lock_slowpath 1.13% [kernel] [k] _raw_spin_lock 0.84% [kernel] [k] clear_page_erms 0.82% [kernel] [k] queued_write_lock_slowpath 0.64% [kernel] [k] proc_task_readdir 0.61% [kernel] [k] find_idlest_group.isra.95 0.61% [kernel] [k] syscall_return_via_sysret 0.55% [kernel] [k] entry_SYSCALL_64 0.49% [kernel] [k] memcpy_erms 0.46% [kernel] [k] update_cfs_group 0.41% [kernel] [k] get_pid_task 0.39% [kernel] [k] _raw_spin_lock_irqsave 0.37% [kernel] [k] __list_del_entry_valid 0.34% [kernel] [k] get_page_from_freelist 0.34% [kernel] [k] __d_lookup 0.32% [kernel] [k] update_load_avg 0.31% libc-2.17.so [.] get_next_seq 0.27% [kernel] [k] avc_has_perm_noaudit 0.26% [kernel] [k] __sched_text_start 0.25% [kernel] [k] selinux_inode_permission 0.25% [kernel] [k] __slab_free 0.24% [kernel] [k] detach_entity_cfs_rq 0.23% [kernel] [k] zap_pte_range 0.22% [kernel] [k] _find_next_bit.constprop.1 0.22% libc-2.17.so [.] vfprintf 0.20% libc-2.17.so [.] _int_malloc 0.19% [kernel] [k] _raw_spin_lock_irq 0.18% [kernel] [k] rb_erase 0.18% [kernel] [k] pid_revalidate 0.18% [kernel] [k] lockref_get_not_dead 0.18% [kernel] [k] __alloc_pages_nodemask 0.17% [kernel] [k] set_task_cpu 0.17% libc-2.17.so [.] __strcoll_l 0.17% [kernel] [k] do_syscall_64 0.17% [kernel] [k] __vmalloc_node_range 0.17% libc-2.17.so [.] _IO_vfscanf 0.17% [kernel] [k] refcount_dec_not_one 0.15% [kernel] [k] __task_pid_nr_ns 0.15% [kernel] [k] native_irq_return_iret 0.15% [kernel] [k] free_pcppages_bulk 0.14% [kernel] [k] kmem_cache_alloc 0.14% [kernel] [k] link_path_walk 0.14% libc-2.17.so [.] _int_free 0.14% [kernel] [k] __update_load_avg_cfs_rq 0.14% perf.5.7.0-master.20200601.ol7.x86_64 [.] 0x000eac29 0.13% [kernel] [k] kmem_cache_free 0.13% [kernel] [k] number 0.13% [kernel] [k] memset_erms 0.12% [kernel] [k] proc_pid_status 0.12% [kernel] [k] __d_lookup_rcu === runme.sh == #!/bin/bash threads=${1:-1} prog=proc_race while [ 1 ]; do ./$prog $threads; done & while [ 1 ]; do pid=`ps aux | grep $prog | grep -v grep| awk '{print $2}'` if [ -z $pid ]; then continue; fi threadnum=`ls -l /proc/$pid/task | wc -l` if [ $threadnum -gt $threads ]; then echo kill $pid kill -9 $pid fi done ===proc_race.c= #include #include #include #include #include #include #include #define handle_error_en(en, msg) \ do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0) #define