Re: severe proc dentry lock contention

2020-06-18 Thread Eric W. Biederman
Junxiao Bi  writes:

> On 6/18/20 5:02 PM, ebied...@xmission.com wrote:
>
>> Matthew Wilcox  writes:
>>
>>> On Thu, Jun 18, 2020 at 03:17:33PM -0700, Junxiao Bi wrote:
 When debugging some performance issue, i found that thousands of threads
 exit around same time could cause a severe spin lock contention on proc
 dentry "/proc/$parent_process_pid/task/", that's because threads needs to
 clean up their pid file from that dir when exit. Check the following
 standalone test case that simulated the case and perf top result on v5.7
 kernel. Any idea on how to fix this?
>>> Thanks, Junxiao.
>>>
>>> We've looked at a few different ways of fixing this problem.
>>>
>>> Even though the contention is within the dcache, it seems like a usecase
>>> that the dcache shouldn't be optimised for -- generally we do not have
>>> hundreds of CPUs removing dentries from a single directory in parallel.
>>>
>>> We could fix this within procfs.  We don't have a great patch yet, but
>>> the current approach we're looking at allows only one thread at a time
>>> to call dput() on any /proc/*/task directory.
>>>
>>> We could also look at fixing this within the scheduler.  Only allowing
>>> one CPU to run the threads of an exiting process would fix this particular
>>> problem, but might have other consequences.
>>>
>>> I was hoping that 7bc3e6e55acf would fix this, but that patch is in 5.7,
>>> so that hope is ruled out.
>> Does anyone know if problem new in v5.7?  I am wondering if I introduced
>> this problem when I refactored the code or if I simply churned the code
>> but the issue remains effectively the same.
> It's not new issue, we see it in old kernel like v4.14
>>
>> Can you try only flushing entries when the last thread of the process is
>> reaped?  I think in practice we would want to be a little more
>> sophisticated but it is a good test case to see if it solves the issue.
>
> Thank you. i will try and let you know.

Assuming this works we can remove the hacking part of always doing
this by only doing this if SIGNAL_GROUP_EXIT when we know this
thundering herd will appear.

We still need to do something with the list however.

If your testing works out performance wise I will figure out what
we need to make the hack generale and safe.

Eric



> Thanks,
>
> Junxiao.
>
>>
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index cebae77a9664..d56e4eb60bdd 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -152,7 +152,7 @@ void put_task_struct_rcu_user(struct task_struct *task)
>>   void release_task(struct task_struct *p)
>>   {
>>  struct task_struct *leader;
>> -struct pid *thread_pid;
>> +struct pid *thread_pid = NULL;
>>  int zap_leader;
>>   repeat:
>>  /* don't need to get the RCU readlock here - the process is dead and
>> @@ -165,7 +165,8 @@ void release_task(struct task_struct *p)
>>  write_lock_irq(_lock);
>>  ptrace_release_task(p);
>> -thread_pid = get_pid(p->thread_pid);
>> +if (p == p->group_leader)
>> +thread_pid = get_pid(p->thread_pid);
>>  __exit_signal(p);
>>  /*
>> @@ -188,8 +189,10 @@ void release_task(struct task_struct *p)
>>  }
>>  write_unlock_irq(_lock);
>> -proc_flush_pid(thread_pid);
>> -put_pid(thread_pid);
>> +if (thread_pid) {
>> +proc_flush_pid(thread_pid);
>> +put_pid(thread_pid);
>> +}
>>  release_thread(p);
>>  put_task_struct_rcu_user(p);
>>   


Re: severe proc dentry lock contention

2020-06-18 Thread Junxiao Bi

On 6/18/20 5:02 PM, ebied...@xmission.com wrote:


Matthew Wilcox  writes:


On Thu, Jun 18, 2020 at 03:17:33PM -0700, Junxiao Bi wrote:

When debugging some performance issue, i found that thousands of threads
exit around same time could cause a severe spin lock contention on proc
dentry "/proc/$parent_process_pid/task/", that's because threads needs to
clean up their pid file from that dir when exit. Check the following
standalone test case that simulated the case and perf top result on v5.7
kernel. Any idea on how to fix this?

Thanks, Junxiao.

We've looked at a few different ways of fixing this problem.

Even though the contention is within the dcache, it seems like a usecase
that the dcache shouldn't be optimised for -- generally we do not have
hundreds of CPUs removing dentries from a single directory in parallel.

We could fix this within procfs.  We don't have a great patch yet, but
the current approach we're looking at allows only one thread at a time
to call dput() on any /proc/*/task directory.

We could also look at fixing this within the scheduler.  Only allowing
one CPU to run the threads of an exiting process would fix this particular
problem, but might have other consequences.

I was hoping that 7bc3e6e55acf would fix this, but that patch is in 5.7,
so that hope is ruled out.

Does anyone know if problem new in v5.7?  I am wondering if I introduced
this problem when I refactored the code or if I simply churned the code
but the issue remains effectively the same.

It's not new issue, we see it in old kernel like v4.14


Can you try only flushing entries when the last thread of the process is
reaped?  I think in practice we would want to be a little more
sophisticated but it is a good test case to see if it solves the issue.


Thank you. i will try and let you know.

Thanks,

Junxiao.



diff --git a/kernel/exit.c b/kernel/exit.c
index cebae77a9664..d56e4eb60bdd 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -152,7 +152,7 @@ void put_task_struct_rcu_user(struct task_struct *task)
  void release_task(struct task_struct *p)
  {
struct task_struct *leader;
-   struct pid *thread_pid;
+   struct pid *thread_pid = NULL;
int zap_leader;
  repeat:
/* don't need to get the RCU readlock here - the process is dead and
@@ -165,7 +165,8 @@ void release_task(struct task_struct *p)
  
  	write_lock_irq(_lock);

ptrace_release_task(p);
-   thread_pid = get_pid(p->thread_pid);
+   if (p == p->group_leader)
+   thread_pid = get_pid(p->thread_pid);
__exit_signal(p);
  
  	/*

@@ -188,8 +189,10 @@ void release_task(struct task_struct *p)
}
  
  	write_unlock_irq(_lock);

-   proc_flush_pid(thread_pid);
-   put_pid(thread_pid);
+   if (thread_pid) {
+   proc_flush_pid(thread_pid);
+   put_pid(thread_pid);
+   }
release_thread(p);
put_task_struct_rcu_user(p);
  


Re: severe proc dentry lock contention

2020-06-18 Thread Eric W. Biederman
Matthew Wilcox  writes:

> On Thu, Jun 18, 2020 at 03:17:33PM -0700, Junxiao Bi wrote:
>> When debugging some performance issue, i found that thousands of threads
>> exit around same time could cause a severe spin lock contention on proc
>> dentry "/proc/$parent_process_pid/task/", that's because threads needs to
>> clean up their pid file from that dir when exit. Check the following
>> standalone test case that simulated the case and perf top result on v5.7
>> kernel. Any idea on how to fix this?

>
> Thanks, Junxiao.
>
> We've looked at a few different ways of fixing this problem.
>
> Even though the contention is within the dcache, it seems like a usecase
> that the dcache shouldn't be optimised for -- generally we do not have
> hundreds of CPUs removing dentries from a single directory in parallel.
>
> We could fix this within procfs.  We don't have a great patch yet, but
> the current approach we're looking at allows only one thread at a time
> to call dput() on any /proc/*/task directory.
>
> We could also look at fixing this within the scheduler.  Only allowing
> one CPU to run the threads of an exiting process would fix this particular
> problem, but might have other consequences.
>
> I was hoping that 7bc3e6e55acf would fix this, but that patch is in 5.7,
> so that hope is ruled out.

Does anyone know if problem new in v5.7?  I am wondering if I introduced
this problem when I refactored the code or if I simply churned the code
but the issue remains effectively the same.

Can you try only flushing entries when the last thread of the process is
reaped?  I think in practice we would want to be a little more
sophisticated but it is a good test case to see if it solves the issue.

diff --git a/kernel/exit.c b/kernel/exit.c
index cebae77a9664..d56e4eb60bdd 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -152,7 +152,7 @@ void put_task_struct_rcu_user(struct task_struct *task)
 void release_task(struct task_struct *p)
 {
struct task_struct *leader;
-   struct pid *thread_pid;
+   struct pid *thread_pid = NULL;
int zap_leader;
 repeat:
/* don't need to get the RCU readlock here - the process is dead and
@@ -165,7 +165,8 @@ void release_task(struct task_struct *p)
 
write_lock_irq(_lock);
ptrace_release_task(p);
-   thread_pid = get_pid(p->thread_pid);
+   if (p == p->group_leader)
+   thread_pid = get_pid(p->thread_pid);
__exit_signal(p);
 
/*
@@ -188,8 +189,10 @@ void release_task(struct task_struct *p)
}
 
write_unlock_irq(_lock);
-   proc_flush_pid(thread_pid);
-   put_pid(thread_pid);
+   if (thread_pid) {
+   proc_flush_pid(thread_pid);
+   put_pid(thread_pid);
+   }
release_thread(p);
put_task_struct_rcu_user(p);
 


Re: severe proc dentry lock contention

2020-06-18 Thread Matthew Wilcox
On Thu, Jun 18, 2020 at 03:17:33PM -0700, Junxiao Bi wrote:
> When debugging some performance issue, i found that thousands of threads
> exit around same time could cause a severe spin lock contention on proc
> dentry "/proc/$parent_process_pid/task/", that's because threads needs to
> clean up their pid file from that dir when exit. Check the following
> standalone test case that simulated the case and perf top result on v5.7
> kernel. Any idea on how to fix this?

Thanks, Junxiao.

We've looked at a few different ways of fixing this problem.

Even though the contention is within the dcache, it seems like a usecase
that the dcache shouldn't be optimised for -- generally we do not have
hundreds of CPUs removing dentries from a single directory in parallel.

We could fix this within procfs.  We don't have a great patch yet, but
the current approach we're looking at allows only one thread at a time
to call dput() on any /proc/*/task directory.

We could also look at fixing this within the scheduler.  Only allowing
one CPU to run the threads of an exiting process would fix this particular
problem, but might have other consequences.

I was hoping that 7bc3e6e55acf would fix this, but that patch is in 5.7,
so that hope is ruled out.

> 
>    PerfTop:   48891 irqs/sec  kernel:95.6%  exact: 100.0% lost: 0/0 drop:
> 0/0 [4000Hz cycles],  (all, 72 CPUs)
> ---
> 
> 
>     66.10%  [kernel]   [k]
> native_queued_spin_lock_slowpath
>  1.13%  [kernel]   [k] _raw_spin_lock
>  0.84%  [kernel]   [k] clear_page_erms
>  0.82%  [kernel]   [k]
> queued_write_lock_slowpath
>  0.64%  [kernel]   [k] proc_task_readdir
>  0.61%  [kernel]   [k]
> find_idlest_group.isra.95
>  0.61%  [kernel]   [k]
> syscall_return_via_sysret
>  0.55%  [kernel]   [k] entry_SYSCALL_64
>  0.49%  [kernel]   [k] memcpy_erms
>  0.46%  [kernel]   [k] update_cfs_group
>  0.41%  [kernel]   [k] get_pid_task
>  0.39%  [kernel]   [k]
> _raw_spin_lock_irqsave
>  0.37%  [kernel]   [k]
> __list_del_entry_valid
>  0.34%  [kernel]   [k]
> get_page_from_freelist
>  0.34%  [kernel]   [k] __d_lookup
>  0.32%  [kernel]   [k] update_load_avg
>  0.31%  libc-2.17.so   [.] get_next_seq
>  0.27%  [kernel]   [k] avc_has_perm_noaudit
>  0.26%  [kernel]   [k] __sched_text_start
>  0.25%  [kernel]   [k]
> selinux_inode_permission
>  0.25%  [kernel]   [k] __slab_free
>  0.24%  [kernel]   [k] detach_entity_cfs_rq
>  0.23%  [kernel]   [k] zap_pte_range
>  0.22%  [kernel]   [k]
> _find_next_bit.constprop.1
>  0.22%  libc-2.17.so   [.] vfprintf
>  0.20%  libc-2.17.so   [.] _int_malloc
>  0.19%  [kernel]   [k] _raw_spin_lock_irq
>  0.18%  [kernel]   [k] rb_erase
>  0.18%  [kernel]   [k] pid_revalidate
>  0.18%  [kernel]   [k] lockref_get_not_dead
>  0.18%  [kernel]   [k]
> __alloc_pages_nodemask
>  0.17%  [kernel]   [k] set_task_cpu
>  0.17%  libc-2.17.so   [.] __strcoll_l
>  0.17%  [kernel]   [k] do_syscall_64
>  0.17%  [kernel]   [k] __vmalloc_node_range
>  0.17%  libc-2.17.so   [.] _IO_vfscanf
>  0.17%  [kernel]   [k] refcount_dec_not_one
>  0.15%  [kernel]   [k] __task_pid_nr_ns
>  0.15%  [kernel]   [k]
> native_irq_return_iret
>  0.15%  [kernel]   [k] free_pcppages_bulk
>  0.14%  [kernel]   [k] kmem_cache_alloc
>  0.14%  [kernel]   [k] link_path_walk
>  0.14%  libc-2.17.so   [.] _int_free
>  0.14%  [kernel]   [k]
> __update_load_avg_cfs_rq
>  0.14%  perf.5.7.0-master.20200601.ol7.x86_64  [.] 0x000eac29
>  0.13% 

severe proc dentry lock contention

2020-06-18 Thread Junxiao Bi

Hi,

When debugging some performance issue, i found that thousands of threads 
exit around same time could cause a severe spin lock contention on proc 
dentry "/proc/$parent_process_pid/task/", that's because threads needs 
to clean up their pid file from that dir when exit. Check the following 
standalone test case that simulated the case and perf top result on v5.7 
kernel. Any idea on how to fix this?



   PerfTop:   48891 irqs/sec  kernel:95.6%  exact: 100.0% lost: 0/0 
drop: 0/0 [4000Hz cycles],  (all, 72 CPUs)
--- 



    66.10%  [kernel]   [k] 
native_queued_spin_lock_slowpath

 1.13%  [kernel]   [k] _raw_spin_lock
 0.84%  [kernel]   [k] clear_page_erms
 0.82%  [kernel]   [k] 
queued_write_lock_slowpath

 0.64%  [kernel]   [k] proc_task_readdir
 0.61%  [kernel]   [k] 
find_idlest_group.isra.95
 0.61%  [kernel]   [k] 
syscall_return_via_sysret

 0.55%  [kernel]   [k] entry_SYSCALL_64
 0.49%  [kernel]   [k] memcpy_erms
 0.46%  [kernel]   [k] update_cfs_group
 0.41%  [kernel]   [k] get_pid_task
 0.39%  [kernel]   [k] 
_raw_spin_lock_irqsave
 0.37%  [kernel]   [k] 
__list_del_entry_valid
 0.34%  [kernel]   [k] 
get_page_from_freelist

 0.34%  [kernel]   [k] __d_lookup
 0.32%  [kernel]   [k] update_load_avg
 0.31%  libc-2.17.so   [.] get_next_seq
 0.27%  [kernel]   [k] avc_has_perm_noaudit
 0.26%  [kernel]   [k] __sched_text_start
 0.25%  [kernel]   [k] 
selinux_inode_permission

 0.25%  [kernel]   [k] __slab_free
 0.24%  [kernel]   [k] detach_entity_cfs_rq
 0.23%  [kernel]   [k] zap_pte_range
 0.22%  [kernel]   [k] 
_find_next_bit.constprop.1

 0.22%  libc-2.17.so   [.] vfprintf
 0.20%  libc-2.17.so   [.] _int_malloc
 0.19%  [kernel]   [k] _raw_spin_lock_irq
 0.18%  [kernel]   [k] rb_erase
 0.18%  [kernel]   [k] pid_revalidate
 0.18%  [kernel]   [k] lockref_get_not_dead
 0.18%  [kernel]   [k] 
__alloc_pages_nodemask

 0.17%  [kernel]   [k] set_task_cpu
 0.17%  libc-2.17.so   [.] __strcoll_l
 0.17%  [kernel]   [k] do_syscall_64
 0.17%  [kernel]   [k] __vmalloc_node_range
 0.17%  libc-2.17.so   [.] _IO_vfscanf
 0.17%  [kernel]   [k] refcount_dec_not_one
 0.15%  [kernel]   [k] __task_pid_nr_ns
 0.15%  [kernel]   [k] 
native_irq_return_iret

 0.15%  [kernel]   [k] free_pcppages_bulk
 0.14%  [kernel]   [k] kmem_cache_alloc
 0.14%  [kernel]   [k] link_path_walk
 0.14%  libc-2.17.so   [.] _int_free
 0.14%  [kernel]   [k] 
__update_load_avg_cfs_rq

 0.14%  perf.5.7.0-master.20200601.ol7.x86_64  [.] 0x000eac29
 0.13%  [kernel]   [k] kmem_cache_free
 0.13%  [kernel]   [k] number
 0.13%  [kernel]   [k] memset_erms
 0.12%  [kernel]   [k] proc_pid_status
 0.12%  [kernel]   [k] __d_lookup_rcu


=== runme.sh ==

#!/bin/bash

threads=${1:-1}
prog=proc_race
while [ 1 ]; do ./$prog $threads; done &

while [ 1 ]; do
    pid=`ps aux | grep $prog | grep -v grep| awk '{print $2}'`
    if [ -z $pid ]; then continue; fi
    threadnum=`ls -l /proc/$pid/task | wc -l`
    if [ $threadnum -gt $threads ]; then
        echo kill $pid
        kill -9 $pid
    fi
done


===proc_race.c=


#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define handle_error_en(en, msg) \
    do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)

#define