Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry
在 2020/12/31 下午5:22, Greg Kroah-Hartman 写道: On Thu, Dec 17, 2020 at 10:26:23AM +0800, Wen Yang wrote: 在 2020/12/4 上午2:31, Wen Yang 写道: The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they should be deleted when the process exits. Suppose the following race appears: release_task dput -> proc_flush_task -> dentry->d_op->d_delete(dentry) -> __exit_signal -> dentry->d_lockref.count-- and return. In the proc_flush_task(), if another process is using this dentry, it will not be deleted. At the same time, in dput(), d_op->d_delete() can be executed before __exit_signal(pid has not been hashed), d_delete returns false, so this dentry still cannot be deleted. This dentry will always be cached (although its count is 0 and the DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and these dentries can only be deleted when drop_caches is manually triggered. This will result in wasted memory. What's more troublesome is that these dentries reference pid, according to the commit f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), if the pid cannot be released, it may result in the inability to create a new pid_ns. This problem occurred in our cluster environment (Linux 4.9 LTS). We could reproduce it by manually constructing a test program + adding some debugging switches in the kernel: * A test program to open the directory (/proc//ns) [1] * Adding some debugging switches to the kernel, adding a delay between proc_flush_task and __exit_signal in release_task() [2] The test process is as follows: A, terminal #1 Turn on the debug switch: echo 1> /proc/sys/vm/dentry_debug_trace Execute the following unshare command: sudo unshare --pid --fork --mount-proc bash B, terminal #2 Find the pid of the unshare process: # pstree -p | grep unshare | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818) Find the corresponding dentry: # dmesg | grep pid=818 [70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 entry=818/8802c7b670e8 C, terminal #3 Execute the opendir program, it will always open the /proc/818/ns/ directory: # ./a.out /proc/818/ns/ pid: 876 . .. net uts ipc pid user mnt cgroup D, go back to terminal #2 Turn on the debugging switches to construct the race: # echo 818> /proc/sys/vm/dentry_debug_pid # echo 1> /proc/sys/vm/dentry_debug_delay Kill the unshare process (pid 818). Since the debugging switches have been turned on, it will get stuck in release_task(): # kill -9 818 Then kill the process that opened the /proc/818/ns/ directory: # kill -9 876 Then turn off these debugging switches to allow the 818 process to exit: # echo 0> /proc/sys/vm/dentry_debug_delay # echo 0> /proc/sys/vm/dentry_debug_pid Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0, and the flag is 2800cc (#define DCACHE_OP_DELETE 0x0008), but it is still cached: # dmesg | grep 8802a3999548 … [565.559156] XXX dput:853 dentry=ns/8802bea7b528, flag=2800cc, cnt=0, inode=8802b38c2010, pdentry=818/8802c7b670e8, pflag=20008c, pcnt=1, pinode=8802c7812010, keywords: be cached It could also be verified via the crash tool: crash> dentry.d_flags,d_iname,d_inode,d_lockref -x 8802bea7b528 d_flags = 0x2800cc d_iname = "ns\000" d_inode = 0x8802b38c2010 d_lockref = { { lock_count = 0x0, { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, count = 0x0 } } } crash> kmem 8802bea7b528 CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME 8802dd5f5900 192 23663 2613087116k dentry SLAB MEMORYNODE TOTAL ALLOCATED FREE ea000afa9e00 8802bea78000 0 30 25 5 FREE / [ALLOCATED] [8802bea7b520] PAGEPHYSICAL MAPPING INDEX CNT FLAGS ea000afa9ec0 2bea7b000 dead04000 0 2f8000 crash> This series of patches is to fix this issue. Regards, Wen Alexey Dobriyan (1): proc: use %u for pid printing and slightly less stack Andreas Gruenbacher (1): proc: Pass file mode to proc_pid_make_inode Christian Brauner (1): clone: add CLONE_PIDFD Eric W. Biederman (6): proc: Better ownership of files for non-dumpable tasks in user namespaces proc: Rename in proc_inode rename sysctl_inodes
Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry
On Thu, Dec 17, 2020 at 10:26:23AM +0800, Wen Yang wrote: > > > 在 2020/12/4 上午2:31, Wen Yang 写道: > > The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they > > should be deleted when the process exits. > > > > Suppose the following race appears: > > > > release_task dput > > -> proc_flush_task > > -> dentry->d_op->d_delete(dentry) > > -> __exit_signal > > -> dentry->d_lockref.count-- and return. > > > > In the proc_flush_task(), if another process is using this dentry, it will > > not be deleted. At the same time, in dput(), d_op->d_delete() can be > > executed > > before __exit_signal(pid has not been hashed), d_delete returns false, so > > this dentry still cannot be deleted. > > > > This dentry will always be cached (although its count is 0 and the > > DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and > > these dentries can only be deleted when drop_caches is manually triggered. > > > > This will result in wasted memory. What's more troublesome is that these > > dentries reference pid, according to the commit f333c700c610 ("pidns: Add a > > limit on the number of pid namespaces"), if the pid cannot be released, it > > may result in the inability to create a new pid_ns. > > > > This problem occurred in our cluster environment (Linux 4.9 LTS). > > We could reproduce it by manually constructing a test program + adding some > > debugging switches in the kernel: > > * A test program to open the directory (/proc//ns) [1] > > * Adding some debugging switches to the kernel, adding a delay between > > proc_flush_task and __exit_signal in release_task() [2] > > > > The test process is as follows: > > > > A, terminal #1 > > > > Turn on the debug switch: > > echo 1> /proc/sys/vm/dentry_debug_trace > > > > Execute the following unshare command: > > sudo unshare --pid --fork --mount-proc bash > > > > > > B, terminal #2 > > > > Find the pid of the unshare process: > > > > # pstree -p | grep unshare > > | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818) > > > > > > Find the corresponding dentry: > > # dmesg | grep pid=818 > > [70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 > > entry=818/8802c7b670e8 > > > > > > C, terminal #3 > > > > Execute the opendir program, it will always open the /proc/818/ns/ > > directory: > > > > # ./a.out /proc/818/ns/ > > pid: 876 > > . > > .. > > net > > uts > > ipc > > pid > > user > > mnt > > cgroup > > > > D, go back to terminal #2 > > > > Turn on the debugging switches to construct the race: > > # echo 818> /proc/sys/vm/dentry_debug_pid > > # echo 1> /proc/sys/vm/dentry_debug_delay > > > > Kill the unshare process (pid 818). Since the debugging switches have been > > turned on, it will get stuck in release_task(): > > # kill -9 818 > > > > Then kill the process that opened the /proc/818/ns/ directory: > > # kill -9 876 > > > > Then turn off these debugging switches to allow the 818 process to exit: > > # echo 0> /proc/sys/vm/dentry_debug_delay > > # echo 0> /proc/sys/vm/dentry_debug_pid > > > > Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is > > 0, > > and the flag is 2800cc (#define DCACHE_OP_DELETE 0x0008), but it is > > still > > cached: > > # dmesg | grep 8802a3999548 > > … > > [565.559156] XXX dput:853 dentry=ns/8802bea7b528, flag=2800cc, cnt=0, > > inode=8802b38c2010, pdentry=818/8802c7b670e8, pflag=20008c, pcnt=1, > > pinode=8802c7812010, keywords: be cached > > > > > > It could also be verified via the crash tool: > > > > crash> dentry.d_flags,d_iname,d_inode,d_lockref -x 8802bea7b528 > >d_flags = 0x2800cc > >d_iname = "ns\000" > >d_inode = 0x8802b38c2010 > >d_lockref = { > > { > >lock_count = 0x0, > >{ > > lock = { > >{ > > rlock = { > >raw_lock = { > > { > >val = { > > counter = 0x0 > >}, > >{ > > locked = 0x0, > > pending = 0x0 > >}, > >{ > > locked_pending = 0x0, > > tail = 0x0 > >} > > } > >} > > } > >} > > }, > > count = 0x0 > >} > > } > >} > > crash> kmem 8802bea7b528 > > CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME > > 8802dd5f5900 192 23663 2613087116k dentry > >SLAB MEMORYNODE TOTAL ALLOCATED FREE > >ea000afa9e00 8802bea78000 0 30 25 5 > >FREE / [ALLOCATED] > >[8802bea7b520] > > > >PAGEPHYSICAL MAPPING INDEX CNT
Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry
在 2020/12/4 上午2:31, Wen Yang 写道: The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they should be deleted when the process exits. Suppose the following race appears: release_task dput -> proc_flush_task -> dentry->d_op->d_delete(dentry) -> __exit_signal -> dentry->d_lockref.count-- and return. In the proc_flush_task(), if another process is using this dentry, it will not be deleted. At the same time, in dput(), d_op->d_delete() can be executed before __exit_signal(pid has not been hashed), d_delete returns false, so this dentry still cannot be deleted. This dentry will always be cached (although its count is 0 and the DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and these dentries can only be deleted when drop_caches is manually triggered. This will result in wasted memory. What's more troublesome is that these dentries reference pid, according to the commit f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), if the pid cannot be released, it may result in the inability to create a new pid_ns. This problem occurred in our cluster environment (Linux 4.9 LTS). We could reproduce it by manually constructing a test program + adding some debugging switches in the kernel: * A test program to open the directory (/proc//ns) [1] * Adding some debugging switches to the kernel, adding a delay between proc_flush_task and __exit_signal in release_task() [2] The test process is as follows: A, terminal #1 Turn on the debug switch: echo 1> /proc/sys/vm/dentry_debug_trace Execute the following unshare command: sudo unshare --pid --fork --mount-proc bash B, terminal #2 Find the pid of the unshare process: # pstree -p | grep unshare | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818) Find the corresponding dentry: # dmesg | grep pid=818 [70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 entry=818/8802c7b670e8 C, terminal #3 Execute the opendir program, it will always open the /proc/818/ns/ directory: # ./a.out /proc/818/ns/ pid: 876 . .. net uts ipc pid user mnt cgroup D, go back to terminal #2 Turn on the debugging switches to construct the race: # echo 818> /proc/sys/vm/dentry_debug_pid # echo 1> /proc/sys/vm/dentry_debug_delay Kill the unshare process (pid 818). Since the debugging switches have been turned on, it will get stuck in release_task(): # kill -9 818 Then kill the process that opened the /proc/818/ns/ directory: # kill -9 876 Then turn off these debugging switches to allow the 818 process to exit: # echo 0> /proc/sys/vm/dentry_debug_delay # echo 0> /proc/sys/vm/dentry_debug_pid Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0, and the flag is 2800cc (#define DCACHE_OP_DELETE 0x0008), but it is still cached: # dmesg | grep 8802a3999548 … [565.559156] XXX dput:853 dentry=ns/8802bea7b528, flag=2800cc, cnt=0, inode=8802b38c2010, pdentry=818/8802c7b670e8, pflag=20008c, pcnt=1, pinode=8802c7812010, keywords: be cached It could also be verified via the crash tool: crash> dentry.d_flags,d_iname,d_inode,d_lockref -x 8802bea7b528 d_flags = 0x2800cc d_iname = "ns\000" d_inode = 0x8802b38c2010 d_lockref = { { lock_count = 0x0, { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, count = 0x0 } } } crash> kmem 8802bea7b528 CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME 8802dd5f5900 192 23663 2613087116k dentry SLAB MEMORYNODE TOTAL ALLOCATED FREE ea000afa9e00 8802bea78000 0 30 25 5 FREE / [ALLOCATED] [8802bea7b520] PAGEPHYSICAL MAPPING INDEX CNT FLAGS ea000afa9ec0 2bea7b000 dead04000 0 2f8000 crash> This series of patches is to fix this issue. Regards, Wen Alexey Dobriyan (1): proc: use %u for pid printing and slightly less stack Andreas Gruenbacher (1): proc: Pass file mode to proc_pid_make_inode Christian Brauner (1): clone: add CLONE_PIDFD Eric W. Biederman (6): proc: Better ownership of files for non-dumpable tasks in user namespaces proc: Rename in proc_inode rename sysctl_inodes sibling_inodes proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache proc: Clear the pieces of proc_inode that proc_evict_inode cares
[PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry
The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they should be deleted when the process exits. Suppose the following race appears??? release_task?? dput -> proc_flush_task ->??dentry->d_op->d_delete(dentry) -> __exit_signal -> dentry->d_lockref.count--?? and return. In the proc_flush_task(), if another process is using this dentry, it will not be deleted. At the same time, in dput(), d_op->d_delete() can be executed before __exit_signal(pid has not been hashed), d_delete returns false, so this dentry still cannot be deleted. This dentry will always be cached (although its count is 0 and the DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and these dentries can only be deleted when drop_caches is manually triggered. This will result in wasted memory. What's more troublesome is that these dentries reference pid, according to the commit f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), if the pid cannot be released, it may result in the inability to create a new pid_ns. This problem occurred in our cluster environment (Linux 4.9 LTS). We could reproduce it by manually constructing a test program + adding some debugging switches in the kernel: * A test program to open the directory (/proc//ns) [1] * Adding some debugging switches to the kernel, adding a delay between proc_flush_task and __exit_signal in release_task() [2] The test process is as follows: A, terminal #1 Turn on the debug switch: echo 1> /proc/sys/vm/dentry_debug_trace Execute the following unshare command: sudo unshare --pid --fork --mount-proc bash B, terminal #2 Find the pid of the unshare process: # pstree -p | grep unshare | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818) Find the corresponding dentry: # dmesg | grep pid=818 [70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 entry=818/8802c7b670e8 C, terminal #3 Execute the opendir program, it will always open the /proc/818/ns/ directory: # ./a.out /proc/818/ns/ pid: 876 . .. net uts ipc pid user mnt cgroup D, go back to terminal #2 Turn on the debugging switches to construct the race: # echo 818> /proc/sys/vm/dentry_debug_pid # echo 1> /proc/sys/vm/dentry_debug_delay Kill the unshare process (pid 818). Since the debugging switches have been turned on, it will get stuck in release_task(): # kill -9 818 Then kill the process that opened the /proc/818/ns/ directory: # kill -9 876 Then turn off these debugging switches to allow the 818 process to exit: # echo 0> /proc/sys/vm/dentry_debug_delay # echo 0> /proc/sys/vm/dentry_debug_pid Checking the dmesg, we will find that the dentry(/proc/818/ns) ???s count is 0, and the flag is 2800cc (#define DCACHE_OP_DELETE 0x0008), but it is still cached: # dmesg | grep 8802a3999548 ??? [565.559156] XXX dput:853 dentry=ns/8802bea7b528, flag=2800cc, cnt=0, inode=8802b38c2010, pdentry=818/8802c7b670e8, pflag=20008c, pcnt=1, pinode=8802c7812010, keywords: be cached It could also be verified via the crash tool: crash> dentry.d_flags,d_iname,d_inode,d_lockref -x 8802bea7b528 d_flags = 0x2800cc d_iname = "ns\000" d_inode = 0x8802b38c2010 d_lockref = { { lock_count = 0x0, { lock = { { rlock = { raw_lock = { { val = { counter = 0x0 }, { locked = 0x0, pending = 0x0 }, { locked_pending = 0x0, tail = 0x0 } } } } } }, count = 0x0 } } } crash> kmem 8802bea7b528 CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME 8802dd5f5900 192 23663 2613087116k dentry SLAB MEMORYNODE TOTAL ALLOCATED FREE ea000afa9e00 8802bea78000 0 30 25 5 FREE / [ALLOCATED] [8802bea7b520] PAGEPHYSICAL MAPPING INDEX CNT FLAGS ea000afa9ec0 2bea7b000 dead04000 0 2f8000 crash> This series of patches is to fix this issue. Regards, Wen Alexey Dobriyan (1): proc: use %u for pid printing and slightly less stack Andreas Gruenbacher (1): proc: Pass file mode to proc_pid_make_inode Christian Brauner (1): clone: add CLONE_PIDFD Eric W. Biederman (6): proc: Better ownership of files for non-dumpable tasks in user namespaces proc: Rename in proc_inode rename sysctl_inodes sibling_inodes proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache proc: Clear the pieces of proc_inode that proc_evict_inode cares