Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry

2021-01-03 Thread Wen Yang




在 2020/12/31 下午5:22, Greg Kroah-Hartman 写道:

On Thu, Dec 17, 2020 at 10:26:23AM +0800, Wen Yang wrote:



在 2020/12/4 上午2:31, Wen Yang 写道:

The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they
should be deleted when the process exits.

Suppose the following race appears:

release_task     dput
-> proc_flush_task
       -> dentry->d_op->d_delete(dentry)
-> __exit_signal
   -> dentry->d_lockref.count--  and return.

In the proc_flush_task(), if another process is using this dentry, it will
not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
before __exit_signal(pid has not been hashed), d_delete returns false, so
this dentry still cannot be deleted.

This dentry will always be cached (although its count is 0 and the
DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
these dentries can only be deleted when drop_caches is manually triggered.

This will result in wasted memory. What's more troublesome is that these
dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
limit on the number of pid namespaces"), if the pid cannot be released, it
may result in the inability to create a new pid_ns.

This problem occurred in our cluster environment (Linux 4.9 LTS).
We could reproduce it by manually constructing a test program + adding some
debugging switches in the kernel:
* A test program to open the directory (/proc//ns) [1]
* Adding some debugging switches to the kernel, adding a delay between
 proc_flush_task and __exit_signal in release_task() [2]

The test process is as follows:

A, terminal #1

Turn on the debug switch:
echo 1> /proc/sys/vm/dentry_debug_trace

Execute the following unshare command:
sudo unshare --pid --fork --mount-proc bash


B, terminal #2

Find the pid of the unshare process:

# pstree -p | grep unshare
 | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)


Find the corresponding dentry:
# dmesg | grep pid=818
[70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 
entry=818/8802c7b670e8


C, terminal #3

Execute the opendir program, it will always open the /proc/818/ns/ directory:

# ./a.out /proc/818/ns/
pid: 876
.
..
net
uts
ipc
pid
user
mnt
cgroup

D, go back to terminal #2

Turn on the debugging switches to construct the race:
# echo 818> /proc/sys/vm/dentry_debug_pid
# echo 1> /proc/sys/vm/dentry_debug_delay

Kill the unshare process (pid 818). Since the debugging switches have been
turned on, it will get stuck in release_task():
# kill -9 818

Then kill the process that opened the /proc/818/ns/ directory:
# kill -9 876

Then turn off these debugging switches to allow the 818 process to exit:
# echo 0> /proc/sys/vm/dentry_debug_delay
# echo 0> /proc/sys/vm/dentry_debug_pid

Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0,
and the flag is 2800cc (#define DCACHE_OP_DELETE 0x0008), but it is still
cached:
# dmesg | grep 8802a3999548
…
[565.559156] XXX dput:853 dentry=ns/8802bea7b528, flag=2800cc, cnt=0, 
inode=8802b38c2010, pdentry=818/8802c7b670e8, pflag=20008c, pcnt=1, 
pinode=8802c7812010, keywords: be cached


It could also be verified via the crash tool:

crash> dentry.d_flags,d_iname,d_inode,d_lockref -x  8802bea7b528
d_flags = 0x2800cc
d_iname = "ns\000"
d_inode = 0x8802b38c2010
d_lockref = {
  {
lock_count = 0x0,
{
  lock = {
{
  rlock = {
raw_lock = {
  {
val = {
  counter = 0x0
},
{
  locked = 0x0,
  pending = 0x0
},
{
  locked_pending = 0x0,
  tail = 0x0
}
  }
}
  }
}
  },
  count = 0x0
}
  }
}
crash> kmem  8802bea7b528
CACHE OBJSIZE  ALLOCATED TOTAL  SLABS  SSIZE  NAME
8802dd5f5900  192  23663 2613087116k  dentry
SLAB  MEMORYNODE  TOTAL  ALLOCATED  FREE
ea000afa9e00  8802bea78000 0 30 25 5
FREE / [ALLOCATED]
[8802bea7b520]

PAGEPHYSICAL  MAPPING   INDEX CNT FLAGS
ea000afa9ec0 2bea7b000 dead04000  0 2f8000
crash>

This series of patches is to fix this issue.

Regards,
Wen

Alexey Dobriyan (1):
proc: use %u for pid printing and slightly less stack

Andreas Gruenbacher (1):
proc: Pass file mode to proc_pid_make_inode

Christian Brauner (1):
clone: add CLONE_PIDFD

Eric W. Biederman (6):
proc: Better ownership of files for non-dumpable tasks in user
  namespaces
proc: Rename in proc_inode rename sysctl_inodes 

Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry

2020-12-31 Thread Greg Kroah-Hartman
On Thu, Dec 17, 2020 at 10:26:23AM +0800, Wen Yang wrote:
> 
> 
> 在 2020/12/4 上午2:31, Wen Yang 写道:
> > The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they
> > should be deleted when the process exits.
> > 
> > Suppose the following race appears:
> > 
> > release_task     dput
> > -> proc_flush_task
> >       -> dentry->d_op->d_delete(dentry)
> > -> __exit_signal
> >   -> dentry->d_lockref.count--  and return.
> > 
> > In the proc_flush_task(), if another process is using this dentry, it will
> > not be deleted. At the same time, in dput(), d_op->d_delete() can be 
> > executed
> > before __exit_signal(pid has not been hashed), d_delete returns false, so
> > this dentry still cannot be deleted.
> > 
> > This dentry will always be cached (although its count is 0 and the
> > DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
> > these dentries can only be deleted when drop_caches is manually triggered.
> > 
> > This will result in wasted memory. What's more troublesome is that these
> > dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
> > limit on the number of pid namespaces"), if the pid cannot be released, it
> > may result in the inability to create a new pid_ns.
> > 
> > This problem occurred in our cluster environment (Linux 4.9 LTS).
> > We could reproduce it by manually constructing a test program + adding some
> > debugging switches in the kernel:
> > * A test program to open the directory (/proc//ns) [1]
> > * Adding some debugging switches to the kernel, adding a delay between
> > proc_flush_task and __exit_signal in release_task() [2]
> > 
> > The test process is as follows:
> > 
> > A, terminal #1
> > 
> > Turn on the debug switch:
> > echo 1> /proc/sys/vm/dentry_debug_trace
> > 
> > Execute the following unshare command:
> > sudo unshare --pid --fork --mount-proc bash
> > 
> > 
> > B, terminal #2
> > 
> > Find the pid of the unshare process:
> > 
> > # pstree -p | grep unshare
> > | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)
> > 
> > 
> > Find the corresponding dentry:
> > # dmesg | grep pid=818
> > [70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 
> > entry=818/8802c7b670e8
> > 
> > 
> > C, terminal #3
> > 
> > Execute the opendir program, it will always open the /proc/818/ns/ 
> > directory:
> > 
> > # ./a.out /proc/818/ns/
> > pid: 876
> > .
> > ..
> > net
> > uts
> > ipc
> > pid
> > user
> > mnt
> > cgroup
> > 
> > D, go back to terminal #2
> > 
> > Turn on the debugging switches to construct the race:
> > # echo 818> /proc/sys/vm/dentry_debug_pid
> > # echo 1> /proc/sys/vm/dentry_debug_delay
> > 
> > Kill the unshare process (pid 818). Since the debugging switches have been
> > turned on, it will get stuck in release_task():
> > # kill -9 818
> > 
> > Then kill the process that opened the /proc/818/ns/ directory:
> > # kill -9 876
> > 
> > Then turn off these debugging switches to allow the 818 process to exit:
> > # echo 0> /proc/sys/vm/dentry_debug_delay
> > # echo 0> /proc/sys/vm/dentry_debug_pid
> > 
> > Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 
> > 0,
> > and the flag is 2800cc (#define DCACHE_OP_DELETE 0x0008), but it is 
> > still
> > cached:
> > # dmesg | grep 8802a3999548
> > …
> > [565.559156] XXX dput:853 dentry=ns/8802bea7b528, flag=2800cc, cnt=0, 
> > inode=8802b38c2010, pdentry=818/8802c7b670e8, pflag=20008c, pcnt=1, 
> > pinode=8802c7812010, keywords: be cached
> > 
> > 
> > It could also be verified via the crash tool:
> > 
> > crash> dentry.d_flags,d_iname,d_inode,d_lockref -x  8802bea7b528
> >d_flags = 0x2800cc
> >d_iname = "ns\000"
> >d_inode = 0x8802b38c2010
> >d_lockref = {
> >  {
> >lock_count = 0x0,
> >{
> >  lock = {
> >{
> >  rlock = {
> >raw_lock = {
> >  {
> >val = {
> >  counter = 0x0
> >},
> >{
> >  locked = 0x0,
> >  pending = 0x0
> >},
> >{
> >  locked_pending = 0x0,
> >  tail = 0x0
> >}
> >  }
> >}
> >  }
> >}
> >  },
> >  count = 0x0
> >}
> >  }
> >}
> > crash> kmem  8802bea7b528
> > CACHE OBJSIZE  ALLOCATED TOTAL  SLABS  SSIZE  NAME
> > 8802dd5f5900  192  23663 2613087116k  dentry
> >SLAB  MEMORYNODE  TOTAL  ALLOCATED  FREE
> >ea000afa9e00  8802bea78000 0 30 25 5
> >FREE / [ALLOCATED]
> >[8802bea7b520]
> > 
> >PAGEPHYSICAL  MAPPING   INDEX CNT 

Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry

2020-12-16 Thread Wen Yang




在 2020/12/4 上午2:31, Wen Yang 写道:

The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they
should be deleted when the process exits.

Suppose the following race appears:

release_task     dput
-> proc_flush_task
      -> dentry->d_op->d_delete(dentry)
-> __exit_signal
  -> dentry->d_lockref.count--  and return.

In the proc_flush_task(), if another process is using this dentry, it will
not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
before __exit_signal(pid has not been hashed), d_delete returns false, so
this dentry still cannot be deleted.

This dentry will always be cached (although its count is 0 and the
DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
these dentries can only be deleted when drop_caches is manually triggered.

This will result in wasted memory. What's more troublesome is that these
dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
limit on the number of pid namespaces"), if the pid cannot be released, it
may result in the inability to create a new pid_ns.

This problem occurred in our cluster environment (Linux 4.9 LTS).
We could reproduce it by manually constructing a test program + adding some
debugging switches in the kernel:
* A test program to open the directory (/proc//ns) [1]
* Adding some debugging switches to the kernel, adding a delay between
proc_flush_task and __exit_signal in release_task() [2]

The test process is as follows:

A, terminal #1

Turn on the debug switch:
echo 1> /proc/sys/vm/dentry_debug_trace

Execute the following unshare command:
sudo unshare --pid --fork --mount-proc bash


B, terminal #2

Find the pid of the unshare process:

# pstree -p | grep unshare
| `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)


Find the corresponding dentry:
# dmesg | grep pid=818
[70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 
entry=818/8802c7b670e8


C, terminal #3

Execute the opendir program, it will always open the /proc/818/ns/ directory:

# ./a.out /proc/818/ns/
pid: 876
.
..
net
uts
ipc
pid
user
mnt
cgroup

D, go back to terminal #2

Turn on the debugging switches to construct the race:
# echo 818> /proc/sys/vm/dentry_debug_pid
# echo 1> /proc/sys/vm/dentry_debug_delay

Kill the unshare process (pid 818). Since the debugging switches have been
turned on, it will get stuck in release_task():
# kill -9 818

Then kill the process that opened the /proc/818/ns/ directory:
# kill -9 876

Then turn off these debugging switches to allow the 818 process to exit:
# echo 0> /proc/sys/vm/dentry_debug_delay
# echo 0> /proc/sys/vm/dentry_debug_pid

Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0,
and the flag is 2800cc (#define DCACHE_OP_DELETE 0x0008), but it is still
cached:
# dmesg | grep 8802a3999548
…
[565.559156] XXX dput:853 dentry=ns/8802bea7b528, flag=2800cc, cnt=0, 
inode=8802b38c2010, pdentry=818/8802c7b670e8, pflag=20008c, pcnt=1, 
pinode=8802c7812010, keywords: be cached


It could also be verified via the crash tool:

crash> dentry.d_flags,d_iname,d_inode,d_lockref -x  8802bea7b528
   d_flags = 0x2800cc
   d_iname = "ns\000"
   d_inode = 0x8802b38c2010
   d_lockref = {
 {
   lock_count = 0x0,
   {
 lock = {
   {
 rlock = {
   raw_lock = {
 {
   val = {
 counter = 0x0
   },
   {
 locked = 0x0,
 pending = 0x0
   },
   {
 locked_pending = 0x0,
 tail = 0x0
   }
 }
   }
 }
   }
 },
 count = 0x0
   }
 }
   }
crash> kmem  8802bea7b528
CACHE OBJSIZE  ALLOCATED TOTAL  SLABS  SSIZE  NAME
8802dd5f5900  192  23663 2613087116k  dentry
   SLAB  MEMORYNODE  TOTAL  ALLOCATED  FREE
   ea000afa9e00  8802bea78000 0 30 25 5
   FREE / [ALLOCATED]
   [8802bea7b520]

   PAGEPHYSICAL  MAPPING   INDEX CNT FLAGS
ea000afa9ec0 2bea7b000 dead04000  0 2f8000
crash>

This series of patches is to fix this issue.

Regards,
Wen

Alexey Dobriyan (1):
   proc: use %u for pid printing and slightly less stack

Andreas Gruenbacher (1):
   proc: Pass file mode to proc_pid_make_inode

Christian Brauner (1):
   clone: add CLONE_PIDFD

Eric W. Biederman (6):
   proc: Better ownership of files for non-dumpable tasks in user
 namespaces
   proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
   proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
   proc: Clear the pieces of proc_inode that proc_evict_inode cares 

[PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry

2020-12-03 Thread Wen Yang
The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they 
should be deleted when the process exits. 

Suppose the following race appears??? 

release_task??  dput 
-> proc_flush_task 
   
->??dentry->d_op->d_delete(dentry) 
-> __exit_signal 
 -> 
dentry->d_lockref.count--?? and return. 

In the proc_flush_task(), if another process is using this dentry, it will
not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
before __exit_signal(pid has not been hashed), d_delete returns false, so
this dentry still cannot be deleted.

This dentry will always be cached (although its count is 0 and the
DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
these dentries can only be deleted when drop_caches is manually triggered.

This will result in wasted memory. What's more troublesome is that these
dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
limit on the number of pid namespaces"), if the pid cannot be released, it
may result in the inability to create a new pid_ns.

This problem occurred in our cluster environment (Linux 4.9 LTS).
We could reproduce it by manually constructing a test program + adding some
debugging switches in the kernel:
* A test program to open the directory (/proc//ns) [1]
* Adding some debugging switches to the kernel, adding a delay between
   proc_flush_task and __exit_signal in release_task() [2]

The test process is as follows:

A, terminal #1

Turn on the debug switch:
echo 1> /proc/sys/vm/dentry_debug_trace

Execute the following unshare command:
sudo unshare --pid --fork --mount-proc bash


B, terminal #2

Find the pid of the unshare process:

# pstree -p | grep unshare
   | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)


Find the corresponding dentry:
# dmesg | grep pid=818
[70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 
entry=818/8802c7b670e8


C, terminal #3

Execute the opendir program, it will always open the /proc/818/ns/ directory:

# ./a.out /proc/818/ns/
pid: 876
.
..
net
uts
ipc
pid
user
mnt
cgroup

D, go back to terminal #2

Turn on the debugging switches to construct the race:
# echo 818> /proc/sys/vm/dentry_debug_pid
# echo 1> /proc/sys/vm/dentry_debug_delay

Kill the unshare process (pid 818). Since the debugging switches have been
turned on, it will get stuck in release_task():
# kill -9 818

Then kill the process that opened the /proc/818/ns/ directory:
# kill -9 876

Then turn off these debugging switches to allow the 818 process to exit:
# echo 0> /proc/sys/vm/dentry_debug_delay
# echo 0> /proc/sys/vm/dentry_debug_pid

Checking the dmesg, we will find that the dentry(/proc/818/ns) ???s count is 0,
and the flag is 2800cc (#define DCACHE_OP_DELETE 0x0008), but it is still
cached:
# dmesg | grep 8802a3999548
???
[565.559156] XXX dput:853 dentry=ns/8802bea7b528, flag=2800cc, cnt=0, 
inode=8802b38c2010, pdentry=818/8802c7b670e8, pflag=20008c, pcnt=1, 
pinode=8802c7812010, keywords: be cached


It could also be verified via the crash tool:

crash> dentry.d_flags,d_iname,d_inode,d_lockref -x  8802bea7b528
  d_flags = 0x2800cc
  d_iname = "ns\000"
  d_inode = 0x8802b38c2010
  d_lockref = {
{
  lock_count = 0x0,
  {
lock = {
  {
rlock = {
  raw_lock = {
{
  val = {
counter = 0x0
  },
  {
locked = 0x0,
pending = 0x0
  },
  {
locked_pending = 0x0,
tail = 0x0
  }
}
  }
}
  }
},
count = 0x0
  }
}
  }
crash> kmem  8802bea7b528
CACHE OBJSIZE  ALLOCATED TOTAL  SLABS  SSIZE  NAME
8802dd5f5900  192  23663 2613087116k  dentry
  SLAB  MEMORYNODE  TOTAL  ALLOCATED  FREE
  ea000afa9e00  8802bea78000 0 30 25 5
  FREE / [ALLOCATED]
  [8802bea7b520]

  PAGEPHYSICAL  MAPPING   INDEX CNT FLAGS
ea000afa9ec0 2bea7b000 dead04000  0 2f8000
crash>

This series of patches is to fix this issue.

Regards,
Wen

Alexey Dobriyan (1):
  proc: use %u for pid printing and slightly less stack

Andreas Gruenbacher (1):
  proc: Pass file mode to proc_pid_make_inode

Christian Brauner (1):
  clone: add CLONE_PIDFD

Eric W. Biederman (6):
  proc: Better ownership of files for non-dumpable tasks in user
namespaces
  proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
  proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
  proc: Clear the pieces of proc_inode that proc_evict_inode cares