Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote: > > I have tested thes changes by booting with the code compiled in and > by killing "bpfilter_umh" and running iptables -vnL to restart > the userspace driver. > > I have compiled tested each change with and without CONFIG_BPFILTER > enabled. Sounds like grounds for a selftests driver and respective selftest? And if so, has the other issues Tetsuo reported be hacked into one? Luis
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
Just to make certain I understand what is going on I instrumented a kernel with some print statements. a) The workqueues and timers start before populate_rootfs. b) populate_rootfs does indeed happen long before the bpfilter module is intialized. c) What prevents populate_rootfs and the umd_load_blob from having problems when they call flush_delayed_put is the fact that fput_many does: "schedule_delayed_work(&delayed_fput_work,1)". That 1 requests a delay of at least 1 jiffy. A jiffy is between 1ms and 10ms depending on how Linux is configured. In my test configuration running a kernel in kvm printing to a serial console I measured 0.8ms between the fput in blob_to_mnt and flush_delayed_fput which immediately follows it. So unless the fput becomes incredibly slow there is nothing to worry about in blob_to_mnt. d) As the same mechanism is used by populate_rootfs. A but in the mechanism applies to both. e) No one appears to have reported a problem executing files out of initramfs these last several years since the flush_delayed_fput was introduced. f) The code works for me. There is real reason to believe the code will work for everyone else, as the exact same logic is used by initramfs. So it should be perfectly fine for the patchset and the usermode_driver code to go ahead as written. h) If there is something to be fixed it is flush_delayed_fput as that is much more important than anything in the usermode driver code. Eric p.s.) When I talked of restarts of the usermode driver code ealier I was referring to the code that restarts the usermode driver if it is killed, the next time the kernel tries to talk to it. That could mask an -ETXTBUSY except if it happens on the first exec the net/bfilter/bpfilter_kern.c:load_umh() will return an error.
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
On 2020/07/04 7:25, Eric W. Biederman wrote: > Tetsuo Handa writes: > >> On 2020/07/02 22:08, Eric W. Biederman wrote: By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says that use of flush_delayed_fput() has to be careful. Al, is it safe to call flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be called from both kernel thread and from process context (e.g. init_module() syscall by /sbin/insmod )) ? >>> >>> And __fput_sync needs to be even more careful. >>> umd_load_blob is called in these changes without any locks held. >> >> But where is the guarantee that a thread which called flush_delayed_fput() >> waits for >> the completion of processing _all_ "struct file" linked into >> delayed_fput_list ? >> If some other thread or delayed_fput_work (scheduled by fput_many()) called >> flush_delayed_fput() between blob_to_mnt()'s fput(file) and >> flush_delayed_fput() >> sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" >> which >> needs to be processed before execve(), can't it? > > As a module the guarantee is we call task_work_run. No. It is possible that blob_to_mnt() is called by a kernel thread which was started by init_module() syscall by /sbin/insmod . > Built into the kernel the guarantee as best I can trace it is that > kthreadd hasn't started, and as such nothing that is scheduled has run > yet. Have you ever checked how early the kthreadd (PID=2) gets started? -- --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2306,6 +2306,7 @@ static __latent_entropy struct task_struct *copy_process( trace_task_newtask(p, clone_flags); uprobe_copy_process(p, clone_flags); + printk(KERN_INFO "Created PID: %u Comm: %s\n", p->pid, p->comm); return p; bad_fork_cancel_cgroup: -- -- [0.090757][T0] pid_max: default: 65536 minimum: 512 [0.090890][T0] LSM: Security Framework initializing [0.090890][T0] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes, linear) [0.090890][T0] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear) [0.090890][T0] Disabled fast string operations [0.090890][T0] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 1024 [0.090890][T0] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 1024, 1GB 4 [0.090890][T0] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization [0.090890][T0] Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; no mitigation available! [0.090890][T0] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp [0.090890][T0] SRBDS: Unknown: Dependent on hypervisor status [0.090890][T0] MDS: Mitigation: Clear CPU buffers [0.090890][T0] Freeing SMP alternatives memory: 24K [0.090890][T0] Created PID: 1 Comm: swapper/0 [0.090890][T0] Created PID: 2 Comm: swapper/0 [0.090890][T1] smpboot: CPU0: Intel(R) Core(TM) i5-4440S CPU @ 2.80GHz (family: 0x6, model: 0x3c, stepping: 0x3) [0.091000][T2] Created PID: 3 Comm: kthreadd [0.091995][T2] Created PID: 4 Comm: kthreadd [0.093028][T2] Created PID: 5 Comm: kthreadd [0.093997][T2] Created PID: 6 Comm: kthreadd [0.094995][T2] Created PID: 7 Comm: kthreadd [0.096037][T2] Created PID: 8 Comm: kthreadd (...snipped...) [0.135716][T2] Created PID: 13 Comm: kthreadd [0.135716][T1] smp: Bringing up secondary CPUs ... [0.135716][T2] Created PID: 14 Comm: kthreadd [0.135716][T2] Created PID: 15 Comm: kthreadd [0.135716][T2] Created PID: 16 Comm: kthreadd [0.135716][T2] Created PID: 17 Comm: kthreadd [0.135716][T2] Created PID: 18 Comm: kthreadd [0.135716][T1] x86: Booting SMP configuration: (...snipped...) [0.901990][T1] pci :00:00.0: Limiting direct PCI/PCI transfers [0.902145][T1] pci :00:0f.0: Video device with shadowed ROM at [mem 0x000c-0x000d] [0.902213][T1] pci :02:00.0: CLS mismatch (32 != 64), using 64 bytes [0.902224][T1] Trying to unpack rootfs image as initramfs... [1.107993][T1] Freeing initrd memory: 18876K [1.109049][T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB) [1.111003][T1] software IO TLB: mapped [mem 0xab00-0xaf00] (64MB) [1.112136][T1] check: Scanning for low memory corruption every 60 seconds [1.115040][T2] Created PID: 52 Comm: kthreadd [1.116110][T1] workingset: timestamp_bits=46 max_order=20 bucket_order=0 [1.120936][T1] SGI XFS with ACLs, security attributes, verbose warnings, quota, no debug enabled [1.129626][T2] Created PID: 53 Comm: kthreadd [1.131403][T2] Created PID: 54 Comm: kthreadd -- kthreadd (PID=2) is created by swapper/0 (PID=0) immediately after init
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
Tetsuo Handa writes: > On 2020/07/02 22:08, Eric W. Biederman wrote: >>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says >>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call >>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be >>> called from both kernel thread and from process context (e.g. init_module() >>> syscall by /sbin/insmod )) ? >> >> And __fput_sync needs to be even more careful. >> umd_load_blob is called in these changes without any locks held. > > But where is the guarantee that a thread which called flush_delayed_fput() > waits for > the completion of processing _all_ "struct file" linked into > delayed_fput_list ? > If some other thread or delayed_fput_work (scheduled by fput_many()) called > flush_delayed_fput() between blob_to_mnt()'s fput(file) and > flush_delayed_fput() > sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" > which > needs to be processed before execve(), can't it? As a module the guarantee is we call task_work_run. Built into the kernel the guarantee as best I can trace it is that kthreadd hasn't started, and as such nothing that is scheduled has run yet. > Also, I don't know how convoluted the dependency of all "struct file" linked > into > delayed_fput_list might be, for there can be "struct file" which will not be a > simple close of tmpfs file created by blob_to_mnt()'s file_open_root() > request. > > On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD > threads, > there is a guarantee that __fput_sync() waits for the completion of "struct > file" > which needs to be flushed before execve(), isn't there? There is really not a good helper or helpers, and this code suggests we have something better. Right now I have used the existing helpers to the best of my ability. If you or someone else wants to write a better version of flushing so that exec can happen be my guest. As far as I can tell what I have is good enough. >> We fundamentally AKA in any correct version of this code need to flush >> the file descriptor before we call exec or exec can not open it a >> read-only denying all writes from any other opens. >> >> The use case of flush_delayed_fput is exactly the same as that used >> when loading the initramfs. > > When loading the initramfs, the number of threads is quite few (which > means that the possibility of hitting the race window and convoluted > dependency is small). But the reality is the code run very early, before the initramfs is initialized in practice. > But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s > flush_delayed_fput() might be called after many number of threads already > started running. At which point the code probably won't be runnig from a kernel thread but instead will be running in a thread where task_work_run is relevant. At worst it is a very small race, where someone else in another thread starts flushing the file. Which means the file could still be completely close before exec. Even that is not necessarily fatal, as the usermode driver code has a respawn capability. Code that is used enough that it hits that race sounds like a very good problem to have from the perspective of the usermode driver code. > Do we really need to mount upon umd_load_blob() and unmount upon > umd_unload_blob() ? > LSM modules might prefer only one instance of filesystem for umd > blobs. It is simple. People are free to change it, but a single filesystem seems like a very good place to start with this functionality. > For pathname based LSMs, since that filesystem is not visible from mount > tree, only > info->driver_name can be used for distinction. Therefore, one instance of > filesystem > with files created with file_open_root(O_CREAT | O_WRONLY | O_EXCL) > might be preferable. I took a quick look and the creation and removal of files with the in-kernel helpers is not particularly easy. Certainly it is more work and thus a higher likelyhood of bugs than what I have done. A directory per driver does sound tempting. Just more work that I am willing to do. > For inode based LSMs, reusing one instance of filesystem created upon early > boot might > be convenient for labeling. > > Also, we might want a dedicated filesystem (say, "umdfs") instead of regular > tmpfs in > order to implement protections without labeling files. Then, we might also be > able to > implement minimal protections without LSMs. All valid points. Nothing sets this design in stone. Nothing says this is the endpoint of the evolution of this code. The entire point of this patchset for me is that I remove the unnecessary special cases from exec and do_exit, so I don't have to deal with the usermode driver code anymore. Eric
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
On 2020/07/02 22:08, Eric W. Biederman wrote: >> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says >> that use of flush_delayed_fput() has to be careful. Al, is it safe to call >> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be >> called from both kernel thread and from process context (e.g. init_module() >> syscall by /sbin/insmod )) ? > > And __fput_sync needs to be even more careful. > umd_load_blob is called in these changes without any locks held. But where is the guarantee that a thread which called flush_delayed_fput() waits for the completion of processing _all_ "struct file" linked into delayed_fput_list ? If some other thread or delayed_fput_work (scheduled by fput_many()) called flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput() sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which needs to be processed before execve(), can't it? Also, I don't know how convoluted the dependency of all "struct file" linked into delayed_fput_list might be, for there can be "struct file" which will not be a simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request. On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads, there is a guarantee that __fput_sync() waits for the completion of "struct file" which needs to be flushed before execve(), isn't there? > > We fundamentally AKA in any correct version of this code need to flush > the file descriptor before we call exec or exec can not open it a > read-only denying all writes from any other opens. > > The use case of flush_delayed_fput is exactly the same as that used > when loading the initramfs. When loading the initramfs, the number of threads is quite few (which means that the possibility of hitting the race window and convoluted dependency is small). But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s flush_delayed_fput() might be called after many number of threads already started running. On 2020/07/03 1:02, Eric W. Biederman wrote: On 2020/06/30 21:29, Eric W. Biederman wrote: > Hmm. The wake up happens just of tgid->wait_pidfd happens just before > release_task is called so there is a race. As it is possible to wake > up and then go back to sleep before pid_has_task becomes false. What is the reason we want to wait until pid_has_task() becomes false? - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1)); >>> >>> So that it is safe to call bpfilter_umh_cleanup. The previous code >>> performed the wait by having a callback in do_exit. >> >> But bpfilter_umh_cleanup() does only >> >> fput(info->pipe_to_umh); >> fput(info->pipe_from_umh); >> put_pid(info->tgid); >> info->tgid = NULL; >> >> which is (I think) already safe regardless of the usermode process because >> bpfilter_umh_cleanup() merely closes one side of two pipes used between >> two processes and forgets about the usermode process. > > It is not safe. > > Baring bugs there is only one use of shtudown_umh that matters. The one > in fini_umh. The use of the file by the mm must be finished before > umd_unload_blob. AKA unmount. Which completely frees the filesystem. Do we really need to mount upon umd_load_blob() and unmount upon umd_unload_blob() ? LSM modules might prefer only one instance of filesystem for umd blobs. For pathname based LSMs, since that filesystem is not visible from mount tree, only info->driver_name can be used for distinction. Therefore, one instance of filesystem with files created with file_open_root(O_CREAT | O_WRONLY | O_EXCL) might be preferable. For inode based LSMs, reusing one instance of filesystem created upon early boot might be convenient for labeling. Also, we might want a dedicated filesystem (say, "umdfs") instead of regular tmpfs in order to implement protections without labeling files. Then, we might also be able to implement minimal protections without LSMs.
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
Tetsuo Handa writes: > On 2020/07/02 22:08, Eric W. Biederman wrote: >> Tetsuo Handa writes: >> >>> On 2020/06/30 21:29, Eric W. Biederman wrote: Hmm. The wake up happens just of tgid->wait_pidfd happens just before release_task is called so there is a race. As it is possible to wake up and then go back to sleep before pid_has_task becomes false. >>> >>> What is the reason we want to wait until pid_has_task() becomes false? >>> >>> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); >>> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, >>> PIDTYPE_TGID), 1)); >> >> So that it is safe to call bpfilter_umh_cleanup. The previous code >> performed the wait by having a callback in do_exit. > > But bpfilter_umh_cleanup() does only > > fput(info->pipe_to_umh); > fput(info->pipe_from_umh); > put_pid(info->tgid); > info->tgid = NULL; > > which is (I think) already safe regardless of the usermode process because > bpfilter_umh_cleanup() merely closes one side of two pipes used between > two processes and forgets about the usermode process. It is not safe. Baring bugs there is only one use of shtudown_umh that matters. The one in fini_umh. The use of the file by the mm must be finished before umd_unload_blob. AKA unmount. Which completely frees the filesystem. >> It might be possible to call bpf_umh_cleanup early but I have not done >> that analysis. >> >> To perform the test correctly what I have right now is: > > Waiting for the termination of a SIGKILLed usermode process is not > such simple. The waiting is that simple. You are correct it might not be a quick process. A good general principle is to start with something simple and correct for what it does, and then to make it more complicated when real world cases show up, and it can be understood what the real challenges are. I am not going to merge known broken code but I am also not going to overcomplicate it. Dealing with very rare and pathological cases that are not handled or considered today is out of scope for my patchset. Eric
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
On 2020/07/02 22:08, Eric W. Biederman wrote: > Tetsuo Handa writes: > >> On 2020/06/30 21:29, Eric W. Biederman wrote: >>> Hmm. The wake up happens just of tgid->wait_pidfd happens just before >>> release_task is called so there is a race. As it is possible to wake >>> up and then go back to sleep before pid_has_task becomes false. >> >> What is the reason we want to wait until pid_has_task() becomes false? >> >> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); >> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, >> PIDTYPE_TGID), 1)); > > So that it is safe to call bpfilter_umh_cleanup. The previous code > performed the wait by having a callback in do_exit. But bpfilter_umh_cleanup() does only fput(info->pipe_to_umh); fput(info->pipe_from_umh); put_pid(info->tgid); info->tgid = NULL; which is (I think) already safe regardless of the usermode process because bpfilter_umh_cleanup() merely closes one side of two pipes used between two processes and forgets about the usermode process. > > It might be possible to call bpf_umh_cleanup early but I have not done > that analysis. > > To perform the test correctly what I have right now is: Waiting for the termination of a SIGKILLed usermode process is not such simple. If a usermode process was killed by the OOM killer, it might take minutes for the killed process to reach do_exit() due to invisible memory allocation dependency chain. Since the OOM killer kicks the OOM reaper, and the OOM reaper forgets about the killed process after one second if mmap_sem could not be held (in order to avoid OOM deadlock), the OOM situation will be eventually solved; but there is no guarantee that the killed process can reach do_exit() in a short period.
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
Tetsuo Handa writes: > On 2020/06/30 21:29, Eric W. Biederman wrote: >> Hmm. The wake up happens just of tgid->wait_pidfd happens just before >> release_task is called so there is a race. As it is possible to wake >> up and then go back to sleep before pid_has_task becomes false. > > What is the reason we want to wait until pid_has_task() becomes false? > > - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); > + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, > PIDTYPE_TGID), 1)); So that it is safe to call bpfilter_umh_cleanup. The previous code performed the wait by having a callback in do_exit. It might be possible to call bpf_umh_cleanup early but I have not done that analysis. To perform the test correctly what I have right now is: bool thread_group_exited(struct pid *pid) { struct task_struct *tsk; bool exited; rcu_read_lock(); tsk = pid_task(pid, PIDTYPE_PID); exited = !tsk || (READ_ONCE(tsk->exit_state) && thread_group_empty(tsk)); rcu_read_unlock(); return exited; } Which is factored out of pidfd_poll. Which means that this won't be something that the bpfilter code has to maintain. That seems to be a fundamentally good facility to have regardless of bpfilter. I will post the whole thing in a bit once I have a chance to dot my i's and cross my t's. > By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says > that use of flush_delayed_fput() has to be careful. Al, is it safe to call > flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be > called from both kernel thread and from process context (e.g. init_module() > syscall by /sbin/insmod )) ? And __fput_sync needs to be even more careful. umd_load_blob is called in these changes without any locks held. We fundamentally AKA in any correct version of this code need to flush the file descriptor before we call exec or exec can not open it a read-only denying all writes from any other opens. The use case of flush_delayed_fput is exactly the same as that used when loading the initramfs. Eric
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
Alexei Starovoitov writes: > On Tue, Jun 30, 2020 at 07:29:34AM -0500, Eric W. Biederman wrote: >> >> diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c >> index 91474884ddb7..3e1874030daa 100644 >> --- a/net/bpfilter/bpfilter_kern.c >> +++ b/net/bpfilter/bpfilter_kern.c >> @@ -19,8 +19,8 @@ static void shutdown_umh(void) >> struct pid *tgid = info->tgid; >> >> if (tgid) { >> - kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); >> - wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); >> + kill_pid(tgid, SIGKILL, 1); >> + wait_event(tgid->wait_pidfd, !pid_has_task(tgid, >> PIDTYPE_TGID)); >> bpfilter_umh_cleanup(info); >> } >> } >> >> > And then did: >> > while true; do iptables -L;rmmod bpfilter; done >> > >> > Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event(). >> >> Hmm. The wake up happens just of tgid->wait_pidfd happens just before >> release_task is called so there is a race. As it is possible to wake >> up and then go back to sleep before pid_has_task becomes false. >> >> So I think I need a friendly helper that does: >> >> bool task_has_exited(struct pid *tgid) >> { >> bool exited = false; >> >> rcu_read_lock(); >> tsk = pid_task(tgid, PIDTYPE_TGID); >> exited = !!tsk; >> if (tsk) { >> exited = !!tsk->exit_state; >> out: >> rcu_unlock(); >> return exited; >> } > > All makes sense to me. > If I understood the race condition such helper should indeed solve it. > Are you going to add such patch to your series? > I'll proceed with my work on top of your series and will ignore this > race for now, but I think it should be fixed before we land this set > into multiple trees. Yes. I am just finishing it up now. Eric
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
On Tue, Jun 30, 2020 at 07:29:34AM -0500, Eric W. Biederman wrote: > > diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c > index 91474884ddb7..3e1874030daa 100644 > --- a/net/bpfilter/bpfilter_kern.c > +++ b/net/bpfilter/bpfilter_kern.c > @@ -19,8 +19,8 @@ static void shutdown_umh(void) > struct pid *tgid = info->tgid; > > if (tgid) { > - kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); > - wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); > + kill_pid(tgid, SIGKILL, 1); > + wait_event(tgid->wait_pidfd, !pid_has_task(tgid, > PIDTYPE_TGID)); > bpfilter_umh_cleanup(info); > } > } > > > And then did: > > while true; do iptables -L;rmmod bpfilter; done > > > > Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event(). > > Hmm. The wake up happens just of tgid->wait_pidfd happens just before > release_task is called so there is a race. As it is possible to wake > up and then go back to sleep before pid_has_task becomes false. > > So I think I need a friendly helper that does: > > bool task_has_exited(struct pid *tgid) > { > bool exited = false; > > rcu_read_lock(); > tsk = pid_task(tgid, PIDTYPE_TGID); > exited = !!tsk; > if (tsk) { > exited = !!tsk->exit_state; > out: > rcu_unlock(); > return exited; > } All makes sense to me. If I understood the race condition such helper should indeed solve it. Are you going to add such patch to your series? I'll proceed with my work on top of your series and will ignore this race for now, but I think it should be fixed before we land this set into multiple trees.
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
On 2020/06/30 21:29, Eric W. Biederman wrote: > Hmm. The wake up happens just of tgid->wait_pidfd happens just before > release_task is called so there is a race. As it is possible to wake > up and then go back to sleep before pid_has_task becomes false. What is the reason we want to wait until pid_has_task() becomes false? - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1)); By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says that use of flush_delayed_fput() has to be careful. Al, is it safe to call flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be called from both kernel thread and from process context (e.g. init_module() syscall by /sbin/insmod )) ?
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
Alexei Starovoitov writes: 2> On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote: >> >> I have tested thes changes by booting with the code compiled in and >> by killing "bpfilter_umh" and running iptables -vnL to restart >> the userspace driver. >> >> I have compiled tested each change with and without CONFIG_BPFILTER >> enabled. > > With > CONFIG_BPFILTER=y > CONFIG_BPFILTER_UMH=m > it doesn't build: > > ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined! > > I've added: > +EXPORT_SYMBOL(kill_pid_info); > to continue testing... I am rather surprised I thought Tetsuo had already compile tested modules. > I suspect patch 13 is somehow responsible: > + if (tgid) { > + kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); > + wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); > + bpfilter_umh_cleanup(info); > + } > > I cannot figure out why it hangs. Some sort of race ? > Since adding short delay between kill and wait makes it work. Having had a chance to sleep kill_pid_info was a thinko, as was !pid_task. It should have been !pid_has_task as that takes the proper rcu locking. I don't know if that is going to be enough to fix the wait_event but those are obvious bugs that need to be fixed. diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c index 91474884ddb7..3e1874030daa 100644 --- a/net/bpfilter/bpfilter_kern.c +++ b/net/bpfilter/bpfilter_kern.c @@ -19,8 +19,8 @@ static void shutdown_umh(void) struct pid *tgid = info->tgid; if (tgid) { - kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); - wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); + kill_pid(tgid, SIGKILL, 1); + wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID)); bpfilter_umh_cleanup(info); } } > And then did: > while true; do iptables -L;rmmod bpfilter; done > > Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event(). Hmm. The wake up happens just of tgid->wait_pidfd happens just before release_task is called so there is a race. As it is possible to wake up and then go back to sleep before pid_has_task becomes false. So I think I need a friendly helper that does: bool task_has_exited(struct pid *tgid) { bool exited = false; rcu_read_lock(); tsk = pid_task(tgid, PIDTYPE_TGID); exited = !!tsk; if (tsk) { exited = !!tsk->exit_state; out: rcu_unlock(); return exited; } There should be a sensible way to do that. Eric
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
On 2020/06/30 10:13, Eric W. Biederman wrote: > Alexei Starovoitov writes: > >> On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote: >>> >>> I have tested thes changes by booting with the code compiled in and >>> by killing "bpfilter_umh" and running iptables -vnL to restart >>> the userspace driver. >>> >>> I have compiled tested each change with and without CONFIG_BPFILTER >>> enabled. >> >> With >> CONFIG_BPFILTER=y >> CONFIG_BPFILTER_UMH=m >> it doesn't build: >> >> ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined! >> >> I've added: >> +EXPORT_SYMBOL(kill_pid_info); >> to continue testing... kill_pid() is already exported. >> >> And then did: >> while true; do iptables -L;rmmod bpfilter; done >> >> Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event(). >> >> I suspect patch 13 is somehow responsible: >> +if (tgid) { >> +kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); >> +wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); >> +bpfilter_umh_cleanup(info); >> +} >> >> I cannot figure out why it hangs. Some sort of race ? >> Since adding short delay between kill and wait makes it work. Because there is a race window that detach_pid() from __unhash_process() from __exit_signal() from release_task() from exit_notify() from do_exit() is called some time after wake_up_all(&pid->wait_pidfd) from do_notify_pidfd() from do_notify_parent() from exit_notify() from do_exit() was called (in other words, we can't use pid->wait_pidfd when pid_task() is used at wait_event()) ? Below are changes I suggest. diff --git a/kernel/umd.c b/kernel/umd.c index ff79fb16d738..f688813b8830 100644 --- a/kernel/umd.c +++ b/kernel/umd.c @@ -26,7 +26,7 @@ static struct vfsmount *blob_to_mnt(const void *data, size_t len, const char *na if (IS_ERR(mnt)) return mnt; - file = file_open_root(mnt->mnt_root, mnt, name, O_CREAT | O_WRONLY, 0700); + file = file_open_root(mnt->mnt_root, mnt, name, O_CREAT | O_WRONLY | O_EXCL, 0700); if (IS_ERR(file)) { mntput(mnt); return ERR_CAST(file); @@ -52,16 +52,18 @@ static struct vfsmount *blob_to_mnt(const void *data, size_t len, const char *na /** * umd_load_blob - Remember a blob of bytes for fork_usermode_driver - * @info: information about usermode driver - * @data: a blob of bytes that can be executed as a file - * @len: The lentgh of the blob + * @info: information about usermode driver (shouldn't be NULL) + * @data: a blob of bytes that can be executed as a file (shouldn't be NULL) + * @len: The lentgh of the blob (shouldn't be 0) * */ int umd_load_blob(struct umd_info *info, const void *data, size_t len) { struct vfsmount *mnt; - if (WARN_ON_ONCE(info->wd.dentry || info->wd.mnt)) + if (!info || !info->driver_name || !data || !len) + return -EINVAL; + if (info->wd.dentry || info->wd.mnt) return -EBUSY; mnt = blob_to_mnt(data, len, info->driver_name); @@ -76,15 +78,14 @@ EXPORT_SYMBOL_GPL(umd_load_blob); /** * umd_unload_blob - Disassociate @info from a previously loaded blob - * @info: information about usermode driver + * @info: information about usermode driver (shouldn't be NULL) * */ int umd_unload_blob(struct umd_info *info) { - if (WARN_ON_ONCE(!info->wd.mnt || -!info->wd.dentry || -info->wd.mnt->mnt_root != info->wd.dentry)) + if (!info || !info->driver_name || !info->wd.dentry || !info->wd.mnt) return -EINVAL; + BUG_ON(info->wd.mnt->mnt_root != info->wd.dentry); kern_unmount(info->wd.mnt); info->wd.mnt = NULL; @@ -138,7 +139,7 @@ static void umd_cleanup(struct subprocess_info *info) { struct umd_info *umd_info = info->data; - /* cleanup if umh_setup() was successful but exec failed */ + /* cleanup if umd_setup() was successful but exec failed */ if (info->retval) { fput(umd_info->pipe_to_umh); fput(umd_info->pipe_from_umh); @@ -163,7 +164,10 @@ int fork_usermode_driver(struct umd_info *info) const char *argv[] = { info->driver_name, NULL }; int err; - if (WARN_ON_ONCE(info->tgid)) + if (!info || !info->driver_name || !info->wd.dentry || !info->wd.mnt) + return -EINVAL; + BUG_ON(info->wd.mnt->mnt_root != info->wd.dentry); + if (info->tgid) return -EBUSY; err = -ENOMEM; diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c index 91474884ddb7..9dd70aacb81a 100644 --- a/net/bpfilter/bpfilter_kern.c +++ b/net/bpfilter/bpfilter_kern.c @@ -19,8 +19,13 @@ static void shutdown_umh(void) struct pid *tgid = info->tgid; if (tgid) { - kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); - wait_event(tgid->wait_pidfd, !pid_task(tgid, P
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
Alexei Starovoitov writes: > On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote: >> >> I have tested thes changes by booting with the code compiled in and >> by killing "bpfilter_umh" and running iptables -vnL to restart >> the userspace driver. >> >> I have compiled tested each change with and without CONFIG_BPFILTER >> enabled. > > With > CONFIG_BPFILTER=y > CONFIG_BPFILTER_UMH=m > it doesn't build: > > ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined! > > I've added: > +EXPORT_SYMBOL(kill_pid_info); > to continue testing... > > And then did: > while true; do iptables -L;rmmod bpfilter; done > > Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event(). > > I suspect patch 13 is somehow responsible: > + if (tgid) { > + kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); > + wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); > + bpfilter_umh_cleanup(info); > + } > > I cannot figure out why it hangs. Some sort of race ? > Since adding short delay between kill and wait makes it work. Thanks. I will take a look. Eric
Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote: > > I have tested thes changes by booting with the code compiled in and > by killing "bpfilter_umh" and running iptables -vnL to restart > the userspace driver. > > I have compiled tested each change with and without CONFIG_BPFILTER > enabled. With CONFIG_BPFILTER=y CONFIG_BPFILTER_UMH=m it doesn't build: ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined! I've added: +EXPORT_SYMBOL(kill_pid_info); to continue testing... And then did: while true; do iptables -L;rmmod bpfilter; done Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event(). I suspect patch 13 is somehow responsible: + if (tgid) { + kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid); + wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID)); + bpfilter_umh_cleanup(info); + } I cannot figure out why it hangs. Some sort of race ? Since adding short delay between kill and wait makes it work.
[PATCH v2 00/15] Make the user mode driver code a better citizen
This is the second round of my changeset to split the user mode driver code from the user mode helper code, and to make the code use common facilities to get things done instead of recreating them just for the user mode driver code. I have split the changes into small enough pieces so they should be easily readable and testable. The changes lean into the preexisting interfaces in the kernel and remove special cases for user mode driver code in favor of solutions that don't need special cases. This results in smaller code with fewer bugs. At a practical level this removes the maintenance burden of the user mode drivers from the user mode helper code and from exec as the special cases are removed. Similarly the LSM interaction bugs are fixed by not having unnecessary special cases for user mode drivers. I have tested thes changes by booting with the code compiled in and by killing "bpfilter_umh" and running iptables -vnL to restart the userspace driver. I have compiled tested each change with and without CONFIG_BPFILTER enabled. I made a few very small changes from v1 to v2: - Updated the function name in a comment when the function is renamed - Moved some more code so that the the !CONFIG_BPFILTER case continues to compile when I moved the code into umd.c - A fix for the module loading case to really flush the file descriptor. - Removed split_argv entirely from fork_usermode_driver. There was nothing to split so it was just confusing. Please let me know if you see any bugs. Once the code review is finished I plan to place the code in a non-rebasing branch so I can pull it into my tree and so it can also be pulled into the bpf-next tree. Eric W. Biederman (15): umh: Capture the pid in umh_pipe_setup umh: Move setting PF_UMH into umh_pipe_setup umh: Rename the user mode driver helpers for clarity umh: Remove call_usermodehelper_setup_file. umh: Separate the user mode driver and the user mode helper support umd: For clarity rename umh_info umd_info umd: Rename umd_info.cmdline umd_info.driver_name umd: Transform fork_usermode_blob into fork_usermode_driver umh: Stop calling do_execve_file exec: Remove do_execve_file bpfilter: Move bpfilter_umh back into init data umd: Track user space drivers with struct pid bpfilter: Take advantage of the facilities of struct pid umd: Remove exit_umh umd: Stop using split_argv fs/exec.c| 38 ++-- include/linux/binfmts.h | 1 - include/linux/bpfilter.h | 7 +- include/linux/sched.h| 9 -- include/linux/umd.h | 18 include/linux/umh.h | 15 kernel/Makefile | 1 + kernel/exit.c| 1 - kernel/umd.c | 182 +++ kernel/umh.c | 171 +--- net/bpfilter/bpfilter_kern.c | 38 net/bpfilter/bpfilter_umh_blob.S | 2 +- net/ipv4/bpfilter/sockopt.c | 20 +++-- 13 files changed, 248 insertions(+), 255 deletions(-) v1: https://lkml.kernel.org/r/87pn9mgfc2.fsf...@x220.int.ebiederm.org --- git range-diff master v1 v2 1: 2b76f9b3158d ! 1: d8fb851fa3d8 umh: Capture the pid in umh_pipe_setup @@ Commit message code that is specific to user mode drivers from the common user path of user mode helpers. +Link: https://lkml.kernel.org/r/87h7uygf9i.fsf...@x220.int.ebiederm.org +Reviewed-by: Greg Kroah-Hartman Signed-off-by: "Eric W. Biederman" ## include/linux/umh.h ## 2: d853e933ae32 ! 2: b191c5df43ec umh: Move setting PF_UMH into umh_pipe_setup @@ Commit message Setting PF_UMH unconditionally is harmless as an action will only happen if it is paired with an entry on umh_list. +Link: https://lkml.kernel.org/r/87bll6gf8t.fsf...@x220.int.ebiederm.org +Reviewed-by: Greg Kroah-Hartman Signed-off-by: "Eric W. Biederman" ## kernel/umh.c ## 3: 92d2550f0d6a ! 3: 74e8c0bf3076 umh: Rename the user mode driver helpers for clarity @@ Commit message don't make much sense. Instead name them umd_setup and umd_cleanup for the functional role in setting up user mode drivers. +Link: https://lkml.kernel.org/r/875zbegf82.fsf...@x220.int.ebiederm.org +Reviewed-by: Greg Kroah-Hartman Signed-off-by: "Eric W. Biederman" ## kernel/umh.c ## @@ kernel/umh.c: static int umh_pipe_setup(struct subprocess_info *info, struct cre { struct umh_info *umh_info = info->data; +- /* cleanup if umh_pipe_setup() was successful but exec failed */ ++ /* cleanup if umh_setup() was successful but exec failed */ + if (info->retval) { + fput(umh_info->pipe_to_umh); + fput(umh_info->pipe_from_um