Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-07-07 Thread Luis Chamberlain
On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote:
> 
> I have tested thes changes by booting with the code compiled in and
> by killing "bpfilter_umh" and running iptables -vnL to restart
> the userspace driver.
> 
> I have compiled tested each change with and without CONFIG_BPFILTER
> enabled.

Sounds like grounds for a selftests driver and respective selftest?
And if so, has the other issues Tetsuo reported be hacked into one?

  Luis


Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-07-07 Thread Eric W. Biederman


Just to make certain I understand what is going on I instrumented a
kernel with some print statements.

a) The workqueues and timers start before populate_rootfs.

b) populate_rootfs does indeed happen long before the bpfilter
   module is intialized.

c) What prevents populate_rootfs and the umd_load_blob from
   having problems when they call flush_delayed_put is the
   fact that fput_many does:
   "schedule_delayed_work(&delayed_fput_work,1)".

   That 1 requests a delay of at least 1 jiffy.  A jiffy is between
   1ms and 10ms depending on how Linux is configured.

   In my test configuration running a kernel in kvm printing to a serial
   console I measured 0.8ms between the fput in blob_to_mnt and
   flush_delayed_fput which immediately follows it.

   So unless the fput becomes incredibly slow there is nothing to worry
   about in blob_to_mnt.

d) As the same mechanism is used by populate_rootfs.  A but in the
   mechanism applies to both.

e) No one appears to have reported a problem executing files out of
   initramfs these last several years since the flush_delayed_fput was
   introduced.
 
f) The code works for me.  There is real reason to believe the code will
   work for everyone else, as the exact same logic is used by initramfs.
   So it should be perfectly fine for the patchset and the
   usermode_driver code to go ahead as written.

h) If there is something to be fixed it is flush_delayed_fput as that is
   much more important than anything in the usermode driver code.

Eric

p.s.) When I talked of restarts of the usermode driver code ealier I was
   referring to the code that restarts the usermode driver if it is
   killed, the next time the kernel tries to talk to it.

   That could mask an -ETXTBUSY except if it happens on the first exec
   the net/bfilter/bpfilter_kern.c:load_umh() will return an error.



Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-07-03 Thread Tetsuo Handa
On 2020/07/04 7:25, Eric W. Biederman wrote:
> Tetsuo Handa  writes:
> 
>> On 2020/07/02 22:08, Eric W. Biederman wrote:
 By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
 that use of flush_delayed_fput() has to be careful. Al, is it safe to call
 flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might 
 be
 called from both kernel thread and from process context (e.g. init_module()
 syscall by /sbin/insmod )) ?
>>>
>>> And __fput_sync needs to be even more careful.
>>> umd_load_blob is called in these changes without any locks held.
>>
>> But where is the guarantee that a thread which called flush_delayed_fput() 
>> waits for
>> the completion of processing _all_ "struct file" linked into 
>> delayed_fput_list ?
>> If some other thread or delayed_fput_work (scheduled by fput_many()) called
>> flush_delayed_fput() between blob_to_mnt()'s fput(file) and 
>> flush_delayed_fput()
>> sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" 
>> which
>> needs to be processed before execve(), can't it?
> 
> As a module the guarantee is we call task_work_run.

No. It is possible that blob_to_mnt() is called by a kernel thread which was
started by init_module() syscall by /sbin/insmod .

> Built into the kernel the guarantee as best I can trace it is that
> kthreadd hasn't started, and as such nothing that is scheduled has run
> yet.

Have you ever checked how early the kthreadd (PID=2) gets started?

--
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2306,6 +2306,7 @@ static __latent_entropy struct task_struct *copy_process(
trace_task_newtask(p, clone_flags);
uprobe_copy_process(p, clone_flags);

+   printk(KERN_INFO "Created PID: %u Comm: %s\n", p->pid, p->comm);
return p;

 bad_fork_cancel_cgroup:
--

--
[0.090757][T0] pid_max: default: 65536 minimum: 512
[0.090890][T0] LSM: Security Framework initializing
[0.090890][T0] Mount-cache hash table entries: 8192 (order: 4, 65536 
bytes, linear)
[0.090890][T0] Mountpoint-cache hash table entries: 8192 (order: 4, 
65536 bytes, linear)
[0.090890][T0] Disabled fast string operations
[0.090890][T0] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 1024
[0.090890][T0] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 1024, 
1GB 4
[0.090890][T0] Spectre V1 : Mitigation: usercopy/swapgs barriers and 
__user pointer sanitization
[0.090890][T0] Spectre V2 : Spectre mitigation: kernel not compiled 
with retpoline; no mitigation available!
[0.090890][T0] Speculative Store Bypass: Mitigation: Speculative Store 
Bypass disabled via prctl and seccomp
[0.090890][T0] SRBDS: Unknown: Dependent on hypervisor status
[0.090890][T0] MDS: Mitigation: Clear CPU buffers
[0.090890][T0] Freeing SMP alternatives memory: 24K
[0.090890][T0] Created PID: 1 Comm: swapper/0
[0.090890][T0] Created PID: 2 Comm: swapper/0
[0.090890][T1] smpboot: CPU0: Intel(R) Core(TM) i5-4440S CPU @ 2.80GHz 
(family: 0x6, model: 0x3c, stepping: 0x3)
[0.091000][T2] Created PID: 3 Comm: kthreadd
[0.091995][T2] Created PID: 4 Comm: kthreadd
[0.093028][T2] Created PID: 5 Comm: kthreadd
[0.093997][T2] Created PID: 6 Comm: kthreadd
[0.094995][T2] Created PID: 7 Comm: kthreadd
[0.096037][T2] Created PID: 8 Comm: kthreadd
(...snipped...)
[0.135716][T2] Created PID: 13 Comm: kthreadd
[0.135716][T1] smp: Bringing up secondary CPUs ...
[0.135716][T2] Created PID: 14 Comm: kthreadd
[0.135716][T2] Created PID: 15 Comm: kthreadd
[0.135716][T2] Created PID: 16 Comm: kthreadd
[0.135716][T2] Created PID: 17 Comm: kthreadd
[0.135716][T2] Created PID: 18 Comm: kthreadd
[0.135716][T1] x86: Booting SMP configuration:
(...snipped...)
[0.901990][T1] pci :00:00.0: Limiting direct PCI/PCI transfers
[0.902145][T1] pci :00:0f.0: Video device with shadowed ROM at [mem 
0x000c-0x000d]
[0.902213][T1] pci :02:00.0: CLS mismatch (32 != 64), using 64 bytes
[0.902224][T1] Trying to unpack rootfs image as initramfs...
[1.107993][T1] Freeing initrd memory: 18876K
[1.109049][T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[1.111003][T1] software IO TLB: mapped [mem 0xab00-0xaf00] 
(64MB)
[1.112136][T1] check: Scanning for low memory corruption every 60 
seconds
[1.115040][T2] Created PID: 52 Comm: kthreadd
[1.116110][T1] workingset: timestamp_bits=46 max_order=20 bucket_order=0
[1.120936][T1] SGI XFS with ACLs, security attributes, verbose 
warnings, quota, no debug enabled
[1.129626][T2] Created PID: 53 Comm: kthreadd
[1.131403][T2] Created PID: 54 Comm: kthreadd
--

kthreadd (PID=2) is created by swapper/0 (PID=0) immediately after init

Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-07-03 Thread Eric W. Biederman
Tetsuo Handa  writes:

> On 2020/07/02 22:08, Eric W. Biederman wrote:
>>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
>>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
>>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
>>> called from both kernel thread and from process context (e.g. init_module()
>>> syscall by /sbin/insmod )) ?
>> 
>> And __fput_sync needs to be even more careful.
>> umd_load_blob is called in these changes without any locks held.
>
> But where is the guarantee that a thread which called flush_delayed_fput() 
> waits for
> the completion of processing _all_ "struct file" linked into 
> delayed_fput_list ?
> If some other thread or delayed_fput_work (scheduled by fput_many()) called
> flush_delayed_fput() between blob_to_mnt()'s fput(file) and 
> flush_delayed_fput()
> sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" 
> which
> needs to be processed before execve(), can't it?

As a module the guarantee is we call task_work_run.
Built into the kernel the guarantee as best I can trace it is that
kthreadd hasn't started, and as such nothing that is scheduled has run
yet.

> Also, I don't know how convoluted the dependency of all "struct file" linked 
> into
> delayed_fput_list might be, for there can be "struct file" which will not be a
> simple close of tmpfs file created by blob_to_mnt()'s file_open_root() 
> request.
>
> On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD 
> threads,
> there is a guarantee that __fput_sync() waits for the completion of "struct 
> file"
> which needs to be flushed before execve(), isn't there?

There is really not a good helper or helpers, and this code suggests we
have something better.  Right now I have used the existing helpers to
the best of my ability.  If you or someone else wants to write a better
version of flushing so that exec can happen be my guest.

As far as I can tell what I have is good enough.

>> We fundamentally AKA in any correct version of this code need to flush
>> the file descriptor before we call exec or exec can not open it a
>> read-only denying all writes from any other opens.
>> 
>> The use case of flush_delayed_fput is exactly the same as that used
>> when loading the initramfs.
>
> When loading the initramfs, the number of threads is quite few (which
> means that the possibility of hitting the race window and convoluted
> dependency is small).

But the reality is the code run very early, before the initramfs is
initialized in practice.

> But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s
> flush_delayed_fput() might be called after many number of threads already
> started running.

At which point the code probably won't be runnig from a kernel thread
but instead will be running in a thread where task_work_run is relevant.

At worst it is a very small race, where someone else in another thread
starts flushing the file.  Which means the file could still be
completely close before exec.   Even that is not necessarily fatal,
as the usermode driver code has a respawn capability.

Code that is used enough that it hits that race sounds like a very
good problem to have from the perspective of the usermode driver code.

> Do we really need to mount upon umd_load_blob() and unmount upon 
> umd_unload_blob() ?
> LSM modules might prefer only one instance of filesystem for umd
> blobs.

It is simple. People are free to change it, but a single filesystem
seems like a very good place to start with this functionality.

> For pathname based LSMs, since that filesystem is not visible from mount 
> tree, only
> info->driver_name can be used for distinction. Therefore, one instance of 
> filesystem
> with files created with file_open_root(O_CREAT | O_WRONLY | O_EXCL)
> might be preferable.

I took a quick look and the creation and removal of files with the
in-kernel helpers is not particularly easy.  Certainly it is more work
and thus a higher likelyhood of bugs than what I have done.

A directory per driver does sound tempting.  Just more work that I am
willing to do.

> For inode based LSMs, reusing one instance of filesystem created upon early 
> boot might
> be convenient for labeling.
>
> Also, we might want a dedicated filesystem (say, "umdfs") instead of regular 
> tmpfs in
> order to implement protections without labeling files. Then, we might also be 
> able to
> implement minimal protections without LSMs.

All valid points.  Nothing sets this design in stone.
Nothing says this is the endpoint of the evolution of this code.

The entire point of this patchset for me is that I remove the
unnecessary special cases from exec and do_exit, so I don't have to deal
with the usermode driver code anymore.

Eric


Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-07-03 Thread Tetsuo Handa
On 2020/07/02 22:08, Eric W. Biederman wrote:
>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
>> called from both kernel thread and from process context (e.g. init_module()
>> syscall by /sbin/insmod )) ?
> 
> And __fput_sync needs to be even more careful.
> umd_load_blob is called in these changes without any locks held.

But where is the guarantee that a thread which called flush_delayed_fput() 
waits for
the completion of processing _all_ "struct file" linked into delayed_fput_list ?
If some other thread or delayed_fput_work (scheduled by fput_many()) called
flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput()
sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which
needs to be processed before execve(), can't it?

Also, I don't know how convoluted the dependency of all "struct file" linked 
into
delayed_fput_list might be, for there can be "struct file" which will not be a
simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request.

On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD 
threads,
there is a guarantee that __fput_sync() waits for the completion of "struct 
file"
which needs to be flushed before execve(), isn't there?

> 
> We fundamentally AKA in any correct version of this code need to flush
> the file descriptor before we call exec or exec can not open it a
> read-only denying all writes from any other opens.
> 
> The use case of flush_delayed_fput is exactly the same as that used
> when loading the initramfs.

When loading the initramfs, the number of threads is quite few (which
means that the possibility of hitting the race window and convoluted
dependency is small).

But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s
flush_delayed_fput() might be called after many number of threads already
started running.



On 2020/07/03 1:02, Eric W. Biederman wrote:
 On 2020/06/30 21:29, Eric W. Biederman wrote:
> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
> release_task is called so there is a race.  As it is possible to wake
> up and then go back to sleep before pid_has_task becomes false.

 What is the reason we want to wait until pid_has_task() becomes false?

 - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
 + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, 
 PIDTYPE_TGID), 1));
>>>
>>> So that it is safe to call bpfilter_umh_cleanup.  The previous code
>>> performed the wait by having a callback in do_exit.
>>
>> But bpfilter_umh_cleanup() does only
>>
>>  fput(info->pipe_to_umh);
>>  fput(info->pipe_from_umh);
>>  put_pid(info->tgid);
>>  info->tgid = NULL;
>>
>> which is (I think) already safe regardless of the usermode process because
>> bpfilter_umh_cleanup() merely closes one side of two pipes used between
>> two processes and forgets about the usermode process.
> 
> It is not safe.
> 
> Baring bugs there is only one use of shtudown_umh that matters.  The one
> in fini_umh.  The use of the file by the mm must be finished before
> umd_unload_blob.  AKA unmount.  Which completely frees the filesystem.

Do we really need to mount upon umd_load_blob() and unmount upon 
umd_unload_blob() ?
LSM modules might prefer only one instance of filesystem for umd blobs.

For pathname based LSMs, since that filesystem is not visible from mount tree, 
only
info->driver_name can be used for distinction. Therefore, one instance of 
filesystem
with files created with file_open_root(O_CREAT | O_WRONLY | O_EXCL) might be 
preferable.

For inode based LSMs, reusing one instance of filesystem created upon early 
boot might
be convenient for labeling.

Also, we might want a dedicated filesystem (say, "umdfs") instead of regular 
tmpfs in
order to implement protections without labeling files. Then, we might also be 
able to
implement minimal protections without LSMs.



Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-07-02 Thread Eric W. Biederman
Tetsuo Handa  writes:

> On 2020/07/02 22:08, Eric W. Biederman wrote:
>> Tetsuo Handa  writes:
>> 
>>> On 2020/06/30 21:29, Eric W. Biederman wrote:
 Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
 release_task is called so there is a race.  As it is possible to wake
 up and then go back to sleep before pid_has_task becomes false.
>>>
>>> What is the reason we want to wait until pid_has_task() becomes false?
>>>
>>> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
>>> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, 
>>> PIDTYPE_TGID), 1));
>> 
>> So that it is safe to call bpfilter_umh_cleanup.  The previous code
>> performed the wait by having a callback in do_exit.
>
> But bpfilter_umh_cleanup() does only
>
>   fput(info->pipe_to_umh);
>   fput(info->pipe_from_umh);
>   put_pid(info->tgid);
>   info->tgid = NULL;
>
> which is (I think) already safe regardless of the usermode process because
> bpfilter_umh_cleanup() merely closes one side of two pipes used between
> two processes and forgets about the usermode process.

It is not safe.

Baring bugs there is only one use of shtudown_umh that matters.  The one
in fini_umh.  The use of the file by the mm must be finished before
umd_unload_blob.  AKA unmount.  Which completely frees the filesystem.

>> It might be possible to call bpf_umh_cleanup early but I have not done
>> that analysis.
>> 
>> To perform the test correctly what I have right now is:
>
> Waiting for the termination of a SIGKILLed usermode process is not
> such simple.

The waiting is that simple.

You are correct it might not be a quick process.

A good general principle is to start with something simple and correct
for what it does, and then to make it more complicated when real world
cases show up, and it can be understood what the real challenges are.

I am not going to merge known broken code but I am also not going to
overcomplicate it.

Dealing with very rare and pathological cases that are not handled or
considered today is out of scope for my patchset.

Eric


Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-07-02 Thread Tetsuo Handa
On 2020/07/02 22:08, Eric W. Biederman wrote:
> Tetsuo Handa  writes:
> 
>> On 2020/06/30 21:29, Eric W. Biederman wrote:
>>> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
>>> release_task is called so there is a race.  As it is possible to wake
>>> up and then go back to sleep before pid_has_task becomes false.
>>
>> What is the reason we want to wait until pid_has_task() becomes false?
>>
>> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
>> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, 
>> PIDTYPE_TGID), 1));
> 
> So that it is safe to call bpfilter_umh_cleanup.  The previous code
> performed the wait by having a callback in do_exit.

But bpfilter_umh_cleanup() does only

fput(info->pipe_to_umh);
fput(info->pipe_from_umh);
put_pid(info->tgid);
info->tgid = NULL;

which is (I think) already safe regardless of the usermode process because
bpfilter_umh_cleanup() merely closes one side of two pipes used between
two processes and forgets about the usermode process.

> 
> It might be possible to call bpf_umh_cleanup early but I have not done
> that analysis.
> 
> To perform the test correctly what I have right now is:

Waiting for the termination of a SIGKILLed usermode process is not
such simple. If a usermode process was killed by the OOM killer, it
might take minutes for the killed process to reach do_exit() due to
invisible memory allocation dependency chain. Since the OOM killer
kicks the OOM reaper, and the OOM reaper forgets about the killed
process after one second if mmap_sem could not be held (in order to
avoid OOM deadlock), the OOM situation will be eventually solved; but
there is no guarantee that the killed process can reach do_exit()
in a short period.



Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-07-02 Thread Eric W. Biederman
Tetsuo Handa  writes:

> On 2020/06/30 21:29, Eric W. Biederman wrote:
>> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
>> release_task is called so there is a race.  As it is possible to wake
>> up and then go back to sleep before pid_has_task becomes false.
>
> What is the reason we want to wait until pid_has_task() becomes false?
>
> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, 
> PIDTYPE_TGID), 1));

So that it is safe to call bpfilter_umh_cleanup.  The previous code
performed the wait by having a callback in do_exit.

It might be possible to call bpf_umh_cleanup early but I have not done
that analysis.

To perform the test correctly what I have right now is:

bool thread_group_exited(struct pid *pid)
{
struct task_struct *tsk;
bool exited;

rcu_read_lock();
tsk = pid_task(pid, PIDTYPE_PID);
exited = !tsk || (READ_ONCE(tsk->exit_state) && 
thread_group_empty(tsk));
rcu_read_unlock();

return exited;
}

Which is factored out of pidfd_poll.  Which means that this won't be
something that the bpfilter code has to maintain.  That seems to be a
fundamentally good facility to have regardless of bpfilter.

I will post the whole thing in a bit once I have a chance to dot my i's
and cross my t's.

> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
> called from both kernel thread and from process context (e.g. init_module()
> syscall by /sbin/insmod )) ?

And __fput_sync needs to be even more careful.
umd_load_blob is called in these changes without any locks held.

We fundamentally AKA in any correct version of this code need to flush
the file descriptor before we call exec or exec can not open it a
read-only denying all writes from any other opens.

The use case of flush_delayed_fput is exactly the same as that used
when loading the initramfs.

Eric






Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-07-01 Thread Eric W. Biederman
Alexei Starovoitov  writes:

> On Tue, Jun 30, 2020 at 07:29:34AM -0500, Eric W. Biederman wrote:
>> 
>> diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
>> index 91474884ddb7..3e1874030daa 100644
>> --- a/net/bpfilter/bpfilter_kern.c
>> +++ b/net/bpfilter/bpfilter_kern.c
>> @@ -19,8 +19,8 @@ static void shutdown_umh(void)
>> struct pid *tgid = info->tgid;
>>  
>> if (tgid) {
>> -   kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
>> -   wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
>> +   kill_pid(tgid, SIGKILL, 1);
>> +   wait_event(tgid->wait_pidfd, !pid_has_task(tgid, 
>> PIDTYPE_TGID));
>> bpfilter_umh_cleanup(info);
>> }
>>  }
>> 
>> > And then did:
>> > while true; do iptables -L;rmmod bpfilter; done
>> >  
>> > Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event().
>> 
>> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
>> release_task is called so there is a race.  As it is possible to wake
>> up and then go back to sleep before pid_has_task becomes false.
>> 
>> So I think I need a friendly helper that does:
>> 
>> bool task_has_exited(struct pid *tgid)
>> {
>>  bool exited = false;
>> 
>>  rcu_read_lock();
>> tsk = pid_task(tgid, PIDTYPE_TGID);
>> exited = !!tsk;
>> if (tsk) {
>>  exited = !!tsk->exit_state;
>> out:
>>  rcu_unlock();
>>  return exited;
>> }
>
> All makes sense to me.
> If I understood the race condition such helper should indeed solve it.
> Are you going to add such patch to your series?
> I'll proceed with my work on top of your series and will ignore this
> race for now, but I think it should be fixed before we land this set
> into multiple trees.

Yes. I am just finishing it up now.

Eric



Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-06-30 Thread Alexei Starovoitov
On Tue, Jun 30, 2020 at 07:29:34AM -0500, Eric W. Biederman wrote:
> 
> diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
> index 91474884ddb7..3e1874030daa 100644
> --- a/net/bpfilter/bpfilter_kern.c
> +++ b/net/bpfilter/bpfilter_kern.c
> @@ -19,8 +19,8 @@ static void shutdown_umh(void)
> struct pid *tgid = info->tgid;
>  
> if (tgid) {
> -   kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
> -   wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
> +   kill_pid(tgid, SIGKILL, 1);
> +   wait_event(tgid->wait_pidfd, !pid_has_task(tgid, 
> PIDTYPE_TGID));
> bpfilter_umh_cleanup(info);
> }
>  }
> 
> > And then did:
> > while true; do iptables -L;rmmod bpfilter; done
> >  
> > Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event().
> 
> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
> release_task is called so there is a race.  As it is possible to wake
> up and then go back to sleep before pid_has_task becomes false.
> 
> So I think I need a friendly helper that does:
> 
> bool task_has_exited(struct pid *tgid)
> {
>   bool exited = false;
> 
>   rcu_read_lock();
> tsk = pid_task(tgid, PIDTYPE_TGID);
> exited = !!tsk;
> if (tsk) {
>   exited = !!tsk->exit_state;
> out:
>   rcu_unlock();
>   return exited;
> }

All makes sense to me.
If I understood the race condition such helper should indeed solve it.
Are you going to add such patch to your series?
I'll proceed with my work on top of your series and will ignore this
race for now, but I think it should be fixed before we land this set
into multiple trees.


Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-06-30 Thread Tetsuo Handa
On 2020/06/30 21:29, Eric W. Biederman wrote:
> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
> release_task is called so there is a race.  As it is possible to wake
> up and then go back to sleep before pid_has_task becomes false.

What is the reason we want to wait until pid_has_task() becomes false?

- wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
+ while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, 
PIDTYPE_TGID), 1));




By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
that use of flush_delayed_fput() has to be careful. Al, is it safe to call
flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
called from both kernel thread and from process context (e.g. init_module()
syscall by /sbin/insmod )) ?


Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-06-30 Thread Eric W. Biederman
Alexei Starovoitov  writes:

2> On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote:
>> 
>> I have tested thes changes by booting with the code compiled in and
>> by killing "bpfilter_umh" and running iptables -vnL to restart
>> the userspace driver.
>> 
>> I have compiled tested each change with and without CONFIG_BPFILTER
>> enabled.
>
> With
> CONFIG_BPFILTER=y
> CONFIG_BPFILTER_UMH=m
> it doesn't build:
>
> ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined!
>
> I've added:
> +EXPORT_SYMBOL(kill_pid_info);
> to continue testing...

I am rather surprised I thought Tetsuo had already compile tested
modules.


> I suspect patch 13 is somehow responsible:
> + if (tgid) {
> + kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
> + wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
> + bpfilter_umh_cleanup(info);
> + }
>
> I cannot figure out why it hangs. Some sort of race ?
> Since adding short delay between kill and wait makes it work.

Having had a chance to sleep kill_pid_info was a thinko, as was
!pid_task.  It should have been !pid_has_task as that takes the proper
rcu locking.

I don't know if that is going to be enough to fix the wait_event
but those are obvious bugs that need to be fixed.

diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
index 91474884ddb7..3e1874030daa 100644
--- a/net/bpfilter/bpfilter_kern.c
+++ b/net/bpfilter/bpfilter_kern.c
@@ -19,8 +19,8 @@ static void shutdown_umh(void)
struct pid *tgid = info->tgid;
 
if (tgid) {
-   kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
-   wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
+   kill_pid(tgid, SIGKILL, 1);
+   wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
bpfilter_umh_cleanup(info);
}
 }

> And then did:
> while true; do iptables -L;rmmod bpfilter; done
>  
> Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event().

Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
release_task is called so there is a race.  As it is possible to wake
up and then go back to sleep before pid_has_task becomes false.

So I think I need a friendly helper that does:

bool task_has_exited(struct pid *tgid)
{
bool exited = false;

rcu_read_lock();
tsk = pid_task(tgid, PIDTYPE_TGID);
exited = !!tsk;
if (tsk) {
exited = !!tsk->exit_state;
out:
rcu_unlock();
return exited;
}

There should be a sensible way to do that.

Eric




Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-06-29 Thread Tetsuo Handa
On 2020/06/30 10:13, Eric W. Biederman wrote:
> Alexei Starovoitov  writes:
> 
>> On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote:
>>>
>>> I have tested thes changes by booting with the code compiled in and
>>> by killing "bpfilter_umh" and running iptables -vnL to restart
>>> the userspace driver.
>>>
>>> I have compiled tested each change with and without CONFIG_BPFILTER
>>> enabled.
>>
>> With
>> CONFIG_BPFILTER=y
>> CONFIG_BPFILTER_UMH=m
>> it doesn't build:
>>
>> ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined!
>>
>> I've added:
>> +EXPORT_SYMBOL(kill_pid_info);
>> to continue testing...

kill_pid() is already exported.

>>
>> And then did:
>> while true; do iptables -L;rmmod bpfilter; done
>>  
>> Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event().
>>
>> I suspect patch 13 is somehow responsible:
>> +if (tgid) {
>> +kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
>> +wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
>> +bpfilter_umh_cleanup(info);
>> +}
>>
>> I cannot figure out why it hangs. Some sort of race ?
>> Since adding short delay between kill and wait makes it work.

Because there is a race window that detach_pid() from __unhash_process() from
__exit_signal() from release_task() from exit_notify() from do_exit() is called
some time after wake_up_all(&pid->wait_pidfd) from do_notify_pidfd() from
do_notify_parent() from exit_notify() from do_exit() was called (in other words,
we can't use pid->wait_pidfd when pid_task() is used at wait_event()) ?

Below are changes I suggest.

diff --git a/kernel/umd.c b/kernel/umd.c
index ff79fb16d738..f688813b8830 100644
--- a/kernel/umd.c
+++ b/kernel/umd.c
@@ -26,7 +26,7 @@ static struct vfsmount *blob_to_mnt(const void *data, size_t 
len, const char *na
if (IS_ERR(mnt))
return mnt;
 
-   file = file_open_root(mnt->mnt_root, mnt, name, O_CREAT | O_WRONLY, 
0700);
+   file = file_open_root(mnt->mnt_root, mnt, name, O_CREAT | O_WRONLY | 
O_EXCL, 0700);
if (IS_ERR(file)) {
mntput(mnt);
return ERR_CAST(file);
@@ -52,16 +52,18 @@ static struct vfsmount *blob_to_mnt(const void *data, 
size_t len, const char *na
 
 /**
  * umd_load_blob - Remember a blob of bytes for fork_usermode_driver
- * @info: information about usermode driver
- * @data: a blob of bytes that can be executed as a file
- * @len:  The lentgh of the blob
+ * @info: information about usermode driver (shouldn't be NULL)
+ * @data: a blob of bytes that can be executed as a file (shouldn't be NULL)
+ * @len:  The lentgh of the blob (shouldn't be 0)
  *
  */
 int umd_load_blob(struct umd_info *info, const void *data, size_t len)
 {
struct vfsmount *mnt;
 
-   if (WARN_ON_ONCE(info->wd.dentry || info->wd.mnt))
+   if (!info || !info->driver_name || !data || !len)
+   return -EINVAL;
+   if (info->wd.dentry || info->wd.mnt)
return -EBUSY;
 
mnt = blob_to_mnt(data, len, info->driver_name);
@@ -76,15 +78,14 @@ EXPORT_SYMBOL_GPL(umd_load_blob);
 
 /**
  * umd_unload_blob - Disassociate @info from a previously loaded blob
- * @info: information about usermode driver
+ * @info: information about usermode driver (shouldn't be NULL)
  *
  */
 int umd_unload_blob(struct umd_info *info)
 {
-   if (WARN_ON_ONCE(!info->wd.mnt ||
-!info->wd.dentry ||
-info->wd.mnt->mnt_root != info->wd.dentry))
+   if (!info || !info->driver_name || !info->wd.dentry || !info->wd.mnt)
return -EINVAL;
+   BUG_ON(info->wd.mnt->mnt_root != info->wd.dentry);
 
kern_unmount(info->wd.mnt);
info->wd.mnt = NULL;
@@ -138,7 +139,7 @@ static void umd_cleanup(struct subprocess_info *info)
 {
struct umd_info *umd_info = info->data;
 
-   /* cleanup if umh_setup() was successful but exec failed */
+   /* cleanup if umd_setup() was successful but exec failed */
if (info->retval) {
fput(umd_info->pipe_to_umh);
fput(umd_info->pipe_from_umh);
@@ -163,7 +164,10 @@ int fork_usermode_driver(struct umd_info *info)
const char *argv[] = { info->driver_name, NULL };
int err;
 
-   if (WARN_ON_ONCE(info->tgid))
+   if (!info || !info->driver_name || !info->wd.dentry || !info->wd.mnt)
+   return -EINVAL;
+   BUG_ON(info->wd.mnt->mnt_root != info->wd.dentry);
+   if (info->tgid)
return -EBUSY;
 
err = -ENOMEM;
diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
index 91474884ddb7..9dd70aacb81a 100644
--- a/net/bpfilter/bpfilter_kern.c
+++ b/net/bpfilter/bpfilter_kern.c
@@ -19,8 +19,13 @@ static void shutdown_umh(void)
struct pid *tgid = info->tgid;
 
if (tgid) {
-   kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
-   wait_event(tgid->wait_pidfd, !pid_task(tgid, P

Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-06-29 Thread Eric W. Biederman
Alexei Starovoitov  writes:

> On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote:
>> 
>> I have tested thes changes by booting with the code compiled in and
>> by killing "bpfilter_umh" and running iptables -vnL to restart
>> the userspace driver.
>> 
>> I have compiled tested each change with and without CONFIG_BPFILTER
>> enabled.
>
> With
> CONFIG_BPFILTER=y
> CONFIG_BPFILTER_UMH=m
> it doesn't build:
>
> ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined!
>
> I've added:
> +EXPORT_SYMBOL(kill_pid_info);
> to continue testing...
>
> And then did:
> while true; do iptables -L;rmmod bpfilter; done
>  
> Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event().
>
> I suspect patch 13 is somehow responsible:
> + if (tgid) {
> + kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
> + wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
> + bpfilter_umh_cleanup(info);
> + }
>
> I cannot figure out why it hangs. Some sort of race ?
> Since adding short delay between kill and wait makes it work.

Thanks.  I will take a look.

Eric


Re: [PATCH v2 00/15] Make the user mode driver code a better citizen

2020-06-29 Thread Alexei Starovoitov
On Mon, Jun 29, 2020 at 02:55:05PM -0500, Eric W. Biederman wrote:
> 
> I have tested thes changes by booting with the code compiled in and
> by killing "bpfilter_umh" and running iptables -vnL to restart
> the userspace driver.
> 
> I have compiled tested each change with and without CONFIG_BPFILTER
> enabled.

With
CONFIG_BPFILTER=y
CONFIG_BPFILTER_UMH=m
it doesn't build:

ERROR: modpost: "kill_pid_info" [net/bpfilter/bpfilter.ko] undefined!

I've added:
+EXPORT_SYMBOL(kill_pid_info);
to continue testing...

And then did:
while true; do iptables -L;rmmod bpfilter; done
 
Unfortunately sometimes 'rmmod bpfilter' hangs in wait_event().

I suspect patch 13 is somehow responsible:
+   if (tgid) {
+   kill_pid_info(SIGKILL, SEND_SIG_PRIV, tgid);
+   wait_event(tgid->wait_pidfd, !pid_task(tgid, PIDTYPE_TGID));
+   bpfilter_umh_cleanup(info);
+   }

I cannot figure out why it hangs. Some sort of race ?
Since adding short delay between kill and wait makes it work.


[PATCH v2 00/15] Make the user mode driver code a better citizen

2020-06-29 Thread Eric W. Biederman


This is the second round of my changeset to split the user mode driver
code from the user mode helper code, and to make the code use common
facilities to get things done instead of recreating them just
for the user mode driver code.

I have split the changes into small enough pieces so they should be
easily readable and testable.

The changes lean into the preexisting interfaces in the kernel and
remove special cases for user mode driver code in favor of solutions
that don't need special cases.  This results in smaller code with fewer
bugs.

At a practical level this removes the maintenance burden of the user
mode drivers from the user mode helper code and from exec as the special
cases are removed.

Similarly the LSM interaction bugs are fixed by not having unnecessary
special cases for user mode drivers.

I have tested thes changes by booting with the code compiled in and
by killing "bpfilter_umh" and running iptables -vnL to restart
the userspace driver.

I have compiled tested each change with and without CONFIG_BPFILTER
enabled.

I made a few very small changes from v1 to v2:
- Updated the function name in a comment when the function is renamed
- Moved some more code so that the the !CONFIG_BPFILTER case continues
  to compile when I moved the code into umd.c
- A fix for the module loading case to really flush the file descriptor.
- Removed split_argv entirely from fork_usermode_driver.
  There was nothing to split so it was just confusing.

Please let me know if you see any bugs.  Once the code review is
finished I plan to place the code in a non-rebasing branch
so I can pull it into my tree and so it can also be pulled into
the bpf-next tree.

Eric W. Biederman (15):
  umh: Capture the pid in umh_pipe_setup
  umh: Move setting PF_UMH into umh_pipe_setup
  umh: Rename the user mode driver helpers for clarity
  umh: Remove call_usermodehelper_setup_file.
  umh: Separate the user mode driver and the user mode helper support
  umd: For clarity rename umh_info umd_info
  umd: Rename umd_info.cmdline umd_info.driver_name
  umd: Transform fork_usermode_blob into fork_usermode_driver
  umh: Stop calling do_execve_file
  exec: Remove do_execve_file
  bpfilter: Move bpfilter_umh back into init data
  umd: Track user space drivers with struct pid
  bpfilter: Take advantage of the facilities of struct pid
  umd: Remove exit_umh
  umd: Stop using split_argv

 fs/exec.c|  38 ++--
 include/linux/binfmts.h  |   1 -
 include/linux/bpfilter.h |   7 +-
 include/linux/sched.h|   9 --
 include/linux/umd.h  |  18 
 include/linux/umh.h  |  15 
 kernel/Makefile  |   1 +
 kernel/exit.c|   1 -
 kernel/umd.c | 182 +++
 kernel/umh.c | 171 +---
 net/bpfilter/bpfilter_kern.c |  38 
 net/bpfilter/bpfilter_umh_blob.S |   2 +-
 net/ipv4/bpfilter/sockopt.c  |  20 +++--
 13 files changed, 248 insertions(+), 255 deletions(-)

v1: https://lkml.kernel.org/r/87pn9mgfc2.fsf...@x220.int.ebiederm.org
---
git range-diff master v1 v2

 1:  2b76f9b3158d !  1:  d8fb851fa3d8 umh: Capture the pid in umh_pipe_setup
@@ Commit message
 code that is specific to user mode drivers from the common user path of
 user mode helpers.
 
+Link: https://lkml.kernel.org/r/87h7uygf9i.fsf...@x220.int.ebiederm.org
+Reviewed-by: Greg Kroah-Hartman 
 Signed-off-by: "Eric W. Biederman" 
 
  ## include/linux/umh.h ##
 2:  d853e933ae32 !  2:  b191c5df43ec umh: Move setting PF_UMH into 
umh_pipe_setup
@@ Commit message
 Setting PF_UMH unconditionally is harmless as an action will only
 happen if it is paired with an entry on umh_list.
 
+Link: https://lkml.kernel.org/r/87bll6gf8t.fsf...@x220.int.ebiederm.org
+Reviewed-by: Greg Kroah-Hartman 
 Signed-off-by: "Eric W. Biederman" 
 
  ## kernel/umh.c ##
 3:  92d2550f0d6a !  3:  74e8c0bf3076 umh: Rename the user mode driver helpers 
for clarity
@@ Commit message
 don't make much sense.  Instead name them  umd_setup and umd_cleanup
 for the functional role in setting up user mode drivers.
 
+Link: https://lkml.kernel.org/r/875zbegf82.fsf...@x220.int.ebiederm.org
+Reviewed-by: Greg Kroah-Hartman 
 Signed-off-by: "Eric W. Biederman" 
 
  ## kernel/umh.c ##
@@ kernel/umh.c: static int umh_pipe_setup(struct subprocess_info *info, 
struct cre
  {
struct umh_info *umh_info = info->data;
  
+-  /* cleanup if umh_pipe_setup() was successful but exec failed */
++  /* cleanup if umh_setup() was successful but exec failed */
+   if (info->retval) {
+   fput(umh_info->pipe_to_umh);
+   fput(umh_info->pipe_from_um