Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

2023-08-13 Thread michael . christie
On 8/13/23 2:01 PM, Michael S. Tsirkin wrote:
> On Fri, Aug 11, 2023 at 01:51:36PM -0500, Mike Christie wrote:
>> On 8/10/23 1:57 PM, Michael S. Tsirkin wrote:
>>> On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.chris...@oracle.com wrote:
 On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
>> For vhost workers we use the kthread API which inherit's its values from
>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>> being checked, so while tools like libvirt try to control the number of
>> threads based on the nproc rlimit setting we can end up creating more
>> threads than the user wanted.
>>
>> This patch has us use the vhost_task helpers which will inherit its
>> values/checks from the thread that owns the device similar to if we did
>> a clone in userspace. The vhost threads will now be counted in the nproc
>> rlimits. And we get features like cgroups and mm sharing automatically,
>> so we can remove those calls.
>>
>> Signed-off-by: Mike Christie 
>> Acked-by: Michael S. Tsirkin 
>
>
> Hi Mike,
> So this seems to have caused a measureable regression in networking
> performance (about 30%). Take a look here, and there's a zip file
> with detailed measuraments attached:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=603
>
>
> Could you take a look please?
> You can also ask reporter questions there assuming you
> have or can create a (free) account.
>

 Sorry for the late reply. I just got home from vacation.

 The account creation link seems to be down. I keep getting a
 "unable to establish SMTP connection to bz-exim-prod port 25 " error.

 Can you give me Quan's email?

 I think I can replicate the problem. I just need some extra info from Quan:

 1. Just double check that they are using RHEL 9 on the host running the 
 VMs.
 2. The kernel config
 3. Any tuning that was done. Is tuned running in guest and/or host running 
 the
 VMs and what profile is being used in each.
 4. Number of vCPUs and virtqueues being used.
 5. Can they dump the contents of:

 /sys/kernel/debug/sched

 and

 sysctl  -a

 on the host running the VMs.

 6. With the 6.4 kernel, can they also run a quick test and tell me if they 
 set
 the scheduler to batch:

 ps -T -o comm,pid,tid $QEMU_THREAD

 then for each vhost thread do:

 chrt -b -p 0 $VHOST_THREAD

 Does that end up increasing perf? When I do this I see throughput go up by
 around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
 and virtqueues per net device in the VM). Note that I'm not saying that is 
 a fix.
 It's just a difference I noticed when running some other tests.
>>>
>>>
>>> Mike I'm unsure what to do at this point. Regressions are not nice
>>> but if the kernel is released with the new userspace api we won't
>>> be able to revert. So what's the plan?
>>>
>>
>> I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 
>> and
>> 6.4 perform the same for me. I've tried your setup and settings and with 
>> different
>> combos of using things like tuned and irqbalance.
>>
>> I can sort of force the issue. In 6.4, the vhost thread inherits it's 
>> settings
>> from the parent thread. In 6.3, the vhost thread inherits from kthreadd and 
>> we
>> would then reset the sched settings. So in 6.4 if I just tune the parent 
>> differently
>> I can cause different performance. If we want the 6.3 behavior we can do the 
>> patch
>> below.
>>
>> However, I don't think you guys are hitting this because you are just running
>> qemu from the normal shell and were not doing anything fancy with the sched
>> settings.
>>
>>
>> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
>> index da35e5b7f047..f2c2638d1106 100644
>> --- a/kernel/vhost_task.c
>> +++ b/kernel/vhost_task.c
>> @@ -2,6 +2,7 @@
>>  /*
>>   * Copyright (C) 2021 Oracle Corporation
>>   */
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -22,9 +23,16 @@ struct vhost_task {
>>  
>>  static int vhost_task_fn(void *data)
>>  {
>> +static const struct sched_param param = { .sched_priority = 0 };
>>  struct vhost_task *vtsk = data;
>>  bool dead = false;
>>  
>> +/*
>> + * Don't inherit the parent's sched info, so we maintain compat from
>> + * when we used kthreads and it reset this info.
>> + */
>> +sched_setscheduler_nocheck(current, SCHED_NORMAL, );
>> +
>>  for (;;) {
>>  bool did_work;
>>  
>>
>>
> 
> yes seems unlikely, still, attach this to bugzilla so it can be
> tested?
> 
> and, what will help you debug? any traces to enable?

I added the patch and asked for a perf trace.

> 
> Also wasn't there another issue with a non standard 

Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

2023-07-22 Thread michael . christie
On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
>> For vhost workers we use the kthread API which inherit's its values from
>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>> being checked, so while tools like libvirt try to control the number of
>> threads based on the nproc rlimit setting we can end up creating more
>> threads than the user wanted.
>>
>> This patch has us use the vhost_task helpers which will inherit its
>> values/checks from the thread that owns the device similar to if we did
>> a clone in userspace. The vhost threads will now be counted in the nproc
>> rlimits. And we get features like cgroups and mm sharing automatically,
>> so we can remove those calls.
>>
>> Signed-off-by: Mike Christie 
>> Acked-by: Michael S. Tsirkin 
> 
> 
> Hi Mike,
> So this seems to have caused a measureable regression in networking
> performance (about 30%). Take a look here, and there's a zip file
> with detailed measuraments attached:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=603
> 
> 
> Could you take a look please?
> You can also ask reporter questions there assuming you
> have or can create a (free) account.
> 

Sorry for the late reply. I just got home from vacation.

The account creation link seems to be down. I keep getting a
"unable to establish SMTP connection to bz-exim-prod port 25 " error.

Can you give me Quan's email?

I think I can replicate the problem. I just need some extra info from Quan:

1. Just double check that they are using RHEL 9 on the host running the VMs.
2. The kernel config
3. Any tuning that was done. Is tuned running in guest and/or host running the
VMs and what profile is being used in each.
4. Number of vCPUs and virtqueues being used.
5. Can they dump the contents of:

/sys/kernel/debug/sched

and

sysctl  -a

on the host running the VMs.

6. With the 6.4 kernel, can they also run a quick test and tell me if they set
the scheduler to batch:

ps -T -o comm,pid,tid $QEMU_THREAD

then for each vhost thread do:

chrt -b -p 0 $VHOST_THREAD

Does that end up increasing perf? When I do this I see throughput go up by
around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
and virtqueues per net device in the VM). Note that I'm not saying that is a 
fix.
It's just a difference I noticed when running some other tests.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/2] vhost-scsi: Fix IO hangs when using windows

2023-07-10 Thread michael . christie
On 7/10/23 12:52 AM, Michael S. Tsirkin wrote:
> On Sun, Jul 09, 2023 at 03:28:57PM -0500, Mike Christie wrote:
>> The following patches were made over Linus's tree and fix an issue
>> where windows guests will send iovecs with offset/lengths that result
>> in IOs that are not aligned to 512. The LIO layer will then send them
>> to Linux's FS/block layer but it requires 512 byte alignment, so
>> depending on the FS/block driver being used we will get IO errors or
>> hung IO.
>>
>> The following patches have vhost-scsi detect when windows sends these
>> IOs and copy them to a bounce buffer. It then does some cleanup in
>> the related code.
> 
> 
> Thanks, tagged!
> Mike, you are the main developer on vhost/scsi recently.
> Do you want to be listed in MAINTAINERS too?
> This implies you will be expected to review patches/bug reports
> though.

That sounds good. Thanks.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Can vhost translate to io_uring?

2023-06-14 Thread michael . christie
On 6/14/23 1:02 AM, Eric W. Biederman wrote:
> 
> I am sad my idea for simplifying things did not work out.
> 
> 
> Let's try an even bigger idea to reduce maintenance and simplify things.
> 
> Could vhost depend on io_uring?
> 
> Could vhost just be a translation layer of existing vhost requests to
> io_uring requests?
> 
> At a quick glance it looks like io_uring already supports the
> functionality that vhost supports (which I think is networking and
> scsi).
> 
> If vhost could become a translation layer that would allow removing
> the vhost worker and PF_USER_WORKER could be removed completely,
> leaving only PF_IO_WORKER.
> 
> 
> I suggest this because a significant vhost change is needed because in

It would be nice if the vhost layer could use the io-wq code as sort of
generic worker. I can look into what that would take if Jens is ok
with that type of thing.

For vhost, I just submitted a patch to the vhost layer that allow us to
switch out the vhost worker thread when IO is running:

https://lists.linuxfoundation.org/pipermail/virtualization/2023-June/067246.html

After that patch, I just need to modify vhost_worker/vhost_task_fn so
when get_signal returns true we set the worker to NULL and do a synchronize_rcu.
Then I just need to modify vhost-scsi so it detects when the worker is NULL
and modify the flush code so it handles work that didn't get to run.



> the long term the hacks in exec and coredump are not a good idea.  Which
> means something like my failed "[PATCH v3] fork, vhost: Use CLONE_THREAD
> to fix freezer/ps regression".
> 
> If we have to go to all of the trouble of reworking things it why can't
> we just make io_uring do all of the work?
> 
> Eric

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-03 Thread michael . christie
On 6/2/23 11:15 PM, Eric W. Biederman wrote:
> 
> This fixes the ordering issue in vhost_task_fn so that get_signal
> should not work.
> 
> This patch is a gamble that during process exit or de_thread in exec
> work will not be commonly queued from other threads.
> 
> If this gamble turns out to be false the existing WARN_ON in
> vhost_worker_free will fire.
> 
> Can folks test this and let us know if the WARN_ON fires?

I don't hit the WARN_ONs but probably not for the reason you are thinking
of. We are hung like:

Jun 03 22:25:23 ol4 kernel: Call Trace:
Jun 03 22:25:23 ol4 kernel:  
Jun 03 22:25:23 ol4 kernel:  __schedule+0x334/0xac0
Jun 03 22:25:23 ol4 kernel:  ? wait_for_completion+0x86/0x150
Jun 03 22:25:23 ol4 kernel:  schedule+0x5a/0xd0
Jun 03 22:25:23 ol4 kernel:  schedule_timeout+0x240/0x2a0
Jun 03 22:25:23 ol4 kernel:  ? __wake_up_klogd.part.0+0x3c/0x60
Jun 03 22:25:23 ol4 kernel:  ? vprintk_emit+0x104/0x270
Jun 03 22:25:23 ol4 kernel:  ? wait_for_completion+0x86/0x150
Jun 03 22:25:23 ol4 kernel:  wait_for_completion+0xb0/0x150
Jun 03 22:25:23 ol4 kernel:  vhost_scsi_flush+0xc2/0xf0 [vhost_scsi]
Jun 03 22:25:23 ol4 kernel:  vhost_scsi_clear_endpoint+0x16f/0x240 [vhost_scsi]
Jun 03 22:25:23 ol4 kernel:  vhost_scsi_release+0x7d/0xf0 [vhost_scsi]
Jun 03 22:25:23 ol4 kernel:  __fput+0xa2/0x270
Jun 03 22:25:23 ol4 kernel:  task_work_run+0x56/0xa0
Jun 03 22:25:23 ol4 kernel:  do_exit+0x337/0xb40
Jun 03 22:25:23 ol4 kernel:  ? __remove_hrtimer+0x39/0x70
Jun 03 22:25:23 ol4 kernel:  do_group_exit+0x30/0x90
Jun 03 22:25:23 ol4 kernel:  get_signal+0x9cd/0x9f0
Jun 03 22:25:23 ol4 kernel:  ? kvm_arch_vcpu_put+0x12b/0x170 [kvm]
Jun 03 22:25:23 ol4 kernel:  ? vcpu_put+0x1e/0x50 [kvm]
Jun 03 22:25:23 ol4 kernel:  ? kvm_arch_vcpu_ioctl_run+0x193/0x4e0 [kvm]
Jun 03 22:25:23 ol4 kernel:  arch_do_signal_or_restart+0x2a/0x260
Jun 03 22:25:23 ol4 kernel:  exit_to_user_mode_prepare+0xdd/0x120
Jun 03 22:25:23 ol4 kernel:  syscall_exit_to_user_mode+0x1d/0x40
Jun 03 22:25:23 ol4 kernel:  do_syscall_64+0x48/0x90
Jun 03 22:25:23 ol4 kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Jun 03 22:25:23 ol4 kernel: RIP: 0033:0x7f2d004df50b


The problem is that as part of the flush the drivers/vhost/scsi.c code
will wait for outstanding commands, because we can't free the device and
it's resources before the commands complete or we will hit the accessing
freed memory bug.

We got hung because the patch had us now do:

vhost_dev_flush() -> vhost_task_flush() 

and that saw VHOST_TASK_FLAGS_STOP was set and the exited completion has
completed. However, the scsi code is still waiting on commands in 
vhost_scsi_flush.
The cmds wanted to use the vhost_task task to complete and couldn't since the 
task
has exited.

To handle those types of issues above, it's a lot more code. We would add
some rcu in vhost_work_queue to handle the worker being freed from under us.
Then add a callback similar to what I did on one of the past patchsets that
stops the drivers. Then modify scsi, so in the callback it also sets some
bits so the completion paths just do a fast failing that doesn't try to
queue the completion to the vhost_task.

If we want to go that route, I can get it done in more like a 6.6 time frame.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue

2023-05-30 Thread michael . christie
On 5/30/23 11:17 AM, Stefano Garzarella wrote:
> On Tue, May 30, 2023 at 11:09:09AM -0500, Mike Christie wrote:
>> On 5/30/23 11:00 AM, Stefano Garzarella wrote:
>>> I think it is partially related to commit 6e890c5d5021 ("vhost: use
>>> vhost_tasks for worker threads") and commit 1a5f8090c6de ("vhost: move
>>> worker thread fields to new struct"). Maybe that commits just
>>> highlighted the issue and it was already existing.
>>
>> See my mail about the crash. Agree with your analysis about worker->vtsk
>> not being set yet. It's a bug from my commit where I should have not set
>> it so early or I should be checking for
>>
>> if (dev->worker && worker->vtsk)
>>
>> instead of
>>
>> if (dev->worker)
> 
> Yes, though, in my opinion the problem may persist depending on how the
> instructions are reordered.

Ah ok.

> 
> Should we protect dev->worker() with an RCU to be safe?

For those multiple worker patchsets Jason had asked me about supporting
where we don't have a worker while we are swapping workers around. To do
that I had added rcu around the dev->worker. I removed it in later patchsets
because I didn't think anyone would use it.

rcu would work for your case and for what Jason had requested.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread michael . christie
On 5/29/23 2:35 PM, Mike Christie wrote:
>> Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap 
>> itself,
> Oh wait, are you saying that when we get auto-reaped then we would do the last
> fput and call the file_operations->release function right? We actually set
> task_struct->files = NULL for the vhost_task task_struct, so I think we call
> release a little sooner than you think.
> 
> vhost_task_create() sets kernel_clone_args->no_files, so the vhost_task 
> task_struc
> that gets created works like kthreads where it doesn't do a CLONE_FILES and it
> doesn't do a dup_fd.
> 
> So when we do de_thread() -> zap_other_threads(), that will kill all the 
> threads
> in the group right? So when they exit, it will call our release function since
> we don't have refcount on ourself.
> 

Just to make sure I'm on the same page now.

In the past thread when were discussing the patch below and you guys were saying
that it doesn't really ignore SIGKILL because we will hit the
SIGNAL_GROUP_EXIT/group_exec_task checks and the parent is going to exit, it was
on purpose.

Instead of a signal to tell me when do exit, I was using the parent exiting and 
doing
the last fput on the vhost device's file which calls our release function. That 
then
allowed the vhost code to use the vhost_task to handle the outstanding IOs and 
then
do vhost_task_should_stop to tell the vhost_task to exit when the oustanding IO
had completed.

On the vhost side of things it's really nice, because all the shutdown paths 
work
the same and we don't need rcu/locking in the main IO path to handle the 
vhost_task
exiting while we are using it.


diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..e0f5ac90a228 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,6 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
-   u32 ignore_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..fd2970b598b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process(
p->flags &= ~PF_KTHREAD;
if (args->kthread)
p->flags |= PF_KTHREAD;
-   if (args->user_worker)
+   if (args->user_worker) {
+   /*
+* User worker are similar to io_threads but they do not
+* support signals and cleanup is driven via another kernel
+* interface so even SIGKILL is blocked.
+*/
p->flags |= PF_USER_WORKER;
+   siginitsetinv(>blocked, 0);
+   }
if (args->io_thread) {
/*
 * Mark us an IO worker, and block any signal that isn't
@@ -2517,8 +2524,8 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
 
-   if (args->ignore_signals)
-   ignore_signals(p);
+   if (args->user_worker)
+   p->flags |= PF_NOFREEZE;
 
stackleak_task_init(p);
 
@@ -2860,7 +2867,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), 
void *arg, int node)
.fn = fn,
.fn_arg = arg,
.io_thread  = 1,
-   .user_worker= 1,
};
 
return copy_process(NULL, 0, node, );
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..bc7e26072437 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -995,6 +995,19 @@ static inline bool wants_signal(int sig, struct 
task_struct *p)
return task_curr(p) || !task_sigpending(p);
 }
 
+static void try_set_pending_sigkill(struct task_struct *t)
+{
+   /*
+* User workers don't support signals and their exit is driven through
+* their kernel layer, so by default block even SIGKILL.
+*/
+   if (sigismember(>blocked, SIGKILL))
+   return;
+
+   sigaddset(>pending.signal, SIGKILL);
+   signal_wake_up(t, 1);
+}
+
 static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 {
struct signal_struct *signal = p->signal;
@@ -1055,8 +1068,7 @@ static void complete_signal(int sig, struct task_struct 
*p, enum pid_type type)
t = p;
do {
task_clear_jobctl_pending(t, 
JOBCTL_PENDING_MASK);
-   sigaddset(>pending.signal, SIGKILL);
-   signal_wake_up(t, 1);
+   try_set_pending_sigkill(t);
} while_each_thread(p, t);
return;
}
@@ -1373,8 +1385,7 @@ int zap_other_threads(struct task_struct *p)
/* Don't bother with already dead threads */
if (t->exit_state)

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread michael . christie
On 5/29/23 6:19 AM, Oleg Nesterov wrote:
> If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal()
> returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
> flush the pending works on ->work_list before exit, I dunno. But imo it should
> not wait for the final fput().

Just a FYI, I coded this. It's pre-RFC quality. It's invasive. If we want to go
this route then we have to do a temp hack for 6.4 or revert then do this route
for 6.5 or 6.6 (vhost devs will need time to review and we need time to full 
test).
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread michael . christie
On 5/29/23 6:19 AM, Oleg Nesterov wrote:
> On 05/27, Eric W. Biederman wrote:
>>
>> Looking forward I don't see not asking the worker threads to stop
>> for the coredump right now causing any problems in the future.
>> So I think we can use this to resolve the coredump issue I spotted.
> 
> But we have almost the same problem with exec.
> 
> Execing thread will wait for vhost_worker() while vhost_worker will wait for
> .release -> vhost_task_stop().

For this type of case, what is the goal or correct behavior in the end?

When get_signal returns true we can code things like you mention below and
clean up the task_struct. However, we now have a non-functioning vhost device
open and just sitting around taking up memory and it can't do any IO.

For this type of case, do we expect just not to crash/hang, or was this new
exec'd thread suppose to be able to use the vhost device?

I would normally say it probably wants to use the vhost device still. However,
I don't think this comes up so just not hanging might be ok. Before 6.4-rc1,
we ignored signals so it would have worked if we are concerned about a possible
regression if this was a common thing.



> 
> And even O_CLOEXEC won't help, do_close_on_exec() is called after de_thread().
> 
> Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...

You mean the vhost_task's task/thread doing a function that does a copy_process
right? That type of thing is not needed. I can add a check in vhost_task_create
for this so new code doesn't try to do it. I don't think it will come up that 
some
code vhost is using will call kernel_thread/copy_process directly since those
calls are so rare and the functions are not exported to modules.


> 
> If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal()
> returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
> flush the pending works on ->work_list before exit, I dunno. But imo it should
> not wait for the final fput().
> 
> Oleg.
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v7 00/14] vhost: multiple worker support

2023-04-28 Thread michael . christie
The following patches were built over Linux's tree. They allow us to
support multiple vhost workers tasks per device. The design is a modified
version of Stefan's original idea where userspace has the kernel create a
worker and we pass back the pid. In this version instead of passing the
pid between user/kernel space we use a worker_id which is just an integer
managed by the vhost driver and we allow userspace to create and free
workers and then attach them to virtqueues at setup time.

All review comments from the past reviews should be handled. If I didn't
reply to a review comment, I agreed with the comment and should have
handled it in this posting. Let me know if I missed one.

Results:


fio jobs1   2   4   8   12  16
--
1 worker160k   488k -   -   -   -
worker per vq   160k   310k620k1300k   1836k   2326k


Notes:
0. This used a simple fio command:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=128  --numjobs=$JOBS_ABOVE

and I used a VM with 16 vCPUs and 16 virtqueues.

1. The patches were tested with LIO's emulate_pr=0 which drops the
LIO PR lock use. This was a bottleneck at around 12 vqs/jobs.

2. Because we have a hard limit of 1024 cmds, if the num jobs * iodepth
was greater than 1024, I would decrease iodepth. So 12 jobs used 85 cmds,
and 16 used 64.

3. The perf issue above at 2 jobs is because when we only have 1 worker
we execute more cmds per vhost_work due to all vqs funneling to one worker.
I have 2 patches that fix this. One is just submit from the worker thread
instead of kikcing off to a workqueue like how the vhost block patches do.
I'll post this after the worker support is merged. I'm also working on
patches to add back batching during lio completion and do polling on the
submission side.

We will still want the threading patches, because if we batch at the fio
level plus use the vhost theading patches, we can see a big boost like
below. So hopefully doing it at the kernel will allow apps to just work
without having to be smart like fio.

fio using io_uring and batching with the iodepth_batch* settings:

fio jobs1   2   4   8   12  16
-
1 worker494k520k-   -   -   -
worker per vq   496k878k1542k   2436k   2304k   2590k

V7:
- Add more comments about assumptions.
- Drop refcounting and just use an attachment_cnt variable, so there
is no confusion about when a worker is freed.
- Do a opt-in model, because vsiock has an issue where it can queue works
before it's running and it doesn't even need multiple workers, so there 
is no point in chaning the driver or core code.
- Add back get worker ioctl.
- Broke up last patches to make it easier to read.

V6:
- Rebase against vhost_task patchset.
- Used xa instead of idr.
V5:
- Rebase against user_worker patchset.
- Rebase against flush patchset.
- Redo vhost-scsi tmf flush handling so it doesn't access vq->worker.
V4:
- fix vhost-sock VSOCK_VQ_RX use.
- name functions called directly by ioctl cmd's to match the ioctl cmd.
- break up VHOST_SET_VRING_WORKER into a new, free and attach cmd.
- document worker lifetime, and cgroup, namespace, mm, rlimit
inheritance, make it clear we currently only support sharing within the
device.
- add support to attach workers while IO is running.
- instead of passing a pid_t of the kernel thread, pass a int allocated
by the vhost layer with an idr.

V3:
- fully convert vhost code to use vq based APIs instead of leaving it
half per dev and half per vq.
- rebase against kernel worker API.
- Drop delayed worker creation. We always create the default worker at
VHOST_SET_OWNER time. Userspace can create and bind workers after that.

V2:
- change loop that we take a refcount to the worker in
- replaced pid == -1 with define.
- fixed tabbing/spacing coding style issue
- use hash instead of list to lookup workers.
- I dropped the patch that added an ioctl cmd to get a vq's worker's
pid. I saw we might do a generic netlink interface instead.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/7] vhost-scsi: Fix crashes and management op hangs

2023-03-20 Thread michael . christie
On 3/20/23 9:06 PM, Mike Christie wrote:
> The following patches were made over Linus tree.

Hi Michael, I see you merged my first version of the patchset in your
vhost branch.

Do you want me to just send a followup patchset?

The major diff between the 2 versions:

1. I added the first 2 patches which fix some bugs in the existing code
I found while doing some code review and testing another LIO patchset
plus v1.

Note: The other day I posted that I thought the 3rd patch in v1 caused
the bugs but they were already in the code.

2. In v2 I made one of the patches not need the vhost device lock when
unmapping/mapping LUNs, so you can add new LUNs even if one LUN on the same
vhost_scsi device was hung.

Since it's not regressions with the existing patches, I can just send those
as a followup patchset if that's preferred.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 03/11] kthread: Pass in the thread's name during creation

2023-03-11 Thread michael . christie
On 3/11/23 2:53 AM, Christian Brauner wrote:
> On Fri, Mar 10, 2023 at 04:03:24PM -0600, Mike Christie wrote:
>> This has us pass in the thread's name during creation in kernel_thread.
>>
>> Signed-off-by: Mike Christie 
>> ---
>>  kernel/kthread.c | 35 ++-
>>  1 file changed, 14 insertions(+), 21 deletions(-)
>>
>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> index 63574cee925e..831a55b406d8 100644
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -38,6 +38,7 @@ struct task_struct *kthreadd_task;
>>  struct kthread_create_info
>>  {
>>  /* Information passed to kthread() from kthreadd. */
>> +char *full_name;
>>  int (*threadfn)(void *data);
>>  void *data;
>>  int node;
>> @@ -343,10 +344,15 @@ static int kthread(void *_create)
>>  /* Release the structure when caller killed by a fatal signal. */
>>  done = xchg(>done, NULL);
>>  if (!done) {
>> +kfree(create->full_name);
>>  kfree(create);
>>  kthread_exit(-EINTR);
>>  }
>>  
>> +if (strlen(create->full_name) >= TASK_COMM_LEN)
>> +self->full_name = create->full_name;
>> +else
>> +kfree(create->full_name);
> 
> This is monir but wwiw, this looks suspicious when reading it without
> more context. It took me a while to see that kthread->full_name is
> intended to store the untruncated name only if truncation actually needs
> to happen. So either we should always initialize this or we should add a
> comment. You can just send a tiny patch that I can fold into this one so
> you don't have to resend the whole series...
Ok. Thanks. I think always initializing it would make the code a lot easier
to understand and be nicer.

I don't think it would be too much extra memory used, and we don't have
to worry about some random userspace app breaking because it wanted to
configure the thread but the name got truncated.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V9 0/8] Use copy_process in vhost layer

2022-06-01 Thread michael . christie
On 5/12/22 4:46 PM, Mike Christie wrote:
> The following patches were made over Eric's tree:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git
> 
> and his kthread-cleanups-for-v5.19 branch.

Hi Eric,

Did you have more review comments for the patches or are you ok with them
now? I haven't seen any comments the last couple of postings. I've
been watching your trees, so I know you are doing cleanup/improvements
in the areas I needed help with so I know it's still on your radar, but
now I'm worried I'm missing emails/reviews from you :)

Are you also going to push the kthread-cleanups for 5.19 still or is that
going to be 5.20? I have patches that depend and/or have conflicts with
this set so I'm trying to figure out what I should be rebasing and testing
against.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V6 01/10] Use copy_process in vhost layer

2021-12-17 Thread michael . christie
On 12/17/21 1:26 PM, Eric W. Biederman wrote:
> Mike Christie  writes:
> 
>> The following patches made over Linus's tree, allow the vhost layer to do
>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
>> io_uring does a copy_process against its userspace app. This allows the
>> vhost layer's worker threads to inherit cgroups, namespaces, address
>> space, etc and this worker thread will also be accounted for against that
>> owner/parent process's RLIMIT_NPROC limit.
>>
>> If you are not familiar with qemu and vhost here is more detailed
>> problem description:
>>
>> Qemu will create vhost devices in the kernel which perform network, SCSI,
>> etc IO and management operations from worker threads created by the
>> kthread API. Because the kthread API does a copy_process on the kthreadd
>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
>> thread's cgroups.
>>
>> The problem with this approach is that we then have to add new functions/
>> args/functionality for every thing we want to inherit. I started doing
>> that here:
>>
>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!eIaEe9V8mCgGU6vyvaWTKGi3Zlnz0rgk5Y-0nsBXRbsuVZsM8lYfHr8I8GRuoLYPYrOB$
>>  
>>
>> for the RLIMIT_NPROC check, but it seems it might be easier to just
>> inherit everything from the beginning, becuase I'd need to do something
>> like that patch several times.
> 
> I read through the code and I don't see why you want to make these
> almost threads of a process not actually threads of that process
> (like the io_uring threads are).
> 
> As a separate process there are many things that will continue to be
> disjoint.  If the thread changes cgroups for example your new process
> won't follow.
> 
> If you want them to be actually processes with an lifetime independent
> of the creating process I expect you want to reparent them to the local
> init process.  Just so they don't confuse the process tree.  Plus init
> processes know how to handle unexpected children.
> 
> What are the semantics you are aiming for?
> 

Hi Eric,

Right now, for vhost we need the thread being created:

1. added to the caller's cgroup.
2. to share the mm struct with the caller.
3. to be accounted for under the caller's nproc rlimit value.

For 1 and 2, we have cgroup_attach_task_all and get_task_mm
already.

This patchset started with me just trying to handle #3 by modifying kthreads
like here:

https://lkml.org/lkml/2021/6/23/1234

So we can use kthreads and the existing helpers and add:

A. a ucounts version of the above patches in the link

or

B. a helper that does something like copy_process's use of
is_ucounts_overlimit and vhost can call that.

instead of this patchset.


Before we even get to the next section below, do you consider items 1 - 3
something we need an API based on copy_process for?

Do you think I should just do A or B above, or do you have another idea? If
so can we get agreement on that from everyone?

I thought my patches in that link were a little hacky in how they passed
around the user/creds info. I thought maybe it shouldn't be passed around like
that, so switched to the copy_process based approach which did everything for
me. And I thought io_uring needed something similar as us so I made it generic.

I don't have a preference. You and Christian are the experts, so I'll leave it
to you guys.


> I can see sense in generalizing some of the pieces of create_io_thread
> but I think generalizing create_io_thread itself is premature.  The code
> lives in kernel/fork.c because it is a very special thing that we want
> to keep our eyes on.
> 
> Some of your generalization makes it much more difficult to tell what
> your code is going to use because you remove hard codes that are there
> to simplify the analysis of the situation.
> 
> My gut says adding a new create_vhost_worker and putting that in
> kernel/fork.c is a lot safer and will allow much better code analysis.
> 
> If there a really are commonalities between creating a userspace process
> that runs completely in the kernel and creating an additional userspace
> thread we refactor the code and simplify things.
> 
> I am especially nervous about generalizing the io_uring code as it's
> signal handling just barely works, and any generalization will cause it
> to break.  So you are in the process of generalizing code that simply
> can not handle the general case.  That scares me
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V6 01/10] Use copy_process in vhost layer

2021-12-08 Thread michael . christie
On 12/8/21 2:34 PM, Michael S. Tsirkin wrote:
> On Mon, Nov 29, 2021 at 01:46:57PM -0600, Mike Christie wrote:
>> The following patches made over Linus's tree, allow the vhost layer to do
>> a copy_process on the thread that does the VHOST_SET_OWNER ioctl like how
>> io_uring does a copy_process against its userspace app. This allows the
>> vhost layer's worker threads to inherit cgroups, namespaces, address
>> space, etc and this worker thread will also be accounted for against that
>> owner/parent process's RLIMIT_NPROC limit.
>>
>> If you are not familiar with qemu and vhost here is more detailed
>> problem description:
>>
>> Qemu will create vhost devices in the kernel which perform network, SCSI,
>> etc IO and management operations from worker threads created by the
>> kthread API. Because the kthread API does a copy_process on the kthreadd
>> thread, the vhost layer has to use kthread_use_mm to access the Qemu
>> thread's memory and cgroup_attach_task_all to add itself to the Qemu
>> thread's cgroups.
>>
>> The problem with this approach is that we then have to add new functions/
>> args/functionality for every thing we want to inherit. I started doing
>> that here:
>>
>> https://urldefense.com/v3/__https://lkml.org/lkml/2021/6/23/1233__;!!ACWV5N9M2RV99hQ!ceUHd4m4MTJFOGccB9N5r7WonxVoYYT2XPiYwWt2-Vt1Y-DmQirRN8OqKozFLN1h73N6$
>>  
>>
>> for the RLIMIT_NPROC check, but it seems it might be easier to just
>> inherit everything from the beginning, becuase I'd need to do something
>> like that patch several times.
> 
> 
> So who's merging this? Me? Did all patches get acked by appropriate
> maintainers?
> 

Not yet.

Jens, The last open review comment was from you for the name change
and additional patch description info.

In this posting, I changed the name from:

kernel_worker/kernel_worker_start

to

user_worker_create/user_worker_start

I didn't do the start/create_user_worker format originally discussed
because a while back Christoph had given me a review comment about trying
to tie everything together into an API. Everything having the user_worker
prefix felt nicer in that it was easy to tell the functions and flags went
together, and so I thought it would handle his comment too.

And patch:

[PATCH V6 07/10] io_uring: switch to user_worker

should better explain the reason for the switch.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/7] vhost_vsock: simplify vhost_vsock_flush()

2021-12-08 Thread michael . christie
On 12/7/21 9:53 PM, Jason Wang wrote:
> On Tue, Dec 7, 2021 at 10:45 AM Mike Christie
>  wrote:
>>
>> From: Andrey Ryabinin 
>>
>> vhost_vsock_flush() calls vhost_work_dev_flush(vsock->vqs[i].poll.dev)
>> before vhost_work_dev_flush(>dev). This seems pointless
>> as vsock->vqs[i].poll.dev is the same as >dev and several flushes
>> in a row doesn't do anything useful, one is just enough.
>>
>> Signed-off-by: Andrey Ryabinin 
>> Reviewed-by: Stefano Garzarella 
>> Signed-off-by: Mike Christie 
>> ---
>>  drivers/vhost/vsock.c | 5 -
>>  1 file changed, 5 deletions(-)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index 2339587bcd31..1f38160b249d 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -703,11 +703,6 @@ static int vhost_vsock_dev_open(struct inode *inode, 
>> struct file *file)
>>
>>  static void vhost_vsock_flush(struct vhost_vsock *vsock)
>>  {
>> -   int i;
>> -
>> -   for (i = 0; i < ARRAY_SIZE(vsock->vqs); i++)
>> -   if (vsock->vqs[i].handle_kick)
>> -   vhost_work_dev_flush(vsock->vqs[i].poll.dev);
>> vhost_work_dev_flush(>dev);
>>  }
> 
> Is this better to be consistent with vhost-net so that we can simply
> remove the vhost_vsock_flush() wrapper here.
> 

I didn't understand that comment.

Did you mean consistent as they both have vhost_vsock/net_flush functions
or as in they prefer to not have one line wrappers?

Before and after this patchset, both net and vsock have a vhost_vsock/net_flush
function, so maybe you didn't mean that.

I think the wrapper is not very useful and could be removed. However,
I thought vsock preferred wrappers because we have vhost_vsock_free
which is just a wrapper around kfree. I also noticed test.c is a
fan of one line wrappers, but I see net and scsi do not do that.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 07/10] io_uring: switch to kernel_worker

2021-11-22 Thread michael . christie
On 11/22/21 8:20 AM, Jens Axboe wrote:
> On 11/22/21 3:02 AM, Christian Brauner wrote:
>> On Sun, Nov 21, 2021 at 11:17:11AM -0700, Jens Axboe wrote:
>>> On 11/21/21 10:49 AM, Mike Christie wrote:
 Convert io_uring and io-wq to use kernel_worker.
>>>
>>> I don't like the kernel_worker name, that implies it's always giving you
>>> a kernel thread or kthread. That's not the io_uring use case, it's
>>> really just a thread off the original task that just happens to never
>>> exit to userspace.
>>>
>>> Can we do a better name? At least io_thread doesn't imply that.
>>
>> Yeah, I had thought about that as well and at first had kernel_uworker()
>> locally but wasn't convinced. Maybe we should just make it
>> create_user_worker()?
> 
> That's better, or maybe even create_user_inkernel_thread() or something?
> Pretty long, though... I'd be fine with create_user_worker().
> 

Ok, I'll do:

create_user_worker()
start_user_worker()

since you guys agree. It will also match the PF flag naming.

I'll also add more details to the commit message you requested.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V3 11/11] vhost: allow userspace to create workers

2021-10-26 Thread michael . christie
On 10/26/21 12:37 AM, Jason Wang wrote:
> 
> 在 2021/10/22 下午1:19, Mike Christie 写道:
>> This patch allows userspace to create workers and bind them to vqs. You
>> can have N workers per dev and also share N workers with M vqs.
>>
>> Signed-off-by: Mike Christie 
> 
> 
> A question, who is the best one to determine the binding? Is it the VMM (Qemu 
> etc) or the management stack? If the latter, it looks to me it's better to 
> expose this via sysfs?

I thought it would be where you have management app settings, then the
management app talks to the qemu control interface like it does when it
adds new devices on the fly.

A problem with the management app doing it is to handle the RLIMIT_NPROC
review comment, this patchset:

https://lore.kernel.org/all/20211007214448.6282-1-michael.chris...@oracle.com/

basically has the kernel do a clone() from the caller's context. So adding
a worker is like doing the VHOST_SET_OWNER ioctl where it still has to be done
from a process you can inherit values like the mm, cgroups, and now RLIMITs.


>> diff --git a/include/uapi/linux/vhost_types.h 
>> b/include/uapi/linux/vhost_types.h
>> index f7f6a3a28977..af654e3cef0e 100644
>> --- a/include/uapi/linux/vhost_types.h
>> +++ b/include/uapi/linux/vhost_types.h
>> @@ -47,6 +47,18 @@ struct vhost_vring_addr {
>>   __u64 log_guest_addr;
>>   };
>>   +#define VHOST_VRING_NEW_WORKER -1
> 
> 
> Do we need VHOST_VRING_FREE_WORKER? And I wonder if using dedicated ioctls 
> are better:
> 
> VHOST_VRING_NEW/FREE_WORKER
> VHOST_VRING_ATTACH_WORKER


We didn't need a free worker, because the kernel handles it for userspace. I
tried to make it easy for userspace because in some cases it may not be able
to do syscalls like close on the device. For example if qemu crashes or for
vhost-scsi we don't do an explicit close during VM shutdown.

So we start off with the default worker thread that's used by all vqs like we do
today. Userspace can then override it by creating a new worker. That also 
unbinds/
detaches the existing worker and does a put on the workers refcount. We also do 
a
put on the worker when we stop using it during device shutdown/closure/release.
When the worker's refcount goes to zero the kernel deletes it.

I think separating the calls could be helpful though.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 06/11] vhost-sock: convert to vq helpers

2021-10-25 Thread michael . christie
On 10/25/21 4:08 AM, Stefano Garzarella wrote:
> On Fri, Oct 22, 2021 at 12:19:06AM -0500, Mike Christie wrote:
>> Convert from vhost dev based helpers to vq ones.
>>
>> Signed-off-by: Mike Christie 
>> ---
>> drivers/vhost/vsock.c | 8 +---
>> 1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index 938aefbc75ec..c50c60d0955e 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -300,7 +300,7 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
>> list_add_tail(>list, >send_pkt_list);
>> spin_unlock_bh(>send_pkt_list_lock);
>>
>> -    vhost_work_queue(>dev, >send_pkt_work);
>> +    vhost_vq_work_queue(>vqs[VSOCK_VQ_TX], >send_pkt_work);
> 
> I think we should use VSOCK_VQ_RX. I know, the nomenclature is weird, but 
> it's from the guest's point of view, so the host when sending packets uses 
> the VSOCK_VQ_RX, see vhost_transport_send_pkt_work().
> 
> 
>>
>> rcu_read_unlock();
>> return len;
>> @@ -612,7 +612,7 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
>> /* Some packets may have been queued before the device was started,
>>  * let's kick the send worker to send them.
>>  */
>> -    vhost_work_queue(>dev, >send_pkt_work);
>> +    vhost_vq_work_queue(>vqs[VSOCK_VQ_TX], >send_pkt_work);
> 
> Ditto.
> 

You're right. I'll fix. Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 11/11] vhost: allow userspace to create workers

2021-10-25 Thread michael . christie
On 10/23/21 3:11 PM, Michael S. Tsirkin wrote:
> On Fri, Oct 22, 2021 at 01:17:26PM -0500, michael.chris...@oracle.com wrote:
>> On 10/22/21 11:12 AM, michael.chris...@oracle.com wrote:
>>> On 10/22/21 5:47 AM, Michael S. Tsirkin wrote:
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index c998860d7bbc..e5c0669430e5 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -70,6 +70,17 @@
>  #define VHOST_VRING_BIG_ENDIAN 1
>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct 
> vhost_vring_state)
>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct 
> vhost_vring_state)
> +/* By default, a device gets one vhost_worker created during 
> VHOST_SET_OWNER
> + * that its virtqueues share. This allows userspace to create a 
> vhost_worker
> + * and map a virtqueue to it or map a virtqueue to an existing worker.
> + *
> + * If pid > 0 and it matches an existing vhost_worker thread it will be 
> bound
> + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
> + * created and bound to the vq.
> + *
> + * This must be called after VHOST_SET_OWNER and before the vq is active.
> + */

 A couple of things here:
 it's probably a good idea not to make it match pid exactly,
 if for no other reason than I'm not sure we want to
 commit this being a pid. Let's just call it an id?
>>>
>>> Ok.
>>>
 And maybe byteswap it or xor with some value
 just to make sure userspace does not begin abusing it anyway.

 Also, interaction with pid namespace is unclear to me.
 Can you document what happens here?
>>>
>>> This current patchset only allows the vhost_dev owner to
>>> create/bind workers for devices it owns, so namespace don't come
>>
>> I made a mistake here. The patches do restrict VHOST_SET_VRING_WORKER
>> to the same owner like I wrote. However, it looks like we could have 2
>> threads with the same mm pointer so vhost_dev_check_owner returns true,
>> but they could be in different namespaces.
>>
>> Even though we are not going to pass the pid_t between user/kernel
>> space, should I add a pid namespace check when I repost the patches?
> 
> Um it's part of the ioctl. How you are not going to pass it around?

The not passing a pid around was referring to your comment about
obfuscating the pid. I might have misunderstood you and thought you
wanted to do something more like you suggested below where to userspace
it's just some int as far as userspace knows.


> 
> So if we do worry about this, I would just make it a 64 bit integer,
> rename it "id" and increment each time a thread is created.
>  
Yeah, this works for me. I just used a ida to allocate the id. We can 
then use it's lookup functions too.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V3 11/11] vhost: allow userspace to create workers

2021-10-22 Thread michael . christie
On 10/22/21 11:12 AM, michael.chris...@oracle.com wrote:
> On 10/22/21 5:47 AM, Michael S. Tsirkin wrote:
>>> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
>>> index c998860d7bbc..e5c0669430e5 100644
>>> --- a/include/uapi/linux/vhost.h
>>> +++ b/include/uapi/linux/vhost.h
>>> @@ -70,6 +70,17 @@
>>>  #define VHOST_VRING_BIG_ENDIAN 1
>>>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct 
>>> vhost_vring_state)
>>>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct 
>>> vhost_vring_state)
>>> +/* By default, a device gets one vhost_worker created during 
>>> VHOST_SET_OWNER
>>> + * that its virtqueues share. This allows userspace to create a 
>>> vhost_worker
>>> + * and map a virtqueue to it or map a virtqueue to an existing worker.
>>> + *
>>> + * If pid > 0 and it matches an existing vhost_worker thread it will be 
>>> bound
>>> + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
>>> + * created and bound to the vq.
>>> + *
>>> + * This must be called after VHOST_SET_OWNER and before the vq is active.
>>> + */
>>
>> A couple of things here:
>> it's probably a good idea not to make it match pid exactly,
>> if for no other reason than I'm not sure we want to
>> commit this being a pid. Let's just call it an id?
> 
> Ok.
> 
>> And maybe byteswap it or xor with some value
>> just to make sure userspace does not begin abusing it anyway.
>>
>> Also, interaction with pid namespace is unclear to me.
>> Can you document what happens here?
> 
> This current patchset only allows the vhost_dev owner to
> create/bind workers for devices it owns, so namespace don't come

I made a mistake here. The patches do restrict VHOST_SET_VRING_WORKER
to the same owner like I wrote. However, it looks like we could have 2
threads with the same mm pointer so vhost_dev_check_owner returns true,
but they could be in different namespaces.

Even though we are not going to pass the pid_t between user/kernel
space, should I add a pid namespace check when I repost the patches?



> into play. If a thread from another namespace tried to create/bind
> a worker we would hit the owner checks in vhost_dev_ioctl which is
> done before vhost_vring_ioctl normally (for vdpa we hit the use_worker
> check and fail there).
> 
> However, with the kernel worker API changes the worker threads will
> now be in the vhost dev owner's namespace and not the kthreadd/default
> one, so in the future we are covered if we want to do something more
> advanced. For example, I've seen people working on an API to export the
> worker pids:
> 
> https://lore.kernel.org/netdev/20210507154332.hiblsd6ot5wzwkdj@steredhat/T/
> 
> and in the future for interfaces that export that info we could restrict
> access to root or users from the same namespace or I guess add interfaces
> to allow different namespaces to see the workers and share them.
> 
> 
>> No need to fix funky things like moving the fd between
>> pid namespaces while also creating/destroying workers, but let's
>> document it's not supported.
> 
> Ok. I'll add a comment.
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V3 11/11] vhost: allow userspace to create workers

2021-10-22 Thread michael . christie
On 10/22/21 5:47 AM, Michael S. Tsirkin wrote:
>> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
>> index c998860d7bbc..e5c0669430e5 100644
>> --- a/include/uapi/linux/vhost.h
>> +++ b/include/uapi/linux/vhost.h
>> @@ -70,6 +70,17 @@
>>  #define VHOST_VRING_BIG_ENDIAN 1
>>  #define VHOST_SET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x13, struct 
>> vhost_vring_state)
>>  #define VHOST_GET_VRING_ENDIAN _IOW(VHOST_VIRTIO, 0x14, struct 
>> vhost_vring_state)
>> +/* By default, a device gets one vhost_worker created during VHOST_SET_OWNER
>> + * that its virtqueues share. This allows userspace to create a vhost_worker
>> + * and map a virtqueue to it or map a virtqueue to an existing worker.
>> + *
>> + * If pid > 0 and it matches an existing vhost_worker thread it will be 
>> bound
>> + * to the vq. If pid is VHOST_VRING_NEW_WORKER, then a new worker will be
>> + * created and bound to the vq.
>> + *
>> + * This must be called after VHOST_SET_OWNER and before the vq is active.
>> + */
> 
> A couple of things here:
> it's probably a good idea not to make it match pid exactly,
> if for no other reason than I'm not sure we want to
> commit this being a pid. Let's just call it an id?

Ok.

> And maybe byteswap it or xor with some value
> just to make sure userspace does not begin abusing it anyway.
> 
> Also, interaction with pid namespace is unclear to me.
> Can you document what happens here?

This current patchset only allows the vhost_dev owner to
create/bind workers for devices it owns, so namespace don't come
into play. If a thread from another namespace tried to create/bind
a worker we would hit the owner checks in vhost_dev_ioctl which is
done before vhost_vring_ioctl normally (for vdpa we hit the use_worker
check and fail there).

However, with the kernel worker API changes the worker threads will
now be in the vhost dev owner's namespace and not the kthreadd/default
one, so in the future we are covered if we want to do something more
advanced. For example, I've seen people working on an API to export the
worker pids:

https://lore.kernel.org/netdev/20210507154332.hiblsd6ot5wzwkdj@steredhat/T/

and in the future for interfaces that export that info we could restrict
access to root or users from the same namespace or I guess add interfaces
to allow different namespaces to see the workers and share them.


> No need to fix funky things like moving the fd between
> pid namespaces while also creating/destroying workers, but let's
> document it's not supported.

Ok. I'll add a comment.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V3 00/11] vhost: multiple worker support

2021-10-22 Thread michael . christie
Ccing Christian for the kernel worker API merging stuff.

On 10/22/21 4:48 AM, Michael S. Tsirkin wrote:
> On Fri, Oct 22, 2021 at 12:18:59AM -0500, Mike Christie wrote:
>> The following patches apply over linus's tree and this patchset
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/all/20211007214448.6282-1-michael.chris...@oracle.com/__;!!ACWV5N9M2RV99hQ!aqbE06mycEW-AMIj5avlBMDSvg2FONlNdYHr8PcNKdvl5FeO4QLCxCOyaVg8g8C2_Kp5$
>>  
>>
>> which allows us to check the vhost owner thread's RLIMITs:
> 
> 
> Unfortunately that patchset in turn triggers kbuild warnings.

Yeah, that's the Jens/Paul issue I mentioned. I have to remove the
old create_io_thread code and resolve issues with their trees. Paul's
tree has a conflict with Jens and then my patch has a issue with Paul's
patches.

So Christian and I thought we would re-push the patchset through
Christian after that has settled in 5.16-rc1 and then shoot for 5.17
so it has time to bake in next.


> I was hoping you would address them, I don't think
> merging that patchset before kbuild issues are addressed
> is possible.
> 
> It also doesn't have lots of acks, I'm a bit apprehensive
> of merging core changes like this through the vhost tree.

Ok. Just to make sure we are on the same page. Christian was going to
push the kernel worker API changes.

> Try to CC more widely/ping people?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V3 00/11] vhost: multiple worker support

2021-10-22 Thread michael . christie
On 10/22/21 12:18 AM, Mike Christie wrote:
> Results:
> 
> 
> fio jobs1   2   4   8   12  16
> --
> 1 worker84k492k510k-   -   -

That should be 184k above.

> worker per vq   184k   380k744k1422k   2256k   2434k

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] kernel/fork, cred.c: allow copy_process to take user

2021-07-01 Thread michael . christie
On 6/29/21 11:53 AM, Mike Christie wrote:
> On 6/29/21 8:04 AM, Christian Brauner wrote:
>> On Wed, Jun 23, 2021 at 10:08:03PM -0500, Mike Christie wrote:
>>> This allows kthread to pass copy_process the user we want to check for the
>>> RLIMIT_NPROC limit for and also charge for the new process. It will be used
>>> by vhost where userspace has that driver create threads but the kthreadd
>>> thread is checked/charged.
>>>
>>> Signed-off-by: Mike Christie 
>>> ---
>>>  include/linux/cred.h |  3 ++-
>>>  kernel/cred.c|  7 ---
>>>  kernel/fork.c| 12 +++-
>>>  3 files changed, 13 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/cred.h b/include/linux/cred.h
>>> index 14971322e1a0..9a2c1398cdd4 100644
>>> --- a/include/linux/cred.h
>>> +++ b/include/linux/cred.h
>>> @@ -153,7 +153,8 @@ struct cred {
>>>  
>>>  extern void __put_cred(struct cred *);
>>>  extern void exit_creds(struct task_struct *);
>>> -extern int copy_creds(struct task_struct *, unsigned long);
>>> +extern int copy_creds(struct task_struct *, unsigned long,
>>> + struct user_struct *);
>>>  extern const struct cred *get_task_cred(struct task_struct *);
>>>  extern struct cred *cred_alloc_blank(void);
>>>  extern struct cred *prepare_creds(void);
>>> diff --git a/kernel/cred.c b/kernel/cred.c
>>> index e1d274cd741b..e006aafa8f05 100644
>>> --- a/kernel/cred.c
>>> +++ b/kernel/cred.c
>>> @@ -330,7 +330,8 @@ struct cred *prepare_exec_creds(void)
>>>   * The new process gets the current process's subjective credentials as its
>>>   * objective and subjective credentials
>>>   */
>>> -int copy_creds(struct task_struct *p, unsigned long clone_flags)
>>> +int copy_creds(struct task_struct *p, unsigned long clone_flags,
>>> +  struct user_struct *user)
>>>  {
>>> struct cred *new;
>>> int ret;
>>> @@ -351,7 +352,7 @@ int copy_creds(struct task_struct *p, unsigned long 
>>> clone_flags)
>>> kdebug("share_creds(%p{%d,%d})",
>>>p->cred, atomic_read(>cred->usage),
>>>read_cred_subscribers(p->cred));
>>> -   atomic_inc(>cred->user->processes);
>>> +   atomic_inc(>processes);
>>
>> Hey Mike,
>>
>> This won't work anymore since this has moved into ucounts. So in v5.14
>> atomic_inc(>cred->user->processes);
>> will have been replaced by
>> inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1);
>>
> Will do.
> 
>> From what I can see from your code vhost will always create this kthread
>> for current. So you could e.g. add an internal flag/bitfield entry to
>> struct kernel_clone_args that you can use to tell copy_creds() that you
>> want to charge this thread against current's process limit.
> 
> If I understood you, I don't think a flag/bit will work. When vhost does
> a kthread call we do kthread_create -> __kthread_create_on_node. This creates
> a tmp kthread_create_info struct and adds it to the kthread_create_list list.
> It then wakes up the kthreadd thread. kthreadd will then loop over the list,
> and do the:
> 
> kernel_thread -> kernel_clone -> copy_process ->  copy_creds
> 
> So copy_creds sees current == kthreadd.
> 
> I think I would have to add a task_struct pointer to kernel_clone_args
> and kthread_create_info. If copy_creds sees kernel_clone_args->user_task
> then it would use that.

One question/clarification. For 5.14, I could pass in the struct task_struct
or struct ucounts (in a previous mail I wrote user_struct).

I could also just have vhost.c do inc_rlimit_ucounts and is_ucounts_overlimit
directly.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 7/9] vhost: allow userspace to create workers

2021-06-05 Thread michael . christie
On 6/3/21 9:30 AM, Stefan Hajnoczi wrote:
>> +if (info->pid == VHOST_VRING_NEW_WORKER) {
>> +worker = vhost_worker_create(dev);
> 
> The maximum number of kthreads created is limited by
> vhost_dev_init(nvqs)? For example VHOST_SCSI_MAX_VQ 128.
> 
> IIUC kthread_create is not limited by or accounted against the current
> task, so I'm a little worried that a process can create a lot of
> kthreads.
> 
> I haven't investigated other kthread_create() users reachable from
> userspace applications to see how they bound the number of threads
> effectively.

Do we want something like io_uring's copy_process use? It's what fork uses,
so we get checks like RLIMIT_NPROC and max_threads.

I know I didn't look at everything, but it looks like for some software
drivers we just allow the user to run wild. For example for nbd, when we
create the device to do alloc_workqueue and use the default max_active
value (256). We then don't have a limit on connections, so we could end
up with 256 workqueue threads per device. And then there is no limit on
devices a user can make.


> 
> Any thoughts?
>

Is the concern a bad VM could create N devs each with 128 vqs/threads
and it would slow down other VMs? How do we handle the case where
some VM makes M * N devs and that is equal to N * 128 so we would end
up with the same number of threads either way? Is there a limit to the
number of vhost devices a VM can make and can I just stick in a similar
check for workers?

For vhost-scsi specifically, the 128 limit does not make a lot of sense.
I think we want the max to be the number of vCPUs the VM has so we can
add checks for that. Then we would assume someone making a VM with lots of
CPUs is going to have the resources to support them.

Note: It does make sense from the point of view that we don't know the
number of vCPUs when vhost-scsi calls vhost_dev_init, so I get we had to
select an initial limit.



>> +if (!dev->workers) {
>> +vhost_worker_put(worker);
>> +return -ENOMEM;
>> +}
>> +}
>> +
>> +vq->worker = worker;
>> +
>> +dev->workers[dev->num_workers] = worker;
>> +dev->num_workers++;
> 
> Hmm...should we really append to workers[] in the vhost_worker_find()
> case?


As it's coded now, yes. Every successful vhost_worker_find call does a
get on the worker's refcount. Later when we delete the device, we loop
over the workers array and for every entry we do a put.

I can add in some code to first check if the worker is already in the
dev's worker list. If so then skip the refcount and skip adding to the
workers array. If not in the dev's worker list then do a vhost_worker_find.

I thought it might be nicer how it is now with the single path. It's less
code at least. Later if we add support to change a vq's worker then we also
don't have to worry about refcounts as much. We just always drop the count
taken from when it was added.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: vhost: multiple worker support

2021-06-05 Thread michael . christie
On 6/3/21 5:16 PM, Mike Christie wrote:
> On 6/3/21 9:37 AM, Stefan Hajnoczi wrote:
>> On Tue, May 25, 2021 at 01:05:51PM -0500, Mike Christie wrote:
>>> The following patches apply over linus's tree or mst's vhost branch
>>> and my cleanup patchset:
>>>
>>> https://urldefense.com/v3/__https://lists.linuxfoundation.org/pipermail/virtualization/2021-May/054354.html__;!!GqivPVa7Brio!P55eslnMW_iZkoTUZckwhnSw_9Z35JBScgtSYRh4CMFTRkSsWwKOYqY0huUfBfIPM8BV$
>>>  
>>>
>>> These patches allow us to support multiple vhost workers per device. I
>>> ended up just doing Stefan's original idea where userspace has the
>>> kernel create a worker and we pass back the pid. This has the benefit
>>> over the workqueue and userspace thread approach where we only have
>>> one'ish code path in the kernel during setup to detect old tools. The
>>> main IO paths and device/vq setup/teardown paths all use common code.
>>>
>>> The kernel patches here allow us to then do N workers device and also
>>> share workers across devices.
>>>
>>> I've also included a patch for qemu so you can get an idea of how it
>>> works. If we are ok with the kernel code then I'll break that up into
>>> a patchset and send to qemu-devel.
>>
>> It seems risky to allow userspace process A to "share" a vhost worker
>> thread with userspace process B based on a matching pid alone. Should
>> they have ptrace_may_access() or similar?
>>
> 
> I'm not sure. I already made it a little restrictive in this posting, but

Made a mistake here. In this posting I did not make it restrictive and
I was allowing any old 2 processes to share. So we would need something
like ptrace_may_access if go this route.

If we restrict sharing workers with the same owner, then I'm not sure if
need anything.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization