Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 8/13/23 2:01 PM, Michael S. Tsirkin wrote: > On Fri, Aug 11, 2023 at 01:51:36PM -0500, Mike Christie wrote: >> On 8/10/23 1:57 PM, Michael S. Tsirkin wrote: >>> On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.chris...@oracle.com wrote: On 7/20/23 8:06 AM, Michael S. Tsirkin wrote: > On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote: >> For vhost workers we use the kthread API which inherit's its values from >> and checks against the kthreadd thread. This results in the wrong RLIMITs >> being checked, so while tools like libvirt try to control the number of >> threads based on the nproc rlimit setting we can end up creating more >> threads than the user wanted. >> >> This patch has us use the vhost_task helpers which will inherit its >> values/checks from the thread that owns the device similar to if we did >> a clone in userspace. The vhost threads will now be counted in the nproc >> rlimits. And we get features like cgroups and mm sharing automatically, >> so we can remove those calls. >> >> Signed-off-by: Mike Christie >> Acked-by: Michael S. Tsirkin > > > Hi Mike, > So this seems to have caused a measureable regression in networking > performance (about 30%). Take a look here, and there's a zip file > with detailed measuraments attached: > > https://bugzilla.redhat.com/show_bug.cgi?id=603 > > > Could you take a look please? > You can also ask reporter questions there assuming you > have or can create a (free) account. > Sorry for the late reply. I just got home from vacation. The account creation link seems to be down. I keep getting a "unable to establish SMTP connection to bz-exim-prod port 25 " error. Can you give me Quan's email? I think I can replicate the problem. I just need some extra info from Quan: 1. Just double check that they are using RHEL 9 on the host running the VMs. 2. The kernel config 3. Any tuning that was done. Is tuned running in guest and/or host running the VMs and what profile is being used in each. 4. Number of vCPUs and virtqueues being used. 5. Can they dump the contents of: /sys/kernel/debug/sched and sysctl -a on the host running the VMs. 6. With the 6.4 kernel, can they also run a quick test and tell me if they set the scheduler to batch: ps -T -o comm,pid,tid $QEMU_THREAD then for each vhost thread do: chrt -b -p 0 $VHOST_THREAD Does that end up increasing perf? When I do this I see throughput go up by around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs and virtqueues per net device in the VM). Note that I'm not saying that is a fix. It's just a difference I noticed when running some other tests. >>> >>> >>> Mike I'm unsure what to do at this point. Regressions are not nice >>> but if the kernel is released with the new userspace api we won't >>> be able to revert. So what's the plan? >>> >> >> I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 >> and >> 6.4 perform the same for me. I've tried your setup and settings and with >> different >> combos of using things like tuned and irqbalance. >> >> I can sort of force the issue. In 6.4, the vhost thread inherits it's >> settings >> from the parent thread. In 6.3, the vhost thread inherits from kthreadd and >> we >> would then reset the sched settings. So in 6.4 if I just tune the parent >> differently >> I can cause different performance. If we want the 6.3 behavior we can do the >> patch >> below. >> >> However, I don't think you guys are hitting this because you are just running >> qemu from the normal shell and were not doing anything fancy with the sched >> settings. >> >> >> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c >> index da35e5b7f047..f2c2638d1106 100644 >> --- a/kernel/vhost_task.c >> +++ b/kernel/vhost_task.c >> @@ -2,6 +2,7 @@ >> /* >> * Copyright (C) 2021 Oracle Corporation >> */ >> +#include >> #include >> #include >> #include >> @@ -22,9 +23,16 @@ struct vhost_task { >> >> static int vhost_task_fn(void *data) >> { >> +static const struct sched_param param = { .sched_priority = 0 }; >> struct vhost_task *vtsk = data; >> bool dead = false; >> >> +/* >> + * Don't inherit the parent's sched info, so we maintain compat from >> + * when we used kthreads and it reset this info. >> + */ >> +sched_setscheduler_nocheck(current, SCHED_NORMAL, ); >> + >> for (;;) { >> bool did_work; >> >> >> > > yes seems unlikely, still, attach this to bugzilla so it can be > tested? > > and, what will help you debug? any traces to enable? I added the patch and asked for a perf trace. > > Also wasn't there another issue with a non standard
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Fri, Aug 11, 2023 at 01:51:36PM -0500, Mike Christie wrote: > On 8/10/23 1:57 PM, Michael S. Tsirkin wrote: > > On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.chris...@oracle.com wrote: > >> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote: > >>> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote: > For vhost workers we use the kthread API which inherit's its values from > and checks against the kthreadd thread. This results in the wrong RLIMITs > being checked, so while tools like libvirt try to control the number of > threads based on the nproc rlimit setting we can end up creating more > threads than the user wanted. > > This patch has us use the vhost_task helpers which will inherit its > values/checks from the thread that owns the device similar to if we did > a clone in userspace. The vhost threads will now be counted in the nproc > rlimits. And we get features like cgroups and mm sharing automatically, > so we can remove those calls. > > Signed-off-by: Mike Christie > Acked-by: Michael S. Tsirkin > >>> > >>> > >>> Hi Mike, > >>> So this seems to have caused a measureable regression in networking > >>> performance (about 30%). Take a look here, and there's a zip file > >>> with detailed measuraments attached: > >>> > >>> https://bugzilla.redhat.com/show_bug.cgi?id=603 > >>> > >>> > >>> Could you take a look please? > >>> You can also ask reporter questions there assuming you > >>> have or can create a (free) account. > >>> > >> > >> Sorry for the late reply. I just got home from vacation. > >> > >> The account creation link seems to be down. I keep getting a > >> "unable to establish SMTP connection to bz-exim-prod port 25 " error. > >> > >> Can you give me Quan's email? > >> > >> I think I can replicate the problem. I just need some extra info from Quan: > >> > >> 1. Just double check that they are using RHEL 9 on the host running the > >> VMs. > >> 2. The kernel config > >> 3. Any tuning that was done. Is tuned running in guest and/or host running > >> the > >> VMs and what profile is being used in each. > >> 4. Number of vCPUs and virtqueues being used. > >> 5. Can they dump the contents of: > >> > >> /sys/kernel/debug/sched > >> > >> and > >> > >> sysctl -a > >> > >> on the host running the VMs. > >> > >> 6. With the 6.4 kernel, can they also run a quick test and tell me if they > >> set > >> the scheduler to batch: > >> > >> ps -T -o comm,pid,tid $QEMU_THREAD > >> > >> then for each vhost thread do: > >> > >> chrt -b -p 0 $VHOST_THREAD > >> > >> Does that end up increasing perf? When I do this I see throughput go up by > >> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs > >> and virtqueues per net device in the VM). Note that I'm not saying that is > >> a fix. > >> It's just a difference I noticed when running some other tests. > > > > > > Mike I'm unsure what to do at this point. Regressions are not nice > > but if the kernel is released with the new userspace api we won't > > be able to revert. So what's the plan? > > > > I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 > and > 6.4 perform the same for me. I've tried your setup and settings and with > different > combos of using things like tuned and irqbalance. > > I can sort of force the issue. In 6.4, the vhost thread inherits it's settings > from the parent thread. In 6.3, the vhost thread inherits from kthreadd and we > would then reset the sched settings. So in 6.4 if I just tune the parent > differently > I can cause different performance. If we want the 6.3 behavior we can do the > patch > below. > > However, I don't think you guys are hitting this because you are just running > qemu from the normal shell and were not doing anything fancy with the sched > settings. > > > diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c > index da35e5b7f047..f2c2638d1106 100644 > --- a/kernel/vhost_task.c > +++ b/kernel/vhost_task.c > @@ -2,6 +2,7 @@ > /* > * Copyright (C) 2021 Oracle Corporation > */ > +#include > #include > #include > #include > @@ -22,9 +23,16 @@ struct vhost_task { > > static int vhost_task_fn(void *data) > { > + static const struct sched_param param = { .sched_priority = 0 }; > struct vhost_task *vtsk = data; > bool dead = false; > > + /* > + * Don't inherit the parent's sched info, so we maintain compat from > + * when we used kthreads and it reset this info. > + */ > + sched_setscheduler_nocheck(current, SCHED_NORMAL, ); > + > for (;;) { > bool did_work; > > > yes seems unlikely, still, attach this to bugzilla so it can be tested? and, what will help you debug? any traces to enable? Also wasn't there another issue with a non standard config? Maybe if we fix that it will by chance fix this one too? > > ___ Virtualization mailing list
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 8/10/23 1:57 PM, Michael S. Tsirkin wrote: > On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.chris...@oracle.com wrote: >> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote: >>> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote: For vhost workers we use the kthread API which inherit's its values from and checks against the kthreadd thread. This results in the wrong RLIMITs being checked, so while tools like libvirt try to control the number of threads based on the nproc rlimit setting we can end up creating more threads than the user wanted. This patch has us use the vhost_task helpers which will inherit its values/checks from the thread that owns the device similar to if we did a clone in userspace. The vhost threads will now be counted in the nproc rlimits. And we get features like cgroups and mm sharing automatically, so we can remove those calls. Signed-off-by: Mike Christie Acked-by: Michael S. Tsirkin >>> >>> >>> Hi Mike, >>> So this seems to have caused a measureable regression in networking >>> performance (about 30%). Take a look here, and there's a zip file >>> with detailed measuraments attached: >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=603 >>> >>> >>> Could you take a look please? >>> You can also ask reporter questions there assuming you >>> have or can create a (free) account. >>> >> >> Sorry for the late reply. I just got home from vacation. >> >> The account creation link seems to be down. I keep getting a >> "unable to establish SMTP connection to bz-exim-prod port 25 " error. >> >> Can you give me Quan's email? >> >> I think I can replicate the problem. I just need some extra info from Quan: >> >> 1. Just double check that they are using RHEL 9 on the host running the VMs. >> 2. The kernel config >> 3. Any tuning that was done. Is tuned running in guest and/or host running >> the >> VMs and what profile is being used in each. >> 4. Number of vCPUs and virtqueues being used. >> 5. Can they dump the contents of: >> >> /sys/kernel/debug/sched >> >> and >> >> sysctl -a >> >> on the host running the VMs. >> >> 6. With the 6.4 kernel, can they also run a quick test and tell me if they >> set >> the scheduler to batch: >> >> ps -T -o comm,pid,tid $QEMU_THREAD >> >> then for each vhost thread do: >> >> chrt -b -p 0 $VHOST_THREAD >> >> Does that end up increasing perf? When I do this I see throughput go up by >> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs >> and virtqueues per net device in the VM). Note that I'm not saying that is a >> fix. >> It's just a difference I noticed when running some other tests. > > > Mike I'm unsure what to do at this point. Regressions are not nice > but if the kernel is released with the new userspace api we won't > be able to revert. So what's the plan? > I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 and 6.4 perform the same for me. I've tried your setup and settings and with different combos of using things like tuned and irqbalance. I can sort of force the issue. In 6.4, the vhost thread inherits it's settings from the parent thread. In 6.3, the vhost thread inherits from kthreadd and we would then reset the sched settings. So in 6.4 if I just tune the parent differently I can cause different performance. If we want the 6.3 behavior we can do the patch below. However, I don't think you guys are hitting this because you are just running qemu from the normal shell and were not doing anything fancy with the sched settings. diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index da35e5b7f047..f2c2638d1106 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -2,6 +2,7 @@ /* * Copyright (C) 2021 Oracle Corporation */ +#include #include #include #include @@ -22,9 +23,16 @@ struct vhost_task { static int vhost_task_fn(void *data) { + static const struct sched_param param = { .sched_priority = 0 }; struct vhost_task *vtsk = data; bool dead = false; + /* +* Don't inherit the parent's sched info, so we maintain compat from +* when we used kthreads and it reset this info. +*/ + sched_setscheduler_nocheck(current, SCHED_NORMAL, ); + for (;;) { bool did_work; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.chris...@oracle.com wrote: > On 7/20/23 8:06 AM, Michael S. Tsirkin wrote: > > On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote: > >> For vhost workers we use the kthread API which inherit's its values from > >> and checks against the kthreadd thread. This results in the wrong RLIMITs > >> being checked, so while tools like libvirt try to control the number of > >> threads based on the nproc rlimit setting we can end up creating more > >> threads than the user wanted. > >> > >> This patch has us use the vhost_task helpers which will inherit its > >> values/checks from the thread that owns the device similar to if we did > >> a clone in userspace. The vhost threads will now be counted in the nproc > >> rlimits. And we get features like cgroups and mm sharing automatically, > >> so we can remove those calls. > >> > >> Signed-off-by: Mike Christie > >> Acked-by: Michael S. Tsirkin > > > > > > Hi Mike, > > So this seems to have caused a measureable regression in networking > > performance (about 30%). Take a look here, and there's a zip file > > with detailed measuraments attached: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=603 > > > > > > Could you take a look please? > > You can also ask reporter questions there assuming you > > have or can create a (free) account. > > > > Sorry for the late reply. I just got home from vacation. > > The account creation link seems to be down. I keep getting a > "unable to establish SMTP connection to bz-exim-prod port 25 " error. > > Can you give me Quan's email? > > I think I can replicate the problem. I just need some extra info from Quan: > > 1. Just double check that they are using RHEL 9 on the host running the VMs. > 2. The kernel config > 3. Any tuning that was done. Is tuned running in guest and/or host running the > VMs and what profile is being used in each. > 4. Number of vCPUs and virtqueues being used. > 5. Can they dump the contents of: > > /sys/kernel/debug/sched > > and > > sysctl -a > > on the host running the VMs. > > 6. With the 6.4 kernel, can they also run a quick test and tell me if they set > the scheduler to batch: > > ps -T -o comm,pid,tid $QEMU_THREAD > > then for each vhost thread do: > > chrt -b -p 0 $VHOST_THREAD > > Does that end up increasing perf? When I do this I see throughput go up by > around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs > and virtqueues per net device in the VM). Note that I'm not saying that is a > fix. > It's just a difference I noticed when running some other tests. Mike I'm unsure what to do at this point. Regressions are not nice but if the kernel is released with the new userspace api we won't be able to revert. So what's the plan? -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.chris...@oracle.com wrote: > On 7/20/23 8:06 AM, Michael S. Tsirkin wrote: > > On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote: > >> For vhost workers we use the kthread API which inherit's its values from > >> and checks against the kthreadd thread. This results in the wrong RLIMITs > >> being checked, so while tools like libvirt try to control the number of > >> threads based on the nproc rlimit setting we can end up creating more > >> threads than the user wanted. > >> > >> This patch has us use the vhost_task helpers which will inherit its > >> values/checks from the thread that owns the device similar to if we did > >> a clone in userspace. The vhost threads will now be counted in the nproc > >> rlimits. And we get features like cgroups and mm sharing automatically, > >> so we can remove those calls. > >> > >> Signed-off-by: Mike Christie > >> Acked-by: Michael S. Tsirkin > > > > > > Hi Mike, > > So this seems to have caused a measureable regression in networking > > performance (about 30%). Take a look here, and there's a zip file > > with detailed measuraments attached: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=603 > > > > > > Could you take a look please? > > You can also ask reporter questions there assuming you > > have or can create a (free) account. > > > > Sorry for the late reply. I just got home from vacation. > > The account creation link seems to be down. I keep getting a > "unable to establish SMTP connection to bz-exim-prod port 25 " error. > > Can you give me Quan's email? Thanks for getting back! I asked whether it's ok to share the email. For now pasted your request in the bugzilla. > I think I can replicate the problem. I just need some extra info from Quan: > > 1. Just double check that they are using RHEL 9 on the host running the VMs. > 2. The kernel config > 3. Any tuning that was done. Is tuned running in guest and/or host running the > VMs and what profile is being used in each. > 4. Number of vCPUs and virtqueues being used. > 5. Can they dump the contents of: > > /sys/kernel/debug/sched > > and > > sysctl -a > > on the host running the VMs. > > 6. With the 6.4 kernel, can they also run a quick test and tell me if they set > the scheduler to batch: > > ps -T -o comm,pid,tid $QEMU_THREAD > > then for each vhost thread do: > > chrt -b -p 0 $VHOST_THREAD > > Does that end up increasing perf? When I do this I see throughput go up by > around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs > and virtqueues per net device in the VM). Note that I'm not saying that is a > fix. > It's just a difference I noticed when running some other tests. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 7/20/23 8:06 AM, Michael S. Tsirkin wrote: > On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote: >> For vhost workers we use the kthread API which inherit's its values from >> and checks against the kthreadd thread. This results in the wrong RLIMITs >> being checked, so while tools like libvirt try to control the number of >> threads based on the nproc rlimit setting we can end up creating more >> threads than the user wanted. >> >> This patch has us use the vhost_task helpers which will inherit its >> values/checks from the thread that owns the device similar to if we did >> a clone in userspace. The vhost threads will now be counted in the nproc >> rlimits. And we get features like cgroups and mm sharing automatically, >> so we can remove those calls. >> >> Signed-off-by: Mike Christie >> Acked-by: Michael S. Tsirkin > > > Hi Mike, > So this seems to have caused a measureable regression in networking > performance (about 30%). Take a look here, and there's a zip file > with detailed measuraments attached: > > https://bugzilla.redhat.com/show_bug.cgi?id=603 > > > Could you take a look please? > You can also ask reporter questions there assuming you > have or can create a (free) account. > Sorry for the late reply. I just got home from vacation. The account creation link seems to be down. I keep getting a "unable to establish SMTP connection to bz-exim-prod port 25 " error. Can you give me Quan's email? I think I can replicate the problem. I just need some extra info from Quan: 1. Just double check that they are using RHEL 9 on the host running the VMs. 2. The kernel config 3. Any tuning that was done. Is tuned running in guest and/or host running the VMs and what profile is being used in each. 4. Number of vCPUs and virtqueues being used. 5. Can they dump the contents of: /sys/kernel/debug/sched and sysctl -a on the host running the VMs. 6. With the 6.4 kernel, can they also run a quick test and tell me if they set the scheduler to batch: ps -T -o comm,pid,tid $QEMU_THREAD then for each vhost thread do: chrt -b -p 0 $VHOST_THREAD Does that end up increasing perf? When I do this I see throughput go up by around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs and virtqueues per net device in the VM). Note that I'm not saying that is a fix. It's just a difference I noticed when running some other tests. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote: > For vhost workers we use the kthread API which inherit's its values from > and checks against the kthreadd thread. This results in the wrong RLIMITs > being checked, so while tools like libvirt try to control the number of > threads based on the nproc rlimit setting we can end up creating more > threads than the user wanted. > > This patch has us use the vhost_task helpers which will inherit its > values/checks from the thread that owns the device similar to if we did > a clone in userspace. The vhost threads will now be counted in the nproc > rlimits. And we get features like cgroups and mm sharing automatically, > so we can remove those calls. > > Signed-off-by: Mike Christie > Acked-by: Michael S. Tsirkin Hi Mike, So this seems to have caused a measureable regression in networking performance (about 30%). Take a look here, and there's a zip file with detailed measuraments attached: https://bugzilla.redhat.com/show_bug.cgi?id=603 Could you take a look please? You can also ask reporter questions there assuming you have or can create a (free) account. > --- > drivers/vhost/vhost.c | 58 --- > drivers/vhost/vhost.h | 4 +-- > 2 files changed, 13 insertions(+), 49 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 74378d241f8d..d3c7c37b69a7 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -22,11 +22,11 @@ > #include > #include > #include > -#include > #include > #include > #include > #include > +#include > #include > #include > #include > @@ -256,7 +256,7 @@ void vhost_work_queue(struct vhost_dev *dev, struct > vhost_work *work) >* test_and_set_bit() implies a memory barrier. >*/ > llist_add(>node, >worker->work_list); > - wake_up_process(dev->worker->task); > + wake_up_process(dev->worker->vtsk->task); > } > } > EXPORT_SYMBOL_GPL(vhost_work_queue); > @@ -336,17 +336,14 @@ static void vhost_vq_reset(struct vhost_dev *dev, > static int vhost_worker(void *data) > { > struct vhost_worker *worker = data; > - struct vhost_dev *dev = worker->dev; > struct vhost_work *work, *work_next; > struct llist_node *node; > > - kthread_use_mm(dev->mm); > - > for (;;) { > /* mb paired w/ kthread_stop */ > set_current_state(TASK_INTERRUPTIBLE); > > - if (kthread_should_stop()) { > + if (vhost_task_should_stop(worker->vtsk)) { > __set_current_state(TASK_RUNNING); > break; > } > @@ -368,7 +365,7 @@ static int vhost_worker(void *data) > schedule(); > } > } > - kthread_unuse_mm(dev->mm); > + > return 0; > } > > @@ -509,31 +506,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev) > } > EXPORT_SYMBOL_GPL(vhost_dev_check_owner); > > -struct vhost_attach_cgroups_struct { > - struct vhost_work work; > - struct task_struct *owner; > - int ret; > -}; > - > -static void vhost_attach_cgroups_work(struct vhost_work *work) > -{ > - struct vhost_attach_cgroups_struct *s; > - > - s = container_of(work, struct vhost_attach_cgroups_struct, work); > - s->ret = cgroup_attach_task_all(s->owner, current); > -} > - > -static int vhost_attach_cgroups(struct vhost_dev *dev) > -{ > - struct vhost_attach_cgroups_struct attach; > - > - attach.owner = current; > - vhost_work_init(, vhost_attach_cgroups_work); > - vhost_work_queue(dev, ); > - vhost_dev_flush(dev); > - return attach.ret; > -} > - > /* Caller should have device mutex */ > bool vhost_dev_has_owner(struct vhost_dev *dev) > { > @@ -580,14 +552,14 @@ static void vhost_worker_free(struct vhost_dev *dev) > > dev->worker = NULL; > WARN_ON(!llist_empty(>work_list)); > - kthread_stop(worker->task); > + vhost_task_stop(worker->vtsk); > kfree(worker); > } > > static int vhost_worker_create(struct vhost_dev *dev) > { > struct vhost_worker *worker; > - struct task_struct *task; > + struct vhost_task *vtsk; > int ret; > > worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); > @@ -595,27 +567,19 @@ static int vhost_worker_create(struct vhost_dev *dev) > return -ENOMEM; > > dev->worker = worker; > - worker->dev = dev; > worker->kcov_handle = kcov_common_handle(); > init_llist_head(>work_list); > > - task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid); > - if (IS_ERR(task)) { > - ret = PTR_ERR(task); > + vtsk = vhost_task_create(vhost_worker, worker, NUMA_NO_NODE); > + if (!vtsk) { > + ret = -ENOMEM; > goto free_worker; > } > > - worker->task = task; > - wake_up_process(task); /*
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 5/17/23 12:09 PM, Oleg Nesterov wrote: > IIUC, Mike is going to send the next version? So I think we can delay > the further discussions until then. Yeah, I'm working on a version that supports signals so it will be easier to discuss with the vhost devs and you, Christian and Eric. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 05/16, Eric W. Biederman wrote: > > Oleg Nesterov writes: > > >> There is this bit in complete_signal when SIGKILL is delivered to any > >> thread in the process. > >> > >>t = p; > >>do { > >>task_clear_jobctl_pending(t, > >> JOBCTL_PENDING_MASK); > >>sigaddset(>pending.signal, SIGKILL); > >>signal_wake_up(t, 1); > >>} while_each_thread(p, t); > > > > That is why the latest version adds try_set_pending_sigkill(). No, no, > > it is not that I think this is a good idea. > > I see that try_set_pending_sigkill in the patch now. > > That try_set_pending_sigkill just keeps the process from reporting > that it has exited, and extend the process exit indefinitely. > > SIGNAL_GROUP_EXIT has already been set, so the KILL signal was > already delivered and the process is exiting. Agreed, that is why I said I don't think try_set_pending_sigkill() is a good idea. And again, the same is true for the threads created by create_io_thread(). get_signal() from io_uring/ can dequeue a pending SIGKILL and return, but that is all. > >> For clarity that sigaddset(>pending.signal, SIGKILL); Really isn't > >> setting SIGKILL pending, > > > > Hmm. it does? Nevermind. > > The point is that what try_set_pending_sigkill in the patch is doing is > keeping the "you are dead exit now" flag, from being set. > > That flag is what fatal_signal_pending always tests, because we can only > know if a fatal signal is pending if we have performed short circuit > delivery on the signal. > > The result is the effects of the change are mostly what people expect. > The difference the semantics being changed aren't what people think they > are. > > AKA process exit is being ignored for the thread, not that SIGKILL is > being blocked. Sorry, I don't understand. I just tried to say that sigaddset(>pending.signal, SIGKILL) really sets SIGKILL pending. Nevermind. > > Although I never understood this logic. I meant I never really liked how io-threads play with signals, > I can't even understand the usage > > of lower_32_bits() in create_io_thread(). > > As far as I can tell lower_32_bits(flags) is just defensive programming Cough. but this is ugly. Or I missed something. > or have just populated .flags directly. Exactly, > Then .exit_signal > could have been set to 0. Exactly. --- OK. It doesn't matter. I tried to read the whole thread and got lost. IIUC, Mike is going to send the next version? So I think we can delay the further discussions until then. Oleg. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
Oleg Nesterov writes: > On 05/16, Eric W. Biederman wrote: >> >> A kernel thread can block SIGKILL and that is supported. >> >> For a thread that is part of a process you can't block SIGKILL when the >> task is part of a user mode process. > > Or SIGSTOP. Another thread can call do_signal_stop()->signal_wake_up/etc. Yes, ignoring SIGSTOP leads to the same kind of rendezvous issues as SIGKILL. >> There is this bit in complete_signal when SIGKILL is delivered to any >> thread in the process. >> >> t = p; >> do { >> task_clear_jobctl_pending(t, >> JOBCTL_PENDING_MASK); >> sigaddset(>pending.signal, SIGKILL); >> signal_wake_up(t, 1); >> } while_each_thread(p, t); > > That is why the latest version adds try_set_pending_sigkill(). No, no, > it is not that I think this is a good idea. I see that try_set_pending_sigkill in the patch now. That try_set_pending_sigkill just keeps the process from reporting that it has exited, and extend the process exit indefinitely. SIGNAL_GROUP_EXIT has already been set, so the KILL signal was already delivered and the process is exiting. >> For clarity that sigaddset(>pending.signal, SIGKILL); Really isn't >> setting SIGKILL pending, > > Hmm. it does? Nevermind. The point is that what try_set_pending_sigkill in the patch is doing is keeping the "you are dead exit now" flag, from being set. That flag is what fatal_signal_pending always tests, because we can only know if a fatal signal is pending if we have performed short circuit delivery on the signal. The result is the effects of the change are mostly what people expect. The difference the semantics being changed aren't what people think they are. AKA process exit is being ignored for the thread, not that SIGKILL is being blocked. >> The important part of that code is that SIGNAL_GROUP_EXIT gets set. >> That indicates the entire process is being torn down. > > Yes. and the same is true for io-thread even if it calls get_signal() > and dequeues SIGKILL and clears TIF_SIGPENDING. > >> but in that case the vhost logic needs to act like a process, just >> like io_uring does. > > confused... create_io_thread() creates a sub-thread too? Yes, create_io_uring creates an ordinary user space thread that never runs any code in user space. > Although I never understood this logic. I can't even understand the usage > of lower_32_bits() in create_io_thread(). As far as I can tell lower_32_bits(flags) is just defensive programming that just copies the code in clone. The code just as easily have said u32 flags, or have just populated .flags directly. Then .exit_signal could have been set to 0. Later copy_process will set .exit_signal = -1 because CLONE_THREAD is set. The reason for adding create_io_thread calling copy_process as I recall so that the new task does not start automatically. This allows functions like io_init_new_worker to initialize the new task without races and then call wake_up_new_task. Eric ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 05/16, Eric W. Biederman wrote: > > A kernel thread can block SIGKILL and that is supported. > > For a thread that is part of a process you can't block SIGKILL when the > task is part of a user mode process. Or SIGSTOP. Another thread can call do_signal_stop()->signal_wake_up/etc. > There is this bit in complete_signal when SIGKILL is delivered to any > thread in the process. > > t = p; > do { > task_clear_jobctl_pending(t, > JOBCTL_PENDING_MASK); > sigaddset(>pending.signal, SIGKILL); > signal_wake_up(t, 1); > } while_each_thread(p, t); That is why the latest version adds try_set_pending_sigkill(). No, no, it is not that I think this is a good idea. > For clarity that sigaddset(>pending.signal, SIGKILL); Really isn't > setting SIGKILL pending, Hmm. it does? Nevermind. > The important part of that code is that SIGNAL_GROUP_EXIT gets set. > That indicates the entire process is being torn down. Yes. and the same is true for io-thread even if it calls get_signal() and dequeues SIGKILL and clears TIF_SIGPENDING. > but in that case the vhost logic needs to act like a process, just > like io_uring does. confused... create_io_thread() creates a sub-thread too? Although I never understood this logic. I can't even understand the usage of lower_32_bits() in create_io_thread(). Oleg. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
Linus Torvalds writes: > On Mon, May 15, 2023 at 3:23 PM Mike Christie > wrote: >> >> The vhost layer really doesn't want any signals and wants to work like >> kthreads >> for that case. To make it really simple can we do something like this where >> it >> separates user and io worker behavior where the major diff is how they handle >> signals and exit. I also included a fix for the freeze case: > > I don't love the SIGKILL special case, but I also don't find this > deeply offensive. So if this is what it takes, I'm ok with it. > > I wonder if we could make that special case simply check for "is > SIGKILL blocked" instead? No normal case will cause that, and it means > that a PF_USER_WORKER thread could decide per-thread what it wants to > do wrt SIGKILL. A kernel thread can block SIGKILL and that is supported. For a thread that is part of a process you can't block SIGKILL when the task is part of a user mode process. There is this bit in complete_signal when SIGKILL is delivered to any thread in the process. /* * Start a group exit and wake everybody up. * This way we don't have other threads * running and doing things after a slower * thread has the fatal signal pending. */ signal->flags = SIGNAL_GROUP_EXIT; signal->group_exit_code = sig; signal->group_stop_count = 0; t = p; do { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); sigaddset(>pending.signal, SIGKILL); signal_wake_up(t, 1); } while_each_thread(p, t); For clarity that sigaddset(>pending.signal, SIGKILL); Really isn't setting SIGKILL pending, it is part of the short circuit delivery logic, and that sigaddset(SIGKILL) is just setting a flag to tell the process it needs to die. The important part of that code is that SIGNAL_GROUP_EXIT gets set. That indicates the entire process is being torn down. Where this becomes important is exit_notify and release_task work together to ensure that the first thread in the process (a user space thread that can not block SIGKILL) will not send SIGCHLD to it's parent process until every thread in the process has exited. The delay_group_leader logic in wait_consider_task part of wait(2) has the same logic. Having been through this with io_uring the threads really need to call get_signal to handle that case. This is pretty much why I said at the outset you they needed to decided if they were going to implement a thread or if they were going to be a process. Changing the decision to be a thread from a process is fine but in that case the vhost logic needs to act like a process, just like io_uring does. > Christian? And I guess we should Cc: Oleg too, since the signal parts > are an area he's familiar with and has worked on.. Eric Biederman has > already been on the list and has also been involved > > Oleg: see > > > https://lore.kernel.org/lkml/122b597e-a5fa-daf7-27bb-6f04fa98d...@oracle.com/ > > for the context here. Eric ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 5/16/23 3:39 AM, Christian Brauner wrote: > On Mon, May 15, 2023 at 05:23:12PM -0500, Mike Christie wrote: >> On 5/15/23 10:44 AM, Linus Torvalds wrote: >>> On Mon, May 15, 2023 at 7:23 AM Christian Brauner >>> wrote: So I think we will be able to address (1) and (2) by making vhost tasks proper threads and blocking every signal except for SIGKILL and SIGSTOP and then having vhost handle get_signal() - as you mentioned - the same way io uring already does. We should also remove the ingore_signals thing completely imho. I don't think we ever want to do this with user workers. >>> >>> Right. That's what IO_URING does: >>> >>> if (args->io_thread) { >>> /* >>> * Mark us an IO worker, and block any signal that isn't >>> * fatal or STOP >>> */ >>> p->flags |= PF_IO_WORKER; >>> siginitsetinv(>blocked, >>> sigmask(SIGKILL)|sigmask(SIGSTOP)); >>> } >>> >>> and I really think that vhost should basically do exactly what io_uring >>> does. >>> >>> Not because io_uring fundamentally got this right - but simply because >>> io_uring had almost all the same bugs (and then some), and what the >>> io_uring worker threads ended up doing was to basically zoom in on >>> "this works". >>> >>> And it zoomed in on it largely by just going for "make it look as much >>> as possible as a real user thread", because every time the kernel >>> thread did something different, it just caused problems. >>> >>> So I think the patch should just look something like the attached. >>> Mike, can you test this on whatever vhost test-suite? >> >> I tried that approach already and it doesn't work because io_uring and vhost >> differ in that vhost drivers implement a device where each device has a >> vhost_task >> and the drivers have a file_operations for the device. When the vhost_task's >> parent gets signal like SIGKILL, then it will exit and call into the vhost >> driver's file_operations->release function. At this time, we need to do >> cleanup > > But that's no reason why the vhost worker couldn't just be allowed to > exit on SIGKILL cleanly similar to io_uring. That's just describing the > current architecture which isn't a necessity afaict. And the helper > thread could e.g., crash. > >> like flush the device which uses the vhost_task. There is also the case >> where if >> the vhost_task gets a SIGKILL, we can just exit from under the vhost layer. > > In a way I really don't like the patch below. Because this should be > solvable by adapting vhost workers. Right now, vhost is coming from a > kthread model and we ported it to a user worker model and the whole > point of this excercise has been that the workers behave more like > regular userspace processes. So my tendency is to not massage kernel > signal handling to now also include a special case for user workers in > addition to kthreads. That's just the wrong way around and then vhost > could've just stuck with kthreads in the first place. I would have preferred that :) Maybe let's take a step back and revisit that decision to make sure it was right. The vhost layer wants: 1. inherit cgroups. 2. share mm. 3. no signals 4. to not show up was an extra process like in Nicolas's bug. 5. have it's worker threads counted under its parent nproc limit. We can do 1 - 4 today with kthreads. Can we do #5 with kthreads? My first attempt which passed around the creds to use for kthreads or exported a helper for the nproc accounting was not liked and we eventually ended up here. Is this hybird user/kernel thread/task still the right way to go or is better to use kthreads and add some way to handle #5? > > So I'm fine with skipping over the freezing case for now but SIGKILL > should be handled imho. Only init and kthreads should get the luxury of > ignoring SIGKILL. > > So, I'm afraid I'm asking some work here of you but how feasible would a > model be where vhost_worker() similar to io_wq_worker() gracefully > handles SIGKILL. Yes, I see there's > > net.c: .release = vhost_net_release > scsi.c: .release = vhost_scsi_release > test.c: .release = vhost_test_release > vdpa.c: .release = vhost_vdpa_release > vsock.c: .release = virtio_transport_release > vsock.c: .release = vhost_vsock_dev_release > > but that means you have all the basic logic in place and all of those > drivers also support the VHOST_RESET_OWNER ioctl which also stops the > vhost worker. I'm confident that a lof this can be leveraged to just > cleanup on SIGKILL. We can do this, but the issue I'm worried about is that right now if there is queued/running IO and userspace escalates to SIGKILL, then the vhost layer will still complete those IOs. If we now allow SIGKILL on the vhost thread, then those IOs might fail. If we get a SIGKILL, I can modify vhost_worker() so that it temporarily ignores the signal and allows IO/flushes/whatever-operations to complete
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 05/15, Mike Christie wrote: > > --- a/kernel/vhost_task.c > +++ b/kernel/vhost_task.c > @@ -75,13 +75,13 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), > void *arg, >const char *name) > { > struct kernel_clone_args args = { > - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, > + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | > + CLONE_THREAD | CLONE_SIGHAND, I am looking at 6/8 on https://lore.kernel.org/lkml/ ... with this change kernel_wait4() in vhost_task_stop() won't work? Oleg. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 05/15, Mike Christie wrote: > > Oleg and Christian, > > > Below is an updated patch that doesn't check for PF_USER_WORKER in the > signal.c code and instead will check for if we have blocked the signal. Looks like I need to read the whole series... will try tomorrow. > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process( > p->flags &= ~PF_KTHREAD; > if (args->kthread) > p->flags |= PF_KTHREAD; > - if (args->user_worker) > + if (args->user_worker) { > + /* > + * User worker are similar to io_threads but they do not > + * support signals and cleanup is driven via another kernel > + * interface so even SIGKILL is blocked. > + */ > p->flags |= PF_USER_WORKER; > + siginitsetinv(>blocked, 0); I never liked the fact that io-threads block the signals, this adds another precedent... OK, this needs another discussion. > +static void try_set_pending_sigkill(struct task_struct *t) > +{ > + /* > + * User workers don't support signals and their exit is driven through > + * their kernel layer, so by default block even SIGKILL. > + */ > + if (sigismember(>blocked, SIGKILL)) > + return; > + > + sigaddset(>pending.signal, SIGKILL); > + signal_wake_up(t, 1); > +} so why do you need this? to avoid fatal_signal_pending() or signal_pending() ? In the latter case this change is not enough. Oleg. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 5/15/23 5:54 PM, Linus Torvalds wrote: > On Mon, May 15, 2023 at 3:23 PM Mike Christie > wrote: >> >> The vhost layer really doesn't want any signals and wants to work like >> kthreads >> for that case. To make it really simple can we do something like this where >> it >> separates user and io worker behavior where the major diff is how they handle >> signals and exit. I also included a fix for the freeze case: > > I don't love the SIGKILL special case, but I also don't find this > deeply offensive. So if this is what it takes, I'm ok with it. > > I wonder if we could make that special case simply check for "is > SIGKILL blocked" instead? No normal case will cause that, and it means Yeah, it's doable. Updated below. > that a PF_USER_WORKER thread could decide per-thread what it wants to > do wrt SIGKILL. > > Christian? And I guess we should Cc: Oleg too, since the signal parts > are an area he's familiar with and has worked on.. Eric Biederman has > already been on the list and has also been involved > > Oleg: see > > > https://lore.kernel.org/lkml/122b597e-a5fa-daf7-27bb-6f04fa98d...@oracle.com/ > > for the context here. Oleg and Christian, Below is an updated patch that doesn't check for PF_USER_WORKER in the signal.c code and instead will check for if we have blocked the signal. diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 537cbf9a2ade..e0f5ac90a228 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -29,7 +29,6 @@ struct kernel_clone_args { u32 io_thread:1; u32 user_worker:1; u32 no_files:1; - u32 ignore_signals:1; unsigned long stack; unsigned long stack_size; unsigned long tls; diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..fd2970b598b2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process( p->flags &= ~PF_KTHREAD; if (args->kthread) p->flags |= PF_KTHREAD; - if (args->user_worker) + if (args->user_worker) { + /* +* User worker are similar to io_threads but they do not +* support signals and cleanup is driven via another kernel +* interface so even SIGKILL is blocked. +*/ p->flags |= PF_USER_WORKER; + siginitsetinv(>blocked, 0); + } if (args->io_thread) { /* * Mark us an IO worker, and block any signal that isn't @@ -2517,8 +2524,8 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; - if (args->ignore_signals) - ignore_signals(p); + if (args->user_worker) + p->flags |= PF_NOFREEZE; stackleak_task_init(p); @@ -2860,7 +2867,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) .fn = fn, .fn_arg = arg, .io_thread = 1, - .user_worker= 1, }; return copy_process(NULL, 0, node, ); diff --git a/kernel/signal.c b/kernel/signal.c index 8f6330f0e9ca..bc7e26072437 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -995,6 +995,19 @@ static inline bool wants_signal(int sig, struct task_struct *p) return task_curr(p) || !task_sigpending(p); } +static void try_set_pending_sigkill(struct task_struct *t) +{ + /* +* User workers don't support signals and their exit is driven through +* their kernel layer, so by default block even SIGKILL. +*/ + if (sigismember(>blocked, SIGKILL)) + return; + + sigaddset(>pending.signal, SIGKILL); + signal_wake_up(t, 1); +} + static void complete_signal(int sig, struct task_struct *p, enum pid_type type) { struct signal_struct *signal = p->signal; @@ -1055,8 +1068,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type) t = p; do { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); - sigaddset(>pending.signal, SIGKILL); - signal_wake_up(t, 1); + try_set_pending_sigkill(t); } while_each_thread(p, t); return; } @@ -1373,8 +1385,7 @@ int zap_other_threads(struct task_struct *p) /* Don't bother with already dead threads */ if (t->exit_state) continue; - sigaddset(>pending.signal, SIGKILL); - signal_wake_up(t, 1); + try_set_pending_sigkill(t); } return count; diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..2d8d3ebaec4d 100644 ---
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Mon, May 15, 2023 at 3:23 PM Mike Christie wrote: > > The vhost layer really doesn't want any signals and wants to work like > kthreads > for that case. To make it really simple can we do something like this where it > separates user and io worker behavior where the major diff is how they handle > signals and exit. I also included a fix for the freeze case: I don't love the SIGKILL special case, but I also don't find this deeply offensive. So if this is what it takes, I'm ok with it. I wonder if we could make that special case simply check for "is SIGKILL blocked" instead? No normal case will cause that, and it means that a PF_USER_WORKER thread could decide per-thread what it wants to do wrt SIGKILL. Christian? And I guess we should Cc: Oleg too, since the signal parts are an area he's familiar with and has worked on.. Eric Biederman has already been on the list and has also been involved Oleg: see https://lore.kernel.org/lkml/122b597e-a5fa-daf7-27bb-6f04fa98d...@oracle.com/ for the context here. Linus ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 5/15/23 10:44 AM, Linus Torvalds wrote: > On Mon, May 15, 2023 at 7:23 AM Christian Brauner wrote: >> >> So I think we will be able to address (1) and (2) by making vhost tasks >> proper threads and blocking every signal except for SIGKILL and SIGSTOP >> and then having vhost handle get_signal() - as you mentioned - the same >> way io uring already does. We should also remove the ingore_signals >> thing completely imho. I don't think we ever want to do this with user >> workers. > > Right. That's what IO_URING does: > > if (args->io_thread) { > /* > * Mark us an IO worker, and block any signal that isn't > * fatal or STOP > */ > p->flags |= PF_IO_WORKER; > siginitsetinv(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); > } > > and I really think that vhost should basically do exactly what io_uring does. > > Not because io_uring fundamentally got this right - but simply because > io_uring had almost all the same bugs (and then some), and what the > io_uring worker threads ended up doing was to basically zoom in on > "this works". > > And it zoomed in on it largely by just going for "make it look as much > as possible as a real user thread", because every time the kernel > thread did something different, it just caused problems. > > So I think the patch should just look something like the attached. > Mike, can you test this on whatever vhost test-suite? I tried that approach already and it doesn't work because io_uring and vhost differ in that vhost drivers implement a device where each device has a vhost_task and the drivers have a file_operations for the device. When the vhost_task's parent gets signal like SIGKILL, then it will exit and call into the vhost driver's file_operations->release function. At this time, we need to do cleanup like flush the device which uses the vhost_task. There is also the case where if the vhost_task gets a SIGKILL, we can just exit from under the vhost layer. io_uring has a similar cleanup issue where the core kernel code can't do exit for the io thread, but it only has that one point that it has to worry about so when it gets SIGKILL it can clean itself up then exit. So the patch in the other mail hits an issue where vhost_worker() can get into a tight loop hammering the CPU due to the pending SIGKILL signal. The vhost layer really doesn't want any signals and wants to work like kthreads for that case. To make it really simple can we do something like this where it separates user and io worker behavior where the major diff is how they handle signals and exit. I also included a fix for the freeze case: diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 537cbf9a2ade..e0f5ac90a228 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -29,7 +29,6 @@ struct kernel_clone_args { u32 io_thread:1; u32 user_worker:1; u32 no_files:1; - u32 ignore_signals:1; unsigned long stack; unsigned long stack_size; unsigned long tls; diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..fd2970b598b2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process( p->flags &= ~PF_KTHREAD; if (args->kthread) p->flags |= PF_KTHREAD; - if (args->user_worker) + if (args->user_worker) { + /* +* User worker are similar to io_threads but they do not +* support signals and cleanup is driven via another kernel +* interface so even SIGKILL is blocked. +*/ p->flags |= PF_USER_WORKER; + siginitsetinv(>blocked, 0); + } if (args->io_thread) { /* * Mark us an IO worker, and block any signal that isn't @@ -2517,8 +2524,8 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; - if (args->ignore_signals) - ignore_signals(p); + if (args->user_worker) + p->flags |= PF_NOFREEZE; stackleak_task_init(p); @@ -2860,7 +2867,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) .fn = fn, .fn_arg = arg, .io_thread = 1, - .user_worker= 1, }; return copy_process(NULL, 0, node, ); diff --git a/kernel/signal.c b/kernel/signal.c index 8f6330f0e9ca..f2f1e5ef44b2 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -995,6 +995,19 @@ static inline bool wants_signal(int sig, struct task_struct *p) return task_curr(p) || !task_sigpending(p); } +static void try_set_pending_sigkill(struct task_struct *t) +{ + /* +* User workers don't support signals and their exit is driven
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Mon, May 15, 2023 at 8:54 AM Linus Torvalds wrote: > > Blush. I decided to build-test it, and then forgot to attach it. Here. Btw, if this tests out good and we end up doing this, I think we should also just rename that '.ignore_signals' bitfield to '.block_signals' to actually match what it does. But that's an entirely cosmetic thing just to clarify things. The patch would look almost identical, apart from the new name (and the small additional parts to rename the two existing users that weren't touched by that patch - the header file and the vhost use-case). Linus ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Mon, May 15, 2023 at 8:52 AM Jens Axboe wrote: > > Only potential downside is that it does make file references more > expensive for other syscalls, since you now have a shared file table. > But probably not something to worry about here? Would the vhost user worker user processes ever be otherwise single-threaded? I'd *assume* that a vhost user is already doing its own threads. But maybe that's a completely bogus assumption. I don't actually use any of this, so... Because you are obviously 100% right that if you're otherwise single-threaded, then a CLONE_FILES kernel helper thread will cause the extra cost for file descriptor lookup/free due to all the race prevention. Linus ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Mon, May 15, 2023 at 8:52 AM Jens Axboe wrote: > > On 5/15/23 9:44?AM, Linus Torvalds wrote: > > > > So I think the patch should just look something like the attached. > > Mike, can you test this on whatever vhost test-suite? > > Seems like that didn't get attached... Blush. I decided to built-test it, and then forgot to attach it. Here. Linus kernel/fork.c | 12 +++- kernel/vhost_task.c | 3 ++- 2 files changed, 5 insertions(+), 10 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..cd06b137418f 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process( p->flags |= PF_KTHREAD; if (args->user_worker) p->flags |= PF_USER_WORKER; - if (args->io_thread) { - /* - * Mark us an IO worker, and block any signal that isn't - * fatal or STOP - */ + if (args->io_thread) p->flags |= PF_IO_WORKER; + if (args->ignore_signals) siginitsetinv(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); - } if (args->name) strscpy_pad(p->comm, args->name, sizeof(p->comm)); @@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; - if (args->ignore_signals) - ignore_signals(p); - stackleak_task_init(p); if (pid != _struct_pid) { @@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) .fn_arg = arg, .io_thread = 1, .user_worker = 1, + .ignore_signals = 1, }; return copy_process(NULL, 0, node, ); diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..2e334b2d7cc4 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -75,7 +75,8 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, const char *name) { struct kernel_clone_args args = { - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | + CLONE_THREAD | CLONE_SIGHAND, .exit_signal = 0, .fn = vhost_task_fn, .name = name, ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 5/15/23 9:44?AM, Linus Torvalds wrote: > On Mon, May 15, 2023 at 7:23?AM Christian Brauner wrote: >> >> So I think we will be able to address (1) and (2) by making vhost tasks >> proper threads and blocking every signal except for SIGKILL and SIGSTOP >> and then having vhost handle get_signal() - as you mentioned - the same >> way io uring already does. We should also remove the ingore_signals >> thing completely imho. I don't think we ever want to do this with user >> workers. > > Right. That's what IO_URING does: > > if (args->io_thread) { > /* > * Mark us an IO worker, and block any signal that isn't > * fatal or STOP > */ > p->flags |= PF_IO_WORKER; > siginitsetinv(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); > } > > and I really think that vhost should basically do exactly what io_uring does. > > Not because io_uring fundamentally got this right - but simply because > io_uring had almost all the same bugs (and then some), and what the > io_uring worker threads ended up doing was to basically zoom in on > "this works". > > And it zoomed in on it largely by just going for "make it look as much > as possible as a real user thread", because every time the kernel > thread did something different, it just caused problems. This is exactly what I told Christian in a private chat too - we went through all of that, and this is what works. KISS. > So I think the patch should just look something like the attached. > Mike, can you test this on whatever vhost test-suite? Seems like that didn't get attached... > I did consider getting rid of ".ignore_signals" entirely, and instead > just keying the "block signals" behavior off the ".user_worker" flag. > But this approach doesn't seem wrong either, and I don't think it's > wrong to make the create_io_thread() function say that > ".ignore_signals = 1" thing explicitly, rather than key it off the > ".io_thread" flag. > > Jens/Christian - comments? > > Slightly related to this all: I think vhost should also do > CLONE_FILES, and get rid of the whole ".no_files" thing. Again, if > vhost doesn't use any files, it shouldn't matter, and looking > different just to be different is wrong. But if vhost doesn't use any > files, the current situation shouldn't be a bug either. Only potential downside is that it does make file references more expensive for other syscalls, since you now have a shared file table. But probably not something to worry about here? -- Jens Axboe ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Mon, May 15, 2023 at 7:23 AM Christian Brauner wrote: > > So I think we will be able to address (1) and (2) by making vhost tasks > proper threads and blocking every signal except for SIGKILL and SIGSTOP > and then having vhost handle get_signal() - as you mentioned - the same > way io uring already does. We should also remove the ingore_signals > thing completely imho. I don't think we ever want to do this with user > workers. Right. That's what IO_URING does: if (args->io_thread) { /* * Mark us an IO worker, and block any signal that isn't * fatal or STOP */ p->flags |= PF_IO_WORKER; siginitsetinv(>blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); } and I really think that vhost should basically do exactly what io_uring does. Not because io_uring fundamentally got this right - but simply because io_uring had almost all the same bugs (and then some), and what the io_uring worker threads ended up doing was to basically zoom in on "this works". And it zoomed in on it largely by just going for "make it look as much as possible as a real user thread", because every time the kernel thread did something different, it just caused problems. So I think the patch should just look something like the attached. Mike, can you test this on whatever vhost test-suite? I did consider getting rid of ".ignore_signals" entirely, and instead just keying the "block signals" behavior off the ".user_worker" flag. But this approach doesn't seem wrong either, and I don't think it's wrong to make the create_io_thread() function say that ".ignore_signals = 1" thing explicitly, rather than key it off the ".io_thread" flag. Jens/Christian - comments? Slightly related to this all: I think vhost should also do CLONE_FILES, and get rid of the whole ".no_files" thing. Again, if vhost doesn't use any files, it shouldn't matter, and looking different just to be different is wrong. But if vhost doesn't use any files, the current situation shouldn't be a bug either. Linus ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Sat, May 13, 2023 at 7:39 AM Thorsten Leemhuis wrote: > > Jumping in here, as I found another problem with that patch: it broke > s2idle on my laptop when a qemu-kvm VM is running, as freezing user > space processes now fails for me: Hmm. kthreads have PF_NOFREEZE by default, which is probably the reason. Adding current->flags |= PF_NOFREEZE; to the vhost_task setup might just fix it, but it feels a bit off. The way io_uring does this is to do if (signal_pending(current)) { struct ksignal ksig; if (!get_signal()) continue; break; } in the main loop, which ends up handling the freezer situation too. But it should handle things like SIGSTOP etc as well, and also exit on actual signals. I get the feeling that the whole "vhost_task_should_stop()" logic should have the exact logic above, and basically make those threads killable as well. Hmm? Linus ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
[CCing the regression list] On 06.05.23 00:37, Mike Christie wrote: > On 5/5/23 1:22 PM, Linus Torvalds wrote: >> On Fri, May 5, 2023 at 6:40 AM Nicolas Dichtel >> wrote: >>> >>> Is this an intended behavior? >>> This breaks some of our scripts. Jumping in here, as I found another problem with that patch: it broke s2idle on my laptop when a qemu-kvm VM is running, as freezing user space processes now fails for me: ``` [ 195.442949] PM: suspend entry (s2idle) [ 195.641271] Filesystems sync: 0.198 seconds [ 195.833828] Freezing user space processes [ 215.841084] Freezing user space processes failed after 20.007 seconds (1 tasks refusing to freeze, wq_busy=0): [ 215.841255] task:vhost-3221 state:R stack:0 pid:3250 ppid:3221 flags:0x4006 [ 215.841264] Call Trace: [ 215.841266] [ 215.841270] ? update_rq_clock+0x39/0x270 [ 215.841283] ? _raw_spin_unlock+0x19/0x40 [ 215.841290] ? __schedule+0x3f/0x1510 [ 215.841296] ? sysvec_apic_timer_interrupt+0xaf/0xd0 [ 215.841306] ? schedule+0x61/0xe0 [ 215.841313] ? vhost_worker+0x87/0xb0 [vhost] [ 215.841329] ? vhost_task_fn+0x1a/0x30 [ 215.841336] ? __pfx_vhost_task_fn+0x10/0x10 [ 215.841341] ? ret_from_fork+0x2c/0x50 [ 215.841352] [ 215.841936] OOM killer enabled. [ 215.841938] Restarting tasks ... done. [ 215.844204] random: crng reseeded on system resumption [ 215.957095] PM: suspend exit [ 215.957185] PM: suspend entry (s2idle) [ 215.967646] Filesystems sync: 0.010 seconds [ 215.971326] Freezing user space processes [ 235.974400] Freezing user space processes failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0): [ 235.974574] task:vhost-3221 state:R stack:0 pid:3250 ppid:3221 flags:0x4806 [ 235.974583] Call Trace: [ 235.974586] [ 235.974593] ? __schedule+0x184/0x1510 [ 235.974605] ? sysvec_apic_timer_interrupt+0xaf/0xd0 [ 235.974616] ? schedule+0x61/0xe0 [ 235.974624] ? vhost_worker+0x87/0xb0 [vhost] [ 235.974648] ? vhost_task_fn+0x1a/0x30 [ 235.974656] ? __pfx_vhost_task_fn+0x10/0x10 [ 235.974662] ? ret_from_fork+0x2c/0x50 [ 235.974673] [ 235.975190] OOM killer enabled. [ 235.975192] Restarting tasks ... done. [ 235.978131] random: crng reseeded on system resumption [ 236.091219] PM: suspend exit ``` After running into the problem I booted 6.3.1-rc1 again and there s2idle still worked. Didn't do a bisection, just looked at the vhost commits during the latest merge window; 6e890c5d502 ("vhost: use vhost_tasks for worker threads") looked suspicious, so I reverted it on top of latest mainline and then things work again. Through a search on lore I arrived in this thread and found below patch from Mike. Gave it a try on top of latest mainline, but it didn't help. Ciao, Thorsten > [...] > If it's ok to change the behavior of "ps -u root", then we can do this patch: > (Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the > 'ps' > case. If you could test the ps only case or give me info on what > /usr/bin/example > was doing I can replicate and test here): > > > diff --git a/kernel/fork.c b/kernel/fork.c > index ed4e01daccaa..eb9ffc58e211 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -2269,8 +2269,14 @@ __latent_entropy struct task_struct *copy_process( > /* >* Thread groups must share signals as well, and detached threads >* can only be started up within the thread group. > + * > + * A userworker's parent thread will normally have a signal handler > + * that performs management operations, but the worker will not > + * because the parent will handle the signal then user a worker > + * specific interface to manage the thread and related resources. >*/ > - if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND)) > + if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND) && > + !args->user_worker && !args->ignore_signals) > return ERR_PTR(-EINVAL); > > /* > diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c > index b7cbd66f889e..3700c21ea39d 100644 > --- a/kernel/vhost_task.c > +++ b/kernel/vhost_task.c > @@ -75,7 +78,8 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), > void *arg, >const char *name) > { > struct kernel_clone_args args = { > - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, > + .flags = CLONE_FS | CLONE_THREAD | CLONE_VM | > + CLONE_UNTRACED, > .exit_signal= 0, > .fn = vhost_task_fn, > .name = name ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Fri, May 5, 2023 at 3:38 PM Mike Christie wrote: > > If it's ok to change the behavior of "ps -u root", then we can do this patch: I think this is the right thing to do. Making the user worker threads show up as threads with the vhost process as the parent really seems like a much better model, and more accurate. Yes, they used to show up as random kernel threads, and you'd see them as such (not just for "ps -u root", but simply also with just a normal "ps ax" kind of thing). But that isn't all that helpful, and it's really just annoying to see our kernel threads in "ps ax" output, and I've often wished we didn't do that (it think of all the random "kworker/xyz-kcryptd" etc things that show up). So I think showing them as the threaded children of the vhost process is much nicer, and probably the best option. Because I don't thin kanything is going to get the *old* behavior of showing them as the '[vhost-xyz]' system threads (or whatever the old output ended up being in 'ps ax'), but hopefully nothing wants that horror anyway. At a minimum, the parenting is fundamentally going to look different in the new model. Linus ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 5/5/23 1:22 PM, Linus Torvalds wrote: > On Fri, May 5, 2023 at 6:40 AM Nicolas Dichtel > wrote: >> >> Is this an intended behavior? >> This breaks some of our scripts. > > It doesn't just break your scripts (which counts as a regression), I > think it's really wrong. > > The worker threads should show up as threads of the thing that started > them, not as processes. > > So they should show up in 'ps' only when one of the "show threads" flag is > set. > > But I suspect the fix is trivial: the virtio code should likely use > CLONE_THREAD for the copy_process() it does. > > It should look more like "create_io_thread()" than "copy_process()", I think. > > For example, do virtio worker threads really want their own signals > and files? That sounds wrong. create_io_thread() uses all of > > CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_IO > > to share much more of the context with the process it is actually run within. > For the vhost tasks and the CLONE flags: 1. I didn't use CLONE_FILES in the vhost task patches because you are right and we didn't need our own. We needed it to work like kthreads where there are no files, so I set the kernel_clone_args.no_files bit to have copy_files not do a dup or clone (task->files is NULL). 2. vhost tasks didn't use CLONE_SIGHAND, because userspace apps like qemu use signals for management operations. But, the vhost thread's worker functions assume signals are ignored like they were with kthreads. So if they were doing IO and got a signal like a SIGHUP they might return early and fail from whatever network/block function they were calling. And currently the parent like qemu handles something like a SIGSTOP by shutting everything down by calling into the vhost interface to remove the device. So similar to files I used the kernel_clone_args.ignore_signals bit so copy_process has the vhost thread have it's own signal handle that just ignores signals. 3. I didn't use CLONE_THREAD because before my patches you could do "ps -u root" and see all the vhost threads. If we use CLONE_THREAD, then we can only see it when we do something like "ps -T -p $parent" like you mentioned above. I guess I messed up and did the reverse and thought it would be a regression if "ps -u root" no longer showed the vhost threads. If it's ok to change the behavior of "ps -u root", then we can do this patch: (Nicolas, I confirmed it fixes the 'ps a' case, but couldn't replicate the 'ps' case. If you could test the ps only case or give me info on what /usr/bin/example was doing I can replicate and test here): diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..eb9ffc58e211 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2269,8 +2269,14 @@ __latent_entropy struct task_struct *copy_process( /* * Thread groups must share signals as well, and detached threads * can only be started up within the thread group. +* +* A userworker's parent thread will normally have a signal handler +* that performs management operations, but the worker will not +* because the parent will handle the signal then user a worker +* specific interface to manage the thread and related resources. */ - if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND)) + if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND) && + !args->user_worker && !args->ignore_signals) return ERR_PTR(-EINVAL); /* diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..3700c21ea39d 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -75,7 +78,8 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, const char *name) { struct kernel_clone_args args = { - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, + .flags = CLONE_FS | CLONE_THREAD | CLONE_VM | + CLONE_UNTRACED, .exit_signal= 0, .fn = vhost_task_fn, .name = name, ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On Fri, May 5, 2023 at 6:40 AM Nicolas Dichtel wrote: > > Is this an intended behavior? > This breaks some of our scripts. It doesn't just break your scripts (which counts as a regression), I think it's really wrong. The worker threads should show up as threads of the thing that started them, not as processes. So they should show up in 'ps' only when one of the "show threads" flag is set. But I suspect the fix is trivial: the virtio code should likely use CLONE_THREAD for the copy_process() it does. It should look more like "create_io_thread()" than "copy_process()", I think. For example, do virtio worker threads really want their own signals and files? That sounds wrong. create_io_thread() uses all of CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_IO to share much more of the context with the process it is actually run within. Christian? Mike? Linus ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v11 8/8] vhost: use vhost_tasks for worker threads
For vhost workers we use the kthread API which inherit's its values from and checks against the kthreadd thread. This results in the wrong RLIMITs being checked, so while tools like libvirt try to control the number of threads based on the nproc rlimit setting we can end up creating more threads than the user wanted. This patch has us use the vhost_task helpers which will inherit its values/checks from the thread that owns the device similar to if we did a clone in userspace. The vhost threads will now be counted in the nproc rlimits. And we get features like cgroups and mm sharing automatically, so we can remove those calls. Signed-off-by: Mike Christie Acked-by: Michael S. Tsirkin --- drivers/vhost/vhost.c | 58 --- drivers/vhost/vhost.h | 4 +-- 2 files changed, 13 insertions(+), 49 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 74378d241f8d..d3c7c37b69a7 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -22,11 +22,11 @@ #include #include #include -#include #include #include #include #include +#include #include #include #include @@ -256,7 +256,7 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) * test_and_set_bit() implies a memory barrier. */ llist_add(>node, >worker->work_list); - wake_up_process(dev->worker->task); + wake_up_process(dev->worker->vtsk->task); } } EXPORT_SYMBOL_GPL(vhost_work_queue); @@ -336,17 +336,14 @@ static void vhost_vq_reset(struct vhost_dev *dev, static int vhost_worker(void *data) { struct vhost_worker *worker = data; - struct vhost_dev *dev = worker->dev; struct vhost_work *work, *work_next; struct llist_node *node; - kthread_use_mm(dev->mm); - for (;;) { /* mb paired w/ kthread_stop */ set_current_state(TASK_INTERRUPTIBLE); - if (kthread_should_stop()) { + if (vhost_task_should_stop(worker->vtsk)) { __set_current_state(TASK_RUNNING); break; } @@ -368,7 +365,7 @@ static int vhost_worker(void *data) schedule(); } } - kthread_unuse_mm(dev->mm); + return 0; } @@ -509,31 +506,6 @@ long vhost_dev_check_owner(struct vhost_dev *dev) } EXPORT_SYMBOL_GPL(vhost_dev_check_owner); -struct vhost_attach_cgroups_struct { - struct vhost_work work; - struct task_struct *owner; - int ret; -}; - -static void vhost_attach_cgroups_work(struct vhost_work *work) -{ - struct vhost_attach_cgroups_struct *s; - - s = container_of(work, struct vhost_attach_cgroups_struct, work); - s->ret = cgroup_attach_task_all(s->owner, current); -} - -static int vhost_attach_cgroups(struct vhost_dev *dev) -{ - struct vhost_attach_cgroups_struct attach; - - attach.owner = current; - vhost_work_init(, vhost_attach_cgroups_work); - vhost_work_queue(dev, ); - vhost_dev_flush(dev); - return attach.ret; -} - /* Caller should have device mutex */ bool vhost_dev_has_owner(struct vhost_dev *dev) { @@ -580,14 +552,14 @@ static void vhost_worker_free(struct vhost_dev *dev) dev->worker = NULL; WARN_ON(!llist_empty(>work_list)); - kthread_stop(worker->task); + vhost_task_stop(worker->vtsk); kfree(worker); } static int vhost_worker_create(struct vhost_dev *dev) { struct vhost_worker *worker; - struct task_struct *task; + struct vhost_task *vtsk; int ret; worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); @@ -595,27 +567,19 @@ static int vhost_worker_create(struct vhost_dev *dev) return -ENOMEM; dev->worker = worker; - worker->dev = dev; worker->kcov_handle = kcov_common_handle(); init_llist_head(>work_list); - task = kthread_create(vhost_worker, worker, "vhost-%d", current->pid); - if (IS_ERR(task)) { - ret = PTR_ERR(task); + vtsk = vhost_task_create(vhost_worker, worker, NUMA_NO_NODE); + if (!vtsk) { + ret = -ENOMEM; goto free_worker; } - worker->task = task; - wake_up_process(task); /* avoid contributing to loadavg */ - - ret = vhost_attach_cgroups(dev); - if (ret) - goto stop_worker; - + worker->vtsk = vtsk; + vhost_task_start(vtsk, "vhost-%d", current->pid); return 0; -stop_worker: - kthread_stop(worker->task); free_worker: kfree(worker); dev->worker = NULL; diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 2f6beab93784..3af59c65025e 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -16,6 +16,7 @@ #include struct vhost_work; +struct vhost_task; typedef