Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
On Mon, 2018-01-29 at 14:33 +0800, Lai Jiangshan wrote: > On Mon, Jan 29, 2018 at 12:41 PM, Mike Galbraithwrote: > > On Mon, 2018-01-29 at 12:15 +0800, Lai Jiangshan wrote: > >> I think adding priority boost to workqueue(flush_work()) is the best > >> way to fix the problem. > > > > I disagree, priority boosting is needlessly invasive, takes control out > > of user hands. The kernel wanting to run a workqueue does not justify > > perturbing the user's critical task. > > The kworkers doesn't belong to any user, it is really needlessly invasive > if we give the ability to any user to control the priority of the kworkers. In a scenario where box is being saturated by RT, every last bit of the box is likely in the (hopefully capable) hands of a solo box pilot. With a prio-boosting scheme, which user gets to choose the boost priority for the global resource? > If the user's critical task calls flush_work(). the critical task should > boost one responsible kworker. (the kwoker scheduled for > the work item, or the first idle kworker or the manager kworker, > the kwoker for the later two cases is changing, need to migrate > the boosting to a new kworker when needed) > > The boosted work items need to be moved to a prio list in the pool > too for the boosted kworker to pick it up. Userspace knows which of its actions are wired up to what kernel mechanism? New workers are never spawned, stepping on any prioritization userspace does? I don't want to argue about it really, I'm just expressing my opinion on the matter. I have a mechanism in place to let users safely do whatever they like, have for years, and it's not going anywhere. That mechanism was born from the needs of users, not mine. First came a user with a long stable product that suddenly ceased to function due to workqueues learning to spawn new threads, then came a few cases where users were absolutely convinced that they really really did need to be able to safely saturate. I could have said tough titty, adapt your product to use a dedicated kthread to the one, and no you just think you need to do that to the others, but I'm not (quite) that arrogant, gave them the control they wanted instead. -Mike
Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
On Mon, 2018-01-29 at 14:33 +0800, Lai Jiangshan wrote: > On Mon, Jan 29, 2018 at 12:41 PM, Mike Galbraith wrote: > > On Mon, 2018-01-29 at 12:15 +0800, Lai Jiangshan wrote: > >> I think adding priority boost to workqueue(flush_work()) is the best > >> way to fix the problem. > > > > I disagree, priority boosting is needlessly invasive, takes control out > > of user hands. The kernel wanting to run a workqueue does not justify > > perturbing the user's critical task. > > The kworkers doesn't belong to any user, it is really needlessly invasive > if we give the ability to any user to control the priority of the kworkers. In a scenario where box is being saturated by RT, every last bit of the box is likely in the (hopefully capable) hands of a solo box pilot. With a prio-boosting scheme, which user gets to choose the boost priority for the global resource? > If the user's critical task calls flush_work(). the critical task should > boost one responsible kworker. (the kwoker scheduled for > the work item, or the first idle kworker or the manager kworker, > the kwoker for the later two cases is changing, need to migrate > the boosting to a new kworker when needed) > > The boosted work items need to be moved to a prio list in the pool > too for the boosted kworker to pick it up. Userspace knows which of its actions are wired up to what kernel mechanism? New workers are never spawned, stepping on any prioritization userspace does? I don't want to argue about it really, I'm just expressing my opinion on the matter. I have a mechanism in place to let users safely do whatever they like, have for years, and it's not going anywhere. That mechanism was born from the needs of users, not mine. First came a user with a long stable product that suddenly ceased to function due to workqueues learning to spawn new threads, then came a few cases where users were absolutely convinced that they really really did need to be able to safely saturate. I could have said tough titty, adapt your product to use a dedicated kthread to the one, and no you just think you need to do that to the others, but I'm not (quite) that arrogant, gave them the control they wanted instead. -Mike
Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
On Mon, Jan 29, 2018 at 12:41 PM, Mike Galbraithwrote: > On Mon, 2018-01-29 at 12:15 +0800, Lai Jiangshan wrote: >> I think adding priority boost to workqueue(flush_work()) is the best >> way to fix the problem. > > I disagree, priority boosting is needlessly invasive, takes control out > of user hands. The kernel wanting to run a workqueue does not justify > perturbing the user's critical task. The kworkers doesn't belong to any user, it is really needlessly invasive if we give the ability to any user to control the priority of the kworkers. If the user's critical task calls flush_work(). the critical task should boost one responsible kworker. (the kwoker scheduled for the work item, or the first idle kworker or the manager kworker, the kwoker for the later two cases is changing, need to migrate the boosting to a new kworker when needed) The boosted work items need to be moved to a prio list in the pool too for the boosted kworker to pick it up. > > I think "give userspace rope" is always the best option, how rope is > used is none of our business. Giving the user a means to draw a simple > line in the sand, above which they run only critical stuff, below > which, they can do whatever they want, sane in our opinions or not, > lets users do whatever craziness they want/need to do, and puts the > responsibility for consequences squarely on the right set of shoulders. > > -Mike
Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
On Mon, Jan 29, 2018 at 12:41 PM, Mike Galbraith wrote: > On Mon, 2018-01-29 at 12:15 +0800, Lai Jiangshan wrote: >> I think adding priority boost to workqueue(flush_work()) is the best >> way to fix the problem. > > I disagree, priority boosting is needlessly invasive, takes control out > of user hands. The kernel wanting to run a workqueue does not justify > perturbing the user's critical task. The kworkers doesn't belong to any user, it is really needlessly invasive if we give the ability to any user to control the priority of the kworkers. If the user's critical task calls flush_work(). the critical task should boost one responsible kworker. (the kwoker scheduled for the work item, or the first idle kworker or the manager kworker, the kwoker for the later two cases is changing, need to migrate the boosting to a new kworker when needed) The boosted work items need to be moved to a prio list in the pool too for the boosted kworker to pick it up. > > I think "give userspace rope" is always the best option, how rope is > used is none of our business. Giving the user a means to draw a simple > line in the sand, above which they run only critical stuff, below > which, they can do whatever they want, sane in our opinions or not, > lets users do whatever craziness they want/need to do, and puts the > responsibility for consequences squarely on the right set of shoulders. > > -Mike
Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
On Mon, 2018-01-29 at 12:15 +0800, Lai Jiangshan wrote: > I think adding priority boost to workqueue(flush_work()) is the best > way to fix the problem. I disagree, priority boosting is needlessly invasive, takes control out of user hands. The kernel wanting to run a workqueue does not justify perturbing the user's critical task. I think "give userspace rope" is always the best option, how rope is used is none of our business. Giving the user a means to draw a simple line in the sand, above which they run only critical stuff, below which, they can do whatever they want, sane in our opinions or not, lets users do whatever craziness they want/need to do, and puts the responsibility for consequences squarely on the right set of shoulders. -Mike
Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
On Mon, 2018-01-29 at 12:15 +0800, Lai Jiangshan wrote: > I think adding priority boost to workqueue(flush_work()) is the best > way to fix the problem. I disagree, priority boosting is needlessly invasive, takes control out of user hands. The kernel wanting to run a workqueue does not justify perturbing the user's critical task. I think "give userspace rope" is always the best option, how rope is used is none of our business. Giving the user a means to draw a simple line in the sand, above which they run only critical stuff, below which, they can do whatever they want, sane in our opinions or not, lets users do whatever craziness they want/need to do, and puts the responsibility for consequences squarely on the right set of shoulders. -Mike
Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
I think adding priority boost to workqueue(flush_work()) is the best way to fix the problem. On Sat, Jan 27, 2018 at 1:15 PM, Wen Yangwrote: > When pinning RT threads to specific cores using CPU affinity, the > kworkers on the same CPU would starve, which may lead to some kind > of priority inversion. In that case, the RT threads would also > suffer high performance impact. > > The priority inversion looks like, > CPU 0: libvirtd acquired cgroup_mutex, and triggered > lru_add_drain_per_cpu, then waiting for all the kworkers to complete: > PID: 44145 TASK: 8807bec7b980 CPU: 0 COMMAND: "libvirtd" > #0 [8807f2cbb9d0] __schedule at 816410ed > #1 [8807f2cbba38] schedule at 81641789 > #2 [8807f2cbba48] schedule_timeout at 8163f479 > #3 [8807f2cbbaf8] wait_for_completion at 81641b56 > #4 [8807f2cbbb58] flush_work at 8109efdc > #5 [8807f2cbbbd0] lru_add_drain_all at 81179002 > #6 [8807f2cbbc08] migrate_prep at 811c77be > #7 [8807f2cbbc18] do_migrate_pages at 811b8010 > #8 [8807f2cbbcf8] cpuset_migrate_mm at 810fea6c > #9 [8807f2cbbd10] cpuset_attach at 810ff91e > #10 [8807f2cbbd50] cgroup_attach_task at 810f9972 > #11 [8807f2cbbe08] attach_task_by_pid at 810fa520 > #12 [8807f2cbbe58] cgroup_tasks_write at 810fa593 > #13 [8807f2cbbe68] cgroup_file_write at 810f8773 > #14 [8807f2cbbef8] vfs_write at 811dfdfd > #15 [8807f2cbbf38] sys_write at 811e089f > #16 [8807f2cbbf80] system_call_fastpath at 8164c809 > > CPU 43: kworker/43 starved because of the RT threads: > CURRENT: PID: 21294 TASK: 883fd2d45080 COMMAND: "lwip" > RT PRIO_ARRAY: 883fff3f4950 > [ 79] PID: 21294 TASK: 883fd2d45080 COMMAND: "lwip" > [ 79] PID: 21295 TASK: 88276d481700 COMMAND: "ovdk-ovsvswitch" > [ 79] PID: 21351 TASK: 8807be822280 COMMAND: "dispatcher" > [ 79] PID: 21129 TASK: 8807bef0f300 COMMAND: "ovdk-ovsvswitch" > [ 79] PID: 21337 TASK: 88276d482e00 COMMAND: "handler_3" > [ 79] PID: 21352 TASK: 8807be824500 COMMAND: "flow_dumper" > [ 79] PID: 21336 TASK: 88276d480b80 COMMAND: "handler_2" > [ 79] PID: 21342 TASK: 88276d484500 COMMAND: "handler_8" > [ 79] PID: 21341 TASK: 88276d482280 COMMAND: "handler_7" > [ 79] PID: 21338 TASK: 88276d483980 COMMAND: "handler_4" > [ 79] PID: 21339 TASK: 88276d48 COMMAND: "handler_5" > [ 79] PID: 21340 TASK: 88276d486780 COMMAND: "handler_6" > CFS RB_ROOT: 883fff3f4868 > [120] PID: 37959 TASK: 88276e148000 COMMAND: "kworker/43:1" > > CPU 28: Systemd(Victim) was blocked by cgroup_mutex: > PID: 1 TASK: 883fd2d4 CPU: 28 COMMAND: "systemd" > #0 [881fd317bd60] __schedule at 816410ed > #1 [881fd317bdc8] schedule_preempt_disabled at 81642869 > #2 [881fd317bdd8] __mutex_lock_slowpath at 81640565 > #3 [881fd317be38] mutex_lock at 8163f9cf > #4 [881fd317be50] proc_cgroup_show at 810fd256 > #5 [881fd317be98] seq_read at 81203cda > #6 [881fd317bf08] vfs_read at 811dfc6c > #7 [881fd317bf38] sys_read at 811e07bf > #8 [881fd317bf80] system_call_fastpath at 81 > > The simplest way to fix that is to set the scheduler of kworkers to > higher RT priority, just like, > chrt --fifo -p 61 > However, it can not avoid other WORK_CPU_BOUND worker threads running > and starving. > > This patch introduces a way to set the scheduler(policy and priority) > of percpu worker_pool, in that way, user could set proper scheduler > policy and priority of the worker_pool as needed, which could apply > to all the WORK_CPU_BOUND workers on the same CPU. On the other hand, > we could using /sys/devices/virtual/workqueue/cpumask for > WORK_CPU_UNBOUND workers to prevent them starving. > > Tejun Heo suggested: > "* Add scheduler type to wq_attrs so that unbound workqueues can be > configured. > > * Rename system_wq's wq->name from "events" to "system_percpu", and > similarly for the similarly named workqueues. > > * Enable wq_attrs (only the applicable part should show up in the > interface) for system_percpu and system_percpu_highpri, and use that > to change the attributes of the percpu pools." > > This patch implements the basic infrastructure and /sys interface, > such as: > # cat /sys/devices/virtual/workqueue/system_percpu/sched_attr > policy=0 prio=0 nice=0 > # echo "policy=1 prio=1 nice=0" > > /sys/devices/virtual/workqueue/system_percpu/sched_attr > # cat /sys/devices/virtual/workqueue/system_percpu/sched_attr > policy=1 prio=1 nice=0 > # cat
Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
I think adding priority boost to workqueue(flush_work()) is the best way to fix the problem. On Sat, Jan 27, 2018 at 1:15 PM, Wen Yang wrote: > When pinning RT threads to specific cores using CPU affinity, the > kworkers on the same CPU would starve, which may lead to some kind > of priority inversion. In that case, the RT threads would also > suffer high performance impact. > > The priority inversion looks like, > CPU 0: libvirtd acquired cgroup_mutex, and triggered > lru_add_drain_per_cpu, then waiting for all the kworkers to complete: > PID: 44145 TASK: 8807bec7b980 CPU: 0 COMMAND: "libvirtd" > #0 [8807f2cbb9d0] __schedule at 816410ed > #1 [8807f2cbba38] schedule at 81641789 > #2 [8807f2cbba48] schedule_timeout at 8163f479 > #3 [8807f2cbbaf8] wait_for_completion at 81641b56 > #4 [8807f2cbbb58] flush_work at 8109efdc > #5 [8807f2cbbbd0] lru_add_drain_all at 81179002 > #6 [8807f2cbbc08] migrate_prep at 811c77be > #7 [8807f2cbbc18] do_migrate_pages at 811b8010 > #8 [8807f2cbbcf8] cpuset_migrate_mm at 810fea6c > #9 [8807f2cbbd10] cpuset_attach at 810ff91e > #10 [8807f2cbbd50] cgroup_attach_task at 810f9972 > #11 [8807f2cbbe08] attach_task_by_pid at 810fa520 > #12 [8807f2cbbe58] cgroup_tasks_write at 810fa593 > #13 [8807f2cbbe68] cgroup_file_write at 810f8773 > #14 [8807f2cbbef8] vfs_write at 811dfdfd > #15 [8807f2cbbf38] sys_write at 811e089f > #16 [8807f2cbbf80] system_call_fastpath at 8164c809 > > CPU 43: kworker/43 starved because of the RT threads: > CURRENT: PID: 21294 TASK: 883fd2d45080 COMMAND: "lwip" > RT PRIO_ARRAY: 883fff3f4950 > [ 79] PID: 21294 TASK: 883fd2d45080 COMMAND: "lwip" > [ 79] PID: 21295 TASK: 88276d481700 COMMAND: "ovdk-ovsvswitch" > [ 79] PID: 21351 TASK: 8807be822280 COMMAND: "dispatcher" > [ 79] PID: 21129 TASK: 8807bef0f300 COMMAND: "ovdk-ovsvswitch" > [ 79] PID: 21337 TASK: 88276d482e00 COMMAND: "handler_3" > [ 79] PID: 21352 TASK: 8807be824500 COMMAND: "flow_dumper" > [ 79] PID: 21336 TASK: 88276d480b80 COMMAND: "handler_2" > [ 79] PID: 21342 TASK: 88276d484500 COMMAND: "handler_8" > [ 79] PID: 21341 TASK: 88276d482280 COMMAND: "handler_7" > [ 79] PID: 21338 TASK: 88276d483980 COMMAND: "handler_4" > [ 79] PID: 21339 TASK: 88276d48 COMMAND: "handler_5" > [ 79] PID: 21340 TASK: 88276d486780 COMMAND: "handler_6" > CFS RB_ROOT: 883fff3f4868 > [120] PID: 37959 TASK: 88276e148000 COMMAND: "kworker/43:1" > > CPU 28: Systemd(Victim) was blocked by cgroup_mutex: > PID: 1 TASK: 883fd2d4 CPU: 28 COMMAND: "systemd" > #0 [881fd317bd60] __schedule at 816410ed > #1 [881fd317bdc8] schedule_preempt_disabled at 81642869 > #2 [881fd317bdd8] __mutex_lock_slowpath at 81640565 > #3 [881fd317be38] mutex_lock at 8163f9cf > #4 [881fd317be50] proc_cgroup_show at 810fd256 > #5 [881fd317be98] seq_read at 81203cda > #6 [881fd317bf08] vfs_read at 811dfc6c > #7 [881fd317bf38] sys_read at 811e07bf > #8 [881fd317bf80] system_call_fastpath at 81 > > The simplest way to fix that is to set the scheduler of kworkers to > higher RT priority, just like, > chrt --fifo -p 61 > However, it can not avoid other WORK_CPU_BOUND worker threads running > and starving. > > This patch introduces a way to set the scheduler(policy and priority) > of percpu worker_pool, in that way, user could set proper scheduler > policy and priority of the worker_pool as needed, which could apply > to all the WORK_CPU_BOUND workers on the same CPU. On the other hand, > we could using /sys/devices/virtual/workqueue/cpumask for > WORK_CPU_UNBOUND workers to prevent them starving. > > Tejun Heo suggested: > "* Add scheduler type to wq_attrs so that unbound workqueues can be > configured. > > * Rename system_wq's wq->name from "events" to "system_percpu", and > similarly for the similarly named workqueues. > > * Enable wq_attrs (only the applicable part should show up in the > interface) for system_percpu and system_percpu_highpri, and use that > to change the attributes of the percpu pools." > > This patch implements the basic infrastructure and /sys interface, > such as: > # cat /sys/devices/virtual/workqueue/system_percpu/sched_attr > policy=0 prio=0 nice=0 > # echo "policy=1 prio=1 nice=0" > > /sys/devices/virtual/workqueue/system_percpu/sched_attr > # cat /sys/devices/virtual/workqueue/system_percpu/sched_attr > policy=1 prio=1 nice=0 > # cat /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr >
Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
On Sat, 2018-01-27 at 10:31 +0100, Mike Galbraith wrote: > On Sat, 2018-01-27 at 13:15 +0800, Wen Yang wrote: > > When pinning RT threads to specific cores using CPU affinity, the > > kworkers on the same CPU would starve, which may lead to some kind > > of priority inversion. In that case, the RT threads would also > > suffer high performance impact. > > ... > > > This patch introduces a way to set the scheduler(policy and priority) > > of percpu worker_pool, in that way, user could set proper scheduler > > policy and priority of the worker_pool as needed, which could apply > > to all the WORK_CPU_BOUND workers on the same CPU. > > What happens when a new kworker needs to be spawned? What guarantees > that kthreadd can run? Not to mention other kthreads that can be > starved, resulting in severe self inflicted injury. An interface to > configure workqueues is very nice, but it's only part of the problem. P.S. You can also meet inversion expressly due to having excluded unbound kworkers. Just yesterday, I was tracing dbench, and both varieties of kworker were involved in the chain. An RT task doing anything at all involving unbound kworkers meets an inversion the instant an unbound kworker doing work on its behalf has to wait for a SCHED_OTHER task, if that wait can in any way affect RT progress. -Mike
Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
On Sat, 2018-01-27 at 10:31 +0100, Mike Galbraith wrote: > On Sat, 2018-01-27 at 13:15 +0800, Wen Yang wrote: > > When pinning RT threads to specific cores using CPU affinity, the > > kworkers on the same CPU would starve, which may lead to some kind > > of priority inversion. In that case, the RT threads would also > > suffer high performance impact. > > ... > > > This patch introduces a way to set the scheduler(policy and priority) > > of percpu worker_pool, in that way, user could set proper scheduler > > policy and priority of the worker_pool as needed, which could apply > > to all the WORK_CPU_BOUND workers on the same CPU. > > What happens when a new kworker needs to be spawned? What guarantees > that kthreadd can run? Not to mention other kthreads that can be > starved, resulting in severe self inflicted injury. An interface to > configure workqueues is very nice, but it's only part of the problem. P.S. You can also meet inversion expressly due to having excluded unbound kworkers. Just yesterday, I was tracing dbench, and both varieties of kworker were involved in the chain. An RT task doing anything at all involving unbound kworkers meets an inversion the instant an unbound kworker doing work on its behalf has to wait for a SCHED_OTHER task, if that wait can in any way affect RT progress. -Mike
Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
On Sat, 2018-01-27 at 13:15 +0800, Wen Yang wrote: > When pinning RT threads to specific cores using CPU affinity, the > kworkers on the same CPU would starve, which may lead to some kind > of priority inversion. In that case, the RT threads would also > suffer high performance impact. ... > This patch introduces a way to set the scheduler(policy and priority) > of percpu worker_pool, in that way, user could set proper scheduler > policy and priority of the worker_pool as needed, which could apply > to all the WORK_CPU_BOUND workers on the same CPU. What happens when a new kworker needs to be spawned? What guarantees that kthreadd can run? Not to mention other kthreads that can be starved, resulting in severe self inflicted injury. An interface to configure workqueues is very nice, but it's only part of the problem. -Mike
Re: [RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
On Sat, 2018-01-27 at 13:15 +0800, Wen Yang wrote: > When pinning RT threads to specific cores using CPU affinity, the > kworkers on the same CPU would starve, which may lead to some kind > of priority inversion. In that case, the RT threads would also > suffer high performance impact. ... > This patch introduces a way to set the scheduler(policy and priority) > of percpu worker_pool, in that way, user could set proper scheduler > policy and priority of the worker_pool as needed, which could apply > to all the WORK_CPU_BOUND workers on the same CPU. What happens when a new kworker needs to be spawned? What guarantees that kthreadd can run? Not to mention other kthreads that can be starved, resulting in severe self inflicted injury. An interface to configure workqueues is very nice, but it's only part of the problem. -Mike
[RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
When pinning RT threads to specific cores using CPU affinity, the kworkers on the same CPU would starve, which may lead to some kind of priority inversion. In that case, the RT threads would also suffer high performance impact. The priority inversion looks like, CPU 0: libvirtd acquired cgroup_mutex, and triggered lru_add_drain_per_cpu, then waiting for all the kworkers to complete: PID: 44145 TASK: 8807bec7b980 CPU: 0 COMMAND: "libvirtd" #0 [8807f2cbb9d0] __schedule at 816410ed #1 [8807f2cbba38] schedule at 81641789 #2 [8807f2cbba48] schedule_timeout at 8163f479 #3 [8807f2cbbaf8] wait_for_completion at 81641b56 #4 [8807f2cbbb58] flush_work at 8109efdc #5 [8807f2cbbbd0] lru_add_drain_all at 81179002 #6 [8807f2cbbc08] migrate_prep at 811c77be #7 [8807f2cbbc18] do_migrate_pages at 811b8010 #8 [8807f2cbbcf8] cpuset_migrate_mm at 810fea6c #9 [8807f2cbbd10] cpuset_attach at 810ff91e #10 [8807f2cbbd50] cgroup_attach_task at 810f9972 #11 [8807f2cbbe08] attach_task_by_pid at 810fa520 #12 [8807f2cbbe58] cgroup_tasks_write at 810fa593 #13 [8807f2cbbe68] cgroup_file_write at 810f8773 #14 [8807f2cbbef8] vfs_write at 811dfdfd #15 [8807f2cbbf38] sys_write at 811e089f #16 [8807f2cbbf80] system_call_fastpath at 8164c809 CPU 43: kworker/43 starved because of the RT threads: CURRENT: PID: 21294 TASK: 883fd2d45080 COMMAND: "lwip" RT PRIO_ARRAY: 883fff3f4950 [ 79] PID: 21294 TASK: 883fd2d45080 COMMAND: "lwip" [ 79] PID: 21295 TASK: 88276d481700 COMMAND: "ovdk-ovsvswitch" [ 79] PID: 21351 TASK: 8807be822280 COMMAND: "dispatcher" [ 79] PID: 21129 TASK: 8807bef0f300 COMMAND: "ovdk-ovsvswitch" [ 79] PID: 21337 TASK: 88276d482e00 COMMAND: "handler_3" [ 79] PID: 21352 TASK: 8807be824500 COMMAND: "flow_dumper" [ 79] PID: 21336 TASK: 88276d480b80 COMMAND: "handler_2" [ 79] PID: 21342 TASK: 88276d484500 COMMAND: "handler_8" [ 79] PID: 21341 TASK: 88276d482280 COMMAND: "handler_7" [ 79] PID: 21338 TASK: 88276d483980 COMMAND: "handler_4" [ 79] PID: 21339 TASK: 88276d48 COMMAND: "handler_5" [ 79] PID: 21340 TASK: 88276d486780 COMMAND: "handler_6" CFS RB_ROOT: 883fff3f4868 [120] PID: 37959 TASK: 88276e148000 COMMAND: "kworker/43:1" CPU 28: Systemd(Victim) was blocked by cgroup_mutex: PID: 1 TASK: 883fd2d4 CPU: 28 COMMAND: "systemd" #0 [881fd317bd60] __schedule at 816410ed #1 [881fd317bdc8] schedule_preempt_disabled at 81642869 #2 [881fd317bdd8] __mutex_lock_slowpath at 81640565 #3 [881fd317be38] mutex_lock at 8163f9cf #4 [881fd317be50] proc_cgroup_show at 810fd256 #5 [881fd317be98] seq_read at 81203cda #6 [881fd317bf08] vfs_read at 811dfc6c #7 [881fd317bf38] sys_read at 811e07bf #8 [881fd317bf80] system_call_fastpath at 81 The simplest way to fix that is to set the scheduler of kworkers to higher RT priority, just like, chrt --fifo -p 61 However, it can not avoid other WORK_CPU_BOUND worker threads running and starving. This patch introduces a way to set the scheduler(policy and priority) of percpu worker_pool, in that way, user could set proper scheduler policy and priority of the worker_pool as needed, which could apply to all the WORK_CPU_BOUND workers on the same CPU. On the other hand, we could using /sys/devices/virtual/workqueue/cpumask for WORK_CPU_UNBOUND workers to prevent them starving. Tejun Heo suggested: "* Add scheduler type to wq_attrs so that unbound workqueues can be configured. * Rename system_wq's wq->name from "events" to "system_percpu", and similarly for the similarly named workqueues. * Enable wq_attrs (only the applicable part should show up in the interface) for system_percpu and system_percpu_highpri, and use that to change the attributes of the percpu pools." This patch implements the basic infrastructure and /sys interface, such as: # cat /sys/devices/virtual/workqueue/system_percpu/sched_attr policy=0 prio=0 nice=0 # echo "policy=1 prio=1 nice=0" > /sys/devices/virtual/workqueue/system_percpu/sched_attr # cat /sys/devices/virtual/workqueue/system_percpu/sched_attr policy=1 prio=1 nice=0 # cat /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr policy=0 prio=0 nice=-20 # echo "policy=1 prio=2 nice=0" > /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr # cat /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr policy=1 prio=2 nice=0 Signed-off-by: Wen YangSigned-off-by: Jiang Biao
[RFC PATCH V5 5/5] workqueue: introduce a way to set workqueue's scheduler
When pinning RT threads to specific cores using CPU affinity, the kworkers on the same CPU would starve, which may lead to some kind of priority inversion. In that case, the RT threads would also suffer high performance impact. The priority inversion looks like, CPU 0: libvirtd acquired cgroup_mutex, and triggered lru_add_drain_per_cpu, then waiting for all the kworkers to complete: PID: 44145 TASK: 8807bec7b980 CPU: 0 COMMAND: "libvirtd" #0 [8807f2cbb9d0] __schedule at 816410ed #1 [8807f2cbba38] schedule at 81641789 #2 [8807f2cbba48] schedule_timeout at 8163f479 #3 [8807f2cbbaf8] wait_for_completion at 81641b56 #4 [8807f2cbbb58] flush_work at 8109efdc #5 [8807f2cbbbd0] lru_add_drain_all at 81179002 #6 [8807f2cbbc08] migrate_prep at 811c77be #7 [8807f2cbbc18] do_migrate_pages at 811b8010 #8 [8807f2cbbcf8] cpuset_migrate_mm at 810fea6c #9 [8807f2cbbd10] cpuset_attach at 810ff91e #10 [8807f2cbbd50] cgroup_attach_task at 810f9972 #11 [8807f2cbbe08] attach_task_by_pid at 810fa520 #12 [8807f2cbbe58] cgroup_tasks_write at 810fa593 #13 [8807f2cbbe68] cgroup_file_write at 810f8773 #14 [8807f2cbbef8] vfs_write at 811dfdfd #15 [8807f2cbbf38] sys_write at 811e089f #16 [8807f2cbbf80] system_call_fastpath at 8164c809 CPU 43: kworker/43 starved because of the RT threads: CURRENT: PID: 21294 TASK: 883fd2d45080 COMMAND: "lwip" RT PRIO_ARRAY: 883fff3f4950 [ 79] PID: 21294 TASK: 883fd2d45080 COMMAND: "lwip" [ 79] PID: 21295 TASK: 88276d481700 COMMAND: "ovdk-ovsvswitch" [ 79] PID: 21351 TASK: 8807be822280 COMMAND: "dispatcher" [ 79] PID: 21129 TASK: 8807bef0f300 COMMAND: "ovdk-ovsvswitch" [ 79] PID: 21337 TASK: 88276d482e00 COMMAND: "handler_3" [ 79] PID: 21352 TASK: 8807be824500 COMMAND: "flow_dumper" [ 79] PID: 21336 TASK: 88276d480b80 COMMAND: "handler_2" [ 79] PID: 21342 TASK: 88276d484500 COMMAND: "handler_8" [ 79] PID: 21341 TASK: 88276d482280 COMMAND: "handler_7" [ 79] PID: 21338 TASK: 88276d483980 COMMAND: "handler_4" [ 79] PID: 21339 TASK: 88276d48 COMMAND: "handler_5" [ 79] PID: 21340 TASK: 88276d486780 COMMAND: "handler_6" CFS RB_ROOT: 883fff3f4868 [120] PID: 37959 TASK: 88276e148000 COMMAND: "kworker/43:1" CPU 28: Systemd(Victim) was blocked by cgroup_mutex: PID: 1 TASK: 883fd2d4 CPU: 28 COMMAND: "systemd" #0 [881fd317bd60] __schedule at 816410ed #1 [881fd317bdc8] schedule_preempt_disabled at 81642869 #2 [881fd317bdd8] __mutex_lock_slowpath at 81640565 #3 [881fd317be38] mutex_lock at 8163f9cf #4 [881fd317be50] proc_cgroup_show at 810fd256 #5 [881fd317be98] seq_read at 81203cda #6 [881fd317bf08] vfs_read at 811dfc6c #7 [881fd317bf38] sys_read at 811e07bf #8 [881fd317bf80] system_call_fastpath at 81 The simplest way to fix that is to set the scheduler of kworkers to higher RT priority, just like, chrt --fifo -p 61 However, it can not avoid other WORK_CPU_BOUND worker threads running and starving. This patch introduces a way to set the scheduler(policy and priority) of percpu worker_pool, in that way, user could set proper scheduler policy and priority of the worker_pool as needed, which could apply to all the WORK_CPU_BOUND workers on the same CPU. On the other hand, we could using /sys/devices/virtual/workqueue/cpumask for WORK_CPU_UNBOUND workers to prevent them starving. Tejun Heo suggested: "* Add scheduler type to wq_attrs so that unbound workqueues can be configured. * Rename system_wq's wq->name from "events" to "system_percpu", and similarly for the similarly named workqueues. * Enable wq_attrs (only the applicable part should show up in the interface) for system_percpu and system_percpu_highpri, and use that to change the attributes of the percpu pools." This patch implements the basic infrastructure and /sys interface, such as: # cat /sys/devices/virtual/workqueue/system_percpu/sched_attr policy=0 prio=0 nice=0 # echo "policy=1 prio=1 nice=0" > /sys/devices/virtual/workqueue/system_percpu/sched_attr # cat /sys/devices/virtual/workqueue/system_percpu/sched_attr policy=1 prio=1 nice=0 # cat /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr policy=0 prio=0 nice=-20 # echo "policy=1 prio=2 nice=0" > /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr # cat /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr policy=1 prio=2 nice=0 Signed-off-by: Wen Yang Signed-off-by: Jiang Biao Signed-off-by: Tan Hu