Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)
> On Aug 20, 2020, at 3:35 PM, Vincent Guittot > wrote: > > On Thu, 20 Aug 2020 at 02:13, benbjiang(蒋彪) wrote: >> >> >> >>> On Aug 19, 2020, at 10:55 PM, Vincent Guittot >>> wrote: >>> >>> On Wed, 19 Aug 2020 at 16:27, benbjiang(蒋彪) wrote: >>>> >>>> >>>> >>>>> On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann >>>>> wrote: >>>>> >>>>> On 19/08/2020 13:05, Vincent Guittot wrote: >>>>>> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann >>>>>> wrote: >>>>>>> >>>>>>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote: >>>>>>>> >>>>>>>> >>>>>>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote: >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote: >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote: >>>>>>>>>>>>>>>>>>>> From: Jiang Biao >>>>>>> >>>>>>> [...] >>>>>>> >>>>>>>>> Are you sure about this? >>>>>>>> Yes. :) >>>>>>>>> >>>>>>>>> The math is telling me for the: >>>>>>>>> >>>>>>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms >>>>>>>>> >>>>>>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms =8ms >>>>>>>>> >>>>>>>>> (4ms - 250 Hz) >>>>>>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :) >>>>>>> >>>>>>> OK, I see. >>>>>>> >>>>>>> But here the different sched slices (check_preempt_tick()-> >>>>>>> sched_slice()) between normal tasks and the idle task play a role to. >>>>>>> >>>>>>> Normal tasks get ~3ms whereas the idle task gets <0.01ms. >>>>>> >>>>>> In fact that depends on the number of CPUs on the system >>>>>> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system, >>>>>> normal task will run around 12ms in one shoot and the idle task still >>>>>> one tick period >>>
Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)
> On Aug 19, 2020, at 10:55 PM, Vincent Guittot > wrote: > > On Wed, 19 Aug 2020 at 16:27, benbjiang(蒋彪) wrote: >> >> >> >>> On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann >>> wrote: >>> >>> On 19/08/2020 13:05, Vincent Guittot wrote: >>>> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann >>>> wrote: >>>>> >>>>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote: >>>>>> >>>>>> >>>>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann >>>>>>> wrote: >>>>>>> >>>>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote: >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote: >>>>>>>>>>>>>>>>>> From: Jiang Biao >>>>> >>>>> [...] >>>>> >>>>>>> Are you sure about this? >>>>>> Yes. :) >>>>>>> >>>>>>> The math is telling me for the: >>>>>>> >>>>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms >>>>>>> >>>>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms =8ms >>>>>>> >>>>>>> (4ms - 250 Hz) >>>>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :) >>>>> >>>>> OK, I see. >>>>> >>>>> But here the different sched slices (check_preempt_tick()-> >>>>> sched_slice()) between normal tasks and the idle task play a role to. >>>>> >>>>> Normal tasks get ~3ms whereas the idle task gets <0.01ms. >>>> >>>> In fact that depends on the number of CPUs on the system >>>> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system, >>>> normal task will run around 12ms in one shoot and the idle task still >>>> one tick period >>> >>> True. This is on a single CPU. >> Agree. :) >> >>> >>>> Also, you can increase even more the period between 2 runs of idle >>>> task by using cgroups and min shares value : 2 >>> >>> Ah yes, maybe this is what Jiang wants to do then? If his runtime does >>> not have other requirements preventing this. >> That could work for increasing the period between 2 runs. But could not >> reduce the single runtime of idle task I guess, which means normal task >> could have 1-tick schedule latency because of idle task. > > Yes. An idle task will preempt an always running task during 1 tick > every 680ms. But also you should keep in mind that a waking normal > task will preempt the idle task immediately which means that it will > not add scheduling latency to a normal task but "steal" 0.14% of > normal task throughput (1/680) at most That’s true. But in the VM case, when VM are busy(MWAIT passthrough or running cpu eating works), the 1-tick scheduling latency could be detected by cyclictest running in the VM. OTOH, we compensate vruntime in place_entity() to boot waking without distinguish SCHED_IDLE task, do you think it’s necessary to do that? like --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4115,7 +4115,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) vruntime += sched_vslice(cfs_rq, se); /* sleeps up to a single latency don't count. */ - if (!initial) { + if (!initial && likely(!task_has_idle_policy(task_of(se { unsigned long thresh = sysctl_sched_latency; > >> OTOH, cgroups(shares) could introduce extra complexity. :) >> >> I wonder if there’s any possibility to make SCHED_IDLEs’ priorities >> absolutely >> lower than SCHED_NORMAL(OTHER), which means no weights/shares >> for them, and they run only when no other task’s runnable. >> I guess there may be priority inversion issue if we do that. But maybe we > > Exactly, that's why we must ensure a minimum running time for sched_idle task Still for VM case, different VMs have been much isolated from each other, priority inversion issue could be very rare, we’re trying to make offline tasks absoultly harmless to online tasks. :) Thanks a lot for your time. Regards, Jiang > >> could avoid it by load-balance more aggressively, or it(priority inversion) >> could be ignored in some special case. >> >>>
Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)
> On Aug 19, 2020, at 7:55 PM, Dietmar Eggemann > wrote: > > On 19/08/2020 13:05, Vincent Guittot wrote: >> On Wed, 19 Aug 2020 at 12:46, Dietmar Eggemann >> wrote: >>> >>> On 17/08/2020 14:05, benbjiang(蒋彪) wrote: >>>> >>>> >>>>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann >>>>> wrote: >>>>> >>>>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote: >>>>>> Hi, >>>>>> >>>>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann >>>>>>> wrote: >>>>>>> >>>>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote: >>>>>>>>>>>>>>>> From: Jiang Biao >>> >>> [...] >>> >>>>> Are you sure about this? >>>> Yes. :) >>>>> >>>>> The math is telling me for the: >>>>> >>>>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms >>>>> >>>>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms =8ms >>>>> >>>>> (4ms - 250 Hz) >>>> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :) >>> >>> OK, I see. >>> >>> But here the different sched slices (check_preempt_tick()-> >>> sched_slice()) between normal tasks and the idle task play a role to. >>> >>> Normal tasks get ~3ms whereas the idle task gets <0.01ms. >> >> In fact that depends on the number of CPUs on the system >> :sysctl_sched_latency = 6ms * (1 + ilog(ncpus)) . On a 8 cores system, >> normal task will run around 12ms in one shoot and the idle task still >> one tick period > > True. This is on a single CPU. Agree. :) > >> Also, you can increase even more the period between 2 runs of idle >> task by using cgroups and min shares value : 2 > > Ah yes, maybe this is what Jiang wants to do then? If his runtime does > not have other requirements preventing this. That could work for increasing the period between 2 runs. But could not reduce the single runtime of idle task I guess, which means normal task could have 1-tick schedule latency because of idle task. OTOH, cgroups(shares) could introduce extra complexity. :) I wonder if there’s any possibility to make SCHED_IDLEs’ priorities absolutely lower than SCHED_NORMAL(OTHER), which means no weights/shares for them, and they run only when no other task’s runnable. I guess there may be priority inversion issue if we do that. But maybe we could avoid it by load-balance more aggressively, or it(priority inversion) could be ignored in some special case. Thx. Regard, Jiang > > [...]
Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)
> On Aug 19, 2020, at 6:46 PM, Dietmar Eggemann > wrote: > > On 17/08/2020 14:05, benbjiang(蒋彪) wrote: >> >> >>> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann >>> wrote: >>> >>> On 14/08/2020 01:55, benbjiang(蒋彪) wrote: >>>> Hi, >>>> >>>>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann >>>>> wrote: >>>>> >>>>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote: >>>>>> Hi, >>>>>> >>>>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann >>>>>>> wrote: >>>>>>> >>>>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote: >>>>>>>>>>>>>> From: Jiang Biao > > [...] > >>> Are you sure about this? >> Yes. :) >>> >>> The math is telling me for the: >>> >>> idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms >>> >>> normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms =8ms >>> >>> (4ms - 250 Hz) >> My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :) > > OK, I see. > > But here the different sched slices (check_preempt_tick()-> > sched_slice()) between normal tasks and the idle task play a role to. > > Normal tasks get ~3ms whereas the idle task gets <0.01ms. > > So the idle task runs every ~680ms but only for 1 tick (1ms) (4 times > less than the normal tasks). The condition 'if (delta_exec > > ideal_runtime)' in check_preempt_tick() is only true at the 4th tick > when a normal task runs even though the slice is 3ms. > > In the 250 Hz example the sched slice diffs are hidden behind the 4ms tick. Exactly. Tick size decides the single runtime and the interval between two runs of idle task. That makes sense for most of the test results above. Thx. Regards, Jiang
Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)
> On Aug 17, 2020, at 4:57 PM, Dietmar Eggemann > wrote: > > On 14/08/2020 01:55, benbjiang(蒋彪) wrote: >> Hi, >> >>> On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann >>> wrote: >>> >>> On 12/08/2020 05:19, benbjiang(蒋彪) wrote: >>>> Hi, >>>> >>>>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann >>>>> wrote: >>>>> >>>>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote: >>>>>> Hi, >>>>>> >>>>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann >>>>>>> wrote: >>>>>>> >>>>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote: >>>>>>>>>>>> From: Jiang Biao > > [...] > >>>> ** 2normal+1idle: idle preempt normal every 600+ms ** >>> >>> During the 3.2s the 2 SCHED_OTHER tasks run, the SCHED_IDLE task is >>> switched in only once, after ~2.5s. >> Use your config with loop increased from 200 to 2000, to observe longer, >> >> <...>-37620 [002] d... 47950.446191: sched_switch: >> prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=S ==> >> next_comm=task_idle-2 next_pid=37621 next_prio=120 >> <...>-37619 [002] d... 47955.687709: sched_switch: >> prev_comm=task_other-0 prev_pid=37619 prev_prio=120 prev_state=R ==> >> next_comm=task_idle-2 next_pid=37621 next_prio=120 >> // The first preemption interval is 5.2s. >> <...>-37620 [002] d... 47956.375716: sched_switch: >> prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> >> next_comm=task_idle-2 next_pid=37621 next_prio=120 >> <...>-37619 [002] d... 47957.060722: sched_switch: >> prev_comm=task_other-0 prev_pid=37619 prev_prio=120 prev_state=R ==> >> next_comm=task_idle-2 next_pid=37621 next_prio=120 >> <...>-37620 [002] d... 47957.747728: sched_switch: >> prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> >> next_comm=task_idle-2 next_pid=37621 next_prio=120 >> <...>-37620 [002] d... 47958.423734: sched_switch: >> prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> >> next_comm=task_idle-2 next_pid=37621 next_prio=120 >> <...>-37620 [002] d... 47959.119740: sched_switch: >> prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> >> next_comm=task_idle-2 next_pid=37621 next_prio=120 >> // After the first preemption, the rest preemption intervals are all about >> 600ms+. :) > > Are you sure about this? Yes. :) > > The math is telling me for the: > > idle task: (3 / (1024 + 1024 + 3))^(-1) * 4ms = 2735ms > > normal task: (1024 / (1024 + 1024 + 3))^(-1) * 4ms =8ms > > (4ms - 250 Hz) My tick is 1ms - 1000HZ, which seems reasonable for 600ms? :) Thx. Regards, Jiang > >>>> ** 3normal+idle: idle preempt normal every 1000+ms ** >>> >>> Ah, this was meant to be 3 SCHED_OTHER tasks only! To see the difference >>> in behavior. >> With 3 SCHED_OHTER tasks only, the SCHED_OHTER task is switched in >> Every 27ms. > > normal task: (1024 / (1024 + 1024 + 1024))^(-1) * 4ms = 12ms > >>>> ** 2normal(nice 19)+1idle(nice 0): idle preempt normal every 30+ms ** >>> >>> During the 3.2s the 2 SCHED_OTHER tasks run, the SCHED_IDLE task is >>> switched in every ~45ms. >> That’s as what I expected. :) > > idle task:(3 / (15 + 15 + 3))^(-1) * 4ms = 44ms > > normal task: (15 / (15 + 15 + 3))^(-1) * 4ms = 9ms
Re: [RFC PATCH 00/16] Core scheduling v6(Internet mail)
Hi, > On Aug 14, 2020, at 1:18 PM, Li, Aubrey wrote: > > On 2020/8/14 12:04, benbjiang(蒋彪) wrote: >> >> >>> On Aug 14, 2020, at 9:36 AM, Li, Aubrey wrote: >>> >>> On 2020/8/14 8:26, benbjiang(蒋彪) wrote: >>>> >>>> >>>>> On Aug 13, 2020, at 12:28 PM, Li, Aubrey >>>>> wrote: >>>>> >>>>> On 2020/8/13 7:08, Joel Fernandes wrote: >>>>>> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote: >>>>>>> Hi Joel, >>>>>>> >>>>>>> On 2020/8/10 0:44, Joel Fernandes wrote: >>>>>>>> Hi Aubrey, >>>>>>>> >>>>>>>> Apologies for replying late as I was still looking into the details. >>>>>>>> >>>>>>>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote: >>>>>>>> [...] >>>>>>>>> +/* >>>>>>>>> + * Core scheduling policy: >>>>>>>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled. >>>>>>>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run >>>>>>>>> + * on the same core concurrently. >>>>>>>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel >>>>>>>>> thread on the same core concurrently. >>>>>>>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only >>>>>>>>> + * with idle thread on the same core. >>>>>>>>> + */ >>>>>>>>> +enum coresched_policy { >>>>>>>>> + CORE_SCHED_DISABLED, >>>>>>>>> + CORE_SCHED_COOKIE_MATCH, >>>>>>>>> + CORE_SCHED_COOKIE_TRUST, >>>>>>>>> + CORE_SCHED_COOKIE_LONELY, >>>>>>>>> +}; >>>>>>>>> >>>>>>>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this >>>>>>>>> kind >>>>>>>>> of performance regression. Not sure if this sounds attractive? >>>>>>>> >>>>>>>> Instead of this, I think it can be something simpler IMHO: >>>>>>>> >>>>>>>> 1. Consider all cookie-0 task as trusted. (Even right now, if you >>>>>>>> apply the >>>>>>>> core-scheduling patchset, such tasks will share a core and sniff on >>>>>>>> each >>>>>>>> other. So let us not pretend that such tasks are not trusted). >>>>>>>> >>>>>>>> 2. All kernel threads and idle task would have a cookie 0 (so that >>>>>>>> will cover >>>>>>>> ksoftirqd reported in your original issue). >>>>>>>> >>>>>>>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). >>>>>>>> Default >>>>>>>> enable it. Setting this option would tag all tasks that are forked >>>>>>>> from a >>>>>>>> cookie-0 task with their own cookie. Later on, such tasks can be added >>>>>>>> to >>>>>>>> a group. This cover's PeterZ's ask about having 'default untrusted'). >>>>>>>> (Users like ChromeOS that don't want to userspace system processes to >>>>>>>> be >>>>>>>> tagged can disable this option so such tasks will be cookie-0). >>>>>>>> >>>>>>>> 4. Allow prctl/cgroup interfaces to create groups of tasks and >>>>>>>> override the >>>>>>>> above behaviors. >>>>>>> >>>>>>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I >>>>>>> set uperf's >>>>>>> cookie to be cookie-0 via prctl? >>>>>> >>>>>> Yes, but let me try to understand better. There are 2 problems here I >>>>>> think: >>>>>> >>>>>> 1. ksoftirqd getting idled when HT is turned on, because uperf is >>>>>> sharing a >>>>>> core with it: This should not be any worse than SMT OFF, because even >>>>>> SM
Re: [RFC PATCH 00/16] Core scheduling v6(Internet mail)
> On Aug 14, 2020, at 9:36 AM, Li, Aubrey wrote: > > On 2020/8/14 8:26, benbjiang(蒋彪) wrote: >> >> >>> On Aug 13, 2020, at 12:28 PM, Li, Aubrey wrote: >>> >>> On 2020/8/13 7:08, Joel Fernandes wrote: >>>> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote: >>>>> Hi Joel, >>>>> >>>>> On 2020/8/10 0:44, Joel Fernandes wrote: >>>>>> Hi Aubrey, >>>>>> >>>>>> Apologies for replying late as I was still looking into the details. >>>>>> >>>>>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote: >>>>>> [...] >>>>>>> +/* >>>>>>> + * Core scheduling policy: >>>>>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled. >>>>>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run >>>>>>> + * on the same core concurrently. >>>>>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel >>>>>>> thread on the same core concurrently. >>>>>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only >>>>>>> + * with idle thread on the same core. >>>>>>> + */ >>>>>>> +enum coresched_policy { >>>>>>> + CORE_SCHED_DISABLED, >>>>>>> + CORE_SCHED_COOKIE_MATCH, >>>>>>> + CORE_SCHED_COOKIE_TRUST, >>>>>>> + CORE_SCHED_COOKIE_LONELY, >>>>>>> +}; >>>>>>> >>>>>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind >>>>>>> of performance regression. Not sure if this sounds attractive? >>>>>> >>>>>> Instead of this, I think it can be something simpler IMHO: >>>>>> >>>>>> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply >>>>>> the >>>>>> core-scheduling patchset, such tasks will share a core and sniff on each >>>>>> other. So let us not pretend that such tasks are not trusted). >>>>>> >>>>>> 2. All kernel threads and idle task would have a cookie 0 (so that will >>>>>> cover >>>>>> ksoftirqd reported in your original issue). >>>>>> >>>>>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). >>>>>> Default >>>>>> enable it. Setting this option would tag all tasks that are forked from >>>>>> a >>>>>> cookie-0 task with their own cookie. Later on, such tasks can be added >>>>>> to >>>>>> a group. This cover's PeterZ's ask about having 'default untrusted'). >>>>>> (Users like ChromeOS that don't want to userspace system processes to be >>>>>> tagged can disable this option so such tasks will be cookie-0). >>>>>> >>>>>> 4. Allow prctl/cgroup interfaces to create groups of tasks and override >>>>>> the >>>>>> above behaviors. >>>>> >>>>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set >>>>> uperf's >>>>> cookie to be cookie-0 via prctl? >>>> >>>> Yes, but let me try to understand better. There are 2 problems here I >>>> think: >>>> >>>> 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a >>>> core with it: This should not be any worse than SMT OFF, because even SMT >>>> OFF >>>> would also reduce ksoftirqd's CPU time just core sched is doing. Sure >>>> core-scheduling adds some overhead with IPIs but such a huge drop of perf >>>> is >>>> strange. Peter any thoughts on that? >>>> >>>> 2. Interface: To solve the performance problem, you are saying you want >>>> uperf >>>> to share a core with ksoftirqd so that it is not forced into idle. Why not >>>> just keep uperf out of the cgroup? >>> >>> I guess this is unacceptable for who runs their apps in container and vm. >> IMHO, just as Joel proposed, >> 1. Consider all cookie-0 task as trusted. >> 2. All kernel threads and idle task would have a cookie 0 >> In that way, all tasks with cookies(including uperf in a cgroup) could run >> concurrently with kernel threads. >> That could be a good solution for the issue. :) > > From uperf point of review, it can trust cookie-0(I assume we still need > some modifications to change cookie-match to cookie-compatible to allow > ZERO and NONZERO run together). > > But from kernel thread point of review, it can NOT trust uperf, unless > we set uperf's cookie to 0. That’s right. :) Could we set the cookie of cgroup where uperf lies to 0? Thx. Regards, Jiang > > Thanks, > -Aubrey >
Re: [RFC PATCH 00/16] Core scheduling v6(Internet mail)
> On Aug 13, 2020, at 12:28 PM, Li, Aubrey wrote: > > On 2020/8/13 7:08, Joel Fernandes wrote: >> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote: >>> Hi Joel, >>> >>> On 2020/8/10 0:44, Joel Fernandes wrote: Hi Aubrey, Apologies for replying late as I was still looking into the details. On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote: [...] > +/* > + * Core scheduling policy: > + * - CORE_SCHED_DISABLED: core scheduling is disabled. > + * - CORE_COOKIE_MATCH: tasks with same cookie can run > + * on the same core concurrently. > + * - CORE_COOKIE_TRUST: trusted task can run with kernel > thread on the same core concurrently. > + * - CORE_COOKIE_LONELY: tasks with cookie can run only > + * with idle thread on the same core. > + */ > +enum coresched_policy { > + CORE_SCHED_DISABLED, > + CORE_SCHED_COOKIE_MATCH, > + CORE_SCHED_COOKIE_TRUST, > + CORE_SCHED_COOKIE_LONELY, > +}; > > We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind > of performance regression. Not sure if this sounds attractive? Instead of this, I think it can be something simpler IMHO: 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the core-scheduling patchset, such tasks will share a core and sniff on each other. So let us not pretend that such tasks are not trusted). 2. All kernel threads and idle task would have a cookie 0 (so that will cover ksoftirqd reported in your original issue). 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default enable it. Setting this option would tag all tasks that are forked from a cookie-0 task with their own cookie. Later on, such tasks can be added to a group. This cover's PeterZ's ask about having 'default untrusted'). (Users like ChromeOS that don't want to userspace system processes to be tagged can disable this option so such tasks will be cookie-0). 4. Allow prctl/cgroup interfaces to create groups of tasks and override the above behaviors. >>> >>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set >>> uperf's >>> cookie to be cookie-0 via prctl? >> >> Yes, but let me try to understand better. There are 2 problems here I think: >> >> 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a >> core with it: This should not be any worse than SMT OFF, because even SMT OFF >> would also reduce ksoftirqd's CPU time just core sched is doing. Sure >> core-scheduling adds some overhead with IPIs but such a huge drop of perf is >> strange. Peter any thoughts on that? >> >> 2. Interface: To solve the performance problem, you are saying you want uperf >> to share a core with ksoftirqd so that it is not forced into idle. Why not >> just keep uperf out of the cgroup? > > I guess this is unacceptable for who runs their apps in container and vm. IMHO, just as Joel proposed, 1. Consider all cookie-0 task as trusted. 2. All kernel threads and idle task would have a cookie 0 In that way, all tasks with cookies(including uperf in a cgroup) could run concurrently with kernel threads. That could be a good solution for the issue. :) If with CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED enabled, maybe we should set ksoftirqd’s cookie to be cookie-0 to solve the issue. Thx. Regards, Jiang > > Thanks, > -Aubrey > >> Then it will have cookie 0 and be able to >> share core with kernel threads. About user-user isolation that you need, if >> you tag any "untrusted" threads by adding it to CGroup, then there will >> automatically isolated from uperf while allowing uperf to share CPU with >> kernel threads. >> >> Please let me know your thoughts and thanks, >> >> - Joel >> >>> >>> Thanks, >>> -Aubrey 5. Document everything clearly so the semantics are clear both to the developers of core scheduling and to system administrators. Note that, with the concept of "system trusted cookie", we can also do optimizations like: 1. Disable STIBP when switching into trusted tasks. 2. Disable L1D flushing / verw stuff for L1TF/MDS issues, when switching into trusted tasks. At least #1 seems to be biting enabling HT on ChromeOS right now, and one other engineer requested I do something like #2 already. Once we get full-syscall isolation working, threads belonging to a process can also share a core so those can just share a core with the task-group leader. >> Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ? > > This is a good question, from the data we measured by uperf, > SMT+core-scheduling is 28.2% worse than no-SMT, :( This is worrying for
Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)
Hi, > On Aug 13, 2020, at 2:39 AM, Dietmar Eggemann > wrote: > > On 12/08/2020 05:19, benbjiang(蒋彪) wrote: >> Hi, >> >>> On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann >>> wrote: >>> >>> On 11/08/2020 02:41, benbjiang(蒋彪) wrote: >>>> Hi, >>>> >>>>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann >>>>> wrote: >>>>> >>>>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote: >>>>>> Hi, >>>>>> >>>>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann >>>>>>> wrote: >>>>>>> >>>>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote: >>>>>>>> >>>>>>>> >>>>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On 01/08/2020 04:32, Jiang Biao wrote: >>>>>>>>>> From: Jiang Biao > > [...] > >>> Trace a run of 2 SCHED_OTHER (nice 0) tasks and 1 SCHED_IDLE task on a >>> single CPU and trace_printk the conditions 'if (delta < 0)' and ' if >>> (delta > ideal_runtime)' in check_preempt_tick(). >>> >>> Then do the same with 3 SCHED_OTHER (nice 0) tasks. You can also change >>> the niceness of the 2 SCHED_OTHER task to 19 to see some differences in >>> the kernelshark's task layout. >>> >>> rt-app (https://github.com/scheduler-tools/rt-app) is a nice tool to >>> craft those artificial use cases. >> With rt-app tool, sched_switch traced by ftrace, the result is as what I >> expected, > > I use: > > { >"tasks" : { >"task_other" : { >"instance" : 2, >"loop" : 200, >"policy" : "SCHED_OTHER", >"run" : 8000, >"timer" : { "ref" : "unique1" , "period" : 16000, > "mode" : "absolute" }, >"priority" : 0 >}, >"task_idle" : { >"instance" : 1, >"loop" : 200, >"policy" : "SCHED_IDLE", >"run" : 8000, >"timer" : { "ref" : "unique2" , "period" : 16000, > "mode" : "absolute" } >} >}, >"global" : { >"calibration" : 243, <-- Has to be calibrated against the CPU > you run on !!! >"default_policy" : "SCHED_OTHER", >"duration" : -1 >} > } > > to have 2 (periodic) SCHED_OTHER and 1 SCHED_IDLE task. > >> ** 2normal+1idle: idle preempt normal every 600+ms ** > > During the 3.2s the 2 SCHED_OTHER tasks run, the SCHED_IDLE task is > switched in only once, after ~2.5s. Use your config with loop increased from 200 to 2000, to observe longer, <...>-37620 [002] d... 47950.446191: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=S ==> next_comm=task_idle-2 next_pid=37621 next_prio=120 <...>-37619 [002] d... 47955.687709: sched_switch: prev_comm=task_other-0 prev_pid=37619 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120 // The first preemption interval is 5.2s. <...>-37620 [002] d... 47956.375716: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120 <...>-37619 [002] d... 47957.060722: sched_switch: prev_comm=task_other-0 prev_pid=37619 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120 <...>-37620 [002] d... 47957.747728: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120 <...>-37620 [002] d... 47958.423734: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120 <...>-37620 [002] d... 47959.119740: sched_switch: prev_comm=task_other-1 prev_pid=37620 prev_prio=120 prev_state=R ==> next_comm=task_idle-2 next_pid=37621 next_prio=120 // After the first preemption, the rest preemption intervals are all about 600ms+. :) > >> ** 3normal+idle: idle preempt normal every 1000+ms ** > > Ah, this was meant to be 3 SCHED_OTHER tasks only! To see the difference > in behavior. With 3 SCHED_OHTER tasks only, the SCHED_OHTER task is switched in Every 27ms. > >> ** 2normal(nice 19)+1idle(nice 0): idle preempt normal every 30+ms ** > > During the 3.2s the 2 SCHED_OTHER tasks run, the SCHED_IDLE task is > switched in every ~45ms. That’s as what I expected. :) Thx. Regards, Jiang > > [...]
Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)
Hi, > On Aug 11, 2020, at 11:54 PM, Dietmar Eggemann > wrote: > > On 11/08/2020 02:41, benbjiang(蒋彪) wrote: >> Hi, >> >>> On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann >>> wrote: >>> >>> On 06/08/2020 17:52, benbjiang(蒋彪) wrote: >>>> Hi, >>>> >>>>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann >>>>> wrote: >>>>> >>>>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote: >>>>>> >>>>>> >>>>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann >>>>>>> wrote: >>>>>>> >>>>>>> On 01/08/2020 04:32, Jiang Biao wrote: >>>>>>>> From: Jiang Biao > > [...] > >>> Because of this very small weight (weight=3), compared to a SCHED_NORMAL >>> nice 0 task (weight=1024), a SCHED_IDLE task is penalized by a huge >>> se->vruntime value (1024/3 higher than for a SCHED_NORMAL nice 0 task). >>> This should make sure it doesn't tick preempt a SCHED_NORMAL nice 0 task. >> Could you please explain how the huge penalization of vruntime(1024/3) could >> make sure SCHED_IDLE not tick preempting SCHED_NORMAL nice 0 task? >> >> Thanks a lot. > > Trace a run of 2 SCHED_OTHER (nice 0) tasks and 1 SCHED_IDLE task on a > single CPU and trace_printk the conditions 'if (delta < 0)' and ' if > (delta > ideal_runtime)' in check_preempt_tick(). > > Then do the same with 3 SCHED_OTHER (nice 0) tasks. You can also change > the niceness of the 2 SCHED_OTHER task to 19 to see some differences in > the kernelshark's task layout. > > rt-app (https://github.com/scheduler-tools/rt-app) is a nice tool to > craft those artificial use cases. With rt-app tool, sched_switch traced by ftrace, the result is as what I expected, ** 1normal+1idle: idle preempt normal every 200ms ** <...>-92016 [002] d... 2398066.902477: sched_switch: prev_comm=normal0-0 prev_pid=92016 prev_prio=120 prev_state=S ==> next_comm=idle0-0 next_pid=91814 next_prio=120 <...>-91814 [002] d... 2398066.902527: sched_switch: prev_comm=idle0-0 prev_pid=91814 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=92016 next_prio=120 <...>-92016 [002] d... 2398066.922472: sched_switch: prev_comm=normal0-0 prev_pid=92016 prev_prio=120 prev_state=S ==> next_comm=idle0-0 next_pid=91814 next_prio=120 <...>-91814 [002] d... 2398066.922522: sched_switch: prev_comm=idle0-0 prev_pid=91814 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=92016 next_prio=120 <...>-92016 [002] d... 2398066.942292: sched_switch: prev_comm=normal0-0 prev_pid=92016 prev_prio=120 prev_state=S ==> next_comm=idle0-0 next_pid=91814 next_prio=120 <...>-91814 [002] d... 2398066.942343: sched_switch: prev_comm=idle0-0 prev_pid=91814 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=92016 next_prio=120 <...>-92016 [002] d... 2398066.962331: sched_switch: prev_comm=normal0-0 prev_pid=92016 prev_prio=120 prev_state=S ==> next_comm=idle0-0 next_pid=91814 next_prio=120 ** 2normal+1idle: idle preempt normal every 600+ms ** <...>-49009 [002] d... 2400562.746640: sched_switch: prev_comm=normal0-0 prev_pid=49009 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120 <...>-187466 [002] d... 2400562.747502: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=S ==> next_comm=normal1-0 next_pid=198658 next_prio=120 <...>-198658 [002] d... 2400563.335262: sched_switch: prev_comm=normal1-0 prev_pid=198658 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120 <...>-187466 [002] d... 2400563.336258: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=49009 next_prio=120 <...>-198658 [002] d... 2400564.017663: sched_switch: prev_comm=normal1-0 prev_pid=198658 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120 <...>-187466 [002] d... 2400564.018661: sched_switch: prev_comm=idle0-0 prev_pid=187466 prev_prio=120 prev_state=R ==> next_comm=normal0-0 next_pid=49009 next_prio=120 <...>-198658 [002] d... 2400564.701063: sched_switch: prev_comm=normal1-0 prev_pid=198658 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid=187466 next_prio=120 ** 3normal+idle: idle preempt normal every 1000+ms ** <...>-198658 [002] d... 2400415.780701: sched_switch: prev_comm=normal1-0 prev_pid=198658 prev_prio=120 prev_state=R ==> next_comm=idle0-0 next_pid
Re: [PATCH] sched/fair: Optimize dequeue_task_fair()(Internet mail)
Hi, > On Aug 12, 2020, at 12:55 AM, Dietmar Eggemann > wrote: > > On 11/08/2020 10:43, Jiang Biao wrote: >> Similar optimization as what has been done in commit, >> 7d148be69e3a(sched/fair: Optimize enqueue_task_fair()) >> >> dequeue_task_fair jumps to dequeue_throttle label when cfs_rq_of(se) is >> throttled which means that se can't be NULL. We can move the label after >> the if (!se) statement and remove the if(!se) statment as se is always >> NULL when reaching this point. >> >> Besides, trying to keep the same pattern with enqueue_task_fair can make >> it more readable. >> >> Signed-off-by: Jiang Biao >> --- >> kernel/sched/fair.c | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 04fa8dbcfa4d..cbbeafdfa8b7 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -5618,10 +5618,10 @@ static void dequeue_task_fair(struct rq *rq, struct >> task_struct *p, int flags) >> >> } >> >> -dequeue_throttle: >> -if (!se) >> -sub_nr_running(rq, 1); >> +/* At this point se is NULL and we are at root level*/ >> +sub_nr_running(rq, 1); >> >> +dequeue_throttle: >> /* balance early to pull high priority tasks */ >> if (unlikely(!was_sched_idle && sched_idle_rq(rq))) >> rq->next_balance = jiffies; > > There is already a similar patch in master. > > 423d02e1463b - sched/fair: Optimize dequeue_task_fair() (2020-06-25 Peng > Wang) Indeed, my local repo has been outdated, sorry for the interruption. :) Thx. Regards, Jiang
Re: [PATCH RFC v2] sched/fair: simplify the work when reweighting entity(Internet mail)
Hi, > On Aug 11, 2020, at 5:50 PM, Dietmar Eggemann > wrote: > > On 06/08/2020 18:14, Jiang Biao wrote: >> From: Jiang Biao >> >> If a se is on_rq when reweighting entity, all we need should be >> updating the load of cfs_rq, other dequeue/enqueue work could be >> redundant, such as, >> * nr_running--/nr_running++ >> >> Even though the following dequeue/enqueue path would never be reached >> * account_numa_dequeue/account_numa_enqueue >> * list_del/list_add from/into cfs_tasks >> but it could be a little confusing. >> >> Simplifying the logic could be helpful to reduce a litte overhead for >> hot path, and make it cleaner and more readable. > > LGTM. I guess you have to resend it w/o the RFC tag. Will do soon. > > Maybe you can rework the patch header a little? > > Something like this: > > The code in reweight_entity() can be simplified. > > For a sched entity on the rq, the entity accounting can be replaced by > cfs_rq instantaneous load updates currently called from within the > entity accounting. > > Even though an entity on the rq can't represent a task in > reweight_entity() (a task is always dequeued before calling this > function) and so the numa task accounting and the rq->cfs_tasks list > management of the entity accounting are never called, the redundant > cfs_rq->nr_running decrement/increment will be avoided. That’s a much better. I’ll pick the header if you don’t mind. :) Thanks a lot for your kindly reply and detailed explanation. Regards, Jiang > >> >> Signed-off-by: Jiang Biao >> --- >> v2<-v1: >> - Amend the commit log. >> >> kernel/sched/fair.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 04fa8dbcfa4d..18a8fc7bd0de 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -3086,7 +3086,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, >> struct sched_entity *se, >> /* commit outstanding execution time */ >> if (cfs_rq->curr == se) >> update_curr(cfs_rq); >> -account_entity_dequeue(cfs_rq, se); >> +update_load_sub(_rq->load, se->load.weight); >> } >> dequeue_load_avg(cfs_rq, se); >> >> @@ -3102,7 +3102,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, >> struct sched_entity *se, >> >> enqueue_load_avg(cfs_rq, se); >> if (se->on_rq) >> -account_entity_enqueue(cfs_rq, se); >> +update_load_add(_rq->load, se->load.weight); >> >> }
Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)
Hi, > On Aug 10, 2020, at 9:24 PM, Dietmar Eggemann > wrote: > > On 06/08/2020 17:52, benbjiang(蒋彪) wrote: >> Hi, >> >>> On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann >>> wrote: >>> >>> On 03/08/2020 13:26, benbjiang(蒋彪) wrote: >>>> >>>> >>>>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann >>>>> wrote: >>>>> >>>>> On 01/08/2020 04:32, Jiang Biao wrote: >>>>>> From: Jiang Biao > > [...] > >>> How would you deal with se's representing taskgroups which contain >>> SCHED_IDLE and SCHED_NORMAL tasks or other taskgroups doing that? >> I’m not sure I get the point. :) How about the following patch, >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 04fa8dbcfa4d..8715f03ed6d7 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -2994,6 +2994,9 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct >> sched_entity *se) >>list_add(>group_node, >cfs_tasks); >>} >> #endif >> + if (task_has_idle_policy(task_of(se))) >> + cfs_rq->idle_nr_running++; >> + >>cfs_rq->nr_running++; >> } >> >> @@ -3007,6 +3010,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct >> sched_entity *se) >>list_del_init(>group_node); >>} >> #endif >> + if (task_has_idle_policy(task_of(se))) >> + cfs_rq->idle_nr_running--; >> + >>cfs_rq->nr_running--; >> } >> >> @@ -4527,7 +4533,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity >> *curr, int queued) >>return; >> #endif >> >> - if (cfs_rq->nr_running > 1) >> + if (cfs_rq->nr_running > cfs_rq->idle_nr_running + 1 && >> + cfs_rq->h_nr_running - cfs_rq->idle_h_nr_running > >> cfs_rq->idle_nr_running + 1) >>check_preempt_tick(cfs_rq, curr); >> } >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 877fb08eb1b0..401090393e09 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -500,6 +500,7 @@ struct cfs_bandwidth { }; >> struct cfs_rq { >>struct load_weight load; >>unsigned intnr_running; >> + unsigned intidle_nr_running; >>unsigned inth_nr_running; /* >> SCHED_{NORMAL,BATCH,IDLE} */ >>unsigned intidle_h_nr_running; /* SCHED_IDLE */ > > / > / | \ > A n0 i0 > / \ >n1 i1 > > I don't think this will work. E.g. the patch would prevent tick > preemption between 'A' and 'n0' on '/' as well > > (3 > 1 + 1) && (4 - 2 > 1 + 1) > > You also have to make sure that a SCHED_IDLE task can tick preempt > another SCHED_IDLE task. That’s right. :) > >>>> I’m not sure if it’s ok to do that, because the IDLE class seems not to be >>>> so >>>> pure that could tolerate starving. >>> >>> Not sure I understand but idle_sched_class is not the same as SCHED_IDLE >>> (policy)? >> The case is that we need tasks(low priority, called offline tasks) to >> utilize the >> spare cpu left by CFS SCHED_NORMAL tasks(called online tasks) without >> interfering the online tasks. >> Offline tasks only run when there’s no runnable online tasks, and offline >> tasks >> never preempt online tasks. >> The SCHED_IDLE policy seems not to be abled to be qualified for that >> requirement, >> because it has a weight(3), even though it’s small, but it can still preempt >> online >> tasks considering the fairness. In that way, offline tasks of SCHED_IDLE >> policy >> could interfere the online tasks. > > Because of this very small weight (weight=3), compared to a SCHED_NORMAL > nice 0 task (weight=1024), a SCHED_IDLE task is penalized by a huge > se->vruntime value (1024/3 higher than for a SCHED_NORMAL nice 0 task). > This should make sure it doesn't tick preempt a SCHED_NORMAL nice 0 task. Could you please explain how the huge penalization of vruntime(1024/3) could make sure SCHED_IDLE not tick preempting SCHED_NORMAL nice 0 task? Thanks a lot. Regards, Jiang > > It's different when the SCHED_NORMAL task has nice 19 (weight=15) but > that's part of the CFS design. > >> On the other hand, idle_sched_class seems not to be qualified either. It’s >> too >> simple and only used for per-cpu idle task currently. > > Yeah, leave this for the rq->idle task (swapper/X). Got it. > >>>> We need an absolutely low priority class that could tolerate starving, >>>> which >>>> could be used to co-locate offline tasks. But IDLE class seems to be not >>>> *low* enough, if considering the fairness of CFS, and IDLE class still has >>>> a >>>> weight. > > Understood. But this (tick) preemption should happen extremely rarely, > especially if you have SCHED_NORMAL nice 0 tasks, right?
Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)
Hi, > On Aug 6, 2020, at 9:29 PM, Dietmar Eggemann wrote: > > On 03/08/2020 13:26, benbjiang(蒋彪) wrote: >> >> >>> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann >>> wrote: >>> >>> On 01/08/2020 04:32, Jiang Biao wrote: >>>> From: Jiang Biao >>>> >>>> No need to preempt when there are only one runable CFS task with >>>> other IDLE tasks on runqueue. The only one CFS task would always >>>> be picked in that case. >>>> >>>> Signed-off-by: Jiang Biao >>>> --- >>>> kernel/sched/fair.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>> index 04fa8dbcfa4d..8fb80636b010 100644 >>>> --- a/kernel/sched/fair.c >>>> +++ b/kernel/sched/fair.c >>>> @@ -4527,7 +4527,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct >>>> sched_entity *curr, int queued) >>>>return; >>>> #endif >>>> >>>> - if (cfs_rq->nr_running > 1) >>>> + if (cfs_rq->nr_running > cfs_rq.idle_h_nr_running + 1) >>> >>> cfs_rq is a pointer. >> It is. Sorry about that. :) >> >>> >>>>check_preempt_tick(cfs_rq, curr); >>>> } >>> >>> You can't compare cfs_rq->nr_running with cfs_rq->idle_h_nr_running! >>> >>> There is a difference between cfs_rq->h_nr_running and >>> cfs_rq->nr_running. The '_h_' stands for hierarchical. >>> >>> The former gives you hierarchical task accounting whereas the latter is >>> the number of sched entities (representing tasks or taskgroups) enqueued >>> in cfs_rq. >>> >>> In entity_tick(), cfs_rq->nr_running has to be used for the condition to >>> call check_preempt_tick(). We want to check if curr has to be preempted >>> by __pick_first_entity(cfs_rq) on this cfs_rq. >>> >>> entity_tick() is called for each sched entity (and so for each >>> cfs_rq_of(se)) of the task group hierarchy (e.g. task p running in >>> taskgroup /A/B : se(p) -> se(A/B) -> se(A)). >> That’s true. I was thinking adding a new cfs_rq->idle_nr_running member to >> track the per cfs_rq's IDLE task number, and reducing preemption here based >> on that. > > How would you deal with se's representing taskgroups which contain > SCHED_IDLE and SCHED_NORMAL tasks or other taskgroups doing that? I’m not sure I get the point. :) How about the following patch, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04fa8dbcfa4d..8715f03ed6d7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2994,6 +2994,9 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) list_add(>group_node, >cfs_tasks); } #endif + if (task_has_idle_policy(task_of(se))) + cfs_rq->idle_nr_running++; + cfs_rq->nr_running++; } @@ -3007,6 +3010,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) list_del_init(>group_node); } #endif + if (task_has_idle_policy(task_of(se))) + cfs_rq->idle_nr_running--; + cfs_rq->nr_running--; } @@ -4527,7 +4533,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) return; #endif - if (cfs_rq->nr_running > 1) + if (cfs_rq->nr_running > cfs_rq->idle_nr_running + 1 && + cfs_rq->h_nr_running - cfs_rq->idle_h_nr_running > cfs_rq->idle_nr_running + 1) check_preempt_tick(cfs_rq, curr); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 877fb08eb1b0..401090393e09 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -500,6 +500,7 @@ struct cfs_bandwidth { }; struct cfs_rq { struct load_weight load; unsigned intnr_running; + unsigned intidle_nr_running; unsigned inth_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */ unsigned intidle_h_nr_running; /* SCHED_IDLE */ > >> I’m not sure if it’s ok to do that, because the IDLE class seems not to be so >> pure that could tolerate starving. > > Not sure I understand but idle_sched_class is not the same as SCHED_IDLE > (policy)? The case is that we need tasks(low priority, called offline tasks) to utilize the spare cpu left by CFS SCHED_NORMAL tasks(called online tasks) without interfering the online tasks. Offline tasks only run when there’s no runnable online tasks, and offline ta
Re: [PATCH RFC] sched/fair: simplfy the work when reweighting entity
> On Aug 6, 2020, at 12:21 AM, Dietmar Eggemann > wrote: > > On 04/08/2020 09:12, Jiang Biao wrote: >> If a se is on_rq when reweighting entity, all we need should be >> updating the load of cfs_rq, other dequeue/enqueue works could be >> redundant, such as, >> * account_numa_dequeue/account_numa_enqueue >> * list_del/list_add from/into cfs_tasks >> * nr_running--/nr_running++ > > I think this could make sense. Have you spotted a code path where this > gives you a change? > > I guess only for a task on the rq, so: entity_is_task(se) && se->on_rq Yes, you're right. No other code path I spotted except what you list below. > >> Just simplfy the work. Could be helpful for the hot path. > > IMHO hotpath is update_cfs_group() -> reweight_entity() but this is only > called for '!entity_is_task(se)'. > > See > > 3290 if (!gcfs_rq) > 3291 return; > > in update_cfs_group(). Yes, It is. But *nr_running--/nr_running++* works are still redundant for ‘!entity_is_task(se)' hot path. :) Besides, I guess we could simplify the logic and make it cleaner and more readable with this patch. If it could make sense to you, would you mind if I resend the patch with the commit log amended? > > The 'entity_is_task(se)' case is > > set_load_weight(struct task_struct *p, ...) -> reweight_task(p, ...) -> > reweight_entity(..., >se, ...) > > but here !se->on_rq. Yes, indeed. Thanks a lot for your comments. Regards, Jiang > >> Signed-off-by: Jiang Biao >> --- >> kernel/sched/fair.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 04fa8dbcfa4d..18a8fc7bd0de 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -3086,7 +3086,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, >> struct sched_entity *se, >> /* commit outstanding execution time */ >> if (cfs_rq->curr == se) >> update_curr(cfs_rq); >> -account_entity_dequeue(cfs_rq, se); >> +update_load_sub(_rq->load, se->load.weight); >> } >> dequeue_load_avg(cfs_rq, se); >> >> @@ -3102,7 +3102,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, >> struct sched_entity *se, >> >> enqueue_load_avg(cfs_rq, se); >> if (se->on_rq) >> -account_entity_enqueue(cfs_rq, se); >> +update_load_add(_rq->load, se->load.weight); >> >> } >
Re: [RFC PATCH 00/16] Core scheduling v6(Internet mail)
Hi, > On Aug 5, 2020, at 11:57 AM, Li, Aubrey wrote: > > On 2020/8/4 0:53, Joel Fernandes wrote: >> Hi Aubrey, >> >> On Mon, Aug 3, 2020 at 4:23 AM Li, Aubrey wrote: >>> >>> On 2020/7/1 5:32, Vineeth Remanan Pillai wrote: Sixth iteration of the Core-Scheduling feature. Core scheduling is a feature that allows only trusted tasks to run concurrently on cpus sharing compute resources (eg: hyperthreads on a core). The goal is to mitigate the core-level side-channel attacks without requiring to disable SMT (which has a significant impact on performance in some situations). Core scheduling (as of v6) mitigates user-space to user-space attacks and user to kernel attack when one of the siblings enters the kernel via interrupts. It is still possible to have a task attack the sibling thread when it enters the kernel via syscalls. By default, the feature doesn't change any of the current scheduler behavior. The user decides which tasks can run simultaneously on the same core (for now by having them in the same tagged cgroup). When a tag is enabled in a cgroup and a task from that cgroup is running on a hardware thread, the scheduler ensures that only idle or trusted tasks run on the other sibling(s). Besides security concerns, this feature can also be beneficial for RT and performance applications where we want to control how tasks make use of SMT dynamically. This iteration is mostly a cleanup of v5 except for a major feature of pausing sibling when a cpu enters kernel via nmi/irq/softirq. Also introducing documentation and includes minor crash fixes. One major cleanup was removing the hotplug support and related code. The hotplug related crashes were not documented and the fixes piled up over time leading to complex code. We were not able to reproduce the crashes in the limited testing done. But if they are reroducable, we don't want to hide them. We should document them and design better fixes if any. In terms of performance, the results in this release are similar to v5. On a x86 system with N hardware threads: - if only N/2 hardware threads are busy, the performance is similar between baseline, corescheduling and nosmt - if N hardware threads are busy with N different corescheduling groups, the impact of corescheduling is similar to nosmt - if N hardware threads are busy and multiple active threads share the same corescheduling cookie, they gain a performance improvement over nosmt. The specific performance impact depends on the workload, but for a really busy database 12-vcpu VM (1 coresched tag) running on a 36 hardware threads NUMA node with 96 mostly idle neighbor VMs (each in their own coresched tag), the performance drops by 54% with corescheduling and drops by 90% with nosmt. >>> >>> We found uperf(in cgroup) throughput drops by ~50% with corescheduling. >>> >>> The problem is, uperf triggered a lot of softirq and offloaded softirq >>> service to *ksoftirqd* thread. >>> >>> - default, ksoftirqd thread can run with uperf on the same core, we saw >>> 100% CPU utilization. >>> - coresched enabled, ksoftirqd's core cookie is different from uperf, so >>> they can't run concurrently on the same core, we saw ~15% forced idle. >>> >>> I guess this kind of performance drop can be replicated by other similar >>> (a lot of softirq activities) workloads. >>> >>> Currently core scheduler picks cookie-match tasks for all SMT siblings, does >>> it make sense we add a policy to allow cookie-compatible task running >>> together? >>> For example, if a task is trusted(set by admin), it can work with kernel >>> thread. >>> The difference from corescheduling disabled is that we still have user to >>> user >>> isolation. >> >> In ChromeOS we are considering all cookie-0 tasks as trusted. >> Basically if you don't trust a task, then that is when you assign the >> task a tag. We do this for the sandboxed processes. > > I have a proposal of this, by changing cpu.tag to cpu.coresched_policy, > something like the following: > > +/* > + * Core scheduling policy: > + * - CORE_SCHED_DISABLED: core scheduling is disabled. > + * - CORE_COOKIE_MATCH: tasks with same cookie can run > + * on the same core concurrently. > + * - CORE_COOKIE_TRUST: trusted task can run with kernel > thread on the same core concurrently. How about other OS tasks(like systemd) except kernel thread? :) Thx. Regards, Jiang > + * - CORE_COOKIE_LONELY: tasks with cookie can run only > + * with idle thread on the same core. > + */ > +enum coresched_policy { > + CORE_SCHED_DISABLED, > + CORE_SCHED_COOKIE_MATCH, > + CORE_SCHED_COOKIE_TRUST, > + CORE_SCHED_COOKIE_LONELY, > +}; > > We can set policy to CORE_COOKIE_TRUST
Re: [PATCH] sched/fair: reduce preemption with IDLE tasks runable(Internet mail)
> On Aug 3, 2020, at 4:16 PM, Dietmar Eggemann wrote: > > On 01/08/2020 04:32, Jiang Biao wrote: >> From: Jiang Biao >> >> No need to preempt when there are only one runable CFS task with >> other IDLE tasks on runqueue. The only one CFS task would always >> be picked in that case. >> >> Signed-off-by: Jiang Biao >> --- >> kernel/sched/fair.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 04fa8dbcfa4d..8fb80636b010 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -4527,7 +4527,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity >> *curr, int queued) >> return; >> #endif >> >> -if (cfs_rq->nr_running > 1) >> +if (cfs_rq->nr_running > cfs_rq.idle_h_nr_running + 1) > > cfs_rq is a pointer. It is. Sorry about that. :) > >> check_preempt_tick(cfs_rq, curr); >> } > > You can't compare cfs_rq->nr_running with cfs_rq->idle_h_nr_running! > > There is a difference between cfs_rq->h_nr_running and > cfs_rq->nr_running. The '_h_' stands for hierarchical. > > The former gives you hierarchical task accounting whereas the latter is > the number of sched entities (representing tasks or taskgroups) enqueued > in cfs_rq. > > In entity_tick(), cfs_rq->nr_running has to be used for the condition to > call check_preempt_tick(). We want to check if curr has to be preempted > by __pick_first_entity(cfs_rq) on this cfs_rq. > > entity_tick() is called for each sched entity (and so for each > cfs_rq_of(se)) of the task group hierarchy (e.g. task p running in > taskgroup /A/B : se(p) -> se(A/B) -> se(A)). That’s true. I was thinking adding a new cfs_rq->idle_nr_running member to track the per cfs_rq's IDLE task number, and reducing preemption here based on that. I’m not sure if it’s ok to do that, because the IDLE class seems not to be so pure that could tolerate starving. We need an absolutely low priority class that could tolerate starving, which could be used to co-locate offline tasks. But IDLE class seems to be not *low* enough, if considering the fairness of CFS, and IDLE class still has a weight. Thanks for you reply. Regards, Jiang
Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)
Hi, > On Jul 24, 2020, at 10:05 AM, Li, Aubrey wrote: > > On 2020/7/24 9:26, benbjiang(蒋彪) wrote: >> Hi, >> >>> On Jul 24, 2020, at 7:43 AM, Aubrey Li wrote: >>> >>> On Thu, Jul 23, 2020 at 4:28 PM benbjiang(蒋彪) wrote: >>>> >>>> Hi, >>>> >>>>> On Jul 23, 2020, at 4:06 PM, Li, Aubrey wrote: >>>>> >>>>> On 2020/7/23 15:47, benbjiang(蒋彪) wrote: >>>>>> Hi, >>>>>> >>>>>>> On Jul 23, 2020, at 1:39 PM, Li, Aubrey >>>>>>> wrote: >>>>>>> >>>>>>> On 2020/7/23 12:23, benbjiang(蒋彪) wrote: >>>>>>>> Hi, >>>>>>>>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote: >>>>>>>>>>>>>> Hi, Aubrey, >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> From: Aubrey Li >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Don't migrate if there is a cookie mismatch >>>>>>>>>>>>>>> Load balance tries to move task from busiest CPU to the >>>>>>>>>>>>>>> destination CPU. When core scheduling is enabled, if the >>>>>>>>>>>>>>> task's cookie does not match with the destination CPU's >>>>>>>>>>>>>>> core cookie, this task will be skipped by this CPU. This >>>>>>>>>>>>>>> mitigates the forced idle time on the destination CPU. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Select cookie matched idle CPU >>>>>>>>>>>>>>> In the fast path of task wakeup, select the first cookie matched >>>>>>>>>>>>>>> idle CPU instead of the first idle CPU. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Find cookie matched idlest CPU >>>>>>>>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core >>>>>>>>>>>>>>> cookie matches with task's cookie >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Don't migrate task if cookie not match >>>>>>>>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose >>>>>>>>>>>>>>> core cookie does not match with task's cookie >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Signed-off-by: Aubrey Li >>>>>>>>>>>>>>> Signed-off-by: Tim Chen >>>>>>>>>>>>>>> Signed-off-by: Vineeth Remanan Pillai >>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>> kernel/sched/fair.c | 64 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> kernel/sched/sched.h | 29 >>>>>>>>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-) >>>>>>&g
Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)
Hi, > On Jul 24, 2020, at 7:43 AM, Aubrey Li wrote: > > On Thu, Jul 23, 2020 at 4:28 PM benbjiang(蒋彪) wrote: >> >> Hi, >> >>> On Jul 23, 2020, at 4:06 PM, Li, Aubrey wrote: >>> >>> On 2020/7/23 15:47, benbjiang(蒋彪) wrote: >>>> Hi, >>>> >>>>> On Jul 23, 2020, at 1:39 PM, Li, Aubrey wrote: >>>>> >>>>> On 2020/7/23 12:23, benbjiang(蒋彪) wrote: >>>>>> Hi, >>>>>>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey >>>>>>> wrote: >>>>>>> >>>>>>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote: >>>>>>>>>>>> Hi, Aubrey, >>>>>>>>>>>> >>>>>>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> From: Aubrey Li >>>>>>>>>>>>> >>>>>>>>>>>>> - Don't migrate if there is a cookie mismatch >>>>>>>>>>>>> Load balance tries to move task from busiest CPU to the >>>>>>>>>>>>> destination CPU. When core scheduling is enabled, if the >>>>>>>>>>>>> task's cookie does not match with the destination CPU's >>>>>>>>>>>>> core cookie, this task will be skipped by this CPU. This >>>>>>>>>>>>> mitigates the forced idle time on the destination CPU. >>>>>>>>>>>>> >>>>>>>>>>>>> - Select cookie matched idle CPU >>>>>>>>>>>>> In the fast path of task wakeup, select the first cookie matched >>>>>>>>>>>>> idle CPU instead of the first idle CPU. >>>>>>>>>>>>> >>>>>>>>>>>>> - Find cookie matched idlest CPU >>>>>>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core >>>>>>>>>>>>> cookie matches with task's cookie >>>>>>>>>>>>> >>>>>>>>>>>>> - Don't migrate task if cookie not match >>>>>>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose >>>>>>>>>>>>> core cookie does not match with task's cookie >>>>>>>>>>>>> >>>>>>>>>>>>> Signed-off-by: Aubrey Li >>>>>>>>>>>>> Signed-off-by: Tim Chen >>>>>>>>>>>>> Signed-off-by: Vineeth Remanan Pillai >>>>>>>>>>>>> --- >>>>>>>>>>>>> kernel/sched/fair.c | 64 >>>>>>>>>>>>> >>>>>>>>>>>>> kernel/sched/sched.h | 29 >>>>>>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-) >>>>>>>>>>>>> >>>>>>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>>>>>>>>>> index d16939766361..33dc4bf01817 100644 >>>>>>>>>>>>> --- a/kernel/sched/fair.c >>>>>>>>>>>>> +++ b/kernel/sched/fair.c >>>>>>>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct >>>>>>>>>>>>> task_numa_env *env, >>>>>>>>>>>>>
Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)
Hi, > On Jul 23, 2020, at 4:06 PM, Li, Aubrey wrote: > > On 2020/7/23 15:47, benbjiang(蒋彪) wrote: >> Hi, >> >>> On Jul 23, 2020, at 1:39 PM, Li, Aubrey wrote: >>> >>> On 2020/7/23 12:23, benbjiang(蒋彪) wrote: >>>> Hi, >>>>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey >>>>> wrote: >>>>> >>>>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote: >>>>>> Hi, >>>>>> >>>>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey >>>>>>> wrote: >>>>>>> >>>>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote: >>>>>>>>>> Hi, Aubrey, >>>>>>>>>> >>>>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> From: Aubrey Li >>>>>>>>>>> >>>>>>>>>>> - Don't migrate if there is a cookie mismatch >>>>>>>>>>> Load balance tries to move task from busiest CPU to the >>>>>>>>>>> destination CPU. When core scheduling is enabled, if the >>>>>>>>>>> task's cookie does not match with the destination CPU's >>>>>>>>>>> core cookie, this task will be skipped by this CPU. This >>>>>>>>>>> mitigates the forced idle time on the destination CPU. >>>>>>>>>>> >>>>>>>>>>> - Select cookie matched idle CPU >>>>>>>>>>> In the fast path of task wakeup, select the first cookie matched >>>>>>>>>>> idle CPU instead of the first idle CPU. >>>>>>>>>>> >>>>>>>>>>> - Find cookie matched idlest CPU >>>>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core >>>>>>>>>>> cookie matches with task's cookie >>>>>>>>>>> >>>>>>>>>>> - Don't migrate task if cookie not match >>>>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose >>>>>>>>>>> core cookie does not match with task's cookie >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Aubrey Li >>>>>>>>>>> Signed-off-by: Tim Chen >>>>>>>>>>> Signed-off-by: Vineeth Remanan Pillai >>>>>>>>>>> --- >>>>>>>>>>> kernel/sched/fair.c | 64 >>>>>>>>>>> >>>>>>>>>>> kernel/sched/sched.h | 29 >>>>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>>>>>>>> index d16939766361..33dc4bf01817 100644 >>>>>>>>>>> --- a/kernel/sched/fair.c >>>>>>>>>>> +++ b/kernel/sched/fair.c >>>>>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct >>>>>>>>>>> task_numa_env *env, >>>>>>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr)) >>>>>>>>>>> continue; >>>>>>>>>>> >>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE >>>>>>>>>>> + /* >>>>>>>>>>> +* Skip this cpu if source task's cookie does not match >>>>>>>>>>> +* with CPU's core cookie. >>>>>>>>>>> +*/ >>>>>>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p)) >>>>>>>>>>> +
Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)
Hi, > On Jul 23, 2020, at 1:39 PM, Li, Aubrey wrote: > > On 2020/7/23 12:23, benbjiang(蒋彪) wrote: >> Hi, >>> On Jul 23, 2020, at 11:35 AM, Li, Aubrey wrote: >>> >>> On 2020/7/23 10:42, benbjiang(蒋彪) wrote: >>>> Hi, >>>> >>>>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey wrote: >>>>> >>>>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote: >>>>>> Hi, >>>>>> >>>>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey >>>>>>> wrote: >>>>>>> >>>>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote: >>>>>>>> Hi, Aubrey, >>>>>>>> >>>>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> From: Aubrey Li >>>>>>>>> >>>>>>>>> - Don't migrate if there is a cookie mismatch >>>>>>>>> Load balance tries to move task from busiest CPU to the >>>>>>>>> destination CPU. When core scheduling is enabled, if the >>>>>>>>> task's cookie does not match with the destination CPU's >>>>>>>>> core cookie, this task will be skipped by this CPU. This >>>>>>>>> mitigates the forced idle time on the destination CPU. >>>>>>>>> >>>>>>>>> - Select cookie matched idle CPU >>>>>>>>> In the fast path of task wakeup, select the first cookie matched >>>>>>>>> idle CPU instead of the first idle CPU. >>>>>>>>> >>>>>>>>> - Find cookie matched idlest CPU >>>>>>>>> In the slow path of task wakeup, find the idlest CPU whose core >>>>>>>>> cookie matches with task's cookie >>>>>>>>> >>>>>>>>> - Don't migrate task if cookie not match >>>>>>>>> For the NUMA load balance, don't migrate task to the CPU whose >>>>>>>>> core cookie does not match with task's cookie >>>>>>>>> >>>>>>>>> Signed-off-by: Aubrey Li >>>>>>>>> Signed-off-by: Tim Chen >>>>>>>>> Signed-off-by: Vineeth Remanan Pillai >>>>>>>>> --- >>>>>>>>> kernel/sched/fair.c | 64 >>>>>>>>> kernel/sched/sched.h | 29 >>>>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>>>>>> index d16939766361..33dc4bf01817 100644 >>>>>>>>> --- a/kernel/sched/fair.c >>>>>>>>> +++ b/kernel/sched/fair.c >>>>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct >>>>>>>>> task_numa_env *env, >>>>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr)) >>>>>>>>> continue; >>>>>>>>> >>>>>>>>> +#ifdef CONFIG_SCHED_CORE >>>>>>>>> + /* >>>>>>>>> + * Skip this cpu if source task's cookie does not match >>>>>>>>> + * with CPU's core cookie. >>>>>>>>> + */ >>>>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p)) >>>>>>>>> + continue; >>>>>>>>> +#endif >>>>>>>>> + >>>>>>>>> env->dst_cpu = cpu; >>>>>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove)) >>>>>>>>> break; >>>>>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group >>>>>>>>> *group, struct task_struct *p, int this >>>>>>>>> >>>>>>>>> /* Traverse only the allowed CPUs */ >>>>>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr
Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)
Hi, > On Jul 23, 2020, at 11:35 AM, Li, Aubrey wrote: > > On 2020/7/23 10:42, benbjiang(蒋彪) wrote: >> Hi, >> >>> On Jul 23, 2020, at 9:57 AM, Li, Aubrey wrote: >>> >>> On 2020/7/22 22:32, benbjiang(蒋彪) wrote: >>>> Hi, >>>> >>>>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey wrote: >>>>> >>>>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote: >>>>>> Hi, Aubrey, >>>>>> >>>>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai >>>>>>> wrote: >>>>>>> >>>>>>> From: Aubrey Li >>>>>>> >>>>>>> - Don't migrate if there is a cookie mismatch >>>>>>> Load balance tries to move task from busiest CPU to the >>>>>>> destination CPU. When core scheduling is enabled, if the >>>>>>> task's cookie does not match with the destination CPU's >>>>>>> core cookie, this task will be skipped by this CPU. This >>>>>>> mitigates the forced idle time on the destination CPU. >>>>>>> >>>>>>> - Select cookie matched idle CPU >>>>>>> In the fast path of task wakeup, select the first cookie matched >>>>>>> idle CPU instead of the first idle CPU. >>>>>>> >>>>>>> - Find cookie matched idlest CPU >>>>>>> In the slow path of task wakeup, find the idlest CPU whose core >>>>>>> cookie matches with task's cookie >>>>>>> >>>>>>> - Don't migrate task if cookie not match >>>>>>> For the NUMA load balance, don't migrate task to the CPU whose >>>>>>> core cookie does not match with task's cookie >>>>>>> >>>>>>> Signed-off-by: Aubrey Li >>>>>>> Signed-off-by: Tim Chen >>>>>>> Signed-off-by: Vineeth Remanan Pillai >>>>>>> --- >>>>>>> kernel/sched/fair.c | 64 >>>>>>> kernel/sched/sched.h | 29 >>>>>>> 2 files changed, 88 insertions(+), 5 deletions(-) >>>>>>> >>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>>>> index d16939766361..33dc4bf01817 100644 >>>>>>> --- a/kernel/sched/fair.c >>>>>>> +++ b/kernel/sched/fair.c >>>>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct >>>>>>> task_numa_env *env, >>>>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr)) >>>>>>> continue; >>>>>>> >>>>>>> +#ifdef CONFIG_SCHED_CORE >>>>>>> + /* >>>>>>> +* Skip this cpu if source task's cookie does not match >>>>>>> +* with CPU's core cookie. >>>>>>> +*/ >>>>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p)) >>>>>>> + continue; >>>>>>> +#endif >>>>>>> + >>>>>>> env->dst_cpu = cpu; >>>>>>> if (task_numa_compare(env, taskimp, groupimp, maymove)) >>>>>>> break; >>>>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group >>>>>>> *group, struct task_struct *p, int this >>>>>>> >>>>>>> /* Traverse only the allowed CPUs */ >>>>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) { >>>>>>> + struct rq *rq = cpu_rq(i); >>>>>>> + >>>>>>> +#ifdef CONFIG_SCHED_CORE >>>>>>> + if (!sched_core_cookie_match(rq, p)) >>>>>>> + continue; >>>>>>> +#endif >>>>>>> + >>>>>>> if (sched_idle_cpu(i)) >>>>>>> return i; >>>>>>> >>>>>>> if (available_idle_cpu(i)) { >>>>>>> - struct rq *rq = cpu_rq(i); >>&
Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)
Hi, > On Jul 23, 2020, at 9:57 AM, Li, Aubrey wrote: > > On 2020/7/22 22:32, benbjiang(蒋彪) wrote: >> Hi, >> >>> On Jul 22, 2020, at 8:13 PM, Li, Aubrey wrote: >>> >>> On 2020/7/22 16:54, benbjiang(蒋彪) wrote: >>>> Hi, Aubrey, >>>> >>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai >>>>> wrote: >>>>> >>>>> From: Aubrey Li >>>>> >>>>> - Don't migrate if there is a cookie mismatch >>>>> Load balance tries to move task from busiest CPU to the >>>>> destination CPU. When core scheduling is enabled, if the >>>>> task's cookie does not match with the destination CPU's >>>>> core cookie, this task will be skipped by this CPU. This >>>>> mitigates the forced idle time on the destination CPU. >>>>> >>>>> - Select cookie matched idle CPU >>>>> In the fast path of task wakeup, select the first cookie matched >>>>> idle CPU instead of the first idle CPU. >>>>> >>>>> - Find cookie matched idlest CPU >>>>> In the slow path of task wakeup, find the idlest CPU whose core >>>>> cookie matches with task's cookie >>>>> >>>>> - Don't migrate task if cookie not match >>>>> For the NUMA load balance, don't migrate task to the CPU whose >>>>> core cookie does not match with task's cookie >>>>> >>>>> Signed-off-by: Aubrey Li >>>>> Signed-off-by: Tim Chen >>>>> Signed-off-by: Vineeth Remanan Pillai >>>>> --- >>>>> kernel/sched/fair.c | 64 >>>>> kernel/sched/sched.h | 29 >>>>> 2 files changed, 88 insertions(+), 5 deletions(-) >>>>> >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>> index d16939766361..33dc4bf01817 100644 >>>>> --- a/kernel/sched/fair.c >>>>> +++ b/kernel/sched/fair.c >>>>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct >>>>> task_numa_env *env, >>>>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr)) >>>>> continue; >>>>> >>>>> +#ifdef CONFIG_SCHED_CORE >>>>> + /* >>>>> + * Skip this cpu if source task's cookie does not match >>>>> + * with CPU's core cookie. >>>>> + */ >>>>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p)) >>>>> + continue; >>>>> +#endif >>>>> + >>>>> env->dst_cpu = cpu; >>>>> if (task_numa_compare(env, taskimp, groupimp, maymove)) >>>>> break; >>>>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, >>>>> struct task_struct *p, int this >>>>> >>>>> /* Traverse only the allowed CPUs */ >>>>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) { >>>>> + struct rq *rq = cpu_rq(i); >>>>> + >>>>> +#ifdef CONFIG_SCHED_CORE >>>>> + if (!sched_core_cookie_match(rq, p)) >>>>> + continue; >>>>> +#endif >>>>> + >>>>> if (sched_idle_cpu(i)) >>>>> return i; >>>>> >>>>> if (available_idle_cpu(i)) { >>>>> - struct rq *rq = cpu_rq(i); >>>>> struct cpuidle_state *idle = idle_get_state(rq); >>>>> if (idle && idle->exit_latency < min_exit_latency) { >>>>> /* >>>>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, >>>>> struct sched_domain *sd, int t >>>>> for_each_cpu_wrap(cpu, cpus, target) { >>>>> if (!--nr) >>>>> return -1; >>>>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) >>>>> - break; >>>>> + >>>>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) { >>>>> +#ifdef CONFIG_SCHED_CORE >>>>> + /* >>>>&g
Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)
Hi, > On Jul 22, 2020, at 8:13 PM, Li, Aubrey wrote: > > On 2020/7/22 16:54, benbjiang(蒋彪) wrote: >> Hi, Aubrey, >> >>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai >>> wrote: >>> >>> From: Aubrey Li >>> >>> - Don't migrate if there is a cookie mismatch >>>Load balance tries to move task from busiest CPU to the >>>destination CPU. When core scheduling is enabled, if the >>>task's cookie does not match with the destination CPU's >>>core cookie, this task will be skipped by this CPU. This >>>mitigates the forced idle time on the destination CPU. >>> >>> - Select cookie matched idle CPU >>>In the fast path of task wakeup, select the first cookie matched >>>idle CPU instead of the first idle CPU. >>> >>> - Find cookie matched idlest CPU >>>In the slow path of task wakeup, find the idlest CPU whose core >>>cookie matches with task's cookie >>> >>> - Don't migrate task if cookie not match >>>For the NUMA load balance, don't migrate task to the CPU whose >>>core cookie does not match with task's cookie >>> >>> Signed-off-by: Aubrey Li >>> Signed-off-by: Tim Chen >>> Signed-off-by: Vineeth Remanan Pillai >>> --- >>> kernel/sched/fair.c | 64 >>> kernel/sched/sched.h | 29 >>> 2 files changed, 88 insertions(+), 5 deletions(-) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index d16939766361..33dc4bf01817 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env >>> *env, >>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr)) >>> continue; >>> >>> +#ifdef CONFIG_SCHED_CORE >>> + /* >>> +* Skip this cpu if source task's cookie does not match >>> +* with CPU's core cookie. >>> +*/ >>> + if (!sched_core_cookie_match(cpu_rq(cpu), env->p)) >>> + continue; >>> +#endif >>> + >>> env->dst_cpu = cpu; >>> if (task_numa_compare(env, taskimp, groupimp, maymove)) >>> break; >>> @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, >>> struct task_struct *p, int this >>> >>> /* Traverse only the allowed CPUs */ >>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) { >>> + struct rq *rq = cpu_rq(i); >>> + >>> +#ifdef CONFIG_SCHED_CORE >>> + if (!sched_core_cookie_match(rq, p)) >>> + continue; >>> +#endif >>> + >>> if (sched_idle_cpu(i)) >>> return i; >>> >>> if (available_idle_cpu(i)) { >>> - struct rq *rq = cpu_rq(i); >>> struct cpuidle_state *idle = idle_get_state(rq); >>> if (idle && idle->exit_latency < min_exit_latency) { >>> /* >>> @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, >>> struct sched_domain *sd, int t >>> for_each_cpu_wrap(cpu, cpus, target) { >>> if (!--nr) >>> return -1; >>> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) >>> - break; >>> + >>> + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) { >>> +#ifdef CONFIG_SCHED_CORE >>> + /* >>> +* If Core Scheduling is enabled, select this cpu >>> +* only if the process cookie matches core cookie. >>> +*/ >>> + if (sched_core_enabled(cpu_rq(cpu)) && >>> + p->core_cookie == cpu_rq(cpu)->core->core_cookie) >> Why not also add similar logic in select_idle_smt to reduce forced-idle? :) > We hit select_idle_smt after we scaned the entire LLC domain for idle cores > and idle cpus and failed,so IMHO, an idle smt is probably a good choice under > this scenario. AFAIC, selecting idle sibling with unmatched cookie will cause unnecessary fored-idle, unfairness and latency, compared to choosing *target* cp
Re: [RFC PATCH 11/16] sched: migration changes for core scheduling(Internet mail)
Hi, Aubrey, > On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai > wrote: > > From: Aubrey Li > > - Don't migrate if there is a cookie mismatch > Load balance tries to move task from busiest CPU to the > destination CPU. When core scheduling is enabled, if the > task's cookie does not match with the destination CPU's > core cookie, this task will be skipped by this CPU. This > mitigates the forced idle time on the destination CPU. > > - Select cookie matched idle CPU > In the fast path of task wakeup, select the first cookie matched > idle CPU instead of the first idle CPU. > > - Find cookie matched idlest CPU > In the slow path of task wakeup, find the idlest CPU whose core > cookie matches with task's cookie > > - Don't migrate task if cookie not match > For the NUMA load balance, don't migrate task to the CPU whose > core cookie does not match with task's cookie > > Signed-off-by: Aubrey Li > Signed-off-by: Tim Chen > Signed-off-by: Vineeth Remanan Pillai > --- > kernel/sched/fair.c | 64 > kernel/sched/sched.h | 29 > 2 files changed, 88 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index d16939766361..33dc4bf01817 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2051,6 +2051,15 @@ static void task_numa_find_cpu(struct task_numa_env > *env, > if (!cpumask_test_cpu(cpu, env->p->cpus_ptr)) > continue; > > +#ifdef CONFIG_SCHED_CORE > + /* > + * Skip this cpu if source task's cookie does not match > + * with CPU's core cookie. > + */ > + if (!sched_core_cookie_match(cpu_rq(cpu), env->p)) > + continue; > +#endif > + > env->dst_cpu = cpu; > if (task_numa_compare(env, taskimp, groupimp, maymove)) > break; > @@ -5963,11 +5972,17 @@ find_idlest_group_cpu(struct sched_group *group, > struct task_struct *p, int this > > /* Traverse only the allowed CPUs */ > for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) { > + struct rq *rq = cpu_rq(i); > + > +#ifdef CONFIG_SCHED_CORE > + if (!sched_core_cookie_match(rq, p)) > + continue; > +#endif > + > if (sched_idle_cpu(i)) > return i; > > if (available_idle_cpu(i)) { > - struct rq *rq = cpu_rq(i); > struct cpuidle_state *idle = idle_get_state(rq); > if (idle && idle->exit_latency < min_exit_latency) { > /* > @@ -6224,8 +6239,18 @@ static int select_idle_cpu(struct task_struct *p, > struct sched_domain *sd, int t > for_each_cpu_wrap(cpu, cpus, target) { > if (!--nr) > return -1; > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > - break; > + > + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) { > +#ifdef CONFIG_SCHED_CORE > + /* > + * If Core Scheduling is enabled, select this cpu > + * only if the process cookie matches core cookie. > + */ > + if (sched_core_enabled(cpu_rq(cpu)) && > + p->core_cookie == cpu_rq(cpu)->core->core_cookie) Why not also add similar logic in select_idle_smt to reduce forced-idle? :) > +#endif > + break; > + } > } > > time = cpu_clock(this) - time; > @@ -7609,8 +7634,9 @@ int can_migrate_task(struct task_struct *p, struct > lb_env *env) >* We do not migrate tasks that are: >* 1) throttled_lb_pair, or >* 2) cannot be migrated to this CPU due to cpus_ptr, or > - * 3) running (obviously), or > - * 4) are cache-hot on their current CPU. > + * 3) task's cookie does not match with this CPU's core cookie > + * 4) running (obviously), or > + * 5) are cache-hot on their current CPU. >*/ > if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) > return 0; > @@ -7645,6 +7671,15 @@ int can_migrate_task(struct task_struct *p, struct > lb_env *env) > return 0; > } > > +#ifdef CONFIG_SCHED_CORE > + /* > + * Don't migrate task if the task's cookie does not match > + * with the destination CPU's core cookie. > + */ > + if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p)) > + return 0; > +#endif > + > /* Record that we found atleast one task that could run on dst_cpu */ > env->flags &= ~LBF_ALL_PINNED; > > @@ -8857,6 +8892,25 @@ find_idlest_group(struct sched_domain *sd, struct > task_struct *p, > p->cpus_ptr)) >
Re: [RFC PATCH 07/16] sched/fair: Fix forced idle sibling starvation corner case(Internet mail)
> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai > wrote: > > From: vpillai > > If there is only one long running local task and the sibling is > forced idle, it might not get a chance to run until a schedule > event happens on any cpu in the core. > > So we check for this condition during a tick to see if a sibling > is starved and then give it a chance to schedule. Hi, There may be other similar starvation cases this patch can not cover. Such as, If there is one long running RT task and sibling is forced idle, then all tasks with different cookies on all siblings could be starving forever. Current load-balances seems not able to pull the starved tasks away. Would load-balance be more aware of core-scheduling to make things better? :) Thx. Regards, Jiang > > Signed-off-by: Vineeth Remanan Pillai > Signed-off-by: Julien Desfossez > --- > kernel/sched/fair.c | 39 +++ > 1 file changed, 39 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ae17507533a0..49fb93296e35 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10613,6 +10613,40 @@ static void rq_offline_fair(struct rq *rq) > > #endif /* CONFIG_SMP */ > > +#ifdef CONFIG_SCHED_CORE > +static inline bool > +__entity_slice_used(struct sched_entity *se) > +{ > + return (se->sum_exec_runtime - se->prev_sum_exec_runtime) > > + sched_slice(cfs_rq_of(se), se); > +} > + > +/* > + * If runqueue has only one task which used up its slice and if the sibling > + * is forced idle, then trigger schedule to give forced idle task a chance. > + */ > +static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se) > +{ > + int cpu = cpu_of(rq), sibling_cpu; > + > + if (rq->cfs.nr_running > 1 || !__entity_slice_used(se)) > + return; > + > + for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) { > + struct rq *sibling_rq; > + if (sibling_cpu == cpu) > + continue; > + if (cpu_is_offline(sibling_cpu)) > + continue; > + > + sibling_rq = cpu_rq(sibling_cpu); > + if (sibling_rq->core_forceidle) { > + resched_curr(sibling_rq); > + } > + } > +} > +#endif > + > /* > * scheduler tick hitting a task of our scheduling class. > * > @@ -10636,6 +10670,11 @@ static void task_tick_fair(struct rq *rq, struct > task_struct *curr, int queued) > > update_misfit_status(curr, rq); > update_overutilized_status(task_rq(curr)); > + > +#ifdef CONFIG_SCHED_CORE > + if (sched_core_enabled(rq)) > + resched_forceidle_sibling(rq, >se); > +#endif > } > > /* > -- > 2.17.1 > >
Re: [RFC PATCH 09/16] sched/fair: core wide cfs task priority comparison(Internet mail)
> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai > wrote: > > From: Aaron Lu > > This patch provides a vruntime based way to compare two cfs task's > priority, be it on the same cpu or different threads of the same core. > > When the two tasks are on the same CPU, we just need to find a common > cfs_rq both sched_entities are on and then do the comparison. > > When the two tasks are on differen threads of the same core, each thread > will choose the next task to run the usual way and then the root level > sched entities which the two tasks belong to will be used to decide > which task runs next core wide. > > An illustration for the cross CPU case: > > cpu0 cpu1 > / | \ / | \ > se1 se2 se3 se4 se5 se6 >/ \/ \ > se21 se22 se61 se62 > (A)/ > se621 >(B) > > Assume CPU0 and CPU1 are smt siblings and cpu0 has decided task A to > run next and cpu1 has decided task B to run next. To compare priority > of task A and B, we compare priority of se2 and se6. Whose vruntime is > smaller, who wins. > > To make this work, the root level sched entities' vruntime of the two > threads must be directly comparable. So one of the hyperthread's root > cfs_rq's min_vruntime is chosen as the core wide one and all root level > sched entities' vruntime is normalized against it. > > All sub cfs_rqs and sched entities are not interesting in cross cpu > priority comparison as they will only participate in the usual cpu local > schedule decision so no need to normalize their vruntimes. > > Signed-off-by: Aaron Lu > --- > kernel/sched/core.c | 23 +++ > kernel/sched/fair.c | 142 ++- > kernel/sched/sched.h | 3 + > 3 files changed, 150 insertions(+), 18 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index f51e5c4798c8..4d6d6a678013 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -114,19 +114,8 @@ static inline bool prio_less(struct task_struct *a, > struct task_struct *b) > if (pa == -1) /* dl_prio() doesn't work because of stop_class above */ > return !dl_time_before(a->dl.deadline, b->dl.deadline); > > - if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */ > - u64 vruntime = b->se.vruntime; > - > - /* > - * Normalize the vruntime if tasks are in different cpus. > - */ > - if (task_cpu(a) != task_cpu(b)) { > - vruntime -= task_cfs_rq(b)->min_vruntime; > - vruntime += task_cfs_rq(a)->min_vruntime; > - } > - > - return !((s64)(a->se.vruntime - vruntime) <= 0); > - } > + if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */ > + return cfs_prio_less(a, b); > > return false; > } > @@ -229,8 +218,12 @@ static int __sched_core_stopper(void *data) > bool enabled = !!(unsigned long)data; > int cpu; > > - for_each_possible_cpu(cpu) > - cpu_rq(cpu)->core_enabled = enabled; > + for_each_possible_cpu(cpu) { > + struct rq *rq = cpu_rq(cpu); > + rq->core_enabled = enabled; > + if (cpu_online(cpu) && rq->core != rq) > + sched_core_adjust_sibling_vruntime(cpu, enabled); > + } > > return 0; > } > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 61d19e573443..d16939766361 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -462,11 +462,142 @@ find_matching_se(struct sched_entity **se, struct > sched_entity **pse) > > #endif/* CONFIG_FAIR_GROUP_SCHED */ > > +#ifdef CONFIG_SCHED_CORE > +static inline struct cfs_rq *root_cfs_rq(struct cfs_rq *cfs_rq) > +{ > + return _of(cfs_rq)->cfs; > +} > + > +static inline bool is_root_cfs_rq(struct cfs_rq *cfs_rq) > +{ > + return cfs_rq == root_cfs_rq(cfs_rq); > +} > + > +static inline struct cfs_rq *core_cfs_rq(struct cfs_rq *cfs_rq) > +{ > + return _of(cfs_rq)->core->cfs; > +} > + > +static inline u64 cfs_rq_min_vruntime(struct cfs_rq *cfs_rq) > +{ > + if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq)) > + return cfs_rq->min_vruntime; > + > + return core_cfs_rq(cfs_rq)->min_vruntime; > +} > + > +#ifndef CONFIG_64BIT > +static inline u64 cfs_rq_min_vruntime_copy(struct cfs_rq *cfs_rq) > +{ > + if (!sched_core_enabled(rq_of(cfs_rq)) || !is_root_cfs_rq(cfs_rq)) > + return cfs_rq->min_vruntime_copy; > + > + return core_cfs_rq(cfs_rq)->min_vruntime_copy; > +} > +#endif /* CONFIG_64BIT */ > + > +bool cfs_prio_less(struct task_struct *a, struct task_struct *b) > +{ > + struct sched_entity *sea = >se; > + struct sched_entity *seb = >se; > + bool samecpu = task_cpu(a) == task_cpu(b); > + s64 delta; > + > +#ifdef CONFIG_FAIR_GROUP_SCHED > + if (samecpu) { > + /* vruntime is per cfs_rq */ > +
Re: [RFC PATCH 05/16] sched: Basic tracking of matching tasks(Internet mail)
Hi, perter > On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai > wrote: > > From: Peter Zijlstra > > Introduce task_struct::core_cookie as an opaque identifier for core > scheduling. When enabled; core scheduling will only allow matching > task to be on the core; where idle matches everything. > > When task_struct::core_cookie is set (and core scheduling is enabled) > these tasks are indexed in a second RB-tree, first on cookie value > then on scheduling function, such that matching task selection always > finds the most elegible match. > > NOTE: *shudder* at the overhead... > > NOTE: *sigh*, a 3rd copy of the scheduling function; the alternative > is per class tracking of cookies and that just duplicates a lot of > stuff for no raisin (the 2nd copy lives in the rt-mutex PI code). > > Signed-off-by: Peter Zijlstra (Intel) > Signed-off-by: Vineeth Remanan Pillai > Signed-off-by: Julien Desfossez > --- > include/linux/sched.h | 8 ++- > kernel/sched/core.c | 146 ++ > kernel/sched/fair.c | 46 - > kernel/sched/sched.h | 55 > 4 files changed, 208 insertions(+), 47 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 4418f5cb8324..3c8dcc5ff039 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -683,10 +683,16 @@ struct task_struct { > const struct sched_class*sched_class; > struct sched_entity se; > struct sched_rt_entity rt; > + struct sched_dl_entity dl; > + > +#ifdef CONFIG_SCHED_CORE > + struct rb_node core_node; > + unsigned long core_cookie; > +#endif > + > #ifdef CONFIG_CGROUP_SCHED > struct task_group *sched_task_group; > #endif > - struct sched_dl_entity dl; > > #ifdef CONFIG_UCLAMP_TASK > /* Clamp values requested for a scheduling entity */ > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 4b81301e3f21..b21bcab20da6 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -77,6 +77,141 @@ int sysctl_sched_rt_runtime = 95; > > DEFINE_STATIC_KEY_FALSE(__sched_core_enabled); > > +/* kernel prio, less is more */ > +static inline int __task_prio(struct task_struct *p) > +{ > + if (p->sched_class == _sched_class) /* trumps deadline */ > + return -2; > + > + if (rt_prio(p->prio)) /* includes deadline */ > + return p->prio; /* [-1, 99] */ > + > + if (p->sched_class == _sched_class) > + return MAX_RT_PRIO + NICE_WIDTH; /* 140 */ return MAX_PRIO; May be simpler? > + > + return MAX_RT_PRIO + MAX_NICE; /* 120, squash fair */ MAX_RT_PRIO(100) + MAX_NICE(19) Should be 119? :) Thx. Regards, Jiang > +} > + > +/* > + * l(a,b) > + * le(a,b) := !l(b,a) > + * g(a,b) := l(b,a) > + * ge(a,b) := !l(a,b) > + */ > + > +/* real prio, less is less */ > +static inline bool prio_less(struct task_struct *a, struct task_struct *b) > +{ > + > + int pa = __task_prio(a), pb = __task_prio(b); > + > + if (-pa < -pb) > + return true; > + > + if (-pb < -pa) > + return false; > + > + if (pa == -1) /* dl_prio() doesn't work because of stop_class above */ > + return !dl_time_before(a->dl.deadline, b->dl.deadline); > + > + if (pa == MAX_RT_PRIO + MAX_NICE) { /* fair */ > + u64 vruntime = b->se.vruntime; > + > + /* > + * Normalize the vruntime if tasks are in different cpus. > + */ > + if (task_cpu(a) != task_cpu(b)) { > + vruntime -= task_cfs_rq(b)->min_vruntime; > + vruntime += task_cfs_rq(a)->min_vruntime; > + } > + > + return !((s64)(a->se.vruntime - vruntime) <= 0); > + } > + > + return false; > +} > + > +static inline bool __sched_core_less(struct task_struct *a, struct > task_struct *b) > +{ > + if (a->core_cookie < b->core_cookie) > + return true; > + > + if (a->core_cookie > b->core_cookie) > + return false; > + > + /* flip prio, so high prio is leftmost */ > + if (prio_less(b, a)) > + return true; > + > + return false; > +} > + > +static void sched_core_enqueue(struct rq *rq, struct task_struct *p) > +{ > + struct rb_node *parent, **node; > + struct task_struct *node_task; > + > + rq->core->core_task_seq++; > + > + if (!p->core_cookie) > + return; > + > + node = >core_tree.rb_node; > + parent = *node; > + > + while (*node) { > + node_task = container_of(*node, struct task_struct, core_node); > + parent = *node; > + > + if (__sched_core_less(p, node_task)) > + node = >rb_left; > + else > + node = >rb_right; > + } > + > + rb_link_node(>core_node, parent, node); > +
Re: [RFC PATCH 07/16] sched/fair: Fix forced idle sibling starvation corner case(Internet mail)
Hi, > On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai > wrote: > > From: vpillai > > If there is only one long running local task and the sibling is > forced idle, it might not get a chance to run until a schedule > event happens on any cpu in the core. > > So we check for this condition during a tick to see if a sibling > is starved and then give it a chance to schedule. > > Signed-off-by: Vineeth Remanan Pillai > Signed-off-by: Julien Desfossez > --- > kernel/sched/fair.c | 39 +++ > 1 file changed, 39 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ae17507533a0..49fb93296e35 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -10613,6 +10613,40 @@ static void rq_offline_fair(struct rq *rq) > > #endif /* CONFIG_SMP */ > > +#ifdef CONFIG_SCHED_CORE > +static inline bool > +__entity_slice_used(struct sched_entity *se) > +{ > + return (se->sum_exec_runtime - se->prev_sum_exec_runtime) > > + sched_slice(cfs_rq_of(se), se); > +} > + > +/* > + * If runqueue has only one task which used up its slice and if the sibling > + * is forced idle, then trigger schedule to give forced idle task a chance. > + */ > +static void resched_forceidle_sibling(struct rq *rq, struct sched_entity *se) > +{ > + int cpu = cpu_of(rq), sibling_cpu; > + > + if (rq->cfs.nr_running > 1 || !__entity_slice_used(se)) > + return; > + > + for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) { > + struct rq *sibling_rq; > + if (sibling_cpu == cpu) > + continue; > + if (cpu_is_offline(sibling_cpu)) > + continue; > + > + sibling_rq = cpu_rq(sibling_cpu); > + if (sibling_rq->core_forceidle) { > + resched_curr(sibling_rq); > + } > + } > +} > +#endif > + > /* > * scheduler tick hitting a task of our scheduling class. > * > @@ -10636,6 +10670,11 @@ static void task_tick_fair(struct rq *rq, struct > task_struct *curr, int queued) > > update_misfit_status(curr, rq); > update_overutilized_status(task_rq(curr)); > + > +#ifdef CONFIG_SCHED_CORE > + if (sched_core_enabled(rq)) > + resched_forceidle_sibling(rq, >se); > +#endif Hi, resched_forceidle_sibling depends on tick, but there could be no tick in 1s(scheduler_tick_max_derferment) after entering nohz_full mode. And when enable nohz_full, cpu will enter nohz_full mode frequently when *there is only one long running local task*. That means the siblings rescheduling would be delayed much more than sched_slice(), could be unfair and result in big latency. Should we restrict cpu with forced-idle sibling entering nohz_full mode by adding specific flag and checking it before stop tick? Or we can do rescheduling on siblings in task_tick_idle by checking starvation time? :) Thx Regard, Jiang > } > > /* > -- > 2.17.1 > >
Re: [RFC PATCH 10/16] sched: Trivial forced-newidle balancer(Internet mail)
> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai > wrote: > > From: Peter Zijlstra > > When a sibling is forced-idle to match the core-cookie; search for > matching tasks to fill the core. > > rcu_read_unlock() can incur an infrequent deadlock in > sched_core_balance(). Fix this by using the RCU-sched flavor instead. > > Signed-off-by: Peter Zijlstra (Intel) > Signed-off-by: Joel Fernandes (Google) > Acked-by: Paul E. McKenney > --- > include/linux/sched.h | 1 + > kernel/sched/core.c | 131 +- > kernel/sched/idle.c | 1 + > kernel/sched/sched.h | 6 ++ > 4 files changed, 138 insertions(+), 1 deletion(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 3c8dcc5ff039..4f9edf013df3 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -688,6 +688,7 @@ struct task_struct { > #ifdef CONFIG_SCHED_CORE > struct rb_node core_node; > unsigned long core_cookie; > + unsigned intcore_occupation; > #endif > > #ifdef CONFIG_CGROUP_SCHED > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 4d6d6a678013..fb9edb09ead7 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -201,6 +201,21 @@ static struct task_struct *sched_core_find(struct rq > *rq, unsigned long cookie) > return match; > } > > +static struct task_struct *sched_core_next(struct task_struct *p, unsigned > long cookie) > +{ > + struct rb_node *node = >core_node; > + > + node = rb_next(node); > + if (!node) > + return NULL; > + > + p = container_of(node, struct task_struct, core_node); > + if (p->core_cookie != cookie) > + return NULL; > + > + return p; > +} > + > /* > * The static-key + stop-machine variable are needed such that: > * > @@ -4233,7 +4248,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, > struct rq_flags *rf) > struct task_struct *next, *max = NULL; > const struct sched_class *class; > const struct cpumask *smt_mask; > - int i, j, cpu; > + int i, j, cpu, occ = 0; > bool need_sync; > > if (!sched_core_enabled(rq)) > @@ -4332,6 +4347,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, > struct rq_flags *rf) > goto done; > } > > + if (!is_idle_task(p)) > + occ++; > + > rq_i->core_pick = p; > > /* > @@ -4357,6 +4375,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, > struct rq_flags *rf) > > cpu_rq(j)->core_pick = NULL; > } > + occ = 1; > goto again; > } else { > /* > @@ -4393,6 +4412,8 @@ next_class:; > if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) > rq_i->core_forceidle = true; > > + rq_i->core_pick->core_occupation = occ; > + > if (i == cpu) > continue; > > @@ -4408,6 +4429,114 @@ next_class:; > return next; > } > > +static bool try_steal_cookie(int this, int that) > +{ > + struct rq *dst = cpu_rq(this), *src = cpu_rq(that); > + struct task_struct *p; > + unsigned long cookie; > + bool success = false; > + > + local_irq_disable(); > + double_rq_lock(dst, src); > + > + cookie = dst->core->core_cookie; > + if (!cookie) > + goto unlock; > + > + if (dst->curr != dst->idle) > + goto unlock; > + Could it be ok to add another fast return here? if (src->nr_running == 1) goto unlock; When src cpu has only 1 running task, no need to pull and no need to do sched_core_find. Thx. Regards, Jiang > + p = sched_core_find(src, cookie); > + if (p == src->idle) > + goto unlock; > + > + do { > + if (p == src->core_pick || p == src->curr) > + goto next; > + > + if (!cpumask_test_cpu(this, >cpus_mask)) > + goto next; > + > + if (p->core_occupation > dst->idle->core_occupation) > + goto next; > + > + p->on_rq = TASK_ON_RQ_MIGRATING; > + deactivate_task(src, p, 0); > + set_task_cpu(p, this); > + activate_task(dst, p, 0); > + p->on_rq = TASK_ON_RQ_QUEUED; > + > + resched_curr(dst); > + > + success = true; > + break; > + > +next: > + p = sched_core_next(p, cookie); > + } while (p); > + > +unlock: > + double_rq_unlock(dst, src); > + local_irq_enable(); > + > + return success; > +} > + > +static bool steal_cookie_task(int cpu, struct sched_domain *sd) >
Re: [RFC PATCH 10/16] sched: Trivial forced-newidle balancer(Internet mail)
> On Jul 20, 2020, at 4:03 PM, Li, Aubrey wrote: > > On 2020/7/20 15:23, benbjiang(蒋彪) wrote: >> Hi, >> >>> On Jul 20, 2020, at 2:06 PM, Li, Aubrey >> <mailto:aubrey...@linux.intel.com>> wrote: >>> >>> On 2020/7/20 12:06, benbjiang(蒋彪) wrote: >>>> Hi, >>>> >>>>> On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai >>>>> mailto:vpil...@digitalocean.com>> wrote: >>>>> >>>>> From: Peter Zijlstra mailto:pet...@infradead.org>> >>>>> >>>>> When a sibling is forced-idle to match the core-cookie; search for >>>>> matching tasks to fill the core. >>>>> >>>>> rcu_read_unlock() can incur an infrequent deadlock in >>>>> sched_core_balance(). Fix this by using the RCU-sched flavor instead. >>>>> >>>>> Signed-off-by: Peter Zijlstra (Intel) >>>> <mailto:pet...@infradead.org>> >>>>> Signed-off-by: Joel Fernandes (Google) >>>> <mailto:j...@joelfernandes.org>> >>>>> Acked-by: Paul E. McKenney >>>> <mailto:paul...@kernel.org>> >>>>> --- >>>>> include/linux/sched.h | 1 + >>>>> kernel/sched/core.c | 131 +- >>>>> kernel/sched/idle.c | 1 + >>>>> kernel/sched/sched.h | 6 ++ >>>>> 4 files changed, 138 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h >>>>> index 3c8dcc5ff039..4f9edf013df3 100644 >>>>> --- a/include/linux/sched.h >>>>> +++ b/include/linux/sched.h >>>>> @@ -688,6 +688,7 @@ struct task_struct { >>>>> #ifdef CONFIG_SCHED_CORE >>>>> struct rb_nodecore_node; >>>>> unsigned longcore_cookie; >>>>> +unsigned intcore_occupation; >>>>> #endif >>>>> >>>>> #ifdef CONFIG_CGROUP_SCHED >>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>>>> index 4d6d6a678013..fb9edb09ead7 100644 >>>>> --- a/kernel/sched/core.c >>>>> +++ b/kernel/sched/core.c >>>>> @@ -201,6 +201,21 @@ static struct task_struct *sched_core_find(struct rq >>>>> *rq, unsigned long cookie) >>>>> return match; >>>>> } >>>>> >>>>> +static struct task_struct *sched_core_next(struct task_struct *p, >>>>> unsigned long cookie) >>>>> +{ >>>>> +struct rb_node *node = >core_node; >>>>> + >>>>> +node = rb_next(node); >>>>> +if (!node) >>>>> +return NULL; >>>>> + >>>>> +p = container_of(node, struct task_struct, core_node); >>>>> +if (p->core_cookie != cookie) >>>>> +return NULL; >>>>> + >>>>> +return p; >>>>> +} >>>>> + >>>>> /* >>>>> * The static-key + stop-machine variable are needed such that: >>>>> * >>>>> @@ -4233,7 +4248,7 @@ pick_next_task(struct rq *rq, struct task_struct >>>>> *prev, struct rq_flags *rf) >>>>> struct task_struct *next, *max = NULL; >>>>> const struct sched_class *class; >>>>> const struct cpumask *smt_mask; >>>>> -int i, j, cpu; >>>>> +int i, j, cpu, occ = 0; >>>>> bool need_sync; >>>>> >>>>> if (!sched_core_enabled(rq)) >>>>> @@ -4332,6 +4347,9 @@ pick_next_task(struct rq *rq, struct task_struct >>>>> *prev, struct rq_flags *rf) >>>>> goto done; >>>>> } >>>>> >>>>> +if (!is_idle_task(p)) >>>>> +occ++; >>>>> + >>>>> rq_i->core_pick = p; >>>>> >>>>> /* >>>>> @@ -4357,6 +4375,7 @@ pick_next_task(struct rq *rq, struct task_struct >>>>> *prev, struct rq_flags *rf) >>>>> >>>>> cpu_rq(j)->core_pick = NULL; >>>>> } >>>>> +occ = 1; >>>>> goto again; >>>>> } else { >>>>> /* >>>>> @@ -4393,6 +4412,8 @@ next_class:; >>>>> if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) >>>>> rq_i->core_forceidle = true; >>>>> >>>>>
Re: [RFC PATCH 10/16] sched: Trivial forced-newidle balancer(Internet mail)
Hi, > On Jul 1, 2020, at 5:32 AM, Vineeth Remanan Pillai > wrote: > > From: Peter Zijlstra > > When a sibling is forced-idle to match the core-cookie; search for > matching tasks to fill the core. > > rcu_read_unlock() can incur an infrequent deadlock in > sched_core_balance(). Fix this by using the RCU-sched flavor instead. > > Signed-off-by: Peter Zijlstra (Intel) > Signed-off-by: Joel Fernandes (Google) > Acked-by: Paul E. McKenney > --- > include/linux/sched.h | 1 + > kernel/sched/core.c | 131 +- > kernel/sched/idle.c | 1 + > kernel/sched/sched.h | 6 ++ > 4 files changed, 138 insertions(+), 1 deletion(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 3c8dcc5ff039..4f9edf013df3 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -688,6 +688,7 @@ struct task_struct { > #ifdef CONFIG_SCHED_CORE > struct rb_node core_node; > unsigned long core_cookie; > + unsigned intcore_occupation; > #endif > > #ifdef CONFIG_CGROUP_SCHED > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 4d6d6a678013..fb9edb09ead7 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -201,6 +201,21 @@ static struct task_struct *sched_core_find(struct rq > *rq, unsigned long cookie) > return match; > } > > +static struct task_struct *sched_core_next(struct task_struct *p, unsigned > long cookie) > +{ > + struct rb_node *node = >core_node; > + > + node = rb_next(node); > + if (!node) > + return NULL; > + > + p = container_of(node, struct task_struct, core_node); > + if (p->core_cookie != cookie) > + return NULL; > + > + return p; > +} > + > /* > * The static-key + stop-machine variable are needed such that: > * > @@ -4233,7 +4248,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, > struct rq_flags *rf) > struct task_struct *next, *max = NULL; > const struct sched_class *class; > const struct cpumask *smt_mask; > - int i, j, cpu; > + int i, j, cpu, occ = 0; > bool need_sync; > > if (!sched_core_enabled(rq)) > @@ -4332,6 +4347,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, > struct rq_flags *rf) > goto done; > } > > + if (!is_idle_task(p)) > + occ++; > + > rq_i->core_pick = p; > > /* > @@ -4357,6 +4375,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, > struct rq_flags *rf) > > cpu_rq(j)->core_pick = NULL; > } > + occ = 1; > goto again; > } else { > /* > @@ -4393,6 +4412,8 @@ next_class:; > if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) > rq_i->core_forceidle = true; > > + rq_i->core_pick->core_occupation = occ; > + > if (i == cpu) > continue; > > @@ -4408,6 +4429,114 @@ next_class:; > return next; > } > > +static bool try_steal_cookie(int this, int that) > +{ > + struct rq *dst = cpu_rq(this), *src = cpu_rq(that); > + struct task_struct *p; > + unsigned long cookie; > + bool success = false; > + > + local_irq_disable(); > + double_rq_lock(dst, src); > + > + cookie = dst->core->core_cookie; > + if (!cookie) > + goto unlock; > + > + if (dst->curr != dst->idle) > + goto unlock; > + > + p = sched_core_find(src, cookie); > + if (p == src->idle) > + goto unlock; > + > + do { > + if (p == src->core_pick || p == src->curr) > + goto next; > + > + if (!cpumask_test_cpu(this, >cpus_mask)) > + goto next; > + > + if (p->core_occupation > dst->idle->core_occupation) > + goto next; > + > + p->on_rq = TASK_ON_RQ_MIGRATING; > + deactivate_task(src, p, 0); > + set_task_cpu(p, this); > + activate_task(dst, p, 0); > + p->on_rq = TASK_ON_RQ_QUEUED; > + > + resched_curr(dst); > + > + success = true; > + break; > + > +next: > + p = sched_core_next(p, cookie); > + } while (p); > + > +unlock: > + double_rq_unlock(dst, src); > + local_irq_enable(); > + > + return success; > +} > + > +static bool steal_cookie_task(int cpu, struct sched_domain *sd) > +{ > + int i; > + > + for_each_cpu_wrap(i, sched_domain_span(sd), cpu) { Since (i == cpu) should be skipped, should we start iteration at cpu+1? like, for_each_cpu_wrap(i, sched_domain_span(sd), cpu+1) {
Re: [PATCH RFC] sched: Add a per-thread core scheduling interface(Internet mail)
> On May 21, 2020, at 6:26 AM, Joel Fernandes (Google) > wrote: > > Add a per-thread core scheduling interface which allows a thread to tag > itself and enable core scheduling. Based on discussion at OSPM with > maintainers, we propose a prctl(2) interface accepting values of 0 or 1. > 1 - enable core scheduling for the task. > 0 - disable core scheduling for the task. > > Special cases: > (1) > The core-scheduling patchset contains a CGroup interface as well. In > order for us to respect users of that interface, we avoid overriding the > tag if a task was CGroup-tagged because the task becomes inconsistent > with the CGroup tag. Instead return -EBUSY. > > (2) > If a task is prctl-tagged, allow the CGroup interface to override > the task's tag. > > ChromeOS will use core-scheduling to securely enable hyperthreading. > This cuts down the keypress latency in Google docs from 150ms to 50ms > while improving the camera streaming frame rate by ~3%. Hi, Are the performance improvements compared to the hyperthreading disabled scenario or not? Could you help to explain how the keypress latency improvement comes with core-scheduling? Thanks a lot. Regards, Jiang
Re: [PATCH] rcu/tree_exp: cleanup initialized but not used rdp(Internet mail)
> 在 2019年4月23日,下午4:06,Paul E. McKenney 写道: > > On Tue, Apr 23, 2019 at 09:21:55AM +0800, Jiang Biao wrote: >> rdp is initialized but never used in synchronize_rcu_expedited(), >> just remove it. >> >> Signed-off-by: Jiang Biao > > I queued and pushed both patches for testing and review, good eyes, > thank you! I reworded the subject and commit logs, so could you please > check to make sure that I didn't mess anything up? > > Thanx, Paul > Already checked, everything’s good to me. Thanks a lot. Jiang, >> --- >> kernel/rcu/tree_exp.h | 2 -- >> 1 file changed, 2 deletions(-) >> >> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h >> index 4c2a0189e748..5772612379e4 100644 >> --- a/kernel/rcu/tree_exp.h >> +++ b/kernel/rcu/tree_exp.h >> @@ -733,7 +733,6 @@ static void sync_sched_exp_online_cleanup(int cpu) >> */ >> void synchronize_rcu_expedited(void) >> { >> -struct rcu_data *rdp; >> struct rcu_exp_work rew; >> struct rcu_node *rnp; >> unsigned long s; >> @@ -770,7 +769,6 @@ void synchronize_rcu_expedited(void) >> } >> >> /* Wait for expedited grace period to complete. */ >> -rdp = per_cpu_ptr(_data, raw_smp_processor_id()); >> rnp = rcu_get_root(); >> wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3], >> sync_exp_work_done(s)); >> -- >> 2.17.2 (Apple Git-113) >> > >