Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Fri, Mar 22, 2024 at 06:02:05PM +0100, Vincent Guittot wrote: > and then > se->vruntime = max_vruntime(se->vruntime, vruntime) > First things first, I was wrong to assume a "boost" in the CFS code. So I dug a bit deeper and tried to pinpoint what the difference between CFS and EEVDF actually is. I found the following: Let's assume we have two tasks taking turns on a single CPU. Task 1 is always runnable. Task 2 gets woken up by task 1 and goes back to sleep when it is done. This means, task 1 runs, wakes up task 2, task 2 runs, goes to sleep and task 1 runs again and we repeat. Most of the time: runtime(task1) > runtime(task2) Rare occasions: runtime(task1) < runtime(task2) So, task 1 usually consumes more of its designated time slices until it gets rescheduled by the wakeup of task2 than task 2 does. But both never consume their full time slice. Rather the opposite, both run for low 5-digit ns or less. So something like this: task 1|--||-||--... task 2 || || This creates different behaviors under CFS and EEVDF: ### CFS In CFS the difference in runtimes means that task 2 cannot catch up with task 1 vruntime-wise With every exchange between task 1 and task 2, task 2 falls back more on vruntime. Once a difference in the magnitude of sysctl_sched_latency is established, the difference remains stable due to the max handling in place_entity. Occasionally, task 2 may run longer than task 1. In those cases, it will catch up slightly. But in the majority of cases, task 2 runs shorter, thereby increasing the difference in vruntime. This would explain why task 2 gets always scheduled immediately on wakeup. ### EEVDF ## The rare occasions where task 2 runs longer than task 1 seem to cause issues with EEVDF: In the regular case where task 1 runs longer than task 2. Task 2 gets a positive lag and is selected on wake up --> good. In the irregular case where task 2 runs longer than task 1 task 2 now gets a negative lag and is no longer chosen on wakeup --> bad (in some cases). This would explain why task 2 gets not selected on wake up occasionally. ### Summary So my wording, that a woken up task gets "boosted" was obviously wrong. Task 2 is not getting boosted in CFS, it gets "outrun" by task 1, with no chance of catching up. Leaving it with a smaller vruntime value. EEVDF on the other hand, does not allow lag to accumulate if an entity, like task 2 in this case, regularly dequeues itself. So it will always have a lag with an upper boundary of whatever difference it encountered in comparison to the runtime with task 1. The patch below, allows tasks to accumulate lag over time. This fixes the original regression, that made me stumble into this topic. But, this might of course come with arbitrary side effects. I'm not suggesting to actually implement this, but would like to confirm whether my understanding is correct that this is the aspect where CFS and EEVDF differ, where CFS is more aware of the past in this particular case than EEVDF is. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 03be0d1330a6..b83a72311d2a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -701,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se) s64 lag, limit; SCHED_WARN_ON(!se->on_rq); - lag = avg_vruntime(cfs_rq) - se->vruntime; + lag = se->vlag + avg_vruntime(cfs_rq) - se->vruntime; limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se); se->vlag = clamp(lag, -limit, limit);
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Thu, 21 Mar 2024 at 13:18, Tobias Huschle wrote: > > On Wed, Mar 20, 2024 at 02:51:00PM +0100, Vincent Guittot wrote: > > On Wed, 20 Mar 2024 at 08:04, Tobias Huschle wrote: > > > There was no guarantee of course. place_entity was reducing the vruntime > > > of > > > woken up tasks though, giving it a slight boost, right?. For the scenario > > > > It was rather the opposite, It was ensuring that long sleeping tasks > > will not get too much bonus because of vruntime too far in the past. > > This is similar although not exactly the same intent as the lag. The > > bonus was up to 24ms previously whereas it's not more than a slice now > > > > I might have gotten this quite wrong then. I was looking at place_entity > and saw that non-initial placements get their vruntime reduced via > > vruntime -= thresh; and then se->vruntime = max_vruntime(se->vruntime, vruntime) > > which would mean that the placed task would have a vruntime smaller than > cfs_rq->min_vruntime, based on pre-EEVDF behavior, last seen at: > >af4cf40470c2 sched/fair: Add cfs_rq::avg_vruntime > > If there was no such benefit for woken up tasks. Then the scenario I observed > is just conincidentally worse with EEVDF, which can happen when exchanging an > algorithm I suppose. Or EEVDF just exposes a so far hidden problem in that > scenario.
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Wed, Mar 20, 2024 at 02:51:00PM +0100, Vincent Guittot wrote: > On Wed, 20 Mar 2024 at 08:04, Tobias Huschle wrote: > > There was no guarantee of course. place_entity was reducing the vruntime of > > woken up tasks though, giving it a slight boost, right?. For the scenario > > It was rather the opposite, It was ensuring that long sleeping tasks > will not get too much bonus because of vruntime too far in the past. > This is similar although not exactly the same intent as the lag. The > bonus was up to 24ms previously whereas it's not more than a slice now > I might have gotten this quite wrong then. I was looking at place_entity and saw that non-initial placements get their vruntime reduced via vruntime -= thresh; which would mean that the placed task would have a vruntime smaller than cfs_rq->min_vruntime, based on pre-EEVDF behavior, last seen at: af4cf40470c2 sched/fair: Add cfs_rq::avg_vruntime If there was no such benefit for woken up tasks. Then the scenario I observed is just conincidentally worse with EEVDF, which can happen when exchanging an algorithm I suppose. Or EEVDF just exposes a so far hidden problem in that scenario.
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Wed, 20 Mar 2024 at 08:04, Tobias Huschle wrote: > > On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote: > > On Tue, 19 Mar 2024 at 10:08, Tobias Huschle wrote: > > > ... > > > > > > Haven't seen that one yet. Unfortunately, it does not help to ignore the > > > eligibility. > > > > > > I'm inclined to rather propose propose a documentation change, which > > > describes that tasks should not rely on woken up tasks being scheduled > > > immediately. > > > > Where do you see such an assumption ? Even before eevdf, there were > > nothing that ensures such behavior. When using CFS (legacy or eevdf) > > tasks, you can't know if the newly wakeup task will run 1st or not > > > > There was no guarantee of course. place_entity was reducing the vruntime of > woken up tasks though, giving it a slight boost, right?. For the scenario It was rather the opposite, It was ensuring that long sleeping tasks will not get too much bonus because of vruntime too far in the past. This is similar although not exactly the same intent as the lag. The bonus was up to 24ms previously whereas it's not more than a slice now > that I observed, that boost was enough to make sure, that the woken up tasks > gets scheduled consistently. This might still not be true for all scenarios, > but in general EEVDF seems to be stricter with woken up tasks. > > Dismissing the lag on wakeup also does obviously not guarantee getting > scheduled, as other tasks might still be involved. > > The question would be if it should be explicitly mentioned somewhere that, > at this point, woken up tasks are not getting any special treatment and > noone should rely on that boost for woken up tasks. > > > > > > > Changing things in the code to address for the specific scenario I'm > > > seeing seems to mostly create unwanted side effects and/or would require > > > the definition of some magic cut-off values. > > > > > >
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On 3/20/24 07:04, Tobias Huschle wrote: > On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote: >> On Tue, 19 Mar 2024 at 10:08, Tobias Huschle wrote: >>> >>> On 2024-03-18 15:45, Luis Machado wrote: On 3/14/24 13:45, Tobias Huschle wrote: > On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote: >> On 2/28/24 16:10, Tobias Huschle wrote: >>> >>> Questions: >>> 1. The kworker getting its negative lag occurs in the following >>> scenario >>>- kworker and a cgroup are supposed to execute on the same CPU >>>- one task within the cgroup is executing and wakes up the >>> kworker >>>- kworker with 0 lag, gets picked immediately and finishes its >>> execution within ~5000ns >>>- on dequeue, kworker gets assigned a negative lag >>>Is this expected behavior? With this short execution time, I >>> would >>>expect the kworker to be fine. >> >> That strikes me as a bit odd as well. Have you been able to determine >> how a negative lag >> is assigned to the kworker after such a short runtime? >> > > I did some more trace reading though and found something. > > What I observed if everything runs regularly: > - vhost and kworker run alternating on the same CPU > - if the kworker is done, it leaves the runqueue > - vhost wakes up the kworker if it needs it > --> this means: > - vhost starts alone on an otherwise empty runqueue > - it seems like it never gets dequeued > (unless another unrelated task joins or migration hits) > - if vhost wakes up the kworker, the kworker gets selected > - vhost runtime > kworker runtime > --> kworker gets positive lag and gets selected immediately next > time > > What happens if it does go wrong: > From what I gather, there seem to be occasions where the vhost either > executes suprisingly quick, or the kworker surprinsingly slow. If > these > outliers reach critical values, it can happen, that >vhost runtime < kworker runtime > which now causes the kworker to get the negative lag. > > In this case it seems like, that the vhost is very fast in waking up > the kworker. And coincidentally, the kworker takes, more time than > usual > to finish. We speak of 4-digit to low 5-digit nanoseconds. > > So, for these outliers, the scheduler extrapolates that the kworker > out-consumes the vhost and should be slowed down, although in the > majority > of other cases this does not happen. Thanks for providing the above details Tobias. It does seem like EEVDF is strict about the eligibility checks and making tasks wait when their lags are negative, even if just a little bit as in the case of the kworker. There was a patch to disable the eligibility checks (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/), which would make EEVDF more like EVDF, though the deadline comparison would probably still favor the vhost task instead of the kworker with the negative lag. I'm not sure if you tried it, but I thought I'd mention it. >>> >>> Haven't seen that one yet. Unfortunately, it does not help to ignore the >>> eligibility. >>> >>> I'm inclined to rather propose propose a documentation change, which >>> describes that tasks should not rely on woken up tasks being scheduled >>> immediately. >> >> Where do you see such an assumption ? Even before eevdf, there were >> nothing that ensures such behavior. When using CFS (legacy or eevdf) >> tasks, you can't know if the newly wakeup task will run 1st or not >> > > There was no guarantee of course. place_entity was reducing the vruntime of > woken up tasks though, giving it a slight boost, right?. For the scenario > that I observed, that boost was enough to make sure, that the woken up tasks > gets scheduled consistently. This might still not be true for all scenarios, > but in general EEVDF seems to be stricter with woken up tasks. It seems that way, as EEVDF will do eligibility and deadline checks before scheduling a task, so a task would have to satisfy both of those checks. I think we have some special treatment for when a task initially joins the competition, in which case we halve its slice. But I don't think there is any special treatment for woken tasks anymore. There was also a fix (63304558ba5dcaaff9e052ee43cfdcc7f9c29e85) to try to reduce the number of wake up preemptions under some conditions, under the RUN_TO_PARITY feature.
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote: > On Tue, 19 Mar 2024 at 10:08, Tobias Huschle wrote: > > > > On 2024-03-18 15:45, Luis Machado wrote: > > > On 3/14/24 13:45, Tobias Huschle wrote: > > >> On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote: > > >>> On 2/28/24 16:10, Tobias Huschle wrote: > > > > Questions: > > 1. The kworker getting its negative lag occurs in the following > > scenario > > - kworker and a cgroup are supposed to execute on the same CPU > > - one task within the cgroup is executing and wakes up the > > kworker > > - kworker with 0 lag, gets picked immediately and finishes its > > execution within ~5000ns > > - on dequeue, kworker gets assigned a negative lag > > Is this expected behavior? With this short execution time, I > > would > > expect the kworker to be fine. > > >>> > > >>> That strikes me as a bit odd as well. Have you been able to determine > > >>> how a negative lag > > >>> is assigned to the kworker after such a short runtime? > > >>> > > >> > > >> I did some more trace reading though and found something. > > >> > > >> What I observed if everything runs regularly: > > >> - vhost and kworker run alternating on the same CPU > > >> - if the kworker is done, it leaves the runqueue > > >> - vhost wakes up the kworker if it needs it > > >> --> this means: > > >> - vhost starts alone on an otherwise empty runqueue > > >> - it seems like it never gets dequeued > > >> (unless another unrelated task joins or migration hits) > > >> - if vhost wakes up the kworker, the kworker gets selected > > >> - vhost runtime > kworker runtime > > >> --> kworker gets positive lag and gets selected immediately next > > >> time > > >> > > >> What happens if it does go wrong: > > >> From what I gather, there seem to be occasions where the vhost either > > >> executes suprisingly quick, or the kworker surprinsingly slow. If > > >> these > > >> outliers reach critical values, it can happen, that > > >>vhost runtime < kworker runtime > > >> which now causes the kworker to get the negative lag. > > >> > > >> In this case it seems like, that the vhost is very fast in waking up > > >> the kworker. And coincidentally, the kworker takes, more time than > > >> usual > > >> to finish. We speak of 4-digit to low 5-digit nanoseconds. > > >> > > >> So, for these outliers, the scheduler extrapolates that the kworker > > >> out-consumes the vhost and should be slowed down, although in the > > >> majority > > >> of other cases this does not happen. > > > > > > Thanks for providing the above details Tobias. It does seem like EEVDF > > > is strict > > > about the eligibility checks and making tasks wait when their lags are > > > negative, even > > > if just a little bit as in the case of the kworker. > > > > > > There was a patch to disable the eligibility checks > > > (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/), > > > which would make EEVDF more like EVDF, though the deadline comparison > > > would > > > probably still favor the vhost task instead of the kworker with the > > > negative lag. > > > > > > I'm not sure if you tried it, but I thought I'd mention it. > > > > Haven't seen that one yet. Unfortunately, it does not help to ignore the > > eligibility. > > > > I'm inclined to rather propose propose a documentation change, which > > describes that tasks should not rely on woken up tasks being scheduled > > immediately. > > Where do you see such an assumption ? Even before eevdf, there were > nothing that ensures such behavior. When using CFS (legacy or eevdf) > tasks, you can't know if the newly wakeup task will run 1st or not > There was no guarantee of course. place_entity was reducing the vruntime of woken up tasks though, giving it a slight boost, right?. For the scenario that I observed, that boost was enough to make sure, that the woken up tasks gets scheduled consistently. This might still not be true for all scenarios, but in general EEVDF seems to be stricter with woken up tasks. Dismissing the lag on wakeup also does obviously not guarantee getting scheduled, as other tasks might still be involved. The question would be if it should be explicitly mentioned somewhere that, at this point, woken up tasks are not getting any special treatment and noone should rely on that boost for woken up tasks. > > > > Changing things in the code to address for the specific scenario I'm > > seeing seems to mostly create unwanted side effects and/or would require > > the definition of some magic cut-off values. > > > >
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Tue, 19 Mar 2024 at 10:08, Tobias Huschle wrote: > > On 2024-03-18 15:45, Luis Machado wrote: > > On 3/14/24 13:45, Tobias Huschle wrote: > >> On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote: > >>> On 2/28/24 16:10, Tobias Huschle wrote: > > Questions: > 1. The kworker getting its negative lag occurs in the following > scenario > - kworker and a cgroup are supposed to execute on the same CPU > - one task within the cgroup is executing and wakes up the > kworker > - kworker with 0 lag, gets picked immediately and finishes its > execution within ~5000ns > - on dequeue, kworker gets assigned a negative lag > Is this expected behavior? With this short execution time, I > would > expect the kworker to be fine. > >>> > >>> That strikes me as a bit odd as well. Have you been able to determine > >>> how a negative lag > >>> is assigned to the kworker after such a short runtime? > >>> > >> > >> I did some more trace reading though and found something. > >> > >> What I observed if everything runs regularly: > >> - vhost and kworker run alternating on the same CPU > >> - if the kworker is done, it leaves the runqueue > >> - vhost wakes up the kworker if it needs it > >> --> this means: > >> - vhost starts alone on an otherwise empty runqueue > >> - it seems like it never gets dequeued > >> (unless another unrelated task joins or migration hits) > >> - if vhost wakes up the kworker, the kworker gets selected > >> - vhost runtime > kworker runtime > >> --> kworker gets positive lag and gets selected immediately next > >> time > >> > >> What happens if it does go wrong: > >> From what I gather, there seem to be occasions where the vhost either > >> executes suprisingly quick, or the kworker surprinsingly slow. If > >> these > >> outliers reach critical values, it can happen, that > >>vhost runtime < kworker runtime > >> which now causes the kworker to get the negative lag. > >> > >> In this case it seems like, that the vhost is very fast in waking up > >> the kworker. And coincidentally, the kworker takes, more time than > >> usual > >> to finish. We speak of 4-digit to low 5-digit nanoseconds. > >> > >> So, for these outliers, the scheduler extrapolates that the kworker > >> out-consumes the vhost and should be slowed down, although in the > >> majority > >> of other cases this does not happen. > > > > Thanks for providing the above details Tobias. It does seem like EEVDF > > is strict > > about the eligibility checks and making tasks wait when their lags are > > negative, even > > if just a little bit as in the case of the kworker. > > > > There was a patch to disable the eligibility checks > > (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/), > > which would make EEVDF more like EVDF, though the deadline comparison > > would > > probably still favor the vhost task instead of the kworker with the > > negative lag. > > > > I'm not sure if you tried it, but I thought I'd mention it. > > Haven't seen that one yet. Unfortunately, it does not help to ignore the > eligibility. > > I'm inclined to rather propose propose a documentation change, which > describes that tasks should not rely on woken up tasks being scheduled > immediately. Where do you see such an assumption ? Even before eevdf, there were nothing that ensures such behavior. When using CFS (legacy or eevdf) tasks, you can't know if the newly wakeup task will run 1st or not > > Changing things in the code to address for the specific scenario I'm > seeing seems to mostly create unwanted side effects and/or would require > the definition of some magic cut-off values. > >
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On 2024-03-18 15:45, Luis Machado wrote: On 3/14/24 13:45, Tobias Huschle wrote: On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote: On 2/28/24 16:10, Tobias Huschle wrote: Questions: 1. The kworker getting its negative lag occurs in the following scenario - kworker and a cgroup are supposed to execute on the same CPU - one task within the cgroup is executing and wakes up the kworker - kworker with 0 lag, gets picked immediately and finishes its execution within ~5000ns - on dequeue, kworker gets assigned a negative lag Is this expected behavior? With this short execution time, I would expect the kworker to be fine. That strikes me as a bit odd as well. Have you been able to determine how a negative lag is assigned to the kworker after such a short runtime? I did some more trace reading though and found something. What I observed if everything runs regularly: - vhost and kworker run alternating on the same CPU - if the kworker is done, it leaves the runqueue - vhost wakes up the kworker if it needs it --> this means: - vhost starts alone on an otherwise empty runqueue - it seems like it never gets dequeued (unless another unrelated task joins or migration hits) - if vhost wakes up the kworker, the kworker gets selected - vhost runtime > kworker runtime --> kworker gets positive lag and gets selected immediately next time What happens if it does go wrong: From what I gather, there seem to be occasions where the vhost either executes suprisingly quick, or the kworker surprinsingly slow. If these outliers reach critical values, it can happen, that vhost runtime < kworker runtime which now causes the kworker to get the negative lag. In this case it seems like, that the vhost is very fast in waking up the kworker. And coincidentally, the kworker takes, more time than usual to finish. We speak of 4-digit to low 5-digit nanoseconds. So, for these outliers, the scheduler extrapolates that the kworker out-consumes the vhost and should be slowed down, although in the majority of other cases this does not happen. Thanks for providing the above details Tobias. It does seem like EEVDF is strict about the eligibility checks and making tasks wait when their lags are negative, even if just a little bit as in the case of the kworker. There was a patch to disable the eligibility checks (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/), which would make EEVDF more like EVDF, though the deadline comparison would probably still favor the vhost task instead of the kworker with the negative lag. I'm not sure if you tried it, but I thought I'd mention it. Haven't seen that one yet. Unfortunately, it does not help to ignore the eligibility. I'm inclined to rather propose propose a documentation change, which describes that tasks should not rely on woken up tasks being scheduled immediately. Changing things in the code to address for the specific scenario I'm seeing seems to mostly create unwanted side effects and/or would require the definition of some magic cut-off values.
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On 3/14/24 13:45, Tobias Huschle wrote: > On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote: >> On 2/28/24 16:10, Tobias Huschle wrote: >>> >>> Questions: >>> 1. The kworker getting its negative lag occurs in the following scenario >>>- kworker and a cgroup are supposed to execute on the same CPU >>>- one task within the cgroup is executing and wakes up the kworker >>>- kworker with 0 lag, gets picked immediately and finishes its >>> execution within ~5000ns >>>- on dequeue, kworker gets assigned a negative lag >>>Is this expected behavior? With this short execution time, I would >>>expect the kworker to be fine. >> >> That strikes me as a bit odd as well. Have you been able to determine how a >> negative lag >> is assigned to the kworker after such a short runtime? >> > > I did some more trace reading though and found something. > > What I observed if everything runs regularly: > - vhost and kworker run alternating on the same CPU > - if the kworker is done, it leaves the runqueue > - vhost wakes up the kworker if it needs it > --> this means: > - vhost starts alone on an otherwise empty runqueue > - it seems like it never gets dequeued > (unless another unrelated task joins or migration hits) > - if vhost wakes up the kworker, the kworker gets selected > - vhost runtime > kworker runtime > --> kworker gets positive lag and gets selected immediately next time > > What happens if it does go wrong: > From what I gather, there seem to be occasions where the vhost either > executes suprisingly quick, or the kworker surprinsingly slow. If these > outliers reach critical values, it can happen, that >vhost runtime < kworker runtime > which now causes the kworker to get the negative lag. > > In this case it seems like, that the vhost is very fast in waking up > the kworker. And coincidentally, the kworker takes, more time than usual > to finish. We speak of 4-digit to low 5-digit nanoseconds. > > So, for these outliers, the scheduler extrapolates that the kworker > out-consumes the vhost and should be slowed down, although in the majority > of other cases this does not happen. Thanks for providing the above details Tobias. It does seem like EEVDF is strict about the eligibility checks and making tasks wait when their lags are negative, even if just a little bit as in the case of the kworker. There was a patch to disable the eligibility checks (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/), which would make EEVDF more like EVDF, though the deadline comparison would probably still favor the vhost task instead of the kworker with the negative lag. I'm not sure if you tried it, but I thought I'd mention it.
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote: > On 2/28/24 16:10, Tobias Huschle wrote: > > > > Questions: > > 1. The kworker getting its negative lag occurs in the following scenario > >- kworker and a cgroup are supposed to execute on the same CPU > >- one task within the cgroup is executing and wakes up the kworker > >- kworker with 0 lag, gets picked immediately and finishes its > > execution within ~5000ns > >- on dequeue, kworker gets assigned a negative lag > >Is this expected behavior? With this short execution time, I would > >expect the kworker to be fine. > > That strikes me as a bit odd as well. Have you been able to determine how a > negative lag > is assigned to the kworker after such a short runtime? > I did some more trace reading though and found something. What I observed if everything runs regularly: - vhost and kworker run alternating on the same CPU - if the kworker is done, it leaves the runqueue - vhost wakes up the kworker if it needs it --> this means: - vhost starts alone on an otherwise empty runqueue - it seems like it never gets dequeued (unless another unrelated task joins or migration hits) - if vhost wakes up the kworker, the kworker gets selected - vhost runtime > kworker runtime --> kworker gets positive lag and gets selected immediately next time What happens if it does go wrong: >From what I gather, there seem to be occasions where the vhost either executes suprisingly quick, or the kworker surprinsingly slow. If these outliers reach critical values, it can happen, that vhost runtime < kworker runtime which now causes the kworker to get the negative lag. In this case it seems like, that the vhost is very fast in waking up the kworker. And coincidentally, the kworker takes, more time than usual to finish. We speak of 4-digit to low 5-digit nanoseconds. So, for these outliers, the scheduler extrapolates that the kworker out-consumes the vhost and should be slowed down, although in the majority of other cases this does not happen. Therefore this particular usecase would profit from being able to ignore such outliers, or being able to ignore a certain amount of difference in the lag values, i.e. introduce some grace value around the average runtime for which lag is not accounted. But not sure if I like that idea. So the negative lag can be somewhat justified, but for this particular case it leads to a problem where one outlier can cause havoc. As mentioned in the vhost discussion, it could also be argued that the vhost should not rely on the fact that the kworker gets always scheduled on wake up, since these timing issues can always happen. Hence, the two options: - offer the alternative strategy which dismisses lag on wake up for workloads where we know that a task usually finishes faster than others but should not be punished by rare outliers (if that is predicatble, I don't know) - require vhost to adress this issue on their side (if possible without creating an armada of side effects) (plus the third one mentioned above, but that requires a magic cutoff value, meh) > I was looking at a different thread > (https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.c...@intel.com/) > that > uncovers a potential overflow in the eligibility calculation. Though I don't > think that is the case for this particular > vhost problem. Yea, the numbers I see do not look very overflowy.
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
Hi Tobias, On 2/28/24 16:10, Tobias Huschle wrote: > The previously used CFS scheduler gave tasks that were woken up an > enhanced chance to see runtime immediately by deducting a certain value > from its vruntime on runqueue placement during wakeup. > > This property was used by some, at least vhost, to ensure, that certain > kworkers are scheduled immediately after being woken up. The EEVDF > scheduler, does not support this so far. Instead, if such a woken up > entitiy carries a negative lag from its previous execution, it will have > to wait for the current time slice to finish, which affects the > performance of the process expecting the immediate execution negatively. > > To address this issue, implement EEVDF strategy #2 for rejoining > entities, which dismisses the lag from previous execution and allows > the woken up task to run immediately (if no other entities are deemed > to be preferred for scheduling by EEVDF). > > The vruntime is decremented by an additional value of 1 to make sure, > that the woken up tasks gets to actually run. This is of course not > following strategy #2 in an exact manner but guarantees the expected > behavior for the scenario described above. Without the additional > decrement, the performance goes south even more. So there are some > side effects I could not get my head around yet. > > Questions: > 1. The kworker getting its negative lag occurs in the following scenario >- kworker and a cgroup are supposed to execute on the same CPU >- one task within the cgroup is executing and wakes up the kworker >- kworker with 0 lag, gets picked immediately and finishes its > execution within ~5000ns >- on dequeue, kworker gets assigned a negative lag >Is this expected behavior? With this short execution time, I would >expect the kworker to be fine. That strikes me as a bit odd as well. Have you been able to determine how a negative lag is assigned to the kworker after such a short runtime? I was looking at a different thread (https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.c...@intel.com/) that uncovers a potential overflow in the eligibility calculation. Though I don't think that is the case for this particular vhost problem.
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
On Thu, Feb 29, 2024 at 09:06:16AM +0530, K Prateek Nayak wrote: > (+ Xuewen Yan, Ke Wang) > > Hello Tobias, > <...> > > > > Questions: > > 1. The kworker getting its negative lag occurs in the following scenario > >- kworker and a cgroup are supposed to execute on the same CPU > >- one task within the cgroup is executing and wakes up the kworker > >- kworker with 0 lag, gets picked immediately and finishes its > > execution within ~5000ns > >- on dequeue, kworker gets assigned a negative lag > >Is this expected behavior? With this short execution time, I would > >expect the kworker to be fine. > >For a more detailed discussion on this symptom, please see: > >https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/ > > Does the lag clamping path from Xuewen Yan [1] work for the vhost case > mentioned in the thread? Instead of placing the task just behind the > 0-lag point, clamping the lag seems to be more principled approach since > EEVDF already does it in update_entity_lag(). > > If the lag is still too large, maybe the above coupled with Peter's > delayed dequeue patch can help [2] (Note: tree is prone to force > updates) > > [1] https://lore.kernel.org/lkml/20240130080643.1828-1-xuewen@unisoc.com/ > [2] > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf=e62ef63a888c97188a977daddb72b61548da8417 I tried Peter's patches a while ago. Unfortunately, reducing the lag is not sufficient in this particular case. The calling entity expects the woken up kworker to run instantly. In order to have a chance that the woken up kworker is scheduled right away, the kworker must not have any negative lag. To guarantee it being scheduled it should even have a positive lag which allows it to pass all other entities on the queue. Therefore I proposed to just wipe the negative lag in these cases, which seems to map to strategy #2 of the underlying paper. The other way to think about this would be: The assumption, that woken up tasks get a high probability to run is no longer valid. In that case, the entity triggering the wake up has to explicitly give up the CPU. If there are no other tasks, apart from the 2 involved so far, has good chances of being scheduled. If the runqueue is busy, other tasks might intervene. I keep playing around with these options, but potential side effects are worrying me. > <...>
Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup
(+ Xuewen Yan, Ke Wang) Hello Tobias, On 2/28/2024 9:40 PM, Tobias Huschle wrote: > The previously used CFS scheduler gave tasks that were woken up an > enhanced chance to see runtime immediately by deducting a certain value > from its vruntime on runqueue placement during wakeup. > > This property was used by some, at least vhost, to ensure, that certain > kworkers are scheduled immediately after being woken up. The EEVDF > scheduler, does not support this so far. Instead, if such a woken up > entitiy carries a negative lag from its previous execution, it will have > to wait for the current time slice to finish, which affects the > performance of the process expecting the immediate execution negatively. > > To address this issue, implement EEVDF strategy #2 for rejoining > entities, which dismisses the lag from previous execution and allows > the woken up task to run immediately (if no other entities are deemed > to be preferred for scheduling by EEVDF). > > The vruntime is decremented by an additional value of 1 to make sure, > that the woken up tasks gets to actually run. This is of course not > following strategy #2 in an exact manner but guarantees the expected > behavior for the scenario described above. Without the additional > decrement, the performance goes south even more. So there are some > side effects I could not get my head around yet. > > Questions: > 1. The kworker getting its negative lag occurs in the following scenario >- kworker and a cgroup are supposed to execute on the same CPU >- one task within the cgroup is executing and wakes up the kworker >- kworker with 0 lag, gets picked immediately and finishes its > execution within ~5000ns >- on dequeue, kworker gets assigned a negative lag >Is this expected behavior? With this short execution time, I would >expect the kworker to be fine. >For a more detailed discussion on this symptom, please see: >https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/ Does the lag clamping path from Xuewen Yan [1] work for the vhost case mentioned in the thread? Instead of placing the task just behind the 0-lag point, clamping the lag seems to be more principled approach since EEVDF already does it in update_entity_lag(). If the lag is still too large, maybe the above coupled with Peter's delayed dequeue patch can help [2] (Note: tree is prone to force updates) [1] https://lore.kernel.org/lkml/20240130080643.1828-1-xuewen@unisoc.com/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf=e62ef63a888c97188a977daddb72b61548da8417 > 2. The proposed code change of course only addresses the symptom. Am I >assuming correctly that this is in general the exepected behavior and >that the task waking up the kworker should rather do an explicit >reschedule of itself to grant the kworker time to execute? >In the vhost case, this is currently attempted through a cond_resched >which is not doing anything because the need_resched flag is not set. > > Feedback and opinions would be highly appreciated. > > Signed-off-by: Tobias Huschle > --- > kernel/sched/fair.c | 5 + > kernel/sched/features.h | 1 + > 2 files changed, 6 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 533547e3c90a..c20ae6d62961 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5239,6 +5239,11 @@ place_entity(struct cfs_rq *cfs_rq, struct > sched_entity *se, int flags) > lag = div_s64(lag, load); > } > > + if (sched_feat(NOLAG_WAKEUP) && (flags & ENQUEUE_WAKEUP)) { > + se->vlag = 0; > + lag = 1; > + } > + > se->vruntime = vruntime - lag; > > /* > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index 143f55df890b..d3118e7568b4 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -7,6 +7,7 @@ > SCHED_FEAT(PLACE_LAG, true) > SCHED_FEAT(PLACE_DEADLINE_INITIAL, true) > SCHED_FEAT(RUN_TO_PARITY, true) > +SCHED_FEAT(NOLAG_WAKEUP, true) > > /* > * Prefer to schedule the task we woke last (assuming it failed -- Thanks and Regards, Prateek
[RFC] sched/eevdf: sched feature to dismiss lag on wakeup
The previously used CFS scheduler gave tasks that were woken up an enhanced chance to see runtime immediately by deducting a certain value from its vruntime on runqueue placement during wakeup. This property was used by some, at least vhost, to ensure, that certain kworkers are scheduled immediately after being woken up. The EEVDF scheduler, does not support this so far. Instead, if such a woken up entitiy carries a negative lag from its previous execution, it will have to wait for the current time slice to finish, which affects the performance of the process expecting the immediate execution negatively. To address this issue, implement EEVDF strategy #2 for rejoining entities, which dismisses the lag from previous execution and allows the woken up task to run immediately (if no other entities are deemed to be preferred for scheduling by EEVDF). The vruntime is decremented by an additional value of 1 to make sure, that the woken up tasks gets to actually run. This is of course not following strategy #2 in an exact manner but guarantees the expected behavior for the scenario described above. Without the additional decrement, the performance goes south even more. So there are some side effects I could not get my head around yet. Questions: 1. The kworker getting its negative lag occurs in the following scenario - kworker and a cgroup are supposed to execute on the same CPU - one task within the cgroup is executing and wakes up the kworker - kworker with 0 lag, gets picked immediately and finishes its execution within ~5000ns - on dequeue, kworker gets assigned a negative lag Is this expected behavior? With this short execution time, I would expect the kworker to be fine. For a more detailed discussion on this symptom, please see: https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/ 2. The proposed code change of course only addresses the symptom. Am I assuming correctly that this is in general the exepected behavior and that the task waking up the kworker should rather do an explicit reschedule of itself to grant the kworker time to execute? In the vhost case, this is currently attempted through a cond_resched which is not doing anything because the need_resched flag is not set. Feedback and opinions would be highly appreciated. Signed-off-by: Tobias Huschle --- kernel/sched/fair.c | 5 + kernel/sched/features.h | 1 + 2 files changed, 6 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 533547e3c90a..c20ae6d62961 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5239,6 +5239,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) lag = div_s64(lag, load); } + if (sched_feat(NOLAG_WAKEUP) && (flags & ENQUEUE_WAKEUP)) { + se->vlag = 0; + lag = 1; + } + se->vruntime = vruntime - lag; /* diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 143f55df890b..d3118e7568b4 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -7,6 +7,7 @@ SCHED_FEAT(PLACE_LAG, true) SCHED_FEAT(PLACE_DEADLINE_INITIAL, true) SCHED_FEAT(RUN_TO_PARITY, true) +SCHED_FEAT(NOLAG_WAKEUP, true) /* * Prefer to schedule the task we woke last (assuming it failed -- 2.34.1