Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-11 Thread Michael S. Tsirkin
On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > >  Summary 
> > > 
> > > In my (non-vhost experience) opinion the way to go would be either
> > > replacing the cond_resched with a hard schedule or setting the
> > > need_resched flag within vhost if the a data transfer was successfully
> > > initiated. It will be necessary to check if this causes problems with
> > > other workloads/benchmarks.
> > 
> > Yes but conceptually I am still in the dark on whether the fact that
> > periodically invoking cond_resched is no longer sufficient to be nice to
> > others is a bug, or intentional.  So you feel it is intentional?
> 
> I would assume that cond_resched is still a valid concept.
> But, in this particular scenario we have the following problem:
> 
> So far (with CFS) we had:
> 1. vhost initiates data transfer
> 2. kworker is woken up
> 3. CFS gives priority to woken up task and schedules it
> 4. kworker runs
> 
> Now (with EEVDF) we have:
> 0. In some cases, kworker has accumulated negative lag 
> 1. vhost initiates data transfer
> 2. kworker is woken up
> -3a. EEVDF does not schedule kworker if it has negative lag
> -4a. vhost continues running, kworker on same CPU starves
> --
> -3b. EEVDF schedules kworker if it has positive or no lag
> -4b. kworker runs
> 
> In the 3a/4a case, the kworker is given no chance to set the
> necessary flag. The flag can only be set by another CPU now.
> The schedule of the kworker was not caused by cond_resched, but
> rather by the wakeup path of the scheduler.
> 
> cond_resched works successfully once the load balancer (I suppose) 
> decides to migrate the vhost off to another CPU. In that case, the
> load balancer on another CPU sets that flag and we are good.
> That then eventually allows the scheduler to pick kworker, but very
> late.

Are we going anywhere with this btw?


> > I propose a two patch series then:
> > 
> > patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> > 
> > If you're doing longer computations: first think userspace. If you
> > **really** want to do it in kernel you should regularly check if you need
> > to give up the CPU (remember there is cooperative multitasking per CPU).
> > Idiom::
> > 
> > cond_resched(); /* Will sleep */
> > 
> > 
> > replace cond_resched -> schedule
> > 
> > 
> > Since apparently cond_resched is no longer sufficient to
> > make the scheduler check whether you need to give up the CPU.
> > 
> > patch 2: make this change for vhost.
> > 
> > WDYT?
> 
> For patch 1, I would like to see some feedback from Peter (or someone else
> from the scheduler maintainers).
> For patch 2, I would prefer to do some more testing first if this might have
> an negative effect on other benchmarks.
> 
> I also stumbled upon something in the scheduler code that I want to verify.
> Maybe a cgroup thing, will check that out again.
> 
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.
> 
> > 
> > -- 
> > MST
> > 
> > 




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-22 Thread Michael S. Tsirkin
On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.

As I predicted, if you want attention from sched guys you need to
send a patch in their area.

-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-01 Thread Michael S. Tsirkin
On Thu, Feb 01, 2024 at 12:47:39PM +0100, Tobias Huschle wrote:
> On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > >  Summary 
> > > 
> > > In my (non-vhost experience) opinion the way to go would be either
> > > replacing the cond_resched with a hard schedule or setting the
> > > need_resched flag within vhost if the a data transfer was successfully
> > > initiated. It will be necessary to check if this causes problems with
> > > other workloads/benchmarks.
> > 
> > Yes but conceptually I am still in the dark on whether the fact that
> > periodically invoking cond_resched is no longer sufficient to be nice to
> > others is a bug, or intentional.  So you feel it is intentional?
> 
> I would assume that cond_resched is still a valid concept.
> But, in this particular scenario we have the following problem:
> 
> So far (with CFS) we had:
> 1. vhost initiates data transfer
> 2. kworker is woken up
> 3. CFS gives priority to woken up task and schedules it
> 4. kworker runs
> 
> Now (with EEVDF) we have:
> 0. In some cases, kworker has accumulated negative lag 
> 1. vhost initiates data transfer
> 2. kworker is woken up
> -3a. EEVDF does not schedule kworker if it has negative lag
> -4a. vhost continues running, kworker on same CPU starves
> --
> -3b. EEVDF schedules kworker if it has positive or no lag
> -4b. kworker runs
> 
> In the 3a/4a case, the kworker is given no chance to set the
> necessary flag. The flag can only be set by another CPU now.
> The schedule of the kworker was not caused by cond_resched, but
> rather by the wakeup path of the scheduler.
> 
> cond_resched works successfully once the load balancer (I suppose) 
> decides to migrate the vhost off to another CPU. In that case, the
> load balancer on another CPU sets that flag and we are good.
> That then eventually allows the scheduler to pick kworker, but very
> late.

I don't really understand what is special about vhost though.
Wouldn't it apply to any kernel code?

> > I propose a two patch series then:
> > 
> > patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> > 
> > If you're doing longer computations: first think userspace. If you
> > **really** want to do it in kernel you should regularly check if you need
> > to give up the CPU (remember there is cooperative multitasking per CPU).
> > Idiom::
> > 
> > cond_resched(); /* Will sleep */
> > 
> > 
> > replace cond_resched -> schedule
> > 
> > 
> > Since apparently cond_resched is no longer sufficient to
> > make the scheduler check whether you need to give up the CPU.
> > 
> > patch 2: make this change for vhost.
> > 
> > WDYT?
> 
> For patch 1, I would like to see some feedback from Peter (or someone else
> from the scheduler maintainers).

I am guessing once you post it you will see feedback.

> For patch 2, I would prefer to do some more testing first if this might have
> an negative effect on other benchmarks.
> 
> I also stumbled upon something in the scheduler code that I want to verify.
> Maybe a cgroup thing, will check that out again.
> 
> I'll do some more testing with the cond_resched->schedule fix, check the
> cgroup thing and wait for Peter then.
> Will get back if any of the above yields some results.
> 
> > 
> > -- 
> > MST
> > 
> > 




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-01 Thread Tobias Huschle
On Thu, Feb 01, 2024 at 03:08:07AM -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> > On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > 
> >  Summary 
> > 
> > In my (non-vhost experience) opinion the way to go would be either
> > replacing the cond_resched with a hard schedule or setting the
> > need_resched flag within vhost if the a data transfer was successfully
> > initiated. It will be necessary to check if this causes problems with
> > other workloads/benchmarks.
> 
> Yes but conceptually I am still in the dark on whether the fact that
> periodically invoking cond_resched is no longer sufficient to be nice to
> others is a bug, or intentional.  So you feel it is intentional?

I would assume that cond_resched is still a valid concept.
But, in this particular scenario we have the following problem:

So far (with CFS) we had:
1. vhost initiates data transfer
2. kworker is woken up
3. CFS gives priority to woken up task and schedules it
4. kworker runs

Now (with EEVDF) we have:
0. In some cases, kworker has accumulated negative lag 
1. vhost initiates data transfer
2. kworker is woken up
-3a. EEVDF does not schedule kworker if it has negative lag
-4a. vhost continues running, kworker on same CPU starves
--
-3b. EEVDF schedules kworker if it has positive or no lag
-4b. kworker runs

In the 3a/4a case, the kworker is given no chance to set the
necessary flag. The flag can only be set by another CPU now.
The schedule of the kworker was not caused by cond_resched, but
rather by the wakeup path of the scheduler.

cond_resched works successfully once the load balancer (I suppose) 
decides to migrate the vhost off to another CPU. In that case, the
load balancer on another CPU sets that flag and we are good.
That then eventually allows the scheduler to pick kworker, but very
late.

> I propose a two patch series then:
> 
> patch 1: in this text in Documentation/kernel-hacking/hacking.rst
> 
> If you're doing longer computations: first think userspace. If you
> **really** want to do it in kernel you should regularly check if you need
> to give up the CPU (remember there is cooperative multitasking per CPU).
> Idiom::
> 
> cond_resched(); /* Will sleep */
> 
> 
> replace cond_resched -> schedule
> 
> 
> Since apparently cond_resched is no longer sufficient to
> make the scheduler check whether you need to give up the CPU.
> 
> patch 2: make this change for vhost.
> 
> WDYT?

For patch 1, I would like to see some feedback from Peter (or someone else
from the scheduler maintainers).
For patch 2, I would prefer to do some more testing first if this might have
an negative effect on other benchmarks.

I also stumbled upon something in the scheduler code that I want to verify.
Maybe a cgroup thing, will check that out again.

I'll do some more testing with the cond_resched->schedule fix, check the
cgroup thing and wait for Peter then.
Will get back if any of the above yields some results.

> 
> -- 
> MST
> 
> 



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-02-01 Thread Michael S. Tsirkin
On Thu, Feb 01, 2024 at 08:38:43AM +0100, Tobias Huschle wrote:
> On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> > On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > - Along with the wakeup of the kworker, need_resched needs to
> > >   be set, such that cond_resched() triggers a reschedule.
> > 
> > Let's try this? Does not look like discussing vhost itself will
> > draw attention from scheduler guys but posting a scheduling
> > patch probably will? Can you post a patch?
> 
> As a baseline, I verified that the following two options fix
> the regression:
> 
> - replacing the cond_resched in the vhost_worker function with a hard
>   schedule 
> - setting the need_resched flag using set_tsk_need_resched(current)
>   right before calling cond_resched
> 
> I then tried to find a better spot to put the set_tsk_need_resched
> call. 
> 
> One approach I found to be working is setting the need_resched flag 
> at the end of handle_tx and hande_rx.
> This would be after data has been actually passed to the socket, so 
> the originally blocked kworker has something to do and will profit
> from the reschedule. 
> It might be possible to go deeper and place the set_tsk_need_resched
> call to the location right after actually passing the data, but this
> might leave us with sprinkling that call in multiple places and
> might be too intrusive.
> Furthermore, it might be possible to check if an error occured when
> preparing the transmission and then skip the setting of the flag.
> 
> This would require a conceptual decision on the vhost side.
> This solution would not touch the scheduler, only incentivise it to
> do the right thing for this particular regression.
> 
> Another idea could be to find the counterpart that initiates the
> actual data transfer, which I assume wakes up the kworker. From
> what I gather it seems to be an eventfd notification that ends up
> somewhere in the qemu code. Not sure if that context would allow
> to set the need_resched flag, nor whether this would be a good idea.
> 
> > 
> > > - On cond_resched(), verify if the consumed runtime of the caller
> > >   is outweighing the negative lag of another process (e.g. the 
> > >   kworker) and schedule the other process. Introduces overhead
> > >   to cond_resched.
> > 
> > Or this last one.
> 
> On cond_resched itself, this will probably only be possible in a very 
> very hacky way. That is because currently, there is no immidiate access
> to the necessary data available, which would make it necessary to 
> bloat up the cond_resched function quite a bit, with a probably 
> non-negligible amount of overhead.
> 
> Changing other aspects in the scheduler might get us in trouble as
> they all would probably resolve back to the question "What is the magic
> value that determines whether a small task not being scheduled justifies
> setting the need_resched flag for a currently running task or adjusting 
> its lag?". As this would then also have to work for all non-vhost related
> cases, this looks like a dangerous path to me on second thought.
> 
> 
>  Summary 
> 
> In my (non-vhost experience) opinion the way to go would be either
> replacing the cond_resched with a hard schedule or setting the
> need_resched flag within vhost if the a data transfer was successfully
> initiated. It will be necessary to check if this causes problems with
> other workloads/benchmarks.

Yes but conceptually I am still in the dark on whether the fact that
periodically invoking cond_resched is no longer sufficient to be nice to
others is a bug, or intentional.  So you feel it is intentional?
I propose a two patch series then:

patch 1: in this text in Documentation/kernel-hacking/hacking.rst

If you're doing longer computations: first think userspace. If you
**really** want to do it in kernel you should regularly check if you need
to give up the CPU (remember there is cooperative multitasking per CPU).
Idiom::

cond_resched(); /* Will sleep */


replace cond_resched -> schedule


Since apparently cond_resched is no longer sufficient to
make the scheduler check whether you need to give up the CPU.

patch 2: make this change for vhost.

WDYT?

-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-01-31 Thread Tobias Huschle
On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > - Along with the wakeup of the kworker, need_resched needs to
> >   be set, such that cond_resched() triggers a reschedule.
> 
> Let's try this? Does not look like discussing vhost itself will
> draw attention from scheduler guys but posting a scheduling
> patch probably will? Can you post a patch?

As a baseline, I verified that the following two options fix
the regression:

- replacing the cond_resched in the vhost_worker function with a hard
  schedule 
- setting the need_resched flag using set_tsk_need_resched(current)
  right before calling cond_resched

I then tried to find a better spot to put the set_tsk_need_resched
call. 

One approach I found to be working is setting the need_resched flag 
at the end of handle_tx and hande_rx.
This would be after data has been actually passed to the socket, so 
the originally blocked kworker has something to do and will profit
from the reschedule. 
It might be possible to go deeper and place the set_tsk_need_resched
call to the location right after actually passing the data, but this
might leave us with sprinkling that call in multiple places and
might be too intrusive.
Furthermore, it might be possible to check if an error occured when
preparing the transmission and then skip the setting of the flag.

This would require a conceptual decision on the vhost side.
This solution would not touch the scheduler, only incentivise it to
do the right thing for this particular regression.

Another idea could be to find the counterpart that initiates the
actual data transfer, which I assume wakes up the kworker. From
what I gather it seems to be an eventfd notification that ends up
somewhere in the qemu code. Not sure if that context would allow
to set the need_resched flag, nor whether this would be a good idea.

> 
> > - On cond_resched(), verify if the consumed runtime of the caller
> >   is outweighing the negative lag of another process (e.g. the 
> >   kworker) and schedule the other process. Introduces overhead
> >   to cond_resched.
> 
> Or this last one.

On cond_resched itself, this will probably only be possible in a very 
very hacky way. That is because currently, there is no immidiate access
to the necessary data available, which would make it necessary to 
bloat up the cond_resched function quite a bit, with a probably 
non-negligible amount of overhead.

Changing other aspects in the scheduler might get us in trouble as
they all would probably resolve back to the question "What is the magic
value that determines whether a small task not being scheduled justifies
setting the need_resched flag for a currently running task or adjusting 
its lag?". As this would then also have to work for all non-vhost related
cases, this looks like a dangerous path to me on second thought.


 Summary 

In my (non-vhost experience) opinion the way to go would be either
replacing the cond_resched with a hard schedule or setting the
need_resched flag within vhost if the a data transfer was successfully
initiated. It will be necessary to check if this causes problems with
other workloads/benchmarks.



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-01-22 Thread Tobias Huschle
On Sun, Jan 21, 2024 at 01:44:32PM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> > On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > > 
> > > Peter, would appreciate feedback on this. When is cond_resched()
> > > insufficient to give up the CPU? Should 
> > > Documentation/kernel-hacking/hacking.rst
> > > be updated to require schedule() instead?
> > > 
> > 
> > Happy new year everybody!
> > 
> > I'd like to bring this thread back to life. To reiterate:
> > 
> > - The introduction of the EEVDF scheduler revealed a performance
> >   regression in a uperf testcase of ~50%.
> > - Tracing the scheduler showed that it takes decisions which are
> >   in line with its design.
> > - The traces showed as well, that a vhost instance might run
> >   excessively long on its CPU in some circumstance. Those cause
> >   the performance regression as they cause delay times of 100+ms
> >   for a kworker which drives the actual network processing.
> > - Before EEVDF, the vhost would always be scheduled off its CPU
> >   in favor of the kworker, as the kworker was being woken up and
> >   the former scheduler was giving more priority to the woken up
> >   task. With EEVDF, the kworker, as a long running process, is
> >   able to accumulate negative lag, which causes EEVDF to not
> >   prefer it on its wake up, leaving the vhost running.
> > - If the kworker is not scheduled when being woken up, the vhost
> >   continues looping until it is migrated off the CPU.
> > - The vhost offers to be scheduled off the CPU by calling 
> >   cond_resched(), but, the the need_resched flag is not set,
> >   therefore cond_resched() does nothing.
> > 
> > To solve this, I see the following options 
> >   (might not be a complete nor a correct list)
> > - Along with the wakeup of the kworker, need_resched needs to
> >   be set, such that cond_resched() triggers a reschedule.
> 
> Let's try this? Does not look like discussing vhost itself will
> draw attention from scheduler guys but posting a scheduling
> patch probably will? Can you post a patch?
> 

I'll give it a go.

> > - The vhost calls schedule() instead of cond_resched() to give up
> >   the CPU. This would of course be a significantly stricter
> >   approach and might limit the performance of vhost in other cases.
> > - Preventing the kworker from accumulating negative lag as it is
> >   mostly not runnable and if it runs, it only runs for a very short
> >   time frame. This might clash with the overall concept of EEVDF.
> > - On cond_resched(), verify if the consumed runtime of the caller
> >   is outweighing the negative lag of another process (e.g. the 
> >   kworker) and schedule the other process. Introduces overhead
> >   to cond_resched.
> 
> Or this last one.
> 

This one will probably be more complicated as the necessary information
is not really available at the places where I'd like to see it.
Will have to ponder on that a bit to figure out if there might be an
elegant way to approach this.

> 
> > 
> > I would be curious on feedback on those ideas and interested in
> > alternative approaches.
> 
> 



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-01-21 Thread Michael S. Tsirkin
On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > 
> > Peter, would appreciate feedback on this. When is cond_resched()
> > insufficient to give up the CPU? Should 
> > Documentation/kernel-hacking/hacking.rst
> > be updated to require schedule() instead?
> > 
> 
> Happy new year everybody!
> 
> I'd like to bring this thread back to life. To reiterate:
> 
> - The introduction of the EEVDF scheduler revealed a performance
>   regression in a uperf testcase of ~50%.
> - Tracing the scheduler showed that it takes decisions which are
>   in line with its design.
> - The traces showed as well, that a vhost instance might run
>   excessively long on its CPU in some circumstance. Those cause
>   the performance regression as they cause delay times of 100+ms
>   for a kworker which drives the actual network processing.
> - Before EEVDF, the vhost would always be scheduled off its CPU
>   in favor of the kworker, as the kworker was being woken up and
>   the former scheduler was giving more priority to the woken up
>   task. With EEVDF, the kworker, as a long running process, is
>   able to accumulate negative lag, which causes EEVDF to not
>   prefer it on its wake up, leaving the vhost running.
> - If the kworker is not scheduled when being woken up, the vhost
>   continues looping until it is migrated off the CPU.
> - The vhost offers to be scheduled off the CPU by calling 
>   cond_resched(), but, the the need_resched flag is not set,
>   therefore cond_resched() does nothing.
> 
> To solve this, I see the following options 
>   (might not be a complete nor a correct list)
> - Along with the wakeup of the kworker, need_resched needs to
>   be set, such that cond_resched() triggers a reschedule.

Let's try this? Does not look like discussing vhost itself will
draw attention from scheduler guys but posting a scheduling
patch probably will? Can you post a patch?

> - The vhost calls schedule() instead of cond_resched() to give up
>   the CPU. This would of course be a significantly stricter
>   approach and might limit the performance of vhost in other cases.
> - Preventing the kworker from accumulating negative lag as it is
>   mostly not runnable and if it runs, it only runs for a very short
>   time frame. This might clash with the overall concept of EEVDF.
> - On cond_resched(), verify if the consumed runtime of the caller
>   is outweighing the negative lag of another process (e.g. the 
>   kworker) and schedule the other process. Introduces overhead
>   to cond_resched.

Or this last one.


> 
> I would be curious on feedback on those ideas and interested in
> alternative approaches.




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-01-09 Thread Michael S. Tsirkin
On Mon, Jan 08, 2024 at 02:13:25PM +0100, Tobias Huschle wrote:
> On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> > 
> > Peter, would appreciate feedback on this. When is cond_resched()
> > insufficient to give up the CPU? Should 
> > Documentation/kernel-hacking/hacking.rst
> > be updated to require schedule() instead?
> > 
> 
> Happy new year everybody!
> 
> I'd like to bring this thread back to life. To reiterate:
> 
> - The introduction of the EEVDF scheduler revealed a performance
>   regression in a uperf testcase of ~50%.
> - Tracing the scheduler showed that it takes decisions which are
>   in line with its design.
> - The traces showed as well, that a vhost instance might run
>   excessively long on its CPU in some circumstance. Those cause
>   the performance regression as they cause delay times of 100+ms
>   for a kworker which drives the actual network processing.
> - Before EEVDF, the vhost would always be scheduled off its CPU
>   in favor of the kworker, as the kworker was being woken up and
>   the former scheduler was giving more priority to the woken up
>   task. With EEVDF, the kworker, as a long running process, is
>   able to accumulate negative lag, which causes EEVDF to not
>   prefer it on its wake up, leaving the vhost running.
> - If the kworker is not scheduled when being woken up, the vhost
>   continues looping until it is migrated off the CPU.
> - The vhost offers to be scheduled off the CPU by calling 
>   cond_resched(), but, the the need_resched flag is not set,
>   therefore cond_resched() does nothing.
> 
> To solve this, I see the following options 
>   (might not be a complete nor a correct list)
> - Along with the wakeup of the kworker, need_resched needs to
>   be set, such that cond_resched() triggers a reschedule.
> - The vhost calls schedule() instead of cond_resched() to give up
>   the CPU. This would of course be a significantly stricter
>   approach and might limit the performance of vhost in other cases.

And on these two, I asked:
Would appreciate feedback on this. When is cond_resched()
insufficient to give up the CPU? Should 
Documentation/kernel-hacking/hacking.rst
be updated to require schedule() instead?


> - Preventing the kworker from accumulating negative lag as it is
>   mostly not runnable and if it runs, it only runs for a very short
>   time frame. This might clash with the overall concept of EEVDF.
> - On cond_resched(), verify if the consumed runtime of the caller
>   is outweighing the negative lag of another process (e.g. the 
>   kworker) and schedule the other process. Introduces overhead
>   to cond_resched.
> 
> I would be curious on feedback on those ideas and interested in
> alternative approaches.





Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-01-08 Thread Tobias Huschle
On Thu, Dec 14, 2023 at 02:14:59AM -0500, Michael S. Tsirkin wrote:
> 
> Peter, would appreciate feedback on this. When is cond_resched()
> insufficient to give up the CPU? Should 
> Documentation/kernel-hacking/hacking.rst
> be updated to require schedule() instead?
> 

Happy new year everybody!

I'd like to bring this thread back to life. To reiterate:

- The introduction of the EEVDF scheduler revealed a performance
  regression in a uperf testcase of ~50%.
- Tracing the scheduler showed that it takes decisions which are
  in line with its design.
- The traces showed as well, that a vhost instance might run
  excessively long on its CPU in some circumstance. Those cause
  the performance regression as they cause delay times of 100+ms
  for a kworker which drives the actual network processing.
- Before EEVDF, the vhost would always be scheduled off its CPU
  in favor of the kworker, as the kworker was being woken up and
  the former scheduler was giving more priority to the woken up
  task. With EEVDF, the kworker, as a long running process, is
  able to accumulate negative lag, which causes EEVDF to not
  prefer it on its wake up, leaving the vhost running.
- If the kworker is not scheduled when being woken up, the vhost
  continues looping until it is migrated off the CPU.
- The vhost offers to be scheduled off the CPU by calling 
  cond_resched(), but, the the need_resched flag is not set,
  therefore cond_resched() does nothing.

To solve this, I see the following options 
  (might not be a complete nor a correct list)
- Along with the wakeup of the kworker, need_resched needs to
  be set, such that cond_resched() triggers a reschedule.
- The vhost calls schedule() instead of cond_resched() to give up
  the CPU. This would of course be a significantly stricter
  approach and might limit the performance of vhost in other cases.
- Preventing the kworker from accumulating negative lag as it is
  mostly not runnable and if it runs, it only runs for a very short
  time frame. This might clash with the overall concept of EEVDF.
- On cond_resched(), verify if the consumed runtime of the caller
  is outweighing the negative lag of another process (e.g. the 
  kworker) and schedule the other process. Introduces overhead
  to cond_resched.

I would be curious on feedback on those ideas and interested in
alternative approaches.



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 09:55:23AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 13, 2023 at 01:45:35PM +0100, Tobias Huschle wrote:
> > On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> > > On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > > > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin 
> > > > > >  wrote:
> > 
> > [...]
> > > 
> > > Apparently schedule is already called?
> > > 
> > 
> > What about this: 
> > 
> > static int vhost_task_fn(void *data)
> > {
> > <...>
> > did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm 
> > not mistaken
> > if (!did_work)
> > schedule();
> > <...>
> > }
> > 
> > static bool vhost_worker(void *data)
> > {
> > struct vhost_worker *worker = data;
> > struct vhost_work *work, *work_next;
> > struct llist_node *node;
> > 
> > node = llist_del_all(>work_list);
> > if (node) {
> > <...>
> > llist_for_each_entry_safe(work, work_next, node, node) {
> > <...>
> > }
> > }
> > 
> > return !!node;
> > }
> > 
> > The llist_for_each_entry_safe does not actually change the node value, 
> > doesn't it?
> > 
> > If it does not change it, !!node would return 1.
> > Thereby skipping the schedule.
> > 
> > This was changed recently with:
> > f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
> > 
> > It returned a hardcoded 0 before. The commit message explicitly mentions 
> > this
> > change to make vhost_worker return 1 if it did something.
> > 
> > Seems indeed like a nasty little side effect caused by EEVDF not scheduling
> > the woken up kworker right away.
> 
> 
> So we are actually making an effort to be nice.
> Documentation/kernel-hacking/hacking.rst says:
> 
> If you're doing longer computations: first think userspace. If you
> **really** want to do it in kernel you should regularly check if you need
> to give up the CPU (remember there is cooperative multitasking per CPU).
> Idiom::
> 
> cond_resched(); /* Will sleep */
> 
> 
> and this is what vhost.c does.
> 
> At this point I'm not sure why it's appropriate to call schedule() as opposed 
> to
> cond_resched(). Ideas?
> 

Peter, would appreciate feedback on this. When is cond_resched()
insufficient to give up the CPU? Should Documentation/kernel-hacking/hacking.rst
be updated to require schedule() instead?


> -- 
> MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 01:45:35PM +0100, Tobias Huschle wrote:
> On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  
> > > > > wrote:
> 
> [...]
> > 
> > Apparently schedule is already called?
> > 
> 
> What about this: 
> 
> static int vhost_task_fn(void *data)
> {
>   <...>
>   did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm 
> not mistaken
>   if (!did_work)
>   schedule();
>   <...>
> }
> 
> static bool vhost_worker(void *data)
> {
>   struct vhost_worker *worker = data;
>   struct vhost_work *work, *work_next;
>   struct llist_node *node;
> 
>   node = llist_del_all(>work_list);
>   if (node) {
>   <...>
>   llist_for_each_entry_safe(work, work_next, node, node) {
>   <...>
>   }
>   }
> 
>   return !!node;
> }
> 
> The llist_for_each_entry_safe does not actually change the node value, 
> doesn't it?
> 
> If it does not change it, !!node would return 1.
> Thereby skipping the schedule.
> 
> This was changed recently with:
> f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
> 
> It returned a hardcoded 0 before. The commit message explicitly mentions this
> change to make vhost_worker return 1 if it did something.
> 
> Seems indeed like a nasty little side effect caused by EEVDF not scheduling
> the woken up kworker right away.


So we are actually making an effort to be nice.
Documentation/kernel-hacking/hacking.rst says:

If you're doing longer computations: first think userspace. If you
**really** want to do it in kernel you should regularly check if you need
to give up the CPU (remember there is cooperative multitasking per CPU).
Idiom::

cond_resched(); /* Will sleep */


and this is what vhost.c does.

At this point I'm not sure why it's appropriate to call schedule() as opposed to
cond_resched(). Ideas?


-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 01:45:35PM +0100, Tobias Huschle wrote:
> On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  
> > > > > wrote:
> 
> [...]
> > 
> > Apparently schedule is already called?
> > 
> 
> What about this: 
> 
> static int vhost_task_fn(void *data)
> {
>   <...>
>   did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm 
> not mistaken
>   if (!did_work)
>   schedule();
>   <...>
> }
> 
> static bool vhost_worker(void *data)
> {
>   struct vhost_worker *worker = data;
>   struct vhost_work *work, *work_next;
>   struct llist_node *node;
> 
>   node = llist_del_all(>work_list);
>   if (node) {
>   <...>
>   llist_for_each_entry_safe(work, work_next, node, node) {
>   <...>
>   }
>   }
> 
>   return !!node;
> }
> 
> The llist_for_each_entry_safe does not actually change the node value, 
> doesn't it?
> 
> If it does not change it, !!node would return 1.
> Thereby skipping the schedule.
> 
> This was changed recently with:
> f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
> 
> It returned a hardcoded 0 before. The commit message explicitly mentions this
> change to make vhost_worker return 1 if it did something.
> 
> Seems indeed like a nasty little side effect caused by EEVDF not scheduling
> the woken up kworker right away.

Indeed, but previously vhost_worker was looping itself.
And it did:
-   node = llist_del_all(>work_list);
-   if (!node)
-   schedule();

so I don't think this was changed at all.






-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Tobias Huschle
On Wed, Dec 13, 2023 at 07:00:53AM -0500, Michael S. Tsirkin wrote:
> On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> > On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  
> > > > wrote:

[...]
> 
> Apparently schedule is already called?
> 

What about this: 

static int vhost_task_fn(void *data)
{
<...>
did_work = vtsk->fn(vtsk->data);  --> this calls vhost_worker if I'm 
not mistaken
if (!did_work)
schedule();
<...>
}

static bool vhost_worker(void *data)
{
struct vhost_worker *worker = data;
struct vhost_work *work, *work_next;
struct llist_node *node;

node = llist_del_all(>work_list);
if (node) {
<...>
llist_for_each_entry_safe(work, work_next, node, node) {
<...>
}
}

return !!node;
}

The llist_for_each_entry_safe does not actually change the node value, doesn't 
it?

If it does not change it, !!node would return 1.
Thereby skipping the schedule.

This was changed recently with:
f9010dbdce91 fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

It returned a hardcoded 0 before. The commit message explicitly mentions this
change to make vhost_worker return 1 if it did something.

Seems indeed like a nasty little side effect caused by EEVDF not scheduling
the woken up kworker right away.



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Michael S. Tsirkin
On Wed, Dec 13, 2023 at 11:37:23AM +0100, Tobias Huschle wrote:
> On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> > On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  
> > > wrote:
> 
> We played around with the suggestions and some other ideas.
> I would like to share some initial results.
> 
> We tried the following:
> 
> 1. Call uncondtional schedule in the vhost_worker function
> 2. Change the HZ value from 100 to 1000
> 3. Reverting 05bfb338fa8d vhost: Fix livepatch timeouts in vhost_worker()
> 4. Adding a cond_resched to translate_desc
> 5. Reducing VHOST_NET_WEIGHT to 25% of its original value
> 
> Please find the diffs below.
> 
> Summary:
> 
> Option 1 is very very hacky but resolved the regression.
> Option 2 reduces the regression by ~20%.
> Options 3-5 do not help unfortunately.
> 
> Potential explanation:
> 
> While the vhost is executing, the need_resched flag is not set (observable
> in the traces). Therefore cond_resched and alike will do nothing. vhost
> will continue executing until the need_resched flag is set by an external
> party, e.g. by a request to migrate the vhost.
> 
> Calling schedule unconditionally forces the scheduler to re-evaluate all 
> tasks and their vruntime/deadline/vlag values. The scheduler comes to the
> correct conclusion, that the kworker should be executed and from there it
> is smooth sailing. I will have to verify that sequence by collecting more
> traces, but this seems rather plausible.
> This hack might of course introduce all kinds of side effects but might
> provide an indicator that this is the actual problem.
> The big question would be how to solve this conceptually, and, first
> things first, whether you think this is a viable hypothesis.
> 
> Increasing the HZ value helps most likely because the other CPUs take 
> scheduling/load balancing decisions more often as well and therefore
> trigger the migration faster.
> 
> Bringing down VHOST_NET_WEIGHT even more might also help to shorten the
> vhost loop. But I have no intuition how low we can/should go here.
> 
> 
> We also changed vq_err to print error messages, but did not encounter any.
> 
> Diffs:
> --
> 
> 1. Call uncondtional schedule in the vhost_worker function
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index e0c181ad17e3..16d73fd28831 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -414,6 +414,7 @@ static bool vhost_worker(void *data)
> }
> }
>  
> +   schedule();
> return !!node;
>  }


So, this helps.
But this is very surprising!


static int vhost_task_fn(void *data)
{
struct vhost_task *vtsk = data;
bool dead = false;

for (;;) {
bool did_work;

if (!dead && signal_pending(current)) {
struct ksignal ksig;
/*
 * Calling get_signal will block in SIGSTOP,
 * or clear fatal_signal_pending, but remember
 * what was set.
 *
 * This thread won't actually exit until all
 * of the file descriptors are closed, and
 * the release function is called.
 */
dead = get_signal();
if (dead)
clear_thread_flag(TIF_SIGPENDING);
}

/* mb paired w/ vhost_task_stop */
set_current_state(TASK_INTERRUPTIBLE);

if (test_bit(VHOST_TASK_FLAGS_STOP, >flags)) {
__set_current_state(TASK_RUNNING);
break;
}

did_work = vtsk->fn(vtsk->data);
if (!did_work)
schedule();
}

complete(>exited);
do_exit(0);

}

Apparently schedule is already called?


-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-13 Thread Tobias Huschle
On Tue, Dec 12, 2023 at 11:15:01AM -0500, Michael S. Tsirkin wrote:
> On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> > On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  wrote:

We played around with the suggestions and some other ideas.
I would like to share some initial results.

We tried the following:

1. Call uncondtional schedule in the vhost_worker function
2. Change the HZ value from 100 to 1000
3. Reverting 05bfb338fa8d vhost: Fix livepatch timeouts in vhost_worker()
4. Adding a cond_resched to translate_desc
5. Reducing VHOST_NET_WEIGHT to 25% of its original value

Please find the diffs below.

Summary:

Option 1 is very very hacky but resolved the regression.
Option 2 reduces the regression by ~20%.
Options 3-5 do not help unfortunately.

Potential explanation:

While the vhost is executing, the need_resched flag is not set (observable
in the traces). Therefore cond_resched and alike will do nothing. vhost
will continue executing until the need_resched flag is set by an external
party, e.g. by a request to migrate the vhost.

Calling schedule unconditionally forces the scheduler to re-evaluate all 
tasks and their vruntime/deadline/vlag values. The scheduler comes to the
correct conclusion, that the kworker should be executed and from there it
is smooth sailing. I will have to verify that sequence by collecting more
traces, but this seems rather plausible.
This hack might of course introduce all kinds of side effects but might
provide an indicator that this is the actual problem.
The big question would be how to solve this conceptually, and, first
things first, whether you think this is a viable hypothesis.

Increasing the HZ value helps most likely because the other CPUs take 
scheduling/load balancing decisions more often as well and therefore
trigger the migration faster.

Bringing down VHOST_NET_WEIGHT even more might also help to shorten the
vhost loop. But I have no intuition how low we can/should go here.


We also changed vq_err to print error messages, but did not encounter any.

Diffs:
--

1. Call uncondtional schedule in the vhost_worker function

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..16d73fd28831 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -414,6 +414,7 @@ static bool vhost_worker(void *data)
}
}
 
+   schedule();
return !!node;
 }

--

2. Change the HZ value from 100 to 1000

--> config change 

--

3. Reverting 05bfb338fa8d vhost: Fix livepatch timeouts in vhost_worker()

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..d519d598ebb9 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -410,7 +410,8 @@ static bool vhost_worker(void *data)
kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
-   cond_resched();
+   if (need_resched())
+   schedule();
}
}

--

4. Adding a cond_resched to translate_desc

I just picked some location.

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e0c181ad17e3..f885dd29cbd1 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2367,6 +2367,7 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 
addr, u32 len,
s += size;
addr += size;
++ret;
+   cond_resched();
}
 
if (ret == -EAGAIN)

--

5. Reducing VHOST_NET_WEIGHT to 25% of its original value

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..2c6966ea6229 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -42,7 +42,7 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
 
 /* Max number of bytes transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others. */
-#define VHOST_NET_WEIGHT 0x8
+#define VHOST_NET_WEIGHT 0x2
 
 /* Max number of packets transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others with small



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-12 Thread Michael S. Tsirkin
On Tue, Dec 12, 2023 at 11:00:12AM +0800, Jason Wang wrote:
> On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  wrote:
> >
> > On Mon, Dec 11, 2023 at 03:26:46PM +0800, Jason Wang wrote:
> > > > Try reducing the VHOST_NET_WEIGHT limit and see if that improves things 
> > > > any?
> > >
> > > Or a dirty hack like cond_resched() in translate_desc().
> >
> > what do you mean, exactly?
> 
> Ideally it should not matter, but Tobias said there's an unexpectedly
> long time spent on translate_desc() which may indicate that the
> might_sleep() or other doesn't work for some reason.
> 
> Thanks

You mean for debugging, add it with a patch to see what this does?

Sure - can you post the debugging patch pls?

> >
> > --
> > MST
> >




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-11 Thread Jason Wang
On Tue, Dec 12, 2023 at 12:54 AM Michael S. Tsirkin  wrote:
>
> On Mon, Dec 11, 2023 at 03:26:46PM +0800, Jason Wang wrote:
> > > Try reducing the VHOST_NET_WEIGHT limit and see if that improves things 
> > > any?
> >
> > Or a dirty hack like cond_resched() in translate_desc().
>
> what do you mean, exactly?

Ideally it should not matter, but Tobias said there's an unexpectedly
long time spent on translate_desc() which may indicate that the
might_sleep() or other doesn't work for some reason.

Thanks

>
> --
> MST
>




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-11 Thread Michael S. Tsirkin
On Mon, Dec 11, 2023 at 03:26:46PM +0800, Jason Wang wrote:
> > Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any?
> 
> Or a dirty hack like cond_resched() in translate_desc().

what do you mean, exactly?

-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-10 Thread Jason Wang
On Sat, Dec 9, 2023 at 6:42 PM Michael S. Tsirkin  wrote:
>
> On Fri, Dec 08, 2023 at 12:41:38PM +0100, Tobias Huschle wrote:
> > On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote:
> > > On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> > > > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > > > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > > > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > > > >
> > > > > > I dug a little deeper on what the vhost is doing. I'm not an expert 
> > > > > > on
> > > > > > virtio whatsoever, so these are just educated guesses that maybe
> > > > > > someone can verify/correct. Please bear with me probably messing up
> > > > > > the terminology.
> > > > > >
> > > > > > - vhost is looping through available queues.
> > > > > > - vhost wants to wake up a kworker to process a found queue.
> > > > > > - kworker does something with that queue and terminates quickly.
> > > > > >
> > > > > > What I found by throwing in some very noisy trace statements was 
> > > > > > that,
> > > > > > if the kworker is not woken up, the vhost just keeps looping accross
> > > > > > all available queues (and seems to repeat itself). So it essentially
> > > > > > relies on the scheduler to schedule the kworker fast enough. 
> > > > > > Otherwise
> > > > > > it will just keep on looping until it is migrated off the CPU.
> > > > >
> > > > >
> > > > > Normally it takes the buffers off the queue and is done with it.
> > > > > I am guessing that at the same time guest is running on some other
> > > > > CPU and keeps adding available buffers?
> > > > >
> > > >
> > > > It seems to do just that, there are multiple other vhost instances
> > > > involved which might keep filling up thoses queues.
> > > >
> > >
> > > No vhost is ever only draining queues. Guest is filling them.
> > >
> > > > Unfortunately, this makes the problematic vhost instance to stay on
> > > > the CPU and prevents said kworker to get scheduled. The kworker is
> > > > explicitly woken up by vhost, so it wants it to do something.

It looks to me vhost doesn't use workqueue but the worker by itself.

> > > >
> > > > At this point it seems that there is an assumption about the scheduler
> > > > in place which is no longer fulfilled by EEVDF. From the discussion so
> > > > far, it seems like EEVDF does what is intended to do.
> > > >
> > > > Shouldn't there be a more explicit mechanism in use that allows the
> > > > kworker to be scheduled in favor of the vhost?

Vhost did a brunch of copy_from_user() which should trigger
__might_fault() so a __might_sleep() most of the case.

> > > >
> > > > It is also concerning that the vhost seems cannot be preempted by the
> > > > scheduler while executing that loop.
> > >
> > >
> > > Which loop is that, exactly?
> >
> > The loop continously passes translate_desc in drivers/vhost/vhost.c
> > That's where I put the trace statements.
> >
> > The overall sequence seems to be (top to bottom):
> >
> > handle_rx
> > get_rx_bufs
> > vhost_get_vq_desc
> > vhost_get_avail_head
> > vhost_get_avail
> > __vhost_get_user_slow
> > translate_desc   << trace statement in here
> > vhost_iotlb_itree_first
>
> I wonder why do you keep missing cache and re-translating.
> Is pr_debug enabled for you? If not could you check if it
> outputs anything?
> Or you can tweak:
>
> #define vq_err(vq, fmt, ...) do {  \
> pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
> if ((vq)->error_ctx)   \
> eventfd_signal((vq)->error_ctx, 1);\
> } while (0)
>
> to do pr_err if you prefer.
>
> > These functions show up as having increased overhead in perf.
> >
> > There are multiple loops going on in there.
> > Again the disclaimer though, I'm not familiar with that code at all.
>
>
> So there's a limit there: vhost_exceeds_weight should requeue work:
>
> } while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));
>
> then we invoke scheduler each time before re-executing it:
>
>
> {
> struct vhost_worker *worker = data;
> struct vhost_work *work, *work_next;
> struct llist_node *node;
>
> node = llist_del_all(>work_list);
> if (node) {
> __set_current_state(TASK_RUNNING);
>
> node = llist_reverse_order(node);
> /* make sure flag is seen after deletion */
> smp_wmb();
> llist_for_each_entry_safe(work, work_next, node, node) {
> clear_bit(VHOST_WORK_QUEUED, >flags);
> kcov_remote_start_common(worker->kcov_handle);
> work->fn(work);
> kcov_remote_stop();
> cond_resched();
> }
> }
>
> return !!node;
> }
>
> These are the byte 

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-09 Thread Michael S. Tsirkin
On Fri, Dec 08, 2023 at 12:41:38PM +0100, Tobias Huschle wrote:
> On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote:
> > On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> > > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > > > 
> > > > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > > > virtio whatsoever, so these are just educated guesses that maybe
> > > > > someone can verify/correct. Please bear with me probably messing up 
> > > > > the terminology.
> > > > > 
> > > > > - vhost is looping through available queues.
> > > > > - vhost wants to wake up a kworker to process a found queue.
> > > > > - kworker does something with that queue and terminates quickly.
> > > > > 
> > > > > What I found by throwing in some very noisy trace statements was that,
> > > > > if the kworker is not woken up, the vhost just keeps looping accross
> > > > > all available queues (and seems to repeat itself). So it essentially
> > > > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > > > it will just keep on looping until it is migrated off the CPU.
> > > > 
> > > > 
> > > > Normally it takes the buffers off the queue and is done with it.
> > > > I am guessing that at the same time guest is running on some other
> > > > CPU and keeps adding available buffers?
> > > > 
> > > 
> > > It seems to do just that, there are multiple other vhost instances
> > > involved which might keep filling up thoses queues. 
> > > 
> > 
> > No vhost is ever only draining queues. Guest is filling them.
> > 
> > > Unfortunately, this makes the problematic vhost instance to stay on
> > > the CPU and prevents said kworker to get scheduled. The kworker is
> > > explicitly woken up by vhost, so it wants it to do something.
> > > 
> > > At this point it seems that there is an assumption about the scheduler
> > > in place which is no longer fulfilled by EEVDF. From the discussion so
> > > far, it seems like EEVDF does what is intended to do.
> > > 
> > > Shouldn't there be a more explicit mechanism in use that allows the
> > > kworker to be scheduled in favor of the vhost?
> > > 
> > > It is also concerning that the vhost seems cannot be preempted by the
> > > scheduler while executing that loop.
> > 
> > 
> > Which loop is that, exactly?
> 
> The loop continously passes translate_desc in drivers/vhost/vhost.c
> That's where I put the trace statements.
> 
> The overall sequence seems to be (top to bottom):
> 
> handle_rx
> get_rx_bufs
> vhost_get_vq_desc
> vhost_get_avail_head
> vhost_get_avail
> __vhost_get_user_slow
> translate_desc   << trace statement in here
> vhost_iotlb_itree_first

I wonder why do you keep missing cache and re-translating.
Is pr_debug enabled for you? If not could you check if it
outputs anything?
Or you can tweak:

#define vq_err(vq, fmt, ...) do {  \
pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
if ((vq)->error_ctx)   \
eventfd_signal((vq)->error_ctx, 1);\
} while (0)

to do pr_err if you prefer.

> These functions show up as having increased overhead in perf.
> 
> There are multiple loops going on in there.
> Again the disclaimer though, I'm not familiar with that code at all.


So there's a limit there: vhost_exceeds_weight should requeue work:

} while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len)));

then we invoke scheduler each time before re-executing it:


{   
struct vhost_worker *worker = data;
struct vhost_work *work, *work_next;
struct llist_node *node;

node = llist_del_all(>work_list);
if (node) {
__set_current_state(TASK_RUNNING);

node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
smp_wmb();
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, >flags);
kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
cond_resched();
}
}

return !!node;
}   

These are the byte and packet limits:

/* Max number of bytes transferred before requeueing the job.
 * Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x8

/* Max number of packets transferred before requeueing the job.
 * Using this limit prevents one virtqueue from starving others with small
 * pkts.
 */
#define VHOST_NET_PKT_WEIGHT 256


Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any?

-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-08 Thread Tobias Huschle
On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote:
> On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > > 
> > > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > > virtio whatsoever, so these are just educated guesses that maybe
> > > > someone can verify/correct. Please bear with me probably messing up 
> > > > the terminology.
> > > > 
> > > > - vhost is looping through available queues.
> > > > - vhost wants to wake up a kworker to process a found queue.
> > > > - kworker does something with that queue and terminates quickly.
> > > > 
> > > > What I found by throwing in some very noisy trace statements was that,
> > > > if the kworker is not woken up, the vhost just keeps looping accross
> > > > all available queues (and seems to repeat itself). So it essentially
> > > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > > it will just keep on looping until it is migrated off the CPU.
> > > 
> > > 
> > > Normally it takes the buffers off the queue and is done with it.
> > > I am guessing that at the same time guest is running on some other
> > > CPU and keeps adding available buffers?
> > > 
> > 
> > It seems to do just that, there are multiple other vhost instances
> > involved which might keep filling up thoses queues. 
> > 
> 
> No vhost is ever only draining queues. Guest is filling them.
> 
> > Unfortunately, this makes the problematic vhost instance to stay on
> > the CPU and prevents said kworker to get scheduled. The kworker is
> > explicitly woken up by vhost, so it wants it to do something.
> > 
> > At this point it seems that there is an assumption about the scheduler
> > in place which is no longer fulfilled by EEVDF. From the discussion so
> > far, it seems like EEVDF does what is intended to do.
> > 
> > Shouldn't there be a more explicit mechanism in use that allows the
> > kworker to be scheduled in favor of the vhost?
> > 
> > It is also concerning that the vhost seems cannot be preempted by the
> > scheduler while executing that loop.
> 
> 
> Which loop is that, exactly?

The loop continously passes translate_desc in drivers/vhost/vhost.c
That's where I put the trace statements.

The overall sequence seems to be (top to bottom):

handle_rx
get_rx_bufs
vhost_get_vq_desc
vhost_get_avail_head
vhost_get_avail
__vhost_get_user_slow
translate_desc   << trace statement in here
vhost_iotlb_itree_first

These functions show up as having increased overhead in perf.

There are multiple loops going on in there.
Again the disclaimer though, I'm not familiar with that code at all.



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-08 Thread Michael S. Tsirkin
On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote:
> On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > > 
> > > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > > virtio whatsoever, so these are just educated guesses that maybe
> > > someone can verify/correct. Please bear with me probably messing up 
> > > the terminology.
> > > 
> > > - vhost is looping through available queues.
> > > - vhost wants to wake up a kworker to process a found queue.
> > > - kworker does something with that queue and terminates quickly.
> > > 
> > > What I found by throwing in some very noisy trace statements was that,
> > > if the kworker is not woken up, the vhost just keeps looping accross
> > > all available queues (and seems to repeat itself). So it essentially
> > > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > > it will just keep on looping until it is migrated off the CPU.
> > 
> > 
> > Normally it takes the buffers off the queue and is done with it.
> > I am guessing that at the same time guest is running on some other
> > CPU and keeps adding available buffers?
> > 
> 
> It seems to do just that, there are multiple other vhost instances
> involved which might keep filling up thoses queues. 
> 

No vhost is ever only draining queues. Guest is filling them.

> Unfortunately, this makes the problematic vhost instance to stay on
> the CPU and prevents said kworker to get scheduled. The kworker is
> explicitly woken up by vhost, so it wants it to do something.
> 
> At this point it seems that there is an assumption about the scheduler
> in place which is no longer fulfilled by EEVDF. From the discussion so
> far, it seems like EEVDF does what is intended to do.
> 
> Shouldn't there be a more explicit mechanism in use that allows the
> kworker to be scheduled in favor of the vhost?
> 
> It is also concerning that the vhost seems cannot be preempted by the
> scheduler while executing that loop.


Which loop is that, exactly?

-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-08 Thread Tobias Huschle
On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote:
> On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> > 3. vhost looping endlessly, waiting for kworker to be scheduled
> > 
> > I dug a little deeper on what the vhost is doing. I'm not an expert on
> > virtio whatsoever, so these are just educated guesses that maybe
> > someone can verify/correct. Please bear with me probably messing up 
> > the terminology.
> > 
> > - vhost is looping through available queues.
> > - vhost wants to wake up a kworker to process a found queue.
> > - kworker does something with that queue and terminates quickly.
> > 
> > What I found by throwing in some very noisy trace statements was that,
> > if the kworker is not woken up, the vhost just keeps looping accross
> > all available queues (and seems to repeat itself). So it essentially
> > relies on the scheduler to schedule the kworker fast enough. Otherwise
> > it will just keep on looping until it is migrated off the CPU.
> 
> 
> Normally it takes the buffers off the queue and is done with it.
> I am guessing that at the same time guest is running on some other
> CPU and keeps adding available buffers?
> 

It seems to do just that, there are multiple other vhost instances
involved which might keep filling up thoses queues. 

Unfortunately, this makes the problematic vhost instance to stay on
the CPU and prevents said kworker to get scheduled. The kworker is
explicitly woken up by vhost, so it wants it to do something.

At this point it seems that there is an assumption about the scheduler
in place which is no longer fulfilled by EEVDF. From the discussion so
far, it seems like EEVDF does what is intended to do.

Shouldn't there be a more explicit mechanism in use that allows the
kworker to be scheduled in favor of the vhost?

It is also concerning that the vhost seems cannot be preempted by the
scheduler while executing that loop.



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-06 Thread Michael S. Tsirkin
On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote:
> 3. vhost looping endlessly, waiting for kworker to be scheduled
> 
> I dug a little deeper on what the vhost is doing. I'm not an expert on
> virtio whatsoever, so these are just educated guesses that maybe
> someone can verify/correct. Please bear with me probably messing up 
> the terminology.
> 
> - vhost is looping through available queues.
> - vhost wants to wake up a kworker to process a found queue.
> - kworker does something with that queue and terminates quickly.
> 
> What I found by throwing in some very noisy trace statements was that,
> if the kworker is not woken up, the vhost just keeps looping accross
> all available queues (and seems to repeat itself). So it essentially
> relies on the scheduler to schedule the kworker fast enough. Otherwise
> it will just keep on looping until it is migrated off the CPU.


Normally it takes the buffers off the queue and is done with it.
I am guessing that at the same time guest is running on some other
CPU and keeps adding available buffers?


-- 
MST




Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-12-06 Thread Tobias Huschle
On Tue, Nov 28, 2023 at 04:55:11PM +0800, Abel Wu wrote:
> On 11/27/23 9:56 PM, Tobias Huschle Wrote:
> > On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:
[...]
> 
> What are the weights of the two entities?
> 

Both entities have the same weights (I saw 1048576 for both of them).
The story looks different when we look at the cgroup hierarchy though:

sew := weight of the sched entity (se->load.weight)

 CPU 6/KVM-2360[011] d  1158.884473: sched_place: comm=vhost-2961 
pid=2984 sev=3595548386 sed=3598548386 sel=0 sew=1048576 avg=3595548386 
min=3595548386 cpu=11 nr=0 vru=3595548386 lag=0
 CPU 6/KVM-2360[011] d  1158.884473: sched_place: comm= pid=0 
sev=19998138425 sed=20007532920 sel=0 sew=335754 avg=19998138425 
min=19998138425 cpu=11 nr=0 vru=19998138425 lag=0
 CPU 6/KVM-2360[011] d  1158.884474: sched_place: comm= pid=0 
sev=37794158943 sed=37807515464 sel=0 sew=236146 avg=37794158943 
min=37794158943 cpu=11 nr=0 vru=37794158943 lag=0
 CPU 6/KVM-2360[011] d  1158.884474: sched_place: comm= pid=0 
sev=50387168150 sed=50394482435 sel=0 sew=430665 avg=50387168150 
min=50387168150 cpu=11 nr=0 vru=50387168150 lag=0
 CPU 6/KVM-2360[011] d  1158.884474: sched_place: comm= pid=0 
sev=76600751247 sed=77624751246 sel=0 sew=3876 avg=76600751247 min=76600751247 
cpu=11 nr=0 vru=76600751247 lag=0
<...>
vhost-2961-2984[011] d  1158.884487: sched_place: comm=kworker/11:2 
pid=202 sev=76603905961 sed=76606905961 sel=0 sew=1048576 avg=76603905961 
min=76603905961 cpu=11 nr=1 vru=76603905961 lag=0

Here we can see the following weights:
kworker -> 1048576
vhost   -> 1048576
cgroup root ->3876

kworker and vhost weights remain the same. The weights of the nodes in the 
cgroup vary.


I also spent some more thought on this and have some more observations:

1. kworker lag after short runtime

vhost-2961-2984[011] d  1158.884486: sched_waking: 
comm=kworker/11:2 pid=202 prio=120 target_cpu=011
vhost-2961-2984[011] d  1158.884487: sched_place: comm=kworker/11:2 
pid=202 sev=76603905961 sed=76606905961 sel=0 sew=1048576 avg=76603905961 
min=76603905961 cpu=11 nr=1 vru=76603905961 lag=0
<...>   
^
vhost-2961-2984[011] d  1158.884490: sched_switch: 
prev_comm=vhost-2961 prev_pid=2984 prev_prio=120 prev_state=R+ ==> 
next_comm=kworker/11:2 next_pid=202 next_prio=120
   kworker/11:2-202[011] d  1158.884491: sched_waking: comm=CPU 0/KVM 
pid=2988 prio=120 target_cpu=009
   kworker/11:2-202[011] d  1158.884492: sched_stat_runtime: 
comm=kworker/11:2 pid=202 runtime=5150 [ns] vruntime=7660391 [ns] 
deadline=76606905961 [ns] lag=76606905961

   
   kworker/11:2-202[011] d  1158.884492: sched_update: 
comm=kworker/11:2 pid=202 sev=7660391 sed=76606905961 sel=-1128 sew=1048576 
avg=76603909983 min=76603905961 cpu=11 nr=2 lag=-1128 lim=1000

 ^
   kworker/11:2-202[011] d  1158.884494: sched_stat_wait: 
comm=vhost-2961 pid=2984 delay=5150 [ns]
   kworker/11:2-202[011] d  1158.884494: sched_switch: 
prev_comm=kworker/11:2 prev_pid=202 prev_prio=120 prev_state=I ==> 
next_comm=vhost-2961 next_pid=2984 next_prio=120

In the sequence above, the kworker gets woken up by the vhost and placed on the 
timeline with 0 lag.
The kworker then executes for 5150ns and returns control to the vhost.
Unfortunately, this short runtime earns the kworker a negative lag of -1128.
This in turn, causes the kworker to not be selected by 
check_preempt_wakeup_fair.

My naive understanding of lag is, that only those entities get negative lag, 
which consume
more time than they should. Why is the kworker being punished for running only 
a tiny
portion of time?

In the majority of cases, the kworker finishes after a 4-digit number of ns.
There are occassional outliers with 5-digit numbers. I would therefore not 
expect negative lag for the kworker.

It is fair to say that the kworker was executing while the vhost was not.
kworker gets put on the queue with no lag, so it essentially has its vruntime
set to avg_vruntime.
After giving up its timeslice the kworker has now a vruntime which is larger
than the avg_vruntime. Hence the negative lag might make sense here from an
algorithmic standpoint. 


2a/b. vhost getting increased deadlines over time, no call of pick_eevdf

vhost-2961-2984[011] d.h..  1158.892878: sched_stat_runtime: 
comm=vhost-2961 pid=2984 runtime=8385872 [ns] vruntime=3603948448 [ns] 
deadline=3606948448 [ns] lag=3598548386
vhost-2961-2984  

Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-28 Thread Tobias Huschle
On Tue, Nov 28, 2023 at 04:55:11PM +0800, Abel Wu wrote:
> On 11/27/23 9:56 PM, Tobias Huschle Wrote:
> > On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:

[...]

> > - At depth 4, the cgroup shows the observed vruntime value which is smaller
> >by a factor of 20, but depth 0 seems to be running with values of the
> >correct magnitude.
> 
> A child is running means its parent also being the cfs->curr, but
> not vice versa if there are more than one child.
> 
> > - cgroup at depth 0 has zero lag, with higher depth, there are large lag
> >values (as observed 606.338267 onwards)
> 
> These values of se->vlag means 'run this entity to parity' to avoid
> excess context switch, which is what RUN_TO_PARITY does, or nothing
> when !RUN_TO_PARITY. In short, se->vlag is not vlag when se->on_rq.
> 

Thanks for clarifying that. This makes things clearer to me.

> > 
> > Now the following occurs, triggered by the vhost:
> > - The kworker gets placed again with:
> >  vruntime  deadline
> > cgroup56117619190   57650477291 -> depth 0, last known value
> > kworker   56117885776   56120885776 -> lag of -725
> > - vhost continues executing and updates its vruntime accordingly, here
> >I would need to enhance the trace to also print the vruntimes of the
> >parent sched_entities to see the progress of their vruntime/deadline/lag
> >values as well
> > - It is a bit irritating that the EEVDF algorithm would not pick the kworker
> >over the cgroup as its deadline is smaller.
> >But, the kworker has negative lag, which might cause EEVDF to not pick
> >the kworker.
> >The cgroup at depth 0 has no lag, all deeper layers have a significantly
> >positve lag (last known values, might have changed in the meantime).
> >At this point I would see the option that the vhost task is stuck
> >somewhere or EEVDF just does not see the kworker as a eligible option.
> 
> IMHO such lag should not introduce that long delay. Can you run the
> test again with NEXT_BUDDY disabled?

I added a trace event to the next buddy path, it does not get triggered, so I'd 
assume that no buddies are selected.

> 
> > 
> > - Once the vhost is migrated off the cpu, the update_entity_lag function
> >works with the following values at 606.467022: sched_update
> >For the cgroup at depth 0
> >- vruntime = 5710415 --> this is in line with the amount of new 
> > timeslices
> > vhost got assigned while the kworker was 
> > waiting
> >- vlag =   -62439022 --> the scheduler knows that the cgroup was
> > overconsuming, but no runtime for the 
> > kworker
> >For the cfs_rq we have
> >- min_vruntime =  56117885776 --> this matches the vruntime of the 
> > kworker
> >- avg_vruntime = 161750065796 --> this is rather large in comparison, 
> > but I
> >  might access this value at a bad time
> 
> Use avg_vruntime() instead.

Fair.

[...]

> > 
> > # full trace #
> > 
> > sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
> > --> during __pick_eevdf, prints values for best and the first node loop 
> > variable, second loop is never executed
> > 
> > sched_place/sched_update: 
> > sev=se->vruntime,sed=se->deadline,sev=se->vlag,avg=cfs_rq->avg_vruntime,min=cfs_rq->min_vruntime
> 
> It would be better replace cfs_rq->avg_vruntime with avg_vruntime().
> Although we can get real @avg by (vruntime + vlag), I am not sure
> vlag (@lag in trace) is se->vlag or the local variable in the place
> function which is scaled and no longer be the true vlag.
> 

Oh my bad, sev is the vlag value of the sched_entity, lag is the local variable.

[...]

> >  vhost-2931-2953[013] d   606.338313: sched_wakeup: 
> > comm=kworker/13:1 pid=168 prio=120 target_cpu=013
> > --> kworker set to runnable, but vhost keeps on executing
> 
> What are the weights of the two entities?

I'll do another run and look at those values.

[...]



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-28 Thread Abel Wu

On 11/27/23 9:56 PM, Tobias Huschle Wrote:

On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:

On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:

The below should also work for internal entities, but last time I poked
around with it I had some regressions elsewhere -- you know, fix one,
wreck another type of situations on hand.

But still, could you please give it a go -- it applies cleanly to linus'
master and -rc2.

---
Subject: sched/eevdf: Revenge of the Sith^WSleepers



Tried the patch, it does not help unfortuntately.

It might also be possible that the long running vhost is stuck on something.
During those phases where the vhost just runs for a while. This might have
been a risk for a while, EEVDF might have just uncovered an unfortuntate
sequence of events.
I'll look into this option.

I also added some more trace outputs in order to find the actual vruntimes
of the cgroup parents. The numbers look kind of reasonable, but I struggle
to judge this with certainty.

In one of the scenarios where the kworker sees an absurd wait time, the
following occurs (full trace below):

- The kworker ends its timeslice after 4941 ns
- __pick_eevdf finds the cgroup holding vhost as the next option to execute
- Last known values are:
 vruntime  deadline
cgroup56117619190   57650477291 -> depth 0
kworker   56117624131   56120619190
   This is fair, since the kworker is not runnable here.
- At depth 4, the cgroup shows the observed vruntime value which is smaller
   by a factor of 20, but depth 0 seems to be running with values of the
   correct magnitude.


A child is running means its parent also being the cfs->curr, but
not vice versa if there are more than one child.


- cgroup at depth 0 has zero lag, with higher depth, there are large lag
   values (as observed 606.338267 onwards)


These values of se->vlag means 'run this entity to parity' to avoid
excess context switch, which is what RUN_TO_PARITY does, or nothing
when !RUN_TO_PARITY. In short, se->vlag is not vlag when se->on_rq.



Now the following occurs, triggered by the vhost:
- The kworker gets placed again with:
 vruntime  deadline
cgroup56117619190   57650477291 -> depth 0, last known value
kworker   56117885776   56120885776 -> lag of -725
- vhost continues executing and updates its vruntime accordingly, here
   I would need to enhance the trace to also print the vruntimes of the
   parent sched_entities to see the progress of their vruntime/deadline/lag
   values as well
- It is a bit irritating that the EEVDF algorithm would not pick the kworker
   over the cgroup as its deadline is smaller.
   But, the kworker has negative lag, which might cause EEVDF to not pick
   the kworker.
   The cgroup at depth 0 has no lag, all deeper layers have a significantly
   positve lag (last known values, might have changed in the meantime).
   At this point I would see the option that the vhost task is stuck
   somewhere or EEVDF just does not see the kworker as a eligible option.


IMHO such lag should not introduce that long delay. Can you run the
test again with NEXT_BUDDY disabled?



- Once the vhost is migrated off the cpu, the update_entity_lag function
   works with the following values at 606.467022: sched_update
   For the cgroup at depth 0
   - vruntime = 5710415 --> this is in line with the amount of new 
timeslices
vhost got assigned while the kworker was waiting
   - vlag =   -62439022 --> the scheduler knows that the cgroup was
overconsuming, but no runtime for the kworker
   For the cfs_rq we have
   - min_vruntime =  56117885776 --> this matches the vruntime of the kworker
   - avg_vruntime = 161750065796 --> this is rather large in comparison, but I
 might access this value at a bad time


Use avg_vruntime() instead.


   - nr_running   =2 --> at this point, both, cgroup and kworker are
 still on the queue, with the cgroup being
 in the migration process
--> It seems like the overconsumption accumulates at cgroup depth 0 and is not
 propageted downwards. This might be intended though.

- At 606.479979: sched_place, cgroup hosting the vhost is migrated back
   onto cpu 13 with a lag of -166821875 it gets scheduled right away as
   there is no other task (nr_running = 0)

- At 606.479996: sched_place, the kworker gets placed again, this time
   with no lag and get scheduled almost immediately, with a wait
   time of 1255 ns.

It shall be noted, that these scenarios also occur when the first placement
of the kworker in this sequence has no lag, i.e. a lag <= 0 is the pattern
when observing this issue.

# full trace #

sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
--> 

Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-27 Thread Tobias Huschle
On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:
> 
> The below should also work for internal entities, but last time I poked
> around with it I had some regressions elsewhere -- you know, fix one,
> wreck another type of situations on hand.
> 
> But still, could you please give it a go -- it applies cleanly to linus'
> master and -rc2.
> 
> ---
> Subject: sched/eevdf: Revenge of the Sith^WSleepers
> 

Tried the patch, it does not help unfortuntately.

It might also be possible that the long running vhost is stuck on something.
During those phases where the vhost just runs for a while. This might have
been a risk for a while, EEVDF might have just uncovered an unfortuntate
sequence of events.
I'll look into this option.

I also added some more trace outputs in order to find the actual vruntimes
of the cgroup parents. The numbers look kind of reasonable, but I struggle
to judge this with certainty.

In one of the scenarios where the kworker sees an absurd wait time, the 
following occurs (full trace below):

- The kworker ends its timeslice after 4941 ns
- __pick_eevdf finds the cgroup holding vhost as the next option to execute
- Last known values are:   
vruntime  deadline
   cgroup56117619190   57650477291 -> depth 0
   kworker   56117624131   56120619190
  This is fair, since the kworker is not runnable here.
- At depth 4, the cgroup shows the observed vruntime value which is smaller 
  by a factor of 20, but depth 0 seems to be running with values of the 
  correct magnitude.
- cgroup at depth 0 has zero lag, with higher depth, there are large lag 
  values (as observed 606.338267 onwards)

Now the following occurs, triggered by the vhost:
- The kworker gets placed again with:   
vruntime  deadline
   cgroup56117619190   57650477291 -> depth 0, last known value
   kworker   56117885776   56120885776 -> lag of -725
- vhost continues executing and updates its vruntime accordingly, here 
  I would need to enhance the trace to also print the vruntimes of the 
  parent sched_entities to see the progress of their vruntime/deadline/lag 
  values as well
- It is a bit irritating that the EEVDF algorithm would not pick the kworker 
  over the cgroup as its deadline is smaller.
  But, the kworker has negative lag, which might cause EEVDF to not pick 
  the kworker.
  The cgroup at depth 0 has no lag, all deeper layers have a significantly 
  positve lag (last known values, might have changed in the meantime).
  At this point I would see the option that the vhost task is stuck
  somewhere or EEVDF just does not see the kworker as a eligible option.

- Once the vhost is migrated off the cpu, the update_entity_lag function
  works with the following values at 606.467022: sched_update
  For the cgroup at depth 0
  - vruntime = 5710415 --> this is in line with the amount of new timeslices
   vhost got assigned while the kworker was waiting
  - vlag =   -62439022 --> the scheduler knows that the cgroup was 
   overconsuming, but no runtime for the kworker
  For the cfs_rq we have
  - min_vruntime =  56117885776 --> this matches the vruntime of the kworker
  - avg_vruntime = 161750065796 --> this is rather large in comparison, but I 
might access this value at a bad time
  - nr_running   =2 --> at this point, both, cgroup and kworker are 
still on the queue, with the cgroup being 
in the migration process
--> It seems like the overconsumption accumulates at cgroup depth 0 and is not 
propageted downwards. This might be intended though.

- At 606.479979: sched_place, cgroup hosting the vhost is migrated back
  onto cpu 13 with a lag of -166821875 it gets scheduled right away as 
  there is no other task (nr_running = 0)

- At 606.479996: sched_place, the kworker gets placed again, this time
  with no lag and get scheduled almost immediately, with a wait 
  time of 1255 ns.

It shall be noted, that these scenarios also occur when the first placement
of the kworker in this sequence has no lag, i.e. a lag <= 0 is the pattern
when observing this issue.

# full trace #

sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
--> during __pick_eevdf, prints values for best and the first node loop 
variable, second loop is never executed

sched_place/sched_update: 
sev=se->vruntime,sed=se->deadline,sev=se->vlag,avg=cfs_rq->avg_vruntime,min=cfs_rq->min_vruntime
--> at the end of place_entity and update_entity_lag

--> the chunks of 5 entries for these new events represent the 5 levels of the 
cgroup which hosts the vhost

vhost-2931-2953[013] d   606.338262: sched_stat_blocked: 
comm=kworker/13:1 pid=168 

Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-22 Thread Peter Zijlstra
On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:

> We applied both suggested patch options and ran the test again, so 
> 
> sched/eevdf: Fix vruntime adjustment on reweight
> sched/fair: Update min_vruntime for reweight_entity() correctly
> 
> and
> 
> sched/eevdf: Delay dequeue
> 
> Unfortunately, both variants do NOT fix the problem.
> The regression remains unchanged.

Thanks for testing.

> I will continue getting myself familiar with how cgroups are scheduled to dig 
> deeper here. If there are any other ideas, I'd be happy to use them as a 
> starting point for further analysis.
> 
> Would additional traces still be of interest? If so, I would be glad to
> provide them.

So, since it got bisected to the placement logic, but is a cgroup
related issue, I was thinking that 'Delay dequeue' might not cut it,
that only works for tasks, not the internal entities.

The below should also work for internal entities, but last time I poked
around with it I had some regressions elsewhere -- you know, fix one,
wreck another type of situations on hand.

But still, could you please give it a go -- it applies cleanly to linus'
master and -rc2.

---
Subject: sched/eevdf: Revenge of the Sith^WSleepers

For tasks that have received excess service (negative lag) allow them to
gain parity (zero lag) by sleeping.

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/sched/fair.c | 36 
 kernel/sched/features.h |  6 ++
 2 files changed, 42 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..b975e4b07a68 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5110,6 +5110,33 @@ static inline void update_misfit_status(struct 
task_struct *p, struct rq *rq) {}
 
 #endif /* CONFIG_SMP */
 
+static inline u64
+entity_vlag_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+   u64 now, vdelta;
+   s64 delta;
+
+   if (!(flags & ENQUEUE_WAKEUP))
+   return se->vlag;
+
+   if (flags & ENQUEUE_MIGRATED)
+   return 0;
+
+   now = rq_clock_task(rq_of(cfs_rq));
+   delta = now - se->exec_start;
+   if (delta < 0)
+   return se->vlag;
+
+   if (sched_feat(GENTLE_SLEEPER))
+   delta /= 2;
+
+   vdelta = __calc_delta(delta, NICE_0_LOAD, _rq->load);
+   if (vdelta < -se->vlag)
+   return se->vlag + vdelta;
+
+   return 0;
+}
+
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -5133,6 +5160,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
 
lag = se->vlag;
 
+   /*
+* Allow tasks that have received too much service (negative
+* lag) to (re)gain parity (zero lag) by sleeping for the
+* equivalent duration. This ensures they will be readily
+* eligible.
+*/
+   if (sched_feat(PLACE_SLEEPER) && lag < 0)
+   lag = entity_vlag_sleeper(cfs_rq, se, flags);
+
/*
 * If we want to place a task and preserve lag, we have to
 * consider the effect of the new entity on the weighted
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a3ddf84de430..722282d3ed07 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,6 +7,12 @@
 SCHED_FEAT(PLACE_LAG, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 SCHED_FEAT(RUN_TO_PARITY, true)
+/*
+ * Let sleepers earn back lag, but not more than 0-lag. GENTLE_SLEEPERS earn at
+ * half the speed.
+ */
+SCHED_FEAT(PLACE_SLEEPER, true)
+SCHED_FEAT(GENTLE_SLEEPER, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed



Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-21 Thread Tobias Huschle
On Fri, Nov 17, 2023 at 09:07:55PM +0800, Abel Wu wrote:
> On 11/17/23 8:37 PM, Peter Zijlstra Wrote:

[...]

> > Ah, so if this is a cgroup issue, it might be worth trying this patch
> > that we have in tip/sched/urgent.
> 
> And please also apply this fix:
> https://lore.kernel.org/all/20231117080106.12890-1-s921975...@gmail.com/
> 

We applied both suggested patch options and ran the test again, so 

sched/eevdf: Fix vruntime adjustment on reweight
sched/fair: Update min_vruntime for reweight_entity() correctly

and

sched/eevdf: Delay dequeue

Unfortunately, both variants do NOT fix the problem.
The regression remains unchanged.


I will continue getting myself familiar with how cgroups are scheduled to dig 
deeper here. If there are any other ideas, I'd be happy to use them as a 
starting point for further analysis.

Would additional traces still be of interest? If so, I would be glad to
provide them.

[...]



Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-20 Thread Abel Wu

On 11/20/23 6:56 PM, Peter Zijlstra Wrote:

On Sat, Nov 18, 2023 at 01:14:32PM +0800, Abel Wu wrote:


Hi Peter, I'm a little confused here. As we adopt placement strategy #1
when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
Given that the weight doesn't change, we have:

vl' = vl

But in fact it is scaled on placement:

vl' = vl * W/(W + w)


(W+w)/W


Ah, right. I misunderstood (again) the comment which says:

vl_i = (W + w_i)*vl'_i / W

So the current implementation is:

v' = V - vl'

and what I was proposing is:

v' = V' - vl

and they are equal in fact.





Does this intended?


The scaling, yes that's intended and the comment explains why. So now
you have me confused too :-)

Specifically, I want the lag after placement to be equal to the lag we
come in with. Since placement will affect avg_vruntime (adding one
element to the average changes the average etc..) the placement also
affects the lag as measured after placement.


Yes. You did the math in an iterative fashion and mine is facing the
final state:

v' = V' - vlag
V' = (WV + wv') / (W + w)

which gives:

V' = V - w * vlag / W



Or rather, if you enqueue and dequeue, I want the lag to be preserved.
If you do not take placement into consideration, lag will dissipate real
quick.


And to illustrate my understanding of strategy #1:



@@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
 * vl_i is given by:
 *
 *   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
-*  = (W*V + w_i*(V - vl_i)) / (W + w_i)
-*  = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
-*  = (V*(W + w_i) - w_i*l) / (W + w_i)
-*  = V - w_i*vl_i / (W + w_i)
-*
-* And the actual lag after adding an entity with vl_i is:
-*
-*   vl'_i = V' - v_i
-* = V - w_i*vl_i / (W + w_i) - (V - vl_i)
-* = vl_i - w_i*vl_i / (W + w_i)
-*
-* Which is strictly less than vl_i. So in order to preserve lag
-* we should inflate the lag before placement such that the
-* effective lag after placement comes out right.
-*
-* As such, invert the above relation for vl'_i to get the vl_i
-* we need to use such that the lag after placement is the lag
-* we computed before dequeue.
+*  = (W*V + w_i*(V' - vl_i)) / (W + w_i)
+*  = V - w_i*vl_i / W
 *
-*   vl'_i = vl_i - w_i*vl_i / (W + w_i)
-* = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
-*
-*   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
-*   = W*vl_i
-*
-*   vl_i = (W + w_i)*vl'_i / W
 */
load = cfs_rq->avg_load;
if (curr && curr->on_rq)
load += scale_load_down(curr->load.weight);
-
-   lag *= load + scale_load_down(se->load.weight);
if (WARN_ON_ONCE(!load))
load = 1;
-   lag = div_s64(lag, load);
+
+   vruntime -= div_s64(lag * scale_load_down(se->load.weight), 
load);
}
se->vruntime = vruntime - lag;



So you're proposing we do:

v = V - (lag * w) / (W + w) - lag


What I 'm proposing is:

V' = V - w * vlag / W

so we have:

v' = V' - vlag
   = V - vlag * w/W - vlag
   = V - vlag * (W + w)/W

which is exactly the same as current implementation.



?

That can be written like:

v = V - (lag * w) / (W+w) - (lag * (W+w)) / (W+w)
  = V - (lag * (W+w) + lag * w) / (W+w)
  = V - (lag * (W+2w)) / (W+w)

And that turns into a mess AFAICT.


Let me repeat my earlier argument. Suppose v,w,l are the new element.
V,W are the old avg_vruntime and sum-weight.

Then: V = V*W / W, and by extention: V' = (V*W + v*w) / (W + w).

The new lag, after placement:

l' = V' - v = (V*W + v*w) / (W+w) - v
 = (V*W + v*w) / (W+w) - v * (W+w) / (W+v)
= (V*W + v*w -v*W - v*w) / (W+w)
= (V*W - v*W) / (W+w)
= W*(V-v) / (W+w)
= W/(W+w) * (V-v)

Substitute: v = V - (W+w)/W * l, my scaling thing, to obtain:

l' = W/(W+w) * (V - (V - (W+w)/W * l))
= W/(W+w) * (V - V + (W+w)/W * l)
= W/(W+w) * (W+w)/W * l
= l

So by scaling, we've preserved lag across placement.

That make sense?


Yes, I think I won't misunderstand again for the 3rd time :)

Thanks!
Abel



Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-20 Thread Peter Zijlstra
On Sat, Nov 18, 2023 at 01:14:32PM +0800, Abel Wu wrote:

> Hi Peter, I'm a little confused here. As we adopt placement strategy #1
> when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
> Given that the weight doesn't change, we have:
> 
>   vl' = vl
> 
> But in fact it is scaled on placement:
> 
>   vl' = vl * W/(W + w)

(W+w)/W

> 
> Does this intended? 

The scaling, yes that's intended and the comment explains why. So now
you have me confused too :-)

Specifically, I want the lag after placement to be equal to the lag we
come in with. Since placement will affect avg_vruntime (adding one
element to the average changes the average etc..) the placement also
affects the lag as measured after placement.

Or rather, if you enqueue and dequeue, I want the lag to be preserved.
If you do not take placement into consideration, lag will dissipate real
quick.

> And to illustrate my understanding of strategy #1:

> @@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct 
> sched_entity *se, int flags)
>* vl_i is given by:
>*
>*   V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> -  *  = (W*V + w_i*(V - vl_i)) / (W + w_i)
> -  *  = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
> -  *  = (V*(W + w_i) - w_i*l) / (W + w_i)
> -  *  = V - w_i*vl_i / (W + w_i)
> -  *
> -  * And the actual lag after adding an entity with vl_i is:
> -  *
> -  *   vl'_i = V' - v_i
> -  * = V - w_i*vl_i / (W + w_i) - (V - vl_i)
> -  * = vl_i - w_i*vl_i / (W + w_i)
> -  *
> -  * Which is strictly less than vl_i. So in order to preserve lag
> -  * we should inflate the lag before placement such that the
> -  * effective lag after placement comes out right.
> -  *
> -  * As such, invert the above relation for vl'_i to get the vl_i
> -  * we need to use such that the lag after placement is the lag
> -  * we computed before dequeue.
> +  *  = (W*V + w_i*(V' - vl_i)) / (W + w_i)
> +  *  = V - w_i*vl_i / W
>*
> -  *   vl'_i = vl_i - w_i*vl_i / (W + w_i)
> -  * = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
> -  *
> -  *   (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
> -  *   = W*vl_i
> -  *
> -  *   vl_i = (W + w_i)*vl'_i / W
>*/
>   load = cfs_rq->avg_load;
>   if (curr && curr->on_rq)
>   load += scale_load_down(curr->load.weight);
> -
> - lag *= load + scale_load_down(se->load.weight);
>   if (WARN_ON_ONCE(!load))
>   load = 1;
> - lag = div_s64(lag, load);
> +
> + vruntime -= div_s64(lag * scale_load_down(se->load.weight), 
> load);
>   }
>   se->vruntime = vruntime - lag;


So you're proposing we do:

v = V - (lag * w) / (W + w) - lag

?

That can be written like:

v = V - (lag * w) / (W+w) - (lag * (W+w)) / (W+w)
  = V - (lag * (W+w) + lag * w) / (W+w)
  = V - (lag * (W+2w)) / (W+w)

And that turns into a mess AFAICT.


Let me repeat my earlier argument. Suppose v,w,l are the new element.
V,W are the old avg_vruntime and sum-weight.

Then: V = V*W / W, and by extention: V' = (V*W + v*w) / (W + w).

The new lag, after placement: 

l' = V' - v = (V*W + v*w) / (W+w) - v
= (V*W + v*w) / (W+w) - v * (W+w) / (W+v)
= (V*W + v*w -v*W - v*w) / (W+w)
= (V*W - v*W) / (W+w)
= W*(V-v) / (W+w)
= W/(W+w) * (V-v)

Substitute: v = V - (W+w)/W * l, my scaling thing, to obtain:

l' = W/(W+w) * (V - (V - (W+w)/W * l))
   = W/(W+w) * (V - V + (W+w)/W * l)
   = W/(W+w) * (W+w)/W * l
   = l

So by scaling, we've preserved lag across placement.

That make sense?



Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-17 Thread Abel Wu

On 11/17/23 5:23 PM, Peter Zijlstra Wrote:


Your email is pretty badly mangled by wrapping, please try and
reconfigure your MUA, esp. the trace and debug output is unreadable.

On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:


The base scenario are two KVM guests running on an s390 LPAR. One guest
hosts the uperf server, one the uperf client.
With EEVDF we observe a regression of ~50% for a strburst test.
For a more detailed description of the setup see the section TEST SUMMARY at
the bottom.


Well, that's not good :/


Short summary:
The mentioned kworker has been scheduled to CPU 14 before the tracing was
enabled.
A vhost process is migrated onto CPU 14.
The vruntimes of kworker and vhost differ significantly (86642125805 vs
4242563284 -> factor 20)


So bear with me, I know absolutely nothing about virt stuff. I suspect
there's cgroups involved because shiny or something.

kworkers are typically not in cgroups and are part of the root cgroup,
but what's a vhost and where does it live?

Also, what are their weights / nice values?


The vhost process wants to wake up the kworker, therefore the kworker is
placed onto the runqueue again and set to runnable.
The vhost process continues to execute, waking up other vhost processes on
other CPUs.

So far this behavior is not different to what we see on pre-EEVDF kernels.

On timestamp 576.162767, the vhost process triggers the last wake up of
another vhost on another CPU.
Until timestamp 576.171155, we see no other activity. Now, the vhost process
ends its time slice.
Then, vhost gets re-assigned new time slices 4 times and gets then migrated
off to CPU 15.


So why does this vhost stay on the CPU if it doesn't have anything to
do? (I've not tried to make sense of the trace, that's just too
painful).


This does not occur with older kernels.
The kworker has to wait for the migration to happen in order to be able to
execute again.
This is due to the fact, that the vruntime of the kworker is significantly
larger than the one of vhost.


That's, weird. Can you add a trace_printk() to update_entity_lag() and
have it print out the lag, limit and vlag (post clamping) values? And
also in place_entity() for the reverse process, lag pre and post scaling
or something.

After confirming both tasks are indeed in the same cgroup ofcourse,
because if they're not, vruntime will be meaningless to compare and we
should look elsewhere.

Also, what HZ and what preemption mode are you running? If kworker is
somehow vastly over-shooting it's slice -- keeps running way past the
avg_vruntime, then it will build up a giant lag and you get what you
describe, next time it wakes up it gets placed far to the right (exactly
where it was when it 'finally' went to sleep, relatively speaking).


We found some options which sound plausible but we are not sure if they are
valid or not:

1. The wake up path has a dependency on the vruntime metrics that now delays
the execution of the kworker.
2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
which updates the way cfs_rq->min_vruntime and
 cfs_rq->avg_runtime are set might have introduced an issue which is
uncovered with the commit mentioned above.


Suppose you have a few tasks (of equal weight) on you virtual timeline
like so:

-+---+---+---+---+--
 ^   ^
|   `avg_vruntime
`-min_vruntime

Then the above would be more or less the relative placements of these
values. avg_vruntime is the weighted average of the various vruntimes
and is therefore always in the 'middle' of the tasks, and not somewhere
out-there.

min_vruntime is a monotonically increasing 'minimum' that's left-ish on
the tree (there's a few cases where a new task can be placed left of
min_vruntime and its no longer actuall the minimum, but whatever).

These values should be relatively close to one another, depending
ofcourse on the spread of the tasks. So I don't think this is causing
trouble.

Anyway, the big difference with lag based placement is that where
previously tasks (that do not migrate) retain their old vruntime and on
placing they get pulled forward to at least min_vruntime, so a task that
wildly overshoots, but then doesn't run for significant time can still
be overtaken and then when placed again be 'okay'.

Now OTOH, with lag-based placement,  we strictly preserve their relative
offset vs avg_vruntime. So if they were *far* too the right when they go
to sleep, they will again be there on placement.


Hi Peter, I'm a little confused here. As we adopt placement strategy #1
when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
Given that the weight doesn't change, we have:

vl' = vl

But in fact it is scaled on placement:

vl' = vl * W/(W + w)

Does this intended? And to illustrate my understanding of strategy #1:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07f555857698..a24ef8b297ed 100644
--- 

Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-17 Thread Abel Wu

On 11/17/23 8:37 PM, Peter Zijlstra Wrote:

On Fri, Nov 17, 2023 at 01:24:21PM +0100, Tobias Huschle wrote:

On Fri, Nov 17, 2023 at 10:23:18AM +0100, Peter Zijlstra wrote:



kworkers are typically not in cgroups and are part of the root cgroup,
but what's a vhost and where does it live?


The qemu instances of the two KVM guests are placed into cgroups.
The vhosts run within the context of these qemu instances (4 threads per guest).
So they are also put into those cgroups.

I'll answer the other questions you brought up as well, but I guess that one
is most critical:



After confirming both tasks are indeed in the same cgroup ofcourse,
because if they're not, vruntime will be meaningless to compare and we
should look elsewhere.


In that case we probably have to go with elsewhere ... which is good to know.


Ah, so if this is a cgroup issue, it might be worth trying this patch
that we have in tip/sched/urgent.


And please also apply this fix:
https://lore.kernel.org/all/20231117080106.12890-1-s921975...@gmail.com/



I'll try and read the rest of the email a little later, gotta run
errands first.

---

commit eab03c23c2a162085b13200d7942fc5a00b5ccc8
Author: Abel Wu 
Date:   Tue Nov 7 17:05:07 2023 +0800

 sched/eevdf: Fix vruntime adjustment on reweight
 
 vruntime of the (on_rq && !0-lag) entity needs to be adjusted when

 it gets re-weighted, and the calculations can be simplified based
 on the fact that re-weight won't change the w-average of all the
 entities. Please check the proofs in comments.
 
 But adjusting vruntime can also cause position change in RB-tree

 hence require re-queue to fix up which might be costly. This might
 be avoided by deferring adjustment to the time the entity actually
 leaves tree (dequeue/pick), but that will negatively affect task
 selection and probably not good enough either.
 
 Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")

 Signed-off-by: Abel Wu 
 Signed-off-by: Peter Zijlstra (Intel) 
 Link: 
https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.a...@bytedance.com

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2048138ce54b..025d90925bf6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3666,41 +3666,140 @@ static inline void
  dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
  #endif
  
+static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,

+  unsigned long weight)
+{
+   unsigned long old_weight = se->load.weight;
+   u64 avruntime = avg_vruntime(cfs_rq);
+   s64 vlag, vslice;
+
+   /*
+* VRUNTIME
+* 
+*
+* COROLLARY #1: The virtual runtime of the entity needs to be
+* adjusted if re-weight at !0-lag point.
+*
+* Proof: For contradiction assume this is not true, so we can
+* re-weight without changing vruntime at !0-lag point.
+*
+* Weight   VRuntime   Avg-VRuntime
+* beforew  vV
+*  afterw' v'   V'
+*
+* Since lag needs to be preserved through re-weight:
+*
+*  lag = (V - v)*w = (V'- v')*w', where v = v'
+*  ==>  V' = (V - v)*w/w' + v   (1)
+*
+* Let W be the total weight of the entities before reweight,
+* since V' is the new weighted average of entities:
+*
+*  V' = (WV + w'v - wv) / (W + w' - w) (2)
+*
+* by using (1) & (2) we obtain:
+*
+*  (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
+*  ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
+*  ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
+*  ==>  (V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
+*
+* Since we are doing at !0-lag point which means V != v, we
+* can simplify (3):
+*
+*  ==>  W / (W + w' - w) = w / w'
+*  ==>  Ww' = Ww + ww' - ww
+*  ==>  W * (w' - w) = w * (w' - w)
+*  ==>  W = w   (re-weight indicates w' != w)
+*
+* So the cfs_rq contains only one entity, hence vruntime of
+* the entity @v should always equal to the cfs_rq's weighted
+* average vruntime @V, which means we will always re-weight
+* at 0-lag point, thus breach assumption. Proof completed.
+*
+*
+* COROLLARY #2: Re-weight does NOT affect weighted average
+* vruntime of all the entities.
+*
+* Proof: According to corollary #1, Eq. (1) should be:
+*
+*  (V - v)*w = (V' - v')*w'
+*  ==>v' = V' - (V - v)*w/w'(4)
+*
+* According to the weighted average formula, we have:
+*
+*  V' = (WV - wv + w'v') / (W - w + w')
+* =