RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi Michel, > On 2021-08-10 10:30 a.m., Daniel Vetter wrote: > > On Tue, Aug 10, 2021 at 08:21:09AM +, Kasireddy, Vivek wrote: > >>> On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote: > >>> > >>> Hence my gut feeling reaction that first we need to get these two > >>> compositors aligned in their timings, which propobably needs > >>> consistent vblank periods/timestamps across them (plus/minux > >>> guest/host clocksource fun ofc). Without this any of the next steps > >>> will simply not work because there's too much jitter by the time the > >>> guest compositor gets the flip completion events. > >> [Kasireddy, Vivek] Timings are not a problem and do not significantly > >> affect the repaint cycles from what I have seen so far. > >> > >>> > >>> Once we have solid events I think we should look into statically > >>> tuning guest/host compositor deadlines (like you've suggested in a > >>> bunch of places) to consisently make that deadline and hit 60 fps. > >>> With that we can then look into tuning this automatically and what to > >>> do when e.g. switching between copying and zero-copy on the host side > >>> (which might be needed in some cases) and how to handle all that. > >> [Kasireddy, Vivek] As I confirm here: > >>> https://gitlab.freedesktop.org/wayland/weston/- > > /issues/514#note_984065 > >> tweaking the deadlines works (i.e., we get 60 FPS) as we expect. > >> However, > >> I feel that this zero-copy solution I am trying to create should be > >> independent > >> of compositors' deadlines, delays or other scheduling parameters. > > > > That's not how compositors work nowadays. Your problem is that you don't > > have the guest/host compositor in sync. zero-copy only changes the > > timing, > > so it changes things from "rendering way too many frames" to "rendering > > way too few frames". > > > > We need to fix the timing/sync issue here first, not paper over it with > > hacks. > [Kasireddy, Vivek] What I really meant is that the zero-copy solution > should be > independent of the scheduling policies to ensure that it works with all > compositors. > IIUC, Weston for example uses the vblank/pageflip completion timestamp, > the > configurable repaint-window value, refresh-rate, etc to determine when > to start > its next repaint -- if there is any damage: > timespec_add_nsec(>next_repaint, stamp, refresh_nsec); > timespec_add_msec(>next_repaint, >next_repaint, > -compositor- > repaint_msec); > > And, in the case of VKMS, since there is no real hardware, the timestamp > is always: > now = ktime_get(); > send_vblank_event(dev, e, seq, now); > >>> > >>> vkms has been fixed since a while to fake high-precision timestamps like > >>> from a real display. > >> [Kasireddy, Vivek] IIUC, that might be one of the reasons why the Guest > >> does not need > >> to have the same timestamp as that of the Host -- to work as expected. > >> > >>> > When you say that the Guest/Host compositor need to stay in sync, are you > suggesting that we need to ensure that the vblank timestamp on the Host > needs to be shared and be the same on the Guest and a vblank/pageflip > completion for the Guest needs to be sent at exactly the same time it is > sent > on the Host? If yes, I'd say that we do send the pageflip completion to > Guest > around the same time a vblank is generated on the Host but it does not > help > because the Guest compositor would only have 9 ms to submit a new frame > and if the Host is running Mutter, the Guest would only have 2 ms. > (https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341) > >>> > >>> Not at the same time, but the same timestamp. And yes there is some fun > >>> there, which is I think the fundamental issue. Or at least some of the > >>> compositor experts seem to think so, and it makes sense to me. > >> [Kasireddy, Vivek] It is definitely possible that if the timestamp is > >> messed up, then > >> the Guest repaint cycle would be affected. However, I do not believe that > >> is the case > >> here given the debug and instrumentation data we collected and > >> scrutinized. Hopefully, > >> compositor experts could chime in to shed some light on this matter. > >> > >>> > > > > Only, and I really mean only, when that shows that it's simply > > impossible > > to hit 60fps with zero-copy and the guest/host fully aligned should we > > look into making the overall pipeline deeper. > [Kasireddy, Vivek] From all the experiments conducted so far and given > the > discussion associated with > https://gitlab.freedesktop.org/wayland/weston/- > /issues/514 > I think we have already established that in order for a zero-copy >
RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi Daniel, > On Tue, Aug 10, 2021 at 08:21:09AM +, Kasireddy, Vivek wrote: > > Hi Daniel, > > > > > On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote: > > > > Hi Daniel, > > > > > > > > > > > > >>> The solution: > > > > > > > > >>> - To ensure full framerate, the Guest compositor has to > > > > > > > > >>> start it's repaint > > > cycle > > > > > > > (including > > > > > > > > >>> the 9 ms wait) when the Host compositor sends the frame > > > > > > > > >>> callback event > to > > > its > > > > > > > clients. > > > > > > > > >>> In order for this to happen, the dma-fence that the Guest > > > > > > > > >>> KMS waits on - > - > > > before > > > > > > > sending > > > > > > > > >>> pageflip completion -- cannot be tied to a > > > > > > > > >>> wl_buffer.release event. This > > > means > > > > > that, > > > > > > > the > > > > > > > > >>> Guest compositor has to be forced to use a new buffer for > > > > > > > > >>> its next > repaint > > > cycle > > > > > > > when it > > > > > > > > >>> gets a pageflip completion. > > > > > > > > >> > > > > > > > > >> Is that really the only solution? > > > > > > > > > [Kasireddy, Vivek] There are a few others I mentioned here: > > > > > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572 > > > > > > > > > But I think none of them are as compelling as this one. > > > > > > > > > > > > > > > > > >> > > > > > > > > >> If we fix the event timestamps so that both guest and host > > > > > > > > >> use the same > > > > > > > > >> timestamp, but then the guest starts 5ms (or something like > > > > > > > > >> that) earlier, > > > > > > > > >> then things should work too? I.e. > > > > > > > > >> - host compositor starts at (previous_frametime + 9ms) > > > > > > > > >> - guest compositor starts at (previous_frametime + 4ms) > > > > > > > > >> > > > > > > > > >> Ofc this only works if the frametimes we hand out to both > > > > > > > > >> match > _exactly_ > > > > > > > > >> and are as high-precision as the ones on the host side. > > > > > > > > >> Which for many > gpu > > > > > > > > >> drivers at least is the case, and all the ones you care > > > > > > > > >> about for sure :-) > > > > > > > > >> > > > > > > > > >> But if the frametimes the guest receives are the no_vblank > > > > > > > > >> fake ones, > then > > > > > > > > >> they'll be all over the place and this carefully tuned > > > > > > > > >> low-latency redraw > > > > > > > > >> loop falls apart. Aside fromm the fact that without tuning > > > > > > > > >> the guests to > > > > > > > > >> be earlier than the hosts, you're guaranteed to miss every > > > > > > > > >> frame (except > > > > > > > > >> when the timing wobbliness in the guest is big enough by > > > > > > > > >> chance to make > > > > > > > > >> the deadline on the oddball frame). > > > > > > > > > [Kasireddy, Vivek] The Guest and Host use different event > > > > > > > > > timestamps as > we > > > don't > > > > > > > > > share these between the Guest and the Host. It does not seem > > > > > > > > > to be causing > any > > > > > other > > > > > > > > > problems so far but we did try the experiment you mentioned > > > > > > > > > (i.e., > adjusting > > > the > > > > > > > delays) > > > > > > > > > and it works. However, this patch series is meant to fix the > > > > > > > > > issue without > > > having to > > > > > > > tweak > > > > > > > > > anything (delays) because we can't do this for every > > > > > > > > > compositor out there. > > > > > > > > > > > > > > > > Maybe there could be a mechanism which allows the compositor in > > > > > > > > the guest > to > > > > > > > automatically adjust its repaint cycle as needed. > > > > > > > > > > > > > > > > This might even be possible without requiring changes in each > > > > > > > > compositor, > by > > > > > adjusting > > > > > > > the vertical blank periods in the guest to be aligned with the > > > > > > > host compositor > > > repaint > > > > > > > cycles. Not sure about that though. > > > > > > > > > > > > > > > > Even if not, both this series or making it possible to queue > > > > > > > > multiple flips > require > > > > > > > corresponding changes in each compositor as well to have any > > > > > > > effect. > > > > > > > > > > > > > > Yeah from all the discussions and tests done it sounds even with a > > > > > > > deeper queue we have big coordination issues between the guest and > > > > > > > host compositor (like the example that the guest is now rendering > > > > > > > at > > > > > > > 90fps instead of 60fps like the host). > > > > > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to > > > > > > Gerd. That 90 > FPS vs > > > > > > 60 FPS problem is a completely different issue that is associated > > > > > > with Qemu > GTK UI > > > > > > backend. With the GTK backend -- and also with SDL backend -- we > > > > > > Blit the > Guest > > > > > > scanout FB onto one of the backbuffers managed by EGL. > > > > > > > > > > > > I am trying
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On 2021-08-10 10:30 a.m., Daniel Vetter wrote: > On Tue, Aug 10, 2021 at 08:21:09AM +, Kasireddy, Vivek wrote: >>> On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote: >>> >>> Hence my gut feeling reaction that first we need to get these two >>> compositors aligned in their timings, which propobably needs >>> consistent vblank periods/timestamps across them (plus/minux >>> guest/host clocksource fun ofc). Without this any of the next steps >>> will simply not work because there's too much jitter by the time the >>> guest compositor gets the flip completion events. >> [Kasireddy, Vivek] Timings are not a problem and do not significantly >> affect the repaint cycles from what I have seen so far. >> >>> >>> Once we have solid events I think we should look into statically >>> tuning guest/host compositor deadlines (like you've suggested in a >>> bunch of places) to consisently make that deadline and hit 60 fps. >>> With that we can then look into tuning this automatically and what to >>> do when e.g. switching between copying and zero-copy on the host side >>> (which might be needed in some cases) and how to handle all that. >> [Kasireddy, Vivek] As I confirm here: >>> https://gitlab.freedesktop.org/wayland/weston/- > /issues/514#note_984065 >> tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However, >> I feel that this zero-copy solution I am trying to create should be >> independent >> of compositors' deadlines, delays or other scheduling parameters. > > That's not how compositors work nowadays. Your problem is that you don't > have the guest/host compositor in sync. zero-copy only changes the timing, > so it changes things from "rendering way too many frames" to "rendering > way too few frames". > > We need to fix the timing/sync issue here first, not paper over it with > hacks. [Kasireddy, Vivek] What I really meant is that the zero-copy solution should be independent of the scheduling policies to ensure that it works with all compositors. IIUC, Weston for example uses the vblank/pageflip completion timestamp, the configurable repaint-window value, refresh-rate, etc to determine when to start its next repaint -- if there is any damage: timespec_add_nsec(>next_repaint, stamp, refresh_nsec); timespec_add_msec(>next_repaint, >next_repaint, -compositor- repaint_msec); And, in the case of VKMS, since there is no real hardware, the timestamp is always: now = ktime_get(); send_vblank_event(dev, e, seq, now); >>> >>> vkms has been fixed since a while to fake high-precision timestamps like >>> from a real display. >> [Kasireddy, Vivek] IIUC, that might be one of the reasons why the Guest does >> not need >> to have the same timestamp as that of the Host -- to work as expected. >> >>> When you say that the Guest/Host compositor need to stay in sync, are you suggesting that we need to ensure that the vblank timestamp on the Host needs to be shared and be the same on the Guest and a vblank/pageflip completion for the Guest needs to be sent at exactly the same time it is sent on the Host? If yes, I'd say that we do send the pageflip completion to Guest around the same time a vblank is generated on the Host but it does not help because the Guest compositor would only have 9 ms to submit a new frame and if the Host is running Mutter, the Guest would only have 2 ms. (https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341) >>> >>> Not at the same time, but the same timestamp. And yes there is some fun >>> there, which is I think the fundamental issue. Or at least some of the >>> compositor experts seem to think so, and it makes sense to me. >> [Kasireddy, Vivek] It is definitely possible that if the timestamp is messed >> up, then >> the Guest repaint cycle would be affected. However, I do not believe that is >> the case >> here given the debug and instrumentation data we collected and scrutinized. >> Hopefully, >> compositor experts could chime in to shed some light on this matter. >> >>> > > Only, and I really mean only, when that shows that it's simply impossible > to hit 60fps with zero-copy and the guest/host fully aligned should we > look into making the overall pipeline deeper. [Kasireddy, Vivek] From all the experiments conducted so far and given the discussion associated with https://gitlab.freedesktop.org/wayland/weston/-/issues/514 I think we have already established that in order for a zero-copy solution to work reliably, the Guest compositor needs to start its repaint cycle when the Host compositor sends a frame callback event to its clients. > >>> Only when that all shows that we just can't hit 60fps
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On Tue, Aug 10, 2021 at 08:21:09AM +, Kasireddy, Vivek wrote: > Hi Daniel, > > > On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote: > > > Hi Daniel, > > > > > > > > > > >>> The solution: > > > > > > > >>> - To ensure full framerate, the Guest compositor has to start > > > > > > > >>> it's repaint > > cycle > > > > > > (including > > > > > > > >>> the 9 ms wait) when the Host compositor sends the frame > > > > > > > >>> callback event to > > its > > > > > > clients. > > > > > > > >>> In order for this to happen, the dma-fence that the Guest KMS > > > > > > > >>> waits on -- > > before > > > > > > sending > > > > > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release > > > > > > > >>> event. This > > means > > > > that, > > > > > > the > > > > > > > >>> Guest compositor has to be forced to use a new buffer for its > > > > > > > >>> next repaint > > cycle > > > > > > when it > > > > > > > >>> gets a pageflip completion. > > > > > > > >> > > > > > > > >> Is that really the only solution? > > > > > > > > [Kasireddy, Vivek] There are a few others I mentioned here: > > > > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572 > > > > > > > > But I think none of them are as compelling as this one. > > > > > > > > > > > > > > > >> > > > > > > > >> If we fix the event timestamps so that both guest and host use > > > > > > > >> the same > > > > > > > >> timestamp, but then the guest starts 5ms (or something like > > > > > > > >> that) earlier, > > > > > > > >> then things should work too? I.e. > > > > > > > >> - host compositor starts at (previous_frametime + 9ms) > > > > > > > >> - guest compositor starts at (previous_frametime + 4ms) > > > > > > > >> > > > > > > > >> Ofc this only works if the frametimes we hand out to both > > > > > > > >> match _exactly_ > > > > > > > >> and are as high-precision as the ones on the host side. Which > > > > > > > >> for many gpu > > > > > > > >> drivers at least is the case, and all the ones you care about > > > > > > > >> for sure :-) > > > > > > > >> > > > > > > > >> But if the frametimes the guest receives are the no_vblank > > > > > > > >> fake ones, then > > > > > > > >> they'll be all over the place and this carefully tuned > > > > > > > >> low-latency redraw > > > > > > > >> loop falls apart. Aside fromm the fact that without tuning the > > > > > > > >> guests to > > > > > > > >> be earlier than the hosts, you're guaranteed to miss every > > > > > > > >> frame (except > > > > > > > >> when the timing wobbliness in the guest is big enough by > > > > > > > >> chance to make > > > > > > > >> the deadline on the oddball frame). > > > > > > > > [Kasireddy, Vivek] The Guest and Host use different event > > > > > > > > timestamps as we > > don't > > > > > > > > share these between the Guest and the Host. It does not seem to > > > > > > > > be causing any > > > > other > > > > > > > > problems so far but we did try the experiment you mentioned > > > > > > > > (i.e., adjusting > > the > > > > > > delays) > > > > > > > > and it works. However, this patch series is meant to fix the > > > > > > > > issue without > > having to > > > > > > tweak > > > > > > > > anything (delays) because we can't do this for every compositor > > > > > > > > out there. > > > > > > > > > > > > > > Maybe there could be a mechanism which allows the compositor in > > > > > > > the guest to > > > > > > automatically adjust its repaint cycle as needed. > > > > > > > > > > > > > > This might even be possible without requiring changes in each > > > > > > > compositor, by > > > > adjusting > > > > > > the vertical blank periods in the guest to be aligned with the host > > > > > > compositor > > repaint > > > > > > cycles. Not sure about that though. > > > > > > > > > > > > > > Even if not, both this series or making it possible to queue > > > > > > > multiple flips require > > > > > > corresponding changes in each compositor as well to have any effect. > > > > > > > > > > > > Yeah from all the discussions and tests done it sounds even with a > > > > > > deeper queue we have big coordination issues between the guest and > > > > > > host compositor (like the example that the guest is now rendering at > > > > > > 90fps instead of 60fps like the host). > > > > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. > > > > > That 90 FPS vs > > > > > 60 FPS problem is a completely different issue that is associated > > > > > with Qemu GTK UI > > > > > backend. With the GTK backend -- and also with SDL backend -- we Blit > > > > > the Guest > > > > > scanout FB onto one of the backbuffers managed by EGL. > > > > > > > > > > I am trying to add a new Qemu Wayland UI backend so that we can > > > > > eliminate that > > Blit > > > > > and thereby have a truly zero-copy solution. And, this is there I am > > > > > running into the > > > > > halved frame-rate issue -- the current problem. > > > > > > > > Yes, that's what I
RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi Daniel, > On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote: > > Hi Daniel, > > > > > > > > >>> The solution: > > > > > > >>> - To ensure full framerate, the Guest compositor has to start > > > > > > >>> it's repaint > cycle > > > > > (including > > > > > > >>> the 9 ms wait) when the Host compositor sends the frame > > > > > > >>> callback event to > its > > > > > clients. > > > > > > >>> In order for this to happen, the dma-fence that the Guest KMS > > > > > > >>> waits on -- > before > > > > > sending > > > > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release > > > > > > >>> event. This > means > > > that, > > > > > the > > > > > > >>> Guest compositor has to be forced to use a new buffer for its > > > > > > >>> next repaint > cycle > > > > > when it > > > > > > >>> gets a pageflip completion. > > > > > > >> > > > > > > >> Is that really the only solution? > > > > > > > [Kasireddy, Vivek] There are a few others I mentioned here: > > > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572 > > > > > > > But I think none of them are as compelling as this one. > > > > > > > > > > > > > >> > > > > > > >> If we fix the event timestamps so that both guest and host use > > > > > > >> the same > > > > > > >> timestamp, but then the guest starts 5ms (or something like > > > > > > >> that) earlier, > > > > > > >> then things should work too? I.e. > > > > > > >> - host compositor starts at (previous_frametime + 9ms) > > > > > > >> - guest compositor starts at (previous_frametime + 4ms) > > > > > > >> > > > > > > >> Ofc this only works if the frametimes we hand out to both match > > > > > > >> _exactly_ > > > > > > >> and are as high-precision as the ones on the host side. Which > > > > > > >> for many gpu > > > > > > >> drivers at least is the case, and all the ones you care about > > > > > > >> for sure :-) > > > > > > >> > > > > > > >> But if the frametimes the guest receives are the no_vblank fake > > > > > > >> ones, then > > > > > > >> they'll be all over the place and this carefully tuned > > > > > > >> low-latency redraw > > > > > > >> loop falls apart. Aside fromm the fact that without tuning the > > > > > > >> guests to > > > > > > >> be earlier than the hosts, you're guaranteed to miss every frame > > > > > > >> (except > > > > > > >> when the timing wobbliness in the guest is big enough by chance > > > > > > >> to make > > > > > > >> the deadline on the oddball frame). > > > > > > > [Kasireddy, Vivek] The Guest and Host use different event > > > > > > > timestamps as we > don't > > > > > > > share these between the Guest and the Host. It does not seem to > > > > > > > be causing any > > > other > > > > > > > problems so far but we did try the experiment you mentioned > > > > > > > (i.e., adjusting > the > > > > > delays) > > > > > > > and it works. However, this patch series is meant to fix the > > > > > > > issue without > having to > > > > > tweak > > > > > > > anything (delays) because we can't do this for every compositor > > > > > > > out there. > > > > > > > > > > > > Maybe there could be a mechanism which allows the compositor in the > > > > > > guest to > > > > > automatically adjust its repaint cycle as needed. > > > > > > > > > > > > This might even be possible without requiring changes in each > > > > > > compositor, by > > > adjusting > > > > > the vertical blank periods in the guest to be aligned with the host > > > > > compositor > repaint > > > > > cycles. Not sure about that though. > > > > > > > > > > > > Even if not, both this series or making it possible to queue > > > > > > multiple flips require > > > > > corresponding changes in each compositor as well to have any effect. > > > > > > > > > > Yeah from all the discussions and tests done it sounds even with a > > > > > deeper queue we have big coordination issues between the guest and > > > > > host compositor (like the example that the guest is now rendering at > > > > > 90fps instead of 60fps like the host). > > > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. > > > > That 90 FPS vs > > > > 60 FPS problem is a completely different issue that is associated with > > > > Qemu GTK UI > > > > backend. With the GTK backend -- and also with SDL backend -- we Blit > > > > the Guest > > > > scanout FB onto one of the backbuffers managed by EGL. > > > > > > > > I am trying to add a new Qemu Wayland UI backend so that we can > > > > eliminate that > Blit > > > > and thereby have a truly zero-copy solution. And, this is there I am > > > > running into the > > > > halved frame-rate issue -- the current problem. > > > > > > Yes, that's what I referenced. But I disagree that it's a different > > > problem. The underlying problem in both cases is that the guest and host > > > compositor free-wheel instead of rendering in sync. It's just that > > > depending upon how exactly the flip completion event on the gues side > > > plays out
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote: > Hi Daniel, > > > > > > >>> The solution: > > > > > >>> - To ensure full framerate, the Guest compositor has to start > > > > > >>> it's repaint cycle > > > > (including > > > > > >>> the 9 ms wait) when the Host compositor sends the frame callback > > > > > >>> event to its > > > > clients. > > > > > >>> In order for this to happen, the dma-fence that the Guest KMS > > > > > >>> waits on -- before > > > > sending > > > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release > > > > > >>> event. This means > > that, > > > > the > > > > > >>> Guest compositor has to be forced to use a new buffer for its > > > > > >>> next repaint cycle > > > > when it > > > > > >>> gets a pageflip completion. > > > > > >> > > > > > >> Is that really the only solution? > > > > > > [Kasireddy, Vivek] There are a few others I mentioned here: > > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572 > > > > > > But I think none of them are as compelling as this one. > > > > > > > > > > > >> > > > > > >> If we fix the event timestamps so that both guest and host use the > > > > > >> same > > > > > >> timestamp, but then the guest starts 5ms (or something like that) > > > > > >> earlier, > > > > > >> then things should work too? I.e. > > > > > >> - host compositor starts at (previous_frametime + 9ms) > > > > > >> - guest compositor starts at (previous_frametime + 4ms) > > > > > >> > > > > > >> Ofc this only works if the frametimes we hand out to both match > > > > > >> _exactly_ > > > > > >> and are as high-precision as the ones on the host side. Which for > > > > > >> many gpu > > > > > >> drivers at least is the case, and all the ones you care about for > > > > > >> sure :-) > > > > > >> > > > > > >> But if the frametimes the guest receives are the no_vblank fake > > > > > >> ones, then > > > > > >> they'll be all over the place and this carefully tuned low-latency > > > > > >> redraw > > > > > >> loop falls apart. Aside fromm the fact that without tuning the > > > > > >> guests to > > > > > >> be earlier than the hosts, you're guaranteed to miss every frame > > > > > >> (except > > > > > >> when the timing wobbliness in the guest is big enough by chance to > > > > > >> make > > > > > >> the deadline on the oddball frame). > > > > > > [Kasireddy, Vivek] The Guest and Host use different event > > > > > > timestamps as we don't > > > > > > share these between the Guest and the Host. It does not seem to be > > > > > > causing any > > other > > > > > > problems so far but we did try the experiment you mentioned (i.e., > > > > > > adjusting the > > > > delays) > > > > > > and it works. However, this patch series is meant to fix the issue > > > > > > without having to > > > > tweak > > > > > > anything (delays) because we can't do this for every compositor out > > > > > > there. > > > > > > > > > > Maybe there could be a mechanism which allows the compositor in the > > > > > guest to > > > > automatically adjust its repaint cycle as needed. > > > > > > > > > > This might even be possible without requiring changes in each > > > > > compositor, by > > adjusting > > > > the vertical blank periods in the guest to be aligned with the host > > > > compositor repaint > > > > cycles. Not sure about that though. > > > > > > > > > > Even if not, both this series or making it possible to queue multiple > > > > > flips require > > > > corresponding changes in each compositor as well to have any effect. > > > > > > > > Yeah from all the discussions and tests done it sounds even with a > > > > deeper queue we have big coordination issues between the guest and > > > > host compositor (like the example that the guest is now rendering at > > > > 90fps instead of 60fps like the host). > > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. > > > That 90 FPS vs > > > 60 FPS problem is a completely different issue that is associated with > > > Qemu GTK UI > > > backend. With the GTK backend -- and also with SDL backend -- we Blit the > > > Guest > > > scanout FB onto one of the backbuffers managed by EGL. > > > > > > I am trying to add a new Qemu Wayland UI backend so that we can eliminate > > > that Blit > > > and thereby have a truly zero-copy solution. And, this is there I am > > > running into the > > > halved frame-rate issue -- the current problem. > > > > Yes, that's what I referenced. But I disagree that it's a different > > problem. The underlying problem in both cases is that the guest and host > > compositor free-wheel instead of rendering in sync. It's just that > > depending upon how exactly the flip completion event on the gues side > > plays out you either get guest rendering that's faster than the host-side > > 60fps, or guest rendering that's much slower than the host-side 60fps. > [Kasireddy, Vivek] That used to be the case before we added a synchronization > mechanism to the GTK
RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi Daniel, > > > > >>> The solution: > > > > >>> - To ensure full framerate, the Guest compositor has to start it's > > > > >>> repaint cycle > > > (including > > > > >>> the 9 ms wait) when the Host compositor sends the frame callback > > > > >>> event to its > > > clients. > > > > >>> In order for this to happen, the dma-fence that the Guest KMS waits > > > > >>> on -- before > > > sending > > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. > > > > >>> This means > that, > > > the > > > > >>> Guest compositor has to be forced to use a new buffer for its next > > > > >>> repaint cycle > > > when it > > > > >>> gets a pageflip completion. > > > > >> > > > > >> Is that really the only solution? > > > > > [Kasireddy, Vivek] There are a few others I mentioned here: > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572 > > > > > But I think none of them are as compelling as this one. > > > > > > > > > >> > > > > >> If we fix the event timestamps so that both guest and host use the > > > > >> same > > > > >> timestamp, but then the guest starts 5ms (or something like that) > > > > >> earlier, > > > > >> then things should work too? I.e. > > > > >> - host compositor starts at (previous_frametime + 9ms) > > > > >> - guest compositor starts at (previous_frametime + 4ms) > > > > >> > > > > >> Ofc this only works if the frametimes we hand out to both match > > > > >> _exactly_ > > > > >> and are as high-precision as the ones on the host side. Which for > > > > >> many gpu > > > > >> drivers at least is the case, and all the ones you care about for > > > > >> sure :-) > > > > >> > > > > >> But if the frametimes the guest receives are the no_vblank fake > > > > >> ones, then > > > > >> they'll be all over the place and this carefully tuned low-latency > > > > >> redraw > > > > >> loop falls apart. Aside fromm the fact that without tuning the > > > > >> guests to > > > > >> be earlier than the hosts, you're guaranteed to miss every frame > > > > >> (except > > > > >> when the timing wobbliness in the guest is big enough by chance to > > > > >> make > > > > >> the deadline on the oddball frame). > > > > > [Kasireddy, Vivek] The Guest and Host use different event timestamps > > > > > as we don't > > > > > share these between the Guest and the Host. It does not seem to be > > > > > causing any > other > > > > > problems so far but we did try the experiment you mentioned (i.e., > > > > > adjusting the > > > delays) > > > > > and it works. However, this patch series is meant to fix the issue > > > > > without having to > > > tweak > > > > > anything (delays) because we can't do this for every compositor out > > > > > there. > > > > > > > > Maybe there could be a mechanism which allows the compositor in the > > > > guest to > > > automatically adjust its repaint cycle as needed. > > > > > > > > This might even be possible without requiring changes in each > > > > compositor, by > adjusting > > > the vertical blank periods in the guest to be aligned with the host > > > compositor repaint > > > cycles. Not sure about that though. > > > > > > > > Even if not, both this series or making it possible to queue multiple > > > > flips require > > > corresponding changes in each compositor as well to have any effect. > > > > > > Yeah from all the discussions and tests done it sounds even with a > > > deeper queue we have big coordination issues between the guest and > > > host compositor (like the example that the guest is now rendering at > > > 90fps instead of 60fps like the host). > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That > > 90 FPS vs > > 60 FPS problem is a completely different issue that is associated with Qemu > > GTK UI > > backend. With the GTK backend -- and also with SDL backend -- we Blit the > > Guest > > scanout FB onto one of the backbuffers managed by EGL. > > > > I am trying to add a new Qemu Wayland UI backend so that we can eliminate > > that Blit > > and thereby have a truly zero-copy solution. And, this is there I am > > running into the > > halved frame-rate issue -- the current problem. > > Yes, that's what I referenced. But I disagree that it's a different > problem. The underlying problem in both cases is that the guest and host > compositor free-wheel instead of rendering in sync. It's just that > depending upon how exactly the flip completion event on the gues side > plays out you either get guest rendering that's faster than the host-side > 60fps, or guest rendering that's much slower than the host-side 60fps. [Kasireddy, Vivek] That used to be the case before we added a synchronization mechanism to the GTK UI backend that uses a sync file. After adding this and making the Guest wait until this sync file fd on the Host is signaled, we consistently get 60 FPS because the flip completion event for the Guest is directly tied to the signaling of the sync file in this particular case (GTK
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On Thu, Aug 05, 2021 at 04:15:27AM +, Kasireddy, Vivek wrote: > Hi Daniel, > > > > >>> The solution: > > > >>> - To ensure full framerate, the Guest compositor has to start it's > > > >>> repaint cycle > > (including > > > >>> the 9 ms wait) when the Host compositor sends the frame callback > > > >>> event to its > > clients. > > > >>> In order for this to happen, the dma-fence that the Guest KMS waits > > > >>> on -- before > > sending > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. > > > >>> This means that, > > the > > > >>> Guest compositor has to be forced to use a new buffer for its next > > > >>> repaint cycle > > when it > > > >>> gets a pageflip completion. > > > >> > > > >> Is that really the only solution? > > > > [Kasireddy, Vivek] There are a few others I mentioned here: > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572 > > > > But I think none of them are as compelling as this one. > > > > > > > >> > > > >> If we fix the event timestamps so that both guest and host use the same > > > >> timestamp, but then the guest starts 5ms (or something like that) > > > >> earlier, > > > >> then things should work too? I.e. > > > >> - host compositor starts at (previous_frametime + 9ms) > > > >> - guest compositor starts at (previous_frametime + 4ms) > > > >> > > > >> Ofc this only works if the frametimes we hand out to both match > > > >> _exactly_ > > > >> and are as high-precision as the ones on the host side. Which for many > > > >> gpu > > > >> drivers at least is the case, and all the ones you care about for sure > > > >> :-) > > > >> > > > >> But if the frametimes the guest receives are the no_vblank fake ones, > > > >> then > > > >> they'll be all over the place and this carefully tuned low-latency > > > >> redraw > > > >> loop falls apart. Aside fromm the fact that without tuning the guests > > > >> to > > > >> be earlier than the hosts, you're guaranteed to miss every frame > > > >> (except > > > >> when the timing wobbliness in the guest is big enough by chance to make > > > >> the deadline on the oddball frame). > > > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as > > > > we don't > > > > share these between the Guest and the Host. It does not seem to be > > > > causing any other > > > > problems so far but we did try the experiment you mentioned (i.e., > > > > adjusting the > > delays) > > > > and it works. However, this patch series is meant to fix the issue > > > > without having to > > tweak > > > > anything (delays) because we can't do this for every compositor out > > > > there. > > > > > > Maybe there could be a mechanism which allows the compositor in the guest > > > to > > automatically adjust its repaint cycle as needed. > > > > > > This might even be possible without requiring changes in each compositor, > > > by adjusting > > the vertical blank periods in the guest to be aligned with the host > > compositor repaint > > cycles. Not sure about that though. > > > > > > Even if not, both this series or making it possible to queue multiple > > > flips require > > corresponding changes in each compositor as well to have any effect. > > > > Yeah from all the discussions and tests done it sounds even with a > > deeper queue we have big coordination issues between the guest and > > host compositor (like the example that the guest is now rendering at > > 90fps instead of 60fps like the host). > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90 > FPS vs > 60 FPS problem is a completely different issue that is associated with Qemu > GTK UI > backend. With the GTK backend -- and also with SDL backend -- we Blit the > Guest > scanout FB onto one of the backbuffers managed by EGL. > > I am trying to add a new Qemu Wayland UI backend so that we can eliminate > that Blit > and thereby have a truly zero-copy solution. And, this is there I am running > into the > halved frame-rate issue -- the current problem. Yes, that's what I referenced. But I disagree that it's a different problem. The underlying problem in both cases is that the guest and host compositor free-wheel instead of rendering in sync. It's just that depending upon how exactly the flip completion event on the gues side plays out you either get guest rendering that's faster than the host-side 60fps, or guest rendering that's much slower than the host-side 60fps. The fundamental problem in both cases is that they don't run in lockstep. If you fix that, through fixing the timestamp and even reporting most likely, you should be able to fix both bugs. > > Hence my gut feeling reaction that first we need to get these two > > compositors aligned in their timings, which propobably needs > > consistent vblank periods/timestamps across them (plus/minux > > guest/host clocksource fun ofc). Without this any of the next steps > > will simply not work because there's too much jitter by the time the
RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi Daniel, > > >>> The solution: > > >>> - To ensure full framerate, the Guest compositor has to start it's > > >>> repaint cycle > (including > > >>> the 9 ms wait) when the Host compositor sends the frame callback event > > >>> to its > clients. > > >>> In order for this to happen, the dma-fence that the Guest KMS waits on > > >>> -- before > sending > > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. > > >>> This means that, > the > > >>> Guest compositor has to be forced to use a new buffer for its next > > >>> repaint cycle > when it > > >>> gets a pageflip completion. > > >> > > >> Is that really the only solution? > > > [Kasireddy, Vivek] There are a few others I mentioned here: > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572 > > > But I think none of them are as compelling as this one. > > > > > >> > > >> If we fix the event timestamps so that both guest and host use the same > > >> timestamp, but then the guest starts 5ms (or something like that) > > >> earlier, > > >> then things should work too? I.e. > > >> - host compositor starts at (previous_frametime + 9ms) > > >> - guest compositor starts at (previous_frametime + 4ms) > > >> > > >> Ofc this only works if the frametimes we hand out to both match _exactly_ > > >> and are as high-precision as the ones on the host side. Which for many > > >> gpu > > >> drivers at least is the case, and all the ones you care about for sure > > >> :-) > > >> > > >> But if the frametimes the guest receives are the no_vblank fake ones, > > >> then > > >> they'll be all over the place and this carefully tuned low-latency redraw > > >> loop falls apart. Aside fromm the fact that without tuning the guests to > > >> be earlier than the hosts, you're guaranteed to miss every frame (except > > >> when the timing wobbliness in the guest is big enough by chance to make > > >> the deadline on the oddball frame). > > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as > > > we don't > > > share these between the Guest and the Host. It does not seem to be > > > causing any other > > > problems so far but we did try the experiment you mentioned (i.e., > > > adjusting the > delays) > > > and it works. However, this patch series is meant to fix the issue > > > without having to > tweak > > > anything (delays) because we can't do this for every compositor out there. > > > > Maybe there could be a mechanism which allows the compositor in the guest to > automatically adjust its repaint cycle as needed. > > > > This might even be possible without requiring changes in each compositor, > > by adjusting > the vertical blank periods in the guest to be aligned with the host > compositor repaint > cycles. Not sure about that though. > > > > Even if not, both this series or making it possible to queue multiple flips > > require > corresponding changes in each compositor as well to have any effect. > > Yeah from all the discussions and tests done it sounds even with a > deeper queue we have big coordination issues between the guest and > host compositor (like the example that the guest is now rendering at > 90fps instead of 60fps like the host). [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90 FPS vs 60 FPS problem is a completely different issue that is associated with Qemu GTK UI backend. With the GTK backend -- and also with SDL backend -- we Blit the Guest scanout FB onto one of the backbuffers managed by EGL. I am trying to add a new Qemu Wayland UI backend so that we can eliminate that Blit and thereby have a truly zero-copy solution. And, this is there I am running into the halved frame-rate issue -- the current problem. > > Hence my gut feeling reaction that first we need to get these two > compositors aligned in their timings, which propobably needs > consistent vblank periods/timestamps across them (plus/minux > guest/host clocksource fun ofc). Without this any of the next steps > will simply not work because there's too much jitter by the time the > guest compositor gets the flip completion events. [Kasireddy, Vivek] Timings are not a problem and do not significantly affect the repaint cycles from what I have seen so far. > > Once we have solid events I think we should look into statically > tuning guest/host compositor deadlines (like you've suggested in a > bunch of places) to consisently make that deadline and hit 60 fps. > With that we can then look into tuning this automatically and what to > do when e.g. switching between copying and zero-copy on the host side > (which might be needed in some cases) and how to handle all that. [Kasireddy, Vivek] As I confirm here: https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984065 tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However, I feel that this zero-copy solution I am trying to create should be independent of compositors' deadlines, delays or other scheduling
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On Tue, Aug 3, 2021 at 9:34 AM Michel Dänzer wrote: > On 2021-08-03 8:11 a.m., Kasireddy, Vivek wrote: > >>> The goal: > >>> - Maintain full framerate even when the Guest scanout FB is flipped onto > >>> a hardware > >> plane > >>> on the Host -- regardless of either compositor's scheduling policy -- > >>> without making any > >>> copies and ensuring that both Host and Guest are not accessing the buffer > >>> at the same > >> time. > >>> > >>> The problem: > >>> - If the Host compositor flips the client's buffer (in this case Guest > >>> compositor's buffer) > >>> onto a hardware plane, then it can send a wl_buffer.release event for the > >>> previous buffer > >>> only after it gets a pageflip completion. And, if the Guest compositor > >>> takes 10-12 ms to > >>> submit a new buffer and given the fact that the Host compositor waits > >>> only for 9 ms, the > >>> Guest compositor will miss the Host's repaint cycle resulting in halved > >>> frame-rate. > >>> > >>> The solution: > >>> - To ensure full framerate, the Guest compositor has to start it's > >>> repaint cycle (including > >>> the 9 ms wait) when the Host compositor sends the frame callback event to > >>> its clients. > >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- > >>> before sending > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This > >>> means that, the > >>> Guest compositor has to be forced to use a new buffer for its next > >>> repaint cycle when it > >>> gets a pageflip completion. > >> > >> Is that really the only solution? > > [Kasireddy, Vivek] There are a few others I mentioned here: > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572 > > But I think none of them are as compelling as this one. > > > >> > >> If we fix the event timestamps so that both guest and host use the same > >> timestamp, but then the guest starts 5ms (or something like that) earlier, > >> then things should work too? I.e. > >> - host compositor starts at (previous_frametime + 9ms) > >> - guest compositor starts at (previous_frametime + 4ms) > >> > >> Ofc this only works if the frametimes we hand out to both match _exactly_ > >> and are as high-precision as the ones on the host side. Which for many gpu > >> drivers at least is the case, and all the ones you care about for sure :-) > >> > >> But if the frametimes the guest receives are the no_vblank fake ones, then > >> they'll be all over the place and this carefully tuned low-latency redraw > >> loop falls apart. Aside fromm the fact that without tuning the guests to > >> be earlier than the hosts, you're guaranteed to miss every frame (except > >> when the timing wobbliness in the guest is big enough by chance to make > >> the deadline on the oddball frame). > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we > > don't > > share these between the Guest and the Host. It does not seem to be causing > > any other > > problems so far but we did try the experiment you mentioned (i.e., > > adjusting the delays) > > and it works. However, this patch series is meant to fix the issue without > > having to tweak > > anything (delays) because we can't do this for every compositor out there. > > Maybe there could be a mechanism which allows the compositor in the guest to > automatically adjust its repaint cycle as needed. > > This might even be possible without requiring changes in each compositor, by > adjusting the vertical blank periods in the guest to be aligned with the host > compositor repaint cycles. Not sure about that though. > > Even if not, both this series or making it possible to queue multiple flips > require corresponding changes in each compositor as well to have any effect. Yeah from all the discussions and tests done it sounds even with a deeper queue we have big coordination issues between the guest and host compositor (like the example that the guest is now rendering at 90fps instead of 60fps like the host). Hence my gut feeling reaction that first we need to get these two compositors aligned in their timings, which propobably needs consistent vblank periods/timestamps across them (plus/minux guest/host clocksource fun ofc). Without this any of the next steps will simply not work because there's too much jitter by the time the guest compositor gets the flip completion events. Once we have solid events I think we should look into statically tuning guest/host compositor deadlines (like you've suggested in a bunch of places) to consisently make that deadline and hit 60 fps. With that we can then look into tuning this automatically and what to do when e.g. switching between copying and zero-copy on the host side (which might be needed in some cases) and how to handle all that. Only when that all shows that we just can't hit 60fps consistently and really need 3 buffers in flight should we look at deeper kms queues. And then we really need to implement them properly and not
RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi Gerd, > > > > virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for > > > DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I > > > think for the page-flip case the host (aka qemu) doesn't get the > > > "wait until old framebuffer is not in use any more" right yet. > > [Kasireddy, Vivek] As you know, with the GTK UI backend and this patch > > series: > > https://lists.nongnu.org/archive/html/qemu-devel/2021-06/msg06745.html > > we do create a sync file fd -- after the Blit -- and wait (adding it to > > Qemu's main > > event loop) for it to ensure that the Guest scanout FB is longer in use on > > the Host. > > This mechanism works in a similarly way for both frontbuffer DIRTYFB case > > and > > also the double-buffer case. > > Well, we don't explicitly wait on the old framebuffer. Not fully sure > this is actually needed, maybe the command ordering (SET_SCANOUT goes > first) is enough. [Kasireddy, Vivek] When the sync file fd is signaled, the new FB can be considered done/free on the Host; and, when this new FB becomes the old FB -- after another FB is submitted by the Guest -- we don't need to explicitly wait as we already did that in the previous cycle. Strictly speaking, in the double-buffered Guest case, we should be waiting for the sync file fd of the old FB and not the new one. However, if we do this, we saw that the Guest will render faster (~90 FPS) than what the Host can consume (~60 FPS) resulting in unnecessary GPU cycles. And, in addition, we can't be certain about whether a Guest is using double-buffering or single as we noticed that Windows Guests tend to switch between single and double-buffering at runtime based on the damage, etc. > > > > So we'll need a host-side fix for that and a guest-side fix to switch > > > from a blocking wait on the fence to vblank events. > > [Kasireddy, Vivek] Do you see any concerns with the blocking wait? > > Well, it's sync vs. async for userspace. > > With the blocking wait the userspace ioctl (PAGE_FLIP or the atomic > version of it) will return when the host is done. > > Without the blocking wait the userspace ioctl will return right away and > userspace can do something else until the host is done (and the vbland > event is sent to notify userspace). [Kasireddy, Vivek] Right, but upstream Weston -- and I am guessing Mutter as well -- almost always choose DRM_MODE_ATOMIC_NONBLOCK. In this case, the atomic ioctl call would not block and the blocking wait will instead happen in the commit_work/commit_tail workqueue thread. > > > And, are you > > suggesting that we use a vblank timer? > > I think we should send the vblank event when the RESOURCE_FLUSH fence > signals the host is done. [Kasireddy, Vivek] That is how it works now: drm_atomic_helper_commit_planes(dev, old_state, 0); drm_atomic_helper_commit_modeset_enables(dev, old_state); drm_atomic_helper_fake_vblank(old_state); The blocking wait is in the plane_update hook called by drm_atomic_helper_commit_planes() and immediately after that the fake vblank is sent. Thanks, Vivek > > take care, > Gerd
RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi Michel, > > > >>> The goal: > >>> - Maintain full framerate even when the Guest scanout FB is flipped onto > >>> a hardware > >> plane > >>> on the Host -- regardless of either compositor's scheduling policy -- > >>> without making > any > >>> copies and ensuring that both Host and Guest are not accessing the buffer > >>> at the same > >> time. > >>> > >>> The problem: > >>> - If the Host compositor flips the client's buffer (in this case Guest > >>> compositor's > buffer) > >>> onto a hardware plane, then it can send a wl_buffer.release event for the > >>> previous > buffer > >>> only after it gets a pageflip completion. And, if the Guest compositor > >>> takes 10-12 ms > to > >>> submit a new buffer and given the fact that the Host compositor waits > >>> only for 9 ms, > the > >>> Guest compositor will miss the Host's repaint cycle resulting in halved > >>> frame-rate. > >>> > >>> The solution: > >>> - To ensure full framerate, the Guest compositor has to start it's > >>> repaint cycle > (including > >>> the 9 ms wait) when the Host compositor sends the frame callback event to > >>> its clients. > >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- > >>> before > sending > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This > >>> means that, > the > >>> Guest compositor has to be forced to use a new buffer for its next > >>> repaint cycle when > it > >>> gets a pageflip completion. > >> > >> Is that really the only solution? > > [Kasireddy, Vivek] There are a few others I mentioned here: > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572 > > But I think none of them are as compelling as this one. > > > >> > >> If we fix the event timestamps so that both guest and host use the same > >> timestamp, but then the guest starts 5ms (or something like that) earlier, > >> then things should work too? I.e. > >> - host compositor starts at (previous_frametime + 9ms) > >> - guest compositor starts at (previous_frametime + 4ms) > >> > >> Ofc this only works if the frametimes we hand out to both match _exactly_ > >> and are as high-precision as the ones on the host side. Which for many gpu > >> drivers at least is the case, and all the ones you care about for sure :-) > >> > >> But if the frametimes the guest receives are the no_vblank fake ones, then > >> they'll be all over the place and this carefully tuned low-latency redraw > >> loop falls apart. Aside fromm the fact that without tuning the guests to > >> be earlier than the hosts, you're guaranteed to miss every frame (except > >> when the timing wobbliness in the guest is big enough by chance to make > >> the deadline on the oddball frame). > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we > > don't > > share these between the Guest and the Host. It does not seem to be causing > > any other > > problems so far but we did try the experiment you mentioned (i.e., > > adjusting the delays) > > and it works. However, this patch series is meant to fix the issue without > > having to tweak > > anything (delays) because we can't do this for every compositor out there. > > Maybe there could be a mechanism which allows the compositor in the guest to > automatically adjust its repaint cycle as needed. > > This might even be possible without requiring changes in each compositor, by > adjusting > the vertical blank periods in the guest to be aligned with the host > compositor repaint > cycles. Not sure about that though. [Kasireddy, Vivek] The problem really is that the Guest compositor -- or any other compositor for that matter -- assumes that after a pageflip completion, the old buffer submitted in the previous flip is free and can be reused again. I think this is a guarantee given by KMS. If we have to enforce this, we (Guest KMS) have to wait until the Host compositor sends a wl_buffer.release -- which can only happen after Host gets a pageflip completion assuming it uses hardware planes . From this point onwards, the Guest compositor only has 9 ms (in the case of Weston) -- or less based on the Host compositor's scheduling policy -- to submit a new frame. Although, we can adjust the repaint-window of the Guest compositor to ensure a submission within 9 ms or increase the delay on the Host, these tweaks are just heuristics. I think in order to have a generic solution that'll work in all cases means that the Guest compositor has to start its repaint cycle with a new buffer when the Host sends out the frame callback event. > > Even if not, both this series or making it possible to queue multiple flips > require > corresponding changes in each compositor as well to have any effect. [Kasireddy, Vivek] Yes, unfortunately; but the hope is that the Guest KMS can do most of the heavy lifting and keep the changes for the compositors generic enough and minimal. Thanks, Vivek > > > -- > Earthling Michel Dänzer |
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi, > > virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for > > DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I > > think for the page-flip case the host (aka qemu) doesn't get the > > "wait until old framebuffer is not in use any more" right yet. > [Kasireddy, Vivek] As you know, with the GTK UI backend and this patch > series: > https://lists.nongnu.org/archive/html/qemu-devel/2021-06/msg06745.html > we do create a sync file fd -- after the Blit -- and wait (adding it to > Qemu's main > event loop) for it to ensure that the Guest scanout FB is longer in use on > the Host. > This mechanism works in a similarly way for both frontbuffer DIRTYFB case and > also the double-buffer case. Well, we don't explicitly wait on the old framebuffer. Not fully sure this is actually needed, maybe the command ordering (SET_SCANOUT goes first) is enough. > > So we'll need a host-side fix for that and a guest-side fix to switch > > from a blocking wait on the fence to vblank events. > [Kasireddy, Vivek] Do you see any concerns with the blocking wait? Well, it's sync vs. async for userspace. With the blocking wait the userspace ioctl (PAGE_FLIP or the atomic version of it) will return when the host is done. Without the blocking wait the userspace ioctl will return right away and userspace can do something else until the host is done (and the vbland event is sent to notify userspace). > And, are you > suggesting that we use a vblank timer? I think we should send the vblank event when the RESOURCE_FLUSH fence signals the host is done. take care, Gerd
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On 2021-08-03 8:11 a.m., Kasireddy, Vivek wrote: > >>> The goal: >>> - Maintain full framerate even when the Guest scanout FB is flipped onto a >>> hardware >> plane >>> on the Host -- regardless of either compositor's scheduling policy -- >>> without making any >>> copies and ensuring that both Host and Guest are not accessing the buffer >>> at the same >> time. >>> >>> The problem: >>> - If the Host compositor flips the client's buffer (in this case Guest >>> compositor's buffer) >>> onto a hardware plane, then it can send a wl_buffer.release event for the >>> previous buffer >>> only after it gets a pageflip completion. And, if the Guest compositor >>> takes 10-12 ms to >>> submit a new buffer and given the fact that the Host compositor waits only >>> for 9 ms, the >>> Guest compositor will miss the Host's repaint cycle resulting in halved >>> frame-rate. >>> >>> The solution: >>> - To ensure full framerate, the Guest compositor has to start it's repaint >>> cycle (including >>> the 9 ms wait) when the Host compositor sends the frame callback event to >>> its clients. >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- >>> before sending >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This >>> means that, the >>> Guest compositor has to be forced to use a new buffer for its next repaint >>> cycle when it >>> gets a pageflip completion. >> >> Is that really the only solution? > [Kasireddy, Vivek] There are a few others I mentioned here: > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572 > But I think none of them are as compelling as this one. > >> >> If we fix the event timestamps so that both guest and host use the same >> timestamp, but then the guest starts 5ms (or something like that) earlier, >> then things should work too? I.e. >> - host compositor starts at (previous_frametime + 9ms) >> - guest compositor starts at (previous_frametime + 4ms) >> >> Ofc this only works if the frametimes we hand out to both match _exactly_ >> and are as high-precision as the ones on the host side. Which for many gpu >> drivers at least is the case, and all the ones you care about for sure :-) >> >> But if the frametimes the guest receives are the no_vblank fake ones, then >> they'll be all over the place and this carefully tuned low-latency redraw >> loop falls apart. Aside fromm the fact that without tuning the guests to >> be earlier than the hosts, you're guaranteed to miss every frame (except >> when the timing wobbliness in the guest is big enough by chance to make >> the deadline on the oddball frame). > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we > don't > share these between the Guest and the Host. It does not seem to be causing > any other > problems so far but we did try the experiment you mentioned (i.e., adjusting > the delays) > and it works. However, this patch series is meant to fix the issue without > having to tweak > anything (delays) because we can't do this for every compositor out there. Maybe there could be a mechanism which allows the compositor in the guest to automatically adjust its repaint cycle as needed. This might even be possible without requiring changes in each compositor, by adjusting the vertical blank periods in the guest to be aligned with the host compositor repaint cycles. Not sure about that though. Even if not, both this series or making it possible to queue multiple flips require corresponding changes in each compositor as well to have any effect. -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer
RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi Gerd, > > Hi, > > > > That sounds sensible to me. Fence the virtio commands, make sure (on > > > the host side) the command completes only when the work is actually done > > > not only submitted. Has recently been added to qemu for RESOURCE_FLUSH > > > (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka > > > pageflipping), then send vblank events to userspace on command > > > completion certainly makes sense. > > > > Hm how does this all work? At least drm/virtio uses > > drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end > > up in the same driver path for everything. Or do you just combine the > > resource_flush with the flip as needed and let the host side figure it all > > out? From a quick read of virtgpu_plane.c that seems to be the case ... > > virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for > DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I > think for the page-flip case the host (aka qemu) doesn't get the > "wait until old framebuffer is not in use any more" right yet. [Kasireddy, Vivek] As you know, with the GTK UI backend and this patch series: https://lists.nongnu.org/archive/html/qemu-devel/2021-06/msg06745.html we do create a sync file fd -- after the Blit -- and wait (adding it to Qemu's main event loop) for it to ensure that the Guest scanout FB is longer in use on the Host. This mechanism works in a similarly way for both frontbuffer DIRTYFB case and also the double-buffer case. The out_fence work is only relevant for the future Wayland UI backend though. > > So we'll need a host-side fix for that and a guest-side fix to switch > from a blocking wait on the fence to vblank events. [Kasireddy, Vivek] Do you see any concerns with the blocking wait? And, are you suggesting that we use a vblank timer? Not sure if that would be needed because it would not align with the render/draw signals used with GTK. And, the DRM core does send out an event -- immediately after the blocking wait -- to Guest compositor as no_vblank=true. > > > Also to make this work we don't just need the fence, we need the timestamp > > (in a clock domain the guest can correct for ofc) of the host side kms > > driver flip completion. If you just have the fence then the jitter from > > going through all the layers will most likely make it unusable. > > Well, there are no timestamps in the virtio-gpu protocol ... > > Also I'm not sure they would be that helpful, any timing is *much* less > predictable in a virtual machine, especially in case the host machine is > loaded. [Kasireddy, Vivek] I agree; I think sharing the Host timestamps with the Guest or vice-versa may not be useful. We have not run into any problems without these so far. Thanks, Vivek > > take care, > Gerd
RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi Daniel, > > > > By separating the OUT_FENCE signalling from pageflip completion allows > > > > a Guest compositor to start a new repaint cycle with a new buffer > > > > instead of waiting for the old buffer to be free. > > > > > > > > This work is based on the idea/suggestion from Simon and Pekka. > > > > > > > > This capability can be a solution for this issue: > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514 > > > > > > > > Corresponding Weston MR: > > > > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 > > > > > > Uh I kinda wanted to discuss this a bit more before we jump into typing > > > code, but well I guess not that much work yet. > > [Kasireddy, Vivek] Right, it wasn't a lot of work :) > > > > > > > > So maybe I'm not understanding the problem, but I think the fundamental > > > underlying issue is that with KMS you can have at most 2 buffers > > > in-flight, due to our queue depth limit of 1 pending flip. > > [Kasireddy, Vivek] Let me summarize the problem again from the perspective > > of > > both the Host (Weston) and Guest (Weston) compositors assuming a > > refresh-rate > > of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms. > > Host compositor: > > - After a pageflip completion event, it starts its next repaint cycle by > > waiting for 9 ms > > and then submits the atomic commit and at the tail end of its cycle sends a > > frame > callback > > event to all its clients (who registered and submitted frames) indicating > > to them to > > start their next redraw -- giving them at-least ~16 ms to submit a new > > frame to be > > included in its next repaint. Why a configurable 9 ms delay is needed is > > explained > > in Pekka's blog post here: > > https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html > > > > - It'll send a wl_buffer.release event for a client submitted previous > > buffer only > > when the client has submitted a new buffer and: > > a) When it hasn't started its repaint cycle yet OR > > b) When it clears its old state after it gets a pageflip completion event > > -- if it had > > flipped the client's buffer onto a hardware plane. > > > > Guest compositor: > > - After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms > > for the > > Guest compositor to submit a new atomic commit. This time of 10-12 ms > > includes the > > 9 ms wait -- just like the Host compositor -- for its clients to submit new > > buffers. > > - When it gets a pageflip completion, it assumes that the previously > > submitted buffer > > is free for re-use and uses it again -- resulting in the usage of only 2 > > out of a maximum > > of 4 backbuffers included as part of the Mesa GBM surface implementation. > > > > Guest KMS/Virtio-gpu/Qemu Wayland UI: > > - Because no_vblank=true for Guest KMS and since the vblank event (which > > also serves > > as the pageflip completion event for user-space) is sent right away after > > atomic commit, > > as Gerd said, we use an internal dma-fence to block/wait the Guest KMS > > until we know > for > > sure that the Host is completely done using the buffer. To ensure this, we > > signal the > dma-fence > > only after the Host compositor sends a wl_buffer.release event or an > > equivalent signal. > > > > The goal: > > - Maintain full framerate even when the Guest scanout FB is flipped onto a > > hardware > plane > > on the Host -- regardless of either compositor's scheduling policy -- > > without making any > > copies and ensuring that both Host and Guest are not accessing the buffer > > at the same > time. > > > > The problem: > > - If the Host compositor flips the client's buffer (in this case Guest > > compositor's buffer) > > onto a hardware plane, then it can send a wl_buffer.release event for the > > previous buffer > > only after it gets a pageflip completion. And, if the Guest compositor > > takes 10-12 ms to > > submit a new buffer and given the fact that the Host compositor waits only > > for 9 ms, the > > Guest compositor will miss the Host's repaint cycle resulting in halved > > frame-rate. > > > > The solution: > > - To ensure full framerate, the Guest compositor has to start it's repaint > > cycle (including > > the 9 ms wait) when the Host compositor sends the frame callback event to > > its clients. > > In order for this to happen, the dma-fence that the Guest KMS waits on -- > > before sending > > pageflip completion -- cannot be tied to a wl_buffer.release event. This > > means that, the > > Guest compositor has to be forced to use a new buffer for its next repaint > > cycle when it > > gets a pageflip completion. > > Is that really the only solution? [Kasireddy, Vivek] There are a few others I mentioned here: https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572 But I think none of them are as compelling as this one. > > If we fix the event timestamps so that both guest and host use the same > timestamp, but then
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On Mon, Aug 2, 2021 at 2:51 PM Gerd Hoffmann wrote: > > Hi, > > > > That sounds sensible to me. Fence the virtio commands, make sure (on > > > the host side) the command completes only when the work is actually done > > > not only submitted. Has recently been added to qemu for RESOURCE_FLUSH > > > (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka > > > pageflipping), then send vblank events to userspace on command > > > completion certainly makes sense. > > > > Hm how does this all work? At least drm/virtio uses > > drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end > > up in the same driver path for everything. Or do you just combine the > > resource_flush with the flip as needed and let the host side figure it all > > out? From a quick read of virtgpu_plane.c that seems to be the case ... > > virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for > DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I > think for the page-flip case the host (aka qemu) doesn't get the > "wait until old framebuffer is not in use any more" right yet. Hm reading the code I think you simply elide the set_scanout if it's still the same buffer. There's no difference betweeen dirtyfb and an atomic commit that just hands the damage rects to the driver. At least if you use the helpers. > So we'll need a host-side fix for that and a guest-side fix to switch > from a blocking wait on the fence to vblank events. > > > Also to make this work we don't just need the fence, we need the timestamp > > (in a clock domain the guest can correct for ofc) of the host side kms > > driver flip completion. If you just have the fence then the jitter from > > going through all the layers will most likely make it unusable. > > Well, there are no timestamps in the virtio-gpu protocol ... > > Also I'm not sure they would be that helpful, any timing is *much* less > predictable in a virtual machine, especially in case the host machine is > loaded. Hm yeah if the output is currently not displaying, then the timestamp is very fake. But if you display you should be able to pass it all around in both direction. So target vblank (or whatever it's called) would go from guest to host to host-compositor (over wayland protocol) to host-side kms, and the timestamp could travel all the way back. But yeah making this all work correctly is going to be a pile of work. Also I have no idea how well compositors take it when a kms driver switches between high-precision timestamps and frame scheduling to the entirely virtual/vblank-less approach on the fly. -Daniel > take care, > Gerd > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi, > > That sounds sensible to me. Fence the virtio commands, make sure (on > > the host side) the command completes only when the work is actually done > > not only submitted. Has recently been added to qemu for RESOURCE_FLUSH > > (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka > > pageflipping), then send vblank events to userspace on command > > completion certainly makes sense. > > Hm how does this all work? At least drm/virtio uses > drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end > up in the same driver path for everything. Or do you just combine the > resource_flush with the flip as needed and let the host side figure it all > out? From a quick read of virtgpu_plane.c that seems to be the case ... virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I think for the page-flip case the host (aka qemu) doesn't get the "wait until old framebuffer is not in use any more" right yet. So we'll need a host-side fix for that and a guest-side fix to switch from a blocking wait on the fence to vblank events. > Also to make this work we don't just need the fence, we need the timestamp > (in a clock domain the guest can correct for ofc) of the host side kms > driver flip completion. If you just have the fence then the jitter from > going through all the layers will most likely make it unusable. Well, there are no timestamps in the virtio-gpu protocol ... Also I'm not sure they would be that helpful, any timing is *much* less predictable in a virtual machine, especially in case the host machine is loaded. take care, Gerd
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On 2021-08-02 11:06 a.m., Daniel Vetter wrote: > On Mon, Aug 02, 2021 at 10:49:37AM +0200, Michel Dänzer wrote: >> On 2021-08-02 9:59 a.m., Daniel Vetter wrote: >>> On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote: On 2021-07-30 12:25 p.m., Daniel Vetter wrote: > On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote: >> By separating the OUT_FENCE signalling from pageflip completion allows >> a Guest compositor to start a new repaint cycle with a new buffer >> instead of waiting for the old buffer to be free. >> >> This work is based on the idea/suggestion from Simon and Pekka. >> >> This capability can be a solution for this issue: >> https://gitlab.freedesktop.org/wayland/weston/-/issues/514 >> >> Corresponding Weston MR: >> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 > > Uh I kinda wanted to discuss this a bit more before we jump into typing > code, but well I guess not that much work yet. > > So maybe I'm not understanding the problem, but I think the fundamental > underlying issue is that with KMS you can have at most 2 buffers > in-flight, due to our queue depth limit of 1 pending flip. > > Unfortunately that means for virtual hw where it takes a few more > steps/vblanks until the framebuffer actually shows up on screen and is > scanned out, we suffer deeply. The usual fix for that is to drop the > latency and increase throughput, and have more buffers in-flight. Which > this patch tries to do. Per https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 , IMO the underlying issue is actually that the guest compositor repaint cycle is not aligned with the host compositor one. If they were aligned, the problem would not occur even without allowing multiple page flips in flight, and latency would be lower. >>> >>> Yeah my proposal here is under the premise that we do actually need to fix >>> this with a deeper queue depth. >>> > Now I think where we go wrong here is that we're trying to hack this up by > defining different semantics for the out-fence and for the drm-event. Imo > that's wrong, they're both meant to show eactly the same thing: > - when is the new frame actually visible to the user (as in, eyeballs in a > human head, preferrably, not the time when we've handed the buffer off > to the virtual hw) > - when is the previous buffer no longer being used by the scanout hw > > We do cheat a bit right now in so far that we assume they're both the > same, as in, panel-side latency is currently the compositor's problem to > figure out. > > So for virtual hw I think the timestamp and even completion really need to > happen only when the buffer has been pushed through the entire > virtualization chain, i.e. ideally we get the timestamp from the kms > driver from the host side. Currently that's not done, so this is most > likely quite broken already (virtio relies on the no-vblank auto event > sending, which definitely doesn't wait for anything, or I'm completely > missing something). > > I think instead of hacking up some ill-defined 1.5 queue depth support, > what we should do is support queue depth > 1 properly. So: > > - Change atomic to support queue depth > 1, this needs to be a per-driver > thing due to a bunch of issues in driver code. Essentially drivers must > never look at obj->state pointers, and only ever look up state through > the passed-in drm_atomic_state * update container. > > - Aside: virtio should loose all it's empty hooks, there's no point in > that. > > - We fix virtio to send out the completion event at the end of this entire > pipeline, i.e. virtio code needs to take care of sending out the > crtc_state->event correctly. > > - We probably also want some kind of (maybe per-crtc) recommended queue > depth property so compositors know how many buffers to keep in flight. > Not sure about that. I'd say there would definitely need to be some kind of signal for the display server that it should queue multiple flips, since this is normally not desirable for latency. In other words, this wouldn't really be useful on bare metal (in contrast to the ability to replace a pending flip with a newer one). >>> >>> Hm I was thinking that the compositor can tune this. If the round-trip >>> latency (as measured by events) is too long to get full refresh rate, it >>> can add more buffers to the queue. That's kinda why I think the returned >>> event really must be accurate wrt actual display time (and old buffer >>> release time), so that this computation in the compositor because a pretty >>> simple >>> >>> num_buffers = (flip_time - submit_time) / frame_time >>> >>> With maybe some rounding up and
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On Fri, Jul 30, 2021 at 03:38:50PM +0200, Gerd Hoffmann wrote: > Hi, > > > - We fix virtio to send out the completion event at the end of this entire > > pipeline, i.e. virtio code needs to take care of sending out the > > crtc_state->event correctly. > > That sounds sensible to me. Fence the virtio commands, make sure (on > the host side) the command completes only when the work is actually done > not only submitted. Has recently been added to qemu for RESOURCE_FLUSH > (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka > pageflipping), then send vblank events to userspace on command > completion certainly makes sense. Hm how does this all work? At least drm/virtio uses drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end up in the same driver path for everything. Or do you just combine the resource_flush with the flip as needed and let the host side figure it all out? From a quick read of virtgpu_plane.c that seems to be the case ... Also to make this work we don't just need the fence, we need the timestamp (in a clock domain the guest can correct for ofc) of the host side kms driver flip completion. If you just have the fence then the jitter from going through all the layers will most likely make it unusable. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On Mon, Aug 02, 2021 at 10:49:37AM +0200, Michel Dänzer wrote: > On 2021-08-02 9:59 a.m., Daniel Vetter wrote: > > On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote: > >> On 2021-07-30 12:25 p.m., Daniel Vetter wrote: > >>> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote: > By separating the OUT_FENCE signalling from pageflip completion allows > a Guest compositor to start a new repaint cycle with a new buffer > instead of waiting for the old buffer to be free. > > This work is based on the idea/suggestion from Simon and Pekka. > > This capability can be a solution for this issue: > https://gitlab.freedesktop.org/wayland/weston/-/issues/514 > > Corresponding Weston MR: > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 > >>> > >>> Uh I kinda wanted to discuss this a bit more before we jump into typing > >>> code, but well I guess not that much work yet. > >>> > >>> So maybe I'm not understanding the problem, but I think the fundamental > >>> underlying issue is that with KMS you can have at most 2 buffers > >>> in-flight, due to our queue depth limit of 1 pending flip. > >>> > >>> Unfortunately that means for virtual hw where it takes a few more > >>> steps/vblanks until the framebuffer actually shows up on screen and is > >>> scanned out, we suffer deeply. The usual fix for that is to drop the > >>> latency and increase throughput, and have more buffers in-flight. Which > >>> this patch tries to do. > >> > >> Per > >> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 , > >> IMO the underlying issue is actually that the guest compositor repaint > >> cycle is not aligned with the host compositor one. If they were aligned, > >> the problem would not occur even without allowing multiple page flips in > >> flight, and latency would be lower. > > > > Yeah my proposal here is under the premise that we do actually need to fix > > this with a deeper queue depth. > > > >>> Now I think where we go wrong here is that we're trying to hack this up by > >>> defining different semantics for the out-fence and for the drm-event. Imo > >>> that's wrong, they're both meant to show eactly the same thing: > >>> - when is the new frame actually visible to the user (as in, eyeballs in a > >>> human head, preferrably, not the time when we've handed the buffer off > >>> to the virtual hw) > >>> - when is the previous buffer no longer being used by the scanout hw > >>> > >>> We do cheat a bit right now in so far that we assume they're both the > >>> same, as in, panel-side latency is currently the compositor's problem to > >>> figure out. > >>> > >>> So for virtual hw I think the timestamp and even completion really need to > >>> happen only when the buffer has been pushed through the entire > >>> virtualization chain, i.e. ideally we get the timestamp from the kms > >>> driver from the host side. Currently that's not done, so this is most > >>> likely quite broken already (virtio relies on the no-vblank auto event > >>> sending, which definitely doesn't wait for anything, or I'm completely > >>> missing something). > >>> > >>> I think instead of hacking up some ill-defined 1.5 queue depth support, > >>> what we should do is support queue depth > 1 properly. So: > >>> > >>> - Change atomic to support queue depth > 1, this needs to be a per-driver > >>> thing due to a bunch of issues in driver code. Essentially drivers must > >>> never look at obj->state pointers, and only ever look up state through > >>> the passed-in drm_atomic_state * update container. > >>> > >>> - Aside: virtio should loose all it's empty hooks, there's no point in > >>> that. > >>> > >>> - We fix virtio to send out the completion event at the end of this entire > >>> pipeline, i.e. virtio code needs to take care of sending out the > >>> crtc_state->event correctly. > >>> > >>> - We probably also want some kind of (maybe per-crtc) recommended queue > >>> depth property so compositors know how many buffers to keep in flight. > >>> Not sure about that. > >> > >> I'd say there would definitely need to be some kind of signal for the > >> display server that it should queue multiple flips, since this is > >> normally not desirable for latency. In other words, this wouldn't really > >> be useful on bare metal (in contrast to the ability to replace a pending > >> flip with a newer one). > > > > Hm I was thinking that the compositor can tune this. If the round-trip > > latency (as measured by events) is too long to get full refresh rate, it > > can add more buffers to the queue. That's kinda why I think the returned > > event really must be accurate wrt actual display time (and old buffer > > release time), so that this computation in the compositor because a pretty > > simple > > > > num_buffers = (flip_time - submit_time) / frame_time > > > > With maybe some rounding up and averaging. You can also hit this when your > >
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On 2021-08-02 9:59 a.m., Daniel Vetter wrote: > On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote: >> On 2021-07-30 12:25 p.m., Daniel Vetter wrote: >>> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote: By separating the OUT_FENCE signalling from pageflip completion allows a Guest compositor to start a new repaint cycle with a new buffer instead of waiting for the old buffer to be free. This work is based on the idea/suggestion from Simon and Pekka. This capability can be a solution for this issue: https://gitlab.freedesktop.org/wayland/weston/-/issues/514 Corresponding Weston MR: https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 >>> >>> Uh I kinda wanted to discuss this a bit more before we jump into typing >>> code, but well I guess not that much work yet. >>> >>> So maybe I'm not understanding the problem, but I think the fundamental >>> underlying issue is that with KMS you can have at most 2 buffers >>> in-flight, due to our queue depth limit of 1 pending flip. >>> >>> Unfortunately that means for virtual hw where it takes a few more >>> steps/vblanks until the framebuffer actually shows up on screen and is >>> scanned out, we suffer deeply. The usual fix for that is to drop the >>> latency and increase throughput, and have more buffers in-flight. Which >>> this patch tries to do. >> >> Per >> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 , >> IMO the underlying issue is actually that the guest compositor repaint >> cycle is not aligned with the host compositor one. If they were aligned, >> the problem would not occur even without allowing multiple page flips in >> flight, and latency would be lower. > > Yeah my proposal here is under the premise that we do actually need to fix > this with a deeper queue depth. > >>> Now I think where we go wrong here is that we're trying to hack this up by >>> defining different semantics for the out-fence and for the drm-event. Imo >>> that's wrong, they're both meant to show eactly the same thing: >>> - when is the new frame actually visible to the user (as in, eyeballs in a >>> human head, preferrably, not the time when we've handed the buffer off >>> to the virtual hw) >>> - when is the previous buffer no longer being used by the scanout hw >>> >>> We do cheat a bit right now in so far that we assume they're both the >>> same, as in, panel-side latency is currently the compositor's problem to >>> figure out. >>> >>> So for virtual hw I think the timestamp and even completion really need to >>> happen only when the buffer has been pushed through the entire >>> virtualization chain, i.e. ideally we get the timestamp from the kms >>> driver from the host side. Currently that's not done, so this is most >>> likely quite broken already (virtio relies on the no-vblank auto event >>> sending, which definitely doesn't wait for anything, or I'm completely >>> missing something). >>> >>> I think instead of hacking up some ill-defined 1.5 queue depth support, >>> what we should do is support queue depth > 1 properly. So: >>> >>> - Change atomic to support queue depth > 1, this needs to be a per-driver >>> thing due to a bunch of issues in driver code. Essentially drivers must >>> never look at obj->state pointers, and only ever look up state through >>> the passed-in drm_atomic_state * update container. >>> >>> - Aside: virtio should loose all it's empty hooks, there's no point in >>> that. >>> >>> - We fix virtio to send out the completion event at the end of this entire >>> pipeline, i.e. virtio code needs to take care of sending out the >>> crtc_state->event correctly. >>> >>> - We probably also want some kind of (maybe per-crtc) recommended queue >>> depth property so compositors know how many buffers to keep in flight. >>> Not sure about that. >> >> I'd say there would definitely need to be some kind of signal for the >> display server that it should queue multiple flips, since this is >> normally not desirable for latency. In other words, this wouldn't really >> be useful on bare metal (in contrast to the ability to replace a pending >> flip with a newer one). > > Hm I was thinking that the compositor can tune this. If the round-trip > latency (as measured by events) is too long to get full refresh rate, it > can add more buffers to the queue. That's kinda why I think the returned > event really must be accurate wrt actual display time (and old buffer > release time), so that this computation in the compositor because a pretty > simple > > num_buffers = (flip_time - submit_time) / frame_time > > With maybe some rounding up and averaging. You can also hit this when your > 3d engine has an extremely deep pipeline (like some of the tiling > renders have), where rendering just takes forever, but as long as you keep > 2 frames in the renderer in-flight you can achieve full refresh rate (at a > latency cost). As long as a
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On Mon, Aug 02, 2021 at 06:51:33AM +, Kasireddy, Vivek wrote: > Hi Daniel, > > > > > On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote: > > > By separating the OUT_FENCE signalling from pageflip completion allows > > > a Guest compositor to start a new repaint cycle with a new buffer > > > instead of waiting for the old buffer to be free. > > > > > > This work is based on the idea/suggestion from Simon and Pekka. > > > > > > This capability can be a solution for this issue: > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514 > > > > > > Corresponding Weston MR: > > > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 > > > > Uh I kinda wanted to discuss this a bit more before we jump into typing > > code, but well I guess not that much work yet. > [Kasireddy, Vivek] Right, it wasn't a lot of work :) > > > > > So maybe I'm not understanding the problem, but I think the fundamental > > underlying issue is that with KMS you can have at most 2 buffers > > in-flight, due to our queue depth limit of 1 pending flip. > [Kasireddy, Vivek] Let me summarize the problem again from the perspective of > both the Host (Weston) and Guest (Weston) compositors assuming a refresh-rate > of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms. > Host compositor: > - After a pageflip completion event, it starts its next repaint cycle by > waiting for 9 ms > and then submits the atomic commit and at the tail end of its cycle sends a > frame callback > event to all its clients (who registered and submitted frames) indicating to > them to > start their next redraw -- giving them at-least ~16 ms to submit a new frame > to be > included in its next repaint. Why a configurable 9 ms delay is needed is > explained > in Pekka's blog post here: > https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html > > - It'll send a wl_buffer.release event for a client submitted previous buffer > only > when the client has submitted a new buffer and: > a) When it hasn't started its repaint cycle yet OR > b) When it clears its old state after it gets a pageflip completion event -- > if it had > flipped the client's buffer onto a hardware plane. > > Guest compositor: > - After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms > for the > Guest compositor to submit a new atomic commit. This time of 10-12 ms > includes the > 9 ms wait -- just like the Host compositor -- for its clients to submit new > buffers. > - When it gets a pageflip completion, it assumes that the previously > submitted buffer > is free for re-use and uses it again -- resulting in the usage of only 2 out > of a maximum > of 4 backbuffers included as part of the Mesa GBM surface implementation. > > Guest KMS/Virtio-gpu/Qemu Wayland UI: > - Because no_vblank=true for Guest KMS and since the vblank event (which also > serves > as the pageflip completion event for user-space) is sent right away after > atomic commit, > as Gerd said, we use an internal dma-fence to block/wait the Guest KMS until > we know for > sure that the Host is completely done using the buffer. To ensure this, we > signal the dma-fence > only after the Host compositor sends a wl_buffer.release event or an > equivalent signal. > > The goal: > - Maintain full framerate even when the Guest scanout FB is flipped onto a > hardware plane > on the Host -- regardless of either compositor's scheduling policy -- without > making any > copies and ensuring that both Host and Guest are not accessing the buffer at > the same time. > > The problem: > - If the Host compositor flips the client's buffer (in this case Guest > compositor's buffer) > onto a hardware plane, then it can send a wl_buffer.release event for the > previous buffer > only after it gets a pageflip completion. And, if the Guest compositor takes > 10-12 ms to > submit a new buffer and given the fact that the Host compositor waits only > for 9 ms, the > Guest compositor will miss the Host's repaint cycle resulting in halved > frame-rate. > > The solution: > - To ensure full framerate, the Guest compositor has to start it's repaint > cycle (including > the 9 ms wait) when the Host compositor sends the frame callback event to its > clients. > In order for this to happen, the dma-fence that the Guest KMS waits on -- > before sending > pageflip completion -- cannot be tied to a wl_buffer.release event. This > means that, the > Guest compositor has to be forced to use a new buffer for its next repaint > cycle when it > gets a pageflip completion. Is that really the only solution? If we fix the event timestamps so that both guest and host use the same timestamp, but then the guest starts 5ms (or something like that) earlier, then things should work too? I.e. - host compositor starts at (previous_frametime + 9ms) - guest compositor starts at (previous_frametime + 4ms) Ofc this only works if the frametimes we hand out to both match
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote: > On 2021-07-30 12:25 p.m., Daniel Vetter wrote: > > On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote: > >> By separating the OUT_FENCE signalling from pageflip completion allows > >> a Guest compositor to start a new repaint cycle with a new buffer > >> instead of waiting for the old buffer to be free. > >> > >> This work is based on the idea/suggestion from Simon and Pekka. > >> > >> This capability can be a solution for this issue: > >> https://gitlab.freedesktop.org/wayland/weston/-/issues/514 > >> > >> Corresponding Weston MR: > >> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 > > > > Uh I kinda wanted to discuss this a bit more before we jump into typing > > code, but well I guess not that much work yet. > > > > So maybe I'm not understanding the problem, but I think the fundamental > > underlying issue is that with KMS you can have at most 2 buffers > > in-flight, due to our queue depth limit of 1 pending flip. > > > > Unfortunately that means for virtual hw where it takes a few more > > steps/vblanks until the framebuffer actually shows up on screen and is > > scanned out, we suffer deeply. The usual fix for that is to drop the > > latency and increase throughput, and have more buffers in-flight. Which > > this patch tries to do. > > Per > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 , > IMO the underlying issue is actually that the guest compositor repaint > cycle is not aligned with the host compositor one. If they were aligned, > the problem would not occur even without allowing multiple page flips in > flight, and latency would be lower. Yeah my proposal here is under the premise that we do actually need to fix this with a deeper queue depth. > > Now I think where we go wrong here is that we're trying to hack this up by > > defining different semantics for the out-fence and for the drm-event. Imo > > that's wrong, they're both meant to show eactly the same thing: > > - when is the new frame actually visible to the user (as in, eyeballs in a > > human head, preferrably, not the time when we've handed the buffer off > > to the virtual hw) > > - when is the previous buffer no longer being used by the scanout hw > > > > We do cheat a bit right now in so far that we assume they're both the > > same, as in, panel-side latency is currently the compositor's problem to > > figure out. > > > > So for virtual hw I think the timestamp and even completion really need to > > happen only when the buffer has been pushed through the entire > > virtualization chain, i.e. ideally we get the timestamp from the kms > > driver from the host side. Currently that's not done, so this is most > > likely quite broken already (virtio relies on the no-vblank auto event > > sending, which definitely doesn't wait for anything, or I'm completely > > missing something). > > > > I think instead of hacking up some ill-defined 1.5 queue depth support, > > what we should do is support queue depth > 1 properly. So: > > > > - Change atomic to support queue depth > 1, this needs to be a per-driver > > thing due to a bunch of issues in driver code. Essentially drivers must > > never look at obj->state pointers, and only ever look up state through > > the passed-in drm_atomic_state * update container. > > > > - Aside: virtio should loose all it's empty hooks, there's no point in > > that. > > > > - We fix virtio to send out the completion event at the end of this entire > > pipeline, i.e. virtio code needs to take care of sending out the > > crtc_state->event correctly. > > > > - We probably also want some kind of (maybe per-crtc) recommended queue > > depth property so compositors know how many buffers to keep in flight. > > Not sure about that. > > I'd say there would definitely need to be some kind of signal for the > display server that it should queue multiple flips, since this is > normally not desirable for latency. In other words, this wouldn't really > be useful on bare metal (in contrast to the ability to replace a pending > flip with a newer one). Hm I was thinking that the compositor can tune this. If the round-trip latency (as measured by events) is too long to get full refresh rate, it can add more buffers to the queue. That's kinda why I think the returned event really must be accurate wrt actual display time (and old buffer release time), so that this computation in the compositor because a pretty simple num_buffers = (flip_time - submit_time) / frame_time With maybe some rounding up and averaging. You can also hit this when your 3d engine has an extremely deep pipeline (like some of the tiling renders have), where rendering just takes forever, but as long as you keep 2 frames in the renderer in-flight you can achieve full refresh rate (at a latency cost). So kernel can't really tell you in all cases how many buffers you should have. -Daniel > -- >
RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi Daniel, > > On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote: > > By separating the OUT_FENCE signalling from pageflip completion allows > > a Guest compositor to start a new repaint cycle with a new buffer > > instead of waiting for the old buffer to be free. > > > > This work is based on the idea/suggestion from Simon and Pekka. > > > > This capability can be a solution for this issue: > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514 > > > > Corresponding Weston MR: > > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 > > Uh I kinda wanted to discuss this a bit more before we jump into typing > code, but well I guess not that much work yet. [Kasireddy, Vivek] Right, it wasn't a lot of work :) > > So maybe I'm not understanding the problem, but I think the fundamental > underlying issue is that with KMS you can have at most 2 buffers > in-flight, due to our queue depth limit of 1 pending flip. [Kasireddy, Vivek] Let me summarize the problem again from the perspective of both the Host (Weston) and Guest (Weston) compositors assuming a refresh-rate of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms. Host compositor: - After a pageflip completion event, it starts its next repaint cycle by waiting for 9 ms and then submits the atomic commit and at the tail end of its cycle sends a frame callback event to all its clients (who registered and submitted frames) indicating to them to start their next redraw -- giving them at-least ~16 ms to submit a new frame to be included in its next repaint. Why a configurable 9 ms delay is needed is explained in Pekka's blog post here: https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html - It'll send a wl_buffer.release event for a client submitted previous buffer only when the client has submitted a new buffer and: a) When it hasn't started its repaint cycle yet OR b) When it clears its old state after it gets a pageflip completion event -- if it had flipped the client's buffer onto a hardware plane. Guest compositor: - After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms for the Guest compositor to submit a new atomic commit. This time of 10-12 ms includes the 9 ms wait -- just like the Host compositor -- for its clients to submit new buffers. - When it gets a pageflip completion, it assumes that the previously submitted buffer is free for re-use and uses it again -- resulting in the usage of only 2 out of a maximum of 4 backbuffers included as part of the Mesa GBM surface implementation. Guest KMS/Virtio-gpu/Qemu Wayland UI: - Because no_vblank=true for Guest KMS and since the vblank event (which also serves as the pageflip completion event for user-space) is sent right away after atomic commit, as Gerd said, we use an internal dma-fence to block/wait the Guest KMS until we know for sure that the Host is completely done using the buffer. To ensure this, we signal the dma-fence only after the Host compositor sends a wl_buffer.release event or an equivalent signal. The goal: - Maintain full framerate even when the Guest scanout FB is flipped onto a hardware plane on the Host -- regardless of either compositor's scheduling policy -- without making any copies and ensuring that both Host and Guest are not accessing the buffer at the same time. The problem: - If the Host compositor flips the client's buffer (in this case Guest compositor's buffer) onto a hardware plane, then it can send a wl_buffer.release event for the previous buffer only after it gets a pageflip completion. And, if the Guest compositor takes 10-12 ms to submit a new buffer and given the fact that the Host compositor waits only for 9 ms, the Guest compositor will miss the Host's repaint cycle resulting in halved frame-rate. The solution: - To ensure full framerate, the Guest compositor has to start it's repaint cycle (including the 9 ms wait) when the Host compositor sends the frame callback event to its clients. In order for this to happen, the dma-fence that the Guest KMS waits on -- before sending pageflip completion -- cannot be tied to a wl_buffer.release event. This means that, the Guest compositor has to be forced to use a new buffer for its next repaint cycle when it gets a pageflip completion. - The Weston MR I linked above does this by getting an out_fence fd and taking a reference on all the FBs included in the atomic commit forcing the compositor to use new FBs for its next repaint cycle. It releases the references when the out_fence is signalled later when the Host compositor sends a wl_buffer.release event. > > Unfortunately that means for virtual hw where it takes a few more > steps/vblanks until the framebuffer actually shows up on screen and is > scanned out, we suffer deeply. The usual fix for that is to drop the > latency and increase throughput, and have more buffers in-flight. Which > this patch tries to do. > > Now I think where we go
RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
> -Original Message- > From: Daniel Vetter > Sent: Friday, July 30, 2021 6:26 PM > To: Kasireddy, Vivek > Cc: dri-devel@lists.freedesktop.org; Daniel Vetter ; Gerd > Hoffmann ; Pekka Paalanen ; > Simon Ser ; Michel Dänzer ; > Zhang, Tina ; Kim, Dongwon > > Subject: Re: [RFC v1 0/4] drm: Add support for > DRM_CAP_DEFERRED_OUT_FENCE capability > > On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote: > > By separating the OUT_FENCE signalling from pageflip completion allows > > a Guest compositor to start a new repaint cycle with a new buffer > > instead of waiting for the old buffer to be free. > > > > This work is based on the idea/suggestion from Simon and Pekka. > > > > This capability can be a solution for this issue: > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514 > > > > Corresponding Weston MR: > > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 > > Uh I kinda wanted to discuss this a bit more before we jump into typing code, > but well I guess not that much work yet. > > So maybe I'm not understanding the problem, but I think the fundamental > underlying issue is that with KMS you can have at most 2 buffers in-flight, > due to our queue depth limit of 1 pending flip. > > Unfortunately that means for virtual hw where it takes a few more > steps/vblanks until the framebuffer actually shows up on screen and is > scanned out, we suffer deeply. The usual fix for that is to drop the latency > and increase throughput, and have more buffers in-flight. Which this patch > tries to do. > > Now I think where we go wrong here is that we're trying to hack this up by > defining different semantics for the out-fence and for the drm-event. Imo > that's wrong, they're both meant to show eactly the same thing: > - when is the new frame actually visible to the user (as in, eyeballs in a > human head, preferrably, not the time when we've handed the buffer off > to the virtual hw) > - when is the previous buffer no longer being used by the scanout hw > > We do cheat a bit right now in so far that we assume they're both the same, > as in, panel-side latency is currently the compositor's problem to figure out. > > So for virtual hw I think the timestamp and even completion really need to > happen only when the buffer has been pushed through the entire > virtualization chain, i.e. ideally we get the timestamp from the kms driver > from the host side. Currently that's not done, so this is most likely quite > broken already (virtio relies on the no-vblank auto event sending, which > definitely doesn't wait for anything, or I'm completely missing something). Agree. One lesson we got from previous direct-display related work is that using host hardware event is kind of "must". Otherwise, problems like flickering or tearing or frame drop will come out. Besides, as the wayland-ui is working as a weston client, it needs to have more than 2 buffers to support the full-frame redraw. I tried the Weston-simple-dmabuf-egl with 2 buffers and it was bad. BR, Tina > > I think instead of hacking up some ill-defined 1.5 queue depth support, what > we should do is support queue depth > 1 properly. So: > > - Change atomic to support queue depth > 1, this needs to be a per-driver > thing due to a bunch of issues in driver code. Essentially drivers must > never look at obj->state pointers, and only ever look up state through > the passed-in drm_atomic_state * update container. > > - Aside: virtio should loose all it's empty hooks, there's no point in > that. > > - We fix virtio to send out the completion event at the end of this entire > pipeline, i.e. virtio code needs to take care of sending out the > crtc_state->event correctly. > > - We probably also want some kind of (maybe per-crtc) recommended queue > depth property so compositors know how many buffers to keep in flight. > Not sure about that. > > It's a bit more work, but also a lot less hacking around infrastructure in > dubious ways. > > Thoughts? > > Cheers, Daniel > > > > > Cc: Daniel Vetter > > Cc: Gerd Hoffmann > > Cc: Pekka Paalanen > > Cc: Simon Ser > > Cc: Michel Dänzer > > Cc: Tina Zhang > > Cc: Dongwon Kim > > > > Vivek Kasireddy (4): > > drm: Add a capability flag to support deferred out_fence signalling > > virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature > > drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd > > drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature > > > > drivers/gpu/drm/drm_file.c | 11 +++--- > > drivers/gpu/drm
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
Hi, > - We fix virtio to send out the completion event at the end of this entire > pipeline, i.e. virtio code needs to take care of sending out the > crtc_state->event correctly. That sounds sensible to me. Fence the virtio commands, make sure (on the host side) the command completes only when the work is actually done not only submitted. Has recently been added to qemu for RESOURCE_FLUSH (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka pageflipping), then send vblank events to userspace on command completion certainly makes sense. take care, Gerd
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On 2021-07-30 12:25 p.m., Daniel Vetter wrote: > On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote: >> By separating the OUT_FENCE signalling from pageflip completion allows >> a Guest compositor to start a new repaint cycle with a new buffer >> instead of waiting for the old buffer to be free. >> >> This work is based on the idea/suggestion from Simon and Pekka. >> >> This capability can be a solution for this issue: >> https://gitlab.freedesktop.org/wayland/weston/-/issues/514 >> >> Corresponding Weston MR: >> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 > > Uh I kinda wanted to discuss this a bit more before we jump into typing > code, but well I guess not that much work yet. > > So maybe I'm not understanding the problem, but I think the fundamental > underlying issue is that with KMS you can have at most 2 buffers > in-flight, due to our queue depth limit of 1 pending flip. > > Unfortunately that means for virtual hw where it takes a few more > steps/vblanks until the framebuffer actually shows up on screen and is > scanned out, we suffer deeply. The usual fix for that is to drop the > latency and increase throughput, and have more buffers in-flight. Which > this patch tries to do. Per https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 , IMO the underlying issue is actually that the guest compositor repaint cycle is not aligned with the host compositor one. If they were aligned, the problem would not occur even without allowing multiple page flips in flight, and latency would be lower. > Now I think where we go wrong here is that we're trying to hack this up by > defining different semantics for the out-fence and for the drm-event. Imo > that's wrong, they're both meant to show eactly the same thing: > - when is the new frame actually visible to the user (as in, eyeballs in a > human head, preferrably, not the time when we've handed the buffer off > to the virtual hw) > - when is the previous buffer no longer being used by the scanout hw > > We do cheat a bit right now in so far that we assume they're both the > same, as in, panel-side latency is currently the compositor's problem to > figure out. > > So for virtual hw I think the timestamp and even completion really need to > happen only when the buffer has been pushed through the entire > virtualization chain, i.e. ideally we get the timestamp from the kms > driver from the host side. Currently that's not done, so this is most > likely quite broken already (virtio relies on the no-vblank auto event > sending, which definitely doesn't wait for anything, or I'm completely > missing something). > > I think instead of hacking up some ill-defined 1.5 queue depth support, > what we should do is support queue depth > 1 properly. So: > > - Change atomic to support queue depth > 1, this needs to be a per-driver > thing due to a bunch of issues in driver code. Essentially drivers must > never look at obj->state pointers, and only ever look up state through > the passed-in drm_atomic_state * update container. > > - Aside: virtio should loose all it's empty hooks, there's no point in > that. > > - We fix virtio to send out the completion event at the end of this entire > pipeline, i.e. virtio code needs to take care of sending out the > crtc_state->event correctly. > > - We probably also want some kind of (maybe per-crtc) recommended queue > depth property so compositors know how many buffers to keep in flight. > Not sure about that. I'd say there would definitely need to be some kind of signal for the display server that it should queue multiple flips, since this is normally not desirable for latency. In other words, this wouldn't really be useful on bare metal (in contrast to the ability to replace a pending flip with a newer one). -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer
Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote: > By separating the OUT_FENCE signalling from pageflip completion allows > a Guest compositor to start a new repaint cycle with a new buffer > instead of waiting for the old buffer to be free. > > This work is based on the idea/suggestion from Simon and Pekka. > > This capability can be a solution for this issue: > https://gitlab.freedesktop.org/wayland/weston/-/issues/514 > > Corresponding Weston MR: > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 Uh I kinda wanted to discuss this a bit more before we jump into typing code, but well I guess not that much work yet. So maybe I'm not understanding the problem, but I think the fundamental underlying issue is that with KMS you can have at most 2 buffers in-flight, due to our queue depth limit of 1 pending flip. Unfortunately that means for virtual hw where it takes a few more steps/vblanks until the framebuffer actually shows up on screen and is scanned out, we suffer deeply. The usual fix for that is to drop the latency and increase throughput, and have more buffers in-flight. Which this patch tries to do. Now I think where we go wrong here is that we're trying to hack this up by defining different semantics for the out-fence and for the drm-event. Imo that's wrong, they're both meant to show eactly the same thing: - when is the new frame actually visible to the user (as in, eyeballs in a human head, preferrably, not the time when we've handed the buffer off to the virtual hw) - when is the previous buffer no longer being used by the scanout hw We do cheat a bit right now in so far that we assume they're both the same, as in, panel-side latency is currently the compositor's problem to figure out. So for virtual hw I think the timestamp and even completion really need to happen only when the buffer has been pushed through the entire virtualization chain, i.e. ideally we get the timestamp from the kms driver from the host side. Currently that's not done, so this is most likely quite broken already (virtio relies on the no-vblank auto event sending, which definitely doesn't wait for anything, or I'm completely missing something). I think instead of hacking up some ill-defined 1.5 queue depth support, what we should do is support queue depth > 1 properly. So: - Change atomic to support queue depth > 1, this needs to be a per-driver thing due to a bunch of issues in driver code. Essentially drivers must never look at obj->state pointers, and only ever look up state through the passed-in drm_atomic_state * update container. - Aside: virtio should loose all it's empty hooks, there's no point in that. - We fix virtio to send out the completion event at the end of this entire pipeline, i.e. virtio code needs to take care of sending out the crtc_state->event correctly. - We probably also want some kind of (maybe per-crtc) recommended queue depth property so compositors know how many buffers to keep in flight. Not sure about that. It's a bit more work, but also a lot less hacking around infrastructure in dubious ways. Thoughts? Cheers, Daniel > > Cc: Daniel Vetter > Cc: Gerd Hoffmann > Cc: Pekka Paalanen > Cc: Simon Ser > Cc: Michel Dänzer > Cc: Tina Zhang > Cc: Dongwon Kim > > Vivek Kasireddy (4): > drm: Add a capability flag to support deferred out_fence signalling > virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature > drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd > drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature > > drivers/gpu/drm/drm_file.c | 11 +++--- > drivers/gpu/drm/drm_ioctl.c | 3 ++ > drivers/gpu/drm/virtio/virtgpu_debugfs.c | 1 + > drivers/gpu/drm/virtio/virtgpu_drv.c | 1 + > drivers/gpu/drm/virtio/virtgpu_drv.h | 6 > drivers/gpu/drm/virtio/virtgpu_fence.c | 9 + > drivers/gpu/drm/virtio/virtgpu_kms.c | 10 -- > drivers/gpu/drm/virtio/virtgpu_plane.c | 44 +++- > drivers/gpu/drm/virtio/virtgpu_vq.c | 17 + > include/drm/drm_mode_config.h| 9 + > include/uapi/drm/drm.h | 1 + > include/uapi/linux/virtio_gpu.h | 12 +++ > 12 files changed, 117 insertions(+), 7 deletions(-) > > -- > 2.30.2 > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
[RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability
By separating the OUT_FENCE signalling from pageflip completion allows a Guest compositor to start a new repaint cycle with a new buffer instead of waiting for the old buffer to be free. This work is based on the idea/suggestion from Simon and Pekka. This capability can be a solution for this issue: https://gitlab.freedesktop.org/wayland/weston/-/issues/514 Corresponding Weston MR: https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668 Cc: Daniel Vetter Cc: Gerd Hoffmann Cc: Pekka Paalanen Cc: Simon Ser Cc: Michel Dänzer Cc: Tina Zhang Cc: Dongwon Kim Vivek Kasireddy (4): drm: Add a capability flag to support deferred out_fence signalling virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature drivers/gpu/drm/drm_file.c | 11 +++--- drivers/gpu/drm/drm_ioctl.c | 3 ++ drivers/gpu/drm/virtio/virtgpu_debugfs.c | 1 + drivers/gpu/drm/virtio/virtgpu_drv.c | 1 + drivers/gpu/drm/virtio/virtgpu_drv.h | 6 drivers/gpu/drm/virtio/virtgpu_fence.c | 9 + drivers/gpu/drm/virtio/virtgpu_kms.c | 10 -- drivers/gpu/drm/virtio/virtgpu_plane.c | 44 +++- drivers/gpu/drm/virtio/virtgpu_vq.c | 17 + include/drm/drm_mode_config.h| 9 + include/uapi/drm/drm.h | 1 + include/uapi/linux/virtio_gpu.h | 12 +++ 12 files changed, 117 insertions(+), 7 deletions(-) -- 2.30.2