RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-11 Thread Kasireddy, Vivek
Hi Michel,
 
> On 2021-08-10 10:30 a.m., Daniel Vetter wrote:
> > On Tue, Aug 10, 2021 at 08:21:09AM +, Kasireddy, Vivek wrote:
> >>> On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote:
> >>>
> >>> Hence my gut feeling reaction that first we need to get these two
> >>> compositors aligned in their timings, which propobably needs
> >>> consistent vblank periods/timestamps across them (plus/minux
> >>> guest/host clocksource fun ofc). Without this any of the next steps
> >>> will simply not work because there's too much jitter by the time the
> >>> guest compositor gets the flip completion events.
> >> [Kasireddy, Vivek] Timings are not a problem and do not significantly
> >> affect the repaint cycles from what I have seen so far.
> >>
> >>>
> >>> Once we have solid events I think we should look into statically
> >>> tuning guest/host compositor deadlines (like you've suggested in a
> >>> bunch of places) to consisently make that deadline and hit 60 fps.
> >>> With that we can then look into tuning this automatically and what to
> >>> do when e.g. switching between copying and zero-copy on the host side
> >>> (which might be needed in some cases) and how to handle all that.
> >> [Kasireddy, Vivek] As I confirm here:
> >>> https://gitlab.freedesktop.org/wayland/weston/-
> > /issues/514#note_984065
> >> tweaking the deadlines works (i.e., we get 60 FPS) as we expect. 
> >> However,
> >> I feel that this zero-copy solution I am trying to create should be 
> >> independent
> >> of compositors' deadlines, delays or other scheduling parameters.
> >
> > That's not how compositors work nowadays. Your problem is that you don't
> > have the guest/host compositor in sync. zero-copy only changes the 
> > timing,
> > so it changes things from "rendering way too many frames" to "rendering
> > way too few frames".
> >
> > We need to fix the timing/sync issue here first, not paper over it with
> > hacks.
>  [Kasireddy, Vivek] What I really meant is that the zero-copy solution 
>  should be
>  independent of the scheduling policies to ensure that it works with all 
>  compositors.
>   IIUC, Weston for example uses the vblank/pageflip completion timestamp, 
>  the
>  configurable repaint-window value, refresh-rate, etc to determine when 
>  to start
>  its next repaint -- if there is any damage:
>  timespec_add_nsec(>next_repaint, stamp, refresh_nsec);
>  timespec_add_msec(>next_repaint, >next_repaint, 
>  -compositor-
>  repaint_msec);
> 
>  And, in the case of VKMS, since there is no real hardware, the timestamp 
>  is always:
>  now = ktime_get();
>  send_vblank_event(dev, e, seq, now);
> >>>
> >>> vkms has been fixed since a while to fake high-precision timestamps like
> >>> from a real display.
> >> [Kasireddy, Vivek] IIUC, that might be one of the reasons why the Guest 
> >> does not need
> >> to have the same timestamp as that of the Host -- to work as expected.
> >>
> >>>
>  When you say that the Guest/Host compositor need to stay in sync, are you
>  suggesting that we need to ensure that the vblank timestamp on the Host
>  needs to be shared and be the same on the Guest and a vblank/pageflip
>  completion for the Guest needs to be sent at exactly the same time it is 
>  sent
>  on the Host? If yes, I'd say that we do send the pageflip completion to 
>  Guest
>  around the same time a vblank is generated on the Host but it does not 
>  help
>  because the Guest compositor would only have 9 ms to submit a new frame
>  and if the Host is running Mutter, the Guest would only have 2 ms.
>  (https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341)
> >>>
> >>> Not at the same time, but the same timestamp. And yes there is some fun
> >>> there, which is I think the fundamental issue. Or at least some of the
> >>> compositor experts seem to think so, and it makes sense to me.
> >> [Kasireddy, Vivek] It is definitely possible that if the timestamp is 
> >> messed up, then
> >> the Guest repaint cycle would be affected. However, I do not believe that 
> >> is the case
> >> here given the debug and instrumentation data we collected and 
> >> scrutinized. Hopefully,
> >> compositor experts could chime in to shed some light on this matter.
> >>
> >>>
> >
> > Only, and I really mean only, when that shows that it's simply 
> > impossible
> > to hit 60fps with zero-copy and the guest/host fully aligned should we
> > look into making the overall pipeline deeper.
>  [Kasireddy, Vivek] From all the experiments conducted so far and given 
>  the
>  discussion associated with 
>  https://gitlab.freedesktop.org/wayland/weston/-
> /issues/514
>  I think we have already established that in order for a zero-copy 
>  

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-11 Thread Kasireddy, Vivek
Hi Daniel,

> On Tue, Aug 10, 2021 at 08:21:09AM +, Kasireddy, Vivek wrote:
> > Hi Daniel,
> >
> > > On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote:
> > > > Hi Daniel,
> > > >
> > > > > > > > >>> The solution:
> > > > > > > > >>> - To ensure full framerate, the Guest compositor has to 
> > > > > > > > >>> start it's repaint
> > > cycle
> > > > > > > (including
> > > > > > > > >>> the 9 ms wait) when the Host compositor sends the frame 
> > > > > > > > >>> callback event
> to
> > > its
> > > > > > > clients.
> > > > > > > > >>> In order for this to happen, the dma-fence that the Guest 
> > > > > > > > >>> KMS waits on -
> -
> > > before
> > > > > > > sending
> > > > > > > > >>> pageflip completion -- cannot be tied to a 
> > > > > > > > >>> wl_buffer.release event. This
> > > means
> > > > > that,
> > > > > > > the
> > > > > > > > >>> Guest compositor has to be forced to use a new buffer for 
> > > > > > > > >>> its next
> repaint
> > > cycle
> > > > > > > when it
> > > > > > > > >>> gets a pageflip completion.
> > > > > > > > >>
> > > > > > > > >> Is that really the only solution?
> > > > > > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > > > > > But I think none of them are as compelling as this one.
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >> If we fix the event timestamps so that both guest and host 
> > > > > > > > >> use the same
> > > > > > > > >> timestamp, but then the guest starts 5ms (or something like 
> > > > > > > > >> that) earlier,
> > > > > > > > >> then things should work too? I.e.
> > > > > > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > > > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > > > > > >>
> > > > > > > > >> Ofc this only works if the frametimes we hand out to both 
> > > > > > > > >> match
> _exactly_
> > > > > > > > >> and are as high-precision as the ones on the host side. 
> > > > > > > > >> Which for many
> gpu
> > > > > > > > >> drivers at least is the case, and all the ones you care 
> > > > > > > > >> about for sure :-)
> > > > > > > > >>
> > > > > > > > >> But if the frametimes the guest receives are the no_vblank 
> > > > > > > > >> fake ones,
> then
> > > > > > > > >> they'll be all over the place and this carefully tuned 
> > > > > > > > >> low-latency redraw
> > > > > > > > >> loop falls apart. Aside fromm the fact that without tuning 
> > > > > > > > >> the guests to
> > > > > > > > >> be earlier than the hosts, you're guaranteed to miss every 
> > > > > > > > >> frame (except
> > > > > > > > >> when the timing wobbliness in the guest is big enough by 
> > > > > > > > >> chance to make
> > > > > > > > >> the deadline on the oddball frame).
> > > > > > > > > [Kasireddy, Vivek] The Guest and Host use different event 
> > > > > > > > > timestamps as
> we
> > > don't
> > > > > > > > > share these between the Guest and the Host. It does not seem 
> > > > > > > > > to be causing
> any
> > > > > other
> > > > > > > > > problems so far but we did try the experiment you mentioned 
> > > > > > > > > (i.e.,
> adjusting
> > > the
> > > > > > > delays)
> > > > > > > > > and it works. However, this patch series is meant to fix the 
> > > > > > > > > issue without
> > > having to
> > > > > > > tweak
> > > > > > > > > anything (delays) because we can't do this for every 
> > > > > > > > > compositor out there.
> > > > > > > >
> > > > > > > > Maybe there could be a mechanism which allows the compositor in 
> > > > > > > > the guest
> to
> > > > > > > automatically adjust its repaint cycle as needed.
> > > > > > > >
> > > > > > > > This might even be possible without requiring changes in each 
> > > > > > > > compositor,
> by
> > > > > adjusting
> > > > > > > the vertical blank periods in the guest to be aligned with the 
> > > > > > > host compositor
> > > repaint
> > > > > > > cycles. Not sure about that though.
> > > > > > > >
> > > > > > > > Even if not, both this series or making it possible to queue 
> > > > > > > > multiple flips
> require
> > > > > > > corresponding changes in each compositor as well to have any 
> > > > > > > effect.
> > > > > > >
> > > > > > > Yeah from all the discussions and tests done it sounds even with a
> > > > > > > deeper queue we have big coordination issues between the guest and
> > > > > > > host compositor (like the example that the guest is now rendering 
> > > > > > > at
> > > > > > > 90fps instead of 60fps like the host).
> > > > > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to 
> > > > > > Gerd. That 90
> FPS vs
> > > > > > 60 FPS problem is a completely different issue that is associated 
> > > > > > with Qemu
> GTK UI
> > > > > > backend. With the GTK backend -- and also with SDL backend -- we 
> > > > > > Blit the
> Guest
> > > > > > scanout FB onto one of the backbuffers managed by EGL.
> > > > > >
> > > > > > I am trying 

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-10 Thread Michel Dänzer
On 2021-08-10 10:30 a.m., Daniel Vetter wrote:
> On Tue, Aug 10, 2021 at 08:21:09AM +, Kasireddy, Vivek wrote:
>>> On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote:
>>>
>>> Hence my gut feeling reaction that first we need to get these two
>>> compositors aligned in their timings, which propobably needs
>>> consistent vblank periods/timestamps across them (plus/minux
>>> guest/host clocksource fun ofc). Without this any of the next steps
>>> will simply not work because there's too much jitter by the time the
>>> guest compositor gets the flip completion events.
>> [Kasireddy, Vivek] Timings are not a problem and do not significantly
>> affect the repaint cycles from what I have seen so far.
>>
>>>
>>> Once we have solid events I think we should look into statically
>>> tuning guest/host compositor deadlines (like you've suggested in a
>>> bunch of places) to consisently make that deadline and hit 60 fps.
>>> With that we can then look into tuning this automatically and what to
>>> do when e.g. switching between copying and zero-copy on the host side
>>> (which might be needed in some cases) and how to handle all that.
>> [Kasireddy, Vivek] As I confirm here:
>>> https://gitlab.freedesktop.org/wayland/weston/-
> /issues/514#note_984065
>> tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
>> I feel that this zero-copy solution I am trying to create should be 
>> independent
>> of compositors' deadlines, delays or other scheduling parameters.
>
> That's not how compositors work nowadays. Your problem is that you don't
> have the guest/host compositor in sync. zero-copy only changes the timing,
> so it changes things from "rendering way too many frames" to "rendering
> way too few frames".
>
> We need to fix the timing/sync issue here first, not paper over it with
> hacks.
 [Kasireddy, Vivek] What I really meant is that the zero-copy solution 
 should be
 independent of the scheduling policies to ensure that it works with all 
 compositors.
  IIUC, Weston for example uses the vblank/pageflip completion timestamp, 
 the
 configurable repaint-window value, refresh-rate, etc to determine when to 
 start
 its next repaint -- if there is any damage:
 timespec_add_nsec(>next_repaint, stamp, refresh_nsec);
 timespec_add_msec(>next_repaint, >next_repaint, 
 -compositor-
 repaint_msec);

 And, in the case of VKMS, since there is no real hardware, the timestamp 
 is always:
 now = ktime_get();
 send_vblank_event(dev, e, seq, now);
>>>
>>> vkms has been fixed since a while to fake high-precision timestamps like
>>> from a real display.
>> [Kasireddy, Vivek] IIUC, that might be one of the reasons why the Guest does 
>> not need 
>> to have the same timestamp as that of the Host -- to work as expected.
>>
>>>
 When you say that the Guest/Host compositor need to stay in sync, are you
 suggesting that we need to ensure that the vblank timestamp on the Host
 needs to be shared and be the same on the Guest and a vblank/pageflip
 completion for the Guest needs to be sent at exactly the same time it is 
 sent
 on the Host? If yes, I'd say that we do send the pageflip completion to 
 Guest
 around the same time a vblank is generated on the Host but it does not help
 because the Guest compositor would only have 9 ms to submit a new frame
 and if the Host is running Mutter, the Guest would only have 2 ms.
 (https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984341)
>>>
>>> Not at the same time, but the same timestamp. And yes there is some fun
>>> there, which is I think the fundamental issue. Or at least some of the
>>> compositor experts seem to think so, and it makes sense to me.
>> [Kasireddy, Vivek] It is definitely possible that if the timestamp is messed 
>> up, then
>> the Guest repaint cycle would be affected. However, I do not believe that is 
>> the case
>> here given the debug and instrumentation data we collected and scrutinized. 
>> Hopefully,
>> compositor experts could chime in to shed some light on this matter.
>>
>>>
>
> Only, and I really mean only, when that shows that it's simply impossible
> to hit 60fps with zero-copy and the guest/host fully aligned should we
> look into making the overall pipeline deeper.
 [Kasireddy, Vivek] From all the experiments conducted so far and given the
 discussion associated with 
 https://gitlab.freedesktop.org/wayland/weston/-/issues/514
 I think we have already established that in order for a zero-copy solution 
 to work
 reliably, the Guest compositor needs to start its repaint cycle when the 
 Host
 compositor sends a frame callback event to its clients.

>
>>> Only when that all shows that we just can't hit 60fps 

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-10 Thread Daniel Vetter
On Tue, Aug 10, 2021 at 08:21:09AM +, Kasireddy, Vivek wrote:
> Hi Daniel,
> 
> > On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote:
> > > Hi Daniel,
> > >
> > > > > > > >>> The solution:
> > > > > > > >>> - To ensure full framerate, the Guest compositor has to start 
> > > > > > > >>> it's repaint
> > cycle
> > > > > > (including
> > > > > > > >>> the 9 ms wait) when the Host compositor sends the frame 
> > > > > > > >>> callback event to
> > its
> > > > > > clients.
> > > > > > > >>> In order for this to happen, the dma-fence that the Guest KMS 
> > > > > > > >>> waits on --
> > before
> > > > > > sending
> > > > > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release 
> > > > > > > >>> event. This
> > means
> > > > that,
> > > > > > the
> > > > > > > >>> Guest compositor has to be forced to use a new buffer for its 
> > > > > > > >>> next repaint
> > cycle
> > > > > > when it
> > > > > > > >>> gets a pageflip completion.
> > > > > > > >>
> > > > > > > >> Is that really the only solution?
> > > > > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > > > > But I think none of them are as compelling as this one.
> > > > > > > >
> > > > > > > >>
> > > > > > > >> If we fix the event timestamps so that both guest and host use 
> > > > > > > >> the same
> > > > > > > >> timestamp, but then the guest starts 5ms (or something like 
> > > > > > > >> that) earlier,
> > > > > > > >> then things should work too? I.e.
> > > > > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > > > > >>
> > > > > > > >> Ofc this only works if the frametimes we hand out to both 
> > > > > > > >> match _exactly_
> > > > > > > >> and are as high-precision as the ones on the host side. Which 
> > > > > > > >> for many gpu
> > > > > > > >> drivers at least is the case, and all the ones you care about 
> > > > > > > >> for sure :-)
> > > > > > > >>
> > > > > > > >> But if the frametimes the guest receives are the no_vblank 
> > > > > > > >> fake ones, then
> > > > > > > >> they'll be all over the place and this carefully tuned 
> > > > > > > >> low-latency redraw
> > > > > > > >> loop falls apart. Aside fromm the fact that without tuning the 
> > > > > > > >> guests to
> > > > > > > >> be earlier than the hosts, you're guaranteed to miss every 
> > > > > > > >> frame (except
> > > > > > > >> when the timing wobbliness in the guest is big enough by 
> > > > > > > >> chance to make
> > > > > > > >> the deadline on the oddball frame).
> > > > > > > > [Kasireddy, Vivek] The Guest and Host use different event 
> > > > > > > > timestamps as we
> > don't
> > > > > > > > share these between the Guest and the Host. It does not seem to 
> > > > > > > > be causing any
> > > > other
> > > > > > > > problems so far but we did try the experiment you mentioned 
> > > > > > > > (i.e., adjusting
> > the
> > > > > > delays)
> > > > > > > > and it works. However, this patch series is meant to fix the 
> > > > > > > > issue without
> > having to
> > > > > > tweak
> > > > > > > > anything (delays) because we can't do this for every compositor 
> > > > > > > > out there.
> > > > > > >
> > > > > > > Maybe there could be a mechanism which allows the compositor in 
> > > > > > > the guest to
> > > > > > automatically adjust its repaint cycle as needed.
> > > > > > >
> > > > > > > This might even be possible without requiring changes in each 
> > > > > > > compositor, by
> > > > adjusting
> > > > > > the vertical blank periods in the guest to be aligned with the host 
> > > > > > compositor
> > repaint
> > > > > > cycles. Not sure about that though.
> > > > > > >
> > > > > > > Even if not, both this series or making it possible to queue 
> > > > > > > multiple flips require
> > > > > > corresponding changes in each compositor as well to have any effect.
> > > > > >
> > > > > > Yeah from all the discussions and tests done it sounds even with a
> > > > > > deeper queue we have big coordination issues between the guest and
> > > > > > host compositor (like the example that the guest is now rendering at
> > > > > > 90fps instead of 60fps like the host).
> > > > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. 
> > > > > That 90 FPS vs
> > > > > 60 FPS problem is a completely different issue that is associated 
> > > > > with Qemu GTK UI
> > > > > backend. With the GTK backend -- and also with SDL backend -- we Blit 
> > > > > the Guest
> > > > > scanout FB onto one of the backbuffers managed by EGL.
> > > > >
> > > > > I am trying to add a new Qemu Wayland UI backend so that we can 
> > > > > eliminate that
> > Blit
> > > > > and thereby have a truly zero-copy solution. And, this is there I am 
> > > > > running into the
> > > > > halved frame-rate issue -- the current problem.
> > > >
> > > > Yes, that's what I 

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-10 Thread Kasireddy, Vivek
Hi Daniel,

> On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote:
> > Hi Daniel,
> >
> > > > > > >>> The solution:
> > > > > > >>> - To ensure full framerate, the Guest compositor has to start 
> > > > > > >>> it's repaint
> cycle
> > > > > (including
> > > > > > >>> the 9 ms wait) when the Host compositor sends the frame 
> > > > > > >>> callback event to
> its
> > > > > clients.
> > > > > > >>> In order for this to happen, the dma-fence that the Guest KMS 
> > > > > > >>> waits on --
> before
> > > > > sending
> > > > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release 
> > > > > > >>> event. This
> means
> > > that,
> > > > > the
> > > > > > >>> Guest compositor has to be forced to use a new buffer for its 
> > > > > > >>> next repaint
> cycle
> > > > > when it
> > > > > > >>> gets a pageflip completion.
> > > > > > >>
> > > > > > >> Is that really the only solution?
> > > > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > > > But I think none of them are as compelling as this one.
> > > > > > >
> > > > > > >>
> > > > > > >> If we fix the event timestamps so that both guest and host use 
> > > > > > >> the same
> > > > > > >> timestamp, but then the guest starts 5ms (or something like 
> > > > > > >> that) earlier,
> > > > > > >> then things should work too? I.e.
> > > > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > > > >>
> > > > > > >> Ofc this only works if the frametimes we hand out to both match 
> > > > > > >> _exactly_
> > > > > > >> and are as high-precision as the ones on the host side. Which 
> > > > > > >> for many gpu
> > > > > > >> drivers at least is the case, and all the ones you care about 
> > > > > > >> for sure :-)
> > > > > > >>
> > > > > > >> But if the frametimes the guest receives are the no_vblank fake 
> > > > > > >> ones, then
> > > > > > >> they'll be all over the place and this carefully tuned 
> > > > > > >> low-latency redraw
> > > > > > >> loop falls apart. Aside fromm the fact that without tuning the 
> > > > > > >> guests to
> > > > > > >> be earlier than the hosts, you're guaranteed to miss every frame 
> > > > > > >> (except
> > > > > > >> when the timing wobbliness in the guest is big enough by chance 
> > > > > > >> to make
> > > > > > >> the deadline on the oddball frame).
> > > > > > > [Kasireddy, Vivek] The Guest and Host use different event 
> > > > > > > timestamps as we
> don't
> > > > > > > share these between the Guest and the Host. It does not seem to 
> > > > > > > be causing any
> > > other
> > > > > > > problems so far but we did try the experiment you mentioned 
> > > > > > > (i.e., adjusting
> the
> > > > > delays)
> > > > > > > and it works. However, this patch series is meant to fix the 
> > > > > > > issue without
> having to
> > > > > tweak
> > > > > > > anything (delays) because we can't do this for every compositor 
> > > > > > > out there.
> > > > > >
> > > > > > Maybe there could be a mechanism which allows the compositor in the 
> > > > > > guest to
> > > > > automatically adjust its repaint cycle as needed.
> > > > > >
> > > > > > This might even be possible without requiring changes in each 
> > > > > > compositor, by
> > > adjusting
> > > > > the vertical blank periods in the guest to be aligned with the host 
> > > > > compositor
> repaint
> > > > > cycles. Not sure about that though.
> > > > > >
> > > > > > Even if not, both this series or making it possible to queue 
> > > > > > multiple flips require
> > > > > corresponding changes in each compositor as well to have any effect.
> > > > >
> > > > > Yeah from all the discussions and tests done it sounds even with a
> > > > > deeper queue we have big coordination issues between the guest and
> > > > > host compositor (like the example that the guest is now rendering at
> > > > > 90fps instead of 60fps like the host).
> > > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. 
> > > > That 90 FPS vs
> > > > 60 FPS problem is a completely different issue that is associated with 
> > > > Qemu GTK UI
> > > > backend. With the GTK backend -- and also with SDL backend -- we Blit 
> > > > the Guest
> > > > scanout FB onto one of the backbuffers managed by EGL.
> > > >
> > > > I am trying to add a new Qemu Wayland UI backend so that we can 
> > > > eliminate that
> Blit
> > > > and thereby have a truly zero-copy solution. And, this is there I am 
> > > > running into the
> > > > halved frame-rate issue -- the current problem.
> > >
> > > Yes, that's what I referenced. But I disagree that it's a different
> > > problem. The underlying problem in both cases is that the guest and host
> > > compositor free-wheel instead of rendering in sync. It's just that
> > > depending upon how exactly the flip completion event on the gues side
> > > plays out 

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-09 Thread Daniel Vetter
On Fri, Aug 06, 2021 at 07:27:13AM +, Kasireddy, Vivek wrote:
> Hi Daniel,
> 
> > > > > >>> The solution:
> > > > > >>> - To ensure full framerate, the Guest compositor has to start 
> > > > > >>> it's repaint cycle
> > > > (including
> > > > > >>> the 9 ms wait) when the Host compositor sends the frame callback 
> > > > > >>> event to its
> > > > clients.
> > > > > >>> In order for this to happen, the dma-fence that the Guest KMS 
> > > > > >>> waits on -- before
> > > > sending
> > > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release 
> > > > > >>> event. This means
> > that,
> > > > the
> > > > > >>> Guest compositor has to be forced to use a new buffer for its 
> > > > > >>> next repaint cycle
> > > > when it
> > > > > >>> gets a pageflip completion.
> > > > > >>
> > > > > >> Is that really the only solution?
> > > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > > But I think none of them are as compelling as this one.
> > > > > >
> > > > > >>
> > > > > >> If we fix the event timestamps so that both guest and host use the 
> > > > > >> same
> > > > > >> timestamp, but then the guest starts 5ms (or something like that) 
> > > > > >> earlier,
> > > > > >> then things should work too? I.e.
> > > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > > >>
> > > > > >> Ofc this only works if the frametimes we hand out to both match 
> > > > > >> _exactly_
> > > > > >> and are as high-precision as the ones on the host side. Which for 
> > > > > >> many gpu
> > > > > >> drivers at least is the case, and all the ones you care about for 
> > > > > >> sure :-)
> > > > > >>
> > > > > >> But if the frametimes the guest receives are the no_vblank fake 
> > > > > >> ones, then
> > > > > >> they'll be all over the place and this carefully tuned low-latency 
> > > > > >> redraw
> > > > > >> loop falls apart. Aside fromm the fact that without tuning the 
> > > > > >> guests to
> > > > > >> be earlier than the hosts, you're guaranteed to miss every frame 
> > > > > >> (except
> > > > > >> when the timing wobbliness in the guest is big enough by chance to 
> > > > > >> make
> > > > > >> the deadline on the oddball frame).
> > > > > > [Kasireddy, Vivek] The Guest and Host use different event 
> > > > > > timestamps as we don't
> > > > > > share these between the Guest and the Host. It does not seem to be 
> > > > > > causing any
> > other
> > > > > > problems so far but we did try the experiment you mentioned (i.e., 
> > > > > > adjusting the
> > > > delays)
> > > > > > and it works. However, this patch series is meant to fix the issue 
> > > > > > without having to
> > > > tweak
> > > > > > anything (delays) because we can't do this for every compositor out 
> > > > > > there.
> > > > >
> > > > > Maybe there could be a mechanism which allows the compositor in the 
> > > > > guest to
> > > > automatically adjust its repaint cycle as needed.
> > > > >
> > > > > This might even be possible without requiring changes in each 
> > > > > compositor, by
> > adjusting
> > > > the vertical blank periods in the guest to be aligned with the host 
> > > > compositor repaint
> > > > cycles. Not sure about that though.
> > > > >
> > > > > Even if not, both this series or making it possible to queue multiple 
> > > > > flips require
> > > > corresponding changes in each compositor as well to have any effect.
> > > >
> > > > Yeah from all the discussions and tests done it sounds even with a
> > > > deeper queue we have big coordination issues between the guest and
> > > > host compositor (like the example that the guest is now rendering at
> > > > 90fps instead of 60fps like the host).
> > > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. 
> > > That 90 FPS vs
> > > 60 FPS problem is a completely different issue that is associated with 
> > > Qemu GTK UI
> > > backend. With the GTK backend -- and also with SDL backend -- we Blit the 
> > > Guest
> > > scanout FB onto one of the backbuffers managed by EGL.
> > >
> > > I am trying to add a new Qemu Wayland UI backend so that we can eliminate 
> > > that Blit
> > > and thereby have a truly zero-copy solution. And, this is there I am 
> > > running into the
> > > halved frame-rate issue -- the current problem.
> > 
> > Yes, that's what I referenced. But I disagree that it's a different
> > problem. The underlying problem in both cases is that the guest and host
> > compositor free-wheel instead of rendering in sync. It's just that
> > depending upon how exactly the flip completion event on the gues side
> > plays out you either get guest rendering that's faster than the host-side
> > 60fps, or guest rendering that's much slower than the host-side 60fps.
> [Kasireddy, Vivek] That used to be the case before we added a synchronization
> mechanism to the GTK 

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-06 Thread Kasireddy, Vivek
Hi Daniel,

> > > > >>> The solution:
> > > > >>> - To ensure full framerate, the Guest compositor has to start it's 
> > > > >>> repaint cycle
> > > (including
> > > > >>> the 9 ms wait) when the Host compositor sends the frame callback 
> > > > >>> event to its
> > > clients.
> > > > >>> In order for this to happen, the dma-fence that the Guest KMS waits 
> > > > >>> on -- before
> > > sending
> > > > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. 
> > > > >>> This means
> that,
> > > the
> > > > >>> Guest compositor has to be forced to use a new buffer for its next 
> > > > >>> repaint cycle
> > > when it
> > > > >>> gets a pageflip completion.
> > > > >>
> > > > >> Is that really the only solution?
> > > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > > But I think none of them are as compelling as this one.
> > > > >
> > > > >>
> > > > >> If we fix the event timestamps so that both guest and host use the 
> > > > >> same
> > > > >> timestamp, but then the guest starts 5ms (or something like that) 
> > > > >> earlier,
> > > > >> then things should work too? I.e.
> > > > >> - host compositor starts at (previous_frametime + 9ms)
> > > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > > >>
> > > > >> Ofc this only works if the frametimes we hand out to both match 
> > > > >> _exactly_
> > > > >> and are as high-precision as the ones on the host side. Which for 
> > > > >> many gpu
> > > > >> drivers at least is the case, and all the ones you care about for 
> > > > >> sure :-)
> > > > >>
> > > > >> But if the frametimes the guest receives are the no_vblank fake 
> > > > >> ones, then
> > > > >> they'll be all over the place and this carefully tuned low-latency 
> > > > >> redraw
> > > > >> loop falls apart. Aside fromm the fact that without tuning the 
> > > > >> guests to
> > > > >> be earlier than the hosts, you're guaranteed to miss every frame 
> > > > >> (except
> > > > >> when the timing wobbliness in the guest is big enough by chance to 
> > > > >> make
> > > > >> the deadline on the oddball frame).
> > > > > [Kasireddy, Vivek] The Guest and Host use different event timestamps 
> > > > > as we don't
> > > > > share these between the Guest and the Host. It does not seem to be 
> > > > > causing any
> other
> > > > > problems so far but we did try the experiment you mentioned (i.e., 
> > > > > adjusting the
> > > delays)
> > > > > and it works. However, this patch series is meant to fix the issue 
> > > > > without having to
> > > tweak
> > > > > anything (delays) because we can't do this for every compositor out 
> > > > > there.
> > > >
> > > > Maybe there could be a mechanism which allows the compositor in the 
> > > > guest to
> > > automatically adjust its repaint cycle as needed.
> > > >
> > > > This might even be possible without requiring changes in each 
> > > > compositor, by
> adjusting
> > > the vertical blank periods in the guest to be aligned with the host 
> > > compositor repaint
> > > cycles. Not sure about that though.
> > > >
> > > > Even if not, both this series or making it possible to queue multiple 
> > > > flips require
> > > corresponding changes in each compositor as well to have any effect.
> > >
> > > Yeah from all the discussions and tests done it sounds even with a
> > > deeper queue we have big coordination issues between the guest and
> > > host compositor (like the example that the guest is now rendering at
> > > 90fps instead of 60fps like the host).
> > [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 
> > 90 FPS vs
> > 60 FPS problem is a completely different issue that is associated with Qemu 
> > GTK UI
> > backend. With the GTK backend -- and also with SDL backend -- we Blit the 
> > Guest
> > scanout FB onto one of the backbuffers managed by EGL.
> >
> > I am trying to add a new Qemu Wayland UI backend so that we can eliminate 
> > that Blit
> > and thereby have a truly zero-copy solution. And, this is there I am 
> > running into the
> > halved frame-rate issue -- the current problem.
> 
> Yes, that's what I referenced. But I disagree that it's a different
> problem. The underlying problem in both cases is that the guest and host
> compositor free-wheel instead of rendering in sync. It's just that
> depending upon how exactly the flip completion event on the gues side
> plays out you either get guest rendering that's faster than the host-side
> 60fps, or guest rendering that's much slower than the host-side 60fps.
[Kasireddy, Vivek] That used to be the case before we added a synchronization
mechanism to the GTK UI backend that uses a sync file. After adding this
and making the Guest wait until this sync file fd on the Host is signaled, we
consistently get 60 FPS because the flip completion event for the Guest is
directly tied to the signaling of the sync file in this particular case (GTK 

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-05 Thread Daniel Vetter
On Thu, Aug 05, 2021 at 04:15:27AM +, Kasireddy, Vivek wrote:
> Hi Daniel,
> 
> > > >>> The solution:
> > > >>> - To ensure full framerate, the Guest compositor has to start it's 
> > > >>> repaint cycle
> > (including
> > > >>> the 9 ms wait) when the Host compositor sends the frame callback 
> > > >>> event to its
> > clients.
> > > >>> In order for this to happen, the dma-fence that the Guest KMS waits 
> > > >>> on -- before
> > sending
> > > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. 
> > > >>> This means that,
> > the
> > > >>> Guest compositor has to be forced to use a new buffer for its next 
> > > >>> repaint cycle
> > when it
> > > >>> gets a pageflip completion.
> > > >>
> > > >> Is that really the only solution?
> > > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > > But I think none of them are as compelling as this one.
> > > >
> > > >>
> > > >> If we fix the event timestamps so that both guest and host use the same
> > > >> timestamp, but then the guest starts 5ms (or something like that) 
> > > >> earlier,
> > > >> then things should work too? I.e.
> > > >> - host compositor starts at (previous_frametime + 9ms)
> > > >> - guest compositor starts at (previous_frametime + 4ms)
> > > >>
> > > >> Ofc this only works if the frametimes we hand out to both match 
> > > >> _exactly_
> > > >> and are as high-precision as the ones on the host side. Which for many 
> > > >> gpu
> > > >> drivers at least is the case, and all the ones you care about for sure 
> > > >> :-)
> > > >>
> > > >> But if the frametimes the guest receives are the no_vblank fake ones, 
> > > >> then
> > > >> they'll be all over the place and this carefully tuned low-latency 
> > > >> redraw
> > > >> loop falls apart. Aside fromm the fact that without tuning the guests 
> > > >> to
> > > >> be earlier than the hosts, you're guaranteed to miss every frame 
> > > >> (except
> > > >> when the timing wobbliness in the guest is big enough by chance to make
> > > >> the deadline on the oddball frame).
> > > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as 
> > > > we don't
> > > > share these between the Guest and the Host. It does not seem to be 
> > > > causing any other
> > > > problems so far but we did try the experiment you mentioned (i.e., 
> > > > adjusting the
> > delays)
> > > > and it works. However, this patch series is meant to fix the issue 
> > > > without having to
> > tweak
> > > > anything (delays) because we can't do this for every compositor out 
> > > > there.
> > >
> > > Maybe there could be a mechanism which allows the compositor in the guest 
> > > to
> > automatically adjust its repaint cycle as needed.
> > >
> > > This might even be possible without requiring changes in each compositor, 
> > > by adjusting
> > the vertical blank periods in the guest to be aligned with the host 
> > compositor repaint
> > cycles. Not sure about that though.
> > >
> > > Even if not, both this series or making it possible to queue multiple 
> > > flips require
> > corresponding changes in each compositor as well to have any effect.
> > 
> > Yeah from all the discussions and tests done it sounds even with a
> > deeper queue we have big coordination issues between the guest and
> > host compositor (like the example that the guest is now rendering at
> > 90fps instead of 60fps like the host).
> [Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90 
> FPS vs 
> 60 FPS problem is a completely different issue that is associated with Qemu 
> GTK UI
> backend. With the GTK backend -- and also with SDL backend -- we Blit the 
> Guest
> scanout FB onto one of the backbuffers managed by EGL. 
> 
> I am trying to add a new Qemu Wayland UI backend so that we can eliminate 
> that Blit
> and thereby have a truly zero-copy solution. And, this is there I am running 
> into the 
> halved frame-rate issue -- the current problem.

Yes, that's what I referenced. But I disagree that it's a different
problem. The underlying problem in both cases is that the guest and host
compositor free-wheel instead of rendering in sync. It's just that
depending upon how exactly the flip completion event on the gues side
plays out you either get guest rendering that's faster than the host-side
60fps, or guest rendering that's much slower than the host-side 60fps.

The fundamental problem in both cases is that they don't run in lockstep.
If you fix that, through fixing the timestamp and even reporting most
likely, you should be able to fix both bugs.

> > Hence my gut feeling reaction that first we need to get these two
> > compositors aligned in their timings, which propobably needs
> > consistent vblank periods/timestamps across them (plus/minux
> > guest/host clocksource fun ofc). Without this any of the next steps
> > will simply not work because there's too much jitter by the time the

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-04 Thread Kasireddy, Vivek
Hi Daniel,

> > >>> The solution:
> > >>> - To ensure full framerate, the Guest compositor has to start it's 
> > >>> repaint cycle
> (including
> > >>> the 9 ms wait) when the Host compositor sends the frame callback event 
> > >>> to its
> clients.
> > >>> In order for this to happen, the dma-fence that the Guest KMS waits on 
> > >>> -- before
> sending
> > >>> pageflip completion -- cannot be tied to a wl_buffer.release event. 
> > >>> This means that,
> the
> > >>> Guest compositor has to be forced to use a new buffer for its next 
> > >>> repaint cycle
> when it
> > >>> gets a pageflip completion.
> > >>
> > >> Is that really the only solution?
> > > [Kasireddy, Vivek] There are a few others I mentioned here:
> > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > > But I think none of them are as compelling as this one.
> > >
> > >>
> > >> If we fix the event timestamps so that both guest and host use the same
> > >> timestamp, but then the guest starts 5ms (or something like that) 
> > >> earlier,
> > >> then things should work too? I.e.
> > >> - host compositor starts at (previous_frametime + 9ms)
> > >> - guest compositor starts at (previous_frametime + 4ms)
> > >>
> > >> Ofc this only works if the frametimes we hand out to both match _exactly_
> > >> and are as high-precision as the ones on the host side. Which for many 
> > >> gpu
> > >> drivers at least is the case, and all the ones you care about for sure 
> > >> :-)
> > >>
> > >> But if the frametimes the guest receives are the no_vblank fake ones, 
> > >> then
> > >> they'll be all over the place and this carefully tuned low-latency redraw
> > >> loop falls apart. Aside fromm the fact that without tuning the guests to
> > >> be earlier than the hosts, you're guaranteed to miss every frame (except
> > >> when the timing wobbliness in the guest is big enough by chance to make
> > >> the deadline on the oddball frame).
> > > [Kasireddy, Vivek] The Guest and Host use different event timestamps as 
> > > we don't
> > > share these between the Guest and the Host. It does not seem to be 
> > > causing any other
> > > problems so far but we did try the experiment you mentioned (i.e., 
> > > adjusting the
> delays)
> > > and it works. However, this patch series is meant to fix the issue 
> > > without having to
> tweak
> > > anything (delays) because we can't do this for every compositor out there.
> >
> > Maybe there could be a mechanism which allows the compositor in the guest to
> automatically adjust its repaint cycle as needed.
> >
> > This might even be possible without requiring changes in each compositor, 
> > by adjusting
> the vertical blank periods in the guest to be aligned with the host 
> compositor repaint
> cycles. Not sure about that though.
> >
> > Even if not, both this series or making it possible to queue multiple flips 
> > require
> corresponding changes in each compositor as well to have any effect.
> 
> Yeah from all the discussions and tests done it sounds even with a
> deeper queue we have big coordination issues between the guest and
> host compositor (like the example that the guest is now rendering at
> 90fps instead of 60fps like the host).
[Kasireddy, Vivek] Oh, I think you are referring to my reply to Gerd. That 90 
FPS vs 
60 FPS problem is a completely different issue that is associated with Qemu GTK 
UI
backend. With the GTK backend -- and also with SDL backend -- we Blit the Guest
scanout FB onto one of the backbuffers managed by EGL. 

I am trying to add a new Qemu Wayland UI backend so that we can eliminate that 
Blit
and thereby have a truly zero-copy solution. And, this is there I am running 
into the 
halved frame-rate issue -- the current problem.

> 
> Hence my gut feeling reaction that first we need to get these two
> compositors aligned in their timings, which propobably needs
> consistent vblank periods/timestamps across them (plus/minux
> guest/host clocksource fun ofc). Without this any of the next steps
> will simply not work because there's too much jitter by the time the
> guest compositor gets the flip completion events.
[Kasireddy, Vivek] Timings are not a problem and do not significantly
affect the repaint cycles from what I have seen so far.

> 
> Once we have solid events I think we should look into statically
> tuning guest/host compositor deadlines (like you've suggested in a
> bunch of places) to consisently make that deadline and hit 60 fps.
> With that we can then look into tuning this automatically and what to
> do when e.g. switching between copying and zero-copy on the host side
> (which might be needed in some cases) and how to handle all that.
[Kasireddy, Vivek] As I confirm here: 
https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_984065
tweaking the deadlines works (i.e., we get 60 FPS) as we expect. However,
I feel that this zero-copy solution I am trying to create should be independent
of compositors' deadlines, delays or other scheduling 

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-04 Thread Daniel Vetter
On Tue, Aug 3, 2021 at 9:34 AM Michel Dänzer  wrote:
> On 2021-08-03 8:11 a.m., Kasireddy, Vivek wrote:
> >>> The goal:
> >>> - Maintain full framerate even when the Guest scanout FB is flipped onto 
> >>> a hardware
> >> plane
> >>> on the Host -- regardless of either compositor's scheduling policy -- 
> >>> without making any
> >>> copies and ensuring that both Host and Guest are not accessing the buffer 
> >>> at the same
> >> time.
> >>>
> >>> The problem:
> >>> - If the Host compositor flips the client's buffer (in this case Guest 
> >>> compositor's buffer)
> >>> onto a hardware plane, then it can send a wl_buffer.release event for the 
> >>> previous buffer
> >>> only after it gets a pageflip completion. And, if the Guest compositor 
> >>> takes 10-12 ms to
> >>> submit a new buffer and given the fact that the Host compositor waits 
> >>> only for 9 ms, the
> >>> Guest compositor will miss the Host's repaint cycle resulting in halved 
> >>> frame-rate.
> >>>
> >>> The solution:
> >>> - To ensure full framerate, the Guest compositor has to start it's 
> >>> repaint cycle (including
> >>> the 9 ms wait) when the Host compositor sends the frame callback event to 
> >>> its clients.
> >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- 
> >>> before sending
> >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This 
> >>> means that, the
> >>> Guest compositor has to be forced to use a new buffer for its next 
> >>> repaint cycle when it
> >>> gets a pageflip completion.
> >>
> >> Is that really the only solution?
> > [Kasireddy, Vivek] There are a few others I mentioned here:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > But I think none of them are as compelling as this one.
> >
> >>
> >> If we fix the event timestamps so that both guest and host use the same
> >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> >> then things should work too? I.e.
> >> - host compositor starts at (previous_frametime + 9ms)
> >> - guest compositor starts at (previous_frametime + 4ms)
> >>
> >> Ofc this only works if the frametimes we hand out to both match _exactly_
> >> and are as high-precision as the ones on the host side. Which for many gpu
> >> drivers at least is the case, and all the ones you care about for sure :-)
> >>
> >> But if the frametimes the guest receives are the no_vblank fake ones, then
> >> they'll be all over the place and this carefully tuned low-latency redraw
> >> loop falls apart. Aside fromm the fact that without tuning the guests to
> >> be earlier than the hosts, you're guaranteed to miss every frame (except
> >> when the timing wobbliness in the guest is big enough by chance to make
> >> the deadline on the oddball frame).
> > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we 
> > don't
> > share these between the Guest and the Host. It does not seem to be causing 
> > any other
> > problems so far but we did try the experiment you mentioned (i.e., 
> > adjusting the delays)
> > and it works. However, this patch series is meant to fix the issue without 
> > having to tweak
> > anything (delays) because we can't do this for every compositor out there.
>
> Maybe there could be a mechanism which allows the compositor in the guest to 
> automatically adjust its repaint cycle as needed.
>
> This might even be possible without requiring changes in each compositor, by 
> adjusting the vertical blank periods in the guest to be aligned with the host 
> compositor repaint cycles. Not sure about that though.
>
> Even if not, both this series or making it possible to queue multiple flips 
> require corresponding changes in each compositor as well to have any effect.

Yeah from all the discussions and tests done it sounds even with a
deeper queue we have big coordination issues between the guest and
host compositor (like the example that the guest is now rendering at
90fps instead of 60fps like the host).

Hence my gut feeling reaction that first we need to get these two
compositors aligned in their timings, which propobably needs
consistent vblank periods/timestamps across them (plus/minux
guest/host clocksource fun ofc). Without this any of the next steps
will simply not work because there's too much jitter by the time the
guest compositor gets the flip completion events.

Once we have solid events I think we should look into statically
tuning guest/host compositor deadlines (like you've suggested in a
bunch of places) to consisently make that deadline and hit 60 fps.
With that we can then look into tuning this automatically and what to
do when e.g. switching between copying and zero-copy on the host side
(which might be needed in some cases) and how to handle all that.

Only when that all shows that we just can't hit 60fps consistently and
really need 3 buffers in flight should we look at deeper kms queues.
And then we really need to implement them properly and not 

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-04 Thread Kasireddy, Vivek
Hi Gerd,

> 
> > > virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for
> > > DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I
> > > think for the page-flip case the host (aka qemu) doesn't get the
> > > "wait until old framebuffer is not in use any more" right yet.
> > [Kasireddy, Vivek] As you know, with the GTK UI backend and this patch 
> > series:
> > https://lists.nongnu.org/archive/html/qemu-devel/2021-06/msg06745.html
> > we do create a sync file fd -- after the Blit -- and wait (adding it to 
> > Qemu's main
> > event loop) for it to ensure that the Guest scanout FB is longer in use on 
> > the Host.
> > This mechanism works in a similarly way for both frontbuffer DIRTYFB case 
> > and
> > also the double-buffer case.
> 
> Well, we don't explicitly wait on the old framebuffer.  Not fully sure
> this is actually needed, maybe the command ordering (SET_SCANOUT goes
> first) is enough.
[Kasireddy, Vivek] When the sync file fd is signaled, the new FB can be 
considered done/free
on the Host; and, when this new FB becomes the old FB -- after another FB is 
submitted
by the Guest -- we don't need to explicitly wait as we already did that in the 
previous
cycle. 

Strictly speaking, in the double-buffered Guest case, we should be waiting for 
the
sync file fd of the old FB and not the new one. However, if we do this, we saw 
that
the Guest will render faster (~90 FPS) than what the Host can consume (~60 FPS)
resulting in unnecessary GPU cycles. And, in addition, we can't be certain about
whether a Guest is using double-buffering or single as we noticed that Windows
Guests tend to switch between single and double-buffering at runtime based on
the damage, etc.

> 
> > > So we'll need a host-side fix for that and a guest-side fix to switch
> > > from a blocking wait on the fence to vblank events.
> > [Kasireddy, Vivek] Do you see any concerns with the blocking wait?
> 
> Well, it's sync vs. async for userspace.
> 
> With the blocking wait the userspace ioctl (PAGE_FLIP or the atomic
> version of it) will return when the host is done.
> 
> Without the blocking wait the userspace ioctl will return right away and
> userspace can do something else until the host is done (and the vbland
> event is sent to notify userspace).
[Kasireddy, Vivek] Right, but upstream Weston -- and I am guessing Mutter as 
well -- 
almost always choose DRM_MODE_ATOMIC_NONBLOCK. In this case, the
atomic ioctl call would not block and the blocking wait will instead happen in 
the
commit_work/commit_tail workqueue thread.

> 
> > And, are you
> > suggesting that we use a vblank timer?
> 
> I think we should send the vblank event when the RESOURCE_FLUSH fence
> signals the host is done.
[Kasireddy, Vivek] That is how it works now:
drm_atomic_helper_commit_planes(dev, old_state, 0);

drm_atomic_helper_commit_modeset_enables(dev, old_state);

drm_atomic_helper_fake_vblank(old_state);

The blocking wait is in the plane_update hook called by 
drm_atomic_helper_commit_planes()
and immediately after that the fake vblank is sent.

Thanks,
Vivek
> 
> take care,
>   Gerd



RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-04 Thread Kasireddy, Vivek
Hi Michel,

> >
> >>> The goal:
> >>> - Maintain full framerate even when the Guest scanout FB is flipped onto 
> >>> a hardware
> >> plane
> >>> on the Host -- regardless of either compositor's scheduling policy -- 
> >>> without making
> any
> >>> copies and ensuring that both Host and Guest are not accessing the buffer 
> >>> at the same
> >> time.
> >>>
> >>> The problem:
> >>> - If the Host compositor flips the client's buffer (in this case Guest 
> >>> compositor's
> buffer)
> >>> onto a hardware plane, then it can send a wl_buffer.release event for the 
> >>> previous
> buffer
> >>> only after it gets a pageflip completion. And, if the Guest compositor 
> >>> takes 10-12 ms
> to
> >>> submit a new buffer and given the fact that the Host compositor waits 
> >>> only for 9 ms,
> the
> >>> Guest compositor will miss the Host's repaint cycle resulting in halved 
> >>> frame-rate.
> >>>
> >>> The solution:
> >>> - To ensure full framerate, the Guest compositor has to start it's 
> >>> repaint cycle
> (including
> >>> the 9 ms wait) when the Host compositor sends the frame callback event to 
> >>> its clients.
> >>> In order for this to happen, the dma-fence that the Guest KMS waits on -- 
> >>> before
> sending
> >>> pageflip completion -- cannot be tied to a wl_buffer.release event. This 
> >>> means that,
> the
> >>> Guest compositor has to be forced to use a new buffer for its next 
> >>> repaint cycle when
> it
> >>> gets a pageflip completion.
> >>
> >> Is that really the only solution?
> > [Kasireddy, Vivek] There are a few others I mentioned here:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> > But I think none of them are as compelling as this one.
> >
> >>
> >> If we fix the event timestamps so that both guest and host use the same
> >> timestamp, but then the guest starts 5ms (or something like that) earlier,
> >> then things should work too? I.e.
> >> - host compositor starts at (previous_frametime + 9ms)
> >> - guest compositor starts at (previous_frametime + 4ms)
> >>
> >> Ofc this only works if the frametimes we hand out to both match _exactly_
> >> and are as high-precision as the ones on the host side. Which for many gpu
> >> drivers at least is the case, and all the ones you care about for sure :-)
> >>
> >> But if the frametimes the guest receives are the no_vblank fake ones, then
> >> they'll be all over the place and this carefully tuned low-latency redraw
> >> loop falls apart. Aside fromm the fact that without tuning the guests to
> >> be earlier than the hosts, you're guaranteed to miss every frame (except
> >> when the timing wobbliness in the guest is big enough by chance to make
> >> the deadline on the oddball frame).
> > [Kasireddy, Vivek] The Guest and Host use different event timestamps as we 
> > don't
> > share these between the Guest and the Host. It does not seem to be causing 
> > any other
> > problems so far but we did try the experiment you mentioned (i.e., 
> > adjusting the delays)
> > and it works. However, this patch series is meant to fix the issue without 
> > having to tweak
> > anything (delays) because we can't do this for every compositor out there.
> 
> Maybe there could be a mechanism which allows the compositor in the guest to
> automatically adjust its repaint cycle as needed.
> 
> This might even be possible without requiring changes in each compositor, by 
> adjusting
> the vertical blank periods in the guest to be aligned with the host 
> compositor repaint
> cycles. Not sure about that though.
[Kasireddy, Vivek] The problem really is that the Guest compositor -- or any 
other compositor
for that matter -- assumes that after a pageflip completion, the old buffer 
submitted in the
previous flip is free and can be reused again. I think this is a guarantee 
given by KMS. If we have
to enforce this, we (Guest KMS) have to wait until the Host compositor sends a 
wl_buffer.release --
which can only happen after Host gets a pageflip completion assuming it uses 
hardware planes .
From this point onwards, the Guest compositor only has 9 ms (in the case of 
Weston) -- or less
based on the Host compositor's scheduling policy -- to submit a new frame.

Although, we can adjust the repaint-window of the Guest compositor to ensure a 
submission 
within 9 ms or increase the delay on the Host, these tweaks are just 
heuristics. I think in order
to have a generic solution that'll work in all cases means that the Guest 
compositor has to start
its repaint cycle with a new buffer when the Host sends out the frame callback 
event.

> 
> Even if not, both this series or making it possible to queue multiple flips 
> require
> corresponding changes in each compositor as well to have any effect.
[Kasireddy, Vivek] Yes, unfortunately; but the hope is that the Guest KMS can 
do most of
the heavy lifting and keep the changes for the compositors generic enough and 
minimal.

Thanks,
Vivek
> 
> 
> --
> Earthling Michel Dänzer   |

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-03 Thread Gerd Hoffmann
  Hi,

> > virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for
> > DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I
> > think for the page-flip case the host (aka qemu) doesn't get the
> > "wait until old framebuffer is not in use any more" right yet.
> [Kasireddy, Vivek] As you know, with the GTK UI backend and this patch 
> series: 
> https://lists.nongnu.org/archive/html/qemu-devel/2021-06/msg06745.html
> we do create a sync file fd -- after the Blit -- and wait (adding it to 
> Qemu's main
> event loop) for it to ensure that the Guest scanout FB is longer in use on 
> the Host.
> This mechanism works in a similarly way for both frontbuffer DIRTYFB case and
> also the double-buffer case. 

Well, we don't explicitly wait on the old framebuffer.  Not fully sure
this is actually needed, maybe the command ordering (SET_SCANOUT goes
first) is enough.

> > So we'll need a host-side fix for that and a guest-side fix to switch
> > from a blocking wait on the fence to vblank events.
> [Kasireddy, Vivek] Do you see any concerns with the blocking wait?

Well, it's sync vs. async for userspace.

With the blocking wait the userspace ioctl (PAGE_FLIP or the atomic
version of it) will return when the host is done.

Without the blocking wait the userspace ioctl will return right away and
userspace can do something else until the host is done (and the vbland
event is sent to notify userspace).

> And, are you
> suggesting that we use a vblank timer?

I think we should send the vblank event when the RESOURCE_FLUSH fence
signals the host is done.

take care,
  Gerd



Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-03 Thread Michel Dänzer
On 2021-08-03 8:11 a.m., Kasireddy, Vivek wrote:
> 
>>> The goal:
>>> - Maintain full framerate even when the Guest scanout FB is flipped onto a 
>>> hardware
>> plane
>>> on the Host -- regardless of either compositor's scheduling policy -- 
>>> without making any
>>> copies and ensuring that both Host and Guest are not accessing the buffer 
>>> at the same
>> time.
>>>
>>> The problem:
>>> - If the Host compositor flips the client's buffer (in this case Guest 
>>> compositor's buffer)
>>> onto a hardware plane, then it can send a wl_buffer.release event for the 
>>> previous buffer
>>> only after it gets a pageflip completion. And, if the Guest compositor 
>>> takes 10-12 ms to
>>> submit a new buffer and given the fact that the Host compositor waits only 
>>> for 9 ms, the
>>> Guest compositor will miss the Host's repaint cycle resulting in halved 
>>> frame-rate.
>>>
>>> The solution:
>>> - To ensure full framerate, the Guest compositor has to start it's repaint 
>>> cycle (including
>>> the 9 ms wait) when the Host compositor sends the frame callback event to 
>>> its clients.
>>> In order for this to happen, the dma-fence that the Guest KMS waits on -- 
>>> before sending
>>> pageflip completion -- cannot be tied to a wl_buffer.release event. This 
>>> means that, the
>>> Guest compositor has to be forced to use a new buffer for its next repaint 
>>> cycle when it
>>> gets a pageflip completion.
>>
>> Is that really the only solution?
> [Kasireddy, Vivek] There are a few others I mentioned here:
> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
> But I think none of them are as compelling as this one.
> 
>>
>> If we fix the event timestamps so that both guest and host use the same
>> timestamp, but then the guest starts 5ms (or something like that) earlier,
>> then things should work too? I.e.
>> - host compositor starts at (previous_frametime + 9ms)
>> - guest compositor starts at (previous_frametime + 4ms)
>>
>> Ofc this only works if the frametimes we hand out to both match _exactly_
>> and are as high-precision as the ones on the host side. Which for many gpu
>> drivers at least is the case, and all the ones you care about for sure :-)
>>
>> But if the frametimes the guest receives are the no_vblank fake ones, then
>> they'll be all over the place and this carefully tuned low-latency redraw
>> loop falls apart. Aside fromm the fact that without tuning the guests to
>> be earlier than the hosts, you're guaranteed to miss every frame (except
>> when the timing wobbliness in the guest is big enough by chance to make
>> the deadline on the oddball frame).
> [Kasireddy, Vivek] The Guest and Host use different event timestamps as we 
> don't
> share these between the Guest and the Host. It does not seem to be causing 
> any other
> problems so far but we did try the experiment you mentioned (i.e., adjusting 
> the delays)
> and it works. However, this patch series is meant to fix the issue without 
> having to tweak
> anything (delays) because we can't do this for every compositor out there.

Maybe there could be a mechanism which allows the compositor in the guest to 
automatically adjust its repaint cycle as needed.

This might even be possible without requiring changes in each compositor, by 
adjusting the vertical blank periods in the guest to be aligned with the host 
compositor repaint cycles. Not sure about that though.

Even if not, both this series or making it possible to queue multiple flips 
require corresponding changes in each compositor as well to have any effect.


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-03 Thread Kasireddy, Vivek
Hi Gerd,

> 
>   Hi,
> 
> > > That sounds sensible to me.  Fence the virtio commands, make sure (on
> > > the host side) the command completes only when the work is actually done
> > > not only submitted.  Has recently been added to qemu for RESOURCE_FLUSH
> > > (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka
> > > pageflipping), then send vblank events to userspace on command
> > > completion certainly makes sense.
> >
> > Hm how does this all work? At least drm/virtio uses
> > drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end
> > up in the same driver path for everything. Or do you just combine the
> > resource_flush with the flip as needed and let the host side figure it all
> > out? From a quick read of virtgpu_plane.c that seems to be the case ...
> 
> virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for
> DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I
> think for the page-flip case the host (aka qemu) doesn't get the
> "wait until old framebuffer is not in use any more" right yet.
[Kasireddy, Vivek] As you know, with the GTK UI backend and this patch series: 
https://lists.nongnu.org/archive/html/qemu-devel/2021-06/msg06745.html
we do create a sync file fd -- after the Blit -- and wait (adding it to Qemu's 
main
event loop) for it to ensure that the Guest scanout FB is longer in use on the 
Host.
This mechanism works in a similarly way for both frontbuffer DIRTYFB case and
also the double-buffer case. 

The out_fence work is only relevant for the future Wayland UI backend though.

> 
> So we'll need a host-side fix for that and a guest-side fix to switch
> from a blocking wait on the fence to vblank events.
[Kasireddy, Vivek] Do you see any concerns with the blocking wait? And, are you
suggesting that we use a vblank timer? Not sure if that would be needed because 
it
would not align with the render/draw signals used with GTK. And, the DRM core
does send out an event -- immediately after the blocking wait -- to Guest 
compositor
as no_vblank=true.

> 
> > Also to make this work we don't just need the fence, we need the timestamp
> > (in a clock domain the guest can correct for ofc) of the host side kms
> > driver flip completion. If you just have the fence then the jitter from
> > going through all the layers will most likely make it unusable.
> 
> Well, there are no timestamps in the virtio-gpu protocol ...
> 
> Also I'm not sure they would be that helpful, any timing is *much* less
> predictable in a virtual machine, especially in case the host machine is
> loaded.
[Kasireddy, Vivek] I agree; I think sharing the Host timestamps with the Guest 
or 
vice-versa may not be useful. We have not run into any problems without these 
so far.

Thanks,
Vivek

> 
> take care,
>   Gerd



RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-03 Thread Kasireddy, Vivek
Hi Daniel,

> > > > By separating the OUT_FENCE signalling from pageflip completion allows
> > > > a Guest compositor to start a new repaint cycle with a new buffer
> > > > instead of waiting for the old buffer to be free.
> > > >
> > > > This work is based on the idea/suggestion from Simon and Pekka.
> > > >
> > > > This capability can be a solution for this issue:
> > > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> > > >
> > > > Corresponding Weston MR:
> > > > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> > >
> > > Uh I kinda wanted to discuss this a bit more before we jump into typing
> > > code, but well I guess not that much work yet.
> > [Kasireddy, Vivek] Right, it wasn't a lot of work :)
> >
> > >
> > > So maybe I'm not understanding the problem, but I think the fundamental
> > > underlying issue is that with KMS you can have at most 2 buffers
> > > in-flight, due to our queue depth limit of 1 pending flip.
> > [Kasireddy, Vivek] Let me summarize the problem again from the perspective 
> > of
> > both the Host (Weston) and Guest (Weston) compositors assuming a 
> > refresh-rate
> > of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms.
> > Host compositor:
> > - After a pageflip completion event, it starts its next repaint cycle by 
> > waiting for 9 ms
> > and then submits the atomic commit and at the tail end of its cycle sends a 
> > frame
> callback
> > event to all its clients (who registered and submitted frames) indicating 
> > to them to
> > start their next redraw  -- giving them at-least ~16 ms to submit a new 
> > frame to be
> > included in its next repaint. Why a configurable 9 ms delay is needed is 
> > explained
> > in Pekka's blog post here:
> > https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html
> >
> > - It'll send a wl_buffer.release event for a client submitted previous 
> > buffer only
> > when the client has submitted a new buffer and:
> > a) When it hasn't started its repaint cycle yet OR
> > b) When it clears its old state after it gets a pageflip completion event 
> > -- if it had
> > flipped the client's buffer onto a hardware plane.
> >
> > Guest compositor:
> > - After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms 
> > for the
> > Guest compositor to submit a new atomic commit. This time of 10-12 ms 
> > includes the
> > 9 ms wait -- just like the Host compositor -- for its clients to submit new 
> > buffers.
> > - When it gets a pageflip completion, it assumes that the previously 
> > submitted buffer
> > is free for re-use and uses it again -- resulting in the usage of only 2 
> > out of a maximum
> > of 4 backbuffers included as part of the Mesa GBM surface implementation.
> >
> > Guest KMS/Virtio-gpu/Qemu Wayland UI:
> > - Because no_vblank=true for Guest KMS and since the vblank event (which 
> > also serves
> > as the pageflip completion event for user-space) is sent right away after 
> > atomic commit,
> > as Gerd said, we use an internal dma-fence to block/wait the Guest KMS 
> > until we know
> for
> > sure that the Host is completely done using the buffer. To ensure this, we 
> > signal the
> dma-fence
> > only after the Host compositor sends a wl_buffer.release event or an 
> > equivalent signal.
> >
> > The goal:
> > - Maintain full framerate even when the Guest scanout FB is flipped onto a 
> > hardware
> plane
> > on the Host -- regardless of either compositor's scheduling policy -- 
> > without making any
> > copies and ensuring that both Host and Guest are not accessing the buffer 
> > at the same
> time.
> >
> > The problem:
> > - If the Host compositor flips the client's buffer (in this case Guest 
> > compositor's buffer)
> > onto a hardware plane, then it can send a wl_buffer.release event for the 
> > previous buffer
> > only after it gets a pageflip completion. And, if the Guest compositor 
> > takes 10-12 ms to
> > submit a new buffer and given the fact that the Host compositor waits only 
> > for 9 ms, the
> > Guest compositor will miss the Host's repaint cycle resulting in halved 
> > frame-rate.
> >
> > The solution:
> > - To ensure full framerate, the Guest compositor has to start it's repaint 
> > cycle (including
> > the 9 ms wait) when the Host compositor sends the frame callback event to 
> > its clients.
> > In order for this to happen, the dma-fence that the Guest KMS waits on -- 
> > before sending
> > pageflip completion -- cannot be tied to a wl_buffer.release event. This 
> > means that, the
> > Guest compositor has to be forced to use a new buffer for its next repaint 
> > cycle when it
> > gets a pageflip completion.
> 
> Is that really the only solution?
[Kasireddy, Vivek] There are a few others I mentioned here:
https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986572
But I think none of them are as compelling as this one.

> 
> If we fix the event timestamps so that both guest and host use the same
> timestamp, but then 

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-02 Thread Daniel Vetter
On Mon, Aug 2, 2021 at 2:51 PM Gerd Hoffmann  wrote:
>
>   Hi,
>
> > > That sounds sensible to me.  Fence the virtio commands, make sure (on
> > > the host side) the command completes only when the work is actually done
> > > not only submitted.  Has recently been added to qemu for RESOURCE_FLUSH
> > > (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka
> > > pageflipping), then send vblank events to userspace on command
> > > completion certainly makes sense.
> >
> > Hm how does this all work? At least drm/virtio uses
> > drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end
> > up in the same driver path for everything. Or do you just combine the
> > resource_flush with the flip as needed and let the host side figure it all
> > out? From a quick read of virtgpu_plane.c that seems to be the case ...
>
> virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for
> DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I
> think for the page-flip case the host (aka qemu) doesn't get the
> "wait until old framebuffer is not in use any more" right yet.

Hm reading the code I think you simply elide the set_scanout if it's
still the same buffer. There's no difference betweeen dirtyfb and an
atomic commit that just hands the damage rects to the driver. At least
if you use the helpers.

> So we'll need a host-side fix for that and a guest-side fix to switch
> from a blocking wait on the fence to vblank events.
>
> > Also to make this work we don't just need the fence, we need the timestamp
> > (in a clock domain the guest can correct for ofc) of the host side kms
> > driver flip completion. If you just have the fence then the jitter from
> > going through all the layers will most likely make it unusable.
>
> Well, there are no timestamps in the virtio-gpu protocol ...
>
> Also I'm not sure they would be that helpful, any timing is *much* less
> predictable in a virtual machine, especially in case the host machine is
> loaded.

Hm yeah if the output is currently not displaying, then the timestamp
is very fake. But if you display you should be able to pass it all
around in both direction. So target vblank (or whatever it's called)
would go from guest to host to host-compositor (over wayland protocol)
to host-side kms, and the timestamp could travel all the way back.

But yeah making this all work correctly is going to be a pile of work.
Also I have no idea how well compositors take it when a kms driver
switches between high-precision timestamps and frame scheduling to the
entirely virtual/vblank-less approach on the fly.
-Daniel

> take care,
>   Gerd
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-02 Thread Gerd Hoffmann
  Hi,

> > That sounds sensible to me.  Fence the virtio commands, make sure (on
> > the host side) the command completes only when the work is actually done
> > not only submitted.  Has recently been added to qemu for RESOURCE_FLUSH
> > (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka
> > pageflipping), then send vblank events to userspace on command
> > completion certainly makes sense.
> 
> Hm how does this all work? At least drm/virtio uses
> drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end
> up in the same driver path for everything. Or do you just combine the
> resource_flush with the flip as needed and let the host side figure it all
> out? From a quick read of virtgpu_plane.c that seems to be the case ...

virtio_gpu_primary_plane_update() will send RESOURCE_FLUSH only for
DIRTYFB and both SET_SCANOUT + RESOURCE_FLUSH for page-flip, and I
think for the page-flip case the host (aka qemu) doesn't get the
"wait until old framebuffer is not in use any more" right yet.

So we'll need a host-side fix for that and a guest-side fix to switch
from a blocking wait on the fence to vblank events.

> Also to make this work we don't just need the fence, we need the timestamp
> (in a clock domain the guest can correct for ofc) of the host side kms
> driver flip completion. If you just have the fence then the jitter from
> going through all the layers will most likely make it unusable.

Well, there are no timestamps in the virtio-gpu protocol ...

Also I'm not sure they would be that helpful, any timing is *much* less
predictable in a virtual machine, especially in case the host machine is
loaded.

take care,
  Gerd



Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-02 Thread Michel Dänzer
On 2021-08-02 11:06 a.m., Daniel Vetter wrote:
> On Mon, Aug 02, 2021 at 10:49:37AM +0200, Michel Dänzer wrote:
>> On 2021-08-02 9:59 a.m., Daniel Vetter wrote:
>>> On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote:
 On 2021-07-30 12:25 p.m., Daniel Vetter wrote:
> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
>> By separating the OUT_FENCE signalling from pageflip completion allows
>> a Guest compositor to start a new repaint cycle with a new buffer
>> instead of waiting for the old buffer to be free. 
>>
>> This work is based on the idea/suggestion from Simon and Pekka.
>>
>> This capability can be a solution for this issue:
>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514
>>
>> Corresponding Weston MR:
>> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
>
> Uh I kinda wanted to discuss this a bit more before we jump into typing
> code, but well I guess not that much work yet.
>
> So maybe I'm not understanding the problem, but I think the fundamental
> underlying issue is that with KMS you can have at most 2 buffers
> in-flight, due to our queue depth limit of 1 pending flip.
>
> Unfortunately that means for virtual hw where it takes a few more
> steps/vblanks until the framebuffer actually shows up on screen and is
> scanned out, we suffer deeply. The usual fix for that is to drop the
> latency and increase throughput, and have more buffers in-flight. Which
> this patch tries to do.

 Per
 https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 ,
 IMO the underlying issue is actually that the guest compositor repaint
 cycle is not aligned with the host compositor one. If they were aligned,
 the problem would not occur even without allowing multiple page flips in
 flight, and latency would be lower.
>>>
>>> Yeah my proposal here is under the premise that we do actually need to fix
>>> this with a deeper queue depth.
>>>
> Now I think where we go wrong here is that we're trying to hack this up by
> defining different semantics for the out-fence and for the drm-event. Imo
> that's wrong, they're both meant to show eactly the same thing:
> - when is the new frame actually visible to the user (as in, eyeballs in a
>   human head, preferrably, not the time when we've handed the buffer off
>   to the virtual hw)
> - when is the previous buffer no longer being used by the scanout hw
>
> We do cheat a bit right now in so far that we assume they're both the
> same, as in, panel-side latency is currently the compositor's problem to
> figure out.
>
> So for virtual hw I think the timestamp and even completion really need to
> happen only when the buffer has been pushed through the entire
> virtualization chain, i.e. ideally we get the timestamp from the kms
> driver from the host side. Currently that's not done, so this is most
> likely quite broken already (virtio relies on the no-vblank auto event
> sending, which definitely doesn't wait for anything, or I'm completely
> missing something).
>
> I think instead of hacking up some ill-defined 1.5 queue depth support,
> what we should do is support queue depth > 1 properly. So:
>
> - Change atomic to support queue depth > 1, this needs to be a per-driver
>   thing due to a bunch of issues in driver code. Essentially drivers must
>   never look at obj->state pointers, and only ever look up state through
>   the passed-in drm_atomic_state * update container.
>
> - Aside: virtio should loose all it's empty hooks, there's no point in
>   that.
>
> - We fix virtio to send out the completion event at the end of this entire
>   pipeline, i.e. virtio code needs to take care of sending out the
>   crtc_state->event correctly.
>
> - We probably also want some kind of (maybe per-crtc) recommended queue
>   depth property so compositors know how many buffers to keep in flight.
>   Not sure about that.

 I'd say there would definitely need to be some kind of signal for the
 display server that it should queue multiple flips, since this is
 normally not desirable for latency. In other words, this wouldn't really
 be useful on bare metal (in contrast to the ability to replace a pending
 flip with a newer one).
>>>
>>> Hm I was thinking that the compositor can tune this. If the round-trip
>>> latency (as measured by events) is too long to get full refresh rate, it
>>> can add more buffers to the queue. That's kinda why I think the returned
>>> event really must be accurate wrt actual display time (and old buffer
>>> release time), so that this computation in the compositor because a pretty
>>> simple
>>>
>>> num_buffers = (flip_time - submit_time) / frame_time
>>>
>>> With maybe some rounding up and 

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-02 Thread Daniel Vetter
On Fri, Jul 30, 2021 at 03:38:50PM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > - We fix virtio to send out the completion event at the end of this entire
> >   pipeline, i.e. virtio code needs to take care of sending out the
> >   crtc_state->event correctly.
> 
> That sounds sensible to me.  Fence the virtio commands, make sure (on
> the host side) the command completes only when the work is actually done
> not only submitted.  Has recently been added to qemu for RESOURCE_FLUSH
> (aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka
> pageflipping), then send vblank events to userspace on command
> completion certainly makes sense.

Hm how does this all work? At least drm/virtio uses
drm_atomic_helper_dirtyfb, so both DIRTYFB ioctl and atomic flips all end
up in the same driver path for everything. Or do you just combine the
resource_flush with the flip as needed and let the host side figure it all
out? From a quick read of virtgpu_plane.c that seems to be the case ...

Also to make this work we don't just need the fence, we need the timestamp
(in a clock domain the guest can correct for ofc) of the host side kms
driver flip completion. If you just have the fence then the jitter from
going through all the layers will most likely make it unusable.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-02 Thread Daniel Vetter
On Mon, Aug 02, 2021 at 10:49:37AM +0200, Michel Dänzer wrote:
> On 2021-08-02 9:59 a.m., Daniel Vetter wrote:
> > On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote:
> >> On 2021-07-30 12:25 p.m., Daniel Vetter wrote:
> >>> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
>  By separating the OUT_FENCE signalling from pageflip completion allows
>  a Guest compositor to start a new repaint cycle with a new buffer
>  instead of waiting for the old buffer to be free. 
> 
>  This work is based on the idea/suggestion from Simon and Pekka.
> 
>  This capability can be a solution for this issue:
>  https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> 
>  Corresponding Weston MR:
>  https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> >>>
> >>> Uh I kinda wanted to discuss this a bit more before we jump into typing
> >>> code, but well I guess not that much work yet.
> >>>
> >>> So maybe I'm not understanding the problem, but I think the fundamental
> >>> underlying issue is that with KMS you can have at most 2 buffers
> >>> in-flight, due to our queue depth limit of 1 pending flip.
> >>>
> >>> Unfortunately that means for virtual hw where it takes a few more
> >>> steps/vblanks until the framebuffer actually shows up on screen and is
> >>> scanned out, we suffer deeply. The usual fix for that is to drop the
> >>> latency and increase throughput, and have more buffers in-flight. Which
> >>> this patch tries to do.
> >>
> >> Per
> >> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 ,
> >> IMO the underlying issue is actually that the guest compositor repaint
> >> cycle is not aligned with the host compositor one. If they were aligned,
> >> the problem would not occur even without allowing multiple page flips in
> >> flight, and latency would be lower.
> > 
> > Yeah my proposal here is under the premise that we do actually need to fix
> > this with a deeper queue depth.
> > 
> >>> Now I think where we go wrong here is that we're trying to hack this up by
> >>> defining different semantics for the out-fence and for the drm-event. Imo
> >>> that's wrong, they're both meant to show eactly the same thing:
> >>> - when is the new frame actually visible to the user (as in, eyeballs in a
> >>>   human head, preferrably, not the time when we've handed the buffer off
> >>>   to the virtual hw)
> >>> - when is the previous buffer no longer being used by the scanout hw
> >>>
> >>> We do cheat a bit right now in so far that we assume they're both the
> >>> same, as in, panel-side latency is currently the compositor's problem to
> >>> figure out.
> >>>
> >>> So for virtual hw I think the timestamp and even completion really need to
> >>> happen only when the buffer has been pushed through the entire
> >>> virtualization chain, i.e. ideally we get the timestamp from the kms
> >>> driver from the host side. Currently that's not done, so this is most
> >>> likely quite broken already (virtio relies on the no-vblank auto event
> >>> sending, which definitely doesn't wait for anything, or I'm completely
> >>> missing something).
> >>>
> >>> I think instead of hacking up some ill-defined 1.5 queue depth support,
> >>> what we should do is support queue depth > 1 properly. So:
> >>>
> >>> - Change atomic to support queue depth > 1, this needs to be a per-driver
> >>>   thing due to a bunch of issues in driver code. Essentially drivers must
> >>>   never look at obj->state pointers, and only ever look up state through
> >>>   the passed-in drm_atomic_state * update container.
> >>>
> >>> - Aside: virtio should loose all it's empty hooks, there's no point in
> >>>   that.
> >>>
> >>> - We fix virtio to send out the completion event at the end of this entire
> >>>   pipeline, i.e. virtio code needs to take care of sending out the
> >>>   crtc_state->event correctly.
> >>>
> >>> - We probably also want some kind of (maybe per-crtc) recommended queue
> >>>   depth property so compositors know how many buffers to keep in flight.
> >>>   Not sure about that.
> >>
> >> I'd say there would definitely need to be some kind of signal for the
> >> display server that it should queue multiple flips, since this is
> >> normally not desirable for latency. In other words, this wouldn't really
> >> be useful on bare metal (in contrast to the ability to replace a pending
> >> flip with a newer one).
> > 
> > Hm I was thinking that the compositor can tune this. If the round-trip
> > latency (as measured by events) is too long to get full refresh rate, it
> > can add more buffers to the queue. That's kinda why I think the returned
> > event really must be accurate wrt actual display time (and old buffer
> > release time), so that this computation in the compositor because a pretty
> > simple
> > 
> > num_buffers = (flip_time - submit_time) / frame_time
> > 
> > With maybe some rounding up and averaging. You can also hit this when your
> > 

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-02 Thread Michel Dänzer
On 2021-08-02 9:59 a.m., Daniel Vetter wrote:
> On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote:
>> On 2021-07-30 12:25 p.m., Daniel Vetter wrote:
>>> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
 By separating the OUT_FENCE signalling from pageflip completion allows
 a Guest compositor to start a new repaint cycle with a new buffer
 instead of waiting for the old buffer to be free. 

 This work is based on the idea/suggestion from Simon and Pekka.

 This capability can be a solution for this issue:
 https://gitlab.freedesktop.org/wayland/weston/-/issues/514

 Corresponding Weston MR:
 https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
>>>
>>> Uh I kinda wanted to discuss this a bit more before we jump into typing
>>> code, but well I guess not that much work yet.
>>>
>>> So maybe I'm not understanding the problem, but I think the fundamental
>>> underlying issue is that with KMS you can have at most 2 buffers
>>> in-flight, due to our queue depth limit of 1 pending flip.
>>>
>>> Unfortunately that means for virtual hw where it takes a few more
>>> steps/vblanks until the framebuffer actually shows up on screen and is
>>> scanned out, we suffer deeply. The usual fix for that is to drop the
>>> latency and increase throughput, and have more buffers in-flight. Which
>>> this patch tries to do.
>>
>> Per
>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 ,
>> IMO the underlying issue is actually that the guest compositor repaint
>> cycle is not aligned with the host compositor one. If they were aligned,
>> the problem would not occur even without allowing multiple page flips in
>> flight, and latency would be lower.
> 
> Yeah my proposal here is under the premise that we do actually need to fix
> this with a deeper queue depth.
> 
>>> Now I think where we go wrong here is that we're trying to hack this up by
>>> defining different semantics for the out-fence and for the drm-event. Imo
>>> that's wrong, they're both meant to show eactly the same thing:
>>> - when is the new frame actually visible to the user (as in, eyeballs in a
>>>   human head, preferrably, not the time when we've handed the buffer off
>>>   to the virtual hw)
>>> - when is the previous buffer no longer being used by the scanout hw
>>>
>>> We do cheat a bit right now in so far that we assume they're both the
>>> same, as in, panel-side latency is currently the compositor's problem to
>>> figure out.
>>>
>>> So for virtual hw I think the timestamp and even completion really need to
>>> happen only when the buffer has been pushed through the entire
>>> virtualization chain, i.e. ideally we get the timestamp from the kms
>>> driver from the host side. Currently that's not done, so this is most
>>> likely quite broken already (virtio relies on the no-vblank auto event
>>> sending, which definitely doesn't wait for anything, or I'm completely
>>> missing something).
>>>
>>> I think instead of hacking up some ill-defined 1.5 queue depth support,
>>> what we should do is support queue depth > 1 properly. So:
>>>
>>> - Change atomic to support queue depth > 1, this needs to be a per-driver
>>>   thing due to a bunch of issues in driver code. Essentially drivers must
>>>   never look at obj->state pointers, and only ever look up state through
>>>   the passed-in drm_atomic_state * update container.
>>>
>>> - Aside: virtio should loose all it's empty hooks, there's no point in
>>>   that.
>>>
>>> - We fix virtio to send out the completion event at the end of this entire
>>>   pipeline, i.e. virtio code needs to take care of sending out the
>>>   crtc_state->event correctly.
>>>
>>> - We probably also want some kind of (maybe per-crtc) recommended queue
>>>   depth property so compositors know how many buffers to keep in flight.
>>>   Not sure about that.
>>
>> I'd say there would definitely need to be some kind of signal for the
>> display server that it should queue multiple flips, since this is
>> normally not desirable for latency. In other words, this wouldn't really
>> be useful on bare metal (in contrast to the ability to replace a pending
>> flip with a newer one).
> 
> Hm I was thinking that the compositor can tune this. If the round-trip
> latency (as measured by events) is too long to get full refresh rate, it
> can add more buffers to the queue. That's kinda why I think the returned
> event really must be accurate wrt actual display time (and old buffer
> release time), so that this computation in the compositor because a pretty
> simple
> 
> num_buffers = (flip_time - submit_time) / frame_time
> 
> With maybe some rounding up and averaging. You can also hit this when your
> 3d engine has an extremely deep pipeline (like some of the tiling
> renders have), where rendering just takes forever, but as long as you keep
> 2 frames in the renderer in-flight you can achieve full refresh rate (at a
> latency cost).

As long as a 

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-02 Thread Daniel Vetter
On Mon, Aug 02, 2021 at 06:51:33AM +, Kasireddy, Vivek wrote:
> Hi Daniel,
> 
> > 
> > On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> > > By separating the OUT_FENCE signalling from pageflip completion allows
> > > a Guest compositor to start a new repaint cycle with a new buffer
> > > instead of waiting for the old buffer to be free.
> > >
> > > This work is based on the idea/suggestion from Simon and Pekka.
> > >
> > > This capability can be a solution for this issue:
> > > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> > >
> > > Corresponding Weston MR:
> > > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> > 
> > Uh I kinda wanted to discuss this a bit more before we jump into typing
> > code, but well I guess not that much work yet.
> [Kasireddy, Vivek] Right, it wasn't a lot of work :)
> 
> > 
> > So maybe I'm not understanding the problem, but I think the fundamental
> > underlying issue is that with KMS you can have at most 2 buffers
> > in-flight, due to our queue depth limit of 1 pending flip.
> [Kasireddy, Vivek] Let me summarize the problem again from the perspective of
> both the Host (Weston) and Guest (Weston) compositors assuming a refresh-rate
> of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms.
> Host compositor:
> - After a pageflip completion event, it starts its next repaint cycle by 
> waiting for 9 ms
> and then submits the atomic commit and at the tail end of its cycle sends a 
> frame callback
> event to all its clients (who registered and submitted frames) indicating to 
> them to 
> start their next redraw  -- giving them at-least ~16 ms to submit a new frame 
> to be
> included in its next repaint. Why a configurable 9 ms delay is needed is 
> explained
> in Pekka's blog post here:
> https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html
> 
> - It'll send a wl_buffer.release event for a client submitted previous buffer 
> only
> when the client has submitted a new buffer and:
> a) When it hasn't started its repaint cycle yet OR
> b) When it clears its old state after it gets a pageflip completion event -- 
> if it had
> flipped the client's buffer onto a hardware plane.
> 
> Guest compositor:
> - After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms 
> for the 
> Guest compositor to submit a new atomic commit. This time of 10-12 ms 
> includes the
> 9 ms wait -- just like the Host compositor -- for its clients to submit new 
> buffers.
> - When it gets a pageflip completion, it assumes that the previously 
> submitted buffer
> is free for re-use and uses it again -- resulting in the usage of only 2 out 
> of a maximum
> of 4 backbuffers included as part of the Mesa GBM surface implementation.
> 
> Guest KMS/Virtio-gpu/Qemu Wayland UI:
> - Because no_vblank=true for Guest KMS and since the vblank event (which also 
> serves
> as the pageflip completion event for user-space) is sent right away after 
> atomic commit,
> as Gerd said, we use an internal dma-fence to block/wait the Guest KMS until 
> we know for
> sure that the Host is completely done using the buffer. To ensure this, we 
> signal the dma-fence
> only after the Host compositor sends a wl_buffer.release event or an 
> equivalent signal.
> 
> The goal:
> - Maintain full framerate even when the Guest scanout FB is flipped onto a 
> hardware plane
> on the Host -- regardless of either compositor's scheduling policy -- without 
> making any
> copies and ensuring that both Host and Guest are not accessing the buffer at 
> the same time.
> 
> The problem:
> - If the Host compositor flips the client's buffer (in this case Guest 
> compositor's buffer) 
> onto a hardware plane, then it can send a wl_buffer.release event for the 
> previous buffer
> only after it gets a pageflip completion. And, if the Guest compositor takes 
> 10-12 ms to
> submit a new buffer and given the fact that the Host compositor waits only 
> for 9 ms, the
> Guest compositor will miss the Host's repaint cycle resulting in halved 
> frame-rate.
> 
> The solution:
> - To ensure full framerate, the Guest compositor has to start it's repaint 
> cycle (including
> the 9 ms wait) when the Host compositor sends the frame callback event to its 
> clients.
> In order for this to happen, the dma-fence that the Guest KMS waits on -- 
> before sending
> pageflip completion -- cannot be tied to a wl_buffer.release event. This 
> means that, the
> Guest compositor has to be forced to use a new buffer for its next repaint 
> cycle when it
> gets a pageflip completion.

Is that really the only solution?

If we fix the event timestamps so that both guest and host use the same
timestamp, but then the guest starts 5ms (or something like that) earlier,
then things should work too? I.e.
- host compositor starts at (previous_frametime + 9ms)
- guest compositor starts at (previous_frametime + 4ms)

Ofc this only works if the frametimes we hand out to both match 

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-02 Thread Daniel Vetter
On Fri, Jul 30, 2021 at 02:50:10PM +0200, Michel Dänzer wrote:
> On 2021-07-30 12:25 p.m., Daniel Vetter wrote:
> > On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> >> By separating the OUT_FENCE signalling from pageflip completion allows
> >> a Guest compositor to start a new repaint cycle with a new buffer
> >> instead of waiting for the old buffer to be free. 
> >>
> >> This work is based on the idea/suggestion from Simon and Pekka.
> >>
> >> This capability can be a solution for this issue:
> >> https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> >>
> >> Corresponding Weston MR:
> >> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> > 
> > Uh I kinda wanted to discuss this a bit more before we jump into typing
> > code, but well I guess not that much work yet.
> > 
> > So maybe I'm not understanding the problem, but I think the fundamental
> > underlying issue is that with KMS you can have at most 2 buffers
> > in-flight, due to our queue depth limit of 1 pending flip.
> > 
> > Unfortunately that means for virtual hw where it takes a few more
> > steps/vblanks until the framebuffer actually shows up on screen and is
> > scanned out, we suffer deeply. The usual fix for that is to drop the
> > latency and increase throughput, and have more buffers in-flight. Which
> > this patch tries to do.
> 
> Per
> https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 ,
> IMO the underlying issue is actually that the guest compositor repaint
> cycle is not aligned with the host compositor one. If they were aligned,
> the problem would not occur even without allowing multiple page flips in
> flight, and latency would be lower.

Yeah my proposal here is under the premise that we do actually need to fix
this with a deeper queue depth.

> > Now I think where we go wrong here is that we're trying to hack this up by
> > defining different semantics for the out-fence and for the drm-event. Imo
> > that's wrong, they're both meant to show eactly the same thing:
> > - when is the new frame actually visible to the user (as in, eyeballs in a
> >   human head, preferrably, not the time when we've handed the buffer off
> >   to the virtual hw)
> > - when is the previous buffer no longer being used by the scanout hw
> > 
> > We do cheat a bit right now in so far that we assume they're both the
> > same, as in, panel-side latency is currently the compositor's problem to
> > figure out.
> > 
> > So for virtual hw I think the timestamp and even completion really need to
> > happen only when the buffer has been pushed through the entire
> > virtualization chain, i.e. ideally we get the timestamp from the kms
> > driver from the host side. Currently that's not done, so this is most
> > likely quite broken already (virtio relies on the no-vblank auto event
> > sending, which definitely doesn't wait for anything, or I'm completely
> > missing something).
> > 
> > I think instead of hacking up some ill-defined 1.5 queue depth support,
> > what we should do is support queue depth > 1 properly. So:
> > 
> > - Change atomic to support queue depth > 1, this needs to be a per-driver
> >   thing due to a bunch of issues in driver code. Essentially drivers must
> >   never look at obj->state pointers, and only ever look up state through
> >   the passed-in drm_atomic_state * update container.
> > 
> > - Aside: virtio should loose all it's empty hooks, there's no point in
> >   that.
> > 
> > - We fix virtio to send out the completion event at the end of this entire
> >   pipeline, i.e. virtio code needs to take care of sending out the
> >   crtc_state->event correctly.
> > 
> > - We probably also want some kind of (maybe per-crtc) recommended queue
> >   depth property so compositors know how many buffers to keep in flight.
> >   Not sure about that.
> 
> I'd say there would definitely need to be some kind of signal for the
> display server that it should queue multiple flips, since this is
> normally not desirable for latency. In other words, this wouldn't really
> be useful on bare metal (in contrast to the ability to replace a pending
> flip with a newer one).

Hm I was thinking that the compositor can tune this. If the round-trip
latency (as measured by events) is too long to get full refresh rate, it
can add more buffers to the queue. That's kinda why I think the returned
event really must be accurate wrt actual display time (and old buffer
release time), so that this computation in the compositor because a pretty
simple

num_buffers = (flip_time - submit_time) / frame_time

With maybe some rounding up and averaging. You can also hit this when your
3d engine has an extremely deep pipeline (like some of the tiling
renders have), where rendering just takes forever, but as long as you keep
2 frames in the renderer in-flight you can achieve full refresh rate (at a
latency cost).

So kernel can't really tell you in all cases how many buffers you should
have.
-Daniel
> -- 
> 

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-02 Thread Kasireddy, Vivek
Hi Daniel,

> 
> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> > By separating the OUT_FENCE signalling from pageflip completion allows
> > a Guest compositor to start a new repaint cycle with a new buffer
> > instead of waiting for the old buffer to be free.
> >
> > This work is based on the idea/suggestion from Simon and Pekka.
> >
> > This capability can be a solution for this issue:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> >
> > Corresponding Weston MR:
> > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> 
> Uh I kinda wanted to discuss this a bit more before we jump into typing
> code, but well I guess not that much work yet.
[Kasireddy, Vivek] Right, it wasn't a lot of work :)

> 
> So maybe I'm not understanding the problem, but I think the fundamental
> underlying issue is that with KMS you can have at most 2 buffers
> in-flight, due to our queue depth limit of 1 pending flip.
[Kasireddy, Vivek] Let me summarize the problem again from the perspective of
both the Host (Weston) and Guest (Weston) compositors assuming a refresh-rate
of 60 -- which implies the Vblank/Vsync is generated every ~16.66 ms.
Host compositor:
- After a pageflip completion event, it starts its next repaint cycle by 
waiting for 9 ms
and then submits the atomic commit and at the tail end of its cycle sends a 
frame callback
event to all its clients (who registered and submitted frames) indicating to 
them to 
start their next redraw  -- giving them at-least ~16 ms to submit a new frame 
to be
included in its next repaint. Why a configurable 9 ms delay is needed is 
explained
in Pekka's blog post here:
https://ppaalanen.blogspot.com/2015/02/weston-repaint-scheduling.html

- It'll send a wl_buffer.release event for a client submitted previous buffer 
only
when the client has submitted a new buffer and:
a) When it hasn't started its repaint cycle yet OR
b) When it clears its old state after it gets a pageflip completion event -- if 
it had
flipped the client's buffer onto a hardware plane.

Guest compositor:
- After a pageflip completion is sent by Guest KMS, it takes about 10-12 ms for 
the 
Guest compositor to submit a new atomic commit. This time of 10-12 ms includes 
the
9 ms wait -- just like the Host compositor -- for its clients to submit new 
buffers.
- When it gets a pageflip completion, it assumes that the previously submitted 
buffer
is free for re-use and uses it again -- resulting in the usage of only 2 out of 
a maximum
of 4 backbuffers included as part of the Mesa GBM surface implementation.

Guest KMS/Virtio-gpu/Qemu Wayland UI:
- Because no_vblank=true for Guest KMS and since the vblank event (which also 
serves
as the pageflip completion event for user-space) is sent right away after 
atomic commit,
as Gerd said, we use an internal dma-fence to block/wait the Guest KMS until we 
know for
sure that the Host is completely done using the buffer. To ensure this, we 
signal the dma-fence
only after the Host compositor sends a wl_buffer.release event or an equivalent 
signal.

The goal:
- Maintain full framerate even when the Guest scanout FB is flipped onto a 
hardware plane
on the Host -- regardless of either compositor's scheduling policy -- without 
making any
copies and ensuring that both Host and Guest are not accessing the buffer at 
the same time.

The problem:
- If the Host compositor flips the client's buffer (in this case Guest 
compositor's buffer) 
onto a hardware plane, then it can send a wl_buffer.release event for the 
previous buffer
only after it gets a pageflip completion. And, if the Guest compositor takes 
10-12 ms to
submit a new buffer and given the fact that the Host compositor waits only for 
9 ms, the
Guest compositor will miss the Host's repaint cycle resulting in halved 
frame-rate.

The solution:
- To ensure full framerate, the Guest compositor has to start it's repaint 
cycle (including
the 9 ms wait) when the Host compositor sends the frame callback event to its 
clients.
In order for this to happen, the dma-fence that the Guest KMS waits on -- 
before sending
pageflip completion -- cannot be tied to a wl_buffer.release event. This means 
that, the
Guest compositor has to be forced to use a new buffer for its next repaint 
cycle when it
gets a pageflip completion.
- The Weston MR I linked above does this by getting an out_fence fd and taking 
a reference
on all the FBs included in the atomic commit forcing the compositor to use new 
FBs for its
next repaint cycle. It releases the references when the out_fence is signalled 
later when
the Host compositor sends a wl_buffer.release event.

> 
> Unfortunately that means for virtual hw where it takes a few more
> steps/vblanks until the framebuffer actually shows up on screen and is
> scanned out, we suffer deeply. The usual fix for that is to drop the
> latency and increase throughput, and have more buffers in-flight. Which
> this patch tries to do.
> 
> Now I think where we go 

RE: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-08-01 Thread Zhang, Tina



> -Original Message-
> From: Daniel Vetter 
> Sent: Friday, July 30, 2021 6:26 PM
> To: Kasireddy, Vivek 
> Cc: dri-devel@lists.freedesktop.org; Daniel Vetter ; Gerd
> Hoffmann ; Pekka Paalanen ;
> Simon Ser ; Michel Dänzer ;
> Zhang, Tina ; Kim, Dongwon
> 
> Subject: Re: [RFC v1 0/4] drm: Add support for
> DRM_CAP_DEFERRED_OUT_FENCE capability
> 
> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> > By separating the OUT_FENCE signalling from pageflip completion allows
> > a Guest compositor to start a new repaint cycle with a new buffer
> > instead of waiting for the old buffer to be free.
> >
> > This work is based on the idea/suggestion from Simon and Pekka.
> >
> > This capability can be a solution for this issue:
> > https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> >
> > Corresponding Weston MR:
> > https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> 
> Uh I kinda wanted to discuss this a bit more before we jump into typing code,
> but well I guess not that much work yet.
> 
> So maybe I'm not understanding the problem, but I think the fundamental
> underlying issue is that with KMS you can have at most 2 buffers in-flight,
> due to our queue depth limit of 1 pending flip.
> 
> Unfortunately that means for virtual hw where it takes a few more
> steps/vblanks until the framebuffer actually shows up on screen and is
> scanned out, we suffer deeply. The usual fix for that is to drop the latency
> and increase throughput, and have more buffers in-flight. Which this patch
> tries to do.
> 
> Now I think where we go wrong here is that we're trying to hack this up by
> defining different semantics for the out-fence and for the drm-event. Imo
> that's wrong, they're both meant to show eactly the same thing:
> - when is the new frame actually visible to the user (as in, eyeballs in a
>   human head, preferrably, not the time when we've handed the buffer off
>   to the virtual hw)
> - when is the previous buffer no longer being used by the scanout hw
> 
> We do cheat a bit right now in so far that we assume they're both the same,
> as in, panel-side latency is currently the compositor's problem to figure out.
> 
> So for virtual hw I think the timestamp and even completion really need to
> happen only when the buffer has been pushed through the entire
> virtualization chain, i.e. ideally we get the timestamp from the kms driver
> from the host side. Currently that's not done, so this is most likely quite
> broken already (virtio relies on the no-vblank auto event sending, which
> definitely doesn't wait for anything, or I'm completely missing something).

Agree. One lesson we got from previous direct-display related work is that 
using host hardware event is kind of "must". Otherwise, problems like 
flickering or tearing or frame drop will come out. Besides, as the wayland-ui 
is working as a weston client, it needs to have more than 2 buffers to support 
the full-frame redraw. I tried the Weston-simple-dmabuf-egl with 2 buffers and 
it was bad.

BR,
Tina

> 
> I think instead of hacking up some ill-defined 1.5 queue depth support, what
> we should do is support queue depth > 1 properly. So:
> 
> - Change atomic to support queue depth > 1, this needs to be a per-driver
>   thing due to a bunch of issues in driver code. Essentially drivers must
>   never look at obj->state pointers, and only ever look up state through
>   the passed-in drm_atomic_state * update container.
> 
> - Aside: virtio should loose all it's empty hooks, there's no point in
>   that.
> 
> - We fix virtio to send out the completion event at the end of this entire
>   pipeline, i.e. virtio code needs to take care of sending out the
>   crtc_state->event correctly.
> 
> - We probably also want some kind of (maybe per-crtc) recommended queue
>   depth property so compositors know how many buffers to keep in flight.
>   Not sure about that.
> 
> It's a bit more work, but also a lot less hacking around infrastructure in
> dubious ways.
> 
> Thoughts?
> 
> Cheers, Daniel
> 
> >
> > Cc: Daniel Vetter 
> > Cc: Gerd Hoffmann 
> > Cc: Pekka Paalanen 
> > Cc: Simon Ser 
> > Cc: Michel Dänzer 
> > Cc: Tina Zhang 
> > Cc: Dongwon Kim 
> >
> > Vivek Kasireddy (4):
> >   drm: Add a capability flag to support deferred out_fence signalling
> >   virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature
> >   drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd
> >   drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature
> >
> >  drivers/gpu/drm/drm_file.c   | 11 +++---
> >  drivers/gpu/drm

Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-07-30 Thread Gerd Hoffmann
  Hi,

> - We fix virtio to send out the completion event at the end of this entire
>   pipeline, i.e. virtio code needs to take care of sending out the
>   crtc_state->event correctly.

That sounds sensible to me.  Fence the virtio commands, make sure (on
the host side) the command completes only when the work is actually done
not only submitted.  Has recently been added to qemu for RESOURCE_FLUSH
(aka frontbuffer rendering) and doing the same for SET_SCANOUT (aka
pageflipping), then send vblank events to userspace on command
completion certainly makes sense.

take care,
  Gerd



Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-07-30 Thread Michel Dänzer
On 2021-07-30 12:25 p.m., Daniel Vetter wrote:
> On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
>> By separating the OUT_FENCE signalling from pageflip completion allows
>> a Guest compositor to start a new repaint cycle with a new buffer
>> instead of waiting for the old buffer to be free. 
>>
>> This work is based on the idea/suggestion from Simon and Pekka.
>>
>> This capability can be a solution for this issue:
>> https://gitlab.freedesktop.org/wayland/weston/-/issues/514
>>
>> Corresponding Weston MR:
>> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668
> 
> Uh I kinda wanted to discuss this a bit more before we jump into typing
> code, but well I guess not that much work yet.
> 
> So maybe I'm not understanding the problem, but I think the fundamental
> underlying issue is that with KMS you can have at most 2 buffers
> in-flight, due to our queue depth limit of 1 pending flip.
> 
> Unfortunately that means for virtual hw where it takes a few more
> steps/vblanks until the framebuffer actually shows up on screen and is
> scanned out, we suffer deeply. The usual fix for that is to drop the
> latency and increase throughput, and have more buffers in-flight. Which
> this patch tries to do.

Per https://gitlab.freedesktop.org/wayland/weston/-/issues/514#note_986797 , 
IMO the underlying issue is actually that the guest compositor repaint cycle is 
not aligned with the host compositor one. If they were aligned, the problem 
would not occur even without allowing multiple page flips in flight, and 
latency would be lower.


> Now I think where we go wrong here is that we're trying to hack this up by
> defining different semantics for the out-fence and for the drm-event. Imo
> that's wrong, they're both meant to show eactly the same thing:
> - when is the new frame actually visible to the user (as in, eyeballs in a
>   human head, preferrably, not the time when we've handed the buffer off
>   to the virtual hw)
> - when is the previous buffer no longer being used by the scanout hw
> 
> We do cheat a bit right now in so far that we assume they're both the
> same, as in, panel-side latency is currently the compositor's problem to
> figure out.
> 
> So for virtual hw I think the timestamp and even completion really need to
> happen only when the buffer has been pushed through the entire
> virtualization chain, i.e. ideally we get the timestamp from the kms
> driver from the host side. Currently that's not done, so this is most
> likely quite broken already (virtio relies on the no-vblank auto event
> sending, which definitely doesn't wait for anything, or I'm completely
> missing something).
> 
> I think instead of hacking up some ill-defined 1.5 queue depth support,
> what we should do is support queue depth > 1 properly. So:
> 
> - Change atomic to support queue depth > 1, this needs to be a per-driver
>   thing due to a bunch of issues in driver code. Essentially drivers must
>   never look at obj->state pointers, and only ever look up state through
>   the passed-in drm_atomic_state * update container.
> 
> - Aside: virtio should loose all it's empty hooks, there's no point in
>   that.
> 
> - We fix virtio to send out the completion event at the end of this entire
>   pipeline, i.e. virtio code needs to take care of sending out the
>   crtc_state->event correctly.
> 
> - We probably also want some kind of (maybe per-crtc) recommended queue
>   depth property so compositors know how many buffers to keep in flight.
>   Not sure about that.

I'd say there would definitely need to be some kind of signal for the display 
server that it should queue multiple flips, since this is normally not 
desirable for latency. In other words, this wouldn't really be useful on bare 
metal (in contrast to the ability to replace a pending flip with a newer one).


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer


Re: [RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-07-30 Thread Daniel Vetter
On Thu, Jul 29, 2021 at 01:16:55AM -0700, Vivek Kasireddy wrote:
> By separating the OUT_FENCE signalling from pageflip completion allows
> a Guest compositor to start a new repaint cycle with a new buffer
> instead of waiting for the old buffer to be free. 
> 
> This work is based on the idea/suggestion from Simon and Pekka.
> 
> This capability can be a solution for this issue:
> https://gitlab.freedesktop.org/wayland/weston/-/issues/514
> 
> Corresponding Weston MR:
> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668

Uh I kinda wanted to discuss this a bit more before we jump into typing
code, but well I guess not that much work yet.

So maybe I'm not understanding the problem, but I think the fundamental
underlying issue is that with KMS you can have at most 2 buffers
in-flight, due to our queue depth limit of 1 pending flip.

Unfortunately that means for virtual hw where it takes a few more
steps/vblanks until the framebuffer actually shows up on screen and is
scanned out, we suffer deeply. The usual fix for that is to drop the
latency and increase throughput, and have more buffers in-flight. Which
this patch tries to do.

Now I think where we go wrong here is that we're trying to hack this up by
defining different semantics for the out-fence and for the drm-event. Imo
that's wrong, they're both meant to show eactly the same thing:
- when is the new frame actually visible to the user (as in, eyeballs in a
  human head, preferrably, not the time when we've handed the buffer off
  to the virtual hw)
- when is the previous buffer no longer being used by the scanout hw

We do cheat a bit right now in so far that we assume they're both the
same, as in, panel-side latency is currently the compositor's problem to
figure out.

So for virtual hw I think the timestamp and even completion really need to
happen only when the buffer has been pushed through the entire
virtualization chain, i.e. ideally we get the timestamp from the kms
driver from the host side. Currently that's not done, so this is most
likely quite broken already (virtio relies on the no-vblank auto event
sending, which definitely doesn't wait for anything, or I'm completely
missing something).

I think instead of hacking up some ill-defined 1.5 queue depth support,
what we should do is support queue depth > 1 properly. So:

- Change atomic to support queue depth > 1, this needs to be a per-driver
  thing due to a bunch of issues in driver code. Essentially drivers must
  never look at obj->state pointers, and only ever look up state through
  the passed-in drm_atomic_state * update container.

- Aside: virtio should loose all it's empty hooks, there's no point in
  that.

- We fix virtio to send out the completion event at the end of this entire
  pipeline, i.e. virtio code needs to take care of sending out the
  crtc_state->event correctly.

- We probably also want some kind of (maybe per-crtc) recommended queue
  depth property so compositors know how many buffers to keep in flight.
  Not sure about that.

It's a bit more work, but also a lot less hacking around infrastructure in
dubious ways.

Thoughts?

Cheers, Daniel

> 
> Cc: Daniel Vetter 
> Cc: Gerd Hoffmann 
> Cc: Pekka Paalanen 
> Cc: Simon Ser 
> Cc: Michel Dänzer 
> Cc: Tina Zhang 
> Cc: Dongwon Kim 
> 
> Vivek Kasireddy (4):
>   drm: Add a capability flag to support deferred out_fence signalling
>   virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature
>   drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd
>   drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature
> 
>  drivers/gpu/drm/drm_file.c   | 11 +++---
>  drivers/gpu/drm/drm_ioctl.c  |  3 ++
>  drivers/gpu/drm/virtio/virtgpu_debugfs.c |  1 +
>  drivers/gpu/drm/virtio/virtgpu_drv.c |  1 +
>  drivers/gpu/drm/virtio/virtgpu_drv.h |  6 
>  drivers/gpu/drm/virtio/virtgpu_fence.c   |  9 +
>  drivers/gpu/drm/virtio/virtgpu_kms.c | 10 --
>  drivers/gpu/drm/virtio/virtgpu_plane.c   | 44 +++-
>  drivers/gpu/drm/virtio/virtgpu_vq.c  | 17 +
>  include/drm/drm_mode_config.h|  9 +
>  include/uapi/drm/drm.h   |  1 +
>  include/uapi/linux/virtio_gpu.h  | 12 +++
>  12 files changed, 117 insertions(+), 7 deletions(-)
> 
> -- 
> 2.30.2
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


[RFC v1 0/4] drm: Add support for DRM_CAP_DEFERRED_OUT_FENCE capability

2021-07-29 Thread Vivek Kasireddy
By separating the OUT_FENCE signalling from pageflip completion allows
a Guest compositor to start a new repaint cycle with a new buffer
instead of waiting for the old buffer to be free. 

This work is based on the idea/suggestion from Simon and Pekka.

This capability can be a solution for this issue:
https://gitlab.freedesktop.org/wayland/weston/-/issues/514

Corresponding Weston MR:
https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/668

Cc: Daniel Vetter 
Cc: Gerd Hoffmann 
Cc: Pekka Paalanen 
Cc: Simon Ser 
Cc: Michel Dänzer 
Cc: Tina Zhang 
Cc: Dongwon Kim 

Vivek Kasireddy (4):
  drm: Add a capability flag to support deferred out_fence signalling
  virtio-gpu uapi: Add VIRTIO_GPU_F_OUT_FENCE feature
  drm/virtio: Add VIRTIO_GPU_CMD_RESOURCE_OUT_FENCE cmd
  drm/virtio: Probe and implement VIRTIO_GPU_F_OUT_FENCE feature

 drivers/gpu/drm/drm_file.c   | 11 +++---
 drivers/gpu/drm/drm_ioctl.c  |  3 ++
 drivers/gpu/drm/virtio/virtgpu_debugfs.c |  1 +
 drivers/gpu/drm/virtio/virtgpu_drv.c |  1 +
 drivers/gpu/drm/virtio/virtgpu_drv.h |  6 
 drivers/gpu/drm/virtio/virtgpu_fence.c   |  9 +
 drivers/gpu/drm/virtio/virtgpu_kms.c | 10 --
 drivers/gpu/drm/virtio/virtgpu_plane.c   | 44 +++-
 drivers/gpu/drm/virtio/virtgpu_vq.c  | 17 +
 include/drm/drm_mode_config.h|  9 +
 include/uapi/drm/drm.h   |  1 +
 include/uapi/linux/virtio_gpu.h  | 12 +++
 12 files changed, 117 insertions(+), 7 deletions(-)

-- 
2.30.2