Re: Support for 2D engines/blitters in V4L2 and DRM

2019-04-24 Thread Michel Dänzer
On 2019-04-19 10:38 a.m., Paul Kocialkowski wrote:
> On Thu, 2019-04-18 at 20:30 -0400, Nicolas Dufresne wrote:
>> Le jeudi 18 avril 2019 à 10:18 +0200, Daniel Vetter a écrit :
 It would be cool if both could be used concurrently and not just return
 -EBUSY when the device is used with the other subsystem.
>>>
>>> We live in this world already :-) I think there's even patches (or merged
>>> already) to add fences to v4l, for Android.
>>
>> This work is currently suspended. It will require some feature on DRM
>> display to really make this useful, but there is also a lot of
>> challanges in V4L2. In GFX space, most of the use case are about
>> rendering as soon as possible. Though, in multimedia we have two
>> problems, we need to synchronize the frame rendering with the audio,
>> and output buffers may comes out of order due to how video CODECs are
>> made.
> 
> Definitely, it feels like the DRM display side is currently a good fit
> for render use cases, but not so much for precise display cases where
> we want to try and display a buffer at a given vblank target instead of
> "as soon as possible".
> 
> I have a userspace project where I've implemented a page flip queue,
> which only schedules the next flip when relevant and keeps ready
> buffers in the queue until then. This requires explicit vblank
> syncronisation (which DRM offsers, but pretty much all other display
> APIs, that are higher-level don't, so I'm just using a refresh-rate
> timer for them) and flip done notification.
> 
> I haven't looked too much at how to flip with a target vblank with DRM
> directly but maybe the atomic API already has the bits in for that (but
> I haven't heard of such a thing as a buffer queue, so that makes me
> doubt it).

Not directly. What's available is that if userspace waits for vblank n
and then submits a flip, the flip will complete in vblank n+1 (or a
later vblank, depending on when the flip is submitted and when the
fences the flip depends on signal).

There is reluctance allowing more than one flip to be queued in the
kernel, as it would considerably increase complexity in the kernel. It
would probably only be considered if there was a compelling use-case
which was outright impossible otherwise.


> Well, I need to handle stuff like SDL in my userspace project, so I have
> to have all that queuing stuff in software anyway, but it would be good
> if each project didn't have to implement that. Worst case, it could be
> in libdrm too.

Usually, this kind of queuing will be handled in a display server such
as Xorg or a Wayland compositor, not by the application such as a video
player itself, or any library in the latter's address space. I'm not
sure there's much potential for sharing code between display servers for
this.


>> In the first, we'd need a mechanism where we can schedule a render at a
>> specific time or vblank. We can of course already implement this in
>> software, but with fences, the scheduling would need to be done in the
>> driver. Then if the fence is signalled earlier, the driver should hold
>> on until the delay is met. If the fence got signalled late, we also
>> need to think of a workflow. As we can't schedule more then one render
>> in DRM at one time, I don't really see yet how to make that work.
> 
> Indeed, that's also one of the main issues I've spotted. Before using
> an implicit fence, we basically have to make sure the frame is due for
> display at the next vblank. Otherwise, we need to refrain from using
> the fence and schedule the flip later, which is kind of counter-
> productive.

Fences are about signalling that the contents of a frame are "done" and
ready to be presented. They're not about specifying which frame is to be
presented when.


> I feel like specifying a target vblank would be a good unit for that,

The mechanism described above works for that.

> since it's our native granularity after all (while a timestamp is not).

Note that variable refresh rate (Adaptive Sync / FreeSync / G-Sync)
changes things in this regard. It makes the vblank length variable, and
if you wait for multiple vblanks between flips, you get the maximum
vblank length corresponding to the minimum refresh rate / timing
granularity. Thus, it would be useful to allow userspace to specify a
timestamp corresponding to the earliest time when the flip is to
complete. The kernel could then try to hit that as closely as possible.


-- 
Earthling Michel Dänzer   |  https://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: Support for 2D engines/blitters in V4L2 and DRM

2019-04-18 Thread Daniel Vetter
On Wed, Apr 17, 2019 at 08:10:15PM +0200, Paul Kocialkowski wrote:
> Hi Nicolas,
> 
> I'm detaching this thread from our V4L2 stateless decoding spec since
> it has drifted off and would certainly be interesting to DRM folks as
> well!
> 
> For context: I was initially talking about writing up support for the
> Allwinner 2D engine as a DRM render driver, where I'd like to be able
> to batch jobs that affect the same destination buffer to only signal
> the out fence once when the batch is done. We have a similar issue in
> v4l2 where we'd like the destination buffer for a set of requests (each
> covering one H264 slice) to be marked as done once the set was decoded.
> 
> Le mercredi 17 avril 2019 à 12:22 -0400, Nicolas Dufresne a écrit :
> > > > > Interestingly, I'm experiencing the exact same problem dealing with a
> > > > > 2D graphics blitter that has limited ouput scaling abilities which
> > > > > imply handlnig a large scaling operation as multiple clipped smaller
> > > > > scaling operations. The issue is basically that multiple jobs have to
> > > > > be submitted to complete a single frame and relying on an indication
> > > > > from the destination buffer (such as a fence) doesn't work to indicate
> > > > > that all the operations were completed, since we get the indication at
> > > > > each step instead of at the end of the batch.
> > > > 
> > > > That looks similar to the IMX.6 IPU m2m driver. It splits the image in
> > > > tiles of 1024x1024 and process each tile separately. This driver has
> > > > been around for a long time, so I guess they have a solution to that.
> > > > They don't need requests, because there is nothing to be bundled with
> > > > the input image. I know that Renesas folks have started working on a
> > > > de-interlacer. Again, this kind of driver may process and reuse input
> > > > buffers for motion compensation, but I don't think they need special
> > > > userspace API for that.
> > > 
> > > Thanks for the reference! I hope it's not a blitter that was
> > > contributed as a V4L2 driver instead of DRM, as it probably would be
> > > more useful in DRM (but that's way beside the point).
> > 
> > DRM does not offer a generic and discoverable interface for these
> > accelerators. Note that these drivers have most of the time started as
> > DRM driver and their DRM side where dropped. That was the case for
> > Exynos drivers at least.
> 
> Heh, sadly I'm aware of how things turn out most of the time. The thing
> is that DRM expects drivers to implement their own interface. That's
> fine for passing BOs with GPU bitstream and textures, but not so much
> for dealing with framebuffer-based operations where the streaming and
> buffer interface that v4l2 has is a good fit.
> 
> There's also the fact that the 2D pipeline is fixed-function and highly
> hardware-specific, so we need driver-specific job descriptions to
> really make the most of it. That's where v4l2 is not much of a good fit
> for complex 2D pipelines either. Most 2D engines can take multiple
> inputs and blit them together in various ways, which is too far from
> what v4l2 deals with. So we can have fixed single-buffer pipelines with
> at best CSC and scaling, but not much more with v4l2 really.
> 
> I don't think it would be too much work to bring an interface to DRM in
> order to describe render framebuffers (we only have display
> framebuffers so far), with a simple queuing interface for scheduling
> driver-specific jobs, which could be grouped together to only signal
> the out fences when every buffer of the batch was done being rendered.
> This last point would allow handling cases where userapce need to
> perform multiple operations to carry out the single operation that it
> needs to do. In the case of my 2D blitter, that would be scaling above
> a 1024x1024 destination, which could be required to scaling a video
> buffer up to a 1920x1080 display. With that, we can e.g. page flip the
> 2D engine destination buffer and be certain that scaling will be fully
> done when the fence is signaled.
> 
> There's also the userspace problem: DRM render has mesa to back it in
> userspace and provide a generic API for other programes. For 2D
> engines, we don't have much to hold on to. Cairo has a DRM render
> interface that supports a few DRM render drivers where there is either
> a 2D pipeline or where pre-built shaders are used to implement a 2D
> pipeline, and that's about it as far as I know.
> 
> There's also the possibility of writing up a drm-render DDX to handle
> these 2D blitters that can make things a lot faster when running a
> desktop environment. As for wayland, well, I don't really know what to
> think. I was under the impression that it relies on GL for 2D
> operations, but am really not sure how true that actually is.

Just fyi in case you folks aren't aware, I typed up a blog a while ago
about why drm doesn't have a 2d submit api:

https://blog.ffwll.ch/2018/08/no-2d-in-drm.html

> > The thing is that DRM is