How to design a DRM KMS driver exposing 2D compositing?

2014-08-13 Thread Pekka Paalanen
On Tue, 12 Aug 2014 09:10:47 -0700
Eric Anholt  wrote:

> Pekka Paalanen  writes:
> 
> > On Mon, 11 Aug 2014 19:27:45 +0200
> > Daniel Vetter  wrote:
> >
> >> On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:
> >> > Daniel Vetter  writes:
> >> > 
> >> > > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> >> > >> Hi,
> >> > >> 
> >> > >> there is some hardware than can do 2D compositing with an arbitrary
> >> > >> number of planes. I'm not sure what the absolute maximum number of
> >> > >> planes is, but for the discussion, let's say it is 100.
> >> > >> 
> >> > >> There are many complicated, dynamic constraints on how many, what 
> >> > >> size,
> >> > >> etc. planes can be used at once. A driver would be able to check those
> >> > >> before kicking the 2D compositing engine.
> >> > >> 
> >> > >> The 2D compositing engine in the best case (only few planes used) is
> >> > >> able to composite on the fly in scanout, just like the usual overlay
> >> > >> hardware blocks in CRTCs. When the composition complexity goes up, the
> >> > >> driver can fall back to compositing into a buffer rather than on the
> >> > >> fly in scanout. This fallback needs to be completely transparent to 
> >> > >> the
> >> > >> user space, implying only additional latency if anything.
> >> > >> 
> >> > >> These 2D compositing features should be exposed to user space through 
> >> > >> a
> >> > >> standard kernel ABI, hopefully an existing ABI in the very near future
> >> > >> like the KMS atomic.
> >> > >
> >> > > I presume we're talking about the video core from raspi? Or at least
> >> > > something similar?
> >> > 
> >> > Pekka wasn't sure if things were confidential here, but I can say it:
> >> > Yeah, it's the RPi.
> >> > 
> >> > While I haven't written code using the compositor interface (I just did
> >> > enough to shim in a single plane for bringup, and I'm hoping Pekka and
> >> > company can handle the rest for me :) ), my understanding is that the
> >> > way you make use of it is that you've got your previous frame loaded up
> >> > in the HVS (the plane compositor hardware), then when you're asked to
> >> > put up a new frame that's going to be too hard, you take some
> >> > complicated chunk of your scene and ask the HVS to use any spare
> >> > bandwidth it has while it's still scanning out the previous frame in
> >> > order to composite that piece of new scene into memory.  Then, when it's
> >> > done with the offline composite, you ask the HVS to do the next scanout
> >> > frame using the original scene with the pre-composited temporary buffer.
> >> > 
> >> > I'm pretty comfortable with the idea of having some large number of
> >> > planes preallocated, and deciding that "nobody could possibly need more
> >> > than 16" (or whatever).
> >> > 
> >> > My initial reaction to "we should just punt when we run out of bandwidth
> >> > and have a special driver interface for offline composite" was "that's
> >> > awful, when the kernel could just get the job done immediately, and
> >> > easily, and it would know exactly what it needed to composite to get
> >> > things to fit (unlike userspace)".  I'm trying to come up with what
> >> > benefit there would be to having a separate interface for offline
> >> > composite.  I've got 3 things:
> >> > 
> >> > - Avoids having a potentially long, interruptible wait in the modeset
> >> >   path while the offline composite happens.  But I think we have other
> >> >   interruptible waits in that path alreaady.
> >> > 
> >> > - Userspace could potentially do something else besides use the HVS to
> >> >   get the fallback done.  Video would have to use the HVS, to get the
> >> >   same scaling filters applied as the previous frame where things *did*
> >> >   fit, but I guess you could composite some 1:1 RGBA overlays in GL,
> >> >   which would have more BW available to it than what you're borrowing
> >> >   from the previous frame's HVS capacity.
> >> > 
> >> > - Userspace could potentially use the offline composite interface for
> >> >   things besides just the running-out-of-bandwidth case.  Like, it was
> >> >   doing a nicely-filtered downscale of an overlaid video, then the user
> >> >   hit pause and walked away: you could have a timeout that noticed that
> >> >   the complicated scene hadn't changed in a while, and you'd drop from
> >> >   overlays to a HVS-composited single plane to reduce power.
> >> > 
> >> > The third one is the one I've actually found kind of compelling, and
> >> > might be switching me from wanting no userspace visibility into the
> >> > fallback.  But I don't have a good feel for how much complexity there is
> >> > to our descriptions of planes, and how much poorly-tested interface we'd
> >> > be adding to support this usecase.
> >> 
> >> Compositor should already do a rough bw guesstimate and if stuff doesn't
> >> change any more bake the entire scene into a single framebuffer. The exact
> >> same issue happens on more usual hw with 

How to design a DRM KMS driver exposing 2D compositing?

2014-08-12 Thread Ville Syrjälä
On Tue, Aug 12, 2014 at 10:03:26AM +0200, Daniel Vetter wrote:
> On Tue, Aug 12, 2014 at 9:20 AM, Pekka Paalanen  
> wrote:
> >> but I tend to think it would be nice for compositors (userspace) to
> >> know explicitly what is going on..  ie. if some layers are blended via
> >> intermediate buffer, couldn't that intermediate buffer be potentially
> >> re-used on next frame if not damaged?
> >
> > Very true, and I think that speaks for exposing the HVS explicitly to
> > user space to be directly used. That way I believe the user space could
> > track damage and composite only the minimum, rather than everything
> > every time which I suppose the KMS API approach would imply.
> >
> > We don't have dirty regions in KMS API/props, do we? But yeah, that is
> > starting to feel like a stretch to push through KMS.
> 
> We have the dirty-ioctl, but imo it's a bit misdesigned: It works at
> the framebuffer level (so the driver always has to figure out which
> crtc/plane this is about), and it only works for frontbuffer
> rendering. It was essentially a single-purpose thing for udl uploads.
> 
> But in generally I think it would make tons of sense to supply a
> per-crtc (or maybe per-plane) damage rect with nuclear flips. Both
> mipi dsi and edp have provisions to upload a subrect, so this could be
> useful in general. And decent compositors compute this already anyway.

Agreed, as long as we make it more of a hint so that the driver is
allowed to expand the rect to satisfy hardware specific alignment
requirements and whatnot.

I think a single per-crtc rect should be enough, but in case people
would like to implement a more sophisticated multi-rect update I
suppose we could allow it. And for those that don't want the extra
complexity of trying to deal with multiple rectangles, the driver
could just calculate the bounding rectangle and update that.

-- 
Ville Syrj?l?
Intel OTC


How to design a DRM KMS driver exposing 2D compositing?

2014-08-12 Thread Pekka Paalanen
On Mon, 11 Aug 2014 19:27:45 +0200
Daniel Vetter  wrote:

> On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:
> > Daniel Vetter  writes:
> > 
> > > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> > >> Hi,
> > >> 
> > >> there is some hardware than can do 2D compositing with an arbitrary
> > >> number of planes. I'm not sure what the absolute maximum number of
> > >> planes is, but for the discussion, let's say it is 100.
> > >> 
> > >> There are many complicated, dynamic constraints on how many, what size,
> > >> etc. planes can be used at once. A driver would be able to check those
> > >> before kicking the 2D compositing engine.
> > >> 
> > >> The 2D compositing engine in the best case (only few planes used) is
> > >> able to composite on the fly in scanout, just like the usual overlay
> > >> hardware blocks in CRTCs. When the composition complexity goes up, the
> > >> driver can fall back to compositing into a buffer rather than on the
> > >> fly in scanout. This fallback needs to be completely transparent to the
> > >> user space, implying only additional latency if anything.
> > >> 
> > >> These 2D compositing features should be exposed to user space through a
> > >> standard kernel ABI, hopefully an existing ABI in the very near future
> > >> like the KMS atomic.
> > >
> > > I presume we're talking about the video core from raspi? Or at least
> > > something similar?
> > 
> > Pekka wasn't sure if things were confidential here, but I can say it:
> > Yeah, it's the RPi.
> > 
> > While I haven't written code using the compositor interface (I just did
> > enough to shim in a single plane for bringup, and I'm hoping Pekka and
> > company can handle the rest for me :) ), my understanding is that the
> > way you make use of it is that you've got your previous frame loaded up
> > in the HVS (the plane compositor hardware), then when you're asked to
> > put up a new frame that's going to be too hard, you take some
> > complicated chunk of your scene and ask the HVS to use any spare
> > bandwidth it has while it's still scanning out the previous frame in
> > order to composite that piece of new scene into memory.  Then, when it's
> > done with the offline composite, you ask the HVS to do the next scanout
> > frame using the original scene with the pre-composited temporary buffer.
> > 
> > I'm pretty comfortable with the idea of having some large number of
> > planes preallocated, and deciding that "nobody could possibly need more
> > than 16" (or whatever).
> > 
> > My initial reaction to "we should just punt when we run out of bandwidth
> > and have a special driver interface for offline composite" was "that's
> > awful, when the kernel could just get the job done immediately, and
> > easily, and it would know exactly what it needed to composite to get
> > things to fit (unlike userspace)".  I'm trying to come up with what
> > benefit there would be to having a separate interface for offline
> > composite.  I've got 3 things:
> > 
> > - Avoids having a potentially long, interruptible wait in the modeset
> >   path while the offline composite happens.  But I think we have other
> >   interruptible waits in that path alreaady.
> > 
> > - Userspace could potentially do something else besides use the HVS to
> >   get the fallback done.  Video would have to use the HVS, to get the
> >   same scaling filters applied as the previous frame where things *did*
> >   fit, but I guess you could composite some 1:1 RGBA overlays in GL,
> >   which would have more BW available to it than what you're borrowing
> >   from the previous frame's HVS capacity.
> > 
> > - Userspace could potentially use the offline composite interface for
> >   things besides just the running-out-of-bandwidth case.  Like, it was
> >   doing a nicely-filtered downscale of an overlaid video, then the user
> >   hit pause and walked away: you could have a timeout that noticed that
> >   the complicated scene hadn't changed in a while, and you'd drop from
> >   overlays to a HVS-composited single plane to reduce power.
> > 
> > The third one is the one I've actually found kind of compelling, and
> > might be switching me from wanting no userspace visibility into the
> > fallback.  But I don't have a good feel for how much complexity there is
> > to our descriptions of planes, and how much poorly-tested interface we'd
> > be adding to support this usecase.
> 
> Compositor should already do a rough bw guesstimate and if stuff doesn't
> change any more bake the entire scene into a single framebuffer. The exact
> same issue happens on more usual hw with video overlays, too.
> 
> Ofc if it turns out that scanning out your yuv planes is less bw then the
> overlay shouldn't be stopped ofc. But imo there's nothing special here for
> the rpi.
>  
> > (Because, honestly, I don't expect the fallbacks to be hit much -- my
> > understanding of the bandwidth equation is that you're mostly counting
> > the number of pixels that have to 

How to design a DRM KMS driver exposing 2D compositing?

2014-08-12 Thread Pekka Paalanen
On Mon, 11 Aug 2014 07:37:18 -0700
Matt Roper  wrote:

> On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> > Hi,
> > 
> > there is some hardware than can do 2D compositing with an arbitrary
> > number of planes. I'm not sure what the absolute maximum number of
> > planes is, but for the discussion, let's say it is 100.
> > 
> > There are many complicated, dynamic constraints on how many, what size,
> > etc. planes can be used at once. A driver would be able to check those
> > before kicking the 2D compositing engine.
> > 
> > The 2D compositing engine in the best case (only few planes used) is
> > able to composite on the fly in scanout, just like the usual overlay
> > hardware blocks in CRTCs. When the composition complexity goes up, the
> > driver can fall back to compositing into a buffer rather than on the
> > fly in scanout. This fallback needs to be completely transparent to the
> > user space, implying only additional latency if anything.
> 
> Is your requirement that this needs to be transparent to all userspace
> or just transparent to your display server (e.g., Weston)?  I'm
> wondering whether it might be easier to write a libdrm interposer that
> intercepts any libdrm calls dealing with planes and exposes a bunch of
> additional "virtual" planes to the display server when queried.  When
> you submit an atomic ioctl, your interposer will figure out the best
> strategy to make that happen given the real hardware available on your
> system and will try to blend some of your excess buffers via whatever
> userspace API's are available (Cairo, GLES, OpenVG, etc.).  This would
> keep kernel complexity down and allow easier debugging and tuning.

That's an inventive proposition. ;-)

I would still need to design the kernel/user ABI for the HVS (the 2D
engine). As I am starting to believe, that the "non-real-time" use of
the HVS does not belong behind the KMS API, we might as well just do
things more properly, and expose it with a real user space API
eventually.


Thanks,
pq


How to design a DRM KMS driver exposing 2D compositing?

2014-08-12 Thread Pekka Paalanen
On Mon, 11 Aug 2014 09:32:32 -0400
Rob Clark  wrote:

> On Mon, Aug 11, 2014 at 8:06 AM, Daniel Vetter  wrote:
> > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> >> What if I cannot even pick a maximum number of planes, but wanted to
> >> (as the hardware allows) let the 2D compositing scale up basically
> >> unlimited while becoming just slower and slower?
> >>
> >> I think at that point one would be looking at a rendering API really,
> >> rather than a KMS API, so it's probably out of scope. Where is the line
> >> between KMS 2D compositing with planes vs. 2D composite rendering?
> >
> > I think kms should still be real-time compositing - if you have to
> > internally render to a buffer and then scan that one out due to lack of
> > memory bandwidth or so that very much sounds like a rendering api. Ofc
> > stuff like writeback buffers blurry that a bit. But hw writeback is still
> > real-time.
> 
> not really sure how much of this is exposed to the cpu side, vs hidden
> on coproc..
> 
> but I tend to think it would be nice for compositors (userspace) to
> know explicitly what is going on..  ie. if some layers are blended via
> intermediate buffer, couldn't that intermediate buffer be potentially
> re-used on next frame if not damaged?

Very true, and I think that speaks for exposing the HVS explicitly to
user space to be directly used. That way I believe the user space could
track damage and composite only the minimum, rather than everything
every time which I suppose the KMS API approach would imply.

We don't have dirty regions in KMS API/props, do we? But yeah, that is
starting to feel like a stretch to push through KMS.

> >> Should I really be designing a driver-specific compositing API instead,
> >> similar to what the Mesa OpenGL implementations use? Then have user
> >> space maybe use the user space driver part via OpenWFC perhaps?
> >> And when I mention OpenWFC, you probably notice, that I am not aware of
> >> any standard user space API I could be implementing here. ;-)
> >
> > Personally I'd expose a bunch of planes with kms (enough so that you can
> > reap the usual benefits planes bring wrt video-playback and stuff like
> > that). So perhaps something in line with what current hw does in hw and
> > then double it a bit or twice - 16 planes or so. Your driver would reject
> > any requests that need intermediate buffers to store render results. I.e.
> > everything that can't be scanned out directly in real-time at about 60fps.
> > The fun with kms planes is also that right now we have 0 standards for
> > z-ordering and blending. So would need to define that first.
> >
> > Then expose everything else with a separate api. I guess you'll just end
> > up with per-compositor userspace drivers due to the lack of a widespread
> > 2d api. OpenVG is kinda dead, and cairo might not fit.
> 
> I kind of suspect someone should really just design weston2d, an api
> more explicitly for compositing.. model after OpenWFC if that fits
> nicely.  Or not if it doesn't.  Or just use the existing weston
> front-end/back-end split..
> 
> I expect other wayland compositors would want more or less the same
> thing as weston (barring pre-existing layer-cake mess..  cough, cough,
> cogl/clutter/gnome-shell..)
> 
> We could even make a gallium statetracker implementation of weston2d
> to get some usage on desktop..

Yeah. I suppose I should aim for whatever driver-specific
interface we need for the HVS to be used from user space, use that in
Weston, and get a feeling of what might be a nice, driver-agnostic 2D
compositing API.


Thanks,
pq


How to design a DRM KMS driver exposing 2D compositing?

2014-08-12 Thread Pekka Paalanen
On Mon, 11 Aug 2014 17:35:31 +0200
Daniel Vetter  wrote:

> Well for other drivers/stacks we'd fall back to GL compositing. pixman
> would obviously be terribly. Curious question: Can you provoke the
> hw/firmware to render into abitrary buffers or does it only work together
> with real display outputs?

Since we have been talking about on-line (direct to output) and
off-line (buffer target) use of the HVS (2D compositing engine), it
should be able to do both I think.

> So I guess the real question is: What kind of interface does videocore
> provide? Note that kms framebuffers are super-flexible and you're freee to
> add your own ioctl for special framebuffers which are rendered live by the
> vc. So that might be a possible way to expose this if you can't tell the
> vc which buffers to render into explicitly.

Right. I don't know the HVS details yet, but I'm hoping we can tell
it to render into a custom buffer, like the 3D core can.

This discussion is very helpful btw, I'm starting to see some possible
plans.


Thanks,
pq


How to design a DRM KMS driver exposing 2D compositing?

2014-08-12 Thread Daniel Vetter
On Tue, Aug 12, 2014 at 9:20 AM, Pekka Paalanen  wrote:
>> but I tend to think it would be nice for compositors (userspace) to
>> know explicitly what is going on..  ie. if some layers are blended via
>> intermediate buffer, couldn't that intermediate buffer be potentially
>> re-used on next frame if not damaged?
>
> Very true, and I think that speaks for exposing the HVS explicitly to
> user space to be directly used. That way I believe the user space could
> track damage and composite only the minimum, rather than everything
> every time which I suppose the KMS API approach would imply.
>
> We don't have dirty regions in KMS API/props, do we? But yeah, that is
> starting to feel like a stretch to push through KMS.

We have the dirty-ioctl, but imo it's a bit misdesigned: It works at
the framebuffer level (so the driver always has to figure out which
crtc/plane this is about), and it only works for frontbuffer
rendering. It was essentially a single-purpose thing for udl uploads.

But in generally I think it would make tons of sense to supply a
per-crtc (or maybe per-plane) damage rect with nuclear flips. Both
mipi dsi and edp have provisions to upload a subrect, so this could be
useful in general. And decent compositors compute this already anyway.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


How to design a DRM KMS driver exposing 2D compositing?

2014-08-12 Thread Eric Anholt
Pekka Paalanen  writes:

> On Mon, 11 Aug 2014 19:27:45 +0200
> Daniel Vetter  wrote:
>
>> On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:
>> > Daniel Vetter  writes:
>> > 
>> > > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
>> > >> Hi,
>> > >> 
>> > >> there is some hardware than can do 2D compositing with an arbitrary
>> > >> number of planes. I'm not sure what the absolute maximum number of
>> > >> planes is, but for the discussion, let's say it is 100.
>> > >> 
>> > >> There are many complicated, dynamic constraints on how many, what size,
>> > >> etc. planes can be used at once. A driver would be able to check those
>> > >> before kicking the 2D compositing engine.
>> > >> 
>> > >> The 2D compositing engine in the best case (only few planes used) is
>> > >> able to composite on the fly in scanout, just like the usual overlay
>> > >> hardware blocks in CRTCs. When the composition complexity goes up, the
>> > >> driver can fall back to compositing into a buffer rather than on the
>> > >> fly in scanout. This fallback needs to be completely transparent to the
>> > >> user space, implying only additional latency if anything.
>> > >> 
>> > >> These 2D compositing features should be exposed to user space through a
>> > >> standard kernel ABI, hopefully an existing ABI in the very near future
>> > >> like the KMS atomic.
>> > >
>> > > I presume we're talking about the video core from raspi? Or at least
>> > > something similar?
>> > 
>> > Pekka wasn't sure if things were confidential here, but I can say it:
>> > Yeah, it's the RPi.
>> > 
>> > While I haven't written code using the compositor interface (I just did
>> > enough to shim in a single plane for bringup, and I'm hoping Pekka and
>> > company can handle the rest for me :) ), my understanding is that the
>> > way you make use of it is that you've got your previous frame loaded up
>> > in the HVS (the plane compositor hardware), then when you're asked to
>> > put up a new frame that's going to be too hard, you take some
>> > complicated chunk of your scene and ask the HVS to use any spare
>> > bandwidth it has while it's still scanning out the previous frame in
>> > order to composite that piece of new scene into memory.  Then, when it's
>> > done with the offline composite, you ask the HVS to do the next scanout
>> > frame using the original scene with the pre-composited temporary buffer.
>> > 
>> > I'm pretty comfortable with the idea of having some large number of
>> > planes preallocated, and deciding that "nobody could possibly need more
>> > than 16" (or whatever).
>> > 
>> > My initial reaction to "we should just punt when we run out of bandwidth
>> > and have a special driver interface for offline composite" was "that's
>> > awful, when the kernel could just get the job done immediately, and
>> > easily, and it would know exactly what it needed to composite to get
>> > things to fit (unlike userspace)".  I'm trying to come up with what
>> > benefit there would be to having a separate interface for offline
>> > composite.  I've got 3 things:
>> > 
>> > - Avoids having a potentially long, interruptible wait in the modeset
>> >   path while the offline composite happens.  But I think we have other
>> >   interruptible waits in that path alreaady.
>> > 
>> > - Userspace could potentially do something else besides use the HVS to
>> >   get the fallback done.  Video would have to use the HVS, to get the
>> >   same scaling filters applied as the previous frame where things *did*
>> >   fit, but I guess you could composite some 1:1 RGBA overlays in GL,
>> >   which would have more BW available to it than what you're borrowing
>> >   from the previous frame's HVS capacity.
>> > 
>> > - Userspace could potentially use the offline composite interface for
>> >   things besides just the running-out-of-bandwidth case.  Like, it was
>> >   doing a nicely-filtered downscale of an overlaid video, then the user
>> >   hit pause and walked away: you could have a timeout that noticed that
>> >   the complicated scene hadn't changed in a while, and you'd drop from
>> >   overlays to a HVS-composited single plane to reduce power.
>> > 
>> > The third one is the one I've actually found kind of compelling, and
>> > might be switching me from wanting no userspace visibility into the
>> > fallback.  But I don't have a good feel for how much complexity there is
>> > to our descriptions of planes, and how much poorly-tested interface we'd
>> > be adding to support this usecase.
>> 
>> Compositor should already do a rough bw guesstimate and if stuff doesn't
>> change any more bake the entire scene into a single framebuffer. The exact
>> same issue happens on more usual hw with video overlays, too.
>> 
>> Ofc if it turns out that scanning out your yuv planes is less bw then the
>> overlay shouldn't be stopped ofc. But imo there's nothing special here for
>> the rpi.
>>  
>> > (Because, honestly, I don't expect the fallbacks to be hit much 

How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Daniel Vetter
On Mon, Aug 11, 2014 at 10:16:24AM -0700, Eric Anholt wrote:
> Daniel Vetter  writes:
> 
> > On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> >> Hi,
> >> 
> >> there is some hardware than can do 2D compositing with an arbitrary
> >> number of planes. I'm not sure what the absolute maximum number of
> >> planes is, but for the discussion, let's say it is 100.
> >> 
> >> There are many complicated, dynamic constraints on how many, what size,
> >> etc. planes can be used at once. A driver would be able to check those
> >> before kicking the 2D compositing engine.
> >> 
> >> The 2D compositing engine in the best case (only few planes used) is
> >> able to composite on the fly in scanout, just like the usual overlay
> >> hardware blocks in CRTCs. When the composition complexity goes up, the
> >> driver can fall back to compositing into a buffer rather than on the
> >> fly in scanout. This fallback needs to be completely transparent to the
> >> user space, implying only additional latency if anything.
> >> 
> >> These 2D compositing features should be exposed to user space through a
> >> standard kernel ABI, hopefully an existing ABI in the very near future
> >> like the KMS atomic.
> >
> > I presume we're talking about the video core from raspi? Or at least
> > something similar?
> 
> Pekka wasn't sure if things were confidential here, but I can say it:
> Yeah, it's the RPi.
> 
> While I haven't written code using the compositor interface (I just did
> enough to shim in a single plane for bringup, and I'm hoping Pekka and
> company can handle the rest for me :) ), my understanding is that the
> way you make use of it is that you've got your previous frame loaded up
> in the HVS (the plane compositor hardware), then when you're asked to
> put up a new frame that's going to be too hard, you take some
> complicated chunk of your scene and ask the HVS to use any spare
> bandwidth it has while it's still scanning out the previous frame in
> order to composite that piece of new scene into memory.  Then, when it's
> done with the offline composite, you ask the HVS to do the next scanout
> frame using the original scene with the pre-composited temporary buffer.
> 
> I'm pretty comfortable with the idea of having some large number of
> planes preallocated, and deciding that "nobody could possibly need more
> than 16" (or whatever).
> 
> My initial reaction to "we should just punt when we run out of bandwidth
> and have a special driver interface for offline composite" was "that's
> awful, when the kernel could just get the job done immediately, and
> easily, and it would know exactly what it needed to composite to get
> things to fit (unlike userspace)".  I'm trying to come up with what
> benefit there would be to having a separate interface for offline
> composite.  I've got 3 things:
> 
> - Avoids having a potentially long, interruptible wait in the modeset
>   path while the offline composite happens.  But I think we have other
>   interruptible waits in that path alreaady.
> 
> - Userspace could potentially do something else besides use the HVS to
>   get the fallback done.  Video would have to use the HVS, to get the
>   same scaling filters applied as the previous frame where things *did*
>   fit, but I guess you could composite some 1:1 RGBA overlays in GL,
>   which would have more BW available to it than what you're borrowing
>   from the previous frame's HVS capacity.
> 
> - Userspace could potentially use the offline composite interface for
>   things besides just the running-out-of-bandwidth case.  Like, it was
>   doing a nicely-filtered downscale of an overlaid video, then the user
>   hit pause and walked away: you could have a timeout that noticed that
>   the complicated scene hadn't changed in a while, and you'd drop from
>   overlays to a HVS-composited single plane to reduce power.
> 
> The third one is the one I've actually found kind of compelling, and
> might be switching me from wanting no userspace visibility into the
> fallback.  But I don't have a good feel for how much complexity there is
> to our descriptions of planes, and how much poorly-tested interface we'd
> be adding to support this usecase.

Compositor should already do a rough bw guesstimate and if stuff doesn't
change any more bake the entire scene into a single framebuffer. The exact
same issue happens on more usual hw with video overlays, too.

Ofc if it turns out that scanning out your yuv planes is less bw then the
overlay shouldn't be stopped ofc. But imo there's nothing special here for
the rpi.

> (Because, honestly, I don't expect the fallbacks to be hit much -- my
> understanding of the bandwidth equation is that you're mostly counting
> the number of pixels that have to be read, and clipped-out pixels
> because somebody's overlaid on top of you don't count unless they're in
> the same burst read.  So unless people are going nuts with blending in
> overlays, or downscaled video, it's probably not a 

How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Daniel Vetter
On Mon, Aug 11, 2014 at 07:09:11PM +0300, Ville Syrj?l? wrote:
> On Mon, Aug 11, 2014 at 05:35:31PM +0200, Daniel Vetter wrote:
> > On Mon, Aug 11, 2014 at 03:47:22PM +0300, Pekka Paalanen wrote:
> > > > > What if I cannot even pick a maximum number of planes, but wanted to
> > > > > (as the hardware allows) let the 2D compositing scale up basically
> > > > > unlimited while becoming just slower and slower?
> > > > > 
> > > > > I think at that point one would be looking at a rendering API really,
> > > > > rather than a KMS API, so it's probably out of scope. Where is the 
> > > > > line
> > > > > between KMS 2D compositing with planes vs. 2D composite rendering?
> > > > 
> > > > I think kms should still be real-time compositing - if you have to
> > > > internally render to a buffer and then scan that one out due to lack of
> > > > memory bandwidth or so that very much sounds like a rendering api. Ofc
> > > > stuff like writeback buffers blurry that a bit. But hw writeback is 
> > > > still
> > > > real-time.
> > > 
> > > Agreed, that's a good and clear definition, even if it might make my
> > > life harder.
> > > 
> > > I'm still not completely sure, that using an intermediate buffer means
> > > sacrificing real-time (i.e. being able to hit the next vblank the user
> > > space is aiming for) performance, maybe the 2D engine output rate
> > > fluctuates so that the scanout block would have problems but a buffer
> > > can still be completed in time. Anyway, details.
> > > 
> > > Would using an intermediate buffer be ok if we can still maintain
> > > real-time? That is, say, if a compositor kicks the atomic update e.g.
> > > 7 ms before vblank, we would still hit it even with the intermediate
> > > buffer? If that is actually possible, I don't know yet.
> > 
> > I guess you could hide this in the kernel if you want. After all the
> > entire point of kms is to shovel the memory management into the kernel
> > driver's responsibility. But I agree with Rob that if there are
> > intermediate buffers, it would be fairly neat to let userspace know about
> > them.
> > 
> > So I don't think the intermediate buffer thing would be a no-go for kms,
> > but I suspect that will only happen when the videocore can't hit the next
> > frame reliably. And that kind of stutter is imo not good for a kms driver.
> > I guess you could forgo vblank timestamp support and just go with
> > super-variable scanout times, but I guess that will make the video
> > playback people unhappy - they already bitch about the sub 1% inaccuracy
> > we have in our hdmi clocks.
> > 
> > > > > Should I really be designing a driver-specific compositing API 
> > > > > instead,
> > > > > similar to what the Mesa OpenGL implementations use? Then have user
> > > > > space maybe use the user space driver part via OpenWFC perhaps?
> > > > > And when I mention OpenWFC, you probably notice, that I am not aware 
> > > > > of
> > > > > any standard user space API I could be implementing here. ;-)
> > > > 
> > > > Personally I'd expose a bunch of planes with kms (enough so that you can
> > > > reap the usual benefits planes bring wrt video-playback and stuff like
> > > > that). So perhaps something in line with what current hw does in hw and
> > > > then double it a bit or twice - 16 planes or so. Your driver would 
> > > > reject
> > > > any requests that need intermediate buffers to store render results. 
> > > > I.e.
> > > > everything that can't be scanned out directly in real-time at about 
> > > > 60fps.
> > > > The fun with kms planes is also that right now we have 0 standards for
> > > > z-ordering and blending. So would need to define that first.
> > > 
> > > I do not yet know where that real-time limit is, but I'm guessing it
> > > could be pretty low. If it is, we might start hitting software
> > > compositing (like Pixman) very often, which is too slow to be usable.
> > 
> > Well for other drivers/stacks we'd fall back to GL compositing. pixman
> > would obviously be terribly. Curious question: Can you provoke the
> > hw/firmware to render into abitrary buffers or does it only work together
> > with real display outputs?
> > 
> > So I guess the real question is: What kind of interface does videocore
> > provide? Note that kms framebuffers are super-flexible and you're freee to
> > add your own ioctl for special framebuffers which are rendered live by the
> > vc. So that might be a possible way to expose this if you can't tell the
> > vc which buffers to render into explicitly.
> 
> We should maybe think about exposing this display engine writeback
> stuff in some decent way. Maybe a property on the crtc (or plane when
> doing per-plane writeback) where you attach a target framebuffer for
> the write. And some virtual connectors/encoders to satisfy the kms API
> requirements.
> 
> With DSI command mode I suppose it would be possible to even mix display
> and writeback uses of the same hardware pipeline so that the writeback
> doesn't disturb the 

How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Ville Syrjälä
On Mon, Aug 11, 2014 at 05:35:31PM +0200, Daniel Vetter wrote:
> On Mon, Aug 11, 2014 at 03:47:22PM +0300, Pekka Paalanen wrote:
> > > > What if I cannot even pick a maximum number of planes, but wanted to
> > > > (as the hardware allows) let the 2D compositing scale up basically
> > > > unlimited while becoming just slower and slower?
> > > > 
> > > > I think at that point one would be looking at a rendering API really,
> > > > rather than a KMS API, so it's probably out of scope. Where is the line
> > > > between KMS 2D compositing with planes vs. 2D composite rendering?
> > > 
> > > I think kms should still be real-time compositing - if you have to
> > > internally render to a buffer and then scan that one out due to lack of
> > > memory bandwidth or so that very much sounds like a rendering api. Ofc
> > > stuff like writeback buffers blurry that a bit. But hw writeback is still
> > > real-time.
> > 
> > Agreed, that's a good and clear definition, even if it might make my
> > life harder.
> > 
> > I'm still not completely sure, that using an intermediate buffer means
> > sacrificing real-time (i.e. being able to hit the next vblank the user
> > space is aiming for) performance, maybe the 2D engine output rate
> > fluctuates so that the scanout block would have problems but a buffer
> > can still be completed in time. Anyway, details.
> > 
> > Would using an intermediate buffer be ok if we can still maintain
> > real-time? That is, say, if a compositor kicks the atomic update e.g.
> > 7 ms before vblank, we would still hit it even with the intermediate
> > buffer? If that is actually possible, I don't know yet.
> 
> I guess you could hide this in the kernel if you want. After all the
> entire point of kms is to shovel the memory management into the kernel
> driver's responsibility. But I agree with Rob that if there are
> intermediate buffers, it would be fairly neat to let userspace know about
> them.
> 
> So I don't think the intermediate buffer thing would be a no-go for kms,
> but I suspect that will only happen when the videocore can't hit the next
> frame reliably. And that kind of stutter is imo not good for a kms driver.
> I guess you could forgo vblank timestamp support and just go with
> super-variable scanout times, but I guess that will make the video
> playback people unhappy - they already bitch about the sub 1% inaccuracy
> we have in our hdmi clocks.
> 
> > > > Should I really be designing a driver-specific compositing API instead,
> > > > similar to what the Mesa OpenGL implementations use? Then have user
> > > > space maybe use the user space driver part via OpenWFC perhaps?
> > > > And when I mention OpenWFC, you probably notice, that I am not aware of
> > > > any standard user space API I could be implementing here. ;-)
> > > 
> > > Personally I'd expose a bunch of planes with kms (enough so that you can
> > > reap the usual benefits planes bring wrt video-playback and stuff like
> > > that). So perhaps something in line with what current hw does in hw and
> > > then double it a bit or twice - 16 planes or so. Your driver would reject
> > > any requests that need intermediate buffers to store render results. I.e.
> > > everything that can't be scanned out directly in real-time at about 60fps.
> > > The fun with kms planes is also that right now we have 0 standards for
> > > z-ordering and blending. So would need to define that first.
> > 
> > I do not yet know where that real-time limit is, but I'm guessing it
> > could be pretty low. If it is, we might start hitting software
> > compositing (like Pixman) very often, which is too slow to be usable.
> 
> Well for other drivers/stacks we'd fall back to GL compositing. pixman
> would obviously be terribly. Curious question: Can you provoke the
> hw/firmware to render into abitrary buffers or does it only work together
> with real display outputs?
> 
> So I guess the real question is: What kind of interface does videocore
> provide? Note that kms framebuffers are super-flexible and you're freee to
> add your own ioctl for special framebuffers which are rendered live by the
> vc. So that might be a possible way to expose this if you can't tell the
> vc which buffers to render into explicitly.

We should maybe think about exposing this display engine writeback
stuff in some decent way. Maybe a property on the crtc (or plane when
doing per-plane writeback) where you attach a target framebuffer for
the write. And some virtual connectors/encoders to satisfy the kms API
requirements.

With DSI command mode I suppose it would be possible to even mix display
and writeback uses of the same hardware pipeline so that the writeback
doesn't disturb the display. But I'm not sure there would any nice way
to expose that in kms. Maybe just expose two crtcs, one for writeback
and one for display and multiplex in the driver.

-- 
Ville Syrj?l?
Intel OTC


How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Daniel Vetter
On Mon, Aug 11, 2014 at 03:47:22PM +0300, Pekka Paalanen wrote:
> > > What if I cannot even pick a maximum number of planes, but wanted to
> > > (as the hardware allows) let the 2D compositing scale up basically
> > > unlimited while becoming just slower and slower?
> > > 
> > > I think at that point one would be looking at a rendering API really,
> > > rather than a KMS API, so it's probably out of scope. Where is the line
> > > between KMS 2D compositing with planes vs. 2D composite rendering?
> > 
> > I think kms should still be real-time compositing - if you have to
> > internally render to a buffer and then scan that one out due to lack of
> > memory bandwidth or so that very much sounds like a rendering api. Ofc
> > stuff like writeback buffers blurry that a bit. But hw writeback is still
> > real-time.
> 
> Agreed, that's a good and clear definition, even if it might make my
> life harder.
> 
> I'm still not completely sure, that using an intermediate buffer means
> sacrificing real-time (i.e. being able to hit the next vblank the user
> space is aiming for) performance, maybe the 2D engine output rate
> fluctuates so that the scanout block would have problems but a buffer
> can still be completed in time. Anyway, details.
> 
> Would using an intermediate buffer be ok if we can still maintain
> real-time? That is, say, if a compositor kicks the atomic update e.g.
> 7 ms before vblank, we would still hit it even with the intermediate
> buffer? If that is actually possible, I don't know yet.

I guess you could hide this in the kernel if you want. After all the
entire point of kms is to shovel the memory management into the kernel
driver's responsibility. But I agree with Rob that if there are
intermediate buffers, it would be fairly neat to let userspace know about
them.

So I don't think the intermediate buffer thing would be a no-go for kms,
but I suspect that will only happen when the videocore can't hit the next
frame reliably. And that kind of stutter is imo not good for a kms driver.
I guess you could forgo vblank timestamp support and just go with
super-variable scanout times, but I guess that will make the video
playback people unhappy - they already bitch about the sub 1% inaccuracy
we have in our hdmi clocks.

> > > Should I really be designing a driver-specific compositing API instead,
> > > similar to what the Mesa OpenGL implementations use? Then have user
> > > space maybe use the user space driver part via OpenWFC perhaps?
> > > And when I mention OpenWFC, you probably notice, that I am not aware of
> > > any standard user space API I could be implementing here. ;-)
> > 
> > Personally I'd expose a bunch of planes with kms (enough so that you can
> > reap the usual benefits planes bring wrt video-playback and stuff like
> > that). So perhaps something in line with what current hw does in hw and
> > then double it a bit or twice - 16 planes or so. Your driver would reject
> > any requests that need intermediate buffers to store render results. I.e.
> > everything that can't be scanned out directly in real-time at about 60fps.
> > The fun with kms planes is also that right now we have 0 standards for
> > z-ordering and blending. So would need to define that first.
> 
> I do not yet know where that real-time limit is, but I'm guessing it
> could be pretty low. If it is, we might start hitting software
> compositing (like Pixman) very often, which is too slow to be usable.

Well for other drivers/stacks we'd fall back to GL compositing. pixman
would obviously be terribly. Curious question: Can you provoke the
hw/firmware to render into abitrary buffers or does it only work together
with real display outputs?

So I guess the real question is: What kind of interface does videocore
provide? Note that kms framebuffers are super-flexible and you're freee to
add your own ioctl for special framebuffers which are rendered live by the
vc. So that might be a possible way to expose this if you can't tell the
vc which buffers to render into explicitly.

> Defining z-order and blending sounds like peanuts compared to below.
> 
> > Then expose everything else with a separate api. I guess you'll just end
> > up with per-compositor userspace drivers due to the lack of a widespread
> > 2d api. OpenVG is kinda dead, and cairo might not fit.
> 
> Yeah, that is kind of the worst case, which also seems unavoidable.

Yeah, there's no universal 2d accel standard at all. Which sucks for hw
that can't do full gl.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Daniel Vetter
On Mon, Aug 11, 2014 at 09:32:32AM -0400, Rob Clark wrote:
> On Mon, Aug 11, 2014 at 8:06 AM, Daniel Vetter  wrote:
> > Personally I'd expose a bunch of planes with kms (enough so that you can
> > reap the usual benefits planes bring wrt video-playback and stuff like
> > that). So perhaps something in line with what current hw does in hw and
> > then double it a bit or twice - 16 planes or so. Your driver would reject
> > any requests that need intermediate buffers to store render results. I.e.
> > everything that can't be scanned out directly in real-time at about 60fps.
> > The fun with kms planes is also that right now we have 0 standards for
> > z-ordering and blending. So would need to define that first.
> >
> > Then expose everything else with a separate api. I guess you'll just end
> > up with per-compositor userspace drivers due to the lack of a widespread
> > 2d api. OpenVG is kinda dead, and cairo might not fit.
> 
> I kind of suspect someone should really just design weston2d, an api
> more explicitly for compositing.. model after OpenWFC if that fits
> nicely.  Or not if it doesn't.  Or just use the existing weston
> front-end/back-end split..
> 
> I expect other wayland compositors would want more or less the same
> thing as weston (barring pre-existing layer-cake mess..  cough, cough,
> cogl/clutter/gnome-shell..)
> 
> We could even make a gallium statetracker implementation of weston2d
> to get some usage on desktop..

There's vega already in mesa  It just looks terribly unused.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Pekka Paalanen
On Mon, 11 Aug 2014 14:14:56 +0100
Damien Lespiau  wrote:

> On Mon, Aug 11, 2014 at 03:07:33PM +0300, Pekka Paalanen wrote:
> > > > there is some hardware than can do 2D compositing with an arbitrary
> > > > number of planes. I'm not sure what the absolute maximum number of
> > > > planes is, but for the discussion, let's say it is 100.
> > > > 
> > > > There are many complicated, dynamic constraints on how many, what size,
> > > > etc. planes can be used at once. A driver would be able to check those
> > > > before kicking the 2D compositing engine.
> > > > 
> > > > The 2D compositing engine in the best case (only few planes used) is
> > > > able to composite on the fly in scanout, just like the usual overlay
> > > > hardware blocks in CRTCs. When the composition complexity goes up, the
> > > > driver can fall back to compositing into a buffer rather than on the
> > > > fly in scanout. This fallback needs to be completely transparent to the
> > > > user space, implying only additional latency if anything.
> > > 
> > > This looks like a fallback that would use GL to compose the intermediate
> > > buffer. Any reason why that fallback can't be kicked from userspace?
> > 
> > It is not GL, and GL might not be available or desireable. It is still
> > the same 2D compositing engine in hardware, but now running with
> > off-screen target buffer, because it cannot anymore keep up with the
> > continous pixel rate that the direct scanout would need.
> 
> I didn't mean this was GL, but just making the parallel, ie. we wouldn't
> put a GL fallback into the kernel.
> 
> > If we were to use the 2D compositing engine from user space, we would
> > be on the road to OpenWFC. IOW, there is no standard API for the
> > user space to use yet, as far as I'm aware. ;-)
> > 
> > I'm just trying to avoid having to design a kernel driver ABI for a
> > user space driver, then design/implement some standard user space
> > API on top, and then go fix all compositors to actually use it instead
> > of / with KMS.
> 
> It's no easy trade-off. For instance, if the compositor doesn't know
> about some of the hw constraints you are talking about, it may ask the
> kernel for a configuration that suddently will only allow 20 fps updates
> (because of the bw limitation you're mentioning). And the compositor
> just wouldn't know.

Sure, but it would still be much better than the actual fallback in the
compositor in user space, if we cannot drive the 2D engine from user
space.

KMS works the same way already: if you have GL rendering that just
runs for too long, your final pageflip using it will implicitly get
delayed that much. Does it not?

> I can only speak for the hw I know, if you want to squeeze everything
> you can from that simple (compared to the one you're talking about)
> display hw, there's no choice, the compositor needs to know about the
> constraints to make clever decisions (that's what we do on Android). But
> then the appeal of a common interface is understandable.
> 
> (An answer that doesn't actually say anything interesting, oh well),

Yeah... so it comes down to deciding at what point will the kernel
driver say "this won't fly, do something else". And danvet has a pretty
solid answer to that, I think.


Thanks,
pq


How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Pekka Paalanen
Hi Daniel,

you make perfect sense as usual. :-)
Comments below.

On Mon, 11 Aug 2014 14:06:36 +0200
Daniel Vetter  wrote:

> On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> > Hi,
> > 
> > there is some hardware than can do 2D compositing with an arbitrary
> > number of planes. I'm not sure what the absolute maximum number of
> > planes is, but for the discussion, let's say it is 100.
> > 
> > There are many complicated, dynamic constraints on how many, what size,
> > etc. planes can be used at once. A driver would be able to check those
> > before kicking the 2D compositing engine.
> > 
> > The 2D compositing engine in the best case (only few planes used) is
> > able to composite on the fly in scanout, just like the usual overlay
> > hardware blocks in CRTCs. When the composition complexity goes up, the
> > driver can fall back to compositing into a buffer rather than on the
> > fly in scanout. This fallback needs to be completely transparent to the
> > user space, implying only additional latency if anything.
> > 
> > These 2D compositing features should be exposed to user space through a
> > standard kernel ABI, hopefully an existing ABI in the very near future
> > like the KMS atomic.
> 
> I presume we're talking about the video core from raspi? Or at least
> something similar?

Yes.

> > Assuming the DRM universal planes and atomic mode setting / page flip
> > infrastructure is in place, could the 2D compositing capabilities be
> > exposed through universal planes? We can assume that plane properties
> > are enough to describe all the compositing parameters.
> > 
> > Atomic updates are needed so that the complicated constraints can be
> > checked, and user space can try to reduce the composition complexity if
> > the kernel driver sees that it won't work.
> > 
> > Would it be feasible to generate a hundred identical non-primary planes
> > to be exposed to user space via DRM?
> > 
> > If that could be done, the kernel driver could just use the existing
> > kernel/user ABIs without having to invent something new, and programs
> > like a Wayland compositor would not need to be coded specifically for
> > this hardware.
> > 
> > What problems do you see with this plan?
> > Are any of those problems unfixable or simply prohibitive?
> > 
> > I have some concerns, which I am not sure will actually be a problem:
> > - Does allocating a 100 planes eat too much kernel memory?
> >   I mean just the bookkeeping, properties, etc.
> > - Would such an amount of planes make some in-kernel algorithms slow
> >   (particularly in DRM common code)?
> > - Considering how user space discovers all DRM resources, would this
> >   make a compositor "slow" to start?
> 
> I don't see any problem with that. We have a few plane-loops, but iirc
> those can be easily fixed to use indices and similar stuff. The atomic
> ioctl itself should scale nicely.

Very nice.

> > I suppose whether these turn out to be prohibitive or not, one just has
> > to implement it and see. It should be usable on a slowish CPU with
> > unimpressive amounts of RAM, because that is where a separate 2D
> > compositing engine gives the most kick.
> > 
> > FWIW, dynamically created/destroyed planes would probably not be the
> > answer. The kernel driver cannot decide before-hand how many planes it
> > can expose. How many planes can be used depends completely on how user
> > space decides to use them. Therefore I believe it should expose the
> > maximum number always, whether there is any real use case that could
> > actually get them all running or not.
> 
> Yeah dynamic planes doesn't sound like a nice solution, least because
> you'll get to audit piles of code. Currently really only framebuffers (and
> to some extent connectors) can come and go freely in kms-land.

Yup, thought so.

> > What if I cannot even pick a maximum number of planes, but wanted to
> > (as the hardware allows) let the 2D compositing scale up basically
> > unlimited while becoming just slower and slower?
> > 
> > I think at that point one would be looking at a rendering API really,
> > rather than a KMS API, so it's probably out of scope. Where is the line
> > between KMS 2D compositing with planes vs. 2D composite rendering?
> 
> I think kms should still be real-time compositing - if you have to
> internally render to a buffer and then scan that one out due to lack of
> memory bandwidth or so that very much sounds like a rendering api. Ofc
> stuff like writeback buffers blurry that a bit. But hw writeback is still
> real-time.

Agreed, that's a good and clear definition, even if it might make my
life harder.

I'm still not completely sure, that using an intermediate buffer means
sacrificing real-time (i.e. being able to hit the next vblank the user
space is aiming for) performance, maybe the 2D engine output rate
fluctuates so that the scanout block would have problems but a buffer
can still be completed in time. Anyway, details.

Would using an intermediate buffer 

How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Pekka Paalanen
On Mon, 11 Aug 2014 11:57:10 +0100
Damien Lespiau  wrote:

> On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> > Hi,
> 
> Hi,
> 
> > there is some hardware than can do 2D compositing with an arbitrary
> > number of planes. I'm not sure what the absolute maximum number of
> > planes is, but for the discussion, let's say it is 100.
> > 
> > There are many complicated, dynamic constraints on how many, what size,
> > etc. planes can be used at once. A driver would be able to check those
> > before kicking the 2D compositing engine.
> > 
> > The 2D compositing engine in the best case (only few planes used) is
> > able to composite on the fly in scanout, just like the usual overlay
> > hardware blocks in CRTCs. When the composition complexity goes up, the
> > driver can fall back to compositing into a buffer rather than on the
> > fly in scanout. This fallback needs to be completely transparent to the
> > user space, implying only additional latency if anything.
> 
> This looks like a fallback that would use GL to compose the intermediate
> buffer. Any reason why that fallback can't be kicked from userspace?

It is not GL, and GL might not be available or desireable. It is still
the same 2D compositing engine in hardware, but now running with
off-screen target buffer, because it cannot anymore keep up with the
continous pixel rate that the direct scanout would need.

If we were to use the 2D compositing engine from user space, we would
be on the road to OpenWFC. IOW, there is no standard API for the
user space to use yet, as far as I'm aware. ;-)

I'm just trying to avoid having to design a kernel driver ABI for a
user space driver, then design/implement some standard user space
API on top, and then go fix all compositors to actually use it instead
of / with KMS.


Thanks,
pq


How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Damien Lespiau
On Mon, Aug 11, 2014 at 03:07:33PM +0300, Pekka Paalanen wrote:
> > > there is some hardware than can do 2D compositing with an arbitrary
> > > number of planes. I'm not sure what the absolute maximum number of
> > > planes is, but for the discussion, let's say it is 100.
> > > 
> > > There are many complicated, dynamic constraints on how many, what size,
> > > etc. planes can be used at once. A driver would be able to check those
> > > before kicking the 2D compositing engine.
> > > 
> > > The 2D compositing engine in the best case (only few planes used) is
> > > able to composite on the fly in scanout, just like the usual overlay
> > > hardware blocks in CRTCs. When the composition complexity goes up, the
> > > driver can fall back to compositing into a buffer rather than on the
> > > fly in scanout. This fallback needs to be completely transparent to the
> > > user space, implying only additional latency if anything.
> > 
> > This looks like a fallback that would use GL to compose the intermediate
> > buffer. Any reason why that fallback can't be kicked from userspace?
> 
> It is not GL, and GL might not be available or desireable. It is still
> the same 2D compositing engine in hardware, but now running with
> off-screen target buffer, because it cannot anymore keep up with the
> continous pixel rate that the direct scanout would need.

I didn't mean this was GL, but just making the parallel, ie. we wouldn't
put a GL fallback into the kernel.

> If we were to use the 2D compositing engine from user space, we would
> be on the road to OpenWFC. IOW, there is no standard API for the
> user space to use yet, as far as I'm aware. ;-)
> 
> I'm just trying to avoid having to design a kernel driver ABI for a
> user space driver, then design/implement some standard user space
> API on top, and then go fix all compositors to actually use it instead
> of / with KMS.

It's no easy trade-off. For instance, if the compositor doesn't know
about some of the hw constraints you are talking about, it may ask the
kernel for a configuration that suddently will only allow 20 fps updates
(because of the bw limitation you're mentioning). And the compositor
just wouldn't know.

I can only speak for the hw I know, if you want to squeeze everything
you can from that simple (compared to the one you're talking about)
display hw, there's no choice, the compositor needs to know about the
constraints to make clever decisions (that's what we do on Android). But
then the appeal of a common interface is understandable.

(An answer that doesn't actually say anything interesting, oh well),

-- 
Damien


How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Daniel Vetter
On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> Hi,
> 
> there is some hardware than can do 2D compositing with an arbitrary
> number of planes. I'm not sure what the absolute maximum number of
> planes is, but for the discussion, let's say it is 100.
> 
> There are many complicated, dynamic constraints on how many, what size,
> etc. planes can be used at once. A driver would be able to check those
> before kicking the 2D compositing engine.
> 
> The 2D compositing engine in the best case (only few planes used) is
> able to composite on the fly in scanout, just like the usual overlay
> hardware blocks in CRTCs. When the composition complexity goes up, the
> driver can fall back to compositing into a buffer rather than on the
> fly in scanout. This fallback needs to be completely transparent to the
> user space, implying only additional latency if anything.
> 
> These 2D compositing features should be exposed to user space through a
> standard kernel ABI, hopefully an existing ABI in the very near future
> like the KMS atomic.

I presume we're talking about the video core from raspi? Or at least
something similar?

> Assuming the DRM universal planes and atomic mode setting / page flip
> infrastructure is in place, could the 2D compositing capabilities be
> exposed through universal planes? We can assume that plane properties
> are enough to describe all the compositing parameters.
> 
> Atomic updates are needed so that the complicated constraints can be
> checked, and user space can try to reduce the composition complexity if
> the kernel driver sees that it won't work.
> 
> Would it be feasible to generate a hundred identical non-primary planes
> to be exposed to user space via DRM?
> 
> If that could be done, the kernel driver could just use the existing
> kernel/user ABIs without having to invent something new, and programs
> like a Wayland compositor would not need to be coded specifically for
> this hardware.
> 
> What problems do you see with this plan?
> Are any of those problems unfixable or simply prohibitive?
> 
> I have some concerns, which I am not sure will actually be a problem:
> - Does allocating a 100 planes eat too much kernel memory?
>   I mean just the bookkeeping, properties, etc.
> - Would such an amount of planes make some in-kernel algorithms slow
>   (particularly in DRM common code)?
> - Considering how user space discovers all DRM resources, would this
>   make a compositor "slow" to start?

I don't see any problem with that. We have a few plane-loops, but iirc
those can be easily fixed to use indices and similar stuff. The atomic
ioctl itself should scale nicely.

> I suppose whether these turn out to be prohibitive or not, one just has
> to implement it and see. It should be usable on a slowish CPU with
> unimpressive amounts of RAM, because that is where a separate 2D
> compositing engine gives the most kick.
> 
> FWIW, dynamically created/destroyed planes would probably not be the
> answer. The kernel driver cannot decide before-hand how many planes it
> can expose. How many planes can be used depends completely on how user
> space decides to use them. Therefore I believe it should expose the
> maximum number always, whether there is any real use case that could
> actually get them all running or not.

Yeah dynamic planes doesn't sound like a nice solution, least because
you'll get to audit piles of code. Currently really only framebuffers (and
to some extent connectors) can come and go freely in kms-land.

> What if I cannot even pick a maximum number of planes, but wanted to
> (as the hardware allows) let the 2D compositing scale up basically
> unlimited while becoming just slower and slower?
> 
> I think at that point one would be looking at a rendering API really,
> rather than a KMS API, so it's probably out of scope. Where is the line
> between KMS 2D compositing with planes vs. 2D composite rendering?

I think kms should still be real-time compositing - if you have to
internally render to a buffer and then scan that one out due to lack of
memory bandwidth or so that very much sounds like a rendering api. Ofc
stuff like writeback buffers blurry that a bit. But hw writeback is still
real-time.

> Should I really be designing a driver-specific compositing API instead,
> similar to what the Mesa OpenGL implementations use? Then have user
> space maybe use the user space driver part via OpenWFC perhaps?
> And when I mention OpenWFC, you probably notice, that I am not aware of
> any standard user space API I could be implementing here. ;-)

Personally I'd expose a bunch of planes with kms (enough so that you can
reap the usual benefits planes bring wrt video-playback and stuff like
that). So perhaps something in line with what current hw does in hw and
then double it a bit or twice - 16 planes or so. Your driver would reject
any requests that need intermediate buffers to store render results. I.e.
everything that can't be scanned out directly in real-time 

How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Pekka Paalanen
Hi,

there is some hardware than can do 2D compositing with an arbitrary
number of planes. I'm not sure what the absolute maximum number of
planes is, but for the discussion, let's say it is 100.

There are many complicated, dynamic constraints on how many, what size,
etc. planes can be used at once. A driver would be able to check those
before kicking the 2D compositing engine.

The 2D compositing engine in the best case (only few planes used) is
able to composite on the fly in scanout, just like the usual overlay
hardware blocks in CRTCs. When the composition complexity goes up, the
driver can fall back to compositing into a buffer rather than on the
fly in scanout. This fallback needs to be completely transparent to the
user space, implying only additional latency if anything.

These 2D compositing features should be exposed to user space through a
standard kernel ABI, hopefully an existing ABI in the very near future
like the KMS atomic.

Assuming the DRM universal planes and atomic mode setting / page flip
infrastructure is in place, could the 2D compositing capabilities be
exposed through universal planes? We can assume that plane properties
are enough to describe all the compositing parameters.

Atomic updates are needed so that the complicated constraints can be
checked, and user space can try to reduce the composition complexity if
the kernel driver sees that it won't work.

Would it be feasible to generate a hundred identical non-primary planes
to be exposed to user space via DRM?

If that could be done, the kernel driver could just use the existing
kernel/user ABIs without having to invent something new, and programs
like a Wayland compositor would not need to be coded specifically for
this hardware.

What problems do you see with this plan?
Are any of those problems unfixable or simply prohibitive?

I have some concerns, which I am not sure will actually be a problem:
- Does allocating a 100 planes eat too much kernel memory?
  I mean just the bookkeeping, properties, etc.
- Would such an amount of planes make some in-kernel algorithms slow
  (particularly in DRM common code)?
- Considering how user space discovers all DRM resources, would this
  make a compositor "slow" to start?

I suppose whether these turn out to be prohibitive or not, one just has
to implement it and see. It should be usable on a slowish CPU with
unimpressive amounts of RAM, because that is where a separate 2D
compositing engine gives the most kick.

FWIW, dynamically created/destroyed planes would probably not be the
answer. The kernel driver cannot decide before-hand how many planes it
can expose. How many planes can be used depends completely on how user
space decides to use them. Therefore I believe it should expose the
maximum number always, whether there is any real use case that could
actually get them all running or not.

What if I cannot even pick a maximum number of planes, but wanted to
(as the hardware allows) let the 2D compositing scale up basically
unlimited while becoming just slower and slower?

I think at that point one would be looking at a rendering API really,
rather than a KMS API, so it's probably out of scope. Where is the line
between KMS 2D compositing with planes vs. 2D composite rendering?

Should I really be designing a driver-specific compositing API instead,
similar to what the Mesa OpenGL implementations use? Then have user
space maybe use the user space driver part via OpenWFC perhaps?
And when I mention OpenWFC, you probably notice, that I am not aware of
any standard user space API I could be implementing here. ;-)


Thanks,
pq


How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Damien Lespiau
On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> Hi,

Hi,

> there is some hardware than can do 2D compositing with an arbitrary
> number of planes. I'm not sure what the absolute maximum number of
> planes is, but for the discussion, let's say it is 100.
> 
> There are many complicated, dynamic constraints on how many, what size,
> etc. planes can be used at once. A driver would be able to check those
> before kicking the 2D compositing engine.
> 
> The 2D compositing engine in the best case (only few planes used) is
> able to composite on the fly in scanout, just like the usual overlay
> hardware blocks in CRTCs. When the composition complexity goes up, the
> driver can fall back to compositing into a buffer rather than on the
> fly in scanout. This fallback needs to be completely transparent to the
> user space, implying only additional latency if anything.

This looks like a fallback that would use GL to compose the intermediate
buffer. Any reason why that fallback can't be kicked from userspace?

-- 
Damien


How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Eric Anholt
Daniel Vetter  writes:

> On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
>> Hi,
>> 
>> there is some hardware than can do 2D compositing with an arbitrary
>> number of planes. I'm not sure what the absolute maximum number of
>> planes is, but for the discussion, let's say it is 100.
>> 
>> There are many complicated, dynamic constraints on how many, what size,
>> etc. planes can be used at once. A driver would be able to check those
>> before kicking the 2D compositing engine.
>> 
>> The 2D compositing engine in the best case (only few planes used) is
>> able to composite on the fly in scanout, just like the usual overlay
>> hardware blocks in CRTCs. When the composition complexity goes up, the
>> driver can fall back to compositing into a buffer rather than on the
>> fly in scanout. This fallback needs to be completely transparent to the
>> user space, implying only additional latency if anything.
>> 
>> These 2D compositing features should be exposed to user space through a
>> standard kernel ABI, hopefully an existing ABI in the very near future
>> like the KMS atomic.
>
> I presume we're talking about the video core from raspi? Or at least
> something similar?

Pekka wasn't sure if things were confidential here, but I can say it:
Yeah, it's the RPi.

While I haven't written code using the compositor interface (I just did
enough to shim in a single plane for bringup, and I'm hoping Pekka and
company can handle the rest for me :) ), my understanding is that the
way you make use of it is that you've got your previous frame loaded up
in the HVS (the plane compositor hardware), then when you're asked to
put up a new frame that's going to be too hard, you take some
complicated chunk of your scene and ask the HVS to use any spare
bandwidth it has while it's still scanning out the previous frame in
order to composite that piece of new scene into memory.  Then, when it's
done with the offline composite, you ask the HVS to do the next scanout
frame using the original scene with the pre-composited temporary buffer.

I'm pretty comfortable with the idea of having some large number of
planes preallocated, and deciding that "nobody could possibly need more
than 16" (or whatever).

My initial reaction to "we should just punt when we run out of bandwidth
and have a special driver interface for offline composite" was "that's
awful, when the kernel could just get the job done immediately, and
easily, and it would know exactly what it needed to composite to get
things to fit (unlike userspace)".  I'm trying to come up with what
benefit there would be to having a separate interface for offline
composite.  I've got 3 things:

- Avoids having a potentially long, interruptible wait in the modeset
  path while the offline composite happens.  But I think we have other
  interruptible waits in that path alreaady.

- Userspace could potentially do something else besides use the HVS to
  get the fallback done.  Video would have to use the HVS, to get the
  same scaling filters applied as the previous frame where things *did*
  fit, but I guess you could composite some 1:1 RGBA overlays in GL,
  which would have more BW available to it than what you're borrowing
  from the previous frame's HVS capacity.

- Userspace could potentially use the offline composite interface for
  things besides just the running-out-of-bandwidth case.  Like, it was
  doing a nicely-filtered downscale of an overlaid video, then the user
  hit pause and walked away: you could have a timeout that noticed that
  the complicated scene hadn't changed in a while, and you'd drop from
  overlays to a HVS-composited single plane to reduce power.

The third one is the one I've actually found kind of compelling, and
might be switching me from wanting no userspace visibility into the
fallback.  But I don't have a good feel for how much complexity there is
to our descriptions of planes, and how much poorly-tested interface we'd
be adding to support this usecase.

(Because, honestly, I don't expect the fallbacks to be hit much -- my
understanding of the bandwidth equation is that you're mostly counting
the number of pixels that have to be read, and clipped-out pixels
because somebody's overlaid on top of you don't count unless they're in
the same burst read.  So unless people are going nuts with blending in
overlays, or downscaled video, it's probably not a problem, and
something that gets your pixels on the screen at all is sufficient)
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 818 bytes
Desc: not available
URL: 



How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Rob Clark
On Mon, Aug 11, 2014 at 8:06 AM, Daniel Vetter  wrote:
> On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
>> Hi,
>>
>> there is some hardware than can do 2D compositing with an arbitrary
>> number of planes. I'm not sure what the absolute maximum number of
>> planes is, but for the discussion, let's say it is 100.
>>
>> There are many complicated, dynamic constraints on how many, what size,
>> etc. planes can be used at once. A driver would be able to check those
>> before kicking the 2D compositing engine.
>>
>> The 2D compositing engine in the best case (only few planes used) is
>> able to composite on the fly in scanout, just like the usual overlay
>> hardware blocks in CRTCs. When the composition complexity goes up, the
>> driver can fall back to compositing into a buffer rather than on the
>> fly in scanout. This fallback needs to be completely transparent to the
>> user space, implying only additional latency if anything.
>>
>> These 2D compositing features should be exposed to user space through a
>> standard kernel ABI, hopefully an existing ABI in the very near future
>> like the KMS atomic.
>
> I presume we're talking about the video core from raspi? Or at least
> something similar?
>
>> Assuming the DRM universal planes and atomic mode setting / page flip
>> infrastructure is in place, could the 2D compositing capabilities be
>> exposed through universal planes? We can assume that plane properties
>> are enough to describe all the compositing parameters.
>>
>> Atomic updates are needed so that the complicated constraints can be
>> checked, and user space can try to reduce the composition complexity if
>> the kernel driver sees that it won't work.
>>
>> Would it be feasible to generate a hundred identical non-primary planes
>> to be exposed to user space via DRM?
>>
>> If that could be done, the kernel driver could just use the existing
>> kernel/user ABIs without having to invent something new, and programs
>> like a Wayland compositor would not need to be coded specifically for
>> this hardware.
>>
>> What problems do you see with this plan?
>> Are any of those problems unfixable or simply prohibitive?
>>
>> I have some concerns, which I am not sure will actually be a problem:
>> - Does allocating a 100 planes eat too much kernel memory?
>>   I mean just the bookkeeping, properties, etc.
>> - Would such an amount of planes make some in-kernel algorithms slow
>>   (particularly in DRM common code)?
>> - Considering how user space discovers all DRM resources, would this
>>   make a compositor "slow" to start?
>
> I don't see any problem with that. We have a few plane-loops, but iirc
> those can be easily fixed to use indices and similar stuff. The atomic
> ioctl itself should scale nicely.
>
>> I suppose whether these turn out to be prohibitive or not, one just has
>> to implement it and see. It should be usable on a slowish CPU with
>> unimpressive amounts of RAM, because that is where a separate 2D
>> compositing engine gives the most kick.
>>
>> FWIW, dynamically created/destroyed planes would probably not be the
>> answer. The kernel driver cannot decide before-hand how many planes it
>> can expose. How many planes can be used depends completely on how user
>> space decides to use them. Therefore I believe it should expose the
>> maximum number always, whether there is any real use case that could
>> actually get them all running or not.
>
> Yeah dynamic planes doesn't sound like a nice solution, least because
> you'll get to audit piles of code. Currently really only framebuffers (and
> to some extent connectors) can come and go freely in kms-land.
>
>> What if I cannot even pick a maximum number of planes, but wanted to
>> (as the hardware allows) let the 2D compositing scale up basically
>> unlimited while becoming just slower and slower?
>>
>> I think at that point one would be looking at a rendering API really,
>> rather than a KMS API, so it's probably out of scope. Where is the line
>> between KMS 2D compositing with planes vs. 2D composite rendering?
>
> I think kms should still be real-time compositing - if you have to
> internally render to a buffer and then scan that one out due to lack of
> memory bandwidth or so that very much sounds like a rendering api. Ofc
> stuff like writeback buffers blurry that a bit. But hw writeback is still
> real-time.

not really sure how much of this is exposed to the cpu side, vs hidden
on coproc..

but I tend to think it would be nice for compositors (userspace) to
know explicitly what is going on..  ie. if some layers are blended via
intermediate buffer, couldn't that intermediate buffer be potentially
re-used on next frame if not damaged?


>> Should I really be designing a driver-specific compositing API instead,
>> similar to what the Mesa OpenGL implementations use? Then have user
>> space maybe use the user space driver part via OpenWFC perhaps?
>> And when I mention OpenWFC, you probably notice, that I am not aware of
>> any 

How to design a DRM KMS driver exposing 2D compositing?

2014-08-11 Thread Matt Roper
On Mon, Aug 11, 2014 at 01:38:55PM +0300, Pekka Paalanen wrote:
> Hi,
> 
> there is some hardware than can do 2D compositing with an arbitrary
> number of planes. I'm not sure what the absolute maximum number of
> planes is, but for the discussion, let's say it is 100.
> 
> There are many complicated, dynamic constraints on how many, what size,
> etc. planes can be used at once. A driver would be able to check those
> before kicking the 2D compositing engine.
> 
> The 2D compositing engine in the best case (only few planes used) is
> able to composite on the fly in scanout, just like the usual overlay
> hardware blocks in CRTCs. When the composition complexity goes up, the
> driver can fall back to compositing into a buffer rather than on the
> fly in scanout. This fallback needs to be completely transparent to the
> user space, implying only additional latency if anything.

Is your requirement that this needs to be transparent to all userspace
or just transparent to your display server (e.g., Weston)?  I'm
wondering whether it might be easier to write a libdrm interposer that
intercepts any libdrm calls dealing with planes and exposes a bunch of
additional "virtual" planes to the display server when queried.  When
you submit an atomic ioctl, your interposer will figure out the best
strategy to make that happen given the real hardware available on your
system and will try to blend some of your excess buffers via whatever
userspace API's are available (Cairo, GLES, OpenVG, etc.).  This would
keep kernel complexity down and allow easier debugging and tuning.


Matt

> These 2D compositing features should be exposed to user space through a
> standard kernel ABI, hopefully an existing ABI in the very near future
> like the KMS atomic.
> 
> Assuming the DRM universal planes and atomic mode setting / page flip
> infrastructure is in place, could the 2D compositing capabilities be
> exposed through universal planes? We can assume that plane properties
> are enough to describe all the compositing parameters.
> 
> Atomic updates are needed so that the complicated constraints can be
> checked, and user space can try to reduce the composition complexity if
> the kernel driver sees that it won't work.
> 
> Would it be feasible to generate a hundred identical non-primary planes
> to be exposed to user space via DRM?
> 
> If that could be done, the kernel driver could just use the existing
> kernel/user ABIs without having to invent something new, and programs
> like a Wayland compositor would not need to be coded specifically for
> this hardware.
> 
> What problems do you see with this plan?
> Are any of those problems unfixable or simply prohibitive?
> 
> I have some concerns, which I am not sure will actually be a problem:
> - Does allocating a 100 planes eat too much kernel memory?
>   I mean just the bookkeeping, properties, etc.
> - Would such an amount of planes make some in-kernel algorithms slow
>   (particularly in DRM common code)?
> - Considering how user space discovers all DRM resources, would this
>   make a compositor "slow" to start?
> 
> I suppose whether these turn out to be prohibitive or not, one just has
> to implement it and see. It should be usable on a slowish CPU with
> unimpressive amounts of RAM, because that is where a separate 2D
> compositing engine gives the most kick.
> 
> FWIW, dynamically created/destroyed planes would probably not be the
> answer. The kernel driver cannot decide before-hand how many planes it
> can expose. How many planes can be used depends completely on how user
> space decides to use them. Therefore I believe it should expose the
> maximum number always, whether there is any real use case that could
> actually get them all running or not.
> 
> What if I cannot even pick a maximum number of planes, but wanted to
> (as the hardware allows) let the 2D compositing scale up basically
> unlimited while becoming just slower and slower?
> 
> I think at that point one would be looking at a rendering API really,
> rather than a KMS API, so it's probably out of scope. Where is the line
> between KMS 2D compositing with planes vs. 2D composite rendering?
> 
> Should I really be designing a driver-specific compositing API instead,
> similar to what the Mesa OpenGL implementations use? Then have user
> space maybe use the user space driver part via OpenWFC perhaps?
> And when I mention OpenWFC, you probably notice, that I am not aware of
> any standard user space API I could be implementing here. ;-)
> 
> 
> Thanks,
> pq
> ___
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Matt Roper
Graphics Software Engineer
IoTG Platform Enabling & Development
Intel Corporation
(916) 356-2795