Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-24 Thread Thomas Hellstrom

On 10/08/2011 12:03 AM, Marek Olšák wrote:

On Fri, Oct 7, 2011 at 10:00 AM, Thomas Hellstromtho...@shipmail.org  wrote:
   

OK. First I think we need to make a distinction: bo sync objects vs driver
fences. The bo sync obj api is there to strictly provide functionality that
the ttm bo subsystem is using, and that follows a simple set of rules:

1) the bo subsystem does never assume sync objects are ordered. That means
the bo subsystem needs to wait on a sync object before removing it from a
buffer. Any other assumption is buggy and must be fixed. BUT, if that
assumption takes place in the driver unknowingly from the ttm bo subsystem
(which is usually the case), it's OK.

2) When the sync object(s) attached to the bo are signaled the ttm bo
subsystem is free to copy the bo contents and to unbind the bo.

3) The ttm bo system allows sync objects to be signaled in different ways
opaque to the subsystem using sync_obj_arg. The driver is responsible for
setting up that argument.

4) Driver fences may be used for or expose other functionality or adaptions
to APIs as long as the sync obj api exported to the bo sybsystem follows the
above rules.

This means the following w r t the patch.

A) it violates 1). This is a bug that must be fixed. Assumptions that if one
sync object is singnaled, another sync object is also signaled must be done
in the driver and not in the bo subsystem. Hence we need to explicitly wait
for a fence to remove it from the bo.

B) the sync_obj_arg carries *per-sync-obj* information on how it should be
signaled. If we need to attach multiple sync objects to a buffer object, we
also need multiple sync_obj_args. This is a bug and needs to be fixed.

C) There is really only one reason that the ttm bo subsystem should care
about multiple sync objects, and that is because the driver can't order them
efficiently. A such example would be hardware with multiple pipes reading
simultaneously from the same texture buffer. Currently we don't support this
so only the *last* sync object needs to be know by the bo subsystem. Keeping
track of multiple fences generates a lot of completely unnecessary code in
the ttm_bo_util file, the ttm_bo_vm file, and will be a nightmare if / when
we truly support pipelined moves.

As I understand it from your patches, you want to keep multiple fences
around only to track rendering history. If we want to do that generically, i
suggest doing it in the execbuf util code in the following way:

struct ttm_eu_rendering_history {
void *last_read_sync_obj;
void *last_read_sync_obj_arg;
void *last_write_sync_obj;
void *last_write_sync_obj_arg;
}

Embed this structure in the radeon_bo, and build a small api around it,
including *optionally* passing it to the existing execbuf utilities, and you
should be done. The bo_util code and bo_vm code doesn't care about the
rendering history. Only that the bo is completely idle.

Note also that when an accelerated bo move is scheduled, the driver needs to
update this struct
 

OK, sounds good. I'll fix what should be fixed and send a patch when
it's ready. I am now not so sure whether doing this generically is a
good idea. :)

Marek
   


Marek,
Any progress on this. The merge window is about to open soon I guess and 
we need a fix by then.


/Thomas



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-24 Thread Marek Olšák
On Sat, Oct 8, 2011 at 1:32 PM, Thomas Hellstrom tho...@shipmail.org wrote:
 On 10/08/2011 01:27 PM, Ville Syrjälä wrote:

 On Sat, Oct 08, 2011 at 01:10:13PM +0200, Thomas Hellstrom wrote:


 On 10/08/2011 12:26 PM, Ville Syrjälä wrote:


 On Fri, Oct 07, 2011 at 10:58:13AM +0200, Thomas Hellstrom wrote:



 Oh, and one more style comment below:

 On 08/07/2011 10:39 PM, Marek Olšák wrote:



 +enum ttm_buffer_usage {
 +    TTM_USAGE_READ = 1,
 +    TTM_USAGE_WRITE = 2,
 +    TTM_USAGE_READWRITE = TTM_USAGE_READ | TTM_USAGE_WRITE
 +};





 Please don't use enums for bit operations.



 Now I'm curious. Why not?




 Because it's inconsistent with how flags are defined in the rest of the
 TTM module.


 Ah OK. I was wondering if there's some subtle technical issue involved.
 I've recently gotten to the habit of using enums for pretty much all
 constants. Just easier on the eye IMHO, and avoids cpp output from
 looking like number soup.



 Yes, there are a number of advantages, including symbolic debugger output.
 If we had flag enums that enumerated 1, 2, 4, 8 etc. I'd feel motivated to
 move
 all TTM definitions over.

I don't think that how it is enumerated matters in any way. What I
like about enums, besides what has already been mentioned, is that it
adds a self-documentation in the code. Compare:

void ttm_set_bo_flags(unsigned flags);

And:

void ttm_set_bo_flags(enum ttm_bo_flags flags);

The latter is way easier to understand for somebody who doesn't know
the code and wants to implement his first patch. With the latter, it's
clear at first glance what are the valid values for flags, because
you can just search for enum ttm_bo_flags.

I will change the enum to defines for the sake of following your code
style convention, but it's an unreasonable convention to say the
least.

Marek
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-24 Thread Thomas Hellstrom

On 10/24/2011 06:42 PM, Marek Olšák wrote:

On Sat, Oct 8, 2011 at 1:32 PM, Thomas Hellstromtho...@shipmail.org  wrote:
   

On 10/08/2011 01:27 PM, Ville Syrjälä wrote:
 

On Sat, Oct 08, 2011 at 01:10:13PM +0200, Thomas Hellstrom wrote:

   

On 10/08/2011 12:26 PM, Ville Syrjälä wrote:

 

On Fri, Oct 07, 2011 at 10:58:13AM +0200, Thomas Hellstrom wrote:


   

Oh, and one more style comment below:

On 08/07/2011 10:39 PM, Marek Olšák wrote:


 

+enum ttm_buffer_usage {
+TTM_USAGE_READ = 1,
+TTM_USAGE_WRITE = 2,
+TTM_USAGE_READWRITE = TTM_USAGE_READ | TTM_USAGE_WRITE
+};




   

Please don't use enums for bit operations.


 

Now I'm curious. Why not?



   

Because it's inconsistent with how flags are defined in the rest of the
TTM module.

 

Ah OK. I was wondering if there's some subtle technical issue involved.
I've recently gotten to the habit of using enums for pretty much all
constants. Just easier on the eye IMHO, and avoids cpp output from
looking like number soup.


   

Yes, there are a number of advantages, including symbolic debugger output.
If we had flag enums that enumerated 1, 2, 4, 8 etc. I'd feel motivated to
move
all TTM definitions over.
 

I don't think that how it is enumerated matters in any way. What I
like about enums, besides what has already been mentioned, is that it
adds a self-documentation in the code. Compare:

void ttm_set_bo_flags(unsigned flags);

And:

void ttm_set_bo_flags(enum ttm_bo_flags flags);

The latter is way easier to understand for somebody who doesn't know
the code and wants to implement his first patch. With the latter, it's
clear at first glance what are the valid values for flags, because
you can just search for enum ttm_bo_flags.

I will change the enum to defines for the sake of following your code
style convention, but it's an unreasonable convention to say the
least.

Marek
   


I'm not going to argue against this, because you're probably right.

The important thing is that we get the fix in with or without enums.

Thanks,
Thomas



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-24 Thread Marek Olšák
Hi Thomas,

I have made no progress so far due to lack of time.

Would you mind if I fixed the most important things first, which are:
- sync objects are not ordered, (A)
- every sync object must have its corresponding sync_obj_arg, (B)
and if I fixed (C) some time later.

I planned on moving the two sync objects from ttm into radeon and not
using ttm_bo_wait from radeon (i.e. pretty much reimplementing what it
does), but it looks more complicated to me than I had originally
thought.

Marek

On Mon, Oct 24, 2011 at 4:48 PM, Thomas Hellstrom tho...@shipmail.org wrote:

 Marek,
 Any progress on this. The merge window is about to open soon I guess and we
 need a fix by then.

 /Thomas
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-24 Thread Thomas Hellstrom

Marek,

The problem is that the patch adds a lot of complicated code where it's 
not needed, and I don't want to end up reverting that code and 
re-implementing the new Radeon gem ioctl by myself.


Having a list of two fence objects and waiting for either of them 
shouldn't be that complicated to implement, in particular when it's done 
in a driver-specific way and you have the benefit of assuming that they 
are ordered.


Since the new functionality is a performance improvement, If time is an 
issue, I suggest we back this change out and go for the next merge window.


/Thomas


On 10/24/2011 07:10 PM, Marek Olšák wrote:

Hi Thomas,

I have made no progress so far due to lack of time.

Would you mind if I fixed the most important things first, which are:
- sync objects are not ordered, (A)
- every sync object must have its corresponding sync_obj_arg, (B)
and if I fixed (C) some time later.

I planned on moving the two sync objects from ttm into radeon and not
using ttm_bo_wait from radeon (i.e. pretty much reimplementing what it
does), but it looks more complicated to me than I had originally
thought.

Marek

On Mon, Oct 24, 2011 at 4:48 PM, Thomas Hellstromtho...@shipmail.org  wrote:
   

Marek,
Any progress on this. The merge window is about to open soon I guess and we
need a fix by then.

/Thomas
 




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-24 Thread Marek Olšák
Alright then.

Dave, if you are reading this, feel free not to include the two
patches I sent you in the next pull request.

Marek

On Mon, Oct 24, 2011 at 7:28 PM, Thomas Hellstrom tho...@shipmail.org wrote:
 Marek,

 The problem is that the patch adds a lot of complicated code where it's not
 needed, and I don't want to end up reverting that code and re-implementing
 the new Radeon gem ioctl by myself.

 Having a list of two fence objects and waiting for either of them shouldn't
 be that complicated to implement, in particular when it's done in a
 driver-specific way and you have the benefit of assuming that they are
 ordered.

 Since the new functionality is a performance improvement, If time is an
 issue, I suggest we back this change out and go for the next merge window.

 /Thomas


 On 10/24/2011 07:10 PM, Marek Olšák wrote:

 Hi Thomas,

 I have made no progress so far due to lack of time.

 Would you mind if I fixed the most important things first, which are:
 - sync objects are not ordered, (A)
 - every sync object must have its corresponding sync_obj_arg, (B)
 and if I fixed (C) some time later.

 I planned on moving the two sync objects from ttm into radeon and not
 using ttm_bo_wait from radeon (i.e. pretty much reimplementing what it
 does), but it looks more complicated to me than I had originally
 thought.

 Marek

 On Mon, Oct 24, 2011 at 4:48 PM, Thomas Hellstromtho...@shipmail.org
  wrote:


 Marek,
 Any progress on this. The merge window is about to open soon I guess and
 we
 need a fix by then.

 /Thomas





___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-08 Thread Thomas Hellstrom

On 10/07/2011 11:30 PM, Marek Olšák wrote:

On Fri, Oct 7, 2011 at 3:38 PM, Jerome Glissej.gli...@gmail.com  wrote:
   

I should have look at the patch long ago ... anyway i think a better
approach would be to expose fence id as 64bits unsigned to each
userspace client. I was thinking of mapping a page readonly (same page
as the one gpu write back) at somespecific offset in drm file (bit
like sarea but readonly so no lock). Each time userspace submit a
command stream it would get the fence id associated with the command
stream. It would then be up to userspace to track btw read or write
and do appropriate things. The kernel code would be simple (biggest
issue is finding an offset we can use for that), it would be fast as
no round trip to kernel to know the last fence.

Each fence seq would be valid only for a specific ring (only future
gpu impacted here, maybe cayman).

So no change to ttm, just change to radeon to return fence seq and to
move to an unsigned 64. Issue would be when gpu write back is
disabled, then we would either need userspace to call somethings like
bo wait or to other ioctl to get the kernel to update the copy, copy
would be updated in the irq handler too so at least it get updated
anytime something enable irq.
 

I am having a hard time understanding what you are saying.

Anyway, I had some read and write usage tracking in the radeon winsys.
That worked well for driver-private resources, but it was a total fail
for the ones shared with the DDX. I needed this bo_wait optimization
to work with multiple processes. That's the whole point why I am doing
this in the kernel.

Marek
   

At one XDS meeting in Cambridge an IMHO questionable decision was taken to
try to keep synchronization operations like this in user-space, 
communicating necessary
info among involved components. In this case you'd then need to send 
fence sequences

down the DRI2 protocol to the winsys.

However, if you at one point want to do user-space suballocation of 
kernel buffers,
that's something you need to do anyway, because the kernel is not aware 
that user-space fences

the suballocations separately.

/Thomas



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-08 Thread Ville Syrjälä
On Fri, Oct 07, 2011 at 10:58:13AM +0200, Thomas Hellstrom wrote:
 Oh, and one more style comment below:
 
 On 08/07/2011 10:39 PM, Marek Olšák wrote:
 
  +enum ttm_buffer_usage {
  +TTM_USAGE_READ = 1,
  +TTM_USAGE_WRITE = 2,
  +TTM_USAGE_READWRITE = TTM_USAGE_READ | TTM_USAGE_WRITE
  +};
 
 
 
 Please don't use enums for bit operations.

Now I'm curious. Why not?

-- 
Ville Syrjälä
syrj...@sci.fi
http://www.sci.fi/~syrjala/
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-08 Thread Thomas Hellstrom

On 10/08/2011 12:26 PM, Ville Syrjälä wrote:

On Fri, Oct 07, 2011 at 10:58:13AM +0200, Thomas Hellstrom wrote:
   

Oh, and one more style comment below:

On 08/07/2011 10:39 PM, Marek Olšák wrote:
 

+enum ttm_buffer_usage {
+TTM_USAGE_READ = 1,
+TTM_USAGE_WRITE = 2,
+TTM_USAGE_READWRITE = TTM_USAGE_READ | TTM_USAGE_WRITE
+};


   

Please don't use enums for bit operations.
 

Now I'm curious. Why not?

   
Because it's inconsistent with how flags are defined in the rest of the 
TTM module.


/Thomas


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-08 Thread Ville Syrjälä
On Sat, Oct 08, 2011 at 01:10:13PM +0200, Thomas Hellstrom wrote:
 On 10/08/2011 12:26 PM, Ville Syrjälä wrote:
  On Fri, Oct 07, 2011 at 10:58:13AM +0200, Thomas Hellstrom wrote:
 
  Oh, and one more style comment below:
 
  On 08/07/2011 10:39 PM, Marek Olšák wrote:
   
  +enum ttm_buffer_usage {
  +TTM_USAGE_READ = 1,
  +TTM_USAGE_WRITE = 2,
  +TTM_USAGE_READWRITE = TTM_USAGE_READ | TTM_USAGE_WRITE
  +};
 
 
 
  Please don't use enums for bit operations.
   
  Now I'm curious. Why not?
 
 
 Because it's inconsistent with how flags are defined in the rest of the 
 TTM module.

Ah OK. I was wondering if there's some subtle technical issue involved.
I've recently gotten to the habit of using enums for pretty much all
constants. Just easier on the eye IMHO, and avoids cpp output from
looking like number soup.

-- 
Ville Syrjälä
syrj...@sci.fi
http://www.sci.fi/~syrjala/
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-08 Thread Thomas Hellstrom

On 10/08/2011 01:27 PM, Ville Syrjälä wrote:

On Sat, Oct 08, 2011 at 01:10:13PM +0200, Thomas Hellstrom wrote:
   

On 10/08/2011 12:26 PM, Ville Syrjälä wrote:
 

On Fri, Oct 07, 2011 at 10:58:13AM +0200, Thomas Hellstrom wrote:

   

Oh, and one more style comment below:

On 08/07/2011 10:39 PM, Marek Olšák wrote:

 

+enum ttm_buffer_usage {
+TTM_USAGE_READ = 1,
+TTM_USAGE_WRITE = 2,
+TTM_USAGE_READWRITE = TTM_USAGE_READ | TTM_USAGE_WRITE
+};



   

Please don't use enums for bit operations.

 

Now I'm curious. Why not?


   

Because it's inconsistent with how flags are defined in the rest of the
TTM module.
 

Ah OK. I was wondering if there's some subtle technical issue involved.
I've recently gotten to the habit of using enums for pretty much all
constants. Just easier on the eye IMHO, and avoids cpp output from
looking like number soup.

   


Yes, there are a number of advantages, including symbolic debugger output.
If we had flag enums that enumerated 1, 2, 4, 8 etc. I'd feel motivated 
to move

all TTM definitions over.

/Thomas



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-07 Thread Thomas Hellstrom

On 10/07/2011 12:42 AM, Marek Olšák wrote:

On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstromtho...@shipmail.org  wrote:
   

In any case, I'm not saying fences is the best way to flush but since the bo
code assumes that signaling a sync object means make the buffer contents
available for CPU read / write, it's usually a good way to do it; there's
even a sync_obj_flush() method that gets called when a potential flush needs
to happen.
 

I don't think we use it like that. To my knowledge, the purpose of the
sync obj (to Radeon Gallium drivers at least) is to be able to wait
for the last use of a buffer. Whether the contents can or cannot be
available to the CPU is totally irrelevant.

Currently (and it's a very important performance optimization),
buffers stay mapped and available for CPU read/write during their
first map_buffer call. Unmap_buffer is a no-op. The unmapping happens
on buffer destruction. We only call bo_wait when we want to wait for
the GPU until it's done with the buffer (we don't always want that,
sometimes we want to use the unsychronized flag). Otherwise the
contents of buffers are available at *any time*.

We could probably implement bo_wait privately in the kernel driver and
not use ttm_bo_wait. I preferred code sharing though.

Textures (especially the tiled ones) are never mapped directly and a
temporary staging resource is used instead, so we don't actually
pollute address space that much. (in case you would have such a
remark) We will use staging resources for buffers too, but it's really
the last resort to avoid waiting when direct access can't be used.


   

2) Can't we say that a write_sync_obj is simply a sync_obj? What's the
difference between those two? I think we should remove the write_sync_obj
bo
member.

 

Okay, but I think we should remove sync_obj instead, and keep write
and read sync objs. In the case of READWRITE usage, read_sync_obj
would be equal to write_sync_obj.


   

Sure, I'm fine with that.

One other thing, though, that makes me a little puzzled:

Let's assume you don't allow readers and writers at the same time, which is
my perception of how read- and write fences should work; you either have a
list of read fences or a single write fence (in the same way a read-write
lock works).

Now, if you only allow a single read fence, like in this patch. That implies
that you can only have either a single read fence or a single write fence at
any one time. We'd only need a single fence pointer on the bo, and
sync_obj_arg would tell us whether to signal the fence for read or for write
(assuming that sync_obj_arg was set up to indicate read / write at
validation time), then the patch really isn't necessary at all, as it only
allows a single read fence?

Or is it that you want to allow read- and write fences co-existing? In that
case, what's the use case?
 

There are lots of read-write use cases which don't need any barriers
or flushing. The obvious ones are color blending and depth-stencil
buffering. The OpenGL application is also allowed to use a subrange of
a buffer as a vertex buffer (read-only) and another disjoint subrange
of the same buffer for transform feedback (write-only), which kinda
makes me think about whether we should track subranges instead of
treating a whole buffer as busy. It gets even more funky with
ARB_shader_image_load_store, which supports atomic read-modify-write
operations on textures, not to mention atomic memory operations in
compute shaders (wait, isn't that also exposed in GL as
GL_ARB_shader_atomic_counters?).

I was thinking whether the two sync objs should be read and
readwrite, or read and write. I chose the latter, because it's
more fine-grained and we have to keep at least two of them around
anyway.

So now that you know what we use sync objs for, what are your ideas on
re-implementing that patch in a way that is okay with you? Besides
removing the third sync objs of course.

Marek
   
OK. First I think we need to make a distinction: bo sync objects vs 
driver fences. The bo sync obj api is there to strictly provide 
functionality that the ttm bo subsystem is using, and that follows a 
simple set of rules:


1) the bo subsystem does never assume sync objects are ordered. That 
means the bo subsystem needs to wait on a sync object before removing it 
from a buffer. Any other assumption is buggy and must be fixed. BUT, if 
that assumption takes place in the driver unknowingly from the ttm bo 
subsystem (which is usually the case), it's OK.


2) When the sync object(s) attached to the bo are signaled the ttm bo 
subsystem is free to copy the bo contents and to unbind the bo.


3) The ttm bo system allows sync objects to be signaled in different 
ways opaque to the subsystem using sync_obj_arg. The driver is 
responsible for setting up that argument.


4) Driver fences may be used for or expose other functionality or 
adaptions to APIs as long as the sync obj api exported to the bo 
sybsystem follows the above 

Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-07 Thread Thomas Hellstrom

Oh, and one more style comment below:

On 08/07/2011 10:39 PM, Marek Olšák wrote:


+enum ttm_buffer_usage {
+TTM_USAGE_READ = 1,
+TTM_USAGE_WRITE = 2,
+TTM_USAGE_READWRITE = TTM_USAGE_READ | TTM_USAGE_WRITE
+};

   


Please don't use enums for bit operations.

#define TTM_USAGE_FLAG_READ   (1  0)
#define TTM_USAGE_FLAG_WRITE  (1  1)

/Thomas



___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-07 Thread Alex Deucher
On Fri, Oct 7, 2011 at 4:00 AM, Thomas Hellstrom tho...@shipmail.org wrote:
 On 10/07/2011 12:42 AM, Marek Olšák wrote:

 On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstromtho...@shipmail.org
  wrote:


 In any case, I'm not saying fences is the best way to flush but since the
 bo
 code assumes that signaling a sync object means make the buffer contents
 available for CPU read / write, it's usually a good way to do it;
 there's
 even a sync_obj_flush() method that gets called when a potential flush
 needs
 to happen.


 I don't think we use it like that. To my knowledge, the purpose of the
 sync obj (to Radeon Gallium drivers at least) is to be able to wait
 for the last use of a buffer. Whether the contents can or cannot be
 available to the CPU is totally irrelevant.

 Currently (and it's a very important performance optimization),
 buffers stay mapped and available for CPU read/write during their
 first map_buffer call. Unmap_buffer is a no-op. The unmapping happens
 on buffer destruction. We only call bo_wait when we want to wait for
 the GPU until it's done with the buffer (we don't always want that,
 sometimes we want to use the unsychronized flag). Otherwise the
 contents of buffers are available at *any time*.

 We could probably implement bo_wait privately in the kernel driver and
 not use ttm_bo_wait. I preferred code sharing though.

 Textures (especially the tiled ones) are never mapped directly and a
 temporary staging resource is used instead, so we don't actually
 pollute address space that much. (in case you would have such a
 remark) We will use staging resources for buffers too, but it's really
 the last resort to avoid waiting when direct access can't be used.




 2) Can't we say that a write_sync_obj is simply a sync_obj? What's the
 difference between those two? I think we should remove the
 write_sync_obj
 bo
 member.



 Okay, but I think we should remove sync_obj instead, and keep write
 and read sync objs. In the case of READWRITE usage, read_sync_obj
 would be equal to write_sync_obj.




 Sure, I'm fine with that.

 One other thing, though, that makes me a little puzzled:

 Let's assume you don't allow readers and writers at the same time, which
 is
 my perception of how read- and write fences should work; you either have
 a
 list of read fences or a single write fence (in the same way a read-write
 lock works).

 Now, if you only allow a single read fence, like in this patch. That
 implies
 that you can only have either a single read fence or a single write fence
 at
 any one time. We'd only need a single fence pointer on the bo, and
 sync_obj_arg would tell us whether to signal the fence for read or for
 write
 (assuming that sync_obj_arg was set up to indicate read / write at
 validation time), then the patch really isn't necessary at all, as it
 only
 allows a single read fence?

 Or is it that you want to allow read- and write fences co-existing? In
 that
 case, what's the use case?


 There are lots of read-write use cases which don't need any barriers
 or flushing. The obvious ones are color blending and depth-stencil
 buffering. The OpenGL application is also allowed to use a subrange of
 a buffer as a vertex buffer (read-only) and another disjoint subrange
 of the same buffer for transform feedback (write-only), which kinda
 makes me think about whether we should track subranges instead of
 treating a whole buffer as busy. It gets even more funky with
 ARB_shader_image_load_store, which supports atomic read-modify-write
 operations on textures, not to mention atomic memory operations in
 compute shaders (wait, isn't that also exposed in GL as
 GL_ARB_shader_atomic_counters?).

 I was thinking whether the two sync objs should be read and
 readwrite, or read and write. I chose the latter, because it's
 more fine-grained and we have to keep at least two of them around
 anyway.

 So now that you know what we use sync objs for, what are your ideas on
 re-implementing that patch in a way that is okay with you? Besides
 removing the third sync objs of course.

 Marek


 OK. First I think we need to make a distinction: bo sync objects vs driver
 fences. The bo sync obj api is there to strictly provide functionality that
 the ttm bo subsystem is using, and that follows a simple set of rules:

 1) the bo subsystem does never assume sync objects are ordered. That means
 the bo subsystem needs to wait on a sync object before removing it from a
 buffer. Any other assumption is buggy and must be fixed. BUT, if that
 assumption takes place in the driver unknowingly from the ttm bo subsystem
 (which is usually the case), it's OK.

 2) When the sync object(s) attached to the bo are signaled the ttm bo
 subsystem is free to copy the bo contents and to unbind the bo.

 3) The ttm bo system allows sync objects to be signaled in different ways
 opaque to the subsystem using sync_obj_arg. The driver is responsible for
 setting up that argument.

 4) Driver fences may be used for or expose 

Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-07 Thread Jerome Glisse
On Fri, Oct 7, 2011 at 4:00 AM, Thomas Hellstrom tho...@shipmail.org wrote:
 On 10/07/2011 12:42 AM, Marek Olšák wrote:

 On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstromtho...@shipmail.org
  wrote:


 In any case, I'm not saying fences is the best way to flush but since the
 bo
 code assumes that signaling a sync object means make the buffer contents
 available for CPU read / write, it's usually a good way to do it;
 there's
 even a sync_obj_flush() method that gets called when a potential flush
 needs
 to happen.


 I don't think we use it like that. To my knowledge, the purpose of the
 sync obj (to Radeon Gallium drivers at least) is to be able to wait
 for the last use of a buffer. Whether the contents can or cannot be
 available to the CPU is totally irrelevant.

 Currently (and it's a very important performance optimization),
 buffers stay mapped and available for CPU read/write during their
 first map_buffer call. Unmap_buffer is a no-op. The unmapping happens
 on buffer destruction. We only call bo_wait when we want to wait for
 the GPU until it's done with the buffer (we don't always want that,
 sometimes we want to use the unsychronized flag). Otherwise the
 contents of buffers are available at *any time*.

 We could probably implement bo_wait privately in the kernel driver and
 not use ttm_bo_wait. I preferred code sharing though.

 Textures (especially the tiled ones) are never mapped directly and a
 temporary staging resource is used instead, so we don't actually
 pollute address space that much. (in case you would have such a
 remark) We will use staging resources for buffers too, but it's really
 the last resort to avoid waiting when direct access can't be used.




 2) Can't we say that a write_sync_obj is simply a sync_obj? What's the
 difference between those two? I think we should remove the
 write_sync_obj
 bo
 member.



 Okay, but I think we should remove sync_obj instead, and keep write
 and read sync objs. In the case of READWRITE usage, read_sync_obj
 would be equal to write_sync_obj.




 Sure, I'm fine with that.

 One other thing, though, that makes me a little puzzled:

 Let's assume you don't allow readers and writers at the same time, which
 is
 my perception of how read- and write fences should work; you either have
 a
 list of read fences or a single write fence (in the same way a read-write
 lock works).

 Now, if you only allow a single read fence, like in this patch. That
 implies
 that you can only have either a single read fence or a single write fence
 at
 any one time. We'd only need a single fence pointer on the bo, and
 sync_obj_arg would tell us whether to signal the fence for read or for
 write
 (assuming that sync_obj_arg was set up to indicate read / write at
 validation time), then the patch really isn't necessary at all, as it
 only
 allows a single read fence?

 Or is it that you want to allow read- and write fences co-existing? In
 that
 case, what's the use case?


 There are lots of read-write use cases which don't need any barriers
 or flushing. The obvious ones are color blending and depth-stencil
 buffering. The OpenGL application is also allowed to use a subrange of
 a buffer as a vertex buffer (read-only) and another disjoint subrange
 of the same buffer for transform feedback (write-only), which kinda
 makes me think about whether we should track subranges instead of
 treating a whole buffer as busy. It gets even more funky with
 ARB_shader_image_load_store, which supports atomic read-modify-write
 operations on textures, not to mention atomic memory operations in
 compute shaders (wait, isn't that also exposed in GL as
 GL_ARB_shader_atomic_counters?).

 I was thinking whether the two sync objs should be read and
 readwrite, or read and write. I chose the latter, because it's
 more fine-grained and we have to keep at least two of them around
 anyway.

 So now that you know what we use sync objs for, what are your ideas on
 re-implementing that patch in a way that is okay with you? Besides
 removing the third sync objs of course.

 Marek


 OK. First I think we need to make a distinction: bo sync objects vs driver
 fences. The bo sync obj api is there to strictly provide functionality that
 the ttm bo subsystem is using, and that follows a simple set of rules:

 1) the bo subsystem does never assume sync objects are ordered. That means
 the bo subsystem needs to wait on a sync object before removing it from a
 buffer. Any other assumption is buggy and must be fixed. BUT, if that
 assumption takes place in the driver unknowingly from the ttm bo subsystem
 (which is usually the case), it's OK.

 2) When the sync object(s) attached to the bo are signaled the ttm bo
 subsystem is free to copy the bo contents and to unbind the bo.

 3) The ttm bo system allows sync objects to be signaled in different ways
 opaque to the subsystem using sync_obj_arg. The driver is responsible for
 setting up that argument.

 4) Driver fences may be used for or expose 

Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-07 Thread Jerome Glisse
On Fri, Oct 7, 2011 at 9:38 AM, Jerome Glisse j.gli...@gmail.com wrote:
 On Fri, Oct 7, 2011 at 4:00 AM, Thomas Hellstrom tho...@shipmail.org wrote:
 On 10/07/2011 12:42 AM, Marek Olšák wrote:

 On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstromtho...@shipmail.org
  wrote:


 In any case, I'm not saying fences is the best way to flush but since the
 bo
 code assumes that signaling a sync object means make the buffer contents
 available for CPU read / write, it's usually a good way to do it;
 there's
 even a sync_obj_flush() method that gets called when a potential flush
 needs
 to happen.


 I don't think we use it like that. To my knowledge, the purpose of the
 sync obj (to Radeon Gallium drivers at least) is to be able to wait
 for the last use of a buffer. Whether the contents can or cannot be
 available to the CPU is totally irrelevant.

 Currently (and it's a very important performance optimization),
 buffers stay mapped and available for CPU read/write during their
 first map_buffer call. Unmap_buffer is a no-op. The unmapping happens
 on buffer destruction. We only call bo_wait when we want to wait for
 the GPU until it's done with the buffer (we don't always want that,
 sometimes we want to use the unsychronized flag). Otherwise the
 contents of buffers are available at *any time*.

 We could probably implement bo_wait privately in the kernel driver and
 not use ttm_bo_wait. I preferred code sharing though.

 Textures (especially the tiled ones) are never mapped directly and a
 temporary staging resource is used instead, so we don't actually
 pollute address space that much. (in case you would have such a
 remark) We will use staging resources for buffers too, but it's really
 the last resort to avoid waiting when direct access can't be used.




 2) Can't we say that a write_sync_obj is simply a sync_obj? What's the
 difference between those two? I think we should remove the
 write_sync_obj
 bo
 member.



 Okay, but I think we should remove sync_obj instead, and keep write
 and read sync objs. In the case of READWRITE usage, read_sync_obj
 would be equal to write_sync_obj.




 Sure, I'm fine with that.

 One other thing, though, that makes me a little puzzled:

 Let's assume you don't allow readers and writers at the same time, which
 is
 my perception of how read- and write fences should work; you either have
 a
 list of read fences or a single write fence (in the same way a read-write
 lock works).

 Now, if you only allow a single read fence, like in this patch. That
 implies
 that you can only have either a single read fence or a single write fence
 at
 any one time. We'd only need a single fence pointer on the bo, and
 sync_obj_arg would tell us whether to signal the fence for read or for
 write
 (assuming that sync_obj_arg was set up to indicate read / write at
 validation time), then the patch really isn't necessary at all, as it
 only
 allows a single read fence?

 Or is it that you want to allow read- and write fences co-existing? In
 that
 case, what's the use case?


 There are lots of read-write use cases which don't need any barriers
 or flushing. The obvious ones are color blending and depth-stencil
 buffering. The OpenGL application is also allowed to use a subrange of
 a buffer as a vertex buffer (read-only) and another disjoint subrange
 of the same buffer for transform feedback (write-only), which kinda
 makes me think about whether we should track subranges instead of
 treating a whole buffer as busy. It gets even more funky with
 ARB_shader_image_load_store, which supports atomic read-modify-write
 operations on textures, not to mention atomic memory operations in
 compute shaders (wait, isn't that also exposed in GL as
 GL_ARB_shader_atomic_counters?).

 I was thinking whether the two sync objs should be read and
 readwrite, or read and write. I chose the latter, because it's
 more fine-grained and we have to keep at least two of them around
 anyway.

 So now that you know what we use sync objs for, what are your ideas on
 re-implementing that patch in a way that is okay with you? Besides
 removing the third sync objs of course.

 Marek


 OK. First I think we need to make a distinction: bo sync objects vs driver
 fences. The bo sync obj api is there to strictly provide functionality that
 the ttm bo subsystem is using, and that follows a simple set of rules:

 1) the bo subsystem does never assume sync objects are ordered. That means
 the bo subsystem needs to wait on a sync object before removing it from a
 buffer. Any other assumption is buggy and must be fixed. BUT, if that
 assumption takes place in the driver unknowingly from the ttm bo subsystem
 (which is usually the case), it's OK.

 2) When the sync object(s) attached to the bo are signaled the ttm bo
 subsystem is free to copy the bo contents and to unbind the bo.

 3) The ttm bo system allows sync objects to be signaled in different ways
 opaque to the subsystem using sync_obj_arg. The driver is responsible for
 

Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-07 Thread Thomas Hellstrom

On 10/07/2011 03:24 PM, Alex Deucher wrote:

On Fri, Oct 7, 2011 at 4:00 AM, Thomas Hellstromtho...@shipmail.org  wrote:
   

On 10/07/2011 12:42 AM, Marek Olšák wrote:
 

On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstromtho...@shipmail.org
  wrote:

   

In any case, I'm not saying fences is the best way to flush but since the
bo
code assumes that signaling a sync object means make the buffer contents
available for CPU read / write, it's usually a good way to do it;
there's
even a sync_obj_flush() method that gets called when a potential flush
needs
to happen.

 

I don't think we use it like that. To my knowledge, the purpose of the
sync obj (to Radeon Gallium drivers at least) is to be able to wait
for the last use of a buffer. Whether the contents can or cannot be
available to the CPU is totally irrelevant.

Currently (and it's a very important performance optimization),
buffers stay mapped and available for CPU read/write during their
first map_buffer call. Unmap_buffer is a no-op. The unmapping happens
on buffer destruction. We only call bo_wait when we want to wait for
the GPU until it's done with the buffer (we don't always want that,
sometimes we want to use the unsychronized flag). Otherwise the
contents of buffers are available at *any time*.

We could probably implement bo_wait privately in the kernel driver and
not use ttm_bo_wait. I preferred code sharing though.

Textures (especially the tiled ones) are never mapped directly and a
temporary staging resource is used instead, so we don't actually
pollute address space that much. (in case you would have such a
remark) We will use staging resources for buffers too, but it's really
the last resort to avoid waiting when direct access can't be used.



   

2) Can't we say that a write_sync_obj is simply a sync_obj? What's the
difference between those two? I think we should remove the
write_sync_obj
bo
member.


 

Okay, but I think we should remove sync_obj instead, and keep write
and read sync objs. In the case of READWRITE usage, read_sync_obj
would be equal to write_sync_obj.



   

Sure, I'm fine with that.

One other thing, though, that makes me a little puzzled:

Let's assume you don't allow readers and writers at the same time, which
is
my perception of how read- and write fences should work; you either have
a
list of read fences or a single write fence (in the same way a read-write
lock works).

Now, if you only allow a single read fence, like in this patch. That
implies
that you can only have either a single read fence or a single write fence
at
any one time. We'd only need a single fence pointer on the bo, and
sync_obj_arg would tell us whether to signal the fence for read or for
write
(assuming that sync_obj_arg was set up to indicate read / write at
validation time), then the patch really isn't necessary at all, as it
only
allows a single read fence?

Or is it that you want to allow read- and write fences co-existing? In
that
case, what's the use case?

 

There are lots of read-write use cases which don't need any barriers
or flushing. The obvious ones are color blending and depth-stencil
buffering. The OpenGL application is also allowed to use a subrange of
a buffer as a vertex buffer (read-only) and another disjoint subrange
of the same buffer for transform feedback (write-only), which kinda
makes me think about whether we should track subranges instead of
treating a whole buffer as busy. It gets even more funky with
ARB_shader_image_load_store, which supports atomic read-modify-write
operations on textures, not to mention atomic memory operations in
compute shaders (wait, isn't that also exposed in GL as
GL_ARB_shader_atomic_counters?).

I was thinking whether the two sync objs should be read and
readwrite, or read and write. I chose the latter, because it's
more fine-grained and we have to keep at least two of them around
anyway.

So now that you know what we use sync objs for, what are your ideas on
re-implementing that patch in a way that is okay with you? Besides
removing the third sync objs of course.

Marek

   

OK. First I think we need to make a distinction: bo sync objects vs driver
fences. The bo sync obj api is there to strictly provide functionality that
the ttm bo subsystem is using, and that follows a simple set of rules:

1) the bo subsystem does never assume sync objects are ordered. That means
the bo subsystem needs to wait on a sync object before removing it from a
buffer. Any other assumption is buggy and must be fixed. BUT, if that
assumption takes place in the driver unknowingly from the ttm bo subsystem
(which is usually the case), it's OK.

2) When the sync object(s) attached to the bo are signaled the ttm bo
subsystem is free to copy the bo contents and to unbind the bo.

3) The ttm bo system allows sync objects to be signaled in different ways
opaque to the subsystem using sync_obj_arg. The driver is responsible for
setting up that argument.

4) Driver 

Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-07 Thread Thomas Hellstrom

On 10/07/2011 03:38 PM, Jerome Glisse wrote:

On Fri, Oct 7, 2011 at 4:00 AM, Thomas Hellstromtho...@shipmail.org  wrote:
   

On 10/07/2011 12:42 AM, Marek Olšák wrote:
 

On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstromtho...@shipmail.org
  wrote:

   

In any case, I'm not saying fences is the best way to flush but since the
bo
code assumes that signaling a sync object means make the buffer contents
available for CPU read / write, it's usually a good way to do it;
there's
even a sync_obj_flush() method that gets called when a potential flush
needs
to happen.

 

I don't think we use it like that. To my knowledge, the purpose of the
sync obj (to Radeon Gallium drivers at least) is to be able to wait
for the last use of a buffer. Whether the contents can or cannot be
available to the CPU is totally irrelevant.

Currently (and it's a very important performance optimization),
buffers stay mapped and available for CPU read/write during their
first map_buffer call. Unmap_buffer is a no-op. The unmapping happens
on buffer destruction. We only call bo_wait when we want to wait for
the GPU until it's done with the buffer (we don't always want that,
sometimes we want to use the unsychronized flag). Otherwise the
contents of buffers are available at *any time*.

We could probably implement bo_wait privately in the kernel driver and
not use ttm_bo_wait. I preferred code sharing though.

Textures (especially the tiled ones) are never mapped directly and a
temporary staging resource is used instead, so we don't actually
pollute address space that much. (in case you would have such a
remark) We will use staging resources for buffers too, but it's really
the last resort to avoid waiting when direct access can't be used.



   

2) Can't we say that a write_sync_obj is simply a sync_obj? What's the
difference between those two? I think we should remove the
write_sync_obj
bo
member.


 

Okay, but I think we should remove sync_obj instead, and keep write
and read sync objs. In the case of READWRITE usage, read_sync_obj
would be equal to write_sync_obj.



   

Sure, I'm fine with that.

One other thing, though, that makes me a little puzzled:

Let's assume you don't allow readers and writers at the same time, which
is
my perception of how read- and write fences should work; you either have
a
list of read fences or a single write fence (in the same way a read-write
lock works).

Now, if you only allow a single read fence, like in this patch. That
implies
that you can only have either a single read fence or a single write fence
at
any one time. We'd only need a single fence pointer on the bo, and
sync_obj_arg would tell us whether to signal the fence for read or for
write
(assuming that sync_obj_arg was set up to indicate read / write at
validation time), then the patch really isn't necessary at all, as it
only
allows a single read fence?

Or is it that you want to allow read- and write fences co-existing? In
that
case, what's the use case?

 

There are lots of read-write use cases which don't need any barriers
or flushing. The obvious ones are color blending and depth-stencil
buffering. The OpenGL application is also allowed to use a subrange of
a buffer as a vertex buffer (read-only) and another disjoint subrange
of the same buffer for transform feedback (write-only), which kinda
makes me think about whether we should track subranges instead of
treating a whole buffer as busy. It gets even more funky with
ARB_shader_image_load_store, which supports atomic read-modify-write
operations on textures, not to mention atomic memory operations in
compute shaders (wait, isn't that also exposed in GL as
GL_ARB_shader_atomic_counters?).

I was thinking whether the two sync objs should be read and
readwrite, or read and write. I chose the latter, because it's
more fine-grained and we have to keep at least two of them around
anyway.

So now that you know what we use sync objs for, what are your ideas on
re-implementing that patch in a way that is okay with you? Besides
removing the third sync objs of course.

Marek

   

OK. First I think we need to make a distinction: bo sync objects vs driver
fences. The bo sync obj api is there to strictly provide functionality that
the ttm bo subsystem is using, and that follows a simple set of rules:

1) the bo subsystem does never assume sync objects are ordered. That means
the bo subsystem needs to wait on a sync object before removing it from a
buffer. Any other assumption is buggy and must be fixed. BUT, if that
assumption takes place in the driver unknowingly from the ttm bo subsystem
(which is usually the case), it's OK.

2) When the sync object(s) attached to the bo are signaled the ttm bo
subsystem is free to copy the bo contents and to unbind the bo.

3) The ttm bo system allows sync objects to be signaled in different ways
opaque to the subsystem using sync_obj_arg. The driver is responsible for
setting up that argument.

4) Driver 

Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-07 Thread Alex Deucher
On Fri, Oct 7, 2011 at 10:05 AM, Thomas Hellstrom tho...@shipmail.org wrote:
 On 10/07/2011 03:24 PM, Alex Deucher wrote:

 On Fri, Oct 7, 2011 at 4:00 AM, Thomas Hellstromtho...@shipmail.org
  wrote:


 On 10/07/2011 12:42 AM, Marek Olšák wrote:


 On Wed, Oct 5, 2011 at 7:54 AM, Thomas Hellstromtho...@shipmail.org
  wrote:



 In any case, I'm not saying fences is the best way to flush but since
 the
 bo
 code assumes that signaling a sync object means make the buffer
 contents
 available for CPU read / write, it's usually a good way to do it;
 there's
 even a sync_obj_flush() method that gets called when a potential flush
 needs
 to happen.



 I don't think we use it like that. To my knowledge, the purpose of the
 sync obj (to Radeon Gallium drivers at least) is to be able to wait
 for the last use of a buffer. Whether the contents can or cannot be
 available to the CPU is totally irrelevant.

 Currently (and it's a very important performance optimization),
 buffers stay mapped and available for CPU read/write during their
 first map_buffer call. Unmap_buffer is a no-op. The unmapping happens
 on buffer destruction. We only call bo_wait when we want to wait for
 the GPU until it's done with the buffer (we don't always want that,
 sometimes we want to use the unsychronized flag). Otherwise the
 contents of buffers are available at *any time*.

 We could probably implement bo_wait privately in the kernel driver and
 not use ttm_bo_wait. I preferred code sharing though.

 Textures (especially the tiled ones) are never mapped directly and a
 temporary staging resource is used instead, so we don't actually
 pollute address space that much. (in case you would have such a
 remark) We will use staging resources for buffers too, but it's really
 the last resort to avoid waiting when direct access can't be used.





 2) Can't we say that a write_sync_obj is simply a sync_obj? What's
 the
 difference between those two? I think we should remove the
 write_sync_obj
 bo
 member.




 Okay, but I think we should remove sync_obj instead, and keep write
 and read sync objs. In the case of READWRITE usage, read_sync_obj
 would be equal to write_sync_obj.





 Sure, I'm fine with that.

 One other thing, though, that makes me a little puzzled:

 Let's assume you don't allow readers and writers at the same time,
 which
 is
 my perception of how read- and write fences should work; you either
 have
 a
 list of read fences or a single write fence (in the same way a
 read-write
 lock works).

 Now, if you only allow a single read fence, like in this patch. That
 implies
 that you can only have either a single read fence or a single write
 fence
 at
 any one time. We'd only need a single fence pointer on the bo, and
 sync_obj_arg would tell us whether to signal the fence for read or for
 write
 (assuming that sync_obj_arg was set up to indicate read / write at
 validation time), then the patch really isn't necessary at all, as it
 only
 allows a single read fence?

 Or is it that you want to allow read- and write fences co-existing? In
 that
 case, what's the use case?



 There are lots of read-write use cases which don't need any barriers
 or flushing. The obvious ones are color blending and depth-stencil
 buffering. The OpenGL application is also allowed to use a subrange of
 a buffer as a vertex buffer (read-only) and another disjoint subrange
 of the same buffer for transform feedback (write-only), which kinda
 makes me think about whether we should track subranges instead of
 treating a whole buffer as busy. It gets even more funky with
 ARB_shader_image_load_store, which supports atomic read-modify-write
 operations on textures, not to mention atomic memory operations in
 compute shaders (wait, isn't that also exposed in GL as
 GL_ARB_shader_atomic_counters?).

 I was thinking whether the two sync objs should be read and
 readwrite, or read and write. I chose the latter, because it's
 more fine-grained and we have to keep at least two of them around
 anyway.

 So now that you know what we use sync objs for, what are your ideas on
 re-implementing that patch in a way that is okay with you? Besides
 removing the third sync objs of course.

 Marek



 OK. First I think we need to make a distinction: bo sync objects vs
 driver
 fences. The bo sync obj api is there to strictly provide functionality
 that
 the ttm bo subsystem is using, and that follows a simple set of rules:

 1) the bo subsystem does never assume sync objects are ordered. That
 means
 the bo subsystem needs to wait on a sync object before removing it from a
 buffer. Any other assumption is buggy and must be fixed. BUT, if that
 assumption takes place in the driver unknowingly from the ttm bo
 subsystem
 (which is usually the case), it's OK.

 2) When the sync object(s) attached to the bo are signaled the ttm bo
 subsystem is free to copy the bo contents and to unbind the bo.

 3) The ttm bo system allows sync objects to be signaled in different ways
 

Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-07 Thread Marek Olšák
On Fri, Oct 7, 2011 at 3:38 PM, Jerome Glisse j.gli...@gmail.com wrote:
 I should have look at the patch long ago ... anyway i think a better
 approach would be to expose fence id as 64bits unsigned to each
 userspace client. I was thinking of mapping a page readonly (same page
 as the one gpu write back) at somespecific offset in drm file (bit
 like sarea but readonly so no lock). Each time userspace submit a
 command stream it would get the fence id associated with the command
 stream. It would then be up to userspace to track btw read or write
 and do appropriate things. The kernel code would be simple (biggest
 issue is finding an offset we can use for that), it would be fast as
 no round trip to kernel to know the last fence.

 Each fence seq would be valid only for a specific ring (only future
 gpu impacted here, maybe cayman).

 So no change to ttm, just change to radeon to return fence seq and to
 move to an unsigned 64. Issue would be when gpu write back is
 disabled, then we would either need userspace to call somethings like
 bo wait or to other ioctl to get the kernel to update the copy, copy
 would be updated in the irq handler too so at least it get updated
 anytime something enable irq.

I am having a hard time understanding what you are saying.

Anyway, I had some read and write usage tracking in the radeon winsys.
That worked well for driver-private resources, but it was a total fail
for the ones shared with the DDX. I needed this bo_wait optimization
to work with multiple processes. That's the whole point why I am doing
this in the kernel.

Marek
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-07 Thread Marek Olšák
On Fri, Oct 7, 2011 at 10:00 AM, Thomas Hellstrom tho...@shipmail.org wrote:
 OK. First I think we need to make a distinction: bo sync objects vs driver
 fences. The bo sync obj api is there to strictly provide functionality that
 the ttm bo subsystem is using, and that follows a simple set of rules:

 1) the bo subsystem does never assume sync objects are ordered. That means
 the bo subsystem needs to wait on a sync object before removing it from a
 buffer. Any other assumption is buggy and must be fixed. BUT, if that
 assumption takes place in the driver unknowingly from the ttm bo subsystem
 (which is usually the case), it's OK.

 2) When the sync object(s) attached to the bo are signaled the ttm bo
 subsystem is free to copy the bo contents and to unbind the bo.

 3) The ttm bo system allows sync objects to be signaled in different ways
 opaque to the subsystem using sync_obj_arg. The driver is responsible for
 setting up that argument.

 4) Driver fences may be used for or expose other functionality or adaptions
 to APIs as long as the sync obj api exported to the bo sybsystem follows the
 above rules.

 This means the following w r t the patch.

 A) it violates 1). This is a bug that must be fixed. Assumptions that if one
 sync object is singnaled, another sync object is also signaled must be done
 in the driver and not in the bo subsystem. Hence we need to explicitly wait
 for a fence to remove it from the bo.

 B) the sync_obj_arg carries *per-sync-obj* information on how it should be
 signaled. If we need to attach multiple sync objects to a buffer object, we
 also need multiple sync_obj_args. This is a bug and needs to be fixed.

 C) There is really only one reason that the ttm bo subsystem should care
 about multiple sync objects, and that is because the driver can't order them
 efficiently. A such example would be hardware with multiple pipes reading
 simultaneously from the same texture buffer. Currently we don't support this
 so only the *last* sync object needs to be know by the bo subsystem. Keeping
 track of multiple fences generates a lot of completely unnecessary code in
 the ttm_bo_util file, the ttm_bo_vm file, and will be a nightmare if / when
 we truly support pipelined moves.

 As I understand it from your patches, you want to keep multiple fences
 around only to track rendering history. If we want to do that generically, i
 suggest doing it in the execbuf util code in the following way:

 struct ttm_eu_rendering_history {
    void *last_read_sync_obj;
    void *last_read_sync_obj_arg;
    void *last_write_sync_obj;
    void *last_write_sync_obj_arg;
 }

 Embed this structure in the radeon_bo, and build a small api around it,
 including *optionally* passing it to the existing execbuf utilities, and you
 should be done. The bo_util code and bo_vm code doesn't care about the
 rendering history. Only that the bo is completely idle.

 Note also that when an accelerated bo move is scheduled, the driver needs to
 update this struct

OK, sounds good. I'll fix what should be fixed and send a patch when
it's ready. I am now not so sure whether doing this generically is a
good idea. :)

Marek
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-04 Thread Marek Olšák
On Tue, Oct 4, 2011 at 1:48 PM, Thomas Hellstrom tho...@shipmail.org wrote:
 Bah, I totally missed this patch and thus never reviewed it :( Probably on
 vacation.

 There are a couple of things I'd like to point out.

 1) The bo subsystem may never assume that fence objects are ordered, so that
 when we unref
 bo::sync_obj, we may never assume that previously attached fence objects are
 signaled and can be unref'd
 Think for example fence objects submitted to different command streams. This
 is a bug and must be fixed.

If what you say is true, then even the original sync_obj can't be
trusted. What if I overwrite sync_obj with a new one and the new one
is signalled sooner than the old one?


 We can detach fence objects from buffers in the driver validation code,
 because that code knows whether fences are implicitly ordered, or can order
 them either by inserting a barrier (semaphore in NV languange) or waiting

I am not sure I follow you here. ttm_bo_wait needs the fences...
unless we want to move the fences out of TTM into drivers.


 for the fence to expire. (For example if the new validation is READ and the
 fence currently attached is WRITE, we might need to schedule a gpu cache
 flush before detaching the write fence).

I am not sure what fences have to do with flushing. Write caches
should be flushed automatically when resources are unbound. When a
resource is used for write and read at the same time, it's not our
problem: the user is responsible for flushing (e.g. through memory and
texture barriers in OpenGL), not the driver.



 2) Can't we say that a write_sync_obj is simply a sync_obj? What's the
 difference between those two? I think we should remove the write_sync_obj bo
 member.

Okay, but I think we should remove sync_obj instead, and keep write
and read sync objs. In the case of READWRITE usage, read_sync_obj
would be equal to write_sync_obj.

Marek
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-10-04 Thread Thomas Hellstrom

On 10/05/2011 04:08 AM, Marek Olšák wrote:

On Tue, Oct 4, 2011 at 1:48 PM, Thomas Hellstromtho...@shipmail.org  wrote:
   

Bah, I totally missed this patch and thus never reviewed it :( Probably on
vacation.

There are a couple of things I'd like to point out.

1) The bo subsystem may never assume that fence objects are ordered, so that
when we unref
bo::sync_obj, we may never assume that previously attached fence objects are
signaled and can be unref'd
Think for example fence objects submitted to different command streams. This
is a bug and must be fixed.
 

If what you say is true, then even the original sync_obj can't be
trusted. What if I overwrite sync_obj with a new one and the new one
is signalled sooner than the old one?
   


The driver validation code will in effect overwrite the old with a new 
one, because the driver validation code knows what sync objects are 
ordered. If, during validation of a buffer object, the driver validation 
code detects that the buffer is already fenced with a sync object that 
will signal out-of-order, the driver validation code needs to *wait* for 
that sync object to signal before proceeding, or insert a sync object 
barrier in the command stream.


The TTM bo code doesn't know how to order fences, and never assumes that 
they are ordered.




   

We can detach fence objects from buffers in the driver validation code,
because that code knows whether fences are implicitly ordered, or can order
them either by inserting a barrier (semaphore in NV languange) or waiting
 

I am not sure I follow you here. ttm_bo_wait needs the fences...
unless we want to move the fences out of TTM into drivers.
   


Please see the above explanation.



   

for the fence to expire. (For example if the new validation is READ and the
fence currently attached is WRITE, we might need to schedule a gpu cache
flush before detaching the write fence).
 

I am not sure what fences have to do with flushing. Write caches
should be flushed automatically when resources are unbound. When a
resource is used for write and read at the same time, it's not our
problem: the user is responsible for flushing (e.g. through memory and
texture barriers in OpenGL), not the driver.
   


How flushing is done is up to the driver writer, (fences is an excellent 
tool to do it in an efficient way), but barriers like the write-read 
barrier example above may need to be inserted for various reasons. Let's 
say you use render-to-texture, unbind the texture from the fbo, and then 
want to texture from it. At some point you *need* to flush if you have a 
write cache, and that flush needs to happen when you remove the write 
fence from the buffer, in order to replace it with a read fence, since 
after that the information that the buffer has been written to is gone. 
IIRC nouveau uses barriers like this to order fences from different 
command streams, Unichrome used it to order fences from different 
hardware engines.


In any case, I'm not saying fences is the best way to flush but since 
the bo code assumes that signaling a sync object means make the buffer 
contents available for CPU read / write, it's usually a good way to do 
it; there's even a sync_obj_flush() method that gets called when a 
potential flush needs to happen.




   

2) Can't we say that a write_sync_obj is simply a sync_obj? What's the
difference between those two? I think we should remove the write_sync_obj bo
member.
 

Okay, but I think we should remove sync_obj instead, and keep write
and read sync objs. In the case of READWRITE usage, read_sync_obj
would be equal to write_sync_obj.

   


Sure, I'm fine with that.

One other thing, though, that makes me a little puzzled:

Let's assume you don't allow readers and writers at the same time, which 
is my perception of how read- and write fences should work; you either 
have a list of read fences or a single write fence (in the same way a 
read-write lock works).


Now, if you only allow a single read fence, like in this patch. That 
implies that you can only have either a single read fence or a single 
write fence at any one time. We'd only need a single fence pointer on 
the bo, and sync_obj_arg would tell us whether to signal the fence for 
read or for write (assuming that sync_obj_arg was set up to indicate 
read / write at validation time), then the patch really isn't necessary 
at all, as it only allows a single read fence?


Or is it that you want to allow read- and write fences co-existing? In 
that case, what's the use case?



Thanks,
Thomas




Marek
   




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: add a way to bo_wait for either the last read or last write

2011-08-12 Thread Jerome Glisse
On Sun, Aug 7, 2011 at 4:39 PM, Marek Olšák mar...@gmail.com wrote:
 Sometimes we want to know whether a buffer is busy and wait for it (bo_wait).
 However, sometimes it would be more useful to be able to query whether
 a buffer is busy and being either read or written, and wait until it's stopped
 being either read or written. The point of this is to be able to avoid
 unnecessary waiting, e.g. if a GPU has written something to a buffer and is 
 now
 reading that buffer, and a CPU wants to map that buffer for read, it needs to
 only wait for the last write. If there were no write, there wouldn't be any
 waiting needed.

 This, or course, requires user space drivers to send read/write flags
 with each relocation (like we have read/write domains in radeon, so we can
 actually use those for something useful now).

 Now how this patch works:

 The read/write flags should passed to ttm_validate_buffer. TTM maintains
 separate sync objects of the last read and write for each buffer, in addition
 to the sync object of the last use of a buffer. ttm_bo_wait then operates
 with one the sync objects.

Just minor comment for extra safety see below, otherwise:
Reviewed-by: Jerome Glisse jgli...@redhat.com

 Signed-off-by: Marek Olšák mar...@gmail.com
 ---
  drivers/gpu/drm/nouveau/nouveau_bo.c    |    3 +-
  drivers/gpu/drm/nouveau/nouveau_gem.c   |    5 +-
  drivers/gpu/drm/radeon/radeon_cs.c      |    1 +
  drivers/gpu/drm/radeon/radeon_object.h  |    2 +-
  drivers/gpu/drm/ttm/ttm_bo.c            |   97 ++
  drivers/gpu/drm/ttm/ttm_bo_util.c       |   26 +++--
  drivers/gpu/drm/ttm/ttm_bo_vm.c         |    2 +-
  drivers/gpu/drm/ttm/ttm_execbuf_util.c  |   17 +-
  drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c |    1 +
  include/drm/ttm/ttm_bo_api.h            |   16 +-
  include/drm/ttm/ttm_execbuf_util.h      |    6 ++
  11 files changed, 137 insertions(+), 39 deletions(-)

 diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
 b/drivers/gpu/drm/nouveau/nouveau_bo.c
 index 890d50e..e87e24b 100644
 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
 +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
 @@ -1104,7 +1104,8 @@ nouveau_bo_vma_del(struct nouveau_bo *nvbo, struct 
 nouveau_vma *vma)
        if (vma-node) {
                if (nvbo-bo.mem.mem_type != TTM_PL_SYSTEM) {
                        spin_lock(nvbo-bo.bdev-fence_lock);
 -                       ttm_bo_wait(nvbo-bo, false, false, false);
 +                       ttm_bo_wait(nvbo-bo, false, false, false,
 +                                   TTM_USAGE_READWRITE);
                        spin_unlock(nvbo-bo.bdev-fence_lock);
                        nouveau_vm_unmap(vma);
                }
 diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c 
 b/drivers/gpu/drm/nouveau/nouveau_gem.c
 index 5f0bc57..322bf62 100644
 --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
 +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
 @@ -589,7 +589,8 @@ nouveau_gem_pushbuf_reloc_apply(struct drm_device *dev,
                }

                spin_lock(nvbo-bo.bdev-fence_lock);
 -               ret = ttm_bo_wait(nvbo-bo, false, false, false);
 +               ret = ttm_bo_wait(nvbo-bo, false, false, false,
 +                                 TTM_USAGE_READWRITE);
                spin_unlock(nvbo-bo.bdev-fence_lock);
                if (ret) {
                        NV_ERROR(dev, reloc wait_idle failed: %d\n, ret);
 @@ -825,7 +826,7 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void 
 *data,
        nvbo = nouveau_gem_object(gem);

        spin_lock(nvbo-bo.bdev-fence_lock);
 -       ret = ttm_bo_wait(nvbo-bo, true, true, no_wait);
 +       ret = ttm_bo_wait(nvbo-bo, true, true, no_wait, 
 TTM_USAGE_READWRITE);
        spin_unlock(nvbo-bo.bdev-fence_lock);
        drm_gem_object_unreference_unlocked(gem);
        return ret;
 diff --git a/drivers/gpu/drm/radeon/radeon_cs.c 
 b/drivers/gpu/drm/radeon/radeon_cs.c
 index fae00c0..14e8531 100644
 --- a/drivers/gpu/drm/radeon/radeon_cs.c
 +++ b/drivers/gpu/drm/radeon/radeon_cs.c
 @@ -80,6 +80,7 @@ int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
                        p-relocs[i].lobj.wdomain = r-write_domain;
                        p-relocs[i].lobj.rdomain = r-read_domains;
                        p-relocs[i].lobj.tv.bo = p-relocs[i].robj-tbo;
 +                       p-relocs[i].lobj.tv.usage = TTM_USAGE_READWRITE;
                        p-relocs[i].handle = r-handle;
                        p-relocs[i].flags = r-flags;
                        radeon_bo_list_add_object(p-relocs[i].lobj,
 diff --git a/drivers/gpu/drm/radeon/radeon_object.h 
 b/drivers/gpu/drm/radeon/radeon_object.h
 index ede6c13..e9dc8b2 100644
 --- a/drivers/gpu/drm/radeon/radeon_object.h
 +++ b/drivers/gpu/drm/radeon/radeon_object.h
 @@ -130,7 +130,7 @@ static inline int radeon_bo_wait(struct radeon_bo *bo, 
 u32 *mem_type,
        if (mem_type)
                *mem_type = bo-tbo.mem.mem_type;
        if (bo-tbo.sync_obj)
 -               r