Re: GEM discussion questions

2008-05-22 Thread Thomas Hellström
Ian Romanick wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Thomas Hellström wrote:
 | Ian Romanick wrote:
 | Jerome Glisse wrote:
 | | On Mon, 19 May 2008 12:04:16 -0700
 | | Ian Romanick [EMAIL PROTECTED] wrote:
 | |
 | | The GLX spec says, basically, that the results of changes to a 
 shared
 | | object in context A are guaranteed to be visible to context B when
 | | context B binds the object.  It leaves a lot of slack for 
 changes to
 | | show up earlier.  This is part of the reason that app developers 
 want
 | | NV_fence-like functionality.
 | |
 | | I quickly browsed glx spec and failed to spot where this topic 
 appear.
 | | And what does B binds mean in this context, i am thinking to this 
 use:
 | | A  B share obj
 | | A map
 | | B map
 | | A do some drawing
 |
 | Hmm,
 | According to the GL spec, section 2.9 a GL buffer object cannot be
 | mapped again while in the mapped state
 | Given that, am I wrong in assuming that it's legal for B map to fail
 | with an INVALID_OPERATION error, even if there's no intermediate 
 context
 | B bind?

 My recollection is that it was supposed to be possible, but the language
 seems to indicate otherwise.  It would be interesting to see what other
 implementations do in this case.

 | Also, what should be the correct action by the driver when an 
 attempt is
 | made to
 | dereference a mapped buffer object with a GL command?

 Generate INVALID_OPERATION.  This is in the GL_ARB_vertex_buffer_object
 spec:

 ~INVALID_OPERATION is generated if a buffer object that is currently
 ~mapped is used as a source of GL render data, or as a destination of
 ~GL query data.

OK,
 thanks Ian.

/Thomas





-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-21 Thread Jerome Glisse
On Mon, 19 May 2008 12:04:16 -0700
Ian Romanick [EMAIL PROTECTED] wrote:

 
 The GLX spec says, basically, that the results of changes to a shared
 object in context A are guaranteed to be visible to context B when
 context B binds the object.  It leaves a lot of slack for changes to
 show up earlier.  This is part of the reason that app developers want
 NV_fence-like functionality.

I quickly browsed glx spec and failed to spot where this topic appear.
And what does B binds mean in this context, i am thinking to this use:
A  B share obj
A map
B map
A do some drawing
A unmap
A submit draw cmd which change obj
B want to draw at mapped obj

Here does B should the old content of obj before A modified it or
should it map to the actual object (even if there is a drawing going on)
Note that i explicitly didn't include anysynchronization btw A  B
where i believe this 2 applications should sync together.

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-21 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jerome Glisse wrote:
| On Mon, 19 May 2008 12:04:16 -0700
| Ian Romanick [EMAIL PROTECTED] wrote:
|
| The GLX spec says, basically, that the results of changes to a shared
| object in context A are guaranteed to be visible to context B when
| context B binds the object.  It leaves a lot of slack for changes to
| show up earlier.  This is part of the reason that app developers want
| NV_fence-like functionality.
|
| I quickly browsed glx spec and failed to spot where this topic appear.
| And what does B binds mean in this context, i am thinking to this use:
| A  B share obj
| A map
| B map
| A do some drawing

Which will fail because A has it mapped.  Or do you mean some drawing
operation not involving the mapped object?

| A unmap
| A submit draw cmd which change obj
| B want to draw at mapped obj

Which will fail because B has it mapped.

| Here does B should the old content of obj before A modified it or
| should it map to the actual object (even if there is a drawing going on)
| Note that i explicitly didn't include anysynchronization btw A  B
| where i believe this 2 applications should sync together.

The result is undefined.  I think part of the confusion here, and this is
my fault, is the difference between the data contents of the object and
the object itself.  There is no guarantees about the data contents (i.e.,
the texels of a texture) of an object.  Bind is the synchronization point
for things that affect the object itself (i.e., using glTexImage2D to
change the size of a texture).

We're going to update the GLX spec after OpenGL 3.0 ships, and we're going
to make a lot of this more explicit.  Right now it's mostly implied by
language spread through section 2.3, 2.4, 2.5, and 2.7.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFINGiQX1gOwKyEAw8RAlBEAKCbmp3E3n82EY3OjPwhYQB+lTkaggCbB0Lb
G98ypfRM76k2H8KJNCBf5QM=
=WIwW
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-21 Thread Thomas Hellström
Ian Romanick wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Jerome Glisse wrote:
 | On Mon, 19 May 2008 12:04:16 -0700
 | Ian Romanick [EMAIL PROTECTED] wrote:
 |
 | The GLX spec says, basically, that the results of changes to a shared
 | object in context A are guaranteed to be visible to context B when
 | context B binds the object.  It leaves a lot of slack for changes to
 | show up earlier.  This is part of the reason that app developers want
 | NV_fence-like functionality.
 |
 | I quickly browsed glx spec and failed to spot where this topic appear.
 | And what does B binds mean in this context, i am thinking to this use:
 | A  B share obj
 | A map
 | B map
 | A do some drawing

   
Hmm,
According to the GL spec, section 2.9 a GL buffer object cannot be 
mapped again while in the mapped state
Given that, am I wrong in assuming that it's legal for B map to fail 
with an INVALID_OPERATION error, even if there's no intermediate context 
B bind?

Also, what should be the correct action by the driver when an attempt is 
made to
dereference a mapped buffer object with a GL command?

/Thomas





-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-21 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thomas Hellström wrote:
| Ian Romanick wrote:
| Jerome Glisse wrote:
| | On Mon, 19 May 2008 12:04:16 -0700
| | Ian Romanick [EMAIL PROTECTED] wrote:
| |
| | The GLX spec says, basically, that the results of changes to a shared
| | object in context A are guaranteed to be visible to context B when
| | context B binds the object.  It leaves a lot of slack for changes to
| | show up earlier.  This is part of the reason that app developers want
| | NV_fence-like functionality.
| |
| | I quickly browsed glx spec and failed to spot where this topic appear.
| | And what does B binds mean in this context, i am thinking to this use:
| | A  B share obj
| | A map
| | B map
| | A do some drawing
|
| Hmm,
| According to the GL spec, section 2.9 a GL buffer object cannot be
| mapped again while in the mapped state
| Given that, am I wrong in assuming that it's legal for B map to fail
| with an INVALID_OPERATION error, even if there's no intermediate context
| B bind?

My recollection is that it was supposed to be possible, but the language
seems to indicate otherwise.  It would be interesting to see what other
implementations do in this case.

| Also, what should be the correct action by the driver when an attempt is
| made to
| dereference a mapped buffer object with a GL command?

Generate INVALID_OPERATION.  This is in the GL_ARB_vertex_buffer_object
spec:

~INVALID_OPERATION is generated if a buffer object that is currently
~mapped is used as a source of GL render data, or as a destination of
~GL query data.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFINL/2X1gOwKyEAw8RArfBAJ9EJLaYQHkQK9LLwtfeSLSwfDHbEACeLd6r
+kqRG76ZO7kG5wlTcOWm9js=
=VJ/M
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-20 Thread Thomas Hellström
Keith Packard wrote:
 On Mon, 2008-05-19 at 12:13 -0700, Ian Romanick wrote:
   
 The obvious overhead I was referring to is the extra malloc / free.
 That's why I went on to say So, now I have to go back and spend time
 caching the buffer allocations and doing other things to make it fast.
 ~ In that context, I is idr as an app developer. :)
 

 You'd be wrong then -- the cost of the malloc/write/copy/free is cheaper
 than the cost of map/write/unmap.

   
 One problem that we have here is that none of the benchmarks currently
 being used hit any of these paths.  OpenArena, Enemy Territory (I assume
 this is the older Quake 3 engine game), and gears don't use MapBuffer at
 all.  Unfortunately, any apps that would hit these paths are so
 fill-rate bound on i965 that they're useless for measuring CPU overhead.
 

 The only place we see significant map/write/unmap vs
 malloc/write/copy/free is with batch buffers, and so far the
 measurements that I've taken which appear to show a benefit haven't been
 reproduced by others...
   

We could certainly use texdown to test this out, if the GEM i915 
driver implemented a pwrite-enabled
struct dd_function_table::TextureMemCpy()

/Thomas


   
 

 -
 This SF.net email is sponsored by: Microsoft 
 Defy all challenges. Microsoft(R) Visual Studio 2008. 
 http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
 

 --
 ___
 Dri-devel mailing list
 Dri-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dri-devel
   




-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-20 Thread Keith Whitwell
On Tue, May 20, 2008 at 1:29 PM, Thomas Hellström
[EMAIL PROTECTED] wrote:
 Keith Packard wrote:
 On Mon, 2008-05-19 at 12:13 -0700, Ian Romanick wrote:

 The obvious overhead I was referring to is the extra malloc / free.
 That's why I went on to say So, now I have to go back and spend time
 caching the buffer allocations and doing other things to make it fast.
 ~ In that context, I is idr as an app developer. :)


 You'd be wrong then -- the cost of the malloc/write/copy/free is cheaper
 than the cost of map/write/unmap.


 One problem that we have here is that none of the benchmarks currently
 being used hit any of these paths.  OpenArena, Enemy Territory (I assume
 this is the older Quake 3 engine game), and gears don't use MapBuffer at
 all.  Unfortunately, any apps that would hit these paths are so
 fill-rate bound on i965 that they're useless for measuring CPU overhead.


 The only place we see significant map/write/unmap vs
 malloc/write/copy/free is with batch buffers, and so far the
 measurements that I've taken which appear to show a benefit haven't been
 reproduced by others...


 We could certainly use texdown to test this out, if the GEM i915
 driver implemented a pwrite-enabled
 struct dd_function_table::TextureMemCpy()

Double-copy texture uploads have been 'tested' in the past -- and
their poor performance was one of the motivating factors for creating
a single-copy scheme.

The double-copy upload path isn't *that* bad, as long as the entire
texture fits into cache...  As soon as it exceeds the cache
dimensions, it falls off a cliff.

FWIW, Intel are making some cpus with pretty small caches these days,
and teaming them up with i945 gpus, so this isn't completely
theoretical.

Keith

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 00:14 -0700, Ian Romanick wrote:

 - - I'm pretty sure that the read_domain = GPU, write_domain = CPU case
 needs to be handled.  I know of at least one piece of hardware with a
 kooky command buffer that wants to be used that way.

Oh, so mapping the same command buffer for both activities.

For Intel, we use batch buffers written with the CPU and queued to the
GPU by the kernel, using suitable flushing to get data written to memory
before the GPU is asked to read it.

It could be that this 'command domain' just needs to be separate, and
mapped coherent between GPU and CPU so that this works.

However, instead of messing with the API on some theoretical hardware,
I'm really only interested in seeing how the API fits to actual
hardware. Having someone look at how a gem-like API would work on Radeon
or nVidia hardware would go a long ways to exploring what pieces are
general purpose and which are UMA- (or even Intel-) specific.

 - - I suspect that in the (near) future we may want multiple read_domains.

That's why the argument is called 'read_domains' and not 'read_domain'.
We already have operations that read objects to both the sampler and
render caches.

 - - I think drm_i915_gem_relocation_entry should have a size field.
 There are a lot of cases in the current GL API (and more to come) where
 the entire object will trivially not be used.  Clamped LOD on textures
 is a trivial example, but others exist as well.

There are a couple of places where this might be useful (presumably both
offset and length); the 'set_domain' operation seems like one of them,
and if we place it there, then other places where domain information is
passed across the API might be good places to include that as well.

The most obvious benefit here is reducing clflush action as we flip
buffers from GPU to CPU for fallbacks; however, flipping objects back
and forth should be avoided anyway, eliminating this kind of fallback
would provide more performance benefit than making the fallback a bit
faster.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Keith Packard wrote:
| On Mon, 2008-05-19 at 00:14 -0700, Ian Romanick wrote:
|
| - - I'm pretty sure that the read_domain = GPU, write_domain = CPU case
| needs to be handled.  I know of at least one piece of hardware with a
| kooky command buffer that wants to be used that way.
|
| Oh, so mapping the same command buffer for both activities.
|
| For Intel, we use batch buffers written with the CPU and queued to the
| GPU by the kernel, using suitable flushing to get data written to memory
| before the GPU is asked to read it.
|
| It could be that this 'command domain' just needs to be separate, and
| mapped coherent between GPU and CPU so that this works.
|
| However, instead of messing with the API on some theoretical hardware,
| I'm really only interested in seeing how the API fits to actual
| hardware. Having someone look at how a gem-like API would work on Radeon
| or nVidia hardware would go a long ways to exploring what pieces are
| general purpose and which are UMA- (or even Intel-) specific.

Sorry for being subtle.  It isn't theoretical hardware.  It's XP10.  It
uses a weird linked-list mechanism for commands.  Each command has a
header that contains a pointer to the next command and a flag.  The flag
says whether the command is valid or an end-of-list sentinel.  The
driver can then keep linking in new commands and changing sentinels to
commands while the hardware is going.

I'd have to go back and look, but I think MGA would work well with a
similar domain setting.

| - - I suspect that in the (near) future we may want multiple
read_domains.
|
| That's why the argument is called 'read_domains' and not 'read_domain'.
| We already have operations that read objects to both the sampler and
| render caches.

Ah, so it is.  It wasn't clear in the document that the domain values
were bits.  It looks more like they're enums.

| - - I think drm_i915_gem_relocation_entry should have a size field.
| There are a lot of cases in the current GL API (and more to come) where
| the entire object will trivially not be used.  Clamped LOD on textures
| is a trivial example, but others exist as well.

The specific situation I was thinking of above is where a 2048x2048
mipmapped texture has been evicted from texturable memory.  The LOD
range of the card is clamped so that only the 512x512 mipmap will be
used (imagine doing render-to-texture to generate the 256x256 mipmap
from the 512x512).  Having both an offset and a size allows the memory
manager to only bring back in the required subset of the texture.

| There are a couple of places where this might be useful (presumably both
| offset and length); the 'set_domain' operation seems like one of them,
| and if we place it there, then other places where domain information is
| passed across the API might be good places to include that as well.
|
| The most obvious benefit here is reducing clflush action as we flip
| buffers from GPU to CPU for fallbacks; however, flipping objects back
| and forth should be avoided anyway, eliminating this kind of fallback
| would provide more performance benefit than making the fallback a bit
| faster.

There is also a bunch of up-and-coming GL functionality that you may not
be aware of that changes this picture a *LOT*.

- - GL_EXT_texture_buffer_object allows a portion of a buffer object to be
used to back a texture.

- - GL_EXT_bindable_uniform allows a portion of a buffer object to be used
to back a block of uniforms.

- - GL_EXT_transform_feedback allows the output of the vertex shader or
geometry shader to be stored to buffer objects.

- - Long's Peak has functionality where a buffer object can be mapped
*without* waiting for all previous GL commands to complete.
GL_APPLE_flush_buffer_range has similar functionality.

- - Long's Peak has NV_fence-like synchronization objects.

The usage scenario that ISVs (and that other vendors are going to make
fast) is one where transform feedback is used to render a bunch of
objects to a single buffer object.  There is a fair amount of overhead
in changing all the output buffer object bindings, so ISVs don't want to
be forced to take that performance hit.  If a fence is set after each
object is sent down the pipe, the app can wait for one object to finish
rendering, map the buffer object, and operate on the data.

Similar situations can occur even without transform feedback.  Imagine
an app that is streaming data into a buffer object.  It streams in one
object (via MapBuffer), does a draw command, sets a fence, streams in
the next, etc.  When the buffer is full, it waits on the first fence,
and starts back at the beginning.  Apparently, a lot of ISVs are wanting
to do this.  I'm not a big fan of this usage.  It seems that nobody ever
got fire-and-forget buffer objects (repeated cycle of allocate, fill,
use, delete) to be fast, so this is what ISVs are wanting instead.

In other news, app developers *hate* BufferSubData.  They much prefer to

Re: GEM discussion questions

2008-05-19 Thread Jerome Glisse
On Mon, 19 May 2008 02:22:00 -0700
Ian Romanick [EMAIL PROTECTED] wrote:
 
 There is also a bunch of up-and-coming GL functionality that you may not
 be aware of that changes this picture a *LOT*.
 
 - - GL_EXT_texture_buffer_object allows a portion of a buffer object to be
 used to back a texture.
 
 - - GL_EXT_bindable_uniform allows a portion of a buffer object to be used
 to back a block of uniforms.
 
 - - GL_EXT_transform_feedback allows the output of the vertex shader or
 geometry shader to be stored to buffer objects.
 
 - - Long's Peak has functionality where a buffer object can be mapped
 *without* waiting for all previous GL commands to complete.
 GL_APPLE_flush_buffer_range has similar functionality.
 
 - - Long's Peak has NV_fence-like synchronization objects.
 
 The usage scenario that ISVs (and that other vendors are going to make
 fast) is one where transform feedback is used to render a bunch of
 objects to a single buffer object.  There is a fair amount of overhead
 in changing all the output buffer object bindings, so ISVs don't want to
 be forced to take that performance hit.  If a fence is set after each
 object is sent down the pipe, the app can wait for one object to finish
 rendering, map the buffer object, and operate on the data.
 
 Similar situations can occur even without transform feedback.  Imagine
 an app that is streaming data into a buffer object.  It streams in one
 object (via MapBuffer), does a draw command, sets a fence, streams in
 the next, etc.  When the buffer is full, it waits on the first fence,
 and starts back at the beginning.  Apparently, a lot of ISVs are wanting
 to do this.  I'm not a big fan of this usage.  It seems that nobody ever
 got fire-and-forget buffer objects (repeated cycle of allocate, fill,
 use, delete) to be fast, so this is what ISVs are wanting instead.
 
 In other news, app developers *hate* BufferSubData.  They much prefer to
 just map the buffer and write to it or read from it.
 
 All of this points to mapping buffers to the CPU in, on, and around GPU
 usage being a very common operation.  It's also an operation that app
 developers expect to be fast.
 

Thanks Ian to stress current and future usage, i was really hopping that
with GL3 buffer object mapping would have vanish but i guess as you said
that the fire-and-forget path never get optimized.

Does in GL3 object must be unmapped before being use ? IIRC this what is
required in current GL 1.x and GL 2.x. If so i think i can still use VRAM
as cache ie i put their object which are barely never mapped (like a
constant texture, or constant vertex table). This avoid me to think to
complexe solution for cleanly handling unmappable vram.

A side question is there any data on tlb flush ? ie how much map/unmap,
from client vma, cycle cost.

In the meantime i think we can promote use of pread/pwrite or BufferSubData
to take advantage of offset  size information in software we do (mesa, EXA,
...).

Ian do you know why dev hate BufferSubData ? Is there any place i can read
about it ? I have been focusing on driver dev and i am little bit out dated
on today typical GL usage, i assumed that hw manufacturer did promote use of
BufferSubData to software dev.

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jerome Glisse wrote:

| Thanks Ian to stress current and future usage, i was really hopping that
| with GL3 buffer object mapping would have vanish but i guess as you said
| that the fire-and-forget path never get optimized.

I think various drivers have tried to optimize it.  I think it's just a
case where an application managed suballocator will just always be faster.

| Does in GL3 object must be unmapped before being use ? IIRC this what is
| required in current GL 1.x and GL 2.x. If so i think i can still use VRAM
| as cache ie i put their object which are barely never mapped (like a
| constant texture, or constant vertex table). This avoid me to think to
| complexe solution for cleanly handling unmappable vram.

Be careful here.  An object must be unmapped in the context where it is
used for drawing.  However, buffer objects can be shared between
contexts.  This means that even today in OpenGL 1.5 context A can be
drawing with a buffer object while context B has it mapped.  Of course,
context A doesn't have to see the changes caused by context B until the
next time it binds the buffer.  This means that copying data for the map
will just work.

But to actually answer the original question, a buffer that will be used
as a source or destination by the GL must be unmapped at Begin time.

| A side question is there any data on tlb flush ? ie how much map/unmap,
| from client vma, cycle cost.
|
| In the meantime i think we can promote use of pread/pwrite or
BufferSubData
| to take advantage of offset  size information in software we do
(mesa, EXA,
| ...).
|
| Ian do you know why dev hate BufferSubData ? Is there any place i can read
| about it ? I have been focusing on driver dev and i am little bit out
dated
| on today typical GL usage, i assumed that hw manufacturer did promote
use of
| BufferSubData to software dev.

Because it forces them to make extra copies of their data and do extra
copy operations.  As an app developer, I *much* prefer:

glBindBuffer(GL_ARRAY_BUFFER, my_buf);
GLfloat *data = glMapBufferData(GL_ARRAY_BUFFER, GL_READ_WRITE);
if (data == NULL) {
/* fail */
}

/* Fill in buffer data */

glUnmapBuffer(GL_ARRAY_BUFFER);

Over:

GLfloat *data = malloc(buffer_size);
if (data == NULL) {
/* fail */
}

/* Fill in buffer data */

glBindBuffer(GL_ARRAY_BUFFER, my_buf);
glBufferSubData(GL_ARRAY_BUFFER, 0, buffer_size, data);
free(data);

The second version obviously has extra overhead and takes a performance
hit.  So, now I have to go back and spend time caching the buffer
allocations and doing other things to make it fast.  In the MapBuffer
version, I can leverage the work done by the smart guys that write drivers.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMbf5X1gOwKyEAw8RAvvRAJ4gEjByIWndSs4NWmVFTAOgBQHqAgCaA3pK
2ShJGYatMlCxHR57CSYbuTk=
=FOBu
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Jerome Glisse
On Mon, 19 May 2008 10:25:16 -0700
Ian Romanick [EMAIL PROTECTED] wrote:

 
 | Does in GL3 object must be unmapped before being use ? IIRC this what is
 | required in current GL 1.x and GL 2.x. If so i think i can still use VRAM
 | as cache ie i put their object which are barely never mapped (like a
 | constant texture, or constant vertex table). This avoid me to think to
 | complexe solution for cleanly handling unmappable vram.
 
 Be careful here.  An object must be unmapped in the context where it is
 used for drawing.  However, buffer objects can be shared between
 contexts.  This means that even today in OpenGL 1.5 context A can be
 drawing with a buffer object while context B has it mapped.  Of course,
 context A doesn't have to see the changes caused by context B until the
 next time it binds the buffer.  This means that copying data for the map
 will just work.
 

Is the result defined by GL specification ? ie does B need to access an
old copy of the object or if A is rendering to this object can we let B
see the ongoing rendering.

In latter case this likely lead to broken rendering if there is no
synchronization btw A  B.

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 10:25 -0700, Ian Romanick wrote:

   glBindBuffer(GL_ARRAY_BUFFER, my_buf);
   GLfloat *data = glMapBufferData(GL_ARRAY_BUFFER, GL_READ_WRITE);
   if (data == NULL) {
   /* fail */
   }
 
   /* Fill in buffer data */
 
   glUnmapBuffer(GL_ARRAY_BUFFER);
 
 Over:
 
   GLfloat *data = malloc(buffer_size);
   if (data == NULL) {
   /* fail */
   }
 
   /* Fill in buffer data */
 
   glBindBuffer(GL_ARRAY_BUFFER, my_buf);
   glBufferSubData(GL_ARRAY_BUFFER, 0, buffer_size, data);
   free(data);

In terms of system performance, that 'extra copy' is not a problem
though; the only cost is the traffic to the graphics chip, and these
both do precisely the same amount of work. The benefit to the latter
approach is that we get to use cache-aware copy code. The former can't
do this as easily when updating only a portion of the data.

 The second version obviously has extra overhead and takes a performance
 hit. 

My measurements show that doing a cache-aware copy is a net performance
win over using cache-ignorant word-at-a-time writes.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Thomas Hellström
Keith Packard wrote:
 On Mon, 2008-05-19 at 10:25 -0700, Ian Romanick wrote:

   
  glBindBuffer(GL_ARRAY_BUFFER, my_buf);
  GLfloat *data = glMapBufferData(GL_ARRAY_BUFFER, GL_READ_WRITE);
  if (data == NULL) {
  /* fail */
  }

  /* Fill in buffer data */

  glUnmapBuffer(GL_ARRAY_BUFFER);

 Over:

  GLfloat *data = malloc(buffer_size);
  if (data == NULL) {
  /* fail */
  }

  /* Fill in buffer data */

  glBindBuffer(GL_ARRAY_BUFFER, my_buf);
  glBufferSubData(GL_ARRAY_BUFFER, 0, buffer_size, data);
  free(data);
 

 In terms of system performance, that 'extra copy' is not a problem
 though; the only cost is the traffic to the graphics chip, and these
 both do precisely the same amount of work. The benefit to the latter
 approach is that we get to use cache-aware copy code. The former can't
 do this as easily when updating only a portion of the data.

   
 The second version obviously has extra overhead and takes a performance
 hit. 
 

 My measurements show that doing a cache-aware copy is a net performance
 win over using cache-ignorant word-at-a-time writes.
   
I think the point here is when the buffer in 1) is mapped write-combined 
which IMHO is the obvious approach,
the caches aren't affected at all.

In 2) you have two opportunities to completely fill the cache with data 
that shouldn't need to be reused. With cache-aware copy code you can 
reduce the impact of one of those opportunities.

/Thomas

   
 

 -
 This SF.net email is sponsored by: Microsoft 
 Defy all challenges. Microsoft(R) Visual Studio 2008. 
 http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
 

 --
 ___
 Dri-devel mailing list
 Dri-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dri-devel
   




-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 20:22 +0200, Thomas Hellström wrote:

 I think the point here is when the buffer in 1) is mapped write-combined 
 which IMHO is the obvious approach,
 the caches aren't affected at all.

write-combining only wins if you manage to get writes to the same cache
line to line up appropriately. Doing significant computation between
writes to the WC region means failing to meet the necessary conditions,
so the WC writes end up trickling out slowly.

 In 2) you have two opportunities to completely fill the cache with data 
 that shouldn't need to be reused. With cache-aware copy code you can 
 reduce the impact of one of those opportunities.

The allocator should be re-using recently freed pages for other
activity, so your cache-loaded pages will not go to waste, even if all
you did was fill them with data and copy them to the graphics object.

So, it turns out the 'malloc, fill, copy, free' cycle is actually fairly
good from a cache perspective. And, the gem benchmarks bear this out
with better-than-classic bandwidth from CPU to GPU for raw vertices. We
might do better by using WC pages on the backend, rather than using
clflush, but not a lot.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jerome Glisse wrote:
| On Mon, 19 May 2008 10:25:16 -0700
| Ian Romanick [EMAIL PROTECTED] wrote:
|
| | Does in GL3 object must be unmapped before being use ? IIRC this
what is
| | required in current GL 1.x and GL 2.x. If so i think i can still
use VRAM
| | as cache ie i put their object which are barely never mapped (like a
| | constant texture, or constant vertex table). This avoid me to think to
| | complexe solution for cleanly handling unmappable vram.
|
| Be careful here.  An object must be unmapped in the context where it is
| used for drawing.  However, buffer objects can be shared between
| contexts.  This means that even today in OpenGL 1.5 context A can be
| drawing with a buffer object while context B has it mapped.  Of course,
| context A doesn't have to see the changes caused by context B until the
| next time it binds the buffer.  This means that copying data for the map
| will just work.
|
| Is the result defined by GL specification ? ie does B need to access an
| old copy of the object or if A is rendering to this object can we let B
| see the ongoing rendering.
|
| In latter case this likely lead to broken rendering if there is no
| synchronization btw A  B.

The GLX spec says, basically, that the results of changes to a shared
object in context A are guaranteed to be visible to context B when
context B binds the object.  It leaves a lot of slack for changes to
show up earlier.  This is part of the reason that app developers want
NV_fence-like functionality.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMc8rX1gOwKyEAw8RAposAKCWQ8fTtIPHuvNXmj36eq+P7qeNIACfYHuB
564mXsChzRey46q8RXv15bI=
=0GZC
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Keith Packard wrote:
| On Mon, 2008-05-19 at 10:25 -0700, Ian Romanick wrote:
|
|  glBindBuffer(GL_ARRAY_BUFFER, my_buf);
|  GLfloat *data = glMapBufferData(GL_ARRAY_BUFFER, GL_READ_WRITE);
|  if (data == NULL) {
|  /* fail */
|  }
|
|  /* Fill in buffer data */
|
|  glUnmapBuffer(GL_ARRAY_BUFFER);
|
| Over:
|
|  GLfloat *data = malloc(buffer_size);
|  if (data == NULL) {
|  /* fail */
|  }
|
|  /* Fill in buffer data */
|
|  glBindBuffer(GL_ARRAY_BUFFER, my_buf);
|  glBufferSubData(GL_ARRAY_BUFFER, 0, buffer_size, data);
|  free(data);
|
| In terms of system performance, that 'extra copy' is not a problem
| though; the only cost is the traffic to the graphics chip, and these
| both do precisely the same amount of work. The benefit to the latter
| approach is that we get to use cache-aware copy code. The former can't
| do this as easily when updating only a portion of the data.

It depends on the hardware.  In the second approach the driver has no
opportunity to do something smart if the copy path isn't the fast
path.  Applications are being tuned more for the hardware where the copy
path isn't the fast path.

| The second version obviously has extra overhead and takes a performance
| hit.
|
| My measurements show that doing a cache-aware copy is a net performance
| win over using cache-ignorant word-at-a-time writes.

The obvious overhead I was referring to is the extra malloc / free.
That's why I went on to say So, now I have to go back and spend time
caching the buffer allocations and doing other things to make it fast.
~ In that context, I is idr as an app developer. :)

One problem that we have here is that none of the benchmarks currently
being used hit any of these paths.  OpenArena, Enemy Territory (I assume
this is the older Quake 3 engine game), and gears don't use MapBuffer at
all.  Unfortunately, any apps that would hit these paths are so
fill-rate bound on i965 that they're useless for measuring CPU overhead.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMdFgX1gOwKyEAw8RAvTBAJ4vEFUkCalQuEadOdh99BFIcz4WAwCfQ8e+
omh7z+g5Ja6AABvs5zrsR4k=
=HXx2
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 12:13 -0700, Ian Romanick wrote:

 It depends on the hardware.  In the second approach the driver has no
 opportunity to do something smart if the copy path isn't the fast
 path.  Applications are being tuned more for the hardware where the copy
 path isn't the fast path.

It really only depends on the CPU and bus architecture; the GPU
architecture is not relevant here. The cost is getting data from the CPU
into the GPU cache coherence domain, currently that involves actually
writing the data from the CPU over some kind of bus to physical memory.

 The obvious overhead I was referring to is the extra malloc / free.
 That's why I went on to say So, now I have to go back and spend time
 caching the buffer allocations and doing other things to make it fast.
 ~ In that context, I is idr as an app developer. :)

You'd be wrong then -- the cost of the malloc/write/copy/free is cheaper
than the cost of map/write/unmap.

 One problem that we have here is that none of the benchmarks currently
 being used hit any of these paths.  OpenArena, Enemy Territory (I assume
 this is the older Quake 3 engine game), and gears don't use MapBuffer at
 all.  Unfortunately, any apps that would hit these paths are so
 fill-rate bound on i965 that they're useless for measuring CPU overhead.

The only place we see significant map/write/unmap vs
malloc/write/copy/free is with batch buffers, and so far the
measurements that I've taken which appear to show a benefit haven't been
reproduced by others...

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Keith Packard wrote:
| On Mon, 2008-05-19 at 12:13 -0700, Ian Romanick wrote:
|
| It depends on the hardware.  In the second approach the driver has no
| opportunity to do something smart if the copy path isn't the fast
| path.  Applications are being tuned more for the hardware where the copy
| path isn't the fast path.
|
| It really only depends on the CPU and bus architecture; the GPU
| architecture is not relevant here. The cost is getting data from the CPU
| into the GPU cache coherence domain, currently that involves actually
| writing the data from the CPU over some kind of bus to physical memory.
|
| The obvious overhead I was referring to is the extra malloc / free.
| That's why I went on to say So, now I have to go back and spend time
| caching the buffer allocations and doing other things to make it fast.
| ~ In that context, I is idr as an app developer. :)
|
| You'd be wrong then -- the cost of the malloc/write/copy/free is cheaper
| than the cost of map/write/unmap.

Using glMapBuffer does not necessarily mean that the driver is doing
map/write/unmap.  In fact, based on measurements I took back in 2006,
fglrx doesn't (or didn't at the time, anyway).  See section 4.3 of
http://web.cecs.pdx.edu/~idr/publications/ddc2006-opengl_immediate_mode.pdf

It means that the driver *CAN DO THAT IF IT WANTS TO.*  Some drivers are
some platforms are clearly doing that, and they're running really fast.
~ Using glBufferSubData *FORCES* the driver to do the copy and *FORCES*
the app to do extra buffer management.

Apps are using and will increasingly use the glMapBuffer path.  With the
information currently at hand, doing the alloc/copy/upload/free in the
driver might be the win.  Great.  It's way too soon to box ourselves
into that route.  If we're going to be stuck with an unchangeable
interface for another 5 years, it had better be flexible enough to
support more than one way to do things under the sheets.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMhrlX1gOwKyEAw8RArwCAJ9fHW/TXdwpiXro6LIjk6twgaT36ACfVywo
iwIy4DMhiybnmOo1Myk4Hps=
=j+Uc
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 17:27 -0700, Ian Romanick wrote:

 Apps are using and will increasingly use the glMapBuffer path.  With the
 information currently at hand, doing the alloc/copy/upload/free in the
 driver might be the win.  Great.  It's way too soon to box ourselves
 into that route.  If we're going to be stuck with an unchangeable
 interface for another 5 years, it had better be flexible enough to
 support more than one way to do things under the sheets.

No-one is forcing anyone to do anything -- certainly gem supports mmap
as well as pwrite, so you can do whichever you prefer. The reason pwrite
was added was to support the SubData path directly in the kernel,
instead of providing only a map/copy/unmap path. That way the kernel
gets to choose how it implements both paths, and so does the
application.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Keith Packard wrote:
| On Mon, 2008-05-19 at 17:27 -0700, Ian Romanick wrote:
|
| Apps are using and will increasingly use the glMapBuffer path.  With the
| information currently at hand, doing the alloc/copy/upload/free in the
| driver might be the win.  Great.  It's way too soon to box ourselves
| into that route.  If we're going to be stuck with an unchangeable
| interface for another 5 years, it had better be flexible enough to
| support more than one way to do things under the sheets.
|
| No-one is forcing anyone to do anything -- certainly gem supports mmap
| as well as pwrite, so you can do whichever you prefer. The reason pwrite
| was added was to support the SubData path directly in the kernel,
| instead of providing only a map/copy/unmap path. That way the kernel
| gets to choose how it implements both paths, and so does the
| application.

I know...that's one of the things I liked about it. :)
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMkPSX1gOwKyEAw8RAsXwAKCXMRPm/c8SYBXRM3q2PQnOKU7DJgCfRGHJ
jBqKf0km6HHDJulmCGFfqec=
=j+l5
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel