Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-11 Thread Daniel Vetter
On Sat, Aug 10, 2013 at 05:36:07PM +0200, Marek Olšák wrote:
 On Sat, Aug 10, 2013 at 5:09 PM, Christian König
 deathsim...@vodafone.de wrote:
  Am 10.08.2013 15:53, schrieb Marek Olšák:
 
  The RCU approach sounds good, but you can never know if 16 is enough.
  We should release the buffer once it is full and allocate a new one.
  The cache bufmgr in the winsys will assure there won't be any buffer
  allocation overhead - it would work kinda a like a ring of buffers.
 
 
  Are you sure of that? The overhead of allocating a new buffer was what
  always looked so unfriendly to me with this approach.
 
 Absolutely. We've had this optimization since 2010 or so. As long as
 you use winsys-buffer_create(use_reusable_pool=TRUE), you're fine. A
 deleted buffer is added to a list of deleted buffers and if nobody
 reclaims it, it will be released after 1 second. During that 1 second,
 the buffer must first become idle and then anybody can reclaim it with
 the same buffer_create call. If the buffer had been mapped before,
 it's still mapped, so there is even no map-buffer overhead. This is a
 crucial part of our driver stack that makes vertex uploading so
 efficient. Don't underestimate the Radeon winsys. :)

Just an fyi how we do that for libdrm_intel: We have an madv ioctl, and
every buffer in the allocation cache is marked purgeable. When memory gets
tight the kernel can drop the backing storage on the floor right away for
such buffers (and we preferrentially reap such objects from our shrinker).
When reusing such a buffer libdrm calls madv(WILLNEED) to make sure the
buffer is still around. If it got reaped libdrm cleans up the entire cache
and checks for other reaped buffer objects.

With this you don't need tunables like the 1s delay you're using and
instead cache size is directly driven by vm pressure.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-11 Thread Alex Deucher
On Sat, Aug 10, 2013 at 11:09 AM, Christian König
deathsim...@vodafone.de wrote:
 Am 10.08.2013 15:53, schrieb Marek Olšák:

 The RCU approach sounds good, but you can never know if 16 is enough.
 We should release the buffer once it is full and allocate a new one.
 The cache bufmgr in the winsys will assure there won't be any buffer
 allocation overhead - it would work kinda a like a ring of buffers.


 Are you sure of that? The overhead of allocating a new buffer was what
 always looked so unfriendly to me with this approach.

 On the other hand the CP definitely can't handle more than 8 contexts at the
 same time (and one of them is always the clear context), so I strongly think
 we should be on the save side with 16 slots here. I'm just not sure if the
 SQ could add some more depth to our pipeline, maybe Alex knows more on this.


IIRC, it's 8 contexts on larger cards and 4 on smaller cards.  You can
look up the details in the kernel driver.  The contexts only affect
the CONTEXT registers from the 3D pipeline.

Alex

 Christian.


 Marek

 On Sat, Aug 10, 2013 at 10:45 AM, Christian König
 deathsim...@vodafone.de wrote:

 Am 09.08.2013 20:06, schrieb Marek Olšák:

 [SNIP]

 What if I kept the current emission code, and only allocated a new
 buffer at the end of the emit function, copied all descriptors to it
 using CP_DMA or COPY_DATA, and pointed SPI_SHADER_USER_DATA to it. The
 buffer where the descriptors are updated using WRITE_DATA would act as
 a staging buffer only and shaders would always read from the fresh new
 copy. Does that sound good to you?


 That sounds like the solution with multiple buffers I already suggest,
 but I
 would rather use some RCU approach to it. Basically we just have to
 handle
 it like the context based resources on earlier asics. So to me a proper
 solution should look something like this:

 We allocate a ring of (let's say) 16 slots for descriptor arrays, fill
 the
 first slot with WRITE_DATA packets and then use it in a draw command.

 As soon as any of the descriptors is about to change we copy it's content
 to
 the next slot, let the SPI_SHADER_USER_DATA point to it, make the
 necessary
 updates using WRITE_DATA and then use it in a draw command.

 This repeats over and over again, all we need to make sure is that we
 have
 enough slots in the ring to be sure that we never override descriptors
 when
 they are still in use, but I'm pretty sure that we should be on the save
 side with 16 or so.

 We can even prepare the commands for the switch from one slot to the next
 only once and then use it for the whole lifetime of the driver.

 Christian.


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-10 Thread Christian König

Am 09.08.2013 20:06, schrieb Marek Olšák:

[SNIP]
What if I kept the current emission code, and only allocated a new
buffer at the end of the emit function, copied all descriptors to it
using CP_DMA or COPY_DATA, and pointed SPI_SHADER_USER_DATA to it. The
buffer where the descriptors are updated using WRITE_DATA would act as
a staging buffer only and shaders would always read from the fresh new
copy. Does that sound good to you?


That sounds like the solution with multiple buffers I already suggest, 
but I would rather use some RCU approach to it. Basically we just have 
to handle it like the context based resources on earlier asics. So to me 
a proper solution should look something like this:


We allocate a ring of (let's say) 16 slots for descriptor arrays, fill 
the first slot with WRITE_DATA packets and then use it in a draw command.


As soon as any of the descriptors is about to change we copy it's 
content to the next slot, let the SPI_SHADER_USER_DATA point to it, make 
the necessary updates using WRITE_DATA and then use it in a draw command.


This repeats over and over again, all we need to make sure is that we 
have enough slots in the ring to be sure that we never override 
descriptors when they are still in use, but I'm pretty sure that we 
should be on the save side with 16 or so.


We can even prepare the commands for the switch from one slot to the 
next only once and then use it for the whole lifetime of the driver.


Christian.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-10 Thread Marek Olšák
The RCU approach sounds good, but you can never know if 16 is enough.
We should release the buffer once it is full and allocate a new one.
The cache bufmgr in the winsys will assure there won't be any buffer
allocation overhead - it would work kinda a like a ring of buffers.

Marek

On Sat, Aug 10, 2013 at 10:45 AM, Christian König
deathsim...@vodafone.de wrote:
 Am 09.08.2013 20:06, schrieb Marek Olšák:

 [SNIP]

 What if I kept the current emission code, and only allocated a new
 buffer at the end of the emit function, copied all descriptors to it
 using CP_DMA or COPY_DATA, and pointed SPI_SHADER_USER_DATA to it. The
 buffer where the descriptors are updated using WRITE_DATA would act as
 a staging buffer only and shaders would always read from the fresh new
 copy. Does that sound good to you?


 That sounds like the solution with multiple buffers I already suggest, but I
 would rather use some RCU approach to it. Basically we just have to handle
 it like the context based resources on earlier asics. So to me a proper
 solution should look something like this:

 We allocate a ring of (let's say) 16 slots for descriptor arrays, fill the
 first slot with WRITE_DATA packets and then use it in a draw command.

 As soon as any of the descriptors is about to change we copy it's content to
 the next slot, let the SPI_SHADER_USER_DATA point to it, make the necessary
 updates using WRITE_DATA and then use it in a draw command.

 This repeats over and over again, all we need to make sure is that we have
 enough slots in the ring to be sure that we never override descriptors when
 they are still in use, but I'm pretty sure that we should be on the save
 side with 16 or so.

 We can even prepare the commands for the switch from one slot to the next
 only once and then use it for the whole lifetime of the driver.

 Christian.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-10 Thread Christian König

Am 10.08.2013 15:53, schrieb Marek Olšák:

The RCU approach sounds good, but you can never know if 16 is enough.
We should release the buffer once it is full and allocate a new one.
The cache bufmgr in the winsys will assure there won't be any buffer
allocation overhead - it would work kinda a like a ring of buffers.


Are you sure of that? The overhead of allocating a new buffer was what 
always looked so unfriendly to me with this approach.


On the other hand the CP definitely can't handle more than 8 contexts at 
the same time (and one of them is always the clear context), so I 
strongly think we should be on the save side with 16 slots here. I'm 
just not sure if the SQ could add some more depth to our pipeline, maybe 
Alex knows more on this.


Christian.


Marek

On Sat, Aug 10, 2013 at 10:45 AM, Christian König
deathsim...@vodafone.de wrote:

Am 09.08.2013 20:06, schrieb Marek Olšák:

[SNIP]

What if I kept the current emission code, and only allocated a new
buffer at the end of the emit function, copied all descriptors to it
using CP_DMA or COPY_DATA, and pointed SPI_SHADER_USER_DATA to it. The
buffer where the descriptors are updated using WRITE_DATA would act as
a staging buffer only and shaders would always read from the fresh new
copy. Does that sound good to you?


That sounds like the solution with multiple buffers I already suggest, but I
would rather use some RCU approach to it. Basically we just have to handle
it like the context based resources on earlier asics. So to me a proper
solution should look something like this:

We allocate a ring of (let's say) 16 slots for descriptor arrays, fill the
first slot with WRITE_DATA packets and then use it in a draw command.

As soon as any of the descriptors is about to change we copy it's content to
the next slot, let the SPI_SHADER_USER_DATA point to it, make the necessary
updates using WRITE_DATA and then use it in a draw command.

This repeats over and over again, all we need to make sure is that we have
enough slots in the ring to be sure that we never override descriptors when
they are still in use, but I'm pretty sure that we should be on the save
side with 16 or so.

We can even prepare the commands for the switch from one slot to the next
only once and then use it for the whole lifetime of the driver.

Christian.


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-10 Thread Marek Olšák
On Sat, Aug 10, 2013 at 5:09 PM, Christian König
deathsim...@vodafone.de wrote:
 Am 10.08.2013 15:53, schrieb Marek Olšák:

 The RCU approach sounds good, but you can never know if 16 is enough.
 We should release the buffer once it is full and allocate a new one.
 The cache bufmgr in the winsys will assure there won't be any buffer
 allocation overhead - it would work kinda a like a ring of buffers.


 Are you sure of that? The overhead of allocating a new buffer was what
 always looked so unfriendly to me with this approach.

Absolutely. We've had this optimization since 2010 or so. As long as
you use winsys-buffer_create(use_reusable_pool=TRUE), you're fine. A
deleted buffer is added to a list of deleted buffers and if nobody
reclaims it, it will be released after 1 second. During that 1 second,
the buffer must first become idle and then anybody can reclaim it with
the same buffer_create call. If the buffer had been mapped before,
it's still mapped, so there is even no map-buffer overhead. This is a
crucial part of our driver stack that makes vertex uploading so
efficient. Don't underestimate the Radeon winsys. :)



 On the other hand the CP definitely can't handle more than 8 contexts at the
 same time (and one of them is always the clear context), so I strongly think
 we should be on the save side with 16 slots here. I'm just not sure if the
 SQ could add some more depth to our pipeline, maybe Alex knows more on this.

If we needed only 16 slots, that would be even better.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-09 Thread Christian König

Am 08.08.2013 21:38, schrieb Alex Deucher:

On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák mar...@gmail.com wrote:

On Thu, Aug 8, 2013 at 6:57 PM, Christian König deathsim...@vodafone.de wrote:

Am 08.08.2013 16:33, schrieb Marek Olšák:

On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de
wrote:

Am 08.08.2013 14:38, schrieb Marek Olšák:


.On Thu, Aug 8, 2013 at 9:47 AM, Christian König
deathsim...@vodafone.de wrote:

Am 08.08.2013 02:20, schrieb Marek Olšák:


FMASK is bound as a separate texture. For every texture, there can be
an FMASK. Therefore a separate array of resource slots has to be
added.

This adds a new mechanism for emitting resource descriptors, its
features
are:
- resource descriptors are stored in an ordinary buffer (not in a CS)


Having resource descriptors outside of the CS has two problems that we
need
to solve first:

1. Fine grained descriptor updates doesn't work, I already tried that.
The
problem is that unlike previous asics descriptors are now a memory
block,
so
no longer part of the CP context. So when we (for example) have a draw
command executing and the next draw command is using new resources for
a
specific slot we would either block until the first draw command is
finished
(which is bad for performance) or change the descriptors while they are
still in use (which results in VM faults).

So what would the proper solution be here? Do I need to flush some
caches or would moving the descriptor updates to the constant IB fix
that?


Actually the current implementation worked better than anything else I
tried.

When you really need the resource descriptors in a separate buffer you
need
to use one buffer for each draw call and always write the full buffer
contents (no partial updates). Flushing anything won't really help
either..

The only solution I see using one buffer is to block until the last draw
call is finished with WAIT_REG_MEM, but that would be quite disastrous
for
performance.



2. If my understand is correct when they are embedded the descriptors
are
preloaded into the caches while executing the IB, so to archive the
same
speed with descriptors outside of the IB you need to add additional
commands
to the constant IB which is new to SI and we currently doesn't support
in
the CS interface.

There seems to be support for the constant IB. The CS ioctl chunk ID
is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
si_vm_packet3_ce_check. Is there anything missing?


The userspace side seems to be missing and except for throwing NOP
packets
into it we never tested it. I know from the closed source side that it
actually was quite tricky for them to get working.

Additional to that please note that I'm not 100% sure that just putting
the
descriptors into the IB is really helping here. It was just the most
simplest solution to avoid allocating a new buffer on each draw call.

I understand. I don't really need to have resource descriptors in a
separate buffer, all I need is these 3 basic features a gallium driver
should support:
- fine-grained resource updates (mainly for performance, see below)
- ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0)
- no GPU crash if a shader is using SAMPLER[15] but there are no samplers
bound

FYI, partial sampler view and sampler state updates are coming to
gallium, Brian Paul already has some patches, it's just a matter of
time now. Vertex and constant buffer states already support partial
updates.


That shouldn't be to much off a problem.

Just allocate a state at startup and initialize it with the proper pm4
commands for 16 samplers, then update the resource descriptors in that state
when we change the bound textures/samplers/views/constants/whatever. All we
need to do then is setting the emitted state to NULL so that it gets
re-emitted in the next draw command.

That would re-emit all 16 shader resources even if just one of them
needs to be changed. I was trying to avoid this inefficiency. Is it
really impossible to emit just one resource descriptor and keep the
others unchanged? This is a basic D3D10/11 feature, for example:

void ID3D11DeviceContext::VSSetShaderResources(
   [in]  UINT StartSlot,
   [in]  UINT NumViews,
   [in]  ID3D11ShaderResourceView *const *ppShaderResourceViews
);

If the constant engine is required to implement this interface
efficiently, then I'd like to work on constant IB support.

You'll need to either store them in memory or re-emit them if you
store them in the IB.  The CE is mainly there so that it can prime the
TC in parallel with the command stream processing.


Yeah indeed. The CE is just for prefetching everything into caches and 
doesn't really help here.


The only two options I see is either fully emitting it into the command 
stream whenever anything changes or allocating a new buffer for the 
resources on each new draw call, copying over the old state and then 
setting just the things that changed. Both options have their pro and 
cons, no idea what 

Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-09 Thread Marek Olšák
On Fri, Aug 9, 2013 at 10:34 AM, Christian König
deathsim...@vodafone.de wrote:
 Am 08.08.2013 21:38, schrieb Alex Deucher:

 On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák mar...@gmail.com wrote:

 On Thu, Aug 8, 2013 at 6:57 PM, Christian König deathsim...@vodafone.de
 wrote:

 Am 08.08.2013 16:33, schrieb Marek Olšák:

 On Thu, Aug 8, 2013 at 3:09 PM, Christian König
 deathsim...@vodafone.de
 wrote:

 Am 08.08.2013 14:38, schrieb Marek Olšák:

 .On Thu, Aug 8, 2013 at 9:47 AM, Christian König
 deathsim...@vodafone.de wrote:

 Am 08.08.2013 02:20, schrieb Marek Olšák:

 FMASK is bound as a separate texture. For every texture, there can
 be
 an FMASK. Therefore a separate array of resource slots has to be
 added.

 This adds a new mechanism for emitting resource descriptors, its
 features
 are:
 - resource descriptors are stored in an ordinary buffer (not in a
 CS)


 Having resource descriptors outside of the CS has two problems that
 we
 need
 to solve first:

 1. Fine grained descriptor updates doesn't work, I already tried
 that.
 The
 problem is that unlike previous asics descriptors are now a memory
 block,
 so
 no longer part of the CP context. So when we (for example) have a
 draw
 command executing and the next draw command is using new resources
 for
 a
 specific slot we would either block until the first draw command is
 finished
 (which is bad for performance) or change the descriptors while they
 are
 still in use (which results in VM faults).

 So what would the proper solution be here? Do I need to flush some
 caches or would moving the descriptor updates to the constant IB fix
 that?


 Actually the current implementation worked better than anything else I
 tried.

 When you really need the resource descriptors in a separate buffer you
 need
 to use one buffer for each draw call and always write the full buffer
 contents (no partial updates). Flushing anything won't really help
 either..

 The only solution I see using one buffer is to block until the last
 draw
 call is finished with WAIT_REG_MEM, but that would be quite disastrous
 for
 performance.


 2. If my understand is correct when they are embedded the
 descriptors
 are
 preloaded into the caches while executing the IB, so to archive the
 same
 speed with descriptors outside of the IB you need to add additional
 commands
 to the constant IB which is new to SI and we currently doesn't
 support
 in
 the CS interface.

 There seems to be support for the constant IB. The CS ioctl chunk ID
 is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
 si_vm_packet3_ce_check. Is there anything missing?


 The userspace side seems to be missing and except for throwing NOP
 packets
 into it we never tested it. I know from the closed source side that it
 actually was quite tricky for them to get working.

 Additional to that please note that I'm not 100% sure that just
 putting
 the
 descriptors into the IB is really helping here. It was just the most
 simplest solution to avoid allocating a new buffer on each draw call.

 I understand. I don't really need to have resource descriptors in a
 separate buffer, all I need is these 3 basic features a gallium driver
 should support:
 - fine-grained resource updates (mainly for performance, see below)
 - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0)
 - no GPU crash if a shader is using SAMPLER[15] but there are no
 samplers
 bound

 FYI, partial sampler view and sampler state updates are coming to
 gallium, Brian Paul already has some patches, it's just a matter of
 time now. Vertex and constant buffer states already support partial
 updates.


 That shouldn't be to much off a problem.

 Just allocate a state at startup and initialize it with the proper pm4
 commands for 16 samplers, then update the resource descriptors in that
 state
 when we change the bound textures/samplers/views/constants/whatever. All
 we
 need to do then is setting the emitted state to NULL so that it gets
 re-emitted in the next draw command.

 That would re-emit all 16 shader resources even if just one of them
 needs to be changed. I was trying to avoid this inefficiency. Is it
 really impossible to emit just one resource descriptor and keep the
 others unchanged? This is a basic D3D10/11 feature, for example:

 void ID3D11DeviceContext::VSSetShaderResources(
[in]  UINT StartSlot,
[in]  UINT NumViews,
[in]  ID3D11ShaderResourceView *const *ppShaderResourceViews
 );

 If the constant engine is required to implement this interface
 efficiently, then I'd like to work on constant IB support.

 You'll need to either store them in memory or re-emit them if you
 store them in the IB.  The CE is mainly there so that it can prime the
 TC in parallel with the command stream processing.


 Yeah indeed. The CE is just for prefetching everything into caches and
 doesn't really help here.

 The only two options I see is either fully emitting it into the command
 stream whenever anything changes or 

Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-09 Thread Christian König

Am 09.08.2013 15:29, schrieb Marek Olšák:

On Fri, Aug 9, 2013 at 10:34 AM, Christian König
deathsim...@vodafone.de wrote:

Am 08.08.2013 21:38, schrieb Alex Deucher:


On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák mar...@gmail.com wrote:

On Thu, Aug 8, 2013 at 6:57 PM, Christian König deathsim...@vodafone.de
wrote:

Am 08.08.2013 16:33, schrieb Marek Olšák:

On Thu, Aug 8, 2013 at 3:09 PM, Christian König
deathsim...@vodafone.de
wrote:

Am 08.08.2013 14:38, schrieb Marek Olšák:


.On Thu, Aug 8, 2013 at 9:47 AM, Christian König
deathsim...@vodafone.de wrote:

Am 08.08.2013 02:20, schrieb Marek Olšák:


FMASK is bound as a separate texture. For every texture, there can
be
an FMASK. Therefore a separate array of resource slots has to be
added.

This adds a new mechanism for emitting resource descriptors, its
features
are:
- resource descriptors are stored in an ordinary buffer (not in a
CS)


Having resource descriptors outside of the CS has two problems that
we
need
to solve first:

1. Fine grained descriptor updates doesn't work, I already tried
that.
The
problem is that unlike previous asics descriptors are now a memory
block,
so
no longer part of the CP context. So when we (for example) have a
draw
command executing and the next draw command is using new resources
for
a
specific slot we would either block until the first draw command is
finished
(which is bad for performance) or change the descriptors while they
are
still in use (which results in VM faults).

So what would the proper solution be here? Do I need to flush some
caches or would moving the descriptor updates to the constant IB fix
that?


Actually the current implementation worked better than anything else I
tried.

When you really need the resource descriptors in a separate buffer you
need
to use one buffer for each draw call and always write the full buffer
contents (no partial updates). Flushing anything won't really help
either..

The only solution I see using one buffer is to block until the last
draw
call is finished with WAIT_REG_MEM, but that would be quite disastrous
for
performance.



2. If my understand is correct when they are embedded the
descriptors
are
preloaded into the caches while executing the IB, so to archive the
same
speed with descriptors outside of the IB you need to add additional
commands
to the constant IB which is new to SI and we currently doesn't
support
in
the CS interface.

There seems to be support for the constant IB. The CS ioctl chunk ID
is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
si_vm_packet3_ce_check. Is there anything missing?


The userspace side seems to be missing and except for throwing NOP
packets
into it we never tested it. I know from the closed source side that it
actually was quite tricky for them to get working.

Additional to that please note that I'm not 100% sure that just
putting
the
descriptors into the IB is really helping here. It was just the most
simplest solution to avoid allocating a new buffer on each draw call..

I understand. I don't really need to have resource descriptors in a
separate buffer, all I need is these 3 basic features a gallium driver
should support:
- fine-grained resource updates (mainly for performance, see below)
- ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0)
- no GPU crash if a shader is using SAMPLER[15] but there are no
samplers
bound

FYI, partial sampler view and sampler state updates are coming to
gallium, Brian Paul already has some patches, it's just a matter of
time now. Vertex and constant buffer states already support partial
updates.


That shouldn't be to much off a problem.

Just allocate a state at startup and initialize it with the proper pm4
commands for 16 samplers, then update the resource descriptors in that
state
when we change the bound textures/samplers/views/constants/whatever. All
we
need to do then is setting the emitted state to NULL so that it gets
re-emitted in the next draw command.

That would re-emit all 16 shader resources even if just one of them
needs to be changed. I was trying to avoid this inefficiency. Is it
really impossible to emit just one resource descriptor and keep the
others unchanged? This is a basic D3D10/11 feature, for example:

void ID3D11DeviceContext::VSSetShaderResources(
[in]  UINT StartSlot,
[in]  UINT NumViews,
[in]  ID3D11ShaderResourceView *const *ppShaderResourceViews
);

If the constant engine is required to implement this interface
efficiently, then I'd like to work on constant IB support.

You'll need to either store them in memory or re-emit them if you
store them in the IB.  The CE is mainly there so that it can prime the
TC in parallel with the command stream processing.


Yeah indeed. The CE is just for prefetching everything into caches and
doesn't really help here.

The only two options I see is either fully emitting it into the command
stream whenever anything changes or allocating a new buffer for the
resources on each new draw call, 

Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-09 Thread Marek Olšák
On Fri, Aug 9, 2013 at 3:48 PM, Christian König deathsim...@vodafone.de wrote:
 Am 09.08.2013 15:29, schrieb Marek Olšák:

 On Fri, Aug 9, 2013 at 10:34 AM, Christian König
 deathsim...@vodafone.de wrote:

 Am 08.08.2013 21:38, schrieb Alex Deucher:

 On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák mar...@gmail.com wrote:

 On Thu, Aug 8, 2013 at 6:57 PM, Christian König
 deathsim...@vodafone.de
 wrote:

 Am 08.08.2013 16:33, schrieb Marek Olšák:

 On Thu, Aug 8, 2013 at 3:09 PM, Christian König
 deathsim...@vodafone.de
 wrote:

 Am 08.08.2013 14:38, schrieb Marek Olšák:

 .On Thu, Aug 8, 2013 at 9:47 AM, Christian König
 deathsim...@vodafone.de wrote:

 Am 08.08.2013 02:20, schrieb Marek Olšák:

 FMASK is bound as a separate texture. For every texture, there
 can
 be
 an FMASK. Therefore a separate array of resource slots has to be
 added.

 This adds a new mechanism for emitting resource descriptors, its
 features
 are:
 - resource descriptors are stored in an ordinary buffer (not in a
 CS)


 Having resource descriptors outside of the CS has two problems
 that
 we
 need
 to solve first:

 1. Fine grained descriptor updates doesn't work, I already tried
 that.
 The
 problem is that unlike previous asics descriptors are now a memory
 block,
 so
 no longer part of the CP context. So when we (for example) have a
 draw
 command executing and the next draw command is using new resources
 for
 a
 specific slot we would either block until the first draw command
 is
 finished
 (which is bad for performance) or change the descriptors while
 they
 are
 still in use (which results in VM faults).

 So what would the proper solution be here? Do I need to flush some
 caches or would moving the descriptor updates to the constant IB
 fix
 that?


 Actually the current implementation worked better than anything else
 I
 tried.

 When you really need the resource descriptors in a separate buffer
 you
 need
 to use one buffer for each draw call and always write the full
 buffer
 contents (no partial updates). Flushing anything won't really help
 either..

 The only solution I see using one buffer is to block until the last
 draw
 call is finished with WAIT_REG_MEM, but that would be quite
 disastrous
 for
 performance.


 2. If my understand is correct when they are embedded the
 descriptors
 are
 preloaded into the caches while executing the IB, so to archive
 the
 same
 speed with descriptors outside of the IB you need to add
 additional
 commands
 to the constant IB which is new to SI and we currently doesn't
 support
 in
 the CS interface.

 There seems to be support for the constant IB. The CS ioctl chunk
 ID
 is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
 si_vm_packet3_ce_check. Is there anything missing?


 The userspace side seems to be missing and except for throwing NOP
 packets
 into it we never tested it. I know from the closed source side that
 it
 actually was quite tricky for them to get working.

 Additional to that please note that I'm not 100% sure that just
 putting
 the
 descriptors into the IB is really helping here. It was just the most
 simplest solution to avoid allocating a new buffer on each draw
 call..

 I understand. I don't really need to have resource descriptors in a
 separate buffer, all I need is these 3 basic features a gallium
 driver
 should support:
 - fine-grained resource updates (mainly for performance, see below)
 - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0)
 - no GPU crash if a shader is using SAMPLER[15] but there are no
 samplers
 bound

 FYI, partial sampler view and sampler state updates are coming to
 gallium, Brian Paul already has some patches, it's just a matter of
 time now. Vertex and constant buffer states already support partial
 updates.


 That shouldn't be to much off a problem.

 Just allocate a state at startup and initialize it with the proper pm4
 commands for 16 samplers, then update the resource descriptors in that
 state
 when we change the bound textures/samplers/views/constants/whatever.
 All
 we
 need to do then is setting the emitted state to NULL so that it gets
 re-emitted in the next draw command.

 That would re-emit all 16 shader resources even if just one of them
 needs to be changed. I was trying to avoid this inefficiency. Is it
 really impossible to emit just one resource descriptor and keep the
 others unchanged? This is a basic D3D10/11 feature, for example:

 void ID3D11DeviceContext::VSSetShaderResources(
 [in]  UINT StartSlot,
 [in]  UINT NumViews,
 [in]  ID3D11ShaderResourceView *const *ppShaderResourceViews
 );

 If the constant engine is required to implement this interface
 efficiently, then I'd like to work on constant IB support.

 You'll need to either store them in memory or re-emit them if you
 store them in the IB.  The CE is mainly there so that it can prime the
 TC in parallel with the command stream processing.


 Yeah indeed. The CE is just for prefetching everything into 

Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-08 Thread Christian König

Am 08.08.2013 02:20, schrieb Marek Olšák:

FMASK is bound as a separate texture. For every texture, there can be
an FMASK. Therefore a separate array of resource slots has to be added.

This adds a new mechanism for emitting resource descriptors, its features are:
- resource descriptors are stored in an ordinary buffer (not in a CS)


Having resource descriptors outside of the CS has two problems that we 
need to solve first:


1. Fine grained descriptor updates doesn't work, I already tried that. 
The problem is that unlike previous asics descriptors are now a memory 
block, so no longer part of the CP context. So when we (for example) 
have a draw command executing and the next draw command is using new 
resources for a specific slot we would either block until the first draw 
command is finished (which is bad for performance) or change the 
descriptors while they are still in use (which results in VM faults).


2. If my understand is correct when they are embedded the descriptors 
are preloaded into the caches while executing the IB, so to archive the 
same speed with descriptors outside of the IB you need to add additional 
commands to the constant IB which is new to SI and we currently doesn't 
support in the CS interface.


Regards,
Christian.


- descriptors of disabled resources are set to zeros
- fine-grained resource updates (it can update one resource slot while not
   touching the other slots)
- updates are done with the WRITE_DATA packet
- it implements the si_atom interface for packet emission
- only used for FMASK textures right now

The primary motivation for this is that FMASK textures naturally need
fine-grained resource updates and I also need to query in the shader
if a resource is NULL.
---
  src/gallium/drivers/radeonsi/Makefile.sources  |   1 +
  src/gallium/drivers/radeonsi/r600_hw_context.c |   3 +
  src/gallium/drivers/radeonsi/r600_resource.h   |   1 +
  src/gallium/drivers/radeonsi/r600_texture.c|   1 +
  src/gallium/drivers/radeonsi/radeonsi_pipe.c   |   9 +-
  src/gallium/drivers/radeonsi/radeonsi_pipe.h   |   6 +-
  src/gallium/drivers/radeonsi/radeonsi_pm4.c|   7 +
  src/gallium/drivers/radeonsi/radeonsi_pm4.h|   2 +
  src/gallium/drivers/radeonsi/si_descriptors.c  | 188 +
  src/gallium/drivers/radeonsi/si_state.c|  58 +++-
  src/gallium/drivers/radeonsi/si_state.h|  36 +
  11 files changed, 305 insertions(+), 7 deletions(-)
  create mode 100644 src/gallium/drivers/radeonsi/si_descriptors.c

diff --git a/src/gallium/drivers/radeonsi/Makefile.sources 
b/src/gallium/drivers/radeonsi/Makefile.sources
index b3ffa72..68c8282 100644
--- a/src/gallium/drivers/radeonsi/Makefile.sources
+++ b/src/gallium/drivers/radeonsi/Makefile.sources
@@ -10,6 +10,7 @@ C_SOURCES := \
r600_translate.c \
radeonsi_pm4.c \
radeonsi_compute.c \
+   si_descriptors.c \
si_state.c \
si_state_streamout.c \
si_state_draw.c \
diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c 
b/src/gallium/drivers/radeonsi/r600_hw_context.c
index 7ed7496..b595477 100644
--- a/src/gallium/drivers/radeonsi/r600_hw_context.c
+++ b/src/gallium/drivers/radeonsi/r600_hw_context.c
@@ -289,6 +289,9 @@ void si_context_flush(struct r600_context *ctx, unsigned 
flags)
 * next draw command
 */
si_pm4_reset_emitted(ctx);
+
+   si_sampler_views_begin_new_cs(ctx, 
ctx-fmask_sampler_views[PIPE_SHADER_VERTEX]);
+   si_sampler_views_begin_new_cs(ctx, 
ctx-fmask_sampler_views[PIPE_SHADER_FRAGMENT]);
  }
  
  void si_context_emit_fence(struct r600_context *ctx, struct si_resource *fence_bo, unsigned offset, unsigned value)

diff --git a/src/gallium/drivers/radeonsi/r600_resource.h 
b/src/gallium/drivers/radeonsi/r600_resource.h
index e5dd36a..ab5c7b7 100644
--- a/src/gallium/drivers/radeonsi/r600_resource.h
+++ b/src/gallium/drivers/radeonsi/r600_resource.h
@@ -44,6 +44,7 @@ struct r600_fmask_info {
unsigned offset;
unsigned size;
unsigned alignment;
+   unsigned pitch;
unsigned bank_height;
unsigned slice_tile_max;
unsigned tile_mode_index;
diff --git a/src/gallium/drivers/radeonsi/r600_texture.c 
b/src/gallium/drivers/radeonsi/r600_texture.c
index cd3d1aa..b613564 100644
--- a/src/gallium/drivers/radeonsi/r600_texture.c
+++ b/src/gallium/drivers/radeonsi/r600_texture.c
@@ -463,6 +463,7 @@ static void r600_texture_get_fmask_info(struct r600_screen 
*rscreen,
out-slice_tile_max -= 1;
  
  	out-tile_mode_index = fmask.tiling_index[0];

+   out-pitch = fmask.level[0].nblk_x;
out-bank_height = fmask.bankh;
out-alignment = MAX2(256, fmask.bo_alignment);
out-size = fmask.bo_size;
diff --git a/src/gallium/drivers/radeonsi/radeonsi_pipe.c 
b/src/gallium/drivers/radeonsi/radeonsi_pipe.c
index ad955e3..3112124 100644
--- a/src/gallium/drivers/radeonsi/radeonsi_pipe.c
+++ 

Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-08 Thread Marek Olšák
.On Thu, Aug 8, 2013 at 9:47 AM, Christian König
deathsim...@vodafone.de wrote:
 Am 08.08.2013 02:20, schrieb Marek Olšák:

 FMASK is bound as a separate texture. For every texture, there can be
 an FMASK. Therefore a separate array of resource slots has to be added.

 This adds a new mechanism for emitting resource descriptors, its features
 are:
 - resource descriptors are stored in an ordinary buffer (not in a CS)


 Having resource descriptors outside of the CS has two problems that we need
 to solve first:

 1. Fine grained descriptor updates doesn't work, I already tried that. The
 problem is that unlike previous asics descriptors are now a memory block, so
 no longer part of the CP context. So when we (for example) have a draw
 command executing and the next draw command is using new resources for a
 specific slot we would either block until the first draw command is finished
 (which is bad for performance) or change the descriptors while they are
 still in use (which results in VM faults).

So what would the proper solution be here? Do I need to flush some
caches or would moving the descriptor updates to the constant IB fix
that?


 2. If my understand is correct when they are embedded the descriptors are
 preloaded into the caches while executing the IB, so to archive the same
 speed with descriptors outside of the IB you need to add additional commands
 to the constant IB which is new to SI and we currently doesn't support in
 the CS interface.

There seems to be support for the constant IB. The CS ioctl chunk ID
is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
si_vm_packet3_ce_check. Is there anything missing?

Marek


 Regards,
 Christian.


 - descriptors of disabled resources are set to zeros
 - fine-grained resource updates (it can update one resource slot while not
touching the other slots)
 - updates are done with the WRITE_DATA packet
 - it implements the si_atom interface for packet emission
 - only used for FMASK textures right now

 The primary motivation for this is that FMASK textures naturally need
 fine-grained resource updates and I also need to query in the shader
 if a resource is NULL.
 ---
   src/gallium/drivers/radeonsi/Makefile.sources  |   1 +
   src/gallium/drivers/radeonsi/r600_hw_context.c |   3 +
   src/gallium/drivers/radeonsi/r600_resource.h   |   1 +
   src/gallium/drivers/radeonsi/r600_texture.c|   1 +
   src/gallium/drivers/radeonsi/radeonsi_pipe.c   |   9 +-
   src/gallium/drivers/radeonsi/radeonsi_pipe.h   |   6 +-
   src/gallium/drivers/radeonsi/radeonsi_pm4.c|   7 +
   src/gallium/drivers/radeonsi/radeonsi_pm4.h|   2 +
   src/gallium/drivers/radeonsi/si_descriptors.c  | 188
 +
   src/gallium/drivers/radeonsi/si_state.c|  58 +++-
   src/gallium/drivers/radeonsi/si_state.h|  36 +
   11 files changed, 305 insertions(+), 7 deletions(-)
   create mode 100644 src/gallium/drivers/radeonsi/si_descriptors.c

 diff --git a/src/gallium/drivers/radeonsi/Makefile.sources
 b/src/gallium/drivers/radeonsi/Makefile.sources
 index b3ffa72..68c8282 100644
 --- a/src/gallium/drivers/radeonsi/Makefile.sources
 +++ b/src/gallium/drivers/radeonsi/Makefile.sources
 @@ -10,6 +10,7 @@ C_SOURCES := \
 r600_translate.c \
 radeonsi_pm4.c \
 radeonsi_compute.c \
 +   si_descriptors.c \
 si_state.c \
 si_state_streamout.c \
 si_state_draw.c \
 diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c
 b/src/gallium/drivers/radeonsi/r600_hw_context.c
 index 7ed7496..b595477 100644
 --- a/src/gallium/drivers/radeonsi/r600_hw_context.c
 +++ b/src/gallium/drivers/radeonsi/r600_hw_context.c
 @@ -289,6 +289,9 @@ void si_context_flush(struct r600_context *ctx,
 unsigned flags)
  * next draw command
  */
 si_pm4_reset_emitted(ctx);
 +
 +   si_sampler_views_begin_new_cs(ctx,
 ctx-fmask_sampler_views[PIPE_SHADER_VERTEX]);
 +   si_sampler_views_begin_new_cs(ctx,
 ctx-fmask_sampler_views[PIPE_SHADER_FRAGMENT]);
   }
 void si_context_emit_fence(struct r600_context *ctx, struct
 si_resource *fence_bo, unsigned offset, unsigned value)
 diff --git a/src/gallium/drivers/radeonsi/r600_resource.h
 b/src/gallium/drivers/radeonsi/r600_resource.h
 index e5dd36a..ab5c7b7 100644
 --- a/src/gallium/drivers/radeonsi/r600_resource.h
 +++ b/src/gallium/drivers/radeonsi/r600_resource.h
 @@ -44,6 +44,7 @@ struct r600_fmask_info {
 unsigned offset;
 unsigned size;
 unsigned alignment;
 +   unsigned pitch;
 unsigned bank_height;
 unsigned slice_tile_max;
 unsigned tile_mode_index;
 diff --git a/src/gallium/drivers/radeonsi/r600_texture.c
 b/src/gallium/drivers/radeonsi/r600_texture.c
 index cd3d1aa..b613564 100644
 --- a/src/gallium/drivers/radeonsi/r600_texture.c
 +++ b/src/gallium/drivers/radeonsi/r600_texture.c
 @@ -463,6 +463,7 @@ static void r600_texture_get_fmask_info(struct
 r600_screen *rscreen,

Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-08 Thread Christian König

Am 08.08.2013 14:38, schrieb Marek Olšák:

.On Thu, Aug 8, 2013 at 9:47 AM, Christian König
deathsim...@vodafone.de wrote:

Am 08.08.2013 02:20, schrieb Marek Olšák:


FMASK is bound as a separate texture. For every texture, there can be
an FMASK. Therefore a separate array of resource slots has to be added.

This adds a new mechanism for emitting resource descriptors, its features
are:
- resource descriptors are stored in an ordinary buffer (not in a CS)


Having resource descriptors outside of the CS has two problems that we need
to solve first:

1. Fine grained descriptor updates doesn't work, I already tried that. The
problem is that unlike previous asics descriptors are now a memory block, so
no longer part of the CP context. So when we (for example) have a draw
command executing and the next draw command is using new resources for a
specific slot we would either block until the first draw command is finished
(which is bad for performance) or change the descriptors while they are
still in use (which results in VM faults).

So what would the proper solution be here? Do I need to flush some
caches or would moving the descriptor updates to the constant IB fix
that?


Actually the current implementation worked better than anything else I 
tried.


When you really need the resource descriptors in a separate buffer you 
need to use one buffer for each draw call and always write the full 
buffer contents (no partial updates). Flushing anything won't really 
help either. The only solution I see using one buffer is to block until 
the last draw call is finished with WAIT_REG_MEM, but that would be 
quite disastrous for performance.



2. If my understand is correct when they are embedded the descriptors are
preloaded into the caches while executing the IB, so to archive the same
speed with descriptors outside of the IB you need to add additional commands
to the constant IB which is new to SI and we currently doesn't support in
the CS interface.

There seems to be support for the constant IB. The CS ioctl chunk ID
is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
si_vm_packet3_ce_check. Is there anything missing?


The userspace side seems to be missing and except for throwing NOP 
packets into it we never tested it. I know from the closed source side 
that it actually was quite tricky for them to get working.


Additional to that please note that I'm not 100% sure that just putting 
the descriptors into the IB is really helping here. It was just the most 
simplest solution to avoid allocating a new buffer on each draw call.


Cheers,
Christian.


Marek


Regards,
Christian.



- descriptors of disabled resources are set to zeros
- fine-grained resource updates (it can update one resource slot while not
touching the other slots)
- updates are done with the WRITE_DATA packet
- it implements the si_atom interface for packet emission
- only used for FMASK textures right now

The primary motivation for this is that FMASK textures naturally need
fine-grained resource updates and I also need to query in the shader
if a resource is NULL.
---
   src/gallium/drivers/radeonsi/Makefile.sources  |   1 +
   src/gallium/drivers/radeonsi/r600_hw_context.c |   3 +
   src/gallium/drivers/radeonsi/r600_resource.h   |   1 +
   src/gallium/drivers/radeonsi/r600_texture.c|   1 +
   src/gallium/drivers/radeonsi/radeonsi_pipe.c   |   9 +-
   src/gallium/drivers/radeonsi/radeonsi_pipe.h   |   6 +-
   src/gallium/drivers/radeonsi/radeonsi_pm4.c|   7 +
   src/gallium/drivers/radeonsi/radeonsi_pm4.h|   2 +
   src/gallium/drivers/radeonsi/si_descriptors.c  | 188
+
   src/gallium/drivers/radeonsi/si_state.c|  58 +++-
   src/gallium/drivers/radeonsi/si_state.h|  36 +
   11 files changed, 305 insertions(+), 7 deletions(-)
   create mode 100644 src/gallium/drivers/radeonsi/si_descriptors.c

diff --git a/src/gallium/drivers/radeonsi/Makefile.sources
b/src/gallium/drivers/radeonsi/Makefile.sources
index b3ffa72..68c8282 100644
--- a/src/gallium/drivers/radeonsi/Makefile.sources
+++ b/src/gallium/drivers/radeonsi/Makefile.sources
@@ -10,6 +10,7 @@ C_SOURCES := \
 r600_translate.c \
 radeonsi_pm4.c \
 radeonsi_compute.c \
+   si_descriptors.c \
 si_state.c \
 si_state_streamout.c \
 si_state_draw.c \
diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c
b/src/gallium/drivers/radeonsi/r600_hw_context.c
index 7ed7496..b595477 100644
--- a/src/gallium/drivers/radeonsi/r600_hw_context.c
+++ b/src/gallium/drivers/radeonsi/r600_hw_context.c
@@ -289,6 +289,9 @@ void si_context_flush(struct r600_context *ctx,
unsigned flags)
  * next draw command
  */
 si_pm4_reset_emitted(ctx);
+
+   si_sampler_views_begin_new_cs(ctx,
ctx-fmask_sampler_views[PIPE_SHADER_VERTEX]);
+   si_sampler_views_begin_new_cs(ctx,
ctx-fmask_sampler_views[PIPE_SHADER_FRAGMENT]);
   }
 void 

Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-08 Thread Marek Olšák
On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de wrote:
 Am 08.08.2013 14:38, schrieb Marek Olšák:

 .On Thu, Aug 8, 2013 at 9:47 AM, Christian König
 deathsim...@vodafone.de wrote:

 Am 08.08.2013 02:20, schrieb Marek Olšák:

 FMASK is bound as a separate texture. For every texture, there can be
 an FMASK. Therefore a separate array of resource slots has to be added.

 This adds a new mechanism for emitting resource descriptors, its
 features
 are:
 - resource descriptors are stored in an ordinary buffer (not in a CS)


 Having resource descriptors outside of the CS has two problems that we
 need
 to solve first:

 1. Fine grained descriptor updates doesn't work, I already tried that.
 The
 problem is that unlike previous asics descriptors are now a memory block,
 so
 no longer part of the CP context. So when we (for example) have a draw
 command executing and the next draw command is using new resources for a
 specific slot we would either block until the first draw command is
 finished
 (which is bad for performance) or change the descriptors while they are
 still in use (which results in VM faults).

 So what would the proper solution be here? Do I need to flush some
 caches or would moving the descriptor updates to the constant IB fix
 that?


 Actually the current implementation worked better than anything else I
 tried.

 When you really need the resource descriptors in a separate buffer you need
 to use one buffer for each draw call and always write the full buffer
 contents (no partial updates). Flushing anything won't really help either.
 The only solution I see using one buffer is to block until the last draw
 call is finished with WAIT_REG_MEM, but that would be quite disastrous for
 performance.


 2. If my understand is correct when they are embedded the descriptors are
 preloaded into the caches while executing the IB, so to archive the same
 speed with descriptors outside of the IB you need to add additional
 commands
 to the constant IB which is new to SI and we currently doesn't support in
 the CS interface.

 There seems to be support for the constant IB. The CS ioctl chunk ID
 is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
 si_vm_packet3_ce_check. Is there anything missing?


 The userspace side seems to be missing and except for throwing NOP packets
 into it we never tested it. I know from the closed source side that it
 actually was quite tricky for them to get working.

 Additional to that please note that I'm not 100% sure that just putting the
 descriptors into the IB is really helping here. It was just the most
 simplest solution to avoid allocating a new buffer on each draw call.

I understand. I don't really need to have resource descriptors in a
separate buffer, all I need is these 3 basic features a gallium driver
should support:
- fine-grained resource updates (mainly for performance, see below)
- ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0)
- no GPU crash if a shader is using SAMPLER[15] but there are no samplers bound

FYI, partial sampler view and sampler state updates are coming to
gallium, Brian Paul already has some patches, it's just a matter of
time now. Vertex and constant buffer states already support partial
updates.

Marek


 Cheers,
 Christian.

 Marek

 Regards,
 Christian.


 - descriptors of disabled resources are set to zeros
 - fine-grained resource updates (it can update one resource slot while
 not
 touching the other slots)
 - updates are done with the WRITE_DATA packet
 - it implements the si_atom interface for packet emission
 - only used for FMASK textures right now

 The primary motivation for this is that FMASK textures naturally need
 fine-grained resource updates and I also need to query in the shader
 if a resource is NULL.
 ---
src/gallium/drivers/radeonsi/Makefile.sources  |   1 +
src/gallium/drivers/radeonsi/r600_hw_context.c |   3 +
src/gallium/drivers/radeonsi/r600_resource.h   |   1 +
src/gallium/drivers/radeonsi/r600_texture.c|   1 +
src/gallium/drivers/radeonsi/radeonsi_pipe.c   |   9 +-
src/gallium/drivers/radeonsi/radeonsi_pipe.h   |   6 +-
src/gallium/drivers/radeonsi/radeonsi_pm4.c|   7 +
src/gallium/drivers/radeonsi/radeonsi_pm4.h|   2 +
src/gallium/drivers/radeonsi/si_descriptors.c  | 188
 +
src/gallium/drivers/radeonsi/si_state.c|  58 +++-
src/gallium/drivers/radeonsi/si_state.h|  36 +
11 files changed, 305 insertions(+), 7 deletions(-)
create mode 100644 src/gallium/drivers/radeonsi/si_descriptors.c

 diff --git a/src/gallium/drivers/radeonsi/Makefile.sources
 b/src/gallium/drivers/radeonsi/Makefile.sources
 index b3ffa72..68c8282 100644
 --- a/src/gallium/drivers/radeonsi/Makefile.sources
 +++ b/src/gallium/drivers/radeonsi/Makefile.sources
 @@ -10,6 +10,7 @@ C_SOURCES := \
  r600_translate.c \
  radeonsi_pm4.c \
  

Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-08 Thread Christian König

Am 08.08.2013 16:33, schrieb Marek Olšák:

On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de wrote:

Am 08.08.2013 14:38, schrieb Marek Olšák:


.On Thu, Aug 8, 2013 at 9:47 AM, Christian König
deathsim...@vodafone.de wrote:

Am 08.08.2013 02:20, schrieb Marek Olšák:


FMASK is bound as a separate texture. For every texture, there can be
an FMASK. Therefore a separate array of resource slots has to be added.

This adds a new mechanism for emitting resource descriptors, its
features
are:
- resource descriptors are stored in an ordinary buffer (not in a CS)


Having resource descriptors outside of the CS has two problems that we
need
to solve first:

1. Fine grained descriptor updates doesn't work, I already tried that.
The
problem is that unlike previous asics descriptors are now a memory block,
so
no longer part of the CP context. So when we (for example) have a draw
command executing and the next draw command is using new resources for a
specific slot we would either block until the first draw command is
finished
(which is bad for performance) or change the descriptors while they are
still in use (which results in VM faults).

So what would the proper solution be here? Do I need to flush some
caches or would moving the descriptor updates to the constant IB fix
that?


Actually the current implementation worked better than anything else I
tried.

When you really need the resource descriptors in a separate buffer you need
to use one buffer for each draw call and always write the full buffer
contents (no partial updates). Flushing anything won't really help either..
The only solution I see using one buffer is to block until the last draw
call is finished with WAIT_REG_MEM, but that would be quite disastrous for
performance.



2. If my understand is correct when they are embedded the descriptors are
preloaded into the caches while executing the IB, so to archive the same
speed with descriptors outside of the IB you need to add additional
commands
to the constant IB which is new to SI and we currently doesn't support in
the CS interface.

There seems to be support for the constant IB. The CS ioctl chunk ID
is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
si_vm_packet3_ce_check. Is there anything missing?


The userspace side seems to be missing and except for throwing NOP packets
into it we never tested it. I know from the closed source side that it
actually was quite tricky for them to get working.

Additional to that please note that I'm not 100% sure that just putting the
descriptors into the IB is really helping here. It was just the most
simplest solution to avoid allocating a new buffer on each draw call.

I understand. I don't really need to have resource descriptors in a
separate buffer, all I need is these 3 basic features a gallium driver
should support:
- fine-grained resource updates (mainly for performance, see below)
- ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0)
- no GPU crash if a shader is using SAMPLER[15] but there are no samplers bound

FYI, partial sampler view and sampler state updates are coming to
gallium, Brian Paul already has some patches, it's just a matter of
time now. Vertex and constant buffer states already support partial
updates.


That shouldn't be to much off a problem.

Just allocate a state at startup and initialize it with the proper pm4 
commands for 16 samplers, then update the resource descriptors in that 
state when we change the bound 
textures/samplers/views/constants/whatever. All we need to do then is 
setting the emitted state to NULL so that it gets re-emitted in the next 
draw command.


Christian.


Marek


Cheers,
Christian.


Marek


Regards,
Christian.



- descriptors of disabled resources are set to zeros
- fine-grained resource updates (it can update one resource slot while
not
 touching the other slots)
- updates are done with the WRITE_DATA packet
- it implements the si_atom interface for packet emission
- only used for FMASK textures right now

The primary motivation for this is that FMASK textures naturally need
fine-grained resource updates and I also need to query in the shader
if a resource is NULL.
---
src/gallium/drivers/radeonsi/Makefile.sources  |   1 +
src/gallium/drivers/radeonsi/r600_hw_context.c |   3 +
src/gallium/drivers/radeonsi/r600_resource.h   |   1 +
src/gallium/drivers/radeonsi/r600_texture.c|   1 +
src/gallium/drivers/radeonsi/radeonsi_pipe.c   |   9 +-
src/gallium/drivers/radeonsi/radeonsi_pipe.h   |   6 +-
src/gallium/drivers/radeonsi/radeonsi_pm4.c|   7 +
src/gallium/drivers/radeonsi/radeonsi_pm4.h|   2 +
src/gallium/drivers/radeonsi/si_descriptors.c  | 188
+
src/gallium/drivers/radeonsi/si_state.c|  58 +++-
src/gallium/drivers/radeonsi/si_state.h|  36 +
11 files changed, 305 insertions(+), 7 deletions(-)
create mode 100644 

Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-08 Thread Marek Olšák
On Thu, Aug 8, 2013 at 6:57 PM, Christian König deathsim...@vodafone.de wrote:
 Am 08.08.2013 16:33, schrieb Marek Olšák:

 On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de
 wrote:

 Am 08.08.2013 14:38, schrieb Marek Olšák:

 .On Thu, Aug 8, 2013 at 9:47 AM, Christian König
 deathsim...@vodafone.de wrote:

 Am 08.08.2013 02:20, schrieb Marek Olšák:

 FMASK is bound as a separate texture. For every texture, there can be
 an FMASK. Therefore a separate array of resource slots has to be
 added.

 This adds a new mechanism for emitting resource descriptors, its
 features
 are:
 - resource descriptors are stored in an ordinary buffer (not in a CS)


 Having resource descriptors outside of the CS has two problems that we
 need
 to solve first:

 1. Fine grained descriptor updates doesn't work, I already tried that.
 The
 problem is that unlike previous asics descriptors are now a memory
 block,
 so
 no longer part of the CP context. So when we (for example) have a draw
 command executing and the next draw command is using new resources for
 a
 specific slot we would either block until the first draw command is
 finished
 (which is bad for performance) or change the descriptors while they are
 still in use (which results in VM faults).

 So what would the proper solution be here? Do I need to flush some
 caches or would moving the descriptor updates to the constant IB fix
 that?


 Actually the current implementation worked better than anything else I
 tried.

 When you really need the resource descriptors in a separate buffer you
 need
 to use one buffer for each draw call and always write the full buffer
 contents (no partial updates). Flushing anything won't really help
 either..

 The only solution I see using one buffer is to block until the last draw
 call is finished with WAIT_REG_MEM, but that would be quite disastrous
 for
 performance.


 2. If my understand is correct when they are embedded the descriptors
 are
 preloaded into the caches while executing the IB, so to archive the
 same
 speed with descriptors outside of the IB you need to add additional
 commands
 to the constant IB which is new to SI and we currently doesn't support
 in
 the CS interface.

 There seems to be support for the constant IB. The CS ioctl chunk ID
 is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
 si_vm_packet3_ce_check. Is there anything missing?


 The userspace side seems to be missing and except for throwing NOP
 packets
 into it we never tested it. I know from the closed source side that it
 actually was quite tricky for them to get working.

 Additional to that please note that I'm not 100% sure that just putting
 the
 descriptors into the IB is really helping here. It was just the most
 simplest solution to avoid allocating a new buffer on each draw call.

 I understand. I don't really need to have resource descriptors in a
 separate buffer, all I need is these 3 basic features a gallium driver
 should support:
 - fine-grained resource updates (mainly for performance, see below)
 - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0)
 - no GPU crash if a shader is using SAMPLER[15] but there are no samplers
 bound

 FYI, partial sampler view and sampler state updates are coming to
 gallium, Brian Paul already has some patches, it's just a matter of
 time now. Vertex and constant buffer states already support partial
 updates.


 That shouldn't be to much off a problem.

 Just allocate a state at startup and initialize it with the proper pm4
 commands for 16 samplers, then update the resource descriptors in that state
 when we change the bound textures/samplers/views/constants/whatever. All we
 need to do then is setting the emitted state to NULL so that it gets
 re-emitted in the next draw command.

That would re-emit all 16 shader resources even if just one of them
needs to be changed. I was trying to avoid this inefficiency. Is it
really impossible to emit just one resource descriptor and keep the
others unchanged? This is a basic D3D10/11 feature, for example:

void ID3D11DeviceContext::VSSetShaderResources(
  [in]  UINT StartSlot,
  [in]  UINT NumViews,
  [in]  ID3D11ShaderResourceView *const *ppShaderResourceViews
);

If the constant engine is required to implement this interface
efficiently, then I'd like to work on constant IB support.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 16/20] radeonsi: add FMASK texture binding slots and resource setup

2013-08-08 Thread Alex Deucher
On Thu, Aug 8, 2013 at 1:34 PM, Marek Olšák mar...@gmail.com wrote:
 On Thu, Aug 8, 2013 at 6:57 PM, Christian König deathsim...@vodafone.de 
 wrote:
 Am 08.08.2013 16:33, schrieb Marek Olšák:

 On Thu, Aug 8, 2013 at 3:09 PM, Christian König deathsim...@vodafone.de
 wrote:

 Am 08.08.2013 14:38, schrieb Marek Olšák:

 .On Thu, Aug 8, 2013 at 9:47 AM, Christian König
 deathsim...@vodafone.de wrote:

 Am 08.08.2013 02:20, schrieb Marek Olšák:

 FMASK is bound as a separate texture. For every texture, there can be
 an FMASK. Therefore a separate array of resource slots has to be
 added.

 This adds a new mechanism for emitting resource descriptors, its
 features
 are:
 - resource descriptors are stored in an ordinary buffer (not in a CS)


 Having resource descriptors outside of the CS has two problems that we
 need
 to solve first:

 1. Fine grained descriptor updates doesn't work, I already tried that.
 The
 problem is that unlike previous asics descriptors are now a memory
 block,
 so
 no longer part of the CP context. So when we (for example) have a draw
 command executing and the next draw command is using new resources for
 a
 specific slot we would either block until the first draw command is
 finished
 (which is bad for performance) or change the descriptors while they are
 still in use (which results in VM faults).

 So what would the proper solution be here? Do I need to flush some
 caches or would moving the descriptor updates to the constant IB fix
 that?


 Actually the current implementation worked better than anything else I
 tried.

 When you really need the resource descriptors in a separate buffer you
 need
 to use one buffer for each draw call and always write the full buffer
 contents (no partial updates). Flushing anything won't really help
 either..

 The only solution I see using one buffer is to block until the last draw
 call is finished with WAIT_REG_MEM, but that would be quite disastrous
 for
 performance.


 2. If my understand is correct when they are embedded the descriptors
 are
 preloaded into the caches while executing the IB, so to archive the
 same
 speed with descriptors outside of the IB you need to add additional
 commands
 to the constant IB which is new to SI and we currently doesn't support
 in
 the CS interface.

 There seems to be support for the constant IB. The CS ioctl chunk ID
 is RADEON_CHUNK_ID_CONST_IB and the allowed packets are listed in
 si_vm_packet3_ce_check. Is there anything missing?


 The userspace side seems to be missing and except for throwing NOP
 packets
 into it we never tested it. I know from the closed source side that it
 actually was quite tricky for them to get working.

 Additional to that please note that I'm not 100% sure that just putting
 the
 descriptors into the IB is really helping here. It was just the most
 simplest solution to avoid allocating a new buffer on each draw call.

 I understand. I don't really need to have resource descriptors in a
 separate buffer, all I need is these 3 basic features a gallium driver
 should support:
 - fine-grained resource updates (mainly for performance, see below)
 - ability to unbind resources (e.g. by setting IMG_RSRC_WORD1 to 0)
 - no GPU crash if a shader is using SAMPLER[15] but there are no samplers
 bound

 FYI, partial sampler view and sampler state updates are coming to
 gallium, Brian Paul already has some patches, it's just a matter of
 time now. Vertex and constant buffer states already support partial
 updates.


 That shouldn't be to much off a problem.

 Just allocate a state at startup and initialize it with the proper pm4
 commands for 16 samplers, then update the resource descriptors in that state
 when we change the bound textures/samplers/views/constants/whatever. All we
 need to do then is setting the emitted state to NULL so that it gets
 re-emitted in the next draw command.

 That would re-emit all 16 shader resources even if just one of them
 needs to be changed. I was trying to avoid this inefficiency. Is it
 really impossible to emit just one resource descriptor and keep the
 others unchanged? This is a basic D3D10/11 feature, for example:

 void ID3D11DeviceContext::VSSetShaderResources(
   [in]  UINT StartSlot,
   [in]  UINT NumViews,
   [in]  ID3D11ShaderResourceView *const *ppShaderResourceViews
 );

 If the constant engine is required to implement this interface
 efficiently, then I'd like to work on constant IB support.

You'll need to either store them in memory or re-emit them if you
store them in the IB.  The CE is mainly there so that it can prime the
TC in parallel with the command stream processing.

Alex
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev